Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Working with PANGAEA

PyleoTUPS logo

Working with PANGAEA

Authors

Deborah Khider ORCID, Dhiren Oswal ORCID

Preamble

This tutorial introduces the PangaeaDataset object and its functionalities in PyleoTUPS, focusing on searching by unique ID. A separate tutorial covers the full range of PANGAEA search capabilities.

Note that PANGAEA and NOAA work differently under the hood, though PyleoTUPS aims to provide a consistent user experience across both repositories. Most differences arise from the search query parameters.

Goals

  • Search for datasets using a specific ID

  • Obtain information such as location, publication, variable metadata and the associated data.

Pre-requisites

  • A familiarity with PANGAEA data repository

  • A familiarity the pangaeapy package upon which PyleoTUPS is built.

Reading time

15 min

Let’s import our packages!

import pyleotups as pt
import pandas as pd

Accessing a dataset using the PANGAEA Unique ID

This is the most basic search you can perform, but if you know which dataset you are looking for, it is the easiest way to get all the needed information. First, you need to create a PangaeaDataset object that will store the information.

ds = pt.PangaeaDataset()

In this example, we will be looking at the dataset from Todorovic et al. (2024), which is a supplement to Todorovic et al. (2024). From this example, you can see that PANGAEA mints DOI for each of its datasets. When using the data in a publication, a good practice is to cite both the data citation (from PANGAEA) and the original paper (in this case from Paleoceanography and Paleoclimatology).

Let’s have a look at the PANGAEA DOI: 0.1594/PANGAEA.965772. The last 6 numbers can be considered a unique ID that you can pass to PyleoTUPS much like you would pass a noaa_id:

res = ds.search_studies(study_ids = '965772')
[2026-04-24 10:22:38,057][INFO] - Registering Study 965772 via direct lookup.
[2026-04-24 10:22:40,221][INFO] - Retrived 1 studies
Note: Unlike The NOAADataset object, the PangaeaDataset object will accept multiple IDs, passed as a list.

The summary method provides basic information about the dataset, mirroring what is available from NOAA (e.g., time bounds, notes, keywords). However, there are two important differences to be aware of:

  • PANGAEA does not store year/age information the same way NOAA does, since PANGAEA serves a much broader community than paleoscience. Time information at the metadata level typically refers to when the archive or data was collected, not the paleoclimate age. PyleoTUPS attempts to retrieve age information directly from the data files via the AGE column.

  • PANGAEA has no concept of Site. For paleoclimate data, the closely related concept of Event serves a similar role; PyleoTUPS maps it to the NOAA Site nomenclature for consistency.

The method returns a pandas.DataFrame.

df_summary = ds.get_summary()
display(df_summary)
Loading...

Obtaining some basic metadata information: geography, bibliography. and funding:

Some of the basic functionalities such as retrieving the geographical, bibliographical, and funding information work the same way on the PangaeaDataset object as it did on the NOAADataset object:

df_geo = ds.get_geo()
display(df_geo)
Loading...
Note: The SiteID field is available through pangaeapy and included here for completeness. While conceptually equivalent to the NOAA SiteID, it cannot be used programmatically to access data for a specific site.
bib, df_pub = ds.get_publications()
display(df_pub)
Loading...
df_funding = ds.get_funding()
display(df_funding)
Loading...

Get Variable Information

Accessing the data and variable information is where PANGAEA differs from NOAA the most. First, although Event is closely related to the concept Site, PyleoTUPS cannot retrieve the data independently for each site. When PANGAEA creates a Dataset on its repository, the data publisher adds information to the data table pertaining to these sites: the site name under the column header Event and geographical coordinates. We will see an example of this as we look at the data. Second, PANGAEA only allows one table per dataset. If a study results in multiple tables, then they are stored under a collection. We will look at an example of a collection in this notebook’s next section. Consequently, the methods get_sites and get_tables are not supported for PangaeaDataset.

Let’s have a look at the data:

df_data = ds.get_data(study_id = '965772')[0]
display(df_data.head())
Loading...

Note that although a dataset will only have one table associated with it, we return a list for consistency with the output of the same functionality for NOAADataset.

As mentioned previously, PANGAEA data table will contain information about the Event and associated geographical coordinate as well as a Date which is not useful for most paleoclimate study. Let’s have a look at the unique entries in Event:

df_data['Event'].unique()
<StringArray> ['Rotuma_RO2', 'Tonga_TNI2'] Length: 2, dtype: str

As you see, data for the two Sites are stored in the same data table.

To get meatadata information about the variables (i.e., columns), you can use the same get_variables function:

df_var = ds.get_variables(study_ids = '965772')
display(df_var)
Loading...
Note: Notice that the returned DataFrame looks quite different from its NOAA counterpart — the two repositories provide very different metadata fields. PyleoTUPS harmonizes column names (e.g., VariableName) where possible.

Working with Collections

Let’s look at a study where multiple tables were needed to report results, leading to several datasets on the repository. In PANGAEA, these related datasets are grouped into a collection (also called a dataset publication series), each with its own DOI.

For this tutorial, we use the collection byLand and Reichle (2024).

ds2 = pt.PangaeaDataset()
res2 = ds2.search_studies(study_ids = '971943')
[2026-04-24 10:22:59,639][INFO] - Registering Study 971943 via direct lookup.
[2026-04-24 10:23:00,708][INFO] - Retrived 1 studies
[2026-04-24 10:23:00,709][WARNING] - The search contains dataset(s) [971943] marked as collection. Refer to the 'CollectionMembers' column toidentify respective child datasets.

Note the warning as the results get downloaded. It informs us that one of the matching search criteria is a collection.

Let’s have a look at the summary:

df_summary = ds2.get_summary()
display(df_summary)
[2026-04-24 10:23:02,621][WARNING] - The search contains dataset(s) [971943] marked as collection. Refer to the 'CollectionMembers' column toidentify respective child datasets.
Loading...

Let’s have a closer look at the last column in our summary DataFrame, called CollectionMembers. The list contains all the datasets available as part of this collection:

print(df_summary['CollectionMembers'].iloc[0])
[972091, 972656, 972657, 972658, 972659, 972660, 972661, 972662, 972663, 972664, 972665, 972666, 972667, 972668, 972669, 972670, 972671, 972672, 972673, 972674, 972675, 972676, 972677, 972678, 972679, 972680, 972681, 972682, 972683, 972684, 972685, 972686, 972687, 972688, 972689, 972690, 972691, 972692, 972693, 972694, 972695, 972696, 972697, 972698, 972699, 972700, 972701, 972702]

What information can you get from the Collection? A few things.

  1. Geographical location

display(ds2.get_geo())
Loading...
  1. Funding information

Since this collection does not information about funding, the function will return an empty dataframe:

display(ds2.get_funding())
Loading...
  1. Publication Information

bib, df_pub = ds2.get_publications()
display(df_pub)
Loading...

However, since our study is a collection, we cannot retrieve the data automatically:

df_data = ds2.get_data(study_id='971943')
[2026-04-24 10:23:13,090][WARNING] - Study 971943 is a collection dataset. Skipping.

But we can pass the member list from the summary DataFrame:

df_data = ds2.get_data(study_id=df_summary['CollectionMembers'].iloc[0])
[2026-04-24 10:23:15,021][INFO] - Study 972091 found as collection member. Registering child dataset.
[2026-04-24 10:23:16,664][INFO] - Study 972656 found as collection member. Registering child dataset.
[2026-04-24 10:23:19,452][INFO] - Study 972657 found as collection member. Registering child dataset.
[2026-04-24 10:23:21,109][INFO] - Study 972658 found as collection member. Registering child dataset.
[2026-04-24 10:23:22,760][INFO] - Study 972659 found as collection member. Registering child dataset.
[2026-04-24 10:23:24,404][INFO] - Study 972660 found as collection member. Registering child dataset.
[2026-04-24 10:23:26,076][INFO] - Study 972661 found as collection member. Registering child dataset.
[2026-04-24 10:23:27,785][INFO] - Study 972662 found as collection member. Registering child dataset.
[2026-04-24 10:23:29,403][INFO] - Study 972663 found as collection member. Registering child dataset.
[2026-04-24 10:23:31,083][INFO] - Study 972664 found as collection member. Registering child dataset.
[2026-04-24 10:23:32,745][INFO] - Study 972665 found as collection member. Registering child dataset.
[2026-04-24 10:23:34,423][INFO] - Study 972666 found as collection member. Registering child dataset.
[2026-04-24 10:23:36,086][INFO] - Study 972667 found as collection member. Registering child dataset.
[2026-04-24 10:23:37,723][INFO] - Study 972668 found as collection member. Registering child dataset.
[2026-04-24 10:23:40,300][INFO] - Study 972669 found as collection member. Registering child dataset.
[2026-04-24 10:23:41,965][INFO] - Study 972670 found as collection member. Registering child dataset.
[2026-04-24 10:23:43,661][INFO] - Study 972671 found as collection member. Registering child dataset.
[2026-04-24 10:23:45,316][INFO] - Study 972672 found as collection member. Registering child dataset.
[2026-04-24 10:23:46,964][INFO] - Study 972673 found as collection member. Registering child dataset.
[2026-04-24 10:23:48,596][INFO] - Study 972674 found as collection member. Registering child dataset.
[2026-04-24 10:23:50,227][INFO] - Study 972675 found as collection member. Registering child dataset.
[2026-04-24 10:23:51,924][INFO] - Study 972676 found as collection member. Registering child dataset.
[2026-04-24 10:23:53,604][INFO] - Study 972677 found as collection member. Registering child dataset.
[2026-04-24 10:23:56,433][INFO] - Study 972678 found as collection member. Registering child dataset.
[2026-04-24 10:23:58,272][INFO] - Study 972679 found as collection member. Registering child dataset.
[2026-04-24 10:23:59,936][INFO] - Study 972680 found as collection member. Registering child dataset.
[2026-04-24 10:24:01,637][INFO] - Study 972681 found as collection member. Registering child dataset.
[2026-04-24 10:24:03,298][INFO] - Study 972682 found as collection member. Registering child dataset.
[2026-04-24 10:24:05,000][INFO] - Study 972683 found as collection member. Registering child dataset.
[2026-04-24 10:24:06,695][INFO] - Study 972684 found as collection member. Registering child dataset.
[2026-04-24 10:24:10,171][INFO] - Study 972685 found as collection member. Registering child dataset.
[2026-04-24 10:24:11,853][INFO] - Study 972686 found as collection member. Registering child dataset.
[2026-04-24 10:24:13,605][INFO] - Study 972687 found as collection member. Registering child dataset.
[2026-04-24 10:24:15,287][INFO] - Study 972688 found as collection member. Registering child dataset.
[2026-04-24 10:24:16,977][INFO] - Study 972689 found as collection member. Registering child dataset.
[2026-04-24 10:24:18,653][INFO] - Study 972690 found as collection member. Registering child dataset.
[2026-04-24 10:24:20,374][INFO] - Study 972691 found as collection member. Registering child dataset.
[2026-04-24 10:24:22,049][INFO] - Study 972692 found as collection member. Registering child dataset.
[2026-04-24 10:24:23,727][INFO] - Study 972693 found as collection member. Registering child dataset.
[2026-04-24 10:24:25,406][INFO] - Study 972694 found as collection member. Registering child dataset.
[2026-04-24 10:24:27,094][INFO] - Study 972695 found as collection member. Registering child dataset.
[2026-04-24 10:24:28,787][INFO] - Study 972696 found as collection member. Registering child dataset.
[2026-04-24 10:24:30,470][INFO] - Study 972697 found as collection member. Registering child dataset.
[2026-04-24 10:24:32,177][INFO] - Study 972698 found as collection member. Registering child dataset.
[2026-04-24 10:24:33,847][INFO] - Study 972699 found as collection member. Registering child dataset.
[2026-04-24 10:24:35,538][INFO] - Study 972700 found as collection member. Registering child dataset.
[2026-04-24 10:24:37,451][INFO] - Study 972701 found as collection member. Registering child dataset.
[2026-04-24 10:24:39,143][INFO] - Study 972702 found as collection member. Registering child dataset.

Let’s have a look at the list:

len(df_data)
48

There are 48 data tables corresponding to the 48 Datasets in a Collection.

Why is this important: For now, you can pass a dataset ID directly. In practice, though, most users will query the data, and collections will be returned alongside their individual datasets. For example, consider a geographical query: if a collection contains datasets from across the globe, only some will match the query — but the collection itself will still appear in the results. You will then need to disambiguate based on the type of data you are interested in.

Summary

In this notebook, you learned how to access data and metadata for paleoclimate datasets stored on PANGAEA, and how this repository differs from NOAA NCEI. A subsequent notebook covers querying the PANGAEA database.

References
  1. Todorovic, S., Dissard, D., Linsley, B. K., Kuhnert, H., & Wu, H. C. (2024). Paired d18O and Sr/Ca, and reconstructed d18Osw records of Porites sp. from Rotuma (RO2) and Tonga (TNI2), Southwest Pacific. PANGAEA. 10.1594/PANGAEA.965772
  2. Todorović, S., Wu, H. C., Linsley, B. K., Kuhnert, H., Menkes, C., Isbjakowa, A., & Dissard, D. (2024). Western Pacific Warm Pool Warming and Salinity Front Expansion Since 1821 Reconstructed From Paired Coral δ18O, Sr/Ca, and Reconstructed δ18Osw. Paleoceanography and Paleoclimatology, 39(12). 10.1029/2024pa004843
  3. Land, A., & Reichle, D. (2024). Tree-ring width measurements of living Douglas firs in Central Europe. PANGAEA. 10.1594/PANGAEA.971943