Querying the NOAA NCEI Database¶

Authors¶

Deborah Khider , Dhiren Oswal

Preamble¶

In a previous chapter, you learned how to create a PangaeaDataset object and its many functionalities. We won’t be revisiting those here; just know that regardless of how many datasets are returned, the functions will apply.

This tutorial demonstrates how to query the PANGAEA database. A list of search parameters available in PyleoTUPS can be found here. Note that many PANGAEA search terms have been mapped to their NOAA NCEI equivalents for consistency.

study_ids (int, str, or list, optional) – One or more PANGAEA dataset identifiers (numeric ID or DOI string). If provided, performs direct lookup and ignores other filters.
topic (str, optional) – Filter datasets by PANGAEA topic classification. Must be one of the predefined topics: - “all” (default) - “Agriculture” - “Atmosphere” - “Biological Classification” - “Biosphere” - “Chemistry” - “Cryosphere” - “Ecology” - “Fisheries” - “Geophysics” - “Human Dimensions” - “Lakes & Rivers” - “Land Surface” - “Lithosphere” - “Oceans” - “Paleontology” If set to “all” or omitted, no topic filtering is applied.
search_text (str, optional) – Free-text search query applied across dataset metadata. Maps to PANGAEA full-text search parameter ‘q’. Example: ‘stable carbon and oxygen isotopes’.
investigators (str or list[str], optional) – Author names. Mapped internally to PANGAEA query syntax: author:
variable_name (str or list[str], optional) – Name of parameters/variables (columns) present in dataset tables. Internally mapped to PANGAEA query term: parameter:<variable_name>
min_lat (float, optional) – Latitude bounds (–90..90).
max_lat (float, optional) – Latitude bounds (–90..90).
min_lon (float, optional) – Longitude bounds (–180..180)
max_lon (float, optional) – Longitude bounds (–180..180)
limit (int, default 100, maximum 500) – Maximum number of results returned.
skip (int, default 0) – Number of results to skip (pagination). Maps to PANGAEA ‘offset’

Note: Notice that time parameters are not part of the possible search terms. This is because PANGAEA does not store this information as part of its metadata. `AGE` is treated as a variable and is only accessible through the date tables. For a full description of the problem and the solutions we have considered, see this GitHub issue.

Another major difference between how PANGAEA handles queries compared to NOAA NCEI is that PANGAEA queries are text-based. When you fill out the parameters above, PyleoTUPS construct the string search query for you. Concretely, however, text-based searches are less structures and therefore less reproducible than parameter searches. Results may vary based on metadata completeness.

A list of available search term is available here.

Goals¶

Perform query on PANGAEA database using search parameters.
Extend query using the search_text parameter and PANGAEA’s fields
Dealing with Collections

Prerequisite¶

Understanding of PANGAEA datasets and associated search API
PyleoTUPS PANGAEADataset object

Reading time¶

15 min

Let’s import our packages!

import pyleotups as pt
import pandas as pd

Study Query¶

In the notebook introducing the PangaeaDataset object, we introduced the concept of searching by study_ids. If you are familiar with the pangaeapy package, this is one of the easiest way to get data from the database.

These IDs correspond to the last six digits of the DOI PANGAEA mints for each of this dataset. Let’s use the study by van der Bilt et al. (2016), whose data are available here. The DOI is 10.1594/PANGAEA.868935, resuting in the following search:

ds = pt.PangaeaDataset()
res = ds.search_studies(study_ids = '868935')
display(res)

[2026-04-30 09:14:09,644][INFO] - Registering Study 868935 via direct lookup.
[2026-04-30 09:14:11,278][INFO] - Retrived 1 studies

You may have noticed that the parameter name is in a plural form. Yes, this means that you can query data from multiple datasets at the same time.

Let’s get the second dataset in the collection corresponding to the van der Bilt et al. (2016) study.

ds = pt.PangaeaDataset()
res = ds.search_studies(study_ids = ['868935','868936'])
display(res)

[2026-04-30 09:18:21,005][INFO] - Registering Study 868935 via direct lookup.
[2026-04-30 09:18:22,625][INFO] - Registering Study 868936 via direct lookup.
[2026-04-30 09:18:24,265][INFO] - Retrived 2 studies

Investigator query¶

Let’s search for datasets with a specific investigator.

ds = pt.PangaeaDataset()
res = ds.search_studies(investigators = "Khider, D.")
display(res.head())

[2026-05-05 10:43:46,394][INFO] - Limit set to 100

{'q': 'author:Khider, D.', 'bbox': None, 'limit': 100, 'offset': 0}

[2026-05-05 10:44:36,825][INFO] - Retrived 31 studies
[2026-05-05 10:44:36,830][WARNING] - The search contains dataset(s) [830589, 897517, 921315, 965845] marked as collection. Refer to the 'CollectionMembers' column toidentify respective child datasets.

The query returned 31 datasets. Note that PyleoTUPS warns us that some of these are collections. As discussed in the PangaeaObject tutorial, collections group datasets that belong to the same study. Collections are necessary in PANGAEA because each data table is its own dataset with a unique DOI. Consider a study in which two sediment cores are analyzed for Mg/Ca and $\delta^{18}O$ . he data for both cores might appear in the same table under different Event labels, but if the data differ substantially, they may be reported in separate tables, and therefore separate datasets.

This is one of the key differences between the two repositories. NOAA NCEI groups data by study, allowing a one-to-many relationship between a study and its data tables; a query returns all tables in the study. PANGAEA, by contrast, creates an individual dataset for each table (a one-to-one relationship), with collections serving as the grouping mechanism.

Also notice the string returned by PyleoTUPS under the q key. As mentioned in the preamble, PANGAEA searches are string-based, with certain keys (such as author) that can be passed to the search. The q value is the actual query string that PyleoTUPS constructed from your search parameters.

Let’s have a look at multiple authors search:

ds = pt.PangaeaDataset()
res = ds.search_studies(investigators = ["Khider, D.", "Richey, JN"])
display(res.head())

[2026-05-05 10:45:20,892][INFO] - Limit set to 100

{'q': 'author:Khider, D. author:Richey, JN', 'bbox': None, 'limit': 100, 'offset': 0}

[2026-05-05 10:47:45,586][INFO] - Retrived 100 studies
[2026-05-05 10:47:45,593][WARNING] - The search contains dataset(s) [897517, 760802, 818286, 830589, 760896, 761045, 760942, 729394, 783929, 761115, 760882, 761025, 729401, 729434, 760903, 729668, 761020, 760764, 762793, 729328, 760820, 784082, 760812, 760933, 943589, 784131, 762766, 784120, 760923, 762803, 729268, 935987, 989348, 949077, 761094, 921315, 859372, 773313, 809522, 919287] marked as collection. Refer to the 'CollectionMembers' column toidentify respective child datasets.

We have hit the limit! Like we have seen with NOAA queries, PyleoTUPS is conducting an OR search by defaults. So it is looking for all the studies authored by either one of the investigators.

To do an AND query:

ds = pt.PangaeaDataset()
res = ds.search_studies(investigators = ["Khider, D.", "Richey, JN"], investigators_and_or = 'and')
display(res.head())

[2026-05-05 10:47:54,263][INFO] - Limit set to 100

{'q': 'author:Khider, D. author:Richey, JN', 'bbox': None, 'limit': 100, 'offset': 0}

[2026-05-05 10:50:15,475][INFO] - Retrived 100 studies
[2026-05-05 10:50:15,484][WARNING] - The search contains dataset(s) [897517, 760802, 818286, 830589, 760896, 761045, 760942, 729394, 783929, 761115, 760882, 761025, 729401, 729434, 760903, 729668, 761020, 760764, 762793, 729328, 760820, 784082, 760812, 760933, 943589, 784131, 762766, 784120, 760923, 762803, 729268, 935987, 989348, 949077, 761094, 921315, 859372, 773313, 809522, 919287] marked as collection. Refer to the 'CollectionMembers' column toidentify respective child datasets.

Geographical query¶

Let’s perform the same query as we did previously on NOAA NCEI for all the datasets within 5°S-5°N and 109-125°E, roughly corresponding to the Indo-Pacific Warm Pool. For speed, we will limit the search to the first 50 datasets.

Note: Like NOAA NCEI, PANGAEA offers `limit` and `skip` parameters for larger queries.

res = ds.search_studies(max_lat=5, min_lat=-5, max_lon=109,
                       min_lon=125, limit = 50)

[2026-04-30 10:40:31,815][INFO] - Limit set to 50

{'q': '', 'bbox': (125, -5, 109, 5), 'limit': 50, 'offset': 0}

[2026-04-30 10:40:32,598][INFO] - Retrived 149 studies
[2026-04-30 10:40:32,613][WARNING] - The search contains dataset(s) [897517, 760802, 818286, 830589, 760896, 761045, 729394, 760942, 783929, 761115, 760882, 761025, 729401, 760903, 729434, 729668, 761020, 762793, 784082, 729328, 760764, 760933, 760812, 760820, 943589, 784131, 762766, 729268, 784120, 762803, 760923, 989348, 935987, 921315, 949077, 761094, 773313, 859372, 974323, 884026, 945184, 907164, 893362, 908012, 931789, 832852, 889777, 807060, 948261, 882207, 906264, 893251, 904320, 863918, 783095, 858234, 962319, 842136, 986565, 887797, 875162, 959573, 950248, 882093, 820222, 726830, 871490, 910698, 885760, 974693, 972097, 788547, 890541, 871600, 875106, 871703, 962646, 736003, 858610, 938223, 701361, 772717, 863914, 859532, 875299] marked as collection. Refer to the 'CollectionMembers' column toidentify respective child datasets.

display(res.head())

Variable queries¶

Let’s search using variable names. In PANGAEA, this concept is closely related to Parameter and this is the reason why we mapped this concept to our key variable_name. But Parameter also include units. For a complete list of available parameters in PANGAEA, see this wiki page.

ds = pt.PangaeaDataset()
res = ds.search_studies(variable_name = 'Temperature', limit=50)

[2026-04-30 11:30:15,414][INFO] - Limit set to 50

{'q': 'parameter:Temperature', 'bbox': None, 'limit': 50, 'offset': 0}

[2026-04-30 11:31:59,860][INFO] - Retrived 50 studies

display(res.head())

As you can see on the sample above, many studies on PANGAEA will not be paleoclimate related. When doing a variable search, it is strongly advised to narrow the topics if possible:

ds = pt.PangaeaDataset()
res = ds.search_studies(variable_name = 'Temperature', topic = 'Paleontology', limit=50)

[2026-04-30 11:36:16,466][INFO] - Limit set to 50

{'q': 'topic:Paleontology parameter:Temperature', 'bbox': None, 'limit': 50, 'offset': 0}

[2026-04-30 11:37:42,968][INFO] - Retrived 50 studies

display(res.head())

Text search¶

It is possible to further refine queries by using the open search_text parameter. For instance, let’s try to look for Holocene records:

ds = pt.PangaeaDataset()
res = ds.search_studies(variable_name = 'Temperature', topic = 'Paleontology', search_text = 'Holocene', limit=50)
display(res.head())

[2026-04-30 11:40:51,677][INFO] - Limit set to 50

{'q': 'topic:Paleontology Holocene parameter:Temperature', 'bbox': None, 'limit': 50, 'offset': 0}

/Users/deborahkhider/anaconda3/envs/tups/lib/python3.12/site-packages/pangaeapy/pandataset.py:1052: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.data["Event"] = self.events[0].label
/Users/deborahkhider/anaconda3/envs/tups/lib/python3.12/site-packages/pangaeapy/pandataset.py:1059: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.data["Latitude"] = np.nan
/Users/deborahkhider/anaconda3/envs/tups/lib/python3.12/site-packages/pangaeapy/pandataset.py:1063: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.data["Longitude"] = np.nan
/Users/deborahkhider/anaconda3/envs/tups/lib/python3.12/site-packages/pangaeapy/pandataset.py:1067: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.data["Elevation"] = np.nan
/Users/deborahkhider/anaconda3/envs/tups/lib/python3.12/site-packages/pangaeapy/pandataset.py:1071: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.data["Date/Time"] = "NaN"
[2026-04-30 11:42:44,432][INFO] - Retrived 50 studies

As long as the word “Holocene” appears anywhere in the metadata (dataset name, study name, or description), the search will return those datasets. However, this does not mean that other datasets covering the Holocene are absent from PANGAEA — a study described as spanning “the past 15,000 years” would not be matched. Because age bounds are not part of PANGAEA’s standard metadata, time-based searches are severely limited.

Dealing with Collections¶

Collections link Datasets together. They can be very powerful, but also difficult to deal with when interested in the data only.

Let’s do a search query for the data is supplement to van der Bilt et al. (2016). To do so we will be using:

investigators parameter in search_studies
search_text to narrow down the publication year using the PANGAEA field citation:year: as seen here:

ds = pt.PangaeaDataset()
res = ds.search_studies(investigators = 'van der Bilt', search_text = 'citation:year:2016', limit=50)

[2026-05-01 08:20:14,049][INFO] - Limit set to 50

{'q': 'citation:year:2016 author:van der Bilt', 'bbox': None, 'limit': 50, 'offset': 0}

[2026-05-01 08:20:20,433][INFO] - Retrived 3 studies
[2026-05-01 08:20:20,434][WARNING] - The search contains dataset(s) [868938] marked as collection. Refer to the 'CollectionMembers' column toidentify respective child datasets.

As you can see, PyleoTUPS created the string query q from both the exposed parameter investigators and the PANGAEA field citation:year:.

Let’s have a look at the returned datasets:

display(res)

As noted earlier, the first dataset is actually a collection. If you scroll to the last column, you’ll see a list of its collection members. One way to filter out collections is to remove all entries where collectionMembers is not NaN, then recreate a PangaeaDataset object:

ids = []

for idx,row in res.iterrows():
    if row.at['CollectionMembers'] is not None:
        ids.extend(row['CollectionMembers'])

ds2 = pt.PangaeaDataset()
res2 = ds2.search_studies(study_ids = ids)

[2026-05-01 09:43:38,225][INFO] - Registering Study 868935 via direct lookup.
[2026-05-01 09:43:39,985][INFO] - Registering Study 868936 via direct lookup.
[2026-05-01 09:43:41,574][INFO] - Retrived 2 studies

Summary¶

In this tutorial, you learned how to use the search_studies function with the PANGAEA database. A few things to keep in mind:

PANGAEA is more general-purpose than NOAA NCEI for paleo. While this means instrumental datasets may also appear in results, search parameters common in paleoclimate queries (e.g., age) are unavailable on PANGAEA. Additionally, PANGAEA lacks the rich controlled vocabulary of NOAA NCEI, which can make broad searches more difficult.
Some datasets require login credentials to access.
Collections are returned alongside individual datasets, so care is needed when linking them back together.

References¶

van der Bilt, W. G. M., D’Andrea, W. J., Bakke, J., Balascio, N. L., Werner, J. P., Gjerde, M., & Bradley, R. S. (2018). Alkenone-based reconstructions reveal four-phase Holocene temperature evolution for High Arctic Svalbard. Quaternary Science Reviews, 183, 204–213. 10.1016/j.quascirev.2016.10.006
van der Bilt, W. G. M., D’Andrea, W. J., Bakke, J., Balascio, N. L., Werner, J. P., Gjerde, M., & Bradley, R. S. (2016). Holocene-length alkenone-based summer temperature record from sediment core AMP112. PANGAEA. 10.1594/PANGAEA.868935
van der Bilt, W. G. M., D’Andrea, W. J., Bakke, J., Balascio, N. L., Werner, J. P., Gjerde, M., & Bradley, R. S. (2016). Holocene-length alkenone-based summer temperature record from sediment core HAP0212. PANGAEA. 10.1594/PANGAEA.868936