Working with NOAA NCEI for Paleoclimatology¶

Authors¶

Deborah Khider , Dhiren Oswal

Preamble¶

The goal of this tutorial is to familiarize you with the NOAADataset object and its functionalities in PyleoTUPS. In this tutorial, we focus specifically on searching by NOAA study ID; a separate tutorial will cover the full range of NOAA search capabilities.

In the NOAA framework, study IDs represent a publication unit for a dataset—that is, all data associated with a given study. For example, a study may include multiple data tables corresponding to different sites. While these tables share the same study ID, each site is assigned a unique site ID.

Goals¶

Search for datasets using a specific study ID
Obtain information such as location, publication, variable metadata and the associated data.

Pre-requisites¶

Understanding of NOAA datasets

Reading time¶

15 min

Let’s import our packages!

import pyleotups as pt
import pandas as pd

Accessing a dataset using the NOAA study ID¶

This is the most basic search you can perform, but if you know which dataset you are looking for, it is the easiest way to get all the needed information. First, you need to create a NOAADataset object that will store the information.

ds=pt.NOAADataset()

Let’s do a simple search, knowing the NOAA study ID. For this example, let’s use the dataset from Clemens et al. (2021), which can be accessed through NOAA Paleo portal. The study contains several tables of measurements made on a marine sediment at site IODP U1446:

Warning: Even for one study, the search may take some time.

res = ds.search_studies(noaa_id=33213)

[2026-05-15 15:03:29,928][INFO] - search_studies: Limit defaulted to 100 (PyleoTUPS).

Request URL: https://www.ncei.noaa.gov/access/paleo-search/study/search.json?dataPublisher=NOAA&NOAAStudyId=33213&limit=100

Parsing NOAA studies: 100%|██████████| 1/1 [00:00<00:00, 2376.38it/s]
[2026-05-15 15:03:30,575][INFO] - Retrieved 1 studies.

Note: The NOAADataset object will only accept one ID at a time. You cannot pass a list of various IDs to get multiple NOAA studies at a time.

The get_summary method provides basic information about the dataset, such as the name of the study, the NOAA DataType, the time coverage and the associated publication. The function returns a pandas.DataFrame.

df_summary = ds.get_summary()
display(df_summary)

Let’s have a look at the information returned in this DataFrame:

StudyID: Matches the value we searched for. While redundant in this case, it becomes useful when querying using other criteria.
XMLID: A different identifier for the same study. Each StudyID is associated with a corresponding XMLID.
Study Name: The name of the study.
DataType: Uses NOAA’s nomenclature for the data type.
EarliestYearBP, MostRecentYearBP, EarliestYearCE, MostRecentYearCE: The temporal bounds of the dataset.
Coverage: The geographic area covered by the study. In this example, it corresponds to a single point. This information can also be retrieved using the get_geo function.
StudyNotes: Includes descriptive notes and keywords associated with the study.
ScienceKeywords: Standardized science keywords.
Investigators: The contributors to the study.
Publications: Information about publications associated with the dataset. This can also be retrieved using the get_publication function.
Sites: The sites associated with the NOAA study. A single study may include multiple sites, and each site may contain multiple data tables. This information can be retrieved using the get_sites function.
Funding: Information about funding sources for the study. This can be retrieved using the get_funding function.

This function provides a high-level overview of the study and corresponds to the information available on the NOAA landing page.

To facilitate downstream analysis, we provide additional functions that extract key components of the study into separate Pandas.DataFrame objects, making it easier to work with specific aspects of the data.

Obtaining Information about Funding and Publications¶

To get more details about the funding:

df_funding = ds.get_funding()
display(df_funding)

Note that each table has a studyID key. You can think of this as the key in relational databases. It may not matter right now, but this will become useful as you use more advanced search functionalities which will return several studies.

In addition to funding information, PyleoTUPS allows you to get information about the publication(s) associated with the dataset.

bib, df_pub = ds.get_publications()
display(df_pub)

Note: You can set the `save` parameter to True if you want to save a copy of the BibTeX entries.

Obtaining Information about Geographical Location¶

Similarly, you can return information about the location of the record:

df_geo = ds.get_geo()
display(df_geo)

When getting the geographical coordinates for a study, each row in the returned DataFrame will correspond to a specific site. In this case, there is only one site in the study. We will be looking at a more interesting case later in this tutorial.

Obtaining Variable Information¶

The next step is to actually obtain the data tables present in the dataset. Doing so requires some understanding of how NOAA organizes their database.

At the top level, there is the study, which is represented by a unique studyID and xmlID. Each study can have multiple sites. For instance, a study could be a compilation of benthic $\delta^{18}O$ records, each with their own site. Each site can have multiple tables, which might refer to different measurements (e.g., one table for Mg/Ca and SST and another one for planktic $\delta^{18}O$ ). PyleoTUPS keeps this structure intact to give you as much flexibility in your workflows as possible.

Let’s first have a look at the sites:

df_sites = ds.get_sites()
display(df_sites)

As you may notice, this function returns similar information than the get_geo method. This allows multiple entry points to the database and functionality harmonization between PANGAEA and NOAA NCEI. Which one should you use? Whichever is the most convenient for you and your workflow. For instance, if interested in a meta-analysis of available records on NOAA NCEI and PANGAEA, get_geo might be more convenient as the two APIs have been harmonized between the two databases. But if familiar with the NOAA structure and only working with datasets stored there, the get_sites might be a more useful and natural function to use.

Whichever you choose, you do not need to first get the sites to access the data tables, as shown below:

df_tables = ds.get_tables()
display(df_tables)

This function operates directly on the NOAA object and does not require a siteID.

Note that each TableID is unique and can be used to get more information about the data. Most of the fields returned here are self-explanatory, but let’s go over the few that require further explanation and design consideration:

fileURL: This is where the actual data lives! So far, all the information we have gathered has been made available through the NOAA API response in a JSON file. Now it is time to actually read the data!
FileDescription: This is an internal representation for NOAA and does inform about the type of files to be found at the fileURL. PyleoTUPS can only read text files, which means file description such as NOAA Template File, NOAA Template File - Sunthesis Metadata, Raw Measurements - NOAA Template File, Chronology - NOAA Template File. However, PyleoTUPS will not work on other format such as NetCDF (we recommend the xarray library for these files), .lpd (we have created a library to deal with LiPD files called PyLiPD), html, or pickle file.

Note: Let's make matters more complicated! The same TableID can be used multiple times for different file format. For instance, if the data uses the NOAA template and LiPD file format. AND a TableID (which is really the file) can actually have multiple tables in them.

According to the table above, our unique study contains one unique site (SiteID = 58697) and eight tables. To get the data, you need to pass the DataTableID or FileURL to the get_data() method. Let’s have a look at the TEX86 data:

dfs = ds.get_data(dataTableIDs="45859")
type(dfs)

list

Notice that we are returning a list. This is because in some cases, the PyleoTUPS parser may identify more than one table in the file. This is most common with older studies.

Let’s have a look at the first and (only) table:

display(dfs[0])

And we have the data! How about more metadata about the variables themselves?

Well, there is a function for this as well:

df_var = ds.get_variables(dataTableIDs="45859")
display(df_var)

For more information about the meaning of the columns and possible values, please have a look at the NOAA PaST Thesaurus.

Note: You can pass multiple tables IDs as a list. The function will always return a list of DataFrames (hence why we selected the first one in the code above.)

Some relevant metadata for each column is also stored in the DataFrame attributes:

dfs[0].attrs

{'variables': ['Site',
  'Hole',
  'Core',
  'Type',
  'Section',
  'Section_Depth',
  'Sample_Depth',
  'Age',
  'TEX86H',
  'SST'],
 'NOAAStudyId': '33213',
 'StudyName': 'Bay of Bengal, Northeast Indian Margin Stable Isotope, Biomarker and SST Reconstructions since the Mid-Pleistocene'}

Instead of the table ID, you can also pass the file URL to get the data:

file_url = df_tables['FileURL'].iloc[0]
file_url

'https://www.ncei.noaa.gov/pub/data/paleo/contributions_by_author/clemens2021/clemens2021-u1446-benth-iso-noaa.txt'

df1 = ds.get_data(file_urls=file_url)[0]
display(df1.head())

You can choose the method that is more convenient for you.

Summary¶

The functionalities created in PyleoTUPS allow you to access all the metadata store at NOAA NCEI for Paleo using various functions that mimics the structure offered by this data provider. Note that some information (e.g., geography) can be obtained through multiple functions. This built-in redundancy allows you to work with the methods that work best for you and your use case.

Dataset with Multiple Sites¶

As mentioned, some studies may have multiple sites. Let’s see how this affect the functionalities. For this example, let’s have a look a the paleoceanographic study for the Makassar Strait (Indo-Pacific Warm Pool) by Linsley et al. (2010), which can be found here.

ds2 = pt.NOAADataset()

res = ds2.search_studies(noaa_id = 10420)
display(res)

[2026-05-15 15:04:06,053][INFO] - search_studies: Limit defaulted to 100 (PyleoTUPS).

Request URL: https://www.ncei.noaa.gov/access/paleo-search/study/search.json?dataPublisher=NOAA&NOAAStudyId=10420&limit=100

Parsing NOAA studies: 100%|██████████| 1/1 [00:00<00:00, 1766.77it/s]
[2026-05-15 15:04:06,530][INFO] - Retrieved 1 studies.

Let’s first have a look at the geographical information:

display(ds2.get_geo())

There are a few takeaways here:

As expected, there are multiple sites in this study. Each site is represented as a row
The GeometryType indicates whether the site is one location (i.e., one physical archive), which is represented as a point or whether it represents multiple physical archives pulled together, which is represented as a polygon.

Let’s have a look at the site:

display(ds2.get_sites())

In this table, notice the various sites, with different coordinates and file URLs. This is an example of the one-to-many relationship that exists at NOAA: One study can have multiple sites and as we see in the next cells, one Site can have multiple Tables:

display(ds2.get_tables())

Summary¶

This notebook walks through way PyleoTUPS handles the NOAA response internally and the functionalities supporting access to the data and metadata using examples based on giving a NOAA ID. We understand that most users will want to access the NOAA querying capabilities, which is the subject of the third tutorial in this section.

References¶

Clemens, S. C., Yamamoto, M., Thirumalai, K., Giosan, L., Richey, J. N., Nilsson-Kerr, K., Rosenthal, Y., Anand, P., & McGrath, S. M. (2021). Remote and local drivers of Pleistocene South Asian summer monsoon precipitation: A test for future predictions. Science Advances, 7(23). 10.1126/sciadv.abg3848
Linsley, B. K., Rosenthal, Y., & Oppo, D. W. (2010). Holocene evolution of the Indonesian throughflow and the western Pacific warm pool. Nature Geoscience, 3(8), 578–583. 10.1038/ngeo920