Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Working with NOAA NCEI for Paleoclimatology

PyleoTUPS logo

Working with NOAA NCEI for Paleoclimatology

Authors

Deborah Khider ORCID, Dhiren Oswal ORCID

Preamble

The goal of this tutorial is to familiarize you with the NOAADataset object and its functionalities in PyleoTUPS. In this tutorial, we focus specifically on searching by NOAA study ID; a separate tutorial will cover the full range of NOAA search capabilities.

In the NOAA framework, study IDs represent a publication unit for a dataset—that is, all data associated with a given study. For example, a study may include multiple data tables corresponding to different sites. While these tables share the same study ID, each site is assigned a unique site ID.

Goals

  • Search for datasets using a specific study ID

  • Obtain information such as location, publication, variable metadata and the associated data.

Pre-requisites

  • Understanding of NOAA datasets

Reading time

15 min

Let’s import our packages!

import pyleotups as pt
import pandas as pd

Accessing a dataset using the NOAA study ID

This is the most basic search you can perform, but if you know which dataset you are looking for, it is the easiest way to get all the needed information. First, you need to create a NOAADataset object that will store the information.

ds=pt.NOAADataset()

Let’s do a simple search, knowing the NOAA study ID. For this example, let’s use the dataset from Clemens et al. (2021), which can be accessed through NOAA Paleo portal. The study contains several tables of measurements made on a marine sediment at site IODP U1446:

Warning: Even for one study, the search may take some time.
res = ds.search_studies(noaa_id=33213)
[2026-04-06 12:46:18,879][INFO] - search_studies: Limit defaulted to 100 (PyleoTUPS).
[2026-04-06 12:46:18,881][INFO] - search_studies: Input Query includes geographical bounds. Inspect the results to ensure they match your intended region as one study can contain sites across various parts of the world.
Request URL: https://www.ncei.noaa.gov/access/paleo-search/study/search.json?dataPublisher=NOAA&NOAAStudyId=33213&limit=100
Parsing NOAA studies: 100%|██████████| 1/1 [00:00<00:00, 1911.72it/s]
[2026-04-06 12:46:19,473][INFO] - Retrieved 1 studies.

The summary method provides basic information about the dataset, such as the name of the study, the NOAA DataType, the time coverage and the associated publication. The function retuns a pandas.DataFrame.

df_summary = ds.get_summary()
display(df_summary)
Loading...

Let’s have a look at the information returned in this DataFrame:

  • StudyID: Matches the value we searched for. While redundant in this case, it becomes useful when querying using other criteria.

  • XMLID: A different identifier for the same study. Each StudyID is associated with a corresponding XMLID.

  • Study Name: The name of the study.

  • DataType: Uses NOAA’s nomenclature for the data type.

  • EarliestYearBP, MostRecentYearBP, EarliestYearCE, MostRecentYearCE: The temporal bounds of the dataset.

  • Coverage: The geographic area covered by the study. In this example, it corresponds to a single point. This information can also be retrieved using the get_geo function.

  • StudyNotes: Includes descriptive notes and keywords associated with the study.

  • ScienceKeywords: Standardized science keywords.

  • Investigators: The contributors to the study.

  • Publications: Information about publications associated with the dataset. This can also be retrieved using the get_publication function.

  • Sites: The sites associated with the NOAA study. A single study may include multiple sites, and each site may contain multiple data tables. This information can be retrieved using the get_sites function.

  • Funding: Information about funding sources for the study. This can be retrieved using the get_funding function.

This function provides a high-level overview of the study and corresponds to the information available on the NOAA landing page.

To facilitate downstream analysis, we provide additional functions that extract key components of the study into separate Pandas.DataFrame objects, making it easier to work with specific aspects of the data.

Obtaining Information about Funding and Publications

To get more details about the funding:

df_funding = ds.get_funding()
display(df_funding)
Loading...

Note that each table has a studyID key. You can think of this as the key in relational databases. It may not matter right now, but this will become useful as you use more advanced search functionalities which will return several studies.

In addition to funding information, PyleoTUPS allows you to get information about the publication associated with the dataset.

bib, df_pub = ds.get_publications()
display(df_pub)
Loading...
Note: You can set the `save` parameter to True if you want to save a copy of the BibTeX entries.

Obtaining Information about Geographical Location

Similarly, you can return information about the location of the record:

df_geo = ds.get_geo()
display(df_geo)
Loading...

When getting the geographical coordinates for a study, each row in the returned DataFrame will correspond to a specific site. In this case, there is only one site in the study. We will be looking at a more interesting case later in this tutorial.

Obtaining Variable Information

The next step is to actually obtain the data tables present in the dataset:

df_tables = ds.get_tables()
display(df_tables)
Loading...

Note that each TableID is unique and can be used to get more information about the data. Most of the fields returned here are self-explanatory, but let’s go over the few that require further explanation and design consideration:

  • fileURL: This is where the actual data lives! So far, all the information we have gathered has been made available through the NOAA API response in a JSON file. Now it is time to actually read the data!

  • FileDescription: This is an internal representation for NOAA and does inform about the type of files to be found at the fileURL. PyleoTUPS can only read text files, which means file description such as NOAA Template File, NOAA Template File - Sunthesis Metadata, Raw Measurements - NOAA Template File, Chronology - NOAA Template File. However, PyleoTUPS will not work on other format such as NetCDF (we recommend the xarray library for these files), .lpd (we have created a library to deal with LiPD files called PyLiPD), html, or pickle file.

Note: Let's make matters more complicated! The same TableID can be used multiple times for different file format. For instance, if the data uses the NOAA template and LiPD file format. AND a TableID (which is really the file) can actually have multiple tables in them.

According to the table above, our unique study contains one unique site (SiteID = 58697) and eight tables. To get the data, you need to pass the DataTableID or FileURL to the get_data() method. Let’s have a look at the TEX86 data:

dfs = ds.get_data(dataTableIDs="45859")
type(dfs)
list

Notice that we are returning a list. This is because in some cases, the PyleoTUPS parser may identify more than one table in the file. This is most common with older studies.

Let’s have a look at the first and (only) table:

display(dfs[0])
Loading...

And we have the data! How about more metadata about the variables themselves?

Well, there is a function for this as well:

df_var = ds.get_variables(dataTableIDs="45859")
display(df_var)
Loading...

For more information about the meaning of the columns and possible values, please have a look at the NOAA PaST Thesaurus.

Note: You can pass multiple tables IDs as a list. The function will always return a list of DataFrames (hence why we selected the first one in the code above.)

Some relevant metadata for each column is also stored in the DataFrame attributes:

dfs[0].attrs
{'variables': ['Site', 'Hole', 'Core', 'Type', 'Section', 'Section_Depth', 'Sample_Depth', 'Age', 'TEX86H', 'SST'], 'NOAAStudyId': '33213', 'StudyName': 'Bay of Bengal, Northeast Indian Margin Stable Isotope, Biomarker and SST Reconstructions since the Mid-Pleistocene'}

Instead of the table ID, you can also pass the file URL to get the data:

df1 = ds.get_data(file_urls="https://www.ncei.noaa.gov/pub/data/paleo/contributions_by_author/clemens2021/clemens2021-u1446-mgca-noaa.txt")[0]
display(df1.head())
Loading...

You can choose the method that is more convenient for you.

Let’s have a look at some more complicated cases

Dataset with Multiple Sites

As mentioned, some studies may have multiple sites. Let’s see how this affect the functionalities. For this example, let’s have a look a the temperature reconstruction for the Greater Yellowstone Ecoregion by King et al. (2021), which can be found here.

ds2 = pt.NOAADataset()
res = ds2.search_studies(noaa_id = 32833)
display(res)
[2026-04-06 14:11:19,202][INFO] - search_studies: Limit defaulted to 100 (PyleoTUPS).
[2026-04-06 14:11:19,203][INFO] - search_studies: Input Query includes geographical bounds. Inspect the results to ensure they match your intended region as one study can contain sites across various parts of the world.
Request URL: https://www.ncei.noaa.gov/access/paleo-search/study/search.json?dataPublisher=NOAA&NOAAStudyId=32833&limit=100
Parsing NOAA studies: 100%|██████████| 1/1 [00:00<00:00, 7084.97it/s]
[2026-04-06 14:11:19,554][INFO] - Retrieved 1 studies.
Loading...

The one item to notice on the summary is the geographical coordinates, which indicate an area rather than a single point. Let’s get more information about the sites:

display(ds2.get_sites())
Loading...
References
  1. Clemens, S. C., Yamamoto, M., Thirumalai, K., Giosan, L., Richey, J. N., Nilsson-Kerr, K., Rosenthal, Y., Anand, P., & McGrath, S. M. (2021). Remote and local drivers of Pleistocene South Asian summer monsoon precipitation: A test for future predictions. Science Advances, 7(23). 10.1126/sciadv.abg3848
  2. Heeter, K. J., Rochner, M. L., & Harley, G. L. (2021). Summer Air Temperature for the Greater Yellowstone Ecoregion (770–2019 CE) Over 1,250 Years. Geophysical Research Letters, 48(7). 10.1029/2020gl092269