Converting a NOAA dataset into the LiPD format¶

Authors¶

Deborah Khider

Preamble¶

This tutorial showcases how to create a LiPD dataset, a standardized format supporting reusable and reproducible analysis, from a dataset stored at NOAA NCEI using PyLiPD.

What is LiPD?¶

LiPD is a standardized way of storing paleoclimate datasets so they can be easily shared, understood, and reused.

Instead of distributing data, metadata, and documentation separately, LiPD packages everything together into a single structured file. This typically includes:

Time series data (e.g., proxy measurements)
Metadata (location, archive type, investigators)
Chronological information (age models, uncertainties)
Links to publications and methods

In practice, this means you can load a LiPD file and immediately have access to both the data and the context needed to interpret it.

Why use LiPD?¶

LiPD uses standardized vocabularies (e.g., the PaST Thesaurus), so variables and metadata are consistently described. This makes it much easier to understand datasets created by other researchers.
LiPD is designed to integrate with analysis tools and libraries (like Pyleoclim or PyLiPD), so you can load datasets with a few lines of code, query metadata programmatically, combine multiple records into a single workflow.
Because all LiPD files follow the same structure, you can analyze many records at once and build reproducible workflows
By keeping data, metadata, and methods together, LiPD makes it easier to reproduce published results, share complete datasets, build transparent analysis pipelines.

Why convert to LiPD?¶

While repositories like NOAA and PANGAEA are excellent for data discovery and access, working directly with their native formats can be challenging: metadata may be inconsistent across datasets; file formats vary from one record to another; additional processing is often needed before analysis. To address this, we convert datasets into the LiPD format.

LiPD standardizes both the data and metadata into a single, structured representation, making it easier to load datasets into Python workflows; compare multiple records consistently; build reproducible analyses.

Goals¶

Understanding how to get relevant information from NOAA Data Center
Use PyLiPD to create a LiPD file from the retrieved information.

Pre-requisites¶

Familiarity with the PyLiPD package.
An understanding of PyLiPD classes and how to create a LiPD file from this tutorial.

Reading time¶

Let’s import our packages!

import pyleotups as pt


import pylipd.classes.dataset as dataset
from pylipd.classes.archivetype import ArchiveTypeConstants, ArchiveType
from pylipd.classes.funding import Funding
from pylipd.classes.interpretation import Interpretation
from pylipd.classes.interpretationvariable import InterpretationVariableConstants, InterpretationVariable
from pylipd.classes.location import Location
from pylipd.classes.paleodata import PaleoData
from pylipd.classes.datatable import DataTable
from pylipd.classes.paleounit import PaleoUnitConstants, PaleoUnit
from pylipd.classes.paleovariable import PaleoVariableConstants, PaleoVariable
from pylipd.classes.person import Person
from pylipd.classes.publication import Publication
from pylipd.classes.resolution import Resolution
from pylipd.classes.variable import Variable
from pylipd.classes.model import Model
from pylipd.classes.chrondata import ChronData
from pylipd.classes.compilation import Compilation

from pylipd import LiPD


import json

import numpy as np
import pandas as pd

import re

Querying the NOAA database¶

The first step is to create a pyleotups.NOAADataset object in which we will store the relevant information

ds=pt.NOAADataset()

Now that this is done, let’s search using the study ID. Here we are interesting in converting the dataset from Dee et al. (2020), a monthly-resolved coral aragonite oxygen isotope ( $\delta^{18}O$ ) record from Palmyra Island.

res = ds.search_studies(noaa_id=27490)

[2026-05-01 11:11:22,115][INFO] - search_studies: Limit defaulted to 100 (PyleoTUPS).
[2026-05-01 11:11:22,116][INFO] - search_studies: Input Query includes geographical bounds. Inspect the results to ensure they match your intended region as one study can contain sites across various parts of the world.

Request URL: https://www.ncei.noaa.gov/access/paleo-search/study/search.json?dataPublisher=NOAA&NOAAStudyId=27490&limit=100

Parsing NOAA studies: 100%|██████████| 1/1 [00:00<00:00, 3715.06it/s]
[2026-05-01 11:11:22,521][INFO] - Retrieved 1 studies.

Let’s have a look at a summary of our dataset:

df_summary = ds.get_summary()
display(df_summary)

General Metadata¶

Let’s first obtain general metadata such as publication, location and funding. For publication:

bib, df_pub = ds.get_publications()
display(df_pub)

Let’s get some information about the geographical information:

df_geo = ds.get_geo()
df_geo

And finally let’s get some information about the funding:

df_fund = ds.get_funding()
df_fund

Data and variable metadata:¶

Let’s get information about the data tables present in the file:

df_tables = ds.get_tables()
df_tables

There is only one table present, let’s put it into a Pandas DataFrame:

dfs = ds.get_data(dataTableIDs="41697")

dfs[0].head()

We need to coerce the values to numeric:

df_data = dfs[0].apply(pd.to_numeric, errors='coerce')

And let’s get some variable information:

df_var = ds.get_variables(dataTableIDs="41697")
df_var

Now that we have all the necessary information, let’s convert into a LiPD format!

Converting to a LiPD-formatted dataset¶

Root Metadata¶

It’s time to create our LiPD dataset using PyLiPD. Let’s start by created a pylipd.Dataset object:

dsl = dataset.Dataset()

Let’s add root metadata such as the name of the dataset, type of archive and datasetID:

dsl.setName('Palmyra.Dee.2020')
archiveType = ArchiveType.from_synonym('Coral')
dsl.setArchiveType(archiveType)
dsl.setOriginalDataUrl('https://www.ncei.noaa.gov/pub/data/paleo/coral/east_pacific/palmyra2020d18o.txt')
dsl.setDatasetId('DP2020PC')

Let’s add the investigators of the study, which are found in the df_summary. To do so, we need to create a Person object:

def extract_last_name_and_initials(full_name):
    parts = full_name.split()
    last_name = parts[-1]
    initials = '.'.join([p[0] for p in parts[:-1]]) + '.'
    return last_name, initials


authors = df_summary['Investigators'].to_list()[0]

# Step 1: Split the string by commas
parts = authors.split(',')

# Prepare a list to hold the formatted names
investigators = []

# Step 2: Iterate over the parts to process each
for i in parts:  # Step by 2 since each name and initial are next to each other
    last_name,initial = extract_last_name_and_initials(i)
    person = Person() # create the Person object
    person.setName(f"{last_name}, {initial}")
    investigators.append(person)

# Step 3: Store the list of Persons into the ds object
dsl.setInvestigators(investigators)

Publication Metadata¶

Let’s add a publication object to our dataset from the information in df_pub:

pub = Publication()

# Let's start with the authors

authors = df_pub['Author'].to_list()[0]

# Step 1: Split the string by commas
parts = authors.split(',')

# Prepare a list to hold the formatted names
investigators = []

# Step 2: Iterate over the parts to process each
for i in parts:  # Step by 2 since each name and initial are next to each other
    last_name,initial = extract_last_name_and_initials(i)
    person = Person() # create the Person object
    person.setName(f"{last_name}, {initial}")
    investigators.append(person)

# Step 3: Store the list of Persons into the ds object
pub.setAuthors(investigators)

Let’s get the rest of the information:

pub.setTitle(df_pub['Title'].iloc[0])
pub.setJournal(df_pub['Journal'].iloc[0])
pub.setYear(int(df_pub['Year'].iloc[0]))
pub.setVolume(str(df_pub['Volume'].iloc[0]))
pub.setPages(str(df_pub['Pages'].iloc[0]))
pub.setDOI(df_pub['DOI'].iloc[0])
pub.setUrls([df_pub['URL'].iloc[0]])
pub.setCiteKey(df_pub['CitationKey'].iloc[0])

Let’s add our publication objet to the dataset:

dsl.setPublications([pub])

Funding Metadata¶

Let’s add information about the funding from df_fund. First let’s grab the names of all the grants.

def parse_grants(grant_string):
    result = []
    # Split different grant groups using semicolon
    groups = grant_string.split(';')
    
    for group in groups:
        group = group.strip()
        # Match prefix followed by space or hyphen, then numbers
        match = re.match(r'([A-Z]{3})[\s-]+(.+)', group)
        if match:
            prefix = match.group(1)
            number_part = match.group(2)
            numbers = re.findall(r'\d+', number_part)
            result.extend([f'{prefix}-{num}' for num in numbers])
    
    return result


grants = parse_grants(df_fund['FundingGrant'].iloc[0])
grants

['MGG-0752091', 'MGG-1502832', 'MGG-1836645', 'EAR-1347213']

Let’s create a Funding object and add it to the Dataset:

fund = Funding()
fund.setFundingAgency(df_fund['FundingAgency'].iloc[0])
fund.setGrants(grants)

dsl.setFundings([fund])

Geographic Metadata¶

Let’s add information about the location associated with the dataset from df_geo. Since the location is a Point, we can use the min Latitude/Longitude/Elevation to set that into the LiPD dataset. To do so, we need to create a Location object:

loc = Location()

loc.setLatitude(str(df_geo['MinLatitude'].iloc[0]))
loc.setLongitude(str(df_geo['MinLongitude'].iloc[0]))
loc.setElevation(str(df_geo['MinElevation'].iloc[0]))
loc.setSiteName(str(df_geo['SiteName'].iloc[0]))
loc.setLocationName(str(df_geo['LocationName'].iloc[0]))

dsl.setLocation(loc)

PaleoData Metadata¶

Let’s enter the information relating to the paleoData. The data itself is stored in df_data while the metadata information about the variables is in df_var. Let’s first create our PaleoData object.

paleodata = PaleoData()

Our next step is to create measurement tables in this object. To do so, we can use the DataTable object:

table = DataTable()

Now let’s add some information about the table such as the name and the value use for missing values in the data:

table.setFileName("paleo0measurement0.csv")
table.setMissingValue("NaN")

The next step is to add columns to our table. In other words, we need to create some variables.

In LiPD, each variable is also given a unique ID. The function below generates one:

import uuid

def generate_unique_id(prefix='CPD'):
    # Generate a random UUID
    random_uuid = str(uuid.uuid4()).replace('-','')  # Generates a random UUID.
    
    # Convert UUID format to the specific format we need
    # UUID is usually in the form '1e2a2846-2048-480b-9ec6-674daef472bd' so we slice and insert accordingly
    id_str = str(random_uuid)
    formatted_id = f"{prefix}-{id_str[:15]}"
    
    return formatted_id

Since variable names and units are controlled in LiPD, let’s see if we can get synonyms before we proceed. Let’s start with the standard variable name:

check_names = {}
for index, row in df_var.iterrows():
    check_names[row['VariableName']]= PaleoVariable.from_synonym(row['VariableName']).label

check_names

{'age': 'age', 'd18O': 'd18O'}

This looks good, let’s have a look at the units:

check_units = {}
for index, row in df_var.iterrows():
    try:
        unit = row['cvUnit'].split('>')[-1]
        check_units[unit]= PaleoUnit.from_synonym(unit).label
    except:
        unit = row['cvUnit'].split('>')[-1]
        check_units[unit]= None

check_units

{'year Common Era': 'yr AD', 'per mil VPDB': None}

The “per mil VPDB” unit is not recognized automatically (not in the thesaurus). Let’s have a look at the standard unit names and see if we can manually select one that will match.

The permil entry will match, so let’s use this.

variables = []

# Resolution
res = df_data.iloc[:, 1].diff()[1:].to_numpy()
Res = Resolution() # create a Resolution object - it will be the same for all variables since it is based on time
Res.setMinValue(np.min(res))
Res.setMaxValue(np.max(res))
Res.setMeanValue(np.mean(res))
Res.setMedianValue(np.median(res))
from pylipd.globals.urls import UNITSURL
mynewunit = PaleoUnit(f"{UNITSURL}#year", "year")
Res.setUnits(mynewunit)

# Compilation
comp = Compilation() # create a compilation object
comp.setName('Pages2kTemperature')
comp.setVersions(['2_2_0'])

counter = 0

for index, row in df_var.iterrows():
    var = Variable()
    var.setName(row['VariableName']) # name of the variable
    # Now let's do the standard name
    var.setStandardVariable(PaleoVariable.from_synonym(row['VariableName']))
    var.setColumnNumber(counter+1) #The column in which the data is stored. Note that LiPD uses index 1
    var.setVariableId(generate_unique_id(prefix='PCU')) # create a unique ID for the variable - change prefix if needed
    # Units
    unit = row['cvUnit'].split('>')[-1]
    if unit=='per mil VPDB':
        var.setUnits(PaleoUnit.from_synonym('permil'))
    else:
        var.setUnits(PaleoUnit.from_synonym(unit))
    # Make sure the data is JSON writable (no numpy arrays or Pandas DataFrame)
    var.setValues(json.dumps(df_data.iloc[:,counter].tolist()))
    # Calculate some metadata about the values - this makes it easier to do some queries later on, including looking for data in a particular time slice. 
    var.setMinValue(float(df_data.iloc[:,counter].min()))
    var.setMaxValue(float(df_data.iloc[:,counter].max()))
    var.setMeanValue(float(df_data.iloc[:,counter].mean()))
    var.setMedianValue(float(df_data.iloc[:,counter].median()))
    # Attach the resolution metadata information to the variable
    var.setResolution(Res)
    # if the variable in d18O, add information about the compilation
    if row['VariableName'] == 'd18O':
        var.setPartOfCompilations([comp])
    else:
        print('pass')
    # append in the list
    variables.append(var) 
    # add to the counter
    counter+=1

pass

Let’s now put our variables in the DataTable:

table.setVariables(variables)

The Table into the PaleoData object:

paleodata.setMeasurementTables([table])

And finally, the PaleoData object into the Dataset:

dsl.setPaleoData([paleodata])

Writing a LiPD file¶

The last step in this process is to write to a LiPD file. To do so, you need to pass the Dataset dsl back into a LiPD object:

lipd = LiPD()
lipd.load_datasets([dsl])
lipd.create_lipd(dsl.getName(), "../data/Palmyra.Dee.2020.lpd");

[2026-05-01 15:17:29,662][INFO] - Creating bag for directory /var/folders/xj/p7h9764x7cx0by8547l04rrr0000gn/T/rdf_to_lipd_1qn3osjd/Palmyra.Dee.2020
[2026-05-01 15:17:29,662][INFO] - Creating data directory
[2026-05-01 15:17:29,663][INFO] - Moving paleo0measurement0.csv to /private/var/folders/xj/p7h9764x7cx0by8547l04rrr0000gn/T/rdf_to_lipd_1qn3osjd/Palmyra.Dee.2020/tmppkkqguiu/paleo0measurement0.csv
[2026-05-01 15:17:29,663][INFO] - Moving metadata.jsonld to /private/var/folders/xj/p7h9764x7cx0by8547l04rrr0000gn/T/rdf_to_lipd_1qn3osjd/Palmyra.Dee.2020/tmppkkqguiu/metadata.jsonld
[2026-05-01 15:17:29,663][INFO] - Moving /private/var/folders/xj/p7h9764x7cx0by8547l04rrr0000gn/T/rdf_to_lipd_1qn3osjd/Palmyra.Dee.2020/tmppkkqguiu to data
[2026-05-01 15:17:29,664][INFO] - Using 1 processes to generate manifests: md5
[2026-05-01 15:17:29,664][INFO] - Generating manifest lines for file data/metadata.jsonld
[2026-05-01 15:17:29,665][INFO] - Generating manifest lines for file data/paleo0measurement0.csv
[2026-05-01 15:17:29,665][INFO] - Creating bagit.txt
[2026-05-01 15:17:29,666][INFO] - Creating bag-info.txt
[2026-05-01 15:17:29,666][INFO] - Creating /var/folders/xj/p7h9764x7cx0by8547l04rrr0000gn/T/rdf_to_lipd_1qn3osjd/Palmyra.Dee.2020/tagmanifest-md5.txt

References¶

Ratnakar, V., & Khider, D. (2025). PyLiPD: A python package for the manipulation of paleoclimate datasets. Journal of Open Source Software, 10(115), 8861. 10.21105/joss.08861
Dee, S. G., Cobb, K. M., Emile-Geay, J., Ault, T. R., Edwards, R. L., Cheng, H., & Charles, C. D. (2020). No consistent ENSO response to volcanic forcing over the last millennium. Science, 367(6485), 1477–1481. 10.1126/science.aax2000