Converting a PANGAEA dataset into the LiPD format¶

Authors¶

Deborah Khider

Preamble¶

This tutorial showcases how to create a LiPD dataset, a standardized format supporting reusable and reproducible analysis, from a dataset stored at PANGAEA using PyLiPD.

What is LiPD?¶

LiPD is a standardized way of storing paleoclimate datasets so they can be easily shared, understood, and reused.

Instead of distributing data, metadata, and documentation separately, LiPD packages everything together into a single structured file. This typically includes:

Time series data (e.g., proxy measurements)
Metadata (location, archive type, investigators)
Chronological information (age models, uncertainties)
Links to publications and methods

In practice, this means you can load a LiPD file and immediately have access to both the data and the context needed to interpret it.

Why use LiPD?¶

LiPD uses standardized vocabularies (e.g., the PaST Thesaurus), so variables and metadata are consistently described. This makes it much easier to understand datasets created by other researchers.
LiPD is designed to integrate with analysis tools and libraries (like Pyleoclim or PyLiPD), so you can load datasets with a few lines of code, query metadata programmatically, combine multiple records into a single workflow.
Because all LiPD files follow the same structure, you can analyze many records at once and build reproducible workflows
By keeping data, metadata, and methods together, LiPD makes it easier to reproduce published results, share complete datasets, build transparent analysis pipelines.

Why convert to LiPD?¶

While repositories like NOAA and PANGAEA are excellent for data discovery and access, working directly with their native formats can be challenging: metadata may be inconsistent across datasets; file formats vary from one record to another; additional processing is often needed before analysis. To address this, we convert datasets into the LiPD format.

LiPD standardizes both the data and metadata into a single, structured representation, making it easier to load datasets into Python workflows; compare multiple records consistently; build reproducible analyses.

Goals¶

Understanding how to get relevant information from PANGAEA
Deal with collections and LiPD
Use PyLiPD to create a LiPD file from the retrieved information.

Pre-requisites¶

Familiarity with the PyLiPD package.
An understanding of PyLiPD classes and how to create a LiPD file from this tutorial.

Reading time¶

Let’s import our packages!

import pyleotups as pt

# LiPD
import pylipd.classes.dataset as dataset
from pylipd.classes.archivetype import ArchiveTypeConstants, ArchiveType
from pylipd.classes.funding import Funding
from pylipd.classes.interpretation import Interpretation
from pylipd.classes.interpretationvariable import InterpretationVariableConstants, InterpretationVariable
from pylipd.classes.location import Location
from pylipd.classes.paleodata import PaleoData
from pylipd.classes.datatable import DataTable
from pylipd.classes.paleounit import PaleoUnitConstants, PaleoUnit
from pylipd.classes.paleovariable import PaleoVariableConstants, PaleoVariable
from pylipd.classes.person import Person
from pylipd.classes.publication import Publication
from pylipd.classes.resolution import Resolution
from pylipd.classes.variable import Variable
from pylipd.classes.model import Model
from pylipd.classes.chrondata import ChronData
from pylipd.classes.compilation import Compilation

from pylipd import LiPD

# General purpose
import json
import re

# Scientific libraries
import numpy as np
import pandas as pd

# Bibliography
from doi2bib import crossref
import bibtexparser

Collection, Dataset, and LiPD¶

PANGAEA uses the concept of collection to group datasets that are related to each other. For paleoclimate studies, in practice, collections can either be:

Datasets that are obtained from different physical samples and/or were used in a compilation study. In this case, we recommend storing each dataset in its own LiPD file, with appropriate metadata. Let’s take the example of the compilation study by Leduc et al. (2020), which is available on PANGAEA here. Each of the 133 records are part of the collections, accessible through individual DOIs. In this case, one would create 133 LiPD datasets and relate them through the part of compilation ontological property, which relates a Variable to a Compilation. Using PyLiPD, one would first need to create a Compilation object with a name and version and relate the Variable object to the Compilation using the setPartofCompulations property.
Datasets that report different measurements performed on the same physical sample. In this case, we recommend archiving all the PANGAEA datasets into one LiPD Dataset. This tutorial addresses this particular use case using the data produced by Khider et al. (2011).

If you are already confused by the all the datasets, fear not, you are not alone (and this is why ontologies exist, but that’s another tutorial). For clarity, in the tutorial, we will refer to the PANGAEA dataset as PANGAEADataset and LiPD dataset simply as Dataset, following the nomenclature used for the objects in PyleoTUPS and PyLiPD respectively.

Let’s get started!

Retrieving information from PANGAEA¶

Let’s retrieve information about the collection from PANGAEA:

ds_col = pt.PangaeaDataset()
res = ds_col.search_studies(study_ids ='830589')

display(res)

[2026-05-08 08:57:11,410][INFO] - Registering Study 830589 via direct lookup.
[2026-05-08 08:57:12,518][INFO] - Retrived 1 studies
[2026-05-08 08:57:12,520][WARNING] - The search contains dataset(s) [830589] marked as collection. Refer to the 'CollectionMembers' column toidentify respective child datasets.

Now let’s grab the datasets in the collection:

ids = []

for idx,row in res.iterrows():
    if row.at['CollectionMembers'] is not None:
        ids.extend(row['CollectionMembers'])

ds_pangaea = pt.PangaeaDataset()
res2 = ds_pangaea.search_studies(study_ids = ids)

display(res2)

[2026-05-08 08:57:12,541][INFO] - Registering Study 830586 via direct lookup.
[2026-05-08 08:57:14,201][INFO] - Registering Study 830587 via direct lookup.
[2026-05-08 08:57:15,816][INFO] - Registering Study 830588 via direct lookup.
[2026-05-08 08:57:17,439][INFO] - Retrived 3 studies

Root metadata¶

Let’s get some basic information. First, geographical information. Since there is only one site, we can use the collection to obtain the information.

geo = ds_col.get_geo()

display(geo)

We can follow a similar process for publications:

bib, df_pub = ds_col.get_publications()

display(df_pub)

Remember that PANGAEA returns the citation information for the dataset rather than the study publication, which is available in the CitationKey field:

df_pub['CitationKey'].iloc[0]

'Khider, D; Stott, Lowell D; Emile-Geay, J; Thunell, Robert C; Hammond, Douglas E (2011): Stable isotope record of sediment core MD98-2177 [dataset publication series]. PANGAEA, https://doi.org/10.1594/PANGAEA.830589, Supplement to: Khider, D et al. (2011): Assessing El Niño Southern Oscillation variability during the past millennium. Paleoceanography, 26(3), PA3222, https://doi.org/10.1029/2011PA002139'

We can create a simple function that retrieves the DOI from the journal citation. This DOI can be used to obtain the citation from crossref:

def extract_supplement_doi(text):
    """Extract DOI from the '(In) supplement to' section of a citation string, if present."""
    match = re.search(r'(?i)supplement\s+to:.*?(10\.\d{4,9}/[^\s,]+)', text)
    return match.group(1) if match else None

doi = extract_supplement_doi(df_pub['CitationKey'].iloc[0])
print(doi)

10.1029/2011PA002139

Let’s retrieve the information:

status, citation = crossref.get_bib(doi = doi)

print(citation)

 @article{Khider_2011, title={Assessing El Niño Southern Oscillation variability during the past millennium}, volume={26}, ISSN={1944-9186}, url={http://dx.doi.org/10.1029/2011PA002139}, DOI={10.1029/2011pa002139}, number={3}, journal={Paleoceanography}, publisher={American Geophysical Union (AGU)}, author={Khider, D. and Stott, L. D. and Emile‐Geay, J. and Thunell, R. and Hammond, D. E.}, year={2011}, month=Sept }

We can use the information in the BibTex entry to create our LiPD Dataset.

Creating a LiPD Dataset¶

Since we have some basic information already from the collection (and it will be easier to work with the tables one at a time), let’s create our basic dataset:

dsl = dataset.Dataset()

Let’s add root metadata such as the name of the dataset, type of archive and datasetID:

dsl.setName('MD982177.Khider.2011')
archiveType = ArchiveType.from_synonym('MarineSediment')
dsl.setArchiveType(archiveType)
dsl.setOriginalDataUrl('https://doi.pangaea.de/10.1594/PANGAEA.830589')
dsl.setDatasetId('MD77DK11')

Let’s add the investigators of the study, which are found in the df_summary. To do so, we need to create a Person object:

authors = res['Investigators'].to_list()[0]

# Step 1: Split the string by commas.
# Note that in PANGAEA, the authors are separated by commas, but so are the last names and initials.
parts = [p.strip() for p in authors.split(',')]
# Now we need to put the last, initial pairs together
names =  [(parts[i], parts[i + 1]) for i in range(0, len(parts) - 1, 2)]

# Prepare a list to hold the formatted names
investigators = []

# Step 2: Iterate over the parts to process each
for i in names:  # Step by 2 since each name and initial are next to each other
    person = Person() # create the Person object
    person.setName(f"{i[0]}, {i[1]}")
    investigators.append(person)

# Step 3: Store the list of Persons into the ds object
dsl.setInvestigators(investigators)

Publication metadata¶

Let’s add publication information based on the retrieved citation from crossref:

pub = Publication()

First, let’s parse the BibTeX citation using the bibtexparser library:

parser = bibtexparser.bparser.BibTexParser(common_strings=True)
citation = re.sub(r'month\s*=\s*(\w+)', r'month={\1}', citation) # deals with the fact that the month of september should not be treated as a variable
library = bibtexparser.loads(citation, parser=parser)
d = library.entries[0]

d.keys()

dict_keys(['month', 'year', 'author', 'publisher', 'journal', 'number', 'doi', 'url', 'issn', 'volume', 'title', 'ENTRYTYPE', 'ID'])

Let’s start with the authors:

# Let's start with the authors

authors = d['author']

# Step 1: Split the string by commas.
# Note that in PANGAEA, the authors are separated by commas, but so are the last names and initials.
parts = [p.strip() for p in authors.split(',')]
# Now we need to put the last, initial pairs together
names =  [(parts[i], parts[i + 1]) for i in range(0, len(parts) - 1, 2)]

# Prepare a list to hold the formatted names
investigators = []

# Step 2: Iterate over the parts to process each
for i in names:  # Step by 2 since each name and initial are next to each other
    person = Person() # create the Person object
    person.setName(f"{i[0]}, {i[1]}")
    investigators.append(person)

# Step 3: Store the list of Persons into the ds object
pub.setAuthors(investigators)

Next, let’s get the rest of the publication information:

pub.setTitle(d['title'])
pub.setJournal(d['journal'])
pub.setYear(int(d['year']))
pub.setVolume(str(d['volume']))
pub.setDOI(d['doi'])
pub.setUrls([d['url']])

Let’s add our publication to the Publication object:

dsl.setPublications([pub])

Geographic metadata¶

Since we don’t have information about funding, let’s get to the geographical coordinates:

loc = Location()

loc.setLatitude(str(geo['MinLatitude'].iloc[0]))
loc.setLongitude(str(geo['MinLongitude'].iloc[0]))
loc.setElevation(str(geo['Elevation'].iloc[0]))
loc.setSiteName(str(geo['SiteName'].iloc[0]))

dsl.setLocation(loc)

We are done with the general metadata, now let’s have a look at the three tables comprising our dataset from the information we have retrieved from PANGAEA:

(Table S1) Stable carbon and oxygen isotope ratios of Pulleniatina obliquiloculata of sediment core MD98-2177 - This looks like PaleoData information
(Table 2) Age determination of sediment core MD98-2177 - This looks like age information
(Table 3) Lead 214 and Lead 210 concentration of sediment core MD98-2177 - This looks like age information.

In LiPD, we will be storing the first table in a PaleoData object and the last two tables in a ChronData object.

Let’s start with the PaleoData.

PaleoData¶

Let’s first retrieve the data table and variable information associated with the first table. Remember that PANGAEA stores one table in a PANGAEADataset so we just need to recover the data from the study with ID: 830586

df_data = ds_pangaea.get_data(study_id = '830586')[0]
display(df_data.head())

Remember that PANGAEA always add the Event, location and time information to all the datasets. Let’s drop these columns as this information is not needed here:

df_data.drop(columns=df_data.columns[-5:], inplace=True)
display(df_data.head())

Next let’s get some information about our variables:

df_var = ds_pangaea.get_variables(study_ids = '830586')
display(df_var.head())

Two things to notice:

The metadata associated with each variable is a lot less rich than what is stored at NOAA NCEI
The vocabulary is less aligned to the ontology. This part makes sense since we used the NOAA PaST Thesaurus to create the ontology.
The OntologyTerms refer to the match in the PANGAEA internal ontology, not the LinkedEarth Ontology, which forms the basis of how LiPD datasets are represented in PyLiPD.

Also, this DataFrame contains description of the five columns we removed from the data table. Let’s drop these rows as well:

df_var.drop(index=df_var.index[-5:], inplace=True)

But let’s get started on adding our measurement table to the LiPD file. The first step is to create a PaleoData object:

paleodata = PaleoData()

Next, we need a measurement table to be stored within that object. To do so, we can use the DataTable object:

table = DataTable()

Now let’s add some information about the table such as the name and value use for missing values in the data:

table.setFileName("paleo0measurement0.csv")
table.setMissingValue("NaN")

The next steps is the create Variable objects for each of the data column.

In LiPD, each variable is also given a unique ID. The function belows generate some with a unique prefix that we will relate to the study when invoking the function:

import uuid

def generate_unique_id(prefix='CPD'):
    # Generate a random UUID
    random_uuid = str(uuid.uuid4()).replace('-','')  # Generates a random UUID.
    
    # Convert UUID format to the specific format we need
    # UUID is usually in the form '1e2a2846-2048-480b-9ec6-674daef472bd' so we slice and insert accordingly
    id_str = str(random_uuid)
    formatted_id = f"{prefix}-{id_str[:15]}"
    
    return formatted_id

Since variable names and units are controlled in LiPD, let’s see if we can get synonyms before we proceed. Let’s start with the standard variable name:

check_names = []
for index, row in df_var.iterrows():
    try:
        check_names[row['VariableName']]= PaleoVariable.from_synonym(row['VariableName']).label
    except:
        print(f"{row['VariableName']} does not have a synonym")
        check_names.append(None)

DEPTH, sediment/rock does not have a synonym
Depth, top/min does not have a synonym
Depth, bottom/max does not have a synonym
Age does not have a synonym
Age does not have a synonym
AGE does not have a synonym
Pulleniatina obliquiloculata, δ13C does not have a synonym
Pulleniatina obliquiloculata, δ18O does not have a synonym
Mass does not have a synonym

PyLiPD found no synonyms, which is not surprising since the names are quite descriptive and contains additional information (such as the species). PyLiPD enforces a strict synonym match to avoid any problems with translating scientific information.

For instance, relating the first three columns relate to various concepts of depth which may not be easily matched.

How about using the OntologyTerms? Let’s have a look at the first two rows:

df_var['OntologyTerms'].iloc[0]

[{'id': 1073131,
  'name': 'depth',
  'semantic_uri': 'urn:obo:pato:term:0001595',
  'ontology': 14},
 {'id': 38263, 'name': 'sediment', 'semantic_uri': None, 'ontology': 18},
 {'id': 41056, 'name': 'rock', 'semantic_uri': None, 'ontology': 18},
 {'id': 43863,
  'name': 'Length',
  'semantic_uri': 'http://qudt.org/1.1/vocab/quantity#Length',
  'ontology': 13}]

df_var['OntologyTerms'].iloc[1]

[{'id': 1073131,
  'name': 'depth',
  'semantic_uri': 'urn:obo:pato:term:0001595',
  'ontology': 14},
 {'id': 43863,
  'name': 'Length',
  'semantic_uri': 'http://qudt.org/1.1/vocab/quantity#Length',
  'ontology': 13}]

Both refers to the same concept of depth with no distinction for bottom/top, which matters in paleoclimate applications and are therefore represented in the LinkedEarth ontology.

There is no good way to lift the necessary information programmatically in this case, and so we will create list by hand with the information provided in the df_var dataframe.

for ids,row in df_var.iterrows():
    print(f"variable name: {row['VariableName']}; units: {row['Unit']}")

variable name: DEPTH, sediment/rock; units: m
variable name: Depth, top/min; units: m
variable name: Depth, bottom/max; units: m
variable name: Age; units: a AD/CE
variable name: Age; units: a AD/CE
variable name: AGE; units: ka BP
variable name: Pulleniatina obliquiloculata, δ13C; units: ‰ PDB
variable name: Pulleniatina obliquiloculata, δ18O; units: ‰ PDB
variable name: Mass; units: µg

From this information, we can create lists to store the necessary information into our LiPD dataset.

Note: `Mass` is not a concept in the Ontology so we will need to create a new Variable name.

To see which concepts are available in the Ontology:

Furthermore, some of the variable names contain information about the proxy used. In LiPD, this information can be stored via the `

varName = ['depth', 'depthTop', 'depthBottom','age', 'age','age','d18O', 'd13C','mass']
units = ['m', 'm', 'm', 'yr AD', 'yr AD', 'yr BP', 'permil','permil','um']

variables = []

# Resolution - infer the resolution from the age
res = np.abs(df_data.iloc[:, 3].diff()[1:].to_numpy())
Res = Resolution() # create a Resolution object - it will be the same for all variables since it is based on time
Res.setMinValue(np.min(res))
Res.setMaxValue(np.max(res))
Res.setMeanValue(np.mean(res))
Res.setMedianValue(np.median(res))
from pylipd.globals.urls import UNITSURL
mynewunit = PaleoUnit(f"{UNITSURL}#year", "year") # this creates the necessary URI for ingestion in the ontology
Res.setUnits(mynewunit)

counter = 0

for idx, row in df_var.iterrows():
    var = Variable()
    var.setName(row['VariableName']) # name of the variable- this will stay as PANGAEA describes it 
    # Now let's do the standard name - Things are a bit trickier since we have to create a new standardName for mass.
    if varName[idx] == 'mass':
        from pylipd.globals.urls import VARIABLEURL
        mynewvar =  PaleoVariable(f"{VARIABLEURL}#mass", "mass")
        var.setStandardVariable(mynewvar)
    else:
        var.setStandardVariable(PaleoVariable.from_synonym(varName[idx]))
    # Column
    var.setColumnNumber(counter+1) #The column in which the data is stored. Note that LiPD uses index 1
    # Unique ID
    var.setVariableId(generate_unique_id(prefix='MD77DK')) # create a unique ID for the variable - prefix set to core name and author initials.
    # Units
    var.setUnits(PaleoUnit.from_synonym(units[idx]))
    # Make sure the data is JSON writable (no numpy arrays or Pandas DataFrame)
    var.setValues(json.dumps(df_data.iloc[:,counter].tolist()))
    # Calculate some metadata about the values - this makes it easier to do some queries later on, including looking for data in a particular time slice. 
    var.setMinValue(float(df_data.iloc[:,counter].min()))
    var.setMaxValue(float(df_data.iloc[:,counter].max()))
    var.setMeanValue(float(df_data.iloc[:,counter].mean()))
    var.setMedianValue(float(df_data.iloc[:,counter].median()))
    # Attach the resolution metadata information to the variable
    var.setResolution(Res)
    # if the variable is d18O or d13C then add the species in the notes
    if varName[idx] == 'd18O' or varName[idx] == 'd13C':
        var.setNotes('Measurement performed on Pulleniatina obliquiloculata')
    # append in the list
    variables.append(var) 
    # add to the counter
    counter+=1

Let’s add our variables to the DataTable:

table.setVariables(variables)

Then put the Table into the PaleoData object:

paleodata.setMeasurementTables([table])

And finally, the PaleoData object into the Dataset:

dsl.setPaleoData([paleodata])

And we are done with the PaleoData information!

ChronData¶

The dataset also includes chronological information that we can store in the LiPD dataset. Since, in this case, the tables pertain to the same age model, we will store both of them in the same ChronData object in two Table.

Let’s start by creating a ChronData object:

chrondata = ChronData()

As we have done for the PaleoData object, we need to create tables:

ctable1 = DataTable()
ctable2 = DataTable()

And let’s add some information about the tables, namely the name use for the csv in LiPD and the value used to indicate missing values:

ctable1.setFileName("chron0measurement0.csv")
ctable1.setMissingValue("NaN")

ctable2.setFileName("chron0measurement1.csv")
ctable2.setMissingValue("NaN")

Radiocarbon chronology¶

We’re ready to start processing the data. Let’s start with studyID 830587.

df_cdata1 = ds_pangaea.get_data(study_id = '830587')[0]
display(df_cdata1.head())

Let’s drop the Event metadata columns:

df_cdata1.drop(columns=df_cdata1.columns[-5:], inplace=True)

And let’s gather some variable information, dropping the unnecessary rows:

df_cvar1 = ds_pangaea.get_variables(study_ids = '830587')
df_cvar1.drop(index=df_cvar1.index[-5:], inplace=True)
display(df_cvar1)

Let’s get started with our variables. The ChronData does not have a standard vocabulary. So we will not be using a StandardVariable for Unit, we will be using PaleoUnit if possible. Also note, that the tables contain a number of strings so we won’t be able to lift statistics such as the mean for these variables. Finally, the resolution is not as useful for the chronological information so we won’t be calculating it here.

units = ['m', 'nounits', 'nounits', 'yr 14C BP', 'year','yr AD','year','nounits']

nounits and year do not exist in the ontology so we need to create a URI. year was created earlier for resolution so we will be reusing the Python variable mynewunit. Now let’s create nounits:

nounits = PaleoUnit(f"{UNITSURL}#nounits", "nounits")

variables = []

counter = 0

for idx, row in df_cvar1.iterrows():
    var = Variable()
    var.setName(row['VariableName']) # name of the variable- this will stay as PANGAEA describes it 
    # Column
    var.setColumnNumber(counter+1) #The column in which the data is stored. Note that LiPD uses index 1
    # Unique ID
    var.setVariableId(generate_unique_id(prefix='MD77DK')) # create a unique ID for the variable - prefix set to core name and author initials.
    # Units
    if units[idx] == 'nounits':
        var.setUnits(nounits)
    elif units[idx] == 'year':
        var.setUnits(mynewunit)
    else:
        var.setUnits(PaleoUnit.from_synonym(units[idx]))
    # Make sure the data is JSON writable (no numpy arrays or Pandas DataFrame)
    var.setValues(json.dumps(df_cdata1.iloc[:,counter].tolist()))
    # Calculate some metadata about the values - this makes it easier to do some queries later on, including looking for data in a particular time slice. 
    try:
        var.setMinValue(float(df_cdata1.iloc[:,counter].min()))
        var.setMaxValue(float(df_cdata1.iloc[:,counter].max()))
        var.setMeanValue(float(df_cdata1.iloc[:,counter].mean()))
        var.setMedianValue(float(df_cdata1.iloc[:,counter].median()))
    except:
        pass
        # Attach the resolution metadata information to the variable
    # append in the list
    variables.append(var) 
    # add to the counter
    counter+=1

Let’s add our variables to the first table object:

ctable1.setVariables(variables)

Then put the Table into the ChronData object:

chrondata.setMeasurementTables([ctable1])

Let’s move on to the next Table!

210Pb-based chronology¶

This information is stored in studyID 830587, which corresponds to the 210Pb-based chronology data.

df_cdata2 = ds_pangaea.get_data(study_id = '830588')[0]
display(df_cdata2.head())

Let’s drop the columns corresponding to the Event:

df_cdata2.drop(columns=df_cdata2.columns[-5:], inplace=True)

And let’s gather our variable information, dropping the unnecessary rows:

df_cvar2 = ds_pangaea.get_variables(study_ids = '830588')
df_cvar2.drop(index=df_cvar2.index[-5:], inplace=True)
display(df_cvar2)

Let’s prepare our units information:

units = ['m','m','Bq/kg','Bq/kg','Bq/kg','Bq/kg','Bq/kg','Bq/kg']

bq_kg = PaleoUnit(f"{UNITSURL}#bq_kg", "Bq/kg")

variables = []

counter = 0

for idx, row in df_cvar2.iterrows():
    var = Variable()
    var.setName(row['VariableName']) # name of the variable- this will stay as PANGAEA describes it 
    # Column
    var.setColumnNumber(counter+1) #The column in which the data is stored. Note that LiPD uses index 1
    # Unique ID
    var.setVariableId(generate_unique_id(prefix='MD77DK')) # create a unique ID for the variable - prefix set to core name and author initials.
    # Units
    if units[idx] == 'Bq/kg':
        var.setUnits(bq_kg)
    else:
        var.setUnits(PaleoUnit.from_synonym(units[idx]))
    # Make sure the data is JSON writable (no numpy arrays or Pandas DataFrame)
    var.setValues(json.dumps(df_cdata2.iloc[:,counter].tolist()))
    # Calculate some metadata about the values - this makes it easier to do some queries later on, including looking for data in a particular time slice. 
    try:
        var.setMinValue(float(df_cdata2.iloc[:,counter].min()))
        var.setMaxValue(float(df_cdata2.iloc[:,counter].max()))
        var.setMeanValue(float(df_cdata2.iloc[:,counter].mean()))
        var.setMedianValue(float(df_cdata2.iloc[:,counter].median()))
    except:
        pass
        # Attach the resolution metadata information to the variable
    # append in the list
    variables.append(var) 
    # add to the counter
    counter+=1

Let’s add our variables to the second table object:

ctable2.setVariables(variables)

Then put that Table into the ChronData object:

chrondata.addMeasurementTable(ctable2)

Note that there is difference function in PyLiPD to append to an existing object. Let’s make sure our tables are in there:

chrondata.getMeasurementTables()

[<pylipd.classes.datatable.DataTable at 0x11fcec320>,
 <pylipd.classes.datatable.DataTable at 0x11fceeb10>]

We’re all set to add the ChronData object to the dataset:

dsl.setChronData([chrondata])

All there is left to do is write our dataset to a file:

lipd = LiPD()
lipd.load_datasets([dsl])
lipd.create_lipd(dsl.getName(),"../data/MD982177.Khider.2011.lpd")

[2026-05-08 14:03:25,870][INFO] - Creating bag for directory /var/folders/xj/p7h9764x7cx0by8547l04rrr0000gn/T/rdf_to_lipd_nt40rtiy/MD982177.Khider.2011
[2026-05-08 14:03:25,871][INFO] - Creating data directory
[2026-05-08 14:03:25,872][INFO] - Moving chron0measurement0.csv to /private/var/folders/xj/p7h9764x7cx0by8547l04rrr0000gn/T/rdf_to_lipd_nt40rtiy/MD982177.Khider.2011/tmpgyeented/chron0measurement0.csv
[2026-05-08 14:03:25,872][INFO] - Moving chron0measurement1.csv to /private/var/folders/xj/p7h9764x7cx0by8547l04rrr0000gn/T/rdf_to_lipd_nt40rtiy/MD982177.Khider.2011/tmpgyeented/chron0measurement1.csv
[2026-05-08 14:03:25,872][INFO] - Moving paleo0measurement0.csv to /private/var/folders/xj/p7h9764x7cx0by8547l04rrr0000gn/T/rdf_to_lipd_nt40rtiy/MD982177.Khider.2011/tmpgyeented/paleo0measurement0.csv
[2026-05-08 14:03:25,873][INFO] - Moving metadata.jsonld to /private/var/folders/xj/p7h9764x7cx0by8547l04rrr0000gn/T/rdf_to_lipd_nt40rtiy/MD982177.Khider.2011/tmpgyeented/metadata.jsonld
[2026-05-08 14:03:25,873][INFO] - Moving /private/var/folders/xj/p7h9764x7cx0by8547l04rrr0000gn/T/rdf_to_lipd_nt40rtiy/MD982177.Khider.2011/tmpgyeented to data
[2026-05-08 14:03:25,873][INFO] - Using 1 processes to generate manifests: md5
[2026-05-08 14:03:25,873][INFO] - Generating manifest lines for file data/chron0measurement0.csv
[2026-05-08 14:03:25,876][INFO] - Generating manifest lines for file data/chron0measurement1.csv
[2026-05-08 14:03:25,876][INFO] - Generating manifest lines for file data/metadata.jsonld
[2026-05-08 14:03:25,876][INFO] - Generating manifest lines for file data/paleo0measurement0.csv
[2026-05-08 14:03:25,877][INFO] - Creating bagit.txt
[2026-05-08 14:03:25,877][INFO] - Creating bag-info.txt
[2026-05-08 14:03:25,884][INFO] - Creating /var/folders/xj/p7h9764x7cx0by8547l04rrr0000gn/T/rdf_to_lipd_nt40rtiy/MD982177.Khider.2011/tagmanifest-md5.txt

{'chronData': [{'measurementTable': [{'columns': [{'number': 1,
       'hasMaxValue': 1.3,
       'hasMeanValue': 0.6583333333333333,
       'hasMedianValue': 0.72,
       'variableName': 'DEPTH, sediment/rock',
       'TSid': 'MD77DK-ba11d24f66ab4d5',
       'units': 'm'},
      {'number': 2,
       'variableName': 'Laboratory',
       'TSid': 'MD77DK-ef302757c47f4df',
       'units': 'nounits'},
      {'number': 3,
       'hasMinValue': 100234.0,
       'variableName': 'Laboratory code/label',
       'TSid': 'MD77DK-0fb5442d5a03475',
       'units': 'nounits'},
      {'number': 4,
       'hasMaxValue': 2.26,
       'hasMeanValue': 1.3266666666666667,
       'hasMedianValue': 1.4275000000000002,
       'hasMinValue': 0.395,
       'variableName': 'Age, dated',
       'TSid': 'MD77DK-7f03eae61f9447b',
       'units': 'yr 14C BP'},
      {'number': 5,
       'hasMaxValue': 0.11,
       'hasMeanValue': 0.06583333333333333,
       'hasMedianValue': 0.0525,
       'hasMinValue': 0.045,
       'variableName': 'Age, dated, standard deviation',
       'TSid': 'MD77DK-94857b05fdff496',
       'units': 'year'},
      {'number': 6,
       'hasMaxValue': 1852.0,
       'hasMeanValue': 780.6666666666666,
       'hasMedianValue': 657.0,
       'variableName': 'Age',
       'TSid': 'MD77DK-ec552649d04e44e',
       'units': 'yr AD'},
      {'number': 7,
       'hasMaxValue': 136.0,
       'hasMeanValue': 73.16666666666667,
       'hasMedianValue': 76.0,
       'variableName': 'Age, error',
       'TSid': 'MD77DK-355d43edcd404a0',
       'units': 'year'},
      {'number': 8,
       'variableName': 'Age, comment',
       'TSid': 'MD77DK-74b5b429424a43b',
       'units': 'nounits'}],
     'filename': 'chron0measurement0.csv',
     'missingValue': 'NaN'},
    {'columns': [{'number': 1,
       'hasMaxValue': 0.125,
       'hasMeanValue': 0.08214285714285714,
       'hasMedianValue': 0.115,
       'hasMinValue': 0.025,
       'variableName': 'DEPTH, sediment/rock',
       'TSid': 'MD77DK-ea1471476bdc4f5',
       'units': 'm'},
      {'number': 2,
       'hasMaxValue': 0.01,
       'hasMeanValue': 0.008000000000000002,
       'hasMedianValue': 0.01,
       'hasMinValue': 0.004,
       'variableName': 'Layer thickness',
       'TSid': 'MD77DK-78d7fc82fb114ad',
       'units': 'm'},
      {'number': 3,
       'hasMaxValue': 84.0,
       'hasMeanValue': 47.857142857142854,
       'hasMedianValue': 33.0,
       'hasMinValue': 24.0,
       'variableName': 'Lead-214',
       'TSid': 'MD77DK-8de53910c5944d2',
       'units': 'Bq/kg'},
      {'number': 4,
       'hasMaxValue': 12.0,
       'hasMeanValue': 7.428571428571429,
       'hasMedianValue': 6.0,
       'hasMinValue': 3.0,
       'variableName': 'Lead-214, standard deviation',
       'TSid': 'MD77DK-f36082b53e34476',
       'units': 'Bq/kg'},
      {'number': 5,
       'hasMaxValue': 84.0,
       'hasMeanValue': 40.0,
       'hasMedianValue': 53.0,
       'variableName': 'Lead-210',
       'TSid': 'MD77DK-cb63c3b55ee94b5',
       'units': 'Bq/kg'},
      {'number': 6,
       'hasMaxValue': 172.0,
       'hasMeanValue': 88.14285714285714,
       'hasMedianValue': 68.0,
       'hasMinValue': 28.0,
       'variableName': 'Lead-210, standard deviation',
       'TSid': 'MD77DK-a818091ffd84463',
       'units': 'Bq/kg'},
      {'number': 7,
       'hasMaxValue': 53.0,
       'hasMeanValue': 41.5,
       'hasMedianValue': 42.0,
       'hasMinValue': 29.0,
       'variableName': 'Lead-210 excess',
       'TSid': 'MD77DK-e2ab6fa0cfe74d9',
       'units': 'Bq/kg'},
      {'number': 8,
       'hasMaxValue': 68.0,
       'hasMeanValue': 45.0,
       'hasMedianValue': 42.0,
       'hasMinValue': 28.0,
       'variableName': 'Lead-210 excess, standard deviation',
       'TSid': 'MD77DK-d95b6dff17c141c',
       'units': 'Bq/kg'}],
     'filename': 'chron0measurement1.csv',
     'missingValue': 'NaN'}]}],
 'investigator': [{'name': 'Khider, D'},
  {'name': 'Stott, Lowell D'},
  {'name': 'Emile-Geay, J'},
  {'name': 'Thunell, Robert C'},
  {'name': 'Hammond, Douglas E'}],
 'paleoData': [{'measurementTable': [{'columns': [{'number': 1,
       'hasMaxValue': 0.97,
       'hasMeanValue': 0.4706351931330472,
       'hasMedianValue': 0.445,
       'hasMinValue': 0.005,
       'variableName': 'depth',
       'resolution': {'hasMaxValue': np.float64(165.0),
        'hasMeanValue': np.float64(0.9785223367697594),
        'units': 'year'},
       'TSid': 'MD77DK-7657776bd25a484',
       'units': 'm'},
      {'number': 2,
       'hasMaxValue': 0.96,
       'hasMeanValue': 0.46209442060085837,
       'hasMedianValue': 0.44,
       'variableName': 'depthTop',
       'resolution': {'hasMaxValue': np.float64(165.0),
        'hasMeanValue': np.float64(0.9785223367697594),
        'units': 'year'},
       'TSid': 'MD77DK-07b7a9fc6325495',
       'units': 'm'},
      {'number': 3,
       'hasMaxValue': 0.98,
       'hasMeanValue': 0.47917596566523607,
       'hasMedianValue': 0.45,
       'hasMinValue': 0.01,
       'variableName': 'depthBottom',
       'resolution': {'hasMaxValue': np.float64(165.0),
        'hasMeanValue': np.float64(0.9785223367697594),
        'units': 'year'},
       'TSid': 'MD77DK-2aaa2e2b8659470',
       'units': 'm'},
      {'number': 4,
       'hasMaxValue': 1843.0,
       'hasMeanValue': 1344.5656652360515,
       'hasMedianValue': 1407.0,
       'hasMinValue': 704.0,
       'variableName': 'age',
       'resolution': {'hasMaxValue': np.float64(165.0),
        'hasMeanValue': np.float64(0.9785223367697594),
        'units': 'year'},
       'TSid': 'MD77DK-650a115cde04406',
       'units': 'yr AD'},
      {'number': 5,
       'hasMaxValue': 1851.0,
       'hasMeanValue': 1364.656652360515,
       'hasMedianValue': 1419.0,
       'hasMinValue': 734.0,
       'variableName': 'age',
       'resolution': {'hasMaxValue': np.float64(165.0),
        'hasMeanValue': np.float64(0.9785223367697594),
        'units': 'year'},
       'TSid': 'MD77DK-fea38844415a485',
       'units': 'yr AD'},
      {'number': 6,
       'hasMaxValue': 1.231,
       'hasMeanValue': 0.5953888412017166,
       'hasMedianValue': 0.537,
       'hasMinValue': 0.103,
       'variableName': 'age',
       'resolution': {'hasMaxValue': np.float64(165.0),
        'hasMeanValue': np.float64(0.9785223367697594),
        'units': 'year'},
       'TSid': 'MD77DK-6c7da88c6e99428',
       'units': 'yr BP'},
      {'number': 7,
       'hasMaxValue': 2.111,
       'hasMeanValue': 0.7866178571428571,
       'hasMedianValue': 0.808,
       'hasMinValue': -0.524,
       'variableName': 'd18O',
       'notes': 'Measurement performed on Pulleniatina obliquiloculata',
       'resolution': {'hasMaxValue': np.float64(165.0),
        'hasMeanValue': np.float64(0.9785223367697594),
        'units': 'year'},
       'TSid': 'MD77DK-76dd445083b0463',
       'units': 'permil'},
      {'number': 8,
       'hasMaxValue': -0.642,
       'hasMeanValue': -2.084085836909871,
       'hasMedianValue': -2.1,
       'hasMinValue': -3.78,
       'variableName': 'd13C',
       'notes': 'Measurement performed on Pulleniatina obliquiloculata',
       'resolution': {'hasMaxValue': np.float64(165.0),
        'hasMeanValue': np.float64(0.9785223367697594),
        'units': 'year'},
       'TSid': 'MD77DK-b491b340db94423',
       'units': 'permil'},
      {'number': 9,
       'hasMaxValue': 87.0,
       'hasMeanValue': 23.321888412017167,
       'hasMedianValue': 22.0,
       'hasMinValue': 9.0,
       'variableName': 'mass',
       'resolution': {'hasMaxValue': np.float64(165.0),
        'hasMeanValue': np.float64(0.9785223367697594),
        'units': 'year'},
       'TSid': 'MD77DK-972a4f4b81a54c3',
       'units': 'um'}],
     'filename': 'paleo0measurement0.csv',
     'missingValue': 'NaN'}]}],
 'pub': [{'author': [{'name': 'Khider, D. and Stott'},
    {'name': 'L. D. and Emile‐Geay, J. and Thunell'},
    {'name': 'R. and Hammond, D. E.'}],
   'url': ['http://dx.doi.org/10.1029/2011PA002139'],
   'doi': '10.1029/2011pa002139',
   'journal': 'Paleoceanography',
   'title': 'Assessing El Niño Southern Oscillation variability during the past millennium',
   'volume': '26',
   'year': 2011}],
 'datasetId': 'MD77DK11',
 'geo': {'geometry': {'coordinates': [119.08, 1.4, -968.0]},
  'properties': {'type': 'http://linked.earth/ontology#Location',
   'elevation': '-968.0',
   'latitude': '1.4',
   'longitude': '119.08',
   'siteName': 'MD98-2177'}},
 'dataSetName': 'MD982177.Khider.2011',
 'originalDataURL': 'https://doi.pangaea.de/10.1594/PANGAEA.830589',
 'archiveType': 'Marine sediment'}

References¶

Ratnakar, V., & Khider, D. (2025). PyLiPD: A python package for the manipulation of paleoclimate datasets. Journal of Open Source Software, 10(115), 8861. 10.21105/joss.08861
Leduc, G., Schneider, R., Kim, J.-H., & Lohmann, G. (2010). Holocene and Eemian sea surface temperature trends as revealed by alkenone and Mg/Ca paleothermometry. Quaternary Science Reviews, 29(7–8), 989–1004. 10.1016/j.quascirev.2010.01.004
Leduc, G., Schneider, R. R., Kim, J.-H., & Lohmann, G. (2010). Expanded GHOST database. PANGAEA. 10.1594/PANGAEA.737370
Khider, D., Stott, L. D., Emile-Geay, J., Thunell, R. C., & Hammond, D. E. (2011). Stable isotope record of sediment core MD98-2177. PANGAEA. 10.1594/PANGAEA.830589
Khider, D., Stott, L. D., Emile‐Geay, J., Thunell, R., & Hammond, D. E. (2011). Assessing El Niño Southern Oscillation variability during the past millennium. Paleoceanography, 26(3). 10.1029/2011pa002139