PyleoTUPS logo

Converting a NOAA dataset into the LiPD format#

Authors#

Deborah Khider ORCID iD

Preamble#

This tutorial showcases how to create a LiPD file from a NOAA dataset.

Goals#

  • Understanding how to get relevant information from NOAA Data Center

  • Use PyLiPD to create a LiPD file from the retrieved information.

Pre-requisites#

  • An understanding of PyLiPD classes and how to create a LiPD file from this tutorial.

Reading time#

Let’s import our packages!

from pyleotups import Dataset


import pylipd.classes.dataset as dataset
from pylipd.classes.archivetype import ArchiveTypeConstants, ArchiveType
from pylipd.classes.funding import Funding
from pylipd.classes.interpretation import Interpretation
from pylipd.classes.interpretationvariable import InterpretationVariableConstants, InterpretationVariable
from pylipd.classes.location import Location
from pylipd.classes.paleodata import PaleoData
from pylipd.classes.datatable import DataTable
from pylipd.classes.paleounit import PaleoUnitConstants, PaleoUnit
from pylipd.classes.paleovariable import PaleoVariableConstants, PaleoVariable
from pylipd.classes.person import Person
from pylipd.classes.publication import Publication
from pylipd.classes.resolution import Resolution
from pylipd.classes.variable import Variable
from pylipd.classes.model import Model
from pylipd.classes.chrondata import ChronData
from pylipd.classes.compilation import Compilation

from pylipd import LiPD


import json

import numpy as np
import pandas as pd

import re

Demonstration#

Querying the NOAA database.#

The first step is to create a pyleotups.Dataset object in which we will store the relevant information

ds=Dataset()

Now that this is done, let’s search using the study ID. Here we are interesting in converting the dataset from Dee et al. (2020), a monthly-resolved coral aragonite oxygen isotope (\(\delta^{18}\)O) record from Palmyra Island.

ds.search_studies(noaa_id=27490)
Parsing NOAA studies: 100%|█████████████████████| 1/1 [00:00<00:00, 3194.44it/s]
StudyID XMLID StudyName DataType EarliestYearBP MostRecentYearBP EarliestYearCE MostRecentYearCE StudyNotes ScienceKeywords Investigators Publications Sites Funding
0 27490 67635 Palmyra 860 Year Modern and Fossil Coral Oxyge... CORALS AND SCLEROSPONGES 804 -56 1146 2006 Monthly-resolved coral aragonite oxygen isotop... None Sylvia Dee, Kim Cobb, Julien Emile-Geay, Toby ... [{'Author': 'Sylvia G. Dee, Kim M. Cobb, Julie... [[{'DataTableID': '41697', 'DataTableName': 'P... [{'fundingAgency': 'US National Science Founda...

Let’s get some information about the dataset that we can later use to create the LiPD dataset.

df_summary = ds.get_summary()

df_summary
StudyID XMLID StudyName DataType EarliestYearBP MostRecentYearBP EarliestYearCE MostRecentYearCE StudyNotes ScienceKeywords Investigators Publications Sites Funding
0 27490 67635 Palmyra 860 Year Modern and Fossil Coral Oxyge... CORALS AND SCLEROSPONGES 804 -56 1146 2006 Monthly-resolved coral aragonite oxygen isotop... None Sylvia Dee, Kim Cobb, Julien Emile-Geay, Toby ... [{'Author': 'Sylvia G. Dee, Kim M. Cobb, Julie... [[{'DataTableID': '41697', 'DataTableName': 'P... [{'fundingAgency': 'US National Science Founda...

And some information about the publications:

bib, df_pub = ds.get_publications()
df_pub.head()
Author Title Journal Year Volume Number Pages Type DOI URL CitationKey StudyID StudyName
0 Sylvia G. Dee, Kim M. Cobb, Julien Emile-Geay,... No consistent ENSO response to volcanic forcin... Science 2020 367 6485 1477-1481 publication 10.1126/science.aax2000 http://dx.doi.org/10.1126/science.aax2000 Charles_Consistent_2020_27490 27490 Palmyra 860 Year Modern and Fossil Coral Oxyge...

Let’s get some information about the geographical information:

df_geo = ds.get_geo()
df_geo
StudyID DataType SiteID SiteName LocationName Latitude Longitude MinElevation MaxElevation
0 27490 CORALS AND SCLEROSPONGES 1900 Palmyra Island Ocean>Pacific Ocean>Central Pacific Ocean 5.8664 -162.12 -9 -9

Let’s get some information about the funding:

df_fund = ds.get_funding()
df_fund
StudyID StudyName FundingAgency FundingGrant
0 27490 Palmyra 860 Year Modern and Fossil Coral Oxyge... US National Science Foundation MGG 0752091, 1502832, 1836645; EAR-1347213

Finally, let’s get information about the data tables present in the file:

df_tables = ds.get_tables()
df_tables
DataTableID DataTableName TimeUnit FileURL Variables FileDescription TotalFilesAvailable SiteID SiteName LocationName Latitude Longitude MinElevation MaxElevation StudyID StudyName
0 41697 Palmyra2020d18O CE https://www.ncei.noaa.gov/pub/data/paleo/coral... [age, d18O] NOAA Template File 1 1900 Palmyra Island Ocean>Pacific Ocean>Central Pacific Ocean 5.8664 -162.12 -9 -9 27490 Palmyra 860 Year Modern and Fossil Coral Oxyge...

There is only one table present, let’s put it into a Pandas DataFrame:

dfs = ds.get_data(dataTableIDs="41697")
dfs[0].head()
age d18O
0 1146.375 -4.749
1 1146.4583 -4.672
2 1146.5417 -4.724
3 1146.625 -4.717
4 1146.7083 -4.8947

We need to coerce the values to numeric:

df_data = dfs[0].apply(pd.to_numeric, errors='coerce')

And let’s get some variable information:

df_var = ds.get_variables(dataTableIDs="41697")
df_var
StudyID SiteID FileURL VariableName cvDataType cvWhat cvMaterial cvError cvUnit cvSeasonality cvDetail cvMethod cvAdditionalInfo cvFormat cvShortName
DataTableID
41697 27490 1900 https://www.ncei.noaa.gov/pub/data/paleo/coral... age CORALS AND SCLEROSPONGES age variable>age None None time unit>age unit>year Common Era None None None None Numeric age
41697 27490 1900 https://www.ncei.noaa.gov/pub/data/paleo/coral... d18O CORALS AND SCLEROSPONGES chemical composition>isotope>isotope ratio>del... biological material>organism>coral>Porites sp.... None concentration unit>parts per notation unit>par... None None laboratory method>spectroscopy>mass spectromet... None Numeric d18O

Now that we have all the necessary information, let’s convert into a LiPD format!

Converting to a LiPD-formatted dataset#

Root Metadata#

Let’s start by created a pylipd.Dataset object:

dsl = dataset.Dataset()

Let’s add root metadata such as the name of the dataset, type of archive and datasetID:

dsl.setName('Palmyra.Dee.2020')
archiveType = ArchiveType.from_synonym('Coral')
dsl.setArchiveType(ArchiveType)
dsl.setOriginalDataUrl('https://www.ncei.noaa.gov/pub/data/paleo/coral/east_pacific/palmyra2020d18o.txt')
dsl.setDatasetId('DP2020PC')

Let’s add the invrstigators of the study, which are found in the df_summary. To do so, we need to create a Person object:

def extract_last_name_and_initials(full_name):
    parts = full_name.split()
    last_name = parts[-1]
    initials = '.'.join([p[0] for p in parts[:-1]]) + '.'
    return last_name, initials


authors = df_summary['Investigators'].to_list()[0]

# Step 1: Split the string by commas
parts = authors.split(',')

# Prepare a list to hold the formatted names
investigators = []

# Step 2: Iterate over the parts to process each
for i in parts:  # Step by 2 since each name and initial are next to each other
    last_name,initial = extract_last_name_and_initials(i)
    person = Person() # create the Person object
    person.setName(f"{last_name}, {initial}")
    investigators.append(person)

# Step 3: Store the list of Persons into the ds object
dsl.setInvestigators(investigators)

Publication Metadata#

Let’s add a publication object to our dataset from the information in df_pub:

pub = Publication()
# Let's start with the authors

authors = df_pub['Author'].to_list()[0]

# Step 1: Split the string by commas
parts = authors.split(',')

# Prepare a list to hold the formatted names
investigators = []

# Step 2: Iterate over the parts to process each
for i in parts:  # Step by 2 since each name and initial are next to each other
    last_name,initial = extract_last_name_and_initials(i)
    person = Person() # create the Person object
    person.setName(f"{last_name}, {initial}")
    investigators.append(person)

# Step 3: Store the list of Persons into the ds object
pub.setAuthors(investigators)

Let’s get the rest of the information:

pub.setTitle(df_pub['Title'].iloc[0])
pub.setJournal(df_pub['Journal'].iloc[0])
pub.setYear(int(df_pub['Year'].iloc[0]))
pub.setVolume(str(df_pub['Volume'].iloc[0]))
pub.setPages(str(df_pub['Pages'].iloc[0]))
pub.setDOI(df_pub['DOI'].iloc[0])
pub.setUrls([df_pub['URL'].iloc[0]])
pub.setCiteKey(df_pub['CitationKey'].iloc[0])

Let’s add our publication objet to the dataset:

dsl.setPublications([pub])

Funding Metadata#

Let’s add information about the funding from df_fund. First let’s grab the names of all the grants.

def parse_grants(grant_string):
    result = []
    # Split different grant groups using semicolon
    groups = grant_string.split(';')
    
    for group in groups:
        group = group.strip()
        # Match prefix followed by space or hyphen, then numbers
        match = re.match(r'([A-Z]{3})[\s-]+(.+)', group)
        if match:
            prefix = match.group(1)
            number_part = match.group(2)
            numbers = re.findall(r'\d+', number_part)
            result.extend([f'{prefix}-{num}' for num in numbers])
    
    return result


grants = parse_grants(df_fund['FundingGrant'].iloc[0])
grants
['MGG-0752091', 'MGG-1502832', 'MGG-1836645', 'EAR-1347213']

Let’s create a Funding object and add it to the Dataset:

fund = Funding()
fund.setFundingAgency(df_fund['FundingAgency'].iloc[0])
fund.setGrants(grants)                
dsl.setFundings([fund])

Geographic Metadata#

Let’s add information about the location associated with the dataset from df_geo. To do so, we need to create a Location object:

loc = Location()
loc.setLatitude(df_geo['Latitude'].iloc[0])
loc.setLongitude(df_geo['Longitude'].iloc[0])
loc.setElevation(df_geo['MinElevation'].iloc[0])
loc.setSiteName(df_geo['SiteName'].iloc[0])
loc.setLocationName(df_geo['LocationName'].iloc[0])
dsl.setLocation(loc)

PaleoData Metadata#

Let’s enter the information relating to the paleoData. The data itself is stored in df_data while the metadata information about the variables is in df_var. Let’s first create our PaleoData object.

paleodata = PaleoData()

Our next step is to create measurement tables in this object. To do so, we can use the DataTable object:

table = DataTable()

Now let’s add some information about the table such as the name and the value use for missin values in the data:

table.setFileName("paleo0measurement0.csv")
table.setMissingValue("NaN")

The next step is to add columns to our table. In other words, we need to create some variables.

In LiPD, each variable is also given a unique ID. The function below generates one:

import uuid

def generate_unique_id(prefix='CPD'):
    # Generate a random UUID
    random_uuid = str(uuid.uuid4()).replace('-','')  # Generates a random UUID.
    
    # Convert UUID format to the specific format we need
    # UUID is usually in the form '1e2a2846-2048-480b-9ec6-674daef472bd' so we slice and insert accordingly
    id_str = str(random_uuid)
    formatted_id = f"{prefix}-{id_str[:15]}"
    
    return formatted_id

Since variable names and units are controlled in LiPD, let’s see if we can get synonyms before we proceed. Let’s start with the standard variable name:

check_names = {}
for index, row in df_var.iterrows():
    check_names[row['VariableName']]= PaleoVariable.from_synonym(row['VariableName']).label

check_names
{'age': 'age', 'd18O': 'd18O'}

This looks good, let’s have a look at the units:

check_units = {}
for index, row in df_var.iterrows():
    try:
        unit = row['cvUnit'].split('>')[-1]
        check_units[unit]= PaleoUnit.from_synonym(unit).label
    except:
        unit = row['cvUnit'].split('>')[-1]
        check_units[unit]= None

check_units
{'year Common Era': 'yr AD', 'per mil VPDB': None}

The “per mil VPDB” unit is not recognized automatically (not in the thesaurus). Let’s have a look at the standard unit names and see if we can manually select one that will match.

The permil entry will match, so let’s use this.

variables = []

# Resolution
res = df_data.iloc[:, 1].diff()[1:].to_numpy()
Res = Resolution() # create a Resolution object - it will be the same for all variables since it is based on time
Res.setMinValue(np.min(res))
Res.setMaxValue(np.max(res))
Res.setMeanValue(np.mean(res))
Res.setMedianValue(np.median(res))
from pylipd.globals.urls import UNITSURL
mynewunit = PaleoUnit(f"{UNITSURL}#year", "year")
Res.setUnits(mynewunit)

# Compilation
comp = Compilation() # create a compilation object
comp.setName('Pages2kTemperature')
comp.setVersion('2_2_0')

counter = 0

for index, row in df_var.iterrows():
    var = Variable()
    var.setName(row['VariableName']) # name of the variable
    # Now let's do the standard name
    var.setStandardVariable(PaleoVariable.from_synonym(row['VariableName']))
    var.setColumnNumber(counter+1) #The column in which the data is stored. Note that LiPD uses index 1
    var.setVariableId(generate_unique_id(prefix='PCU')) # create a unique ID for the variable - change prefix if needed
    # Units
    unit = row['cvUnit'].split('>')[-1]
    if unit=='per mil VPDB':
        var.setUnits(PaleoUnit.from_synonym('permil'))
    else:
        var.setUnits(PaleoUnit.from_synonym(unit))
    # Make sure the data is JSON writable (no numpy arrays or Pandas DataFrame)
    var.setValues(json.dumps(df_data.iloc[:,counter].tolist()))
    # Calculate some metadata about the values - this makes it easier to do some queries later on, including looking for data in a particular time slice. 
    var.setMinValue(float(df_data.iloc[:,counter].min()))
    var.setMaxValue(float(df_data.iloc[:,counter].max()))
    var.setMeanValue(float(df_data.iloc[:,counter].mean()))
    var.setMedianValue(float(df_data.iloc[:,counter].median()))
    # Attach the resolution metadata information to the variable
    var.setResolution(Res)
    # if the variable in d18O, add information about the compilation
    if row['VariableName'] == 'd18O':
        var.setPartOfCompilation(comp)
    else:
        print('pass')
    # append in the list
    variables.append(var) 
    # add to the counter
    counter+=1
pass

Let’s now put our variables in the DataTable:

table.setVariables(variables)

The Table into the PaleoData object:

paleodata.setMeasurementTables([table])

And finally, the PaleoData object into the Dataset:

dsl.setPaleoData([paleodata])

Writing a LiPD file#

The last step in this process is to write to a LiPD file. To do so, you need to pass the Dataset dsl back into a LiPD object:

lipd = LiPD()
lipd.load_datasets([dsl])
lipd.create_lipd(dsl.getName(), "../data/Palmyra.Dee.2020.lpd");

References#

Dee, S. G., Cobb, K. M., Emile-Geay, J., Ault, T. R., Edwards, R. L., Cheng, H., & Charles, C. D. (2020). No consistent ENSO response to volcanic forcing over the last millennium. Science (New York, N.Y.), 367(6485), 1477–1481. doi:10.1126/science.aax2000.