Converting a NOAA dataset into the LiPD format

Converting a NOAA dataset into the LiPD format#

Authors#

Deborah Khider

Preamble#

This tutorial showcases how to create a LiPD file from a NOAA dataset.

Goals#

Understanding how to get relevant information from NOAA Data Center
Use PyLiPD to create a LiPD file from the retrieved information.

Pre-requisites#

An understanding of PyLiPD classes and how to create a LiPD file from this tutorial.

Reading time#

Let’s import our packages!

from pyleotups import Dataset


import pylipd.classes.dataset as dataset
from pylipd.classes.archivetype import ArchiveTypeConstants, ArchiveType
from pylipd.classes.funding import Funding
from pylipd.classes.interpretation import Interpretation
from pylipd.classes.interpretationvariable import InterpretationVariableConstants, InterpretationVariable
from pylipd.classes.location import Location
from pylipd.classes.paleodata import PaleoData
from pylipd.classes.datatable import DataTable
from pylipd.classes.paleounit import PaleoUnitConstants, PaleoUnit
from pylipd.classes.paleovariable import PaleoVariableConstants, PaleoVariable
from pylipd.classes.person import Person
from pylipd.classes.publication import Publication
from pylipd.classes.resolution import Resolution
from pylipd.classes.variable import Variable
from pylipd.classes.model import Model
from pylipd.classes.chrondata import ChronData
from pylipd.classes.compilation import Compilation

from pylipd import LiPD


import json

import numpy as np
import pandas as pd

import re

Demonstration#

Querying the NOAA database.#

The first step is to create a pyleotups.Dataset object in which we will store the relevant information

ds=Dataset()

Now that this is done, let’s search using the study ID. Here we are interesting in converting the dataset from Dee et al. (2020), a monthly-resolved coral aragonite oxygen isotope (\(\delta^{18}\)O) record from Palmyra Island.

ds.search_studies(noaa_id=27490)

Parsing NOAA studies: 100%|█████████████████████| 1/1 [00:00<00:00, 3194.44it/s]

	StudyID	XMLID	StudyName	DataType	EarliestYearBP	MostRecentYearBP	EarliestYearCE	MostRecentYearCE	StudyNotes	ScienceKeywords	Investigators	Publications	Sites	Funding
0	27490	67635	Palmyra 860 Year Modern and Fossil Coral Oxyge...	CORALS AND SCLEROSPONGES	804	-56	1146	2006	Monthly-resolved coral aragonite oxygen isotop...	None	Sylvia Dee, Kim Cobb, Julien Emile-Geay, Toby ...	[{'Author': 'Sylvia G. Dee, Kim M. Cobb, Julie...	[[{'DataTableID': '41697', 'DataTableName': 'P...	[{'fundingAgency': 'US National Science Founda...

Let’s get some information about the dataset that we can later use to create the LiPD dataset.

df_summary = ds.get_summary()

df_summary

	StudyID	XMLID	StudyName	DataType	EarliestYearBP	MostRecentYearBP	EarliestYearCE	MostRecentYearCE	StudyNotes	ScienceKeywords	Investigators	Publications	Sites	Funding
0	27490	67635	Palmyra 860 Year Modern and Fossil Coral Oxyge...	CORALS AND SCLEROSPONGES	804	-56	1146	2006	Monthly-resolved coral aragonite oxygen isotop...	None	Sylvia Dee, Kim Cobb, Julien Emile-Geay, Toby ...	[{'Author': 'Sylvia G. Dee, Kim M. Cobb, Julie...	[[{'DataTableID': '41697', 'DataTableName': 'P...	[{'fundingAgency': 'US National Science Founda...

And some information about the publications:

bib, df_pub = ds.get_publications()
df_pub.head()

	Author	Title	Journal	Year	Volume	Number	Pages	Type	DOI	URL	CitationKey	StudyID	StudyName
0	Sylvia G. Dee, Kim M. Cobb, Julien Emile-Geay,...	No consistent ENSO response to volcanic forcin...	Science	2020	367	6485	1477-1481	publication	10.1126/science.aax2000	http://dx.doi.org/10.1126/science.aax2000	Charles_Consistent_2020_27490	27490	Palmyra 860 Year Modern and Fossil Coral Oxyge...

Let’s get some information about the geographical information:

df_geo = ds.get_geo()
df_geo

	StudyID	DataType	SiteID	SiteName	LocationName	Latitude	Longitude	MinElevation	MaxElevation
0	27490	CORALS AND SCLEROSPONGES	1900	Palmyra Island	Ocean>Pacific Ocean>Central Pacific Ocean	5.8664	-162.12	-9	-9

Let’s get some information about the funding:

df_fund = ds.get_funding()
df_fund

	StudyID	StudyName	FundingAgency	FundingGrant
0	27490	Palmyra 860 Year Modern and Fossil Coral Oxyge...	US National Science Foundation	MGG 0752091, 1502832, 1836645; EAR-1347213

Finally, let’s get information about the data tables present in the file:

df_tables = ds.get_tables()
df_tables

	DataTableID	DataTableName	TimeUnit	FileURL	Variables	FileDescription	TotalFilesAvailable	SiteID	SiteName	LocationName	Latitude	Longitude	MinElevation	MaxElevation	StudyID	StudyName
0	41697	Palmyra2020d18O	CE	https://www.ncei.noaa.gov/pub/data/paleo/coral...	[age, d18O]	NOAA Template File	1	1900	Palmyra Island	Ocean>Pacific Ocean>Central Pacific Ocean	5.8664	-162.12	-9	-9	27490	Palmyra 860 Year Modern and Fossil Coral Oxyge...

There is only one table present, let’s put it into a Pandas DataFrame:

dfs = ds.get_data(dataTableIDs="41697")

dfs[0].head()

	age	d18O
0	1146.375	-4.749
1	1146.4583	-4.672
2	1146.5417	-4.724
3	1146.625	-4.717
4	1146.7083	-4.8947

We need to coerce the values to numeric:

df_data = dfs[0].apply(pd.to_numeric, errors='coerce')

And let’s get some variable information:

df_var = ds.get_variables(dataTableIDs="41697")
df_var

	StudyID	SiteID	FileURL	VariableName	cvDataType	cvWhat	cvMaterial	cvError	cvUnit	cvSeasonality	cvDetail	cvMethod	cvAdditionalInfo	cvFormat	cvShortName
DataTableID
41697	27490	1900	https://www.ncei.noaa.gov/pub/data/paleo/coral...	age	CORALS AND SCLEROSPONGES	age variable>age	None	None	time unit>age unit>year Common Era	None	None	None	None	Numeric	age
41697	27490	1900	https://www.ncei.noaa.gov/pub/data/paleo/coral...	d18O	CORALS AND SCLEROSPONGES	chemical composition>isotope>isotope ratio>del...	biological material>organism>coral>Porites sp....	None	concentration unit>parts per notation unit>par...	None	None	laboratory method>spectroscopy>mass spectromet...	None	Numeric	d18O

Now that we have all the necessary information, let’s convert into a LiPD format!

Converting to a LiPD-formatted dataset#

Root Metadata#

Let’s start by created a pylipd.Dataset object:

dsl = dataset.Dataset()

Let’s add root metadata such as the name of the dataset, type of archive and datasetID:

dsl.setName('Palmyra.Dee.2020')
archiveType = ArchiveType.from_synonym('Coral')
dsl.setArchiveType(ArchiveType)
dsl.setOriginalDataUrl('https://www.ncei.noaa.gov/pub/data/paleo/coral/east_pacific/palmyra2020d18o.txt')
dsl.setDatasetId('DP2020PC')

Let’s add the invrstigators of the study, which are found in the df_summary. To do so, we need to create a Person object:

def extract_last_name_and_initials(full_name):
    parts = full_name.split()
    last_name = parts[-1]
    initials = '.'.join([p[0] for p in parts[:-1]]) + '.'
    return last_name, initials


authors = df_summary['Investigators'].to_list()[0]

# Step 1: Split the string by commas
parts = authors.split(',')

# Prepare a list to hold the formatted names
investigators = []

# Step 2: Iterate over the parts to process each
for i in parts:  # Step by 2 since each name and initial are next to each other
    last_name,initial = extract_last_name_and_initials(i)
    person = Person() # create the Person object
    person.setName(f"{last_name}, {initial}")
    investigators.append(person)

# Step 3: Store the list of Persons into the ds object
dsl.setInvestigators(investigators)

Publication Metadata#

Let’s add a publication object to our dataset from the information in df_pub:

pub = Publication()

# Let's start with the authors

authors = df_pub['Author'].to_list()[0]

# Step 1: Split the string by commas
parts = authors.split(',')

# Prepare a list to hold the formatted names
investigators = []

# Step 2: Iterate over the parts to process each
for i in parts:  # Step by 2 since each name and initial are next to each other
    last_name,initial = extract_last_name_and_initials(i)
    person = Person() # create the Person object
    person.setName(f"{last_name}, {initial}")
    investigators.append(person)

# Step 3: Store the list of Persons into the ds object
pub.setAuthors(investigators)

Let’s get the rest of the information:

pub.setTitle(df_pub['Title'].iloc[0])
pub.setJournal(df_pub['Journal'].iloc[0])
pub.setYear(int(df_pub['Year'].iloc[0]))
pub.setVolume(str(df_pub['Volume'].iloc[0]))
pub.setPages(str(df_pub['Pages'].iloc[0]))
pub.setDOI(df_pub['DOI'].iloc[0])
pub.setUrls([df_pub['URL'].iloc[0]])
pub.setCiteKey(df_pub['CitationKey'].iloc[0])

Let’s add our publication objet to the dataset:

dsl.setPublications([pub])

Funding Metadata#

Let’s add information about the funding from df_fund. First let’s grab the names of all the grants.

def parse_grants(grant_string):
    result = []
    # Split different grant groups using semicolon
    groups = grant_string.split(';')
    
    for group in groups:
        group = group.strip()
        # Match prefix followed by space or hyphen, then numbers
        match = re.match(r'([A-Z]{3})[\s-]+(.+)', group)
        if match:
            prefix = match.group(1)
            number_part = match.group(2)
            numbers = re.findall(r'\d+', number_part)
            result.extend([f'{prefix}-{num}' for num in numbers])
    
    return result


grants = parse_grants(df_fund['FundingGrant'].iloc[0])
grants

['MGG-0752091', 'MGG-1502832', 'MGG-1836645', 'EAR-1347213']

Let’s create a Funding object and add it to the Dataset:

fund = Funding()
fund.setFundingAgency(df_fund['FundingAgency'].iloc[0])
fund.setGrants(grants)                

dsl.setFundings([fund])

Geographic Metadata#

Let’s add information about the location associated with the dataset from df_geo. To do so, we need to create a Location object:

loc = Location()

loc.setLatitude(df_geo['Latitude'].iloc[0])
loc.setLongitude(df_geo['Longitude'].iloc[0])
loc.setElevation(df_geo['MinElevation'].iloc[0])
loc.setSiteName(df_geo['SiteName'].iloc[0])
loc.setLocationName(df_geo['LocationName'].iloc[0])

dsl.setLocation(loc)

PaleoData Metadata#

Let’s enter the information relating to the paleoData. The data itself is stored in df_data while the metadata information about the variables is in df_var. Let’s first create our PaleoData object.

paleodata = PaleoData()

Our next step is to create measurement tables in this object. To do so, we can use the DataTable object:

table = DataTable()

Now let’s add some information about the table such as the name and the value use for missin values in the data:

table.setFileName("paleo0measurement0.csv")
table.setMissingValue("NaN")

The next step is to add columns to our table. In other words, we need to create some variables.

In LiPD, each variable is also given a unique ID. The function below generates one:

import uuid

def generate_unique_id(prefix='CPD'):
    # Generate a random UUID
    random_uuid = str(uuid.uuid4()).replace('-','')  # Generates a random UUID.
    
    # Convert UUID format to the specific format we need
    # UUID is usually in the form '1e2a2846-2048-480b-9ec6-674daef472bd' so we slice and insert accordingly
    id_str = str(random_uuid)
    formatted_id = f"{prefix}-{id_str[:15]}"
    
    return formatted_id

Since variable names and units are controlled in LiPD, let’s see if we can get synonyms before we proceed. Let’s start with the standard variable name:

check_names = {}
for index, row in df_var.iterrows():
    check_names[row['VariableName']]= PaleoVariable.from_synonym(row['VariableName']).label

check_names

{'age': 'age', 'd18O': 'd18O'}

This looks good, let’s have a look at the units:

check_units = {}
for index, row in df_var.iterrows():
    try:
        unit = row['cvUnit'].split('>')[-1]
        check_units[unit]= PaleoUnit.from_synonym(unit).label
    except:
        unit = row['cvUnit'].split('>')[-1]
        check_units[unit]= None

check_units

{'year Common Era': 'yr AD', 'per mil VPDB': None}

The “per mil VPDB” unit is not recognized automatically (not in the thesaurus). Let’s have a look at the standard unit names and see if we can manually select one that will match.

The permil entry will match, so let’s use this.

variables = []

# Resolution
res = df_data.iloc[:, 1].diff()[1:].to_numpy()
Res = Resolution() # create a Resolution object - it will be the same for all variables since it is based on time
Res.setMinValue(np.min(res))
Res.setMaxValue(np.max(res))
Res.setMeanValue(np.mean(res))
Res.setMedianValue(np.median(res))
from pylipd.globals.urls import UNITSURL
mynewunit = PaleoUnit(f"{UNITSURL}#year", "year")
Res.setUnits(mynewunit)

# Compilation
comp = Compilation() # create a compilation object
comp.setName('Pages2kTemperature')
comp.setVersion('2_2_0')

counter = 0

for index, row in df_var.iterrows():
    var = Variable()
    var.setName(row['VariableName']) # name of the variable
    # Now let's do the standard name
    var.setStandardVariable(PaleoVariable.from_synonym(row['VariableName']))
    var.setColumnNumber(counter+1) #The column in which the data is stored. Note that LiPD uses index 1
    var.setVariableId(generate_unique_id(prefix='PCU')) # create a unique ID for the variable - change prefix if needed
    # Units
    unit = row['cvUnit'].split('>')[-1]
    if unit=='per mil VPDB':
        var.setUnits(PaleoUnit.from_synonym('permil'))
    else:
        var.setUnits(PaleoUnit.from_synonym(unit))
    # Make sure the data is JSON writable (no numpy arrays or Pandas DataFrame)
    var.setValues(json.dumps(df_data.iloc[:,counter].tolist()))
    # Calculate some metadata about the values - this makes it easier to do some queries later on, including looking for data in a particular time slice. 
    var.setMinValue(float(df_data.iloc[:,counter].min()))
    var.setMaxValue(float(df_data.iloc[:,counter].max()))
    var.setMeanValue(float(df_data.iloc[:,counter].mean()))
    var.setMedianValue(float(df_data.iloc[:,counter].median()))
    # Attach the resolution metadata information to the variable
    var.setResolution(Res)
    # if the variable in d18O, add information about the compilation
    if row['VariableName'] == 'd18O':
        var.setPartOfCompilation(comp)
    else:
        print('pass')
    # append in the list
    variables.append(var) 
    # add to the counter
    counter+=1

pass

Let’s now put our variables in the DataTable:

table.setVariables(variables)

The Table into the PaleoData object:

paleodata.setMeasurementTables([table])

And finally, the PaleoData object into the Dataset:

dsl.setPaleoData([paleodata])

Writing a LiPD file#

The last step in this process is to write to a LiPD file. To do so, you need to pass the Dataset dsl back into a LiPD object:

lipd = LiPD()
lipd.load_datasets([dsl])
lipd.create_lipd(dsl.getName(), "../data/Palmyra.Dee.2020.lpd");

References#

Dee, S. G., Cobb, K. M., Emile-Geay, J., Ault, T. R., Edwards, R. L., Cheng, H., & Charles, C. D. (2020). No consistent ENSO response to volcanic forcing over the last millennium. Science (New York, N.Y.), 367(6485), 1477–1481. doi:10.1126/science.aax2000.