
Converting a NOAA dataset into the LiPD format#
Preamble#
This tutorial showcases how to create a LiPD file from a NOAA dataset.
Goals#
Understanding how to get relevant information from NOAA Data Center
Use PyLiPD to create a LiPD file from the retrieved information.
Pre-requisites#
An understanding of PyLiPD classes and how to create a LiPD file from this tutorial.
Reading time#
Let’s import our packages!
from pyleotups import Dataset
import pylipd.classes.dataset as dataset
from pylipd.classes.archivetype import ArchiveTypeConstants, ArchiveType
from pylipd.classes.funding import Funding
from pylipd.classes.interpretation import Interpretation
from pylipd.classes.interpretationvariable import InterpretationVariableConstants, InterpretationVariable
from pylipd.classes.location import Location
from pylipd.classes.paleodata import PaleoData
from pylipd.classes.datatable import DataTable
from pylipd.classes.paleounit import PaleoUnitConstants, PaleoUnit
from pylipd.classes.paleovariable import PaleoVariableConstants, PaleoVariable
from pylipd.classes.person import Person
from pylipd.classes.publication import Publication
from pylipd.classes.resolution import Resolution
from pylipd.classes.variable import Variable
from pylipd.classes.model import Model
from pylipd.classes.chrondata import ChronData
from pylipd.classes.compilation import Compilation
from pylipd import LiPD
import json
import numpy as np
import pandas as pd
import re
Demonstration#
Querying the NOAA database.#
The first step is to create a pyleotups.Dataset
object in which we will store the relevant information
ds=Dataset()
Now that this is done, let’s search using the study ID. Here we are interesting in converting the dataset from Dee et al. (2020), a monthly-resolved coral aragonite oxygen isotope (\(\delta^{18}\)O) record from Palmyra Island.
ds.search_studies(noaa_id=27490)
Parsing NOAA studies: 100%|█████████████████████| 1/1 [00:00<00:00, 3194.44it/s]
StudyID | XMLID | StudyName | DataType | EarliestYearBP | MostRecentYearBP | EarliestYearCE | MostRecentYearCE | StudyNotes | ScienceKeywords | Investigators | Publications | Sites | Funding | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 27490 | 67635 | Palmyra 860 Year Modern and Fossil Coral Oxyge... | CORALS AND SCLEROSPONGES | 804 | -56 | 1146 | 2006 | Monthly-resolved coral aragonite oxygen isotop... | None | Sylvia Dee, Kim Cobb, Julien Emile-Geay, Toby ... | [{'Author': 'Sylvia G. Dee, Kim M. Cobb, Julie... | [[{'DataTableID': '41697', 'DataTableName': 'P... | [{'fundingAgency': 'US National Science Founda... |
Let’s get some information about the dataset that we can later use to create the LiPD dataset.
df_summary = ds.get_summary()
df_summary
StudyID | XMLID | StudyName | DataType | EarliestYearBP | MostRecentYearBP | EarliestYearCE | MostRecentYearCE | StudyNotes | ScienceKeywords | Investigators | Publications | Sites | Funding | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 27490 | 67635 | Palmyra 860 Year Modern and Fossil Coral Oxyge... | CORALS AND SCLEROSPONGES | 804 | -56 | 1146 | 2006 | Monthly-resolved coral aragonite oxygen isotop... | None | Sylvia Dee, Kim Cobb, Julien Emile-Geay, Toby ... | [{'Author': 'Sylvia G. Dee, Kim M. Cobb, Julie... | [[{'DataTableID': '41697', 'DataTableName': 'P... | [{'fundingAgency': 'US National Science Founda... |
And some information about the publications:
bib, df_pub = ds.get_publications()
df_pub.head()
Author | Title | Journal | Year | Volume | Number | Pages | Type | DOI | URL | CitationKey | StudyID | StudyName | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Sylvia G. Dee, Kim M. Cobb, Julien Emile-Geay,... | No consistent ENSO response to volcanic forcin... | Science | 2020 | 367 | 6485 | 1477-1481 | publication | 10.1126/science.aax2000 | http://dx.doi.org/10.1126/science.aax2000 | Charles_Consistent_2020_27490 | 27490 | Palmyra 860 Year Modern and Fossil Coral Oxyge... |
Let’s get some information about the geographical information:
df_geo = ds.get_geo()
df_geo
StudyID | DataType | SiteID | SiteName | LocationName | Latitude | Longitude | MinElevation | MaxElevation | |
---|---|---|---|---|---|---|---|---|---|
0 | 27490 | CORALS AND SCLEROSPONGES | 1900 | Palmyra Island | Ocean>Pacific Ocean>Central Pacific Ocean | 5.8664 | -162.12 | -9 | -9 |
Let’s get some information about the funding:
df_fund = ds.get_funding()
df_fund
StudyID | StudyName | FundingAgency | FundingGrant | |
---|---|---|---|---|
0 | 27490 | Palmyra 860 Year Modern and Fossil Coral Oxyge... | US National Science Foundation | MGG 0752091, 1502832, 1836645; EAR-1347213 |
Finally, let’s get information about the data tables present in the file:
df_tables = ds.get_tables()
df_tables
DataTableID | DataTableName | TimeUnit | FileURL | Variables | FileDescription | TotalFilesAvailable | SiteID | SiteName | LocationName | Latitude | Longitude | MinElevation | MaxElevation | StudyID | StudyName | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 41697 | Palmyra2020d18O | CE | https://www.ncei.noaa.gov/pub/data/paleo/coral... | [age, d18O] | NOAA Template File | 1 | 1900 | Palmyra Island | Ocean>Pacific Ocean>Central Pacific Ocean | 5.8664 | -162.12 | -9 | -9 | 27490 | Palmyra 860 Year Modern and Fossil Coral Oxyge... |
There is only one table present, let’s put it into a Pandas DataFrame:
dfs = ds.get_data(dataTableIDs="41697")
dfs[0].head()
age | d18O | |
---|---|---|
0 | 1146.375 | -4.749 |
1 | 1146.4583 | -4.672 |
2 | 1146.5417 | -4.724 |
3 | 1146.625 | -4.717 |
4 | 1146.7083 | -4.8947 |
We need to coerce the values to numeric:
df_data = dfs[0].apply(pd.to_numeric, errors='coerce')
And let’s get some variable information:
df_var = ds.get_variables(dataTableIDs="41697")
df_var
StudyID | SiteID | FileURL | VariableName | cvDataType | cvWhat | cvMaterial | cvError | cvUnit | cvSeasonality | cvDetail | cvMethod | cvAdditionalInfo | cvFormat | cvShortName | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DataTableID | |||||||||||||||
41697 | 27490 | 1900 | https://www.ncei.noaa.gov/pub/data/paleo/coral... | age | CORALS AND SCLEROSPONGES | age variable>age | None | None | time unit>age unit>year Common Era | None | None | None | None | Numeric | age |
41697 | 27490 | 1900 | https://www.ncei.noaa.gov/pub/data/paleo/coral... | d18O | CORALS AND SCLEROSPONGES | chemical composition>isotope>isotope ratio>del... | biological material>organism>coral>Porites sp.... | None | concentration unit>parts per notation unit>par... | None | None | laboratory method>spectroscopy>mass spectromet... | None | Numeric | d18O |
Now that we have all the necessary information, let’s convert into a LiPD format!
Converting to a LiPD-formatted dataset#
Root Metadata#
Let’s start by created a pylipd.Dataset
object:
dsl = dataset.Dataset()
Let’s add root metadata such as the name of the dataset, type of archive and datasetID:
dsl.setName('Palmyra.Dee.2020')
archiveType = ArchiveType.from_synonym('Coral')
dsl.setArchiveType(ArchiveType)
dsl.setOriginalDataUrl('https://www.ncei.noaa.gov/pub/data/paleo/coral/east_pacific/palmyra2020d18o.txt')
dsl.setDatasetId('DP2020PC')
Let’s add the invrstigators of the study, which are found in the df_summary
. To do so, we need to create a Person object
:
def extract_last_name_and_initials(full_name):
parts = full_name.split()
last_name = parts[-1]
initials = '.'.join([p[0] for p in parts[:-1]]) + '.'
return last_name, initials
authors = df_summary['Investigators'].to_list()[0]
# Step 1: Split the string by commas
parts = authors.split(',')
# Prepare a list to hold the formatted names
investigators = []
# Step 2: Iterate over the parts to process each
for i in parts: # Step by 2 since each name and initial are next to each other
last_name,initial = extract_last_name_and_initials(i)
person = Person() # create the Person object
person.setName(f"{last_name}, {initial}")
investigators.append(person)
# Step 3: Store the list of Persons into the ds object
dsl.setInvestigators(investigators)
Publication Metadata#
Let’s add a publication object to our dataset from the information in df_pub
:
pub = Publication()
# Let's start with the authors
authors = df_pub['Author'].to_list()[0]
# Step 1: Split the string by commas
parts = authors.split(',')
# Prepare a list to hold the formatted names
investigators = []
# Step 2: Iterate over the parts to process each
for i in parts: # Step by 2 since each name and initial are next to each other
last_name,initial = extract_last_name_and_initials(i)
person = Person() # create the Person object
person.setName(f"{last_name}, {initial}")
investigators.append(person)
# Step 3: Store the list of Persons into the ds object
pub.setAuthors(investigators)
Let’s get the rest of the information:
pub.setTitle(df_pub['Title'].iloc[0])
pub.setJournal(df_pub['Journal'].iloc[0])
pub.setYear(int(df_pub['Year'].iloc[0]))
pub.setVolume(str(df_pub['Volume'].iloc[0]))
pub.setPages(str(df_pub['Pages'].iloc[0]))
pub.setDOI(df_pub['DOI'].iloc[0])
pub.setUrls([df_pub['URL'].iloc[0]])
pub.setCiteKey(df_pub['CitationKey'].iloc[0])
Let’s add our publication objet to the dataset:
dsl.setPublications([pub])
Funding Metadata#
Let’s add information about the funding from df_fund
. First let’s grab the names of all the grants.
def parse_grants(grant_string):
result = []
# Split different grant groups using semicolon
groups = grant_string.split(';')
for group in groups:
group = group.strip()
# Match prefix followed by space or hyphen, then numbers
match = re.match(r'([A-Z]{3})[\s-]+(.+)', group)
if match:
prefix = match.group(1)
number_part = match.group(2)
numbers = re.findall(r'\d+', number_part)
result.extend([f'{prefix}-{num}' for num in numbers])
return result
grants = parse_grants(df_fund['FundingGrant'].iloc[0])
grants
['MGG-0752091', 'MGG-1502832', 'MGG-1836645', 'EAR-1347213']
Let’s create a Funding
object and add it to the Dataset
:
fund = Funding()
fund.setFundingAgency(df_fund['FundingAgency'].iloc[0])
fund.setGrants(grants)
dsl.setFundings([fund])
Geographic Metadata#
Let’s add information about the location associated with the dataset from df_geo
. To do so, we need to create a Location
object:
loc = Location()
loc.setLatitude(df_geo['Latitude'].iloc[0])
loc.setLongitude(df_geo['Longitude'].iloc[0])
loc.setElevation(df_geo['MinElevation'].iloc[0])
loc.setSiteName(df_geo['SiteName'].iloc[0])
loc.setLocationName(df_geo['LocationName'].iloc[0])
dsl.setLocation(loc)
PaleoData Metadata#
Let’s enter the information relating to the paleoData. The data itself is stored in df_data
while the metadata information about the variables is in df_var
. Let’s first create our PaleoData
object.
paleodata = PaleoData()
Our next step is to create measurement tables in this object. To do so, we can use the DataTable
object:
table = DataTable()
Now let’s add some information about the table such as the name and the value use for missin values in the data:
table.setFileName("paleo0measurement0.csv")
table.setMissingValue("NaN")
The next step is to add columns to our table. In other words, we need to create some variables.
In LiPD, each variable is also given a unique ID. The function below generates one:
import uuid
def generate_unique_id(prefix='CPD'):
# Generate a random UUID
random_uuid = str(uuid.uuid4()).replace('-','') # Generates a random UUID.
# Convert UUID format to the specific format we need
# UUID is usually in the form '1e2a2846-2048-480b-9ec6-674daef472bd' so we slice and insert accordingly
id_str = str(random_uuid)
formatted_id = f"{prefix}-{id_str[:15]}"
return formatted_id
Since variable names and units are controlled in LiPD, let’s see if we can get synonyms before we proceed. Let’s start with the standard variable name:
check_names = {}
for index, row in df_var.iterrows():
check_names[row['VariableName']]= PaleoVariable.from_synonym(row['VariableName']).label
check_names
{'age': 'age', 'd18O': 'd18O'}
This looks good, let’s have a look at the units:
check_units = {}
for index, row in df_var.iterrows():
try:
unit = row['cvUnit'].split('>')[-1]
check_units[unit]= PaleoUnit.from_synonym(unit).label
except:
unit = row['cvUnit'].split('>')[-1]
check_units[unit]= None
check_units
{'year Common Era': 'yr AD', 'per mil VPDB': None}
The “per mil VPDB” unit is not recognized automatically (not in the thesaurus). Let’s have a look at the standard unit names and see if we can manually select one that will match.
The permil entry will match, so let’s use this.
variables = []
# Resolution
res = df_data.iloc[:, 1].diff()[1:].to_numpy()
Res = Resolution() # create a Resolution object - it will be the same for all variables since it is based on time
Res.setMinValue(np.min(res))
Res.setMaxValue(np.max(res))
Res.setMeanValue(np.mean(res))
Res.setMedianValue(np.median(res))
from pylipd.globals.urls import UNITSURL
mynewunit = PaleoUnit(f"{UNITSURL}#year", "year")
Res.setUnits(mynewunit)
# Compilation
comp = Compilation() # create a compilation object
comp.setName('Pages2kTemperature')
comp.setVersion('2_2_0')
counter = 0
for index, row in df_var.iterrows():
var = Variable()
var.setName(row['VariableName']) # name of the variable
# Now let's do the standard name
var.setStandardVariable(PaleoVariable.from_synonym(row['VariableName']))
var.setColumnNumber(counter+1) #The column in which the data is stored. Note that LiPD uses index 1
var.setVariableId(generate_unique_id(prefix='PCU')) # create a unique ID for the variable - change prefix if needed
# Units
unit = row['cvUnit'].split('>')[-1]
if unit=='per mil VPDB':
var.setUnits(PaleoUnit.from_synonym('permil'))
else:
var.setUnits(PaleoUnit.from_synonym(unit))
# Make sure the data is JSON writable (no numpy arrays or Pandas DataFrame)
var.setValues(json.dumps(df_data.iloc[:,counter].tolist()))
# Calculate some metadata about the values - this makes it easier to do some queries later on, including looking for data in a particular time slice.
var.setMinValue(float(df_data.iloc[:,counter].min()))
var.setMaxValue(float(df_data.iloc[:,counter].max()))
var.setMeanValue(float(df_data.iloc[:,counter].mean()))
var.setMedianValue(float(df_data.iloc[:,counter].median()))
# Attach the resolution metadata information to the variable
var.setResolution(Res)
# if the variable in d18O, add information about the compilation
if row['VariableName'] == 'd18O':
var.setPartOfCompilation(comp)
else:
print('pass')
# append in the list
variables.append(var)
# add to the counter
counter+=1
pass
Let’s now put our variables in the DataTable
:
table.setVariables(variables)
The Table
into the PaleoData
object:
paleodata.setMeasurementTables([table])
And finally, the PaleoData
object into the Dataset
:
dsl.setPaleoData([paleodata])
Writing a LiPD file#
The last step in this process is to write to a LiPD file. To do so, you need to pass the Dataset dsl
back into a LiPD object:
lipd = LiPD()
lipd.load_datasets([dsl])
lipd.create_lipd(dsl.getName(), "../data/Palmyra.Dee.2020.lpd");
References#
Dee, S. G., Cobb, K. M., Emile-Geay, J., Ault, T. R., Edwards, R. L., Cheng, H., & Charles, C. D. (2020). No consistent ENSO response to volcanic forcing over the last millennium. Science (New York, N.Y.), 367(6485), 1477–1481. doi:10.1126/science.aax2000.