Editing LiPD Files#
Preamble#
Now that we have learned about the Dataset
class and how to extract information from it, let’s edit a LiPD file. We will be considering three main instances of editing: (1) changing exisiting information, (2) adding new metadata, and (3) adding an ensemble table.
Before we start, have a look at the documentation on the LiPD classes module. If you click on any of the classes, you should notice a pattern in the associated methods:
get
+ PropertyName allows you to retrieve to values associated with a propertyset
+ PropertyName allows you to set or change the value for an exisiting property value with another one of type string, float, integer, boolean. If the property value is a list, set will replace any exisitng value already present in the metadata (refer to the diagram below for the expected type).add
+ PropertyName allows you to set or add a value for an exisiting property that takes a list.
In addition, there are two functionalies that allow you to add your custom properties: set_non_standard_property
and add_non_standard_property
. For now, these properties can only be used for values that do not require a new class to be created.
Goals#
Edit a LiPD-formatted dataset
Adding information in a new object (e.g., publication information)
Adding an ensemble table
Save the edited dataset to a new file
Reading Time: 5 minutes
Keywords#
LiPD, LinkedEarth Ontology, Object-Oriented Programming
Pre-requisites#
An understanding of OOP and the LinkedEarth Ontology. Completion of Dataset class example.
Data Description#
We will be working with an hypothetical marine sedimentary record of \(\delta^{18}\)O and Mg/Ca so we can edit the file without worrying about accuracy in a specific record. The idealized record was converted into the LiPD format using the LiPD playground and is made available on the GitHub repository for these tutorials.
Demonstration#
Let’s import the necessary packages.
from pylipd.classes.dataset import Dataset
from pylipd.lipd import LiPD
import pandas as pd
import numpy as np
import re
The next cell allows generating unique identifier for variables called TSID in LiPD:
import uuid
def generate_unique_id(prefix='PYD'):
# Generate a random UUID
random_uuid = uuid.uuid4() # Generates a random UUID.
# Convert UUID format to the specific format we need
# UUID is usually in the form '1e2a2846-2048-480b-9ec6-674daef472bd' so we slice and insert accordingly
id_str = str(random_uuid)
formatted_id = f"{prefix}-{id_str[:5]}-{id_str[9:13]}-{id_str[14:18]}-{id_str[19:23]}-{id_str[24:28]}"
return formatted_id
Let’s load our idealized dataset:
path = '../data/MyWonderfulRecord.LinkedEarth.2024.lpd'
D = LiPD()
D.load(path)
Loading 1 LiPD files
0%| | 0/1 [00:00<?, ?it/s]
100%|██████████| 1/1 [00:00<00:00, 41.17it/s]
Loaded..
Now, let’s export to a Dataset
object:
ds = D.get_datasets()[0]
Editing an exisiting property#
For this example, we will assume that there is an error in the geographical coordinates and we will correct the longitude. First, we need to get to the geo
object. When in doubt on how to navigate the file, you can use the LinkedEarth ontology to help you or the handy diagram above shown in the preamble.
From this, you can see that the information associated with the Location
can be obtained from the Dataset
. A quick check from the documentation tells you that you can use the getLocation
function to do so:
geo = ds.getLocation()
lon = geo.getLongitude()
lon
170.9
To change the exsiting longitude information to the corrected value (here, we will assume 165.9), you can use the setLongitude
function. Notice that the longitude value should be input as a string:
geo.setLongitude('165.9')
geo.getLongitude()
'165.9'
We have succcessfully changed the longitude to its correct value! You can also use the set
+ PropertyName functions to add information (not just correct it). For instance, this record doesn’t have a SiteName
:
geo.getSiteName()
Let’s change this to WonderfulCore
:
geo.setSiteName('WonderfulCore')
geo.getSiteName()
'WonderfulCore'
So far, we have looked at adding or editing the values of exisiting properties.
Creating new properties#
Many datasets on the Lipdverse and associated LiPDGraph come from working groups compiling datasets for a particular purpose. In this case, it may be useful to create a temporary property associated with a specific variable (i.e., column) in the dataset to indicate its use for an analysis. For instance, the Pages2k Temperature working group used the property usedInGlobalTemperatureAnalysis
as a flag to represent the column to be used for temperature reconstructions.
As an example, we will add a property called forTempAnalysis
to the variable Temperature in our dataset. To do so, you can use the set_non_standard_property
function. This function is available for each of the classes present in PyLiPD
and takes a “key,value” pair an input with the key representing the property name and the value the value associated with the property.
pattern = r'temperature?s?' #look for temperature or temperatures in the varname
for pdata in ds.getPaleoData(): # loop through all possible paleodata object
for table in pdata.getMeasurementTables(): # Loop through the measurement tables
for var in table.getVariables(): # Loop through the variables in the table
if re.search(pattern, var.getName(), re.IGNORECASE):
var.set_non_standard_property('forTempAnalysis',True)
Adding new information from classes#
Adding Publication information#
You may have several publications associated with a particular dataset. Therefore, publications are set in a list, in which each item represent a Publication object. When you want to do so in PyLiPD
, you will be using functions of the form add
+ PropertyName.
Let’s add a publication to our dataset:
from pylipd.classes.publication import Publication
pub = Publication() # instantiate the object
pub.setTitle('Publication Title')
Now that we have created the object and entered the title, let’s add authors. Looking at the diagram in the preamble, authors can be set as a list of Person
. Let’s create two authors and add them to our publication object:
from pylipd.classes.person import Person
person1 = Person(); person1.setName("Deborah Khider")
person2 = Person(); person2.setName("Varun Ratnakar")
pub.setAuthors([person1, person2])
Let’s add a few more information:
pub.setJournal('Journal Name')
pub.setPages('1-12')
pub.setVolume('1')
pub.setYear('2014')
Let’s add the Publication information to our Dataset
:
ds.addPublication(pub)
Ok, let’s have a look at our work:
print(ds.getPublications()[0].getTitle())
Publication Title
Adding an Ensemble Table#
A common task in paleoclimate studies is to create possible realizations of the age model using Bayesian age modeling software such as Bchron(Haslett and Parnell, 2008), BACON(Blaauw and Christen, 2011), or OxCal(Bronk Ramsey, 2008).
For this example, we will create a “dummy” ensemble table as a numpy.array
from the existing data.
So first, let’s grab the age values from the measurement table:
data_tables = []
for paleoData in ds.getPaleoData(): # loop over the various PaleoData objects
for table in paleoData.getMeasurementTables(): #get the measurement tables
df = table.getDataFrame(use_standard_names=True) # grab the data and standardize the variable names
data_tables.append(df)
data_tables[0].head()
depth | temperature | Mg/Ca | year | d18O | |
---|---|---|---|---|---|
0 | 0.5 | 28.686774 | 5.023996 | 1981.30 | -4.176004 |
1 | 1.0 | 28.853606 | 5.100000 | 1961.30 | -4.100000 |
2 | 1.5 | 29.017971 | 5.176004 | 1946.37 | -4.023996 |
3 | 2.0 | 29.661152 | 5.484465 | 1952.00 | -3.715535 |
4 | 2.5 | 27.982737 | 4.715535 | 1906.37 | -4.484465 |
Let’s have a look at the metadata for each variable (i.e., column), stored in the DataFrame attributes:
data_tables[0].attrs
{'depth': {'@id': 'http://linked.earth/lipd/paleo0measurement0.WEB-1e2a2-4620-480b-9ec6-674da.depth',
'interpretation': [{'@id': 'http://linked.earth/lipd/paleo0measurement0.WEB-1e2a2-4620-480b-9ec6-674da.depth.Interpretation1'}],
'archiveType': 'Marine sediment',
'number': 1,
'hasMaxValue': 23.0,
'hasMeanValue': 11.75,
'hasMedianValue': 10.75,
'hasMinValue': 0.5,
'variableName': 'depth',
'resolution': {'@id': 'http://linked.earth/lipd/paleo0measurement0.WEB-1e2a2-4620-480b-9ec6-674da.depth.Resolution',
'hasMaxValue': 66.52999999999997,
'hasMeanValue': 20.9051111111111,
'hasMedianValue': 3.1700000000000728,
'hasMinValue': 3.1699999999998454},
'hasStandardVariable': 'depth',
'units': 'cm',
'TSid': 'WEB-1e2a2-4620-480b-9ec6-674da',
'variableType': 'measured',
'proxyObservationType': 'depth'},
'temperature': {'@id': 'http://linked.earth/lipd/paleo0measurement0.WEB-5dd32-9b60-4662-9b73-ad1b3.temperature',
'interpretation': [{'@id': 'http://linked.earth/lipd/paleo0measurement0.WEB-5dd32-9b60-4662-9b73-ad1b3.temperature.Interpretation1'}],
'archiveType': 'Marine sediment',
'number': 5,
'hasMaxValue': 30.27867291,
'hasMeanValue': 28.91042688586957,
'hasMedianValue': 28.894521055,
'hasMinValue': 27.21849706,
'variableName': 'temperature',
'resolution': {'@id': 'http://linked.earth/lipd/paleo0measurement0.WEB-5dd32-9b60-4662-9b73-ad1b3.temperature.Resolution',
'hasMaxValue': 66.52999999999997,
'hasMeanValue': 20.9051111111111,
'hasMedianValue': 3.1700000000000728,
'hasMinValue': 3.1699999999998454},
'hasStandardVariable': 'temperature',
'units': 'degC',
'TSid': 'WEB-5dd32-9b60-4662-9b73-ad1b3',
'variableType': 'inferred',
'inferredVariableType': 'temperature',
'forTempAnalysis': True},
'Mg/Ca': {'@id': 'http://linked.earth/lipd/paleo0measurement0.WEB-c745d-99e8-4f77-9042-748f9.Mg_Ca',
'interpretation': [{'@id': 'http://linked.earth/lipd/paleo0measurement0.WEB-c745d-99e8-4f77-9042-748f9.Mg_Ca.Interpretation1',
'direction': 'positive',
'seasonality': 'Annual',
'variable': 'temperature',
'variableDetail': 'Sea surface'}],
'archiveType': 'Marine sediment',
'number': 3,
'description': 'Obtained from G.ruber',
'hasMaxValue': 5.797904362,
'hasMeanValue': 5.132975283934784,
'hasMedianValue': 5.118849202,
'hasMinValue': 4.402095638,
'variableName': 'Mg/Ca',
'proxy': 'Mg/Ca',
'resolution': {'@id': 'http://linked.earth/lipd/paleo0measurement0.WEB-c745d-99e8-4f77-9042-748f9.Mg_Ca.Resolution',
'hasMaxValue': 66.52999999999997,
'hasMeanValue': 20.9051111111111,
'hasMedianValue': 3.1700000000000728,
'hasMinValue': 3.1699999999998454},
'hasStandardVariable': 'Mg/Ca',
'units': 'permil',
'TSid': 'WEB-c745d-99e8-4f77-9042-748f9',
'variableType': 'measured',
'proxyObservationType': 'Mg/Ca'},
'year': {'@id': 'http://linked.earth/lipd/paleo0measurement0.WEB-e6f66-1365-427c-af34-363a2.year',
'interpretation': [{'@id': 'http://linked.earth/lipd/paleo0measurement0.WEB-e6f66-1365-427c-af34-363a2.year.Interpretation1'}],
'archiveType': 'Marine sediment',
'number': 2,
'hasMaxValue': 1981.3,
'hasMeanValue': 1572.5945652173914,
'hasMedianValue': 1568.7150000000001,
'hasMinValue': 1187.49,
'variableName': 'year',
'hasStandardVariable': 'year',
'units': 'yr AD',
'TSid': 'WEB-e6f66-1365-427c-af34-363a2',
'variableType': 'inferred',
'inferredVariableType': 'year'},
'd18O': {'@id': 'http://linked.earth/lipd/paleo0measurement0.WEB-f24e5-dcca-4743-b487-5c6fd.d18O',
'interpretation': [{'@id': 'http://linked.earth/lipd/paleo0measurement0.WEB-f24e5-dcca-4743-b487-5c6fd.d18O.Interpretation1'}],
'archiveType': 'Marine sediment',
'number': 4,
'hasMaxValue': -3.402095638,
'hasMeanValue': -4.067024716065219,
'hasMedianValue': -4.0811507979999995,
'hasMinValue': -4.797904362,
'variableName': 'd18O',
'proxy': 'd18O',
'resolution': {'@id': 'http://linked.earth/lipd/paleo0measurement0.WEB-f24e5-dcca-4743-b487-5c6fd.d18O.Resolution',
'hasMaxValue': 66.52999999999997,
'hasMeanValue': 20.9051111111111,
'hasMedianValue': 3.1700000000000728,
'hasMinValue': 3.1699999999998454},
'hasStandardVariable': 'd18O',
'units': 'permil',
'TSid': 'WEB-f24e5-dcca-4743-b487-5c6fd',
'variableType': 'measured',
'proxyObservationType': 'd18O'}}
time = data_tables[0]['year'].to_numpy()
Create 1000 realizations of the age ensemble by using a normal distribution centered around the time
vector with a standard deviation of 5.
std_dev = 5
num_draws = 1000
#Generate ensemble
ens = np.random.normal(loc=time[:, np.newaxis], scale=std_dev, size=(len(time), num_draws))
Next, let’s inset the table into our dataset. First, let’s check if there are any ChronData
object we should attach the table to:
ds.getChronData()
[]
Since there are none, we will have to create a ChronData
object in addition to a Table
and Model
object. Let’s start with creating a Table
object for the ensemble we have just created:
from pylipd.classes.datatable import DataTable
ensemble_table = DataTable()
To add content to our new DataTable
object, the easiest way route is to use the setDataFrame
method. To do so, we must first generate a DataFrame similar to the one we read in the L3_dataset_class notebook. In summary, the DataFrame contains two columns: one for depth and one for age, where the realizations are stored as a vector:
# Initialize empty DataFrame with the depth column
df_ens = pd.DataFrame({'depth': data_tables[0]['depth'].tolist()})
# Add the year data - each row will contain one vector from array_data
df_ens['year'] = [ens[i,:].tolist() for i in range(len(time))]
df_ens.head()
depth | year | |
---|---|---|
0 | 0.5 | [1979.2648387681054, 1985.768191076515, 1984.7... |
1 | 1.0 | [1960.524410715646, 1960.316468547741, 1970.30... |
2 | 1.5 | [1949.4173621471534, 1943.741248943736, 1946.8... |
3 | 2.0 | [1957.6899080837127, 1946.2838848540448, 1946.... |
4 | 2.5 | [1906.6000150368072, 1900.4965541754896, 1906.... |
Add attributes to the Pandas Dataframe to store the metadata.
num_year_columns = len(ens[0,:])
year_columns = [i+2 for i in range(num_year_columns)]
df_ens.attrs = {
'year': {'number': str(year_columns), 'variableName': 'year', 'units': 'yr AD', 'TSid':generate_unique_id()},
'depth': {'number': 1, 'variableName': 'depth', 'units': 'cm', 'TSid':generate_unique_id()}
}
Incorporate into the LiPD structure.
ensemble_table.setDataFrame(df_ens)
ensemble_table.setFileName("chron0model0ensemble0.csv")
Now add the table to a model:
from pylipd.classes.model import Model
model = Model()
model.addEnsembleTable(ensemble_table)
And add the Model to a ChronData object:
from pylipd.classes.chrondata import ChronData
cd = ChronData()
cd.addModeledBy(model)
Finally add the ChronData to our Dataset:
ds.addChronData(cd)
Voila! Let’s check that we now have a ChronData object in our dataset:
ds.getChronData()
[<pylipd.classes.chrondata.ChronData at 0x7f813584cdc0>]
Writing to a LiPD file#
The last step in this process is to write our edited file back into a LiPD file. To do so, you need to pass the Dataset ds
back into a LiPD object:
lipd = LiPD()
lipd.load_datasets([ds])
lipd.create_lipd(ds.getName(), "../data/MyWonderfulCorev2.LinkedEarth.2024.lpd");
References#
Blaauw, M., & Christen, J. A. (2011). Flexible Paleoclimate Age-Depth Models using an Autoregressive Gamma Process. Bayesian Analysis, 6(3), 457-474.
Bronk Ramsey, C. (2008). Deposition models for chronological records, Quaternary Sci. Rev., 27, 42–60.
Haslett, J., & Parnell, A. (2008). A simple monotone process with application to radiocarbon-dated depth chronologies. Journal of the Royal Statistical Society C, 57, 399-418.