Editing LiPD Files#

Authors#

Deborah Khider

Preamble#

Now that we have learned about the Dataset class and how to extract information from it, let’s edit a LiPD file. We will be considering three main instances of editing: (1) changing exisiting information, (2) adding new metadata, and (3) adding an ensemble table.

Before we start, have a look at the documentation on the LiPD classes module. If you click on any of the classes, you should notice a pattern in the associated methods:

  • get + PropertyName allows you to retrieve to values associated with a property

  • set + PropertyName allows you to set or change the value for an exisiting property value with another one of type string, float, integer, boolean. If the property value is a list, set will replace any exisitng value already present in the metadata (refer to the diagram below for the expected type).

  • add + PropertyName allows you to set or add a value for an exisiting property that takes a list.

In addition, there are two functionalies that allow you to add your custom properties: set_non_standard_property and add_non_standard_property. For now, these properties can only be used for values that do not require a new class to be created.

image

Goals#

  • Edit a LiPD-formatted dataset

  • Adding information in a new object (e.g., publication information)

  • Adding an ensemble table

  • Save the edited dataset to a new file

Reading Time: 5 minutes

Keywords#

LiPD, LinkedEarth Ontology, Object-Oriented Programming

Pre-requisites#

An understanding of OOP and the LinkedEarth Ontology. Completion of Dataset class example.

Data Description#

We will be working with an hypothetical marine sedimentary record of \(\delta^{18}\)O and Mg/Ca so we can edit the file without worrying about accuracy in a specific record. The idealized record was converted into the LiPD format using the LiPD playground and is made available on the GitHub repository for these tutorials.

Demonstration#

Let’s import the necessary packages.

from pylipd.classes.dataset import Dataset
from pylipd.lipd import LiPD

import pandas as pd
import numpy as np

import re

The next cell allows generating unique identifier for variables called TSID in LiPD:

import uuid

def generate_unique_id(prefix='PYD'):
    # Generate a random UUID
    random_uuid = uuid.uuid4()  # Generates a random UUID.
    
    # Convert UUID format to the specific format we need
    # UUID is usually in the form '1e2a2846-2048-480b-9ec6-674daef472bd' so we slice and insert accordingly
    id_str = str(random_uuid)
    formatted_id = f"{prefix}-{id_str[:5]}-{id_str[9:13]}-{id_str[14:18]}-{id_str[19:23]}-{id_str[24:28]}"
    
    return formatted_id

Let’s load our idealized dataset:

path = '../data/MyWonderfulRecord.LinkedEarth.2024.lpd'
D = LiPD()
D.load(path)
Loading 1 LiPD files
  0%|          | 0/1 [00:00<?, ?it/s]
100%|██████████| 1/1 [00:00<00:00, 41.17it/s]
Loaded..

Now, let’s export to a Dataset object:

ds = D.get_datasets()[0]

Editing an exisiting property#

For this example, we will assume that there is an error in the geographical coordinates and we will correct the longitude. First, we need to get to the geo object. When in doubt on how to navigate the file, you can use the LinkedEarth ontology to help you or the handy diagram above shown in the preamble.

From this, you can see that the information associated with the Location can be obtained from the Dataset. A quick check from the documentation tells you that you can use the getLocation function to do so:

geo = ds.getLocation()
lon = geo.getLongitude()

lon
170.9

To change the exsiting longitude information to the corrected value (here, we will assume 165.9), you can use the setLongitude function. Notice that the longitude value should be input as a string:

geo.setLongitude('165.9')
geo.getLongitude()
'165.9'

We have succcessfully changed the longitude to its correct value! You can also use the set + PropertyName functions to add information (not just correct it). For instance, this record doesn’t have a SiteName:

geo.getSiteName()

Let’s change this to WonderfulCore:

geo.setSiteName('WonderfulCore')
geo.getSiteName()
'WonderfulCore'

So far, we have looked at adding or editing the values of exisiting properties.

Creating new properties#

Many datasets on the Lipdverse and associated LiPDGraph come from working groups compiling datasets for a particular purpose. In this case, it may be useful to create a temporary property associated with a specific variable (i.e., column) in the dataset to indicate its use for an analysis. For instance, the Pages2k Temperature working group used the property usedInGlobalTemperatureAnalysis as a flag to represent the column to be used for temperature reconstructions.

As an example, we will add a property called forTempAnalysis to the variable Temperature in our dataset. To do so, you can use the set_non_standard_property function. This function is available for each of the classes present in PyLiPD and takes a “key,value” pair an input with the key representing the property name and the value the value associated with the property.

pattern = r'temperature?s?' #look for temperature or temperatures in the varname
 
for pdata in ds.getPaleoData(): # loop through all possible paleodata object
    for table in pdata.getMeasurementTables(): # Loop through the measurement tables
        for var in table.getVariables(): # Loop through the variables in the table
            if re.search(pattern, var.getName(), re.IGNORECASE):
                var.set_non_standard_property('forTempAnalysis',True)

Adding new information from classes#

Adding Publication information#

You may have several publications associated with a particular dataset. Therefore, publications are set in a list, in which each item represent a Publication object. When you want to do so in PyLiPD, you will be using functions of the form add + PropertyName.

Let’s add a publication to our dataset:

from pylipd.classes.publication import Publication
pub = Publication() # instantiate the object
pub.setTitle('Publication Title')

Now that we have created the object and entered the title, let’s add authors. Looking at the diagram in the preamble, authors can be set as a list of Person. Let’s create two authors and add them to our publication object:

from pylipd.classes.person import Person
person1 = Person(); person1.setName("Deborah Khider")
person2 = Person(); person2.setName("Varun Ratnakar")
pub.setAuthors([person1, person2])

Let’s add a few more information:

pub.setJournal('Journal Name')
pub.setPages('1-12')
pub.setVolume('1')
pub.setYear('2014')

Let’s add the Publication information to our Dataset:

ds.addPublication(pub)

Ok, let’s have a look at our work:

print(ds.getPublications()[0].getTitle())
Publication Title

Adding an Ensemble Table#

A common task in paleoclimate studies is to create possible realizations of the age model using Bayesian age modeling software such as Bchron(Haslett and Parnell, 2008), BACON(Blaauw and Christen, 2011), or OxCal(Bronk Ramsey, 2008).

For this example, we will create a “dummy” ensemble table as a numpy.array from the existing data.

So first, let’s grab the age values from the measurement table:

data_tables = []

for paleoData in ds.getPaleoData(): # loop over the various PaleoData objects
    for table in paleoData.getMeasurementTables(): #get the measurement tables
        df = table.getDataFrame(use_standard_names=True) # grab the data and standardize the variable names
        data_tables.append(df)

data_tables[0].head()
depth temperature Mg/Ca year d18O
0 0.5 28.686774 5.023996 1981.30 -4.176004
1 1.0 28.853606 5.100000 1961.30 -4.100000
2 1.5 29.017971 5.176004 1946.37 -4.023996
3 2.0 29.661152 5.484465 1952.00 -3.715535
4 2.5 27.982737 4.715535 1906.37 -4.484465

Let’s have a look at the metadata for each variable (i.e., column), stored in the DataFrame attributes:

data_tables[0].attrs
{'depth': {'@id': 'http://linked.earth/lipd/paleo0measurement0.WEB-1e2a2-4620-480b-9ec6-674da.depth',
  'interpretation': [{'@id': 'http://linked.earth/lipd/paleo0measurement0.WEB-1e2a2-4620-480b-9ec6-674da.depth.Interpretation1'}],
  'archiveType': 'Marine sediment',
  'number': 1,
  'hasMaxValue': 23.0,
  'hasMeanValue': 11.75,
  'hasMedianValue': 10.75,
  'hasMinValue': 0.5,
  'variableName': 'depth',
  'resolution': {'@id': 'http://linked.earth/lipd/paleo0measurement0.WEB-1e2a2-4620-480b-9ec6-674da.depth.Resolution',
   'hasMaxValue': 66.52999999999997,
   'hasMeanValue': 20.9051111111111,
   'hasMedianValue': 3.1700000000000728,
   'hasMinValue': 3.1699999999998454},
  'hasStandardVariable': 'depth',
  'units': 'cm',
  'TSid': 'WEB-1e2a2-4620-480b-9ec6-674da',
  'variableType': 'measured',
  'proxyObservationType': 'depth'},
 'temperature': {'@id': 'http://linked.earth/lipd/paleo0measurement0.WEB-5dd32-9b60-4662-9b73-ad1b3.temperature',
  'interpretation': [{'@id': 'http://linked.earth/lipd/paleo0measurement0.WEB-5dd32-9b60-4662-9b73-ad1b3.temperature.Interpretation1'}],
  'archiveType': 'Marine sediment',
  'number': 5,
  'hasMaxValue': 30.27867291,
  'hasMeanValue': 28.91042688586957,
  'hasMedianValue': 28.894521055,
  'hasMinValue': 27.21849706,
  'variableName': 'temperature',
  'resolution': {'@id': 'http://linked.earth/lipd/paleo0measurement0.WEB-5dd32-9b60-4662-9b73-ad1b3.temperature.Resolution',
   'hasMaxValue': 66.52999999999997,
   'hasMeanValue': 20.9051111111111,
   'hasMedianValue': 3.1700000000000728,
   'hasMinValue': 3.1699999999998454},
  'hasStandardVariable': 'temperature',
  'units': 'degC',
  'TSid': 'WEB-5dd32-9b60-4662-9b73-ad1b3',
  'variableType': 'inferred',
  'inferredVariableType': 'temperature',
  'forTempAnalysis': True},
 'Mg/Ca': {'@id': 'http://linked.earth/lipd/paleo0measurement0.WEB-c745d-99e8-4f77-9042-748f9.Mg_Ca',
  'interpretation': [{'@id': 'http://linked.earth/lipd/paleo0measurement0.WEB-c745d-99e8-4f77-9042-748f9.Mg_Ca.Interpretation1',
    'direction': 'positive',
    'seasonality': 'Annual',
    'variable': 'temperature',
    'variableDetail': 'Sea surface'}],
  'archiveType': 'Marine sediment',
  'number': 3,
  'description': 'Obtained from G.ruber',
  'hasMaxValue': 5.797904362,
  'hasMeanValue': 5.132975283934784,
  'hasMedianValue': 5.118849202,
  'hasMinValue': 4.402095638,
  'variableName': 'Mg/Ca',
  'proxy': 'Mg/Ca',
  'resolution': {'@id': 'http://linked.earth/lipd/paleo0measurement0.WEB-c745d-99e8-4f77-9042-748f9.Mg_Ca.Resolution',
   'hasMaxValue': 66.52999999999997,
   'hasMeanValue': 20.9051111111111,
   'hasMedianValue': 3.1700000000000728,
   'hasMinValue': 3.1699999999998454},
  'hasStandardVariable': 'Mg/Ca',
  'units': 'permil',
  'TSid': 'WEB-c745d-99e8-4f77-9042-748f9',
  'variableType': 'measured',
  'proxyObservationType': 'Mg/Ca'},
 'year': {'@id': 'http://linked.earth/lipd/paleo0measurement0.WEB-e6f66-1365-427c-af34-363a2.year',
  'interpretation': [{'@id': 'http://linked.earth/lipd/paleo0measurement0.WEB-e6f66-1365-427c-af34-363a2.year.Interpretation1'}],
  'archiveType': 'Marine sediment',
  'number': 2,
  'hasMaxValue': 1981.3,
  'hasMeanValue': 1572.5945652173914,
  'hasMedianValue': 1568.7150000000001,
  'hasMinValue': 1187.49,
  'variableName': 'year',
  'hasStandardVariable': 'year',
  'units': 'yr AD',
  'TSid': 'WEB-e6f66-1365-427c-af34-363a2',
  'variableType': 'inferred',
  'inferredVariableType': 'year'},
 'd18O': {'@id': 'http://linked.earth/lipd/paleo0measurement0.WEB-f24e5-dcca-4743-b487-5c6fd.d18O',
  'interpretation': [{'@id': 'http://linked.earth/lipd/paleo0measurement0.WEB-f24e5-dcca-4743-b487-5c6fd.d18O.Interpretation1'}],
  'archiveType': 'Marine sediment',
  'number': 4,
  'hasMaxValue': -3.402095638,
  'hasMeanValue': -4.067024716065219,
  'hasMedianValue': -4.0811507979999995,
  'hasMinValue': -4.797904362,
  'variableName': 'd18O',
  'proxy': 'd18O',
  'resolution': {'@id': 'http://linked.earth/lipd/paleo0measurement0.WEB-f24e5-dcca-4743-b487-5c6fd.d18O.Resolution',
   'hasMaxValue': 66.52999999999997,
   'hasMeanValue': 20.9051111111111,
   'hasMedianValue': 3.1700000000000728,
   'hasMinValue': 3.1699999999998454},
  'hasStandardVariable': 'd18O',
  'units': 'permil',
  'TSid': 'WEB-f24e5-dcca-4743-b487-5c6fd',
  'variableType': 'measured',
  'proxyObservationType': 'd18O'}}
time  = data_tables[0]['year'].to_numpy()

Create 1000 realizations of the age ensemble by using a normal distribution centered around the time vector with a standard deviation of 5.

std_dev = 5
num_draws = 1000

#Generate ensemble
ens = np.random.normal(loc=time[:, np.newaxis], scale=std_dev, size=(len(time), num_draws))
Make sure that the number of realization of the age ensembles correspond to the number of columns in your array

Next, let’s inset the table into our dataset. First, let’s check if there are any ChronData object we should attach the table to:

ds.getChronData()
[]

Since there are none, we will have to create a ChronData object in addition to a Table and Model object. Let’s start with creating a Table object for the ensemble we have just created:

from pylipd.classes.datatable import DataTable

ensemble_table = DataTable()

To add content to our new DataTable object, the easiest way route is to use the setDataFrame method. To do so, we must first generate a DataFrame similar to the one we read in the L3_dataset_class notebook. In summary, the DataFrame contains two columns: one for depth and one for age, where the realizations are stored as a vector:

# Initialize empty DataFrame with the depth column
df_ens = pd.DataFrame({'depth': data_tables[0]['depth'].tolist()})

# Add the year data - each row will contain one vector from array_data
df_ens['year'] = [ens[i,:].tolist() for i in range(len(time))]

df_ens.head()
depth year
0 0.5 [1979.2648387681054, 1985.768191076515, 1984.7...
1 1.0 [1960.524410715646, 1960.316468547741, 1970.30...
2 1.5 [1949.4173621471534, 1943.741248943736, 1946.8...
3 2.0 [1957.6899080837127, 1946.2838848540448, 1946....
4 2.5 [1906.6000150368072, 1900.4965541754896, 1906....

Add attributes to the Pandas Dataframe to store the metadata.

Warning: Metadata attributes are necessary to save a LiPD file.
num_year_columns = len(ens[0,:])
year_columns = [i+2 for i in range(num_year_columns)]
df_ens.attrs = {
    'year': {'number': str(year_columns), 'variableName': 'year', 'units': 'yr AD', 'TSid':generate_unique_id()},
    'depth': {'number': 1, 'variableName': 'depth', 'units': 'cm', 'TSid':generate_unique_id()}
}

Incorporate into the LiPD structure.

Warning: Don't forget the set the name of the file for the table.
ensemble_table.setDataFrame(df_ens)
ensemble_table.setFileName("chron0model0ensemble0.csv")

Now add the table to a model:

from pylipd.classes.model import Model
model = Model()
model.addEnsembleTable(ensemble_table)

And add the Model to a ChronData object:

from pylipd.classes.chrondata import ChronData
cd = ChronData()
cd.addModeledBy(model)

Finally add the ChronData to our Dataset:

ds.addChronData(cd)

Voila! Let’s check that we now have a ChronData object in our dataset:

ds.getChronData()
[<pylipd.classes.chrondata.ChronData at 0x7f813584cdc0>]

Writing to a LiPD file#

The last step in this process is to write our edited file back into a LiPD file. To do so, you need to pass the Dataset ds back into a LiPD object:

lipd = LiPD()
lipd.load_datasets([ds])
lipd.create_lipd(ds.getName(), "../data/MyWonderfulCorev2.LinkedEarth.2024.lpd");

References#

  • Blaauw, M., & Christen, J. A. (2011). Flexible Paleoclimate Age-Depth Models using an Autoregressive Gamma Process. Bayesian Analysis, 6(3), 457-474.

  • Bronk Ramsey, C. (2008). Deposition models for chronological records, Quaternary Sci. Rev., 27, 42–60.

  • Haslett, J., & Parnell, A. (2008). A simple monotone process with application to radiocarbon-dated depth chronologies. Journal of the Royal Statistical Society C, 57, 399-418.