Creating LiPD files from a tabular template#

Authors#

Deborah Khider

Preamble#

If you are planning to only create one LiPD file on your own, we recommend using the LiPD Playground. This tutorial is intended for users who wish to programatically create multiple files from a template.

In this example, we use this templated file. You can repurpose the Excel template as needed; it is only meant as an example.

Goals#

  • Create a LiPD formatted Dataset from an excel template

  • Adding an ensemble table

  • Save the Dataset to a file

Reading Time: 40 minutes

Keywords#

LiPD, LinkedEarth Ontology, Object-Oriented Programming

Pre-requisites#

An understanding of OOP and the LinkedEarth Ontology. Completion of Dataset class example. An understanding how to edit LiPD files can also be useful.

Note: Please read the pre-requisites below as it contains valuable information about the LiPD structure, the LinkedEarth Ontology and their relationship to the classes and methods available in PyLiPD.

For reference, below is a diagram of the classed in PyliPD, the methods associated with them, and the resulting objects:

image

This diagram will help you create the objects and place them in the appropriate nested structure. Each object is associated with methods that allow you to look at the information present or create that information. Have a look at the documentation on the LiPD classes module. If you click on any of the classes, you should notice a pattern in the associated methods:

  • get + PropertyName allows you to retrieve to values associated with a property

  • set + PropertyName allows you to set or change the value for an exisiting property value with another one of type string, float, integer, boolean. If the property value is a list, set will replace any exisitng value already present in the metadata (refer to the diagram below for the expected type).

  • add + PropertyName allows you to set or add a value for an exisiting property that takes a list.

In addition, there are two functionalies that allow you to add your custom properties: set_non_standard_property and add_non_standard_property. For now, these properties can only be used for values that do not require a new class to be created.

Warning: LiPD uses a standard vocabulary for some information.

In order to support querying, LiPD files have several standardized fields, which corresponds in most cases to the terms that NOAA World Data Center for paleoclimatology has also standardized:

Consequently, we have represented these names as objects in the Ontology. Therefore, some information that would naturally be entered as a string (e.g., coral as an archiveType) should result in an object creation. PyLiPD can help you with this, as long as you don’t forget to create an object!

There is one big exception to this general rule: paleoData_variableName. We understand that in many instances, the name contains more than just the variable name. For instance, paleoceanographers making measurements on the shells of foraminifera often report their variable name with the species name and the geochemical measurement (e.g., “G. ruber Mg/Ca”). To preserve this information at the variable level, the setName method takes a string. To store the paleoData_variableName information following the standard vocabulary, use the setStandardVariable method.

Data Description#

  • Tian, Y., Fleitmann, D., Zhang, Q., Sha, L., Wassenburg, J. A., Axelsson, J., … Cheng, H. (2023). Holocene climate change in southern Oman deciphered by speleothem records and climate model simulations. Nature Communications, 14(1), 4718. doi:10.1038/s41467-023-40454-z.

Demonstration#

Let’s import the necessary packages.

from pylipd.classes.dataset import Dataset
from pylipd.classes.archivetype import ArchiveTypeConstants, ArchiveType
from pylipd.classes.funding import Funding
from pylipd.classes.interpretation import Interpretation
from pylipd.classes.interpretationvariable import InterpretationVariableConstants, InterpretationVariable
from pylipd.classes.location import Location
from pylipd.classes.paleodata import PaleoData
from pylipd.classes.datatable import DataTable
from pylipd.classes.paleounit import PaleoUnitConstants, PaleoUnit
from pylipd.classes.paleovariable import PaleoVariableConstants, PaleoVariable
from pylipd.classes.person import Person
from pylipd.classes.publication import Publication
from pylipd.classes.resolution import Resolution
from pylipd.classes.variable import Variable
from pylipd.classes.model import Model
from pylipd.classes.chrondata import ChronData

from pylipd import LiPD

import pandas as pd
import json
import numpy as np

import re

Opening our template file#

The Excel file contains the following sheets:

  • About

  • Guidelines

  • Metadata

  • paleo1measurementTable1

  • chron1measurementTable1

  • Lists

The information we are interested in contained in Metadata, paleo1measurementTable1 and chron1measurementTable1. Notice that the last two sheets follow the LiPD nomenclature closely and this can be helpful to keep track of the tables and where to insert them. However, you may choose any names that is convenient for you.

Let’s start with the root metadata portion.

Metadata#

If you have a look at the Metadata sheet in the Excel file, you should notice that the information is orgnaized in four sections:

  • root metadata, which contains general information about the dataset such as its name and the type of archive the measurements were made on (e.g., coral, speleothem).

  • Publication information - Note that if more than one publication is associated with the dataset, then this information can be added in seperate columns.

  • Location information

  • Funding information

The first step is to create a function to separate these different sections into four dataframes:

def read_metadata(df):
    '''
    Reads the inforamtion contained in the metadata sheet of the Excel template.
    Note that the algorithm uses the blank lines in the template to denote the block (e.g., publication).

    The code returns 4 pieces of information: root meatadata, location metadata, funding metadata, and publication metadata.
    '''
    # Check for empty rows across all columns
    empty_rows = df.isnull().all(axis=1)
    
    # Initialize the start index of the first table
    start_idx = 0
    tables = []
    
    # Iterate through the indices of the DataFrame
    for idx in empty_rows[empty_rows].index:
        # Slice from the current start index to the row before the empty row
        if not df[start_idx:idx].empty:
            current_table = df[start_idx:idx]
            # Check if the table should use its first row as header
            if start_idx != 0:  # Skip header adjustment for the first table
                current_table.columns = current_table.iloc[0]  # Set first row as header
                current_table = current_table[1:]  # Remove the first row from the data
                current_table.reset_index(drop=True, inplace=True)  # Reset index after dropping row
            tables.append(current_table)
        # Update start_idx to the row after the current empty row
        start_idx = idx + 1
    
    # Handle the last table, if any, after the last empty row to the end of the DataFrame
    if start_idx < len(df):
        current_table = df[start_idx:]
        if start_idx != 0:  # Likely unnecessary check but for consistency
            current_table.columns = current_table.iloc[0]  # Set first row as header
            current_table = current_table[1:]  # Remove the first row from the data
            current_table.reset_index(drop=True, inplace=True)
        tables.append(current_table)

    # place the tables according to their types
    root=tables[0]
    pub=tables[1]
    geo=tables[2]
    fund=tables[3]

    return root, pub, geo, fund    
file_path = "../data/Oman.Tian.2023.xlsx"
sheet_name = 'Metadata'

df = pd.read_excel(file_path, sheet_name=sheet_name)
# get the various tables
root, pub, geo, fund = read_metadata(df)

The next step is to create an empty Dataset object so we can start storing the information.

ds = Dataset()

Let’s go over each of the information that we have stored in Pandas Dataframe, namely some root information such as the name of the dataset, geographical location, information about the source of funding, and publication(s) associated with the data.

Root metadata#

Let’s start with the root information:

root
Unnamed: 0 Unnamed: 1 Unnamed: 2
0 Dataset Name (siteName.firstAuthor.year) Oman.Tian.2023 NaN
1 Archive Type speleothem NaN
2 Original Source_URL (if applicable) https://www.nature.com/articles/s41467-023-404... NaN
3 Investigators (Lastname, first; lastname2, fir... Tian, Y., Fleitmann, D., Zhang, Q., Sha, L. J,... NaN

Let’s add this information to the Dataset object. You can see a list of possible properties under the Dataset class in the documentation. Remember from our previous tutorial that set+PropertyName is meant to create the information, which is what we will be doing here. If you cannot find you property in the list, don’t panic! The LiPD format is flexible so you can add your own properties using the set_non_standard_property(key,value) function.

Also, remember that archiveType needs to be set as an object. To do so, we can use the .from_synomym method that is applicable to all objects making up the standard vocabulary (see the pre-requistes section from this notebook for more details):

ds.setName(root.iloc[0,1]) # set the name for the dataset
archiveType = ArchiveType.from_synonym(root.iloc[1,1]) # Try to identify a standard archiveType name from the vocabulary. 
if archiveType is not None:
    ds.setArchiveType(archiveType) #set the archive type
else:
    raise ValueError('This archiveType is not part of the standard vocabulary; check your spelling or create a new one')
ds.setOriginalDataUrl(root.iloc[2,1]) # set the original data URL

The ArchiveTypeConstant object also allows you to see what is available in the standard vocabulary and set it directly as:

ds.setArchiveType(ArchiveTypeConstants.Speleothem)

Just make sure you impart the ArchiveTypeConstants object first.

Let’s check if the information was stored properly into a ArchiveType object:

ds.getArchiveType().label
'Speleothem'

Everything looks good! But what happens if you trigger the ValueError in the cell above?

Dealing with new standard terms#

The error raises two possibilities:

  1. You misspelled the name. The .from_synonym function is for exact matches with terms found in the thesaurus. Using a language model to automatically guess could result in errors. This may not be a problem for the archive but you can imagine that the variable name with various isotopes can have very slight changes that have very different scientific interpretations but are close enough that a language models may interpret them as similar items (e.g., d18O and d17O). In an abundance of caution, we decided to not allow for fuzzy matches.

  2. You are actually creating a new archiveType. In this case, you need to create a new class in the ontology to represent your new archiveType. In short you need to create a new archiveType object in PyLiPD:

from pylipd.globals.urls import ARCHIVEURL
mynewarchive = ArchiveType(f"{ARCHIVEURL}#MyArchiveType", "MyArchiveType")

Yo have now created a new ArchiveType object that can be inserted in your dataset. Just make sure that the label and id field look like the following:

print('label: '+mynewarchive.label)
print('ID: '+mynewarchive.id)
label: MyArchiveType
ID: http://linked.earth/ontology/archive#MyArchiveType

Remember that you will need to do this for every field that has been standardized in the vocabulary:

As with the ArchiveType each of these have a corresponding ObjectConstants that can be used to see what terms are already available. See the LiPD Controlled Vocabulary module for details.

The next step is to enter the investigators, which takes a list of Person objects (see Figure in the Preamble):

authors = root.iloc[3,1]

# Step 1: Split the string by commas
parts = authors.split(',')

# Prepare a list to hold the formatted names
investigators = []

# Step 2: Iterate over the parts to process each
for i in range(0, len(parts) - 1, 2):  # Step by 2 since each name and initial are next to each other
    last_name = parts[i].strip()  # Remove any leading/trailing whitespace
    initial = parts[i + 1].strip()  # The initial follows the last name
    person = Person() # create the Person object
    person.setName(f"{last_name}, {initial}")
    investigators.append(person)

# Step 3: Store the list of Persons into the ds object
ds.setInvestigators(investigators)

Let’s have a quick look at what we’ve done so far:

ds.getName()
'Oman.Tian.2023'
ds.getInvestigators()[0].getName() #get the name of the first person in the list
'Tian, Y.'

Everything looks go so far.

Publication metadata#

pub
5 Publication Section Ref #1 Ref #2
0 Authors (last, first; last2, first2; separate ... Tian, Y., Fleitmann, D., Zhang, Q., Sha, L. J,... NaN
1 Publication title Holocene climate change in southern Oman decip... NaN
2 Journal Nature Communications NaN
3 Year 2023 NaN
4 Volume NaN NaN
5 Issue NaN NaN
6 Pages NaN NaN
7 Report Number NaN NaN
8 DOI 10.1038/s41467-023-40454-z NaN
9 Abstract Qunf Cave oxygen isotope (δ18Oc) record from s... NaN
10 Alternate citation in paragraph format (For bo... NaN NaN

The first step is to create a Publication object. In this case, we only have one publication; otherwise you may need to loop over the various columns in the dataframe to create others.

pub1 = Publication()

And now let’s add the information. Let’s start with the authors:

authors = pub.iloc[0,1]

# Step 1: Split the string by commas
parts = authors.split(',')

# Prepare a list to hold the formatted names
investigators = []

# Step 2: Iterate over the parts to process each
for i in range(0, len(parts) - 1, 2):  # Step by 2 since each name and initial are next to each other
    last_name = parts[i].strip()  # Remove any leading/trailing whitespace
    initial = parts[i + 1].strip()  # The initial follows the last name
    person = Person() # create the Person object
    person.setName(f"{last_name}, {initial}")
    investigators.append(person)

# Step 3: Store the list of Persons into the ds object
pub1.setAuthors(investigators)

Let’s add other information:

pub1.setTitle(pub.iloc[1,1])
pub1.setJournal(pub.iloc[2,1])
pub1.setYear(pub.iloc[3,1])
pub1.setDOI(pub.iloc[8,1])
pub1.setAbstract(pub.iloc[9,1])

Let’s add our Publication object to the Dataset object:

ds.addPublication(pub1)

Location metadata#

Let’s add geographical information next. First, we need to create a Location object:

loc = Location()
geo
18 Site Information Use appropriate significant digits for all values NaN
0 Northernmost latitude (decimal degree, South n... 17.17 NaN
1 Southernmost latitude (decimal degree, South n... 17.17 NaN
2 Easternmost longitude (decimal degree, West ne... 54.3 NaN
3 Westernmost longitude (decimal degree, West ne... 54.3 NaN
4 elevation (m), below sea level negative 650 NaN
loc.setLatitude(geo.iloc[0,1])
loc.setLongitude(geo.iloc[2,1])
loc.setElevation(geo.iloc[4,1])

Let’s add the Location object into the Dataset:

ds.setLocation(loc)

Funding metadata#

Finally, let’s look at funding information:

fund
26 Funding_Agency Any additional Funding agencies and grants should be entered in Columns C,D, etc. NaN
0 Funding_Agency_Name NaN NaN
1 Grant NaN NaN
2 Principal_Investigator NaN NaN
3 country NaN NaN

Since no information is available, let’s move on to the PaleoData Section.

PaleoData#

sheet_name = 'paleo1measurementTable1'

# Read the information into a Pandas DataFrame
pd_df = pd.read_excel(file_path, sheet_name=sheet_name, header=None)

# Drop completely empty rows
pd_df = pd_df.dropna(how="all").reset_index(drop=True)

Let’s create a PaleoData object:

paleodata = PaleoData()

Our next step is to create measurement tables in this object. To do so, we can use the DataTable object:

table = DataTable()

Now let’s add some information about the table such as the name and the value use for missin values in the data:

table.setFileName("paleo0measurement0.csv")
table.setMissingValue("NaN")

The next step is to add columns to our table. In other words, we need to create some variables.

Let’s have a look at the sheet in the excel file. It contains three sections:

  • Notes, which would be attached to the table

  • Variables information such as the name, units… Each row in the template represents metadata information for each of the column.

  • Data, which contains the numerical values for the variables.

Let’s first create an algorithm that separates the various section and returns the information in three dataframes (Notes, Variables, and Data).

def extract_data(df): 
    '''
    This function extracts the relevant sections for measurementTables. 
    '''
    # Find the index positions of the section headers
    notes_start = df[df[0] == "Notes"].index[0]
    variables_start = df[df[0] == "Variables"].index[0]
    data_start = df[df[0] == "Data"].index[0]

    # Extract sections, ensuring blank rows are removed
    df_notes = df.iloc[notes_start + 1:variables_start].dropna(how="all").reset_index(drop=True)

    # Extract the Variables section
    df_variables_raw = df.iloc[variables_start + 1:data_start].dropna(how="all").reset_index(drop=True)

    # Set the first row as the header for the Variables section
    df_variables = df_variables_raw[1:].reset_index(drop=True)  # Data rows
    df_variables.columns = df_variables_raw.iloc[0]  # Set first row as column headers

    # Extract the Data section
    df_data_raw = df.iloc[data_start + 2:].dropna(how="all").reset_index(drop=True)

    # Correctly skip the first row and set the second row as the header
    df_data = df_data_raw.iloc[1:].reset_index(drop=True)  # Skip first row, keep rest
    df_data.columns = df_data_raw.iloc[0]  # Use second row as column headers
    df_data = df_data.dropna(axis=1, how="all") # Drop the columns with NaN.

    return df_notes, df_variables, df_data
df_notes, df_variables, df_data = extract_data(pd_df)

Let’s loop over the variables and get the relevant information. We will also be calculating some relevant information such as the average value and average resolution.

In LiPD, each variable is also given a unique ID. The function below generates one:

import uuid

def generate_unique_id(prefix='PYD'):
    # Generate a random UUID
    random_uuid = uuid.uuid4()  # Generates a random UUID.
    
    # Convert UUID format to the specific format we need
    # UUID is usually in the form '1e2a2846-2048-480b-9ec6-674daef472bd' so we slice and insert accordingly
    id_str = str(random_uuid)
    formatted_id = f"{prefix}-{id_str[:5]}-{id_str[9:13]}-{id_str[14:18]}-{id_str[19:23]}-{id_str[24:28]}"
    
    return formatted_id

Also, remember that some of these fiels should result in objects (namely, variableName and units). Let’s see if we can get the information from the synonyms before we proceed:

df_variables
variableName variableType Units ProxyObservationType InferredVariableType TakenAtDepth InferredFrom notes Interpretation1_variable Interpretation1_variableDetail ... calibration_reference calibration_uncertainty calibration_uncertaintyType sensorGenus sensorSpecies PhysicalSample_Name PhysicalSample_Identifier PhysicalSample_hasIGSN PhysicalSample_housedAt PhysicalSample_collectionMethod
0 Depth NaN mm NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Age NaN year BP NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 d18O NaN per mil VPDB NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 d13C NaN per mil VPDB NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

4 rows × 28 columns

Let’s start with identifying proper objects for the variableNames:

check_names = {}
for index, row in df_variables.iterrows():
    check_names[row['variableName']]= PaleoVariable.from_synonym(row['variableName']).label

check_names
{'Depth': 'depth', 'Age': 'age', 'd18O': 'd18O', 'd13C': 'd13C'}

This looks good for the variable names. Let’s move on to the units:

check_units = {}
for index, row in df_variables.iterrows():
    try:
        check_units[row['Units']]= PaleoUnit.from_synonym(row['Units']).label
    except:
        check_units[row['Units']]= None

check_units
{'mm': 'mm', 'year BP': 'yr BP', 'per mil VPDB': None}

The “per mil VPDB” unit is not recognized automatically (not in the thesaurus). Let’s have a look at the standard unit names and see if we can manually select one that will match.

The permil entry will match, so let’s use this.

variables = []

res = df_data.iloc[:, 1].diff()[1:].to_numpy()
Res = Resolution() # create a Resolution object - it will be the same for all variables since it is based on time
Res.setMinValue(np.min(res))
Res.setMaxValue(np.max(res))
Res.setMeanValue(np.mean(res))
Res.setMedianValue(np.median(res))

for index, row in df_variables.iterrows():
    var = Variable()
    var.setName(row['variableName']) # name of the variable
    # Now let's do the standard name
    var.setStandardVariable(PaleoVariable.from_synonym(row['variableName']))
    var.setColumnNumber(index+1) #The column in which the data is stored. Note that LiPD uses index 1
    var.setVariableId(generate_unique_id(prefix='TIAN')) # create a unique ID for the variable
    # Units
    if row['Units']=='per mil VPDB':
        var.setUnits(PaleoUnit.from_synonym('permil'))
    else:
        var.setUnits(PaleoUnit.from_synonym(row['Units']))
    # Make sure the data is JSON writable (no numpy arrays or Pandas DataFrame)
    var.setValues(json.dumps(df_data.iloc[:,index].tolist()))
    # Calculate some metadata about the values - this makes it easier to do some queries later on, including looking for data in a particular time slice. 
    var.setMinValue(df_data.iloc[:,index].min())
    var.setMaxValue(df_data.iloc[:,index].max())
    var.setMeanValue(df_data.iloc[:,index].mean())
    var.setMedianValue(df_data.iloc[:,index].median())
    # Attach the resolution metadata information to the variable
    var.setResolution(Res)
    # append in the list
    variables.append(var)                    

Let’s now put our variables in the DataTable:

table.setVariables(variables)

The Table into the PaleoData object:

paleodata.setMeasurementTables([table])

And finally, the PaleoData object into the Dataset:

ds.addPaleoData(paleodata)

ChronData#

The next step is to create a ChronData object to store the information about chronology. In the last section, we used an OOP approach to add the information about each variable. In this section, we will use an approach involving Pandas DataFrame.

Let’s open the data. The same function we wrote to read in the PaleoData can be used here since the template is the same for both objects.

sheet_name = 'chron1measurementTable1'

# Read the information into a Pandas DataFrame
cd_df = pd.read_excel(file_path, sheet_name=sheet_name, header=None)

# Drop completely empty rows
cd_df = cd_df.dropna(how="all").reset_index(drop=True)

Let’s create a ChronData object:

chrondata = ChronData()

We need to create a DataTable, a process similar to what we have done for the PaleoData:

chrontable = DataTable()

Let’s add some basic information about the table:

chrontable.setFileName("chron0measurement0.csv")
chrontable.setMissingValue("NaN")
dfc_notes, dfc_variables, dfc_data = extract_data(cd_df)

We will used the setDataFrame function to incorporate the columns into the table. In this framework, the values are held in a dataframe (which is represented by dfc_data in our framework) and the metadata for each variable is added as attributes to the DataFrame. Since the variables and units associated with Chronology are not yet standardized, we can leave them as strings:

metadata_dict = {} # create a dictionary

for index, row in dfc_variables.iterrows():
    temp_dict = {}
    temp_dict['number']=index+1
    temp_dict['variableName']=row['variableName']
    temp_dict['TSid']= generate_unique_id(prefix='TC')
    if pd.notna(row['Units']):
        temp_dict['units']=row['Units']
    else:
        temp_dict['units']='NA'
    try:
        temp_dict['hasMinValue']=dfc_data.iloc[:,index].min()
        temp_dict['hasMaxValue']=dfc_data.iloc[:,index].max()
        temp_dict['hasMeanValue']=dfc_data.iloc[:,index].mean()
        temp_dict['hasMedianValue']=dfc_data.iloc[:,index].median()
    except:
        pass      
    metadata_dict[row['variableName']]=temp_dict

dfc_data.attrs = metadata_dict

Use the DataFrame to construct the Table object:

chrontable.setDataFrame(dfc_data)

Put the Table into the ChronData object:

chrondata.setMeasurementTables([chrontable])

Put the ChronData into the Dataset:

ds.addChronData(chrondata)

Adding an ensemble table#

This particular dataset also has an ensemble table available in Oman.Tian.2023.chrondf.csv. Let’s add the information to an ensembleTable in the LiPD file.

Let’s first open the data:

ens_path = "../data/Oman.Tian.2023.chrondf.csv"

df_ens = pd.read_csv(ens_path)
df_ens.head()
Unnamed: 0 age d18O depth chron_0 chron_1 chron_2 chron_3 chron_4 chron_5 ... chron_990 chron_991 chron_992 chron_993 chron_994 chron_995 chron_996 chron_997 chron_998 chron_999
0 0 401.88 -0.20 0.00 444.0 445.0 417.0 290.0 403.0 349.0 ... 430.0 407.0 410.0 369.0 392.0 347.0 359.0 416.0 414.0 408.0
1 1 408.55 -0.59 0.69 445.0 446.0 445.0 355.0 404.0 367.0 ... 431.0 410.0 411.0 371.0 395.0 358.0 362.0 417.0 416.0 414.0
2 2 424.07 -0.58 1.37 489.0 491.0 482.0 425.0 448.0 422.0 ... 469.0 468.0 459.0 411.0 437.0 397.0 422.0 441.0 463.0 435.0
3 3 438.75 -0.73 2.06 562.0 568.0 525.0 495.0 521.0 504.0 ... 533.0 565.0 541.0 479.0 507.0 455.0 524.0 482.0 541.0 466.0
4 4 450.24 -1.26 2.75 568.0 574.0 536.0 503.0 531.0 514.0 ... 547.0 575.0 548.0 491.0 510.0 461.0 530.0 493.0 549.0 480.0

5 rows × 1004 columns

Warning: LiPD files require ensemble tables to have the following format: the first column contains depth and the other column the ensemble members as a list.

Our first step is to drop the first 3 columns in the DataFrame:

df_ens = df_ens.iloc[:, 3:]
df_ens.head()
depth chron_0 chron_1 chron_2 chron_3 chron_4 chron_5 chron_6 chron_7 chron_8 ... chron_990 chron_991 chron_992 chron_993 chron_994 chron_995 chron_996 chron_997 chron_998 chron_999
0 0.00 444.0 445.0 417.0 290.0 403.0 349.0 387.0 412.0 341.0 ... 430.0 407.0 410.0 369.0 392.0 347.0 359.0 416.0 414.0 408.0
1 0.69 445.0 446.0 445.0 355.0 404.0 367.0 405.0 413.0 369.0 ... 431.0 410.0 411.0 371.0 395.0 358.0 362.0 417.0 416.0 414.0
2 1.37 489.0 491.0 482.0 425.0 448.0 422.0 447.0 442.0 441.0 ... 469.0 468.0 459.0 411.0 437.0 397.0 422.0 441.0 463.0 435.0
3 2.06 562.0 568.0 525.0 495.0 521.0 504.0 506.0 492.0 543.0 ... 533.0 565.0 541.0 479.0 507.0 455.0 524.0 482.0 541.0 466.0
4 2.75 568.0 574.0 536.0 503.0 531.0 514.0 511.0 501.0 555.0 ... 547.0 575.0 548.0 491.0 510.0 461.0 530.0 493.0 549.0 480.0

5 rows × 1001 columns

Next, let’s create the proper dataframe format. The first column will stay the same. The second column will contain each values on the ensemble in a list:

#Let's keep the first column (depth) in place
ens_table = pd.DataFrame({'depth': df_ens['depth'].tolist()})

# Add the year data - each row will contain one vector from a data array. 
array = df_ens.iloc[:, 1:].to_numpy()
ens_table['year'] = [array[i,:].tolist() for i in range(array.shape[0])]
ens_table.head()
depth year
0 0.00 [444.0, 445.0, 417.0, 290.0, 403.0, 349.0, 387...
1 0.69 [445.0, 446.0, 445.0, 355.0, 404.0, 367.0, 405...
2 1.37 [489.0, 491.0, 482.0, 425.0, 448.0, 422.0, 447...
3 2.06 [562.0, 568.0, 525.0, 495.0, 521.0, 504.0, 506...
4 2.75 [568.0, 574.0, 536.0, 503.0, 531.0, 514.0, 511...

Add attributes to the Pandas Dataframe to store the metadata.

Warning: Metadata attributes are necessary to save a LiPD file.
num_year_columns = len(array[0,:])
year_columns = [i+2 for i in range(num_year_columns)]
ens_table.attrs = {
    'year': {'number': str(year_columns), 'variableName': 'year', 'units': 'yr AD', 'TSid':generate_unique_id()},
    'depth': {'number': 1, 'variableName': 'depth', 'units': 'cm', 'TSid':generate_unique_id()}
}

Let’s create a DataTable object for our ensemble table:

ensemble_table = DataTable()
ensemble_table.setDataFrame(ens_table)
ensemble_table.setFileName("chron0model0ensemble0.csv")

Now add the table to a model:

model = Model()
model.addEnsembleTable(ensemble_table)

And add the Model to a ChronData object:

chrondata.addModeledBy(model)

Writing a LiPD file#

The last step in this process is to write to a LiPD file. To do so, you need to pass the Dataset ds back into a LiPD object:

lipd = LiPD()
lipd.load_datasets([ds])
lipd.create_lipd(ds.getName(), "../data/Oman.Tian.2023.lpd");

Opening the LiPD file#

Let’s re-open the LiPD file that we have just created and check some of our work.

L = LiPD()
file = "../data/Oman.Tian.2023.lpd"

L.load(file)
Loading 1 LiPD files
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00,  2.93it/s]
Loaded..

Let’s get the name of the dataset:

L.get_all_dataset_names()
['Oman.Tian.2023']
df_essential = L.get_timeseries_essentials()

df_essential.head()
dataSetName archiveType geo_meanLat geo_meanLon geo_meanElev paleoData_variableName paleoData_values paleoData_units paleoData_proxy paleoData_proxyGeneral time_variableName time_values time_units depth_variableName depth_values depth_units
0 Oman.Tian.2023 Speleothem 17.17 54.3 650.0 d13C [-2.71, -4.12, -3.45, -3.11, -3.61, -4.86, -4.... permil None None age [401.88, 408.55, 424.07, 438.75, 450.24, 461.3... yr BP depth [0.0, 0.69, 1.37, 2.06, 2.75, 3.43, 4.12, 4.81... mm
1 Oman.Tian.2023 Speleothem 17.17 54.3 650.0 d18O [-0.2, -0.59, -0.58, -0.73, -1.26, -1.19, -0.6... permil None None age [401.88, 408.55, 424.07, 438.75, 450.24, 461.3... yr BP depth [0.0, 0.69, 1.37, 2.06, 2.75, 3.43, 4.12, 4.81... mm
2 Oman.Tian.2023 Speleothem 17.17 54.3 650.0 d13C [-2.71, -4.12, -3.45, -3.11, -3.61, -4.86, -4.... permil None None age [401.88, 408.55, 424.07, 438.75, 450.24, 461.3... yr BP depth [0.0, 0.69, 1.37, 2.06, 2.75, 3.43, 4.12, 4.81... mm
3 Oman.Tian.2023 Speleothem 17.17 54.3 650.0 d18O [-0.2, -0.59, -0.58, -0.73, -1.26, -1.19, -0.6... permil None None age [401.88, 408.55, 424.07, 438.75, 450.24, 461.3... yr BP depth [0.0, 0.69, 1.37, 2.06, 2.75, 3.43, 4.12, 4.81... mm
4 Oman.Tian.2023 Speleothem 17.17 54.3 650.0 d13C [-2.71, -4.12, -3.45, -3.11, -3.61, -4.86, -4.... permil None None age [401.88, 408.55, 424.07, 438.75, 450.24, 461.3... yr BP depth [0.0, 0.69, 1.37, 2.06, 2.75, 3.43, 4.12, 4.81... mm

And voila! The LiPD file is ready to be used.