
Creating LiPD files from a tabular template#
Preamble#
If you are planning to only create one LiPD file on your own, we recommend using the LiPD Playground. This tutorial is intended for users who wish to programatically create multiple files from a template.
In this example, we use this templated file. You can repurpose the Excel template as needed; it is only meant as an example.
Goals#
Create a LiPD formatted Dataset from an excel template
Adding an ensemble table
Save the Dataset to a file
Reading Time: 40 minutes
Keywords#
LiPD, LinkedEarth Ontology, Object-Oriented Programming
Pre-requisites#
An understanding of OOP and the LinkedEarth Ontology. Completion of Dataset class example. An understanding how to edit LiPD files can also be useful.
For reference, below is a diagram of the classed in PyliPD
, the methods associated with them, and the resulting objects:
This diagram will help you create the objects and place them in the appropriate nested structure. Each object is associated with methods that allow you to look at the information present or create that information. Have a look at the documentation on the LiPD classes module. If you click on any of the classes, you should notice a pattern in the associated methods:
get
+ PropertyName allows you to retrieve to values associated with a propertyset
+ PropertyName allows you to set or change the value for an exisiting property value with another one of type string, float, integer, boolean. If the property value is a list, set will replace any exisitng value already present in the metadata (refer to the diagram below for the expected type).add
+ PropertyName allows you to set or add a value for an exisiting property that takes a list.
In addition, there are two functionalies that allow you to add your custom properties: set_non_standard_property
and add_non_standard_property
. For now, these properties can only be used for values that do not require a new class to be created.
In order to support querying, LiPD files have several standardized fields, which corresponds in most cases to the terms that NOAA World Data Center for paleoclimatology has also standardized:
Consequently, we have represented these names as objects in the Ontology. Therefore, some information that would naturally be entered as a string (e.g., coral
as an archiveType
) should result in an object creation. PyLiPD
can help you with this, as long as you don’t forget to create an object!
There is one big exception to this general rule: paleoData_variableName
. We understand that in many instances, the name contains more than just the variable name. For instance, paleoceanographers making measurements on the shells of foraminifera often report their variable name with the species name and the geochemical measurement (e.g., “G. ruber Mg/Ca”). To preserve this information at the variable level, the setName
method takes a string. To store the paleoData_variableName
information following the standard vocabulary, use the setStandardVariable
method.
Data Description#
Tian, Y., Fleitmann, D., Zhang, Q., Sha, L., Wassenburg, J. A., Axelsson, J., … Cheng, H. (2023). Holocene climate change in southern Oman deciphered by speleothem records and climate model simulations. Nature Communications, 14(1), 4718. doi:10.1038/s41467-023-40454-z.
Demonstration#
Let’s import the necessary packages.
from pylipd.classes.dataset import Dataset
from pylipd.classes.archivetype import ArchiveTypeConstants, ArchiveType
from pylipd.classes.funding import Funding
from pylipd.classes.interpretation import Interpretation
from pylipd.classes.interpretationvariable import InterpretationVariableConstants, InterpretationVariable
from pylipd.classes.location import Location
from pylipd.classes.paleodata import PaleoData
from pylipd.classes.datatable import DataTable
from pylipd.classes.paleounit import PaleoUnitConstants, PaleoUnit
from pylipd.classes.paleovariable import PaleoVariableConstants, PaleoVariable
from pylipd.classes.person import Person
from pylipd.classes.publication import Publication
from pylipd.classes.resolution import Resolution
from pylipd.classes.variable import Variable
from pylipd.classes.model import Model
from pylipd.classes.chrondata import ChronData
from pylipd import LiPD
import pandas as pd
import json
import numpy as np
import re
Opening our template file#
The Excel file contains the following sheets:
About
Guidelines
Metadata
paleo1measurementTable1
chron1measurementTable1
Lists
The information we are interested in contained in Metadata
, paleo1measurementTable1
and chron1measurementTable1
. Notice that the last two sheets follow the LiPD nomenclature closely and this can be helpful to keep track of the tables and where to insert them. However, you may choose any names that is convenient for you.
Let’s start with the root metadata portion.
Metadata#
If you have a look at the Metadata sheet in the Excel file, you should notice that the information is orgnaized in four sections:
root metadata, which contains general information about the dataset such as its name and the type of archive the measurements were made on (e.g., coral, speleothem).
Publication information - Note that if more than one publication is associated with the dataset, then this information can be added in seperate columns.
Location information
Funding information
The first step is to create a function to separate these different sections into four dataframes:
def read_metadata(df):
'''
Reads the inforamtion contained in the metadata sheet of the Excel template.
Note that the algorithm uses the blank lines in the template to denote the block (e.g., publication).
The code returns 4 pieces of information: root meatadata, location metadata, funding metadata, and publication metadata.
'''
# Check for empty rows across all columns
empty_rows = df.isnull().all(axis=1)
# Initialize the start index of the first table
start_idx = 0
tables = []
# Iterate through the indices of the DataFrame
for idx in empty_rows[empty_rows].index:
# Slice from the current start index to the row before the empty row
if not df[start_idx:idx].empty:
current_table = df[start_idx:idx]
# Check if the table should use its first row as header
if start_idx != 0: # Skip header adjustment for the first table
current_table.columns = current_table.iloc[0] # Set first row as header
current_table = current_table[1:] # Remove the first row from the data
current_table.reset_index(drop=True, inplace=True) # Reset index after dropping row
tables.append(current_table)
# Update start_idx to the row after the current empty row
start_idx = idx + 1
# Handle the last table, if any, after the last empty row to the end of the DataFrame
if start_idx < len(df):
current_table = df[start_idx:]
if start_idx != 0: # Likely unnecessary check but for consistency
current_table.columns = current_table.iloc[0] # Set first row as header
current_table = current_table[1:] # Remove the first row from the data
current_table.reset_index(drop=True, inplace=True)
tables.append(current_table)
# place the tables according to their types
root=tables[0]
pub=tables[1]
geo=tables[2]
fund=tables[3]
return root, pub, geo, fund
file_path = "../data/Oman.Tian.2023.xlsx"
sheet_name = 'Metadata'
df = pd.read_excel(file_path, sheet_name=sheet_name)
# get the various tables
root, pub, geo, fund = read_metadata(df)
The next step is to create an empty Dataset
object so we can start storing the information.
ds = Dataset()
Let’s go over each of the information that we have stored in Pandas Dataframe, namely some root information such as the name of the dataset, geographical location, information about the source of funding, and publication(s) associated with the data.
Root metadata#
Let’s start with the root information:
root
Unnamed: 0 | Unnamed: 1 | Unnamed: 2 | |
---|---|---|---|
0 | Dataset Name (siteName.firstAuthor.year) | Oman.Tian.2023 | NaN |
1 | Archive Type | speleothem | NaN |
2 | Original Source_URL (if applicable) | https://www.nature.com/articles/s41467-023-404... | NaN |
3 | Investigators (Lastname, first; lastname2, fir... | Tian, Y., Fleitmann, D., Zhang, Q., Sha, L. J,... | NaN |
Let’s add this information to the Dataset
object. You can see a list of possible properties under the Dataset
class in the documentation. Remember from our previous tutorial that set+PropertyName
is meant to create the information, which is what we will be doing here. If you cannot find you property in the list, don’t panic! The LiPD format is flexible so you can add your own properties using the set_non_standard_property(key,value)
function.
Also, remember that archiveType needs to be set as an object. To do so, we can use the .from_synomym
method that is applicable to all objects making up the standard vocabulary (see the pre-requistes section from this notebook for more details):
ds.setName(root.iloc[0,1]) # set the name for the dataset
archiveType = ArchiveType.from_synonym(root.iloc[1,1]) # Try to identify a standard archiveType name from the vocabulary.
if archiveType is not None:
ds.setArchiveType(archiveType) #set the archive type
else:
raise ValueError('This archiveType is not part of the standard vocabulary; check your spelling or create a new one')
ds.setOriginalDataUrl(root.iloc[2,1]) # set the original data URL
The ArchiveTypeConstant
object also allows you to see what is available in the standard vocabulary and set it directly as:
ds.setArchiveType(ArchiveTypeConstants.Speleothem)
Just make sure you impart the ArchiveTypeConstants
object first.
Let’s check if the information was stored properly into a ArchiveType
object:
ds.getArchiveType().label
'Speleothem'
Everything looks good! But what happens if you trigger the ValueError
in the cell above?
Dealing with new standard terms#
The error raises two possibilities:
You misspelled the name. The
.from_synonym
function is for exact matches with terms found in the thesaurus. Using a language model to automatically guess could result in errors. This may not be a problem for the archive but you can imagine that the variable name with various isotopes can have very slight changes that have very different scientific interpretations but are close enough that a language models may interpret them as similar items (e.g., d18O and d17O). In an abundance of caution, we decided to not allow for fuzzy matches.You are actually creating a new archiveType. In this case, you need to create a new class in the ontology to represent your new archiveType. In short you need to create a new archiveType object in
PyLiPD
:
from pylipd.globals.urls import ARCHIVEURL
mynewarchive = ArchiveType(f"{ARCHIVEURL}#MyArchiveType", "MyArchiveType")
Yo have now created a new ArchiveType
object that can be inserted in your dataset. Just make sure that the label
and id
field look like the following:
print('label: '+mynewarchive.label)
print('ID: '+mynewarchive.id)
label: MyArchiveType
ID: http://linked.earth/ontology/archive#MyArchiveType
Remember that you will need to do this for every field that has been standardized in the vocabulary:
As with the ArchiveType
each of these have a corresponding ObjectConstants
that can be used to see what terms are already available. See the LiPD Controlled Vocabulary module for details.
The next step is to enter the investigators, which takes a list of Person
objects (see Figure in the Preamble):
authors = root.iloc[3,1]
# Step 1: Split the string by commas
parts = authors.split(',')
# Prepare a list to hold the formatted names
investigators = []
# Step 2: Iterate over the parts to process each
for i in range(0, len(parts) - 1, 2): # Step by 2 since each name and initial are next to each other
last_name = parts[i].strip() # Remove any leading/trailing whitespace
initial = parts[i + 1].strip() # The initial follows the last name
person = Person() # create the Person object
person.setName(f"{last_name}, {initial}")
investigators.append(person)
# Step 3: Store the list of Persons into the ds object
ds.setInvestigators(investigators)
Let’s have a quick look at what we’ve done so far:
ds.getName()
'Oman.Tian.2023'
ds.getInvestigators()[0].getName() #get the name of the first person in the list
'Tian, Y.'
Everything looks go so far.
Publication metadata#
pub
5 | Publication Section | Ref #1 | Ref #2 |
---|---|---|---|
0 | Authors (last, first; last2, first2; separate ... | Tian, Y., Fleitmann, D., Zhang, Q., Sha, L. J,... | NaN |
1 | Publication title | Holocene climate change in southern Oman decip... | NaN |
2 | Journal | Nature Communications | NaN |
3 | Year | 2023 | NaN |
4 | Volume | NaN | NaN |
5 | Issue | NaN | NaN |
6 | Pages | NaN | NaN |
7 | Report Number | NaN | NaN |
8 | DOI | 10.1038/s41467-023-40454-z | NaN |
9 | Abstract | Qunf Cave oxygen isotope (δ18Oc) record from s... | NaN |
10 | Alternate citation in paragraph format (For bo... | NaN | NaN |
The first step is to create a Publication
object. In this case, we only have one publication; otherwise you may need to loop over the various columns in the dataframe to create others.
pub1 = Publication()
And now let’s add the information. Let’s start with the authors:
authors = pub.iloc[0,1]
# Step 1: Split the string by commas
parts = authors.split(',')
# Prepare a list to hold the formatted names
investigators = []
# Step 2: Iterate over the parts to process each
for i in range(0, len(parts) - 1, 2): # Step by 2 since each name and initial are next to each other
last_name = parts[i].strip() # Remove any leading/trailing whitespace
initial = parts[i + 1].strip() # The initial follows the last name
person = Person() # create the Person object
person.setName(f"{last_name}, {initial}")
investigators.append(person)
# Step 3: Store the list of Persons into the ds object
pub1.setAuthors(investigators)
Let’s add other information:
pub1.setTitle(pub.iloc[1,1])
pub1.setJournal(pub.iloc[2,1])
pub1.setYear(pub.iloc[3,1])
pub1.setDOI(pub.iloc[8,1])
pub1.setAbstract(pub.iloc[9,1])
Let’s add our Publication
object to the Dataset
object:
ds.addPublication(pub1)
Location metadata#
Let’s add geographical information next. First, we need to create a Location
object:
loc = Location()
geo
18 | Site Information | Use appropriate significant digits for all values | NaN |
---|---|---|---|
0 | Northernmost latitude (decimal degree, South n... | 17.17 | NaN |
1 | Southernmost latitude (decimal degree, South n... | 17.17 | NaN |
2 | Easternmost longitude (decimal degree, West ne... | 54.3 | NaN |
3 | Westernmost longitude (decimal degree, West ne... | 54.3 | NaN |
4 | elevation (m), below sea level negative | 650 | NaN |
loc.setLatitude(geo.iloc[0,1])
loc.setLongitude(geo.iloc[2,1])
loc.setElevation(geo.iloc[4,1])
Let’s add the Location
object into the Dataset
:
ds.setLocation(loc)
Funding metadata#
Finally, let’s look at funding information:
fund
26 | Funding_Agency | Any additional Funding agencies and grants should be entered in Columns C,D, etc. | NaN |
---|---|---|---|
0 | Funding_Agency_Name | NaN | NaN |
1 | Grant | NaN | NaN |
2 | Principal_Investigator | NaN | NaN |
3 | country | NaN | NaN |
Since no information is available, let’s move on to the PaleoData Section.
PaleoData#
sheet_name = 'paleo1measurementTable1'
# Read the information into a Pandas DataFrame
pd_df = pd.read_excel(file_path, sheet_name=sheet_name, header=None)
# Drop completely empty rows
pd_df = pd_df.dropna(how="all").reset_index(drop=True)
Let’s create a PaleoData
object:
paleodata = PaleoData()
Our next step is to create measurement tables in this object. To do so, we can use the DataTable
object:
table = DataTable()
Now let’s add some information about the table such as the name and the value use for missin values in the data:
table.setFileName("paleo0measurement0.csv")
table.setMissingValue("NaN")
The next step is to add columns to our table. In other words, we need to create some variables.
Let’s have a look at the sheet in the excel file. It contains three sections:
Notes, which would be attached to the table
Variables information such as the name, units… Each row in the template represents metadata information for each of the column.
Data, which contains the numerical values for the variables.
Let’s first create an algorithm that separates the various section and returns the information in three dataframes (Notes, Variables, and Data).
def extract_data(df):
'''
This function extracts the relevant sections for measurementTables.
'''
# Find the index positions of the section headers
notes_start = df[df[0] == "Notes"].index[0]
variables_start = df[df[0] == "Variables"].index[0]
data_start = df[df[0] == "Data"].index[0]
# Extract sections, ensuring blank rows are removed
df_notes = df.iloc[notes_start + 1:variables_start].dropna(how="all").reset_index(drop=True)
# Extract the Variables section
df_variables_raw = df.iloc[variables_start + 1:data_start].dropna(how="all").reset_index(drop=True)
# Set the first row as the header for the Variables section
df_variables = df_variables_raw[1:].reset_index(drop=True) # Data rows
df_variables.columns = df_variables_raw.iloc[0] # Set first row as column headers
# Extract the Data section
df_data_raw = df.iloc[data_start + 2:].dropna(how="all").reset_index(drop=True)
# Correctly skip the first row and set the second row as the header
df_data = df_data_raw.iloc[1:].reset_index(drop=True) # Skip first row, keep rest
df_data.columns = df_data_raw.iloc[0] # Use second row as column headers
df_data = df_data.dropna(axis=1, how="all") # Drop the columns with NaN.
return df_notes, df_variables, df_data
df_notes, df_variables, df_data = extract_data(pd_df)
Let’s loop over the variables and get the relevant information. We will also be calculating some relevant information such as the average value and average resolution.
In LiPD, each variable is also given a unique ID. The function below generates one:
import uuid
def generate_unique_id(prefix='PYD'):
# Generate a random UUID
random_uuid = uuid.uuid4() # Generates a random UUID.
# Convert UUID format to the specific format we need
# UUID is usually in the form '1e2a2846-2048-480b-9ec6-674daef472bd' so we slice and insert accordingly
id_str = str(random_uuid)
formatted_id = f"{prefix}-{id_str[:5]}-{id_str[9:13]}-{id_str[14:18]}-{id_str[19:23]}-{id_str[24:28]}"
return formatted_id
Also, remember that some of these fiels should result in objects (namely, variableName and units). Let’s see if we can get the information from the synonyms before we proceed:
df_variables
variableName | variableType | Units | ProxyObservationType | InferredVariableType | TakenAtDepth | InferredFrom | notes | Interpretation1_variable | Interpretation1_variableDetail | ... | calibration_reference | calibration_uncertainty | calibration_uncertaintyType | sensorGenus | sensorSpecies | PhysicalSample_Name | PhysicalSample_Identifier | PhysicalSample_hasIGSN | PhysicalSample_housedAt | PhysicalSample_collectionMethod | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Depth | NaN | mm | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | Age | NaN | year BP | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | d18O | NaN | per mil VPDB | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | d13C | NaN | per mil VPDB | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 rows × 28 columns
Let’s start with identifying proper objects for the variableNames:
check_names = {}
for index, row in df_variables.iterrows():
check_names[row['variableName']]= PaleoVariable.from_synonym(row['variableName']).label
check_names
{'Depth': 'depth', 'Age': 'age', 'd18O': 'd18O', 'd13C': 'd13C'}
This looks good for the variable names. Let’s move on to the units:
check_units = {}
for index, row in df_variables.iterrows():
try:
check_units[row['Units']]= PaleoUnit.from_synonym(row['Units']).label
except:
check_units[row['Units']]= None
check_units
{'mm': 'mm', 'year BP': 'yr BP', 'per mil VPDB': None}
The “per mil VPDB” unit is not recognized automatically (not in the thesaurus). Let’s have a look at the standard unit names and see if we can manually select one that will match.
The permil entry will match, so let’s use this.
variables = []
res = df_data.iloc[:, 1].diff()[1:].to_numpy()
Res = Resolution() # create a Resolution object - it will be the same for all variables since it is based on time
Res.setMinValue(np.min(res))
Res.setMaxValue(np.max(res))
Res.setMeanValue(np.mean(res))
Res.setMedianValue(np.median(res))
for index, row in df_variables.iterrows():
var = Variable()
var.setName(row['variableName']) # name of the variable
# Now let's do the standard name
var.setStandardVariable(PaleoVariable.from_synonym(row['variableName']))
var.setColumnNumber(index+1) #The column in which the data is stored. Note that LiPD uses index 1
var.setVariableId(generate_unique_id(prefix='TIAN')) # create a unique ID for the variable
# Units
if row['Units']=='per mil VPDB':
var.setUnits(PaleoUnit.from_synonym('permil'))
else:
var.setUnits(PaleoUnit.from_synonym(row['Units']))
# Make sure the data is JSON writable (no numpy arrays or Pandas DataFrame)
var.setValues(json.dumps(df_data.iloc[:,index].tolist()))
# Calculate some metadata about the values - this makes it easier to do some queries later on, including looking for data in a particular time slice.
var.setMinValue(df_data.iloc[:,index].min())
var.setMaxValue(df_data.iloc[:,index].max())
var.setMeanValue(df_data.iloc[:,index].mean())
var.setMedianValue(df_data.iloc[:,index].median())
# Attach the resolution metadata information to the variable
var.setResolution(Res)
# append in the list
variables.append(var)
Let’s now put our variables in the DataTable
:
table.setVariables(variables)
The Table
into the PaleoData
object:
paleodata.setMeasurementTables([table])
And finally, the PaleoData
object into the Dataset
:
ds.addPaleoData(paleodata)
ChronData#
The next step is to create a ChronData object to store the information about chronology. In the last section, we used an OOP approach to add the information about each variable. In this section, we will use an approach involving Pandas DataFrame
.
Let’s open the data. The same function we wrote to read in the PaleoData can be used here since the template is the same for both objects.
sheet_name = 'chron1measurementTable1'
# Read the information into a Pandas DataFrame
cd_df = pd.read_excel(file_path, sheet_name=sheet_name, header=None)
# Drop completely empty rows
cd_df = cd_df.dropna(how="all").reset_index(drop=True)
Let’s create a ChronData
object:
chrondata = ChronData()
We need to create a DataTable
, a process similar to what we have done for the PaleoData:
chrontable = DataTable()
Let’s add some basic information about the table:
chrontable.setFileName("chron0measurement0.csv")
chrontable.setMissingValue("NaN")
dfc_notes, dfc_variables, dfc_data = extract_data(cd_df)
We will used the setDataFrame
function to incorporate the columns into the table. In this framework, the values are held in a dataframe (which is represented by dfc_data
in our framework) and the metadata for each variable is added as attributes to the DataFrame. Since the variables and units associated with Chronology are not yet standardized, we can leave them as strings:
metadata_dict = {} # create a dictionary
for index, row in dfc_variables.iterrows():
temp_dict = {}
temp_dict['number']=index+1
temp_dict['variableName']=row['variableName']
temp_dict['TSid']= generate_unique_id(prefix='TC')
if pd.notna(row['Units']):
temp_dict['units']=row['Units']
else:
temp_dict['units']='NA'
try:
temp_dict['hasMinValue']=dfc_data.iloc[:,index].min()
temp_dict['hasMaxValue']=dfc_data.iloc[:,index].max()
temp_dict['hasMeanValue']=dfc_data.iloc[:,index].mean()
temp_dict['hasMedianValue']=dfc_data.iloc[:,index].median()
except:
pass
metadata_dict[row['variableName']]=temp_dict
dfc_data.attrs = metadata_dict
Use the DataFrame to construct the Table
object:
chrontable.setDataFrame(dfc_data)
Put the Table
into the ChronData
object:
chrondata.setMeasurementTables([chrontable])
Put the ChronData
into the Dataset
:
ds.addChronData(chrondata)
Adding an ensemble table#
This particular dataset also has an ensemble table available in Oman.Tian.2023.chrondf.csv. Let’s add the information to an ensembleTable in the LiPD file.
Let’s first open the data:
ens_path = "../data/Oman.Tian.2023.chrondf.csv"
df_ens = pd.read_csv(ens_path)
df_ens.head()
Unnamed: 0 | age | d18O | depth | chron_0 | chron_1 | chron_2 | chron_3 | chron_4 | chron_5 | ... | chron_990 | chron_991 | chron_992 | chron_993 | chron_994 | chron_995 | chron_996 | chron_997 | chron_998 | chron_999 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 401.88 | -0.20 | 0.00 | 444.0 | 445.0 | 417.0 | 290.0 | 403.0 | 349.0 | ... | 430.0 | 407.0 | 410.0 | 369.0 | 392.0 | 347.0 | 359.0 | 416.0 | 414.0 | 408.0 |
1 | 1 | 408.55 | -0.59 | 0.69 | 445.0 | 446.0 | 445.0 | 355.0 | 404.0 | 367.0 | ... | 431.0 | 410.0 | 411.0 | 371.0 | 395.0 | 358.0 | 362.0 | 417.0 | 416.0 | 414.0 |
2 | 2 | 424.07 | -0.58 | 1.37 | 489.0 | 491.0 | 482.0 | 425.0 | 448.0 | 422.0 | ... | 469.0 | 468.0 | 459.0 | 411.0 | 437.0 | 397.0 | 422.0 | 441.0 | 463.0 | 435.0 |
3 | 3 | 438.75 | -0.73 | 2.06 | 562.0 | 568.0 | 525.0 | 495.0 | 521.0 | 504.0 | ... | 533.0 | 565.0 | 541.0 | 479.0 | 507.0 | 455.0 | 524.0 | 482.0 | 541.0 | 466.0 |
4 | 4 | 450.24 | -1.26 | 2.75 | 568.0 | 574.0 | 536.0 | 503.0 | 531.0 | 514.0 | ... | 547.0 | 575.0 | 548.0 | 491.0 | 510.0 | 461.0 | 530.0 | 493.0 | 549.0 | 480.0 |
5 rows × 1004 columns
Our first step is to drop the first 3 columns in the DataFrame:
df_ens = df_ens.iloc[:, 3:]
df_ens.head()
depth | chron_0 | chron_1 | chron_2 | chron_3 | chron_4 | chron_5 | chron_6 | chron_7 | chron_8 | ... | chron_990 | chron_991 | chron_992 | chron_993 | chron_994 | chron_995 | chron_996 | chron_997 | chron_998 | chron_999 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00 | 444.0 | 445.0 | 417.0 | 290.0 | 403.0 | 349.0 | 387.0 | 412.0 | 341.0 | ... | 430.0 | 407.0 | 410.0 | 369.0 | 392.0 | 347.0 | 359.0 | 416.0 | 414.0 | 408.0 |
1 | 0.69 | 445.0 | 446.0 | 445.0 | 355.0 | 404.0 | 367.0 | 405.0 | 413.0 | 369.0 | ... | 431.0 | 410.0 | 411.0 | 371.0 | 395.0 | 358.0 | 362.0 | 417.0 | 416.0 | 414.0 |
2 | 1.37 | 489.0 | 491.0 | 482.0 | 425.0 | 448.0 | 422.0 | 447.0 | 442.0 | 441.0 | ... | 469.0 | 468.0 | 459.0 | 411.0 | 437.0 | 397.0 | 422.0 | 441.0 | 463.0 | 435.0 |
3 | 2.06 | 562.0 | 568.0 | 525.0 | 495.0 | 521.0 | 504.0 | 506.0 | 492.0 | 543.0 | ... | 533.0 | 565.0 | 541.0 | 479.0 | 507.0 | 455.0 | 524.0 | 482.0 | 541.0 | 466.0 |
4 | 2.75 | 568.0 | 574.0 | 536.0 | 503.0 | 531.0 | 514.0 | 511.0 | 501.0 | 555.0 | ... | 547.0 | 575.0 | 548.0 | 491.0 | 510.0 | 461.0 | 530.0 | 493.0 | 549.0 | 480.0 |
5 rows × 1001 columns
Next, let’s create the proper dataframe format. The first column will stay the same. The second column will contain each values on the ensemble in a list:
#Let's keep the first column (depth) in place
ens_table = pd.DataFrame({'depth': df_ens['depth'].tolist()})
# Add the year data - each row will contain one vector from a data array.
array = df_ens.iloc[:, 1:].to_numpy()
ens_table['year'] = [array[i,:].tolist() for i in range(array.shape[0])]
ens_table.head()
depth | year | |
---|---|---|
0 | 0.00 | [444.0, 445.0, 417.0, 290.0, 403.0, 349.0, 387... |
1 | 0.69 | [445.0, 446.0, 445.0, 355.0, 404.0, 367.0, 405... |
2 | 1.37 | [489.0, 491.0, 482.0, 425.0, 448.0, 422.0, 447... |
3 | 2.06 | [562.0, 568.0, 525.0, 495.0, 521.0, 504.0, 506... |
4 | 2.75 | [568.0, 574.0, 536.0, 503.0, 531.0, 514.0, 511... |
Add attributes to the Pandas Dataframe to store the metadata.
num_year_columns = len(array[0,:])
year_columns = [i+2 for i in range(num_year_columns)]
ens_table.attrs = {
'year': {'number': str(year_columns), 'variableName': 'year', 'units': 'yr AD', 'TSid':generate_unique_id()},
'depth': {'number': 1, 'variableName': 'depth', 'units': 'cm', 'TSid':generate_unique_id()}
}
Let’s create a DataTable
object for our ensemble table:
ensemble_table = DataTable()
ensemble_table.setDataFrame(ens_table)
ensemble_table.setFileName("chron0model0ensemble0.csv")
Now add the table to a model:
model = Model()
model.addEnsembleTable(ensemble_table)
And add the Model to a ChronData object:
chrondata.addModeledBy(model)
Writing a LiPD file#
The last step in this process is to write to a LiPD file. To do so, you need to pass the Dataset ds
back into a LiPD object:
lipd = LiPD()
lipd.load_datasets([ds])
lipd.create_lipd(ds.getName(), "../data/Oman.Tian.2023.lpd");
Opening the LiPD file#
Let’s re-open the LiPD file that we have just created and check some of our work.
L = LiPD()
file = "../data/Oman.Tian.2023.lpd"
L.load(file)
Loading 1 LiPD files
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00, 2.93it/s]
Loaded..
Let’s get the name of the dataset:
L.get_all_dataset_names()
['Oman.Tian.2023']
df_essential = L.get_timeseries_essentials()
df_essential.head()
dataSetName | archiveType | geo_meanLat | geo_meanLon | geo_meanElev | paleoData_variableName | paleoData_values | paleoData_units | paleoData_proxy | paleoData_proxyGeneral | time_variableName | time_values | time_units | depth_variableName | depth_values | depth_units | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Oman.Tian.2023 | Speleothem | 17.17 | 54.3 | 650.0 | d13C | [-2.71, -4.12, -3.45, -3.11, -3.61, -4.86, -4.... | permil | None | None | age | [401.88, 408.55, 424.07, 438.75, 450.24, 461.3... | yr BP | depth | [0.0, 0.69, 1.37, 2.06, 2.75, 3.43, 4.12, 4.81... | mm |
1 | Oman.Tian.2023 | Speleothem | 17.17 | 54.3 | 650.0 | d18O | [-0.2, -0.59, -0.58, -0.73, -1.26, -1.19, -0.6... | permil | None | None | age | [401.88, 408.55, 424.07, 438.75, 450.24, 461.3... | yr BP | depth | [0.0, 0.69, 1.37, 2.06, 2.75, 3.43, 4.12, 4.81... | mm |
2 | Oman.Tian.2023 | Speleothem | 17.17 | 54.3 | 650.0 | d13C | [-2.71, -4.12, -3.45, -3.11, -3.61, -4.86, -4.... | permil | None | None | age | [401.88, 408.55, 424.07, 438.75, 450.24, 461.3... | yr BP | depth | [0.0, 0.69, 1.37, 2.06, 2.75, 3.43, 4.12, 4.81... | mm |
3 | Oman.Tian.2023 | Speleothem | 17.17 | 54.3 | 650.0 | d18O | [-0.2, -0.59, -0.58, -0.73, -1.26, -1.19, -0.6... | permil | None | None | age | [401.88, 408.55, 424.07, 438.75, 450.24, 461.3... | yr BP | depth | [0.0, 0.69, 1.37, 2.06, 2.75, 3.43, 4.12, 4.81... | mm |
4 | Oman.Tian.2023 | Speleothem | 17.17 | 54.3 | 650.0 | d13C | [-2.71, -4.12, -3.45, -3.11, -3.61, -4.86, -4.... | permil | None | None | age | [401.88, 408.55, 424.07, 438.75, 450.24, 461.3... | yr BP | depth | [0.0, 0.69, 1.37, 2.06, 2.75, 3.43, 4.12, 4.81... | mm |
And voila! The LiPD file is ready to be used.