Basic manipulation of pylipd.LiPD objects#

Authors#

by Deborah Khider

Preamble#

Goals:#

  • Extract a LiPD time series for analysis

  • Remove/pop LiPD datasets from an existing LiPD object

Reading Time: 5 minutes

Keywords#

LiPD; query

Pre-requisites#

None. This tutorial assumes basic knowledge of Python and Pandas. If you are not familiar with this coding language and the Pandas library, check out this tutorial: http://linked.earth/ec_workshops_py/.

Relevant Packages#

Pandas, pylipd

Data Description#

This notebook uses the following datasets, in LiPD format:

  • Nurhati, I. S., Cobb, K. M., & Di Lorenzo, E. (2011). Decadal-scale SST and salinity variations in the central tropical Pacific: Signatures of natural and anthropogenic climate change. Journal of Climate, 24(13), 3294–3308. doi:10.1175/2011jcli3852.1

  • PAGES2k Consortium (2017): A global multiproxy database for temperature reconstructions of the Common Era. Sci Data 4, 170088. doi:10.1038/sdata.2017.88

from pylipd.lipd import LiPD

Demonstration#

Extract time series data from LiPD formatted datasets#

If you are famliar with the R utilities, one useful functions is the ability to expand “timeseries” structures. This capability was also present in the previous iteration of the Python utilities and PyLiPD retains this compatbility to ease the transition.

If you’re unsure about what a “timeseries” is in the LiPD context, read this page.

Working with one dataset#

First, let’s load a single dataset:

data_path = '../data/Ocn-Palmyra.Nurhati.2011.lpd'
D = LiPD()
D.load(data_path)
Loading 1 LiPD files
  0%|          | 0/1 [00:00<?, ?it/s]
100%|██████████| 1/1 [00:00<00:00, 30.48it/s]
Loaded..

Now let’s get all the timeseries for this dataset. Note that the get_timeseries function requires to pass the dataset names. This is useful if you only want to expand only one dataset from your LiPD object. You can also use the function get_all_dataset_names in the call to expand all datasets:

ts_list = D.get_timeseries(D.get_all_dataset_names())

type(ts_list)
Extracting timeseries from dataset: Ocn-Palmyra.Nurhati.2011 ...
dict

Note that the above function returns a dictionary that organizes the extracted timeseries by dataset name:

ts_list.keys()
dict_keys(['Ocn-Palmyra.Nurhati.2011'])

Each timeseries is then stored into a list of dictionaries that preserve essential metadata for each time/depth and value pair:

type(ts_list['Ocn-Palmyra.Nurhati.2011'])
list

Although the information is present, it is not easy to navigate or query across the various list. One simple way of doing so is to return the list into a Pandas.DataFrame:

ts_list, df = D.get_timeseries(D.get_all_dataset_names(), to_dataframe=True)

df
Extracting timeseries from dataset: Ocn-Palmyra.Nurhati.2011 ...
mode time_id archiveType geo_meanLon geo_meanLat geo_meanElev geo_type geo_siteName geo_ocean geo_pages2kRegion ... paleoData_notes paleoData_sensorSpecies paleoData_sensorGenus paleoData_proxy paleoData_qCCertification paleoData_iso2kUI paleoData_interpretation paleoData_ocean2kID paleoData_inCompilation paleoData_pages2kID
0 paleoData age Coral -162.13 5.87 -10.0 Feature Palmyra WP Ocean ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 paleoData age Coral -162.13 5.87 -10.0 Feature Palmyra WP Ocean ... d18Osw (residuals calculated from coupled SrCa... lutea Porites d18O NaN NaN NaN NaN NaN NaN
2 paleoData age Coral -162.13 5.87 -10.0 Feature Palmyra WP Ocean ... Duplicate of modern d18O record presented in C... lutea Porites d18O MNE, NJA CO11NUPM01B [{'variableDetail': 'sea_surface', 'scope': 'c... NaN NaN NaN
3 paleoData age Coral -162.13 5.87 -10.0 Feature Palmyra WP Ocean ... ; paleoData_variableName changed - was origina... lutea Porites Sr/Ca MNE, NJA CO11NUPM01BT1 [{'scope': 'climate', 'hasVariable': {'label':... PacificNurhati2011 Ocean2k_v1.0.0 Ocn_129
4 paleoData age Coral -162.13 5.87 -10.0 Feature Palmyra WP Ocean ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 88 columns

You can now use all the pandas functionalities for filtering and querying dataframes. First, let’s have a look at the available properties, which corresponds to the column headers:

df.columns
Index(['mode', 'time_id', 'archiveType', 'geo_meanLon', 'geo_meanLat',
       'geo_meanElev', 'geo_type', 'geo_siteName', 'geo_ocean',
       'geo_pages2kRegion', 'dataSetName', 'hasUrl', 'lipdVersion',
       'pub1_author', 'pub1_DOI', 'pub1_year', 'pub1_title', 'pub1_journal',
       'pub1_pages', 'pub1_dataUrl', 'pub1_publisher', 'pub1_citeKey',
       'pub1_volume', 'pub2_author', 'pub2_urldate', 'pub2_institution',
       'pub2_title', 'pub2_url', 'pub2_citeKey', 'pub3_author',
       'pub3_publisher', 'pub3_pages', 'pub3_DOI', 'pub3_year', 'pub3_issue',
       'pub3_volume', 'pub3_dataUrl', 'pub3_journal', 'pub3_title',
       'pub3_citeKey', 'studyName', 'dataContributor', 'createdBy',
       'googleMetadataWorksheet', 'originalDataURL', 'googleDataURL',
       'googleSpreadSheetKey', 'tableType', 'paleoData_filename',
       'paleoData_measurementTableName', 'paleoData_measurementTableMD5',
       'paleoData_googleWorkSheetKey', 'paleoData_paleoDataTableName', 'year',
       'yearUnits', 'paleoData_hasMaxValue', 'paleoData_missingValue',
       'paleoData_hasArchiveType_label', 'paleoData_number',
       'paleoData_variableType', 'paleoData_wDSPaleoUrl',
       'paleoData_hasMeanValue', 'paleoData_hasMinValue', 'paleoData_dataType',
       'paleoData_variableName', 'paleoData_hasMedianValue', 'paleoData_TSid',
       'paleoData_resolution_hasMeanValue', 'paleoData_resolution_hasMaxValue',
       'paleoData_resolution_hasMedianValue',
       'paleoData_resolution_hasMinValue', 'paleoData_resolution_units',
       'paleoData_inferredVariableType', 'paleoData_description',
       'paleoData_units', 'paleoData_values',
       'paleoData_useInGlobalTemperatureAnalysis',
       'paleoData_proxyObservationType', 'paleoData_notes',
       'paleoData_sensorSpecies', 'paleoData_sensorGenus', 'paleoData_proxy',
       'paleoData_qCCertification', 'paleoData_iso2kUI',
       'paleoData_interpretation', 'paleoData_ocean2kID',
       'paleoData_inCompilation', 'paleoData_pages2kID'],
      dtype='object')

Let’s have a look at the paleoData_variableName column to see what’s available:

df['paleoData_variableName']
0     year
1     d18O
2     d18O
3    Sr/Ca
4     year
Name: paleoData_variableName, dtype: object

All columns get extracted, hence why year is extracted as a paleo variable, with its associated values stored in paleoData_values. Notice that there is also two variables names d18O. Since this is a coral record, it stands to reason that one corresponds to the measured \(\delta^{18}O\) of the coral and the other the \(\delta^{18}O\) of the seawater. Let’s have a look at the notes field:

df[['paleoData_variableName','paleoData_notes']]
paleoData_variableName paleoData_notes
0 year NaN
1 d18O d18Osw (residuals calculated from coupled SrCa...
2 d18O Duplicate of modern d18O record presented in C...
3 Sr/Ca ; paleoData_variableName changed - was origina...
4 year NaN

In fact, one is for the measurement on the coral and the other for seawater. Querying on this small dataset is not necessary; however, it can become useful when looking at a collection of files as shown in the next example (working with multiple datasets).

To extract by row index (here extracting for Sr_Ca):

df_cut = df.iloc[4,:]

df_cut
mode                        paleoData
time_id                           age
archiveType                     Coral
geo_meanLon                   -162.13
geo_meanLat                      5.87
                              ...    
paleoData_iso2kUI                 NaN
paleoData_interpretation          NaN
paleoData_ocean2kID               NaN
paleoData_inCompilation           NaN
paleoData_pages2kID               NaN
Name: 4, Length: 88, dtype: object
df_cut['paleoData_variableName']
'year'

This can be very useful when working with the Pyleoclim software since a Pyleoclim.Series can be initialized from the information contained in df_cut. Working with PyLiPD and Pyleoclim is the subject of several tutorials.

Working with such a large dataframe can be overwhelming and not needed in some cases. Therefore, PyLiPD has a nifty function called get_timeseries_essentials that grabs information about the dataset, its geographical location, the time/depth values, the variable information, including archive and proxy:

df_essential = D.get_timeseries_essentials()

df_essential
dataSetName archiveType geo_meanLat geo_meanLon geo_meanElev paleoData_variableName paleoData_values paleoData_units paleoData_proxy paleoData_proxyGeneral time_variableName time_values time_units depth_variableName depth_values depth_units
0 Ocn-Palmyra.Nurhati.2011 Coral 5.87 -162.13 -10.0 Sr_Ca [8.96, 8.9, 8.91, 8.94, 8.92, 8.89, 8.87, 8.81... mmol/mol Sr/Ca None year [1998.29, 1998.21, 1998.13, 1998.04, 1997.96, ... yr AD None None None
1 Ocn-Palmyra.Nurhati.2011 Coral 5.87 -162.13 -10.0 d18O [-5.41, -5.47, -5.49, -5.43, -5.48, -5.53, -5.... permil d18O None year [1998.29, 1998.21, 1998.13, 1998.04, 1997.96, ... yr AD None None None
2 Ocn-Palmyra.Nurhati.2011 Coral 5.87 -162.13 -10.0 d18O [0.39, 0.35, 0.35, 0.35, 0.36, 0.22, 0.33, 0.3... permil d18O None year [1998.21, 1998.13, 1998.04, 1997.96, 1997.88, ... yr AD None None None

The metadata (i.e., the column names) available through this function will always remain the same and are as follows:

df_essential.columns
Index(['dataSetName', 'archiveType', 'geo_meanLat', 'geo_meanLon',
       'geo_meanElev', 'paleoData_variableName', 'paleoData_values',
       'paleoData_units', 'paleoData_proxy', 'paleoData_proxyGeneral',
       'time_variableName', 'time_values', 'time_units', 'depth_variableName',
       'depth_values', 'depth_units'],
      dtype='object')

Working with multiple datasets#

path = '../data/Pages2k/'

D_dir = LiPD()
D_dir.load_from_dir(path)
Loading 16 LiPD files
  0%|          | 0/16 [00:00<?, ?it/s]
 38%|███▊      | 6/16 [00:00<00:00, 50.40it/s]
 75%|███████▌  | 12/16 [00:00<00:00, 39.97it/s]
100%|██████████| 16/16 [00:00<00:00, 43.03it/s]
Loaded..

Let’s expand into our essential dataframe:

df_dir = D_dir.get_timeseries_essentials()

Let’s have a look at the dataframe:

df_dir.head()
dataSetName archiveType geo_meanLat geo_meanLon geo_meanElev paleoData_variableName paleoData_values paleoData_units paleoData_proxy paleoData_proxyGeneral time_variableName time_values time_units depth_variableName depth_values depth_units
0 Eur-NorthernSpain.Martin-Chivelet.2011 Speleothem 42.90 -3.50 1250.0 d18O [0.94, 0.8, 0.23, 0.17, 0.51, 0.36, 0.24, 0.4,... permil d18O None year [2000, 1987, 1983, 1978, 1975, 1971, 1967, 196... yr AD None None None
1 Eur-NorthernScandinavia.Esper.2012 Wood 68.00 25.00 300.0 MXD [0.46, 1.305, 0.755, -0.1, -0.457, 1.62, 0.765... None maximum latewood density None year [-138, -137, -136, -135, -134, -133, -132, -13... yr AD None None None
2 Eur-Stockholm.Leijonhufvud.2009 Documents 59.32 18.06 10.0 temperature [-1.7212, -1.6382, -0.6422, 0.1048, -0.7252, -... degC historical None year [1502, 1503, 1504, 1505, 1506, 1507, 1508, 150... yr AD None None None
3 Eur-LakeSilvaplana.Trachsel.2010 Lake sediment 46.50 9.80 1791.0 temperature [0.181707222, 0.111082797, 0.001382129, -0.008... degC reflectance None year [1175, 1176, 1177, 1178, 1179, 1180, 1181, 118... yr AD None None None
4 Eur-SpanishPyrenees.Dorado-Linan.2012 Wood 42.50 1.00 1200.0 trsgi [-1.612, -0.703, -0.36, -0.767, -0.601, -0.733... None ring width None year [1260, 1261, 1262, 1263, 1264, 1265, 1266, 126... yr AD None None None

The size of this dataframe is:

df_dir.shape
(25, 16)

So we expanded into 25 timeseries.

Let’s have a look at the available variables:

df_dir['paleoData_variableName'].unique()
array(['d18O', 'MXD', 'temperature', 'trsgi', 'Uk37', 'notes', 'Mg_Ca',
       'uncertainty_temperature'], dtype=object)

Let’s assume we are only interested in the temperature data:

df_temp = df_dir[df_dir['paleoData_variableName']=='temperature']
df_temp.head()
dataSetName archiveType geo_meanLat geo_meanLon geo_meanElev paleoData_variableName paleoData_values paleoData_units paleoData_proxy paleoData_proxyGeneral time_variableName time_values time_units depth_variableName depth_values depth_units
2 Eur-Stockholm.Leijonhufvud.2009 Documents 59.3200 18.0600 10.0 temperature [-1.7212, -1.6382, -0.6422, 0.1048, -0.7252, -... degC historical None year [1502, 1503, 1504, 1505, 1506, 1507, 1508, 150... yr AD None None None
3 Eur-LakeSilvaplana.Trachsel.2010 Lake sediment 46.5000 9.8000 1791.0 temperature [0.181707222, 0.111082797, 0.001382129, -0.008... degC reflectance None year [1175, 1176, 1177, 1178, 1179, 1180, 1181, 118... yr AD None None None
6 Arc-Kongressvatnet.D'Andrea.2012 Lake sediment 78.0217 13.9311 94.0 temperature [5.9, 5.1, 6.1, 5.3, 4.3, 4.8, 3.8, 4.8, 4.3, ... degC alkenone None year [2008, 2004, 2000, 1996, 1990, 1987, 1982, 197... yr AD None None None
7 Eur-CoastofPortugal.Abrantes.2011 Marine sediment 41.1000 -8.9000 -80.0 temperature [15.235, 15.329, 15.264, 15.376, 15.4, 15.129,... degC alkenone None year [971.19, 982.672, 991.858, 1001.044, 1010.23, ... yr AD None None None
12 Ocn-FeniDrift.Richter.2009 Marine sediment 55.5000 -13.9000 -2543.0 temperature [12.94, 10.99, 10.53, 10.44, 11.39, 13.38, 10.... degC None None year [1998, 1987, 1975, 1962, 1949, 1936, 1924, 191... yr AD depth_bottom [0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, ... cm
df_temp.shape
(11, 16)

which leaves us with 11 timeseries.

Let’s assume that you want everything that is not related to time, depth, and uncertainty. To keep the rows that are relevant to our problem, you can use the DataFrame.query function available in Pandas:

df_filt = df_dir.query("paleoData_variableName in ('temperature','MXD','Mg_Ca','d18O','trsgi', 'Uk37')")
df_filt.head()
dataSetName archiveType geo_meanLat geo_meanLon geo_meanElev paleoData_variableName paleoData_values paleoData_units paleoData_proxy paleoData_proxyGeneral time_variableName time_values time_units depth_variableName depth_values depth_units
0 Eur-NorthernSpain.Martin-Chivelet.2011 Speleothem 42.90 -3.50 1250.0 d18O [0.94, 0.8, 0.23, 0.17, 0.51, 0.36, 0.24, 0.4,... permil d18O None year [2000, 1987, 1983, 1978, 1975, 1971, 1967, 196... yr AD None None None
1 Eur-NorthernScandinavia.Esper.2012 Wood 68.00 25.00 300.0 MXD [0.46, 1.305, 0.755, -0.1, -0.457, 1.62, 0.765... None maximum latewood density None year [-138, -137, -136, -135, -134, -133, -132, -13... yr AD None None None
2 Eur-Stockholm.Leijonhufvud.2009 Documents 59.32 18.06 10.0 temperature [-1.7212, -1.6382, -0.6422, 0.1048, -0.7252, -... degC historical None year [1502, 1503, 1504, 1505, 1506, 1507, 1508, 150... yr AD None None None
3 Eur-LakeSilvaplana.Trachsel.2010 Lake sediment 46.50 9.80 1791.0 temperature [0.181707222, 0.111082797, 0.001382129, -0.008... degC reflectance None year [1175, 1176, 1177, 1178, 1179, 1180, 1181, 118... yr AD None None None
4 Eur-SpanishPyrenees.Dorado-Linan.2012 Wood 42.50 1.00 1200.0 trsgi [-1.612, -0.703, -0.36, -0.767, -0.601, -0.733... None ring width None year [1260, 1261, 1262, 1263, 1264, 1265, 1266, 126... yr AD None None None
df_filt.shape
(22, 16)

Which leaves us with 22 timeseries.

Removing and popping datasets out of a LiPD object#

You can also remove (i.e., delete the corresponding dataset from the LiPD object) or pop (i.e., delete the corresponding dataset from the LiPD object and return the dataset) datasets from a LiPD object. Note that these functionalities behave similarly as the functions with the same names on Python lists. These functions underpin more adavanced filtering and querying capabilities that we will discuss in later tutorials.

First let’s make a copy of D_dir:

D_test = D_dir.copy()
print(D_test.get_all_dataset_names())
['Eur-NorthernSpain.Martin-Chivelet.2011', 'Eur-NorthernScandinavia.Esper.2012', 'Eur-Stockholm.Leijonhufvud.2009', 'Eur-LakeSilvaplana.Trachsel.2010', 'Eur-SpanishPyrenees.Dorado-Linan.2012', 'Arc-Kongressvatnet.D_Andrea.2012', 'Eur-CoastofPortugal.Abrantes.2011', 'Ocn-PedradeLume-CapeVerdeIslands.Moses.2006', 'Ocn-FeniDrift.Richter.2009', 'Ocn-SinaiPeninsula_RedSea.Moustafa.2000', 'Ant-WAIS-Divide.Severinghaus.2012', 'Asi-SourthAndMiddleUrals.Demezhko.2007', 'Ocn-AlboranSea436B.Nieto-Moreno.2013', 'Eur-SpannagelCave.Mangini.2005', 'Ocn-RedSea.Felis.2000', 'Eur-FinnishLakelands.Helama.2014']

And let’s remove Eur-Stockholm.Leijonhufvud.2009, which corresponds to the last entry in the list above:

D_test.remove('Eur-Stockholm.Leijonhufvud.2009')

print(D_test.get_all_dataset_names())
['Eur-NorthernSpain.Martin-Chivelet.2011', 'Eur-NorthernScandinavia.Esper.2012', 'Eur-LakeSilvaplana.Trachsel.2010', 'Eur-SpanishPyrenees.Dorado-Linan.2012', 'Arc-Kongressvatnet.D_Andrea.2012', 'Eur-CoastofPortugal.Abrantes.2011', 'Ocn-PedradeLume-CapeVerdeIslands.Moses.2006', 'Ocn-FeniDrift.Richter.2009', 'Ocn-SinaiPeninsula_RedSea.Moustafa.2000', 'Ant-WAIS-Divide.Severinghaus.2012', 'Asi-SourthAndMiddleUrals.Demezhko.2007', 'Ocn-AlboranSea436B.Nieto-Moreno.2013', 'Eur-SpannagelCave.Mangini.2005', 'Ocn-RedSea.Felis.2000', 'Eur-FinnishLakelands.Helama.2014']

Now let’s pop Eur-NorthernScandinavia.Esper.2012 from D_test:

d_eur = D_test.pop('Eur-NorthernScandinavia.Esper.2012')

Now let’s have a look at d_eur:

print(d_eur.get_all_dataset_names())
['Eur-NorthernScandinavia.Esper.2012']

It contains the dataset we are expecting. Let’s have a look at D_test:

print(D_test.get_all_dataset_names())
['Eur-NorthernSpain.Martin-Chivelet.2011', 'Eur-LakeSilvaplana.Trachsel.2010', 'Eur-SpanishPyrenees.Dorado-Linan.2012', 'Arc-Kongressvatnet.D_Andrea.2012', 'Eur-CoastofPortugal.Abrantes.2011', 'Ocn-PedradeLume-CapeVerdeIslands.Moses.2006', 'Ocn-FeniDrift.Richter.2009', 'Ocn-SinaiPeninsula_RedSea.Moustafa.2000', 'Ant-WAIS-Divide.Severinghaus.2012', 'Asi-SourthAndMiddleUrals.Demezhko.2007', 'Ocn-AlboranSea436B.Nieto-Moreno.2013', 'Eur-SpannagelCave.Mangini.2005', 'Ocn-RedSea.Felis.2000', 'Eur-FinnishLakelands.Helama.2014']
The dataset was removed from `D_test` in the process. Hence, it's always prudent to make a copy of the original object when using the `remove` and `pop` functionalities.

If can also remove/pop more than one dataset at a time:

rem = ['Ocn-RedSea.Felis.2000','Ant-WAIS-Divide.Severinghaus.2012']

D_test.remove(rem)
print(D_test.get_all_dataset_names())
['Eur-NorthernSpain.Martin-Chivelet.2011', 'Eur-LakeSilvaplana.Trachsel.2010', 'Eur-SpanishPyrenees.Dorado-Linan.2012', 'Arc-Kongressvatnet.D_Andrea.2012', 'Eur-CoastofPortugal.Abrantes.2011', 'Ocn-PedradeLume-CapeVerdeIslands.Moses.2006', 'Ocn-FeniDrift.Richter.2009', 'Ocn-SinaiPeninsula_RedSea.Moustafa.2000', 'Asi-SourthAndMiddleUrals.Demezhko.2007', 'Ocn-AlboranSea436B.Nieto-Moreno.2013', 'Eur-SpannagelCave.Mangini.2005', 'Eur-FinnishLakelands.Helama.2014']