Basic manipulation of pylipd.LiPD objects#

Authors#

by Deborah Khider

Preamble#

Goals:#

  • Extract a LiPD time series for analysis

  • Remove/pop LiPD datasets from an existing LiPD object

Reading Time: 5 minutes

Keywords#

LiPD; query

Pre-requisites#

None. This tutorial assumes basic knowledge of Python and Pandas. If you are not familiar with this coding language and the Pandas library, check out this tutorial: http://linked.earth/ec_workshops_py/.

Relevant Packages#

Pandas, pylipd

Data Description#

This notebook uses the following datasets, in LiPD format:

  • Nurhati, I. S., Cobb, K. M., & Di Lorenzo, E. (2011). Decadal-scale SST and salinity variations in the central tropical Pacific: Signatures of natural and anthropogenic climate change. Journal of Climate, 24(13), 3294–3308. doi:10.1175/2011jcli3852.1

  • PAGES2k Consortium (2017): A global multiproxy database for temperature reconstructions of the Common Era. Sci Data 4, 170088. doi:10.1038/sdata.2017.88

from pylipd.lipd import LiPD

Demonstration#

Extract time series data from LiPD formatted datasets#

If you are famliar with the R utilities, one useful functions is the ability to expand “timeseries” structures. This capability was also present in the previous iteration of the Python utilities and PyLiPD retains this compatbility to ease the transition.

If you’re unsure about what a “timeseries” is in the LiPD context, read this page.

Working with one dataset#

First, let’s load a single dataset:

data_path = '../data/Ocn-Palmyra.Nurhati.2011.lpd'
D = LiPD()
D.load(data_path)
Loading 1 LiPD files
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00, 36.64it/s]
Loaded..

Now let’s get all the timeseries for this dataset. Note that the get_timeseries function requires to pass the dataset names. This is useful if you only want to expand only one dataset from your LiPD object. You can also use the function get_all_dataset_names in the call to expand all datasets:

ts_list = D.get_timeseries(D.get_all_dataset_names())

type(ts_list)
Extracting timeseries from dataset: Ocn-Palmyra.Nurhati.2011 ...
dict

Note that the above function returns a dictionary that organizes the extracted timeseries by dataset name:

ts_list.keys()
dict_keys(['Ocn-Palmyra.Nurhati.2011'])

Each timeseries is then stored into a list of dictionaries that preserve essential metadata for each time/depth and value pair:

type(ts_list['Ocn-Palmyra.Nurhati.2011'])
list

Although the information is present, it is not easy to navigate or query across the various list. One simple way of doing so is to return the list into a Pandas.DataFrame:

ts_list, df = D.get_timeseries(D.get_all_dataset_names(), to_dataframe=True)

df
Extracting timeseries from dataset: Ocn-Palmyra.Nurhati.2011 ...
mode time_id archiveType originalDataURL dataContributor dataSetName geo_meanLon geo_meanLat geo_meanElev geo_type ... paleoData_proxyObservationType paleoData_sensorGenus paleoData_notes paleoData_proxy paleoData_iso2kUI paleoData_interpretation paleoData_qCCertification paleoData_ocean2kID paleoData_inCompilation paleoData_pages2kID
0 paleoData age Coral http://hurricane.ncdc.noaa.gov/pls/paleox/f?p=... {'name': 'HLF MNE'} Ocn-Palmyra.Nurhati.2011 -162.13 5.87 -10.0 Feature ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 paleoData age Coral http://hurricane.ncdc.noaa.gov/pls/paleox/f?p=... {'name': 'HLF MNE'} Ocn-Palmyra.Nurhati.2011 -162.13 5.87 -10.0 Feature ... d18O Porites d18Osw (residuals calculated from coupled SrCa... d18O NaN NaN NaN NaN NaN NaN
2 paleoData age Coral http://hurricane.ncdc.noaa.gov/pls/paleox/f?p=... {'name': 'HLF MNE'} Ocn-Palmyra.Nurhati.2011 -162.13 5.87 -10.0 Feature ... Sr/Ca Porites ; paleoData_variableName changed - was origina... Sr/Ca CO11NUPM01BT1 [{'scope': 'climate', 'variableDetail': 'sea@s... MNE, NJA PacificNurhati2011 Ocean2k_v1.0.0 Ocn_129
3 paleoData age Coral http://hurricane.ncdc.noaa.gov/pls/paleox/f?p=... {'name': 'HLF MNE'} Ocn-Palmyra.Nurhati.2011 -162.13 5.87 -10.0 Feature ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 paleoData age Coral http://hurricane.ncdc.noaa.gov/pls/paleox/f?p=... {'name': 'HLF MNE'} Ocn-Palmyra.Nurhati.2011 -162.13 5.87 -10.0 Feature ... d18O Porites Duplicate of modern d18O record presented in C... d18O CO11NUPM01B [{'scope': 'climate', 'variableDetail': 'sea_s... MNE, NJA NaN NaN NaN

5 rows × 88 columns

You can now use all the pandas functionalities for filtering and querying dataframes. First, let’s have a look at the available properties, which corresponds to the column headers:

df.columns
Index(['mode', 'time_id', 'lipdVersion', 'googleDataURL',
       'googleSpreadSheetKey', 'pub1_author', 'pub1_url', 'pub1_urldate',
       'pub1_institution', 'pub1_citeKey', 'pub1_title', 'pub1_DOI',
       'pub2_author', 'pub2_year', 'pub2_doi', 'pub2_title', 'pub2_hasLink',
       'pub2_publisher', 'pub2_volume', 'pub2_citeKey', 'pub2_pages',
       'pub2_dataUrl', 'pub2_journal', 'pub2_DOI', 'pub3_author', 'pub3_title',
       'pub3_hasLink', 'pub3_dataUrl', 'pub3_journal', 'pub3_issue',
       'pub3_pages', 'pub3_doi', 'pub3_citeKey', 'pub3_volume',
       'pub3_publisher', 'pub3_year', 'pub3_DOI', 'dataContributor',
       'originalDataURL', 'dataSetName', 'geo_meanLon', 'geo_meanLat',
       'geo_meanElev', 'geo_type', 'geo_ocean', 'geo_pages2kRegion',
       'geo_siteName', 'hasUrl', 'studyName', 'createdBy',
       'googleMetadataWorksheet', 'archiveType', 'tableType',
       'paleoData_measurementTableMD5', 'paleoData_paleoDataTableName',
       'paleoData_filename', 'paleoData_googleWorkSheetKey',
       'paleoData_measurementTableName', 'year', 'yearUnits',
       'paleoData_archiveType', 'paleoData_hasMeanValue',
       'paleoData_hasMinValue', 'paleoData_inferredVariableType',
       'paleoData_TSid', 'paleoData_resolution_hasMeanValue',
       'paleoData_resolution_hasMinValue',
       'paleoData_resolution_hasMedianValue',
       'paleoData_resolution_hasMaxValue', 'paleoData_resolution_units',
       'paleoData_dataType', 'paleoData_hasMedianValue',
       'paleoData_variableType', 'paleoData_number', 'paleoData_missingValue',
       'paleoData_variableName', 'paleoData_wDSPaleoUrl',
       'paleoData_description', 'paleoData_hasMaxValue', 'paleoData_units',
       'paleoData_values', 'paleoData_proxyObservationType',
       'paleoData_useInGlobalTemperatureAnalysis', 'paleoData_notes',
       'paleoData_sensorGenus', 'paleoData_sensorSpecies', 'paleoData_proxy',
       'paleoData_compilation_nest', 'paleoData_interpretation',
       'paleoData_inCompilation'],
      dtype='object')

Let’s have a look at the paleoData_variableName column to see what’s available:

df['paleoData_variableName']
0     year
1     d18O
2     d18O
3     year
4    Sr/Ca
Name: paleoData_variableName, dtype: object

All columns get extracted, hence why year is extracted as a paleo variable, with its associated values stored in paleoData_values. Notice that there is also two variables names d18O. Since this is a coral record, it stands to reason that one corresponds to the measured \(\delta^{18}O\) of the coral and the other the \(\delta^{18}O\) of the seawater. Let’s have a look at the notes field:

df[['paleoData_variableName','paleoData_notes']]
paleoData_variableName paleoData_notes
0 year NaN
1 d18O d18Osw (residuals calculated from coupled SrCa...
2 d18O Duplicate of modern d18O record presented in C...
3 year NaN
4 Sr/Ca ; paleoData_variableName changed - was origina...

In fact, one is for the measurement on the coral and the other for seawater. Querying on this small dataset is not necessary; however, it can become useful when looking at a collection of files as shown in the next example (working with multiple datasets).

To extract by row index (here extracting for Sr_Ca):

df_cut = df.iloc[4,:]

df_cut
mode                                                                  paleoData
time_id                                                                     age
lipdVersion                                                                 1.3
googleDataURL                 https://docs.google.com/spreadsheets/d/1AIFOcq...
googleSpreadSheetKey               1AIFOcqDtbZ5O4YCCnH5K-lNdzmI1UNmGnUpCJFNN-YQ
                                                    ...                        
paleoData_sensorSpecies                                                   lutea
paleoData_proxy                                                           Sr/Ca
paleoData_compilation_nest                                             MNE, NJA
paleoData_interpretation      [{'scope': 'climate', 'variableDetail': 'sea@s...
paleoData_inCompilation                                          Ocean2k_v1.0.0
Name: 4, Length: 90, dtype: object
df_cut['paleoData_variableName']
'Sr/Ca'

This can be very useful when working with the Pyleoclim software since a Pyleoclim.Series can be initialized from the information contained in df_cut. Working with PyLiPD and Pyleoclim is the subject of several tutorials.

Working with such a large dataframe can be overwhelming and not needed in some cases. Therefore, PyLiPD has a nifty function called get_timeseries_essentials that grabs information about the dataset, its geographical location, the time/depth values, the variable information, including archive and proxy:

df_essential = D.get_timeseries_essentials()

df_essential
dataSetName archiveType geo_meanLat geo_meanLon geo_meanElev paleoData_variableName paleoData_values paleoData_units paleoData_proxy paleoData_proxyGeneral time_variableName time_values time_units depth_variableName depth_values depth_units
0 Ocn-Palmyra.Nurhati.2011 Coral 5.87 -162.13 -10.0 d18O [-5.41, -5.47, -5.49, -5.43, -5.48, -5.53, -5.... permil d18O None year [1998.29, 1998.21, 1998.13, 1998.04, 1997.96, ... yr AD None None None
1 Ocn-Palmyra.Nurhati.2011 Coral 5.87 -162.13 -10.0 d18O [0.39, 0.35, 0.35, 0.35, 0.36, 0.22, 0.33, 0.3... permil d18O None year [1998.21, 1998.13, 1998.04, 1997.96, 1997.88, ... yr AD None None None
2 Ocn-Palmyra.Nurhati.2011 Coral 5.87 -162.13 -10.0 Sr_Ca [8.96, 8.9, 8.91, 8.94, 8.92, 8.89, 8.87, 8.81... mmol/mol Sr/Ca None year [1998.29, 1998.21, 1998.13, 1998.04, 1997.96, ... yr AD None None None

The metadata (i.e., the column names) available through this function will always remain the same and are as follows:

df_essential.columns
Index(['dataSetName', 'archiveType', 'geo_meanLat', 'geo_meanLon',
       'geo_meanElev', 'paleoData_variableName', 'paleoData_values',
       'paleoData_units', 'paleoData_proxy', 'paleoData_proxyGeneral',
       'time_variableName', 'time_values', 'time_units', 'depth_variableName',
       'depth_values', 'depth_units'],
      dtype='object')

Working with multiple datasets#

path = '../data/Pages2k/'

D_dir = LiPD()
D_dir.load_from_dir(path)
Loading 16 LiPD files
100%|██████████████████████████████████████████| 16/16 [00:00<00:00, 112.46it/s]
Loaded..

Let’s expand into our essential dataframe:

df_dir = D_dir.get_timeseries_essentials()

Let’s have a look at the dataframe:

df_dir.head()
dataSetName archiveType geo_meanLat geo_meanLon geo_meanElev paleoData_variableName paleoData_values paleoData_units paleoData_proxy paleoData_proxyGeneral time_variableName time_values time_units depth_variableName depth_values depth_units
0 Ocn-RedSea.Felis.2000 Coral 27.8500 34.3200 -6.0 d18O [-4.12, -3.82, -3.05, -3.02, -3.62, -3.96, -3.... permil d18O None year [1995.583, 1995.417, 1995.25, 1995.083, 1994.9... yr AD None None None
1 Ant-WAIS-Divide.Severinghaus.2012 Borehole -79.4630 -112.1250 1766.0 uncertainty_temperature [1.327, 1.328, 1.328, 1.329, 1.33, 1.33, 1.331... degC None None year [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,... yr AD None None None
2 Ant-WAIS-Divide.Severinghaus.2012 Borehole -79.4630 -112.1250 1766.0 temperature [-29.607, -29.607, -29.606, -29.606, -29.605, ... degC borehole None year [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,... yr AD None None None
3 Asi-SourthAndMiddleUrals.Demezhko.2007 Borehole 55.0000 59.5000 1900.0 temperature [0.166, 0.264, 0.354, 0.447, 0.538, 0.62, 0.68... degC borehole None year [800, 850, 900, 950, 1000, 1050, 1100, 1150, 1... yr AD None None None
4 Ocn-AlboranSea436B.Nieto-Moreno.2013 Marine sediment 36.2053 -4.3133 -1108.0 temperature [18.79, 19.38, 19.61, 18.88, 18.74, 19.25, 18.... degC alkenone None year [1999.07, 1993.12, 1987.17, 1975.26, 1963.36, ... yr AD None None None

The size of this dataframe is:

df_dir.shape
(25, 16)

So we expanded into 25 timeseries.

Let’s have a look at the available variables:

df_dir['paleoData_variableName'].unique()
array(['d18O', 'uncertainty_temperature', 'temperature', 'Mg_Ca', 'notes',
       'Uk37', 'trsgi', 'MXD'], dtype=object)

Let’s assume we are only interested in the temperature data:

df_temp = df_dir[df_dir['paleoData_variableName']=='temperature']
df_temp.head()
dataSetName archiveType geo_meanLat geo_meanLon geo_meanElev paleoData_variableName paleoData_values paleoData_units paleoData_proxy paleoData_proxyGeneral time_variableName time_values time_units depth_variableName depth_values depth_units
2 Ant-WAIS-Divide.Severinghaus.2012 Borehole -79.4630 -112.1250 1766.0 temperature [-29.607, -29.607, -29.606, -29.606, -29.605, ... degC borehole None year [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,... yr AD None None None
3 Asi-SourthAndMiddleUrals.Demezhko.2007 Borehole 55.0000 59.5000 1900.0 temperature [0.166, 0.264, 0.354, 0.447, 0.538, 0.62, 0.68... degC borehole None year [800, 850, 900, 950, 1000, 1050, 1100, 1150, 1... yr AD None None None
4 Ocn-AlboranSea436B.Nieto-Moreno.2013 Marine sediment 36.2053 -4.3133 -1108.0 temperature [18.79, 19.38, 19.61, 18.88, 18.74, 19.25, 18.... degC alkenone None year [1999.07, 1993.12, 1987.17, 1975.26, 1963.36, ... yr AD None None None
6 Ocn-FeniDrift.Richter.2009 Marine sediment 55.5000 -13.9000 -2543.0 temperature [12.94, 10.99, 10.53, 10.44, 11.39, 13.38, 10.... degC None None year [1998, 1987, 1975, 1962, 1949, 1936, 1924, 191... yr AD depth_bottom [0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, ... cm
7 Ocn-FeniDrift.Richter.2009 Marine sediment 55.5000 -13.9000 -2543.0 temperature [12.94, 10.99, 10.53, 10.44, 11.39, 13.38, 10.... degC None None year [1998, 1987, 1975, 1962, 1949, 1936, 1924, 191... yr AD depth_top [0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, ... cm
df_temp.shape
(11, 16)

which leaves us with 11 timeseries.

Let’s assume that you want everything that is not related to time, depth, and uncertainty. To keep the rows that are relevant to our problem, you can use the DataFrame.query function available in Pandas:

df_filt = df_dir.query("paleoData_variableName in ('temperature','MXD','Mg_Ca','d18O','trsgi', 'Uk37')")
df_filt.head()
dataSetName archiveType geo_meanLat geo_meanLon geo_meanElev paleoData_variableName paleoData_values paleoData_units paleoData_proxy paleoData_proxyGeneral time_variableName time_values time_units depth_variableName depth_values depth_units
0 Ocn-RedSea.Felis.2000 Coral 27.8500 34.3200 -6.0 d18O [-4.12, -3.82, -3.05, -3.02, -3.62, -3.96, -3.... permil d18O None year [1995.583, 1995.417, 1995.25, 1995.083, 1994.9... yr AD None None None
2 Ant-WAIS-Divide.Severinghaus.2012 Borehole -79.4630 -112.1250 1766.0 temperature [-29.607, -29.607, -29.606, -29.606, -29.605, ... degC borehole None year [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,... yr AD None None None
3 Asi-SourthAndMiddleUrals.Demezhko.2007 Borehole 55.0000 59.5000 1900.0 temperature [0.166, 0.264, 0.354, 0.447, 0.538, 0.62, 0.68... degC borehole None year [800, 850, 900, 950, 1000, 1050, 1100, 1150, 1... yr AD None None None
4 Ocn-AlboranSea436B.Nieto-Moreno.2013 Marine sediment 36.2053 -4.3133 -1108.0 temperature [18.79, 19.38, 19.61, 18.88, 18.74, 19.25, 18.... degC alkenone None year [1999.07, 1993.12, 1987.17, 1975.26, 1963.36, ... yr AD None None None
5 Eur-SpannagelCave.Mangini.2005 Speleothem 47.1000 11.6000 2347.0 d18O [-7.49, -7.41, -7.36, -7.15, -7.28, -6.99, -6.... permil d18O None year [1935.0, 1932.0, 1930.0, 1929.0, 1929.0, 1928.... yr AD None None None
df_filt.shape
(22, 16)

Which leaves us with 22 timeseries.

Removing and popping datasets out of a LiPD object#

You can also remove (i.e., delete the corresponding dataset from the LiPD object) or pop (i.e., delete the corresponding dataset from the LiPD object and return the dataset) datasets from a LiPD object. Note that these functionalities behave similarly as the functions with the same names on Python lists. These functions underpin more adavanced filtering and querying capabilities that we will discuss in later tutorials.

First let’s make a copy of D_dir:

D_test = D_dir.copy()
print(D_test.get_all_dataset_names())
['Ocn-RedSea.Felis.2000', 'Ant-WAIS-Divide.Severinghaus.2012', 'Asi-SourthAndMiddleUrals.Demezhko.2007', 'Ocn-AlboranSea436B.Nieto-Moreno.2013', 'Eur-SpannagelCave.Mangini.2005', 'Ocn-FeniDrift.Richter.2009', 'Eur-LakeSilvaplana.Trachsel.2010', 'Ocn-PedradeLume-CapeVerdeIslands.Moses.2006', 'Ocn-SinaiPeninsula_RedSea.Moustafa.2000', 'Eur-NorthernSpain.Martin-Chivelet.2011', 'Arc-Kongressvatnet.D_Andrea.2012', 'Eur-CoastofPortugal.Abrantes.2011', 'Eur-SpanishPyrenees.Dorado-Linan.2012', 'Eur-FinnishLakelands.Helama.2014', 'Eur-NorthernScandinavia.Esper.2012', 'Eur-Stockholm.Leijonhufvud.2009']

And let’s remove Eur-Stockholm.Leijonhufvud.2009, which corresponds to the last entry in the list above:

D_test.remove('Eur-Stockholm.Leijonhufvud.2009')

print(D_test.get_all_dataset_names())
['Ocn-RedSea.Felis.2000', 'Ant-WAIS-Divide.Severinghaus.2012', 'Asi-SourthAndMiddleUrals.Demezhko.2007', 'Ocn-AlboranSea436B.Nieto-Moreno.2013', 'Eur-SpannagelCave.Mangini.2005', 'Ocn-FeniDrift.Richter.2009', 'Eur-LakeSilvaplana.Trachsel.2010', 'Ocn-PedradeLume-CapeVerdeIslands.Moses.2006', 'Ocn-SinaiPeninsula_RedSea.Moustafa.2000', 'Eur-NorthernSpain.Martin-Chivelet.2011', 'Arc-Kongressvatnet.D_Andrea.2012', 'Eur-CoastofPortugal.Abrantes.2011', 'Eur-SpanishPyrenees.Dorado-Linan.2012', 'Eur-FinnishLakelands.Helama.2014', 'Eur-NorthernScandinavia.Esper.2012']

Now let’s pop Eur-NorthernScandinavia.Esper.2012 from D_test:

d_eur = D_test.pop('Eur-NorthernScandinavia.Esper.2012')

Now let’s have a look at d_eur:

print(d_eur.get_all_dataset_names())
['Eur-NorthernScandinavia.Esper.2012']

It contains the dataset we are expecting. Let’s have a look at D_test:

print(D_test.get_all_dataset_names())
['Ocn-RedSea.Felis.2000', 'Ant-WAIS-Divide.Severinghaus.2012', 'Asi-SourthAndMiddleUrals.Demezhko.2007', 'Ocn-AlboranSea436B.Nieto-Moreno.2013', 'Eur-SpannagelCave.Mangini.2005', 'Ocn-FeniDrift.Richter.2009', 'Eur-LakeSilvaplana.Trachsel.2010', 'Ocn-PedradeLume-CapeVerdeIslands.Moses.2006', 'Ocn-SinaiPeninsula_RedSea.Moustafa.2000', 'Eur-NorthernSpain.Martin-Chivelet.2011', 'Arc-Kongressvatnet.D_Andrea.2012', 'Eur-CoastofPortugal.Abrantes.2011', 'Eur-SpanishPyrenees.Dorado-Linan.2012', 'Eur-FinnishLakelands.Helama.2014']
The dataset was removed from `D_test` in the process. Hence, it's always prudent to make a copy of the original object when using the `remove` and `pop` functionalities.

If can also remove/pop more than one dataset at a time:

rem = ['Ocn-RedSea.Felis.2000','Ant-WAIS-Divide.Severinghaus.2012']

D_test.remove(rem)
print(D_test.get_all_dataset_names())
['Asi-SourthAndMiddleUrals.Demezhko.2007', 'Ocn-AlboranSea436B.Nieto-Moreno.2013', 'Eur-SpannagelCave.Mangini.2005', 'Ocn-FeniDrift.Richter.2009', 'Eur-LakeSilvaplana.Trachsel.2010', 'Ocn-PedradeLume-CapeVerdeIslands.Moses.2006', 'Ocn-SinaiPeninsula_RedSea.Moustafa.2000', 'Eur-NorthernSpain.Martin-Chivelet.2011', 'Arc-Kongressvatnet.D_Andrea.2012', 'Eur-CoastofPortugal.Abrantes.2011', 'Eur-SpanishPyrenees.Dorado-Linan.2012', 'Eur-FinnishLakelands.Helama.2014']