Basic manipulation of pylipd.LiPD
objects#
Preamble#
Goals:#
Extract a LiPD time series for analysis
Remove/pop LiPD datasets from an existing
LiPD
object
Reading Time: 5 minutes
Keywords#
LiPD; query
Pre-requisites#
None. This tutorial assumes basic knowledge of Python and Pandas. If you are not familiar with this coding language and the Pandas library, check out this tutorial: http://linked.earth/ec_workshops_py/.
Relevant Packages#
Pandas, pylipd
Data Description#
This notebook uses the following datasets, in LiPD format:
Nurhati, I. S., Cobb, K. M., & Di Lorenzo, E. (2011). Decadal-scale SST and salinity variations in the central tropical Pacific: Signatures of natural and anthropogenic climate change. Journal of Climate, 24(13), 3294–3308. doi:10.1175/2011jcli3852.1
PAGES2k Consortium (2017): A global multiproxy database for temperature reconstructions of the Common Era. Sci Data 4, 170088. doi:10.1038/sdata.2017.88
from pylipd.lipd import LiPD
Demonstration#
Extract time series data from LiPD formatted datasets#
If you are famliar with the R utilities, one useful functions is the ability to expand “timeseries” structures. This capability was also present in the previous iteration of the Python utilities and PyLiPD
retains this compatbility to ease the transition.
If you’re unsure about what a “timeseries” is in the LiPD context, read this page.
Working with one dataset#
First, let’s load a single dataset:
data_path = '../data/Ocn-Palmyra.Nurhati.2011.lpd'
D = LiPD()
D.load(data_path)
Loading 1 LiPD files
0%| | 0/1 [00:00<?, ?it/s]
100%|██████████| 1/1 [00:00<00:00, 30.48it/s]
Loaded..
Now let’s get all the timeseries for this dataset. Note that the get_timeseries
function requires to pass the dataset names. This is useful if you only want to expand only one dataset from your LiPD object. You can also use the function get_all_dataset_names
in the call to expand all datasets:
ts_list = D.get_timeseries(D.get_all_dataset_names())
type(ts_list)
Extracting timeseries from dataset: Ocn-Palmyra.Nurhati.2011 ...
dict
Note that the above function returns a dictionary that organizes the extracted timeseries by dataset name:
ts_list.keys()
dict_keys(['Ocn-Palmyra.Nurhati.2011'])
Each timeseries is then stored into a list of dictionaries that preserve essential metadata for each time/depth and value pair:
type(ts_list['Ocn-Palmyra.Nurhati.2011'])
list
Although the information is present, it is not easy to navigate or query across the various list. One simple way of doing so is to return the list into a Pandas.DataFrame
:
ts_list, df = D.get_timeseries(D.get_all_dataset_names(), to_dataframe=True)
df
Extracting timeseries from dataset: Ocn-Palmyra.Nurhati.2011 ...
mode | time_id | archiveType | geo_meanLon | geo_meanLat | geo_meanElev | geo_type | geo_siteName | geo_ocean | geo_pages2kRegion | ... | paleoData_notes | paleoData_sensorSpecies | paleoData_sensorGenus | paleoData_proxy | paleoData_qCCertification | paleoData_iso2kUI | paleoData_interpretation | paleoData_ocean2kID | paleoData_inCompilation | paleoData_pages2kID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | paleoData | age | Coral | -162.13 | 5.87 | -10.0 | Feature | Palmyra | WP | Ocean | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | paleoData | age | Coral | -162.13 | 5.87 | -10.0 | Feature | Palmyra | WP | Ocean | ... | d18Osw (residuals calculated from coupled SrCa... | lutea | Porites | d18O | NaN | NaN | NaN | NaN | NaN | NaN |
2 | paleoData | age | Coral | -162.13 | 5.87 | -10.0 | Feature | Palmyra | WP | Ocean | ... | Duplicate of modern d18O record presented in C... | lutea | Porites | d18O | MNE, NJA | CO11NUPM01B | [{'variableDetail': 'sea_surface', 'scope': 'c... | NaN | NaN | NaN |
3 | paleoData | age | Coral | -162.13 | 5.87 | -10.0 | Feature | Palmyra | WP | Ocean | ... | ; paleoData_variableName changed - was origina... | lutea | Porites | Sr/Ca | MNE, NJA | CO11NUPM01BT1 | [{'scope': 'climate', 'hasVariable': {'label':... | PacificNurhati2011 | Ocean2k_v1.0.0 | Ocn_129 |
4 | paleoData | age | Coral | -162.13 | 5.87 | -10.0 | Feature | Palmyra | WP | Ocean | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 88 columns
You can now use all the pandas functionalities for filtering and querying dataframes. First, let’s have a look at the available properties, which corresponds to the column headers:
df.columns
Index(['mode', 'time_id', 'archiveType', 'geo_meanLon', 'geo_meanLat',
'geo_meanElev', 'geo_type', 'geo_siteName', 'geo_ocean',
'geo_pages2kRegion', 'dataSetName', 'hasUrl', 'lipdVersion',
'pub1_author', 'pub1_DOI', 'pub1_year', 'pub1_title', 'pub1_journal',
'pub1_pages', 'pub1_dataUrl', 'pub1_publisher', 'pub1_citeKey',
'pub1_volume', 'pub2_author', 'pub2_urldate', 'pub2_institution',
'pub2_title', 'pub2_url', 'pub2_citeKey', 'pub3_author',
'pub3_publisher', 'pub3_pages', 'pub3_DOI', 'pub3_year', 'pub3_issue',
'pub3_volume', 'pub3_dataUrl', 'pub3_journal', 'pub3_title',
'pub3_citeKey', 'studyName', 'dataContributor', 'createdBy',
'googleMetadataWorksheet', 'originalDataURL', 'googleDataURL',
'googleSpreadSheetKey', 'tableType', 'paleoData_filename',
'paleoData_measurementTableName', 'paleoData_measurementTableMD5',
'paleoData_googleWorkSheetKey', 'paleoData_paleoDataTableName', 'year',
'yearUnits', 'paleoData_hasMaxValue', 'paleoData_missingValue',
'paleoData_hasArchiveType_label', 'paleoData_number',
'paleoData_variableType', 'paleoData_wDSPaleoUrl',
'paleoData_hasMeanValue', 'paleoData_hasMinValue', 'paleoData_dataType',
'paleoData_variableName', 'paleoData_hasMedianValue', 'paleoData_TSid',
'paleoData_resolution_hasMeanValue', 'paleoData_resolution_hasMaxValue',
'paleoData_resolution_hasMedianValue',
'paleoData_resolution_hasMinValue', 'paleoData_resolution_units',
'paleoData_inferredVariableType', 'paleoData_description',
'paleoData_units', 'paleoData_values',
'paleoData_useInGlobalTemperatureAnalysis',
'paleoData_proxyObservationType', 'paleoData_notes',
'paleoData_sensorSpecies', 'paleoData_sensorGenus', 'paleoData_proxy',
'paleoData_qCCertification', 'paleoData_iso2kUI',
'paleoData_interpretation', 'paleoData_ocean2kID',
'paleoData_inCompilation', 'paleoData_pages2kID'],
dtype='object')
Let’s have a look at the paleoData_variableName
column to see what’s available:
df['paleoData_variableName']
0 year
1 d18O
2 d18O
3 Sr/Ca
4 year
Name: paleoData_variableName, dtype: object
All columns get extracted, hence why year
is extracted as a paleo variable, with its associated values stored in paleoData_values
. Notice that there is also two variables names d18O
. Since this is a coral record, it stands to reason that one corresponds to the measured \(\delta^{18}O\) of the coral and the other the \(\delta^{18}O\) of the seawater. Let’s have a look at the notes
field:
df[['paleoData_variableName','paleoData_notes']]
paleoData_variableName | paleoData_notes | |
---|---|---|
0 | year | NaN |
1 | d18O | d18Osw (residuals calculated from coupled SrCa... |
2 | d18O | Duplicate of modern d18O record presented in C... |
3 | Sr/Ca | ; paleoData_variableName changed - was origina... |
4 | year | NaN |
In fact, one is for the measurement on the coral and the other for seawater. Querying on this small dataset is not necessary; however, it can become useful when looking at a collection of files as shown in the next example (working with multiple datasets).
To extract by row index (here extracting for Sr_Ca
):
df_cut = df.iloc[4,:]
df_cut
mode paleoData
time_id age
archiveType Coral
geo_meanLon -162.13
geo_meanLat 5.87
...
paleoData_iso2kUI NaN
paleoData_interpretation NaN
paleoData_ocean2kID NaN
paleoData_inCompilation NaN
paleoData_pages2kID NaN
Name: 4, Length: 88, dtype: object
df_cut['paleoData_variableName']
'year'
This can be very useful when working with the Pyleoclim
software since a Pyleoclim.Series
can be initialized from the information contained in df_cut
. Working with PyLiPD
and Pyleoclim
is the subject of several tutorials.
Working with such a large dataframe can be overwhelming and not needed in some cases. Therefore, PyLiPD
has a nifty function called get_timeseries_essentials
that grabs information about the dataset, its geographical location, the time/depth values, the variable information, including archive and proxy:
df_essential = D.get_timeseries_essentials()
df_essential
dataSetName | archiveType | geo_meanLat | geo_meanLon | geo_meanElev | paleoData_variableName | paleoData_values | paleoData_units | paleoData_proxy | paleoData_proxyGeneral | time_variableName | time_values | time_units | depth_variableName | depth_values | depth_units | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Ocn-Palmyra.Nurhati.2011 | Coral | 5.87 | -162.13 | -10.0 | Sr_Ca | [8.96, 8.9, 8.91, 8.94, 8.92, 8.89, 8.87, 8.81... | mmol/mol | Sr/Ca | None | year | [1998.29, 1998.21, 1998.13, 1998.04, 1997.96, ... | yr AD | None | None | None |
1 | Ocn-Palmyra.Nurhati.2011 | Coral | 5.87 | -162.13 | -10.0 | d18O | [-5.41, -5.47, -5.49, -5.43, -5.48, -5.53, -5.... | permil | d18O | None | year | [1998.29, 1998.21, 1998.13, 1998.04, 1997.96, ... | yr AD | None | None | None |
2 | Ocn-Palmyra.Nurhati.2011 | Coral | 5.87 | -162.13 | -10.0 | d18O | [0.39, 0.35, 0.35, 0.35, 0.36, 0.22, 0.33, 0.3... | permil | d18O | None | year | [1998.21, 1998.13, 1998.04, 1997.96, 1997.88, ... | yr AD | None | None | None |
The metadata (i.e., the column names) available through this function will always remain the same and are as follows:
df_essential.columns
Index(['dataSetName', 'archiveType', 'geo_meanLat', 'geo_meanLon',
'geo_meanElev', 'paleoData_variableName', 'paleoData_values',
'paleoData_units', 'paleoData_proxy', 'paleoData_proxyGeneral',
'time_variableName', 'time_values', 'time_units', 'depth_variableName',
'depth_values', 'depth_units'],
dtype='object')
Working with multiple datasets#
path = '../data/Pages2k/'
D_dir = LiPD()
D_dir.load_from_dir(path)
Loading 16 LiPD files
0%| | 0/16 [00:00<?, ?it/s]
38%|███▊ | 6/16 [00:00<00:00, 50.40it/s]
75%|███████▌ | 12/16 [00:00<00:00, 39.97it/s]
100%|██████████| 16/16 [00:00<00:00, 43.03it/s]
Loaded..
Let’s expand into our essential dataframe:
df_dir = D_dir.get_timeseries_essentials()
Let’s have a look at the dataframe:
df_dir.head()
dataSetName | archiveType | geo_meanLat | geo_meanLon | geo_meanElev | paleoData_variableName | paleoData_values | paleoData_units | paleoData_proxy | paleoData_proxyGeneral | time_variableName | time_values | time_units | depth_variableName | depth_values | depth_units | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Eur-NorthernSpain.Martin-Chivelet.2011 | Speleothem | 42.90 | -3.50 | 1250.0 | d18O | [0.94, 0.8, 0.23, 0.17, 0.51, 0.36, 0.24, 0.4,... | permil | d18O | None | year | [2000, 1987, 1983, 1978, 1975, 1971, 1967, 196... | yr AD | None | None | None |
1 | Eur-NorthernScandinavia.Esper.2012 | Wood | 68.00 | 25.00 | 300.0 | MXD | [0.46, 1.305, 0.755, -0.1, -0.457, 1.62, 0.765... | None | maximum latewood density | None | year | [-138, -137, -136, -135, -134, -133, -132, -13... | yr AD | None | None | None |
2 | Eur-Stockholm.Leijonhufvud.2009 | Documents | 59.32 | 18.06 | 10.0 | temperature | [-1.7212, -1.6382, -0.6422, 0.1048, -0.7252, -... | degC | historical | None | year | [1502, 1503, 1504, 1505, 1506, 1507, 1508, 150... | yr AD | None | None | None |
3 | Eur-LakeSilvaplana.Trachsel.2010 | Lake sediment | 46.50 | 9.80 | 1791.0 | temperature | [0.181707222, 0.111082797, 0.001382129, -0.008... | degC | reflectance | None | year | [1175, 1176, 1177, 1178, 1179, 1180, 1181, 118... | yr AD | None | None | None |
4 | Eur-SpanishPyrenees.Dorado-Linan.2012 | Wood | 42.50 | 1.00 | 1200.0 | trsgi | [-1.612, -0.703, -0.36, -0.767, -0.601, -0.733... | None | ring width | None | year | [1260, 1261, 1262, 1263, 1264, 1265, 1266, 126... | yr AD | None | None | None |
The size of this dataframe is:
df_dir.shape
(25, 16)
So we expanded into 25 timeseries.
Let’s have a look at the available variables:
df_dir['paleoData_variableName'].unique()
array(['d18O', 'MXD', 'temperature', 'trsgi', 'Uk37', 'notes', 'Mg_Ca',
'uncertainty_temperature'], dtype=object)
Let’s assume we are only interested in the temperature data:
df_temp = df_dir[df_dir['paleoData_variableName']=='temperature']
df_temp.head()
dataSetName | archiveType | geo_meanLat | geo_meanLon | geo_meanElev | paleoData_variableName | paleoData_values | paleoData_units | paleoData_proxy | paleoData_proxyGeneral | time_variableName | time_values | time_units | depth_variableName | depth_values | depth_units | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | Eur-Stockholm.Leijonhufvud.2009 | Documents | 59.3200 | 18.0600 | 10.0 | temperature | [-1.7212, -1.6382, -0.6422, 0.1048, -0.7252, -... | degC | historical | None | year | [1502, 1503, 1504, 1505, 1506, 1507, 1508, 150... | yr AD | None | None | None |
3 | Eur-LakeSilvaplana.Trachsel.2010 | Lake sediment | 46.5000 | 9.8000 | 1791.0 | temperature | [0.181707222, 0.111082797, 0.001382129, -0.008... | degC | reflectance | None | year | [1175, 1176, 1177, 1178, 1179, 1180, 1181, 118... | yr AD | None | None | None |
6 | Arc-Kongressvatnet.D'Andrea.2012 | Lake sediment | 78.0217 | 13.9311 | 94.0 | temperature | [5.9, 5.1, 6.1, 5.3, 4.3, 4.8, 3.8, 4.8, 4.3, ... | degC | alkenone | None | year | [2008, 2004, 2000, 1996, 1990, 1987, 1982, 197... | yr AD | None | None | None |
7 | Eur-CoastofPortugal.Abrantes.2011 | Marine sediment | 41.1000 | -8.9000 | -80.0 | temperature | [15.235, 15.329, 15.264, 15.376, 15.4, 15.129,... | degC | alkenone | None | year | [971.19, 982.672, 991.858, 1001.044, 1010.23, ... | yr AD | None | None | None |
12 | Ocn-FeniDrift.Richter.2009 | Marine sediment | 55.5000 | -13.9000 | -2543.0 | temperature | [12.94, 10.99, 10.53, 10.44, 11.39, 13.38, 10.... | degC | None | None | year | [1998, 1987, 1975, 1962, 1949, 1936, 1924, 191... | yr AD | depth_bottom | [0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, ... | cm |
df_temp.shape
(11, 16)
which leaves us with 11 timeseries.
Let’s assume that you want everything that is not related to time, depth, and uncertainty. To keep the rows that are relevant to our problem, you can use the DataFrame.query
function available in Pandas:
df_filt = df_dir.query("paleoData_variableName in ('temperature','MXD','Mg_Ca','d18O','trsgi', 'Uk37')")
df_filt.head()
dataSetName | archiveType | geo_meanLat | geo_meanLon | geo_meanElev | paleoData_variableName | paleoData_values | paleoData_units | paleoData_proxy | paleoData_proxyGeneral | time_variableName | time_values | time_units | depth_variableName | depth_values | depth_units | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Eur-NorthernSpain.Martin-Chivelet.2011 | Speleothem | 42.90 | -3.50 | 1250.0 | d18O | [0.94, 0.8, 0.23, 0.17, 0.51, 0.36, 0.24, 0.4,... | permil | d18O | None | year | [2000, 1987, 1983, 1978, 1975, 1971, 1967, 196... | yr AD | None | None | None |
1 | Eur-NorthernScandinavia.Esper.2012 | Wood | 68.00 | 25.00 | 300.0 | MXD | [0.46, 1.305, 0.755, -0.1, -0.457, 1.62, 0.765... | None | maximum latewood density | None | year | [-138, -137, -136, -135, -134, -133, -132, -13... | yr AD | None | None | None |
2 | Eur-Stockholm.Leijonhufvud.2009 | Documents | 59.32 | 18.06 | 10.0 | temperature | [-1.7212, -1.6382, -0.6422, 0.1048, -0.7252, -... | degC | historical | None | year | [1502, 1503, 1504, 1505, 1506, 1507, 1508, 150... | yr AD | None | None | None |
3 | Eur-LakeSilvaplana.Trachsel.2010 | Lake sediment | 46.50 | 9.80 | 1791.0 | temperature | [0.181707222, 0.111082797, 0.001382129, -0.008... | degC | reflectance | None | year | [1175, 1176, 1177, 1178, 1179, 1180, 1181, 118... | yr AD | None | None | None |
4 | Eur-SpanishPyrenees.Dorado-Linan.2012 | Wood | 42.50 | 1.00 | 1200.0 | trsgi | [-1.612, -0.703, -0.36, -0.767, -0.601, -0.733... | None | ring width | None | year | [1260, 1261, 1262, 1263, 1264, 1265, 1266, 126... | yr AD | None | None | None |
df_filt.shape
(22, 16)
Which leaves us with 22 timeseries.
Removing and popping datasets out of a LiPD object#
You can also remove (i.e., delete the corresponding dataset from the LiPD
object) or pop (i.e., delete the corresponding dataset from the LiPD
object and return the dataset) datasets from a LiPD
object. Note that these functionalities behave similarly as the functions with the same names on Python lists. These functions underpin more adavanced filtering and querying capabilities that we will discuss in later tutorials.
First let’s make a copy of D_dir
:
D_test = D_dir.copy()
print(D_test.get_all_dataset_names())
['Eur-NorthernSpain.Martin-Chivelet.2011', 'Eur-NorthernScandinavia.Esper.2012', 'Eur-Stockholm.Leijonhufvud.2009', 'Eur-LakeSilvaplana.Trachsel.2010', 'Eur-SpanishPyrenees.Dorado-Linan.2012', 'Arc-Kongressvatnet.D_Andrea.2012', 'Eur-CoastofPortugal.Abrantes.2011', 'Ocn-PedradeLume-CapeVerdeIslands.Moses.2006', 'Ocn-FeniDrift.Richter.2009', 'Ocn-SinaiPeninsula_RedSea.Moustafa.2000', 'Ant-WAIS-Divide.Severinghaus.2012', 'Asi-SourthAndMiddleUrals.Demezhko.2007', 'Ocn-AlboranSea436B.Nieto-Moreno.2013', 'Eur-SpannagelCave.Mangini.2005', 'Ocn-RedSea.Felis.2000', 'Eur-FinnishLakelands.Helama.2014']
And let’s remove Eur-Stockholm.Leijonhufvud.2009
, which corresponds to the last entry in the list above:
D_test.remove('Eur-Stockholm.Leijonhufvud.2009')
print(D_test.get_all_dataset_names())
['Eur-NorthernSpain.Martin-Chivelet.2011', 'Eur-NorthernScandinavia.Esper.2012', 'Eur-LakeSilvaplana.Trachsel.2010', 'Eur-SpanishPyrenees.Dorado-Linan.2012', 'Arc-Kongressvatnet.D_Andrea.2012', 'Eur-CoastofPortugal.Abrantes.2011', 'Ocn-PedradeLume-CapeVerdeIslands.Moses.2006', 'Ocn-FeniDrift.Richter.2009', 'Ocn-SinaiPeninsula_RedSea.Moustafa.2000', 'Ant-WAIS-Divide.Severinghaus.2012', 'Asi-SourthAndMiddleUrals.Demezhko.2007', 'Ocn-AlboranSea436B.Nieto-Moreno.2013', 'Eur-SpannagelCave.Mangini.2005', 'Ocn-RedSea.Felis.2000', 'Eur-FinnishLakelands.Helama.2014']
Now let’s pop Eur-NorthernScandinavia.Esper.2012
from D_test
:
d_eur = D_test.pop('Eur-NorthernScandinavia.Esper.2012')
Now let’s have a look at d_eur
:
print(d_eur.get_all_dataset_names())
['Eur-NorthernScandinavia.Esper.2012']
It contains the dataset we are expecting. Let’s have a look at D_test
:
print(D_test.get_all_dataset_names())
['Eur-NorthernSpain.Martin-Chivelet.2011', 'Eur-LakeSilvaplana.Trachsel.2010', 'Eur-SpanishPyrenees.Dorado-Linan.2012', 'Arc-Kongressvatnet.D_Andrea.2012', 'Eur-CoastofPortugal.Abrantes.2011', 'Ocn-PedradeLume-CapeVerdeIslands.Moses.2006', 'Ocn-FeniDrift.Richter.2009', 'Ocn-SinaiPeninsula_RedSea.Moustafa.2000', 'Ant-WAIS-Divide.Severinghaus.2012', 'Asi-SourthAndMiddleUrals.Demezhko.2007', 'Ocn-AlboranSea436B.Nieto-Moreno.2013', 'Eur-SpannagelCave.Mangini.2005', 'Ocn-RedSea.Felis.2000', 'Eur-FinnishLakelands.Helama.2014']
If can also remove/pop more than one dataset at a time:
rem = ['Ocn-RedSea.Felis.2000','Ant-WAIS-Divide.Severinghaus.2012']
D_test.remove(rem)
print(D_test.get_all_dataset_names())
['Eur-NorthernSpain.Martin-Chivelet.2011', 'Eur-LakeSilvaplana.Trachsel.2010', 'Eur-SpanishPyrenees.Dorado-Linan.2012', 'Arc-Kongressvatnet.D_Andrea.2012', 'Eur-CoastofPortugal.Abrantes.2011', 'Ocn-PedradeLume-CapeVerdeIslands.Moses.2006', 'Ocn-FeniDrift.Richter.2009', 'Ocn-SinaiPeninsula_RedSea.Moustafa.2000', 'Asi-SourthAndMiddleUrals.Demezhko.2007', 'Ocn-AlboranSea436B.Nieto-Moreno.2013', 'Eur-SpannagelCave.Mangini.2005', 'Eur-FinnishLakelands.Helama.2014']