Filtering through queries#

Authors#

Deborah Khider

Preamble#

PyLiPD is a Python package that allows you to read, manipulate, and write LiPD formatted datasets. In this tutorial, we will demonstrate how to use pre-defined filtering capabilities through APIs.

Note: These capabilities are under heavy development. If you are interested in other filtering capabilities to help with your science, please open an issue.

Goals#

  • Use existing APIs to filter datasets according to archive types, location and variable names.

  • Understand the concept of a LiPDSeries object and how it differs from LiPD object.

Reading Time: 5 minutes

Keywords#

LiPD

Pre-requisites#

Relevant Packages#

pylipd

Data Description#

This notebook uses the following datasets, in LiPD format:

  • Euro2k database: PAGES2k Consortium (2017), A global multiproxy database for temperature reconstructions of the Common Era. Sci Data 4, 170088. doi:10.1038/sdata.2017.88

Demonstration#

Filtering on the LiPD object#

Filtering by the type of archive#

Let’s start by importing our favorite package and load our datasets.

from pylipd.lipd import LiPD

path = '../data/Pages2k/'

D = LiPD()
D.load_from_dir(path)
Loading 16 LiPD files
  0%|          | 0/16 [00:00<?, ?it/s]
 38%|███▊      | 6/16 [00:00<00:00, 50.49it/s]
 75%|███████▌  | 12/16 [00:00<00:00, 46.87it/s]
100%|██████████| 16/16 [00:00<00:00, 42.86it/s]
Loaded..

And let’s get a list of available datasets so we can compare the results as we filter using various criteria:

D.get_all_dataset_names()
['Eur-NorthernSpain.Martin-Chivelet.2011',
 'Eur-NorthernScandinavia.Esper.2012',
 'Eur-Stockholm.Leijonhufvud.2009',
 'Eur-LakeSilvaplana.Trachsel.2010',
 'Eur-SpanishPyrenees.Dorado-Linan.2012',
 'Arc-Kongressvatnet.D_Andrea.2012',
 'Eur-CoastofPortugal.Abrantes.2011',
 'Ocn-PedradeLume-CapeVerdeIslands.Moses.2006',
 'Ocn-FeniDrift.Richter.2009',
 'Ocn-SinaiPeninsula_RedSea.Moustafa.2000',
 'Ant-WAIS-Divide.Severinghaus.2012',
 'Asi-SourthAndMiddleUrals.Demezhko.2007',
 'Ocn-AlboranSea436B.Nieto-Moreno.2013',
 'Eur-SpannagelCave.Mangini.2005',
 'Ocn-RedSea.Felis.2000',
 'Eur-FinnishLakelands.Helama.2014']

Now let’s look at the available types of archives for which we can filter:

D.get_all_archiveTypes()
['Speleothem',
 'Wood',
 'Documents',
 'Lake sediment',
 'Marine sediment',
 'Coral',
 'Borehole']

Let’s first filter for speleothem records:

D_speleothem = D.filter_by_archive_type('speleothem')

D_speleothem.get_all_dataset_names()
['Eur-NorthernSpain.Martin-Chivelet.2011', 'Eur-SpannagelCave.Mangini.2005']

The function uses regular expressions to do partial matches. Therefore, you can look for all the datasets based on marine sediments using:

D_marine = D.filter_by_archive_type('marine')

D_marine.get_all_dataset_names()
['Ocn-FeniDrift.Richter.2009',
 'Ocn-AlboranSea436B.Nieto-Moreno.2013',
 'Eur-CoastofPortugal.Abrantes.2011']

However, looking for sediment will return records based on both marine and lake sediments:

D_sediment = D.filter_by_archive_type('sediment')

D_sediment.get_all_dataset_names()
['Arc-Kongressvatnet.D_Andrea.2012',
 'Ocn-FeniDrift.Richter.2009',
 'Ocn-AlboranSea436B.Nieto-Moreno.2013',
 'Eur-LakeSilvaplana.Trachsel.2010',
 'Eur-CoastofPortugal.Abrantes.2011']
D_marinespeleo = D.filter_by_archive_type('(marine|speleothem)')

D_marinespeleo.get_all_dataset_names()
['Ocn-FeniDrift.Richter.2009',
 'Ocn-AlboranSea436B.Nieto-Moreno.2013',
 'Eur-NorthernSpain.Martin-Chivelet.2011',
 'Eur-CoastofPortugal.Abrantes.2011',
 'Eur-SpannagelCave.Mangini.2005']

If you are not comfortable with regular expression and constructing patterns as we have done above, you can always merge two LiPD objects as we demonstrated previously. Therefore, the filtering above can also be achieved by combining our previous objects:

D_merged = D_marine.merge(D_speleothem)

D_merged.get_all_dataset_names()
['Ocn-FeniDrift.Richter.2009',
 'Ocn-AlboranSea436B.Nieto-Moreno.2013',
 'Eur-CoastofPortugal.Abrantes.2011',
 'Eur-NorthernSpain.Martin-Chivelet.2011',
 'Eur-SpannagelCave.Mangini.2005']

Filtering by location#

Let’s filter using a bounding box between 40-70N and 0-90E:

D_geo = D.filter_by_geo_bbox(lonMin=0, latMin=40, lonMax=90, latMax=70)

D_geo.get_all_dataset_names()
['Asi-SourthAndMiddleUrals.Demezhko.2007',
 'Eur-NorthernScandinavia.Esper.2012',
 'Eur-SpanishPyrenees.Dorado-Linan.2012',
 'Eur-LakeSilvaplana.Trachsel.2010',
 'Eur-Stockholm.Leijonhufvud.2009',
 'Eur-SpannagelCave.Mangini.2005',
 'Eur-FinnishLakelands.Helama.2014']

Because PyLiPD uses object-oriented programming, we can take advantage of method cascading to create a series of filters. For instance, let’s assume that we want all the datasets between 40-70N and 0-90E that are trees. One can write an additional query on top of the location query that we just performed:

D_geotree = D_geo.filter_by_archive_type('wood')

D_geotree.get_all_dataset_names()
['Eur-NorthernScandinavia.Esper.2012',
 'Eur-SpanishPyrenees.Dorado-Linan.2012',
 'Eur-FinnishLakelands.Helama.2014']

Using method cascading, you can write this in a single query:

D_filt = D.filter_by_geo_bbox(lonMin=0, latMin=40, lonMax=90, latMax=70).filter_by_archive_type('wood')

D_filt.get_all_dataset_names()
['Eur-NorthernScandinavia.Esper.2012',
 'Eur-SpanishPyrenees.Dorado-Linan.2012',
 'Eur-FinnishLakelands.Helama.2014']

Filtering by time interval#

The .filter_by_time method allows you to filter for a specific interval and a matching record length within that interval. The function takes three arguments:

  • timeBound (mandatory): A list of the minimum and maximum time to consider

  • timeBoundType (optional): As defined by the World Data Service for Paleoclimatology and an Applied Research Service for Paleoclimatology. image.png Default is any.

  • recordLength(optional): The minimum length of the record within the time interval.

Warning

As of August 2024, the function does not reason over time representation (i.e., BP vs CE) nor units (e.g., yr vs kyr). Make sure that all LiPD in your object have the same time representation and units before using this function.

Let’s filter over the period [500,1800], using default settings:

D_filt_time1 = D.filter_by_time([500,1800])


num_records = float(len(D_filt_time1.get_all_dataset_names()))
print(f"There are {num_records} number of records matching this query.")
There are 13.0 number of records matching this query.

Let’s compare to the number of records we would get if we asked for entirely:

D_filt_time2 = D.filter_by_time([500,1800], timeBoundType = 'entirely')


num_records = float(len(D_filt_time2.get_all_dataset_names()))
print(f"There are {num_records} number of records matching this query.")
There are 5.0 number of records matching this query.

A lot less records match this search criteria, which is not surprising since only records fully covering the 500-1800 period are considered. Note that when entirely is selected, the recordLength parameter has no effect.

Filtering on the LiPDSeries object#

Filtering by variable names#

Most of the time, we are interested in working with specific variables in a LiPD file. To use filtering capabilities, this requires expanding the LiPD object into a LiPDSeries object, which contains individual variables. You can do so by using the following function, applied on the filtered database of tree records from high northern latitudes:

ts = D_filt.to_lipd_series()
Creating LiPD Series...
- Extracting dataset subgraphs
  0%|          | 0/3 [00:00<?, ?it/s]
100%|██████████| 3/3 [00:00<00:00, 379.93it/s]

- Extracting variable subgraphs
  0%|          | 0/3 [00:00<?, ?it/s]
100%|██████████| 3/3 [00:00<00:00, 21.22it/s]
100%|██████████| 3/3 [00:00<00:00, 21.15it/s]
Done..

Let’s have a look at what we have by placing essential metadata into a Pandas DataFrame. This function is the equivalent to LiPD.get_timeseries_essentials with slight differences:

df = ts.get_timeseries_essentials()

df.head()
dataSetName archiveType name TSID values units proxy
0 Eur-NorthernScandinavia.Esper.2012 Wood year PYTECO66XAD [-138, -137, -136, -135, -134, -133, -132, -13... yr AD None
1 Eur-NorthernScandinavia.Esper.2012 Wood MXD Eur_014 [0.46, 1.305, 0.755, -0.1, -0.457, 1.62, 0.765... None None
2 Eur-FinnishLakelands.Helama.2014 Wood year PYTUSB62S0A [2000, 1999, 1998, 1997, 1996, 1995, 1994, 199... yr AD None
3 Eur-FinnishLakelands.Helama.2014 Wood temperature Eur_005 [14.603, 14.643, 12.074, 13.898, 13.671, 13.41... degC None
4 Eur-SpanishPyrenees.Dorado-Linan.2012 Wood year PYT2K8MIA3N [1260, 1261, 1262, 1263, 1264, 1265, 1266, 126... yr AD None

As you can see, the DataFrame contains much of the same information, except for location. This is because that information is stored at the Dataset level in LiPD and is, therefore, attached to the LiPD object. But you can merge the data using Pandas’ capabilities. First, let’s get the location information:

df_loc = D_filt.get_all_locations()

df_loc.head()
dataSetName geo_meanLat geo_meanLon geo_meanElev
0 Eur-NorthernScandinavia.Esper.2012 68.0 25.000 300.0
1 Eur-SpanishPyrenees.Dorado-Linan.2012 42.5 1.000 1200.0
2 Eur-FinnishLakelands.Helama.2014 62.0 28.325 130.0
import pandas as pd

We can use the .merge functionality in Pandas to achieve our goal. This functionality is very similar to join in SQL. The main arguments are how which defines how the join is achieved and on which specifies the column:

how: {‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, default ‘inner’

  • left: use only keys from left frame, similar to a SQL left outer join; preserve key order.

  • right: use only keys from right frame, similar to a SQL right outer join; preserve key order.

  • outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.

  • inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.

  • cross: creates the cartesian product from both frames, preserves the order of the left keys.

df_merged = df.merge(df_loc,how='inner', on='dataSetName')

df_merged.head()
dataSetName archiveType name TSID values units proxy geo_meanLat geo_meanLon geo_meanElev
0 Eur-NorthernScandinavia.Esper.2012 Wood year PYTECO66XAD [-138, -137, -136, -135, -134, -133, -132, -13... yr AD None 68.0 25.000 300.0
1 Eur-NorthernScandinavia.Esper.2012 Wood MXD Eur_014 [0.46, 1.305, 0.755, -0.1, -0.457, 1.62, 0.765... None None 68.0 25.000 300.0
2 Eur-FinnishLakelands.Helama.2014 Wood year PYTUSB62S0A [2000, 1999, 1998, 1997, 1996, 1995, 1994, 199... yr AD None 62.0 28.325 130.0
3 Eur-FinnishLakelands.Helama.2014 Wood temperature Eur_005 [14.603, 14.643, 12.074, 13.898, 13.671, 13.41... degC None 62.0 28.325 130.0
4 Eur-SpanishPyrenees.Dorado-Linan.2012 Wood year PYT2K8MIA3N [1260, 1261, 1262, 1263, 1264, 1265, 1266, 126... yr AD None 42.5 1.000 1200.0

The second difference is that the variables representing time are only present as rows, which means that time will need to be associated with the variables. To do so, you can use the filtering capabilities. First, let’s filter for temperature and trsgi:

ts_filter = ts.filter_by_name('(temperature|trsgi)')

df_paleo = ts_filter.get_timeseries_essentials()
df_paleo.head()
dataSetName archiveType name TSID values units proxy
0 Eur-SpanishPyrenees.Dorado-Linan.2012 Wood trsgi Eur_020 [-1.612, -0.703, -0.36, -0.767, -0.601, -0.733... None None
1 Eur-FinnishLakelands.Helama.2014 Wood temperature Eur_005 [14.603, 14.643, 12.074, 13.898, 13.671, 13.41... degC None

Our next task is to get all the information regarding time. Let’s use method chaining to get the DataFrame directly:

df_time = ts.filter_by_name('year').get_timeseries_essentials()

df_time.head()
dataSetName archiveType name TSID values units proxy
0 Eur-FinnishLakelands.Helama.2014 Wood year PYTUSB62S0A [2000, 1999, 1998, 1997, 1996, 1995, 1994, 199... yr AD None
1 Eur-SpanishPyrenees.Dorado-Linan.2012 Wood year PYT2K8MIA3N [1260, 1261, 1262, 1263, 1264, 1265, 1266, 126... yr AD None
2 Eur-NorthernScandinavia.Esper.2012 Wood year PYTECO66XAD [-138, -137, -136, -135, -134, -133, -132, -13... yr AD None

Now, we can use Pandas to merge the two DataFrames. In this case, we would expect that the DataFrame containing the time information would have more datasets associated with them (since not all datasets have a temoerature or trsgi value). Therefore, we want to perform a left join, dropping the unnecessary age information:

df_filt_merge=df_paleo.merge(df_time,how='left',on='dataSetName')

df_filt_merge
dataSetName archiveType_x name_x TSID_x values_x units_x proxy_x archiveType_y name_y TSID_y values_y units_y proxy_y
0 Eur-SpanishPyrenees.Dorado-Linan.2012 Wood trsgi Eur_020 [-1.612, -0.703, -0.36, -0.767, -0.601, -0.733... None None Wood year PYT2K8MIA3N [1260, 1261, 1262, 1263, 1264, 1265, 1266, 126... yr AD None
1 Eur-FinnishLakelands.Helama.2014 Wood temperature Eur_005 [14.603, 14.643, 12.074, 13.898, 13.671, 13.41... degC None Wood year PYTUSB62S0A [2000, 1999, 1998, 1997, 1996, 1995, 1994, 199... yr AD None

Notice the Pandas automatically changed the names of the columns to accomodate the fact that both our DataFrames had the same column names. Now let’s add the location information with another left join:

df_filt_merge_loc=df_filt_merge.merge(df_loc,how='left',on='dataSetName')

df_filt_merge_loc.head()
dataSetName archiveType_x name_x TSID_x values_x units_x proxy_x archiveType_y name_y TSID_y values_y units_y proxy_y geo_meanLat geo_meanLon geo_meanElev
0 Eur-SpanishPyrenees.Dorado-Linan.2012 Wood trsgi Eur_020 [-1.612, -0.703, -0.36, -0.767, -0.601, -0.733... None None Wood year PYT2K8MIA3N [1260, 1261, 1262, 1263, 1264, 1265, 1266, 126... yr AD None 42.5 1.000 1200.0
1 Eur-FinnishLakelands.Helama.2014 Wood temperature Eur_005 [14.603, 14.643, 12.074, 13.898, 13.671, 13.41... degC None Wood year PYTUSB62S0A [2000, 1999, 1998, 1997, 1996, 1995, 1994, 199... yr AD None 62.0 28.325 130.0

Filtering by proxy#

LiPD files have a proxy field that can be used to associate the proxy used for the inference or measurements. On the LiPDVerse, this field is standardized.

To get a list of all proxies available in the LiPDSeries object, first use the .get_all_proxy method:

ts_all = D.to_lipd_series()
ts_all.get_all_proxy()
Creating LiPD Series...
- Extracting dataset subgraphs
  0%|          | 0/16 [00:00<?, ?it/s]
100%|██████████| 16/16 [00:00<00:00, 308.82it/s]

- Extracting variable subgraphs
  0%|          | 0/16 [00:00<?, ?it/s]
 12%|█▎        | 2/16 [00:00<00:00, 19.23it/s]
 25%|██▌       | 4/16 [00:00<00:00, 12.36it/s]
 38%|███▊      | 6/16 [00:00<00:00, 14.66it/s]
 56%|█████▋    | 9/16 [00:00<00:00, 17.40it/s]
 69%|██████▉   | 11/16 [00:00<00:00, 15.43it/s]
 88%|████████▊ | 14/16 [00:00<00:00, 17.41it/s]
100%|██████████| 16/16 [00:00<00:00, 16.96it/s]
Done..

['borehole',
 'alkenone',
 'Mg/Ca',
 'maximum latewood density',
 'd18O',
 'ring width',
 'reflectance',
 'historical']

Let’s filter for the ring width proxy:

ts_filt_proxy = ts_all.filter_by_proxy('ring width')

ts_filt_proxy.get_all_proxy()
['ring width']

Filtering by resolution#

When a LiPD file is created the resolution is automatically calculated for each variable. This can be used to filter the records by either the mean, median, minimum and maximum resolution. To do so, you can use the .filter_by_resolution method. The stats parameter allows to set which one.

print(f"There are {len(ts_all.get_timeseries_essentials())} variables in this LiPDSeries object.")
There are 41 variables in this LiPDSeries object.

Let’s filter for a mean resolution of at least 10 years:

ts_filt_meanres = ts_all.filter_by_resolution(10)

print(f"There are {len(ts_filt_meanres.get_timeseries_essentials())} variables in this LiPDSeries object.")
There are 25 variables in this LiPDSeries object.

Let’s redo this query with a median resolution of 10 years:

ts_filt_medres = ts_all.filter_by_resolution(10, stats = 'Median')

print(f"There are {len(ts_filt_medres.get_timeseries_essentials())} variables in this LiPDSeries object.")
There are 25 variables in this LiPDSeries object.

In this case, the mean or median did not seem to matter.