Custom queries on LiPD objects#

Authors#

Deborah Khider

Preamble#

PyLiPD is a Python package that allows you to read, manipulate, and write LiPD formatted datasets. While PyLiPD comes pre-loaded with certain functions to filter datasets and extract information (see other tutorials), in some cases one might want to design more specific queries. In this tutorial, we will demonstrate how to generate such custom queries using SPARQL.

Goals#

  • Learn how to build SPAQRL queries

Reading Time: 15 minutes

Keywords#

LiPD, SPARQL

Pre-requisites#

Relevant Packages#

pylipd

Data Description#

This notebook uses the following datasets, in LiPD format:

  • McCabe-Glynn, S., Johnson, K., Strong, C. et al. Variable North Pacific influence on drought in southwestern North America since AD 854. Nature Geosci 6, 617–621 (2013). https://doi.org/10.1038/ngeo1862

  • Lawrence, K. T., Liu, Z. H., & Herbert, T. D. (2006). Evolution of the eastern tropical Pacific through Plio-Pleistocne glaciation. Science, 312(5770), 79-83.

  • PAGES2k Consortium., Emile-Geay, J., McKay, N. et al. A global multiproxy database for temperature reconstructions of the Common Era. Sci Data 4, 170088 (2017). doi:10.1038/sdata.2017.88

Demonstration#

Understanding the LiPD object#

Let’s start by importing our favorite package and load our datasets.

from pylipd.lipd import LiPD
path = '../data/Pages2k/'

D = LiPD()
D.load_from_dir(path)
Loading 16 LiPD files
  0%|          | 0/16 [00:00<?, ?it/s]
 38%|███▊      | 6/16 [00:00<00:00, 50.03it/s]
 75%|███████▌  | 12/16 [00:00<00:00, 46.72it/s]
100%|██████████| 16/16 [00:00<00:00, 42.46it/s]
Loaded..

By now, you should be familiar with our load functionalities. But you need to understand a little bit about what’s happening under the hood to truly appreciate this tutorial and why everything will function the way it does.

  • The first step is to expand the dataset stored in LiPD format. Remember that LiPD essentially consists of data tables stored in csv format and a JSON-LD file that contains the metadata.

  • The second step is to map the JSON file to RDF using our ontology as the schema.

  • The third (and unique) step is to load the data contained into the csv files into the graph directly. This is quite unique to LinkedEarth. If you work with other knowledge bases, you’ll find that most of the time, the actual data are stored in another format that you will need the learn to handle in conjunction with the metadata.

Also, each dataset gets its own graph! So you can think of the LiPD object as a collection of graphs, each of which represent a particular dataset.

Let’s have a close look at our object, shall we?

D.copy().__dict__
{'graph': <Graph identifier=N7e673dc520984009819be5b1e4bed605 (<class 'rdflib.graph.ConjunctiveGraph'>)>}

In particular, our graph object is a ConjuctiveGraph from the rdflib library. Why should you care?

  1. If you isolate the graph, then you can use the rdflib package directly. Since this is a well-maintained community package with a larger user base than PyLiPD, chances are that stackoverflow or their documentation will provide you with some answers on how to work with your graph.

  2. why a ConjuctiveGraph? We use this particular type because although each dataset is stored into its own graph, we often want to query across datasets (i.e., across graphs) and this particular object in rdflib allows us to do so.

Let’s try to create our own queries from this.

Constructing a query#

Let’s return to the query described in this primer in which we try to gather all the dataset names. Let’s modify it to also return the variable ds so you can see the difference between the object and the name:

query ="""

PREFIX le:<http://linked.earth/ontology#>

SELECT ?ds ?dsname WHERE{

?ds a le:Dataset .
?ds le:hasName ?dsname .}

"""

Notice that this is just a string containing the query in SPARQL language. Now, let’s run it:

res_sparql, res_df = D.query(query)

Which returns two objects:

  • res_sparql, which is a SPARQLResult object from rdflib

  • res_df, which processes the SPARQLResult into a DataFrame, which is often what you will be interested to look at. Let’s have a look at it:

res_df.head()
ds dsname
0 http://linked.earth/lipd/Eur-NorthernSpain.Mar... Eur-NorthernSpain.Martin-Chivelet.2011
1 http://linked.earth/lipd/Eur-NorthernScandinav... Eur-NorthernScandinavia.Esper.2012
2 http://linked.earth/lipd/Eur-Stockholm.Leijonh... Eur-Stockholm.Leijonhufvud.2009
3 http://linked.earth/lipd/Eur-LakeSilvaplana.Tr... Eur-LakeSilvaplana.Trachsel.2010
4 http://linked.earth/lipd/Eur-SpanishPyrenees.D... Eur-SpanishPyrenees.Dorado-Linan.2012

First, notice that the columns of the dataframe corresponds to the names of the variables that you have specified in the SPARQL query. You can choose any name that you want as long as you use a question mark at the beginning to indicate to SPARQL that you are questioning the database for the answer. Second, notice the difference between ds and dsname. ds is the dataset object and the query returns the URI of that object. dsname is a string, which contains the names of the dataset. In general, we form our URIs with the names of the objects so don’t be surprised if you sometimes end up with the URI instead of the name (we all sometimes forget to ask for the name).

One thing to consider is the speed of query. If you’re doing exploratory work with SPARQL, you may wish to only return a few results to see if you’re doing this right. In this case, the number of datasets is quite small:

len(res_df.index)
16

But if you have thousands of rows in that dataframe, you might not want to wait for the query to perform as this may take a few minutes. If you only want to return the first 10 rows in the query, just add the limit to the query:

query ="""

PREFIX le:<http://linked.earth/ontology#>

SELECT ?ds ?dsname WHERE{

?ds a le:Dataset .
?ds le:hasName ?dsname .}

LIMIT 10

"""

res, res_df = D.query(query)

len(res_df.index)
10

Please note that the number of rows doesn’t correspond to the number of datasets. For instance, if a dataset contains several variables (as most do) and I ask for each variable in a dataset, then each variable will be contained in an individual row.

Optional variables#

Let’s pop out one the Pages2k dataset into a new LiPD object, D2 and add the ODP846 dataset for the purpose of this demonstration. Why ODP846? The file was created before LiPD had DatasetIDs, therefore I know this dataset doesn’t have one.

D.get_all_dataset_names()
['Eur-NorthernSpain.Martin-Chivelet.2011',
 'Eur-NorthernScandinavia.Esper.2012',
 'Eur-Stockholm.Leijonhufvud.2009',
 'Eur-LakeSilvaplana.Trachsel.2010',
 'Eur-SpanishPyrenees.Dorado-Linan.2012',
 'Arc-Kongressvatnet.D_Andrea.2012',
 'Eur-CoastofPortugal.Abrantes.2011',
 'Ocn-PedradeLume-CapeVerdeIslands.Moses.2006',
 'Ocn-FeniDrift.Richter.2009',
 'Ocn-SinaiPeninsula_RedSea.Moustafa.2000',
 'Ant-WAIS-Divide.Severinghaus.2012',
 'Asi-SourthAndMiddleUrals.Demezhko.2007',
 'Ocn-AlboranSea436B.Nieto-Moreno.2013',
 'Eur-SpannagelCave.Mangini.2005',
 'Ocn-RedSea.Felis.2000',
 'Eur-FinnishLakelands.Helama.2014']
D2 = D.pop('Ocn-RedSea.Felis.2000')
D2.load('../data/ODP846.Lawrence.2006.lpd')
D2.get_all_dataset_names()
Loading 1 LiPD files
  0%|          | 0/1 [00:00<?, ?it/s]
100%|██████████| 1/1 [00:00<00:00,  1.12it/s]
100%|██████████| 1/1 [00:00<00:00,  1.12it/s]
Loaded..

['Ocn-RedSea.Felis.2000', 'ODP846.Lawrence.2006']

Let’s perform a similar query, but asking for the datasetID in addition to returning the dataset name:

query ="""

PREFIX le:<http://linked.earth/ontology#>

SELECT ?ds ?dsname ?datasetID WHERE{

?ds a le:Dataset .
?ds le:hasName ?dsname .
?ds le:hasDatasetId ?datasetID}

"""

res, res_df = D2.query(query)

res_df
ds dsname datasetID
0 http://linked.earth/lipd/Ocn-RedSea.Felis.2000 Ocn-RedSea.Felis.2000 4fZQAHmeuJn8ipLfurWv

As you can see, only the PAGES2k dataset was returned through this query. The ODP846 dataset was omitted because it doesn’t have a datasetID. If you want to make this query optional, then all you need to do is write OPTIONAL in front of it:

query ="""

PREFIX le:<http://linked.earth/ontology#>

SELECT ?ds ?dsname ?datasetID WHERE{

?ds a le:Dataset .
?ds le:hasName ?dsname .
OPTIONAL{?ds le:hasDatasetId ?datasetID.}}

"""

res, res_df = D2.query(query)

res_df
ds dsname datasetID
0 http://linked.earth/lipd/Ocn-RedSea.Felis.2000 Ocn-RedSea.Felis.2000 4fZQAHmeuJn8ipLfurWv
1 http://linked.earth/lipd/ODP846.Lawrence.2006 ODP846.Lawrence.2006 None

In this case, you at least get the dataset name. When to use OPTIONAL filtering is really up to you. If the answer associated with a specific property is critical to your analysis, then it allows you to filter out these datasets directly. However, if the piece of information is not critical, you may want to use the OPTIONAL filter.

Querying for predicates#

Constructing these queries beg the question: do I need to know the ontology? To some extent, yes. You would need to know any database schema to be able to query it. However you do not need to know it in excrutiating details.

First, as you may have surmised, most of the PyLiPD functionalities designed to return information about the datasets are essentially wrappers around these queries. So if you’re interested, you can actually have a look at queries.py within the PyLiPD package to get some inspiration.

Second, you can also directly query all the predicates attached to an object directly. Let’s do so, shall we?

query ="""

PREFIX le:<http://linked.earth/ontology#>

SELECT DISTINCT ?p WHERE {

?ds a le:Dataset .
?ds ?p ?o .
}
"""

Let’s break that query down. The second line contains the directive DISTINCT which will return unique predicates. The last line makes the predicate a variable that we are querying.

_,df = D2.query(query)

df
p
0 http://linked.earth/ontology#hasPaleoData
1 http://linked.earth/ontology#inCompilation3_
2 http://linked.earth/ontology#minYear
3 http://linked.earth/ontology#createdBy
4 http://linked.earth/ontology#hasPublication
5 http://linked.earth/ontology#lipdVersion
6 http://linked.earth/ontology#googleMetadataWor...
7 http://linked.earth/ontology#hasSpreadsheetLink
8 http://linked.earth/ontology#hasName
9 http://linked.earth/ontology#studyName
10 http://linked.earth/ontology#googleDataURL
11 http://linked.earth/ontology#hasContributor
12 http://linked.earth/ontology#hasArchiveType
13 http://linked.earth/ontology#hasLocation
14 http://linked.earth/ontology#inCompilation1_
15 http://linked.earth/ontology#hasDatasetId
16 http://linked.earth/ontology#hasOriginalDataUrl
17 http://linked.earth/ontology#hasChangeLog
18 http://linked.earth/ontology#hasUrl
19 http://www.w3.org/1999/02/22-rdf-syntax-ns#type
20 http://linked.earth/ontology#maxYear
21 http://linked.earth/ontology#inCompilation2_
22 http://linked.earth/ontology#hasInvestigator
23 http://linked.earth/ontology#lipdComplete
24 http://linked.earth/ontology#hasChronData

Geoquery#

Let’s try a geospatial query on our original database. From the list of properties above, hasLocation links to coordinates. So let’s try to have a look at this object:

query ="""

PREFIX le:<http://linked.earth/ontology#>

SELECT DISTINCT ?p WHERE {

?ds a le:Dataset .
?ds le:hasLocation ?geo .
?geo ?p ?o .
}
"""
_,df = D.query(query)

df
p
0 http://www.w3.org/2003/01/geo/wgs84_pos#lat
1 http://linked.earth/ontology#region
2 http://linked.earth/ontology#pages2kRegion
3 http://linked.earth/ontology#coordinatesFor
4 http://www.w3.org/1999/02/22-rdf-syntax-ns#type
5 http://www.w3.org/2003/01/geo/wgs84_pos#long
6 http://linked.earth/ontology#coordinates
7 http://www.w3.org/2003/01/geo/wgs84_pos#alt
8 http://linked.earth/ontology#hasSiteName
9 http://linked.earth/ontology#hasOcean

From this properties, lat, long and alt are promising. Notice that the URI is different. This is not a predicate from the ontology created by LinkedEarth but rather from another ontology with a different URI. We should define it in a prefix.

query ="""

PREFIX le:<http://linked.earth/ontology#>
PREFIX wgs84: <http://www.w3.org/2003/01/geo/wgs84_pos#>

SELECT ?dsname ?lat ?lon  WHERE {

?ds a le:Dataset .
?ds le:hasName ?dsname .

?ds le:hasLocation ?geo .
OPTIONAL{?geo wgs84:lat ?lat .
        ?geo wgs84:long ?lon .}

}
"""

Let’s have a look at our new query:

  • Notice the new prefix since we are reusing the wgs84 ontology.

  • The third line defines the variables that we wish to return in our query, namely the datasetName, latitude and longitude

  • The last paragraph concerns our geographical query. Note that these lines are made OPTIONAL in case the file doesn’t contain geographical coordinates. This is unlikely to be the case, but it is a quich way to appraise errors in our database or your LiPD files.

_,df = D.query(query)

df
dsname lat lon
0 Eur-NorthernSpain.Martin-Chivelet.2011 42.9000 -3.5000
1 Eur-NorthernScandinavia.Esper.2012 68.0000 25.0000
2 Eur-Stockholm.Leijonhufvud.2009 59.3200 18.0600
3 Eur-LakeSilvaplana.Trachsel.2010 46.5000 9.8000
4 Eur-SpanishPyrenees.Dorado-Linan.2012 42.5000 1.0000
5 Arc-Kongressvatnet.D'Andrea.2012 78.0217 13.9311
6 Eur-CoastofPortugal.Abrantes.2011 41.1000 -8.9000
7 Ocn-PedradeLume-CapeVerdeIslands.Moses.2006 16.7600 -22.8883
8 Ocn-FeniDrift.Richter.2009 55.5000 -13.9000
9 Ocn-SinaiPeninsula,RedSea.Moustafa.2000 27.8483 34.3100
10 Ant-WAIS-Divide.Severinghaus.2012 -79.4630 -112.1250
11 Asi-SourthAndMiddleUrals.Demezhko.2007 55.0000 59.5000
12 Ocn-AlboranSea436B.Nieto-Moreno.2013 36.2053 -4.3133
13 Eur-SpannagelCave.Mangini.2005 47.1000 11.6000
14 Eur-FinnishLakelands.Helama.2014 62.0000 28.3250

Let’s take this query a little bit further: let’s say I only want to keep the data between 40N and 70N and 0E to 30E. Welcome to the FILTER functionality.

Your query is going to look much the same with the added FILTER:

query ="""

PREFIX le:<http://linked.earth/ontology#>
PREFIX wgs84: <http://www.w3.org/2003/01/geo/wgs84_pos#>

SELECT ?dsname ?lat ?lon  WHERE {

?ds a le:Dataset .
?ds le:hasName ?dsname .

?ds le:hasLocation ?geo .
?geo wgs84:lat ?lat .
?geo wgs84:long ?lon .

FILTER ( ?lat >= 40 && ?lat < 70 && ?lon >= 0 && ?lon < 30 ) .

}
"""

I also removed the OPTIONAL argument since we are doing a geospatial query. If the dataset doesn’t have lat/lon, then it shouldn’t be included anyway in the return.

_,df = D.query(query)

df
dsname lat lon
0 Eur-NorthernScandinavia.Esper.2012 68.00 25.000
1 Eur-Stockholm.Leijonhufvud.2009 59.32 18.060
2 Eur-LakeSilvaplana.Trachsel.2010 46.50 9.800
3 Eur-SpanishPyrenees.Dorado-Linan.2012 42.50 1.000
4 Eur-SpannagelCave.Mangini.2005 47.10 11.600
5 Eur-FinnishLakelands.Helama.2014 62.00 28.325

Congratulations! You’ve performed your first filter! In case you are wondering, this exact query is used in the function filter_by_geo_box.

Filtering by name#

You must now be dying to know how you can look for specific variable names and filter by them. In the interest of time, I’ll create the query that returns all unique variable names first (This is what’s behind get_all_variable_names) but I encourage you to spend some time following it line-by-line and even use the predicate search shown above to see how I got the names of the properties.

query ="""

PREFIX le:<http://linked.earth/ontology#>

SELECT DISTINCT ?varname WHERE{

?ds a le:Dataset .

?ds le:hasPaleoData ?data .
?data le:hasMeasurementTable ?table .
?table le:hasVariable ?var .
?var le:hasName ?varname .
}
"""
_,df = D.query(query)

df
varname
0 year
1 d18O
2 MXD
3 temperature
4 trsgi
5 Uk37
6 depth_bottom
7 Mg_Ca
8 depth_top
9 notes
10 uncertainty_temperature

This returned all unique variable names present in all the PaleoData objects tables present in the LiPD object. This is the power of SPARQL: you do not need to know how many tables/columns/data are present in the dataset to query across them.

Let’s filter by temperature. In this case, I really mean temperature and ignore the uncertainty.

query ="""

PREFIX le:<http://linked.earth/ontology#>

SELECT ?dsname ?varname WHERE{

?ds a le:Dataset .
?ds le:hasName ?dsname .

?ds le:hasPaleoData ?data .
?data le:hasMeasurementTable ?table .
?table le:hasVariable ?var .
?var le:hasName ?varname .

FILTER (?varname = 'temperature')

}
"""
_,df = D.query(query)

df
dsname varname
0 Eur-Stockholm.Leijonhufvud.2009 temperature
1 Eur-LakeSilvaplana.Trachsel.2010 temperature
2 Arc-Kongressvatnet.D'Andrea.2012 temperature
3 Eur-CoastofPortugal.Abrantes.2011 temperature
4 Ocn-FeniDrift.Richter.2009 temperature
5 Ocn-FeniDrift.Richter.2009 temperature
6 Ant-WAIS-Divide.Severinghaus.2012 temperature
7 Asi-SourthAndMiddleUrals.Demezhko.2007 temperature
8 Ocn-AlboranSea436B.Nieto-Moreno.2013 temperature
9 Eur-FinnishLakelands.Helama.2014 temperature

If you want to relax this assumption, you have two choices:

  1. explicitly go look for uncertainty_temperature

  2. Use regular expressions. Although as you may discover, you now have two problems: SPARQL and regex.

Let’s have a look at the first option:

query ="""

PREFIX le:<http://linked.earth/ontology#>

SELECT ?dsname ?varname WHERE{

?ds a le:Dataset .
?ds le:hasName ?dsname .

?ds le:hasPaleoData ?data .
?data le:hasMeasurementTable ?table .
?table le:hasVariable ?var .
?var le:hasName ?varname .

FILTER (?varname = 'temperature' || ?varname = 'uncertainty_temperature')

}
"""

Much like other languages, SPARQL uses || to indicate a logical “or” and && for a logical “and”.

_,df = D.query(query)

df
dsname varname
0 Eur-Stockholm.Leijonhufvud.2009 temperature
1 Eur-LakeSilvaplana.Trachsel.2010 temperature
2 Arc-Kongressvatnet.D'Andrea.2012 temperature
3 Eur-CoastofPortugal.Abrantes.2011 temperature
4 Ocn-FeniDrift.Richter.2009 temperature
5 Ocn-FeniDrift.Richter.2009 temperature
6 Ant-WAIS-Divide.Severinghaus.2012 temperature
7 Ant-WAIS-Divide.Severinghaus.2012 uncertainty_temperature
8 Asi-SourthAndMiddleUrals.Demezhko.2007 temperature
9 Ocn-AlboranSea436B.Nieto-Moreno.2013 temperature
10 Eur-FinnishLakelands.Helama.2014 temperature

Using regular expressions:

query ="""

PREFIX le:<http://linked.earth/ontology#>

SELECT ?dsname ?varname WHERE{

?ds a le:Dataset .
?ds le:hasName ?dsname .

?ds le:hasPaleoData ?data .
?data le:hasMeasurementTable ?table .
?table le:hasVariable ?var .
?var le:hasName ?varname .

FILTER regex (?varname, "temperature.*", "i" ) .

}
"""

_,df = D.query(query)

df
dsname varname
0 Eur-Stockholm.Leijonhufvud.2009 temperature
1 Eur-LakeSilvaplana.Trachsel.2010 temperature
2 Arc-Kongressvatnet.D'Andrea.2012 temperature
3 Eur-CoastofPortugal.Abrantes.2011 temperature
4 Ocn-FeniDrift.Richter.2009 temperature
5 Ocn-FeniDrift.Richter.2009 temperature
6 Ant-WAIS-Divide.Severinghaus.2012 temperature
7 Ant-WAIS-Divide.Severinghaus.2012 uncertainty_temperature
8 Asi-SourthAndMiddleUrals.Demezhko.2007 temperature
9 Ocn-AlboranSea436B.Nieto-Moreno.2013 temperature
10 Eur-FinnishLakelands.Helama.2014 temperature
As you have seen in previous tutorials, PyLiPD natively uses regex to query for variables.

Concluding remarks#

  • We are expanding PyLiPD to include more and more of the common queries in paleoclimatology. If you’d like to see one, please open an issue. Make sure you consult the open issues before opening your own and upvote if it already exists. If a issue is upvoted, it will be most likely slated for being addressed quickly.

Below are some FAQs, with their answers:

  • Can I create my own properties in a LiPD file and search with PyLiPD? The short answer resides in whether you are creating a data property (a property linking to a known class such as string, float, int, boolean) or an object property. Data properties will immediately be recognized by PyLiPD and can be used. For object properties, please contact us as this will change the underlying classes in the ontology.

  • Can you create APIs for my properties? The short answer is no, unless this is a property that will be re-used extensively by others. As you navigate datasets, you will notice that some properties are essentially used as flags for analysis. This may only serve one person and will therefore not be intergrated at the API level. However, if you cannot create your own query using this tutorial, please start a discussion on Discourse.