Custom queries on LiPD objects#
Preamble#
PyLiPD
is a Python package that allows you to read, manipulate, and write LiPD formatted datasets. While PyLiPD
comes pre-loaded with certain functions to filter datasets and extract information (see other tutorials), in some cases one might want to design more specific queries.
In this tutorial, we will demonstrate how to generate such custom queries using SPARQL.
Goals#
Learn how to build SPAQRL queries
Reading Time: 15 minutes
Keywords#
LiPD, SPARQL
Pre-requisites#
Relevant Packages#
pylipd
Data Description#
This notebook uses the following datasets, in LiPD format:
McCabe-Glynn, S., Johnson, K., Strong, C. et al. Variable North Pacific influence on drought in southwestern North America since AD 854. Nature Geosci 6, 617–621 (2013). https://doi.org/10.1038/ngeo1862
Lawrence, K. T., Liu, Z. H., & Herbert, T. D. (2006). Evolution of the eastern tropical Pacific through Plio-Pleistocne glaciation. Science, 312(5770), 79-83.
PAGES2k Consortium., Emile-Geay, J., McKay, N. et al. A global multiproxy database for temperature reconstructions of the Common Era. Sci Data 4, 170088 (2017). doi:10.1038/sdata.2017.88
Demonstration#
Understanding the LiPD object#
Let’s start by importing our favorite package and load our datasets.
from pylipd.lipd import LiPD
path = '../data/Pages2k/'
D = LiPD()
D.load_from_dir(path)
Loading 16 LiPD files
0%| | 0/16 [00:00<?, ?it/s]
38%|███▊ | 6/16 [00:00<00:00, 50.03it/s]
75%|███████▌ | 12/16 [00:00<00:00, 46.72it/s]
100%|██████████| 16/16 [00:00<00:00, 42.46it/s]
Loaded..
By now, you should be familiar with our load
functionalities. But you need to understand a little bit about what’s happening under the hood to truly appreciate this tutorial and why everything will function the way it does.
The first step is to expand the dataset stored in LiPD format. Remember that LiPD essentially consists of data tables stored in csv format and a JSON-LD file that contains the metadata.
The second step is to map the JSON file to RDF using our ontology as the schema.
The third (and unique) step is to load the data contained into the csv files into the graph directly. This is quite unique to LinkedEarth. If you work with other knowledge bases, you’ll find that most of the time, the actual data are stored in another format that you will need the learn to handle in conjunction with the metadata.
Also, each dataset gets its own graph! So you can think of the LiPD
object as a collection of graphs, each of which represent a particular dataset.
Let’s have a close look at our object, shall we?
D.copy().__dict__
{'graph': <Graph identifier=N7e673dc520984009819be5b1e4bed605 (<class 'rdflib.graph.ConjunctiveGraph'>)>}
In particular, our graph object is a ConjuctiveGraph
from the rdflib library. Why should you care?
If you isolate the graph, then you can use the
rdflib
package directly. Since this is a well-maintained community package with a larger user base thanPyLiPD
, chances are that stackoverflow or their documentation will provide you with some answers on how to work with your graph.why a
ConjuctiveGraph
? We use this particular type because although each dataset is stored into its own graph, we often want to query across datasets (i.e., across graphs) and this particular object inrdflib
allows us to do so.
Let’s try to create our own queries from this.
Constructing a query#
Let’s return to the query described in this primer in which we try to gather all the dataset names. Let’s modify it to also return the variable ds
so you can see the difference between the object and the name:
query ="""
PREFIX le:<http://linked.earth/ontology#>
SELECT ?ds ?dsname WHERE{
?ds a le:Dataset .
?ds le:hasName ?dsname .}
"""
Notice that this is just a string containing the query in SPARQL language. Now, let’s run it:
res_sparql, res_df = D.query(query)
Which returns two objects:
res_sparql
, which is aSPARQLResult
object fromrdflib
res_df
, which processes theSPARQLResult
into a DataFrame, which is often what you will be interested to look at. Let’s have a look at it:
res_df.head()
ds | dsname | |
---|---|---|
0 | http://linked.earth/lipd/Eur-NorthernSpain.Mar... | Eur-NorthernSpain.Martin-Chivelet.2011 |
1 | http://linked.earth/lipd/Eur-NorthernScandinav... | Eur-NorthernScandinavia.Esper.2012 |
2 | http://linked.earth/lipd/Eur-Stockholm.Leijonh... | Eur-Stockholm.Leijonhufvud.2009 |
3 | http://linked.earth/lipd/Eur-LakeSilvaplana.Tr... | Eur-LakeSilvaplana.Trachsel.2010 |
4 | http://linked.earth/lipd/Eur-SpanishPyrenees.D... | Eur-SpanishPyrenees.Dorado-Linan.2012 |
First, notice that the columns of the dataframe corresponds to the names of the variables that you have specified in the SPARQL query. You can choose any name that you want as long as you use a question mark at the beginning to indicate to SPARQL that you are questioning the database for the answer. Second, notice the difference between ds
and dsname
. ds
is the dataset object and the query returns the URI of that object. dsname
is a string, which contains the names of the dataset. In general, we form our URIs with the names of the objects so don’t be surprised if you sometimes end up with the URI instead of the name (we all sometimes forget to ask for the name).
One thing to consider is the speed of query. If you’re doing exploratory work with SPARQL, you may wish to only return a few results to see if you’re doing this right. In this case, the number of datasets is quite small:
len(res_df.index)
16
But if you have thousands of rows in that dataframe, you might not want to wait for the query to perform as this may take a few minutes. If you only want to return the first 10 rows in the query, just add the limit to the query:
query ="""
PREFIX le:<http://linked.earth/ontology#>
SELECT ?ds ?dsname WHERE{
?ds a le:Dataset .
?ds le:hasName ?dsname .}
LIMIT 10
"""
res, res_df = D.query(query)
len(res_df.index)
10
Please note that the number of rows doesn’t correspond to the number of datasets. For instance, if a dataset contains several variables (as most do) and I ask for each variable in a dataset, then each variable will be contained in an individual row.
Optional variables#
Let’s pop out one the Pages2k dataset into a new LiPD object, D2
and add the ODP846 dataset for the purpose of this demonstration. Why ODP846? The file was created before LiPD had DatasetIDs, therefore I know this dataset doesn’t have one.
D.get_all_dataset_names()
['Eur-NorthernSpain.Martin-Chivelet.2011',
'Eur-NorthernScandinavia.Esper.2012',
'Eur-Stockholm.Leijonhufvud.2009',
'Eur-LakeSilvaplana.Trachsel.2010',
'Eur-SpanishPyrenees.Dorado-Linan.2012',
'Arc-Kongressvatnet.D_Andrea.2012',
'Eur-CoastofPortugal.Abrantes.2011',
'Ocn-PedradeLume-CapeVerdeIslands.Moses.2006',
'Ocn-FeniDrift.Richter.2009',
'Ocn-SinaiPeninsula_RedSea.Moustafa.2000',
'Ant-WAIS-Divide.Severinghaus.2012',
'Asi-SourthAndMiddleUrals.Demezhko.2007',
'Ocn-AlboranSea436B.Nieto-Moreno.2013',
'Eur-SpannagelCave.Mangini.2005',
'Ocn-RedSea.Felis.2000',
'Eur-FinnishLakelands.Helama.2014']
D2 = D.pop('Ocn-RedSea.Felis.2000')
D2.load('../data/ODP846.Lawrence.2006.lpd')
D2.get_all_dataset_names()
Loading 1 LiPD files
0%| | 0/1 [00:00<?, ?it/s]
100%|██████████| 1/1 [00:00<00:00, 1.12it/s]
100%|██████████| 1/1 [00:00<00:00, 1.12it/s]
Loaded..
['Ocn-RedSea.Felis.2000', 'ODP846.Lawrence.2006']
Let’s perform a similar query, but asking for the datasetID in addition to returning the dataset name:
query ="""
PREFIX le:<http://linked.earth/ontology#>
SELECT ?ds ?dsname ?datasetID WHERE{
?ds a le:Dataset .
?ds le:hasName ?dsname .
?ds le:hasDatasetId ?datasetID}
"""
res, res_df = D2.query(query)
res_df
ds | dsname | datasetID | |
---|---|---|---|
0 | http://linked.earth/lipd/Ocn-RedSea.Felis.2000 | Ocn-RedSea.Felis.2000 | 4fZQAHmeuJn8ipLfurWv |
As you can see, only the PAGES2k dataset was returned through this query. The ODP846 dataset was omitted because it doesn’t have a datasetID. If you want to make this query optional, then all you need to do is write OPTIONAL
in front of it:
query ="""
PREFIX le:<http://linked.earth/ontology#>
SELECT ?ds ?dsname ?datasetID WHERE{
?ds a le:Dataset .
?ds le:hasName ?dsname .
OPTIONAL{?ds le:hasDatasetId ?datasetID.}}
"""
res, res_df = D2.query(query)
res_df
ds | dsname | datasetID | |
---|---|---|---|
0 | http://linked.earth/lipd/Ocn-RedSea.Felis.2000 | Ocn-RedSea.Felis.2000 | 4fZQAHmeuJn8ipLfurWv |
1 | http://linked.earth/lipd/ODP846.Lawrence.2006 | ODP846.Lawrence.2006 | None |
In this case, you at least get the dataset name. When to use OPTIONAL
filtering is really up to you. If the answer associated with a specific property is critical to your analysis, then it allows you to filter out these datasets directly. However, if the piece of information is not critical, you may want to use the OPTIONAL
filter.
Querying for predicates#
Constructing these queries beg the question: do I need to know the ontology? To some extent, yes. You would need to know any database schema to be able to query it. However you do not need to know it in excrutiating details.
First, as you may have surmised, most of the PyLiPD
functionalities designed to return information about the datasets are essentially wrappers around these queries. So if you’re interested, you can actually have a look at queries.py
within the PyLiPD
package to get some inspiration.
Second, you can also directly query all the predicates attached to an object directly. Let’s do so, shall we?
query ="""
PREFIX le:<http://linked.earth/ontology#>
SELECT DISTINCT ?p WHERE {
?ds a le:Dataset .
?ds ?p ?o .
}
"""
Let’s break that query down. The second line contains the directive DISTINCT
which will return unique predicates. The last line makes the predicate a variable that we are querying.
_,df = D2.query(query)
df
p | |
---|---|
0 | http://linked.earth/ontology#hasPaleoData |
1 | http://linked.earth/ontology#inCompilation3_ |
2 | http://linked.earth/ontology#minYear |
3 | http://linked.earth/ontology#createdBy |
4 | http://linked.earth/ontology#hasPublication |
5 | http://linked.earth/ontology#lipdVersion |
6 | http://linked.earth/ontology#googleMetadataWor... |
7 | http://linked.earth/ontology#hasSpreadsheetLink |
8 | http://linked.earth/ontology#hasName |
9 | http://linked.earth/ontology#studyName |
10 | http://linked.earth/ontology#googleDataURL |
11 | http://linked.earth/ontology#hasContributor |
12 | http://linked.earth/ontology#hasArchiveType |
13 | http://linked.earth/ontology#hasLocation |
14 | http://linked.earth/ontology#inCompilation1_ |
15 | http://linked.earth/ontology#hasDatasetId |
16 | http://linked.earth/ontology#hasOriginalDataUrl |
17 | http://linked.earth/ontology#hasChangeLog |
18 | http://linked.earth/ontology#hasUrl |
19 | http://www.w3.org/1999/02/22-rdf-syntax-ns#type |
20 | http://linked.earth/ontology#maxYear |
21 | http://linked.earth/ontology#inCompilation2_ |
22 | http://linked.earth/ontology#hasInvestigator |
23 | http://linked.earth/ontology#lipdComplete |
24 | http://linked.earth/ontology#hasChronData |
Geoquery#
Let’s try a geospatial query on our original database. From the list of properties above, hasLocation
links to coordinates. So let’s try to have a look at this object:
query ="""
PREFIX le:<http://linked.earth/ontology#>
SELECT DISTINCT ?p WHERE {
?ds a le:Dataset .
?ds le:hasLocation ?geo .
?geo ?p ?o .
}
"""
_,df = D.query(query)
df
p | |
---|---|
0 | http://www.w3.org/2003/01/geo/wgs84_pos#lat |
1 | http://linked.earth/ontology#region |
2 | http://linked.earth/ontology#pages2kRegion |
3 | http://linked.earth/ontology#coordinatesFor |
4 | http://www.w3.org/1999/02/22-rdf-syntax-ns#type |
5 | http://www.w3.org/2003/01/geo/wgs84_pos#long |
6 | http://linked.earth/ontology#coordinates |
7 | http://www.w3.org/2003/01/geo/wgs84_pos#alt |
8 | http://linked.earth/ontology#hasSiteName |
9 | http://linked.earth/ontology#hasOcean |
From this properties, lat
, long
and alt
are promising. Notice that the URI is different. This is not a predicate from the ontology created by LinkedEarth but rather from another ontology with a different URI. We should define it in a prefix.
query ="""
PREFIX le:<http://linked.earth/ontology#>
PREFIX wgs84: <http://www.w3.org/2003/01/geo/wgs84_pos#>
SELECT ?dsname ?lat ?lon WHERE {
?ds a le:Dataset .
?ds le:hasName ?dsname .
?ds le:hasLocation ?geo .
OPTIONAL{?geo wgs84:lat ?lat .
?geo wgs84:long ?lon .}
}
"""
Let’s have a look at our new query:
Notice the new prefix since we are reusing the wgs84 ontology.
The third line defines the variables that we wish to return in our query, namely the datasetName, latitude and longitude
The last paragraph concerns our geographical query. Note that these lines are made OPTIONAL in case the file doesn’t contain geographical coordinates. This is unlikely to be the case, but it is a quich way to appraise errors in our database or your LiPD files.
_,df = D.query(query)
df
dsname | lat | lon | |
---|---|---|---|
0 | Eur-NorthernSpain.Martin-Chivelet.2011 | 42.9000 | -3.5000 |
1 | Eur-NorthernScandinavia.Esper.2012 | 68.0000 | 25.0000 |
2 | Eur-Stockholm.Leijonhufvud.2009 | 59.3200 | 18.0600 |
3 | Eur-LakeSilvaplana.Trachsel.2010 | 46.5000 | 9.8000 |
4 | Eur-SpanishPyrenees.Dorado-Linan.2012 | 42.5000 | 1.0000 |
5 | Arc-Kongressvatnet.D'Andrea.2012 | 78.0217 | 13.9311 |
6 | Eur-CoastofPortugal.Abrantes.2011 | 41.1000 | -8.9000 |
7 | Ocn-PedradeLume-CapeVerdeIslands.Moses.2006 | 16.7600 | -22.8883 |
8 | Ocn-FeniDrift.Richter.2009 | 55.5000 | -13.9000 |
9 | Ocn-SinaiPeninsula,RedSea.Moustafa.2000 | 27.8483 | 34.3100 |
10 | Ant-WAIS-Divide.Severinghaus.2012 | -79.4630 | -112.1250 |
11 | Asi-SourthAndMiddleUrals.Demezhko.2007 | 55.0000 | 59.5000 |
12 | Ocn-AlboranSea436B.Nieto-Moreno.2013 | 36.2053 | -4.3133 |
13 | Eur-SpannagelCave.Mangini.2005 | 47.1000 | 11.6000 |
14 | Eur-FinnishLakelands.Helama.2014 | 62.0000 | 28.3250 |
Let’s take this query a little bit further: let’s say I only want to keep the data between 40N and 70N and 0E to 30E. Welcome to the FILTER
functionality.
Your query is going to look much the same with the added FILTER
:
query ="""
PREFIX le:<http://linked.earth/ontology#>
PREFIX wgs84: <http://www.w3.org/2003/01/geo/wgs84_pos#>
SELECT ?dsname ?lat ?lon WHERE {
?ds a le:Dataset .
?ds le:hasName ?dsname .
?ds le:hasLocation ?geo .
?geo wgs84:lat ?lat .
?geo wgs84:long ?lon .
FILTER ( ?lat >= 40 && ?lat < 70 && ?lon >= 0 && ?lon < 30 ) .
}
"""
I also removed the OPTIONAL
argument since we are doing a geospatial query. If the dataset doesn’t have lat/lon, then it shouldn’t be included anyway in the return.
_,df = D.query(query)
df
dsname | lat | lon | |
---|---|---|---|
0 | Eur-NorthernScandinavia.Esper.2012 | 68.00 | 25.000 |
1 | Eur-Stockholm.Leijonhufvud.2009 | 59.32 | 18.060 |
2 | Eur-LakeSilvaplana.Trachsel.2010 | 46.50 | 9.800 |
3 | Eur-SpanishPyrenees.Dorado-Linan.2012 | 42.50 | 1.000 |
4 | Eur-SpannagelCave.Mangini.2005 | 47.10 | 11.600 |
5 | Eur-FinnishLakelands.Helama.2014 | 62.00 | 28.325 |
Congratulations! You’ve performed your first filter! In case you are wondering, this exact query is used in the function filter_by_geo_box
.
Filtering by name#
You must now be dying to know how you can look for specific variable names and filter by them. In the interest of time, I’ll create the query that returns all unique variable names first (This is what’s behind get_all_variable_names
) but I encourage you to spend some time following it line-by-line and even use the predicate search shown above to see how I got the names of the properties.
query ="""
PREFIX le:<http://linked.earth/ontology#>
SELECT DISTINCT ?varname WHERE{
?ds a le:Dataset .
?ds le:hasPaleoData ?data .
?data le:hasMeasurementTable ?table .
?table le:hasVariable ?var .
?var le:hasName ?varname .
}
"""
_,df = D.query(query)
df
varname | |
---|---|
0 | year |
1 | d18O |
2 | MXD |
3 | temperature |
4 | trsgi |
5 | Uk37 |
6 | depth_bottom |
7 | Mg_Ca |
8 | depth_top |
9 | notes |
10 | uncertainty_temperature |
This returned all unique variable names present in all the PaleoData
objects tables present in the LiPD
object. This is the power of SPARQL: you do not need to know how many tables/columns/data are present in the dataset to query across them.
Let’s filter by temperature. In this case, I really mean temperature
and ignore the uncertainty
.
query ="""
PREFIX le:<http://linked.earth/ontology#>
SELECT ?dsname ?varname WHERE{
?ds a le:Dataset .
?ds le:hasName ?dsname .
?ds le:hasPaleoData ?data .
?data le:hasMeasurementTable ?table .
?table le:hasVariable ?var .
?var le:hasName ?varname .
FILTER (?varname = 'temperature')
}
"""
_,df = D.query(query)
df
dsname | varname | |
---|---|---|
0 | Eur-Stockholm.Leijonhufvud.2009 | temperature |
1 | Eur-LakeSilvaplana.Trachsel.2010 | temperature |
2 | Arc-Kongressvatnet.D'Andrea.2012 | temperature |
3 | Eur-CoastofPortugal.Abrantes.2011 | temperature |
4 | Ocn-FeniDrift.Richter.2009 | temperature |
5 | Ocn-FeniDrift.Richter.2009 | temperature |
6 | Ant-WAIS-Divide.Severinghaus.2012 | temperature |
7 | Asi-SourthAndMiddleUrals.Demezhko.2007 | temperature |
8 | Ocn-AlboranSea436B.Nieto-Moreno.2013 | temperature |
9 | Eur-FinnishLakelands.Helama.2014 | temperature |
If you want to relax this assumption, you have two choices:
explicitly go look for
uncertainty_temperature
Use regular expressions. Although as you may discover, you now have two problems: SPARQL and regex.
Let’s have a look at the first option:
query ="""
PREFIX le:<http://linked.earth/ontology#>
SELECT ?dsname ?varname WHERE{
?ds a le:Dataset .
?ds le:hasName ?dsname .
?ds le:hasPaleoData ?data .
?data le:hasMeasurementTable ?table .
?table le:hasVariable ?var .
?var le:hasName ?varname .
FILTER (?varname = 'temperature' || ?varname = 'uncertainty_temperature')
}
"""
Much like other languages, SPARQL uses ||
to indicate a logical “or” and &&
for a logical “and”.
_,df = D.query(query)
df
dsname | varname | |
---|---|---|
0 | Eur-Stockholm.Leijonhufvud.2009 | temperature |
1 | Eur-LakeSilvaplana.Trachsel.2010 | temperature |
2 | Arc-Kongressvatnet.D'Andrea.2012 | temperature |
3 | Eur-CoastofPortugal.Abrantes.2011 | temperature |
4 | Ocn-FeniDrift.Richter.2009 | temperature |
5 | Ocn-FeniDrift.Richter.2009 | temperature |
6 | Ant-WAIS-Divide.Severinghaus.2012 | temperature |
7 | Ant-WAIS-Divide.Severinghaus.2012 | uncertainty_temperature |
8 | Asi-SourthAndMiddleUrals.Demezhko.2007 | temperature |
9 | Ocn-AlboranSea436B.Nieto-Moreno.2013 | temperature |
10 | Eur-FinnishLakelands.Helama.2014 | temperature |
Using regular expressions:
query ="""
PREFIX le:<http://linked.earth/ontology#>
SELECT ?dsname ?varname WHERE{
?ds a le:Dataset .
?ds le:hasName ?dsname .
?ds le:hasPaleoData ?data .
?data le:hasMeasurementTable ?table .
?table le:hasVariable ?var .
?var le:hasName ?varname .
FILTER regex (?varname, "temperature.*", "i" ) .
}
"""
_,df = D.query(query)
df
dsname | varname | |
---|---|---|
0 | Eur-Stockholm.Leijonhufvud.2009 | temperature |
1 | Eur-LakeSilvaplana.Trachsel.2010 | temperature |
2 | Arc-Kongressvatnet.D'Andrea.2012 | temperature |
3 | Eur-CoastofPortugal.Abrantes.2011 | temperature |
4 | Ocn-FeniDrift.Richter.2009 | temperature |
5 | Ocn-FeniDrift.Richter.2009 | temperature |
6 | Ant-WAIS-Divide.Severinghaus.2012 | temperature |
7 | Ant-WAIS-Divide.Severinghaus.2012 | uncertainty_temperature |
8 | Asi-SourthAndMiddleUrals.Demezhko.2007 | temperature |
9 | Ocn-AlboranSea436B.Nieto-Moreno.2013 | temperature |
10 | Eur-FinnishLakelands.Helama.2014 | temperature |
Concluding remarks#
We are expanding
PyLiPD
to include more and more of the common queries in paleoclimatology. If you’d like to see one, please open an issue. Make sure you consult the open issues before opening your own and upvote if it already exists. If a issue is upvoted, it will be most likely slated for being addressed quickly.
Below are some FAQs, with their answers:
Can I create my own properties in a LiPD file and search with
PyLiPD
? The short answer resides in whether you are creating adata property
(a property linking to a known class such asstring
,float
,int
,boolean
) or anobject property
.Data properties
will immediately be recognized byPyLiPD
and can be used. Forobject properties
, please contact us as this will change the underlying classes in the ontology.Can you create APIs for my properties? The short answer is no, unless this is a property that will be re-used extensively by others. As you navigate datasets, you will notice that some properties are essentially used as flags for analysis. This may only serve one person and will therefore not be intergrated at the API level. However, if you cannot create your own query using this tutorial, please start a discussion on Discourse.