Directly querying the LiPDGraph#
Preamble#
This tutorial demonstrates a query on the LiPDGraph
using the direct SPARQL endpoint. You might ask yourself: “What is the LiPDGraph
and how does it differ from the Lipdverse?” The short answer os the the LiPDGraph
contains all the datasets that are available through Lipdverse, stored in a graph format. Remember tha PyLiPD essentially takes a LiPD file and transforms it into a graph. We have already done this process for the LiPDverse, and the datasets can already be queried there.
Goals#
Learn how to query the Graph database directly
Reading Time: 5 minutes
Keywords#
SPARQL
Pre-requisites#
Relevant Packages#
pandas
Data Description#
This notebook makes use of the LiPDGraph
database, a graph database that contains datasets from various working groups. For a list of these compilations, see this page.
In particular, we will be using PAGES2k and iso-2k for this demonstration.
PAGES2k: PAGES2k Consortium., Emile-Geay, J., McKay, N. et al. A global multiproxy database for temperature reconstructions of the Common Era. Sci Data 4, 170088 (2017). doi:10.1038/sdata.2017.88
iso2k: Konecky, B. L., McKay, N. P., Churakova (Sidorova), O. V., Comas-Bru, L., Dassié, E. P., DeLong, K. L., Falster, G. M., Fischer, M. J., Jones, M. D., Jonkers, L., Kaufman, D. S., Leduc, G., Managave, S. R., Martrat, B., Opel, T., Orsi, A. J., Partin, J. W., Sayani, H. R., Thomas, E. K., Thompson, D. M., Tyler, J. J., Abram, N. J., Atwood, A. R., Cartapanis, O., Conroy, J. L., Curran, M. A., Dee, S. G., Deininger, M., Divine, D. V., Kern, Z., Porter, T. J., Stevenson, S. L., von Gunten, L., and Iso2k Project Members: The Iso2k database: a global compilation of paleo-δ18O and δ2H records to aid understanding of Common Era climate, Earth Syst. Sci. Data, 12, 2261–2288, https://doi.org/10.5194/essd-12-2261-2020, 2020.
Demonstration#
Let’s import a few web packages along with pandas.
import json
import requests
import pandas as pd
import io
import ast
Simple query#
Let’s start with one the simple queries we have been playing with. The LiPDGraph
knowledge base is organized very similarly to a LiPD
object, with each dataset stored in its own graph. So the queries you have been learning in the previous tutorials also apply.
Let’s retrieve 10 random dataset names. But first we need to point towards the endpoint for the query:
url = 'https://linkedearth.graphdb.mint.isi.edu/repositories/LiPDVerse-dynamic'
We periodocially version the LiPDGraph for reproducbility purposes. LiPDverse-dynamic
refers to the most up-to-date versions of the LiPDGraph. Once your study is done, we encourage you to point towards the latest frozen version.
Next, let’s write our queries:
query ="""
PREFIX le:<http://linked.earth/ontology#>
SELECT ?ds ?dsname WHERE{
?ds a le:Dataset .
?ds le:hasName ?dsname .
}
ORDER BY RAND() LIMIT 10
"""
This should be very familiar as this is the first query we performed in the custom queries tutorial. Now, let’s send it to our SPAQRL endpoint and retrieve the results that we will store in a pandas DataFrame.
Note: The next cell might take a little bit of time to run since the query has to be performed on the remote server before sending the data back.
response = requests.post(url, data = {'query': query})
data = io.StringIO(response.text)
df = pd.read_csv(data, sep=",")
df
ds | dsname | |
---|---|---|
0 | http://linked.earth/lipd/KR20SAR01 | KR20SAR01 |
1 | http://linked.earth/lipd/ClatteringshawsLoch.B... | ClatteringshawsLoch.Birks.1975 |
2 | http://linked.earth/lipd/WildseemoorKaltenbron... | WildseemoorKaltenbronn.Rosch.2009 |
3 | http://linked.earth/lipd/CO04LIFI | CO04LIFI |
4 | http://linked.earth/lipd/AchitNur.Tarasov.1994 | AchitNur.Tarasov.1994 |
5 | http://linked.earth/lipd/CO13CAHN | CO13CAHN |
6 | http://linked.earth/lipd/ThorpeBulmer.EPD | ThorpeBulmer.EPD |
7 | http://linked.earth/lipd/Craterlake.Arcusa.2020 | Craterlake.Arcusa.2020 |
8 | http://linked.earth/lipd/AustreNevlingen.Kjell... | AustreNevlingen.Kjellman.2020 |
9 | http://linked.earth/lipd/DevilsLake.Shuman.2016 | DevilsLake.Shuman.2016 |
Complex queries#
Ok let’s work on something more complex. Let’s say that you want to retrieve all the data in the PAGES2k and iso2k compilations that were tagged to be used in the temperature analysis for this compilations along with metadata regarding the location, the type of archive and proxy used.
Don’t forget that each column is stored separately so you also need to go grab the time axis!
query = """PREFIX le: <http://linked.earth/ontology#>
PREFIX le_var: <http://linked.earth/ontology/variables#>
PREFIX wgs84: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?dataSetName ?archiveType ?geo_meanLat ?geo_meanLon ?geo_meanElev
?paleoData_variableName ?paleoData_values ?paleoData_units
?paleoData_proxy ?paleoData_proxyGeneral ?time_variableName ?time_values
?time_units ?compilationName ?TSID where{
?ds a le:Dataset .
?ds le:hasName ?dataSetName .
OPTIONAL{
?ds le:hasArchiveType ?archiveTypeObj .
?archiveTypeObj rdfs:label ?archiveType .
}
?ds le:hasLocation ?loc .
OPTIONAL{?loc wgs84:lat ?geo_meanLat .}
OPTIONAL{?loc wgs84:long ?geo_meanLon .}
OPTIONAL {?loc wgs84:alt ?geo_meanElev .}
?ds le:hasPaleoData ?data .
?data le:hasMeasurementTable ?table .
?table le:hasVariable ?var .
?var le:hasName ?paleoData_variableName .
?var le:hasValues ?paleoData_values .
OPTIONAL{
?var le:hasUnits ?paleoData_unitsObj .
?paleoData_unitsObj rdfs:label ?paleoData_units .
}
OPTIONAL{
?var le:hasProxy ?paleoData_proxyObj .
?paleoData_proxyObj rdfs:label ?paleoData_proxy .
}
OPTIONAL{
?var le:hasProxyGeneral ?paleoData_proxyGeneralObj .
?paleoData_proxyGeneralObj rdfs:label ?paleoData_proxyGeneral .
}
?var le:partOfCompilation ?compilation .
?compilation le:hasName ?compilationName .
VALUES ?compilationName {"iso2k" "Pages2kTemperature"} .
?var le:useInGlobalTemperatureAnalysis True .
OPTIONAL{
?var le:hasVariableId ?TSID
} .
?table le:hasVariable ?timevar .
?timevar le:hasName ?time_variableName .
?timevar le:hasStandardVariable le_var:year .
?timevar le:hasValues ?time_values .
OPTIONAL{
?timevar le:hasUnits ?time_unitsObj .
?time_unitsObj rdfs:label ?time_units .
}
}"""
Pretty long! But let’s take it paragraph by paragraph.
The first paragraph corresponds to our prefixes. You should be familiar with them by now.
The second paragraph lists all the variables that we wish to return:
the name of the dataset
the type of archive the measurements were made on
geographical coordinates (lat/lon/elevation)
the name of the paleovariable, associated units and values.
The type of proxy used to obtain the data
Information about the time axis (name, units and values)
the name of the compilation the data is part of
a unique identifer for the variables
The third paragraph:
?ds a le:Dataset .
?ds le:hasName ?dataSetName .
OPTIONAL{
?ds le:hasArchiveType ?archiveTypeObj .
?archiveTypeObj rdfs:label ?archiveType .
}
performs a query to return the information stored at the dataset level (namely the name of the dataset and the type of archive)
The fourth paragraph:
?ds le:hasLocation ?loc .
OPTIONAL{?loc wgs84:lat ?geo_meanLat .}
OPTIONAL{?loc wgs84:long ?geo_meanLon .}
OPTIONAL {?loc wgs84:alt ?geo_meanElev .}
returns geographical coordinates.
The fifth pragraph allows to dig to the variable level:
?ds le:hasPaleoData ?data .
?data le:hasMeasurementTable ?table .
?table le:hasVariable ?var .
?var le:hasName ?paleoData_variableName .
?var le:hasValues ?paleoData_values .
OPTIONAL{
?var le:hasUnits ?paleoData_unitsObj .
?paleoData_unitsObj rdfs:label ?paleoData_units .
}
OPTIONAL{
?var le:hasProxy ?paleoData_proxyObj .
?paleoData_proxyObj rdfs:label ?paleoData_proxy .
}
OPTIONAL{
?var le:hasProxyGeneral ?paleoData_proxyGeneralObj .
?paleoData_proxyGeneralObj rdfs:label ?paleoData_proxyGeneral .
}
?var le:partOfCompilation ?compilation .
?compilation le:hasName ?compilationName .
VALUES ?compilationName {"iso2k" "Pages2kTemperature"} .
?var le:useInGlobalTemperatureAnalysis True .
OPTIONAL{
?var le:hasVariableId ?TSID
} .
and asks for the name of the variable, its values, units, the proxy information, and unique identifier. Notice the two filters: one for the compilation the datasets belong to and the other to retain only the variables that these working groups have identified as being used in a global temperature analysis.
The last paragraph:
?timevar le:hasValues ?time_values .
OPTIONAL{
?timevar le:hasUnits ?time_unitsObj .
?time_unitsObj rdfs:label ?time_units .
}
looks for the time information associated with the data.
Let’s send our query over to the server and get the results into a dataframe:
response = requests.post(url, data = {'query': query})
data = io.StringIO(response.text)
df = pd.read_csv(data, sep=",")
df.head()
dataSetName | archiveType | geo_meanLat | geo_meanLon | geo_meanElev | paleoData_variableName | paleoData_values | paleoData_units | paleoData_proxy | paleoData_proxyGeneral | time_variableName | time_values | time_units | compilationName | TSID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | LS16STCL | Lake sediment | 50.8300 | -116.3900 | 1126.0 | d18O | [-7.81, -5.91, -9.03, -5.35, -5.61, -5.98, -5.... | permil | d18O | NaN | year | [2009.0, 2008.3, 2007.8, 2007.4, 2007.0, 2006.... | yr AD | iso2k | LPD7dc5b9ba |
1 | CO00URMA | Coral | 0.9330 | 173.0000 | 6.0 | d18O | [-4.8011, -4.725, -4.6994, -4.86, -5.0886, -5.... | permil | d18O | NaN | year | [1994.5, 1994.33, 1994.17, 1994.0, 1993.83, 19... | yr AD | iso2k | Ocean2kHR_177 |
2 | CO05KUBE | Coral | 32.4670 | -64.7000 | -12.0 | d18O | [-4.15, -3.66, -3.69, -4.07, -3.95, -4.12, -3.... | permil | d18O | NaN | year | [1983.21, 1983.13, 1983.04, 1982.96, 1982.88, ... | yr AD | iso2k | Ocean2kHR_105 |
3 | IC13THQU | Glacier ice | -13.9333 | -70.8333 | 5670.0 | d18O | [-18.5905, -16.3244, -16.2324, -17.0112, -18.6... | permil | d18O | NaN | year | [2009, 2008, 2007, 2006, 2005, 2004, 2003, 200... | yr AD | iso2k | SAm_035 |
4 | CO01TUNG | Coral | -5.2170 | 145.8170 | -3.0 | d18O | [-4.875, -4.981, -5.166, -5.06, -4.942, -4.919... | permil | d18O | NaN | year | [1993.042, 1992.792, 1992.542, 1992.292, 1992.... | yr AD | iso2k | Ocean2kHR_141 |
Much like PyLiPD, the information is contained in columns with each row representing a time series. Let’s have a look at the type in the values
column:
type(df['paleoData_values'].iloc[0])
str
This list is actually a string. Why? Keep in mind that most knowledge bases won’t contain the actual data, but rather a link to a file containing the data. So this is quite unique and an improvement. Could we have stored the data as lists of float? The problem comes down to memory and efficiency. So string was preferred.
So what does that mean for your workflow? First, you will need to convert each row to numpy arrays or list. Here is a way to do so:
df['paleoData_values']=df['paleoData_values'].apply(lambda row : json.loads(row) if isinstance(row, str) else row)
df['time_values']=df['time_values'].apply(lambda row : json.loads(row) if isinstance(row, str) else row)
type(df['paleoData_values'].iloc[0])
list
Second, it means that you cannot perform queries on the data themselves. We store some information about the data (e.g., maximum/minumum value, resolution) as metadata that is being sensed when a LiPD file is being created. So you actually perform queries on these to filter datasets. But you cannot query anything about the data directly (e.g., how many data points are in the timeseries, or between two time points in the series) until you have loaded them to your workspace.