Directly querying the LiPDGraph

Directly querying the LiPDGraph#

Authors#

Preamble#

This tutorial demonstrates a query on the LiPDGraph using the direct SPARQL endpoint. You might ask yourself: “What is the LiPDGraph and how does it differ from the Lipdverse?” The short answer os the the LiPDGraph contains all the datasets that are available through Lipdverse, stored in a graph format. Remember tha PyLiPD essentially takes a LiPD file and transforms it into a graph. We have already done this process for the LiPDverse, and the datasets can already be queried there.

Goals#

Learn how to query the Graph database directly

Reading Time: 5 minutes

Keywords#

SPARQL

Pre-requisites#

Relevant Packages#

pandas

Data Description#

This notebook makes use of the LiPDGraph database, a graph database that contains datasets from various working groups. For a list of these compilations, see this page.

In particular, we will be using PAGES2k and iso-2k for this demonstration.

PAGES2k: PAGES2k Consortium., Emile-Geay, J., McKay, N. et al. A global multiproxy database for temperature reconstructions of the Common Era. Sci Data 4, 170088 (2017). doi:10.1038/sdata.2017.88
iso2k: Konecky, B. L., McKay, N. P., Churakova (Sidorova), O. V., Comas-Bru, L., Dassié, E. P., DeLong, K. L., Falster, G. M., Fischer, M. J., Jones, M. D., Jonkers, L., Kaufman, D. S., Leduc, G., Managave, S. R., Martrat, B., Opel, T., Orsi, A. J., Partin, J. W., Sayani, H. R., Thomas, E. K., Thompson, D. M., Tyler, J. J., Abram, N. J., Atwood, A. R., Cartapanis, O., Conroy, J. L., Curran, M. A., Dee, S. G., Deininger, M., Divine, D. V., Kern, Z., Porter, T. J., Stevenson, S. L., von Gunten, L., and Iso2k Project Members: The Iso2k database: a global compilation of paleo-δ18O and δ2H records to aid understanding of Common Era climate, Earth Syst. Sci. Data, 12, 2261–2288, https://doi.org/10.5194/essd-12-2261-2020, 2020.

Demonstration#

Let’s import a few web packages along with pandas.

import json
import requests
import pandas as pd
import io
import ast

Simple query#

Let’s start with one the simple queries we have been playing with. The LiPDGraph knowledge base is organized very similarly to a LiPD object, with each dataset stored in its own graph. So the queries you have been learning in the previous tutorials also apply.

Let’s retrieve 10 random dataset names. But first we need to point towards the endpoint for the query:

url = 'https://linkedearth.graphdb.mint.isi.edu/repositories/LiPDVerse-dynamic'

We periodocially version the LiPDGraph for reproducbility purposes. LiPDverse-dynamic refers to the most up-to-date versions of the LiPDGraph. Once your study is done, we encourage you to point towards the latest frozen version.

Next, let’s write our queries:

query ="""

PREFIX le:<http://linked.earth/ontology#>

SELECT ?ds ?dsname WHERE{

?ds a le:Dataset .
?ds le:hasName ?dsname .
} 

ORDER BY RAND() LIMIT 10

"""

This should be very familiar as this is the first query we performed in the custom queries tutorial. Now, let’s send it to our SPAQRL endpoint and retrieve the results that we will store in a pandas DataFrame.

Note: The next cell might take a little bit of time to run since the query has to be performed on the remote server before sending the data back.

response = requests.post(url, data = {'query': query})

data = io.StringIO(response.text)
df = pd.read_csv(data, sep=",")

df

	ds	dsname
0	http://linked.earth/lipd/KR20SAR01	KR20SAR01
1	http://linked.earth/lipd/ClatteringshawsLoch.B...	ClatteringshawsLoch.Birks.1975
2	http://linked.earth/lipd/WildseemoorKaltenbron...	WildseemoorKaltenbronn.Rosch.2009
3	http://linked.earth/lipd/CO04LIFI	CO04LIFI
4	http://linked.earth/lipd/AchitNur.Tarasov.1994	AchitNur.Tarasov.1994
5	http://linked.earth/lipd/CO13CAHN	CO13CAHN
6	http://linked.earth/lipd/ThorpeBulmer.EPD	ThorpeBulmer.EPD
7	http://linked.earth/lipd/Craterlake.Arcusa.2020	Craterlake.Arcusa.2020
8	http://linked.earth/lipd/AustreNevlingen.Kjell...	AustreNevlingen.Kjellman.2020
9	http://linked.earth/lipd/DevilsLake.Shuman.2016	DevilsLake.Shuman.2016

Complex queries#

Ok let’s work on something more complex. Let’s say that you want to retrieve all the data in the PAGES2k and iso2k compilations that were tagged to be used in the temperature analysis for this compilations along with metadata regarding the location, the type of archive and proxy used.

Don’t forget that each column is stored separately so you also need to go grab the time axis!

query = """PREFIX le: <http://linked.earth/ontology#>
PREFIX le_var: <http://linked.earth/ontology/variables#>
PREFIX wgs84: <http://www.w3.org/2003/01/geo/wgs84_pos#>


PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?dataSetName ?archiveType ?geo_meanLat ?geo_meanLon ?geo_meanElev 
    ?paleoData_variableName ?paleoData_values ?paleoData_units 
    ?paleoData_proxy ?paleoData_proxyGeneral ?time_variableName ?time_values 
	?time_units ?compilationName ?TSID where{
    
    ?ds a le:Dataset .
    ?ds le:hasName ?dataSetName .
    OPTIONAL{
            ?ds le:hasArchiveType ?archiveTypeObj .
            ?archiveTypeObj rdfs:label ?archiveType .
        }
        
    ?ds le:hasLocation ?loc .
        OPTIONAL{?loc wgs84:lat ?geo_meanLat .}
        OPTIONAL{?loc wgs84:long ?geo_meanLon .}      
        OPTIONAL {?loc wgs84:alt ?geo_meanElev .}

    ?ds le:hasPaleoData ?data .
    ?data le:hasMeasurementTable ?table .
    ?table le:hasVariable ?var .
    ?var le:hasName ?paleoData_variableName .
    ?var le:hasValues ?paleoData_values .
    OPTIONAL{
        ?var le:hasUnits ?paleoData_unitsObj .
        ?paleoData_unitsObj rdfs:label ?paleoData_units .
    }
    OPTIONAL{
        ?var le:hasProxy ?paleoData_proxyObj .
        ?paleoData_proxyObj rdfs:label ?paleoData_proxy .
    }
    OPTIONAL{
        ?var le:hasProxyGeneral ?paleoData_proxyGeneralObj .
        ?paleoData_proxyGeneralObj rdfs:label ?paleoData_proxyGeneral .
    }
    ?var le:partOfCompilation ?compilation . 
    ?compilation le:hasName ?compilationName .
    VALUES ?compilationName {"iso2k" "Pages2kTemperature"} .
    ?var le:useInGlobalTemperatureAnalysis True .
    OPTIONAL{
        ?var le:hasVariableId ?TSID
    } .
        
    ?table le:hasVariable ?timevar .
    ?timevar le:hasName ?time_variableName .
    ?timevar le:hasStandardVariable le_var:year .
    ?timevar le:hasValues ?time_values .
    OPTIONAL{
        ?timevar le:hasUnits ?time_unitsObj .
        ?time_unitsObj rdfs:label ?time_units .
    }
}"""

Pretty long! But let’s take it paragraph by paragraph.

The first paragraph corresponds to our prefixes. You should be familiar with them by now.
The second paragraph lists all the variables that we wish to return:
- the name of the dataset
- the type of archive the measurements were made on
- geographical coordinates (lat/lon/elevation)
- the name of the paleovariable, associated units and values.
- The type of proxy used to obtain the data
- Information about the time axis (name, units and values)
- the name of the compilation the data is part of
- a unique identifer for the variables
The third paragraph:

?ds a le:Dataset .
?ds le:hasName ?dataSetName .
OPTIONAL{
    ?ds le:hasArchiveType ?archiveTypeObj .
    ?archiveTypeObj rdfs:label ?archiveType .
        }

performs a query to return the information stored at the dataset level (namely the name of the dataset and the type of archive)

The fourth paragraph:

?ds le:hasLocation ?loc .
OPTIONAL{?loc wgs84:lat ?geo_meanLat .}
OPTIONAL{?loc wgs84:long ?geo_meanLon .}      
OPTIONAL {?loc wgs84:alt ?geo_meanElev .}

returns geographical coordinates.

The fifth pragraph allows to dig to the variable level:

?ds le:hasPaleoData ?data .
    ?data le:hasMeasurementTable ?table .
    
    ?table le:hasVariable ?var .
    ?var le:hasName ?paleoData_variableName .
    ?var le:hasValues ?paleoData_values .
    
    OPTIONAL{
        ?var le:hasUnits ?paleoData_unitsObj .
        ?paleoData_unitsObj rdfs:label ?paleoData_units .
    }
    OPTIONAL{
        ?var le:hasProxy ?paleoData_proxyObj .
        ?paleoData_proxyObj rdfs:label ?paleoData_proxy .
    }
    OPTIONAL{
        ?var le:hasProxyGeneral ?paleoData_proxyGeneralObj .
        ?paleoData_proxyGeneralObj rdfs:label ?paleoData_proxyGeneral .
    }
    ?var le:partOfCompilation ?compilation . 
    ?compilation le:hasName ?compilationName .
    VALUES ?compilationName {"iso2k" "Pages2kTemperature"} .
    ?var le:useInGlobalTemperatureAnalysis True .

    OPTIONAL{
        ?var le:hasVariableId ?TSID
    } .

and asks for the name of the variable, its values, units, the proxy information, and unique identifier. Notice the two filters: one for the compilation the datasets belong to and the other to retain only the variables that these working groups have identified as being used in a global temperature analysis.

The last paragraph:

?timevar le:hasValues ?time_values .
    OPTIONAL{
        ?timevar le:hasUnits ?time_unitsObj .
        ?time_unitsObj rdfs:label ?time_units .
    }

looks for the time information associated with the data.

Let’s send our query over to the server and get the results into a dataframe:

response = requests.post(url, data = {'query': query})

data = io.StringIO(response.text)
df = pd.read_csv(data, sep=",")

df.head()

	dataSetName	archiveType	geo_meanLat	geo_meanLon	geo_meanElev	paleoData_variableName	paleoData_values	paleoData_units	paleoData_proxy	paleoData_proxyGeneral	time_variableName	time_values	time_units	compilationName	TSID
0	LS16STCL	Lake sediment	50.8300	-116.3900	1126.0	d18O	[-7.81, -5.91, -9.03, -5.35, -5.61, -5.98, -5....	permil	d18O	NaN	year	[2009.0, 2008.3, 2007.8, 2007.4, 2007.0, 2006....	yr AD	iso2k	LPD7dc5b9ba
1	CO00URMA	Coral	0.9330	173.0000	6.0	d18O	[-4.8011, -4.725, -4.6994, -4.86, -5.0886, -5....	permil	d18O	NaN	year	[1994.5, 1994.33, 1994.17, 1994.0, 1993.83, 19...	yr AD	iso2k	Ocean2kHR_177
2	CO05KUBE	Coral	32.4670	-64.7000	-12.0	d18O	[-4.15, -3.66, -3.69, -4.07, -3.95, -4.12, -3....	permil	d18O	NaN	year	[1983.21, 1983.13, 1983.04, 1982.96, 1982.88, ...	yr AD	iso2k	Ocean2kHR_105
3	IC13THQU	Glacier ice	-13.9333	-70.8333	5670.0	d18O	[-18.5905, -16.3244, -16.2324, -17.0112, -18.6...	permil	d18O	NaN	year	[2009, 2008, 2007, 2006, 2005, 2004, 2003, 200...	yr AD	iso2k	SAm_035
4	CO01TUNG	Coral	-5.2170	145.8170	-3.0	d18O	[-4.875, -4.981, -5.166, -5.06, -4.942, -4.919...	permil	d18O	NaN	year	[1993.042, 1992.792, 1992.542, 1992.292, 1992....	yr AD	iso2k	Ocean2kHR_141

Much like PyLiPD, the information is contained in columns with each row representing a time series. Let’s have a look at the type in the values column:

type(df['paleoData_values'].iloc[0])

str

This list is actually a string. Why? Keep in mind that most knowledge bases won’t contain the actual data, but rather a link to a file containing the data. So this is quite unique and an improvement. Could we have stored the data as lists of float? The problem comes down to memory and efficiency. So string was preferred.

So what does that mean for your workflow? First, you will need to convert each row to numpy arrays or list. Here is a way to do so:

df['paleoData_values']=df['paleoData_values'].apply(lambda row : json.loads(row) if isinstance(row, str) else row)
df['time_values']=df['time_values'].apply(lambda row : json.loads(row) if isinstance(row, str) else row)

type(df['paleoData_values'].iloc[0])

list

Second, it means that you cannot perform queries on the data themselves. We store some information about the data (e.g., maximum/minumum value, resolution) as metadata that is being sensed when a LiPD file is being created. So you actually perform queries on these to filter datasets. But you cannot query anything about the data directly (e.g., how many data points are in the timeseries, or between two time points in the series) until you have loaded them to your workspace.