Directly querying the LiPDGraph#

Authors#

Deborah Khider

Preamble#

This tutorial demonstrates a query on the LiPDGraph using the direct SPARQL endpoint. You might ask yourself: “What is the LiPDGraph and how does it differ from the Lipdverse?” The short answer os the the LiPDGraph contains all the datasets that are available through Lipdverse, stored in a graph format. Remember tha PyLiPD essentially takes a LiPD file and transforms it into a graph. We have already done this process for the LiPDverse, and the datasets can already be queried there.

Goals#

  • Learn how to query the Graph database directly

Reading Time: 5 minutes

Keywords#

SPARQL

Pre-requisites#

Relevant Packages#

pandas

Data Description#

This notebook makes use of the LiPDGraph database, a graph database that contains datasets from various working groups. For a list of these compilations, see this page.

In particular, we will be using PAGES2k and iso-2k for this demonstration.

  • PAGES2k: PAGES2k Consortium., Emile-Geay, J., McKay, N. et al. A global multiproxy database for temperature reconstructions of the Common Era. Sci Data 4, 170088 (2017). doi:10.1038/sdata.2017.88

  • iso2k: Konecky, B. L., McKay, N. P., Churakova (Sidorova), O. V., Comas-Bru, L., Dassié, E. P., DeLong, K. L., Falster, G. M., Fischer, M. J., Jones, M. D., Jonkers, L., Kaufman, D. S., Leduc, G., Managave, S. R., Martrat, B., Opel, T., Orsi, A. J., Partin, J. W., Sayani, H. R., Thomas, E. K., Thompson, D. M., Tyler, J. J., Abram, N. J., Atwood, A. R., Cartapanis, O., Conroy, J. L., Curran, M. A., Dee, S. G., Deininger, M., Divine, D. V., Kern, Z., Porter, T. J., Stevenson, S. L., von Gunten, L., and Iso2k Project Members: The Iso2k database: a global compilation of paleo-δ18O and δ2H records to aid understanding of Common Era climate, Earth Syst. Sci. Data, 12, 2261–2288, https://doi.org/10.5194/essd-12-2261-2020, 2020.

Demonstration#

Let’s import a few web packages along with pandas.

import json
import requests
import pandas as pd
import io
import ast

Simple query#

Let’s start with one the simple queries we have been playing with. The LiPDGraph knowledge base is organized very similarly to a LiPD object, with each dataset stored in its own graph. So the queries you have been learning in the previous tutorials also apply.

Let’s retrieve 10 random dataset names. But first we need to point towards the endpoint for the query:

url = 'https://linkedearth.graphdb.mint.isi.edu/repositories/LiPDVerse-dynamic'

We periodocially version the LiPDGraph for reproducbility purposes. LiPDverse-dynamic refers to the most up-to-date versions of the LiPDGraph. Once your study is done, we encourage you to point towards the latest frozen version.

Next, let’s write our queries:

query ="""

PREFIX le:<http://linked.earth/ontology#>

SELECT ?ds ?dsname WHERE{

?ds a le:Dataset .
?ds le:hasName ?dsname .
} 

ORDER BY RAND() LIMIT 10

"""

This should be very familiar as this is the first query we performed in the custom queries tutorial. Now, let’s send it to our SPAQRL endpoint and retrieve the results that we will store in a pandas DataFrame.

Note: The next cell might take a little bit of time to run since the query has to be performed on the remote server before sending the data back.

response = requests.post(url, data = {'query': query})

data = io.StringIO(response.text)
df = pd.read_csv(data, sep=",")

df
ds dsname
0 http://linked.earth/lipd/KR20SAR01 KR20SAR01
1 http://linked.earth/lipd/ClatteringshawsLoch.B... ClatteringshawsLoch.Birks.1975
2 http://linked.earth/lipd/WildseemoorKaltenbron... WildseemoorKaltenbronn.Rosch.2009
3 http://linked.earth/lipd/CO04LIFI CO04LIFI
4 http://linked.earth/lipd/AchitNur.Tarasov.1994 AchitNur.Tarasov.1994
5 http://linked.earth/lipd/CO13CAHN CO13CAHN
6 http://linked.earth/lipd/ThorpeBulmer.EPD ThorpeBulmer.EPD
7 http://linked.earth/lipd/Craterlake.Arcusa.2020 Craterlake.Arcusa.2020
8 http://linked.earth/lipd/AustreNevlingen.Kjell... AustreNevlingen.Kjellman.2020
9 http://linked.earth/lipd/DevilsLake.Shuman.2016 DevilsLake.Shuman.2016

Complex queries#

Ok let’s work on something more complex. Let’s say that you want to retrieve all the data in the PAGES2k and iso2k compilations that were tagged to be used in the temperature analysis for this compilations along with metadata regarding the location, the type of archive and proxy used.

Don’t forget that each column is stored separately so you also need to go grab the time axis!

query = """PREFIX le: <http://linked.earth/ontology#>
PREFIX le_var: <http://linked.earth/ontology/variables#>
PREFIX wgs84: <http://www.w3.org/2003/01/geo/wgs84_pos#>


PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?dataSetName ?archiveType ?geo_meanLat ?geo_meanLon ?geo_meanElev 
    ?paleoData_variableName ?paleoData_values ?paleoData_units 
    ?paleoData_proxy ?paleoData_proxyGeneral ?time_variableName ?time_values 
	?time_units ?compilationName ?TSID where{
    
    ?ds a le:Dataset .
    ?ds le:hasName ?dataSetName .
    OPTIONAL{
            ?ds le:hasArchiveType ?archiveTypeObj .
            ?archiveTypeObj rdfs:label ?archiveType .
        }
        
    ?ds le:hasLocation ?loc .
        OPTIONAL{?loc wgs84:lat ?geo_meanLat .}
        OPTIONAL{?loc wgs84:long ?geo_meanLon .}      
        OPTIONAL {?loc wgs84:alt ?geo_meanElev .}

    ?ds le:hasPaleoData ?data .
    ?data le:hasMeasurementTable ?table .
    ?table le:hasVariable ?var .
    ?var le:hasName ?paleoData_variableName .
    ?var le:hasValues ?paleoData_values .
    OPTIONAL{
        ?var le:hasUnits ?paleoData_unitsObj .
        ?paleoData_unitsObj rdfs:label ?paleoData_units .
    }
    OPTIONAL{
        ?var le:hasProxy ?paleoData_proxyObj .
        ?paleoData_proxyObj rdfs:label ?paleoData_proxy .
    }
    OPTIONAL{
        ?var le:hasProxyGeneral ?paleoData_proxyGeneralObj .
        ?paleoData_proxyGeneralObj rdfs:label ?paleoData_proxyGeneral .
    }
    ?var le:partOfCompilation ?compilation . 
    ?compilation le:hasName ?compilationName .
    VALUES ?compilationName {"iso2k" "Pages2kTemperature"} .
    ?var le:useInGlobalTemperatureAnalysis True .
    OPTIONAL{
        ?var le:hasVariableId ?TSID
    } .
        
    ?table le:hasVariable ?timevar .
    ?timevar le:hasName ?time_variableName .
    ?timevar le:hasStandardVariable le_var:year .
    ?timevar le:hasValues ?time_values .
    OPTIONAL{
        ?timevar le:hasUnits ?time_unitsObj .
        ?time_unitsObj rdfs:label ?time_units .
    }
}"""

Pretty long! But let’s take it paragraph by paragraph.

  • The first paragraph corresponds to our prefixes. You should be familiar with them by now.

  • The second paragraph lists all the variables that we wish to return:

    • the name of the dataset

    • the type of archive the measurements were made on

    • geographical coordinates (lat/lon/elevation)

    • the name of the paleovariable, associated units and values.

    • The type of proxy used to obtain the data

    • Information about the time axis (name, units and values)

    • the name of the compilation the data is part of

    • a unique identifer for the variables

  • The third paragraph:

?ds a le:Dataset .
?ds le:hasName ?dataSetName .
OPTIONAL{
    ?ds le:hasArchiveType ?archiveTypeObj .
    ?archiveTypeObj rdfs:label ?archiveType .
        }

performs a query to return the information stored at the dataset level (namely the name of the dataset and the type of archive)

  • The fourth paragraph:

?ds le:hasLocation ?loc .
OPTIONAL{?loc wgs84:lat ?geo_meanLat .}
OPTIONAL{?loc wgs84:long ?geo_meanLon .}      
OPTIONAL {?loc wgs84:alt ?geo_meanElev .}

returns geographical coordinates.

  • The fifth pragraph allows to dig to the variable level:

?ds le:hasPaleoData ?data .
    ?data le:hasMeasurementTable ?table .
    
    ?table le:hasVariable ?var .
    ?var le:hasName ?paleoData_variableName .
    ?var le:hasValues ?paleoData_values .
    
    OPTIONAL{
        ?var le:hasUnits ?paleoData_unitsObj .
        ?paleoData_unitsObj rdfs:label ?paleoData_units .
    }
    OPTIONAL{
        ?var le:hasProxy ?paleoData_proxyObj .
        ?paleoData_proxyObj rdfs:label ?paleoData_proxy .
    }
    OPTIONAL{
        ?var le:hasProxyGeneral ?paleoData_proxyGeneralObj .
        ?paleoData_proxyGeneralObj rdfs:label ?paleoData_proxyGeneral .
    }
    ?var le:partOfCompilation ?compilation . 
    ?compilation le:hasName ?compilationName .
    VALUES ?compilationName {"iso2k" "Pages2kTemperature"} .
    ?var le:useInGlobalTemperatureAnalysis True .

    OPTIONAL{
        ?var le:hasVariableId ?TSID
    } .

and asks for the name of the variable, its values, units, the proxy information, and unique identifier. Notice the two filters: one for the compilation the datasets belong to and the other to retain only the variables that these working groups have identified as being used in a global temperature analysis.

  • The last paragraph:

?timevar le:hasValues ?time_values .
    OPTIONAL{
        ?timevar le:hasUnits ?time_unitsObj .
        ?time_unitsObj rdfs:label ?time_units .
    }

looks for the time information associated with the data.

Let’s send our query over to the server and get the results into a dataframe:

response = requests.post(url, data = {'query': query})

data = io.StringIO(response.text)
df = pd.read_csv(data, sep=",")
df.head()
dataSetName archiveType geo_meanLat geo_meanLon geo_meanElev paleoData_variableName paleoData_values paleoData_units paleoData_proxy paleoData_proxyGeneral time_variableName time_values time_units compilationName TSID
0 LS16STCL Lake sediment 50.8300 -116.3900 1126.0 d18O [-7.81, -5.91, -9.03, -5.35, -5.61, -5.98, -5.... permil d18O NaN year [2009.0, 2008.3, 2007.8, 2007.4, 2007.0, 2006.... yr AD iso2k LPD7dc5b9ba
1 CO00URMA Coral 0.9330 173.0000 6.0 d18O [-4.8011, -4.725, -4.6994, -4.86, -5.0886, -5.... permil d18O NaN year [1994.5, 1994.33, 1994.17, 1994.0, 1993.83, 19... yr AD iso2k Ocean2kHR_177
2 CO05KUBE Coral 32.4670 -64.7000 -12.0 d18O [-4.15, -3.66, -3.69, -4.07, -3.95, -4.12, -3.... permil d18O NaN year [1983.21, 1983.13, 1983.04, 1982.96, 1982.88, ... yr AD iso2k Ocean2kHR_105
3 IC13THQU Glacier ice -13.9333 -70.8333 5670.0 d18O [-18.5905, -16.3244, -16.2324, -17.0112, -18.6... permil d18O NaN year [2009, 2008, 2007, 2006, 2005, 2004, 2003, 200... yr AD iso2k SAm_035
4 CO01TUNG Coral -5.2170 145.8170 -3.0 d18O [-4.875, -4.981, -5.166, -5.06, -4.942, -4.919... permil d18O NaN year [1993.042, 1992.792, 1992.542, 1992.292, 1992.... yr AD iso2k Ocean2kHR_141

Much like PyLiPD, the information is contained in columns with each row representing a time series. Let’s have a look at the type in the values column:

type(df['paleoData_values'].iloc[0])
str

This list is actually a string. Why? Keep in mind that most knowledge bases won’t contain the actual data, but rather a link to a file containing the data. So this is quite unique and an improvement. Could we have stored the data as lists of float? The problem comes down to memory and efficiency. So string was preferred.

So what does that mean for your workflow? First, you will need to convert each row to numpy arrays or list. Here is a way to do so:

df['paleoData_values']=df['paleoData_values'].apply(lambda row : json.loads(row) if isinstance(row, str) else row)
df['time_values']=df['time_values'].apply(lambda row : json.loads(row) if isinstance(row, str) else row)
type(df['paleoData_values'].iloc[0])
list

Second, it means that you cannot perform queries on the data themselves. We store some information about the data (e.g., maximum/minumum value, resolution) as metadata that is being sensed when a LiPD file is being created. So you actually perform queries on these to filter datasets. But you cannot query anything about the data directly (e.g., how many data points are in the timeseries, or between two time points in the series) until you have loaded them to your workspace.