Assembling the proxy database from the LiPDGraph

Assembling the proxy database from the LiPDGraph#

Authors#

Goal#

The goal of this notebook is to showcase data querying on the LiPDGraph to create the proxy database used by cfr for data assimilation.

Pre-requisite#

An understanding of the workflow presented in assembling the database from the LiPDverse notebook.
An understanding of the Linked Paleo Data format (LiPD)
A basic understanding of Dynamic Time Warping

Reading time#

10 min

Data Description#

This notebook makes use of the LiPDGraph database, a graph database that contains datasets from various working groups. For a list of these compilations, see this page.

This notebook makes use of the LiPDVerse database, which contains datasets from various working groups. For a list of these compilations, see this page.

In particular, we will be using PAGES2kTemperature, iso-2k, and CoralHydro2k compilations for this demonstration.

PAGES 2k: PAGES2k Consortium, A global multiproxy database for temperature reconstructions of the Common Era. Sci Data 4, 170088 (2017). doi:10.1038/sdata.2017.88
iso2k: Konecky, B. L., McKay, N. P., Churakova (Sidorova), O. V., Comas-Bru, L., Dassié, E. P., DeLong, K. L., Falster, G. M., Fischer, M. J., Jones, M. D., Jonkers, L., Kaufman, D. S., Leduc, G., Managave, S. R., Martrat, B., Opel, T., Orsi, A. J., Partin, J. W., Sayani, H. R., Thomas, E. K., Thompson, D. M., Tyler, J. J., Abram, N. J., Atwood, A. R., Cartapanis, O., Conroy, J. L., Curran, M. A., Dee, S. G., Deininger, M., Divine, D. V., Kern, Z., Porter, T. J., Stevenson, S. L., von Gunten, L., and Iso2k Project Members: The Iso2k database: a global compilation of paleo-δ18O and δ2H records to aid understanding of Common Era climate, Earth Syst. Sci. Data, 12, 2261–2288, https://doi.org/10.5194/essd-12-2261-2020, 2020.
CoralHydro2k: Walter, R. M., H. R. Sayani, T. Felis, K. M. Cobb, N. J. Abram, A. K. Arzey, A. R. Atwood, L. D. Brenner, E. P. Dassi ́e, K. L. DeLong, B. Ellis, J. Emile-Geay, M. J. Fischer, N. F. Goodkin, J. A. Hargreaves, K. H. Kilbourne, H. Krawczyk, N. P. McKay, A. L. Moore, S. A. Murty, M. R. Ong, R. D. Ramos, E. V. Reed, D. Samanta, S. C. Sanchez, J. Zinke, and the PAGES CoralHydro2k Project Members (2023), The CoralHydro2k database: a global, actively curated compilation of coral δ18O and Sr / Ca proxy records of tropical ocean hydrology and temperature for the Common Era, Earth System Science Data, 15(5), 2081–2116, doi:10.5194/essd-15-2081-2023.

Note: Once the database is queried, the workflow is similar to the one presented in the assembling from the LiPDverse notebook. Therefore, this notebook only contains the querying workflow to the assembly of the database (and before the addition of the Palmyra record).

Let’s import the needed packages:

import os

# General data manipulation libraries
import numpy as np
import pandas as pd

# To query the graph
import json
import requests
import pandas as pd
import io

Data Retrieval#

Following this tutorial, we perform a SPARQL query on the LiPDGraph to assemble the most current dataset.

We point towards the dynamic version to get the most up-to-date files. For reproduction purposes, you can change the URL to one of the versioned graphs.

url = 'https://linkedearth.graphdb.mint.isi.edu/repositories/LiPDVerse-dynamic'

Querying the PAGES2k datasets#

In the first query, we retrieve all the data in the PAGES2k compilation that were tagged to be used in the temperature analysis for this compilations along with relevant metadata.

We will return the following information if present:

datSetName: The name of the dataset
compilationName: The name of the compilation. This will be used for filtering later on
archiveType: The type of archive on which the measurements were made
Latitude, Longitude, and Elevation: the geographical coordinates for the site
paleoData_variableName: The name of the paleo variable as originally reported by the investigator
paleoData_standardName: The standard name for the paleo variable
paleoData_values: The values for the paleo variable
paleoData_units: The units associated with the paleo variable
paleoData_proxy: The proxy used to infer paleoenvironmental conditions
paleoData_proxyGeneral: Generalized grouping for proxy
paleoData_interpName: The interpretation for the proxy record
paleoData_interpRank: The rank for the interpretation
paleoData_Seasonality: The seasonality for the proxy
TSiD: The unique ID for the timeseries
time_variableName: The name for the time variable. Since we are dealing with the common era, should be all “year”.
timevar_standardName: The standard name for the time variable. Since we are dealing with the common era, should be all “year”. So let’s use this as a filter.
time_values: The values for the time axis
time_units: the units for the time axis.

query1 = """ PREFIX le: <http://linked.earth/ontology#>
PREFIX wgs84: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?dataSetName ?compilationName ?archiveType ?geo_meanLat ?geo_meanLon ?geo_meanElev 
    ?paleoData_variableName ?paleoData_standardName ?paleoData_values ?paleoData_units 
    ?paleoData_proxy ?paleoData_proxyGeneral ?paleoData_seasonality ?paleoData_interpName ?paleoData_interpRank 
    ?TSID ?time_variableName ?time_standardName ?time_values
	?time_units where{
    
    ?ds a le:Dataset .
    ?ds le:hasName ?dataSetName .

    # Get the archive
    OPTIONAL{
            ?ds le:hasArchiveType ?archiveTypeObj .
            ?archiveTypeObj rdfs:label ?archiveType .
        }
    
    # Geographical information    
    
    ?ds le:hasLocation ?loc .
        OPTIONAL{?loc wgs84:lat ?geo_meanLat .}
        OPTIONAL{?loc wgs84:long ?geo_meanLon .}      
        OPTIONAL{?loc wgs84:alt ?geo_meanElev .}
    
    # PaleoData

    ?ds le:hasPaleoData ?data .
    ?data le:hasMeasurementTable ?table .
    ?table le:hasVariable ?var .

    # Name
    ?var le:hasName ?paleoData_variableName .
    ?var le:hasStandardVariable ?variable_obj .
    ?variable_obj rdfs:label ?paleoData_standardName .
    
    #Values
    ?var le:hasValues ?paleoData_values .

    #Seasonality
    OPTIONAL{?var le:seasonality ?paleoData_seasonality} .

    #Units
    OPTIONAL{
        ?var le:hasUnits ?paleoData_unitsObj .
        ?paleoData_unitsObj rdfs:label ?paleoData_units .
    }

    #Proxy information
    OPTIONAL{
        ?var le:hasProxy ?paleoData_proxyObj .
        ?paleoData_proxyObj rdfs:label ?paleoData_proxy .
    }
    OPTIONAL{
        ?var le:hasProxyGeneral ?paleoData_proxyGeneralObj .
        ?paleoData_proxyGeneralObj rdfs:label ?paleoData_proxyGeneral .
    }

    # Compilation
    ?var le:partOfCompilation ?compilation . 
    ?compilation le:hasName ?compilationName .
    FILTER (?compilationName = "Pages2kTemperature").
    ?var le:useInGlobalTemperatureAnalysis True .

    # TSiD (should all have them)
    OPTIONAL{
        ?var le:hasVariableId ?TSID
    } .

    # Interpretation (might be an optional field)
    OPTIONAL{
        ?var le:hasInterpretation ?interp .

        ?interp le:hasVariable ?interpvar .
        BIND(REPLACE(STR(?interpvar), "http://linked.earth/ontology/interpretation#", "") AS ?interpretedVariable_Fallback)
        OPTIONAL { ?interpvar rdfs:label ?interpretedVariable_Label } # Use a temporary variable for the label
        BIND(COALESCE(?interpretedVariable_Label, ?interpretedVariable_Fallback) AS ?paleoData_interpName) # COALESCE into the final variable
        }
    
    OPTIONAL{
        ?var le:hasInterpretation ?interp .
        ?interp le:hasRank ?paleoData_interpRank .}
    
    
    OPTIONAL{
        ?var le:hasInterpretation ?interp .
    
        ?interp le:hasSeasonality ?seasonalityURI .
        BIND(REPLACE(STR(?seasonalityURI), "http://linked.earth/ontology/interpretation#", "") AS ?seasonality_Fallback)
        OPTIONAL { ?seasonalityURI rdfs:label ?seasonalityLabel }
        BIND(COALESCE(?seasonalityLabel, ?seasonality_Fallback) AS ?paleoData_seasonality)
    }

    #Time information
    ?table le:hasVariable ?timevar .
    ?timevar le:hasName ?time_variableName .
    ?timevar le:hasStandardVariable ?timevar_obj .
    ?timevar_obj rdfs:label ?time_standardName .
    VALUES ?time_standardName {"year"} .
    ?timevar le:hasValues ?time_values .
    OPTIONAL{
        ?timevar le:hasUnits ?time_unitsObj .
        ?time_unitsObj rdfs:label ?time_units .
    }  
}"""

This query should look familiar and there is a good reason for it: this is the same query as the one we performed on the downloaded LiPD files from the LiPDverse in the previous notebook.

That is because PyLiPD creates an internal database on your machine that is similar in structure (i.e., the use the same mapping, the LinkedEarth Ontology) to describe datasets.

Let’s run our query:

response = requests.post(url, data = {'query': query1})

data = io.StringIO(response.text)
df1 = pd.read_csv(data, sep=",")

df1 = df1.drop_duplicates()

df1['paleoData_values']=df1['paleoData_values'].apply(lambda row : json.loads(row) if isinstance(row, str) else row)
df1['time_values']=df1['time_values'].apply(lambda row : json.loads(row) if isinstance(row, str) else row)

df1.head()

	dataSetName	compilationName	archiveType	geo_meanLat	geo_meanLon	geo_meanElev	paleoData_variableName	paleoData_standardName	paleoData_values	paleoData_units	paleoData_proxy	paleoData_proxyGeneral	paleoData_seasonality	paleoData_interpName	paleoData_interpRank	TSID	time_variableName	time_standardName	time_values	time_units
0	Ant-WDC05A.Steig.2013	Pages2kTemperature	Glacier ice	-79.46	-112.09	1806.0	d18O	d18O	[-33.32873325, -35.6732, -33.1574, -34.2854, -...	permil	d18O	NaN	Annual	temperature	NaN	Ant_030	year	year	[2005, 2004, 2003, 2002, 2001, 2000, 1999, 199...	yr AD
16	Arc-Arjeplog.Bjorklund.2014	Pages2kTemperature	Wood	66.30	18.20	800.0	density	density	[-0.829089212152348, -0.733882889924006, -0.89...	NaN	maximum latewood density	dendrophysical	Jun-Aug	temperature	NaN	Arc_060	year	year	[1200, 1201, 1202, 1203, 1204, 1205, 1206, 120...	yr AD
20	NAm-SmithersSkiArea.Schweingruber.1996	Pages2kTemperature	Wood	54.90	-127.30	1200.0	MXD	MXD	[1.041, 1.062, 1.051, 0.997, 0.993, 0.959, 0.9...	NaN	maximum latewood density	dendrophysical	Summer	temperature	NaN	NAm_1120	year	year	[1680, 1681, 1682, 1683, 1684, 1685, 1686, 168...	yr AD
28	Ocn-Mayotte.Zinke.2008	Pages2kTemperature	Coral	-12.65	45.10	-3.0	d18O	d18O	[-4.72, -4.95, -5.45, -5.157, -5.05, -5.065, -...	permil	d18O	NaN	subannual	temperature	NaN	Ocean2kHR_002	year	year	[1993.62, 1993.46, 1993.29, 1993.12, 1992.96, ...	yr AD
36	Ocn-LosRoques.Hetzinger.2008	Pages2kTemperature	Coral	11.77	-66.75	-2.0	d18O	d18O	[-3.8123, -3.73082, -3.74912, -3.6656, -3.6995...	permil	d18O	NaN	subannual	temperature	NaN	Ocean2kHR_096	year	year	[1917.79, 1917.87, 1917.96, 1918.04, 1918.12, ...	yr AD

df1.info()

<class 'pandas.core.frame.DataFrame'>
Index: 690 entries, 0 to 15328
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 dataSetName             690 non-null    object 
 compilationName         690 non-null    object 
 archiveType             690 non-null    object 
 geo_meanLat             690 non-null    float64
 geo_meanLon             690 non-null    float64
 geo_meanElev            683 non-null    float64
 paleoData_variableName  690 non-null    object 
 paleoData_standardName  690 non-null    object 
 paleoData_values        690 non-null    object 
 paleoData_units         282 non-null    object 
paleoData_proxy         690 non-null    object 
paleoData_proxyGeneral  464 non-null    object 
paleoData_seasonality   690 non-null    object 
paleoData_interpName    690 non-null    object 
paleoData_interpRank    1 non-null      float64
TSID                    690 non-null    object 
time_variableName       690 non-null    object 
time_standardName       690 non-null    object 
time_values             690 non-null    object 
time_units              690 non-null    object 
dtypes: float64(4), object(16)
memory usage: 113.2+ KB

Querying the iso2k dataset#

Let’s repeat our query for the datasets in the iso2k compilation:

query2 = """ PREFIX le: <http://linked.earth/ontology#>
PREFIX wgs84: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?dataSetName ?compilationName ?archiveType ?geo_meanLat ?geo_meanLon ?geo_meanElev 
    ?paleoData_variableName ?paleoData_standardName ?paleoData_values ?paleoData_units 
    ?paleoData_proxy ?paleoData_proxyGeneral ?paleoData_seasonality ?paleoData_interpName ?paleoData_interpRank 
    ?TSID ?time_variableName ?time_standardName ?time_values
	?time_units where{
    
    ?ds a le:Dataset .
    ?ds le:hasName ?dataSetName .

    # Get the archive
    OPTIONAL{
            ?ds le:hasArchiveType ?archiveTypeObj .
            ?archiveTypeObj rdfs:label ?archiveType .
        }
    
    # Geographical information    
    
    ?ds le:hasLocation ?loc .
        OPTIONAL{?loc wgs84:lat ?geo_meanLat .}
        OPTIONAL{?loc wgs84:long ?geo_meanLon .}      
        OPTIONAL{?loc wgs84:alt ?geo_meanElev .}
    
    # PaleoData

    ?ds le:hasPaleoData ?data .
    ?data le:hasMeasurementTable ?table .
    ?table le:hasVariable ?var .

    # Name
    ?var le:hasName ?paleoData_variableName .
    ?var le:hasStandardVariable ?variable_obj .
    ?variable_obj rdfs:label ?paleoData_standardName .
    
    #Values
    ?var le:hasValues ?paleoData_values .

    #Seasonality
    OPTIONAL{?var le:seasonality ?paleoData_seasonality} .

    #Units
    OPTIONAL{
        ?var le:hasUnits ?paleoData_unitsObj .
        ?paleoData_unitsObj rdfs:label ?paleoData_units .
    }

    #Proxy information
    OPTIONAL{
        ?var le:hasProxy ?paleoData_proxyObj .
        ?paleoData_proxyObj rdfs:label ?paleoData_proxy .
    }
    OPTIONAL{
        ?var le:hasProxyGeneral ?paleoData_proxyGeneralObj .
        ?paleoData_proxyGeneralObj rdfs:label ?paleoData_proxyGeneral .
    }

    # Compilation
    ?var le:partOfCompilation ?compilation . 
    ?compilation le:hasName ?compilationName .
    FILTER (?compilationName = "iso2k").
    ?var le:useInGlobalTemperatureAnalysis True .

    # TSiD (should all have them)
    OPTIONAL{
        ?var le:hasVariableId ?TSID
    } .

    # Interpretation (might be an optional field)
    OPTIONAL{
        ?var le:hasInterpretation ?interp .

        ?interp le:hasVariable ?interpvar .
        BIND(REPLACE(STR(?interpvar), "http://linked.earth/ontology/interpretation#", "") AS ?interpretedVariable_Fallback)
        OPTIONAL { ?interpvar rdfs:label ?interpretedVariable_Label } # Use a temporary variable for the label
        BIND(COALESCE(?interpretedVariable_Label, ?interpretedVariable_Fallback) AS ?paleoData_interpName) # COALESCE into the final variable
        }
    
    OPTIONAL{
        ?var le:hasInterpretation ?interp .
        ?interp le:hasRank ?paleoData_interpRank .}
    
    
    OPTIONAL{
        ?var le:hasInterpretation ?interp .
    
        ?interp le:hasSeasonality ?seasonalityURI .
        BIND(REPLACE(STR(?seasonalityURI), "http://linked.earth/ontology/interpretation#", "") AS ?seasonality_Fallback)
        OPTIONAL { ?seasonalityURI rdfs:label ?seasonalityLabel }
        BIND(COALESCE(?seasonalityLabel, ?seasonality_Fallback) AS ?paleoData_seasonality)
    }

    #Time information
    ?table le:hasVariable ?timevar .
    ?timevar le:hasName ?time_variableName .
    ?timevar le:hasStandardVariable ?timevar_obj .
    ?timevar_obj rdfs:label ?time_standardName .
    VALUES ?time_standardName {"year"} .
    ?timevar le:hasValues ?time_values .
    OPTIONAL{
        ?timevar le:hasUnits ?time_unitsObj .
        ?time_unitsObj rdfs:label ?time_units .
    }  
}"""

response = requests.post(url, data = {'query': query2})

data = io.StringIO(response.text)
df2 = pd.read_csv(data, sep=",")

df2 = df2.drop_duplicates()

df2['paleoData_values']=df2['paleoData_values'].apply(lambda row : json.loads(row) if isinstance(row, str) else row)
df2['time_values']=df2['time_values'].apply(lambda row : json.loads(row) if isinstance(row, str) else row)

df2.head()

	dataSetName	compilationName	archiveType	geo_meanLat	geo_meanLon	geo_meanElev	paleoData_variableName	paleoData_standardName	paleoData_values	paleoData_units	paleoData_proxy	paleoData_proxyGeneral	paleoData_seasonality	paleoData_interpName	paleoData_interpRank	TSID	time_variableName	time_standardName	time_values	time_units
0	LS16STCL	iso2k	Lake sediment	50.83	-116.39	1126.0	d18O	d18O	[-7.81, -5.91, -9.03, -5.35, -5.61, -5.98, -5....	permil	d18O	NaN	Winter	NaN	5.0	LPD7dc5b9ba-dup-dup	year	year	[2009.0, 2008.3, 2007.8, 2007.4, 2007.0, 2006....	yr AD
4	LS16STCL	iso2k	Lake sediment	50.83	-116.39	1126.0	d18O	d18O	[-7.81, -5.91, -9.03, -5.35, -5.61, -5.98, -5....	permil	d18O	NaN	NaN	NaN	3.0	LPD7dc5b9ba-dup-dup	year	year	[2009.0, 2008.3, 2007.8, 2007.4, 2007.0, 2006....	yr AD
6	LS16STCL	iso2k	Lake sediment	50.83	-116.39	1126.0	d18O	d18O	[-7.81, -5.91, -9.03, -5.35, -5.61, -5.98, -5....	permil	d18O	NaN	NaN	NaN	4.0	LPD7dc5b9ba-dup-dup	year	year	[2009.0, 2008.3, 2007.8, 2007.4, 2007.0, 2006....	yr AD
8	LS16STCL	iso2k	Lake sediment	50.83	-116.39	1126.0	d18O	d18O	[-7.81, -5.91, -9.03, -5.35, -5.61, -5.98, -5....	permil	d18O	NaN	NaN	NaN	2.0	LPD7dc5b9ba-dup-dup	year	year	[2009.0, 2008.3, 2007.8, 2007.4, 2007.0, 2006....	yr AD
10	LS16STCL	iso2k	Lake sediment	50.83	-116.39	1126.0	d18O	d18O	[-7.81, -5.91, -9.03, -5.35, -5.61, -5.98, -5....	permil	d18O	NaN	NaN	NaN	1.0	LPD7dc5b9ba-dup-dup	year	year	[2009.0, 2008.3, 2007.8, 2007.4, 2007.0, 2006....	yr AD

Let’s have a look at the interpretation field:

df2['paleoData_interpName'].unique()

array([nan, 'temperature', 'seawaterIsotope', 'precipitationIsotope',
       'hydrologicBalance', 'salinity'], dtype=object)

And let’s only keep interpretation as temperature:

df2_filt = df2[df2['paleoData_interpName']=='temperature'].drop_duplicates(subset='TSID', keep='first')
df2_filt.head()

	dataSetName	compilationName	archiveType	geo_meanLat	geo_meanLon	geo_meanElev	paleoData_variableName	paleoData_standardName	paleoData_values	paleoData_units	paleoData_proxy	paleoData_proxyGeneral	paleoData_seasonality	paleoData_interpName	paleoData_interpRank	TSID	time_variableName	time_standardName	time_values	time_units
48	CO00URMA	iso2k	Coral	0.9330	173.0000	6.0	d18O	d18O	[-4.8011, -4.725, -4.6994, -4.86, -5.0886, -5....	permil	d18O	NaN	subannual	temperature	1.0	Ocean2kHR_177_iso2k	year	year	[1994.5, 1994.33, 1994.17, 1994.0, 1993.83, 19...	yr AD
108	CO05KUBE	iso2k	Coral	32.4670	-64.7000	-12.0	d18O	d18O	[-4.15, -3.66, -3.69, -4.07, -3.95, -4.12, -3....	permil	d18O	NaN	subannual	temperature	1.0	Ocean2kHR_105_iso2k	year	year	[1983.21, 1983.13, 1983.04, 1982.96, 1982.88, ...	yr AD
124	IC13THQU	iso2k	Glacier ice	-13.9333	-70.8333	5670.0	d18O	d18O	[-18.5905, -16.3244, -16.2324, -17.0112, -18.6...	permil	d18O	NaN	Winter	temperature	4.0	SAm_035_iso2k	year	year	[2009, 2008, 2007, 2006, 2005, 2004, 2003, 200...	yr AD
144	CO01TUNG	iso2k	Coral	-5.2170	145.8170	-3.0	d18O	d18O	[-4.875, -4.981, -5.166, -5.06, -4.942, -4.919...	permil	d18O	NaN	subannual	temperature	3.0	Ocean2kHR_141_iso2k	year	year	[1993.042, 1992.792, 1992.542, 1992.292, 1992....	yr AD
166	CO01TUNG	iso2k	Coral	-5.2170	145.8170	-3.0	d18O	d18O	[-4.827, -4.786, -4.693, -4.852, -4.991, -4.90...	permil	d18O	NaN	subannual	temperature	2.0	Ocean2kHR_140_iso2k	year	year	[1993.042, 1992.792, 1992.542, 1992.292, 1992....	yr AD

df2_filt.info()

<class 'pandas.core.frame.DataFrame'>
Index: 49 entries, 48 to 1202
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 dataSetName             49 non-null     object 
 compilationName         49 non-null     object 
 archiveType             49 non-null     object 
 geo_meanLat             49 non-null     float64
 geo_meanLon             49 non-null     float64
 geo_meanElev            49 non-null     float64
 paleoData_variableName  49 non-null     object 
 paleoData_standardName  49 non-null     object 
 paleoData_values        49 non-null     object 
 paleoData_units         49 non-null     object 
paleoData_proxy         49 non-null     object 
paleoData_proxyGeneral  0 non-null      float64
paleoData_seasonality   37 non-null     object 
paleoData_interpName    49 non-null     object 
paleoData_interpRank    47 non-null     float64
TSID                    49 non-null     object 
time_variableName       49 non-null     object 
time_standardName       49 non-null     object 
time_values             49 non-null     object 
time_units              49 non-null     object 
dtypes: float64(5), object(15)
memory usage: 8.0+ KB

Querying the CoralHydro2k database#

Let’s do the same query with CoralHydro2k:

query3 = """ PREFIX le: <http://linked.earth/ontology#>
PREFIX wgs84: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?dataSetName ?compilationName ?archiveType ?geo_meanLat ?geo_meanLon ?geo_meanElev 
    ?paleoData_variableName ?paleoData_standardName ?paleoData_values ?paleoData_units 
    ?paleoData_proxy ?paleoData_proxyGeneral ?paleoData_seasonality ?paleoData_interpName ?paleoData_interpRank 
    ?TSID ?time_variableName ?time_standardName ?time_values
	?time_units where{
    
    ?ds a le:Dataset .
    ?ds le:hasName ?dataSetName .

    # Get the archive
    OPTIONAL{
            ?ds le:hasArchiveType ?archiveTypeObj .
            ?archiveTypeObj rdfs:label ?archiveType .
        }
    
    # Geographical information    
    
    ?ds le:hasLocation ?loc .
        OPTIONAL{?loc wgs84:lat ?geo_meanLat .}
        OPTIONAL{?loc wgs84:long ?geo_meanLon .}      
        OPTIONAL{?loc wgs84:alt ?geo_meanElev .}
    
    # PaleoData

    ?ds le:hasPaleoData ?data .
    ?data le:hasMeasurementTable ?table .
    ?table le:hasVariable ?var .

    # Name
    ?var le:hasName ?paleoData_variableName .
    ?var le:hasStandardVariable ?variable_obj .
    ?variable_obj rdfs:label ?paleoData_standardName .
    
    #Values
    ?var le:hasValues ?paleoData_values .

    #Seasonality
    OPTIONAL{?var le:seasonality ?paleoData_seasonality} .

    #Units
    OPTIONAL{
        ?var le:hasUnits ?paleoData_unitsObj .
        ?paleoData_unitsObj rdfs:label ?paleoData_units .
    }

    #Proxy information
    OPTIONAL{
        ?var le:hasProxy ?paleoData_proxyObj .
        ?paleoData_proxyObj rdfs:label ?paleoData_proxy .
    }
    OPTIONAL{
        ?var le:hasProxyGeneral ?paleoData_proxyGeneralObj .
        ?paleoData_proxyGeneralObj rdfs:label ?paleoData_proxyGeneral .
    }

    # Compilation
    ?var le:partOfCompilation ?compilation . 
    ?compilation le:hasName ?compilationName .
    FILTER (?compilationName = "CoralHydro2k").


    # TSiD (should all have them)
    OPTIONAL{
        ?var le:hasVariableId ?TSID
    } .

    # Interpretation (might be an optional field)
    OPTIONAL{
        ?var le:hasInterpretation ?interp .

        ?interp le:hasVariable ?interpvar .
        BIND(REPLACE(STR(?interpvar), "http://linked.earth/ontology/interpretation#", "") AS ?interpretedVariable_Fallback)
        OPTIONAL { ?interpvar rdfs:label ?interpretedVariable_Label } # Use a temporary variable for the label
        BIND(COALESCE(?interpretedVariable_Label, ?interpretedVariable_Fallback) AS ?paleoData_interpName) # COALESCE into the final variable
        }
    
    OPTIONAL{
        ?var le:hasInterpretation ?interp .
        ?interp le:hasRank ?paleoData_interpRank .}
    
    
    OPTIONAL{
        ?var le:hasInterpretation ?interp .
    
        ?interp le:hasSeasonality ?seasonalityURI .
        BIND(REPLACE(STR(?seasonalityURI), "http://linked.earth/ontology/interpretation#", "") AS ?seasonality_Fallback)
        OPTIONAL { ?seasonalityURI rdfs:label ?seasonalityLabel }
        BIND(COALESCE(?seasonalityLabel, ?seasonality_Fallback) AS ?paleoData_seasonality)
    }

    #Time information
    ?table le:hasVariable ?timevar .
    ?timevar le:hasName ?time_variableName .
    ?timevar le:hasStandardVariable ?timevar_obj .
    ?timevar_obj rdfs:label ?time_standardName .
    VALUES ?time_standardName {"year"} .
    ?timevar le:hasValues ?time_values .
    OPTIONAL{
        ?timevar le:hasUnits ?time_unitsObj .
        ?time_unitsObj rdfs:label ?time_units .
    }  
}"""

response = requests.post(url, data = {'query': query3})

data = io.StringIO(response.text)
df3 = pd.read_csv(data, sep=",")

df3 = df3.drop_duplicates()

df3['paleoData_values']=df3['paleoData_values'].apply(lambda row : json.loads(row) if isinstance(row, str) else row)
df3['time_values']=df3['time_values'].apply(lambda row : json.loads(row) if isinstance(row, str) else row)

df3.head()

	dataSetName	compilationName	archiveType	geo_meanLat	geo_meanLon	geo_meanElev	paleoData_variableName	paleoData_standardName	paleoData_values	paleoData_units	paleoData_proxy	paleoData_proxyGeneral	paleoData_seasonality	paleoData_interpName	paleoData_interpRank	TSID	time_variableName	time_standardName	time_values	time_units
0	KA17RYU01	CoralHydro2k	Coral	28.300	130.000	-3.5	SrCa	Sr/Ca	[8.802, 9.472, 8.825, 9.355, 8.952, 9.297, 8.8...	mmol/mol	NaN	NaN	NaN	NaN	NaN	KA17RYU01_SrCa	year	year	[1578.58, 1579.08, 1579.58, 1580.08, 1580.58, ...	yr AD
2	CH18YOA02	CoralHydro2k	Coral	16.448	111.605	NaN	SrCa	Sr/Ca	[8.58, 8.683, 8.609, 8.37, 8.38, 8.417, 8.584,...	mmol/mol	NaN	NaN	NaN	NaN	NaN	CH18YOA02_SrCa	year	year	[1987.92, 1988.085, 1988.25, 1988.42, 1988.585...	yr AD
4	FL17DTO02	CoralHydro2k	Coral	24.699	-82.799	-3.0	Sr/Ca	Sr/Ca	[9.159, 9.257, 9.245, 9.166, 9.045, 9.013, 8.9...	mmol/mol	NaN	NaN	NaN	NaN	NaN	FL17DTO02_SrCa	year	year	[1837.04, 1837.13, 1837.21, 1837.29, 1837.38, ...	yr AD
6	BO14HTI01	CoralHydro2k	Coral	12.210	109.310	-3.6	Sr/Ca	Sr/Ca	[9.2, 9.17, 9.11, 9.02, 8.95, 8.99, 9.06, 9.1,...	mmol/mol	NaN	NaN	NaN	NaN	NaN	BO14HTI01_SrCa	year	year	[1600.04, 1600.12, 1600.2, 1600.28, 1600.37, 1...	yr AD
8	BO14HTI01	CoralHydro2k	Coral	12.210	109.310	-3.6	d18O	d18O	[-5.4206, -5.3477, -5.1354, -5.7119, -5.9058, ...	permil	NaN	NaN	NaN	NaN	NaN	BO14HTI01_d18O	year	year	[1977.37, 1977.45, 1977.54, 1977.62, 1977.7, 1...	yr AD

Let’s filter for Sr/Ca and d18O records:

df3_filt = df3[df3['paleoData_variableName'].isin(['SrCa', 'd18O'])].drop_duplicates(subset='TSID', keep='first')
df3_filt.head()

	dataSetName	compilationName	archiveType	geo_meanLat	geo_meanLon	geo_meanElev	paleoData_variableName	paleoData_standardName	paleoData_values	paleoData_units	paleoData_proxy	paleoData_proxyGeneral	paleoData_seasonality	paleoData_interpName	paleoData_interpRank	TSID	time_variableName	time_standardName	time_values	time_units
0	KA17RYU01	CoralHydro2k	Coral	28.300	130.000	-3.5	SrCa	Sr/Ca	[8.802, 9.472, 8.825, 9.355, 8.952, 9.297, 8.8...	mmol/mol	NaN	NaN	NaN	NaN	NaN	KA17RYU01_SrCa	year	year	[1578.58, 1579.08, 1579.58, 1580.08, 1580.58, ...	yr AD
2	CH18YOA02	CoralHydro2k	Coral	16.448	111.605	NaN	SrCa	Sr/Ca	[8.58, 8.683, 8.609, 8.37, 8.38, 8.417, 8.584,...	mmol/mol	NaN	NaN	NaN	NaN	NaN	CH18YOA02_SrCa	year	year	[1987.92, 1988.085, 1988.25, 1988.42, 1988.585...	yr AD
8	BO14HTI01	CoralHydro2k	Coral	12.210	109.310	-3.6	d18O	d18O	[-5.4206, -5.3477, -5.1354, -5.7119, -5.9058, ...	permil	NaN	NaN	NaN	NaN	NaN	BO14HTI01_d18O	year	year	[1977.37, 1977.45, 1977.54, 1977.62, 1977.7, 1...	yr AD
12	AB20MEN01	CoralHydro2k	Coral	-3.180	100.517	-0.7	d18O	d18O	[-5.7529527, -5.7139629, -5.6917258, -5.546536...	permil	NaN	NaN	NaN	NaN	NaN	AB20MEN01_d18O	year	year	[1936.375, 1936.45833, 1936.54167, 1936.625, 1...	yr AD
16	DR99ABR01	CoralHydro2k	Coral	-22.100	153.000	-10.0	d18O	d18O	[-4.008, -3.977, -3.853, -4.113, -3.913, -4.01...	permil	NaN	NaN	NaN	NaN	NaN	DR99ABR01_d18O	year	year	[1638.3, 1639.3, 1640.3, 1641.3, 1642.3, 1643....	yr AD

df3_filt.info()

<class 'pandas.core.frame.DataFrame'>
Index: 203 entries, 0 to 780
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 dataSetName             203 non-null    object 
 compilationName         203 non-null    object 
 archiveType             203 non-null    object 
 geo_meanLat             203 non-null    float64
 geo_meanLon             203 non-null    float64
 geo_meanElev            166 non-null    float64
 paleoData_variableName  203 non-null    object 
 paleoData_standardName  203 non-null    object 
 paleoData_values        203 non-null    object 
 paleoData_units         203 non-null    object 
paleoData_proxy         0 non-null      float64
paleoData_proxyGeneral  0 non-null      float64
paleoData_seasonality   0 non-null      float64
paleoData_interpName    0 non-null      float64
paleoData_interpRank    0 non-null      float64
TSID                    203 non-null    object 
time_variableName       203 non-null    object 
time_standardName       203 non-null    object 
time_values             203 non-null    object 
time_units              203 non-null    object 
dtypes: float64(8), object(12)
memory usage: 33.3+ KB

The rest of the workflow is the same as the one presented in the previous notebook.