Assembling the proxy database from the LiPDGraph#

Authors#

Deborah Khider & Julien Emile-Geay

Goal#

The goal of this notebook is to showcase data querying on the LiPDGraph to create the proxy database used by cfr for the data assimilation.

Pre-requisite#

Reading time#

10 min

Data Description#

This notebook makes use of the LiPDGraph database, a graph database that contains datasets from various working groups. For a list of these compilations, see this page.

This notebook makes use of the LiPDVerse database, which contains datasets from various working groups. For a list of these compilations, see this page.

In particular, we will be using PAGES2kTemperature, iso-2k, and CoralHydro2k compilations for this demonstration.

  • PAGES2k: PAGES2k Consortium., Emile-Geay, J., McKay, N. et al. A global multiproxy database for temperature reconstructions of the Common Era. Sci Data 4, 170088 (2017). doi:10.1038/sdata.2017.88

  • iso2k: Konecky, B. L., McKay, N. P., Churakova (Sidorova), O. V., Comas-Bru, L., Dassié, E. P., DeLong, K. L., Falster, G. M., Fischer, M. J., Jones, M. D., Jonkers, L., Kaufman, D. S., Leduc, G., Managave, S. R., Martrat, B., Opel, T., Orsi, A. J., Partin, J. W., Sayani, H. R., Thomas, E. K., Thompson, D. M., Tyler, J. J., Abram, N. J., Atwood, A. R., Cartapanis, O., Conroy, J. L., Curran, M. A., Dee, S. G., Deininger, M., Divine, D. V., Kern, Z., Porter, T. J., Stevenson, S. L., von Gunten, L., and Iso2k Project Members: The Iso2k database: a global compilation of paleo-δ18O and δ2H records to aid understanding of Common Era climate, Earth Syst. Sci. Data, 12, 2261–2288, https://doi.org/10.5194/essd-12-2261-2020, 2020.

  • CoralHydro2k: Walter, R. M., H. R. Sayani, T. Felis, K. M. Cobb, N. J. Abram, A. K. Arzey, A. R. Atwood, L. D. Brenner, E. P. Dassi ́e, K. L. DeLong, B. Ellis, J. Emile-Geay, M. J. Fischer, N. F. Goodkin, J. A. Hargreaves, K. H. Kilbourne, H. Krawczyk, N. P. McKay, A. L. Moore, S. A. Murty, M. R. Ong, R. D. Ramos, E. V. Reed, D. Samanta, S. C. Sanchez, J. Zinke, and the PAGES CoralHydro2k Project Members (2023), The CoralHydro2k database: a global, actively curated compilation of coral δ18O and Sr / Ca proxy records of tropical ocean hydrology and temperature for the Common Era, Earth System Science Data, 15(5), 2081–2116, doi:10.5194/essd-15-2081-2023.

Note: Once the database is queried, the workflow is similar to the one presented in the assembling from the LiPDverse notebook. Therefore, this notebook only contains the querying workflow to the assembly of the database (and before the addition of the Palmyra record).

Let’s import the needed packages:

import os

# General data manipulation libraries
import numpy as np
import pandas as pd

# To query the graph
import json
import requests
import pandas as pd
import io

Data Retrieval#

Following this tutorial, we perform a SPARQL query on the LiPDGraph to assemble the most current dataset.

We point towards the dynamic version to get the most up-to-date files. For reproduction purposes, you can change the URL to one of the versioned graphs.

url = 'https://linkedearth.graphdb.mint.isi.edu/repositories/LiPDVerse-dynamic'

Querying the PAGES2k datasets#

In the first query, we retrieve all the data in the PAGES2k compilation that were tagged to be used in the temperature analysis for this compilations along with relevant metadata.

We will return the following information if present:

  • datSetName: The name of the dataset

  • compilationName: The name of the compilation. This will be used for filtering later on

  • archiveType: The type of archive on which the measurements were made

  • Latitude, Longitude, and Elevation: the geographical coordinates for the site

  • paleoData_variableName: The name of the paleo variable as originally reported by the investigator

  • paleoData_standardName: The standard name for the paleo variable

  • paleoData_values: The values for the paleo variable

  • paleoData_units: The units associated with the paleo variable

  • paleoData_proxy: The proxy used to infer paleoenvironmental conditions

  • paleoData_proxyGeneral: Generalized grouping for proxy

  • paleoData_interpName: The interpretation for the proxy record

  • paleoData_interpRank: The rank for the interpretation

  • paleoData_Seasonality: The seasonality for the proxy

  • TSiD: The unique ID for the timeseries

  • time_variableName: The name for the time variable. Since we are dealing with the common era, should be all “year”.

  • timevar_standardName: The standard name for the time variable. Since we are dealing with the common era, should be all “year”. So let’s use this as a filter.

  • time_values: The values for the time axis

  • time_units: the units for the time axis.

query1 = """ PREFIX le: <http://linked.earth/ontology#>
PREFIX wgs84: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?dataSetName ?compilationName ?archiveType ?geo_meanLat ?geo_meanLon ?geo_meanElev 
    ?paleoData_variableName ?paleoData_standardName ?paleoData_values ?paleoData_units 
    ?paleoData_proxy ?paleoData_proxyGeneral ?paleoData_seasonality ?paleoData_interpName ?paleoData_interpRank 
    ?TSID ?time_variableName ?time_standardName ?time_values
	?time_units where{
    
    ?ds a le:Dataset .
    ?ds le:hasName ?dataSetName .

    # Get the archive
    OPTIONAL{
            ?ds le:hasArchiveType ?archiveTypeObj .
            ?archiveTypeObj rdfs:label ?archiveType .
        }
    
    # Geographical information    
    
    ?ds le:hasLocation ?loc .
        OPTIONAL{?loc wgs84:lat ?geo_meanLat .}
        OPTIONAL{?loc wgs84:long ?geo_meanLon .}      
        OPTIONAL{?loc wgs84:alt ?geo_meanElev .}
    
    # PaleoData

    ?ds le:hasPaleoData ?data .
    ?data le:hasMeasurementTable ?table .
    ?table le:hasVariable ?var .

    # Name
    ?var le:hasName ?paleoData_variableName .
    ?var le:hasStandardVariable ?variable_obj .
    ?variable_obj rdfs:label ?paleoData_standardName .
    
    #Values
    ?var le:hasValues ?paleoData_values .

    #Seasonality
    OPTIONAL{?var le:seasonality ?paleoData_seasonality} .

    #Units
    OPTIONAL{
        ?var le:hasUnits ?paleoData_unitsObj .
        ?paleoData_unitsObj rdfs:label ?paleoData_units .
    }

    #Proxy information
    OPTIONAL{
        ?var le:hasProxy ?paleoData_proxyObj .
        ?paleoData_proxyObj rdfs:label ?paleoData_proxy .
    }
    OPTIONAL{
        ?var le:hasProxyGeneral ?paleoData_proxyGeneralObj .
        ?paleoData_proxyGeneralObj rdfs:label ?paleoData_proxyGeneral .
    }

    # Compilation
    ?var le:partOfCompilation ?compilation . 
    ?compilation le:hasName ?compilationName .
    FILTER (?compilationName = "Pages2kTemperature").
    ?var le:useInGlobalTemperatureAnalysis True .

    # TSiD (should all have them)
    OPTIONAL{
        ?var le:hasVariableId ?TSID
    } .

    # Interpretation (might be an optional field)
    OPTIONAL{
        ?var le:hasInterpretation ?interp .

        ?interp le:hasVariable ?interpvar .
        BIND(REPLACE(STR(?interpvar), "http://linked.earth/ontology/interpretation#", "") AS ?interpretedVariable_Fallback)
        OPTIONAL { ?interpvar rdfs:label ?interpretedVariable_Label } # Use a temporary variable for the label
        BIND(COALESCE(?interpretedVariable_Label, ?interpretedVariable_Fallback) AS ?paleoData_interpName) # COALESCE into the final variable
        }
    
    OPTIONAL{
        ?var le:hasInterpretation ?interp .
        ?interp le:hasRank ?paleoData_interpRank .}
    
    
    OPTIONAL{
        ?var le:hasInterpretation ?interp .
    
        ?interp le:hasSeasonality ?seasonalityURI .
        BIND(REPLACE(STR(?seasonalityURI), "http://linked.earth/ontology/interpretation#", "") AS ?seasonality_Fallback)
        OPTIONAL { ?seasonalityURI rdfs:label ?seasonalityLabel }
        BIND(COALESCE(?seasonalityLabel, ?seasonality_Fallback) AS ?paleoData_seasonality)
    }

    #Time information
    ?table le:hasVariable ?timevar .
    ?timevar le:hasName ?time_variableName .
    ?timevar le:hasStandardVariable ?timevar_obj .
    ?timevar_obj rdfs:label ?time_standardName .
    VALUES ?time_standardName {"year"} .
    ?timevar le:hasValues ?time_values .
    OPTIONAL{
        ?timevar le:hasUnits ?time_unitsObj .
        ?time_unitsObj rdfs:label ?time_units .
    }  
}"""

This query should look familiar and there is a good reason for it: this is the same query as the one we performed on the downloaded LiPD files from the LiPDverse in the previous notebook.

That is because PyLiPD creates an internal database on your machine that is similar in structure (i.e., the use the same mapping, the LinkedEarth Ontology) to describe datasets.

Let’s run our query:

response = requests.post(url, data = {'query': query1})

data = io.StringIO(response.text)
df1 = pd.read_csv(data, sep=",")

df1 = df1.drop_duplicates()

df1['paleoData_values']=df1['paleoData_values'].apply(lambda row : json.loads(row) if isinstance(row, str) else row)
df1['time_values']=df1['time_values'].apply(lambda row : json.loads(row) if isinstance(row, str) else row)

df1.head()
dataSetName compilationName archiveType geo_meanLat geo_meanLon geo_meanElev paleoData_variableName paleoData_standardName paleoData_values paleoData_units paleoData_proxy paleoData_proxyGeneral paleoData_seasonality paleoData_interpName paleoData_interpRank TSID time_variableName time_standardName time_values time_units
0 Ant-WDC05A.Steig.2013 Pages2kTemperature Glacier ice -79.46 -112.09 1806.0 d18O d18O [-33.32873325, -35.6732, -33.1574, -34.2854, -... permil d18O NaN Annual temperature NaN Ant_030 year year [2005, 2004, 2003, 2002, 2001, 2000, 1999, 199... yr AD
16 Arc-Arjeplog.Bjorklund.2014 Pages2kTemperature Wood 66.30 18.20 800.0 density density [-0.829089212152348, -0.733882889924006, -0.89... NaN maximum latewood density dendrophysical Jun-Aug temperature NaN Arc_060 year year [1200, 1201, 1202, 1203, 1204, 1205, 1206, 120... yr AD
20 NAm-SmithersSkiArea.Schweingruber.1996 Pages2kTemperature Wood 54.90 -127.30 1200.0 MXD MXD [1.041, 1.062, 1.051, 0.997, 0.993, 0.959, 0.9... NaN maximum latewood density dendrophysical Summer temperature NaN NAm_1120 year year [1680, 1681, 1682, 1683, 1684, 1685, 1686, 168... yr AD
28 Ocn-Mayotte.Zinke.2008 Pages2kTemperature Coral -12.65 45.10 -3.0 d18O d18O [-4.72, -4.95, -5.45, -5.157, -5.05, -5.065, -... permil d18O NaN subannual temperature NaN Ocean2kHR_002 year year [1993.62, 1993.46, 1993.29, 1993.12, 1992.96, ... yr AD
36 Ocn-LosRoques.Hetzinger.2008 Pages2kTemperature Coral 11.77 -66.75 -2.0 d18O d18O [-3.8123, -3.73082, -3.74912, -3.6656, -3.6995... permil d18O NaN subannual temperature NaN Ocean2kHR_096 year year [1917.79, 1917.87, 1917.96, 1918.04, 1918.12, ... yr AD
df1.info()
<class 'pandas.core.frame.DataFrame'>
Index: 690 entries, 0 to 15328
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   dataSetName             690 non-null    object 
 1   compilationName         690 non-null    object 
 2   archiveType             690 non-null    object 
 3   geo_meanLat             690 non-null    float64
 4   geo_meanLon             690 non-null    float64
 5   geo_meanElev            683 non-null    float64
 6   paleoData_variableName  690 non-null    object 
 7   paleoData_standardName  690 non-null    object 
 8   paleoData_values        690 non-null    object 
 9   paleoData_units         282 non-null    object 
 10  paleoData_proxy         690 non-null    object 
 11  paleoData_proxyGeneral  464 non-null    object 
 12  paleoData_seasonality   690 non-null    object 
 13  paleoData_interpName    690 non-null    object 
 14  paleoData_interpRank    1 non-null      float64
 15  TSID                    690 non-null    object 
 16  time_variableName       690 non-null    object 
 17  time_standardName       690 non-null    object 
 18  time_values             690 non-null    object 
 19  time_units              690 non-null    object 
dtypes: float64(4), object(16)
memory usage: 113.2+ KB

Querying the iso2k dataset#

Let’s repeat our query for the datasets in the iso2k compilation:

query2 = """ PREFIX le: <http://linked.earth/ontology#>
PREFIX wgs84: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?dataSetName ?compilationName ?archiveType ?geo_meanLat ?geo_meanLon ?geo_meanElev 
    ?paleoData_variableName ?paleoData_standardName ?paleoData_values ?paleoData_units 
    ?paleoData_proxy ?paleoData_proxyGeneral ?paleoData_seasonality ?paleoData_interpName ?paleoData_interpRank 
    ?TSID ?time_variableName ?time_standardName ?time_values
	?time_units where{
    
    ?ds a le:Dataset .
    ?ds le:hasName ?dataSetName .

    # Get the archive
    OPTIONAL{
            ?ds le:hasArchiveType ?archiveTypeObj .
            ?archiveTypeObj rdfs:label ?archiveType .
        }
    
    # Geographical information    
    
    ?ds le:hasLocation ?loc .
        OPTIONAL{?loc wgs84:lat ?geo_meanLat .}
        OPTIONAL{?loc wgs84:long ?geo_meanLon .}      
        OPTIONAL{?loc wgs84:alt ?geo_meanElev .}
    
    # PaleoData

    ?ds le:hasPaleoData ?data .
    ?data le:hasMeasurementTable ?table .
    ?table le:hasVariable ?var .

    # Name
    ?var le:hasName ?paleoData_variableName .
    ?var le:hasStandardVariable ?variable_obj .
    ?variable_obj rdfs:label ?paleoData_standardName .
    
    #Values
    ?var le:hasValues ?paleoData_values .

    #Seasonality
    OPTIONAL{?var le:seasonality ?paleoData_seasonality} .

    #Units
    OPTIONAL{
        ?var le:hasUnits ?paleoData_unitsObj .
        ?paleoData_unitsObj rdfs:label ?paleoData_units .
    }

    #Proxy information
    OPTIONAL{
        ?var le:hasProxy ?paleoData_proxyObj .
        ?paleoData_proxyObj rdfs:label ?paleoData_proxy .
    }
    OPTIONAL{
        ?var le:hasProxyGeneral ?paleoData_proxyGeneralObj .
        ?paleoData_proxyGeneralObj rdfs:label ?paleoData_proxyGeneral .
    }

    # Compilation
    ?var le:partOfCompilation ?compilation . 
    ?compilation le:hasName ?compilationName .
    FILTER (?compilationName = "iso2k").
    ?var le:useInGlobalTemperatureAnalysis True .

    # TSiD (should all have them)
    OPTIONAL{
        ?var le:hasVariableId ?TSID
    } .

    # Interpretation (might be an optional field)
    OPTIONAL{
        ?var le:hasInterpretation ?interp .

        ?interp le:hasVariable ?interpvar .
        BIND(REPLACE(STR(?interpvar), "http://linked.earth/ontology/interpretation#", "") AS ?interpretedVariable_Fallback)
        OPTIONAL { ?interpvar rdfs:label ?interpretedVariable_Label } # Use a temporary variable for the label
        BIND(COALESCE(?interpretedVariable_Label, ?interpretedVariable_Fallback) AS ?paleoData_interpName) # COALESCE into the final variable
        }
    
    OPTIONAL{
        ?var le:hasInterpretation ?interp .
        ?interp le:hasRank ?paleoData_interpRank .}
    
    
    OPTIONAL{
        ?var le:hasInterpretation ?interp .
    
        ?interp le:hasSeasonality ?seasonalityURI .
        BIND(REPLACE(STR(?seasonalityURI), "http://linked.earth/ontology/interpretation#", "") AS ?seasonality_Fallback)
        OPTIONAL { ?seasonalityURI rdfs:label ?seasonalityLabel }
        BIND(COALESCE(?seasonalityLabel, ?seasonality_Fallback) AS ?paleoData_seasonality)
    }

    #Time information
    ?table le:hasVariable ?timevar .
    ?timevar le:hasName ?time_variableName .
    ?timevar le:hasStandardVariable ?timevar_obj .
    ?timevar_obj rdfs:label ?time_standardName .
    VALUES ?time_standardName {"year"} .
    ?timevar le:hasValues ?time_values .
    OPTIONAL{
        ?timevar le:hasUnits ?time_unitsObj .
        ?time_unitsObj rdfs:label ?time_units .
    }  
}"""
response = requests.post(url, data = {'query': query2})

data = io.StringIO(response.text)
df2 = pd.read_csv(data, sep=",")

df2 = df2.drop_duplicates()

df2['paleoData_values']=df2['paleoData_values'].apply(lambda row : json.loads(row) if isinstance(row, str) else row)
df2['time_values']=df2['time_values'].apply(lambda row : json.loads(row) if isinstance(row, str) else row)

df2.head()
dataSetName compilationName archiveType geo_meanLat geo_meanLon geo_meanElev paleoData_variableName paleoData_standardName paleoData_values paleoData_units paleoData_proxy paleoData_proxyGeneral paleoData_seasonality paleoData_interpName paleoData_interpRank TSID time_variableName time_standardName time_values time_units
0 LS16STCL iso2k Lake sediment 50.83 -116.39 1126.0 d18O d18O [-7.81, -5.91, -9.03, -5.35, -5.61, -5.98, -5.... permil d18O NaN Winter NaN 5.0 LPD7dc5b9ba-dup-dup year year [2009.0, 2008.3, 2007.8, 2007.4, 2007.0, 2006.... yr AD
4 LS16STCL iso2k Lake sediment 50.83 -116.39 1126.0 d18O d18O [-7.81, -5.91, -9.03, -5.35, -5.61, -5.98, -5.... permil d18O NaN NaN NaN 3.0 LPD7dc5b9ba-dup-dup year year [2009.0, 2008.3, 2007.8, 2007.4, 2007.0, 2006.... yr AD
6 LS16STCL iso2k Lake sediment 50.83 -116.39 1126.0 d18O d18O [-7.81, -5.91, -9.03, -5.35, -5.61, -5.98, -5.... permil d18O NaN NaN NaN 4.0 LPD7dc5b9ba-dup-dup year year [2009.0, 2008.3, 2007.8, 2007.4, 2007.0, 2006.... yr AD
8 LS16STCL iso2k Lake sediment 50.83 -116.39 1126.0 d18O d18O [-7.81, -5.91, -9.03, -5.35, -5.61, -5.98, -5.... permil d18O NaN NaN NaN 2.0 LPD7dc5b9ba-dup-dup year year [2009.0, 2008.3, 2007.8, 2007.4, 2007.0, 2006.... yr AD
10 LS16STCL iso2k Lake sediment 50.83 -116.39 1126.0 d18O d18O [-7.81, -5.91, -9.03, -5.35, -5.61, -5.98, -5.... permil d18O NaN NaN NaN 1.0 LPD7dc5b9ba-dup-dup year year [2009.0, 2008.3, 2007.8, 2007.4, 2007.0, 2006.... yr AD

Let’s have a look at the interpretation field:

df2['paleoData_interpName'].unique()
array([nan, 'temperature', 'seawaterIsotope', 'precipitationIsotope',
       'hydrologicBalance', 'salinity'], dtype=object)

And let’s only keep interpretation as temperature:

df2_filt = df2[df2['paleoData_interpName']=='temperature'].drop_duplicates(subset='TSID', keep='first')
df2_filt.head()
dataSetName compilationName archiveType geo_meanLat geo_meanLon geo_meanElev paleoData_variableName paleoData_standardName paleoData_values paleoData_units paleoData_proxy paleoData_proxyGeneral paleoData_seasonality paleoData_interpName paleoData_interpRank TSID time_variableName time_standardName time_values time_units
48 CO00URMA iso2k Coral 0.9330 173.0000 6.0 d18O d18O [-4.8011, -4.725, -4.6994, -4.86, -5.0886, -5.... permil d18O NaN subannual temperature 1.0 Ocean2kHR_177_iso2k year year [1994.5, 1994.33, 1994.17, 1994.0, 1993.83, 19... yr AD
108 CO05KUBE iso2k Coral 32.4670 -64.7000 -12.0 d18O d18O [-4.15, -3.66, -3.69, -4.07, -3.95, -4.12, -3.... permil d18O NaN subannual temperature 1.0 Ocean2kHR_105_iso2k year year [1983.21, 1983.13, 1983.04, 1982.96, 1982.88, ... yr AD
124 IC13THQU iso2k Glacier ice -13.9333 -70.8333 5670.0 d18O d18O [-18.5905, -16.3244, -16.2324, -17.0112, -18.6... permil d18O NaN Winter temperature 4.0 SAm_035_iso2k year year [2009, 2008, 2007, 2006, 2005, 2004, 2003, 200... yr AD
144 CO01TUNG iso2k Coral -5.2170 145.8170 -3.0 d18O d18O [-4.875, -4.981, -5.166, -5.06, -4.942, -4.919... permil d18O NaN subannual temperature 3.0 Ocean2kHR_141_iso2k year year [1993.042, 1992.792, 1992.542, 1992.292, 1992.... yr AD
166 CO01TUNG iso2k Coral -5.2170 145.8170 -3.0 d18O d18O [-4.827, -4.786, -4.693, -4.852, -4.991, -4.90... permil d18O NaN subannual temperature 2.0 Ocean2kHR_140_iso2k year year [1993.042, 1992.792, 1992.542, 1992.292, 1992.... yr AD
df2_filt.info()
<class 'pandas.core.frame.DataFrame'>
Index: 49 entries, 48 to 1202
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   dataSetName             49 non-null     object 
 1   compilationName         49 non-null     object 
 2   archiveType             49 non-null     object 
 3   geo_meanLat             49 non-null     float64
 4   geo_meanLon             49 non-null     float64
 5   geo_meanElev            49 non-null     float64
 6   paleoData_variableName  49 non-null     object 
 7   paleoData_standardName  49 non-null     object 
 8   paleoData_values        49 non-null     object 
 9   paleoData_units         49 non-null     object 
 10  paleoData_proxy         49 non-null     object 
 11  paleoData_proxyGeneral  0 non-null      float64
 12  paleoData_seasonality   37 non-null     object 
 13  paleoData_interpName    49 non-null     object 
 14  paleoData_interpRank    47 non-null     float64
 15  TSID                    49 non-null     object 
 16  time_variableName       49 non-null     object 
 17  time_standardName       49 non-null     object 
 18  time_values             49 non-null     object 
 19  time_units              49 non-null     object 
dtypes: float64(5), object(15)
memory usage: 8.0+ KB

Querying the CoralHydro2k database#

Let’s do the same query with CoralHydro2k:

query3 = """ PREFIX le: <http://linked.earth/ontology#>
PREFIX wgs84: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?dataSetName ?compilationName ?archiveType ?geo_meanLat ?geo_meanLon ?geo_meanElev 
    ?paleoData_variableName ?paleoData_standardName ?paleoData_values ?paleoData_units 
    ?paleoData_proxy ?paleoData_proxyGeneral ?paleoData_seasonality ?paleoData_interpName ?paleoData_interpRank 
    ?TSID ?time_variableName ?time_standardName ?time_values
	?time_units where{
    
    ?ds a le:Dataset .
    ?ds le:hasName ?dataSetName .

    # Get the archive
    OPTIONAL{
            ?ds le:hasArchiveType ?archiveTypeObj .
            ?archiveTypeObj rdfs:label ?archiveType .
        }
    
    # Geographical information    
    
    ?ds le:hasLocation ?loc .
        OPTIONAL{?loc wgs84:lat ?geo_meanLat .}
        OPTIONAL{?loc wgs84:long ?geo_meanLon .}      
        OPTIONAL{?loc wgs84:alt ?geo_meanElev .}
    
    # PaleoData

    ?ds le:hasPaleoData ?data .
    ?data le:hasMeasurementTable ?table .
    ?table le:hasVariable ?var .

    # Name
    ?var le:hasName ?paleoData_variableName .
    ?var le:hasStandardVariable ?variable_obj .
    ?variable_obj rdfs:label ?paleoData_standardName .
    
    #Values
    ?var le:hasValues ?paleoData_values .

    #Seasonality
    OPTIONAL{?var le:seasonality ?paleoData_seasonality} .

    #Units
    OPTIONAL{
        ?var le:hasUnits ?paleoData_unitsObj .
        ?paleoData_unitsObj rdfs:label ?paleoData_units .
    }

    #Proxy information
    OPTIONAL{
        ?var le:hasProxy ?paleoData_proxyObj .
        ?paleoData_proxyObj rdfs:label ?paleoData_proxy .
    }
    OPTIONAL{
        ?var le:hasProxyGeneral ?paleoData_proxyGeneralObj .
        ?paleoData_proxyGeneralObj rdfs:label ?paleoData_proxyGeneral .
    }

    # Compilation
    ?var le:partOfCompilation ?compilation . 
    ?compilation le:hasName ?compilationName .
    FILTER (?compilationName = "CoralHydro2k").


    # TSiD (should all have them)
    OPTIONAL{
        ?var le:hasVariableId ?TSID
    } .

    # Interpretation (might be an optional field)
    OPTIONAL{
        ?var le:hasInterpretation ?interp .

        ?interp le:hasVariable ?interpvar .
        BIND(REPLACE(STR(?interpvar), "http://linked.earth/ontology/interpretation#", "") AS ?interpretedVariable_Fallback)
        OPTIONAL { ?interpvar rdfs:label ?interpretedVariable_Label } # Use a temporary variable for the label
        BIND(COALESCE(?interpretedVariable_Label, ?interpretedVariable_Fallback) AS ?paleoData_interpName) # COALESCE into the final variable
        }
    
    OPTIONAL{
        ?var le:hasInterpretation ?interp .
        ?interp le:hasRank ?paleoData_interpRank .}
    
    
    OPTIONAL{
        ?var le:hasInterpretation ?interp .
    
        ?interp le:hasSeasonality ?seasonalityURI .
        BIND(REPLACE(STR(?seasonalityURI), "http://linked.earth/ontology/interpretation#", "") AS ?seasonality_Fallback)
        OPTIONAL { ?seasonalityURI rdfs:label ?seasonalityLabel }
        BIND(COALESCE(?seasonalityLabel, ?seasonality_Fallback) AS ?paleoData_seasonality)
    }

    #Time information
    ?table le:hasVariable ?timevar .
    ?timevar le:hasName ?time_variableName .
    ?timevar le:hasStandardVariable ?timevar_obj .
    ?timevar_obj rdfs:label ?time_standardName .
    VALUES ?time_standardName {"year"} .
    ?timevar le:hasValues ?time_values .
    OPTIONAL{
        ?timevar le:hasUnits ?time_unitsObj .
        ?time_unitsObj rdfs:label ?time_units .
    }  
}"""
response = requests.post(url, data = {'query': query3})

data = io.StringIO(response.text)
df3 = pd.read_csv(data, sep=",")

df3 = df3.drop_duplicates()

df3['paleoData_values']=df3['paleoData_values'].apply(lambda row : json.loads(row) if isinstance(row, str) else row)
df3['time_values']=df3['time_values'].apply(lambda row : json.loads(row) if isinstance(row, str) else row)

df3.head()
dataSetName compilationName archiveType geo_meanLat geo_meanLon geo_meanElev paleoData_variableName paleoData_standardName paleoData_values paleoData_units paleoData_proxy paleoData_proxyGeneral paleoData_seasonality paleoData_interpName paleoData_interpRank TSID time_variableName time_standardName time_values time_units
0 KA17RYU01 CoralHydro2k Coral 28.300 130.000 -3.5 SrCa Sr/Ca [8.802, 9.472, 8.825, 9.355, 8.952, 9.297, 8.8... mmol/mol NaN NaN NaN NaN NaN KA17RYU01_SrCa year year [1578.58, 1579.08, 1579.58, 1580.08, 1580.58, ... yr AD
2 CH18YOA02 CoralHydro2k Coral 16.448 111.605 NaN SrCa Sr/Ca [8.58, 8.683, 8.609, 8.37, 8.38, 8.417, 8.584,... mmol/mol NaN NaN NaN NaN NaN CH18YOA02_SrCa year year [1987.92, 1988.085, 1988.25, 1988.42, 1988.585... yr AD
4 FL17DTO02 CoralHydro2k Coral 24.699 -82.799 -3.0 Sr/Ca Sr/Ca [9.159, 9.257, 9.245, 9.166, 9.045, 9.013, 8.9... mmol/mol NaN NaN NaN NaN NaN FL17DTO02_SrCa year year [1837.04, 1837.13, 1837.21, 1837.29, 1837.38, ... yr AD
6 BO14HTI01 CoralHydro2k Coral 12.210 109.310 -3.6 Sr/Ca Sr/Ca [9.2, 9.17, 9.11, 9.02, 8.95, 8.99, 9.06, 9.1,... mmol/mol NaN NaN NaN NaN NaN BO14HTI01_SrCa year year [1600.04, 1600.12, 1600.2, 1600.28, 1600.37, 1... yr AD
8 BO14HTI01 CoralHydro2k Coral 12.210 109.310 -3.6 d18O d18O [-5.4206, -5.3477, -5.1354, -5.7119, -5.9058, ... permil NaN NaN NaN NaN NaN BO14HTI01_d18O year year [1977.37, 1977.45, 1977.54, 1977.62, 1977.7, 1... yr AD

Let’s filter for Sr/Ca and d18O records:

df3_filt = df3[df3['paleoData_variableName'].isin(['SrCa', 'd18O'])].drop_duplicates(subset='TSID', keep='first')
df3_filt.head()
dataSetName compilationName archiveType geo_meanLat geo_meanLon geo_meanElev paleoData_variableName paleoData_standardName paleoData_values paleoData_units paleoData_proxy paleoData_proxyGeneral paleoData_seasonality paleoData_interpName paleoData_interpRank TSID time_variableName time_standardName time_values time_units
0 KA17RYU01 CoralHydro2k Coral 28.300 130.000 -3.5 SrCa Sr/Ca [8.802, 9.472, 8.825, 9.355, 8.952, 9.297, 8.8... mmol/mol NaN NaN NaN NaN NaN KA17RYU01_SrCa year year [1578.58, 1579.08, 1579.58, 1580.08, 1580.58, ... yr AD
2 CH18YOA02 CoralHydro2k Coral 16.448 111.605 NaN SrCa Sr/Ca [8.58, 8.683, 8.609, 8.37, 8.38, 8.417, 8.584,... mmol/mol NaN NaN NaN NaN NaN CH18YOA02_SrCa year year [1987.92, 1988.085, 1988.25, 1988.42, 1988.585... yr AD
8 BO14HTI01 CoralHydro2k Coral 12.210 109.310 -3.6 d18O d18O [-5.4206, -5.3477, -5.1354, -5.7119, -5.9058, ... permil NaN NaN NaN NaN NaN BO14HTI01_d18O year year [1977.37, 1977.45, 1977.54, 1977.62, 1977.7, 1... yr AD
12 AB20MEN01 CoralHydro2k Coral -3.180 100.517 -0.7 d18O d18O [-5.7529527, -5.7139629, -5.6917258, -5.546536... permil NaN NaN NaN NaN NaN AB20MEN01_d18O year year [1936.375, 1936.45833, 1936.54167, 1936.625, 1... yr AD
16 DR99ABR01 CoralHydro2k Coral -22.100 153.000 -10.0 d18O d18O [-4.008, -3.977, -3.853, -4.113, -3.913, -4.01... permil NaN NaN NaN NaN NaN DR99ABR01_d18O year year [1638.3, 1639.3, 1640.3, 1641.3, 1642.3, 1643.... yr AD
df3_filt.info()
<class 'pandas.core.frame.DataFrame'>
Index: 203 entries, 0 to 780
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   dataSetName             203 non-null    object 
 1   compilationName         203 non-null    object 
 2   archiveType             203 non-null    object 
 3   geo_meanLat             203 non-null    float64
 4   geo_meanLon             203 non-null    float64
 5   geo_meanElev            166 non-null    float64
 6   paleoData_variableName  203 non-null    object 
 7   paleoData_standardName  203 non-null    object 
 8   paleoData_values        203 non-null    object 
 9   paleoData_units         203 non-null    object 
 10  paleoData_proxy         0 non-null      float64
 11  paleoData_proxyGeneral  0 non-null      float64
 12  paleoData_seasonality   0 non-null      float64
 13  paleoData_interpName    0 non-null      float64
 14  paleoData_interpRank    0 non-null      float64
 15  TSID                    203 non-null    object 
 16  time_variableName       203 non-null    object 
 17  time_standardName       203 non-null    object 
 18  time_values             203 non-null    object 
 19  time_units              203 non-null    object 
dtypes: float64(8), object(12)
memory usage: 33.3+ KB

The rest of the workflow is the same as the one presented in the previous notebook.