Assembling the proxy database from the LiPDGraph#
Goal#
The goal of this notebook is to showcase data querying on the LiPDGraph to create the proxy database used by cfr for the data assimilation.
Pre-requisite#
An understanding of the workflow presented in assembling the database from the LiPDverse notebook.
An understanding of the Linked Paleo Data format (LiPD)
A basic understanding of Dynamic Time Warping
Reading time#
10 min
Data Description#
This notebook makes use of the LiPDGraph
database, a graph database that contains datasets from various working groups. For a list of these compilations, see this page.
This notebook makes use of the LiPDVerse
database, which contains datasets from various working groups. For a list of these compilations, see this page.
In particular, we will be using PAGES2kTemperature, iso-2k, and CoralHydro2k compilations for this demonstration.
PAGES2k: PAGES2k Consortium., Emile-Geay, J., McKay, N. et al. A global multiproxy database for temperature reconstructions of the Common Era. Sci Data 4, 170088 (2017). doi:10.1038/sdata.2017.88
iso2k: Konecky, B. L., McKay, N. P., Churakova (Sidorova), O. V., Comas-Bru, L., Dassié, E. P., DeLong, K. L., Falster, G. M., Fischer, M. J., Jones, M. D., Jonkers, L., Kaufman, D. S., Leduc, G., Managave, S. R., Martrat, B., Opel, T., Orsi, A. J., Partin, J. W., Sayani, H. R., Thomas, E. K., Thompson, D. M., Tyler, J. J., Abram, N. J., Atwood, A. R., Cartapanis, O., Conroy, J. L., Curran, M. A., Dee, S. G., Deininger, M., Divine, D. V., Kern, Z., Porter, T. J., Stevenson, S. L., von Gunten, L., and Iso2k Project Members: The Iso2k database: a global compilation of paleo-δ18O and δ2H records to aid understanding of Common Era climate, Earth Syst. Sci. Data, 12, 2261–2288, https://doi.org/10.5194/essd-12-2261-2020, 2020.
CoralHydro2k: Walter, R. M., H. R. Sayani, T. Felis, K. M. Cobb, N. J. Abram, A. K. Arzey, A. R. Atwood, L. D. Brenner, E. P. Dassi ́e, K. L. DeLong, B. Ellis, J. Emile-Geay, M. J. Fischer, N. F. Goodkin, J. A. Hargreaves, K. H. Kilbourne, H. Krawczyk, N. P. McKay, A. L. Moore, S. A. Murty, M. R. Ong, R. D. Ramos, E. V. Reed, D. Samanta, S. C. Sanchez, J. Zinke, and the PAGES CoralHydro2k Project Members (2023), The CoralHydro2k database: a global, actively curated compilation of coral δ18O and Sr / Ca proxy records of tropical ocean hydrology and temperature for the Common Era, Earth System Science Data, 15(5), 2081–2116, doi:10.5194/essd-15-2081-2023.
Let’s import the needed packages:
import os
# General data manipulation libraries
import numpy as np
import pandas as pd
# To query the graph
import json
import requests
import pandas as pd
import io
Data Retrieval#
Following this tutorial, we perform a SPARQL query on the LiPDGraph to assemble the most current dataset.
We point towards the dynamic
version to get the most up-to-date files. For reproduction purposes, you can change the URL to one of the versioned graphs.
url = 'https://linkedearth.graphdb.mint.isi.edu/repositories/LiPDVerse-dynamic'
Querying the PAGES2k datasets#
In the first query, we retrieve all the data in the PAGES2k compilation that were tagged to be used in the temperature analysis for this compilations along with relevant metadata.
We will return the following information if present:
datSetName
: The name of the datasetcompilationName
: The name of the compilation. This will be used for filtering later onarchiveType
: The type of archive on which the measurements were madeLatitude
,Longitude
, andElevation
: the geographical coordinates for the sitepaleoData_variableName
: The name of the paleo variable as originally reported by the investigatorpaleoData_standardName
: The standard name for the paleo variablepaleoData_values
: The values for the paleo variablepaleoData_units
: The units associated with the paleo variablepaleoData_proxy
: The proxy used to infer paleoenvironmental conditionspaleoData_proxyGeneral
: Generalized grouping for proxypaleoData_interpName
: The interpretation for the proxy recordpaleoData_interpRank
: The rank for the interpretationpaleoData_Seasonality
: The seasonality for the proxyTSiD
: The unique ID for the timeseriestime_variableName
: The name for the time variable. Since we are dealing with the common era, should be all “year”.timevar_standardName
: The standard name for the time variable. Since we are dealing with the common era, should be all “year”. So let’s use this as a filter.time_values
: The values for the time axistime_units
: the units for the time axis.
query1 = """ PREFIX le: <http://linked.earth/ontology#>
PREFIX wgs84: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?dataSetName ?compilationName ?archiveType ?geo_meanLat ?geo_meanLon ?geo_meanElev
?paleoData_variableName ?paleoData_standardName ?paleoData_values ?paleoData_units
?paleoData_proxy ?paleoData_proxyGeneral ?paleoData_seasonality ?paleoData_interpName ?paleoData_interpRank
?TSID ?time_variableName ?time_standardName ?time_values
?time_units where{
?ds a le:Dataset .
?ds le:hasName ?dataSetName .
# Get the archive
OPTIONAL{
?ds le:hasArchiveType ?archiveTypeObj .
?archiveTypeObj rdfs:label ?archiveType .
}
# Geographical information
?ds le:hasLocation ?loc .
OPTIONAL{?loc wgs84:lat ?geo_meanLat .}
OPTIONAL{?loc wgs84:long ?geo_meanLon .}
OPTIONAL{?loc wgs84:alt ?geo_meanElev .}
# PaleoData
?ds le:hasPaleoData ?data .
?data le:hasMeasurementTable ?table .
?table le:hasVariable ?var .
# Name
?var le:hasName ?paleoData_variableName .
?var le:hasStandardVariable ?variable_obj .
?variable_obj rdfs:label ?paleoData_standardName .
#Values
?var le:hasValues ?paleoData_values .
#Seasonality
OPTIONAL{?var le:seasonality ?paleoData_seasonality} .
#Units
OPTIONAL{
?var le:hasUnits ?paleoData_unitsObj .
?paleoData_unitsObj rdfs:label ?paleoData_units .
}
#Proxy information
OPTIONAL{
?var le:hasProxy ?paleoData_proxyObj .
?paleoData_proxyObj rdfs:label ?paleoData_proxy .
}
OPTIONAL{
?var le:hasProxyGeneral ?paleoData_proxyGeneralObj .
?paleoData_proxyGeneralObj rdfs:label ?paleoData_proxyGeneral .
}
# Compilation
?var le:partOfCompilation ?compilation .
?compilation le:hasName ?compilationName .
FILTER (?compilationName = "Pages2kTemperature").
?var le:useInGlobalTemperatureAnalysis True .
# TSiD (should all have them)
OPTIONAL{
?var le:hasVariableId ?TSID
} .
# Interpretation (might be an optional field)
OPTIONAL{
?var le:hasInterpretation ?interp .
?interp le:hasVariable ?interpvar .
BIND(REPLACE(STR(?interpvar), "http://linked.earth/ontology/interpretation#", "") AS ?interpretedVariable_Fallback)
OPTIONAL { ?interpvar rdfs:label ?interpretedVariable_Label } # Use a temporary variable for the label
BIND(COALESCE(?interpretedVariable_Label, ?interpretedVariable_Fallback) AS ?paleoData_interpName) # COALESCE into the final variable
}
OPTIONAL{
?var le:hasInterpretation ?interp .
?interp le:hasRank ?paleoData_interpRank .}
OPTIONAL{
?var le:hasInterpretation ?interp .
?interp le:hasSeasonality ?seasonalityURI .
BIND(REPLACE(STR(?seasonalityURI), "http://linked.earth/ontology/interpretation#", "") AS ?seasonality_Fallback)
OPTIONAL { ?seasonalityURI rdfs:label ?seasonalityLabel }
BIND(COALESCE(?seasonalityLabel, ?seasonality_Fallback) AS ?paleoData_seasonality)
}
#Time information
?table le:hasVariable ?timevar .
?timevar le:hasName ?time_variableName .
?timevar le:hasStandardVariable ?timevar_obj .
?timevar_obj rdfs:label ?time_standardName .
VALUES ?time_standardName {"year"} .
?timevar le:hasValues ?time_values .
OPTIONAL{
?timevar le:hasUnits ?time_unitsObj .
?time_unitsObj rdfs:label ?time_units .
}
}"""
This query should look familiar and there is a good reason for it: this is the same query as the one we performed on the downloaded LiPD files from the LiPDverse in the previous notebook.
That is because PyLiPD creates an internal database on your machine that is similar in structure (i.e., the use the same mapping, the LinkedEarth Ontology) to describe datasets.
Let’s run our query:
response = requests.post(url, data = {'query': query1})
data = io.StringIO(response.text)
df1 = pd.read_csv(data, sep=",")
df1 = df1.drop_duplicates()
df1['paleoData_values']=df1['paleoData_values'].apply(lambda row : json.loads(row) if isinstance(row, str) else row)
df1['time_values']=df1['time_values'].apply(lambda row : json.loads(row) if isinstance(row, str) else row)
df1.head()
dataSetName | compilationName | archiveType | geo_meanLat | geo_meanLon | geo_meanElev | paleoData_variableName | paleoData_standardName | paleoData_values | paleoData_units | paleoData_proxy | paleoData_proxyGeneral | paleoData_seasonality | paleoData_interpName | paleoData_interpRank | TSID | time_variableName | time_standardName | time_values | time_units | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Ant-WDC05A.Steig.2013 | Pages2kTemperature | Glacier ice | -79.46 | -112.09 | 1806.0 | d18O | d18O | [-33.32873325, -35.6732, -33.1574, -34.2854, -... | permil | d18O | NaN | Annual | temperature | NaN | Ant_030 | year | year | [2005, 2004, 2003, 2002, 2001, 2000, 1999, 199... | yr AD |
16 | Arc-Arjeplog.Bjorklund.2014 | Pages2kTemperature | Wood | 66.30 | 18.20 | 800.0 | density | density | [-0.829089212152348, -0.733882889924006, -0.89... | NaN | maximum latewood density | dendrophysical | Jun-Aug | temperature | NaN | Arc_060 | year | year | [1200, 1201, 1202, 1203, 1204, 1205, 1206, 120... | yr AD |
20 | NAm-SmithersSkiArea.Schweingruber.1996 | Pages2kTemperature | Wood | 54.90 | -127.30 | 1200.0 | MXD | MXD | [1.041, 1.062, 1.051, 0.997, 0.993, 0.959, 0.9... | NaN | maximum latewood density | dendrophysical | Summer | temperature | NaN | NAm_1120 | year | year | [1680, 1681, 1682, 1683, 1684, 1685, 1686, 168... | yr AD |
28 | Ocn-Mayotte.Zinke.2008 | Pages2kTemperature | Coral | -12.65 | 45.10 | -3.0 | d18O | d18O | [-4.72, -4.95, -5.45, -5.157, -5.05, -5.065, -... | permil | d18O | NaN | subannual | temperature | NaN | Ocean2kHR_002 | year | year | [1993.62, 1993.46, 1993.29, 1993.12, 1992.96, ... | yr AD |
36 | Ocn-LosRoques.Hetzinger.2008 | Pages2kTemperature | Coral | 11.77 | -66.75 | -2.0 | d18O | d18O | [-3.8123, -3.73082, -3.74912, -3.6656, -3.6995... | permil | d18O | NaN | subannual | temperature | NaN | Ocean2kHR_096 | year | year | [1917.79, 1917.87, 1917.96, 1918.04, 1918.12, ... | yr AD |
df1.info()
<class 'pandas.core.frame.DataFrame'>
Index: 690 entries, 0 to 15328
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 dataSetName 690 non-null object
1 compilationName 690 non-null object
2 archiveType 690 non-null object
3 geo_meanLat 690 non-null float64
4 geo_meanLon 690 non-null float64
5 geo_meanElev 683 non-null float64
6 paleoData_variableName 690 non-null object
7 paleoData_standardName 690 non-null object
8 paleoData_values 690 non-null object
9 paleoData_units 282 non-null object
10 paleoData_proxy 690 non-null object
11 paleoData_proxyGeneral 464 non-null object
12 paleoData_seasonality 690 non-null object
13 paleoData_interpName 690 non-null object
14 paleoData_interpRank 1 non-null float64
15 TSID 690 non-null object
16 time_variableName 690 non-null object
17 time_standardName 690 non-null object
18 time_values 690 non-null object
19 time_units 690 non-null object
dtypes: float64(4), object(16)
memory usage: 113.2+ KB
Querying the iso2k dataset#
Let’s repeat our query for the datasets in the iso2k compilation:
query2 = """ PREFIX le: <http://linked.earth/ontology#>
PREFIX wgs84: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?dataSetName ?compilationName ?archiveType ?geo_meanLat ?geo_meanLon ?geo_meanElev
?paleoData_variableName ?paleoData_standardName ?paleoData_values ?paleoData_units
?paleoData_proxy ?paleoData_proxyGeneral ?paleoData_seasonality ?paleoData_interpName ?paleoData_interpRank
?TSID ?time_variableName ?time_standardName ?time_values
?time_units where{
?ds a le:Dataset .
?ds le:hasName ?dataSetName .
# Get the archive
OPTIONAL{
?ds le:hasArchiveType ?archiveTypeObj .
?archiveTypeObj rdfs:label ?archiveType .
}
# Geographical information
?ds le:hasLocation ?loc .
OPTIONAL{?loc wgs84:lat ?geo_meanLat .}
OPTIONAL{?loc wgs84:long ?geo_meanLon .}
OPTIONAL{?loc wgs84:alt ?geo_meanElev .}
# PaleoData
?ds le:hasPaleoData ?data .
?data le:hasMeasurementTable ?table .
?table le:hasVariable ?var .
# Name
?var le:hasName ?paleoData_variableName .
?var le:hasStandardVariable ?variable_obj .
?variable_obj rdfs:label ?paleoData_standardName .
#Values
?var le:hasValues ?paleoData_values .
#Seasonality
OPTIONAL{?var le:seasonality ?paleoData_seasonality} .
#Units
OPTIONAL{
?var le:hasUnits ?paleoData_unitsObj .
?paleoData_unitsObj rdfs:label ?paleoData_units .
}
#Proxy information
OPTIONAL{
?var le:hasProxy ?paleoData_proxyObj .
?paleoData_proxyObj rdfs:label ?paleoData_proxy .
}
OPTIONAL{
?var le:hasProxyGeneral ?paleoData_proxyGeneralObj .
?paleoData_proxyGeneralObj rdfs:label ?paleoData_proxyGeneral .
}
# Compilation
?var le:partOfCompilation ?compilation .
?compilation le:hasName ?compilationName .
FILTER (?compilationName = "iso2k").
?var le:useInGlobalTemperatureAnalysis True .
# TSiD (should all have them)
OPTIONAL{
?var le:hasVariableId ?TSID
} .
# Interpretation (might be an optional field)
OPTIONAL{
?var le:hasInterpretation ?interp .
?interp le:hasVariable ?interpvar .
BIND(REPLACE(STR(?interpvar), "http://linked.earth/ontology/interpretation#", "") AS ?interpretedVariable_Fallback)
OPTIONAL { ?interpvar rdfs:label ?interpretedVariable_Label } # Use a temporary variable for the label
BIND(COALESCE(?interpretedVariable_Label, ?interpretedVariable_Fallback) AS ?paleoData_interpName) # COALESCE into the final variable
}
OPTIONAL{
?var le:hasInterpretation ?interp .
?interp le:hasRank ?paleoData_interpRank .}
OPTIONAL{
?var le:hasInterpretation ?interp .
?interp le:hasSeasonality ?seasonalityURI .
BIND(REPLACE(STR(?seasonalityURI), "http://linked.earth/ontology/interpretation#", "") AS ?seasonality_Fallback)
OPTIONAL { ?seasonalityURI rdfs:label ?seasonalityLabel }
BIND(COALESCE(?seasonalityLabel, ?seasonality_Fallback) AS ?paleoData_seasonality)
}
#Time information
?table le:hasVariable ?timevar .
?timevar le:hasName ?time_variableName .
?timevar le:hasStandardVariable ?timevar_obj .
?timevar_obj rdfs:label ?time_standardName .
VALUES ?time_standardName {"year"} .
?timevar le:hasValues ?time_values .
OPTIONAL{
?timevar le:hasUnits ?time_unitsObj .
?time_unitsObj rdfs:label ?time_units .
}
}"""
response = requests.post(url, data = {'query': query2})
data = io.StringIO(response.text)
df2 = pd.read_csv(data, sep=",")
df2 = df2.drop_duplicates()
df2['paleoData_values']=df2['paleoData_values'].apply(lambda row : json.loads(row) if isinstance(row, str) else row)
df2['time_values']=df2['time_values'].apply(lambda row : json.loads(row) if isinstance(row, str) else row)
df2.head()
dataSetName | compilationName | archiveType | geo_meanLat | geo_meanLon | geo_meanElev | paleoData_variableName | paleoData_standardName | paleoData_values | paleoData_units | paleoData_proxy | paleoData_proxyGeneral | paleoData_seasonality | paleoData_interpName | paleoData_interpRank | TSID | time_variableName | time_standardName | time_values | time_units | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | LS16STCL | iso2k | Lake sediment | 50.83 | -116.39 | 1126.0 | d18O | d18O | [-7.81, -5.91, -9.03, -5.35, -5.61, -5.98, -5.... | permil | d18O | NaN | Winter | NaN | 5.0 | LPD7dc5b9ba-dup-dup | year | year | [2009.0, 2008.3, 2007.8, 2007.4, 2007.0, 2006.... | yr AD |
4 | LS16STCL | iso2k | Lake sediment | 50.83 | -116.39 | 1126.0 | d18O | d18O | [-7.81, -5.91, -9.03, -5.35, -5.61, -5.98, -5.... | permil | d18O | NaN | NaN | NaN | 3.0 | LPD7dc5b9ba-dup-dup | year | year | [2009.0, 2008.3, 2007.8, 2007.4, 2007.0, 2006.... | yr AD |
6 | LS16STCL | iso2k | Lake sediment | 50.83 | -116.39 | 1126.0 | d18O | d18O | [-7.81, -5.91, -9.03, -5.35, -5.61, -5.98, -5.... | permil | d18O | NaN | NaN | NaN | 4.0 | LPD7dc5b9ba-dup-dup | year | year | [2009.0, 2008.3, 2007.8, 2007.4, 2007.0, 2006.... | yr AD |
8 | LS16STCL | iso2k | Lake sediment | 50.83 | -116.39 | 1126.0 | d18O | d18O | [-7.81, -5.91, -9.03, -5.35, -5.61, -5.98, -5.... | permil | d18O | NaN | NaN | NaN | 2.0 | LPD7dc5b9ba-dup-dup | year | year | [2009.0, 2008.3, 2007.8, 2007.4, 2007.0, 2006.... | yr AD |
10 | LS16STCL | iso2k | Lake sediment | 50.83 | -116.39 | 1126.0 | d18O | d18O | [-7.81, -5.91, -9.03, -5.35, -5.61, -5.98, -5.... | permil | d18O | NaN | NaN | NaN | 1.0 | LPD7dc5b9ba-dup-dup | year | year | [2009.0, 2008.3, 2007.8, 2007.4, 2007.0, 2006.... | yr AD |
Let’s have a look at the interpretation field:
df2['paleoData_interpName'].unique()
array([nan, 'temperature', 'seawaterIsotope', 'precipitationIsotope',
'hydrologicBalance', 'salinity'], dtype=object)
And let’s only keep interpretation as temperature:
df2_filt = df2[df2['paleoData_interpName']=='temperature'].drop_duplicates(subset='TSID', keep='first')
df2_filt.head()
dataSetName | compilationName | archiveType | geo_meanLat | geo_meanLon | geo_meanElev | paleoData_variableName | paleoData_standardName | paleoData_values | paleoData_units | paleoData_proxy | paleoData_proxyGeneral | paleoData_seasonality | paleoData_interpName | paleoData_interpRank | TSID | time_variableName | time_standardName | time_values | time_units | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
48 | CO00URMA | iso2k | Coral | 0.9330 | 173.0000 | 6.0 | d18O | d18O | [-4.8011, -4.725, -4.6994, -4.86, -5.0886, -5.... | permil | d18O | NaN | subannual | temperature | 1.0 | Ocean2kHR_177_iso2k | year | year | [1994.5, 1994.33, 1994.17, 1994.0, 1993.83, 19... | yr AD |
108 | CO05KUBE | iso2k | Coral | 32.4670 | -64.7000 | -12.0 | d18O | d18O | [-4.15, -3.66, -3.69, -4.07, -3.95, -4.12, -3.... | permil | d18O | NaN | subannual | temperature | 1.0 | Ocean2kHR_105_iso2k | year | year | [1983.21, 1983.13, 1983.04, 1982.96, 1982.88, ... | yr AD |
124 | IC13THQU | iso2k | Glacier ice | -13.9333 | -70.8333 | 5670.0 | d18O | d18O | [-18.5905, -16.3244, -16.2324, -17.0112, -18.6... | permil | d18O | NaN | Winter | temperature | 4.0 | SAm_035_iso2k | year | year | [2009, 2008, 2007, 2006, 2005, 2004, 2003, 200... | yr AD |
144 | CO01TUNG | iso2k | Coral | -5.2170 | 145.8170 | -3.0 | d18O | d18O | [-4.875, -4.981, -5.166, -5.06, -4.942, -4.919... | permil | d18O | NaN | subannual | temperature | 3.0 | Ocean2kHR_141_iso2k | year | year | [1993.042, 1992.792, 1992.542, 1992.292, 1992.... | yr AD |
166 | CO01TUNG | iso2k | Coral | -5.2170 | 145.8170 | -3.0 | d18O | d18O | [-4.827, -4.786, -4.693, -4.852, -4.991, -4.90... | permil | d18O | NaN | subannual | temperature | 2.0 | Ocean2kHR_140_iso2k | year | year | [1993.042, 1992.792, 1992.542, 1992.292, 1992.... | yr AD |
df2_filt.info()
<class 'pandas.core.frame.DataFrame'>
Index: 49 entries, 48 to 1202
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 dataSetName 49 non-null object
1 compilationName 49 non-null object
2 archiveType 49 non-null object
3 geo_meanLat 49 non-null float64
4 geo_meanLon 49 non-null float64
5 geo_meanElev 49 non-null float64
6 paleoData_variableName 49 non-null object
7 paleoData_standardName 49 non-null object
8 paleoData_values 49 non-null object
9 paleoData_units 49 non-null object
10 paleoData_proxy 49 non-null object
11 paleoData_proxyGeneral 0 non-null float64
12 paleoData_seasonality 37 non-null object
13 paleoData_interpName 49 non-null object
14 paleoData_interpRank 47 non-null float64
15 TSID 49 non-null object
16 time_variableName 49 non-null object
17 time_standardName 49 non-null object
18 time_values 49 non-null object
19 time_units 49 non-null object
dtypes: float64(5), object(15)
memory usage: 8.0+ KB
Querying the CoralHydro2k database#
Let’s do the same query with CoralHydro2k:
query3 = """ PREFIX le: <http://linked.earth/ontology#>
PREFIX wgs84: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?dataSetName ?compilationName ?archiveType ?geo_meanLat ?geo_meanLon ?geo_meanElev
?paleoData_variableName ?paleoData_standardName ?paleoData_values ?paleoData_units
?paleoData_proxy ?paleoData_proxyGeneral ?paleoData_seasonality ?paleoData_interpName ?paleoData_interpRank
?TSID ?time_variableName ?time_standardName ?time_values
?time_units where{
?ds a le:Dataset .
?ds le:hasName ?dataSetName .
# Get the archive
OPTIONAL{
?ds le:hasArchiveType ?archiveTypeObj .
?archiveTypeObj rdfs:label ?archiveType .
}
# Geographical information
?ds le:hasLocation ?loc .
OPTIONAL{?loc wgs84:lat ?geo_meanLat .}
OPTIONAL{?loc wgs84:long ?geo_meanLon .}
OPTIONAL{?loc wgs84:alt ?geo_meanElev .}
# PaleoData
?ds le:hasPaleoData ?data .
?data le:hasMeasurementTable ?table .
?table le:hasVariable ?var .
# Name
?var le:hasName ?paleoData_variableName .
?var le:hasStandardVariable ?variable_obj .
?variable_obj rdfs:label ?paleoData_standardName .
#Values
?var le:hasValues ?paleoData_values .
#Seasonality
OPTIONAL{?var le:seasonality ?paleoData_seasonality} .
#Units
OPTIONAL{
?var le:hasUnits ?paleoData_unitsObj .
?paleoData_unitsObj rdfs:label ?paleoData_units .
}
#Proxy information
OPTIONAL{
?var le:hasProxy ?paleoData_proxyObj .
?paleoData_proxyObj rdfs:label ?paleoData_proxy .
}
OPTIONAL{
?var le:hasProxyGeneral ?paleoData_proxyGeneralObj .
?paleoData_proxyGeneralObj rdfs:label ?paleoData_proxyGeneral .
}
# Compilation
?var le:partOfCompilation ?compilation .
?compilation le:hasName ?compilationName .
FILTER (?compilationName = "CoralHydro2k").
# TSiD (should all have them)
OPTIONAL{
?var le:hasVariableId ?TSID
} .
# Interpretation (might be an optional field)
OPTIONAL{
?var le:hasInterpretation ?interp .
?interp le:hasVariable ?interpvar .
BIND(REPLACE(STR(?interpvar), "http://linked.earth/ontology/interpretation#", "") AS ?interpretedVariable_Fallback)
OPTIONAL { ?interpvar rdfs:label ?interpretedVariable_Label } # Use a temporary variable for the label
BIND(COALESCE(?interpretedVariable_Label, ?interpretedVariable_Fallback) AS ?paleoData_interpName) # COALESCE into the final variable
}
OPTIONAL{
?var le:hasInterpretation ?interp .
?interp le:hasRank ?paleoData_interpRank .}
OPTIONAL{
?var le:hasInterpretation ?interp .
?interp le:hasSeasonality ?seasonalityURI .
BIND(REPLACE(STR(?seasonalityURI), "http://linked.earth/ontology/interpretation#", "") AS ?seasonality_Fallback)
OPTIONAL { ?seasonalityURI rdfs:label ?seasonalityLabel }
BIND(COALESCE(?seasonalityLabel, ?seasonality_Fallback) AS ?paleoData_seasonality)
}
#Time information
?table le:hasVariable ?timevar .
?timevar le:hasName ?time_variableName .
?timevar le:hasStandardVariable ?timevar_obj .
?timevar_obj rdfs:label ?time_standardName .
VALUES ?time_standardName {"year"} .
?timevar le:hasValues ?time_values .
OPTIONAL{
?timevar le:hasUnits ?time_unitsObj .
?time_unitsObj rdfs:label ?time_units .
}
}"""
response = requests.post(url, data = {'query': query3})
data = io.StringIO(response.text)
df3 = pd.read_csv(data, sep=",")
df3 = df3.drop_duplicates()
df3['paleoData_values']=df3['paleoData_values'].apply(lambda row : json.loads(row) if isinstance(row, str) else row)
df3['time_values']=df3['time_values'].apply(lambda row : json.loads(row) if isinstance(row, str) else row)
df3.head()
dataSetName | compilationName | archiveType | geo_meanLat | geo_meanLon | geo_meanElev | paleoData_variableName | paleoData_standardName | paleoData_values | paleoData_units | paleoData_proxy | paleoData_proxyGeneral | paleoData_seasonality | paleoData_interpName | paleoData_interpRank | TSID | time_variableName | time_standardName | time_values | time_units | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | KA17RYU01 | CoralHydro2k | Coral | 28.300 | 130.000 | -3.5 | SrCa | Sr/Ca | [8.802, 9.472, 8.825, 9.355, 8.952, 9.297, 8.8... | mmol/mol | NaN | NaN | NaN | NaN | NaN | KA17RYU01_SrCa | year | year | [1578.58, 1579.08, 1579.58, 1580.08, 1580.58, ... | yr AD |
2 | CH18YOA02 | CoralHydro2k | Coral | 16.448 | 111.605 | NaN | SrCa | Sr/Ca | [8.58, 8.683, 8.609, 8.37, 8.38, 8.417, 8.584,... | mmol/mol | NaN | NaN | NaN | NaN | NaN | CH18YOA02_SrCa | year | year | [1987.92, 1988.085, 1988.25, 1988.42, 1988.585... | yr AD |
4 | FL17DTO02 | CoralHydro2k | Coral | 24.699 | -82.799 | -3.0 | Sr/Ca | Sr/Ca | [9.159, 9.257, 9.245, 9.166, 9.045, 9.013, 8.9... | mmol/mol | NaN | NaN | NaN | NaN | NaN | FL17DTO02_SrCa | year | year | [1837.04, 1837.13, 1837.21, 1837.29, 1837.38, ... | yr AD |
6 | BO14HTI01 | CoralHydro2k | Coral | 12.210 | 109.310 | -3.6 | Sr/Ca | Sr/Ca | [9.2, 9.17, 9.11, 9.02, 8.95, 8.99, 9.06, 9.1,... | mmol/mol | NaN | NaN | NaN | NaN | NaN | BO14HTI01_SrCa | year | year | [1600.04, 1600.12, 1600.2, 1600.28, 1600.37, 1... | yr AD |
8 | BO14HTI01 | CoralHydro2k | Coral | 12.210 | 109.310 | -3.6 | d18O | d18O | [-5.4206, -5.3477, -5.1354, -5.7119, -5.9058, ... | permil | NaN | NaN | NaN | NaN | NaN | BO14HTI01_d18O | year | year | [1977.37, 1977.45, 1977.54, 1977.62, 1977.7, 1... | yr AD |
Let’s filter for Sr/Ca and d18O records:
df3_filt = df3[df3['paleoData_variableName'].isin(['SrCa', 'd18O'])].drop_duplicates(subset='TSID', keep='first')
df3_filt.head()
dataSetName | compilationName | archiveType | geo_meanLat | geo_meanLon | geo_meanElev | paleoData_variableName | paleoData_standardName | paleoData_values | paleoData_units | paleoData_proxy | paleoData_proxyGeneral | paleoData_seasonality | paleoData_interpName | paleoData_interpRank | TSID | time_variableName | time_standardName | time_values | time_units | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | KA17RYU01 | CoralHydro2k | Coral | 28.300 | 130.000 | -3.5 | SrCa | Sr/Ca | [8.802, 9.472, 8.825, 9.355, 8.952, 9.297, 8.8... | mmol/mol | NaN | NaN | NaN | NaN | NaN | KA17RYU01_SrCa | year | year | [1578.58, 1579.08, 1579.58, 1580.08, 1580.58, ... | yr AD |
2 | CH18YOA02 | CoralHydro2k | Coral | 16.448 | 111.605 | NaN | SrCa | Sr/Ca | [8.58, 8.683, 8.609, 8.37, 8.38, 8.417, 8.584,... | mmol/mol | NaN | NaN | NaN | NaN | NaN | CH18YOA02_SrCa | year | year | [1987.92, 1988.085, 1988.25, 1988.42, 1988.585... | yr AD |
8 | BO14HTI01 | CoralHydro2k | Coral | 12.210 | 109.310 | -3.6 | d18O | d18O | [-5.4206, -5.3477, -5.1354, -5.7119, -5.9058, ... | permil | NaN | NaN | NaN | NaN | NaN | BO14HTI01_d18O | year | year | [1977.37, 1977.45, 1977.54, 1977.62, 1977.7, 1... | yr AD |
12 | AB20MEN01 | CoralHydro2k | Coral | -3.180 | 100.517 | -0.7 | d18O | d18O | [-5.7529527, -5.7139629, -5.6917258, -5.546536... | permil | NaN | NaN | NaN | NaN | NaN | AB20MEN01_d18O | year | year | [1936.375, 1936.45833, 1936.54167, 1936.625, 1... | yr AD |
16 | DR99ABR01 | CoralHydro2k | Coral | -22.100 | 153.000 | -10.0 | d18O | d18O | [-4.008, -3.977, -3.853, -4.113, -3.913, -4.01... | permil | NaN | NaN | NaN | NaN | NaN | DR99ABR01_d18O | year | year | [1638.3, 1639.3, 1640.3, 1641.3, 1642.3, 1643.... | yr AD |
df3_filt.info()
<class 'pandas.core.frame.DataFrame'>
Index: 203 entries, 0 to 780
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 dataSetName 203 non-null object
1 compilationName 203 non-null object
2 archiveType 203 non-null object
3 geo_meanLat 203 non-null float64
4 geo_meanLon 203 non-null float64
5 geo_meanElev 166 non-null float64
6 paleoData_variableName 203 non-null object
7 paleoData_standardName 203 non-null object
8 paleoData_values 203 non-null object
9 paleoData_units 203 non-null object
10 paleoData_proxy 0 non-null float64
11 paleoData_proxyGeneral 0 non-null float64
12 paleoData_seasonality 0 non-null float64
13 paleoData_interpName 0 non-null float64
14 paleoData_interpRank 0 non-null float64
15 TSID 203 non-null object
16 time_variableName 203 non-null object
17 time_standardName 203 non-null object
18 time_values 203 non-null object
19 time_units 203 non-null object
dtypes: float64(8), object(12)
memory usage: 33.3+ KB
The rest of the workflow is the same as the one presented in the previous notebook.