Assembling the proxy database from the LiPDVerse#

⚠️ Warning

This notebook contains a cell that starts a widget to help with identifying duplicates in the database. We strongly advise that you do not run the widget yourself unless you intend to go through the cleaning process. A list of removed datasets (TSiDS) and the reason they were removed or kept is stored in csv files in this repository. The resulting cleaned database has also been saved in a pickle file.

❗ DO NOT RUN ALL!!!!!!

Authors#

Deborah Khider & Julien Emile-Geay

Goal#

The goal of this notebook is to showcase data querying on the LiPDverser to create the proxy database

Data Description#

This notebook makes use of the LiPDVerse database, which contains datasets from various working groups. For a list of these compilations, see this page.

In particular, we will be using PAGES2kTemperature, iso-2k, and CoralHydro2k compilations for this demonstration.

  • PAGES2k: PAGES2k Consortium., Emile-Geay, J., McKay, N. et al. A global multiproxy database for temperature reconstructions of the Common Era. Sci Data 4, 170088 (2017). doi:10.1038/sdata.2017.88

  • iso2k: Konecky, B. L., McKay, N. P., Churakova (Sidorova), O. V., Comas-Bru, L., Dassié, E. P., DeLong, K. L., Falster, G. M., Fischer, M. J., Jones, M. D., Jonkers, L., Kaufman, D. S., Leduc, G., Managave, S. R., Martrat, B., Opel, T., Orsi, A. J., Partin, J. W., Sayani, H. R., Thomas, E. K., Thompson, D. M., Tyler, J. J., Abram, N. J., Atwood, A. R., Cartapanis, O., Conroy, J. L., Curran, M. A., Dee, S. G., Deininger, M., Divine, D. V., Kern, Z., Porter, T. J., Stevenson, S. L., von Gunten, L., and Iso2k Project Members: The Iso2k database: a global compilation of paleo-δ18O and δ2H records to aid understanding of Common Era climate, Earth Syst. Sci. Data, 12, 2261–2288, https://doi.org/10.5194/essd-12-2261-2020, 2020.

  • CoralHydro2k: Walter, R. M., H. R. Sayani, T. Felis, K. M. Cobb, N. J. Abram, A. K. Arzey, A. R. Atwood, L. D. Brenner, E. P. Dassi ́e, K. L. DeLong, B. Ellis, J. Emile-Geay, M. J. Fischer, N. F. Goodkin, J. A. Hargreaves, K. H. Kilbourne, H. Krawczyk, N. P. McKay, A. L. Moore, S. A. Murty, M. R. Ong, R. D. Ramos, E. V. Reed, D. Samanta, S. C. Sanchez, J. Zinke, and the PAGES CoralHydro2k Project Members (2023), The CoralHydro2k database: a global, actively curated compilation of coral δ18O and Sr / Ca proxy records of tropical ocean hydrology and temperature for the Common Era, Earth System Science Data, 15(5), 2081–2116, doi:10.5194/essd-15-2081-2023.

Let’s start by downloading the datasets.

⚠️ Warning: You only need to run the cell below once. It will download the data locally. The rest of the notebook assumes that the data is organized according to the output of the cell below.
# URL Links to download the data

pages2k_url = "https://lipdverse.org/Pages2kTemperature/current_version/Pages2kTemperature2_2_0.zip"
iso2k_url = "https://lipdverse.org/iso2k/current_version/iso2k1_1_2.zip"
coral2k_url ="https://lipdverse.org/CoralHydro2k/current_version/CoralHydro2k1_0_1.zip"

urls = [pages2k_url,iso2k_url,coral2k_url]
folders = ['Pages2k', 'iso2k', 'coral2k']

# import the necessary packages
import os
import requests
import zipfile
from pathlib import Path

## PAGES2k
# Define folder path and URL
for idx,url in enumerate(urls):
    folder_path = Path(f"./data/{folders[idx]}")
    zip_path = folder_path.with_suffix('.zip')

    # Step 1: Check if folder exists, create if not
    folder_path.mkdir(parents=True, exist_ok=True)

    # Step 2: Check if folder is empty
    if not any(folder_path.iterdir()):
        print("Folder is empty. Downloading data...")

        # Step 3: Download zip file
        response = requests.get(url)
        with open(zip_path, 'wb') as f:
            f.write(response.content)
        print(f"Downloaded ZIP to {zip_path}")

        # Step 4: Extract zip contents into folder
        with zipfile.ZipFile(zip_path, 'r') as zip_ref:
            zip_ref.extractall(folder_path)
        print(f"Extracted contents to {folder_path}")
Folder is empty. Downloading data...
Downloaded ZIP to data/Pages2k.zip
Extracted contents to data/Pages2k
Folder is empty. Downloading data...
Downloaded ZIP to data/iso2k.zip
Extracted contents to data/iso2k
Folder is empty. Downloading data...
Downloaded ZIP to data/coral2k.zip
Extracted contents to data/coral2k

Exploring the records in each of the compilations#

Let’s start with importing the needed packages:

import os

# General data manipulation libraries
import numpy as np
import pandas as pd

# To manipulate LiPD files
from pylipd.lipd import LiPD
import json

# Scientific toolboxes
import pyleoclim as pyleo
import seaborn as sns
import cfr
from scipy import stats

#To be able to compare timeseries
from scipy.stats import zscore
from dtaidistance import dtw
from scipy.stats import pearsonr, zscore
import matplotlib.pyplot as plt
import itertools

#For saving
import pickle

PAGES2k#

Let’s start with the PAGES2k dataset. We can load the entire folder into a pylipd.LiPD object:

D_pages2k = LiPD()
D_pages2k.load_from_dir('./data/Pages2k/')
Loading 647 LiPD files
100%|██████████| 647/647 [00:11<00:00, 54.47it/s]
Loaded..

Let’s create a query to gather all the necessary information. Since PyLiPD is build upon the rdflib library and LiPD is defined as a child of the Graph class in rdflib, it is possible to directly query a LiPD object using a SPARQL query as in this tutorial:

query = """ PREFIX le: <http://linked.earth/ontology#>
PREFIX wgs84: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?dataSetName ?compilationName ?archiveType ?geo_meanLat ?geo_meanLon ?geo_meanElev 
    ?paleoData_variableName ?paleoData_standardName ?paleoData_values ?paleoData_units 
    ?paleoData_proxy ?paleoData_proxyGeneral ?paleoData_seasonality ?paleoData_interpName ?paleoData_interpRank 
    ?TSID ?time_variableName ?time_standardName ?time_values
	?time_units where{
    
    ?ds a le:Dataset .
    ?ds le:hasName ?dataSetName .

    # Get the archive
    OPTIONAL{
            ?ds le:hasArchiveType ?archiveTypeObj .
            ?archiveTypeObj rdfs:label ?archiveType .
        }
    
    # Geographical information    
    
    ?ds le:hasLocation ?loc .
        OPTIONAL{?loc wgs84:lat ?geo_meanLat .}
        OPTIONAL{?loc wgs84:long ?geo_meanLon .}      
        OPTIONAL{?loc wgs84:alt ?geo_meanElev .}
    
    # PaleoData

    ?ds le:hasPaleoData ?data .
    ?data le:hasMeasurementTable ?table .
    ?table le:hasVariable ?var .

    # Name
    ?var le:hasName ?paleoData_variableName .
    ?var le:hasStandardVariable ?variable_obj .
    ?variable_obj rdfs:label ?paleoData_standardName .
    
    #Values
    ?var le:hasValues ?paleoData_values .

    #Seasonality
    OPTIONAL{?var le:seasonality ?paleoData_seasonality} .

    #Units
    OPTIONAL{
        ?var le:hasUnits ?paleoData_unitsObj .
        ?paleoData_unitsObj rdfs:label ?paleoData_units .
    }

    #Proxy information
    OPTIONAL{
        ?var le:hasProxy ?paleoData_proxyObj .
        ?paleoData_proxyObj rdfs:label ?paleoData_proxy .
    }
    OPTIONAL{
        ?var le:hasProxyGeneral ?paleoData_proxyGeneralObj .
        ?paleoData_proxyGeneralObj rdfs:label ?paleoData_proxyGeneral .
    }

    # Compilation
    ?var le:partOfCompilation ?compilation . 
    ?compilation le:hasName ?compilationName .
    FILTER (?compilationName = "Pages2kTemperature").
    ?var le:useInGlobalTemperatureAnalysis True .

    # TSiD (should all have them)
    OPTIONAL{
        ?var le:hasVariableId ?TSID
    } .

    # Interpretation (might be an optional field)
    OPTIONAL{
        ?var le:hasInterpretation ?interp .

        ?interp le:hasVariable ?interpvar .
        BIND(REPLACE(STR(?interpvar), "http://linked.earth/ontology/interpretation#", "") AS ?interpretedVariable_Fallback)
        OPTIONAL { ?interpvar rdfs:label ?interpretedVariable_Label } # Use a temporary variable for the label
        BIND(COALESCE(?interpretedVariable_Label, ?interpretedVariable_Fallback) AS ?paleoData_interpName) # COALESCE into the final variable
        }
    
    OPTIONAL{
        ?var le:hasInterpretation ?interp .
        ?interp le:hasRank ?paleoData_interpRank .}
    
    
    OPTIONAL{
        ?var le:hasInterpretation ?interp .
    
        ?interp le:hasSeasonality ?seasonalityURI .
        BIND(REPLACE(STR(?seasonalityURI), "http://linked.earth/ontology/interpretation#", "") AS ?seasonality_Fallback)
        OPTIONAL { ?seasonalityURI rdfs:label ?seasonalityLabel }
        BIND(COALESCE(?seasonalityLabel, ?seasonality_Fallback) AS ?paleoData_seasonality)
    }

    #Time information
    ?table le:hasVariable ?timevar .
    ?timevar le:hasName ?time_variableName .
    ?timevar le:hasStandardVariable ?timevar_obj .
    ?timevar_obj rdfs:label ?time_standardName .
    VALUES ?time_standardName {"year"} .
    ?timevar le:hasValues ?time_values .
    OPTIONAL{
        ?timevar le:hasUnits ?time_unitsObj .
        ?time_unitsObj rdfs:label ?time_units .
    }  
}"""
qres, df_pages2k = D_pages2k.query(query)
df_pages2k.head()
dataSetName compilationName archiveType geo_meanLat geo_meanLon geo_meanElev paleoData_variableName paleoData_standardName paleoData_values paleoData_units paleoData_proxy paleoData_proxyGeneral paleoData_seasonality paleoData_interpName paleoData_interpRank TSID time_variableName time_standardName time_values time_units
0 Ant-WDC05A.Steig.2013 Pages2kTemperature Glacier ice -79.46 -112.09 1806.0 d18O d18O [-33.32873325, -35.6732, -33.1574, -34.2854, -... permil d18O None Annual temperature NaN Ant_030 year year [2005, 2004, 2003, 2002, 2001, 2000, 1999, 199... yr AD
1 NAm-MtLemon.Briffa.2002 Pages2kTemperature Wood 32.50 -110.80 2700.0 MXD MXD [0.968, 0.962, 1.013, 0.95, 1.008, 0.952, 1.02... None maximum latewood density dendrophysical Summer temperature NaN NAm_568 year year [1568, 1569, 1570, 1571, 1572, 1573, 1574, 157... yr AD
2 Arc-Arjeplog.Bjorklund.2014 Pages2kTemperature Wood 66.30 18.20 800.0 density density [-0.829089212152348, -0.733882889924006, -0.89... None maximum latewood density dendrophysical Jun-Aug temperature NaN Arc_060 year year [1200, 1201, 1202, 1203, 1204, 1205, 1206, 120... yr AD
3 Asi-CHIN019.Li.2010 Pages2kTemperature Wood 29.15 99.93 2150.0 ringWidth ringWidth [1.465, 1.327, 1.202, 0.757, 1.094, 1.006, 1.2... None ring width dendrophysical Annual temperature NaN Asia_041 year year [1509, 1510, 1511, 1512, 1513, 1514, 1515, 151... yr AD
4 NAm-Landslide.Luckman.2006 Pages2kTemperature Wood 60.20 -138.50 800.0 ringWidth ringWidth [1.123, 0.841, 0.863, 1.209, 1.139, 1.056, 0.8... None ring width dendrophysical Summer temperature NaN NAm_1876 year year [913, 914, 915, 916, 917, 918, 919, 920, 921, ... yr AD

Let’s have a look at the resulting data.

df_pages2k.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 691 entries, 0 to 690
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   dataSetName             691 non-null    object 
 1   compilationName         691 non-null    object 
 2   archiveType             691 non-null    object 
 3   geo_meanLat             691 non-null    float64
 4   geo_meanLon             691 non-null    float64
 5   geo_meanElev            684 non-null    float64
 6   paleoData_variableName  691 non-null    object 
 7   paleoData_standardName  691 non-null    object 
 8   paleoData_values        691 non-null    object 
 9   paleoData_units         283 non-null    object 
 10  paleoData_proxy         691 non-null    object 
 11  paleoData_proxyGeneral  500 non-null    object 
 12  paleoData_seasonality   691 non-null    object 
 13  paleoData_interpName    691 non-null    object 
 14  paleoData_interpRank    1 non-null      float64
 15  TSID                    691 non-null    object 
 16  time_variableName       691 non-null    object 
 17  time_standardName       691 non-null    object 
 18  time_values             691 non-null    object 
 19  time_units              691 non-null    object 
dtypes: float64(4), object(16)
memory usage: 108.1+ KB
df_pages2k['paleoData_interpName'].unique()
array(['temperature'], dtype=object)

We have 691 timeseries that represent temperature, consistent with the information from the LiPDverse.

iso2k#

Let’s do something similar for the iso2k database:

D_iso2k = LiPD()
D_iso2k.load_from_dir('./data/iso2k')
Loading 509 LiPD files
100%|██████████| 509/509 [00:14<00:00, 34.94it/s]
Loaded..

query_iso = """ PREFIX le: <http://linked.earth/ontology#>
PREFIX wgs84: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?dataSetName ?compilationName ?archiveType ?geo_meanLat ?geo_meanLon ?geo_meanElev 
    ?paleoData_variableName ?paleoData_standardName ?paleoData_values ?paleoData_units 
    ?paleoData_proxy ?paleoData_proxyGeneral ?paleoData_seasonality ?paleoData_interpName ?paleoData_interpRank 
    ?TSID ?time_variableName ?time_standardName ?time_values
	?time_units where{
    
    ?ds a le:Dataset .
    ?ds le:hasName ?dataSetName .

    # Get the archive
    OPTIONAL{
            ?ds le:hasArchiveType ?archiveTypeObj .
            ?archiveTypeObj rdfs:label ?archiveType .
        }
    
    # Geographical information    
    
    ?ds le:hasLocation ?loc .
        OPTIONAL{?loc wgs84:lat ?geo_meanLat .}
        OPTIONAL{?loc wgs84:long ?geo_meanLon .}      
        OPTIONAL{?loc wgs84:alt ?geo_meanElev .}
    
    # PaleoData

    ?ds le:hasPaleoData ?data .
    ?data le:hasMeasurementTable ?table .
    ?table le:hasVariable ?var .

    # Name
    ?var le:hasName ?paleoData_variableName .
    ?var le:hasStandardVariable ?variable_obj .
    ?variable_obj rdfs:label ?paleoData_standardName .
    
    #Values
    ?var le:hasValues ?paleoData_values .

    #Seasonality
    OPTIONAL{?var le:seasonality ?paleoData_seasonality} .

    #Units
    OPTIONAL{
        ?var le:hasUnits ?paleoData_unitsObj .
        ?paleoData_unitsObj rdfs:label ?paleoData_units .
    }

    #Proxy information
    OPTIONAL{
        ?var le:hasProxy ?paleoData_proxyObj .
        ?paleoData_proxyObj rdfs:label ?paleoData_proxy .
    }
    OPTIONAL{
        ?var le:hasProxyGeneral ?paleoData_proxyGeneralObj .
        ?paleoData_proxyGeneralObj rdfs:label ?paleoData_proxyGeneral .
    }

    # Compilation
    ?var le:partOfCompilation ?compilation . 
    ?compilation le:hasName ?compilationName .
    FILTER (?compilationName = "iso2k").
    ?var le:useInGlobalTemperatureAnalysis True .

    # TSiD (should all have them)
    OPTIONAL{
        ?var le:hasVariableId ?TSID
    } .

    # Interpretation (might be an optional field)
    OPTIONAL{
        ?var le:hasInterpretation ?interp .

        ?interp le:hasVariable ?interpvar .
        BIND(REPLACE(STR(?interpvar), "http://linked.earth/ontology/interpretation#", "") AS ?interpretedVariable_Fallback)
        OPTIONAL { ?interpvar rdfs:label ?interpretedVariable_Label } # Use a temporary variable for the label
        BIND(COALESCE(?interpretedVariable_Label, ?interpretedVariable_Fallback) AS ?paleoData_interpName) # COALESCE into the final variable
        }
    
    OPTIONAL{
        ?var le:hasInterpretation ?interp .
        ?interp le:hasRank ?paleoData_interpRank .}
    
    
    OPTIONAL{
        ?var le:hasInterpretation ?interp .
    
        ?interp le:hasSeasonality ?seasonalityURI .
        BIND(REPLACE(STR(?seasonalityURI), "http://linked.earth/ontology/interpretation#", "") AS ?seasonality_Fallback)
        OPTIONAL { ?seasonalityURI rdfs:label ?seasonalityLabel }
        BIND(COALESCE(?seasonalityLabel, ?seasonality_Fallback) AS ?paleoData_seasonality)
    }

    #Time information
    ?table le:hasVariable ?timevar .
    ?timevar le:hasName ?time_variableName .
    ?timevar le:hasStandardVariable ?timevar_obj .
    ?timevar_obj rdfs:label ?time_standardName .
    VALUES ?time_standardName {"year"} .
    ?timevar le:hasValues ?time_values .
    OPTIONAL{
        ?timevar le:hasUnits ?time_unitsObj .
        ?time_unitsObj rdfs:label ?time_units .
    }  
}"""
qres, df_iso2k = D_iso2k.query(query_iso)
df_iso2k.head()
dataSetName compilationName archiveType geo_meanLat geo_meanLon geo_meanElev paleoData_variableName paleoData_standardName paleoData_values paleoData_units paleoData_proxy paleoData_proxyGeneral paleoData_seasonality paleoData_interpName paleoData_interpRank TSID time_variableName time_standardName time_values time_units
0 LS16STCL iso2k Lake sediment 50.830 -116.39 1126.0 d18O d18O [-7.81, -5.91, -9.03, -5.35, -5.61, -5.98, -5.... permil d18O None Winter effectivePrecipitation NaN LPD7dc5b9ba-dup-dup year year [2009.0, 2008.3, 2007.8, 2007.4, 2007.0, 2006.... yr AD
1 LS16STCL iso2k Lake sediment 50.830 -116.39 1126.0 d18O d18O [-7.81, -5.91, -9.03, -5.35, -5.61, -5.98, -5.... permil d18O None None effectivePrecipitation 3.0 LPD7dc5b9ba-dup-dup year year [2009.0, 2008.3, 2007.8, 2007.4, 2007.0, 2006.... yr AD
2 CO00URMA iso2k Coral 0.933 173.00 6.0 d18O d18O [-4.8011, -4.725, -4.6994, -4.86, -5.0886, -5.... permil d18O None subannual temperature 1.0 Ocean2kHR_177_iso2k year year [1994.5, 1994.33, 1994.17, 1994.0, 1993.83, 19... yr AD
3 CO00URMA iso2k Coral 0.933 173.00 6.0 d18O d18O [-4.8011, -4.725, -4.6994, -4.86, -5.0886, -5.... permil d18O None subannual temperature 1.0 Ocean2kHR_177_iso2k year year [1994.5, 1994.33, 1994.17, 1994.0, 1993.83, 19... yr AD
4 CO00URMA iso2k Coral 0.933 173.00 6.0 d18O d18O [-4.8011, -4.725, -4.6994, -4.86, -5.0886, -5.... permil d18O None subannual temperature 1.0 Ocean2kHR_177_iso2k year year [1994.5, 1994.33, 1994.17, 1994.0, 1993.83, 19... yr AD
df_iso2k['paleoData_interpName'].unique()
array(['effectivePrecipitation', 'temperature', 'seawaterIsotope',
       'precipitationIsotope', 'hydrologicBalance', 'salinity'],
      dtype=object)

Despite using the useinGlobalTemperatureAnalysis flag, there seems to be more than 1 interpretation for several of the variables. Let’s first filer for interpretation associated with temperature and remove duplicate entries based on TSiD:

df_iso2k_filt = df_iso2k[df_iso2k['paleoData_interpName']=='temperature'].drop_duplicates(subset='TSID', keep='first')
df_iso2k_filt.head()
dataSetName compilationName archiveType geo_meanLat geo_meanLon geo_meanElev paleoData_variableName paleoData_standardName paleoData_values paleoData_units paleoData_proxy paleoData_proxyGeneral paleoData_seasonality paleoData_interpName paleoData_interpRank TSID time_variableName time_standardName time_values time_units
2 CO00URMA iso2k Coral 0.9330 173.0000 6.0 d18O d18O [-4.8011, -4.725, -4.6994, -4.86, -5.0886, -5.... permil d18O None subannual temperature 1.0 Ocean2kHR_177_iso2k year year [1994.5, 1994.33, 1994.17, 1994.0, 1993.83, 19... yr AD
11 CO05KUBE iso2k Coral 32.4670 -64.7000 -12.0 d18O d18O [-4.15, -3.66, -3.69, -4.07, -3.95, -4.12, -3.... permil d18O None subannual temperature 1.0 Ocean2kHR_105_iso2k year year [1983.21, 1983.13, 1983.04, 1982.96, 1982.88, ... yr AD
13 IC13THQU iso2k Glacier ice -13.9333 -70.8333 5670.0 d18O d18O [-18.5905, -16.3244, -16.2324, -17.0112, -18.6... permil d18O None Winter temperature NaN SAm_035_iso2k year year [2009, 2008, 2007, 2006, 2005, 2004, 2003, 200... yr AD
15 CO01TUNG iso2k Coral -5.2170 145.8170 -3.0 d18O d18O [-4.875, -4.981, -5.166, -5.06, -4.942, -4.919... permil d18O None subannual temperature NaN Ocean2kHR_141_iso2k year year [1993.042, 1992.792, 1992.542, 1992.292, 1992.... yr AD
19 CO01TUNG iso2k Coral -5.2170 145.8170 -3.0 d18O d18O [-4.827, -4.786, -4.693, -4.852, -4.991, -4.90... permil d18O None subannual temperature 1.0 Ocean2kHR_140_iso2k year year [1993.042, 1992.792, 1992.542, 1992.292, 1992.... yr AD
df_iso2k_filt.info()
<class 'pandas.core.frame.DataFrame'>
Index: 49 entries, 2 to 155
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   dataSetName             49 non-null     object 
 1   compilationName         49 non-null     object 
 2   archiveType             49 non-null     object 
 3   geo_meanLat             49 non-null     float64
 4   geo_meanLon             49 non-null     float64
 5   geo_meanElev            49 non-null     float64
 6   paleoData_variableName  49 non-null     object 
 7   paleoData_standardName  49 non-null     object 
 8   paleoData_values        49 non-null     object 
 9   paleoData_units         49 non-null     object 
 10  paleoData_proxy         49 non-null     object 
 11  paleoData_proxyGeneral  0 non-null      object 
 12  paleoData_seasonality   39 non-null     object 
 13  paleoData_interpName    49 non-null     object 
 14  paleoData_interpRank    24 non-null     float64
 15  TSID                    49 non-null     object 
 16  time_variableName       49 non-null     object 
 17  time_standardName       49 non-null     object 
 18  time_values             49 non-null     object 
 19  time_units              49 non-null     object 
dtypes: float64(4), object(16)
memory usage: 8.0+ KB

We have 49 entries left in the iso2k database.

CoralHydro2k#

Let’s repeat our filtering. The CoralHydro2k database does not have a property indicating whether to use in temperature analysis.

D_coral2k = LiPD()
D_coral2k.load_from_dir('./data/coral2k')
Loading 179 LiPD files
100%|██████████| 179/179 [00:02<00:00, 88.65it/s]
Loaded..

query_coral = """ PREFIX le: <http://linked.earth/ontology#>
PREFIX wgs84: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?dataSetName ?compilationName ?archiveType ?geo_meanLat ?geo_meanLon ?geo_meanElev 
    ?paleoData_variableName ?paleoData_standardName ?paleoData_values ?paleoData_units 
    ?paleoData_proxy ?paleoData_proxyGeneral ?paleoData_seasonality ?paleoData_interpName ?paleoData_interpRank 
    ?TSID ?time_variableName ?time_standardName ?time_values
	?time_units where{
    
    ?ds a le:Dataset .
    ?ds le:hasName ?dataSetName .

    # Get the archive
    OPTIONAL{
            ?ds le:hasArchiveType ?archiveTypeObj .
            ?archiveTypeObj rdfs:label ?archiveType .
        }
    
    # Geographical information    
    
    ?ds le:hasLocation ?loc .
        OPTIONAL{?loc wgs84:lat ?geo_meanLat .}
        OPTIONAL{?loc wgs84:long ?geo_meanLon .}      
        OPTIONAL{?loc wgs84:alt ?geo_meanElev .}
    
    # PaleoData

    ?ds le:hasPaleoData ?data .
    ?data le:hasMeasurementTable ?table .
    ?table le:hasVariable ?var .

    # Name
    ?var le:hasName ?paleoData_variableName .
    ?var le:hasStandardVariable ?variable_obj .
    ?variable_obj rdfs:label ?paleoData_standardName .
    
    #Values
    ?var le:hasValues ?paleoData_values .

    #Seasonality
    OPTIONAL{?var le:seasonality ?paleoData_seasonality} .

    #Units
    OPTIONAL{
        ?var le:hasUnits ?paleoData_unitsObj .
        ?paleoData_unitsObj rdfs:label ?paleoData_units .
    }

    #Proxy information
    OPTIONAL{
        ?var le:hasProxy ?paleoData_proxyObj .
        ?paleoData_proxyObj rdfs:label ?paleoData_proxy .
    }
    OPTIONAL{
        ?var le:hasProxyGeneral ?paleoData_proxyGeneralObj .
        ?paleoData_proxyGeneralObj rdfs:label ?paleoData_proxyGeneral .
    }

    # Compilation
    ?var le:partOfCompilation ?compilation . 
    ?compilation le:hasName ?compilationName .
    FILTER (?compilationName = "CoralHydro2k").


    # TSiD (should all have them)
    OPTIONAL{
        ?var le:hasVariableId ?TSID
    } .

    # Interpretation (might be an optional field)
    OPTIONAL{
        ?var le:hasInterpretation ?interp .

        ?interp le:hasVariable ?interpvar .
        BIND(REPLACE(STR(?interpvar), "http://linked.earth/ontology/interpretation#", "") AS ?interpretedVariable_Fallback)
        OPTIONAL { ?interpvar rdfs:label ?interpretedVariable_Label } # Use a temporary variable for the label
        BIND(COALESCE(?interpretedVariable_Label, ?interpretedVariable_Fallback) AS ?paleoData_interpName) # COALESCE into the final variable
        }
    
    OPTIONAL{
        ?var le:hasInterpretation ?interp .
        ?interp le:hasRank ?paleoData_interpRank .}
    
    
    OPTIONAL{
        ?var le:hasInterpretation ?interp .
    
        ?interp le:hasSeasonality ?seasonalityURI .
        BIND(REPLACE(STR(?seasonalityURI), "http://linked.earth/ontology/interpretation#", "") AS ?seasonality_Fallback)
        OPTIONAL { ?seasonalityURI rdfs:label ?seasonalityLabel }
        BIND(COALESCE(?seasonalityLabel, ?seasonality_Fallback) AS ?paleoData_seasonality)
    }

    #Time information
    ?table le:hasVariable ?timevar .
    ?timevar le:hasName ?time_variableName .
    ?timevar le:hasStandardVariable ?timevar_obj .
    ?timevar_obj rdfs:label ?time_standardName .
    VALUES ?time_standardName {"year"} .
    ?timevar le:hasValues ?time_values .
    OPTIONAL{
        ?timevar le:hasUnits ?time_unitsObj .
        ?time_unitsObj rdfs:label ?time_units .
    }  
}"""
qres, df_coral2k = D_coral2k.query(query_coral)
df_coral2k.head()
dataSetName compilationName archiveType geo_meanLat geo_meanLon geo_meanElev paleoData_variableName paleoData_standardName paleoData_values paleoData_units paleoData_proxy paleoData_proxyGeneral paleoData_seasonality paleoData_interpName paleoData_interpRank TSID time_variableName time_standardName time_values time_units
0 KA17RYU01 CoralHydro2k Coral 28.300 130.000 -3.5 SrCa Sr/Ca [8.802, 9.472, 8.825, 9.355, 8.952, 9.297, 8.8... mmol/mol None None None None None KA17RYU01_SrCa year year [1578.58, 1579.08, 1579.58, 1580.08, 1580.58, ... yr AD
1 CH18YOA02 CoralHydro2k Coral 16.448 111.605 NA SrCa Sr/Ca [8.58, 8.683, 8.609, 8.37, 8.38, 8.417, 8.584,... mmol/mol None None None None None CH18YOA02_SrCa year year [1987.92, 1988.085, 1988.25, 1988.42, 1988.585... yr AD
2 FL17DTO02 CoralHydro2k Coral 24.699 -82.799 -3.0 SrCa Sr/Ca [9.159, 9.257, 9.245, 9.166, 9.045, 9.013, 8.9... mmol/mol None None None None None FL17DTO02_SrCa year year [1837.04, 1837.13, 1837.21, 1837.29, 1837.38, ... yr AD
3 BO14HTI01 CoralHydro2k Coral 12.210 109.310 -3.6 d18O d18O [-5.4206, -5.3477, -5.1354, -5.7119, -5.9058, ... permil None None None None None BO14HTI01_d18O year year [1977.37, 1977.45, 1977.54, 1977.62, 1977.7, 1... yr AD
4 BO14HTI01 CoralHydro2k Coral 12.210 109.310 -3.6 SrCa Sr/Ca [9.2, 9.17, 9.11, 9.02, 8.95, 8.99, 9.06, 9.1,... mmol/mol None None None None None BO14HTI01_SrCa year year [1600.04, 1600.12, 1600.2, 1600.28, 1600.37, 1... yr AD

Since there is no interpretation objects to go by, let’s have a look at the names of the variables:

df_coral2k['paleoData_variableName'].unique()
array(['SrCa', 'd18O', 'd18O_sw'], dtype=object)

Sr/Ca is usually interpreted as a temperature proxy. d18O contains information about both temperature and d18Osw. Let’s keep these two and drop any duplicates as identified by TSiDs:

df_coral2k_filt = df_coral2k[df_coral2k['paleoData_variableName'].isin(['SrCa', 'd18O'])].drop_duplicates(subset='TSID', keep='first')
df_coral2k_filt.head()
dataSetName compilationName archiveType geo_meanLat geo_meanLon geo_meanElev paleoData_variableName paleoData_standardName paleoData_values paleoData_units paleoData_proxy paleoData_proxyGeneral paleoData_seasonality paleoData_interpName paleoData_interpRank TSID time_variableName time_standardName time_values time_units
0 KA17RYU01 CoralHydro2k Coral 28.300 130.000 -3.5 SrCa Sr/Ca [8.802, 9.472, 8.825, 9.355, 8.952, 9.297, 8.8... mmol/mol None None None None None KA17RYU01_SrCa year year [1578.58, 1579.08, 1579.58, 1580.08, 1580.58, ... yr AD
1 CH18YOA02 CoralHydro2k Coral 16.448 111.605 NA SrCa Sr/Ca [8.58, 8.683, 8.609, 8.37, 8.38, 8.417, 8.584,... mmol/mol None None None None None CH18YOA02_SrCa year year [1987.92, 1988.085, 1988.25, 1988.42, 1988.585... yr AD
2 FL17DTO02 CoralHydro2k Coral 24.699 -82.799 -3.0 SrCa Sr/Ca [9.159, 9.257, 9.245, 9.166, 9.045, 9.013, 8.9... mmol/mol None None None None None FL17DTO02_SrCa year year [1837.04, 1837.13, 1837.21, 1837.29, 1837.38, ... yr AD
3 BO14HTI01 CoralHydro2k Coral 12.210 109.310 -3.6 d18O d18O [-5.4206, -5.3477, -5.1354, -5.7119, -5.9058, ... permil None None None None None BO14HTI01_d18O year year [1977.37, 1977.45, 1977.54, 1977.62, 1977.7, 1... yr AD
4 BO14HTI01 CoralHydro2k Coral 12.210 109.310 -3.6 SrCa Sr/Ca [9.2, 9.17, 9.11, 9.02, 8.95, 8.99, 9.06, 9.1,... mmol/mol None None None None None BO14HTI01_SrCa year year [1600.04, 1600.12, 1600.2, 1600.28, 1600.37, 1... yr AD
df_coral2k_filt.info()
<class 'pandas.core.frame.DataFrame'>
Index: 233 entries, 0 to 251
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   dataSetName             233 non-null    object 
 1   compilationName         233 non-null    object 
 2   archiveType             233 non-null    object 
 3   geo_meanLat             233 non-null    float64
 4   geo_meanLon             233 non-null    float64
 5   geo_meanElev            232 non-null    object 
 6   paleoData_variableName  233 non-null    object 
 7   paleoData_standardName  233 non-null    object 
 8   paleoData_values        233 non-null    object 
 9   paleoData_units         233 non-null    object 
 10  paleoData_proxy         0 non-null      object 
 11  paleoData_proxyGeneral  0 non-null      object 
 12  paleoData_seasonality   0 non-null      object 
 13  paleoData_interpName    0 non-null      object 
 14  paleoData_interpRank    0 non-null      object 
 15  TSID                    233 non-null    object 
 16  time_variableName       233 non-null    object 
 17  time_standardName       233 non-null    object 
 18  time_values             233 non-null    object 
 19  time_units              233 non-null    object 
dtypes: float64(2), object(18)
memory usage: 38.2+ KB

This database adds 233 new records.

DataBase: Assemble!#

Let’s concatenate the three DataFrames into a clean database, dropping timeseries with the same TSiD as we go. Let’s also convert the values to numpy.array:

df_raw = pd.concat([df_pages2k, df_iso2k_filt, df_coral2k_filt], ignore_index=True).drop_duplicates(subset='TSID', keep='first')
df_raw['paleoData_values']=df_raw['paleoData_values'].apply(lambda row : json.loads(row) if isinstance(row, str) else row)
df_raw['time_values']=df_raw['time_values'].apply(lambda row : json.loads(row) if isinstance(row, str) else row)
df_raw.info()
<class 'pandas.core.frame.DataFrame'>
Index: 972 entries, 0 to 972
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   dataSetName             972 non-null    object 
 1   compilationName         972 non-null    object 
 2   archiveType             972 non-null    object 
 3   geo_meanLat             972 non-null    float64
 4   geo_meanLon             972 non-null    float64
 5   geo_meanElev            964 non-null    object 
 6   paleoData_variableName  972 non-null    object 
 7   paleoData_standardName  972 non-null    object 
 8   paleoData_values        972 non-null    object 
 9   paleoData_units         564 non-null    object 
 10  paleoData_proxy         739 non-null    object 
 11  paleoData_proxyGeneral  500 non-null    object 
 12  paleoData_seasonality   729 non-null    object 
 13  paleoData_interpName    739 non-null    object 
 14  paleoData_interpRank    24 non-null     float64
 15  TSID                    972 non-null    object 
 16  time_variableName       972 non-null    object 
 17  time_standardName       972 non-null    object 
 18  time_values             972 non-null    object 
 19  time_units              972 non-null    object 
dtypes: float64(3), object(17)
memory usage: 159.5+ KB

We have 972 timeseries before any further pre-processing.

Special handling: The Palmyra record#

The Palmyra Coral record in these databases may not have been updated with the latest changes published by Dee et al. (2020). Let’s open this record and extract the same information:

palm = LiPD()
palm.load('./data/Palmyra.Dee.2020.lpd')
Loading 1 LiPD files
100%|██████████| 1/1 [00:00<00:00,  9.18it/s]
Loaded..

query_palm = """
PREFIX le: <http://linked.earth/ontology#>
PREFIX wgs84: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?dataSetName ?compilationName ?archiveType ?geo_meanLat ?geo_meanLon ?geo_meanElev 
    ?paleoData_variableName ?paleoData_standardName ?paleoData_values ?paleoData_units 
    ?paleoData_proxy ?paleoData_proxyGeneral ?paleoData_seasonality ?paleoData_interpName ?paleoData_interpRank 
    ?TSID ?time_variableName ?time_standardName ?time_values
	?time_units where{
    
    ?ds a le:Dataset .
    ?ds le:hasName ?dataSetName .

    # Get the archive
    OPTIONAL{
            ?ds le:hasArchiveType ?archiveTypeObj .
            ?archiveTypeObj rdfs:label ?archiveType .
        }
    
    # Geographical information    
    
    ?ds le:hasLocation ?loc .
        OPTIONAL{?loc wgs84:lat ?geo_meanLat .}
        OPTIONAL{?loc wgs84:long ?geo_meanLon .}      
        OPTIONAL{?loc wgs84:alt ?geo_meanElev .}
    
    # PaleoData

    ?ds le:hasPaleoData ?data .
    ?data le:hasMeasurementTable ?table .
    ?table le:hasVariable ?var .

    # Name
    ?var le:hasName ?paleoData_variableName .
    ?var le:hasStandardVariable ?variable_obj .
    ?variable_obj rdfs:label ?paleoData_standardName .
    
    #Values
    ?var le:hasValues ?paleoData_values .

    #Units
    OPTIONAL{
        ?var le:hasUnits ?paleoData_unitsObj .
        ?paleoData_unitsObj rdfs:label ?paleoData_units .
    }

    #Proxy information
    OPTIONAL{
        ?var le:hasProxy ?paleoData_proxyObj .
        ?paleoData_proxyObj rdfs:label ?paleoData_proxy .
    }
    OPTIONAL{
        ?var le:hasProxyGeneral ?paleoData_proxyGeneralObj .
        ?paleoData_proxyGeneralObj rdfs:label ?paleoData_proxyGeneral .
    }

    # Compilation
    OPTIONAL{
    ?var le:partOfCompilation ?compilation . 
    ?compilation le:hasName ?compilationName .}


    # TSiD (should all have them)
    OPTIONAL{
        ?var le:hasVariableId ?TSID
    } .

    # Interpretation (might be an optional field)
    OPTIONAL{
        ?var le:hasInterpretation ?interp .

        ?interp le:hasVariable ?interpvar .
        BIND(REPLACE(STR(?interpvar), "http://linked.earth/ontology/interpretation#", "") AS ?interpretedVariable_Fallback)
        OPTIONAL { ?interpvar rdfs:label ?interpretedVariable_Label } # Use a temporary variable for the label
        BIND(COALESCE(?interpretedVariable_Label, ?interpretedVariable_Fallback) AS ?paleoData_interpName) # COALESCE into the final variable
        }
    
    OPTIONAL{
        ?var le:hasInterpretation ?interp .
        ?interp le:hasRank ?paleoData_interpRank .}
    
    
    OPTIONAL{
        ?var le:hasInterpretation ?interp .
    
        ?interp le:hasSeasonality ?seasonalityURI .
        BIND(REPLACE(STR(?seasonalityURI), "http://linked.earth/ontology/interpretation#", "") AS ?seasonality_Fallback)
        OPTIONAL { ?seasonalityURI rdfs:label ?seasonalityLabel }
        BIND(COALESCE(?seasonalityLabel, ?seasonality_Fallback) AS ?paleoData_seasonality)
    }

    #Time information
    ?table le:hasVariable ?timevar .
    ?timevar le:hasName ?time_variableName .
    ?timevar le:hasStandardVariable ?timevar_obj .
    ?timevar_obj rdfs:label ?time_standardName .
    VALUES ?time_standardName {"year"} .
    ?timevar le:hasValues ?time_values .
    OPTIONAL{
        ?timevar le:hasUnits ?time_unitsObj .
        ?time_unitsObj rdfs:label ?time_units .
    }  
}"""
qres, df_palm = palm.query(query_palm)
df_palm.head()
dataSetName compilationName archiveType geo_meanLat geo_meanLon geo_meanElev paleoData_variableName paleoData_standardName paleoData_values paleoData_units paleoData_proxy paleoData_proxyGeneral paleoData_seasonality paleoData_interpName paleoData_interpRank TSID time_variableName time_standardName time_values time_units
0 Palmyra.Dee.2020 None Coral 5.8664 -162.12 -9.0 d18O d18O [-4.749, -4.672, -4.724, -4.717, -4.8947, -4.8... permil d18O isotopic subannual seawaterIsotope 1.0 PCU-1cce3af5639a4ec year year [1146.375, 1146.4583, 1146.5417, 1146.625, 114... yr AD
1 Palmyra.Dee.2020 None Coral 5.8664 -162.12 -9.0 d18O d18O [-4.749, -4.672, -4.724, -4.717, -4.8947, -4.8... permil d18O isotopic subannual temperature NaN PCU-1cce3af5639a4ec year year [1146.375, 1146.4583, 1146.5417, 1146.625, 114... yr AD
2 Palmyra.Dee.2020 None Coral 5.8664 -162.12 -9.0 year year [1146.375, 1146.4583, 1146.5417, 1146.625, 114... yr AD None None None None NaN PCU-45b2ff49e3944f1 year year [1146.375, 1146.4583, 1146.5417, 1146.625, 114... yr AD

Let’s keep the variable associated with the temperature interpretation:

df_palm_filt = df_palm[df_palm['paleoData_interpName']=='temperature']
df_palm_filt
dataSetName compilationName archiveType geo_meanLat geo_meanLon geo_meanElev paleoData_variableName paleoData_standardName paleoData_values paleoData_units paleoData_proxy paleoData_proxyGeneral paleoData_seasonality paleoData_interpName paleoData_interpRank TSID time_variableName time_standardName time_values time_units
1 Palmyra.Dee.2020 None Coral 5.8664 -162.12 -9.0 d18O d18O [-4.749, -4.672, -4.724, -4.717, -4.8947, -4.8... permil d18O isotopic subannual temperature NaN PCU-1cce3af5639a4ec year year [1146.375, 1146.4583, 1146.5417, 1146.625, 114... yr AD

Let’s make sure that all the values are converted from string:

df_palm_filt['paleoData_values']=df_palm_filt['paleoData_values'].apply(lambda row : json.loads(row) if isinstance(row, str) else row)
df_palm_filt['time_values']=df_palm_filt['time_values'].apply(lambda row : json.loads(row) if isinstance(row, str) else row)
Handling duplication: Sr/Ca#

Now, let’s find the entry(ies) within PAGES2k corresponding to the Palmyra record. To do so, let’s find all the records within 50km:

# Earth's radius in kilometers
R = 6371.0

# Threshold
threshold = 50 #Look for all records within threshold km

# Target location
target_lat = df_palm_filt['geo_meanLat'].iloc[0]
target_lon = df_palm_filt['geo_meanLon'].iloc[0]

# Convert degrees to radians
lat1 = np.radians(target_lat)
lon1 = np.radians(target_lon)
lat2 = np.radians(df_raw['geo_meanLat'])
lon2 = np.radians(df_raw['geo_meanLon'])

# Haversine formula
dlat = lat2 - lat1
dlon = lon2 - lon1
a = np.sin(dlat / 2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2)**2
c = 2 * np.arcsin(np.sqrt(a))
distance_km = R * c

# Add distance column
df_raw['distance_km'] = distance_km

# Filter to only rows within 50 km
df_distance = df_raw[df_raw['distance_km'] <= threshold].reset_index(drop=True)
df_distance
dataSetName compilationName archiveType geo_meanLat geo_meanLon geo_meanElev paleoData_variableName paleoData_standardName paleoData_values paleoData_units ... paleoData_proxyGeneral paleoData_seasonality paleoData_interpName paleoData_interpRank TSID time_variableName time_standardName time_values time_units distance_km
0 Ocn-Palmyra.Nurhati.2011 Pages2kTemperature Coral 5.870 -162.130 -10.0 Sr/Ca Sr/Ca [8.96, 8.9, 8.91, 8.94, 8.92, 8.89, 8.87, 8.81... mmol/mol ... None subannual temperature NaN Ocean2kHR_161 year year [1998.29, 1998.21, 1998.13, 1998.04, 1997.96, ... yr AD 1.176328
1 Ocn-Palmyra.Cobb.2013 Pages2kTemperature Coral 5.870 -162.130 -7.0 d18O d18O [-4.651, -4.631, -4.629, -4.562, -4.58, -4.442... permil ... None subannual temperature NaN Ocean2kHR_139 year year [928.125, 928.209, 928.292, 928.375, 928.459, ... yr AD 1.176328
2 CO03COPM iso2k Coral 5.870 -162.130 -7.0 d18O d18O [-5.27, -5.42, -5.49, -5.5, -5.44, -5.48, -5.5... permil ... None subannual temperature NaN Ocean2kHR_139_iso2k year year [1998.37, 1998.29, 1998.2, 1998.13, 1998.04, 1... yr AD 1.176328
3 NU11PAL01 CoralHydro2k Coral 5.867 -162.133 -9.0 d18O d18O [-4.79, -4.89, -4.81, -4.84, -4.85, -4.82, -4.... permil ... None None None NaN NU11PAL01_d18O year year [1886.13, 1886.21, 1886.29, 1886.38, 1886.46, ... yr AD 1.439510
4 NU11PAL01 CoralHydro2k Coral 5.867 -162.133 -9.0 SrCa Sr/Ca [9.01, 9.02, 8.98, 8.99, 9.0, 8.99, 9.0, 9.11,... mmol/mol ... None None None NaN NU11PAL01_SrCa year year [1886.13, 1886.21, 1886.29, 1886.38, 1886.46, ... yr AD 1.439510
5 SA19PAL01 CoralHydro2k Coral 5.878 -162.142 -10.0 d18O d18O [-5.21, -5.23, -5.17, -5.13, -5.11, -5.05, -5.... permil ... None None None NaN SA19PAL01_d18O year year [1983.58, 1983.67, 1983.75, 1983.83, 1983.92, ... yr AD 2.754166
6 SA19PAL01 CoralHydro2k Coral 5.878 -162.142 -10.0 SrCa Sr/Ca [8.94, 9.08, 9.17, 9.16, 9.16, 9.17, 9.17, 9.1... mmol/mol ... None None None NaN SA19PAL01_SrCa year year [1980.75, 1980.83, 1980.92, 1981.0, 1981.08, 1... yr AD 2.754166
7 SA19PAL02 CoralHydro2k Coral 5.878 -162.142 -10.0 d18O d18O [-5.05, -5.1, -5.21, -5.14, -5.23, -5.32, -5.3... permil ... None None None NaN SA19PAL02_d18O year year [1985.92, 1986.0, 1986.08, 1986.17, 1986.25, 1... yr AD 2.754166
8 SA19PAL02 CoralHydro2k Coral 5.878 -162.142 -10.0 SrCa Sr/Ca [8.86, 8.82, 8.81, 8.85, 8.89, 8.87, 8.82, 8.8... mmol/mol ... None None None NaN SA19PAL02_SrCa year year [1985.33, 1985.42, 1985.5, 1985.58, 1985.67, 1... yr AD 2.754166
9 CO03PAL08 CoralHydro2k Coral 5.870 -162.130 NA d18O d18O [-5.14, -5.13, -5.13, -5.16, -5.28, -4.99, -4.... permil ... None None None NaN CO03PAL08_d18O year year [1653.63, 1653.65, 1653.686083, 1653.722167, 1... yr AD 1.176328
10 CO03PAL09 CoralHydro2k Coral 5.870 -162.130 NA d18O d18O [-4.98, -4.8, -4.68, -4.74, -4.7, -4.74, -4.84... permil ... None None None NaN CO03PAL09_d18O year year [1664.96, 1665.0215, 1665.083, 1665.164, 1665.... yr AD 1.176328
11 CO03PAL04 CoralHydro2k Coral 5.870 -162.130 NA d18O d18O [-4.86, -4.9, -5.0, -4.92, -4.91, -4.85, -4.74... permil ... None None None NaN CO03PAL04_d18O year year [1398.68, 1398.747167, 1398.814333, 1398.8815,... yr AD 1.176328
12 CO03PAL10 CoralHydro2k Coral 5.870 -162.130 NA d18O d18O [-4.865, -4.765, -4.805, -4.665, -4.595, -4.73... permil ... None None None NaN CO03PAL10_d18O year year [1326.65, 1326.75825, 1326.8665, 1326.97475, 1... yr AD 1.176328
13 CO03PAL05 CoralHydro2k Coral 5.870 -162.130 NA d18O d18O [-4.61, -4.64, -4.7, -4.77, -4.86, -4.85, -4.8... permil ... None None None NaN CO03PAL05_d18O year year [1405.1569, 1405.280175, 1405.40345, 1405.5267... yr AD 1.176328
14 CO03PAL07 CoralHydro2k Coral 5.870 -162.130 NA d18O d18O [-4.63, -4.55, -4.65, -4.59, -4.74, -4.79, -4.... permil ... None None None NaN CO03PAL07_d18O year year [1635.02, 1635.083, 1635.146, 1635.209, 1635.2... yr AD 1.176328
15 CO03PAL06 CoralHydro2k Coral 5.870 -162.130 NA d18O d18O [-4.64, -4.64, -4.62, -4.62, -4.68, -4.77, -4.... permil ... None None None NaN CO03PAL06_d18O year year [1411.9262, 1412.0046, 1412.083, 1412.1775, 14... yr AD 1.176328
16 CO03PAL02 CoralHydro2k Coral 5.870 -162.130 NA d18O d18O [-4.631, -4.724, -4.709, -4.707, -4.912, -4.83... permil ... None None None NaN CO03PAL02_d18O year year [1149.08, 1149.2225, 1149.365, 1149.5075, 1149... yr AD 1.176328
17 CO03PAL03 CoralHydro2k Coral 5.870 -162.130 NA d18O d18O [-4.79, -4.73, -4.66, -4.66, -4.78, -4.69, -4.... permil ... None None None NaN CO03PAL03_d18O year year [1317.17, 1317.29, 1317.41, 1317.53, 1317.65, ... yr AD 1.176328
18 CO03PAL01 CoralHydro2k Coral 5.870 -162.130 NA d18O d18O [-4.68, -4.62, -4.65, -4.56, -4.58, -4.42, -4.... permil ... None None None NaN CO03PAL01_d18O year year [928.08, 928.175, 928.27, 928.365, 928.46, 928... yr AD 1.176328

19 rows × 21 columns

The first 18 rows seem to be corresponding to various coral heads within the Palmyra records. Let’s have a look at them. Let’s first have a look at Sr/Ca record:

df_distance_SrCa = df_distance[df_distance['paleoData_variableName'].isin(['SrCa', 'Sr/Ca'])]
df_distance_SrCa
dataSetName compilationName archiveType geo_meanLat geo_meanLon geo_meanElev paleoData_variableName paleoData_standardName paleoData_values paleoData_units ... paleoData_proxyGeneral paleoData_seasonality paleoData_interpName paleoData_interpRank TSID time_variableName time_standardName time_values time_units distance_km
0 Ocn-Palmyra.Nurhati.2011 Pages2kTemperature Coral 5.870 -162.130 -10.0 Sr/Ca Sr/Ca [8.96, 8.9, 8.91, 8.94, 8.92, 8.89, 8.87, 8.81... mmol/mol ... None subannual temperature NaN Ocean2kHR_161 year year [1998.29, 1998.21, 1998.13, 1998.04, 1997.96, ... yr AD 1.176328
4 NU11PAL01 CoralHydro2k Coral 5.867 -162.133 -9.0 SrCa Sr/Ca [9.01, 9.02, 8.98, 8.99, 9.0, 8.99, 9.0, 9.11,... mmol/mol ... None None None NaN NU11PAL01_SrCa year year [1886.13, 1886.21, 1886.29, 1886.38, 1886.46, ... yr AD 1.439510
6 SA19PAL01 CoralHydro2k Coral 5.878 -162.142 -10.0 SrCa Sr/Ca [8.94, 9.08, 9.17, 9.16, 9.16, 9.17, 9.17, 9.1... mmol/mol ... None None None NaN SA19PAL01_SrCa year year [1980.75, 1980.83, 1980.92, 1981.0, 1981.08, 1... yr AD 2.754166
8 SA19PAL02 CoralHydro2k Coral 5.878 -162.142 -10.0 SrCa Sr/Ca [8.86, 8.82, 8.81, 8.85, 8.89, 8.87, 8.82, 8.8... mmol/mol ... None None None NaN SA19PAL02_SrCa year year [1985.33, 1985.42, 1985.5, 1985.58, 1985.67, 1... yr AD 2.754166

4 rows × 21 columns

Let’s plot them:

ts_list = []
for _, row in df_distance_SrCa.iterrows():
    proxy = row['paleoData_proxy'] if isinstance(row['paleoData_proxy'],str) else row['paleoData_variableName']
    ts_list.append(pyleo.GeoSeries(time=row['time_values'],value=row['paleoData_values'],
                            time_name='Time',value_name=row['paleoData_variableName'],
                            time_unit=row['time_units'], value_unit=row['paleoData_units'],
                            lat = row['geo_meanLat'], lon = row['geo_meanLon'], 
                            elevation = row['geo_meanElev'], importedFrom='lipdverse.org',
                            archiveType = row['archiveType'],
                            observationType=row['paleoData_proxy'],       
                            label=row['dataSetName']+'_'+proxy, verbose = False)) 

mgs_srca = pyleo.MultipleGeoSeries(ts_list) 
fig, ax = mgs_srca.stackplot()
fig.show()
../../_images/c388ca6fc8d7e3de48d07e50ff1e017f7a7ef03b585e3137a3c695b217936e67.png

The three records look very similar. The Ocn-Palmyra.Nurhati.2011_Sr/Ca and NU11PAL01_SrCa appear nearly identical. However, this may occur in other places in the database, so we want an algorithm to (1) find records that may be the same. We can follow a similar strategy as we have done here. a. Are they within 50km of each other and b. do they have the same or similar variable names, and (2) assess whether these records are, in fact, the same.

We can do this in several ways:

  • Correlation: this may work well for true duplicates but may fail for composites

  • Dynamic Time Warping (DTW), which can be used to to assess shape similarity even when time points aren’t exactly aligned. DTW returns a distance metric: lower values mean more similar patterns.It can be combined with a z-normalization to remove mean/variance effects.

  • Principal Component Analysis (PCA)/ On groups of nearby records, run PCA to detect if certain records contribute nearly identical structure to the same components. High loadings on the same component can suggest redundancy.

Let’s have a look at these three options.

Option 1: Correlation#

Let’s use the Pyleoclim correlation function. The following code is inspired from this tutorial.

n = len(mgs_srca.series_list)
corr_mat = np.empty((n,n))
corr_mat[:] = np.nan #let's only work on half of the matrix to speeds things up

for i in range(0, n-1): # loop over records 
    ts = mgs_srca.series_list[i]    
    mgsl = pyleo.MultipleGeoSeries(mgs_srca.series_list[i+1:]) #gather potential pairs
    corr = mgsl.correlation(ts,method='built-in') # compute correlations
    corr_mat[i,i+1:] = corr.r # save result in matrix
Looping over 3 Series in collection
100%|██████████| 3/3 [00:00<00:00, 193.66it/s]
Looping over 2 Series in collection
100%|██████████| 2/2 [00:00<00:00, 288.76it/s]
Looping over 1 Series in collection
100%|██████████| 1/1 [00:00<00:00, 336.57it/s]
labels = []
for item in mgs_srca.series_list:
    labels.append(item.label)
core_matrix_df = pd.DataFrame(corr_mat, index=labels, columns=labels)
plt.figure(figsize=(10, 8))
sns.heatmap(core_matrix_df, annot=True, fmt=".2f", cmap="coolwarm", center=0)
plt.title("Correlation Heatmap")
plt.tight_layout()
plt.show()
../../_images/73989ced1bc723539b77f1632034d0c38580dc036bb6d4761675b7ddc86375f3.png

Unsurprisingly, Ocn-Palmyra.Nurhati.2011_Sr/Ca and NU11PAL01_SrCa have a correlation of 1 as they are essentially the same record. The other correlations are below 0.8, which makes it difficult to assess whether they are, in fact the same record. Let’s have a look at DTW as a potential discriminant.

Option 2: Dynamic Time Wrapping#

Dynamic Time Wrapping (DTW) is a distance metric designed to compare time series that may be misaligned in time or differ in length. Unlike traditional correlation-based methods, DTW allows for flexible, non-linear alignment of sequences by warping the time axis to minimize the total distance between corresponding points. In our context, DTW provides a more robust measure of similarity between records than correlation, especially when records are composites or partially overlapping. By restricting comparisons to shared time intervals and z-score normalizing the data to emphasize shape similarity, DTW helps identify potentially redundant or duplicated records while accounting for the complexities inherent in paleoenvironmental data.

The following cell does the following:

  • Clip the timeseries for similar time range for comparison

  • Compute zscore

  • Compute the DTW distance

# --- Helper function: clip to overlapping time range and normalize ---

def clip_to_overlap(time1, series1, time2, series2, min_overlap=5):
    start = max(time1[0], time2[0])
    end = min(time1[-1], time2[-1])
    
    mask1 = (time1 >= start) & (time1 <= end)
    mask2 = (time2 >= start) & (time2 <= end)

    if np.sum(mask1) < min_overlap or np.sum(mask2) < min_overlap:
        return None, None  # Not enough overlapping points
    
    return zscore(series1[mask1]), zscore(series2[mask2])

# --- Prepare the data ----
records = {}
times = {}

for series in mgs_srca.series_list:
    records[series.label] = series.value
    times[series.label] = series.time


# --- Main loop ---
labels = list(records.keys())
n = len(labels)
dtw_matrix = np.full((n, n), np.nan)

for i, j in itertools.combinations(range(n), 2):
    key_i, key_j = labels[i], labels[j]

    s1, s2 = clip_to_overlap(times[key_i], records[key_i],
                             times[key_j], records[key_j])

    if s1 is not None:
        path = dtw.warping_path(s1, s2)
        dist = dtw.distance(s1, s2)
        norm_dist = dist / len(path)  # Per-step normalization

        dtw_matrix[i, j] = dtw_matrix[j, i] = norm_dist

# Convert to DataFrame for visualization
dtw_df = pd.DataFrame(dtw_matrix, index=labels, columns=labels)

Let’s visualize the results:

plt.figure(figsize=(10, 8))
sns.heatmap(dtw_df, annot=True, fmt=".2f", cmap="viridis", mask=np.isnan(dtw_df))
plt.title("DTW Distance Matrix (Clipped to Overlap)")
plt.tight_layout()
plt.show()
../../_images/f471fab3704dae64d775bd32ec3c269ae1ea433dbfd41ed2f79f8723d1030a92.png

The next problem lies in the interpretation of these results. The matrix holds pairwise DTW distances between time series. Each entry DTW[i, j] is a non-negative real number representing how similar two z-normalized time series are after optimal alignment.

  • Lower values = more similar shape (after warping).

  • Higher values = less similar (more misaligned or different structure).

Since we z-scored normalized the data and accounted for number of steps, the DTW distances can be interpreted as follows:

Normalized DTW

Meaning

0.0 – 0.1

Very similar shape — likely same core or duplicated segment

0.1 – 0.2

High similarity — possibly same site, maybe overlapping or composited

0.2 – 0.4

Moderate similarity — same region, but different signal characteristics

> 0.4

Different shape — likely independent records

Option 3: Principal Component Analysis#

Let’s leverage the PCA functionality in Pyleoclim. A tutorial for this functionality is available here.

The first step is to brings the records on a common time axis and standardize them:

mgs_srca_sc = mgs_srca.common_time().standardize()
fig, ax = mgs_srca_sc.stackplot()
../../_images/37ac7c677e4abfef7af801b675839f1d8ba824d0b7f9c7eab8ab75c8ebe9ed5d.png

Let’s run PCA:

pca = mgs_srca_sc.pca()

And let’s have a look at the first mode:

fig, ax = pca.modeplot()
../../_images/711e097baa3f42b9f139df455a9aed2b15dbd981b1371133f66518ed6f9c1568.png

Not unsurprisingly, the first PC explains over 75% of the variance. Let’s have a look at the loads:

pca.eigvecs[1]
array([-0.54194915,  0.4525358 ,  0.03876175, -0.70710678])

The loads are not very similar, which is interesting considering that tow of these records are definitely the same. PCA may not work well to find duplicates.

Based on this small proof of concept, correlation and DTW seems to be a reasonable approach to identify duplicates in the database.

Handling duplication part 2: \(\delta^{18}\mathrm{O}\)#

Let’s now have a look at the various \(\delta^{18}\mathrm{O}\) records.

df_distance_d18O = df_distance[df_distance['paleoData_variableName']=='d18O']
df_distance_d18O.head()
dataSetName compilationName archiveType geo_meanLat geo_meanLon geo_meanElev paleoData_variableName paleoData_standardName paleoData_values paleoData_units ... paleoData_proxyGeneral paleoData_seasonality paleoData_interpName paleoData_interpRank TSID time_variableName time_standardName time_values time_units distance_km
1 Ocn-Palmyra.Cobb.2013 Pages2kTemperature Coral 5.870 -162.130 -7.0 d18O d18O [-4.651, -4.631, -4.629, -4.562, -4.58, -4.442... permil ... None subannual temperature NaN Ocean2kHR_139 year year [928.125, 928.209, 928.292, 928.375, 928.459, ... yr AD 1.176328
2 CO03COPM iso2k Coral 5.870 -162.130 -7.0 d18O d18O [-5.27, -5.42, -5.49, -5.5, -5.44, -5.48, -5.5... permil ... None subannual temperature NaN Ocean2kHR_139_iso2k year year [1998.37, 1998.29, 1998.2, 1998.13, 1998.04, 1... yr AD 1.176328
3 NU11PAL01 CoralHydro2k Coral 5.867 -162.133 -9.0 d18O d18O [-4.79, -4.89, -4.81, -4.84, -4.85, -4.82, -4.... permil ... None None None NaN NU11PAL01_d18O year year [1886.13, 1886.21, 1886.29, 1886.38, 1886.46, ... yr AD 1.439510
5 SA19PAL01 CoralHydro2k Coral 5.878 -162.142 -10.0 d18O d18O [-5.21, -5.23, -5.17, -5.13, -5.11, -5.05, -5.... permil ... None None None NaN SA19PAL01_d18O year year [1983.58, 1983.67, 1983.75, 1983.83, 1983.92, ... yr AD 2.754166
7 SA19PAL02 CoralHydro2k Coral 5.878 -162.142 -10.0 d18O d18O [-5.05, -5.1, -5.21, -5.14, -5.23, -5.32, -5.3... permil ... None None None NaN SA19PAL02_d18O year year [1985.92, 1986.0, 1986.08, 1986.17, 1986.25, 1... yr AD 2.754166

5 rows × 21 columns

ts_list = []
for _, row in df_distance_d18O.iterrows():
    proxy = row['paleoData_proxy'] if isinstance(row['paleoData_proxy'],str) else row['paleoData_variableName']
    ts_list.append(pyleo.GeoSeries(time=row['time_values'],value=row['paleoData_values'],
                            time_name='Time',value_name=row['paleoData_variableName'],
                            time_unit=row['time_units'], value_unit=row['paleoData_units'],
                            lat = row['geo_meanLat'], lon = row['geo_meanLon'], 
                            elevation = row['geo_meanElev'], importedFrom='lipdverse.org',
                            archiveType = row['archiveType'],
                            observationType=row['paleoData_proxy'],       
                            label=row['dataSetName']+'_'+proxy, verbose = False)) 

# Add the Dee et al. record

ts_list.append(pyleo.GeoSeries(time=df_palm_filt['time_values'].iloc[0],
                    value=df_palm_filt['paleoData_values'].iloc[0],
                    time_name='Time',
                    value_name=df_palm_filt['paleoData_variableName'].iloc[0],
                    time_unit=df_palm_filt['time_units'].iloc[0], 
                    value_unit=df_palm_filt['paleoData_units'].iloc[0],
                    lat = df_palm_filt['geo_meanLat'].iloc[0],
                    lon = df_palm_filt['geo_meanLon'].iloc[0], 
                    elevation = df_palm_filt['geo_meanElev'].iloc[0],
                    importedFrom='lipdverse.org',
                    archiveType = df_palm_filt['archiveType'].iloc[0],
                    observationType=df_palm_filt['paleoData_proxy'].iloc[0],       
                    label=df_palm_filt['dataSetName'].iloc[0]+'_'+proxy, verbose = False))

mgs = pyleo.MultipleGeoSeries(ts_list) 
fig, ax = mgs.stackplot()
../../_images/16375f3fba15f44bd05bfe76123a205caf286a6b513b1f3c13cb1956e2835588.png

As we have seen with the Sr/Ca record, there seems to be duplications among the record. Although some look like changes in the age models. Let’s explore correlation and DTW to weed out the duplicates.

Option 1: Correlation#
n = len(mgs.series_list)
corr_mat = np.empty((n,n))
corr_mat[:] = np.nan #let's only work on half of the matrix to speeds things up

for i in range(0, n-1): # loop over records 
    ts = mgs.series_list[i]    
    mgsl = pyleo.MultipleGeoSeries(mgs.series_list[i+1:]) #gather potential pairs
    corr = mgsl.correlation(ts,method='built-in') # compute correlations
    corr_mat[i,i+1:] = corr.r # save result in matrix
Looping over 15 Series in collection
100%|██████████| 15/15 [00:00<00:00, 543.36it/s]
Looping over 14 Series in collection
100%|██████████| 14/14 [00:00<00:00, 612.66it/s]
Looping over 13 Series in collection
100%|██████████| 13/13 [00:00<00:00, 2594.87it/s]
Looping over 12 Series in collection
100%|██████████| 12/12 [00:00<00:00, 3536.76it/s]
Looping over 11 Series in collection
100%|██████████| 11/11 [00:00<00:00, 3804.83it/s]
Looping over 10 Series in collection
100%|██████████| 10/10 [00:00<00:00, 2108.33it/s]
Looping over 9 Series in collection
100%|██████████| 9/9 [00:00<00:00, 1791.42it/s]
Looping over 8 Series in collection
100%|██████████| 8/8 [00:00<00:00, 501.88it/s]
Looping over 7 Series in collection
100%|██████████| 7/7 [00:00<00:00, 980.70it/s]
Looping over 6 Series in collection
100%|██████████| 6/6 [00:00<00:00, 1411.43it/s]
Looping over 5 Series in collection
100%|██████████| 5/5 [00:00<00:00, 2103.25it/s]
Looping over 4 Series in collection
100%|██████████| 4/4 [00:00<00:00, 1598.13it/s]
Looping over 3 Series in collection
100%|██████████| 3/3 [00:00<00:00, 1029.53it/s]
Looping over 2 Series in collection
100%|██████████| 2/2 [00:00<00:00, 651.24it/s]
Looping over 1 Series in collection
100%|██████████| 1/1 [00:00<00:00, 21290.88it/s]
labels = []
for item in mgs.series_list:
    labels.append(item.label)
core_matrix_df = pd.DataFrame(corr_mat, index=labels, columns=labels)
plt.figure(figsize=(10, 8))
sns.heatmap(core_matrix_df, annot=True, fmt=".2f", cmap="coolwarm", center=0)
plt.title("Correlation Heatmap")
plt.tight_layout()
plt.show()
../../_images/0fa32dd6c5a34e6c184e341a3c06ea629aad4d30f0951839d5332d75da29a2db.png

Some of the correlations are high, suggesting that the small segments went into the composited records. Note that some of the correlations were not run since the segments were not overlapping and were, therefore, skipped.

Option 2: DTW#

Let’s look at the DTW method to identify duplicates.

# --- Prepare the data ----
records = {}
times = {}

for series in mgs.series_list:
    records[series.label] = series.value
    times[series.label] = series.time


# --- Main loop ---
labels = list(records.keys())
n = len(labels)
dtw_matrix = np.full((n, n), np.nan)

for i, j in itertools.combinations(range(n), 2):
    key_i, key_j = labels[i], labels[j]

    s1, s2 = clip_to_overlap(times[key_i], records[key_i],
                             times[key_j], records[key_j])

    if s1 is not None:
        path = dtw.warping_path(s1, s2)
        dist = dtw.distance(s1, s2)
        norm_dist = dist / len(path)  # Per-step normalization

        dtw_matrix[i, j] = dtw_matrix[j, i] = norm_dist

# Convert to DataFrame for visualization
dtw_df = pd.DataFrame(dtw_matrix, index=labels, columns=labels)
plt.figure(figsize=(10, 8))
sns.heatmap(dtw_df, annot=True, fmt=".2f", cmap="viridis", mask=np.isnan(dtw_df))
plt.title("DTW Distance Matrix (Clipped to Overlap)")
plt.tight_layout()
plt.show()
../../_images/4ed09d165fbe918929567d228e1d127605956f527dab87b451273c7511f65188.png

The DTW method largely confirms what we have seen from the correlations in terms of duplicates. However, note that the DTW method indicates that the records from OCN-Palmyra.Cobb.2013 and CO03COPM have a correlation of 0.79 but a DTW of 0. The DTW would essentially indicate that they are in fact the same records but the correlation is less than 0.8. This is due to the interpolation needed to calculate correlation over the 14th century, highlighting the need to run DTW for records with larger gaps. We will use the insights gained here to clean the database in the next section of this notebook. For now, let’s add the Dee et al. (2020) record to the database:

df_raw = pd.concat([df_raw, df_palm_filt], ignore_index=True)
df_raw.tail()
dataSetName compilationName archiveType geo_meanLat geo_meanLon geo_meanElev paleoData_variableName paleoData_standardName paleoData_values paleoData_units ... paleoData_proxyGeneral paleoData_seasonality paleoData_interpName paleoData_interpRank TSID time_variableName time_standardName time_values time_units distance_km
968 ZI08MAY01 CoralHydro2k Coral -12.6500 45.10 -2.0 SrCa Sr/Ca [8.947578, 8.797017, 8.784511, 8.751525, 8.778... mmol/mol ... None None None NaN ZI08MAY01_SrCa year year [1881.6247, 1881.791367, 1881.958033, 1882.124... yr AD 16936.585327
969 LI06FIJ01 CoralHydro2k Coral -16.8200 179.23 -10.0 d18O d18O [-4.6922, -4.6266, -4.6018, -4.5486, -4.6102, ... permil ... None None None NaN LI06FIJ01_d18O year year [1617.5, 1618.5, 1619.5, 1620.5, 1621.5, 1622.... yr AD 3250.716202
970 SM06LKF02 CoralHydro2k Coral 24.5600 -81.41 -4.0 d18O d18O [-3.85, -3.98, -4.21, -4.06, -3.97, -4.04, -3.... permil ... None None None NaN SM06LKF02_d18O year year [1960.97, 1961.03, 1961.09, 1961.15, 1961.21, ... yr AD 8799.120973
971 SM06LKF02 CoralHydro2k Coral 24.5600 -81.41 -4.0 SrCa Sr/Ca [9.225, nan, 9.195, 9.221, 9.198, 9.281, 9.319... mmol/mol ... None None None NaN SM06LKF02_SrCa year year [1960.97, 1961.03, 1961.09, 1961.15, 1961.21, ... yr AD 8799.120973
972 Palmyra.Dee.2020 None Coral 5.8664 -162.12 -9.0 d18O d18O [-4.749, -4.672, -4.724, -4.717, -4.8947, -4.8... permil ... isotopic subannual temperature NaN PCU-1cce3af5639a4ec year year [1146.375, 1146.4583, 1146.5417, 1146.625, 114... yr AD NaN

5 rows × 21 columns

Let’s drop the added columns that we put in for distance:

df_raw = df_raw.drop(columns=['distance_km'])

Data Cleaning and Pre-processing#

This section deals with cleaning and pre-processing for use with the cfr codebase. We will do so in three steps:

  • Step 1: Remove records that are less than 150 years. For reconstruction purposes, we need series that go back at least to 1850; many in this database do not, so we remove them to avoid disappointments later on.

  • Step 2: Remove duplicates. We will use an algorithm based on the data exploratory analysis we did for the specific case of the Palmyra record and generalize it.

  • Step 3: Filtering for records with annual to subannual resolution.

Step 1: Remove records shorter than 150 years#

For reconstruction purposes, we need series that go back at least to 1850; many in this database do not, so we remove them to avoid disappointments later on. Let’s first look at the start date of these records.

df_raw['min_time'] = df_raw['time_values'].apply(lambda x: min(x) if isinstance(x, list) and x else float('inf'))
filtered_df = df_raw[df_raw['min_time'] < 1850]
filtered_df = filtered_df.drop(columns=['min_time'])
print(filtered_df.shape)
(744, 20)

That leaves us with 744 timeseries.

Step 2: Screen for duplicates#

Based on the preliminary work we have done with the Palmyra records, we will used the following algorithm to identify duplicates:

  1. Look for records within 10km of each other

  2. Look at whether the variable name is the same

  3. Calculate correlations among the records

  4. Calculate the DTW distance among the records.

  5. For correlations >0.8 OR DTW distance < 0.03, use the following:

    a. Use the latest publication date (based on dataset name)

    b. Use the longest record if the publication date is the same

Let’s have a look at the variable names present in the database to see if there are synonyms we have to deal with. Let’s make sure that we use a criteria that doesn’t have any missing values:

filtered_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 744 entries, 0 to 972
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   dataSetName             744 non-null    object 
 1   compilationName         743 non-null    object 
 2   archiveType             744 non-null    object 
 3   geo_meanLat             744 non-null    float64
 4   geo_meanLon             744 non-null    float64
 5   geo_meanElev            737 non-null    object 
 6   paleoData_variableName  744 non-null    object 
 7   paleoData_standardName  744 non-null    object 
 8   paleoData_values        744 non-null    object 
 9   paleoData_units         336 non-null    object 
 10  paleoData_proxy         665 non-null    object 
 11  paleoData_proxyGeneral  501 non-null    object 
 12  paleoData_seasonality   659 non-null    object 
 13  paleoData_interpName    665 non-null    object 
 14  paleoData_interpRank    12 non-null     float64
 15  TSID                    744 non-null    object 
 16  time_variableName       744 non-null    object 
 17  time_standardName       744 non-null    object 
 18  time_values             744 non-null    object 
 19  time_units              744 non-null    object 
dtypes: float64(3), object(17)
memory usage: 122.1+ KB

From this, it seems that paleoData_standardName could work for our purposes since there are no missing values in this field and most of the names should have been standardized:

filtered_df['paleoData_standardName'].unique()
array(['d18O', 'MXD', 'density', 'ringWidth', 'temperature',
       'reflectance', 'Sr/Ca', 'iceMelt', 'd2H', 'thickness',
       'RABD660670', 'calcificationRate', 'varveThickness', 'MAR'],
      dtype=object)

The names are unique. Let’s proceed with the main loop. But first, let’s define some constants and parameters for the algorithm described above:

# --- Define some parameters ---
threshold  = 10 # Distance in km
corr_threshold = 0.8 # Corr threshold over which two records are considered the same. 
dtw_threshold = 0.03 # DTW threshold under which two records are considered the same. 

Let’s also consider writing some functions to calculate the various criteria described above:

  1. Distance on aa sphere using the Haversine distance

  2. Correlation among multiple timeseries using Pyleoclim

  3. DTW amon multiple timeseries

# --- Haversine distance function in km ---
def haversine(lat1, lon1, lats2, lons2, R=6371):
    """Calculate the haversine distance between a target location and a vector of locations. 

    Args:
        lat1 (float): The target latitude
        lon1 (float): the target longitude
        lats2 (np.array or df.Series): The latitude of the locations to check against
        lons2 (np.array or df.Series): The longitude of the locations to check against
        R (int, optional): The radius of the planet in km. Defaults to 6371 for Earth.

    Returns:
        np.array: The distance between the target location and the vector of locations in km.
    """
    # Convert degrees to radians
    lat1 = np.radians(lat1)
    lon1 = np.radians(lon1)
    lats2 = np.radians(lats2)
    lons2 = np.radians(lons2)

    # Haversine formula
    dlat = lats2 - lat1
    dlon = lons2 - lon1
    
    a = np.sin(dlat / 2)**2 + np.cos(lat1) * np.cos(lats2) * np.sin(dlon / 2)**2
    c = 2 * np.arcsin(np.sqrt(a))

    return R*c


# ----- Correlation function ------

def correlation(data):
    """Calculate correlation among multiple timeseries

    Parameters
    ----------
    data: pd.DataFrame
        A pandas DataFrame containing the necessary information to create Pyleoclim.GeoSeries() objects
    
    Returns
    -------
    A DataFrame containing pairwise correlations
    """
    ds_name1 = [] # list of dataset names used as correlation basis
    ds_name2 = [] # list of dataset names used as the series to correlate against
    # Same for tsids since they are unique and we can find them again easily in the database
    tsid1 = []
    tsid2 = []
    # list to keep the correlations
    corr = []

    # Create a MultipleGeoSeries object

    ts_list = []

    for idx, row in data.iterrows():

        ts = pyleo.GeoSeries(time=row['time_values'],value=row['paleoData_values'],
                            time_name='Time',value_name=row['paleoData_standardName'],
                            time_unit=row['time_units'], value_unit=row['paleoData_units'],
                            lat = row['geo_meanLat'], lon = row['geo_meanLon'], 
                            elevation = row['geo_meanElev'], archiveType = row['archiveType'],    
                            label=row['TSID']+'+'+row['dataSetName'], verbose = False)
        ts_list.append(ts)

    for i,j in itertools.combinations(range(len(ts_list)), 2):
        ts1 = ts_list[i] #grab the first timeseries
        ts2 = ts_list[j]

        # append the names and tsids
        ds_name1.append(ts1.label.split('+')[1])
        tsid1.append(ts1.label.split('+')[0])
                
        ds_name2.append(ts2.label.split('+')[1])
        tsid2.append(ts2.label.split('+')[0])

        # calculate the correlation
        corr_res = ts1.correlation(ts2, method = 'built-in')

        # Store the information
        corr.append(corr_res.r)
    
    # Step 3: Summarize results

    return pd.DataFrame({'dataSetName1': ds_name1,
                         'dataSetName2': ds_name2,
                         'TSID1': tsid1,
                         'TSID2': tsid2,
                        'correlation': corr}) 


# ------ DTW function


def dtw_pair(data):
    """Calculates the DTW betwen the records in the dataframe

    Args:
        data (pd.DataFrame): A pandas DataFrame containing the necessary information to create Pyleoclim.GeoSeries() objects

    Returns:
        pd.DataFrame: A DataFrame containing pairwise DTW
    """

    ### Helper function
    
    def clip_to_overlap(time1, series1, time2, series2, min_overlap=5):
        """Helper function to clip the series to common overlap. Note we do  not need to interpolate for DTW, just make sure that the time is overlapping

        Args:
            time1 (np.array): Time vector for Series 1
            series1 (np.array): Value vector for Series 1
            time2 (np.array): Time vector for Series 2
            series2 (np.array): Value vector for Series 2
            min_overlap (int, optional): Minimum number of pointe to even attempt the computation. Defaults to 5.

        Returns:
            np.array, np.array: The zscore of the values for the time of overlap.  
        """
        # Make sure things are sorted and have no NaNs
        valid1 = (~np.isnan(time1)) & (~np.isnan(series1))
        time1 = time1[valid1]
        series1 = series1[valid1]

        valid2 = (~np.isnan(time2)) & (~np.isnan(series2))
        time2 = time2[valid2]
        series2 = series2[valid2]

        # Return early if either series is now empty
        if len(time1) == 0 or len(time2) == 0:
            return None, None


        sort_idx1 = np.argsort(time1)
        time1_sorted = time1[sort_idx1]
        series1_sorted = series1[sort_idx1]

        sort_idx2 = np.argsort(time2)
        time2_sorted = time2[sort_idx2]
        series2_sorted = series2[sort_idx2]

        # Determine overlapping interval
        start = max(time1_sorted[0], time2_sorted[0])
        end = min(time1_sorted[-1], time2_sorted[-1])

        # Apply mask
        mask1 = (time1_sorted >= start) & (time1_sorted <= end)
        mask2 = (time2_sorted >= start) & (time2_sorted <= end)

        if np.sum(mask1) < min_overlap or np.sum(mask2) < min_overlap:
            return None, None # Not enough overlapping points

        clipped1 = series1_sorted[mask1]
        clipped2 = series2_sorted[mask2]

        # Avoid division by zero if std = 0
        if np.std(clipped1) == 0 or np.std(clipped2) == 0:
            return None, None

        return zscore(clipped1), zscore(clipped2)
   
    ### Main function 

    ds_name1 = [] # list of dataset names used as correlation basis
    ds_name2 = [] # list of dataset names used as the series to correlate against
    # Same for tsids since they are unique and we can find them again easily in the database
    tsid1 = []
    tsid2 = []
    # list to keep the DTW
    dtw_list = []

    for idx1, idx2 in itertools.combinations(data.index, 2):
        row1 = data.loc[idx1] #grab the first record
        row2 = data.loc[idx2] #grab the second record

        ds_name1.append(row1['dataSetName'])
        ds_name2.append(row2['dataSetName'])
        tsid1.append(row1['TSID'])
        tsid2.append(row2['TSID'])

        s1, s2 = clip_to_overlap(np.array(row1['time_values']), np.array(row1['paleoData_values']),
                                 np.array(row2['time_values']), np.array(row2['paleoData_values']))
        

        if s1 is not None:
            path = dtw.warping_path(s1, s2)
            dist = dtw.distance(s1, s2)
            dtw_list.append(dist / len(path))  # Per-step normalization
        else:
            dtw_list.append(np.nan)

    df= pd.DataFrame({'dataSetName1': ds_name1,
                       'dataSetName2': ds_name2,
                       'TSID1': tsid1,
                       'TSID2': tsid2,
                       'dtw': dtw_list})
        
    return df
    
df_list = [] # Containers for the list of DataFrames with spurious correlations
tsids = [] #to keep track of the tsids that have already been checked against the distance criteria

#make a copy of the dataframe as we need to modify it for this loop
df_copy = filtered_df.copy()

for idx, row in df_copy.iterrows():

    # Check that we haven't already looked at the record in the context of another distance search

    if row['TSID'] in tsids:
        continue #go to the next row

    # ---- Criteria 1: Distance -----
    # Target location
    lat1, lon1 = row['geo_meanLat'], row['geo_meanLon']
    distance_km = haversine(lat1,lon1,df_copy['geo_meanLat'],df_copy['geo_meanLon'])

    # Add distance column
    df_copy['distance_km'] = distance_km

    # Filter to only rows within a threshold
    df_distance = df_copy[df_copy['distance_km'] <= threshold].reset_index(drop=True)

    # Store the TSIDs so we don't run this multiple times
    tsids.extend(df_distance['TSID'].to_list())

    # If there is only one entry (the original one, go to the next row)
    if len(df_distance)<=1:
        continue

    # ---- Criteria 2: Variable Name ------

    var_unique = df_distance['paleoData_standardName'].unique()

    for var in var_unique:
        df_var_filt = df_distance[df_distance['paleoData_standardName'] == var].reset_index(drop=True)
        
        if len(df_var_filt) >1:

    # ----- Criteria 3: Correlation -------
            corr_df = correlation(df_var_filt)
    
    # ------ Criteria 4: DTW --------
            dtw_df = dtw_pair(df_var_filt)
        
            #put the results together
            common_cols = ['dataSetName1', 'dataSetName1', 'TSID1','TSID2']

            df_combined = pd.merge(corr_df, dtw_df, on=common_cols)
            df_res = df_combined[(df_combined['correlation'] > corr_threshold) | (df_combined['dtw'] < dtw_threshold)]

            if not df_res.empty:
                df_list.append(df_res)

Let’s see how many similar records we have:

len(df_list)
89

Let’s have a look at each records with duplicated and create a log (human in the loop) of why records were removed from the database.

⚠️ Warning

The following cell produces a widget that allows you to go through each of the similar records and to keep/remove records. If you do so, you will change the nature of the database.

❗ DO NOT RUN THIS CELL UNLESS YOU WANT TO CLEAN THE DATABASE YOURSELF. NOT RECOMMENDED!!!!!! Skip to the last section to load the cleaned database directly.
from IPython.display import display, clear_output, Markdown
import ipywidgets as widgets

# Storage
flagged_records = []
group_notes = []

df_copy = filtered_df.copy()
index_tracker = {'i': 0}

def run_iteration(i):
    clear_output(wait=True)

    # Final screen
    if i >= len(df_list):
        print("✅ Review complete!")

        if flagged_records:
            print("\n🚩 Flagged TSIDs:")
            display(pd.DataFrame(flagged_records))
        else:
            print("\n🚩 No TSIDs were flagged.")

        if group_notes and any(n['Note'].strip() for n in group_notes):
            print("\n📝 Notes from each group:")
            markdown_notes = "\n\n".join(
                [
                    f"### Iteration {note['Group'] + 1}\n"
                    f"```\n{note['Note']}\n```"
                    for note in group_notes if note['Note'].strip()
                ]
            )
            display(Markdown(markdown_notes))
        else:
            print("\n📝 No group notes were entered.")

        # Final Done button
        done_button = widgets.Button(description="Done", button_style='success')

        def on_done_clicked(b):
            clear_output(wait=True)
            print("🎉 Done! You may now export your results.")

        done_button.on_click(on_done_clicked)
        display(done_button)
        return

    # Begin reviewing group i
    df = df_list[i]
    print(f"--- Reviewing DataFrame {i+1}/{len(df_list)} ---")

    valid_ids = pd.concat([df['TSID1'], df['TSID2']]).unique()
    df_lookup = df_copy[df_copy['TSID'].isin(valid_ids)]
    display(df_lookup)

    # Plot time series
    ts_list = []
    for _, row in df_lookup.iterrows():
        ts = pyleo.GeoSeries(
            time=row['time_values'],
            value=row['paleoData_values'],
            time_name='Time',
            value_name=row['paleoData_standardName'],
            time_unit=row['time_units'],
            value_unit=row['paleoData_units'],
            lat=row['geo_meanLat'],
            lon=row['geo_meanLon'],
            elevation=row['geo_meanElev'],
            archiveType=row['archiveType'],
            label=row['TSID'] + '+' + row['dataSetName'],
            verbose=False
        )
        ts_list.append(ts)

    if ts_list:
        mgs = pyleo.MultipleGeoSeries(ts_list)
        fig, ax = mgs.stackplot()
        fig.show()
    else:
        print("No valid time series to plot.")

    # Create widgets for TSIDs
    tsid_widgets = []
    for tsid in valid_ids:
        toggle = widgets.RadioButtons(
            options=['Keep', 'Remove'],
            value='Keep',
            description=f'{tsid}:',
            layout=widgets.Layout(width='40%')
        )
        tsid_widgets.append((tsid, toggle))

    toggle_box = widgets.VBox([w[1] for w in tsid_widgets])

    # Shared note box (pre-filled)
    note_template = "Records from different samples. Not true duplicates."
    note_box = widgets.Textarea(
        value=note_template,
        layout=widgets.Layout(width='100%', height='60px')
    )

    # Buttons
    next_button = widgets.Button(description="Next", button_style='success')
    back_button = widgets.Button(description="Back", button_style='warning')
    quit_button = widgets.Button(description="Quit", button_style='danger')
    button_box = widgets.HBox([next_button, back_button, quit_button])

    # --- Button callbacks ---
    def on_next_clicked(b):
        note = note_box.value.strip()

        # Remove prior flags and note for this group
        group_notes[:] = [n for n in group_notes if n['Group'] != i]
        flagged_records[:] = [r for r in flagged_records if r['TSID'] not in valid_ids]

        # Save new notes and flags
        group_notes.append({'Group': i, 'Note': note})
        for tsid, toggle in tsid_widgets:
            if toggle.value == 'Remove':
                flagged_records.append({'TSID': tsid, 'Note': note})

        index_tracker['i'] += 1
        run_iteration(index_tracker['i'])

    def on_back_clicked(b):
        if index_tracker['i'] > 0:
            index_tracker['i'] -= 1
            run_iteration(index_tracker['i'])
        else:
            print("You're already at the first group.")

    def on_quit_clicked(b):
        clear_output(wait=True)
        print("⚠️ Quit early.")

        if flagged_records:
            print("\n🚩 Flagged TSIDs:")
            display(pd.DataFrame(flagged_records))

        if group_notes and any(n['Note'].strip() for n in group_notes):
            print("\n📝 Notes from each group:")
            markdown_notes = "\n\n".join(
                [
                    f"### Iteration {note['Group'] + 1}\n"
                    f"```\n{note['Note']}\n```"
                    for note in group_notes if note['Note'].strip()
                ]
            )
            display(Markdown(markdown_notes))
        else:
            print("\n📝 No group notes were entered.")

    # Hook up buttons
    next_button.on_click(on_next_clicked)
    back_button.on_click(on_back_clicked)
    quit_button.on_click(on_quit_clicked)

    # Display UI
    display(button_box, toggle_box, note_box)

# Start the interactive loop
run_iteration(index_tracker['i'])
TSID Note
0 Ocean2kHR_142_iso2k Ocean2kHR_142_iso2k is a duplicate of Ocean2kH...
1 Ocean2kHR_156_iso2k Ocean2kHR_142_iso2k is a duplicate of Ocean2kH...
2 Ocean2kHR_157_iso2k Ocean2kHR_142_iso2k is a duplicate of Ocean2kH...
3 LI00RAR01_d18O Same records. Keeping 147.
4 LI00RAR01_SrCa Same record. Keeping 149
5 Ocean2kHR_006_iso2k Same records, keeping Ocean2kHR_006 from PAGES2k
6 CH03LOM01_d18O Same records, keeping Ocean2kHR_006 from PAGES2k
7 Ocean2kHR_019_iso2k Same record. Keeping Ocean2kHR_019
8 FE18RUS01_d18O Same record. Keeping Ocean2kHR_019
9 CA07FLI01_SrCa Records from different samples. Not true dupli...
10 Ocean2kHR_193_iso2k Same record. Keeping Ocean2kHR_193
11 CA07FLI01_d18O Same record. Keeping Ocean2kHR_193
12 Ocean2kHR_026_iso2k Two records but duplicates. Keeping Ocean2kHR_...
13 KU99HOU01_d18O Two records but duplicates. Keeping Ocean2kHR_...
14 ZI14HOU01_d18O Two records but duplicates. Keeping Ocean2kHR_...
15 ZI14HOU01_SrCa Two records. Keeping Ocean2kHR 027 and 025
16 Ocean2kHR_133_iso2k Same record across all compilations. Keeping O...
17 QU96ESV01_d18O Same record across all compilations. Keeping O...
18 Ocean2kHR_183_iso2k Same record across all three compilations. Kee...
19 OS14UCP01_d18O Same record across all three compilations. Kee...
20 Ocean2kHR_166_iso2k Same record across all three compilations. Kee...
21 AS05GUA01_d18O Same record across all three compilations. Kee...
22 KI08PAR01_d18O Same record from two different compilations. K...
23 Ocean2kHR_173_iso2k Same record from three different compilations....
24 DU94URV01_d18O Same record from three different compilations....
25 Ocean2kHR_001_iso2k Same record from three different compilations....
26 CO00MAL01_d18O Same record from three different compilations....
27 DR99ABR01_d18O Same record from two different compilations. K...
28 Ocean2kHR_094_iso2k Same record from two different compilations. K...
29 Ocean2kHR_007_iso2k Same record from three compilations (note that...
30 DA06MAF02_d18O Same record from three compilations (note that...
31 Ocean2kHR_177_iso2k Same record from three different compilations....
32 UR00MAI01_d18O Same record from three different compilations....
33 SAm_035_iso2k Same record from two different compilations. K...
34 Ocean2kHR_139 Palmyra record!!!! Keeping PCU-1cce3af5639a4ec...
35 Ocean2kHR_139_iso2k Palmyra record!!!! Keeping PCU-1cce3af5639a4ec...
36 CO03PAL08_d18O Palmyra record!!!! Keeping PCU-1cce3af5639a4ec...
37 CO03PAL09_d18O Palmyra record!!!! Keeping PCU-1cce3af5639a4ec...
38 CO03PAL04_d18O Palmyra record!!!! Keeping PCU-1cce3af5639a4ec...
39 CO03PAL10_d18O Palmyra record!!!! Keeping PCU-1cce3af5639a4ec...
40 CO03PAL05_d18O Palmyra record!!!! Keeping PCU-1cce3af5639a4ec...
41 CO03PAL07_d18O Palmyra record!!!! Keeping PCU-1cce3af5639a4ec...
42 CO03PAL06_d18O Palmyra record!!!! Keeping PCU-1cce3af5639a4ec...
43 CO03PAL02_d18O Palmyra record!!!! Keeping PCU-1cce3af5639a4ec...
44 CO03PAL03_d18O Palmyra record!!!! Keeping PCU-1cce3af5639a4ec...
45 CO03PAL01_d18O Palmyra record!!!! Keeping PCU-1cce3af5639a4ec...
46 Ocean2kHR_020_iso2k Same record from different compilations. Keepi...
47 ZI04IFR01_d18O Same record from different compilations. Keepi...
48 DE12ANC01_SrCa Same record from two different compilations. K...
49 WE09ARR01_d18O Same record from two different compilations. K...
50 WE09ARR01_SrCa Same record from two different compilations. K...
51 Ocean2kHR_013_iso2k Same record from two different compilations. K...
52 Ocean2kHR_126_iso2k Same record from different compilations. Keepi...
53 Ocean2kHR_105_iso2k Same reocords. Keeping Ocean2kHR 102 and 105 a...
54 Ocean2kHR_102_iso2k Same reocords. Keeping Ocean2kHR 102 and 105 a...
55 Ocean2kHR_154_iso2k Same record from different compilation. Keepin...
56 GO12SBV01_d18O Same record from different compilation. Keepin...
57 WU13TON01_SrCa Same record from different compilations. Keepi...
58 Ocean2kHR_138_iso2k Same record from different compilations. Keepi...
59 LI94SEC01_d18O Same record from different compilations. Keepi...

Iteration 1

Records from different samples. Not true duplicates.

Iteration 2

Records from different samples. Not true duplicates.

Iteration 3

Records from different samples. Not true duplicates.

Iteration 4

Records from different samples. Not true duplicates.

Iteration 5

Records from different samples. Not true duplicates.

Iteration 6

Records from different samples. Not true duplicates.

Iteration 7

Records from different samples. Not true duplicates.

Iteration 8

Records from different samples. Not true duplicates.

Iteration 9

Ocean2kHR_142_iso2k is a duplicate of Ocean2kHR_142 deom most likely PAGES2k. Removing iso2k record. Same with 156 and 157

Iteration 10

From publications info in GraphDB. Different records.

Iteration 11

Same records. Keeping 147.

Iteration 12

Same record. Keeping 149

Iteration 13

Same records, keeping Ocean2kHR_006 from PAGES2k

Iteration 14

Same record. Keeping Ocean2kHR_019

Iteration 15

Records from different samples. Not true duplicates.

Iteration 16

Records from different samples. Not true duplicates.

Iteration 17

Same record. Keeping Ocean2kHR_193

Iteration 18

Two records but duplicates. Keeping Ocean2kHR_023 and 026.

Iteration 19

Two records. Keeping Ocean2kHR 027 and 025

Iteration 20

Same record across all compilations. Keeping Ocean2kHR_133

Iteration 21

Records from different samples. Not true duplicates.

Iteration 22

Same record across all three compilations. Keeping Ocean2kHR_183

Iteration 23

Same record across all three compilations. Keeping Ocean2kHR166

Iteration 24

Records from different samples. Not true duplicates.

Iteration 25

Records from different samples. Not true duplicates.

Iteration 26

Records from different samples. Not true duplicates.

Iteration 27

Records from different samples. Not true duplicates.

Iteration 28

Records from different samples. Not true duplicates.

Iteration 29

Records from different samples. Not true duplicates.

Iteration 30

Same record from two different compilations. Keeping Ocean2kHR_099

Iteration 31

Same record from two different compilations. Keeping Ocean2HR_100

Iteration 32

Records from different samples. Not true duplicates.

Iteration 33

Records from different samples. Not true duplicates.

Iteration 34

Same record from three different compilations. Keeping Ocean2kHR_173

Iteration 35

Records from different samples. Not true duplicates.

Iteration 36

Same record from three different compilations. Keeping Ocean2kHR_001

Iteration 37

Records from different samples. Not true duplicates.

Iteration 38

Records from different samples. Not true duplicates.

Iteration 39

Records from different samples. Not true duplicates.

Iteration 40

Same record from two different compilations. Keeping Ocean2kHR_167

Iteration 41

Same record from two different compilations. Keeping Ocean2kHR_094

Iteration 42

Same record from three compilations (note that it's most likely from two different coral head which CoralHydro2k archived in different dataset). Keeping Ocean2kHR_007

Iteration 43

Records from different samples. Not true duplicates.

Iteration 44

Records from different samples. Not true duplicates.

Iteration 45

Records from different samples. Not true duplicates.

Iteration 46

Records from different samples. Not true duplicates.

Iteration 47

Same record from three different compilations. Keeping Ocean2kHR_177

Iteration 48

Records from different samples. Not true duplicates.

Iteration 49

Same record from two different compilations. Keeping SAm_035

Iteration 50

Records from different samples. Not true duplicates.

Iteration 51

Records from different samples. Not true duplicates.

Iteration 52

Records from different samples. Not true duplicates.

Iteration 53

Palmyra record!!!! Keeping PCU-1cce3af5639a4ec from Dee et al. (2020)

Iteration 54

Records from different samples. Not true duplicates.

Iteration 55

Records from different samples. Not true duplicates.

Iteration 56

Records from different samples. Not true duplicates.

Iteration 57

Records from different samples. Not true duplicates.

Iteration 58

Records from different samples. Not true duplicates.

Iteration 59

Records from different samples. Not true duplicates.

Iteration 60

Records from different samples. Not true duplicates.

Iteration 61

Records from different samples. Not true duplicates.

Iteration 62

Same record from different compilations. Keeping Ocean2kHR_020

Iteration 63

Records from different samples. Not true duplicates.

Iteration 64

Records from different samples. Not true duplicates.

Iteration 65

Records from different samples. Not true duplicates.

Iteration 66

Same record from two different compilations. Keeping Ocean2kHR_134

Iteration 67

Records from different samples. Not true duplicates.

Iteration 68

Records from different samples. Not true duplicates.

Iteration 69

Records from different samples. Not true duplicates.

Iteration 70

Records from different samples. Not true duplicates.

Iteration 71

Same record from two different compilations. Keeping Ocean2kHR_190

Iteration 72

Same record from two different compilations. Keeping Ocean2kHR_191

Iteration 73

Records from different samples. Not true duplicates.

Iteration 74

Records from different samples. Not true duplicates.

Iteration 75

Same record from two different compilations. Keeping Ocean2kHR_013

Iteration 76

Records from different samples. Not true duplicates.

Iteration 77

Records from different samples. Not true duplicates.

Iteration 78

Records from different samples. Not true duplicates.

Iteration 79

Records from different samples. Not true duplicates.

Iteration 80

Same record from different compilations. Keeping Ocean2kHR_126

Iteration 81

Same reocords. Keeping Ocean2kHR 102 and 105 as well as DR00NBB01, which seems to be unique

Iteration 82

Records from different samples. Not true duplicates.

Iteration 83

Same record from different compilation. Keeping Ocean2kHR_154

Iteration 84

Same record from different compilations. Keeping Ocean2kHR_159

Iteration 85

Records from different samples. Not true duplicates.

Iteration 86

Same record from different compilations. Keeping Ocean2kHR_138

Iteration 87

Records from different samples. Not true duplicates.

Iteration 88

Records from different samples. Not true duplicates.

Iteration 89

Records from different samples. Not true duplicates.
# Save flagged TSIDs and notes
df_flags = pd.DataFrame(flagged_records)
df_flags.to_csv("./data/flagged_tsids_with_notes.csv", index=False)

df_notes = pd.DataFrame(group_notes)
df_notes.to_csv("./data/group_notes.csv", index=False)

# Cleaned dataset
df_cleaned = df_copy[~df_copy['TSID'].isin(df_flags['TSID'])]
# save the file

df_cleaned.to_pickle("./data/cleaned_timeseries.pkl")

Building the cfr.ProxyDataBase#

Reopen the file:

df_cleaned = pd.read_pickle("./data/cleaned_timeseries.pkl")
df_cleaned.head()
dataSetName compilationName archiveType geo_meanLat geo_meanLon geo_meanElev paleoData_variableName paleoData_standardName paleoData_values paleoData_units paleoData_proxy paleoData_proxyGeneral paleoData_seasonality paleoData_interpName paleoData_interpRank TSID time_variableName time_standardName time_values time_units
0 Ant-WDC05A.Steig.2013 Pages2kTemperature Glacier ice -79.46 -112.09 1806.0 d18O d18O [-33.32873325, -35.6732, -33.1574, -34.2854, -... permil d18O None Annual temperature NaN Ant_030 year year [2005, 2004, 2003, 2002, 2001, 2000, 1999, 199... yr AD
1 NAm-MtLemon.Briffa.2002 Pages2kTemperature Wood 32.50 -110.80 2700.0 MXD MXD [0.968, 0.962, 1.013, 0.95, 1.008, 0.952, 1.02... None maximum latewood density dendrophysical Summer temperature NaN NAm_568 year year [1568, 1569, 1570, 1571, 1572, 1573, 1574, 157... yr AD
2 Arc-Arjeplog.Bjorklund.2014 Pages2kTemperature Wood 66.30 18.20 800.0 density density [-0.829089212152348, -0.733882889924006, -0.89... None maximum latewood density dendrophysical Jun-Aug temperature NaN Arc_060 year year [1200, 1201, 1202, 1203, 1204, 1205, 1206, 120... yr AD
3 Asi-CHIN019.Li.2010 Pages2kTemperature Wood 29.15 99.93 2150.0 ringWidth ringWidth [1.465, 1.327, 1.202, 0.757, 1.094, 1.006, 1.2... None ring width dendrophysical Annual temperature NaN Asia_041 year year [1509, 1510, 1511, 1512, 1513, 1514, 1515, 151... yr AD
4 NAm-Landslide.Luckman.2006 Pages2kTemperature Wood 60.20 -138.50 800.0 ringWidth ringWidth [1.123, 0.841, 0.863, 1.209, 1.139, 1.056, 0.8... None ring width dendrophysical Summer temperature NaN NAm_1876 year year [913, 914, 915, 916, 917, 918, 919, 920, 921, ... yr AD

Adding ptype and seasonality info#

LiPD files contain seasonality (in months or phrases) and the necessary information to create a cfr ‘ptype’. The ‘ptype’ is simply the proxy archive and variable name, for example d18O of coral would be coral.d18O. The seasonality on the other hand, requires some conversion between month to the numerical list format that cfr accepts. For example, for a growing season of June, July, August, the cfr seasonality will be [6,7,8]. In the following two functions convert_seasonality and create_ptype, we can get the original dataFrame columns into the format we need for assimilation.

def convert_seasonality(seasonality_str, latitude = None):
    """
    Convert LiPD seasonality strings to CFR month lists
    CFR expects lists like [6,7,8] for JJA, [12,1,2] for DJF, etc.
    
    Args:
        seasonality_str: String representation of seasonality from LiPD
        
    Returns:
        list: List of month integers, or list(range(1,13)) for annual/None
    """
    # Default to annual for None/NaN
    if pd.isna(seasonality_str) or seasonality_str == '' or seasonality_str is None:
        return list(range(1, 13))
        
    seasonality_str = str(seasonality_str).strip()

    # Hemisphere-aware mapping for Summer/Winter and Growing season
    if seasonality_str in ['Summer', 'Winter'] and latitude is not None:
        if latitude > 0:  # NH
            if seasonality_str == 'Summer':
                return [6, 7, 8]  # JJA
            elif seasonality_str == 'Winter':
                return [12, 1, 2]  # DJF
        elif latitude < 0:  # SH
            if seasonality_str == 'Summer':
                return [12, 1, 2]  # DJF in NH = Summer in SH
            elif seasonality_str == 'Winter':
                return [6, 7, 8]   # JJA in NH = Winter in SH
            
    if seasonality_str in ['Growing Season'] and latitude is not None:
        if latitude > 0:  # NH
            return [4, 5, 6, 7, 8, 9]  # Apr-Sept
        else: # SH
            return [10, 11, 12, 1, 2, 3]  # Oct-Mar   
    
    # Exact mapping for your specific seasonality patterns
    seasonality_map = {
        'Annual': list(range(1, 13)),
        'Jun-Aug': [6, 7, 8],
        'Nov-Feb': [11, 12, 1, 2],
        'May-Apr': [5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4],  # May to next April
        'Jun-Jul': [6, 7],
        'Mar-May': [3, 4, 5],
        'subannual': list(range(1, 13)),  # Default to annual??
        'Sep-Apr': [9, 10, 11, 12, 1, 2, 3, 4],
        'Jul': [7],
        'Dec-Feb': [12, 1, 2],
        'Apr': [4],
        'Jul-Jun': [7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6],  # Jul to next June
        'Sep-Nov': [9, 10, 11],
        'Nov-Oct': [11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],  # Nov to next Oct
        'Apr-Jul': [4, 5, 6, 7],
        'Jul-Sep': [7, 8, 9],
        'Oct-Apr': [10, 11, 12, 1, 2, 3, 4],
        'Nov-Apr': [11, 12, 1, 2, 3, 4],
        'Apr-Sep': [4, 5, 6, 7, 8, 9],
        'Dec-Mar': [12, 1, 2, 3],
        'Aug-Jul': [8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7],  # Aug to next July
        'Jun-Sep': [6, 7, 8, 9],
        'Sep-Aug': [9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8],  # Sep to next Aug
        'Mar-Aug': [3, 4, 5, 6, 7, 8],
        'Jun': [6],
        'May-Sep': [5, 6, 7, 8, 9],
        'Feb-Aug': [2, 3, 4, 5, 6, 7, 8],
        'Jul-Dec': [7, 8, 9, 10, 11, 12],
        'Mar-Oct': [3, 4, 5, 6, 7, 8, 9, 10],
        'May-Jul': [5, 6, 7],
        'Jan-Mar': [1, 2, 3],
        'Apr-Jun': [4, 5, 6],
    }
    
    # Direct lookup
    if seasonality_str in seasonality_map:
        return seasonality_map[seasonality_str]
    
    # If no pattern matches, default to annual
    print(f"Warning: Could not parse seasonality '{seasonality_str}', defaulting to annual")
    return list(range(1, 13))
def create_ptype(archive_type, proxy_standard_name):
    """
    Determine CFR proxy type from LiPD archive and proxy information
    
    Args:
        archive_type: LiPD archiveType
        proxy_standard_name: LiPD paleoData_standardName
        
    Returns:
        str: CFR-compatible proxy type string
    """
    archive = str(archive_type).lower() if pd.notna(archive_type) else 'unknown'
    proxy = str(proxy_standard_name).lower() if pd.notna(proxy_standard_name) else 'unknown'
    
    if 'wood' in archive or 'tree' in archive:
        if 'ring width' in proxy or 'trw' in proxy:
            return 'tree.TRW'
        elif 'density' in proxy or 'mxd' in proxy:
            return 'tree.MXD'
        else:
            return 'tree.TRW'
    elif 'coral' in archive:
        if 'd18o' in proxy:
            return 'coral.d18O'
        elif 'sr/ca' in proxy:
            return 'coral.SrCa'
        elif 'calcification' in proxy:
            return 'coral.calc'
        else:
            return 'coral.d18O'
    elif 'ice' in archive:
        if 'dD' in proxy or 'd2H' in proxy:
            return 'ice.dD'
        elif 'melt' in proxy:
            return 'ice.melt'
        else:
            return 'ice.d18O'
    elif 'lake' in archive:
        if 'varve thickness' in proxy:
            return 'lake.varve_thickness'
        elif 'varve' in proxy and 'thickness' not in proxy:
            return 'lake.varve_property'
        elif 'chironomid' in proxy:
            return 'lake.chironomid'
        elif 'midge' in proxy:
            return 'lake.midge'
        elif 'reflectance' in proxy:
            return 'lake.reflectance'
        elif 'bsi' in proxy:
            return 'lake.BSi'
        elif 'accumulation' in proxy:
            return 'lake.accumulation'
        else:
            return 'lake.other'
    elif 'marine' in archive:
        if 'alkenone' in proxy:
            return 'marine.alkenone'
        elif 'mg/ca' in proxy:
            return 'marine.MgCa'
        else:
            return 'marine.other'
    elif 'documents' in archive:
        return 'documents'
    elif 'borehole' in archive:
        return 'borehole'
    elif 'speleothem' in archive:
        return 'speleothem.d18O'
    elif 'sclerosponge' in archive:
        return 'sclerosponge.d18O'
    elif 'bivalve' in archive:
        return 'bivalve.d18O'
    else:
        return 'hybrid'

Paleodata can sometimes be published in descending time order rather than ascending. This creates a problem for us since the cfr.ProxyRecord.dt or the median time difference between data points will register the dt < 0 when the time values are in descending order. To prevent this issue before creating any proxy databases, we will use flip_df to reverse the time values that are incorrect to keep it consistent.

def flip_df(df):
    """
    Fix reverse time series in DataFrame 
    """
    
    df_fixed = df.copy()
    fixed_count = 0
    
    for idx, row in df_fixed.iterrows():
        time_values = np.array(row['time_values'])
        paleo_values = np.array(row['paleoData_values'])
        
        if len(time_values) > 1:
            # Check if time is descending (reverse)
            if time_values[0] > time_values[-1]:
                # Flip both arrays
                df_fixed.at[idx, 'time_values'] = time_values[::-1]
                df_fixed.at[idx, 'paleoData_values'] = paleo_values[::-1]
                fixed_count += 1
    
    print(f"Fixed {fixed_count} reverse time series in DataFrame")
    return df_fixed
# apply the seasonality into new column 

df_cleaned['cfr_seasonality'] = df_cleaned.apply(lambda row: convert_seasonality(row['paleoData_seasonality'], row['geo_meanLat']),
                                                axis=1
)
# apply ptype conversion into new column

df_cleaned['cfr_ptype'] = df_cleaned.apply(
        lambda row: create_ptype(row['archiveType'], row['paleoData_standardName']), 
        axis=1
    )
# Create new dataframe with correctly reversed time/data values

df_cleaned_flip = flip_df(df_cleaned)
Fixed 173 reverse time series in DataFrame

cfr.ProxyDatabase object takes in very specific parameters such as latitude, longitude, elevation, ptype, seasonality, and of course the time and data value. All of this is provided in the dataFrame, so create_pdb will seamlessly convert our necessary df columns into a cfr.Proxydatabase with all necessary data to be able to work in a data assimilation workflow.

def create_pdb(df):
    """
    Create CFR ProxyDatabase from DataFrame that already has cfr_seasonality and cfr_ptype columns
    
    Args:
        df: DataFrame with cfr_seasonality and cfr_ptype columns
        
    Returns:
        cfr.ProxyDatabase: Ready-to-use CFR ProxyDatabase with seasonality
    """
    print("Creating CFR ProxyDatabase from enhanced DataFrame...")
    
    records_list = []
    skipped_records = 0
    
    for idx, row in df.iterrows():
        if idx % 100 == 0:
            print(f"  Processing {idx}/{len(df)}...")
        
        try:
            # Get the complete time and value arrays directly
            time_values = np.array(row['time_values'])
            paleo_values = np.array(row['paleoData_values'])
            
            # Basic validation to make sure lengths match
            if len(time_values) != len(paleo_values) or len(time_values) == 0:
                skipped_records += 1
                continue
            
            # Remove any NaN values
            valid_mask = ~(np.isnan(time_values) | np.isnan(paleo_values))
            time_values = time_values[valid_mask]
            paleo_values = paleo_values[valid_mask]
            
            if len(time_values) == 0:
                skipped_records += 1
                continue
            
            # Basic info (LiPD's tsid = cfr's pid)
            dataset_name = str(row['dataSetName']) if pd.notna(row['dataSetName']) else f"Dataset_{idx}"
            tsid = str(row['TSID']) if pd.notna(row['TSID']) else f"TS_{idx}"
            
            # Location
            lat = float(row['geo_meanLat']) if pd.notna(row['geo_meanLat']) else 0.0
            lon = float(row['geo_meanLon']) if pd.notna(row['geo_meanLon']) else 0.0
            
            # Handle elevation
            elev = 0.0
            if pd.notna(row['geo_meanElev']):
                try:
                    elev_val = row['geo_meanElev']
                    if str(elev_val).lower() not in ['na', 'nan', 'none', '']:
                        elev = float(elev_val)
                except (ValueError, TypeError):
                    elev = 0.0
            
            # Use the pre-computed CFR columns
            seasonality = row['cfr_seasonality']
            ptype = row['cfr_ptype']
            
            # Optional metadata
            value_name = str(row['paleoData_variableName']) if pd.notna(row['paleoData_variableName']) else None
            value_unit = str(row['paleoData_units']) if pd.notna(row['paleoData_units']) else None
            
            # Create ProxyRecord directly
            record = cfr.ProxyRecord(
                pid=tsid,
                time=time_values,
                value=paleo_values,
                lat=lat,
                lon=lon,
                elev=elev,
                ptype=ptype,
                seasonality=seasonality,
                value_name=value_name,
                value_unit=value_unit
            )
            
            records_list.append(record)
            
        except Exception as e:
            print(f"    Error processing record {idx}: {e}")
            skipped_records += 1
            continue
    
    # Create empty ProxyDatabase
    proxy_db = cfr.ProxyDatabase()
    for record in records_list:
        proxy_db += record
    
    print(f"Created CFR ProxyDatabase with {proxy_db.nrec} records")
    if skipped_records > 0:
        print(f"   Skipped {skipped_records} records due to data issues")
    
    return proxy_db

Once our functions are created to go from a dataFrame into a ProxyDatabase, we will branch off into two options for the data assimilation step.

  • Option 1: Filter without interpolation

    • Create cfr.ProxyDatabase -> filter by dt <= 1 (annual + subannual only) -> save as ‘pdb_w_seasonality_annual.pkl’

  • Option 2: Filter with interpolation

    • For records with dt <= 5, interpolate to annual then refilter.

    • Create cfr.ProxyDatabase -> pyleo.interp -> filter by dt <=1 -> save as ‘pdb_wseasonality_interp.pkl’

Quick debug before saving ProxyDatabases#

The main thing we want to make sure here is that seasonality is preserved when we convert from a pandas Dataframe to cfr.ProxyDatabase and then to a pickle file. Here we run a quick check just to make sure the important information is saved for all the records.

print(f"\n" + "="*50)
print(" >> Converting to CFR ProxyDatabase...")

try:
    unfiltered_pdb = create_pdb(df_cleaned_flip)
    print(f"SUCCESS: ProxyDatabase created with {unfiltered_pdb.nrec} records")
    print("Proxy types:", unfiltered_pdb.type_dict)
    
    # Verify seasonality is preserved
    print(f"\nSeasonality verification:")
    seasonality_preserved = 0
    for pid, record in unfiltered_pdb.records.items():
        if record.seasonality is not None:
            seasonality_preserved += 1
    
    print(f"Records with seasonality: {seasonality_preserved}/{unfiltered_pdb.nrec}")
    
    # Show a sample record
    sample_pid = list(unfiltered_pdb.records.keys())[0]
    sample_record = unfiltered_pdb.records[sample_pid]
    print(f"\nSample CFR ProxyRecord:")
    print(f"  PID: {sample_record.pid}")
    print(f"  Type: {sample_record.ptype}")
    print(f"  Seasonality: {sample_record.seasonality}")
    print(f"  Time points: {len(sample_record.time)}")
    print(f"  Location: ({sample_record.lat:.2f}, {sample_record.lon:.2f})")
    
except Exception as e:
    print(f"ERROR creating CFR ProxyDatabase: {e}")
    import traceback
    traceback.print_exc()
==================================================
 >> Converting to CFR ProxyDatabase...
Creating CFR ProxyDatabase from enhanced DataFrame...
  Processing 0/684...
  Processing 100/684...
  Processing 200/684...
  Processing 300/684...
  Processing 400/684...
  Processing 500/684...
  Processing 600/684...
Created CFR ProxyDatabase with 684 records
SUCCESS: ProxyDatabase created with 684 records
Proxy types: {'ice.d18O': 47, 'tree.MXD': 59, 'tree.TRW': 355, 'lake.other': 42, 'marine.other': 57, 'documents': 15, 'lake.reflectance': 1, 'coral.SrCa': 33, 'coral.d18O': 58, 'ice.melt': 2, 'borehole': 3, 'speleothem.d18O': 4, 'coral.calc': 2, 'sclerosponge.d18O': 5, 'hybrid': 1}

Seasonality verification:
Records with seasonality: 684/684

Sample CFR ProxyRecord:
  PID: Ant_030
  Type: ice.d18O
  Seasonality: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
  Time points: 1220
  Location: (-79.46, 247.91)

Option 1: Filter by dt <= 1#

This option is more direct, as we will use cfr’s capabilities alone to restrict the Proxy Database to only annual and subannual records. Then we will save it as a pickle file as to preserve the seasonality and also as a netCDF (for future purposes).

# Option 1 with filtering by resolution (annual and subannual only)

filtered_pdb_1 = unfiltered_pdb.filter(by='dt', keys=(0.0,1.2))
len(filtered_pdb_1.records)
559
# Save as NetCDF (without seasonality preserved)

filtered_pdb_1.to_nc('./prev_data/pdb_w_seasonality_annual.nc')
>>> Warning: this is an experimental feature.
>>> Warning: this is an experimental feature.

100%|██████████| 559/559 [00:20<00:00, 27.57it/s]
>>> Data before 1 CE is dropped for records: ['Arc_032', 'SAm_025', 'Arc_025', 'Arc_002', 'Aus_041', 'Arc_031', 'Eur_017', 'Arc_021', 'NAm_880', 'Ant_031dD', 'Ant_031', 'NAm_664', 'Ant_022', 'Arc_020', 'Eur_014', 'Arc_004'].
ProxyDatabase saved to: ./prev_data/pdb_w_seasonality_annual.nc

pickle_path = './prev_data/pdb_w_seasonality_annual.pkl'
with open(pickle_path, 'wb') as f:
    pickle.dump(filtered_pdb_1, f)
# Use pickle to save proxyDatabase

print("Saving with seasonality preservation using pickle...")

print(f"✓ Saved to pickle: {pickle_path}")

# Test loading from pickle
with open(pickle_path, 'rb') as f:
    test_db_pickle = pickle.load(f)

print(f"✓ Loaded from pickle: {test_db_pickle.nrec} records")

# Verify seasonality is preserved in pickle
seasonality_pickle = {}
for pid, record in test_db_pickle.records.items():
    if record.seasonality is None:
        key = "None"
    elif record.seasonality == list(range(1, 13)):
        key = "Annual [1-12]"
    elif record.seasonality == [6, 7, 8]:
        key = "Summer [6, 7, 8]"
    elif record.seasonality == [12, 1, 2]:
        key = "Winter [12, 1, 2]"
    else:
        key = str(record.seasonality)
    
    seasonality_pickle[key] = seasonality_pickle.get(key, 0) + 1

print(f"\nSeasonality distribution in pickle-loaded database:")
for season, count in sorted(seasonality_pickle.items(), key=lambda x: x[1], reverse=True)[:10]:
    print(f"  {season}: {count}")

# Show examples
print(f"\nExample records from pickle with their seasonality:")
count = 0
for pid, record in test_db_pickle.records.items():
    if count >= 10:
        break
    if record.seasonality != list(range(1, 13)) and record.seasonality is not None:
        print(f"  {record.pid}: {record.ptype} -> {record.seasonality}")
        count += 1

# Verify counts
pickle_custom_seasonality = sum(1 for r in test_db_pickle.records.values() 
                               if r.seasonality is not None and r.seasonality != list(range(1, 13)))
pickle_none_count = sum(1 for r in test_db_pickle.records.values() if r.seasonality is None)

# Debugging checks
print(f"\nPickle verification:")
print(f"  Records with custom seasonality: {pickle_custom_seasonality}")
print(f"  Records with None seasonality: {pickle_none_count}")
print(f"  Records with annual seasonality: {test_db_pickle.nrec - pickle_custom_seasonality - pickle_none_count}")

if pickle_custom_seasonality > 0:
    print(f"SUCCESS: Seasonality preserved in pickle format!")
else:
    print(f"\n⚠️  Still having issues with seasonality preservation")

print(f"\n" + "="*60)
print(f"File: {pickle_path}")
Saving with seasonality preservation using pickle...
✓ Saved to pickle: ./prev_data/pdb_w_seasonality_annual.pkl
✓ Loaded from pickle: 559 records

Seasonality distribution in pickle-loaded database:
  Summer [6, 7, 8]: 307
  Annual [1-12]: 213
  [6, 7]: 6
  [5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4]: 4
  [9, 10, 11, 12, 1, 2, 3, 4]: 4
  Winter [12, 1, 2]: 4
  [7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6]: 2
  [7, 8, 9]: 2
  [6]: 2
  [7]: 2

Example records from pickle with their seasonality:
  NAm_568: tree.MXD -> [6, 7, 8]
  Arc_060: tree.MXD -> [6, 7, 8]
  NAm_1876: tree.TRW -> [6, 7, 8]
  NAm_1108: tree.TRW -> [6, 7, 8]
  NAm_1120: tree.MXD -> [6, 7, 8]
  Asia_102: tree.TRW -> [6, 7, 8]
  SAm_025: lake.other -> [11, 12, 1, 2]
  NAm_1396: tree.TRW -> [6, 7, 8]
  NAm_220: tree.TRW -> [6, 7, 8]
  SAm_020: tree.TRW -> [5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4]

Pickle verification:
  Records with custom seasonality: 346
  Records with None seasonality: 0
  Records with annual seasonality: 213

✅ SUCCESS: Seasonality preserved in pickle format!
   Use pickle format for now until CFR NetCDF seasonality support is fixed.

============================================================
File: ./prev_data/pdb_w_seasonality_annual.pkl

Option 2: Interpolated then filtered by annual/subannual#

This step is more involved than the last. Firstly we will take all records within a median of 5 year resolution. Then we will use Pyleoclim’s pyleo.interp to interpolate those resolutions to make them annual. Then in order to preserve their original time and data values, we will duplicate only the rows (matching ‘TSID’) of those records with dt<5 such that we have one that is annual and one that is the original. Once that is done, we can re implement the annual/subannual filter. For each interpolated record, the code will output the new number of data points to confirm that it does what is required.

to_interp = unfiltered_pdb.filter(by='dt', keys=(1.1, 5.1))

We take record of the TSIDs of the records that are being interpolated so we can duplicate and preserve their information later in an unaltered format.

# Create a list to store new interpolated rows
new_rows = []
failed_reasons = []

for pobj in to_interp.records:
    if pobj in df_cleaned_flip['TSID'].values:
        # Find the matching row
        row_idx = df_cleaned_flip[df_cleaned_flip['TSID'] == pobj].index[0]
        original_row = df_cleaned_flip.loc[row_idx].copy()
        
        # Get the data
        x = np.array(original_row['time_values'])
        y = np.array(original_row['paleoData_values'])
        
        print(f"\nTrying {pobj}:")
        print(f"  Original data: {len(x)} points, range {x.min():.1f} to {x.max():.1f}")
        
        # Clean data (remove NaNs)
        valid_mask = ~(np.isnan(x) | np.isnan(y))
        x_clean = x[valid_mask]
        y_clean = y[valid_mask]
        
        print(f"  After cleaning: {len(x_clean)} points")
        
        if len(x_clean) < 3:
            failed_reasons.append(f"{pobj}: Not enough points ({len(x_clean)})")
            continue
            
        # Check if data is sorted
        if not np.all(np.diff(x_clean) > 0):
            # Sort the data
            sort_idx = np.argsort(x_clean)
            x_clean = x_clean[sort_idx]
            y_clean = y_clean[sort_idx]
            print(f"  Data was unsorted, now sorted")
        
        try:
            # Calculate step manually to avoid pyleoclim's step calculation
            time_diffs = np.diff(x_clean)
            median_step = np.median(time_diffs)
            target_step = 1.0  # Annual
            
            print(f"  Current median step: {median_step:.3f}, target: {target_step}")
            
            # Use explicit start/stop/step
            start_time = np.ceil(x_clean.min())
            stop_time = np.floor(x_clean.max()) 
            
            print(f"  Interpolation range: {start_time} to {stop_time}")
            
            if stop_time <= start_time:
                failed_reasons.append(f"{pobj}: Invalid time range")
                continue
            
            # Try pyleoclim with explicit parameters
            x_interp, y_interp = pyleo.utils.tsutils.interp(
                x_clean, y_clean,
                interp_type='linear',
                start=start_time,
                stop=stop_time,
                step=target_step,
                bounds_error=False,
                fill_value='extrapolate'
            )
            
            print(f"  SUCCESS: {len(x_interp)} interpolated points")
            
            # Create duplicate with interpolated data
            interp_row = original_row.copy()
            interp_row['time_values'] = x_interp
            interp_row['paleoData_values'] = y_interp
            interp_row['TSID'] = str(original_row['TSID']) + '_interp'
            
            new_rows.append(interp_row)
            
        except Exception as e:
            failed_reasons.append(f"{pobj}: {str(e)}")
            print(f"  FAILED: {e}")
            continue

print(f"\n=== RESULTS ===")
print(f"Successfully interpolated: {len(new_rows)} records")
print(f"Failed: {len(failed_reasons)} records")

for reason in failed_reasons[:5]:  # Show first 5 failures
    print(f"  {reason}")
Trying O2kLR_045:
  Original data: 516 points, range 1.5 to 2001.0
  After cleaning: 516 points
  Current median step: 4.000, target: 1.0
  Interpolation range: 2.0 to 2001.0
  SUCCESS: 2000 interpolated points

Trying Ocean2kHR_194:
  Original data: 57 points, range 1708.0 to 1988.0
  After cleaning: 57 points
  Current median step: 5.000, target: 1.0
  Interpolation range: 1708.0 to 1988.0
  SUCCESS: 281 interpolated points

Trying Ocean2kHR_193:
  Original data: 57 points, range 1708.0 to 1988.0
  After cleaning: 57 points
  Current median step: 5.000, target: 1.0
  Interpolation range: 1708.0 to 1988.0
  SUCCESS: 281 interpolated points

Trying Eur_001:
  Original data: 717 points, range -90.0 to 1935.0
  After cleaning: 717 points
  Data was unsorted, now sorted
  Current median step: 2.500, target: 1.0
  Interpolation range: -90.0 to 1935.0
  SUCCESS: 2026 interpolated points

Trying Ant_009:
  Original data: 231 points, range 423.8 to 1467.2
  After cleaning: 231 points
  Current median step: 4.440, target: 1.0
  Interpolation range: 424.0 to 1467.0
  SUCCESS: 1044 interpolated points

Trying NAm_3870:
  Original data: 2678 points, range -6015.0 to 1716.0
  After cleaning: 2678 points
  Data was unsorted, now sorted
  Current median step: 2.000, target: 1.0
  Interpolation range: -6015.0 to 1716.0
  SUCCESS: 7732 interpolated points

Trying LPD335d863f:
  Original data: 404 points, range 1140.4 to 2001.3
  After cleaning: 404 points
  Current median step: 1.871, target: 1.0
  Interpolation range: 1141.0 to 2001.0
  SUCCESS: 861 interpolated points

Trying O2kLR_150:
  Original data: 82 points, range 1568.0 to 1983.0
  After cleaning: 82 points
  Current median step: 5.000, target: 1.0
  Interpolation range: 1568.0 to 1983.0
  SUCCESS: 416 interpolated points

Trying Ocean2kHR_125:
  Original data: 25 points, range 1833.0 to 1904.0
  After cleaning: 25 points
  Current median step: 3.000, target: 1.0
  Interpolation range: 1833.0 to 1904.0
  SUCCESS: 72 interpolated points

Trying LPDb7ba9166:
  Original data: 165 points, range 74.0 to 2001.0
  After cleaning: 165 points
  Data was unsorted, now sorted
  Current median step: 3.000, target: 1.0
  Interpolation range: 74.0 to 2001.0
  SUCCESS: 1928 interpolated points

Trying Arc_056:
  Original data: 397 points, range 3.0 to 1983.0
  After cleaning: 397 points
  Current median step: 5.000, target: 1.0
  Interpolation range: 3.0 to 1983.0
  SUCCESS: 1981 interpolated points

Trying Ocean2kHR_129:
  Original data: 134 points, range 1384.0 to 1992.0
  After cleaning: 113 points
  Current median step: 5.000, target: 1.0
  Interpolation range: 1384.0 to 1992.0
  SUCCESS: 609 interpolated points

Trying Ocean2kHR_128:
  Original data: 133 points, range 1389.0 to 1992.0
  After cleaning: 118 points
  Current median step: 5.000, target: 1.0
  Interpolation range: 1389.0 to 1992.0
  SUCCESS: 604 interpolated points

Trying Eur_008:
  Original data: 875 points, range -1949.0 to 2000.0
  After cleaning: 875 points
  Data was unsorted, now sorted
  Current median step: 3.000, target: 1.0
  Interpolation range: -1949.0 to 2000.0
  SUCCESS: 3950 interpolated points

Trying NAm_3867:
  Original data: 197 points, range 1120.0 to 1900.0
  After cleaning: 197 points
  Current median step: 4.000, target: 1.0
  Interpolation range: 1120.0 to 1900.0
  SUCCESS: 781 interpolated points

Trying NAm_3868:
  Original data: 197 points, range 1120.0 to 1900.0
  After cleaning: 197 points
  Current median step: 4.000, target: 1.0
  Interpolation range: 1120.0 to 1900.0
  SUCCESS: 781 interpolated points

Trying Ocean2kHR_126:
  Original data: 187 points, range 1344.0 to 1991.0
  After cleaning: 175 points
  Current median step: 3.000, target: 1.0
  Interpolation range: 1344.0 to 1991.0
  SUCCESS: 648 interpolated points

Trying Ocean2kHR_127:
  Original data: 184 points, range 1356.0 to 1991.0
  After cleaning: 170 points
  Current median step: 3.000, target: 1.0
  Interpolation range: 1356.0 to 1991.0
  SUCCESS: 636 interpolated points

Trying Arc_043:
  Original data: 395 points, range 1.0 to 1971.0
  After cleaning: 395 points
  Current median step: 5.000, target: 1.0
  Interpolation range: 1.0 to 1971.0
  SUCCESS: 1971 interpolated points

Trying LPDe51d17e8:
  Original data: 566 points, range 1221.0 to 1990.0
  After cleaning: 566 points
  Current median step: 1.480, target: 1.0
  Interpolation range: 1222.0 to 1990.0
  SUCCESS: 769 interpolated points

Trying Ant_035:
  Original data: 1160 points, range -993.4 to 1937.1
  After cleaning: 1055 points
  Current median step: 2.500, target: 1.0
  Interpolation range: -993.0 to 1937.0
  SUCCESS: 2931 interpolated points

Trying Ocean2kHR_125_iso2k:
  Original data: 25 points, range 1833.0 to 1904.0
  After cleaning: 25 points
  Current median step: 3.000, target: 1.0
  Interpolation range: 1833.0 to 1904.0
  SUCCESS: 72 interpolated points

Trying Ocean2kHR_128_iso2k:
  Original data: 133 points, range 1389.0 to 1992.0
  After cleaning: 118 points
  Current median step: 5.000, target: 1.0
  Interpolation range: 1389.0 to 1992.0
  SUCCESS: 604 interpolated points

Trying DR00NBB01_d18O:
  Original data: 25 points, range 1833.0 to 1904.0
  After cleaning: 25 points
  Current median step: 3.000, target: 1.0
  Interpolation range: 1833.0 to 1904.0
  SUCCESS: 72 interpolated points

Trying RO19PAR01_SrCa:
  Original data: 97 points, range 1445.5 to 1669.5
  After cleaning: 97 points
  Current median step: 1.500, target: 1.0
  Interpolation range: 1446.0 to 1669.0
  SUCCESS: 224 interpolated points

Trying HE02GBR01_d18O:
  Original data: 84 points, range 1568.0 to 1983.0
  After cleaning: 82 points
  Current median step: 5.000, target: 1.0
  Interpolation range: 1568.0 to 1983.0
  SUCCESS: 416 interpolated points

Trying HE02GBR01_SrCa:
  Original data: 84 points, range 1568.0 to 1983.0
  After cleaning: 82 points
  Current median step: 5.000, target: 1.0
  Interpolation range: 1568.0 to 1983.0
  SUCCESS: 416 interpolated points

=== RESULTS ===
Successfully interpolated: 27 records
Failed: 0 records

We then combine the dataframe to include both the interpolated records as well as the originals. The total number should be original total + interpolated.

new_rows_dfs = [pd.DataFrame([row]) for row in new_rows]
df_with_interp = pd.concat([df_cleaned_flip] + new_rows_dfs, ignore_index=True)

print(f"✅ Corrected dataframe shape: {df_with_interp.shape}")
print(f"  Original records: {len(df_cleaned_flip)}")
print(f"  Interpolated records: {len(new_rows)}")
print(f"  Total: {len(df_with_interp)} (should be {len(df_cleaned_flip) + len(new_rows)})")

# Verify we have the right number of interpolated records
interp_count = sum(df_with_interp['TSID'].str.contains('_interp', na=False))
print(f"  Interpolated records in dataframe: {interp_count}")
✅ Corrected dataframe shape: (711, 22)
  Original records: 684
  Interpolated records: 27
  Total: 711 (should be 711)
  Interpolated records in dataframe: 27

Now we run the same process as we did in Option 1 to convert our new combined dataFrame into a proxyDatabase. We will once again save this resulting proxyDataBase as a pickle and a netCDF.

print(f"\n" + "="*50)
print("Converting to CFR ProxyDatabase...")

try:
    interp_pdb = create_pdb(df_with_interp)
    print(f"SUCCESS: ProxyDatabase created with {interp_pdb.nrec} records")
    print("Proxy types:", interp_pdb.type_dict)
    
    # Verify seasonality is preserved
    print(f"\nSeasonality verification:")
    seasonality_preserved = 0
    for pid, record in interp_pdb.records.items():
        if record.seasonality is not None:
            seasonality_preserved += 1
    
    print(f"Records with seasonality: {seasonality_preserved}/{interp_pdb.nrec}")
    
    # Show a sample record
    sample_pid = list(interp_pdb.records.keys())[0]
    sample_record = interp_pdb.records[sample_pid]
    print(f"\nSample CFR ProxyRecord:")
    print(f"  PID: {sample_record.pid}")
    print(f"  Type: {sample_record.ptype}")
    print(f"  Seasonality: {sample_record.seasonality}")
    print(f"  Time points: {len(sample_record.time)}")
    print(f"  Location: ({sample_record.lat:.2f}, {sample_record.lon:.2f})")
    
except Exception as e:
    print(f"ERROR creating CFR ProxyDatabase: {e}")
    import traceback
    traceback.print_exc()
==================================================
Converting to CFR ProxyDatabase...
Creating CFR ProxyDatabase from enhanced DataFrame...
  Processing 0/711...
  Processing 100/711...
  Processing 200/711...
  Processing 300/711...
  Processing 400/711...
  Processing 500/711...
  Processing 600/711...
  Processing 700/711...
Created CFR ProxyDatabase with 711 records
SUCCESS: ProxyDatabase created with 711 records
Proxy types: {'ice.d18O': 51, 'tree.MXD': 59, 'tree.TRW': 355, 'lake.other': 45, 'marine.other': 60, 'documents': 15, 'lake.reflectance': 1, 'coral.SrCa': 36, 'coral.d18O': 64, 'ice.melt': 2, 'borehole': 3, 'speleothem.d18O': 7, 'coral.calc': 2, 'sclerosponge.d18O': 10, 'hybrid': 1}

Seasonality verification:
Records with seasonality: 711/711

Sample CFR ProxyRecord:
  PID: Ant_030
  Type: ice.d18O
  Seasonality: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
  Time points: 1220
  Location: (-79.46, 247.91)
filtered_pdb_2 = interp_pdb.filter(by='dt', keys=(0,1.1))
len(filtered_pdb_2.records)
586
filtered_pdb_2.to_nc('./prev_data/pdb_w_seasonality_annual_interp.nc')
>>> Warning: this is an experimental feature.
>>> Warning: this is an experimental feature.

100%|██████████| 586/586 [00:21<00:00, 26.66it/s]
>>> Data before 1 CE is dropped for records: ['Arc_032', 'SAm_025', 'Arc_025', 'Arc_002', 'Aus_041', 'Arc_031', 'Eur_017', 'Arc_021', 'NAm_880', 'Ant_031dD', 'Ant_031', 'NAm_664', 'Ant_022', 'Arc_020', 'Eur_014', 'Arc_004', 'Eur_001_interp', 'NAm_3870_interp', 'Eur_008_interp', 'Ant_035_interp'].
ProxyDatabase saved to: ./prev_data/pdb_w_seasonality_annual_interp.nc

pickle_path = './prev_data/pdb_w_seasonality_annual_interp.pkl'
with open(pickle_path, 'wb') as f:
    pickle.dump(filtered_pdb_2, f)
print("Saving with seasonality preservation using pickle...")

print(f"✓ Saved to pickle: {pickle_path}")

# Test loading from pickle
with open(pickle_path, 'rb') as f:
    test_db_pickle = pickle.load(f)

print(f"✓ Loaded from pickle: {test_db_pickle.nrec} records")

# Verify seasonality is preserved in pickle
seasonality_pickle = {}
for pid, record in test_db_pickle.records.items():
    if record.seasonality is None:
        key = "None"
    elif record.seasonality == list(range(1, 13)):
        key = "Annual [1-12]"
    elif record.seasonality == [6, 7, 8]:
        key = "Summer [6, 7, 8]"
    elif record.seasonality == [12, 1, 2]:
        key = "Winter [12, 1, 2]"
    else:
        key = str(record.seasonality)
    
    seasonality_pickle[key] = seasonality_pickle.get(key, 0) + 1

print(f"\nSeasonality distribution in pickle-loaded database:")
for season, count in sorted(seasonality_pickle.items(), key=lambda x: x[1], reverse=True)[:10]:
    print(f"  {season}: {count}")

# Show examples
print(f"\nExample records from pickle with their seasonality:")
count = 0
for pid, record in test_db_pickle.records.items():
    if count >= 10:
        break
    if record.seasonality != list(range(1, 13)) and record.seasonality is not None:
        print(f"  {record.pid}: {record.ptype} -> {record.seasonality}")
        count += 1

# Verify counts
pickle_custom_seasonality = sum(1 for r in test_db_pickle.records.values() 
                               if r.seasonality is not None and r.seasonality != list(range(1, 13)))
pickle_none_count = sum(1 for r in test_db_pickle.records.values() if r.seasonality is None)

# Debugging checks
print(f"\nPickle verification:")
print(f"  Records with custom seasonality: {pickle_custom_seasonality}")
print(f"  Records with None seasonality: {pickle_none_count}")
print(f"  Records with annual seasonality: {test_db_pickle.nrec - pickle_custom_seasonality - pickle_none_count}")

if pickle_custom_seasonality > 0:
    print(f"\n✅ SUCCESS: Seasonality preserved in pickle format!")
    print(f"   Use pickle format for now until CFR NetCDF seasonality support is fixed.")
else:
    print(f"\n⚠️  Still having issues with seasonality preservation")

print(f"\n" + "="*60)
print(f"File: {pickle_path}")
Saving with seasonality preservation using pickle...
✓ Saved to pickle: ./prev_data/pdb_w_seasonality_annual_interp.pkl
✓ Loaded from pickle: 586 records

Seasonality distribution in pickle-loaded database:
  Summer [6, 7, 8]: 309
  Annual [1-12]: 226
  [6, 7]: 6
  Winter [12, 1, 2]: 6
  [8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7]: 5
  [5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4]: 4
  [9, 10, 11, 12, 1, 2, 3, 4]: 4
  [7, 8, 9]: 3
  [3, 4, 5]: 2
  [7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6]: 2

Example records from pickle with their seasonality:
  NAm_568: tree.MXD -> [6, 7, 8]
  Arc_060: tree.MXD -> [6, 7, 8]
  NAm_1876: tree.TRW -> [6, 7, 8]
  NAm_1108: tree.TRW -> [6, 7, 8]
  NAm_1120: tree.MXD -> [6, 7, 8]
  Asia_102: tree.TRW -> [6, 7, 8]
  SAm_025: lake.other -> [11, 12, 1, 2]
  NAm_1396: tree.TRW -> [6, 7, 8]
  NAm_220: tree.TRW -> [6, 7, 8]
  SAm_020: tree.TRW -> [5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4]

Pickle verification:
  Records with custom seasonality: 360
  Records with None seasonality: 0
  Records with annual seasonality: 226

✅ SUCCESS: Seasonality preserved in pickle format!
   Use pickle format for now until CFR NetCDF seasonality support is fixed.

============================================================
File: ./prev_data/pdb_w_seasonality_annual_interp.pkl