Validating PReSto2k

Validating PReSto2k#

Author#

For this notebook, we will be following a similar formula as the Validating a Reconstruction Notebook. The goal is to evaluate the skill of the “PReSto2k” reconstructions, produced by adding records from Iso2k and CoralHydro2k to the PAGES 2k 2.0.0 compilation. Our baseline for the various comparisons are to LMRv2.1, which used only PAGES 2k records, specifically those from wood, coral and glacier ice archives with annual or better resolution. Specifically, we seek to relate improvements/changes made as part of the PReSto project to this benchmark.

%load_ext autoreload
%autoreload 2

import cfr
import numpy as np
print(cfr.__version__)
import pens
import xarray as xr
import pandas as pd
import pickle

# Plotting helpers  
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import cartopy.crs as ccrs
import cartopy.feature as cfeature
import numpy as np
from matplotlib.cm import ScalarMappable
from matplotlib.colors import Normalize
from typing import List, Dict

2025.7.28

Load the Reconstructions#

Within PReSto2k, we have three separate reconstructions:

PReSto2k with class-based seasonality
PReSto2k with metadata-based seasonality
PReSto2k with metadata-based seasonality and interpolation

We will compare them against the base reconstruction of LMRv2.1 Tardif et al. 2019.

# Load class-based

res = cfr.ReconRes('./recons/lmr_reproduce_lipd/')
res.load(['tas', 'tas_gm'])

# Load meta-based

res1 = cfr.ReconRes('./recons/presto2k_season/')
res1.load(['tas', 'tas_gm'])

# Load meta-based with interpolation

res2 = cfr.ReconRes('./recons/presto2k_season_interp/')
res2.load(['tas', 'tas_gm'])

res_ts = res.recons['tas_gm']
res1_ts = res1.recons['tas_gm']
res2_ts = res2.recons['tas_gm']

# Load LMRv2.1 (offline DA) 

lmr_off = xr.open_dataset('./prev_data/gmt_MCruns_ensemble_full_LMRv2.1.nc')
ens_values = lmr_off.gmt.stack(ensemble=['MCrun', 'members']).values  # Shape: (time, MCrun*members)
time_values = lmr_off.time.values
years = np.array([t.year for t in time_values])
lmr_ens = cfr.EnsTS(time=years, value=ens_values, value_name='GMSTa (°C)')

GMST plots#

We first plot the GMST ensembles for all four reconstructions. This is mainly to see if we can identify any striking differences and check if the variability appears similar or different within each of the reconstructions. Another thing we can check just with our eyes is to see if there is any variation in how each of the reconstructions captured extreme climate events.

fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, 1, figsize=(16, 12))

lmr_ens.plot_qs(ax=ax1, ylim=[-1.2, 0.8])
ax1.set_title('LMRv2.1 (Offline w/ class based seasonality)')

res_ts.plot_qs(ax=ax2, ylim=[-1.2, 0.8])
ax2.set_ylabel('GMSTa (°C)')
ax2.set_title('PReSto2k w/ class-based seasonality')

res1_ts.plot_qs(ax=ax3, ylim=[-1.2, 0.8])
ax3.set_ylabel('GMSTa (°C)')
ax3.set_title('PReSto2k w/ meta-seasonality')

res2_ts.plot_qs(ax=ax4, ylim=[-1.2, 0.8]) 
ax4.set_ylabel('GMSTa (°C)')
ax4.set_title('PReSto2k w/ meta-seasonality + interpolation')

plt.tight_layout()
plt.show()

../../_images/84393cbf10338f7b5f65b495b3c41ce92beed7242a8a8a06a503fed7cb79ee1e.png

Upon first glance, all four reconstructions look very similar. Though you can notice that in LMRv2.1 and PReSto2k with class-based seasonality, there are more significant dips in temperature around 550 CE. In contrast, in the metadata-based reconstructions, there appears to be greater cooling around 1800 CE, which is not captured by the first two. We can examine this segment further to see if it coincides with the Tambora Eruption (1815).

fig, ax = plt.subplots(figsize=(10,4))
res2_ts.plot_qs(ax=ax, xlim=(1750, 1850), title='PReSto2k meta season w/ interp (1750–1850)')
ax.grid(True, alpha=0.3)
ax.set_ylabel('GMSTa (°C)')

Text(0, 0.5, 'Temperature Anomaly (°C)')

../../_images/93380139dfd0c6dadc992b8346733e4e6721666437ea35fa71c1b9af5dfc70f3.png

median_ts = res2_ts.median        
imin = np.nanargmin(median_ts)
t_min_median = res2_ts.time[imin]
v_min_median = median_ts[imin]
print(f"Minimum: {v_min_median:.3f} at year {t_min_median:.1f}")

Minimum: -0.725 at year 1815.0

# Looking at the spatial patterns of this minimum year

res1.recons['tas']['1815'].plot(
        levels=np.linspace(-5, 5, 21),      
        cbar_labels=np.linspace(-5, 5, 11),  
)

(<Figure size 1000x800 with 2 Axes>, <GeoAxes: title={'center': 'tas, 1815'}>)

../../_images/6e096603655dbf80153e4271dc66cbdde087b85a09a00b2d61eef6f93e2ed7dd.png

Difference Plots#

Since we cannot tell (accurately) how similar or different reconstructions are at a glace, a difference plot of their ensemble mean can be a good way to identify specific instances where one reconstruction captures greater warming/cooling over another. Within this, instead of comparing each reconstruction to another, I thought it would be important to reduce it down to three main comparisons.

PReSto2k class-based vs. LMRv2.1
PReSto2k meta-based vs. LMRv2.1
PReSto2k meta-based vs. class-based

For comparison purposes, both metadata based seasonality approaches are treated as mostly the same. In a subsequent section (instrumental validation), we will see that the added interpolation tends to improve the reconstruction ever so slightly, but that can be added to these validation methods later on if we see fit.

# Compute the ensemble means

lmr_ens_mean = np.mean(lmr_ens.value, axis=1)
res_ts_mean = np.mean(res_ts.value, axis=1)
res1_ts_mean = np.mean(res1_ts.value, axis=1)

# Use the minimum length for everything
min_len = min(len(lmr_ens_mean), len(res_ts_mean))
difference = res_ts_mean[:min_len] - lmr_ens_mean[:min_len] 
difference2 = res1_ts_mean[:min_len] - lmr_ens_mean[:min_len]
difference3 = res1_ts_mean[:min_len] - res_ts_mean[:min_len] 

# Find common y-axis limits
y_min = min(np.min(difference), np.min(difference2)) * 1.1
y_max = max(np.max(difference), np.max(difference2)) * 1.1

fig, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(14, 8), sharex=True, sharey=True)

# Top plot: Class-based difference
ax1.plot(lmr_ens.time[:min_len], difference, color='darkred', linewidth=1, alpha=0.8)
ax1.axhline(y=0, color='k', linestyle='--', alpha=0.5)
ax1.set_ylabel('Difference (°C)')
ax1.set_title('LMRv2.1 - PReSto2k (Class-based)')
ax1.grid(True, alpha=0.3)
ax1.set_ylim(y_min, y_max)

# Bottom plot: Meta-based difference
ax2.plot(lmr_ens.time[:min_len], difference2, color='blue', linewidth=1, alpha=0.8)
ax2.axhline(y=0, color='k', linestyle='--', alpha=0.5)
ax2.set_ylabel('Difference (°C)')
ax2.set_xlabel('Time')
ax2.set_title('LMRv2.1 - PReSto2k (Meta-based)')
ax2.grid(True, alpha=0.3)
ax2.set_ylim(y_min, y_max)

# Bottom plot: Meta-based difference
ax3.plot(lmr_ens.time[:min_len], difference3, color='green', linewidth=1, alpha=0.8)
ax3.axhline(y=0, color='k', linestyle='--', alpha=0.5)
ax3.set_ylabel('Difference (°C)')
ax3.set_xlabel('Time')
ax3.set_title('PReSto2k (class-based) - PReSto2k (Meta-based)')
ax3.grid(True, alpha=0.3)
ax3.set_ylim(y_min, y_max)

plt.suptitle('Difference Between Ensemble Means', fontsize=20, y=0.95, fontweight='bold')
plt.tight_layout()
plt.show()

../../_images/93782f5f47c85cec306c833aa9a85751c59864d90186774510e31c11423915b8.png

Using these difference plots, it is immediately more noticeable that there are some spots (notably around 850 CE) where LMRv2.1 does not capture as great a cooling as either PReSto 2k reconstruction. It is also apparent that the differences between class-based and expert seasonality appear to be much smaller than compared to LMRv2.1. It is also clear that class-based seasonality predicts cooler overall temperature anomalies than compilation metadata (expert-based) before 1000 CE.

Pens#

As done in Validating a Reconstruction, we will use the plume framework introduced by Emile-Geay et al. (2025) using the pens package. To summarize again, this framework compares collections of climate trajectories (plumes) by evaluating:

Intra-ensemble distances, which describe the spread of trajectories within an ensemble.
Inter-ensemble distances, which measure differences between two ensembles.
The plume distance, a summary metric (°C) that captures the degree of similarity (or dissimilarity) between two plumes while accounting for internal variability.

We compute these distances using the GMST ensembles from the original LMRv2.1 and our PReSto2k reconstructions. All are converted into pens.EnsembleTS objects for ease of comparison.

Note: The closer the plume distance is to zero, the more similar the ensembles are.

import pens
import seaborn as sns

plt.style.use('default')
pens.set_style()

# Function for ease of plotting multiple comparisons

def plot_pens(ts1, ts1_name, ts2, ts2_name):
    # Convert cfr EnsTS to pens EnsTS
    ts1_pens = pens.EnsembleTS(time=ts1.time, value=ts1.value)
    ts1_pens.label = 'Original LMR'
    ts1_pens.time_unit = 'years'
    ts1_pens.value_name = 'GMST'
    ts1_pens.value_unit = '\N{DEGREE SIGN}C'

    ts2_pens = pens.EnsembleTS(time=ts2.time, value=ts2.value)
    ts2_pens.label = 'PReSto2k-offlineDA'
    ts2_pens.time_unit = 'years'
    ts2_pens.value_name = 'GMST'
    ts2_pens.value_unit = '\N{DEGREE SIGN}C'

    # Align time dimension for both EnsTS

    ts1_time = ts1.time
    ts2_time = ts2.time.values

    common_start = max(ts1_time.min(), ts2_time.min())
    common_end = min(ts1_time.max(), ts2_time.max())

    # Create time range array
    timespan = np.array([common_start, common_end])

    # Slice to common period
    ts1_pens_aligned = ts1_pens.slice(timespan)
    ts2_pens_aligned = ts2_pens.slice(timespan)

    # Calculate stats 
    ts1_intra = ts1_pens_aligned.distance()
    ts2_intra = ts2_pens_aligned.distance()

    # Calculate inter-ensemble distance 
    inter_dist = ts1_pens_aligned.distance(ts2_pens_aligned.value)
        
    # Calculate plume distance 
    plume_dist = ts1_pens_aligned.plume_distance(ts2_pens_aligned.value, max_dist=1.0)

    print("\nDistances between ensembles:")
    print(f"Original intra-ensemble distance : {ts1_intra},\ len={len(ts1_intra)}", )
    print(f"Reproduced intra-ensemble distance: {ts2_intra},\ len={len(ts2_intra)}")
    print(f"Inter-ensemble distance: {inter_dist}")
    print(f"Plume distance: {plume_dist}")


    # Create figure and plot
    fig, ax = plt.subplots(figsize=(10, 6))

    # Plot KDE for individual ensembles with explicit labels
    sns.kdeplot(data=ts1_intra, fill=False, ax=ax, common_norm=False, label=ts1_name)
    sns.kdeplot(data=ts2_intra, fill=False, ax=ax, common_norm=False, label=ts2_name)

    # Add inter-ensemble distribution
    sns.kdeplot(data=inter_dist, fill=True, ax=ax, common_norm=False, color='silver', 
                label='inter-ensemble')

    # Add plume distance line
    ax.axvline(x=plume_dist, color="black", linestyle="--", label='plume distance')

    # Axis + ticks
    ax.set_xlim(0, 1)
    ax.set_xticks(np.arange(0, 1.05, 0.05))
    ax.minorticks_on()
    ax.grid(True, which='both', alpha=0.25, linestyle='--', linewidth=0.6)


    # Add labels
    ax.set_xlabel('Distance')
    ax.set_ylabel('Density')
    ax.set_title('Distance Distributions')
    ax.legend()

    plt.tight_layout()
    plt.show()

plot_pens(lmr_ens, 'LMRv2.1 Offline', res1_ts, 'PReSto2k meta seasonality (no interp)')

Computing intra-ensemble distance among possible pairs: 100%|██████████| 1999000/1999000 [00:42<00:00, 47494.24it/s]
Computing intra-ensemble distance among possible pairs: 100%|██████████| 12497500/12497500 [03:58<00:00, 52481.32it/s]
Computing inter-ensemble distance: 100%|██████████| 2000/2000 [03:30<00:00,  9.51it/s]
Computing inter-ensemble distance: 100%|██████████| 2000/2000 [03:29<00:00,  9.53it/s]

Distances between ensembles:
Original intra-ensemble distance : [0.40367679 0.21343417 0.04743079 ... 0.26221506 0.05118074 0.29547632],\ len=1999000
Reproduced intra-ensemble distance: [1.0092546  0.17004128 0.33781567 ... 0.19607017 0.12532675 0.07074343],\ len=12497500
Inter-ensemble distance: [0.31925129 0.6902534  0.15575017 ... 0.21786565 0.41357816 0.34283473]
Plume distance: 0.15510274440047797

../../_images/1e04c12351684e5f237bef68023bf01b00d131f6e03952169c2c67ea9035ee86.png

plot_pens(lmr_ens, 'LMRv2.1 Offline', res_ts, 'PReSto2k class-based')

Computing intra-ensemble distance among possible pairs: 100%|██████████| 1999000/1999000 [00:42<00:00, 46933.43it/s]
Computing intra-ensemble distance among possible pairs: 100%|██████████| 12497500/12497500 [04:01<00:00, 51812.75it/s]
Computing inter-ensemble distance: 100%|██████████| 2000/2000 [03:32<00:00,  9.40it/s]
Computing inter-ensemble distance: 100%|██████████| 2000/2000 [03:31<00:00,  9.44it/s]

Distances between ensembles:
Original intra-ensemble distance : [0.40367679 0.21343417 0.04743079 ... 0.26221506 0.05118074 0.29547632],\ len=1999000
Reproduced intra-ensemble distance: [0.91594323 0.13586752 0.2780244  ... 0.2092932  0.15233736 0.05700134],\ len=12497500
Inter-ensemble distance: [0.26008597 0.65616499 0.13131738 ... 0.2464135  0.45569422 0.39873838]
Plume distance: 0.15399506737029178

../../_images/fc454298ef20345699fd8d1742f96f6deb017f35f7bfce0eb2b8b8a7f71000e5.png

plot_pens(res_ts, 'PReSto2k class based', res2_ts, 'PReSto2k meta with interp')

Computing intra-ensemble distance among possible pairs: 100%|██████████| 12497500/12497500 [04:00<00:00, 51859.78it/s]
Computing intra-ensemble distance among possible pairs: 100%|██████████| 12497500/12497500 [04:00<00:00, 51873.87it/s]
Computing inter-ensemble distance: 100%|██████████| 5000/5000 [07:57<00:00, 10.47it/s]
Computing inter-ensemble distance: 100%|██████████| 5000/5000 [07:54<00:00, 10.55it/s]

Distances between ensembles:
Original intra-ensemble distance : [0.91594323 0.13586752 0.2780244  ... 0.2092932  0.15233736 0.05700134],\ len=12497500
Reproduced intra-ensemble distance: [1.00691248 0.16223658 0.33689253 ... 0.19224536 0.12171932 0.07052603],\ len=12497500
Inter-ensemble distance: [0.08145034 0.95276477 0.11608722 ... 0.15290283 0.07001891 0.0692549 ]
Plume distance: 0.14609708156212153

../../_images/02df08b1d2e7f0dbc218d490ca055bd49a1b731b6019150a2110f2c081e0ce86.png

As we see here, all three of the comparisons have plume distances ~0.15 °C, which is not negligible but still within range of the inter-ensemble spread. The smallest distance is between the PReSto2k metadata and class-based versions, which we also noted earlier in the difference plots as being most similar. Importantly, all plume distances fall within the spread of intra-ensemble variability and closely track the original distribution, indicating that the reconstructions are statistically compatible and not significantly different from each other.

Instrumental Validation#

While comparing against other reconstructions is important, we can only establish skill by comparising against instrumental (and independent) data. We will validate our PReSto2k reconstructions’ GMST against the original LMRv2.1 (offline) and some instrumental datasets (HadCRUT5, BEST, GISTEMP). These datasets were used to validate the original LMRv2.1 back in 2019; here we will be using the most updated versions of the datasets.

cfr’s EnsTS.compare() will take the ensemble time series and the plot the median and the spread of our original dataset and plot it against the target, or validation, dataset.

The following function uses built-in cfr functionality to compare two GMST time series and returns the comparison data array as well as the validation statistics (\(r\), \(CE\)).

To save space and for the reason of not making this step convoluted, we have opted out of making comparison plots for all the validations (3 instrumental data sets x 4 reconstructions… a lot). Instead, the validation statistics from this function will be output in a table at the end of the section.

def plot_ensts_compare(target_ts, target_name, ts, ts_name, timespan = (1880,2000)):
    compared = ts.compare(
    target_ts, 
    ref_name=target_name , 
    timespan= timespan
    )

    corr = compared.valid_stats['corr']
    CE = compared.valid_stats['CE']

    #fig, ax = compared.plot_qs(figsize=[12, 4], xlim=[1880,2000], title= f'{ts_name} v {target_name}')

    return compared, corr, CE

GMST comparisons#

HADCRUT5#

Same process as before, but this time we use HadCRUT5. LMRv2.1 used HADCRUT4 for their validations, but we are using the most updated dataset, since it exists. This time we load the dataset from the cloud, then convert the dataframe to a cfr.EnsTS object.

url = 'https://www.metoffice.gov.uk/hadobs/hadcrut5/data/HadCRUT.5.0.2.0/analysis/diagnostics/HadCRUT.5.0.2.0.analysis.summary_series.global.annual.csv'
df = pd.read_csv(url)
df.head()

	Time	Anomaly (deg C)	Lower confidence limit (2.5%)	Upper confidence limit (97.5%)
0	1850	-0.417711	-0.589256	-0.246166
1	1851	-0.233350	-0.411868	-0.054832
2	1852	-0.229399	-0.409382	-0.049416
3	1853	-0.270354	-0.430009	-0.110700
4	1854	-0.291521	-0.432712	-0.150330

had_ts_annual = cfr.EnsTS().from_df(df=df,time_column='Time', value_columns='Anomaly (deg C)')

had_res, had_rescor, had_resce = plot_ensts_compare(had_ts_annual, 'HADCRUT5', res_ts, 'PReSto2k class-based')
had_res1, had_res1cor, had_res1ce = plot_ensts_compare(had_ts_annual, 'HADCRUT5', res1_ts, 'PReSto2k meta season')
had_res2, had_res2cor, had_res2ce = plot_ensts_compare(had_ts_annual, 'HADCRUT5', res2_ts, 'PReSto2k meta season w/ interp')
had_lmr, had_lmrcor, had_lmrce = plot_ensts_compare(had_ts_annual, 'HADCRUT5', lmr_ens, 'LMRv2.1')

Consensus#

In addition to the individual validation analyses, we can now plot them all together along with a ‘consensus’ time series, which is the mean of all the instrumental datasets.

# combine all datasets to calculate mean (class based)
all_refs = np.array([
   had_res.ref_value[:121],
   gis_res.ref_value[:121],
   best_res.ref_value[:121]
])
mean_ref = np.mean(all_refs, axis=0)

mean_ts_annual = cfr.EnsTS(
    time=best_annual.year.values[30:-14], #manually slice dates
    value=mean_ref,
    value_name='Temperature Anomaly'
)

mean_ts_compared = res_ts.compare(
    mean_ts_annual,
    ref_name='Consensus',
    timespan=(1880, 2000)
)

# combine all datasets to calculate mean (no interp)
all_refs1 = np.array([
   had_res1.ref_value[:121],
   gis_res1.ref_value[:121],
   best_res1.ref_value[:121]
])
mean_ref1 = np.mean(all_refs1, axis=0)

mean_ts1_annual = cfr.EnsTS(
    time=best_annual.year.values[30:-14], #manually slice dates
    value=mean_ref1,
    value_name='Temperature Anomaly'
)

mean_ts1_compared = res1_ts.compare(
    mean_ts1_annual,
    ref_name='Consensus',
    timespan=(1880, 2000)
)

# combine all datasets to calculate mean (with interp)
all_refs2 = np.array([
   had_res2.ref_value[:121],
   gis_res2.ref_value[:121],
   best_res2.ref_value[:121]
])
mean_ref2 = np.mean(all_refs2, axis=0)

mean_ts2_annual = cfr.EnsTS(
    time=best_annual.year.values[30:-14], #manually slice dates
    value=mean_ref2,
    value_name='Temperature Anomaly'
)

mean_ts2_compared = res2_ts.compare(
    mean_ts2_annual,
    ref_name='Consensus',
    timespan=(1880, 2000)
)

# combine all datasets to calculate mean (LMR)
all_refsl = np.array([
   had_lmr.ref_value[:121],
   gis_lmr.ref_value[:121],
   best_lmr.ref_value[:121]
])
mean_refl = np.mean(all_refsl, axis=0)

mean_tsl_annual = cfr.EnsTS(
    time=best_annual.year.values[30:-14], #manually slice dates
    value=mean_refl,
    value_name='Temperature Anomaly'
)

mean_tsl_compared = lmr_ens.compare(
    mean_tsl_annual,
    ref_name='Consensus',
    timespan=(1880, 2000)
)

results_table_data = {
    'HadCRUT5': [had_rescor, had_res1cor, had_res2cor, had_lmrcor,
                 had_resce, had_res1ce, had_res2ce, had_lmrce],
    'GISTEMP': [gis_rescor, gis_res1cor, gis_res2cor, gis_lmrcor,
                gis_resce, gis_res1ce, gis_res2ce, gis_lmrce],
    'Berkeley Earth': [best_rescor, best_res1cor, best_res2cor, best_lmrcor,
                       best_resce, best_res1ce, best_res2ce, best_lmrce],
    'Consensus': [mean_ts_compared.valid_stats['corr'], mean_ts1_compared.valid_stats['corr'], mean_ts2_compared.valid_stats['corr'], mean_tsl_compared.valid_stats['corr'],
                  mean_ts_compared.valid_stats['CE'], mean_ts1_compared.valid_stats['CE'], mean_ts2_compared.valid_stats['CE'], mean_tsl_compared.valid_stats['CE']]
}

# Create multi-level index
index_tuples = [
    ('PReSto2k class-based', 'R'),
    ('PReSto2k meta season', 'R'), 
    ('PReSto2k meta season w/ interp', 'R'),
    ('LMRv2.1', 'R'),
    ('PReSto2k class-based', 'CE'),
    ('PReSto2k meta season', 'CE'),
    ('PReSto2k meta season w/ interp', 'CE'),
    ('LMRv2.1', 'CE')
]

multi_index = pd.MultiIndex.from_tuples(index_tuples, names=['Reconstruction', 'Metric'])

results_df = pd.DataFrame(results_table_data, index=multi_index)

print("COMPLETE RESULTS TABLE:")
print("=" * 60)
print(results_df.round(3))

COMPLETE RESULTS TABLE:
============================================================
                                       HadCRUT5  GISTEMP  Berkeley Earth  \
Reconstruction                 Metric                                      
PReSto2k class-based           R          0.928    0.919           0.917   
PReSto2k meta season           R          0.874    0.874           0.857   
PReSto2k meta season w/ interp R          0.891    0.889           0.877   
LMRv2.1                        R          0.933    0.932           0.921   
PReSto2k class-based           CE         0.608    0.827           0.744   
PReSto2k meta season           CE         0.612    0.765           0.698   
PReSto2k meta season w/ interp CE         0.659    0.787           0.739   
LMRv2.1                        CE         0.565    0.834           0.719   

                                       Consensus  
Reconstruction                 Metric             
PReSto2k class-based           R           0.927  
PReSto2k meta season           R           0.874  
PReSto2k meta season w/ interp R           0.891  
LMRv2.1                        R           0.934  
PReSto2k class-based           CE          0.757  
PReSto2k meta season           CE          0.724  
PReSto2k meta season w/ interp CE          0.762  
LMRv2.1                        CE          0.735  

Takeaways from the Results Table#

Correlations (R)

LMRv2.1 is still the strongest overall, clearly better than all PReSto2k variants.
PReSto2k class-based comes very close to LMR (~ within 0.01–0.02 in most cases).
PReSto2k meta-seasonality lags somewhat, while meta w/ interpolation is marginally better.

Coefficient of Efficiency (CE)

PReSto2k class-based outperform LMRv2.1 in all cases.
Meta seasonality is weakest, but interpolation boosts it, and ends up being the best in terms of the consensus.
LMRv2.1 is middle of the road, but usually a bit lower than PReSto2k class-based or interpolated meta-based.

Overall

LMRv2.1 remains the standard in terms of correlation with observational data.
PReSto2k class-based nearly matches LMR’s correlations while showing better CE scores, which could imply improved predictive skill.
Meta seasonality with interpolation can match or exceed class-based in CE, but both meta-based reconstructions are still worse in correlation.

So, PReSto2k reconstructions can safely rival (and sometimes surpass) LMRv2.1 in skill, depending on the metric, especially when class-based or meta-seasonality with interpolation are used.

NOTE: These instrumental validations are only done from 1880-2000, which means its still missing a very large portion of the picture. Namely 1-1880, which misses the medieval climate anomaly, Little Ice Age, and many large volcanic events. There are also probably biases in validation due to the more recent proxy/instrumental overlaps, where data quality and density is higher.

Validation on Climate Fields#

Load in both BEST climate field and the original LMR air (2m air temperature) climate field as cfr.ClimateField objects and modify the format so that we can more easily compare it with our validation datasets using cfr. In this step, we will plot both the correlation (corr) and the coefficient of efficiency (CE) of 2m air temperature against the Berkeley Earth Surface Temperature (BEST) climate field, since that was what Tardif et al. (2019) used to validate their field reconstructions.

Load ProxyDatabase for Overlay#

class_og = cfr.ProxyDatabase().load_nc('./prev_data/lipd_pdb_pids.nc')
class_og.refresh()

with open('./prev_data/pdb_w_seasonality_annual.pkl', 'rb') as f:
    meta_og = pickle.load(f)

BEST#

ds = xr.open_dataset('./analyses/BerkeleyEarth/Land_and_Ocean_LatLong1.nc')

# Convert decimal years to datetime
years = ds.time.values
dates = pd.to_datetime([f"{int(year)}-{int((year % 1) * 12 + 1):02d}-15" for year in years])

# Create a new DataArray with proper coordinates
da = xr.DataArray(
    ds.temperature.values,
    coords={
        'time': dates,
        'lat': ds.latitude.values,
        'lon': ds.longitude.values
    },
    dims=['time', 'lat', 'lon'],
    name='temperature'
)

# 3. create ClimateField object
target_be = cfr.ClimateField(da).get_anom(ref_period=[1951, 1980])
target_be = target_be.annualize(months=list(range(1,13)))
target_be.da

# Convert target data to 0-360 format first
target_wrapped = target_be.wrap_lon(mode='360')

lmr_tas = cfr.ClimateField().load_nc(
    path='./prev_data/air_MCruns_ensemble_mean_LMRv2.1.nc',
    vn='air', 
    time_name='time',  
    lat_name='lat', 
    lon_name='lon'
)

target_lmr = lmr_tas.get_anom(ref_period=[1951, 1980])
filtered_da = target_lmr.da.sel(time=target_lmr.da.time.dt.year != 0)
target_lmr = cfr.ClimateField(da=filtered_da)
target_lmr = target_lmr.annualize(months=list(range(1,13)))

target_lmr_mean = cfr.ClimateField(da=target_lmr.da.mean(dim='MCrun'))

# Function to get the comparison statistics from a climate field compare 

def get_cf_stat(target_cf, target_name, cf, cf_name, stat, ax=None):
    valid_fdb = cf.compare(
        target_cf,
        stat=stat,
        timespan=(1880, 2000),
        interp_target = 'ref',
    )

    return valid_fdb

lmrvbest = get_cf_stat(target_wrapped, 'BEST', target_lmr_mean, 'LMRv2.1', stat='corr')
p2kclassvbest = get_cf_stat(target_wrapped, 'BEST', res.recons['tas'], 'PReSto2k class based', stat='corr')

diff = p2kclassvbest - lmrvbest

# compute symmetric color scale
vmax = np.nanmax(np.abs(diff.da.values))
clim = (-vmax, vmax)

diff.plot(
    title=f'PReSto2k class - LMRv2.1 (v. BEST) corr(recon, obs), mean={diff.geo_mean().value[0,0]:.2f}',
    plot_cbar=True,
    cmap='RdBu_r',
    plot_proxydb=True,
    proxydb=class_og,
    clim=clim   # enforce centered colormap
)

plt.tight_layout()
plt.show()

../../_images/f1b227fd0baf4e4941ae6a260d65365e2389ed5bfe64fda8a4ca6b8b022394ee.png

KEY: Red – PReSto2k class better than LMRv2.1, Blue – LMRv2.1 better than PReSto2k class

Here we see that the correlation between LMRv2.1 v Best and PReSto2k v BEST averages out at about 0.0. The stark difference here is that it appears PReSto2k class-based performs much better over Scandinavia and slightly worse in Antarctica and a large swath of the South Atlantic.

p2kclassvbest = get_cf_stat(target_wrapped, 'BEST', res.recons['tas'], 'PReSto2k class based', stat='corr')
p2kmetaintvbest = get_cf_stat(target_wrapped, 'BEST', res2.recons['tas'], 'PReSto2k meta-season w/ interp', stat='corr')

diff2 = p2kmetaintvbest - p2kclassvbest 

# compute symmetric color scale
vmax = np.nanmax(np.abs(diff2.da.values))
clim = (-vmax, vmax)

diff2.plot(
    title=f'PReSto2k meta - class (v. BEST) corr(recon, obs), mean={diff.geo_mean().value[0,0]:.2f}',
    plot_cbar=True,
    cmap='RdBu_r',
    plot_proxydb=True,
    proxydb=meta_og,
    clim=clim   # enforce centered colormap
)

plt.tight_layout()
plt.show()

../../_images/a6bd380f3e6d8bb9cf93cacc54ae1848f475a6098d2804d637a7603c10b3d155.png

KEY: Red – meta better than class, Blue – class better than meta

While global average skill is similar, class-based seasonality outperforms metadata-based seasonality across much of the Indian, Pacific and Southern Hemisphere oceans, whereas expert-assessed seasonality shows slight advantages in regions like the North Atlantic, the Pacific Cold Tongue, and Antarctica. Overall this analysis suggests that little is gained by using expert seasonality, as far as PReSto2k reconstructions go.

lmrvbest = get_cf_stat(target_wrapped, 'BEST', target_lmr_mean, 'LMRv2.1', stat='CE')
p2kclassvbest = get_cf_stat(target_wrapped, 'BEST', res.recons['tas'], 'PReSto2k class based', stat='CE')

diff = p2kclassvbest - lmrvbest 

# compute symmetric color scale
vmax = np.nanmax(np.abs(diff.da.values))
clim = (-vmax, vmax)

diff.plot(
    title=f'PReSto2k class - LMRv2.1 (v. BEST) CE(recon, obs), mean={diff.geo_mean().value[0,0]:.2f}',
    plot_cbar=True,
    cmap='RdBu_r',
    plot_proxydb=True,
    proxydb=class_og,
    clim=clim   # enforce centered colormap
)

plt.tight_layout()
plt.show()

../../_images/13d97f249a5da2b44c002ee47f59766cd1daaec5a91201ed9cbae82ccb468a3e.png

KEY: Red – class better than LMRv2.1, Blue – LMRv2.1 better than class

Aside from masked-out gray regions (an upcoming fix), PReSto2k class-based and LMRv2.1 have nearly identical CE skill overall, with minor regional differences (class-based doing slightly better over Scandinavia, while LMR has very small advantages in scattered locations along the North Atlantic and Austrialia).

p2kclassvbest = get_cf_stat(target_wrapped, 'BEST', res.recons['tas'], 'PReSto2k class based', stat='CE')
p2kmetaintvbest = get_cf_stat(target_wrapped, 'BEST', res2.recons['tas'], 'PReSto2k meta-season w/ interp', stat='CE')

diff2 =  p2kmetaintvbest - p2kclassvbest

vmax = np.nanpercentile(np.abs(diff2.da.values), 98)
clim = (-vmax, vmax)

# clip just for plotting
vals = diff2.da.values.copy()
vals = np.clip(vals, -vmax, vmax)

diff2_clipped = diff2.copy()
diff2_clipped.da.values = vals

diff2_clipped.plot(
    title=f'PReSto2k meta - class (v. BEST) CE(recon, obs), mean={diff2.geo_mean().value[0,0]:.2f}',
    plot_cbar=True,
    cmap='RdBu_r',
    plot_proxydb=True,
    proxydb=meta_og,
    clim=clim
)

plt.tight_layout()
plt.show()

../../_images/dbbf3feebfc12b1731606ac66d74c9aeb47e2578673297b44a386bab0f3c3b5b.png

KEY: Red – meta better than class, Blue – class better than meta

It is apparent for this one that class-based seasonality performs significantly better over southeast asia and parts of the North Atlantic (Caribbean region). Everywhere else is pretty even in terms of skill.

PSM Calibration Statistics#

One way to understand these results is to examine the calibration scores of the proxy system models (PSMs). Here we compare two seasonality approaches used by our PReSto2k reconstructions:

Class-based: seasonalities are broadly defined for each proxy class.
Metadata-based: each proxy record uses an expert-designated seasonality defined in its metadata.

These choices directly affect how the PSMs are calibrated within the data assimilation workflow (see: Run Data Assimilation with Individual Seasonality).

A well-calibrated proxy record is characterized by its PSM successfully explaining a significant portion of observed climate variability during the calibration period, typically reflected in a strong fit metric like R^2. This aligns with how PSMs are used in paleoclimate DA, where calibration makes sure that the proxy accurately reflects the climate signal and drives confidence in reconstruction results.

In this section, we compile calibration statistics and examine the spatial distribution of records with high R^2 values. This provides insight into the regional skill of the climate field reconstruction, beyond what we can infer from just GMST and instrumental validation.

Load Proxy Databases#

These proxy databases should have the post-calibration statsistics saved in their metadata for easier validation.

with open('./prev_data/calibrated/calibrated_lipd_pdb.pkl', 'rb') as f:
    pdb_class_new = pickle.load(f)

job_class_new = cfr.ReconJob()
job_class_new.proxydb = pdb_class_new

with open('./prev_data/calibrated/calibrated_pdb_w_seasonality_annual.pkl', 'rb') as f:
    pdb_meta = pickle.load(f)

job_meta = cfr.ReconJob()
job_meta.proxydb = pdb_meta

with open('./prev_data/calibrated/calibrated_pdb_w_seasonality_annual_interp.pkl', 'rb') as f:
    pdb_meta_interp = pickle.load(f)

job_meta_interp = cfr.ReconJob()
job_meta_interp.proxydb = pdb_meta_interp

# Extract calibration statistics from all three databases
def extract_calib_stats(job, approach_name):
    """Extract PSM calibration statistics from a ReconJob"""
    calib_stats = []
    
    print(f"Collecting PSM calibration statistics for {approach_name}...")
    for pid, record in job.proxydb.records.items():
        if (hasattr(record, 'psm') and record.psm is not None and 
            hasattr(record.psm, 'calib_details') and record.psm.calib_details is not None):
            
            details = record.psm.calib_details
            calib_stats.append({
                'pid': pid,
                'name': record.tags.get('name', 'Unknown') if isinstance(record.tags, dict) else pid,
                'r2_adj': details['fitR2adj'],
                'nobs': details['nobs'],
                'mse': details['PSMmse'],
                'snr': details['SNR'],
                'seasonality': details['seasonality'],
                'approach': approach_name
            })
    
    print(f"Found {len(calib_stats)} calibrated PSMs for {approach_name}")
    return calib_stats

# Extract stats from all three approaches
stats_class = extract_calib_stats(job_class_new, "Class-based seasonality")
stats_meta = extract_calib_stats(job_meta, "Metadata seasonality") 
stats_meta_interp = extract_calib_stats(job_meta_interp, "Metadata + interpolation")

# Combine all stats
all_stats =  stats_class + stats_meta + stats_meta_interp

Collecting PSM calibration statistics for Class-based seasonality...
Found 555 calibrated PSMs for Class-based seasonality
Collecting PSM calibration statistics for Metadata seasonality...
Found 501 calibrated PSMs for Metadata seasonality
Collecting PSM calibration statistics for Metadata + interpolation...
Found 522 calibrated PSMs for Metadata + interpolation

Make Plots#

We will have a histogram showing the distribution of the R^2 values, a summary table with statistics, and a spatial distribution with locations of well-calibrated records (threshold can be adjusted).

# Create comparison plot
fig, axes = plt.subplots(1, 3, figsize=(15, 6))
fig.suptitle('PSM Calibration R² Distribution Comparison', fontsize=20, y=1.02, fontweight="bold")

approaches = ["Class-based seasonality", "Metadata seasonality", "Metadata + interpolation"]
colors = ['steelblue', 'forestgreen', 'darkorange']

for i, (approach, color) in enumerate(zip(approaches, colors)):
    # Get R² values for this approach
    approach_stats = [stat for stat in all_stats if stat['approach'] == approach]
    r2_values = [stat['r2_adj'] for stat in approach_stats]
    
    # Plot histogram
    axes[i].hist(r2_values, bins=20, alpha=0.7, edgecolor='black', color=color)
    axes[i].set_xlabel('Adjusted R²')
    axes[i].set_ylabel('Number of Records')
    axes[i].set_title(f'{approach}\n(n={len(r2_values)})')
    
    # Add median line
    median_r2 = np.median(r2_values)
    axes[i].axvline(median_r2, color='red', linestyle='--', linewidth=2, 
                   label=f'Median: {median_r2:.3f}')
    axes[i].legend()
    
    # Add summary stats as text
    mean_r2 = np.mean(r2_values)
    std_r2 = np.std(r2_values)
    axes[i].text(0.05, 0.95, f'Mean: {mean_r2:.3f}\nStd: {std_r2:.3f}', 
                transform=axes[i].transAxes, verticalalignment='top',
                bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))

plt.tight_layout()
plt.show()

../../_images/d66681e0468a998a51f05b7a228c005702d9e234d7aa20c6e027288f548a0e35.png

# Print summary statistics
print("\n" + "="*60)
print("SUMMARY STATISTICS")
print("="*60)

for approach in approaches:
    approach_stats = [stat for stat in all_stats if stat['approach'] == approach]
    r2_values = [stat['r2_adj'] for stat in approach_stats]
    
    print(f"\n{approach}:")
    print(f"  Records: {len(r2_values)}")
    print(f"  Mean R²: {np.mean(r2_values):.4f}")
    print(f"  Median R²: {np.median(r2_values):.4f}")
    print(f"  Std R²: {np.std(r2_values):.4f}")
    print(f"  Min R²: {np.min(r2_values):.4f}")
    print(f"  Max R²: {np.max(r2_values):.4f}")
    print(f"  Records with R² > 0.5: {len([r for r in r2_values if r > 0.5])}")
    print(f"  Records with R² > 0.7: {len([r for r in r2_values if r > 0.7])}")

============================================================
SUMMARY STATISTICS
============================================================

Class-based seasonality:
  Records: 555
  Mean R²: 0.1242
  Median R²: 0.0949
  Std R²: 0.1205
  Min R²: -0.0277
  Max R²: 0.6473
  Records with R² > 0.5: 13
  Records with R² > 0.7: 0

Metadata seasonality:
  Records: 501
  Mean R²: 0.0815
  Median R²: 0.0345
  Std R²: 0.1279
  Min R²: -0.0433
  Max R²: 0.9098
  Records with R² > 0.5: 11
  Records with R² > 0.7: 2

Metadata + interpolation:
  Records: 522
  Mean R²: 0.0826
  Median R²: 0.0357
  Std R²: 0.1269
  Min R²: -0.0433
  Max R²: 0.9098
  Records with R² > 0.5: 11
  Records with R² > 0.7: 2

PlateCRS = ccrs.PlateCarree()

def _add_base_map(ax):
    ax.add_feature(cfeature.LAND, facecolor="lightgray", alpha=0.5, zorder=0)
    ax.add_feature(cfeature.OCEAN, facecolor="lightblue", alpha=0.5, zorder=0)
    ax.add_feature(cfeature.COASTLINE, linewidth=0.5)
    ax.add_feature(cfeature.BORDERS, linewidth=0.3)
    ax.set_global()


def _prep_xy_r2(stats, r2_key, thr):
    lons, lats, r2s = [], [], []
    for s in stats:
        r2 = s.get(r2_key)
        lon, lat = s.get("lon"), s.get("lat")
        if (r2 is None) or (lon is None) or (lat is None):
            continue
        if r2 > thr:
            lons.append(lon); lats.append(lat); r2s.append(r2)
    return np.asarray(lons), np.asarray(lats), np.asarray(r2s)

def plot_high_r2_spatial_comparison(
    stats_class, stats_meta, stats_meta_interp,
    *, r2_threshold=0.7, figsize=(20, 12), r2_key="r2_adj", cmap="Reds",
    show=False  # <- default False to avoid double display in notebooks
):
    approaches = ("Class-based", "Metadata", "Metadata + interp")
    colors     = ("steelblue", "forestgreen", "darkorange")
    stats_sets = (stats_class, stats_meta, stats_meta_interp)

    xy_r2 = [ _prep_xy_r2(s, r2_key, r2_threshold) for s in stats_sets ]
    counts = [
        int((np.array([d.get(r2_key) for d in s if d.get(r2_key) is not None]) > r2_threshold).sum())
        for s in stats_sets
    ]

    # Figure with a thin bottom row for the colorbar
    fig = plt.figure(figsize=figsize, constrained_layout=False)
    gs = gridspec.GridSpec(3, 2, height_ratios=[1, 1, 0.06], hspace=0.15, wspace=0.08, figure=fig)

    axes = [
        fig.add_subplot(gs[0, 0], projection=PlateCRS),
        fig.add_subplot(gs[0, 1], projection=PlateCRS),
        fig.add_subplot(gs[1, 0], projection=PlateCRS),
        fig.add_subplot(gs[1, 1], projection=PlateCRS),
    ]
    cax = fig.add_subplot(gs[2, :])  # dedicated colorbar axis

    # Panels 1–3 (colored by R²)
    for ax, name, (lons, lats, r2s), count in zip(axes[:3], approaches, xy_r2, counts):
        _add_base_map(ax)#, with_labels=True)
        if r2s.size:
            ax.scatter(
                lons, lats, c=r2s, cmap=cmap, s=60, alpha=0.8, transform=PlateCRS,
                vmin=r2_threshold, vmax=1.0, edgecolor="black", linewidth=0.5
            )
        ax.set_title(f"{name}\n(n={count} with R² > {r2_threshold:g})", fontsize=11)

    # Panel 4 (combined, colored by approach)
    ax_c = axes[3]
    _add_base_map(ax_c)#, with_labels=True)
    for name, color, (lons, lats, _), count in zip(approaches, colors, xy_r2, counts):
        if lons.size:
            ax_c.scatter(
                lons, lats, c=color, s=40, alpha=0.8, transform=PlateCRS,
                edgecolor="black", linewidth=0.3, label=f"{name} (n={lons.size})"
            )
    ax_c.set_title(f"Combined: R² > {r2_threshold:g}", fontsize=11)
    ax_c.legend(loc="lower left", bbox_to_anchor=(0, 0))

    # Colorbar in its own axis
    norm = Normalize(vmin=r2_threshold, vmax=1.0)
    sm = ScalarMappable(norm=norm, cmap=cmap); sm.set_array([])
    cbar = fig.colorbar(sm, cax=cax, orientation="horizontal")
    cbar.set_label("Adjusted R²")

    fig.suptitle(f"Spatial Distribution of Well-Calibrated PSMs (R² > {r2_threshold:g})", fontsize=16)
    fig.tight_layout(rect=[0, 0.04, 1, 0.98])  # leave a bit for suptitle

    if show:
        plt.show()   # show once if you explicitly ask
        return None  # avoid echoing a Figure
    return fig       # otherwise return the Figure without calling show()


def extract_calib_stats_with_coords(job, approach_name: str, *, r2_key: str = "fitR2adj", verbose: bool = True
                                   ):
    """
    Extract PSM calibration stats incl. coordinates from a ReconJob-like object.

    Expects:
    - job.proxydb.records: mapping of pid -> record
    - record.psm.calib_details: dict containing r², nobs, mse, SNR, seasonality
    - record.lat / record.lon / record.ptype
    - record.tags (dict-like) with optional 'name'
    """
    calib_stats: List[Dict] = []
    if verbose:
        print(f"Collecting PSM calibration statistics for {approach_name}...")

    for pid, record in getattr(job, "proxydb", {}).records.items():
        psm = getattr(record, "psm", None)
        details = getattr(psm, "calib_details", None) if psm is not None else None
        if not details:
            continue

        entry = {
            "pid": pid,
            "name": (record.tags.get("name") if isinstance(record.tags, dict) else None) or pid,
            "r2_adj": details.get(r2_key),           # normalize to 'r2_adj' for plotting
            "nobs": details.get("nobs"),
            "mse": details.get("PSMmse"),
            "snr": details.get("SNR"),
            "seasonality": details.get("seasonality"),
            "approach": approach_name,
            "lat": getattr(record, "lat", None),
            "lon": getattr(record, "lon", None),
            "ptype": getattr(record, "ptype", None),
        }
        calib_stats.append(entry)

    if verbose:
        print(f"Found {len(calib_stats)} calibrated PSMs for {approach_name}")
    return calib_stats

Plotting functions and helpers for spatial distribution.

# --- CRS (axes + data are both geographic lon/lat) ---
MAP_CRS  = ccrs.PlateCarree()   # for subplot projection
DATA_CRS = ccrs.PlateCarree()   # for the lon/lat data transform

def _add_base_map(ax):
    ax.add_feature(cfeature.LAND, facecolor="lightgray", alpha=0.5, zorder=0)
    ax.add_feature(cfeature.OCEAN, facecolor="lightblue", alpha=0.5, zorder=0)
    ax.add_feature(cfeature.COASTLINE, linewidth=0.5)
    ax.add_feature(cfeature.BORDERS, linewidth=0.3)
    ax.set_global()

def _prep_xy_r2(stats, r2_key, thr):
    """Return lon/lat/r2 arrays filtered by threshold and finite values."""
    lons, lats, r2s = [], [], []
    for s in stats:
        r2 = s.get(r2_key)
        lon, lat = s.get("lon"), s.get("lat")
        if r2 is None or lon is None or lat is None:
            continue
        if r2 > thr:
            lons.append(lon); lats.append(lat); r2s.append(r2)
    if not lons:
        return np.array([]), np.array([]), np.array([])
    lons, lats, r2s = map(np.asarray, (lons, lats, r2s))
    m = np.isfinite(lons) & np.isfinite(lats) & np.isfinite(r2s)
    return lons[m], lats[m], r2s[m]

def plot_high_r2_spatial_comparison(
    stats_class: List[Dict],
    stats_meta: List[Dict],
    stats_meta_interp: List[Dict],
    *,
    r2_threshold: float = 0.2,           # <- lower default so you see points
    figsize=(20, 12),
    r2_key: str = "r2_adj",
    cmap: str = "Reds",
    show: bool = True
):
    approaches = ("Class-based", "Metadata", "Metadata + interp")
    colors     = ("steelblue", "forestgreen", "darkorange")
    stats_sets = (stats_class, stats_meta, stats_meta_interp)

    xy_r2 = [ _prep_xy_r2(s, r2_key, r2_threshold) for s in stats_sets ]
    counts = [
        int((np.array([d.get(r2_key) for d in s if d.get(r2_key) is not None]) > r2_threshold).sum())
        for s in stats_sets
    ]
    # quick sanity print so you know data exist
    print("counts (>= threshold):", [len(x[0]) for x in xy_r2], " | threshold:", r2_threshold)

    # figure with a dedicated colorbar row
    fig = plt.figure(figsize=figsize, constrained_layout=False)
    gs = gridspec.GridSpec(3, 2, height_ratios=[1, 1, 0.06], hspace=0.15, wspace=0.08, figure=fig)

    axes = [
        fig.add_subplot(gs[0, 0], projection=MAP_CRS),
        fig.add_subplot(gs[0, 1], projection=MAP_CRS),
        fig.add_subplot(gs[1, 0], projection=MAP_CRS),
        fig.add_subplot(gs[1, 1], projection=MAP_CRS),
    ]
    cax = fig.add_subplot(gs[2, :])  # shared colorbar axis

    # panels 1–3 (colored by R²)
    for ax, name, (lons, lats, r2s), count in zip(axes[:3], approaches, xy_r2, counts):
        _add_base_map(ax)
        if r2s.size:
            ax.scatter(
                lons, lats, c=r2s, cmap=cmap, s=60, alpha=0.9,
                transform=DATA_CRS, vmin=r2_threshold, vmax=1.0,
                edgecolor="black", linewidth=0.4
            )
        ax.set_title(f"{name}\n(n={count} with R² > {r2_threshold:g})", fontsize=11)

    # combined panel (colored by approach)
    ax_c = axes[3]
    _add_base_map(ax_c)
    for name, color, (lons, lats, _), _count in zip(approaches, colors, xy_r2, counts):
        if lons.size:
            ax_c.scatter(
                lons, lats, c=color, s=42, alpha=0.9, transform=DATA_CRS,
                edgecolor="black", linewidth=0.3, label=f"{name} (n={lons.size})"
            )
    ax_c.set_title(f"Combined: R² > {r2_threshold:g}", fontsize=11)
    ax_c.legend(loc="lower left", bbox_to_anchor=(0, 0))

    # shared colorbar
    norm = Normalize(vmin=r2_threshold, vmax=1.0)
    sm = ScalarMappable(norm=norm, cmap=cmap); sm.set_array([])
    cbar = fig.colorbar(sm, cax=cax, orientation="horizontal")
    cbar.set_label("Adjusted R²")

    fig.suptitle(f"Spatial Distribution of Well-Calibrated PSMs (R² > {r2_threshold:g})", fontsize=16)
    fig.tight_layout(rect=[0, 0.05, 1, 0.98])

    if show:
        plt.show()
        return None
    return fig

# ---- stats extractor (with coords) ----
def extract_calib_stats_with_coords(job, approach_name: str, *, r2_key: str = "fitR2adj", verbose: bool = True):
    """Extract PSM calibration stats incl. lat/lon."""
    calib_stats: List[Dict] = []
    if verbose:
        print(f"Collecting PSM calibration statistics for {approach_name}...")
    for pid, record in getattr(job, "proxydb", {}).records.items():
        psm = getattr(record, "psm", None)
        details = getattr(psm, "calib_details", None) if psm is not None else None
        if not details:
            continue
        calib_stats.append({
            "pid": pid,
            "name": (record.tags.get("name") if isinstance(record.tags, dict) else None) or pid,
            "r2_adj": details.get(r2_key),
            "nobs": details.get("nobs"),
            "mse": details.get("PSMmse"),
            "snr": details.get("SNR"),
            "seasonality": details.get("seasonality"),
            "approach": approach_name,
            "lat": getattr(record, "lat", None),
            "lon": getattr(record, "lon", None),
            "ptype": getattr(record, "ptype", None),
        })
    if verbose:
        print(f"Found {len(calib_stats)} calibrated PSMs for {approach_name}")
    return calib_stats

stats_class = extract_calib_stats_with_coords(job_class_new, "Class-based")
stats_meta  = extract_calib_stats_with_coords(job_meta, "Metadata")
stats_int   = extract_calib_stats_with_coords(job_meta_interp, "Metadata + interp")

plot_high_r2_spatial_comparison(stats_class, stats_meta, stats_int, r2_threshold=0.2)

Collecting PSM calibration statistics for Class-based...
Found 555 calibrated PSMs for Class-based
Collecting PSM calibration statistics for Metadata...
Found 501 calibrated PSMs for Metadata
Collecting PSM calibration statistics for Metadata + interp...
Found 522 calibrated PSMs for Metadata + interp
counts (>= threshold): [107, 63, 65]  | threshold: 0.2

../../_images/0f1548e004a8ee2d07da7d68f1935f184c8a5022bca4a6ee871149d293c23c54.png

Key Takeaways#

Calibration distributions and table:

Class-based seasonality achieves the highest mean and median R^2 (~0.12 / 0.095), with more consistently skillful records.
Metadata approaches have lower average skill (mean ~0.08 / median ~0.035), though they occasionally produce very high outliers
Interpolation increases record count but does not improve overall calibration quality, which could still contribute to the overall skill of the reconstruction.

“Well-calibrated” records (R^2 > 0.2):

Class-based has the largest number (n=107) compared to metadata (n=63) and metadata + interpolation (n=65).
High-skill records are located mainly in Europe, North America, and parts of Asia (for class-based), with lower coverage in the tropics, oceans in general, and Southern Hemisphere.

Overall:

Class-based calibration provides broader and more robust spatial coverage.
Metadata-based adds a tiny number of high correlated sites (that require deeper investigation), but overall performance is worse and patchier
The combined view of all three shows that class-based is the strongest by default, with the metadata approach potentially contributing to some skill in very specific sites.
It is important to note here that the network loses about 10% of its records once we whittle it down from class-based to metadata-based despite both starting from the same initial proxy database. Further investigations will show how much of an impact this has and if there are other combinations of databases (class-based with interp, etc.) that we can use to improve upon the baseline set by LMRv2.1.