Chapter 3: Measures of association

Chapter 3: Measures of association#

by Julien Emile-Geay & Jordan Landers, University of Southern California

In this notebook we explore the relationship between two well-known climate indices, the Southern Oscillation Index (SOI) and the North Atlantic Oscillation (NAO) index. The NAO and SOI index have been alleged to show some relationship, albeit a subtle one, in some seasons/epochs. Specifically, we will explore:

effects of trends
effects of autocorrelation
various measures of association
various methods of establishing the significance of a relationship

Data Wrangling#

The NAO data are from NCEP
The SOI data ship with Pyleoclim, which houses a few datasets that make it easy to experiment with real-world geoscientific timeseries. To see what is available, do:

import pyleoclim as pyleo
pyleo.utils.available_dataset_names()

['SOI',
 'NINO3',
 'HadCRUT5',
 'AIR',
 'LR04',
 'AACO2',
 'EDC-dD',
 'GISP2',
 'cenogrid_d18O',
 'cenogrid_d13C']

To load the SOI, simply do:

SOI = pyleo.utils.load_dataset('SOI')
SOI

{'archiveType': 'Instrumental', 'label': 'Southern Oscillation Index'}

None
Time [year C.E.]
000000    1.5
083333    0.9
166667   -0.1
250000   -0.3
333333   -0.7
              ... 
583333   -0.1
666667   -1.2
750000   -0.4
833333   -0.8
916667   -0.6
Name: SOI [mb], Length: 828, dtype: float64

SOI.plot()

(<Figure size 1000x400 with 1 Axes>,
 <Axes: xlabel='Time [year C.E.]', ylabel='SOI [mb]'>)

../_images/670ec1415a72e7f0d3421f377a0a99bb7ebf0fdb67e10043634475ffd80abfb5.png

That was the easy part. For the NAO index, we have to work a little harder.

import pandas as pd
url = 'https://www.cpc.ncep.noaa.gov/products/precip/CWlink/pna/norm.nao.monthly.b5001.current.ascii.table'
df_nao = pd.read_csv(url, sep='\s+', header=0, index_col=0)
df_nao

	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
1950	0.92	0.40	-0.36	0.73	-0.59	-0.06	-1.26	-0.05	0.25	0.85	-1.26	-1.02
1951	0.08	0.70	-1.02	-0.22	-0.59	-1.64	1.37	-0.22	-1.36	1.87	-0.39	1.32
1952	0.93	-0.83	-1.49	1.01	-1.12	-0.40	-0.09	-0.28	-0.54	-0.73	-1.13	-0.43
1953	0.33	-0.49	-0.04	-1.67	-0.66	1.09	0.40	-0.71	-0.35	1.32	1.04	-0.47
1954	0.37	0.74	-0.83	1.34	-0.09	-0.25	-0.60	-1.90	-0.44	0.60	0.40	0.69
...	...	...	...	...	...	...	...	...	...	...	...	...
2020	1.34	1.26	1.01	-1.02	-0.41	-0.15	-1.23	0.12	0.98	-0.65	2.54	-0.30
2021	-1.11	0.14	0.73	-1.43	-1.24	0.77	0.03	-0.28	-0.21	-2.29	-0.18	0.29
2022	1.08	1.68	0.77	-0.36	0.71	-0.12	-0.09	1.47	-1.61	-0.72	0.69	-0.15
2023	1.25	0.92	-1.11	-0.63	0.39	-0.58	-2.17	-1.16	-0.44	-2.03	-0.32	1.94
2024	0.21	1.09	-0.21	-0.78	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

75 rows × 12 columns

Here we have a challenge: the data present themselves as a 2-dimensional table, and need to be reshaped into a 1-D series by somehow combining months and years in a sequential fashion. Fear not! Pandas can do!

First, we turn the index as a column called Year:

df_nao = df_nao.reset_index().rename(columns={'index':'Year'})
df_nao

	Year	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
0	1950	0.92	0.40	-0.36	0.73	-0.59	-0.06	-1.26	-0.05	0.25	0.85	-1.26	-1.02
1	1951	0.08	0.70	-1.02	-0.22	-0.59	-1.64	1.37	-0.22	-1.36	1.87	-0.39	1.32
2	1952	0.93	-0.83	-1.49	1.01	-1.12	-0.40	-0.09	-0.28	-0.54	-0.73	-1.13	-0.43
3	1953	0.33	-0.49	-0.04	-1.67	-0.66	1.09	0.40	-0.71	-0.35	1.32	1.04	-0.47
4	1954	0.37	0.74	-0.83	1.34	-0.09	-0.25	-0.60	-1.90	-0.44	0.60	0.40	0.69
...	...	...	...	...	...	...	...	...	...	...	...	...	...
70	2020	1.34	1.26	1.01	-1.02	-0.41	-0.15	-1.23	0.12	0.98	-0.65	2.54	-0.30
71	2021	-1.11	0.14	0.73	-1.43	-1.24	0.77	0.03	-0.28	-0.21	-2.29	-0.18	0.29
72	2022	1.08	1.68	0.77	-0.36	0.71	-0.12	-0.09	1.47	-1.61	-0.72	0.69	-0.15
73	2023	1.25	0.92	-1.11	-0.63	0.39	-0.58	-2.17	-1.16	-0.44	-2.03	-0.32	1.94
74	2024	0.21	1.09	-0.21	-0.78	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

75 rows × 13 columns

Next, we invoke pandas’ powerful melt function to unpivot the table:

df_nao = df_nao.melt(id_vars='Year', var_name='Month', value_name='NAO')
df_nao

	Year	Month	NAO
0	1950	Jan	0.92
1	1951	Jan	0.08
2	1952	Jan	0.93
3	1953	Jan	0.33
4	1954	Jan	0.37
...	...	...	...
895	2020	Dec	-0.30
896	2021	Dec	0.29
897	2022	Dec	-0.15
898	2023	Dec	1.94
899	2024	Dec	NaN

900 rows × 3 columns

Next we combine year and month into a date, then ditch the superfluous year and month columns:

df_nao['date']=(df_nao
          .assign(Date=lambda x: pd.to_datetime(x['Year'].astype(str) + '-' + x['Month'], format='%Y-%b')))['Date']             
df_nao = df_nao.drop(columns=['Year', 'Month']).set_index('date').sort_index()
df_nao

	NAO
date
1950-01-01	0.92
1950-02-01	0.40
1950-03-01	-0.36
1950-04-01	0.73
1950-05-01	-0.59
...	...
2024-08-01	NaN
2024-09-01	NaN
2024-10-01	NaN
2024-11-01	NaN
2024-12-01	NaN

900 rows × 1 columns

Note the use of a lambda function to rework month and year on the fly. Now we have a proper dataframe with a time axis that is a proper datetime object, which Pyleoclim will know what do to with:

df_nao.index

DatetimeIndex(['1950-01-01', '1950-02-01', '1950-03-01', '1950-04-01',
               '1950-05-01', '1950-06-01', '1950-07-01', '1950-08-01',
               '1950-09-01', '1950-10-01',
               ...
               '2024-03-01', '2024-04-01', '2024-05-01', '2024-06-01',
               '2024-07-01', '2024-08-01', '2024-09-01', '2024-10-01',
               '2024-11-01', '2024-12-01'],
              dtype='datetime64[ns]', name='date', length=900, freq=None)

df_nao['NAO'].plot()

<Axes: xlabel='date'>

../_images/0a7d4b088677e53860f04abf6c2795a5163e316ec3566723fe88c1a6def72510.png

Now all we need to do to turn this into a Pyleoclim Series object is to sprinkle a little metadata.

metadata = {
        'time_unit': 'years CE',
        'time_name': 'Time',
        'value_name': 'NAO',
        'label': 'North Atlantic Oscillation',
        'archiveType': 'Instrumental',
        'importedFrom': 'NOAA/NCEI',
    }

NAO = pyleo.Series.from_pandas(df_nao['NAO'],metadata)
NAO.plot()

(<Figure size 1000x400 with 1 Axes>,
 <Axes: xlabel='Time [years CE]', ylabel='NAO'>)

../_images/2f2928ca6b8637730e8e1a554381e559e9a3485fca171f7a983ebe4aac6b8c7c.png

It turns out that this verison has gaps, which we need to fill by interpolation so other methods work properly:

NAOi = NAO.interp(step=1/12)
fig, ax = NAO.plot()
NAOi.plot(ax=ax, label='interpolated')

<Axes: xlabel='Time [years CE]', ylabel='NAO'>

../_images/08436a68c270fa715b425fe5ea209171e301d0cab677da33e701726c9da61b79.png

Let us now look at SOI and NAO side by side:

ms = NAOi & SOI
ms.stackplot()

(<Figure size 640x480 with 3 Axes>,
 {0: <Axes: ylabel='NAO'>,
  1: <Axes: ylabel='SOI [mb]'>,
  'x_axis': <Axes: xlabel='Time [year C.E.]'>})

../_images/cf44e6e13c4cd9e3309a51efa0cb1d193fd632e536cd57d3243f3e22109621bb.png

The series have comparable length, and SOI seems, by eye, to dislay stronger internnual variability. We could confirm this by spectral analysis, but this is not the object of this tutorial. Let’s dive into exploring their relationship, if any.

Measuring Association#

Here we will use the Pyleoclim’s correlation() function to measure association between SOI and NAO.

corr_base = SOI.correlation(NAOi, method='built-in')
print(corr_base)

  correlation    p-value  signif. (α: 0.05)
-------------  ---------  -------------------
    0.0011608       0.97  False

We see a very small correlation, with (under this particular test) a large p-value that does not allow to claim that this correlation is significant (i.e. remarkable) in any way. At this point you should be asking yourself what the underlying null hypothesis is. This link might be helpful in recalling that, and the assumptions of this standard test.

Note that the function allows to use a bunch of different statistics to measure association:

for stat in ['pearsonr','spearmanr','pointbiserialr','kendalltau','weightedtau']:
     corr = SOI.correlation(NAOi, statistic=stat, number=200)
     print(f'{stat}(NAO,SOI)={corr.r:.3f}, p={corr.p:.3f}')   

Evaluating association on surrogate pairs:   0%|                                                                | 0/200 [00:00<?, ?it/s]

Evaluating association on surrogate pairs:  78%|████████████████████████████████████████▌           | 156/200 [00:00<00:00, 1551.08it/s]

Evaluating association on surrogate pairs: 100%|████████████████████████████████████████████████████| 200/200 [00:00<00:00, 1549.06it/s]

pearsonr(NAO,SOI)=0.001, p=0.980

Evaluating association on surrogate pairs:   0%|                                                                | 0/200 [00:00<?, ?it/s]

Evaluating association on surrogate pairs: 100%|████████████████████████████████████████████████████| 200/200 [00:00<00:00, 2188.33it/s]

spearmanr(NAO,SOI)=-0.002, p=0.965

Evaluating association on surrogate pairs:   0%|                                                                | 0/200 [00:00<?, ?it/s]

Evaluating association on surrogate pairs:  81%|██████████████████████████████████████████          | 162/200 [00:00<00:00, 1616.32it/s]

Evaluating association on surrogate pairs: 100%|████████████████████████████████████████████████████| 200/200 [00:00<00:00, 1595.42it/s]

pointbiserialr(NAO,SOI)=0.001, p=0.975

Evaluating association on surrogate pairs:   0%|                                                                | 0/200 [00:00<?, ?it/s]

Evaluating association on surrogate pairs: 100%|████████████████████████████████████████████████████| 200/200 [00:00<00:00, 2572.95it/s]

kendalltau(NAO,SOI)=-0.001, p=0.980

Evaluating association on surrogate pairs:   0%|                                                                | 0/200 [00:00<?, ?it/s]

Evaluating association on surrogate pairs:  22%|████████████▏                                         | 45/200 [00:00<00:00, 447.87it/s]

Evaluating association on surrogate pairs:  46%|████████████████████████▌                             | 91/200 [00:00<00:00, 454.93it/s]

Evaluating association on surrogate pairs:  68%|████████████████████████████████████▎                | 137/200 [00:00<00:00, 454.62it/s]

Evaluating association on surrogate pairs:  92%|████████████████████████████████████████████████▊    | 184/200 [00:00<00:00, 457.71it/s]

Evaluating association on surrogate pairs: 100%|█████████████████████████████████████████████████████| 200/200 [00:00<00:00, 455.56it/s]

weightedtau(NAO,SOI)=-0.035, p=0.710

We see that no matter which way we slice it, we get very small numbers, which appear insignificant with respect to the default test.

Spurious Correlations#

In the geosciences, there are two process that might artificially increase correlations between otherwise unrelated variables

smoothing
common trends

Smoothing-enhanced correlations#

ms_md = ms.filter(method='butterworth',cutoff_scale=[2, 50])
ms_md.plot()

(<Figure size 1000x400 with 1 Axes>,
 <Axes: xlabel='Time [year C.E.]', ylabel='value'>)

../_images/902ccc23f2306deff2e755e8e5f4b017d034db90aba6099b5881379fda445aaf.png

corr_md=ms_md.series_list[0].correlation(ms_md.series_list[1],method='built-in')
print(corr_md)

  correlation  p-value    signif. (α: 0.05)
-------------  ---------  -------------------
     0.306343  < 1e-18    True

We see that we sufficient smoothing, we have increased the correlation to a point that it is now highly significant (note the infinitesimal p-value). Bu BEWARE! The built-in test for Pearson’s correlation, as recalled above, assumes that the data are IID: independent and identically distributed. After smoothing so heavily, neighboring values now resemble each other a lot, so independence is lost (see course notes for more details). So now we need to bring more sophisticated methods to gauge the p-values. First, let’s use a common analytical approximation to adjust for the loss of degrees of freedom (see course notes). This is coded up in Pyleoclim as ttest:

corr_md_dof=ms_md.series_list[0].correlation(ms_md.series_list[1],method='ttest')
print(corr_md_dof)

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[17], line 1
----> 1 corr_md_dof=ms_md.series_list[0].correlation(ms_md.series_list[1],method='ttest')
      2 print(corr_md_dof)

File ~/Documents/GitHub/Pyleoclim_util/pyleoclim/core/series.py:3556, in Series.correlation(self, target_series, alpha, statistic, method, number, timespan, settings, seed, common_time_kwargs, mute_pbar)
   3554 if method == 'ttest':
   3555     if statistic == 'pearsonr':
-> 3556         stat, signf, pval = corrutils.corr_ttest(ts0.value, ts1.value, alpha=alpha)
   3557         signif = bool(signf)
   3558     else:

File ~/Documents/GitHub/Pyleoclim_util/pyleoclim/utils/correlation.py:234, in corr_ttest(y1, y2, alpha)
    231 Ney2 = N * (1-g2) / (1+g2)
    233 Ne = stats.mstats.gmean([Ney1+Ney2])
--> 234 assert Ne >= 10, 'Too few effective d.o.f. to apply this method!'
    236 df = Ne - 2
    237 t = np.abs(r) * np.sqrt(df/(1-r**2))

AssertionError: Too few effective d.o.f. to apply this method!

That didn’t work! As the error message implies, the smoothing has lowered the effective degrees of freedom to such an extent this method won’t work. Then let’s use a parametric Monte Carlo test: AR(1) benchmarks.

corr_md_ar1=ms_md.series_list[0].correlation(ms_md.series_list[1],method='ar1sim', number=5000, seed = 222)
print(corr_md_ar1)

Evaluating association on surrogate pairs: 100%|██████████| 5000/5000 [00:03<00:00, 1487.20it/s]

  correlation    p-value  signif. (α: 0.05)
-------------  ---------  -------------------
     0.306343       0.35  False

The correlation has not changed, but see what happened to the p-value! We can actually quantify this:

corr_md_ar1.p/corr_md.p

1.8258770939061737e+18

This is how much larger the p-value is under this test! So we got from highly significant to insignificant. Would we have found the same with the non-parametric Monte Carlo test (phase-randomization)?

corr_md_pr=ms_md.series_list[0].correlation(ms_md.series_list[1],method='phaseran', number=5000, seed = 222)
print(corr_md_pr)

Evaluating association on surrogate pairs: 100%|██████████| 5000/5000 [00:03<00:00, 1434.84it/s]

  correlation    p-value  signif. (α: 0.05)
-------------  ---------  -------------------
     0.306343       0.02  True

In this case, the two methods disagree, and phase-randomization appears more lenient. Disagreements between methods are frequent when there are relatively few degrees of freedom, and are cause for caution.

Take-home message: beware the effect of smoothing on the degrees of freedom, and assessments of significance.

Trend-enhanced correlations#

Now we look at a slightly different issue: how the presence of common trends may make otherwise unrelated timeseries appear artificially related. First, let’s apply the common_time method to obtain a common time vector to both series, from which we’ll construct the trend:

msc = ms.common_time()
tc = msc.series_list[0].time # would get the same answer with msc.series_list[`].time

The baseline trend will look something like this, which we’ll superimpose on the NAO and SOI indices, respectively.

trend = (tc-tc.mean())/tc.ptp() # create trend as a NumPy array
import matplotlib.pyplot as plt
plt.plot(tc, trend) # plot it

[<matplotlib.lines.Line2D at 0x1406ca890>]

../_images/9fd22fa7783bcb093fa27dc3f37da9e88684ff347441512643c5f23794853e2d.png

scales = [0.5, 1, 2]
for method in ['built-in','ttest','ar1sim','phaseran']:
    for s in scales:
        scaled_trend = s*trend # create trend as a NumPy array
        nao_t = msc.series_list[0].copy() # create copy of the NAO series
        nao_t.value += scaled_trend # add trend
        soi_t = msc.series_list[1].copy() # create copy of the SOI series
        soi_t.value += scaled_trend # add trend
        corr_t = nao_t.correlation(soi_t,method=method)
        print(f"Scale = {s}, using {method} method\n")
        print(corr_t)

Scale = 0.5, using built-in method

  correlation    p-value  signif. (α: 0.05)
-------------  ---------  -------------------
    0.0279778       0.42  False

Scale = 1, using built-in method

  correlation  p-value    signif. (α: 0.05)
-------------  ---------  -------------------
    0.0924265  < 1e-2     True

Scale = 2, using built-in method

  correlation  p-value    signif. (α: 0.05)
-------------  ---------  -------------------
     0.277871  < 1e-15    True

Scale = 0.5, using ttest method

  correlation    p-value  signif. (α: 0.05)
-------------  ---------  -------------------
    0.0279778       0.46  False

Scale = 1, using ttest method

  correlation    p-value  signif. (α: 0.05)
-------------  ---------  -------------------
    0.0924265       0.02  True

Scale = 2, using ttest method

  correlation  p-value    signif. (α: 0.05)
-------------  ---------  -------------------
     0.277871  < 1e-12    True

Evaluating association on surrogate pairs: 100%|██████████| 1000/1000 [00:00<00:00, 1523.83it/s]

Scale = 0.5, using ar1sim method

  correlation    p-value  signif. (α: 0.05)
-------------  ---------  -------------------
    0.0279778       0.44  False

Evaluating association on surrogate pairs: 100%|██████████| 1000/1000 [00:00<00:00, 1531.31it/s]

Scale = 1, using ar1sim method

  correlation    p-value  signif. (α: 0.05)
-------------  ---------  -------------------
    0.0924265       0.03  True

Evaluating association on surrogate pairs: 100%|██████████| 1000/1000 [00:00<00:00, 1537.74it/s]

Scale = 2, using ar1sim method

  correlation  p-value    signif. (α: 0.05)
-------------  ---------  -------------------
     0.277871  < 1e-6     True

Evaluating association on surrogate pairs: 100%|██████████| 1000/1000 [00:00<00:00, 1452.27it/s]

Scale = 0.5, using phaseran method

  correlation    p-value  signif. (α: 0.05)
-------------  ---------  -------------------
    0.0279778       0.55  False

Evaluating association on surrogate pairs: 100%|██████████| 1000/1000 [00:00<00:00, 1513.72it/s]

Scale = 1, using phaseran method

  correlation    p-value  signif. (α: 0.05)
-------------  ---------  -------------------
    0.0924265       0.15  False

Evaluating association on surrogate pairs: 100%|██████████| 1000/1000 [00:00<00:00, 1389.75it/s]

Scale = 2, using phaseran method

  correlation  p-value    signif. (α: 0.05)
-------------  ---------  -------------------
     0.277871  < 1e-2     True

As we dial up the amplitude of the trend (1/2, 1, 2x the original trend), we see that the correlation goes from 0.028, to 0.09, to 0.28. The built-in method, being easily fooled, declares all three as significant. The other methods are more resilient, but when the trend is large enough (2), all of them call the correlation significant. The t-test with adjusted degrees of freedom (DOFs) gives downright counterinuitive results: it calls 0.028 significant, 0.09 insignificant, and 0.28 significant again. In this case, phase randomization is the most conservative. This is why it is the default method in Pyleoclim.

Take-home message: common trends can easily create the appearance of correlations (see Tyler Vigen’s excellent website) and really complicate assessments of significance. If the trend is not relevant to your question, we recommend removing it prior to computing correlations, e.g. using Pyleoclim's detrend().

Takeways#

Not only is correlation not indicative of causation, but spurious correlations abound, often driven by smoothing, trends, or short sample sizes. Some null hypotheses are more stringent than others, but the built-in one nearly always assumed IID data, which is hardly ever verified in the geosciences. Make you carefully match your null hypothesis to your data/problem.

If if it causality you are after, more advanced tools are available, like Empirical Dynamical Modeling. It is well worth a dive into the underlying theory, but it is undoubtedly more complex.