Principal Component Analysis with Pyleoclim#

by Julien Emile-Geay, Deborah Khider, Alexander James

Preamble#

A chief goal of multivariate data analysis is data reduction, which attempts to condense the information contained in a potentially high-dimensional dataset into a few interpretable patterns. Chief among data reduction techniques is Principal Component Analysis (PCA), which organizes the data into orthogonal patterns that account for a decreasing share of variance: the first pattern accounts for the lion’s share, followed by the second, third, and so on. In geophysical timeseries, it is often the case that the first pattern (“mode”) tends to be associated with the longest time scale (e.g. a secular trend). PCA is therefore very useful for exploratory data analysis, and sometimes helps testing theories of climate change (e.g. comparing simulated to observed patterns of change).

Goals#

In this notebook you’ll see how to apply PCA within Pyleoclim. For more details, see Emile-Geay (2017), chap 12. In addition, it will walk you through Monte-Carlo PCA, which is the version of the PCA relevant to the analysis of records that present themselves as age ensembles.

Reading Time: 15 min

Keywords#

Principal Component Analysis, Singular Value Decomposition, Data Reduction

Pre-requisites#

None

Relevant Packages#

statsmodels, matplotlib, pylipd

Data Description#

  • for PCA we use the Euro2k database, as the European working group of the PAGES 2k paleotemperature compilation, which gathers proxies from multiple archives, mainly wood, coral, lake sediment and documentary archives.

  • for MC-PCA we use the SISAL v2 database of speleothem records.

Demonstration#

Let’s load packages first:

%load_ext watermark

import pyleoclim as pyleo
import numpy as np
import seaborn as sns

Data Wrangling and Processing#

To load this dataset, we make use of pylipd. We first import everything into a pandas dataframe:

from pylipd.utils.dataset import load_dir
lipd = load_dir(name='Pages2k')
df = lipd.get_timeseries_essentials()
df.head()
Loading 16 LiPD files
100%|██████████| 16/16 [00:00<00:00, 32.59it/s]
Loaded..
dataSetName archiveType geo_meanLat geo_meanLon geo_meanElev paleoData_variableName paleoData_values paleoData_units paleoData_proxy paleoData_proxyGeneral time_variableName time_values time_units depth_variableName depth_values depth_units
0 Ocn-RedSea.Felis.2000 Coral 27.8500 34.3200 -6.0 d18O [-4.12, -3.82, -3.05, -3.02, -3.62, -3.96, -3.... permil d18O None year [1995.583, 1995.417, 1995.25, 1995.083, 1994.9... yr AD None None None
1 Ant-WAIS-Divide.Severinghaus.2012 Borehole -79.4630 -112.1250 1766.0 temperature [-29.607, -29.607, -29.606, -29.606, -29.605, ... degC borehole None year [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,... yr AD None None None
2 Ant-WAIS-Divide.Severinghaus.2012 Borehole -79.4630 -112.1250 1766.0 uncertainty_temperature [1.327, 1.328, 1.328, 1.329, 1.33, 1.33, 1.331... degC None None year [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,... yr AD None None None
3 Asi-SourthAndMiddleUrals.Demezhko.2007 Borehole 55.0000 59.5000 1900.0 temperature [0.166, 0.264, 0.354, 0.447, 0.538, 0.62, 0.68... degC borehole None year [800, 850, 900, 950, 1000, 1050, 1100, 1150, 1... yr AD None None None
4 Ocn-AlboranSea436B.Nieto-Moreno.2013 Marine sediment 36.2053 -4.3133 -1108.0 temperature [18.79, 19.38, 19.61, 18.88, 18.74, 19.25, 18.... degC alkenone None year [1999.07, 1993.12, 1987.17, 1975.26, 1963.36, ... yr AD None None None

Let’s see what variable names are present in those series. In the LiPD terminology, one has to look at paleoData_variableName, and pandas makes it easy to find unique entries for each.

df['paleoData_variableName'].unique()
array(['d18O', 'temperature', 'uncertainty_temperature', 'Mg_Ca', 'notes',
       'Uk37', 'trsgi', 'MXD'], dtype=object)

Next, let’s select all the Northern Hemisphere data that can be interpreted as temperature:

dfs = df.query("paleoData_variableName in ('temperature','d18O', 'MXD', 'trsgi') & geo_meanLat > 0") 
len(dfs)
17

We should have 17 series. Now, to run PCA we need the various series to overlap in a signicant way. The easiest way to do so is to look for the minimum values of the time variable, which is in years CE:

tmin = dfs['time_values'].apply(lambda x : x.min())
tmin
0     1751.083
3      800.000
4      564.360
5      -90.000
6     -368.000
7     -368.000
8     -368.000
14    1175.000
15    1928.960
16    1897.120
17   -1949.000
18     232.000
20     971.190
21    1260.000
22     760.000
23    -138.000
24    1502.000
Name: time_values, dtype: float64

We see that indices 0, 15 and 16 of the original dataframe have series that start after the 1600s, which wouldn’t be enough time to capture some centennial-scale patterns. Let’s filter them out:

idx = tmin[tmin <= 1600].index  # crude. there has to be a better way to pass the results of that query to dfs directly.
dfs_pre1600 = dfs.loc[idx]
dfs_pre1600.info()
<class 'pandas.core.frame.DataFrame'>
Index: 14 entries, 3 to 24
Data columns (total 16 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   dataSetName             14 non-null     object 
 1   archiveType             14 non-null     object 
 2   geo_meanLat             14 non-null     float64
 3   geo_meanLon             14 non-null     float64
 4   geo_meanElev            14 non-null     float64
 5   paleoData_variableName  14 non-null     object 
 6   paleoData_values        14 non-null     object 
 7   paleoData_units         12 non-null     object 
 8   paleoData_proxy         12 non-null     object 
 9   paleoData_proxyGeneral  0 non-null      object 
 10  time_variableName       14 non-null     object 
 11  time_values             14 non-null     object 
 12  time_units              14 non-null     object 
 13  depth_variableName      2 non-null      object 
 14  depth_values            2 non-null      object 
 15  depth_units             2 non-null      object 
dtypes: float64(3), object(13)
memory usage: 1.9+ KB

This narrows it down to 14 records. Next, let’s look at resolution:

res_med = dfs_pre1600['time_values'].apply(lambda x: np.median(np.abs(np.diff(x)))) # looking at median spacing between time points for each record
sns.histplot(res_med)
<Axes: xlabel='time_values', ylabel='Count'>
../_images/3465372500585c2e0c4628e8583eb0869c935a2e21a5a8f0e675fa1d09127039.png

Because PCA requires a common time grid, we have to be a little careful about the resolution of the records. Let’s select those with a resolution finer than 20 years:

idx = np.where(res_med <= 20)
dfs_res = dfs_pre1600.loc[dfs_pre1600.index[idx]]
dfs_res.info()
<class 'pandas.core.frame.DataFrame'>
Index: 9 entries, 5 to 24
Data columns (total 16 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   dataSetName             9 non-null      object 
 1   archiveType             9 non-null      object 
 2   geo_meanLat             9 non-null      float64
 3   geo_meanLon             9 non-null      float64
 4   geo_meanElev            9 non-null      float64
 5   paleoData_variableName  9 non-null      object 
 6   paleoData_values        9 non-null      object 
 7   paleoData_units         7 non-null      object 
 8   paleoData_proxy         9 non-null      object 
 9   paleoData_proxyGeneral  0 non-null      object 
 10  time_variableName       9 non-null      object 
 11  time_values             9 non-null      object 
 12  time_units              9 non-null      object 
 13  depth_variableName      0 non-null      object 
 14  depth_values            0 non-null      object 
 15  depth_units             0 non-null      object 
dtypes: float64(3), object(13)
memory usage: 1.2+ KB

Now were are down to 9 proxy records. Next, we iterate over the rows of this dataframe, create GeoSeries objects for each proxy record, and bundle them all into a MultipleGeoSeries object (it will become obvious why in a second):

ts_list = []
for _, row in dfs_res.iterrows():
    ts_list.append(pyleo.GeoSeries(time=row['time_values'],value=row['paleoData_values'],
                                   time_name=row['time_variableName'],value_name=row['paleoData_variableName'],
                                   time_unit=row['time_units'], value_unit=row['paleoData_units'],
                                   lat = row['geo_meanLat'], lon = row['geo_meanLon'],
                                   elevation = row['geo_meanElev'], observationType = row['paleoData_proxy'],
                                   archiveType = row['archiveType'], verbose = False, 
                                   label=row['dataSetName']+'_'+row['paleoData_variableName'])) 

Euro2k = pyleo.MultipleGeoSeries(ts_list, label='Euro2k', time_unit='year AD')  

Once in this form, it’s trivial to map these records all at once:

fig, ax = Euro2k.map()
../_images/e10d155f8727aac07472c88e660f67706f893c2f621368ccf06afbff2aad95e6.png

For a relatively localized dataset like this, the default projection is “Orthographic” because it minimally distorts the represented surface, and reminds us that the Earth is (nearly) round (unlike some other, unnamed projections). For a full list of allowable projections, see the cartopy documentation. For more on this, see the mapping notebook.

Because it may be relevant to interpreting the results, let’s map elevation:

fig, ax = Euro2k.map(hue='elevation')
../_images/e4814bede522b541a127bea28136ce072246cd7db9c0519f3bc6dbe59f1d9112.png

Most records are near sea level, though a few in the Pyrenees and Alps are at altitudes ov several thousand meters. It critical to look at the records themselves prior to feeding them to PCA:

fig, ax = Euro2k.stackplot(v_shift_factor=1.2)
../_images/d1e265bf0f73a1e8d58103c89dab5122863c4d64063bacc862f60cb4d97204bc.png

No red flags ; we are ready to proceed with PCA. Notice here that Euro2k, being an instance of MultipleGeoSeries, has access to all the methods of its parent class MultipleSeries, including stackplot()

Principal Component Analysis#

Now let’s perform PCA. To do this, all the Series objects within the collection must be on a common time axis, and it’s also prudent to standardize them so differences in units don’t mess up the scaling of the data matrix on which PCA operates. We can perform both operations at once using method-chaining:

Euro2kc = Euro2k.common_time().standardize()
fig, ax = Euro2kc.stackplot(v_shift_factor=1.2)
../_images/a0ad096dc186ccc46aa0da9f530aa4f05907e05357c00cd7cbf84a5287a19e21.png

The PCA part is extremely anticlimatic:

pca = Euro2kc.pca()
type(pca)
pyleoclim.core.multivardecomp.MultivariateDecomp

We placed the result of the operation into an object called “pca” (we felt very inspired when we wrote this), which is of type MultivariateDecomp, which contains the following information:

pca.__dict__.keys()
dict_keys(['name', 'eigvals', 'eigvecs', 'pctvar', 'pcs', 'neff', 'orig'])
  • name is a string used to label plots

  • eigvals is an array containing the eigenvalues of the covariance matrix

  • pctvar is the % of the total variance accounted for by each mode (derived from the eigenvalues)

  • eigvecs is an array containing the eigenvectors of the covariance matrix ()

  • pcs is an array containing the temporal loadings corresponding to these eigenvectors.

  • neff is an estimate of the effective sample size (“degrees of freedom”) associated with the principal component. Because of autocorrelation, this number is usually much smaller than the length of the timeseries itself.

  • orig is the original MultipleGeoSeries (or MultipleSeries) object on which the multivariate decomposition was applied.

  • locs is an array containing the geographic coordinates as (lat, lon).

The first thing one should look at in such an object is the eigenvalue spectrum, often called a scree plot:

pca.screeplot()
The provided eigenvalue array has only one dimension. UQ defaults to NB82
(<Figure size 600x400 with 1 Axes>,
 <Axes: title={'center': 'Euro2k PCA eigenvalues'}, xlabel='Mode index $i$', ylabel='$\\lambda_i$'>)
../_images/a498eed3754b77262434f036798c7f9f5a658065c6f9af8d5582ae5566c9e7e3.png

As is nearly always the case with geophysical timeseries, the first handful of eigenvalues truly overwhelm the rest. Eigenvalues are, by themselves, not all that informative; their meaning comes from the fact that their normalized value quantifies the fraction of variance accounted for by each mode. That is readily accessed via:

pca.pctvar[:5] # display only the first 5 values, for brevity
array([20.56227428, 14.00221086, 13.41751132, 12.01685311, 10.30408637])

The first mode accounts for about 20% of the total variance; let’s look at it via modeplot()

pca.modeplot()
(<Figure size 800x800 with 4 Axes>,
 {'pc': <Axes: xlabel='Time [year AD]', ylabel='$PC_1$'>,
  'psd': <Axes: xlabel='Period [year]', ylabel='PSD'>,
  'map': {'cb': <Axes: ylabel='EOF'>,
   'map': <GeoAxes: xlabel='lon', ylabel='lat'>}})
../_images/e9fd44faf0bc0d5a823bc09a205783d63c8fa20416c6e1b9a97c0ee6ac9245ab.png

Three features jump at us:

  • firstly, the first principal component displays a sharp transition around the year 1675.

  • the figure includes the PC’s spectrum (here, using the MTM method), which is dominated by variation around 200 yr periods, corresponding to the shift in the middle of the series.

  • the default map projection is again orthographic, but can be adjusted. For instance, let’s set a stereographic projection with bounds appropriate for Europe:

europe = [-10, 30,35,80]
pca.modeplot(map_kwargs={'projection':'Stereographic','extent':europe})
(<Figure size 800x800 with 4 Axes>,
 {'pc': <Axes: xlabel='Time [year AD]', ylabel='$PC_1$'>,
  'psd': <Axes: xlabel='Period [year]', ylabel='PSD'>,
  'map': {'cb': <Axes: ylabel='EOF'>,
   'map': <GeoAxes: xlabel='lon', ylabel='lat'>}})
../_images/c097d989b9bcf4f14e553bb039a95bd5c292c8d970769c33181cebae5ba8fa0d.png

Now let’s look at the second mode:

pca.modeplot(index=1, map_kwargs={'projection':'Stereographic','extent':europe})
(<Figure size 800x800 with 4 Axes>,
 {'pc': <Axes: xlabel='Time [year AD]', ylabel='$PC_2$'>,
  'psd': <Axes: xlabel='Period [year]', ylabel='PSD'>,
  'map': {'cb': <Axes: ylabel='EOF'>,
   'map': <GeoAxes: xlabel='lon', ylabel='lat'>}})
../_images/9530070387001dc40638285fc669f01b26b67d9dab0c1cb95e2b576b59997d78.png

This one has a more uniforml spatial structure (loadings of one sign, withe one exception). This PC’s spectrum shows energetic variability at multidecadal timescales. On to the third mode:

pca.modeplot(index=2, map_kwargs={'projection':'Stereographic','extent':europe})
(<Figure size 800x800 with 4 Axes>,
 {'pc': <Axes: xlabel='Time [year AD]', ylabel='$PC_3$'>,
  'psd': <Axes: xlabel='Period [year]', ylabel='PSD'>,
  'map': {'cb': <Axes: ylabel='EOF'>,
   'map': <GeoAxes: xlabel='lon', ylabel='lat'>}})
../_images/35d61ba21c423190167fe781ddb58069de08b4ee9a1ea579ea5746cee869c211.png

This PC shows evidence of scaling behavior, as well as decadal variability. To look at this in more detail, you could place the principal component into a Series object and apply the tools of spectral analysis:

pc2 = pyleo.Series(value=pca.pcs[:,2], verbose=False, label='Euro2k PC3',
                   time = Euro2kc.series_list[0].time) # remember Python's zero-based indexing
pc2spec = pc2.spectral(method='mtm')                
pc2spec.beta_est().plot()
(<Figure size 1000x400 with 1 Axes>,
 <Axes: xlabel='Period [years]', ylabel='PSD'>)
../_images/ad2d8dcbbf229bbd6561793046a17763da6bd6eca847af6c98383e84ef2df3fa.png

Indeed, the spectrum seems compatible with a shape \(S(f) \propto f^{-\beta}\), where \(\beta \approx 0.68\).

As usual with PCA, we could go down the list, but the interpretability becomes increasingly difficult. You have enough information to proceed with your own analysis now.

Monte Carlo PCA#

When age ensembles are available, we can use Monte Carlo PCA (MC-PCA) to robustly account for age uncertainty in such analyses.

PAGES 2k does not come with age ensembles, so for this example generate our own using the Pyleoclim method random_time_axis(). To emulate real proxies, we assign no age uncertainty to documentary records, minimal age uncertainties to tree-ring records, and larger age uncertainties to sedimentary records. Note that this is a very crude approximation of real processes, used here merely for illustration purposes. A research-grade assessment of age uncertainties should proceed much more carefully. All we need here is to generate reasonable ensembles so we can show how to analyze and visualize them.

mul_ens_list = []
num = 1000

for series in Euro2k.series_list:
    ens_list = []
    #Speleothems and wood are typically pretty well dated, though they should have some age errors
    if series.archiveType in ['Speleothem']:
        for idx in range(num):
            series_copy = series.copy()
            pert = pyleo.utils.tsmodel.random_time_axis(len(series_copy.time),
                                         delta_t_dist='random_choice',
                                         param =[[-3,-1,0,1,3],[0.02,0.04,0.88,0.04,0.02]])
            series_copy.time += pert
            ens_list.append(series_copy)
        ens = pyleo.EnsembleGeoSeries(ens_list,lat=series.lat,lon=series.lon,elevation=series.elevation,archiveType=series.archiveType)
    elif series.archiveType in ['Wood']:
        for idx in range(num):
            series_copy = series.copy()
            pert = pyleo.utils.tsmodel.random_time_axis(len(series_copy.time),
                                         delta_t_dist='random_choice',
                                         param =[[-.5,-.2,0,.2,.5],[0.02,0.04,0.88,0.04,0.02]])
            series_copy.time += pert
            ens_list.append(series_copy)
        ens = pyleo.EnsembleGeoSeries(ens_list,lat=series.lat,lon=series.lon,elevation=series.elevation,archiveType=series.archiveType)
    #Sediments should be a bit more smoothed when it comes to age errors
    elif series.archiveType in ['Lake sediment']:
        for idx in range(num):
            series_copy = series.copy()
            pert = pyleo.utils.tsmodel.random_time_axis(len(series_copy.time),
                                         delta_t_dist='random_choice',
                                         param =[[-2,-.5,0,.5,2],[0.02,0.04,0.88,0.04,0.02]])
            series_copy.time += pert
            ens_list.append(series_copy)
        ens = pyleo.EnsembleGeoSeries(ens_list,lat=series.lat,lon=series.lon,elevation=series.elevation,archiveType=series.archiveType)
    elif series.archiveType in ['Marine sediment']:
        for idx in range(num):
            series_copy = series.copy()
            pert = pyleo.utils.tsmodel.random_time_axis(len(series_copy.time),
                                         delta_t_dist='random_choice',
                                         param =[[-4,-2,0,2,4],[0.02,0.04,0.88,0.04,0.02]])
            series_copy.time += pert
            ens_list.append(series_copy)
        ens = pyleo.EnsembleGeoSeries(ens_list,lat=series.lat,lon=series.lon,elevation=series.elevation,archiveType=series.archiveType)
    #Documents should have zero age errors
    elif series.archiveType in ['Documents']:
        for idx in range(num):
            ens_list.append(series)
        ens = pyleo.EnsembleGeoSeries(ens_list,lat=series.lat,lon=series.lon,elevation=series.elevation,archiveType=series.archiveType)

    mul_ens_list.append(ens)

megs = pyleo.MulEnsGeoSeries(mul_ens_list)

We can take a look at our creation using the stackplot() method

megs.stackplot(figsize=(10,10),plot_style='envelope',ylims='spacious',
               v_shift_factor=1.2, yticks_minor =True)
(<Figure size 1000x1000 with 10 Axes>,
 {0: <Axes: ylabel='d18O [permil]'>,
  1: <Axes: ylabel='temperature [degC]'>,
  2: <Axes: ylabel='d18O [permil]'>,
  3: <Axes: ylabel='temperature [degC]'>,
  4: <Axes: ylabel='temperature [degC]'>,
  5: <Axes: ylabel='trsgi'>,
  6: <Axes: ylabel='temperature [degC]'>,
  7: <Axes: ylabel='MXD'>,
  8: <Axes: ylabel='temperature [degC]'>,
  'x_axis': <Axes: xlabel='Time [year AD]'>})
../_images/8369d5fd04445ffd5c007c23283c4bf51438b224cae53ee5644178df18c89781.png

Once we’ve created these ensembles, appying MC-PCA is a breeze (and again is fairly anticlimatic)

mcpca = megs.mcpca()
type(mcpca)
Iterating over simulations: 100%|██████████| 1000/1000 [00:34<00:00, 29.12it/s]
pyleoclim.core.ensmultivardecomp.EnsMultivarDecomp

The contents of the EnsMultivarDecomp object are just a list of many PCA objects, with an optional label tag:

mcpca.__dict__.keys()
dict_keys(['pca_list', 'label'])

We can again produce a screeplot, which takes the form of a violin plot since each eigenvalue \(\lambda_i\) contains 1,000 entries from the Monte Carlo simulations:

mcpca.screeplot()
(<Figure size 800x800 with 1 Axes>,
 <Axes: title={'center': 'Screeplot'}, xlabel='Mode index $i$', ylabel='$\\lambda_i$'>)
../_images/534c12ac6308213d295d784b5f994606e21fa1e3e0533ca9a036474474e03174.png

modeplot() works similarly as in the non-ensemble case, but now the trace and spectra both showcase various essential quantiles:

# Modeplot with some extra quality of life kwargs

mcpca.modeplot(
    index=0,
    scatter_kwargs={'sizes':(20,400),'linewidth':.1},
    map_kwargs={'projection':'Mollweide','extent':[-20,40,30,90]}
)
Performing spectral analysis on individual series: 100%|██████████| 1000/1000 [00:01<00:00, 563.19it/s]
(<Figure size 800x800 with 4 Axes>,
 {'pc': <Axes: xlabel='Time [year AD]', ylabel='$PC_1$'>,
  'psd': <Axes: xlabel='Period [year AD]', ylabel='PSD'>,
  'map': {'map': <GeoAxes: xlabel='lon', ylabel='lat'>, 'leg': <Axes: >}})
../_images/d5ace29e3750e8b3e1ac9a481ce05db182538397c558697746771a0e3547016f.png

Modeplot produces a three-panel figure visualizing Monte Carlo Principal Component Analysis (MC-PCA) results for the selected PC (indicated via the index argument):

  1. Top left: Selected PC time series with uncertainty shading.

  2. Top right: Power spectral density of the selected PC.

  3. Bottom: Global map of EOF loadings for the selected PC.

The map represents sites as colored dots, where colors indicate quantiles of EOF loading distributions across Monte Carlo iterations. For example, the 25th quantile represents the EOF value below which 25% of possible EOF values fall. Lighter shades denote lower quantiles, darker shades higher quantiles. The sign of the quantile indicates loading direction, and is associated with color. Blue signifies negative loadings, red positive. Dot size corresponds to loading magnitude. The representation of the spatial loadings is based on the convention of Tierney et al (2013). Concentric circles with different diameters are characteristic of large uncertainties; conversely, cases with circles of similar diameter indicate consistent loadings across the age ensemble.

Qualitatively the result is quite similar to the non-ensemble PCA case, although the transition around 1700 C.E. is smoother. This is perhaps in line with expectations, as the inclusion of age uncertainty will spread the timing of abrupt transitions slightly.

Most records show uniformly strong loadings associated with this first mode, as indicated by the large, solid circles. The Easternmost record shows an opposite loading to the rest, indicating a potential spatial anticorrelation. Two records, namely the wood archives from Southern France and Finland, show heterogeneity in their loading directions (indicated by the superposition of warm and cool toned circles), suggesting a result that is not robust across their respective age models.

However, we note that the age ensembles we created for this example are entirely artificial, so this result won’t bear a great deal of scrutiny.

To look at higher modes:

mcpca.modeplot(
    index=1,
    scatter_kwargs={'sizes':(20,400),'linewidth':.1},
    map_kwargs={'projection':'Mollweide','extent':[-20,40,30,90]}
)
Performing spectral analysis on individual series: 100%|██████████| 1000/1000 [00:01<00:00, 558.66it/s]
(<Figure size 800x800 with 4 Axes>,
 {'pc': <Axes: xlabel='Time [year AD]', ylabel='$PC_2$'>,
  'psd': <Axes: xlabel='Period [year AD]', ylabel='PSD'>,
  'map': {'map': <GeoAxes: xlabel='lon', ylabel='lat'>, 'leg': <Axes: >}})
../_images/3e928d2946777691678e21d47e339dda684a65233b02d2cc6c3ca98e2108e48f.png

This looks a lot like noise, as almost all of the records show heterogeneous loadings, and probably shouldn’t be interpreted too deeply.

One issue that you may run into using MC-PCA is a misalignment of PCs. Because the sign of PCs is arbitrary, there is no guarantee that different Monte Carlo simulations will have properly aligned PCs (the sign can flip on a dime). Some work is done behind the scenes to minimize this issue (note the alignment_method argument available when applying mcpca()), however for the lower variance modes, such techniques can only go so far in properly aligning the PCs. That is to say, great caution should be exercised when interpreting MC-PCA modes that account for lower percentages of the total variance.

Conclusion#

  • PCA can be applied to both MultipleGeoSeries and MulEnsembleGeoSeries objects

  • The latter are appropriate when age ensembles are available (“MC-PCA”).

  • In Pyleoclim, both methods have customizable plotting functions, allowing to visualize results in an intuitive way.

If you have suggestions for additional features or domains of application, feel free to discuss them on our discourse forum or to open an issue on GitHub.

%watermark -n -u -v -iv -w
Last updated: Tue Jul 16 2024

Python implementation: CPython
Python version       : 3.11.9
IPython version      : 8.25.0

seaborn  : 0.13.2
pyleoclim: 1.0.0b0
numpy    : 1.26.4

Watermark: 2.4.3