Tracing Transformations#

by Julien Emile-Geay

In can be sometimes difficult to remember all the transformations that you apply to your data in the course of your work. Quite often , this results in reported protocols that do not match the actual steps taken during an analysis, so published results may not be readily reproduced. We illustrate this using the example from the spectral analysis tutorial.

Goals#

Learn how to use the keep_log parameter to keep track of all transformations applied to the timeseries

Reading Time: 10 minutes

Keywords#

Provenance

Pre-Requisites#

This notebook re-uses the analysis performed in spectral analysis tutorial. A grasp of this tutorial is necessary.

Relevant Packages¶#

Pyleoclim

Data Description#

  • Lisiecki, L. E., and Raymo, M. E. (2005), A Pliocene-Pleistocene stack of 57 globally distributed benthic δ18O records, Paleoceanography, 20, PA1003, doi:10.1029/2004PA001071.

Demonstration#

%load_ext watermark

import pyleoclim as pyleo
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Let’s first import the LR04 benthic stack into a Series object:

lr04 = pyleo.utils.load_dataset('LR04')

In L2_spectral_analysis.ipynb, we plot this iconic compilation by inverting the y axis:

fig, ax = lr04.plot(legend=False, invert_yaxis=True) 
../_images/1d3b1dd0e7ce1e0d1a55d5f0adf712c1265fcf8e05379a4b3775d24d39e6e352.png

Let’s work with the first 3 million years (3000 kyr) of the record for illustrative purposes:

lr04_s = lr04.slice([0,3000])
fig,ax = lr04_s.plot(legend=False, invert_yaxis=True) 
../_images/44f3bce4dbec05e04c0b8bdc0f96295cfa315bdb6a49b3d753473cb91d8850be.png

In L2_spectral_analysis.ipynb, we applied 3 transformations to the data prior to conducting spectral analysis: detrending, interpolation (each with default value), and standardization. If we ask to keep a log of such transformations at each turn, the resulting Series object will contain a field called log, which can be mined for information:

lr04_seq = lr04_s.detrend(keep_log=True).interp(keep_log=True).standardize(keep_log=True)
lr04_seq.log
({0: 'detrend',
  'method': 'emd',
  'args': {},
  'previous_trend': array([4.11436493, 4.11437622, 4.11438901, ..., 3.62013935, 3.62016692,
         3.62019564])},
 {1: 'interp', 'method': 'linear', 'args': {}},
 {2: 'standardize',
  'args': 1,
  'previous_mean': -0.03990010536672826,
  'previous_std': 0.3316320197708689})

This log is a tuple of dictionaries, a very flexible (and immutable) data structures that ensures that many fields can be traced, in an order that cannot be touched. This log reveals several things:

  • the series was cleaned upon creation, as is the default.

  • the series was then detrended, and we see here that it was done using empirical mode decomposition (see here for details. The original trend is kept here as an array in case you need to use it later (e.g. put it back).

  • the Series was then interpolated using linear interpolation with default parameters.

  • then it was standardized, and the log kept the mean and standard deviation in case they are needed later.

If you wanted to access the original trend, you would go for the second index of this tuple, and use the dictionary key ‘previous_trend’:

lr04_seq.log[0]['previous_trend']
array([4.11436493, 4.11437622, 4.11438901, ..., 3.62013935, 3.62016692,
       3.62019564])

This can be plotted along side the original series:

fig, ax = lr04_s.plot(title='LR04 trend extraction', invert_yaxis=True)
ax.plot(lr04_s.time,lr04_seq.log[0]['previous_trend'],label='EMD trend')
ax.legend(); 
../_images/54db91f280cdf9771bcfac57e0e67cf7cce0fd7e1d08aa1fab2e3343c2fdda2b.png

So far we have worked with the original, but flipped the plots. We may instead wish to flip the values themselves:

lr04_f = lr04.flip(axis='value', keep_log=True)
lr04_f.plot()
(<Figure size 1000x400 with 1 Axes>,
 <Axes: xlabel='Age [ky BP]', ylabel='$\\delta^{18} \\mathrm{O}$ x (-1) [‰]'>)
../_images/c5ab65b8f8cafa36805f813c53d4d45fdfaa6c5e21e3baa07645e285ec587cf5.png

Note that the label has been updated to reflect the flip. If we now apply the same sequence of transformations as before, this will be reflected in the log:

lr04_fseq = lr04_f.detrend(keep_log=True).interp(keep_log=True).standardize(keep_log=True)
lr04_fseq.log
({0: 'flip', 'applied': True, 'axis': 'value'},
 {1: 'detrend',
  'method': 'emd',
  'args': {},
  'previous_trend': array([-4.11361122, -4.11359379, -4.11357686, ..., -2.92825926,
         -2.92825817, -2.92826008])},
 {2: 'interp', 'method': 'linear', 'args': {}},
 {3: 'standardize',
  'args': 1,
  'previous_mean': 0.000358900643691679,
  'previous_std': 0.25330907793231616})

The Series can be flipped back, but an error message will be issued to warn users that flipping was previously applied:

lr04_ff = lr04_fseq.flip(keep_log=True)
/Users/julieneg/Documents/GitHub/Pyleoclim_util/pyleoclim/core/series.py:764: UserWarning: this Series' log indicates that it has previously been flipped
  warnings.warn("this Series' log indicates that it has previously been flipped")
lr04_ff.plot()
(<Figure size 1000x400 with 1 Axes>,
 <Axes: xlabel='Age [ky BP]', ylabel='$\\delta^{18} \\mathrm{O}$ x (-1) x (-1) [‰]'>)
../_images/2465bc51c1df5636cc293a0e9422c7130ad530800c5e7713309fd83e282cc7fc.png

In this case, one might want to clean up the value_name property, by copying the original one:

lr04_ff.value_name = lr04.value_name

Finally, the log will reflect this double-flipping as well:

lr04_ff.log
({0: 'flip', 'applied': True, 'axis': 'value'},
 {1: 'detrend',
  'method': 'emd',
  'args': {},
  'previous_trend': array([-4.11361122, -4.11359379, -4.11357686, ..., -2.92825926,
         -2.92825817, -2.92826008])},
 {2: 'interp', 'method': 'linear', 'args': {}},
 {3: 'standardize',
  'args': 1,
  'previous_mean': 0.000358900643691679,
  'previous_std': 0.25330907793231616},
 {4: 'flip', 'applied': True, 'axis': 'value'})
%watermark -n -u -v -iv -w
Last updated: Mon Mar 04 2024

Python implementation: CPython
Python version       : 3.11.7
IPython version      : 8.20.0

pyleoclim : 0.13.1b0
numpy     : 1.26.3
matplotlib: 3.8.2
pandas    : 2.1.4

Watermark: 2.4.3