Working with Pandas#

by Julien Emile-Geay

Preamble#

Pandas is part of nearly every data scientist’s toolkit, with robust support for spreadsheet-like data structures. Until 2023, however, this workhorse of time series analysis was unusable in the paleogeosciences, because Pandas long ago hardcoded nanoseconds as the base unit of time. This limited the timescales it can represent on a 64-bit machine to a relatively narrow timespan of 585 years, thus excluding many paleogeoscience applications (for more explanations, see this blog post). In this notebook we explore the synergies between pandas and Pyleoclim.

Goals:#

  • Learn to export Pyleclim Series to Pandas Series.

  • Unit conversions

  • Select portions of a time series based on time or value criteria

  • Export/Import to csv file

  • Resampling in time

  • Handling MultipleSeries with Pandas

Reading time#

10 minutes

Keywords#

Pandas; Time Units; Resampling in time; Selection

Pre-Requisites#

Knoweldge of Pyleoclim Series and MultipleSeries objects. Familiarity with Pandas.

Relevant Packages#

Pandas

Data Description#

  • Sea-surface temperature from Kaplan (1998) averaged over the NINO3 (5N-5S, 150W-190E)) region.

  • Bereiter, B., Eggleston, S., Schmitt, J., Nehrbass-Ahles, C., Stocker, T. F., Fischer, H., … Chappellaz, J. (2015). Revision of the EPICA Dome C CO2 record from 800 to 600 kyr before present. Geophysical Research Letters, 42(2), 542–549. doi:10.1002/2014gl061957

Demonstration#

%load_ext watermark
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import pyleoclim as pyleo
pd.__version__
'2.1.4'

Let’s load the SOI timeseries:

ts = pyleo.utils.load_dataset('SOI')
ts.plot(invert_yaxis=True) # invert y axis so El Niño events plot upward
(<Figure size 1000x400 with 1 Axes>,
 <Axes: xlabel='Time [year C.E.]', ylabel='SOI [mb]'>)
../_images/39035d0d74625148c4fbd52515e09e1c7b16cef2a9f4fcbce902e3acdc319171.png

There are two properties attached to pyleo.Series objects:

  1. a Pandas datetime_index:

ts.datetime_index
DatetimeIndex(['1950-12-31 12:42:56', '1951-01-30 23:11:49',
               '1951-03-02 09:41:14', '1951-04-01 20:10:07',
               '1951-05-02 06:39:01', '1951-06-01 17:08:26',
               '1951-07-02 03:37:19', '1951-08-01 14:06:12',
               '1951-09-01 00:35:37', '1951-10-01 11:04:30',
               ...
               '2019-03-01 20:57:20', '2019-04-01 07:26:14',
               '2019-05-01 17:55:07', '2019-06-01 04:24:32',
               '2019-07-01 14:53:25', '2019-08-01 01:22:19',
               '2019-08-31 11:51:43', '2019-09-30 22:20:37',
               '2019-10-31 08:49:30', '2019-11-30 19:18:55'],
              dtype='datetime64[s]', name='datetime', length=828, freq=None)
  1. a dictionary bundling all the metadata:

ts.metadata
{'time_unit': 'year C.E.',
 'time_name': 'Time',
 'value_unit': 'mb',
 'value_name': 'SOI',
 'label': 'Southern Oscillation Index',
 'archiveType': 'Instrumental',
 'importedFrom': None,
 'log': None}

Invoking the object itself returns some essential metadata and a compressed view of the data.

ts
{'archiveType': 'Instrumental', 'label': 'Southern Oscillation Index'}
None
Time [year C.E.]
1951.000000    1.5
1951.083333    0.9
1951.166667   -0.1
1951.250000   -0.3
1951.333333   -0.7
              ... 
2019.583333   -0.1
2019.666667   -1.2
2019.750000   -0.4
2019.833333   -0.8
2019.916667   -0.6
Name: SOI [mb], Length: 828, dtype: float64

For a prettier display (in Jupyter notebook only):

ts.view()
SOI [mb]
Time [year C.E.]
1951.000000 1.5
1951.083333 0.9
1951.166667 -0.1
1951.250000 -0.3
1951.333333 -0.7
... ...
2019.583333 -0.1
2019.666667 -1.2
2019.750000 -0.4
2019.833333 -0.8
2019.916667 -0.6

828 rows × 1 columns

.to_pandas()#

It is easy to export a Pyleoclim Series to a Pandas Series:

pdts = ts.to_pandas() #  returns just the Series ; metadata are available at ts.metadata
type(pdts) 
pandas.core.series.Series

It is now a bona fide Pandas Series, and we can do with it everything we might do with Pandas. For example:

pdts.head()
datetime
1950-12-31 12:42:56    1.5
1951-01-30 23:11:49    0.9
1951-03-02 09:41:14   -0.1
1951-04-01 20:10:07   -0.3
1951-05-02 06:39:01   -0.7
Name: SOI, dtype: float64

Or this:

pdts.describe()
count    828.000000
mean       0.119928
std        0.938586
min       -3.600000
25%       -0.500000
50%        0.100000
75%        0.800000
max        2.900000
Name: SOI, dtype: float64

Because Pyleoclim now has Pandas under the hood, one can now apply any Pandas method to a Pyleoclim Series, via a lambda function. For instance, applying an exponential transform to the data:

ts.pandas_method(lambda x: x.transform(np.exp))
{'archiveType': 'Instrumental', 'label': 'Southern Oscillation Index'}
None
Time [year C.E.]
1951.000000    4.481689
1951.083333    2.459603
1951.166667    0.904837
1951.250000    0.740818
1951.333333    0.496585
                 ...   
2019.583333    0.904837
2019.666667    0.301194
2019.750000    0.670320
2019.833333    0.449329
2019.916667    0.548812
Name: SOI [mb], Length: 828, dtype: float64

For more examples, see this page

Unit conversions#

Pyleoclim now comprehends datestring semantics, which enable enhanced conversions betwen time representations. For instance, converting the SOI series to years before 1950 (“BP”):

Selection#

Let’s load a true paleoclimate example, the Dome C \(CO_2\) composite, and use Pandas semantics to select particular values.

tsBP = ts.convert_time_unit('yrs BP')  
fig, ax = tsBP.plot(invert_yaxis=True) # by default, plots represent values in increasing order, so we reverse the x-axis
../_images/1eff2c3c92ed46e64d65934b2e7ef0ebd694c777879c303e84ba1cec6d5ebdb9.png

The Series plots from recent to old, because the Matplotlib plot() function always works with increasing values. However, it is easy to check that it was indeed flipped:

tsBP.view()
SOI [mb]
Age [yrs BP]
-69.914717 1.5
-69.831383 0.9
-69.748050 -0.1
-69.664717 -0.3
-69.581383 -0.7
... ...
-1.331383 -0.1
-1.248050 -1.2
-1.164717 -0.4
-1.081383 -0.8
-0.998050 -0.6

828 rows × 1 columns

The index has been flipped as well:

tsBP.datetime_index
DatetimeIndex(['2019-11-30 19:18:55', '2019-10-31 08:49:30',
               '2019-09-30 22:20:37', '2019-08-31 11:51:43',
               '2019-08-01 01:22:19', '2019-07-01 14:53:25',
               '2019-06-01 04:24:32', '2019-05-01 17:55:07',
               '2019-04-01 07:26:14', '2019-03-01 20:57:20',
               ...
               '1951-10-01 11:04:30', '1951-09-01 00:35:37',
               '1951-08-01 14:06:12', '1951-07-02 03:37:19',
               '1951-06-01 17:08:26', '1951-05-02 06:39:01',
               '1951-04-01 20:10:07', '1951-03-02 09:41:14',
               '1951-01-30 23:11:49', '1950-12-31 12:42:56'],
              dtype='datetime64[s]', name='datetime', length=828, freq=None)

If we wanted to preserve the original time flow (old to recent) in a plot, all you’d have to do is use the invert_xaxis parameter:

tsBP.plot(invert_yaxis=True, invert_xaxis=True, label = 'SOI, years BP', color='C1') 
(<Figure size 1000x400 with 1 Axes>,
 <Axes: xlabel='Age [yrs BP]', ylabel='SOI [mb]'>)
../_images/3e6db6b73bf86e46c3c43aa87717bab9690fb2bb3d06c168a1f106e6d111a42b.png

(this is an admittedly contrived example, as no one in their right mind would cast instrumental series in years BP).

co2ts = pyleo.utils.load_dataset('AACO2')
co2ts.plot(color='gray')
(<Figure size 1000x400 with 1 Axes>,
 <Axes: xlabel='Age [ky BP]', ylabel='$CO_2$ [ppm]'>)
../_images/7c1d476e4f9d5f20e94d155a0c9a5bec1ade3af56a7dcb8f24cee9bdb1805ab9.png
co2ts.metadata
{'time_unit': 'ky BP',
 'time_name': 'Age',
 'value_unit': 'ppm',
 'value_name': '$CO_2$',
 'label': 'EPICA Dome C CO2',
 'archiveType': 'GlacierIce',
 'importedFrom': 'https://www.ncei.noaa.gov/pub/data/paleo/icecore/antarctica/antarctica2015co2composite.txt',
 'log': None}
co2ts.view()
$CO_2$ [ppm]
Age [ky BP]
-0.05103 368.02
-0.04800 361.78
-0.04628 359.65
-0.04441 357.11
-0.04308 353.95
... ...
803.92528 202.92
804.00987 207.50
804.52267 204.86
805.13244 202.23
805.66887 207.29

1901 rows × 1 columns

To select a particular interval, (say the Eemian, 115 to 130 ky BP), you can use sel:

co2_ee = co2ts.sel(time=slice(115,130))
co2_ee.plot(marker='o', color='gray')
(<Figure size 1000x400 with 1 Axes>,
 <Axes: xlabel='Age [ky BP]', ylabel='$CO_2$ [ppm]'>)
../_images/56e52e68278005df4b24353e8819942eefb4ec26a4ba0e0b4a81b2ce58160210.png

If you wanted to extract the time of the maximum value, an easy approach is to use sel again:

co2_ee.sel(value=co2_ee.value.max()).view()
$CO_2$ [ppm]
Age [ky BP]
128.46673 285.76

We zero-ed in on the value ~128.5 ky BP as the highest \(CO_2\) concentration in the Eemian. Note that the result is still a Series object (in this case, a very short one!). Similarly, to identify the timing of glacial maxima by values below 200 ppm, one could select with an open interval:

co2_G = co2ts.sel(value=slice(None,200)) # returns a new Series object
fig, ax = co2ts.plot(color='gray')
co2_G.plot(marker='o',linewidth=0,alpha= 0.5, ax=ax,label=r"$CO_2\leq200$") 
<Axes: xlabel='Age [ky BP]', ylabel='$CO_2$ [ppm]'>
../_images/e18b90de1f6c372d6ecd5f930315ab1fb1669af399dbd41b71a412a61c052a62.png

For interglacials, select half-open intervals the other way:

co2_IG = co2ts.sel(value=slice(270,None)) # returns a new Series object
fig, ax = co2ts.plot(color='gray')
co2_G.plot(marker='o',linewidth=0,alpha= 0.5, ax=ax,label=r"$CO_2\leq200$") 
co2_IG.plot(marker='o',linewidth=0, color = 'C3', alpha= 0.5, ax=ax,label=r"$CO_2\geq270$") 
<Axes: xlabel='Age [ky BP]', ylabel='$CO_2$ [ppm]'>
../_images/b06542e81cd6dcea42bd751ce64a9afd3c4e932377c412ec5f213998888760d4.png

Again, both of these are Series objects, albeit highly discontinuous ones.

Science Remark: admittedly, an absolute threshold for defining interglacials makes little sense. A smarter way would be to use clustering, as implemented in the outliers() method and a dedicated tutorial.

CSV Import/Export#

Pandas integration allows a very easy round trip with CSV files:

Exporting to CSV#

filename = '../data/EPICA_Dome_C_CO2.csv'
co2ts.to_csv(path = filename)
Series exported to ../data/EPICA_Dome_C_CO2.csv

The path name is optional; if no file name is provided (as is the case here), the file is named after the Series label. This is another reason to choose meaningful and relatively concise labels!

Importing from CSV#

Read the file back in and gives the same Series:

co2ts2 = pyleo.Series.from_csv(path='../data/EPICA_Dome_C_CO2.csv')
co2ts2.equals(co2ts) 
Time axis values sorted in ascending order
(True, True)

Note the use of the equals() method, inspired the eponymous method from Pandas. The Pyleoclim implementation returns two boolean objects: the first says whether the two Pandas Series are the same; the second whether the metadata are the same. Fortunately, we landed back on our feet.

Resampling#

One of Pandas’ most powerful capabilities is resampling() series to attain various temporal resolutions. Pyleoclim implements a function of the same name, but using normal “paleo-speak” to define resolution. For instance, let’s coarse-grain on 5,000 year intervals:

co2_5k = co2ts.resample('5ka')
type(co2_5k)
pyleoclim.core.series.SeriesResampler

The output of this function is a variant on a Pandas resampler; the values then need to be aggregated or transformed. For our purpose, let’s average them over those 5ky bins:

co2_5kavg = co2_5k.mean() # the aggregator here is simply the mean
fig, ax = co2ts.plot(color='gray')
co2_5kavg.plot(ax=ax,color='C1')         
<Axes: xlabel='Age [ky BP]', ylabel='$CO_2$ [ppm]'>
../_images/384ca9bd4c15cde393f4d345b8f9017d5604ddc0e0b136a78b34c92ad3016f65.png

One notable distinction to standard Pandas resampling is that Pyleoclim aligns the resampled index to the midpoint of each interval, to minimize age offsets. In contrast, Pandas aligns to the beginning of an interval.

In terms of nomenclature, resample() can understand several abbreviations for “kiloyear”, such as ‘ka’, ‘ky’, or ‘kyrs’. In fact, these would work:

pyleo.utils.tsbase.MATCH_KA
frozenset({'ka', 'kiloyear', 'kiloyr', 'kiloyrs', 'ky', 'kyr', 'kyrs'})

For millions of years, use:

pyleo.utils.tsbase.MATCH_MA
frozenset({'ma', 'my', 'myr', 'myrs'})

and so on for other multiples, like years (pyleo.utils.tsbase.MATCH_A) or billion years (pyleo.utils.tsbase.MATCH_GA)

Further, the Resampler class allows one to choose statistics other than the mean, e.g., a running standard deviation:

co2_5k.std().view()
$CO_2$ [ppm]
Age [ky BP]
-1.831006 21.187646
3.169000 4.424875
8.169003 2.471905
13.169008 12.374541
18.169011 13.381836
... ...
783.169643 3.247548
788.169646 15.221735
793.169651 8.006723
798.169654 2.977565
803.169660 3.386011

162 rows × 1 columns

Pandas operations with MultipleSeries objects#

Let’s load the Deuterium EPICA DOME C record:

edc = pyleo.utils.load_dataset('EDC-dD')

We then create a MultipleSeries object using the & shorthand:

ms =  edc.convert_time_unit("ky BP") & co2ts
type(ms)
pyleoclim.core.multipleseries.MultipleSeries
ms.stackplot()
(<Figure size 640x480 with 3 Axes>,
 {0: <Axes: ylabel='$\\delta \\mathrm{D}$ [‰]'>,
  1: <Axes: ylabel='$CO_2$ [ppm]'>,
  'x_axis': <Axes: xlabel='Age [ka]'>})
../_images/48f480b6c3c751d92ed314680fec7e230111546131f0ebab0cb9ea6890d6d0fb.png

Exporting to pandas#

The MultipleSeries class also has a to_pandas() method; this one exports to a dataframe:

df = ms.to_pandas()
df.head()
EPICA Dome C dD EPICA Dome C CO2
datetime
-803719-10-17 10:22:57 NaN 207.29
-803182-03-23 07:19:38 NaN 202.23
-802573-12-30 00:58:49 NaN 204.86
-802060-10-17 05:46:09 NaN 207.50
-801975-05-20 01:52:17 NaN 202.92

By default, to_pandas() exports the Series as they are, and one can see that the values are not aligned. Some plotting functions still work out of the proverbial box:

df.hist()
array([[<Axes: title={'center': 'EPICA Dome C dD'}>,
        <Axes: title={'center': 'EPICA Dome C CO2'}>]], dtype=object)
../_images/e500192305c7b4760b626c6e934699316d32c28680487522f4217ba2b581c56b.png

Alas, df.plot() wouldn’t work, as Matplotlib assumes dates must be between year 0001 and 9999.

As with Series, ms.to_csv() will export the object to a CSV file. We skip this here, as the output is not very graphical.

Aligning to a common axis#

In many instances it is useful for the Series to share a common time axis. This can be done simply this way:

df = ms.to_pandas(use_common_time=True)
df.head()
EPICA Dome C dD EPICA Dome C CO2
datetime
1911-08-18 06:32:30 -390.900000 299.817617
1630-04-19 08:12:23 -400.382865 274.220786
1348-12-20 09:52:15 -401.167982 278.367365
1067-08-23 11:32:07 -398.662864 282.612398
786-04-25 13:11:59 -396.520050 279.625252

Seaborn is a plotting library designed to visualize DataFrames. With just one line, it is easy to get an overview of the data with a pairplot:

sns.set(font_scale=0.8)
with plt.style.context('bmh'):
    sns.pairplot(df)  
../_images/2be550e25611a25a0aa6db1f7940a79deae19605bae494b1c86e51598b411599.png

or a jointplot:

with plt.style.context('bmh'):
    sns.jointplot(data=df,kind='kde', x='EPICA Dome C dD', y='EPICA Dome C CO2')  
../_images/27fc6346b9991cc3649320d558cede81a5c3c09fc9e8e0fe1f8280b8f87a929c.png

Summary#

The legendary pandas library is now fully integrated within Pyleoclim. This opens the door to many powerful, user-friendly functionatlities for data processing and visualization. For other examples of how pandas is making life easier in Pyleoclim, see this notebook; however, the real frontier is what you will decide to do with it. Possibilities abound!

%watermark -n -u -v -iv -w
Last updated: Mon Jun 24 2024

Python implementation: CPython
Python version       : 3.11.9
IPython version      : 8.25.0

pandas    : 2.1.4
matplotlib: 3.8.4
numpy     : 1.26.4
seaborn   : 0.13.2
pyleoclim : 1.0.0b0

Watermark: 2.4.3