Loading data into Pyleoclim#

by Jordan Landers, Deborah Khider, Julien Emile-Geay.

Preamble#

Contrary to common beliefs, Pyleoclim is not specific to the LiPD file ecosystem. The object at the heart of the package is the Series object, which describes the fundamentals of a time series. To create a Pyleoclim Series, we first need to load the data set, and then specify values for its various properties:

  • time: Time values for the time series

  • value: Paleo values for the time series

  • time_name (optional): Name of the time vector, (e.g., ‘Time’, ‘Age’). This is used to label the x-axis on plots

  • time_unit (optional): The units of the time axis (e.g., ‘years’)

  • value_name (optional): The name of the paleo variable (e.g., ‘Temperature’)

  • value_unit (optional): The units of the paleo variable (e.g., ‘deg C’)

  • label (optional): Name of the time series (e.g., ‘Nino 3.4’)

  • log (dict) – Dictionary of tuples documentating the various transformations applied to the object

  • keep_log (bool) – Whether to keep a log of applied transformations. False by default

  • importedFrom (string) – source of the dataset. If it came from a LiPD file, this could be the datasetID property

  • archiveType (string) – climate archive

  • dropna (bool) – Whether to drop NaNs from the series to prevent downstream functions from choking on them defaults to True

  • sort_ts (str) – Direction of sorting over the time coordinate; ‘ascending’ or ‘descending’ Defaults to ‘ascending’

  • verbose (bool) – If True, will print warning messages if there is any

  • clean_ts (optional): set to True to remove the NaNs and make time axis strictly prograde with duplicated timestamps reduced by averaging the values Default is None.

Data may be stored in different file types, which can be ingested in different ways.

Goals:#

  • Create Pyleoclim Series objects from datasets stored as CSV, NetCDF, LiPD, and NOAA txt files.

Reading Time: 5 minutes

Keywords#

CSV; NetCDF; LiPD, Series; EnsembleSeries; MultipleSeries

Pre-requisites#

None. This tutorial assumes basic knowledge of Python. If you are not familiar with this coding language, check out this tutorial: http://linked.earth/ec_workshops_py/.

Relevant Packages#

Pandas; Xarray; pylipd; pangaeapy

Data Description#

  • McCabe-Glynn, S., Johnson, K., Strong, C. et al. Variable North Pacific influence on drought in southwestern North America since AD 854. Nature Geosci 6, 617–621 (2013). doi:10.1038/ngeo1862

  • Euro2k database: PAGES2k Consortium., Emile-Geay, J., McKay, N. et al. A global multiproxy database for temperature reconstructions of the Common Era. Sci Data 4, 170088 (2017). doi:10.1038/sdata.2017.88

  • Lisiecki, L. E., and Raymo, M. E. (2005), A Pliocene-Pleistocene stack of 57 globally distributed benthic δ18O records, Paleoceanography, 20, PA1003, doi:10.1029/2004PA001071.

Demonstration#

First we import our favorite packages:

%load_ext watermark

import pyleoclim as pyleo
from pylipd.lipd import LiPD
import xarray as xr
import pandas as pd
from pangaeapy.pandataset import PanDataSet

Pre-loaded datasets#

First, note that pyleoclim ships with a few pre-defined datasets:

pyleo.utils.available_dataset_names()
['SOI',
 'NINO3',
 'HadCRUT5',
 'AIR',
 'LR04',
 'AACO2',
 'EDC-dD',
 'GISP2',
 'cenogrid_d18O',
 'cenogrid_d13C']

To load one such dataset, simply type its name within pyleo.utils.load_dataset(), like so:

ts = pyleo.utils.load_dataset('cenogrid_d18O')
ts.plot(invert_yaxis=True)
(<Figure size 1000x400 with 1 Axes>,
 <Axes: xlabel='Age [My BP]', ylabel='$\\delta^{18} \\mathrm{O}$ [‰ VPDB]'>)
../_images/71881af75d8513251cf445c47b4a0d3ef1c4004ee1c4631ac6a290d8711d693b.png

Of course, there is a whole world out there full of wondrous data. How do you pull them into your Python workspace and get them to play nice with Pyleoclim? A few options below.

LiPD#

Linked Paleo Data format (LiPD) files contain time series information in addition to supporting metadata (e.g., root metadata, location).

Data stored in this format can be loaded into Python through a package called pylipd. Tutorials for this toolbox are available here.

Loading a single LiPD file#

data_path = '../data/Crystal.McCabe-Glynn.2013.lpd'
D = LiPD()
D.load(data_path)
Loading 1 LiPD files
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00,  5.64it/s]
Loaded..

You can extract each available variable and load them into a Pandas DataFrame. Each row will represent one variable:

d, df = D.get_timeseries(D.get_all_dataset_names(), to_dataframe=True)

df.head()
Extracting timeseries from dataset: Crystal.McCabe-Glynn.2013 ...
mode time_id investigator createdBy geo_meanLon geo_meanLat geo_meanElev geo_type funding1_grant funding1_agency ... paleoData_hasMinValue paleoData_number paleoData_variableName paleoData_resolution_hasMaxValue paleoData_resolution_hasMeanValue paleoData_resolution_hasMedianValue paleoData_resolution_hasMinValue paleoData_units paleoData_values paleoData_inferredVariableType
0 paleoData age [{'name': 'R.L. Edwards'}, {'name': 'S. McCabe... excel -118.82 36.59 1386.0 Feature [0823554, 1103360] US National Science Foundation ... 10.43 3 d18O -0.7 -1.097626 -1.1 2.5 permil [-8.01, -8.23, -8.61, -8.54, -8.6, -9.08, -8.9... NaN
1 paleoData age [{'name': 'R.L. Edwards'}, {'name': 'S. McCabe... excel -118.82 36.59 1386.0 Feature [0823554, 1103360] US National Science Foundation ... 0.05 1 depth -0.7 -1.097626 -1.1 2.5 mm [0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0... NaN
2 paleoData age [{'name': 'R.L. Edwards'}, {'name': 'S. McCabe... excel -118.82 36.59 1386.0 Feature [0823554, 1103360] US National Science Foundation ... 851.90 2 age NaN NaN NaN NaN yr AD [2007.7, 2007.0, 2006.3, 2005.6, 2004.9, 2004.... Age

3 rows × 53 columns

The available properties are stored in the columns of the DataFrame:

df.columns
Index(['mode', 'time_id', 'investigator', 'createdBy', 'geo_meanLon',
       'geo_meanLat', 'geo_meanElev', 'geo_type', 'funding1_grant',
       'funding1_agency', 'funding2_grant', 'funding2_agency',
       'funding3_agency', 'funding3_grant', 'pub1_author', 'pub1_pages',
       'pub1_journal', 'pub1_year', 'pub1_title', 'pub1_abstract',
       'pub1_volume', 'pub1_dataUrl', 'pub1_doi', 'pub1_DOI', 'dataSetName',
       'hasUrl', 'lipdVersion', 'archiveType', 'tableType',
       'paleoData_filename', 'paleoData_missingValue', 'paleoData_tableName',
       'depth', 'depthUnits', 'age', 'ageUnits',
       'paleoData_proxyObservationType', 'paleoData_variableType',
       'paleoData_hasMeanValue', 'paleoData_hasMedianValue',
       'paleoData_hasMaxValue', 'paleoData_takenAtDepth', 'paleoData_TSid',
       'paleoData_hasMinValue', 'paleoData_number', 'paleoData_variableName',
       'paleoData_resolution_hasMaxValue', 'paleoData_resolution_hasMeanValue',
       'paleoData_resolution_hasMedianValue',
       'paleoData_resolution_hasMinValue', 'paleoData_units',
       'paleoData_values', 'paleoData_inferredVariableType'],
      dtype='object')

Let’s see which variables are available:

df['paleoData_variableName'].unique()
array(['d18O', 'depth', 'age'], dtype=object)
df_filter=df[df['paleoData_variableName']=='d18O']
df_filter
mode time_id investigator createdBy geo_meanLon geo_meanLat geo_meanElev geo_type funding1_grant funding1_agency ... paleoData_hasMinValue paleoData_number paleoData_variableName paleoData_resolution_hasMaxValue paleoData_resolution_hasMeanValue paleoData_resolution_hasMedianValue paleoData_resolution_hasMinValue paleoData_units paleoData_values paleoData_inferredVariableType
0 paleoData age [{'name': 'R.L. Edwards'}, {'name': 'S. McCabe... excel -118.82 36.59 1386.0 Feature [0823554, 1103360] US National Science Foundation ... 10.43 3 d18O -0.7 -1.097626 -1.1 2.5 permil [-8.01, -8.23, -8.61, -8.54, -8.6, -9.08, -8.9... NaN

1 rows × 53 columns

Let’s create a Series object from the information in the file:

ts = pyleo.Series(time=df_filter['age'].iloc[0], value=df_filter['paleoData_values'].iloc[0],
                  time_name = 'Age', time_unit = df_filter['ageUnits'].iloc[0],
                  value_name = df_filter['paleoData_variableName'].iloc[0], 
                  value_unit = df_filter['paleoData_units'].iloc[0],
                  label = df_filter['dataSetName'].iloc[0],
                  archiveType = df_filter['archiveType'].iloc[0], 
                  auto_time_params=False, verbose=False)

And plot the timeseries:

ts.plot()
(<Figure size 1000x400 with 1 Axes>,
 <Axes: xlabel='Age [yr AD]', ylabel='d18O [permil]'>)
../_images/345e7d43bc0098a4bab27a362d358c63cbf5abd67c39a1dae76b70e8f5f700c5.png

By default, the plot will use the name/units passed to Series to make the plots. You can change these labels directly on the graph or update the information in the Series object at any time:

ts.time_unit = 'CE'
ts.value_name = '$\delta^{18}$O'
ts.value_unit = "'‰ VPDB"
ts.plot()
(<Figure size 1000x400 with 1 Axes>,
 <Axes: xlabel='Age [CE]', ylabel="$\\delta^{18}$O ['‰ VPDB]">)
../_images/4f3abe11acac2657bcc61d4ff3b37fa9cb8988e433a29cfdfcc3850af70c9c90.png

Loading multiple LiPD files#

data_path = '../data/Euro2k/'
D_euro = LiPD()
D_euro.load_from_dir(data_path)
Loading 31 LiPD files
100%|██████████████████████████████████████████| 31/31 [00:00<00:00, 112.56it/s]
Loaded..

Let’s use the same .get_timeseries function to load all the variables in a DataFrame:

d_euro, df_euro = D_euro.get_timeseries(D_euro.get_all_dataset_names(),to_dataframe=True)
Extracting timeseries from dataset: Ocn-RedSea.Felis.2000 ...
Extracting timeseries from dataset: Arc-Forfjorddalen.McCarroll.2013 ...
Extracting timeseries from dataset: Eur-Tallinn.Tarand.2001 ...
Extracting timeseries from dataset: Eur-CentralEurope.Dobrovoln_.2009 ...
Extracting timeseries from dataset: Eur-EuropeanAlps.B_ntgen.2011 ...
Extracting timeseries from dataset: Eur-CentralandEasternPyrenees.Pla.2004 ...
Extracting timeseries from dataset: Arc-Tjeggelvas.Bjorklund.2012 ...
Extracting timeseries from dataset: Arc-Indigirka.Hughes.1999 ...
Extracting timeseries from dataset: Eur-SpannagelCave.Mangini.2005 ...
Extracting timeseries from dataset: Ocn-AqabaJordanAQ19.Heiss.1999 ...
Extracting timeseries from dataset: Arc-Jamtland.Wilson.2016 ...
Extracting timeseries from dataset: Eur-RAPiD-17-5P.Moffa-Sanchez.2014 ...
Extracting timeseries from dataset: Eur-LakeSilvaplana.Trachsel.2010 ...
Extracting timeseries from dataset: Eur-NorthernSpain.Mart_n-Chivelet.2011 ...
Extracting timeseries from dataset: Eur-MaritimeFrenchAlps.B_ntgen.2012 ...
Extracting timeseries from dataset: Ocn-AqabaJordanAQ18.Heiss.1999 ...
Extracting timeseries from dataset: Arc-Tornetrask.Melvin.2012 ...
Extracting timeseries from dataset: Eur-EasternCarpathianMountains.Popa.2008 ...
Extracting timeseries from dataset: Arc-PolarUrals.Wilson.2015 ...
Extracting timeseries from dataset: Eur-LakeSilvaplana.Larocque-Tobler.2010 ...
Extracting timeseries from dataset: Eur-CoastofPortugal.Abrantes.2011 ...
Extracting timeseries from dataset: Eur-TatraMountains.B_ntgen.2013 ...
Extracting timeseries from dataset: Eur-SpanishPyrenees.Dorado-Linan.2012 ...
Extracting timeseries from dataset: Eur-FinnishLakelands.Helama.2014 ...
Extracting timeseries from dataset: Eur-Seebergsee.Larocque-Tobler.2012 ...
Extracting timeseries from dataset: Eur-NorthernScandinavia.Esper.2012 ...
Extracting timeseries from dataset: Arc-GulfofAlaska.Wilson.2014 ...
Extracting timeseries from dataset: Arc-Kittelfjall.Bjorklund.2012 ...
Extracting timeseries from dataset: Eur-L_tschental.B_ntgen.2006 ...
Extracting timeseries from dataset: Eur-Stockholm.Leijonhufvud.2009 ...
Extracting timeseries from dataset: Arc-AkademiiNaukIceCap.Opel.2013 ...

Let’s have a look at the available variable names:

df_euro['paleoData_variableName'].unique()
array(['year', 'd18O', 'MXD', 'temperature', 'JulianDay', 'ringWidth',
       'sampleID', 'uncertainty', 'age', 'density', 'd13C', 'count',
       'thickness', 'sodium'], dtype=object)

Let’s filter the temperature records:

df_euro_filt = df_euro[df_euro['paleoData_variableName']=='temperature']

df_euro_filt.shape
(12, 117)

Which leaves us with 12 timeseries. Let’s iterate over the DataFrame and create a Series for each of these:

ts_list = []

for _,row in df_euro_filt.iterrows():
    ts_list.append(pyleo.GeoSeries(time=row['year'],value=row['paleoData_values'],
                            time_name='Year',value_name=row['paleoData_variableName'],
                            time_unit=row['yearUnits'], value_unit=row['paleoData_units'],
                            lat = row['geo_meanLat'], lon = row['geo_meanLon'],
                            archiveType = row['archiveType'],
                            label=row['dataSetName']+'_'+row['paleoData_proxy'],
                            auto_time_params=False, verbose=False))  

Here I chose to keep these series into a list, which makes it easy to create a MultipleSeries object. However, you could choose to keep them in a dictionary, using the datasetname/proxy as a key.

ms = pyleo.MultipleSeries(ts_list)

And plot them:

ms.stackplot(figsize = [10,20])
(<Figure size 1000x2000 with 13 Axes>,
 {0: <Axes: ylabel='temperature [degC]'>,
  1: <Axes: ylabel='temperature [degC]'>,
  2: <Axes: ylabel='temperature [degC]'>,
  3: <Axes: ylabel='temperature [degC]'>,
  4: <Axes: ylabel='temperature [degC]'>,
  5: <Axes: ylabel='temperature [degC]'>,
  6: <Axes: ylabel='temperature [degC]'>,
  7: <Axes: ylabel='temperature [degC]'>,
  8: <Axes: ylabel='temperature [degC]'>,
  9: <Axes: ylabel='temperature [degC]'>,
  10: <Axes: ylabel='temperature [degC]'>,
  11: <Axes: ylabel='temperature [degC]'>,
  'x_axis': <Axes: xlabel='Time [yr AD]'>})
../_images/f24aa8f54b758d89abaf6c4b1540ce96bd7ef4ab228c2c10c8b2a34f46ddea60.png

NetCDF#

In order to load data from a NetCDF file, we will use Xarray.

file_path = '../data/p2k_ngeo19_recons.nc'
p2k_nc = xr.open_dataset(file_path)

The coordinates of this data set are year and ens, and the temperature anamoly is contained in the variable LMRv2.1. Below we extract the timeseries for the ensemble runs:

variable_name = 'LMRv2.1'
ens_runs = p2k_nc.groupby('ens')

To create the pyleo.Series, we pass the time coordinate of the dataset, p2k_nc.year, as time, and one of the ensemble runs as value. It is optional to specify time_name and time_unit, and value_name and value_unit, but doing so ensures that plot axes are properly labeled.

ens_run1 = ens_runs[1].data_vars[variable_name]
p2k_ps = pyleo.Series(time=p2k_nc.year, value=ens_run1,
                      time_name='Time', time_unit='year', label = 'LMRv2.1 member #1',
                      value_name='GMST', value_unit='$^{\circ}$C', verbose=False)
fig, ax = p2k_ps.plot()

However, given this is an ensemble, capturing this data in a pyleo.EnsembleSeries will open up opportunities for specific analysis and visualization. In the cell below, we generate a list of pyleo.Series (one for each trace) for the full set of ensemble runs in much the same way as above.

%%time
ts_list = []

for im in range(len(p2k_nc.ens)):
    ens_run = ens_runs[im+1].data_vars[variable_name]
    ts_list.append(pyleo.Series(time=p2k_nc.year, value=ens_run,
                      time_name='Time', time_unit='year', verbose=False,
                      value_name='GMST', value_unit='$^{\circ}$C'))

Then we simply pass ts_list to pyleo.EnsembleSeries

ts_ens = pyleo.EnsembleSeries(ts_list)

For more detail on visualizing pyleo.EnsembleSeries, check out the tutorial on Basic operations with MultipleSeries and EnsembleSeries, but we can use plot_traces() to quickly check to make sure the data seems properly organized (by default, plot_traces() plots 10 randomly selected traces).

fig, ax = ts_ens.plot_traces()

CSV#

CSV files have a table structure, so we will use Pandas and the read the data into a pandas DataFrame.

LR04 = pd.read_csv('../data/LR04.csv', header=4)
LR04.head()
LR04.columns

To create a pyleo.Series, we pass the Time (ka) column as time and the Benthic d18O (per mil) column as value. Note the extra space at the end of that column name! We have to work with what the interwebs give us.

LR04_ps = pyleo.Series(time=LR04['Time (ka)'], value=LR04['Benthic d18O (per mil)  '],
                      time_name='Age', time_unit='ka', verbose=False,
                      value_name='$\delta^{18}$O', value_unit=u'‰')
fig, ax = LR04_ps.plot(invert_yaxis=True)

Note that you would get the exact same result with LR04_ps = pyleo.utils.load_dataset('LR04'). Generally speaking, it is very easy to go between Pyleoclim and CSV files, but the file needs to have a certain structure to be properly read with the function from_csv(). For an example, see this tutorial. If it doesn’t follow that structure, proceed as outlined above.

NOAA txt files#

As you may now, the World Data Service for paleoclimatology, operated by NCEI/NOAA of the US Department of Commerce, hosts thousands of data files in various formats. A common one is a templated text file, containing rich data and metadata. One can treat the file as a raw text file, ignoring the header and loading the data directly into a Pandas DataFrame.

path = 'ftp://ftp.ncdc.noaa.gov/pub/data/paleo/icecore/antarctica/antarctica2015co2composite.txt'
co2df = pd.read_csv(path, skiprows=137, sep='\t')
co2df.head()
age_gas_calBP co2_ppm co2_1s_ppm
0 -51.03 368.02 0.06
1 -48.00 361.78 0.37
2 -46.28 359.65 0.10
3 -44.41 357.11 0.16
4 -43.08 353.95 0.04

Notice how pandas retrieved the data over the network, without needing to download a local copy of the file. However, it would work just as well if you did have such a copy on your hardrive, and you would simply replace path with the local file path (everything else would stay the same). We did cheat a bit, however: we had to peak at the file to know how many header lines to skip (137). The separator (sep) argument is set to '\t', which means “tab”. It works well in this case, but we cannot guarantee that it will work on all NOAA text files.

Finally, we pull the relevant columns of this datframe into a Series object, convert the years to kyr for ease of use, and put in the relevant metadata so that we can get a well-labeled, publication-quality plot right off the bat:

co2ts = pyleo.Series(time=co2df['age_gas_calBP']/1000,value= co2df['co2_ppm'],
                     time_name='Age',time_unit='kyr BP',value_name = r'$CO_2$',
                     value_unit='ppm',label='EPICA Dome C CO2', verbose=False)
co2ts.plot(color='C1')

Loading from PANGAEA#

Another major repository for paleoclimate data is PANGAEA. Here we load the dataset of Skinner et al. (2007), using the very helpful pangaeapy package:

ds = PanDataSet('10.1594/PANGAEA.619066')
print(ds.title)
print(ds.data.head())

Once again, this method eschewed a download, retrieving the data directly from PANGAEA’s web server. Notice that all you needed was the dataset DOI, easily gleaned from the page’s URL or from the contents themselves. The PanDataSet is a custom data structure built on pandas, much in the way Pyleoclim Series will be in a few releases.

One can extract the dataset column names via:

ds.data.columns
type(ds.data['Temp']) 

As a pandas Series object, this variable has an associated plot method:

ds.data['Temp'].plot()

However, we see that the x-axis knows nothing about time; indeed, PANGAEA services a dizzying array of dataset types, and their API has to be very general. This is one reason to have a Pyleoclim Series class that knows explicitly about time. Let us create an instance here, with proper metadata that are not available in the original, barebones PANGAEA format:

S07_temp = pyleo.Series(time=ds.data['Age'], value=ds.data['Temp'],
                        time_name='Age', time_unit='ka BP', verbose=False,
                        value_name='Temperature', value_unit=r'$^{\circ}$ C',
                        label = 'Skinner et al. (2007)')
S07_temp.plot()

As of now, you can create this object manually, but we will soon automate this process to a large extent. Let us also load their reconstruction of seawater \(\delta^{18}\)O:

S07_d18Osw = pyleo.Series(time=ds.data['Age'], value=ds.data['δ18O H2O'],
                      time_name='Age', time_unit='ka BP', verbose=False,
                      value_name='seawater $\delta^{18}$O', value_unit='$\perthousand$')
S07_d18Osw.plot()

We can put both series into a MultipleSeries object, which unlocks fun features, like stackplot():

ms = S07_temp & S07_d18Osw
ms.stackplot(plot_kwargs={'marker':'o'})

For more information on MultipleSeries, read the documentation and/or follow this tutorial.

Takeway#

There are multiple paths into Series, unlocking the full power of Pyleoclim to work with data originally stored in all manner of formats.

%watermark -n -u -v -iv -w
Last updated: Mon Mar 04 2024

Python implementation: CPython
Python version       : 3.11.7
IPython version      : 8.20.0

pandas   : 2.1.4
xarray   : 2024.2.0
pyleoclim: 0.13.1b0

Watermark: 2.4.3