Loading Data with PyLiPD and Pyleoclim

Loading Data with PyLiPD and Pyleoclim#

This notebook lays out the details of how we load data from LiPD files using PyLiPD into Pyleoclim objects that we use for the remainder of our analysis. We also do a little bit of pre-processing for plotting that will be used later in this book.

The notebook is structured as follows:

Load records with PyLiPD
Process data into a MultipleGeoSeries object
Use Pyleoclim to create new objects that will plot more nicely than the raw records for later Figures

# Importing necessary pacakges

import pickle
from tqdm import tqdm

import numpy as np
import pyleoclim as pyleo
from pylipd.lipd import LiPD

First we load our records into a LiPD object to identify the record names:

# Loading files into pylipd object

lipd_path = "../../data/speleothems/"

all_files = LiPD()

if __name__ == "__main__":
    all_files.load_from_dir(lipd_path, parallel=True)

record_names = all_files.get_all_dataset_names()

Loading 15 LiPD files

  0%|                                                                                                                                   | 0/15 [00:00<?, ?it/s]

  7%|████████▏                                                                                                                  | 1/15 [00:00<00:12,  1.09it/s]

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 15.31it/s]

Loaded..

Next, we parse through the records and load each into a LiPD object individually. These we convert into pandas DataFrames, which are straightforward to work with. Once we’ve extracted the relevant info from each dataframe, we create a GeoSeries object, which we add to a list for later use:

# Parsing this object for oxygen isotopes and saving them into a dictionary

series_list = []
value_name = "d18O"

for record in record_names:
    d = LiPD()
    d.load(f"{lipd_path}/{record}.lpd")
    df = d.get_timeseries_essentials()
    row = df[df["paleoData_variableName"] == value_name]
    lat = row["geo_meanLat"].to_numpy()[0]
    lon = row["geo_meanLon"].to_numpy()[0]
    elevation = row["geo_meanElev"].to_numpy()[0]
    value = row["paleoData_values"].to_numpy()[0]
    value_unit = row["paleoData_units"].to_numpy()[0]
    time = row["time_values"].to_numpy()[0]
    time_unit = row["time_units"].to_numpy()[0]
    time_name = row["time_variableName"].to_numpy()[0]
    label = row["dataSetName"].to_numpy()[0]
    geo_series = pyleo.GeoSeries(
        time=time,
        value=value,
        lat=lat,
        lon=lon,
        elevation=elevation,
        time_unit=time_unit,
        time_name=time_name,
        value_name=value_name,
        value_unit=value_unit,
        label=label,
        archiveType="speleothem",
    )
    series_list.append(geo_series)

Loading 1 LiPD files

  0%|                                                                                                                                    | 0/1 [00:00<?, ?it/s]

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 79.78it/s]

Loaded..
Time axis values sorted in ascending order
Loading 1 LiPD files

  0%|                                                                                                                                    | 0/1 [00:00<?, ?it/s]

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 112.37it/s]

Loaded..
NaNs have been detected and dropped.
Time axis values sorted in ascending order
Loading 1 LiPD files

  0%|                                                                                                                                    | 0/1 [00:00<?, ?it/s]

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 130.77it/s]

Loaded..
NaNs have been detected and dropped.
Time axis values sorted in ascending order
Loading 1 LiPD files

  0%|                                                                                                                                    | 0/1 [00:00<?, ?it/s]

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 117.51it/s]

Loaded..
Time axis values sorted in ascending order

Loading 1 LiPD files

  0%|                                                                                                                                    | 0/1 [00:00<?, ?it/s]

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 145.08it/s]

Loaded..
Time axis values sorted in ascending order
Loading 1 LiPD files

  0%|                                                                                                                                    | 0/1 [00:00<?, ?it/s]

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 129.87it/s]

Loaded..

Time axis values sorted in ascending order
Loading 1 LiPD files

  0%|                                                                                                                                    | 0/1 [00:00<?, ?it/s]

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 120.63it/s]

Loaded..
NaNs have been detected and dropped.
Time axis values sorted in ascending order
Loading 1 LiPD files

  0%|                                                                                                                                    | 0/1 [00:00<?, ?it/s]

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 122.46it/s]

Loaded..
Time axis values sorted in ascending order
Loading 1 LiPD files

  0%|                                                                                                                                    | 0/1 [00:00<?, ?it/s]

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 143.02it/s]

Loaded..
Time axis values sorted in ascending order
Loading 1 LiPD files

  0%|                                                                                                                                    | 0/1 [00:00<?, ?it/s]

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 122.13it/s]

Loaded..

Time axis values sorted in ascending order
Loading 1 LiPD files

  0%|                                                                                                                                    | 0/1 [00:00<?, ?it/s]

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 143.68it/s]

Loaded..
Time axis values sorted in ascending order
Loading 1 LiPD files

  0%|                                                                                                                                    | 0/1 [00:00<?, ?it/s]

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 105.86it/s]

Loaded..

Time axis values sorted in ascending order
Loading 1 LiPD files

  0%|                                                                                                                                    | 0/1 [00:00<?, ?it/s]

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 105.74it/s]

Loaded..

Time axis values sorted in ascending order

Loading 1 LiPD files

  0%|                                                                                                                                    | 0/1 [00:00<?, ?it/s]

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 97.58it/s]

Loaded..

Time axis values sorted in ascending order
Loading 1 LiPD files

  0%|                                                                                                                                    | 0/1 [00:00<?, ?it/s]

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 132.75it/s]

Loaded..
Time axis values sorted in ascending order

One record requires special attention, as it is a composite of two different records. We handle that here, and creat a dictionary with record names as the keys and GeoSeries objects as the values for later use:

# Handling dongge separately due to the presence of two records

series_dicts = []
geo_ms_composite_list = []

for series in series_list:
    if "dongge" in series.label.lower():
        time = series.time
        value = series.value
        tv_dict = {t: value[idx] for idx, t in enumerate(time)}
        series_dicts.append(tv_dict)
        lat = series.lat
        lon = series.lon
        elevation = series.elevation
        time_unit = series.time_unit
        time_name = series.time_name
        value_name = series.value_name
        value_unit = series.value_unit
    else:
        geo_ms_composite_list.append(series)

series1_dict = series_dicts[0]
series2_dict = series_dicts[1]

combined_timestamps = set(series1_dict.keys()).union(set(series2_dict.keys()))

# Combine the values for overlapping timestamps
combined_series = []
for timestamp in sorted(combined_timestamps):
    value1 = series1_dict.get(timestamp)
    value2 = series2_dict.get(timestamp)

    if value1 is not None:
        combined_series.append((timestamp, value1))
    elif value2 is not None:
        combined_series.append((timestamp, value2))

dongge_composite = pyleo.GeoSeries(
    time=[tv[0] for tv in combined_series],
    value=[tv[1] for tv in combined_series],
    lat=lat,
    lon=lon,
    elevation=elevation,
    time_unit=time_unit,
    time_name=time_name,
    value_name=value_name,
    value_unit=value_unit,
    archiveType="speleothem",
    label="Dongge.China.2004",
)

geo_ms_composite_list.append(dongge_composite)
geo_ms_composites = pyleo.MultipleGeoSeries(geo_ms_composite_list)
geo_ms_composite_dict = {series.label: series for series in geo_ms_composite_list}

Time axis values sorted in ascending order

Next, we create versions of each GeoSeries that are useful for plotting. These objects won’t be used to do any analysis, the raw records will do that, but the plotting objects are structured so that we don’t plot lines over hiatuses. We’ll create an unsmoothed and a smoothed version of each:

# Creating series that can be plotted nicely.

plotting_series_dict = {}
smooth_plotting_series_dict = {}

for series in tqdm(geo_ms_composite_dict.values()):
    series.value_name = r"$\delta^{18}O$"
    series.value_unit = "‰"
    series = series.convert_time_unit("kyr BP")
    segments = series.segment(factor=15)
    if isinstance(segments, pyleo.core.multiplegeoseries.MultipleGeoSeries):
        plotting_series_value = []
        plotting_series_time = []
        smooth_plotting_series_value = []
        smooth_plotting_series_time = []
        for segment in segments.series_list:
            if max(segment.time) - min(segment.time) > 12:
                plotting_series_value.extend(segment.value)
                plotting_series_value.append(np.nan)
                plotting_series_time.extend(segment.time)
                plotting_series_time.append(plotting_series_time[-1] + 0.0000001)

                smooth_segment = segment.interp().filter(cutoff_scale=6)
                smooth_plotting_series_value.extend(smooth_segment.value)
                smooth_plotting_series_value.append(np.nan)
                smooth_plotting_series_time.extend(smooth_segment.time)
                smooth_plotting_series_time.append(
                    smooth_plotting_series_time[-1] + 0.0000001
                )

        plotting_series = series.copy()
        plotting_series.value = np.array(plotting_series_value)
        plotting_series.time = np.array(plotting_series_time)

        smooth_plotting_series = series.copy()
        smooth_plotting_series.value = np.array(smooth_plotting_series_value)
        smooth_plotting_series.time = np.array(smooth_plotting_series_time)

    else:
        plotting_series = series
        smooth_plotting_series = series.interp().filter(cutoff_scale=6)

    plotting_series_dict[series.label] = plotting_series
    smooth_plotting_series_dict[series.label] = smooth_plotting_series

  0%|                                                                                                                                   | 0/14 [00:00<?, ?it/s]

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:00<00:00, 156.40it/s]

We save these as pickle files for easy loading later on:

# Saving the output as a pickle file

with open("../../data/geo_ms_composite_dict.pkl", "wb") as f:
    pickle.dump(geo_ms_composite_dict, f)

with open("../../data/plotting_series_dict.pkl", "wb") as handle:
    pickle.dump(plotting_series_dict, handle)

with open("../../data/smooth_plotting_series_dict.pkl", "wb") as handle:
    pickle.dump(smooth_plotting_series_dict, handle)