Loading Data with PyLiPD and Pyleoclim#
This notebook lays out the details of how we load data from LiPD files using PyLiPD into Pyleoclim objects that we use for the remainder of our analysis. We also do a little bit of pre-processing for plotting that will be used later in this book.
The notebook is structured as follows:
Load records with PyLiPD
Process data into a MultipleGeoSeries object
Use Pyleoclim to create new objects that will plot more nicely than the raw records for later Figures
# Importing necessary pacakges
import pickle
from tqdm import tqdm
import numpy as np
import pyleoclim as pyleo
from pylipd.lipd import LiPD
First we load our records into a LiPD object to identify the record names:
# Loading files into pylipd object
lipd_path = "../../data/speleothems/"
all_files = LiPD()
if __name__ == "__main__":
all_files.load_from_dir(lipd_path, parallel=True)
record_names = all_files.get_all_dataset_names()
Loading 15 LiPD files
0%| | 0/15 [00:00<?, ?it/s]
7%|████████▏ | 1/15 [00:00<00:12, 1.09it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 15.31it/s]
Loaded..
Next, we parse through the records and load each into a LiPD object individually. These we convert into pandas DataFrames, which are straightforward to work with. Once we’ve extracted the relevant info from each dataframe, we create a GeoSeries object, which we add to a list for later use:
# Parsing this object for oxygen isotopes and saving them into a dictionary
series_list = []
value_name = "d18O"
for record in record_names:
d = LiPD()
d.load(f"{lipd_path}/{record}.lpd")
df = d.get_timeseries_essentials()
row = df[df["paleoData_variableName"] == value_name]
lat = row["geo_meanLat"].to_numpy()[0]
lon = row["geo_meanLon"].to_numpy()[0]
elevation = row["geo_meanElev"].to_numpy()[0]
value = row["paleoData_values"].to_numpy()[0]
value_unit = row["paleoData_units"].to_numpy()[0]
time = row["time_values"].to_numpy()[0]
time_unit = row["time_units"].to_numpy()[0]
time_name = row["time_variableName"].to_numpy()[0]
label = row["dataSetName"].to_numpy()[0]
geo_series = pyleo.GeoSeries(
time=time,
value=value,
lat=lat,
lon=lon,
elevation=elevation,
time_unit=time_unit,
time_name=time_name,
value_name=value_name,
value_unit=value_unit,
label=label,
archiveType="speleothem",
)
series_list.append(geo_series)
Loading 1 LiPD files
0%| | 0/1 [00:00<?, ?it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 79.78it/s]
Loaded..
Time axis values sorted in ascending order
Loading 1 LiPD files
0%| | 0/1 [00:00<?, ?it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 112.37it/s]
Loaded..
NaNs have been detected and dropped.
Time axis values sorted in ascending order
Loading 1 LiPD files
0%| | 0/1 [00:00<?, ?it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 130.77it/s]
Loaded..
NaNs have been detected and dropped.
Time axis values sorted in ascending order
Loading 1 LiPD files
0%| | 0/1 [00:00<?, ?it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 117.51it/s]
Loaded..
Time axis values sorted in ascending order
Loading 1 LiPD files
0%| | 0/1 [00:00<?, ?it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 145.08it/s]
Loaded..
Time axis values sorted in ascending order
Loading 1 LiPD files
0%| | 0/1 [00:00<?, ?it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 129.87it/s]
Loaded..
Time axis values sorted in ascending order
Loading 1 LiPD files
0%| | 0/1 [00:00<?, ?it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 120.63it/s]
Loaded..
NaNs have been detected and dropped.
Time axis values sorted in ascending order
Loading 1 LiPD files
0%| | 0/1 [00:00<?, ?it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 122.46it/s]
Loaded..
Time axis values sorted in ascending order
Loading 1 LiPD files
0%| | 0/1 [00:00<?, ?it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 143.02it/s]
Loaded..
Time axis values sorted in ascending order
Loading 1 LiPD files
0%| | 0/1 [00:00<?, ?it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 122.13it/s]
Loaded..
Time axis values sorted in ascending order
Loading 1 LiPD files
0%| | 0/1 [00:00<?, ?it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 143.68it/s]
Loaded..
Time axis values sorted in ascending order
Loading 1 LiPD files
0%| | 0/1 [00:00<?, ?it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 105.86it/s]
Loaded..
Time axis values sorted in ascending order
Loading 1 LiPD files
0%| | 0/1 [00:00<?, ?it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 105.74it/s]
Loaded..
Time axis values sorted in ascending order
Loading 1 LiPD files
0%| | 0/1 [00:00<?, ?it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 97.58it/s]
Loaded..
Time axis values sorted in ascending order
Loading 1 LiPD files
0%| | 0/1 [00:00<?, ?it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 132.75it/s]
Loaded..
Time axis values sorted in ascending order
One record requires special attention, as it is a composite of two different records. We handle that here, and creat a dictionary with record names as the keys and GeoSeries objects as the values for later use:
# Handling dongge separately due to the presence of two records
series_dicts = []
geo_ms_composite_list = []
for series in series_list:
if "dongge" in series.label.lower():
time = series.time
value = series.value
tv_dict = {t: value[idx] for idx, t in enumerate(time)}
series_dicts.append(tv_dict)
lat = series.lat
lon = series.lon
elevation = series.elevation
time_unit = series.time_unit
time_name = series.time_name
value_name = series.value_name
value_unit = series.value_unit
else:
geo_ms_composite_list.append(series)
series1_dict = series_dicts[0]
series2_dict = series_dicts[1]
combined_timestamps = set(series1_dict.keys()).union(set(series2_dict.keys()))
# Combine the values for overlapping timestamps
combined_series = []
for timestamp in sorted(combined_timestamps):
value1 = series1_dict.get(timestamp)
value2 = series2_dict.get(timestamp)
if value1 is not None:
combined_series.append((timestamp, value1))
elif value2 is not None:
combined_series.append((timestamp, value2))
dongge_composite = pyleo.GeoSeries(
time=[tv[0] for tv in combined_series],
value=[tv[1] for tv in combined_series],
lat=lat,
lon=lon,
elevation=elevation,
time_unit=time_unit,
time_name=time_name,
value_name=value_name,
value_unit=value_unit,
archiveType="speleothem",
label="Dongge.China.2004",
)
geo_ms_composite_list.append(dongge_composite)
geo_ms_composites = pyleo.MultipleGeoSeries(geo_ms_composite_list)
geo_ms_composite_dict = {series.label: series for series in geo_ms_composite_list}
Time axis values sorted in ascending order
Next, we create versions of each GeoSeries that are useful for plotting. These objects won’t be used to do any analysis, the raw records will do that, but the plotting objects are structured so that we don’t plot lines over hiatuses. We’ll create an unsmoothed and a smoothed version of each:
# Creating series that can be plotted nicely.
plotting_series_dict = {}
smooth_plotting_series_dict = {}
for series in tqdm(geo_ms_composite_dict.values()):
series.value_name = r"$\delta^{18}O$"
series.value_unit = "‰"
series = series.convert_time_unit("kyr BP")
segments = series.segment(factor=15)
if isinstance(segments, pyleo.core.multiplegeoseries.MultipleGeoSeries):
plotting_series_value = []
plotting_series_time = []
smooth_plotting_series_value = []
smooth_plotting_series_time = []
for segment in segments.series_list:
if max(segment.time) - min(segment.time) > 12:
plotting_series_value.extend(segment.value)
plotting_series_value.append(np.nan)
plotting_series_time.extend(segment.time)
plotting_series_time.append(plotting_series_time[-1] + 0.0000001)
smooth_segment = segment.interp().filter(cutoff_scale=6)
smooth_plotting_series_value.extend(smooth_segment.value)
smooth_plotting_series_value.append(np.nan)
smooth_plotting_series_time.extend(smooth_segment.time)
smooth_plotting_series_time.append(
smooth_plotting_series_time[-1] + 0.0000001
)
plotting_series = series.copy()
plotting_series.value = np.array(plotting_series_value)
plotting_series.time = np.array(plotting_series_time)
smooth_plotting_series = series.copy()
smooth_plotting_series.value = np.array(smooth_plotting_series_value)
smooth_plotting_series.time = np.array(smooth_plotting_series_time)
else:
plotting_series = series
smooth_plotting_series = series.interp().filter(cutoff_scale=6)
plotting_series_dict[series.label] = plotting_series
smooth_plotting_series_dict[series.label] = smooth_plotting_series
0%| | 0/14 [00:00<?, ?it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:00<00:00, 156.40it/s]
We save these as pickle files for easy loading later on:
# Saving the output as a pickle file
with open("../../data/geo_ms_composite_dict.pkl", "wb") as f:
pickle.dump(geo_ms_composite_dict, f)
with open("../../data/plotting_series_dict.pkl", "wb") as handle:
pickle.dump(plotting_series_dict, handle)
with open("../../data/smooth_plotting_series_dict.pkl", "wb") as handle:
pickle.dump(smooth_plotting_series_dict, handle)