The Dataset class#
Preamble#
The next sets of tutorials go through editing and creating LiPD files from Python. Before we delve into the details on how to do so, it is good to remind ourselves of two important facts:
PyLiPD
uses the object-oriented programming (OOP). In OOP, object contains the data, associated parameters (e.g., metadata) for the object and code that represents procedures that are applicable to each object. So far, we have seen two objects theLiPD
object and theLiPDSeries
object. Both of these objects contain a graph that follows an ontology.The LinkedEarth Ontolgy describes paleoclimate datasets and was created from the LiPD format. Ontologies list the types of objects, called classes (e.g.,
Dataset
,Publication
,Variable
), the relationship that connects them (e.g.,Dataset publishedIn Publication
), and constraints on the ways that classes and relationships can be combined. Here is a snipet of the LinkedEarth Ontology:
As you can see, the top class is the Dataset
class.
Why is this information relevant now?#
At first glance, OOP and ontologies serve two different purposes. However, they function is a very similar fashion: a class (or object) that can be manipulated through its properties (or methods). We used this resemblance to help us create editing functions for PyLiPD
. In short, each class in the ontology was made into an object into PyLiPD
and each property was given functionalities to read/write/edit the property value.
This is how the Dataset
object was created and your entry point to editing LiPD files.
How is Dataset
different from the LiPD
class?#
At first glance, the two classes are very similar as they contained the data and metadata for paleoclimate datasets as a graph. However, the functions attached to the Dataset
class are meant for editing while those associated with the LiPD
class are meant for querying and manipulation. Separating the two also ensure that files are not overwritten by mistake.
However, if you prefer to use the Python APIs for each property to loop over various files, the Dataset
class may be more useful to you. This option requires knowledge of the ontology and the LiPD stucture.
Goals#
Create a Dataset class from an existing file
Retrieve information from the file
Reading Time: 5 minutes
Keywords#
LiPD, LinkedEarth Ontology, Object-Oriented Programming
Pre-requisites#
An understanding of OOP and the LinkedEarth Ontology:
The Linked Earth Core Ontology provides the main concepts and relationships to describe a paleoclimate dataset and its values.
The Archive Type Ontology describes a taxonomy of the most common types of archives (e.g.,
Coral
,Glacier Ice
).The Paleo Variables Ontology describes a taxonomy of the most common types of paleo variables.
The Paleo Proxy Ontology describes a taxonomy of the most common types of paleo proxies.
The Paleo Units Ontology describes a taxonomy of the most common types of paleo units.
The Interpretation Ontology describes a taxonomy of the most common interpretations.
The Instrument Ontology describes a taxonomy of the most common instrument for taking measurements.
The Chron Variables Ontology describes a taxonomy of the most common types of chron variables. Under Construction.
The Chron Proxy Ontology describes a taxonomy of the most common types of chron proxies. Under Construction.
The Chron Units Ontology describes a taxonomy of the most common types of chron units. Under Construction.
Relevant Packages#
pylipd
Data Description#
This notebook uses the following datasets, in LiPD format:
Nurhati, I. S., Cobb, K. M., & Di Lorenzo, E. (2011). Decadal-scale SST and salinity variations in the central tropical Pacific: Signatures of natural and anthropogenic climate change. Journal of Climate, 24(13), 3294–3308. doi:10.1175/2011jcli3852.1
Lawrence, K. T., Liu, Z. H., & Herbert, T. D. (2006). Evolution of the eastern tropical Pacific through Plio-Pleistocne glaciation. Science, 312(5770), 79-83.
Demonstration#
Let’s import the LiPD
and Dataset
classes:
from pylipd.classes.dataset import Dataset
from pylipd.lipd import LiPD
# Pandas for data
import pandas as pd
import pyleoclim as pyleo
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[1], line 6
4 # Pandas for data
5 import pandas as pd
----> 6 import pyleoclim as pyleo
ModuleNotFoundError: No module named 'pyleoclim'
For the purpose of this demonstration, let’s open the dataset from Nurhati et al. (2011).
D = LiPD()
data_path = '../data/Ocn-Palmyra.Nurhati.2011.lpd'
D.load(data_path)
Loading 1 LiPD files
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00, 35.86it/s]
Loaded..
Creating a Dataset
object#
Convert to a Dataset
object using the get_datasets
method. By default, This method returns a list. Since we have only one dataset, we select the first item:
ds = D.get_datasets()[0]
Obtaining information about the Dataset
#
Get Dataset level information#
You can now access information about the dataset using Python APIs. The methods to do so are named as get
+ the name of the class/property. For instance, to get the name of the dataset, you should use the getName
function:
name = ds.getName()
name
'Ocn-Palmyra.Nurhati.2011'
In doubt, consult the documentation for the various objects. The documentation is light on these since they were created directly from the ontology.
Location#
Remember that data properties such as Name
have a range of string
, float
, integer
and represent the leaf in the graph. However, you may need to dig into the graph to obtain that answer. For instance, let’s have a look at the geographical coordinates for the site:
geo = ds.getLocation()
type(geo)
pylipd.classes.location.Location
As you can see, the function returns another object called Location
with its own functions that corresponds to the properties attached to the Location
class in the LinkedEarth Ontology. Let’s get the latitute and longitude information:
lat = geo.getLatitude()
lon = geo.getLongitude()
coord = [lon, lat]
print(coord)
[-162.13, 5.87]
Data#
Let’s access the data contained in PaleoData
and load them in a Pandas DataFrame
:
data_tables = []
for paleoData in ds.getPaleoData(): # loop over the various PaleoData objects
for table in paleoData.getMeasurementTables(): #get the measurement tables
df = table.getDataFrame(use_standard_names=True) # grab the data and standardize the variable names
data_tables.append(df)
print("There are", len(data_tables), " tables in the dataset")
There are 2 tables in the dataset
data1 = data_tables[0]
data1.head()
d18O | year | |
---|---|---|
0 | 0.39 | 1998.21 |
1 | 0.35 | 1998.13 |
2 | 0.35 | 1998.04 |
3 | 0.35 | 1997.96 |
4 | 0.36 | 1997.88 |
Note that the basic information about the variables are stored in the attributes of the DataFrame. The dictionary key for each of the variable corresponds to the data hearder.
data_tables[0].attrs
{'d18O': {'@id': 'http://linked.earth/lipd/Ocn-Palmyra.Nurhati.2011.paleo2.measurementTable1.Ocean2kHR_162.d18O',
'archiveType': 'Coral',
'number': 1,
'hasMaxValue': 1.26,
'hasMeanValue': 0.7670059435,
'hasMedianValue': 0.78,
'hasMinValue': 0.07,
'missingValue': 'NaN',
'variableName': 'd18O',
'notes': 'd18Osw (residuals calculated from coupled SrCa and d18O measurements)',
'proxy': 'd18O',
'resolution': {'@id': 'http://linked.earth/lipd/Ocn-Palmyra.Nurhati.2011.paleo2.measurementTable1.Ocean2kHR_162.d18O.Resolution',
'hasMaxValue': 0.09,
'hasMeanValue': 0.08333085502,
'hasMedianValue': 0.08,
'hasMinValue': 0.08,
'units': 'yr AD'},
'hasStandardVariable': 'd18O',
'units': 'permil',
'TSid': 'Ocean2kHR_162',
'variableType': 'measured',
'proxyObservationType': 'd18O',
'measurementTableMD5': '3d028342178e079acb4366bfedf54a77',
'sensorSpecies': 'lutea',
'useInGlobalTemperatureAnalysis': False,
'wDSPaleoUrl': 'https://www1.ncdc.noaa.gov/pub/data/paleo/pages2k/pages2k-temperature-v2-2017/data-version-2.0.0/Ocn-Palmyra.Nurhati.2011-2.txt',
'sensorGenus': 'Porites'},
'year': {'@id': 'http://linked.earth/lipd/Ocn-Palmyra.Nurhati.2011.paleo2.measurementTable1.PYTEBCDC4GO.year',
'archiveType': 'Coral',
'number': 2,
'description': 'Year AD',
'hasMaxValue': 1998.21,
'hasMeanValue': 1942.168336,
'hasMedianValue': 1942.17,
'hasMinValue': 1886.13,
'missingValue': 'NaN',
'variableName': 'year',
'resolution': {'@id': 'http://linked.earth/lipd/Ocn-Palmyra.Nurhati.2011.paleo2.measurementTable1.PYTEBCDC4GO.year.Resolution',
'hasMaxValue': 0.09,
'hasMeanValue': 0.08333085502,
'hasMedianValue': 0.08,
'hasMinValue': 0.08,
'units': 'yr AD'},
'hasStandardVariable': 'year',
'units': 'yr AD',
'TSid': 'PYTEBCDC4GO',
'variableType': 'inferred',
'wDSPaleoUrl': 'https://www1.ncdc.noaa.gov/pub/data/paleo/pages2k/pages2k-temperature-v2-2017/data-version-2.0.0/Ocn-Palmyra.Nurhati.2011-2.txt',
'inferredVariableType': 'Year',
'measurementTableMD5': '3d028342178e079acb4366bfedf54a77',
'dataType': 'float'}}
You can use Pyleoclim
to plot the data and conduct further analyses:
ts = pyleo.Series(time = data1['year'], value = data1['d18O'],
time_name = 'year', time_unit = 'CE',
value_name = 'd18O', value_unit = 'per mil')
ts.plot()
Time axis values sorted in ascending order
(<Figure size 1000x400 with 1 Axes>,
<Axes: xlabel='Time [years CE]', ylabel='d18O [per mil]'>)
Understanding the relationships among classes#
As mentioned, the classes and methods present in the pylipd.classes
module are derived from the LinkedEarth Ontology. You can alwasy refer to it when in doubt. We also provide a handy diagram here illustrating the relationship:
Working with age ensembles#
For the purpose of this demonstration, let’s load the record from Lawrence et al. (2006), which contains an ensemble table, and return it into a DataSet object:
lipd = LiPD()
lipd.load('../data/ODP846.Lawrence.2006.lpd')
ds_ens = lipd.get_datasets()[0]
Loading 1 LiPD files
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00, 1.36it/s]
Loaded..
Now let’s grab the ensemble data information:
df_ens = [] # create an empty list to store all ensemble tables across models.
for cd in ds_ens.getChronData():
for model in cd.getModeledBy():
for etable in model.getEnsembleTables():
df_ens.append(etable.getDataFrame())
Let’s have a look at the resulting DataFrame:
df_ens[0].head()
depth | age | |
---|---|---|
0 | 0.12 | [4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, ... |
1 | 0.23 | [9.03, 8.64, 8.64, 10.58, 7.09, 10.58, 6.71, 9... |
2 | 0.33 | [11.74, 11.35, 10.96, 12.12, 10.58, 11.74, 11.... |
3 | 0.43 | [13.28, 13.28, 12.51, 13.67, 12.9, 13.28, 13.2... |
4 | 0.53 | [14.83, 14.83, 15.22, 15.6, 14.83, 15.6, 14.83... |
As will be the case with all EnsembleTable, the resulting DataFrame contains two columns: (1) depth and (2) age. The possible values for each depth is then stored as a numpy vector.
The DataFrame also contains relevant metadata information stored as attributes:
df_ens[0].attrs
{'depth': {'@id': 'http://linked.earth/lipd/chron0model0ensemble0.PYTGOFY4KZD.depth',
'number': 1,
'variableName': 'depth',
'hasStandardVariable': 'depth',
'units': 'm',
'TSid': 'PYTGOFY4KZD'},
'age': {'@id': 'http://linked.earth/lipd/chron0model0ensemble0.PYTUHE3XLGQ.age',
'number': '[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419, 420, 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433, 434, 435, 436, 437, 438, 439, 440, 441, 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460, 461, 462, 463, 464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479, 480, 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493, 494, 495, 496, 497, 498, 499, 500, 501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519, 520, 521, 522, 523, 524, 525, 526, 527, 528, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538, 539, 540, 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556, 557, 558, 559, 560, 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578, 579, 580, 581, 582, 583, 584, 585, 586, 587, 588, 589, 590, 591, 592, 593, 594, 595, 596, 597, 598, 599, 600, 601, 602, 603, 604, 605, 606, 607, 608, 609, 610, 611, 612, 613, 614, 615, 616, 617, 618, 619, 620, 621, 622, 623, 624, 625, 626, 627, 628, 629, 630, 631, 632, 633, 634, 635, 636, 637, 638, 639, 640, 641, 642, 643, 644, 645, 646, 647, 648, 649, 650, 651, 652, 653, 654, 655, 656, 657, 658, 659, 660, 661, 662, 663, 664, 665, 666, 667, 668, 669, 670, 671, 672, 673, 674, 675, 676, 677, 678, 679, 680, 681, 682, 683, 684, 685, 686, 687, 688, 689, 690, 691, 692, 693, 694, 695, 696, 697, 698, 699, 700, 701, 702, 703, 704, 705, 706, 707, 708, 709, 710, 711, 712, 713, 714, 715, 716, 717, 718, 719, 720, 721, 722, 723, 724, 725, 726, 727, 728, 729, 730, 731, 732, 733, 734, 735, 736, 737, 738, 739, 740, 741, 742, 743, 744, 745, 746, 747, 748, 749, 750, 751, 752, 753, 754, 755, 756, 757, 758, 759, 760, 761, 762, 763, 764, 765, 766, 767, 768, 769, 770, 771, 772, 773, 774, 775, 776, 777, 778, 779, 780, 781, 782, 783, 784, 785, 786, 787, 788, 789, 790, 791, 792, 793, 794, 795, 796, 797, 798, 799, 800, 801, 802, 803, 804, 805, 806, 807, 808, 809, 810, 811, 812, 813, 814, 815, 816, 817, 818, 819, 820, 821, 822, 823, 824, 825, 826, 827, 828, 829, 830, 831, 832, 833, 834, 835, 836, 837, 838, 839, 840, 841, 842, 843, 844, 845, 846, 847, 848, 849, 850, 851, 852, 853, 854, 855, 856, 857, 858, 859, 860, 861, 862, 863, 864, 865, 866, 867, 868, 869, 870, 871, 872, 873, 874, 875, 876, 877, 878, 879, 880, 881, 882, 883, 884, 885, 886, 887, 888, 889, 890, 891, 892, 893, 894, 895, 896, 897, 898, 899, 900, 901, 902, 903, 904, 905, 906, 907, 908, 909, 910, 911, 912, 913, 914, 915, 916, 917, 918, 919, 920, 921, 922, 923, 924, 925, 926, 927, 928, 929, 930, 931, 932, 933, 934, 935, 936, 937, 938, 939, 940, 941, 942, 943, 944, 945, 946, 947, 948, 949, 950, 951, 952, 953, 954, 955, 956, 957, 958, 959, 960, 961, 962, 963, 964, 965, 966, 967, 968, 969, 970, 971, 972, 973, 974, 975, 976, 977, 978, 979, 980, 981, 982, 983, 984, 985, 986, 987, 988, 989, 990, 991, 992, 993, 994, 995, 996, 997, 998, 999, 1000, 1001]',
'variableName': 'age',
'hasStandardVariable': 'age',
'TSid': 'PYTUHE3XLGQ'}}