Ontology

A major goal of the LinkedEarth project is the LinkedEarth ontology. In short, the ontology allows us to not only define terms commonly used to describing a paleoclimate dataset (e.g., variable, uncertainty, calibration) but also to specify the relationship among these terms (e.g., a variable has uncertainty). As such, it allows us to make inferences, support complex queries, as well as perform quality control on the data.

Remember that no formal knowledge about ontologies is required to use and contribute to the LinkedEarth wiki

blogExample1
The triple consist of a subject(the dataset), a property (hasName), and an object (the name of the dataset, WesternPacific_Khider_2014).

When representing the knowledge of a domain like paleoclimatology, we usually can distinguish the things that we want to describe (i.e., concepts like a dataset, a variable, etc...) and the relationships used to describe those concepts (e.g., the name of the dataset, the value of the variable, etc...). As shown in the figure on the left, we can use a graph-based representation to encode the information in a set of triples.

Each triple has a subject (i.e., what we want to describe), a property (the element describing the subject) and an object (i.e., the value used to describe the subject.).

Different concepts may be linked to each other using properties. For example, a dataset contains a data table which contain several variables. The properties and concepts for a domain are often defined as ontologies. An ontology is defined as a "formal specification of a shared conceptualization", and they represent consensual knowledge that helps a community describing the concepts of the domain using a common representation. A feature of ontologies is that they are machine readable, i.e., they allow machines understanding the domain in the way the creators of the ontology have defined. Thanks to the ontology, machines can navigate through data and discover data that otherwise would be hidden to them.

A dataset contains a data table with several variables.
A dataset contains a data table with several variables.

This enables batch processing of data that would require a large amount of (wo)men hours.

The LinkedEarth Ontology

The ontology is organic, meaning that it is designed to “grow” as more and more records are added to the LinkedEarth wiki and researchers need to define new terms or redefine existing ones.

The LinkedEarth ontology is divided into several components:

  • The LiPD Ontology
  • The Proxy Archive Ontology
  • The Proxy Observation Ontology
  • The Proxy Sensor Ontology
  • The Instrument Ontology
  • The Inferred Variable Ontology

The first version of the LinkedEarth ontology has officially been accepted at the workshop on paleoclimate data standards. 

The LiPD Ontology

As its name indicates, this part of the LinkedEarth ontology was developed from the LiPD format championed by Nick McKay and Julien Emile-Geay. However, unlike LiPD, the LinkedEarth ontology defines relationships among the terms, enabling the reasoning necessary to make a true knowledge base.

The LiPD ontology is designed to be flexible and provides a backbone for the development of the LinkedEarth wiki. As such many of the terms are critical for the functioning of the wiki and the various codes built around it. These terms are considered "core" and are identified on the wiki with a copyright sign. Changes to these core categories must therefore be approved by the Editorial Board. To suggest changes to a term in the core ontology, start a discussion on the category or property page. Once a community consensus has been reached, use this form to contact the Editorial Board about the change.

This "core" ontology also serves as the backbone for the crowd-sourced part of the ontology (the other modules). For instance, we have defined a class “ProxyObservation” following the definition of Evans et al. (2013) that would group all the properties common to paleoclimate observations. However, instances of this class (i.e, stable isotopes, trace elements, radioisotopes…) are not explicitly defined in the core ontology and are first part of the crowd ontology.

The Proxy Archive Ontology

The Proxy Archive Ontology defines the different categories of archive types used in paleoclimate studies (such as marine sediment, coral,...) following the definition by Evans et al. (2013). This ontology is the product of an evolving community effort.

The Proxy Observation Ontology

The Proxy Observation Ontology defines the various proxy observations made on the proxy archives following the definition by Evans et al. (2013). This ontology is also a product of an evolving community effort.

The Proxy Sensor Ontology

The Proxy Sensor Ontology defines the various types of proxy sensors following the definition of Evans et al. (2013). This ontology is also a product of an evolving community effort.

The Instrument Ontology

The Instrument Ontology aims to define the various types of instruments used to produce the proxy observations. This ontology is also crowd-sourced.

The Inferred Variable Ontology

The Inferred Variable Ontology aims to provide a taxonomy of the various inferred variables. This ontology is also crowd-sourced.

Background on Ontologies

Before we can talk about ontologies, we need to define other terms and concepts. Some of these were introduced in the section above but we describe them here. First, the LinkedEarth ontology as well as the LiPD format represent a way to organize the data and, most importantly, the metadata associated with a paleoclimate dataset in a way that is machine-readable.

What is semantic data?

To answer this question, we first need to make the distinction between data and metadata. Metadata is the  “data that provides information about other data.” In other words, metadata is information that provides context key to understand what the data represents. For instance, providing a name for each of the variables in a paleo dataset is key to use the data. Semantic metadata is metadata that is structured and explicit.

Metadata Types
Structured and explicit metadata vs unstructured and not explicit metadata. Click on the picture to enlarge. 

Metadata facilitate reuse by others, support queries on data repositories, explain a data analysis by providing context for the data, and enable automated data integration.

Metadata fall into three categories: descriptive metadata (i.e., location, collection procedure,…), data characteristics (i.e, size of the dataset, statistical properties,…), and provenance metadata (i.e., instruments, method or software used to generate the data,…).

A metadata vocabulary is the set of terms used to describe metadata. For instance, the Dublin Core is a list of terms used to describe web resources (video, images, web pages, etc.), as well as physical resources such as books or CDs, and objects like artworks. A vocabulary is designed based on its broad applicability and how well it supports uses of the metadata. If the vocabulary is agreed upon by a community and are adopted for structured metadata, then it becomes a metadata standard. For instance, LiPD is a new metadata vocabulary that describes a paleoclimate dataset. The next task is to represent our knowledge about the metadata.

What is an ontology?

An ontology is a form of knowledge representation. It contains more information than just a vocabulary (i.e., LiPD). Ontologies are a formal way to name and define classes (concepts), properties, and the interrelationships between those. The different components of an ontology are:

Ontology Example
Example of class and property from the LinkedEarth Ontology. The domain of the property "measuredBy" is "MeasuredVariable" while its range is "Instrument". Click on the picture to enlarge. 
  • Classes (types) of objects. It corresponds to a concept. For instance, paleoclimatic information is derived from a ClimateProxy. Following the definition of Evans et al. (2013), a ClimateProxy is comprised of three components: the Sensor, the Archive, and the Observations. “ClimateProxy”, “Sensor”, “Archive”, and “Observations” are classes used to described knowledge about the paleoclimate dataset.
  • Instances of those classes are occurrences of a specific class. For instance, "marine sediment" is an instance of the class “Archive”, "planktonic foraminifera" is an instance of the class “Sensor”, and "Mg/Ca" is an instance of the class “Observations”. Instances can be belong to more than one class.
  • Properties are characteristics of the knowledge being represented. It represents the relations between entities. For instance, the Mg/Ca “Observations” have properties “name”, “values”, “units”.
    • Property domain and range. The domain corresponds to the class that the property applies to. The range corresponds to the class of the property’s value. For instance, the property ‘measuredBy’ has domain ‘MeasuredVariable’ and range ‘Instrument’.

The LinkedEarth ontology was written using the Web Ontology Language (OWL), which is a standard knowledge representation language for the web.