When LinkedEarth debuted last September, we knew that a strange beast called "ontology" would be involved, but it seemed very abstract. Nearly a year into it, we've already been through several versions of the LinkedEarth ontology, and all the paleoclimatologists on the team (Deborah, Nick and I) have had to untwist their brains a few times. Since an ontology takes so much work to develop, and since we are asking our fellow paleoclimatologists to contribute to ours, one may reasonably ask: what is an ontology, anyway? And what have they ever done for science?
Let's start with the first question. An ontology is a formal way to organize the knowledge that we usually take for granted as experts in a technical field: it's the stuff you most critically need to teach to another human (or to a machine) in order for them to start making sense of what you do. Basically, it encompasses a definition of all the concepts, how they apply to things, and how they relate to other concepts. Still too abstract? Let's take a simple example: climate proxy.
If you ask any paleoclimatologist, they'll tell you that a climate proxy is something they measure that tells them about past climates. Like a legal proxy (someone who acts on someone else' behalf), a proxy is a stand-in for the real thing: we can't measure temperature 100,000 years ago, but we can measure things that were closely related to temperature, and infer temperature from that (with some uncertainty, of course). So if you are going to start an ontology for paleoclimatology, the first order of business is to unambiguously define the term proxy.
As it turns out, this is easier said then done. When you start asking paleoclomatologists what proxy they work with, you will get a variety of responses: "Foraminifera", says one. "Corals", says another. "Alkenones", says a third. In the parlance of Evans et al, (2013), these three terms refer to three separate notions: the three components of a proxy system:
- Sensor: physical, structural, and sometimes biological response of the medium to climate forcing.
- Archive: medium in which the proxy’s sensor reaction is emplaced or deposited.
- Observation: measurement made on the archive, accounting for effects related to sampling resolution in time and/or across replicates, choice of observation type, and age model.
These work together as on the figure below.
Let's return to our proxy poll for a minute: foraminifera are marine microorganisms commonly found in carbonate-rich sediments. Their biology and distribution are sensitive to the temperature and chemistry of their environment - they act as sensors of their environment, and are archived in marine sediments. You can measure (observe) various quantities on their beautiful shells, like the ratio of trace metals like magnesium and calcium (Mg/Ca) or the ratio of oxygen isotopes. Both vary with temperature, and you can use them jointly.
"Corals" refer to reef-building coral polyp colonies that we usually associate with Australia's Great Barrier Reef and most tropical marine biodiversity. These organisms also sense their physico-chemical environment and record its variations in the chemistry of their skeleton. The polyp is the sensor, the archive is the skeleton (made of aragonite, a carbonate mineral), and the observations are commonly of oxygen isotopes or trace metals like Strontium/Calcium.
Alkenones are long-chain organic molecules (more precisely, unsaturated ketones) produced by a few phytoplankton species. These species respond to changes in their environment — including to changes in water temperature — by altering the relative proportions of the different alkenones they produce. This means that the relative degree of unsaturation of alkenones found in marine sediments can be used to estimate the temperature of the water in which these organisms grew. Alkenones unsaturation is a type of observation you can make on a sedimentary archive. The sensor is the phytoplankton that experienced the temperature change.
Strictly speaking (and when you're building an ontology, you cannot be anything but strict) the three responses we got were referring to different parts of a proxy system. In many cases, you can infer the other two, but not always. "Corals", for instance, allows you you define the sensor and archive, but not the observation. "Alkenone" is neither a sensor, an archive, or an observation, but somehow conveys all three to people in the know; the trouble is, the vast majority of humankind is not in the know, and neither are machines. Therefore, it behooves paleoclimatologists to be very precise about how they use the term proxy, because so much rides on it.
I've found the proxy system framework extremely useful in my research (see here, here, here, here, here and here). However, I can't say that I was initially very enthusiastic about this way of thinking. It felt weird. I mentioned that to Michael N. Evans, who in his great wisdom, told me "Try to break it". I tried, but found that I couldn't. Despite the dizzying variety of climate proxies, all of them could be subsumed under this framework. Sometimes that required a bit of mental gymnastics, but I couldn't find exceptions 1)an interesting case that documentary records of past climates, like the times of harvest or the number of weeks per year Iceland's main port was ice-bound. These can be highly predictive of climate state, so they should definitely be taken into account. Fitting them into this framework is a bit uncomfortable, but not altogether impossible. In the time of harvest example, that's the observation. The archive is whatever written document this was recorded in (often, a parish's registry), and the sensor is ...weird: society? Something like that.. Which is exactly what you want when building an ontology. In the words of my esteemed colleague Nick McKay, ontology-building is about making everyone equally uncomfortable with the terminology.
This played out recently at the Paleoclimate Data Standards workshop. We organized a group activity asking people to define a proxy. They wrestled with in on their own, then we compared it to our solution. When they heard it, several participants balked, sometimes vocally; they agreed with the structure, but they found the terminology misleading. After wrestling with it a few more minutes, one colleague declared "I take back what said. I didn't like the terms, but I was hard pressed to find any better ones".
Bingo. We're onto something here.
Now, hopefully, the reader should start getting an appreciation for what ontology-building entails: Nick, Deborah and I start discussing concepts we think rather obvious, and many hours later we emerge with schematics and definitions that make us equally uncomfortable. Then we take those to our artificial intelligence colleagues (Yolanda, Varun and Daniel), who question them, ask how they relate to similar terms defined in other ontologies, and we iterate until we can't find a way to improve. That is, until a problem arise that requires us to rethink the whole structure.
We are currently in the midst of yet another major change to the ontology. The one currently in place has served us well, being the backbone of the LinkedEarth wiki. But some developments make it necessary to refine it, make it more internally consistent, and better able to link to existing ontologies (e.g. the Semantic Sensor Network, SESAR). So the hair-splitting will continue for some time, but hopefully we'll move on to other hairs very soon.
Now on to the second question: what have ontologies ever done for science, and more to the point, what will they do for paleoclimatology? Ontologies, as I said, are a way to teach machines how to think about a field of knowledge. There are countless examples of how they have vastly accelerated certain fields of science, from genomics to drug discovery.
With a proper paleoclimate ontology, a machine will know that if a certain type of archive is terrestrial (e.g. a tree) and you assign it coordinates that correspond to the ocean, there is a potential problem to be flagged. It can also start reasoning about things: if you tell the system that Siberia is a polygon encompassing a certain number of geolocations, and if you ask for all the tree ring series in Siberia, it will be able to find those even though the only geolocation information you gave about those trees are their latitudes and longitudes - not that they were in Siberia. It can figure this out by itself.
We're exploring other possibilities, of bots making educating guesses about missing information in the database, and flagging them for experts to check. That is, if the "Proxy" listed is 'alkenones', it would infer that the sensor is a phytoplankton species of the class Prymnesiophyceae; the archive is a marine or lacustrine sediment, and the observation is likely to be UK'37. Obviously, the inference is not perfect, but it's a lot better than having blank categories, which is currently the case for a lot of online paleoclimate records. We believe that bots doing the groundwork that humans can later check and refine will be the best and fastest way to move forward on this, liberating scientists from doing low-level data management, and spending more time on the actual science.
We'll be in touch when the bots are starting to do something useful. In the meantime, the LinkedEarth team might have to argue a few more times about what a proxy truly is.
Notes [ + ]
|1.||↑||an interesting case that documentary records of past climates, like the times of harvest or the number of weeks per year Iceland's main port was ice-bound. These can be highly predictive of climate state, so they should definitely be taken into account. Fitting them into this framework is a bit uncomfortable, but not altogether impossible. In the time of harvest example, that's the observation. The archive is whatever written document this was recorded in (often, a parish's registry), and the sensor is ...weird: society? Something like that.|