LiPD vs LinkedEarth

AGU brought its own share of (welcome!) community feedback. Among them is the lingering confusion surrounding the difference between LiPD and LinkedEarth and I hope this blog post will clarify it.

Unzipped LiDP file showing two csv files (containing the data) and one JSON-LD file (containing the metadata).

LiPD (pronounce: lipid) stands for Linked Paleo Data, a self-describing, machine-independent data format created by Nick McKay and Julien Emile-Geay (you can ready all the pesky technical details here). At its core, a LiPD file is nothing more than a zipped folder (yes, you can open it with an unzipper on your computer; on a Mac the unzip command will do the trick!) that contains the data in csv format and the metadata in JSON-LD (as well as some tid bits to ensure accurate data accounting).

What is JSON-LD? JSON-LD (or JavaScript Object Notation for Linked Data) is a method of encoding linked data using JSON, itself a lightweight data-exchange format that is widely popular because it is easy for humans and computers to read and parse. But you don’t need to know anything about JSON to read, write, and manipulate a LiPD file. For that, Chris and Nick have developed utilities in Python, R, and Matlab. So whatever your favorite coding language is, the LiPD file will follow.

Think of LiPD as “netCDF for paleo”.  In the 1990’s most climate modelers agreed that it was positively ridiculous to have to deal with climate model output in a myriad formats, and UNIDATA helped birth netCDF (network Common Data Form), without which model intercomparison projects like CMIP would likely not exist. If netCDF is not part of your reality, think of LiPD as a replacement for the .mat files, endless text/csv files, or Excel spreadsheets that you use during the life cycle of your project. I already  hear some of you asking: “I’m perfectly happy with my obsolete formats, why should I invest time reformatting my data?”

First and foremost, LiPD  files are standardized, making it easy to write codes around them. Second, that same standardization allows for rapid exchange between researchers. The recipient will most likely have some piece of code waiting for the data in the LiPD format. If universally adopted, no one would ever have to ensure that the data follows their homebrewed way of organizing data, allowing researchers to spend more time on what they really want to do: paleoclimate science. Thirdly, LiPD allows the metadata to travel with the data, so there is minimal ambiguity about its provenance or interpretation. (I say minimal, because no data format can encapsulate all the nuances of a paleoclimatologist’s brain).

Map of all records contained in a LiPD library by archive type

How difficult is it to create LiPD files? No more difficult than it is to format your data for NOAA or PANGAEA (see this page). What properties need to be filled out from this template or on the online lipidifier? Well, that depends on what you need to do with the files and the codes you’ve written around them. For instance, a map of all the records in your LiPD library by archive type such as the one shown on the left obviously requires latitude/longitude and archive type information. But you can’t do too much more without more metadata.

View of the contents of a LiPD file from the function viewLipd() in the Python utilities. Each level corresponds to a level into the JSON-LD file and a Python dictionary. Essentially, in LiPD all the metadata is stored in a series of nested dictionaries.

Of course, you can create a skinny file to start with, and update it as you go. LiPD files can also be updated through Matlab, R and Python with the help of the LiPD utilities. For instance, in Python, the LiPD file can be viewed quickly though the command viewLipd(), as seen on the right. Each level in the hierarchy is represented by a Python dictionary. Adding metadata is as simple as updating the key/value pair in the corresponding dictionary and saving the file using the command writeLipd(). The utilities also contain code to help you incorporate ensemble, summary, and probability tables that can be generated through the use of Bchron or Bacon, or other age modeling software within GeoChronR.

Speaking of which, the LinkedEarth team has also been developing software packages for the analysis of paleoclimate data in R and Python. These packages have their own requirements regarding the properties that need to be present in LiPD files. For instance, GeoChronR expects the location information (latitude and longitude) to be present in the file.

This brings us to the intersection between LiPD and LinkedEarth. First and foremost, LinkedEarth is a technology incubator. We’ve developed a format (LiPD), a platform (the LinkedEarth wiki), some codes (GeoChronR, Pyleoclim, LiPD utilities) and fostered a community-wide discussion on standards, soon to result in an actual standard.  The LinkedEarth platform (“wiki”) is a portal to a database that you can think of as the cloud version of LiPD.  When the data are in a LiPD format, they are incredibly re-usable, but they still are in a silo. When they are on LinkedEarth, they join billions of other RDF triples in the web of data, and are visible to search engines and other querying mechanisms.

Because of this, LinkedEarth has its own set of rules regardless of the completeness of a LiPD file. Think of it as the difference between your working Excel file and the Excel template that you can submit to NOAA paleo. Both are in Excel, but the information contained in each file is different. Your own file may not have much in terms of metadata but NOAA has some requirements (e.g. latitude and longitude). Similarly, LinkedEarth uses LiPD as the core exchange format but expects some information to be filled out. The specific properties are highlighted in red in the following template. The online version has a nifty toggle to check if the LiPD file is ready to be placed on the wiki.

Screenshot of the online lipidifier. The toggle to check for wiki-required properties is circled in red, with a validation warning pop up explaining the details.

Who decided on the required properties? So far, the LinkedEarth team has identified the properties required for the optimal functioning of the wiki but the goal is to enable the community to decide the necessary properties for the records to be archived on the LinkedEarth platform. In 2016, a workshop on paleoclimate data standards served as the focal point to initiate this decision process. Workshop participants distinguished a set of essential, recommended, and desired properties for each dataset. By default, all metadata properties are desired. This decision stems from the belief that if the information is available, it should be kept with the dataset. A subset of these properties is recommended, meaning they will ensure sensible reuse of the dataset. Yet, another subset is deemed essential. Without these precious pieces of information, the dataset is useless.

A consensus emerged that these levels should be archive-specific, as what is needed to intelligently re-use marine-annually resolved records could be quite different from what is needed to intelligently re-use an ice core records, for instance. It was therefore decided that archive-centric working groups (WGs) would be best positioned to elaborate and discuss the components of a data standard for their specific sub-field of paleoclimatology. These WGs carried out their discussions on the LinkedEarth online platform, providing the foundation for a preliminary standard that could be voted on by the rest of the community. Votes were open on the LinkedEarth wiki, through our Twitter account, and through a community survey that was distributed in the fall of 2017.

How far are we in the process? I’m currently writing a paper describing these standards, which will be distributed to the community for additional inputs before submission to a peer-review journal (to be decided). Once the paper is accepted, the essential properties identified by the community will be implemented in the check of the online lipifier. So you can think of  LiPD as the bones, and the standards as the labels we attach to them.

While the current round of EarthCube funding for LinkedEarth is has ended,  LiPD continues to gain traction, and will continue to evolve to meet the needs of the community.  For instance, the NOAA/World Data Service for Paleoclimatology  (“NOAA paleo”) now accepts LiPD as submission format. We are working on PANGAEA to do the same. Pyleoclim and GeoChronR “speak” LiPD natively. All the PAGES2k 2017 code is based on consistently formatted LiPD files. Iso2k is also relying on consistently formatted LiPD files, and a growing number of PAGES working groups are adopting LiPD as their format of choice to preserve paleoclimate information.  So be reassured: your investments into LiPD will not be wasted. LiPD is the way of the future, because it enables the meta-studies that people want to do, in a way that cannot be done efficiently with any other format. So while LinkedEarth may or may not live on as a project, we are confident that its legacy will live on, as more and more people create, curate and work with, LiPD files.

I hope this clarified the difference between LiPD and LinkedEarth!

Deborah, on behalf of the LinkedEarth team

Leave a Reply