Every innovation starts with a profound dissatisfaction with the status quo.
I am a climate scientist interested in understanding how and why climate changes on time scales of years to millennia. Early in my PhD, this led me to analyze paleoclimate data that I got from colleagues or the Web, and working with those generated intense frustrations. Here's what used to grind my gears the most:
- there was no universal format. Some files were in raw text files, others Excel spreadsheets, others were txt files from NOAA (now NCEI) Paleo formatted more or less consistently, which meant that a variable size header had to be removed to access the data tables themselves. No consistency whatsoever.
- there was no consistency about missing data representation. Some missing values were marked as empty (blank) entries. Others were marked as NaNs (my preference). Others were marked as 0, or -99 or -99999. One day I even saw a dataset with values ranging from -50 to -200 with missing values marked as -99. No consistency whatsoever, and some absurdity in places.
- there was no common vocabulary. A common type of measurement made on proxy archives is the stable oxygen isotope composition of some material (e.g. ice, calcite, aragonite, cellulose), written δ O. One day I counted no fewer than 9 different ways to report: d18O, delta18O, dO18, delO18, you name it. The units? permil, per mille, 0/00, ‰, etc. What about the isotopic standard? Some datasets were explicit about that (VPDB, VSMOW), others left you wondering. No consistency whatsoever.
- there was no minimal metadata. Some datasets came with exhaustive metadata, others came with very terse metadata, which made it impossible to properly digest the data's content without having to email the people who collected the data. Even then, as in all forms of human communication, there was a potential for misunderstanding (e.g. my closest collaborator and myself use the words "parameter" and "variable" in slightly different ways) . No consistency whatsoever.
The result is that I had to spend a lot of time writing code that could handle all these exceptions. If a dataset came along that made choices I had not encountered before, it would break my code and I would have to include more exceptions to accommodate it. Given all the problems in today's world, you might say these are first world problems. Here's why it's bigger than me: I once read this statistic that geoscientists spend up to 40% of their time formatting data in a form they can parse, or exporting it to another form so somebody else can parse it. For parts of my postdoc that was closer to 80%. Think about that for a minute: PhD level scientists spending a significant chunk of their precious time doing menial tasks at the expense of doing actual research, teaching, or outreach (i.e. what academic institutions, mission agencies, or other employers, are paying us to do).
"Don't complain", I was told. "It used to be worse". It used to be that no data was available online. You had to send a letter to an investigator and hope they might consent to share their data with you (never mind that the data had been gathered using public monies, and therefore belonged to the public). They might never share it with you by fear that you might find something different from what they had published. (which, arguably, is precisely why they should share it). Sure, it was tremendous progress that an increasing number of investigators believed in open science enough that they wanted to share their data with the World at large. There's a whole lot to love and celebrate about that, and if you read this post you probably are one of these champions of open science, so kudos to you.
But we live in an age when you can ask your phone to find a you a vegan taco within a specified radius (and get the answer within seconds), so it seem like spending 40% of my day getting my hands on a paleoclimate record in a useful form was not the best use of my (or anyone's) time. In 2013 it seemed possible to improve on that without having to be Apple or Google.
Consider a world without standards. Actually, none of us can: everything about modern life, from cellphones, GPS, credit cards, the metric system, or the Web, requires some form of standard. A world without standards would seem medieval.
Paleoclimate data is still in a medieval age: no matter how refined the methods, how clever the investigators, how sophisticated the instrumentation, if all you do with it is dump it into a one-of-a-kind Excel spreadsheet, you're missing out on modernity in a very major way. And if, like me, you are a consumer of paleoclimate data, it takes so much time to find, download, format and analyze data that it feels like a huge waste: you've exhausted a lot of your precious time on menial tasks that really don't teach you anything, and detract from a lot of other worthy pursuits. It's a very bad use of taxpayer money, and a frustrating way to spend a life.
This situation isn't the result of malevolence on anybody's part. No one purposefully sat down and decided "let's make this field a complete smorgasbord of formats, a carnival of conventions". The problem, in fact, is precisely this lack of intentionality: no one ever sat down and tried to come up with a system to organize data from ice cores, corals, sediment cores, corals, trees, speleothems, boreholes, and figure out a way to store them so they could be immediately intelligible to machines and to humans. Climate modelers had that for instrumental data and model output: the netCDF standard. Why couldn't we?
I fumed silently for years, making my own pet format in Matlab. It enabled me to do some pretty cool science, but ultimately it is a dead end: Matlab is proprietary, and my data was not findable by search engines. I then spoke to an aikido friend of mine, Jason Eshleman, who turned out to work with biomedical databases. He said: "Sounds like you need the Semantic Web!". That's when I began to learn about ontologies, RDF, and other web concepts. Jason and I published an article about to make this happen for paleoclimatology, and I am pleased to say that this pipe dream is now becoming a reality.
During the 2011 Fall AGU, Nick and I met with a prominent paleoceanographer to participate in the Ocean2k project (a subset of the PAGES2k project). We both thought the project was great, but that it would be unnecessarily complicated without 2 things: a common way to store paleoclimate data (i.e. a data standard), and a platform through which project members could all collectively edit and curate a database of consistently-formatted paleoclimate datasets. We got to work on the first part, which is now published here. Nick's group took the ideas of the Emile-Geay & Eshleman 2013 paper to a whole new level, culminating in the Linked Paleo Data format, which we're very proud of. It's supported the PAGES2k project in a fundamental way (more on this soon). The second part, however, was far beyond our capabilities – that's where Yolanda comes in.
In October 2012, an NSF project manager told me that EarthCube might be a good place take my obsession with paleo tech to the next level. Later that month I ambled into an EarthCube workshop in DC, not knowing what to expect. I met a lot of interesting people, and resonated particularly with Yolanda. She was the first computer scientist I met who didn't think my problem was either too trivial or too complicated, who was genuinely interested in helping me solve it, and who had the technical capability to do so. She happened to work in the same university (USC), though without this EarthCube meeting we might never have met. We stayed in touch and started to learn about what each other did.
Three years and one turned-down proposal later, the three of us have willed the LinkedEarth project into existence. We are excited to have planted the seeds, and have brought wonderful people on board to take it one step further. We are very grateful that NSF funded us to lead this effort on behalf of our community.
The next chapters will depend on you. If you are as frustrated as I am with the medieval state of paleoclimate data, and if you are as excited as we are about a future where you can search, compare and analyze paleoclimate data in the Cloud using state of the art techniques, this is the place for you. Together, we can build something grand. Together, we can build a Web of paleoclimate data that will revolutionize the field and how it connects to adjacent domains: climate dynamics, hydrology, archeology, geochemistry, glaciology. We're not after the small stuff here. We're building something great to enable everyone to do better science. Will you join us?