Digital Data Cry Out — Save Me!

June 16, 2008 at 6:19 pm - More than 2 years ago

Digital. It’s the darling of science and nearly everything else. Publishing online means going paperless. Good for the environment, right? Presenting findings in digital form means data can be shared effortlessly and nearly instantaneously between collaborators over long distances. Digital data can be collected and stored at very little cost — good for tight research budgets, right?

What none of these issues confronts is the fragility of digital files. How and where will you store them? How long will the stores survive? Who will manage them, for how long, at what cost, and for which potential users?

Although these are far from trivial questions, they’re on the radar screens of few scientists and engineers today. But they should be front and center, argues Sarah Higgins of the University of Edinburgh, a member of the Digital Curation Centre in Britain — a virtual consortium of researchers at four separate institutions.

This morning, Higgins led a workshop, here in Pittsburgh, on the opening day of the Joint Conference on Digital Libraries. She walked a small group of librarians, for want of a better term, through issues of who should determine how to save data, where to save it, and for what purpose.

I’ll cut to the chase. She had few “answers.” They’re all contingent on factors that are idiosyncratic to the data and the potential users. Moreover, she observed: Little certainty yet exists on how to ensure that data collected today will be available for reading five years from now, much less 50.

An even bigger issue: how data are “catalogued.” All too many essentially aren’t, she says.

This means ginormous quantities of data may be assembled at huge cost and essentially thrown willy nilly into what’s equivalent to thousands of containers — none of them labeled. These may then get warehoused in the analog of a big garage. Many sets of data essentially belong together — such as brain scans of patients, details of their medical conditions, the names and histories of the patients, and more. All too often, however, links between these data may not be recorded so that their value to future researchers will be weak if not totally meaningless.

In Britain, Higgins observes, some funding agencies will no longer dispense funds to scientists who haven’t planned for how their data will be collected, the formats data will be stored in, who will archive these data, and what’s the minimum time data will be archived. Among other big issues for those who will be charged with storing the railroad-cars-worth of mined data:

— how will information gatherers verify the provenance of data kept on digital media?

— how will they ensure the data haven’t degraded over time, introducing errors?

— how will they ensure that someone hasn’t tinkered with the files in ways that either introduce errors or merely change the data (such as updating or deleting some) in ways that are no longer reflected in descriptions of stored data?

— is it important to merely keep data in some form, or must archivists also maintain its initial look and feel? For instance, if someone photographed cows, would it be enough to merely retain some low-resolution digital thumbnail photo of each cow or must high-resolution versions be retained such that any could later be enlarged to ultimately identify tiny details of an animal’s coat?

At the workshop, Higgins unveiled a new “lifecycle model” for managing digital data. Her DCC group in Britain is not the first to have attempted this, but she told me afterward that hers is the first to focus on curation — not preservation.

Turns out there’s a big difference. The latter tends to focus on how to store data, such as: on microfilm, which has a shelf-life of 100 years, on a DVD or CD with a lifetime of perhaps only 20 years — or on a low-cost hard drive that may allow data to degrade in as few as five years. Curation, by contrast, focuses on determining what kinds of “metadata” to link with the digital files.

Metadata? I know, it’s hardly self-explanatory. So I posed the obvious question to Higgins: Huh?

In digital-library lingo, metadata essentially refers to cataloging details: who collected the data, when, where, and on what type of equipment; on what hardware were those data initially stored and using what specific software; which other data sets correlate with the one at hand and in what ways; and what legal restrictions might there be associated with the data — such as prohibitions against unblinding the source of medical data to anyone other than the specific patient they had been collected from, or restricting the release of certain census data for the first 50 years after they had been collected.

These are problems that could make or break the utility of collected data, Higgins says. And there are dizzying quantities being accumulated, in large part thanks to big science. By 2006, an estimated 161 billion gigabytes (each gigabyte itself equalling one billion bytes) of digital information had been created, captured, or replicated, she reported. That’s “roughly three million times the information in all books ever written,” she notes.

By 2010, the digital-data universe will expand to an estimated 988 billion gigabytes, she says. Obviously, we can’t keep it all. So what data do we really want, how much is it worth paying to save, and shouldn’t we begin budgeting for its management now?

Over the next few days, here at JCDL, I suspect I’ll hear elaborations on just such issues.