Why Big Data is bad for science

November 26, 2013 at 2:00 pm - More than 2 years ago

First of two parts

If Star Trek: The Next Generation were to return to TV in the 21st century, Lt. Commander Data’s nickname would be “Big.”

“Big Data,” after all, is the biggest buzzword of the new millennium. It’s everywhere, from genomics, biomics and a bunch of other –omics to the NSA’s database on writers who mention NSA in their blogs. Social networks, financial networks, ecological networks all contain vast amounts of data that no longer overwhelm computer hard drive storage capabilities. Scientists are now swimming in a superocean of endless information, fulfilling their wildest dreams of data nirvana.

What a nightmare.

You see, scientists usually celebrate the availability of a lot of data. Most of them have been extolling all the research opportunities that massive databases offer. But perhaps that’s because everybody isn’t seeing the big data picture. Here and there you can find warnings from some experts that Big Data has its downsides.

“Scientific advances are becoming more and more data-driven,” write statistician Jianqing Fan of Princeton University and colleagues. “The massive amounts of … data bring both opportunities and new challenges to data analysis.”

For one thing, huge datasets are seductive. They invite aggressive analyses with the hope of extracting prizewinning scientific findings. But sometimes Big Data In means Bad Data Out. Wringing intelligent insights from Big Data poses formidable challenges for computer science, statistical inference methods and even the scientific method itself.

Computer scientists, of course, have made the accumulation of all this big data possible by developing exceptional computing power and information storage technologies. But collecting data and storing information is not the same as understanding it. Figuring out what Big Data means isn’t the same as interpreting little data, just as understanding flocking behavior in birds doesn’t explain the squawks of a lone seagull.

Standard statistical tests and computing procedures for drawing scientific inferences were designed to analyze small samples taken from large populations. But Big Data provides extremely large samples that sometimes include all or most of a population. The magnitude of the task can pose problems for implementing computing processes to do the tests.

“Many statistical procedures either have unknown runtimes or runtimes that render the procedure unusable on large-scale data,” writes Michael Jordan of the University of California, Berkeley. “Faced with this situation, gatherers of large-scale data are often forced to turn to ad hoc procedures that … may have poor or even disastrous statistical properties.”

Sounds bad. But it gets worse. Not only do Big Data samples take more time to analyze, they also typically contain lots of different information about every individual that gets sampled — which means, in statistics-speak, they are “high dimensional.” More dimensions raises the risk of finding spurious correlations — apparently important links that are actually just flukes. A medical study might link success with a drug to a patient’s height, for instance. But that might just be because the Big Data contained information on everything from height and weight to eye color, shoe size and favorite baseball team. With so many dimensions to consider, some will seem to be important simply by chance.

“High dimensionality,” write Fan and collaborators, “may lead to wrong statistical inference and false scientific conclusions.”

Besides that, Big Data often is acquired by combining information from many sources, at different times, using a variety of technologies or methodologies, Fan and colleagues point out. “This creates issues of heterogeneity, experimental variations, and statistical biases, and requires us to develop more adaptive and robust procedures,” they write. “To handle the challenges of Big Data, we need new statistical thinking and computational methods.”

Many computer scientists and statisticians are aware of these issues, and a lot of work is under way to address them. But there’s more to it than just mashing up some more sophisticated statistical methodologies. Scientists also need to confront some biases, rooted in the days of sparse data, about what science is and how it should work.

In fact, the arrival of Big Data should compel scientists to cope with the fact that nature itself is the ultimate Big Data database. Old style science coped with nature’s complexities by seeking the underlying simplicities in the sparse data acquired by experiments. But Big Data forces scientists to confront the entire repertoire of nature’s nuances and all their complexities.

In medical research, for instance, uncountably many factors can influence whether, say, a drug cures a disease. Traditional medical experimentation has been able to study only a few of those factors at a time. The empirical scientific approach — observation, description and inference — can’t reliably handle more than a few such factors. Now that Big Data makes it possible to actually collect vast amounts of the relevant information, the traditional empirical approach is no longer up to the job.

But it’s even worse than that. No matter how big Big Data gets, it can’t get big enough to really encompass all of the relevant information, as Yaneer Bar-Yam of the New England Complex Systems Institute points out in a recent paper.

“For any system that has more than a few possible conditions, the amount of information needed to describe it is not communicable in any reasonable time or writable on any reasonable medium,” he writes.

Consequently, science cannot rely on the strictly empirical approach to answer questions about complex systems. There are too many possible factors influencing the system and too many possible responses that the system might make in any given set of circumstances. To use Big Data effectively, science might just have to learn to subordinate experiment to theory.

Follow me on Twitter: @tom_siegfried