Rise of Big Data underscores need for theory

December 3, 2013 at 3:00 pm

Second of two parts (read part 1)

Francis Bacon was a philosopher of science when there wasn’t much real science going on to philosophize about.

Four centuries ago, decades before Newton established the laws of motion and gravity, Bacon set out to establish the laws of doing science to begin with. Bacon articulated the need for observation and experiment, the empirical approach to acquiring knowledge. He excoriated the followers of Aristotle for their blind acquiescence in ancient authority.

Bacon supposedly proclaimed that science could discover truths about nature only by empirical testing of all the possible explanations for all the observed phenomena. If you wanted to discern the true explanation for heat’s properties, for instance, you needed to perform experiments to eliminate incorrect explanations — after you had recorded all the observed facts about all manner of heat-related phenomena. In other words, Bacon was a fan of Big Data.

With today’s massive computerized collections of massive amounts of data on everything from Facebook friends to molecular medicine, Bacon’s dreams have been realized. Except that the empirical method actually doesn’t work very well on Big Data. And that poses a problem for the entrenched obsession with experiment that Bacon instilled in the consciousness of Western culture.

True, theory has had its advocates, especially in realms like physics, where the underlying simplicity of nature lends itself to concise mathematical expression. From that math, scientists build “models” of reality that can be compared with observations to test the theory. But the prevailing view is that the complexities of life and society do not easily yield to theoretical models, as Yaneer Bar-Yam of the New England Complex Systems Institute points out in a recent paper.

Theory supposedly can’t handle the complexities of things like human behavior or the effects of drugs on disease. If that’s so, then you can find out how the brain works, or what cures what, only by doing experiments to see what happens.

But in fact, Bar-Yam asserts, it’s the empirical approach that can’t handle all the facts that complex systems present. Theory is essential.

“Based upon an analysis of the information that can be obtained from experimental observations,” he writes, “theory is even more essential in the understanding of complex systems.”

And why is that? Because Big Data isn’t as big as it thinks it is.

Bar-Yam reached that realization by using information theory to analyze the Baconian process of observing and experimenting to understand a system’s behavior. First you have to specify the system being observed, all the conditions influencing it and all of its possible responses to those conditions. Then information theory can tell you how much data you need to tell you what the system will do in each of all the possible sets of circumstances.

For any system with considerable complexity, the answer is that there is never enough data.

In clinical trials, for instance, researchers use the empirical approach to see whether a drug cures a disease. At the simplest level, the systems (patients) are exposed to a single condition (drug or no drug) and respond with one of two possible outcomes (cured, not cured). But in real life, there are actually many more conditions and many more possible outcomes. Diseases can be more or less severe or co-occur with other diseases. Patients differ in age and sex as well as with respect to a vast array of genetic variations. Responses aren’t limited to cure or no cure, either. Some symptoms may improve while others don’t. Side effects not related to the disease may be common in some patients but not others.

Adding patients to the trial (the “bigger is better” strategy) confounds the problem by further increasing the range of possible conditions and possible outcomes. As a result, faithfulness to the empirical approach would require an independent population for each condition to be tested, with each population big enough to include all the possible outcomes. Not happening.

It’s not that it’s bad to have data, of course. Getting a grip on complex systems is impossible without acquiring a lot of data. So Big Data has its benefits. But it is bad to assume that Big Data is always big enough to reach sound conclusions with current empirical methods.

“The advent of ‘big data’ is critical for addressing complex systems,” Bar-Yam notes, “but without recognizing the sparseness of that data in an exponentially rich possible data set we are limited in the progress we can make.”

Because even Big Data isn’t enough data, the strictly empirical approach to complex systems is inadequate. Theoretical models are essential. “It is precisely for highly complex biological and social systems that theoretical modeling is essential to the scientific process,” Bar-Yam avers.

Models also have their limits. And not just any model will do. Some models, Bar-Yam points out, merely summarize observations that have already been performed. A useful model, on the other hand, is “a kind of data compression of the information about the system,” enabling “the ability to identify the results of observations that have not yet been performed.” After all, science’s primary value is its ability to tell us what will happen, not what has already happened. Good science provides accurate predictions about how systems will behave.

But with current methods, Big Data alone is simultaneously too much data to analyze accurately and not enough data to guarantee accuracy. Science’s devotion to empiricism must therefore yield to an appreciation of the role of theoretical modeling in arriving at reliable knowledge about reality.

“No empirical observation is ever useful as a direct measure of a future observation,” Bar-Yam asserts. “It is only through generalization motivated by some form of model/theory that we can use past information to address future circumstances.”

Good science does not magically emerge from massive databases; it requires extracting the valuable information from the worthless. Big Data alone doesn’t discriminate between the two very well. That’s what theoretical models can do.

“Ultimately, the possibility of knowledge itself must rely upon the ability to differentiate between different pieces of information based on their importance,” Bar-Yam declares. “Observations must focus on those pieces of information and not on the rest.… Thus it is evident that an essential role of theory must be to identify which pieces of information are important.”

Francis Bacon would have appreciated all this. He was a fan of Big Data — he believed the “foundation of a true philosophy” was a storehouse of facts, a “natural and experimental history” of all the relevant information. But he also believed that theories should seek the physical causes underlying observations so as to predict the results of observations not included in the data that generated the theory, as the philosopher Peter Urbach pointed out.

Bacon believed that theory and experiment were partners in the quest to generate sound science for the benefit of society. Bar-Yam emphasizes similar points. If Big Data is to benefit humankind, its dangers, and the limits of empiricism, must both be recognized.

“Practical approaches to medicine, management and policy require a better framing of how we can effectively understand biological and social systems,” Bar-Yam writes. “Recognizing that empirical approaches do not extend well to complex systems, and that theory and experiment must work hand in hand, is an important step in the right direction.”

Follow me on Twitter: @tom_siegfried