In science, the success of an experiment is often determined by a measure called “statistical significance.” A result is considered to be “significant” if the difference observed in the experiment between groups (of people, plants, animals and so on) would be very unlikely if no difference actually exists. The common cutoff for “very unlikely” is that you’d see a difference as big or bigger only 5 percent of the time if it wasn’t really there — a cutoff that might seem, at first blush, very strict.
It sounds esoteric, but statistical significance has been used to draw a bright line between experimental success and failure. Achieving an experimental result with statistical significance often determines if a scientist’s paper gets published or if further research gets funded. That makes the measure far too important in deciding research priorities, statisticians say, and so it’s time to throw it in the trash.
More than 800 statisticians and scientists are calling for an end to judging studies by statistical significance in a March 20 comment published in Nature. An accompanying March 20 special issue of the American Statistician makes the manifesto crystal clear in its introduction: “‘statistically significant’ — don’t say it and don’t use it.”
There is good reason to want to scrap statistical significance. But with so much research now built around the concept, it’s unclear how — or with what other measures — the scientific community could replace it. The American Statistician offers a full 43 articles exploring what scientific life might look like without this measure in the mix.
This isn’t the first call for an end to statistical significance, and it probably won’t be the last. “This is not easy,” says Nicole Lazar, a statistician at the University of Georgia in Athens and a guest editor of the American Statistician special issue. “If it were easy, we’d be there already.”
What does statistical significance offer?
Many scientific studies today are designed around a framework of “null hypothesis significance testing.” In this type of test, a scientist compares results of an experiment asking, say, if a drug reduces depression in a treated versus control group. The scientist compares the results against the hypothesis that no difference really exists between the groups. The goal is not to prove that the drug fights depression. Instead, the idea is to gather enough data (eventually) to reject the hypothesis that it doesn’t.
The scientist will compare the groups using a statistical analysis that results in a P value, a result between 0 and 1, with the “P” standing for probability. The value signifies the likelihood that repeating the experiment would yield a result with a difference as big (or bigger) than the one the scientist got if the drug doesn’t actually reduce depression. Smaller P values mean that the scientist is less likely to see a difference that large if no difference really exists. In scientific parlance, the value is “statistically significant” if P is less than or equal to 0.05.
When scientists interpret P values correctly, they can be useful for finding out how compatible experimental results are with the scientists’ expectations, Lazar says. Because a P value is a probability, it “has variability attached to it,” she explains. “If I repeated my procedure over and over, I’d get a whole range of P values. Some would be significant, some wouldn’t.”
Because of this variability, P equal to 0.05 was never meant to be an end result. Instead, it was more of a beginning, “something that would cause you to raise your eyebrows and investigate further,” Lazar says.
Where did the idea for statistical significance come from?
Many scientists now interpret P equal to 0.05 as a cutoff between an experiment that “worked” and one that didn’t. That cutoff can be attributed to one man: famed 20th century statistician Ronald Fisher. In a 1925 monograph, Fisher offered a simple test that research scientists could use to produce a P value. And he offered the cutoff of P equals 0.05, saying “it is convenient to take this point as a limit in judging whether a deviation [a difference between groups] is to be considered significant or not.”
That “convenient” suggestion has reverberated far beyond what Fisher probably intended. In 2015, more than 96 percent of papers in the PubMed database of biomedical and life science papers boasted results with P less than or equal to 0.05.
What’s the problem with statistical significance?
But science and statistics have never been so simple as to cater to convenient cutoffs. A P value, no matter how small, is just a probability. It doesn’t mean an experiment worked. And it doesn’t tell you if the difference in results between experimental groups is big or small. In fact, it doesn’t even say whether the difference is meaningful.
The 0.05 cutoff has become shorthand for scientific quality, says Blake McShane, one of the authors on the Nature commentary and a statistician at Northwestern University in Evanston, Ill. “First you show me your P less than 0.05, and then I will go and think about the data quality and study design,” he says. “But you better have that [P less than 0.05] first.”
That shorthand also draws a bright line between scientific findings that are “good” and those that are “bad,” when in fact no such line exists. “On one side of the threshold, you label it one thing, and if it falls on the other side, it’s something else,” McShane says. But nothing in statistics, or reality, actually works that way. Strictly speaking, he says, “there’s no difference between a P value of 0.049 and a P value of 0.051.”
What would it take to get rid of statistical significance?
Because statistical significance is entrenched in science culture, being used widely in decisions on whether to fund, promote or publish scientific research, a switch to anything else would take huge effort, says Steven Goodman, a Stanford University medical research methodologist who contributed one of the 43 articles of the special issue of the American Statistician. “The currency in that economy is the P value,” he says.
Computer programs that calculate a P value automatically from experimental data have helped to make the measure even more of a “crutch,” Goodman notes. Using it as the default means that scientists “haven’t developed the scientific muscles to understand what it means to reason under true uncertainty.” True uncertainty doesn’t mean scientists throw up their hands and say the data don’t reveal anything. In statistics, “uncertainty” refers to how much data is expected to vary from one experiment to another. Learning to interpret that uncertainty in scientific results, he notes, would require a lot more statistical training than many scientists usually get.
Shifting to one or many new kinds of statistics that better capture uncertainty would also mean that scientists would have to put more effort into making judgment calls. Journal editors and peer reviewers would have to learn to rely on other criteria to determine if a study was worth publishing. Scientific journals might have to change their standards. “It’s very, very hard to dislodge,” Goodman says. “The world of science is not ruled or directed by statisticians.”
Partially because of the potential challenges of change, some scientists don’t want to throw out statistical significance cutoffs just yet. Some want to start by raising the bar. Instead of P less than or equal to 0.05 as a cutoff, Valen Johnson, a statistician at Texas A&M University in College Station, prefers P less than or equal to 0.005 — a 0.5 percent chance that someone would observe a difference as big or bigger than the difference observed if the null hypothesis were true. “It’s not quite an absolute threshold, but we’d have fewer false positives.”
Is there a better way to judge if a study is solid?
Unfortunately, there is no single alternative that everyone agrees would be better for all experiments. “Everyone knows what they’re against,” Goodman says. “Very few people know what they’re for.”
New computer programs offer people who aren’t statisticians the freedom to move beyond the P value measure, notes Julia Haaf, a psychological methodologist at the University of Amsterdam in the Netherlands. “The reason why P values got so popular was because it was the only thing people could do” throughout much of the 20th century, she says. “Now you have options.”
Scientists could add confidence intervals to their results. These are estimated ranges of values (based on your experiment) that are likely to include the true difference between treatments or conditions. Scientists could also embrace Bayes factors, as Haaf has done, comparing how much the data in an experiment support one hypothesis over another hypothesis. And depending on how an experiment is designed, sometimes a test that spits out a P value can still be the right choice.
But no matter what statistical test is chosen, a scientist should not set a cutoff to serve as a shortcut in separating scientific wheat from chaff, critics of statistical significance say. These cutoffs will always be too black and white, and scientists need to embrace the idea of statistical gray.
In any case, scientists shouldn’t be judging an experiment’s quality by a single statistical test anyway — whatever that test may be, McShane says. Other factors may be of equal concern. “What’s the quality of your data? What’s your study design like? Do you have an understanding of the underlying mechanism?” he says. “These other factors are just as important, and often more important, than measures like P values.”
What does a future without statistically significance look like?
The P value itself is only a statistical test, and no one is trying to get rid of it. Instead, the signers of the Nature manifesto are against the idea of statistical significance, where P is less than or equal to 0.05. That limit gives a false sense of certainty about results, McShane says. “Statistics is often wrongly perceived to be a way to get rid of uncertainty,” he says. But it’s really “about quantifying the degree of uncertainty.”
Embracing that uncertainty would change how science is communicated to the public. People expect clear yes-or-no answers from science, or want to know that an experiment “found” something, though that’s never truly the case, Haaf says. There is always uncertainty in scientific results. But right now scientists and nonscientists alike have bought into the false certainty of statistical significance.
Those teaching or communicating science — and those learning and listening — would need to understand and embrace uncertainty right along with the scientific community. “I’m not sure how we do that,” says Haaf. “What people want from science is answers, and sometimes the way we report data should show [that] we don’t have a clear answer; it’s messier than you think.”