# Statisticians want to abandon scienceâs standard measure of âsignificanceâ

## Hereâs why âstatistically significantâ shouldnât be a stamp of scientific approval

In science, the success of an experiment is often determined by a measure called âstatistical significance.â A result is considered to be âsignificantâ if the difference observed in the experiment between groups (of people, plants, animals and so on) would be very unlikely if no difference actually exists. The common cutoff for âvery unlikelyâ is that youâd see a difference as big or bigger only 5 percent of the time if it wasnât really there â a cutoff that might seem, at first blush, very strict.

It sounds esoteric, but statistical significance has been used to draw a bright line between experimental success and failure. Achieving an experimental result with statistical significance often determines if a scientistâs paper gets published or if further research gets funded. That makes the measure far too important in deciding research priorities, statisticians say, and so itâs time to throw it in the trash.

More than 800 statisticians and scientists are calling for an end to judging studies by statistical significance in a March 20 comment published in *Nature*. An accompanying March 20 special issue of the* American Statistician* makes the manifesto crystal clear in its introduction: ââstatistically significantâ â donât say it and donât use it.â

There is good reason to want to scrap statistical significance. But with so much research now built around the concept, itâs unclear how â or with what other measures â the scientific community could replace it. The* American Statistician *offers a full 43 articles exploring what scientific life might look like without this measure in the mix.

This isnât the first call for an end to statistical significance, and it probably wonât be the last. âThis is not easy,â says Nicole Lazar, a statistician at the University of Georgia in Athens and a guest editor of the *American Statistician* special issue. âIf it were easy, weâd be there already.â

**What does statistical significance offer? **

Many scientific studies today are designed around a framework of ânull hypothesis significance testing.â In this type of test, a scientist compares results of an experiment asking, say, if a drug reduces depression in a treated versus control group. The scientist compares the results against the hypothesis that no difference really exists between the groups. The goal is not to prove that the drug fights depression. Instead, the idea is to gather enough data (eventually) to reject the hypothesis that it doesnât.

The scientist will compare the groups using a statistical analysis that results in a P value, a result between 0 and 1, with the âPâ standing for probability. The value signifies the likelihood that repeating the experiment would yield a result with a difference as big (or bigger) than the one the scientist got if the drug doesnât actually reduce depression. Smaller P values mean that the scientist is less likely to see a difference that large if no difference really exists. In scientific parlance, the value is âstatistically significantâ if P is less than or equal to 0.05.

When scientists interpret P values correctly, they can be useful for finding out how compatible experimental results are with the scientistsâ expectations, Lazar says. Because a P value is a probability, it âhas variability attached to it,â she explains. âIf I repeated my procedure over and over, Iâd get a whole range of P values. Some would be significant, some wouldnât.â

Because of this variability, P equal to 0.05 was never meant to be an end result. Instead, it was more of a beginning, âsomething that would cause you to raise your eyebrows and investigate further,â Lazar says.

**Where did the idea for statistical significance come from?**

Many scientists now interpret P equal to 0.05 as a cutoff between an experiment that âworkedâ and one that didnât. That cutoff can be attributed to one man: famed 20th century statistician Ronald Fisher. In a 1925 monograph, Fisher offered a simple test that research scientists could use to produce a P value. And he offered the cutoff of P equals 0.05, saying âit is convenient to take this point as a limit in judging whether a deviation [a difference between groups] is to be considered significant or not.â

That âconvenientâ suggestion has reverberated far beyond what Fisher probably intended. In 2015, more than 96 percent of papers in the PubMed database of biomedical and life science papers boasted results with P less than or equal to 0.05.

**Whatâs the problem with statistical significance?**

But science and statistics have never been so simple as to cater to convenient cutoffs. A P value, no matter how small, is just a probability. It doesnât mean an experiment worked. And it doesnât tell you if the difference in results between experimental groups is big or small. In fact, it doesnât even say whether the difference is meaningful.

The 0.05 cutoff has become shorthand for scientific quality, says Blake McShane, one of the authors on the *Nature* commentary and a statistician at Northwestern University in Evanston, Ill. âFirst you show me your P less than 0.05, and then I will go and think about the data quality and study design,â he says. âBut you better have that [P less than 0.05] first.â

That shorthand also draws a bright line between scientific findings that are âgoodâ and those that are âbad,â when in fact no such line exists. âOn one side of the threshold, you label it one thing, and if it falls on the other side, itâs something else,â McShane says** . **But nothing in statistics, or reality, actually works that way. Strictly speaking, he says, âthereâs no difference between a P value of 0.049 and a P value of 0.051.â

**What would it take to get rid of statistical significance?**

Because statistical significance is entrenched in science culture, being used widely in decisions on whether to fund, promote or publish scientific research, a switch to anything else would take huge effort, says Steven Goodman, a Stanford University medical research methodologist who contributed one of the 43 articles of the special issue of the* American Statistician*. âThe currency in that economy is the P value,â he says.

Computer programs that calculate a P value automatically from experimental data have helped to make the measure even more of a âcrutch,â Goodman notes. Using it as the default means that scientists âhavenât developed the scientific muscles to understand what it means to reason under true uncertainty.â True uncertainty doesnât mean scientists throw up their hands and say the data donât reveal anything. In statistics, âuncertaintyâ refers to how much data is expected to vary from one experiment to another. Learning to interpret that uncertainty in scientific results, he notes, would require a lot more statistical training than many scientists usually get.

Shifting to one or many new kinds of statistics that better capture uncertainty would also mean that scientists would have to put more effort into making judgment calls. Journal editors and peer reviewers would have to learn to rely on other criteria to determine if a study was worth publishing. Scientific journals might have to change their standards. âItâs very, very hard to dislodge,â Goodman says. âThe world of science is not ruled or directed by statisticians.â

Partially because of the potential challenges of change, some scientists donât want to throw out statistical significance cutoffs just yet. Some want to start by raising the bar. Instead of P less than or equal to 0.05 as a cutoff, Valen Johnson, a statistician at Texas A&M University in College Station, prefers P less than or equal to 0.005 â a 0.5 percent chance that someone would observe a difference as big or bigger than the difference observed if the null hypothesis were true. âItâs not quite an absolute threshold, but weâd have fewer false positives.â

**Is there a better way to judge if a study is solid? **

Unfortunately, there is no single alternative that everyone agrees would be better for all experiments. âEveryone knows what theyâre against,â Goodman says. âVery few people know what theyâre for.â

New computer programs offer people who arenât statisticians the freedom to move beyond the P value measure, notes Julia Haaf, a psychological methodologist at the University of Amsterdam in the Netherlands. âThe reason why P values got so popular was because it was the only thing people could doâ throughout much of the 20th century, she says. âNow you have options.â

Scientists could add confidence intervals to their results. These are estimated ranges of values (based on your experiment) that are likely to include the true difference between treatments or conditions. Scientists could also embrace Bayes factors, as Haaf has done, comparing how much the data in an experiment support one hypothesis over another hypothesis. And depending on how an experiment is designed, sometimes a test that spits out a P value can still be the right choice.

But no matter what statistical test is chosen, a scientist should not set a cutoff to serve as a shortcut in separating scientific wheat from chaff, critics of statistical significance say. These cutoffs will always be too black and white, and scientists need to embrace the idea of statistical gray.

In any case, scientists shouldnât be judging an experimentâs quality by a single statistical test anyway â whatever that test may be, McShane says. Other factors may be of equal concern. âWhatâs the quality of your data? Whatâs your study design like? Do you have an understanding of the underlying mechanism?â he says. âThese other factors are just as important, and often more important, than measures like P values.â

**What does a future without statistically significance look like?**

The P value itself is only a statistical test, and no one is trying to get rid of it. Instead, the signers of the *Nature* manifesto are against the idea of statistical significance, where P is less than or equal to 0.05. That limit gives a false sense of certainty about results, McShane says. âStatistics is often wrongly perceived to be a way to get rid of uncertainty,â he says. But itâs really âabout quantifying the degree of uncertainty.â

Embracing that uncertainty would change how science is communicated to the public. People expect clear yes-or-no answers from science, or want to know that an experiment âfoundâ something, though thatâs never truly the case, Haaf says. There is always uncertainty in scientific results. But right now scientists and nonscientists alike have bought into the false certainty of statistical significance.

Those teaching or communicating science â and those learning and listening â would need to understand and embrace uncertainty right along with the scientific community. âIâm not sure how we do that,â says Haaf. âWhat people want from science is answers, and sometimes the way we report data should show [that] we donât have a clear answer; itâs messier than you think.â