Psychology’s replication crisis sparks new debate

Analyses of major reproducibility review reach conflicting conclusions

DO OVER Recent evidence that findings from many published psychology studies don’t stand up to further scrutiny may have greatly underestimated the reproducibility of those studies, researchers say. But the extent to which psychology reports can be replicated remains controversial.

Thierry Berrod, Mona Lisa Production/Science Source

By Bruce Bower

March 3, 2016 at 3:41 pm

Psychology got rocked last year by a report that many of the field’s published results vanish in repeat experiments. But that disturbing study sounded a false alarm, a controversial analysis finds.

The original investigation of 100 studies contained key errors, contend Harvard University psychologist Daniel Gilbert and his colleagues. After correcting for those errors, the effects reported in 85 of those studies appeared in replications conducted by different researchers. So an initial conclusion that only 35 studies generated repeatable findings was a gross underestimate, Gilbert’s team reports in the March 4 Science.

“There’s no evidence for a replication crisis in psychology,” Gilbert says.

Psychologist Brian Nosek of the University of Virginia in Charlottesville and other members of the group who conducted the original replication study (SN: 10/3/15, p. 8) reject Gilbert’s analysis. The 2015 report provides “initial, not definitive evidence” that psychology has a reproducibility problem, they write in a response published in the same issue of Science.

Strikingly, “the very best scientists cannot really agree on what the results of the most important paper in the recent history of psychology mean,” says Stanford University epidemiologist John Ioannidis. Researchers’ assumptions and expectations can influence their take on any results, “no matter how clear and strong they are.”

Many repeat studies in the 2015 paper differed dramatically from initial studies, stacking the deck against achieving successful replications, Gilbert says. Replications often sampled different populations, such as substituting native Italians for Americans in a study of attitudes toward black Americans. Many altered procedures. One replication effort gave older children the relatively easy task of locating items on a small computer screen, whereas the original study gave younger children a harder task of locating items on a large computer screen.

Repeat studies also generally included too few volunteers to make a statistically compelling case that a replication had succeeded or failed, Gilbert says. Another problem was that each original study was replicated only once. Multiple repeats of a study balance out differences in study procedures and increase the number of successful replications, the scientists argue.

In a replication study that often amounted to a comparison of apples and oranges, at least 34 replication studies should have failed by chance, assuming all 100 original studies described true effects, Gilbert and his colleagues estimate. That makes the new estimate of 85 successful replications even more impressive, they say.

Nosek’s group calculates that only about 22 replication attempts in the 2015 study should have failed by chance. Tellingly, Nosek says, even successful replications found weaker statistical effects than the original studies had. Published studies make statistically significant findings look unduly strong, he says. Journals usually don’t publish replication failures and many researchers simply file them away.

Another new analysis of Nosek’s group’s work suggests that replication study samples need to be beefed up before any conclusions can be made about the durability of psychology results. Failures to replicate in the 2015 investigation largely occurred because many original studies contained only enough participants to generate weak but statistically significant effects, two psychologists assert February 26 in PLOS ONE. Journals’ bias for publishing only positive results also contributed to replication failures, add Alexander Etz, at the University of Amsterdam at the time of the study, and Joachim Vandekerckhove of the University of California, Irvine.

The pair statistically analyzed 72 papers and replication attempts from Nosek’s project. Only 19 original studies contained enough volunteers to yield a strong, statistically significant effect. Nosek’s team needed many more studies with comparably large sample sizes to generalize about the state of replication in psychology, the researchers say.

Researchers in psychology and other fields need to worry less about reproducing statistically significant results and more about developing theories that can be tested with a variety of statistical approaches, argues psychologist Gerd Gigerenzer of the Max Planck Institute for Human Development in Berlin. Statistical significance expresses the probability of observing a relationship between two variables — say, a link between a change in the wording of a charitable appeal and an increase in donations — assuming from the start that no such relationship actually exists. But researchers rarely test any proposed explanations for statistically significant results.

Pressures to publish encourage researchers to tweak what they’re studying and how they measure it to ensure statistically significant results, Gigerenzer adds. Journals need to review study proposals before any experiments are run, in order to discourage such “borderline cheating,” he recommends.