‘Replication crisis’ spurs reforms in how science studies are done

But some researchers say the focus on reproducibility ignores a larger problem

PONDERING SCIENCE A new investigation indicates that reproducibility of social science studies (including one investigating how contemplation of Rodin’s The Thinker statue affects religious beliefs), while not great, is improving. But intensifying replication efforts may address only surface problems.

Joe deSousa/Flickr (CC0 1.0)

By Bruce Bower

August 27, 2018 at 11:00 am - More than 2 years ago

What started out a few years ago as a crisis of confidence in scientific results has evolved into an opportunity for improvement. Researchers and journal editors are exposing how studies get done and encouraging independent redos of published reports. And there’s nothing like the string of failed replications to spur improved scientific practice.

That’s the conclusion of a research team, led by Caltech economist Colin Camerer, that examined 21 social science papers published in two major scientific journals, Nature and Science, from 2010 to 2015. Five replication teams directed by coauthors of the new study successfully reproduced effects reported for 13 of those investigations, the researchers report online August 27 in Nature Human Behavior. Results reported in eight papers could not be replicated.

The new study is an improvement over a previous attempt to replicate psychology findings (SN: 4/2/16, p. 8). But the latest results underscore the need to view any single study with caution, a lesson that many researchers and journal gatekeepers have taken to heart over the past few years, Camerer’s team says. An opportunity now exists to create a scientific culture of replication that provides a check on what ends up getting published and publicized, the researchers contend.

Still, the new study reveals a troubling aspect of successful experimental redos. Camerer’s team found that for repeat studies that panned out, which included four to five times as many participants as originally studied, the statistical strength to detect actual effects was weaker than reported for the initial investigations. In other words, the best replications — which exceeded initial studies in their ability to detect actual effects — were only partially successful.

One reason for that trend is that scientific journals have tended not to publish studies that disconfirm previous findings, leaving initial findings unchallenged until now, says study coauthor and psychologist Brian Nosek of the University of Virginia in Charlottesville. Even the most prestigious journals have often published results that garner lots of scientific and media attention but that could easily have occurred randomly, he says.

On the plus side, the new report appears as such practices are changing. “The social and behavioral sciences are in the midst of a reformation in scientific practices,” Nosek says.

In the last five years, for example, 19 of 33 journals in social and personality psychology have established policies requiring investigators to submit their research designs for peer review before submitting research papers for review. In this way, peer reviewers can check whether experimenters altered their procedures to tease out positive effects. The same journals also collect experimental data from researchers so that replications can be conducted.

Intriguingly, when Camerer’s group asked a group of nearly 400 researchers, mostly psychologists and economists, to examine data from the 21 experiments and predict whether each could be reproduced, the scientists’ forecasts were usually correct. Peer predictions may be one way to bolster peer reviews and help weed out weak studies, Nosek says.

Another positive sign is that scientists whose papers were rechecked in the new study generally cooperated with the effort, even if their findings failed to replicate. For instance, one new replication study that Camerer and colleagues examined did not support a 2012 Science report that viewing pictures of Auguste Rodin’s famous statue The Thinker reduces volunteers’ self-reported religious belief. This finding was part of a project examining how mental reflection affects religious belief.

Psychologist Will Gervais of the University of Kentucky in Lexington, a coauthor of the 2012 paper, welcomes the new evidence. “In hindsight, we oversold a study with an effect that was barely statistically significant,” Gervais says. “I like to think that our study wouldn’t get published today.”

Current replication efforts represent “an opportunity to sharpen our scientific practices,” he adds.

Even if he’s right, the problem with studies in the social sciences, as well as disciplines including neuroscience and medical research (SN: 2/18/17, p. 10), goes deeper than reproducibility, says psychologist Gerd Gigerenzer of the Max Planck Institute for Human Development in Berlin.

Researchers in these fields largely rely on a statistical technique that assesses whether a result — say, an apparent decline in religious belief after viewing The Thinker — would likely occur if there were no true difference. Scientists call that a null hypothesis. An arbitrary cutoff is used to determine whether a reported difference from a null hypothesis is “statistically significant.” But that doesn’t establish the existence of a true effect, although it’s usually assumed to do just that. And typically, researchers make no attempt to test alternative predictions that attempt to explain how, for example, contemplation might affect spiritual convictions, Gigerenzer says.

What’s worse, many researchers wrongly assume that achieving statistical significance in a study makes replication of its results unnecessary.

That’s what Gigerenzer found in a review of studies that quizzed 839 academic psychologists and 991 undergraduate and graduate psychology students. Of those, 20 percent of faculty teaching statistics, 39 percent of professors and lecturers, and 66 percent of students believed that statistical significance means a result requires no replication, Gigerenzer reports in the June Advances in Methods and Practices in Psychological Science.

Such results fit a scenario in which researchers usually resort to null-hypothesis testing as a ritual for getting studies published, without having to develop actual theories, Gigerenzer contends. Psychology departments need to teach students about various statistical methods that can be used to test experimental predictions, he suggests. Journal editors should no longer accept studies based solely on statistical significance, he adds.

“We need to blow up the current system and promote real statistical thinking and judgment,” Gigerenzer argues.

Although not a proponent of radical change, Nosek sees bigger revisions on the horizon. “I’m going to be optimistic and predict that in five years we’ll see the social and behavioral sciences move beyond studying null hypotheses,” he says.