Philosophical critique exposes flaws in medical evidence hierarchies

Rankings of research reliability are logically untenable, an in-depth analysis concludes

graphic of an evidence hierarchy

Evidence hierarchies, one version shown, classify types of studies according to the strength of evidence they provide. But a recent paper challenges the assumptions behind these hierarchies.

T. Tibbitts

Immanuel Kant was famous for writing critiques.

He earned his status as the premier philosopher of modern times with such works as Critique of Pure Reason, Critique of Practical Reason and Critique of Judgment. It might have been helpful for medical science if he had also written a critique of evidence.

Scientific research supposedly provides reliable evidence for physicians to apply to treating patients. In recent years “evidence-based medicine” has been the guiding buzzword for clinical practice. But not all “evidence” is created equal. So many experts advocate the use of an evidence hierarchy, a ladder or pyramid classifying different types of studies in the order of their evidentiary strength. Anecdotes, for instance, might occupy the lowest level of the evidence pyramid. At the apex you’d typically find randomized controlled clinical trials, or perhaps meta-analyses, which combine multiple studies in a single analysis.

Kant died in 1804, so it’s hard to say what he would have thought about evidence hierarchy pyramids. But at least one modern-day philosopher thinks they’re bunk.

In a Ph.D. thesis submitted in September 2015 to the London School of Economics, philosopher of medicine Christopher Blunt analyzes evidence-based medicine’s evidence hierarchies in considerable depth (requiring 79,599 words). He notes that such hierarchies have been formally adopted by many prominent medicine-related organizations, such as the World Health Organization and the U.S. Preventive Services Task Force. But philosophical assessment of such hierarchies has generally focused on randomized clinical trials. It “has largely neglected the questions of what hierarchies are, what assumptions they require, and how they affect clinical practice,” Blunt asserts.

Throughout his thesis, Blunt examines the facts and logic underlying the development, use and interpretation of medical evidence hierarchies. He finds that “hierarchies in general embed untenable philosophical assumptions….” And he reaches a sobering conclusion: “Hierarchies are a poor basis for the application of evidence in clinical practice. The Evidence-Based Medicine movement should move beyond them and explore alternative tools for appraising the overall evidence for therapeutic claims.”

Each chapter of Blunt’s thesis confronts some aspect of evidence hierarchies that suggests the need for skepticism. For one thing, dozens of such hierarchies have been proposed (Blunt counts more than 80). There is no obvious way to judge which is the best one. Furthermore, developers of different hierarchies suggest different ways of interpreting them. Not to mention that various hierarchy versions don’t always agree on what “evidence” even means or what “counts” as evidence.

It’s not even clear that evidence hierarchies are really ways of ranking evidence. They actually rank methodologies — clinical trials supposedly being in some sense a better methodology than observational studies, for instance. But it’s not necessarily true that a “better” method always produces superior evidence. A poorly conducted clinical trial may produce inferior evidence to a high-quality observational study. And sometimes two clinical trials disagree — they can’t both offer the “best” evidence.

Ultimately, the idea behind evidence-based medicine is to provide doctors with sound recommendations for treating patients. But evidence hierarchies, with their emphasis on clinical trials, are frequently unhelpful in this regard. A trial with high “internal validity” — properly conducted and analyzed — might have low “external validity” (its worth for treating real-world patients who differ in important respects from patients in the original trial). In fact, as Blunt points out, efforts to ensure high internal validity (carefully limiting who gets admitted to the trial, for example) may actually reduce the likelihood of external validity. “The emphasis upon ensuring internal validity may come at the expense of generalizability,” he comments.

High-quality medical evidence must be relevant to the population to be treated, along with offering the expectation of a clinically significant effect. “If RCT [randomized controlled trial] evidence normally lacks these … two properties, then RCT evidence is not normally high-quality evidence, contrary to the claims implied by most interpretations of most hierarchies,” Blunt declares.

He especially emphasizes the problem of individual differences from patient to patient. Clinical trials determine average effects of a treatment on a population of people. Those averages may not be applicable to treating a given individual.

“RCTs are primarily used to provide evidence for claims about the average treatment effect, and their primary results provide no evidence about individual treatment effects,” Blunt writes. “But information about average treatment effects is an insufficient basis to make recommendations.”

Some proponents of RCTs claim that random assignment of patients to treatment and comparison groups should ease concerns about individual differences. Some personal variables that might influence the results will be evenly distributed (more or less) between groups by randomization, but there is no guarantee that all influencing variables will be. “If there are many confounding factors (and for complex medical interventions, it seems reasonable to expect that there will be a great many potential confounders), the chance that all confounding factors are near-evenly distributed in a given random allocation is low,” Blunt writes.

After all, doctors know well that many drugs work for some patients but not others. Evidence about how treatment effects vary is essential for best medical practices, and that’s the sort of information that averages reported by clinical trials do not provide.

Blunt examines many other issues that question the validity of evidence hierarchies. Some methods may be good at providing positive evidence that a drug works, for example, but other types of studies may be better at finding evidence that it doesn’t — or that it also causes unacceptable side effects. Many hierarchies rate clinical experience, expert opinion and “mechanistic” evidence — actual biological science — as evidence of the lowest form. Yet biological plausibility is an important element in evaluating evidence for a drug’s efficacy. And biological implausibility is a powerful clue that a study’s result should be viewed skeptically.

“Hierarchies of evidence are a poor basis for evidence appraisal,” Blunt concludes. “There is no convincing evidence for the claim that hierarchical appraisal improves practice…. At the present time, neither a theoretical nor an empirical justification for hierarchies of evidence can be successfully provided.”

Medical practitioners and hierarchy advocates need to realize, Blunt suggests, that evidence from individual studies in isolation, regardless of methodology, is typically not as strong as an “evidence base” comprising results from numerous different kinds of studies using different methods.  

“Numerous philosophical accounts of evidence for causal claims have argued that the strongest causal inferences result from multiple mutually-reinforcing evidence sources,” Blunt points out.

On that point he echoes another philosopher, William Whewell, one of 19th century England’s most prominent men of science (which is what scientists were typically called before Whewell invented the term “scientist”). In his Novum Organon Renovatum, Whewell articulated what he called the “Consilience of Inductions,” the idea that the best evidence came from a convergence of implications from unrelated investigations.

“The evidence in favor of our induction is of a much higher and more forcible character when it enables us to explain and determine cases of a kind different from those which were contemplated in the formation of our hypothesis,” Whewell wrote. “That rules springing from remote and unconnected quarters should thus leap to the same point, can only arise from that being the point where truth resides.”

No hierarchy of methods can capture this convergence of research results toward truth. It’s a process that requires intelligence and judgment by knowledgeable people, not slavish adherence to artificial categorizations. Sometimes philosophers see that truth more clearly than scientists do.

Follow me on Twitter: @tom_siegfried

Tom Siegfried is a contributing correspondent. He was editor in chief of Science News from 2007 to 2012 and managing editor from 2014 to 2017.

More Stories from Science News on Science & Society