Read articles, including Science News stories written for ages 9-14, on the SNK website.
Odds Are, It's Wrong
Science fails to face the shortcomings of statistics
A+ A- Text Size
Enlarge
P value
A P value is the probability of an observed (or more extreme) result arising only from chance.
S. Goodman, adapted by A. Nandy

For better or for worse, science has long been married to mathematics. Generally it has been for the better. Especially since the days of Galileo and Newton, math has nurtured science. Rigorous mathematical methods have secured science’s fidelity to fact and conferred a timeless reliability to its findings.

During the past century, though, a mutant form of math has deflected science’s heart from the modes of calculation that had long served so faithfully. Science was seduced by statistics, the math rooted in the same principles that guarantee profits for Las Vegas casinos. Supposedly, the proper use of statistics makes relying on scientific results a safe bet. But in practice, widespread misuse of statistical methods makes science more like a crapshoot.

It’s science’s dirtiest secret: The “scientific method” of testing hypotheses by statistical analysis stands on a flimsy foundation. Statistical tests are supposed to guide scientists in judging whether an experimental result reflects some real effect or is merely a random fluke, but the standard methods mix mutually inconsistent philosophies and offer no meaningful basis for making such decisions. Even when performed correctly, statistical tests are widely misunderstood and frequently misinterpreted. As a result, countless conclusions in the scientific literature are erroneous, and tests of medical dangers or treatments are often contradictory and confusing.

Replicating a result helps establish its validity more securely, but the common tactic of combining numerous studies into one analysis, while sound in principle, is seldom conducted properly in practice.

Experts in the math of probability and statistics are well aware of these problems and have for decades expressed concern about them in major journals. Over the years, hundreds of published papers have warned that science’s love affair with statistics has spawned countless illegitimate findings. In fact, if you believe what you read in the scientific literature, you shouldn’t believe what you read in the scientific literature.

“There is increasing concern,” declared epidemiologist John Ioannidis in a highly cited 2005 paper in PLoS Medicine, “that in modern research, false findings may be the majority or even the vast majority of published research claims.”

Ioannidis claimed to prove that more than half of published findings are false, but his analysis came under fire for statistical shortcomings of its own. “It may be true, but he didn’t prove it,” says biostatistician Steven Goodman of the Johns Hopkins University School of Public Health. On the other hand, says Goodman, the basic message stands. “There are more false claims made in the medical literature than anybody appreciates,” he says. “There’s no question about that.”

Nobody contends that all of science is wrong, or that it hasn’t compiled an impressive array of truths about the natural world. Still, any single scientific study alone is quite likely to be incorrect, thanks largely to the fact that the standard statistical system for drawing conclusions is, in essence, illogical. “A lot of scientists don’t understand statistics,” says Goodman. “And they don’t understand statistics because the statistics don’t make sense.”

Statistical insignificance

Nowhere are the problems with statistics more blatant than in studies of genetic influences on disease. In 2007, for instance, researchers combing the medical literature found numerous studies linking a total of 85 genetic variants in 70 different genes to acute coronary syndrome, a cluster of heart problems. When the researchers compared genetic tests of 811 patients that had the syndrome with a group of 650 (matched for sex and age) that didn’t, only one of the suspect gene variants turned up substantially more often in those with the syndrome — a number to be expected by chance.

“Our null results provide no support for the hypothesis that any of the 85 genetic variants tested is a susceptibility factor” for the syndrome, the researchers reported in the Journal of the American Medical Association.

How could so many studies be wrong? Because their conclusions relied on “statistical significance,” a concept at the heart of the mathematical analysis of modern scientific experiments.

Statistical significance is a phrase that every science graduate student learns, but few comprehend. While its origins stretch back at least to the 19th century, the modern notion was pioneered by the mathematician Ronald A. Fisher in the 1920s. His original interest was agriculture. He sought a test of whether variation in crop yields was due to some specific intervention (say, fertilizer) or merely reflected random factors beyond experimental control.

Fisher first assumed that fertilizer caused no difference — the “no effect” or “null” hypothesis. He then calculated a number called the P value, the probability that an observed yield in a fertilized field would occur if fertilizer had no real effect. If P is less than .05 — meaning the chance of a fluke is less than 5 percent — the result should be declared “statistically significant,” Fisher arbitrarily declared, and the no effect hypothesis should be rejected, supposedly confirming that fertilizer works.

Fisher’s P value eventually became the ultimate arbiter of credibility for science results of all sorts — whether testing the health effects of pollutants, the curative powers of new drugs or the effect of genes on behavior. In various forms, testing for statistical significance pervades most of scientific and medical research to this day.

But in fact, there’s no logical basis for using a P value from a single study to draw any conclusion. If the chance of a fluke is less than 5 percent, two possible conclusions remain: There is a real effect, or the result is an improbable fluke. Fisher’s method offers no way to know which is which. On the other hand, if a study finds no statistically significant effect, that doesn’t prove anything, either. Perhaps the effect doesn’t exist, or maybe the statistical test wasn’t powerful enough to detect a small but real effect.

“That test itself is neither necessary nor sufficient for proving a scientific result,” asserts Stephen Ziliak, an economic historian at Roosevelt University in Chicago.

Soon after Fisher established his system of statistical significance, it was attacked by other mathematicians, notably Egon Pearson and Jerzy Neyman. Rather than testing a null hypothesis, they argued, it made more sense to test competing hypotheses against one another. That approach also produces a P value, which is used to gauge the likelihood of a “false positive” — concluding an effect is real when it actually isn’t. What  eventually emerged was a hybrid mix of the mutually inconsistent Fisher and Neyman-Pearson approaches, which has rendered interpretations of standard statistics muddled at best and simply erroneous at worst. As a result, most scientists are confused about the meaning of a P value or how to interpret it. “It’s almost never, ever, ever stated correctly, what it means,” says Goodman.

Correctly phrased, experimental data yielding a P value of .05 means that there is only a 5 percent chance of obtaining the observed (or more extreme) result if no real effect exists (that is, if the no-difference hypothesis is correct). But many explanations mangle the subtleties in that definition. A recent popular book on issues involving science, for example, states a commonly held misperception about the meaning of statistical significance at the .05 level: “This means that it is 95 percent certain that the observed difference between groups, or sets of samples, is real and could not have arisen by chance.”

That interpretation commits an egregious logical error (technical term: “transposed conditional”): confusing the odds of getting a result (if a hypothesis is true) with the odds favoring the hypothesis if you observe that result. A well-fed dog may seldom bark, but observing the rare bark does not imply that the dog is hungry. A dog may bark 5 percent of the time even if it is well-fed all of the time. (See Box 2)

Another common error equates statistical significance to “significance” in the ordinary use of the word. Because of the way statistical formulas work, a study with a very large sample can detect “statistical significance” for a small effect that is meaningless in practical terms. A new drug may be statistically better than an old drug, but for every thousand people you treat you might get just one or two additional cures — not clinically significant. Similarly, when studies claim that a chemical causes a “significantly increased risk of cancer,” they often mean that it is just statistically significant, possibly posing only a tiny absolute increase in risk.

Statisticians perpetually caution against mistaking statistical significance for practical importance, but scientific papers commit that error often. Ziliak studied journals from various fields — psychology, medicine and economics among others — and reported frequent disregard for the distinction.

“I found that eight or nine of every 10 articles published in the leading journals make the fatal substitution” of equating statistical significance to importance, he said in an interview. Ziliak’s data are documented in the 2008 book The Cult of Statistical Significance, coauthored with Deirdre McCloskey of the University of Illinois at Chicago.

Multiplicity of mistakes

Even when “significance” is properly defined and P values are carefully calculated, statistical inference is plagued by many other problems. Chief among them is the “multiplicity” issue — the testing of many hypotheses simultaneously. When several drugs are tested at once, or a single drug is tested on several groups, chances of getting a statistically significant but false result rise rapidly. Experiments on altered gene activity in diseases may test 20,000 genes at once, for instance. Using a P value of .05, such studies could find 1,000 genes that appear to differ even if none are actually involved in the disease. Setting a higher threshold of statistical significance will eliminate some of those flukes, but only at the cost of eliminating truly changed genes from the list. In metabolic diseases such as diabetes, for example, many genes truly differ in activity, but the changes are so small that statistical tests will dismiss most as mere fluctuations. Of hundreds of genes that misbehave, standard stats might identify only one or two. Altering the threshold to nab 80 percent of the true culprits might produce a list of 13,000 genes — of which over 12,000 are actually innocent.

Recognizing these problems, some researchers now calculate a “false discovery rate” to warn of flukes disguised as real effects. And genetics researchers have begun using “genome-wide association studies” that attempt to ameliorate the multiplicity issue (SN: 6/21/08, p. 20).

Many researchers now also commonly report results with confidence intervals, similar to the margins of error reported in opinion polls. Such intervals, usually given as a range that should include the actual value with 95 percent confidence, do convey a better sense of how precise a finding is. But the 95 percent confidence calculation is based on the same math as the .05 P value and so still shares some of its problems.

Clinical trials and errors

Statistical problems also afflict the “gold standard” for medical research, the randomized, controlled clinical trials that test drugs for their ability to cure or their power to harm. Such trials assign patients at random to receive either the substance being tested or a placebo, typically a sugar pill; random selection supposedly guarantees that patients’ personal characteristics won’t bias the choice of who gets the actual treatment. But in practice, selection biases may still occur, Vance Berger and Sherri Weinstein noted in 2004 in ControlledClinical Trials. “Some of the benefits ascribed to randomization, for example that it eliminates all selection bias, can better be described as fantasy than reality,” they wrote.

Randomization also should ensure that unknown differences among individuals are mixed in roughly the same proportions in the groups being tested. But statistics do not guarantee an equal distribution any more than they prohibit 10 heads in a row when flipping a penny. With thousands of clinical trials in progress, some will not be well randomized. And DNA differs at more than a million spots in the human genetic catalog, so even in a single trial differences may not be evenly mixed. In a sufficiently large trial, unrandomized factors may balance out, if some have positive effects and some are negative. (See Box 3) Still, trial results are reported as averages that may obscure individual differences, masking beneficial or harm­ful effects and possibly leading to approval of drugs that are deadly for some and denial of effective treatment to others.

“Determining the best treatment for a particular patient is fundamentally different from determining which treatment is best on average,” physicians David Kent and Rodney Hayward wrote in American Scientist in 2007. “Reporting a single number gives the misleading impression that the treatment-effect is a property of the drug rather than of the interaction between the drug and the complex risk-benefit profile of a particular group of patients.”

Another concern is the common strategy of combining results from many trials into a single “meta-analysis,” a study of studies. In a single trial with relatively few participants, statistical tests may not detect small but real and possibly important effects. In principle, combining smaller studies to create a larger sample would allow the tests to detect such small effects. But statistical techniques for doing so are valid only if certain criteria are met. For one thing, all the studies conducted on the drug must be included — published and unpublished. And all the studies should have been performed in a similar way, using the same protocols, definitions, types of patients and doses. When combining studies with differences, it is necessary first to show that those differences would not affect the analysis, Goodman notes, but that seldom happens. “That’s not a formal part of most meta-analyses,” he says.

Meta-analyses have produced many controversial conclusions. Common claims that antidepressants work no better than placebos, for example, are based on meta-analyses that do not conform to the criteria that would confer validity. Similar problems afflicted a 2007 meta-analysis, published in the New England Journal of Medicine, that attributed increased heart attack risk to the diabetes drug Avandia. Raw data from the combined trials showed that only 55 people in 10,000 had heart attacks when using Avandia, compared with 59 people per 10,000 in comparison groups. But after a series of statistical manipulations, Avandia appeared to confer an increased risk.

In principle, a proper statistical analysis can suggest an actual risk even though the raw numbers show a benefit. But in this case the criteria justifying such statistical manipulations were not met. In some of the trials, Avandia was given along with other drugs. Sometimes the non-Avandia group got placebo pills, while in other trials that group received another drug. And there were no common definitions.

“Across the trials, there was no standard method for identifying or validating outcomes; events ... may have been missed or misclassified,” Bruce Psaty and Curt Furberg wrote in an editorial accompanying the New England Journal report. “A few events either way might have changed the findings.”

More recently, epidemiologist Charles Hennekens and biostatistician David DeMets have pointed out that combining small studies in a meta-analysis is not a good substitute for a single trial sufficiently large to test a given question. “Meta-analyses can reduce the role of chance in the interpretation but may introduce bias and confounding,” Hennekens and DeMets write in the Dec. 2 Journal of the American Medical Association. “Such results should be considered more as hypothesis formulating than as hypothesis testing.”

These concerns do not make clinical trials worthless, nor do they render science impotent. Some studies show dramatic effects that don’t require sophisticated statistics to interpret. If the P value is 0.0001 — a hundredth of a percent chance of a fluke — that is strong evidence, Goodman points out. Besides, most well-accepted science is based not on any single study, but on studies that have been confirmed by repetition. Any one result may be likely to be wrong, but confidence rises quickly if that result is independently replicated.

“Replication is vital,” says statistician Juliet Shaffer, a lecturer emeritus at the University of California, Berkeley. And in medicine, she says, the need for replication is widely recognized. “But in the social sciences and behavioral sciences, replication is not common,” she noted in San Diego in February at the annual meeting of the American Association for the Advancement of Science. “This is a sad situation.”

Bayes watch

Such sad statistical situations suggest that the marriage of science and math may be desperately in need of counseling. Perhaps it could be provided by the Rev. Thomas Bayes.

Most critics of standard statistics advocate the Bayesian approach to statistical reasoning, a methodology that derives from a theorem credited to Bayes, an 18th century English clergyman. His approach uses similar math, but requires the added twist of a “prior probability” — in essence, an informed guess about the expected probability of something in advance of the study. Often this prior probability is more than a mere guess — it could be based, for instance, on previous studies.

Bayesian math seems baffling at first, even to many scientists, but it basically just reflects the need to include previous knowledge when drawing conclusions from new observations. To infer the odds that a barking dog is hungry, for instance, it is not enough to know how often the dog barks when well-fed. You also need to know how often it eats — in order to calculate the prior probability of being hungry. Bayesian math combines a prior probability with observed data to produce an estimate of the likelihood of the hunger hypothesis. “A scientific hypothesis cannot be properly assessed solely by reference to the observational data,” but only by viewing the data in light of prior belief in the hypothesis, wrote George Diamond and Sanjay Kaul of UCLA’s School of Medicine in 2004 in the Journal of the American College of Cardiology. “Bayes’ theorem is ... a logically consistent, mathematically valid, and intuitive way to draw inferences about the hypothesis.” (See Box 4)

With the increasing availability of computer power to perform its complex calculations, the Bayesian approach has become more widely applied in medicine and other fields in recent years. In many real-life contexts, Bayesian methods do produce the best answers to important questions. In medical diagnoses, for instance, the likelihood that a test for a disease is correct depends on the prevalence of the disease in the population, a factor that Bayesian math would take into account.

But Bayesian methods introduce a confusion into the actual meaning of the mathematical concept of “probability” in the real world. Standard or “frequentist” statistics treat probabilities as objective realities; Bayesians treat probabilities as “degrees of belief” based in part on a personal assessment or subjective decision about what to include in the calculation. That’s a tough placebo to swallow for scientists wedded to the “objective” ideal of standard statistics. “Subjective prior beliefs are anathema to the frequentist, who relies instead on a series of ad hoc algorithms that maintain the facade of scientific objectivity,” Diamond and Kaul wrote.

Conflict between frequentists and Bayesians has been ongoing for two centuries. So science’s marriage to mathematics seems to entail some irreconcilable differences. Whether the future holds a fruitful reconciliation or an ugly separation may depend on forging a shared understanding of probability.

“What does probability mean in real life?” the statistician David Salsburg asked in his 2001 book The Lady Tasting Tea. “This problem is still unsolved, and ... if it remains un­solved, the whole of the statistical approach to science may come crashing down from the weight of its own inconsistencies.”

_______________________________________________________________________

BOX 1: Statistics Can Confuse

Statistical significance is not always statistically significant.

It is common practice to test the effectiveness (or dangers) of a drug by comparing it to a placebo or sham treatment that should have no effect at all. Using statistical methods to compare the results, researchers try to judge whether the real treatment’s effect was greater than the fake treatments by an amount unlikely to occur by chance.

By convention, a result expected to occur less than 5 percent of the time is considered “statistically significant.” So if Drug X outperformed a placebo by an amount that would be expected by chance only 4 percent of the time, most researchers would conclude that Drug X really works (or at least, that there is evidence favoring the conclusion that it works).

Now suppose Drug Y also outperformed the placebo, but by an amount that would be expected by chance 6 percent of the time. In that case, conventional analysis would say that such an effect lacked statistical significance and that there was insufficient evidence to conclude that Drug Y worked.

If both drugs were tested on the same disease, though, a conundrum arises. For even though Drug X appeared to work at a statistically significant level and Drug Y did not, the difference between the performance of Drug A and Drug B might very well NOT be statistically significant. Had they been tested against each other, rather than separately against placebos, there may have been no statistical evidence to suggest that one was better than the other (even if their cure rates had been precisely the same as in the separate tests).

“Comparisons of the sort, ‘X is statistically significant but Y is not,’ can be misleading,” statisticians Andrew Gelman of Columbia University and Hal Stern of the University of California, Irvine, noted in an article discussing this issue in 2006 in the American Statistician. “Students and practitioners [should] be made more aware that the difference between ‘significant’ and ‘not significant’ is not itself statistically significant.”

A similar real-life example arises in studies suggesting that children and adolescents taking antidepressants face an increased risk of suicidal thoughts or behavior. Most such studies show no statistically significant increase in such risk, but some show a small (possibly due to chance) excess of suicidal behavior in groups receiving the drug rather than a placebo. One set of such studies, for instance, found that with the antidepressant Paxil, trials recorded more than twice the rate of suicidal incidents for participants given the drug compared with those given the placebo. For another antidepressant, Prozac, trials found fewer suicidal incidents with the drug than with the placebo. So it appeared that Paxil might be more dangerous than Prozac.

But actually, the rate of suicidal incidents was higher with Prozac than with Paxil. The apparent safety advantage of Prozac was due not to the behavior of kids on the drug, but to kids on placebo — in the Paxil trials, fewer kids on placebo reported incidents than those on placebo in the Prozac trials. So the original evidence for showing a possible danger signal from Paxil but not from Prozac was based on data from people in two placebo groups, none of whom received either drug. Consequently it can be misleading to use statistical significance results alone when comparing the benefits (or dangers) of two drugs.

_______________________________________________________________________

BOX 2: The Hunger Hypothesis

A common misinterpretation of the statistician’s P value is that it measures how likely it is that a null (or “no effect”) hypothesis is correct. Actually, the P value gives the probability of observing a result if the null hypothesis is true, and there is no real effect of a treatment or difference between groups being tested. A P value of .05, for instance, means that there is only a 5 percent chance of getting the observed results if the null hypothesis is correct.

It is incorrect, however, to transpose that finding into a 95 percent probability that the null hypothesis is false. “The P value is calculated under the assumption that the null hypothesis is true,” writes biostatistician Steven Goodman. “It therefore cannot simultaneously be a probability that the null hypothesis is false.”

Consider this simplified example. Suppose a certain dog is known to bark constantly when hungry. But when well-fed, the dog barks less than 5 percent of the time. So if you assume for the null hypothesis that the dog is not hungry, the probability of observing the dog barking (given that hypothesis) is less than 5 percent. If you then actually do observe the dog barking, what is the likelihood that the null hypothesis is incorrect and the dog is in fact hungry?

Answer: That probability cannot be computed with the information given. The dog barks 100 percent of the time when hungry, and less than 5 percent of the time when not hungry. To compute the likelihood of hunger, you need to know how often the dog is fed, information not provided by the mere observation of barking.

_______________________________________________________________________

BOX 3: Randomness and Clinical Trials

Assigning patients at random to treatment and control groups is an essential feature of controlled clinical trials, but statistically that approach cannot guarantee that individual differences among patients will always be distributed equally. Experts in clinical trial analyses are aware that such incomplete randomization will leave some important differences unbalanced between experimental groups, at least some of the time.

“This is an important concern,” says biostatistician Don Berry of M.D. Anderson Cancer Center in Houston.

In an e-mail message, Berry points out that two patients who appear to be alike may respond differently to identical treatments. So statisticians attempt to incorporate patient variability into their mathematical models.

“There may be a googol of patient characteristics and it’s guaranteed that not all of them will be balanced by randomization,” Berry notes. “But some characteristics will be biased in favor of treatment A and others in favor of treatment B. They tend to even out. What is not evened out is regarded by statisticians to be ‘random error,’ and this we model explicitly.”

Understanding the individual differences affecting response to treatment is a major goal of scientists pursuing “personalized medicine,” in which therapies are tailored to each person’s particular biology. But the limits of statistical methods in drawing conclusions about subgroups of patients pose a challenge to achieving that goal.

“False-positive observations abound,” Berry acknowledges. “There are patients whose tumors melt away when given some of our newer treatments.… But just which one of the googol of characteristics of this particular tumor enabled such a thing? It’s like looking for a needle in a haystack ... or rather, looking for one special needle in a stack of other needles.”

_______________________________________________________________________

BOX 4: Bayesian Reasoning

Bayesian methods of statistical analysis stem from a paper published posthumously in 1763 by the English clergyman Thomas Bayes. In a Bayesian analysis, probability calculations require a prior value for the likelihood of an association, which is then modified after data are collected. When the prior probability isn’t known, it must be estimated, leading to criticisms that subjective guesses must often be incorporated into what ought to be an objective scientific analysis. But without such an estimate, statistics can produce grossly inaccurate conclusions.

For a simplified example, consider the use of drug tests to detect cheaters in sports. Suppose the test for steroid use among baseball players is 95 percent accurate — that is, it correctly identifies actual steroid users 95 percent of the time, and misidentifies non-users as users 5 percent of the time.

Suppose an anonymous player tests positive. What is the probability that he really is using steroids? Since the test really is accurate 95 percent of the time, the naïve answer would be that probability of guilt is 95 percent. But a Bayesian knows that such a conclusion cannot be drawn from the test alone. You would need to know some additional facts not included in this evidence. In this case, you need to know how many baseball players use steroids to begin with — that would be what a Bayesian would call the prior probability.

Now suppose, based on previous testing, that experts have established that about 5 percent of professional baseball players use steroids. Now suppose you test 400 players. How many would test positive?

• Out of the 400 players, 20 are users (5 percent) and 380 are not users.
• Of the 20 users, 19 (95 percent) would be identified correctly as users.
• Of the 380 nonusers, 19 (5 percent) would incorrectly be indicated as users.

So if you tested 400 players, 38 would test positive. Of those, 19 would be guilty users and 19 would be innocent nonusers. So if any single player’s test is positive, the chances that he really is a user are 50 percent, since an equal number of users and nonusers test positive.

Comment

Altman, D.G. 1994. The scandal of poor medical research. British Medical Journal 308:283-284.

Berger, V.W., and S. Weinstein. 2004. Ensuring the comparability of comparison groups: Is randomization enough? Controlled Clinical Trials 25.

Berry, D.A. 2006. Bayesian cinical trials. Nature Reviews Drug Discovery 5(January:27-36.

Berry, D.A. 2007. The difficult and ubiquitous problems of multiplicities. Pharmaceutical Statistics 6:155-160.

Diamond, G.A., and S. Kaul. 2004. Prior convictions: Bayesian approaches to the analysis and interpretation of clinical megatrials. Journal of the American College of Cardiology 43:1929-1939.

Gelman, A., and D. Weaklie. 2009. Of beauty, sex and power. American Scientist 97:310-316.

Gelman, A., and H. Stern. 2006. The difference between ‘significant’ and ‘not significant’ is not itself statistically significant. American Statistictian 60(November):328-331.

Goodman, S.N. 1999. Toward evidence-based medical statistics. 1: The P Value Fallacy.” Annals of Internal Medicine 130(June 15):995-1004.

Goodman, S.N. 2008. A dirty dozen: Twelve P-value misconceptions. Seminars in Hematology 45:135-140. doi:10.1053/j.seminhematol.2008.04.003.

Hennekens, C.H., and D. DeMets. 2009. The need for large-scale randomized evidence without undue emphasis on small trials, meta-analyses, or subgroup analyses. Journal of the American Medical Association 302(Dec. 2):2361-2362.

Hubbard, R., and J. Scott Armstrong. 2006. Why we don’t really know what ‘statistical significance’ means: A major educational failure. Journal of Marketing Education 28(August):114-120.

Hubbard, R., and R. Murray Lindsey. 2008. Why P values are not a useful measure of evidence in statistical significance testing. Theory & Psychology 18:69-88.

Ioannidis, J.P.A. 2005. Why most published research findings are false. PLoS Medicine 2(August):0101-0106.

Kent, D., and R. Hayward. 2007. When averages hide individual differences in clinical trials. American Scientist 95(January-February):60. DOI: 10.1511/2007.63.1016

Morgan, T.H., et al. 2007. Nonvalidation of reported genetic risk factors for acute coronary syndrome in a large-scale replication study. Journal of the American Medical Association 297(April 11):1551-1561.

Nuzzo, R. 2008. Nabbing suspicious SNPs. Science News 173(June 21):20-24. [Go to]

Psaty, B.M., and C.D. Furberg. 2007. Rosiglitazone and cardiovascular risk. New England Journal of Medicine 356:2522-2524.

Stephens, P.A., S.W. Buskirk, and C. Martínez del Rio. 2007. Inference in ecology and evolution. Trends in Ecology and Evolution 22(April 1):192-197.

Stroup, T.S., et al. 2006. Clinical trials for antipsychotic drugs: Design conventions, dilemmas and innovations. Nature Reviews Drug Discovery 5(February):133-146.

Sullivan, P.F.. 2006. Spurious genetic associations. Biological Psychiatry 61:1121-1126. doi:10.1016/j.biopsych.2006.11.010

Wacholder, S., et al. 2004. Assessing the probability that a positive report is false: An approach for molecular epidemiology studies. Journal of the National Cancer Institute 96(March 17):434-442.

Howson, C., and P. Urbach. 2006. Scientific Reasoning: The Bayesian Approach. Third Edition. Chicago: Open Court.

Salsburg, D. 2001. The Lady Tasting Tea. New York: W.H. Freeman.

Ziliak, S.T., and D. McCloskey. 2008. The Cult of Statistical Significance. University of Michigan Press.

Bower, B. 1997. Null science: Psychology’s statistical status quo draws fire. Science News 151:356-357.

Nuzzo, R. 2008. Nabbing suspicious SNPs. Science News 173(June 21):20-24. [Go to]

Please alert Science News to any inappropriate posts by clicking the REPORT SPAM link within the post. Comments will be reviewed before posting.

• Great article! Non-bashing and a fairly stated scientific editorial.
Conceivably, "Scientific calculations" may be applied to most any genre - from bean counting to space travel. Curious. The scientific human mind cannot create a computer to generate mathematical equations beyond "Infinity." Reminds me of an educated, yet insular speaker finishing a long-winded technical lecture with "etcetera etcetera".

So, must science conceed there may be an intelligent designer?

TMUCKY
Mar. 13, 2010 at 4:55am
• (concede(sp) see - that's my point;)
TMUCKY
Mar. 13, 2010 at 5:02am
• Excellent discussion on statistics.

Box 4 is particularly useful. I think the discussion in box 4 clarifies that the bayesian approach produces more sensible answers and is not *really* subjective, at least in that example. It just depends on an additional measurement of the prevalence of an association you are to measure. (I wonder if it is ok to use the same data set to measure the association as is used to do the bayesian analysis.)
Mark Lindeman
Mar. 14, 2010 at 6:29am
• I'm afraid I almost didn't read this article with the start being rather inflammatory, calling statistics a 'mutant form of math' and implying that the transition from pure mathematical (classical) approaches which cannot deal with reality to the use of statistics has been to the detriment of science. It has in fact been the opposite in spite of the widespread failings indicated.

Much of the failing of statistical analysis in science is the result of lack of understanding of the philosophical aspects of statistical techniques and analysis with researchers simply applying rote mathematical procedures they saw in their classes (often not taught by an expert in statistics) or a prior researchers journal article (possibly even worse). Suggesting that Bayesian statistics is a panacea for what ails the use of statistics in science is dangerous. Bayesian approaches are generally even more sophisticated philosophically and mathematically than classical and prone to even worse misuses. For example they generally start with the use of a prior which the naive researcher that does not understand the process can easily mistakenly adjust to make their results work.

That being said, I liked the overall presentation and agree that scientists and we statisticians as well need to give more time and consideration to Bayesian techniques. The availability of compute power to effectively use these techniques makes it imperative that practitioners get the requisite training to be sure that their results are not just the result of rote application of formula (computer algorithm) and/or computer tinkering until the result is publishable.
ZeitlerD
Mar. 14, 2010 at 1:12pm
• This is why the U.S., World Health Org., and Bill Gates should not be pushing male circumcision on the world to prevent HIV infection based on statistics from three African studies. Compounding these flawed studies, it's been reported 1 in 5 HIV infections in Africa are caused by cause by the medical profession. Write to the APP to not support these studies with their new circumcision guidelines.
Frank McGinness
Mar. 14, 2010 at 1:12pm
• Thank You. I always knew something was wrong.
Paul Etzler
Mar. 14, 2010 at 3:50pm
• Two points:
1) I think the condom issue may be a reflection of "moral hazard", more dangerous overall behavior based on the illusion of perfect protection. Similar to the tendency of drivers to push cars with traction control etc. to the edge of instability, removing the benefit of the technology.
2) Positive results bias inflicts the statistical pool, too; "all studies, published and unpublished" includes even those that the researchers didn't bother to write up and submit because nothing much was found. Many "null results" die unmarked deaths, which means that the likelihood of a positive result actually being that 1 out of 20 or 100 arising by chance is actually much higher. I'd guesstimate an anti-fudge factor of at least 5 would be needed to compensate for this.
Brian Hall
Mar. 15, 2010 at 3:34am
• I enjoyed this article. I work health psychology and I deal with statistical issues on a weekly basis. One point that I would like to mention is the necessity of test-retest reliability. Under few circumstances does a single publication represent a large scope of scientific work. However statistically significant any social construct may be, this significance must be repeated in order to gain scientific clout.

I agree that scientists rely too heavily on statistics in the same way they rely too heavily on single publications. Using several subject pools to test the same phenomena and getting the same result demonstrates and argument for significance. The Bayes method hits on this but does give me the subjective creeps. You can achieve the same result with traditional statistical methods as long as at some point test-retest reliability is considered.

Thanks again for the article. The site is bookmarked.
Micajah Spoden
Mar. 15, 2010 at 12:25pm
• I agree with ZeitlerD.

The peer-review system breaks down when a non-negligible proportion of peers misunderstand - or neglect proper attention to - what's being refereed.

Those engaging in scientific enquiry (and editorial boards) would be better served by consulting professional statisticians earlier and more often in the process.

Liked the article.
G. Jay Kerns
Mar. 15, 2010 at 3:17pm

I'd go with that.
slre
Mar. 16, 2010 at 5:55am
• Wow! A story from the future. How did that happen? (Home / March 27th, 2010; Vol.177 #7 / Feature [today is March 17, 2010]) Maybe this is why statistics works in the 1st place. The answer is from the future where the result has already occured.
John Compton
Mar. 17, 2010 at 12:40pm
• This article has the flavor of science and math bashing. Box 4, for example, which a poster above lauded, is very standard material. We teach it to freshmen. It is just an illustration of false positives. Anyone worthy of the name "scientist" knows such stuff thoroughly.

Mistakes in reasoning like that certainly do come up. There are several famous examples involving law suits and criminal trials where the jury was bamboozled into the wrong verdict by a clever lawyer. Sometimes neither the lawyers know the judge realize that that kind of false statistical "reasoning" is being used. I seem to recall a recent famous case in Denmark.

Saying that science is flawed because it uses statistics is nonsense.

Robert H. Lewis
Mathematics Department
Fordham University
Robert Lewis
Mar. 17, 2010 at 1:16pm
• As a statistician my only disagreement with the article is the rather credulous suggestion in the first paragraph that we had some better system in the past. In reality we had no way of dealing with systems with any appreciable degree of inherent variability. Movements of the stars, fine, but effectiveness of medicines, not a hope! Looked at from this perspective what we have now is progress. But I guess this would not be great journalism.
Jim Slattery
Mar. 17, 2010 at 2:02pm
• I really enjoyed your explanation of Baysean statistics. However, I quibble over your example in Box 4. You assume the false negative is also 5% when it may be higher or lower.
R Halsey
Mar. 17, 2010 at 3:04pm
• TERRIBLE ARTICLE
PART OF THE PROBLEM, not the solution
the problem is that it is to complicated
so instead of long complex articles about the problem, simple, simple solutions of what to do
Really, I shouldn't have to think about it - there should be an interactive form, I enter the type of experiment, and it tells me what test to use and what they mean.
You don't expect the avg sicentist to understand how to build a stepper motor every time he or she wants to have a moving part in the lab do you ?
ezra abrams
Mar. 18, 2010 at 8:57am
• This article is awful. I can't see why the editors would let this garbage out on to the website. Statistics is not the issue, education is. If this is the kind of thing Science News is going to start publishing, I think my paper subscription is unlikely to be renewed.
Richard Minerich
Mar. 18, 2010 at 9:43am
• I agree with the other commentators who lament the identification of statistics as the culprit.

Yes arguments can be made that published research is a biased view of the sum total evidence collected by science, and yes, null hypothesis testing is oft misunderstood and applied when more subtle approaches are more appropriate. But to call statistics itself the culprit is frankly irresponsible science reporting.

Statistics is the study of uncertainty and the quantification of such. Period. In a world rife with unaccounted-for variance, science cannot operate without analysis of uncertainty.

The problems noted in the article are problems in the organization of scientific practice and statistical education. The "published results are likely false" issue comes from primarily the medical literature, where there is an enhanced publication bias driven by financial interests. That's not to say that other domains don't suffer from "publish what's statistically significant and sexy" syndrome, contributing to bias, but the solution to this doesn't lie in statistics, but in the culture of science and scientific publishing.

Fixes to the problems noted in the article include:
- Experiments need be pre-registered if they want to publish later (as they are now required in much of medicine thanks to the fall out of recent exposure of bias).
- Published data needs to be made freely and immediately available in public repositories.
- Conclusions should be made through null hypothesis testing only when one-time decisions are necessary, which is arguably never, but more reasonably any time the cost of replication or ethical considerations are prohibitive.
- Replication should become more widespread in science (I have the idea that all undergraduate theses should be replications only, contributing to a public database of replicated results).
- Scientists generally should adopt the model spreading through physics whereby research is conducted by teams of specialists, including at least one trained statistician (not a non-statistician who took a couple stats courses during their graduate training).
Mike Lawrence
Mar. 18, 2010 at 11:45am
• The example in box 4 illustrates yet another problem. It concludes with "So if any single player’s test is positive, the chances that he really is a user are 50 percent, since an equal number of users and nonusers test positive." A single player is not a random variable. In the statistics class that I teach, I flip a coin onto the floor and ask my students "What is the probability that I got heads?" They usually answer "50%" So I look down at the coin. If it is heads, I answer back "No, it is 100%" Probabilities should be applied to experimental procedures, not individual outcomes.
mark sullivan
Mar. 18, 2010 at 12:44pm
• "For better or for worse, science has long been married to mathematics." -- Mathematics *is* science. It case they taught you differently at school, they were wrong. Math is built-in every other science, that is why it is the "mother" of all sciences. Wikipedia says: "Science is, in its broadest sense, any systematic knowledge-base or prescriptive practice that is capable of resulting in a correct prediction, or reliably-predictable type of outcome." There is no science without outcome prediction and there is no prediction capability without abstract modeling, a.k.a. mathematics.
G M
Mar. 18, 2010 at 1:37pm
• I really enjoyed the article. One thing that I would like to add: there is a reason why there are separate math and statistics departments in most major universities- statistics is NOT a branch of mathematics! For sure, statistics uses math, but after many attempts to axiomatize the major results in statistics, it was finally proved that it can’t be done. In contrast, probability theory is a branch of mathematics (see the Kolmogorov axioms). Personally, I prefer to think of stats as quantitative philosophy of science; like any other branch of philosophy, it is subjective and constantly evolving.
PS
Mar. 18, 2010 at 3:23pm
• @Peter Schotland: I'd never heard it has been proven statistics won't ever be able to be axiomatized ... would love a reference to more details please. Also, your comment is a bit sweeping: you must be talking about *frequentist* statistics here, since Bayesian statistics (incl Bayesian inference) most certainly does have an axiomatic foundation (see JM Bernardo ... Google for bernardo bayesian, first link my URLs are being removed).
L A
Mar. 18, 2010 at 5:53pm
• Siegfried's title is misleading: it is the "shortcomings" of scientists and statisticians that has been the problem, not any of the statistical tools properly used. Three recent articles documenting that claim (and critiquing some of the very people Siegried quotes) are:

Lombardi, C.M. and S.H. Hurlbert, 2009. Misprescription and misuse of one-tailed tests. Austral Ecology 34:447-468.

Hurlbert, S.H. and C.M. Lombardi. 2009. Final collapse of the Neyman-Pearson decision theoretic framework and rise of the neoFisherian. Annales Zoologici Fennici 46:311-349.

Hurlbert, S.H. 2009. The ancient black art and transdisciplinary extent of pseudoreplication. Journal of Comparative Psychology 123:434-443.

Pdfs of all can be found by googling Stuart H. Hurlbert and going to the publication list on my website.
Stuart Hurlbert
Mar. 18, 2010 at 7:05pm
• IMO The field of statistics DOES deserve much of the blame here.

I realize that every professional who teaches classical statistics is very scrupulous in giving the warnings. To hit just some of the highlights:

- "statistical signficance" does not imply signficance as we use the word in normal discourse, even in formal scientific discourse. That "statistical" qualifier - never lose or ignore that, it totally alters the meaning!
- "p-value", well it's a faintly interesting value about weight of evidence that carries a lot of challenges in interpretation. But no matter what, remember that "p" is just a letter in the alphabet, this is nothing to do with the "p"robability of the hypothesis being wrong. Unfortunate clash of letters, that, how sad.
- "maximum-likelihood estimate" is not the "most likely" value for the estimated quantity; whatever would make you think that? Please play closer attention to the grammar here.
- A "95% confidence interval" is not contain the true value with probability 95%. You can be 95% "confident" in it in some other very technical sense of confident, but it's not a probability. Indeed, sometimes you may sometimes see 95% confidence intervals that obviously cannot possibly (i.e. with probability 0) contain the true value but this is A-OK and if this situation confuses you just need to go back and re-read the definitions.
- "Rejecting the null hypothesis" doesn't mean you've disproved it or (heaven-forbid!) should act as if you know it's false. "Rejecting", well, it's a technical term here.
- Regarding an "unbiased estimate"; well, bias is used very technically here, don't read to much into it: in particular, you shouldn't necessarily assume a biased estimate is a bad one, and indeed in many cases a suitably-chosen biased estimate will turn superior for any conceivable practical purpose that the best unbiased estimate on offer.

Etc. ... And thus the conscience remains clear. And yet, they know from incessant observation, in practice the rest of the world WILL and DOES lose these subtle warnings. "What a shame, it's their fault for not paying closer attention in class when I warned them against this very confusion. Still, if they confront me _directly_ with such a misunderstanding you can be sure I will recite the necessary formal correction; professionalism requires no less!". How about this, statistians?: DON'T USE CALCULATEDLY MISLEADING WORDS IN THE FIRST PLACE, CONVENIENTLY CHOSEN TO INFLATE THE REAL WORLD USEFULNESS OF THE TOOLS YOU OFFER!
a j
Mar. 18, 2010 at 8:40pm
• Blaming statistics for misused statistics is like
blaming medical science because of incompetent doctors.
And suggesting Bayesian methods will make things better is
like suggesting homeopathy should replace medicine.
Larry Wasserman
Mar. 20, 2010 at 9:03am
• Nice article. I am a bit surprised by Dr. Wasserman's comment. If this is indeed the Larry Wasserman from statistics, he knows full well that p-values are suboptimal in terms of evidentiary value; he also would agree that they do not correspond to what researchers want to know.

When I teach a course on statistics (to about 300 second-year psychology students, all of whom have had several courses in introductory stats already) I start by waving a 20 euro bill around, telling them "if you can tell me exactly what a p-value is you get 20 euros". Result: I always get to keep my money, because they have no clue what a p-value is.

It is easy to blame the students for their ignorance, but I think the problem runs deeper. The p-value is simply not something researchers are interested in when they do experiments. Researchers do not want to know about the probability of obtaining a test statistic at least as extreme as the one that they have, given that the null hypothesis is true. Instead, they want to attach probabilities to hypotheses. This is very natural but frequentists statistics does not allow one to do this.

Comparing Bayesian methods to homeopathy is insulting, and frankly worthy of an apology to the growing community of researchers that do use Bayesian methods.
Eric-Jan Wagenmakers
Mar. 20, 2010 at 11:21am
• I don't want to defend p-values which are indeed over-used
as well as abused. But often an easy fix is to use a confidence interval instead. I was taking issue with the tone of the article which seemed to criticize the whole
field of statistics. My comment about homeopathy was
meant to be a joke.. I didn't mean to offend anyone.
Larry Wasserman
Mar. 20, 2010 at 3:02pm
• good job!..
murat ulug
Mar. 20, 2010 at 7:42pm
• I must be making some sort of elementary logic error below, so please point it out if you see it because it's bothering me that I can't find it.

For a given result (call it "R"), a p value of 0.05 means that there is a 5% chance that, IF the null hypothesis is true (we'll call that statement "N"), then you will get R. so put succinctly, there is a 5% chance that N-->R.

That's equivalent to saying there is a 95% chance that N-->~R (~ means "not"). Now, to invoke the contrapositive, N-->~R is logically equivalent to saying R-->~N. So there is a 95% chance that R-->~N.

Put back in English, a p value of 0.05 means there is a 95% chance that if R, then the null hypothesis is not true. So if you get R, you can state that there is a 95% chance that the null hypothesis is not true, which is the same as saying that there is a 95% chance that the effect is real. But as I understand it (and according to the article), that's exactly what the p value does NOT mean. So where did I screw up?
Yosemite G
Mar. 20, 2010 at 9:59pm
• In a nutshell, the author reminds everyone to be careful when interpreting p-values, that effect size is just as important to consider as the statistical significance of the effect, and that replication should be done more often. All good points. Now, if the author can get over his need to bash all of statistics, he might be more credible and clear when making these good points.
Sal Danori
Mar. 20, 2010 at 10:05pm
• Yosemite, well, there's many things going on here. First, you start off with a probability of the data given a hypothesis, but you end up with the reverse, a probability for a hypothesis given the data. The only way this can happen is by using the prior probability of the hypothesis via Bayes rule.

Second, it seems weird to a Bayesian to want to test the null hypothesis in isolation -- how likely the null is depends crucially on the plausibility of the alternative hypothesis in the context of a scientic process. When the null says: "you were just guessing", and the alternative says "you had some knowledge", it matters whether we are considering (a) your recent performance on a test of logic or (b) your ability to repeatedly predict whether a fair coin will land heads or tails.
Eric-Jan Wagenmakers
Mar. 21, 2010 at 6:15am
• I suggest going over to the blog run by Lubos Motl.
Larry Wasserman
Mar. 21, 2010 at 8:35am
• IMHO this article is sensationalist. Responses by better informed commentators are found in blog posts "It's wrong is wrong" by Kaiser Fung and "Defending statistical methods" by Luboš Motl.
Rafael Irizarry
Mar. 21, 2010 at 8:57am
• Andrew Gelman also blogs about this. He says "I agree with most of what Siegfried wrote." What an interesting mix of opinions!
Eric-Jan Wagenmakers
Mar. 21, 2010 at 10:29am
• I have always wondered about the "size" of the placebo effect, which is a well documented phenomenon. It would seem to require giving one (randomly assigned)control group nothing without them knowing they are in a control group, which might be ethically problematic? Has anyone studied this? How was it done?
John Stewart
Mar. 21, 2010 at 5:46pm
• Here is a doctor who has had some fun skewing us all while making plenty of money. Researcher Dr. Scott Reuben admits to faking dozens of research studies for Pfizer, Merck
February 19, 2010.
Paul Blake, N.D.
Mar. 21, 2010 at 10:52pm
• "Still, any single scientific study alone is quite likely to be incorrect, thanks largely to the fact that the standard statistical system for drawing conclusions is, in essence, illogical." That's an extraordinary claim.
Atticus Finch
Mar. 22, 2010 at 12:56am

After the intro, the content is fine. With the possible exception of the Bayesian material, this is all standard information offered in the average undergraduate statistics course.

But the title and introduction to this piece are inflammatory and irresponsible, offering an altogether misleading summary of the contents of the article itself and, most importantly, feeding into the antiscientism agenda in the United States. Already in this comment section you can see people who have strong desires to engage in wholesale rejection of scientific findings taking comfort in the title, if not the substance, of this article.

Scientific reasoning is under siege in this country's political culture. The last thing we need is for science journalism to exacerbate the problem by sensationalizing the well understood limitations of statistical significance testing.

Christopher Anderson, PhD
Politics and Government Department
University of Hartford
Chris Anderson
Mar. 22, 2010 at 9:35am
• The author needs to read and understand his own article. In the second section ('Statistical insignificance') he notes that most apparent genetic associations are false positives but by the time he has got to section four ('Clinical Trials and Errors') he has completely forgotten to be sceptical about gene by treatment interaction and quotes, apparently with approval, "Reporting a single number gives the misleading impression that the treatment-effect is a property of the drug rather than of the interaction between the drug and the complex risk-benefit profile of a particular group of patients.”

Well which is it? Make your mind up. But you won't get very far in analysing the problem unless you understand something about statistics.

In fact, the evidence that treatment by gene interaction is important is very thin (with all those false positives out there, how does anybody know?) and the widespread delusion to the contrary cannot be laid at the door of statisticians and statistics. If the important, but sadly neglected statistical topic of components of variation were better understood, medical researchers would not go around making these claims. It takes carefully analysed trials involving adequate replication to establish the presence of an interaction and very few of these are run.

Stephen Senn, PhD
University of Glasgow
Stephen Senn
Mar. 24, 2010 at 4:44pm
• If a scientist is looking to find a functional relationship between suspected independent variables and a dependent variable, statistical inference is a blind alley. The only way to determine if there is a cause-effect relationship is to carefully manipulate an independent variable to see what effect it has, if any, on the dependent variable. Statistical inference will only lead to confirming explanatory fictions and just so stories, not anything that can be predicted, especially in the behavioral and social sciences. The more empirically based functional relations that are experimentally determined the more solid your scientific foundations will be.
Raymond Weitzman
Mar. 26, 2010 at 6:41pm
• Yosemite: To fill in the technical details, what exactly do you mean by the notation N-->R? It can either means the probability of R given N, which is written as Pr( R | N ), or the probability of the proposition N implies R, written as Pr(N -> R). They have different meanings, as Pr( R | N ) = Pr(N and R) / Pr(N) by Bayes' theorem, while Pr(N -> R) = Pr((not N) or R) (only propositional calculus is used in this step, not probability).

If the first interpretation is used, then it is true that Pr( not R | N ) = 1 - Pr( R | N ), but then you can't do the contrapositive step. On the other hand, for the second interpretation, the negation of N -> R is, by de Morgan's law and some Boolean algebra, N and not R, which is not the same as N -> not R. In particular, Pr(N -> not R) = Pr((not N) or (N and not R)) = Pr(not N) + Pr(N and not R) (since they are mutually exclusive). I think the key here (in my opinion) is to distinguish between a given, say an observed result, which excludes cases when observed is falsified; and a logical proposition, which considers cases when the antecedent is true AND when the antecedent is false (in which case the conditional will be considered as vacuously true).
Tom Lam
Mar. 27, 2010 at 9:45am
• While this essay does draw attention to important issues of misinterpretation of hypothesis tests, I think that the author did not draw sufficient attention to the superiority of confidence intervals; the fact that he describes confidence intervals as based on the same math as hypothesis tests betrays a lack of understanding on his part, which seems particularly relevant in an opinion piece like this.

More important and disturbing is that the claim, without any empirical support, that "[M]ost critics ... advocate the Bayesian approach...." Certainly *some* of the critics, in some fields, are Bayesians, but I have never see any empirical work on the prevalence of Bayesian approaches among critics of contemporary statistical interpretation and practice, and my own observations of such critics would lend no support to the "most" aspect of that claim. If empirical work exists, I'd be most interested to hear about it.
Michael Lacy
Mar. 27, 2010 at 10:10am
• Even though the article is flawed in some ways, it does draw attention to the need to put the science in front of the statistics. Statistics is a tool for better understanding uncertainty. Period.

The focus needs to be on the quality of studies and the weight of scientific evidence, and not statistical significance from single studies. Having said that, does the publication of this article by the Editor in Chief of Science News mean that this particular publication will stop (or be more careful when) writing doom-and-gloom stories about environmental contaminants based on single-study results? I doubt it.
Robert Peterson
Mar. 31, 2010 at 11:00am
• As a humanities guy, I was in over my head with the article, but I found it fascinating and wish to thank Tom Siegried for publishing it.
Spider
Mar. 31, 2010 at 10:31pm
• Great Article. A statistics class should be required of all college graduates and perhaps high school.

If figures that Dr. Christopher Anderson, Politics and Government Department, University of Hartford would have a problem with this article. The social studies group is the one with the worse cases of misleading statistics due to their inability to handle simple math. Hence the reason they go in to Politics and Government so that they can dictate economic policy on the rest of us, e.g. ObamaCare where the numbers do not add up.
John Mills
Apr. 2, 2010 at 11:55am
• Greetings Tom,

Our paper addresses some of the issues concerning the accuracy of p-values.
How accurate are the extremely small P-values used in genomic research: An evaluation of numerical libraries
Computational Statistics & Data Analysis
Volume 53, Issue 7, 15 May 2009, Pages 2446-2452
sai santosh bangalore
Apr. 7, 2010 at 11:39am
• I was reading this article with great interest, and with no idea that I had been quoted in it, and I am not above admitting that it was a thrill to see my name there. Overall, I thought that it was a great article. My only criticism is that it seems to suggest that the criticisms of standard practice are themselves in agreement. Almost like pulling together liberals and conservatives and saying that they agree (that change is needed). The change they are calling for is obviously not the same. Likewise here. I firmly believe that there is nothing wrong with hypothesis testing, even though many others who were quoted are against them. This may not be the appropriate forum for defending the humble p-value, but I will at least note that if p-values are widely misunderstood, then this is a criticism of education, and not of the p-value per se. And if 0.049 and 0.051 are not really all that different, then you are again arguing not against p-values per se, but rather against binary decisions. When decisions must be made, based on some quantity, then some cut-off is necessary. Finally, if you argue that p-values take on different meanings across different studies with different sample sizes, then I will remind you that test scores take on different meanings too, as we consider AP exams, the SAT, the GRE, the usual format of 0 to 100, and so on. So do we no longer grade tests in school? No, we simply recognize that they provide a ranking within the test, rather than across tests. So what is the problem?
Vance Berger
Apr. 9, 2010 at 12:49pm
• The p-value measures correlation, not necessarily cause and effect.
James Rice
Apr. 10, 2010 at 9:41pm
• I use mathematical probability in the way that I drive. I stay in the right-hand lane on highways, and at no more than 60 mph, in the assumption that I will likely have fewer traffic tickets, fewer accidents, and better mileage. In four years, all three of these assumptions continually prove correct. Also, my personal stress level seems quite reduced. I do recommend this behavior to all who read this. James R. Stewart Jr.
James Stewart
May. 20, 2010 at 1:08pm
• All science is based upon statistics. A scientific theory consists of a consistent mathematical framework along with empirical verification. The verification can be done only with statistics. To the extent that statistics is faulty, knowledge is faulty. It is critical that we properly understand statistics.
Sanford Aranoff
Jun. 15, 2010 at 1:30pm
• The models used by the IPCC to forecast changes in the world temperature are the most blatant example of the misuse of statistics in science. Those models have not been calibrated and cannot be calibrated. The results are completely meaningless. Nobody can forecast the average temperature of the world in even 10 years time, and that concept is also meaningless. The world's environmental problems are indeed very serious, but the study of so called global warming and measures designed to reduce it are fatuous.
peter senker
Mar. 2, 2011 at 3:16am
• The author seems to make an error in Box 4 (in relation to what we talked about sensitivity and specificity the other day):
"For a simplified example, consider the use of drug tests to detect cheaters in sports. Suppose the test for steroid use among baseball players is 95 percent accurate — that is, it correctly identifies actual steroid users 95 percent of the time, and misidentifies non-users as users 5 percent of the time."
The phrase "it correctly identifies actual steroid users 95 percent of the time" implies that the test is 95% sensitive, and the part "misidentifies non-users as users 5 percent of the time" implies that the Positive Predictive Value is 95% (100-5) in that particular sample. So the two parts say about two different things. According to the latter phrase, if a person is test positive the probability of the person being a drug user IS in fact 95%. However, his numbers are correct in the more elaborate example. He blames scientists that they confuse things, but the author himself is confused in the above instance.
Tharaka Dassanayake
Mar. 8, 2011 at 10:22pm
• Odds Are, It's Wrong
Science fails to face the shortcomings of statistics: Comment.
This useful overview, among several things, gives statisticians a clear warning. Scientists are increasingly dissatisfied with classical significance testing and p-values, to make decisions and to evaluate evidence. It is not constructive the self-indulgent reaction, to blame the scientist back, due to their purported lack of training or presumed misunderstanding of statistical concepts.
The fact is, that significance testing is suited and was designed to perform very specific comparisons, under well designed studies, on which beforehand, based on a specified Type I error, which is controlled to be alfa, a specified Type II error is then minimized. But the vast majority of studies do not conform to this standard. Or even if individual studies conform to the standard, merging studies no longer do. The fact is that fixing Type I error, for whatever amount of evidence, simply does not make any sense, and by extension, the p-values neither do, since then the Type II error is completely outside of control, with the possibility that Type I error may be enormous as compared with Type II error.
There is the need for an alternative paradigm to: i) Fix Type I error at alfa and Minimize Type II error, or ii) Calculate p-value and interpret it as the minimum alfa for which you will reject the null hypothesis.
I also agree with the author final point that a solution is bound to be in the direction of Bayesian procedures with good frequentist properties, that is an approach that may reconcile the schools of statistics, in a procedure better suited for the needs of science.
Luis Pericchi, Department of Mathematics and Center for Biostatistics and Bioinformatics, University of Puerto Rico, Rio Piedras Campus.
Luis Pericchi
Jan. 13, 2012 at 1:11pm