Article Summary: Most statisticians caution against using statistical significance as a method of making policy or business decisions [link]. One reason why is that p-values are notoriously difficult to interpret, even for PhD level researchers. This article outlines some of the common misinterpretations of p-values.

Quick guide


Gerd Gigerenzer called these the “null rituals” [1], saying there were three steps:

  1. Set up a statistical null hypothesis of “no mean difference” or “zero correlation.” Don’t specify the predictions of your research hypothesis or of any alternative substantive hypotheses.

  2. Use 5% as a convention for rejecting the null. If significant, accept your research hypothesis. Report the result as p < 0.05, p < 0.01, or p < 0.001 (whichever comes next to the obtained p-value).

  3. Always perform this procedure.

The null ritual does not exist in statistics proper. This point is not always understood; even its critics sometimes confuse it with Fisher’s theory of null hypothesis testing and call it “null-hypothesis significance testing.” In fact, the ritual is an incoherent mishmash of ideas from Fisher on the one hand and Neyman and Pearson on the other, spiked with a characteristically novel contribution: the elimination of researchers’ judgment.
— Gerd Gigerenzer

Fallacy 1: The p-value is the probability of the null hypothesis

Many people mistakenly believe that the p-value is the probability of the null hypothesis being true. For example, a p-value of 0.05 — traditionally considered the cutoff for statistical significance — means there is only a 5% probability of the null hypothesis being true (a 1 in 20 chance). This is not a correct interpretation of the p-value.

In fact, the p-value assumes the null hypothesis is true and indicates the degree to which the data conform to the null hypothesis and all its underlying assumptions (like the statistical model being used) [1]. Here the term “data” means a test statistic, like a t-statistic. This fallacy has things exactly backward: the p-value is the probability of the data given the null hypothesis, not the probability of the null hypothesis given the data.

The probability of the null hypothesis is called the false positive risk and is typically much larger than the p-value [2]. A statistically significant p-value can have a false positive risk of more than 25% [3]. However, note that all methods to determine the probability of hypothesis require some form of Bayesian statistics where assumptions are made about the hypothesis probabilities prior to the calculations.

A related fallacy is that the p-value is the probability that chance alone produced the observed data given the null hypothesis. This is also not true and in fact again completely backwards. The P value is a probability computed assuming chance was operating alone [4].

Fallacy 2: A non-significant p-value indicates no effect between the treatment and control

This fallacy is one of the basis for the current interpretation of statistical significance, a concept not supported by most statisticians [5]. A p-value greater than 0.05 simply indicates the data are not unusual under the assumptions of the model (using a p-value less than 0.05 as a definition of “unusual,” itself an arbitrary cutoff). Model assumptions include that the null hypothesis is true and that only random error is operating on the data.

It’s important to remember that:

  1. Only a p-value of 1 indicates the null hypothesis is the hypothesis most consistent with the data. However, even in cases where there is a p-value of 1 many other hypothesis are consistent with the data. A determination of no difference between groups cannot be made from a p-value, regardless of size [6].

  2. The observed point estimate for a given experiment is always the effect size most compatible with the data, regardless of the statistical significance [7]. This again means that unless the effect size is zero — indicating no difference between groups — then the null hypothesis is not the hypothesis most compatible with the data (the null hypothesis does not have to be defined as no difference between groups, but in A/B testing and many other applications no difference is the most common null).

  3. The p-value represents a tail probability and so includes observations that are larger than the observed outcome. For example, a p-value of 0.3 means that, assuming the null hypothesis is true, the observed outcome — and observations more extreme — would be seen 30% of the time (if only random error were operating on the data) [8, 9]. This should help explain why a bright line 0.05 p-value threshold is not as useful as one might assume.

  4. The p-value assumes

Fallacy 2 is extremely common, even among PhD researchers [10].


Even researchers believe these fallacies

Survey Time period Number of articles examined Percentage of articles with errors Definition of "error"
Schatz P, Jay KA, McComb J, McLaughlin JR (2005). Misuse of statistical tests in Archives of Clinical Neuropsychology publications. Archives of Clinical Neuropsychology 20:1053-1059 2001-2004 170 48% (81 articles) “using statistical tests to confirm the null, that there is no difference between groups.” (Page 1054)
Fidler F, Burgman MA, Cumming G, Buttrose R, Thomason N. (2006). Impact of criticism of null hypothesis significance testing on statistical reporting practices in conservation biology. Conservation Biology 20:1539-1544 2005 100 42% (42 articles) “statistically nonsignificant results were interpreted as evidence of ‘no effect’ or ‘no relationship'"(Page 1542)
Hoekstra R, Finch S, Kiers HAL, Johnson A. (2006). Probability as certainty: dichotomous thinking and the misuse of p values. Psychonomic Bulletin & Review 13:1033 1037 2002-2004 259 56% (145 articles) “Phrases such as ‘there is no effect,’ ‘there was no evidence for’ (combined with an effect in the expected direction), ‘the nonexistence of the effect,’ ‘no effect was found,’ ‘are equally affected,’ ‘there was no main effect,’ ‘A and B did not differ,’ or ‘the significance test reveals that there is no difference’"(Pages 1034-1035)
Bernardi F, Chakhaia L, Leopold L. (2017). ‘Sing me a song with social significance’: the (mis)use of statistical significance testing in European sociological research. European Sociological Review 33:1-15 2010-2014 262 51% (134 articles) “authors mechanically equate a statistically insignificant effect with a zero effect.” (Page 2)
Reproduced from Valentin Amrhein et al., "Supplementary information to: Retire statistical significance", Nature, 2019

References and theoretical basis

  1. Greenland et. al., “Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations”, European Journal of Epidemiology, 2016 [link]

  2. “False positive risk”, The Research, 2019 [link]

  3. “False positive risk”, The Research, 2019 [link]

  4. Greenland et. al., “Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations”, European Journal of Epidemiology, 2016 [link]

  5. “Statisticians hate statistical significance”, The Research, 2019 [link]

  6. Greenland et. al., “Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations” (Misinterpretations #4, #5, #6), European Journal of Epidemiology, 2016 [link]

  7. Steven Goodman, “A Dirty Dozen: Twelve P-Value Misconceptions” (Misconception #2), Seminars in Hematology, 2008

  8. Steven Goodman, “A Dirty Dozen: Twelve P-Value Misconceptions” (Misconception #2) (Introduction), Seminars in Hematology, 2008

  9. Greenland et. al., “Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations” (Misinterpretations #4, #9), European Journal of Epidemiology, 2016 [link]

Fallacy 3: Significant and nonsignificant results are contradictory

Reproduced from Gigerenzer, "Mindless statistics", Journal of Socio-Economics, 2004