Article summary: Contrary to popular belief, null hypothesis significance testing (NHST) is not generally recognized as a coherent and robust method of weighing evidence or making decisions. In fact, statisticians have been criticizing the use of statistical significance for decades. Their critiques stem largely from the misuse and misunderstandings of p-values (statistical significance is achieved when a p-value surpasses a pre-defined threshold, traditionally a value of 0.05). While some statisticians have faith that p-values themselves can be rehabilitated by better educating non-statisticians on their proper use, others believe p-values must be abandoned altogether. Although statisticians don’t agree on the proper place for p-values within science or on the best approach to replace NHST, they almost universal agree that NHST has no place in the scientific enterprise. Over time non-statisticians researchers have joined in on their criticism. We’ve collected dozens of excerpts from academic journal articles to demonstrate this surprising point.

How we got to this point is another matter. For one theory see the first section of “Statistical Rituals: The Replication Delusion and How We Got There” by Gerd Gigerenzer [link].


No p-value can reveal the plausibility, presence, truth, or importance of an association or effect. Therefore, a label of statistical significance does not mean or imply that an association or effect is highly probable, real, true, or important. Nor does a label of statistical nonsignificance lead to the association or effect being improbable, absent, false, or unimportant.
— Ronald Wasserstein, Allen Schirm, & Nicole Lazar

Reference: “Moving to a World Beyond ‘p < 0.05’” [link], The American Statistician. Ronald Wasserstein, Executive Director of The American Statistical Association [link]; Allen Schirm, Vice President and Director of Human Services Research at Mathematica Policy Research (retired) [link]; & Nicole Lazar, Professor of Statistics at the University of Georgia and President Elect of Caucus for Women in Statistics [link]. This article was part of The American Statistician’s March 2019 special edition, “Statistical Inference in the 21st Century: A World Beyond p < 0.05” [link].


We...call for the entire concept of statistical significance to be abandoned.
— Valentin Amrhein, Sander Greenland, & Blake McShane on behalf of more than 800 signatories

Reference: “Scientists rise up against statistical significance” [link], Nature. Valentin Amrhein, Professor of Zoology at the University of Basel [link]; Sander Greenland, Professor Emeritus at the UCLA Fielding School of Public Health [link]; & Blake McShane, Associate Professor of Marketing at Northwestern’s Kellogg School of Management [link]. The full list of 854 scientists from 52 countries signing on to “Retire statistical significance” can be found here: [link].


Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold...by itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.
— Ronald Wasserstein & Nicole Lazar on behalf of the American Statistical Association

Reference: “The American Statistical Association’s Statement on p-values: Context, Process, and Purpose” [link], The American Statistician, by By Ronald Wasserstein, Executive Director of The American Statistical Association [link] & Nicole Lazar, Professor of Statistics at the University of Georgia and President Elect of Caucus for Women in Statistics [link].


The process of turning data into insight is central to the scientific enterprise. It is therefore remarkable that the most widely used approach—null hypothesis significance testing (NHST)—has been subjected to devastating criticism for so long to so little effect.
— Robert Matthews

Reference: “Moving Towards the Post p < 0.05 Era via the Analysis of Credibility” [link], The American Statistician. Robert Matthews, Professor of Mathematics at Aston University [link]. This article was part of The American Statistician’s March 2019 special edition, “Statistical Inference in the 21st Century: A World Beyond p < 0.05” [link].


The most important task before us in developing statistical science is to demolish the P-value culture, which has taken root to a frightening extent in many areas of both pure and applied science, and technology.
— J.A. Nelder

Reference: “Statistics to statistical science” (1999), Journal of the Royal Statistical Society [link]. John Nelder (deceased), Visiting Professor at Imperial College London and Fellow of the Royal Society [link]


My personal view is that p-values should be relegated to the scrap heap and not considered by those who wish to think and act coherently.
— Dennis Lindley

Reference: Bayesian Statistics 6: Proceedings of the Sixth Valencia International Meeting [link]. From a section by Dennis Lindley (deceased) , Professor University College London and Fellow of the American Statistical Association [link].


Even in situations where the hypothesis testing paradigm is correct, the common practice of basing inferences solely on p-values has been under intense criticism for over 50 years.
— Bayarri et. al.

Reference: “Rejection odds and rejection ratios: A proposal for statistical practice in testing hypotheses” (1999), Journal of Mathematical Psychology [link], by Bayarri et. al.


Several methodologists have pointed out that the high rate of nonreplication (lack of confirmation) of research discoveries is a consequence of the convenient, yet ill-founded strategy of claiming conclusive research findings solely on the basis of a single study assessed by formal statistical significance, typically for a p-value less than 0.05. Research is not most appropriately represented and summarized by p-values, but, unfortunately, there is a widespread notion that medical research articles should be interpreted based only on p-values.
— John Loannidis

Reference: “Why Most Published Research Findings Are False” [link], PLoS ONE. John P.A. loannidis of Stanford University, the C.F. Rehnborg Chair in Disease Prevention; Professor of Medicine, of Health Research and Policy, of Biomedical Data Science, and of Statistics; co-Director, Meta-Research Innovation Center at Stanford; Director of the PhD program in Epidemiology and Clinical Research [link].


Some have argued against change due to optimism, arguing that if we simply taught and used the NHST approach correctly all would be fine. We do not believe that the cognitive biases which p-values exacerbate can be trained away. Moreover, those with the highest levels of statistical training still regularly interpret p-values in invalid ways. Vulcans would probably use p-values perfectly; mere humans should seek safer alternatives.
— Robert Calin-Jageman & Geoff Cumming

Reference: “The New Statistics for Better Science: Ask How Much, How Uncertain, and What Else Is Known” [link], The American Statistician. Robert Calin-Jageman, Associate Professor of Psychology and Discipline Director of Neuroscience [link] & Geoff Cumming Emeritus Professor of Psychology at La Trobe University [link] . This article was part of The American Statistician’s March 2019 special edition, “Statistical Inference in the 21st Century: A World Beyond p < 0.05” [link].


We recommend dropping the NHST paradigm—and the p-value thresholds intrinsic to it—as the default statistical paradigm for research, publication, and discovery in the biomedical and social sciences.
— Jennifer Tackett, Christian Robert, Andrew Gelman, David Gal, & Blakeley McShane

Reference: “Abandon Statistical Significance” [link], The American Statistician. Jennifer Tackett, Professor of Psychology and Director of Clinical Psychology at Weinberg College [link]; Christian Robert, Professor of Statistics at University of Warwick [link]; Andrew Gelman, Professor of Statistics and Director of the Applied Statistics Center at Columbia University [link]; David Gal, Professor of Marketing at the University of Illinois [link]; Blake McShane, Associate Professor of Marketing at Northwestern’s Kellogg School of Management [link]. This article was part of The American Statistician’s March 2019 special edition, “Statistical Inference in the 21st Century: A World Beyond p < 0.05” [link].


I believe that the almost universal reliance on merely refuting the null hypothesis as the standard method for corroborating substantive theories in the soft areas is a terrible mistake, is basically unsound, poor scientific strategy, and one of the worst things that ever happened in the history of psychology.
— Paul Meehl

Reference: “Theoretical Risks and Tabular Asterisks: Sir Karl, Sir Ronald, and the Slow Progress of Soft Psychology” [link], Journal of Consulting and Clinical Psychology, 1978. Paul Meehl (deceased), Professor of Psychology at the University of Minnesota and past president of the American Psychological Association [link].


The reliability and reproducibility of science are under scrutiny. However, a major cause of this lack of repeatability is not being considered: the wide sample to-sample variability in the P value. We explain why P is fickle to discourage the ill-informed practice of interpreting analyses based predominantly on this statistic.
— Lewis G Halsey, Douglas Curran-Everett, Sarah L Vowler & Gordon B Drummond

Reference: “The fickle P value generates irreproducible results” [link], Nature Methods [link], 2015. Lewis G Halsey, Professor of Life Sciences and the University of Roehampton London and head of the Roehampton University Behaviour and Energetics Lab (RUBEL) [link]; Douglas Curran-Everett, Division of Bioinformatics at the National Jewish Health Hospital and the Department of Biostatistics and Informatics at the University of Colorado Denver’s School of Public Health [link, link]; Sarah L Vowle, Cancer Research UK Cambridge Institute at the University of Cambridge [link]; Gordon B Drummond, Honorary Clinical Senior Lecturer of Anaesthesia at The University of Edinburgh [link]


Associating statistically significant findings with P < 0.05 results in a high rate of false positives even in the absence of other experimental, procedural and reporting problems.
— Benjamin et. al.

Reference: “Redefine statistical significance” [link], Nature Human Behaviour, 2017. Daniel Benjamin plus 71 coauthors signed on to a proposal to lower the statistical significance threshold to 0.005. Their proposal is not without controversy [link], but there has been little disagreement about the problem they are trying to solve.


In formal statistical testing, the crude dichotomy of ‘pass/fail’ or ‘significant or not’ will scarcely do. We must determine the magnitudes (and directions) of any statistical discrepancies warranted, and the limits to any substantive claims you may be entitled to infer from the statistical ones.
— Deborah Mayo

Reference: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars [link], 2018. Deborah Mayo, Professor Emerita of Philosophy of Science at Virginia Tech University [link]


We have saddled ourselves with perversions of logic—p-values—and so we deserve our collective fate. I forgive nonstatisticians who cannot provide a correct interpretation of p < 0.05. p-Values are fundamentally un-understandable. I cannot forgive statisticians who give understandable—and therefore wrong—definitions of p-values to their nonstatistician colleagues.
— Donald Berry

Reference: “A p-Value to Die For” [link], Journal of the American Statistical Association, 2017. Donald Berry, Professor of Biostatistics at The University of Texas MD Anderson Cancer Center [link]


There are no good uses for [p-values]; indeed, every use either violates frequentist theory, is fallacious, or is based on a misunderstanding.
— William Briggs

Reference: “The substitute for p-Values” [link], Journal of the American Statistical Association, 2017. William Briggs, Assistant Professor Statistics, Weill Medical College of Cornell University [link]


There is a long line of work documenting how applied researchers misuse and misinterpret p-values in practice.
— Blakeley McShanea and David Gal

Reference: “Statistical Significance and the Dichotomization of Evidence” [link], Journal of the American Statistical Association, 2017. David Gal, Professor of Marketing at the University of Illinois [link]; Blake McShane, Associate Professor of Marketing at Northwestern’s Kellogg School of Management [link].


Contrary to common dogma, tests of statistical null hypotheses have relatively little utility in science and are not a fundamental aspect of the scientific method.
— David Anderson, Kenneth Burnham, & William Thompson

Reference: “Null Hypothesis Testing: Problems, Prevalence, and an Alternative,” Journal of Wildlife Management, 2000. David Anderson (employment unknown), former scientist at the Cooperative Fish and Wildlife Research Unit [link]; Kenneth Burnham (retired), former Senior Scientist with the United States Geological Survey [link]; William Thompson, Adjunct Professor in Natural Resources at the University of Rhode Island, National Park Service Research Coordinator for the North Atlantic Coast Cooperative Ecosystem Studies Unit [link].