Article Summary: Contrary to popular belief, null hypothesis significance testing (NHST) is not generally recognized as a coherent and robust method of weighing evidence or making decisions. In fact, statisticians have been criticizing the use of statistical significance for decades. This article outlines six reasons you should move to a post p < 0.05 world. Links are provided to our more detailed articles on each topic.

Quick guide


Problem statement

Statistical significance is a widely used decision criteria in business.

Growth of Significance Test Usage in Management and Marketing Articles

Synthesized from Corrupt Research by Raymond Hubbard

It is important to note that none of the criticisms below impune the p-value per se. As American statistician and epidemiologist Sander Greenland has pointed out, p-values behave exactly as they should [1]. They are a well-defined quantity with a strict mathematical calculation. However, the properties of p-values are surprising to many. Their interpretation is poor. And their usage within the null hypothesis significance testing framework leads to incoherent decision making.

This is not to say that the p-value should never be used. Simply that it should not stand alone or above other statistical measures for the purpose of conducting statistical inference. If one has conducted a careful experiment and is aware of the p-values properties and interpretation, there is no problem using the p-value as one among many factors in making a decision based on the experimental outcome. Unfortunately, that is not how the p-value is used today.

Some have argued against change due to optimism, arguing that if we simply taught and used the NHST approach correctly all would be fine. We do not believe that the cognitive biases which p-values exacerbate can be trained away. Moreover, those with the highest levels of statistical training still regularly interpret p-values in invalid ways. Vulcans would probably use p-values perfectly; mere humans should seek safer alternatives.”

— ROBERT CALIN-JAGEMAN & GEOFF CUMMING

also see Donald Berry for killing statistical significance: https://www.tandfonline.com/doi/full/10.1080/01621459.2017.1316279

Principle 4 in the ASA statement is that “Proper inference requires full reporting and transparency and multiplicities.” This essentially never happens.

We created a monster. And we keep feeding it, hoping that it will stop doing bad things. It is a forlorn hope. No cage can confine this monster. The only reasonable route forward is to kill it.

Later in this article we stated that “the p-value can tell us nothing about either the probability of the null or alternative hypothesis.” It is important to note that in fact no mathematical procedure can result in describing the probability of the null or alternative hypothesis without first making assumptions about what those probabilities might be. This is the fundamental uncertainty inherent in data analysis.

Statisticians have slightly different gripes about why to move away from a p-value threshold. Some focus on the rate of null hypothesis false positives, others on the lack of replicability of p-values, and still others on contextual information like the relative tradeoff false positives and false negatives and the risks and costs associated with the resultant action of an experimental result.

Reason 1: The statistical significance framework is incoherent

The null hypothesis significance testing procedure (NHST) itself did not spring from a coherent statistical framework, but instead was a combination of null hypothesis testing developed by Ronald Fisher and decision theory developed by Jerzy Neyman and Egon Pearson. Gred Gigerenzer, who referes to NHST as “the null ritual” descrbies things this way in his article “Mindless statistics” [2]: “The null ritual does not exist in statistics proper. This point is not always understood; even its critics sometimes confuse it with Fisher’s theory of null hypothesis testing and call it ‘null-hypothesis significance testing.’ In fact, the ritual is an incoherent mishmash of ideas from Fisher on the one hand and Neyman and Pearson on the other, spiked with a characteristically novel contribution: the elimination of researchers’ judgment.”

Reason 2: The p-value offers much less evidence than typically understood

Much of the supposed power of the null hypothesis significance testing procedure is due to misunderstandings about the meaning of statistical significance and the associated p-value from which significance flows. In reality the p-value can tell us nothing about either the probability of the null or alternative hypothesis. In fact NHST does not even specify an alternative statistical hypothesis [3]. Instead, the p-value describes the probability of a calculated test statistic (and more extreme values) assuming the null hypothesis is true. While these two concepts may seem similar there is a substantial difference between the probability of data given a hypothesis and the probability of a hypothesis given data. In fact the probability of the null hypothesis being true when the p-value is 0.05 can be more than 25% under reasonable assumptions [4].

Another way to see the limited evidence of a statistically significance result is to calculate a p-value’s Shannon information, sometimes called the surprisal or s-value [5]. The s-value is a mathematical transformation of the p-value, useful because it has an easy-to-interpret analogy with coin flipping. For instance, a p-value of 0.25 is analogous to testing a coin for fairness by flipping it twice and seeing that heads came up both times. If you didn’t know if a coin was fair seeing two heads in a row would hardly convince you of anything. What about a p-value of 0.05, traditionally considered the cutoff for statistical significance? That would be like flipping a coin four times and getting four heads. More convincing than the p-value of 0.25, but would you be willing to make a bet that the coin is rigged from that evidence alone? In business that single bet could mean thousands or even millions of dollars in increased cost or forgone revenue.

It is often stated that the Bayes factor is another way to assess the evidence inheirent in the p-value (the Bayes factor is a likelihood ratio comparing the probability of data assuming the alternative hypothesis is true to the probability of data assuming the null hypothesis is true). For example, the following passage appears in “Redefine statsitical significance” by Benjamin et. al. [6]:

A two-sided P-value of 0.05 corresponds to Bayes factors in favor of the alternative hypothesis that range from about 2.5 to 3.4 under reasonable assumptions about the alternative. This is weak evidence from at least three perspectives. First, conventional Bayes factor categorizations characterize this range as “weak” or “very weak.” Second, we suspect many scientists would guess that a p-value of 0.05 implies stronger support for the alternative hypothesis than a Bayes factor of 2.5 to 3.4. Third, using equation (1) and prior odds of 1:10, a p-value of 0.05 corresponds to at least 3:1 odds in favor of the null hypothesis!

However, this formulation of the problem and the associated solution has itself been criticized [7] as has using Bayes factors as part of the null hypothesis significance testing framework [8, 9, 10].

Add reearch where random pairs of data are statistically significant

Study Percent of random associations that were statistically significant Study Details
"The fallacy of the null hypothesis in soft psychology", Applied & Preventitive Psychology, Niels G. Waller, 2004. 46% Results of 81,485 individuals who took the Minnesota Multiphasic Personality Inventory-Revised were examined (it includes 567 questions across a wide range of domains such as general health concerns; personal habits and interests; attitudes towards sex, marriage, and family; affective functioning; normal range personality; and extreme manifestations of psychopathology). Waller programmed a computer to randomly select 511 questions and for each the proportion of males and females were examined. The analysis showed that "46% of the directional hypotheses were supported at significance levels that far exceeded traditional p-value cutoffs."
"The fallacy of the null hypothesis in soft psychology", Applied & Preventitive Psychology, Niels G. Waller, 2004. 47% Results of 39,994 females who took the Minnesota Multiphasic Personality Inventory-Revised were examined (it includes 567 questions across a wide range of domains such as general health concerns; personal habits and interests; attitudes towards sex, marriage, and family; affective functioning; normal range personality; and extreme manifestations of psychopathology). Waller programmed a computer to compare the responses of women on a single question to their responses on the 566 remaining questions (320,922 comparisons were made in total). Of these 47% were found to be statistically significant."
Percentage of random directional hypothesis that are statistically significant across three studies

Reason 3: The p-value has surprising

Believing a small p-value means that a replication of an experiment has a high probability of obtaining another small p-value is just a special case of p-value misinterpretation, what Gred Gigerenzer calls the “replication delusion” [11]. In fact, if there is a true difference between treatment and control it is the power of the experiment, not the p-value that tells us the probability of obtaining statistically significant results upon replication. However, the true statistical power of an experiment is almost always unknown because the true effect size between the treatment and control is unknown (otherwise there would be no reason to conduct an experiment in the first place). This can lead to very low p-value replication rates even when an experiment has the appearance of being properly powered as we showed in a set of simulations building on the work of Geoff Cumming [12].

The p-value ranges of replicated experiments can also be very large, spanning almost the entire interval from 0 to 1. As prominent statistican Andrew Gleman put it in a September, 2019 blog post: “To say it again: it is completely consistent with the null hypothesis to see p-values of 0.2 and 0.005 from two replications of the same damn experiment” [13].

large sampes gauranteed to have small p-value “too big to fail p-value”

Jason hsu’s note about p-value and confidence interval

Reason 4: The p-value is prone to misinterpretation

Interpreting the p-value as a measure of the replicability of the experiment is just one common misinterpretations of the p-value. Other p-value fallacies include the following:

  • Fallacy 1: The p-value is the probability of the null hypothesis

  • Fallacy 2: A non-significant p-value indicates no effect between the treatment and control

  • Fallacy 3: Significant and insignificant results are contradictory

These fallacies are detailed in our article on common p-value misinterpretations [14].

Reason 5: Multiple comparisons are tricker than you think

One multiple comparison trap goes by the name “researcher degrees of freedom,” or what statistician Andrew Gelman calls “The Garden of the Forking Paths.” To understand the problem let’s consider a thought experiment. Suppose you run an A/B test with a new, more prominent “add to cart” button on some product pages. You initially plan on testing the click-through-rate (CTR) on the “add to cart” product pages you have today and comparing it to the version with the improved button. After the test is complete a colleague tells you that there was a small uptick in total sales during the period the test was active. Could the increase be due to the new button? Everyone agrees that based on the data you should abandon the original success metric of CTR and instead conduct the test of statistical significance on the comparison of average order value between the two groups exposed to the differing “add to cart” buttons.

Here’s a question: you ran a single test of statistical significance, do multiple comparisons matter? The surprising answer is yes. Why? Because while you conducted a single test of significance, the choice to run that test was based on first examining the data. Had the data come out differently -- perhaps sales saw no uptick -- you would’ve stuck with the original plan of evaluating success based on CTR. “[I]f you accept the concept of the p-value, you have to respect the legitimacy of modeling what would have been done under alternative data,” Andrew Gelman and Eric Loken wrote in their 2013 paper on the subject [15].

Reason 6: The p-value doesn't say anything about the effect size

There are many ways in which the p-value avoids telling us any information about the effect size of an experimental result:

  1. A small p-value doesn’t mean the effect size of an experiment represents a practically important result.

  2. In fact, it is a mathematical certainty that the p-value will go to zero as the sample size increases regardless of the underlying group different between the treatment and control [16].

  3. Two different experimental results with completely different effect sizes can have the same p-value [17].

  4. Scenarios where p-value and confidence interval disagree [18]

In fact it is common practices for researchers to associate p-values with effect sizes, mistakenly believing that smaller p-values mean more substantial effects [19] or categorizing different phenomenon by their p-value [20].


References & further reading

References:

  1. adf

  2. Gerd Gigerenzer, “Mindless statistics”, The Journal of Socio-Economics, 2004 [link]

  3. Gerd Gigerenzer, “Mindless statistics”, The Journal of Socio-Economics, 2004 [link]

  4. The Research, “False Positive Risk”, 2019 [link]

  5. Sander Greenland, “Valid P-Values Behave Exactly as They Should: Some Misleading Criticisms of P-Values and Their Resolution With S-Values”, The American Statistician, 2019 [link]

  6. Benjamin et. al. (71 coauthors), “Redefine Statistical Significance”, Nautre Human Behaviour, 2017 [link]

  7. Jennifer Tackett, Christian Robert, Andrew Gelman, David Gal, Blakeley McShane, “Abandon Statistical Significance”, The American Statistician, 2019 [link]

  8. Uri Simonsohn, “If you think p-values are problematic, wait until you understand Bayes Factors”, Data Colada (blog), 2019 [link]

  9. Andrew Gelman, “Why I don’t like so-called Bayesian hypothesis testing”, Statistical Modeling, Causal Inference, and Social Science (blog), 2009 [link]

  10. Andrew Gelman, “‘Bayes factor’: where the term came from, and some references to why I generally hate it”, Statistical Modeling, Causal Inference, and Social Science (blog), 2017 [link]

  11. Gerd Gigerenzer, “Statistical Rituals: The Replication Delusion and How We Got There”, Advances in Methods and Practices in Psychological Science, 2018 [link]

  12. The Research, “P-value replication”, 2019 [link]

  13. Andrew Gelman, “It’s not just p=0.048 vs. p=0.052”, Statistical Modeling, Causal Inference, and Social Science (blog), 2019 [link]

  14. The Research, “Common p-value misinterpretations” [link]

  15. Andrew Gelman and Eric Loken, “The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time”, 2013 [link]

  16. s

  17. s

  18. s

  19. s

  20. s