Article summary: Many analysts assume that if an experiment or analysis results in a statistically significant p-value then repeating the same experiment will likely again result in a statistically significant p-value. However, this is not always true, even with large sample sizes or high statistical power. This is one of many reasons statisticians exercise caution when using p-values and suggest moving away from binary significant/non-significant criteria for decision making.

Quick guide


Replication is fundamental to science, so statistical analysis should give information about replication. Because p values dominate statistical analysis...it is important to ask what p says about replication. The answer to this question is ‘Surprisingly little.’
— Geoff Cumming

Problem summary

Many analysts assume that if an experiment or analysis results in a statistically significant p-value then repeating the same experiment will likely again result in a statistically significant p-value. However, this is not always true, even with large sample sizes or high statistical power.

To see this consider a simple A/B test meant to assess click-through rate (CTR) that compares our current call to action (CTA) against the treatment, a CTA everyone agrees is more enticing. Suppose the current CTR is 3% and a lift of 10% (from 3% to 3.3%) would be practically meaningful. One could use an online sample size calculator or statistical software package to determine the sample size needed to detect this lift with 95% confidence at 80% power (the standard for A/B testing). Using a the power.prop.test function in R shows that we need a sample size of 53,210 customers for both the control and the treatment, for a total of 106,420 customers.

Suppose we then run our experiment comparing the two CTAs and the result is a lift of 10%, our desired goal, with a statistically significant p-value. We might then reasonably assume that our experiment was a success, not in the sense that the treatment “won,” but rather in the sense that the result was methodologically sound. We selected our sample size so that our experiment was well powered (80%) and we had high confidence (95%); we then observed precisely the lift we expected. What’s the problem?

The problem is that if we repeated our experiment not only are we not guaranteed to get a statistically significant p-value the second time, but the probability of getting a statistically significant p-value is completely unknown. Surely, we do not want a decision rule based on a metric that can vary from trial to trial with unknown probability even when the experimental setup is unchanged. However, that’s exactly what the p-value provides.

Statistical details

The reason for the p-value replication problem is that p-values are contingent on sample spaces and sampling introduces randomness. There is a difference between the “true” effect of the CTA for our entire customer base and the effect of the CTA for customers that happen to visit our site while the test is live (tests are usually live two to four weeks). What we get during our experiment is a random sample of customers, not the entire population. Of course, the set of customers that visit our site at any one time is not completely random. Businesses run sales, release new products, and engage in advertising. All of these strategies drive customers to our site. However, while it’s true all businesses have some ability to influence their customer base, there is a strong random component to consumer behavior (if companies had complete influence there would be no need for an experimentation program).

Some careful readers may object to the statement that “there is an unknown probability of getting a statistically significant p-value” when an experiment is repeated. In fact, the statistical power is the figure that describes the long term frequency of detecting a true effect if there is one (if the alternative hypothesis is true). By definition if the null hypothesis is true the probability that another experiment produces a p-value as extreme or more extreme is equal to the p-value we obtained in the first experiment . However, the power tells us nothing about the probability of a particular trial of an experiment. Furthermore, A/B test calculators assume a particular effect size (ex. lift) and then produce a sample size. There is no guarantee the assumed lift corresponds in any way to the actual efficacy of the CTA in the population under study. Pre-study test calculations do not measure the compatibility of these alternatives with the data actually observed [1, 2].

In our example above if the true effect of the treatment CTA over the control CTA for the entire population of customers were 10% lift, then based on our experimental design, in the long run we would expect 80% of experiments to result in a statistically significant effect (precisely because we calculated our sample size for 80% power at a 10% lift).

However, if the true effect of the treatment CTA over the control CTA for the entire population of customers were only, say, 5% lift, we would still get a statistically significant result some of the time. And in fact the observed lift of these statistically significant trials would be close to 10% because statistically significant results from underpowered tests tend to exaggerate the effect size.

Gerd Gigerenzer gives this example to help better understand the principles behind replication: “A die, which could be fair or loaded, is thrown twice and shows a “six” both times, which results in a p value of .03 (1/36) under the null hypothesis of a fair die. Yet this does not imply that one can expect two sixes in 97% of all further throws” [3].

Simulation

The key insight is that if we run an experiment once and get a statistically significant result we don’t know if it’s because the true population response to the CTA is 10% lift or if the CTA lift is actually lower and in a particular trial of the experiment we happened to get a statistically significant result.

This surprising fact can be demonstrated using simulation. The simulation below was generated using the same experimental parameters specified above: the sample size (51,486 for both the treatment and control) is based on 10% lift at 80% power and 95% confidence. Each trial of the experiment uses the same sample size, but the true CTA lift for our population of customers varies from 0% (no difference in CTA) to 20% (the treatment produces a CTR of 3.6% compared to the control CTR of 3%).

For each experimental trial a random sample of customers was drawn from the population, so that even if the true lift of the CTA for our population were 10% as we expected, the observed lift might not be exactly 10% since only a random set of customers are observed while the test is live.

Each dot on the graph below represents 100,000 experimental trials (equivalent to 100,000 A/B tests) for a total of 2.1 million A/B test simulations. The simulation allows us to examine the results of our experimental design as the true population lift varies.

The x-axis shows the true population lift for each trial, while the y-axis shows the percentage of the 100,000 trials that are statistically significant. As expected when the true lift of the CTA is 10%, about 80% of the experimental trials are statistically significant (because our test is properly powered). However, if the true lift of the CTA were only 5%, we’d still get a statistically significant test a quarter of the time!

Percent of Trials that are Statistically Significant by Percent Lift

What’s more, when an underpowered experiment achieves statistical significance the estimated lift is overstated. For the case where the actual effect of the CTA on the population is only 5% lift, not only will the experiment result in a statistically significant result a quarter of the time, but in those significant trials the average reported lift would be around 9.2%, very close to our desired lift of 10%. This underscores the argument that a statistically significant result from a single trial of an experiment may appear to confirm a treatment’s superiority over the control while simultaneously conforming to the desired test parameters. In reality, however, the result may come from an effect size our experiment is not properly powered to detect reliably. The result is a p-value with poor replication properties.

Estimated Lift of Statistically Significant Trials by True Population Lift

P-value range

In addition to the p-values themselves alternating between statistically significant and statistically nonsignificant at varying probabilities based on the factors discussed above, the range of the p-values is also surprising. It’s not simply that sometimes the p-values are significant while other times they are not. The p-values themselves vary widely across the entire range from 0 to 1. The table at right shows the p-value range for just 10 simulated A/B tests per row (instead of the 100,000 simulated tests in the example above). The methodological setup here is the same as above, the sample size was set to detect a true population lift of 10% (from 3.0% CTR to 3.3% CTR) and the true effect of the CTA for the population varied from 0% to 20%.

In cases where the true lift of the CTA for the population was much larger than what the sample size was based on the p-value varies less (because we have much more than adequate power to detect the effect).

Lift P-value Max P-value Min
0% 0.95 0.05
5% 0.91 0.0032
10% 0.40 0.0000053
15% 0.071 0.0000017
20% 0.00067 0.0
P-value range based on 10 simulations.

However, even for the case where the effect of the CTA for the population was exactly the lift chosen to calculate our sample size (10% lift), the p-value ranges from nearly 0 (0.0000053) to 0.4. The p-value varies much more for lifts that are smaller than those we are adequately powered to detect consistently.

This comes as a surprise to many. If a single experiment produced a statistically significant result at the 0.0032 level, one typically wouldn’t expect a replication of the experiment to produce a p-value of 0.91.

As prominent statistican Andrew Gleman put it in a September, 2019 blog post: “To say it again: it is completely consistent with the null hypothesis to see p-values of 0.2 and 0.005 from two replications of the same damn experiment” [5].

Average Order Value

Simulated Number of Orders by order amount

Don't worry, researchers get it wrong too

Misunderstandings around p-value replication are not limited to analysts and falling prey to these misunderstandings doesn’t make one statistically inept. In fact, surveys have shown that many professional academic researchers misunderstand the meaning of the p-value as well, with many believing that 1-p represents the probability of replication. Under this replication delusion, as Gigerenzer terms it, a p-value of 0.05 means that 95% of future experiments using the same experimental design would again result in a statistically significant result (at the 0.05 level).

In his 2018 paper “Statistical Rituals: The Replication Delusion and How We Got There” Gigerenzer reviewed a number of previous surveys that asked psychologist to assess the truthfulness of statistical statements related to the p-value [4]. The replication delusion was widely believed across various psychological fields. The results of this literature review are reproduced in the table below. For more “statistical delusions” see our article on common p-value misunderstandings.

Study Description of group Country N Statistic tested Respondents exhibiting the replication delusion
Oakes (1986) Academic psychologists United Kingdom 70 p = 0.01 60%
Haller & Krauss (2002) Statistics teachers in psychology Germany 30 p = 0.01 37%
Haller & Krauss (2002) Professors of psychology Germany 39 p = 0.01 49%
Badenes-Ribera, Frias-Navarro, Monterde-i-Bort, & Pascual-Soler (2015) Academic psychologists: personality, evaluation, psychological treatments Spain 98 p = 0.001 35%
Badenes-Ribera et al. (2015) Academic psychologists: methodology Spain 47 p = 0.001 16%
Badenes-Ribera et al. (2015) Academic psychologists: basic psychology Spain 56 p = 0.001 36%
Badenes-Ribera et al. (2015) Academic psychologists: social psychology Spain 74 p = 0.001 39%
Badenes-Ribera et al. (2015) Academic psychologists: psychobiology Spain 29 p = 0.001 28%
Badenes-Ribera et al. (2015) Academic psychologists: developmental and educational psychology Spain 94 p = 0.001 46%
Badenes-Ribera, Frias-Navarro, Iotti, Bonilla-Campos, & Longobardi (2016) Academic psychologists: methodology Italy, Chile 18 p = 0.001 6%
Badenes-Ribera et al. (2016) Academic psychologists: other areas Italy, Chile 146 p = 0.001 13%
Hoekstra, Morey, Rouder, & Wagenmakers (2014) Researchers in psychology (Ph.D. students and faculty) Netherlands 118 95% CI 58%
Reproduced from Gigerenzer, "Statistical Rituals: The Replication Delusion and How We Got There", Advances in Methods and Practices in Psychological Science, 2018

References & theoretical basis

References:

  1. Greenland et. al., “Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations”, European Journal of Epidemiology, 2016 [link]

  2. Steven Goodman, “A Comment on Replication, P-Values, and Evidence”, Statistics in Medicine, 1992 [link]

  3. Gerd Gigerenzer, “Statistical Rituals: The Replication Delusion and How We Got There”, Advances in Methods and Practices in Psychological Science, 2018 [link]

  4. Gerd Gigerenzer, “Statistical Rituals: The Replication Delusion and How We Got There”, Advances in Methods and Practices in Psychological Science, 2018 [link]

  5. Andrew Gelman, “It’s not just p=0.048 vs. p=0.052”, Statistical Modeling, Causal Inference, and Social Science (blog), 2019 [link]

Theoretical basis:

  • Geoff Cumming, “Replication and p Intervals: p Values Predict the Future Only Vaguely, but Confidence Intervals Do Much Better”, Perspectives on Psychological Science, 2008 [link]

  • Lewis G Halsey, Douglas Curran-Everett, Sarah L Vowler & Gordon B Drummond, “The fickle P value generates irreproducible results”, Nature, 2015 [link]


Code examples

# Load libraries
library(tidyverse)

# Define functions
ctr_sim = function(n, ctr1, ctr2) {
  x = rbinom(n, 1, ctr1)
  y = rbinom(n, 1, ctr2)
  prop.test(x=c(sum(y), sum(x)), n=c(n, n), correct=FALSE)
}

simulate = function(n_sims, n, ctr1, ctr2) {
  result = tibble()
  
  for(i in 1:n_sims) {
    sim = ctr_sim(n, ctr1, ctr2)
    p_value = sim[["p.value"]]
    CI_lower = sim[["conf.int"]][1]
    CI_upper = sim[["conf.int"]][2]
    estimate1 = sim[["estimate"]][[1]]
    estimate2 = sim[["estimate"]][[2]]
    result = bind_rows(result, tibble(p_value, CI_lower, CI_upper, estimate1, estimate2))
  }
  actual_diff_in_prop = ctr2 - ctr1
  
  result %>%
    mutate(significant = if_else(p_value <= 0.05, 1, 0, missing = NULL)) %>%
    mutate(point_estimate = abs(estimate1 - estimate2)) %>%
    mutate(actual_diff_in_prop = actual_diff_in_prop) %>%
    mutate(point_estimate_diff = actual_diff_in_prop - point_estimate) %>%
    mutate(CI_coverage = if_else(CI_lower <= actual_diff_in_prop & actual_diff_in_prop <= CI_upper, 1, 0))
}

print_sim_stats = function(sim) {
  print(paste("Percent of simulation trials that are significant:", sum(sim[["significant"]]) / nrow(sim), sep=" "))
  print(paste("Largest p-value from simulation:", max(sim[["p_value"]])), sep=" ")
  print(paste("Smallest p-value from simulation:", min(sim[["p_value"]])), sep=" ")
  print(paste("Percent of simulation CIs that cover true difference in proportion:", sum(sim[["CI_coverage"]]) / nrow(sim), sep=" "))
  
  sig_trials = sim %>% filter(significant==1)
  average_point_estimate_of_sig_trials = sig_trials %>% summarize(mean(point_estimate))
  print(paste("Average point estimate of statistically significant trials:", average_point_estimate_of_sig_trials, sep=" "))
  print(paste("Percent of simulation CIs that cover true difference in proportion for statistically signficant trials:", sum(sig_trials[["CI_coverage"]]) / nrow(sig_trials), sep=" "))
}

# Sample size based on this calculator: https://www.abtasty.com/sample-size-calculator/

###
# 
###
# Sample size calculation
# Converstion rate: 3
# 95% CI
# 80% power
# 10% lift
n = 51486
n_sims = 100000

# Actual lift
# 10%
ctr1 = .03
ctr2 = .03 * 1.10
sim = simulate(n_sims, n, ctr1, ctr2)
print("Sample size based on 10% lift at 80% power and actual population lift is 10%.")
print_sim_stats(sim)


# Actual lift
# 6.5%
ctr1 = .03
ctr2 = .03 * 1.065
sim = simulate(n_sims, n, ctr1, ctr2)
print("Sample size based on 10% lift at 80% power and actual population lift is 6.5%.")
print_sim_stats(sim)


# Actual lift
# 5%
ctr1 = .03
ctr2 = .03 * 1.05
sim = simulate(n_sims, n, ctr1, ctr2)
print("Sample size based on 10% lift at 80% power and actual population lift is 5%.")
print_sim_stats(sim)


# Actual lift
# 20%
ctr1 = .03
ctr2 = .03 * 1.2
sim = simulate(n_sims, n, ctr1, ctr2)
print("Sample size based on 10% lift at 80% power and actual population lift is 20%.")
print_sim_stats(sim)

# Examine all lifts from 0% to 20%
n = 51486
n_sims = 100000
base_ctr = .03
lifts = seq(1, 1.2, by=.01)
sims = tibble()
for(lift in lifts) {
  ctr2 = base_ctr * lift
  sim = simulate(n_sims, n, base_ctr, ctr2)
  
  # Calculate stats
  percent_sig = sum(sim[["significant"]]) * 100 / nrow(sim) # percent of trials that are significant
  population_lift = (lift-1)*100 # population_lift defined in lifts array
  average_point_estimate_of_sig_trials = sim %>% # average CTR estimated from simulations
    filter(significant==1) %>%
    summarize(mean(point_estimate)) %>%
    unlist
  estimated_lift = (average_point_estimate_of_sig_trials/base_ctr)*100 # lift implied by average CTR
  
  sims = bind_rows(sims, bind_cols("population lift"=population_lift, "percent significant"=percent_sig, "ctr_point_estimate"=average_point_estimate_of_sig_trials, "estimated lift"=estimated_lift))
}

View(sims)