P-value functions 

Article Summary

P-value functions are a useful way to enlarge on the concept of the p-value. A p-value function is constructed by taking the same underlying dataset, but allowing the hypothesis to range between all reasonable values instead of fixing it at the standard value of 0 (an assumption of no difference between group means). P-value functions provide several benefits over the dichotomous significant/nonsignificant paradigm, instead shifting focus to parameter estimation. P-value functions provide an estimated effect size and the precision of that estimate.

Contact information

Please direct all inquires to info@theresear.ch

Quick guide

Supplementary Material

For technical details including code samples, please click here.

Article Status

This article is complete and is being actively maintained.

Paid reviewers

The reviewers below were paid by The Research to ensure this article is accurate and fair. That work includes a review of the article content as well as the code that produced the results. This does not mean the reviewer would have written the article in the same way, that the reviewer is officially endorsing the article, or that the article is perfect, nothing is. It simply means the reviewer has done their best to take what The Research produced, improve it where needed, given editorial guidance, and generally considers the content to be correct. Thanks to all the reviewers for their time, energy, and guidance in helping to improve this article.

  1. Kenneth Rothman. Professor of Preventive Medicine & Epidemiology at Boston University School of Medicine. Professor Rothman is the founding editor of Epidemiology, a leading public-health journal, and has coauthored over 450 peer-reviewed articles including several articles on statistical misinterpretations. He has written about p-value functions in Chapter 8 of his book Epidemiology: An Introduction. Click here to see Kenneth Rothman’s profile at Boston University.

  2. Other reviewers will be added as funding allows.


Introduction

P-value functions are a useful way to enlarge on the concept of the p-value. A p-value function is constructed by taking the same underlying dataset, but allowing the hypothesis to range between all reasonable values instead of fixing it at the standard value of 0 (an assumption of no difference between group means). P-value functions provide several benefits over the dichotomous significant/nonsignificant paradigm, instead shifting focus to parameter estimation. P-value functions provide an estimated effect size and the precision of that estimate.

While this introductory article focuses on cohort analysis, p-value functions can be used in a variety of settings; for example, A/B testing [1] or classifier comparison in machine learning [2]. Future articles will dive deeper into these specific use cases. This article tackles the introduction of the p-value function, how to interpret it in a simple applied business context, and common p-value function misapplications.

Analysis setup

To create a p-value function we’ll use data from IBM Watson’s open source insurance dataset. The example will use cohort analysis, comparing the customer lifetime value (CLV) of two different insurance segments: customers with personal auto insurance (6,788 customers) and customers with corporate auto insurance (1,968), meaning the customer’s vehicle is used for business purposes. There are a total of 8,756 customers between the two groups.

To begin, the two customer segments were compared using the standard significance testing framework, with the goal of understanding the difference in CLV. Such differences may be useful for the insurance company to understand. For example, if corporate auto insurance users have higher CLV the company may want to educate personal auto insurance customers that if they sometimes use their car for work then corporate auto insurance may provide additional benefits. If personal auto insurance customers have a higher CLV this may mean the company needs to work harder to retain corporate customers or double down on expansion of personal auto insurance. Using the standard hypothesis of 0 — an assumption that there is no difference in CLV between personal and corporate auto insurance customers — results in a nonsignificant p-value of 0.2 for this dataset. The estimated effect size is a CLV difference of $213 in favor of personal auto insurance customers.

Under the dichotomous usage of statistical significance the p-value of 0.2 would indicate a nonsignificant difference in customer CLV. This would likely lead analysts at the company to then conclude that there is no meaningful difference between the two customer segments. However, the concept of statistical significance has been heavily criticized [3]. The p-value function, on the other hand, can paint a fuller picture, pulling focus back to the important aspects of the problem: what exactly is the estimated difference in CLV between the two customer segments and how precise is that estimate given our data?

The p-value function

A null hypothesis is the most common choice in significance testing. In many business settings the null is 0, which corresponds to a hypothesis of no difference between two customer groups. In A/B testing, for instance, the null might correspond to no difference between the treatment and control groups' click-through rate (CTR) on a "Buy now" button, or no difference in the average order value (AOV) of made purchases. In other settings the null might correspond to other values, for example 1 if the measure is a ratio of two rates (if the two rates are the same the ratio is 1). However, it is statistically valid to choose any hypothesis; there is no reason to confine hypotheses to these null values.

The p-value function simply extends this concept by calculating a p-value for every hypothesis within some reasonable range. The result is a nice visual graph with useful statistical properties. After one understands the p-value function it can be added to the toolbox used to make analytical business decisions.

To begin exploring, the p-value function for the difference in CLV by insurance policy type is shown below. To generate the p-value function the p-value is first calculated for the standard null hypothesis of no difference in CLV between the two groups. The null hypothesis and resulting p-value are then plotted on an x-y axis. Next, a p-value is calculated assuming a hypothesis that personal auto insurance customers have an average CLV that is $1 more than corporate auto insurance customers. Next the hypothesis is set to $2. The hypothesized difference is increased in $1 increments up to $1,000; each time a p-value is calculated. Then values lower than zero are considered. A p-value is calculated assuming a hypothesis that personal auto insurance customers have an average CLV that is $1 less than corporate auto insurance customers. This is again repeated in $1 increments down to a CLV difference of -$500. The result is 1,500 data points, each one a combination of a hypothesis and a p-value.

The $1,000 upper bound for the hypothesized difference as well as the -$500 lower bound are dependent on the problem at hand. Data must be analyzed to determine which values make sense. The same holds true for what hypothesis increments should be used. Since CLV is reported in dollars, $1 increments is a natural choice.

Note that the p-value functions in this article are created using a two-sided test as is common in many applications. Imagine there are two groups: Group A and Group B. These could be customer segments or customers exposed to either the treatment or control of an A/B test, for instance. A two-sided tests implies that Group A might have an average difference larger or smaller than Group B.

The p-value function can be seen above with the characteristic tent shape typical in many treatments of the function (for example, see references 4-8). While the functions produced in this article all follow this tent shaped pattern and are all symmetric, this is not always the case.

There are a few things to note about the p-value function. First, it is centered at a CLV difference of $213, equal to the estimated effect size. It is worth asking why the p-value function is centered at the estimated effect size. Recall that the p-value is a measure of consistency between the data and a given hypothesis. Large p-values indicate that the collected data are more compatible with the hypothesis, while smaller p-values indicate that the data are less compatible. The estimated effect size is always the effect most compatible with the data and therefore when the hypothesis turns out to equal the estimated effect size calculated from the data, the p-value reaches its maximum value (close to 1.0).

A statistically nonsignificant p-value is often wrongly misinterpreted to mean that there is no difference between groups. A correct interpretation is that “no difference” is simply one of many values reasonably compatible with the observed data, where “reasonably compatible” typically means the value lies within the 95% confidence interval. Even when a result is statistically nonsignificant, as is the case with our insurance data, the effect size is always the value most compatible with the observed data. In our case the p-value is nonsignificant, but $213 is the best estimate for the CLV difference between personal and corporate auto insurance customers.

Another way to look at the the p-value function is that it is a function that contains all possible confidence intervals. This is why the p-value function is also sometimes called a confidence interval function. Any horizontal line will represent a confidence interval at the two points where it intersects the function. The width of the associated confidence interval is related to the p-value on the y-axis and can be calculated as 1-p. For example, the horizontal line drawn at a p-value of 0.2 represents the 80% confidence interval at the points where that line intersects the function (1 - 0.2 = 0.8).

A note of caution is in order, however. The calculation of 1-p is sometimes mistakenly used in business analysis to denote confidence in a result. For example, Adobe Target, a web A/B testing tool, translates a p-value into a metric called “confidence.” Therefore, a p-value of 0.01 would produce a “confidence” of 99% that the result is “real.” This is not a correct usage of the 1-p calculation. (Recall that the p-value cannot tell us anything about the plausibility of a hypothesis because it assumes the null hypothesis is true [9]). The p-value function allows the 1-p calculation because the result is a confidence interval, a well-defined statistical concept.

Another key aspect of the p-value function is that every hypothesis has a “sister” hypothesis that produces the same p-value (this has also been called an “entangled” hypothesis [10]). This is shown on the annotated p-value function below. For instance, the standard null hypothesis of no difference in CLV produces a p-value of 0.2. But so does a hypothesis of $426. This is one more way of seeing a nonsignificant p-value does not imply that the null hypothesis is true. Comparing p-values alone there is no reason to prefer the interpretation that the CLV difference is $0 over the interpretation that the CLV difference is $426, both have the same p-value.

(Of course, as we have seen any CLV difference that results in a p-value greater than 0.2 is even more compatible with the data; these CLV differences fall in the region between $0 and $426).

Misuses of the p-value function

A standard null hypothesis significance testing view of p-values is annotated in the function below to contrast this misuse with the proper usage of the function. The chart below includes a horizontal line where the p-value equals 0.05. The two points where the 0.05 p-value line crosses the p-value function are equivalent to the standard 95% confidence interval. The lower CI bound cross the p-value function at -$113 and the upper CI bound crosses the p-value function at $539. Those values are the same ones that are produced when calculating a 95% confidence interval for this data.

For those that are used to the traditional dichotomized terminology the p-value of the shaded region is nonsignificant: these CLV differences cannot be ruled out by the standard 0.05 decision rule and so these hypotheses are not rejected. The unshaded region represents statistically significant CLV differences. Under the normal NHST conception the p-value is simply noted as significant or nonsignificant. This equates to whether the vertical black line at the null hypothesis of no difference is included in the shaded region (inside the 95% confidence interval).

However, note that the purpose of the p-value function is to expand beyond the dichotomized view and so the significant/nonsignificant interpretation is not recommended. There is no need to draw horizontal lines in particular places and note what intersects where. Nor is there any reason to split the function up into different regions. Instead, the whole function is meant to be used as a single integrated view of the estimated effect size and its precision.

Another misuse of the p-value function would be to select a single confidence level and use this in the traditional way. For example, suppose the insurance company had identified a CLV difference that met some predetermined cost/benefit threshold and thought an 80% confidence interval was sufficient for decision making. They could use the p-value function to directly read off whether the cost/benefit threshold was met at the specified confidence. Again, however, this is an incorrect usage of the function. Using a single confidence interval not only encourages dichotomized thinking, but it throws away the vast amount of information contained in the p-value function.

Another mistake is to confuse the p-value function with a probability distribution. This might be tempting because the function is built upon p-values, a probability based measure. However, the probability distribution interpretation is in correct. It is more useful to think of the function as measuring compatibility. The notion of p-value functions as compatibility measures follows a broader trend to think of confidence intervals themselves as “compatibility intervals” [10].

Proper usage of the p-value function

To understand proper usage, remember some of the key aspects of the p-value function:

  • It measures compatibility not probability.

  • The peak of the “tent” always represents the estimated effect size most compatible with the observed data.

  • Values closer to the peak are more compatible with the observed data than values further from the peak.

  • The width of the “tent” represents the precision of the estimated effect size.

  • The entire function should be thought of as a single tool, not split apart into constituent pieces.

Let’s use the insurance policy type p-value function in an example. First, we can immediately see that the p-value function peaks at an estimated effect size of $213 in CLV difference, as we have noted before. This represents the center of the function. Second, the precision of that estimate can be immediately determined via visual inspection by looking at the width of the function. Whether the current precision is “good enough” depends on the context. As a rough precision baseline it can be noted that estimated CLV differences of $500 and -$100 are both reasonably compatible with the data. However, remember that all hypotheses under the curve are at least somewhat compatible and this includes values as low -$250 and as high as $750. That may be “too wide” or “just fine” depending on how this function will be used. The function can be narrowed by collecting more data, but of course that involves tradeoffs in the form of additional time, effort, and cost.

We can combine the two notions above — where the function is centered and its width — to consider the function’s overall position. The position of the function can be used to suggest a general interpretation most compatible with the data. In our opinion the general interpretation most compatible with the data is: “personal auto insurance customers have a larger CLV than corporate auto insurance customers by an amount that is in the low hundreds of dollars.” Where “low hundreds of dollars” might be in the neighborhood of (but not exactly) $100 to $400.

There are other interpretations one could take. For example, “personal auto insurance customers have a larger CLV than corporate auto insurance customers by an amount that is in the mid-to-high tens of dollars.” Looking at the p-value function this interpretation is somewhat compatible, but it is less compatible than the “low hundreds of dollars” interpretation because those values are further from the function’s peak. Another interpretation would be that “personal auto insurance customers have a larger CLV than corporate auto insurance customers by an amount that is in the high hundreds of dollars.” While not impossible, the p-value function shows us that interpretation is not very compatible with the observed data.

Additionally, the p-value function shows, for example, that the “low hundreds of dollars” interpretation is more compatible than a “no substantial CLV difference” interpretation (assuming “no substantial CLV difference” would correspond to a small range around $0). If a business leader expressed concern that the p-value was statistically nonsignifiant and therefore argued that there was likely no CLV difference, the p-value function could be used to show them the “low hundreds of dollars” interpretation is more compatible with the data.

Following this style of conversation business analysts and decision makers can use the p-value function to support arguments and have evidence-based conversations despite no notion of statistical significance being brought into the picture.

P-value functions might be criticized as lacking objectivity or resulting in an unclear decision making framework unlike standard significance testing. It’s true that the p-value function is merely a tool that summarizes the results, putting emphasis on the effect size and its precision. It is incumbent upon the business analyst or decision maker to construct an argument for or against an action or strategy. The analyst’s informed judgment as well as other quantitative and qualitative aspects of the problem at hand are also crucial elements. An analyst’s interpretation of a p-value function may not always be convincing to everyone. That’s OK! Others may choose to interpret the same function differently. That’s also OK! Some business leaders might believe the “low hundreds of dollars” interpretation — along with other relevant analysis — is enough evidence to launch a pilot program to investigate the impact on revenue of exploiting the difference in policy type CLV, perhaps via growth of the personal auto insurance business unit. Other business leaders might not agree with that judgement, wanting to collect more data and increase the precision of the CLV difference estimate (decrease the width of the p-value function). Both points of view can be reasonable.

By contrast, how would an analyst make their case within the significance testing paradigm? The argument would stand or fall with a single p-value — which either reached statistical significance or didn’t — and a corresponding point estimate. No further intellectual engagement is encouraged and arguments are scant because results are presorted into dichotomous go/no-go decisions (significant/nonsignificant categorizations) [12].

Significance testing aims to end conversations, p-value functions aim to start them. Ending conversations might be considered a virtue for some businesses because more conversations means more meetings. More time wasted. Higher costs. Less alignment and confidence in decisions. But the reduced energy put into decision making under significance testing comes at the hidden cost of being overconfident in suboptimal decisions. Sure, you will make some correct choices under the significance testing paradigm. But those don’t occur because statistical significance guided you toward objective truth. They occur either by accident or because judgement based on additional analysis allowed business to overcome the flaws of the dichotomous significant/nonsignificant framework.

The width of p-value functions

Like confidence intervals, the width of p-value functions is dependent on the sample size. Using the same insurance policy dataset a p-value function was calculated for the difference in CLV for customers with small and large vehicles (a total of 2,710 policy holders). This led to a difference in CLV of $540 between the two groups, with small vehicle owners having a higher CLV than large vehicle owners. The function was also wider than that produced from comparing personal and corporate auto insurance customers (this is shown in the chart in the next section).

The function was then modified using simulation. Random draws from a normal distribution were used to simulate policy holder CLV (each draw was one customer). The mean and standard deviation of the normal distribution were set to equal those from the vehicle size dataset. Three different sample sizes were used varying from one thousand to one million policy holders. (The functions were also centered at $700 so they can be easily compared; random fluctuations from the simulation cause the distributions to naturally be offset).

The width of the three p-value functions varies dramatically based on the sample size. The width of the curved part of the p-value distribution (that looks like a tent) is much narrower for the blue curve (with a larger sample size) than the red curve (with a smaller sample size). Recall that the curved part of the p-value curve corresponds to hypotheses that are reasonably compatible with the data. There is more information (less variance) when the sample size is larger. The result is that with large sample sizes relatively few hypotheses are reasonably compatible with the data. With smaller sample sizes there are a relatively large number of hypotheses compatible with the data.

In the example above suppose someone from accounting had determined that there is “naturally” a larger CLV for small vehicle drivers of between, say, $650 and $700 due to smaller vehicles lasting longer than larger ones. The question then becomes: is there an additional component of the CLV difference above and beyond this natural rate, perhaps because small vehicle owners are marketed to differently. Using the blue p-value function (with the largest sample size) shows that CLV differences around $650 are not very compatible with the data whereas CLV differences around $700 are quite compatible. From this information alone it is inconclusive if there is a CLV difference above the natural rate since the range outlined by accounting spans both the flat and curved part of the p-value function. However, if accounting could narrow the natural CLV difference to a more precise interval, say, $650 to $655 the p-value function would show this range is not very compatible with the data. Therefore, the difference in CLV is possibly due to some factor unaccounted for in the natural difference calculated by the accounting department.

Similar examples could be used with the red and yellow functions. The red p-value function has a large range of hypotheses compatible with the data due to its relatively low sample size. Because the p-value function easily makes the precision of the estimate apparent, the insurance company could understand if more data needs to be collected to increase the precision of the estimated CLV differences. For prospective analysis simulation can be used to get a sense of the p-value function shape and width before any data is collected.

Likewise, the p-value function can also be used in the opposite way. By visual inspection analysts could determine a range of CLV differences they thought were reasonably compatible with the data. They might then proactively take that range to the accounting department to see if it corresponded with a “natural” CLV difference the accountants had calculated. Or the range might simply be evaluated using business judgement and past experience and act as only one tool in a broader framework of decision making.

Comparing p-value functions

P-value functions can also be compared [13, 14]. Suppose the insurance company were planning to target one of two customer segments: either the segmentation of customers by policy type (discussed in the first sections) or the segmentation of customers by vehicle size (discussed in the previous section). The insurance company believes it could exploit the higher CLV of small vehicle owners, perhaps urging its customers to switch to smaller cars because doing so can save them money on fuel expenditures. A separate strategy would be used to target the policy type segmentation: shifting resources from corporate customers and instead growing personal auto insurance (since personal auto insurance customers have a higher CLV). One part of this strategic decision is to compare the potential increase in revenue of the two opportunities. Revenue in turn is highly influenced by the CLV of different customer segments [15]. A potential starting point to investigate the CLV impacts would be to plot the two p-value functions side by side as shown below. P-value functions can be compared if the estimate is in the same units (in this case CLV differences).

Overall segmenting customers by vehicle size produces a larger difference in CLV than segmenting customers by policy type. However, notice that the policy type p-value function has a more precise estimate than the vehicle size function due to the different sample sizes. Even though there is separation between the centers of the two functions the width of the vehicle size function causes a substantial amount of overlap. Still based on the two functions alone, we would moderately favor the strategy focused on vehicle size segmentation. Again, others might disagree (and that’s OK).


Code samples

Code examples and technical documentation can be found here.

References and theoretical basis

References:

  1. A/B web testing is just an application of randomized controlled trials (RCTs) applied to the web. Several of the sources below cover p-value functions in RCT-style medical studies. See, for example, Kenneth Rothman in chapter 8 of his book Epidemiology: An Introduction or Daniel Mark, Kerry Lee, and Frank Harrell in “Understanding the Role of P Values and Hypothesis Tests in Clinical Research.”

  2. Daniel Berrar, “Confidence curves: an alternative to null hypothesis significance testing for the comparison of classifiers”, Machine Learning, 2016 [link]

  3. For example, see this list of quotations critical of standard significance testing or the many articles cited on our references page.

  4. Charles Poole, “Beyond the Confidence Interval”, American Journal of Public Health, 1987 [link]

  5. Kevin Sullivan and David Foster, “Use of Confidence Interval Function”, Epidemiology, 1990 [link]

  6. Kenneth Rothman, Epidemiology: An Introduction, Oxford University Press, 2012 [link]

  7. Daniel Berrar, “Confidence curves: an alternative to null hypothesis significance testing for the comparison of classifiers”, Machine Learning, 2016 [link]

  8. Daniel Mark, Kerry Lee, and Frank Harrell, “Understanding the Role of P Values and Hypothesis Tests in Clinical Research”, JAMA Cardiology, 2016 [link]

  9. Greenland et. al., “Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations”, European Journal of Epidemiology, 2016 [link]

  10. Daniel Berrar, “Confidence curves: an alternative to null hypothesis significance testing for the comparison of classifiers”, Machine Learning, 2016 [link]

  11. Amrhein et. al., “Scientists rise up against statistical significance”, Nature, 2019 [link]

  12. This paragraph and the one above it are adapted from: Daniel Berrar, “Confidence curves: an alternative to null hypothesis significance testing for the comparison of classifiers”, Machine Learning, 2016 [link]

  13. Kevin Sullivan and David Foster, “Use of Confidence Interval Function”, Epidemiology, 1990 [link]

  14. Daniel Berrar, “Confidence curves: an alternative to null hypothesis significance testing for the comparison of classifiers”, Machine Learning, 2016 [link]

  15. Estrella-Ramón et. al., “A marketing view of customer value: Customer lifetime value and customer equity”, South African Journal of Business Management, 2013 [link]