p-value

In null-hypothesis significance testing, the $p$ -value^{[note 1]} is the probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct.^[2]^[3] A very small p-value means that such an extreme observed outcome would be very unlikely under the null hypothesis. Even though reporting p-values of statistical tests is common practice in academic publications of many quantitative fields, misinterpretation and misuse of p-values is widespread and has been a major topic in mathematics and metascience.^[4]^[5] In 2016, the American Statistical Association (ASA) made a formal statement that "p-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone" and that "a p-value, or statistical significance, does not measure the size of an effect or the importance of a result" or "evidence regarding a model or hypothesis".^[6] That said, a 2019 task force by ASA has issued a statement on statistical significance and replicability, concluding with: "p-values and significance tests, when properly applied and interpreted, increase the rigor of the conclusions drawn from data".^[7]

Not to be confused with the P-factor.

Basic concepts[edit]

In statistics, every conjecture concerning the unknown probability distribution of a collection of random variables representing the observed data $X$ in some study is called a statistical hypothesis. If we state one hypothesis only and the aim of the statistical test is to see whether this hypothesis is tenable, but not to investigate other specific hypotheses, then such a test is called a null hypothesis test.

As our statistical hypothesis will, by definition, state some property of the distribution, the null hypothesis is the default hypothesis under which that property does not exist. The null hypothesis is typically that some parameter (such as a correlation or a difference between means) in the populations of interest is zero. Our hypothesis might specify the probability distribution of $X$ precisely, or it might only specify that it belongs to some class of distributions. Often, we reduce the data to a single numerical statistic, e.g., $T$ , whose marginal probability distribution is closely connected to a main question of interest in the study.

The p-value is used in the context of null hypothesis testing in order to quantify the statistical significance of a result, the result being the observed value of the chosen statistic $T$ .^{[note 2]} The lower the p-value is, the lower the probability of getting that result if the null hypothesis were true. A result is said to be statistically significant if it allows us to reject the null hypothesis. All other things being equal, smaller p-values are taken as stronger evidence against the null hypothesis.

Loosely speaking, rejection of the null hypothesis implies that there is sufficient evidence against it.

As a particular example, if a null hypothesis states that a certain summary statistic $T$ follows the standard normal distribution ${\mathcal {N}}(0,1),$ then the rejection of this null hypothesis could mean that (i) the mean of $T$ is not 0, or (ii) the variance of $T$ is not 1, or (iii) $T$ is not normally distributed. Different tests of the same null hypothesis would be more or less sensitive to different alternatives. However, even if we do manage to reject the null hypothesis for all 3 alternatives, and even if we know that the distribution is normal and variance is 1, the null hypothesis test does not tell us which non-zero values of the mean are now most plausible. The more independent observations from the same probability distribution one has, the more accurate the test will be, and the higher the precision with which one will be able to determine the mean value and show that it is not equal to zero; but this will also increase the importance of evaluating the real-world or scientific relevance of this deviation.

Definition and interpretation[edit]

Definition[edit]

The p-value is the probability under the null hypothesis of obtaining a real-valued test statistic at least as extreme as the one obtained. Consider an observed test-statistic $t$ from unknown distribution $T$ . Then the p-value $p$ is what the prior probability would be of observing a test-statistic value at least as "extreme" as $t$ if null hypothesis $H_{0}$ were true. That is:

Calculation[edit]

Usually, $T$ is a test statistic. A test statistic is the output of a scalar function of all the observations. This statistic provides a single number, such as a t-statistic or an F-statistic. As such, the test statistic follows a distribution determined by the function used to define that test statistic and the distribution of the input observational data.

For the important case in which the data are hypothesized to be a random sample from a normal distribution, depending on the nature of the test statistic and the hypotheses of interest about its distribution, different null hypothesis tests have been developed. Some such tests are the z-test for hypotheses concerning the mean of a normal distribution with known variance, the t-test based on Student's t-distribution of a suitable statistic for hypotheses concerning the mean of a normal distribution when the variance is unknown, the F-test based on the F-distribution of yet another statistic for hypotheses concerning the variance. For data of other nature, for instance, categorical (discrete) data, test statistics might be constructed whose null hypothesis distribution is based on normal approximations to appropriate statistics obtained by invoking the central limit theorem for large samples, as in the case of Pearson's chi-squared test.

Thus computing a p-value requires a null hypothesis, a test statistic (together with deciding whether the researcher is performing a one-tailed test or a two-tailed test), and data. Even though computing the test statistic on given data may be easy, computing the sampling distribution under the null hypothesis, and then computing its cumulative distribution function (CDF) is often a difficult problem. Today, this computation is done using statistical software, often via numeric methods (rather than exact formulae), but, in the early and mid 20th century, this was instead done via tables of values, and one interpolated or extrapolated p-values from these discrete values. Rather than using a table of p-values, Fisher instead inverted the CDF, publishing a list of values of the test statistic for given fixed p-values; this corresponds to computing the quantile function (inverse CDF).

Null hypothesis (H₀): The coin is fair, with Pr(heads) = 0.5.

Test statistic: Number of heads.

Alpha level (designated threshold of significance): 0.05.

Observation O: 14 heads out of 20 flips.

Two-tailed p-value of observation O given H₀ = 2 × min(Pr(no. of heads ≥ 14 heads), Pr(no. of heads ≤ 14 heads)) = 2 × min(0.058, 0.978) = 2 × 0.058 = 0.115.

Related indices[edit]

The E-value can refer to two concepts, both of which are related to the p-value and both of which play a role in multiple testing. First, it corresponds to a generic, more robust alternative to the p-value that can deal with optional continuation of experiments. Second, it is also used to abbreviate "expect value", which is the expected number of times that one expects to obtain a test statistic at least as extreme as the one that was actually observed if one assumes that the null hypothesis is true.^[46] This expect-value is the product of the number of tests and the p-value.

The q-value is the analog of the p-value with respect to the positive false discovery rate.^[47] It is used in multiple hypothesis testing to maintain statistical power while minimizing the false positive rate.^[48]

The Probability of Direction (pd) is the Bayesian numerical equivalent of the p-value.^[49] It corresponds to the proportion of the posterior distribution that is of the median's sign, typically varying between 50% and 100%, and representing the certainty with which an effect is positive or negative.

Second-generation p-values extend the concept of p-values by not considering extremely small, practically irrelevant effect sizes as significant.^[50]

Student's t-test

Bonferroni correction

Counternull

Fisher's method of combining p-values

Generalized p-value

Harmonic mean p-value

Holm–Bonferroni method

Multiple comparisons problem

p-rep

p-value fallacy

for various specific tests (chi-square, Fisher's F-test, etc.).

Free online p-values calculators

including a Java applet that illustrates how the numerical values of p-values can give quite misleading impressions about the truth or falsity of the hypothesis under test.

Understanding p-values

on YouTube

StatQuest: P Values, clearly explained