Statistical hypothesis test

A statistical hypothesis test is a method of statistical inference used to decide whether the data sufficiently support a particular hypothesis. A statistical hypothesis test typically involves a calculation of a test statistic. Then a decision is made, either by comparing the test statistic to a critical value or equivalently by evaluating a p-value computed from the test statistic. Roughly 100 specialized statistical tests have been defined.^[1]^[2]

Philosophy[edit]

Hypothesis testing and philosophy intersect. Inferential statistics, which includes hypothesis testing, is applied probability. Both probability and its application are intertwined with philosophy. Philosopher David Hume wrote, "All knowledge degenerates into probability." Competing practical definitions of probability reflect philosophical differences. The most common application of hypothesis testing is in the scientific interpretation of experimental data, which is naturally studied by the philosophy of science.

Fisher and Neyman opposed the subjectivity of probability. Their views contributed to the objective definitions. The core of their historical disagreement was philosophical.

Many of the philosophical criticisms of hypothesis testing are discussed by statisticians in other contexts, particularly correlation does not imply causation and the design of experiments. Hypothesis testing is of continuing interest to philosophers.^[15]^[20]

"The Geiger-counter reading is 10. The limit is 9. Check the suitcase."

"The Geiger-counter reading is high; 97% of safe suitcases have lower readings. The limit is 95%. Check the suitcase."

Statistical hypothesis: A statement about the parameters describing a (not a sample).

population

Test statistic: A value calculated from a sample without any unknown parameters, often to summarize the sample for comparison purposes.

Simple hypothesis: Any hypothesis which specifies the population distribution completely.

Composite hypothesis: Any hypothesis which does not specify the population distribution completely.

(H₀)

Null hypothesis

Positive data: Data that enable the investigator to reject a null hypothesis.

(H₁)

Alternative hypothesis

The following definitions are mainly based on the exposition in the book by Lehmann and Romano:^[35]

A statistical hypothesis test compares a test statistic (z or t for examples) to a threshold. The test statistic (the formula found in the table below) is based on optimality. For a fixed level of Type I error rate, use of these statistics minimizes Type II error rates (equivalent to maximizing power). The following terms describe tests in terms of such optimality:

null hypothesis ${\text{:}}\qquad H_{0}:p={\tfrac {1}{4}}$ (just guessing)

Variations and sub-classes[edit]

Statistical hypothesis testing is a key technique of both frequentist inference and Bayesian inference, although the two types of inference have notable differences. Statistical hypothesis tests define a procedure that controls (fixes) the probability of incorrectly deciding that a default position (null hypothesis) is incorrect. The procedure is based on how likely it would be for a set of observations to occur if the null hypothesis were true. This probability of making an incorrect decision is not the probability that the null hypothesis is true, nor whether any specific alternative hypothesis is true. This contrasts with other possible techniques of decision theory in which the null and alternative hypothesis are treated on a more equal basis.

One naïve Bayesian approach to hypothesis testing is to base decisions on the posterior probability,^[57]^[58] but this fails when comparing point and continuous hypotheses. Other approaches to decision making, such as Bayesian decision theory, attempt to balance the consequences of incorrect decisions across all possibilities, rather than concentrating on a single null hypothesis. A number of other approaches to reaching a decision based on data are available via decision theory and optimal decisions, some of which have desirable properties. Hypothesis testing, though, is a dominant approach to data analysis in many fields of science. Extensions to the theory of hypothesis testing include the study of the power of tests, i.e. the probability of correctly rejecting the null hypothesis given that it is false. Such considerations can be used for the purpose of sample size determination prior to the collection of data.

Neyman–Pearson hypothesis testing[edit]

An example of Neyman–Pearson hypothesis testing (or null hypothesis statistical significance testing) can be made by a change to the radioactive suitcase example. If the "suitcase" is actually a shielded container for the transportation of radioactive material, then a test might be used to select among three hypotheses: no radioactive source present, one present, two (all) present. The test could be required for safety, with actions required in each case. The Neyman–Pearson lemma of hypothesis testing says that a good criterion for the selection of hypotheses is the ratio of their probabilities (a likelihood ratio). A simple method of solution is to select the hypothesis with the highest probability for the Geiger counts observed. The typical result matches intuition: few counts imply no source, many counts imply two sources and intermediate counts imply one source. Notice also that usually there are problems for proving a negative. Null hypotheses should be at least falsifiable.

Neyman–Pearson theory can accommodate both prior probabilities and the costs of actions resulting from decisions.^[59] The former allows each test to consider the results of earlier tests (unlike Fisher's significance tests). The latter allows the consideration of economic issues (for example) as well as probabilities. A likelihood ratio remains a good criterion for selecting among hypotheses.

The two forms of hypothesis testing are based on different problem formulations. The original test is analogous to a true/false question; the Neyman–Pearson test is more like multiple choice. In the view of Tukey^[60] the former produces a conclusion on the basis of only strong evidence while the latter produces a decision on the basis of available evidence. While the two tests seem quite different both mathematically and philosophically, later developments lead to the opposite claim. Consider many tiny radioactive sources. The hypotheses become 0,1,2,3... grains of radioactive sand. There is little distinction between none or some radiation (Fisher) and 0 grains of radioactive sand versus all of the alternatives (Neyman–Pearson). The major Neyman–Pearson paper of 1933^[11] also considered composite hypotheses (ones whose distribution includes an unknown parameter). An example proved the optimality of the (Student's) t-test, "there can be no better test for the hypothesis under consideration" (p 321). Neyman–Pearson theory was proving the optimality of Fisherian methods from its inception.

Fisher's significance testing has proven a popular flexible statistical tool in application with little mathematical growth potential. Neyman–Pearson hypothesis testing is claimed as a pillar of mathematical statistics,^[61] creating a new paradigm for the field. It also stimulated new applications in statistical process control, detection theory, decision theory and game theory. Both formulations have been successful, but the successes have been of a different character.

The dispute over formulations is unresolved. Science primarily uses Fisher's (slightly modified) formulation as taught in introductory statistics. Statisticians study Neyman–Pearson theory in graduate school. Mathematicians are proud of uniting the formulations. Philosophers consider them separately. Learned opinions deem the formulations variously competitive (Fisher vs Neyman), incompatible^[9] or complementary.^[13] The dispute has become more complex since Bayesian inference has achieved respectability.

The terminology is inconsistent. Hypothesis testing can mean any mixture of two formulations that both changed with time. Any discussion of significance testing vs hypothesis testing is doubly vulnerable to confusion.

Fisher thought that hypothesis testing was a useful strategy for performing industrial quality control, however, he strongly disagreed that hypothesis testing could be useful for scientists.^[10] Hypothesis testing provides a means of finding test statistics used in significance testing.^[13] The concept of power is useful in explaining the consequences of adjusting the significance level and is heavily used in sample size determination. The two methods remain philosophically distinct.^[15] They usually (but not always) produce the same mathematical answer. The preferred answer is context dependent.^[13] While the existing merger of Fisher and Neyman–Pearson theories has been heavily criticized, modifying the merger to achieve Bayesian goals has been considered.^[62]

The interpretation of a p-value is dependent upon and definition of multiple comparison. The former often changes during the course of a study and the latter is unavoidably ambiguous. (i.e. "p values depend on both the (data) observed and on the other possible (data) that might have been observed but weren't").^[69]

stopping rule

Confusion resulting (in part) from combining the methods of Fisher and Neyman–Pearson which are conceptually distinct.

[60]

Emphasis on statistical significance to the exclusion of estimation and confirmation by repeated experiments.

[70]

Rigidly requiring statistical significance as a criterion for publication, resulting in .^[71] Most of the criticism is indirect. Rather than being wrong, statistical hypothesis testing is misunderstood, overused and misused.

publication bias

When used to detect whether a difference exists between groups, a paradox arises. As improvements are made to experimental design (e.g. increased precision of measurement and sample size), the test becomes more lenient. Unless one accepts the absurd assumption that all sources of noise in the data cancel out completely, the chance of finding statistical significance in either direction approaches 100%. However, this absurd assumption that the mean difference between two groups cannot be zero implies that the data cannot be independent and identically distributed (i.i.d.) because the expected difference between any two subgroups of i.i.d. random variates is zero; therefore, the i.i.d. assumption is also absurd.

[72]

Layers of philosophical concerns. The probability of statistical significance is a function of decisions made by experimenters/analysts. If the decisions are based on convention they are termed arbitrary or mindless^[74] while those not so based may be termed subjective. To minimize type II errors, large samples are recommended. In psychology practically all null hypotheses are claimed to be false for sufficiently large samples so "...it is usually nonsensical to perform an experiment with the sole aim of rejecting the null hypothesis."^[75] "Statistically significant findings are often misleading" in psychology.^[76] Statistical significance does not imply practical significance, and correlation does not imply causation. Casting doubt on the null hypothesis is thus far from directly supporting the research hypothesis.

[73]

"[I]t does not tell us what we want to know". Lists of dozens of complaints are available.^[67]^[78]^[79]

[77]

Criticism of statistical hypothesis testing fills volumes.^[63]^[64]^[65]^[66]^[67]^[68] Much of the criticism can be summarized by the following issues:

Critics and supporters are largely in factual agreement regarding the characteristics of null hypothesis significance testing (NHST): While it can provide critical information, it is inadequate as the sole tool for statistical analysis. Successfully rejecting the null hypothesis may offer no support for the research hypothesis. The continuing controversy concerns the selection of the best statistical practices for the near-term future given the existing practices. However, adequate research design can minimize this issue. Critics would prefer to ban NHST completely, forcing a complete departure from those practices,^[80] while supporters suggest a less absolute change.

Controversy over significance testing, and its effects on publication bias in particular, has produced several results. The American Psychological Association has strengthened its statistical reporting requirements after review,^[81] medical journal publishers have recognized the obligation to publish some results that are not statistically significant to combat publication bias,^[82] and a journal (Journal of Articles in Support of the Null Hypothesis) has been created to publish such results exclusively.^[83] Textbooks have added some cautions,^[84] and increased coverage of the tools necessary to estimate the size of the sample required to produce significant results. Few major organizations have abandoned use of significance tests although some have discussed doing so.^[81] For instance, in 2023, the editors of the Journal of Physiology "strongly recommend the use of estimation methods for those publishing in The Journal" (meaning the magnitude of the effect size (to allow readers to judge whether a finding has practical, physiological, or clinical relevance) and confidence intervals to convey the precision of that estimate), saying "Ultimately, it is the physiological importance of the data that those publishing in The Journal of Physiology should be most concerned with, rather than the statistical significance."^[85]

Lehmann E.L. (1992) "Introduction to Neyman and Pearson (1933) On the Problem of the Most Efficient Tests of Statistical Hypotheses". In: Breakthroughs in Statistics, Volume 1, (Eds Kotz, S., Johnson, N.L.), Springer-Verlag. 0-387-94037-5 (followed by reprinting of the paper)

ISBN

Neyman, J.; Pearson, E.S. (1933). . Philosophical Transactions of the Royal Society A. 231 (694–706): 289–337. Bibcode:1933RSPTA.231..289N. doi:10.1098/rsta.1933.0009.

"On the Problem of the Most Efficient Tests of Statistical Hypotheses"

, Encyclopedia of Mathematics, EMS Press, 2001 [1994]

"Statistical hypotheses, verification of"

Wilson González, Georgina; Kay Sankaran (September 10, 1997). . Environmental Sampling & Monitoring Primer. Virginia Tech.

"Hypothesis Testing"

Bayesian critique of classical hypothesis testing

Critique of classical hypothesis testing highlighting long-standing qualms of statisticians

Dallal GE (2007) (A good tutorial)

The Little Handbook of Statistical Practice

References for arguments for and against hypothesis testing

How to choose the correct statistical test

Statistical Tests Overview:

Statistical Analysis based Hypothesis Testing Method in Biological Knowledge Discovery; Md. Naseef-Ur-Rahman Chowdhury, Suvankar Paul, Kazi Zakia Sultana