intermediate 55 min read · May 8, 2026

Hypothesis Testing Framework

Null, alternative, power, p-values, and the z/t/χ² test family — the decision-theoretic core of classical inference.

17.1 From Estimation to Decisions

Track 4 built the classical estimation toolkit — point estimation, maximum likelihood, method of moments, sufficient statistics, Rao-Blackwell, UMVUE. Track 5 opens a different question. Estimation asks: given the data, what is our best guess for θ\theta? Testing asks: is the evidence strong enough to act against the status-quo claim H0H_0? The distinction is not cosmetic. Estimation optimizes one quantity (MSE, say); testing trades off two — the risk of rejecting H0H_0 when it is true, and the risk of failing to reject when it is false. This topic builds the framework for that trade-off and applies it to the three canonical parametric families (z, t, χ2\chi^2), the binomial exact test, and the asymptotic trio (Wald, score, LRT). Topic 18 extends it with optimality theory (Neyman-Pearson, UMP, Wilks); Topic 19 with confidence intervals via the duality previewed in §17.10; Topic 20 with the multiple-testing correction needed once we admit that a single test rarely stands alone.

Two-panel opener. Left: conceptual diagram contrasting Fisher 1922 significance testing (single hypothesis, continuous evidence) with Neyman-Pearson 1933 decision theory (two hypotheses, two error types). Right: A/B testing cartoon — variants A (p_A = 0.10) and B (p_B = 0.11) with n = 1000 each; Monte Carlo histogram of p̂_B − p̂_A under H₀ overlaid with the shaded rejection region at α = 0.05, showing how "A/B testing is hypothesis testing".

Remark 1 Fisher 1922 vs Neyman-Pearson 1933 — two traditions in one framework

Fisher’s significance testing (1922) framed the p-value as a continuous measure of evidence against a single null hypothesis: small p-values are surprising, large p-values are not, and the analyst uses judgment. Neyman and Pearson’s decision-theoretic formulation (1933) recast testing as a choice between two hypotheses with two error types (Type I and Type II) and two controlled probabilities (α\alpha and 1β1 - \beta). The modern textbook framework merges the two: a test statistic produces a p-value, and the p-value is compared to a pre-specified α\alpha. Different communities weight the Fisherian and Neyman-Pearsonian views differently, and much of the replication-crisis discourse (Remark 6) stems from confusing the two. Topic 17 develops both threads: §17.5 treats the p-value on Fisher’s terms, while §17.3 and §17.4 set up size and power in the Neyman-Pearson frame.

Remark 2 A/B testing is hypothesis testing — the ML-native canonical example

The running example throughout Topic 17 is the two-variant A/B test. Variant A has conversion rate pAp_A; variant B has conversion rate pBp_B. The question is: does deploying variant B produce a higher conversion rate than the incumbent A? The null H0:pBpAH_0: p_B \le p_A is the skeptical baseline (no improvement); the alternative H1:pB>pAH_1: p_B > p_A is the claim we might act on. Every experimentation platform — Optimizely, Statsig, internal platforms at Google / Meta / Netflix — runs some flavor of this test billions of times per year. Sample-size calculations (§17.4 Example 6), two-sample proportion z-tests (§17.6 Example 10), and the multiple-testing correction for many simultaneous experiments (§17.11 Remark 16) are the theoretical foundation of that infrastructure. This connects Topic 17 to formalML: A/B testing platforms .

Remark 3 Why testing is harder than estimation — two risks, not one

In estimation, we minimize a single risk (mean squared error, say). In testing, we manage two competing risks simultaneously: Type I error (rejecting a true H0H_0) and Type II error (failing to reject a false H0H_0). These pull in opposite directions — shrinking the rejection region reduces Type I error but increases Type II, and vice versa. No test can drive both to zero at fixed nn; the entire Neyman-Pearson framework is built around this trade-off. The pragmatic convention is to fix α\alpha (Type I rate) at a conventional level (0.05, 0.01, or 0.001) and then maximize power (1 − Type II rate) subject to that constraint. Topic 18 formalizes “maximize power” via the uniformly-most-powerful (UMP) property; here we set up the framework.

17.2 The Hypothesis-Testing Setup

Every hypothesis test is specified by four objects: the two hypotheses, a test statistic, and a rejection region. We formalize the hypotheses first.

Definition 1 Null and alternative hypotheses

Suppose X1,,XnX_1, \ldots, X_n is a sample from a family {Pθ:θΘ}\{P_\theta : \theta \in \Theta\}. A partition Θ=Θ0Θ1\Theta = \Theta_0 \sqcup \Theta_1 with Θ0\Theta_0 \ne \emptyset and Θ1\Theta_1 \ne \emptyset defines two hypotheses:

  • The null hypothesis H0:θΘ0H_0: \theta \in \Theta_0 — the claim we wish to disprove.
  • The alternative hypothesis H1:θΘ1H_1: \theta \in \Theta_1 — the claim we wish to act on.

A test is a decision rule that accepts one of H0H_0 or H1H_1 on the basis of the observed data. Note the asymmetry: H0H_0 is the default, retained unless the data provide strong evidence against it. “Failing to reject H0H_0” is not the same as “accepting H0H_0” — we remain agnostic, having merely failed to accumulate enough contrary evidence.

Definition 2 Simple vs composite hypotheses

A hypothesis Hi:θΘiH_i: \theta \in \Theta_i is simple if Θi\Theta_i is a singleton (one parameter value) and composite otherwise. A test with simple H0H_0 and simple H1H_1 is the cleanest case — Topic 18’s Neyman-Pearson lemma gives a uniformly most powerful test for it. Most real-world tests are composite on at least one side: H0:μ=0H_0: \mu = 0 (simple) vs H1:μ0H_1: \mu \ne 0 (composite two-sided) is typical.

Definition 3 One-sided vs two-sided alternatives

For a scalar θR\theta \in \mathbb{R}, the standard alternatives are:

  • One-sided right: H1:θ>θ0H_1: \theta > \theta_0 (used when the analyst has a directional hypothesis — e.g., “the new drug helps”).
  • One-sided left: H1:θ<θ0H_1: \theta < \theta_0 (the mirror image).
  • Two-sided: H1:θθ0H_1: \theta \ne \theta_0 (the agnostic alternative — “something is different”).

One-sided tests are more powerful against alternatives in the specified direction, but they commit the analyst to that direction before seeing the data. Choosing the side after looking at the data is a form of p-hacking (Remark 6).

Three-panel taxonomy. Left: simple vs composite — a point in parameter space vs an interval. Center: one-sided vs two-sided — half-lines vs full-line-minus-a-point. Right: decision tree mapping data to reject / fail-to-reject, with the "fail to reject ≠ accept" caveat annotated.

Example 1 A/B test as a one-sided composite null

Variants A and B have conversion rates pAp_A and pBp_B. The product team is considering deploying B only if it strictly improves on A. The hypotheses are

H0:pBpAvsH1:pB>pA.H_0: p_B \le p_A \qquad \text{vs} \qquad H_1: p_B > p_A.

Both are composite (each covers an interval in pBp_B at fixed pAp_A). The relevant test statistic (§17.6 Example 10) is the pooled-variance two-sample proportion z-statistic. The one-sided framing captures the business constraint that the team is only interested in improvements, not in detecting arbitrary changes — and, as a bonus, gives more power at fixed α\alpha than a two-sided test against the same effect size.

Example 2 A/B test as a two-sided simple null

If the product team cares whether A and B differ at all — perhaps because a decrease would also prompt action (roll back B; investigate B’s unintended effects) — the two-sided framing is appropriate:

H0:pB=pAvsH1:pBpA.H_0: p_B = p_A \qquad \text{vs} \qquad H_1: p_B \ne p_A.

Here H0H_0 is simple (one parameter value: pAp_A, with pBp_B forced to equal it) and H1H_1 is composite. The rejection region is the union of two tails, and at the same nominal α\alpha it has roughly half the power of Example 1 against a given directional effect — the price paid for detecting the opposite direction as well.

Example 3 Clinical trial — three alternative framings

A clinical trial compares a new treatment (mean response μT\mu_T) to standard care (μS\mu_S). Three framings:

  • Simple: H0:μT=μSH_0: \mu_T = \mu_S, H1:μT=μS+δH_1: \mu_T = \mu_S + \delta for a pre-specified clinically meaningful effect δ>0\delta > 0. Requires knowing δ\delta up front; rare in practice but crucial for power calculations.
  • One-sided composite: H0:μTμSH_0: \mu_T \le \mu_S, H1:μT>μSH_1: \mu_T > \mu_S. Matches the regulatory framing: we only approve the treatment if it improves on standard care.
  • Two-sided composite: H0:μT=μSH_0: \mu_T = \mu_S, H1:μTμSH_1: \mu_T \ne \mu_S. Appropriate when an inferior treatment should also prompt action (withdraw, study further).

Choice among the three is a substantive decision driven by the clinical, regulatory, and ethical context — not a statistical one.

17.3 Test Statistics, Rejection Regions, and Size

With the hypotheses fixed, we need a summary of the data that concentrates the evidence for / against H0H_0, and a rule that uses that summary to make a decision.

Definition 4 Test statistic

A test statistic is a measurable function T:XnRT : \mathcal{X}^n \to \mathbb{R} that maps the data X=(X1,,Xn)X = (X_1, \ldots, X_n) to a single real number. Large T|T| (or large TT for one-sided tests) is interpreted as evidence against H0H_0; the reduction is the one we met in Topic 16 under the sufficiency banner, and indeed most canonical test statistics are functions of a sufficient statistic (e.g., the z-statistic is a function of Xˉ\bar X, which is sufficient for μ\mu in the Normal family with σ2\sigma^2 known).

Definition 5 Rejection region; critical value

The rejection region RXnR \subseteq \mathcal{X}^n is the set of data values for which the test rejects H0H_0. When RR is specified via the test statistic as R={X:T(X)>c}R = \{X : T(X) > c\} (right-tailed), R={X:T(X)<c}R = \{X : T(X) < c\} (left-tailed), or R={X:T(X)>c}R = \{X : |T(X)| > c\} (two-tailed), the threshold cc is called the critical value and denoted cαc_\alpha when calibrated to level α\alpha.

Definition 6 Size and level of a test

The size of a test with rejection region RR is

α=supθΘ0Pθ(XR).\alpha = \sup_{\theta \in \Theta_0} P_\theta(X \in R).

That is, the size is the worst-case Type I error rate over the null set. A test has level α\alpha if its size is at most α\alpha — every test of size α\alpha has level α\alpha, but a test of level α\alpha may have strictly smaller size. The distinction matters when the null distribution is discrete (Thm 1, Remark 4, Example 4).

Theorem 1 Size-level distinction for discrete vs continuous test statistics

For continuous null distributions, any α(0,1)\alpha \in (0, 1) is achievable exactly: the critical value cαc_\alpha can be chosen so that PH0(T>cα)=αP_{H_0}(T > c_\alpha) = \alpha on the nose. For discrete null distributions (e.g., Binomial, Poisson), only a countable set of α\alpha values is achievable; other levels force a conservative test, with actual size strictly less than the designed level.

Remark 4 Conservative vs exact size in practice

When the null distribution is discrete, the exact achievable sizes are the tail probabilities of that distribution — a discrete set. For Binomial(20,0.5)(20, 0.5), the right-tail probabilities P(Xx)P(X \ge x) at integer xx give exact sizes {1,0.999,0.994,,0.021,0.006,0.001,0.00009,106}\{1, 0.999, 0.994, \ldots, 0.021, 0.006, 0.001, 0.00009, 10^{-6}\} — the sizes jump at each integer. A test designed to have level α=0.05\alpha = 0.05 must pick a boundary whose exact size is 0.05\le 0.05; for H0:p=0.5H_0: p = 0.5 at n=20n = 20, the best achievable right-tailed size is 0.0210.021 (at boundary x=15x = 15), strictly less than 0.050.05. This is not a flaw; it is the price of exactness. Conservative tests preserve the Type I guarantee at the cost of some power, and they are preferred over alternatives (randomization, mid-p) in most practical settings.

Example 4 Bernoulli exact size at n = 20, p₀ = 0.5

Test H0:p=0.5H_0: p = 0.5 vs H1:p>0.5H_1: p > 0.5 using T=i=120XiT = \sum_{i=1}^{20} X_i with rejection region {T15}\{T \ge 15\}. Under H0H_0, TBinomial(20,0.5)T \sim \mathrm{Binomial}(20, 0.5), so

PH0(T15)=k=1520(20k)(0.5)20=15504+4845+1140+190+20+1220=2170010485760.0207.P_{H_0}(T \ge 15) = \sum_{k = 15}^{20} \binom{20}{k} (0.5)^{20} = \frac{15504 + 4845 + 1140 + 190 + 20 + 1}{2^{20}} = \frac{21700}{1048576} \approx 0.0207.

The exact size is 0.0207<0.050.0207 < 0.05, so the test is conservative at the nominal α=0.05\alpha = 0.05. Moving the boundary to T14T \ge 14 would give size 0.058>0.05\approx 0.058 > 0.05 — unacceptable. The next smaller achievable size is 0.02070.0207; the one after that is 0.00590.0059. No boundary achieves α=0.05\alpha = 0.05 exactly, and the analyst must choose between conservative (0.02070.0207) and anti-conservative (0.05770.0577) rounding. This is the discrete-test pedagogical example that §17.6 Example 11 revisits in the binomial-exact treatment.

Two panels. Left: continuous null distribution with shaded two-sided rejection region, critical values ±c_α marked, Type I area shaded. Right: discrete null — Binomial(20, 0.5) PMF bars with rejection region X ≥ 15 colored, annotated "exact size ≈ 0.021 < α = 0.05".

Null vs Alternative · §17.3
alternative side
Live readouts
Type I (α)
0.050
Type II (1−β)
0.137
Power (β)
0.863
δ (z-scale)
2.74
The canonical picture: two Normal sampling distributions of the z-statistic, separated by the standardized effect size δ = √n(μ − μ₀)/σ. The rejection region is a tail under H₀; power is the complementary tail under H_A.
Moving H_A further from H₀ increases power; increasing n shrinks both distributions, narrowing their overlap. Shrinking α pushes the critical value out, trading Type I error for Type II.

17.4 Type I, Type II, and Power

The two errors of a test have names.

Definition 7 Type I error; Type II error
  • Type I error: rejecting H0H_0 when H0H_0 is true. Its probability is the size α\alpha (§17.3 Def 6).
  • Type II error: failing to reject H0H_0 when H1H_1 is true. Its probability at a specific θΘ1\theta \in \Theta_1 is conventionally written 1β(θ)1 - \beta(\theta) in the Neyman-Pearson convention where β\beta denotes power; some texts reverse this convention and write β(θ)\beta(\theta) for the Type II rate. Topic 17 follows Lehmann-Romano: β(θ)\beta(\theta) is power (probability of rejection at θ\theta), so Type II rate is 1β(θ)1 - \beta(\theta).
Definition 8 Power function

The power function of a test with rejection region RR is

β(θ)=Pθ(XR),θΘ.\beta(\theta) = P_\theta(X \in R), \qquad \theta \in \Theta.

Evaluated at θΘ0\theta \in \Theta_0, β(θ)\beta(\theta) is the Type I error rate at that null parameter (and supθΘ0β(θ)=α\sup_{\theta \in \Theta_0}\beta(\theta) = \alpha, the size). Evaluated at θΘ1\theta \in \Theta_1, β(θ)\beta(\theta) is the (actual) probability of correctly rejecting H0H_0 at the specific alternative θ\theta. A good test has β\beta close to α\alpha on Θ0\Theta_0 and close to 1 on Θ1\Theta_1; how close depends on the sample size and the distance between θ\theta and Θ0\Theta_0.

Theorem 2 Monotonicity of power in effect size (exp-family one-sided tests)

For a one-parameter exponential family with monotone likelihood ratio (MLR) in the sufficient statistic TT, the one-sided right-tail test {T>c}\{T > c\} has power β(θ)\beta(\theta) that is non-decreasing in θ\theta. In particular, for any θ1>θ0\theta_1 > \theta_0, β(θ1)β(θ0)=α\beta(\theta_1) \ge \beta(\theta_0) = \alpha.

The proof uses the MLR property to show that the likelihood ratio L(θ1)/L(θ0)L(\theta_1)/L(\theta_0) is monotone in TT, so the event {T>c}\{T > c\} has higher probability at θ1\theta_1. The full argument uses Topic 18’s optimality machinery; we cite the result here and use it in power calculations that follow.

Example 5 Power of the one-sample z-test against a Normal mean

For iid N(μ,σ2)\mathcal{N}(\mu, \sigma^2) data with σ2\sigma^2 known, consider the right-tailed z-test of H0:μ=μ0H_0: \mu = \mu_0 vs H1:μ>μ0H_1: \mu > \mu_0 at level α\alpha. The test rejects when Z=n(Xˉμ0)/σ>zαZ = \sqrt n(\bar X - \mu_0)/\sigma > z_\alpha, where zα=Φ1(1α)z_\alpha = \Phi^{-1}(1 - \alpha) is the (1α)(1 - \alpha)-quantile of the standard Normal. Under μ\mu, ZN(n(μμ0)/σ,1)Z \sim \mathcal{N}(\sqrt n(\mu - \mu_0)/\sigma,\, 1), so

β(μ)=Pμ(Z>zα)=1Φ ⁣(zαn(μμ0)σ).\beta(\mu) = P_\mu(Z > z_\alpha) = 1 - \Phi\!\left(z_\alpha - \frac{\sqrt n\,(\mu - \mu_0)}{\sigma}\right).

This closed-form expression is the analytic engine of the sample-size calculator in Example 6 and the PowerCurveExplorer component. Key observations: β\beta increases monotonically in μ\mu (Thm 2); β(μ0)=α\beta(\mu_0) = \alpha; β(μ)1\beta(\mu) \to 1 as μ\mu \to \infty at rate n/σ\sqrt n / \sigma (the test is consistent).

Example 6 Sample-size calculation: how large does n have to be?

Suppose we want power 1β=0.81 - \beta = 0.8 against the alternative μ=μ0+0.5σ\mu = \mu_0 + 0.5\sigma (a “half-σ\sigma” effect) at level α=0.05\alpha = 0.05, one-sided. Setting β(μ)=0.8\beta(\mu) = 0.8 in Example 5 and solving:

0.8=1Φ ⁣(z0.05n(0.5σ)σ)=1Φ(1.6450.5n).0.8 = 1 - \Phi\!\left(z_{0.05} - \frac{\sqrt n\,(0.5\sigma)}{\sigma}\right) = 1 - \Phi(1.645 - 0.5\sqrt n).

Inverting, Φ(1.6450.5n)=0.2\Phi(1.645 - 0.5\sqrt n) = 0.2, so 1.6450.5n=Φ1(0.2)=0.84161.645 - 0.5\sqrt n = \Phi^{-1}(0.2) = -0.8416, giving n=(1.645+0.8416)/0.54.97\sqrt n = (1.645 + 0.8416)/0.5 \approx 4.97, hence n24.7n \ge 24.7. Rounding up, n=25n = 25. More generally, for effect size δ\delta measured in σ\sigma-units,

n=(zα+zβ)2σ2δ2,n = \frac{(z_\alpha + z_\beta)^2 \sigma^2}{\delta^2},

the textbook sample-size formula. The PowerCurveExplorer (§17.4 component) implements this closed form for the Normal scenarios and a numerical inversion for the binomial exact test.

Remark 5 Power and the CRLB — Fisher information bounds achievable power

In a regular parametric family, the Cramér-Rao lower bound (Thm 13.9) says any unbiased estimator θ^\hat\theta satisfies Var(θ^)1/[nI(θ)]\mathrm{Var}(\hat\theta) \ge 1/[n I(\theta)], where I(θ)I(\theta) is the Fisher information at θ\theta. This translates directly into an upper bound on achievable power for any asymptotically-Normal-based test: the smaller the estimator variance, the more concentrated its sampling distribution, and the larger the separation between H0H_0 and H1H_1 sampling distributions — hence higher power. Topic 18 formalizes this as the CRLB giving a “power envelope” that UMP tests achieve. At intermediate level, the intuition is: power is to the estimator’s variance what accuracy is to its bias, and both are bounded below by Fisher information.

Three panels. Left: H₀ (green) and H_A (red) sampling distributions of X̄ under μ = 0 vs μ = 0.5 at n = 30, σ = 1; Type I area shaded under H₀ right of c, Type II area shaded under H_A left of c. Center: power curve β(μ) as a sigmoid rising from α to 1. Right: sample-size envelope — power vs n at fixed effect size, with horizontal line at 1 − β = 0.8 crossing at n ≈ 25.

Power Curves · §17.4
vary
Sample-size calculator
Target power 1 − β = 0.8 at level α = 0.050
Required n: 25
Closed form: β(μ) = 1 − Φ(z_α − √n(μ − μ₀)/σ)
n = (z_α + z_β)² σ² / δ²
critical z at α/2: ±1.96

17.5 P-values

A rejection region with critical value cαc_\alpha makes the test a binary decision. A more informative summary — and the one most practitioners actually report — is the p-value.

Definition 9 P-value

For a test with test statistic T(X)T(X) and a right-tailed rejection rule {T>c}\{T > c\}, the p-value of an observed data point xx is

p(x)=PH0(T(X)T(x)).p(x) = P_{H_0}(T(X) \ge T(x)).

That is, the p-value is the tail probability, under H0H_0, of the test statistic being at least as extreme as the observed value. For left-tailed tests replace \ge with \le; for two-sided tests use 2min(PH0(TT(x)),PH0(TT(x)))2 \min(P_{H_0}(T \ge T(x)),\,P_{H_0}(T \le T(x))) for symmetric null distributions (a convention; a few alternatives exist for non-symmetric nulls, discussed in §17.6 Example 11 for the binomial case).

The rejection rule “reject when pαp \le \alpha” is equivalent to “reject when T>cαT > c_\alpha” — the two formulations give identical decisions. The advantage of the p-value is that it makes the level at which rejection occurs transparent, rather than forcing an all-or-nothing decision at a fixed α\alpha.

Theorem 3 Uniformity of the p-value under continuous H₀

Let TT be a test statistic with continuous CDF FF under H0H_0. For a right-tailed test, the p-value p(X)=1F(T(X))p(X) = 1 - F(T(X)). Under H0H_0, p(X)Uniform(0,1)p(X) \sim \mathrm{Uniform}(0, 1).

Proof [show]

Step 1 — identify the distribution of F(T)F(T). Under H0H_0, TT has CDF FF. The probability integral transform (Topic 6 §6.3) says that for any continuous CDF FF and any random variable TT with CDF FF, the variable U=F(T)U = F(T) is Uniform(0,1)(0, 1). We verify this directly: for any u(0,1)u \in (0, 1),

P(Uu)=P(F(T)u)=P(TF1(u))=F(F1(u))=u,P(U \le u) = P(F(T) \le u) = P(T \le F^{-1}(u)) = F(F^{-1}(u)) = u,

where the second equality uses strict monotonicity of FF on its support (which holds because FF is continuous and TT has full support on that region).

Step 2 — conclude for the p-value. The p-value is p(X)=1F(T(X))p(X) = 1 - F(T(X)). By Step 1, F(T)F(T) is Uniform(0,1)(0, 1) under H0H_0, hence 1F(T)1 - F(T) is also Uniform(0,1)(0, 1): the reflection u1uu \mapsto 1 - u is a measure-preserving transformation of [0,1][0, 1], so it maps the uniform distribution to itself.

Step 3 — size calibration. It follows that for every α(0,1)\alpha \in (0, 1),

PH0(p(X)α)=α.P_{H_0}(p(X) \le \alpha) = \alpha.

Rejecting when pαp \le \alpha therefore gives a test of exactly size α\alpha — an appealing formal property of the p-value rule.

∎ — by the probability integral transform (Topic 6) and measure-preservation of u1uu \mapsto 1 - u on [0,1][0, 1].

Example 7 Monte Carlo demonstration of p-value uniformity

Simulate 10,000 samples of size n=30n = 30 from N(0,1)\mathcal{N}(0, 1) (the null distribution). For each, compute the z-test statistic Z=nXˉZ = \sqrt n \bar X and its right-tailed p-value p=1Φ(Z)p = 1 - \Phi(Z). The histogram of the 10,000 p-values is visibly flat on [0,1][0, 1] — the empirical verification of Thm 3. Under an alternative (say μ=0.5\mu = 0.5), the same MC run gives a histogram concentrated near zero: power visualized as a distribution on the p-value scale. The PValueDemonstrator component (§17.5) makes this interactive.

Three panels. Left: one-sided test with observed T = t_obs and shaded p-value tail under the null density. Center: 10,000-sample MC of p-values under Normal H₀, histogram overlaid with flat Uniform(0, 1) density. Right: MC p-value histogram under H_A (Normal with shift μ = 0.5), strongly right-skewed toward zero.

P-values · §17.5
mode
side
Observed draw
observed T = 1.126
p-value = 0.2602
⇒ fail to reject at α = 0.05
Under continuous H₀, the p-value is Uniform(0, 1) — Thm 3. The MC-uniformity mode verifies this empirically.
Remark 6 The ASA statement, replication, and what p-values don't tell you

A string of papers from the 2005–2016 period argued that naive p-value interpretation is a major driver of non-replicable research. Ioannidis (2005), Why Most Published Research Findings Are False, used a prior-probability argument to show that a statistically significant finding (p<0.05p < 0.05) is often more likely to be a false positive than a true positive, depending on the prior probability of the null — p-values are emphatically not P(H0data)P(H_0 \mid \text{data}). Gelman & Loken (2013), The Garden of Forking Paths, argued that even well-intentioned researchers inflate Type I error through researcher degrees of freedom (p-hacking, HARKing) without formal multiple testing. The ASA Statement on p-Values (Wasserstein & Lazar, 2016) synthesized the critique into six principles, emphasizing that a p-value quantifies inconsistency with a null model under specific assumptions, not the probability of a hypothesis, and not the importance of an effect.

These are genuine warnings; Topic 20 addresses the multiple-testing piece formally (Bonferroni, Holm, Benjamini-Hochberg FDR, Šidák — see especially §20.7 for the featured BH proof and §20.8 for the Ioannidis / Gelman-Loken framing). For Topic 17, the takeaway is: a p-value is a single number reporting a tail probability under a stated null; it is not a posterior probability, not an effect size, and not a replacement for thinking about the experimental design, power, and prior plausibility.

17.6 The z-test

The z-test is the simplest test in the parametric toolkit: standardize the sample mean, compare to a standard Normal quantile. It is the direct consumer of the Central Limit Theorem (Topic 11 Thm 11.1) and the prototype for every asymptotic test in §17.9.

Definition 10 z-test (one-sample and two-sample)

One-sample z-test. For iid data X1,,XnX_1, \ldots, X_n from a distribution with mean μ\mu and known variance σ2\sigma^2, the z-statistic for testing H0:μ=μ0H_0: \mu = \mu_0 is

Z=n(Xˉμ0)σ.Z = \frac{\sqrt n\,(\bar X - \mu_0)}{\sigma}.

Under H0H_0, and under Normality of the data, ZN(0,1)Z \sim \mathcal{N}(0, 1) exactly. For non-Normal data, ZdN(0,1)Z \xrightarrow{d} \mathcal{N}(0, 1) under H0H_0 by the CLT — an asymptotic result with finite-sample accuracy governed by Berry-Esseen (Topic 11 §11.4).

Two-sample z-test. For two independent samples X1,,Xn1X_1, \ldots, X_{n_1} and Y1,,Yn2Y_1, \ldots, Y_{n_2} with known variances σ12\sigma_1^2 and σ22\sigma_2^2, the two-sample z-statistic for testing H0:μ1=μ2H_0: \mu_1 = \mu_2 is

Z=XˉYˉσ12/n1+σ22/n2.Z = \frac{\bar X - \bar Y}{\sqrt{\sigma_1^2/n_1 + \sigma_2^2/n_2}}.

Under H0H_0 (and Normality or asymptotically), ZN(0,1)Z \sim \mathcal{N}(0, 1).

Theorem 4 Size of the z-test is exactly α under Normal null

Let X1,,XnX_1, \ldots, X_n be iid N(μ0,σ2)\mathcal{N}(\mu_0, \sigma^2) with σ2\sigma^2 known. For the two-sided test that rejects H0:μ=μ0H_0: \mu = \mu_0 when Z>zα/2|Z| > z_{\alpha/2} (where Z=n(Xˉμ0)/σZ = \sqrt n(\bar X - \mu_0)/\sigma and zα/2=Φ1(1α/2)z_{\alpha/2} = \Phi^{-1}(1 - \alpha/2)), the size is exactly α\alpha.

Proof [show]

Step 1 — null distribution of ZZ. Under H0H_0, the sample mean XˉN(μ0,σ2/n)\bar X \sim \mathcal{N}(\mu_0, \sigma^2/n) exactly — Normal-sum closure (Topic 6 §6.5) says a sum of iid Normals is Normal with the expected mean and variance. Standardizing,

Z=n(Xˉμ0)σN(0,1)under H0,Z = \frac{\sqrt n\,(\bar X - \mu_0)}{\sigma} \sim \mathcal{N}(0, 1) \quad \text{under } H_0,

exactly — no asymptotics needed.

Step 2 — compute the size. The size is PH0(Z>zα/2)P_{H_0}(|Z| > z_{\alpha/2}). Split into two tails:

PH0(Z>zα/2)=P(Z>zα/2)+P(Z<zα/2).P_{H_0}(|Z| > z_{\alpha/2}) = P(Z > z_{\alpha/2}) + P(Z < -z_{\alpha/2}).

By symmetry of the standard Normal density around zero, P(Z>zα/2)=P(Z<zα/2)P(Z > z_{\alpha/2}) = P(Z < -z_{\alpha/2}). And by definition of zα/2z_{\alpha/2},

P(Z>zα/2)=1Φ(zα/2)=α/2.P(Z > z_{\alpha/2}) = 1 - \Phi(z_{\alpha/2}) = \alpha/2.

Adding the two tails,

PH0(Z>zα/2)=α/2+α/2=α.P_{H_0}(|Z| > z_{\alpha/2}) = \alpha/2 + \alpha/2 = \alpha.

Step 3 — asymptotic extension when σ2\sigma^2 is estimated. If σ2\sigma^2 is unknown and replaced by a consistent estimator σ^2\hat\sigma^2 (e.g., the sample variance), the ratio Zn=n(Xˉμ0)/σ^Z_n = \sqrt n(\bar X - \mu_0)/\hat\sigma has null distribution N(0,1)\mathcal{N}(0, 1) only asymptotically — by the CLT applied to n(Xˉμ0)/σ\sqrt n(\bar X - \mu_0)/\sigma and Slutsky’s theorem (Topic 9) applied to the ratio σ/σ^P1\sigma/\hat\sigma \xrightarrow{P} 1. Exact size is α\alpha only in the σ\sigma-known case; for the σ\sigma-unknown case, the t-test (Thm 5) gives exact finite-sample size.

∎ — using Normality of sums (Topic 6), symmetry of the standard Normal, and Slutsky’s theorem (Topic 9) for the asymptotic case.

Example 8 One-sample z-test — IQ testing

A random sample of n=50n = 50 adults has mean IQ xˉ=103.2\bar x = 103.2 and we wish to test whether the population mean differs from the standardized value μ0=100\mu_0 = 100, assuming σ=15\sigma = 15 (the calibration target of the instrument). The z-statistic is

z=50(103.2100)15=7.0713.2151.508.z = \frac{\sqrt{50}\,(103.2 - 100)}{15} = \frac{7.071 \cdot 3.2}{15} \approx 1.508.

Two-sided p-value: p=2(1Φ(1.508))0.131p = 2\,(1 - \Phi(1.508)) \approx 0.131. At α=0.05\alpha = 0.05 we fail to reject. The point estimate is higher than μ0\mu_0, but the evidence is not strong — a half-σ\sigma effect would need n32n \gtrsim 32 to reach power 0.8 (Example 6), and we observed about 0.21σ0.21 \sigma.

Example 9 Two-sample z-test — drug trial

A trial enrolls n1=100n_1 = 100 patients in the treatment arm (blood-pressure reduction Xˉ=12.4\bar X = 12.4 mmHg, assumed σ1=8\sigma_1 = 8) and n2=100n_2 = 100 in the placebo arm (Yˉ=8.1\bar Y = 8.1, σ2=8\sigma_2 = 8). Testing H0:μT=μPH_0: \mu_T = \mu_P:

Z=12.48.164/100+64/100=4.31.283.80.Z = \frac{12.4 - 8.1}{\sqrt{64/100 + 64/100}} = \frac{4.3}{\sqrt{1.28}} \approx 3.80.

Two-sided p-value 0.00015\approx 0.00015 — strongly significant. The observed treatment effect of 4.34.3 mmHg at these standard errors is well outside the rejection region at any conventional α\alpha.

Example 10 Two-sample proportion z-test — A/B testing workhorse

An experiment ran nA=nB=10000n_A = n_B = 10000 impressions per variant. Variant A produced kA=1200k_A = 1200 conversions (p^A=0.12\hat p_A = 0.12); variant B produced kB=1260k_B = 1260 (p^B=0.126\hat p_B = 0.126). Testing H0:pA=pBH_0: p_A = p_B with the pooled-variance z-statistic:

p^=kA+kBnA+nB=246020000=0.123,SE=p^(1p^)(1nA+1nB)=0.1230.8770.00020.00465.\hat p = \frac{k_A + k_B}{n_A + n_B} = \frac{2460}{20000} = 0.123, \qquad \mathrm{SE} = \sqrt{\hat p (1 - \hat p)\left(\frac{1}{n_A} + \frac{1}{n_B}\right)} = \sqrt{0.123 \cdot 0.877 \cdot 0.0002} \approx 0.00465.

Z=0.1260.120.004651.29.Z = \frac{0.126 - 0.12}{0.00465} \approx 1.29.

Two-sided p-value 0.197\approx 0.197; we fail to reject at α=0.05\alpha = 0.05. The detected lift (0.6 percentage points, 5% relative) is plausible — but not strong enough at this sample size. A typical platform would either run the experiment longer or accept the null (“no detectable effect”) given the sample-size budget. This is the canonical output of every A/B-testing platform on the planet; the interactive version of this calculation is in the NullAlternativeSimulator and PowerCurveExplorer components.

Example 11 Binomial exact test — the fourth worked test family

Return to H0:p=0.5H_0: p = 0.5 vs H1:p>0.5H_1: p > 0.5 at n=20n = 20 with observed xobs=15x_{\mathrm{obs}} = 15. Two approaches:

Exact test. The null distribution of X=XiX = \sum X_i is Binomial(20,0.5)\mathrm{Binomial}(20, 0.5). The right-tailed p-value is the exact tail probability:

pexact=PH0(X15)=k=1520(20k)(0.5)200.0207.p_{\mathrm{exact}} = P_{H_0}(X \ge 15) = \sum_{k=15}^{20} \binom{20}{k} (0.5)^{20} \approx 0.0207.

At α=0.05\alpha = 0.05, we reject. The rejection region {X15}\{X \ge 15\} has exact size 0.02070.0207 — conservative (Remark 4).

Normal approximation. Under H0H_0, XN(10,5)X \approx \mathcal{N}(10, 5); the standardized observed value is z=(1510)/52.236z = (15 - 10)/\sqrt 5 \approx 2.236, giving p-value 0.0127\approx 0.0127. With a continuity correction (using 14.514.5 instead of 1515), the p-value becomes 0.0207\approx 0.0207 — close to the exact. Without continuity correction, the Normal approximation overstates significance.

Why this matters. At n=20n = 20, the Normal approximation p-value can be 30-40% off the exact p-value at moderate thresholds. At n=200n = 200 it is typically within a few percent. The notebook figure (right panel) shows the size comparison across n[10,200]n \in [10, 200]: exact size stays below α\alpha by a discrete jump; Normal-approximation size oscillates above and below α\alpha, converging as nn \to \infty. The binomial exact test has the additional property — covered in Topic 18 — of being UMP one-sided for the Bernoulli family.

Remark 7 Why exact tests matter at small n

For small-nn discrete tests, the Normal approximation can miss the actual size by several percentage points — not an academic concern but a practical one. An experimenter who believes their nominal α\alpha is 0.050.05 but whose actual Type I rate is 0.080.08 is reporting misleading guarantees. For A/B tests at nn in the hundreds or more, the Normal approximation is typically accurate to within a percentage point; for medical or behavioral studies at n50n \le 50, exact tests are the responsible choice. The PowerCurveExplorer component includes a Binomial exact scenario that inverts the exact size via root-find (not the Normal approximation), letting the reader see this effect directly.

Three panels. Left: one-sample z-test null N(0, 1) with two-sided rejection region at |z| > 1.96; observed Z = 1.51 marked, p-value ≈ 0.131 annotated. Center: Binomial(20, 0.5) PMF bars with rejection X ≥ 15 highlighted (exact size 0.021), superimposed Normal approximation density. Right: size comparison as n varies from 10 to 200 — exact discrete steps below α; Normal approximation oscillates above and below α, converging to α as n grows.

17.7 The t-test via Basu

The one-sample t-test is the small-sample analog of the z-test: use SS (the sample standard deviation) in place of σ\sigma when σ\sigma is unknown. The substitution is natural, but it has a non-obvious consequence: the test statistic T=n(Xˉμ0)/ST = \sqrt n(\bar X - \mu_0)/S no longer has a Normal null distribution. Under iid Normal data, it has a Student’s tn1t_{n-1} distribution, and the derivation of that fact is the content of this section. The hinge is Basu’s theorem (Topic 16 §16.9) — the single most important cross-reference from Track 4 into Track 5.

Definition 11 Student's t-test

One-sample t-test. For iid Normal data X1,,XnN(μ,σ2)X_1, \ldots, X_n \sim \mathcal{N}(\mu, \sigma^2) with both μ\mu and σ2\sigma^2 unknown, the t-statistic for testing H0:μ=μ0H_0: \mu = \mu_0 is

T=n(Xˉμ0)S,S2=1n1i=1n(XiXˉ)2.T = \frac{\sqrt n\,(\bar X - \mu_0)}{S}, \qquad S^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar X)^2.

Under H0H_0, Ttn1T \sim t_{n-1} (Student’s t distribution with n1n - 1 degrees of freedom) exactly — Thm 5 below, via Basu.

Two-sample pooled t-test. For two independent iid Normal samples with a common unknown variance, the pooled t-statistic for testing H0:μ1=μ2H_0: \mu_1 = \mu_2 is

T=XˉYˉSp1/n1+1/n2,Sp2=(n11)S12+(n21)S22n1+n22.T = \frac{\bar X - \bar Y}{S_p\sqrt{1/n_1 + 1/n_2}}, \qquad S_p^2 = \frac{(n_1 - 1)S_1^2 + (n_2 - 1)S_2^2}{n_1 + n_2 - 2}.

Under H0H_0 and the common-variance assumption, Ttn1+n22T \sim t_{n_1 + n_2 - 2} exactly. When variances are unequal, Welch’s modification (welchTStatistic in testing.ts) gives an approximate tνt_{\nu} distribution with Satterthwaite degrees of freedom — asymptotic exactness only.

Theorem 5 Null distribution of the one-sample t-statistic is tₙ₋₁

Let X1,,XnX_1, \ldots, X_n be iid N(μ0,σ2)\mathcal{N}(\mu_0, \sigma^2) with both parameters unknown. Under H0:μ=μ0H_0: \mu = \mu_0, the one-sample t-statistic

Tn=n(Xˉμ0)ST_n = \frac{\sqrt n\,(\bar X - \mu_0)}{S}

has distribution tn1t_{n-1} exactly.

Proof [show]

Step 1 — build Student’s tt from its defining ratio. The tn1t_{n-1} distribution is defined (Topic 6 §6.6) as the law of

T=ZV/(n1),T = \frac{Z}{\sqrt{V/(n-1)}},

where ZN(0,1)Z \sim \mathcal{N}(0, 1), Vχn12V \sim \chi^2_{n-1}, and Z ⁣ ⁣ ⁣VZ \perp\!\!\!\perp V. We will rewrite TnT_n in this form and verify the three conditions.

Step 2 — express TnT_n as a ratio of standardized quantities. Under H0H_0,

Tn=nXˉμ0S=n(Xˉμ0)/σS/σ.T_n = \sqrt n\,\frac{\bar X - \mu_0}{S} = \frac{\sqrt n\,(\bar X - \mu_0)/\sigma}{S/\sigma}.

The numerator n(Xˉμ0)/σ\sqrt n\,(\bar X - \mu_0)/\sigma is N(0,1)\mathcal{N}(0, 1) under H0H_0 (Topic 6 Normal-sum closure; same as in the z-test’s Proof 2 Step 1). Call it ZZ.

Step 3 — identify the chi-squared denominator. The scaled sample variance is

(n1)S2σ2=1σ2i=1n(XiXˉ)2.\frac{(n-1)\,S^2}{\sigma^2} = \frac{1}{\sigma^2} \sum_{i=1}^n (X_i - \bar X)^2.

It is a classical result (Topic 6 §6.6; proved via the orthogonal transformation argument in Casella-Berger Thm 5.3.1) that under iid Normal data,

(n1)S2σ2χn12.\frac{(n-1)\,S^2}{\sigma^2} \sim \chi^2_{n-1}.

Call this random variable VV. Then S/σ=V/(n1)S/\sigma = \sqrt{V/(n-1)}, and

Tn=ZV/(n1).T_n = \frac{Z}{\sqrt{V/(n-1)}}.

This matches the defining ratio for Student’s tn1t_{n-1}provided Z ⁣ ⁣ ⁣VZ \perp\!\!\!\perp V.

Step 4 — independence via Basu. This is the hinge. For iid N(μ,σ2)\mathcal{N}(\mu, \sigma^2) data at fixed σ2\sigma^2, Xˉ\bar X is complete sufficient for μ\mu — a standard application of the exponential-family completeness result (Topic 16 §16.6 Lemma 1 specialized to the one-parameter Normal with σ2\sigma^2 known). The sample variance S2=(n1)1(XiXˉ)2S^2 = (n-1)^{-1}\sum(X_i - \bar X)^2 is ancillary for μ\mu — its distribution depends on σ2\sigma^2 but not on μ\mu, because S2S^2 is a function of the centered data XiXˉX_i - \bar X which is location-invariant. By Basu’s theorem (Topic 16 Thm 7),

Xˉ   ⁣ ⁣ ⁣  S2.\bar X \;\perp\!\!\!\perp\; S^2.

Continuous functions of independent random variables are independent, so

n(Xˉμ0)/σ   ⁣ ⁣ ⁣  (n1)S2/σ2,\sqrt n\,(\bar X - \mu_0)/\sigma \;\perp\!\!\!\perp\; (n-1)\,S^2/\sigma^2,

i.e., Z ⁣ ⁣ ⁣VZ \perp\!\!\!\perp V.

Step 5 — conclude. The ratio Tn=Z/V/(n1)T_n = Z/\sqrt{V/(n-1)} has ZN(0,1)Z \sim \mathcal{N}(0, 1), Vχn12V \sim \chi^2_{n-1}, and Z ⁣ ⁣ ⁣VZ \perp\!\!\!\perp V — exactly the defining conditions for tn1t_{n-1}. Hence Tntn1T_n \sim t_{n-1} under H0H_0.

∎ — using Normality of Xˉ\bar X (Topic 6), the (n1)S2/σ2χn12(n-1) S^2/\sigma^2 \sim \chi^2_{n-1} result (Topic 6 §6.6 / Casella-Berger 5.3.1), and Basu’s theorem (Topic 16 Thm 7 §16.9).

Example 12 One-sample t-test at n = 10 — where z and t disagree

A small clinical study measures fasting glucose in n=10n = 10 patients on a new regimen: xˉ=105\bar x = 105 mg/dL, s=12s = 12 mg/dL. Testing H0:μ=100H_0: \mu = 100 (the standard reference value) two-sided:

T=10(105100)12=510121.318.T = \frac{\sqrt{10}\,(105 - 100)}{12} = \frac{5\sqrt{10}}{12} \approx 1.318.

Using t: Two-sided p-value from the t9t_9 distribution: pt=2(1Ft9(1.318))0.220p_t = 2\,(1 - F_{t_9}(1.318)) \approx 0.220.

Using z (wrong!): Two-sided z-approximation p-value: pz=2(1Φ(1.318))0.187p_z = 2\,(1 - \Phi(1.318)) \approx 0.187.

The t p-value is 18%\sim 18\% larger than the z approximation — the t9t_9 density has heavier tails than N(0,1)\mathcal{N}(0, 1), so extreme values are less surprising under t9t_9 and the resulting p-value is larger. For small nn, ignoring this difference inflates the Type I rate. At n=30n = 30 the discrepancy shrinks to 3%\sim 3\%; at n=100n = 100 it is negligible.

Example 13 Two-sample pooled t-test

A lab tests two catalysts for a reaction yield: n1=8n_1 = 8 trials with catalyst A (Xˉ=72\bar X = 72, S1=4.5S_1 = 4.5), n2=8n_2 = 8 trials with catalyst B (Yˉ=68\bar Y = 68, S2=5.1S_2 = 5.1). Testing H0:μA=μBH_0: \mu_A = \mu_B with equal-variance pooling:

Sp2=74.52+75.1214=141.75+182.071423.13,Sp4.81.S_p^2 = \frac{7 \cdot 4.5^2 + 7 \cdot 5.1^2}{14} = \frac{141.75 + 182.07}{14} \approx 23.13, \qquad S_p \approx 4.81.

T=72684.811/8+1/8=44.810.5=42.4051.663.T = \frac{72 - 68}{4.81\,\sqrt{1/8 + 1/8}} = \frac{4}{4.81 \cdot 0.5} = \frac{4}{2.405} \approx 1.663.

Two-sided p-value from t14t_{14}: 0.119\approx 0.119. We fail to reject at α=0.05\alpha = 0.05; the observed difference in yields is consistent with sampling noise at this nn. For a detectable effect at α=0.05\alpha = 0.05 with 1β=0.81 - \beta = 0.8, a similar magnitude would need roughly n23n \ge 23 per arm — a typical power-calculation output.

Remark 8 Basu is the hinge — the Track-4-to-Track-5 bridge

Proof 3 Step 4 is the load-bearing step. Without Basu’s Xˉ ⁣ ⁣ ⁣S2\bar X \perp\!\!\!\perp S^2, the ratio Tn=Z/V/(n1)T_n = Z/\sqrt{V/(n-1)} has a distribution that depends in a complicated way on the joint distribution of (Xˉ,S2)(\bar X, S^2) — and tn1t_{n-1} is no longer the answer. Every finite-sample inference procedure for the one-sample Normal mean with unknown variance — the t-test, the t-confidence-interval, and their two-sample siblings — rests on this independence. Topic 16 gave Basu with a full proof and flagged the forward payoff; Topic 17 §17.7 is that payoff. For the featured visualization of this, see the TTestBasuFoundation component below, which shows the decorrelated scatter of (Xˉ,S2)(\bar X, S^2) across MC replications, the resulting t-statistic histogram matching tn1t_{n-1}, and a direct link back to Topic 16 §16.9.

Featured figure. Three panels. Left: under H₀, MC histogram of T_n = √n(X̄ − μ₀)/S for n = 10 with μ = μ₀, overlaid with exact t_9 density. Center: Basu independence — scatter of (X̄, S²) across 5000 MC samples, decorrelated cloud, annotation "Basu: X̄ ⊥⊥ S² — what makes the t-ratio distribution-free of (μ, σ²)". Right: under H_A (μ = μ₀ + 0.8σ), MC t-histogram shifted right, shaded rejection region, power annotated.

★ Featured · t-test via Basu · §17.7
Basu: (X̄, S²) scatter — decorrelated
sample ρ = 0.0003 (theoretical: 0)
T = √n(X̄ − μ₀)/S — matches t_9
Why this works

Basu's theorem (Topic 16 §16.9) gives X̄ ⊥⊥ S². This independence is what makes the t-ratio have a clean distribution:

  • Numerator √n(X̄ − μ₀)/σ ~ N(0, 1)
  • Denominator S/σ = √(χ²_9/(n − 1))
  • And they are independent.

That's exactly the defining construction of Student's t_9. Without Basu's independence, the ratio has a messy joint distribution that depends on (μ, σ²).

→ Jump to §16.9
The foundational case. Basu gives X̄ ⊥⊥ S²; the t-ratio has t₉ distribution exactly. Heavier tails than Normal visible at small n.

17.8 The χ²-test

Variance inference and goodness-of-fit are the two classical uses of the chi-squared test. The first tests H0:σ2=σ02H_0: \sigma^2 = \sigma_0^2 directly; the second tests whether observed category counts match expected counts under a hypothesized multinomial model (Pearson’s original formulation, 1900).

Definition 12 Chi-squared variance test; Pearson goodness-of-fit

Variance test. For iid Normal data X1,,XnN(μ,σ2)X_1, \ldots, X_n \sim \mathcal{N}(\mu, \sigma^2) with unknown μ\mu, the test statistic for H0:σ2=σ02H_0: \sigma^2 = \sigma_0^2 is

W=(n1)S2σ02.W = \frac{(n-1)\,S^2}{\sigma_0^2}.

Under H0H_0, Wχn12W \sim \chi^2_{n-1} exactly (Thm 6 below).

Pearson goodness-of-fit. For a categorical distribution with kk cells, observed counts O1,,OkO_1, \ldots, O_k (summing to nn), and expected counts E1,,EkE_1, \ldots, E_k under H0H_0, the Pearson statistic is

X2=k=1K(OkEk)2Ek.X^2 = \sum_{k=1}^K \frac{(O_k - E_k)^2}{E_k}.

Under H0H_0, and assuming all EkE_k are not too small (rule of thumb: Ek5E_k \ge 5), X2dχk1r2X^2 \xrightarrow{d} \chi^2_{k - 1 - r} where rr is the number of parameters estimated from the data (zero if the null fixes all cell probabilities). The derivation is asymptotic, based on the multinomial CLT and continuous mapping.

Theorem 6 Null distribution of the variance statistic

For iid N(μ,σ02)\mathcal{N}(\mu, \sigma_0^2) data under H0:σ2=σ02H_0: \sigma^2 = \sigma_0^2, the test statistic W=(n1)S2/σ02W = (n-1)\,S^2/\sigma_0^2 has distribution χn12\chi^2_{n-1} exactly.

Proof [show]

Step 1 — reduce to the standard-Normal form. By definition,

W=(n1)S2σ02=1σ02i=1n(XiXˉ)2.W = \frac{(n-1)\,S^2}{\sigma_0^2} = \frac{1}{\sigma_0^2}\sum_{i=1}^n (X_i - \bar X)^2.

Let Yi=(Xiμ)/σ0Y_i = (X_i - \mu)/\sigma_0; under H0H_0, YiN(0,1)Y_i \sim \mathcal{N}(0, 1) iid. Then XiXˉ=σ0(YiYˉ)X_i - \bar X = \sigma_0(Y_i - \bar Y), so

W=i=1n(YiYˉ)2.W = \sum_{i=1}^n (Y_i - \bar Y)^2.

The problem is reduced to: what is the distribution of (YiYˉ)2\sum(Y_i - \bar Y)^2 when the YiY_i are iid standard Normal?

Step 2 — orthogonal decomposition of the sum of squares. Expand and collect:

i=1nYi2=i=1n[(YiYˉ)+Yˉ]2=i=1n(YiYˉ)2+2Yˉi=1n(YiYˉ)+nYˉ2.\sum_{i=1}^n Y_i^2 = \sum_{i=1}^n \bigl[(Y_i - \bar Y) + \bar Y\bigr]^2 = \sum_{i=1}^n (Y_i - \bar Y)^2 + 2\bar Y\sum_{i=1}^n (Y_i - \bar Y) + n\bar Y^2.

The middle sum is zero because (YiYˉ)=0\sum(Y_i - \bar Y) = 0 identically. So

i=1nYi2=W+nYˉ2.\sum_{i=1}^n Y_i^2 = W + n\bar Y^2.

Step 3 — distribute the chi-squared degrees of freedom. The left side: Yi2χn2\sum Y_i^2 \sim \chi^2_n because YiY_i iid N(0,1)\mathcal{N}(0, 1) (Topic 6 §6.6, sum of squares of iid standard Normals). The last term on the right: nYˉ2=(nYˉ)2n\bar Y^2 = (\sqrt n\,\bar Y)^2, and nYˉN(0,1)\sqrt n\,\bar Y \sim \mathcal{N}(0, 1) under iid standard-Normal data, so nYˉ2χ12n\bar Y^2 \sim \chi^2_1.

Step 4 — independence + MGF additivity give the split. By Basu’s theorem (as in Proof 3 Step 4), Yˉ ⁣ ⁣ ⁣(YiYˉ)2\bar Y \perp\!\!\!\perp \sum(Y_i - \bar Y)^2. Hence the decomposition

i=1nYi2=W+nYˉ2\sum_{i=1}^n Y_i^2 = W + n\bar Y^2

writes a χn2\chi^2_n variable as the sum of two independent non-negative random variables, one of which (nYˉ2n\bar Y^2) is χ12\chi^2_1. By the additivity of independent chi-squared variables (Topic 6 §6.6 via the MGF argument: the χa2\chi^2_a MGF is (12t)a/2(1 - 2t)^{-a/2}, so products of MGFs sum the degrees of freedom), the other term must be χn12\chi^2_{n-1}:

W=i=1n(YiYˉ)2χn12.W = \sum_{i=1}^n (Y_i - \bar Y)^2 \sim \chi^2_{n-1}.

∎ — using the χn2\chi^2_n representation of Zi2\sum Z_i^2 (Topic 6), Basu’s independence (Topic 16 Thm 7), and MGF additivity for independent chi-squareds.

Example 14 One-sample variance test — process control

A manufacturing process specifies σ02=4\sigma_0^2 = 4 (mm²) for a critical dimension. A sample of n=15n = 15 units has s2=6.8s^2 = 6.8. Testing H0:σ2=4H_0: \sigma^2 = 4 vs H1:σ2>4H_1: \sigma^2 > 4 (right-tailed, since excess variance is the quality concern):

W=146.84=23.8.W = \frac{14 \cdot 6.8}{4} = 23.8.

Right-tailed p-value from χ142\chi^2_{14}: p=1Fχ142(23.8)0.048p = 1 - F_{\chi^2_{14}}(23.8) \approx 0.048. We reject at α=0.05\alpha = 0.05 (narrowly) — evidence that the process variance exceeds the specification. Contrast with the two-sided version: for a two-sided variance test at α=0.05\alpha = 0.05, we use the equal-tailed p-value 2min(F,1F)=20.048=0.0962\,\min(F, 1 - F) = 2 \cdot 0.048 = 0.096 — we would not reject. The asymmetry of the χ2\chi^2 density makes the two-sided variance test less powerful than the one-sided version against one-sided alternatives.

Example 15 Pearson χ² goodness-of-fit preview

A die is rolled n=120n = 120 times; the six face counts are O=(18,22,17,25,19,19)O = (18, 22, 17, 25, 19, 19). Under H0H_0: the die is fair (Ek=20E_k = 20 for all kk):

X2=k=16(Ok20)220=4+4+9+25+1+120=4420=2.2.X^2 = \sum_{k=1}^{6} \frac{(O_k - 20)^2}{20} = \frac{4 + 4 + 9 + 25 + 1 + 1}{20} = \frac{44}{20} = 2.2.

Right-tailed p-value from χ52\chi^2_5 (df =k10=5= k - 1 - 0 = 5, no parameters estimated): p0.82p \approx 0.82. The observed counts are easily consistent with a fair die. Note that the χ2\chi^2 asymptotic here rests on the multinomial CLT, and degrades when some EkE_k are small (<5< 5 is the classical rule of thumb); Lehmann-Romano Ch. 14 covers small-EE corrections.

Remark 9 The F-test is the variance-ratio cousin — full treatment in Linear Regression

For two independent iid Normal samples with possibly different variances, the ratio F=S12/S22F = S_1^2/S_2^2 has an Fn11,n21F_{n_1-1, n_2-1} null distribution under H0:σ12=σ22H_0: \sigma_1^2 = \sigma_2^2. This is the F-test for equality of variances, and it is the direct generalization of the one-sample variance test to two samples. The F-test also underlies the analysis of variance (ANOVA) and the regression FF-test for joint significance of coefficients. The full treatment belongs to Linear Regression (Track 6), where the Normal linear model gives the F-test its natural home — see §21.8 Thm 9 (F-test-as-Wilks) and Example 9 (one-way ANOVA).

Two panels. Left: null distribution of W = (n−1)S²/σ₀² — MC histogram for n = 15, σ = σ₀ = 1, overlaid with exact χ²_14 density, two-sided rejection region shaded at 2.5% / 97.5% quantiles (note the asymmetry — χ² is not symmetric). Right: goodness-of-fit Pearson χ² preview — 4-category observed-vs-expected bar chart, computed statistic, χ²_3 null density.

17.9 Asymptotic Tests: Wald, Score, LRT

The z-test, t-test, and χ2\chi^2-test all rely on special distributional assumptions (Normality and known families) to give exact null distributions. For general parametric models, three asymptotic tests built directly from the likelihood are the workhorses. They agree to first order under H0H_0 and diverge in finite-sample power — a divergence that Topic 18 §18.8 treats in full.

Definition 13 Wald test

For a scalar parameter θ\theta with MLE θ^n\hat\theta_n and Fisher information I(θ)I(\theta) (Topic 13 Def 13.6), the Wald test statistic for H0:θ=θ0H_0: \theta = \theta_0 is

Wn=n(θ^nθ0)2I(θ^n).W_n = n\,(\hat\theta_n - \theta_0)^2\,I(\hat\theta_n).

Intuition: WnW_n is the squared standardized distance of the MLE from the null, scaled by the observed information. Under H0H_0 and standard regularity conditions, Wndχ12W_n \xrightarrow{d} \chi^2_1.

For vector-valued θRk\theta \in \mathbb{R}^k, the Wald statistic generalizes to Wn=n(θ^nθ0)I(θ^n)(θ^nθ0)W_n = n\,(\hat\theta_n - \theta_0)^\top I(\hat\theta_n)\,(\hat\theta_n - \theta_0), asymptotically χk2\chi^2_k.

Definition 14 Score (Rao) test

For the same setup, the score test (also called the Rao or Lagrange-multiplier test) uses the score function U(θ)=θ(θ)U(\theta) = \partial_\theta \ell(\theta) evaluated at the null:

Sn=U(θ0)2nI(θ0).S_n = \frac{U(\theta_0)^2}{n\,I(\theta_0)}.

Under H0H_0 and regularity, Sndχ12S_n \xrightarrow{d} \chi^2_1. The score test has the desirable property that it only requires fitting the null model (no unrestricted MLE is needed) — a practical advantage in GLM and logistic regression applications where the null fit is much easier than the full fit.

Definition 15 Likelihood-ratio test

The likelihood-ratio statistic is

2logΛn=2[(θ0)(θ^n)]=2[(θ^n)(θ0)].-2\log\Lambda_n = -2\bigl[\ell(\theta_0) - \ell(\hat\theta_n)\bigr] = 2\bigl[\ell(\hat\theta_n) - \ell(\theta_0)\bigr].

It is the log-likelihood difference between the restricted fit (at θ0\theta_0) and the unrestricted fit (at the MLE θ^n\hat\theta_n). Under H0H_0 and Wilks’ regularity conditions, 2logΛndχk2-2\log\Lambda_n \xrightarrow{d} \chi^2_k where kk is the number of restricted parameters (Wilks 1938; full proof in Topic 18).

Theorem 7 Asymptotic null distribution of Wald, score, LRT

Under standard regularity conditions (the MLE is consistent and asymptotically Normal per Topic 14 Thm 14.3; the Fisher information is continuous and positive at θ0\theta_0; the log-likelihood is sufficiently smooth), the three statistics all have asymptotic null distribution χ12\chi^2_1:

Wn,  Sn,  2logΛn  d  χ12under H0.W_n, \; S_n, \; -2\log\Lambda_n \;\xrightarrow{d}\; \chi^2_1 \quad \text{under } H_0.

The Wald case follows inline from Topic 14 Thm 14.3 (MLE asymptotic normality) by Slutsky and continuous mapping — derived below. The score and LRT cases are stated; Wilks’ full proof of the LRT case is Topic 18’s territory.

Derivation of the Wald case (inline). By Topic 14 Thm 14.3, under H0H_0,

n(θ^nθ0)  d  N(0,I(θ0)1).\sqrt n\,(\hat\theta_n - \theta_0) \;\xrightarrow{d}\; \mathcal{N}(0,\,I(\theta_0)^{-1}).

Therefore

nI(θ0)(θ^nθ0)  d  N(0,1).\sqrt{n\,I(\theta_0)}\,(\hat\theta_n - \theta_0) \;\xrightarrow{d}\; \mathcal{N}(0, 1).

By consistency of θ^n\hat\theta_n and continuity of I()I(\cdot), I(θ^n)PI(θ0)I(\hat\theta_n) \xrightarrow{P} I(\theta_0), so I(θ^n)/I(θ0)P1\sqrt{I(\hat\theta_n)/I(\theta_0)} \xrightarrow{P} 1. By Slutsky,

nI(θ^n)(θ^nθ0)  d  N(0,1).\sqrt{n\,I(\hat\theta_n)}\,(\hat\theta_n - \theta_0) \;\xrightarrow{d}\; \mathcal{N}(0, 1).

Squaring and applying the continuous mapping theorem (Topic 9):

Wn=n(θ^nθ0)2I(θ^n)  d  χ12under H0.W_n = n\,(\hat\theta_n - \theta_0)^2\,I(\hat\theta_n) \;\xrightarrow{d}\; \chi^2_1 \quad \text{under } H_0.

This two-line argument from Topic 14 is all that’s needed for the Wald case. The score case uses a similar argument applied to the score function (one Taylor expansion at θ0\theta_0); the LRT case (Wilks) requires a more delicate quadratic-approximation argument that Topic 18 develops in full.

Example 16 Wald, score, and LRT for the Bernoulli H₀: p = p₀

Let X1,,XnX_1, \ldots, X_n iid Bernoulli(p)(p), p^=Xˉ\hat p = \bar X. For H0:p=p0H_0: p = p_0:

Wald:

Wn=n(p^p0)2p^(1p^).W_n = \frac{n\,(\hat p - p_0)^2}{\hat p\,(1 - \hat p)}.

Variance estimate at the MLE: p^(1p^)\hat p(1 - \hat p).

Score:

Sn=n(p^p0)2p0(1p0).S_n = \frac{n\,(\hat p - p_0)^2}{p_0\,(1 - p_0)}.

Variance estimate at the null: p0(1p0)p_0(1 - p_0).

LRT:

2logΛn=2n ⁣[p^log ⁣p^p0+(1p^)log ⁣1p^1p0].-2\log\Lambda_n = 2n\!\left[\hat p \log\!\frac{\hat p}{p_0} + (1 - \hat p)\log\!\frac{1 - \hat p}{1 - p_0}\right].

Under H0H_0, all three are asymptotically χ12\chi^2_1 (Wilks, Wald, Rao). The three statistics differ by the choice of variance estimate (Wald uses p^\hat p, score uses p0p_0, LRT uses an information-theoretic hybrid via the KL divergence). At p0=0.5p_0 = 0.5 and n=100n = 100, MC simulation under H0H_0 (5000 replications, testing.ts test #17) confirms all three have empirical mean 1\approx 1 and empirical quantiles matching χ12\chi^2_1.

Remark 10 The three tests agree to first order under H₀

A short Taylor expansion of the log-likelihood around θ^n\hat\theta_n reveals that all three statistics are asymptotically equivalent under H0H_0:

Wn(2logΛn)=OP(n1/2),Sn(2logΛn)=OP(n1/2).W_n - (-2\log\Lambda_n) = O_P(n^{-1/2}), \qquad S_n - (-2\log\Lambda_n) = O_P(n^{-1/2}).

At finite nn they differ, and the differences matter for power at alternatives close to the null; they can also matter for finite-sample Type I error in small samples. Topic 18 treats the finite-sample divergence quantitatively.

Remark 11 LRT is parameterization-invariant; Wald is not

One practical advantage of the LRT is that it is invariant under reparameterization. If we replace θ\theta with η=g(θ)\eta = g(\theta) (any smooth one-to-one transformation), the log-likelihood ratio is unchanged — the maximized likelihoods are the same, just at different parameter values. Hence the LRT statistic and its p-value are identical under reparameterization.

The Wald statistic, by contrast, is not invariant: Wnθ=n(θ^θ0)2I(θ^)W_n^\theta = n\,(\hat\theta - \theta_0)^2 I(\hat\theta) and Wnη=n(η^η0)2Iη(η^)W_n^\eta = n\,(\hat\eta - \eta_0)^2 I^\eta(\hat\eta) can differ substantially when gg is nonlinear — and their asymptotic χ12\chi^2_1 distribution is only approximately right at either end. This is a common source of confusion in practice; for logistic regression (natural parameter θ=\logitp\theta = \logit p vs original parameter pp), Wald statistics on pp and on \logitp\logit p can give meaningfully different p-values. Practitioners often prefer the LRT for this reason.

The score test is parameterization-invariant when the MLE is computed on the same parameterization in which θ0\theta_0 is specified — in practice, it behaves similarly to the LRT.

The concrete logit-vs-raw Bernoulli example — where Wald p-values differ between parameterizations while the LRT’s p-value is identical — is worked out in Topic 18 §18.8.

Three panels. Left: overlaid null-distribution histograms of W_n (Wald, amber), S_n (Score, purple), −2log Λ_n (LRT, green) for Bernoulli H₀: p = 0.5 at n = 30, 10000 MC runs, common χ²_1 density overlaid. Center: algebraic formulas with Bernoulli specializations. Right: under H_A (true p = 0.65), overlay of the three alternative distributions — similar means, slightly different spreads, the first glimpse of finite-sample power differences.

Wald / Score / LRT · §17.9
mode
Formulas — Bernoulli H₀: p = 0.5
Wald: W_n = n(p̂ − p₀)² / [p̂(1 − p̂)]
Score: S_n = n(p̂ − p₀)² / [p₀(1 − p₀)]
LRT: −2 log Λ = 2n [p̂ log(p̂/p₀) + (1 − p̂) log((1 − p̂)/(1 − p₀))]
Rejection rate at α = 0.05
Wald
0.062
Score
0.062
LRT
0.062
max pairwise diff = 0.000
Under H₀ all three should approximate α = 0.05. Differences shrink as n grows — Wilks, Rao, Wald all agree asymptotically (Topic 18 proves the LRT case in full).
Cleanest algebra: the three statistics differ only by the variance-estimate plug-in (p̂ vs p₀). Under H₀, all three converge to χ²₁; divergence under H_A shows finite-sample power differences.

17.10 Duality with Confidence Intervals

Every hypothesis test gives rise to a confidence interval, and every confidence interval gives rise to a family of hypothesis tests. The duality is exact. Topic 19 will develop the full theory; here we preview the construction.

Remark 12 Test inversion: the non-rejection set is a confidence interval

Fix a level α\alpha. For each candidate parameter value θ0\theta_0, construct the level-α\alpha test of H0:θ=θ0H_0: \theta = \theta_0. The non-rejection set — the collection of θ0\theta_0 values for which the observed data does not lead to rejection — is a (1α)(1 - \alpha) confidence interval for θ\theta:

C(X)={θ0:the level-α test of H0:θ=θ0 does not reject}.C(X) = \{\theta_0 : \text{the level-}\alpha\text{ test of } H_0: \theta = \theta_0 \text{ does not reject}\}.

The coverage Pθ(θC(X))1αP_\theta(\theta \in C(X)) \ge 1 - \alpha for every θ\theta, because the test controls Type I error at level α\alpha uniformly over Θ0\Theta_0 in each hypothesis direction. Conversely, any (1α)(1 - \alpha) confidence interval C(X)C(X) gives a family of level-α\alpha tests: reject H0:θ=θ0H_0: \theta = \theta_0 iff θ0C(X)\theta_0 \notin C(X).

The duality is more than a formal correspondence — it is a concrete construction. Inverting a z-test gives the z-interval; inverting a t-test gives the t-interval; inverting an LRT gives the profile-likelihood interval; inverting a score test gives the score interval. Topic 19 develops all four constructions.

Example 17 Z-interval and t-interval as test inversions

Z-interval. For iid Normal data with known σ2\sigma^2, the level-α\alpha two-sided z-test fails to reject H0:μ=μ0H_0: \mu = \mu_0 iff n(Xˉμ0)/σzα/2|\sqrt n(\bar X - \mu_0)/\sigma| \le z_{\alpha/2}, i.e., iff

Xˉzα/2σn    μ0    Xˉ+zα/2σn.\bar X - z_{\alpha/2}\,\frac{\sigma}{\sqrt n} \;\le\; \mu_0 \;\le\; \bar X + z_{\alpha/2}\,\frac{\sigma}{\sqrt n}.

The non-rejection set is Xˉ±zα/2σ/n\bar X \pm z_{\alpha/2}\,\sigma/\sqrt n — the textbook (1α)(1 - \alpha) z-confidence interval.

T-interval. For iid Normal data with unknown σ2\sigma^2, the analogous inversion of the t-test gives

Xˉ±tn1,α/2Sn,\bar X \pm t_{n-1,\,\alpha/2}\,\frac{S}{\sqrt n},

using the (1α/2)(1 - \alpha/2)-quantile of the tn1t_{n-1} distribution. For n=30n = 30 and α=0.05\alpha = 0.05, t29,0.0252.045t_{29, 0.025} \approx 2.045 — slightly larger than the z0.025=1.96z_{0.025} = 1.96 of the z-interval, reflecting the extra variability introduced by estimating σ\sigma.

The inversion also works for the Wald, score, and LRT tests in §17.9, giving three different (1α)(1 - \alpha) intervals for a general parameter. Topic 19 develops the full set; for now, the preview is that the same machinery we just built handles both “is θ=θ0\theta = \theta_0?” and “what values of θ\theta are plausible?”.

Remark 13 Pivotal quantities and Wald / score / LRT intervals

The t-interval Xˉ±tn1,α/2S/n\bar X \pm t_{n-1,\alpha/2}\,S/\sqrt n works because the quantity T=n(Xˉμ)/ST = \sqrt n(\bar X - \mu)/S is a pivot — its distribution (tn1t_{n-1}) does not depend on any unknown parameter. The confidence interval is formed by inverting a distributional statement about the pivot. Pivots are available for the location-scale families (Normal, exponential with known rate, etc.) and give exact small-sample confidence intervals.

When pivots are not available (most general parametric models), we invert asymptotic tests instead: the Wald interval uses θ^±zα/2/nI(θ^)\hat\theta \pm z_{\alpha/2} / \sqrt{n\,I(\hat\theta)}; the score interval uses the set {θ:Sn(θ)zα/2}\{\theta : |S_n(\theta)| \le z_{\alpha/2}\}; the LRT interval uses the set {θ:2logΛn(θ)χ1,α2}\{\theta : -2\log\Lambda_n(\theta) \le \chi^2_{1,\alpha}\}. All three have coverage 1α1 - \alpha asymptotically; their finite-sample coverage can differ substantially — the Wald interval is notoriously under-covering in small samples near the boundary (e.g., for binomial with pp near 0 or 1; the Wilson interval, based on the score test, is preferred). Full theory, with coverage diagnostics and comparisons, is Confidence Intervals & Duality (Topic 19).

Two panels. Left: test inversion diagram — horizontal axis μ₀; for each μ₀, run the level-α test; shade the non-rejection set of μ₀ where |√n(x̄ − μ₀)/S| ≤ t at level α/2 — this shaded interval IS x̄ ± t at level α/2 times S/√n. Right: banner previewing Topic 19 — "Every hypothesis test gives a CI; every CI gives a family of hypothesis tests. The duality is exact."

17.11 Where the Framework Falls Short

Classical hypothesis testing is a tightly specified decision procedure, and it fails to answer several questions that practitioners often want it to answer. An honest account lists the failure modes.

Remark 14 Hypothesis testing is not model selection

A p-value tests whether the data are consistent with a specific null model. It does not rank competing models, and it does not quantify which model predicts better out of sample. An analyst who runs two separate hypothesis tests (one on each model) and picks the one with smaller p-value has not performed model comparison — they have performed two inconsistent Type I error controls and then post-selected. For genuine model comparison, use cross-validation, AIC, BIC, or held-out likelihood. Cross-validation is the ML-native standard; AIC and BIC are the frequentist and Bayesian large-sample approximations to out-of-sample predictive performance. The contrast is developed in Topic 24’s CV/IC framework and at formalml’s formalML: Model comparison .

Remark 15 Bayes factors as the Bayesian alternative

The Bayesian analog of a hypothesis test is the Bayes factor

BF10=P(dataH1)P(dataH0),\mathrm{BF}_{10} = \frac{P(\text{data} \mid H_1)}{P(\text{data} \mid H_0)},

the ratio of marginal likelihoods under the two hypotheses. A Bayes factor of 10 is commonly interpreted as “strong evidence for H1H_1”; combined with a prior odds P(H1)/P(H0)P(H_1)/P(H_0), it gives the posterior odds. Bayes factors are more consistent with Fisher’s “evidence” framing and less vulnerable to the Ioannidis critique (Remark 6), but they require specifying priors for the competing hypotheses, which introduces its own modelling choices. Bayesian Foundations (Topic 25) introduces the marginal likelihood m(y)m(\mathbf{y}) and names Bayes factors; the full Bayes-factor framework and BMA are Topic 27’s territory.

Remark 16 Multiple testing inflates Type I error

Running mm independent tests, each at level α\alpha, under independence produces an overall false-positive rate 1(1α)m1 - (1 - \alpha)^m — which exceeds α\alpha rapidly: at m=20m = 20 and α=0.05\alpha = 0.05, the overall rate is 0.64\approx 0.64. Modern A/B testing platforms run thousands of simultaneous tests; naive per-test level control would give mostly false positives. Bonferroni correction (test each at level α/m\alpha/m to control family-wise error at α\alpha) is the classical fix; Benjamini-Hochberg’s false discovery rate (FDR) controls the proportion of rejections that are false positives, with less power loss than Bonferroni at large mm. The full treatment is Multiple Testing & False Discovery (Topic 20), which addresses the “garden of forking paths” concern flagged in Remark 6.

Remark 17 Pre-registration and adversarial collaboration

Some replication-crisis concerns are not technical at all. A researcher who tests multiple hypotheses but reports only the significant one has inflated the effective Type I rate, and no multiple-testing correction can recover the right level if the correction isn’t applied. The remedy is pre-registration: announcing the hypotheses, the analysis plan, and the stopping rule before seeing the data. Adversarial collaboration extends this by recruiting a skeptic as a co-author, tasked with anticipating and pre-committing to the interpretation of each possible outcome. Both are non-technical but important complements to the multiple-testing machinery of Topic 20.

17.12 Summary & Forward Look

Remark 18 Topic 17 cheat sheet

A compact map of the test families we have met.

TestNull H0H_0AssumesStatisticNull distributionExact or asymptotic?
z-test (one-sample)μ=μ0\mu = \mu_0Normal, σ\sigma knownn(Xˉμ0)/σ\sqrt n(\bar X - \mu_0)/\sigmaN(0,1)\mathcal{N}(0, 1)Exact (Normal); asymptotic (general) via CLT
z-test (two-sample)μ1=μ2\mu_1 = \mu_2Both Normal, σ1,σ2\sigma_1, \sigma_2 known(XˉYˉ)/σ12/n1+σ22/n2(\bar X - \bar Y)/\sqrt{\sigma_1^2/n_1 + \sigma_2^2/n_2}N(0,1)\mathcal{N}(0, 1)Exact under Normal; asymptotic via CLT
Two-proportion zpA=pBp_A = p_BLarge nAn_A, nBn_B(p^Ap^B)/SEpool(\hat p_A - \hat p_B)/\mathrm{SE}_{\text{pool}}N(0,1)\mathcal{N}(0, 1)Asymptotic
Binomial exactp=p0p = p_0BernoulliX=XiX = \sum X_iBinomial(n,p0)\mathrm{Binomial}(n, p_0)Exact (discrete)
t-test (one-sample)μ=μ0\mu = \mu_0Normal, σ\sigma unknownn(Xˉμ0)/S\sqrt n(\bar X - \mu_0)/Stn1t_{n-1}Exact via Basu
t-test (two-sample pooled)μ1=μ2\mu_1 = \mu_2Both Normal, σ1=σ2\sigma_1 = \sigma_2 unknownPooled ttn1+n22t_{n_1 + n_2 - 2}Exact under equal variance
Welch tμ1=μ2\mu_1 = \mu_2Both Normal, unequal variancesWelch ttν^t_{\hat\nu} (Satterthwaite)Asymptotic
χ2\chi^2 varianceσ2=σ02\sigma^2 = \sigma_0^2Normal(n1)S2/σ02(n-1)S^2/\sigma_0^2χn12\chi^2_{n-1}Exact
Pearson χ2\chi^2 (GoF)Fixed cell probsMultinomial, Ek5E_k \ge 5(OkEk)2/Ek\sum (O_k - E_k)^2 / E_kχk1r2\chi^2_{k-1-r}Asymptotic
Waldθ=θ0\theta = \theta_0Regular MLEn(θ^θ0)2I(θ^)n(\hat\theta - \theta_0)^2 I(\hat\theta)χ12\chi^2_1Asymptotic
Score (Rao)θ=θ0\theta = \theta_0Regular MLEU(θ0)2/[nI(θ0)]U(\theta_0)^2/[n I(\theta_0)]χ12\chi^2_1Asymptotic
LRTθ=θ0\theta = \theta_0Wilks’ regularity2[(θ0)(θ^)]-2[\ell(\theta_0) - \ell(\hat\theta)]χk2\chi^2_kAsymptotic

The two “exact” entries (z-test with σ\sigma known; t-test and χ2\chi^2 variance under Normality) owe their exactness to the special structure of the Normal family — in particular, to Basu’s theorem (Topic 16 §16.9) giving Xˉ ⁣ ⁣ ⁣S2\bar X \perp\!\!\!\perp S^2.

Testing cheat sheet — tidy one-page table summarizing each test family: null, assumptions, statistic, null distribution, and whether exact or asymptotic. Matches the Remark 18 table above in rendered form.

Remark 19 Where Track 5 goes next

Topic 17 set up the framework. The next three topics complete Track 5:

  • Likelihood-Ratio Tests & Neyman-Pearson (Topic 18) — the optimality theory. Neyman-Pearson’s lemma: the LRT is UMP for simple-vs-simple. Uniformly most powerful tests, monotone likelihood ratio families, Wilks’ theorem proved in full, the three asymptotic tests’ finite-sample power compared.
  • Confidence Intervals & Duality (Topic 19) — the duality previewed in §17.10 becomes the construction. Pivotal quantities, Wald / score / LRT intervals, coverage diagnostics, the Wilson interval for binomial proportions (which fixes the Wald boundary problem from Remark 13).
  • Multiple Testing & False Discovery (Topic 20) — family-wise error (Bonferroni, Holm, Šidák, Hochberg), false discovery rate (Benjamini-Hochberg with full proof, Benjamini-Yekutieli for arbitrary dependence, Storey adaptive q-values), simultaneous CIs dualizing the FWER procedures, and the replication-crisis framing in quantitative terms. Track 5 closes here.

Beyond Track 5, the framework reappears in Linear Regression (F-tests §21.8 Thm 9, partial F-tests Example 8, ANOVA Example 9), Generalized Linear Models (Wald / score / LRT tests for GLM coefficients; the score test is the standard specification test), Bayesian Foundations (Topic 25) (posterior over θ\theta, conjugate priors, credible intervals, Bernstein–von Mises; Bayes factors named in §25.10 with full development deferred to Topic 27), and Nonparametric Inference (permutation tests, rank tests, Kolmogorov-Smirnov). Finally, forward to formalML: A/B testing platforms , where every deployed experimentation system is running variants of Topic 17’s machinery at scale — and increasingly augmenting them with sequential-testing, variance-reduction (CUPED), and always-valid-inference extensions.

The shift from estimation to testing — from “best guess” to “decide” — is the conceptual move. Every topic in Track 5 and beyond builds on the scaffolding we put up here.


References

  1. Erich L. Lehmann & Joseph P. Romano. (2005). Testing Statistical Hypotheses (3rd ed.). Springer.
  2. George Casella & Roger L. Berger. (2002). Statistical Inference (2nd ed.). Duxbury.
  3. Ronald A. Fisher. (1922). On the Mathematical Foundations of Theoretical Statistics. Philosophical Transactions of the Royal Society A, 222, 309–368.
  4. Jerzy Neyman & Egon S. Pearson. (1933). On the Problem of the Most Efficient Tests of Statistical Hypotheses. Philosophical Transactions of the Royal Society A, 231, 289–337.
  5. ‘Student’ (William Sealy Gosset). (1908). The Probable Error of a Mean. Biometrika, 6(1), 1–25.
  6. Karl Pearson. (1900). On the Criterion that a Given System of Deviations from the Probable is Such that It Can Be Reasonably Supposed to Have Arisen from Random Sampling. Philosophical Magazine (5th series), 50, 157–175.
  7. Abraham Wald. (1943). Tests of Statistical Hypotheses Concerning Several Parameters When the Number of Observations is Large. Transactions of the American Mathematical Society, 54(3), 426–482.
  8. C. Radhakrishna Rao. (1948). Large Sample Tests of Statistical Hypotheses Concerning Several Parameters with Applications to Problems of Estimation. Mathematical Proceedings of the Cambridge Philosophical Society, 44(1), 50–57.
  9. Samuel S. Wilks. (1938). The Large-Sample Distribution of the Likelihood Ratio for Testing Composite Hypotheses. Annals of Mathematical Statistics, 9(1), 60–62.
  10. Ronald L. Wasserstein & Nicole A. Lazar. (2016). The ASA’s Statement on p-Values: Context, Process, and Purpose. The American Statistician, 70(2), 129–133.
  11. John P. A. Ioannidis. (2005). Why Most Published Research Findings Are False. PLoS Medicine, 2(8), e124.
  12. Andrew Gelman & Eric Loken. (2013). The Garden of Forking Paths: Why Multiple Comparisons Can Be a Problem, Even When There is No ‘Fishing Expedition’ or ‘p-Hacking’. Department of Statistics, Columbia University.