intermediate 60 min read · April 17, 2026

Likelihood-Ratio Tests & Neyman-Pearson

Optimality theory — Neyman-Pearson's lemma, Karlin-Rubin UMP construction, Wilks' χ² theorem, and the first-order equivalence of Wald, Score, and LRT.

18.1 From Framework to Optimality

Two tests have the same size. Both reject the null 5% of the time when the null is true. One rejects a true effect of size δ=0.3\delta = 0.3 with probability 0.40.4; the other rejects with probability 0.60.6. Which should we use? Topic 17 built a framework for valid testing but left this question open. This topic answers it: we characterize when a uniformly most powerful test exists (Neyman-Pearson, Karlin-Rubin), construct the likelihood-ratio test as a general-purpose procedure when it doesn’t, and prove that the three asymptotic tests — Wald, Score, LRT — agree to first order under the null.

Two level-α tests, same size but different power curves; the dominant test's power is strictly higher in the right tail. Inset: at δ = 0.3, one test has power 0.4, the other 0.6 — a concrete instance of the optimality question.

Remark 1 Validity versus optimality — two different questions

A level-α\alpha test is valid if its Type I error is at most α\alpha. That is not a hard standard to meet. The trivial “reject with probability α\alpha regardless of data” test is valid — and has power exactly α\alpha at every alternative. Every test we wrote down in Topic 17 (z-test, t-test, χ2\chi^2 variance test, binomial exact) is valid for its parametric family. But validity does not single out which level-α\alpha test to use.

Optimality asks the harder question: among all valid level-α\alpha tests, which maximizes power? The answer depends on whether the hypotheses are simple or composite and on the structural geometry of the likelihood family. Where the answer exists — NP lemma, Karlin-Rubin — it is a uniqueness result, not merely an existence one. Where it does not — two-sided composite alternatives in most families — the LRT is the best general-purpose substitute, and we pay a price for that generality in the form of asymptotic rather than finite-sample optimality.

Remark 2 Four pillars, organized

Topic 18 is built around four results. The first two hold at every sample size; the last two are asymptotic.

  1. Neyman-Pearson lemma (§18.2). Simple-vs-simple optimality via a three-step indicator-function argument. Provable from integration alone — no convergence theory required.

  2. Karlin-Rubin theorem (§18.3). One-sided composite optimality via monotone likelihood ratio (MLR), using the NP lemma as an internal engine. The MLR families are exactly where “one test is most powerful at every alternative simultaneously” — i.e., UMP — becomes possible.

  3. Wilks’ theorem (§18.6). The composite LRT’s null distribution converges: 2logΛndχk2-2\log\Lambda_n \xrightarrow{d} \chi^2_k where kk is the number of parameter restrictions. Proved in 8 steps via Taylor expansion around θ^n\hat\theta_n, remainder control, observed-information-to-Fisher-information, and MLE asymptotic normality (Topic 14 Thm 14.3).

  4. Three-tests equivalence (§18.7). Wald, Score, and LRT differ by OP(n1/2)O_P(n^{-1/2}) under H0H_0. All three collapse to the same quadratic form nI(θ0)(θ^nθ0)2n I(\theta_0)(\hat\theta_n - \theta_0)^2 at leading order. §18.8 quantifies the finite-sample divergence the oP(1)o_P(1) hides.

The residual sections then extend the theory in specific directions. §18.4 catalogues UMP tests in action; §18.5 formalizes the composite LRT; §18.8 proves reparameterization invariance; §18.9 characterizes local power via non-central χ2\chi^2; §18.10 points to the scope boundaries.

Remark 3 Scope boundary — scalar θ throughout, with pointers for the rest

Every proof in Topic 18 is written for scalar θR\theta \in \mathbb{R}. Vector-θ\theta extensions — kk-df Wilks, UMP unbiased, Hunt-Stein invariance, Chernoff 1954 boundary cases — are stated with pointers to Lehmann-Romano (LEH2005) and nothing more. This is not pedagogical convenience; it is a deliberate scope decision. The scalar case captures every main idea (the pointwise LR comparison of Proof 1, the MLR-yields-UMP argument of Proof 2, the quadratic-approximation of Proof 3). The vector generalizations are mostly bookkeeping — matrix inverses in place of scalar divisions, quadratic forms qI(θ0)q\mathbf{q}^\top \mathbf{I}(\theta_0) \mathbf{q} in place of q2I(θ0)q^2 I(\theta_0), and a χk2\chi^2_k limit in place of χ12\chi^2_1. Readers who need the vector results in production should consult LEH2005 Ch. 3 (vector UMP), Ch. 6 (invariance), and Ch. 12 (large-sample theory). Topic 19 and Topic 20 extend Track 5 with CI duality and multiple testing; vector-θ\theta extensions to Wilks reappear organically in GLM inference on formalml.com/generalized-linear-models.

18.2 The Neyman-Pearson Lemma

Start with the cleanest case: both H0H_0 and H1H_1 are single points in parameter space. Is there a level-α\alpha test whose power at θ1\theta_1 exceeds the power of every other level-α\alpha test? Yes, and the answer is the likelihood-ratio test.

Definition 1 Most-powerful test

A test φ:X[0,1]\varphi: \mathcal{X} \to [0, 1] (where φ(x)\varphi(x) is the probability of rejecting H0H_0 at observation xx) is most powerful (MP) at level α\alpha against H1:θ=θ1H_1: \theta = \theta_1 if

φargmaxφ:Eθ0[φ]αEθ1[φ(X)].\varphi^* \in \arg\max_{\varphi: \, E_{\theta_0}[\varphi] \le \alpha} E_{\theta_1}[\varphi(X)].

That is, φ\varphi^* maximizes the power at θ1\theta_1 among all tests of size at most α\alpha. The maximum need not be unique on the boundary of the constraint; uniqueness holds in the interior.

Definition 2 Simple-vs-simple likelihood ratio Λ(x)

For simple H0:θ=θ0H_0: \theta = \theta_0 vs simple H1:θ=θ1H_1: \theta = \theta_1, the likelihood ratio at observation xx is

Λ(x)=f(x;θ1)f(x;θ0).\Lambda(x) = \frac{f(x; \theta_1)}{f(x; \theta_0)}.

For iid data X=(X1,,Xn)X = (X_1, \ldots, X_n) with density if(xi;θ)\prod_i f(x_i; \theta), the LR factorizes: Λ(x)=if(xi;θ1)/f(xi;θ0)\Lambda(x) = \prod_i f(x_i; \theta_1)/f(x_i; \theta_0), and we typically work with logΛ(x)=i[logf(xi;θ1)logf(xi;θ0)]\log \Lambda(x) = \sum_i [\log f(x_i; \theta_1) - \log f(x_i; \theta_0)] for numerical stability.

Remark 4 Notation — Λ(x) here vs Λₙ in Topic 17

The Λ(x)\Lambda(x) above is distinct from the composite generalized likelihood ratio Λn\Lambda_n defined in Topic 17 §17.9 Def 15. The simple-vs-simple LR compares two specific parameter values; the composite Λn=supθΘ0L(θ)/supθΘL(θ)\Lambda_n = \sup_{\theta \in \Theta_0} L(\theta) / \sup_{\theta \in \Theta} L(\theta) compares the best null fit to the best full-model fit. They coincide when both hypotheses are singletons. §18.5 reintroduces the composite Λn\Lambda_n, and §18.6 proves its χ2\chi^2 asymptotic limit.

Theorem 1 Neyman-Pearson lemma

Let XX have density f(x;θ)f(x; \theta) with respect to a common σ\sigma-finite measure μ\mu. For testing simple H0:θ=θ0H_0: \theta = \theta_0 vs simple H1:θ=θ1H_1: \theta = \theta_1, fix α(0,1)\alpha \in (0, 1) and let k0k \ge 0 be a threshold. The Neyman-Pearson test

φ(x)={1f(x;θ1)>kf(x;θ0),0f(x;θ1)<kf(x;θ0),\varphi^*(x) = \begin{cases} 1 & f(x; \theta_1) > k\, f(x; \theta_0), \\ 0 & f(x; \theta_1) < k\, f(x; \theta_0), \end{cases}

with randomization on the boundary {f1=kf0}\{f_1 = k f_0\} chosen to make Eθ0[φ]=αE_{\theta_0}[\varphi^*] = \alpha, is most powerful level-α\alpha among all tests φ\varphi with Eθ0[φ]αE_{\theta_0}[\varphi] \le \alpha.

Proof [show]

Let φ\varphi be any test with Eθ0[φ(X)]αE_{\theta_0}[\varphi(X)] \le \alpha. We prove Eθ1[φ(X)]Eθ1[φ(X)]E_{\theta_1}[\varphi^*(X)] \ge E_{\theta_1}[\varphi(X)].

Step 1 — pointwise inequality. For every xx,

(φ(x)φ(x))(f(x;θ1)kf(x;θ0))0.\bigl(\varphi^*(x) - \varphi(x)\bigr)\bigl(f(x; \theta_1) - k\, f(x; \theta_0)\bigr) \ge 0.

At any xx with f(x;θ1)>kf(x;θ0)f(x; \theta_1) > k f(x; \theta_0), we have φ(x)=1φ(x)\varphi^*(x) = 1 \ge \varphi(x), so both factors are 0\ge 0. At any xx with f(x;θ1)<kf(x;θ0)f(x; \theta_1) < k f(x; \theta_0), we have φ(x)=0φ(x)\varphi^*(x) = 0 \le \varphi(x), so both factors are 0\le 0. On the boundary {f1=kf0}\{f_1 = k f_0\} the second factor is zero and the inequality holds trivially. Either way, the product is non-negative pointwise.

Step 2 — integrate. Integrating the pointwise inequality against μ\mu:

(φ(x)φ(x))(f(x;θ1)kf(x;θ0))dμ(x)0.\int \bigl(\varphi^*(x) - \varphi(x)\bigr)\bigl(f(x; \theta_1) - k\, f(x; \theta_0)\bigr)\, d\mu(x) \ge 0.

Expanding the product and converting integrals to expectations,

Eθ1[φ(X)]Eθ1[φ(X)]k(Eθ0[φ(X)]Eθ0[φ(X)]).E_{\theta_1}[\varphi^*(X)] - E_{\theta_1}[\varphi(X)] \ge k\bigl(E_{\theta_0}[\varphi^*(X)] - E_{\theta_0}[\varphi(X)]\bigr).

Step 3 — apply the size constraint. By construction, Eθ0[φ(X)]=αE_{\theta_0}[\varphi^*(X)] = \alpha. By assumption, Eθ0[φ(X)]αE_{\theta_0}[\varphi(X)] \le \alpha. So

Eθ0[φ(X)]Eθ0[φ(X)]0.E_{\theta_0}[\varphi^*(X)] - E_{\theta_0}[\varphi(X)] \ge 0.

Multiplying by k0k \ge 0 preserves the inequality:

Eθ1[φ(X)]Eθ1[φ(X)]0,E_{\theta_1}[\varphi^*(X)] - E_{\theta_1}[\varphi(X)] \ge 0,

i.e., φ\varphi^* has power at least as large as any other level-α\alpha test.

∎ — using NEY1933

Two-panel figure. Left: overlaid densities f(x; θ₀) and f(x; θ₁) with the NP rejection region — the set of x where f₁ exceeds k times f₀ — shaded. Right: the same region shown as size (Type I area under f₀) on one axis and power (area under f₁) on the other, with the NP power-envelope curve as k varies.

Neyman-Pearson Lemma · §18.2
NP Test — Level α
threshold c = 0.329
T = xbar
exact size = 0.0500
size α:0.0500
power β(θ₁):0.9996
By the NP lemma (Theorem 1), no other level-α test at θ₁ has higher power than the shown region.
Example 1 NP for Bernoulli

Let X1,,XnX_1, \ldots, X_n be iid Bernoulli(p)(p). Test H0:p=p0H_0: p = p_0 vs H1:p=p1H_1: p = p_1 with p1>p0p_1 > p_0. The likelihood ratio at xx is

Λ(x)=(p1p0)xi(1p11p0)nxi,\Lambda(x) = \left(\frac{p_1}{p_0}\right)^{\sum x_i} \left(\frac{1-p_1}{1-p_0}\right)^{n - \sum x_i},

which is monotone increasing in T(x)=xiT(x) = \sum x_i (since p1>p0p_1 > p_0 implies p1/p0>1p_1/p_0 > 1 and (1p1)/(1p0)<1(1-p_1)/(1-p_0) < 1). The NP test rejects iff T(X)>cT(X) > c for some threshold cc; randomization at T=cT = c achieves exact size α\alpha.

Concretely: for n=20n = 20, p0=0.5p_0 = 0.5, p1=0.7p_1 = 0.7, α=0.05\alpha = 0.05, the NP test rejects at T>15T > 15 with exact size 0.021\approx 0.021 (the conservative size of §18.4). Power at p1=0.7p_1 = 0.7: P0.7(T>15)0.416P_{0.7}(T > 15) \approx 0.416 (verified in binomialExactPower of Topic 17’s testing.ts).

Example 2 NP for Normal mean (σ known)

Let X1,,XnX_1, \ldots, X_n be iid N(μ,σ2)\mathcal{N}(\mu, \sigma^2) with σ\sigma known. Test H0:μ=μ0H_0: \mu = \mu_0 vs H1:μ=μ1H_1: \mu = \mu_1 with μ1>μ0\mu_1 > \mu_0. The log-likelihood ratio is

logΛ(x)=n(μ1μ0)(2xˉμ0μ1)2σ2,\log \Lambda(x) = \frac{n(\mu_1 - \mu_0)\bigl(2\bar x - \mu_0 - \mu_1\bigr)}{2\sigma^2},

monotone increasing in xˉ\bar x. The NP test rejects iff xˉ>c\bar x > c for c=μ0+z1ασ/nc = \mu_0 + z_{1-\alpha}\, \sigma / \sqrt n, where z1αz_{1-\alpha} is the standard-Normal (1α)(1-\alpha)-quantile. The threshold depends on α\alpha and nn and σ\sigma — but not on μ1\mu_1. The same test is NP against every μ1>μ0\mu_1 > \mu_0. That is the geometric seed of the Karlin-Rubin theorem in §18.3.

Numerical check: n=25n = 25, μ0=0\mu_0 = 0, μ1=1\mu_1 = 1, σ=1\sigma = 1, α=0.05\alpha = 0.05. Then c=0+1.6451/250.329c = 0 + 1.645 \cdot 1/\sqrt{25} \approx 0.329, matching npCriticalValue('normal-mean-known-sigma', 0, 1, 25, 0.05, 1) in testing.ts.

Example 3 NP for Exponential rate

Let X1,,XnX_1, \ldots, X_n be iid Exponential(λ)(\lambda) with density f(x;λ)=λeλxf(x; \lambda) = \lambda e^{-\lambda x} on x0x \ge 0. Test H0:λ=λ0H_0: \lambda = \lambda_0 vs H1:λ=λ1H_1: \lambda = \lambda_1. The log-likelihood ratio is

logΛ(x)=nlog(λ1/λ0)(λ1λ0)ixi.\log \Lambda(x) = n\log(\lambda_1/\lambda_0) - (\lambda_1 - \lambda_0) \sum_i x_i.

If λ1>λ0\lambda_1 > \lambda_0 (faster rate, shorter expected waits), the LR is decreasing in T(x)=xiT(x) = \sum x_i: short totals favor H1H_1. The NP test rejects iff T<cloT < c_{\text{lo}}. Under H0H_0, TΓ(n,λ0)T \sim \Gamma(n, \lambda_0); using (2λ0T)χ2n2(2\lambda_0 T) \sim \chi^2_{2n} we get the exact size at clo=χ2n,α2/(2λ0)c_{\text{lo}} = \chi^2_{2n, \alpha} / (2\lambda_0).

The direction reversal — shorter totals trigger rejection of “the slower rate is true” — is a common pitfall in exponential-reliability testing. npCriticalValue('exponential', ...) in Topic 17’s testing.ts handles the direction via an explicit Tform label noting the side.

Remark 5 Randomization is only needed for discrete test statistics

The NP test’s “randomize on the boundary” clause looks strange but is needed only when Pθ0(Λ=k)>0P_{\theta_0}(\Lambda = k) > 0 — i.e., when the test statistic is discrete and kk falls on an atom. For continuous tests (Normal, Exponential), the boundary has measure zero and no randomization is needed; the size is exactly α\alpha by choice of kk. For discrete tests (Bernoulli, Poisson), the exact size without randomization is usually strictly less than α\alpha — the conservative size of §17.3 Thm 1 and §17.6 Ex 11. We treat randomization as a theoretical device: in practice we rarely implement it, accepting the conservative size in exchange for interpretability.

Remark 6 Template versus usable test

Theorem 1 is a template — given the two parameter values and a chosen α\alpha, it tells you what the MP test looks like. It does not by itself produce a usable test of a practical hypothesis like “is this drug better than placebo?” because practical hypotheses are rarely simple-vs-simple. The real H1H_1 is usually composite (“some positive effect of unspecified size”), and the MP test depends on which specific alternative you face. §18.3 asks: when is the NP rejection region the same for every alternative in a composite H1H_1? That is the MLR / Karlin-Rubin story.

18.3 Monotone Likelihood Ratio & Karlin-Rubin

The NP test at Example 2 had a remarkable property: the rejection region {xˉ>c}\{\bar x > c\} did not depend on μ1\mu_1 — only on μ0\mu_0, α\alpha, nn, σ\sigma. The same test was NP against every μ1>μ0\mu_1 > \mu_0. That property has a name: uniformly most powerful (UMP). Which families give it?

Definition 3 Monotone likelihood ratio (MLR)

A family {f(;θ):θΘR}\{f(\cdot; \theta) : \theta \in \Theta \subseteq \mathbb{R}\} has monotone likelihood ratio in T(x)T(x) if there exists a statistic T(x)T(x) such that for every pair θ1<θ2\theta_1 < \theta_2 in Θ\Theta, the ratio

f(x;θ2)f(x;θ1) is a non-decreasing function of T(x).\frac{f(x; \theta_2)}{f(x; \theta_1)} \text{ is a non-decreasing function of } T(x).

Equivalently, log[f(x;θ2)/f(x;θ1)]\log[f(x; \theta_2)/f(x; \theta_1)] is a non-decreasing function of T(x)T(x).

Definition 4 Uniformly most powerful (UMP) test

A level-α\alpha test φ\varphi^* is uniformly most powerful (UMP) against a composite alternative H1:θΘ1H_1: \theta \in \Theta_1 if it is MP level-α\alpha at every θ1Θ1\theta_1 \in \Theta_1 simultaneously. In symbols: for every θ1Θ1\theta_1 \in \Theta_1 and every level-α\alpha test φ\varphi,

Eθ1[φ(X)]Eθ1[φ(X)].E_{\theta_1}[\varphi^*(X)] \ge E_{\theta_1}[\varphi(X)].

UMP existence is a strong statement — the same rejection region beats every competitor at every alternative.

Theorem 2 Karlin-Rubin: MLR gives one-sided UMP

Let {f(;θ):θΘR}\{f(\cdot; \theta) : \theta \in \Theta \subseteq \mathbb{R}\} have MLR in T(x)T(x). For testing

H0:θθ0versusH1:θ>θ0,H_0: \theta \le \theta_0 \quad \text{versus} \quad H_1: \theta > \theta_0,

there exists a threshold cc (possibly with randomization at T=cT = c for discrete TT) such that the test

φ(x)=1{T(x)>c}\varphi^*(x) = \mathbf{1}\{T(x) > c\}

has size exactly α\alpha and is UMP level-α\alpha.

Proof [show]

We prove two things: (i) φ\varphi^* has size α\le \alpha over the composite null {θθ0}\{\theta \le \theta_0\}, achieving α\alpha at θ=θ0\theta = \theta_0; (ii) φ\varphi^* is MP at every θ1>θ0\theta_1 > \theta_0.

Step 1 — size over the composite null. By MLR, for any θ<θ0\theta < \theta_0,

f(x;θ0)f(x;θ) is non-decreasing in T(x).\frac{f(x; \theta_0)}{f(x; \theta)} \text{ is non-decreasing in } T(x).

Consider the NP test (Theorem 1) of the simple ”θ\theta vs θ0\theta_0” with rejection region {f(x;θ0)>κf(x;θ)}\{f(x; \theta_0) > \kappa f(x; \theta)\}, which by MLR coincides with {T(x)>cκ}\{T(x) > c_\kappa\} for an appropriate cκc_\kappa depending on κ\kappa. The NP lemma says this test maximizes Pθ0(reject)P_{\theta_0}(\text{reject}) subject to Pθ(reject)αP_{\theta}(\text{reject}) \le \alpha'. Inverting: the power at θ0\theta_0 of any level-α\alpha test at θ\theta is bounded by the power at θ0\theta_0 of the NP test at θ\theta, which is exactly the size at θ\theta required to achieve size α\alpha at θ0\theta_0. Because the rejection region {T>c}\{T > c\} is of NP form with size α\alpha at θ0\theta_0, its size at θ<θ0\theta < \theta_0 cannot exceed α\alpha:

Pθ(T>c)Pθ0(T>c)=αfor every θ<θ0.P_\theta(T > c) \le P_{\theta_0}(T > c) = \alpha \quad \text{for every } \theta < \theta_0.

Size is maximized at θ=θ0\theta = \theta_0 and equals α\alpha on the composite null boundary.

Step 2 — MP at each θ1>θ0\theta_1 > \theta_0. Fix θ1>θ0\theta_1 > \theta_0. By MLR,

f(x;θ1)f(x;θ0) is non-decreasing in T(x).\frac{f(x; \theta_1)}{f(x; \theta_0)} \text{ is non-decreasing in } T(x).

There exists a threshold κ\kappa such that {f(x;θ1)>κf(x;θ0)}={T(x)>c}\{f(x; \theta_1) > \kappa f(x; \theta_0)\} = \{T(x) > c\} (same cc as Step 1, since both are calibrated to size α\alpha at θ0\theta_0). By Theorem 1, the test 1{T>c}\mathbf{1}\{T > c\} is MP level-α\alpha for ”θ0\theta_0 vs θ1\theta_1” — and the rejection region does not depend on θ1\theta_1. The same test is MP at every θ1>θ0\theta_1 > \theta_0 simultaneously.

Combining Steps 1 and 2: φ=1{T>c}\varphi^* = \mathbf{1}\{T > c\} has size α\alpha on the composite null (achieved at θ=θ0\theta = \theta_0) and is MP at every alternative. This is the definition of UMP.

∎ — using KAR1956 and Theorem 1

Four-panel grid showing log-likelihood ratios as monotone-increasing functions of the sufficient statistic T: top-left Bernoulli (T = ΣXᵢ), top-right Normal mean (T = x̄), bottom-left Poisson rate (T = ΣXᵢ), bottom-right Exponential rate (T = ΣXᵢ). Each line is visibly non-decreasing, confirming MLR for four exponential-family examples.

Karlin-Rubin UMP · §18.3
UMP rejection region: T > c. Current boundary c = 32, T form = ΣXᵢ, exact size = 0.0325. The boundary does not depend on θ₁ — this is what makes the test UMP.
Example 4 Every exponential family has MLR in its natural sufficient statistic

Let f(x;η)=h(x)exp(ηT(x)A(η))f(x; \eta) = h(x) \exp\bigl(\eta T(x) - A(\eta)\bigr) be a one-parameter exponential family with natural parameter η\eta and sufficient statistic T(x)T(x). For any η1<η2\eta_1 < \eta_2,

f(x;η2)f(x;η1)=exp((η2η1)T(x)[A(η2)A(η1)]).\frac{f(x; \eta_2)}{f(x; \eta_1)} = \exp\bigl((\eta_2 - \eta_1) T(x) - [A(\eta_2) - A(\eta_1)]\bigr).

This is a monotone increasing function of T(x)T(x) because η2η1>0\eta_2 - \eta_1 > 0. So every one-parameter exponential family has MLR in its natural sufficient statistic. Karlin-Rubin then hands us a UMP test for every one-sided composite hypothesis on η\eta — Bernoulli, Normal mean (with σ\sigma known), Poisson, Exponential, Gamma (with shape known), Geometric, and so on.

This is the pedagogical payoff of Topic 16’s factorization + completeness machinery: once you know the sufficient statistic, the UMP test writes itself.

Example 5 Uniform(0, θ) — MLR without exponential-family structure

Let X1,,XnX_1, \ldots, X_n be iid Uniform(0,θ)(0, \theta), θ>0\theta > 0. The density is f(x;θ)=θ11{0xθ}f(x; \theta) = \theta^{-1} \mathbf{1}\{0 \le x \le \theta\}, and the joint is θn1{maxixiθ}\theta^{-n} \mathbf{1}\{\max_i x_i \le \theta\}. For θ1<θ2\theta_1 < \theta_2,

f(x;θ2)f(x;θ1)={(θ1/θ2)nmaxxiθ1,(θ1/θ2)nθ2n01=+θ1<maxxiθ2,undefinedmaxxi>θ2.\frac{f(x; \theta_2)}{f(x; \theta_1)} = \begin{cases} (\theta_1/\theta_2)^n & \max x_i \le \theta_1, \\ (\theta_1/\theta_2)^n \cdot \theta_2^n \cdot 0^{-1} = +\infty & \theta_1 < \max x_i \le \theta_2, \\ \text{undefined} & \max x_i > \theta_2. \end{cases}

The ratio is non-decreasing in maxixi\max_i x_i (jumps from a constant to ++\infty at max=θ1\max = \theta_1). Uniform(0,θ)(0, \theta) has MLR in T=maxXiT = \max X_i, even though it is not an exponential family. By Karlin-Rubin, the one-sided UMP test for H0:θθ0H_0: \theta \le \theta_0 vs H1:θ>θ0H_1: \theta > \theta_0 rejects iff maxXi>c\max X_i > c where c=θ0α1/nc = \theta_0 \cdot \alpha^{1/n} gives size exactly α\alpha (the left-tail quantile of max/θ0\max/\theta_0, which is distributed as Umax1/nU^{1/n}_{\max}).

Remark 7 Two-sided UMP usually doesn't exist

Karlin-Rubin gives UMP for one-sided composite alternatives. For two-sided alternatives H1:θθ0H_1: \theta \ne \theta_0, UMP tests typically do not exist, because the best rejection region for θ1>θ0\theta_1 > \theta_0 points right while the best for θ1<θ0\theta_1 < \theta_0 points left, and no single region handles both. The standard workaround is to restrict to unbiased tests (power α\ge \alpha at every alternative) and prove UMP within that class — the UMP unbiased (UMPU) construction of Lehmann-Romano (LEH2005 Ch. 4). That theory is beyond Topic 18’s scope; we note only that for the Normal variance test (one-sided in σ2\sigma^2) the UMPU and the equal-tail χ2\chi^2 test coincide, while for the two-sided Normal mean the z-test is UMPU but not UMP in the full class of level-α\alpha tests. §18.4 Remark 8 returns to this.

Remark 8 Historical note — Karlin 1957 coined 'monotone likelihood ratio'

The structural property was isolated by Karlin and Rubin (KAR1956) in their 1956 paper on MLR decision procedures, but the term “monotone likelihood ratio” itself was coined the next year in Karlin’s solo paper on Pólya-type distributions (KAR1957). The MLR families are exactly the totally positive of order 2 (TP2\mathrm{TP}_2) families in Karlin’s terminology — a classification that generalizes to kernels and sequences in a way that powers modern results in statistical order (Shaked & Shanthikumar 2007) and in asymptotic efficiency bounds for monotone regression (Groeneboom & Jongbloed 2014).

18.4 UMP in Action — Three Worked Families

Example 6 Binomial exact test is UMP one-sided

Let X1,,XnX_1, \ldots, X_n be iid Bernoulli(p)(p). Test H0:pp0H_0: p \le p_0 vs H1:p>p0H_1: p > p_0. The Bernoulli family is exponential with natural parameter η=log[p/(1p)]\eta = \log[p/(1-p)] (log-odds) and sufficient statistic T=XiBinomial(n,p)T = \sum X_i \sim \text{Binomial}(n, p). By Example 4, MLR holds in TT. By Karlin-Rubin, the test

φ(x)=1{xi>c},c=smallest integer with Pp0(Xi>c)α\varphi^*(x) = \mathbf{1}\{\textstyle\sum x_i > c\}, \qquad c = \text{smallest integer with } P_{p_0}(\textstyle\sum X_i > c) \le \alpha

is UMP level-α\alpha. This is exactly the binomial exact test of Topic 17 §17.6 Example 11 — now equipped with an optimality certificate. At n=20n = 20, p0=0.5p_0 = 0.5, α=0.05\alpha = 0.05, the boundary is c=15c = 15 with exact size 0.02070.0207 (binomialExactRejectionBoundary in testing.ts).

The payoff: the binomial exact test is not just valid — it is the best level-α\alpha test for every alternative p1>0.5p_1 > 0.5 simultaneously, with no competitor having higher power at any alternative.

Two-panel figure. Left: UMP power curve β(μ) for the Normal one-sided z-test at n = 25, σ = 1, α = 0.05 — β(0) = 0.05 at the boundary, monotone increasing in μ. Right: Normal sampling density of x̄ at μ = 0 with the UMP rejection region x̄ > 0.329 shaded.

Example 7 Normal z-test (σ known) is UMP one-sided

Let X1,,XnX_1, \ldots, X_n be iid N(μ,σ2)\mathcal{N}(\mu, \sigma^2) with σ\sigma known. For H0:μμ0H_0: \mu \le \mu_0 vs H1:μ>μ0H_1: \mu > \mu_0, MLR in T=XˉT = \bar X follows from Example 4 (Normal with σ\sigma known is exp-family with η=μ/σ2\eta = \mu/\sigma^2, T=xiT = \sum x_i). Karlin-Rubin gives UMP

φ(x)=1{xˉ>μ0+z1ασ/n}\varphi^*(x) = \mathbf{1}\{\bar x > \mu_0 + z_{1-\alpha}\sigma/\sqrt n\}

with exact size α\alpha. No level-α\alpha test of this one-sided hypothesis has higher power at any μ>μ0\mu > \mu_0 than the one-sided z-test. This result justifies the A/B-testing convention of reporting one-sided p-values when the practical question is directional.

Two-panel figure. Left: exact UMP rejection boundary (integer threshold on ΣXᵢ, step function) vs Normal-approximation boundary (smooth) as n grows from 20 to 100. Right: exact size vs approximate size — exact is conservative below 0.05, Normal approximation oscillates.

Example 8 Normal t-test is UMP-invariant (not UMP)

Let X1,,XnX_1, \ldots, X_n be iid N(μ,σ2)\mathcal{N}(\mu, \sigma^2) with σ2\sigma^2 unknown. For H0:μμ0H_0: \mu \le \mu_0 vs H1:μ>μ0H_1: \mu > \mu_0, MLR fails: the likelihood depends on both (μ,σ2)(\mu, \sigma^2) and no scalar TT captures the likelihood-ratio monotonicity uniformly. UMP does not exist in the full class of level-α\alpha tests. The one-sample t-test is UMP within the restricted class of tests that are invariant under scale transformations XicXiX_i \mapsto c X_i — a Hunt-Stein invariance argument (LEH2005 Ch. 6). The t-test retains an optimality certificate, but only conditionally.

The σ\sigma-unknown case is the historical reason Wilks pursued a different asymptotic framework (§18.6): when Karlin-Rubin’s MLR argument fails, the composite LRT provides a general-purpose procedure that does not require MLR but gives up finite-sample optimality in exchange.

Remark 9 Why two-sided z-tests aren't UMP — and why it barely matters in practice

The two-sided Normal mean test H0:μ=μ0H_0: \mu = \mu_0 vs H1:μμ0H_1: \mu \ne \mu_0 has no UMP test (Remark 7). The equal-tail z-test is UMPU (LEH2005 Thm 4.4.1): MP within the class of unbiased tests, i.e., those with power α\ge \alpha at every alternative. The practical consequence is minor — in concrete applications the UMPU test and the only serious competitors differ by at most a few percent in power. But conceptually it is important: UMP is the exception, UMPU / invariance / LRT are the rule. The composite LRT of §18.5 is the price we pay for general-purpose applicability, and Wilks’ theorem is the justification for believing that price is low.

18.5 The Likelihood-Ratio Principle for Composite H₀

When Karlin-Rubin fails — two-sided alternatives, nuisance parameters, irregular families — we need a general-purpose construction. The classical answer is Wilks’ generalized likelihood-ratio test.

Definition 5 Generalized likelihood ratio Λₙ

Let Θ\Theta be the full parameter space and Θ0Θ\Theta_0 \subset \Theta the null subspace. The generalized likelihood ratio for a sample X1,,XnX_1, \ldots, X_n is

Λn=supθΘ0L(θX)supθΘL(θX)=L(θ~n)L(θ^n),\Lambda_n = \frac{\sup_{\theta \in \Theta_0} L(\theta \mid X)}{\sup_{\theta \in \Theta} L(\theta \mid X)} = \frac{L(\tilde\theta_n)}{L(\hat\theta_n)},

where θ~n\tilde\theta_n is the restricted MLE (maximizer under H0H_0) and θ^n\hat\theta_n is the unrestricted MLE. Both are functions of the data. Since Θ0Θ\Theta_0 \subseteq \Theta, always 0Λn10 \le \Lambda_n \le 1.

Definition 6 LRT rejection rule

The likelihood-ratio test (LRT) at level α\alpha rejects H0H_0 when

2logΛn>cα,-2 \log \Lambda_n > c_\alpha,

where cαc_\alpha is chosen to achieve size α\alpha under H0H_0. The conventional asymptotic choice cα=χk,1α2c_\alpha = \chi^2_{k, 1-\alpha}, where kk is the number of parameter restrictions, is justified by Wilks’ theorem (§18.6 Thm 4). At finite nn, cαc_\alpha may need to be calibrated by Monte Carlo for accurate size.

Theorem 3 Reparameterization invariance of the LRT (stated; proved in §18.8)

Let η=g(θ)\eta = g(\theta) be a smooth one-to-one reparameterization with H0η=g(H0θ)H_0^\eta = g(H_0^\theta). Then the composite LRT Λn\Lambda_n is invariant:

Λn(X;θ)=Λn(X;η),\Lambda_n(X; \theta) = \Lambda_n(X; \eta),

i.e., computing the LRT in either parameterization yields the same test statistic and the same rejection decision. The Wald and Score tests lack this invariance; see §18.8 Theorem 6 and the concrete logit-vs-raw Bernoulli example in Example 13.

Example 9 One-sample t-test is the LRT for Normal mean

Let X1,,XnX_1, \ldots, X_n be iid N(μ,σ2)\mathcal{N}(\mu, \sigma^2) with both unknown. Test H0:μ=μ0H_0: \mu = \mu_0 vs H1:μμ0H_1: \mu \ne \mu_0.

Under H0H_0: σ2\sigma^2 is the only free parameter; its MLE is σ~02=n1(Xiμ0)2\tilde\sigma^2_0 = n^{-1} \sum (X_i - \mu_0)^2. Under H1H0H_1 \cup H_0: the MLEs are μ^=Xˉ\hat\mu = \bar X, σ^2=n1(XiXˉ)2\hat\sigma^2 = n^{-1} \sum (X_i - \bar X)^2. A short calculation (expand the likelihoods) gives

Λn=(σ^2σ~02)n/2=(1+Tn2n1)n/2,\Lambda_n = \left(\frac{\hat\sigma^2}{\tilde\sigma^2_0}\right)^{n/2} = \left(1 + \frac{T_n^2}{n-1}\right)^{-n/2},

where Tn=n(Xˉμ0)/ST_n = \sqrt n (\bar X - \mu_0)/S is the t-statistic with S2=(n1)1(XiXˉ)2S^2 = (n-1)^{-1} \sum (X_i - \bar X)^2. So 2logΛn=nlog(1+Tn2/(n1))-2 \log \Lambda_n = n \log(1 + T_n^2/(n-1)) is a strictly increasing function of Tn|T_n| — the LRT rejects iff Tn|T_n| exceeds a threshold, which is exactly the two-sided t-test rejection rule. The t-test is the LRT.

At large nn, 2logΛnTn2-2 \log \Lambda_n \approx T_n^2 and the asymptotic χ12\chi^2_1 reference of §18.6 Thm 4 reduces to the squared-t approximation. At finite nn, the exact t-test reference (Topic 17 Thm 5 via Basu) is preferred over the asymptotic χ2\chi^2.

Example 10 χ² variance test is the LRT for Normal variance

Let X1,,XnX_1, \ldots, X_n be iid N(μ,σ2)\mathcal{N}(\mu, \sigma^2) with μ\mu unknown. Test H0:σ2=σ02H_0: \sigma^2 = \sigma_0^2 vs H1:σ2σ02H_1: \sigma^2 \ne \sigma_0^2.

Under H0H_0: μ^=Xˉ\hat\mu = \bar X is free; plug-in gives likelihood L(Xˉ,σ02)L(\bar X, \sigma_0^2). Under H1H_1: σ^2=n1(XiXˉ)2\hat\sigma^2 = n^{-1} \sum (X_i - \bar X)^2 is the unrestricted MLE. The LRT statistic simplifies to

2logΛn=nlog(σ02σ^2)+nσ^2σ02n,-2 \log \Lambda_n = n \log\left(\frac{\sigma_0^2}{\hat\sigma^2}\right) + \frac{n \hat\sigma^2}{\sigma_0^2} - n,

a function of W=nσ^2/σ02W = n \hat\sigma^2 / \sigma_0^2. The equal-tail χ2\chi^2 variance test of Topic 17 §17.8 rejects for extreme WW; the LRT agrees asymptotically via Wilks’ Thm 4, and the exact (n1)S2/σ02χn12(n-1)S^2/\sigma_0^2 \sim \chi^2_{n-1} distribution gives the exact size.

Remark 10 Profile likelihood — preview, full treatment in Topic 19

When there are nuisance parameters, the restricted MLE θ~n\tilde\theta_n in Λn\Lambda_n is obtained by maximizing over the nuisance parameters at fixed θ0\theta_0. This operation — “eliminate the nuisance by profiling” — yields the profile likelihood

LP(θ)=supψL(θ,ψ),L_P(\theta) = \sup_{\psi} L(\theta, \psi),

a marginal likelihood in θ\theta alone. Profile likelihood is a CI-construction tool first (Topic 19) and a testing artefact second; §18.6 Example 11 treats the Bernoulli case (no nuisance) directly, and §18.9’s local-power argument likewise handles the scalar-θ\theta case. For a full treatment of profile, integrated, and conditional likelihoods, see Topic 19 and the Reid-Fraser approach in the nonparametric Bayesian literature.

Remark 11 LRT as the general-purpose optimality substitute

Wilks’ generalized LRT plays the role of the Swiss-Army-knife testing procedure: applicable to every composite-vs-composite setting (subject to regularity), with a known asymptotic null distribution. It gives up finite-sample optimality (no Karlin-Rubin-style UMP guarantee) but retains asymptotic efficiency in a precise sense. The pragmatic algorithm is the following. If your family has MLR (exponential families, Uniform-scale family), prefer Karlin-Rubin UMP. If not, default to the LRT. If the LRT is analytically intractable, approximate with Wald or Score — noting that the three differ in finite samples (§18.7–§18.8) and that LRT is usually the safest choice when the three disagree. FER1967 develops this decision-theoretic calculus in full; LEH2005 Ch. 12 gives the modern view.

18.6 Wilks’ Theorem

This is the technical high point of Topic 18. The LRT statistic 2logΛn-2 \log \Lambda_n does not have a clean finite-sample distribution in general, but under regularity its null distribution converges to χ12\chi^2_1 (scalar θ\theta) — the same reference distribution as the squared z-test. The proof unfolds in 8 steps from Taylor’s theorem and MLE asymptotic normality.

Theorem 4 Wilks' theorem (scalar θ)

Let X1,,XnX_1, \ldots, X_n be iid from {f(;θ):θΘ}\{f(\cdot; \theta) : \theta \in \Theta\} with ΘR\Theta \subseteq \mathbb{R} open. Under H0:θ=θ0H_0: \theta = \theta_0 with θ0\theta_0 in the interior of Θ\Theta, and under Wilks’ regularity — the MLE θ^n\hat\theta_n is consistent and asymptotically normal (Topic 14 Thm 14.3); the Fisher information I(θ)I(\theta) is continuous and positive at θ0\theta_0; the third log-density derivative 3logf/θ3\partial^3 \log f / \partial\theta^3 is uniformly bounded in a neighborhood of θ0\theta_0 by a function M(x)M(x) with Eθ0[M(X)]<E_{\theta_0}[M(X)] < \infty — the log-likelihood-ratio statistic converges in distribution:

2logΛndχ12under H0,-2 \log \Lambda_n \xrightarrow{d} \chi^2_1 \quad \text{under } H_0,

where Λn=L(θ0)/L(θ^n)\Lambda_n = L(\theta_0) / L(\hat\theta_n).

Proof [show]

The argument proceeds in 8 steps. Write (θ)=logL(θ)\ell(\theta) = \log L(\theta) for the log-likelihood.

Step 1 — Rewrite in log-likelihood form. By definition,

2logΛn=2[(θ0)(θ^n)]=2[(θ^n)(θ0)].-2 \log \Lambda_n = -2 \bigl[\ell(\theta_0) - \ell(\hat\theta_n)\bigr] = 2\bigl[\ell(\hat\theta_n) - \ell(\theta_0)\bigr].

Step 2 — Taylor expand (θ0)\ell(\theta_0) around θ^n\hat\theta_n. By Taylor’s theorem with remainder, there exists ξn\xi_n between θ0\theta_0 and θ^n\hat\theta_n such that

(θ0)=(θ^n)+(θ^n)(θ0θ^n)+12(θ^n)(θ0θ^n)2+Rn,\ell(\theta_0) = \ell(\hat\theta_n) + \ell'(\hat\theta_n)(\theta_0 - \hat\theta_n) + \tfrac{1}{2} \ell''(\hat\theta_n)(\theta_0 - \hat\theta_n)^2 + R_n,

where Rn=16(ξn)(θ0θ^n)3R_n = \tfrac{1}{6} \ell'''(\xi_n)(\theta_0 - \hat\theta_n)^3.

Step 3 — First-order term vanishes. Because θ^n\hat\theta_n is the MLE in the interior of Θ\Theta, the first-order condition (θ^n)=0\ell'(\hat\theta_n) = 0 holds exactly (Topic 14 §14.6). Substituting into Step 2,

(θ0)(θ^n)=12(θ^n)(θ0θ^n)2+Rn.\ell(\theta_0) - \ell(\hat\theta_n) = \tfrac{1}{2} \ell''(\hat\theta_n)(\theta_0 - \hat\theta_n)^2 + R_n.

Therefore

2logΛn=(θ^n)(θ^nθ0)22Rn.-2 \log \Lambda_n = -\ell''(\hat\theta_n)(\hat\theta_n - \theta_0)^2 - 2 R_n.

Step 4 — Remainder control. We show 2Rn=oP(1)2 R_n = o_P(1). By the third-derivative hypothesis, there exists a neighborhood UU of θ0\theta_0 and a function M(x)M(x) with Eθ0[M(X)]<E_{\theta_0}[M(X)] < \infty such that

(θ)i=1nM(Xi)for every θU.\bigl|\ell'''(\theta)\bigr| \le \sum_{i=1}^n M(X_i) \quad \text{for every } \theta \in U.

Since θ^nPθ0\hat\theta_n \xrightarrow{P} \theta_0 (Topic 14 Thm 14.2), we have ξnU\xi_n \in U with probability 1\to 1, and on that event

Rn16i=1nM(Xi)θ^nθ03.|R_n| \le \tfrac{1}{6} \left|\sum_{i=1}^n M(X_i)\right| \cdot |\hat\theta_n - \theta_0|^3.

By the weak law of large numbers applied to M(Xi)M(X_i),

1ni=1nM(Xi)PEθ0[M(X)]<,\tfrac{1}{n} \sum_{i=1}^n M(X_i) \xrightarrow{P} E_{\theta_0}[M(X)] < \infty,

so M(Xi)=OP(n)|\sum M(X_i)| = O_P(n). By Topic 14 Thm 14.3, n(θ^nθ0)=OP(1)\sqrt n (\hat\theta_n - \theta_0) = O_P(1), hence θ^nθ03=OP(n3/2)|\hat\theta_n - \theta_0|^3 = O_P(n^{-3/2}). Combining,

Rn=OP(n)OP(n3/2)=OP(n1/2)=oP(1).|R_n| = O_P(n) \cdot O_P(n^{-3/2}) = O_P(n^{-1/2}) = o_P(1).

Thus 2Rn=oP(1)2 R_n = o_P(1).

Log-likelihood ℓ(θ) for Bernoulli p = 0.5 at n = 100 overlaid with the quadratic Taylor approximation tangent at θ̂ₙ. The vertical drop from the peak down to ℓ(θ₀) is exactly (minus half of) −2 log Λₙ; the quadratic approximation shows that this drop is n·I(θ₀)·(θ̂ₙ−θ₀)²/2 modulo o_P(1).

Step 5 — Rescale observed curvature to Fisher information. Write (θ^n)=n[(1/n)(θ^n)]-\ell''(\hat\theta_n) = n \cdot [-(1/n)\ell''(\hat\theta_n)]. The bracketed quantity converges in probability to I(θ0)I(\theta_0):

1n(θ^n)PI(θ0).-\frac{1}{n} \ell''(\hat\theta_n) \xrightarrow{P} I(\theta_0).

This identity — “observed information at the MLE converges to Fisher information at θ0\theta_0” — is the same lemma used in the proof of Topic 14 Thm 14.3. It follows from the SLLN applied to the iid summands θ2logf(Xi;θ)-\partial^2_\theta \log f(X_i; \theta), continuity of I()I(\cdot) at θ0\theta_0, and consistency of θ^n\hat\theta_n. We cite Topic 14 for the full argument and do not reprove it here.

Step 6 — Invoke MLE asymptotic normality. By Topic 14 Thm 14.3,

n(θ^nθ0)dN(0,I(θ0)1)under H0.\sqrt n (\hat\theta_n - \theta_0) \xrightarrow{d} \mathcal{N}\bigl(0, I(\theta_0)^{-1}\bigr) \quad \text{under } H_0.

Equivalently, nI(θ0)(θ^nθ0)dN(0,1)\sqrt{n I(\theta_0)} \cdot (\hat\theta_n - \theta_0) \xrightarrow{d} \mathcal{N}(0, 1).

Step 7 — Continuous mapping: square. Let Zn=nI(θ0)(θ^nθ0)Z_n = \sqrt{n I(\theta_0)} (\hat\theta_n - \theta_0). Then ZndZN(0,1)Z_n \xrightarrow{d} Z \sim \mathcal{N}(0, 1). By the continuous mapping theorem (Topic 9), Zn2dZ2χ12Z_n^2 \xrightarrow{d} Z^2 \sim \chi^2_1. Note Zn2=nI(θ0)(θ^nθ0)2Z_n^2 = n I(\theta_0) (\hat\theta_n - \theta_0)^2.

Step 8 — Combine via Slutsky. From Step 3 (using Step 5),

2logΛn=(θ^n)(θ^nθ0)22Rn=n[1n(θ^n)](θ^nθ0)22Rn.-2 \log \Lambda_n = -\ell''(\hat\theta_n)(\hat\theta_n - \theta_0)^2 - 2 R_n = n \cdot \left[-\tfrac{1}{n} \ell''(\hat\theta_n)\right] \cdot (\hat\theta_n - \theta_0)^2 - 2 R_n.

Rewrite the leading term as Zn2[(1/n)(θ^n)/I(θ0)]Z_n^2 \cdot \bigl[-(1/n)\ell''(\hat\theta_n) / I(\theta_0)\bigr]. The ratio inside the brackets converges in probability to 11 by Step 5, and Zn2dχ12Z_n^2 \xrightarrow{d} \chi^2_1 by Step 7. By Slutsky’s theorem, the product converges in distribution to χ12\chi^2_1. The remainder 2Rn=oP(1)2 R_n = o_P(1) does not affect the distributional limit, and we conclude

2logΛndχ12under H0.-2 \log \Lambda_n \xrightarrow{d} \chi^2_1 \quad \text{under } H_0.

∎ — using Topic 14 Thm 14.3, Topic 9 (continuous mapping, Slutsky), WIL1938

MC histograms of −2 log Λₙ under H₀ for Bernoulli(p₀ = 0.5) at sample sizes n = 10, 50, 200, 1000 with M = 5000 replications. The χ²₁ density is overlaid in each panel. Visible convergence: at n = 10 the histogram is lumpy and overdispersed, at n = 200 it hugs the density, at n = 1000 the match is within Monte Carlo error.

Wilks Convergence · §18.6
mean: 1.000 (χ²₁: 1)
variance: 2.027 (χ²₁: 2)
95th %ile: 4.03 (χ²₁: 3.84)
Convergence panels — n = 10, 50, 200, 1000
Each mini-panel uses the same θ₀ and seed as the main chart, with M = min(main M, 2000) for performance.
Example 11 Bernoulli: explicit formulas and −2 log Λₙ vs Wald vs Score

Let X1,,XnX_1, \ldots, X_n be iid Bernoulli(p)(p), test H0:p=p0H_0: p = p_0 vs H1:pp0H_1: p \ne p_0. Write p^=Xˉ\hat p = \bar X for the MLE and recall the three statistics from Topic 17 §17.9:

Wn=n(p^p0)2p^(1p^),Sn=n(p^p0)2p0(1p0),2logΛn=2n[p^logp^p0+(1p^)log1p^1p0].W_n = \frac{n(\hat p - p_0)^2}{\hat p(1 - \hat p)}, \qquad S_n = \frac{n(\hat p - p_0)^2}{p_0(1 - p_0)}, \qquad -2\log\Lambda_n = 2n\Bigl[\hat p \log\tfrac{\hat p}{p_0} + (1-\hat p)\log\tfrac{1-\hat p}{1-p_0}\Bigr].

All three converge to χ12\chi^2_1 under H0H_0 (Wilks for the LRT; Wald-and-Score from Topic 17 Thm 7). They differ only by the variance plug-in: Wald uses p^\hat p, Score uses p0p_0, LRT uses an interpolation via the logarithm.

Numerical check: at p0=0.5p_0 = 0.5, n=100n = 100, M=2000M = 2000 MC replications with seed 42 (wilksSimulate('bernoulli', 0.5, 100, 2000, undefined, 42) in testing.ts): empirical mean 1.000\approx 1.000, 95th percentile 4.03\approx 4.03. The χ12\chi^2_1 theoretical targets are mean 11, 95% quantile 3.843.84. The 95th percentile’s slight overshoot at n=100n = 100 is the finite-sample overdispersion Wilks cures asymptotically.

Remark 12 Regularity conditions and pointers to the rigorous LAN treatment

The classical regularity used in Proof 3 — existence and continuity of the first three log-density derivatives, uniform boundedness of the third, consistency and asymptotic normality of the MLE — traces back to Wilks’ 1938 paper and Cramér’s 1946 textbook. A rigorous modern treatment via Local Asymptotic Normality (LAN) and contiguity removes the third-derivative hypothesis and replaces it with differentiability in quadratic mean; see [VAN1998] van der Vaart §16 for the empirical-process formulation. For the practical user, the classical conditions suffice; the LAN treatment matters when the underlying family is non-smooth or non-identifiable at specific points (e.g., mixture models, non-regular tails).

Remark 13 Vector-θ Wilks — the k-df extension

The vector-θ\theta extension replaces the scalar quadratic nI(θ0)(θ^nθ0)2n I(\theta_0) (\hat\theta_n - \theta_0)^2 with the quadratic form n(θ^nθ0)I(θ0)(θ^nθ0)n (\hat\theta_n - \theta_0)^\top \mathbf{I}(\theta_0) (\hat\theta_n - \theta_0), where I(θ0)\mathbf{I}(\theta_0) is the Fisher information matrix. By the multivariate CLT and continuous mapping, this quadratic form converges to χk2\chi^2_k where k=dim(Θ)dim(Θ0)k = \dim(\Theta) - \dim(\Theta_0). The 8-step scalar proof above extends line-for-line; no new ideas are needed, only more bookkeeping. See LEH2005 Thm 12.4.2 for the rigorous statement and proof.

Remark 14 Non-regular failures — when Wilks breaks

Wilks’ χ2\chi^2 limit fails in three kinds of non-regular settings. (1) Boundary nulls, where θ0\theta_0 lies on the boundary of Θ\Theta (e.g., testing a variance component in a mixed model with H0:σb2=0H_0: \sigma_b^2 = 0); Chernoff (1954) shows 2logΛn-2 \log \Lambda_n converges to a 12χ02+12χ12\tfrac{1}{2} \chi^2_0 + \tfrac{1}{2} \chi^2_1 mixture. (2) Non-identifiable parameters under H0H_0 (e.g., testing “there is one mixture component” vs “there are two” in a Gaussian mixture), where the limiting distribution is supremum-of-Gaussian-process and requires empirical-process techniques (Liu & Shao 2003; Drton 2009). (3) Non-smooth families (e.g., U[0,θ]U[0, \theta] shift models), where the MLE has a non-Normal limit and the Taylor expansion of Step 2 does not apply. In all three cases, Monte Carlo calibration of critical values is the safe practical default.

18.7 Wald, Score, LRT: First-Order Equivalence

Wilks’ theorem hands us the χ²₁ limit for the LRT. Topic 17 Theorem 7 hands us the same limit for Wald and Score. Do the three tests agree at finite nn? Asymptotically yes, but with a precise rate — and their pairwise differences matter in practice.

Theorem 5 Three-tests first-order equivalence

Under the regularity of Theorem 4, define the asymptotic test statistics

Wn=n(θ^nθ0)2I(θ^n),Sn=U(θ0)2nI(θ0),2logΛn=2[(θ^n)(θ0)],W_n = n (\hat\theta_n - \theta_0)^2 I(\hat\theta_n), \quad S_n = \frac{U(\theta_0)^2}{n I(\theta_0)}, \quad -2\log\Lambda_n = 2[\ell(\hat\theta_n) - \ell(\theta_0)],

where U(θ)=(θ)U(\theta) = \ell'(\theta) is the score function. Then under H0H_0:

Wn(2logΛn)=OP(n1/2),Sn(2logΛn)=OP(n1/2).W_n - (-2\log\Lambda_n) = O_P(n^{-1/2}), \qquad S_n - (-2\log\Lambda_n) = O_P(n^{-1/2}).

In particular, all three statistics converge to the same χ12\chi^2_1 distribution under H0H_0.

Proof [show]

We derive a common quadratic expansion for all three statistics and compare leading terms.

Step 1 — LRT. From Proof 3 Step 8,

2logΛn=nI(θ0)(θ^nθ0)2+oP(1).-2 \log \Lambda_n = n I(\theta_0) (\hat\theta_n - \theta_0)^2 + o_P(1).

Step 2 — Wald. By consistency of θ^n\hat\theta_n and continuity of I()I(\cdot) at θ0\theta_0, I(θ^n)=I(θ0)+oP(1)I(\hat\theta_n) = I(\theta_0) + o_P(1). Therefore

Wn=n(θ^nθ0)2[I(θ0)+oP(1)]=nI(θ0)(θ^nθ0)2+oP(1),W_n = n (\hat\theta_n - \theta_0)^2 \bigl[I(\theta_0) + o_P(1)\bigr] = n I(\theta_0) (\hat\theta_n - \theta_0)^2 + o_P(1),

where we used n(θ^nθ0)2=OP(1)n (\hat\theta_n - \theta_0)^2 = O_P(1) by Topic 14 Thm 14.3.

Step 3 — Score. Taylor expand U(θ0)U(\theta_0) around θ^n\hat\theta_n. Since U(θ^n)=(θ^n)=0U(\hat\theta_n) = \ell'(\hat\theta_n) = 0 by the MLE FOC,

U(θ0)=(θ~n)(θ0θ^n)U(\theta_0) = \ell''(\tilde\theta_n)(\theta_0 - \hat\theta_n)

for some θ~n\tilde\theta_n between θ0\theta_0 and θ^n\hat\theta_n. By the lemma in Proof 3 Step 5, (1/n)(θ~n)PI(θ0)-(1/n) \ell''(\tilde\theta_n) \xrightarrow{P} I(\theta_0), so (θ~n)=nI(θ0)+oP(n)-\ell''(\tilde\theta_n) = n I(\theta_0) + o_P(n). Therefore

U(θ0)=(θ~n)(θ^nθ0)=[nI(θ0)+oP(n)](θ^nθ0).U(\theta_0) = -\ell''(\tilde\theta_n)(\hat\theta_n - \theta_0) = \bigl[n I(\theta_0) + o_P(n)\bigr](\hat\theta_n - \theta_0).

Squaring and dividing by nI(θ0)n I(\theta_0):

Sn=U(θ0)2nI(θ0)=nI(θ0)(θ^nθ0)2+oP(1).S_n = \frac{U(\theta_0)^2}{n I(\theta_0)} = n I(\theta_0) (\hat\theta_n - \theta_0)^2 + o_P(1).

Step 4 — Compare. From Steps 1–3, all three statistics equal nI(θ0)(θ^nθ0)2+oP(1)n I(\theta_0) (\hat\theta_n - \theta_0)^2 + o_P(1). Pairwise differences are oP(1)o_P(1). A finer analysis that tracks the n1/2n^{-1/2} order terms shows the differences are OP(n1/2)O_P(n^{-1/2}) — essentially, the oP(1)o_P(1) residuals are Taylor-series corrections of the form cn1/2Z+oP(n1/2)c \cdot n^{-1/2} Z + o_P(n^{-1/2}) for a constant cc depending on the third log-density derivative. For most applications, the leading-order equivalence suffices: all three statistics share the χ12\chi^2_1 asymptotic null distribution.

∎ — using Topic 14 Thm 14.3, Proof 3, RAO1948

Three-panel figure showing Wald (amber), Score (purple), LRT (green) histograms under H₀ for Bernoulli p₀ = 0.5 at n = 100, M = 5000. The χ²₁ density is overlaid. Right panel: empirical rejection rate at α = 0.05 — all three ≈ 0.05 within MC error, confirming the asymptotic agreement.

Example 12 Bernoulli MC: three tests under H₀ and H_A

At p0=0.5p_0 = 0.5, n=100n = 100, M=5000M = 5000 MC replications (Topic 17 testing.test.ts test 17): the three tests have empirical means E^[Wn]=1.038\hat E[W_n] = 1.038, E^[Sn]=1.006\hat E[S_n] = 1.006, E^[2logΛn]=1.011\hat E[-2\log\Lambda_n] = 1.011. The Wald statistic is slightly overdispersed (mean 3.8% above target 1) — the plug-in p^(1p^)\hat p(1-\hat p) in the denominator introduces a small bias. Under HAH_A at p=0.7p = 0.7, n=30n = 30: all three reject at empirical rate 0.585\approx 0.585 (test 18), pairwise differences 0.04\le 0.04. The three-tests equivalence is visible.

A deeper lesson: the Exponential family with scale parameter gives the cleanest three-tests agreement. For Exponential(λ)(\lambda) (Topic 18’s testing.ts waldStatistic, scoreStatistic, lrtStatistic for family = 'exponential'): Wald and Score are exactly equal at every sample, Wn=Sn=n(1λ0Xˉ)2W_n = S_n = n(1 - \lambda_0 \bar X)^2. The LRT differs by a log-correction 2n[(u1)logu]-2 n[(u - 1) - \log u] with u=λ0Xˉu = \lambda_0 \bar X; near u=1u = 1, LRT matches Wald and Score to O(u1)3O(u-1)^3. Algebraic identity in place of asymptotic equivalence.

Remark 15 Wald vs Score vs LRT — when to use which

BUS1982 is the canonical pedagogical reference for choosing among the three; the summary is short. Wald is the easiest to compute — it uses only the unrestricted MLE and its observed information. It is the natural choice when the MLE is easy and the constrained optimization under H0H_0 is hard. Score uses only the null-restricted MLE and is the natural choice when the unconstrained MLE is hard (e.g., stepwise forward selection in regression: score tests evaluate adding a variable without refitting). LRT is parameterization-invariant (§18.8 Thm 6) and agrees with Wald and Score asymptotically but often dominates in finite samples, especially at boundary-rare events. The practical default for hypothesis testing in regular parametric families is LRT; switch to Wald or Score only when computational cost or the specific form of the null forces the decision. This practical ranking is the econometrics-textbook consensus (Greene 2018 Ch. 14; Wooldridge 2010 Ch. 12).

18.8 Finite-Sample Divergence & Reparameterization Invariance

The three tests agree asymptotically but diverge at finite nn. Two questions: how much, and does it matter? For regular families at large nn, divergence is cosmetic. For small nn or boundary-rare events, divergence can flip the rejection decision. The reparameterization-invariance property of the LRT is the structural reason to prefer it in ambiguous cases.

Theorem 6 LRT is parameterization-invariant; Wald is not

Let η=g(θ)\eta = g(\theta) be a smooth one-to-one reparameterization with inverse θ=g1(η)\theta = g^{-1}(\eta), and let H0θ:θ=θ0H_0^\theta: \theta = \theta_0 correspond to H0η:η=η0H_0^\eta: \eta = \eta_0 with η0=g(θ0)\eta_0 = g(\theta_0). Then:

(i) The LRT statistic 2logΛn-2 \log \Lambda_n is invariant:

2logΛnθ=2logΛnη.-2\log\Lambda_n^\theta = -2\log\Lambda_n^\eta.

(ii) The Wald statistic is not invariant in general:

Wnθ=n(θ^nθ0)2Iθ(θ^n)n(η^nη0)2Iη(η^n)=Wnη,W_n^\theta = n(\hat\theta_n - \theta_0)^2 I^\theta(\hat\theta_n) \ne n(\hat\eta_n - \eta_0)^2 I^\eta(\hat\eta_n) = W_n^\eta,

except at the equality-of-derivatives condition g(θ0)=1g'(\theta_0) = 1. The Score statistic inherits Wald’s non-invariance for the same reason.

Proof sketch (full version in LEH2005 §12.4). (i) The likelihood function L(θ)=L(g1(η))L(\theta) = L(g^{-1}(\eta)) is the same function up to reparameterization; the sup under H0H_0 and the sup under Θ\Theta are unchanged. So Λn\Lambda_n and 2logΛn-2 \log \Lambda_n are invariant. (ii) For Wald, the MLE transforms as η^n=g(θ^n)\hat\eta_n = g(\hat\theta_n) (MLE equivariance, Topic 14 §14.4) but the difference η^nη0=g(θ^n)g(θ0)\hat\eta_n - \eta_0 = g(\hat\theta_n) - g(\theta_0) is not linearly related to θ^nθ0\hat\theta_n - \theta_0 unless gg is linear. The Jacobian gg' rescales the Fisher information Iη(η)=Iθ(g1(η))/g(g1(η))2I^\eta(\eta) = I^\theta(g^{-1}(\eta)) / g'(g^{-1}(\eta))^2, but this rescaling only partially compensates for the nonlinearity. The net result: WnW_n takes different numerical values in the two parameterizations, and the corresponding p-values differ. ∎

Two-panel figure. Left: Wald p-value for Bernoulli H₀: p = 0.5 at observed p̂ = 0.3, n = 50 — different numerical values in the p-parameterization vs the logit-η parameterization. Right: LRT p-value computed both ways — identical numerical values, confirming Theorem 6(i).

Example 13 Bernoulli p vs logit η — Wald disagrees, LRT agrees

Bernoulli pp with logit link η=log[p/(1p)]\eta = \log[p/(1-p)]. Test H0:p=0.5H_0: p = 0.5 (equivalently η0=0\eta_0 = 0) with observed p^=0.3\hat p = 0.3, n=50n = 50.

Wald in p-space: Wnp=n(p^0.5)2/[p^(1p^)]=500.04/0.219.52W_n^p = n(\hat p - 0.5)^2 / [\hat p(1 - \hat p)] = 50 \cdot 0.04 / 0.21 \approx 9.52. P-value 0.002\approx 0.002.

Wald in logit-space: η^=log(0.3/0.7)0.847\hat\eta = \log(0.3/0.7) \approx -0.847, η0=0\eta_0 = 0. Information at η^\hat\eta: Iη(η^)=p^(1p^)=0.21I^\eta(\hat\eta) = \hat p(1-\hat p) = 0.21. Wnη=500.7170.217.53W_n^\eta = 50 \cdot 0.717 \cdot 0.21 \approx 7.53. P-value 0.006\approx 0.006.

LRT: 2logΛn=2n[0.3log(0.3/0.5)+0.7log(0.7/0.5)]8.22-2\log\Lambda_n = 2n\,[0.3 \log(0.3/0.5) + 0.7 \log(0.7/0.5)] \approx 8.22. P-value 0.004\approx 0.004. The same value in both parameterizations (Theorem 6(i)).

The three p-values (0.002, 0.006, 0.004) all clear α = 0.05. But at borderline significance — say, p^=0.42\hat p = 0.42 — the Wald p-values in the two parameterizations can straddle α, flipping the rejection decision. The LRT does not suffer this instability.

Example 14 MC at n = 20 — Wald liberal, Score conservative, LRT accurate

At p0=0.3p_0 = 0.3, n=20n = 20, M=10,000M = 10{,}000 MC replications under H0H_0: empirical Type I error rates at nominal α=0.05\alpha = 0.05 (using the χ12\chi^2_1 reference critical value 3.843.84):

TestEmpirical sizeDeviation from 0.05
Wald WnW_n0.080+0.030 (liberal)
Score SnS_n0.021−0.029 (conservative)
LRT 2logΛn-2\log\Lambda_n0.049−0.001 (≈ nominal)

At this sample size, Wald rejects about 60% more often than the nominal rate, and Score rejects about 60% less often. The LRT hits nominal essentially exactly. This is the quantitative version of Topic 17 Remark 10’s divergence claim: the three tests are not interchangeable in small samples, and the LRT’s accuracy is the empirical argument for preferring it when computational cost permits.

Note: results replicate with wilksSimulate('bernoulli', 0.3, 20, 10000, undefined, 42) for the LRT column; Wald and Score columns use analogous seeded MC over waldStatistic and scoreStatistic. Reproducible via the test harness in testing.test.ts.

Remark 16 Wald's boundary pathology

The Wald statistic Wn=n(θ^nθ0)2I(θ^n)W_n = n(\hat\theta_n - \theta_0)^2 I(\hat\theta_n) breaks down when θ^n\hat\theta_n lies near the parameter boundary: I(θ^n)I(\hat\theta_n) \to \infty (for Bernoulli at p^=0\hat p = 0 or 11, I=1/[p^(1p^)]I = 1/[\hat p(1-\hat p)] \to \infty), which can produce Wn=0W_n = 0 \cdot \infty or degenerate behavior. The Score statistic evaluates I(θ0)I(\theta_0) at a boundary-safe null value and usually remains finite; the LRT uses the logarithm, which is well-behaved on (0,1)(0, 1) except at the endpoints themselves. For A/B tests with rare-event outcomes (conversion rates of 0.1% or smaller), the Wald test’s boundary fragility motivates the Wilson interval in Topic 19 and the log-odds-ratio approach that modern experimentation platforms increasingly use.

Remark 17 GLM link choice is a reparameterization invariance question

In generalized linear models, the question “should I fit with a logit link, a probit link, or a log link?” is, for inference purposes, a parameterization question. The coefficient βk\beta_k under a logit link carries log-odds-ratio meaning; under a probit link, it carries z-score-of-normal-latent meaning. The LRT for H0:βk=0H_0: \beta_k = 0 is invariant across link choices (Theorem 6(i)): the same 2logΛn-2\log\Lambda_n and the same p-value regardless of link. The Wald test is not — Wald-logit, Wald-probit, and Wald-log-binomial yield three different numerical p-values for the same underlying hypothesis. This is why GLM software (R’s glm, Python’s statsmodels.GLM) typically reports LRT-based deviance-difference tests as the canonical nested-model comparison, reserving Wald for single-coefficient zz-statistics. See §22.8 for the formal treatment and §22.7 Thm 6 for the deviance-LRT derivation.

18.9 Power Envelope and Local Power

Tests aren’t judged by size alone. Size tells us Type I error is controlled; power tells us whether we’ll catch the effect when it’s there. For asymptotic tests, the sharpest characterization of power is in the local-alternative regime θn=θ0+h/n\theta_n = \theta_0 + h/\sqrt n — effects that shrink with nn at the scaling rate where all three tests (Wald, Score, LRT) have non-trivial power strictly between α\alpha and 11.

Definition 7 Non-central chi-squared distribution

Let Z1,,ZkZ_1, \ldots, Z_k be independent N(μi,1)\mathcal{N}(\mu_i, 1). Then V=i=1kZi2V = \sum_{i=1}^k Z_i^2 has the non-central chi-squared distribution χk2(λ)\chi^2_k(\lambda) with kk degrees of freedom and non-centrality parameter λ=μi2\lambda = \sum \mu_i^2. Its density and CDF admit the Poisson-mixture series

fχk2(λ)(x)=j=0eλ/2(λ/2)jj!fχk+2j2(x),f_{\chi^2_k(\lambda)}(x) = \sum_{j=0}^\infty \frac{e^{-\lambda/2} (\lambda/2)^j}{j!}\, f_{\chi^2_{k+2j}}(x),

with the analogous series for the CDF. At λ=0\lambda = 0 this reduces to the central χk2\chi^2_k. Mean k+λk + \lambda, variance 2(k+2λ)2(k + 2\lambda) — larger non-centrality shifts the distribution to the right.

Theorem 7 Local power limit

Under the regularity of Theorem 4, at the level-α\alpha rejection rule {2logΛn>χ1,1α2}\{-2\log\Lambda_n > \chi^2_{1, 1-\alpha}\} (equivalently for Wald or Score, by Theorem 5), the power under the local alternative θn=θ0+h/n\theta_n = \theta_0 + h/\sqrt n converges:

βn(θn)1Fχ12(h2I(θ0))(χ1,1α2),\beta_n(\theta_n) \to 1 - F_{\chi^2_1(h^2 I(\theta_0))}\bigl(\chi^2_{1, 1-\alpha}\bigr),

where Fχ12(λ)F_{\chi^2_1(\lambda)} is the non-central χ12(λ)\chi^2_1(\lambda) CDF. The non-centrality parameter is λ=h2I(θ0)\lambda = h^2 I(\theta_0) — Fisher information scaled by the squared local effect size.

Proof sketch. Under the local alternative, n(θ^nθ0)=n(θ^nθn)+h\sqrt n(\hat\theta_n - \theta_0) = \sqrt n (\hat\theta_n - \theta_n) + h. The first term is N(0,I(θn)1)+oP(1)\mathcal{N}(0, I(\theta_n)^{-1}) + o_P(1) by a contiguous-alternative argument (LAN; LEH2005 §13.2), and I(θn)I(θ0)I(\theta_n) \to I(\theta_0) by continuity. So Zn:=nI(θ0)(θ^nθ0)dN(hI(θ0),1)Z_n := \sqrt{n I(\theta_0)}(\hat\theta_n - \theta_0) \xrightarrow{d} \mathcal{N}(h \sqrt{I(\theta_0)}, 1), and Zn2dχ12(h2I(θ0))Z_n^2 \xrightarrow{d} \chi^2_1(h^2 I(\theta_0)) by continuous mapping. Power at the χ1,1α2\chi^2_{1, 1-\alpha} critical value is the tail probability of the limit distribution. ∎ (LEH2005 Thm 13.5.1 for the full argument.)

Two-panel figure. Left: local power curves β(h) for Wald, Score, and LRT under local alternatives θₙ = θ₀ + h/√n at θ₀ = 0.5 (Bernoulli), h ∈ [0, 4]. The non-central χ²₁(h² I(θ₀)) envelope is overlaid; all three tests hug the envelope. Right: same for Normal mean, σ = 1, θ₀ = 0.

Example 15 Bernoulli local power at h ∈ {0, 1, 2, 3}

For Bernoulli(p)(p) at p0=0.5p_0 = 0.5: I(0.5)=1/[0.50.5]=4I(0.5) = 1/[0.5 \cdot 0.5] = 4. Local alternatives pn=0.5+h/np_n = 0.5 + h/\sqrt n. Non-centrality parameter λ=h24\lambda = h^2 \cdot 4.

hhλ=4h2\lambda = 4h^2Local power at α=0.05\alpha = 0.05
000.050 (= size)
140.515
2160.977
3360.9998

At h=2h = 2 (an effect two standard errors into the alternative), power is already 97.7%. At h=3h = 3, essentially certain rejection. Values match localPower('bernoulli', 0.5, h, 0.05) in testing.ts.

The practical reading: for Bernoulli at p0=0.5p_0 = 0.5, a local effect of h=1.96/I(θ0)=0.98h = 1.96/\sqrt{I(\theta_0)} = 0.98 standard deviations gives 50% power at α=0.05\alpha = 0.05 — the classical “z-test power 50% at the rejection boundary” calibration.

Remark 18 CRLB as the power envelope — Fisher information plays both roles

The non-centrality λ=h2I(θ0)\lambda = h^2 I(\theta_0) has a striking interpretation. On the estimation side, Topic 13 Thm 13.9 (Cramér-Rao) says that Var(θ^n)[nI(θ0)]1\text{Var}(\hat\theta_n) \ge [n I(\theta_0)]^{-1} — Fisher information lower-bounds estimator variance. On the testing side, Theorem 7 says that local power is 1Fχ12(h2I(θ0))(χ1,1α2)1 - F_{\chi^2_1(h^2 I(\theta_0))}(\chi^2_{1, 1-\alpha}) — Fisher information upper-bounds local power through the non-centrality. The same I(θ0)I(\theta_0) appears on both sides of the optimality question: it caps what you can learn (variance of the best estimator) and what you can detect (power at the best test). The Cramér-Rao bound and the asymptotic power envelope are two faces of the same efficiency constraint.

This is the payoff for Topic 17 §17.4 Remark 5’s forward-pointer: Fisher information is the currency of asymptotic efficiency, and every efficient procedure achieves the envelope.

Remark 19 Asymptotic efficiency — what it means operationally

A test is asymptotically efficient at a local alternative θn=θ0+h/n\theta_n = \theta_0 + h/\sqrt n if its local power approaches the envelope 1Fχ12(h2I(θ0))(χ1,1α2)1 - F_{\chi^2_1(h^2 I(\theta_0))}(\chi^2_{1, 1-\alpha}). By Theorem 5, Wald, Score, and LRT are all asymptotically efficient. So are tests derived from efficient estimators other than the MLE (e.g., the two-stage estimator in some exponential-family models). The test that dominates at a specific local alternative is one that incorporates exact information about that alternative — but since the target alternative is usually unknown, asymptotic efficiency (dominance at every local alternative) is the operationally meaningful notion.

Remark 20 Asymptotic Relative Efficiency — Pitman's coinage

The ratio of local powers of two asymptotically efficient tests tends to 11 at every hh, but the ratio of sample sizes needed to achieve a target power at the same alternative is a meaningful constant. Pitman’s asymptotic relative efficiency (ARE) formalizes this: ARE(T1,T2)=limnn2/n1\text{ARE}(T_1, T_2) = \lim_{n \to \infty} n_2 / n_1 where nin_i is the sample size for test TiT_i to achieve power β\beta at a local alternative hh. For T1T_1 = LRT, T2T_2 = Wald: ARE =1= 1 — the three tests are first-order equivalent. For T1T_1 = LRT, T2T_2 = sign test (on Normal data): ARE =2/π0.637= 2/\pi \approx 0.637 — the sign test needs about 57% more samples to match the LRT’s power. See Hettmansperger & McKean (2011) for a systematic treatment; ARE is the intermediate-difficulty entry-point to modern robust-statistics theory.

18.10 Limitations and Forward Look

Topic 18 covered the scalar-θ\theta optimality theory of classical hypothesis testing. Three directions for deeper study, three forward pointers within Track 5, and a cheat-sheet summary.

Remark 21 Scope boundary — what Topic 18 did not cover

Four major topics in classical testing optimality were stated but not proved in Topic 18.

  1. UMP unbiased tests. For two-sided composite alternatives where no UMP test exists, UMPU restricts to tests satisfying β(θ)α\beta(\theta) \ge \alpha at every alternative and proves UMP within that class. LEH2005 Ch. 4 gives the full theory; the two-sample Normal mean test is the canonical example.

  2. Invariance and Hunt-Stein. Problems with natural symmetries (scale invariance for variance tests, translation invariance for location tests) admit UMP-invariant tests even when UMP fails. LEH2005 Ch. 6 and the Hunt-Stein theorem (1946) give the formalism.

  3. Vector-θ Wilks, k-df. The scalar proof extends line-for-line; the χk2\chi^2_k limit replaces χ12\chi^2_1 with k=dim(Θ)dim(Θ0)k = \dim(\Theta) - \dim(\Theta_0). LEH2005 Thm 12.4.2.

  4. Non-regular Wilks — Chernoff 1954 boundary theorem, empirical-process Wilks. The 12χ02+12χ12\tfrac{1}{2} \chi^2_0 + \tfrac{1}{2} \chi^2_1 mixture for boundary nulls, the LAN / Le Cam framework for non-smooth families, and the empirical-process Wilks for semiparametric models (van der Vaart 1998 §25) are the three modern extensions.

Remark 22 CI duality preview — Topic 19

Every family of level-α\alpha tests indexed by the null value θ0\theta_0 defines a (1α)(1-\alpha)-confidence set: the set of θ0\theta_0 values the test does not reject. Conversely, every confidence procedure defines a family of hypothesis tests. This test-CI duality is the organizing principle of Topic 19. The LRT gives likelihood-ratio confidence intervals — invariant under reparameterization, exact in the Normal case, and the practical default for GLM coefficients. The Wald-inversion CI is the simplest but suffers the boundary pathology of §18.8 Remark 16. Topic 19 treats all three constructions in parallel, with explicit coverage calibration.

Remark 23 Multiple testing — Topic 20

Every hypothesis test of Topic 17–18 controls the per-test Type I error at level α\alpha. But when 10, or 100, or 10,000 tests are run simultaneously (every gene in a genomics screen; every variable in a high-dimensional regression; every hypothesis in an A/B/n testing platform), the family-wise Type I error explodes. Bonferroni, Holm, and Šidák control the family-wise error rate (FWER) at level α\alpha by adjusting per-test thresholds. Benjamini-Hochberg controls the false discovery rate (FDR) — a weaker but more powerful notion — and is the contemporary default for exploratory studies. Topic 20 develops all four procedures with full proofs for Bonferroni (§20.4), Holm (§20.5), and the featured BH result (§20.7), against the replication-crisis literature (Ioannidis 2005; Gelman & Loken 2013).

Remark 24 Cheat sheet — decision flow for test selection

The decisions in Topic 18 compress into a short practical guide.

SituationChoiceRationale
Simple-vs-simple, any familyNP test (Theorem 1)MP by NP lemma
One-sided composite, MLR family (exp family, Uniform)Karlin-Rubin UMP (Theorem 2)UMP at every alternative
Two-sided composite, exp familyUMPU test (LEH2005 Ch. 4)UMP within unbiased class
Scale-invariant problemUMP-invariant test (LEH2005 Ch. 6)UMP within invariant class
General composite, regular familyGeneralized LRT (Definition 5)Wilks: asymptotic χk2\chi^2_k
Computational cost mattersWald or ScoreAsymptotic equivalence (Thm 5)
Reparameterization sensitive, boundary rareLRTInvariance (Thm 6), no boundary issues
Non-regular null (boundary, mixture, non-smooth)MC calibrationChernoff mixture / empirical process

Default: LRT with MC calibration of critical values at small nn; switch to Wald or Score when computation demands it or MLR gives UMP.

Remark 25 Track 5 and beyond

Topic 17 built the framework; Topic 18 delivered the optimality layer. Tracks 6, 7, and 8 extend the classical testing machinery in three orthogonal directions.

  • Track 6 (Regression & Linear Models). The Wald/Score/LRT trio of §18.7 reappears as the three standard GLM inference procedures — deviance tests are LRT, z-tests on coefficients are Wald, forward-selection score tests are Rao. Linear regression’s F-test is exactly Wilks’ theorem specialized to nested linear models — §21.8 Thm 9 delivers the exact Fk,np1F_{k, n-p-1} finite-sample distribution as the sharpening of Topic 18 Thm 4.

  • Track 7 (Bayesian Statistics). Bayes factors (the posterior-odds analog of the LRT) are the Bayesian counterpart to §18.5–§18.7. Frequentist/Bayesian testing duality is non-trivial — Lindley’s paradox (Topic 27 §27.5) is the first sharp disagreement.

  • Track 8 (High-Dimensional & Nonparametric). Kolmogorov-Smirnov, Mann-Whitney, and permutation tests are nonparametric alternatives to the z/t/χ2\chi^2 trio. The Pitman ARE framework (Remark 20) is the bridge — nonparametric tests pay a constant efficiency penalty in exchange for distributional robustness. See Topic 29 §29.8 for Kolmogorov-Smirnov.

At formalml.com, Wilks’ theorem becomes the backbone of nested-model comparison in GLMs and deep learning; the NP lemma reappears as the Bayes classifier under 0-1 loss; and the three-tests equivalence powers the standard error machinery of every modern ML inference library. The classical testing theory is not a historical artifact — it is the statistical grammar of modern machine learning.


References

  1. Neyman, Jerzy, and Egon S. Pearson. (1933). On the Problem of the Most Efficient Tests of Statistical Hypotheses. Philosophical Transactions of the Royal Society A, 231, 289–337.

  2. Karlin, Samuel, and Herman Rubin. (1956). The Theory of Decision Procedures for Distributions with Monotone Likelihood Ratio. Annals of Mathematical Statistics, 27(2), 272–299.

  3. Karlin, Samuel. (1957). Pólya Type Distributions, II. Annals of Mathematical Statistics, 28(2), 281–308.

  4. Wilks, Samuel S. (1938). The Large-Sample Distribution of the Likelihood Ratio for Testing Composite Hypotheses. Annals of Mathematical Statistics, 9(1), 60–62.

  5. Rao, C. Radhakrishna. (1948). Large Sample Tests of Statistical Hypotheses Concerning Several Parameters with Applications to Problems of Estimation. Mathematical Proceedings of the Cambridge Philosophical Society, 44(1), 50–57.

  6. Ferguson, Thomas S. (1967). Mathematical Statistics: A Decision Theoretic Approach. New York: Academic Press.

  7. van der Vaart, Aad W. (1998). Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics 3. Cambridge: Cambridge University Press.

  8. Buse, A. (1982). The Likelihood Ratio, Wald, and Lagrange Multiplier Tests: An Expository Note. The American Statistician, 36(3a), 153–157.

  9. Lehmann, Erich L., and Joseph P. Romano. (2005). Testing Statistical Hypotheses (3rd ed.). Springer Texts in Statistics. New York: Springer.

  10. Casella, George, and Roger L. Berger. (2002). Statistical Inference (2nd ed.). Pacific Grove, CA: Duxbury.