intermediate 55 min read · April 24, 2026

Sufficient Statistics & Rao-Blackwell

Data reduction, UMVUE, and the Pitman-Koopman-Darmois characterization — the theorems that close classical estimation.

16.1 From §7.6 to a Theory of Data Reduction

Topic 7 §7.6 stated the Fisher-Neyman factorization theorem only for exponential families: for a family of the form f(x;θ)=h(x)exp(η(θ)T(x)A(θ))f(x;\theta) = h(x)\exp(\eta(\theta) T(x) - A(\theta)), the statistic TT is sufficient by construction — the factorization is already in the form the theorem demands. Topic 16 generalizes this to arbitrary dominated families and builds the full machinery of data reduction: minimal sufficiency, completeness, Rao-Blackwell optimality, the Lehmann-Scheffé UMVUE theorem, Basu’s independence theorem, and — closing the circle — the Pitman-Koopman-Darmois converse.

The thesis of the topic is one sentence: a sufficient statistic is exactly the data summary that loses nothing about the parameter, and exponential families are the only families where the summary stays small. Everything else — Rao-Blackwell, Lehmann-Scheffé, Basu — sharpens this into operational tools for building optimal estimators.

Two-panel motivation. Left: 30 raw Bernoulli observations are collapsed into a single count T = ΣXᵢ via a horizontal arrow labeled "data reduction, lossless for p". Right: histogram of X̄ across M = 5000 Monte-Carlo replications of Bernoulli(p = 0.3) at n = 30, with the theoretical N(p, p(1-p)/n) overlay — demonstrating that all inference about p is contained in the sufficient statistic.

Remark 1 Fisher (1922) and Halmos–Savage (1949)

The concept of sufficiency originates with R. A. Fisher’s 1922 paper, On the Mathematical Foundations of Theoretical Statistics (§7 of that paper introduces “sufficient statistics” as one of the cornerstones of his program). Fisher’s argument was operational: estimators that ignore the sufficient statistic discard usable information, and one should prefer those that don’t. The measure-theoretic formalization — the modern statement and proof of the factorization theorem in full generality — came nearly thirty years later in Halmos & Savage (1949), Application of the Radon-Nikodym Theorem to the Theory of Sufficient Statistics. The Halmos–Savage paper handles the dominated-family case rigorously via the Radon–Nikodym derivative, which is what Topic 16 §16.3 quietly invokes when it speaks of densities f(x;θ)=dPθ/dμf(x; \theta) = dP_\theta/d\mu for a common dominating measure μ\mu.

Remark 2 What "data reduction" actually means

Sufficiency is a strong claim: TT is a lossless compression of the data with respect to θ\theta. Once you know TT, the rest of the data is parameter-free noise — the conditional distribution XTX \mid T does not depend on θ\theta at all. This is information-theoretic in spirit but does not require Shannon information; it is a structural statement about the family {Pθ}\{P_\theta\}. The compression is sometimes dramatic — for nn Bernoulli trials, the sufficient statistic is a single integer in {0,1,,n}\{0, 1, \ldots, n\}; for nn Normal observations with both parameters unknown, it’s a 2-vector (Xi,Xi2)(\sum X_i, \sum X_i^2). Topics 7, 13, 14, and 15 used this fact implicitly. §16.10 will reveal why the compression stays small only in exponential families.

16.2 Sufficient Statistics: The Conditional-Distribution Definition

The cleanest definition of sufficiency uses conditional distributions directly — no factorization theorem yet, no exponential-family scaffolding. The conditional distribution of the data given a sufficient statistic is parameter-free; everything else is downstream.

Definition 1 Sufficient statistic

A statistic T(X)T(X) is sufficient for θ\theta if the conditional distribution of XX given T(X)=tT(X) = t does not depend on θ\theta — that is, Pθ(XT=t)P_\theta(X \in \cdot \mid T = t) is the same for every θΘ\theta \in \Theta and every tt in the range of TT.

Operationally: knowing TT leaves no further parameter information in the data. The residual randomness of XTX \mid T is “noise” in the strict sense that its distribution is fixed regardless of the truth.

Definition 2 Jointly sufficient statistics

A vector statistic T(X)=(T1(X),,Tk(X))T(X) = (T_1(X), \ldots, T_k(X)) is jointly sufficient for a parameter θ=(θ1,,θp)Θ\theta = (\theta_1, \ldots, \theta_p) \in \Theta if the conditional distribution of XX given T(X)=tT(X) = t does not depend on θ\theta. The dimension kk of the sufficient statistic need not equal the parameter dimension pp — though when TT is minimal (§16.4), and the family is regular (§16.10), the two dimensions match.

Theorem 1 Order statistic is always sufficient for iid samples

For iid data X1,,XnX_1, \ldots, X_n from any common distribution PθP_\theta, the order statistic T(X)=(X(1),,X(n))T(X) = (X_{(1)}, \ldots, X_{(n)}) is sufficient for θ\theta. Topic 29 makes the order statistic the central object of nonparametric statistics — Track 8 begins there precisely because this theorem tells us no nonparametric procedure ever discards information in the sorted sample.

The reason is structural: the iid joint density is symmetric in the arguments, so f(x;θ)=f(xσ(1),,xσ(n);θ)f(x; \theta) = f(x_{\sigma(1)}, \ldots, x_{\sigma(n)}; \theta) for every permutation σ\sigma. Conditioning on the order statistic is equivalent to conditioning on the unordered sample — and given the unordered sample, the distribution over labelings is uniform over all n!n! permutations regardless of θ\theta. The order statistic is sufficient by construction. It is rarely minimal — for most families, a much smaller statistic also suffices.

Example 1 Normal(μ, σ² known): T = ΣXᵢ via direct conditional check

Let X1,,XnX_1, \ldots, X_n be iid N(μ,σ2)\mathcal{N}(\mu, \sigma^2) with σ2\sigma^2 known. Define T=i=1nXiT = \sum_{i=1}^n X_i. We check sufficiency directly: the conditional density of XX given T=tT = t is

f(xT=t;μ)  =  f(x;μ)fT(t;μ)1 ⁣{xi=t}.f(x \mid T = t; \mu) \;=\; \frac{f(x; \mu)}{f_T(t; \mu)} \cdot \mathbf{1}\!\left\{\sum x_i = t\right\}.

The numerator is the iid Normal density (2πσ2)n/2exp((xiμ)2/2σ2)(2\pi\sigma^2)^{-n/2}\exp(-\sum(x_i - \mu)^2 / 2\sigma^2). Expanding the squared-deviation sum,

(xiμ)2  =  xi22μxi+nμ2  =  xi22μt+nμ2.\sum (x_i - \mu)^2 \;=\; \sum x_i^2 - 2\mu\sum x_i + n\mu^2 \;=\; \sum x_i^2 - 2\mu t + n\mu^2.

The denominator fT(t;μ)f_T(t;\mu) is the density of TN(nμ,nσ2)T \sim \mathcal{N}(n\mu, n\sigma^2), which contains the same 2μt/2σ2=μt/σ2-2\mu t / 2\sigma^2 = -\mu t/\sigma^2 term and the same nμ2/2σ2n\mu^2/2\sigma^2 term. The two cancel exactly, leaving the conditional density independent of μ\mu. Hence TT is sufficient.

Example 2 Bernoulli(p): T = ΣXᵢ via the binomial conditional

Let X1,,XnX_1, \ldots, X_n be iid Bernoulli(p)\mathrm{Bernoulli}(p), and define T=XiT = \sum X_i. The joint pmf is pt(1p)ntp^t (1-p)^{n-t} where t=T(x)t = T(x). Conditional on T=tT = t, every binary vector x{0,1}nx \in \{0,1\}^n with xi=t\sum x_i = t has probability

Pp(X=xT=t)  =  pt(1p)nt(nt)pt(1p)nt  =  (nt)1,P_p(X = x \mid T = t) \;=\; \frac{p^t(1-p)^{n-t}}{\binom{n}{t}p^t(1-p)^{n-t}} \;=\; \binom{n}{t}^{-1},

uniform over the (nt)\binom{n}{t} vectors in the level set. The conditional is parameter-free — TT is sufficient.

Example 3 Sample size n is trivially sufficient — but useless

The constant statistic T(X)=nT(X) = n is trivially sufficient: the “conditional distribution of XX given T=nT = n” is just the marginal joint distribution of XX, which depends on θ\theta in general — so this is not a counterexample, just a degenerate case of the definition. The genuinely degenerate sufficient statistic is the identity, T(X)=XT(X) = X itself, which is sufficient because conditioning on the full data leaves nothing to vary. Both extremes — too coarse to compress, too fine to compress — motivate minimal sufficiency (§16.4): the coarsest sufficient statistic that still captures everything.

Sufficient-statistic diagram. Left: raw sample space partitioned by level sets of T. Center: the statistic map T : Xⁿ → T. Right: the conditional distribution X | T = t with annotation "parameter-free by sufficiency". Bernoulli (T = ΣXᵢ, partition by count) and Normal (partition by (X̄, S²)) examples are overlaid.

Remark 3 Conditional-distribution vs factorization definitions

The conditional-distribution definition (Def 1) is conceptually clean but operationally heavy — checking it requires writing out the joint density and dividing through, as in Examples 1 and 2. The Fisher–Neyman factorization theorem (§16.3) gives an equivalent and easier operational test: TT is sufficient iff the joint density factors as g(T(x);θ)h(x)g(T(x); \theta) \cdot h(x). The two definitions agree for dominated families (the standard setting), with the equivalence proved in §16.3. From here forward, factorization is the tool of choice; the conditional definition is the conceptual anchor.

16.3 The Fisher-Neyman Factorization Theorem

The factorization theorem is the engine of everything downstream. It converts the conditional-distribution definition (which requires checking that conditional densities are constant in θ\theta) into a one-line algebraic check on the joint density itself.

Theorem 2 Fisher–Neyman factorization (general)

Let {Pθ:θΘ}\{P_\theta : \theta \in \Theta\} be a family of distributions on a measurable space (X,A)(\mathcal{X}, \mathcal{A}), dominated by a common σ\sigma-finite measure μ\mu, with densities f(x;θ)=dPθ/dμf(x; \theta) = dP_\theta/d\mu. A statistic T:XTT : \mathcal{X} \to \mathcal{T} is sufficient for θ\theta if and only if there exist non-negative measurable functions g(;θ)g(\cdot; \theta) on T×Θ\mathcal{T} \times \Theta and h()h(\cdot) on X\mathcal{X} such that

f(x;θ)  =  g(T(x);θ)h(x)μ-a.s.f(x; \theta) \;=\; g(T(x); \theta) \cdot h(x) \qquad \mu\text{-a.s.}

The factorization splits the joint density into a θ\theta-dependent factor that touches the data only through TT and a θ\theta-independent factor hh that absorbs everything else. The proof, which we give in full below, is direct in the discrete case and extends to the dominated case via Halmos–Savage.

Proof [show]

Discrete case (sufficiency \Leftrightarrow factorization).

(\Leftarrow Factorization \Rightarrow sufficient.) Suppose f(x;θ)=g(T(x);θ)h(x)f(x;\theta) = g(T(x); \theta) \cdot h(x). For any xx in the support and t=T(x)t = T(x),

Pθ(X=xT=t)  =  f(x;θ)Pθ(T=t)1{T(x)=t}.P_\theta(X = x \mid T = t) \;=\; \frac{f(x;\theta)}{P_\theta(T = t)}\,\mathbf{1}\{T(x) = t\}.

The denominator is Pθ(T=t)=x:T(x)=tf(x;θ)=g(t;θ)x:T(x)=th(x)P_\theta(T = t) = \sum_{x' : T(x') = t} f(x'; \theta) = g(t; \theta) \sum_{x' : T(x') = t} h(x'), since g(T(x);θ)=g(t;θ)g(T(x'); \theta) = g(t; \theta) is constant on the level set. Substituting,

Pθ(X=xT=t)  =  g(t;θ)h(x)g(t;θ)xh(x)  =  h(x)x:T(x)=th(x).P_\theta(X = x \mid T = t) \;=\; \frac{g(t;\theta) h(x)}{g(t;\theta) \sum_{x'} h(x')} \;=\; \frac{h(x)}{\sum_{x' : T(x') = t} h(x')}.

The conditional probability is a function of xx alone — no θ\theta dependence — so TT is sufficient.

(\Rightarrow Sufficient \Rightarrow factorization.) Suppose TT is sufficient. Then Pθ(X=xT=T(x))P_\theta(X = x \mid T = T(x)) does not depend on θ\theta. Define

h(x)  =  Pθ0(X=xT=T(x)),g(t;θ)  =  Pθ(T=t),h(x) \;=\; P_{\theta_0}(X = x \mid T = T(x)), \qquad g(t; \theta) \;=\; P_\theta(T = t),

for any fixed reference θ0\theta_0. By sufficiency, h(x)h(x) does not depend on the choice of θ0\theta_0. Then

f(x;θ)  =  Pθ(X=x)  =  Pθ(T=T(x))Pθ(X=xT=T(x))  =  g(T(x);θ)h(x),f(x;\theta) \;=\; P_\theta(X = x) \;=\; P_\theta(T = T(x)) \cdot P_\theta(X = x \mid T = T(x)) \;=\; g(T(x); \theta) \cdot h(x),

which is the desired factorization.

Continuous / dominated-family case. The discrete argument extends to the dominated-family setting via the Radon–Nikodym theorem. Let μ\mu be a σ\sigma-finite measure dominating {Pθ}\{P_\theta\}, so f(x;θ)=dPθ/dμf(x;\theta) = dP_\theta/d\mu. The conditional density f(xT=t;θ)f(x \mid T = t; \theta) is well-defined as a Radon–Nikodym derivative on the level set {T(x)=t}\{T(x) = t\}, and sufficiency means this derivative does not depend on θ\theta. The same algebra carries through with sums replaced by integrals: g(t;θ)g(t;\theta) becomes the marginal density of TT, and h(x)h(x) becomes the conditional density of XX given T(x)T(x) — both well-defined on a μ\mu-conull set.

∎ — using the conditional-probability definition (§2) and the dominated-family assumption (Halmos & Savage, 1949).

Factorization theorem in four families. Top-left: Normal(μ, σ²) — density with g and h factors color-coded. Top-right: Bernoulli — probability factors. Bottom-left: Poisson — exp(-nλ) λ^(Σx) / Πxᵢ! decomposition. Bottom-right: Uniform(0, θ) — non-exponential family case with f(x;θ) = θ^(-n) · indicator(max ≤ θ) · indicator(min ≥ 0), showing T = max via the support indicator.

Factorization Explorer · §16.3
T(X) = 0.17
Factorization
f(x; θ) = g(T(x); θ) · h(x)
T: T(x) = \sum_{i=1}^n x_i
g: g(T;\mu,\sigma^2) = \exp\!\left(\tfrac{\mu T}{\sigma^2} - \tfrac{n\mu^2}{2\sigma^2}\right)
h: h(x) = (2\pi\sigma^2)^{-n/2}\exp\!\left(-\tfrac{1}{2\sigma^2}\sum x_i^2\right)
✓ Exponential family
The canonical one-parameter exponential family. T is one-dimensional, matching the parameter dimension. The Fisher–Neyman factorization is direct: $g(T;\mu) \cdot h(x)$ with $g$ depending on the data only through $\sum x_i$.
Example 4 Normal(μ, σ²): two-line factorization

For X1,,XnX_1, \ldots, X_n iid N(μ,σ2)\mathcal{N}(\mu, \sigma^2) with both parameters unknown, the joint density expands to

f(x;μ,σ2)  =  (2πσ2)n/2exp ⁣(12σ2xi2  +  μσ2xi    nμ22σ2).f(x; \mu, \sigma^2) \;=\; (2\pi\sigma^2)^{-n/2}\exp\!\left(-\tfrac{1}{2\sigma^2}\sum x_i^2 \;+\; \tfrac{\mu}{\sigma^2}\sum x_i \;-\; \tfrac{n\mu^2}{2\sigma^2}\right).

Setting T1=xiT_1 = \sum x_i and T2=xi2T_2 = \sum x_i^2, the parameter-dependent factor is g(T1,T2;μ,σ2)=(2πσ2)n/2exp(μT1/σ2T2/(2σ2)nμ2/(2σ2))g(T_1, T_2; \mu, \sigma^2) = (2\pi\sigma^2)^{-n/2}\exp(\mu T_1/\sigma^2 - T_2/(2\sigma^2) - n\mu^2/(2\sigma^2)) and the data-only factor is h(x)=1h(x) = 1. Hence T=(Xi,Xi2)T = (\sum X_i, \sum X_i^2) is jointly sufficient — a 2-vector matching the parameter dimension.

Example 5 Uniform(0, θ): the non-exponential-family case

For X1,,XnX_1, \ldots, X_n iid Uniform(0,θ)\mathrm{Uniform}(0, \theta), the joint density is

f(x;θ)  =  θn1 ⁣{0x(1) and x(n)θ}  =  θn1 ⁣{x(n)θ}g(T;θ)    1 ⁣{x(1)0}h(x),f(x; \theta) \;=\; \theta^{-n} \cdot \mathbf{1}\!\left\{0 \le x_{(1)} \text{ and } x_{(n)} \le \theta\right\} \;=\; \underbrace{\theta^{-n} \mathbf{1}\!\{x_{(n)} \le \theta\}}_{g(T;\theta)} \;\cdot\; \underbrace{\mathbf{1}\!\{x_{(1)} \ge 0\}}_{h(x)},

so T(X)=X(n)=maxiXiT(X) = X_{(n)} = \max_i X_i is sufficient by factorization. The factorization is clean — but the family is not an exponential family: the support [0,θ][0, \theta] depends on θ\theta, the support indicator cannot be absorbed into the exp-family form, and Topic 7’s construction does not apply. This example will return in §16.10 as the canonical counterexample to the Pitman–Koopman–Darmois theorem: a non-exp-family with a fixed-dimensional sufficient statistic, escaping the converse only by violating the regularity condition that support not depend on θ\theta.

Example 6 Poisson(λ): T = ΣXᵢ via factorization

For X1,,XnX_1, \ldots, X_n iid Poisson(λ)\mathrm{Poisson}(\lambda), the joint pmf is

f(x;λ)  =  i=1neλλxixi!  =  enλλxig(T;λ)    (ixi!)1h(x),f(x;\lambda) \;=\; \prod_{i=1}^n \frac{e^{-\lambda}\lambda^{x_i}}{x_i!} \;=\; \underbrace{e^{-n\lambda}\lambda^{\sum x_i}}_{g(T;\lambda)} \;\cdot\; \underbrace{\left(\prod_i x_i!\right)^{-1}}_{h(x)},

so T=XiT = \sum X_i is sufficient. As an exponential family in canonical form f(x;λ)=(x!)1exp(xlogλλ)f(x;\lambda) = (x!)^{-1}\exp(x\log\lambda - \lambda), the natural sufficient statistic is exactly this sum.

Remark 4 Factorization is operationally easier than the conditional check

Compare Example 1 (the direct conditional check for Normal μ\mu, which required expanding the squared-deviation sum and showing two terms cancel) with Example 4 (the factorization, which is two lines). For most families that arise in practice, factorization gives the sufficient statistic immediately by inspection: write down the joint density, group terms by θ\theta-dependence, and read off TT. The conditional-distribution definition remains the conceptual anchor — it tells us why sufficiency means data reduction — but the factorization theorem is what we compute with.

16.4 Minimal Sufficiency

Sufficiency is closed under invertible (and even non-invertible) refinement: if TT is sufficient, so is (T,X1)(T, X_1), the order statistic, or any other statistic from which TT can be recovered. We want the coarsest sufficient statistic — the one that summarizes maximally without losing parameter information.

Definition 3 Minimal sufficient statistic

A sufficient statistic TT is minimal sufficient for θ\theta if, for every other sufficient statistic TT', there exists a measurable function ϕ\phi with T=ϕ(T)T = \phi(T') almost surely. Equivalently, the partition of the sample space induced by TT is the coarsest partition under which sufficiency holds.

Theorem 3 Minimal sufficiency via likelihood-ratio constancy (Lehmann–Scheffé)

Suppose the family {Pθ}\{P_\theta\} has densities f(x;θ)f(x;\theta) with respect to a common dominating measure. A statistic T(X)T(X) is minimal sufficient for θ\theta if and only if, for every pair of sample points x,yx, y,

f(x;θ)f(y;θ) does not depend on θ    T(x)=T(y).\frac{f(x; \theta)}{f(y; \theta)} \text{ does not depend on } \theta \;\Longleftrightarrow\; T(x) = T(y).

The forward direction is clear: if TT is minimal sufficient and T(x)=T(y)T(x) = T(y), then xx and yy lie in the same level set of every sufficient statistic, so the likelihood ratio cannot distinguish θ\theta-dependence at xx from yy. The converse — that any statistic with this likelihood-ratio constancy property is minimal sufficient — uses sufficiency of TT (the level sets give a sufficient partition, by factorization) plus the minimality of the partition (any coarser partition would merge points with θ\theta-varying likelihood ratios, breaking sufficiency). The proof is short but technical; we omit the algebra here and refer to Casella & Berger (2002, §6.2.2).

Minimal sufficiency partition. Left: 200 iid Normal(0, 1) samples of size n=10, each represented by its minimal sufficient (X̄, S²) — a 2-dim partition matching the parameter dimension. Right: the same samples plotted via the non-minimal (X̄, X₁) — a strictly finer partition that wastes information on the identity of X₁.

Example 7 Normal(μ, σ²): (X̄, S²) is minimal sufficient

For iid Normal data with both parameters unknown, the likelihood ratio is

f(x;μ,σ2)f(y;μ,σ2)  =  exp ⁣(μσ2[xiyi]12σ2[xi2yi2]).\frac{f(x;\mu,\sigma^2)}{f(y;\mu,\sigma^2)} \;=\; \exp\!\left(\tfrac{\mu}{\sigma^2}\bigl[\textstyle\sum x_i - \sum y_i\bigr] - \tfrac{1}{2\sigma^2}\bigl[\sum x_i^2 - \sum y_i^2\bigr]\right).

This is constant in (μ,σ2)(\mu, \sigma^2) if and only if xi=yi\sum x_i = \sum y_i AND xi2=yi2\sum x_i^2 = \sum y_i^2 — equivalently, xˉ=yˉ\bar x = \bar y AND S2(x)=S2(y)S^2(x) = S^2(y). By Theorem 3, T(X)=(Xˉ,S2)T(X) = (\bar X, S^2) is minimal sufficient. The dimension matches the parameter dimension — a feature shared by every regular exponential family (and forced, under regularity, by Pitman–Koopman–Darmois).

Example 8 (X̄, X₁) is sufficient but NOT minimal for Normal μ

For Normal data with σ2\sigma^2 known, T=XˉT = \bar X is minimal sufficient. The expanded statistic T=(Xˉ,X1)T' = (\bar X, X_1) is also sufficient (any function from which TT can be recovered inherits sufficiency), but it is not minimal: for any pair of samples x,yx, y with xˉ=yˉ\bar x = \bar y but x1y1x_1 \ne y_1, the likelihood ratio is constant in μ\mu (since the joint density depends on xx only through xˉ\bar x), but T(x)T(y)T'(x) \ne T'(y). Theorem 3’s biconditional fails, so TT' is not minimal. Carrying X1X_1 along is information-wasting: it refines the partition without buying any inferential content.

Remark 5 Complete sufficient ⇒ minimal sufficient (converse false)

A theorem due to Bahadur (1957) states that a complete sufficient statistic is minimal sufficient. The converse is false: the standard counterexample is Uniform(θ,θ+1)(\theta, \theta+1), where the minimal sufficient statistic T=(X(1),X(n))T = (X_{(1)}, X_{(n)}) is not complete (Example 13, §16.6). Completeness is a strictly stronger property than minimality. The Lehmann–Scheffé theorem (§16.7) requires completeness, not just minimality, because it needs to rule out the existence of multiple unbiased functions of TT — and minimality alone does not.

16.5 The Rao-Blackwell Theorem

Sufficiency is structural — it identifies which summaries preserve parameter information. Rao–Blackwell makes it operational: any unbiased estimator can be improved (in the MSE sense) by conditioning on a sufficient statistic. The improvement is constructive and tight: it strictly reduces variance unless the estimator was already a function of TT.

Definition 4 Rao-Blackwellized estimator

Given an estimator θ^(X)\hat\theta(X) and a sufficient statistic T(X)T(X), the Rao-Blackwellized estimator is

θ~(T)  =  E[θ^(X)T].\tilde\theta(T) \;=\; \mathbb{E}[\hat\theta(X) \mid T].

By sufficiency, the conditional distribution of XTX \mid T does not depend on θ\theta, so the conditional expectation is a function of TT alone (not of θ\theta) — hence a statistic. The construction is constructive: integrate θ^(X)\hat\theta(X) against the parameter-free conditional distribution XTX \mid T.

Theorem 4 Rao–Blackwell: conditioning on a sufficient T does not increase MSE

Let TT be sufficient for θ\theta and let θ^\hat\theta be an unbiased estimator of θ\theta with finite variance. Define θ~=E[θ^T]\tilde\theta = \mathbb{E}[\hat\theta \mid T]. Then:

  1. θ~\tilde\theta is a statistic (function of TT alone).
  2. θ~\tilde\theta is unbiased: Eθ[θ~]=θ\mathbb{E}_\theta[\tilde\theta] = \theta.
  3. Varθ(θ~)Varθ(θ^)\mathrm{Var}_\theta(\tilde\theta) \le \mathrm{Var}_\theta(\hat\theta), with equality iff θ^=θ~\hat\theta = \tilde\theta almost surely (i.e., θ^\hat\theta was already a function of TT).

Since both estimators are unbiased, MSE = variance, so MSE(θ~)MSE(θ^)\mathrm{MSE}(\tilde\theta) \le \mathrm{MSE}(\hat\theta).

Proof [show]

Step 1 — θ~\tilde\theta is a statistic. The conditional distribution of θ^(X)\hat\theta(X) given TT depends on XX only through TT (because TT is sufficient and θ^\hat\theta is a function of XX), and crucially does not depend on θ\theta. So θ~=E[θ^T]\tilde\theta = \mathbb{E}[\hat\theta \mid T] is computable without knowing θ\theta — it is a function of TT alone.

Step 2 — θ~\tilde\theta is unbiased. By the law of iterated expectation (Topic 4),

Eθ[θ~]  =  Eθ[E[θ^T]]  =  Eθ[θ^]  =  θ.\mathbb{E}_\theta[\tilde\theta] \;=\; \mathbb{E}_\theta[\mathbb{E}[\hat\theta \mid T]] \;=\; \mathbb{E}_\theta[\hat\theta] \;=\; \theta.

Step 3 — Var(θ~)Var(θ^)\mathrm{Var}(\tilde\theta) \le \mathrm{Var}(\hat\theta) via Eve’s law. The law of total variance (Eve’s law, Topic 4) decomposes the variance of θ^\hat\theta as

Varθ(θ^)  =  Eθ[Var(θ^T)]  +  Varθ(E[θ^T]).\mathrm{Var}_\theta(\hat\theta) \;=\; \mathbb{E}_\theta[\mathrm{Var}(\hat\theta \mid T)] \;+\; \mathrm{Var}_\theta(\mathbb{E}[\hat\theta \mid T]).

Substituting θ~=E[θ^T]\tilde\theta = \mathbb{E}[\hat\theta \mid T],

Varθ(θ^)  =  Eθ[Var(θ^T)]0  +  Varθ(θ~).\mathrm{Var}_\theta(\hat\theta) \;=\; \underbrace{\mathbb{E}_\theta[\mathrm{Var}(\hat\theta \mid T)]}_{\ge 0} \;+\; \mathrm{Var}_\theta(\tilde\theta).

The first term is non-negative, with equality if and only if Var(θ^T)=0\mathrm{Var}(\hat\theta \mid T) = 0 almost surely — i.e., θ^\hat\theta is constant given TT, which means θ^\hat\theta is itself a function of TT. Otherwise, Var(θ~)<Var(θ^)\mathrm{Var}(\tilde\theta) < \mathrm{Var}(\hat\theta) strictly.

Since both estimators are unbiased, MSE equals variance, so MSE(θ~)MSE(θ^)\mathrm{MSE}(\tilde\theta) \le \mathrm{MSE}(\hat\theta) — strictly so whenever θ^\hat\theta was not already a function of TT.

∎ — using sufficiency of TT (Def 1), iterated expectation, and Eve’s law (Topic 4).

Rao-Blackwell MSE drop. Left: Bernoulli p at n=20, histogram of crude p̂₀ = X₁ (0/1-valued, very wide) vs Rao-Blackwellized p̃ = X̄ (concentrated around 0.3) across M = 10000 MC samples, with variance ratio annotation ~20×. Center: Poisson λ=2 at n=10, crude 𝟙(X₁=0) vs RB'd (1−1/n)ˆ(ΣXᵢ) — discrete 0/1 vs smooth RB'd. Right: MSE bars comparing crude vs RB'd across four scenarios, demonstrating "RB never hurts".

Rao-Blackwell Improver · §16.5
MSE(crude) = 0.2164
MSE(RB'd) = 0.0107
Ratio = 20.2×
Closed-form Rao-Blackwellization
Crude: \hat{p}_0 = X_1
RB'd: \tilde{p} = \bar{X}
True value of estimand: 0.3000
Most visceral demo: $X_1 \in \{0,1\}$ (huge variance), $\bar X$ concentrates around $p$. Variance ratio $\mathrm{Var}(X_1)/\mathrm{Var}(\bar X) = n$.
Example 9 Bernoulli p: X₁ Rao-Blackwellized by ΣXᵢ gives X̄

For iid Bernoulli(p)\mathrm{Bernoulli}(p) with T=XiT = \sum X_i sufficient, the crude estimator p^0=X1\hat p_0 = X_1 is unbiased (E[X1]=p\mathbb{E}[X_1] = p) but lies in {0,1}\{0, 1\} — variance is p(1p)p(1-p), the maximum possible for an estimator with this support. Conditioning on TT:

p~  =  E[X1T=t]  =  tn  =  Xˉ,\tilde p \;=\; \mathbb{E}[X_1 \mid T = t] \;=\; \frac{t}{n} \;=\; \bar X,

since by exchangeability all XiX_i have the same conditional distribution given TT, and their conditional sum is TT. The Rao-Blackwellized estimator is the sample mean, with Var(Xˉ)=p(1p)/n\mathrm{Var}(\bar X) = p(1-p)/n — a factor nn smaller than Var(X1)\mathrm{Var}(X_1). The RaoBlackwellImprover above visualizes this dramatic shrinkage at n=20n = 20.

Example 10 Poisson: UMVUE of P(X = 0) = exp(−λ) via Rao-Blackwellization

Suppose we want to estimate g(λ)=Pλ(X=0)=eλg(\lambda) = P_\lambda(X = 0) = e^{-\lambda} from iid Poisson(λ)\mathrm{Poisson}(\lambda) data. The crude estimator g^0=1{X1=0}\hat g_0 = \mathbf{1}\{X_1 = 0\} is unbiased (E[1{X1=0}]=eλ\mathbb{E}[\mathbf{1}\{X_1 = 0\}] = e^{-\lambda}), but it is a 0/1 indicator. Conditioning on T=XiT = \sum X_i (which is Poisson(nλ)\mathrm{Poisson}(n\lambda)):

g~  =  E[1{X1=0}T=t]  =  P(X1=0Xi=t).\tilde g \;=\; \mathbb{E}[\mathbf{1}\{X_1 = 0\} \mid T = t] \;=\; P(X_1 = 0 \mid \textstyle\sum X_i = t).

The conditional distribution of (X1,,Xn)(X_1, \ldots, X_n) given Xi=t\sum X_i = t is Multinomial(t;1/n,,1/n)\mathrm{Multinomial}(t; 1/n, \ldots, 1/n). So X1T=tBinomial(t,1/n)X_1 \mid T = t \sim \mathrm{Binomial}(t, 1/n), and P(X1=0T=t)=(11/n)tP(X_1 = 0 \mid T = t) = (1 - 1/n)^t. The Rao-Blackwellized estimator is

g~  =  (11n)Xi,\tilde g \;=\; \left(1 - \tfrac{1}{n}\right)^{\sum X_i},

which is the UMVUE of eλe^{-\lambda} (since Xi\sum X_i is complete sufficient — see §16.6 Lemma 1 — and Lehmann–Scheffé applies). A 0/1 estimator promoted to a smooth, sample-size-aware estimator with provably minimum variance among unbiased estimators.

Example 11 Normal variance Rao-Blackwellization: preview of §16.7 Example 14

Suppose we want to estimate σ2\sigma^2 for iid N(μ,σ2)\mathcal{N}(\mu, \sigma^2) with μ\mu known. The crude estimator σ^02=(X1μ)2\hat\sigma^2_0 = (X_1 - \mu)^2 is unbiased (E[(X1μ)2]=σ2\mathbb{E}[(X_1 - \mu)^2] = \sigma^2) but uses only one observation. Conditioning on the sufficient statistic T=(Xiμ)2σ2χn2T = \sum (X_i - \mu)^2 \sim \sigma^2 \chi^2_n:

σ~2  =  E[(X1μ)2T=t]  =  tn  =  1n(Xiμ)2,\tilde\sigma^2 \;=\; \mathbb{E}[(X_1 - \mu)^2 \mid T = t] \;=\; \frac{t}{n} \;=\; \frac{1}{n}\sum (X_i - \mu)^2,

by exchangeability of (Xiμ)2(X_i - \mu)^2 given TT. The variance ratio is nn — a factor-nn reduction. We finish this calculation in §16.7 Example 14, where Lehmann–Scheffé promotes it to the UMVUE.

Remark 6 Rao-Blackwell is constructive, not just existence

Rao–Blackwell does more than prove the existence of a better estimator — it tells you exactly what it is: the conditional expectation E[θ^T]\mathbb{E}[\hat\theta \mid T], computable by integrating against the parameter-free conditional distribution XTX \mid T. For the canonical families (Bernoulli, Poisson, Normal, Exponential, Gamma) the conditional distribution has a closed form, and the Rao-Blackwellized estimator is a direct calculation. This constructive aspect is what makes the theorem operational: Rao (1945) and Blackwell (1947) independently realized that every unbiased estimator can be mechanically upgraded by this recipe, without needing to know whether a UMVUE even exists.

16.6 Completeness

Rao–Blackwell guarantees that conditioning on a sufficient statistic does not hurt. But it does not guarantee uniqueness: two different unbiased estimators might Rao-Blackwellize to two different unbiased functions of TT. Completeness is the structural property that closes this gap — it says no two distinct unbiased functions of TT can have the same expectation, so the Rao-Blackwellization map is injective on unbiased functions.

Definition 5 Complete family; bounded completeness

A family {Pθ:θΘ}\{P_\theta : \theta \in \Theta\} of distributions of a statistic TT is complete if, for every measurable function gg with Eθ[g(T)]<\mathbb{E}_\theta[|g(T)|] < \infty for all θ\theta,

Eθ[g(T)]  =  0 for all θΘ    g(T)  =  0 almost surely under every Pθ.\mathbb{E}_\theta[g(T)] \;=\; 0 \text{ for all } \theta \in \Theta \;\Longrightarrow\; g(T) \;=\; 0 \text{ almost surely under every } P_\theta.

A statistic TT is complete sufficient for θ\theta if it is sufficient and the family of its distributions {PθT:θΘ}\{P_\theta^T : \theta \in \Theta\} is complete. The statistic is boundedly complete if the implication above holds for every bounded measurable gg. Bounded completeness is strictly weaker than full completeness, and in fact suffices for the Lehmann–Scheffé theorem (§16.7 Remark 7).

Lemma 1 Completeness of exponential families in the natural parameter

Let TT be the canonical sufficient statistic of an exponential family f(x;η)=h(x)exp(ηT(x)A(η))f(x; \eta) = h(x)\exp(\eta^\top T(x) - A(\eta)) with natural parameter η\eta ranging over an open subset HRkH \subseteq \mathbb{R}^k. Then TT is complete for η\eta.

The argument is a one-liner: Eη[g(T)]=g(t)exp(ηtA(η))dν(t)=0\mathbb{E}_\eta[g(T)] = \int g(t) \exp(\eta^\top t - A(\eta)) \, d\nu(t) = 0 for every η\eta in the open set HH implies, by the uniqueness of the Laplace transform of the signed measure g(t)dν(t)g(t) \, d\nu(t) on Rk\mathbb{R}^k, that g(t)=0g(t) = 0 almost everywhere under the dominating measure ν\nu. The technical details (open natural parameter space, analyticity of A(η)A(\eta), appropriate moment conditions) are handled rigorously in Brown (1986, §2.2). For our purposes, the operational consequence is clean: every full-rank exponential family with an open natural parameter space has a complete sufficient statistic, and Lehmann–Scheffé will deliver the UMVUE in every such family.

Completeness demonstration. Left: Bernoulli (complete) — four candidate test functions g₁, g₂, g₃, g₄ of T = ΣXᵢ plotted as 𝔼_p[g(T)] vs p; only the constant-zero function gives the flat zero curve. Right: Uniform(θ, θ+1) (incomplete) — the range-based witness g(T) = (X_(n) − X_(1)) − (n−1)/(n+1) plotted as 𝔼_θ[g(T)] vs θ — the curve is identically zero for all θ (since the range is ancillary for the location parameter θ), with a non-witness contrast g'(T) = X_(1) − 1/(n+1) plotted as a linear-in-θ reference line.

Completeness Probe · §16.6
✓ Family is COMPLETE
Reading the plot

For complete families, only the constant-zero function g₀ ≡ 0 produces a flat zero curve in 𝔼_θ[g(T)] vs. θ. Any other test function traces a non-trivial θ-dependence.

Finite-support discrete family; completeness is a finite linear-algebra argument. Only the constant-zero function gives the flat $\mathbb{E}_p[g(T)] \equiv 0$ curve.
Example 12 Bernoulli, Binomial, Poisson, Normal, Gamma — all complete

By Lemma 1, every exponential family in its natural parameter has a complete sufficient statistic. Concretely:

  • Bernoulli(pp): T=XiBinomial(n,p)T = \sum X_i \sim \mathrm{Binomial}(n, p) is complete for pp.
  • Binomial(n,pn, p) with nn known: T=XT = X itself is complete for pp.
  • Poisson(λ\lambda): T=XiPoisson(nλ)T = \sum X_i \sim \mathrm{Poisson}(n\lambda) is complete for λ\lambda.
  • Normal(μ,σ2\mu, \sigma^2) with σ2\sigma^2 known: T=XˉT = \bar X is complete for μ\mu.
  • Normal(μ,σ2\mu, \sigma^2) jointly: T=(Xˉ,S2)T = (\bar X, S^2) is complete for (μ,σ2)(\mu, \sigma^2).
  • Gamma(α,β\alpha, \beta) with α\alpha known: T=XiT = \sum X_i is complete for β\beta.
  • Exponential(λ\lambda): T=XiT = \sum X_i is complete for λ\lambda.

Each delivers a UMVUE via Lehmann–Scheffé (§16.7).

Example 13 Uniform(θ, θ+1) is NOT complete — the centered-range witness

Consider X1,,XnX_1, \ldots, X_n iid Uniform(θ,θ+1)\mathrm{Uniform}(\theta, \theta + 1) — a unit-width interval shifted by θ\theta. The minimal sufficient statistic is the pair T=(X(1),X(n))T = (X_{(1)}, X_{(n)})two-dimensional for a one-dimensional parameter, an immediate structural warning sign.

Why incompleteness? Because θ\theta is a pure location parameter, the range R=X(n)X(1)R = X_{(n)} - X_{(1)} is ancillary for θ\theta: its distribution depends only on nn, not on θ\theta. (Specifically, RR has mean (n1)/(n+1)(n-1)/(n+1) and a Beta(n1,2)\mathrm{Beta}(n-1, 2)-shaped density, both free of θ\theta.) Therefore the centered range

g(T)  =  R    n1n+1  =  (X(n)X(1))    n1n+1g(T) \;=\; R \;-\; \frac{n-1}{n+1} \;=\; (X_{(n)} - X_{(1)}) \;-\; \frac{n-1}{n+1}

satisfies Eθ[g(T)]=0\mathbb{E}_\theta[g(T)] = 0 for every θR\theta \in \mathbb{R} — yet g(T)g(T) is not identically zero. This is the definition of incompleteness: a non-trivial function of TT with zero expectation under every PθP_\theta.

The CompletenessProbe above lets you visualize this directly: choose Uniform(θ,θ+1)(\theta, \theta+1), evaluate the centered range across the θ\theta-grid, and see the empirical Eθ[g(T)]0\mathbb{E}_\theta[g(T)] \approx 0 curve. By contrast, on the same grid a non-witness like g(T)=X(1)1/(n+1)g'(T) = X_{(1)} - 1/(n+1) traces a non-zero linear function of θ\theta — illustrating the difference between an ancillary-derived witness (which makes incompleteness visible) and a generic test function (which does not).

Remark 7 Bounded completeness suffices for Lehmann–Scheffé

The Lehmann–Scheffé theorem (§16.7) is usually proved using full completeness. But a closer look at the proof shows that only bounded completeness is needed — the only test functions that arise are differences h1(T)h2(T)h_1(T) - h_2(T) between unbiased estimators of the same quantity, and these are bounded whenever both have finite second moment. Bounded completeness is strictly weaker than full completeness, and a small number of important families (notably some non-regular location-scale families) are boundedly complete without being completely complete. For the canonical exponential families in this topic, the distinction does not matter — all are fully complete by Lemma 1.

16.7 The Lehmann-Scheffé Theorem

Sufficiency reduces the data without loss; completeness reduces unbiased functions of the data without redundancy. Lehmann–Scheffé combines them into a uniqueness-and-optimality statement: the unique unbiased function of a complete sufficient statistic is the uniformly minimum-variance unbiased estimator.

Definition 6 UMVUE

An estimator θ~\tilde\theta of a parameter (or parametric function) g(θ)g(\theta) is the uniformly minimum-variance unbiased estimator (UMVUE) if:

  1. θ~\tilde\theta is unbiased: Eθ[θ~]=g(θ)\mathbb{E}_\theta[\tilde\theta] = g(\theta) for every θΘ\theta \in \Theta.
  2. For every other unbiased estimator θ^\hat\theta of g(θ)g(\theta), Varθ(θ~)Varθ(θ^)\mathrm{Var}_\theta(\tilde\theta) \le \mathrm{Var}_\theta(\hat\theta) for every θΘ\theta \in \Theta.

The “uniformly” applies to the parameter — the variance dominance must hold simultaneously at every θ\theta, not just on average or at a particular value. UMVUE existence is non-trivial: there are families with no UMVUE at all (Remark 8). When a UMVUE exists, it is essentially unique (Theorem 6).

Theorem 5 Lehmann–Scheffé: unbiased function of complete sufficient ⇒ UMVUE

Let TT be a complete sufficient statistic for θ\theta, and let θ~=h(T)\tilde\theta = h(T) be an unbiased estimator of g(θ)g(\theta) that is a function of TT. Then θ~\tilde\theta is the unique UMVUE of g(θ)g(\theta).

Proof [show]

Existence (any unbiased estimator can be improved to a function of TT). Let θ^\hat\theta^* be any unbiased estimator of g(θ)g(\theta). Define θ^=E[θ^T]\hat\theta^{**} = \mathbb{E}[\hat\theta^* \mid T]. By the Rao–Blackwell theorem (Theorem 4), θ^\hat\theta^{**} is a function of TT, is unbiased for g(θ)g(\theta), and has variance Var(θ^)Var(θ^)\mathrm{Var}(\hat\theta^{**}) \le \mathrm{Var}(\hat\theta^*).

Uniqueness (any two unbiased functions of TT agree almost surely). Suppose θ~1=h1(T)\tilde\theta_1 = h_1(T) and θ~2=h2(T)\tilde\theta_2 = h_2(T) are both unbiased for g(θ)g(\theta). Then for every θΘ\theta \in \Theta,

Eθ[h1(T)h2(T)]  =  g(θ)g(θ)  =  0.\mathbb{E}_\theta[h_1(T) - h_2(T)] \;=\; g(\theta) - g(\theta) \;=\; 0.

By completeness of TT (Definition 5), this forces h1(T)h2(T)=0h_1(T) - h_2(T) = 0 almost surely under every PθP_\theta — that is, h1(T)=h2(T)h_1(T) = h_2(T) a.s.

Optimality (Rao-Blackwellization always lands on θ~\tilde\theta). For any unbiased θ^\hat\theta^*, the Rao-Blackwellization θ^=E[θ^T]\hat\theta^{**} = \mathbb{E}[\hat\theta^* \mid T] is an unbiased function of TT. By uniqueness, θ^=θ~\hat\theta^{**} = \tilde\theta almost surely. Therefore

Var(θ~)  =  Var(θ^)    Var(θ^),\mathrm{Var}(\tilde\theta) \;=\; \mathrm{Var}(\hat\theta^{**}) \;\le\; \mathrm{Var}(\hat\theta^*),

with the second inequality from Rao-Blackwell. This holds for every unbiased θ^\hat\theta^*, so θ~\tilde\theta is the UMVUE — uniformly across θ\theta and uniquely up to a.s. equality.

∎ — by Rao-Blackwell (Theorem 4) and completeness (Definition 5).

Theorem 6 UMVUE uniqueness almost surely

If a UMVUE of g(θ)g(\theta) exists, then it is unique up to almost-sure equality under every PθP_\theta — that is, any two UMVUE candidates θ~1\tilde\theta_1 and θ~2\tilde\theta_2 agree on a set of PθP_\theta-measure 1 for every θ\theta.

This is a corollary of Theorem 5: any UMVUE must be a function of any complete sufficient TT (otherwise Rao-Blackwellization strictly improves it), and any two unbiased functions of TT agree a.s. by completeness. Without completeness, the uniqueness conclusion fails — a family with multiple inequivalent UMVUEs exists in principle, but is exotic.

Lehmann-Scheffé construction. Left: the Lehmann-Scheffé "diamond" — θ̂* (any unbiased) → θ̃ = 𝔼[θ̂* | T] (Rao-Blackwellized, function of T) → unique UMVUE (by completeness). Right: for Normal σ² known μ, histograms of crude σ̂² = (X₁ − μ)², RB'd σ̃² = Σ(Xᵢ − μ)²/n, with the variance-ratio = n annotation.

Example 14 Normal σ² with known μ: UMVUE = Σ(Xᵢ − μ)²/n (worked in full)

Let X1,,XnX_1, \ldots, X_n be iid N(μ,σ2)\mathcal{N}(\mu, \sigma^2) with μ\mu known and σ2\sigma^2 unknown. The complete sufficient statistic is T=(Xiμ)2T = \sum (X_i - \mu)^2, which has distribution σ2χn2\sigma^2 \chi^2_n. The unbiased function of TT for σ2\sigma^2 is

σ^UMVUE2  =  Tn  =  1ni=1n(Xiμ)2,\hat\sigma^2_{\text{UMVUE}} \;=\; \frac{T}{n} \;=\; \frac{1}{n}\sum_{i=1}^n (X_i - \mu)^2,

since E[T]=nσ2\mathbb{E}[T] = n\sigma^2. By Lehmann–Scheffé (Theorem 5), this is the UMVUE of σ2\sigma^2 in this model. Variance: Var(σ^UMVUE2)=Var(T)/n2=(2nσ4)/n2=2σ4/n\mathrm{Var}(\hat\sigma^2_{\text{UMVUE}}) = \mathrm{Var}(T)/n^2 = (2n\sigma^4)/n^2 = 2\sigma^4/n — exactly the CRLB (Topic 13 §13.7). The UMVUE is efficient in this case.

Note the crucial assumption μ\mu known. With μ\mu unknown the model is exactly the same family but with two parameters, and the complete sufficient statistic becomes (Xˉ,(XiXˉ)2)(\bar X, \sum(X_i - \bar X)^2) — leading to the famous Bessel correction; see Example 19 below.

Example 15 Gamma scale β with known α: UMVUE = (nα − 1)/ΣXᵢ

Let X1,,XnX_1, \ldots, X_n be iid Gamma(α,β)\mathrm{Gamma}(\alpha, \beta) with shape α\alpha known and rate β\beta unknown — an exponential family in β\beta. The complete sufficient statistic is T=XiGamma(nα,β)T = \sum X_i \sim \mathrm{Gamma}(n\alpha, \beta), with E[T1]=β/(nα1)\mathbb{E}[T^{-1}] = \beta / (n\alpha - 1) for nα>1n\alpha > 1 (a standard Gamma reciprocal-moment identity). Hence the unbiased function of TT for β\beta is

β^UMVUE  =  nα1Xi.\hat\beta_{\text{UMVUE}} \;=\; \frac{n\alpha - 1}{\sum X_i}.

By Lehmann–Scheffé this is the UMVUE. Compare with the MLE / MoM: β^MLE=β^MoM=α/Xˉ=nα/Xi\hat\beta_{\text{MLE}} = \hat\beta_{\text{MoM}} = \alpha / \bar X = n\alpha / \sum X_i. The two estimators differ by exactly the bias-correction factor (nα1)/(nα)(n\alpha - 1)/(n\alpha) — a small but non-trivial gap that the §16.11 UMVUEComparator visualizes. This is the example that fulfills the method-of-moments.mdx:1062 promise.

Remark 8 UMVUE may not exist; when it exists, need not attain the CRLB

Two cautions worth flagging. First, a UMVUE need not exist. In families without a complete sufficient statistic, the set of unbiased estimators may have no member that uniformly dominates the others — the Lehmann–Scheffé existence proof relies critically on completeness. Second, even when the UMVUE exists, it does not always attain the Cramér–Rao lower bound. The CRLB is a bound on variance, derived from the information inequality; UMVUE is the best unbiased estimator. The UMVUE attains the CRLB precisely in full-rank exponential families with the parameter being the natural one — exactly the same condition for which MLE = UMVUE = MoM coincide (§16.11 Theorem 9). Outside that boundary, the UMVUE has variance strictly above the CRLB, and the gap measures the unattainability of efficient estimation in that family.

16.8 UMVUE Worked Examples

The Lehmann–Scheffé theorem turns sufficiency + completeness into an algorithm: identify the complete sufficient statistic, find an unbiased function of it, and you have the UMVUE. The five canonical examples below demonstrate the algorithm and lay the ground for §16.11’s triple-estimator comparison.

FamilyParameterUMVUEMLEMoMRelationship
Bernoulli(pp)ppXˉ\bar XXˉ\bar XXˉ\bar Xtriple coincidence
Poisson(λ\lambda)λ\lambdaXˉ\bar XXˉ\bar XXˉ\bar Xtriple coincidence
Normal(μ,σ2\mu, \sigma^2)μ\mu (σ2\sigma^2 known)Xˉ\bar XXˉ\bar XXˉ\bar Xtriple coincidence
Normal(μ,σ2\mu, \sigma^2)σ2\sigma^2 (μ\mu unknown)Sn12S^2_{n-1}Sn2S^2_nSn2S^2_nUMVUE \ne MLE = MoM
Exponential(λ\lambda)λ\lambda(n1)/X(n-1)/\sum Xn/Xn/\sum Xn/Xn/\sum XUMVUE \ne MLE = MoM
Gamma(α,β\alpha, \beta), α\alpha knownβ\beta(nα1)/X(n\alpha-1)/\sum Xα/Xˉ\alpha/\bar Xα/Xˉ\alpha/\bar XUMVUE \ne MLE = MoM
Example 16 Bernoulli p: UMVUE = MLE = X̄

T=XiT = \sum X_i is complete sufficient for pp, and E[T/n]=p\mathbb{E}[T/n] = p, so p^UMVUE=Xˉ\hat p_{\text{UMVUE}} = \bar X. The MLE is also Xˉ\bar X (Topic 14, Example 14.1), and so is the MoM (Topic 15, Example 15.3). All three coincide — the Bernoulli is exp family in its natural parameter, so Theorem 9 (§16.11) applies. Variance: Var(Xˉ)=p(1p)/n\mathrm{Var}(\bar X) = p(1-p)/n, which equals the CRLB 1/I(p)=p(1p)/n1/I(p) = p(1-p)/n — efficient.

Example 17 Poisson λ: UMVUE = MLE = X̄

T=XiPoisson(nλ)T = \sum X_i \sim \mathrm{Poisson}(n\lambda) is complete sufficient for λ\lambda, and E[T/n]=λ\mathbb{E}[T/n] = \lambda, so λ^UMVUE=Xˉ\hat\lambda_{\text{UMVUE}} = \bar X. The MLE (Topic 14, Example 14.4) and MoM (Topic 15, §15.2) both also equal Xˉ\bar X. Triple coincidence again, and again CRLB-attaining: Var(Xˉ)=λ/n=1/I(λ)\mathrm{Var}(\bar X) = \lambda/n = 1/I(\lambda).

Example 18 Normal μ (σ² known): UMVUE = MLE = MoM = X̄

With σ2\sigma^2 known, T=XˉN(μ,σ2/n)T = \bar X \sim \mathcal{N}(\mu, \sigma^2/n) is complete sufficient and unbiased for μ\mu, hence the UMVUE. The MLE is also Xˉ\bar X (Topic 14, Example 14.2 with σ2\sigma^2 fixed), and the MoM is too (Topic 15, Example 15.1). All three coincide; the variance σ2/n\sigma^2/n equals the CRLB.

Example 19 Normal σ² with μ unknown: UMVUE = S²ₙ₋₁, MLE = MoM = S²ₙ — the featured case

With μ\mu also unknown, the complete sufficient statistic is the pair T=(Xˉ,(XiXˉ)2)T = (\bar X, \sum (X_i - \bar X)^2). The unbiased function of TT for σ2\sigma^2 is the Bessel-corrected sample variance,

σ^UMVUE2  =  Sn12  =  1n1i=1n(XiXˉ)2,\hat\sigma^2_{\text{UMVUE}} \;=\; S^2_{n-1} \;=\; \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar X)^2,

since (XiXˉ)2σ2χn12\sum (X_i - \bar X)^2 \sim \sigma^2 \chi^2_{n-1} and E[χn12]=n1\mathbb{E}[\chi^2_{n-1}] = n - 1. The MLE and MoM, by contrast, both land at the un-corrected

σ^MLE2  =  σ^MoM2  =  Sn2  =  1ni=1n(XiXˉ)2,\hat\sigma^2_{\text{MLE}} \;=\; \hat\sigma^2_{\text{MoM}} \;=\; S^2_n \;=\; \frac{1}{n}\sum_{i=1}^n (X_i - \bar X)^2,

with bias E[Sn2]σ2=σ2/n\mathbb{E}[S^2_n] - \sigma^2 = -\sigma^2 / n. The bias is small but does not disappear: at n=30n = 30 and σ2=4\sigma^2 = 4, the MLE/MoM bias is 0.133-0.133. The featured §16.11 comparison runs all three estimators in parallel and shows the UMVUE on top by a small but persistent margin in MSE for moderate nn.

Example 20 UMVUE of P(X > c) for Normal data via Rao-Blackwellization of an indicator

Suppose we want to estimate g(μ,σ2)=Pμ,σ2(X>c)g(\mu, \sigma^2) = P_{\mu,\sigma^2}(X > c) for some fixed threshold cc. The crude unbiased estimator is the indicator g^0=1{X1>c}\hat g_0 = \mathbf{1}\{X_1 > c\}. Conditioning on the complete sufficient (Xˉ,S2)(\bar X, S^2), the UMVUE turns out to be a regularized incomplete-beta tail:

g^UMVUE  =  Iy ⁣(n22,n22),where y=12 ⁣(1cXˉSnn2+(cXˉ)2/S2),\hat g_{\text{UMVUE}} \;=\; I_{y}\!\left(\frac{n-2}{2}, \frac{n-2}{2}\right), \quad \text{where } y = \frac{1}{2}\!\left(1 - \frac{c - \bar X}{S}\sqrt{\frac{n}{n - 2 + (c-\bar X)^2/S^2}}\right),

with IyI_y the regularized incomplete beta function. This is the formula used by umvueNormalTailProb in estimation.ts. The full derivation (in Lehmann & Casella, 1998, §2.4) integrates the indicator against the conditional distribution of X1X_1 given (Xˉ,S2)(\bar X, S^2), which has a scaled-Beta form. The point: even for non-trivial parametric functions, Rao-Blackwellization combined with completeness gives the UMVUE in closed form when the sufficient statistic has tractable conditional distributions.

UMVUE Normal variance — featured figure for Example 19. Left: sampling distributions for n=10, σ²=4: UMVUE S²_(n-1), MLE S²_n, MoM S²_n plotted as histograms over M=10000 replications. Center: bias bars — UMVUE ≈ 0, MLE = MoM ≈ -σ²/n = -0.4. Right: MSE bars — UMVUE has slightly higher variance but zero bias; MLE/MoM slightly lower variance but biased; UMVUE wins for moderate n.

Remark 9 Bias-efficiency pattern across UMVUE / MLE / MoM

Reading the table at the top of §16.8: in every exp-family in its natural parameter, UMVUE = MLE = MoM = Xˉ\bar X (or its relevant linear transform). Outside the natural parameterization, UMVUE and MLE diverge in a structured way: UMVUE prioritizes unbiasedness (Sn12S^2_{n-1} for Normal σ2\sigma^2); MLE and MoM coincidentally prioritize a different summary (Sn2S^2_n, the maximum-likelihood projection onto the tangent space) and pay a small bias cost. For finite nn, the comparison is family-specific: sometimes UMVUE wins on MSE (Normal σ2\sigma^2 at moderate nn), sometimes MLE wins (e.g., Gamma scale in some regimes — its slight positive bias and lower variance can yield smaller MSE for very small nn). Asymptotically, all three are consistent and asymptotically Normal at the same n1/2n^{-1/2} rate; the bias gap shrinks as 1/n1/n, so it disappears in the asymptotic limit.

16.9 Ancillary Statistics and Basu’s Theorem

Sufficient statistics carry all the parameter information; ancillary statistics carry none of it. Basu’s theorem says that when the first is complete sufficient, the two are independent — a startlingly clean structural result with broad consequences in Track 5.

Definition 7 Ancillary statistic; first-order ancillary

A statistic A(X)A(X) is ancillary for θ\theta if its distribution under PθP_\theta does not depend on θ\theta — that is, Pθ(AB)P_\theta(A \in B) is the same function of BB for every θΘ\theta \in \Theta.

A statistic is first-order ancillary for θ\theta if its mean does not depend on θ\theta, even if higher moments might. First-order ancillarity is strictly weaker than full ancillarity. For most of our purposes — and for Basu’s theorem — we use full ancillarity.

Examples of ancillary statistics: the sample range R=X(n)X(1)R = X_{(n)} - X_{(1)} for a pure-location family; the studentized statistic (X1Xˉ)/S(X_1 - \bar X)/S in a Normal family with location parameter μ\mu and known shape; any function of standardized residuals after removing a location and scale.

Theorem 7 Basu: complete sufficient ⊥⊥ ancillary

If TT is a complete sufficient statistic for θ\theta and AA is an ancillary statistic for θ\theta, then TT and AA are independent under every PθP_\theta.

Proof [show]

Fix any measurable event BB in the σ\sigma-algebra generated by AA. Define

g(T)  =  Pθ(ABT)    Pθ(AB).g(T) \;=\; P_\theta(A \in B \mid T) \;-\; P_\theta(A \in B).

The conditional probability does not depend on θ\theta. Because TT is sufficient, the conditional distribution of XX (and hence of any function of XX, including AA) given TT does not depend on θ\theta. So Pθ(ABT)P_\theta(A \in B \mid T) is the same function of TT for every θ\theta — it is a statistic.

The marginal probability does not depend on θ\theta. Because AA is ancillary for θ\theta, its distribution under PθP_\theta does not depend on θ\theta. So Pθ(AB)P_\theta(A \in B) is a constant in θ\theta.

Hence g(T)g(T) is a well-defined statistic (a function of TT minus a constant). Taking expectation under PθP_\theta:

Eθ[g(T)]  =  Eθ[Pθ(ABT)]    Pθ(AB)  =  Pθ(AB)    Pθ(AB)  =  0,\mathbb{E}_\theta[g(T)] \;=\; \mathbb{E}_\theta[P_\theta(A \in B \mid T)] \;-\; P_\theta(A \in B) \;=\; P_\theta(A \in B) \;-\; P_\theta(A \in B) \;=\; 0,

where the first equality uses iterated expectation. This holds for every θΘ\theta \in \Theta. By completeness of TT, the implication Eθ[g(T)]=0θg(T)=0\mathbb{E}_\theta[g(T)] = 0 \,\forall \theta \Rightarrow g(T) = 0 a.s. forces

Pθ(ABT)  =  Pθ(AB)a.s.P_\theta(A \in B \mid T) \;=\; P_\theta(A \in B) \quad \text{a.s.}

This is the definition of independence of AA and TT under PθP_\theta, for every measurable BB — hence A ⁣ ⁣ ⁣TA \perp\!\!\!\perp T under PθP_\theta, for every θ\theta.

∎ — by sufficiency of TT, ancillarity of AA, and completeness (Definition 5).

Basu independence demo. Left: scatter of (X̄, S²) across M=10000 Normal(μ=0, σ²=1) samples at n=30 — visually uncorrelated cloud, sample correlation ≈ 0, annotation "Basu: complete sufficient X̄ for μ at fixed σ² is independent of ancillary S²". Center: scatter of (X_(1), X̄) for Exp-shift model X = Exp(1) + μ — visible positive correlation ≈ 0.18, annotation "Basu does NOT apply — X̄ is not ancillary for μ". Right: t-statistic histogram t = √n(X̄ − μ)/S with t_(n-1) density overlay — independence from Basu making the t ratio work.

Basu Independence · §16.9
Sample ρ = 0.0162
Theoretical: 0 (Basu)
What Basu's theorem says

If T is complete sufficient for θ and A is ancillary for θ, then T ⊥⊥ A under every P_θ. Visualized as: their joint MC scatter should be decorrelated — sample ρ ≈ 0.

Basu applies. X̄ is complete sufficient for μ (σ² fixed), S² is ancillary for μ — so they are independent. This is the foundation of the Student's t-distribution: t = √n(X̄ − μ)/S has a well-defined distribution because its numerator and denominator are independent.
The t-distribution foundation. Basu's theorem gives $\bar X \perp\!\!\!\perp S^2$, which makes $t = \sqrt n(\bar X - \mu)/S$ have a well-defined distribution (Student's t with $n-1$ degrees of freedom).
Example 21 Normal: X̄ ⊥⊥ S² — the t-distribution foundation

For X1,,XnX_1, \ldots, X_n iid N(μ,σ2)\mathcal{N}(\mu, \sigma^2) with σ2\sigma^2 fixed and known, Xˉ\bar X is complete sufficient for μ\mu. The sample variance S2=(n1)1(XiXˉ)2S^2 = (n-1)^{-1}\sum (X_i - \bar X)^2 has distribution σ2χn12/(n1)\sigma^2 \chi^2_{n-1}/(n-1), which depends on σ2\sigma^2 but not on μ\mu — so S2S^2 is ancillary for μ\mu at fixed σ2\sigma^2. By Basu’s theorem,

Xˉ   ⁣ ⁣ ⁣  S2.\bar X \;\perp\!\!\!\perp\; S^2.

This is the independence that makes the t-statistic t=n(Xˉμ)/St = \sqrt{n}(\bar X - \mu)/S have a well-defined distribution: it’s the ratio of an independent N(0,1)\mathcal{N}(0, 1) (from n(Xˉμ)/σ\sqrt n(\bar X - \mu)/\sigma) and χn12/(n1)\sqrt{\chi^2_{n-1}/(n-1)} (from S/σS/\sigma) — exactly the construction of Student’s tn1t_{n-1}. The t-test, the t-confidence-interval, and the entire Track 5 chapter on hypothesis testing for Normal-mean problems all rest on this Basu corollary. The BasuIndependence component above visualizes this independence as a decorrelated MC scatter cloud.

Example 22 Uniform(0, θ): X₍ₙ₎ complete sufficient, X₍ₙ₋₁₎/X₍ₙ₎ ancillary ⇒ independent

For X1,,XnX_1, \ldots, X_n iid Uniform(0,θ)\mathrm{Uniform}(0, \theta), T=X(n)T = X_{(n)} is complete sufficient for θ\theta (a standard exercise — completeness follows from the fact that any zero-expectation function g(T)g(T) must satisfy 0θg(t)ntn1/θndt=0\int_0^\theta g(t) \cdot n t^{n-1}/\theta^n \, dt = 0 for all θ\theta, which by differentiation in θ\theta forces g0g \equiv 0 on (0,)(0, \infty)). The ratio A=X(n1)/X(n)A = X_{(n-1)}/X_{(n)} has distribution Beta(n1,1)\mathrm{Beta}(n-1, 1) regardless of θ\theta — pure scale-equivariance — so AA is ancillary. Basu’s theorem then gives X(n) ⁣ ⁣ ⁣X(n1)/X(n)X_{(n)} \perp\!\!\!\perp X_{(n-1)}/X_{(n)}, a non-obvious consequence of the structure of order statistics. (The uniform-order-statistics structure this exploits is developed in Topic 29 §29.3.) This kind of independence underlies many likelihood-ratio test statistics and pivotal quantities for scale parameters.

Remark 10 Basu is load-bearing for Track 5 — t-distribution independence

The independence Xˉ ⁣ ⁣ ⁣S2\bar X \perp\!\!\!\perp S^2 in Example 21 is the technical engine of one-sample Normal inference. Without it, the ratio t=n(Xˉμ)/St = \sqrt n (\bar X - \mu)/S would have a distribution depending on the joint distribution of (Xˉ,S)(\bar X, S) in a complicated way; with it, the ratio becomes the textbook tn1t_{n-1}. Track 5 (hypothesis testing) and Track 5.5 (confidence intervals) build the t-test, the F-test, and the analysis of variance on this foundation. Basu provides the abstract reason: complete sufficiency for one parameter (μ\mu, with σ2\sigma^2 fixed) plus ancillarity for that parameter (S2S^2, whose distribution doesn’t depend on μ\mu) gives independence, which is exactly what pivotal-quantity inference needs. We will return to this when we develop the t-test in Track 5.

16.10 The Pitman-Koopman-Darmois Theorem

The classical results so far — sufficiency, completeness, Rao-Blackwell, Lehmann-Scheffé, Basu — work for any family that has a complete sufficient statistic. The Pitman–Koopman–Darmois theorem is the converse direction: it says that, under regularity, the only families admitting fixed-dimensional sufficient statistics for iid samples of every size are the exponential families.

In other words: Topic 7’s exponential families are not just one class with nice data-reduction properties — they are the only class. This converts Topic 7’s “exponential families are convenient” into “exponential families are essentially forced” — a structural result with serious philosophical weight.

Theorem 8 Pitman–Koopman–Darmois (scalar θ, regular family)

Let {f(x;θ):θΘ}\{f(x; \theta) : \theta \in \Theta\} be a family of densities on XR\mathcal{X} \subseteq \mathbb{R} with respect to Lebesgue measure, and assume:

  • (A) The support S={x:f(x;θ)>0}S = \{x : f(x;\theta) > 0\} is independent of θ\theta.
  • (B) f(x;θ)>0f(x;\theta) > 0 on S×ΘS \times \Theta.
  • (C) ff is twice continuously differentiable in (x,θ)(x, \theta) jointly on S×ΘS \times \Theta.

Suppose, in addition, that for every sample size n1n \ge 1, there exists a scalar-valued sufficient statistic Tn(x1,,xn)T_n(x_1, \ldots, x_n) for θ\theta (i.e., a real-valued function, not a vector). Then ff is an exponential family in canonical form:

f(x;θ)  =  h(x)exp ⁣(η(θ)T(x)A(θ))f(x; \theta) \;=\; h(x)\,\exp\!\bigl(\eta(\theta)\,T(x) - A(\theta)\bigr)

for some twice continuously differentiable functions η:ΘR\eta : \Theta \to \mathbb{R}, T:SRT : S \to \mathbb{R}, A:ΘRA : \Theta \to \mathbb{R}, and h:S(0,)h : S \to (0, \infty).

Proof [show]

Step 1 — Apply Fisher–Neyman to the iid joint density. By Theorem 2 (Fisher–Neyman factorization), sufficiency of TnT_n means

i=1nf(xi;θ)  =  gn ⁣(Tn(x1,,xn);θ)hn(x1,,xn).\prod_{i=1}^n f(x_i; \theta) \;=\; g_n\!\bigl(T_n(x_1, \ldots, x_n);\,\theta\bigr) \cdot h_n(x_1, \ldots, x_n).

Taking logs:

i=1nlogf(xi;θ)  =  loggn(Tn,θ)+loghn(x1,,xn).\sum_{i=1}^n \log f(x_i; \theta) \;=\; \log g_n(T_n, \theta) + \log h_n(x_1, \ldots, x_n).

Step 2 — Differentiate in θ\theta. Differentiating both sides with respect to θ\theta (which is allowed because of (B) and (C)),

i=1nθlogf(xi;θ)  =  θloggn(Tn,θ).\sum_{i=1}^n \partial_\theta \log f(x_i; \theta) \;=\; \partial_\theta \log g_n(T_n, \theta).

The left-hand side is a sum of one-variable functions θlogf(xi;θ)\partial_\theta \log f(x_i; \theta). The right-hand side depends on (x1,,xn)(x_1, \ldots, x_n) only through the scalar TnT_n. This functional-form constraint will force the structure.

Step 3 — Cross-differentiate in xjx_j. Fix any j{1,,n}j \in \{1, \ldots, n\} and differentiate both sides with respect to xjx_j:

xjθlogf(xj;θ)  =  Tθloggn(Tn,θ)xjTn(x1,,xn).\partial_{x_j}\partial_\theta \log f(x_j; \theta) \;=\; \partial_T\partial_\theta \log g_n(T_n, \theta) \cdot \partial_{x_j} T_n(x_1, \ldots, x_n).

The left-hand side depends only on the pair (xj,θ)(x_j, \theta); the right-hand side depends on (x1,,xn,θ)(x_1, \ldots, x_n, \theta) only through TnT_n and the partial xjTn\partial_{x_j} T_n. For this equation to hold for every choice of (x1,,xn)(x_1, \ldots, x_n) and every jj, the structural form must separate: there exist differentiable functions η:ΘR\eta : \Theta \to \mathbb{R} and T:SRT : S \to \mathbb{R} such that

xθlogf(x;θ)  =  η(θ)T(x).\partial_x \partial_\theta \log f(x; \theta) \;=\; \eta'(\theta) \cdot T'(x).

(This step is the content of the original Darmois (1935) argument; the careful version handles the case xjTn=0\partial_{x_j} T_n = 0 on a measure-zero set.)

Step 4 — Integrate in xx. Integrating the previous equation in xx,

θlogf(x;θ)  =  η(θ)T(x)  +  B(θ),\partial_\theta \log f(x; \theta) \;=\; \eta'(\theta) \cdot T(x) \;+\; B'(\theta),

where B(θ)B'(\theta) is the “constant of integration in xx” — a function of θ\theta alone.

Step 5 — Integrate in θ\theta. Integrating in θ\theta,

logf(x;θ)  =  η(θ)T(x)  +  B(θ)  +  D(x),\log f(x; \theta) \;=\; \eta(\theta) \cdot T(x) \;+\; B(\theta) \;+\; D(x),

where D(x)D(x) is the “constant of integration in θ\theta.” Exponentiating:

f(x;θ)  =  eD(x)exp ⁣(η(θ)T(x)+B(θ))  =  h(x)exp ⁣(η(θ)T(x)A(θ)),f(x; \theta) \;=\; e^{D(x)} \cdot \exp\!\bigl(\eta(\theta) T(x) + B(\theta)\bigr) \;=\; h(x) \cdot \exp\!\bigl(\eta(\theta) T(x) - A(\theta)\bigr),

where h(x)=eD(x)h(x) = e^{D(x)} and A(θ)=B(θ)A(\theta) = -B(\theta). This is the canonical exponential-family form (Topic 7 §7.3) with natural parameter η(θ)\eta(\theta), sufficient statistic T(x)T(x), log-partition A(θ)A(\theta), and base measure h(x)h(x).

∎ — via Fisher–Neyman (Theorem 2) and the separable functional equation derived in Steps 3–5.

PKD theorem schematic. Left: schematic of the PKD theorem's logic — "regular family with fixed-dim sufficient statistic for every n" ⟹ "exponential family"; inverse arrow "Topic 7: exp families HAVE fixed-dim sufficient statistics". Right: the Uniform(0, θ) counterexample — support [0, θ] shown with θ-dependent right endpoint (shaded region changes with θ), annotation "support depends on θ — violates PKD regularity (A); Uniform is NOT an exponential family despite having 1-dim sufficient statistic X_(n)".

Example 23 Uniform(0, θ): 1-dim sufficient statistic but NOT exp family — regularity (A) fails

Consider X1,,XnX_1, \ldots, X_n iid Uniform(0,θ)\mathrm{Uniform}(0, \theta). By Example 5, T=X(n)T = X_{(n)} is sufficient (scalar, fixed-dimensional, for every nn). Yet this family is not an exponential family — and the PKD theorem’s regularity (A) tells us why: the support S=[0,θ]S = [0, \theta] depends on θ\theta, so (A) fails. The cross-differentiation step in the proof relies on f>0f > 0 on a θ\theta-independent set, and this fails at the boundary x=θx = \theta. Without (A), the PKD conclusion does not apply, and Uniform(0,θ)(0, \theta) escapes the conclusion despite having a fixed-dim sufficient statistic.

The lesson: the regularity conditions in PKD are not technical decorations — they are the price of the structural conclusion. Drop the support condition and the converse fails. The non-exp families with fixed-dim sufficient statistics that arise in practice (Uniform endpoints, truncated families, exponential shifts) are exactly those whose support varies with the parameter.

Remark 11 k-parameter PKD: Jacobian rank-k condition

The scalar Theorem 8 generalizes to θΘRk\theta \in \Theta \subseteq \mathbb{R}^k with a kk-dimensional sufficient statistic. The regularity (A)–(C) conditions extend in the obvious way (support independent of θ\theta, twice differentiability), and the new condition is a Jacobian-rank requirement: the k×kk \times k matrix [ηj(θ)/θ][\partial \eta_j(\theta)/\partial \theta_\ell] must have full rank on Θ\Theta. Under these conditions, the family is a kk-parameter exponential family. The proof — which requires careful measure-theoretic regularity to handle the multi-dimensional integration — is the content of Brown (1986, §2.3). For our purposes, the scalar version is enough; the multiparameter generalization is a Remark, not a Theorem, in this topic.

Remark 12 PKD justifies "exponential families are special"

Topic 7 introduced exponential families as a “convenient class with closed-form sufficient statistics, MLEs, and conjugate priors.” PKD reverses the direction: under regularity, exponential families are the only such class. This is what makes the exp-family chapter foundational rather than ornamental — every regular family with bounded-dimensional sufficient statistics is, structurally, an exponential family. The catalog in Topic 7 (Bernoulli, Binomial, Poisson, Normal, Gamma, Beta, Multinomial, Dirichlet, etc.) is not a curated list of nice examples but an enumeration of all the families satisfying PKD’s regularity. This is why exponential families appear everywhere in machine learning — variational inference, generalized linear models, conjugate Bayesian computation, energy-based models — they are the structural attractor for “tractable + parametric + regular.”

16.11 UMVUE vs MLE vs MoM: The Estimator Landscape Closes

Track 4 began with the evaluation framework (Topic 13: bias, variance, MSE, consistency, asymptotic normality, efficiency, CRLB) and applied it to two estimation methods: maximum likelihood (Topic 14) and method of moments (Topic 15). Topic 16 has now added the third — UMVUE via Lehmann–Scheffé. §16.11 brings all three into a single comparison: where they agree, where they diverge, and what their differences tell us about bias, efficiency, and the geometry of the likelihood.

Theorem 9 Exp-family triple coincidence: UMVUE = MLE = MoM in the natural parameter

Let {f(x;η)=h(x)exp(ηT(x)A(η))}\{f(x;\eta) = h(x)\exp(\eta T(x) - A(\eta))\} be a one-parameter exponential family with η\eta the natural parameter and TT the canonical sufficient statistic. Suppose TT is complete sufficient (Lemma 1) and the MoM equation selects the natural parameter directly (i.e., the MoM estimating equation is Tˉ=A(η)\bar T = A'(\eta), the moment-matching identity from Topic 7 §7.8). Then

η^UMVUE  =  η^MLE  =  η^MoM.\hat\eta_{\text{UMVUE}} \;=\; \hat\eta_{\text{MLE}} \;=\; \hat\eta_{\text{MoM}}.

In particular, all three estimators coincide for Bernoulli(pp), Poisson(λ\lambda), Normal(μ\mu, σ2\sigma^2 known) at the parameter μ\mu, and any other exp-family in its natural parameterization with a one-to-one η\eta-map.

The proof is short: Topic 7 §7.8 showed η^MLE\hat\eta_{\text{MLE}} solves the moment-matching identity Tˉ=A(η^)\bar T = A'(\hat\eta), which is exactly the MoM equation η^MoM\hat\eta_{\text{MoM}} solves. Lehmann–Scheffé (§16.7) gives that the unbiased function of TT for η\eta — when one exists — is the unique UMVUE. In the natural parameter case, Tˉ\bar T is unbiased for A(η)=E[T]A'(\eta) = \mathbb{E}[T], and the inversion η=(A)1(Tˉ)\eta = (A')^{-1}(\bar T) is monotone, so the UMVUE coincides with the MLE.

This Theorem fulfills the method-of-moments.mdx:1062 promise: in exponential families in the natural parameter, the three estimation principles agree exactly. The interesting cases — and the focus of the UMVUEComparator below — are the families just outside this regime: Normal σ2\sigma^2 with μ\mu unknown (where the MoM and MLE land on a different summary than the UMVUE), and Gamma scale with known shape (where the UMVUE applies a (nα1)/(nα)(n\alpha - 1)/(n\alpha) correction the MLE does not).

Estimator landscape — the closing figure for Track 4. Top-left: Exponential λ=2, n=30: MC sampling distributions of UMVUE, MLE, MoM overlay — UMVUE = (n-1)/ΣX, MLE = MoM = n/ΣX, small bias difference. Top-right: Gamma scale (α=3 known), β=2, n=50: UMVUE = (nα-1)/ΣX vs MLE = MoM = α/X̄ — visible gap. Bottom-left: Normal σ² unknown μ, σ²=4, n=30: featured "all three differ" case — UMVUE S²_29, MLE = MoM S²_30. Bottom-right: Poisson λ=3, n=30: triple coincidence at X̄ — exp-family-in-natural-parameter coincidence.

UMVUE vs MLE vs MoM · §16.11 · Track 4 closer
Estimator formulas
UMVUE: \hat{\sigma}^2_{\text{UMVUE}} = \dfrac{1}{n-1}\sum_i(X_i - \bar X)^2 = S^2_{n-1}
MLE: \hat{\sigma}^2_{\text{MLE}} = \dfrac{1}{n}\sum_i(X_i - \bar X)^2 = S^2_n
MoM: \hat{\sigma}^2_{\text{MoM}} = S^2_n \quad\text{(same as MLE)}
UMVUE bias: 0.0340
MLE bias: -0.1004
MoM bias: -0.1004
UMVUE MSE: 1.1424
MLE MSE: 1.0765
MoM MSE: 1.0765
The featured 'all three differ' case. UMVUE is unbiased; MLE = MoM share bias $-\sigma^2/n$. For $n = 30$ and $\sigma^2 = 4$, MLE bias $\approx -0.13$.
Example 24 Exponential rate λ: all three close, UMVUE strictly unbiased

For iid Exponential(λ)\mathrm{Exponential}(\lambda), T=XiT = \sum X_i is complete sufficient and TGamma(n,λ)T \sim \mathrm{Gamma}(n, \lambda), with E[T1]=λ/(n1)\mathbb{E}[T^{-1}] = \lambda/(n - 1) for n2n \ge 2. So the UMVUE is

λ^UMVUE  =  n1Xi.\hat\lambda_{\text{UMVUE}} \;=\; \frac{n - 1}{\sum X_i}.

The MLE and MoM both equal λ^MLE/MoM=n/Xi=1/Xˉ\hat\lambda_{\text{MLE/MoM}} = n/\sum X_i = 1/\bar X (Topic 14 Example 14.7; Topic 15 Example 15.2). The two differ by exactly the factor (n1)/n(n-1)/n — a small but persistent bias gap. The UMVUE is strictly unbiased; the MLE/MoM has a small positive bias going to zero as 1/n1/n. UMVUEComparator’s Exponential preset visualizes this: the three sampling distributions overlap heavily, but UMVUE sits exactly on λ\lambda while MLE/MoM are slightly displaced.

Example 25 Gamma scale β (with α known): UMVUE ≠ MLE = MoM (fulfills §1062)

This is the example that fulfills the method-of-moments.mdx:1062 promise. For iid Gamma(α,β)\mathrm{Gamma}(\alpha, \beta) with α\alpha known, the family is one-parameter exponential in β\beta (rate). T=XiGamma(nα,β)T = \sum X_i \sim \mathrm{Gamma}(n\alpha, \beta) is complete sufficient. The UMVUE is

β^UMVUE  =  nα1Xi,\hat\beta_{\text{UMVUE}} \;=\; \frac{n\alpha - 1}{\sum X_i},

while the MLE and MoM both equal β^MLE/MoM=α/Xˉ=nα/Xi\hat\beta_{\text{MLE/MoM}} = \alpha/\bar X = n\alpha/\sum X_i. The bias-correction factor is exactly (nα1)/(nα)(n\alpha - 1)/(n\alpha) — a clean instance of the structural pattern: when the parameter is not the natural one (here β\beta rather than β-\beta as the natural parameter), MLE/MoM and UMVUE differ in finite samples but agree asymptotically. The UMVUEComparator’s Gamma preset at α=3\alpha = 3, β=2\beta = 2, n=50n = 50 shows the gap clearly: UMVUE bias 0\approx 0, MLE/MoM bias >0> 0, MSE close but UMVUE marginally better.

Example 26 Normal σ² (with μ unknown) revisited — featured triple comparison

Returning to Example 19 with the full triple-comparison lens. UMVUE is Sn12S^2_{n-1} (the Bessel-corrected variance, unbiased); MLE = MoM is Sn2S^2_n (with bias σ2/n-\sigma^2/n). At n=30n = 30 and σ2=4\sigma^2 = 4, the bias gap is 0.133-0.133, the variance ratio is Sn2/Sn12=(n1)/n0.967S^2_n / S^2_{n-1} = (n-1)/n \approx 0.967, and MSE comparison is dictated by

MSE(Sn12)=Var(Sn12)=2σ4n1,MSE(Sn2)=Var(Sn2)+Bias2=(n1)2σ4n2+σ4n2=(2n1)σ4n2.\mathrm{MSE}(S^2_{n-1}) = \mathrm{Var}(S^2_{n-1}) = \frac{2\sigma^4}{n-1}, \qquad \mathrm{MSE}(S^2_n) = \mathrm{Var}(S^2_n) + \mathrm{Bias}^2 = \frac{(n-1)\cdot 2\sigma^4}{n^2} + \frac{\sigma^4}{n^2} = \frac{(2n - 1)\sigma^4}{n^2}.

For σ2=4\sigma^2 = 4 and n=30n = 30: MSE(UMVUE) 1.103\approx 1.103, MSE(MLE/MoM) 1.049\approx 1.049 — slightly lower for the biased estimator. This is the small-sample regime where the bias-variance trade-off can favor the biased MLE; for larger nn, both MSEs collapse and the UMVUE’s strict unbiasedness wins. The UMVUEComparator’s Normal preset visualizes both regimes by varying nn.

Remark 13 This closes Track 4 — the complete classical estimation toolkit

Track 4 began with the evaluation framework (Topic 13) and asked: which estimators win on bias, variance, and MSE, and how do we know? Topics 14–16 answered this for the three classical methods.

  • MLE (Topic 14): asymptotically efficient under regularity; consistency and asymptotic normality from the LLN/CLT machinery.
  • MoM (Topic 15): simpler, sometimes equivalent to MLE (in exp families, exactly), often less efficient (ARE <1< 1 outside).
  • UMVUE (Topic 16): the unique unbiased estimator with minimum variance, when complete sufficiency is available.

The three principles agree in exp families in their natural parameter (Theorem 9). They diverge in structurally interpretable ways outside: UMVUE prioritizes strict unbiasedness; MLE prioritizes asymptotic efficiency (with bias going to zero); MoM trades both for closed-form simplicity. The Normal σ2\sigma^2 unknown case is the canonical “all three differ” example, and §16.11 provides the synthesis.

Track 4 is now closed. Track 5 picks up the inferential toolkit — hypothesis testing, confidence intervals, Bayesian inference — built on top of the estimators we now have.

16.12 Where Sufficiency Falls Short

Sufficiency is foundational but not the whole story of inference. Three short remarks flag where its reach ends.

Remark 14 Ancillarity recovery — Fisher's conditionality principle

After sufficient-statistic-based reduction, an ancillary statistic can still provide a “more relevant reference set” for inference. Fisher’s conditionality principle says that inference for θ\theta should be conducted conditional on the observed value of any ancillary statistic — even though the marginal distribution of the ancillary is parameter-free. This creates a tension: sufficiency reduces the data to TT for the model; ancillarity refines the inference to a relevant subset for interpretation. In a one-sample Normal location problem with known variance, the ancillary is trivial and the conditionality principle adds nothing. In more complex problems (e.g. linear regression with covariates that are themselves random), the ancillary is the design matrix and conditional inference becomes the standard frequentist approach. We take this up in Track 5 (confidence intervals via pivotal quantities) and again in the Bayesian track (the likelihood principle, where conditioning on ancillary statistics is equivalent to using the likelihood for inference).

Remark 15 Non-regular families — Uniform endpoints and the Pitman estimator

The Lehmann–Scheffé machinery requires completeness, which in turn typically requires regularity (PKD assumption (A): support independent of θ\theta). Non-regular families — Uniform(0,θ)(0, \theta) being the canonical example — escape PKD by violating (A), and Lehmann–Scheffé does not directly apply. Yet these families have minimum-variance unbiased estimators (MVUEs), constructed by different means. For Uniform(0,θ)(0, \theta), the MVUE is θ^=n+1nX(n)\hat\theta = \tfrac{n+1}{n} X_{(n)} — a bias-correction of the MLE X(n)X_{(n)}. The general theory of estimators in non-regular families is the Pitman estimator framework (named for the same Pitman as in PKD), which builds optimal equivariant estimators using the group structure of the parameter (location, scale, location-scale). Track 6 (linear regression and beyond) returns to this when discussing equivariant procedures for shifts and scales.

Remark 16 Asymptotic sufficiency — Le Cam

For non-exp families (no fixed-dim sufficient statistic), data reduction in the strict sense fails — the minimal sufficient statistic grows with nn. But asymptotic sufficiency rescues a weaker version: as nn \to \infty, the MLE becomes asymptotically sufficient in the sense that all relevant inferential information concentrates on a finite-dimensional sufficient statistic tangent to the true parameter. This is the content of Le Cam’s Local Asymptotic Normality program: under regularity, the log-likelihood ratio sequence behaves locally like a Normal location family, and the MLE is the asymptotically efficient estimator. The framework unifies MLE asymptotic efficiency, Bayesian contraction rates, and the CLT into a single picture — and it generalizes to non-iid, non-parametric, and high-dimensional settings. Le Cam’s book (1986) and van der Vaart’s Asymptotic Statistics (1998) are the canonical references. The information-bottleneck and representation-learning topics on formalml.com pick up the same thread for learned (rather than computed) sufficient statistics.

16.13 Summary & Forward Look

Remark 17 The conceptual lattice of sufficiency, completeness, and ancillarity
ConceptDefinitionInferential role
Sufficient TTXTX \mid T parameter-freepreserves all parameter info
Minimal sufficientcoarsest sufficientno data redundancy
Complete TTEθ[g(T)]=0 θg(T)=0\mathbb{E}_\theta[g(T)] = 0\ \forall \theta \Rightarrow g(T) = 0 a.s.no functional redundancy in TT
Ancillary AAdistribution free of θ\thetacarries no parameter info
UMVUEunbiased, min variance uniformlyLehmann–Scheffé construction

The lattice of implications: complete sufficient \Rightarrow minimal sufficient (Bahadur 1957). Complete sufficient  ⁣ ⁣ ⁣\perp\!\!\!\perp ancillary (Basu, Theorem 7). Lehmann–Scheffé combines completeness + sufficiency \Rightarrow unique UMVUE (Theorem 5). Pitman–Koopman–Darmois closes the loop: under regularity, exp families are the only families with fixed-dim sufficient statistics for every nn (Theorem 8). Topic 7’s exp-family construction supplies the sufficient statistic; Topic 16 supplies the entire optimality theory built on top.

Closing Track 4. The classical estimation toolkit is now complete:

  • Topic 13: the evaluation framework — bias, MSE, consistency, asymptotic normality, efficiency, Fisher information, CRLB.
  • Topic 14: the maximum likelihood estimator — score equation, asymptotic efficiency, Newton-Raphson and Fisher scoring.
  • Topic 15: the method of moments and M-estimation — closed-form moment matching, ARE comparison with MLE, the sandwich variance.
  • Topic 16: sufficiency, completeness, Rao–Blackwell, Lehmann–Scheffé, Basu, and Pitman–Koopman–Darmois — the structural theory of optimal unbiased estimation, the converse to Topic 7’s exp families.

The three operative principles (UMVUE, MLE, MoM) coincide in exp families in the natural parameter (§16.11 Theorem 9) and diverge in structurally informative ways outside. Track 4 is closed.

Remark 18 Where this leads

Topic 16’s machinery is foundational for almost every downstream topic on this site and on formalml.com.

  • Hypothesis Testing (Topic 17) — score tests, Wald tests, and likelihood-ratio tests are all functions of the sufficient statistic. The one-sample t-test derivation (§17.7 Thm 5) uses Basu’s Xˉ ⁣ ⁣ ⁣S2\bar X \perp\!\!\!\perp S^2 independence directly, fulfilling the forward promise in Example 21 of this section. The Neyman–Pearson lemma is stated in Topic 17 §17.9 and proved in Topic 18.
  • Confidence Intervals — pivotal quantities are constructed from ancillary statistics (Basu’s theorem gives the required independence); Wilks-type intervals use asymptotic sufficiency of the MLE.
  • Bayesian Foundations (Topic 25) — the posterior depends on the data only through the sufficient statistic. Conjugate priors (Topic 7 §7.7) become especially natural when TT is complete sufficient — the posterior collapses to a finite-dimensional family.
  • Linear Regression (Track 6) — OLS estimators are functions of (XX,Xy)(\mathbf{X}^\top\mathbf{X}, \mathbf{X}^\top\mathbf{y}) — sufficient statistics for the Normal linear model. Gauss–Markov gives UMVUE among linear unbiased estimators; under Normal errors, this is the UMVUE.
  • Generalized Linear Models — GLMs leverage exp-family sufficient statistics directly; iteratively reweighted least squares is a score-based algorithm that descends along the sufficient statistic’s geometry.
  • formalml.com — Information bottleneck — the IB method finds a representation TT of XX that preserves information about a target YY. In the special case where θ=Y\theta = Y, the IB recovers exactly the Topic 16 notion of sufficiency. Learned sufficient statistics are the deep-learning analog of the parametric ones we’ve built here.
  • formalml.com — Representation learning — self-supervised contrastive and masked-modeling objectives implicitly discover sufficient statistics of XX for downstream tasks. The Rao–Blackwellization intuition (optimal estimators are functions of TT) reappears as the rationale for pretrained-feature reuse.

The thread connecting all of these: sufficient statistics are the abstract object that data reduction discovers; everything else is downstream.


References

  1. Erich L. Lehmann & George Casella. (1998). Theory of Point Estimation (2nd ed.). Springer.
  2. George Casella & Roger L. Berger. (2002). Statistical Inference (2nd ed.). Duxbury.
  3. Lawrence D. Brown. (1986). Fundamentals of Statistical Exponential Families. IMS Lecture Notes Monograph Series, Vol. 9.
  4. Ronald A. Fisher. (1922). On the Mathematical Foundations of Theoretical Statistics. Philosophical Transactions of the Royal Society A, 222, 309–368.
  5. Paul R. Halmos & Leonard J. Savage. (1949). Application of the Radon-Nikodym Theorem to the Theory of Sufficient Statistics. Annals of Mathematical Statistics, 20(2), 225–241.
  6. C. Radhakrishna Rao. (1945). Information and the Accuracy Attainable in the Estimation of Statistical Parameters. Bulletin of the Calcutta Mathematical Society, 37, 81–91.
  7. David Blackwell. (1947). Conditional Expectation and Unbiased Sequential Estimation. Annals of Mathematical Statistics, 18(1), 105–110.
  8. Erich L. Lehmann & Henry Scheffé. (1950). Completeness, Similar Regions, and Unbiased Estimation — Part I. Sankhyā, 10(4), 305–340.
  9. Debabrata Basu. (1955). On Statistics Independent of a Complete Sufficient Statistic. Sankhyā, 15(4), 377–380.
  10. Georges Darmois. (1935). Sur les lois de probabilité à estimation exhaustive. Comptes rendus de l’Académie des Sciences, 200, 1265–1266.
  11. Bernard O. Koopman. (1936). On Distributions Admitting a Sufficient Statistic. Transactions of the American Mathematical Society, 39(3), 399–409.
  12. E. J. G. Pitman. (1936). Sufficient Statistics and Intrinsic Accuracy. Mathematical Proceedings of the Cambridge Philosophical Society, 32(4), 567–579.