intermediate 45 min read · April 14, 2026

Point Estimation & Bias-Variance

Estimators as random variables, bias, variance, MSE decomposition — the framework for evaluating any estimator.

13.1 Estimators as Random Variables

Every inferential procedure you have ever run — a sample mean, a regression coefficient, a p-value, a posterior mode — was computed from data. Before the data arrived, that computation was a function waiting for input. After the data arrives, it returns a number. That number feels concrete, but the procedure that produced it is the object of interest, and the procedure is a random variable: its value depends on which sample X1,,XnX_1, \ldots, X_n happened to land in your hands. Different samples, different value. An estimator is not a single number; it is a distribution, summarized in the long run by a single number plus some scatter.

This shift in viewpoint is the entire foundation of estimation theory. We stop asking “what number did I compute?” and start asking “what distribution does my computation have, and what does that distribution say about the parameter I care about?” The distribution of an estimator across repeated sampling is called its sampling distribution. Bias, variance, MSE, consistency, efficiency — every concept in this topic is a property of that sampling distribution.

Three-panel view of an estimator's sampling distribution: (left) one sample of size n=25 from N(5, 4) with the sample mean and ±σ/√n band; (center) the histogram of X̄₂₅ over 5000 replications converging to a N(μ, σ²/n) density; (right) three estimators — mean, median, trimmed mean — superimposed, showing distinct sampling distributions
Definition 1 Statistic

A statistic is any measurable function T=g(X1,,Xn)T = g(X_1, \ldots, X_n) of the observed sample. Equivalently, TT is a random variable whose value is determined by the data, and whose distribution is induced by the joint distribution of the sample.

“Measurable” is a technical requirement — it ensures probabilities of events like {Tt}\{T \leq t\} are well-defined — and is satisfied by every computation a human could plausibly write down. The key restriction is that TT must depend only on the data, not on unknown parameters. So Xˉn\bar{X}_n is a statistic; so is maxiXi\max_i X_i or i(XiXˉn)2\sum_i (X_i - \bar{X}_n)^2. But Xˉnμ\bar{X}_n - \mu is not a statistic when μ\mu is the unknown parameter we are estimating: it involves the quantity we are trying to learn.

Definition 2 Point Estimator

A point estimator of a parameter θ\theta is a statistic θ^n=g(X1,,Xn)\hat{\theta}_n = g(X_1, \ldots, X_n) whose value is used as a guess for θ\theta. The estimator is a random variable; its value θ^n(ω)\hat{\theta}_n(\omega) for a particular sample realization is a point estimate.

The distinction between estimator and estimate is the same as that between the function and its value. “The sample mean” is an estimator — a recipe applicable to any sample. “The sample mean of our data, which equals 4.724.72” is an estimate — the recipe evaluated on a specific sample. Almost every confusion in introductory statistics traces back to conflating these. When we ask “is this estimator unbiased?” we are asking about the recipe across all possible samples, not about any particular number.

Example 1 Three estimators of the population mean

Suppose X1,,XnX_1, \ldots, X_n are iid with unknown mean μ\mu and finite variance. Three candidate estimators of μ\mu:

  1. Sample mean: μ^1=Xˉn=1ni=1nXi\hat{\mu}_1 = \bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i.
  2. Sample median: μ^2=median(X1,,Xn)\hat{\mu}_2 = \operatorname{median}(X_1, \ldots, X_n), the middle order statistic — whose asymptotics are developed in Topic 29 §29.6.
  3. Trimmed mean: μ^3\hat{\mu}_3 = mean of the sample after discarding the top 10%10\% and bottom 10%10\% of observations.

All three are statistics (they depend only on the data), all three target the same μ\mu, and all three are reasonable candidates. They produce different values on the same sample, and their sampling distributions are different shapes. Which one is “best” depends on the underlying distribution and on what we mean by “best” — the rest of this topic builds the vocabulary to answer that question precisely.

Remark 1 The conceptual shift: estimators as random variables

In an introductory statistics class, you compute the sample mean of a dataset and report it as “the” estimate. In estimation theory, we take one step back and ask: if we could repeatedly draw fresh samples of size nn from the population and compute the sample mean each time, what distribution of values would we see? That distribution — not any single computed value — is the object we analyze. Its center is where our estimates cluster on average (bias tells us whether that center is μ\mu). Its spread tells us how precise any single estimate is (variance). Its shape tells us how the estimator behaves in the tails, and for large nn, what confidence intervals look like (asymptotic normality). Every theorem in this topic is a statement about that distribution.

This is also why we write θ^n\hat{\theta}_n with a hat: the hat signals “random variable, function of the data,” distinguishing it from the unknown-but-fixed true parameter θ\theta. Reading a formula, always ask: which symbols are random? which are fixed? The answer changes everything.

Interactive: Estimator Sampling Explorer
Draw samples of size n, compute the chosen estimator on each, and watch the sampling distribution build up. Histogram in blue, theoretical Normal overlay in red (when analytic variance is available).
n = 25M = 0 draws
Sampling distribution
MCTheory
E[θ̂]5.0000
Bias0.0000
Variance0.1600
MSE0.1600
SE0.4000
True θ = 5.0000 (unbiased estimator). Theory cells show "—" where no closed form applies.

13.2 Bias

The first criterion for a “good” estimator is simple: on average, across repeated sampling, does it hit the target? An estimator whose expected value equals the true parameter is called unbiased. Unbiasedness sounds like a bedrock virtue, but as we will see in §13.4, it is surprisingly easy to do better by being deliberately biased. First, the definitions.

Definition 3 Bias of an Estimator

Let θ^n\hat{\theta}_n be an estimator of θ\theta. Its bias is the difference between its expected value and the true parameter:

Bias(θ^n)  =  E[θ^n]θ.\operatorname{Bias}(\hat{\theta}_n) \;=\; \mathbb{E}[\hat{\theta}_n] - \theta.

The expectation is taken over the sampling distribution of θ^n\hat{\theta}_n, treating θ\theta as fixed.

Bias measures a systematic error: if bias is positive, the estimator overestimates on average; if negative, it underestimates. Random error — the deviation of any particular estimate from E[θ^n]\mathbb{E}[\hat{\theta}_n] — is captured by variance, which §13.3 develops.

Definition 4 Unbiased Estimator

The estimator θ^n\hat{\theta}_n is unbiased for θ\theta if Bias(θ^n)=0\operatorname{Bias}(\hat{\theta}_n) = 0, equivalently E[θ^n]=θ\mathbb{E}[\hat{\theta}_n] = \theta, for every value of θ\theta in the parameter space.

Three-panel bias visualization: (left) a dart-board analogy contrasting high-bias/low-variance, low-bias/high-variance, and low-bias/low-variance clusters; (center) Bessel's correction — E[S²] with 1/n (biased) vs 1/(n−1) (unbiased) as n grows; (right) MSE comparison of the two sample-variance estimators

The universal quantifier — for every value of θ\theta — matters. A procedure that happens to return the right answer for θ=0\theta = 0 but systematically misses for other values is not unbiased in this technical sense. Unbiasedness is a property of the estimator’s behavior uniformly across the parameter space, not at any single point.

Theorem 1 The sample mean is unbiased for the population mean

If X1,,XnX_1, \ldots, X_n are iid with E[Xi]=μ\mathbb{E}[X_i] = \mu (assumed finite), then the sample mean Xˉn=1ni=1nXi\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i is unbiased for μ\mu:

E[Xˉn]  =  μ.\mathbb{E}[\bar{X}_n] \;=\; \mu.
Proof [show]

Start from the definition of the sample mean:

E[Xˉn]  =  E ⁣[1ni=1nXi].\mathbb{E}[\bar{X}_n] \;=\; \mathbb{E}\!\left[\frac{1}{n}\sum_{i=1}^n X_i\right].

By linearity of expectation — which does not require independence — we can pull the constant 1/n1/n outside and distribute the expectation across the sum:

E[Xˉn]  =  1ni=1nE[Xi].\mathbb{E}[\bar{X}_n] \;=\; \frac{1}{n}\sum_{i=1}^n \mathbb{E}[X_i].

Because the XiX_i are iid, each E[Xi]=μ\mathbb{E}[X_i] = \mu, and there are nn copies:

E[Xˉn]  =  1nnμ  =  μ.\mathbb{E}[\bar{X}_n] \;=\; \frac{1}{n} \cdot n\mu \;=\; \mu. \quad\blacksquare

The proof is two lines, but the two ingredients — linearity of expectation and identical means — are the template for essentially every unbiasedness calculation you will ever do.

Example 2 Unbiasedness under non-identical distributions

The iid assumption in Theorem 1 is stronger than needed: we only used that E[Xi]=μ\mathbb{E}[X_i] = \mu for every ii. So the sample mean is unbiased whenever the summands share a common mean, even if their variances differ or their distributions are different. For example, if X1Exponential(1)X_1 \sim \operatorname{Exponential}(1) and X2Gamma(2,2)X_2 \sim \operatorname{Gamma}(2, 2) both have mean 11, then Xˉ2=(X1+X2)/2\bar{X}_2 = (X_1 + X_2)/2 is still unbiased for 11. Unbiasedness of linear statistics is astonishingly robust.

Contrast this with the sample median, which is not generally unbiased for μ\mu unless the distribution is symmetric around μ\mu. For Exponential(1)(1), the population mean is 11 but the population median is log20.693\log 2 \approx 0.693; the sample median is unbiased for the median, not the mean.

Example 3 Bessel's correction: why we divide by n − 1

The natural candidate for estimating the population variance σ2\sigma^2 is the average squared deviation from the sample mean:

Sn2  =  1ni=1n(XiXˉn)2.S_n^2 \;=\; \frac{1}{n}\sum_{i=1}^n (X_i - \bar{X}_n)^2.

This is the maximum-likelihood estimate of σ2\sigma^2 when the underlying distribution is Normal, and it is the “natural” variance of the empirical distribution. But it is biased. A direct calculation — expanding the squared deviation, using Var(Xˉn)=σ2/n\operatorname{Var}(\bar{X}_n) = \sigma^2/n (Theorem 2 below), and collecting terms — shows

E[Sn2]  =  n1nσ2.\mathbb{E}[S_n^2] \;=\; \frac{n-1}{n}\,\sigma^2.

So Sn2S_n^2 underestimates σ2\sigma^2 by a factor of (n1)/n(n-1)/n. The fix is to divide by n1n-1 instead of nn:

S2  =  1n1i=1n(XiXˉn)2.S^2 \;=\; \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar{X}_n)^2.

This is the unbiased sample variance, and the (n1)(n-1) denominator is called Bessel’s correction. The correction is intuitive: in computing Xˉn\bar{X}_n from the data we have “used up” one degree of freedom, so only n1n - 1 independent residuals remain. The unbiasedness of S2S^2 follows immediately: E[S2]=nn1E[Sn2]=σ2\mathbb{E}[S^2] = \frac{n}{n-1}\cdot \mathbb{E}[S_n^2] = \sigma^2.

Which divisor should you use in practice? That turns out to be a more subtle question than it first appears — see Example 5 in §13.4.

Remark 2 Unbiasedness alone is a weak criterion

Unbiasedness feels like a bedrock virtue, but consider the estimator μ~=X1\tilde{\mu} = X_1 — the first observation, ignoring everything else. It is unbiased for μ\mu: E[X1]=μ\mathbb{E}[X_1] = \mu. It is also a terrible estimator: its variance is σ2\sigma^2, compared with σ2/n\sigma^2/n for the sample mean, so after any real amount of data it is nn times worse. Unbiasedness is satisfied; precision is not.

More dramatically, some parameters admit no unbiased estimator at all (e.g., the reciprocal of a Poisson mean), while others admit only pathological ones. And, as we will see, biased estimators can systematically outperform unbiased ones in mean squared error — shrinkage estimators, ridge regression, and James–Stein all exploit this. Unbiasedness is one input into evaluation, not the whole story. The right criterion combines bias and variance, which is what MSE does.

13.3 Variance and Standard Error

The second criterion for a “good” estimator is precision: how spread out is the sampling distribution? If two estimators are both unbiased but one has half the variance, it gives estimates that are closer to the truth on average — by a lot. The standard measure of spread for an estimator is its standard deviation, which in this context has a special name.

Definition 5 Standard Error

The standard error of an estimator θ^n\hat{\theta}_n is the standard deviation of its sampling distribution:

SE(θ^n)  =  Var(θ^n).\operatorname{SE}(\hat{\theta}_n) \;=\; \sqrt{\operatorname{Var}(\hat{\theta}_n)}.

When the estimator’s variance depends on unknown population parameters, the estimated standard error SE^(θ^n)\widehat{\operatorname{SE}}(\hat{\theta}_n) plugs in sample-based substitutes.

The word “standard error” exists because we needed a term distinct from “standard deviation of the population.” Standard deviation describes the spread of the underlying random variable XX. Standard error describes the spread of an estimator computed from many XiX_i‘s — which is almost always much smaller, thanks to averaging.

Theorem 2 Variance of the sample mean

If X1,,XnX_1, \ldots, X_n are iid with Var(Xi)=σ2\operatorname{Var}(X_i) = \sigma^2 (assumed finite), then the sample mean has variance

Var(Xˉn)  =  σ2n,\operatorname{Var}(\bar{X}_n) \;=\; \frac{\sigma^2}{n},

and therefore standard error SE(Xˉn)=σ/n\operatorname{SE}(\bar{X}_n) = \sigma/\sqrt{n}.

Proof [show]

Expand the sample mean as a scaled sum:

Var(Xˉn)  =  Var ⁣(1ni=1nXi)  =  1n2Var ⁣(i=1nXi).\operatorname{Var}(\bar{X}_n) \;=\; \operatorname{Var}\!\left(\frac{1}{n}\sum_{i=1}^n X_i\right) \;=\; \frac{1}{n^2}\operatorname{Var}\!\left(\sum_{i=1}^n X_i\right).

By independence, variance distributes over a sum:

Var ⁣(i=1nXi)  =  i=1nVar(Xi)  =  nσ2.\operatorname{Var}\!\left(\sum_{i=1}^n X_i\right) \;=\; \sum_{i=1}^n \operatorname{Var}(X_i) \;=\; n\sigma^2.

Substituting:

Var(Xˉn)  =  1n2nσ2  =  σ2n.\operatorname{Var}(\bar{X}_n) \;=\; \frac{1}{n^2} \cdot n\sigma^2 \;=\; \frac{\sigma^2}{n}. \quad\blacksquare

The factor of 1/n1/n is the entire content of the sample mean’s appeal. It says that doubling the sample size halves the variance — precision improves linearly in nn. And the factor of n\sqrt{n} in the standard error is the rate at which confidence intervals narrow, the rate that appears in the CLT, and the rate that organizes almost all of classical statistics.

Example 4 Standard error of the sample mean for a Normal population

If X1,,XnX_1, \ldots, X_n are iid N(μ,σ2)\mathcal{N}(\mu, \sigma^2), the sample mean satisfies XˉnN(μ,σ2/n)\bar{X}_n \sim \mathcal{N}(\mu, \sigma^2/n) exactly — no approximation, no CLT needed. Its standard error is σ/n\sigma/\sqrt{n}. For σ=2\sigma = 2 and n=25n = 25, that is 2/5=0.42/5 = 0.4: a single sample mean will typically land within ±0.4\pm 0.4 of μ\mu (about one SE), and almost always within ±0.8\pm 0.8 (two SEs).

When σ\sigma is unknown — the usual case — we plug in the sample standard deviation S=S2S = \sqrt{S^2} (from Example 3) to get the estimated standard error SE^(Xˉn)=S/n\widehat{\operatorname{SE}}(\bar{X}_n) = S/\sqrt{n}. The distinction matters: confidence intervals built from SS use the Student’s tt distribution, not the standard Normal, because SS introduces additional uncertainty. That tt-correction is Topic 11’s Example 7, and the machinery of confidence intervals will develop it systematically.

Remark 3 Standard error vs. standard deviation: precision about precision

It is easy to conflate “standard deviation” and “standard error.” The difference is crucial:

  • The standard deviation σ\sigma describes the spread of the population. It does not shrink as you collect more data.
  • The standard error σ/n\sigma/\sqrt{n} describes the spread of the estimator. It shrinks like 1/n1/\sqrt{n}.

When you report a sample mean as ”Xˉn=4.72±0.40\bar{X}_n = 4.72 \pm 0.40,” the ±0.40\pm 0.40 is the standard error — how uncertain you are about the mean. When you report that your data has standard deviation 2.02.0, that is the standard deviation — how spread out the individual observations are. SE says how precisely you know the average; SD says how dispersed the raw data are. SE is precision about precision — a derived uncertainty.

13.4 Mean Squared Error and the Bias-Variance Decomposition

Bias and variance are both measures of estimator quality, and they are in tension: biased estimators can have lower variance, and vice versa. We need a single scalar that combines both. The natural choice — and the one whose properties organize the entire remainder of this topic — is the mean squared error.

Definition 6 Mean Squared Error (MSE)

The mean squared error of an estimator θ^n\hat{\theta}_n of θ\theta is the expected squared deviation from the truth:

MSE(θ^n)  =  E ⁣[(θ^nθ)2].\operatorname{MSE}(\hat{\theta}_n) \;=\; \mathbb{E}\!\left[(\hat{\theta}_n - \theta)^2\right].

The expectation is taken over the sampling distribution of θ^n\hat{\theta}_n, treating θ\theta as fixed.

MSE is the squared analog of bias — instead of asking “how far off is the estimator on average?” it asks “how far off squared is the estimator on average?” Squaring turns the signed error into a positive quantity and penalizes large deviations more severely, which makes MSE a natural loss function. Crucially, MSE decomposes into a systematic part (bias squared) and a random part (variance), making the tradeoff between the two explicit.

Three-panel MSE decomposition: (left) stacked bar chart of Bias² + Var = MSE for four estimators (mean, median, midrange, shrunk mean); (center) MSE, Bias², and Var curves as a function of shrinkage factor c in θ̂ = cX̄, with optimal c* annotated; (right) stacked area decomposition of the same sweep
Theorem 3 MSE decomposition

For any estimator θ^n\hat{\theta}_n of θ\theta with finite second moment,

MSE(θ^n)  =  Bias(θ^n)2  +  Var(θ^n).\operatorname{MSE}(\hat{\theta}_n) \;=\; \operatorname{Bias}(\hat{\theta}_n)^2 \;+\; \operatorname{Var}(\hat{\theta}_n).

Equivalently, E[(θ^nθ)2]\mathbb{E}[(\hat{\theta}_n - \theta)^2] splits cleanly into a mean-offset term and a dispersion term.

Proof [show]

The trick is to add and subtract the mean E[θ^n]\mathbb{E}[\hat{\theta}_n] inside (θ^nθ)(\hat{\theta}_n - \theta):

θ^nθ  =  (θ^nE[θ^n])  +  (E[θ^n]θ).\hat{\theta}_n - \theta \;=\; \big(\hat{\theta}_n - \mathbb{E}[\hat{\theta}_n]\big) \;+\; \big(\mathbb{E}[\hat{\theta}_n] - \theta\big).

The first parenthesis is the random deviation of θ^n\hat{\theta}_n from its mean — its variance comes from here. The second parenthesis is the systematic deviation of the mean from the truth — the bias. Square both sides:

(θ^nθ)2  =  (θ^nE[θ^n])2  +  2(θ^nE[θ^n])(E[θ^n]θ)  +  (E[θ^n]θ)2.(\hat{\theta}_n - \theta)^2 \;=\; \big(\hat{\theta}_n - \mathbb{E}[\hat{\theta}_n]\big)^2 \;+\; 2\,\big(\hat{\theta}_n - \mathbb{E}[\hat{\theta}_n]\big)\big(\mathbb{E}[\hat{\theta}_n] - \theta\big) \;+\; \big(\mathbb{E}[\hat{\theta}_n] - \theta\big)^2.

Now take expectations. The first term becomes Var(θ^n)\operatorname{Var}(\hat{\theta}_n) by definition:

E ⁣[(θ^nE[θ^n])2]  =  Var(θ^n).\mathbb{E}\!\left[\big(\hat{\theta}_n - \mathbb{E}[\hat{\theta}_n]\big)^2\right] \;=\; \operatorname{Var}(\hat{\theta}_n).

The middle term vanishes. To see this, pull out (E[θ^n]θ)\big(\mathbb{E}[\hat{\theta}_n] - \theta\big), which is a constant with respect to the expectation, leaving E[θ^nE[θ^n]]=0\mathbb{E}[\hat{\theta}_n - \mathbb{E}[\hat{\theta}_n]] = 0:

E ⁣[2(θ^nE[θ^n])(E[θ^n]θ)]  =  2(E[θ^n]θ)E ⁣[θ^nE[θ^n]]  =  0.\mathbb{E}\!\left[2\,\big(\hat{\theta}_n - \mathbb{E}[\hat{\theta}_n]\big)\big(\mathbb{E}[\hat{\theta}_n] - \theta\big)\right] \;=\; 2\,\big(\mathbb{E}[\hat{\theta}_n] - \theta\big) \cdot \mathbb{E}\!\left[\hat{\theta}_n - \mathbb{E}[\hat{\theta}_n]\right] \;=\; 0.

The third term is already deterministic — it is the square of the bias:

E ⁣[(E[θ^n]θ)2]  =  Bias(θ^n)2.\mathbb{E}\!\left[\big(\mathbb{E}[\hat{\theta}_n] - \theta\big)^2\right] \;=\; \operatorname{Bias}(\hat{\theta}_n)^2.

Collecting:

MSE(θ^n)  =  Var(θ^n)  +  Bias(θ^n)2.\operatorname{MSE}(\hat{\theta}_n) \;=\; \operatorname{Var}(\hat{\theta}_n) \;+\; \operatorname{Bias}(\hat{\theta}_n)^2. \quad\blacksquare

The proof is the same “add and subtract the mean” maneuver that powered the variance decomposition in Topic 4 (Eve’s law), and it gives the central identity of estimation theory. An unbiased estimator pays all of its MSE as variance. A highly precise but biased estimator pays most of its MSE as bias squared. Minimizing MSE means navigating the tradeoff between the two — which is what the rest of this topic is about.

Three-panel bias-variance tradeoff for polynomial regression: degree 1 (underfitting, high bias), degree 5 (good fit), degree 15 (overfitting, high variance). Each panel shows 50 training-set fits overlaid, the average fit E[f̂(x)] in bold, and the Bias²/Var/MSE values
Example 5 MSE comparison: biased vs. unbiased sample variance

From Example 3, the two candidates for estimating σ2\sigma^2 are:

  • S2=1n1(XiXˉn)2S^2 = \frac{1}{n-1}\sum(X_i - \bar{X}_n)^2, unbiased.
  • Sn2=1n(XiXˉn)2=n1nS2S_n^2 = \frac{1}{n}\sum(X_i - \bar{X}_n)^2 = \frac{n-1}{n}S^2, biased with E[Sn2]=n1nσ2\mathbb{E}[S_n^2] = \frac{n-1}{n}\sigma^2.

For iid Normal data, it can be shown that Var(S2)=2σ4n1\operatorname{Var}(S^2) = \frac{2\sigma^4}{n-1} and Var(Sn2)=2(n1)σ4n2\operatorname{Var}(S_n^2) = \frac{2(n-1)\sigma^4}{n^2}. Plugging into the MSE decomposition:

MSE(S2)  =  0  +  2σ4n1  =  2σ4n1,\operatorname{MSE}(S^2) \;=\; 0 \;+\; \frac{2\sigma^4}{n-1} \;=\; \frac{2\sigma^4}{n-1},MSE(Sn2)  =  (σ2n)2  +  2(n1)σ4n2  =  (2n1)σ4n2.\operatorname{MSE}(S_n^2) \;=\; \left(\frac{\sigma^2}{n}\right)^2 \;+\; \frac{2(n-1)\sigma^4}{n^2} \;=\; \frac{(2n-1)\sigma^4}{n^2}.

For n=10n = 10: MSE(S2)=2σ4/90.222σ4\operatorname{MSE}(S^2) = 2\sigma^4/9 \approx 0.222\sigma^4 vs. MSE(Sn2)=19σ4/100=0.190σ4\operatorname{MSE}(S_n^2) = 19\sigma^4/100 = 0.190\sigma^4. The biased estimator has lower MSE. This is not a trick: Bessel’s correction removes bias at the cost of increased variance, and the tradeoff does not always favor unbiasedness. It turns out that σ^opt2=1n+1(XiXˉn)2\hat{\sigma}^2_{\text{opt}} = \frac{1}{n+1}\sum(X_i - \bar{X}_n)^2 — shrinking even more aggressively than the MLE — has smaller MSE still for Normal data. Minimum-MSE and unbiasedness are genuinely different objectives.

Example 6 Shrinkage preview — a biased estimator with smaller MSE

Consider estimating μR\mu \in \mathbb{R} from iid N(μ,σ2)\mathcal{N}(\mu, \sigma^2) data with the family of shrinkage estimators

μ^c  =  cXˉn,c[0,1].\hat{\mu}_c \;=\; c \cdot \bar{X}_n, \qquad c \in [0, 1].

At c=1c = 1 we recover the sample mean (unbiased, MSE =σ2/n= \sigma^2/n). At c=0c = 0 we get the trivial estimator that always returns zero (biased by μ-\mu, variance zero, MSE =μ2= \mu^2). In between, bias and variance trade off:

Bias(μ^c)  =  (c1)μ,Var(μ^c)  =  c2σ2n,\operatorname{Bias}(\hat{\mu}_c) \;=\; (c - 1)\mu, \qquad \operatorname{Var}(\hat{\mu}_c) \;=\; c^2 \frac{\sigma^2}{n},MSE(μ^c)  =  (c1)2μ2  +  c2σ2n.\operatorname{MSE}(\hat{\mu}_c) \;=\; (c-1)^2 \mu^2 \;+\; c^2 \frac{\sigma^2}{n}.

Differentiating with respect to cc and setting to zero gives the MSE-optimal shrinkage

c  =  μ2μ2+σ2/n.c^* \;=\; \frac{\mu^2}{\mu^2 + \sigma^2/n}.

Note that c<1c^* < 1 strictly whenever σ2/n>0\sigma^2/n > 0 — which it always is. So the sample mean is never MSE-optimal in this family; there always exists a shrunk version that strictly dominates it. The catch: cc^* depends on μ\mu itself, so we cannot use it as a standalone estimator. But this example is the seed of every shrinkage, regularization, and empirical-Bayes method: allowing bias opens a strictly better option in terms of MSE.

Remark 4 The bias-variance tradeoff as the fundamental tension in estimation

Theorem 3 makes the bias-variance tradeoff precise: MSE has two components, and reducing one often increases the other. This is not a metaphor — it is an algebraic identity. Every strategy for minimizing MSE must navigate this tension:

  • More data (nn \uparrow): variance shrinks like σ2/n\sigma^2/n, bias typically unaffected. So MSE is dominated by bias at large nn — unbiased consistent estimators eventually win.
  • Simpler models / more shrinkage: bias increases, variance decreases. Good for small nn when variance dominates.
  • More flexible models / less shrinkage: bias decreases, variance increases. Good for large nn when bias dominates.

In an ML context, this is the polynomial-fitting story: a degree-1 line is highly biased but low-variance; a degree-15 polynomial is nearly unbiased but enormously variable. Regularization, cross-validation, and model selection all exist to find the sweet spot. Section 13.9 develops the ML connection in detail.

Interactive: Bias-Variance-MSE Explorer
Watch Bias² + Var = MSE trace out its U-shape as you slide across shrinkage, ridge penalty, or polynomial degree. Gold star marks the MSE-optimal setting.
Shrinkage factor c = 1.00optimal 0.994
Shrinkage c·X̄
Bias² = 0.0000 (0.0%)
 
Var = 0.1600 (100.0%)
 
MSE = 0.1600

13.5 Consistency

The properties so far — bias, variance, MSE — are finite-sample statements: they hold for every nn. The next two sections zoom out and ask: what happens as nn \to \infty? Does the estimator settle down to the truth? Does its sampling distribution take a recognizable shape? Consistency answers the first question, and we already have all the machinery we need — the law of large numbers does almost all of the work.

Definition 7 Consistency (in probability)

An estimator θ^n\hat{\theta}_n is consistent for θ\theta if, as nn \to \infty, it converges to θ\theta in probability: for every ε>0\varepsilon > 0,

limnP ⁣(θ^nθ>ε)  =  0.\lim_{n \to \infty} P\!\left(\left|\hat{\theta}_n - \theta\right| > \varepsilon\right) \;=\; 0.

Equivalently, θ^nPθ\hat{\theta}_n \overset{P}{\to} \theta.

Consistency is the minimal requirement for an estimator to be useful at large sample sizes. If θ^n\hat{\theta}_n is inconsistent, then even infinite data will not push it to the right answer; no amount of precision can compensate for a procedure that is aimed at the wrong target asymptotically. Unbiasedness and consistency are logically independent (Remark 5), so we need both words.

Definition 8 Strong Consistency (almost sure)

An estimator θ^n\hat{\theta}_n is strongly consistent for θ\theta if θ^na.s.θ\hat{\theta}_n \overset{a.s.}{\to} \theta: the set of sample sequences on which the running estimates fail to converge to θ\theta has probability zero.

Strong consistency implies consistency (convergence almost sure implies convergence in probability, per Topic 9), so whenever the strong law of large numbers applies we get strong consistency “for free.” The distinction matters most when we track a single infinite sequence of data, as a streaming algorithm would — strong consistency guarantees the stream’s running estimate eventually stays close to the truth, not just that any given sample-size-nn snapshot is likely to be close.

Theorem 4 MSE → 0 implies consistency

If MSE(θ^n)0\operatorname{MSE}(\hat{\theta}_n) \to 0 as nn \to \infty, then θ^n\hat{\theta}_n is consistent for θ\theta.

Proof [show]

By Chebyshev’s inequality applied to the random variable θ^nθ\hat{\theta}_n - \theta, for any ε>0\varepsilon > 0:

P ⁣(θ^nθ>ε)    E[(θ^nθ)2]ε2  =  MSE(θ^n)ε2.P\!\left(\left|\hat{\theta}_n - \theta\right| > \varepsilon\right) \;\leq\; \frac{\mathbb{E}[(\hat{\theta}_n - \theta)^2]}{\varepsilon^2} \;=\; \frac{\operatorname{MSE}(\hat{\theta}_n)}{\varepsilon^2}.

By hypothesis MSE(θ^n)0\operatorname{MSE}(\hat{\theta}_n) \to 0, so the right-hand side tends to 00 for every fixed ε>0\varepsilon > 0. Therefore

limnP ⁣(θ^nθ>ε)  =  0,\lim_{n \to \infty} P\!\left(\left|\hat{\theta}_n - \theta\right| > \varepsilon\right) \;=\; 0,

which is the definition of consistency. \quad\blacksquare

Because MSE decomposes as bias² + variance, Theorem 4 gives a two-part recipe: if Bias(θ^n)0\operatorname{Bias}(\hat{\theta}_n) \to 0 and Var(θ^n)0\operatorname{Var}(\hat{\theta}_n) \to 0, then θ^n\hat{\theta}_n is consistent. This is the most common way to prove consistency in practice — check that bias vanishes and variance shrinks. For the sample mean, bias is zero for all nn and variance is σ2/n0\sigma^2/n \to 0, so consistency is immediate — but the theorem gives us a cleaner route via the LLN.

Theorem 5 Consistency of the sample mean (via LLN)

If X1,X2,X_1, X_2, \ldots are iid with E[X1]<\mathbb{E}[|X_1|] < \infty and μ=E[X1]\mu = \mathbb{E}[X_1], then the sample mean is strongly consistent for μ\mu:

Xˉna.s.μ.\bar{X}_n \overset{a.s.}{\to} \mu.

If only E[X12]<\mathbb{E}[X_1^2] < \infty is known, the weaker statement XˉnPμ\bar{X}_n \overset{P}{\to} \mu follows from Chebyshev (Theorem 4 above) or from the weak law of large numbers.

This is the strong law of large numbers from Topic 10, restated in estimator language. The sample mean is the prototypical consistent estimator, and the LLN is the engine: consistency of the sample mean is literally the LLN under a different name. Every estimator that can be written as (or derived from) a sample mean inherits consistency from the LLN — this is the template that Maximum Likelihood Estimation uses to prove the MLE is consistent, and Topic 15 uses to handle method-of-moments estimators.

Three-panel consistency trajectories at log-scale n: (left) 15 paths of X̄ₙ converging to μ (consistent); (center) 15 paths of S²ₙ converging to σ² (consistent); (right) 15 flat paths of θ̂ = X₁, which ignores all but the first observation and does not converge
Example 7 The sample variance is consistent

Let X1,,XnX_1, \ldots, X_n be iid with E[X14]<\mathbb{E}[X_1^4] < \infty (a fourth-moment condition, stronger than what Theorem 5 required). The unbiased sample variance S2=1n1(XiXˉn)2S^2 = \frac{1}{n-1}\sum(X_i - \bar{X}_n)^2 is consistent for σ2=Var(X1)\sigma^2 = \operatorname{Var}(X_1).

Sketch. Expand

S2  =  nn1 ⁣(1ni=1nXi2    Xˉn2).S^2 \;=\; \frac{n}{n-1}\!\left(\frac{1}{n}\sum_{i=1}^n X_i^2 \;-\; \bar{X}_n^2\right).

By the LLN applied to {Xi2}\{X_i^2\}, the first average converges almost surely to E[X12]\mathbb{E}[X_1^2]. By the LLN applied to {Xi}\{X_i\} and continuous-mapping, Xˉn2a.s.μ2\bar{X}_n^2 \overset{a.s.}{\to} \mu^2. The factor n/(n1)1n/(n-1) \to 1. Combining:

S2a.s.E[X12]μ2  =  σ2.S^2 \overset{a.s.}{\to} \mathbb{E}[X_1^2] - \mu^2 \;=\; \sigma^2.

Consistency of the biased version Sn2=n1nS2S_n^2 = \frac{n-1}{n}S^2 follows immediately by the same argument (multiplying by (n1)/n1(n-1)/n \to 1). Both divisors yield consistent estimators of σ2\sigma^2; they differ only in finite-sample MSE.

Example 8 An inconsistent estimator: keep only the first observation

Consider the estimator μ~=X1\tilde{\mu} = X_1 of the population mean μ\mu. It is unbiased (E[X1]=μ\mathbb{E}[X_1] = \mu), but it ignores all but the first observation — no matter how many data points you collect, the estimate never updates after n=1n = 1.

Formally: μ~\tilde{\mu} does not depend on nn at all, so P(μ~μ>ε)=P(X1μ>ε)P(|\tilde{\mu} - \mu| > \varepsilon) = P(|X_1 - \mu| > \varepsilon) — a fixed, positive number for any ε<σ\varepsilon < \sigma and any non-degenerate distribution. The probability does not shrink with nn. Therefore μ~\tilde{\mu} is inconsistent, despite being unbiased.

This is the cleanest demonstration that unbiasedness and consistency are independent properties. μ~\tilde{\mu} has MSE=Var(X1)=σ2\operatorname{MSE} = \operatorname{Var}(X_1) = \sigma^2 for every nn, which does not shrink, so Theorem 4 does not kick in.

Remark 5 Consistency vs. unbiasedness: independent criteria

Examples 7 and 8 make the point: consistency and unbiasedness are logically independent. Exhaustive possibilities:

  • Both: sample mean, unbiased sample variance. The ideal case.
  • Unbiased, inconsistent: μ~=X1\tilde{\mu} = X_1. Unbiased but fails to improve with more data.
  • Biased, consistent: biased sample variance Sn2S_n^2. Its bias shrinks as nn \to \infty (n1nσ2σ2\frac{n-1}{n}\sigma^2 \to \sigma^2), so it is asymptotically unbiased; with vanishing variance, it is consistent.
  • Neither: the constant estimator θ~=7\tilde{\theta} = 7 (or any fixed value unrelated to θ\theta). Ignores the data, biased except by coincidence, not consistent.

For practical purposes, consistency is the more important of the two: we typically want an estimator that eventually gets the right answer, and we tolerate small-sample bias as long as it washes out. Unbiasedness is sometimes useful as a finite-sample property (e.g. for building exact confidence intervals under Normal assumptions), but asymptotic unbiasednessBias(θ^n)0\operatorname{Bias}(\hat{\theta}_n) \to 0 — is what modern statistics actually requires most of the time.

Interactive: Consistency Explorer
Simulate 12 independent sample paths and watch the running estimator trajectories. Consistent estimators collapse to θ as n grows; inconsistent estimators do not.
n = 5000paths = 12
Consistent — trajectories collapse to θ. Sample mean across paths at n=5000: 4.9890. Sample variance across paths: 0.000952. True θ = 5.0000.

13.6 Asymptotic Normality

Consistency tells us θ^n\hat{\theta}_n eventually lands near θ\theta. Asymptotic normality tells us what its shape is near θ\theta: for large nn, the sampling distribution is approximately Gaussian, with a known variance. This is what powers confidence intervals, z-tests, and t-tests. The central limit theorem is the engine here, and — as with consistency and the LLN — most of the work is already done.

Definition 9 Asymptotically Normal Estimator

The estimator θ^n\hat{\theta}_n is asymptotically normal with asymptotic variance v(θ)v(\theta) if

n(θ^nθ)dN ⁣(0,v(θ)),\sqrt{n}\,(\hat{\theta}_n - \theta) \overset{d}{\to} \mathcal{N}\!\left(0, v(\theta)\right),

where d\overset{d}{\to} denotes convergence in distribution. Informally, for large nn, θ^nN(θ,v(θ)/n)\hat{\theta}_n \approx \mathcal{N}(\theta, v(\theta)/n).

The n\sqrt{n} factor is essential. Without it, θ^nθ0\hat{\theta}_n - \theta \to 0 (by consistency) and the limit distribution would be a point mass at zero — degenerate and useless. The n\sqrt{n} scales the fluctuations up at the right rate so the limit is non-degenerate. The asymptotic variance v(θ)v(\theta) is the “per-sample” variance at the 1/n1/\sqrt{n} scale — what the histogram of n(θ^nθ)\sqrt{n}(\hat{\theta}_n - \theta) converges to.

Theorem 6 Asymptotic normality of the sample mean

If X1,X2,X_1, X_2, \ldots are iid with Var(X1)=σ2(0,)\operatorname{Var}(X_1) = \sigma^2 \in (0, \infty), then the sample mean is asymptotically normal with asymptotic variance σ2\sigma^2:

n(Xˉnμ)dN(0,σ2).\sqrt{n}\,(\bar{X}_n - \mu) \overset{d}{\to} \mathcal{N}(0, \sigma^2).

This is the classical CLT of Topic 11, restated as an estimator result. The only change is language: instead of “sums and averages converge to a Gaussian,” we say “the sample mean is an asymptotically normal estimator of μ\mu with asymptotic variance σ2\sigma^2.” The quantities are identical; the framing is new.

Three-panel asymptotic normality: histograms of √n(X̄ₙ − μ)/σ for Exp(1) at n=10, 50, 500, with N(0,1) overlay and Kolmogorov–Smirnov statistic annotations showing convergence to standard Normal
Theorem 7 Delta method for smooth transformations

Let θ^n\hat{\theta}_n be asymptotically normal: n(θ^nθ)dN(0,v(θ))\sqrt{n}(\hat{\theta}_n - \theta) \overset{d}{\to} \mathcal{N}(0, v(\theta)). Let g:RRg : \mathbb{R} \to \mathbb{R} be differentiable at θ\theta with g(θ)0g'(\theta) \neq 0. Then g(θ^n)g(\hat{\theta}_n) is asymptotically normal for g(θ)g(\theta):

n(g(θ^n)g(θ))dN ⁣(0,  g(θ)2v(θ)).\sqrt{n}\,\big(g(\hat{\theta}_n) - g(\theta)\big) \overset{d}{\to} \mathcal{N}\!\left(0, \; g'(\theta)^2 \cdot v(\theta)\right).

The delta method is the workhorse for deriving asymptotic distributions of non-linear functions of estimators. It says that smooth transformations inherit asymptotic normality, with variance rescaled by the squared derivative of the transformation — a first-order Taylor expansion at the asymptotic regime. The proof is in Topic 11 (Theorem 11.8); we will apply it repeatedly in Topics 14 and 15 to handle things like 1/Xˉn1/\bar{X}_n (when estimating a rate) and Xˉn2\bar{X}_n^2 (when estimating a squared quantity).

Example 9 Asymptotic distribution of the sample variance via the delta method

We want the limit distribution of Sn2=1n(XiXˉn)2S_n^2 = \frac{1}{n}\sum(X_i - \bar{X}_n)^2 for iid data with finite fourth moment. Write the sample variance as a function of two sample means: Sn2=X2nXˉn2S_n^2 = \overline{X^2}_n - \bar{X}_n^2 where X2n=1nXi2\overline{X^2}_n = \frac{1}{n}\sum X_i^2.

By the multivariate CLT applied to the random vector (Xi,Xi2)(X_i, X_i^2), the vector (Xˉn,X2n)\big(\bar{X}_n, \overline{X^2}_n\big) is jointly asymptotically Normal around (μ,E[X12])(\mu, \mathbb{E}[X_1^2]) with a covariance matrix determined by the first four moments of X1X_1. Applying the delta method to g(a,b)=ba2g(a, b) = b - a^2:

n(Sn2σ2)dN ⁣(0,  μ4σ4),\sqrt{n}\,(S_n^2 - \sigma^2) \overset{d}{\to} \mathcal{N}\!\left(0, \; \mu_4 - \sigma^4\right),

where μ4=E[(X1μ)4]\mu_4 = \mathbb{E}[(X_1 - \mu)^4] is the central fourth moment. For Normal data, μ4=3σ4\mu_4 = 3\sigma^4, so the asymptotic variance simplifies to 2σ42\sigma^4 — which matches the exact (small-sample) variance formula from Example 5 after the 1/n1/n rescaling. For non-Normal data, the formula picks up the excess kurtosis: heavier tails mean the sample variance has larger asymptotic variance, because extreme observations dominate the squared residuals.

Remark 6 Asymptotic efficiency and relative efficiency (ARE)

Asymptotic normality gives us an apples-to-apples way to compare estimators: look at their asymptotic variances. For two asymptotically normal estimators θ^1,n\hat{\theta}_{1,n} and θ^2,n\hat{\theta}_{2,n} of the same parameter, the asymptotic relative efficiency is

ARE(θ^1;θ^2)  =  v2(θ)v1(θ),\operatorname{ARE}(\hat{\theta}_1 ; \hat{\theta}_2) \;=\; \frac{v_2(\theta)}{v_1(\theta)},

the ratio of their asymptotic variances (in the order that makes values above 11 mean ”θ^1\hat{\theta}_1 wins”). For Normal data, the sample median has ARE 2/π0.6372/\pi \approx 0.637 relative to the sample mean — meaning 1,000 observations for the median give about the same precision as π/2×1,0001,570\pi/2 \times 1{,}000 \approx 1{,}570 for the mean. For Cauchy data (where the sample mean has no well-defined asymptotic variance), the ARE comparison flips: the median is the efficient choice and the mean is catastrophic. Asymptotic efficiency is distribution-dependent; efficient-for-Normal and efficient-for-Cauchy are different estimators.

The next section, on the Cramér–Rao bound, quantifies the ceiling on achievable asymptotic variance — and therefore the ceiling on asymptotic efficiency.

13.7 Efficiency and the Cramér-Rao Lower Bound

We have seen that bias and variance trade off, and that asymptotic variance gives us a clean way to compare estimators. The natural next question: how small can the variance of an unbiased estimator possibly be? Is there a fundamental floor, or can clever procedures improve without limit? The answer is the Cramér–Rao lower bound — the most important non-trivial inequality in estimation theory. It says that the variance of any unbiased estimator is bounded below by a quantity depending only on the model (the distribution family) and the parameter — the Fisher information.

Definition 10 Score Function

Let {f(x;θ):θΘ}\{f(x; \theta) : \theta \in \Theta\} be a parametric family of densities (or PMFs) with ΘR\Theta \subseteq \mathbb{R}. The score function is the partial derivative of the log-likelihood with respect to the parameter:

s(θ;x)  =  θlogf(x;θ).s(\theta; x) \;=\; \frac{\partial}{\partial \theta} \log f(x; \theta).

For an iid sample, the total score is the sum of per-observation scores: S(θ)=i=1ns(θ;Xi)S(\theta) = \sum_{i=1}^n s(\theta; X_i).

The score is the “sensitivity of the log-likelihood to the parameter, at a given data point.” It is a random variable (depends on XX); for a well-specified model, its expected value at the true parameter is zero. The maximum-likelihood estimator is defined as the θ\theta that makes the total score zero. Score-based methods (score tests, score matching, efficient-score GLMs) pervade modern statistics.

Definition 11 Fisher Information

The Fisher information at θ\theta is the variance of the score:

I(θ)  =  Var ⁣(s(θ;X))  =  E ⁣[(θlogf(X;θ))2].I(\theta) \;=\; \operatorname{Var}\!\big(s(\theta; X)\big) \;=\; \mathbb{E}\!\left[\left(\frac{\partial}{\partial \theta}\log f(X; \theta)\right)^2\right].

Under regularity conditions, an equivalent form is the negative expected second derivative of the log-likelihood:

I(θ)  =  E ⁣[2θ2logf(X;θ)].I(\theta) \;=\; -\mathbb{E}\!\left[\frac{\partial^2}{\partial \theta^2}\log f(X; \theta)\right].

Fisher information measures how much information one observation carries about θ\theta — equivalently, how curved the log-likelihood is near the true parameter. Sharp curvature (high I(θ)I(\theta)) means the log-likelihood is a steep parabola near its peak: small changes in θ\theta produce large changes in log-likelihood, so the data is highly informative about θ\theta. Flat curvature (low I(θ)I(\theta)) means the log-likelihood is nearly level: data poorly discriminates among candidate θ\theta‘s. Fisher information is additive for iid samples: nn observations carry nI(θ)nI(\theta) total information.

Three-panel Fisher information: (left) score function curves s(μ; x) = (x − μ)/σ² for N(μ, 1) at several x values; (center) horizontal bar chart of I(μ) = 1/σ² for different σ, showing constant information in μ; (right) log-likelihood curvature for Bernoulli with quadratic Fisher-info approximation overlay
Theorem 8 The score has mean zero and variance equal to Fisher information

Under regularity conditions that allow interchange of differentiation and integration,

E[s(θ;X)]  =  0,Var(s(θ;X))  =  I(θ).\mathbb{E}[s(\theta; X)] \;=\; 0, \qquad \operatorname{Var}(s(\theta; X)) \;=\; I(\theta).
Proof [show]

Start from the fact that f(;θ)f(\cdot; \theta) is a density for every θ\theta, so it integrates to 11:

f(x;θ)dx  =  1.\int f(x; \theta)\, dx \;=\; 1.

Differentiate both sides in θ\theta. The regularity conditions are exactly what justify swapping the derivative and the integral:

θ ⁣f(x;θ)dx  =  θf(x;θ)dx  =  0.\frac{\partial}{\partial \theta}\!\int f(x; \theta)\, dx \;=\; \int \frac{\partial}{\partial \theta} f(x; \theta)\, dx \;=\; 0.

Rewrite f/θ=flogf/θ=fs(θ;x)\partial f/\partial \theta = f \cdot \partial \log f / \partial \theta = f \cdot s(\theta; x):

s(θ;x)f(x;θ)dx  =  E[s(θ;X)]  =  0,\int s(\theta; x)\, f(x; \theta)\, dx \;=\; \mathbb{E}[s(\theta; X)] \;=\; 0,

which is the first claim. For the second, differentiate again:

(sθf  +  sfθ)dx  =  0.\int \left(\frac{\partial s}{\partial \theta} \cdot f \;+\; s \cdot \frac{\partial f}{\partial \theta}\right) dx \;=\; 0.

Substitute f/θ=sf\partial f / \partial \theta = s \cdot f into the second term:

sθfdx  +  s2fdx  =  0,\int \frac{\partial s}{\partial \theta}\, f \, dx \;+\; \int s^2 f \, dx \;=\; 0,

which says

E ⁣[sθ]  +  E[s2]  =  0.\mathbb{E}\!\left[\frac{\partial s}{\partial \theta}\right] \;+\; \mathbb{E}[s^2] \;=\; 0.

The second term is E[s2]=Var(s)+02=Var(s)\mathbb{E}[s^2] = \operatorname{Var}(s) + 0^2 = \operatorname{Var}(s) by the first claim. So

Var(s)  =  E ⁣[sθ]  =  E ⁣[2θ2logf(X;θ)]  =  I(θ).\operatorname{Var}(s) \;=\; -\mathbb{E}\!\left[\frac{\partial s}{\partial \theta}\right] \;=\; -\mathbb{E}\!\left[\frac{\partial^2}{\partial \theta^2}\log f(X; \theta)\right] \;=\; I(\theta). \quad\blacksquare

Theorem 8 gives the two definitions of I(θ)I(\theta) — as the variance of the score, or as the negative expected curvature — and shows they are equal. This is what makes Fisher information computable in practice: for many distributions, the second derivative of the log-likelihood is easier to evaluate than the squared first derivative.

Definition 12 Efficient Estimator

An unbiased estimator θ^n\hat{\theta}_n whose variance attains the Cramér–Rao lower bound (Theorem 9) for every θ\theta is called efficient. An asymptotically normal estimator whose asymptotic variance attains the asymptotic Cramér–Rao bound 1/I(θ)1/I(\theta) is called asymptotically efficient.

Theorem 9 Cramér–Rao lower bound

Let θ^n\hat{\theta}_n be an unbiased estimator of θ\theta based on an iid sample of size nn from a parametric family satisfying the regularity conditions. Then

Var(θ^n)    1nI(θ).\operatorname{Var}(\hat{\theta}_n) \;\geq\; \frac{1}{n\,I(\theta)}.
Proof [show]

The proof applies the Cauchy–Schwarz inequality to the covariance of θ^n\hat{\theta}_n with the total score S(θ)=i=1ns(θ;Xi)S(\theta) = \sum_{i=1}^n s(\theta; X_i).

Step 1. Unbiasedness says E[θ^n]=θ\mathbb{E}[\hat{\theta}_n] = \theta for every θ\theta, i.e. θ^n(x)f(x;θ)dx=θ\int \hat{\theta}_n(x)\, f(x; \theta)\, dx = \theta, where x=(x1,,xn)x = (x_1, \ldots, x_n) and f(x;θ)=if(xi;θ)f(x; \theta) = \prod_i f(x_i; \theta). Differentiate in θ\theta, swapping differentiation and integration by the regularity conditions:

1  =  θ^n(x)θlogf(x;θ)f(x;θ)dx  =  E ⁣[θ^nS(θ)].1 \;=\; \int \hat{\theta}_n(x) \cdot \frac{\partial}{\partial \theta}\log f(x;\theta) \cdot f(x;\theta)\, dx \;=\; \mathbb{E}\!\left[\hat{\theta}_n \cdot S(\theta)\right].

Step 2. Because E[S(θ)]=0\mathbb{E}[S(\theta)] = 0 (Theorem 8 applied to each term and summed), the product E[θ^nS(θ)]\mathbb{E}[\hat{\theta}_n S(\theta)] equals the covariance:

Cov(θ^n,S(θ))  =  E[θ^nS(θ)]E[θ^n]E[S(θ)]  =  1θ0  =  1.\operatorname{Cov}(\hat{\theta}_n, S(\theta)) \;=\; \mathbb{E}[\hat{\theta}_n S(\theta)] - \mathbb{E}[\hat{\theta}_n]\,\mathbb{E}[S(\theta)] \;=\; 1 - \theta \cdot 0 \;=\; 1.

Step 3. Apply the Cauchy–Schwarz inequality to the covariance:

Cov(θ^n,S(θ))2    Var(θ^n)Var(S(θ)).\operatorname{Cov}(\hat{\theta}_n, S(\theta))^2 \;\leq\; \operatorname{Var}(\hat{\theta}_n)\cdot \operatorname{Var}(S(\theta)).

By independence of the XiX_i, Var(S(θ))=iVar(s(θ;Xi))=nI(θ)\operatorname{Var}(S(\theta)) = \sum_i \operatorname{Var}(s(\theta; X_i)) = n\,I(\theta). Substituting from Step 2:

1    Var(θ^n)nI(θ).1 \;\leq\; \operatorname{Var}(\hat{\theta}_n) \cdot n\,I(\theta).

Step 4. Rearranging,

Var(θ^n)    1nI(θ).\operatorname{Var}(\hat{\theta}_n) \;\geq\; \frac{1}{n\,I(\theta)}. \quad\blacksquare

The Cramér–Rao proof is a masterclass in statistical reasoning: an identity (Cov(θ^n,S)=1\operatorname{Cov}(\hat{\theta}_n, S) = 1 for unbiased estimators) plus an inner-product inequality (Cauchy–Schwarz) plus independence-additivity of variance yield the floor on estimator precision. The bound is achieved — the inequality becomes equality — when θ^nθ\hat{\theta}_n - \theta is proportional to S(θ)S(\theta), which happens for sample means of distributions in exponential families. This is the link between the Cramér–Rao bound, efficient estimation, and exponential families that Topic 16 will develop in full.

Three-panel CRLB visualization: (left) CRLB curve σ²/n with mean, median, and trimmed-mean variances as scatter points — only the mean hits the bound; (center) asymptotic relative efficiency bar chart for Normal/Cauchy/Exponential/Uniform; (right) total Fisher information nI(θ) accumulating linearly for three distribution families
Example 10 Fisher information for Normal, Bernoulli, and Exponential

Three canonical computations using I(θ)=E[2logf/θ2]I(\theta) = -\mathbb{E}[\partial^2 \log f / \partial \theta^2]:

Normal N(μ,σ2)\mathcal{N}(\mu, \sigma^2) with respect to μ\mu (known σ2\sigma^2): logf(x;μ)=12log(2πσ2)(xμ)2/(2σ2)\log f(x; \mu) = -\tfrac{1}{2}\log(2\pi\sigma^2) - (x-\mu)^2/(2\sigma^2). Two derivatives: 2logf/μ2=1/σ2\partial^2 \log f / \partial \mu^2 = -1/\sigma^2. Taking the negative expectation (which is constant):

I(μ)  =  1σ2.I(\mu) \;=\; \frac{1}{\sigma^2}.

The Fisher information does not depend on μ\mu, only on σ2\sigma^2: narrow distributions carry more information per sample about their location.

Bernoulli(p)(p): logf(x;p)=xlogp+(1x)log(1p)\log f(x; p) = x \log p + (1-x) \log(1-p). Second derivative 2logf/p2=x/p2(1x)/(1p)2\partial^2 \log f / \partial p^2 = -x/p^2 - (1-x)/(1-p)^2. Expectation: p/p2(1p)/(1p)2=1/p1/(1p)=1/(p(1p))-p/p^2 - (1-p)/(1-p)^2 = -1/p - 1/(1-p) = -1/(p(1-p)). Negating:

I(p)  =  1p(1p).I(p) \;=\; \frac{1}{p(1-p)}.

Fisher information is maximized at p=1/2p = 1/2 (where variance is highest and xx is least predictable) and diverges as p0p \to 0 or p1p \to 1 — near the boundary, each observation is enormously informative about whether pp is exactly 00 or exactly 11.

Exponential(λ)(\lambda): logf(x;λ)=logλλx\log f(x; \lambda) = \log \lambda - \lambda x for x>0x > 0. Second derivative 1/λ2-1/\lambda^2. Negating:

I(λ)  =  1λ2.I(\lambda) \;=\; \frac{1}{\lambda^2}.

Larger rate λ\lambda means faster decay, which means observations cluster near zero and carry more information about λ\lambda.

Example 11 The sample mean achieves the CRLB for the Normal mean

For iid N(μ,σ2)\mathcal{N}(\mu, \sigma^2) with known σ2\sigma^2, the sample mean is unbiased and has variance σ2/n\sigma^2/n (Theorem 2). From Example 10, I(μ)=1/σ2I(\mu) = 1/\sigma^2. The Cramér–Rao lower bound is

1nI(μ)  =  σ2n  =  Var(Xˉn).\frac{1}{n\,I(\mu)} \;=\; \frac{\sigma^2}{n} \;=\; \operatorname{Var}(\bar{X}_n).

Equality. The sample mean is the efficient estimator of the Normal mean — no unbiased estimator based on nn iid Normal observations can do better. This is the canonical result that motivates the CRLB in the first place, and the reason the sample mean is the point estimator you reach for whenever Normality is plausible. For non-Normal data the sample mean is still unbiased and consistent, but it may no longer be efficient — the median beats it for Cauchy, and trimmed means dominate for heavy-tailed distributions.

Remark 7 Regularity conditions and when the CRLB fails

The Cramér–Rao proof requires regularity conditions that allow interchange of differentiation and integration. Informally: the support of f(x;θ)f(x; \theta) must not depend on θ\theta, and logf/θ\partial \log f / \partial \theta must be well-behaved enough to differentiate under the integral. These conditions hold for essentially every standard family — Normal, Poisson, Bernoulli, Exponential, Gamma, Beta — but fail in a few important cases:

  • Uniform(0,θ)(0, \theta): the support is [0,θ][0, \theta], which depends on θ\theta. The MLE θ^=maxiXi\hat{\theta} = \max_i X_i has variance θ2/n2\sim \theta^2/n^2, faster than the 1/n1/n rate the CRLB would predict. The CRLB does not apply.
  • Shifted exponentials: f(x;θ)=e(xθ)f(x; \theta) = e^{-(x-\theta)} for xθx \geq \theta. Same issue — boundary of the support moves with θ\theta.
  • Heavy-tailed distributions with infinite Fisher information (Cauchy w.r.t. scale, for instance): the bound is 1/=01/\infty = 0, which is trivially satisfied but uninformative.

When regularity holds, the CRLB is the ceiling: no unbiased estimator can beat 1/(nI(θ))1/(nI(\theta)). When regularity fails, different tools — the L1L_1 theory, order-statistics asymptotics (Topic 29 §29.6), the Hájek-Le Cam framework — take over. For standard statistical modeling, regularity holds and the CRLB is the right benchmark.

Interactive: Fisher Information Explorer
Pick a family, adjust the parameter, and see the score function, log-likelihood curvature, and Fisher information side by side.
I(μ) = 1/σ². I(μ = 0.000) = 1.0000. Cramér–Rao bound 1/(n·I) = 0.02000. Fisher info is constant in μ, depends on σ²

13.8 Comparing Estimators: Admissibility and Minimax

So far we have compared estimators by MSE at a single value of θ\theta. But θ\theta is unknown, and a procedure that is great at one value and terrible at others is not a good procedure. Decision theory — the framework that organizes this comparison — asks: is θ^1\hat{\theta}_1 uniformly at least as good as θ^2\hat{\theta}_2? When no estimator is uniformly best, how do we choose among incomparable ones? This section introduces the key concepts (admissibility, minimax) and then presents the famous result — Stein’s paradox — that the “natural” estimator is sometimes provably dominated by something stranger.

Definition 13 Admissibility

An estimator θ^1\hat{\theta}_1 is dominated by θ^2\hat{\theta}_2 if MSE(θ^2;θ)MSE(θ^1;θ)\operatorname{MSE}(\hat{\theta}_2; \theta) \leq \operatorname{MSE}(\hat{\theta}_1; \theta) for all θΘ\theta \in \Theta, with strict inequality for at least one θ\theta. An estimator that is not dominated by any other is admissible.

Admissibility is a weak criterion — it only says that no estimator is uniformly better. An admissible estimator can still be terrible (the constant estimator θ^7\hat{\theta} \equiv 7 is admissible in every problem: no other estimator can dominate it, because θ=7\theta = 7 is a point where “always guess 7” has zero MSE). Conversely, inadmissibility is a serious charge: if there exists a uniform improvement, any rational statistician should use it.

Definition 14 Minimax Estimator

An estimator θ^\hat{\theta} is minimax if it minimizes the maximum risk:

θ^min ⁣max  =  argminθ^supθΘMSE(θ^;θ).\hat{\theta}_{\min\!\max} \;=\; \arg\min_{\hat{\theta}} \sup_{\theta \in \Theta}\, \operatorname{MSE}(\hat{\theta}; \theta).

Minimax estimators are conservative: they trade off finite-sample performance at typical parameter values for worst-case robustness.

Minimax is a pessimistic criterion — minimize the damage in the worst case — and it frequently produces estimators that are not the best on average. Bayesian estimators are often preferable when some prior on θ\theta is available. In most applied work, MSE at a point estimate of θ\theta is the right loss and minimax is overcautious.

Three-panel estimator comparison: (left) risk functions R(θ, X̄) vs R(θ, JS) for d=3 with the improvement region shaded; (center) shrinkage arrows from X̄ to the James–Stein estimate across 8 components; (right) total MSE of X̄ vs JS as a function of dimension d, with d=3 threshold marked
Theorem 10 Stein (1956): the sample mean is inadmissible in dimension d ≥ 3

Let XN(θ,σ2Id)X \sim \mathcal{N}(\theta, \sigma^2 I_d) with unknown mean vector θRd\theta \in \mathbb{R}^d and known scale σ2\sigma^2. For d3d \geq 3, the James–Stein estimator

θ^JS  =  (1(d2)σ2X2)+X\hat{\theta}_{JS} \;=\; \left(1 - \frac{(d-2)\sigma^2}{\|X\|^2}\right)_{+} X

has strictly smaller total MSE than the sample mean XX for every value of θ\theta:

E ⁣[θ^JSθ2]  <  E ⁣[Xθ2]  =  dσ2,for every θRd.\mathbb{E}\!\left[\|\hat{\theta}_{JS} - \theta\|^2\right] \;<\; \mathbb{E}\!\left[\|X - \theta\|^2\right] \;=\; d\sigma^2, \qquad \text{for every } \theta \in \mathbb{R}^d.

Therefore the sample mean is inadmissible in dimension d3d \geq 3.

Proof [show]

Proof sketch. The full proof uses Stein’s lemma — integration by parts for Normal expectations — to compute E[θ^JSθ2]\mathbb{E}[\|\hat{\theta}_{JS} - \theta\|^2] in closed form. Here we give the statement and the geometric intuition.

Statement. An explicit (slightly idealized) risk formula for the non-positive-part James–Stein estimator is

R(θ^JS,θ)  =  dσ2    (d2)2σ4θ2+(d2)σ2,R(\hat{\theta}_{JS}, \theta) \;=\; d\sigma^2 \;-\; \frac{(d-2)^2 \sigma^4}{\|\theta\|^2 + (d-2)\sigma^2},

which is strictly less than dσ2d\sigma^2 (the risk of the MLE) for every θ\theta, provided d3d \geq 3. The positive-part version is uniformly at least as good.

Geometric intuition. The sample mean XX estimates each component θi\theta_i independently. In d3d \geq 3 dimensions, the “noise ball” around θ\theta concentrates in a thin annulus at radius σd\approx \sigma\sqrt{d} — by the CLT applied to Xθ2\|X - \theta\|^2. Shrinking XX toward the origin (or any fixed point) reduces total squared error on average, because the shrinkage moves most of the noise away from the part of the annulus farthest from the origin, more than it moves signal away from the target. The (d2)(d-2) factor is the smallest shrinkage strength that wins against noise in all directions simultaneously — it is exactly the threshold where the bound integrates to give a net improvement.

Pointer. The full proof uses Stein’s identity: for ZN(0,1)Z \sim \mathcal{N}(0, 1) and smooth gg, E[Zg(Z)]=E[g(Z)]\mathbb{E}[Z\, g(Z)] = \mathbb{E}[g'(Z)]. Applied coordinate-wise to the squared error of θ^JS\hat{\theta}_{JS}, it reduces the MSE calculation to an integral involving (d2)/X2(d-2)/\|X\|^2. See Lehmann & Casella (1998), Chapter 5, for the detailed argument; Topic 28 §28.5 now also gives the full Stein-identity proof in the hierarchical-Bayes context. \quad\blacksquare

Stein’s result is one of the most surprising theorems in statistics. The sample mean — the estimator taught in every introductory course, guaranteed by the CLT to be efficient for a single parameter — is not the best estimator when you are simultaneously estimating three or more unrelated quantities. Shrinking all the estimates toward zero (or toward any fixed point) uniformly reduces total error. The intuition is that in high dimensions, noise concentrates, and pulling estimates inward exploits the geometry. This is the mathematical seed of regularization, hierarchical Bayes, and every modern shrinkage method.

Example 12 Ridge regression as a James–Stein-type shrinkage

Consider linear regression with design matrix XRn×dX \in \mathbb{R}^{n \times d}, target yRny \in \mathbb{R}^n, and noise εN(0,σ2In)\varepsilon \sim \mathcal{N}(0, \sigma^2 I_n). The OLS estimator is β^OLS=(XX)1Xy\hat{\beta}_{OLS} = (X^\top X)^{-1} X^\top y, unbiased with covariance σ2(XX)1\sigma^2 (X^\top X)^{-1}. The ridge estimator

β^ridge(λ)  =  (XX+λId)1Xy\hat{\beta}_{\text{ridge}}(\lambda) \;=\; (X^\top X + \lambda I_d)^{-1} X^\top y

is biased for λ>0\lambda > 0, but adding λI\lambda I shrinks the coefficients toward zero. For every λ>0\lambda > 0 there exist settings of β\beta^* (roughly, when β\|\beta^*\| is not too large relative to noise) where MSE(β^ridge(λ))<MSE(β^OLS)\operatorname{MSE}(\hat{\beta}_{\text{ridge}}(\lambda)) < \operatorname{MSE}(\hat{\beta}_{OLS}). In the orthonormal-design case XX=IdX^\top X = I_d, ridge reduces to componentwise shrinkage β^ridge,i=11+λβ^OLS,i\hat{\beta}_{\text{ridge},i} = \frac{1}{1+\lambda}\hat{\beta}_{OLS,i}, which is a scalar special case of James–Stein. In that setting, Stein’s theorem implies ridge (at appropriate λ\lambda) dominates OLS in total MSE when d3d \geq 3.

Ridge, lasso, elastic net, early stopping, dropout — every ML regularizer is a cousin of James–Stein, trading bias for variance under the assumption that the parameter vector is not arbitrarily large.

Remark 8 Stein's paradox and what it means for ML

Stein’s paradox is usually dismissed as a curiosity about Normal means in high dimensions. It is not. It is a deep fact about multi-parameter estimation: when the number of parameters is large and the data per parameter is limited, pooling information across parameters — even when they are a priori independent — improves estimation. The James–Stein estimator pools information between the components of θ\theta implicitly via the X2\|X\|^2 denominator; an empirical Bayes estimator does the same thing explicitly via a common prior; a hierarchical Bayesian model does it via a shared hyperparameter.

Every modern ML method that works well in high dimensions is exploiting some version of this phenomenon:

  • Regularization (L1, L2, entropy penalty, etc.) is explicit shrinkage toward a chosen point (zero, a sparse solution, a high-entropy distribution).
  • Transfer learning / pretraining shrinks the target-task estimator toward a source-task estimator, which plays the role of the “fixed shrinkage point” in James–Stein.
  • Hierarchical models borrow strength across groups via a shared prior — the Bayesian analog of James–Stein.
  • Ensemble averaging (bagging, model soups) averages independent estimators, reducing variance by the CLT while preserving bias — the additive inverse of James–Stein (increase one dimension’s signal by combining independent high-variance estimators, rather than reduce many dimensions’ variance by shrinking toward a common point).

The broader lesson: admissibility is fragile, unbiasedness is not sacred, and high-dimensional estimation rewards pooled information over independent per-parameter estimation. This is the bridge from the 1950s-era CRLB and Stein results to twenty-first-century deep learning.

Interactive: Cramér–Rao Explorer
Compare estimator variances to the CRLB 1/(n·I(θ)). Efficient estimators (bar = 1.0) sit on the green envelope; inefficient ones rise above it.
I(θ) = 1.0000; CRLB = 0.01000
n = 100
Mean is efficient (variance = 1/n = CRLB). Median ≈ 0.637 efficient — π/2 inflation.

13.9 Connections to ML

We have developed the bias-variance decomposition for a scalar parameter θ\theta. The ML reader’s natural question is: how does this extend to prediction problems, where the goal is to learn a function f:XYf : \mathcal{X} \to \mathcal{Y} rather than a single number? The answer is a slight generalization — MSE decomposes the same way, but the bias and variance now live in function space, and they depend on the joint distribution over training sets and test points. This section develops three canonical ML applications of the decomposition: polynomial regression (the fitting-complexity tradeoff), regularization (deliberate bias to reduce variance), and stochastic gradient descent (the mini-batch variance–compute tradeoff).

Three-panel ML bias-variance: (left) Bias²/Var/MSE vs polynomial degree with an irreducible-noise floor; (center) ridge regression Bias²/Var/MSE vs λ with optimal λ* marked; (right) SGD gradient variance σ²/B vs batch size B with compute-cost dual axis
Example 13 Bias-variance for polynomial regression

Fix a covariate point x0x_0 and consider the prediction problem where the true response satisfies y=f(x)+εy = f(x) + \varepsilon with εN(0,σ2)\varepsilon \sim \mathcal{N}(0, \sigma^2) independent of xx, and the training set D={(xi,yi)}i=1n\mathcal{D} = \{(x_i, y_i)\}_{i=1}^n is an iid sample. Let f^D\hat{f}_{\mathcal{D}} denote the regressor fit on D\mathcal{D}. The prediction MSE at x0x_0, averaged over training sets, is

ED,ε ⁣[(yf^D(x0))2]  =  (f(x0)ED[f^D(x0)])2Bias2  +  ED ⁣[(f^D(x0)ED[f^D(x0)])2]Variance  +  σ2Irreducible noise.\mathbb{E}_{\mathcal{D}, \varepsilon}\!\left[(y - \hat{f}_{\mathcal{D}}(x_0))^2\right] \;=\; \underbrace{\big(f(x_0) - \mathbb{E}_{\mathcal{D}}[\hat{f}_{\mathcal{D}}(x_0)]\big)^2}_{\text{Bias}^2} \;+\; \underbrace{\mathbb{E}_{\mathcal{D}}\!\left[(\hat{f}_{\mathcal{D}}(x_0) - \mathbb{E}_{\mathcal{D}}[\hat{f}_{\mathcal{D}}(x_0)])^2\right]}_{\text{Variance}} \;+\; \underbrace{\sigma^2}_{\text{Irreducible noise}}.

The derivation is the same add-and-subtract trick from Theorem 3, applied to f^D(x0)\hat{f}_{\mathcal{D}}(x_0) instead of θ^n\hat{\theta}_n, with one extra term for the noise ε\varepsilon at the new test point. For a polynomial of degree dd fit by least squares:

  • Small dd: f^\hat{f} is too inflexible to track ff; bias is large; variance is small.
  • Large dd: f^\hat{f} overfits to the training sample; bias is small; variance is large.
  • Optimal dd: minimizes the sum Bias² + Variance. The irreducible noise σ2\sigma^2 is a floor you cannot beat without changing the problem.

This is the bias-variance picture that appears in every ML textbook — Hastie, Tibshirani & Friedman (Elements of Statistical Learning, Chapter 2.9) is the canonical reference — and it is an exact application of Theorem 3. The only difference from the scalar case is that f^D(x0)\hat{f}_{\mathcal{D}}(x_0) is the “estimator” and f(x0)f(x_0) is the “parameter” — everything else is the same decomposition.

Example 14 Regularization as deliberate bias introduction

Ridge regression (Example 12) adds λβ2\lambda \|\beta\|^2 to the least-squares objective, producing an estimator that is biased toward zero but has smaller variance. The MSE of the ridge predictor at a test point x0x_0 decomposes as before, with the λ\lambda-dependent tradeoff:

MSE(f^ridge(x0);λ)  =  Bias2(λ)  +  Var(λ)  +  σ2.\operatorname{MSE}(\hat{f}_{\text{ridge}}(x_0); \lambda) \;=\; \operatorname{Bias}^2(\lambda) \;+\; \operatorname{Var}(\lambda) \;+\; \sigma^2.

Bias² is monotonically increasing in λ\lambda (more shrinkage, more distortion of the target). Variance is monotonically decreasing in λ\lambda (more shrinkage, less sensitivity to the training set). The sum has a minimum at some λ>0\lambda^* > 0 that depends on β\|\beta^*\| and σ2\sigma^2; cross-validation estimates λ\lambda^* without knowing either.

The reason this works is Stein’s theorem in disguise. Shrinking OLS toward zero is a James–Stein-type operation — each coefficient gets pulled a little closer to the origin — and when the true signal is concentrated or sparse, the bias introduced is small compared with the variance reduction. In high dimensions (d3d \geq 3 but effectively d3d \gg 3 for any real ML problem), the effect is pronounced: deep neural nets with L2 weight decay are the modern industrial version of this phenomenon.

Example 15 SGD mini-batch variance and the batch-size tradeoff

Stochastic gradient descent replaces the full-batch gradient θL(θ)=1Ni=1Nθi(θ)\nabla_\theta \mathcal{L}(\theta) = \frac{1}{N}\sum_{i=1}^N \nabla_\theta \ell_i(\theta) with a mini-batch estimate

g^B(θ)  =  1BiBθi(θ),\hat{g}_B(\theta) \;=\; \frac{1}{B}\sum_{i \in \mathcal{B}} \nabla_\theta \ell_i(\theta),

where B\mathcal{B} is a random subset of size BB. The mini-batch gradient is unbiased (E[g^B]=θL\mathbb{E}[\hat{g}_B] = \nabla_\theta \mathcal{L} when the batch is a uniformly random sample) but has variance that scales as σg2/B\sigma_g^2 / B, where σg2\sigma_g^2 is the per-example gradient variance. This is Theorem 2 restated for the gradient estimator.

The bias-variance tradeoff in SGD is between per-step noise (small BB, high variance, small per-step compute) and per-step efficiency (large BB, low variance, large per-step compute, diminishing returns per step). Typical ML lore:

  • For a fixed compute budget, the optimal batch size balances the two — larger is not always better.
  • The learning rate η\eta should roughly scale linearly with BB: ηeffective=ηB\eta_{\text{effective}} = \eta \cdot B determines the per-step signal-to-noise ratio, and noise scales as 1/B1/B while step size scales as η\eta.
  • Very small batches (B=1B = 1, pure SGD) have high variance but frequent updates — stochasticity can even help escape saddle points, a benefit absent from full-batch gradient descent.

All of this is rigorous point estimation. The “estimator” is g^B\hat{g}_B; the “parameter” is θL\nabla_\theta \mathcal{L}; the MSE decomposes as 0+σg2/B0 + \sigma_g^2 / B; and the compute cost scales linearly with BB. The modern deep-learning zoo of optimizers — Adam, AdamW, LAMB, Lion — are all attempts to control this MSE more cleverly, by rescaling, momentum, or variance-adaptive step sizes.

13.10 Summary

An estimator is a random variable. Its sampling distribution is the object to analyze. We have built the complete evaluation framework:

  • Bias is the mean offset E[θ^n]θ\mathbb{E}[\hat{\theta}_n] - \theta. Zero bias is called unbiased.
  • Variance (equivalently, standard error squared) is the spread of the sampling distribution.
  • MSE combines them: MSE=Bias2+Var\operatorname{MSE} = \operatorname{Bias}^2 + \operatorname{Var} (Theorem 3, the central identity).
  • Consistency is the guarantee that θ^nθ\hat{\theta}_n \to \theta as nn \to \infty — a consequence of the LLN.
  • Asymptotic normality is the guarantee that n(θ^nθ)N(0,v(θ))\sqrt{n}(\hat{\theta}_n - \theta) \to \mathcal{N}(0, v(\theta)) — a consequence of the CLT.
  • Fisher information I(θ)I(\theta) is the expected curvature of the log-likelihood — how much information one observation carries about θ\theta.
  • Cramér–Rao bound: Var(θ^n)1/(nI(θ))\operatorname{Var}(\hat{\theta}_n) \geq 1/(n I(\theta)) for every unbiased estimator. Estimators achieving the bound are efficient.
  • Admissibility and minimax give global criteria for estimator comparison. The James–Stein theorem shows the sample mean is inadmissible in d3d \geq 3 dimensions — shrinkage strictly improves total MSE.

The cheat sheet for every downstream topic:

PropertyFormulaAnalog for ML
BiasE[θ^n]θ\mathbb{E}[\hat{\theta}_n] - \thetaModel mis-specification / underfitting
VarianceVar(θ^n)\operatorname{Var}(\hat{\theta}_n)Sensitivity to training set / overfitting
MSEBias² + VariancePrediction error at a fixed point
Consistencyθ^nPθ\hat{\theta}_n \overset{P}{\to} \thetaConvergence of training procedure
Asymptotic variancelimnVar(θ^n)\lim n\,\operatorname{Var}(\hat{\theta}_n)Rate of convergence, confidence interval width
Fisher informationE[2logf/θ2]-\mathbb{E}[\partial^2 \log f / \partial \theta^2]Curvature of the loss — optimization landscape
CRLB1/(nI(θ))1/(nI(\theta))Information-theoretic lower bound on MSE
EfficiencyVar / CRLBFraction of information the estimator extracts

The arc of Track 4 is to apply this framework to specific estimation methods. Maximum Likelihood Estimation defines θ^MLE\hat{\theta}_{\text{MLE}} as the root of the score equation S(θ)=0S(\theta) = 0; consistency, asymptotic normality, and asymptotic efficiency of the MLE are consequences of the LLN, CLT, and CRLB framework developed here. Method of Moments applies bias, MSE, and consistency to moment-matching estimators, comparing their efficiency to the MLE benchmark. Sufficient Statistics uses Fisher information and the CRLB to prove Rao–Blackwell and Lehmann–Scheffé theorems, thereby establishing the theory of UMVUE.

Beyond Track 4, every downstream topic in formalStatistics uses this vocabulary. Hypothesis Testing builds test statistics that are estimators; their null and alternative distributions are the consistency and asymptotic-normality statements we have just proved. The z-test applies the CLT to a standardized Xˉ\bar X; the t-test uses Basu’s theorem to give n(Xˉμ0)/S\sqrt n(\bar X - \mu_0)/S a clean tn1t_{n-1} distribution; the Wald, score, and likelihood-ratio tests are all χ2\chi^2-asymptotic consequences of Thm 14.3’s asymptotic-normality result. Confidence Intervals invert asymptotic-normality: a 1α1 - \alpha CI is an interval of the form θ^n±zα/2SE^\hat{\theta}_n \pm z_{\alpha/2}\cdot \widehat{\operatorname{SE}}. Linear Regression applies Gauss–Markov efficiency to OLS, which is the CRLB for Normal-error linear models; regularization and shrinkage apply Stein’s phenomenon. The Bootstrap gives empirical analogs of the sampling distribution when closed-form asymptotics are unavailable. Across formalML, the bias-variance tradeoff organizes regularization, ensembling, transfer learning, and hyperparameter tuning — the daily tools of every machine-learning practitioner.

The reader who has made it to here has the complete point-estimation toolkit: what an estimator is (§13.1), how to measure its error (§13.2–13.4), what happens as data accumulates (§13.5–13.6), how good an estimator can possibly be (§13.7), and when the “natural” choice can be uniformly beaten (§13.8). This is the foundation on which all of classical and modern statistical inference is built.


References

  1. Casella, G., & Berger, R. L. (2002). Statistical Inference (2nd ed.). Duxbury.
  2. Billingsley, P. (2012). Probability and Measure (Anniversary ed.). Wiley.
  3. Wasserman, L. (2004). All of Statistics. Springer.
  4. Lehmann, E. L., & Casella, G. (1998). Theory of Point Estimation (2nd ed.). Springer.
  5. Bickel, P. J., & Doksum, K. A. (2015). Mathematical Statistics: Basic Ideas and Selected Topics (2nd ed.). CRC Press.
  6. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer.
  7. Stein, C. (1956). Inadmissibility of the Usual Estimator for the Mean of a Multivariate Normal Distribution. Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, 1, 197–206.
  8. Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.