advanced 62 min read

Empirical Processes & Uniform Convergence

The empirical process, indexed by a class of functions rather than a single point, is the central object of modern nonparametric statistics. Two organising theorems — a functional Glivenko–Cantelli for VC classes (uniform LLN) and Donsker's theorem for the distribution-function class (uniform CLT) — discharge the Donsker IOU Topic 29 left open and close Track 8. Applications include the Cramér–von-Mises statistic, the two-sample Kolmogorov–Smirnov test, the bootstrap empirical process, Talagrand's concentration inequality, and the functional delta method, which provides the on-ramp to formalML's semiparametric and causal-inference tracks. Track 8, topic 4 of 4 — the curriculum closer.

Track 8 closes here

Topics 29, 30, and 31 built the empirical distribution, smoothed it into densities, and resampled from it. Topic 32 closes the track by embedding all three into a single functional framework: the empirical process Gn=n(PnP)\mathbb{G}_n = \sqrt n\,(P_n - P), indexed by a class of functions F\mathcal{F} rather than a single point tt. Two theorems organise the topic — a functional Glivenko–Cantelli for VC classes (§32.4) and a functional CLT, Donsker’s theorem, for the distribution-function class (§32.5) — and both discharge forward-promises stretching back through the entire curriculum. Topic 32 is also the curriculum closer: every forward-pointer in §32.10 leaves the site for formalML.

Three theorems get full proofs in this chapter: Glivenko–Cantelli for VC classes (§32.4, the “fundamental theorem of statistical learning”), Donsker’s theorem for Fcdf\mathcal{F}_{\mathrm{cdf}} (§32.5, the featured result and the discharge of Topic 29 §29.8’s credited-but-unproven Kolmogorov limit), and the functional delta method (§32.7, the on-ramp to semiparametric efficiency). Six more are stated: Sauer–Shelah–Vapnik, the general bracketing-integral Donsker criterion, the extended continuous-mapping theorem, the Cramér–von-Mises limit, the bootstrap empirical-process CLT, and Talagrand’s concentration inequality. Three interactive components give the visual vocabulary for each: VCShatteringDemo for §32.4’s shatter coefficient, EmpiricalProcessExplorer for §32.5’s Donsker convergence, and FunctionalDeltaExplorer for §32.7’s influence-function limits. After §32.10 there is no Topic 33; every “what’s next” points to formalML.

32.1 From pointwise to uniform: motivation

Every machine-learning paper that trains a model by empirical risk minimisation (ERM) — which is to say, most of them — is quietly asking an empirical-process question. Fix a loss \ell, a parameter space Θ\Theta, and iid data X1,,XnX_1, \dots, X_n; the population risk R(θ)=E[(X;θ)]R(\theta) = \mathbb{E}[\ell(X; \theta)] is what we want to minimise, and the empirical risk R^n(θ)=n1i(Xi;θ)\hat R_n(\theta) = n^{-1} \sum_i \ell(X_i; \theta) is what we can compute. ERM picks the minimiser of R^n\hat R_n and hopes that choice tracks the minimiser of RR. Whether it does depends on whether R^nR\hat R_n \to R just pointwise (for each fixed θ\theta, the ordinary LLN suffices) or uniformly over Θ\Theta (the sup-gap shrinks). The distinction is not cosmetic: pointwise convergence alone can coexist with R^n\hat R_n‘s minimiser drifting arbitrarily far from RR‘s, and Topic 32 exists to work out when the uniform version holds — and what happens when we take the difference R^nR\hat R_n - R seriously as a random function, not just a random number.

Two-panel contrast. Left: single trajectory of \hat R_n(\theta) at a fixed \theta, converging to R(\theta) by the LLN. Right: five trajectories at five different \theta values; the sup-gap \sup_\theta |\hat R_n - R| does not shrink nearly as fast as any single trajectory.

Figure 1. Pointwise vs uniform convergence. Left: a single θ\theta; the sample-risk trajectory settles to the true risk by the LLN. Right: five θ\theta values simultaneously; at every nn the sup-gap over the five is larger than any individual trajectory’s deviation, and the sup-gap decays more slowly. ERM selects whichever θ\theta minimises R^n\hat R_n, and without uniform convergence there is no guarantee that minimiser tracks the true-risk minimiser.

Example 1 ERM and the need for uniform convergence

Take XUniform(0,1)X \sim \mathrm{Uniform}(0, 1) and squared loss (x;θ)=(xθ)2\ell(x; \theta) = (x - \theta)^2, so R(θ)=Var(X)+(EXθ)2R(\theta) = \mathrm{Var}(X) + (\mathbb{E} X - \theta)^2 is minimised at θ=1/2\theta^\star = 1/2. The LLN gives R^n(θ)R(θ)\hat R_n(\theta) \to R(\theta) a.s. for every fixed θ\theta. But the minimiser θ^n\hat\theta_n of R^n\hat R_n is just the sample mean Xˉn\bar X_n, which by the CLT sits within O(n1/2)O(n^{-1/2}) of θ\theta^\star — the same n\sqrt n-rate that controls supθR^n(θ)R(θ)\sup_\theta |\hat R_n(\theta) - R(\theta)|. This isn’t a coincidence: the rate at which θ^nθ\hat\theta_n \to \theta^\star is controlled by the uniform gap, not the pointwise one. Pointwise LLN on its own gives consistency-in-parameter; it does not give a rate on that consistency. §32.4 supplies the precise condition — finite VC dimension of the loss class — that makes the uniform-convergence rate quantitative.

Example 2 Goodness-of-fit as a sup-statistic

Topic 29’s Kolmogorov–Smirnov statistic Dn=suptFn(t)F(t)D_n = \sup_t |F_n(t) - F(t)| is the empirical-process object in disguise. Under the null XiFX_i \sim F, Glivenko–Cantelli (Topic 10 §10.7) gives Dn0D_n \to 0 a.s. — the uniform LLN for the distribution-function class Fcdf={1(,t]:tR}\mathcal{F}_{\mathrm{cdf}} = \{\mathbf{1}_{(-\infty, t]} : t \in \mathbb{R}\}. Donsker’s theorem (§32.5) supplies the corresponding uniform CLT: nDn\sqrt n\,D_n converges in distribution to the sup of a Brownian bridge, the Kolmogorov distribution Topic 29 §29.8 used to build KS’s distribution-free critical values. The step from “fix a tt, apply the LLN or CLT” to “take a sup over all tt, get a uniform analogue” is the leap Topic 32 formalises in generality, for any function class F\mathcal{F} not just Fcdf\mathcal{F}_{\mathrm{cdf}}.

Remark 1 ML anchor — uniform convergence underwrites generalization

Generalisation in machine learning is the same question ERM asks, wearing a different hat: does the training loss on an IID training set track the test loss uniformly over the hypothesis class we’re searching? When the class has finite VC dimension VV, §32.4 Thm 2 delivers a positive answer with an explicit O(Vlogn/n)O(\sqrt{V \log n / n}) rate; when it doesn’t — think of neural networks with more parameters than data points — Rademacher-complexity refinements and PAC-Bayes bounds take over, and both reuse §32.4’s symmetrisation technique nearly verbatim. Those refinements live at formalML’s generalization-bounds and pac-bayes-bounds; §32.4 is the version of the result this site proves in full.

Remark 2 Track 8 bridge

Track 8 has been building toward this topic. Topic 29 introduced the ECDF FnF_n and studied its single-point and 1D-sup-norm behaviour. Topic 30 smoothed FnF_n into a density estimate f^h\hat f_h and worked out its bias–variance trade-off. Topic 31 resampled from FnF_n to bootstrap the sampling distribution of any statistic we care about. Each of the three treated FnF_n as a fixed-point or scalar-statistic object. Topic 32 now treats FnF_n as a path in the function space (R)\ell^\infty(\mathbb{R}), the empirical process n(FnF)\sqrt n(F_n - F) as a random function, and everything the earlier topics did as special cases of this functional viewpoint. The most load-bearing payoff: §32.5 Donsker supplies the limit distribution §29.8 Proof 3 credited without deriving — the single biggest forward-promise of the curriculum, discharged here.

32.2 The empirical process Gn\mathbb{G}_n and function classes

The right way to generalise “the ECDF converges to the CDF” is to stop thinking about a single index point tRt \in \mathbb{R} and start thinking about an entire class of functions F\mathcal{F}. Topic 29’s Fn(t)=n1i1{Xit}F_n(t) = n^{-1}\sum_i \mathbf{1}\{X_i \le t\} is the empirical average of a single indicator function evaluated at one tt; tilt your head and that’s the same as PnfP_n f where PnP_n is the empirical measure and f=1(,t]f = \mathbf{1}_{(-\infty, t]} is one function. Let ff range over a class F\mathcal{F} and PnfP_n f becomes a whole random function indexed by F\mathcal{F}, not a random number. Its centred-and-rescaled version is the empirical process Gn(f):=n(PnfPf)\mathbb{G}_n(f) := \sqrt n(P_n f - P f), and the central question of Topic 32 is: what can we say about the stochastic-process-on-F\mathcal{F} behaviour of Gn\mathbb{G}_n as nn \to \infty?

Definition 1 Empirical measure $P_n$

Given iid X1,,XnPX_1, \dots, X_n \sim P on a measurable space (X,A)(\mathcal{X}, \mathcal{A}), the empirical measure is Pn:=n1i=1nδXiP_n := n^{-1}\sum_{i=1}^n \delta_{X_i}, the random probability measure that places mass 1/n1/n at each sample point. For any measurable f:XRf : \mathcal{X} \to \mathbb{R}, define

Pnf:=fdPn=1ni=1nf(Xi),Pf:=fdP=E[f(X)].P_n f := \int f\,dP_n = \frac{1}{n} \sum_{i=1}^n f(X_i), \qquad P f := \int f\,dP = \mathbb{E}[f(X)].

Topic 29’s ECDF Fn(t)=Pn(1(,t])F_n(t) = P_n(\mathbf{1}_{(-\infty, t]}) is PnP_n applied to a single half-line indicator; the sample mean Xˉn=Pn(id)\bar X_n = P_n(\mathrm{id}) is PnP_n applied to the identity; Topic 31’s bootstrap draws sample points from PnP_n rather than from PP. The empirical measure is the glue holding Track 8 together.

Definition 2 Empirical process $\mathbb{G}_n$ indexed by $\mathcal{F}$

Given a class F\mathcal{F} of measurable functions XR\mathcal{X} \to \mathbb{R}, the empirical process indexed by F\mathcal{F} is the random function

Gn:FR,Gnf:=n(PnfPf)=1ni=1n(f(Xi)Pf).\mathbb{G}_n : \mathcal{F} \to \mathbb{R}, \qquad \mathbb{G}_n f := \sqrt n\,(P_n f - P f) = \frac{1}{\sqrt n}\sum_{i=1}^n \bigl(f(X_i) - P f\bigr).

Two specialisations recover familiar objects. If F={f}\mathcal{F} = \{f\} is a singleton, Gnf\mathbb{G}_n f is just n\sqrt n times the sample-mean deviation of f(X)f(X) and the ordinary CLT applies. If F=Fcdf={1(,t]:tR}\mathcal{F} = \mathcal{F}_{\mathrm{cdf}} = \{\mathbf{1}_{(-\infty, t]} : t \in \mathbb{R}\}, then Gn(1(,t])=n(Fn(t)F(t))\mathbb{G}_n(\mathbf{1}_{(-\infty, t]}) = \sqrt n(F_n(t) - F(t)) — the n\sqrt n-rescaled CDF-deviation Topic 29 §29.8 worked with. The empirical process is the uniform-index generalisation of both.

Theorem 1 Pointwise CLT for $\mathbb{G}_n(f)$ (stated)

For any fixed fFf \in \mathcal{F} with Pf2<P f^2 < \infty,

Gnf  d  N ⁣(0,VarPf)as n,\mathbb{G}_n f \;\xrightarrow{d}\; \mathcal{N}\!\bigl(0,\,\mathrm{Var}_P f\bigr) \qquad \text{as } n \to \infty,

by the ordinary CLT (Topic 11 §11.5 Thm 3). This tells us Gn\mathbb{G}_n evaluated at any one ff is asymptotically Gaussian; it tells us nothing about joint or uniform behaviour across F\mathcal{F}. Topic 32’s two organising theorems — uniform LLN (§32.4) and uniform CLT (§32.5) — fill that gap.

Four-panel Glivenko–Cantelli warmup at n \in \{20, 100, 500, 2000\}. Each panel overlays F_n(t) (step function) on F(t) (smooth curve) for a Uniform(0, 1) sample; the pointwise gap visibly shrinks as n grows.

Figure 2. One-dimensional Glivenko–Cantelli warmup. Four panels at increasing sample sizes; FnF_n (blue step) tightens to FF (grey smooth) uniformly. The sup-norm FnF\|F_n - F\|_\infty visibly shrinks like 1/n1/\sqrt n, previewing Donsker’s rate.

Example 3 The distribution-function class $\mathcal{F}_{\mathrm{cdf}}$

The distribution-function class Fcdf:={1(,t]:tR}\mathcal{F}_{\mathrm{cdf}} := \{\mathbf{1}_{(-\infty, t]} : t \in \mathbb{R}\} is parametrised by the real line, one function per threshold tt. The empirical process indexed by this class is Gn(1(,t])=n(Fn(t)F(t))\mathbb{G}_n(\mathbf{1}_{(-\infty, t]}) = \sqrt n(F_n(t) - F(t)), which we’ll just write Gn(t)\mathbb{G}_n(t) when no confusion arises. This is the class Donsker’s theorem (§32.5 Thm 3) handles in full, discharging the §29.8 IOU. Fcdf\mathcal{F}_{\mathrm{cdf}} has VC dimension 11 (see §32.4 Def 6), so it’s also the easiest case of §32.4’s G-C-for-VC theorem.

Example 4 Parametric-family indicators

Fix a parametric family {Fα:αA}\{F_\alpha : \alpha \in A\} with ARA \subseteq \mathbb{R} and an observable q(0,1)q \in (0, 1). The class F={1{xFα1(q)}:αA}\mathcal{F} = \{\mathbf{1}\{x \le F_\alpha^{-1}(q)\} : \alpha \in A\} contains one indicator per candidate α\alpha, and Pnfα=P_n f_\alpha = ECDF evaluated at the qq-th quantile under α\alpha. F\mathcal{F} has VC dimension 1 when αFα1(q)\alpha \mapsto F_\alpha^{-1}(q) is monotone (which is the normal case — bigger α\alpha shifts the quantile one way). §32.4 Thm 2 applies: supαAPnfαPfα0\sup_{\alpha \in A} |P_n f_\alpha - P f_\alpha| \to 0 a.s., a uniform LLN over the whole family. This is the workhorse result that makes MLE consistent over a parametric family — the statistician’s specialisation of the ML learning-theory bound.

Remark 3 From LLN / CLT to uniform versions

Pointwise LLN (for each ff, PnfPfP_n f \to P f a.s.) does NOT imply uniform LLN (supfPnfPf0\sup_f |P_n f - P f| \to 0 a.s.) — the classical Vapnik–Chervonenkis counterexample takes X=[0,1]\mathcal{X} = [0, 1], PP = Lebesgue, and F={1A:A finite}\mathcal{F} = \{\mathbf{1}_A : A \text{ finite}\}; every fFf \in \mathcal{F} satisfies Pf=0P f = 0, but supfPnf=1\sup_f |P_n f| = 1 almost surely (pick A={X1,,Xn}A = \{X_1, \dots, X_n\}). The class is too rich: you can always find some ff that indicator-fits the sample perfectly. §32.4 makes precise the complexity condition — finite VC dimension — that excludes these pathological classes and rescues uniform LLN.

Remark 4 Why $\ell^\infty(\mathcal{F})$ is the ambient space

For each ω\omega, the path fGn(ω)(f)f \mapsto \mathbb{G}_n(\omega)(f) is a bounded function on F\mathcal{F} as long as F\mathcal{F} has a finite envelope. It lives naturally in the Banach space (F)\ell^\infty(\mathcal{F}) of bounded functions FR\mathcal{F} \to \mathbb{R} with the sup-norm zF:=supfFz(f)\|z\|_\mathcal{F} := \sup_{f \in \mathcal{F}} |z(f)|. The right limit-theorem statement for Gn\mathbb{G}_n is “weak convergence in (F)\ell^\infty(\mathcal{F})” — a more delicate notion than ordinary weak convergence because the sup itself need not be measurable. §32.3 handles that machinery.

32.3 Weak convergence in (F)\ell^\infty(\mathcal{F})

Topic 9 defined weak convergence for real-valued random variables: XndXX_n \xrightarrow{d} X means Eh(Xn)Eh(X)\mathbb{E} h(X_n) \to \mathbb{E} h(X) for every bounded continuous hh. Upgrading that to random elements of a function space — which is what Gn(F)\mathbb{G}_n \in \ell^\infty(\mathcal{F}) requires — runs into a technical snag: supfFGn(f)\sup_{f \in \mathcal{F}} \mathbb{G}_n(f) is not, in general, a measurable function on the sample space. Hoffmann–Jørgensen’s solution (1984, formalised in vdV–Wellner §1.5) is to work with outer expectation E\mathbb{E}^* and to demand that the limit process G\mathbb{G} be tight. The payoff is a weak-convergence notion strong enough to deliver continuous-mapping theorems and delta-method lifts, which is all §32.4–§32.8 need. The notation table at the end of this section is the chart worth bookmarking before reading on.

Definition 3 The space $\ell^\infty(\mathcal{F})$

For a set F\mathcal{F}, the space (F)\ell^\infty(\mathcal{F}) is the set of bounded functions z:FRz : \mathcal{F} \to \mathbb{R}, equipped with the sup-norm

zF:=supfFz(f).\|z\|_\mathcal{F} := \sup_{f \in \mathcal{F}} |z(f)|.

((F),F)(\ell^\infty(\mathcal{F}), \|\cdot\|_\mathcal{F}) is a Banach space. The empirical process Gn\mathbb{G}_n is an (F)\ell^\infty(\mathcal{F})-valued random element whenever F\mathcal{F} has a square-integrable envelope: supfFfF\sup_{f \in \mathcal{F}} |f| \le F with PF2<P F^2 < \infty (see §32.6 Def 11 for the formal envelope condition).

Definition 4 Outer probability $\mathbb{P}^*$

For any subset AΩA \subseteq \Omega of the sample space (measurable or not), define

P(A):=inf{P(B):BA,B measurable}.\mathbb{P}^*(A) := \inf\{\mathbb{P}(B) : B \supseteq A, B \text{ measurable}\}.

For a function T:ΩRT : \Omega \to \mathbb{R}, the outer expectation is ET:=inf{ES:ST,S measurable}\mathbb{E}^* T := \inf\{\mathbb{E} S : S \ge T, S \text{ measurable}\}. When TT is measurable, P\mathbb{P}^* and E\mathbb{E}^* collapse to the ordinary P\mathbb{P} and E\mathbb{E}. The construction is due to Hoffmann–Jørgensen (1984) and handles exactly the case of supfGn(f)\sup_f \mathbb{G}_n(f) where the supremum over an uncountable class need not be measurable.

Definition 5 Weak convergence in $\ell^\infty(\mathcal{F})$

A sequence of (F)\ell^\infty(\mathcal{F})-valued random elements Gn\mathbb{G}_n converges weakly to a tight random element G\mathbb{G}, written GnG\mathbb{G}_n \rightsquigarrow \mathbb{G}, if

Eh(Gn)    Eh(G)for every bounded Lipschitz h:(F)R.\mathbb{E}^* h(\mathbb{G}_n) \;\to\; \mathbb{E}\, h(\mathbb{G}) \qquad \text{for every bounded Lipschitz } h : \ell^\infty(\mathcal{F}) \to \mathbb{R}.

“Tight” means G\mathbb{G} concentrates its mass on compact subsets: for every ε>0\varepsilon > 0 there is a compact Kε(F)K_\varepsilon \subseteq \ell^\infty(\mathcal{F}) with Pr(GKε)<ε\Pr(\mathbb{G} \notin K_\varepsilon) < \varepsilon. Tightness is essential because (F)\ell^\infty(\mathcal{F}) is not separable for uncountable F\mathcal{F} — without it the limit law is ill-defined.

Example 5 Non-measurability of the sup

Take F=Fcdf\mathcal{F} = \mathcal{F}_{\mathrm{cdf}}, sample X1Uniform(0,1)X_1 \sim \mathrm{Uniform}(0, 1), and consider the map ωsuptRG1(ω)(t)=supt1(1{X1t}t)\omega \mapsto \sup_{t \in \mathbb{R}} \mathbb{G}_1(\omega)(t) = \sup_t \sqrt 1\,(\mathbf{1}\{X_1 \le t\} - t). Since Fcdf\mathcal{F}_{\mathrm{cdf}} is indexed by uncountably many tt, the supremum is over an uncountable family of measurable functions and need not itself be measurable in a general measure-theoretic sense (any specific construction using the axiom of choice can produce a non-measurable sup). The fix: work with the outer probability P(suptGn(t)>x)\mathbb{P}^*\bigl(\sup_t \mathbb{G}_n(t) > x\bigr) instead. For the classes we actually encounter — VC classes in §32.4, bracketing-finite classes in §32.6 — the supremum is measurable, because continuity of the sample paths a.s. allows reduction to a countable dense index set; outer probability collapses back to probability, and no harm is done.

Remark 5 Hoffmann–Jørgensen outer-probability fix

The outer-probability machinery (VDW1996 §1.3, KOS2008 §7) is the standard fix for non-measurable suprema. It’s the analog of using the outer Lebesgue measure in real analysis: you give up working with ordinary σ\sigma-additivity on all subsets, but you gain the ability to make probabilistic statements about quantities that aren’t technically measurable. Every theorem from §32.4 onwards is stated using E\mathbb{E}^* implicitly; we’ll suppress the star except where it matters.

Remark 6 Why the measurability housekeeping doesn't bite in practice

For every Donsker class F\mathcal{F} we meet in §§32.4–32.8, the limit process GP\mathbb{G}_P has a.s. uniformly continuous sample paths with respect to the L2(P)L^2(P) pseudo-metric on F\mathcal{F}. Continuity lets you recover supfFGP(f)\sup_{f \in \mathcal{F}} \mathbb{G}_P(f) from the sup over any countable dense subset, and that sup is measurable. So all our P\mathbb{P}^* statements coincide with ordinary P\mathbb{P} statements on the event “the limit process is continuous” — which is a full-probability event. The measurability housekeeping is real, but its contribution to the bottom line is zero.

Notation reference (used throughout the chapter).

SymbolMeaningIntroducedRelated prior convention
PP, PnP_nProbability measure (population); empirical measure Pn:=n1iδXiP_n := n^{-1}\sum_i \delta_{X_i}§32.2 Def 1Topic 29 §29.5’s FnF_n is PnP_n applied to half-line indicators
PfP f, PnfP_n fIntegrals fdP=E[f(X)]\int f\,dP = \mathbb{E}[f(X)] and fdPn=n1if(Xi)\int f\,dP_n = n^{-1}\sum_i f(X_i)§32.2 Def 1Topic 10’s Xˉn=Pn(id)\bar X_n = P_n(\mathrm{id})
Gn\mathbb{G}_nEmpirical process Gn:=n(PnP)\mathbb{G}_n := \sqrt n\,(P_n - P)§32.2 Def 2Lift of Topic 29 §29.8’s n(FnF)\sqrt n (F_n - F) to general F\mathcal{F}
(F)\ell^\infty(\mathcal{F})Banach space of bounded FR\mathcal{F} \to \mathbb{R} functions with sup-norm§32.3 Def 3Lift of Topic 10’s R\mathbb{R}-valued r.v. space to function space
VC(F)\mathrm{VC}(\mathcal{F}), sF(n)s_{\mathcal{F}}(n)VC dimension of a class; shatter coefficient on nn points§32.4 Def 6New — Topic 10 Rem 6 name-checked without defining
N(ε,F,)N(\varepsilon, \mathcal{F}, \|\cdot\|)Covering number: minimum balls of radius ε\varepsilon to cover F\mathcal{F} in \|\cdot\|§32.6 Def 9New
N[](ε,F,)N_{[\,]}(\varepsilon, \mathcal{F}, \|\cdot\|)Bracketing number: minimum ε\varepsilon-brackets [,u][\ell, u] with uε\|u - \ell\| \le \varepsilon to cover F\mathcal{F}§32.6 Def 10New
B\mathbb{B}^\circ, BF=BF\mathbb{B}_F = \mathbb{B}^\circ \circ FStandard Brownian bridge on [0,1][0, 1]; FF-Brownian bridge§32.5 Def 8Topic 29 §29.8 used B\mathbb{B}^\circ without defining it; §32.5 Def 8 discharges
\rightsquigarrowWeak convergence in (F)\ell^\infty(\mathcal{F}) (Hoffmann–Jørgensen sense)§32.3 Def 5Distinct from d\xrightarrow{d} (Topic 9 §9.3) which is finite-dim weak convergence
ϕθ\phi'_\thetaHadamard derivative of a functional ϕ\phi at θ\theta, tangential to D0\mathcal{D}_0§32.7 Def 12Analog of the ordinary derivative on a Banach space
IC(x)\mathrm{IC}(x)Influence function IC(x)=ϕθ(δxP)\mathrm{IC}(x) = \phi'_\theta(\delta_x - P) of a functional at PP§32.7 Def 13Robust-statistics terminology (Hampel 1974); central to semiparametric efficiency
J[](δ,F,L2(P))J_{[\,]}(\delta, \mathcal{F}, L^2(P))Bracketing integral 0δlogN[](ε,F,L2(P))dε\int_0^\delta \sqrt{\log N_{[\,]}(\varepsilon, \mathcal{F}, L^2(P))}\,d\varepsilon§32.6 Def 11Dudley 1978 entropy integral; measures Donsker-ness of a class
P\mathbb{P}^*, E\mathbb{E}^*Outer probability / expectation (for non-measurable suprema)§32.3 Def 4Hoffmann–Jørgensen 1991; technical device
Pnξf=n1iξif(Xi)P_n^\xi f = n^{-1}\sum_i \xi_i f(X_i)Rademacher process: ξi\xi_i iid ±1\pm 1 symmetric§32.4 Step 2Pollard 1984 Ch. II; heart of the symmetrisation argument
 ⁣ ⁣ ⁣\perp\!\!\!\perpProbabilistic independenceTopic 16 §16.9 lockUsed when comparing Gn\mathbb{G}_n and its independent ghost copy in §32.4
Fcdf\mathcal{F}_{\mathrm{cdf}}Distribution-function class {1(,t]:tR}\{\mathbf{1}_{(-\infty, t]} : t \in \mathbb{R}\}§32.2 Ex 3Topic 29 §29.8’s object of study, lifted to the F\mathcal{F}-indexed framework
F\mathcal{F}-GC, F\mathcal{F}-DonskerQualifiers: class is Glivenko–Cantelli / Donsker for PP§32.4 Def 7; §32.6 Def 11vdV-Wellner terminology; indicates the analytic condition on F\mathcal{F}
D,E\mathbb{D}, \mathbb{E}Banach spaces; domain / codomain of the functional ϕ\phi in §32.7§32.7 Thm 5 statementStandard vdV2000 §20.6 notation
TnDϕDT_n \in \mathbb{D}_\phi \subseteq \mathbb{D}Random element; Dϕ\mathbb{D}_\phi = domain of ϕ\phi inside D\mathbb{D}§32.7 Thm 5Keeps the Hadamard-diff theorem statement clean
rnr_nScaling rate (usually n\sqrt n) in the functional delta-method statement§32.7 Thm 5Generalises the n\sqrt n in the scalar delta method (Topic 13 §13.5)

32.4 Glivenko–Cantelli for VC classes

The question §32.2 Rem 3 left hanging: which function classes admit a uniform LLN? Answer — classes with finite VC dimension, a combinatorial complexity measure due to Vapnik and Chervonenkis (1971) that captures how many {0,1}n\{0, 1\}^n-labellings of an nn-point set the class can realise. Rich classes realise all 2n2^n labellings at every nn and fail uniform LLN; VC-finite classes realise only polynomially many, and the exponential tail of Hoeffding’s inequality (Topic 12) wins against that polynomial with room to spare. The resulting theorem — Glivenko–Cantelli for VC classes — goes by a less formal name in the ML literature: the fundamental theorem of statistical learning. See Rem 9.

Definition 6 VC dimension

Let F\mathcal{F} be a class of {0,1}\{0, 1\}-valued functions on X\mathcal{X}. For any finite set {x1,,xn}X\{x_1, \dots, x_n\} \subseteq \mathcal{X}, the restriction of F\mathcal{F} to this set is the finite set of labellings {(f(x1),,f(xn)):fF}{0,1}n\{(f(x_1), \dots, f(x_n)) : f \in \mathcal{F}\} \subseteq \{0, 1\}^n. The shatter coefficient is

sF(n)  :=  supx1,,xnX{(f(x1),,f(xn)):fF}.s_{\mathcal{F}}(n) \;:=\; \sup_{x_1, \dots, x_n \in \mathcal{X}} \bigl|\{(f(x_1), \dots, f(x_n)) : f \in \mathcal{F}\}\bigr|.

A set {x1,,xn}\{x_1, \dots, x_n\} is shattered by F\mathcal{F} if its restriction has cardinality 2n2^n — every {0,1}n\{0,1\}^n-labelling is realised. The VC dimension of F\mathcal{F} is the largest nn such that some nn-point set is shattered:

VC(F)  :=  sup{n:sF(n)=2n}.\mathrm{VC}(\mathcal{F}) \;:=\; \sup\{n : s_{\mathcal{F}}(n) = 2^n\}.

If sF(n)=2ns_{\mathcal{F}}(n) = 2^n for every nn, the class has infinite VC dimension (and is too rich to be Glivenko–Cantelli).

Definition 7 Glivenko–Cantelli class

A class F\mathcal{F} is a Glivenko–Cantelli class for PP (or PP-GC) if

PnPF  =  supfFPnfPf  a.s.  0as n.\|P_n - P\|_\mathcal{F} \;=\; \sup_{f \in \mathcal{F}} |P_n f - P f| \;\xrightarrow{\text{a.s.}}\; 0 \qquad \text{as } n \to \infty.

Equivalently: the uniform LLN holds on F\mathcal{F} under PP. Thm 2 below says that any VC-finite class is PP-GC for every PP — uniformity is both over fFf \in \mathcal{F} and over the distribution. That distribution-free property is what makes the result useful in ML, where we never know PP.

Theorem 1 Sauer–Shelah–Vapnik (stated)

Let F\mathcal{F} have VC dimension V<V < \infty. Then for every n1n \ge 1,

sF(n)    k=0V(nk)    (n+1)V.s_{\mathcal{F}}(n) \;\le\; \sum_{k=0}^{V} \binom{n}{k} \;\le\; (n + 1)^V.

The shatter coefficient — which a priori could grow as fast as 2n2^n — is forced into polynomial growth in nn the moment the VC dimension is finite. This is the pigeonhole miracle that makes §32.4 Thm 2 work: the finite-class bound in Step 4 of the proof below only costs a factor of sF(n)(n+1)Vs_{\mathcal{F}}(n) \le (n+1)^V, not 2n2^n.

Proof (sketch). The classical argument is by induction on the size of the sample set, with the inductive step carefully tracking which subsets remain shattered under a one-point extension. Full proof in SAU1972 (3 pages, self-contained); independently in SHE1972 and VAP1971. We omit it here — the proof is elegant but the combinatorics don’t carry over to Donsker or the delta method, so it doesn’t earn its real estate in a topic whose mathematical through-line is fidi+tightness.

Three-panel VC-shattering atlas. Panel (a): halflines \{x \le t\} on \mathbb{R} shatter 1 point but not 2. Panel (b): halfspaces in \mathbb{R}^2 shatter 3 non-collinear points; the 4-point Radon obstruction is pictured. Panel (c): axis-aligned rectangles in \mathbb{R}^2 shatter 4 points (diamond witness); 5 points obstruct.

Figure 3. Three classes, three VC dimensions. (a) Halflines: VC=1\mathrm{VC} = 1. (b) Halfspaces in R2\mathbb{R}^2: VC=3\mathrm{VC} = 3; Radon’s theorem (every 4 points in R2\mathbb{R}^2 split into two disjoint subsets whose convex hulls meet) produces the 4-point obstruction. (c) Axis-aligned rectangles: VC=4\mathrm{VC} = 4; the diamond witness is drawn. Interactive exploration in VCShatteringDemo below.

Points:3
✓ shattersV = 3 · Sauer ∑C(n,k) = 8 ≤ (n+1)^V = 64 · 2^n upper limit = 8
Drag points · Classifiers {w · x ≤ b}. VC dim 3 — shatters any 3 non-collinear points; Radon blocks n = 4.
123
Growth function (log-y) — Sauer vs polynomial vs 2ⁿ
05101520253011010010^310^410^510^6n (sample size)
Sauer ∑ C(n, k)(n+1)^V polynomial2ⁿ (unrestricted)
Theorem 2 Glivenko–Cantelli for VC classes (Vapnik–Chervonenkis 1971)

Let F\mathcal{F} be a class of {0,1}\{0, 1\}-valued measurable functions on (X,A)(\mathcal{X}, \mathcal{A}) with finite VC dimension V:=VC(F)V := \mathrm{VC}(\mathcal{F}). Then, for any probability measure PP on (X,A)(\mathcal{X}, \mathcal{A}),

supfFPnfPf  a.s.  0as n.\sup_{f \in \mathcal{F}} |P_n f - P f| \;\xrightarrow{\text{a.s.}}\; 0 \qquad \text{as } n \to \infty.
Proof 1 [show]

The argument is a five-step chain: symmetrize (replace PP by an independent ghost sample), randomize (multiply by Rademacher signs), reduce to a finite class (Sauer–Shelah–Vapnik), bound the finite class (Hoeffding + union bound), and integrate the exponential tail (Borel–Cantelli).

Fix ε>0\varepsilon > 0 and let F\|\cdot\|_{\mathcal{F}} denote the sup-norm supfF\sup_{f \in \mathcal{F}} |\cdot| on F\mathcal{F}.

Step 1 — symmetrization. Introduce an independent ghost sample X1,,XnX_1', \dots, X_n' with the same distribution PP, and let Pn:=n1i=1nδXiP_n' := n^{-1} \sum_{i=1}^n \delta_{X_i'}. For any ε>0\varepsilon > 0 with nε22n \varepsilon^2 \geq 2, the symmetrization lemma (vdV–Wellner Lem 2.3.7) gives

Pr(PnPF>ε)    2Pr(PnPnF>ε/2).\Pr\bigl(\|P_n - P\|_{\mathcal{F}} > \varepsilon\bigr) \;\leq\; 2\,\Pr\bigl(\|P_n - P_n'\|_{\mathcal{F}} > \varepsilon / 2\bigr).

Step 2 — Rademacher randomization. Let ξ1,,ξn\xi_1, \dots, \xi_n be iid Rademacher (±1\pm 1 each with probability 1/21/2), independent of all Xi,XiX_i, X_i'. By exchangeability of (Xi,Xi)(X_i, X_i') pairs under the sign flip:

Pr(PnPnF>ε/2)    2Pr(1ni=1nξif(Xi)F>ε/4).\Pr\bigl(\|P_n - P_n'\|_{\mathcal{F}} > \varepsilon / 2\bigr) \;\leq\; 2\,\Pr\Bigl(\Bigl\|\tfrac{1}{n}\sum_{i=1}^n \xi_i f(X_i)\Bigr\|_{\mathcal{F}} > \varepsilon / 4\Bigr).

Call the random function fn1iξif(Xi)f \mapsto n^{-1} \sum_i \xi_i f(X_i) the Rademacher process PnξP_n^\xi.

Step 3 — reduce to a finite class. Condition on X1,,XnX_1, \dots, X_n. The restriction of F\mathcal{F} to {X1,,Xn}\{X_1, \dots, X_n\} — the set of distinct {0,1}n\{0, 1\}^n-valued vectors (f(X1),,f(Xn))(f(X_1), \dots, f(X_n)) as ff ranges over F\mathcal{F} — has cardinality at most the shatter coefficient sF(n)s_{\mathcal{F}}(n). By Sauer–Shelah–Vapnik (Thm 1),

sF(n)    k=0V(nk)    (n+1)Vfor all nV.s_{\mathcal{F}}(n) \;\leq\; \sum_{k=0}^{V} \binom{n}{k} \;\leq\; (n + 1)^V \qquad \text{for all } n \geq V.

Step 4 — bound the finite class. For each fixed vector (f(X1),,f(Xn)){0,1}n(f(X_1), \dots, f(X_n)) \in \{0, 1\}^n, Pnξf=n1iξif(Xi)P_n^\xi f = n^{-1} \sum_i \xi_i f(X_i) is a bounded average of iid mean-zero ±1\pm 1-valued random variables — specifically, each summand ξif(Xi)[1,1]\xi_i f(X_i) \in [-1, 1]. Hoeffding’s inequality (Topic 12 §12.3) gives, conditionally on X1,,XnX_1, \dots, X_n,

Prξ(Pnξf>ε/4X1:n)    2exp ⁣(nε2/32).\Pr_\xi\bigl(|P_n^\xi f| > \varepsilon / 4 \,\big|\, X_{1:n}\bigr) \;\leq\; 2\,\exp\!\bigl(-n\,\varepsilon^2 / 32\bigr).

Union-bounding over the at-most-sF(n)s_{\mathcal{F}}(n) distinct (f(Xi))i(f(X_i))_i vectors,

Prξ(PnξF>ε/4X1:n)    2sF(n)exp ⁣(nε2/32)    2(n+1)Vexp ⁣(nε2/32).\Pr_\xi\bigl(\|P_n^\xi\|_{\mathcal{F}} > \varepsilon / 4 \,\big|\, X_{1:n}\bigr) \;\leq\; 2\,s_{\mathcal{F}}(n)\,\exp\!\bigl(-n\,\varepsilon^2 / 32\bigr) \;\leq\; 2(n + 1)^V\,\exp\!\bigl(-n\,\varepsilon^2 / 32\bigr).

Taking expectation over X1:nX_{1:n} preserves the bound.

Step 5 — combine and apply Borel–Cantelli. Chaining the three inequalities (Step 1 × Step 2 × Step 4):

Pr(PnPF>ε)    8(n+1)Vexp ⁣(nε2/32).\Pr\bigl(\|P_n - P\|_{\mathcal{F}} > \varepsilon\bigr) \;\leq\; 8\,(n + 1)^V\,\exp\!\bigl(-n\,\varepsilon^2 / 32\bigr).

The right-hand side is summable in nn for every V0V \geq 0 and every ε>0\varepsilon > 0 (polynomial times exponential, with exponential winning). Borel–Cantelli delivers PnPFε\|P_n - P\|_{\mathcal{F}} \leq \varepsilon eventually a.s.; since ε>0\varepsilon > 0 was arbitrary, PnPF0\|P_n - P\|_{\mathcal{F}} \to 0 a.s.

\blacksquare — using VAP1971, SAU1972, vdV–Wellner 1996 Lem 2.3.7 (symmetrization), and the Hoeffding bound from Topic 12 §12.3.

Example 6 Halfspaces in $\mathbb{R}^d$ have VC dimension $d + 1$

Consider F={1{wxb}:wRd,bR}\mathcal{F} = \{\mathbf{1}\{w \cdot x \le b\} : w \in \mathbb{R}^d,\, b \in \mathbb{R}\}, the class of halfspace indicators in Rd\mathbb{R}^d. Take any d+1d + 1 points in general position (no d+1d + 1 on a common affine hyperplane) — these shatter: for any target labelling, solve the linear system sign(wxib)=±1\mathrm{sign}(w \cdot x_i - b) = \pm 1 coordinate-wise, and a non-degenerate Radon separation always exists. So VC(F)d+1\mathrm{VC}(\mathcal{F}) \ge d + 1.

The matching upper bound is Radon’s theorem: any d+2d + 2 points in Rd\mathbb{R}^d admit a partition into two subsets whose convex hulls intersect. Pick such a partition and target the labelling +1 for one subset, −1 for the other; a separating hyperplane would have to put both convex hulls on opposite sides, which is impossible because they share a point. No halfspace realises this labelling, so d+2d + 2 points never shatter, and VC(F)=d+1\mathrm{VC}(\mathcal{F}) = d + 1.

Example 7 Axis-aligned rectangles in $\mathbb{R}^2$ have VC dimension 4

The class F={1{a1x1b1,a2x2b2}}\mathcal{F} = \{\mathbf{1}\{a_1 \le x_1 \le b_1, a_2 \le x_2 \le b_2\}\} of axis-aligned rectangles in R2\mathbb{R}^2 shatters the four-point diamond {(0,1),(1,0),(0,1),(1,0)}\{(0, 1), (1, 0), (0, -1), (-1, 0)\}: for any target labelling, put the tightest axis-aligned rectangle around the “+1”-labelled points — the four-point geometry guarantees no “−1” point lies inside. So VC(F)4\mathrm{VC}(\mathcal{F}) \ge 4.

Now take any 55 points. Form the tightest axis-aligned bounding box around them. Each of the four sides of this box is “supported” by at least one of the five points (the side coincides with the extremal xx- or yy-coordinate); by pigeonhole at least one point is not extremal on any side. Label the non-extremal point 1-1 and the four extremal points +1+1. Any rectangle containing the four +1+1-points must contain their axis-aligned bounding box, which contains the non-extremal point — so it labels the 1-1-point as +1+1. No rectangle realises this labelling, and VC(F)=4\mathrm{VC}(\mathcal{F}) = 4.

Example 8 ERM uniform-convergence bound (discharges LLN §10.7 Ex 8)

Topic 10 §10.7 Ex 8 asked: is Glivenko–Cantelli what makes ERM work? and left the answer as a forward-pointer to “empirical-process theory (coming in the Statistical Inference track)”. Thm 2 discharges that IOU. Concretely: fix a loss \ell and a hypothesis class Θ\Theta such that the induced loss class {(;θ):θΘ}\{\ell(\cdot; \theta) : \theta \in \Theta\} has VC dimension VV. Let θ^n=argminθR^n(θ)\hat\theta_n = \arg\min_\theta \hat R_n(\theta) and θ=argminθR(θ)\theta^\star = \arg\min_\theta R(\theta). Feeding Thm 2’s high-probability bound through the ERM triangle inequality gives

R(θ^n)R(θ)    2supθΘR^n(θ)R(θ)    O ⁣(Vlogn/n)R(\hat\theta_n) - R(\theta^\star) \;\le\; 2 \sup_{\theta \in \Theta} |\hat R_n(\theta) - R(\theta)| \;\le\; O\!\Bigl(\sqrt{V \log n / n}\,\Bigr)

with probability 1nc\ge 1 - n^{-c} for any c>0c > 0. The V/n\sqrt{V/n} scaling is the cleanest statement of ERM consistency the topic produces, and the Topic 10 IOU is paid in full.

Remark 7 Proof-strategy overview

The five-step proof above is a template, not a one-off. Symmetrisation → Rademacher randomisation → Sauer reduction → Hoeffding + union bound → Borel–Cantelli. Every VC-flavoured, PAC-flavoured, and Rademacher-complexity result in statistical learning theory reuses at least three of these five moves, usually in the same order. Once you’ve digested the Thm 2 proof, the PAC-Bayes bound is a symmetrisation + change-of-measure; Rademacher-complexity bounds are a symmetrisation + contraction; uniform-convergence-of-risk is this proof with a generic loss class instead of {0,1}\{0,1\}-indicators. The Hoeffding step is the only one that genuinely requires ”ff is bounded”; everything else works for general F\mathcal{F}, with Hoeffding replaced by Bernstein or Talagrand (§32.9).

Remark 8 VC extensions — out of scope but named

VC dimension is the binary-classification complexity measure; for multi-class problems the Natarajan dimension takes over, for real-valued regression classes it’s the pseudo-dimension or fat-shattering dimension, and for kernel methods Rademacher complexity gives sharper data-dependent rates that VC cannot. Each refinement costs more machinery but delivers the uniform-convergence payoff for a broader hypothesis class. All three are deferred to formalML’s generalization-bounds topic, which carries the full menu. §32.4 stays at classical binary VC because it’s the cleanest case and the one that carries the rest of the topic — the symmetrisation + Sauer combo is the reusable pattern.

Remark 9 The fundamental theorem of statistical learning

Thm 2 combined with Ex 8 is what Shalev-Shwartz and Ben-David (2014, Ch. 6) call the fundamental theorem of statistical learning: a binary hypothesis class Θ\Theta is PAC-learnable if and only if it has finite VC dimension, and the sample complexity of learning it is Θ(V/ε2)\Theta(V / \varepsilon^2) up to log factors. The iff is striking — finite VC is both necessary and sufficient for uniform generalisation — and it’s the cleanest structural result in statistical learning theory. Topic 32’s most visible ML payoff is this theorem; everything else in the topic either builds toward it (§§32.1–32.3) or lifts it to richer settings (§§32.5–32.9). formalML picks up from here and develops neural-net-scale generalisation bounds where classical VC fails and Rademacher complexity takes over.

Donsker’s theorem is the featured result of Topic 32 and the curriculum’s single biggest forward-promise payoff. It says that n(FnF)\sqrt n (F_n - F), viewed as a random function on R\mathbb{R}, converges weakly in (R)\ell^\infty(\mathbb{R}) to an FF-indexed Brownian bridge BF\mathbb{B}_F. Topic 29 §29.8 Proof 3 took this as given to derive the Kolmogorov limit; here we pay the proof in full. Read the statement, then take a moment with Figure 4 — watching ten Gn\mathbb{G}_n paths at n=2000n = 2000 overlay the Brownian-bridge reference paths is the best intuition for what “weak convergence in a function space” looks like before the rest of the proof lands. Then we prove it in four steps: reduce to the uniform case via the PIT, establish finite-dimensional convergence via the multivariate CLT, upgrade to tightness via a bracketing-number bound, and combine.

Definition 8 Standard Brownian bridge $\mathbb{B}^\circ$ and $F$-Brownian bridge $\mathbb{B}_F$

A standard Brownian bridge B\mathbb{B}^\circ is a centred Gaussian process on [0,1][0, 1] with covariance function

Cov(B(s),B(t))  =  stst,s,t[0,1],\mathrm{Cov}\bigl(\mathbb{B}^\circ(s), \mathbb{B}^\circ(t)\bigr) \;=\; s \wedge t - s t, \qquad s, t \in [0, 1],

and boundary conditions B(0)=B(1)=0\mathbb{B}^\circ(0) = \mathbb{B}^\circ(1) = 0 almost surely. (The name “bridge” captures the latter: it’s a Brownian motion pinned at both endpoints.) Given a continuous CDF FF on R\mathbb{R}, the FF-Brownian bridge is the process BF:RR\mathbb{B}_F : \mathbb{R} \to \mathbb{R} defined by

BF(t)  :=  B(F(t)),\mathbb{B}_F(t) \;:=\; \mathbb{B}^\circ(F(t)),

a centred Gaussian process with covariance Cov(BF(s),BF(t))=F(st)F(s)F(t)\mathrm{Cov}(\mathbb{B}_F(s), \mathbb{B}_F(t)) = F(s \wedge t) - F(s)F(t). Both processes have continuous sample paths a.s. (Lem 3 below).

Lemma 3 Brownian-bridge construction (stated; BIL1999 §14)

Let WW be a standard Brownian motion on [0,1][0, 1]. The process B(t):=W(t)tW(1)\mathbb{B}^\circ(t) := W(t) - t\,W(1) is a standard Brownian bridge: centred Gaussian, covariance ststs \wedge t - st, and B(0)=B(1)=0\mathbb{B}^\circ(0) = \mathbb{B}^\circ(1) = 0. Continuous sample paths on [0,1][0, 1] a.s. are inherited from WW‘s continuity. Full construction and verification in BIL1999 §14. The brownianBridgePath helper in nonparametric.ts §17 implements the discretised version used by EmpiricalProcessExplorer’s reference overlay.

Theorem 3 Donsker's theorem (Donsker 1952; vdV2000 §19.1 Thm 19.5)

Let X1,X2,X_1, X_2, \dots be iid with continuous CDF FF, and let FnF_n denote the empirical CDF. Then

Gn  :=  n(FnF)    BFin (R),\mathbb{G}_n \;:=\; \sqrt n\,(F_n - F) \;\rightsquigarrow\; \mathbb{B}_F \qquad \text{in } \ell^\infty(\mathbb{R}),

where BF:=BF\mathbb{B}_F := \mathbb{B}^\circ \circ F and B\mathbb{B}^\circ is a standard Brownian bridge on [0,1][0, 1].

Proof 2 [show]

The proof has four ingredients: reduce to the uniform case via the probability-integral transform, establish finite-dimensional convergence via the multivariate CLT, upgrade to weak convergence in \ell^\infty via tightness — here, stochastic equicontinuity controlled by a bracketing-entropy bound on Fcdf\mathcal{F}_{\mathrm{cdf}} — and finally combine.

Step 1 — reduce to uniform. Define Ui:=F(Xi)U_i := F(X_i) for i=1,2,i = 1, 2, \dots. Since FF is continuous, UiUniform(0,1)U_i \sim \mathrm{Uniform}(0, 1), iid. Let GnG_n denote the empirical CDF of {Ui}i=1n\{U_i\}_{i=1}^n on [0,1][0, 1]. For any xRx \in \mathbb{R}, Gn(F(x))=Fn(x)G_n(F(x)) = F_n(x), so the map xF(x)x \mapsto F(x) transports (FnF)(x)(F_n - F)(x) to (GnId)(F(x))(G_n - \mathrm{Id})(F(x)). If we establish n(GnId)B\sqrt n\,(G_n - \mathrm{Id}) \rightsquigarrow \mathbb{B}^\circ in ([0,1])\ell^\infty([0, 1]), then composition with the continuous map FF gives GnBF\mathbb{G}_n \rightsquigarrow \mathbb{B}^\circ \circ F in (R)\ell^\infty(\mathbb{R}) by the continuous mapping theorem (Topic 9 §9.5). Henceforth assume F=Uniform(0,1)F = \mathrm{Uniform}(0, 1) and write Gn(t):=n(Gn(t)t)\mathbb{G}_n(t) := \sqrt n (G_n(t) - t) for t[0,1]t \in [0, 1].

Step 2 — finite-dimensional convergence. Fix k1k \geq 1 and 0t1<t2<<tk10 \leq t_1 < t_2 < \dots < t_k \leq 1. For each nn, the vector (Gn(t1),,Gn(tk))(\mathbb{G}_n(t_1), \dots, \mathbb{G}_n(t_k)) equals n1/2i=1nYin^{-1/2} \sum_{i=1}^n Y_i, where Yi=(1{Uitj}tj)j=1kY_i = (\mathbf{1}\{U_i \leq t_j\} - t_j)_{j=1}^k. The YiY_i are iid, mean zero, with covariance Σjj=Cov(1{Utj},1{Utj})=tjjtjtj\Sigma_{jj'} = \mathrm{Cov}(\mathbf{1}\{U \leq t_j\}, \mathbf{1}\{U \leq t_{j'}\}) = t_{j \wedge j'} - t_j t_{j'}. By the multivariate CLT (Topic 11 §11.5 Cor 2),

(Gn(t1),,Gn(tk))    Nk(0,Σ)as n.\bigl(\mathbb{G}_n(t_1), \dots, \mathbb{G}_n(t_k)\bigr) \;\rightsquigarrow\; \mathcal{N}_k(0, \Sigma) \qquad \text{as } n \to \infty.

The covariance matrix Σ\Sigma is the covariance of (B(t1),,B(tk))(\mathbb{B}^\circ(t_1), \dots, \mathbb{B}^\circ(t_k)). Hence, the finite-dimensional distributions of Gn\mathbb{G}_n converge to those of B\mathbb{B}^\circ.

Step 3 — tightness via bracketing-integral control. For weak convergence in ([0,1])\ell^\infty([0, 1]), finite-dim convergence combined with asymptotic tightness of {Gn}\{\mathbb{G}_n\} is necessary and sufficient (vdV–Wellner Thm 1.5.4). Asymptotic tightness for a random map in ([0,1])\ell^\infty([0, 1]) is equivalent, on a separable index set, to stochastic equicontinuity: for every η,ε>0\eta, \varepsilon > 0 there is δ>0\delta > 0 with

lim supnPr ⁣(supst<δGn(s)Gn(t)>η)  <  ε.\limsup_{n \to \infty} \Pr\!\Bigl(\sup_{|s - t| < \delta} |\mathbb{G}_n(s) - \mathbb{G}_n(t)| > \eta\Bigr) \;<\; \varepsilon.

We establish stochastic equicontinuity by bounding the L2L^2-bracketing number of Fcdf\mathcal{F}_{\mathrm{cdf}} and invoking the bracketing-integral equicontinuity inequality (vdV–Wellner Thm 2.5.6).

For the Fcdf\mathcal{F}_{\mathrm{cdf}} class {1(,t]:t[0,1]}\{\mathbf{1}_{(-\infty, t]} : t \in [0, 1]\} under P=Uniform(0,1)P = \mathrm{Uniform}(0, 1): given ε>0\varepsilon > 0, choose a grid 0=s0<s1<<sm=10 = s_0 < s_1 < \dots < s_m = 1 with sjsj1ε2s_j - s_{j-1} \leq \varepsilon^2 (so mε2m \leq \lceil \varepsilon^{-2} \rceil). Define brackets [1(,sj1],1(,sj]][\mathbf{1}_{(-\infty, s_{j-1}]}, \mathbf{1}_{(-\infty, s_j]}]; every 1(,t]\mathbf{1}_{(-\infty, t]} with t[sj1,sj]t \in [s_{j-1}, s_j] lies inside the jj-th bracket. Each bracket has L2(P)L^2(P)-width sjsj1ε\sqrt{s_j - s_{j-1}} \leq \varepsilon. Hence

N[](ε,Fcdf,L2(P))    ε2.N_{[\,]}\bigl(\varepsilon, \mathcal{F}_{\mathrm{cdf}}, L^2(P)\bigr) \;\leq\; \lceil \varepsilon^{-2} \rceil.

The bracketing integral is therefore

J[](1,Fcdf,L2(P))  =  01logN[](ε,Fcdf,L2(P))dε    01log(ε2+1)dε  <  .J_{[\,]}(1, \mathcal{F}_{\mathrm{cdf}}, L^2(P)) \;=\; \int_0^1 \sqrt{\log N_{[\,]}(\varepsilon, \mathcal{F}_{\mathrm{cdf}}, L^2(P))}\, d\varepsilon \;\leq\; \int_0^1 \sqrt{\log(\varepsilon^{-2} + 1)}\, d\varepsilon \;<\; \infty.

Dudley’s bracketing-integral equicontinuity inequality (vdV–Wellner Cor 2.5.2) yields, for every η>0\eta > 0 and every δ>0\delta > 0,

Pr ⁣(supst<δGn(s)Gn(t)>η)    CηJ[] ⁣(δ,Fcdf,L2(P))\Pr\!\Bigl(\sup_{|s - t| < \delta} |\mathbb{G}_n(s) - \mathbb{G}_n(t)| > \eta\Bigr) \;\leq\; \frac{C}{\eta}\,J_{[\,]}\!\bigl(\delta, \mathcal{F}_{\mathrm{cdf}}, L^2(P)\bigr)

for a universal constant CC and all nn large enough. Since J[](δ,Fcdf,L2(P))0J_{[\,]}(\delta, \mathcal{F}_{\mathrm{cdf}}, L^2(P)) \to 0 as δ0\delta \to 0, the equicontinuity condition holds. Hence {Gn}\{\mathbb{G}_n\} is asymptotically tight.

Step 4 — combine. Finite-dim convergence (Step 2) + asymptotic tightness (Step 3) \Rightarrow GnB\mathbb{G}_n \rightsquigarrow \mathbb{B}^\circ in ([0,1])\ell^\infty([0, 1]) (vdV–Wellner Thm 1.5.4). The limit B\mathbb{B}^\circ has continuous sample paths a.s. (Lem 3; BIL1999 §14), so convergence in ([0,1])\ell^\infty([0, 1]) coincides with convergence in C([0,1])C([0, 1]). Composition with the continuous FF (Step 1) transports to GnBF=BF\mathbb{G}_n \rightsquigarrow \mathbb{B}^\circ \circ F = \mathbb{B}_F in (R)\ell^\infty(\mathbb{R}).

\blacksquare — using DON1952, vdV–Wellner 1996 Thm 1.5.4 (finite-dim + tight ⇒ weak), Thm 2.5.6 (bracketing-integral equicontinuity), BIL1999 §14 (Brownian-bridge continuity), and Topic 11 §11.5 Cor 2 (multivariate CLT).

Corollary (Kolmogorov limit — discharges Topic 29 §29.8 IOU). Applying the continuous sup-norm functional to both sides of Donsker’s theorem yields

nFnF  d  supt[0,1]B(t)    K,\sqrt n \,\|F_n - F\|_\infty \;\xrightarrow{d}\; \sup_{t \in [0, 1]} |\mathbb{B}^\circ(t)| \;\sim\; K,

where KK is the classical Kolmogorov distribution Pr(Kx)=12k1(1)k1e2k2x2\Pr(K \le x) = 1 - 2 \sum_{k \ge 1} (-1)^{k-1} e^{-2 k^2 x^2}. Topic 29 §29.8 stated this limit with the Donsker step deferred; §32.5 Thm 3 now supplies that step, and the corollary is a one-line continuous-mapping application.

Three-panel featured figure. (a) Ten empirical-process paths \sqrt n(F_n - F) at n = 2000 (black, semi-transparent) overlaid on three Brownian-bridge reference paths (violet). (b) Sup-norm histogram from 10,000 Monte-Carlo replicates; Kolmogorov PDF overlaid. (c) QQ-plot of observed sup-norm quantiles against scipy.stats.kstwobign.ppf values; straight line indicates Donsker convergence.

Figure 4 — ★ featured. Donsker’s theorem in three views. (a) Sample paths of n(FnF)\sqrt n(F_n - F) at n=2000n = 2000 visually tighten onto Brownian-bridge reference paths. (b) The sup-norm of Gn\mathbb{G}_n is asymptotically Kolmogorov-distributed; the histogram aligns with the Kolmogorov PDF. (c) Quantile-alignment confirmation. Interactive exploration in EmpiricalProcessExplorer.

Empirical process 𝔾_n(t) = √n (F_n(t) − F(t))
-2-101200.30.50.81t
Sup-norm ‖𝔾_n‖_∞ vs Kolmogorov PDF
00.511.522.5300.511.52sup-norm
observed 𝔾_nBrownian-bridge referenceKolmogorov PDF
Example 9 KS limit via CMT — recapitulates §29.8

Apply the continuous mapping theorem (Topic 9 §9.5) with the sup-norm functional ϕ:(R)R,ϕ(z)=z=suptRz(t)\phi : \ell^\infty(\mathbb{R}) \to \mathbb{R},\, \phi(z) = \|z\|_\infty = \sup_{t \in \mathbb{R}} |z(t)|, which is continuous in the \ell^\infty-topology. Thm 3 gives GnBF\mathbb{G}_n \rightsquigarrow \mathbb{B}_F, so

nFnF  =  ϕ(Gn)  d  ϕ(BF)  =  suptRBF(t).\sqrt n \,\|F_n - F\|_\infty \;=\; \phi(\mathbb{G}_n) \;\xrightarrow{d}\; \phi(\mathbb{B}_F) \;=\; \sup_{t \in \mathbb{R}} |\mathbb{B}_F(t)|.

Since BF(t)=B(F(t))\mathbb{B}_F(t) = \mathbb{B}^\circ(F(t)) and FF is a continuous CDF onto [0,1][0, 1], suptBF(t)=supu[0,1]B(u)\sup_t |\mathbb{B}_F(t)| = \sup_{u \in [0, 1]} |\mathbb{B}^\circ(u)|, which by BIL1999 §12 has the Kolmogorov distribution KK with Pr(Kx)=12k1(1)k1e2k2x2\Pr(K \le x) = 1 - 2 \sum_{k \ge 1} (-1)^{k-1} e^{-2 k^2 x^2}. This is the Kolmogorov limit Topic 29 §29.8 deferred — now paid via a one-line CMT application to the just-proven Donsker.

Example 10 Cramér–von-Mises as a continuous functional of $\mathbb{G}_n$

Apply CMT with ϕ(z)=z(t)2dF(t)\phi(z) = \int z(t)^2\, dF(t), a continuous (R)R\ell^\infty(\mathbb{R}) \to \mathbb{R} functional. Donsker gives

n(FnF)2dF  =  ϕ(Gn)  d  ϕ(BF)  =  RBF(t)2dF(t)  =d  01B(u)2dun \int (F_n - F)^2\, dF \;=\; \phi(\mathbb{G}_n) \;\xrightarrow{d}\; \phi(\mathbb{B}_F) \;=\; \int_\mathbb{R} \mathbb{B}_F(t)^2\, dF(t) \;\stackrel{d}{=}\; \int_0^1 \mathbb{B}^\circ(u)^2\, du

(the second equality by the substitution u=F(t)u = F(t)). The right-hand side is the integrated squared Brownian-bridge, which by a Karhunen–Loève eigen-expansion equals k1χ1,k2/(kπ)2\sum_{k \ge 1} \chi^2_{1, k} / (k \pi)^2 with iid χ12\chi^2_1 components. A weighted-χ2\chi^2 limit — not Gaussian — despite every finite evaluation of Gn\mathbb{G}_n being asymptotically Gaussian. §32.8 builds the Cramér–von-Mises test on exactly this limit.

Remark 10 Fidi + tightness: the two-ingredient structure

vdV–Wellner Thm 1.5.4 turns weak convergence in (F)\ell^\infty(\mathcal{F}) into a pair of necessary-and-sufficient conditions: (i) finite-dimensional convergence of (Gn(f1),,Gn(fk))(\mathbb{G}_n(f_1), \dots, \mathbb{G}_n(f_k)) for every finite {f1,,fk}F\{f_1, \dots, f_k\} \subset \mathcal{F}, and (ii) asymptotic tightness of the sequence in (F)\ell^\infty(\mathcal{F}). Every empirical-process weak-convergence proof — ours, the bootstrap empirical process of §32.8 Thm 8, the local Donsker theorems of formalML’s semiparametric-inference — factors the argument this way. The fidi piece is always “multivariate CLT”; the tight piece is always “some entropy bound + chaining”. You’ll see the pattern repeat.

Remark 11 The Bahadur-linearization lift (explicit connection to §31.3)

Topic 31 §31.3’s bootstrap-consistency proof (Bickel–Freedman) had a three-step structure: scalar CLT ⇒ pointwise CDF convergence ⇒ Polya’s theorem upgrades pointwise to uniform (Kolmogorov-distance). That is exactly the finite-dimensional shadow of the §32.5 Donsker proof you just read. Line up the three steps: §32.5 Step 2 (multivariate fidi CLT) is §31.3’s scalar CLT lifted to kk dimensions; §32.5 Step 3 (bracketing-integral equicontinuity) is §31.3’s Polya upgrade, lifted one level from “uniform in xx” to “uniform over function-space neighbourhoods”. The bootstrap is the Donsker-level resampling operation, restricted to a scalar functional. Naming this lift explicitly is the curriculum’s most load-bearing structural retrospective — and §32.8 Thm 8 closes the loop by proving the bootstrap empirical process converges to the same Brownian-bridge limit as the original.

Remark 12 Why continuous $F$

Continuity of FF is load-bearing in Step 1: the probability-integral transform Ui=F(Xi)U_i = F(X_i) is Uniform(0,1)(0, 1) only when FF has no atoms. When FF jumps at some point t0t_0 (e.g., a mixed discrete-continuous distribution), Fn(t0)Fn(t0)F_n(t_0^-) \ne F_n(t_0) a.s. and the path tGn(t)t \mapsto \mathbb{G}_n(t) has a non-vanishing jump at t0t_0 — it can’t converge uniformly to a continuous-path limit. The fix is to replace (R)\ell^\infty(\mathbb{R}) with the Skorokhod space D[0,1]D[0, 1] of càdlàg functions and use its J1J_1-topology, which allows “small time perturbations” that absorb the jump. All of Topic 32 stays inside the continuous-FF case; the Skorokhod-topology generalisation is standard (BIL1999 §12–13) and deferred to formalML’s stochastic-processes topics.

32.6 Metric entropy and bracketing numbers

§32.5 proved Donsker for one specific function class, Fcdf\mathcal{F}_{\mathrm{cdf}}, by counting “brackets” — pairs of functions that sandwich every element of the class to within ε\varepsilon in L2(P)L^2(P). That trick works for far more classes than Fcdf\mathcal{F}_{\mathrm{cdf}} alone. Dudley’s 1978 theorem says: if the bracketing numbers N[](ε,F,L2(P))N_{[\,]}(\varepsilon, \mathcal{F}, L^2(P)) grow slowly enough that 01logN[](ε,F,L2(P))dε<\int_0^1 \sqrt{\log N_{[\,]}(\varepsilon, \mathcal{F}, L^2(P))}\,d\varepsilon < \infty, then F\mathcal{F} is Donsker. The integral is the bracketing integral J[]J_{[\,]}, and it’s the general Donsker criterion — §32.5 Step 3 was this condition, specialised to Fcdf\mathcal{F}_{\mathrm{cdf}}, evaluated in closed form. This section names the abstract version and works two more worked examples.

Definition 9 Covering number $N(\varepsilon, \mathcal{F}, \|\cdot\|)$

Given F\mathcal{F} and a norm \|\cdot\| on functions XR\mathcal{X} \to \mathbb{R}, the ε\varepsilon-covering number N(ε,F,)N(\varepsilon, \mathcal{F}, \|\cdot\|) is the smallest number of \|\cdot\|-balls of radius ε\varepsilon needed to cover F\mathcal{F}. Equivalently, it’s the smallest NN such that F\mathcal{F} admits an ε\varepsilon-net {g1,,gN}F\{g_1, \dots, g_N\} \subseteq \mathcal{F} with minkfgkε\min_k \|f - g_k\| \le \varepsilon for every fFf \in \mathcal{F}. The metric entropy is logN(ε,F,)\log N(\varepsilon, \mathcal{F}, \|\cdot\|). For VC classes the L2(P)L^2(P)-covering number satisfies logN(ε,F,L2(P))=O(Vlog(1/ε))\log N(\varepsilon, \mathcal{F}, L^2(P)) = O(V \log(1/\varepsilon)) (vdV–Wellner Thm 2.6.7).

Definition 10 Bracketing number $N_{[\,]}(\varepsilon, \mathcal{F}, \|\cdot\|)$

A bracket [,u][\ell, u] is a pair of functions ,u:XR\ell, u : \mathcal{X} \to \mathbb{R} with (x)u(x)\ell(x) \le u(x) pointwise for every xx. It contains fFf \in \mathcal{F} if (x)f(x)u(x)\ell(x) \le f(x) \le u(x) pointwise. The bracket’s size under \|\cdot\| is u\|u - \ell\|. The ε\varepsilon-bracketing number is

N[](ε,F,)  :=  min{N: brackets [1,u1],,[N,uN] with sizeε covering F}.N_{[\,]}(\varepsilon, \mathcal{F}, \|\cdot\|) \;:=\; \min\bigl\{N : \exists \text{ brackets } [\ell_1, u_1], \dots, [\ell_N, u_N] \text{ with size} \le \varepsilon \text{ covering } \mathcal{F}\bigr\}.

Bracketing is strictly stronger than covering: a cover uses any ε\varepsilon-close functions, but a bracket enforces the pointwise order fu\ell \le f \le u. The result is that NN[]N \le N_{[\,]} always, and N[]N_{[\,]} can be infinite for classes whose NN is finite — the extra rigidity matters. The bracketing number is what the Donsker criterion (Thm 4 below) actually needs.

Definition 11 Bracketing integral and $P$-Donsker class

The bracketing integral of F\mathcal{F} at scale δ\delta under \|\cdot\| is

J[](δ,F,)  :=  0δlogN[](ε,F,)dε.J_{[\,]}(\delta, \mathcal{F}, \|\cdot\|) \;:=\; \int_0^\delta \sqrt{\log N_{[\,]}(\varepsilon, \mathcal{F}, \|\cdot\|)}\,d\varepsilon.

A class F\mathcal{F} with a square-integrable envelope is PP-Donsker if GnGP\mathbb{G}_n \rightsquigarrow \mathbb{G}_P in (F)\ell^\infty(\mathcal{F}), where GP\mathbb{G}_P is a tight Gaussian process on F\mathcal{F} with covariance Cov(GP(f),GP(g))=P(fg)(Pf)(Pg)\mathrm{Cov}(\mathbb{G}_P(f), \mathbb{G}_P(g)) = P(fg) - (Pf)(Pg) — the PP-Brownian bridge indexed by F\mathcal{F}, generalising §32.5’s BF\mathbb{B}_F from Fcdf\mathcal{F}_{\mathrm{cdf}} to arbitrary F\mathcal{F}. Thm 4 gives the sufficient condition for Donsker-ness in terms of J[]J_{[\,]}.

Theorem 4 Dudley 1978 — bracketing-integral sufficient condition (stated)

Suppose F\mathcal{F} has a square-integrable envelope under PP and

J[](,F,L2(P))  =  0logN[](ε,F,L2(P))dε  <  .J_{[\,]}\bigl(\infty, \mathcal{F}, L^2(P)\bigr) \;=\; \int_0^\infty \sqrt{\log N_{[\,]}(\varepsilon, \mathcal{F}, L^2(P))}\,d\varepsilon \;<\; \infty.

Then F\mathcal{F} is PP-Donsker: GnGP\mathbb{G}_n \rightsquigarrow \mathbb{G}_P in (F)\ell^\infty(\mathcal{F}).

Proof (deferred). The argument is the general version of §32.5 Step 3 — symmetrisation + chaining against the bracketing-integral bound. Full proof in vdV–Wellner Thm 2.5.6 (~10 pages of careful measurability housekeeping plus three lemmas). We omit it because the §32.5 specialisation for Fcdf\mathcal{F}_{\mathrm{cdf}} conveyed the essential structure; the general proof adds machinery but no new ideas.

Two-panel bracketing-number visualisation. Left: \mathcal{F}_{\mathrm{cdf}} under Uniform(0, 1) covered by \lceil \varepsilon^{-2} \rceil brackets at \varepsilon \in \{0.1, 0.05, 0.02\}, shown as horizontal stacked bars. Right: log-log plot of \log N_{[\,]}(\varepsilon) vs \log(1/\varepsilon) with slope-2 reference line overlaid, verifying N_{[\,]} \sim \varepsilon^{-2}.

Figure 5. Bracketing geometry for Fcdf\mathcal{F}_{\mathrm{cdf}}. Left: brackets at three resolutions; the number grows like ε2\varepsilon^{-2} (each refinement halves ε\varepsilon and quadruples the count). Right: log-log confirmation of the slope-2 scaling. The bracketing integral 01log(ε2)dε\int_0^1 \sqrt{\log(\varepsilon^{-2})}\,d\varepsilon is finite, so Fcdf\mathcal{F}_{\mathrm{cdf}} is Donsker — the calculation §32.5 Step 3 used inline.

Example 11 Bracketing number for Lipschitz functions on $[0, 1]$

Let F\mathcal{F} be the class of LL-Lipschitz functions on [0,1][0, 1] bounded by MM. Cover [0,1][0, 1] with an ε/L\varepsilon / L-grid and bound each function on each grid-cell by a piecewise-constant upper/lower pair: the resulting brackets have L2(Leb)L^2(\mathrm{Leb}) size ε\le \varepsilon, and there are at most (M/ε)L/ε(M / \varepsilon)^{L/\varepsilon} of them, giving

logN[](ε,F,L2(Leb))    (L/ε)log(M/ε).\log N_{[\,]}\bigl(\varepsilon, \mathcal{F}, L^2(\mathrm{Leb})\bigr) \;\lesssim\; (L / \varepsilon) \log(M / \varepsilon).

Then logN[](ε)ε1/2log(1/ε)\sqrt{\log N_{[\,]}(\varepsilon)} \lesssim \varepsilon^{-1/2} \sqrt{\log(1/\varepsilon)}, and 01ε1/2log(1/ε)dε<\int_0^1 \varepsilon^{-1/2} \sqrt{\log(1/\varepsilon)}\,d\varepsilon < \infty (gamma-function calculation). Thm 4 applies: Lipschitz-[0,1][0, 1] classes are Donsker under Lebesgue. This is the nonparametric-regression analogue of §32.5 — replace “ECDF in R\mathbb{R}” with “smoothed estimator of a Lipschitz regression function”, same empirical-process machinery.

Example 12 Dudley's entropy integral worked for $\mathcal{F}_{\mathrm{cdf}}$

§32.5 Step 3 computed N[](ε,Fcdf,L2(P))ε2N_{[\,]}(\varepsilon, \mathcal{F}_{\mathrm{cdf}}, L^2(P)) \le \lceil \varepsilon^{-2} \rceil for P=Uniform(0,1)P = \mathrm{Uniform}(0, 1). Plugging into the bracketing integral:

J[](1,Fcdf,L2(P))  =  01logε2dε    012log(1/ε)dε  =  2π2  <  ,J_{[\,]}(1, \mathcal{F}_{\mathrm{cdf}}, L^2(P)) \;=\; \int_0^1 \sqrt{\log \lceil \varepsilon^{-2} \rceil}\,d\varepsilon \;\approx\; \int_0^1 \sqrt{2 \log(1/\varepsilon)}\,d\varepsilon \;=\; \frac{\sqrt{2\pi}}{2} \;<\; \infty,

using the standard Gaussian-tail integral 01log(1/ε)dε=π/2\int_0^1 \sqrt{\log(1/\varepsilon)}\,d\varepsilon = \sqrt{\pi}/2. Finiteness confirmed — Thm 4 delivers §32.5 Thm 3 as a specialisation, which is exactly what we proved in full. The Fcdf\mathcal{F}_{\mathrm{cdf}} calculation is the cleanest worked example of Dudley’s criterion.

Remark 13 Covering vs bracketing — when they differ

Covering and bracketing numbers agree up to polynomial factors for VC classes (vdV–Wellner Thm 2.6.7 + Thm 2.7.11), but for general classes they can diverge dramatically — the bracketing number enforces a pointwise sandwich that covering does not. Why does Donsker need the stronger version? Because the chaining argument in Thm 4’s proof bounds the modulus of continuity of Gn\mathbb{G}_n using sup\sup of differences, and sup-differences respect the bracketing structure (within a bracket, fgu|f - g| \le u - \ell) but not the covering one (two close functions in L2L^2 can differ a lot pointwise). DUD1978 settled this — bracketing is the correct complexity measure for Donsker, covering for Glivenko–Cantelli.

Remark 14 Bickel–Rosenblatt 1973 (discharges Topic 30 §30.5 Rem 15)

Topic 30 §30.5 Rem 15 forward-promised simultaneous confidence bands for the kernel density estimator f^h\hat f_h as “Bickel–Rosenblatt-type functional-CLT machinery”. The promised result is this: under regularity, an appropriately rescaled f^hf\hat f_h - f satisfies a functional CLT in \ell^\infty-topology on a compact interval, with a Gaussian limit process whose covariance depends on the kernel and the bandwidth. The proof uses Thm 4 applied to the kernel-smoothed function class {Kh(x):x[a,b]}\{K_h(\cdot - x) : x \in [a, b]\}, whose bracketing integral is finite for Lipschitz kernels. Full statement and proof in formalML’s density-estimation topic; naming it here discharges Topic 30’s IOU and connects §32.6’s abstract framework to a concrete ML-adjacent application.

Remark 15 Entropy-rate sharpness deferred

Thm 4 delivers Donsker but not the rate at which GnF\|\mathbb{G}_n\|_\mathcal{F} concentrates around its expectation — and the rate depends on the bracketing-entropy growth past the J[]<J_{[\,]} < \infty threshold. Talagrand’s sharp upper-lower-bound theory (TAL1996, GIN2016 Ch. 3) characterises exact rates in terms of metric-entropy integrals and generic-chaining functionals. For ML applications, the sharpness often matters: the gap between V/n\sqrt{V/n} and Vlogn/n\sqrt{V \log n / n} in generalisation bounds, for instance, is exactly this question. We defer the sharp-rate theory to formalML’s empirical-process-rates topic; Topic 32 stays at the Donsker-or-not level, which is the cleanest structural story.

32.7 The functional delta method (★ full proof)

So far we have GnBF\mathbb{G}_n \rightsquigarrow \mathbb{B}_F (§32.5) and, more generally, GnGP\mathbb{G}_n \rightsquigarrow \mathbb{G}_P for Donsker classes (§32.6). The natural follow-up question: if ϕ\phi is a functional of the CDF — the median, the variance, a quantile, a Cramér–von-Mises distance — does n(ϕ(Fn)ϕ(F))\sqrt n (\phi(F_n) - \phi(F)) have a limit distribution, and is there a clean way to read it off? The answer, for “smooth” ϕ\phi, is the functional delta method: n(ϕ(Fn)ϕ(F))ϕF(BF)\sqrt n (\phi(F_n) - \phi(F)) \rightsquigarrow \phi'_F(\mathbb{B}_F), where ϕF\phi'_F is the Hadamard derivative of ϕ\phi at FF. This is the function-space generalisation of Topic 13’s scalar delta method (n(g(Xˉn)g(μ))g(μ)N(0,σ2)\sqrt n(g(\bar X_n) - g(\mu)) \to g'(\mu)\,\mathcal{N}(0, \sigma^2)), with Hadamard-differentiability standing in for ordinary derivative — a looser notion that applies to far more statistical functionals than Fréchet differentiability.

Definition 12 Hadamard differentiability tangential to $\mathbb{D}_0$

Let D,E\mathbb{D}, \mathbb{E} be Banach spaces and D0D\mathbb{D}_0 \subseteq \mathbb{D} a subset (the “tangent set”). A map ϕ:DϕDE\phi : \mathbb{D}_\phi \subseteq \mathbb{D} \to \mathbb{E} is Hadamard-differentiable at θDϕ\theta \in \mathbb{D}_\phi tangentially to D0\mathbb{D}_0 if there exists a continuous linear map ϕθ:D0E\phi'_\theta : \mathbb{D}_0 \to \mathbb{E} (the Hadamard derivative) such that

ϕ(θ+tnhn)ϕ(θ)tn    ϕθ(h)in E\frac{\phi(\theta + t_n h_n) - \phi(\theta)}{t_n} \;\to\; \phi'_\theta(h) \qquad \text{in } \mathbb{E}

for every sequence tn0t_n \to 0 in R\mathbb{R} and every hnhh_n \to h in D0\mathbb{D}_0 with θ+tnhnDϕ\theta + t_n h_n \in \mathbb{D}_\phi eventually. The “tangential” qualifier is the key relaxation: we only need the limit along sequences that stay in D0\mathbb{D}_0, which in practice is the subspace of continuous sample paths where BF\mathbb{B}_F lives. Fréchet differentiability would require the limit along all sequences, which is too strong for most statistical functionals.

Definition 13 Influence function

For a statistical functional ϕ:DϕR\phi : \mathbb{D}_\phi \to \mathbb{R} with PDϕP \in \mathbb{D}_\phi, the influence function at PP is

ICP(x)  :=  ϕP(δxP),xX,\mathrm{IC}_P(x) \;:=\; \phi'_P(\delta_x - P), \qquad x \in \mathcal{X},

i.e., the Hadamard derivative evaluated at the “point-mass perturbation” δxP\delta_x - P. The influence function measures how ϕ\phi reacts to a tiny bit of probability mass added at xx (and a tiny bit subtracted, uniformly, to preserve normalisation). By linearity of ϕP\phi'_P, n(ϕ(Pn)ϕ(P))\sqrt n (\phi(P_n) - \phi(P)) is approximately n1/2iICP(Xi)n^{-1/2} \sum_i \mathrm{IC}_P(X_i), and its asymptotic variance is P(ICP2)=VarP(ICP)P(\mathrm{IC}_P^2) = \mathrm{Var}_P(\mathrm{IC}_P). The influence function is the primitive notion behind efficient-influence-function estimators in semiparametric theory — see Rem 17.

Theorem 5 Functional delta method (vdV–Wellner §3.9; vdV2000 Thm 20.8)

Let D,E\mathbb{D}, \mathbb{E} be Banach spaces and ϕ:DϕDE\phi : \mathbb{D}_\phi \subseteq \mathbb{D} \to \mathbb{E} a map. Suppose ϕ\phi is Hadamard-differentiable at θDϕ\theta \in \mathbb{D}_\phi tangentially to a subset D0D\mathbb{D}_0 \subseteq \mathbb{D}, with derivative ϕθ:D0E\phi'_\theta : \mathbb{D}_0 \to \mathbb{E}. Let rnr_n \to \infty and Tn:ΩDϕT_n : \Omega \to \mathbb{D}_\phi be random elements with rn(Tnθ)Tr_n(T_n - \theta) \rightsquigarrow T in D\mathbb{D}, where TT is supported on D0\mathbb{D}_0. Then

rn(ϕ(Tn)ϕ(θ))    ϕθ(T)in E.r_n\bigl(\phi(T_n) - \phi(\theta)\bigr) \;\rightsquigarrow\; \phi'_\theta(T) \qquad \text{in } \mathbb{E}.
Proof 3 [show]

The argument is Skorokhod representation + Hadamard differentiability in one line, wrapped in measurability housekeeping.

Step 1 — transfer to almost-sure convergence. By the Skorokhod representation theorem (vdV2000 Thm 18.9; applicable because TT is tight in D\mathbb{D}), there exists a probability space (Ω~,F~,Pr~)(\tilde \Omega, \tilde {\mathcal{F}}, \tilde \Pr) and random elements T~n,T~\tilde T_n, \tilde T on it with T~n=drn(Tnθ)\tilde T_n \stackrel{d}{=} r_n(T_n - \theta) and T~=dT\tilde T \stackrel{d}{=} T, such that T~nT~\tilde T_n \to \tilde T a.s. in D\mathbb{D}. Since T~\tilde T is supported on D0\mathbb{D}_0, outside a Pr~\tilde \Pr-null set we have T~nT~\tilde T_n \to \tilde T with T~D0\tilde T \in \mathbb{D}_0.

Step 2 — apply Hadamard differentiability pointwise. Define T~n:=θ+rn1T~n\tilde T_n' := \theta + r_n^{-1} \tilde T_n; by construction T~n=dTn\tilde T_n' \stackrel{d}{=} T_n and T~nDϕ\tilde T_n' \in \mathbb{D}_\phi eventually a.s. Applying the Hadamard-differentiability definition with tn=rn10t_n = r_n^{-1} \to 0 and hn=T~nT~D0h_n = \tilde T_n \to \tilde T \in \mathbb{D}_0:

rn(ϕ(T~n)ϕ(θ))  =  ϕ(θ+rn1T~n)ϕ(θ)rn1  a.s.  ϕθ(T~).r_n\bigl(\phi(\tilde T_n') - \phi(\theta)\bigr) \;=\; \frac{\phi(\theta + r_n^{-1} \tilde T_n) - \phi(\theta)}{r_n^{-1}} \;\xrightarrow{\text{a.s.}}\; \phi'_\theta(\tilde T).

Step 3 — translate back. Since T~n=dTn\tilde T_n' \stackrel{d}{=} T_n and T~=dT\tilde T \stackrel{d}{=} T, and since a.s. convergence implies weak convergence (vdV2000 §2.1), we have

rn(ϕ(Tn)ϕ(θ))    ϕθ(T)in E.r_n\bigl(\phi(T_n) - \phi(\theta)\bigr) \;\rightsquigarrow\; \phi'_\theta(T) \qquad \text{in } \mathbb{E}.

\blacksquare — using vdV–Wellner 1996 Thm 3.9.4 and the Skorokhod representation theorem (vdV2000 Thm 18.9).

Theorem 6 Extended continuous mapping theorem (stated)

Let gn,g:DEg_n, g : \mathbb{D} \to \mathbb{E} be maps with gngg_n \to g uniformly on every compact subset of D\mathbb{D}. Suppose XnXX_n \rightsquigarrow X in D\mathbb{D} with XCgX \in C_g a.s., where CgC_g is the set of continuity points of gg. Then gn(Xn)g(X)g_n(X_n) \rightsquigarrow g(X) in E\mathbb{E}.

This strengthens Topic 9 §9.5’s CMT in two ways: gg is replaced by a sequence gng_n (useful when the functional depends on nn, e.g., studentisation), and the continuity requirement is only at the support of the limit. §32.8 Ex 15’s two-sample-KS argument uses Thm 6 in its full strength. Proof in vdV–Wellner Thm 1.11.1; we take it as given.

Two-panel functional-delta demonstration. Left: histogram of \sqrt n(\hat\xi_{0.5} - 0) across 5000 n=500 Normal replicates, overlaid with theoretical \mathcal{N}(0, \pi/2) density. Right: ten \mathbb{G}_n paths with the dashed influence function \mathrm{IC}(x) = -\mathrm{sign}(x)/(2\phi(0)) overlaid as a visual cue.

Figure 6. Functional delta method for the sample median. Left: the empirical sampling distribution of n\sqrt n times the sample-median deviation lines up with the theoretical Gaussian — asymptotic variance π/21.571\pi / 2 \approx 1.571 from the influence-function formula 1/(4f(0)2)1/(4 f(0)^2). Right: the influence function plotted alongside Gn\mathbb{G}_n paths shows visually how the delta method extracts the limit — it integrates against IC.

Observed sample (n = 100) on the Normal(0, 1) support
-3-1.501.53x · dashed line = φ(F_n) = 0.025
√n(φ(F_n) − φ(F)) · 2000 MC replicates
-5.0-2.50.02.55.0√n(φ(F_n) − φ(F))
asymptotic mean0
Monte-Carlo mean-0.003
asymptotic variance1.5708
Monte-Carlo variance1.5241
φ(F) = F⁻¹(½). IC(x) = −sign(x − m) / (2 f(m)); asymptotic variance = 1 / (4 f(m)²).
MC histograminfluence-function Gaussian
Example 13 Quantile functional $\phi(F) = F^{-1}(p)$

Take ϕ(F)=F1(p)\phi(F) = F^{-1}(p) for fixed p(0,1)p \in (0, 1). At a continuous FF with positive density f(ξp)>0f(\xi_p) > 0 at its pp-quantile ξp=F1(p)\xi_p = F^{-1}(p), ϕ\phi is Hadamard-differentiable tangentially to C(R)C(\mathbb{R}) with derivative ϕF(h)=h(ξp)/f(ξp)\phi'_F(h) = -h(\xi_p) / f(\xi_p) (vdV–Wellner Lem 3.9.23). The influence function is

ICF(x)  =  ϕF(δxF)  =  1{xξp}pf(ξp)  =  p1{xξp}f(ξp).\mathrm{IC}_F(x) \;=\; \phi'_F(\delta_x - F) \;=\; -\frac{\mathbf{1}\{x \le \xi_p\} - p}{f(\xi_p)} \;=\; \frac{p - \mathbf{1}\{x \le \xi_p\}}{f(\xi_p)}.

Thm 5 gives n(ξ^pξp)N(0,p(1p)/f(ξp)2)\sqrt n (\hat\xi_p - \xi_p) \rightsquigarrow \mathcal{N}(0, p(1-p) / f(\xi_p)^2) — Bahadur’s representation theorem from Topic 29 §29.6, derived here in three lines by the functional delta method. The delta method is the right tool for this result; the Bahadur-representation proof of Topic 29 was the direct route for the median-specific case.

Example 14 KS-distance functional $\phi(F) = \|F - F_0\|_\infty$

Take ϕ(F)=suptF(t)F0(t)\phi(F) = \sup_t |F(t) - F_0(t)| for a fixed reference F0F_0. At FF0F \ne F_0, ϕ\phi is Hadamard-differentiable (the sup is attained at an isolated set of points), and Thm 5 delivers a Gaussian limit with an explicit influence function. At F=F0F = F_0, ϕ(F0)=0\phi(F_0) = 0 is the global minimum, and ϕ\phi fails Hadamard differentiability — the sup-of-absolute-value kink at zero is an obstruction. Applying Thm 5 naively gives the wrong answer. The right answer comes from CMT directly: n(ϕ(Fn)ϕ(F0))=GnBF0\sqrt n (\phi(F_n) - \phi(F_0)) = \|\mathbb{G}_n\|_\infty \rightsquigarrow \|\mathbb{B}_{F_0}\|_\infty, the Kolmogorov distribution. This is §32.5’s corollary wearing a §32.7 hat — and illustrates the boundary of the delta method: at differentiability points you get Gaussian limits and efficient-influence-function machinery; at non-differentiability points, CMT on the underlying Donsker limit takes over and the limits are non-Gaussian.

Remark 16 Fréchet vs Hadamard — tangential is the right notion

Fréchet differentiability — uniform convergence of (ϕ(θ+h)ϕ(θ)ϕθ(h))/h(\phi(\theta + h) - \phi(\theta) - \phi'_\theta(h)) / \|h\| to 0 as h0\|h\| \to 0 over the whole ball — is too strong for most statistical functionals. The quantile, median, and KS-distance functionals are Hadamard-differentiable but NOT Fréchet-differentiable: the non-uniform dependence on the direction h/hh / \|h\| breaks Fréchet’s uniformity. Hadamard’s tangential version relaxes “all directions” to “tangent directions in D0\mathbb{D}_0”, which for us is the subspace of uniformly continuous perturbations — exactly where BF\mathbb{B}_F places its mass. This is why vdV–Wellner builds the functional delta method on Hadamard differentiability: it’s the weakest smoothness notion that lets Thm 5 go through, and almost every statistical functional satisfies it.

Remark 17 Influence function → semiparametric efficiency (formalML)

The influence function of a functional ϕ\phi is not just a technical object — it’s the statistical primitive that modern efficient estimators are built from. An estimator ϕ^n\hat\phi_n is regular if it’s asymptotically Gaussian with asymptotic bias 0 along all one-parameter submodels through PP, and the Cramér–Rao bound on its asymptotic variance equals infICP(IC2)\inf_{\mathrm{IC}} P(\mathrm{IC}^2) over influence functions of ϕ\phi in the semiparametric model. Efficient-influence-function estimators — targeted maximum likelihood (TMLE), augmented inverse-propensity weighting (AIPW) — explicitly estimate ICP\mathrm{IC}_P and plug it into a one-step correction to achieve the Cramér–Rao bound. formalML’s semiparametric-inference and causal-inference-methods topics develop both in detail, both grounded in the Hadamard-differentiability framework established here.

Remark 18 Topic 19 Wald-CI reconnection

Topic 19 §19.2 constructed Wald confidence intervals θ^±zα/2SE^(θ^)\hat\theta \pm z_{\alpha/2}\,\widehat{\mathrm{SE}}(\hat\theta) for parametric θ\theta, relying on the scalar delta method (Topic 13 §13.5) to pass a smooth transformation through the asymptotic normality of θ^\hat\theta. Thm 5 is the function-space generalisation: sup-norm confidence bands for a nonparametric functional (e.g., FnF\|F_n - F\|_\infty-based uniform bands, or ϕ(Fn)±\phi(F_n) \pm a band derived from BF\mathbb{B}_F‘s quantiles). Pivotal-CI constructions for functionals — the sup-norm Bayesian-credible-band analog on the frequentist side — use exactly this machinery, and §32.8 Ex 15 shows one worked example. The through-line from Topic 13 → Topic 19 → Topic 32 is: scalar delta ⇒ scalar Wald ⇒ functional delta ⇒ function-space confidence band.

32.8 Applications: Cramér–von-Mises, two-sample KS, bootstrap empirical process

§§32.5–32.7 built the machinery. §32.8 turns it on three concrete statistical objects: the Cramér–von-Mises goodness-of-fit statistic (a continuous functional of Gn\mathbb{G}_n with a non-Gaussian limit, via CMT), the two-sample Kolmogorov–Smirnov test (where two empirical processes collide and their difference inherits a Brownian-bridge limit after rescaling), and the bootstrap empirical process (Giné–Zinn 1990’s functional upgrade of Topic 31’s bootstrap consistency). Each is a direct consequence of Thms 3, 5, 6, or 7 applied to specific functionals or classes. If §§32.5–32.7 were the grammar, §32.8 is the vocabulary test.

Definition 14 Two-sample empirical process

Given independent samples X1,,XnFX_1, \dots, X_n \sim F and Y1,,YmGY_1, \dots, Y_m \sim G, with empirical CDFs FnF_n and GmG_m, the two-sample empirical process is

Gn,m(t)  :=  nmn+m(Fn(t)Gm(t)).\mathbb{G}_{n, m}(t) \;:=\; \sqrt{\frac{nm}{n + m}}\,(F_n(t) - G_m(t)).

Under the null H0:F=G=HH_0 : F = G = H, as n,mn, m \to \infty with n/(n+m)λ(0,1)n / (n + m) \to \lambda \in (0, 1), Gn,mBH\mathbb{G}_{n, m} \rightsquigarrow \mathbb{B}_H in (R)\ell^\infty(\mathbb{R}). The two-sample rescaling factor nm/(n+m)\sqrt{nm/(n+m)} is the effective-sample-size version of the ordinary n\sqrt n: for balanced samples (n=mn = m) it simplifies to n/2\sqrt{n/2}, and as nn \to \infty with mm fixed, it asymptotes to m\sqrt m — the bottleneck sample. The limit is the same Brownian bridge as in §32.5, so two-sample KS inherits Kolmogorov critical values verbatim.

Theorem 7 Cramér–von-Mises limit (stated)

Under iid sampling from continuous FF, the Cramér–von-Mises statistic Wn2:=n(FnF)2dFW^2_n := n \int (F_n - F)^2\,dF converges in distribution:

Wn2  d  W2  :=  01B(u)2du  =d  k1χ1,k2(kπ)2,W^2_n \;\xrightarrow{d}\; W^2 \;:=\; \int_0^1 \mathbb{B}^\circ(u)^2\,du \;\stackrel{d}{=}\; \sum_{k \ge 1} \frac{\chi^2_{1, k}}{(k \pi)^2},

a weighted-χ2\chi^2 distribution where χ1,k2\chi^2_{1, k} are iid chi-squared-1 random variables. Proof: apply CMT to the continuous functional ϕ(z)=z2dF\phi(z) = \int z^2\,dF on (R)\ell^\infty(\mathbb{R}); Donsker gives GnBF\mathbb{G}_n \rightsquigarrow \mathbb{B}_F, so ϕ(Gn)=Wn2ϕ(BF)=BF(t)2dF(t)=01B(u)2du\phi(\mathbb{G}_n) = W^2_n \rightsquigarrow \phi(\mathbb{B}_F) = \int \mathbb{B}_F(t)^2 dF(t) = \int_0^1 \mathbb{B}^\circ(u)^2 du by substitution. The χ2\chi^2 decomposition comes from a Karhunen–Loève expansion of B\mathbb{B}^\circ. Critical values are tabulated; CvM is more sensitive than KS to tail discrepancies (the integrated-squared statistic emphasises bulk over extremes).

Theorem 8 Bootstrap empirical process CLT (Giné–Zinn 1990, stated)

Let X1,,XnX^\ast_1, \dots, X^\ast_n be a bootstrap resample drawn iid from FnF_n (Topic 31 §31.2), and let FnF^\ast_n be its empirical CDF. Conditionally on the data X1,,XnX_1, \dots, X_n, the bootstrap empirical process is Gn:=n(FnFn)\mathbb{G}^\ast_n := \sqrt n(F^\ast_n - F_n). Giné and Zinn (1990) prove:

Gn    BFin (R), conditionally on the data, in probability,\mathbb{G}^\ast_n \;\rightsquigarrow\; \mathbb{B}_F \qquad \text{in } \ell^\infty(\mathbb{R}), \text{ conditionally on the data, in probability},

meaning that for every bounded Lipschitz h:(R)Rh : \ell^\infty(\mathbb{R}) \to \mathbb{R}, E[h(Gn)X1:n]Eh(BF)\mathbb{E}^*[h(\mathbb{G}^\ast_n) \mid X_{1:n}] \to \mathbb{E} h(\mathbb{B}_F) in probability over the data. This is the functional analog of Topic 31 §31.3’s scalar Bickel–Freedman consistency: whatever limit distribution the data-world Gn\mathbb{G}_n converges to, the bootstrap-world Gn\mathbb{G}^\ast_n converges to the same thing. Full proof in GIN2016 Ch. 3; we state it without deriving since the argument parallels §32.5 Thm 3 with an extra layer of conditional tightness.

Two-panel. Left: histogram of W^2_n across 10,000 Monte-Carlo Normal replicates, overlaid with theoretical CvM CDF (1000-term truncation). Right: sup-norm histogram comparing one-sample vs two-sample KS, each at n = 500.

Figure 7. Cramér–von-Mises and two-sample KS in action. Left: the Wn2W^2_n statistic’s null distribution matches the weighted-χ2\chi^2 limit. Right: one-sample and two-sample KS sup-norms share the same Kolmogorov limit after the nm/(n+m)\sqrt{nm/(n+m)} scaling, illustrating Thm 8’s distributional stability.

Two-panel Topic-31 link-back. Left: observed \mathbb{G}_n path from one dataset (black), overlaid with 100 bootstrap \mathbb{G}^\ast_n paths (grey). Right: bootstrap sup-norm distribution against the Kolmogorov distribution — Giné–Zinn consistency visual.

Figure 8. Bootstrap empirical process. Left: 100 bootstrap Gn\mathbb{G}^\ast_n paths around a single observed Gn\mathbb{G}_n path; the bootstrap-world paths mimic the shape of the data-world path. Right: the bootstrap sup-norm distribution overlays the Kolmogorov distribution — numerical confirmation of Thm 8 (Giné–Zinn 1990).

Example 15 Two-sample KS power via Anderson–Darling weighting

The ordinary two-sample KS statistic Dn,m=suptFn(t)Gm(t)D_{n, m} = \sup_t |F_n(t) - G_m(t)| has balanced power across tt, but alternatives that differ only in the tails — e.g., same mean and variance but different kurtosis — are hard to detect because the tail probability F(t)(1F(t))F(t)(1 - F(t)) is small precisely where the discrepancy lives. Anderson and Darling (1954) proposed the rescaled statistic

An,m2  :=  nmn+msuptFn(t)Gm(t)Fˉ(t)(1Fˉ(t))A^2_{n, m} \;:=\; \sqrt{\frac{nm}{n + m}}\,\sup_t \frac{|F_n(t) - G_m(t)|}{\sqrt{\bar F(t)(1 - \bar F(t))}}

(Fˉ\bar F = pooled ECDF), which overweights the tails where the denominator shrinks. The limit distribution — via CMT on Donsker plus a rescaling — is supu(0,1)B(u)/u(1u)\sup_{u \in (0, 1)} |\mathbb{B}^\circ(u)| / \sqrt{u(1-u)}, whose critical values are tabulated. For tail-discrepancy alternatives, An,m2A^2_{n,m} has substantially higher power than Dn,mD_{n,m}. The integrated version — the Anderson–Darling statistic (FnFˉ)2/(Fˉ(1Fˉ))dFˉ\int (F_n - \bar F)^2 / (\bar F(1 - \bar F))\,d\bar F — applies the same weighting to the CvM integral.

Remark 19 Bootstrap as finite-dim shadow of Donsker

Topic 31 §31.3’s bootstrap-consistency proof (CLT + Polya) was the scalar shadow of §32.5’s Donsker (multivariate CLT + bracketing-based tightness). Thm 8 now makes the upgrade official: the bootstrap is not just a resampling heuristic that happens to work for the sample mean, it’s a functional resampling operation on FnF_n whose limit distribution is the same Brownian-bridge process Donsker delivers for the data. Every scalar functional ϕ\phi you apply to Gn\mathbb{G}_n — mean, median, quantile, KS, CvM — has a bootstrap version you apply to Gn\mathbb{G}^\ast_n, and the two asymptotically agree. This is the retrospective statement: the bootstrap, viewed from the empirical-process altitude, is just Donsker with data replaced by FnF_n. The §31.3 → §32.8 bridge is the curriculum’s most important structural connection, and Thm 8 is its formal statement.

Remark 20 U-processes and V-statistics — pointer to formalML

A UU-statistic of order mm averages a symmetric kernel hh over all (nm)\binom{n}{m} subsets of the data; Topic 13 §13.3 introduced the scalar version. A UU-process lets the kernel range over a class H\mathcal{H} of symmetric kernels and studies the resulting stochastic process Un:HRU_n : \mathcal{H} \to \mathbb{R}. The limit theory — multi-level Donsker — involves Gaussian chaoses rather than a single Brownian bridge, and the technical machinery (Hoeffding decomposition + functional Donsker at each order) is nontrivial. VV-statistics (plug-in vs. unbiased variants) follow the same theory with a different normalisation. Both are standard tools in kernel-method statistics and specific nonparametric tests; deferred to formalML’s u-processes topic.

32.9 Concentration: Talagrand’s inequality and symmetrisation

Donsker’s theorem gives the limiting distribution of Gn\mathbb{G}_n; in finite samples, we often need tail bounds on GnF\|\mathbb{G}_n\|_\mathcal{F} that are explicit at every nn, not just asymptotic. Classical Bernstein and Bennett inequalities handle bounded sums of iid random variables, but the supremum over F\mathcal{F} introduces a dependence structure Bernstein doesn’t capture. Talagrand’s 1996 concentration inequality fills the gap: it’s the correct Bernstein-analog for empirical-process suprema, providing sub-Gaussian concentration around EGnF\mathbb{E}\|\mathbb{G}_n\|_\mathcal{F} with explicit constants. It’s the workhorse behind sharp generalisation bounds in ML and the sharp-rate theory §32.6 Rem 15 deferred.

Theorem 9 Talagrand 1996 concentration inequality (stated)

Let F\mathcal{F} be a countable class of measurable functions on (X,A)(\mathcal{X}, \mathcal{A}) with envelope F:XR0F : \mathcal{X} \to \mathbb{R}_{\ge 0} satisfying FM\|F\|_\infty \le M and σ2:=supfFPf2<\sigma^2 := \sup_{f \in \mathcal{F}} P f^2 < \infty. Then for every t0t \ge 0,

Pr ⁣(GnF    EGnF+t)    Kexp ⁣(t2Cσ2+DtM/n)\Pr\!\Bigl(\|\mathbb{G}_n\|_\mathcal{F} \;\ge\; \mathbb{E}\|\mathbb{G}_n\|_\mathcal{F} + t\Bigr) \;\le\; K \exp\!\Bigl(-\frac{t^2}{C \sigma^2 + D\,t\,M / \sqrt n}\Bigr)

for universal constants K,C,D>0K, C, D > 0 independent of nn and F\mathcal{F}. The inequality has two regimes: for small tt (the “sub-Gaussian” regime) the σ2\sigma^2 term dominates and the bound is exp(t2/(Cσ2))\exp(-t^2 / (C\sigma^2)); for large tt (the “sub-exponential” regime) the tM/ntM/\sqrt n term takes over. Bousquet (2002) refined Talagrand’s original constants to near-optimal values.

Remark 21 The Talagrand proof is book-length

Honest warning: the Talagrand 1996 proof uses isoperimetry in product spaces — the most technically demanding machinery in modern probability theory — and runs to several chapters of Ledoux–Talagrand 1991 (Ch. 6–7 specifically; the preparatory material on entropy and Gaussian processes adds two more chapters). Reproducing the argument at Topic 32’s pedagogical level would expand the chapter by 50% without adding conceptual payoff that the statement itself doesn’t carry. We state the inequality, cite the sources, and use it; the full derivation is the kind of reference-book material formalStatistics deliberately doesn’t reproduce.

Remark 22 Local empirical-process theory — formalML pointer

M-estimators (MLE, robust estimators, quantile regression, Z-estimators) have asymptotic-normality theory that specialises Donsker to neighbourhoods of the parameter value — “local empirical-process theory”. The main result (vdV2000 §25 Thm 25.57): under local asymptotic equicontinuity of the score process and a non-singular information matrix, n(θ^nθ0)\sqrt n (\hat\theta_n - \theta_0) \rightsquigarrow Gaussian with the semiparametric-efficient variance. The proof uses Talagrand-type concentration on local empirical processes to control the remainder in the Taylor-expansion argument — exactly what Thm 9 gives. Semiparametric efficiency theory, targeted maximum likelihood, and all modern causal-inference estimators live on this foundation; the full development is deferred to formalML’s semiparametric-inference topic.

32.10 Track 8 retrospective, curriculum close, and the road to formalML

Topic 32 closes Track 8 and the curriculum. Track 8 is a four-pillar nonparametric toolkit — order statistics + kernel density estimation + bootstrap + empirical processes — anchored in the empirical distribution and the functional CLT that governs its fluctuations. The curriculum is the 32-topic formal-probability-and-statistics pipeline formalStatistics was built to ship. Both are complete with this merge. Figure 9 shows the final iteration of the Track 8 spine; the six dashed arrows on the right are the six formalML target topics where Topic 32’s forward-pointers land. Every forward-promise the 32 MDX files accumulated over ~50,000 words of exposition has now been either discharged inside formalStatistics or handed off to a named formalML slug. The remaining two remarks of §32.10 name the last two deferred topics — the Banach-space generalisation of Donsker, and the formalML handoff itself — before the curriculum-retrospective epilogue.

Track 8 spine — the final iteration. Four filled (published) topic nodes: 29, 30, 31, 32. Arrows show the chain 29 → 30 → 31 → 32 plus the long-range 29 → 32 edge via Donsker for \mathcal{F}_{\mathrm{cdf}}. On the right, six dashed-outline formalML target nodes: generalization-bounds, pac-bayes-bounds, conformal-prediction, semiparametric-inference, causal-inference-methods, extreme-value-theory — each connected to Topic 32 via a dashed forward-pointer.

Figure 9. Track 8 spine, final iteration — 4 of 4 published, track complete. The six dashed arrows on the right mark the formalStatistics-to-formalML handoff surface: every forward pointer from Track 8 leaves the site for formalML’s topic slugs named in the figure.

Remark 23 Beyond $\ell^\infty$: Banach-space CLT

(F)\ell^\infty(\mathcal{F}) is the simplest Banach space in which Donsker’s theorem makes sense, because sup-norm convergence plays well with the bounded-function indexing and because chaining arguments reduce neatly to bracketing-integral bounds. General Banach-space CLTs — Donsker-like statements in Hilbert-valued or Banach-valued settings with abstract covariance operators — require Gaussianity assumptions on the limit and type-2 geometric conditions on the space (Ledoux–Talagrand Ch. 10; Giné–Nickl Ch. 3). These generalise Topic 32’s machinery to the high-dimensional-statistics setting where observations themselves live in Hilbert space (e.g., functional data analysis, kernel mean embeddings). All of this lives in formalML’s high-dimensional-statistics and related topics; Topic 32 stays at (F)\ell^\infty(\mathcal{F}) because it’s the cleanest case and the one that carries the functional-delta machinery of §32.7 without additional geometric scaffolding.

Remark 24 The curriculum-closer / formalML handoff

The reader who has made it here — through Carathéodory and conditional probability, through LLN and CLT, through MLE and sufficient statistics, through hypothesis tests and confidence intervals, through OLS and GLMs and regularisation, through priors and MCMC and hierarchical models, and now through order statistics and KDE and bootstrap and empirical processes — has the rigorous probability-and-statistics vocabulary formalML’s ML-math topics assume without apology. Every forward-promise the formalStatistics curriculum accumulated has been discharged either inside the site (Donsker here, Lehmann–Scheffé at Topic 16, posterior consistency at Topic 25) or to a named formalML slug. The next step is formalML itself, where generalization bounds build directly on §32.4, where semiparametric inference and causal inference build on §32.7’s functional delta method, where conformal prediction builds on Topic 29 §29.5’s DKW inequality. The curriculum retrospective that follows names the handoff, the capstone theorems, and the track-by-track throughline — an epilogue, not a §32.11.


Curriculum retrospective

A short retrospective on what shipping Topic 32 means. Renders as a distinct epilogue section at the end of the MDX, after the §32.10 horizontal rule — a coda, not a §32.11.

What does shipping Topic 32 mean? The site is now feature-complete at 32 of 32 topics. Every curriculum-graph node has status: "published". Every forward-promise harvested from the 32-topic MDX chain has been discharged, either within formalStatistics or to a named formalML target. The formalStatistics-to-formalML handoff has six named targets: generalization-bounds, pac-bayes-bounds, conformal-prediction, semiparametric-inference, causal-inference-methods, extreme-value-theory.

The five load-bearing theorems of formalStatistics.

  1. Carathéodory’s extension theorem (Track 1, Topic 1 §1.6). The measure-theoretic bedrock: every countably-additive set function on a semialgebra extends uniquely to a measure. Everything downstream is a consequence.
  2. The Central Limit Theorem (Track 3, Topic 11 §11.5, via Lindeberg–Feller). n(Xˉnμ)N(0,σ2)\sqrt n (\bar X_n - \mu) \rightsquigarrow \mathcal{N}(0, \sigma^2). The finite-dimensional template Donsker lifts one level.
  3. Rao–Blackwell + Lehmann–Scheffé (Track 4, Topic 16 §16.7 + §16.10). Completeness + sufficiency + unbiasedness = UMVUE. The cleanest frequentist-optimality result in the curriculum.
  4. Posterior consistency (Track 7, Topic 25 §25.8, via Schwartz’s theorem). Bayes with a prior giving positive mass to every neighborhood of the truth recovers the truth in the limit. The Bayesian analog of consistency.
  5. Donsker’s theorem (Track 8, Topic 32 §32.5). n(FnF)BF\sqrt n (F_n - F) \rightsquigarrow \mathbb{B}_F in (R)\ell^\infty(\mathbb{R}). The curriculum closer: lifts LLN + CLT from pointwise/finite-dimensional to uniform/functional statements.

Track-by-track recap.

  • Track 1 (Foundations of Probability, Topics 1–4): Kolmogorov axioms, conditional probability, random variables, expectation. The measure-theoretic primitives.
  • Track 2 (Core Distributions & Families, Topics 5–8): discrete + continuous families, exponential families, multivariate distributions. The vocabulary for building models.
  • Track 3 (Convergence & Limit Theorems, Topics 9–12): modes of convergence, LLN, CLT, large deviations. The asymptotic toolkit.
  • Track 4 (Statistical Estimation, Topics 13–16): point estimation, MLE, MoM, sufficient statistics. The estimation framework.
  • Track 5 (Hypothesis Testing & Confidence, Topics 17–20): NP paradigm, LR tests, CIs, multiple testing. The inference framework.
  • Track 6 (Regression & Linear Models, Topics 21–24): OLS, GLMs, regularization, model selection. The regression framework.
  • Track 7 (Bayesian Statistics, Topics 25–28): priors, MCMC, model comparison / BMA, hierarchical models. The Bayesian framework.
  • Track 8 (High-Dimensional & Nonparametric, Topics 29–32): order statistics, KDE, bootstrap, empirical processes. The nonparametric framework and the track that closes the curriculum.

The formalStatistics-to-formalML handoff, named explicitly. The reader who has reached the end of Topic 32 is now equipped for:

  • Generalization bounds — uniform convergence of training risk to test risk via function-class complexity, direct lift of §32.4.
  • PAC-Bayes bounds — posterior-over-hypotheses generalization bounds, uses the symmetrisation technique of §32.4.
  • Conformal prediction — distribution-free predictive intervals via exchangeability, uses the DKW inequality of Topic 29 §29.5 and the empirical-process framework of §32.
  • Semiparametric inference — efficient estimation in models with a finite-dimensional parameter of interest and an infinite-dimensional nuisance; built on local empirical-process theory (local Donsker, local bracketing) deferred from §32.9 Rem 22.
  • Causal inference methods — TMLE, AIPW, efficient-influence-function methods; built on the functional delta method of §32.7.
  • Extreme value theory — Fisher–Tippett–Gnedenko asymptotics for X(n)X_{(n)} and the generalized Pareto; forward-pointed from Topic 29 §29.10 Rem 18, orthogonal to §32 but listed here for completeness.

The site is complete. Thank you, reader — and on to formalML.


References

  1. van der Vaart, Aad W., and Jon A. Wellner. (1996). Weak Convergence and Empirical Processes: With Applications to Statistics. Springer Series in Statistics.
  2. Kosorok, Michael R. (2008). Introduction to Empirical Processes and Semiparametric Inference. Springer Series in Statistics.
  3. Pollard, David. (1984). Convergence of Stochastic Processes. Springer Series in Statistics.
  4. Pollard, David. (1990). Empirical Processes: Theory and Applications. NSF-CBMS Regional Conference Series in Probability and Statistics, Vol. 2.
  5. Dudley, R. M. (1978). Central Limit Theorems for Empirical Measures. Annals of Probability, 6(6), 899–929.
  6. Dudley, R. M. (1999). Uniform Central Limit Theorems. Cambridge University Press.
  7. Giné, Evarist, and Richard Nickl. (2016). Mathematical Foundations of Infinite-Dimensional Statistical Models. Cambridge Series in Statistical and Probabilistic Mathematics.
  8. Talagrand, Michel. (1996). New Concentration Inequalities in Product Spaces. Inventiones Mathematicae, 126(3), 505–563.
  9. Ledoux, Michel, and Michel Talagrand. (1991). Probability in Banach Spaces: Isoperimetry and Processes. Springer Ergebnisse der Mathematik.
  10. Vapnik, V. N., and A. Ya. Chervonenkis. (1971). On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities. Theory of Probability and Its Applications, 16(2), 264–280.
  11. Sauer, Norbert. (1972). On the Density of Families of Sets. Journal of Combinatorial Theory, Series A, 13(1), 145–147.
  12. Donsker, Monroe D. (1952). Justification and Extension of Doob’s Heuristic Approach to the Kolmogorov–Smirnov Theorems. Annals of Mathematical Statistics, 23(2), 277–281.
  13. Billingsley, Patrick. (1999). Convergence of Probability Measures (2nd ed.). Wiley Series in Probability and Statistics.
  14. van der Vaart, Aad W. (2000). Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics.