intermediate 45 min read · April 14, 2026

Modes of Convergence

Almost sure, in probability, in distribution, in Lp — the hierarchy and when each matters.

9.1 Why Convergence Needs Four Definitions

Topics 1 through 8 built the vocabulary of probability: sample spaces gave us the stage, random variables gave us the actors, distributions told us how the actors behave, and moments — expectation, variance, the MGF — gave us numerical summaries of that behavior. But vocabulary alone doesn’t make a language. We need grammar — rules for combining simple objects into complex ones and understanding what happens when we take limits.

This topic starts that grammar. The central question is deceptively simple: if X1,X2,X3,X_1, X_2, X_3, \ldots is a sequence of random variables, what does it mean to say that XnX_n converges to some random variable XX?

For deterministic sequences, there is exactly one answer. A sequence ana_n converges to aa if for every ε>0\varepsilon > 0 there exists NN such that ana<ε|a_n - a| < \varepsilon for all nNn \geq N. The ε\varepsilon-NN definition from calculus handles everything.

For random variables, the situation is fundamentally different. A random variable XnX_n is a function from the sample space Ω\Omega to the real numbers, so asking whether XnX_n converges to XX is asking whether a sequence of functions converges. And functions can converge in many different ways.

Consider the most important sequence in statistics: the sample mean Xˉn=1ni=1nXi\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i, where the XiX_i are iid with mean μ\mu. We want to say Xˉnμ\bar{X}_n \to \mu. But what exactly does this mean?

  • Does every single realization of the sample mean eventually settle near μ\mu? (That’s a strong demand — it requires every sample path to converge.)
  • Or do we only need that the probability of Xˉn\bar{X}_n being far from μ\mu goes to zero? (That’s weaker — some paths might misbehave, as long as the misbehaving paths become rare.)
  • Or do we only need that the distribution of Xˉn\bar{X}_n concentrates around μ\mu? (That’s weaker still — we’re just comparing CDFs.)

These three questions lead to genuinely different notions of convergence, and a fourth emerges when we bring moment conditions into the picture. Here are the four modes, in preview:

ModeNotationIntuition
Almost sureXna.s.XX_n \xrightarrow{\text{a.s.}} XEvery sample path converges
In LpL^pXnLpXX_n \xrightarrow{L^p} XThe pp-th moment of the gap vanishes
In probabilityXnPXX_n \xrightarrow{P} XDeviations become rare
In distributionXndXX_n \xrightarrow{d} XCDFs converge at continuity points

These four modes form a strict hierarchy — a diamond, not a line — with almost sure and LpL^p at the top, convergence in probability in the middle, and convergence in distribution at the bottom. The arrows point downward: stronger modes imply weaker ones, but not the reverse. Section 9.6 will make this precise, and Section 9.7 will show, with explicit counterexamples, exactly why the arrows don’t reverse.

Three-panel overview: sample paths (a.s.), probability decay (in prob), CDF convergence (in dist)

The tools that launch convergence theory are all from Topic 4 — Expectation, Variance & Moments: Markov’s inequality (Theorem 11), Chebyshev’s inequality (Theorem 12), Jensen’s inequality (Theorem 13), and MGF uniqueness (Theorem 17). Chebyshev gives us convergence in probability (Section 9.2). MGF convergence gives us convergence in distribution (Section 9.5). Jensen and Markov bridge between LpL^p and probability modes (Section 9.6). If any of these results feel unfamiliar, revisit Topic 4, Sections 5 and 7, before proceeding.

We start with convergence in probability — not because it’s the strongest mode, but because it follows directly from a tool the reader already knows: Chebyshev’s inequality.


9.2 Convergence in Probability

Chebyshev’s inequality (Topic 4, Theorem 12) tells us that P(Xμ>ε)Var(X)/ε2P(|X - \mu| > \varepsilon) \leq \text{Var}(X)/\varepsilon^2. If the variance of a sequence shrinks to zero, then the probability of large deviations also goes to zero. This is the intuition behind convergence in probability.

Definition 1 Definition 9.1 (Convergence in Probability)

A sequence of random variables {Xn}\{X_n\} converges in probability to a random variable XX if for every ε>0\varepsilon > 0,

P(XnX>ε)0as nP(|X_n - X| > \varepsilon) \to 0 \quad \text{as } n \to \infty

We write XnPXX_n \xrightarrow{P} X.

Equivalently, for every ε>0\varepsilon > 0 and every δ>0\delta > 0, there exists NN such that for all nNn \geq N:

P(XnX>ε)<δP(|X_n - X| > \varepsilon) < \delta

When the limit XX is a constant cc, the definition reduces to: for every ε>0\varepsilon > 0, P(Xnc>ε)0P(|X_n - c| > \varepsilon) \to 0.

The key feature of convergence in probability is that it makes a statement about each fixed nn: at time nn, the probability of being far from the target is small. It does not require that the sequence eventually stays close for all future times simultaneously — that’s the stronger demand of almost sure convergence (Section 9.3).

In statistics, convergence in probability has a dedicated name.

Definition 2 Definition 9.2 (Consistency)

An estimator sequence θ^n\hat{\theta}_n is consistent for the parameter θ\theta if

θ^nPθ\hat{\theta}_n \xrightarrow{P} \theta

That is, as the sample size nn grows, the estimator concentrates around the true parameter value. Consistency is the bare minimum we demand of any reasonable estimator: with enough data, we should at least get the right answer in probability.

The simplest proof of consistency is a direct application of Chebyshev.

Theorem 1 Theorem 9.1 (Sample Mean Convergence in Probability — Weak LLN Preview)

Let X1,X2,X_1, X_2, \ldots be iid random variables with E[Xi]=μE[X_i] = \mu and Var(Xi)=σ2<\text{Var}(X_i) = \sigma^2 < \infty. Then the sample mean Xˉn=1ni=1nXi\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i converges in probability to μ\mu:

XˉnPμ\bar{X}_n \xrightarrow{P} \mu
Proof [show]

We need to show that for every ε>0\varepsilon > 0, P(Xˉnμ>ε)0P(|\bar{X}_n - \mu| > \varepsilon) \to 0.

First, we compute the mean and variance of Xˉn\bar{X}_n. By linearity of expectation:

E[Xˉn]=E ⁣[1ni=1nXi]=1ni=1nE[Xi]=1nnμ=μE[\bar{X}_n] = E\!\left[\frac{1}{n}\sum_{i=1}^n X_i\right] = \frac{1}{n}\sum_{i=1}^n E[X_i] = \frac{1}{n} \cdot n\mu = \mu

Since the XiX_i are independent, the variance of the sum equals the sum of the variances (Topic 4, Theorem 7, Property 3):

Var(Xˉn)=Var ⁣(1ni=1nXi)=1n2i=1nVar(Xi)=1n2nσ2=σ2n\text{Var}(\bar{X}_n) = \text{Var}\!\left(\frac{1}{n}\sum_{i=1}^n X_i\right) = \frac{1}{n^2}\sum_{i=1}^n \text{Var}(X_i) = \frac{1}{n^2} \cdot n\sigma^2 = \frac{\sigma^2}{n}

Now apply Chebyshev’s inequality (Topic 4, Theorem 12) to Xˉn\bar{X}_n:

P(Xˉnμ>ε)Var(Xˉn)ε2=σ2nε2P(|\bar{X}_n - \mu| > \varepsilon) \leq \frac{\text{Var}(\bar{X}_n)}{\varepsilon^2} = \frac{\sigma^2}{n\varepsilon^2}

As nn \to \infty, the right-hand side σ2/(nε2)0\sigma^2/(n\varepsilon^2) \to 0. Since probabilities are non-negative, the squeeze gives:

0P(Xˉnμ>ε)σ2nε200 \leq P(|\bar{X}_n - \mu| > \varepsilon) \leq \frac{\sigma^2}{n\varepsilon^2} \to 0

Therefore P(Xˉnμ>ε)0P(|\bar{X}_n - \mu| > \varepsilon) \to 0, which is XˉnPμ\bar{X}_n \xrightarrow{P} \mu. \square

This three-line argument — compute the variance, apply Chebyshev, watch it go to zero — is the prototype for most consistency proofs in statistics. The full Weak Law of Large Numbers (WLLN), proved in Law of Large Numbers, generalizes this to relaxed independence conditions — including uncorrelated sequences and distributions with finite mean but infinite variance.

Example 1 Scaled Normal: the simplest convergence-in-probability example

Let ZN(0,1)Z \sim N(0, 1) and define Xn=Z/nX_n = Z/n. We claim XnP0X_n \xrightarrow{P} 0.

For any ε>0\varepsilon > 0:

P(Xn>ε)=P ⁣(Zn>ε)=P(Z>nε)P(|X_n| > \varepsilon) = P\!\left(\frac{|Z|}{n} > \varepsilon\right) = P(|Z| > n\varepsilon)

Since ZZ is standard Normal, we can bound this using the tail inequality P(Z>t)2et2/2P(|Z| > t) \leq 2e^{-t^2/2} for t>0t > 0. So:

P(Xn>ε)2en2ε2/2P(|X_n| > \varepsilon) \leq 2e^{-n^2\varepsilon^2/2}

This decays exponentially fast to zero. For concrete values with ε=0.1\varepsilon = 0.1:

nnnεn\varepsilonP(Xn>0.1)P(\|X_n\| > 0.1) \leq
10.10.920
50.50.617
101.00.317
505.05.7×1075.7 \times 10^{-7}
10010.0<1022< 10^{-22}

Notice that Xn=Z/nX_n = Z/n not only converges in probability — every single realization converges deterministically, since for each fixed ω\omega, Xn(ω)=Z(ω)/n0X_n(\omega) = Z(\omega)/n \to 0. So this sequence actually converges almost surely, which is stronger. The next section makes that distinction precise.

Probability tubes shrinking around the limit: P(|Xn - X| > epsilon) decay curves for several epsilon values

9.3 Almost Sure Convergence

Convergence in probability says: at any fixed time nn, the probability of being far from the target is small. Almost sure convergence says something much stronger: with probability 1, the entire sample path eventually settles near the target and stays there. The distinction matters. A sequence could converge in probability (deviations become rare) while occasionally producing large excursions that persist along every sample path — the typewriter sequence of Section 9.7 does exactly this.

Definition 3 Definition 9.3 (Almost Sure Convergence)

A sequence {Xn}\{X_n\} converges almost surely (a.s.) to XX if

P ⁣(limnXn=X)=1P\!\left(\lim_{n \to \infty} X_n = X\right) = 1

Equivalently, P(ωΩ:Xn(ω)X(ω))=1P(\omega \in \Omega : X_n(\omega) \to X(\omega)) = 1, meaning the set of sample points ω\omega where the sequence of real numbers X1(ω),X2(ω),X_1(\omega), X_2(\omega), \ldots converges to X(ω)X(\omega) has probability 1.

We write Xna.s.XX_n \xrightarrow{\text{a.s.}} X.

The critical difference from convergence in probability: a.s. convergence requires that for probability-1 many outcomes ω\omega, there exists N(ω)N(\omega) (depending on the outcome) such that Xn(ω)X(ω)<ε|X_n(\omega) - X(\omega)| < \varepsilon for all nN(ω)n \geq N(\omega). The quantifier structure is: ε>0\forall \varepsilon > 0, P(N:nN,XnX<ε)=1P(\exists N : \forall n \geq N, |X_n - X| < \varepsilon) = 1.

The definition involves limnXn\lim_{n \to \infty} X_n, which in turn involves lim sup\limsup and lim inf\liminf. An equivalent characterization avoids the limit and works directly with tail events.

Theorem 2 Theorem 9.2 (Borel-Cantelli Characterization of A.S. Convergence)

Xna.s.XX_n \xrightarrow{\text{a.s.}} X if and only if for every ε>0\varepsilon > 0:

P ⁣(supmnXmX>ε)0as nP\!\left(\sup_{m \geq n} |X_m - X| > \varepsilon\right) \to 0 \quad \text{as } n \to \infty

Equivalently, P(lim supn{XnX>ε})=0P(\limsup_{n} \{|X_n - X| > \varepsilon\}) = 0 for every ε>0\varepsilon > 0, where

lim supn{XnX>ε}=n=1m=n{XmX>ε}\limsup_{n} \{|X_n - X| > \varepsilon\} = \bigcap_{n=1}^{\infty} \bigcup_{m=n}^{\infty} \{|X_m - X| > \varepsilon\}

is the event that XnX>ε|X_n - X| > \varepsilon infinitely often.

Proof [show]

We show both directions of the equivalence.

Forward direction. Suppose Xna.s.XX_n \xrightarrow{\text{a.s.}} X, so P(limXn=X)=1P(\lim X_n = X) = 1. The event {Xn↛X}\{X_n \not\to X\} — the set of ω\omega where Xn(ω)X_n(\omega) does not converge to X(ω)X(\omega) — can be written as:

{Xn↛X}=k=1lim supn{XnX>1k}\{X_n \not\to X\} = \bigcup_{k=1}^{\infty} \limsup_{n} \left\{|X_n - X| > \frac{1}{k}\right\}

This is because Xn(ω)↛X(ω)X_n(\omega) \not\to X(\omega) means there exists some ε>0\varepsilon > 0 (which we can take to be 1/k1/k for some integer kk) such that Xn(ω)X(ω)>1/k|X_n(\omega) - X(\omega)| > 1/k for infinitely many nn.

Since P({Xn↛X})=0P(\{X_n \not\to X\}) = 0 and the union above covers {Xn↛X}\{X_n \not\to X\}:

0=P ⁣(k=1lim supn{XnX>1k})P ⁣(lim supn{XnX>1k})0 = P\!\left(\bigcup_{k=1}^{\infty} \limsup_{n} \left\{|X_n - X| > \frac{1}{k}\right\}\right) \geq P\!\left(\limsup_{n} \left\{|X_n - X| > \frac{1}{k}\right\}\right)

for each kk. Since probabilities are non-negative, P(lim supn{XnX>1/k})=0P(\limsup_n \{|X_n - X| > 1/k\}) = 0 for every kk. For arbitrary ε>0\varepsilon > 0, choose kk with 1/k<ε1/k < \varepsilon; then {XnX>ε}{XnX>1/k}\{|X_n - X| > \varepsilon\} \subseteq \{|X_n - X| > 1/k\}, so lim supn{XnX>ε}lim supn{XnX>1/k}\limsup_n \{|X_n - X| > \varepsilon\} \subseteq \limsup_n \{|X_n - X| > 1/k\}, and the result follows.

Now, the event lim supn{XnX>ε}=n=1m=n{XmX>ε}\limsup_n \{|X_n - X| > \varepsilon\} = \bigcap_{n=1}^{\infty} \bigcup_{m=n}^{\infty} \{|X_m - X| > \varepsilon\}. The sets Bn=m=n{XmX>ε}B_n = \bigcup_{m=n}^{\infty} \{|X_m - X| > \varepsilon\} are decreasing (B1B2B_1 \supseteq B_2 \supseteq \cdots), so by continuity of probability from above:

P(lim supn{XnX>ε})=limnP(Bn)=limnP ⁣(supmnXmX>ε)P(\limsup_n \{|X_n - X| > \varepsilon\}) = \lim_{n \to \infty} P(B_n) = \lim_{n \to \infty} P\!\left(\sup_{m \geq n} |X_m - X| > \varepsilon\right)

Since this equals 0, we have P(supmnXmX>ε)0P(\sup_{m \geq n} |X_m - X| > \varepsilon) \to 0.

Reverse direction. Suppose P(lim supn{XnX>ε})=0P(\limsup_n \{|X_n - X| > \varepsilon\}) = 0 for every ε>0\varepsilon > 0. Then:

P({Xn↛X})=P ⁣(k=1lim supn{XnX>1k})k=1P ⁣(lim supn{XnX>1k})=k=10=0P(\{X_n \not\to X\}) = P\!\left(\bigcup_{k=1}^{\infty} \limsup_{n}\left\{|X_n - X| > \frac{1}{k}\right\}\right) \leq \sum_{k=1}^{\infty} P\!\left(\limsup_{n}\left\{|X_n - X| > \frac{1}{k}\right\}\right) = \sum_{k=1}^{\infty} 0 = 0

So P(limXn=X)=1P(\lim X_n = X) = 1, which is Xna.s.XX_n \xrightarrow{\text{a.s.}} X. \square

The Borel-Cantelli characterization reveals why a.s. convergence is harder to establish than convergence in probability. For convergence in probability, we only need each individual P(XnX>ε)0P(|X_n - X| > \varepsilon) \to 0. For a.s. convergence, we need the supremum P(supmnXmX>ε)0P(\sup_{m \geq n} |X_m - X| > \varepsilon) \to 0, which controls all future times simultaneously.

Example 2 Strong Law of Large Numbers (Preview)

Let X1,X2,X_1, X_2, \ldots be iid with E[Xi]=μE[X_i] = \mu and Var(Xi)=σ2<\text{Var}(X_i) = \sigma^2 < \infty. Then

Xˉna.s.μ\bar{X}_n \xrightarrow{\text{a.s.}} \mu

This is the Strong Law of Large Numbers (SLLN). It says more than Theorem 9.1: not only does P(Xˉnμ>ε)0P(|\bar{X}_n - \mu| > \varepsilon) \to 0 (which is convergence in probability), but with probability 1, the realized sequence Xˉ1(ω),Xˉ2(ω),Xˉ3(ω),\bar{X}_1(\omega), \bar{X}_2(\omega), \bar{X}_3(\omega), \ldots converges to μ\mu as a sequence of real numbers.

The SLLN is the justification for Monte Carlo simulation: when we estimate E[g(X)]E[g(X)] by averaging g(X1),,g(Xn)g(X_1), \ldots, g(X_n), the a.s. convergence guarantees that our single run of the simulation will produce an average that converges to the truth. Convergence in probability alone would only guarantee this “with high probability” — we’d need to worry about rare bad runs. The SLLN says those bad runs have probability zero.

The full proof requires Kolmogorov’s maximal inequality and a truncation argument — see Law of Large Numbers for the complete Etemadi proof. The key difficulty is controlling the supremum supmnXˉmμ\sup_{m \geq n}|\bar{X}_m - \mu|, which the Chebyshev argument of Theorem 9.1 cannot do.

Remark 1 Kolmogorov's 0-1 Law and Tail Events

The event {Xˉnμ}\{\bar{X}_n \to \mu\} is a tail event — it depends only on the “tail” of the sequence (X1,X2,)(X_1, X_2, \ldots) and is unaffected by changing any finite number of terms. Kolmogorov’s 0-1 law states that every tail event has probability either 0 or 1. So before we even prove the SLLN, we know that P(Xˉnμ){0,1}P(\bar{X}_n \to \mu) \in \{0, 1\}. The SLLN’s job is to establish that the answer is 1, not 0.

This 0-1 structure is why almost sure convergence results tend to be all-or-nothing: either the sample mean converges a.s. to the right thing, or it doesn’t converge a.s. at all. There’s no middle ground like “converges a.s. with probability 0.7.”

Multiple sample paths of the sample mean converging to mu, with the sup envelope shrinking

The interactive explorer below generates sample paths and lets you watch them converge (or not). Try switching between sequences that converge almost surely and sequences that converge only in probability — the difference is visible in the paths.

Interactive: Sample Path Explorer
X̄_n of iid N(0,1) — converges a.s. and in L² to 0
Sample paths Xₙ
125102050100200500n-202value
Proportion outside ε-band
125102050100200500n0.000.250.500.751.00P(|Xₙ − target| > ε)All paths eventually enter the band
Almost sure convergence

9.4 Convergence in LpL^p

Almost sure convergence and convergence in probability are both qualitative — they measure whether XnX_n is close to XX, but not how far away it is on average. LpL^p convergence brings quantitative control: it requires that the pp-th moment of the gap vanishes.

Definition 4 Definition 9.4 (Lp Convergence)

For p1p \geq 1, a sequence {Xn}\{X_n\} converges in LpL^p to XX if

E[XnXp]0as nE[|X_n - X|^p] \to 0 \quad \text{as } n \to \infty

We write XnLpXX_n \xrightarrow{L^p} X. This requires that both XnX_n and XX have finite pp-th moments.

The LpL^p norm of a random variable is Yp=(E[Yp])1/p\|Y\|_p = (E[|Y|^p])^{1/p}, so LpL^p convergence is convergence in the LpL^p norm: XnXp0\|X_n - X\|_p \to 0.

The most important special case is p=2p = 2.

Definition 5 Definition 9.5 (Mean Square Convergence)

A sequence {Xn}\{X_n\} converges in mean square (or in L2L^2) to XX if

E[(XnX)2]0as nE[(X_n - X)^2] \to 0 \quad \text{as } n \to \infty

We write XnL2XX_n \xrightarrow{L^2} X. Expanding the square:

E[(XnX)2]=Var(XnX)+(E[XnX])2=Var(XnX)+(Biasn)2E[(X_n - X)^2] = \text{Var}(X_n - X) + (E[X_n - X])^2 = \text{Var}(X_n - X) + (\text{Bias}_n)^2

So L2L^2 convergence requires both the variance of the gap and the squared bias to vanish. This is exactly the bias-variance decomposition applied to the “estimator” XnX_n targeting the “parameter” XX.

LpL^p convergence controls not just whether XnX_n is close to XX, but whether the moments of XnX_n approach those of XX.

Theorem 3 Theorem 9.3 (Lp Convergence Implies Convergence of Moments)

If XnLpXX_n \xrightarrow{L^p} X, then E[Xnp]E[Xp]E[|X_n|^p] \to E[|X|^p].

More generally, if XnLpXX_n \xrightarrow{L^p} X and qpq \leq p, then XnLqXX_n \xrightarrow{L^q} X.

Proof [show]

Convergence of moments. We use Minkowski’s inequality (the triangle inequality for the LpL^p norm):

XnpXnXp+Xp\|X_n\|_p \leq \|X_n - X\|_p + \|X\|_p

and similarly XpXXnp+Xnp\|X\|_p \leq \|X - X_n\|_p + \|X_n\|_p, so:

XnpXpXnXp\bigl|\|X_n\|_p - \|X\|_p\bigr| \leq \|X_n - X\|_p

Since XnLpXX_n \xrightarrow{L^p} X means XnXp0\|X_n - X\|_p \to 0, we get XnpXp\|X_n\|_p \to \|X\|_p, which is (E[Xnp])1/p(E[Xp])1/p(E[|X_n|^p])^{1/p} \to (E[|X|^p])^{1/p}. Taking pp-th powers (the function ttpt \mapsto t^p is continuous on [0,)[0, \infty)) gives E[Xnp]E[Xp]E[|X_n|^p] \to E[|X|^p].

Higher LpL^p implies lower LqL^q. For 1qp1 \leq q \leq p, Jensen’s inequality (Topic 4, Theorem 13) applied to the convex function ttp/qt \mapsto t^{p/q} gives:

(E[XnXq])p/qE[XnXp](E[|X_n - X|^q])^{p/q} \leq E[|X_n - X|^p]

Taking the qq-th root of both sides:

XnXq=(E[XnXq])1/q(E[XnXp])1/p=XnXp\|X_n - X\|_q = (E[|X_n - X|^q])^{1/q} \leq (E[|X_n - X|^p])^{1/p} = \|X_n - X\|_p

So XnXp0\|X_n - X\|_p \to 0 implies XnXq0\|X_n - X\|_q \to 0. In particular, L2L^2 convergence implies L1L^1 convergence. \square

This theorem highlights an important contrast. Convergence in probability does not guarantee convergence of moments (Example 7 in Section 9.7 will show this dramatically). LpL^p convergence does — it’s what you need when moment control matters, which is most of statistics.

Example 3 L2 convergence with explicit rate

Let UUniform(0,1)U \sim \text{Uniform}(0,1) and define Xn=nU1(U<1/n)X_n = \sqrt{n} \cdot U \cdot \mathbf{1}(U < 1/n). We claim XnL20X_n \xrightarrow{L^2} 0.

We compute E[Xn2]E[X_n^2] directly:

E[Xn2]=E ⁣[nU21(U<1/n)]=n01/nu2du=nu3301/n=n13n3=13n2E[X_n^2] = E\!\left[n \cdot U^2 \cdot \mathbf{1}(U < 1/n)\right] = n \int_0^{1/n} u^2 \, du = n \cdot \frac{u^3}{3}\bigg|_0^{1/n} = n \cdot \frac{1}{3n^3} = \frac{1}{3n^2}

So E[(Xn0)2]=1/(3n2)0E[(X_n - 0)^2] = 1/(3n^2) \to 0, confirming L2L^2 convergence. The rate of convergence is O(1/n2)O(1/n^2) — quite fast.

For comparison, E[Xn]=n01/nudu=n/(2n2)=1/(2n3/2)E[X_n] = \sqrt{n} \int_0^{1/n} u \, du = \sqrt{n}/(2n^2) = 1/(2n^{3/2}), which also goes to zero. Both the bias and the variance vanish, as required by the bias-variance decomposition of E[(Xn0)2]E[(X_n - 0)^2].

Lp norms E[|Xn - X|^p] decaying to zero for p = 1, 2, and 4

9.5 Convergence in Distribution

The three modes above all require the random variables XnX_n and XX to live on the same probability space — they compare the random variables pointwise (a.s.), moment-wise (LpL^p), or event-wise (in probability). Convergence in distribution breaks free of this requirement. It compares only the distributions of XnX_n and XX, which can live on entirely different probability spaces.

This is the weakest mode, but it’s also the most important for asymptotic statistics. The Central Limit Theorem — the single most-used result in statistical inference — is a statement about convergence in distribution.

Definition 6 Definition 9.6 (Convergence in Distribution)

A sequence {Xn}\{X_n\} converges in distribution to XX if

FXn(x)FX(x)as nF_{X_n}(x) \to F_X(x) \quad \text{as } n \to \infty

at every point xx where FXF_X is continuous. We write XndXX_n \xrightarrow{d} X.

The restriction to continuity points of FXF_X is essential. At jump points of FXF_X, the CDFs FXnF_{X_n} may converge to a value different from FX(x)F_X(x) due to the discontinuity. For instance, if Xn=1/nX_n = 1/n (a deterministic sequence converging to 0) and X=0X = 0, then FXn(0)=0F_{X_n}(0) = 0 for all nn but FX(0)=1F_X(0) = 1. At x=0x = 0 — a discontinuity of FXF_X — the CDFs do not converge. But at every other xx, they do.

Convergence in distribution is sometimes called weak convergence or convergence in law. The notation XnXX_n \Rightarrow X and XnLXX_n \xrightarrow{\mathcal{L}} X are also common.

The most powerful characterization of convergence in distribution comes from the Portmanteau lemma, which gives five equivalent conditions.

Theorem 4 Theorem 9.4 (Portmanteau Lemma)

The following are equivalent:

(i) FXn(x)FX(x)F_{X_n}(x) \to F_X(x) at every continuity point xx of FXF_X.

(ii) E[g(Xn)]E[g(X)]E[g(X_n)] \to E[g(X)] for every bounded continuous function g:RRg : \mathbb{R} \to \mathbb{R}.

(iii) lim infnP(XnG)P(XG)\liminf_{n \to \infty} P(X_n \in G) \geq P(X \in G) for every open set GRG \subseteq \mathbb{R}.

(iv) lim supnP(XnF)P(XF)\limsup_{n \to \infty} P(X_n \in F) \leq P(X \in F) for every closed set FRF \subseteq \mathbb{R}.

(v) P(XnA)P(XA)P(X_n \in A) \to P(X \in A) for every Borel set AA with P(XA)=0P(X \in \partial A) = 0 (where A\partial A is the boundary of AA).

Proof [show]

We prove the implication (i) \Rightarrow (ii), which is the direction most often used in practice.

Assume FXn(x)FX(x)F_{X_n}(x) \to F_X(x) at all continuity points of FXF_X, and let gg be a bounded continuous function with gM|g| \leq M.

Fix ε>0\varepsilon > 0. Since FXF_X is a CDF, it has at most countably many discontinuities. Choose a<ba < b such that both aa and bb are continuity points of FXF_X and FX(a)<ε/(4M)F_X(a) < \varepsilon/(4M) and 1FX(b)<ε/(4M)1 - F_X(b) < \varepsilon/(4M). This is possible because FX(x)0F_X(x) \to 0 as xx \to -\infty and FX(x)1F_X(x) \to 1 as x+x \to +\infty, and the discontinuity set is countable.

On the compact interval [a,b][a, b], the continuous function gg is uniformly continuous. So there exists δ>0\delta > 0 such that g(x)g(y)<ε/2|g(x) - g(y)| < \varepsilon/2 whenever xy<δ|x - y| < \delta.

Partition [a,b][a, b] into subintervals a=x0<x1<<xk=ba = x_0 < x_1 < \cdots < x_k = b with mesh maxj(xj+1xj)<δ\max_j(x_{j+1} - x_j) < \delta, choosing each xjx_j to be a continuity point of FXF_X (again possible since discontinuities are countable).

For each subinterval [xj1,xj)[x_{j-1}, x_j), the value of gg varies by at most ε/2\varepsilon/2 from g(xj1)g(x_{j-1}). So the step function gstep(x)=j=1kg(xj1)1[xj1,xj)(x)g_{\text{step}}(x) = \sum_{j=1}^k g(x_{j-1}) \mathbf{1}_{[x_{j-1}, x_j)}(x) satisfies g(x)gstep(x)<ε/2|g(x) - g_{\text{step}}(x)| < \varepsilon/2 for x[a,b)x \in [a, b).

Now we bound E[g(Xn)]E[g(X)]|E[g(X_n)] - E[g(X)]|. Split the expectation into three parts — the tails (,a)(-\infty, a) and [b,)[b, \infty), and the middle [a,b)[a, b):

E[g(Xn)]E[g(X)]E[g(Xn)1(,a)]E[g(X)1(,a)]|E[g(X_n)] - E[g(X)]| \leq |E[g(X_n)\mathbf{1}_{(-\infty, a)}] - E[g(X)\mathbf{1}_{(-\infty, a)}]|+  E[g(Xn)1[a,b)]E[g(X)1[a,b)]+ \; |E[g(X_n)\mathbf{1}_{[a,b)}] - E[g(X)\mathbf{1}_{[a,b)}]|+  E[g(Xn)1[b,)]E[g(X)1[b,)]+ \; |E[g(X_n)\mathbf{1}_{[b,\infty)}] - E[g(X)\mathbf{1}_{[b,\infty)}]|

For the tail terms, since gM|g| \leq M:

E[g(Xn)1(,a)]MP(Xn<a)=MFXn(a)|E[g(X_n)\mathbf{1}_{(-\infty, a)}]| \leq M \cdot P(X_n < a) = M \cdot F_{X_n}(a^-)

Since aa is a continuity point of FXF_X, FXn(a)FX(a)<ε/(4M)F_{X_n}(a) \to F_X(a) < \varepsilon/(4M), so for large nn, MFXn(a)<ε/4M \cdot F_{X_n}(a^-) < \varepsilon/4. Similarly for the right tail. Combined, the tail contributions are at most ε/2+ε/2=ε\varepsilon/2 + \varepsilon/2 = \varepsilon for both XnX_n and XX.

For the middle term, replace gg with gstepg_{\text{step}} at cost ε/2\varepsilon/2. The expectation of gstepg_{\text{step}} depends only on FXn(xj)FXn(xj1)F_{X_n}(x_j) - F_{X_n}(x_{j-1}) for each jj, and since all xjx_j are continuity points of FXF_X, these converge to FX(xj)FX(xj1)F_X(x_j) - F_X(x_{j-1}). So for large enough nn, the step-function expectations are within ε/2\varepsilon/2.

Combining: for all sufficiently large nn, E[g(Xn)]E[g(X)]<3ε|E[g(X_n)] - E[g(X)]| < 3\varepsilon. Since ε\varepsilon was arbitrary, E[g(Xn)]E[g(X)]E[g(X_n)] \to E[g(X)]. \square

Characterization (ii) — convergence of expectations of bounded continuous functions — is the one most frequently used in proofs. It’s the definition of weak convergence in functional analysis, which is why convergence in distribution is also called “weak convergence.”

Example 4 Student's t(v) converges in distribution to N(0,1)

As previewed in Topic 6 — Continuous Distributions (Theorem 18), the Student’s tt distribution with ν\nu degrees of freedom converges in distribution to the standard Normal:

TνdN(0,1)as νT_\nu \xrightarrow{d} N(0, 1) \quad \text{as } \nu \to \infty

Recall that Tν=Z/V/νT_\nu = Z / \sqrt{V/\nu} where ZN(0,1)Z \sim N(0,1) and Vχ2(ν)V \sim \chi^2(\nu) are independent. By the law of large numbers applied to the chi-squared (which is a sum of ν\nu independent squared Normals), V/νP1V/\nu \xrightarrow{P} 1. So V/νP1\sqrt{V/\nu} \xrightarrow{P} 1 by the continuous mapping theorem (Theorem 9.9, Section 9.8).

The punchline is Slutsky’s theorem (Theorem 9.10, Section 9.8): if ZdZZ \xrightarrow{d} Z (trivially) and V/νP1\sqrt{V/\nu} \xrightarrow{P} 1, then Z/V/νdZ/1=ZN(0,1)Z / \sqrt{V/\nu} \xrightarrow{d} Z/1 = Z \sim N(0,1).

This is why, for large ν\nu, the tt-distribution and the Normal give nearly identical critical values. At ν=30\nu = 30, the 97.5th percentile of t(ν)t(\nu) is 2.042 versus 1.960 for the Normal — a difference of about 4%.

Remark 2 Levy's Continuity Theorem (MGF version)

If the moment-generating functions MXn(t)M_{X_n}(t) converge to MX(t)M_X(t) for all tt in an open interval containing 0, and MXM_X is the MGF of some distribution, then XndXX_n \xrightarrow{d} X.

This is the MGF version of Levy’s continuity theorem (the full version uses characteristic functions, which always exist). It converts a convergence-in-distribution problem into a pointwise limit of MGFs — a calculus problem rather than a probability problem. This is exactly the technique we’ll use to prove the Poisson limit theorem (Section 9.9) and the Central Limit Theorem.

The MGF version requires that the limit MGF MX(t)M_X(t) exists in a neighborhood of 0. When it does, MGF uniqueness (Topic 4, Theorem 17) identifies the limiting distribution. The characteristic function version avoids this existence requirement, which is why the full CLT proof uses characteristic functions for the most general version.

CDFs of t(1), t(5), t(30), t(100) approaching the standard Normal CDF
Interactive: Convergence in Distribution Explorer
Bin(n, λ/n) → Poisson(λ)KS distance: 0.0932
CDF Comparison
0.000.250.500.751.00051015KS = 0.0932F_XnF_X (limit)
PMF Comparison
0.000.060.120.180.25051015X_nlimit
Bin(n, λ/n) Poisson(λ) as n — the Poisson limit theorem

9.6 The Convergence Diamond

We now have all four modes. The natural question: how do they relate? The answer is a diamond-shaped implication diagram with almost sure and LpL^p at the top, convergence in probability in the middle, and convergence in distribution at the bottom. The arrows point downward — each mode implies the ones below it, but the reverse implications fail.

Convergence Hierarchy

Almost Sure(a.s.)L^p (Mean)(L^p)In Probability(in prob)In Distribution(in dist)
ImpliesDoes not implyPartial (subsequence)

We prove each arrow in the diamond.

Theorem 5 Theorem 9.5 (Almost Sure Implies In Probability)

If Xna.s.XX_n \xrightarrow{\text{a.s.}} X, then XnPXX_n \xrightarrow{P} X.

Proof [show]

Fix ε>0\varepsilon > 0. Define the indicator random variable Yn=1(XnX>ε)Y_n = \mathbf{1}(|X_n - X| > \varepsilon). We need to show P(XnX>ε)=E[Yn]0P(|X_n - X| > \varepsilon) = E[Y_n] \to 0.

Since Xna.s.XX_n \xrightarrow{\text{a.s.}} X, for almost every ω\omega, Xn(ω)X(ω)X_n(\omega) \to X(\omega). This means Xn(ω)X(ω)0|X_n(\omega) - X(\omega)| \to 0 for almost every ω\omega, so Yn(ω)=1(Xn(ω)X(ω)>ε)0Y_n(\omega) = \mathbf{1}(|X_n(\omega) - X(\omega)| > \varepsilon) \to 0 for almost every ω\omega. That is, Yn0Y_n \to 0 almost surely.

Now, Yn1|Y_n| \leq 1 for all nn (since YnY_n is an indicator), so YnY_n is bounded by the constant function 1, which is integrable. By the bounded convergence theorem (a special case of the dominated convergence theorem):

E[Yn]E[0]=0E[Y_n] \to E[0] = 0

Therefore P(XnX>ε)=E[Yn]0P(|X_n - X| > \varepsilon) = E[Y_n] \to 0, which is XnPXX_n \xrightarrow{P} X. \square

Theorem 6 Theorem 9.6 (Lp Implies In Probability)

If XnLpXX_n \xrightarrow{L^p} X for some p1p \geq 1, then XnPXX_n \xrightarrow{P} X.

Proof [show]

This is a direct application of Markov’s inequality (Topic 4, Theorem 11). Fix ε>0\varepsilon > 0. Apply Markov’s inequality to the non-negative random variable XnXp|X_n - X|^p with threshold εp\varepsilon^p:

P(XnX>ε)=P(XnXp>εp)E[XnXp]εpP(|X_n - X| > \varepsilon) = P(|X_n - X|^p > \varepsilon^p) \leq \frac{E[|X_n - X|^p]}{\varepsilon^p}

Since XnLpXX_n \xrightarrow{L^p} X, the numerator E[XnXp]0E[|X_n - X|^p] \to 0. The denominator εp\varepsilon^p is a fixed positive constant. So:

0P(XnX>ε)E[XnXp]εp00 \leq P(|X_n - X| > \varepsilon) \leq \frac{E[|X_n - X|^p]}{\varepsilon^p} \to 0

By the squeeze theorem, P(XnX>ε)0P(|X_n - X| > \varepsilon) \to 0. \square

Theorem 7 Theorem 9.7 (In Probability Implies In Distribution)

If XnPXX_n \xrightarrow{P} X, then XndXX_n \xrightarrow{d} X.

Proof [show]

Fix a continuity point xx of FXF_X. We need to show FXn(x)FX(x)F_{X_n}(x) \to F_X(x).

Upper bound. For any δ>0\delta > 0:

FXn(x)=P(Xnx)=P(Xnx,XnXδ)+P(Xnx,XnX>δ)F_{X_n}(x) = P(X_n \leq x) = P(X_n \leq x, |X_n - X| \leq \delta) + P(X_n \leq x, |X_n - X| > \delta)

The first term: if XnxX_n \leq x and XnXδ|X_n - X| \leq \delta, then XXn+δx+δX \leq X_n + \delta \leq x + \delta, so:

P(Xnx,XnXδ)P(Xx+δ)=FX(x+δ)P(X_n \leq x, |X_n - X| \leq \delta) \leq P(X \leq x + \delta) = F_X(x + \delta)

The second term is at most P(XnX>δ)P(|X_n - X| > \delta). Combining:

FXn(x)FX(x+δ)+P(XnX>δ)F_{X_n}(x) \leq F_X(x + \delta) + P(|X_n - X| > \delta)

Lower bound. Similarly:

FX(xδ)=P(Xxδ)=P(Xxδ,XnXδ)+P(Xxδ,XnX>δ)F_X(x - \delta) = P(X \leq x - \delta) = P(X \leq x - \delta, |X_n - X| \leq \delta) + P(X \leq x - \delta, |X_n - X| > \delta)

If XxδX \leq x - \delta and XnXδ|X_n - X| \leq \delta, then XnX+δxX_n \leq X + \delta \leq x, so:

P(Xxδ,XnXδ)P(Xnx)=FXn(x)P(X \leq x - \delta, |X_n - X| \leq \delta) \leq P(X_n \leq x) = F_{X_n}(x)

Therefore:

FX(xδ)FXn(x)+P(XnX>δ)F_X(x - \delta) \leq F_{X_n}(x) + P(|X_n - X| > \delta)

Rearranging: FX(xδ)P(XnX>δ)FXn(x)F_X(x - \delta) - P(|X_n - X| > \delta) \leq F_{X_n}(x).

Combining the bounds:

FX(xδ)P(XnX>δ)FXn(x)FX(x+δ)+P(XnX>δ)F_X(x - \delta) - P(|X_n - X| > \delta) \leq F_{X_n}(x) \leq F_X(x + \delta) + P(|X_n - X| > \delta)

Now take nn \to \infty. Since XnPXX_n \xrightarrow{P} X, P(XnX>δ)0P(|X_n - X| > \delta) \to 0. So:

FX(xδ)lim infnFXn(x)lim supnFXn(x)FX(x+δ)F_X(x - \delta) \leq \liminf_{n} F_{X_n}(x) \leq \limsup_{n} F_{X_n}(x) \leq F_X(x + \delta)

This holds for every δ>0\delta > 0. Now take δ0\delta \to 0. Since xx is a continuity point of FXF_X, both FX(xδ)FX(x)F_X(x - \delta) \to F_X(x) and FX(x+δ)FX(x)F_X(x + \delta) \to F_X(x). By the squeeze:

lim infnFXn(x)=lim supnFXn(x)=FX(x)\liminf_{n} F_{X_n}(x) = \limsup_{n} F_{X_n}(x) = F_X(x)

So limnFXn(x)=FX(x)\lim_n F_{X_n}(x) = F_X(x), which is XndXX_n \xrightarrow{d} X. \square

There is one partial converse worth knowing: convergence in probability implies a.s. convergence along a subsequence.

Theorem 8 Theorem 9.8 (In Probability Implies A.S. Along a Subsequence)

If XnPXX_n \xrightarrow{P} X, then there exists a subsequence {nk}\{n_k\} such that Xnka.s.XX_{n_k} \xrightarrow{\text{a.s.}} X.

Proof [show]

Since XnPXX_n \xrightarrow{P} X, for each integer k1k \geq 1, P(XnX>1/k)0P(|X_n - X| > 1/k) \to 0 as nn \to \infty. So we can choose nkn_k (with n1<n2<n3<n_1 < n_2 < n_3 < \cdots) such that:

P(XnkX>1/k)<2kP(|X_{n_k} - X| > 1/k) < 2^{-k}

Define the events Ak={XnkX>1/k}A_k = \{|X_{n_k} - X| > 1/k\}. Then:

k=1P(Ak)<k=12k=1<\sum_{k=1}^{\infty} P(A_k) < \sum_{k=1}^{\infty} 2^{-k} = 1 < \infty

By the first Borel-Cantelli lemma, P(lim supkAk)=0P(\limsup_k A_k) = 0. That is, with probability 1, only finitely many of the events AkA_k occur. So for almost every ω\omega, there exists K(ω)K(\omega) such that for all kK(ω)k \geq K(\omega):

Xnk(ω)X(ω)1k|X_{n_k}(\omega) - X(\omega)| \leq \frac{1}{k}

Since 1/k01/k \to 0, this gives Xnk(ω)X(ω)X_{n_k}(\omega) \to X(\omega) for almost every ω\omega, which is Xnka.s.XX_{n_k} \xrightarrow{\text{a.s.}} X. \square

This subsequence result is a key technical tool. It’s how most proofs upgrade convergence in probability to almost sure convergence: extract a subsequence that converges a.s., use the a.s. convergence to establish some property, then extend back to the full sequence.

Remark 3 Why the hierarchy matters in practice

The mode of convergence you need depends on what you’re trying to do:

  • Almost sure convergence is what you need for SGD guarantees. When training a neural network, you run one optimization trajectory. The SLLN (a.s. convergence of the sample mean) and Robbins-Monro theory (a.s. convergence of SGD iterates) guarantee that your specific run converges — not just that runs converge “on average” or “with high probability.”

  • Convergence in probability is what PAC learning provides. A PAC bound says P(R^(h)R(h)>ε)δP(|\hat{R}(h) - R(h)| > \varepsilon) \leq \delta — the empirical risk is close to the true risk with probability at least 1δ1 - \delta. This is convergence in probability, not a.s. You can’t guarantee that every dataset will give a good estimate, but the bad datasets become vanishingly rare.

  • Convergence in distribution is what the CLT gives. The statement n(Xˉnμ)/σdN(0,1)\sqrt{n}(\bar{X}_n - \mu)/\sigma \xrightarrow{d} N(0,1) tells you the shape of the sampling distribution is approximately Normal. It says nothing about individual sample paths converging — only about the aggregate distributional behavior. This is enough for confidence intervals and hypothesis tests.

Choosing the wrong mode is like using a wrench as a hammer — it might work, but you lose the guarantee. If you need every training run to converge, convergence in distribution is insufficient. If you need the distribution of a test statistic, a.s. convergence is overkill.

Diamond diagram: a.s. and Lp at top, in prob in middle, in dist at bottom, with arrows and crossed arrows

9.7 Counterexamples: Why the Arrows Don’t Reverse

The hierarchy in Section 9.6 has three missing arrows: in probability does not imply a.s., in probability does not imply LpL^p, and convergence in distribution does not imply convergence of expectations. We now construct explicit counterexamples for each.

Interactive: Convergence Counterexamples
Step n = 0U = 0.7441Times X_n(U) = 1: 0Row k = 0
Unit interval [0, 1]U01Press Play to startP(X_n 0) = 1/k 0 (convergence in probability )But X_n(U) = 1 infinitely often for every U (NOT a.s. convergence)
Example 5 The Typewriter Sequence: In probability but NOT almost surely

Let UUniform(0,1)U \sim \text{Uniform}(0,1). We construct a sequence {Xn}\{X_n\} that converges to 0 in probability but not almost surely. The idea: let intervals of decreasing length “slide” across [0,1][0, 1] repeatedly, so every point gets covered infinitely often.

Construction. Index the sequence so that for the mm-th pass, we divide [0,1][0,1] into mm equal intervals, each of length 1/m1/m. The first pass (m=1m = 1) has one interval [0,1][0, 1]. The second pass (m=2m = 2) has two intervals [0,1/2)[0, 1/2) and [1/2,1][1/2, 1]. The third pass (m=3m = 3) has three intervals [0,1/3)[0, 1/3), [1/3,2/3)[1/3, 2/3), [2/3,1][2/3, 1]. And so on.

Formally, we define a mapping from nn to a pair (m,j)(m, j) where mm is the pass number and j{0,1,,m1}j \in \{0, 1, \ldots, m-1\} is the interval within that pass. The relationship is: n=m(m1)/2+j+1n = m(m-1)/2 + j + 1 (since the first m1m - 1 passes use 1+2++(m1)=m(m1)/21 + 2 + \cdots + (m-1) = m(m-1)/2 indices). Then:

Xn=1 ⁣(U[jm,j+1m))X_n = \mathbf{1}\!\left(U \in \left[\frac{j}{m}, \frac{j+1}{m}\right)\right)

Convergence in probability. For each nn in pass mm, P(Xn0)=P(Xn=1)=1/mP(X_n \neq 0) = P(X_n = 1) = 1/m. Since mm \to \infty as nn \to \infty (specifically, m2nm \approx \sqrt{2n}), we have P(Xn>ε)=1/m0P(|X_n| > \varepsilon) = 1/m \to 0 for any ε>0\varepsilon > 0. So XnP0X_n \xrightarrow{P} 0.

Failure of a.s. convergence. Fix any u[0,1)u \in [0, 1). In pass mm, the point uu falls in the interval [mu/m,(mu+1)/m)[\lfloor mu \rfloor / m, (\lfloor mu \rfloor + 1)/m). So for the index nn corresponding to (m,mu)(m, \lfloor mu \rfloor), we have Xn(u)=1X_n(u) = 1. This happens for every mm, so Xn(u)=1X_n(u) = 1 for infinitely many nn. Since Xn(u)=0X_n(u) = 0 for other nn in the same pass, the sequence X1(u),X2(u),X_1(u), X_2(u), \ldots oscillates between 0 and 1 forever and does not converge.

This holds for every u[0,1)u \in [0,1), so P(limXn exists)=0P(\lim X_n \text{ exists}) = 0. Not only does the sequence fail to converge a.s. to 0 — it fails to converge a.s. to anything.

Example 6 Escape to Infinity: In probability but NOT in L1

Let UUniform(0,1)U \sim \text{Uniform}(0, 1) and define:

Xn=n1(U<1/n)X_n = n \cdot \mathbf{1}(U < 1/n)

Convergence in probability. P(Xn0)=P(U<1/n)=1/n0P(X_n \neq 0) = P(U < 1/n) = 1/n \to 0. So for any ε>0\varepsilon > 0, P(Xn>ε)P(Xn0)=1/n0P(|X_n| > \varepsilon) \leq P(X_n \neq 0) = 1/n \to 0, giving XnP0X_n \xrightarrow{P} 0.

Failure of L1L^1 convergence. The expected value is:

E[Xn]=nP(U<1/n)=n1n=1E[X_n] = n \cdot P(U < 1/n) = n \cdot \frac{1}{n} = 1

for every nn. So E[Xn0]=E[Xn]=1↛0E[|X_n - 0|] = E[X_n] = 1 \not\to 0. We do not have XnL10X_n \xrightarrow{L^1} 0.

The intuition: as nn grows, XnX_n is nonzero with vanishing probability (convergence in probability), but when it is nonzero, its value is nn — growing without bound. The spike becomes less likely but taller, maintaining a constant expectation. The probability mass that “escapes to infinity” prevents L1L^1 convergence.

More dramatically, E[Xn2]=n2(1/n)=nE[X_n^2] = n^2 \cdot (1/n) = n \to \infty, so we don’t have L2L^2 convergence either. The sequence converges in probability but in no LpL^p space.

Example 7 Convergence in distribution without convergence of expectations

Define XnX_n by:

P(Xn=0)=11n,P(Xn=n2)=1nP(X_n = 0) = 1 - \frac{1}{n}, \qquad P(X_n = n^2) = \frac{1}{n}

Convergence in distribution. The CDF of XnX_n is: FXn(x)=0F_{X_n}(x) = 0 for x<0x < 0, FXn(x)=11/nF_{X_n}(x) = 1 - 1/n for 0x<n20 \leq x < n^2, and FXn(x)=1F_{X_n}(x) = 1 for xn2x \geq n^2.

For any fixed x0x \geq 0, once nn is large enough that n2>xn^2 > x, we have FXn(x)=11/n1=F0(x)F_{X_n}(x) = 1 - 1/n \to 1 = F_0(x) (where F0F_0 is the CDF of the constant 0: F0(x)=1(x0)F_0(x) = \mathbf{1}(x \geq 0)). For x<0x < 0, FXn(x)=0=F0(x)F_{X_n}(x) = 0 = F_0(x) trivially. So Xnd0X_n \xrightarrow{d} 0.

Divergence of expectations. But:

E[Xn]=0(11n)+n21n=nE[X_n] = 0 \cdot \left(1 - \frac{1}{n}\right) + n^2 \cdot \frac{1}{n} = n \to \infty

The distributions converge to a point mass at 0, yet the expectations diverge to infinity. The tiny probability mass at n2n^2 is invisible to the CDF at any fixed xx, but it dominates the expectation.

This is why convergence in distribution — the mode the CLT provides — does not automatically give convergence of moments. For the CLT, the moments of n(Xˉnμ)/σ\sqrt{n}(\bar{X}_n - \mu)/\sigma do converge to those of N(0,1)N(0,1), but this requires additional uniform integrability arguments beyond the basic convergence-in-distribution statement.

Three counterexamples: typewriter sliding intervals, escape-to-infinity spike, and diverging expectations

9.8 Continuous Mapping, Slutsky, and the Delta Method

The four modes of convergence tell us how sequences of random variables approach their limits. But in practice, we rarely work with the raw sequence — we transform it, combine it with other sequences, and apply functions. The continuous mapping theorem, Slutsky’s theorem, and the delta method are the three tools that let us do this while preserving convergence guarantees.

Theorem 9 Theorem 9.9 (Continuous Mapping Theorem)

Let g:RRg : \mathbb{R} \to \mathbb{R} be a continuous function. Then:

(a) If XndXX_n \xrightarrow{d} X, then g(Xn)dg(X)g(X_n) \xrightarrow{d} g(X).

(b) If XnPXX_n \xrightarrow{P} X, then g(Xn)Pg(X)g(X_n) \xrightarrow{P} g(X).

(c) If Xna.s.XX_n \xrightarrow{\text{a.s.}} X, then g(Xn)a.s.g(X)g(X_n) \xrightarrow{\text{a.s.}} g(X).

In short, continuous functions preserve all three modes of convergence.

Proof [show]

We prove part (b) in full, which illustrates the key technique.

Fix ε>0\varepsilon > 0. We need to show P(g(Xn)g(X)>ε)0P(|g(X_n) - g(X)| > \varepsilon) \to 0.

The difficulty is that gg is continuous on all of R\mathbb{R}, but continuity only gives us local control: for each xx, if yy is close to xx, then g(y)g(y) is close to g(x)g(x). On a compact set, this local control becomes uniform. On all of R\mathbb{R}, we need to handle the tails separately.

Step 1: Truncation. Choose M>0M > 0 large enough that P(X>M)<εP(|X| > M) < \varepsilon. This is possible because P(X>M)0P(|X| > M) \to 0 as MM \to \infty.

Step 2: Uniform continuity on compact sets. On the compact set [M1,M+1][-M - 1, M + 1], the continuous function gg is uniformly continuous. So there exists δ>0\delta > 0 (with δ<1\delta < 1) such that g(x)g(y)<ε|g(x) - g(y)| < \varepsilon whenever xy<δ|x - y| < \delta and both x,y[M1,M+1]x, y \in [-M - 1, M + 1].

Step 3: Event decomposition. We split:

{g(Xn)g(X)>ε}{XnXδ}{X>M}{Xn>M+1}\{|g(X_n) - g(X)| > \varepsilon\} \subseteq \{|X_n - X| \geq \delta\} \cup \{|X| > M\} \cup \{|X_n| > M + 1\}

To see why: if XnX<δ|X_n - X| < \delta, XM|X| \leq M, and XnM+1|X_n| \leq M + 1, then both XX and XnX_n lie in [M1,M+1][-M-1, M+1] and XnX<δ|X_n - X| < \delta, so g(Xn)g(X)<ε|g(X_n) - g(X)| < \varepsilon by uniform continuity. The contrapositive gives the inclusion above.

Step 4: Bounding each piece.

P(g(Xn)g(X)>ε)P(XnXδ)+P(X>M)+P(Xn>M+1)P(|g(X_n) - g(X)| > \varepsilon) \leq P(|X_n - X| \geq \delta) + P(|X| > M) + P(|X_n| > M + 1)

The first term P(XnXδ)0P(|X_n - X| \geq \delta) \to 0 because XnPXX_n \xrightarrow{P} X.

The second term P(X>M)<εP(|X| > M) < \varepsilon by our choice of MM.

The third term: {Xn>M+1}{XnX>1}{X>M}\{|X_n| > M + 1\} \subseteq \{|X_n - X| > 1\} \cup \{|X| > M\} (triangle inequality), so P(Xn>M+1)P(XnX>1)+P(X>M)P(|X_n| > M + 1) \leq P(|X_n - X| > 1) + P(|X| > M). The first part goes to 0 (convergence in probability), and the second is at most ε\varepsilon.

Combining, for all sufficiently large nn:

P(g(Xn)g(X)>ε)<ε+ε+2ε=4εP(|g(X_n) - g(X)| > \varepsilon) < \varepsilon + \varepsilon + 2\varepsilon = 4\varepsilon

Since ε>0\varepsilon > 0 was arbitrary, P(g(Xn)g(X)>ε)0P(|g(X_n) - g(X)| > \varepsilon) \to 0, giving g(Xn)Pg(X)g(X_n) \xrightarrow{P} g(X). \square

The continuous mapping theorem handles transformations of a single converging sequence. Slutsky’s theorem handles combinations of a converging-in-distribution sequence with a converging-in-probability sequence.

Theorem 10 Theorem 9.10 (Slutsky's Theorem)

If XndXX_n \xrightarrow{d} X and YnPcY_n \xrightarrow{P} c (where cc is a constant), then:

(a) Xn+YndX+cX_n + Y_n \xrightarrow{d} X + c

(b) YnXndcXY_n X_n \xrightarrow{d} cX

(c) Xn/YndX/cX_n / Y_n \xrightarrow{d} X / c provided c0c \neq 0

Proof [show]

We prove part (a). Parts (b) and (c) follow by similar arguments.

We use the Portmanteau characterization (Theorem 9.4, part (ii)): we need to show E[g(Xn+Yn)]E[g(X+c)]E[g(X_n + Y_n)] \to E[g(X + c)] for every bounded continuous gg.

Let gg be bounded by MM and continuous. Fix ε>0\varepsilon > 0. Since gg is continuous and bounded, it is uniformly continuous on any compact set. Choose R>0R > 0 large enough that P(X>R)<ε/(4M)P(|X| > R) < \varepsilon/(4M) and P(Xn>R)<ε/(4M)P(|X_n| > R) < \varepsilon/(4M) for all large nn (the latter follows from convergence in distribution via tightness).

On [R1,R+1][-R - 1, R + 1], gg is uniformly continuous: there exists δ>0\delta > 0 such that g(s)g(t)<ε|g(s) - g(t)| < \varepsilon when st<δ|s - t| < \delta and both s,t[R1,R+1]s, t \in [-R - 1, R + 1].

Now decompose:

E[g(Xn+Yn)]E[g(Xn+c)]E[g(Xn+Yn)g(Xn+c)]|E[g(X_n + Y_n)] - E[g(X_n + c)]| \leq E[|g(X_n + Y_n) - g(X_n + c)|]

Split this expectation based on whether Ync<δ|Y_n - c| < \delta and XnR|X_n| \leq R:

When Ync<δ|Y_n - c| < \delta and XnR|X_n| \leq R: both Xn+YnX_n + Y_n and Xn+cX_n + c lie in [R1,R+1][-R-1, R+1] (for δ<1\delta < 1), and (Xn+Yn)(Xn+c)=Ync<δ|(X_n + Y_n) - (X_n + c)| = |Y_n - c| < \delta, so g(Xn+Yn)g(Xn+c)<ε|g(X_n + Y_n) - g(X_n + c)| < \varepsilon.

On the complementary event:

E[g(Xn+Yn)g(Xn+c)]ε+2MP(Yncδ)+2MP(Xn>R)E[|g(X_n + Y_n) - g(X_n + c)|] \leq \varepsilon + 2M \cdot P(|Y_n - c| \geq \delta) + 2M \cdot P(|X_n| > R)

Since YnPcY_n \xrightarrow{P} c, P(Yncδ)0P(|Y_n - c| \geq \delta) \to 0. By tightness, P(Xn>R)<ε/(4M)P(|X_n| > R) < \varepsilon/(4M) for large nn. So for large nn:

E[g(Xn+Yn)]E[g(Xn+c)]<2ε|E[g(X_n + Y_n)] - E[g(X_n + c)]| < 2\varepsilon

Now, the continuous mapping theorem for convergence in distribution (applied to h(x)=g(x+c)h(x) = g(x + c), which is bounded and continuous) gives E[g(Xn+c)]=E[h(Xn)]E[h(X)]=E[g(X+c)]E[g(X_n + c)] = E[h(X_n)] \to E[h(X)] = E[g(X + c)].

Combining: E[g(Xn+Yn)]E[g(X+c)]<3ε|E[g(X_n + Y_n)] - E[g(X + c)]| < 3\varepsilon for large nn. Since ε\varepsilon was arbitrary and gg was an arbitrary bounded continuous function, Xn+YndX+cX_n + Y_n \xrightarrow{d} X + c by the Portmanteau lemma. \square

Slutsky’s theorem is the workhorse of asymptotic statistics. It’s how we derived t(ν)dN(0,1)t(\nu) \xrightarrow{d} N(0,1) in Example 4: the numerator ZZ converges in distribution to N(0,1)N(0,1) (trivially — it is N(0,1)N(0,1)), the denominator V/ν\sqrt{V/\nu} converges in probability to 1, so the ratio converges in distribution to Z/1=ZZ/1 = Z.

A crucial restriction: Slutsky’s theorem requires one of the two sequences to converge to a constant. If both XnX_n and YnY_n converge in distribution to non-degenerate limits, the joint behavior of Xn+YnX_n + Y_n is not determined by the marginal limits — we would need information about the joint distribution.

The delta method converts a convergence-in-distribution result about XnX_n into one about g(Xn)g(X_n), with explicit variance formulas.

Theorem 11 Theorem 9.11 (Delta Method — First Order)

Suppose n(Xnμ)dN(0,σ2)\sqrt{n}(X_n - \mu) \xrightarrow{d} N(0, \sigma^2), and let gg be a function that is differentiable at μ\mu with g(μ)0g'(\mu) \neq 0. Then:

n(g(Xn)g(μ))dN ⁣(0,[g(μ)]2σ2)\sqrt{n}\bigl(g(X_n) - g(\mu)\bigr) \xrightarrow{d} N\!\left(0, \, [g'(\mu)]^2 \sigma^2\right)

In words: the asymptotic variance of g(Xn)g(X_n) is (g(μ))2(g'(\mu))^2 times the asymptotic variance of XnX_n.

The proof uses a Taylor expansion and Slutsky’s theorem. The key insight is that for XnX_n close to μ\mu, g(Xn)g(μ)+g(μ)(Xnμ)g(X_n) \approx g(\mu) + g'(\mu)(X_n - \mu), so the distribution of g(Xn)g(μ)g(X_n) - g(\mu) is approximately g(μ)g'(\mu) times the distribution of XnμX_n - \mu.

When g(μ)=0g'(\mu) = 0, the first-order Taylor term vanishes and we need the second-order term.

Theorem 12 Theorem 9.12 (Delta Method — Second Order)

Suppose n(Xnμ)dN(0,σ2)\sqrt{n}(X_n - \mu) \xrightarrow{d} N(0, \sigma^2), and let gg be twice differentiable at μ\mu with g(μ)=0g'(\mu) = 0 and g(μ)0g''(\mu) \neq 0. Then:

n(g(Xn)g(μ))dg(μ)2σ2χ12n\bigl(g(X_n) - g(\mu)\bigr) \xrightarrow{d} \frac{g''(\mu)}{2} \sigma^2 \chi^2_1

Note the scaling changes from n\sqrt{n} to nn: the convergence rate is slower when the first derivative vanishes.

Example 8 Delta method for the log transform

Let X1,,XnX_1, \ldots, X_n be iid with mean μ>0\mu > 0 and variance σ2\sigma^2. By the CLT (proved in full in Topic 11):

n(Xˉnμ)dN(0,σ2)\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} N(0, \sigma^2)

Apply the delta method with g(x)=log(x)g(x) = \log(x). We have g(x)=1/xg'(x) = 1/x, so g(μ)=1/μ0g'(\mu) = 1/\mu \neq 0 (since μ>0\mu > 0). Therefore:

n(logXˉnlogμ)dN ⁣(0,σ2μ2)\sqrt{n}(\log \bar{X}_n - \log \mu) \xrightarrow{d} N\!\left(0, \frac{\sigma^2}{\mu^2}\right)

The asymptotic variance of logXˉn\log \bar{X}_n is σ2/(nμ2)\sigma^2 / (n\mu^2).

Concrete computation for Exponential(1). If XiExponential(1)X_i \sim \text{Exponential}(1), then μ=1\mu = 1 and σ2=1\sigma^2 = 1. The delta method gives:

n(logXˉnlog1)=nlogXˉndN(0,1)\sqrt{n}(\log \bar{X}_n - \log 1) = \sqrt{n} \log \bar{X}_n \xrightarrow{d} N(0, 1)

So logXˉn\log \bar{X}_n has asymptotic variance 1/n1/n — exactly the same as Xˉn\bar{X}_n itself in this special case. This is because g(1)=1g'(1) = 1, so the log transform preserves the asymptotic variance at μ=1\mu = 1.

For μ1\mu \neq 1, the delta method variance σ2/μ2\sigma^2/\mu^2 is the squared coefficient of variation of the underlying distribution times 1/n1/n. The coefficient of variation CV=σ/μ\text{CV} = \sigma/\mu is dimensionless, which is why the log transform is natural for data with a constant CV (multiplicative noise models).

Delta method: CLT Normal for Xbar, then transformed Normal for g(Xbar) with g' scaling the variance
Interactive: Delta Method Explorer
Histogram of g(X̄n) — 2000 replications
1.501.601.706.0912.18g(X̄)densitysim meandelta N
g(x) = ln(x) with tangent at x = 5.0
44.5055.5061.401.601.80xg(x)ln(x)tangent
QuantityDelta MethodSimulation
Mean of g(X̄)1.60941.6063
Variance of g(X̄)0.00160.0016
μ5.0000
σ²4.0000
g(μ)1.6094
g′(μ)0.2000
[g′(μ)]² σ² / n0.0016

9.9 Making the Poisson Limit Theorem Rigorous

In Topic 5 — Discrete Distributions, we stated that Bin(n,λ/n)Poisson(λ)\text{Bin}(n, \lambda/n) \to \text{Poisson}(\lambda) as nn \to \infty. At the time, we verified this by showing the PMFs converge pointwise. Now we can make this fully rigorous using convergence in distribution, proved via MGF convergence (Levy’s continuity theorem, Remark 2).

Theorem 13 Theorem 9.13 (Poisson Limit Theorem)

Let XnBinomial(n,λ/n)X_n \sim \text{Binomial}(n, \lambda/n) for a fixed λ>0\lambda > 0. Then:

XndPoisson(λ)X_n \xrightarrow{d} \text{Poisson}(\lambda)
Proof [show]

We prove this by showing that the MGF of XnX_n converges to the MGF of Poisson(λ)\text{Poisson}(\lambda) at every tRt \in \mathbb{R}, then invoking Levy’s continuity theorem (Remark 2).

Step 1: MGF of Bin(n,λ/n)\text{Bin}(n, \lambda/n). The MGF of a Binomial(n,p)\text{Binomial}(n, p) random variable is M(t)=(1p+pet)nM(t) = (1 - p + pe^t)^n (derived by exponentiating the binomial PMF and recognizing the binomial theorem). With p=λ/np = \lambda/n:

MXn(t)=(1λn+λnet)nM_{X_n}(t) = \left(1 - \frac{\lambda}{n} + \frac{\lambda}{n} e^t\right)^n

Factor out the 1 and rewrite:

MXn(t)=(1+λ(et1)n)nM_{X_n}(t) = \left(1 + \frac{\lambda(e^t - 1)}{n}\right)^n

Step 2: Take the limit. This is a sequence of the form (1+an/n)n(1 + a_n/n)^n where an=λ(et1)a_n = \lambda(e^t - 1). Since ana_n does not depend on nn (it depends only on tt and λ\lambda), we can use the fundamental limit limn(1+a/n)n=ea\lim_{n \to \infty}(1 + a/n)^n = e^a directly:

limnMXn(t)=limn(1+λ(et1)n)n=exp ⁣(λ(et1))\lim_{n \to \infty} M_{X_n}(t) = \lim_{n \to \infty} \left(1 + \frac{\lambda(e^t - 1)}{n}\right)^n = \exp\!\left(\lambda(e^t - 1)\right)

To be more careful, take the logarithm:

logMXn(t)=nlog ⁣(1+λ(et1)n)\log M_{X_n}(t) = n \log\!\left(1 + \frac{\lambda(e^t - 1)}{n}\right)

Using the Taylor expansion log(1+x)=xx2/2+O(x3)\log(1 + x) = x - x^2/2 + O(x^3) with x=λ(et1)/nx = \lambda(e^t - 1)/n:

logMXn(t)=n[λ(et1)nλ2(et1)22n2+O(n3)]\log M_{X_n}(t) = n \left[\frac{\lambda(e^t - 1)}{n} - \frac{\lambda^2(e^t - 1)^2}{2n^2} + O(n^{-3})\right]=λ(et1)λ2(et1)22n+O(n2)= \lambda(e^t - 1) - \frac{\lambda^2(e^t - 1)^2}{2n} + O(n^{-2})

As nn \to \infty:

logMXn(t)λ(et1)\log M_{X_n}(t) \to \lambda(e^t - 1)

Exponentiating: MXn(t)exp(λ(et1))M_{X_n}(t) \to \exp(\lambda(e^t - 1)).

Step 3: Identify the limit. The MGF of YPoisson(λ)Y \sim \text{Poisson}(\lambda) is:

MY(t)=E[etY]=k=0etkλkeλk!=eλk=0(λet)kk!=eλeλet=exp ⁣(λ(et1))M_Y(t) = E[e^{tY}] = \sum_{k=0}^{\infty} e^{tk} \frac{\lambda^k e^{-\lambda}}{k!} = e^{-\lambda} \sum_{k=0}^{\infty} \frac{(\lambda e^t)^k}{k!} = e^{-\lambda} \cdot e^{\lambda e^t} = \exp\!\left(\lambda(e^t - 1)\right)

So MXn(t)MY(t)M_{X_n}(t) \to M_Y(t) for all tRt \in \mathbb{R}. The limit MY(t)=exp(λ(et1))M_Y(t) = \exp(\lambda(e^t - 1)) exists for all tt and is the MGF of Poisson(λ)\text{Poisson}(\lambda).

Step 4: Apply Levy’s continuity theorem. Since MXn(t)M_{X_n}(t) converges to MY(t)M_Y(t) pointwise on all of R\mathbb{R} (which contains a neighborhood of 0), and MYM_Y is the MGF of Poisson(λ)\text{Poisson}(\lambda), the theorem gives XndYPoisson(λ)X_n \xrightarrow{d} Y \sim \text{Poisson}(\lambda).

By MGF uniqueness (Topic 4, Theorem 17), the limiting distribution is uniquely identified as Poisson(λ)\text{Poisson}(\lambda). \square

PMFs of Bin(10, 0.5), Bin(50, 0.1), Bin(500, 0.01) with lambda=5, approaching Poisson(5)

This proof is the prototype for the Central Limit Theorem proof. The technique is identical: compute the MGF, show pointwise convergence to the target MGF, and invoke Lévy’s continuity theorem. The CLT proof is more involved because the MGF of n(Xˉnμ)/σ\sqrt{n}(\bar{X}_n - \mu)/\sigma requires a careful Taylor expansion of logMX1(t/(σn))\log M_{X_1}(t/(\sigma\sqrt{n})), but the structure established here carries over directly.


9.10 Connections to Machine Learning

The four modes of convergence are not abstract curiosities — they appear throughout machine learning, often implicitly. Recognizing which mode is at play helps you understand what a convergence guarantee actually promises and what it leaves open.

Example 9 Three ML convergence scenarios

Scenario 1: SGD with Robbins-Monro step sizes (almost sure convergence).

Consider stochastic gradient descent on a loss function L(θ)L(\theta) with updates θt+1=θtηtL(θt;xt)\theta_{t+1} = \theta_t - \eta_t \nabla L(\theta_t; x_t), where xtx_t is a random minibatch and ηt\eta_t is the step size. The Robbins-Monro conditions require:

t=1ηt=andt=1ηt2<\sum_{t=1}^{\infty} \eta_t = \infty \quad \text{and} \quad \sum_{t=1}^{\infty} \eta_t^2 < \infty

The first condition ensures the step sizes are large enough to reach any minimum. The second ensures they decay fast enough that the noise averages out. Under these conditions (plus regularity assumptions on LL), the iterates θta.s.θ\theta_t \xrightarrow{\text{a.s.}} \theta^*, a local minimum.

Why almost sure convergence matters: you run SGD once. A single training run produces a single trajectory θ1,θ2,\theta_1, \theta_2, \ldots Almost sure convergence guarantees that this specific trajectory converges — not just that “most” trajectories converge.

With a constant step size ηt=η\eta_t = \eta (common in practice for speed), the Robbins-Monro conditions fail (ηt2=\sum \eta_t^2 = \infty). The iterates then converge only in probability to a neighborhood of θ\theta^*, with the neighborhood size proportional to η\eta. This is the exploration-exploitation tradeoff made explicit through convergence modes.

Scenario 2: PAC learning (convergence in probability).

In Probably Approximately Correct (PAC) learning, the fundamental guarantee has the form:

P ⁣(suphHRemp(h)R(h)>ε)δP\!\left(\sup_{h \in \mathcal{H}} |R_{\text{emp}}(h) - R(h)| > \varepsilon\right) \leq \delta

where R(h)R(h) is the true risk of hypothesis hh, Remp(h)R_{\text{emp}}(h) is the empirical risk on nn training samples, and δ\delta decreases with nn. This is convergence in probability: with high probability, the empirical risk is close to the true risk.

The “sup over H\mathcal{H}” makes this a uniform convergence statement — the empirical risk concentrates simultaneously for all hypotheses in the class. The rate at which δ0\delta \to 0 depends on the complexity of H\mathcal{H} (VC dimension, Rademacher complexity), and this rate is what distinguishes learnable from unlearnable hypothesis classes.

Scenario 3: Bernstein-von Mises theorem (convergence in distribution).

In Bayesian statistics, the posterior distribution for a parameter θ\theta given nn iid observations satisfies:

n(θθ^MLE)X1,,XndN ⁣(0,I(θ0)1)\sqrt{n}(\theta - \hat{\theta}_{\text{MLE}}) \mid X_1, \ldots, X_n \xrightarrow{d} N\!\left(0, I(\theta_0)^{-1}\right)

where I(θ0)I(\theta_0) is the Fisher information at the true parameter θ0\theta_0. This is convergence in distribution: the posterior shape approaches a Normal, regardless of the prior (under regularity conditions).

This is the frequentist justification for Bayesian inference: with enough data, the posterior is approximately Normal centered at the MLE, and the prior washes out. The convergence is in distribution — we’re comparing the posterior distribution to the Normal, not individual samples or paths.

Remark 4 Convergence rates and what comes next

Knowing that a sequence converges is useful, but knowing how fast it converges is essential for practice. The modes of convergence defined in this topic are qualitative — they say whether the limit is reached, not how quickly.

Convergence rates quantify the speed:

  • Hoeffding’s inequality gives exponential decay for convergence in probability: P(Xˉnμ>ε)2exp(2nε2/(ba)2)P(|\bar{X}_n - \mu| > \varepsilon) \leq 2\exp(-2n\varepsilon^2/(b-a)^2) for bounded random variables in [a,b][a, b]. This is much sharper than the 1/n1/n rate from Chebyshev.

  • The Berry–Esseen theorem gives the rate for convergence in distribution in the CLT: supxFZˉn(x)Φ(x)CE[X13]/(σ3n)\sup_x |F_{\bar{Z}_n}(x) - \Phi(x)| \leq C \cdot E[|X_1|^3] / (\sigma^3 \sqrt{n}), where C0.4748C \leq 0.4748. The CLT convergence rate is O(1/n)O(1/\sqrt{n}).

  • Large deviations theory characterizes the exponential rate at which P(XˉnA)P(\bar{X}_n \in A) decays for sets AA that are far from the mean. The rate function I(x)=supt(txlogMX(t))I(x) = \sup_t (tx - \log M_X(t)) is the Fenchel-Legendre transform of the log-MGF.

These rate results are the subject of Large Deviations & Tail Bounds. They bridge the gap between the existence of convergence (this topic) and the speed of convergence (which determines sample size requirements in practice) — Hoeffding, Bernstein, and the Cramér rate function quantify exactly how fast convergence occurs.

Three ML scenarios mapped to convergence modes: SGD paths (a.s.), PAC bounds (in prob), posterior concentration (in dist)

9.11 Summary

This topic defined the four modes of convergence for sequences of random variables, established the implication hierarchy between them, and built the tools — continuous mapping, Slutsky, delta method — for working with convergent sequences in practice.

The Four Modes

ModeDefinitionIntuitionGuaranteesDoes NOT GuaranteeML Application
Almost sureP(limXn=X)=1P(\lim X_n = X) = 1Every path settlesPath-by-path convergenceMoment convergenceSGD convergence
LpL^pE[XnXp]0E[\|X_n - X\|^p] \to 0Average pp-th gap vanishesMoment convergencePath-by-path convergenceMSE-consistent estimation
In probabilityP(XnX>ε)0P(\|X_n - X\| > \varepsilon) \to 0Deviations become rareSubsequential a.s. convergenceFull a.s. convergence, moment convergencePAC learning bounds
In distributionFXn(x)FX(x)F_{X_n}(x) \to F_X(x)CDFs approachShape convergenceMoment convergence, path convergenceCLT, asymptotic tests

The Convergence Diamond

a.s.Pd\text{a.s.} \quad \Longrightarrow \quad \xrightarrow{P} \quad \Longrightarrow \quad \xrightarrow{d} LpPdL^p \quad \Longrightarrow \quad \xrightarrow{P} \quad \Longrightarrow \quad \xrightarrow{d}

Almost sure and LpL^p are incomparable — neither implies the other in general.

The Tools

  • Continuous mapping (Theorem 9.9): continuous functions preserve all modes.
  • Slutsky (Theorem 9.10): convergence in distribution + convergence in probability to a constant = convergence in distribution.
  • Delta method (Theorems 9.11-9.12): transforms CLT-type results through differentiable functions.
  • MGF convergence (Remark 2): pointwise convergence of MGFs implies convergence in distribution.

Formal Element Index

ElementNameSection
Definition 9.1Convergence in probability9.2
Definition 9.2Consistency9.2
Definition 9.3Almost sure convergence9.3
Definition 9.4LpL^p convergence9.4
Definition 9.5Mean square convergence9.4
Definition 9.6Convergence in distribution9.5
Theorem 9.1Sample mean convergence (WLLN preview)9.2
Theorem 9.2Borel-Cantelli characterization9.3
Theorem 9.3LpL^p implies moment convergence9.4
Theorem 9.4Portmanteau lemma9.5
Theorem 9.5A.s. implies in probability9.6
Theorem 9.6LpL^p implies in probability9.6
Theorem 9.7In probability implies in distribution9.6
Theorem 9.8In probability implies a.s. along subsequence9.6
Theorem 9.9Continuous mapping theorem9.8
Theorem 9.10Slutsky’s theorem9.8
Theorem 9.11Delta method (first order)9.8
Theorem 9.12Delta method (second order)9.8
Theorem 9.13Poisson limit theorem9.9

What Comes Next

This topic established the modes of convergence — the different senses in which sequences of random variables can approach a limit. The next three topics apply these modes to the three pillars of asymptotic probability:

  • Law of Large Numbers: The WLLN proves XˉnPμ\bar{X}_n \xrightarrow{P} \mu under finite variance (Theorem 10.1) and under finite mean alone (Khintchine, Theorem 10.3). The SLLN proves Xˉna.s.μ\bar{X}_n \xrightarrow{\text{a.s.}} \mu under finite mean alone (Etemadi, Theorem 10.5). The Glivenko–Cantelli theorem proves uniform a.s. convergence of the empirical CDF.

  • Central Limit Theorem: The CLT proves n(Xˉnμ)/σdN(0,1)\sqrt{n}(\bar{X}_n - \mu)/\sigma \xrightarrow{d} N(0,1) (convergence in distribution). The proof via MGF convergence follows the same technique we used for the Poisson limit theorem in Section 9.9. The Lindeberg CLT extends the result to non-identical summands, and the Berry–Esseen theorem quantifies the convergence rate.

  • Concentration Inequalities: While the CLT gives the shape of the limiting distribution, concentration inequalities give rates — how fast the convergence happens. Hoeffding, Bernstein, and sub-Gaussian tail bounds provide the exponential-rate guarantees that power finite-sample statistical learning theory — PAC bounds, Rademacher complexity, and sample complexity calculations.


References

  1. Billingsley, P. (2012). Probability and Measure (Anniversary ed.). Wiley.
  2. Durrett, R. (2019). Probability: Theory and Examples (5th ed.). Cambridge University Press.
  3. Casella, G. & Berger, R. L. (2002). Statistical Inference (2nd ed.). Duxbury.
  4. Resnick, S. (2014). A Probability Path. Birkhäuser.
  5. Wasserman, L. (2004). All of Statistics. Springer.
  6. Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.