intermediate 40 min read · April 14, 2026

Law of Large Numbers

From sample means to population truth — the weak law, the strong law, and why Monte Carlo works.

10.1 Why “In Probability” Isn’t Enough

Topic 9 ended with a promise. We proved that the sample mean Xˉn=1ni=1nXi\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i converges to the population mean μ\mu in probability — meaning that for any tolerance ε>0\varepsilon > 0, the probability of Xˉn\bar{X}_n deviating from μ\mu by more than ε\varepsilon goes to zero as nn grows. That was Theorem 9.1, and its proof was three lines: compute the variance of Xˉn\bar{X}_n, apply Chebyshev’s inequality, watch the bound vanish.

But convergence in probability is a statement about individual time slices. It says: “at time nn, the probability of a bad deviation is small.” It does not say that the sample path — the actual sequence of values Xˉ1(ω),Xˉ2(ω),Xˉ3(ω),\bar{X}_1(\omega), \bar{X}_2(\omega), \bar{X}_3(\omega), \ldots that you observe when you run your simulation or collect your data — eventually settles down and stays near μ\mu. A sequence can converge in probability without any single realization converging at all (recall the typewriter sequence from Topic 9, Section 9.7).

This distinction matters enormously in practice. When you run a Monte Carlo simulation to estimate E[g(X)]E[g(X)], you run one simulation. You don’t generate infinitely many parallel universes and check that the fraction of bad outcomes is shrinking — you generate a single sequence of samples and compute a single running average. You need to know that this particular running average converges to the truth. That’s almost sure convergence: P(limnXˉn=μ)=1P(\lim_{n \to \infty} \bar{X}_n = \mu) = 1.

The same goes for stochastic gradient descent. You run one training trajectory. Polyak averaging — taking the running mean of the iterates — needs almost sure convergence of the running mean to guarantee that your single training run finds the optimum. And when you evaluate a model’s performance on a test set, the empirical risk R^n(θ)=1ni=1n(Yi,fθ(Xi))\hat{R}_n(\theta) = \frac{1}{n}\sum_{i=1}^n \ell(Y_i, f_\theta(X_i)) is a sample mean. The Strong Law says this single empirical risk converges to the true risk for each fixed θ\theta. The Glivenko—Cantelli theorem (Section 10.7) says something even stronger: the convergence is uniform over all thresholds — or, with the right generalization, over all θ\theta simultaneously.

This topic traces the full arc from the weak law to the strong law. We start where Topic 9 left off — the Chebyshev-based WLLN for finite-variance iid sequences — and progressively generalize in two directions:

  1. Relaxing assumptions (Sections 10.2—10.4): We drop the “identically distributed” requirement (uncorrelated WLLN), then drop the finite-variance requirement entirely (Khintchine’s WLLN under finite mean alone).

  2. Strengthening the conclusion (Sections 10.5—10.7): We upgrade from convergence in probability to almost sure convergence (SLLN via Etemadi’s proof), then from pointwise to uniform convergence (Glivenko—Cantelli).

The bridge between these two directions is Kolmogorov’s maximal inequality (Section 10.5) — a result that controls the running maximum of partial sums with the same bound as Chebyshev uses for the terminal value. It is the technical engine that powers the SLLN proof.

To make the distinction between the weak and strong laws concrete, consider the following analogy. Imagine you are tracking a sequence of fair coin flips, computing the running proportion of heads p^n=(number of heads in n flips)/n\hat{p}_n = (\text{number of heads in } n \text{ flips})/n.

  • WLLN: For any fixed tolerance ε>0\varepsilon > 0, P(p^n1/2>ε)0P(|\hat{p}_n - 1/2| > \varepsilon) \to 0 as nn \to \infty. If you could peek at the proportion at a single, very large time step, you’d very likely see something close to 1/21/2. But this says nothing about what happens if you watch the proportion evolve over time.

  • SLLN: With probability 1, the entire trajectory p^1,p^2,p^3,\hat{p}_1, \hat{p}_2, \hat{p}_3, \ldots converges to 1/21/2. The running proportion settles down and stays near 1/21/2 forever — not just at one particular time, but eventually at all times simultaneously.

The WLLN is like saying “most photos of the sequence look good.” The SLLN is like saying “the whole movie has a happy ending.” For a single run of an experiment, a simulation, or a training procedure, you are watching the movie, not snapshots — so the SLLN is the result you need.

This distinction is especially important in settings where you cannot re-run the experiment: a clinical trial with a fixed patient cohort, a one-time A/B test in production, a training run of a large language model that costs millions of dollars. You get one trajectory. The SLLN says that one trajectory is enough.

LLN overview: sample paths converging, probability decay, Cauchy vs Exponential

Here is the roadmap:

SectionResultWhat it says
10.2WLLN (Chebyshev)iid, σ2<\sigma^2 < \infty implies XˉnPμ\bar{X}_n \xrightarrow{P} \mu
10.3WLLN (uncorrelated)Uncorrelated, σi2/n20\sum \sigma_i^2/n^2 \to 0 implies XˉnμˉnP0\bar{X}_n - \bar{\mu}_n \xrightarrow{P} 0
10.4WLLN (Khintchine)iid, E[X]<E[\lvert X \rvert] < \infty implies XˉnPμ\bar{X}_n \xrightarrow{P} \mu
10.5Kolmogorov’s maximal inequalityP(maxknSkε)Var(Sn)/ε2P(\max_{k \leq n} \lvert S_k \rvert \geq \varepsilon) \leq \text{Var}(S_n)/\varepsilon^2
10.6SLLN (Etemadi)Pairwise iid, E[X]<E[\lvert X \rvert] < \infty implies Xˉna.s.μ\bar{X}_n \xrightarrow{\text{a.s.}} \mu
10.7Glivenko—CantellisupxFn(x)F(x)a.s.0\sup_x \lvert F_n(x) - F(x) \rvert \xrightarrow{\text{a.s.}} 0
10.8Convergence ratesChebyshev (1/ε2n1/\varepsilon^2 n), Hoeffding (e2nε2e^{-2n\varepsilon^2}), LIL (2lnlnn/n\sqrt{2\ln\ln n / n})
10.9ML connectionsMonte Carlo, SGD/Polyak, ERM, and Glivenko—Cantelli in practice

The first seven results are fully proved in this topic. Theorems 10.7 and 10.8 are stated without proof — the former is proved in Topic 12 (Large Deviations & Tail Bounds) as Theorem 12.3, the latter is a deep result requiring tools beyond our current scope.


10.2 WLLN for iid with Finite Variance

We begin by restating the result from Topic 9 that is our starting point. This is Theorem 9.1 transplanted into its permanent home.

Theorem 1 Theorem 10.1 (WLLN — Chebyshev Version; Recap from Topic 9)

Let X1,X2,X_1, X_2, \ldots be iid with E[Xi]=μE[X_i] = \mu and Var(Xi)=σ2<\text{Var}(X_i) = \sigma^2 < \infty. Then

XˉnPμ\bar{X}_n \xrightarrow{P} \mu

That is, for every ε>0\varepsilon > 0, P(Xˉnμ>ε)0P(|\bar{X}_n - \mu| > \varepsilon) \to 0 as nn \to \infty.

The proof is recorded in Topic 9, Section 9.2: compute Var(Xˉn)=σ2/n\text{Var}(\bar{X}_n) = \sigma^2/n, apply Chebyshev’s inequality to get P(Xˉnμ>ε)σ2/(nε2)P(|\bar{X}_n - \mu| > \varepsilon) \leq \sigma^2/(n\varepsilon^2), and observe that the right-hand side vanishes. We do not reproduce it here — the reader should revisit the three-line argument in Topic 9 if needed.

What matters now is where this result falls short and what it leaves on the table. It requires two assumptions:

  1. Identically distributed. The XiX_i must all follow the same distribution — same mean, same variance, same everything. This rules out many practical settings where the noise level changes over time (heteroscedastic data) or the observations come from different subpopulations.

  2. Finite variance. The variance σ2\sigma^2 must be finite. This excludes heavy-tailed distributions like the Student’s tt with ν2\nu \leq 2 degrees of freedom, the Pareto with shape α2\alpha \leq 2, and certain financial return models where extreme events dominate.

Both assumptions are stronger than necessary. The next two sections relax them, one at a time: Section 10.3 drops “identically distributed” (replacing it with the weaker condition “uncorrelated”), and Section 10.4 drops “finite variance” entirely (requiring only a finite mean). Sections 10.5 and 10.6 then attack the second front: upgrading the conclusion from convergence in probability to almost sure convergence.

Remark 1 Connection to Consistency of Estimators

The sample mean Xˉn\bar{X}_n is the simplest estimator of the population mean μ\mu, and the WLLN says it is consistent: XˉnPμ\bar{X}_n \xrightarrow{P} \mu. In statistics, consistency is the bare minimum we demand of any estimator — with enough data, the estimate should converge to the truth in probability. An inconsistent estimator is fundamentally unreliable: no amount of data will make it accurate.

Every consistent estimator is, at its core, a law-of-large-numbers statement. The sample variance Sn2=1n1(XiXˉn)2S_n^2 = \frac{1}{n-1}\sum (X_i - \bar{X}_n)^2 converges to the population variance σ2\sigma^2 — this is the LLN applied to the squared deviations. The empirical CDF converges to the true CDF (Section 10.7). The maximum likelihood estimator converges to the true parameter under regularity conditions — a fact whose proof uses the LLN applied to the log-likelihood.

Conversely, an estimator’s inconsistency is typically diagnosed by showing the LLN fails to apply — either because the estimator is not a sample mean (or a sufficiently well-behaved function of sample means), or because the underlying variables lack the required moment conditions.

Point Estimation & Bias-Variance formalizes this connection: an estimator θ^n\hat{\theta}_n is consistent if and only if it satisfies a suitable LLN — and the strength of the LLN (weak or strong) determines the strength of the consistency (convergence in probability or almost sure).

Before proceeding, it is worth noting the quantitative content of Theorem 10.1. The Chebyshev bound P(Xˉnμ>ε)σ2/(nε2)P(|\bar{X}_n - \mu| > \varepsilon) \leq \sigma^2/(n\varepsilon^2) is not just an existence statement — it gives an explicit bound on how close Xˉn\bar{X}_n is to μ\mu for each nn. For example, if XiN(5,4)X_i \sim N(5, 4) (so μ=5\mu = 5, σ2=4\sigma^2 = 4) and we want P(Xˉn5>0.5)0.01P(|\bar{X}_n - 5| > 0.5) \leq 0.01, we need 4/(n0.25)0.014/(n \cdot 0.25) \leq 0.01, giving n1600n \geq 1600. This is a conservative bound — we will see in Section 10.8 that Hoeffding’s inequality gives much tighter sample-size requirements for bounded variables — but it has the virtue of requiring only the mean and variance, not the full distributional form.

It is also worth noting that the bound σ2/(nε2)\sigma^2/(n\varepsilon^2) is distribution-free in the following sense: it holds for any distribution with mean μ\mu and variance σ2\sigma^2, no matter what the higher moments look like. Whether the XiX_i are Normal, Uniform, Exponential, or any other distribution with finite variance, the same bound applies. This universality is both a strength (the result is extremely general — it works for any distribution with finite variance, from the Uniform to the Exponential to the chi-squared) and a weakness (the bound cannot exploit distributional structure to give tighter guarantees). Section 10.8 presents Hoeffding’s inequality, which gives exponentially better bounds by exploiting the additional assumption that the variables are bounded.

WLLN via Chebyshev: probability bound σ²/(nε²) decaying with n for various ε

10.3 WLLN for Uncorrelated Sequences

The Chebyshev-based proof of Theorem 10.1 used two properties of the XiX_i: they are identically distributed (so all means and variances are equal) and independent (so the variance of the sum is the sum of the variances). Let us examine exactly where each property entered.

The “identically distributed” assumption was used to conclude E[Xˉn]=μE[\bar{X}_n] = \mu and Var(Xˉn)=σ2/n\text{Var}(\bar{X}_n) = \sigma^2/n. Without it, the mean of Xˉn\bar{X}_n is 1nμi\frac{1}{n}\sum \mu_i (which may change with nn) and the variance is 1n2σi2\frac{1}{n^2}\sum \sigma_i^2 (which depends on the individual variances).

The “independent” assumption was used only to show that Var(Sn)=i=1nVar(Xi)\text{Var}(S_n) = \sum_{i=1}^n \text{Var}(X_i) — that is, to kill the cross-terms in Var(Xi)=Var(Xi)+2i<jCov(Xi,Xj)\text{Var}(\sum X_i) = \sum \text{Var}(X_i) + 2\sum_{i < j} \text{Cov}(X_i, X_j). For this, independence is overkill — we only need the cross-terms Cov(Xi,Xj)\text{Cov}(X_i, X_j) to vanish. That is, we only need the XiX_i to be uncorrelated.

This is a meaningful generalization. In many applied settings, variables are uncorrelated but not independent:

  • In a designed experiment with orthogonal contrasts, the contrast variables are uncorrelated by construction but may not be independent (because they are all functions of the same underlying data).
  • In time series analysis, residuals from a correctly specified linear model are uncorrelated (by the Gauss—Markov theorem) but not necessarily independent — the joint distribution may have higher-order dependencies.
  • In random matrix theory, the entries of a Wigner matrix are uncorrelated (by the symmetric distribution assumption) but the eigenvalues are not independent.

Furthermore, if the variables are not identically distributed, the sample mean Xˉn\bar{X}_n may not converge to a single value μ\mu — instead, it tracks the running average of the individual means μˉn=1ni=1nμi\bar{\mu}_n = \frac{1}{n}\sum_{i=1}^n \mu_i. The theorem says the deviation Xˉnμˉn\bar{X}_n - \bar{\mu}_n converges to zero — the sample mean stays close to the running population mean.

Theorem 2 Theorem 10.2 (WLLN for Uncorrelated Sequences)

Let X1,X2,X_1, X_2, \ldots be uncorrelated random variables (not necessarily identically distributed) with E[Xi]=μiE[X_i] = \mu_i and Var(Xi)=σi2<\text{Var}(X_i) = \sigma_i^2 < \infty. Define the average mean μˉn=1ni=1nμi\bar{\mu}_n = \frac{1}{n}\sum_{i=1}^n \mu_i. If

1n2i=1nσi20as n\frac{1}{n^2}\sum_{i=1}^n \sigma_i^2 \to 0 \quad \text{as } n \to \infty

then XˉnμˉnP0\bar{X}_n - \bar{\mu}_n \xrightarrow{P} 0.

Proof [show]

We apply Chebyshev’s inequality to Xˉnμˉn\bar{X}_n - \bar{\mu}_n. First, note that E[Xˉnμˉn]=0E[\bar{X}_n - \bar{\mu}_n] = 0.

For the variance, since the XiX_i are uncorrelated:

Var(Xˉn)=Var ⁣(1ni=1nXi)=1n2i=1nVar(Xi)=1n2i=1nσi2\text{Var}(\bar{X}_n) = \text{Var}\!\left(\frac{1}{n}\sum_{i=1}^n X_i\right) = \frac{1}{n^2}\sum_{i=1}^n \text{Var}(X_i) = \frac{1}{n^2}\sum_{i=1}^n \sigma_i^2

The cross-terms vanish because Cov(Xi,Xj)=0\text{Cov}(X_i, X_j) = 0 for iji \neq j — this is exactly where the uncorrelated assumption enters. Independence would also give this, but uncorrelatedness is strictly weaker: two random variables can be uncorrelated without being independent (e.g., XX and X2X^2 when XX is symmetric about zero).

By Chebyshev’s inequality (Expectation & Moments, Theorem 12):

P(Xˉnμˉn>ε)Var(Xˉn)ε2=1ε21n2i=1nσi2P(|\bar{X}_n - \bar{\mu}_n| > \varepsilon) \leq \frac{\text{Var}(\bar{X}_n)}{\varepsilon^2} = \frac{1}{\varepsilon^2} \cdot \frac{1}{n^2}\sum_{i=1}^n \sigma_i^2

By hypothesis, 1n2i=1nσi20\frac{1}{n^2}\sum_{i=1}^n \sigma_i^2 \to 0, so the right-hand side goes to zero for every ε>0\varepsilon > 0. Therefore XˉnμˉnP0\bar{X}_n - \bar{\mu}_n \xrightarrow{P} 0. \square

The structure of the proof is identical to Theorem 10.1 — compute the variance, apply Chebyshev, watch it vanish — but the generalization is significant. We no longer require the XiX_i to have the same distribution, or even to be independent. The only structural requirement is that they are uncorrelated: Cov(Xi,Xj)=0\text{Cov}(X_i, X_j) = 0 for iji \neq j. This is strictly weaker than independence (independent variables are uncorrelated, but not conversely — recall the examples from Topic 8 — Multivariate Distributions).

The condition 1n2i=1nσi20\frac{1}{n^2}\sum_{i=1}^n \sigma_i^2 \to 0 is a Cesaro-type condition on the variances: it says the average variance doesn’t grow too fast. The most common sufficient condition is that the variances are uniformly bounded.

Corollary 1 Corollary 10.1 (Uniformly Bounded Variances)

If the variances are uniformly bounded — σi2C\sigma_i^2 \leq C for all ii and some constant CC — then

1n2i=1nσi2nCn2=Cn0\frac{1}{n^2}\sum_{i=1}^n \sigma_i^2 \leq \frac{nC}{n^2} = \frac{C}{n} \to 0

and the WLLN (Theorem 10.2) applies. In particular, if the XiX_i are additionally identically distributed with common mean μ\mu, then μˉn=μ\bar{\mu}_n = \mu for all nn and we recover XˉnPμ\bar{X}_n \xrightarrow{P} \mu.

Example 1 Online Learning: Uncorrelated Observations with Time-Varying Variance

An online learning platform collects user engagement scores XiX_i that measure how long (in minutes) a user spends on a task. The scores have a common population mean E[Xi]=μ=12E[X_i] = \mu = 12 minutes (the true average engagement), but the variance changes over time as the platform adjusts its recommendation algorithm: Var(Xi)=σi2=1+0.5sin(i/20)\text{Var}(X_i) = \sigma_i^2 = 1 + 0.5\sin(i/20).

The variances oscillate between 0.5 and 1.5 — some cohorts of users are more homogeneous than others — so they are uniformly bounded by C=1.5C = 1.5. By Corollary 10.1, the running mean of the engagement scores converges in probability to μ=12\mu = 12 minutes, despite the non-identical distributions. The platform can trust its running average even though the noise level fluctuates.

This is a common situation in practice: data collected over time rarely has constant variance (the phenomenon is called heteroscedasticity — literally, “different scatter”). Theorem 10.2 says the WLLN still holds as long as the variances don’t grow too fast. The Cesaro condition 1n2σi20\frac{1}{n^2}\sum \sigma_i^2 \to 0 allows the variances to oscillate, to have isolated spikes, or even to grow slowly — as long as their running sum grows slower than n2n^2.

The uncorrelated condition is also natural in many applications. If users are sampled independently, their scores are certainly uncorrelated (since independence implies uncorrelatedness), even if the measurement process changes over time.

Now consider the opposite scenario: the variance grows with time — say σi2=i\sigma_i^2 = i, perhaps because measurement noise increases as the platform scales or because the user population becomes more diverse. Then:

1n2i=1nσi2=1n2n(n+1)2=n+12n12\frac{1}{n^2}\sum_{i=1}^n \sigma_i^2 = \frac{1}{n^2} \cdot \frac{n(n+1)}{2} = \frac{n+1}{2n} \to \frac{1}{2}

This does not go to zero, so the Chebyshev-based WLLN does not apply. The running mean may not converge — the growing variance can overwhelm the averaging effect. Note that the WLLN might still hold via a different proof technique (Theorem 10.2 gives a sufficient condition, not a necessary one), but the Chebyshev approach gives no guarantee.

This distinction matters when designing experiments or data collection systems: if you anticipate that measurement noise grows over time, you cannot blithely trust the running average — you may need to downweight older, noisier observations, or cap the noise via trimming.

WLLN for uncorrelated sequences: bounded vs growing variance, sample mean trajectories
Interactive: Weak Law of Large Numbers Explorer
The baseline: both WLLN and SLLN apply with fast 1/√n convergence.
SLLN applies ✓
Sample paths of X̄ₙ
μ+εμ−εμ = 31251020501002005001kn246X̄ₙ
P(|X̄ₙ − μ| > ε) vs n
1251020501002005001kn0.000.250.500.751.00ProbabilityEmpiricalChebyshev

10.4 Khintchine’s WLLN

Theorems 10.1 and 10.2 both require finite variance. Is this necessary? The answer, remarkably, is no. For iid sequences, finite mean alone suffices.

Why should we care about distributions with finite mean but infinite variance? Because they arise naturally in practice. The distribution of insurance claim sizes, financial returns, city populations, and many other real-world quantities have heavy tails — their probability of extreme values decays slowly enough that the variance is infinite, even though the mean exists. The Pareto distribution with shape parameter α(1,2]\alpha \in (1, 2] is the canonical example: E[X]<E[X] < \infty but Var(X)=\text{Var}(X) = \infty. If the WLLN required finite variance, we could not guarantee that sample means converge for these distributions.

Khintchine’s theorem — arguably the “right” version of the WLLN — identifies the minimal condition: E[X1]<E[|X_1|] < \infty.

The proof cannot use Chebyshev’s inequality (which requires a finite second moment), so we need a different technique: truncation. The idea is to clip the random variables to make their variance finite, apply the finite-variance WLLN to the clipped version, and then show that the clipping error vanishes. This is our first encounter with the truncation technique, which will reappear in the SLLN proof (Section 10.6).

The strategy has three phases, which we will see again and again in probability:

  1. Approximate: Replace the original variables XiX_i with well-behaved versions YiY_i.
  2. Solve the easy problem: Prove the theorem for YiY_i using standard tools (Chebyshev).
  3. Control the error: Show that the difference between XiX_i and YiY_i is negligible.
Theorem 3 Theorem 10.3 (Khintchine's WLLN)

Let X1,X2,X_1, X_2, \ldots be iid with E[X1]<E[|X_1|] < \infty (finite mean), but possibly Var(X1)=\text{Var}(X_1) = \infty. Then

XˉnPμ=E[X1]\bar{X}_n \xrightarrow{P} \mu = E[X_1]
Proof [show]

The key idea: even when the variance is infinite, we can truncate the random variables to make the variance finite, apply the finite-variance WLLN to the truncated version, and then show the truncation error vanishes. The truncation threshold grows with the sample size — a clever coupling that makes everything work.

Step 1: Truncate. Define Yi=Xi1(Xin)Y_i = X_i \cdot \mathbf{1}(|X_i| \leq n) — the version of XiX_i where values exceeding nn in absolute value are zeroed out. Note that the truncation threshold is nn (the sample size), not a fixed constant. This is essential: as nn grows, the truncation becomes less aggressive, eventually including every finite value.

Step 2: Show the truncation error is negligible. We need P(XˉnYˉn)0P(\bar{X}_n \neq \bar{Y}_n) \to 0. The events {XiYi}={Xi>n}\{X_i \neq Y_i\} = \{|X_i| > n\} are the events where truncation changes the ii-th variable. By the union bound:

P(XˉnYˉn)P ⁣(i=1n{XiYi})i=1nP(Xi>n)=nP(X1>n)P(\bar{X}_n \neq \bar{Y}_n) \leq P\!\left(\bigcup_{i=1}^n \{X_i \neq Y_i\}\right) \leq \sum_{i=1}^n P(|X_i| > n) = n \cdot P(|X_1| > n)

where the last equality uses the identical distribution of the XiX_i. We claim nP(X1>n)0n \cdot P(|X_1| > n) \to 0.

To see this, note that nP(X1>n)=nE[1(X1>n)]n \cdot P(|X_1| > n) = n \cdot E[\mathbf{1}(|X_1| > n)]. We need a sharper bound than Markov’s inequality (which gives nP(X1>n)E[X1]n \cdot P(|X_1| > n) \leq E[|X_1|], a bound that does not vanish). Instead, observe that n1(X1>n)X11(X1>n)n \cdot \mathbf{1}(|X_1| > n) \leq |X_1| \cdot \mathbf{1}(|X_1| > n) (since n<X1n < |X_1| on the event {X1>n}\{|X_1| > n\}). Therefore:

nP(X1>n)E[X11(X1>n)]n \cdot P(|X_1| > n) \leq E[|X_1| \cdot \mathbf{1}(|X_1| > n)]

Since X11(X1>n)0|X_1| \cdot \mathbf{1}(|X_1| > n) \to 0 pointwise as nn \to \infty (for every outcome where X1(ω)<|X_1(\omega)| < \infty, which is almost every outcome since E[X1]<E[|X_1|] < \infty), and X11(X1>n)X1|X_1| \cdot \mathbf{1}(|X_1| > n) \leq |X_1| with E[X1]<E[|X_1|] < \infty, the dominated convergence theorem ( formalCalculus: Integration ) gives:

E[X11(X1>n)]0E[|X_1| \cdot \mathbf{1}(|X_1| > n)] \to 0

Therefore P(XˉnYˉn)0P(\bar{X}_n \neq \bar{Y}_n) \to 0, which means XˉnYˉnP0\bar{X}_n - \bar{Y}_n \xrightarrow{P} 0.

Step 3: Center the truncated variables. Let μn=E[Y1]=E[X11(X1n)]\mu_n = E[Y_1] = E[X_1 \cdot \mathbf{1}(|X_1| \leq n)]. By DCT (since X11(X1n)X1X_1 \cdot \mathbf{1}(|X_1| \leq n) \to X_1 pointwise and X11(X1n)X1|X_1 \cdot \mathbf{1}(|X_1| \leq n)| \leq |X_1|):

μnμ=E[X1]as n\mu_n \to \mu = E[X_1] \quad \text{as } n \to \infty

Step 4: Apply the finite-variance WLLN to the truncated variables. The YiY_i are iid (as functions of iid XiX_i) with Yin|Y_i| \leq n, so Var(Yi)E[Yi2]n2\text{Var}(Y_i) \leq E[Y_i^2] \leq n^2. But we need Var(Yˉn)=Var(Y1)/n0\text{Var}(\bar{Y}_n) = \text{Var}(Y_1)/n \to 0, which requires Var(Y1)/n0\text{Var}(Y_1)/n \to 0. The crude bound Var(Y1)n2\text{Var}(Y_1) \leq n^2 gives Var(Y1)/nn\text{Var}(Y_1)/n \leq n, which does not vanish. We need a sharper estimate.

Since Y1n|Y_1| \leq n implies Y12nY1Y_1^2 \leq n \cdot |Y_1|, we have:

E[Y12]n=E[X121(X1n)]nE[nX11(X1n)]n=E[X11(X1n)]\frac{E[Y_1^2]}{n} = \frac{E[X_1^2 \cdot \mathbf{1}(|X_1| \leq n)]}{n} \leq \frac{E[n \cdot |X_1| \cdot \mathbf{1}(|X_1| \leq n)]}{n} = E[|X_1| \cdot \mathbf{1}(|X_1| \leq n)]

The right side converges to E[X1]<E[|X_1|] < \infty by DCT. But we need it to show Var(Y1)/n0\text{Var}(Y_1)/n \to 0, not just that it’s bounded. Let us be more careful. We have:

E[X121(X1n)]n=E ⁣[X12n1(X1n)]\frac{E[X_1^2 \cdot \mathbf{1}(|X_1| \leq n)]}{n} = E\!\left[\frac{X_1^2}{n} \cdot \mathbf{1}(|X_1| \leq n)\right]

For each fixed ω\omega, X1(ω)2n1(X1(ω)n)0\frac{X_1(\omega)^2}{n} \cdot \mathbf{1}(|X_1(\omega)| \leq n) \to 0 as nn \to \infty (the numerator X1(ω)2X_1(\omega)^2 is fixed while the denominator nn grows). The integrand is bounded by X11(X1n)X1|X_1| \cdot \mathbf{1}(|X_1| \leq n) \leq |X_1| (using X1n|X_1| \leq n on the indicator). Since E[X1]<E[|X_1|] < \infty, DCT gives:

E[X121(X1n)]n0\frac{E[X_1^2 \cdot \mathbf{1}(|X_1| \leq n)]}{n} \to 0

Therefore Var(Yˉn)=Var(Y1)/nE[Y12]/n0\text{Var}(\bar{Y}_n) = \text{Var}(Y_1)/n \leq E[Y_1^2]/n \to 0, and by Chebyshev’s inequality:

YˉnμnP0\bar{Y}_n - \mu_n \xrightarrow{P} 0

Step 5: Combine. We have three convergence statements:

  • XˉnYˉnP0\bar{X}_n - \bar{Y}_n \xrightarrow{P} 0 (Step 2)
  • YˉnμnP0\bar{Y}_n - \mu_n \xrightarrow{P} 0 (Step 4)
  • μnμ\mu_n \to \mu (Step 3, deterministic convergence)

Therefore:

Xˉnμ=(XˉnYˉn)+(Yˉnμn)+(μnμ)P0+0+0=0\bar{X}_n - \mu = (\bar{X}_n - \bar{Y}_n) + (\bar{Y}_n - \mu_n) + (\mu_n - \mu) \xrightarrow{P} 0 + 0 + 0 = 0

which gives XˉnPμ\bar{X}_n \xrightarrow{P} \mu. \square

The truncation technique is a workhorse of probability theory. The pattern — replace a badly-behaved random variable with a well-behaved approximation, handle the approximation with standard tools, then bound the approximation error — appears throughout convergence theory. We will see it again in the SLLN proof (Section 10.6), where the truncation threshold is ii rather than nn. The CLT proof uses the same idea with characteristic functions. In empirical process theory, truncation is the first step in proving maximal inequalities for unbounded function classes.

The key insight is that the truncation threshold grows with nn. A fixed threshold would introduce a permanent bias (the truncated mean E[X1(Xc)]E[X \cdot \mathbf{1}(|X| \leq c)] is generally not equal to μ\mu). By letting the threshold grow, the bias vanishes — the truncated mean converges to the true mean — while the variance is still controlled because the growth rate of the threshold is calibrated to the sample size.

Let us also highlight where the dominated convergence theorem entered, because its role is subtle. The DCT was used in three places:

  1. Step 2: To show E[X11(X1>n)]0E[|X_1| \cdot \mathbf{1}(|X_1| > n)] \to 0 — the tail expectation vanishes. The dominating function is X1|X_1| itself.
  2. Step 3: To show E[X11(X1n)]E[X1]E[X_1 \cdot \mathbf{1}(|X_1| \leq n)] \to E[X_1] — the truncated mean converges to the true mean.
  3. Step 4: To show E[X121(X1n)]/n0E[X_1^2 \cdot \mathbf{1}(|X_1| \leq n)]/n \to 0 — the key variance bound. Here the dominating function is X1|X_1|, and the pointwise convergence X121(X1n)/n0X_1^2 \cdot \mathbf{1}(|X_1| \leq n)/n \to 0 uses the fact that the numerator is a fixed number X1(ω)2X_1(\omega)^2 while the denominator grows.

Each application follows the same pattern: the integrand converges pointwise and is dominated by an integrable function, so the integral converges. The DCT is the workhorse of truncation arguments throughout probability theory. See formalCalculus: Integration for the full statement and proof.

Example 2 Student's t(1.5): Finite Mean, Infinite Variance

The Student’s tt distribution with ν=1.5\nu = 1.5 degrees of freedom has E[X]<E[|X|] < \infty (the mean exists when ν>1\nu > 1) but Var(X)=\text{Var}(X) = \infty (the variance is infinite when ν2\nu \leq 2). The Chebyshev-based WLLN (Theorem 10.1) does not apply — we cannot even compute the bound σ2/(nε2)\sigma^2/(n\varepsilon^2) because σ2=\sigma^2 = \infty.

Khintchine’s theorem (Theorem 10.3) guarantees XˉnP0\bar{X}_n \xrightarrow{P} 0 regardless. But the convergence is painfully slow — much slower than for distributions with finite variance.

Why? The heavy tails of the t(1.5)t(1.5) produce occasional extreme values — observations 10, 100, or even 1000 times larger than typical — that jerk the running mean away from zero. These outliers become rarer as nn grows (the truncation argument ensures this), but their impact on the average is large enough to cause visible fluctuations even at sample sizes where a Normal or Exponential running mean would have long since settled down. The Chebyshev rate σ2/(nε2)\sigma^2/(n\varepsilon^2) is not available (because σ2=\sigma^2 = \infty), and the effective convergence rate is slower than 1/n1/\sqrt{n}.

How slow is “painfully slow”? For N(0,1)N(0,1) (which has σ2=1\sigma^2 = 1), the Chebyshev bound gives P(Xˉn>0.1)100/nP(|\bar{X}_n| > 0.1) \leq 100/n. At n=10, ⁣000n = 10,\!000, this is at most 1%. For t(1.5)t(1.5), we cannot even write down a Chebyshev bound because the variance is infinite. Empirically, the sample mean of n=10, ⁣000n = 10,\!000 observations from t(1.5)t(1.5) still shows fluctuations of order 0.1 or more with noticeable probability — a tenfold degradation compared to the Normal.

The interactive WLLN Explorer (above) lets you compare the convergence speed of the sample mean for different distributions. Try setting the distribution to t(1.5)t(1.5) and comparing with N(0,1)N(0,1) — both have mean zero, but the convergence to zero is visibly rougher for the t(1.5)t(1.5).

Remark 2 Characteristic Function Proof (Alternative)

Khintchine’s original proof used characteristic functions rather than truncation. The characteristic function φX(t)=E[eitX]\varphi_X(t) = E[e^{itX}] always exists (even when moments are infinite), and for iid XiX_i with mean μ\mu:

φXˉn(t)=[φX(t/n)]n\varphi_{\bar{X}_n}(t) = \left[\varphi_X(t/n)\right]^n

Since φX(t)=1+iμt+o(t)\varphi_X(t) = 1 + i\mu t + o(t) near t=0t = 0 (this expansion requires only E[X]<E[|X|] < \infty), we get [φX(t/n)]neiμt[\varphi_X(t/n)]^n \to e^{i\mu t}, which is the characteristic function of the constant μ\mu. By Levy’s continuity theorem, Xˉndμ\bar{X}_n \xrightarrow{d} \mu, and convergence in distribution to a constant is equivalent to convergence in probability.

This approach is more elegant but requires Fourier analysis (specifically, the expansion φX(t)=1+iμt+o(t)\varphi_X(t) = 1 + i\mu t + o(t) for t|t| small, which holds whenever E[X]<E[|X|] < \infty). The truncation proof above uses only tools we have already developed: Chebyshev’s inequality and the dominated convergence theorem. The characteristic function proof has the advantage of immediately giving convergence in distribution, which implies convergence in probability to a constant — but the truncation proof is more transparent about where the finite-mean assumption is used.

Khintchine's WLLN: sample mean convergence for t(1.5) (finite mean, infinite variance) vs N(0,1)

10.5 Kolmogorov’s Maximal Inequality

The three WLLN results (Theorems 10.1—10.3) all establish convergence in probability: XˉnPμ\bar{X}_n \xrightarrow{P} \mu. To upgrade to almost sure convergence, we need to control the entire path of partial sums, not just the terminal value SnS_n. This is a fundamentally harder problem. Convergence in probability is a statement about each fixed time nn; almost sure convergence is a statement about the entire infinite trajectory simultaneously.

The key question is: how large can the partial sums S1,S2,,SnS_1, S_2, \ldots, S_n get along the way? Chebyshev’s inequality controls P(Snε)P(|S_n| \geq \varepsilon) — the probability that the final partial sum is large. But the partial sums might wander far from zero at intermediate times and then return. A random walk, for instance, typically reaches a maximum that is much larger than its terminal value. For the SLLN, we need to control P(max1knSkε)P(\max_{1 \leq k \leq n} |S_k| \geq \varepsilon) — the probability that any partial sum along the way is large.

You might guess that controlling the maximum would cost a factor of nn (by the union bound: P(maxSkε)k=1nP(Skε)P(\max |S_k| \geq \varepsilon) \leq \sum_{k=1}^n P(|S_k| \geq \varepsilon), and each term is bounded by Chebyshev). Kolmogorov’s maximal inequality does dramatically better: the bound is the same as Chebyshev’s for the terminal value alone. Controlling the running maximum costs nothing extra. This “free upgrade” is the mathematical reason why the SLLN proof works — it converts Chebyshev-type bounds (which give convergence in probability) into bounds on the running maximum (which, via Borel—Cantelli, give almost sure convergence).

Theorem 4 Theorem 10.4 (Kolmogorov's Maximal Inequality)

Let X1,,XnX_1, \ldots, X_n be independent random variables with E[Xi]=0E[X_i] = 0 and Var(Xi)=σi2<\text{Var}(X_i) = \sigma_i^2 < \infty. Define partial sums Sk=i=1kXiS_k = \sum_{i=1}^k X_i. Then for any ε>0\varepsilon > 0:

P ⁣(max1knSkε)Var(Sn)ε2=i=1nσi2ε2P\!\left(\max_{1 \leq k \leq n} |S_k| \geq \varepsilon\right) \leq \frac{\text{Var}(S_n)}{\varepsilon^2} = \frac{\sum_{i=1}^n \sigma_i^2}{\varepsilon^2}
Proof [show]

The right-hand side is identical to the Chebyshev bound for P(Snε)P(|S_n| \geq \varepsilon). The fact that controlling the maximum over all kk gives the same bound is what makes this inequality powerful. The proof uses a clever decomposition based on the first time the partial sum exceeds the threshold.

The stopping time idea. Define the event Ak={Skε and Sj<ε for all j<k}A_k = \{|S_k| \geq \varepsilon \text{ and } |S_j| < \varepsilon \text{ for all } j < k\} — the event that the partial sum first exceeds ε\varepsilon in absolute value at time kk. These events are disjoint (the first exceedance can only happen once), and their union is the event that the maximum exceeds ε\varepsilon:

{max1knSkε}=k=1nAk\left\{\max_{1 \leq k \leq n} |S_k| \geq \varepsilon\right\} = \bigcup_{k=1}^n A_k

Decompose E[Sn2]E[S_n^2]. We split E[Sn2]E[S_n^2] according to whether the maximum exceeds ε\varepsilon:

E[Sn2]=E ⁣[Sn21 ⁣(maxknSkε)]+E ⁣[Sn21 ⁣(maxknSk<ε)]E[S_n^2] = E\!\left[S_n^2 \cdot \mathbf{1}\!\left(\max_{k \leq n} |S_k| \geq \varepsilon\right)\right] + E\!\left[S_n^2 \cdot \mathbf{1}\!\left(\max_{k \leq n} |S_k| < \varepsilon\right)\right]

The second term is non-negative (it is the expectation of a non-negative random variable), so:

E[Sn2]E ⁣[Sn21 ⁣(maxknSkε)]=k=1nE[Sn21(Ak)]E[S_n^2] \geq E\!\left[S_n^2 \cdot \mathbf{1}\!\left(\max_{k \leq n} |S_k| \geq \varepsilon\right)\right] = \sum_{k=1}^n E[S_n^2 \cdot \mathbf{1}(A_k)]

Key step: Expand Sn=Sk+(SnSk)S_n = S_k + (S_n - S_k) on each AkA_k. On the event AkA_k, we decompose the final partial sum into the value at the first exceedance time plus the remaining increments:

Sn2=(Sk+(SnSk))2=Sk2+2Sk(SnSk)+(SnSk)2S_n^2 = (S_k + (S_n - S_k))^2 = S_k^2 + 2S_k(S_n - S_k) + (S_n - S_k)^2

Since (SnSk)20(S_n - S_k)^2 \geq 0, we have Sn2Sk2+2Sk(SnSk)S_n^2 \geq S_k^2 + 2S_k(S_n - S_k). Taking expectations restricted to AkA_k:

E[Sn21(Ak)]E[Sk21(Ak)]+2E[Sk(SnSk)1(Ak)]E[S_n^2 \cdot \mathbf{1}(A_k)] \geq E[S_k^2 \cdot \mathbf{1}(A_k)] + 2E[S_k(S_n - S_k) \cdot \mathbf{1}(A_k)]

Now comes the crucial observation. The random variable Sk1(Ak)S_k \cdot \mathbf{1}(A_k) is a function of X1,,XkX_1, \ldots, X_k alone (because SkS_k and the event Ak={S1<ε,,Sk1<ε,Skε}A_k = \{|S_1| < \varepsilon, \ldots, |S_{k-1}| < \varepsilon, |S_k| \geq \varepsilon\} depend only on the first kk variables). The random variable SnSk=Xk+1++XnS_n - S_k = X_{k+1} + \cdots + X_n depends only on Xk+1,,XnX_{k+1}, \ldots, X_n. By independence of the XiX_i, these two quantities are independent. Therefore:

E[Sk(SnSk)1(Ak)]=E[Sk1(Ak)]E[SnSk]E[S_k(S_n - S_k) \cdot \mathbf{1}(A_k)] = E[S_k \cdot \mathbf{1}(A_k)] \cdot E[S_n - S_k]

Since E[Xi]=0E[X_i] = 0 for all ii:

E[SnSk]=i=k+1nE[Xi]=0E[S_n - S_k] = \sum_{i=k+1}^n E[X_i] = 0

So the cross-term vanishes: E[Sk(SnSk)1(Ak)]=0E[S_k(S_n - S_k) \cdot \mathbf{1}(A_k)] = 0. This gives:

E[Sn21(Ak)]E[Sk21(Ak)]E[S_n^2 \cdot \mathbf{1}(A_k)] \geq E[S_k^2 \cdot \mathbf{1}(A_k)]

On the event AkA_k, we have Skε|S_k| \geq \varepsilon, so Sk2ε2S_k^2 \geq \varepsilon^2. Therefore:

E[Sk21(Ak)]ε2P(Ak)E[S_k^2 \cdot \mathbf{1}(A_k)] \geq \varepsilon^2 \cdot P(A_k)

Summing over kk. Combining the chain of inequalities:

Var(Sn)=E[Sn2]k=1nE[Sn21(Ak)]k=1nε2P(Ak)=ε2P ⁣(maxknSkε)\text{Var}(S_n) = E[S_n^2] \geq \sum_{k=1}^n E[S_n^2 \cdot \mathbf{1}(A_k)] \geq \sum_{k=1}^n \varepsilon^2 \cdot P(A_k) = \varepsilon^2 \cdot P\!\left(\max_{k \leq n} |S_k| \geq \varepsilon\right)

Rearranging:

P ⁣(max1knSkε)Var(Sn)ε2=i=1nσi2ε2P\!\left(\max_{1 \leq k \leq n} |S_k| \geq \varepsilon\right) \leq \frac{\text{Var}(S_n)}{\varepsilon^2} = \frac{\sum_{i=1}^n \sigma_i^2}{\varepsilon^2}

which is the claimed bound. \square

The proof is entirely elementary — no advanced machinery, just a decomposition based on the first exceedance time and the observation that independence kills the cross-term. This is the same “orthogonality” trick that makes the variance of a sum equal the sum of the variances, applied more carefully.

Let us pause to appreciate why this result is surprising. A naive approach to controlling P(maxknSkε)P(\max_{k \leq n} |S_k| \geq \varepsilon) would use the union bound:

P ⁣(maxknSkε)k=1nP(Skε)k=1nVar(Sk)ε2=1ε2k=1ni=1kσi2P\!\left(\max_{k \leq n} |S_k| \geq \varepsilon\right) \leq \sum_{k=1}^n P(|S_k| \geq \varepsilon) \leq \sum_{k=1}^n \frac{\text{Var}(S_k)}{\varepsilon^2} = \frac{1}{\varepsilon^2}\sum_{k=1}^n \sum_{i=1}^k \sigma_i^2

This gives a bound that grows like nVar(Sn)n \cdot \text{Var}(S_n) — a factor of nn worse than Kolmogorov’s inequality. The stopping-time decomposition avoids this factor-of-nn loss by exploiting the independence structure: the future increments SnSkS_n - S_k are independent of the past, so the cross-term E[Sk(SnSk)1(Ak)]E[S_k(S_n - S_k) \cdot \mathbf{1}(A_k)] vanishes. Without independence, the best we could do is the union bound.

This factor-of-nn savings is exactly what allows the Borel—Cantelli argument in the SLLN proof to succeed: the summable probability bounds from Kolmogorov’s inequality, summed over the geometric subsequence, converge — whereas the union-bound version would diverge.

Example 3 Random Walk: the Gap Between max|S_k| and |S_n|

Let X1,,XnX_1, \ldots, X_n be iid N(0,1)N(0, 1), so Sk=X1++XkS_k = X_1 + \cdots + X_k is a random walk. The terminal value Sn|S_n| has the same distribution as nZ\sqrt{n} \cdot |Z| for ZN(0,1)Z \sim N(0,1), with typical magnitude n\sqrt{n}. But the running maximum maxknSk\max_{k \leq n} |S_k| is typically larger — the path might wander far from zero at some intermediate time and then return.

The reflection principle from probability theory gives E[maxknSk]2n/πE[\max_{k \leq n} |S_k|] \approx \sqrt{2n/\pi} (slightly larger than E[Sn]=2n/πE[|S_n|] = \sqrt{2n/\pi} by a constant factor). Yet Kolmogorov’s inequality tells us that the tail probability P(maxknSkε)P(\max_{k \leq n} |S_k| \geq \varepsilon) satisfies the same bound as P(Snε)P(|S_n| \geq \varepsilon): both are at most n/ε2n/\varepsilon^2. Controlling the maximum path is no harder than controlling the endpoint, at least in terms of the Chebyshev-type bound.

For n=1000n = 1000 steps, a typical random walk has S100032|S_{1000}| \approx 32 (since 100032\sqrt{1000} \approx 32) but maxk1000Sk50\max_{k \leq 1000} |S_k| \approx 50 or more. The Kolmogorov bound says P(maxk1000Sk100)1000/10000=0.1P(\max_{k \leq 1000} |S_k| \geq 100) \leq 1000/10000 = 0.1 — the same bound we would get from Chebyshev for P(S1000100)P(|S_{1000}| \geq 100).

The interactive explorer below shows sample paths of the random walk with both Sn|S_n| (the endpoint) and maxknSk\max_{k \leq n} |S_k| (the running maximum) highlighted. Notice how the running maximum can be substantially larger than the endpoint, yet the Kolmogorov bound still holds.

Remark 3 Kolmogorov's Maximal Inequality and Doob's Martingale Inequality

Kolmogorov’s maximal inequality is a special case of Doob’s maximal inequality for submartingales. When E[Xi]=0E[X_i] = 0, the partial sums Sk=X1++XkS_k = X_1 + \cdots + X_k form a martingale (the conditional expectation of Sk+1S_{k+1} given S1,,SkS_1, \ldots, S_k equals SkS_k), and Sk2S_k^2 is a submartingale. Doob’s inequality gives the same bound in greater generality: for any non-negative submartingale (Mk)(M_k),

P ⁣(maxknMkλ)E[Mn]λP\!\left(\max_{k \leq n} M_k \geq \lambda\right) \leq \frac{E[M_n]}{\lambda}

Applied to Mk=Sk2M_k = S_k^2 and λ=ε2\lambda = \varepsilon^2, this recovers Kolmogorov’s inequality. We do not need the full martingale theory here — the stopping-time decomposition above is elementary and self-contained — but the reader should be aware that the maximal inequality is part of a much richer theory.

Kolmogorov's maximal inequality: random walk paths, running maximum vs endpoint, Chebyshev bound overlay
Interactive: Kolmogorov's Maximal Inequality
Partial sum path S₁, ..., Sₙ
20406080100k-10010max|Sₖ| = 9.19|Sₙ| = 0.67
Histogram (1000 replications)
ε = 100102030valueP(max|Sₖ| > ε) ≈ 0.573P(|Sₙ| > ε) ≈ 0.304
max|Sₖ||Sₙ|
Probability vs n
20406080100n0.000.250.500.751.00probabilityP(max|Sₖ| > ε)P(|Sₙ| > ε)n / ε²
Kolmogorov's maximal inequality bounds P(maxₖ≤ₙ |Sₖ| > ε) ≤ Var(Sₙ)/ε² — the same Chebyshev bound that controls |Sₙ| alone also controls the running maximum.

10.6 The Strong Law of Large Numbers

We are now ready for the main event. The Strong Law of Large Numbers says that the sample mean converges to the population mean almost surely — not just in probability. Recall from Topic 9 that almost sure convergence is strictly stronger than convergence in probability: it says that the probability of the event “the sample path converges to μ\mu” is 1, not merely that the probability of being far from μ\mu at any particular time goes to zero.

This distinction is not academic. When you run a Monte Carlo simulation to estimate an integral, you run one simulation and observe one trajectory of running averages. You need that trajectory to converge — not just that most trajectories converge, but that yours does. The SLLN says: with probability 1, yours does. When you train a neural network with SGD, you run one training trajectory. The SLLN (and its martingale generalizations) says that trajectory converges. When you evaluate a model on a test set, you compute one empirical risk. The SLLN says it converges to the true risk. Almost sure convergence is what makes single-run inference valid.

We follow the proof of Etemadi (1981), which is remarkable in two respects. First, it requires only pairwise independence — not the full mutual independence that most proofs assume. Second, it is genuinely elementary: the only tools are the Kolmogorov maximal inequality (Theorem 10.4), the first Borel—Cantelli lemma (from Modes of Convergence), and the dominated convergence theorem ( formalCalculus: Integration ). No characteristic functions, no martingale theory, no measure-theoretic heavy machinery. Etemadi’s original paper (1981) is three pages long — one of the shortest proofs of a fundamental theorem in all of probability. Before Etemadi, the standard proof (due to Kolmogorov) required full mutual independence and used the Kolmogorov three-series theorem, which itself requires significant machinery. Etemadi’s contribution was to show that pairwise independence suffices, using only the tools we have developed here.

Theorem 5 Theorem 10.5 (Strong Law of Large Numbers — Kolmogorov/Etemadi)

Let X1,X2,X_1, X_2, \ldots be pairwise independent, identically distributed random variables with E[X1]<E[|X_1|] < \infty. Then

Xˉna.s.μ=E[X1]\bar{X}_n \xrightarrow{\text{a.s.}} \mu = E[X_1]

That is, P ⁣(limnXˉn=μ)=1P\!\left(\lim_{n \to \infty} \bar{X}_n = \mu\right) = 1.

Proof [show]

We follow Etemadi’s 1981 proof, which is remarkable for requiring only pairwise independence (not full independence). The argument has three stages: reduce to non-negative random variables, prove convergence along a geometric subsequence using truncation and Borel—Cantelli, then fill in the gaps between subsequence terms using monotonicity.

Stage 1: Reduce to non-negative random variables.

Write Xi=Xi+XiX_i = X_i^+ - X_i^- where Xi+=max(Xi,0)X_i^+ = \max(X_i, 0) and Xi=max(Xi,0)X_i^- = \max(-X_i, 0). Since E[X1]<E[|X_1|] < \infty, both E[X1+]=E[max(X1,0)]E[X_1^+] = E[\max(X_1, 0)] and E[X1]=E[max(X1,0)]E[X_1^-] = E[\max(-X_1, 0)] are finite (because X1+X1X_1^+ \leq |X_1| and X1X1X_1^- \leq |X_1|). The sequences (Xi+)(X_i^+) and (Xi)(X_i^-) are each pairwise independent and identically distributed with finite mean.

If we prove the SLLN for non-negative random variables, then:

1ni=1nXi+a.s.E[X1+]and1ni=1nXia.s.E[X1]\frac{1}{n}\sum_{i=1}^n X_i^+ \xrightarrow{\text{a.s.}} E[X_1^+] \quad \text{and} \quad \frac{1}{n}\sum_{i=1}^n X_i^- \xrightarrow{\text{a.s.}} E[X_1^-]

Subtracting (the difference of two a.s. convergent sequences converges a.s., since a.s. convergence is preserved under finite linear combinations) gives:

Xˉn=1ni=1nXi+1ni=1nXia.s.E[X1+]E[X1]=E[X1]=μ\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i^+ - \frac{1}{n}\sum_{i=1}^n X_i^- \xrightarrow{\text{a.s.}} E[X_1^+] - E[X_1^-] = E[X_1] = \mu

So without loss of generality, assume Xi0X_i \geq 0 for the remainder of the proof. This assumption buys us something crucial: when Xi0X_i \geq 0, the partial sums Sm=X1++XmS_m = X_1 + \cdots + X_m are non-decreasing in mm. This monotonicity is what makes Stage 3 work.

Stage 2: Truncate and prove convergence along a geometric subsequence.

Fix an integer α2\alpha \geq 2 (the standard choice is α=2\alpha = 2, but any integer 2\geq 2 works). Define the geometric subsequence nk=αkn_k = \alpha^k for k=0,1,2,k = 0, 1, 2, \ldots The key insight is to first establish convergence along this sparse subsequence — which is easier because the gaps between terms grow geometrically, making certain series converge — and then fill in the gaps using monotonicity.

The plan: prove Snk/nkμS_{n_k}/n_k \to \mu almost surely (the hard part, Steps 2a—2e), then show Sn/nμS_n/n \to \mu for all nn between consecutive subsequence terms (Stage 3, using monotonicity from Stage 1).

Step 2a: Truncation. Define truncated variables Yi=Xi1(Xii)Y_i = X_i \cdot \mathbf{1}(X_i \leq i). Since Xi0X_i \geq 0, the truncation simply caps values exceeding ii at zero: if XiiX_i \leq i, then Yi=XiY_i = X_i; if Xi>iX_i > i, then Yi=0Y_i = 0. Note that the truncation threshold for the ii-th variable is ii itself (not nn, as in the Khintchine proof). This is a different calibration — each variable gets its own threshold, growing with its index. Let Zi=YiE[Yi]Z_i = Y_i - E[Y_i] denote the centered truncated variables.

Step 2b: The truncation error vanishes almost surely. We need to show that Xi=YiX_i = Y_i for all but finitely many ii, with probability 1. Consider the “bad” events {XiYi}={Xi>i}\{X_i \neq Y_i\} = \{X_i > i\} — the events where the ii-th observation exceeds the ii-th truncation threshold. Since the XiX_i are identically distributed:

i=1P(Xi>i)=i=1P(X1>i)E[X1]<\sum_{i=1}^{\infty} P(X_i > i) = \sum_{i=1}^{\infty} P(X_1 > i) \leq E[X_1] < \infty

The inequality i=1P(X1>i)E[X1]\sum_{i=1}^{\infty} P(X_1 > i) \leq E[X_1] follows from the tail-sum formula for expectation of non-negative integer-valued random variables, or more precisely from comparing the sum to the integral:

i=1P(X1>i)0P(X1>t)dt=E[X1]<\sum_{i=1}^{\infty} P(X_1 > i) \leq \int_0^{\infty} P(X_1 > t)\, dt = E[X_1] < \infty

By the first Borel—Cantelli lemma ( formalCalculus: Series Convergence ): since P(Xi>i)<\sum P(X_i > i) < \infty, the probability that Xi>iX_i > i occurs for infinitely many ii is zero. In other words, with probability 1, XiiX_i \leq i for all but finitely many ii. For those finitely many exceptions, XiYiX_i \neq Y_i — but a finite number of mismatches does not affect the asymptotic behavior of a sample mean. Therefore, the sample means of XiX_i and YiY_i have the same a.s. limit (if either has one).

Step 2c: The variance series converges. For Borel—Cantelli to work in Step 2d, we need a summable series of probability bounds. This in turn requires that the variances of the truncated variables, when summed with appropriate weights, converge. Specifically, we need i=1Var(Yi)/i2<\sum_{i=1}^{\infty} \text{Var}(Y_i)/i^2 < \infty. Since Var(Yi)E[Yi2]\text{Var}(Y_i) \leq E[Y_i^2] (the variance is at most the second moment), it suffices to show E[Yi2]/i2<\sum E[Y_i^2]/i^2 < \infty:

i=1E[Yi2]i2=i=1E[X121(X1i)]i2\sum_{i=1}^{\infty} \frac{E[Y_i^2]}{i^2} = \sum_{i=1}^{\infty} \frac{E[X_1^2 \cdot \mathbf{1}(X_1 \leq i)]}{i^2}

We interchange the sum and expectation using Tonelli’s theorem (both are non-negative):

=E ⁣[X12iX11i2]= E\!\left[X_1^2 \sum_{i \geq X_1} \frac{1}{i^2}\right]

For X1=mX_1 = m (or more precisely, m1<X1mm - 1 < X_1 \leq m), the inner sum im1/i2\sum_{i \geq m} 1/i^2 satisfies:

i=m1i21m12mfor m2\sum_{i=m}^{\infty} \frac{1}{i^2} \leq \frac{1}{m-1} \leq \frac{2}{m} \quad \text{for } m \geq 2

(by comparison with the integral m1t2dt\int_{m-1}^{\infty} t^{-2}\, dt). Therefore X12iX11/i22X1X_1^2 \cdot \sum_{i \geq X_1} 1/i^2 \leq 2X_1, giving:

i=1E[Yi2]i22E[X1]<\sum_{i=1}^{\infty} \frac{E[Y_i^2]}{i^2} \leq 2\, E[X_1] < \infty

This is where E[X1]<E[|X_1|] < \infty (equivalently E[X1]<E[X_1] < \infty since X10X_1 \geq 0) is used — and it is the only place in the proof where the finite-mean assumption enters in an essential way. If E[X1]=E[X_1] = \infty, the series would diverge, and the entire argument would collapse.

Step 2d: Apply Kolmogorov’s maximal inequality along the subsequence. This is the step where Kolmogorov’s inequality (Theorem 10.4) enters. For each kk, consider the centered truncated partial sums i=1nkZi\sum_{i=1}^{n_k} Z_i. The ZiZ_i are pairwise independent (since each Zi=YiE[Yi]Z_i = Y_i - E[Y_i] is a function of XiX_i alone, and the XiX_i are pairwise independent) with E[Zi]=0E[Z_i] = 0. By Kolmogorov’s maximal inequality, for any δ>0\delta > 0:

P ⁣(max1jnki=1jZi>δnk)i=1nkVar(Yi)δ2nk2P\!\left(\max_{1 \leq j \leq n_k} \left|\sum_{i=1}^{j} Z_i\right| > \delta n_k\right) \leq \frac{\sum_{i=1}^{n_k} \text{Var}(Y_i)}{\delta^2 n_k^2}

We need k=1\sum_{k=1}^{\infty} of the right-hand side to converge (so Borel—Cantelli applies). Since nk=αkn_k = \alpha^k:

k=1i=1nkVar(Yi)δ2nk21δ2k=11nk2i=1nkE[Yi2]\sum_{k=1}^{\infty} \frac{\sum_{i=1}^{n_k} \text{Var}(Y_i)}{\delta^2 n_k^2} \leq \frac{1}{\delta^2} \sum_{k=1}^{\infty} \frac{1}{n_k^2} \sum_{i=1}^{n_k} E[Y_i^2]

Exchanging the order of summation (each ii appears in the inner sum for all kk with nkin_k \geq i, i.e., klogαik \geq \log_\alpha i):

=1δ2i=1E[Yi2]k:nki1nk21δ2i=1E[Yi2]i2Cα= \frac{1}{\delta^2} \sum_{i=1}^{\infty} E[Y_i^2] \sum_{k : n_k \geq i} \frac{1}{n_k^2} \leq \frac{1}{\delta^2} \sum_{i=1}^{\infty} \frac{E[Y_i^2]}{i^2} \cdot C_\alpha

where Cα=j=0α2j=α2/(α21)<C_\alpha = \sum_{j=0}^{\infty} \alpha^{-2j} = \alpha^2/(\alpha^2 - 1) < \infty is a geometric series constant (for α=2\alpha = 2, this is 4/34/3). Since E[Yi2]/i2<\sum E[Y_i^2]/i^2 < \infty (Step 2c), the entire sum converges. This is where the geometric spacing of the subsequence is essential: with equally spaced terms (nk=kn_k = k), the sum would not converge.

By the first Borel—Cantelli lemma ( formalCalculus: Series Convergence ): since kP()<\sum_k P(\ldots) < \infty, the probability that the event occurs for infinitely many kk is zero. That is:

P ⁣(maxjnk1nki=1jZi>δ infinitely often in k)=0P\!\left(\max_{j \leq n_k} \left|\frac{1}{n_k}\sum_{i=1}^{j} Z_i\right| > \delta \text{ infinitely often in } k\right) = 0

In particular, taking j=nkj = n_k:

1nki=1nkZi0almost surely\frac{1}{n_k}\sum_{i=1}^{n_k} Z_i \to 0 \quad \text{almost surely}

Step 2e: The means of the truncated variables converge. Since E[Yi]=E[X11(X1i)]E[X1]=μE[Y_i] = E[X_1 \cdot \mathbf{1}(X_1 \leq i)] \to E[X_1] = \mu by DCT ( formalCalculus: Integration ):

1nki=1nkE[Yi]μ\frac{1}{n_k}\sum_{i=1}^{n_k} E[Y_i] \to \mu

(This is a Cesaro mean of a convergent sequence. By the Cesaro lemma, if aiaa_i \to a, then 1ni=1naia\frac{1}{n}\sum_{i=1}^n a_i \to a. Here ai=E[Yi]μa_i = E[Y_i] \to \mu, so the Cesaro mean also converges to μ\mu.)

Combining Steps 2d and 2e:

1nki=1nkYi=1nki=1nkZi0 a.s.+1nki=1nkE[Yi]μa.s.μ\frac{1}{n_k}\sum_{i=1}^{n_k} Y_i = \underbrace{\frac{1}{n_k}\sum_{i=1}^{n_k} Z_i}_{\to\, 0 \text{ a.s.}} + \underbrace{\frac{1}{n_k}\sum_{i=1}^{n_k} E[Y_i]}_{\to\, \mu} \xrightarrow{\text{a.s.}} \mu

By Step 2b, Xi=YiX_i = Y_i for all sufficiently large ii (almost surely). This means the partial sums i=1nkXi\sum_{i=1}^{n_k} X_i and i=1nkYi\sum_{i=1}^{n_k} Y_i differ by at most a finite constant (the sum of the finitely many terms where XiYiX_i \neq Y_i). Dividing by nkn_k \to \infty, this finite difference becomes negligible. Therefore 1nki=1nkXiμ\frac{1}{n_k}\sum_{i=1}^{n_k} X_i \to \mu almost surely.

Stage 3: Fill in the gaps between subsequence terms.

For nkn<nk+1=αnkn_k \leq n < n_{k+1} = \alpha \cdot n_k, we use the non-negativity of the XiX_i to squeeze the sample mean. Let Sm=i=1mXiS_m = \sum_{i=1}^m X_i. Since each Xi0X_i \geq 0, the partial sums are non-decreasing: SnkSnSnk+1S_{n_k} \leq S_n \leq S_{n_{k+1}}. Therefore:

Snknk+1SnnSnk+1nk\frac{S_{n_k}}{n_{k+1}} \leq \frac{S_n}{n} \leq \frac{S_{n_{k+1}}}{n_k}

The left inequality holds because SnkSnS_{n_k} \leq S_n (adding more non-negative terms can only increase the sum) and nnk+1n \leq n_{k+1} (dividing by a larger denominator makes the fraction smaller), so Snk/nk+1Sn/nk+1Sn/nS_{n_k}/n_{k+1} \leq S_n/n_{k+1} \leq S_n/n. The right inequality holds because SnSnk+1S_n \leq S_{n_{k+1}} (the sum of more terms is larger) and nnkn \geq n_k (dividing by a smaller denominator makes the fraction larger), so Sn/nSnk+1/nSnk+1/nkS_n/n \leq S_{n_{k+1}}/n \leq S_{n_{k+1}}/n_k.

Now we evaluate the bounds as kk \to \infty. The lower bound:

Snknk+1=Snknknknk+1=Snknk1αa.s.μα\frac{S_{n_k}}{n_{k+1}} = \frac{S_{n_k}}{n_k} \cdot \frac{n_k}{n_{k+1}} = \frac{S_{n_k}}{n_k} \cdot \frac{1}{\alpha} \xrightarrow{\text{a.s.}} \frac{\mu}{\alpha}

The upper bound:

Snk+1nk=Snk+1nk+1nk+1nk=Snk+1nk+1αa.s.αμ\frac{S_{n_{k+1}}}{n_k} = \frac{S_{n_{k+1}}}{n_{k+1}} \cdot \frac{n_{k+1}}{n_k} = \frac{S_{n_{k+1}}}{n_{k+1}} \cdot \alpha \xrightarrow{\text{a.s.}} \alpha \mu

So for every nn between nkn_k and nk+1n_{k+1}, the sample mean Sn/nS_n/n is squeezed between values converging to μ/α\mu/\alpha and αμ\alpha\mu.

Since α2\alpha \geq 2 is fixed, the lower bound approaches μ/2\mu/2 and the upper bound approaches 2μ2\mu — useful, but not the Sn/nμS_n/n \to \mu that we want. The squeeze is too loose because the ratio nk+1/nk=αn_{k+1}/n_k = \alpha between consecutive subsequence terms is too large. But we can still extract the result by combining the squeeze with the WLLN.

For any α2\alpha \geq 2:

μαlim infnSnnlim supnSnnαμ\frac{\mu}{\alpha} \leq \liminf_{n \to \infty} \frac{S_n}{n} \leq \limsup_{n \to \infty} \frac{S_n}{n} \leq \alpha \mu

almost surely. Now, the left-hand side and right-hand side hold for every integer α2\alpha \geq 2. Taking α\alpha \to \infty does not help — the bounds get worse (the squeeze widens). But we can finesse this with the following observation.

The random variables L=lim infnSn/nL = \liminf_{n \to \infty} S_n/n and U=lim supnSn/nU = \limsup_{n \to \infty} S_n/n are both tail random variables — they depend only on the tail of the sequence. By Kolmogorov’s 0-1 law (Remark 4), each is almost surely constant: L=c1L = c_1 a.s. and U=c2U = c_2 a.s. for some deterministic constants c1,c2c_1, c_2 with c1c2c_1 \leq c_2.

The squeeze gives μ/αc1\mu/\alpha \leq c_1 and c2αμc_2 \leq \alpha\mu for every α2\alpha \geq 2. This constrains c1c_1 and c2c_2 but does not pin them down. However, the WLLN (Theorem 10.3, which we proved independently) tells us Sn/nPμS_n/n \xrightarrow{P} \mu. Convergence in probability to a constant implies that any almost-sure subsequential limit must equal that constant. Since c1c_1 and c2c_2 are the infimum and supremum of the subsequential limits, we must have:

lim infnSnn=lim supnSnn=μa.s.\liminf_{n \to \infty} \frac{S_n}{n} = \limsup_{n \to \infty} \frac{S_n}{n} = \mu \quad \text{a.s.}

Therefore Sn/nμS_n/n \to \mu almost surely. \square

Let us step back and appreciate the structure of Etemadi’s argument, because it is a masterclass in mathematical proof technique:

  • Stage 1 reduces to non-negative variables — a clean trick that unlocks monotonicity in Stage 3. The decomposition X=X+XX = X^+ - X^- is standard, but its use here is strategic: monotonicity of partial sums is what lets us fill in the gaps between subsequence terms.

  • Stage 2 is the hard work, and it interleaves three ideas. Truncation (Yi=Xi1(Xii)Y_i = X_i \cdot \mathbf{1}(X_i \leq i)) makes the variance summable — this is where E[X1]<E[|X_1|] < \infty is used, via the interchange-of-sum trick E[Yi2]/i22E[X1]\sum E[Y_i^2]/i^2 \leq 2E[X_1]. Kolmogorov’s maximal inequality controls the running maximum of the centered partial sums, not just the terminal value. Borel—Cantelli converts the summable probability bounds into almost sure convergence, but only along the geometric subsequence nk=αkn_k = \alpha^k.

  • Stage 3 fills in the gaps using the non-negativity from Stage 1. The squeeze Snk/nk+1Sn/nSnk+1/nkS_{n_k}/n_{k+1} \leq S_n/n \leq S_{n_{k+1}}/n_k traps every Sn/nS_n/n between two sequences that both converge to μ\mu.

The proof requires only pairwise independence — a remarkable weakening of the usual assumption. Full mutual independence is not needed for two reasons. First, the Kolmogorov maximal inequality proof (Theorem 10.4) uses independence only to show that the future increment SnSkS_n - S_k is independent of Sk1(Ak)S_k \cdot \mathbf{1}(A_k); for the variance calculation Var(Sn)=Var(Xi)\text{Var}(S_n) = \sum \text{Var}(X_i), pairwise independence suffices. Second, the first Borel—Cantelli lemma (P(An)<\sum P(A_n) < \infty implies P(An i.o.)=0P(A_n \text{ i.o.}) = 0) makes no independence assumption at all — it is a purely measure-theoretic fact.

This pairwise-independence generalization is not just a mathematical curiosity. In several practical settings, observations are pairwise independent but not mutually independent:

  • Sampling without replacement from a finite population: knowing that observation 1 and observation 2 take specific values tells you something about observation 3 (the population is slightly depleted), so the three are not mutually independent. But any two observations are pairwise independent (each pair is equally likely to produce any pair of values).
  • Hash-based sampling: Random hash functions with pairwise independence (2-universal hash families) are widely used in streaming algorithms. The SLLN via Etemadi applies to averages computed from such samples.

Etemadi’s theorem guarantees that the SLLN still holds in all these settings.

The following proposition confirms that finite mean is not just sufficient but necessary for the SLLN.

Corollary 1 Proposition 10.1 (Finite Mean is Necessary)

If X1,X2,X_1, X_2, \ldots are iid and Xˉn\bar{X}_n converges almost surely (or even in probability) to a finite limit, then E[X1]<E[|X_1|] < \infty.

The proof uses the converse direction of the Borel—Cantelli lemma. If E[X1]=E[|X_1|] = \infty, then n=1P(Xn>n)=n=1P(X1>n)=\sum_{n=1}^{\infty} P(|X_n| > n) = \sum_{n=1}^{\infty} P(|X_1| > n) = \infty (since n=1P(X1>n)E[X1]1\sum_{n=1}^{\infty} P(|X_1| > n) \geq E[|X_1|] - 1, which is infinite). By the second Borel—Cantelli lemma — which applies here because the events {Xn>n}\{|X_n| > n\} are independent (they depend on different XnX_n‘s) — we get P(Xn>n infinitely often)=1P(|X_n| > n \text{ infinitely often}) = 1.

This means that with probability 1, the sequence X1,X2,X_1, X_2, \ldots contains infinitely many terms with Xn>n|X_n| > n. Each such term contributes at least 1/n1/n to the sample mean (since Xn/n>1|X_n/n| > 1), and these large contributions arrive infinitely often. A careful argument shows this prevents Xˉn\bar{X}_n from converging to any finite limit. The details are in Durrett (2019, Theorem 2.4.5).

The takeaway: E[X1]<E[|X_1|] < \infty is the exact boundary for the LLN. Below this threshold (e.g., the Cauchy with E[X1]=E[|X_1|] = \infty), sample means do not converge to any finite limit. At or above this threshold, they converge almost surely to E[X1]E[X_1]. The condition is sharp — it cannot be weakened.

Example 4 Cauchy Sample Mean: No Convergence at All

The Cauchy distribution (equivalently, Student’s tt with ν=1\nu = 1 degree of freedom) has density f(x)=1/(π(1+x2))f(x) = 1/(\pi(1 + x^2)). It has no finite mean: E[X]=x/(π(1+x2))dx=E[|X|] = \int_{-\infty}^{\infty} |x|/(\pi(1+x^2))\, dx = \infty (the integral diverges logarithmically).

Proposition 10.1 says the SLLN must fail. But something much more dramatic is true: the sample mean of iid Cauchy random variables does not converge at all. In fact, Xˉn\bar{X}_n is itself Cauchy-distributed for every nn — with the exact same distribution as a single observation.

This follows from the Cauchy being a stable distribution: if X1,,XnX_1, \ldots, X_n are iid standard Cauchy, then X1++XnX_1 + \cdots + X_n has the same distribution as nX1n \cdot X_1 (Cauchy with scale parameter nn). Dividing by nn gives XˉnCauchy(0,1)\bar{X}_n \sim \text{Cauchy}(0, 1) — the averaging accomplishes nothing. No matter how large nn is, the sample mean is just as dispersed as a single observation.

This is not just slow convergence — it is no convergence. The sample mean carries zero information about the “center” of the Cauchy distribution, no matter how many observations you collect. The median, by contrast, does converge: the sample median of nn iid Cauchy observations converges to the location parameter at rate 1/n1/\sqrt{n}. This is a vivid illustration of why the choice of estimator matters — the sample mean is the wrong tool for heavy-tailed data.

Try setting the degrees of freedom to ν=1\nu = 1 in the SLLN Explorer below. The sample paths wander without settling down — a striking contrast to the well-behaved convergence seen with distributions that have a finite mean. Then try ν=1.5\nu = 1.5 (finite mean, infinite variance) to see slow but genuine convergence, and ν=3\nu = 3 (finite variance) to see rapid convergence.

Remark 4 Kolmogorov's 0-1 Law

The event {Xˉnμ}\{\bar{X}_n \to \mu\} is a tail event: it depends only on the “tail” of the sequence. More precisely, for any finite NN, changing X1,,XNX_1, \ldots, X_N does not affect whether Xˉn\bar{X}_n converges as nn \to \infty — the convergence or non-convergence is determined by the behavior of XN+1,XN+2,X_{N+1}, X_{N+2}, \ldots The tail σ\sigma-algebra T=N=1σ(XN+1,XN+2,)\mathcal{T} = \bigcap_{N=1}^{\infty} \sigma(X_{N+1}, X_{N+2}, \ldots) captures exactly these “asymptotic” events.

Kolmogorov’s 0-1 law states: if X1,X2,X_1, X_2, \ldots are independent, then every event in T\mathcal{T} has probability 0 or 1. There is no middle ground — you cannot have P(Xˉnμ)=0.7P(\bar{X}_n \to \mu) = 0.7, for instance.

This means that even before proving the SLLN, we know the answer must be all-or-nothing: either P(Xˉnμ)=0P(\bar{X}_n \to \mu) = 0 or P(Xˉnμ)=1P(\bar{X}_n \to \mu) = 1. The SLLN’s job is to determine which one. Etemadi’s proof shows that when E[X1]<E[|X_1|] < \infty, the answer is 1. Proposition 10.1 shows that when E[X1]=E[|X_1|] = \infty, the answer is 0. The 0-1 law ensures there are no other possibilities.

Other tail events include {Xn converges}\{\sum X_n \text{ converges}\}, {lim supXˉn=+}\{\limsup \bar{X}_n = +\infty\}, and {Xˉn oscillates between a and b}\{\bar{X}_n \text{ oscillates between } a \text{ and } b\}. All of these have probability 0 or 1. The 0-1 law is one of the most elegant results in probability: it says that for independent sequences, every question about the “eventual behavior” has a deterministic answer — even though the individual random variables are random, the tail behavior is not.

The 0-1 law requires independence (or at least that the tail σ\sigma-algebra is trivial). For dependent sequences (such as Markov chains), the tail behavior can genuinely have intermediate probabilities. This is why the SLLN for Markov chains (the ergodic theorem) requires additional conditions beyond what the iid SLLN needs — specifically, irreducibility, aperiodicity, and positive recurrence (Topic 26 §26.7 Thm 5).

SLLN: sample paths converging almost surely to μ for Normal, Exponential; failing for Cauchy
Interactive: SLLN Explorer — Every Path Converges
The baseline: both WLLN and SLLN apply with fast 1/√n convergence.
SLLN applies (E[|X|] < ∞)
Sample paths of X̄ₙ
μ + εμ − εμ = 31251020501002005001k2k5kn024X̄ₙ
Convergence Tracker
WLLN: within band at nSLLN: locked in band1251020501002005001k2k5kn0.000.250.500.751.00fraction of paths
SLLN curve tracks paths that entered the band and never left — it always sits below the WLLN curve but catches up as n grows.

10.7 The Glivenko-Cantelli Theorem

The SLLN says that the sample mean Xˉn=1ni=1nXi\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i converges to μ=E[X1]\mu = E[X_1] almost surely. But sample means are just one kind of statistic. What about the entire distribution of the data? Can we estimate the CDF of XX from a sample?

The natural estimator is the empirical CDF: the proportion of observations at or below each threshold xx. The Glivenko—Cantelli theorem says this estimator converges to the true CDF not just pointwise (which would follow immediately from the SLLN applied to indicator functions) but uniformly over all xx — and almost surely.

Definition 1 Definition 10.1 (Empirical CDF)

Given iid observations X1,,XnX_1, \ldots, X_n with common CDF FF, the empirical CDF (ECDF) is:

Fn(x)=1ni=1n1(Xix)F_n(x) = \frac{1}{n}\sum_{i=1}^n \mathbf{1}(X_i \leq x)

For each fixed xx, Fn(x)F_n(x) is the sample mean of the Bernoulli random variables 1(Xix)\mathbf{1}(X_i \leq x), each with mean F(x)=P(Xix)F(x) = P(X_i \leq x).

The ECDF FnF_n is a step function that jumps by 1/n1/n at each observation. It is itself a valid CDF (of the discrete uniform distribution that places mass 1/n1/n on each observed value X1,,XnX_1, \ldots, X_n). As a function of xx, it is non-decreasing and right-continuous, just like any CDF.

The ECDF is the most fundamental object in nonparametric statistics. Unlike parametric estimators (which assume the data comes from a specific family), the ECDF estimates the entire distribution without any assumptions. Every summary statistic you compute from data — the sample mean, the sample variance, the sample median, the sample quantiles — can be expressed as a functional of the ECDF. The sample mean is xdFn(x)=1nXi\int x \, dF_n(x) = \frac{1}{n}\sum X_i. The sample variance is (xXˉn)2dFn(x)\int (x - \bar{X}_n)^2 \, dF_n(x). The sample qq-quantile is Fn1(q)=inf{x:Fn(x)q}F_n^{-1}(q) = \inf\{x : F_n(x) \geq q\}. The interquartile range is Fn1(0.75)Fn1(0.25)F_n^{-1}(0.75) - F_n^{-1}(0.25). In this sense, the ECDF is the “mother” of all nonparametric statistics — if the ECDF converges to the true CDF, then every continuous functional of the CDF is consistently estimated by the corresponding functional of the ECDF.

For each fixed xx, the SLLN immediately gives Fn(x)a.s.F(x)F_n(x) \xrightarrow{\text{a.s.}} F(x), since Fn(x)F_n(x) is the sample mean of iid Bernoulli variables with mean F(x)F(x), and E[1(X1x)]=F(x)1<E[|\mathbf{1}(X_1 \leq x)|] = F(x) \leq 1 < \infty. This gives pointwise a.s. convergence — the ECDF converges to the true CDF at each individual threshold.

But pointwise convergence is not enough. If we want to use the ECDF to estimate quantiles, test distributional hypotheses, or construct confidence bands, we need convergence that holds at all xx simultaneously. The problem is that the pointwise SLLN gives a probability-1 set Ωx\Omega_x for each xx on which Fn(x)F(x)F_n(x) \to F(x), but different xx‘s may have different exceptional sets. Since xx ranges over the uncountable set R\mathbb{R}, we cannot simply intersect the probability-1 sets (an uncountable intersection of probability-1 sets need not have probability 1).

The Glivenko—Cantelli theorem overcomes this obstacle by exploiting the monotonicity of CDFs. The key idea is that an uncountable supremum over xx can be bounded by a finite maximum over grid points plus a discretization error — and this works because CDFs cannot oscillate freely between grid points. The argument is an ε\varepsilon-net argument: approximate the uncountable index set R\mathbb{R} by a finite grid, handle the grid points with the SLLN, and control the approximation error using monotonicity.

This is worth highlighting because the same ε\varepsilon-net technique — with different notions of “grid” and “approximation error” — is the template for all uniform convergence results in probability and statistics, from the Glivenko—Cantelli theorem to uniform laws of large numbers over function classes.

Theorem 6 Theorem 10.6 (Glivenko-Cantelli Theorem)

Let X1,X2,X_1, X_2, \ldots be iid with CDF FF. Then:

supxRFn(x)F(x)a.s.0\sup_{x \in \mathbb{R}} |F_n(x) - F(x)| \xrightarrow{\text{a.s.}} 0

The empirical CDF converges to the true CDF uniformly and almost surely.

Proof [show]

The challenge is upgrading from pointwise to uniform convergence. The SLLN gives convergence at each fixed xx, but taking the supremum over uncountably many xx requires an additional argument. The key insight is that CDFs are monotone, so the supremum over all xx can be bounded by the maximum over finitely many grid points plus a discretization error controlled by the CDF’s own oscillation.

Step 1: Pointwise convergence via SLLN. Fix any xRx \in \mathbb{R}. The random variables 1(Xix)\mathbf{1}(X_i \leq x) are iid Bernoulli(F(x))(F(x)) — they equal 1 with probability F(x)F(x) and 0 otherwise. Their mean is E[1(X1x)]=F(x)E[\mathbf{1}(X_1 \leq x)] = F(x) and they certainly have E[1(X1x)]=F(x)1<E[|\mathbf{1}(X_1 \leq x)|] = F(x) \leq 1 < \infty. By the SLLN (Theorem 10.5):

Fn(x)=1ni=1n1(Xix)a.s.F(x)F_n(x) = \frac{1}{n}\sum_{i=1}^n \mathbf{1}(X_i \leq x) \xrightarrow{\text{a.s.}} F(x)

This gives pointwise a.s. convergence at every xx, but we need uniform convergence.

Step 2: Discretize using partition points. Fix ε>0\varepsilon > 0. Since FF is non-decreasing with limxF(x)=0\lim_{x \to -\infty} F(x) = 0 and limx+F(x)=1\lim_{x \to +\infty} F(x) = 1, we can choose points =x0<x1<x2<<xN<xN+1=+-\infty = x_0 < x_1 < x_2 < \cdots < x_N < x_{N+1} = +\infty such that:

F(xj)F(xj1)<εfor all j=1,,N+1F(x_j) - F(x_{j-1}) < \varepsilon \quad \text{for all } j = 1, \ldots, N+1

(We define F(x0)=F()=0F(x_0) = F(-\infty) = 0 and F(xN+1)=F(+)=1F(x_{N+1}) = F(+\infty) = 1.) This is possible because FF is right-continuous — we can always refine the partition to make the jumps smaller than ε\varepsilon.

Step 3: Bound the supremum using the grid. For any xRx \in \mathbb{R}, there exists j{1,,N+1}j \in \{1, \ldots, N+1\} such that xj1x<xjx_{j-1} \leq x < x_j. Since both FnF_n and FF are non-decreasing:

Fn(xj1)Fn(x)Fn(xj)andF(xj1)F(x)F(xj)F_n(x_{j-1}) \leq F_n(x) \leq F_n(x_j) \quad \text{and} \quad F(x_{j-1}) \leq F(x) \leq F(x_j)

For the upper bound on Fn(x)F(x)F_n(x) - F(x):

Fn(x)F(x)Fn(xj)F(xj1)=[Fn(xj)F(xj)]+[F(xj)F(xj1)]F_n(x) - F(x) \leq F_n(x_j) - F(x_{j-1}) = [F_n(x_j) - F(x_j)] + [F(x_j) - F(x_{j-1})][Fn(xj)F(xj)]+ε\leq [F_n(x_j) - F(x_j)] + \varepsilon

For the upper bound on F(x)Fn(x)F(x) - F_n(x):

F(x)Fn(x)F(xj)Fn(xj1)=[F(xj)F(xj1)]+[F(xj1)Fn(xj1)]F(x) - F_n(x) \leq F(x_j) - F_n(x_{j-1}) = [F(x_j) - F(x_{j-1})] + [F(x_{j-1}) - F_n(x_{j-1})]ε+[F(xj1)Fn(xj1)]\leq \varepsilon + [F(x_{j-1}) - F_n(x_{j-1})]

Combining both bounds:

Fn(x)F(x)max0jN+1Fn(xj)F(xj)+ε|F_n(x) - F(x)| \leq \max_{0 \leq j \leq N+1} |F_n(x_j) - F(x_j)| + \varepsilon

Taking the supremum over all xx:

supxRFn(x)F(x)max0jN+1Fn(xj)F(xj)+ε\sup_{x \in \mathbb{R}} |F_n(x) - F(x)| \leq \max_{0 \leq j \leq N+1} |F_n(x_j) - F(x_j)| + \varepsilon

Step 4: Apply the SLLN at finitely many points. By Step 1, Fn(xj)a.s.F(xj)F_n(x_j) \xrightarrow{\text{a.s.}} F(x_j) for each j=0,1,,N+1j = 0, 1, \ldots, N+1. Since the intersection of finitely many probability-1 events has probability 1, there exists a single event Ωε\Omega_\varepsilon with P(Ωε)=1P(\Omega_\varepsilon) = 1 on which Fn(xj)F(xj)F_n(x_j) \to F(x_j) for all jj simultaneously.

On Ωε\Omega_\varepsilon, for all sufficiently large nn:

max0jN+1Fn(xj)F(xj)<ε\max_{0 \leq j \leq N+1} |F_n(x_j) - F(x_j)| < \varepsilon

Combining with the bound from Step 3: for all nn larger than some N(ω,ε)N(\omega, \varepsilon),

supxRFn(x)F(x)<ε+ε=2ε\sup_{x \in \mathbb{R}} |F_n(x) - F(x)| < \varepsilon + \varepsilon = 2\varepsilon

The crucial point is that NN depends on ω\omega (since the SLLN convergence is sample-path dependent) but not on xx — the same NN works for all xx simultaneously, because we reduced the uncountable supremum to a finite maximum in Step 3.

Step 5: Remove the dependence on ε\varepsilon. Steps 2—4 showed that for each fixed ε>0\varepsilon > 0, there exists a probability-1 set Ωε\Omega_\varepsilon on which supxFn(x)F(x)<2ε\sup_x |F_n(x) - F(x)| < 2\varepsilon eventually. But Ωε\Omega_\varepsilon depends on ε\varepsilon (because the grid points xjx_j depend on ε\varepsilon). To get a single probability-1 set that works for all ε\varepsilon, we use a standard countable-intersection trick.

Let Ω=m=1Ω1/m\Omega^* = \bigcap_{m=1}^{\infty} \Omega_{1/m}. This is a countable intersection of probability-1 events, so P(Ω)=1P(\Omega^*) = 1 (by countable subadditivity of probability). On Ω\Omega^*, for every integer m1m \geq 1, we have supxFn(x)F(x)<2/m\sup_x |F_n(x) - F(x)| < 2/m for all sufficiently large nn. Since mm is arbitrary:

supxRFn(x)F(x)0on Ω\sup_{x \in \mathbb{R}} |F_n(x) - F(x)| \to 0 \quad \text{on } \Omega^*

Therefore supxFn(x)F(x)a.s.0\sup_x |F_n(x) - F(x)| \xrightarrow{\text{a.s.}} 0. \square

The proof’s strategy — reduce an uncountable supremum to a finite maximum plus a discretization error — is a template that recurs throughout empirical process theory. The monotonicity of CDFs is what makes the discretization work: between two grid points, neither FnF_n nor FF can oscillate freely, so the error from approximating the continuous supremum with a finite maximum is controlled by the mesh of the partition.

Notice what the proof does not require: any assumption on the distribution FF. The Glivenko—Cantelli theorem holds for every CDF — continuous, discrete, mixed, any distribution whatsoever. This is because the proof uses only three properties of CDFs: they are non-decreasing, they go from 0 to 1, and they have at most countably many discontinuities. The SLLN provides the pointwise convergence; monotonicity and the ε\varepsilon-net argument upgrade it to uniform convergence.

The quantity Dn=supxFn(x)F(x)D_n = \sup_x |F_n(x) - F(x)| is known as the Kolmogorov—Smirnov (KS) statistic. It measures the worst-case discrepancy between the empirical and true CDFs — the maximum vertical gap between the step function FnF_n and the smooth curve FF. The Glivenko—Cantelli theorem says Dn0D_n \to 0 a.s.; the Dvoretzky—Kiefer—Wolfowitz (DKW) inequality (stated in Example 5 below) gives exponential concentration around zero; and the Kolmogorov distribution gives the exact limiting distribution of nDn\sqrt{n} D_n, which is the foundation of the KS test for goodness of fit. Topic 29 §29.5 and §29.8 develop DKW with Massart’s tight constant and the asymptotic Kolmogorov distribution in full.

Example 5 ECDF of Normal(0,1): KS Statistic Decay

Generate nn observations from N(0,1)N(0, 1). The empirical CDF Fn(x)F_n(x) is a step function that approaches the smooth CDF Φ(x)=P(Zx)\Phi(x) = P(Z \leq x). The Kolmogorov—Smirnov (KS) statistic is:

Dn=supxRFn(x)Φ(x)D_n = \sup_{x \in \mathbb{R}} |F_n(x) - \Phi(x)|

The Glivenko—Cantelli theorem says Dn0D_n \to 0 almost surely. But how fast? The Dvoretzky—Kiefer—Wolfowitz (DKW) inequality gives an exponential rate:

P(Dn>ε)2exp(2nε2)P(D_n > \varepsilon) \leq 2\exp(-2n\varepsilon^2)

At n=100n = 100, Dn0.1D_n \approx 0.1 with high probability. At n=1000n = 1000, Dn0.03D_n \approx 0.03. At n=10000n = 10000, Dn0.01D_n \approx 0.01. The decay is at rate 1/n1/\sqrt{n} — specifically, nDn\sqrt{n}\, D_n converges in distribution to the Kolmogorov distribution (the supremum of a Brownian bridge), which is the basis of the Kolmogorov—Smirnov test for goodness of fit.

The KS test is one of the most widely used nonparametric tests: to test whether a sample comes from a particular distribution FF, compute DnD_n and reject if it is too large (compared to the quantiles of the Kolmogorov distribution). The Glivenko—Cantelli theorem is what makes this test valid — it guarantees that under the null hypothesis H0:XiFH_0: X_i \sim F, the KS statistic converges to zero, so large values of DnD_n are evidence against H0H_0.

The two-sample KS test compares two empirical CDFs FnF_n and GmG_m from potentially different distributions. The test statistic supxFn(x)Gm(x)\sup_x |F_n(x) - G_m(x)| converges to supxF(x)G(x)\sup_x |F(x) - G(x)| by Glivenko—Cantelli applied to each sample, so the test can detect any difference between the two distributions — not just differences in mean or variance, but differences in shape, skewness, or tail behavior. This omnibus property is what makes the KS test valuable as a diagnostic tool in model validation.

In machine learning, the KS test has several applications:

  • Dataset shift detection: Compare the ECDF of a feature in the training data to the ECDF in the production data. If the KS statistic is large, the distribution has shifted, and the model may need retraining.
  • Model validation: After training a generative model, compare the ECDF of generated samples to the ECDF of real data. The KS statistic measures how well the model captures the true distribution.
  • Feature engineering: The KS statistic between the ECDFs of a feature conditioned on different classes measures the feature’s discriminative power.

The Glivenko—Cantelli theorem guarantees that all these comparisons are meaningful: both ECDFs converge to their respective true CDFs, so the KS statistic consistently estimates the true discrepancy.

Glivenko-Cantelli: empirical CDF approaching true CDF, KS statistic decay with n
Interactive: Glivenko--Cantelli Theorem Explorer
CDF Comparison: F(x) vs Fₙ(x)
-4-2024x0.000.250.500.751.00F(x)Fₙ(x)sup |Fₙ - F|
KS Statistic: Dₙ = sup|Fₙ(x) - F(x)|
Dₙ = 0.08245001k1.5k2kn0.000.200.400.600.801.00DₙDKW bound1/√n
n = 50  |  Dₙ = 0.0824  |  DKW bound = 0.1921

10.8 Convergence Rates

The SLLN says Xˉnμ\bar{X}_n \to \mu almost surely — but how fast? This is not just a theoretical curiosity. In practice, the convergence rate determines how many samples you need to achieve a desired accuracy. A Monte Carlo estimate with n=100n = 100 samples might be adequate if the convergence is fast, or catastrophically unreliable if it is slow. The difference between polynomial and exponential convergence rates can mean the difference between n=10, ⁣000n = 10,\!000 and n=100n = 100 samples for the same accuracy guarantee.

Three results give increasingly precise answers, forming a hierarchy of convergence rate statements:

  1. Chebyshev’s inequality — polynomial rate, weakest but most general
  2. Hoeffding’s inequality — exponential rate for bounded variables
  3. Law of the iterated logarithm — the exact a.s. oscillation envelope

The Chebyshev rate. From Theorem 10.1, P(Xˉnμ>ε)σ2/(nε2)P(|\bar{X}_n - \mu| > \varepsilon) \leq \sigma^2/(n\varepsilon^2). This is a polynomial decay: the probability of an ε\varepsilon-deviation decreases like 1/n1/n. For a fixed confidence level δ\delta, we need nσ2/(ε2δ)n \geq \sigma^2/(\varepsilon^2 \delta) samples to guarantee P(Xˉnμ>ε)δP(|\bar{X}_n - \mu| > \varepsilon) \leq \delta. This is often pessimistic — Chebyshev uses only the mean and variance, ignoring all other information about the distribution. For bounded or sub-Gaussian variables, the true rate is much faster.

Theorem 7 Theorem 10.7 (Hoeffding's Inequality — Preview)

Let X1,,XnX_1, \ldots, X_n be independent with Xi[ai,bi]X_i \in [a_i, b_i] almost surely. Then for any t>0t > 0:

P(XˉnE[Xˉn]t)2exp ⁣(2n2t2i=1n(biai)2)P(|\bar{X}_n - E[\bar{X}_n]| \geq t) \leq 2\exp\!\left(\frac{-2n^2t^2}{\sum_{i=1}^n (b_i - a_i)^2}\right)

For iid variables with Xi[a,b]X_i \in [a, b]:

P(Xˉnμε)2exp ⁣(2nε2(ba)2)P(|\bar{X}_n - \mu| \geq \varepsilon) \leq 2\exp\!\left(\frac{-2n\varepsilon^2}{(b-a)^2}\right)

The improvement over Chebyshev is dramatic. Chebyshev gives polynomial decay: P(Xˉnμε)σ2/(nε2)P(|\bar{X}_n - \mu| \geq \varepsilon) \leq \sigma^2/(n\varepsilon^2), which is O(1/n)O(1/n). Hoeffding gives exponential decay: 2exp(2nε2/(ba)2)2\exp(-2n\varepsilon^2/(b-a)^2), which goes to zero exponentially fast in nn. The practical difference is enormous — it determines how many samples you need for a given confidence level. For ε=0.1\varepsilon = 0.1 and [a,b]=[0,1][a,b] = [0,1] (so σ21/4\sigma^2 \leq 1/4):

nnChebyshev boundHoeffding bound
1002.500 (useless)0.271
5000.5000.000045
10000.2502×1092 \times 10^{-9}
50000.0502×10442 \times 10^{-44}

At n=1000n = 1000, Chebyshev says the probability of a 0.1 deviation is at most 25%, while Hoeffding says it is at most two in a billion. At n=5000n = 5000, Chebyshev gives 5% (barely useful) while Hoeffding gives 104410^{-44} (effectively zero).

The practical implication: if you need a confidence guarantee of P(Xˉnμ0.01)0.05P(|\bar{X}_n - \mu| \geq 0.01) \leq 0.05 for bounded [0,1][0,1] variables, Chebyshev requires nσ2/(0.0120.05)=200, ⁣000/σ2n \geq \sigma^2/(0.01^2 \cdot 0.05) = 200,\!000/\sigma^2 samples (at least 50, ⁣00050,\!000 for σ2=1/4\sigma^2 = 1/4). Hoeffding requires nln(0.025)/(20.0001)18, ⁣400n \geq -\ln(0.025)/(2 \cdot 0.0001) \approx 18,\!400 samples — nearly three times fewer. The Chebyshev bound is useless for practical sample-size calculations; the Hoeffding bound is tight enough to be actionable.

The full proof of Hoeffding’s inequality uses the Chernoff bounding technique: apply Markov’s inequality to et(Xμ)e^{t(X - \mu)} instead of Xμ|X - \mu|, then optimize over tt. The exponential transform converts additive structure (the sum of independent variables) into multiplicative structure (a product of MGFs), which is much easier to control. The boundedness assumption Xi[ai,bi]X_i \in [a_i, b_i] enters through the Hoeffding lemma, which bounds the MGF of a bounded random variable. The full development is the subject of Large Deviations & Tail Bounds — the Chernoff technique, Hoeffding’s lemma, and the complete proof are in §§12.2–12.3.

The third result tells us the exact a.s. rate at which Xˉn\bar{X}_n fluctuates around μ\mu.

Theorem 8 Theorem 10.8 (Law of the Iterated Logarithm — Statement)

Let X1,X2,X_1, X_2, \ldots be iid with E[X1]=μE[X_1] = \mu and Var(X1)=σ2<\text{Var}(X_1) = \sigma^2 < \infty. Then almost surely:

lim supnXˉnμσ2lnlnn/n=1andlim infnXˉnμσ2lnlnn/n=1\limsup_{n \to \infty} \frac{\bar{X}_n - \mu}{\sigma\sqrt{2\ln\ln n / n}} = 1 \quad \text{and} \quad \liminf_{n \to \infty} \frac{\bar{X}_n - \mu}{\sigma\sqrt{2\ln\ln n / n}} = -1

The sample mean oscillates within the envelope ±σ2lnlnn/n\pm\sigma\sqrt{2\ln\ln n / n} and touches both boundaries infinitely often.

The LIL gives an astonishingly precise description of the a.s. behavior. The SLLN says Xˉnμ0\bar{X}_n - \mu \to 0. The LIL says exactly how it approaches zero: the fluctuations are of order lnlnn/n\sqrt{\ln\ln n / n}, which is slightly larger than the CLT scale of 1/n1/\sqrt{n} (by a factor of 2lnlnn\sqrt{2\ln\ln n}). The sample path oscillates within this shrinking envelope, touching the upper and lower boundaries infinitely many times — it never stays on one side permanently.

The double logarithm lnlnn\ln\ln n grows extraordinarily slowly: lnln(10)0.83\ln\ln(10) \approx 0.83, lnln(100)1.53\ln\ln(100) \approx 1.53, lnln(106)2.63\ln\ln(10^6) \approx 2.63, lnln(10100)5.44\ln\ln(10^{100}) \approx 5.44. So the LIL envelope σ2lnlnn/n\sigma\sqrt{2\ln\ln n / n} is only slightly wider than the CLT scale σ/n\sigma/\sqrt{n}. In practice, the LIL is more of a theoretical landmark than a computational tool — it tells us that the SLLN convergence rate cannot be faster than lnlnn/n\sqrt{\ln\ln n / n}, establishing a fundamental limit on how quickly sample means can settle down.

The proof of the LIL is deep (it uses the second Borel—Cantelli lemma and exponential tail bounds) and beyond our scope here. We state it to complete the picture of how the LLN, the CLT, and the LIL form a hierarchy of increasingly precise statements about Xˉn\bar{X}_n.

Remark 5 The Hierarchy: SLLN, LIL, CLT

Three results describe Xˉnμ\bar{X}_n - \mu at different levels of precision:

SLLN (Theorem 10.5): Xˉnμ0\bar{X}_n - \mu \to 0 almost surely. The fluctuations vanish. This is the qualitative statement — convergence happens. It answers the question: does Xˉn\bar{X}_n converge?

Law of the Iterated Logarithm (Theorem 10.8): The fluctuations are exactly of order σ2lnlnn/n\sigma\sqrt{2\ln\ln n / n}. This is the quantitative envelope — how tightly the sample path concentrates around μ\mu. It answers: how closely does Xˉn\bar{X}_n track μ\mu?

Central Limit Theorem: n(Xˉnμ)/σdN(0,1)\sqrt{n}(\bar{X}_n - \mu)/\sigma \xrightarrow{d} N(0, 1). The fluctuations at scale 1/n1/\sqrt{n} are approximately Normal. This is the distributional statement — the shape of the fluctuations. It answers: what do the deviations look like at any given nn?

Each tells you more. The SLLN says “it converges.” The LIL says “the deviations decay like lnlnn/n\sqrt{\ln\ln n / n}.” The CLT says “at the scale 1/n1/\sqrt{n}, the deviations look Gaussian.”

These are not competing statements — they complement each other. The SLLN is a pathwise result (it says something about almost every ωΩ\omega \in \Omega). The LIL is also pathwise (it gives the exact a.s. oscillation). The CLT is distributional (it says what the distribution of n(Xˉnμ)\sqrt{n}(\bar{X}_n - \mu) looks like for large nn). Concentration inequalities like Hoeffding (Theorem 10.7) are tail probability bounds (they say how fast P(Xˉnμ>ε)P(|\bar{X}_n - \mu| > \varepsilon) decays). Large Deviations & Tail Bounds develops these exponential-rate guarantees — the most useful results for practical sample-size calculations.

Convergence rates: Chebyshev (1/n) vs Hoeffding (exponential) bounds, LIL envelope around sample paths
Interactive: Convergence Rate Explorer
Sample paths of X̄ₙ − μ
±σ/√nLIL±ε1251020501002005001kn-0.4-0.200.20.4X̄ₙ − μ
Probability bounds (fixed ε = 0.10)
ChebyshevHoeffdingEmpirical2004006008001kn0.000.250.500.751.00P(|X̄ₙ − μ| > ε)
Distribution: Uniform(0,1) | σ = 0.2887 | Chebyshev bound: 0.0083 | Hoeffding bound: 4.12e-9
Chebyshev: σ²/(nε²) (polynomial)Hoeffding: 2exp(−2nε²/(b−a)²) (exponential)LIL envelope: σ√(2 ln ln n / n) (envelope)

10.9 Connections to Machine Learning

The law of large numbers is not an abstract mathematical curiosity. It is the foundation of empirical machine learning. Every time you compute a statistic from data — a training loss, a validation accuracy, a feature mean for normalization, a gradient estimate — you are implicitly invoking the LLN. The WLLN says your estimate is probably close to the truth. The SLLN says your single run converges to the truth. The Glivenko—Cantelli theorem says the entire empirical distribution converges, not just individual statistics.

This section makes the connections explicit, tracing the LLN through three central pillars of modern ML: Monte Carlo estimation (where the SLLN justifies single-run simulation), stochastic optimization (where the SLLN justifies Polyak averaging), and empirical risk minimization (where the Glivenko—Cantelli theorem justifies training on finite data).

The theme is always the same: we observe a finite sample, compute a statistic (a sample mean, a gradient average, an empirical risk), and trust that it approximates the corresponding population quantity. The LLN is the mathematical warrant for that trust.

It is worth pausing to appreciate the remarkable generality of this foundation. The LLN does not care what the random variables represent — pixel intensities, word embeddings, gradient components, loss values, feature activations. It does not care about the dimension of the underlying space. It does not care about the specific distribution, as long as the mean exists. This universality is what makes the LLN the single most important theorem in applied statistics.

Example 6 Monte Carlo Estimation

To compute E[g(X)]E[g(X)] for some function gg and random variable XX, draw X1,,XnX_1, \ldots, X_n iid from the distribution of XX and report:

θ^n=1ni=1ng(Xi)\hat{\theta}_n = \frac{1}{n}\sum_{i=1}^n g(X_i)

The SLLN guarantees θ^nE[g(X)]\hat{\theta}_n \to E[g(X)] almost surely — your single simulation run converges to the truth. This is the entire theoretical justification for Monte Carlo estimation.

Concrete example: Suppose XN(2,1)X \sim N(2, 1) and you want to estimate θ=E[X2]\theta = E[X^2]. The true value is Var(X)+(E[X])2=1+4=5\text{Var}(X) + (E[X])^2 = 1 + 4 = 5. Drawing X1,,XnX_1, \ldots, X_n iid from N(2,1)N(2,1) and computing θ^n=1ni=1nXi2\hat{\theta}_n = \frac{1}{n}\sum_{i=1}^n X_i^2 gives an estimate that converges to 5 almost surely by the SLLN (since E[X2]=5<E[X^2] = 5 < \infty). The CLT will tell us that the error is approximately N(0,Var(X2)/n)N(0, \text{Var}(X^2)/n), giving confidence intervals. And concentration inequalities give non-asymptotic bounds like P(θ^n5>ε)2exp(cnε2)P(|\hat{\theta}_n - 5| > \varepsilon) \leq 2\exp(-cn\varepsilon^2) for bounded transformations — see Hoeffding’s inequality (Theorem 12.3).

The Monte Carlo method extends to high-dimensional integrals where deterministic quadrature fails: if XRdX \in \mathbb{R}^d and dd is large, grid-based methods require O(εd)O(\varepsilon^{-d}) evaluations for accuracy ε\varepsilon (the curse of dimensionality), while Monte Carlo requires only O(ε2)O(\varepsilon^{-2}) evaluations regardless of dd — courtesy of the SLLN and CLT.

Notice the role of the SLLN versus the WLLN here. The WLLN would tell you that for each fixed nn, the probability of your estimate being far from E[X2]E[X^2] is small. But you are looking at a single run — a single sequence of estimates θ^1,θ^2,θ^3,\hat{\theta}_1, \hat{\theta}_2, \hat{\theta}_3, \ldots — and you want this sequence to converge. That is exactly what the SLLN guarantees: with probability 1, your sequence converges to the truth.

The SLLN also extends to MCMC: if the Markov chain is ergodic (irreducible, aperiodic, positive recurrent), then time averages converge to space averages almost surely. This is the ergodic theorem, a generalization of Theorem 10.5 (Kolmogorov/Etemadi SLLN) to dependent sequences — the dependence between successive states of the Markov chain is handled by the ergodic property rather than independence. Topic 26 §26.7 Theorem 5 develops this in the MCMC context. See also formalML: Monte Carlo Methods for applied extensions.

Example 7 Stochastic Gradient Descent and Polyak Averaging

In SGD, we approximate the full gradient R(θ)=E[(Y,fθ(X))]\nabla R(\theta) = E[\nabla \ell(Y, f_\theta(X))] with a single-sample gradient (Yt,fθ(Xt))\nabla \ell(Y_t, f_\theta(X_t)). The key convergence result says: with step sizes satisfying the Robbins—Monro conditions (t=1αt=\sum_{t=1}^{\infty} \alpha_t = \infty and t=1αt2<\sum_{t=1}^{\infty} \alpha_t^2 < \infty), the iterates θt\theta_t converge to a local minimum almost surely.

Polyak averaging replaces the final iterate θT\theta_T with the running mean:

θˉT=1Tt=1Tθt\bar{\theta}_T = \frac{1}{T}\sum_{t=1}^T \theta_t

This exploits the SLLN: the running mean of the iterates (which fluctuate around the optimum) converges to the optimum almost surely, and the Polyak-averaged estimator achieves the optimal asymptotic variance — it is as efficient as the best possible estimator. The SLLN is what makes this averaging trick valid.

The connection between SGD and the LLN runs deeper than the averaging trick. At each step, the stochastic gradient (Yt,fθ(Xt))\nabla \ell(Y_t, f_\theta(X_t)) is a noisy estimate of the true gradient R(θ)\nabla R(\theta). The noise averages out over time — by the LLN applied to the gradient estimates — and the iterates converge. The Robbins—Monro conditions encode two competing requirements:

  • t=1αt=\sum_{t=1}^{\infty} \alpha_t = \infty: the step sizes are large enough in total to reach the optimum from any starting point (the algorithm doesn’t stop too early).
  • t=1αt2<\sum_{t=1}^{\infty} \alpha_t^2 < \infty: the step sizes shrink fast enough that the gradient noise averages out (the algorithm doesn’t oscillate forever).

The common choice αt=1/t\alpha_t = 1/t satisfies both: 1/t=\sum 1/t = \infty (harmonic series diverges) and 1/t2=π2/6<\sum 1/t^2 = \pi^2/6 < \infty (Basel problem).

See formalML: Stochastic Gradient Descent for the full convergence analysis.

Example 8 Empirical Risk Minimization and Uniform Convergence

In supervised learning, the empirical risk of a model fθf_\theta on data (X1,Y1),,(Xn,Yn)(X_1, Y_1), \ldots, (X_n, Y_n) is:

R^n(θ)=1ni=1n(Yi,fθ(Xi))\hat{R}_n(\theta) = \frac{1}{n}\sum_{i=1}^n \ell(Y_i, f_\theta(X_i))

For each fixed θ\theta, R^n(θ)\hat{R}_n(\theta) is a sample mean, so the SLLN gives R^n(θ)a.s.R(θ)=E[(Y,fθ(X))]\hat{R}_n(\theta) \xrightarrow{\text{a.s.}} R(\theta) = E[\ell(Y, f_\theta(X))]. But ERM selects θ^n=argminθR^n(θ)\hat{\theta}_n = \arg\min_\theta \hat{R}_n(\theta), and for this to be consistent we need convergence that is uniform over θ\theta:

supθR^n(θ)R(θ)a.s.0\sup_\theta |\hat{R}_n(\theta) - R(\theta)| \xrightarrow{\text{a.s.}} 0

This is exactly the type of result that the Glivenko—Cantelli theorem provides for indicator functions 1(Xx)\mathbf{1}(X \leq x), and its generalization to richer function classes (via Vapnik—Chervonenkis theory — proved at Topic 32 §32.4) is what makes ERM work. The function class {(x,y)(y,fθ(x)):θΘ}\{(x, y) \mapsto \ell(y, f_\theta(x)) : \theta \in \Theta\} must be “small enough” — specifically, its VC dimension must be finite — for uniform convergence to hold.

The logical chain from the results in this topic to statistical learning theory is:

  1. Glivenko—Cantelli (Theorem 10.6): uniform LLN for indicators 1(Xx)\mathbf{1}(X \leq x).
  2. VC theory (generalization): uniform LLN for function classes with finite VC dimension.
  3. ERM consistency: the empirical risk minimizer converges to the population risk minimizer.
  4. PAC learning bounds: P(R^n(θ^n)R(θ^n)>ε)δP(|\hat{R}_n(\hat{\theta}_n) - R(\hat{\theta}_n)| > \varepsilon) \leq \delta with n=O(d/ε2ln(1/δ))n = O(d/\varepsilon^2 \cdot \ln(1/\delta)) samples, where dd is the VC dimension.

Each step in the chain uses the previous step as a building block. The Glivenko—Cantelli theorem is step 1 — the simplest case — and everything else follows by generalization.

See formalML: Empirical Risk Minimization and formalML: Generalization Bounds for the full development.

Remark 6 Glivenko-Cantelli as the Starting Point for Empirical Process Theory

The Glivenko—Cantelli theorem is the simplest case of a uniform law of large numbers. It says that the empirical measure PnP_n (which assigns mass 1/n1/n to each observation) converges to the population measure PP uniformly over the class of half-lines {(,x]:xR}\{(-\infty, x] : x \in \mathbb{R}\}.

Empirical process theory (Topic 32) generalizes from half-lines to arbitrary function classes F\mathcal{F}, asking: when does supfFPnfPf0\sup_{f \in \mathcal{F}} |P_n f - Pf| \to 0? The answer involves the complexity of F\mathcal{F} — measured by the VC dimension, covering numbers, or metric entropy. The Glivenko—Cantelli class of half-lines has VC dimension 1, which is why the theorem holds without any distributional assumptions.

This is the bridge between the LLN (convergence for a single statistic) and statistical learning theory (uniform convergence over a class of statistics). The function class must be “learnable” (finite VC dimension or, more generally, finite metric entropy) for the uniform LLN to hold, and learnability is equivalent to the uniform LLN. This equivalence — the fundamental theorem of statistical learning — is one of the deepest results in the field.

The Glivenko—Cantelli theorem is the F={1(,x]:xR}\mathcal{F} = \{\mathbf{1}_{(-\infty, x]} : x \in \mathbb{R}\} case. This class has VC dimension 1 (any single point can be shattered, but no two points can be). With VC dimension 1, the uniform convergence holds for any distribution — exactly what Theorem 10.6 states.

The extension to higher dimensions is more subtle: in Rd\mathbb{R}^d, the class of half-spaces has VC dimension d+1d + 1, and the multivariate Glivenko—Cantelli theorem holds with convergence rate O(n1/2)O(n^{-1/2}) — the same rate as in one dimension, independent of dd. This is remarkable: the empirical CDF converges just as fast in high dimensions as in one dimension, at least for half-spaces. For richer function classes, the rate can degrade with dimension — this is the curse of dimensionality in statistical learning theory.

ML connections: Monte Carlo convergence, SGD with Polyak averaging, ERM uniform convergence

10.10 Summary

This topic traced the arc from the simplest weak law to the strongest convergence results available for sample means and empirical distributions. Here is the complete progression:

ResultAssumptionsConclusionProof Tool
Theorem 10.1 (WLLN)iid, σ2<\sigma^2 < \inftyXˉnPμ\bar{X}_n \xrightarrow{P} \muChebyshev
Theorem 10.2 (WLLN)uncorrelated, 1n2σi20\frac{1}{n^2}\sum \sigma_i^2 \to 0XˉnμˉnP0\bar{X}_n - \bar{\mu}_n \xrightarrow{P} 0Variance of sums
Theorem 10.3 (Khintchine)iid, E[X]<E[\lvert X \rvert] < \inftyXˉnPμ\bar{X}_n \xrightarrow{P} \muTruncation + DCT
Theorem 10.4 (Kolmogorov)independent, E[Xi]=0E[X_i] = 0P(maxSkε)Var(Sn)/ε2P(\max\lvert S_k \rvert \geq \varepsilon) \leq \text{Var}(S_n)/\varepsilon^2Stopping time decomposition
Theorem 10.5 (SLLN)pairwise iid, E[X]<E[\lvert X \rvert] < \inftyXˉna.s.μ\bar{X}_n \xrightarrow{\text{a.s.}} \muEtemadi (truncation + Kolmogorov + B-C)
Theorem 10.6 (Glivenko—Cantelli)iidsupxFn(x)F(x)a.s.0\sup_x \lvert F_n(x) - F(x) \rvert \xrightarrow{\text{a.s.}} 0SLLN for indicators + monotonicity
Theorem 10.7 (Hoeffding)independent, boundedP(Xˉnμε)2e2nε2/(ba)2P(\lvert\bar{X}_n - \mu\rvert \geq \varepsilon) \leq 2e^{-2n\varepsilon^2/(b-a)^2}(Proof: Topic 12, Theorem 12.3)
Theorem 10.8 (LIL)iid, σ2<\sigma^2 < \inftylim sup(Xˉnμ)/σ2lnlnn/n=1\limsup (\bar{X}_n - \mu)/\sigma\sqrt{2\ln\ln n/n} = 1 a.s.(Statement only)

The logical dependencies form a clear ladder:

  • Chebyshev WLLN (Theorem 10.1) is the base case — three lines.
  • Uncorrelated WLLN (Theorem 10.2) generalizes by dropping “identically distributed.”
  • Khintchine’s WLLN (Theorem 10.3) generalizes by dropping “finite variance,” using truncation.
  • Kolmogorov’s maximal inequality (Theorem 10.4) provides the new tool needed to control paths, not just endpoints.
  • SLLN (Theorem 10.5) uses Kolmogorov + truncation + Borel—Cantelli to upgrade from “in probability” to “almost surely.”
  • Glivenko—Cantelli (Theorem 10.6) uses the SLLN + monotonicity to upgrade from “pointwise” to “uniform.”

At every stage, we relaxed an assumption or strengthened a conclusion. This is how convergence theory works: you start with the simplest result that captures the essential phenomenon, then systematically push the boundaries — either by weakening the hypotheses (making the theorem more broadly applicable) or by strengthening the conclusion (making it more useful). The progression from Theorem 10.1 to Theorem 10.6 exemplifies both directions simultaneously.

The minimal assumption for the LLN is E[X1]<E[|X_1|] < \infty — finite mean. This is both necessary and sufficient (Proposition 10.1). No weaker condition will do. The Cauchy distribution (E[X1]=E[|X_1|] = \infty) is the canonical counterexample: its sample mean does not converge, in any mode, to any finite limit. This clean dichotomy — finite mean gives everything, infinite mean gives nothing — is one of the most satisfying results in probability.

For the Glivenko—Cantelli theorem, even the finite-mean assumption is unnecessary: the empirical CDF converges uniformly to the true CDF for every distribution, no matter how heavy the tails. This universality is what makes nonparametric statistics possible — the ECDF is a valid, consistent estimator of any CDF, without distributional assumptions. It is the most assumption-free estimator in all of statistics.

The practical takeaway is reassuring: no matter what your data looks like — Normal, heavy-tailed, multimodal, discrete, continuous, or a bizarre mixture — the empirical CDF faithfully approximates the true CDF, and all statistics derived from the ECDF (quantiles, moments, probabilities of intervals) converge to their population counterparts.

Looking Back: Prerequisites in Action

This topic drew on three specific results from earlier in the curriculum:

  • Chebyshev’s inequality (Expectation & Moments, Theorem 12): the proof engine for all three WLLN variants and the starting point for the maximal inequality.
  • Borel—Cantelli lemma (Modes of Convergence): used in the SLLN proof (truncation error vanishes a.s.) and the Glivenko—Cantelli proof (finite intersection of probability-1 sets). The first Borel—Cantelli requires only P(An)<\sum P(A_n) < \infty; the second (used in Proposition 10.1) requires independence.
  • Dominated convergence theorem ( formalCalculus: Integration ): the key tool in every truncation argument — it justifies E[X1(Xn)]E[X]E[X \cdot \mathbf{1}(|X| \leq n)] \to E[X] and E[X21(Xn)]/n0E[X^2 \cdot \mathbf{1}(|X| \leq n)]/n \to 0. The DCT appeared in three steps of the Khintchine proof and two steps of the Etemadi proof.

The four modes of convergence from Topic 9 provided the conceptual framework: the WLLN is a convergence-in-probability statement, the SLLN is an almost-sure convergence statement, and the hierarchy theorem (a.s. implies in probability, Topic 9 Theorem 9.5) means the SLLN is strictly stronger. The Glivenko—Cantelli theorem adds a new dimension: not just convergence of a single statistic, but uniform convergence of an entire function.

What Comes Next

The law of large numbers says Xˉnμ\bar{X}_n \to \mu. Two natural follow-up questions remain:

  • What is the shape of the fluctuations? The Central Limit Theorem answers this: n(Xˉnμ)/σdN(0,1)\sqrt{n}(\bar{X}_n - \mu)/\sigma \xrightarrow{d} N(0, 1). The deviations of Xˉn\bar{X}_n from μ\mu, when rescaled by n\sqrt{n}, are approximately Normal — regardless of the underlying distribution of XX, as long as the variance is finite. This universality is what makes confidence intervals, hypothesis tests, and p-values work. The CLT says the LLN’s fluctuations are Gaussian in shape — and from that shape, all of classical statistics flows.

    The CLT is the distributional complement to the LLN’s qualitative statement. The LLN says Xˉnμ\bar{X}_n \to \mu; the CLT says how Xˉn\bar{X}_n deviates from μ\mu at scale 1/n1/\sqrt{n}.

  • How fast does the convergence happen? Large Deviations & Tail Bounds answers this with exponential-rate guarantees. Hoeffding’s inequality (previewed in Theorem 10.7, proved in Topic 12 as Theorem 12.3) is the simplest such bound; Bernstein’s inequality gives tighter results when the variance is small; sub-Gaussian and sub-exponential tail bounds handle the most common distributional families. These are the finite-sample tools that power statistical learning theory: PAC bounds, uniform convergence rates, and sample complexity calculations — all developed in Large Deviations & Tail Bounds.

    These concentration inequalities are the quantitative complement to the LLN’s qualitative statement. The LLN says Xˉnμ\bar{X}_n \to \mu; the concentration inequalities say exactly how many samples you need for Xˉnμε|\bar{X}_n - \mu| \leq \varepsilon with probability at least 1δ1 - \delta.

Together, the LLN (convergence), the CLT (the shape), and concentration inequalities (the rate) form the three pillars of asymptotic statistics. Everything in statistical inference — point estimation, confidence intervals, hypothesis testing, Bayesian asymptotics — rests on one or more of these pillars. The LLN says “your estimator works.” The CLT says “here is how to quantify your uncertainty.” Concentration inequalities say “here is exactly how many samples you need.”

With the LLN now proved, we have the first of these three pillars firmly in place. The sample mean converges to the truth — weakly under finite variance, strongly under finite mean, uniformly for the empirical CDF. The next two topics will complete the trinity.

The reader who has followed from Topic 1 (Sample Spaces) through Topic 10 has now traversed the full journey from “what is a random event?” to “sample means converge almost surely to population means.” This is the first genuinely asymptotic result in the curriculum — a statement about what happens as nn \to \infty — and it opens the door to all of statistical inference. Point estimation relies on the SLLN for consistency. Confidence intervals rely on the CLT (which refines the LLN). Hypothesis testing relies on the asymptotic distribution of test statistics (which follows from the CLT). Bayesian consistency (the posterior concentrating around the true parameter) relies on the SLLN for the log-likelihood. Every branch of inference grows from this root.


References

  1. Durrett, R. (2019). Probability: Theory and Examples (5th ed.). Cambridge University Press.
  2. Billingsley, P. (2012). Probability and Measure (Anniversary ed.). Wiley.
  3. Casella, G. & Berger, R. L. (2002). Statistical Inference (2nd ed.). Duxbury.
  4. Etemadi, N. (1981). An elementary proof of the strong law of large numbers. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, 55, 119–122.
  5. Wasserman, L. (2004). All of Statistics. Springer.
  6. Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.