intermediate 58 min read · April 22, 2026

Order Statistics & Quantiles

Topic 29 opens Track 8 — the nonparametric track. Three full proofs: Rényi's exponential-spacings representation, the Bahadur representation of the sample quantile (featured), and the partial proof of the Kolmogorov limit via Donsker. Along the way: the ECDF, the DKW inequality with Massart's tight constant, distribution-free quantile CIs via order-statistic pairs, and the one-sample Kolmogorov–Smirnov test. The track's spine — KDE, bootstrap, empirical processes — is previewed in §29.10.

29.1 Motivation: the empirical distribution

Every topic so far has started from a parametric assumption — Normal, Gamma, Bernoulli, or a Beta-binomial compound. We picked a family PθP_\theta, wrote down likelihoods, and argued about estimators for θ\theta. That program has a clean asymptotic theory (MLE efficiency, Bernstein–von Mises, GLMs), but it rests on an assumption the data can violate silently: the family is right.

Track 8 begins by dropping that assumption. The data come from some distribution FF, continuous but otherwise unrestricted. The only summary we insist exists is the sample itself, and the only derived statistic we rely on without apology is the sorted sample:

X(1)X(2)X(n).X_{(1)} \le X_{(2)} \le \cdots \le X_{(n)}.

These are the order statistics. They are the raw material of nonparametric statistics — the place where ranks, ECDFs, quantile estimators, distribution-free CIs, and goodness-of-fit tests all live. Topic 16 already proved (§16 Thm 1) that the tuple (X(1),,X(n))(X_{(1)}, \dots, X_{(n)}) is sufficient for any iid model; Topic 29 takes that as license to build a full inferential toolkit on order statistics alone.

Topic 29 opens Track 8 — the nonparametric-and-high-dimensional track. The four topics of this track (29, 30 kernel density estimation, 31 bootstrap, 32 empirical processes) all reduce, at one level or another, to claims about how the empirical CDF

Fn(x):=1ni=1n1{Xix}F_n(x) := \frac{1}{n} \sum_{i=1}^n \mathbb{1}\{X_i \le x\}

behaves as nn grows. Topic 10 §10.7 already proved one such claim — the Glivenko–Cantelli theorem says supxFn(x)F(x)0\sup_x |F_n(x) - F(x)| \to 0 almost surely. Topic 29 refines it in two directions: the non-asymptotic Dvoretzky–Kiefer–Wolfowitz inequality (§29.5) gives an exponential rate, and the Kolmogorov distribution (§29.8) gives the exact limiting distribution of nsupxFn(x)F(x)\sqrt n \sup_x |F_n(x) - F(x)|. Together, DKW and the Kolmogorov limit promote FnF_n from “close to FF” to “close to FF with quantifiable confidence.”

Example 1 The ECDF as a visual object

Figure 1 plots realizations of FnF_n for iid standard-Normal samples at n=20n = 20, 200200, and 10001000. At n=20n = 20, FnF_n is a jagged step-function whose staircase shape is very different from the smooth Φ\Phi. At n=200n = 200, the steps compress into what visually reads as Φ\Phi with small wobbles. At n=1000n = 1000, the wobbles are invisible at this resolution — the ECDF is a plausible stand-in for Φ\Phi. The annotated supremum distance Dn=supxFn(x)Φ(x)D_n = \sup_x |F_n(x) - \Phi(x)| shrinks from 0.22 at n=20n = 20 to 0.06 at n=200n = 200 to 0.03 at n=1000n = 1000, following the 1/n1/\sqrt n rate that §29.5’s DKW inequality proves non-asymptotically.

Three-panel illustration of the empirical CDF for iid standard-Normal samples at n=20, n=200, and n=1000, with the true CDF Phi overlaid in black. As n grows, the step-function Fn visually approaches the smooth Phi; the supremum distance Dn is annotated in each panel and shrinks from 0.22 at n=20 to 0.03 at n=1000.

Figure 1. The empirical CDF of a standard-Normal sample at three sample sizes. At small nn the ECDF is visibly jagged; at large nn it is indistinguishable from the true CDF within plotting resolution.

Remark 1 Why 'distribution-free' is load-bearing

Nonparametric statistics has two distinct meanings in the ML literature. The one Topic 29 cares about is distribution-free inference: the validity of a procedure does not depend on the form of FF. A bootstrap CI, a KS test, a Wilcoxon test, and a quantile CI built from order-statistic pairs are all distribution-free. The other meaning — “flexible modeling with infinite-dimensional parameters” — covers Bayesian nonparametrics (Dirichlet processes, Gaussian processes) and kernel methods; those are in formalml.

Distribution-freeness is what makes KS useful for real ML workflows: detecting distribution shift between training and deployment samples, validating that a model’s calibration stays the same across data slices, or checking that a simulator reproduces the empirical distribution of a sensor. None of those applications specify FF up front, so a t-test is the wrong tool.

Remark 2 What Track 8 is about, in one sentence

Classical statistics assumes a parametric model; nonparametric statistics lets the data specify the model via the empirical distribution, and the order statistics are the lens through which we study that empirical distribution. §29.10’s forward-map spells out how Topics 30 (KDE), 31 (Bootstrap), and 32 (Empirical Processes) each sharpen one aspect of this sentence.

29.2 Order statistics: joint density

The order-statistic joint density has a clean closed form for continuous FF — one of the few places in statistics where a multivariate density has an elementary derivation via the change-of-variables formula.

Definition 1 Order statistics

For iid X1,,XnFX_1, \dots, X_n \sim F with FF continuous, the order statistics are the sorted sample, written with parenthesized indices:

X(1)X(2)X(n).X_{(1)} \le X_{(2)} \le \cdots \le X_{(n)}.

The parentheses distinguish X(i)X_{(i)} (the ii-th smallest value in the sample) from XiX_i (the ii-th draw, in the order collected). When FF is continuous, ties occur with probability zero, so the inequalities are strict almost surely.

Theorem 1 Joint density of the order statistics (stated)

For iid X1,,XnX_1, \dots, X_n with continuous CDF FF and density ff, the joint density of (X(1),,X(n))(X_{(1)}, \dots, X_{(n)}) on the ordered simplex {x1x2xn}\{x_1 \le x_2 \le \cdots \le x_n\} is

fX(1),,X(n)(x1,,xn)  =  n!i=1nf(xi),x1x2xn,f_{X_{(1)}, \dots, X_{(n)}}(x_1, \dots, x_n) \;=\; n! \, \prod_{i=1}^n f(x_i), \qquad x_1 \le x_2 \le \cdots \le x_n,

and zero outside that simplex. The marginal density of the ii-th order statistic is

fX(i)(x)  =  n!(i1)!(ni)!F(x)i1(1F(x))nif(x).f_{X_{(i)}}(x) \;=\; \frac{n!}{(i - 1)!\,(n - i)!}\, F(x)^{i - 1}\,\bigl(1 - F(x)\bigr)^{n - i}\, f(x).

(DAV2003 §2.1–§2.2.)

The joint density has a clean combinatorial reading: the n!n! factor counts the number of ways to assign nn unordered draws to nn ordered slots, and f(xi)\prod f(x_i) is the iid-joint density of the unordered sample. The marginal formula comes from integrating over the other n1n - 1 coordinates and recognizing the binomial coefficient (n1i1)\binom{n - 1}{i - 1} in the result — i1i - 1 draws must fall below xx, nin - i must fall above, and one sits at xx.

Example 2 The minimum and the maximum

For i=1i = 1, the marginal density reduces to n(1F(x))n1f(x)n\,(1 - F(x))^{n - 1}\,f(x) — the minimum’s density is the density at xx times the probability that the other n1n - 1 draws are all above. For i=ni = n, the marginal is nF(x)n1f(x)n\,F(x)^{n - 1}\,f(x) — the maximum’s density is f(x)f(x) times the probability the rest are all below. In both cases, as nn grows, the density concentrates near the lower or upper support endpoint respectively. The rate is governed by F(x)n1F(x)^{n-1} (upper tail) or (1F(x))n1(1 - F(x))^{n-1} (lower tail), which decays exponentially in nn away from the boundary — a topic fully developed at formalml extreme-value-theory.

Example 3 Joint density of two order statistics

The joint density of (X(i),X(j))(X_{(i)}, X_{(j)}) for i<ji < j comes from marginalizing Thm 1:

fX(i),X(j)(xi,xj)=n!(i1)!(ji1)!(nj)!F(xi)i1(F(xj)F(xi))ji1(1F(xj))njf(xi)f(xj),f_{X_{(i)}, X_{(j)}}(x_i, x_j) = \frac{n!}{(i-1)!\,(j - i - 1)!\,(n - j)!}\, F(x_i)^{i-1}\, \bigl(F(x_j) - F(x_i)\bigr)^{j - i - 1}\, \bigl(1 - F(x_j)\bigr)^{n - j}\, f(x_i)\, f(x_j),

valid for xi<xjx_i < x_j. The combinatorial factor counts placements: i1i - 1 below xix_i, one at xix_i, ji1j - i - 1 between, one at xjx_j, njn - j above. §29.7 uses the special case (i,j)(i, j) to build distribution-free quantile CIs — pick r,sr, s so that Pr(X(r)ξpX(s))1α\Pr\bigl(X_{(r)} \le \xi_p \le X_{(s)}\bigr) \ge 1 - \alpha.

Remark 3 Non-iid extensions are a separate story

For non-iid data (record values, concomitants, ranked-set sampling, progressive censoring) the joint density takes a different combinatorial form — the n!n! symmetry is broken. DAV2003 Chapters 5–6 develop the non-iid theory in full; Topic 29 restricts to iid throughout. The one use case outside iid order statistics that recurs in ML is censoring — survival analysis estimates FF from partially observed data — which we flag as an extension but do not develop.

Remark 4 Sufficiency: the order statistic is the maximally-informative summary

Topic 16 §16 Thm 1 established that for any iid model, the order-statistic tuple T(X)=(X(1),,X(n))T(X) = (X_{(1)}, \dots, X_{(n)}) is sufficient for θ\theta: the conditional distribution of XX given TT does not depend on θ\theta. The same result holds nonparametrically — for any FF, the order statistic captures every feature of the sample that FF influences. This is the structural justification for Track 8: the objects §29 develops are the raw features of every nonparametric procedure, because no statistic that respects the iid assumption can ever discard information in the order statistic. Independence notation  ⁣ ⁣ ⁣\perp\!\!\!\perp (Topic 16 §16.9) carries forward: X(1),,X(n)X_{(1)}, \dots, X_{(n)} are not independent (ordering is the whole point), but for iid XiX_i, the vector of ranks {Ri}\{R_i\} satisfies (R1,,Rn) ⁣ ⁣ ⁣(X(1),,X(n))(R_1, \dots, R_n) \perp\!\!\!\perp (X_{(1)}, \dots, X_{(n)}) — a fact exploited by rank tests (forthcoming on formalml).

29.3 The uniform case and Rényi’s representation

The Uniform(0, 1) distribution is the reference case for order-statistic theory. Three facts make it special: the density is constant, so the marginal formula in Thm 1 collapses; the CDF is the identity, so the probability-integral transform connects it directly to the general-FF case; and the spacings (gaps between consecutive order statistics) have a surprisingly clean joint structure that Rényi exploited in 1953.

Theorem 2 Uniform order statistics are Beta-distributed (stated)

For iid U1,,UnUniform(0,1)U_1, \dots, U_n \sim \mathrm{Uniform}(0, 1), the ii-th order statistic U(i)U_{(i)} has density

fU(i)(u)  =  n!(i1)!(ni)!ui1(1u)ni  =  1B(i,ni+1)ui1(1u)ni,u(0,1),f_{U_{(i)}}(u) \;=\; \frac{n!}{(i - 1)!\,(n - i)!}\, u^{i - 1}\,(1 - u)^{n - i} \;=\; \frac{1}{B(i, n - i + 1)}\, u^{i - 1}\,(1 - u)^{n - i}, \quad u \in (0, 1),

i.e. U(i)Beta(i,ni+1)U_{(i)} \sim \mathrm{Beta}(i,\, n - i + 1). This is Thm 1 specialized to F(u)=u,f(u)=1F(u) = u,\, f(u) = 1 on (0,1)(0,1); the Beta-density normalization follows from B(i,ni+1)=(i1)!(ni)!/n!B(i, n-i+1) = (i-1)!(n-i)!/n!.

Corollary 1 (general FF). If X(i)X_{(i)} is the ii-th order statistic of an iid FF-sample and FF is continuous, then F(X(i))Beta(i,ni+1)F(X_{(i)}) \sim \mathrm{Beta}(i, n - i + 1), since F(Xi)Uniform(0,1)F(X_i) \sim \mathrm{Uniform}(0, 1) under the probability-integral transform, so sorting the FF-values gives the Uniform order statistics.

(DAV2003 §2.2.)

Corollary 1 is a universal statement: the Beta distribution governs F(X(i))F(X_{(i)}) regardless of FF. The OrderStatisticDensityBrowser below (Panel C) makes this visible — switching between Uniform, Exponential, and Normal changes the density of X(i)X_{(i)} (Panels A and B), but the transformed statistic F(X(i))F(X_{(i)}) always sits on the same Beta curve.

Topic 7 §7.13 forward-promised this result as “coming soon”; Thm 2 + Cor 1 discharge it. The OrderStatisticDensityBrowser interactive lets the reader sweep ii from 1 to nn and watch the density shift from left to right across the support — a visual confirmation of the in+1\frac{i}{n+1} location for E[U(i)]\mathbb{E}[U_{(i)}].

Order-statistic density browser§29.3 · Uniform → Beta
A · Analytic density of X(5)
0.00.51.0x
B · MC histogram of X(5)
0.00.51.0
C · U(5) = F(X(5)) vs Beta(5, 6)
00.250.50.751u ∈ [0, 1]
Panel C always agrees with Beta(5, 6) regardless of the F preset — that is Theorem 2 made visible. Try switching between Uniform, Exp, and Normal: Panel A/B change shape, Panel C does not.

Two-panel illustration of the Beta-form density of Uniform order statistics. Left panel: marginal density of U_(i) at n=5 for i=1,2,3,4,5 — five Beta curves sharing one family motif, color-graded from light to dark. Right panel: same at n=10 for i=1,3,5,8,10, showing the center-of-mass sweeping from left to right as i grows. Each curve is labeled Beta(i, n-i+1).

Figure 2. Uniform order-statistic densities at n=5n = 5 (left) and n=10n = 10 (right). Each curve is a Beta density with shape parameters (i,ni+1)(i, n - i + 1) — the reader can verify that the mean in+1\frac{i}{n + 1} moves smoothly from the left endpoint toward the right endpoint as ii grows.

The uniform case has one further structural property that motivates a named theorem: the spacings of a Uniform(0, 1) sample — the gaps U(i)U(i1)U_{(i)} - U_{(i-1)} — have a clean distribution that Rényi (1953) discovered for the exponential case. Because Exp and Uniform are connected by U=1eYU = 1 - e^{-Y} (the inverse of the exponential CDF), the two stories are the same story. We state it for the exponential family where the algebra is prettiest.

Schematic of Rényi's decomposition of exponential order statistics. Three horizontal rows: top row shows n=6 iid Exp(1) draws as tick marks on the real line; middle row shows the sorted sample with spacings W_k = Y_(k) - Y_(k-1) annotated as segments; bottom row shows rescaled spacings E_k = (n-k+1) W_k as iid Exp(1) bars of varying length.

Figure 3. Rényi’s decomposition. Top: n=6n = 6 iid Exp(1)\mathrm{Exp}(1) draws. Middle: sorted, with spacings WkW_k between consecutive order statistics. Bottom: rescaled spacings Ek=(nk+1)WkE_k = (n - k + 1) W_k, which Theorem 3 proves are iid Exp(1)\mathrm{Exp}(1).

Theorem 3 Rényi's representation of exponential order statistics

Let Y1,,YnY_1, \dots, Y_n be iid Exp(1)\mathrm{Exp}(1). Define the normalized spacings Wk=Y(k)Y(k1)W_k = Y_{(k)} - Y_{(k - 1)} for k=1,,nk = 1, \dots, n (with Y(0)=0Y_{(0)} = 0). Then the rescaled spacings Ek:=(nk+1)WkE_k := (n - k + 1)\, W_k are iid Exp(1)\mathrm{Exp}(1), and

Y(k)  =d  i=1kEini+1,Y_{(k)} \;\stackrel{d}{=}\; \sum_{i = 1}^{k} \frac{E_i}{n - i + 1},

where E1,,EnE_1, \dots, E_n on the right-hand side are iid Exp(1)\mathrm{Exp}(1).

Proof 1 Proof of Thm 3 (Rényi's representation) [show]

The joint density of iid Exp(1)\mathrm{Exp}(1) samples on the ordered simplex {0y1yn}\{0 \le y_1 \le \cdots \le y_n\} is

fY(1),,Y(n)(y1,,yn)  =  n!exp ⁣(i=1nyi),0y1y2yn,f_{Y_{(1)}, \dots, Y_{(n)}}(y_1, \dots, y_n) \;=\; n! \, \exp\!\Big(- \sum_{i = 1}^n y_i\Big), \qquad 0 \le y_1 \le y_2 \le \cdots \le y_n,

by Thm 1 specialized to the Exp(1)\mathrm{Exp}(1) density f(y)=eyf(y) = e^{-y} on y0y \ge 0.

We change variables from (y1,,yn)(y_1, \dots, y_n) to the spacings (w1,,wn)(w_1, \dots, w_n) via wk=ykyk1w_k = y_k - y_{k - 1} (with y0=0y_0 = 0), so that yk=w1+w2++wky_k = w_1 + w_2 + \cdots + w_k. The Jacobian of this map is 11 — the transformation matrix is lower-triangular with unit diagonal. The inverse image of the ordered simplex is the positive orthant {w1,,wn0}\{w_1, \dots, w_n \ge 0\}.

The sum in the exponent rewrites as

i=1nyi  =  i=1nj=1iwj  =  j=1n(nj+1)wj,\sum_{i = 1}^n y_i \;=\; \sum_{i = 1}^n \sum_{j = 1}^{i} w_j \;=\; \sum_{j = 1}^n (n - j + 1)\, w_j,

where the swap of summation order counts how many times each wjw_j appears in the sum of partial sums — namely nj+1n - j + 1 times.

Substituting, the joint density of (W1,,Wn)(W_1, \dots, W_n) on the positive orthant is

fW1,,Wn(w1,,wn)  =  n!exp ⁣(j=1n(nj+1)wj).f_{W_1, \dots, W_n}(w_1, \dots, w_n) \;=\; n! \, \exp\!\Big(- \sum_{j = 1}^n (n - j + 1)\, w_j\Big).

This density factorizes across coordinates: W1,,WnW_1, \dots, W_n are independent with WjExp(nj+1)W_j \sim \mathrm{Exp}(n - j + 1). Note that n!=j=1n(nj+1)n! = \prod_{j = 1}^n (n - j + 1), so the joint density is exactly the product of nn exponential densities with rates n,n1,,1n, n - 1, \dots, 1, confirming the normalization.

Defining Ej:=(nj+1)WjE_j := (n - j + 1)\, W_j rescales each coordinate by its rate, converting the Exp(nj+1)\mathrm{Exp}(n - j + 1) coordinate into Exp(1)\mathrm{Exp}(1). So E1,,EnE_1, \dots, E_n are iid Exp(1)\mathrm{Exp}(1). Finally,

Y(k)  =  j=1kWj  =  j=1kEjnj+1.Y_{(k)} \;=\; \sum_{j = 1}^{k} W_j \;=\; \sum_{j = 1}^{k} \frac{E_j}{n - j + 1}.

\blacksquare — using the general order-statistic joint density (Thm 1), the Jacobian change-of-variables formula, and the factorization of the spacings density (REN1953; DAV2003 §2.5).

Example 4 Rényi gives $\mathbb{E}[Y_{(k)}]$ in closed form

From Theorem 3,

E[Y(k)]  =  i=1k1ni+1  =  HnHnk,\mathbb{E}[Y_{(k)}] \;=\; \sum_{i = 1}^{k} \frac{1}{n - i + 1} \;=\; H_n - H_{n - k},

where Hm=1+12++1mH_m = 1 + \tfrac{1}{2} + \cdots + \tfrac{1}{m} is the harmonic number. The minimum’s expectation is HnHn1=1/nH_n - H_{n - 1} = 1/n — the smallest of nn iid Exp(1) variables is itself Exp(nn), mean 1/n1/n. The maximum’s expectation is HnH0=Hnlogn+γH_n - H_0 = H_n \approx \log n + \gamma — the largest of nn iid Exp(1) grows logarithmically. Both are lessons that recur in EVT (forthcoming on formalml): the maximum of iid Exp grows slowly, which is why the Gumbel distribution is the right asymptotic limit for Gumbel-domain tails.

Remark 5 Why the rate-rescaling works

The factor nk+1n - k + 1 in Theorem 3 counts the “remaining” iid samples at step kk: among nn iid Exp(1) draws, the minimum is Exp(nn); conditional on the minimum, the second smallest is the minimum of the remaining n1n - 1 — which by memorylessness is Exp(n1n - 1); and so on. The spacing structure is memorylessness in disguise. This is the exponential-family property that makes §29.6’s Bahadur residual have a clean form — the tail of the ECDF at quantile pp decomposes into iid contributions via the probability-integral transform onto Uniform, and then onto Exp via UlogUU \mapsto -\log U.

Remark 6 Connection to the Poisson process

Rényi’s representation is the order-statistic reading of a Poisson-process fact: if we sprinkle nn iid Exp(λ)\mathrm{Exp}(\lambda)-distributed inter-arrival times on [0,)[0, \infty), we get a Poisson process of rate λ\lambda. The sorted arrival times are Y(1)<Y(2)<Y_{(1)} < Y_{(2)} < \cdots, and the conditional distribution given Y(n+1)=TY_{(n+1)} = T is uniform order statistics on [0,T][0, T]. Theorem 3 is a direct restatement of the inter-arrival independence in the forward direction. Point-process machinery (formalml point-processes, forthcoming) makes this connection structural.

Remark 7 Basu's theorem and the uniform-order-statistics ratio

Topic 16 §16.14 Ex 13 proved an unexpected consequence of Basu’s theorem: for iid Uniform(0,θ)\mathrm{Uniform}(0, \theta), the ratio X(n1)/X(n)X_{(n - 1)} / X_{(n)} is independent of the maximum X(n)X_{(n)}. The uniform-order-statistics structure Theorem 2 and Rényi’s representation exploit is the reason: once you change variables to the Uniform-order-statistics simplex, the joint density factorizes in ways that let Basu’s theorem detect independence cleanly. Basu in Topic 16 was an early hint of the uniform-order-statistics structure that §29.3 develops in full.

29.4 The sample quantile: Type 7

With order statistics in hand, we can define the sample quantile at level pp. The definition is less trivial than it looks: for p=k/np = k/n with integer kk, the “obvious” answer is X(k)X_{(k)}, but for pp in between — say p=0.5p = 0.5 with n=4n = 4 — there are many defensible choices. Hyndman and Fan (1996) catalogued nine conventions in use across statistical packages. Topic 29 adopts Type 7, the default in R, NumPy, and SciPy — matching a reader’s numpy environment is a pedagogical priority.

Definition 2 Population quantile

The population quantile at level p(0,1)p \in (0, 1) is

ξp  :=  F1(p)  =  inf{x:F(x)p}.\xi_p \;:=\; F^{-1}(p) \;=\; \inf\{x : F(x) \ge p\}.

For continuous strictly-increasing FF, ξp=F1(p)\xi_p = F^{-1}(p) in the usual inverse-function sense. The left-continuous-inverse definition handles atoms and flat regions of FF cleanly.

Definition 3 Sample quantile (Hyndman–Fan Type 7)

Let X(1)X(n)X_{(1)} \le \cdots \le X_{(n)} be the order statistics of an iid sample. The Type-7 sample quantile at level p[0,1]p \in [0, 1] is

ξ^p  :=  X(h+1)+(hh)(X(h+2)X(h+1)),h:=(n1)p.\hat\xi_p \;:=\; X_{(\lfloor h \rfloor + 1)} + (h - \lfloor h \rfloor)\, \bigl(X_{(\lfloor h \rfloor + 2)} - X_{(\lfloor h \rfloor + 1)}\bigr), \qquad h := (n - 1)\, p.

Equivalently: ξ^p\hat\xi_p linearly interpolates between the two order statistics bracketing the position (n1)p+1(n - 1)\, p + 1 in the sorted sample. For p=0p = 0, ξ^p=X(1)\hat\xi_p = X_{(1)}; for p=1p = 1, ξ^p=X(n)\hat\xi_p = X_{(n)}; for p=0.5p = 0.5 with odd nn, ξ^p=X((n+1)/2)\hat\xi_p = X_{((n + 1)/2)}, the sample median.

(HYN1996, default convention in R quantile(), NumPy quantile, SciPy scipy.stats.mstats.mquantiles.)

Example 5 Sample quartiles of a small sample

For the sample {1,2,3,4,5}\{1, 2, 3, 4, 5\} with n=5n = 5: at p=0.5p = 0.5, h=(51)(0.5)=2h = (5 - 1)(0.5) = 2, so ξ^0.5=X(3)=3\hat\xi_{0.5} = X_{(3)} = 3 exactly. At p=0.25p = 0.25, h=1h = 1, so ξ^0.25=X(2)=2\hat\xi_{0.25} = X_{(2)} = 2. At p=0.9p = 0.9, h=3.6h = 3.6, so ξ^0.9=X(4)+0.6(X(5)X(4))=4+0.6=4.6\hat\xi_{0.9} = X_{(4)} + 0.6 (X_{(5)} - X_{(4)}) = 4 + 0.6 = 4.6. These match the numpy.quantile([1,2,3,4,5], q) defaults exactly — a sanity check that the Type-7 convention is what the reader’s tooling uses.

Remark 8 Why Type 7 and not something else

The nine Hyndman–Fan conventions differ in how they interpolate when pp is not at an integer order-statistic position. Type 1 (inverse-CDF) picks X(np)X_{(\lceil n p \rceil)} with no interpolation — it’s discontinuous in pp. Type 5 and Type 6 are older conventions still used in some engineering packages. Type 7 is the unique member of the family that (i) is a continuous piecewise-linear function of pp and (ii) hits the boundary exactly (ξ^0=X(1)\hat\xi_0 = X_{(1)} and ξ^1=X(n)\hat\xi_1 = X_{(n)}). Because R, NumPy, and SciPy all adopt Type 7 by default, a reader writing Python code to reproduce §29’s figures gets Type-7 numbers without setting a flag. Any other convention would introduce a silent discrepancy.

Remark 9 Streaming-quantile approximations live outside this topic

Exact sample quantiles require sorting the sample — O(nlogn)O(n \log n) and O(n)O(n) memory. In streaming or high-cardinality settings, approximate quantile sketches (Jain–Chlamtac 1985’s P² algorithm; Greenwald–Khanna 2001; Dunning’s t-digest 2021) trade small guaranteed error for bounded memory. These are widely used in ML production systems for real-time latency monitoring. Topic 29 does not develop them — streaming quantiles are an engineering sub-topic of its own. The relevant entry point is Dunning 2021 (t-digest) for practical use.

29.5 ECDF, Glivenko–Cantelli, DKW

Topic 10 §10.7 introduced the empirical CDF FnF_n and proved it converges uniformly to FF almost surely — the Glivenko–Cantelli theorem. §29.5 sharpens that a.s. statement into a non-asymptotic bound via the Dvoretzky–Kiefer–Wolfowitz inequality, then converts DKW into a finite-sample confidence envelope that can be read off any realization of FnF_n without knowing FF.

Theorem 4 Glivenko–Cantelli theorem (restated from §10.7)

For iid X1,,XnFX_1, \dots, X_n \sim F with FF any CDF,

supxFn(x)F(x)  a.s.  0as n.\sup_x \bigl|F_n(x) - F(x)\bigr| \;\xrightarrow{\text{a.s.}}\; 0 \qquad \text{as } n \to \infty.

The supremum distance Dn:=supxFn(x)F(x)D_n := \sup_x |F_n(x) - F(x)| is the one-sample Kolmogorov–Smirnov statistic. Glivenko–Cantelli says Dn0D_n \to 0 almost surely. (Topic 10 §10.7 Thm 5; see there for the ε-net argument.)

Glivenko–Cantelli is qualitative — it says “eventually” DnD_n is small — but does not tell us how small for any given nn. The Dvoretzky–Kiefer–Wolfowitz inequality provides a non-asymptotic exponential bound.

Theorem 5 Dvoretzky–Kiefer–Wolfowitz inequality (DKW)

For iid X1,,XnFX_1, \dots, X_n \sim F and every ε>0\varepsilon > 0,

Pr ⁣(supxFn(x)F(x)>ε)    2e2nε2.\Pr\!\Bigl(\sup_x |F_n(x) - F(x)| > \varepsilon\Bigr) \;\le\; 2\, e^{-2 n \varepsilon^2}.

The constant 22 is optimal (Massart 1990) when εln2/(2n)\varepsilon \ge \sqrt{\ln 2 / (2n)}; DVO1956 originally proved the bound with a larger, unspecified constant. Inverting the bound at confidence level 1α1 - \alpha gives a distribution-free finite-sample confidence envelope: the band [Fn(x)εn,Fn(x)+εn][F_n(x) - \varepsilon_n, F_n(x) + \varepsilon_n] with

εn  =  log(2/α)2n\varepsilon_n \;=\; \sqrt{\frac{\log(2 / \alpha)}{2 n}}

contains FF uniformly in xx with probability 1α\ge 1 - \alpha.

Corollary 2 (the DKW envelope). With εn\varepsilon_n as above, Pr(Fn(x)εnF(x)Fn(x)+εn for all x)1α\Pr\bigl(F_n(x) - \varepsilon_n \le F(x) \le F_n(x) + \varepsilon_n \text{ for all } x\bigr) \ge 1 - \alpha regardless of FF. (DVO1956; MAS1990.)

DKW is a finite-sample strengthening of Glivenko–Cantelli that (i) holds for every n1n \ge 1, (ii) is distribution-free — FF does not appear in the bound — and (iii) has the optimal constant C=2C = 2. It is the workhorse inequality behind every distribution-free uniform-deviation statement the rest of Track 8 uses. The envelope [Fn±εn][F_n \pm \varepsilon_n] is a simultaneous confidence band for FF at every xx, not just a pointwise CI at a single xx.

Three-panel figure showing the DKW envelope against the empirical CDF of Normal(0,1) samples at n=20, 100, 1000. The true CDF Phi is overlaid in black. At alpha=0.05, the envelope width is epsilon_n ≈ 0.301 at n=20, 0.135 at n=100, and 0.043 at n=1000 — roughly the 1/sqrt(n) rate DKW predicts. The envelope visibly contains Phi in each panel.

Figure 4. The DKW confidence envelope at α=0.05\alpha = 0.05. At n=20n = 20 the envelope is wide (±0.301) but already informative; at n=1000n = 1000 it narrows to ±0.043. The true CDF Φ\Phi is contained by the envelope at every xx in each realization — this is guaranteed with probability 0.95\ge 0.95 by DKW, and the ECDFDKWBandExplorer below lets the reader empirically verify the coverage rate.

ECDF and the DKW envelope§29.5 · Dvoretzky–Kiefer–Wolfowitz
Seed = 42
-4.00.04.000.250.50.751xFn(x), F(x)
Dn = 0.0709
εn = 0.1358
Envelope contains F: yes
DKW margin = Dn / εn = 0.52
DKW envelope ECDF Fn true CDF
Example 6 DKW as a two-sample diagnostic

A common ML workflow: you suspect the distribution of a production feature has shifted since training. Collect n1n_1 training samples and n2n_2 production samples. Compute the ECDFs Fn1trainF_{n_1}^{\text{train}} and Fn2prodF_{n_2}^{\text{prod}} and their sup-norm distance D=supxFn1train(x)Fn2prod(x)D = \sup_x |F_{n_1}^{\text{train}}(x) - F_{n_2}^{\text{prod}}(x)|. The one-sample DKW envelopes give simultaneous confidence bands for the true train and production CDFs; if those bands overlap at every xx, no detectable shift. For the formal two-sample test — including the asymptotic null distribution — see §29.8 (with the two-sample variant a one-line extension to ksTwoSample in nonparametric.ts). Production systems run this test nightly to flag shift on each logged feature.

Remark 10 DKW vs CLT-for-one-point

CLT says n(Fn(x)F(x))N(0,F(x)(1F(x)))\sqrt n\, (F_n(x) - F(x)) \Rightarrow \mathcal{N}(0, F(x)(1 - F(x))) for each fixed xx — a pointwise asymptotic statement. DKW says supxFn(x)F(x)\sup_x |F_n(x) - F(x)| is bounded non-asymptotically with probability 1α\ge 1 - \alpha — a uniform finite-sample statement. Both are useful: CLT gives the pointwise limiting distribution that §29.8 Thm 7 upgrades to uniform (via Donsker); DKW gives the uniform bound that does not require any asymptotic machinery. The Bahadur proof (§29.6) exploits both: CLT at the point ξp\xi_p plus a uniform-over-neighborhood fluctuation bound that DKW is the finite-sample analog of.

Remark 11 DKW is sharp; CLT is not (at $x = F^{-1}(1/2)$)

For xx at the median of FF, CLT gives asymptotic variance F(x)(1F(x))=1/4F(x)(1 - F(x)) = 1/4. DKW’s exponential rate 2e2nε22 e^{-2n\varepsilon^2} at ε=1/2n\varepsilon = 1/\sqrt{2n} gives probability 2/e0.74\ge 2/e \approx 0.74 that the sup-norm gap is below 1/2n1/\sqrt{2n}, which matches the CLT tail probability to within constants. For xx deep in the tails where F(x)0F(x) \approx 0 or 11, CLT gives much smaller pointwise variance than 1/41/4, so the DKW uniform bound is loose at tails — a property exploited by the tail-aware inequalities of Talagrand (Talagrand 1994) developed at formalml concentration-inequalities (forthcoming).

29.6 Bahadur representation and asymptotic normality

§29.6 is the centerpiece of Topic 29. Bahadur’s 1966 representation linearizes the sample quantile ξ^p\hat\xi_p into an empirical-CDF term plus a small remainder — the same “score-function-like” decomposition that gives every classical estimator its asymptotic-normality corollary. The featured theorem + its corollary resolve the question “what is the limiting distribution of the sample median?” for essentially every distribution with positive density at its median, including heavy-tailed ones like the Cauchy where the usual variance-based CLT does not apply.

Theorem 6 Bahadur representation of the sample quantile (featured)

Let X1,X2,,XnX_1, X_2, \dots, X_n be iid with CDF FF. Fix p(0,1)p \in (0, 1) and assume FF is differentiable at ξp=F1(p)\xi_p = F^{-1}(p) with F(ξp)=f(ξp)>0F'(\xi_p) = f(\xi_p) > 0 and that FF is twice-differentiable in a neighborhood of ξp\xi_p with bounded second derivative. Let ξ^p\hat\xi_p be the Type-7 sample quantile (Def 3). Then

ξ^pξp  =  Fn(ξp)pf(ξp)  +  Rn,\hat\xi_p - \xi_p \;=\; -\frac{F_n(\xi_p) - p}{f(\xi_p)} \;+\; R_n,

where the remainder satisfies Rn=Op ⁣(n3/4logn)R_n = O_p\!\bigl(n^{-3/4}\,\sqrt{\log n}\bigr) (Kiefer 1967 rate).

Corollary 1 (asymptotic normality). Under the same hypotheses,

n(ξ^pξp)    N ⁣(0,  p(1p)f(ξp)2).\sqrt{n}\,\bigl(\hat\xi_p - \xi_p\bigr) \;\Rightarrow\; \mathcal{N}\!\left(0,\; \frac{p\,(1 - p)}{f(\xi_p)^2}\right).

(BAH1966; KIE1967; SER1980 §2.5; vdV2000 §21.)

The representation’s structure is the important part. The leading term (Fn(ξp)p)/f(ξp)-(F_n(\xi_p) - p)/f(\xi_p) is a linear functional of the ECDF evaluated at the true quantile — it is n1-\sqrt n^{-1} times a centered Bernoulli average with success probability pp and variance p(1p)p(1 - p), which converges to a Normal by the CLT-for-indicators of §11 Thm 2. The remainder RnR_n is smaller than n1/2n^{-1/2} — at the Kiefer rate n3/4lognn^{-3/4}\sqrt{\log n} — and is asymptotically negligible.

Corollary 1 falls out immediately: the leading term is Normal by CLT, the remainder is op(n1/2)o_p(n^{-1/2}), and Slutsky’s lemma assembles them.

Proof 2 Proof of Thm 6 (Bahadur representation — featured) [show]

Throughout, write F~n(t):=Fn(ξp+t/n)\tilde F_n(t) := F_n(\xi_p + t/\sqrt n) for the empirical CDF at a 1/n1/\sqrt n-scale neighborhood of ξp\xi_p. Define the centered empirical process Un(t):=n(F~n(t)F(ξp+t/n))U_n(t) := \sqrt{n}\bigl(\tilde F_n(t) - F(\xi_p + t/\sqrt n)\bigr). By the CLT applied to indicator variables (Topic 11 Thm 2), for each fixed tt,

Un(t)    N(0,  F(ξp)(1F(ξp)))  =  N(0,  p(1p)).U_n(t) \;\Rightarrow\; \mathcal{N}\bigl(0, \; F(\xi_p)(1 - F(\xi_p))\bigr) \;=\; \mathcal{N}\bigl(0, \; p(1 - p)\bigr).

Step 1: Smoothness of FF near ξp\xi_p. By the second-derivative hypothesis and Taylor’s theorem,

F(ξp+t/n)  =  p+tf(ξp)n+t2F(ξ~)2nF(\xi_p + t/\sqrt n) \;=\; p + \frac{t\, f(\xi_p)}{\sqrt n} + \frac{t^2\, F''(\tilde\xi)}{2 n}

for some ξ~\tilde\xi between ξp\xi_p and ξp+t/n\xi_p + t/\sqrt n. Since FF'' is bounded on a neighborhood of ξp\xi_p, the remainder is O(t2/n)O(t^2/n) uniformly for tK|t| \le K with any fixed KK.

Step 2: The sample quantile equation. By the Type-7 definition, ξ^p\hat\xi_p satisfies Fn(ξ^p)pFn(ξ^p)F_n(\hat\xi_p^{-}) \le p \le F_n(\hat\xi_p), which gives Fn(ξ^p)=p+O(1/n)F_n(\hat\xi_p) = p + O(1/n). Let Tn:=n(ξ^pξp)T_n := \sqrt n\, (\hat\xi_p - \xi_p) denote the scaled quantile deviation. Substituting t=Tnt = T_n into F~n(t)=F(ξp+t/n)+Un(t)/n\tilde F_n(t) = F(\xi_p + t/\sqrt n) + U_n(t)/\sqrt n,

F(ξp+Tn/n)+Un(Tn)n  =  p+O(1/n).F(\xi_p + T_n/\sqrt n) + \frac{U_n(T_n)}{\sqrt n} \;=\; p + O(1/n).

Using the Step-1 Taylor expansion,

p+Tnf(ξp)n+Tn2F(ξ~)2n+Un(Tn)n  =  p+O(1/n).p + \frac{T_n\, f(\xi_p)}{\sqrt n} + \frac{T_n^2\, F''(\tilde\xi)}{2 n} + \frac{U_n(T_n)}{\sqrt n} \;=\; p + O(1/n).

Rearranging and multiplying by n\sqrt n,

Tnf(ξp)  =  Un(Tn)Tn2F(ξ~)2n+O(1/n).T_n\, f(\xi_p) \;=\; -U_n(T_n) - \frac{T_n^2\, F''(\tilde\xi)}{2 \sqrt n} + O(1/\sqrt n).

Step 3: Approximate Un(Tn)U_n(T_n) by Un(0)U_n(0). The empirical-process oscillation bound (Kiefer 1967 Thm 1; SER1980 §2.5 Thm A) states that for any fixed K>0K > 0,

suptKUn(t)Un(0)  =  Op ⁣(n1/4logn).\sup_{|t| \le K} \bigl|\, U_n(t) - U_n(0) \,\bigr| \;=\; O_p\!\bigl(n^{-1/4}\,\sqrt{\log n}\bigr).

Since Tn=Op(1)T_n = O_p(1) (the sample quantile is root-nn consistent — which follows from the Step-2 equation plus the fact that Un(Tn)U_n(T_n) is bounded in probability), we may restrict to TnK|T_n| \le K on an event of probability tending to one, yielding

Un(Tn)  =  Un(0)+Op ⁣(n1/4logn).U_n(T_n) \;=\; U_n(0) + O_p\!\bigl(n^{-1/4}\,\sqrt{\log n}\bigr).

Step 4: Assemble the representation. Substituting Un(Tn)=Un(0)+Op(n1/4logn)U_n(T_n) = U_n(0) + O_p(n^{-1/4}\sqrt{\log n}) into the Step-2 equation and recalling Un(0)=n(Fn(ξp)p)U_n(0) = \sqrt n\,(F_n(\xi_p) - p),

Tnf(ξp)  =  n(Fn(ξp)p)+Op ⁣(n1/4logn)+Op ⁣(1/n).T_n\, f(\xi_p) \;=\; -\sqrt n\, \bigl(F_n(\xi_p) - p\bigr) + O_p\!\bigl(n^{-1/4}\,\sqrt{\log n}\bigr) + O_p\!\bigl(1/\sqrt n\bigr).

Figure 5 renders here — the Monte Carlo decay of the median-absolute residual Rn|R_n| against the Kiefer envelope cn3/4lognc \cdot n^{-3/4} \sqrt{\log n}, verifying empirically what Step 4 proves analytically.

Log-log plot of the Bahadur residual decay for F = Normal(0,1). Two panels: left, p=0.5; right, p=0.9. Each panel shows Monte-Carlo estimates of median |R_n| at n ∈ 50, 100, 200, 500, 1000, 2000, 5000 (filled circles) plotted against the Kiefer envelope c · n^-3/4 sqrt(log n) (dashed line, constant c fitted to pass through n=500). The log-log slope approaches -0.75 as n grows.

Figure 5. Empirical verification of the Kiefer rate. Left: p=0.5p = 0.5; right: p=0.9p = 0.9. The slow-decaying logn\sqrt{\log n} factor means the empirical log-log slope is less negative than 0.75-0.75 at moderate nn — the curves approach but don’t match the Kiefer envelope until n5000n \gtrsim 5000. The QuantileAsymptoticsExplorer lets the reader verify this across distributions.

Step 5: Finalize. Dividing by nf(ξp)\sqrt n\, f(\xi_p) and reading off ξ^pξp=Tn/n\hat\xi_p - \xi_p = T_n/\sqrt n,

ξ^pξp  =  Fn(ξp)pf(ξp)+Op ⁣(n3/4logn).\hat\xi_p - \xi_p \;=\; -\frac{F_n(\xi_p) - p}{f(\xi_p)} + O_p\!\bigl(n^{-3/4}\,\sqrt{\log n}\bigr).

The remainder rate is sharp: Kiefer (1967) proved both the upper bound above and the matching lower bound Rnop(n3/4logn)R_n \ne o_p(n^{-3/4}\sqrt{\log n}).

\blacksquare — using Taylor’s theorem, the empirical-process oscillation bound (KIE1967 Thm 1), and CLT-for-indicators (Topic 11 Thm 2). Full details: BAH1966 for the original op(n1/2)o_p(n^{-1/2}) statement; KIE1967 for the sharp rate; SER1980 §2.5 and vdV2000 §21.2 for modern treatments.

The corollary’s asymptotic variance p(1p)/f(ξp)2p(1 - p)/f(\xi_p)^2 has a striking feature: it does not involve any moments of FF — only the value of the density at the population quantile. This is what makes the Bahadur representation the right tool for heavy-tailed data where moments may not exist. The QuantileAsymptoticsExplorer below makes this vivid via the Cauchy preset: Cauchy has no mean or variance, but its density at the median is 1/π1/\pi, so the sample median’s limiting variance is 0.25/(1/π)2=π2/42.470.25 / (1/\pi)^2 = \pi^2/4 \approx 2.47 — finite and small.

Quantile asymptotics explorer§29.6 · Bahadur representation
Histogram of √n (ξ̂p − ξp) with Bahadur-limit overlay
-3.870.003.29√n (ξ̂_p − ξ_p)
ξp (truth) = 0.000
f(ξp) = 0.399
empirical SD = 1.232
theory SD = 1.253
ratio (empirical / theory) = 0.983 · mean ≈ -0.013 (expect 0 at the limit)
Bahadur residual decay · median |Rn| vs Kiefer envelope
50100500100050007.2e-43.5e-31.7e-28.4e-2n (log scale)
fitted slope = -0.717
Kiefer target ≈ −0.75
median |R_n|
Finite-n slope is greater than −0.75 because the √(log n) factor adds a slowly decaying correction — the curves meet the Kiefer envelope only asymptotically.
500 MC replicates per histogram (adaptive in n) · 250 per residual point over 7 values of n.

Three-panel histogram showing asymptotic-normality of sqrt(n)(sample-median - population-median) at n=100 over 5000 MC replicates, for F in Normal(0,1), Exp(1), Cauchy(0,1). Each panel overlays the Bahadur-limit Normal density N(0, p(1-p)/f(xi_p)^2). Normal limit variance is 1.571; Exp limit variance is 0.25; Cauchy limit variance is 2.467. All three histograms visibly match the overlaid limits.

Figure 6. The Bahadur limit overlay for three distributions with very different moment behavior. Cauchy has no mean, no variance — but the sample median is still root-nn Normal, with limit variance finite because f(ξ0.5)=1/π>0f(\xi_{0.5}) = 1/\pi > 0. The Bahadur representation is distribution-free in this sense: regularity lives at ξp\xi_p, not globally.

Example 7 The sample median for Normal data

For XiN(0,1)X_i \sim \mathcal{N}(0, 1) at p=0.5p = 0.5: ξ0.5=0\xi_{0.5} = 0, f(0)=1/2πf(0) = 1/\sqrt{2\pi}. Bahadur Cor 1 gives

n(ξ^0.50)    N ⁣(0,0.251/(2π))  =  N ⁣(0,π2).\sqrt n\,(\hat\xi_{0.5} - 0) \;\Rightarrow\; \mathcal{N}\!\left(0, \frac{0.25}{1/(2\pi)}\right) \;=\; \mathcal{N}\!\left(0, \frac{\pi}{2}\right).

The limiting standard error of the sample median is π/21.253\sqrt{\pi/2} \approx 1.253, compared to 1.01.0 for the sample mean. The median is less efficient than the mean under exact Normal data by the factor 2/π0.6372/\pi \approx 0.637 — the asymptotic relative efficiency ARE(median, mean) = 2/π2/\pi. Under heavier tails this ratio flips; see Ex 8.

Example 8 The sample median for Cauchy data

For XiCauchy(0,1)X_i \sim \mathrm{Cauchy}(0, 1) at p=0.5p = 0.5: ξ0.5=0\xi_{0.5} = 0, f(0)=1/πf(0) = 1/\pi. Bahadur Cor 1 gives

n(ξ^0.50)    N ⁣(0,0.251/π2)  =  N ⁣(0,π24).\sqrt n\,(\hat\xi_{0.5} - 0) \;\Rightarrow\; \mathcal{N}\!\left(0, \frac{0.25}{1/\pi^2}\right) \;=\; \mathcal{N}\!\left(0, \frac{\pi^2}{4}\right).

The limiting standard error is π/21.571\pi/2 \approx 1.571 — slightly larger than the Normal case, but finite. The sample mean of Cauchy data is undefined asymptotically (the variance of Xˉ\bar X is infinite; CLT does not apply), so the sample median is infinitely more efficient than the sample mean for Cauchy — ARE is ++\infty. This is the money-shot example that motivates Bahadur: regularity is about the density at ξp\xi_p, not about moments.

Remark 12 The density-at-quantile problem

The limit variance p(1p)/f(ξp)2p(1-p)/f(\xi_p)^2 depends on f(ξp)f(\xi_p) — a nuisance parameter the reader typically does not know. Plug-in estimation requires estimating ff at ξp\xi_p. Topic 30 (Kernel Density Estimation) develops the general KDE machinery that provides consistent estimators f^n(ξp)\hat f_n(\xi_p) with controlled bandwidth. The density-at-quantile problem is discharged at Topic 30 §30.7 Ex 9, which computes a Silverman-bandwidth KDE at the empirical quantile and plugs the result into Bahadur’s variance formula. Alternatively, the bootstrap (Topic 31: The Bootstrap) sidesteps this entirely by resampling the full empirical process — the plug-in variance is never computed directly.

Remark 13 Bahadur is the 'score function' of nonparametric inference

The Bahadur representation is structurally identical to the Taylor-expansion-of-the-score-function argument that underpins MLE asymptotic normality (Topic 14 §14.6):

θ^MLEθ  =  ilogf(Xi;θ)/nI(θ)+op(n1/2),\hat\theta_{\text{MLE}} - \theta \;=\; -\frac{\sum_i \nabla \log f(X_i; \theta) / n}{I(\theta)} + o_p(n^{-1/2}),

where the leading term is a mean-zero empirical average divided by a fixed constant, and the remainder is asymptotically negligible. Bahadur’s remainder rate n3/4lognn^{-3/4}\sqrt{\log n} is worse than MLE’s op(n1/2)o_p(n^{-1/2}), but the functional form — “linearize into a known empirical average, bound the remainder, apply CLT” — is identical. Every Track 8 result reduces to a Bahadur-style linearization: KDE bandwidth asymptotics (Topic 30), bootstrap-CI validity (Topic 31: The Bootstrap), empirical-process functionals (Topic 32), and quantile-regression (forthcoming on formalml) all use this template.

Remark 14 What $p = 0$ and $p = 1$ break

Bahadur requires f(ξp)>0f(\xi_p) > 0. At extreme pp — near 0 or 1 — the density may vanish (f(ξ0)f(\xi_0) or f(ξ1)f(\xi_1) may be zero or undefined if the support is unbounded). In those regimes the sample quantile’s asymptotic distribution is not Normal at rate n\sqrt n; it is governed by the Fisher–Tippett–Gnedenko trichotomy (Gumbel, Fréchet, Weibull) at rate determined by the tail behavior of FF. Extreme-value theory is the subject of formalml extreme-value-theory (forthcoming); Topic 29 restricts to interior pp where Bahadur applies.

29.7 Distribution-free quantile confidence intervals

Bahadur Cor 1 gives an asymptotic CI for ξp\xi_p — but implementing it requires estimating f(ξp)f(\xi_p). In some applications you want a CI whose validity does not require estimating any density. Order-statistic pairs provide exactly that: an exact (not asymptotic) finite-sample CI for ξp\xi_p whose coverage is a closed-form binomial probability, distribution-free.

Theorem 7 Distribution-free quantile CI (stated)

For iid X1,,XnX_1, \dots, X_n with continuous CDF FF, and p(0,1)p \in (0, 1), let r<sr < s be integers with 1rsn1 \le r \le s \le n. Then

Pr ⁣(X(r)ξpX(s))  =  Pr(rBs1),BBin(n,p),\Pr\!\bigl(X_{(r)} \le \xi_p \le X_{(s)}\bigr) \;=\; \Pr\bigl(r \le B \le s - 1\bigr), \qquad B \sim \mathrm{Bin}(n, p),

regardless of FF. Choosing r,sr, s so that this binomial probability is 1α\ge 1 - \alpha yields a distribution-free exact (1α)(1 - \alpha)-CI for ξp\xi_p. The canonical choice picks the largest rr with Pr(B<r)α/2\Pr(B < r) \le \alpha / 2 and the smallest ss with Pr(Bs)α/2\Pr(B \ge s) \le \alpha / 2; this gives a two-sided CI with exact coverage 1Pr(B<r)Pr(Bs)1α1 - \Pr(B < r) - \Pr(B \ge s) \ge 1 - \alpha. (SER1980 §2.6.1.)

The key fact behind Theorem 7: the number BB of sample points XiξpX_i \le \xi_p is Bin(n,p)\mathrm{Bin}(n, p), since Pr(Xiξp)=F(ξp)=p\Pr(X_i \le \xi_p) = F(\xi_p) = p. So X(r)ξpX_{(r)} \le \xi_p iff at least rr points fall below ξp\xi_p iff BrB \ge r; and X(s)ξpX_{(s)} \ge \xi_p iff at most s1s - 1 points fall strictly below iff Bs1B \le s - 1 (using continuity to ignore ties). Combining gives the claim. Coverage is exactly the binomial tail probability — no asymptotics, no plug-in.

Because rr and ss are integers, coverage is a step function of nn: for the 95th-percentile CI at p=0.95p = 0.95, the bounds shift discretely as nn crosses thresholds where the binomial tail probabilities change which integers are selected. The consequence: distribution-free quantile CIs over-cover at intermediate nn — the actual coverage is typically strictly greater than 1α1 - \alpha. Figure 7 makes this visible.

Coverage plot of the 95 percent distribution-free CI for the 75th percentile of Exp(1) across n in 20, 30, 50, 100, 200, 500. Empirical coverage (filled circles) lies slightly above the 0.95 target line, drifting toward 0.95 as n grows. Step structure is visible: coverage jumps discretely at n-thresholds where a new (r,s) pair becomes the chosen bound.

Figure 7. Empirical coverage of the 95% distribution-free CI for the 75th percentile of Exp(1)\mathrm{Exp}(1) over 2,000 MC replicates. Coverage is a step function of nn: exact CIs over-cover at intermediate nn, tightening only when the binomial tail probabilities support a larger or smaller (r,s)(r, s) window. The 0.75 quantile-level was chosen over 0.95 because at p=0.95p = 0.95 with n=20n = 20 the binomial tail already exceeds α/2\alpha/2 from only the extreme (n1,n)(n-1, n) pair, producing degenerate one-sided CIs that don’t illustrate the step structure pedagogically.

Example 9 CI for the median at $n = 20$

For n=20n = 20, p=0.5p = 0.5, α=0.05\alpha = 0.05: by symmetry of Bin(20,0.5)\mathrm{Bin}(20, 0.5), the largest rr with Pr(Bin(20,0.5)<r)0.025\Pr(\mathrm{Bin}(20, 0.5) < r) \le 0.025 is r=6r = 6 (since Pr(Bin5)0.021\Pr(\mathrm{Bin} \le 5) \approx 0.021), and symmetrically s=15s = 15. The CI is [X(6),X(15)][X_{(6)}, X_{(15)}], with exact coverage 120.0210.9591 - 2 \cdot 0.021 \approx 0.959 — slightly over-covering the nominal 0.95. This is the quantileCIOrderStatisticBounds(20, 0.5, 0.05) result pinned in the T29.15 test — r=6r = 6, s=15s = 15, actual level 0.9586\approx 0.9586.

Example 10 CI for a latency percentile

Production ML systems often set SLOs on latency percentiles — e.g., the 95th-percentile response time should be below 200ms. The distribution-free CI is a natural monitoring tool: collect n=500n = 500 request latencies in the past hour, compute [X(r),X(s)][X_{(r)}, X_{(s)}] at p=0.95p = 0.95 and α=0.05\alpha = 0.05. The resulting CI has guaranteed coverage 0.95\ge 0.95 without any assumption about the latency distribution — which matters because latency distributions are usually heavy-tailed (lognormal or Pareto-like) and asymptotic-Normal CIs based on Xˉ\bar{X} are misleading in the tails. Conformal-prediction wrappers (forthcoming on formalml) generalize this construction to arbitrary prediction sets.

Remark 15 Bootstrap as the general distribution-free CI machinery

§29.7’s exact CI works specifically for quantiles — one of the few statistics where the coverage probability is binomial and closed-form. For a general statistic T(X)T(X) (mean, variance, regression coefficient), the same distribution-free spirit is captured by the bootstrap: resample from FnF_n and use the empirical distribution of T(X)T(X^*) as the reference. Topic 31: The Bootstrap develops bootstrap-CIs in full (percentile, BCa, studentized), with Hall’s second-order accuracy result as the anchor. §19.10 Rem 20 punted bootstrap to Track 8; Topic 31 is where that promise lands. The order-statistic CI of §29.7 is the distribution-free CI “done in closed form” for quantiles specifically; the bootstrap is the general nonparametric CI machinery for every other statistic.

29.8 The Kolmogorov–Smirnov test

§29.5 Thm 4 restated Glivenko–Cantelli’s Dn0D_n \to 0 almost surely, and Thm 5 gave the non-asymptotic DKW bound. §29.8 completes the picture: the asymptotic distribution of nDn\sqrt n\, D_n is the Kolmogorov distribution, and this yields the one-sample Kolmogorov–Smirnov goodness-of-fit test — testing whether an iid sample was drawn from a specified continuous CDF F0F_0.

Theorem 8 Asymptotic Kolmogorov distribution of $\sqrt n\, D_n$ (partial proof; Donsker stated)

Let X1,,XnX_1, \dots, X_n be iid from a continuous CDF FF and let Dn=supxFn(x)F(x)D_n = \sup_x |F_n(x) - F(x)|. Then

nDn    Kas n,\sqrt n\, D_n \;\Rightarrow\; K \qquad \text{as } n \to \infty,

where KK is the Kolmogorov distribution with CDF

Pr(Kx)  =  12k=1(1)k1e2k2x2,x>0.\Pr(K \le x) \;=\; 1 - 2 \sum_{k = 1}^{\infty} (-1)^{k - 1}\, e^{-2 k^2 x^2}, \qquad x > 0.

The asymptotic null distribution KK is distribution-free: it does not depend on FF. The partial proof below assumes Donsker’s theorem (proved at Topic 32 §32.5) and derives the Kolmogorov limit as a consequence. (KOL1933; SMI1948; BIL1999 §14.)

Proof 3 Proof of Thm 8 (Kolmogorov-distribution limit, partial) [show]

The proof has three ingredients: (i) the probability-integral transform reduces the general-FF case to F=Uniform(0,1)F = \mathrm{Uniform}(0, 1); (ii) Donsker’s theorem — taken as given here, proved at Topic 32 — gives a distributional limit for the normalized empirical process; (iii) the continuous mapping theorem extracts the sup-norm distribution from that limit.

Ingredient (i): reduce to uniform. Define Ui:=F(Xi)U_i := F(X_i). Since FF is continuous, UiUniform(0,1)U_i \sim \mathrm{Uniform}(0, 1). The empirical CDF of {Ui}\{U_i\} at level tt equals the empirical CDF of {Xi}\{X_i\} at F1(t)F^{-1}(t):

Gn(t):=1ni=1n1{Uit}=Fn(F1(t)).G_n(t) := \frac{1}{n} \sum_{i = 1}^n \mathbb{1}\{U_i \le t\} = F_n(F^{-1}(t)).

Substituting t=F(x)t = F(x) gives Gn(F(x))=Fn(x)G_n(F(x)) = F_n(x), so

Dn=supxFn(x)F(x)=suptGn(t)t,D_n = \sup_x |F_n(x) - F(x)| = \sup_t |G_n(t) - t|,

which is the KS statistic for the uniform case. Henceforth assume F=Uniform(0,1)F = \mathrm{Uniform}(0, 1).

Ingredient (ii): Donsker’s theorem (proved at Topic 32 §32.5). Define the uniform empirical process Gn(t):=n(Gn(t)t)\mathbb{G}_n(t) := \sqrt n\, (G_n(t) - t) for t[0,1]t \in [0, 1]. Donsker’s theorem (BIL1999 §16) states that Gn\mathbb{G}_n converges in distribution in the space D[0,1]D[0, 1] (with the Skorokhod topology) to a standard Brownian bridge B\mathbb{B}^\circ — a centered Gaussian process on [0,1][0, 1] with covariance Cov(B(s),B(t))=stst\mathrm{Cov}(\mathbb{B}^\circ(s), \mathbb{B}^\circ(t)) = s \wedge t - s t and boundary conditions B(0)=B(1)=0\mathbb{B}^\circ(0) = \mathbb{B}^\circ(1) = 0. Topic 32 develops the full empirical-process machinery — including the Skorokhod topology, tightness arguments, and finite-dimensional convergence — and proves Donsker as its centerpiece. Here we take it as given.

Ingredient (iii): continuous mapping. The sup-norm functional ϕ:D[0,1]R\phi: D[0, 1] \to \mathbb{R} defined by ϕ(f)=supt[0,1]f(t)\phi(f) = \sup_{t \in [0, 1]} |f(t)| is continuous — indeed, 1-Lipschitz with respect to the uniform metric. By the continuous mapping theorem (Topic 9 §9.5),

nDn=ϕ(Gn)    ϕ(B)=supt[0,1]B(t).\sqrt n\, D_n = \phi(\mathbb{G}_n) \;\Rightarrow\; \phi(\mathbb{B}^\circ) = \sup_{t \in [0, 1]} |\mathbb{B}^\circ(t)|.

The Kolmogorov formula. The distribution of supt[0,1]B(t)\sup_{t \in [0, 1]} |\mathbb{B}^\circ(t)| is classical (BIL1999 §14 Thm 14.4):

Pr ⁣(supt[0,1]B(t)x)  =  12k=1(1)k1e2k2x2,x>0.\Pr\!\left(\sup_{t \in [0, 1]} |\mathbb{B}^\circ(t)| \le x\right) \;=\; 1 - 2 \sum_{k = 1}^{\infty} (-1)^{k - 1}\, e^{-2 k^2 x^2}, \qquad x > 0.

This is derived from the reflection principle applied to Brownian motion and conditioning on B(1)=0B(1) = 0; we quote the formula rather than re-deriving it, since the full derivation uses the Brownian-motion technology developed at formalml stochastic-processes. Combining with Ingredient (iii) gives the claimed limit.

\blacksquare (assuming Donsker — proved at Topic 32 — and the sup-norm-of-Brownian-bridge distribution — quoted from BIL1999 §14.4.)

The KS statistic with the asymptotic Kolmogorov null distribution gives the one-sample KS test. Critical values are standard: d0.101.224d_{0.10} \approx 1.224, d0.051.358d_{0.05} \approx 1.358, d0.011.628d_{0.01} \approx 1.628, all divided by n\sqrt n for the un-scaled statistic.

Two-panel figure for the KS null distribution. Left panel: Monte-Carlo histogram of sqrt(n) D_n at n=100 under H0 (F = Normal(0,1)) over 10000 replicates, with the asymptotic Kolmogorov density overlaid in red. Right panel: Q-Q plot of the MC sample against the asymptotic Kolmogorov quantile function, with the diagonal reference line — the points lie visibly on the diagonal, confirming the asymptotic approximation is excellent at n=100.

Figure 8. Empirical validation of the Kolmogorov asymptotic at n=100n = 100. Left: MC histogram of nDn\sqrt n\, D_n under H0H_0 with the Kolmogorov density overlay — nearly exact match. Right: Q-Q plot against the Kolmogorov quantile function — points lie on the diagonal. The asymptotic is reliable for n100n \ge 100; for smaller nn, the Smirnov finite-sample formula (SMI1948) is more accurate.

Kolmogorov–Smirnov goodness-of-fit explorer§29.8 · Kolmogorov distribution
α = 0.05
A · √n Dn under H₀
dα = 1.3600.511.52√n D_n
empirical size ≈ 0.041 (target α = 0.05)
B · √n Dn under alternative (shift δ = 0.50)
00.511.52
empirical power ≈ 0.988
C · Power vs shift δ
00.511.5200.250.50.751δ
Marker at current δ · dashed line = α (0.05)
Example 11 KS as a distribution-shift detector

Train a classifier on last month’s data, deploy this month. Pull 500 prediction confidences from last month’s validation set and 500 from this month’s live traffic. Run KS on each feature (or on the confidence scores themselves) against the training ECDF. Large DnD_n relative to dα/nd_\alpha / \sqrt n triggers retraining. This is the most common KS application in production ML — and it is distribution-free in exactly the sense §29.1 Rem 1 emphasizes: no assumption about the shape of the confidence distribution.

Example 12 KS against the uniform distribution as a PIT check

A common calibration check for probabilistic classifiers: the probability-integral transform (PIT) values F(yixi)F(y_i | x_i) should be Uniform(0, 1) if the predictive distribution F(x)F(\cdot | x) is well-calibrated. Running KS on the PIT values against Uniform is a standard calibration test. KS detects systematic miscalibration (e.g., the classifier is overconfident and PIT is concentrated near 0 and 1) without assuming any specific miscalibration structure. Gneiting–Raftery 2007 develops the full calibration theory; the KS test is the entry-level distribution-free version.

Remark 16 Finite-sample vs asymptotic p-values

For n<40n < 40, the asymptotic Kolmogorov CDF is not a great approximation to the finite-sample null distribution of nDn\sqrt n\, D_n. Smirnov (SMI1948) gives a finite-sample formula based on the combinatorial reflection principle; modern implementations (scipy.stats.kstest, R ks.test) use Smirnov’s formula for small nn and the Kolmogorov asymptotic for n40n \ge 40 (or thereabouts — the exact crossover depends on the implementation). For ML production workflows where nn is typically 500\ge 500, the asymptotic is always sufficient. The nonparametric.ts module ships the asymptotic-only ksPValue for component usage; wrapping SciPy’s exact formula is standard for notebook-side work.

29.9 ML bridges: ranks, conformal prediction, quantile regression

Three themes recur in the ML literature that use the nonparametric machinery of §§29.2–29.8 as substrate. Topic 29 states their structural connection and forward-points to dedicated development.

Example 13 Ranks and Wilcoxon-style tests

The rank transform replaces each XiX_i by its rank RiR_i in the sorted sample — a discrete analog of the probability-integral transform. For iid continuous XiX_i, the rank vector (R1,,Rn)(R_1, \dots, R_n) is uniform on the symmetric group, independent of the order statistic, and governs the null distribution of Wilcoxon’s signed-rank test, Mann–Whitney’s U test, and Kruskal–Wallis’s multi-sample test. These rank tests are permutation-based nonparametric alternatives to t-tests / F-tests / ANOVA, valid without any distributional assumption. They are the standard tool when the data are ordinal, or when the distributional shape is unknown and the user does not want to risk KS’s loss of power to narrow-band alternatives. Full development: formalml rank-tests (forthcoming).

Example 14 Conformal prediction as a distribution-free CI wrapper

Conformal prediction (Vovk–Gammerman–Shafer 2005; Shafer–Vovk 2008) is a modern machine-learning tool that builds distribution-free prediction intervals around any black-box predictor f^\hat f. The construction: given a calibration set {(Xi,Yi)}i=1n\{(X_i, Y_i)\}_{i=1}^n and a non-conformity score s(x,y)s(x, y) (e.g., yf^(x)|y - \hat f(x)|), the (1α)(1 - \alpha)-conformal prediction interval for a new XnewX^{\text{new}} is

Cα(Xnew)={y:s(Xnew,y)q^1α},C_\alpha(X^{\text{new}}) = \{y : s(X^{\text{new}}, y) \le \hat q_{1 - \alpha}\},

where q^1α\hat q_{1 - \alpha} is the (1α)(1 - \alpha) sample quantile of the calibration non-conformity scores. The coverage guarantee Pr(YnewCα(Xnew))1α\Pr(Y^{\text{new}} \in C_\alpha(X^{\text{new}})) \ge 1 - \alpha is distribution-free — it follows from exchangeability alone, with no parametric assumption on YXY | X. §29.7’s order-statistic CI is the distribution-free CI for the population quantile; conformal prediction applies the same exchangeability argument to the quantile of a learned score function. Full development: formalml conformal-prediction (forthcoming).

Remark 17 Quantile regression as the covariate-conditional extension

Topic 29’s sample quantile ξ^p\hat\xi_p estimates the scalar intercept ξp\xi_p of a distribution with no covariates. Quantile regression (Koenker–Bassett 1978; Koenker 2005) extends this to the covariate-conditional case: estimate ξp(x):=FYX=x1(p)\xi_p(x) := F^{-1}_{Y|X = x}(p), the pp-quantile of the conditional distribution. The estimator minimizes the check loss ρτ(u)=u(τ1{u<0})\rho_\tau(u) = u\,(\tau - \mathbb{1}\{u < 0\}) averaged over the sample, a convex loss tied directly to the population quantile. Topic 15 §15.9 already named check-loss regression as the natural-forward-pointer target; Topic 29 discharges the scalar case via §29.4/§29.6, and formalml quantile-regression (forthcoming) handles the conditional extension. L-statistics and trimmed means (SER1980 §8) are a sibling topic — linear functionals of the order-statistics vector — covered in §31 (bootstrap asymptotics) and formalml.

29.10 Track 8 forward-map and spine

Track 8 has three topics after this one:

  • Topic 30 — Kernel Density Estimation (KDE). The density ff in Bahadur’s denominator is estimated non-parametrically via kernel smoothing of the ECDF. KDE is “what happens when you differentiate FnF_n using a smooth weighting function.” The Silverman rule-of-thumb bandwidth, cross-validation bandwidth selection, and higher-order kernels all sit there.
  • Topic 31 — The Bootstrap (published). The bootstrap replaces the unknown FF with FnF_n (the ECDF) and resamples with replacement. Every bootstrap statement is a pushforward of Topic 29’s ECDF-convergence results through a functional. Hall’s second-order accuracy for BCa is the anchor result.
  • Topic 32 — Empirical Processes & Uniform Convergence (track closer, now published). The functional view: Gn(f)=n(fdFnfdF)\mathbb{G}_n(f) = \sqrt n\,(\int f\, dF_n - \int f\, dF) for function classes F\mathcal{F}. DKW, Bahadur, and the Kolmogorov distribution are all consequences of Donsker’s theorem — the empirical process converges to the Gaussian Brownian-bridge process in (F)\ell^\infty(\mathcal{F}) for Donsker classes. Topic 32 proves Donsker; Topic 29 used it once (§29.8 Thm 7).

Forward-map diagram showing Topic 29 at center with arrows to Topic 30 (Kernel Density Estimation, labeled 'density in the Bahadur denominator'), Topic 31 (Bootstrap, labeled 'ECDF replacement'), and Topic 32 (Empirical Processes, labeled 'Donsker'). Up-arrows connect to five forthcoming formalml boxes: extreme-value-theory, quantile-regression, rank-tests, conformal-prediction, statistical-depth. Back-arrows in grey connect Topic 29 to its predecessors (Topics 7, 10, 16, 19). A small color legend in the corner identifies ECDF blue, DKW teal, Bahadur violet, KS orange.

Figure 9. The Track 8 spine, as seen from its opener. Topic 29 → 30, 31, 32 build the full nonparametric-and-empirical-process toolkit; the five formalml satellites develop domain-specific applications (extremes, quantile regression, rank tests, conformal, depth).

Remark 18 Forward-pointer: Extreme Value Theory

Topic 29’s Bahadur asymptotics cover ξ^p\hat\xi_p for interior pp, where f(ξp)>0f(\xi_p) > 0. When p1p \to 1 (or 0), the density vanishes and Bahadur fails. The correct asymptotic theory — the Fisher–Tippett–Gnedenko trichotomy (Gumbel, Fréchet, Weibull) — lives at formalml extreme-value-theory (forthcoming). The maximum X(n)X_{(n)}, properly centered and rescaled, converges in distribution to one of three limit types depending on the tail behavior of FF. EVT is essential for risk management (financial tail losses), reliability engineering (component failure times), and LLM-latency worst-case bounds.

Remark 19 Forward-pointer: Quantile Regression

Koenker–Bassett 1978 quantile regression extends §29.4/§29.6’s scalar-intercept case to the conditional case ξp(x)=FYX=x1(p)\xi_p(x) = F^{-1}_{Y|X = x}(p). Check-loss minimization yields estimators with Bahadur-style asymptotic normality — the linearization survives, but the influence function now involves the conditional density fYX(ξp(x)x)f_{Y|X}(\xi_p(x) | x) at each covariate value. Full development: formalml quantile-regression (forthcoming).

Remark 20 Forward-pointer: Rank Tests

Wilcoxon signed-rank, Mann–Whitney U, and Kruskal–Wallis are the three classical rank-based nonparametric alternatives to the t-test, two-sample t-test, and one-way ANOVA respectively. The permutation-distribution machinery they depend on is orthogonal to Topic 29’s Bahadur asymptotics — the null distribution comes from rank uniformity, not from empirical-process limits. A proper treatment requires its own chapter. Full development: formalml rank-tests (forthcoming). §18.10 Rem 25 previewed this as Track 8 territory.

Remark 21 Forward-pointer: Conformal Prediction

Conformal prediction (VOV2005; SHA2008) wraps any black-box predictor with distribution-free prediction intervals using exchangeability and quantile machinery. §29.7’s order-statistic CI is the distribution-free CI for the population quantile; conformal prediction is the distribution-free CI for a learned score function’s quantile. The two are the same idea at different levels of generality. Full development: formalml conformal-prediction (forthcoming).

Remark 22 Forward-pointer: Statistical Depth (multivariate generalization)

Topic 29 is scalar throughout — one-dimensional XiX_i. For multivariate XiRdX_i \in \mathbb{R}^d, there is no canonical ordering, so the univariate order-statistic / quantile machinery does not apply directly. Statistical depth functions — Tukey depth, Mahalanobis depth, halfspace depth, zonoid depth — generalize the quantile function to multivariate settings. Full development: formalml statistical-depth (forthcoming). The univariate case is embedded naturally: Tukey depth in 1D reduces to min(F(x),1F(x))\min(F(x), 1 - F(x)), recovering §29.4’s ξp\xi_p via the level set of depth.

Track 8’s organizing thesis, stated at the top of §29.1 and worth repeating here: classical statistics assumes a parametric model; nonparametric statistics lets the data specify the model via the empirical distribution, and the order statistics are the lens through which we study that empirical distribution. Everything that follows — KDE’s smoothing of FnF_n into a density, bootstrap’s pushforward of FnF_n through a functional, empirical processes’ view of FnF_n as a random function — reads that thesis at a different zoom level.


References

  1. Bahadur, R. R. (1966). A Note on Quantiles in Large Samples. Annals of Mathematical Statistics, 37(3), 577–580.
  2. Kiefer, J. (1967). On Bahadur’s Representation of Sample Quantiles. Annals of Mathematical Statistics, 38(5), 1323–1342.
  3. Rényi, A. (1953). On the Theory of Order Statistics. Acta Mathematica Academiae Scientiarum Hungaricae, 4(3–4), 191–231.
  4. Dvoretzky, A., Kiefer, J., & Wolfowitz, J. (1956). Asymptotic Minimax Character of the Sample Distribution Function and of the Classical Multinomial Estimator. Annals of Mathematical Statistics, 27(3), 642–669.
  5. Massart, P. (1990). The Tight Constant in the Dvoretzky–Kiefer–Wolfowitz Inequality. Annals of Probability, 18(3), 1269–1283.
  6. Smirnov, N. V. (1948). Table for Estimating the Goodness of Fit of Empirical Distributions. Annals of Mathematical Statistics, 19(2), 279–281.
  7. Kolmogorov, A. N. (1933). Sulla determinazione empirica di una legge di distribuzione. Giornale dell’Istituto Italiano degli Attuari, 4, 83–91. (Historical Italian journal; no stable URL.)
  8. David, H. A., & Nagaraja, H. N. (2003). Order Statistics (3rd ed.). Wiley.
  9. Hyndman, R. J., & Fan, Y. (1996). Sample Quantiles in Statistical Packages. The American Statistician, 50(4), 361–365.
  10. Serfling, R. J. (1980). Approximation Theorems of Mathematical Statistics. Wiley.
  11. Lehmann, E. L., & Casella, G. (1998). Theory of Point Estimation (2nd ed.). Springer.
  12. van der Vaart, A. W. (2000). Asymptotic Statistics. Cambridge University Press.
  13. Billingsley, P. (1999). Convergence of Probability Measures (2nd ed.). Wiley.