intermediate 56 min read · April 23, 2026

Kernel Density Estimation

The kernel density estimator, the AMISE bias–variance decomposition, the optimal bandwidth h* of order n^(−1/5), Epanechnikov's optimal-kernel theorem, Parzen's pointwise asymptotic normality, and data-driven bandwidth selectors (Silverman, Scott, UCV, Sheather–Jones). Track 8, topic 2 of 4.

30.1 Motivation: from ECDF to density

Topic 29 built its entire nonparametric toolkit on the empirical CDF Fn(x)=n1i=1n1{Xix}F_n(x) = n^{-1} \sum_{i=1}^n \mathbf{1}\{X_i \le x\} — a step function that converges uniformly to the true CDF FF. But FnF_n is not differentiable, so the density f=Ff = F' is unreachable by naïve differentiation. Kernel density estimation is the answer to the obvious question: what happens when we differentiate FnF_n using a smooth weighting function instead of a Dirac delta?

The idea is simple enough to state in a sentence. Replace each sample point XiX_i with a small bump of area 1/n1/n centered at XiX_i; add the bumps up. The sum is a bona fide density (nonnegative, integrates to 1) that looks like ff as nn grows, with a smoothness controlled by the bump width. Histograms are the roughest version — rectangular bumps of constant width, anchored to arbitrary bin edges. KDE replaces those rectangles with a smooth, symmetric, nonnegative kernel and drops the bin-edge arbitrariness entirely.

Example 1 Histograms and KDEs on the same N(0,1) sample

Figure 1 takes a single Normal(0,1)(0, 1) sample of size n=200n = 200 and plots three histograms (bin widths b{0.1,0.4,1.0}b \in \{0.1, 0.4, 1.0\}) alongside three Gaussian-kernel KDEs (bandwidths h{0.08,0.3,0.8}h \in \{0.08, 0.3, 0.8\}). At small bb or small hh, both estimators are spiky and variance-dominated; at large bb or large hh, both over-smooth and bias-dominate. The KDEs remove one obvious source of arbitrariness — bin edges — but the bandwidth problem remains, just in a smoother form.

Six-panel figure. Top row: three histograms of the same 200-sample Normal(0,1) draw at bin widths 0.1, 0.4, and 1.0. Bottom row: three Gaussian-KDE curves of the same sample at bandwidths 0.08, 0.3, and 0.8. True Normal(0,1) density overlaid in black on all six panels.

Figure 1. Histograms (top) and KDEs (bottom) on the same n=200n = 200 Normal(0,1)(0, 1) sample. The bin-width / bandwidth trade-off between spikiness and over-smoothing is visible in both rows; KDE removes the bin-edge arbitrariness but keeps the smoothing-parameter choice.

Remark 1 What KDE gives up vs. histograms

Histograms have two genuine advantages KDEs lack: they are exactly piecewise-constant (so integrals over bin intervals are trivial) and they are visually honest about the underlying empirical counts. KDEs are smoother and easier to differentiate further (for mode-finding or density-gradient methods), but the smoothness is imposed, not inherent. A reader who wants to know “how many data points fell in [a,b][a, b]?” should look at a histogram; a reader who wants a differentiable density estimate should look at a KDE.

Remark 2 Distribution-free density estimation

Topic 7 built density families (Normal, Gamma, Beta) and fit parametric θ\theta from data. That program is efficient when the family is right and biased when it’s wrong. KDE drops the family assumption and estimates ff directly from the sample — a distribution-free procedure in the sense of Topic 29 §29.1 Rem 1. The price is a slower convergence rate: we’ll see in §30.6 that AMISE decays at n4/5n^{-4/5}, slower than the parametric n1n^{-1}, and that this is the price of distribution-freeness.

30.2 The kernel density estimator

With the motivation out of the way, we can write down the estimator formally and check that it is a density. Three definitions (kernel, scaled kernel, KDE) and one stated theorem (KDE is a valid density) — enough to set up everything that follows.

Definition 1 Kernel

A kernel is a function K:RR0K : \mathbb{R} \to \mathbb{R}_{\ge 0} satisfying

K(u)du=1,K(u)=K(u) for all u.\int_{-\infty}^\infty K(u)\,du = 1, \qquad K(-u) = K(u) \text{ for all } u.

Nonnegativity and unit mass make KK a density on R\mathbb{R}; symmetry makes it a density centered at 00. We additionally require finite variance μ2(K):=u2K(u)du<\mu_2(K) := \int u^2 K(u)\,du < \infty and finite roughness R(K):=K(u)2du<R(K) := \int K(u)^2\,du < \infty, both used throughout §30.4–§30.6.

Definition 2 Scaled kernel

For bandwidth h>0h > 0, the scaled kernel is

Kh(u):=1hK ⁣(uh).K_h(u) := \frac{1}{h}\, K\!\left(\frac{u}{h}\right).

The scaling preserves the density property — Kh(u)du=1\int K_h(u)\,du = 1 for any h>0h > 0 — and turns hh into a width parameter: small hh makes KhK_h tall and narrow (close to a Dirac delta); large hh makes it short and wide.

Definition 3 Kernel density estimator (KDE)

Given an iid sample X1,,XnX_1, \dots, X_n from an unknown density ff, the kernel density estimator with kernel KK and bandwidth hh is

f^h(x)  =  1ni=1nKh(xXi)  =  1nhi=1nK ⁣(xXih).\hat f_h(x) \;=\; \frac{1}{n}\sum_{i=1}^n K_h(x - X_i) \;=\; \frac{1}{n\,h}\sum_{i=1}^n K\!\left(\frac{x - X_i}{h}\right).

Both forms are in active use. The first makes it visually obvious that f^h\hat f_h is an average of nn kernel-shaped bumps centered at the data. The second groups the bandwidth normalization into a single nhnh factor, which will be convenient when we compute bias and variance in §30.4.

Theorem 1 The KDE is a valid density

Under the definitions above, f^h\hat f_h is a bona-fide probability density:

f^h(x)0 for all x,f^h(x)dx=1.\hat f_h(x) \ge 0 \text{ for all } x, \qquad \int_{-\infty}^\infty \hat f_h(x)\,dx = 1.

Stated; the proof is elementary. Nonnegativity is immediate from K0K \ge 0. Integration reduces to 1niKh(xXi)dx=1ni1=1\frac{1}{n}\sum_i \int K_h(x - X_i)\,dx = \frac{1}{n}\sum_i 1 = 1 by the scaled-kernel normalization.

Example 2 A hand-computed KDE at two points

Take n=2n = 2 observations X1=0X_1 = 0, X2=1X_2 = 1 and Gaussian kernel K(u)=φ(u)=(2π)1/2eu2/2K(u) = \varphi(u) = (2\pi)^{-1/2} e^{-u^2/2}. At bandwidth h=0.5h = 0.5:

f^0.5(0.5)=120.5[φ(1)+φ(1)]=φ(1)+φ(1)0.4839,\hat f_{0.5}(0.5) = \frac{1}{2 \cdot 0.5}\bigl[\varphi(1) + \varphi(-1)\bigr] = \varphi(1) + \varphi(-1) \approx 0.4839,

where we used φ(1)=φ(1)0.2420\varphi(1) = \varphi(-1) \approx 0.2420. The KDE at the midpoint x=0.5x = 0.5 receives equal contribution from both observations and is therefore twice φ(1)/h\varphi(1)/h. This is the T30.8 test pin — a two-point sanity check on the kdeEvaluate implementation.

Remark 3 Equivalent formulation via convolution

Writing Pn=n1iδXi\mathbb{P}_n = n^{-1} \sum_i \delta_{X_i} for the empirical distribution (a probability measure that places mass 1/n1/n at each sample point), we have f^h=KhPn\hat f_h = K_h \ast \mathbb{P}_n. KDE is the convolution of a smooth kernel with the empirical distribution — equivalently, it is the empirical density against a smoothed observation model. This formulation is convenient in formalml where the convolution structure generalizes to multivariate bandwidths and non-iid samples.

Remark 4 Why the symmetry and nonnegativity constraints

Symmetry K(u)=K(u)K(-u) = K(u) makes the first-moment condition uK(u)du=0\int u K(u)\,du = 0 automatic, which simplifies the bias calculation in §30.4 (the hf(x)h f'(x) Taylor term drops out). Nonnegativity guarantees f^h0\hat f_h \ge 0, so the estimator is always a valid density. A fourth-order or higher-order kernel with μ2(K)=0\mu_2(K) = 0 can achieve faster bias rates (down to O(h4)O(h^4)) but requires KK to take negative values, and then f^h\hat f_h may be negative in places — a practical complication we’ll mention in §30.6 Rem 13 and defer to Serfling 1980 §1.11.

30.3 Kernels: families and properties

Five kernels are standard — Gaussian, Epanechnikov, biweight (quartic), triangular, and uniform — and every modern implementation supports all five. The constants that govern AMISE (μ2(K)\mu_2(K) and R(K)R(K)) are closed-form for each, and their efficiency relative to Epanechnikov sits between 0.93 and 1.00 across the whole family. Kernel choice, in practice, matters much less than bandwidth choice.

Theorem 2 Standard kernels and their constants

The five standard symmetric kernels, with their second moments and roughness:

KernelK(u)K(u)Supportμ2(K)\mu_2(K)R(K)R(K)Efficiency
Gaussianφ(u)=(2π)1/2eu2/2\varphi(u) = (2\pi)^{-1/2} e^{-u^2/2}R\mathbb{R}1112π0.2821\frac{1}{2\sqrt\pi} \approx 0.28210.9510.951
Epanechnikov$\frac34(1 - u^2)\mathbf1{u\le 1}$[1,1][-1, 1]15\frac{1}{5}
Biweight$\frac1516(1 - u^2)^2\mathbf1{u\le 1}$[1,1][-1, 1]17\frac{1}{7}
Triangular$(1 -u)\mathbf1{u\le 1}$
Uniform$\frac12\mathbf1{u\le 1}$[1,1][-1, 1]13\frac{1}{3}

Stated. All values are elementary integrals; Example 3 verifies Epanechnikov explicitly, and the kernelProperties export in nonparametric.ts §8 encodes all five (T30.5–T30.6).

Example 3 Epanechnikov moments by direct integration

For KE(u)=34(1u2)1{u1}K_E(u) = \frac{3}{4}(1 - u^2)\mathbf{1}\{|u| \le 1\}:

11KE(u)du=3411(1u2)du=3443=1,\int_{-1}^1 K_E(u)\,du = \frac{3}{4}\int_{-1}^1 (1 - u^2)\,du = \frac{3}{4} \cdot \frac{4}{3} = 1 \quad\checkmark,μ2(KE)=11u2KE(u)du=3411(u2u4)du=34(2325)=15,\mu_2(K_E) = \int_{-1}^1 u^2 K_E(u)\,du = \frac{3}{4}\int_{-1}^1 (u^2 - u^4)\,du = \frac{3}{4}\left(\frac{2}{3} - \frac{2}{5}\right) = \frac{1}{5},R(KE)=11KE(u)2du=91611(1u2)2du=9161615=35.R(K_E) = \int_{-1}^1 K_E(u)^2\,du = \frac{9}{16}\int_{-1}^1 (1 - u^2)^2\,du = \frac{9}{16} \cdot \frac{16}{15} = \frac{3}{5}.

The values μ2(KE)=1/5\mu_2(K_E) = 1/5 and R(KE)=3/5R(K_E) = 3/5 are load-bearing for Theorem 5’s Epanechnikov optimality proof in §30.6.

Five-panel strip showing the Gaussian, Epanechnikov, biweight, triangular, and uniform kernels on a common u-axis, each annotated with its second moment mu2 and roughness R values and its efficiency relative to Epanechnikov.

Figure 2. The five standard kernels on a common axis. Gaussian extends into the tails; the other four have compact support on [1,1][-1, 1]. Efficiency — the figure of merit from §30.6 Theorem 5 — ranges from 0.9300.930 (uniform) to 1.0001.000 (Epanechnikov).

Kernel choice explorer·Five kernels against the true density. Watch how close the curves stay — the efficiency column shows why.
Density
n = 200
Bandwidth
h = 0.345
-4.0-2.4-0.80.82.44.0Normal(0, 1)  ·  n = 200  ·  mode = equal h
ShowKernelμ₂(K)R(K)effh usedAMISE(h)
Gaussian1.0000.2820.9510.3450.0048
Epanechnikov0.2000.6001.0000.3450.0087
Biweight0.1430.7140.9940.3450.0104
Triangular0.1670.6670.9860.3450.0097
Uniform0.3330.5000.9300.3450.0073

The five kernels track the true density so closely that in any given sample you rarely see a meaningful difference. Epanechnikov is AMISE-optimal (efficiency = 1), but the next three are within a percent; Uniform trails by ~7%. In practice, the bandwidth h carries all the action.

Remark 5 Why the kernel choice rarely matters

The efficiencies in Theorem 2 all lie in [0.93,1.00][0.93, 1.00]. Substituting into the AMISE-optimal formula from §30.6, switching from Epanechnikov to Gaussian changes the optimal AMISE value by less than 5%, and the optimal bandwidths differ only by a multiplicative constant. The practical consequence: once we’ve committed to using a kernel with finite variance and roughness, the bandwidth choice does almost all the work. §30.8 is the real bandwidth chapter; §30.3 is a tour of an almost-equivalence class.

30.4 Leading-order bias and variance

The kernel and bandwidth are in place. Next we compute the mean and variance of f^h(x)\hat f_h(x) at a fixed xx and extract their leading-order behavior as nn \to \infty, h0h \to 0. This is Rosenblatt 1956’s original calculation; it sets up everything that follows.

Theorem 3 Leading-order bias and variance of the KDE

Let X1,,XnX_1, \dots, X_n be iid with density ff having bounded, continuous second derivative ff'' in a neighborhood of xx with f(x)>0f(x) > 0. Let KK be a symmetric kernel satisfying the moment conditions K=1\int K = 1, uK=0\int u K = 0, μ2(K)<\mu_2(K) < \infty, R(K)<R(K) < \infty. As nn \to \infty, h0h \to 0, and nhnh \to \infty:

Bias(f^h(x))=h22μ2(K)f(x)+o(h2),\mathrm{Bias}\bigl(\hat f_h(x)\bigr) = \frac{h^2}{2}\,\mu_2(K)\, f''(x) + o(h^2),Var(f^h(x))=f(x)R(K)nh+o ⁣(1nh).\mathrm{Var}\bigl(\hat f_h(x)\bigr) = \frac{f(x)\, R(K)}{n\,h} + o\!\left(\frac{1}{nh}\right).

Stated here; the derivation is Step (i) and Step (ii) of the §30.6 Theorem 4 proof. Both calculations are a second-order Taylor expansion of ff around xx combined with the kernel moment conditions.

Example 4 Bias and variance at the peak of N(0,1)

For f=φf = \varphi (standard Normal), x=0x = 0, Gaussian kernel (μ2=1\mu_2 = 1, R=(2π)1R = (2\sqrt\pi)^{-1}), h=0.5h = 0.5, n=200n = 200:

f(0)=φ(0)=12π0.399,f(0)=φ(0)0.399,f''(0) = -\varphi(0) = -\frac{1}{\sqrt{2\pi}} \approx -0.399, \quad f(0) = \varphi(0) \approx 0.399,Bias0.2521(0.399)=0.0499,Var0.3990.28212000.50.00113.\mathrm{Bias} \approx \frac{0.25}{2} \cdot 1 \cdot (-0.399) = -0.0499, \quad \mathrm{Var} \approx \frac{0.399 \cdot 0.2821}{200 \cdot 0.5} \approx 0.00113.

The bias is roughly 13% of f(0)f(0); the standard deviation is roughly 8%. The leading-order formulas predict this at large nn; at n=200n = 200 there is finite-nn slippage, but the orders of magnitude are right.

Remark 6 Why the first-order bias term vanishes

The Taylor expansion of f(xhu)f(x - hu) around xx has a linear huf(x)-hu f'(x) term. When integrated against the symmetric kernel KK, the linear term is hf(x)uK(u)du=0-h f'(x) \int u K(u)\,du = 0. Symmetry of KK is doing work here — without it we’d pick up an O(h)O(h) bias term, and the AMISE-optimal bandwidth in §30.6 would have a very different rate. This is also why asymmetric kernels (one-sided boundary kernels in §30.5) need separate treatment.

Remark 7 Rosenblatt's no-free-lunch result

Theorem 3 says bias is O(h2)O(h^2) and variance is O((nh)1)O((nh)^{-1}) under standard regularity — the two rates are intrinsic to the kernel-smoothing program and cannot be both improved without changing the estimator class. A fourth-order kernel with μ2(K)=0\mu_2(K) = 0 can achieve O(h4)O(h^4) bias but potentially negative f^h\hat f_h (§30.2 Rem 4); local polynomial regression (Fan–Gijbels 1996; §30.5 Rem 9) trades one estimator structure for another. In the plain-KDE world with a symmetric nonnegative kernel, Rosenblatt 1956 is the ceiling.

30.5 MISE, AMISE, and boundary behavior

Bias and variance at a single point xx are the building blocks. The relevant global loss — the one AMISE-optimal bandwidth is minimizing — integrates pointwise mean-squared error over the full domain. We derive this integrated form and then flag the one regime where Theorem 3 does not hold: the boundary of a bounded-support density.

The pointwise mean-squared error is MSE(x)=Bias2(x)+Var(x)\mathrm{MSE}(x) = \mathrm{Bias}^2(x) + \mathrm{Var}(x). Integrating over xx,

MISE(h)  =  E ⁣(f^h(x)f(x))2dx  =  Bias2(x)dx+Var(x)dx\mathrm{MISE}(h) \;=\; \mathbb{E}\!\int\bigl(\hat f_h(x) - f(x)\bigr)^2\,dx \;=\; \int \mathrm{Bias}^2(x)\,dx + \int \mathrm{Var}(x)\,dx

is the mean integrated squared error, the canonical global loss functional. The asymptotic MISE (AMISE) keeps only the leading-order terms from Theorem 3:

AMISE(h)  =  h4μ2(K)2R(f)4  +  R(K)nh,\mathrm{AMISE}(h) \;=\; \frac{h^4\,\mu_2(K)^2\, R(f'')}{4} \;+\; \frac{R(K)}{n\,h},

where R(f):=f(x)2dxR(f'') := \int f''(x)^2\,dx is the curvature integral of ff. The two terms are the integrated squared bias (the first) and the integrated variance (the second) — a log-log plot shows them as a line with slope +4+4 and a line with slope 1-1, whose sum has a unique minimum we’ll compute in §30.6.

Example 5 MISE vs AMISE at finite n

For f=φf = \varphi, Gaussian kernel, n=200n = 200: the closed-form AMISE-optimal bandwidth (from §30.6 Thm 4) is h0.3705h^\ast \approx 0.3705. At that hh, AMISE 0.01025\approx 0.01025. The exact MISE computed by numerical integration over the same grid is 0.01058\approx 0.01058 — about 3% higher than AMISE. The gap is the o(h2)o(h^2) and o(1/(nh))o(1/(nh)) terms we dropped; at n=200n = 200 they are nonzero but small, and they vanish as nn \to \infty.

Three-panel log-log plot. Left: integrated squared bias h^4 mu2(K)^2 R(f'')/4 vs h, a line with slope +4. Middle: integrated variance R(K)/(nh) vs h, a line with slope -1. Right: AMISE = IBias^2 + IVar vs h with the h* minimum marked by a vertical dashed line. For f = N(0,1), Gaussian kernel, n = 200.

Figure 3. The integrated squared bias (slope +4), integrated variance (slope −1), and their sum AMISE on log-log axes for f=f = Normal(0,1)(0, 1), Gaussian kernel, n=200n = 200. The AMISE minimum at h0.37h^\ast \approx 0.37 marks the AMISE-optimal bandwidth Theorem 4 characterizes.

Example 6 Boundary bias on Beta(2, 5)

Beta(2,5)(2, 5) has density f(x)=30x(1x)4f(x) = 30 x (1 - x)^4 on [0,1][0, 1] and f(x)=0f(x) = 0 outside. A naïve Gaussian KDE on a Beta(2,5)(2, 5) sample places nonzero density at negative xx (where the true density is zero) because the Gaussian kernel has infinite support. The leakage is not a small correction: at n=500n = 500 with Silverman bandwidth h0.09h \approx 0.09, the mass below x=0x = 0 is approximately 0.020.02, or 2% of the total probability mass. Figure 4 shows the naïve and reflection-corrected KDEs side by side; the reflection trick (augment the sample with {Xi}\{-X_i\} and renormalize) recovers nearly all the leaked mass.

Two-panel figure. Left: naive Gaussian KDE on a Beta(2,5) sample with mass leaked below zero, over-smoothing the boundary. Right: reflection-corrected KDE on the same sample, with leaked mass restored. True Beta(2,5) density overlaid in black on both panels.

Figure 4. Naïve Gaussian KDE (left) vs. Jones 1993 reflection-corrected KDE (right) on a Beta(2,5)(2, 5) sample at n=500n = 500. The naïve estimator over-smooths across the lower boundary; reflection restores the mass and matches the true density much more closely near x=0x = 0.

Boundary-bias demo·Densities on bounded support leak KDE mass across the edge. Jones 1993 reflection patches it.
Density
n
h = 0.049 (× Silverman)
-0.10.20.50.81.1Naive Gaussian KDE
Mass leaked outside support: 0.0136  ·  below: 0.0135  ·  above: 0.0001
-0.10.20.50.81.1Reflection-corrected KDE
Reflection about the support boundaries; mass restored to [0, 1].

Reflection is the simplest of the boundary fixes — Fan–Gijbels 1996's local-linear estimator is the modern treatment, delivering the correction and the regression extension together. See the Topic 30 §30.10 pointer to formalml local-regression.

Remark 8 Why boundary bias is not a small correction

For a density with bounded support [a,b][a, b] and a kernel with support [r,r][-r, r] (or all of R\mathbb{R} with rapid decay), the KDE near x=ax = a averages over XiX_i values in a window that extends below aa. But no XiX_i values are below aa — the sample is entirely on [a,b][a, b]. The KDE therefore under-estimates f(a)f(a) and smooths probability mass into the forbidden region (,a)(-\infty, a). At xx exactly at the boundary, the bias is O(h)O(h) — worse than the interior O(h2)O(h^2) — and the AMISE analysis of §30.6 does not apply. Any production density estimator for bounded-support data needs a boundary correction.

Remark 9 Reflection, local-linear, and the path to formalml

Three standard boundary corrections: Jones 1993 reflection (simplest; used by BoundaryBiasDemo); Müller 1991 boundary kernels (asymmetric kernels within one bandwidth of the boundary); Fan–Gijbels 1996 local polynomial regression, which automatically adapts at the boundary as a byproduct of fitting a local polynomial rather than a local constant. Local polynomial is the modern standard and generalizes cleanly to the regression setting E[YX]E[Y \mid X]; the full treatment is deferred to formalml local-regression. Topic 30 keeps things simple with reflection.

30.6 Optimal bandwidth and Epanechnikov’s theorem

Two theorems anchor the featured section. Theorem 4 derives the AMISE-optimal bandwidth from the decomposition in §30.5 — the full four-step proof is the load-bearing calculation of the topic. Theorem 5 picks the kernel that minimizes the resulting AMISE among symmetric nonnegative kernels with a fixed second moment — a short calculus-of-variations argument whose answer is Epanechnikov’s KEK_E.

Theorem 4 AMISE decomposition and optimal bandwidth (featured)

Let X1,,XnX_1, \dots, X_n be iid with density ff having bounded, continuous second derivative ff''. Let KK be a symmetric kernel with K=1\int K = 1, uK=0\int u K = 0, μ2(K)(0,)\mu_2(K) \in (0, \infty), R(K)<R(K) < \infty. As nn \to \infty, h0h \to 0, nhnh \to \infty,

AMISE(h)  =  R(K)nh+h4μ2(K)2R(f)4,\mathrm{AMISE}(h) \;=\; \frac{R(K)}{nh} + \frac{h^4\,\mu_2(K)^2\, R(f'')}{4},

where R(f):=f(x)2dxR(f'') := \int f''(x)^2\,dx. The AMISE-optimal bandwidth is

h  =  (R(K)nμ2(K)2R(f))1/5  =  O(n1/5),h^\ast \;=\; \left(\frac{R(K)}{n\,\mu_2(K)^2\, R(f'')}\right)^{1/5} \;=\; O(n^{-1/5}),

with matching minimum AMISE

AMISE  =  54{R(K)4μ2(K)2R(f)}1/5n4/5.\mathrm{AMISE}^\ast \;=\; \frac{5}{4}\,\bigl\{R(K)^4\, \mu_2(K)^2\, R(f'')\bigr\}^{1/5}\, n^{-4/5}.
Proof [show]

The proof has four steps: (i) compute the bias via Taylor expansion of ff; (ii) compute the variance via the iid-sum identity; (iii) integrate over xx to obtain AMISE; (iv) minimize in hh.

Step (i): Bias. Substitute u=(xy)/hu = (x - y)/h in the definition of the expected value:

E[f^h(x)]=1hK ⁣(xyh)f(y)dy=K(u)f(xhu)du.\mathbb{E}\bigl[\hat f_h(x)\bigr] = \int \frac{1}{h} K\!\left(\frac{x-y}{h}\right) f(y)\,dy = \int K(u)\, f(x - hu)\,du.

Taylor-expand f(xhu)f(x - hu) around xx to second order, using the regularity of ff'':

f(xhu)=f(x)huf(x)+(hu)22f(x)+o(h2),f(x - hu) = f(x) - hu\, f'(x) + \frac{(hu)^2}{2}\, f''(x) + o(h^2),

where the remainder is o(h2)o(h^2) uniformly in uu on the support of KK. Substituting into the expectation integral,

E[f^h(x)]=f(x)K(u)duhf(x)uK(u)du+h22f(x)u2K(u)du+o(h2).\mathbb{E}\bigl[\hat f_h(x)\bigr] = f(x)\int K(u)\,du - h\, f'(x)\int u\, K(u)\,du + \frac{h^2}{2}\, f''(x)\int u^2\, K(u)\,du + o(h^2).

Applying the kernel moment conditions (K1)–(K3):

E[f^h(x)]=f(x)1hf(x)0+h22f(x)μ2(K)+o(h2).\mathbb{E}\bigl[\hat f_h(x)\bigr] = f(x) \cdot 1 - h\, f'(x) \cdot 0 + \frac{h^2}{2}\, f''(x)\, \mu_2(K) + o(h^2).

Subtracting f(x)f(x) yields the leading-order bias Bias(f^h(x))=h22μ2(K)f(x)+o(h2)\mathrm{Bias}(\hat f_h(x)) = \frac{h^2}{2}\,\mu_2(K)\, f''(x) + o(h^2).

Step (ii): Variance. Since f^h(x)=1niZi\hat f_h(x) = \frac{1}{n}\sum_i Z_i where Zi:=h1K((xXi)/h)Z_i := h^{-1} K((x-X_i)/h) are iid,

Var(f^h(x))=1nVar(Z1)=1n ⁣{E[Z12]E[Z1]2}.\mathrm{Var}\bigl(\hat f_h(x)\bigr) = \frac{1}{n}\, \mathrm{Var}(Z_1) = \frac{1}{n}\!\left\{\mathbb{E}[Z_1^2] - \mathbb{E}[Z_1]^2\right\}.

Compute the second moment of Z1Z_1 by the same substitution:

E[Z12]=1h2K ⁣(xyh) ⁣2f(y)dy=1hK(u)2f(xhu)du.\mathbb{E}[Z_1^2] = \int \frac{1}{h^2}\, K\!\left(\frac{x-y}{h}\right)^{\!2} f(y)\,dy = \frac{1}{h}\int K(u)^2\, f(x - hu)\,du.

Using f(xhu)=f(x)+O(h)f(x - hu) = f(x) + O(h) (first-order Taylor),

E[Z12]=1hf(x)K(u)2du+O(1)=f(x)R(K)h+O(1).\mathbb{E}[Z_1^2] = \frac{1}{h}\, f(x) \int K(u)^2\,du + O(1) = \frac{f(x)\, R(K)}{h} + O(1).

The first-moment-squared term is E[Z1]2=(f(x)+O(h2))2=f(x)2+O(h2)\mathbb{E}[Z_1]^2 = (f(x) + O(h^2))^2 = f(x)^2 + O(h^2), which is O(1)O(1) and therefore negligible compared to the O(1/h)O(1/h) leading term. Combining:

Var(f^h(x))=1n ⁣{f(x)R(K)h+O(1)}=f(x)R(K)nh+O ⁣(1n).\mathrm{Var}\bigl(\hat f_h(x)\bigr) = \frac{1}{n}\!\left\{\frac{f(x)\, R(K)}{h} + O(1)\right\} = \frac{f(x)\, R(K)}{nh} + O\!\left(\frac{1}{n}\right).

Under the regime nhnh \to \infty, the O(1/n)O(1/n) term is o(1/(nh))o(1/(nh)).

Step (iii): Integrate to AMISE. The pointwise mean-squared error is MSE(x)=Bias2(x)+Var(x)\mathrm{MSE}(x) = \mathrm{Bias}^2(x) + \mathrm{Var}(x). Integrating over xx and keeping only the leading terms,

AMISE(h)=Bias2(x)dx+Var(x)dx.\mathrm{AMISE}(h) = \int \mathrm{Bias}^2(x)\,dx + \int \mathrm{Var}(x)\,dx.

The integrated squared bias is

Bias2(x)dx=(h22μ2(K)f(x)) ⁣2dx=h4μ2(K)24R(f),\int \mathrm{Bias}^2(x)\,dx = \int \left(\frac{h^2}{2}\,\mu_2(K)\,f''(x)\right)^{\!2} dx = \frac{h^4\,\mu_2(K)^2}{4}\,R(f''),

where R(f):=f(x)2dxR(f'') := \int f''(x)^2\,dx. The integrated variance is

Var(x)dx=f(x)R(K)nhdx=R(K)nhf(x)dx=R(K)nh,\int \mathrm{Var}(x)\,dx = \int \frac{f(x)\,R(K)}{nh}\,dx = \frac{R(K)}{nh}\int f(x)\,dx = \frac{R(K)}{nh},

using f=1\int f = 1. Summing,

AMISE(h)=R(K)nh+h4μ2(K)2R(f)4.\mathrm{AMISE}(h) = \frac{R(K)}{nh} + \frac{h^4\,\mu_2(K)^2\,R(f'')}{4}.

Step (iv): Minimize in hh. Differentiate AMISE with respect to hh:

dAMISEdh=R(K)nh2+h3μ2(K)2R(f).\frac{d\,\mathrm{AMISE}}{dh} = -\frac{R(K)}{n\,h^2} + h^3\,\mu_2(K)^2\,R(f'').

Setting to zero and solving for hh:

h5=R(K)nμ2(K)2R(f),h=(R(K)nμ2(K)2R(f)) ⁣1/5.h^5 = \frac{R(K)}{n\,\mu_2(K)^2\,R(f'')}, \qquad h^\ast = \left(\frac{R(K)}{n\,\mu_2(K)^2\,R(f'')}\right)^{\!1/5}.

The second derivative

d2AMISEdh2=2R(K)nh3+3h2μ2(K)2R(f)>0,\frac{d^2\,\mathrm{AMISE}}{dh^2} = \frac{2R(K)}{n\,h^3} + 3 h^2\,\mu_2(K)^2\,R(f'') > 0,

confirms hh^\ast is the unique minimum. Substituting hh^\ast into AMISE: the first term evaluates to

R(K)nh=R(K)n(nμ2(K)2R(f)R(K)) ⁣1/5=R(K)4/5μ2(K)2/5R(f)1/5n4/5,\frac{R(K)}{n\,h^\ast} = \frac{R(K)}{n}\left(\frac{n\,\mu_2(K)^2\,R(f'')}{R(K)}\right)^{\!1/5} = R(K)^{4/5}\, \mu_2(K)^{2/5}\, R(f'')^{1/5}\, n^{-4/5},

and the second term evaluates to

(h)4μ2(K)2R(f)4=14R(K)4/5μ2(K)2/5R(f)1/5n4/5.\frac{(h^\ast)^4\, \mu_2(K)^2\, R(f'')}{4} = \frac{1}{4}\, R(K)^{4/5}\, \mu_2(K)^{2/5}\, R(f'')^{1/5}\, n^{-4/5}.

Their sum is AMISE=54R(K)4/5μ2(K)2/5R(f)1/5n4/5=54{R(K)4μ2(K)2R(f)}1/5n4/5\mathrm{AMISE}^\ast = \frac{5}{4}\, R(K)^{4/5}\, \mu_2(K)^{2/5}\, R(f'')^{1/5}\, n^{-4/5} = \frac{5}{4}\bigl\{R(K)^4\, \mu_2(K)^2\, R(f'')\bigr\}^{1/5}\, n^{-4/5}.

\blacksquare — using kernel moment conditions (K1)–(K4), second-order Taylor expansion of ff, iid-variance accounting, and elementary calculus (ROS1956; PAR1962; WAN1995 §2.5; SIL1986 §3.3).

Theorem 5 Epanechnikov's optimality theorem

Consider the class of symmetric kernels K:RR0K : \mathbb{R} \to \mathbb{R}_{\ge 0} satisfying K=1\int K = 1 and μ2(K)=σ02\mu_2(K) = \sigma_0^2 for a fixed positive constant σ02\sigma_0^2. The kernel minimizing R(K)=K2R(K) = \int K^2 subject to these constraints is the Epanechnikov kernel

KE(u)=34(1u2)1{u1},μ2(KE)=15,R(KE)=35.K_E(u) = \frac{3}{4}\,(1 - u^2)\,\mathbf{1}\{|u| \le 1\}, \qquad \mu_2(K_E) = \frac{1}{5}, \quad R(K_E) = \frac{3}{5}.

(By rescaling hh, the conclusion holds for any fixed σ02>0\sigma_0^2 > 0; the statement with σ02=1/5\sigma_0^2 = 1/5 fixes the conventional support [1,1][-1, 1].)

Proof [show]

Minimizing the quadratic functional R(K)=K(u)2duR(K) = \int K(u)^2\,du subject to the two linear constraints K=1\int K = 1 and u2K=15\int u^2 K = \frac{1}{5} plus the symmetry and nonnegativity constraints, we form the Lagrangian

L[K]=K(u)2du+λ0 ⁣(K1)+λ2 ⁣(u2K15).\mathcal{L}[K] = \int K(u)^2\,du + \lambda_0\!\left(\int K - 1\right) + \lambda_2\!\left(\int u^2 K - \frac{1}{5}\right).

Symmetry fixes the first-moment multiplier at zero. The pointwise Euler–Lagrange first-order condition is 2K(u)+λ0+λ2u2=02 K(u) + \lambda_0 + \lambda_2 u^2 = 0, giving

K(u)=λ02λ22u2=abu2K(u) = -\frac{\lambda_0}{2} - \frac{\lambda_2}{2}\, u^2 = a - b\, u^2

on the region where this is positive, with K=0K = 0 outside. Write K(u)=a(1u2/r2)1{ur}K(u) = a(1 - u^2/r^2)\,\mathbf{1}\{|u| \le r\}, where r:=a/br := \sqrt{a/b} is the support radius (forced by K(±r)=0K(\pm r) = 0). The two constraints pin aa and rr:

K=arr(1u2/r2)du=a4r3=1    a=34r,\int K = a \int_{-r}^{r}(1 - u^2/r^2)\,du = a\,\frac{4r}{3} = 1 \implies a = \frac{3}{4r},u2K=arr(u2u4/r2)du=a(2r332r35)=4ar315=r25(using a=3/(4r)).\int u^2 K = a \int_{-r}^{r}(u^2 - u^4/r^2)\,du = a\left(\frac{2r^3}{3} - \frac{2r^3}{5}\right) = \frac{4ar^3}{15} = \frac{r^2}{5}\quad\text{(using $a = 3/(4r)$).}

Setting μ2(K)=1/5\mu_2(K) = 1/5 gives r2=1r^2 = 1, so r=1r = 1 and a=3/4a = 3/4, yielding

KE(u)=34(1u2)1{u1}.K_E(u) = \frac{3}{4}\,(1 - u^2)\, \mathbf{1}\{|u| \le 1\}.

The second-derivative test is automatic: the objective K2\int K^2 is strictly convex in KK, so the stationary point is a global minimum. Computing the optimum:

R(KE)=11916(1u2)2du=9161615=35.R(K_E) = \int_{-1}^{1} \frac{9}{16}(1 - u^2)^2\,du = \frac{9}{16} \cdot \frac{16}{15} = \frac{3}{5}.

\blacksquare — using Lagrangian variational principles with linear constraints and nonnegativity, the Euler–Lagrange condition for a quadratic objective, and elementary Calculus-of-Variations (EPA1969; WAN1995 §2.7).

Bandwidth explorer — the AMISE theorem·Three synchronized panels: KDE curve (top), bias²/var/MISE vs h on log-log (middle), and AMISE* vs n log-log (bottom). Slide h off h* and watch which term explodes.
Density
Kernel
n = 500
h = 0.306 (= h*)
-4.0-2.4-0.80.82.44.0Panel A · Normal(0, 1) · n = 500 · Gaussian · h = 0.306true density· KDE
0.11Panel B · bias² (amber), variance (blue), AMISE (rose) · log-log in h · dashed = oracle h*, solid = current h
501005001,0005,000Panel C · AMISE*(n) vs n · log-log · fitted slope = -0.8000 (theoretical −0.8)
oracle h*
0.3056
bias²(h)
0.00046
variance(h)
0.00185
AMISE(h)
0.00231

Slide h to the left: variance explodes and AMISE follows. Slide it to the right: bias² takes over. The oracle h* is the unique minimum of the rose curve — it's slightly right of where the two decomposition curves cross, a consequence of the 5/4 constant in AMISE*. Switch between Gaussian / Epanechnikov / Triangular kernels: the minimum barely moves — kernel choice is negligible relative to bandwidth choice.

Example 7 Oracle h* for f = N(0,1), Gaussian kernel, n = 100

With f=φf = \varphi and Gaussian kernel, the constants are R(K)=1/(2π)0.2821R(K) = 1/(2\sqrt\pi) \approx 0.2821, μ2(K)=1\mu_2(K) = 1, and R(φ)=3/(8π)0.2116R(\varphi'') = 3/(8\sqrt\pi) \approx 0.2116. Substituting into Theorem 4:

h=(0.282110010.2116) ⁣1/5=(0.01333)1/50.4217.h^\ast = \left(\frac{0.2821}{100 \cdot 1 \cdot 0.2116}\right)^{\!1/5} = (0.01333)^{1/5} \approx 0.4217.

This matches the T30.12 test pin — the oracle-bandwidth computation at n=100n = 100 for f=N(0,1)f = N(0, 1). At n=1000n = 1000 the bandwidth shrinks by the factor 101/50.63110^{-1/5} \approx 0.631 (T30.13), and the optimal AMISE shrinks by 104/50.15810^{-4/5} \approx 0.158 (T30.14).

Two-panel AMISE figure. Left: AMISE curves for Gaussian, Epanechnikov, biweight, and triangular kernels at f = N(0,1), n = 200. All four minima land near h* between 0.37 and 0.40. Right: zoom on the minimum region showing the near-alignment of the four optima.

Figure 5. AMISE for four kernels on f=f = Normal(0,1)(0, 1), n=200n = 200. Left: full range. Right: zoom on [0.2,0.4][0.2, 0.4] showing the near-identical minima, confirming that kernel choice barely matters relative to bandwidth choice.

Log-log plot of AMISE* versus n for n in 50, 100, 200, 500, 1000, 2000, 5000, 10000 at f = N(0,1) with Gaussian kernel. The empirical slope matches the theoretical -0.8 rate; a dotted parametric-rate line at slope -1 is shown for reference.

Figure 6. Oracle AMISE^\ast vs nn on log-log axes. The slope is the theoretical 4/5=0.8-4/5 = -0.8; the faint dotted line at slope 1-1 is the parametric comparison. The gap between the two slopes is the price of distribution-freeness.

Remark 10 Why the n^(-1/5) rate is slower than parametric

A parametric MLE of a Normal mean has MSE O(n1)O(n^{-1}) — the square of the standard n\sqrt n convergence. The KDE’s AMISE rate O(n4/5)O(n^{-4/5}) is slower because we’re estimating an entire function, not a finite-dimensional parameter. The rate is a direct consequence of Theorem 4: the bias–variance trade-off forces h0h \to 0 at rate n1/5n^{-1/5}, not the n1/2n^{-1/2} that parametric estimators enjoy. In high dimensions the rate gets worse still — Scott 2015 Ch. 6 derives the multivariate rate O(n4/(4+d))O(n^{-4/(4+d)}), which we’ll pointer-forward in §30.10.

Remark 11 Structural parallel to Bahadur linearization

The AMISE-optimal hh^\ast is the first-order condition on a quadratic-in-h2h^2 loss function. The structural parallel to the Bahadur representation of the sample quantile (Topic 29 §29.6) is that both reduce nonparametric inference to a balance between a variance term that shrinks with more data and a bias term controlled by an asymptotic parameter: bandwidth hh in the KDE case, remainder-term order in the Bahadur case. Every Track 8 result is a version of this trade-off — bootstrap, empirical processes, and density-ratio estimation each have their own “h”: a resample count, a bracketing radius, a regularization strength.

Remark 12 Kernel choice negligibility

Theorem 5 minimizes R(K)R(K) subject to μ2(K)=σ02\mu_2(K) = \sigma_0^2. Substituting into AMISE\mathrm{AMISE}^\ast, the kernel enters through {R(K)4μ2(K)2}1/5\{R(K)^4 \mu_2(K)^2\}^{1/5}. For the five standard kernels, this combined factor ranges from 0.981/50.98^{1/5} (uniform) to 1.001.00 (Epanechnikov) — a spread of under 2%. In log-space, switching kernels moves AMISE^\ast by less than 0.01. Bandwidth choice, by contrast, can move AMISE by an order of magnitude. The practical advice: pick any standard kernel (Gaussian for simplicity, Epanechnikov for compact support), and spend your energy on bandwidth selection (§30.8).

Remark 13 Higher-order kernels and faster rates

A fourth-order kernel is a kernel with μ2(K)=0\mu_2(K) = 0; the bias in Theorem 3 then vanishes to leading order, and the next term is O(h4)O(h^4) involving f(4)f^{(4)}. The AMISE rate improves from n4/5n^{-4/5} to n8/9n^{-8/9} — a meaningful speedup at large nn. The catch: a fourth-order kernel must take negative values (to integrate u2K=0u^2 K = 0 with K=1\int K = 1), so f^h\hat f_h can be negative. In practice this is usually patched by truncation, but Topic 30 follows the standard monograph convention of restricting to second-order kernels. Serfling 1980 §1.11 and Wand–Jones 1995 §2.8 cover higher-order kernels at length; formalml may return to them in the context of density-ratio estimation.

30.7 Pointwise asymptotic normality and confidence intervals

Theorem 4 gives us the MISE rate; Theorem 6 gives us pointwise asymptotic normality, from which Gaussian-calibrated confidence intervals for f(x)f(x) follow immediately. The proof technique — Lindeberg–Feller for triangular arrays — is the same machinery that powered Topic 11 §11.6’s general CLT; we state it and sketch the verification.

Theorem 6 Parzen's pointwise asymptotic normality (stated with sketch)

Under the hypotheses of Theorem 4 (symmetric kernel satisfying K=1\int K = 1, uK=0\int u K = 0, μ2(K)<\mu_2(K) < \infty, R(K)<R(K) < \infty; ff continuous at xx with f(x)>0f(x) > 0), and under the bandwidth regime h0h \to 0, nhnh \to \infty,

nh(f^h(x)E[f^h(x)])  d  N ⁣(0,  f(x)R(K)).\sqrt{nh}\,\Bigl(\hat f_h(x) - \mathbb{E}[\hat f_h(x)]\Bigr) \;\xrightarrow{d}\; \mathcal{N}\!\left(0,\; f(x)\, R(K)\right).
Proof [show]

Write f^h(x)=1ni=1nZn,i\hat f_h(x) = \frac{1}{n}\sum_{i=1}^n Z_{n,i}, where Zn,i:=h1K((xXi)/h)Z_{n,i} := h^{-1} K((x - X_i)/h) form a triangular array of iid random variables — iid within each row indexed by nn, but their distribution depends on nn through the bandwidth h=h(n)h = h(n). The variance per term is

Var(Zn,i)=f(x)R(K)h(1+o(1)),\mathrm{Var}(Z_{n,i}) = \frac{f(x)\,R(K)}{h}\,(1 + o(1)),

from Theorem 3 (Step (ii) of Theorem 4’s proof, multiplied through by the h1h^{-1} normalization). Multiplying the standardizing factor nh\sqrt{nh} onto the centered sum gives

nh(f^h(x)E[f^h(x)])=1n/hi=1n(Zn,iE[Zn,i]),\sqrt{nh}\,\bigl(\hat f_h(x) - \mathbb{E}[\hat f_h(x)]\bigr) = \frac{1}{\sqrt{n/h}}\sum_{i=1}^n (Z_{n,i} - \mathbb{E}[Z_{n,i}]),

whose variance is nhnVar(Zn,i)=hf(x)R(K)/h(1+o(1))=f(x)R(K)+o(1)\frac{nh}{n} \cdot \mathrm{Var}(Z_{n,i}) = h \cdot f(x) R(K)/h \cdot (1 + o(1)) = f(x)\,R(K) + o(1). Apply the Lindeberg–Feller central limit theorem for triangular arrays (Topic 11 §11.6 Thm 2): check the Lindeberg condition

limn  nf(x)R(K)E ⁣[(Zn,1/n/hE[Zn,1]/n/h)21{>ε}]=0for all ε>0.\lim_{n \to \infty}\; \frac{n}{f(x)\,R(K)}\,\mathbb{E}\!\left[\bigl(Z_{n,1}/\sqrt{n/h} - \mathbb{E}[Z_{n,1}]/\sqrt{n/h}\bigr)^2\, \mathbf{1}\{\cdots > \varepsilon\}\right] = 0 \quad \text{for all } \varepsilon > 0.

Boundedness of KK (compact support in the Epanechnikov case, or eu2/2e^{-u^2/2} decay in the Gaussian case) ensures Zn,iK/h|Z_{n,i}| \le \|K\|_\infty / h, and the event inside the indicator becomes empty for nn large enough because the standardized summand is O(h1/2/n)=O(1/nh)0O(h^{-1/2}/\sqrt{n}) = O(1/\sqrt{nh}) \to 0. The Lindeberg condition holds.

\blacksquare — using the Lindeberg–Feller CLT for triangular arrays (Topic 11 §11.6 Thm 2), the variance calculation from Thm 3, and boundedness of KK (PAR1962 §3–§5; vdV2000 §24.1).

Theorem 6 centers the CI at E[f^h(x)]\mathbb{E}[\hat f_h(x)], not at f(x)f(x). At finite nn there is a bias shift of magnitude h22μ2(K)f(x)\frac{h^2}{2}\mu_2(K) f''(x), and the resulting pointwise CI has coverage slightly below the nominal 1α1 - \alpha. For the purposes of this topic, we ignore the bias correction and use the plug-in variance

V^(x)  =  f^h(x)R(K)nh,CI1α(x)  =  f^h(x)  ±  z1α/2V^(x),\hat V(x) \;=\; \frac{\hat f_h(x)\, R(K)}{n\,h}, \qquad \mathrm{CI}_{1-\alpha}(x) \;=\; \hat f_h(x) \;\pm\; z_{1 - \alpha/2}\,\sqrt{\hat V(x)},

noting the coverage will be biased-low at modest nn and small hh.

Example 8 Pointwise CI at x = 0 for N(0,1), n = 500

For a Normal(0,1)(0, 1) sample of size n=500n = 500 (seed-42 NumPy draw, σ^0.98\hat\sigma \approx 0.98), Silverman’s bandwidth h^ROT0.300\hat h_{\mathrm{ROT}} \approx 0.300. At x=0x = 0:

f^h(0)0.388,V^(0)=0.3880.28215000.3007.30×104,V^(0)0.027.\hat f_h(0) \approx 0.388, \qquad \hat V(0) = \frac{0.388 \cdot 0.2821}{500 \cdot 0.300} \approx 7.30 \times 10^{-4}, \qquad \sqrt{\hat V(0)} \approx 0.027.

The 95% pointwise CI is 0.388±1.960.027=[0.335,0.441]0.388 \pm 1.96 \cdot 0.027 = [0.335, 0.441]. The true density value f(0)=φ(0)0.399f(0) = \varphi(0) \approx 0.399 lies comfortably inside. This is the T30.11 test pin.

Example 9 Density-at-quantile: Bahadur's plug-in (discharge of Topic 29 §29.6 Rem 12)

Topic 29 §29.6 Rem 12 named KDE as the tool for estimating f(ξp)f(\xi_p) — the density in Bahadur’s limit variance σ2=p(1p)/f(ξp)2\sigma^2 = p(1 - p)/f(\xi_p)^2. Here we discharge that promise.

Take a Normal(0,1)(0, 1) sample of size n=500n = 500 and focus on the upper quartile p=0.9p = 0.9. The Type-7 sample quantile is ξ^0.91.30\hat\xi_{0.9} \approx 1.30. The Silverman-bandwidth KDE evaluated at that quantile is

f^h^(ξ^0.9)0.175,\hat f_{\hat h}(\hat\xi_{0.9}) \approx 0.175,

giving the Bahadur-plug-in asymptotic standard error

SE^(ξ^0.9)  =  0.90.15000.1752    0.0242.\hat{\mathrm{SE}}(\hat\xi_{0.9}) \;=\; \sqrt{\frac{0.9 \cdot 0.1}{500 \cdot 0.175^2}} \;\approx\; 0.0242.

The resulting 95% asymptotic CI is 1.30±1.960.0242=[1.25,1.35]1.30 \pm 1.96 \cdot 0.0242 = [1.25, 1.35]. For comparison, the distribution-free binomial-exact CI from Topic 29 §29.7 is [X(r),X(s)][X_{(r)}, X_{(s)}] with rr and ss chosen so the coverage is at least 1α1 - \alpha; at n=500n = 500, p=0.9p = 0.9, α=0.05\alpha = 0.05 the CI is approximately [1.24,1.37][1.24, 1.37]. The Bahadur-plug-in CI is noticeably tighter — asymptotic efficiency at the cost of nominal-coverage validity at finite nn.

Plot of empirical coverage vs sample size n for the pointwise 95% CI at x=0 under f = N(0,1). Bars extend from the coverage observed over 2000 MC replicates at each n in 100, 200, 500, 1000, 2000, 5000. At small n coverage is below 95% because of the bias shift; at large n it approaches nominal.

Figure 7. Empirical coverage of the 95% pointwise CI for f(0)f(0) under f=f = Normal(0,1)(0, 1) across sample sizes. At small nn the bias shift in Theorem 6 depresses coverage below 95%95\%; by n=5000n = 5000 the coverage is within Monte-Carlo noise of nominal.

Remark 14 Why the CI has biased coverage at small n

The CI is centered at f^h(x)\hat f_h(x), which is a consistent estimator of E[f^h(x)]=f(x)+h22μ2(K)f(x)+o(h2)\mathbb{E}[\hat f_h(x)] = f(x) + \frac{h^2}{2}\mu_2(K) f''(x) + o(h^2) — not f(x)f(x) itself. At the oracle bandwidth hh^\ast the bias is O(n2/5)O(n^{-2/5}), and the interval width is O(n2/5)O(n^{-2/5}) as well, so the bias is of the same order as the width. Coverage improves as nn grows (the o(h2)o(h^2) terms shrink faster than the width) but remains biased-low at the finite nn ML practitioners see in practice. Bias-corrected CIs (plug-in estimates of ff'', or undersmoothing) exist; vdV2000 §24.1 is a standard reference.

Remark 15 Simultaneous CIs and the bootstrap

Theorem 6 is pointwise — it gives a CI for f(x0)f(x_0) at a single preselected x0x_0. For a joint CI covering ff over a range [a,b][a, b] at level 1α1 - \alpha, we need Bickel–Rosenblatt-type machinery, which turns out to be a functional CLT in the same Donsker-theory framework Topic 29 §29.8 used for the Kolmogorov limit. The cleanest route in practice is the smooth bootstrap (Topic 31 §31.8; preview at §30.10 Ex 13), which constructs simultaneous envelopes by resampling from f^h\hat f_h itself. The full treatment is forthcoming on formalml.

30.8 Data-driven bandwidth selection

The oracle hh^\ast from Theorem 4 requires the curvature integral R(f)R(f''), which the practitioner does not know. Data-driven bandwidth selectors estimate R(f)R(f'') — or skip it entirely — and return a data-dependent h^\hat h. The four standard selectors are:

  • Silverman’s rule-of-thumb (the default in R’s bw.nrd0): a Normal-scale plug-in that assumes fN(μ,σ2)f \approx N(\mu, \sigma^2).
  • Scott’s rule: Silverman without the IQR-robustness adjustment.
  • Unbiased cross-validation (UCV): minimizes an estimator of MISE -f2f^2.
  • Sheather–Jones plug-in (SHJ): a one-step direct plug-in that estimates R(f)R(f'') from a pilot KDE.

Silverman and Scott are cheap closed forms; UCV is a grid search at cost O(n2G)O(n^2 G) per evaluation; SHJ is an O(n2)O(n^2) double-sum followed by a closed-form plug-in. All four are implemented in nonparametric.ts §10 or inlined in PluginBandwidthComparator (§5.3).

Definition 4 Mean integrated squared error (MISE)

For an estimator f^h\hat f_h of a density ff, the mean integrated squared error is

MISE(h)  :=  E ⁣(f^h(x)f(x))2dx.\mathrm{MISE}(h) \;:=\; \mathbb{E}\!\int\bigl(\hat f_h(x) - f(x)\bigr)^2\,dx.

The AMISE from §30.5 is the leading-order approximation to MISE; bandwidth selectors typically minimize one or the other. Note MISE(h)=Var(f^h(x))dx+Bias2(x)dx\mathrm{MISE}(h) = \int \mathrm{Var}(\hat f_h(x))\,dx + \int \mathrm{Bias}^2(x)\,dx — the integrated decomposition that drives everything in Theorem 4.

Theorem 7 Silverman's rule of thumb

Under the Normal-scale reference rule — assume ff is Normal(μ,σ2)(\mu, \sigma^2) — the AMISE-optimal bandwidth (Theorem 4) with Gaussian kernel simplifies to

hSilverman  =  (43n) ⁣1/5σ^    1.06σ^n1/5,h^\ast_{\text{Silverman}} \;=\; \left(\frac{4}{3 n}\right)^{\!1/5} \hat\sigma \;\approx\; 1.06\, \hat\sigma\, n^{-1/5},

where σ^\hat\sigma is a scale estimate. Silverman’s robust variant uses σ^=min(SD,IQR/1.34)\hat\sigma = \min(\mathrm{SD}, \mathrm{IQR}/1.34) to guard against heavy tails or mild bimodality. Stated; the derivation substitutes R(φ)=3/(8π)R(\varphi'') = 3/(8\sqrt\pi) into Theorem 4 and rearranges. (SIL1986 §3.4.2)

Example 10 Silverman over-smooths bimodal mixtures

Take a 50/50 mixture of Normal(1.5,1)(-1.5, 1) and Normal(1.5,1)(1.5, 1), n=200n = 200. This is the bimodalNormalPreset from nonparametric-data.ts — the overall standard deviation is σ^1+1.521.80\hat\sigma \approx \sqrt{1 + 1.5^2} \approx 1.80, so Silverman gives

h^Silverman  =  1.061.802001/5    0.66.\hat h_{\mathrm{Silverman}} \;=\; 1.06 \cdot 1.80 \cdot 200^{-1/5} \;\approx\; 0.66.

But the oracle bandwidth hh^\ast (Theorem 4, using the closed-form R(f)0.092R(f'') \approx 0.092 for this mixture) is 0.38\approx 0.38. Silverman is nearly 2×2\times too large; the resulting KDE merges the two modes into a single plateau. UCV and Sheather–Jones both estimate R(f)R(f'') from the data rather than assuming a Normal reference, and both recover the bimodality cleanly. Figure 8 shows the three side by side.

Three KDE curves on a bimodal 0.5 N(-1.5,1) + 0.5 N(1.5,1) sample at n = 200. Silverman's rule (indigo) over-smooths into a single-mode curve. UCV (amber) under-smooths with visible noise. Sheather-Jones (violet) recovers both modes cleanly. True density overlaid in black.

Figure 8. Three bandwidth selectors on a bimodal mixture at n=200n = 200. Silverman over-smooths; UCV is noisy; Sheather–Jones cleanly recovers the bimodality. The pedagogical payoff: rule-of-thumb selectors are conservative by design, and their conservatism bites at densities whose curvature is not well-approximated by a Normal reference.

Data-driven bandwidth comparator·Silverman's rule is the textbook default, but watch what it does on the bimodal preset — the two modes merge. Sheather–Jones and UCV work harder but recover the structure.
Density
n = 200
-5.0-3.0-1.01.03.05.00.5·N(−1.5, 1) + 0.5·N(1.5, 1)  ·  n = 200
ShowSelectorhISEvs Silverman
Silverman0.6660.0064
Scott0.6660.00641.00× ✓
UCV0.1870.01592.48× over-smooth
Sheather–Jones0.4350.00691.07× ✓

At the Bimodal preset, Silverman's bandwidth is roughly twice the oracle h*, which is why it over-smooths across the modes. SHJ's plug-in gets much closer by estimating the curvature R(f"). UCV has the largest variance across resamples — try "Run 200 resamples" on a smaller n to see the IQR column widen.

Remark 16 Why Silverman over-smooths bimodals

Silverman’s rule substitutes a Normal reference for R(f)R(f''): for f=N(μ,σ2)f = N(\mu, \sigma^2), R(f)=3/(8πσ5)R(f'') = 3/(8\sqrt\pi \sigma^5). For a bimodal density with separation μ1μ23σ|\mu_1 - \mu_2| \approx 3\sigma, the true R(f)R(f'') can be 5510×10\times larger than the Normal-reference value — the two modes create sharp peaks and sharp valleys that contribute heavily to f2\int f''^2. Silverman therefore underestimates R(f)R(f'') by the same factor, which puts h^\hat h roughly (510)1/51.4(5\text{–}10)^{1/5} \approx 1.41.6×1.6\times too large. The fix is to estimate R(f)R(f'') from the data — UCV does this implicitly (by cross-validating against the sample), SHJ does it explicitly (via a pilot KDE’s second derivative). The PluginBandwidthComparator lets you see the effect directly: switch presets and watch the Silverman bandwidth diverge from the oracle on the bimodal case, match it on the unimodal case.

30.9 ML bridges: anomaly detection and density-based classification

KDE is a direct ingredient in two classic ML workflows — unsupervised anomaly detection and the plug-in Bayes classifier — and an indirect one in modern neural density estimators. We describe both direct cases concretely; the connection to Nadaraya–Watson regression and mean-shift clustering is a forward pointer.

Example 11 Anomaly detection via low-density flagging

Given a training sample X1,,XnX_1, \dots, X_n representing “normal” observations, fit a Gaussian KDE f^h\hat f_h with Silverman bandwidth. At test time, flag a new point xx as anomalous if f^h(x)<t\hat f_h(x) < t for some threshold tt. A natural choice is tt = the empirical 5%-quantile of {f^h(Xi):i=1,,n}\{\hat f_h(X_i) : i = 1, \dots, n\} — the KDE density values at the training points themselves. Under this choice, approximately 5% of training points are (ex post) flagged, and new points that fall into regions of similar sparsity are flagged at test time.

This is the kernel-density-estimate version of one-class classification; it requires no labeled anomalies, only a clean sample of normals. In production, practitioners usually replace KDE with a denser estimator (Gaussian Mixture Models, Isolation Forest, Autoencoders) as nn grows into the millions, but the KDE version remains the pedagogical entry point and a useful sanity check at low nn.

Example 12 Density-based Bayes classifier and naïve Bayes

The plug-in Bayes classifier (Fix–Hodges 1951) estimates class-conditional densities f^h,c(x)\hat f_{h, c}(x) via a separate KDE on each class’s training sample, then assigns test point xx to

c^(x)  =  argmaxc  π^cf^h,c(x),\hat c(x) \;=\; \arg\max_c\; \hat\pi_c \cdot \hat f_{h, c}(x),

where π^c=nc/n\hat\pi_c = n_c/n is the estimated class prior. In dd dimensions, full multivariate KDE suffers from the curse of dimensionality (§30.10 Rem 21), so naïve Bayes with KDE factors the class density as f^h,c(x)=j=1df^h,c,j(xj)\hat f_{h, c}(x) = \prod_{j=1}^d \hat f_{h, c, j}(x_j) — a product of dd univariate KDEs, one per feature. Naïve-Bayes-with-KDE is the standard ML baseline for mixed continuous-discrete features; scikit-learn ships it as KernelDensity + ClassifierMixin.

Compared to parametric discriminant analysis (LDA, QDA from Topic 15), the KDE version trades parametric efficiency for distribution-freeness — it works on class densities that are skewed, bimodal, or otherwise non-Normal, without any parametric assumption.

Remark 17 Kernel regression via Nadaraya–Watson

The covariate-conditional extension of KDE is kernel regression (Nadaraya 1964; Watson 1964):

m^h(x)  =  iYiKh(xXi)iKh(xXi).\hat m_h(x) \;=\; \frac{\sum_i Y_i\, K_h(x - X_i)}{\sum_i K_h(x - X_i)}.

This is a weighted average of the YiY_i values, with weights decaying as xXi|x - X_i| grows. It is the kernel-smoothing companion to the KDE: KDE estimates the marginal density f(x)f(x), NW estimates the conditional expectation E[YX=x]E[Y \mid X = x]. The two share the Theorem 3 bias–variance structure and the Theorem 6 asymptotic-normality story, so everything we built for f^h\hat f_h carries over. The full development — including local polynomial refinements — is the forthcoming formalml kernel-regression chapter.

Remark 18 Mean-shift clustering

Mean-shift (Fukunaga–Hostetler 1975; Comaniciu–Meer 2002) is an unsupervised clustering algorithm that performs gradient ascent on f^h\hat f_h. Starting from each data point, iterate xx+hlogf^h(x)x \leftarrow x + h \cdot \nabla \log \hat f_h(x) until convergence; the fixed points are the local modes of the KDE, and points that converge to the same mode are assigned to the same cluster. The advantage over k-means is that mean-shift does not require specifying kk in advance — the number of modes is the number of clusters. The disadvantage is computational: gradient ascent + KDE evaluation is O(n2)O(n^2) per iteration, so it doesn’t scale past modest nn. Modern variants (spectral mean-shift, blurring mean-shift) are forthcoming on formalml.

30.10 Track 8 forward-map and the KDE → bootstrap bridge

Topic 30 closes with a forward-pointer figure (Fig 9) and a set of remarks that place KDE inside the larger Track 8 spine. Topic 29 (order statistics) opened the track; Topic 30 (KDE) is the mid-track hinge; Topics 31 (bootstrap) and 32 (empirical processes) close it. Beyond Track 8 itself, KDE is a load-bearing ingredient in several formalml chapters — kernel regression, local regression, normalizing flows, clustering, density-ratio estimation — each of which extends the KDE framework to a different ML-relevant context.

Example 13 Smooth bootstrap: resampling from the KDE

The standard nonparametric bootstrap resamples from the empirical distribution FnF_n — a step function that places mass 1/n1/n at each XiX_i. Silverman–Young 1987’s smooth bootstrap resamples from f^h\hat f_h instead: draw Xif^hX_i^\ast \sim \hat f_h by first picking an observation XJiX_{J_i} uniformly from the sample (where JiJ_i is uniform on {1,,n}\{1, \dots, n\}) and then perturbing it by an independent draw from the kernel KhK_h:

Xi  =  XJi  +  hUi,UiK, iid.X_i^\ast \;=\; X_{J_i} \;+\; h \cdot U_i, \qquad U_i \sim K, \text{ iid}.

The resulting bootstrap distribution is smoothly supported rather than concentrated on {X1,,Xn}\{X_1, \dots, X_n\}, which matters when the statistic of interest depends on derivatives or local continuity — e.g., quantile regression estimators or mode estimators. Topic 31: The Bootstrap develops the full bootstrap framework; smooth-bootstrap is one of several refinements it covers. The KDE → bootstrap bridge is exactly the Topic 30 → 31 edge in the Track 8 forward-map.

Track 8 forward-map. Topic 29 (Order Statistics, done) and Topic 30 (KDE, done, centered) are filled boxes. Topics 31 (Bootstrap) and 32 (Empirical Processes) are dashed outline boxes (forthcoming). Forward arrows from Topic 30 to five formalml satellites: kernel-regression, local-regression, normalizing-flows, clustering, density-ratio-estimation. Back-arrows to Topics 7, 10, 11, 13, 19, 29 in grey.

Figure 9. Track 8 spine with Topic 30 at the center. Topics 29 and 30 are published (filled); Topics 31 and 32 are forthcoming (dashed). Forward arrows fan out to the formalml satellites where KDE machinery is re-used; back-arrows acknowledge the six prerequisite topics.

Remark 19 Topic 31: Bootstrap

The bootstrap (Efron 1979) is the Track 8 closer for parametric-inference questions. It resamples from the empirical distribution (or, in the smooth variant, from f^h\hat f_h) to simulate the sampling distribution of a statistic without invoking asymptotic normality. Topic 31 (now published) develops nonparametric and parametric bootstraps, bootstrap CIs (percentile, BCa, studentized), and the Edgeworth-expansion-based analysis of bootstrap accuracy. Expect Topic 30’s bandwidth selectors to reappear as pilot estimators for smooth bootstraps of sample-quantile statistics.

Remark 20 Normalizing flows: learned density estimators

Modern neural density estimators — normalizing flows (Rezende–Mohamed 2015; Papamakarios et al. 2021) — replace the kernel-average-of-bumps f^h\hat f_h with a learned invertible transformation gθg_\theta from a simple base density to the target: f^(x)=pbase(gθ1(x))detgθ1/x\hat f(x) = p_{\text{base}}(g_\theta^{-1}(x)) |\det \partial g_\theta^{-1}/\partial x|. The AMISE framework of §30.6 doesn’t directly apply — flows are parametric within a specified flow family — but the bias-vs-capacity intuition carries over. The forthcoming formalml normalizing-flows chapter will contrast the two estimator classes head-to-head; they have complementary strengths (KDE at low nn and low dd; flows at high nn and moderate dd).

Remark 21 Multivariate KDE and the curse of dimensionality

In dd dimensions, the KDE generalizes via a product kernel or a full bandwidth matrix H\mathbf{H}: f^H(x)=n1H1iK(H1(xXi))\hat f_\mathbf{H}(\mathbf{x}) = n^{-1} |\mathbf{H}|^{-1} \sum_i K(\mathbf{H}^{-1}(\mathbf{x} - \mathbf{X}_i)). The AMISE rate deteriorates to O(n4/(4+d))O(n^{-4/(4+d)}) — the curse of dimensionality for density estimation. At d=10d = 10, n=1000n = 1000 is effectively n4/146n^{4/14} \approx 6 in the univariate-equivalent rate, which is why high-dimensional density estimation is hard. Duong 2007’s ks R package is the standard multivariate implementation; Scott 2015 Ch. 6 is the standard reference. The forthcoming formalml density-estimation-multivariate chapter treats the full story.

Remark 22 Density-ratio estimation and covariate shift

In many ML problems we care about the ratio r(x)=p(x)/q(x)r(x) = p(x)/q(x) rather than the densities themselves — e.g., importance weighting for covariate shift, off-policy evaluation, or generative-adversarial-network discriminator outputs. The naïve plug-in r^(x)=f^h(p)(x)/f^h(q)(x)\hat r(x) = \hat f^{(p)}_h(x)/\hat f^{(q)}_h(x) has the same n4/5n^{-4/5} rate but amplifies bias in the tails. Direct density-ratio estimation (Sugiyama–Suzuki–Kanamori 2012) sidesteps the denominator entirely by fitting rr directly via loss minimization. The full treatment is forthcoming on formalml as density-ratio-estimation; Topic 30’s KDE machinery is the starting point.


References

  1. Rosenblatt, Murray. (1956). Remarks on Some Nonparametric Estimates of a Density Function. Annals of Mathematical Statistics, 27(3), 832–837.
  2. Parzen, Emanuel. (1962). On Estimation of a Probability Density Function and Mode. Annals of Mathematical Statistics, 33(3), 1065–1076.
  3. Epanechnikov, Vassiliy A. (1969). Non-Parametric Estimation of a Multivariate Probability Density. Theory of Probability & Its Applications, 14(1), 153–158.
  4. Silverman, Bernard W. (1986). Density Estimation for Statistics and Data Analysis. Chapman and Hall.
  5. Sheather, Simon J., and M. Chris Jones. (1991). A Reliable Data-Based Bandwidth Selection Method for Kernel Density Estimation. Journal of the Royal Statistical Society: Series B (Methodological), 53(3), 683–690.
  6. Wand, M. P., and M. C. Jones. (1995). Kernel Smoothing. Chapman and Hall/CRC.
  7. Fan, Jianqing, and Irène Gijbels. (1996). Local Polynomial Modelling and Its Applications. Chapman and Hall/CRC.
  8. Duong, Tarn. (2007). ks: Kernel Density Estimation and Kernel Discriminant Analysis for Multivariate Data in R. Journal of Statistical Software, 21(7), 1–16.
  9. Scott, David W. (2015). Multivariate Density Estimation: Theory, Practice, and Visualization (2nd ed.). Wiley.
  10. van der Vaart, Aad W. (2000). Asymptotic Statistics. Cambridge University Press.
  11. Serfling, Robert J. (1980). Approximation Theorems of Mathematical Statistics. Wiley.
  12. Lehmann, Erich Leo, and George Casella. (1998). Theory of Point Estimation (2nd ed.). Springer.