intermediate 56 min read · April 23, 2026

Kernel Density Estimation

The kernel density estimator, the AMISE bias–variance decomposition, the optimal bandwidth h* of order n^(−1/5), Epanechnikov's optimal-kernel theorem, Parzen's pointwise asymptotic normality, and data-driven bandwidth selectors (Silverman, Scott, UCV, Sheather–Jones). Track 8, topic 2 of 4.

formalCalculus: taylor series and approximation formalCalculus: multivariable integration formalCalculus: calculus of variations formalML: kernel regression formalML: local regression formalML: normalizing flows formalML: clustering formalML: density ratio estimation

30.1 Motivation: from ECDF to density

Topic 29 built its entire nonparametric toolkit on the empirical CDF $F_n(x) = n^{-1} \sum_{i=1}^n \mathbf{1}\{X_i \le x\}$ — a step function that converges uniformly to the true CDF $F$ . But $F_n$ is not differentiable, so the density $f = F'$ is unreachable by naïve differentiation. Kernel density estimation is the answer to the obvious question: what happens when we differentiate $F_n$ using a smooth weighting function instead of a Dirac delta?

The idea is simple enough to state in a sentence. Replace each sample point $X_i$ with a small bump of area $1/n$ centered at $X_i$ ; add the bumps up. The sum is a bona fide density (nonnegative, integrates to 1) that looks like $f$ as $n$ grows, with a smoothness controlled by the bump width. Histograms are the roughest version — rectangular bumps of constant width, anchored to arbitrary bin edges. KDE replaces those rectangles with a smooth, symmetric, nonnegative kernel and drops the bin-edge arbitrariness entirely.

Example 1 Histograms and KDEs on the same N(0,1) sample

Figure 1 takes a single Normal $(0, 1)$ sample of size $n = 200$ and plots three histograms (bin widths $b \in \{0.1, 0.4, 1.0\}$ ) alongside three Gaussian-kernel KDEs (bandwidths $h \in \{0.08, 0.3, 0.8\}$ ). At small $b$ or small $h$ , both estimators are spiky and variance-dominated; at large $b$ or large $h$ , both over-smooth and bias-dominate. The KDEs remove one obvious source of arbitrariness — bin edges — but the bandwidth problem remains, just in a smoother form.

Six-panel figure. Top row: three histograms of the same 200-sample Normal(0,1) draw at bin widths 0.1, 0.4, and 1.0. Bottom row: three Gaussian-KDE curves of the same sample at bandwidths 0.08, 0.3, and 0.8. True Normal(0,1) density overlaid in black on all six panels.

Figure 1. Histograms (top) and KDEs (bottom) on the same $n = 200$ Normal $(0, 1)$ sample. The bin-width / bandwidth trade-off between spikiness and over-smoothing is visible in both rows; KDE removes the bin-edge arbitrariness but keeps the smoothing-parameter choice.

Remark 1 What KDE gives up vs. histograms

Histograms have two genuine advantages KDEs lack: they are exactly piecewise-constant (so integrals over bin intervals are trivial) and they are visually honest about the underlying empirical counts. KDEs are smoother and easier to differentiate further (for mode-finding or density-gradient methods), but the smoothness is imposed, not inherent. A reader who wants to know “how many data points fell in $[a, b]$ ?” should look at a histogram; a reader who wants a differentiable density estimate should look at a KDE.

Remark 2 Distribution-free density estimation

Topic 7 built density families (Normal, Gamma, Beta) and fit parametric $\theta$ from data. That program is efficient when the family is right and biased when it’s wrong. KDE drops the family assumption and estimates $f$ directly from the sample — a distribution-free procedure in the sense of Topic 29 §29.1 Rem 1. The price is a slower convergence rate: we’ll see in §30.6 that AMISE decays at $n^{-4/5}$ , slower than the parametric $n^{-1}$ , and that this is the price of distribution-freeness.

30.2 The kernel density estimator

With the motivation out of the way, we can write down the estimator formally and check that it is a density. Three definitions (kernel, scaled kernel, KDE) and one stated theorem (KDE is a valid density) — enough to set up everything that follows.

Definition 1 Kernel

A kernel is a function $K : \mathbb{R} \to \mathbb{R}_{\ge 0}$ satisfying

\int_{-\infty}^\infty K(u)\,du = 1, \qquad K(-u) = K(u) \text{ for all } u.

Nonnegativity and unit mass make $K$ a density on $\mathbb{R}$ ; symmetry makes it a density centered at $0$ . We additionally require finite variance $\mu_2(K) := \int u^2 K(u)\,du < \infty$ and finite roughness $R(K) := \int K(u)^2\,du < \infty$ , both used throughout §30.4–§30.6.

Definition 2 Scaled kernel

For bandwidth $h > 0$ , the scaled kernel is

K_h(u) := \frac{1}{h}\, K\!\left(\frac{u}{h}\right).

The scaling preserves the density property — $\int K_h(u)\,du = 1$ for any $h > 0$ — and turns $h$ into a width parameter: small $h$ makes $K_h$ tall and narrow (close to a Dirac delta); large $h$ makes it short and wide.

Definition 3 Kernel density estimator (KDE)

Given an iid sample $X_1, \dots, X_n$ from an unknown density $f$ , the kernel density estimator with kernel $K$ and bandwidth $h$ is

\hat f_h(x) \;=\; \frac{1}{n}\sum_{i=1}^n K_h(x - X_i) \;=\; \frac{1}{n\,h}\sum_{i=1}^n K\!\left(\frac{x - X_i}{h}\right).

Both forms are in active use. The first makes it visually obvious that $\hat f_h$ is an average of $n$ kernel-shaped bumps centered at the data. The second groups the bandwidth normalization into a single $nh$ factor, which will be convenient when we compute bias and variance in §30.4.

Theorem 1 The KDE is a valid density

Under the definitions above, $\hat f_h$ is a bona-fide probability density:

\hat f_h(x) \ge 0 \text{ for all } x, \qquad \int_{-\infty}^\infty \hat f_h(x)\,dx = 1.

Stated; the proof is elementary. Nonnegativity is immediate from $K \ge 0$ . Integration reduces to $\frac{1}{n}\sum_i \int K_h(x - X_i)\,dx = \frac{1}{n}\sum_i 1 = 1$ by the scaled-kernel normalization.

Example 2 A hand-computed KDE at two points

Take $n = 2$ observations $X_1 = 0$ , $X_2 = 1$ and Gaussian kernel $K(u) = \varphi(u) = (2\pi)^{-1/2} e^{-u^2/2}$ . At bandwidth $h = 0.5$ :

\hat f_{0.5}(0.5) = \frac{1}{2 \cdot 0.5}\bigl[\varphi(1) + \varphi(-1)\bigr] = \varphi(1) + \varphi(-1) \approx 0.4839,

where we used $\varphi(1) = \varphi(-1) \approx 0.2420$ . The KDE at the midpoint $x = 0.5$ receives equal contribution from both observations and is therefore twice $\varphi(1)/h$ . This is the T30.8 test pin — a two-point sanity check on the kdeEvaluate implementation.

Remark 3 Equivalent formulation via convolution

Writing $\mathbb{P}_n = n^{-1} \sum_i \delta_{X_i}$ for the empirical distribution (a probability measure that places mass $1/n$ at each sample point), we have $\hat f_h = K_h \ast \mathbb{P}_n$ . KDE is the convolution of a smooth kernel with the empirical distribution — equivalently, it is the empirical density against a smoothed observation model. This formulation is convenient in formalml where the convolution structure generalizes to multivariate bandwidths and non-iid samples.

Remark 4 Why the symmetry and nonnegativity constraints

Symmetry $K(-u) = K(u)$ makes the first-moment condition $\int u K(u)\,du = 0$ automatic, which simplifies the bias calculation in §30.4 (the $h f'(x)$ Taylor term drops out). Nonnegativity guarantees $\hat f_h \ge 0$ , so the estimator is always a valid density. A fourth-order or higher-order kernel with $\mu_2(K) = 0$ can achieve faster bias rates (down to $O(h^4)$ ) but requires $K$ to take negative values, and then $\hat f_h$ may be negative in places — a practical complication we’ll mention in §30.6 Rem 13 and defer to Serfling 1980 §1.11.

30.3 Kernels: families and properties

Five kernels are standard — Gaussian, Epanechnikov, biweight (quartic), triangular, and uniform — and every modern implementation supports all five. The constants that govern AMISE ( $\mu_2(K)$ and $R(K)$ ) are closed-form for each, and their efficiency relative to Epanechnikov sits between 0.93 and 1.00 across the whole family. Kernel choice, in practice, matters much less than bandwidth choice.

Theorem 2 Standard kernels and their constants

The five standard symmetric kernels, with their second moments and roughness:

Kernel	$K(u)$	Support	$\mu_2(K)$	$R(K)$	Efficiency
Gaussian	$\varphi(u) = (2\pi)^{-1/2} e^{-u^2/2}$	$\mathbb{R}$	$1$	$\frac{1}{2\sqrt\pi} \approx 0.2821$	$0.951$
Epanechnikov	$\frac34(1 - u^2)\mathbf1{	u	\le 1}$	$[-1, 1]$	$\frac{1}{5}$
Biweight	$\frac1516(1 - u^2)^2\mathbf1{	u	\le 1}$	$[-1, 1]$	$\frac{1}{7}$
Triangular	$(1 -	u	)\mathbf1{	u	\le 1}$
Uniform	$\frac12\mathbf1{	u	\le 1}$	$[-1, 1]$	$\frac{1}{3}$

Stated. All values are elementary integrals; Example 3 verifies Epanechnikov explicitly, and the kernelProperties export in nonparametric.ts §8 encodes all five (T30.5–T30.6).

Example 3 Epanechnikov moments by direct integration

For $K_E(u) = \frac{3}{4}(1 - u^2)\mathbf{1}\{|u| \le 1\}$ :

\int_{-1}^1 K_E(u)\,du = \frac{3}{4}\int_{-1}^1 (1 - u^2)\,du = \frac{3}{4} \cdot \frac{4}{3} = 1 \quad\checkmark,

\mu_2(K_E) = \int_{-1}^1 u^2 K_E(u)\,du = \frac{3}{4}\int_{-1}^1 (u^2 - u^4)\,du = \frac{3}{4}\left(\frac{2}{3} - \frac{2}{5}\right) = \frac{1}{5},

R(K_E) = \int_{-1}^1 K_E(u)^2\,du = \frac{9}{16}\int_{-1}^1 (1 - u^2)^2\,du = \frac{9}{16} \cdot \frac{16}{15} = \frac{3}{5}.

The values $\mu_2(K_E) = 1/5$ and $R(K_E) = 3/5$ are load-bearing for Theorem 5’s Epanechnikov optimality proof in §30.6.

Five-panel strip showing the Gaussian, Epanechnikov, biweight, triangular, and uniform kernels on a common u-axis, each annotated with its second moment mu2 and roughness R values and its efficiency relative to Epanechnikov.

Figure 2. The five standard kernels on a common axis. Gaussian extends into the tails; the other four have compact support on $[-1, 1]$ . Efficiency — the figure of merit from §30.6 Theorem 5 — ranges from $0.930$ (uniform) to $1.000$ (Epanechnikov).

Kernel choice explorer·Five kernels against the true density. Watch how close the curves stay — the efficiency column shows why.

Density

n = 200

Bandwidth

h = 0.345

Kernel	μ₂(K)	R(K)	eff	h used	AMISE(h)
Gaussian	1.000	0.282	0.951	0.345	0.0048
Epanechnikov	0.200	0.600	1.000	0.345	0.0087
Biweight	0.143	0.714	0.994	0.345	0.0104
Triangular	0.167	0.667	0.986	0.345	0.0097
Uniform	0.333	0.500	0.930	0.345	0.0073

The five kernels track the true density so closely that in any given sample you rarely see a meaningful difference. Epanechnikov is AMISE-optimal (efficiency = 1), but the next three are within a percent; Uniform trails by ~7%. In practice, the bandwidth h carries all the action.

Remark 5 Why the kernel choice rarely matters

The efficiencies in Theorem 2 all lie in $[0.93, 1.00]$ . Substituting into the AMISE-optimal formula from §30.6, switching from Epanechnikov to Gaussian changes the optimal AMISE value by less than 5%, and the optimal bandwidths differ only by a multiplicative constant. The practical consequence: once we’ve committed to using a kernel with finite variance and roughness, the bandwidth choice does almost all the work. §30.8 is the real bandwidth chapter; §30.3 is a tour of an almost-equivalence class.

30.4 Leading-order bias and variance

The kernel and bandwidth are in place. Next we compute the mean and variance of $\hat f_h(x)$ at a fixed $x$ and extract their leading-order behavior as $n \to \infty$ , $h \to 0$ . This is Rosenblatt 1956’s original calculation; it sets up everything that follows.

Theorem 3 Leading-order bias and variance of the KDE

Let $X_1, \dots, X_n$ be iid with density $f$ having bounded, continuous second derivative $f''$ in a neighborhood of $x$ with $f(x) > 0$ . Let $K$ be a symmetric kernel satisfying the moment conditions $\int K = 1$ , $\int u K = 0$ , $\mu_2(K) < \infty$ , $R(K) < \infty$ . As $n \to \infty$ , $h \to 0$ , and $nh \to \infty$ :

\mathrm{Bias}\bigl(\hat f_h(x)\bigr) = \frac{h^2}{2}\,\mu_2(K)\, f''(x) + o(h^2),

\mathrm{Var}\bigl(\hat f_h(x)\bigr) = \frac{f(x)\, R(K)}{n\,h} + o\!\left(\frac{1}{nh}\right).

Stated here; the derivation is Step (i) and Step (ii) of the §30.6 Theorem 4 proof. Both calculations are a second-order Taylor expansion of $f$ around $x$ combined with the kernel moment conditions.

Example 4 Bias and variance at the peak of N(0,1)

For $f = \varphi$ (standard Normal), $x = 0$ , Gaussian kernel ( $\mu_2 = 1$ , $R = (2\sqrt\pi)^{-1}$ ), $h = 0.5$ , $n = 200$ :

f''(0) = -\varphi(0) = -\frac{1}{\sqrt{2\pi}} \approx -0.399, \quad f(0) = \varphi(0) \approx 0.399,

\mathrm{Bias} \approx \frac{0.25}{2} \cdot 1 \cdot (-0.399) = -0.0499, \quad \mathrm{Var} \approx \frac{0.399 \cdot 0.2821}{200 \cdot 0.5} \approx 0.00113.

The bias is roughly 13% of $f(0)$ ; the standard deviation is roughly 8%. The leading-order formulas predict this at large $n$ ; at $n = 200$ there is finite- $n$ slippage, but the orders of magnitude are right.

Remark 6 Why the first-order bias term vanishes

The Taylor expansion of $f(x - hu)$ around $x$ has a linear $-hu f'(x)$ term. When integrated against the symmetric kernel $K$ , the linear term is $-h f'(x) \int u K(u)\,du = 0$ . Symmetry of $K$ is doing work here — without it we’d pick up an $O(h)$ bias term, and the AMISE-optimal bandwidth in §30.6 would have a very different rate. This is also why asymmetric kernels (one-sided boundary kernels in §30.5) need separate treatment.

Remark 7 Rosenblatt's no-free-lunch result

Theorem 3 says bias is $O(h^2)$ and variance is $O((nh)^{-1})$ under standard regularity — the two rates are intrinsic to the kernel-smoothing program and cannot be both improved without changing the estimator class. A fourth-order kernel with $\mu_2(K) = 0$ can achieve $O(h^4)$ bias but potentially negative $\hat f_h$ (§30.2 Rem 4); local polynomial regression (Fan–Gijbels 1996; §30.5 Rem 9) trades one estimator structure for another. In the plain-KDE world with a symmetric nonnegative kernel, Rosenblatt 1956 is the ceiling.

30.5 MISE, AMISE, and boundary behavior

Bias and variance at a single point $x$ are the building blocks. The relevant global loss — the one AMISE-optimal bandwidth is minimizing — integrates pointwise mean-squared error over the full domain. We derive this integrated form and then flag the one regime where Theorem 3 does not hold: the boundary of a bounded-support density.

The pointwise mean-squared error is $\mathrm{MSE}(x) = \mathrm{Bias}^2(x) + \mathrm{Var}(x)$ . Integrating over $x$ ,

\mathrm{MISE}(h) \;=\; \mathbb{E}\!\int\bigl(\hat f_h(x) - f(x)\bigr)^2\,dx \;=\; \int \mathrm{Bias}^2(x)\,dx + \int \mathrm{Var}(x)\,dx

is the mean integrated squared error, the canonical global loss functional. The asymptotic MISE (AMISE) keeps only the leading-order terms from Theorem 3:

\mathrm{AMISE}(h) \;=\; \frac{h^4\,\mu_2(K)^2\, R(f'')}{4} \;+\; \frac{R(K)}{n\,h},

where $R(f'') := \int f''(x)^2\,dx$ is the curvature integral of $f$ . The two terms are the integrated squared bias (the first) and the integrated variance (the second) — a log-log plot shows them as a line with slope $+4$ and a line with slope $-1$ , whose sum has a unique minimum we’ll compute in §30.6.

Example 5 MISE vs AMISE at finite n

For $f = \varphi$ , Gaussian kernel, $n = 200$ : the closed-form AMISE-optimal bandwidth (from §30.6 Thm 4) is $h^\ast \approx 0.3705$ . At that $h$ , AMISE $\approx 0.01025$ . The exact MISE computed by numerical integration over the same grid is $\approx 0.01058$ — about 3% higher than AMISE. The gap is the $o(h^2)$ and $o(1/(nh))$ terms we dropped; at $n = 200$ they are nonzero but small, and they vanish as $n \to \infty$ .

Three-panel log-log plot. Left: integrated squared bias h^4 mu2(K)^2 R(f'')/4 vs h, a line with slope +4. Middle: integrated variance R(K)/(nh) vs h, a line with slope -1. Right: AMISE = IBias^2 + IVar vs h with the h* minimum marked by a vertical dashed line. For f = N(0,1), Gaussian kernel, n = 200.

Figure 3. The integrated squared bias (slope +4), integrated variance (slope −1), and their sum AMISE on log-log axes for $f =$ Normal $(0, 1)$ , Gaussian kernel, $n = 200$ . The AMISE minimum at $h^\ast \approx 0.37$ marks the AMISE-optimal bandwidth Theorem 4 characterizes.

Example 6 Boundary bias on Beta(2, 5)

Beta $(2, 5)$ has density $f(x) = 30 x (1 - x)^4$ on $[0, 1]$ and $f(x) = 0$ outside. A naïve Gaussian KDE on a Beta $(2, 5)$ sample places nonzero density at negative $x$ (where the true density is zero) because the Gaussian kernel has infinite support. The leakage is not a small correction: at $n = 500$ with Silverman bandwidth $h \approx 0.09$ , the mass below $x = 0$ is approximately $0.02$ , or 2% of the total probability mass. Figure 4 shows the naïve and reflection-corrected KDEs side by side; the reflection trick (augment the sample with $\{-X_i\}$ and renormalize) recovers nearly all the leaked mass.

Two-panel figure. Left: naive Gaussian KDE on a Beta(2,5) sample with mass leaked below zero, over-smoothing the boundary. Right: reflection-corrected KDE on the same sample, with leaked mass restored. True Beta(2,5) density overlaid in black on both panels.

Figure 4. Naïve Gaussian KDE (left) vs. Jones 1993 reflection-corrected KDE (right) on a Beta $(2, 5)$ sample at $n = 500$ . The naïve estimator over-smooths across the lower boundary; reflection restores the mass and matches the true density much more closely near $x = 0$ .

Boundary-bias demo·Densities on bounded support leak KDE mass across the edge. Jones 1993 reflection patches it.

Density

h = 0.049 (× Silverman)

show reflection panel

Mass leaked outside support: 0.0136 · below: 0.0135 · above: 0.0001

Reflection about the support boundaries; mass restored to [0, 1].

Reflection is the simplest of the boundary fixes — Fan–Gijbels 1996's local-linear estimator is the modern treatment, delivering the correction and the regression extension together. See the Topic 30 §30.10 pointer to formalml local-regression.

Remark 8 Why boundary bias is not a small correction

For a density with bounded support $[a, b]$ and a kernel with support $[-r, r]$ (or all of $\mathbb{R}$ with rapid decay), the KDE near $x = a$ averages over $X_i$ values in a window that extends below $a$ . But no $X_i$ values are below $a$ — the sample is entirely on $[a, b]$ . The KDE therefore under-estimates $f(a)$ and smooths probability mass into the forbidden region $(-\infty, a)$ . At $x$ exactly at the boundary, the bias is $O(h)$ — worse than the interior $O(h^2)$ — and the AMISE analysis of §30.6 does not apply. Any production density estimator for bounded-support data needs a boundary correction.

Remark 9 Reflection, local-linear, and the path to formalml

Three standard boundary corrections: Jones 1993 reflection (simplest; used by BoundaryBiasDemo); Müller 1991 boundary kernels (asymmetric kernels within one bandwidth of the boundary); Fan–Gijbels 1996 local polynomial regression, which automatically adapts at the boundary as a byproduct of fitting a local polynomial rather than a local constant. Local polynomial is the modern standard and generalizes cleanly to the regression setting $E[Y \mid X]$ ; the full treatment is deferred to formalml local-regression. Topic 30 keeps things simple with reflection.

30.6 Optimal bandwidth and Epanechnikov’s theorem

Two theorems anchor the featured section. Theorem 4 derives the AMISE-optimal bandwidth from the decomposition in §30.5 — the full four-step proof is the load-bearing calculation of the topic. Theorem 5 picks the kernel that minimizes the resulting AMISE among symmetric nonnegative kernels with a fixed second moment — a short calculus-of-variations argument whose answer is Epanechnikov’s $K_E$ .

Theorem 4 AMISE decomposition and optimal bandwidth (featured)

Let $X_1, \dots, X_n$ be iid with density $f$ having bounded, continuous second derivative $f''$ . Let $K$ be a symmetric kernel with $\int K = 1$ , $\int u K = 0$ , $\mu_2(K) \in (0, \infty)$ , $R(K) < \infty$ . As $n \to \infty$ , $h \to 0$ , $nh \to \infty$ ,

\mathrm{AMISE}(h) \;=\; \frac{R(K)}{nh} + \frac{h^4\,\mu_2(K)^2\, R(f'')}{4},

where $R(f'') := \int f''(x)^2\,dx$ . The AMISE-optimal bandwidth is

h^\ast \;=\; \left(\frac{R(K)}{n\,\mu_2(K)^2\, R(f'')}\right)^{1/5} \;=\; O(n^{-1/5}),

with matching minimum AMISE

\mathrm{AMISE}^\ast \;=\; \frac{5}{4}\,\bigl\{R(K)^4\, \mu_2(K)^2\, R(f'')\bigr\}^{1/5}\, n^{-4/5}.

Proof [show]

The proof has four steps: (i) compute the bias via Taylor expansion of $f$ ; (ii) compute the variance via the iid-sum identity; (iii) integrate over $x$ to obtain AMISE; (iv) minimize in $h$ .

Step (i): Bias. Substitute $u = (x - y)/h$ in the definition of the expected value:

\mathbb{E}\bigl[\hat f_h(x)\bigr] = \int \frac{1}{h} K\!\left(\frac{x-y}{h}\right) f(y)\,dy = \int K(u)\, f(x - hu)\,du.

Taylor-expand $f(x - hu)$ around $x$ to second order, using the regularity of $f''$ :

f(x - hu) = f(x) - hu\, f'(x) + \frac{(hu)^2}{2}\, f''(x) + o(h^2),

where the remainder is $o(h^2)$ uniformly in $u$ on the support of $K$ . Substituting into the expectation integral,

\mathbb{E}\bigl[\hat f_h(x)\bigr] = f(x)\int K(u)\,du - h\, f'(x)\int u\, K(u)\,du + \frac{h^2}{2}\, f''(x)\int u^2\, K(u)\,du + o(h^2).

Applying the kernel moment conditions (K1)–(K3):

\mathbb{E}\bigl[\hat f_h(x)\bigr] = f(x) \cdot 1 - h\, f'(x) \cdot 0 + \frac{h^2}{2}\, f''(x)\, \mu_2(K) + o(h^2).

Subtracting $f(x)$ yields the leading-order bias $\mathrm{Bias}(\hat f_h(x)) = \frac{h^2}{2}\,\mu_2(K)\, f''(x) + o(h^2)$ .

Step (ii): Variance. Since $\hat f_h(x) = \frac{1}{n}\sum_i Z_i$ where $Z_i := h^{-1} K((x-X_i)/h)$ are iid,

\mathrm{Var}\bigl(\hat f_h(x)\bigr) = \frac{1}{n}\, \mathrm{Var}(Z_1) = \frac{1}{n}\!\left\{\mathbb{E}[Z_1^2] - \mathbb{E}[Z_1]^2\right\}.

Compute the second moment of $Z_1$ by the same substitution:

\mathbb{E}[Z_1^2] = \int \frac{1}{h^2}\, K\!\left(\frac{x-y}{h}\right)^{\!2} f(y)\,dy = \frac{1}{h}\int K(u)^2\, f(x - hu)\,du.

Using $f(x - hu) = f(x) + O(h)$ (first-order Taylor),

\mathbb{E}[Z_1^2] = \frac{1}{h}\, f(x) \int K(u)^2\,du + O(1) = \frac{f(x)\, R(K)}{h} + O(1).

The first-moment-squared term is $\mathbb{E}[Z_1]^2 = (f(x) + O(h^2))^2 = f(x)^2 + O(h^2)$ , which is $O(1)$ and therefore negligible compared to the $O(1/h)$ leading term. Combining:

\mathrm{Var}\bigl(\hat f_h(x)\bigr) = \frac{1}{n}\!\left\{\frac{f(x)\, R(K)}{h} + O(1)\right\} = \frac{f(x)\, R(K)}{nh} + O\!\left(\frac{1}{n}\right).

Under the regime $nh \to \infty$ , the $O(1/n)$ term is $o(1/(nh))$ .

Step (iii): Integrate to AMISE. The pointwise mean-squared error is $\mathrm{MSE}(x) = \mathrm{Bias}^2(x) + \mathrm{Var}(x)$ . Integrating over $x$ and keeping only the leading terms,

\mathrm{AMISE}(h) = \int \mathrm{Bias}^2(x)\,dx + \int \mathrm{Var}(x)\,dx.

The integrated squared bias is

\int \mathrm{Bias}^2(x)\,dx = \int \left(\frac{h^2}{2}\,\mu_2(K)\,f''(x)\right)^{\!2} dx = \frac{h^4\,\mu_2(K)^2}{4}\,R(f''),

where $R(f'') := \int f''(x)^2\,dx$ . The integrated variance is

\int \mathrm{Var}(x)\,dx = \int \frac{f(x)\,R(K)}{nh}\,dx = \frac{R(K)}{nh}\int f(x)\,dx = \frac{R(K)}{nh},

using $\int f = 1$ . Summing,

\mathrm{AMISE}(h) = \frac{R(K)}{nh} + \frac{h^4\,\mu_2(K)^2\,R(f'')}{4}.

Step (iv): Minimize in $h$ . Differentiate AMISE with respect to $h$ :

\frac{d\,\mathrm{AMISE}}{dh} = -\frac{R(K)}{n\,h^2} + h^3\,\mu_2(K)^2\,R(f'').

Setting to zero and solving for $h$ :

h^5 = \frac{R(K)}{n\,\mu_2(K)^2\,R(f'')}, \qquad h^\ast = \left(\frac{R(K)}{n\,\mu_2(K)^2\,R(f'')}\right)^{\!1/5}.

The second derivative

\frac{d^2\,\mathrm{AMISE}}{dh^2} = \frac{2R(K)}{n\,h^3} + 3 h^2\,\mu_2(K)^2\,R(f'') > 0,

confirms $h^\ast$ is the unique minimum. Substituting $h^\ast$ into AMISE: the first term evaluates to

\frac{R(K)}{n\,h^\ast} = \frac{R(K)}{n}\left(\frac{n\,\mu_2(K)^2\,R(f'')}{R(K)}\right)^{\!1/5} = R(K)^{4/5}\, \mu_2(K)^{2/5}\, R(f'')^{1/5}\, n^{-4/5},

and the second term evaluates to

\frac{(h^\ast)^4\, \mu_2(K)^2\, R(f'')}{4} = \frac{1}{4}\, R(K)^{4/5}\, \mu_2(K)^{2/5}\, R(f'')^{1/5}\, n^{-4/5}.

Their sum is $\mathrm{AMISE}^\ast = \frac{5}{4}\, R(K)^{4/5}\, \mu_2(K)^{2/5}\, R(f'')^{1/5}\, n^{-4/5} = \frac{5}{4}\bigl\{R(K)^4\, \mu_2(K)^2\, R(f'')\bigr\}^{1/5}\, n^{-4/5}$ .

$\blacksquare$ — using kernel moment conditions (K1)–(K4), second-order Taylor expansion of $f$ , iid-variance accounting, and elementary calculus (ROS1956; PAR1962; WAN1995 §2.5; SIL1986 §3.3).

◼

Theorem 5 Epanechnikov's optimality theorem

Consider the class of symmetric kernels $K : \mathbb{R} \to \mathbb{R}_{\ge 0}$ satisfying $\int K = 1$ and $\mu_2(K) = \sigma_0^2$ for a fixed positive constant $\sigma_0^2$ . The kernel minimizing $R(K) = \int K^2$ subject to these constraints is the Epanechnikov kernel

K_E(u) = \frac{3}{4}\,(1 - u^2)\,\mathbf{1}\{|u| \le 1\}, \qquad \mu_2(K_E) = \frac{1}{5}, \quad R(K_E) = \frac{3}{5}.

(By rescaling $h$ , the conclusion holds for any fixed $\sigma_0^2 > 0$ ; the statement with $\sigma_0^2 = 1/5$ fixes the conventional support $[-1, 1]$ .)

Proof [show]

Minimizing the quadratic functional $R(K) = \int K(u)^2\,du$ subject to the two linear constraints $\int K = 1$ and $\int u^2 K = \frac{1}{5}$ plus the symmetry and nonnegativity constraints, we form the Lagrangian

\mathcal{L}[K] = \int K(u)^2\,du + \lambda_0\!\left(\int K - 1\right) + \lambda_2\!\left(\int u^2 K - \frac{1}{5}\right).

Symmetry fixes the first-moment multiplier at zero. The pointwise Euler–Lagrange first-order condition is $2 K(u) + \lambda_0 + \lambda_2 u^2 = 0$ , giving

K(u) = -\frac{\lambda_0}{2} - \frac{\lambda_2}{2}\, u^2 = a - b\, u^2

on the region where this is positive, with $K = 0$ outside. Write $K(u) = a(1 - u^2/r^2)\,\mathbf{1}\{|u| \le r\}$ , where $r := \sqrt{a/b}$ is the support radius (forced by $K(\pm r) = 0$ ). The two constraints pin $a$ and $r$ :

\int K = a \int_{-r}^{r}(1 - u^2/r^2)\,du = a\,\frac{4r}{3} = 1 \implies a = \frac{3}{4r},

\int u^2 K = a \int_{-r}^{r}(u^2 - u^4/r^2)\,du = a\left(\frac{2r^3}{3} - \frac{2r^3}{5}\right) = \frac{4ar^3}{15} = \frac{r^2}{5}\quad\text{(using $a = 3/(4r)$).}

Setting $\mu_2(K) = 1/5$ gives $r^2 = 1$ , so $r = 1$ and $a = 3/4$ , yielding

K_E(u) = \frac{3}{4}\,(1 - u^2)\, \mathbf{1}\{|u| \le 1\}.

The second-derivative test is automatic: the objective $\int K^2$ is strictly convex in $K$ , so the stationary point is a global minimum. Computing the optimum:

R(K_E) = \int_{-1}^{1} \frac{9}{16}(1 - u^2)^2\,du = \frac{9}{16} \cdot \frac{16}{15} = \frac{3}{5}.

$\blacksquare$ — using Lagrangian variational principles with linear constraints and nonnegativity, the Euler–Lagrange condition for a quadratic objective, and elementary Calculus-of-Variations (EPA1969; WAN1995 §2.7).

◼

Bandwidth explorer — the AMISE theorem·Three synchronized panels: KDE curve (top), bias²/var/MISE vs h on log-log (middle), and AMISE* vs n log-log (bottom). Slide h off h* and watch which term explodes.

Density

Kernel

n = 500

h = 0.306 (= h*)

oracle h*

0.3056

bias²(h)

0.00046

variance(h)

0.00185

AMISE(h)

0.00231

Slide h to the left: variance explodes and AMISE follows. Slide it to the right: bias² takes over. The oracle h* is the unique minimum of the rose curve — it's slightly right of where the two decomposition curves cross, a consequence of the 5/4 constant in AMISE*. Switch between Gaussian / Epanechnikov / Triangular kernels: the minimum barely moves — kernel choice is negligible relative to bandwidth choice.

Example 7 Oracle h* for f = N(0,1), Gaussian kernel, n = 100

With $f = \varphi$ and Gaussian kernel, the constants are $R(K) = 1/(2\sqrt\pi) \approx 0.2821$ , $\mu_2(K) = 1$ , and $R(\varphi'') = 3/(8\sqrt\pi) \approx 0.2116$ . Substituting into Theorem 4:

h^\ast = \left(\frac{0.2821}{100 \cdot 1 \cdot 0.2116}\right)^{\!1/5} = (0.01333)^{1/5} \approx 0.4217.

This matches the T30.12 test pin — the oracle-bandwidth computation at $n = 100$ for $f = N(0, 1)$ . At $n = 1000$ the bandwidth shrinks by the factor $10^{-1/5} \approx 0.631$ (T30.13), and the optimal AMISE shrinks by $10^{-4/5} \approx 0.158$ (T30.14).

Two-panel AMISE figure. Left: AMISE curves for Gaussian, Epanechnikov, biweight, and triangular kernels at f = N(0,1), n = 200. All four minima land near h* between 0.37 and 0.40. Right: zoom on the minimum region showing the near-alignment of the four optima.

Figure 5. AMISE for four kernels on $f =$ Normal $(0, 1)$ , $n = 200$ . Left: full range. Right: zoom on $[0.2, 0.4]$ showing the near-identical minima, confirming that kernel choice barely matters relative to bandwidth choice.

Log-log plot of AMISE* versus n for n in 50, 100, 200, 500, 1000, 2000, 5000, 10000 at f = N(0,1) with Gaussian kernel. The empirical slope matches the theoretical -0.8 rate; a dotted parametric-rate line at slope -1 is shown for reference.

Figure 6. Oracle AMISE $^\ast$ vs $n$ on log-log axes. The slope is the theoretical $-4/5 = -0.8$ ; the faint dotted line at slope $-1$ is the parametric comparison. The gap between the two slopes is the price of distribution-freeness.

Remark 10 Why the n^(-1/5) rate is slower than parametric

A parametric MLE of a Normal mean has MSE $O(n^{-1})$ — the square of the standard $\sqrt n$ convergence. The KDE’s AMISE rate $O(n^{-4/5})$ is slower because we’re estimating an entire function, not a finite-dimensional parameter. The rate is a direct consequence of Theorem 4: the bias–variance trade-off forces $h \to 0$ at rate $n^{-1/5}$ , not the $n^{-1/2}$ that parametric estimators enjoy. In high dimensions the rate gets worse still — Scott 2015 Ch. 6 derives the multivariate rate $O(n^{-4/(4+d)})$ , which we’ll pointer-forward in §30.10.

Remark 11 Structural parallel to Bahadur linearization

The AMISE-optimal $h^\ast$ is the first-order condition on a quadratic-in- $h^2$ loss function. The structural parallel to the Bahadur representation of the sample quantile (Topic 29 §29.6) is that both reduce nonparametric inference to a balance between a variance term that shrinks with more data and a bias term controlled by an asymptotic parameter: bandwidth $h$ in the KDE case, remainder-term order in the Bahadur case. Every Track 8 result is a version of this trade-off — bootstrap, empirical processes, and density-ratio estimation each have their own “h”: a resample count, a bracketing radius, a regularization strength.

Remark 12 Kernel choice negligibility

Theorem 5 minimizes $R(K)$ subject to $\mu_2(K) = \sigma_0^2$ . Substituting into $\mathrm{AMISE}^\ast$ , the kernel enters through $\{R(K)^4 \mu_2(K)^2\}^{1/5}$ . For the five standard kernels, this combined factor ranges from $0.98^{1/5}$ (uniform) to $1.00$ (Epanechnikov) — a spread of under 2%. In log-space, switching kernels moves AMISE $^\ast$ by less than 0.01. Bandwidth choice, by contrast, can move AMISE by an order of magnitude. The practical advice: pick any standard kernel (Gaussian for simplicity, Epanechnikov for compact support), and spend your energy on bandwidth selection (§30.8).

Remark 13 Higher-order kernels and faster rates

A fourth-order kernel is a kernel with $\mu_2(K) = 0$ ; the bias in Theorem 3 then vanishes to leading order, and the next term is $O(h^4)$ involving $f^{(4)}$ . The AMISE rate improves from $n^{-4/5}$ to $n^{-8/9}$ — a meaningful speedup at large $n$ . The catch: a fourth-order kernel must take negative values (to integrate $u^2 K = 0$ with $\int K = 1$ ), so $\hat f_h$ can be negative. In practice this is usually patched by truncation, but Topic 30 follows the standard monograph convention of restricting to second-order kernels. Serfling 1980 §1.11 and Wand–Jones 1995 §2.8 cover higher-order kernels at length; formalml may return to them in the context of density-ratio estimation.

30.7 Pointwise asymptotic normality and confidence intervals

Theorem 4 gives us the MISE rate; Theorem 6 gives us pointwise asymptotic normality, from which Gaussian-calibrated confidence intervals for $f(x)$ follow immediately. The proof technique — Lindeberg–Feller for triangular arrays — is the same machinery that powered Topic 11 §11.6’s general CLT; we state it and sketch the verification.

Theorem 6 Parzen's pointwise asymptotic normality (stated with sketch)

Under the hypotheses of Theorem 4 (symmetric kernel satisfying $\int K = 1$ , $\int u K = 0$ , $\mu_2(K) < \infty$ , $R(K) < \infty$ ; $f$ continuous at $x$ with $f(x) > 0$ ), and under the bandwidth regime $h \to 0$ , $nh \to \infty$ ,

\sqrt{nh}\,\Bigl(\hat f_h(x) - \mathbb{E}[\hat f_h(x)]\Bigr) \;\xrightarrow{d}\; \mathcal{N}\!\left(0,\; f(x)\, R(K)\right).

Proof [show]

Write $\hat f_h(x) = \frac{1}{n}\sum_{i=1}^n Z_{n,i}$ , where $Z_{n,i} := h^{-1} K((x - X_i)/h)$ form a triangular array of iid random variables — iid within each row indexed by $n$ , but their distribution depends on $n$ through the bandwidth $h = h(n)$ . The variance per term is

\mathrm{Var}(Z_{n,i}) = \frac{f(x)\,R(K)}{h}\,(1 + o(1)),

from Theorem 3 (Step (ii) of Theorem 4’s proof, multiplied through by the $h^{-1}$ normalization). Multiplying the standardizing factor $\sqrt{nh}$ onto the centered sum gives

\sqrt{nh}\,\bigl(\hat f_h(x) - \mathbb{E}[\hat f_h(x)]\bigr) = \frac{1}{\sqrt{n/h}}\sum_{i=1}^n (Z_{n,i} - \mathbb{E}[Z_{n,i}]),

whose variance is $\frac{nh}{n} \cdot \mathrm{Var}(Z_{n,i}) = h \cdot f(x) R(K)/h \cdot (1 + o(1)) = f(x)\,R(K) + o(1)$ . Apply the Lindeberg–Feller central limit theorem for triangular arrays (Topic 11 §11.6 Thm 2): check the Lindeberg condition

\lim_{n \to \infty}\; \frac{n}{f(x)\,R(K)}\,\mathbb{E}\!\left[\bigl(Z_{n,1}/\sqrt{n/h} - \mathbb{E}[Z_{n,1}]/\sqrt{n/h}\bigr)^2\, \mathbf{1}\{\cdots > \varepsilon\}\right] = 0 \quad \text{for all } \varepsilon > 0.

Boundedness of $K$ (compact support in the Epanechnikov case, or $e^{-u^2/2}$ decay in the Gaussian case) ensures $|Z_{n,i}| \le \|K\|_\infty / h$ , and the event inside the indicator becomes empty for $n$ large enough because the standardized summand is $O(h^{-1/2}/\sqrt{n}) = O(1/\sqrt{nh}) \to 0$ . The Lindeberg condition holds.

$\blacksquare$ — using the Lindeberg–Feller CLT for triangular arrays (Topic 11 §11.6 Thm 2), the variance calculation from Thm 3, and boundedness of $K$ (PAR1962 §3–§5; vdV2000 §24.1).

◼

Theorem 6 centers the CI at $\mathbb{E}[\hat f_h(x)]$ , not at $f(x)$ . At finite $n$ there is a bias shift of magnitude $\frac{h^2}{2}\mu_2(K) f''(x)$ , and the resulting pointwise CI has coverage slightly below the nominal $1 - \alpha$ . For the purposes of this topic, we ignore the bias correction and use the plug-in variance

\hat V(x) \;=\; \frac{\hat f_h(x)\, R(K)}{n\,h}, \qquad \mathrm{CI}_{1-\alpha}(x) \;=\; \hat f_h(x) \;\pm\; z_{1 - \alpha/2}\,\sqrt{\hat V(x)},

noting the coverage will be biased-low at modest $n$ and small $h$ .

Example 8 Pointwise CI at x = 0 for N(0,1), n = 500

For a Normal $(0, 1)$ sample of size $n = 500$ (seed-42 NumPy draw, $\hat\sigma \approx 0.98$ ), Silverman’s bandwidth $\hat h_{\mathrm{ROT}} \approx 0.300$ . At $x = 0$ :

\hat f_h(0) \approx 0.388, \qquad \hat V(0) = \frac{0.388 \cdot 0.2821}{500 \cdot 0.300} \approx 7.30 \times 10^{-4}, \qquad \sqrt{\hat V(0)} \approx 0.027.

The 95% pointwise CI is $0.388 \pm 1.96 \cdot 0.027 = [0.335, 0.441]$ . The true density value $f(0) = \varphi(0) \approx 0.399$ lies comfortably inside. This is the T30.11 test pin.

Example 9 Density-at-quantile: Bahadur's plug-in (discharge of Topic 29 §29.6 Rem 12)

Topic 29 §29.6 Rem 12 named KDE as the tool for estimating $f(\xi_p)$ — the density in Bahadur’s limit variance $\sigma^2 = p(1 - p)/f(\xi_p)^2$ . Here we discharge that promise.

Take a Normal $(0, 1)$ sample of size $n = 500$ and focus on the upper quartile $p = 0.9$ . The Type-7 sample quantile is $\hat\xi_{0.9} \approx 1.30$ . The Silverman-bandwidth KDE evaluated at that quantile is

\hat f_{\hat h}(\hat\xi_{0.9}) \approx 0.175,

giving the Bahadur-plug-in asymptotic standard error

\hat{\mathrm{SE}}(\hat\xi_{0.9}) \;=\; \sqrt{\frac{0.9 \cdot 0.1}{500 \cdot 0.175^2}} \;\approx\; 0.0242.

The resulting 95% asymptotic CI is $1.30 \pm 1.96 \cdot 0.0242 = [1.25, 1.35]$ . For comparison, the distribution-free binomial-exact CI from Topic 29 §29.7 is $[X_{(r)}, X_{(s)}]$ with $r$ and $s$ chosen so the coverage is at least $1 - \alpha$ ; at $n = 500$ , $p = 0.9$ , $\alpha = 0.05$ the CI is approximately $[1.24, 1.37]$ . The Bahadur-plug-in CI is noticeably tighter — asymptotic efficiency at the cost of nominal-coverage validity at finite $n$ .

Plot of empirical coverage vs sample size n for the pointwise 95% CI at x=0 under f = N(0,1). Bars extend from the coverage observed over 2000 MC replicates at each n in 100, 200, 500, 1000, 2000, 5000. At small n coverage is below 95% because of the bias shift; at large n it approaches nominal.

Figure 7. Empirical coverage of the 95% pointwise CI for $f(0)$ under $f =$ Normal $(0, 1)$ across sample sizes. At small $n$ the bias shift in Theorem 6 depresses coverage below $95\%$ ; by $n = 5000$ the coverage is within Monte-Carlo noise of nominal.

Remark 14 Why the CI has biased coverage at small n

The CI is centered at $\hat f_h(x)$ , which is a consistent estimator of $\mathbb{E}[\hat f_h(x)] = f(x) + \frac{h^2}{2}\mu_2(K) f''(x) + o(h^2)$ — not $f(x)$ itself. At the oracle bandwidth $h^\ast$ the bias is $O(n^{-2/5})$ , and the interval width is $O(n^{-2/5})$ as well, so the bias is of the same order as the width. Coverage improves as $n$ grows (the $o(h^2)$ terms shrink faster than the width) but remains biased-low at the finite $n$ ML practitioners see in practice. Bias-corrected CIs (plug-in estimates of $f''$ , or undersmoothing) exist; vdV2000 §24.1 is a standard reference.

Remark 15 Simultaneous CIs and the bootstrap

Theorem 6 is pointwise — it gives a CI for $f(x_0)$ at a single preselected $x_0$ . For a joint CI covering $f$ over a range $[a, b]$ at level $1 - \alpha$ , we need Bickel–Rosenblatt-type machinery, which turns out to be a functional CLT in the same Donsker-theory framework Topic 29 §29.8 used for the Kolmogorov limit. The cleanest route in practice is the smooth bootstrap (Topic 31 §31.8; preview at §30.10 Ex 13), which constructs simultaneous envelopes by resampling from $\hat f_h$ itself. The full treatment is forthcoming on formalml.

30.8 Data-driven bandwidth selection

The oracle $h^\ast$ from Theorem 4 requires the curvature integral $R(f'')$ , which the practitioner does not know. Data-driven bandwidth selectors estimate $R(f'')$ — or skip it entirely — and return a data-dependent $\hat h$ . The four standard selectors are:

Silverman’s rule-of-thumb (the default in R’s bw.nrd0): a Normal-scale plug-in that assumes $f \approx N(\mu, \sigma^2)$ .
Scott’s rule: Silverman without the IQR-robustness adjustment.
Unbiased cross-validation (UCV): minimizes an estimator of MISE $-$ ∫ $f^2$ .
Sheather–Jones plug-in (SHJ): a one-step direct plug-in that estimates $R(f'')$ from a pilot KDE.

Silverman and Scott are cheap closed forms; UCV is a grid search at cost $O(n^2 G)$ per evaluation; SHJ is an $O(n^2)$ double-sum followed by a closed-form plug-in. All four are implemented in nonparametric.ts §10 or inlined in PluginBandwidthComparator (§5.3).

Definition 4 Mean integrated squared error (MISE)

For an estimator $\hat f_h$ of a density $f$ , the mean integrated squared error is

\mathrm{MISE}(h) \;:=\; \mathbb{E}\!\int\bigl(\hat f_h(x) - f(x)\bigr)^2\,dx.

The AMISE from §30.5 is the leading-order approximation to MISE; bandwidth selectors typically minimize one or the other. Note $\mathrm{MISE}(h) = \int \mathrm{Var}(\hat f_h(x))\,dx + \int \mathrm{Bias}^2(x)\,dx$ — the integrated decomposition that drives everything in Theorem 4.

Theorem 7 Silverman's rule of thumb

Under the Normal-scale reference rule — assume $f$ is Normal $(\mu, \sigma^2)$ — the AMISE-optimal bandwidth (Theorem 4) with Gaussian kernel simplifies to

h^\ast_{\text{Silverman}} \;=\; \left(\frac{4}{3 n}\right)^{\!1/5} \hat\sigma \;\approx\; 1.06\, \hat\sigma\, n^{-1/5},

where $\hat\sigma$ is a scale estimate. Silverman’s robust variant uses $\hat\sigma = \min(\mathrm{SD}, \mathrm{IQR}/1.34)$ to guard against heavy tails or mild bimodality. Stated; the derivation substitutes $R(\varphi'') = 3/(8\sqrt\pi)$ into Theorem 4 and rearranges. (SIL1986 §3.4.2)

Example 10 Silverman over-smooths bimodal mixtures

Take a 50/50 mixture of Normal $(-1.5, 1)$ and Normal $(1.5, 1)$ , $n = 200$ . This is the bimodalNormalPreset from nonparametric-data.ts — the overall standard deviation is $\hat\sigma \approx \sqrt{1 + 1.5^2} \approx 1.80$ , so Silverman gives

\hat h_{\mathrm{Silverman}} \;=\; 1.06 \cdot 1.80 \cdot 200^{-1/5} \;\approx\; 0.66.

But the oracle bandwidth $h^\ast$ (Theorem 4, using the closed-form $R(f'') \approx 0.092$ for this mixture) is $\approx 0.38$ . Silverman is nearly $2\times$ too large; the resulting KDE merges the two modes into a single plateau. UCV and Sheather–Jones both estimate $R(f'')$ from the data rather than assuming a Normal reference, and both recover the bimodality cleanly. Figure 8 shows the three side by side.

Three KDE curves on a bimodal 0.5 N(-1.5,1) + 0.5 N(1.5,1) sample at n = 200. Silverman's rule (indigo) over-smooths into a single-mode curve. UCV (amber) under-smooths with visible noise. Sheather-Jones (violet) recovers both modes cleanly. True density overlaid in black.

Figure 8. Three bandwidth selectors on a bimodal mixture at $n = 200$ . Silverman over-smooths; UCV is noisy; Sheather–Jones cleanly recovers the bimodality. The pedagogical payoff: rule-of-thumb selectors are conservative by design, and their conservatism bites at densities whose curvature is not well-approximated by a Normal reference.

Data-driven bandwidth comparator·Silverman's rule is the textbook default, but watch what it does on the bimodal preset — the two modes merge. Sheather–Jones and UCV work harder but recover the structure.

Density

n = 200

Selector	h	ISE	vs Silverman
Silverman	0.666	0.0064	—
Scott	0.666	0.0064	1.00× ✓
UCV	0.187	0.0159	2.48× over-smooth
Sheather–Jones	0.435	0.0069	1.07× ✓

At the Bimodal preset, Silverman's bandwidth is roughly twice the oracle h*, which is why it over-smooths across the modes. SHJ's plug-in gets much closer by estimating the curvature R(f"). UCV has the largest variance across resamples — try "Run 200 resamples" on a smaller n to see the IQR column widen.

Remark 16 Why Silverman over-smooths bimodals

Silverman’s rule substitutes a Normal reference for $R(f'')$ : for $f = N(\mu, \sigma^2)$ , $R(f'') = 3/(8\sqrt\pi \sigma^5)$ . For a bimodal density with separation $|\mu_1 - \mu_2| \approx 3\sigma$ , the true $R(f'')$ can be $5$ – $10\times$ larger than the Normal-reference value — the two modes create sharp peaks and sharp valleys that contribute heavily to $\int f''^2$ . Silverman therefore underestimates $R(f'')$ by the same factor, which puts $\hat h$ roughly $(5\text{–}10)^{1/5} \approx 1.4$ – $1.6\times$ too large. The fix is to estimate $R(f'')$ from the data — UCV does this implicitly (by cross-validating against the sample), SHJ does it explicitly (via a pilot KDE’s second derivative). The PluginBandwidthComparator lets you see the effect directly: switch presets and watch the Silverman bandwidth diverge from the oracle on the bimodal case, match it on the unimodal case.

30.9 ML bridges: anomaly detection and density-based classification

KDE is a direct ingredient in two classic ML workflows — unsupervised anomaly detection and the plug-in Bayes classifier — and an indirect one in modern neural density estimators. We describe both direct cases concretely; the connection to Nadaraya–Watson regression and mean-shift clustering is a forward pointer.

Example 11 Anomaly detection via low-density flagging

Given a training sample $X_1, \dots, X_n$ representing “normal” observations, fit a Gaussian KDE $\hat f_h$ with Silverman bandwidth. At test time, flag a new point $x$ as anomalous if $\hat f_h(x) < t$ for some threshold $t$ . A natural choice is $t$ = the empirical 5%-quantile of $\{\hat f_h(X_i) : i = 1, \dots, n\}$ — the KDE density values at the training points themselves. Under this choice, approximately 5% of training points are (ex post) flagged, and new points that fall into regions of similar sparsity are flagged at test time.

This is the kernel-density-estimate version of one-class classification; it requires no labeled anomalies, only a clean sample of normals. In production, practitioners usually replace KDE with a denser estimator (Gaussian Mixture Models, Isolation Forest, Autoencoders) as $n$ grows into the millions, but the KDE version remains the pedagogical entry point and a useful sanity check at low $n$ .

Example 12 Density-based Bayes classifier and naïve Bayes

The plug-in Bayes classifier (Fix–Hodges 1951) estimates class-conditional densities $\hat f_{h, c}(x)$ via a separate KDE on each class’s training sample, then assigns test point $x$ to

\hat c(x) \;=\; \arg\max_c\; \hat\pi_c \cdot \hat f_{h, c}(x),

where $\hat\pi_c = n_c/n$ is the estimated class prior. In $d$ dimensions, full multivariate KDE suffers from the curse of dimensionality (§30.10 Rem 21), so naïve Bayes with KDE factors the class density as $\hat f_{h, c}(x) = \prod_{j=1}^d \hat f_{h, c, j}(x_j)$ — a product of $d$ univariate KDEs, one per feature. Naïve-Bayes-with-KDE is the standard ML baseline for mixed continuous-discrete features; scikit-learn ships it as KernelDensity + ClassifierMixin.

Compared to parametric discriminant analysis (LDA, QDA from Topic 15), the KDE version trades parametric efficiency for distribution-freeness — it works on class densities that are skewed, bimodal, or otherwise non-Normal, without any parametric assumption.

Remark 17 Kernel regression via Nadaraya–Watson

The covariate-conditional extension of KDE is kernel regression (Nadaraya 1964; Watson 1964):

\hat m_h(x) \;=\; \frac{\sum_i Y_i\, K_h(x - X_i)}{\sum_i K_h(x - X_i)}.

This is a weighted average of the $Y_i$ values, with weights decaying as $|x - X_i|$ grows. It is the kernel-smoothing companion to the KDE: KDE estimates the marginal density $f(x)$ , NW estimates the conditional expectation $E[Y \mid X = x]$ . The two share the Theorem 3 bias–variance structure and the Theorem 6 asymptotic-normality story, so everything we built for $\hat f_h$ carries over. The full development — including local polynomial refinements — is the forthcoming formalml kernel-regression chapter.

Remark 18 Mean-shift clustering

Mean-shift (Fukunaga–Hostetler 1975; Comaniciu–Meer 2002) is an unsupervised clustering algorithm that performs gradient ascent on $\hat f_h$ . Starting from each data point, iterate $x \leftarrow x + h \cdot \nabla \log \hat f_h(x)$ until convergence; the fixed points are the local modes of the KDE, and points that converge to the same mode are assigned to the same cluster. The advantage over k-means is that mean-shift does not require specifying $k$ in advance — the number of modes is the number of clusters. The disadvantage is computational: gradient ascent + KDE evaluation is $O(n^2)$ per iteration, so it doesn’t scale past modest $n$ . Modern variants (spectral mean-shift, blurring mean-shift) are forthcoming on formalml.

30.10 Track 8 forward-map and the KDE → bootstrap bridge

Topic 30 closes with a forward-pointer figure (Fig 9) and a set of remarks that place KDE inside the larger Track 8 spine. Topic 29 (order statistics) opened the track; Topic 30 (KDE) is the mid-track hinge; Topics 31 (bootstrap) and 32 (empirical processes) close it. Beyond Track 8 itself, KDE is a load-bearing ingredient in several formalml chapters — kernel regression, local regression, normalizing flows, clustering, density-ratio estimation — each of which extends the KDE framework to a different ML-relevant context.

Example 13 Smooth bootstrap: resampling from the KDE

The standard nonparametric bootstrap resamples from the empirical distribution $F_n$ — a step function that places mass $1/n$ at each $X_i$ . Silverman–Young 1987’s smooth bootstrap resamples from $\hat f_h$ instead: draw $X_i^\ast \sim \hat f_h$ by first picking an observation $X_{J_i}$ uniformly from the sample (where $J_i$ is uniform on $\{1, \dots, n\}$ ) and then perturbing it by an independent draw from the kernel $K_h$ :

X_i^\ast \;=\; X_{J_i} \;+\; h \cdot U_i, \qquad U_i \sim K, \text{ iid}.

The resulting bootstrap distribution is smoothly supported rather than concentrated on $\{X_1, \dots, X_n\}$ , which matters when the statistic of interest depends on derivatives or local continuity — e.g., quantile regression estimators or mode estimators. Topic 31: The Bootstrap develops the full bootstrap framework; smooth-bootstrap is one of several refinements it covers. The KDE → bootstrap bridge is exactly the Topic 30 → 31 edge in the Track 8 forward-map.

Figure 9. Track 8 spine with Topic 30 at the center. Topics 29 and 30 are published (filled); Topics 31 and 32 are forthcoming (dashed). Forward arrows fan out to the formalml satellites where KDE machinery is re-used; back-arrows acknowledge the six prerequisite topics.

Remark 19 Topic 31: Bootstrap

The bootstrap (Efron 1979) is the Track 8 closer for parametric-inference questions. It resamples from the empirical distribution (or, in the smooth variant, from $\hat f_h$ ) to simulate the sampling distribution of a statistic without invoking asymptotic normality. Topic 31 (now published) develops nonparametric and parametric bootstraps, bootstrap CIs (percentile, BCa, studentized), and the Edgeworth-expansion-based analysis of bootstrap accuracy. Expect Topic 30’s bandwidth selectors to reappear as pilot estimators for smooth bootstraps of sample-quantile statistics.

Remark 20 Normalizing flows: learned density estimators

Modern neural density estimators — normalizing flows (Rezende–Mohamed 2015; Papamakarios et al. 2021) — replace the kernel-average-of-bumps $\hat f_h$ with a learned invertible transformation $g_\theta$ from a simple base density to the target: $\hat f(x) = p_{\text{base}}(g_\theta^{-1}(x)) |\det \partial g_\theta^{-1}/\partial x|$ . The AMISE framework of §30.6 doesn’t directly apply — flows are parametric within a specified flow family — but the bias-vs-capacity intuition carries over. The forthcoming formalml normalizing-flows chapter will contrast the two estimator classes head-to-head; they have complementary strengths (KDE at low $n$ and low $d$ ; flows at high $n$ and moderate $d$ ).

Remark 21 Multivariate KDE and the curse of dimensionality

In $d$ dimensions, the KDE generalizes via a product kernel or a full bandwidth matrix $\mathbf{H}$ : $\hat f_\mathbf{H}(\mathbf{x}) = n^{-1} |\mathbf{H}|^{-1} \sum_i K(\mathbf{H}^{-1}(\mathbf{x} - \mathbf{X}_i))$ . The AMISE rate deteriorates to $O(n^{-4/(4+d)})$ — the curse of dimensionality for density estimation. At $d = 10$ , $n = 1000$ is effectively $n^{4/14} \approx 6$ in the univariate-equivalent rate, which is why high-dimensional density estimation is hard. Duong 2007’s ks R package is the standard multivariate implementation; Scott 2015 Ch. 6 is the standard reference. The forthcoming formalml density-estimation-multivariate chapter treats the full story.

Remark 22 Density-ratio estimation and covariate shift

In many ML problems we care about the ratio $r(x) = p(x)/q(x)$ rather than the densities themselves — e.g., importance weighting for covariate shift, off-policy evaluation, or generative-adversarial-network discriminator outputs. The naïve plug-in $\hat r(x) = \hat f^{(p)}_h(x)/\hat f^{(q)}_h(x)$ has the same $n^{-4/5}$ rate but amplifies bias in the tails. Direct density-ratio estimation (Sugiyama–Suzuki–Kanamori 2012) sidesteps the denominator entirely by fitting $r$ directly via loss minimization. The full treatment is forthcoming on formalml as density-ratio-estimation; Topic 30’s KDE machinery is the starting point.

References

Rosenblatt, Murray. (1956). Remarks on Some Nonparametric Estimates of a Density Function. Annals of Mathematical Statistics, 27(3), 832–837.
Parzen, Emanuel. (1962). On Estimation of a Probability Density Function and Mode. Annals of Mathematical Statistics, 33(3), 1065–1076.
Epanechnikov, Vassiliy A. (1969). Non-Parametric Estimation of a Multivariate Probability Density. Theory of Probability & Its Applications, 14(1), 153–158.
Silverman, Bernard W. (1986). Density Estimation for Statistics and Data Analysis. Chapman and Hall.
Sheather, Simon J., and M. Chris Jones. (1991). A Reliable Data-Based Bandwidth Selection Method for Kernel Density Estimation. Journal of the Royal Statistical Society: Series B (Methodological), 53(3), 683–690.
Wand, M. P., and M. C. Jones. (1995). Kernel Smoothing. Chapman and Hall/CRC.
Fan, Jianqing, and Irène Gijbels. (1996). Local Polynomial Modelling and Its Applications. Chapman and Hall/CRC.
Duong, Tarn. (2007). ks: Kernel Density Estimation and Kernel Discriminant Analysis for Multivariate Data in R. Journal of Statistical Software, 21(7), 1–16.
Scott, David W. (2015). Multivariate Density Estimation: Theory, Practice, and Visualization (2nd ed.). Wiley.
van der Vaart, Aad W. (2000). Asymptotic Statistics. Cambridge University Press.
Serfling, Robert J. (1980). Approximation Theorems of Mathematical Statistics. Wiley.
Lehmann, Erich Leo, and George Casella. (1998). Theory of Point Estimation (2nd ed.). Springer.