intermediate 60 min read · April 22, 2026

Hierarchical & Empirical Bayes

Topic 28 closes Track 7 by bending the tools of Topics 25–27 onto structures where group-level parameters share a common distribution and the prior itself has a prior. Eight schools as the spine example. Three full proofs — Stein's paradox via Stein's identity (featured), the partial-pooling shrinkage formula in the Normal–Normal hierarchy, and volume preservation + funnel decoupling for the non-centered reparameterization. Empirical Bayes as the boundary case. Linear mixed-effects. HMC diagnostics. The forward-map to formalml closes the track.

28.1 Three ways to pool, one dataset

Eight schools each ran a short-term coaching program for the SAT-verbal section. Each school estimates its own treatment effect yky_k with a standard error σk\sigma_k that reflects its sample size. The data, from Rubin’s 1981 analysis (RUB1981, reproduced in GEL2013 §5.5):

SchoolABCDEFGH
yky_k (effect estimate, SAT-V points)288−37−111812
σk\sigma_k (standard error)151016119111018

The effects are plausibly between 20-20 and +30+30 and the standard errors are large — 28 and 3-3 may just be noise around a common truth, or they may be real differences between schools. The question is how to estimate each school’s effect given all the data.

Three answers sit at the extremes and the middle.

  • No pooling. Treat each school as its own problem: θ^k=yk\hat\theta_k = y_k. Honest about between-school differences, but hostage to each σk\sigma_k. School A’s 28-point estimate carries a ±15\pm 15-point standard error — we shouldn’t read too much into it.
  • Complete pooling. Treat all schools as one problem: θ^k=μ^\hat\theta_k = \hat\mu for every kk, where μ^=k(yk/σk2)/k(1/σk2)\hat\mu = \sum_k (y_k / \sigma_k^2) / \sum_k (1/\sigma_k^2) is the precision-weighted grand mean (7.69\approx 7.69 here). Huge precision gain, but tosses out any real differences.
  • Partial pooling. Assume the θk\theta_k are drawn from a common distribution N(μ,τ2)\mathcal{N}(\mu, \tau^2). Each school’s posterior mean becomes a precision-weighted compromise: (1Bk)yk+Bkμ^(1 - B_k) y_k + B_k \hat\mu with Bk=σk2/(σk2+τ2)B_k = \sigma_k^2 / (\sigma_k^2 + \tau^2). Schools with small σk\sigma_k stay close to yky_k; schools with large σk\sigma_k drift toward μ^\hat\mu. Both the between-school variance τ2\tau^2 and the group means θk\theta_k are learned from the same data.

Three ways to pool the Rubin 1981 8-schools data. Left: no pooling — each school sits at its y_k with its own sigma_k whisker, with no borrowing. Middle: complete pooling — all eight schools collapse to the precision-weighted grand mean (dashed) with a single tight interval. Right: partial pooling at tau = 10 — posterior means interpolate between the raw y_k and the grand mean, with shrinkage proportional to each school's noisiness.

Figure 1. The three pooling regimes on 8-schools. Partial pooling is the middle panel — each school’s posterior mean is a weighted average of its raw estimate and the grand mean, with weights determined by the relative size of σk2\sigma_k^2 (noise in yky_k) and τ2\tau^2 (spread across schools).

Topic 28 is the story of partial pooling. Six intertwined threads run through the next ten sections:

  1. The hierarchical model as a two-level generative process (§28.2).
  2. Full-Bayes inference via Gibbs + HMC, with §26.8 Rem 23’s 8-schools funnel (§28.4).
  3. Stein’s paradox (§28.5): even when the group means are unrelated in truth, shrinkage toward any common target strictly dominates the MLE under squared-error loss in d3d \geq 3 dimensions. The pedagogical summit of Track 7.
  4. Closed-form partial pooling in the Normal–Normal hierarchy (§28.6), with the shrinkage factor BkB_k proved.
  5. Empirical Bayes (§28.7): μ\mu and τ2\tau^2 learned from the data via Type-II MLE; the 8-schools boundary case τ^20\hat\tau^2 \approx 0 motivates full-Bayes over EB.
  6. Non-centered reparameterization (§28.9 Thm 7): the geometric trick that lets HMC actually mix on a hierarchy, proved via a block-triangular Jacobian.

Three sections pivot between these threads: notation (§28.3), linear mixed-effects (§28.8), and the track-closer forward-map (§28.10).

Example 1 The original Rubin 8-schools question

Rubin’s 1981 motivation was not primarily statistical — a research group wanted to know whether coaching helped, and a meta-analytic question followed: do we believe school A’s 28-point effect, school C’s 3-3, or some compromise between each school’s point estimate and the grand mean?

Complete pooling answers “roughly +8+8 everywhere,” losing school A’s apparent success and school C’s apparent null. No pooling answers “trust each school’s sample mean,” but then school C’s 3-3 with σC=16\sigma_C = 16 is barely different from zero, so the no-pool answer is mostly noise. Partial pooling provides the compromise. The partial-pool answer — at τ=10\tau = 10, consistent with GEL2013’s posterior analysis — moves school A’s estimate down toward +11+11 and school C’s up toward +6+6, scaled by how informative each yky_k is individually.

Remark 1 Why partial pooling almost always beats either extreme

Under squared-error loss, partial pooling is guaranteed to dominate at least one extreme in finite samples: it beats complete pooling when the between-school variance is real (so τ2>0\tau^2 > 0) and beats no pooling when the within-school errors are large (so σk2\sigma_k^2 is non-trivial relative to τ2\tau^2). Stein’s paradox (§28.5) sharpens this: in d3d \geq 3, shrinkage dominates the MLE uniformly in θ\theta — not just for “similar” group means, but for every configuration. Topic 28’s hierarchical prior is the Bayesian reading of this dominance.

Remark 2 The arc of the track

Topic 25 established conjugate priors and the posterior-as-update formalism; Topic 26 added MCMC; Topic 27 added Bayes factors and BMA. Topic 28 combines all three on the natural setting where “the prior itself has a prior” — and shows that this simple step reframes decades of frequentist results (James–Stein shrinkage, ridge regression’s bias-variance tradeoff, the Hoerl–Kennard inequality) as Bayesian inference with hyperpriors. The track closer is partial pooling.

28.2 The hierarchical model

The objects of Topic 28 come in two levels. Group-level parameters θ=(θ1,,θK)\boldsymbol\theta = (\theta_1, \dots, \theta_K) govern the distribution of the data; hyperparameters ϕ\phi govern the distribution of the θk\theta_k‘s. The prior on θ\boldsymbol\theta becomes conditional on ϕ\phi, and ϕ\phi itself gets a prior.

Definition 1 Hierarchical model

A hierarchical model is a joint distribution over data y=(y1,,yK)\mathbf{y} = (y_1, \dots, y_K), group-level parameters θ=(θ1,,θK)\boldsymbol\theta = (\theta_1, \dots, \theta_K), and hyperparameters ϕ\phi that factors as

p(y,θ,ϕ)  =  p(ϕ)hyperpriork=1Kπ(θkϕ)group-level priork=1Kf(ykθk)likelihood.p(\mathbf{y}, \boldsymbol\theta, \phi) \;=\; \underbrace{p(\phi)}_{\text{hyperprior}} \cdot \underbrace{\prod_{k=1}^K \pi(\theta_k \mid \phi)}_{\text{group-level prior}} \cdot \underbrace{\prod_{k=1}^K f(y_k \mid \theta_k)}_{\text{likelihood}}.

Inference targets the posterior p(θ,ϕy)p(\boldsymbol\theta, \phi \mid \mathbf{y}) and its marginals p(θky)p(\theta_k \mid \mathbf{y}), p(ϕy)p(\phi \mid \mathbf{y}).

The conditional-independence structure is load-bearing: given ϕ\phi, the θk\theta_k‘s are independent draws from a common distribution; given θ\boldsymbol\theta, the yky_k‘s are independent across groups. The posterior couples the θk\theta_k‘s through ϕ\phi — that’s where partial pooling comes from.

Definition 2 Normal–Normal hierarchical model (Rubin 1981)

The workhorse special case. Observe sample means ykθkN(θk,σk2),k=1,,K,y_k \mid \theta_k \sim \mathcal{N}(\theta_k, \sigma_k^2), \qquad k = 1, \dots, K, with σk2\sigma_k^2 known (the within-group sampling variance). Group-level prior θkμ,τ2iidN(μ,τ2).\theta_k \mid \mu, \tau^2 \stackrel{\text{iid}}{\sim} \mathcal{N}(\mu, \tau^2). Hyperprior (μ,τ2)(\mu, \tau^2). Default choices: μU(R)\mu \sim \mathcal{U}(\mathbb{R}) (improper flat) and τHalf-Cauchy(0,5)\tau \sim \text{Half-Cauchy}(0, 5) (GEL2006 default; §28.7 Rem 17 motivates).

This is the canonical 8-schools model. μ\mu is the grand-mean-of-means; τ\tau controls between-school spread.

Example 2 Beta–Binomial hierarchical model

Let yky_k be the count of successes in nkn_k trials for group kk: ykpkBinomial(nk,pk)y_k \mid p_k \sim \text{Binomial}(n_k, p_k). A natural hierarchical prior uses the Beta: pkα,βiidBeta(α,β).p_k \mid \alpha, \beta \stackrel{\text{iid}}{\sim} \text{Beta}(\alpha, \beta). Hyperparameters ϕ=(α,β)\phi = (\alpha, \beta) might themselves carry vague priors (Exp(1)\text{Exp}(1) each, or a hyperprior on (μ=α/(α+β),κ=α+β)(\mu = \alpha/(\alpha+\beta), \kappa = \alpha + \beta) per Gelman’s reparameterization). Integrating out pkp_k gives the Beta-Binomial marginal for yky_k — a compound distribution that was already in our posterior predictive toolkit (Topic 25 §25.5 Ex 2). The hierarchical model reuses it as a group-level sampling distribution.

Example 3 Rat-tumor data (Gelman BDA3 §5.1)

Seventy rodent groups tested a tumor-promoting chemical; group kk reports yky_k tumors in nkn_k rats. Complete pooling gives a single success rate (throwing out real between-group differences in baseline tumor prevalence); no pooling gives 70 separate Bernoulli estimates (ignoring that these are the same species, same chemical, similar conditions). Partial pooling with a Beta-Binomial hierarchy: individual pkp_k are regularized toward an estimated α/(α+β)\alpha/(\alpha+\beta) with spread governed by α+β\alpha + \beta. This is the template — 8-schools in the Gaussian family, rat-tumor in the Bernoulli family — that every hierarchical-model textbook uses to introduce partial pooling.

Definition 3 Exchangeability

Group labels (θ1,,θK)(\theta_1, \dots, \theta_K) are exchangeable if their joint distribution is invariant under any permutation of the indices: p(θ1,,θK)=p(θπ(1),,θπ(K))p(\theta_1, \dots, \theta_K) = p(\theta_{\pi(1)}, \dots, \theta_{\pi(K)}) for every permutation π\pi. Exchangeability is the Bayesian reading of “groups share a common structure but are otherwise equivalent” — de Finetti’s theorem then guarantees that an exchangeable sequence is a conditional mixture over i.i.d. sequences, so the hierarchical factorization θkϕπ(ϕ)\theta_k \mid \phi \sim \pi(\cdot \mid \phi) is essentially forced by the assumption.

Remark 3 When exchangeability fails

Exchangeability fails when some groups are known a priori to behave differently. A clinical trial with 20 standard-dose arms and 5 high-dose arms is not exchangeable across all 25 arms — dose is an informative label. The fix is to model conditional exchangeability: arms are exchangeable within a dose group, with dose-specific hyperparameters. This is the gateway to multilevel hierarchies (three, four, or more layers) — developed in §28.8 and fully at the formalml mixed-effects topic.

Remark 4 Hierarchical is not the same as Bayesian

Frequentist mixed-effects models (§28.8, Laird–Ware 1982) use the same two-level structure but estimate (μ,τ2)(\mu, \tau^2) by maximum likelihood or REML and treat θk\theta_k as random-effects predictions (BLUPs). The computational machinery differs — no hyperprior, no MCMC — but the conceptual shape is identical. Topic 28 covers the Bayesian reading; the frequentist reading sits at formalml.

Remark 5 Connection to Topic 27's BMA

A hierarchical prior on within-model parameters composes cleanly with BMA across a discrete set of models. §28.10 Ex 16 treats hierarchical BMA as the natural combination: each candidate model Mj\mathcal{M}_j specifies its own θkϕjπj(ϕj)\theta_k \mid \phi_j \sim \pi_j(\cdot \mid \phi_j), and BMA weights each model’s hierarchical posterior by its marginal likelihood. This closes Topic 27 §27.10 Rem 26’s forward-promise.

28.3 Notation

Topic 28 inherits Topic 25 §25.3 verbatim (prior π\pi, posterior pp, marginal mm, likelihood LL, parameter θ\theta, independence  ⁣ ⁣ ⁣\perp\!\!\!\perp). Extensions:

SymbolMeaningFirst use
KKNumber of groups§28.1
nkn_kSample size within group kk§28.1
θk\theta_kGroup-kk parameter (often a mean)§28.2 Def 2
θ=(θ1,,θK)\boldsymbol\theta = (\theta_1, \dots, \theta_K)Full group-parameter vector§28.2 Def 2
ϕ\phiHyperparameters shared across groups§28.2 Def 3
μ,τ2\mu, \tau^2Normal–Normal hyperparameters (grand mean, between-group variance)§28.2 Def 2
σk2\sigma_k^2Within-group observation variance (usually known)§28.2 Def 2
yky_kGroup-kk summary statistic (often xˉk\bar{x}_k)§28.2 Def 2
Bk=σk2/(σk2+τ2)B_k = \sigma_k^2 / (\sigma_k^2 + \tau^2)Shrinkage factor for group kk§28.6 Thm 4
θ^MLE,θ^JS,θ^PP\hat\theta_{\text{MLE}}, \hat\theta_{\text{JS}}, \hat\theta_{\text{PP}}MLE, James–Stein, partial-pool estimators§28.5–§28.6
θ~k\tilde\theta_kNon-centered auxiliary: θk=μ+τθ~k\theta_k = \mu + \tau \tilde\theta_k§28.9 Thm 7
R(θ^,θ)=Eθθ^θ2R(\hat\theta, \theta) = \mathbb{E}_\theta \|\hat\theta - \theta\|^2Frequentist risk under squared loss§28.5 Thm 2
m(yϕ)=kf(ykθk)π(θkϕ)dθkm(\mathbf{y} \mid \phi) = \int \prod_k f(y_k \mid \theta_k) \pi(\theta_k \mid \phi)\, d\theta_kType-II marginal likelihood§28.7 Def 4

Subscripts: k{1,,K}k \in \{1, \dots, K\} indexes groups; j{1,,nk}j \in \{1, \dots, n_k\} indexes observations within a group (used in §28.8 when nk>1n_k > 1); ii indexes coordinates in Rd\mathbb{R}^d in §28.5. In Stein-paradox context (§28.5), the literature-standard symbol dd replaces KK. The semantic mapping: KK groups of scalar Normal means \leftrightarrow a dd-dimensional Normal mean vector with d=Kd = K. §28.5 Rem 10 flags this bridge explicitly before the proof.

28.4 Full-Bayes inference on (θ,ϕ)(\boldsymbol\theta, \phi)

The joint posterior p(θ,ϕy)p(\boldsymbol\theta, \phi \mid \mathbf{y}) almost never has closed form — even for Normal–Normal, integrating (μ,τ2)(\mu, \tau^2) against a Half-Cauchy hyperprior is intractable. Topic 26’s MCMC machinery handles this directly: Gibbs when full conditionals are conjugate (Normal–Normal is), HMC/NUTS when they aren’t (Beta-Binomial hyperpriors on (α,β)(\alpha, \beta) typically aren’t).

Theorem 1 Gibbs on Normal–Normal hierarchical model

For the model in Def 2 with flat prior π(μ)1\pi(\mu) \propto 1 and τ2Inv-χν02(s02)\tau^2 \sim \text{Inv-}\chi^2_{\nu_0}(s_0^2), the full conditionals are all standard distributions:

μθ,τ2    N(θˉ,  τ2/K),\mu \mid \boldsymbol\theta, \tau^2 \;\sim\; \mathcal{N}\left(\bar\theta,\; \tau^2 / K\right),

τ2θ,μ    Inv-χ2(ν0+K,  ν0s02+k(θkμ)2ν0+K),\tau^2 \mid \boldsymbol\theta, \mu \;\sim\; \text{Inv-}\chi^2\left(\nu_0 + K,\; \frac{\nu_0 s_0^2 + \sum_k (\theta_k - \mu)^2}{\nu_0 + K}\right),

θkμ,τ2,yk    N((1Bk)yk+Bkμ,  (1Bk)σk2).\theta_k \mid \mu, \tau^2, y_k \;\sim\; \mathcal{N}\left((1-B_k)y_k + B_k \mu,\; (1-B_k)\sigma_k^2\right).

Gibbs sweeps through the three full-conditional updates in any order; the invariant distribution is the joint posterior p(θ,μ,τ2y)p(\boldsymbol\theta, \mu, \tau^2 \mid \mathbf{y}).

The third conditional is already the partial-pool posterior — §28.6 will prove this closed form from first principles.

Left: the Gibbs alternation diagram for the Normal-Normal hierarchical model — block-updating (theta given (mu, tau squared)) and ((mu, tau squared) given theta). Right: a 200-iteration mu trace from an 8-schools Gibbs run, with burn-in shaded in the first 40 iterations and the post-burn-in mean annotated.

Figure 2. Block Gibbs for Normal–Normal. Left: the alternation diagram — sample θ\boldsymbol\theta given the hyperparameters, then sample the hyperparameters given θ\boldsymbol\theta. Right: 200 iterations of μ\mu from an 8-schools Gibbs run. The chain mixes quickly because all full conditionals are Gaussian or inverse-χ2\chi^2 — no Metropolis rejection to slow it down.

Example 4 8-schools Gibbs, one sweep at a time

From a warm start (θk=yk,μ=yˉ,τ=10)(\theta_k = y_k, \mu = \bar y, \tau = 10):

  1. Sample μθ,τ2\mu \mid \boldsymbol\theta, \tau^2: a Gaussian centered at θˉ=18kθk\bar\theta = \frac{1}{8}\sum_k \theta_k with standard error τ/8\tau / \sqrt{8}. If τ=10\tau = 10, that’s N(θˉ,12.5)\mathcal{N}(\bar\theta, 12.5).
  2. Sample τ2θ,μ\tau^2 \mid \boldsymbol\theta, \mu: an inverse-χ2\chi^2 whose scale tracks the spread k(θkμ)2\sum_k(\theta_k - \mu)^2. When the θk\theta_k‘s are close, τ2\tau^2 gets pulled small — reinforcing further shrinkage at the next step.
  3. Sample each θkμ,τ2,yk\theta_k \mid \mu, \tau^2, y_k: Gaussian with mean (1Bk)yk+Bkμ(1-B_k)y_k + B_k\mu. Schools with large σk\sigma_k see their θk\theta_k‘s pulled toward μ\mu; schools with small σk\sigma_k stay near yky_k.

Iterate 4000 times, discard 1000 as burn-in. The posterior of θk\theta_k is represented by the 3000 post-burn-in draws.

Example 5 When conjugacy breaks: HMC instead of Gibbs

Replace the inverse-χ2\chi^2 prior on τ2\tau^2 with the half-Cauchy on τ\tau (GEL2006 default): the full conditional for τ2\tau^2 is no longer standard, so Gibbs on that component fails. Option 1: Metropolis-within-Gibbs with a proposal on logτ\log\tau. Option 2 (preferred): HMC or NUTS on the joint (θ,μ,τ)(\boldsymbol\theta, \mu, \tau) — differentiate the log-posterior and let leapfrog integration do the mixing. §28.9 returns to this workflow with the 8-schools funnel.

Remark 6 Why block Gibbs works here but not always

The conjugacy of the Normal–Normal hierarchy is fragile: it relies on Normal observations, a Normal group-level prior, and specific forms of the hyperpriors on (μ,τ2)(\mu, \tau^2). Swap any of these for something non-conjugate (Student-tt errors, a group-specific prior that isn’t Normal, a hyperprior that breaks conjugacy) and Gibbs falls apart. §28.8’s linear mixed model extends the Normal case to include covariates; §28.9’s HMC treatment handles everything else.

Remark 7 Posterior couples the $\theta_k$'s

In the prior, the θk\theta_k‘s are conditionally independent given ϕ\phi. In the posterior, they are not: conditioning on y\mathbf{y} induces correlations because ϕ\phi is itself learned from the data. Partial pooling is literally this posterior correlation — every θk\theta_k‘s posterior mean depends on every other yky_k through the shared μ,τ2\mu, \tau^2.

Remark 8 Conditional vs marginal estimation

“Conditional” estimators of θk\theta_k fix ϕ\phi at some value (the MLE, the Type-II MLE, the posterior mode) and compute p(θky,ϕ)p(\theta_k \mid \mathbf{y}, \phi). “Marginal” estimators integrate ϕ\phi out: E[θky]=Eϕy[E[θky,ϕ]]\mathbb{E}[\theta_k \mid \mathbf{y}] = \mathbb{E}_{\phi \mid \mathbf{y}}[\mathbb{E}[\theta_k \mid \mathbf{y}, \phi]]. The two can differ substantially — especially when the ϕ\phi-posterior has mass near the boundary (τ² small). §28.7 dissects this contrast: EB is the conditional-at-MLE estimator; full Bayes is the marginal one.

28.5 Stein’s paradox

In 1956, Charles Stein proved a theorem that upended fifty years of mathematical statistics: for a multivariate Normal mean, the sample mean — the MVUE, the MLE, the “obvious” estimator — is inadmissible in d3d \geq 3 dimensions. There exist estimators (the James–Stein family) with strictly smaller risk for every true mean θ\theta, not just on average.

The apparent paradox: a clinical-trial arm in Boston, a wheat yield in Iowa, and a baseball batting average — three unrelated quantities — should still be jointly shrunk toward some common point to reduce total squared-error risk. This is the theorem. The resolution is that squared-error loss is a sum across coordinates, so trading a little bias on each coordinate for a lot of variance reduction overall is a win. Topic 28 reframes this as empirical Bayes: “three unrelated quantities” are related — as observations drawn from a shared hyperprior the data has to learn.

Theorem 2 Stein's inadmissibility theorem (STE1956, JAM1961)

Let XNd(θ,Id)X \sim \mathcal{N}_d(\theta, I_d) with d3d \geq 3. Under squared-error loss L(θ^,θ)=θ^θ2L(\hat\theta, \theta) = \|\hat\theta - \theta\|^2, the MLE θ^MLE=X\hat\theta_{\text{MLE}} = X has risk R(θ^MLE,θ)=dR(\hat\theta_{\text{MLE}}, \theta) = d for every θ\theta. The James–Stein estimator

θ^JS=(1d2X2)X\hat\theta_{\text{JS}} = \left(1 - \frac{d-2}{\|X\|^2}\right) X

has strictly smaller risk:

R(θ^JS,θ)=d(d2)2E ⁣[1X2]<dfor every θRd.R(\hat\theta_{\text{JS}}, \theta) = d - (d-2)^2 \, \mathbb{E}\!\left[\frac{1}{\|X\|^2}\right] < d \quad \text{for every } \theta \in \mathbb{R}^d.

The expectation is under XNd(θ,Id)X \sim \mathcal{N}_d(\theta, I_d), so X2\|X\|^2 is non-central chi-squared with dd degrees of freedom and non-centrality θ2\|\theta\|^2.

Remark 10 Notation bridge: $d$ in §28.5 replaces $K$ elsewhere

The Stein-paradox literature uses dd (or pp) for the dimensionality of the Normal mean vector. Topic 28 uses KK throughout §§28.1–28.4 and §§28.6–28.10 for the number of hierarchical groups. The two are the same object — a dd-vector of Normal means is a collection of d=Kd = K scalar group means, stacked. §28.5 switches to dd to match Stein/James/Efron-Morris notation; starting at §28.6 we revert to KK. The bridge: θRdθ=(θ1,,θK)\theta \in \mathbb{R}^d \leftrightarrow \boldsymbol\theta = (\theta_1, \dots, \theta_K).

Proof 1 Proof of Thm 2 (Stein's paradox, via Stein's identity) [show]

Setting. XNd(θ,Id)X \sim \mathcal{N}_d(\theta, I_d), d3d \geq 3. Write θ^JSθ=(Xθ)(d2)X/X2\hat\theta_{\text{JS}} - \theta = (X - \theta) - (d-2) \cdot X/\|X\|^2. Under squared-error loss, the risk decomposes as

R(θ^JS,θ)=Eθ^JSθ2.R(\hat\theta_{\text{JS}}, \theta) = \mathbb{E}\|\hat\theta_{\text{JS}} - \theta\|^2.

Lemma (Stein’s identity). Let ZN(μ,1)Z \sim \mathcal{N}(\mu, 1) and g:RRg : \mathbb{R} \to \mathbb{R} differentiable with Eg(Z)<\mathbb{E}|g'(Z)| < \infty. Then

E[(Zμ)g(Z)]=E[g(Z)].\mathbb{E}[(Z - \mu)\, g(Z)] = \mathbb{E}[g'(Z)].

Proof of lemma. Integration by parts against the Gaussian density φ(zμ)\varphi(z - \mu), using φ(zμ)=(zμ)φ(zμ)\varphi'(z - \mu) = -(z - \mu)\varphi(z - \mu):

(zμ)g(z)φ(zμ)dz=g(z)φ(zμ)dz=g(z)φ(zμ)dz,\int (z - \mu)\, g(z)\, \varphi(z - \mu)\, dz = -\int g(z) \cdot \varphi'(z - \mu)\, dz = \int g'(z)\, \varphi(z - \mu)\, dz,

with boundary terms vanishing under the integrability assumption. \square

Main calculation. Expand the squared norm of θ^JSθ\hat\theta_{\text{JS}} - \theta:

θ^JSθ2=Xθ2    2(d2)(Xθ)XX2  +  (d2)21X2.\|\hat\theta_{\text{JS}} - \theta\|^2 = \|X - \theta\|^2 \;-\; 2(d-2) \cdot \frac{(X - \theta)^\top X}{\|X\|^2} \;+\; (d-2)^2 \cdot \frac{1}{\|X\|^2}.

Take expectations term by term. The first term is EXθ2=d\mathbb{E}\|X - \theta\|^2 = d because XθNd(0,Id)X - \theta \sim \mathcal{N}_d(0, I_d). The third term is (d2)2E[1/X2](d-2)^2 \cdot \mathbb{E}[1/\|X\|^2], which is finite and positive for d3d \geq 3. The middle term is the work.

Write (Xθ)X/X2=i=1d(Xiθi)gi(X)(X - \theta)^\top X / \|X\|^2 = \sum_{i=1}^d (X_i - \theta_i) \, g_i(X) with gi(X)=Xi/X2g_i(X) = X_i / \|X\|^2. Conditioning on Xi=(X1,,Xi1,Xi+1,,Xd)X_{-i} = (X_1, \dots, X_{i-1}, X_{i+1}, \dots, X_d), the conditional distribution of XiX_i is N(θi,1)\mathcal{N}(\theta_i, 1); apply Stein’s identity coordinatewise:

E[(Xiθi)gi(X)]=E ⁣[giXi]=E ⁣[X22Xi2X4]=E ⁣[1X22Xi2X4].\mathbb{E}[(X_i - \theta_i)\, g_i(X)] = \mathbb{E}\!\left[\frac{\partial g_i}{\partial X_i}\right] = \mathbb{E}\!\left[\frac{\|X\|^2 - 2 X_i^2}{\|X\|^4}\right] = \mathbb{E}\!\left[\frac{1}{\|X\|^2} - \frac{2 X_i^2}{\|X\|^4}\right].

Sum over i=1,,di = 1, \dots, d:

E ⁣[(Xθ)XX2]=E ⁣[dX22X2X4]=(d2)E ⁣[1X2].\mathbb{E}\!\left[\frac{(X - \theta)^\top X}{\|X\|^2}\right] = \mathbb{E}\!\left[\frac{d}{\|X\|^2} - \frac{2\|X\|^2}{\|X\|^4}\right] = (d - 2)\, \mathbb{E}\!\left[\frac{1}{\|X\|^2}\right].

Substitute into the risk expansion:

R(θ^JS,θ)=d    2(d2)2E ⁣[1X2]  +  (d2)2E ⁣[1X2]=d(d2)2E ⁣[1X2].R(\hat\theta_{\text{JS}}, \theta) = d \;-\; 2(d-2)^2 \cdot \mathbb{E}\!\left[\frac{1}{\|X\|^2}\right] \;+\; (d-2)^2 \cdot \mathbb{E}\!\left[\frac{1}{\|X\|^2}\right] = d - (d-2)^2 \cdot \mathbb{E}\!\left[\frac{1}{\|X\|^2}\right].

Since X2\|X\|^2 is non-central χd2(θ2)\chi^2_d(\|\theta\|^2) with d3d \geq 3, the expectation E[1/X2]\mathbb{E}[1/\|X\|^2] is finite and strictly positive. Hence R(θ^JS,θ)<d=R(θ^MLE,θ)R(\hat\theta_{\text{JS}}, \theta) < d = R(\hat\theta_{\text{MLE}}, \theta) for every θRd\theta \in \mathbb{R}^d. \blacksquare — using Stein’s identity (STE1956) and the James-Stein 1961 representation (JAM1961).

The shrinkage factor 1(d2)/X21 - (d-2)/\|X\|^2 is positive when X2>d2\|X\|^2 > d - 2 and negative otherwise — plain JS can flip the sign of XX, which is never the right answer. The positive-part JS estimator θ^JS+=max(0,1(d2)/X2)X\hat\theta_{\text{JS}^+} = \max(0, 1 - (d-2)/\|X\|^2) \cdot X (EFR1973) dominates plain JS uniformly in θ\theta and is the estimator used in practice.

Theorem 3 James–Stein as empirical-Bayes posterior mean (stated)

Under the model θμ,τ2Nd(μ1,τ2I)\theta \mid \mu, \tau^2 \sim \mathcal{N}_d(\mu \mathbf{1}, \tau^2 I) and XθNd(θ,Id)X \mid \theta \sim \mathcal{N}_d(\theta, I_d) with μ=0\mu = 0 known and τ2\tau^2 estimated by its Type-II MLE, the posterior mean of θ\theta coincides with the James-Stein estimator up to a finite-sample correction. The plug-in τ^2\hat\tau^2 is max(0,X2/d1)\max(0, \|X\|^2/d - 1), yielding shrinkage factor B=1/(1+τ^2)=d/(d+X2d)(d2)/X2B = 1/(1 + \hat\tau^2) = d/(d + \|X\|^2 - d) \approx (d-2)/\|X\|^2 for large X2\|X\|^2 — exactly Stein’s shrinkage coefficient, with the (d2)(d - 2) adjustment arising from the unbiased estimation of 1/θ21/\|\theta\|^2 via Stein’s identity (EFR1973).

This is the Efron–Morris rereading: Stein’s paradox is not a paradox at all but a theorem about empirical-Bayes shrinkage. The “unrelated quantities” are related by the empirical prior the data itself estimates.

The James-Stein shrinkage picture. Left panel: a 2D scatter of five sample points X_i with their ground-truth theta_i drawn alongside; arrows from each X_i to its James-Stein estimate show all arrows bending toward the origin. Right panel: risk curves R(JS, theta) versus the squared norm of theta for d = 3, 5, 10 — each curve starts at d - (d-2) at theta = 0 and approaches d asymptotically.

Figure 3. The Stein shock, visualized. Left: five 2D observations XiN2(θi,I)X_i \sim \mathcal{N}_2(\theta_i, I) with JS-shrunk estimates — every arrow bends toward the origin. Right: risk curves for d=3,5,10d = 3, 5, 10. The MLE sits on the flat R=dR = d line; JS curves dip substantially at θ=0\theta = 0 and rise back toward dd as θ2\|\theta\|^2 \to \infty (but never reach it).

Example 6 JS risk savings at $\theta = 0$ vs $\theta = 3 \cdot \mathbf{1}$

For d=5d = 5:

  • At θ=0\theta = 0: X2χ52\|X\|^2 \sim \chi^2_5 (central), so E[1/X2]=1/(d2)=1/3\mathbb{E}[1/\|X\|^2] = 1/(d - 2) = 1/3. Risk reduction: (d2)2(1/3)=9/3=3(d - 2)^2 \cdot (1/3) = 9/3 = 3. JS risk: 53=25 - 3 = 2. 60% improvement over MLE.
  • At θ=31\theta = 3 \cdot \mathbf{1} (so θ2=45\|\theta\|^2 = 45): X2\|X\|^2 is non-central χ52(45)\chi^2_5(45); E[1/X2]1/(d+θ22)=1/480.021\mathbb{E}[1/\|X\|^2] \approx 1/(d + \|\theta\|^2 - 2) = 1/48 \approx 0.021. Risk reduction: 90.0210.199 \cdot 0.021 \approx 0.19. JS risk: 50.194.815 - 0.19 \approx 4.81. Small but positive.

Savings are largest when θ\theta is near the shrinkage target (here, the origin) and shrink toward zero as θ\|\theta\| grows.

Preset:
origin (JS shrinks toward this)-4-2024θ_1θ_2θ_3θ_4θ_5θ (true, drag me)MLE = XJames–SteinPartial-pool (EB)
MLE SSE
17.46
James–Stein SSE
11.98
Partial-pool SSE
9.50
Remark 11 Positive-part JS and the Baranchik family

The plain JS estimator can flip signs when X2<d2\|X\|^2 < d - 2. The positive-part correction (EFR1973) clamps the shrinkage coefficient at zero and dominates plain JS uniformly. Baranchik 1970 generalizes further: any estimator of the form (1a(X2)/X2)X(1 - a(\|X\|^2)/\|X\|^2) X with aa bounded between 0 and 2(d2)2(d - 2) dominates the MLE in d3d \geq 3. The full Baranchik family and its Bayesian-structural interpretation (a nonparametric empirical-Bayes prior) lives at formalml.

Remark 12 Why $d \geq 3$

Stein’s theorem fails for d=1,2d = 1, 2: in those dimensions the MLE is admissible under squared-error loss (Hodges–Lehmann 1951 for d=1d = 1; Stein 1956 for d=2d = 2). The proof above breaks because E[1/X2]\mathbb{E}[1/\|X\|^2] diverges for central χ1,22\chi^2_{1, 2} — the risk-reduction term is unbounded. In d3d \geq 3, the non-centrality parameter doesn’t save us (the expectation diverges even at θ=0\theta = 0 when d=2d = 2), so Stein’s construction only works when the central χ2\chi^2 has finite reciprocal moment.

28.6 Partial pooling in the Normal–Normal hierarchical model

The closed-form partial-pool posterior is the single most useful calculation in hierarchical Bayes. It makes the shrinkage factor BkB_k explicit, shows exactly how within-group noise σk2\sigma_k^2 trades against between-group spread τ2\tau^2, and gives us the analytical baseline every MCMC result should match in conjugate settings.

Theorem 4 Partial-pooling shrinkage formula (Lindley–Smith 1972)

In the Normal–Normal hierarchical model (Def 2), conditional on (μ,τ2)(\mu, \tau^2):

θkyk,μ,τ2    N ⁣((1Bk)yk+Bkμ,    (1Bk)σk2),\theta_k \mid y_k, \mu, \tau^2 \;\sim\; \mathcal{N}\!\left((1 - B_k)\, y_k + B_k\, \mu,\;\; (1 - B_k)\, \sigma_k^2\right),

where

Bk  =  σk2σk2+τ2    [0,1]B_k \;=\; \frac{\sigma_k^2}{\sigma_k^2 + \tau^2} \;\in\; [0, 1]

is the shrinkage factor for group kk.

Interpretation: BkB_k is the weight on the grand mean μ\mu in the posterior mean; 1Bk1 - B_k is the weight on the raw observation yky_k. Edge cases: τ20\tau^2 \to 0 forces Bk1B_k \to 1 (complete pooling, every group at the grand mean); τ2\tau^2 \to \infty drives Bk0B_k \to 0 (no pooling, every group at yky_k). The posterior variance is (1Bk)σk2(1 - B_k)\sigma_k^2 — less than the no-pool σk2\sigma_k^2 by exactly the shrinkage factor, because borrowing strength from μ\mu narrows the interval.

Proof 2 Proof of Thm 4 (partial-pooling formula) [show]

Setting. ykθkN(θk,σk2)y_k \mid \theta_k \sim \mathcal{N}(\theta_k, \sigma_k^2) with σk2\sigma_k^2 known; hierarchical prior θkμ,τ2N(μ,τ2)\theta_k \mid \mu, \tau^2 \sim \mathcal{N}(\mu, \tau^2) iid across kk; condition on (μ,τ2)(\mu, \tau^2) fixed. We derive the posterior of θk\theta_k given yky_k.

Step 1: write densities as exponentials. The posterior is proportional to prior times likelihood:

p(θkyk,μ,τ2)    exp ⁣{(θkμ)22τ2}exp ⁣{(ykθk)22σk2}.p(\theta_k \mid y_k, \mu, \tau^2) \;\propto\; \exp\!\left\{-\frac{(\theta_k - \mu)^2}{2\tau^2}\right\} \cdot \exp\!\left\{-\frac{(y_k - \theta_k)^2}{2\sigma_k^2}\right\}.

Step 2: combine exponents and complete the square in θk\theta_k. Expand both quadratic terms in θk\theta_k:

(θkμ)22τ2(ykθk)22σk2=12 ⁣(1τ2+1σk2) ⁣(θkθk)2+const,-\frac{(\theta_k - \mu)^2}{2\tau^2} - \frac{(y_k - \theta_k)^2}{2\sigma_k^2} = -\frac{1}{2}\!\left(\frac{1}{\tau^2} + \frac{1}{\sigma_k^2}\right)\!\left(\theta_k - \theta_k^*\right)^2 + \text{const},

where the precision-weighted mean is

θk  =  (1τ2+1σk2)1 ⁣(μτ2+ykσk2)  =  τ2yk+σk2μσk2+τ2.\theta_k^* \;=\; \left(\frac{1}{\tau^2} + \frac{1}{\sigma_k^2}\right)^{-1}\!\left(\frac{\mu}{\tau^2} + \frac{y_k}{\sigma_k^2}\right) \;=\; \frac{\tau^2 y_k + \sigma_k^2 \mu}{\sigma_k^2 + \tau^2}.

Step 3: rewrite in terms of BkB_k. Factor the numerator and denominator:

θk  =  τ2σk2+τ2yk+σk2σk2+τ2μ  =  (1Bk)yk+Bkμ,\theta_k^* \;=\; \frac{\tau^2}{\sigma_k^2 + \tau^2}\, y_k + \frac{\sigma_k^2}{\sigma_k^2 + \tau^2}\, \mu \;=\; (1 - B_k)\, y_k + B_k\, \mu,

with Bk=σk2/(σk2+τ2)B_k = \sigma_k^2 / (\sigma_k^2 + \tau^2). The posterior variance is the reciprocal of the precision sum:

(1τ2+1σk2)1  =  σk2τ2σk2+τ2  =  (1Bk)σk2.\left(\frac{1}{\tau^2} + \frac{1}{\sigma_k^2}\right)^{-1} \;=\; \frac{\sigma_k^2 \tau^2}{\sigma_k^2 + \tau^2} \;=\; (1 - B_k)\, \sigma_k^2. \qquad \blacksquare

— using Gaussian conjugacy (Topic 25 §25.5 Ex 2) and precision-weighted averaging.

Corollary (full-Bayes partial pool, stated). When (μ,τ2)(\mu, \tau^2) carries its own prior, integrating out these hyperparameters gives

E[θky]=Eμ,τ2y ⁣[(1Bk)yk+Bkμ],\mathbb{E}[\theta_k \mid \mathbf{y}] = \mathbb{E}_{\mu, \tau^2 \mid \mathbf{y}}\!\left[(1 - B_k)\, y_k + B_k\, \mu\right],

where the expectation is under the posterior of (μ,τ2)(\mu, \tau^2). The shrinkage factor becomes a posterior-averaged version of BkB_k, typically larger than the EB plug-in because the posterior of τ2\tau^2 assigns nontrivial mass to small values.

Partial-pooling shrinkage on 8-schools at three values of tau. Left: tau = 0.5, heavy shrinkage — every school's posterior mean collapses near the grand mean. Middle: tau = 5, moderate shrinkage — schools move partway toward each other but retain some individual effect. Right: tau = 50, essentially no shrinkage — posterior means are very close to the raw y_k values.

Figure 4. Partial-pool posterior means on 8-schools as τ\tau varies. At small τ\tau (left) the shrinkage factor Bk1B_k \approx 1 for every school and the posterior collapses to the grand mean; at large τ\tau (right) Bk0B_k \approx 0 and the posterior retains the raw estimate. The reader can trace this continuously in the EightSchoolsPartialPooling component below.

Example 7 8-schools at $\tau = 10$

From Rubin’s data: σA2=225\sigma_A^2 = 225, yA=28y_A = 28. Precision-weighted grand mean at τ=10\tau = 10: μ^(τ=10)8.2\hat\mu(\tau=10) \approx 8.2. Shrinkage factor: BA=225/(225+100)=0.692B_A = 225/(225 + 100) = 0.692. Posterior mean for school A: (10.692)(28)+(0.692)(8.2)=0.30828+0.6928.214.3(1 - 0.692)(28) + (0.692)(8.2) = 0.308 \cdot 28 + 0.692 \cdot 8.2 \approx 14.3. Posterior SD: (10.692)(225)8.3\sqrt{(1 - 0.692)(225)} \approx 8.3. Compare to school B: σB2=100\sigma_B^2 = 100, yB=8y_B = 8, BB=100/200=0.5B_B = 100/200 = 0.5. Posterior mean: 0.58+0.58.28.10.5 \cdot 8 + 0.5 \cdot 8.2 \approx 8.1. Posterior SD: 0.51007.1\sqrt{0.5 \cdot 100} \approx 7.1.

School A’s posterior mean has moved from 2828 halfway toward μ\mu; school B’s barely moved because yBy_B was already near μ\mu. The precision-weighted grand mean itself changes with τ\tau: at τ=10\tau = 10 it’s 8.2\approx 8.2; at τ=0\tau = 0 (complete pool) it’s 7.7\approx 7.7.

τ preset:
-20-15-10-5051015202530354045μ̂ = 7.85AB = 0.90BB = 0.80CB = 0.91DB = 0.83EB = 0.76FB = 0.83GB = 0.80HB = 0.93Coaching effect (SAT-V points)
Orange: raw sample mean y_k ± σ_k (no pooling). Violet: posterior mean ± posterior SD at the current τ (§28.6 Thm 4). Dashed indigo: precision-weighted grand mean μ̂(τ) — the target of shrinkage.
Example 8 Shrinkage and Bayes factors

A partial-pool Bayes factor (Topic 27 §27.4) can now compare hierarchical specifications: M0\mathcal{M}_0 is complete pooling (fix τ2=0\tau^2 = 0 a priori) and M1\mathcal{M}_1 has a vague hyperprior on τ\tau. Under the Lindley-paradox machinery of §27.5, the Bayes factor can favor M0\mathcal{M}_0 even when a zz-test rejects homogeneity of means — because the full-Bayes marginal likelihood penalizes wasted parameter space. The 8-schools data lies in exactly this intermediate regime, which is why frequentist heterogeneity tests are famously unreliable on small KK. §28.10 Ex 16 returns to this as hierarchical BMA.

Remark 13 Lindley–Smith 1972 in one sentence

The formula in Thm 4 is the original Bayesian partial-pooling result. Lindley and Smith’s 1972 paper framed multilevel regression as a hierarchical Bayesian calculation; the shrinkage weights BkB_k are the natural multivariate generalization of what we just derived. Most modern mixed-effects software (Stan’s brms, R’s rstanarm) is a thin wrapper over the Lindley–Smith hierarchical Gaussian calculation with non-conjugate hyperpriors handled by HMC.

Remark 14 Shrinkage is not bias

Under the hierarchical generative model, the posterior mean (1Bk)yk+Bkμ(1 - B_k)y_k + B_k\mu is unbiased for θk\theta_k in expectation over the hierarchical prior — because θk\theta_k itself is drawn from N(μ,τ2)\mathcal{N}(\mu, \tau^2), and the posterior respects this prior. The “bias” impression comes from a misaligned frequentist question: conditional on a specific θk\theta_k, the shrinkage is biased (and that’s the price we pay for variance reduction). This is the bias–variance exchange at the heart of Stein’s paradox.

28.7 Empirical Bayes via Type-II MLE

“Full Bayes” requires a hyperprior on ϕ\phi and integration over it. “Empirical Bayes” short-circuits this: estimate ϕ\phi from the data by maximizing the marginal likelihood — the density of the observed y\mathbf{y} after integrating out θ\boldsymbol\theta but not ϕ\phi. Then plug ϕ^\hat\phi into the group-level posteriors.

Definition 4 Type-II marginal likelihood

For a hierarchical model with group-level parameters θ\boldsymbol\theta and hyperparameters ϕ\phi:

m(yϕ)  =  k=1Kf(ykθk)π(θkϕ)  dθk.m(\mathbf{y} \mid \phi) \;=\; \int \prod_{k=1}^K f(y_k \mid \theta_k) \cdot \pi(\theta_k \mid \phi) \; d\theta_k.

The Type-II MLE is ϕ^=argmaxϕlogm(yϕ)\hat\phi = \arg\max_\phi \log m(\mathbf{y} \mid \phi). The empirical-Bayes estimator of θk\theta_k is then the conditional posterior mean E[θkyk,ϕ^]\mathbb{E}[\theta_k \mid y_k, \hat\phi] — in the Normal–Normal case, (1B^k)yk+B^kμ^(1 - \hat B_k) y_k + \hat B_k \hat\mu with B^k=σk2/(σk2+τ^2)\hat B_k = \sigma_k^2 / (\sigma_k^2 + \hat\tau^2).

In Normal–Normal with observation variances σk2\sigma_k^2 known, the marginal is tractable because both integration steps are Gaussian: ykμ,τ2N(μ,σk2+τ2)y_k \mid \mu, \tau^2 \sim \mathcal{N}(\mu, \sigma_k^2 + \tau^2) independent across kk, so

logm(yμ,τ2)=k=1KlogN(ykμ,σk2+τ2).\log m(\mathbf{y} \mid \mu, \tau^2) = \sum_{k=1}^K \log \mathcal{N}(y_k \mid \mu, \sigma_k^2 + \tau^2).
Theorem 5 Consistency of Type-II MLE (stated, KLE2012)

Under regularity conditions on the hierarchical likelihood and prior families, and assuming the true ϕ0\phi_0 is identifiable and lies in the interior of the parameter space, the Type-II MLE ϕ^ϕ0\hat\phi \to \phi_0 in probability as KK \to \infty. Moreover, the empirical-Bayes estimator of θk\theta_k is asymptotically equivalent (in KK) to the full-Bayes posterior mean under any hyperprior with positive density at ϕ0\phi_0. Kleijn and van der Vaart 2012 extend this to the misspecified case via a generalized Bernstein–von Mises theorem; the result is that EB and full Bayes agree at first order when the number of groups grows but can disagree substantially at small KK — especially when ϕ0\phi_0 sits on the boundary of the parameter space.

The boundary case is exactly what happens on 8-schools: the Type-II MLE of τ2\tau^2 collapses to τ^20\hat\tau^2 \approx 0.

Contour plot of the log Type-II marginal likelihood of the 8-schools data as a function of mu (horizontal) and tau squared (vertical, log scale). A white star marks the Type-II MLE near (mu = 7.7, tau squared = 0). An indigo circle marks the full-Bayes posterior mean under a half-Cauchy prior on tau at roughly (mu = 7.9, tau squared = 43). The two locations are strikingly different — the MLE sits on the tau squared boundary while the Bayesian estimate splits the difference.

Figure 5. The 8-schools Type-II log-marginal surface. The MLE (white star) sits at τ^20\hat\tau^2 \approx 0; the full-Bayes posterior mean under a half-Cauchy prior on τ\tau (indigo circle) sits well above the boundary. The gap between these two points is the crux of §28.7 and the reason GEL2006 advocates full Bayes with a weakly-informative prior on τ\tau.

Example 9 8-schools Type-II MLE: the boundary case

Running Type-II MLE on the 8-schools data — iteratively update μ=\mu = precision-weighted mean of yy with weights 1/(σk2+τ2)1/(\sigma_k^2 + \tau^2), then τ2\tau^2 as a method-of-moments estimator floored at zero — the iteration lands at μ^7.69\hat\mu \approx 7.69, τ^20\hat\tau^2 \approx 0 in a handful of steps. The EB partial-pool estimator then coincides with the complete-pool estimator: every school’s posterior mean is μ^\hat\mu, with posterior SD zero.

This is not a bug. It’s the correct Type-II MLE: given 8-schools’ small sample size (K=8K = 8), the marginal likelihood genuinely peaks at the boundary. But it’s a poor point estimate to plug in — the full-Bayes posterior under a half-Cauchy(0,5)(0, 5) prior on τ\tau has mean τ^243\hat\tau^2 \approx 43, reflecting the substantial uncertainty in the between-group variance.

Example 10 Empirical-Bayes ridge regression (discharges Topic 23 Rem 24)

In ridge regression (Topic 23), the shrinkage strength λ\lambda is usually chosen by cross-validation. The hierarchical Bayesian reading: prior βτ2Np(0,τ2I)\beta \mid \tau^2 \sim \mathcal{N}_p(0, \tau^2 I) with τ2\tau^2 unknown, likelihood yβNn(Xβ,σ2I)y \mid \beta \sim \mathcal{N}_n(X\beta, \sigma^2 I). Integrating out β\beta gives a marginal likelihood in (σ2,τ2)(\sigma^2, \tau^2):

logm(yσ2,τ2)=12logdet(σ2I+τ2XX)12y(σ2I+τ2XX)1y+const.\log m(y \mid \sigma^2, \tau^2) = -\tfrac{1}{2}\log\det(\sigma^2 I + \tau^2 XX^\top) - \tfrac{1}{2} y^\top(\sigma^2 I + \tau^2 XX^\top)^{-1} y + \text{const}.

The Type-II MLE of (σ2,τ2)(\sigma^2, \tau^2) yields an estimated shrinkage parameter λ^=σ^2/τ^2\hat\lambda = \hat\sigma^2 / \hat\tau^2, and the empirical-Bayes posterior mean of β\beta is exactly the ridge estimator at λ=λ^\lambda = \hat\lambda. Topic 23 Rem 24 forward-promised this; it discharges here. The key point: EB ridge’s λ^\hat\lambda has an asymptotic interpretation as the MLE of the optimal shrinkage strength — a more principled alternative to CV when the hierarchical model is a reasonable generative story for the data.

Dataset:
-22.3-6.88.824.339.8Hyperprior mean μ0.010.183601091Between-group variance τ² (log)
Current
μ = 8.75 · τ² = 3.30
log m = -29.747
EB MLE (★)
μ̂ = 7.69 · τ̂² = 0.00
converged in 1 iter
Full Bayes (○)
μ̄ = 7.92 · τ̄² = 43.30
NUTS posterior mean, half-Cauchy(0,5) on τ
Remark 15 EB as the Bayesian face of the EM algorithm

The iterative Type-II MLE update — “given ϕ\phi, compute posterior means of θk\theta_k; given those, update ϕ\phi” — is EM with θ\boldsymbol\theta as missing data. This is why EM pops up all over hierarchical modeling: HMM parameter estimation, mixture models, factor analysis. When the E-step can be done in closed form (Normal–Normal) EB is cheap; when it requires MCMC it’s a stochastic EM, equivalent to full-Bayes point estimation.

Remark 16 EB confidence intervals are too narrow

The EB point estimate ϕ^\hat\phi comes with uncertainty — the Type-II MLE is subject to sampling variation. Plug-in EB ignores this uncertainty in the downstream θk\theta_k posteriors, producing credible intervals that are anti-conservative (too narrow). Full-Bayes integration corrects for this automatically. A “corrected EB” workflow (Laird and Louis 1987) inflates the EB intervals using the Fisher information of the marginal likelihood, partially restoring coverage. Modern practice: just do full Bayes with a weakly-informative hyperprior (Remark 17).

Remark 17 Gelman 2006: the half-Cauchy default on $\tau$

When KK is small (5–20 groups), the EB boundary collapse is a real problem and full Bayes with a proper hyperprior on τ\tau is the standard fix. GEL2006 argues for Half-Cauchy(0,A)(0, A) with AA chosen to cover the plausible range of between-group spread — weakly informative enough to avoid boundary degeneracy, diffuse enough not to dominate the likelihood. For 8-schools, A=5A = 5 gives posterior τˉ243\bar\tau^2 \approx 43 (vs the EB MLE at 0). When KK is large (say K50K \geq 50), the boundary issue evaporates — EB and full Bayes effectively agree. The small-KK regime is where hierarchical modeling pays its biggest dividends and also where prior specification matters most.

28.8 Linear mixed models and random effects

The Normal–Normal hierarchical model is the simplest case of a much broader family: linear mixed-effects models (LMMs), which allow covariates at both the group and observation level.

Definition 5 Bayesian linear mixed-effects model (Laird–Ware 1982)

Observations ykjy_{kj} for j=1,,nkj = 1, \dots, n_k in group k=1,,Kk = 1, \dots, K, with group-level covariates zk\mathbf{z}_k and observation-level covariates xkj\mathbf{x}_{kj}:

ykj=xkjβ+zkbk+εkj,εkjN(0,σ2),y_{kj} = \mathbf{x}_{kj}^\top \boldsymbol\beta + \mathbf{z}_k^\top \mathbf{b}_k + \varepsilon_{kj}, \qquad \varepsilon_{kj} \sim \mathcal{N}(0, \sigma^2),

where β\boldsymbol\beta are fixed effects (shared across groups) and bk\mathbf{b}_k are random effects with hierarchical prior

bkΣbNq(0,Σb) iid across k.\mathbf{b}_k \mid \boldsymbol\Sigma_b \sim \mathcal{N}_q(\mathbf{0}, \boldsymbol\Sigma_b) \text{ iid across } k.

Hyperprior on (β,σ2,Σb)(\boldsymbol\beta, \sigma^2, \boldsymbol\Sigma_b). The Normal–Normal hierarchy is the special case with no covariates, K=KK = K groups, nk=1n_k = 1, bk=θkμ\mathbf{b}_k = \theta_k - \mu, Σb=τ2\boldsymbol\Sigma_b = \tau^2.

In R/Stan syntax (brms, rstanarm), this is y ~ x + (1 + z | group) — a fixed-effects slope for xx plus a group-varying intercept and slope in zz with hierarchical Gaussian prior.

Theorem 6 Posterior factorization for Gaussian LMM (stated)

Under Def 5 with Gaussian likelihood and conjugate priors on (β,σ2)(\boldsymbol\beta, \sigma^2) plus a Wishart (or inverse-Wishart) hyperprior on Σb\boldsymbol\Sigma_b, the joint posterior factorizes as a product of Gaussian full conditionals on (β,bk)(\boldsymbol\beta, \mathbf{b}_k) and an inverse-Wishart full conditional on Σb\boldsymbol\Sigma_b (up to its hyperparameters). Gibbs sampling with these conjugate full conditionals converges geometrically; the resulting posterior predictive for a new observation in group kk is a multivariate tt-distribution analogous to Topic 25’s §25.5 Ex 2.

Random-effects regression on simulated clustered data. Left panel: twelve groups, each fit separately with OLS — no-pool slopes and intercepts show large between-group spread including obviously-noisy slopes in small groups. Right panel: partial-pool random-effects fit — group-specific lines are pulled toward the fixed-effect population line, with small-n groups shrunk most heavily.

Figure 6. Random-effects shrinkage on slopes and intercepts. Twelve simulated groups, each with nk=8n_k = 8 observations, fitted with (left) per-group OLS and (right) a Bayesian LMM with hierarchical Gaussian priors on the group-level coefficients. The LMM regularizes the per-group lines toward the population fixed effect — most aggressively for small-nn groups whose individual OLS fits would be dominated by noise.

Example 11 Schools revisited with covariates

Extend 8-schools: now suppose each school has a pre-test score xˉk\bar{x}_k, and the coaching effect varies with baseline performance. LMM form:

yk=α+βxˉk+bk+εk,bkN(0,τ2),εkN(0,σk2).y_k = \alpha + \beta \bar{x}_k + b_k + \varepsilon_k, \qquad b_k \sim \mathcal{N}(0, \tau^2), \quad \varepsilon_k \sim \mathcal{N}(0, \sigma_k^2).

The fixed effects (α,β)(\alpha, \beta) describe the population-level relationship between baseline and coaching effect; the random effect bkb_k absorbs school-specific deviations with hierarchical shrinkage. If β\beta is credibly nonzero, the coaching effect depends on baseline; if not, the model collapses back to Def 2. The LMM structure lets the data answer this — whether the between-school variation is explained by observable covariates or by residual randomness.

Remark 18 Covariance structures on $\boldsymbol\Sigma_b$

Def 5 is agnostic about the covariance structure of the random effects. Common choices: unstructured Σb\boldsymbol\Sigma_b with a Wishart hyperprior (most flexible, expensive); diagonal (independent intercept/slope variances, cheap but restrictive); LKJ-based (GEL2013 §17.4, the brms default: factor Σb=diag(τ)Rdiag(τ)\boldsymbol\Sigma_b = \text{diag}(\boldsymbol\tau) \, \mathbf{R}\, \text{diag}(\boldsymbol\tau) with an LKJ prior on the correlation R\mathbf{R} and half-Cauchy priors on the scales τ\boldsymbol\tau). The LKJ-decomposition is now the de-facto modern default — it separates scale and correlation hyperpriors cleanly.

Remark 19 Random intercepts, random slopes, crossed random effects

“Random intercept” models let each group’s mean shift (the 8-schools case); “random slope” models let each group’s coefficient on a covariate vary (“how does dose respond differently in each site?”). Crossed random effects handle non-nested grouping — e.g., items ×\times subjects in psycholinguistic experiments. The LMM machinery handles all three with appropriate design matrices and covariance structures; the applied workflow in Stan is a one-line change in the model specification.

Remark 20 REML and GLMMs (formalml)

Restricted maximum likelihood (Patterson–Thompson 1971, HAR1977) is the frequentist standard for estimating variance components in LMMs. REML conditions on the fixed-effects estimates to remove their bias influence on the variance estimators — crucial in small samples. lme4’s lmer in R uses REML by default. Non-Gaussian extensions (GLMMs, Breslow–Clayton 1993 with PQL; Pinheiro–Bates 2000’s numerical integration approaches) handle binomial, Poisson, and other exponential-family responses. All three — REML, PQL, Gauss–Hermite — live at formalML: mixed-effects . Topic 28’s scope is the Bayesian Gaussian case.

28.9 Hierarchical diagnostics: funnels, divergences, and the non-centered cure

Gibbs on Normal–Normal is easy; HMC on hierarchical models with non-conjugate hyperpriors is hard. The reason is Neal’s funnel (NEA2011 §4): the joint posterior of (θk,τ)(\theta_k, \tau) has a pinched geometry where small τ\tau forces θkμ\theta_k \approx \mu regardless of the data, and HMC’s leapfrog integrator either overshoots into divergent trajectories or gets stuck crawling through the narrow neck.

Centered parametrization HMC on 8-schools with fixed step size 0.1. The scatter plot shows (log tau, theta_1) samples — most mass is in the bulk of the funnel at moderate log tau values, but the pinched neck at small log tau is where divergent trajectories appear as red X markers, concentrated where the density is steepest.

Figure 7. Neal’s funnel on 8-schools. HMC with a fixed step size samples the bulk of the funnel well, but divergent trajectories (red ×) cluster at small logτ\log \tau — exactly where the density is steepest and the leapfrog discretization error grows fastest.

The fix is to reparameterize. Instead of sampling the hierarchical parameters θk\theta_k directly, sample standardized versions θ~kN(0,1)\tilde\theta_k \sim \mathcal{N}(0, 1) and transform deterministically: θk=μ+τθ~k\theta_k = \mu + \tau \tilde\theta_k. Under the non-centered parameterization, the prior on θ~k\tilde\theta_k doesn’t depend on τ\tau, decoupling the funnel.

Theorem 7 Non-centered reparameterization: volume preservation + funnel decoupling

Let the centered hierarchical model be θkμ,τN(μ,τ2)\theta_k \mid \mu, \tau \sim \mathcal{N}(\mu, \tau^2) iid for k=1,,Kk = 1, \dots, K with any priors on (μ,τ)(\mu, \tau), τ>0\tau > 0. Define the non-centered auxiliary variables θ~kN(0,1)\tilde\theta_k \sim \mathcal{N}(0, 1) iid via the deterministic map θk=μ+τθ~k\theta_k = \mu + \tau \tilde\theta_k.

(a) The Jacobian determinant of the change of variables (θ~,μ,τ)(θ,μ,τ)(\tilde{\boldsymbol\theta}, \mu, \tau) \mapsto (\boldsymbol\theta, \mu, \tau) equals τK\tau^K:

pnc(θ~,μ,τy)=pc(θ,μ,τy)θk=μ+τθ~kτK,p_{\text{nc}}(\tilde{\boldsymbol\theta}, \mu, \tau \mid \mathbf{y}) = p_{\text{c}}(\boldsymbol\theta, \mu, \tau \mid \mathbf{y})\big|_{\theta_k = \mu + \tau \tilde\theta_k} \cdot \tau^K,

so the two parameterizations assign identical posterior probability to every event.

(b) Under the non-centered parameterization, the prior on θ~\tilde{\boldsymbol\theta} is NK(0,IK)\mathcal{N}_K(\mathbf{0}, I_K), independent of τ\tau. The funnel pathology — where the conditional variance of θk\theta_k given (μ,τ)(\mu, \tau) shrinks with τ\tau — is replaced by a unit-Gaussian prior geometry that HMC traverses without pathology. Divergences are essentially eliminated in typical parameter regimes.

Proof 3 Proof of Thm 7 (volume preservation + funnel decoupling) [show]

(a) Volume preservation. Consider the smooth map Φ:(θ~,μ,τ)(θ,μ,τ)\Phi : (\tilde{\boldsymbol\theta}, \mu, \tau) \mapsto (\boldsymbol\theta, \mu, \tau) with θk=μ+τθ~k\theta_k = \mu + \tau \tilde\theta_k and (μ,τ)(\mu, \tau) passed through unchanged. Its Jacobian is the (K+2)×(K+2)(K+2) \times (K+2) matrix

J  =  (θ,μ,τ)(θ~,μ,τ)  =  (τIK1Kθ~010001).J \;=\; \frac{\partial(\boldsymbol\theta, \mu, \tau)}{\partial(\tilde{\boldsymbol\theta}, \mu, \tau)} \;=\; \begin{pmatrix} \tau\, I_K & \mathbf{1}_K & \tilde{\boldsymbol\theta} \\ \mathbf{0}^\top & 1 & 0 \\ \mathbf{0}^\top & 0 & 1 \end{pmatrix}.

The top-left K×KK \times K block is τIK\tau \, I_K (determinant τK\tau^K); the bottom-right 2×22 \times 2 block is the identity (determinant 1). Block-triangular determinants multiply, so detJ=τK|\det J| = \tau^K.

By the multivariate change-of-variables formula for densities,

pnc(θ~,μ,τy)  =  pc(Φ(θ~,μ,τ)y)detJ  =  pc(θ,μ,τy)θk=μ+τθ~kτK.p_{\text{nc}}(\tilde{\boldsymbol\theta}, \mu, \tau \mid \mathbf{y}) \;=\; p_{\text{c}}(\Phi(\tilde{\boldsymbol\theta}, \mu, \tau) \mid \mathbf{y}) \cdot |\det J| \;=\; p_{\text{c}}(\boldsymbol\theta, \mu, \tau \mid \mathbf{y})\big|_{\theta_k = \mu + \tau \tilde\theta_k} \cdot \tau^K.

This confirms the posterior measure is preserved — the two parameterizations assign identical probability to every event. The density differs by the τK\tau^K factor, but the density’s conditional-dependence geometry (which parameters couple to which) differs more consequentially, which is what part (b) addresses.

(b) Funnel decoupling. In the centered parameterization, the prior factor for θk\theta_k given (μ,τ)(\mu, \tau) is N(μ,τ2)\mathcal{N}(\mu, \tau^2). As τ0\tau \to 0, the conditional density of θk\theta_k sharpens into a delta at μ\mu — the funnel’s neck. HMC trajectories with fixed step size cannot adapt to this spatially-varying scale: at large τ\tau the step is too small (slow mixing); at small τ\tau the step is too large (divergent trajectories).

In the non-centered parameterization, the prior on θ~k\tilde\theta_k is N(0,1)\mathcal{N}(0, 1), independent of τ\tau. The only coupling between θ~k\tilde\theta_k and τ\tau is through the likelihood f(ykμ+τθ~k)f(y_k \mid \mu + \tau \tilde\theta_k), which has bounded curvature whenever the observation variance σk2\sigma_k^2 is bounded away from zero (always the case in practice). The posterior density near small τ\tau approaches the unit-Gaussian prior’s spherical geometry — exactly the regime HMC handles without pathology.

This decoupling is not a small effect. Empirically, an 8-schools HMC run with fixed step size 0.1 produces roughly 20% divergent trajectories in the centered parameterization and essentially zero in the non-centered one. \blacksquare — using multivariate change-of-variables (formalcalculus multivariable-integration) and the funnel-geometry analysis of BET2015 §3.

Two-by-two panel: centered versus non-centered parametrization traces for an 8-schools stylized posterior. Top row shows centered mode — the theta_1 trace has visible sticking at low log tau values and the log tau trace gets stuck near zero. Bottom row shows non-centered mode — both theta tilde 1 and log tau traces mix freely, with effective sample sizes an order of magnitude higher.

Figure 8. HMC traces for centered (top) and non-centered (bottom) parameterizations on 8-schools. Centered: θ1\theta_1 freezes at low τ\tau and τ\tau itself sticks near zero. Non-centered: both θ~1\tilde\theta_1 and logτ\log\tau mix freely, with effective sample sizes an order of magnitude higher.

Example 12 8-schools NUTS with centered vs non-centered

Running NUTS (Hoffman–Gelman 2014, Topic 26 §26.8) with default settings on the centered 8-schools model: roughly 3% of trajectories diverge, R^\hat R for τ\tau hovers near 1.05, effective sample size for τ\tau is 400\sim 400 out of 4000 post-burn-in draws. Reparameterize to non-centered: divergences drop to 0.1%\sim 0.1\%, R^1.00\hat R \approx 1.00, ESS for τ\tau jumps to 2500\sim 2500. The posterior is identical (by Thm 7a) but the sampler works an order of magnitude better.

Every production Bayesian-hierarchical codebase (Stan’s built-in “non-centered” reparameterization option, PyMC’s NonCentered context, brms’s prior() with its automatic non-centering for grouping terms) implements this trick as the default. It’s become invisible infrastructure.

Parameterization:
-6-4-20246θ (group-1 mean)-4-202log τ (group SD)
Centered: the pinched funnel at small τ forces θ → 0, producing divergent trajectories under large step sizes. Try raising ε to trigger more divergences.
Example 13 Posterior-predictive checks on hierarchical fits

After running HMC on an 8-schools posterior, the Bayesian-workflow step (Topic 27 §27.9) is the posterior-predictive check: draw replicate datasets from the fitted posterior and compare to y\mathbf{y}. Relevant checks for hierarchical models:

  1. Are the replicated group-level spreads similar to the observed spread? A rank test of max(yrep)min(yrep)\text{max}(y_{\text{rep}}) - \text{min}(y_{\text{rep}}) vs max(y)min(y)\text{max}(y) - \text{min}(y).
  2. Is the replicated between-group variance within posterior spread of τ\tau? Tests whether the hierarchical prior is well-calibrated.
  3. Are group-level ranks consistent? For each group, compute P(yk,rep>yky)P(y_{k,\text{rep}} > y_k \mid \mathbf{y}) and look at the empirical distribution of these p-values.

For 8-schools with half-Cauchy hyperprior, the posterior predictive passes all three checks. For the EB boundary MLE, check 2 fails badly — the model underestimates between-group variability.

Remark 21 When the non-centered trick backfires

Non-centered is the default for groups where τ\tau is small relative to σk\sigma_k. When τ\tau is large (lots of between-group variability), the centered parameterization is actually better — because then the data strongly constrains each θk\theta_k individually, the posterior looks like N(yk,σk2)\mathcal{N}(y_k, \sigma_k^2), and the likelihood dominates the prior. In this regime the non-centered parameterization introduces unnecessary coupling through τθ~k\tau \tilde\theta_k. Mixed-effects software picks automatically based on prior-vs-likelihood strength; experienced users just try both and take the one with better diagnostics.

Remark 22 Beyond non-centered: Riemannian HMC

For pathological posteriors where even non-centered doesn’t mix well (heavy-tailed hierarchies, highly non-Gaussian joint structures), Riemannian HMC (Girolami–Calderhead 2011) adapts the mass matrix dynamically to the local curvature of the log-posterior. It’s more expensive per step but achieves far better mixing on pathological geometries. The implementation is delicate (requires Hessian computation) and mostly lives in specialist codebases. Stan’s NUTS is the practical default for 95% of cases; Riemannian HMC is the escape hatch.

28.10 Track 7 closer: the forward-map to formalml

Track 7 has developed Bayesian foundations, MCMC, model comparison, and hierarchical inference — a complete computational and conceptual toolkit for practical Bayesian analysis. Topic 28 closes the track, but the story continues at formalml. This final section maps the forward-pointers.

Forward-map diagram. The Topic 28 hub in the center, with seven arrows pointing right to formalml topics: horseshoe and continuous shrinkage, variational inference, Bayesian neural networks, Bayesian nonparametrics, meta-learning, mixed effects, deep ensembles. Back-arrows on the left connect to Topics 23 (ridge and Stein), 25 (Bayes foundations), 26 (MCMC), 27 (BMA).

Figure 9. The Track 7 forward-map. Topic 28 is the closer; seven threads extend to formalml, and four back-arrows connect to prerequisites within formalstatistics.

Example 14 Meta-learning as hierarchical Bayes

The modern ML framing of “learn to learn” is hierarchical Bayesian inference over tasks. Each task kk has its own parameters θk\theta_k; tasks are drawn from a task distribution π(θkϕ)\pi(\theta_k \mid \phi) with task-distribution hyperparameters ϕ\phi. Training on many tasks learns ϕ\phi; adapting to a new task kk^* starts from the learned ϕ\phi and updates θk\theta_{k^*} using few observations. MAML (Finn et al. 2017), Neural-Process families (Garnelo et al. 2018), and Bayesian neural networks with hierarchical priors are all recognizable as Topic 28’s formalism scaled to neural-network parameter spaces. Full treatment at formalML: meta-learning .

Example 15 Deep ensembles and Bayesian neural networks

Training MM neural networks independently from random initializations and averaging their predictions — the “deep ensemble” of Lakshminarayanan et al. 2017 — approximates a BNN posterior when the base model is flexible enough. The implicit hierarchical prior is on network initializations: θmπ(θϕ)\theta_m \sim \pi(\theta \mid \phi) with ϕ\phi the architecture + data distribution. BMA over the ensemble gives calibrated predictive uncertainty at a fraction of the cost of full HMC over neural weights. Full development at formalML: bayesian-neural-networks .

Example 16 Hierarchical BMA on 8-schools (discharges Topic 27 Rem 26)

Apply Topic 27’s BMA to a hierarchical model space. Candidate models: M0\mathcal{M}_0 (complete pool, τ2=0\tau^2 = 0), M1\mathcal{M}_1 (partial pool, half-Cauchy hyperprior), M2\mathcal{M}_2 (partial pool with school-level covariates, as in Ex 11). Compute marginal likelihoods via bridge sampling (Topic 27 §27.6) and posterior model probabilities via Bayes’ rule. On 8-schools, M1\mathcal{M}_1 typically wins with posterior probability 0.8\sim 0.8 — supporting partial pooling but not the more complex M2\mathcal{M}_2. Predictions for a new school are weighted averages of each model’s posterior predictive. This closes Topic 27 §27.10 Rem 26’s forward-promise of “proper BMA treatment” for hierarchical models.

Track 7 summary diagram. Four nodes horizontally arranged — Topic 25 Foundations, Topic 26 Computation, Topic 27 Comparison, Topic 28 Hierarchy — connected by forward arrows. Each node has a thematic badge underneath: conjugate priors and Bayes rule, MCMC and HMC, marginal likelihoods and BMA, partial pooling and empirical Bayes. The 8-schools dataset appears as a recurring motif across Topics 26, 27, and 28.

Figure 10. Track 7 in one picture: four topics that progressively extend the Bayesian toolkit from conjugate families (25) through MCMC (26) and model comparison (27) to hierarchical structures (28). The 8-schools dataset serves as the recurring motif across the second half of the track.

Remark 23 Horseshoe and continuous shrinkage (formalml)

The hierarchical-Normal prior of §28.2 Def 2 is the “one-scale” shrinkage prior: every group borrows strength at the same global rate τ\tau. Continuous shrinkage priors generalize this by giving each coordinate its own local scale. The horseshoe (Carvalho–Polson–Scott 2010) uses a half-Cauchy global scale τ\tau times half-Cauchy local scales λk\lambda_k: θkλk,τN(0,λk2τ2)\theta_k \mid \lambda_k, \tau \sim \mathcal{N}(0, \lambda_k^2 \tau^2), λkHalf-Cauchy(0,1)\lambda_k \sim \text{Half-Cauchy}(0, 1), τHalf-Cauchy(0,τ0)\tau \sim \text{Half-Cauchy}(0, \tau_0). The heavy-tailed local scales let irrelevant coordinates shrink to zero hard (like lasso) while relevant ones are barely shrunk (unlike ridge) — adaptive sparsity at the prior level. Variants: regularized horseshoe (Piironen–Vehtari 2017) tames the tails; R2-D2 (Zhang et al. 2020) uses a Dirichlet decomposition of the global variance. Full development, including the connection to spike-and-slab priors and the sampling strategies that make horseshoe HMC-friendly, at formalML: sparse-bayesian-priors . (Topic 25 Rem 23, Topic 26 Rem 25, and Topic 27’s forward-map all named Topic 28 as the horseshoe venue; this remark discharges those pointers.)

Remark 24 Variational inference and Bayesian neural networks (formalml)

MCMC scales poorly as the parameter space grows into the millions (neural-network weights). Variational inference replaces the MCMC integral with an optimization problem: find a tractable approximate posterior q(θ)q(\theta) that minimizes KL(qp)\text{KL}(q \| p). Mean-field VI factors qq coordinatewise; structured VI allows richer factorizations (tree, chain, low-rank). Stochastic VI (Hoffman et al. 2013) uses noisy gradients for large datasets; black-box VI (Ranganath et al. 2014) removes the need for model-specific derivations. For neural-network parameters, VI underlies modern BNNs alongside MC-dropout, deep ensembles, and SG-MCMC. Details at formalML: variational-inference and formalML: bayesian-neural-networks .

Remark 25 Bayesian nonparametrics (formalml)

When the number of groups KK is itself unknown and should be inferred from the data — clustering with an unknown number of clusters, mixture models of unknown order — Bayesian nonparametrics provides the right framework. The Dirichlet process (Ferguson 1973) places a prior on the space of discrete distributions; each draw is almost surely discrete with countably many atoms. The Chinese restaurant process is the marginal over cluster assignments. Hierarchical DPs (Teh et al. 2006) extend this to nested clustering — groups share clusters, tasks within groups share parameters, and the whole structure is learned. Full treatment at formalML: bayesian-nonparametrics .

Track 7 is complete. Topic 25 gave us the Bayesian update mechanism; Topic 26 let us sample any posterior via MCMC; Topic 27 taught us to compare and average across models; Topic 28 extended all three to hierarchical structures — the setting where the prior itself has a prior, where partial pooling emerges, and where Stein’s paradox is reinterpreted as empirical Bayes. The forward-pointers above lead to the full modern Bayesian ML toolkit: sparse priors, variational inference, Bayesian neural networks, nonparametrics, meta-learning, and hierarchical deep ensembles.

The 8-schools example that ran through §§28.1/28.4/28.6/28.9 was chosen deliberately: 45 years after Rubin’s paper, it remains the canonical test case for every hierarchical Bayesian tool — a reminder that the computational and conceptual problems of small-KK hierarchies were never really about 8-schools, but about how to share information across groups when information is scarce.


References

  1. Charles Stein. (1956). Inadmissibility of the Usual Estimator for the Mean of a Multivariate Normal Distribution. Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, 1, 197–206.
  2. Willard James & Charles Stein. (1961). Estimation with Quadratic Loss. Proceedings of the Fourth Berkeley Symposium, 1, 361–379.
  3. Bradley Efron & Carl Morris. (1973). Stein’s Estimation Rule and Its Competitors — an Empirical Bayes Approach. Journal of the American Statistical Association, 68(341), 117–130.
  4. Dennis V. Lindley & Adrian F. M. Smith. (1972). Bayes Estimates for the Linear Model. Journal of the Royal Statistical Society: Series B, 34(1), 1–41.
  5. Nan M. Laird & James H. Ware. (1982). Random-Effects Models for Longitudinal Data. Biometrics, 38(4), 963–974.
  6. Andrew Gelman. (2006). Prior Distributions for Variance Parameters in Hierarchical Models. Bayesian Analysis, 1(3), 515–534.
  7. David A. Harville. (1977). Maximum Likelihood Approaches to Variance Component Estimation and to Related Problems. Journal of the American Statistical Association, 72(358), 320–338.
  8. William J. Browne & David Draper. (2006). A Comparison of Bayesian and Likelihood-Based Methods for Fitting Multilevel Models. Bayesian Analysis, 1(3), 473–514.
  9. Donald B. Rubin. (1981). Estimation in Parallel Randomized Experiments. Journal of Educational Statistics, 6(4), 377–401.
  10. Michael Betancourt & Mark Girolami. (2015). Hamiltonian Monte Carlo for Hierarchical Models. In Current Trends in Bayesian Methodology with Applications (Chapman & Hall/CRC), 79–101.
  11. Radford M. Neal. (2011). MCMC Using Hamiltonian Dynamics. In Handbook of Markov Chain Monte Carlo (Chapman & Hall/CRC), 113–162.
  12. Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari & Donald B. Rubin. (2013). Bayesian Data Analysis (3rd ed.). CRC Press.
  13. Erich L. Lehmann & George Casella. (1998). Theory of Point Estimation (2nd ed.). Springer.
  14. Bas J. K. Kleijn & Aad W. van der Vaart. (2012). The Bernstein–Von Mises Theorem under Misspecification. Electronic Journal of Statistics, 6, 354–381.