intermediate 60 min read · April 20, 2026

Bayesian Foundations & Prior Selection

Track 7 opens with the parallel Bayesian formalism: treat θ as a random variable with a prior π(θ), update to a posterior p(θ | y) via Bayes' theorem, and produce every downstream quantity — point estimates, credible intervals, predictions — by integrating against the posterior. The featured result, Bernstein–von Mises, proves that under regularity the posterior concentrates on 𝒩(θ̂_MLE, 𝓘⁻¹/n) in total variation: Bayesian and frequentist inference agree asymptotically.

25.1 Why Bayesian?

Topics 13–24 treated the unknown parameter θ\theta as a fixed constant: the MLE is a point; confidence intervals are random sets that cover θ\theta with a specified long-run frequency; hypothesis tests reject or retain fixed null hypotheses. That framework has been extraordinarily productive — every result from point estimation through model selection fits within it — but it answers questions about procedures, not about θ\theta. A frequentist 95% confidence interval is a guarantee about the interval-constructing procedure, not a probability statement about the parameter.

The Bayesian framework flips the setup. Treat θ\theta itself as a random variable, specify a prior distribution π(θ)\pi(\theta) encoding what we know before observing data, then apply Bayes’ theorem to obtain a posterior distribution p(θy)p(\theta \mid \mathbf{y}) that is a probability statement about θ\theta given the data. Every downstream quantity — point estimates, interval estimates, predictions, model comparisons — follows by integration against the posterior. The price is that we must specify a prior; the return is that every question about θ\theta admits a direct probability answer.

Four-panel figure showing a Beta(2, 2) prior (blue) and posterior densities (purple) after observing 1, 5, 20, and 100 Bernoulli trials with true θ = 0.3. The prior is broad; the posterior sharpens and shifts toward θ = 0.3 as n grows, with the prior's influence visibly diminishing.
Remark 1 Distribution over θ vs point estimator

A frequentist answer to “what is θ\theta?” is a single number (the MLE) plus an interval around it justified by long-run coverage. A Bayesian answer is a density: the full posterior p(θy)p(\theta \mid \mathbf{y}), which assigns probabilities to every neighborhood of parameter space. Point estimates (posterior mean, posterior median, MAP) and interval estimates (credible intervals, HPD intervals) are summaries of the posterior, not substitutes for it.

Remark 2 Subjective vs objective Bayes — scope note

The philosophical debate between subjective Bayes (LIN2014 — priors as genuine personal beliefs) and objective Bayes (JEF1961 — priors as reference or non-informative distributions that minimize subjective input) is out of scope for Topic 25. We treat priors as substantive modeling choices whose sensitivity we examine empirically (§25.7), and we note where the two schools diverge as we go.

Remark 3 What's in vs what's out

Topic 25 covers: Bayes’ theorem for θ\theta (§25.2), Track 7 notation (§25.3), exponential-family conjugacy with full proof (§25.4), five canonical conjugate pairs with Beta-Binomial in full (§25.5), credible intervals and posterior predictive (§25.6), prior selection including Jeffreys (§25.7), Bernstein–von Mises with a sketch proof (§25.8), multivariate Normal-Inverse-Wishart as a pointer (§25.9), and the forward-map to Track 7 (§25.10). It does not cover: MCMC algorithms (Topic 26), Bayes factors or BMA beyond naming (Topic 27), hierarchical and empirical Bayes (Topic 28), variational inference or Bayesian nonparametrics (formalml). Each deferral gets a §25.10 pointer.

25.2 Bayes’ theorem for a parameter

The computational engine is Topic 4 Thm 4 (Bayes’ theorem for events), lifted from a two-event statement P(AB)=P(BA)P(A)/P(B)P(A \mid B) = P(B \mid A) P(A) / P(B) to a statement about densities on a continuous parameter space.

Definition 1 Prior, likelihood, posterior, marginal likelihood, posterior predictive

Let θΘ\theta \in \Theta be an unknown parameter and y=(y1,,yn)\mathbf{y} = (y_1, \ldots, y_n) an observed sample. The Bayesian framework attaches five densities to the inference problem:

  • Prior π(θ)\pi(\theta): distribution on Θ\Theta encoding information about θ\theta before observing data.
  • Likelihood L(θ;y)=p(yθ)L(\theta; \mathbf{y}) = p(\mathbf{y} \mid \theta): the sampling density, viewed as a function of θ\theta with y\mathbf{y} fixed.
  • Posterior p(θy)p(\theta \mid \mathbf{y}): distribution on Θ\Theta conditional on the observed data.
  • Marginal likelihood m(y)=ΘL(θ;y)π(θ)dθm(\mathbf{y}) = \int_\Theta L(\theta; \mathbf{y}) \pi(\theta) \, d\theta: the prior-predictive density of y\mathbf{y}, also called the evidence.
  • Posterior predictive p(y~y)=Θp(y~θ)p(θy)dθp(\tilde y \mid \mathbf{y}) = \int_\Theta p(\tilde y \mid \theta) p(\theta \mid \mathbf{y}) \, d\theta: distribution of a future observation y~\tilde y integrating out posterior uncertainty.
Theorem 1 Bayes' theorem for a parameter (stated)

Under the measure-theoretic conditions of Topic 4 Thm 4 (absolute continuity of the joint distribution with respect to a product measure on Θ×Yn\Theta \times \mathcal{Y}^n),

p(θy)  =  L(θ;y)π(θ)m(y),p(\theta \mid \mathbf{y}) \;=\; \frac{L(\theta; \mathbf{y}) \, \pi(\theta)}{m(\mathbf{y})},

where m(y)=ΘL(θ;y)π(θ)dθm(\mathbf{y}) = \int_\Theta L(\theta; \mathbf{y}) \pi(\theta) \, d\theta is the marginal likelihood. Equivalently, in unnormalized form,

p(θy)    L(θ;y)π(θ).p(\theta \mid \mathbf{y}) \;\propto\; L(\theta; \mathbf{y}) \, \pi(\theta).

Proof is a direct application of Topic 4 Thm 4 to the joint density p(θ,y)=π(θ)L(θ;y)p(\theta, \mathbf{y}) = \pi(\theta) L(\theta; \mathbf{y}); the computation is exactly the event version with PP replaced by density integrals.

Example 1 Coin toss — Beta(1, 1) prior plus 3 heads in 4 tosses

Let YBinomial(n=4,θ)Y \sim \mathrm{Binomial}(n=4, \theta) and place a uniform prior θBeta(1,1)\theta \sim \mathrm{Beta}(1, 1) (flat on (0,1)(0, 1)). Observe y=3y = 3. Then:

L(θ;y=3)  =  (43)θ3(1θ)1    θ3(1θ),L(\theta; y = 3) \;=\; \binom{4}{3} \theta^3 (1-\theta)^1 \;\propto\; \theta^3 (1-\theta),

and the unnormalized posterior is

p(θy=3)    θ3(1θ)1  =  θ3(1θ),p(\theta \mid y = 3) \;\propto\; \theta^3 (1 - \theta) \cdot 1 \;=\; \theta^3 (1-\theta),

which is the kernel of Beta(4,2)\mathrm{Beta}(4, 2). Normalizing, p(θy=3)=Beta(θ;4,2)p(\theta \mid y = 3) = \mathrm{Beta}(\theta; 4, 2), with posterior mean 4/6=2/34/6 = 2/3. The data shifted the posterior mean from 1/21/2 (prior) to 2/32/3 — and shrunk the posterior SD from 1/120.29\sqrt{1/12} \approx 0.29 to 42/(627)0.178\sqrt{4 \cdot 2 / (6^2 \cdot 7)} \approx 0.178.

Three-panel figure showing the Bayes update for Ex 1. Left panel: Beta(1, 1) prior, flat on (0, 1). Middle panel: Binomial likelihood θ³(1−θ) peaked at θ = 0.75. Right panel: Beta(4, 2) posterior peaked at θ ≈ 0.75 with narrower spread. Annotation 'prior × likelihood ∝ posterior' between the panels.
Remark 4 Proportionality shorthand

In practice one rarely computes the marginal likelihood m(y)m(\mathbf{y}) directly — the proportionality p(θy)L(θ)π(θ)p(\theta \mid \mathbf{y}) \propto L(\theta) \pi(\theta) together with the constraint p(θy)dθ=1\int p(\theta \mid \mathbf{y}) \, d\theta = 1 determines the posterior uniquely. For conjugate pairs (§25.5) the identification of the posterior kernel with a known density family sidesteps the integral entirely. For non-conjugate cases, MCMC (Topic 26) sidesteps it via ratio-based acceptance.

Preset:
Data
0.00.51.0θpriorlikelihood (scaled)posterior
Prior mean: 0.5000
Posterior mean: 0.3750
Pseudo-sample-size: 4.0 → 24.0
95% equal-tailed CrI: [0.197, 0.573]
95% HPD interval: [0.192, 0.565]
MAP estimate: 0.3636

Posterior Beta(9.0, 15.0). The HPD interval is the shortest 95% credible interval; it coincides with the equal-tailed interval only for symmetric posteriors.

25.3 Track 7 notation conventions

The notation locked in this section propagates to Topics 26, 27, and 28. Every symbol is introduced explicitly; the rationale for the Bernardo–Smith / Robert house style over the Gelman overloading is given below the table.

ObjectSymbolInterpretation
Prior densityπ(θ)\pi(\theta)Distribution over θ\theta before observing data. Positive, integrable (or improper — §25.7 Def 7).
Posterior densityp(θy)p(\theta \mid \mathbf{y})Distribution over θ\theta after observing y=(y1,,yn)\mathbf{y} = (y_1, \ldots, y_n).
Sampling densityp(yθ)p(\mathbf{y} \mid \theta) or L(θ;y)L(\theta; \mathbf{y})Likelihood — joint density of y\mathbf{y} given θ\theta, viewed as a function of θ\theta.
Marginal likelihoodm(y)=p(yθ)π(θ)dθm(\mathbf{y}) = \int p(\mathbf{y} \mid \theta)\pi(\theta)\,d\thetaNormalizing constant; also called evidence or prior predictive.
Posterior predictivep(y~y)=p(y~θ)p(θy)dθp(\tilde y \mid \mathbf{y}) = \int p(\tilde y \mid \theta) p(\theta \mid \mathbf{y})\,d\thetaDistribution of a future observation y~\tilde y integrating out posterior uncertainty.
KL divergenceDKL(ππ)D_{\text{KL}}(\pi \Vert \pi')Relative entropy of π\pi w.r.t. π\pi'.
Bayes factorBF10=m(yH1)/m(yH0)\text{BF}_{10} = m(\mathbf{y} \mid H_1)/m(\mathbf{y} \mid H_0)Posterior-odds update factor. Topic 25 mentions only; Topic 27 develops.
MAP estimateθ^MAP=argmaxp(θy)\hat\theta_{\text{MAP}} = \arg\max p(\theta \mid \mathbf{y})Mode of the posterior.
Posterior meanθ^PM=E[θy]\hat\theta_{\text{PM}} = \mathbb{E}[\theta \mid \mathbf{y}]Bayes estimator under squared-error loss (§25.6 Rem 9).
Posterior medianθ^med\hat\theta_{\text{med}}Bayes estimator under absolute-error loss.
(1α)(1-\alpha) credible setCC with Cp(θy)dθ=1α\int_C p(\theta \mid \mathbf{y})\,d\theta = 1 - \alphaAnalog of the frequentist CI in the Bayesian framework.
HPD intervalCHPDC^{\text{HPD}} — the set of θ\theta with p(θy)cαp(\theta \mid \mathbf{y}) \ge c_\alphaHighest-posterior-density interval — the shortest (1α)(1-\alpha) credible set.

We use π\pi for prior (Bernardo–Smith / Robert house style) because it is visually distinct from the sampling pp and the posterior p(y)p(\cdot \mid \mathbf{y}) — the alternative Gelman-style overloading p(θ)p(\theta) forces context-based disambiguation in every formula. We write p(y~y)p(\tilde y \mid \mathbf{y}) for posterior predictive using tilde-for-new-observation rather than ynewy_{\text{new}} subscripts because tilde renders cleanly in KaTeX. We write m(y)m(\mathbf{y}) for marginal likelihood because the natural alternative p(y)p(\mathbf{y}) collides with the sampling density viewed as a function of y\mathbf{y}. Bayesian and frequentist estimators are distinguished by subscripts throughout: θ^MLE\hat\theta_{\text{MLE}} (Topic 14), θ^MAP\hat\theta_{\text{MAP}}, θ^PM\hat\theta_{\text{PM}}, θ^med\hat\theta_{\text{med}}.

25.4 Exponential-family conjugacy

The five conjugate pairs of §25.5 (Beta-Binomial, Normal-Normal, Normal-Normal-IG, Gamma-Poisson, Dirichlet-Multinomial) all emerge from a single construction on exponential families. The theorem below generalizes Topic 7 Thm 3: what was algebra there — close the exponential family under multiplication by a particular kernel — becomes Bayesian inference here, with the updated kernel as the posterior.

Definition 2 Conjugate family

A family of distributions {π(θχ,ν):χR,ν>0}\{\pi(\theta \mid \chi, \nu) : \chi \in \mathbb{R}, \nu > 0\} is conjugate to a likelihood {f(y;θ):θΘ}\{f(y; \theta) : \theta \in \Theta\} if, for every prior π(θχ0,ν0)\pi(\theta \mid \chi_0, \nu_0) in the family and every sample y\mathbf{y}, the posterior p(θy)p(\theta \mid \mathbf{y}) is also in the family — only the hyperparameters update.

Theorem 2 Exponential-family conjugacy

Let {f(y;θ):θH}\{f(y; \theta) : \theta \in H\} be a one-parameter exponential family in canonical form,

f(y;η)  =  h(y)exp ⁣(ηT(y)A(η)),f(y; \eta) \;=\; h(y) \exp\!\big(\eta T(y) - A(\eta)\big),

with natural parameter ηHR\eta \in H \subseteq \mathbb{R}, sufficient statistic T(y)T(y), base measure h(y)h(y), and log-partition A(η)A(\eta). Define the conjugate-prior family indexed by (χ0,ν0)R×(0,)(\chi_0, \nu_0) \in \mathbb{R} \times (0, \infty) by

π(ηχ0,ν0)  =  K(χ0,ν0)exp ⁣(χ0ην0A(η)),\pi(\eta \mid \chi_0, \nu_0) \;=\; K(\chi_0, \nu_0) \exp\!\big(\chi_0 \eta - \nu_0 A(\eta)\big),

where K(χ0,ν0)K(\chi_0, \nu_0) is the normalizing constant (finite whenever the integral on HH converges). For nn iid observations y=(y1,,yn)\mathbf{y} = (y_1, \ldots, y_n) with sufficient-statistic total S=i=1nT(yi)S = \sum_{i=1}^n T(y_i), the posterior is

p(ηy)  =  π(ηχ0+S,  ν0+n).p(\eta \mid \mathbf{y}) \;=\; \pi(\eta \mid \chi_0 + S, \; \nu_0 + n).

The hyperparameter ν0\nu_0 is the pseudo-sample-size: the prior is worth ν0\nu_0 equivalent observations before any data arrive. χ0\chi_0 is the pseudo-sufficient-statistic total.

Proof 1 Proof of Thm 2 (exponential-family conjugacy) [show]

Step 1 — setup. Let f(y;η)=h(y)exp(ηT(y)A(η))f(y; \eta) = h(y) \exp(\eta T(y) - A(\eta)) as stated, with natural parameter ηH\eta \in H and sufficient statistic T(y)T(y).

Step 2 — proposed conjugate prior. Define

π(ηχ0,ν0)  =  K(χ0,ν0)exp(χ0ην0A(η)),\pi(\eta \mid \chi_0, \nu_0) \;=\; K(\chi_0, \nu_0) \exp(\chi_0 \eta - \nu_0 A(\eta)),

with hyperparameters χ0R\chi_0 \in \mathbb{R}, ν0>0\nu_0 > 0, and normalizing constant K(χ0,ν0)K(\chi_0, \nu_0) chosen so Hπ(η)dη=1\int_H \pi(\eta) \, d\eta = 1. Whenever this integral is finite, π\pi is a proper prior; §25.7 Def 7 handles the improper case.

Step 3 — apply Bayes’ theorem. Given nn iid observations y1,,yny_1, \ldots, y_n with joint likelihood

L(η)  =  i=1nf(yi;η)  =  [ih(yi)]exp ⁣(ηiT(yi)nA(η)).L(\eta) \;=\; \prod_{i=1}^n f(y_i; \eta) \;=\; \left[\prod_i h(y_i)\right] \exp\!\left(\eta \sum_i T(y_i) - n A(\eta)\right).

Write S=i=1nT(yi)S = \sum_{i=1}^n T(y_i) for the sufficient-statistic sum. Then, dropping the constant ih(yi)\prod_i h(y_i) that does not depend on η\eta,

p(ηy)    L(η)π(η)    exp(ηSnA(η))exp(χ0ην0A(η)).p(\eta \mid \mathbf{y}) \;\propto\; L(\eta) \pi(\eta) \;\propto\; \exp(\eta S - n A(\eta)) \exp(\chi_0 \eta - \nu_0 A(\eta)).

Step 4 — collect exponents. Combining the two exponentials,

p(ηy)    exp ⁣((χ0+S)η(ν0+n)A(η)).p(\eta \mid \mathbf{y}) \;\propto\; \exp\!\big((\chi_0 + S)\eta - (\nu_0 + n) A(\eta)\big).

Step 5 — identify the posterior kernel. This is the kernel of the same conjugate family with updated hyperparameters χ0=χ0+S\chi_0' = \chi_0 + S, ν0=ν0+n\nu_0' = \nu_0 + n. Since the normalizing constant of the prior family depends only on the kernel shape, we conclude

p(ηy)  =  π(ηχ0+S,  ν0+n).p(\eta \mid \mathbf{y}) \;=\; \pi(\eta \mid \chi_0 + S, \; \nu_0 + n).

Step 6 — interpretation of the update rule. The hyperparameter ν0\nu_0 acts as a pseudo-sample-size: the prior is worth ν0\nu_0 equivalent observations before any data arrive. χ0\chi_0 is the pseudo-sufficient-statistic total. After observing nn real data points with sufficient-statistic sum SS, the effective sample size grows to ν0+n\nu_0 + n and the effective sufficient-statistic total to χ0+S\chi_0 + S.

∎ — using Topic 4 Thm 4 (Bayes’ theorem) and Topic 7 §7.7 Thm 3 (exp-family form).

Three-panel figure showing the same conjugate-family update rule χ₀' = χ₀ + S, ν₀' = ν₀ + n applied to Beta-Binomial, Gamma-Poisson, and Normal-Normal. Each panel shows the prior (blue) and posterior (purple) densities with the updated hyperparameters annotated. The shared kernel-shift structure is visually apparent.
Remark 5 Topic 7 Thm 3 as the special case

Topic 7 §7.7 Thm 3 constructed the conjugate-prior family for an exponential likelihood — the algebra of Proof 1 Step 4 but stripped of inferential framing. Thm 2 adds the inferential layer: the updated hyperparameters are now the posterior, and every downstream Bayesian quantity (credible interval, posterior mean, posterior predictive) follows from this one identification. Where Topic 7 asked “what family of priors is closed under multiplication by this likelihood?” Topic 25 answers “what is the posterior given this prior and these data?” — but they are literally the same calculation.

Remark 6 Pseudo-observations interpretation

The hyperparameter pair (χ0,ν0)(\chi_0, \nu_0) is interpretable as a pseudo-dataset of ν0\nu_0 observations whose sufficient-statistic total is χ0\chi_0. Under this reading, the prior encodes prior information by committing to the equivalent of ν0\nu_0 imaginary observations, and Bayes’ theorem literally concatenates them with the real data. Informative priors have large ν0\nu_0; weakly informative priors have ν01\nu_0 \sim 1; non-informative improper priors correspond to the formal limit ν00\nu_0 \to 0 (§25.7). This pseudo-observation framing is the clearest way to calibrate how much prior information to commit.

25.5 The canonical conjugate pairs

Thm 2 is abstract. Making it concrete means specializing to the five canonical conjugate pairs used throughout applied Bayesian statistics. Beta-Binomial is the first and simplest, with a full end-to-end proof. The other four pairs are worked as examples with citation back to the algebra Topic 7 already established.

Theorem 3 Beta-Binomial posterior

Let YBinomial(n,θ)Y \sim \mathrm{Binomial}(n, \theta) and place a Beta(α0,β0)\mathrm{Beta}(\alpha_0, \beta_0) prior on θ(0,1)\theta \in (0, 1). After observing kk successes in nn trials,

θY=k    Beta(α0+k,  β0+nk).\theta \mid Y = k \;\sim\; \mathrm{Beta}(\alpha_0 + k, \; \beta_0 + n - k).

The posterior mean is

E[θk]  =  α0+kα0+β0+n  =  wα0α0+β0  +  (1w)kn,\mathbb{E}[\theta \mid k] \;=\; \frac{\alpha_0 + k}{\alpha_0 + \beta_0 + n} \;=\; w \cdot \frac{\alpha_0}{\alpha_0 + \beta_0} \;+\; (1-w) \cdot \frac{k}{n},

a convex combination of the prior mean and the sample proportion, with weight w=(α0+β0)/(α0+β0+n)w = (\alpha_0 + \beta_0) / (\alpha_0 + \beta_0 + n) proportional to the pseudo-sample-size.

Proof 2 Proof of Thm 3 (Beta-Binomial posterior) [show]

Step 1 — setup. Observe kk successes in nn independent Bernoulli(θ\theta) trials; equivalently, observe Sn=kS_n = k from SnBinomial(n,θ)S_n \sim \mathrm{Binomial}(n, \theta). Place a Beta(α0,β0)\mathrm{Beta}(\alpha_0, \beta_0) prior on θ(0,1)\theta \in (0, 1).

Step 2 — likelihood and prior. The Binomial likelihood, viewed as a function of θ\theta,

L(θ)  =  (nk)θk(1θ)nk    θk(1θ)nk.L(\theta) \;=\; \binom{n}{k} \theta^k (1-\theta)^{n-k} \;\propto\; \theta^k (1-\theta)^{n-k}.

The Beta prior density,

π(θ)  =  1B(α0,β0)θα01(1θ)β01    θα01(1θ)β01.\pi(\theta) \;=\; \frac{1}{B(\alpha_0, \beta_0)} \theta^{\alpha_0 - 1} (1-\theta)^{\beta_0 - 1} \;\propto\; \theta^{\alpha_0 - 1} (1 - \theta)^{\beta_0 - 1}.

Step 3 — apply Bayes. The posterior is proportional to likelihood × prior:

p(θk)    θk(1θ)nkθα01(1θ)β01  =  θ(α0+k)1(1θ)(β0+nk)1.p(\theta \mid k) \;\propto\; \theta^k (1 - \theta)^{n-k} \cdot \theta^{\alpha_0 - 1} (1-\theta)^{\beta_0 - 1} \;=\; \theta^{(\alpha_0 + k) - 1} (1-\theta)^{(\beta_0 + n - k) - 1}.

Step 4 — identify. This is the kernel of Beta(α0+k,β0+nk)\mathrm{Beta}(\alpha_0 + k, \beta_0 + n - k). Since the posterior must integrate to 1, the normalizing constant is 1/B(α0+k,β0+nk)1 / B(\alpha_0 + k, \beta_0 + n - k), and

p(θk)  =  Beta(α0+k,  β0+nk).p(\theta \mid k) \;=\; \mathrm{Beta}(\alpha_0 + k, \; \beta_0 + n - k).

Step 5 — posterior moments. The posterior mean is

E[θk]  =  α0+kα0+β0+n  =  wα0α0+β0  +  (1w)kn,\mathbb{E}[\theta \mid k] \;=\; \frac{\alpha_0 + k}{\alpha_0 + \beta_0 + n} \;=\; w \cdot \frac{\alpha_0}{\alpha_0 + \beta_0} \;+\; (1 - w) \cdot \frac{k}{n},

with w=(α0+β0)/(α0+β0+n)w = (\alpha_0 + \beta_0) / (\alpha_0 + \beta_0 + n) — a convex combination of the prior mean α0/(α0+β0)\alpha_0 / (\alpha_0 + \beta_0) and the sample proportion k/nk/n, with weights proportional to the pseudo-sample-size α0+β0\alpha_0 + \beta_0 and the real sample size nn.

∎ — using Topic 4 Thm 4 (Bayes’ theorem) and Topic 6 §6.6 Thm 12 (Beta moments).

Example 2 Normal-Normal (known σ²)

Let YiμiidN(μ,σ2)Y_i \mid \mu \overset{\text{iid}}{\sim} \mathcal{N}(\mu, \sigma^2) with σ2\sigma^2 known, and prior μN(μ0,σ02)\mu \sim \mathcal{N}(\mu_0, \sigma_0^2). The posterior on μ\mu given the sample mean yˉ\bar y and sample size nn is N(μn,σn2)\mathcal{N}(\mu_n, \sigma_n^2) with

1σn2  =  1σ02+nσ2,μn  =  σn2(μ0σ02+nyˉσ2).\frac{1}{\sigma_n^2} \;=\; \frac{1}{\sigma_0^2} + \frac{n}{\sigma^2}, \qquad \mu_n \;=\; \sigma_n^2 \left(\frac{\mu_0}{\sigma_0^2} + \frac{n \bar y}{\sigma^2}\right).

The posterior precision is the sum of prior precision and data precision — the canonical precision-weighted averaging formula. The posterior mean is the precision-weighted average of the prior mean and the sample mean. Derivation mirrors Proof 2 but with Gaussian kernels; see Topic 7 §7.7 for the exp-family algebra in the location-family case.

Example 3 Normal-Normal-Inverse-Gamma (σ² unknown)

When σ2\sigma^2 is unknown, place a joint Normal-Inverse-Gamma prior

μσ2N(μ0,σ2/κ0),σ2InvGamma(α0,β0).\mu \mid \sigma^2 \sim \mathcal{N}(\mu_0, \sigma^2/\kappa_0), \qquad \sigma^2 \sim \mathrm{InvGamma}(\alpha_0, \beta_0).

After observing nn iid data with sample mean yˉ\bar y and sample variance s2s^2, the posterior is Normal-Inverse-Gamma with updated hyperparameters (GEL2013 §3.3 Eqs 3.3–3.5):

κn=κ0+n,μn=κ0μ0+nyˉκn,αn=α0+n2,\kappa_n = \kappa_0 + n, \quad \mu_n = \frac{\kappa_0 \mu_0 + n \bar y}{\kappa_n}, \quad \alpha_n = \alpha_0 + \tfrac{n}{2},

βn=β0+ns22+κ0n(yˉμ0)22κn.\beta_n = \beta_0 + \tfrac{n s^2}{2} + \tfrac{\kappa_0 n (\bar y - \mu_0)^2}{2 \kappa_n}.

The marginal posterior on μ\mu is a non-standardized Student-t with 2αn2\alpha_n degrees of freedom, location μn\mu_n, and scale βn/(αnκn)\sqrt{\beta_n / (\alpha_n \kappa_n)} — heavier tails than Normal because σ2\sigma^2‘s uncertainty widens the marginal on μ\mu.

Example 4 Gamma-Poisson

Let YiλiidPoisson(λ)Y_i \mid \lambda \overset{\text{iid}}{\sim} \mathrm{Poisson}(\lambda) and prior λGamma(α0,β0)\lambda \sim \mathrm{Gamma}(\alpha_0, \beta_0) (shape-rate). Observing S=i=1nyiS = \sum_{i=1}^n y_i, the posterior is

λy    Gamma(α0+S,  β0+n),\lambda \mid \mathbf{y} \;\sim\; \mathrm{Gamma}(\alpha_0 + S, \; \beta_0 + n),

with posterior mean (α0+S)/(β0+n)(\alpha_0 + S)/(\beta_0 + n). The algebra reduces to the exp-family update rule of Thm 2 with T(y)=yT(y) = y, A(η)=eηA(\eta) = e^\eta (Poisson canonical form). See Topic 7 §7.7 Ex 3 for the matching derivation.

Example 5 Dirichlet-Multinomial

Let (Y1,,YK)Multinomial(n,p)(Y_1, \ldots, Y_K) \sim \mathrm{Multinomial}(n, \mathbf{p}) with kpk=1\sum_k p_k = 1, and prior pDirichlet(α0)\mathbf{p} \sim \mathrm{Dirichlet}(\boldsymbol\alpha_0) for α0=(α0,1,,α0,K)\boldsymbol\alpha_0 = (\alpha_{0,1}, \ldots, \alpha_{0,K}). Observing counts x=(x1,,xK)\mathbf{x} = (x_1, \ldots, x_K) with kxk=n\sum_k x_k = n, the posterior is

px    Dirichlet(α0+x),\mathbf{p} \mid \mathbf{x} \;\sim\; \mathrm{Dirichlet}(\boldsymbol\alpha_0 + \mathbf{x}),

the natural generalization of Beta-Binomial to K>2K > 2 categories. Topic 8 §8 Thm 6 works this as pure algebra; Topic 25 adds the inferential layer (credible sets on the simplex, marginal posteriors on individual pkp_k).

Five-panel figure showing prior-and-posterior density pairs for each of the five canonical conjugate pairs: Beta-Bernoulli, Normal-Normal known σ², Normal-Normal-Inverse-Gamma joint posterior, Gamma-Poisson, and Dirichlet(1,1,1) → Dirichlet(13,11,9) on the 2-simplex. Each panel is annotated with the prior and posterior hyperparameters.
0.000.501.00priorposterior
Prior π(θ) = Beta(α₀=2.0, β₀=2.0)
Likelihood L(θ) ∝ θ^k (1−θ)^{n−k}, k=10, n=50
Posterior p(θ | y) = Beta(α₀+k = 12.0, β₀+n−k = 42.0)
Posterior mean = (α₀+k)/(α₀+β₀+n) = 0.2222
95% CrI = [0.123, 0.341]
Pseudo-sample-size: prior α₀+β₀ = 4.0, total after data = 54.0
Remark 7 All five pairs as instances of Thm 2

Every pair in §25.5 is a specialization of the exponential-family conjugacy theorem. Beta-Binomial: T(y)=yT(y) = y, A(η)=log(1+eη)A(\eta) = \log(1 + e^\eta) (logit-parameterized Bernoulli). Normal-Normal known σ2\sigma^2: T(y)=yT(y) = y, A(η)=η2/2A(\eta) = \eta^2 / 2 (location-only Gaussian). Gamma-Poisson: T(y)=yT(y) = y, A(η)=eηA(\eta) = e^\eta. Dirichlet-Multinomial: the multi-parameter extension where T(y)=yT(\mathbf{y}) = \mathbf{y} is a count vector. Normal-Normal-IG is the two-parameter case with (μ,σ2)(\mu, \sigma^2) jointly — the algebra is more involved but the conjugacy logic is identical.

Remark 8 When conjugacy breaks

Conjugate priors exist only for exponential-family likelihoods. Non-exponential-family models — mixtures, hierarchical models beyond Topic 28, neural networks — have no closed-form conjugate priors. In these cases, the posterior is known only up to the intractable normalizing constant m(y)m(\mathbf{y}), and one must resort to approximation: MCMC (Topic 26) for exact asymptotic sampling, variational inference (formalml) for fast approximate posteriors, or Laplace approximation (§25.8 Rem 16) for local Gaussian approximations at the MAP.

Preset:
Data
0.00.51.0θpriorlikelihood (scaled)posterior
Prior mean: 0.5000
Posterior mean: 0.3750
Pseudo-sample-size: 4.0 → 24.0
95% equal-tailed CrI: [0.197, 0.573]
95% HPD interval: [0.192, 0.565]
MAP estimate: 0.3636

Posterior Beta(9.0, 15.0). The HPD interval is the shortest 95% credible interval; it coincides with the equal-tailed interval only for symmetric posteriors.

25.6 Credible intervals and posterior predictive

The posterior p(θy)p(\theta \mid \mathbf{y}) is the full Bayesian answer, but practical work demands summaries. Credible intervals are the Bayesian analog of confidence intervals — interval-valued summaries with posterior-probability content. Point estimators (posterior mean, median, MAP) are scalar summaries, each minimizing a different Bayes risk. The posterior predictive closes the loop: given the posterior on θ\theta, what is the distribution of a future y~\tilde y?

Definition 5 Credible interval — equal-tailed and HPD

For a posterior p(θy)p(\theta \mid \mathbf{y}) and level 1α1 - \alpha, a set CΘC \subseteq \Theta is a (1α)(1 - \alpha) credible set if Cp(θy)dθ=1α\int_C p(\theta \mid \mathbf{y}) \, d\theta = 1 - \alpha. Two canonical choices:

  • Equal-tailed credible interval: C=[qα/2,q1α/2]C = [q_{\alpha/2}, q_{1 - \alpha/2}] where qpq_p is the posterior pp-quantile.
  • Highest-posterior-density (HPD) interval: CHPD={θ:p(θy)cα}C^{\text{HPD}} = \{\theta : p(\theta \mid \mathbf{y}) \ge c_\alpha\} for the largest cαc_\alpha such that CHPDC^{\text{HPD}} has posterior mass 1α1 - \alpha. The HPD is the shortest (1α)(1-\alpha) credible set.

For symmetric unimodal posteriors the two coincide. For skewed posteriors the HPD is narrower and shifted toward the mode.

Definition 6 Posterior point estimators

The three standard Bayesian point estimators of θ\theta are:

  • Posterior mean: θ^PM=E[θy]=θp(θy)dθ\hat\theta_{\text{PM}} = \mathbb{E}[\theta \mid \mathbf{y}] = \int \theta \, p(\theta \mid \mathbf{y}) \, d\theta.
  • Posterior median: θ^med\hat\theta_{\text{med}}, the 0.50.5-quantile of p(θy)p(\theta \mid \mathbf{y}).
  • MAP estimate: θ^MAP=argmaxθp(θy)\hat\theta_{\text{MAP}} = \arg\max_\theta p(\theta \mid \mathbf{y}).

Each is Bayes-optimal under a different loss: posterior mean under squared-error loss, posterior median under absolute-error loss, MAP under 0-1 (indicator) loss — as ϵ0\epsilon \to 0 in L(θ^,θ)=1θ^θ>ϵL(\hat\theta, \theta) = \mathbf{1}\\{|\hat\theta - \theta| > \epsilon\\}.

Theorem 4 Beta-Binomial posterior predictive (Thm 3b)

Under the Beta-Binomial setup of Thm 3 with posterior θyBeta(α,β)\theta \mid \mathbf{y} \sim \mathrm{Beta}(\alpha^\star, \beta^\star) where α=α0+k\alpha^\star = \alpha_0 + k, β=β0+nk\beta^\star = \beta_0 + n - k, the posterior-predictive distribution of a future count y~{0,1,,m}\tilde y \in \{0, 1, \ldots, m\} from mm new independent Bernoulli(θ\theta) trials is the Beta-Binomial compound:

p(y~y)  =  (my~)B(y~+α,  my~+β)B(α,β).p(\tilde y \mid \mathbf{y}) \;=\; \binom{m}{\tilde y} \frac{B(\tilde y + \alpha^\star, \; m - \tilde y + \beta^\star)}{B(\alpha^\star, \beta^\star)}.

The variance of y~\tilde y under the posterior predictive strictly exceeds the variance under the plug-in Binomial(m,θ^)(m, \hat\theta) — the Bayesian predictive distribution inflates uncertainty to reflect posterior uncertainty in θ\theta.

Proof 3 Proof of Thm 4 (Beta-Binomial posterior predictive) [show]

Step 1 — setup. Having observed kk successes in nn trials with posterior θyBeta(α0+k,β0+nk)\theta \mid \mathbf{y} \sim \mathrm{Beta}(\alpha_0 + k, \beta_0 + n - k), the posterior predictive is the distribution of a future count y~{0,1,,m}\tilde y \in \{0, 1, \ldots, m\} from mm new independent Bernoulli(θ\theta) trials.

Step 2 — definition. The posterior-predictive PMF is the marginal of y~\tilde y under the joint distribution (θ,y~)p(θy)p(y~θ)(\theta, \tilde y) \sim p(\theta \mid \mathbf{y}) \cdot p(\tilde y \mid \theta), obtained by integrating out θ\theta:

p(y~y)  =  01p(y~θ)p(θy)dθ.p(\tilde y \mid \mathbf{y}) \;=\; \int_0^1 p(\tilde y \mid \theta) \, p(\theta \mid \mathbf{y}) \, d\theta.

Step 3 — substitute the likelihood and posterior. With p(y~θ)=(my~)θy~(1θ)my~p(\tilde y \mid \theta) = \binom{m}{\tilde y} \theta^{\tilde y} (1 - \theta)^{m - \tilde y} and p(θy)=θα1(1θ)β1/B(α,β)p(\theta \mid \mathbf{y}) = \theta^{\alpha^\star - 1} (1 - \theta)^{\beta^\star - 1} / B(\alpha^\star, \beta^\star) where α=α0+k\alpha^\star = \alpha_0 + k, β=β0+nk\beta^\star = \beta_0 + n - k,

p(y~y)  =  (my~)1B(α,β)01θy~+α1(1θ)my~+β1dθ.p(\tilde y \mid \mathbf{y}) \;=\; \binom{m}{\tilde y} \frac{1}{B(\alpha^\star, \beta^\star)} \int_0^1 \theta^{\tilde y + \alpha^\star - 1} (1 - \theta)^{m - \tilde y + \beta^\star - 1} \, d\theta.

Step 4 — identify the integral as a Beta function. The integrand is the kernel of Beta(y~+α,my~+β)\mathrm{Beta}(\tilde y + \alpha^\star, m - \tilde y + \beta^\star). Its integral over (0,1)(0, 1) is B(y~+α,my~+β)B(\tilde y + \alpha^\star, m - \tilde y + \beta^\star), yielding

p(y~y)  =  (my~)B(y~+α,  my~+β)B(α,β).p(\tilde y \mid \mathbf{y}) \;=\; \binom{m}{\tilde y} \frac{B(\tilde y + \alpha^\star, \; m - \tilde y + \beta^\star)}{B(\alpha^\star, \beta^\star)}.

Step 5 — wider than plug-in. This is the Beta-Binomial compound with parameters (m,α,β)(m, \alpha^\star, \beta^\star). The posterior predictive is wider than the Binomial(m,θ^)(m, \hat\theta) plug-in approximation, reflecting the posterior uncertainty in θ\theta — a structural feature of Bayesian prediction that plug-in point-estimate predictions miss. Specifically, by the law of total variance,

Var[y~y]  =  E[Var(y~θ)y]+Var[E(y~θ)y],\mathrm{Var}[\tilde y \mid \mathbf{y}] \;=\; \mathbb{E}[\mathrm{Var}(\tilde y \mid \theta) \mid \mathbf{y}] + \mathrm{Var}[\mathbb{E}(\tilde y \mid \theta) \mid \mathbf{y}],

where the first term is the average plug-in variance and the second is strictly positive whenever the posterior on θ\theta has nonzero spread.

∎ — using Topic 4 §4.3 law of total probability and Topic 6 §6.6 Def 3 (Beta function).

Example 6 95% credible intervals for all five conjugate pairs

Under the standard data scenario used in §25.5 (the posterior hyperparameters from each conjugate-pair example), equal-tailed 95% credible intervals are:

  • Beta-Binomial with α=12,β=42\alpha^\star = 12, \beta^\star = 42: [0.123,0.341][0.123, 0.341] (via the Beta-quantile function).
  • Normal-Normal with μn=0.988,σn2=0.049\mu_n = 0.988, \sigma_n^2 = 0.049: μn±1.96σn=[0.554,1.422]\mu_n \pm 1.96\sigma_n = [0.554, 1.422].
  • Gamma-Poisson with α=17,β=6\alpha^\star = 17, \beta^\star = 6: approximately [1.65,4.47][1.65, 4.47].
  • Normal-Normal-IG with 2αn2\alpha_n df Student-t marginal on μ\mu: non-symmetric if the df is small.
  • Dirichlet-Multinomial on individual pkp_k: marginal is Beta with hyperparameters (αk,jkαj)(\alpha_k, \sum_{j \ne k} \alpha_j).

The HPD interval is slightly narrower and shifted toward the posterior mode in every skewed case (Beta-Binomial for αβ\alpha^\star \ne \beta^\star, Gamma-Poisson, asymmetric Dirichlet marginals).

Two-panel figure. Left: Beta(5, 20) posterior density with a shaded 95% equal-tailed credible interval and a hatched HPD overlay — the HPD is narrower and shifted toward the mode. Right: flat-prior Normal-Normal posterior on μ overlaid with a z-confidence interval; the two 95% shaded regions coincide exactly, discharging Topic 19 §19.1 Rem 3. Two-panel figure. Left: Beta-Binomial posterior-predictive PMF for m=20 new trials with α*=5, β*=20 shown as purple bars; Binomial(m=20, p̂=5/25) plug-in PMF shown as grey bars. The Beta-Binomial bars have visibly heavier tails. Right: variance comparison bar chart showing Beta-Binomial variance > Binomial plug-in variance.
Remark 9 Posterior mean as Bayes estimator

Under squared-error loss L(θ^,θ)=(θ^θ)2L(\hat\theta, \theta) = (\hat\theta - \theta)^2, the Bayes estimator — the one minimizing posterior expected loss E[Ly]\mathbb{E}[L \mid \mathbf{y}] — is exactly the posterior mean. This is the Bayesian reading of the bias-variance decomposition: the posterior-mean point estimator is optimal in the L2L^2 sense against the posterior, the same way the sample mean is L2L^2-optimal against the empirical distribution. See ROB2007 §2 for the decision-theoretic development.

Remark 10 HPD vs equal-tailed for skewed posteriors

For a symmetric unimodal posterior the HPD and equal-tailed intervals coincide. For skewed posteriors (e.g., Beta(2, 20) with pronounced right skew), the equal-tailed interval allocates equal tail mass α/2\alpha/2 to each side, including a long right tail of low-density values; the HPD instead follows the density and is strictly narrower. The tradeoff: equal-tailed is invariant under monotone reparameterization; HPD is not.

Remark 11 Posterior predictive for Normal-Normal is Student-t

For Normal-Normal known σ2\sigma^2 the posterior predictive for a future y~\tilde y is N(μn,σn2+σ2)\mathcal{N}(\mu_n, \sigma_n^2 + \sigma^2) — the posterior variance plus the likelihood variance, again wider than the plug-in N(μ^,σ2)\mathcal{N}(\hat\mu, \sigma^2). For Normal-Normal-Inverse-Gamma (unknown σ2\sigma^2), the predictive is a non-standardized Student-t with 2αn2\alpha_n df — heavier tails still, because σ2\sigma^2‘s uncertainty further inflates predictive variance. See Topic 6 §6.7 for the Student-t distribution.

25.7 Prior selection

The prior is a modeling choice, not a fact. Every Bayesian analysis commits to some π(θ)\pi(\theta), and different priors yield different posteriors. This section treats prior selection systematically: three classes (informative, weakly informative, non-informative), Jeffreys priors as a reparameterization-invariant reference construction, improper priors and the integrable-posterior condition, and prior sensitivity as an empirical diagnostic.

Remark 12 Three classes of priors

Priors divide roughly into three informational tiers:

  • Informative: encode substantive prior knowledge. Example: a pharmacologist’s prior on a drug’s dose-response parameter based on ten years of related trials. Hyperparameters reflect that knowledge (small posterior variance when knowledge is strong).
  • Weakly informative: encode loose, regularization-style constraints. The canonical example is Cauchy(0,2.5)\mathrm{Cauchy}(0, 2.5) on logistic-regression coefficients (Gelman et al. 2008) — excludes implausible values like β>100|\beta| > 100 without committing to any specific scale.
  • Non-informative: minimize prior influence. Jeffreys (§25.7 Thm 5), reference priors (Bernardo 1979 — deferred), flat priors (often improper).

The pragmatic recommendation (GEL2013 Ch. 2): use weakly informative priors by default; use informative priors when genuine prior knowledge exists; use non-informative priors for benchmarking.

Theorem 5 Jeffreys prior and reparameterization invariance

For a one-parameter regular family with Fisher information I(θ)\mathcal{I}(\theta), the Jeffreys prior is

πJ(θ)    I(θ).\pi_J(\theta) \;\propto\; \sqrt{\mathcal{I}(\theta)}.

The Jeffreys prior is reparameterization-invariant: under a smooth monotone reparameterization ϕ=g(θ)\phi = g(\theta) with Jacobian dθ/dϕ|d\theta/d\phi|, πJ(ϕ)=πJ(θ)dθ/dϕ\pi_J(\phi) = \pi_J(\theta) |d\theta/d\phi| — the transformed Jeffreys prior equals the Jeffreys prior computed directly in the ϕ\phi-parameterization. This invariance is the principled argument for the Jeffreys construction as a reference prior: any other “non-informative” choice (e.g., π(θ)=\pi(\theta) = const) depends on the parameterization.

Derivation (Bernoulli). For YBernoulli(θ)Y \sim \mathrm{Bernoulli}(\theta), I(θ)=1/(θ(1θ))\mathcal{I}(\theta) = 1/(\theta(1-\theta)), so πJ(θ)θ1/2(1θ)1/2\pi_J(\theta) \propto \theta^{-1/2} (1-\theta)^{-1/2}, which is the Beta(1/2,1/2)\mathrm{Beta}(1/2, 1/2) density up to normalization.

Derivation (Normal-scale). For YN(μ0,σ2)Y \sim \mathcal{N}(\mu_0, \sigma^2) with μ0\mu_0 known and σ>0\sigma > 0 unknown, I(σ)=2/σ2\mathcal{I}(\sigma) = 2/\sigma^2, so πJ(σ)1/σ\pi_J(\sigma) \propto 1/\sigma — the classical log-flat scale prior. (Improper; see Def 7.)

Invariance property proof uses the Jacobian identity for Fisher information: I(ϕ)=I(θ)(dθ/dϕ)2\mathcal{I}(\phi) = \mathcal{I}(\theta) (d\theta/d\phi)^2, so I(ϕ)=I(θ)dθ/dϕ\sqrt{\mathcal{I}(\phi)} = \sqrt{\mathcal{I}(\theta)} |d\theta/d\phi|. See JEF1961 Ch. III for the full invariance argument.

Two-panel figure. Left: Beta(½, ½) density on (0, 1), the Bernoulli Jeffreys prior π(θ) ∝ 1/√(θ(1−θ)) — U-shaped with integrable singularities at 0 and 1. Right: reparameterization check in logit coordinates φ = logit(θ), showing that π_J(φ) computed directly equals π_J(θ)|dθ/dφ| computed from the original Jeffreys prior — the two curves coincide exactly.
Example 7 Prior sensitivity: three priors, one dataset

Observe 10 successes in 50 Bernoulli trials. Compare three posteriors:

  • Informative Beta(2,8)\mathrm{Beta}(2, 8) peaked at 0.20.2: posterior Beta(12,48)\mathrm{Beta}(12, 48), mean 0.20.2.
  • Weakly informative Beta(2,2)\mathrm{Beta}(2, 2) centered at 0.50.5: posterior Beta(12,42)\mathrm{Beta}(12, 42), mean 0.222\approx 0.222.
  • Non-informative Jeffreys Beta(1/2,1/2)\mathrm{Beta}(1/2, 1/2): posterior Beta(10.5,40.5)\mathrm{Beta}(10.5, 40.5), mean 0.206\approx 0.206.

At n=50n = 50 all three posterior means lie in [0.20,0.23][0.20, 0.23] — the data dominates the informative prior’s pull toward 0.20.2. At n=10n = 10, the three posterior means are far apart (0.21\sim 0.21, 0.23\sim 0.23, 0.25\sim 0.25); at n=1000n = 1000 they are numerically identical. The interpolation is exactly what Bernstein–von Mises (§25.8) predicts.

Three-panel figure showing three posterior densities for the same data (10 heads in 50 tosses): informative Beta(2, 8) prior yields Beta(12, 48) posterior (amber), weakly informative Beta(2, 2) prior yields Beta(12, 42) posterior (violet), Jeffreys Beta(½, ½) prior yields Beta(10.5, 40.5) posterior (blue). All three posteriors are concentrated in roughly the same region around 0.2 with small pairwise differences; a vertical reference line marks the MLE at θ̂ = 0.2.
Preset:
Data
Prior 1 (informative)
Prior 2 (weak)
Prior 3: Jeffreys Beta(½, ½) — fixed reference.
0.000.250.500.751.00θInformative (mean 0.2)Weakly informativeJeffreys (non-informative)
Informative (mean 0.2)
mean = 0.2000
CrI = [0.110, 0.309]
Weakly informative
mean = 0.2222
CrI = [0.123, 0.341]
Jeffreys (non-informative)
mean = 0.2059
CrI = [0.108, 0.326]

Sensitivity magnitude: maxθ |pᵢ(θ | y) − pⱼ(θ | y)| across the three posterior pairs = 2.1054. Slide n to 1000 to watch this decay toward zero — the posteriors align as Bernstein–von Mises kicks in and the prior's contribution becomes negligible.

Definition 7 Improper prior and integrable-posterior condition

A prior π(θ)\pi(\theta) is improper if Θπ(θ)dθ=\int_\Theta \pi(\theta) \, d\theta = \infty — i.e., it is not a valid probability density. Examples: π(μ)=const\pi(\mu) = \text{const} on R\mathbb{R}; π(σ)1/σ\pi(\sigma) \propto 1/\sigma on (0,)(0, \infty) (the Normal-scale Jeffreys prior).

Bayesian inference with an improper prior is valid if and only if the posterior is proper:

ΘL(θ)π(θ)dθ  <  .\int_\Theta L(\theta) \pi(\theta) \, d\theta \;<\; \infty.

When this integrable-posterior condition holds, the posterior p(θy)L(θ)π(θ)p(\theta \mid \mathbf{y}) \propto L(\theta) \pi(\theta) is a well-defined probability density and all downstream quantities (credible intervals, posterior moments, posterior predictive) are well-defined. When it fails, no Bayesian inference is possible.

Remark 13 Stone–Dawid–Zidek paradox

Improper priors can produce paradoxes when naively combined. Stone (1970) and later Dawid, Stone, and Zidek (1973) showed that two proper-posterior improper priors on nested parameter spaces can yield conditionally inconsistent posteriors — the marginal posterior derived one way disagrees with the marginal posterior derived another way on the same parameter. The resolution requires careful measure-theoretic handling of the improper prior limit. Topic 25 names the paradox once; full treatment deferred to Topic 27 or formalml.

Remark 14 Reference priors (Bernardo)

Bernardo (1979) proposed reference priors as a generalization of Jeffreys that maximizes asymptotic expected Kullback–Leibler divergence between prior and posterior — i.e., the prior is chosen to be as “non-informative” as possible in an information-theoretic sense. For one-parameter regular families, reference priors coincide with Jeffreys priors; for multi-parameter problems they differ. Full development is deferred to Topic 27 or formalml.

Remark 15 Prior elicitation as substantive modeling

In applied work, priors are rarely chosen from theoretical principles alone. Prior elicitation — the systematic conversion of expert knowledge into prior hyperparameters — is a substantive modeling activity (ROB2007 §3; GEL2013 §2.9). Practical approaches: ask experts for quantile estimates and fit a prior that matches; use historical data from related problems; elicit pseudo-sample-size directly. The sensitivity analysis of Ex 7 is the diagnostic: if the inference is sensitive to the prior, the prior deserves more attention; if not, the data has spoken.

25.8 Bernstein–von Mises: the bridge to frequentism

This is the featured section. Under regularity conditions identical to those required for MLE asymptotic normality (Topic 14 Thm 4), the posterior itself is asymptotically Normal — centered at the MLE, with covariance equal to the inverse Fisher information scaled by 1/n1/n. Concretely: Bayesian and frequentist inference converge. A 95% credible interval and a 95% Wald confidence interval become numerically identical in the large-sample limit. The prior’s role diminishes at rate 1/n1/\sqrt{n}; priors stop mattering once there is enough data.

Theorem 6 Bernstein–von Mises

Let Y1,,Yniidf(;θ0)Y_1, \ldots, Y_n \overset{\text{iid}}{\sim} f(\cdot; \theta_0) with θ0\theta_0 in the interior of ΘR\Theta \subseteq \mathbb{R}, a regular parametric family with Fisher information I(θ0)(0,)\mathcal{I}(\theta_0) \in (0, \infty). Let θ^n\hat\theta_n be the MLE. For any proper prior π\pi continuous and positive at θ0\theta_0,

supBB(R)Prθp(y) ⁣(n(θθ^n)B)PrZN(0,I(θ0)1)(ZB)  P  0.\sup_{B \in \mathcal{B}(\mathbb{R})} \left| \Pr_{\theta \sim p(\cdot \mid \mathbf{y})}\!\left(\sqrt{n}(\theta - \hat\theta_n) \in B\right) - \Pr_{Z \sim \mathcal{N}(0, \mathcal{I}(\theta_0)^{-1})}(Z \in B) \right| \;\xrightarrow{P}\; 0.

That is, the rescaled posterior converges in total variation to N(0,I(θ0)1)\mathcal{N}(0, \mathcal{I}(\theta_0)^{-1}) in probability under the true distribution Prθ0\Pr_{\theta_0}.

Four-panel figure for n = 5, 25, 100, 500. Beta(2, 2) prior, true θ₀ = 0.3. Each panel shows the posterior density (purple) overlaid with the Normal-at-MLE approximation 𝒩(θ̂_MLE, 𝓘⁻¹/n) (black dashed), with vertical reference lines at θ₀ (green dotted) and θ̂_MLE (orange solid). At n=5 the two curves disagree noticeably (posterior narrower near boundaries). By n=500 they are visually indistinguishable — BvM's total-variation convergence.
Proof 4 Proof of Thm 6 (Bernstein–von Mises, sketch) [show]

Step 1 — statement. As above. We show pointwise convergence of densities then upgrade to total variation via VDV1998 §10.2 Thm 10.1.

Step 2 — Taylor-expand the log-posterior at θ^n\hat\theta_n. Write n(θ)=ilogf(Yi;θ)\ell_n(\theta) = \sum_i \log f(Y_i; \theta). The log-posterior is

logp(θy)  =  n(θ)+logπ(θ)+const.\log p(\theta \mid \mathbf{y}) \;=\; \ell_n(\theta) + \log\pi(\theta) + \text{const}.

By Taylor’s theorem, for θ\theta near θ^n\hat\theta_n,

n(θ)  =  n(θ^n)+n(θ^n)(θθ^n)+12n(θ~)(θθ^n)2\ell_n(\theta) \;=\; \ell_n(\hat\theta_n) + \ell_n'(\hat\theta_n)(\theta - \hat\theta_n) + \tfrac{1}{2}\ell_n''(\tilde\theta)(\theta - \hat\theta_n)^2

for some intermediate θ~\tilde\theta. The first-derivative term vanishes because θ^n\hat\theta_n is the MLE (n(θ^n)=0\ell_n'(\hat\theta_n) = 0). The second derivative satisfies n(θ~)/nI(θ0)-\ell_n''(\tilde\theta)/n \to \mathcal{I}(\theta_0) in probability by Topic 14 Thm 4’s argument.

Step 3 — reparameterize. Let u=n(θθ^n)u = \sqrt{n}(\theta - \hat\theta_n). Then θ=θ^n+u/n\theta = \hat\theta_n + u/\sqrt{n}, and

n(θ)n(θ^n)  =  12I(θ0)u2+oP(1)\ell_n(\theta) - \ell_n(\hat\theta_n) \;=\; -\tfrac{1}{2}\mathcal{I}(\theta_0) u^2 + o_P(1)

uniformly on compact uu-sets. This is the Local Asymptotic Normality (LAN) condition; full proof is VDV1998 §7.2 Lemma 7.6.

Step 4 — prior smoothness. Since π\pi is continuous and positive at θ0\theta_0, logπ(θ)\log\pi(\theta) is bounded on a neighborhood of θ0\theta_0, and

logπ(θ^n+u/n)  =  logπ(θ0)+oP(1)\log\pi(\hat\theta_n + u/\sqrt n) \;=\; \log\pi(\theta_0) + o_P(1)

uniformly on compact uu-sets (by consistency of θ^n\hat\theta_n, which holds by Topic 14 Thm 3).

Step 5 — assemble the posterior density. In the uu-parameterization,

p(uy)    exp ⁣(n(θ^n+u/n)n(θ^n))π(θ^n+u/n)p(u \mid \mathbf{y}) \;\propto\; \exp\!\left(\ell_n(\hat\theta_n + u/\sqrt n) - \ell_n(\hat\theta_n)\right) \pi(\hat\theta_n + u/\sqrt n)

    exp ⁣(12I(θ0)u2)(1+oP(1)).\;\propto\; \exp\!\left(-\tfrac{1}{2}\mathcal{I}(\theta_0) u^2\right) (1 + o_P(1)).

This is the kernel of N(0,I(θ0)1)\mathcal{N}(0, \mathcal{I}(\theta_0)^{-1}).

Step 6 — extend to total-variation convergence. Step 5 establishes pointwise (at each uu) convergence of densities up to normalization. The total-variation upgrade requires showing (a) the posterior puts asymptotically negligible mass outside a shrinking neighborhood of θ^n\hat\theta_n (posterior concentration), and (b) uniform integrability of the posterior density. Both follow from the prior’s positivity at θ0\theta_0 plus the regular-family tail bounds. Full argument: VDV1998 §10.2 Thm 10.1.

Step 7 — interpretation. The posterior forgets the prior at rate 1/n1/\sqrt{n}: the prior’s contribution to the posterior density becomes vanishingly small relative to the likelihood’s contribution. Bayesian and frequentist inference converge — credible intervals \to Wald intervals in TV distance.

∎ (sketch) — using Topic 14 §14.5 Thm 4 (MLE asymptotic normality), Topic 11 §11.3 Thm 1 (CLT), and van der Vaart 1998 §10.2 Thm 10.1 (full TV upgrade).

At n=500 the posterior is visually indistinguishable from the Normal-at-MLE.

0.000.501.00posteriorN(θ̂, Î⁻¹/n)θ₀θ̂
n / sufficient stat: n=25, k=9
θ̂_MLE: 0.3600
Observed I(θ̂): 4.3403
Posterior mean: 0.3654
Posterior 95% CrI: [0.195, 0.555]
Posterior SD: 0.0927
Normal-at-MLE mean: 0.3600
Wald 95% CI: [0.172, 0.548]
TV distance: 0.0230

BvM prediction: as n → ∞, the posterior converges in total variation to N(θ̂_MLE, I(θ₀)⁻¹/n). Watch the purple posterior collapse onto the black dashed Normal curve, and the TV distance drop below 0.01 once n is large enough.

Example 8 MAP = penalized MLE — discharging Topic 14 Rem 9 and Topic 23 Thm 5

The MAP estimate maximizes the log-posterior:

θ^MAP  =  argmaxθ[(θ)+logπ(θ)].\hat\theta_{\text{MAP}} \;=\; \arg\max_\theta \big[\ell(\theta) + \log\pi(\theta)\big].

When π(θ)=N(0,τ2)\pi(\theta) = \mathcal{N}(0, \tau^2) is Gaussian, logπ(θ)=θ2/(2τ2)+const\log\pi(\theta) = -\theta^2/(2\tau^2) + \text{const}, so θ^MAP\hat\theta_{\text{MAP}} maximizes (θ)θ2/(2τ2)\ell(\theta) - \theta^2/(2\tau^2) — exactly ridge regression with λ=σ2/τ2\lambda = \sigma^2/\tau^2. When π(θ)=Laplace(0,b)\pi(\theta) = \mathrm{Laplace}(0, b), logπ(θ)=θ/b+const\log\pi(\theta) = -|\theta|/b + \text{const}, so θ^MAP\hat\theta_{\text{MAP}} maximizes (θ)θ/b\ell(\theta) - |\theta|/b — exactly lasso with λ=σ2/b\lambda = \sigma^2/b. This discharges Topic 14 §14.11 Rem 9 (the MAP/regularized-MLE correspondence preview) and Topic 23 §23.7 Thm 5 (the MAP-penalization formal correspondence). The arc closes here.

Under BvM, θ^MAP\hat\theta_{\text{MAP}} and θ^MLE\hat\theta_{\text{MLE}} converge at rate 1/n1/\sqrt n in TV — the regularization vanishes asymptotically, confirming that ridge and lasso are prior-informed procedures whose regularization is a commitment that data eventually overrides.

Two-panel figure. Left: Gaussian prior 𝒩(0, 0.25) (thin blue) + Normal likelihood centered at x̄=1 (amber) for n=4, σ²=1 → posterior (purple) with MAP vertical line coinciding exactly with the ridge-estimate vertical line. Right: Laplace(0, 0.5) prior + same likelihood → MAP vertical line coincides with lasso estimate (at a sparser solution).
Remark 16 Laplace approximation to the posterior and to m(y)

The same machinery that produces BvM yields a Gaussian approximation to the posterior at any finite nn — the Laplace approximation. Taylor-expanding logp(θ,y)\log p(\theta, \mathbf{y}) to second order at θ^MAP\hat\theta_{\text{MAP}} and exponentiating gives p(θy)N(θ^MAP,H1)p(\theta \mid \mathbf{y}) \approx \mathcal{N}(\hat\theta_{\text{MAP}}, H^{-1}) where H=2logp(θ,y)θ=θ^MAPH = -\nabla^2 \log p(\theta, \mathbf{y})|_{\theta = \hat\theta_{\text{MAP}}}. The same expansion approximates the marginal likelihood:

logm(y)    logp(yθ^MAP)+logπ(θ^MAP)+k2log(2π)12logH.\log m(\mathbf{y}) \;\approx\; \log p(\mathbf{y} \mid \hat\theta_{\text{MAP}}) + \log\pi(\hat\theta_{\text{MAP}}) + \tfrac{k}{2}\log(2\pi) - \tfrac{1}{2}\log|H|.

This is the generalization of Topic 24 §24.4 Proof 2: BIC drops the prior term and approximates logHklogn\log |H| \approx k \log n, yielding BIC = 2logp(yθ^MAP)+klogn+O(1)-2\log p(\mathbf{y} \mid \hat\theta_{\text{MAP}}) + k \log n + O(1). The Laplace approximation is the full version — Topic 27 returns to it for Bayes-factor computation.

Remark 17 Flat-prior coincidence — discharging Topic 19 §19.1 Rem 3

For the Normal-mean problem with known σ2\sigma^2 and improper flat prior π(μ)1\pi(\mu) \propto 1, the posterior is exactly N(yˉ,σ2/n)\mathcal{N}(\bar y, \sigma^2/n) — and the 95% credible interval yˉ±1.96σ/n\bar y \pm 1.96 \sigma/\sqrt n is identical at every finite nn to the frequentist z-CI. This is not a BvM asymptotic result but a finite-sample coincidence: the flat prior makes the posterior coincide with the sampling distribution of the MLE. This discharges Topic 19 §19.1 Rem 3 and §19.10 Rem 21: for Normal mean under flat prior, Bayesian and frequentist intervals coincide, but the interpretations differ — the Bayesian reads it as “posterior probability 0.950.95 that μ\mu lies in the interval” while the frequentist reads it as “95% of such constructed intervals would cover the true μ\mu.”

Remark 18 When BvM fails

BvM’s regularity conditions are those of MLE asymptotic normality plus positivity and continuity of the prior at θ0\theta_0. When these fail, BvM can fail:

  • Heavy-tailed priors (e.g., Cauchy) violate the n\sqrt n-concentration rate near the prior’s tails.
  • Non-identifiable models (mixture label-switching, over-parameterized networks) have multimodal posteriors that never collapse to a single Gaussian.
  • High-dimensional regimes dnd \sim n break the fixed-dimension Taylor expansion of Step 2.
  • Nonparametric models (function-valued θ\theta) require infinite-dimensional generalizations that sometimes hold (Castillo–Rousseau 2015) and sometimes don’t (DIA1986).

See VDV1998 §10.3 for failure modes in regular models and DIA1986 for the nonparametric counterexamples.

25.9 The multivariate case: Normal-Inverse-Wishart

Topic 25’s scalar framework extends to multivariate parameters. The most-cited multivariate conjugate pair is Normal-Inverse-Wishart — the generalization of Normal-Inverse-Gamma (Ex 3) to joint inference on a mean vector μRd\boldsymbol\mu \in \mathbb{R}^d and covariance matrix ΣRd×d\boldsymbol\Sigma \in \mathbb{R}^{d \times d}.

Remark 19 Joint posterior on (μ, Σ)

Place a prior (μ,Σ)NIW(μ0,κ0,Λ0,ν0)(\boldsymbol\mu, \boldsymbol\Sigma) \sim \mathrm{NIW}(\boldsymbol\mu_0, \kappa_0, \boldsymbol\Lambda_0, \nu_0) — a Normal-Inverse-Wishart distribution. After observing nn iid Gaussian samples with sample mean yˉ\bar{\mathbf{y}} and sample scatter matrix S\mathbf{S}, the posterior is NIW with updated hyperparameters. Detailed parameter-update rule: GEL2013 §3.6 Eqs 3.18–3.21. The scalar mechanics of Topic 25 §25.5 Ex 3 (Normal-Normal-IG) are the one-dimensional case.

Remark 20 Regression extension — scalar case covered, regression deferred

The NIW framework extends to Bayesian linear regression: place a conjugate Normal-Inverse-Wishart prior on (β,σ2)(\boldsymbol\beta, \sigma^2) in y=Xβ+ϵ\mathbf{y} = \mathbf{X}\boldsymbol\beta + \boldsymbol\epsilon, and the posterior is NIW with updated parameters that include a ridge-regression posterior mean. Topic 25 §25.5 Ex 3 covered the scalar analog; the full regression generalization, plus non-conjugate priors (horseshoe, spike-and-slab) handled via MCMC, is Topic 27 or Topic 28 territory. This note discharges Topic 21 §21.10 Rem 25.

25.10 Forward map

Topic 25 is the Track 7 opener. The remainder of Track 7 — plus one cross-track pointer to formalml — develops the machinery Topic 25 names but defers.

Remark 21 Topic 26 — Bayesian Computation and MCMC

When the conjugate-prior framework fails — non-exponential-family likelihoods, hierarchical models beyond Topic 28, models with many parameters — the posterior is known only up to the intractable normalizing constant m(y)m(\mathbf{y}). Markov chain Monte Carlo (MCMC) provides exact asymptotic sampling: Metropolis–Hastings (1970), Gibbs sampling (Geman–Geman 1984), Hamiltonian Monte Carlo (Duane et al. 1987; Neal 2011), and the No-U-Turn Sampler (Hoffman–Gelman 2014) are the four principal algorithms. Gibbs samplers use Topic 25 conjugate full-conditionals as building blocks — the connection is direct. Topic 26 develops all four algorithms end-to-end, with the Markov-chain ergodic theorem (§26.7) as the MCMC analog of Topic 10’s SLLN (Thm 10.5).

Remark 22 Topic 27 — Bayesian Model Comparison and BMA

Bayes factors BF10=m(yH1)/m(yH0)\mathrm{BF}_{10} = m(\mathbf{y} \mid H_1) / m(\mathbf{y} \mid H_0) compare two models via their marginal likelihoods. Bayesian model averaging (BMA) combines posterior predictions across candidate models, weighted by posterior model probabilities. Efficient marginal-likelihood computation uses nested sampling (Skilling 2006), bridge sampling (Meng–Wong 1996), thermodynamic integration, DIC / WAIC / PSIS-LOO (Watanabe 2010; Vehtari–Gelman–Gabry 2017). Topic 25 §25.8 Rem 16 introduced m(y)m(\mathbf{y}) via the Laplace approximation; the full computational machinery is Topic 27. This also houses local-FDR and Bayesian multiplicity (Efron 2010), discharging Topic 20 §20.10.

Remark 23 Topic 28 — Hierarchical Bayes and Partial Pooling

Topic 25 treats the prior π(θ)\pi(\theta) as fixed. Hierarchical priors θiμπ(μ),μπ(μ)\theta_i \mid \mu \sim \pi(\cdot \mid \mu), \mu \sim \pi(\mu) treat the prior itself as having hyperparameters with their own prior. This is the Bayesian reading of partial pooling (Lindley–Smith 1972): individual-level estimates borrow strength from group-level structure. Empirical Bayes uses the data to estimate μ\mu directly, collapsing the hierarchy. Continuous shrinkage priors (Carvalho–Polson–Scott 2010 horseshoe) provide adaptive sparsity — the “lasso of priors.” Topic 28 develops.

Remark 24 Variational inference (formalml)

When conjugate priors break down and MCMC is too slow (hierarchical models with millions of parameters, deep neural networks), variational inference approximates the posterior with a tractable family q(θ)q(\theta) minimizing KL(qp(y))\mathrm{KL}(q \,\|\, p(\cdot \mid \mathbf{y})). The ELBO (evidence lower bound) machinery, mean-field VI, normalizing flows, and stochastic VI are developed at formalML: Variational Inference .

Remark 25 Bayesian decision theory

Every Bayesian point estimator is a Bayes-optimal decision under some loss (§25.6 Rem 9). Bayesian decision theory formalizes this: given a loss L(θ^,θ)L(\hat\theta, \theta), the Bayes decision minimizes posterior expected loss E[Ly]\mathbb{E}[L \mid \mathbf{y}]. ROB2007 Ch. 2 is the standard reference; full treatment deferred to a later philosophical essay or formalml.

Remark 26 Reference priors and Berger–Pericchi machinery

Bernardo’s 1979 reference-priors framework extends Jeffreys to multi-parameter problems by maximizing asymptotic expected KL divergence between prior and posterior. Intrinsic priors (Berger–Pericchi 1996) generalize improper Jeffreys priors for Bayes-factor computation. These are specialized tools not covered in Topic 25’s introductory opener.

Remark 27 Predictive checks and model criticism

Posterior predictive checks (PPCs) simulate replicated datasets from the posterior predictive p(yrepy)p(\mathbf{y}^{\text{rep}} \mid \mathbf{y}) and compare them to observed data — a direct Bayesian alternative to frequentist goodness-of-fit tests. GEL2013 Ch. 6 is the standard reference; full development Topic 27.

Remark 28 Bayesian nonparametrics (formalml)

When the parameter space is infinite-dimensional — distributions over distributions, functions over functions — Bayesian nonparametrics provides the toolkit. Dirichlet processes (Ferguson 1973), Pólya trees, and Gaussian processes as priors are the three principal constructions. GPs were already flagged in Topic 8; full treatment at formalML: Gaussian Processes .

Remark 29 Closing: Track 7 as a parallel formalism

Track 7 is a parallel formalism to Tracks 4–6, not a replacement. Every frequentist method of Topics 13–24 has a Bayesian counterpart: MLE ↔ posterior mean or MAP; Wald CI ↔ credible interval (coincident asymptotically by BvM); LRT ↔ Bayes factor; AIC/BIC ↔ marginal likelihood m(y)m(\mathbf{y}); regularization ↔ informative prior. The practical choice between frameworks depends on whether you can defensibly specify priors and whether you want answers that are probability statements about parameters (Bayesian) or long-run frequency guarantees about procedures (frequentist). In many modern applications (hierarchical models, small-sample inference, integration of prior information) the Bayesian framework is better suited; in others (large-n prediction, procedural guarantees with minimal assumptions) the frequentist framework is cleaner. Track 7’s remaining three topics — MCMC, BMA and Bayes factors, hierarchical and empirical Bayes — develop the computational and structural machinery needed to use the Bayesian framework at modern applied scale.

Forward-map diagram for Topic 25. Central purple hub 'Bayesian Foundations (Topic 25)' with arrows outward to Topic 26 (MCMC: Metropolis–Hastings, Gibbs, HMC, NUTS), Topic 27 (BMA, Bayes factors, DIC/WAIC/PSIS-LOO, marginal likelihood), Topic 28 (hierarchical, partial pooling, empirical Bayes, horseshoe), and formalml (VI, BNN, GP, PP, Bayesian nonparametrics). Back-arrows to Topic 4 (Bayes thm), Topic 7 (conjugacy), Topic 14 (MLE/BvM), Topic 23 (MAP=penalty), Topic 24 (BIC=Laplace).

References

  1. Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari & Donald B. Rubin. (2013). Bayesian Data Analysis (3rd ed.). CRC Press.
  2. José M. Bernardo & Adrian F. M. Smith. (1994). Bayesian Theory. Wiley.
  3. Christian P. Robert. (2007). The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation (2nd ed.). Springer.
  4. Harold Jeffreys. (1961). Theory of Probability (3rd ed.). Oxford University Press.
  5. Dennis V. Lindley. (2014). Understanding Uncertainty (Rev. ed.). Wiley.
  6. A. W. van der Vaart. (1998). Asymptotic Statistics. Cambridge University Press.
  7. Persi Diaconis & David Freedman. (1986). On the consistency of Bayes estimates. The Annals of Statistics, 14(1), 1–26.
  8. George Casella & Roger L. Berger. (2002). Statistical Inference (2nd ed.). Duxbury.
  9. E. L. Lehmann & George Casella. (1998). Theory of Point Estimation (2nd ed.). Springer.
  10. Trevor Hastie, Robert Tibshirani & Jerome Friedman. (2009). The Elements of Statistical Learning (2nd ed.). Springer.