intermediate 40 min read · April 12, 2026

Exponential Families

Nine distributions, one canonical form — the unifying framework that makes sufficient statistics, MLE, conjugate priors, and GLMs tractable.

7.1 The Exponential Family (Why Unify?)

Topic 5 — Discrete Distributions cataloged seven discrete PMFs and flagged five as exponential family members. Topic 6 — Continuous Distributions cataloged eight continuous PDFs and flagged four more. In both cases, the exponential family membership was noted in passing — a remark here, a flag there — without explaining what that membership means or why it matters.

This topic answers that question. Nine of the sixteen distributions from Topics 5-6 — Bernoulli, Binomial, Geometric, Negative Binomial, Poisson, Normal, Exponential, Gamma, and Beta — share a single algebraic form:

f(xθ)=h(x)exp ⁣(η(θ)T(x)A(θ))f(x \mid \theta) = h(x) \exp\!\bigl(\eta(\theta) \cdot T(x) - A(\theta)\bigr)

This is not a coincidence, and it is not merely a notational convenience. The factorization above — in which the data xx and the parameter θ\theta interact only through the inner product η(θ)T(x)\eta(\theta) \cdot T(x) — is the structural reason behind four of the most important results in parametric statistics:

  1. Sufficient statistics are finite-dimensional. The function T(x)T(x) captures everything the data has to say about θ\theta. No matter how large the sample, a fixed-dimensional summary suffices.
  2. MLE reduces to moment matching. The maximum likelihood estimate satisfies A(η^)=TˉA'(\hat{\eta}) = \bar{T}, where Tˉ\bar{T} is the sample mean of the sufficient statistics. One equation, universal across all nine distributions.
  3. Conjugate priors exist. For every exponential family likelihood, there is a natural conjugate prior that yields a posterior in the same family — with hyperparameters updated by a clean additive rule.
  4. Generalized linear models are tractable. The GLM framework — logistic regression, Poisson regression, Gamma regression — requires an exponential family response distribution. The canonical link function comes directly from AA'.

The distributions that are not exponential family members — Hypergeometric, Discrete Uniform, Continuous Uniform, Student’s tt, and FF — fail for specific structural reasons that we will make precise in Section 7.4. The Chi-squared distribution, by contrast, is a special case of the Gamma and therefore belongs to the exponential family; we do not count it separately among the nine because its degrees-of-freedom parameter kk is typically treated as a fixed integer, not an estimated parameter.

Grid of 9 exponential family members: 5 discrete and 4 continuous PMFs/PDFs with natural parameter annotations

The interactive explorer below lets you select any of the nine members, adjust parameters, and see the canonical form decomposition live — with the four components h(x)h(x), η(θ)\eta(\theta), T(x)T(x), and A(θ)A(\theta) color-coded in the formula.

Interactive: Exponential Family Explorer
-0.500.511.50.250.50xP(X=k)E[X]
Canonical Form: Bernoulli
p(xθ)=h(x)exp ⁣(η(θ)T(x)A(η))p(x \mid \theta) = \textcolor{#2563eb}{h(x)} \exp\!\bigl(\textcolor{#7c3aed}{\boldsymbol{\eta}(\theta)}\cdot \textcolor{#16a34a}{\mathbf{T}(x)} - \textcolor{#d97706}{A(\boldsymbol{\eta})}\bigr)
h(x) =1\textcolor{#2563eb}{1}
η(θ) =logp1p\textcolor{#7c3aed}{\log\frac{p}{1-p}}
T(x) =x\textcolor{#16a34a}{x}
A(η) =log(1+eη)\textcolor{#d97706}{\log(1 + e^\eta)}
h(x) η(θ) T(x) A(η)
η = 0.0000A(η) = 0.6931A′(η) = E[T(X)] = 0.5000A″(η) = Var(T(X)) = 0.2500

7.2 The Canonical Form

Before we can unify nine distributions, we need a precise definition of the form they share. We start with the one-parameter case to build intuition, then generalize.

Definition 1 Definition 7.1 (One-Parameter Exponential Family)

A family of distributions indexed by a scalar parameter θΘ\theta \in \Theta is a one-parameter exponential family if the PMF or PDF can be written as

f(xθ)=h(x)exp ⁣(η(θ)T(x)A(θ))f(x \mid \theta) = h(x) \exp\!\bigl(\eta(\theta) \cdot T(x) - A(\theta)\bigr)

where:

  • h(x)0h(x) \geq 0 is a base measure that depends only on the data, not on θ\theta
  • η(θ)\eta(\theta) is the natural parameter (also called the canonical parameter), a function of θ\theta alone
  • T(x)T(x) is the sufficient statistic, a function of the data alone
  • A(θ)=logh(x)exp ⁣(η(θ)T(x))dμ(x)A(\theta) = \log \int h(x) \exp\!\bigl(\eta(\theta) \cdot T(x)\bigr) \, d\mu(x) is the log-partition function (also called the cumulant function), which ensures the density integrates to 1

The key structural feature is that θ\theta and xx interact only through the product η(θ)T(x)\eta(\theta) \cdot T(x). Everything else separates cleanly: h(x)h(x) absorbs the data-only terms, and A(θ)A(\theta) absorbs the parameter-only normalization.

When η(θ)=θ\eta(\theta) = \theta — that is, when the natural parameter is the parameter — we say the family is in natural or canonical parameterization. This is always achievable by reparameterizing from θ\theta to η\eta.

The generalization to multiple parameters is immediate.

Definition 2 Definition 7.2 (k-Parameter Exponential Family)

A family of distributions indexed by a parameter vector θΘRk\boldsymbol{\theta} \in \Theta \subseteq \mathbb{R}^k is a kk-parameter exponential family if the PMF or PDF can be written as

f(xθ)=h(x)exp ⁣(η(θ)T(x)A(θ))f(x \mid \boldsymbol{\theta}) = h(x) \exp\!\bigl(\boldsymbol{\eta}(\boldsymbol{\theta}) \cdot \mathbf{T}(x) - A(\boldsymbol{\theta})\bigr)

where η(θ)=(η1(θ),,ηk(θ))\boldsymbol{\eta}(\boldsymbol{\theta}) = (\eta_1(\boldsymbol{\theta}), \ldots, \eta_k(\boldsymbol{\theta})) is a vector of natural parameters, T(x)=(T1(x),,Tk(x))\mathbf{T}(x) = (T_1(x), \ldots, T_k(x)) is a vector of sufficient statistics, and the dot product ηT=j=1kηjTj\boldsymbol{\eta} \cdot \mathbf{T} = \sum_{j=1}^k \eta_j T_j replaces the scalar product. The log-partition function A(θ)A(\boldsymbol{\theta}) normalizes the distribution.

The dimension of the exponential family is kk — the number of natural parameters. The Bernoulli, Geometric, Poisson, and Exponential are one-dimensional (k=1k = 1). The Normal, Gamma, and Beta are two-dimensional (k=2k = 2). The Binomial and Negative Binomial are one-dimensional when nn (resp. rr) is fixed.

Remark 1 The Support Must Not Depend on the Parameter

A critical requirement — often stated implicitly — is that the support of f(xθ)f(x \mid \theta) must be independent of θ\theta. The set of xx values where f(xθ)>0f(x \mid \theta) > 0 must be the same for all θΘ\theta \in \Theta.

Why? Because h(x)h(x) is the only factor that can restrict the support (by being zero), and h(x)h(x) does not depend on θ\theta. If the support depends on θ\theta, we would need an indicator function 1{xS(θ)}\mathbf{1}_{\{x \in S(\theta)\}} that cannot be absorbed into the ηT\eta \cdot T structure.

This is exactly why the Uniform(a,b)(a, b) distribution fails to be an exponential family member, as we will see in Section 7.4.

Anatomy of the canonical form: h(x), eta, T(x), A(eta) components labeled with Bernoulli and Normal worked examples

7.3 Converting Distributions to Canonical Form

We now convert all nine exponential family members to canonical form. The method is the same in every case: start with the known PMF or PDF from Discrete Distributions or Continuous Distributions, take the logarithm, separate the terms that involve only xx, only θ\theta, and both xx and θ\theta, then read off h(x)h(x), η(θ)\eta(\theta), T(x)T(x), and A(θ)A(\theta).

These conversions are compressed — the full PMF/PDF derivations live in Topics 5-6, so we start from the final expressions and focus on the algebraic rearrangement.

7.3.1 Bernoulli

Remark 2 Bernoulli Canonical Form

The Bernoulli(p)(p) PMF is f(xp)=px(1p)1xf(x \mid p) = p^x (1-p)^{1-x} for x{0,1}x \in \{0, 1\}. Taking the logarithm:

logf(xp)=xlogp+(1x)log(1p)=xlog ⁣p1p+log(1p)\log f(x \mid p) = x \log p + (1 - x) \log(1 - p) = x \log\!\frac{p}{1-p} + \log(1-p)

Reading off the components:

  • h(x)=1h(x) = 1
  • η(p)=log ⁣p1p\eta(p) = \log\!\frac{p}{1-p} (the log-odds, also called the logit)
  • T(x)=xT(x) = x
  • A(η)=log(1+eη)A(\eta) = \log(1 + e^\eta) (the softplus function)

The natural parameter η\eta ranges over all of R\mathbb{R}: as p0p \to 0, η\eta \to -\infty; as p1p \to 1, η+\eta \to +\infty. The inverse map is p=eη/(1+eη)=σ(η)p = e^\eta / (1 + e^\eta) = \sigma(\eta), the logistic function — and this is exactly why logistic regression uses the logit link.

7.3.2 Binomial (fixed nn)

Remark 3 Binomial Canonical Form

The Binomial(n,p)(n, p) PMF is f(kp)=(nk)pk(1p)nkf(k \mid p) = \binom{n}{k} p^k (1-p)^{n-k} for k{0,1,,n}k \in \{0, 1, \ldots, n\}, with nn fixed. Taking the logarithm:

logf(kp)=log(nk)+klog ⁣p1p+nlog(1p)\log f(k \mid p) = \log\binom{n}{k} + k \log\!\frac{p}{1-p} + n \log(1 - p)

Reading off:

  • h(k)=(nk)h(k) = \binom{n}{k}
  • η(p)=log ⁣p1p\eta(p) = \log\!\frac{p}{1-p} (same logit as Bernoulli)
  • T(k)=kT(k) = k
  • A(η)=nlog(1+eη)A(\eta) = n \log(1 + e^\eta)

The Binomial with fixed nn is a one-parameter exponential family in pp. When both nn and pp are unknown, the family is no longer exponential in the standard sense because nn enters through (nk)\binom{n}{k}, which couples data and parameters outside the ηT\eta \cdot T structure.

7.3.3 Geometric

Remark 4 Geometric Canonical Form

The Geometric(p)(p) PMF is f(kp)=p(1p)k1f(k \mid p) = p(1-p)^{k-1} for k{1,2,3,}k \in \{1, 2, 3, \ldots\} (counting the trial on which the first success occurs). Taking the logarithm:

logf(kp)=logp+(k1)log(1p)=klog(1p)+logplog(1p)\log f(k \mid p) = \log p + (k - 1) \log(1 - p) = k \log(1 - p) + \log p - \log(1 - p)=klog(1p)+log ⁣p1p= k \log(1 - p) + \log\!\frac{p}{1-p}

Reading off:

  • h(k)=1h(k) = 1
  • η(p)=log(1p)\eta(p) = \log(1 - p)
  • T(k)=kT(k) = k
  • A(η)=ηlog(1eη)A(\eta) = \eta - \log(1 - e^\eta)

Since p(0,1)p \in (0, 1), we have 1p(0,1)1 - p \in (0, 1), so η=log(1p)<0\eta = \log(1-p) < 0. The natural parameter space is η(,0)\eta \in (-\infty, 0).

7.3.4 Negative Binomial (fixed rr)

Remark 5 Negative Binomial Canonical Form

The Negative Binomial(r,p)(r, p) PMF is f(kp)=(k1r1)pr(1p)krf(k \mid p) = \binom{k-1}{r-1} p^r (1-p)^{k-r} for k{r,r+1,r+2,}k \in \{r, r+1, r+2, \ldots\}, with rr fixed. Taking the logarithm:

logf(kp)=log(k1r1)+rlogp+(kr)log(1p)\log f(k \mid p) = \log\binom{k-1}{r-1} + r \log p + (k - r) \log(1 - p)=log(k1r1)+klog(1p)+rlogprlog(1p)= \log\binom{k-1}{r-1} + k \log(1 - p) + r \log p - r \log(1 - p)=log(k1r1)+klog(1p)+rlog ⁣p1p= \log\binom{k-1}{r-1} + k \log(1 - p) + r \log\!\frac{p}{1-p}

Reading off:

  • h(k)=(k1r1)h(k) = \binom{k-1}{r-1}
  • η(p)=log(1p)\eta(p) = \log(1 - p)
  • T(k)=kT(k) = k
  • A(η)=rηrlog(1eη)A(\eta) = r\eta - r\log(1 - e^\eta)

The natural parameter space is η<0\eta < 0, identical to the Geometric. This makes sense: the Geometric is the Negative Binomial with r=1r = 1, and setting r=1r = 1 in A(η)=rηrlog(1eη)A(\eta) = r\eta - r\log(1 - e^\eta) recovers the Geometric log-partition function exactly when T(k)=kT(k) = k.

7.3.5 Poisson

Remark 6 Poisson Canonical Form

The Poisson(λ)(\lambda) PMF is f(kλ)=λkeλk!f(k \mid \lambda) = \frac{\lambda^k e^{-\lambda}}{k!} for k{0,1,2,}k \in \{0, 1, 2, \ldots\}. Taking the logarithm:

logf(kλ)=klogλλlog(k!)\log f(k \mid \lambda) = k \log \lambda - \lambda - \log(k!)

Reading off:

  • h(k)=1k!h(k) = \frac{1}{k!}
  • η(λ)=logλ\eta(\lambda) = \log \lambda
  • T(k)=kT(k) = k
  • A(η)=eη=λA(\eta) = e^\eta = \lambda

The natural parameter η=logλ\eta = \log \lambda ranges over all of R\mathbb{R} (since λ>0\lambda > 0). The log-partition function A(η)=eηA(\eta) = e^\eta is the exponential function itself — making the Poisson the cleanest example for studying AA and its derivatives.

7.3.6 Normal

Remark 7 Normal Canonical Form

The Normal(μ,σ2)(\mu, \sigma^2) PDF is f(xμ,σ2)=1σ2πexp ⁣((xμ)22σ2)f(x \mid \mu, \sigma^2) = \frac{1}{\sigma\sqrt{2\pi}} \exp\!\left(-\frac{(x - \mu)^2}{2\sigma^2}\right) for xRx \in \mathbb{R}. This is a two-parameter family. Expanding the exponent:

(xμ)22σ2=x22σ2+μxσ2μ22σ2-\frac{(x - \mu)^2}{2\sigma^2} = -\frac{x^2}{2\sigma^2} + \frac{\mu x}{\sigma^2} - \frac{\mu^2}{2\sigma^2}

So the full log-density is:

logf(xμ,σ2)=μσ2x+(12σ2)x2μ22σ2log(σ2π)\log f(x \mid \mu, \sigma^2) = \frac{\mu}{\sigma^2} \cdot x + \left(-\frac{1}{2\sigma^2}\right) \cdot x^2 - \frac{\mu^2}{2\sigma^2} - \log(\sigma\sqrt{2\pi})

Reading off:

  • h(x)=12πh(x) = \frac{1}{\sqrt{2\pi}}
  • η=(μσ2,  12σ2)\boldsymbol{\eta} = \left(\frac{\mu}{\sigma^2}, \; -\frac{1}{2\sigma^2}\right), so η1=μ/σ2\eta_1 = \mu/\sigma^2 and η2=1/(2σ2)\eta_2 = -1/(2\sigma^2)
  • T(x)=(x,  x2)\mathbf{T}(x) = (x, \; x^2)
  • A(η)=μ22σ2+logσ=η124η212log(2η2)A(\boldsymbol{\eta}) = \frac{\mu^2}{2\sigma^2} + \log \sigma = -\frac{\eta_1^2}{4\eta_2} - \frac{1}{2}\log(-2\eta_2)

The natural parameter space is η1R\eta_1 \in \mathbb{R}, η2<0\eta_2 < 0. When σ2\sigma^2 is known, the Normal reduces to a one-parameter exponential family with natural parameter η=μ/σ2\eta = \mu/\sigma^2, sufficient statistic T(x)=xT(x) = x, and log-partition function A(η)=σ2η2/2A(\eta) = \sigma^2 \eta^2 / 2.

7.3.7 Exponential

Remark 8 Exponential Canonical Form

The Exponential(λ)(\lambda) PDF is f(xλ)=λeλxf(x \mid \lambda) = \lambda e^{-\lambda x} for x0x \geq 0. Taking the logarithm:

logf(xλ)=logλλx=(λ)x(logλ)\log f(x \mid \lambda) = \log \lambda - \lambda x = (-\lambda) \cdot x - (-\log \lambda)

Reading off:

  • h(x)=1h(x) = 1 for x0x \geq 0
  • η(λ)=λ\eta(\lambda) = -\lambda
  • T(x)=xT(x) = x
  • A(η)=log(η)A(\eta) = -\log(-\eta)

The natural parameter η=λ<0\eta = -\lambda < 0 (since λ>0\lambda > 0). Note the sign: the natural parameter is the negative of the rate. This is a one-parameter exponential family with natural parameter space η(,0)\eta \in (-\infty, 0).

7.3.8 Gamma

Remark 9 Gamma Canonical Form

The Gamma(α,β)(\alpha, \beta) PDF is f(xα,β)=βαΓ(α)xα1eβxf(x \mid \alpha, \beta) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha - 1} e^{-\beta x} for x>0x > 0. Taking the logarithm:

logf(xα,β)=(α1)logx+(β)xlogΓ(α)+αlogβ\log f(x \mid \alpha, \beta) = (\alpha - 1) \log x + (-\beta) \cdot x - \log \Gamma(\alpha) + \alpha \log \beta

This is a two-parameter family. Reading off:

  • h(x)=1h(x) = 1 for x>0x > 0
  • η=(α1,  β)\boldsymbol{\eta} = (\alpha - 1, \; -\beta), so η1=α1\eta_1 = \alpha - 1 and η2=β\eta_2 = -\beta
  • T(x)=(logx,  x)\mathbf{T}(x) = (\log x, \; x)
  • A(η)=logΓ(η1+1)(η1+1)log(η2)A(\boldsymbol{\eta}) = \log \Gamma(\eta_1 + 1) - (\eta_1 + 1) \log(-\eta_2)

The natural parameter space is η1>1\eta_1 > -1, η2<0\eta_2 < 0. Setting α=1\alpha = 1 (i.e., η1=0\eta_1 = 0) and keeping only the second component recovers the Exponential canonical form: η=β=λ\eta = -\beta = -\lambda, T(x)=xT(x) = x, A(η)=log(η)A(\eta) = -\log(-\eta). The Gamma genuinely subsumes the Exponential as a special case, and the canonical forms are consistent.

7.3.9 Beta

Remark 10 Beta Canonical Form

The Beta(α,β)(\alpha, \beta) PDF is f(xα,β)=1B(α,β)xα1(1x)β1f(x \mid \alpha, \beta) = \frac{1}{B(\alpha, \beta)} x^{\alpha - 1} (1 - x)^{\beta - 1} for x(0,1)x \in (0, 1). Taking the logarithm:

logf(xα,β)=(α1)logx+(β1)log(1x)logB(α,β)\log f(x \mid \alpha, \beta) = (\alpha - 1) \log x + (\beta - 1) \log(1 - x) - \log B(\alpha, \beta)

Reading off:

  • h(x)=1h(x) = 1 for x(0,1)x \in (0, 1)
  • η=(α1,  β1)\boldsymbol{\eta} = (\alpha - 1, \; \beta - 1), so η1=α1\eta_1 = \alpha - 1 and η2=β1\eta_2 = \beta - 1
  • T(x)=(logx,  log(1x))\mathbf{T}(x) = (\log x, \; \log(1 - x))
  • A(η)=logB(η1+1,η2+1)=logΓ(η1+1)+logΓ(η2+1)logΓ(η1+η2+2)A(\boldsymbol{\eta}) = \log B(\eta_1 + 1, \eta_2 + 1) = \log \Gamma(\eta_1 + 1) + \log \Gamma(\eta_2 + 1) - \log \Gamma(\eta_1 + \eta_2 + 2)

The natural parameter space is η1>1\eta_1 > -1, η2>1\eta_2 > -1. The sufficient statistics T1(x)=logxT_1(x) = \log x and T2(x)=log(1x)T_2(x) = \log(1 - x) are the log-data and the log-complement — both of which appear naturally in the Beta-Bernoulli conjugate updating that we will derive in Section 7.7.

Summary table of all 9 exponential family members with their canonical form components h, eta, T, A

The converter below lets you step through the algebraic transformation for each distribution, with color-coded components matching the canonical form.

Interactive: Canonical Form Converter

h(x) η(θ) T(x) A(η)
Step 1:Start with the Bernoulli PMF
p(xp)=px(1p)1xp(x \mid p) = p^x (1-p)^{1-x}
Step 2:Take the logarithm
logp(xp)=xlogp+(1x)log(1p)\log p(x \mid p) = x \log p + (1-x) \log(1-p)
Step 3:Separate terms by x vs. not-x
=xlogp1p+log(1p)= x \log\frac{p}{1-p} + \log(1-p)
Step 4:Identify the natural parameter[η(θ)]
η=logp1p\eta = \log\frac{p}{1-p}
Step 5:Identify the sufficient statistic[T(x)]
T(x)=xT(x) = x
Step 6:Identify the log-partition function[A(η)]
A(η)=log(1+eη)A(\eta) = \log(1+e^\eta)
Step 7:Identify h(x)[h(x)]
h(x)=1h(x) = 1
Step 8:Write in canonical form
p(xη)=1h(x)exp ⁣(ηηxT(x)log(1+eη)A(η))p(x \mid \eta) = \underbrace{1}_{h(x)} \exp\!\bigl(\underbrace{\eta}_{\eta} \cdot \underbrace{x}_{T(x)} - \underbrace{\log(1+e^\eta)}_{A(\eta)}\bigr)
8 steps · Use “Step Through” to reveal one step at a time, or check “Show All” to see the full derivation.

7.4 Why Some Distributions Are NOT Exponential Families

Not every distribution fits the canonical form. Of the sixteen distributions cataloged in Topics 5-6, seven are excluded. Understanding why they fail sharpens our understanding of what the exponential family structure actually requires.

Remark 11 Structural Failures: Student's t and F

The Student’s t(ν)t(\nu) PDF is proportional to (1+t2/ν)(ν+1)/2(1 + t^2/\nu)^{-(\nu+1)/2}, and the F(d1,d2)F(d_1, d_2) PDF involves a similar power-of-a-polynomial structure. In both cases, the parameter (ν\nu, or d1,d2d_1, d_2) appears as an exponent on a function of xx. There is no way to factor the log-density into a sum η(θ)T(x)\eta(\theta) \cdot T(x) plus separate functions of xx and θ\theta alone.

Concretely, for the t(ν)t(\nu) distribution:

logf(tν)=ν+12log ⁣(1+t2ν)+C(ν)\log f(t \mid \nu) = -\frac{\nu + 1}{2} \log\!\left(1 + \frac{t^2}{\nu}\right) + C(\nu)

The term ν+12log(1+t2/ν)-\frac{\nu + 1}{2} \log(1 + t^2/\nu) cannot be written as η(ν)T(t)\eta(\nu) \cdot T(t) because ν\nu appears both as a multiplicative factor ((ν+1)/2(\nu+1)/2) and inside the argument of the logarithm (t2/νt^2/\nu). This double entanglement of data and parameter prevents the factorization.

The same issue affects the FF distribution. Both are derived from the Normal and Chi-squared via ratios and transformations that destroy the exponential family structure.

Example 1 Why Uniform(a, b) Fails: Parameter-Dependent Support

The Continuous Uniform(a,b)(a, b) has PDF

f(xa,b)=1ba1[a,b](x)f(x \mid a, b) = \frac{1}{b - a} \cdot \mathbf{1}_{[a, b]}(x)

where 1[a,b](x)\mathbf{1}_{[a, b]}(x) is the indicator function that equals 1 when axba \leq x \leq b and 0 otherwise. The support of ff is the interval [a,b][a, b], which depends on the parameters aa and bb.

We can try to force this into exponential family form by taking the log:

logf(xa,b)=log(ba)+log1[a,b](x)\log f(x \mid a, b) = -\log(b - a) + \log \mathbf{1}_{[a, b]}(x)

The indicator log1[a,b](x)\log \mathbf{1}_{[a, b]}(x) equals 0 when x[a,b]x \in [a, b] and -\infty otherwise. This indicator depends on both xx and θ=(a,b)\theta = (a, b), but it cannot be written as η(θ)T(x)\eta(\theta) \cdot T(x) — it is a hard constraint, not a smooth inner product.

The Discrete Uniform fails for exactly the same reason: its support {a,a+1,,b}\{a, a+1, \ldots, b\} depends on the parameters. The Hypergeometric fails similarly: its support {max(0,nN+K),,min(n,K)}\{\max(0, n - N + K), \ldots, \min(n, K)\} depends on (N,K,n)(N, K, n).

This is the content of Remark 7.1: the support of an exponential family density must be independent of the parameter. Parameter-dependent support is the most common reason a distribution fails the exponential family test.

Three-panel comparison: Uniform support varies with a,b; Exponential support is fixed at [0,infinity); decision tree for exponential family membership

7.5 The Log-Partition Function

The log-partition function A(η)A(\eta) is the most important single object in exponential family theory. Its name comes from statistical mechanics, where exp(A(η))\exp(A(\eta)) is the partition function — the normalizing constant that ensures probabilities sum (or integrate) to 1. But A(η)A(\eta) does far more than normalize. Its derivatives generate the moments of the sufficient statistic.

Theorem 1 Theorem 7.1 (Log-Partition and Moments)

Let f(xη)=h(x)exp(ηT(x)A(η))f(x \mid \eta) = h(x) \exp(\eta \cdot T(x) - A(\eta)) be a one-parameter exponential family in natural parameterization. Then:

  1. A(η)=Eη[T(X)]A'(\eta) = E_\eta[T(X)]
  2. A(η)=Varη(T(X))A''(\eta) = \text{Var}_\eta(T(X))

In words: the first derivative of the log-partition function gives the expected value of the sufficient statistic, and the second derivative gives its variance.

Proof [show]

Part 1. The density integrates to 1 for all η\eta in the natural parameter space:

h(x)exp ⁣(ηT(x)A(η))dμ(x)=1\int h(x) \exp\!\bigl(\eta \cdot T(x) - A(\eta)\bigr) \, d\mu(x) = 1

Multiply both sides by exp(A(η))\exp(A(\eta)):

h(x)exp ⁣(ηT(x))dμ(x)=exp ⁣(A(η))\int h(x) \exp\!\bigl(\eta \cdot T(x)\bigr) \, d\mu(x) = \exp\!\bigl(A(\eta)\bigr)

Differentiate both sides with respect to η\eta. On the left, we differentiate under the integral sign (justified by the smoothness of exponential families on the interior of the natural parameter space — see Barndorff-Nielsen, Ch. 8):

h(x)T(x)exp ⁣(ηT(x))dμ(x)=A(η)exp ⁣(A(η))\int h(x) \cdot T(x) \cdot \exp\!\bigl(\eta \cdot T(x)\bigr) \, d\mu(x) = A'(\eta) \cdot \exp\!\bigl(A(\eta)\bigr)

Divide both sides by exp(A(η))\exp(A(\eta)):

T(x)h(x)exp ⁣(ηT(x)A(η))dμ(x)=A(η)\int T(x) \cdot h(x) \exp\!\bigl(\eta \cdot T(x) - A(\eta)\bigr) \, d\mu(x) = A'(\eta)

The left side is T(x)f(xη)dμ(x)=Eη[T(X)]\int T(x) \cdot f(x \mid \eta) \, d\mu(x) = E_\eta[T(X)]. Therefore A(η)=Eη[T(X)]A'(\eta) = E_\eta[T(X)].

Part 2. Differentiate the normalization identity a second time. Starting from

h(x)T(x)exp ⁣(ηT(x))dμ(x)=A(η)exp ⁣(A(η))\int h(x) \cdot T(x) \cdot \exp\!\bigl(\eta \cdot T(x)\bigr) \, d\mu(x) = A'(\eta) \cdot \exp\!\bigl(A(\eta)\bigr)

we differentiate both sides with respect to η\eta again. On the left:

h(x)T(x)2exp ⁣(ηT(x))dμ(x)\int h(x) \cdot T(x)^2 \cdot \exp\!\bigl(\eta \cdot T(x)\bigr) \, d\mu(x)

On the right, by the product rule:

A(η)exp ⁣(A(η))+A(η)2exp ⁣(A(η))A''(\eta) \cdot \exp\!\bigl(A(\eta)\bigr) + A'(\eta)^2 \cdot \exp\!\bigl(A(\eta)\bigr)

Divide both sides by exp(A(η))\exp(A(\eta)):

T(x)2f(xη)dμ(x)=A(η)+A(η)2\int T(x)^2 \cdot f(x \mid \eta) \, d\mu(x) = A''(\eta) + A'(\eta)^2

The left side is Eη[T(X)2]E_\eta[T(X)^2]. Since A(η)=Eη[T(X)]A'(\eta) = E_\eta[T(X)], we have:

Eη[T(X)2]=A(η)+(Eη[T(X)])2E_\eta[T(X)^2] = A''(\eta) + \bigl(E_\eta[T(X)]\bigr)^2

Rearranging: A(η)=Eη[T(X)2](Eη[T(X)])2=Varη(T(X))A''(\eta) = E_\eta[T(X)^2] - \bigl(E_\eta[T(X)]\bigr)^2 = \text{Var}_\eta(T(X)).

\square

Let us verify this on the cleanest example. For the Poisson, A(η)=eηA(\eta) = e^\eta, so A(η)=eη=λA'(\eta) = e^\eta = \lambda and A(η)=eη=λA''(\eta) = e^\eta = \lambda. We know E[X]=λE[X] = \lambda and Var(X)=λ\text{Var}(X) = \lambda for the Poisson — both confirmed by the log-partition derivatives. For the Bernoulli, A(η)=log(1+eη)A(\eta) = \log(1 + e^\eta), so A(η)=eη/(1+eη)=pA'(\eta) = e^\eta/(1 + e^\eta) = p and A(η)=eη/(1+eη)2=p(1p)A''(\eta) = e^\eta/(1 + e^\eta)^2 = p(1-p), matching E[X]=pE[X] = p and Var(X)=p(1p)\text{Var}(X) = p(1-p).

Remark 12 Convexity of A and Its Consequences

Since A(η)=Varη(T(X))0A''(\eta) = \text{Var}_\eta(T(X)) \geq 0 (variance is nonnegative), the log-partition function A(η)A(\eta) is convex. Moreover, A(η)>0A''(\eta) > 0 whenever T(X)T(X) is not degenerate (i.e., T(X)T(X) takes more than one value with positive probability), so A(η)A(\eta) is strictly convex for all standard exponential families.

This has three major consequences:

  1. The log-likelihood is concave. For a sample x1,,xnx_1, \ldots, x_n, the log-likelihood in natural parameterization is (η)=ηiT(xi)nA(η)+const\ell(\eta) = \eta \cdot \sum_i T(x_i) - n \cdot A(\eta) + \text{const}. Since A(η)A(\eta) is convex, nA(η)-nA(\eta) is concave, and (η)\ell(\eta) is a linear function plus a concave function — hence concave. This guarantees that any local maximum of the log-likelihood is the global maximum.

  2. The MLE is unique (when it exists). A strictly concave function has at most one maximum. Combined with existence conditions (the observed sufficient statistic lies in the interior of the convex hull of possible sufficient statistics), this gives both existence and uniqueness of the MLE.

  3. The mean-to-natural parameter map is invertible. The function μ(η)=A(η)\mu(\eta) = A'(\eta) is strictly increasing (since A>0A'' > 0), so it has an inverse η(μ)=(A)1(μ)\eta(\mu) = (A')^{-1}(\mu). This invertibility is what makes the dual parameterization — sometimes called the mean parameterization — well-defined.

Three-panel: Poisson A(eta) = e^eta, Bernoulli softplus, convexity across members

The explorer below lets you drag a point along the η\eta axis and watch A(η)A(\eta), A(η)=E[T(X)]A'(\eta) = E[T(X)], and A(η)=Var(T(X))A''(\eta) = \text{Var}(T(X)) update in real time. The tangent line at each point has slope E[T(X)]E[T(X)].

Interactive: Log-Partition Function Explorer
-6-4-20246η (natural parameter)-0.50.92.33.75.16.5A(η)
η = 0.0000
A(η) = 0.6931
A′(η) = E[T(X)] = 0.5000
A″(η) = Var(T(X)) = 0.2500
A(η)

A(η) is convex → log-likelihood is concave → MLE exists and is unique


7.6 Sufficient Statistics from Exponential Families

The canonical form immediately reveals the sufficient statistic T(x)T(x) — but why is T(x)T(x) sufficient? The answer comes from the Fisher-Neyman factorization theorem, which states that TT is sufficient for θ\theta if and only if the density factors as a function of TT and θ\theta times a function of xx alone.

Theorem 2 Theorem 7.2 (Fisher-Neyman Factorization for Exponential Families)

In any exponential family f(xθ)=h(x)exp(η(θ)T(x)A(θ))f(x \mid \theta) = h(x) \exp(\eta(\theta) \cdot T(x) - A(\theta)), the statistic T(X)T(X) is sufficient for θ\theta.

For an i.i.d. sample X1,,XnX_1, \ldots, X_n from this family, the sum i=1nT(Xi)\sum_{i=1}^n T(X_i) is sufficient for θ\theta.

Proof [show]

The Fisher-Neyman factorization theorem (which Sufficient Statistics develops in full generality) states that T(X)T(X) is sufficient for θ\theta if and only if the joint density factors as

f(xθ)=g(T(x),θ)h(x)f(x \mid \theta) = g(T(x), \theta) \cdot h(x)

where gg depends on xx only through T(x)T(x), and hh depends only on xx.

For an exponential family, the density is already in this form:

f(xθ)=exp ⁣(η(θ)T(x)A(θ))g(T(x),θ)h(x)h(x)f(x \mid \theta) = \underbrace{\exp\!\bigl(\eta(\theta) \cdot T(x) - A(\theta)\bigr)}_{g(T(x), \theta)} \cdot \underbrace{h(x)}_{h(x)}

The first factor depends on xx only through T(x)T(x) (since the data enters only via the inner product η(θ)T(x)\eta(\theta) \cdot T(x)). The second factor h(x)h(x) depends only on xx, not on θ\theta. The factorization is immediate.

For an i.i.d. sample, the joint density is i=1nf(xiθ)=(i=1nh(xi))exp ⁣(η(θ)i=1nT(xi)nA(θ))\prod_{i=1}^n f(x_i \mid \theta) = \left(\prod_{i=1}^n h(x_i)\right) \exp\!\left(\eta(\theta) \cdot \sum_{i=1}^n T(x_i) - nA(\theta)\right). The same factorization holds with i=1nT(xi)\sum_{i=1}^n T(x_i) playing the role of the sufficient statistic and i=1nh(xi)\prod_{i=1}^n h(x_i) playing the role of the data-only factor.

\square

The power of this result is dimensional reduction. No matter how large the sample, a fixed-dimensional summary suffices:

  • Bernoulli: T(x)=xT(x) = x, so T(xi)=xi\sum T(x_i) = \sum x_i — the number of successes. One number summarizes nn binary observations.
  • Normal (known σ2\sigma^2): T(x)=xT(x) = x, so T(xi)=xi\sum T(x_i) = \sum x_i, or equivalently xˉ\bar{x}. One number summarizes nn continuous observations.
  • Normal (unknown μ\mu and σ2\sigma^2): T(x)=(x,x2)\mathbf{T}(x) = (x, x^2), so the sufficient statistics are xi\sum x_i and xi2\sum x_i^2. Two numbers summarize the entire sample.
  • Poisson: T(k)=kT(k) = k, so T(ki)=ki\sum T(k_i) = \sum k_i — the total count. One number.

This is not a feature of specific distributions — it is a structural consequence of the exponential family form. Sufficient Statistics develops the full theory, including the Rao-Blackwell theorem (sufficient statistics yield optimal estimators), completeness (no redundancy in the sufficient statistic), and the Pitman-Koopman-Darmois converse: under regularity, exponential families are the ONLY families admitting fixed-dimensional sufficient statistics as n grows.

Bernoulli n observations to 1 sufficient statistic, Normal n observations to 2, general dimension reduction diagram

7.7 Conjugate Priors from Exponential Families

One of the most elegant consequences of the exponential family structure is the existence of conjugate priors. A conjugate prior for a likelihood is a prior distribution such that the posterior belongs to the same family as the prior — just with updated parameters. This makes Bayesian updating algebraically tractable: multiply the prior by the likelihood, and read off the new parameters without computing a normalizing integral.

The remarkable fact is that every exponential family has a natural conjugate prior. The general form is a direct consequence of the canonical representation.

Theorem 3 Theorem 7.3 (Conjugate Prior for Exponential Families)

Let f(xη)=h(x)exp(ηT(x)A(η))f(x \mid \eta) = h(x) \exp(\eta \cdot T(x) - A(\eta)) be a one-parameter exponential family in natural parameterization. The conjugate prior for η\eta has the form

π(ηχ0,ν0)exp ⁣(χ0ην0A(η))\pi(\eta \mid \chi_0, \nu_0) \propto \exp\!\bigl(\chi_0 \cdot \eta - \nu_0 \cdot A(\eta)\bigr)

where χ0\chi_0 and ν0>0\nu_0 > 0 are hyperparameters.

Given an i.i.d. sample x1,,xnx_1, \ldots, x_n, the posterior is in the same family with updated hyperparameters:

χ0=χ0+nTˉ,ν0=ν0+n\chi_0' = \chi_0 + n\bar{T}, \quad \nu_0' = \nu_0 + n

where Tˉ=1ni=1nT(xi)\bar{T} = \frac{1}{n} \sum_{i=1}^n T(x_i) is the sample mean of the sufficient statistics.

Proof [show]

The likelihood for an i.i.d. sample x1,,xnx_1, \ldots, x_n from an exponential family is:

L(η)=i=1nf(xiη)=i=1nh(xi)exp ⁣(ηi=1nT(xi)nA(η))L(\eta) = \prod_{i=1}^n f(x_i \mid \eta) = \prod_{i=1}^n h(x_i) \cdot \exp\!\left(\eta \sum_{i=1}^n T(x_i) - nA(\eta)\right)

The product h(xi)\prod h(x_i) does not depend on η\eta and will be absorbed into the proportionality constant. Writing S=i=1nT(xi)=nTˉS = \sum_{i=1}^n T(x_i) = n\bar{T}:

L(η)exp ⁣(ηSnA(η))L(\eta) \propto \exp\!\bigl(\eta \cdot S - n \cdot A(\eta)\bigr)

Multiply by the conjugate prior π(ηχ0,ν0)exp(χ0ην0A(η))\pi(\eta \mid \chi_0, \nu_0) \propto \exp(\chi_0 \cdot \eta - \nu_0 \cdot A(\eta)):

π(ηx1,,xn)L(η)π(ηχ0,ν0)\pi(\eta \mid x_1, \ldots, x_n) \propto L(\eta) \cdot \pi(\eta \mid \chi_0, \nu_0)exp ⁣(ηSnA(η))exp ⁣(χ0ην0A(η))\propto \exp\!\bigl(\eta \cdot S - n \cdot A(\eta)\bigr) \cdot \exp\!\bigl(\chi_0 \cdot \eta - \nu_0 \cdot A(\eta)\bigr)=exp ⁣((χ0+S)η(ν0+n)A(η))= \exp\!\bigl((\chi_0 + S) \cdot \eta - (\nu_0 + n) \cdot A(\eta)\bigr)=exp ⁣((χ0+nTˉ)η(ν0+n)A(η))= \exp\!\bigl((\chi_0 + n\bar{T}) \cdot \eta - (\nu_0 + n) \cdot A(\eta)\bigr)

This is the same functional form as the prior, with χ0\chi_0 replaced by χ0=χ0+nTˉ\chi_0' = \chi_0 + n\bar{T} and ν0\nu_0 replaced by ν0=ν0+n\nu_0' = \nu_0 + n. The posterior is therefore in the conjugate family.

The hyperparameter ν0\nu_0 acts as a pseudo-sample size: it measures how much weight the prior carries relative to the data. After observing nn real data points, the effective sample size increases from ν0\nu_0 to ν0+n\nu_0 + n. The hyperparameter χ0\chi_0 acts as a pseudo-sufficient-statistic sum: it represents the “data” implied by the prior, and it is updated additively by the observed sufficient statistic sum nTˉn\bar{T}.

\square

The theorem above is the general result. Let us see it in action for three specific conjugate pairs.

Example 2 Beta-Bernoulli Conjugate Updating

For the Bernoulli with natural parameter η=logit(p)\eta = \text{logit}(p) and log-partition A(η)=log(1+eη)A(\eta) = \log(1 + e^\eta), the conjugate prior is

π(pα0,β0)=1B(α0,β0)pα01(1p)β01=Beta(α0,β0)\pi(p \mid \alpha_0, \beta_0) = \frac{1}{B(\alpha_0, \beta_0)} p^{\alpha_0 - 1} (1 - p)^{\beta_0 - 1} = \text{Beta}(\alpha_0, \beta_0)

Given nn Bernoulli observations with kk successes (so Tˉ=k/n\bar{T} = k/n), the posterior is:

π(pdata)=Beta(α0+k,  β0+nk)\pi(p \mid \text{data}) = \text{Beta}(\alpha_0 + k, \; \beta_0 + n - k)

In the general framework: χ0=α01\chi_0 = \alpha_0 - 1 (pseudo-successes minus 1), ν0=α0+β02\nu_0 = \alpha_0 + \beta_0 - 2 (pseudo-sample-size minus 2), and the update rule gives χ0=χ0+k\chi_0' = \chi_0 + k, ν0=ν0+n\nu_0' = \nu_0 + n. The Beta parameterization absorbs the offsets.

The posterior mean is α0+kα0+β0+n\frac{\alpha_0 + k}{\alpha_0 + \beta_0 + n} — a weighted average of the prior mean α0α0+β0\frac{\alpha_0}{\alpha_0 + \beta_0} and the sample proportion kn\frac{k}{n}, with weights proportional to the pseudo-sample size and the real sample size.

Example 3 Gamma-Poisson Conjugate Updating

For the Poisson with natural parameter η=logλ\eta = \log \lambda and log-partition A(η)=eη=λA(\eta) = e^\eta = \lambda, the conjugate prior for λ\lambda is

π(λα0,β0)=β0α0Γ(α0)λα01eβ0λ=Gamma(α0,β0)\pi(\lambda \mid \alpha_0, \beta_0) = \frac{\beta_0^{\alpha_0}}{\Gamma(\alpha_0)} \lambda^{\alpha_0 - 1} e^{-\beta_0 \lambda} = \text{Gamma}(\alpha_0, \beta_0)

Given nn Poisson observations with total count S=i=1nkiS = \sum_{i=1}^n k_i, the posterior is:

π(λdata)=Gamma(α0+S,  β0+n)\pi(\lambda \mid \text{data}) = \text{Gamma}(\alpha_0 + S, \; \beta_0 + n)

The update rule is clean: α0\alpha_0 accumulates the total count, β0\beta_0 accumulates the number of observations. The posterior mean is α0+Sβ0+n\frac{\alpha_0 + S}{\beta_0 + n} — a weighted average of the prior mean α0/β0\alpha_0/\beta_0 and the sample mean S/nS/n.

Example 4 Normal-Normal Conjugate Updating (Precision-Weighted Average)

For the Normal with known variance σ2\sigma^2, the natural parameter is η=μ/σ2\eta = \mu/\sigma^2 and the log-partition is A(η)=σ2η2/2A(\eta) = \sigma^2 \eta^2/2. The conjugate prior for μ\mu is

π(μμ0,σ02)=Normal(μ0,σ02)\pi(\mu \mid \mu_0, \sigma_0^2) = \text{Normal}(\mu_0, \sigma_0^2)

Given nn Normal observations with sample mean xˉ\bar{x}, the posterior is:

π(μdata)=Normal(μn,σn2)\pi(\mu \mid \text{data}) = \text{Normal}(\mu_n, \sigma_n^2)

where the posterior variance and mean are given by the precision-weighted average:

1σn2=1σ02+nσ2\frac{1}{\sigma_n^2} = \frac{1}{\sigma_0^2} + \frac{n}{\sigma^2}μn=σn2(μ0σ02+nxˉσ2)\mu_n = \sigma_n^2 \left(\frac{\mu_0}{\sigma_0^2} + \frac{n\bar{x}}{\sigma^2}\right)

The posterior precision (inverse variance) is the sum of the prior precision and the data precision. The posterior mean is the precision-weighted average of the prior mean and the sample mean. As nn \to \infty, the data precision dominates and μnxˉ\mu_n \to \bar{x} — the posterior concentrates on the MLE, regardless of the prior. This is a concrete instance of posterior consistency.

Three-panel: Beta-Bernoulli, Gamma-Poisson, and Normal-Normal prior-to-posterior updating

The explorer below lets you experiment with all three conjugate pairs. Adjust the prior hyperparameters, add data, and watch the posterior concentrate.

Conjugate Prior Explorer
0.000.250.500.751.00
--- Prior Beta(2.0, 2.0)— Posterior Beta(9.0, 5.0)
General rule: Beta(α₀ + k, β₀ + n − k)
Beta(2.0 + 7, 2.0 + 10 - 7) = Beta(9.0, 5.0)
Prior mean: 0.5000 → Posterior mean: 0.6429

7.8 Maximum Likelihood in Exponential Families

The log-partition function’s properties give us a universal recipe for maximum likelihood estimation that applies to all nine exponential family members simultaneously.

Theorem 4 Theorem 7.4 (Universal MLE Recipe)

Let X1,,XnX_1, \ldots, X_n be an i.i.d. sample from a one-parameter exponential family f(xη)=h(x)exp(ηT(x)A(η))f(x \mid \eta) = h(x) \exp(\eta \cdot T(x) - A(\eta)). The maximum likelihood estimator η^\hat{\eta} satisfies

A(η^)=TˉA'(\hat{\eta}) = \bar{T}

where Tˉ=1ni=1nT(Xi)\bar{T} = \frac{1}{n} \sum_{i=1}^n T(X_i) is the sample mean of the sufficient statistics. When Tˉ\bar{T} lies in the interior of the range of AA', this equation has a unique solution.

Proof [show]

The log-likelihood for an i.i.d. sample is:

(η)=i=1nlogf(xiη)=i=1nlogh(xi)+ηi=1nT(xi)nA(η)\ell(\eta) = \sum_{i=1}^n \log f(x_i \mid \eta) = \sum_{i=1}^n \log h(x_i) + \eta \sum_{i=1}^n T(x_i) - n A(\eta)

The first sum does not depend on η\eta. Writing S=i=1nT(xi)=nTˉS = \sum_{i=1}^n T(x_i) = n\bar{T}:

(η)=ηnTˉnA(η)+const\ell(\eta) = \eta \cdot n\bar{T} - nA(\eta) + \text{const}

Differentiate with respect to η\eta and set to zero:

(η)=nTˉnA(η)=0\ell'(\eta) = n\bar{T} - nA'(\eta) = 0A(η^)=TˉA'(\hat{\eta}) = \bar{T}

This is the score equation. It says: find the parameter value η^\hat{\eta} such that the model’s expected sufficient statistic Eη^[T(X)]=A(η^)E_{\hat{\eta}}[T(X)] = A'(\hat{\eta}) equals the observed sufficient statistic Tˉ\bar{T}. This is moment matching — the MLE equates the theoretical and empirical moments of the sufficient statistic.

Existence. By Theorem 7.1, A(η)=Eη[T(X)]A'(\eta) = E_\eta[T(X)]. The range of AA' is an open interval (by the strict convexity of AA and the intermediate value theorem). If Tˉ\bar{T} lies in this interval, the equation A(η)=TˉA'(\eta) = \bar{T} has a solution.

Uniqueness. Since A(η)=Varη(T(X))>0A''(\eta) = \text{Var}_\eta(T(X)) > 0, the function AA' is strictly increasing. A strictly increasing function can hit each value at most once, so the solution η^\hat{\eta} is unique.

Global maximum. The second derivative of the log-likelihood is (η)=nA(η)<0\ell''(\eta) = -nA''(\eta) < 0, so the log-likelihood is strictly concave. Any critical point is the unique global maximum.

\square

The universal recipe A(η^)=TˉA'(\hat{\eta}) = \bar{T} specializes to every distribution’s MLE formula. Let us verify the four most important cases.

Example 5 Poisson MLE

For the Poisson: η=logλ\eta = \log \lambda, T(k)=kT(k) = k, A(η)=eηA(\eta) = e^\eta, so A(η)=eη=λA'(\eta) = e^\eta = \lambda. The MLE equation A(η^)=TˉA'(\hat{\eta}) = \bar{T} becomes

eη^=kˉ    λ^=kˉe^{\hat{\eta}} = \bar{k} \implies \hat{\lambda} = \bar{k}

The Poisson MLE is the sample mean — the most natural estimator one could imagine, and now we see it is a consequence of the exponential family structure. The existence condition is kˉ>0\bar{k} > 0, which holds whenever at least one observation is nonzero.

Example 6 Bernoulli MLE

For the Bernoulli: η=logit(p)\eta = \text{logit}(p), T(x)=xT(x) = x, A(η)=log(1+eη)A(\eta) = \log(1 + e^\eta), so A(η)=eη/(1+eη)=pA'(\eta) = e^\eta/(1 + e^\eta) = p. The MLE equation becomes

p^=xˉ=kn\hat{p} = \bar{x} = \frac{k}{n}

where k=xik = \sum x_i is the number of successes. The sample proportion is the MLE for the Bernoulli probability — again, a universally known result that falls out automatically from A(η^)=TˉA'(\hat{\eta}) = \bar{T}.

Example 7 Exponential MLE

For the Exponential: η=λ\eta = -\lambda, T(x)=xT(x) = x, A(η)=log(η)A(\eta) = -\log(-\eta), so A(η)=1/η=1/λA'(\eta) = -1/\eta = 1/\lambda. The MLE equation becomes

1λ^=xˉ    λ^=1xˉ\frac{1}{\hat{\lambda}} = \bar{x} \implies \hat{\lambda} = \frac{1}{\bar{x}}

The MLE for the Exponential rate is the reciprocal of the sample mean. The existence condition is xˉ>0\bar{x} > 0, which holds for any sample from a continuous positive distribution.

Example 8 Normal MLE (Known Variance)

For the Normal with known σ2\sigma^2: η=μ/σ2\eta = \mu/\sigma^2, T(x)=xT(x) = x, A(η)=σ2η2/2A(\eta) = \sigma^2 \eta^2/2, so A(η)=σ2η=μA'(\eta) = \sigma^2 \eta = \mu. The MLE equation becomes

μ^=xˉ\hat{\mu} = \bar{x}

The sample mean is the MLE for the Normal mean — the single most foundational result in point estimation, and it requires exactly one line once we have the exponential family machinery. Maximum Likelihood Estimation develops the general theory, including asymptotic properties and the multiparameter case.

Three-panel: Poisson log-likelihood, Normal log-likelihood, universal MLE recipe flowchart

7.9 Connection to Generalized Linear Models

The exponential family is not just a theoretical convenience — it is the structural foundation of generalized linear models (GLMs), arguably the most widely used class of statistical models in applied science and industry. Every GLM has three components: a random component (the response distribution), a systematic component (the linear predictor xTβ\boldsymbol{x}^T \boldsymbol{\beta}), and a link function connecting them. The exponential family determines the first and third components.

Remark 13 GLM Variance Functions

For any exponential family member, the variance function V(μ)V(\mu) expresses the variance of the response as a function of its mean:

V(μ)=A ⁣((A)1(μ))V(\mu) = A''\!\bigl((A')^{-1}(\mu)\bigr)

The variance function characterizes the mean-variance relationship and determines the behavior of the GLM. For the Normal, V(μ)=σ2V(\mu) = \sigma^2 (constant variance). For the Poisson, V(μ)=μV(\mu) = \mu (variance equals the mean). For the Bernoulli, V(μ)=μ(1μ)V(\mu) = \mu(1 - \mu) (variance is maximized at p=1/2p = 1/2). For the Gamma, V(μ)=μ2V(\mu) = \mu^2 (variance grows quadratically with the mean).

These variance functions dictate when each GLM is appropriate: Poisson regression for count data where variance scales with the mean, Gamma regression for positive continuous data where variance scales with the square of the mean, logistic regression for binary data with the bell-shaped variance curve.

Example 9 GLM Canonical Links

The canonical link function for a GLM is g(μ)=(A)1(μ)=ηg(\mu) = (A')^{-1}(\mu) = \eta, which maps the mean response μ\mu directly to the natural parameter η\eta. When we use the canonical link, the linear predictor equals the natural parameter: xTβ=η\boldsymbol{x}^T \boldsymbol{\beta} = \eta.

DistributionCanonical link g(μ)g(\mu)Link nameGLM
Normalg(μ)=μg(\mu) = \muIdentityLinear regression
Bernoullig(μ)=log ⁣μ1μg(\mu) = \log\!\frac{\mu}{1-\mu}LogitLogistic regression
Poissong(μ)=logμg(\mu) = \log \muLogPoisson regression
Gammag(μ)=1/μg(\mu) = 1/\muInverseGamma regression

Linear regression is the Normal GLM with identity link: E[Yx]=xTβE[Y \mid \boldsymbol{x}] = \boldsymbol{x}^T \boldsymbol{\beta}. The mean is the linear predictor.

Logistic regression is the Bernoulli GLM with logit link: log ⁣P(Y=1x)P(Y=0x)=xTβ\log\!\frac{P(Y=1 \mid \boldsymbol{x})}{P(Y=0 \mid \boldsymbol{x})} = \boldsymbol{x}^T \boldsymbol{\beta}. The log-odds is the linear predictor.

Poisson regression is the Poisson GLM with log link: logE[Yx]=xTβ\log E[Y \mid \boldsymbol{x}] = \boldsymbol{x}^T \boldsymbol{\beta}. The log-mean is the linear predictor.

In each case, the canonical link emerges directly from the exponential family structure — it is the function η(μ)=(A)1(μ)\eta(\mu) = (A')^{-1}(\mu) that maps from the mean parameterization to the natural parameterization. Generalized Linear Models develops the full theory, including non-canonical links, iteratively reweighted least squares, and deviance analysis.

Example 11 Variance Functions and Their Implications

The variance function V(μ)V(\mu) determines the weight structure of the GLM. In weighted least squares terms, the weight for observation ii is inversely proportional to V(μi)V(\mu_i):

DistributionV(μ)V(\mu)BehaviorImplication
Normalσ2\sigma^2 (constant)Constant varianceOLS is optimal
Poissonμ\muVariance = meanOverdispersion is common
Bernoulliμ(1μ)\mu(1-\mu)Variance peaks at μ=0.5\mu = 0.5Observations near p=0p = 0 or p=1p = 1 are more informative
Gammaμ2\mu^2Variance \propto mean2^2Coefficient of variation is constant

When the variance function does not match the data’s actual mean-variance relationship, we get misspecification. Poisson regression with V(μ)=μV(\mu) = \mu applied to overdispersed count data (where the true variance is cμc\mu for some c>1c > 1) produces correct point estimates but anticonservative standard errors. The Negative Binomial, with V(μ)=μ+μ2/rV(\mu) = \mu + \mu^2/r, is the standard remedy.

Three-panel: linear regression (Normal), logistic regression (Bernoulli), Poisson regression
Interactive: GLM Canonical Link Explorer
0123450246xy
η = β₀ + β₁x → μ = η (identity) → Y ~ N(μ, σ²)
Link: g(μ) = μ (identity)Variance: V(μ) = σ² (constant)

7.10 Connections to ML

The exponential family is not an abstract mathematical curiosity — it is the structural backbone of modern machine learning. Cross-entropy loss, Fisher information, natural gradients, and variational inference all trace back to the canonical form.

Example 10 Cross-Entropy Loss Is Bernoulli Negative Log-Likelihood

The binary cross-entropy loss used in every classification neural network is

L(y,p^)=[ylogp^+(1y)log(1p^)]\mathcal{L}(y, \hat{p}) = -\bigl[y \log \hat{p} + (1 - y) \log(1 - \hat{p})\bigr]

This is exactly the negative log-likelihood of a single Bernoulli observation yy with parameter p^\hat{p}:

logf(yp^)=[ylogp^+(1y)log(1p^)]-\log f(y \mid \hat{p}) = -\bigl[y \log \hat{p} + (1 - y) \log(1 - \hat{p})\bigr]

In natural parameterization, with η=logit(p^)\eta = \text{logit}(\hat{p}), the loss becomes ηy+log(1+eη)=A(η)ηT(y)-\eta y + \log(1 + e^\eta) = A(\eta) - \eta T(y), which is the exponential family negative log-likelihood up to a constant.

When a neural network’s final layer applies a sigmoid activation σ(xTw)=p^\sigma(\boldsymbol{x}^T\boldsymbol{w}) = \hat{p} and is trained with binary cross-entropy, it is performing maximum likelihood estimation for a Bernoulli exponential family model with natural parameter η=xTw\eta = \boldsymbol{x}^T\boldsymbol{w}. This is logistic regression. The entire deep learning classification pipeline — softmax output, cross-entropy loss, gradient descent — is exponential family MLE in disguise.

Example 12 Fisher Information and Information Geometry

The Fisher information for a one-parameter exponential family in natural parameterization is

I(η)=Eη ⁣[(ηlogf(Xη))2]=A(η)=Varη(T(X))I(\eta) = E_\eta\!\left[\left(\frac{\partial}{\partial \eta} \log f(X \mid \eta)\right)^2\right] = A''(\eta) = \text{Var}_\eta(T(X))

This is a consequence of Theorem 7.1: the Fisher information equals the second derivative of the log-partition function, which equals the variance of the sufficient statistic. For an i.i.d. sample of size nn, the Fisher information is nI(η)=nA(η)nI(\eta) = nA''(\eta).

The Fisher information defines a Riemannian metric on the parameter space. The “distance” between two parameter values η1\eta_1 and η2\eta_2 is not the Euclidean distance η1η2|\eta_1 - \eta_2| but the geodesic distance measured by I(η)I(\eta). This is the starting point of formalML: , which studies the differential geometry of statistical models.

Natural gradient descent replaces the standard gradient update ηt+1=ηtα\eta_{t+1} = \eta_t - \alpha \nabla \ell with the natural gradient ηt+1=ηtαI(ηt)1\eta_{t+1} = \eta_t - \alpha I(\eta_t)^{-1} \nabla \ell. This rescales the gradient by the inverse Fisher information, making the update invariant to reparameterization. For exponential families, the natural gradient has a clean form because I(η)=A(η)I(\eta) = A''(\eta) is readily computable. formalML: exploits this: when the variational distribution is an exponential family, the natural gradient of the ELBO has a closed-form update.

Three-panel: cross-entropy loss, variance functions, unification web with exponential family at center

Summary

The exponential family is the single most important structural concept in parametric statistics. Nine distributions, one canonical form, and four major consequences: sufficient statistics, conjugate priors, a universal MLE recipe, and the GLM framework.

Distributionh(x)h(x)η(θ)\eta(\theta)T(x)T(x)A(η)A(\eta)Dim
Bernoulli(p)(p)11log ⁣p1p\log\!\frac{p}{1-p}xxlog(1+eη)\log(1+e^\eta)1
Binomial(n,p)(n,p)(nk)\binom{n}{k}log ⁣p1p\log\!\frac{p}{1-p}kknlog(1+eη)n\log(1+e^\eta)1
Geometric(p)(p)11log(1p)\log(1-p)kkηlog(1eη)\eta - \log(1-e^\eta)1
NegBin(r,p)(r,p)(k1r1)\binom{k-1}{r-1}log(1p)\log(1-p)kkrηrlog(1eη)r\eta - r\log(1-e^\eta)1
Poisson(λ)(\lambda)1/k!1/k!logλ\log\lambdakkeηe^\eta1
Normal(μ,σ2)(\mu,\sigma^2)12π\frac{1}{\sqrt{2\pi}}(μσ2,12σ2)(\frac{\mu}{\sigma^2}, -\frac{1}{2\sigma^2})(x,x2)(x, x^2)η124η212log(2η2)-\frac{\eta_1^2}{4\eta_2} - \frac{1}{2}\log(-2\eta_2)2
Exponential(λ)(\lambda)11λ-\lambdaxxlog(η)-\log(-\eta)1
Gamma(α,β)(\alpha,\beta)11(α1,β)(\alpha-1, -\beta)(logx,x)(\log x, x)logΓ(η1+1)(η1+1)log(η2)\log\Gamma(\eta_1+1) - (\eta_1+1)\log(-\eta_2)2
Beta(α,β)(\alpha,\beta)11(α1,β1)(\alpha-1, \beta-1)(logx,log(1x))(\log x, \log(1-x))logB(η1+1,η2+1)\log B(\eta_1+1, \eta_2+1)2

The log-partition function A(η)A(\eta) is the central object. Its first derivative gives E[T(X)]E[T(X)], its second derivative gives Var(T(X))\text{Var}(T(X)) and the Fisher information I(η)I(\eta), and its convexity guarantees the concavity of the log-likelihood.

The canonical form determines the MLE. The universal recipe A(η^)=TˉA'(\hat{\eta}) = \bar{T} — match the model’s expected sufficient statistic to the sample’s observed sufficient statistic — yields every distribution’s MLE formula in one line.

The canonical form determines conjugate priors. Every exponential family has a natural conjugate prior with the update rule χ0=χ0+nTˉ\chi_0' = \chi_0 + n\bar{T}, ν0=ν0+n\nu_0' = \nu_0 + n, specializing to Beta-Bernoulli, Gamma-Poisson, and Normal-Normal.

The canonical form determines GLMs. The canonical link function (A)1(A')^{-1} connects the linear predictor to the natural parameter. Logistic regression, Poisson regression, and Gamma regression are all instances.

What comes next. This topic established the unifying framework. The downstream topics build on it:

  • Sufficient Statistics develops the full theory of data reduction, including the Rao-Blackwell theorem, completeness, Lehmann-Scheffé UMVUE, Basu’s independence theorem, and the Pitman-Koopman-Darmois converse, building directly on the sufficient statistic T(x)T(x) identified here
  • Maximum Likelihood Estimation develops the asymptotic theory of MLE — consistency, asymptotic normality, efficiency — using the exponential family as the primary example class
  • Bayesian Foundations (Topic 25) develops the full Bayesian framework. The §7.7 conjugate prior theorem becomes Topic 25’s Thm 2, lifted into the Bayesian inferential framework with credible intervals, posterior predictive, and Bernstein–von Mises asymptotics
  • Generalized Linear Models develops the complete GLM framework — estimation via IRLS, deviance, model diagnostics — requiring the exponential family response distribution established here

For the connections to machine learning at scale, see formalML: (Fisher information as Riemannian metric, natural gradient descent), formalML: (logistic, Poisson, and Gamma regression), and formalML: (exponential family variational distributions and natural gradient VI).


References

  1. Casella, G. & Berger, R. L. (2002). Statistical Inference (2nd ed.). Duxbury.
  2. Bickel, P. J. & Doksum, K. A. (2015). Mathematical Statistics: Basic Ideas and Selected Topics (2nd ed.). Chapman and Hall/CRC.
  3. McCullagh, P. & Nelder, J. A. (1989). Generalized Linear Models (2nd ed.). Chapman & Hall/CRC.
  4. Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
  5. Barndorff-Nielsen, O. E. (2014). Information and Exponential Families in Statistical Theory (2nd ed.). Wiley.
  6. Efron, B. (2022). Exponential Families in Theory and Practice. Cambridge University Press.