foundational 55 min read · April 12, 2026

Continuous Distributions

Uniform, Normal, Exponential, Gamma, Beta, Chi-squared, Student's t, and F — the named PDFs that underpin statistical inference and machine learning, each derived from a distinct probabilistic mechanism.

6.1 The Continuous Distribution Catalog

Topic 5 — Discrete Distributions cataloged seven discrete PMFs. We now turn to the continuous side: eight distributions, each arising from a distinct probabilistic mechanism, each receiving the same systematic treatment — PDF, moments, MGF, key property, ML connection.

The difference is calculus, not philosophy. Where the discrete catalog summed PMFs, this one integrates PDFs. The tools we built in Expectation, Variance & MomentsE[X]=xf(x)dxE[X] = \int x \, f(x) \, dx, Var(X)=E[X2](E[X])2\text{Var}(X) = E[X^2] - (E[X])^2, MGFs — carry over directly. What changes is the palette: these distributions live on continuous supports, their densities are smooth curves, and their moments require the integration techniques from formalCalculus: and formalCalculus: .

Remark 1 A Parallel Catalog

This topic mirrors Discrete Distributions by design. Five core distributions (Uniform, Normal, Exponential, Gamma, Beta) are independently motivated with full derivations. Three derived distributions (Chi-squared, Student’s t, F) are defined by construction from the Normal and Chi-squared. The core five receive the same template as Topic 5: definition, then E[X]E[X] and Var(X)\text{Var}(X) proofs, then MGF, then key properties, then an ML connection. The derived three share a single section — they are building blocks for hypothesis testing rather than independent modeling choices.

Gallery of all eight continuous distribution PDFs with E[X] balance-point triangles

The interactive explorer below lets you switch between all eight distributions, adjust their parameters, and see how the PDF, CDF, and moments respond.

Interactive: Continuous Distribution Catalog Explorer
PDF: f(x)
0.000.100.20.30.4-4-3-2-101234
CDF: F(x) = P(X ≤ x)
0.000.250.500.751.00-4-3-2-101234
E[X] = 0.0000Var(X) = 1.0000σ = 1.0000Exp. family? Yes

6.2 The Uniform Distribution

The Uniform distribution is the continuous version of maximum ignorance: every point in an interval is equally likely. It is the simplest continuous distribution and the starting point for simulation — we can transform Uniform samples into samples from any other distribution via the inverse CDF method.

Definition 1 Continuous Uniform Distribution

A random variable XX has the Uniform distribution on [a,b][a, b], written XUniform(a,b)X \sim \text{Uniform}(a, b), if its PDF is

f(x)={1baif axb,0otherwise.f(x) = \begin{cases} \frac{1}{b - a} & \text{if } a \le x \le b, \\ 0 & \text{otherwise.} \end{cases}

The CDF is F(x)=xabaF(x) = \frac{x - a}{b - a} for x[a,b]x \in [a, b], with F(x)=0F(x) = 0 for x<ax < a and F(x)=1F(x) = 1 for x>bx > b.

The PDF is flat — constant density over the interval, zero outside. The normalization condition ab1badx=1\int_a^b \frac{1}{b-a} \, dx = 1 is immediate.

Theorem 1 Uniform Moments and MGF

If XUniform(a,b)X \sim \text{Uniform}(a, b), then:

  1. E[X]=a+b2E[X] = \frac{a + b}{2}
  2. Var(X)=(ba)212\text{Var}(X) = \frac{(b - a)^2}{12}
  3. MX(t)=etbetat(ba)M_X(t) = \frac{e^{tb} - e^{ta}}{t(b - a)} for t0t \ne 0, and MX(0)=1M_X(0) = 1.
Proof [show]

Part 1 (Expectation). By direct integration:

E[X]=abx1badx=1bax22ab=1bab2a22=(b+a)(ba)2(ba)=a+b2E[X] = \int_a^b x \cdot \frac{1}{b - a} \, dx = \frac{1}{b - a} \cdot \frac{x^2}{2} \bigg|_a^b = \frac{1}{b - a} \cdot \frac{b^2 - a^2}{2} = \frac{(b + a)(b - a)}{2(b - a)} = \frac{a + b}{2}

Part 2 (Variance). We first compute E[X2]E[X^2]:

E[X2]=abx21badx=1bax33ab=b3a33(ba)=a2+ab+b23E[X^2] = \int_a^b x^2 \cdot \frac{1}{b - a} \, dx = \frac{1}{b - a} \cdot \frac{x^3}{3} \bigg|_a^b = \frac{b^3 - a^3}{3(b - a)} = \frac{a^2 + ab + b^2}{3}

Then Var(X)=E[X2](E[X])2\text{Var}(X) = E[X^2] - (E[X])^2:

Var(X)=a2+ab+b23(a+b)24=4(a2+ab+b2)3(a2+2ab+b2)12=a22ab+b212=(ba)212\text{Var}(X) = \frac{a^2 + ab + b^2}{3} - \frac{(a + b)^2}{4} = \frac{4(a^2 + ab + b^2) - 3(a^2 + 2ab + b^2)}{12} = \frac{a^2 - 2ab + b^2}{12} = \frac{(b - a)^2}{12}

Part 3 (MGF). For t0t \ne 0:

MX(t)=E[etX]=abetx1badx=1baetxtab=etbetat(ba)M_X(t) = E[e^{tX}] = \int_a^b e^{tx} \cdot \frac{1}{b - a} \, dx = \frac{1}{b - a} \cdot \frac{e^{tx}}{t} \bigg|_a^b = \frac{e^{tb} - e^{ta}}{t(b - a)}

At t=0t = 0, L’Hopital’s rule gives MX(0)=1M_X(0) = 1.

\square

Remark 2 The Uniform Is NOT an Exponential Family Member

Unlike the other four core distributions in this topic, the Uniform is not a member of the exponential family. The reason: its support [a,b][a, b] depends on the parameters. The exponential family requires that the support of the density be independent of the parameters — a condition the Uniform violates. Exponential Families makes this distinction precise.

Example 1 Inverse CDF Transform: Generating Exponential Samples

If UUniform(0,1)U \sim \text{Uniform}(0, 1) and we define X=F1(U)=1λln(1U)X = F^{-1}(U) = -\frac{1}{\lambda} \ln(1 - U), then XExponential(λ)X \sim \text{Exponential}(\lambda). This is the inverse CDF transform — a fundamental simulation technique.

Proof. P(Xx)=P ⁣(1λln(1U)x)=P(U1eλx)=1eλxP(X \le x) = P\!\left(-\frac{1}{\lambda} \ln(1 - U) \le x\right) = P(U \le 1 - e^{-\lambda x}) = 1 - e^{-\lambda x}, which is the Exponential CDF. Since 1U1 - U has the same distribution as UU, we can simplify to X=1λlnUX = -\frac{1}{\lambda} \ln U.

ML connection. Every random number generator starts with Uniform samples. The inverse CDF transform converts them to any target distribution — as long as F1F^{-1} can be computed. For the Normal, Φ1\Phi^{-1} has no closed form, so specialized algorithms (Box-Muller, Ziggurat) are used instead.

Uniform distribution: PDF for different intervals, CDF with quantile mark, inverse CDF transform simulation

6.3 The Normal Distribution

The Normal distribution is the most important distribution in all of statistics. Its centrality comes from the Central Limit Theorem: sums of many independent random variables converge to Normal form, regardless of the original distribution. This makes the Normal the universal approximation for aggregate effects — measurement errors, test scores, stock returns over short intervals, and noise in machine learning models.

Definition 2 Normal Distribution

A random variable XX has the Normal distribution with mean μ\mu and variance σ2\sigma^2, written XN(μ,σ2)X \sim N(\mu, \sigma^2), if its PDF is

f(x)=1σ2πexp ⁣((xμ)22σ2),xRf(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\!\left(-\frac{(x - \mu)^2}{2\sigma^2}\right), \quad x \in \mathbb{R}

The special case ZN(0,1)Z \sim N(0, 1) is the standard Normal, with PDF φ(z)=12πez2/2\varphi(z) = \frac{1}{\sqrt{2\pi}} e^{-z^2/2} and CDF Φ(z)=zφ(t)dt\Phi(z) = \int_{-\infty}^z \varphi(t) \, dt.

The PDF is the familiar bell curve, symmetric about μ\mu. That ez2/2dz=2π\int_{-\infty}^{\infty} e^{-z^2/2} \, dz = \sqrt{2\pi} — the Gaussian integral — is a classical result from formalCalculus: .

Normal distribution: 68-95-99.7 rule, effect of μ, effect of σ
Theorem 2 Normal Moments

If XN(μ,σ2)X \sim N(\mu, \sigma^2), then E[X]=μE[X] = \mu and Var(X)=σ2\text{Var}(X) = \sigma^2.

Proof [show]

Expectation. Standardize: let Z=(Xμ)/σZ = (X - \mu)/\sigma, so X=μ+σZX = \mu + \sigma Z where ZN(0,1)Z \sim N(0, 1).

E[X]=E[μ+σZ]=μ+σE[Z]E[X] = E[\mu + \sigma Z] = \mu + \sigma E[Z]

Now E[Z]=zφ(z)dz=0E[Z] = \int_{-\infty}^{\infty} z \cdot \varphi(z) \, dz = 0 by the symmetry of φ\varphi about zero (the integrand zez2/2z \, e^{-z^2/2} is an odd function). So E[X]=μE[X] = \mu.

Variance. Var(X)=σ2Var(Z)\text{Var}(X) = \sigma^2 \text{Var}(Z), so it suffices to show Var(Z)=E[Z2]=1\text{Var}(Z) = E[Z^2] = 1. We compute:

E[Z2]=z212πez2/2dzE[Z^2] = \int_{-\infty}^{\infty} z^2 \cdot \frac{1}{\sqrt{2\pi}} e^{-z^2/2} \, dz

Integrate by parts with u=zu = z and dv=zez2/2dzdv = z \, e^{-z^2/2} \, dz, giving v=ez2/2v = -e^{-z^2/2}:

E[Z2]=12π[zez2/2]+12πez2/2dzE[Z^2] = \frac{1}{\sqrt{2\pi}} \left[-z \, e^{-z^2/2}\right]_{-\infty}^{\infty} + \frac{1}{\sqrt{2\pi}} \int_{-\infty}^{\infty} e^{-z^2/2} \, dz

The boundary term vanishes (the exponential kills the polynomial). The remaining integral is 2π\sqrt{2\pi}, so E[Z2]=2π2π=1E[Z^2] = \frac{\sqrt{2\pi}}{\sqrt{2\pi}} = 1.

Therefore Var(X)=σ21=σ2\text{Var}(X) = \sigma^2 \cdot 1 = \sigma^2.

\square

Theorem 3 Normal MGF

If XN(μ,σ2)X \sim N(\mu, \sigma^2), then MX(t)=exp ⁣(μt+σ2t22)M_X(t) = \exp\!\left(\mu t + \frac{\sigma^2 t^2}{2}\right) for all tRt \in \mathbb{R}.

Proof [show]

We compute directly, using the completing-the-square technique from formalCalculus: :

MX(t)=E[etX]=etx1σ2πexp ⁣((xμ)22σ2)dxM_X(t) = E[e^{tX}] = \int_{-\infty}^{\infty} e^{tx} \cdot \frac{1}{\sigma\sqrt{2\pi}} \exp\!\left(-\frac{(x - \mu)^2}{2\sigma^2}\right) dx

Combine the exponents: tx(xμ)22σ2tx - \frac{(x-\mu)^2}{2\sigma^2}. Complete the square in xx:

tx(xμ)22σ2=12σ2[(xμ)22σ2tx]tx - \frac{(x - \mu)^2}{2\sigma^2} = -\frac{1}{2\sigma^2}\left[(x - \mu)^2 - 2\sigma^2 tx\right]=12σ2[x22(μ+σ2t)x+μ2]= -\frac{1}{2\sigma^2}\left[x^2 - 2(\mu + \sigma^2 t)x + \mu^2\right]=(x(μ+σ2t))22σ2+μt+σ2t22= -\frac{(x - (\mu + \sigma^2 t))^2}{2\sigma^2} + \mu t + \frac{\sigma^2 t^2}{2}

The term μt+σ2t2/2\mu t + \sigma^2 t^2/2 is constant in xx and factors out. The remaining integral is a Normal PDF with mean μ+σ2t\mu + \sigma^2 t and variance σ2\sigma^2, which integrates to 1:

MX(t)=exp ⁣(μt+σ2t22)1σ2πexp ⁣((x(μ+σ2t))22σ2)dx=1=exp ⁣(μt+σ2t22)M_X(t) = \exp\!\left(\mu t + \frac{\sigma^2 t^2}{2}\right) \cdot \underbrace{\int_{-\infty}^{\infty} \frac{1}{\sigma\sqrt{2\pi}} \exp\!\left(-\frac{(x - (\mu + \sigma^2 t))^2}{2\sigma^2}\right) dx}_{= \, 1} = \exp\!\left(\mu t + \frac{\sigma^2 t^2}{2}\right)

\square

Theorem 4 Normal Reproductive Property

If X1N(μ1,σ12)X_1 \sim N(\mu_1, \sigma_1^2) and X2N(μ2,σ22)X_2 \sim N(\mu_2, \sigma_2^2) are independent, then

X1+X2N(μ1+μ2,σ12+σ22)X_1 + X_2 \sim N(\mu_1 + \mu_2, \, \sigma_1^2 + \sigma_2^2)
Proof [show]

By MGF uniqueness (from Expectation, Variance & Moments):

MX1+X2(t)=MX1(t)MX2(t)=exp ⁣(μ1t+σ12t22)exp ⁣(μ2t+σ22t22)M_{X_1 + X_2}(t) = M_{X_1}(t) \cdot M_{X_2}(t) = \exp\!\left(\mu_1 t + \frac{\sigma_1^2 t^2}{2}\right) \cdot \exp\!\left(\mu_2 t + \frac{\sigma_2^2 t^2}{2}\right)=exp ⁣((μ1+μ2)t+(σ12+σ22)t22)= \exp\!\left((\mu_1 + \mu_2) t + \frac{(\sigma_1^2 + \sigma_2^2) t^2}{2}\right)

This is the MGF of N(μ1+μ2,σ12+σ22)N(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2). Since MGFs uniquely determine distributions, X1+X2N(μ1+μ2,σ12+σ22)X_1 + X_2 \sim N(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2).

\square

The Normal is closed under addition of independent variables. This is the reproductive property — and it is the algebraic reason the Normal appears everywhere. Any sum of independent Normal variables is still Normal.

Theorem 5 Normal Linear Transformation

If XN(μ,σ2)X \sim N(\mu, \sigma^2) and a,ba, b are constants with a0a \ne 0, then aX+bN(aμ+b,a2σ2)aX + b \sim N(a\mu + b, \, a^2\sigma^2).

Proof [show]

By the MGF of Y=aX+bY = aX + b:

MY(t)=E[et(aX+b)]=etbE[e(ta)X]=etbMX(at)M_Y(t) = E[e^{t(aX+b)}] = e^{tb} \cdot E[e^{(ta)X}] = e^{tb} \cdot M_X(at)=etbexp ⁣(μ(at)+σ2(at)22)=exp ⁣((aμ+b)t+a2σ2t22)= e^{tb} \cdot \exp\!\left(\mu(at) + \frac{\sigma^2(at)^2}{2}\right) = \exp\!\left((a\mu + b)t + \frac{a^2\sigma^2 t^2}{2}\right)

This is the MGF of N(aμ+b,a2σ2)N(a\mu + b, a^2\sigma^2).

\square

In particular, the standardization Z=(Xμ)/σN(0,1)Z = (X - \mu)/\sigma \sim N(0, 1) is a special case with a=1/σa = 1/\sigma and b=μ/σb = -\mu/\sigma.

Remark 3 Normal Exponential Family Form

The Normal belongs to the exponential family with natural parameters η1=μ/σ2\eta_1 = \mu/\sigma^2 and η2=1/(2σ2)\eta_2 = -1/(2\sigma^2). The sufficient statistics are (X,X2)(X, X^2). When σ2\sigma^2 is known, the single natural parameter η=μ/σ2\eta = \mu/\sigma^2 makes MLE particularly clean: the sufficient statistic is Xˉ\bar{X}, and the MLE is μ^=Xˉ\hat{\mu} = \bar{X}. Exponential Families develops this systematically.

Standard Normal CDF with key quantiles, area-as-probability shading, CDF steepness versus σ
Interactive: Normal Properties Explorer
μ ± 1σ: 68.27%μ ± 2σ: 95.45%μ ± 3σ: 99.73%
Example 2 Normal MLE: Why Minimizing Squared Error Is MLE Under Gaussian Noise

Suppose we observe data y1,,yny_1, \ldots, y_n and model yi=f(xi)+εiy_i = f(x_i) + \varepsilon_i where εiN(0,σ2)\varepsilon_i \sim N(0, \sigma^2) independently. The log-likelihood is:

(θ)=n2ln(2πσ2)12σ2i=1n(yif(xi))2\ell(\theta) = -\frac{n}{2}\ln(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - f(x_i))^2

Maximizing (θ)\ell(\theta) with respect to θ\theta (the parameters of ff) is equivalent to minimizing i=1n(yif(xi))2\sum_{i=1}^n (y_i - f(x_i))^2. This is why least squares = MLE under Gaussian noise. The assumption of Normal errors is baked into every ordinary least squares regression — and when that assumption fails, the MLE changes (e.g., to Laplace errors for 1\ell_1 loss). See Topic 22 (Generalized Linear Models) for the general framework.


6.4 The Exponential Distribution

The Exponential distribution models waiting times in a Poisson process: if events arrive at a constant rate λ\lambda, the time between consecutive events is Exponential(λ\lambda). Its defining property is memorylessness — the time already waited provides no information about the remaining wait.

Definition 3 Exponential Distribution

A random variable XX has the Exponential distribution with rate λ>0\lambda > 0, written XExponential(λ)X \sim \text{Exponential}(\lambda), if its PDF is

f(x)=λeλx,x0f(x) = \lambda e^{-\lambda x}, \quad x \ge 0

The CDF is F(x)=1eλxF(x) = 1 - e^{-\lambda x} for x0x \ge 0. The mean is 1/λ1/\lambda and the median is ln2/λ\ln 2 / \lambda.

Exponential distribution: PDF family varying λ, CDF family, memoryless property visualization
Theorem 6 Exponential Moments

If XExponential(λ)X \sim \text{Exponential}(\lambda), then E[X]=1λE[X] = \frac{1}{\lambda} and Var(X)=1λ2\text{Var}(X) = \frac{1}{\lambda^2}.

Proof [show]

Expectation. Using formalCalculus: with u=xu = x and dv=λeλxdxdv = \lambda e^{-\lambda x} dx:

E[X]=0xλeλxdx=[xeλx]0+0eλxdxE[X] = \int_0^{\infty} x \lambda e^{-\lambda x} \, dx = \left[-x e^{-\lambda x}\right]_0^{\infty} + \int_0^{\infty} e^{-\lambda x} \, dx

The boundary term vanishes. The remaining integral is 1/λ1/\lambda:

E[X]=0+1λ=1λE[X] = 0 + \frac{1}{\lambda} = \frac{1}{\lambda}

Variance. We need E[X2]E[X^2]. Integrate by parts twice with u=x2u = x^2:

E[X2]=0x2λeλxdx=[x2eλx]0+02xeλxdx=0+2λE[X]=2λ2E[X^2] = \int_0^{\infty} x^2 \lambda e^{-\lambda x} \, dx = \left[-x^2 e^{-\lambda x}\right]_0^{\infty} + \int_0^{\infty} 2x \, e^{-\lambda x} \, dx = 0 + \frac{2}{\lambda} E[X] = \frac{2}{\lambda^2}

Therefore Var(X)=E[X2](E[X])2=2λ21λ2=1λ2\text{Var}(X) = E[X^2] - (E[X])^2 = \frac{2}{\lambda^2} - \frac{1}{\lambda^2} = \frac{1}{\lambda^2}.

\square

Theorem 7 Exponential MGF

If XExponential(λ)X \sim \text{Exponential}(\lambda), then MX(t)=λλtM_X(t) = \frac{\lambda}{\lambda - t} for t<λt < \lambda.

Proof [show]

Direct computation:

MX(t)=0etxλeλxdx=λ0e(λt)xdxM_X(t) = \int_0^{\infty} e^{tx} \lambda e^{-\lambda x} \, dx = \lambda \int_0^{\infty} e^{-(\lambda - t)x} \, dx

For t<λt < \lambda, the integral converges:

MX(t)=λ1λt=λλtM_X(t) = \lambda \cdot \frac{1}{\lambda - t} = \frac{\lambda}{\lambda - t}

For tλt \ge \lambda, the integral diverges, so the MGF is defined only for t<λt < \lambda.

\square

Theorem 8 Exponential Memoryless Property

A continuous random variable XX with support [0,)[0, \infty) satisfies the memoryless property P(X>s+tX>s)=P(X>t)P(X > s + t \mid X > s) = P(X > t) for all s,t0s, t \ge 0 if and only if XExponential(λ)X \sim \text{Exponential}(\lambda) for some λ>0\lambda > 0.

Proof [show]

Forward direction. Suppose XExponential(λ)X \sim \text{Exponential}(\lambda). Then P(X>x)=eλxP(X > x) = e^{-\lambda x} for x0x \ge 0. By the definition of conditional probability:

P(X>s+tX>s)=P(X>s+t)P(X>s)=eλ(s+t)eλs=eλt=P(X>t)P(X > s + t \mid X > s) = \frac{P(X > s + t)}{P(X > s)} = \frac{e^{-\lambda(s+t)}}{e^{-\lambda s}} = e^{-\lambda t} = P(X > t)

Reverse direction. Suppose P(X>s+tX>s)=P(X>t)P(X > s + t \mid X > s) = P(X > t) for all s,t0s, t \ge 0. Let g(x)=P(X>x)g(x) = P(X > x). Then g(s+t)=g(s)g(t)g(s + t) = g(s) \cdot g(t) for all s,t0s, t \ge 0. This is Cauchy’s functional equation on [0,)[0, \infty). Under the mild regularity condition that gg is monotone (which follows from gg being a survival function), the only solution is g(x)=eλxg(x) = e^{-\lambda x} for some λ>0\lambda > 0. This gives F(x)=1eλxF(x) = 1 - e^{-\lambda x}, which is the Exponential CDF.

\square

The memoryless property has a vivid interpretation: if you’ve been waiting 10 minutes for a bus with Exponential interarrival times, the conditional distribution of additional wait time is the same as if you’d just arrived. The past provides no information about the future. This is the continuous analog of the Geometric memoryless property from Discrete Distributions.

Remark 4 Exponential Exponential Family Form

The Exponential belongs to the exponential family with natural parameter η=λ\eta = -\lambda and sufficient statistic T(x)=xT(x) = x. The log-partition function is A(η)=ln(η)A(\eta) = -\ln(-\eta), and E[X]=A(η)=1/λE[X] = A'(\eta) = 1/\lambda. Exponential Families shows how this structure enables conjugate Bayesian inference: the Gamma distribution is the conjugate prior for the Exponential rate parameter.

Interactive: Memoryless Property Explorer
Exp(λ) PDF with conditional at X > 2.0
s = 2.0
Re-indexed: identical to original
Memoryless verification: P(X > s+t | X > s) = P(X > t)
t=1: 0.3679 = 0.3679t=2: 0.1353 = 0.1353t=3: 0.0498 = 0.0498
The waiting time "resets" — the past provides no information about the remaining time.
Example 3 Exponential Survival Analysis: Constant Hazard

In survival analysis, the hazard function h(t)=f(t)/(1F(t))h(t) = f(t) / (1 - F(t)) measures the instantaneous failure rate at time tt, given survival to tt. For the Exponential:

h(t)=λeλteλt=λh(t) = \frac{\lambda e^{-\lambda t}}{e^{-\lambda t}} = \lambda

The hazard is constant — the system does not age. This makes the Exponential a baseline model: real systems wear out (h(t)h(t) increasing, Weibull distribution) or burn in (h(t)h(t) decreasing). Departures from constant hazard motivate the Gamma and Weibull alternatives. In ML, Exponential survival models appear in customer churn prediction, equipment failure forecasting, and time-to-event modeling.


6.5 The Gamma Distribution

The Gamma distribution generalizes the Exponential in a natural way: if the Exponential models the time until the first event in a Poisson process, the Gamma models the time until the α\alpha-th event. It subsumes the Exponential (α=1\alpha = 1) and the Chi-squared (α=k/2\alpha = k/2, β=1/2\beta = 1/2) as special cases.

Before defining the Gamma distribution, we need the Gamma function — the normalization constant that makes the PDF integrate to 1.

Remark 5 The Gamma Function

The Gamma function is defined for α>0\alpha > 0 as

Γ(α)=0tα1etdt\Gamma(\alpha) = \int_0^{\infty} t^{\alpha - 1} e^{-t} \, dt

This is a convergent formalCalculus: for all α>0\alpha > 0.

Key properties:

  • Recursion: Γ(α+1)=αΓ(α)\Gamma(\alpha + 1) = \alpha \, \Gamma(\alpha). This follows from formalCalculus: : with u=tαu = t^{\alpha} and dv=etdtdv = e^{-t} dt, the boundary terms vanish and we get Γ(α+1)=α0tα1etdt=αΓ(α)\Gamma(\alpha + 1) = \alpha \int_0^{\infty} t^{\alpha-1} e^{-t} \, dt = \alpha \, \Gamma(\alpha).

  • Factorial connection: Since Γ(1)=0etdt=1\Gamma(1) = \int_0^{\infty} e^{-t} \, dt = 1, the recursion gives Γ(n)=(n1)!\Gamma(n) = (n-1)! for every positive integer nn. The Gamma function extends the factorial to non-integer arguments.

  • Half-integer value: Γ(1/2)=π\Gamma(1/2) = \sqrt{\pi}. This follows from the substitution t=u2/2t = u^2/2, which converts Γ(1/2)\Gamma(1/2) into the Gaussian integral eu2/2du=2π\int_{-\infty}^{\infty} e^{-u^2/2} \, du = \sqrt{2\pi}.

Definition 4 Gamma Distribution

A random variable XX has the Gamma distribution with shape α>0\alpha > 0 and rate β>0\beta > 0, written XGamma(α,β)X \sim \text{Gamma}(\alpha, \beta), if its PDF is

f(x)=βαΓ(α)xα1eβx,x>0f(x) = \frac{\beta^{\alpha}}{\Gamma(\alpha)} x^{\alpha - 1} e^{-\beta x}, \quad x > 0

The normalization follows from the Gamma function: the substitution u=βxu = \beta x transforms 0βαΓ(α)xα1eβxdx\int_0^{\infty} \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1} e^{-\beta x} dx into 1Γ(α)0uα1eudu=1\frac{1}{\Gamma(\alpha)} \int_0^{\infty} u^{\alpha-1} e^{-u} du = 1.

Gamma family: effect of shape α, special cases (Exponential, Chi-squared), sum of Exponentials → Gamma
Theorem 9 Gamma Moments

If XGamma(α,β)X \sim \text{Gamma}(\alpha, \beta), then E[X]=αβE[X] = \frac{\alpha}{\beta} and Var(X)=αβ2\text{Var}(X) = \frac{\alpha}{\beta^2}.

Proof [show]

Expectation. We compute directly, using the Gamma function recursion:

E[X]=0xβαΓ(α)xα1eβxdx=βαΓ(α)0xαeβxdxE[X] = \int_0^{\infty} x \cdot \frac{\beta^{\alpha}}{\Gamma(\alpha)} x^{\alpha - 1} e^{-\beta x} \, dx = \frac{\beta^{\alpha}}{\Gamma(\alpha)} \int_0^{\infty} x^{\alpha} e^{-\beta x} \, dx

Substituting u=βxu = \beta x:

E[X]=βαΓ(α)1βα+10uαeudu=1βΓ(α)Γ(α+1)=αΓ(α)βΓ(α)=αβE[X] = \frac{\beta^{\alpha}}{\Gamma(\alpha)} \cdot \frac{1}{\beta^{\alpha + 1}} \int_0^{\infty} u^{\alpha} e^{-u} \, du = \frac{1}{\beta \, \Gamma(\alpha)} \cdot \Gamma(\alpha + 1) = \frac{\alpha \, \Gamma(\alpha)}{\beta \, \Gamma(\alpha)} = \frac{\alpha}{\beta}

Variance. Similarly, E[X2]=βαΓ(α)Γ(α+2)βα+2=α(α+1)β2E[X^2] = \frac{\beta^{\alpha}}{\Gamma(\alpha)} \cdot \frac{\Gamma(\alpha + 2)}{\beta^{\alpha+2}} = \frac{\alpha(\alpha+1)}{\beta^2}.

Var(X)=E[X2](E[X])2=α(α+1)β2α2β2=αβ2\text{Var}(X) = E[X^2] - (E[X])^2 = \frac{\alpha(\alpha + 1)}{\beta^2} - \frac{\alpha^2}{\beta^2} = \frac{\alpha}{\beta^2}

\square

Theorem 10 Gamma MGF

If XGamma(α,β)X \sim \text{Gamma}(\alpha, \beta), then MX(t)=(ββt)αM_X(t) = \left(\frac{\beta}{\beta - t}\right)^{\alpha} for t<βt < \beta.

Proof [show]
MX(t)=0etxβαΓ(α)xα1eβxdx=βαΓ(α)0xα1e(βt)xdxM_X(t) = \int_0^{\infty} e^{tx} \cdot \frac{\beta^{\alpha}}{\Gamma(\alpha)} x^{\alpha - 1} e^{-\beta x} \, dx = \frac{\beta^{\alpha}}{\Gamma(\alpha)} \int_0^{\infty} x^{\alpha - 1} e^{-(\beta - t)x} \, dx

For t<βt < \beta, substitute u=(βt)xu = (\beta - t)x:

MX(t)=βαΓ(α)Γ(α)(βt)α=(ββt)αM_X(t) = \frac{\beta^{\alpha}}{\Gamma(\alpha)} \cdot \frac{\Gamma(\alpha)}{(\beta - t)^{\alpha}} = \left(\frac{\beta}{\beta - t}\right)^{\alpha}

\square

Theorem 11 Gamma Reproductive Property

If X1Gamma(α1,β)X_1 \sim \text{Gamma}(\alpha_1, \beta) and X2Gamma(α2,β)X_2 \sim \text{Gamma}(\alpha_2, \beta) are independent (with the same rate β\beta), then

X1+X2Gamma(α1+α2,β)X_1 + X_2 \sim \text{Gamma}(\alpha_1 + \alpha_2, \, \beta)
Proof [show]

By MGFs:

MX1+X2(t)=(ββt)α1(ββt)α2=(ββt)α1+α2M_{X_1 + X_2}(t) = \left(\frac{\beta}{\beta - t}\right)^{\alpha_1} \cdot \left(\frac{\beta}{\beta - t}\right)^{\alpha_2} = \left(\frac{\beta}{\beta - t}\right)^{\alpha_1 + \alpha_2}

This is the MGF of Gamma(α1+α2,β)\text{Gamma}(\alpha_1 + \alpha_2, \beta).

\square

In particular, if X1,,XnExponential(β)X_1, \ldots, X_n \sim \text{Exponential}(\beta) are independent, then X1++XnGamma(n,β)X_1 + \cdots + X_n \sim \text{Gamma}(n, \beta). This confirms the Poisson process interpretation: the sum of nn independent Exponential waiting times has a Gamma distribution.

Remark 6 Gamma Exponential Family Form

The Gamma belongs to the exponential family with natural parameters η1=α1\eta_1 = \alpha - 1 and η2=β\eta_2 = -\beta, and sufficient statistics (X,lnX)(X, \ln X). When α\alpha is known, the single natural parameter is η=β\eta = -\beta with sufficient statistic T(x)=xT(x) = x, and the conjugate prior for β\beta is itself a Gamma. Exponential Families unifies this with the Exponential’s exponential family form.

Interactive: Gamma Family Explorer
E[X] = α/β = 3.000Var(X) = α/β² = 3.000Mode = (α−1)/β = 2.000
Example 4 Gamma GLM: Insurance Claim Amounts

Insurance claim amounts are positive and right-skewed — large claims are rare but impactful. The Gamma distribution is a natural model: the shape parameter α\alpha controls the skewness, and the rate β\beta controls the scale. In a Gamma GLM, we model claim amounts YiGamma(α,βi)Y_i \sim \text{Gamma}(\alpha, \beta_i) with lnE[Yi]=β0+β1xi1+\ln E[Y_i] = \beta_0 + \beta_1 x_{i1} + \cdots (log link), allowing covariates (driver age, vehicle type, region) to affect the expected claim size while maintaining the Gamma’s positive support and right skew. See Topic 22 §22.6 (Gamma regression) for the full GLM framework and the worked insurance-amounts example.


6.6 The Beta Distribution

The Beta distribution lives on [0,1][0, 1] — precisely the range of a probability parameter. This makes it the natural distribution for modeling uncertainty about unknown probabilities, success rates, and proportions. Its two shape parameters give it remarkable flexibility: it can be uniform, bell-shaped, U-shaped, J-shaped, or heavily skewed.

Definition 5 Beta Distribution

A random variable XX has the Beta distribution with parameters α>0\alpha > 0 and β>0\beta > 0, written XBeta(α,β)X \sim \text{Beta}(\alpha, \beta), if its PDF is

f(x)=Γ(α+β)Γ(α)Γ(β)xα1(1x)β1,0<x<1f(x) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\,\Gamma(\beta)} \, x^{\alpha - 1}(1 - x)^{\beta - 1}, \quad 0 < x < 1

The normalizing constant B(α,β)=Γ(α)Γ(β)Γ(α+β)B(\alpha, \beta) = \frac{\Gamma(\alpha)\,\Gamma(\beta)}{\Gamma(\alpha + \beta)} is the Beta function.

Beta distribution: shape regimes (U, J, bell, uniform), mean as α/(α+β), conjugate prior updating
Theorem 12 Beta Moments

If XBeta(α,β)X \sim \text{Beta}(\alpha, \beta), then:

  1. E[X]=αα+βE[X] = \frac{\alpha}{\alpha + \beta}
  2. Var(X)=αβ(α+β)2(α+β+1)\text{Var}(X) = \frac{\alpha\beta}{(\alpha + \beta)^2(\alpha + \beta + 1)}
Proof [show]

Expectation. Let s=α+βs = \alpha + \beta.

E[X]=1B(α,β)01xxα1(1x)β1dx=1B(α,β)01xα(1x)β1dxE[X] = \frac{1}{B(\alpha, \beta)} \int_0^1 x \cdot x^{\alpha-1}(1-x)^{\beta-1} \, dx = \frac{1}{B(\alpha, \beta)} \int_0^1 x^{\alpha}(1-x)^{\beta-1} \, dx

The integral is B(α+1,β)=Γ(α+1)Γ(β)Γ(α+β+1)B(\alpha + 1, \beta) = \frac{\Gamma(\alpha+1)\,\Gamma(\beta)}{\Gamma(\alpha+\beta+1)}. Using Γ(α+1)=αΓ(α)\Gamma(\alpha + 1) = \alpha \, \Gamma(\alpha) and Γ(α+β+1)=(α+β)Γ(α+β)\Gamma(\alpha + \beta + 1) = (\alpha + \beta) \, \Gamma(\alpha + \beta):

E[X]=Γ(s)Γ(α)Γ(β)αΓ(α)Γ(β)sΓ(s)=αs=αα+βE[X] = \frac{\Gamma(s)}{\Gamma(\alpha)\,\Gamma(\beta)} \cdot \frac{\alpha \, \Gamma(\alpha) \, \Gamma(\beta)}{s \, \Gamma(s)} = \frac{\alpha}{s} = \frac{\alpha}{\alpha + \beta}

Variance. Similarly, E[X2]=B(α+2,β)B(α,β)=α(α+1)s(s+1)E[X^2] = \frac{B(\alpha+2, \beta)}{B(\alpha, \beta)} = \frac{\alpha(\alpha+1)}{s(s+1)}.

Var(X)=α(α+1)s(s+1)α2s2=αs(α+1s+1αs)=αβs2(s+1)\text{Var}(X) = \frac{\alpha(\alpha+1)}{s(s+1)} - \frac{\alpha^2}{s^2} = \frac{\alpha}{s} \cdot \left(\frac{\alpha + 1}{s + 1} - \frac{\alpha}{s}\right) = \frac{\alpha\beta}{s^2(s + 1)}

\square

The mean E[X]=α/(α+β)E[X] = \alpha/(\alpha + \beta) has a beautiful interpretation: α\alpha is the “number of successes” and β\beta is the “number of failures” in the prior’s pseudo-data. As we collect real data, α\alpha and β\beta grow, and the distribution concentrates around the true probability.

Theorem 13 Beta-Bernoulli Conjugacy

If the prior on θ\theta is Beta(α,β)\text{Beta}(\alpha, \beta) and we observe kk successes in nn independent Bernoulli(θ\theta) trials, then the posterior is

θdataBeta(α+k,β+nk)\theta \mid \text{data} \sim \text{Beta}(\alpha + k, \, \beta + n - k)
Proof [show]

By Bayes’ theorem, the posterior is proportional to the prior times the likelihood:

p(θk)p(kθ)p(θ)p(\theta \mid k) \propto p(k \mid \theta) \cdot p(\theta)

The Binomial likelihood is p(kθ)θk(1θ)nkp(k \mid \theta) \propto \theta^k (1-\theta)^{n-k}. The Beta prior is p(θ)θα1(1θ)β1p(\theta) \propto \theta^{\alpha-1}(1-\theta)^{\beta-1}. Multiplying:

p(θk)θk(1θ)nkθα1(1θ)β1=θ(α+k)1(1θ)(β+nk)1p(\theta \mid k) \propto \theta^k (1-\theta)^{n-k} \cdot \theta^{\alpha-1}(1-\theta)^{\beta-1} = \theta^{(\alpha+k)-1}(1-\theta)^{(\beta+n-k)-1}

This is the kernel of a Beta(α+k,β+nk)\text{Beta}(\alpha + k, \beta + n - k) density. Since the posterior must integrate to 1, the normalizing constant is B(α+k,β+nk)B(\alpha + k, \beta + n - k).

\square

This is conjugacy: the prior and posterior belong to the same family. The update rule is additive: add kk to α\alpha (successes) and nkn - k to β\beta (failures). The posterior mean is:

E[θk]=α+kα+β+nE[\theta \mid k] = \frac{\alpha + k}{\alpha + \beta + n}

This is a weighted average of the prior mean α/(α+β)\alpha/(\alpha+\beta) and the sample proportion k/nk/n, with weights proportional to the “sample sizes” α+β\alpha + \beta (prior) and nn (data). As nn \to \infty, the posterior concentrates around the true θ\theta regardless of the prior — the data overwhelms prior beliefs.

Remark 7 Beta Exponential Family Form

The Beta belongs to the exponential family with natural parameters η1=α1\eta_1 = \alpha - 1 and η2=β1\eta_2 = \beta - 1, and sufficient statistics (lnX,ln(1X))(\ln X, \ln(1-X)). The log-partition function is A(η1,η2)=lnΓ(η1+1)+lnΓ(η2+1)lnΓ(η1+η2+2)A(\eta_1, \eta_2) = \ln \Gamma(\eta_1 + 1) + \ln \Gamma(\eta_2 + 1) - \ln \Gamma(\eta_1 + \eta_2 + 2). Exponential Families connects this to the Beta-Bernoulli conjugacy via the general theory of conjugate priors.

Interactive: Beta-Bernoulli Conjugate Prior
0.000.250.500.751.00
— Prior Beta(1.0, 1.0)— Posterior Beta(8.0, 14.0)
Prior
E[θ] = 0.5000
95% CI: [0.025, 0.975]
Eff. sample size: 2.0
Posterior
E[θ|data] = 0.3636
95% CI: [0.181, 0.570]
Eff. sample size: 22.0
As n → ∞, the posterior concentrates around the true θ regardless of the prior.
Example 5 Beta-Bernoulli A/B Testing

A company runs an A/B test comparing two button designs. Design A has a Beta(1,1)\text{Beta}(1, 1) prior (uniform — no prior information about the click rate). After 200 users see Design A, 34 click. The posterior is Beta(1+34,1+166)=Beta(35,167)\text{Beta}(1 + 34, 1 + 166) = \text{Beta}(35, 167).

The posterior mean click rate is 35/2020.17335/202 \approx 0.173, with a 95% credible interval of approximately [0.124,0.231][0.124, 0.231].

Design B, with prior Beta(1,1)\text{Beta}(1, 1) and 42 clicks from 200 users, has posterior Beta(43,159)\text{Beta}(43, 159), mean 43/2020.21343/202 \approx 0.213.

The probability that Design B is better, P(θB>θAdata)P(\theta_B > \theta_A \mid \text{data}), can be computed by Monte Carlo: draw from each posterior, count how often θB>θA\theta_B > \theta_A. This is Bayesian A/B testing — and it starts with the Beta-Bernoulli conjugate pair. Bayesian Foundations (Topic 25) develops the general framework, including the posterior predictive for the Beta-Binomial compound.


6.7 Derived Distributions: Chi-squared, Student’s t, and F

The next three distributions are not independently motivated by a random mechanism. Instead, they are derived from the Normal distribution through specific constructions. They form the test statistic distributions for classical hypothesis testing.

Definition 6 Chi-squared Distribution

If Z1,,ZkZ_1, \ldots, Z_k are independent N(0,1)N(0, 1) random variables, then

X=Z12+Z22++Zk2χ2(k)X = Z_1^2 + Z_2^2 + \cdots + Z_k^2 \sim \chi^2(k)

has the Chi-squared distribution with kk degrees of freedom. Equivalently, χ2(k)=Gamma(k/2,1/2)\chi^2(k) = \text{Gamma}(k/2, 1/2).

The equivalence with Gamma(k/2,1/2)\text{Gamma}(k/2, 1/2) follows because Z2Gamma(1/2,1/2)Z^2 \sim \text{Gamma}(1/2, 1/2) (proved via the formalCalculus: formula), and the Gamma reproductive property gives Z12++Zk2Gamma(k/2,1/2)Z_1^2 + \cdots + Z_k^2 \sim \text{Gamma}(k/2, 1/2).

Chi-squared family, Student's t convergence to Normal, F distribution family
Theorem 14 Chi-squared Moments

If Xχ2(k)X \sim \chi^2(k), then E[X]=kE[X] = k and Var(X)=2k\text{Var}(X) = 2k.

Proof [show]

Since χ2(k)=Gamma(k/2,1/2)\chi^2(k) = \text{Gamma}(k/2, 1/2), we apply the Gamma moment formulas:

E[X]=αβ=k/21/2=kE[X] = \frac{\alpha}{\beta} = \frac{k/2}{1/2} = kVar(X)=αβ2=k/21/4=2k\text{Var}(X) = \frac{\alpha}{\beta^2} = \frac{k/2}{1/4} = 2k

\square

Theorem 15 Chi-squared MGF

If Xχ2(k)X \sim \chi^2(k), then MX(t)=(12t)k/2M_X(t) = (1 - 2t)^{-k/2} for t<1/2t < 1/2.

Proof [show]

From the Gamma MGF with α=k/2\alpha = k/2 and β=1/2\beta = 1/2:

MX(t)=(1/21/2t)k/2=(112t)k/2=(12t)k/2M_X(t) = \left(\frac{1/2}{1/2 - t}\right)^{k/2} = \left(\frac{1}{1 - 2t}\right)^{k/2} = (1 - 2t)^{-k/2}

\square

Theorem 16 Chi-squared Reproductive Property

If X1χ2(k1)X_1 \sim \chi^2(k_1) and X2χ2(k2)X_2 \sim \chi^2(k_2) are independent, then X1+X2χ2(k1+k2)X_1 + X_2 \sim \chi^2(k_1 + k_2).

Proof [show]

This is the Gamma reproductive property with β=1/2\beta = 1/2:

Gamma(k1/2,1/2)+Gamma(k2/2,1/2)=Gamma((k1+k2)/2,1/2)=χ2(k1+k2)\text{Gamma}(k_1/2, 1/2) + \text{Gamma}(k_2/2, 1/2) = \text{Gamma}((k_1 + k_2)/2, 1/2) = \chi^2(k_1 + k_2)

Alternatively, by MGFs: (12t)k1/2(12t)k2/2=(12t)(k1+k2)/2(1-2t)^{-k_1/2} \cdot (1-2t)^{-k_2/2} = (1-2t)^{-(k_1+k_2)/2}.

\square

Definition 7 Student's t Distribution

If ZN(0,1)Z \sim N(0,1) and Vχ2(ν)V \sim \chi^2(\nu) are independent, then

T=ZV/νt(ν)T = \frac{Z}{\sqrt{V/\nu}} \sim t(\nu)

has Student’s t distribution with ν\nu degrees of freedom. Its PDF is:

f(t)=Γ ⁣(ν+12)νπ  Γ ⁣(ν2)(1+t2ν)(ν+1)/2,tRf(t) = \frac{\Gamma\!\left(\frac{\nu+1}{2}\right)}{\sqrt{\nu\pi}\;\Gamma\!\left(\frac{\nu}{2}\right)} \left(1 + \frac{t^2}{\nu}\right)^{-(\nu+1)/2}, \quad t \in \mathbb{R}

The t distribution looks like a Normal but has heavier tails — extreme values are more likely. As ν\nu increases, the tails thin and the t converges to the standard Normal.

Theorem 17 Student's t Moments

If Tt(ν)T \sim t(\nu), then:

  1. E[T]=0E[T] = 0 for ν>1\nu > 1 (undefined for ν1\nu \le 1)
  2. Var(T)=νν2\text{Var}(T) = \frac{\nu}{\nu - 2} for ν>2\nu > 2 (infinite for 1<ν21 < \nu \le 2, undefined for ν1\nu \le 1)
Proof [show]

Part 1. The PDF is symmetric about 0: f(t)=f(t)f(-t) = f(t). For ν>1\nu > 1, E[T]<E[|T|] < \infty (the tails decay as t(ν+1)|t|^{-(\nu+1)}, which is integrable when ν+1>2\nu + 1 > 2), so E[T]=0E[T] = 0 by symmetry.

For ν=1\nu = 1, TT has the Cauchy distribution and E[T]=E[|T|] = \infty, so the mean is undefined.

Part 2. For the variance, we use the construction T=Z/V/νT = Z/\sqrt{V/\nu}:

E[T2]=E ⁣[Z2V/ν]=νE[Z2]E ⁣[1V]E[T^2] = E\!\left[\frac{Z^2}{V/\nu}\right] = \nu \cdot E[Z^2] \cdot E\!\left[\frac{1}{V}\right]

since ZZ and VV are independent, and E[Z2]=1E[Z^2] = 1. For Vχ2(ν)=Gamma(ν/2,1/2)V \sim \chi^2(\nu) = \text{Gamma}(\nu/2, 1/2):

E[1/V]=1ν2for ν>2E[1/V] = \frac{1}{\nu - 2} \quad \text{for } \nu > 2

(This can be verified by direct integration or by using the inverse moments of the Gamma distribution.) Therefore:

E[T2]=ν11ν2=νν2E[T^2] = \nu \cdot 1 \cdot \frac{1}{\nu - 2} = \frac{\nu}{\nu - 2}

Since E[T]=0E[T] = 0, Var(T)=E[T2]=ν/(ν2)\text{Var}(T) = E[T^2] = \nu/(\nu - 2).

\square

Theorem 18 Student's t Converges to Standard Normal

As ν\nu \to \infty, the t(ν)t(\nu) distribution converges to N(0,1)N(0, 1). Specifically, Var(T)=ν/(ν2)1\text{Var}(T) = \nu/(\nu - 2) \to 1, and the PDF converges pointwise to φ(t)\varphi(t).

Proof [show]

From the construction T=Z/V/νT = Z/\sqrt{V/\nu}: by the law of large numbers, V/ν1V/\nu \to 1 in probability as ν\nu \to \infty (since E[V/ν]=1E[V/\nu] = 1 and Var(V/ν)=2/ν0\text{Var}(V/\nu) = 2/\nu \to 0). By Slutsky’s theorem, T=Z/V/νZ/1=ZN(0,1)T = Z/\sqrt{V/\nu} \to Z/\sqrt{1} = Z \sim N(0,1) in distribution.

\square

This convergence justifies using the Normal instead of the t when the sample size is large — the t correction matters primarily when ν\nu is small (say, ν<30\nu < 30).

Definition 8 F Distribution

If Uχ2(d1)U \sim \chi^2(d_1) and Vχ2(d2)V \sim \chi^2(d_2) are independent, then

F=U/d1V/d2F(d1,d2)F = \frac{U/d_1}{V/d_2} \sim F(d_1, d_2)

has the F distribution with d1d_1 and d2d_2 degrees of freedom. It takes values on (0,)(0, \infty).

Theorem 19 F Moments and t-F Connection

If FF(d1,d2)F \sim F(d_1, d_2), then:

  1. E[F]=d2d22E[F] = \frac{d_2}{d_2 - 2} for d2>2d_2 > 2
  2. If Tt(ν)T \sim t(\nu), then T2F(1,ν)T^2 \sim F(1, \nu).
Proof [show]

Part 1. Using the construction F=(U/d1)/(V/d2)F = (U/d_1)/(V/d_2):

E[F]=d2d1E[U]E[1/V]=d2d1d11d22=d2d22E[F] = \frac{d_2}{d_1} \cdot E[U] \cdot E[1/V] = \frac{d_2}{d_1} \cdot d_1 \cdot \frac{1}{d_2 - 2} = \frac{d_2}{d_2 - 2}

where E[U]=d1E[U] = d_1 and E[1/V]=1/(d22)E[1/V] = 1/(d_2 - 2) for d2>2d_2 > 2 (as computed in Theorem 17).

Part 2. If T=Z/V/νT = Z/\sqrt{V/\nu} with ZN(0,1)Z \sim N(0,1), Vχ2(ν)V \sim \chi^2(\nu), then:

T2=Z2V/ν=Z2/1V/νT^2 = \frac{Z^2}{V/\nu} = \frac{Z^2/1}{V/\nu}

Since Z2χ2(1)Z^2 \sim \chi^2(1) and Vχ2(ν)V \sim \chi^2(\nu) are independent, this is (U/1)/(V/ν)(U/1)/(V/\nu) with Uχ2(1)U \sim \chi^2(1), which is F(1,ν)F(1, \nu) by definition.

\square

The tt-FF connection means that a two-sided tt-test (rejecting when T>c|T| > c) is equivalent to an FF-test (rejecting when T2>c2T^2 > c^2). Hypothesis Testing builds extensively on all three derived distributions — the z-test on the Normal, the t-test on Student’s tn1t_{n-1} (with null distribution proved via Basu’s theorem), and the variance test on χn12\chi^2_{n-1}.


6.8 Relationships Between Distributions

The eight distributions form a rich web of connections. Rather than a single tangled graph, the relationships organize around two hubs.

Remark 8 Two-Hub Relationship Structure

The Gamma Hub. The Gamma distribution subsumes:

  • Exponential(λ)=Gamma(1,λ)\text{Exponential}(\lambda) = \text{Gamma}(1, \lambda) — shape α=1\alpha = 1
  • χ2(k)=Gamma(k/2,1/2)\chi^2(k) = \text{Gamma}(k/2, 1/2) — shape α=k/2\alpha = k/2, rate β=1/2\beta = 1/2
  • Sum of iid Exponential(β)\text{Exponential}(\beta): X1++XnGamma(n,β)X_1 + \cdots + X_n \sim \text{Gamma}(n, \beta)
  • If Y1Gamma(α,1)Y_1 \sim \text{Gamma}(\alpha, 1) and Y2Gamma(β,1)Y_2 \sim \text{Gamma}(\beta, 1) are independent, then Y1/(Y1+Y2)Beta(α,β)Y_1/(Y_1 + Y_2) \sim \text{Beta}(\alpha, \beta)

The Normal Hub. The Normal distribution generates:

  • Z2χ2(1)Z^2 \sim \chi^2(1) — connects the Normal to the Gamma hub
  • Z/V/νt(ν)Z/\sqrt{V/\nu} \sim t(\nu) — the Student’s t construction
  • (U/d1)/(V/d2)F(d1,d2)(U/d_1)/(V/d_2) \sim F(d_1, d_2) — the F construction from Chi-squareds

The two hubs are connected via the Chi-squared, which belongs to both (it is a Gamma special case and it is a sum of squared Normals).

Limit relationships:

  • t(ν)N(0,1)t(\nu) \to N(0, 1) as ν\nu \to \infty
  • χ2(k)/k1\chi^2(k)/k \to 1 as kk \to \infty (by the law of large numbers)
  • Gamma(α,β)\text{Gamma}(\alpha, \beta) with α\alpha \to \infty and β=α/μ\beta = \alpha/\mu converges to N(μ,μ2/α)N(\mu, \mu^2/\alpha)
Relationship web showing Gamma hub and Normal hub with special-case, sum, and limit connections

6.9 Connections to ML

Example 6 KDE with Gaussian Kernel

Kernel Density Estimation places a small Normal bump Kh(xxi)=1hφ ⁣(xxih)K_h(x - x_i) = \frac{1}{h}\varphi\!\left(\frac{x - x_i}{h}\right) at each data point xix_i and averages:

f^(x)=1ni=1n1hφ ⁣(xxih)\hat{f}(x) = \frac{1}{n} \sum_{i=1}^n \frac{1}{h} \varphi\!\left(\frac{x - x_i}{h}\right)

The bandwidth hh controls the bias-variance tradeoff: small hh gives a spiky estimate (low bias, high variance), large hh gives a smooth estimate (high bias, low variance). The Normal PDF’s smoothness and infinite support make it the default kernel. Kernel Density Estimation (Topic 30) develops the full theory, including the AMISE bias-variance decomposition, the AMISE-optimal bandwidth h=O(n1/5)h^\ast = O(n^{-1/5}), Epanechnikov’s optimal-kernel theorem, and data-driven bandwidth selectors (Silverman, Scott, UCV, Sheather-Jones). Topic 30 §30.6 is the featured section.

Example 7 The t-Test: Comparing Means with Unknown σ

Given nn observations from N(μ,σ2)N(\mu, \sigma^2) with σ2\sigma^2 unknown, we form:

T=Xˉμ0S/nt(n1)under H0 ⁣:μ=μ0T = \frac{\bar{X} - \mu_0}{S/\sqrt{n}} \sim t(n - 1) \quad \text{under } H_0\!: \mu = \mu_0

where SS is the sample standard deviation. The t distribution accounts for the uncertainty in estimating σ\sigma — its heavier tails (compared to the Normal) make the test less likely to reject when nn is small. As nn grows, t(n1)N(0,1)t(n-1) \approx N(0,1) and the distinction vanishes. Hypothesis Testing develops this rigorously, including the one- and two-sample t-tests; the two-sample F-test for equality of variances is covered in Linear Regression.

Example 8 Distribution Choice Guide

Choosing the right distribution for a modeling problem is a core ML skill:

Data typeCommon choiceWhy
Continuous, unbounded, symmetricNormalCLT, squared-error loss
Continuous, positive, right-skewedGamma or Log-NormalPositive support, flexible skew
Waiting times, constant rateExponentialMemoryless property
Proportions, probabilitiesBetaSupport on [0,1][0, 1], conjugacy
Variance ratios, model comparisonFRatio of Chi-squareds
Small-sample means, unknown σ\sigmaStudent’s tHeavier tails than Normal

The key question is not “which distribution fits best?” but “which generative story matches my data?” The PDF shape is a consequence of the mechanism, not the other way around.

ML connections: KDE with Gaussian kernel, Beta-Bernoulli A/B testing, t-test rejection region

Summary

Eight continuous distributions, two structural hubs, one unifying theme: each distribution arises from a specific probabilistic mechanism, and the tools from Expectation, Variance & MomentsE[X]E[X], Var(X)\text{Var}(X), MGF — reveal their properties.

DistributionPDF kernelE[X]E[X]Var(X)\text{Var}(X)MGFExp. Family?
Uniform(a,b)(a,b)1ba\frac{1}{b-a}a+b2\frac{a+b}{2}(ba)212\frac{(b-a)^2}{12}etbetat(ba)\frac{e^{tb}-e^{ta}}{t(b-a)}No
Normal(μ,σ2)(\mu,\sigma^2)e(xμ)2/(2σ2)e^{-(x-\mu)^2/(2\sigma^2)}μ\muσ2\sigma^2eμt+σ2t2/2e^{\mu t + \sigma^2 t^2/2}Yes
Exponential(λ)(\lambda)λeλx\lambda e^{-\lambda x}1λ\frac{1}{\lambda}1λ2\frac{1}{\lambda^2}λλt\frac{\lambda}{\lambda-t}Yes
Gamma(α,β)(\alpha,\beta)xα1eβxx^{\alpha-1}e^{-\beta x}αβ\frac{\alpha}{\beta}αβ2\frac{\alpha}{\beta^2}(ββt)α\left(\frac{\beta}{\beta-t}\right)^\alphaYes
Beta(α,β)(\alpha,\beta)xα1(1x)β1x^{\alpha-1}(1-x)^{\beta-1}αα+β\frac{\alpha}{\alpha+\beta}αβ(α+β)2(α+β+1)\frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}Yes
χ2(k)\chi^2(k)xk/21ex/2x^{k/2-1}e^{-x/2}kk2k2k(12t)k/2(1-2t)^{-k/2}Yes
t(ν)t(\nu)(1+t2/ν)(ν+1)/2(1+t^2/\nu)^{-(\nu+1)/2}00νν2\frac{\nu}{\nu-2}No
F(d1,d2)F(d_1,d_2)d2d22\frac{d_2}{d_2-2}complexNo

What comes next. This topic cataloged the continuous distributions. The parallel treatment continues:

  • Exponential Families unifies the four exponential family members here with the five from Discrete Distributions, identifying natural parameters, sufficient statistics, and log-partition functions
  • Multivariate Distributions extends the Normal to the multivariate Normal — the star of that topic — and develops joint, marginal, and conditional densities in p dimensions
  • Maximum Likelihood Estimation uses the Normal, Exponential, and Gamma as canonical MLE examples
  • Bayesian Foundations (Topic 25) develops the Beta-Bernoulli, Gamma-Poisson, and Normal-Normal conjugate pairs in full generality, adds Normal-Normal-Inverse-Gamma (unknown σ²) and Dirichlet-Multinomial, and frames them all as instances of the exponential-family conjugacy theorem
  • Order Statistics & Quantiles shows that order statistics of Uniform(0,1)(0,1) are Beta-distributed — §29.3 Theorem 2 + Corollary 1 give the full result via the probability-integral transform
  • Hypothesis Testing uses the Chi-squared, Student’s t, and F as test statistic distributions

References

  1. Billingsley, P. (2012). Probability and Measure (Anniversary ed.). Wiley.
  2. Durrett, R. (2019). Probability: Theory and Examples (5th ed.). Cambridge University Press.
  3. Grimmett, G. & Stirzaker, D. (2020). Probability and Random Processes (4th ed.). Oxford University Press.
  4. Wasserman, L. (2004). All of Statistics. Springer.
  5. Casella, G. & Berger, R. L. (2002). Statistical Inference (2nd ed.). Duxbury.
  6. Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
  7. Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A. & Rubin, D. B. (2013). Bayesian Data Analysis (3rd ed.). Chapman & Hall/CRC.
  8. McCullagh, P. & Nelder, J. A. (1989). Generalized Linear Models (2nd ed.). Chapman & Hall/CRC.