foundational 50 min read · April 11, 2026

Random Variables & Distribution Functions

The bridge from events to numbers — PMFs, PDFs, CDFs, and the distribution machinery that makes statistical computation possible

1. Random Variables: The Bridge from Events to Numbers

In Topic 1 and Topic 2, we worked entirely with events — subsets of the sample space Ω\Omega. We asked questions like "P(even)=?P(\text{even}) = ?" and "P(AB)=?P(A \mid B) = ?". But most of statistics and machine learning works with numbers, not sets. We want to talk about the “value” of a die roll, the “height” of a person, the “loss” of a model.

A random variable is the bridge. It’s a function that assigns a number to each outcome.

Motivating example. Roll two dice. The sample space is Ω={(i,j):i,j{1,,6}}\Omega = \{(i,j) : i,j \in \{1,\ldots,6\}\} with 36 outcomes. Define S:ΩRS : \Omega \to \mathbb{R} by S((i,j))=i+jS((i,j)) = i + j. Now instead of asking “what is the probability of outcomes in the set {(1,6),(2,5),(3,4),(4,3),(5,2),(6,1)}\{(1,6),(2,5),(3,4),(4,3),(5,2),(6,1)\}?” we simply ask "P(S=7)=?P(S = 7) = ?". The random variable SS compresses the 36-element sample space into the numbers {2,3,,12}\{2, 3, \ldots, 12\}.

This is not just notational convenience — it’s a change of mathematical universe. Once we have a function X:ΩRX : \Omega \to \mathbb{R}, we can add random variables (X+YX + Y), multiply them, take expectations (E[X]E[X]), compute variances (Var(X)\text{Var}(X)), and apply calculus. The entire toolkit of analysis opens up.

Definition 1 Random Variable

Let (Ω,F,P)(\Omega, \mathcal{F}, P) be a probability space. A random variable is a function X:ΩRX : \Omega \to \mathbb{R} that is measurable with respect to F\mathcal{F} — meaning that for every Borel set BB(R)B \in \mathcal{B}(\mathbb{R}),

X1(B)={ωΩ:X(ω)B}F.X^{-1}(B) = \{\omega \in \Omega : X(\omega) \in B\} \in \mathcal{F}.

In practice, the condition we check most often is the CDF condition: for every xRx \in \mathbb{R},

{ωΩ:X(ω)x}F.\{\omega \in \Omega : X(\omega) \leq x\} \in \mathcal{F}.

This is sufficient because the half-lines (,x](-\infty, x] generate the Borel σ\sigma-algebra (see Topic 1, Example 4).

Remark Why measurability?

The measurability condition ensures that we can assign a probability to every statement of the form "XxX \leq x" or "XBX \in B". Without it, P(Xx)P(X \leq x) is undefined — the set {Xx}\{X \leq x\} might not be in F\mathcal{F}, and PP only speaks to events in F\mathcal{F}. For finite or countable sample spaces with the discrete σ\sigma-algebra (the power set), every function is measurable — so the condition is automatic. It only becomes restrictive for uncountable Ω\Omega, and even then, every function you’ll encounter in practice is measurable.

Why this matters for ML: When you write ”XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2),” you are implicitly working with the Borel σ\sigma-algebra on R\mathbb{R}. The measurability requirement — that a random variable must pull Borel sets back to events in your σ\sigma-algebra — is what makes statements like "P(Xt)P(X \leq t)" well-defined. We formalized σ\sigma-algebras in Topic 1.

Definition 2 Random Vector

A random vector X=(X1,,Xd):ΩRd\mathbf{X} = (X_1, \ldots, X_d) : \Omega \to \mathbb{R}^d is a function where each component XiX_i is a random variable.

Three-panel diagram showing random variable mappings: identity mapping from die outcomes to numbers, indicator mapping, and sum-of-two-dice mapping with arrows from Ω to ℝ
Example 1 Die roll as a random variable

Roll one die. Ω={1,2,3,4,5,6}\Omega = \{1,2,3,4,5,6\}, F=P(Ω)\mathcal{F} = \mathcal{P}(\Omega). The identity function X(ω)=ωX(\omega) = \omega is a random variable. So is Y(ω)=ω2Y(\omega) = \omega^2, or Z(ω)=1{ω is even}Z(\omega) = \mathbf{1}\{\omega \text{ is even}\} (the indicator of “even”). Each creates a different mapping from outcomes to numbers.

Example 2 Sum of two dice

Ω={(i,j):i,j{1,,6}}\Omega = \{(i,j) : i,j \in \{1,\ldots,6\}\}, S((i,j))=i+jS((i,j)) = i + j. The pre-images are:

  • S1({2})={(1,1)}S^{-1}(\{2\}) = \{(1,1)\}, so P(S=2)=1/36P(S = 2) = 1/36
  • S1({7})={(1,6),(2,5),(3,4),(4,3),(5,2),(6,1)}S^{-1}(\{7\}) = \{(1,6),(2,5),(3,4),(4,3),(5,2),(6,1)\}, so P(S=7)=6/36=1/6P(S = 7) = 6/36 = 1/6
  • S1({12})={(6,6)}S^{-1}(\{12\}) = \{(6,6)\}, so P(S=12)=1/36P(S = 12) = 1/36

The random variable SS induces a probability distribution on {2,3,,12}\{2,3,\ldots,12\} via the pre-image and PP.

Use the explorer below to see how random variables map outcomes to numbers. Select an experiment and a mapping, then watch the arrows and the induced probability distribution:

Ω123456123456X : Ω → ℝ
Value xPre-image X⁻¹({{x}})P(X = x)
1{1}1/6 = 0.1667
2{2}1/6 = 0.1667
3{3}1/6 = 0.1667
4{4}1/6 = 0.1667
5{5}1/6 = 0.1667
6{6}1/6 = 0.1667

2. Discrete Random Variables and PMFs

Definition 3 Discrete Random Variable

A random variable XX is discrete if it takes values in a finite or countably infinite set {x1,x2,x3,}\{x_1, x_2, x_3, \ldots\}.

The distribution of a discrete random variable is completely described by its probability mass function.

Definition 4 Probability Mass Function (PMF)

The probability mass function (PMF) of a discrete random variable XX is the function pX:R[0,1]p_X : \mathbb{R} \to [0,1] defined by

pX(x)=P(X=x)=P({ωΩ:X(ω)=x}).p_X(x) = P(X = x) = P(\{\omega \in \Omega : X(\omega) = x\}).

The PMF satisfies two properties:

  1. pX(x)0p_X(x) \geq 0 for all xx (from Axiom 1 of the Kolmogorov axioms, Topic 1)
  2. xpX(x)=1\sum_{x} p_X(x) = 1 (from Axiom 2, summing over all values in the support)
Definition 5 Support

The support of a discrete random variable XX is the set of values where the PMF is positive:

supp(X)={xR:pX(x)>0}.\text{supp}(X) = \{x \in \mathbb{R} : p_X(x) > 0\}.

Three-panel bar chart showing PMFs: Bernoulli(0.3), Binomial(10, 0.3), and Geometric(0.3)
Example 3 Bernoulli random variable

A Bernoulli random variable models a single trial with two outcomes. X{0,1}X \in \{0, 1\} with P(X=1)=pP(X = 1) = p and P(X=0)=1pP(X = 0) = 1 - p for some p[0,1]p \in [0,1]. Written compactly:

pX(x)=px(1p)1x,x{0,1}.p_X(x) = p^x (1-p)^{1-x}, \quad x \in \{0, 1\}.

We write XBernoulli(p)X \sim \text{Bernoulli}(p). This is the building block of classification: "YBernoulli(σ(wx))Y \sim \text{Bernoulli}(\sigma(w^\top x))" is logistic regression.

Example 4 Binomial random variable

The sum of nn independent Bernoulli(p)\text{Bernoulli}(p) trials. X{0,1,,n}X \in \{0, 1, \ldots, n\} with

pX(k)=(nk)pk(1p)nk,k=0,1,,n.p_X(k) = \binom{n}{k} p^k (1-p)^{n-k}, \quad k = 0, 1, \ldots, n.

We write XBinomial(n,p)X \sim \text{Binomial}(n, p). The binomial coefficient (nk)\binom{n}{k} counts the number of ways to arrange kk successes among nn trials — this is the combinatorics from Topic 1 (§6) in action.

For a discrete random variable, pX(x)=P(X=x)p_X(x) = P(X = x) is an actual probability. This is NOT true for continuous random variables (as we’ll see in §3). The probability of any event involving XX is computed by summing the PMF:

P(XA)=xApX(x).P(X \in A) = \sum_{x \in A} p_X(x).


3. Continuous Random Variables and PDFs

When the sample space is uncountable — say Ω=[0,1]\Omega = [0,1] — a random variable XX can take any real value in a continuum. For any specific value xx, we have P(X=x)=0P(X = x) = 0. This is not a pathology; it’s a fundamental feature of the uncountable. There are uncountably many possible values, and since they must sum to 1, each individual probability must be zero.

So we can’t describe continuous random variables with a PMF. Instead, we describe them with a density — a function whose integral gives probabilities.

Definition 6 Continuous Random Variable and PDF

A random variable XX is continuous if there exists a non-negative function fX:R[0,)f_X : \mathbb{R} \to [0, \infty) such that for every interval [a,b][a, b],

P(aXb)=abfX(x)dx.P(a \leq X \leq b) = \int_a^b f_X(x) \, dx.

The function fXf_X is called the probability density function (PDF). It satisfies:

  1. fX(x)0f_X(x) \geq 0 for all xx
  2. fX(x)dx=1\int_{-\infty}^{\infty} f_X(x) \, dx = 1

These mirror the PMF conditions — replace sums with integrals.

Remark The density is NOT a probability

This is perhaps the most common confusion in probability. For a continuous random variable:

  • fX(x)f_X(x) is a density, not a probability. It can be greater than 1.
  • fX(x)dxf_X(x) \, dx is (informally) the probability that XX falls in a tiny interval around xx.
  • Only integrals of the density give probabilities.
  • P(X=x)=0P(X = x) = 0 for every individual xx. Probabilities of intervals are what matter.
Three-panel PDF plot: Uniform(0,1) with shaded probability region, Standard Normal with shaded area, and Uniform(0, 0.5) showing density exceeding 1
Example 5 Uniform distribution on [0, 1]

XUniform(0,1)X \sim \text{Uniform}(0,1) has density fX(x)=1f_X(x) = 1 for x[0,1]x \in [0,1] and fX(x)=0f_X(x) = 0 otherwise. Then P(0.2X0.5)=0.20.51dx=0.3P(0.2 \leq X \leq 0.5) = \int_{0.2}^{0.5} 1 \, dx = 0.3. The probability equals the length of the interval — probability is area under the density curve.

Example 6 Standard normal distribution

ZN(0,1)Z \sim \mathcal{N}(0, 1) has density

fZ(z)=12πez2/2.f_Z(z) = \frac{1}{\sqrt{2\pi}} e^{-z^2/2}.

This density is positive everywhere on R\mathbb{R} and integrates to 1 (a non-trivial fact requiring the Gaussian integral ez2/2dz=2π\int_{-\infty}^{\infty} e^{-z^2/2} \, dz = \sqrt{2\pi}).

Note that fZ(0)=1/2π0.399f_Z(0) = 1/\sqrt{2\pi} \approx 0.399 — this is less than 1 in this case, but for a Uniform(0,0.5)\text{Uniform}(0, 0.5) distribution, fX(x)=2f_X(x) = 2 on [0,0.5][0, 0.5], which exceeds 1. The density is not bounded by 1.

Toggle between discrete and continuous modes in the explorer below. In discrete mode, each bar height IS a probability. In continuous mode, drag to select an interval — the shaded AREA is the probability:

10
0.30
0.070.150.220.29012345678910P(X = x)

Click a bar to see its probability value


4. The Cumulative Distribution Function

The CDF works for every random variable — discrete, continuous, or mixed. It is the universal descriptor.

Definition 7 Cumulative Distribution Function (CDF)

The cumulative distribution function (CDF) of a random variable XX is

FX(x)=P(Xx),xR.F_X(x) = P(X \leq x), \quad x \in \mathbb{R}.

Theorem 1 Properties of the CDF

For any random variable XX, the CDF FXF_X satisfies:

  1. Non-decreasing: If aba \leq b, then FX(a)FX(b)F_X(a) \leq F_X(b).
  2. Right-continuous: limh0+FX(x+h)=FX(x)\lim_{h \to 0^+} F_X(x + h) = F_X(x) for all xx.
  3. Limits at infinity: limxFX(x)=0\lim_{x \to -\infty} F_X(x) = 0 and limxFX(x)=1\lim_{x \to \infty} F_X(x) = 1.
Proof Properties of the CDF (Theorem 1) [show]

(1) If aba \leq b, then {Xa}{Xb}\{X \leq a\} \subseteq \{X \leq b\}. By monotonicity of probability (Topic 1, Theorem 2): FX(a)=P(Xa)P(Xb)=FX(b)F_X(a) = P(X \leq a) \leq P(X \leq b) = F_X(b).

(2) Let hn0h_n \downarrow 0. Then the events An={Xx+hn}A_n = \{X \leq x + h_n\} decrease to {Xx}\{X \leq x\}. By continuity of probability from above (Topic 1, Theorem 6b):

limnP(Xx+hn)=P(Xx)\lim_{n \to \infty} P(X \leq x + h_n) = P(X \leq x)

i.e., FX(x+hn)FX(x)F_X(x + h_n) \to F_X(x).

(3) For xnx_n \to -\infty, {Xxn}\{X \leq x_n\} \downarrow \emptyset, so FX(xn)P()=0F_X(x_n) \to P(\emptyset) = 0 by continuity from above. For xnx_n \to \infty, {Xxn}Ω\{X \leq x_n\} \uparrow \Omega, so FX(xn)P(Ω)=1F_X(x_n) \to P(\Omega) = 1 by continuity from below.

The limit properties rely on the convergence framework from formalCalculus: Sequences & Limits . \square

Remark Using the CDF to compute probabilities

The CDF gives us a direct route to probabilities of intervals:

P(a<Xb)=FX(b)FX(a)P(a < X \leq b) = F_X(b) - F_X(a)

This follows from {Xb}={Xa}{a<Xb}\{X \leq b\} = \{X \leq a\} \cup \{a < X \leq b\} (disjoint).

For continuous random variables, P(X=x)=0P(X = x) = 0, so strict/non-strict inequalities don’t matter: P(aXb)=P(a<X<b)=FX(b)FX(a)P(a \leq X \leq b) = P(a < X < b) = F_X(b) - F_X(a).

For discrete random variables, the CDF is a step function — it jumps at each value in the support. The jump at xx has height P(X=x)=pX(x)P(X = x) = p_X(x).

Three-panel CDF comparison: discrete step CDF for Binomial, smooth CDF for Normal, and CDF-to-probability interval method

For a continuous random variable with PDF fXf_X:

FX(x)=xfX(t)dtF_X(x) = \int_{-\infty}^{x} f_X(t) \, dt

And by the fundamental theorem of calculus, wherever fXf_X is continuous:

fX(x)=FX(x)f_X(x) = F_X'(x)

The PDF is the derivative of the CDF. The CDF is the antiderivative of the PDF. This connection requires the integration and differentiation machinery from formalCalculus: Change of Variables .

Explore discrete (step function) and continuous (smooth curve) CDFs side by side. Drag the vertical line to query F(x)=P(Xx)F(x) = P(X \leq x):

Discrete: Binomial
0.000.250.500.751.00x = 3non-decreasing ✓right-continuous ✓
Continuous: Normal
0.000.250.500.751.00x = 0.00non-decreasing ✓continuous ✓
Discrete: F(3) = P(X ≤ 3) = 0.6496
Continuous: F(0.00) = P(X ≤ 0.00) = 0.5000

5. Joint Distributions

In ML, we almost never work with a single random variable. A feature vector X=(X1,,Xd)\mathbf{X} = (X_1, \ldots, X_d) is a collection of dd random variables. Understanding the joint behavior — how they relate to each other — is essential.

Definition 8 Joint PMF

For discrete random variables XX and YY, the joint PMF is

pX,Y(x,y)=P(X=x and Y=y)=P(X=x,Y=y).p_{X,Y}(x, y) = P(X = x \text{ and } Y = y) = P(X = x, Y = y).

It satisfies pX,Y(x,y)0p_{X,Y}(x,y) \geq 0 and xypX,Y(x,y)=1\sum_x \sum_y p_{X,Y}(x,y) = 1.

Definition 9 Joint PDF

For continuous random variables XX and YY, the joint PDF is a function fX,Y:R2[0,)f_{X,Y} : \mathbb{R}^2 \to [0, \infty) such that

P((X,Y)A)=AfX,Y(x,y)dxdyP((X,Y) \in A) = \iint_A f_{X,Y}(x, y) \, dx \, dy

for every (measurable) set AR2A \subseteq \mathbb{R}^2. It satisfies fX,Y(x,y)0f_{X,Y}(x,y) \geq 0 and fX,Y(x,y)dxdy=1\iint f_{X,Y}(x,y)\,dx\,dy = 1.

Definition 10 Joint CDF

The joint CDF is

FX,Y(x,y)=P(Xx,Yy).F_{X,Y}(x, y) = P(X \leq x, Y \leq y).

This works for any pair of random variables.

Three-panel joint distribution visualization: two-dice heatmap, bivariate normal contours with ρ=0, and bivariate normal contours with ρ=0.7
Example 7 Joint distribution of two dice

Roll two fair dice. Let XX = first die, YY = second die. The joint PMF is pX,Y(i,j)=1/36p_{X,Y}(i, j) = 1/36 for i,j{1,,6}i, j \in \{1,\ldots,6\}. The marginal of XX is pX(i)=j=161/36=6/36=1/6p_X(i) = \sum_{j=1}^6 1/36 = 6/36 = 1/6 — as expected.

Example 8 Standard bivariate normal

(X,Y)N(0,Σ)(X, Y) \sim \mathcal{N}(\mathbf{0}, \Sigma) where Σ=(1ρρ1)\Sigma = \begin{pmatrix} 1 & \rho \\ \rho & 1 \end{pmatrix}. The joint PDF is

fX,Y(x,y)=12π1ρ2exp(x22ρxy+y22(1ρ2)).f_{X,Y}(x,y) = \frac{1}{2\pi\sqrt{1-\rho^2}} \exp\left(-\frac{x^2 - 2\rho xy + y^2}{2(1-\rho^2)}\right).

The parameter ρ(1,1)\rho \in (-1, 1) controls the correlation — the linear dependence between XX and YY. When ρ=0\rho = 0, the joint factors as fX(x)fY(y)f_X(x) \cdot f_Y(y), and XX and YY are independent.

Explore joint distributions: toggle between discrete (two dice heatmap) and continuous (bivariate normal contours). Slide ρ\rho and watch the contours morph from circular (independent) to elliptical (dependent):

0.50
XYDependent (ρ = 0.50)

6. Marginal Distributions

Definition 11 Marginal PMF

For discrete X,YX, Y, the marginal PMF of XX is

pX(x)=ypX,Y(x,y).p_X(x) = \sum_{y} p_{X,Y}(x, y).

Definition 12 Marginal PDF

For continuous X,YX, Y with joint PDF fX,Yf_{X,Y}, the marginal PDF of XX is

fX(x)=fX,Y(x,y)dy.f_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x, y) \, dy.

In both cases, we “integrate out” (or “sum out”) the variable we don’t want. This connects directly to the law of total probability from Topic 2: the marginal is a weighted average over all possible values of the other variable.

Theorem 2 Marginals from the Joint

The marginal distributions are uniquely determined by the joint distribution — sum over (discrete) or integrate out (continuous) the other variables.

Proof Marginals from the Joint (Theorem 2) [show]

For the discrete case:

pX(x)=P(X=x)=P(y{X=x,Y=y}).p_X(x) = P(X = x) = P\left(\bigcup_y \{X = x, Y = y\}\right).

Since {X=x,Y=y}\{X = x, Y = y\} are pairwise disjoint for different yy, countable additivity (Topic 1, Axiom 3) gives

pX(x)=yP(X=x,Y=y)=ypX,Y(x,y).p_X(x) = \sum_y P(X = x, Y = y) = \sum_y p_{X,Y}(x,y). \quad \square

Joint PDF contour plot of bivariate normal with marginal density curves displayed on the top and right axes
Remark Joint determines marginals, not vice versa

Knowing fXf_X and fYf_Y individually is not enough to reconstruct fX,Yf_{X,Y}. The joint encodes the dependence structure between XX and YY — information lost when we marginalize. This is why “correlation does not imply causation” has a precise mathematical meaning: the marginals constrain the joint, but don’t determine it. Specifying the dependence structure is the job of copulas — see Multivariate Distributions.


7. Conditional Distributions

Conditional distributions extend conditional probability (Topic 2) from events to random variables. Instead of "P(AB)P(A \mid B)" for events, we ask "P(XxY=y)P(X \leq x \mid Y = y)" — the distribution of XX given that YY takes a specific value.

Definition 13 Conditional PMF

For discrete X,YX, Y with pY(y)>0p_Y(y) > 0,

pXY(xy)=pX,Y(x,y)pY(y).p_{X|Y}(x \mid y) = \frac{p_{X,Y}(x, y)}{p_Y(y)}.

This is exactly the definition of conditional probability (Topic 2, Definition 1) applied to the events {X=x}\{X = x\} and {Y=y}\{Y = y\}: P(X=xY=y)=P(X=x,Y=y)/P(Y=y)P(X = x \mid Y = y) = P(X = x, Y = y) / P(Y = y).

Definition 14 Conditional PDF

For continuous X,YX, Y with fY(y)>0f_Y(y) > 0,

fXY(xy)=fX,Y(x,y)fY(y).f_{X|Y}(x \mid y) = \frac{f_{X,Y}(x, y)}{f_Y(y)}.

This requires more care than the discrete case — since P(Y=y)=0P(Y = y) = 0 for continuous YY, we can’t directly condition on {Y=y}\{Y = y\}. The conditional PDF is defined as the limit fXY(xy)=limϵ0P(Xxyϵ<Yy+ϵ)f_{X|Y}(x \mid y) = \lim_{\epsilon \to 0} P(X \leq x \mid y - \epsilon < Y \leq y + \epsilon). The formula above is the result of that limit.

Three-panel conditional distribution visualization: joint density with slice line, normalized conditional PDF, and multiple conditional PDFs for different values of y
Theorem 3 Chain Rule for Densities

The joint density factors as

fX,Y(x,y)=fXY(xy)fY(y)=fYX(yx)fX(x).f_{X,Y}(x, y) = f_{X|Y}(x \mid y) \cdot f_Y(y) = f_{Y|X}(y \mid x) \cdot f_X(x).

This is the density version of the multiplication rule (Topic 2, Theorem 1). Rearranging gives the density version of Bayes’ theorem:

fXY(xy)=fYX(yx)fX(x)fY(y).f_{X|Y}(x \mid y) = \frac{f_{Y|X}(y \mid x) \cdot f_X(x)}{f_Y(y)}.

Proof Chain Rule for Densities (Theorem 3) [show]

Rearrange the definition fXY(xy)=fX,Y(x,y)/fY(y)f_{X|Y}(x \mid y) = f_{X,Y}(x,y) / f_Y(y) to get

fX,Y(x,y)=fXY(xy)fY(y).f_{X,Y}(x,y) = f_{X|Y}(x \mid y) \cdot f_Y(y).

The symmetric factorization follows by exchanging the roles of XX and YY. \square

Example 9 Conditional distribution of the bivariate normal

If (X,Y)(X, Y) are jointly normal with correlation ρ\rho, then XY=yX \mid Y = y is normal:

XY=yN(μX+ρσXσY(yμY),  σX2(1ρ2)).X \mid Y = y \sim \mathcal{N}\left(\mu_X + \rho \frac{\sigma_X}{\sigma_Y}(y - \mu_Y), \; \sigma_X^2(1 - \rho^2)\right).

Two remarkable facts:

  1. The conditional mean is linear in yy: this is why linear regression works for jointly normal data.
  2. The conditional variance σX2(1ρ2)\sigma_X^2(1 - \rho^2) does not depend on yy — this is the homoscedasticity assumption.

8. Independence of Random Variables

Independence of random variables extends the concept of independence from events (Topic 2, Definitions 3–4) to distribution functions.

Definition 15 Independence of Random Variables

Random variables XX and YY are independent (written X ⁣ ⁣ ⁣YX \perp\!\!\!\perp Y) if for every pair of Borel sets A,BRA, B \subseteq \mathbb{R},

P(XA,YB)=P(XA)P(YB).P(X \in A, Y \in B) = P(X \in A) \cdot P(Y \in B).

Equivalently, any of the following:

  • CDF factorization: FX,Y(x,y)=FX(x)FY(y)F_{X,Y}(x, y) = F_X(x) \cdot F_Y(y) for all x,yx, y.
  • PMF factorization (discrete): pX,Y(x,y)=pX(x)pY(y)p_{X,Y}(x, y) = p_X(x) \cdot p_Y(y) for all x,yx, y.
  • PDF factorization (continuous): fX,Y(x,y)=fX(x)fY(y)f_{X,Y}(x, y) = f_X(x) \cdot f_Y(y) for all x,yx, y.
Three-panel comparison of independent vs dependent bivariate normals: circular contours for ρ=0, elliptical contours for ρ≠0, and scatter samples from both
Theorem 4 Independence ⟺ Joint Factors

X ⁣ ⁣ ⁣YX \perp\!\!\!\perp Y if and only if fX,Y(x,y)=fX(x)fY(y)f_{X,Y}(x,y) = f_X(x) f_Y(y) (continuous case) or pX,Y(x,y)=pX(x)pY(y)p_{X,Y}(x,y) = p_X(x) p_Y(y) (discrete case) for all x,yx, y.

Proof Independence ⟺ Joint Factors (Theorem 4) [show]

(\Rightarrow) Assume X ⁣ ⁣ ⁣YX \perp\!\!\!\perp Y. Then for all x,yx, y:

FX,Y(x,y)=P(Xx,Yy)=P(Xx)P(Yy)=FX(x)FY(y).F_{X,Y}(x,y) = P(X \leq x, Y \leq y) = P(X \leq x) \cdot P(Y \leq y) = F_X(x) \cdot F_Y(y).

Differentiating with respect to both xx and yy (using the multivariable chain rule):

fX,Y(x,y)=2xyFX,Y(x,y)=FX(x)FY(y)=fX(x)fY(y).f_{X,Y}(x,y) = \frac{\partial^2}{\partial x \, \partial y} F_{X,Y}(x,y) = F_X'(x) \cdot F_Y'(y) = f_X(x) \cdot f_Y(y).

(\Leftarrow) Assume fX,Y(x,y)=fX(x)fY(y)f_{X,Y}(x,y) = f_X(x) f_Y(y). Then for any Borel sets A,BA, B:

P(XA,YB)=ABfX,Y(x,y)dydx=AfX(x)dxBfY(y)dy=P(XA)P(YB).P(X \in A, Y \in B) = \int_A \int_B f_{X,Y}(x,y) \, dy \, dx = \int_A f_X(x) \, dx \int_B f_Y(y) \, dy = P(X \in A) \cdot P(Y \in B).

The integral factors by Fubini’s theorem. \square

Theorem 5 Independence ⟹ Conditional = Marginal

If X ⁣ ⁣ ⁣YX \perp\!\!\!\perp Y, then fXY(xy)=fX(x)f_{X|Y}(x \mid y) = f_X(x) — knowing YY tells you nothing new about XX.

Proof Independence ⟹ Conditional = Marginal (Theorem 5) [show]

fXY(xy)=fX,Y(x,y)fY(y)=fX(x)fY(y)fY(y)=fX(x).f_{X|Y}(x \mid y) = \frac{f_{X,Y}(x,y)}{f_Y(y)} = \frac{f_X(x) f_Y(y)}{f_Y(y)} = f_X(x). \quad \square

This is the random variable version of the event independence criterion P(AB)=P(A)P(A \mid B) = P(A) from Topic 2.

Example 10 Independent vs dependent normals

Let (X,Y)(X, Y) be bivariate normal. They are independent if and only if ρ=0\rho = 0. When ρ0\rho \neq 0, knowing YY shifts the conditional mean of XX (by Example 9), so XX and YY are dependent. The contour plots make this visible: ρ=0\rho = 0 gives circular contours (factored density), ρ0\rho \neq 0 gives elliptical contours (non-factored).


9. Transformations of Random Variables

If XX has a known distribution and Y=g(X)Y = g(X) for some function gg, what is the distribution of YY? This arises constantly:

  • If XX is a measurement, Y=aX+bY = aX + b is a rescaled measurement.
  • If XUniform(0,1)X \sim \text{Uniform}(0,1), then Y=log(X)Exponential(1)Y = -\log(X) \sim \text{Exponential}(1) — the inverse CDF transform.
  • If XX is a model’s raw output, Y=σ(X)=1/(1+eX)Y = \sigma(X) = 1/(1 + e^{-X}) is the sigmoid-transformed probability.
  • If XX is a random variable, Y=X2Y = X^2 appears in the chi-squared distribution.
Theorem 6 CDF Method

Let XX be a random variable with known CDF FXF_X and let Y=g(X)Y = g(X). Then

FY(y)=P(Yy)=P(g(X)y).F_Y(y) = P(Y \leq y) = P(g(X) \leq y).

To find FYF_Y, solve the inequality g(X)yg(X) \leq y for XX, then express the result in terms of FXF_X.

Proof CDF Method (Theorem 6) [show]

This is the definition of CDF applied to YY. The content is in the method, not the proof. \square

Example 11 Linear transformation

Let Y=aX+bY = aX + b with a>0a > 0. Then

FY(y)=P(aX+by)=P(X(yb)/a)=FX(yba).F_Y(y) = P(aX + b \leq y) = P(X \leq (y-b)/a) = F_X\left(\frac{y-b}{a}\right).

Differentiating: fY(y)=1afX(yba)f_Y(y) = \frac{1}{a} f_X\left(\frac{y-b}{a}\right). If a<0a < 0, the inequality flips and fY(y)=1afX(yba)f_Y(y) = \frac{1}{|a|} f_X\left(\frac{y-b}{a}\right).

Application: If XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2), then Z=(Xμ)/σN(0,1)Z = (X - \mu)/\sigma \sim \mathcal{N}(0,1). Standardization is a linear transformation.

Three-panel transformation visualization: PDF of X ~ N(0,1), CDF method for Y = X², and resulting chi-squared(1) PDF

When gg is monotonic and differentiable, we get a direct formula for the PDF.

Theorem 7 Change of Variables for PDFs

Let XX be a continuous random variable with PDF fXf_X. Let Y=g(X)Y = g(X) where gg is strictly monotonic and differentiable with inverse g1g^{-1}. Then

fY(y)=fX(g1(y))(g1)(y).f_Y(y) = f_X(g^{-1}(y)) \cdot |(g^{-1})'(y)|.

The factor (g1)(y)|(g^{-1})'(y)| is the Jacobian — it accounts for how gg stretches or compresses intervals.

Proof Change of Variables for PDFs (Theorem 7) [show]

Assume gg is strictly increasing (the decreasing case is similar with a sign flip).

FY(y)=P(g(X)y)=P(Xg1(y))=FX(g1(y)).F_Y(y) = P(g(X) \leq y) = P(X \leq g^{-1}(y)) = F_X(g^{-1}(y)).

Differentiate using the chain rule:

fY(y)=fX(g1(y))(g1)(y).f_Y(y) = f_X(g^{-1}(y)) \cdot (g^{-1})'(y).

Since gg is increasing, g1g^{-1} is also increasing, so (g1)(y)>0=(g1)(y)(g^{-1})'(y) > 0 = |(g^{-1})'(y)|.

For decreasing gg, the inequality flips: P(g(X)y)=P(Xg1(y))=1FX(g1(y))P(g(X) \leq y) = P(X \geq g^{-1}(y)) = 1 - F_X(g^{-1}(y)), and differentiating introduces a minus sign that the absolute value absorbs. \square

This proof uses the chain rule and substitution rule from formalCalculus: Change of Variables .

Example 12 Log transformation (lognormal)

Let XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2) and Y=eXY = e^X (i.e., X=logYX = \log Y). Then g1(y)=logyg^{-1}(y) = \log y and (g1)(y)=1/y(g^{-1})'(y) = 1/y. So

fY(y)=fX(logy)1y=1yσ2πexp((logyμ)22σ2),y>0.f_Y(y) = f_X(\log y) \cdot \frac{1}{y} = \frac{1}{y\sigma\sqrt{2\pi}} \exp\left(-\frac{(\log y - \mu)^2}{2\sigma^2}\right), \quad y > 0.

This is the lognormal distribution. It arises when the logarithm of a quantity is normally distributed — common for financial returns, biological measurements, and the distribution of weights in neural networks.

Remark The Jacobian in higher dimensions

For a vector transformation Y=g(X)\mathbf{Y} = g(\mathbf{X}) where g:RdRdg : \mathbb{R}^d \to \mathbb{R}^d is a diffeomorphism, the change of variables formula becomes

fY(y)=fX(g1(y))detJg1(y)f_{\mathbf{Y}}(\mathbf{y}) = f_{\mathbf{X}}(g^{-1}(\mathbf{y})) \cdot |\det J_{g^{-1}}(\mathbf{y})|

where Jg1J_{g^{-1}} is the Jacobian matrix. This is the formula behind formalML: Normalizing Flows — a class of generative models that learn invertible transformations of simple distributions. Each flow layer applies this change of variables formula.

Pick a source distribution and a transformation to watch the CDF method produce the output distribution. Toggle Monte Carlo samples and watch the histogram converge:

PDF of X ~ Normal(0,1)
xf(x)
Y = X²: x² → χ²(1)
Enable Monte Carlo to see Y distributionydensity

Transformation: Y = X²x² → χ²(1)


10. Expectation (Preview)

We close the mathematical content with a brief preview of the next topic. Once we have random variables and their distributions, the next natural question is: what is the “average” value?

Definition 16 Expectation (preview)

The expectation (or expected value, or mean) of a random variable XX is

E[X]=xxpX(x)(discrete)E[X] = \sum_x x \cdot p_X(x) \quad \text{(discrete)}

E[X]=xfX(x)dx(continuous)E[X] = \int_{-\infty}^{\infty} x \cdot f_X(x) \, dx \quad \text{(continuous)}

provided the sum or integral converges absolutely.

The expectation is the “center of mass” of the distribution — if you placed the PMF bars (or PDF curve) on a number line and balanced it on a fulcrum, the balance point is E[X]E[X].

These concepts — along with variance, covariance, higher moments, and moment-generating functions — are the subject of Expectation, Variance & Moments. There we prove the linearity of expectation, derive the variance decomposition, establish the law of total expectation E[X]=E[E[XY]]E[X] = E[E[X \mid Y]], and connect moments to the shapes of distributions.


11. Connections to ML

Feature Vectors as Random Vectors

Every dataset in ML is implicitly modeled as a sample from a random vector. A training example (xi,yi)(\mathbf{x}_i, y_i) is a realization of the random vector (X,Y)(\mathbf{X}, Y) where X=(X1,,Xd)\mathbf{X} = (X_1, \ldots, X_d) is the feature vector and YY is the target.

ML ConceptRandom Variable Formulation
Feature vectorRandom vector X=(X1,,Xd)\mathbf{X} = (X_1, \ldots, X_d)
Training setnn i.i.d. draws from PX,YP_{\mathbf{X},Y}
Class probabilitiesPMF pY(k)p_Y(k) or conditional PMF pYX(kx)p_{Y\|\mathbf{X}}(k \mid \mathbf{x})
Regression targetContinuous YY with conditional density fYX(yx)f_{Y\|\mathbf{X}}(y \mid \mathbf{x})

Generative vs. Discriminative: Joint vs. Conditional

The generative/discriminative distinction maps directly to joint vs. conditional distributions:

  • Generative models learn the joint distribution P(X,Y)P(\mathbf{X}, Y) — or equivalently, the class-conditional P(XY)P(\mathbf{X} \mid Y) and prior P(Y)P(Y). Examples: Gaussian Discriminant Analysis, naive Bayes, VAEs, diffusion models.
  • Discriminative models learn the conditional distribution P(YX)P(Y \mid \mathbf{X}) directly. Examples: logistic regression, neural networks, random forests.

From the joint, you can always get the conditional (via Bayes’ theorem from Topic 2). But generative models must model the full X\mathbf{X} distribution — a harder problem in high dimensions.

Softmax Outputs as PMFs

A neural network classifier with softmax output computes pY(kx)=softmax(z)kp_Y(k \mid \mathbf{x}) = \text{softmax}(z)_k for classes k=1,,Kk = 1, \ldots, K. This output is a PMF: pY(kx)0p_Y(k \mid \mathbf{x}) \geq 0 for all kk, and k=1KpY(kx)=1\sum_{k=1}^K p_Y(k \mid \mathbf{x}) = 1 by construction of softmax. The softmax function guarantees that the output satisfies the PMF axioms — making the network’s predictions interpretable as conditional probabilities.

Loss Functions as Transformations

The loss L=(Y,Y^)L = \ell(Y, \hat{Y}) is a transformation of random variables. If YY and Y^\hat{Y} have known distributions, the distribution of LL determines the risk — and the change of variables formula (Theorem 7) is how we analyze it.

The Inverse CDF Transform

Theorem 8 Universality of the Uniform (Inverse CDF Transform)

Let UUniform(0,1)U \sim \text{Uniform}(0,1) and let FF be any CDF with quantile function F1(u)=inf{x:F(x)u}F^{-1}(u) = \inf\{x : F(x) \geq u\}. Then X=F1(U)X = F^{-1}(U) has CDF FF.

Proof Universality of the Uniform (Theorem 8) [show]

P(Xx)=P(F1(U)x)=P(UF(x))=F(x)P(X \leq x) = P(F^{-1}(U) \leq x) = P(U \leq F(x)) = F(x)

since UUniform(0,1)U \sim \text{Uniform}(0,1) gives P(Ut)=tP(U \leq t) = t for t[0,1]t \in [0,1]. \square

Three-panel inverse CDF transform: Uniform(0,1) samples, inverse CDF mapping via quantile function, and resulting Normal samples

This is how you sample from any distribution using only uniform random numbers. It’s the foundation of Monte Carlo simulation and the basis of formalML: Normalizing Flows in generative modeling. The Fisher information metric that gives the space of distributions its geometry is developed in formalML: Information Geometry , and the KL divergence used in formalML: Variational Inference relies on the conditional PDF framework established here.


12. Summary

ConceptKey Idea
Random variable X:ΩRX : \Omega \to \mathbb{R}A measurable function mapping outcomes to numbers
PMF pX(x)=P(X=x)p_X(x) = P(X = x)Each value IS a probability (discrete)
PDF fX(x)f_X(x)NOT a probability; integrate for probability (continuous)
CDF FX(x)=P(Xx)F_X(x) = P(X \leq x)The universal descriptor — works for all random variables
CDF propertiesNon-decreasing, right-continuous, F()=0F(-\infty) = 0, F()=1F(\infty) = 1
Joint distributionFull description of two random variables together
MarginalIntegrate out / sum out the other variable
Conditional distributionfXY(xy)=fX,Y(x,y)/fY(y)f_{X\|Y}(x \mid y) = f_{X,Y}(x,y) / f_Y(y)
IndependenceX ⁣ ⁣ ⁣YX \perp\!\!\!\perp Y iff fX,Y=fXfYf_{X,Y} = f_X \cdot f_Y
TransformationY=g(X)Y = g(X): CDF method or change of variables with Jacobian
Inverse CDF transformX=F1(U)X = F^{-1}(U) where UUniform(0,1)U \sim \text{Uniform}(0,1) generates XFX \sim F

What’s Next

Expectation, Variance & Moments answers: once we have distributions, what summaries do we compute? It defines expectation (E[X]E[X]), proves linearity, derives variance (Var(X)=E[X2](E[X])2\text{Var}(X) = E[X^2] - (E[X])^2), develops the law of total expectation (E[X]=E[E[XY]]E[X] = E[E[X \mid Y]]), and introduces moment-generating functions — the analytic tool that powers the proofs of the law of large numbers and the central limit theorem.


References

  1. Billingsley, P. (2012). Probability and Measure (Anniversary ed.). Wiley.
  2. Durrett, R. (2019). Probability: Theory and Examples (5th ed.). Cambridge University Press.
  3. Grimmett, G. & Stirzaker, D. (2020). Probability and Random Processes (4th ed.). Oxford University Press.
  4. Wasserman, L. (2004). All of Statistics. Springer.
  5. Casella, G. & Berger, R. L. (2002). Statistical Inference (2nd ed.). Duxbury.
  6. Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
  7. Goodfellow, I., Bengio, Y. & Courville, A. (2016). Deep Learning. MIT Press.
  8. Kobyzev, I., Prince, S. J. D. & Brubaker, M. A. (2021). Normalizing Flows: An Introduction and Review of Current Methods. IEEE TPAMI, 43(11), 3964–3979.