foundational 40 min read · April 11, 2026

Sample Spaces, Events & Axioms

The Kolmogorov axioms — non-negativity, normalization, and countable additivity — define the contract that every valid probability distribution must satisfy

1. Experiments, Outcomes & Sample Spaces

Probability is the mathematics of uncertainty. Before we can assign probabilities to anything, we need a precise language for describing what could happen. That language starts with three terms:

  • An experiment (or random experiment) is any process whose outcome is not known in advance.
  • An outcome is one specific result of the experiment.
  • The sample space Ω\Omega is the set of all possible outcomes.

These are not deep concepts — they’re agreements about notation. But getting them right is essential, because every probability statement we’ll ever make is a statement about subsets of Ω\Omega.

Definition 1 Sample Space

The sample space of a random experiment is the set Ω\Omega whose elements are the possible outcomes of the experiment. An individual outcome ωΩ\omega \in \Omega is called a sample point.

Examples of Sample Spaces

Sample spaces range from the trivially finite to the uncountably infinite:

ExperimentSample Space Ω\OmegaSize
Flip a fair coin{H,T}\{H, T\}Finite, Ω=2\lvert\Omega\rvert = 2
Roll a six-sided die{1,2,3,4,5,6}\{1, 2, 3, 4, 5, 6\}Finite, Ω=6\lvert\Omega\rvert = 6
Roll two dice (ordered){(i,j):i,j{1,,6}}\{(i,j) : i,j \in \{1,\ldots,6\}\}Finite, Ω=36\lvert\Omega\rvert = 36
Flip a coin until heads{H,TH,TTH,TTTH,}\{H, TH, TTH, TTTH, \ldots\}Countably infinite
Pick a real number in [0,1][0,1][0,1][0,1]Uncountable
Lifetime of a component[0,)[0, \infty)Uncountable

The distinction between finite, countably infinite, and uncountable matters enormously — it determines which mathematical tools we need. For finite and countable spaces, we can sum. For uncountable spaces, we must integrate.

Gallery of four sample spaces: coin flip, die roll, flip-until-heads, uniform [0,1]

Use the explorer below to build sample spaces for different experiments and see how events and probabilities emerge:

Roll a six-sided die
Click outcomes to select event A (|Ω| = 6)
|A| = 0  |Ω| = 6  P(A) = 0.0000

2. Events as Sets

An event is any collection of outcomes — a subset of Ω\Omega. We say event AA occurs if the actual outcome ω\omega of the experiment satisfies ωA\omega \in A.

Definition 2 Event

An event is a subset AΩA \subseteq \Omega. The event AA occurs if the outcome ω\omega of the experiment belongs to AA.

Special events:

  • The certain event Ω\Omega always occurs.
  • The impossible event \emptyset never occurs.
  • A simple event {ω}\{\omega\} consists of a single outcome.
Example 1 Die roll events and set operations

Let Ω={1,2,3,4,5,6}\Omega = \{1, 2, 3, 4, 5, 6\}.

  • A={2,4,6}A = \{2, 4, 6\} is the event “roll an even number.”
  • B={5,6}B = \{5, 6\} is the event “roll at least 5.”
  • AB={6}A \cap B = \{6\} is the event “roll an even number that is at least 5.”
  • AB={2,4,5,6}A \cup B = \{2, 4, 5, 6\} is the event “roll an even number or at least 5.”
  • Ac={1,3,5}A^c = \{1, 3, 5\} is the event “do not roll an even number” (i.e., roll odd).

Set Operations as Logical Connectives

The correspondence between set operations and logical statements about events is exact:

Set OperationNotationMeaning
UnionABA \cup BAA or BB (or both) occurs
IntersectionABA \cap BBoth AA and BB occur
ComplementAcA^cAA does not occur
Set differenceAB=ABcA \setminus B = A \cap B^cAA occurs but BB does not
SubsetABA \subseteq BIf AA occurs, then BB occurs
DisjointAB=A \cap B = \emptysetAA and BB cannot both occur

De Morgan’s laws connect them:

AcBc=(AB)candAcBc=(AB)cA^c \cup B^c = (A \cap B)^c \qquad \text{and} \qquad A^c \cap B^c = (A \cup B)^c

In words: “not both” equals “not one or not the other,” and “neither” equals “not either.” De Morgan’s laws generalize to arbitrary collections of events — including countably infinite unions and intersections, which is where sigma-algebras enter the picture.

Venn diagrams showing union, intersection, and complement

Explore set operations interactively — toggle different operations to highlight regions and see De Morgan’s laws in action:

ΩAB
P(A ∪ B) = 0.7500

3. Sigma-Algebras

The Problem with “All Subsets”

For finite sample spaces, life is simple: the collection of events is just the power set P(Ω)={A:AΩ}\mathcal{P}(\Omega) = \{A : A \subseteq \Omega\}. With a six-sided die, P(Ω)\mathcal{P}(\Omega) has 26=642^6 = 64 subsets, and we can assign a probability to each one without trouble.

For uncountable sample spaces like Ω=[0,1]\Omega = [0,1], things break. There is no way to assign a probability to every subset of [0,1][0,1] while satisfying the Kolmogorov axioms — this is the content of the Vitali construction (1905), which produces a non-measurable set. The details require measure theory we’ll encounter in the Measure-Theoretic Probability topic on formalML: measure-theoretic probability .

The fix is elegant: instead of trying to assign probabilities to all subsets, we work only with a well-behaved collection called a sigma-algebra.

Definition 3 σ-algebra

A σ\sigma-algebra (or σ\sigma-field) on a set Ω\Omega is a collection FP(Ω)\mathcal{F} \subseteq \mathcal{P}(\Omega) satisfying:

  1. ΩF\Omega \in \mathcal{F}  (the sample space is an event)
  2. If AFA \in \mathcal{F}, then AcFA^c \in \mathcal{F}  (closed under complements)
  3. If A1,A2,A3,FA_1, A_2, A_3, \ldots \in \mathcal{F}, then n=1AnF\bigcup_{n=1}^{\infty} A_n \in \mathcal{F}  (closed under countable unions)

By De Morgan’s law, closure under complements + countable unions gives closure under countable intersections for free. And since =Ωc\emptyset = \Omega^c, the empty set is always in F\mathcal{F}.

Remark 1 Algebra vs σ-algebra — why countable matters

The word “countable” in axiom (3) is doing heavy lifting. A collection closed under finite unions but not countable unions is called an algebra (or field) of sets. The upgrade from algebra to σ\sigma-algebra is exactly what we need for limit theorems — taking limits of sequences of events requires countable set operations.

Example 2 The trivial and discrete σ-algebras
  • The trivial σ\sigma-algebra F={,Ω}\mathcal{F} = \{\emptyset, \Omega\} is the smallest possible — we can only say “something happened” or “nothing happened.”
  • The discrete σ\sigma-algebra F=P(Ω)\mathcal{F} = \mathcal{P}(\Omega) is the largest — every subset is an event. This works for countable Ω\Omega.
Example 3 A partition σ-algebra on {1,2,3,4}

Let Ω={1,2,3,4}\Omega = \{1, 2, 3, 4\}. Consider F={,{1,2},{3,4},Ω}\mathcal{F} = \{\emptyset, \{1,2\}, \{3,4\}, \Omega\}.

Check: (1) ΩF\Omega \in \mathcal{F} ✓. (2) {1,2}c={3,4}F\{1,2\}^c = \{3,4\} \in \mathcal{F} ✓. (3) All unions of members yield members ✓.

This σ\sigma-algebra “sees” the partition {{1,2},{3,4}}\{\{1,2\}, \{3,4\}\} but cannot distinguish 1 from 2 or 3 from 4. It encodes partial information — a theme that becomes central in conditional probability and filtrations.

Example 4 The Borel σ-algebra on ℝ

On Ω=R\Omega = \mathbb{R}, the Borel σ\sigma-algebra B(R)\mathcal{B}(\mathbb{R}) is the smallest σ\sigma-algebra containing all open intervals (a,b)(a, b). It contains all open sets, all closed sets, all countable intersections of open sets (GδG_\delta sets), all countable unions of closed sets (FσF_\sigma sets), and more. This is the standard σ\sigma-algebra for probability on R\mathbb{R} — the one you use 99% of the time.

Why this matters for ML: When you write ”XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2),” you are implicitly working with the Borel σ\sigma-algebra on R\mathbb{R}. The measurability requirement — that a random variable must pull Borel sets back to events in your σ\sigma-algebra — is what makes statements like "P(Xt)P(X \leq t)" well-defined. We’ll formalize this in Random Variables & Distributions (coming soon).

Three sigma-algebra examples: trivial, partition, and power set

Build your own sigma-algebras below. Select subsets and watch the validator check the closure properties in real time:

Presets:
Ω =1234
Ω ∈ ℱ
Closed under complements
Closed under unions
✓ Valid σ-algebra (2 events)

4. The Kolmogorov Axioms

We now have the stage (Ω\Omega), the actors (F\mathcal{F}), and we need the script: a function that assigns a number to each event. Kolmogorov’s axioms (1933) tell us what properties this function must have.

Definition 4 Probability Measure (Kolmogorov Axioms)

A probability measure on a measurable space (Ω,F)(\Omega, \mathcal{F}) is a function P:F[0,1]P : \mathcal{F} \to [0,1] satisfying:

Axiom 1 (Non-negativity). For every AFA \in \mathcal{F}, P(A)0P(A) \geq 0.

Axiom 2 (Normalization). P(Ω)=1P(\Omega) = 1.

Axiom 3 (Countable Additivity). If A1,A2,A3,FA_1, A_2, A_3, \ldots \in \mathcal{F} are pairwise disjoint (i.e., AiAj=A_i \cap A_j = \emptyset for all iji \neq j), then

P(n=1An)=n=1P(An).P\left(\bigcup_{n=1}^{\infty} A_n\right) = \sum_{n=1}^{\infty} P(A_n).

The triple (Ω,F,P)(\Omega, \mathcal{F}, P) is called a probability space.

That’s it. Three axioms. Every theorem in probability theory — the law of large numbers, the central limit theorem, Bayes’ rule, the convergence of your neural network’s training loss — follows from these three axioms plus the machinery of analysis.

Remark 2 Why countable additivity?

Axiom 3 demands countable additivity, not mere finite additivity. This is the axiom that makes limit theorems possible. If we only required P(A1An)=P(A1)++P(An)P(A_1 \cup \cdots \cup A_n) = P(A_1) + \cdots + P(A_n) for finite collections, we could not take limits of sequences of events — and without limits, no convergence theorems, no law of large numbers, no central limit theorem. The jump from “finitely many” to “countably many” is small in notation but enormous in mathematical power.

Example 5 The fair die probability space

Roll a fair die. The probability space is:

  • Ω={1,2,3,4,5,6}\Omega = \{1, 2, 3, 4, 5, 6\}
  • F=P(Ω)\mathcal{F} = \mathcal{P}(\Omega) (all 64 subsets)
  • P({ω})=1/6P(\{\omega\}) = 1/6 for each ω\omega, extended by additivity: P(A)=A/6P(A) = \lvert A\rvert/6

Check the axioms: (1) P(A)=A/60P(A) = \lvert A\rvert/6 \geq 0 ✓. (2) P(Ω)=6/6=1P(\Omega) = 6/6 = 1 ✓. (3) For disjoint A,BA, B: P(AB)=AB/6=(A+B)/6=P(A)+P(B)P(A \cup B) = \lvert A \cup B\rvert/6 = (\lvert A\rvert + \lvert B\rvert)/6 = P(A) + P(B) ✓.

Visual representation of the three Kolmogorov axioms

5. Consequences of the Axioms

Everything in this section is derived from the three axioms — no additional assumptions needed. This is the engine of probability theory: a small axiomatic base generating a rich body of results.

The Complement Rule

Theorem 1 Complement Rule

For any event AFA \in \mathcal{F},

P(Ac)=1P(A).P(A^c) = 1 - P(A).

Proof [show]

Since AA and AcA^c are disjoint and AAc=ΩA \cup A^c = \Omega, Axiom 3 gives P(A)+P(Ac)=P(Ω)=1P(A) + P(A^c) = P(\Omega) = 1 (by Axiom 2). Rearranging: P(Ac)=1P(A)P(A^c) = 1 - P(A).

Corollary 1 P(∅) = 0

P()=0P(\emptyset) = 0.

Proof [show]

Take A=ΩA = \Omega in Theorem 1: P()=P(Ωc)=1P(Ω)=11=0P(\emptyset) = P(\Omega^c) = 1 - P(\Omega) = 1 - 1 = 0.

Monotonicity

Theorem 2 Monotonicity

If ABA \subseteq B, then P(A)P(B)P(A) \leq P(B).

Proof [show]

Write B=A(BA)B = A \cup (B \setminus A), where AA and BAB \setminus A are disjoint. By Axiom 3:

P(B)=P(A)+P(BA)P(A)P(B) = P(A) + P(B \setminus A) \geq P(A)

since P(BA)0P(B \setminus A) \geq 0 by Axiom 1.

The Addition Rule (Inclusion-Exclusion for Two Events)

Theorem 3 Addition Rule (Inclusion-Exclusion, n=2)

For any events A,BFA, B \in \mathcal{F},

P(AB)=P(A)+P(B)P(AB).P(A \cup B) = P(A) + P(B) - P(A \cap B).

Proof [show]

Decompose ABA \cup B into three pairwise disjoint pieces:

AB=(AB)(AB)(BA).A \cup B = (A \setminus B) \cup (A \cap B) \cup (B \setminus A).

By Axiom 3:

P(AB)=P(AB)+P(AB)+P(BA).P(A \cup B) = P(A \setminus B) + P(A \cap B) + P(B \setminus A).

Now, A=(AB)(AB)A = (A \setminus B) \cup (A \cap B) (disjoint), so P(A)=P(AB)+P(AB)P(A) = P(A \setminus B) + P(A \cap B), giving P(AB)=P(A)P(AB)P(A \setminus B) = P(A) - P(A \cap B). Similarly P(BA)=P(B)P(AB)P(B \setminus A) = P(B) - P(A \cap B). Substituting:

P(AB)=[P(A)P(AB)]+P(AB)+[P(B)P(AB)]=P(A)+P(B)P(AB).P(A \cup B) = [P(A) - P(A \cap B)] + P(A \cap B) + [P(B) - P(A \cap B)] = P(A) + P(B) - P(A \cap B).

General Inclusion-Exclusion

Theorem 4 General Inclusion-Exclusion

For events A1,,AnFA_1, \ldots, A_n \in \mathcal{F},

P(i=1nAi)=iP(Ai)i<jP(AiAj)+i<j<kP(AiAjAk)+(1)n+1P(A1An).P\left(\bigcup_{i=1}^n A_i\right) = \sum_{i} P(A_i) - \sum_{i \lt j} P(A_i \cap A_j) + \sum_{i \lt j \lt k} P(A_i \cap A_j \cap A_k) - \cdots + (-1)^{n+1} P(A_1 \cap \cdots \cap A_n).

The proof proceeds by induction on nn, with the base case n=2n = 2 being Theorem 3. The inductive step applies Theorem 3 to (i=1n1Ai)An\left(\bigcup_{i=1}^{n-1} A_i\right) \cup A_n and uses the distributive law.

Step through the inclusion-exclusion formula term by term and watch the running total converge to the exact probability:

AB
Step 0 / 3
+P(A)=0.5000
+P(B)=0.4000
-P(A∩B)=0.1500
Running total: 0.0000|Exact P(A∪B): 0.7500
Adjust probabilities

Union Bound (Boole’s Inequality)

Theorem 5 Union Bound (Boole's Inequality)

For any countable collection of events A1,A2,FA_1, A_2, \ldots \in \mathcal{F},

P(n=1An)n=1P(An).P\left(\bigcup_{n=1}^{\infty} A_n\right) \leq \sum_{n=1}^{\infty} P(A_n).

Proof [show]

Define B1=A1B_1 = A_1 and Bn=Ank=1n1AkB_n = A_n \setminus \bigcup_{k=1}^{n-1} A_k for n2n \geq 2. The BnB_n are pairwise disjoint, Bn=An\bigcup B_n = \bigcup A_n, and BnAnB_n \subseteq A_n. By countable additivity and monotonicity:

P(n=1An)=P(n=1Bn)=n=1P(Bn)n=1P(An).P\left(\bigcup_{n=1}^{\infty} A_n\right) = P\left(\bigcup_{n=1}^{\infty} B_n\right) = \sum_{n=1}^{\infty} P(B_n) \leq \sum_{n=1}^{\infty} P(A_n).

Continuity of Probability

Theorem 6 Continuity of Probability

Let A1,A2,FA_1, A_2, \ldots \in \mathcal{F}.

(a) If AnAA_n \uparrow A (i.e., A1A2A_1 \subseteq A_2 \subseteq \cdots and A=n=1AnA = \bigcup_{n=1}^{\infty} A_n), then P(A)=limnP(An)P(A) = \lim_{n \to \infty} P(A_n).

(b) If AnAA_n \downarrow A (i.e., A1A2A_1 \supseteq A_2 \supseteq \cdots and A=n=1AnA = \bigcap_{n=1}^{\infty} A_n), then P(A)=limnP(An)P(A) = \lim_{n \to \infty} P(A_n).

Proof [show]

Proof of (a). Define B1=A1B_1 = A_1 and Bn=AnAn1B_n = A_n \setminus A_{n-1} for n2n \geq 2. The BnB_n are pairwise disjoint and k=1nBk=An\bigcup_{k=1}^n B_k = A_n. So n=1Bn=A\bigcup_{n=1}^{\infty} B_n = A. By countable additivity:

P(A)=n=1P(Bn)=limNn=1NP(Bn)=limNP(AN).P(A) = \sum_{n=1}^{\infty} P(B_n) = \lim_{N \to \infty} \sum_{n=1}^{N} P(B_n) = \lim_{N \to \infty} P(A_N).

Proof of (b). Apply part (a) to AncAcA_n^c \uparrow A^c, then use the complement rule.

This is where the convergence of sequences from formalCalculus: sequences & limits becomes essential — the telescoping sum and limit exchange are the same tools you developed there.

Remark 3 Continuity ⟺ countable additivity

Given finite additivity, countable additivity is equivalent to continuity from below (Theorem 6a). This means Axiom 3 is really saying “probability is a continuous set function,” connecting our probability axioms directly to the continuity of limits.

Continuity of probability: increasing and decreasing sequences of events

6. Combinatorial Probability

The Equally Likely Model

When Ω\Omega is finite and every outcome is equally probable — P({ω})=1/ΩP(\{\omega\}) = 1/\lvert\Omega\rvert for all ω\omega — probability reduces to counting:

P(A)=AΩ=number of favorable outcomestotal number of outcomes.P(A) = \frac{\lvert A\rvert}{\lvert\Omega\rvert} = \frac{\text{number of favorable outcomes}}{\text{total number of outcomes}}.

This is the classical or Laplacian definition of probability. It’s a special case of the Kolmogorov axioms, not the foundation — many important probability spaces (continuous distributions, unfair coins) don’t have equally likely outcomes.

Counting Tools

To use the equally likely model, we need to count. The three essential counting principles:

Multiplication Principle. If experiment 1 has n1n_1 outcomes and experiment 2 has n2n_2 outcomes, the combined experiment has n1n2n_1 \cdot n_2 outcomes.

Permutations. The number of ways to arrange kk items chosen from nn (order matters) is P(n,k)=n!/(nk)!P(n, k) = n!/(n-k)!.

Combinations. The number of ways to choose kk items from nn (order doesn’t matter) is (nk)=n!/(k!(nk)!)\binom{n}{k} = n! / (k!(n-k)!).

The Birthday Problem

Example 6 Birthday Problem

In a group of nn people, what is the probability that at least two share a birthday? Assume 365 equally likely birthdays.

It’s easier to compute the complement: P(all different)P(\text{all different}).

  • Person 1 has 365/365365/365 choices.
  • Person 2 must avoid person 1’s birthday: 364/365364/365.
  • Person kk must avoid the previous k1k-1 birthdays: (365k+1)/365(365-k+1)/365.

P(all different)=k=0n1365k365=365!(365n)!365nP(\text{all different}) = \prod_{k=0}^{n-1} \frac{365 - k}{365} = \frac{365!}{(365-n)! \cdot 365^n}

P(at least one match)=1P(all different)P(\text{at least one match}) = 1 - P(\text{all different})

The famous result: with just n=23n = 23 people, the probability exceeds 50%.

Birthday problem: probability of a match vs group size

Explore the birthday problem below — adjust the group size and run Monte Carlo simulations to see the exact curve confirmed empirically:

Exact Probability
0.000.250.500.751.00020406080n (group size)n=23
P(match) = 0.507297
Monte Carlo Simulation
0 trials
MC estimate:
Exact: 0.5073
With just 23 people, the probability of a shared birthday exceeds 50%!

The Matching Problem (Derangements)

Example 7 Hat-Check Problem (Derangements)

nn people check their hats; the hats are returned at random. What is the probability that nobody gets their own hat back?

A permutation with no fixed points is called a derangement. Let AiA_i be the event “person ii gets their own hat.” We want P(i=1nAi)cP\left(\bigcup_{i=1}^n A_i\right)^c.

By inclusion-exclusion:

P(i=1nAi)=k=1n(1)k+1(nk)(nk)!n!=k=1n(1)k+1k!P\left(\bigcup_{i=1}^n A_i\right) = \sum_{k=1}^n (-1)^{k+1} \binom{n}{k} \cdot \frac{(n-k)!}{n!} = \sum_{k=1}^n \frac{(-1)^{k+1}}{k!}

So the probability of a derangement is:

P(derangement)=k=0n(1)kk!n1e0.3679P(\text{derangement}) = \sum_{k=0}^n \frac{(-1)^k}{k!} \xrightarrow{n \to \infty} \frac{1}{e} \approx 0.3679

This is remarkable: for large nn, the probability converges to 1/e1/e, independent of nn. The convergence is extremely fast — already at n=5n = 5, the answer is accurate to three decimal places.

Derangement probability converging to 1/e

7. Conditional Probability & Independence (Preview)

We close the mathematical content with a brief preview of the next topic. Conditional probability answers: “given that event BB has occurred, what is the probability of event AA?”

Definition 5 Conditional Probability (preview)

For events A,BFA, B \in \mathcal{F} with P(B)>0P(B) > 0,

P(AB)=P(AB)P(B).P(A \mid B) = \frac{P(A \cap B)}{P(B)}.

Definition 6 Independence (preview)

Events AA and BB are independent if

P(AB)=P(A)P(B).P(A \cap B) = P(A) \cdot P(B).

Equivalently (when P(B)>0P(B) > 0): P(AB)=P(A)P(A \mid B) = P(A) — knowing BB occurred doesn’t change the probability of AA.

These concepts are the subject of Conditional Probability & Independence (coming soon). We’ll prove Bayes’ theorem, develop the law of total probability, and explore the surprisingly subtle concept of conditional independence — the assumption behind naive Bayes classifiers, graphical models, and essentially all tractable probabilistic inference.


8. Connections to ML

Probability Spaces in Disguise

Every ML model implicitly defines a probability space:

ML ModelSample Space Ω\Omegaσ\sigma-Algebra F\mathcal{F}Probability PP
Bernoulli classifier{0,1}\{0, 1\}P({0,1})\mathcal{P}(\{0,1\})P(Y=1)=σ(wx)P(Y{=}1) = \sigma(w^\top x)
Multinomial (softmax){1,,K}\{1, \ldots, K\}P({1,,K})\mathcal{P}(\{1,\ldots,K\})Softmax output
Gaussian modelR\mathbb{R}B(R)\mathcal{B}(\mathbb{R})N(μ,σ2)\mathcal{N}(\mu, \sigma^2)
Language modelToken sequencesCylinder σ\sigma-algebraAutoregressive factorization
Diffusion modelRd\mathbb{R}^dB(Rd)\mathcal{B}(\mathbb{R}^d)Learned via score matching

The Union Bound in Learning Theory

Boole’s inequality (Theorem 5) is the entry point to PAC learning. The core argument:

  1. You have H\lvert\mathcal{H}\rvert hypotheses (models).
  2. For each hHh \in \mathcal{H}, define AhA_h = “hypothesis hh overfits” (training error is small but true error is large).
  3. You want P(some hypothesis overfits)=P(hAh)δP(\text{some hypothesis overfits}) = P\left(\bigcup_h A_h\right) \leq \delta.
  4. By the union bound: P(hAh)hP(Ah)P\left(\bigcup_h A_h\right) \leq \sum_h P(A_h).
  5. So it suffices to make P(Ah)δ/HP(A_h) \leq \delta / \lvert\mathcal{H}\rvert for each hh.

This is the simplest version of the uniform convergence argument. It tells you that generalization requires either few hypotheses or very tight per-hypothesis bounds — the fundamental tension of statistical learning theory. The full treatment lives on formalML: measure-theoretic probability .

Union bound in PAC learning: per-hypothesis error budget

Sigma-Algebras and Information

In Bayesian ML, the σ\sigma-algebra encodes what you can observe. A filtration F0F1\mathcal{F}_0 \subseteq \mathcal{F}_1 \subseteq \cdots models information arriving over time:

  • In online learning, Ft\mathcal{F}_t is the information available after seeing tt data points.
  • In reinforcement learning, Ft\mathcal{F}_t is the history up to time tt (all states, actions, rewards).
  • In time series, Ft\mathcal{F}_t is the natural filtration of the observed process.

This seems abstract now, but when we get to martingales and stochastic processes later in formalStatistics, filtrations will be the language we use to formalize “what the model knows at each step.”


9. Summary

ConceptKey Idea
Sample space Ω\OmegaThe set of all possible outcomes
Event AΩA \subseteq \OmegaA subset of outcomes; “something we can ask about”
σ\sigma-algebra F\mathcal{F}The collection of “askable” events — closed under complements and countable unions
Probability measure PPA function P:F[0,1]P : \mathcal{F} \to [0,1] satisfying non-negativity, normalization, and countable additivity
Probability space (Ω,F,P)(\Omega, \mathcal{F}, P)The complete specification of a probabilistic model
Complement ruleP(Ac)=1P(A)P(A^c) = 1 - P(A)
Addition ruleP(AB)=P(A)+P(B)P(AB)P(A \cup B) = P(A) + P(B) - P(A \cap B)
Union boundP(An)P(An)P(\bigcup A_n) \leq \sum P(A_n) — the workhorse of PAC learning
ContinuityAnAP(An)P(A)A_n \uparrow A \Rightarrow P(A_n) \to P(A) — why countable additivity matters
Combinatorial probabilityP(A)=A/ΩP(A) = \lvert A\rvert / \lvert\Omega\rvert when outcomes are equally likely

What’s Next

The next topic — Conditional Probability & Independence (coming soon) — builds on this foundation to answer: “how does new information change probabilities?” We’ll derive Bayes’ theorem, formalize independence, and see why conditional independence is the structural assumption that makes probabilistic ML tractable.

References

  1. Billingsley, P. (2012). Probability and Measure (Anniversary ed.). Wiley.
  2. Durrett, R. (2019). Probability: Theory and Examples (5th ed.). Cambridge University Press.
  3. Grimmett, G. & Stirzaker, D. (2020). Probability and Random Processes (4th ed.). Oxford University Press.
  4. Wasserman, L. (2004). All of Statistics. Springer.
  5. Shalev-Shwartz, S. & Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.