foundational 45 min read · April 11, 2026

Conditional Probability & Independence

How new information reshapes probability — from Bayes' theorem to the conditional independence assumption that makes probabilistic ML tractable

1. Conditional Probability

In Topic 1 we built the machinery of probability spaces: sample spaces, sigma-algebras, and the Kolmogorov axioms. Everything there describes unconditional probability — the probability of events in the absence of any additional information.

But in practice, we almost always have partial information. A doctor knows the patient tested positive before computing the probability of disease. A spam filter knows the email contains the word “lottery” before computing the probability it’s spam. A stock trader knows yesterday’s return before estimating today’s. The question becomes: how does knowing that BB occurred change the probability of AA?

Definition 1 Conditional Probability

Let (Ω,F,P)(\Omega, \mathcal{F}, P) be a probability space and let BFB \in \mathcal{F} with P(B)>0P(B) > 0. The conditional probability of AA given BB is

P(AB)=P(AB)P(B).P(A \mid B) = \frac{P(A \cap B)}{P(B)}.

The intuition is clean: conditioning on BB means restricting our universe to outcomes in BB. Among those outcomes, we ask how many also belong to AA. The denominator P(B)P(B) re-normalizes so that P(BB)=1P(B \mid B) = 1 — the new “sample space” has total probability 1.

Three-panel Venn diagram showing conditional probability: full sample space, B highlighted, and P(A|B) as A∩B within B
Example 1 Conditional probability on a fair die

Roll a fair die, Ω={1,2,3,4,5,6}\Omega = \{1,2,3,4,5,6\} with P({ω})=1/6P(\{\omega\}) = 1/6. Let A={2,4,6}A = \{2,4,6\} (even) and B={4,5,6}B = \{4,5,6\} (at least 4).

P(AB)=P(AB)P(B)=P({4,6})P({4,5,6})=2/63/6=23.P(A \mid B) = \frac{P(A \cap B)}{P(B)} = \frac{P(\{4,6\})}{P(\{4,5,6\})} = \frac{2/6}{3/6} = \frac{2}{3}.

Knowing the roll was at least 4, the probability of even jumps from 1/21/2 to 2/32/3. Information changed the probability.

Remark Conditional probability is a probability measure

For fixed BB with P(B)>0P(B) > 0, the function AP(AB)A \mapsto P(A \mid B) satisfies the Kolmogorov axioms on (Ω,F)(\Omega, \mathcal{F}):

  1. P(AB)=P(AB)/P(B)0P(A \mid B) = P(A \cap B)/P(B) \geq 0 (non-negativity).
  2. P(ΩB)=P(B)/P(B)=1P(\Omega \mid B) = P(B)/P(B) = 1 (normalization).
  3. Countable additivity follows from that of PP.

So conditioning on BB gives us a new probability space — the same Ω\Omega and F\mathcal{F}, but a different measure. Every theorem we proved in Topic 1 (complement rule, inclusion-exclusion, union bound, continuity) holds for conditional probabilities too.

Definition 2 Conditional Probability as a Probability Measure

For BFB \in \mathcal{F} with P(B)>0P(B) > 0, define PB:F[0,1]P_B : \mathcal{F} \to [0,1] by PB(A)=P(AB)P_B(A) = P(A \mid B). Then (Ω,F,PB)(\Omega, \mathcal{F}, P_B) is a probability space.

Use the explorer below to see how conditioning on BB restricts the sample space and reshapes probabilities. Toggle “Condition on B” to watch Ω\Omega fade and BB become the new universe:

ΩAB
P(A) = 0.5000
P(B) = 0.4000
P(AB) = 0.1500
P(A|B) = 0.15/0.40 = 0.3750

2. The Multiplication Rule and Chain Rule

The Multiplication Rule

Rearranging the definition of conditional probability gives us the multiplication rule (also called the product rule):

Theorem 1 Multiplication Rule (Product Rule)

For events A,BFA, B \in \mathcal{F} with P(B)>0P(B) > 0,

P(AB)=P(AB)P(B).P(A \cap B) = P(A \mid B) \cdot P(B).

By symmetry (when P(A)>0P(A) > 0): P(AB)=P(BA)P(A)P(A \cap B) = P(B \mid A) \cdot P(A).

Proof Multiplication Rule [show]

Multiply both sides of P(AB)=P(AB)/P(B)P(A \mid B) = P(A \cap B)/P(B) by P(B)P(B). ∎

This is trivial as a proof, but powerful as a computational tool. It converts a joint probability into a conditional probability times a marginal — and often the conditional is the easier quantity to reason about.

The Chain Rule

Applying the multiplication rule repeatedly gives the chain rule of probability:

Theorem 2 Chain Rule of Probability

For events A1,,AnFA_1, \ldots, A_n \in \mathcal{F} with P(A1An1)>0P(A_1 \cap \cdots \cap A_{n-1}) > 0,

P(A1A2An)=P(A1)P(A2A1)P(A3A1A2)P(AnA1An1).P(A_1 \cap A_2 \cap \cdots \cap A_n) = P(A_1) \cdot P(A_2 \mid A_1) \cdot P(A_3 \mid A_1 \cap A_2) \cdots P(A_n \mid A_1 \cap \cdots \cap A_{n-1}).

The proof is a straightforward induction using the multiplication rule at each step: the base case is Theorem 1, and the inductive step applies the multiplication rule to (A1An1)(A_1 \cap \cdots \cap A_{n-1}) and AnA_n.

Example 2 Drawing cards without replacement

Draw 3 cards from a standard 52-card deck without replacement. What is the probability all three are hearts?

P(H1H2H3)=P(H1)P(H2H1)P(H3H1H2)P(H_1 \cap H_2 \cap H_3) = P(H_1) \cdot P(H_2 \mid H_1) \cdot P(H_3 \mid H_1 \cap H_2)

=135212511150=17161326000.01294.= \frac{13}{52} \cdot \frac{12}{51} \cdot \frac{11}{50} = \frac{1716}{132600} \approx 0.01294.

Each factor reflects the updated state of the deck: after drawing one heart, 12 of 51 remaining cards are hearts. The chain rule captures this sequential conditioning perfectly.

Probability tree diagram showing the chain rule with branch weights for sequential conditioning

Why this matters for ML: The chain rule is the foundation of autoregressive models. A language model factorizes P(w1,w2,,wT)=t=1TP(wtw1,,wt1)P(w_1, w_2, \ldots, w_T) = \prod_{t=1}^T P(w_t \mid w_1, \ldots, w_{t-1}) — this is exactly the chain rule applied to a sequence of token events.


3. The Law of Total Probability

The law of total probability is one of the most useful results in all of probability. It lets us compute P(A)P(A) by “dividing and conquering” — breaking the computation into cases.

Theorem 3 Law of Total Probability

Let B1,B2,,BkB_1, B_2, \ldots, B_k be a partition of Ω\Omega: the BiB_i are pairwise disjoint and i=1kBi=Ω\bigcup_{i=1}^k B_i = \Omega. If P(Bi)>0P(B_i) > 0 for all ii, then for any event AFA \in \mathcal{F},

P(A)=i=1kP(ABi)P(Bi).P(A) = \sum_{i=1}^k P(A \mid B_i) \cdot P(B_i).

Proof Law of Total Probability [show]

Since the BiB_i partition Ω\Omega, we have A=AΩ=A(iBi)=i(ABi)A = A \cap \Omega = A \cap \left(\bigcup_i B_i\right) = \bigcup_i (A \cap B_i). The sets ABiA \cap B_i are pairwise disjoint (because the BiB_i are). By countable additivity:

P(A)=i=1kP(ABi)P(A) = \sum_{i=1}^k P(A \cap B_i)

=i=1kP(ABi)P(Bi)= \sum_{i=1}^k P(A \mid B_i) \cdot P(B_i)

where the last step uses the multiplication rule. ∎

The most common application uses a two-element partition {B,Bc}\{B, B^c\}:

P(A)=P(AB)P(B)+P(ABc)P(Bc).P(A) = P(A \mid B) \cdot P(B) + P(A \mid B^c) \cdot P(B^c).

Example 3 Prevalence-weighted testing

A disease has 2% prevalence. A test has sensitivity P(+D)=0.95P(+ \mid D) = 0.95 and specificity P(Dc)=0.90P(- \mid D^c) = 0.90. What is P(+)P(+), the probability of testing positive?

Partition: {D,Dc}\{D, D^c\}. By total probability:

P(+)=P(+D)P(D)+P(+Dc)P(Dc)P(+) = P(+ \mid D) \cdot P(D) + P(+ \mid D^c) \cdot P(D^c)

=0.95×0.02+0.10×0.98=0.019+0.098=0.117.= 0.95 \times 0.02 + 0.10 \times 0.98 = 0.019 + 0.098 = 0.117.

About 11.7% of people test positive — and the vast majority are false positives, because the disease is rare. This is where Bayes’ theorem comes in.

Sample space partitioned into regions B₁, B₂, B₃ with event A overlapping each; stacked bar chart of weighted terms

Use the explorer below to partition Ω\Omega into 2, 3, or 4 regions and see how the law of total probability decomposes P(A)P(A) into weighted contributions:

Partition size:
Partition probabilities P(B)
Conditional probabilities P(A|B)
P(A|B) · P(B) = 0.80 · 0.30 = 0.2400
P(A|B) · P(B) = 0.40 · 0.50 = 0.2000
P(A|B) · P(B) = 0.60 · 0.20 = 0.1200
P(A) = Σ P(A|B) · P(B) = 0.5600

4. Bayes’ Theorem

Bayes’ theorem is the multiplication rule used twice, combined with total probability. It answers: given that we observed BB, what is the probability that it was caused by (or associated with) AA?

Theorem 4 Bayes' Theorem

Let A,BFA, B \in \mathcal{F} with P(B)>0P(B) > 0 and P(A)>0P(A) > 0. Then

P(AB)=P(BA)P(A)P(B).P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}.

More generally, if A1,,AkA_1, \ldots, A_k partition Ω\Omega with P(Ai)>0P(A_i) > 0 for all ii, then

P(AiB)=P(BAi)P(Ai)j=1kP(BAj)P(Aj).P(A_i \mid B) = \frac{P(B \mid A_i) \cdot P(A_i)}{\sum_{j=1}^k P(B \mid A_j) \cdot P(A_j)}.

Proof Bayes' Theorem [show]

By the multiplication rule, P(AB)=P(BA)P(A)=P(AB)P(B)P(A \cap B) = P(B \mid A) \cdot P(A) = P(A \mid B) \cdot P(B). Divide by P(B)P(B):

P(AB)=P(BA)P(A)P(B).P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}.

The general form follows by expanding P(B)P(B) via the law of total probability. ∎

The Bayesian Vocabulary

Bayes’ theorem has a canonical interpretation:

TermSymbolRole
PriorP(A)P(A)Probability of AA before observing BB
LikelihoodP(BA)P(B \mid A)Probability of the evidence BB given AA
EvidenceP(B)P(B)Total probability of BB (normalization constant)
PosteriorP(AB)P(A \mid B)Probability of AA after observing BB

In shorthand: posterior \propto likelihood ×\times prior. The evidence P(B)P(B) is just the normalizing constant that makes the posterior sum to 1.

Three-panel flow: prior bars, likelihood scaling, posterior bars (normalized)
Remark The base rate fallacy

Humans are notoriously bad at Bayesian reasoning. We tend to overweight the likelihood P(BA)P(B \mid A) and ignore the prior P(A)P(A) — the base rate. In medical testing, this means patients (and sometimes doctors) confuse “the test is 99% accurate” with “I’m 99% likely to have the disease.” As Example 4 shows, these can be wildly different when the disease is rare.

The Medical Testing Example

Example 4 Medical testing: sensitivity, specificity, and PPV

A disease has prevalence P(D)=0.01P(D) = 0.01 (1% of the population). A diagnostic test has:

  • Sensitivity: P(+D)=0.99P(+ \mid D) = 0.99 (catches 99% of true cases)
  • Specificity: P(Dc)=0.95P(- \mid D^c) = 0.95 (correctly clears 95% of healthy people), so P(+Dc)=0.05P(+ \mid D^c) = 0.05.

Question: If you test positive, what is the probability you actually have the disease?

By Bayes:

P(D+)=P(+D)P(D)P(+)=0.99×0.010.99×0.01+0.05×0.99P(D \mid +) = \frac{P(+ \mid D) \cdot P(D)}{P(+)} = \frac{0.99 \times 0.01}{0.99 \times 0.01 + 0.05 \times 0.99}

=0.00990.0099+0.0495=0.00990.05940.167.= \frac{0.0099}{0.0099 + 0.0495} = \frac{0.0099}{0.0594} \approx 0.167.

Despite a “99% accurate” test, a positive result means only about a 17% chance of disease. The base rate (1% prevalence) dominates.

Natural frequency framing. Consider 10,000 people:

  • 100 have the disease → 99 test positive (true positives), 1 tests negative (false negative)
  • 9,900 are healthy → 495 test positive (false positives), 9,405 test negative (true negatives)
  • Total positives: 99 + 495 = 594. Of those, 99 actually have the disease: 99/59416.7%99/594 \approx 16.7\%.
Natural frequency tree for 1000 people showing TP/FP/TN/FN; PPV vs prevalence curve

Explore Bayes’ theorem interactively below. Try the medical testing presets to see how PPV changes dramatically with prevalence — even when sensitivity and specificity stay fixed:

Step 1 of 4: Prior

We start with our prior beliefs: P(A) and P(Aᶜ), which must sum to 1.

0.3000A0.7000AᶜSum = 1
Prior: P(A) = 0.3000
Likelihood: P(B|A) = 0.8000
Evidence (total probability): P(B) = P(B|A)·P(A) + P(B|Aᶜ)·P(Aᶜ) = 0.3800
Posterior (Bayes): P(A|B) = P(B|A)·P(A) / P(B) = 0.6316

5. Independence

Two events are independent when knowing one occurred gives no information about the other. The formal definition says this in the most computationally useful way:

Definition 3 Independence (Two Events)

Events AA and BB are independent if

P(AB)=P(A)P(B).P(A \cap B) = P(A) \cdot P(B).

Equivalently (when P(B)>0P(B) > 0): P(AB)=P(A)P(A \mid B) = P(A) — conditioning on BB doesn’t change the probability of AA.

Example 5 Independent coin flips

Flip two fair coins. Let AA = “first coin is heads” and BB = “second coin is heads.” Then Ω={HH,HT,TH,TT}\Omega = \{HH, HT, TH, TT\} with P({ω})=1/4P(\{\omega\}) = 1/4.

P(A)=1/2,P(B)=1/2,P(AB)=P({HH})=1/4=P(A)P(B).P(A) = 1/2, \quad P(B) = 1/2, \quad P(A \cap B) = P(\{HH\}) = 1/4 = P(A) \cdot P(B).

Independent. This matches our intuition: the second coin doesn’t “know” what the first coin did.

Definition 4 Independence (Finite Collection)

Events A1,,AnA_1, \ldots, A_n are (mutually) independent if for every subset {i1,,ik}{1,,n}\{i_1, \ldots, i_k\} \subseteq \{1, \ldots, n\} with k2k \geq 2,

P(Ai1Aik)=P(Ai1)P(Aik).P(A_{i_1} \cap \cdots \cap A_{i_k}) = P(A_{i_1}) \cdots P(A_{i_k}).

This requires 2nn12^n - n - 1 conditions — not just the (n2)\binom{n}{2} pairwise ones. For three events, we need four conditions: the three pairwise ones plus P(ABC)=P(A)P(B)P(C)P(A \cap B \cap C) = P(A) \cdot P(B) \cdot P(C).

Independence of Complements

Theorem 5 Independence of Complements

If AA and BB are independent, then AA and BcB^c are also independent (and AcA^c and BcB^c, etc.).

Proof Independence of Complements [show]

We need to show P(ABc)=P(A)P(Bc)P(A \cap B^c) = P(A) \cdot P(B^c).

P(ABc)=P(A)P(AB)P(A \cap B^c) = P(A) - P(A \cap B)

=P(A)P(A)P(B)= P(A) - P(A) \cdot P(B)

=P(A)(1P(B))= P(A)(1 - P(B))

=P(A)P(Bc).= P(A) \cdot P(B^c). \quad \square

The first equality uses A=(AB)(ABc)A = (A \cap B) \cup (A \cap B^c) with the two sets disjoint. The second uses the independence assumption P(AB)=P(A)P(B)P(A \cap B) = P(A) \cdot P(B).

This is reassuring: if “rain” and “traffic jam” are independent, then “no rain” and “traffic jam” should be too.

Two-panel Venn diagram: independent events with P(A∩B)=P(A)·P(B) verified, dependent events with inequality shown

Use the tester below to build events on a die and check whether they’re independent. Try the presets, then define your own:

Presets:
123456ABA∩B
P(A) = 0/6 = 0.0000
P(B) = 0/6 = 0.0000
P(A)·P(B) = 0.0000
P(A∩B) = 0/6 = 0.0000
Select outcomes for both A and B
P(B) = 0 — P(A|B) undefined

6. Pairwise vs. Mutual Independence

Pairwise independence — checking P(AiAj)=P(Ai)P(Aj)P(A_i \cap A_j) = P(A_i) \cdot P(A_j) for all pairs — does not guarantee mutual independence. You also need the higher-order conditions.

Definition 5 Pairwise Independence

Events A1,,AnA_1, \ldots, A_n are pairwise independent if P(AiAj)=P(Ai)P(Aj)P(A_i \cap A_j) = P(A_i) \cdot P(A_j) for all iji \neq j.

Example 6 Pairwise but not mutually independent events

Flip two fair coins. Define:

  • AA = “first coin is heads”: P(A)=1/2P(A) = 1/2
  • BB = “second coin is heads”: P(B)=1/2P(B) = 1/2
  • CC = “the two coins show different faces” ={HT,TH}= \{HT, TH\}: P(C)=1/2P(C) = 1/2

Check pairwise independence:

  • P(AB)=P({HH})=1/4=P(A)P(B)P(A \cap B) = P(\{HH\}) = 1/4 = P(A) \cdot P(B)
  • P(AC)=P({HT})=1/4=P(A)P(C)P(A \cap C) = P(\{HT\}) = 1/4 = P(A) \cdot P(C)
  • P(BC)=P({TH})=1/4=P(B)P(C)P(B \cap C) = P(\{TH\}) = 1/4 = P(B) \cdot P(C)

But ABC=A \cap B \cap C = \emptyset (both heads means same face, so CC fails), so:

P(ABC)=01/8=P(A)P(B)P(C).P(A \cap B \cap C) = 0 \neq 1/8 = P(A) \cdot P(B) \cdot P(C).

The three events are pairwise independent but not mutually independent.

Two-panel: classic 2-coin counterexample with Venn diagram and table showing pairwise but not mutual independence
Remark Pairwise ⊄ mutual: the extra conditions matter

This is not a pathological edge case. In machine learning, pairwise decorrelation (e.g., PCA) does not guarantee full independence — a fact that higher-order methods like ICA (independent component analysis) exploit. The difference between pairwise and mutual independence is the difference between matching second-order statistics and matching the full joint distribution.


7. Conditional Independence

Conditional independence is arguably the most important concept in probabilistic ML. It says: “once we know CC, AA and BB become independent.”

Definition 6 Conditional Independence

Events AA and BB are conditionally independent given CC (written A ⁣ ⁣ ⁣BCA \perp\!\!\!\perp B \mid C) if

P(ABC)=P(AC)P(BC).P(A \cap B \mid C) = P(A \mid C) \cdot P(B \mid C).

Equivalently (when P(BC)>0P(B \cap C) > 0): P(ABC)=P(AC)P(A \mid B \cap C) = P(A \mid C) — once you know CC, learning BB tells you nothing new about AA.

Independence Does NOT Imply Conditional Independence

This is the subtle part. All four combinations are possible:

Marginally independent?Conditionally independent given CC?Name
YesYesFully independent
YesNoExplaining away (Berkson’s paradox)
NoYesConfounding
NoNoGenerally dependent
Theorem 6 Conditional Independence Does Not Imply Marginal Independence

There exist events AA, BB, CC such that A ⁣ ⁣ ⁣BA \perp\!\!\!\perp B (marginally independent) but AA and BB are not conditionally independent given CC. And vice versa.

Proof Conditional Independence Does Not Imply Marginal Independence [show]

By construction. The “explaining away” and “confounding” presets in the explorer below provide concrete probability distributions witnessing each direction. ∎

Example 7 Conditional independence: explaining away

Two independent causes AA and BB can produce an effect CC. Marginally, A ⁣ ⁣ ⁣BA \perp\!\!\!\perp B. But conditioning on CC (the effect having occurred) makes AA and BB dependent: if we know the effect happened and AA didn’t cause it, BB becomes more likely. This is called explaining away.

Concrete example: A fire alarm (CC) can be triggered by a fire (AA) or by cooking smoke (BB). Fire and cooking smoke are independent events. But given that the alarm is ringing, learning there’s no fire makes cooking smoke more probable.

Graphical model A←C→B and table showing conditional independence does not imply marginal independence
Remark Independence ⇎ conditional independence

Conditional independence is the language of graphical models:

  • In a Bayesian network (directed graph), A ⁣ ⁣ ⁣BCA \perp\!\!\!\perp B \mid C is encoded by the graph structure (d-separation).
  • The naive Bayes classifier assumes all features are conditionally independent given the class label: P(X1,,XdY)=j=1dP(XjY)P(X_1, \ldots, X_d \mid Y) = \prod_{j=1}^d P(X_j \mid Y).
  • In hidden Markov models, observations are conditionally independent given the hidden state.

These assumptions are rarely exactly true, but they make inference tractable. The art of probabilistic modeling is choosing which conditional independencies to assume.

Explore the four independence configurations below. Use the presets or adjust the joint distribution manually to see when marginal and conditional independence agree — and when they don’t:

Joint Probability Table
ABCP(A,B,C)
0000.360
0010.040
0100.040
0110.160
1000.040
1010.160
1100.160
1110.040
Sum1.000
Dependency Graph
ABC
Marginal dep.
Conditional dep.
Marginal Independence
P(A=1) = 0.4000
P(B=1) = 0.4000
P(A=1∩B=1) = 0.2000
P(A=1)·P(B=1) = 0.1600
Marginally Dependent ✗
Conditional Independence | C=1
P(A=1|C=1) = 0.5000
P(B=1|C=1) = 0.5000
P(A=1∩B=1|C=1) = 0.1000
P(A=1|C=1)·P(B=1|C=1) = 0.2500
Cond. Dependent given C=1 ✗
Conditional Independence | C=0
P(A=1|C=0) = 0.3333
P(B=1|C=0) = 0.3333
P(A=1∩B=1|C=0) = 0.2667
P(A=1|C=0)·P(B=1|C=0) = 0.1111
Cond. Dependent given C=0 ✗

8. Connections to ML

The Naive Bayes Classifier

The naive Bayes classifier is Bayes’ theorem + conditional independence. Given features X1,,XdX_1, \ldots, X_d and class label YY:

P(YX1,,Xd)P(Y)j=1dP(XjY)P(Y \mid X_1, \ldots, X_d) \propto P(Y) \prod_{j=1}^d P(X_j \mid Y)

The “naive” assumption is that features are conditionally independent given the class: Xi ⁣ ⁣ ⁣XjYX_i \perp\!\!\!\perp X_j \mid Y for all iji \neq j. This reduces O(2d)O(2^d) parameters to O(d)O(d) — from an intractable joint to a product of marginals.

Naive Bayes plate diagram with factorization formula highlighting conditional independence

Bayesian Inference

Every Bayesian model applies Bayes’ theorem to parameter spaces:

P(θdata)=P(dataθ)P(θ)P(data)P(\theta \mid \text{data}) = \frac{P(\text{data} \mid \theta) \cdot P(\theta)}{P(\text{data})}

TermML Role
P(θ)P(\theta) (prior)Regularization, inductive bias
P(dataθ)P(\text{data} \mid \theta) (likelihood)Model fit
P(θdata)P(\theta \mid \text{data}) (posterior)Updated beliefs after seeing data
P(data)P(\text{data}) (evidence)Model comparison (marginal likelihood)

The full development of Bayesian inference — conjugate priors, MCMC, variational methods — lives in Bayesian Foundations (Topic 25) within formalStatistics and formalML: Bayesian Inference on formalML.

The Monty Hall Problem

Example 8 The Monty Hall Problem

A game show has three doors. Behind one is a car; behind the others, goats. You pick a door (say Door 1). The host, who knows what’s behind the doors, opens another door (say Door 3) to reveal a goat. Should you switch to Door 2?

Let CiC_i = “car is behind door ii.” Prior: P(C1)=P(C2)=P(C3)=1/3P(C_1) = P(C_2) = P(C_3) = 1/3.

Let H3H_3 = “host opens door 3.” The host must open a goat door \neq your choice:

  • P(H3C1)=1/2P(H_3 \mid C_1) = 1/2 (car is behind your door, host picks randomly from 2 and 3)
  • P(H3C2)=1P(H_3 \mid C_2) = 1 (car is behind door 2, host must open door 3)
  • P(H3C3)=0P(H_3 \mid C_3) = 0 (host can’t open the car door)

By Bayes:

P(C2H3)=P(H3C2)P(C2)P(H3)P(C_2 \mid H_3) = \frac{P(H_3 \mid C_2) \cdot P(C_2)}{P(H_3)}

=1×1/31/2×1/3+1×1/3+0×1/3=1/31/2=23.= \frac{1 \times 1/3}{1/2 \times 1/3 + 1 \times 1/3 + 0 \times 1/3} = \frac{1/3}{1/2} = \frac{2}{3}.

Switching wins with probability 2/3. Staying wins with probability 1/3. The host’s action provides information that shifts the posterior — a direct application of Bayes’ theorem.

Conditional Entropy and Mutual Information

Conditional probability powers the information-theoretic quantities central to ML:

  • Conditional entropy: H(XY)=x,yP(x,y)logP(xy)H(X \mid Y) = -\sum_{x,y} P(x,y) \log P(x \mid y) — remaining uncertainty in XX after observing YY
  • Mutual information: I(X;Y)=H(X)H(XY)I(X; Y) = H(X) - H(X \mid Y) — information YY provides about XX
  • Chain rule for entropy: H(X,Y)=H(X)+H(YX)H(X, Y) = H(X) + H(Y \mid X) — mirrors the chain rule for probability

These are developed fully in formalML: Shannon Entropy on formalML.


9. Summary

ConceptKey Idea
Conditional probability P(AB)P(A \mid B)Probability of AA given that BB occurred — restricting the sample space to BB
Multiplication ruleP(AB)=P(AB)P(B)P(A \cap B) = P(A \mid B) \cdot P(B) — joint from conditional ×\times marginal
Chain ruleP(A1An)=P(A1)P(A2A1)P(A_1 \cap \cdots \cap A_n) = P(A_1) \cdot P(A_2 \mid A_1) \cdots — sequential conditioning
Law of total probabilityP(A)=iP(ABi)P(Bi)P(A) = \sum_i P(A \mid B_i) \cdot P(B_i) — divide and conquer via partition
Bayes’ theoremP(AB)=P(BA)P(A)/P(B)P(A \mid B) = P(B \mid A) \cdot P(A) / P(B) — posterior \propto likelihood ×\times prior
Base rate fallacyIgnoring the prior P(A)P(A) when interpreting evidence — PPV depends on prevalence
IndependenceP(AB)=P(A)P(B)P(A \cap B) = P(A) \cdot P(B) — information about one tells you nothing about the other
Pairwise vs. mutualPairwise ⇏\not\Rightarrow mutual — need all subset product conditions
Conditional independenceP(ABC)=P(AC)P(BC)P(A \cap B \mid C) = P(A \mid C) \cdot P(B \mid C) — independence after conditioning on CC
Naive BayesP(YX)P(Y)jP(XjY)P(Y \mid X) \propto P(Y) \prod_j P(X_j \mid Y) — conditional independence reduces parameters from O(2d)O(2^d) to O(d)O(d)

What’s Next

Random Variables & Distribution Functions extends these ideas from events to numbers. A random variable X:ΩRX : \Omega \to \mathbb{R} is a measurable function that translates the abstract probability space into numerical statements. Conditional distributions P(XxY=y)P(X \leq x \mid Y = y), conditional expectation E[XY]E[X \mid Y], and the law of total expectation E[X]=E[E[XY]]E[X] = E[E[X \mid Y]] all build directly on the conditional probability framework developed here. See Expectation, Variance & Moments for the full treatment of conditional expectation and the tower property.

References

  1. Billingsley, P. (2012). Probability and Measure (Anniversary ed.). Wiley.
  2. Durrett, R. (2019). Probability: Theory and Examples (5th ed.). Cambridge University Press.
  3. Grimmett, G. & Stirzaker, D. (2020). Probability and Random Processes (4th ed.). Oxford University Press.
  4. Wasserman, L. (2004). All of Statistics. Springer.
  5. Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
  6. Koller, D. & Friedman, N. (2009). Probabilistic Graphical Models. MIT Press.
  7. Shalev-Shwartz, S. & Ben-David, S. (2014). Understanding Machine Learning. Cambridge University Press.