intermediate 55 min read · April 18, 2026

Multiple Testing & False Discovery

When one test becomes many: controlling FWER via Bonferroni, Holm, and the closed-testing principle; controlling FDR via Benjamini–Hochberg; adaptive q-values via Storey; and the dual Bonferroni / Šidák simultaneous confidence intervals that close Track 5.

20.1 The FWER Explosion Problem

A single hypothesis test at level α=0.05\alpha = 0.05 makes the following contract with a scientist: if the null is true, the probability of a false rejection is at most 5%5\%. Twenty replications of the same experiment, each analyzed with the same level-α\alpha test, would make twenty independent such contracts — and on average, one of them would falsely reject the null. That is already uncomfortable. In modern practice, mm is not twenty; it is twenty thousand (GWAS), or two hundred thousand (high-throughput imaging), or — most commonly — the number of coefficients in a regression multiplied by the number of pre-registered contrasts multiplied by the number of subgroup analyses. The per-test guarantee degrades catastrophically with mm, and without an adjustment, "p<0.05p < 0.05" stops meaning what the untrained reader assumes.

The simplest quantification of the degradation assumes independence between the mm tests. If each test has Type I error exactly α\alpha under the null, and all mm nulls are true, then

Pr(i{1,,m}:test i rejects)  =  1(1α)m.\Pr\left(\exists i \in \{1, \ldots, m\} : \text{test } i \text{ rejects}\right) \;=\; 1 - (1 - \alpha)^m.

This quantity is the family-wise error rate (FWER) under the global null. For small α\alpha the Taylor expansion gives FWERmα\text{FWER} \approx m \alpha, which is the same answer the union bound returns unconditionally (no independence required). Both bounds explode with mm: at α=0.05\alpha = 0.05 the table below is unforgiving.

FWER under independence as a function of m, at \\alpha = 0.05. By m = 20 the probability of at least one false rejection is 64\%; by m = 100 it is 99.4\%. The union-bound upper envelope m\\alpha tracks the exact curve closely for m \\alpha \\lesssim 1 and overshoots 1 thereafter.

Example 1 Twenty A/B tests, one genuine effect

A product team runs twenty simultaneous A/B tests on unrelated features, each at level α=0.05\alpha = 0.05. None of the twenty has a true effect — the global null holds. The probability that at least one test falsely flags its feature as significant is 10.95200.641 - 0.95^{20} \approx 0.64. A team that deploys whichever features cross the p<0.05p < 0.05 threshold will, in expectation, ship at least one null feature in this batch. Doing this every quarter across a company compounds the effect: after four quarters, the probability of never having shipped a null feature is 0.3640.0170.36^4 \approx 0.017.

Example 2 Genome-wide significance and the $5 \\times 10^{-8}$ threshold

A GWAS tests m106m \approx 10^6 common variants against a trait. The union-bound control at α=0.05\alpha = 0.05 gives a per-test threshold of α/m=5×108\alpha/m = 5 \times 10^{-8} — the number that has become the de facto field standard for “genome-wide significance.” It is exactly the Bonferroni correction of §20.4 applied to the GWAS scale. The FDR-controlled threshold for the same data is typically two or three orders of magnitude less stringent (§20.7 Ex 8), which is why BH-based analyses routinely identify hits that genome-wide-significance analyses miss.

Definition 1 Multiple-testing problem (setup)

A multiple-testing problem consists of mm hypotheses H1,,HmH_1, \ldots, H_m with null distributions P1(0),,Pm(0)P_1^{(0)}, \ldots, P_m^{(0)} and test statistics T1,,TmT_1, \ldots, T_m producing p-values p1,,pmp_1, \ldots, p_m. Let H0{1,,m}H_0 \subseteq \{1, \ldots, m\} denote the subset of indices where HiH_i is true (the true nulls), with m0=H0m_0 = |H_0| and m1=mm0=H0cm_1 = m - m_0 = |H_0^c| the count of true alternatives. Write π0=m0/m\pi_0 = m_0 / m for the true null fraction.

A multiple-testing procedure is a function (p1,,pm)R{1,,m}(p_1, \ldots, p_m) \mapsto \mathcal{R} \subseteq \{1, \ldots, m\} returning the set of rejected hypotheses. Counts of interest: R=RR = |\mathcal{R}| (total rejections), V=RH0V = |\mathcal{R} \cap H_0| (false rejections), S=RH0cS = |\mathcal{R} \cap H_0^c| (true rejections); by construction R=V+SR = V + S. Throughout the topic, R\mathcal{R} denotes the rejection set and RR its cardinality.

Remark 1 Why not just use $\\alpha/m$ everywhere and move on?

Bonferroni’s correction — apply each test at level α/m\alpha/m — is the canonical fix and will be formalized in §20.4. It works, but it is uniformly conservative: the FWER it actually achieves under the global null is at most m0α/mαm_0 \alpha / m \le \alpha, often much less. The rest of this topic is about more powerful procedures that retain the FWER guarantee (Holm, Hochberg, closed testing) or relax it to the weaker-but-often-preferred FDR guarantee (BH, BY, Storey). The pedagogical arc: α/m\alpha/m is the floor, not the ceiling.

Remark 2 Independence vs dependence in the FWER formula

The exact formula 1(1α)m1 - (1 - \alpha)^m uses independence; the union bound mαm \alpha does not. Under positive dependence — the typical regime for correlated tests in an A/B platform or for linkage-disequilibrium-correlated SNPs in a GWAS — the exact FWER is smaller than the independent-case formula (the false rejections tend to cluster). Under negative dependence, it can be larger, but negative dependence is rare in practice. The upshot: mαm \alpha is a safe universal upper bound; 1(1α)m1 - (1 - \alpha)^m is tight under independence; the truth-under-positive-dependence is somewhere in between.

20.2 The Closed-Testing Principle

Before stating Bonferroni, we describe the organizing scaffold that derives Holm (and many other step-down procedures) from a single principle. Closed testing (Marcus, Peritz, and Gabriel 1976) is a two-step recipe: test every non-empty intersection hypothesis iIHi\bigcap_{i \in I} H_i at level α\alpha; reject HiH_i in the final procedure iff every intersection containing ii rejects. This simultaneously achieves strong FWER control — regardless of the individual tests’ dependence structure — and, when the base tests are chosen appropriately, produces procedures strictly more powerful than single-step corrections.

The closure tree for m = 3. Top: H_{123} = H_1 \\cap H_2 \\cap H_3. Middle: three pairwise intersections H_{12}, H_{13}, H_{23}. Bottom: three elementary hypotheses H_1, H_2, H_3. Closed testing rejects H_i iff every intersection hypothesis on the path from H_{123} down to H_i is rejected by its level-\\alpha test.

Theorem 0 Closed testing controls FWER in the strong sense (Marcus–Peritz–Gabriel 1976)

Let ϕI\phi_I be a level-α\alpha test of HI=iIHiH_I = \bigcap_{i \in I} H_i for every non-empty I{1,,m}I \subseteq \{1, \ldots, m\}. Define the closed procedure to reject HiH_i iff ϕJ\phi_J rejects for every JiJ \ni i. Then for any true-null configuration H0{1,,m}H_0 \subseteq \{1, \ldots, m\}:

PrH0(at least one Hi with iH0 is rejected)    α.\Pr_{H_0}\bigl(\text{at least one } H_i \text{ with } i \in H_0 \text{ is rejected}\bigr) \;\le\; \alpha.

No dependence assumption on the individual test statistics is required.

The proof is a one-line observation: a false rejection of any HiH_i with iH0i \in H_0 requires ϕH0\phi_{H_0} to reject, and ϕH0\phi_{H_0} is a level-α\alpha test of the true null HH0H_{H_0}, so this has probability α\le \alpha. Full exposition and the induction argument that specializes this to Holm are in formalML: §9.1. We will not reproduce it; for Topic 20 the role of closed testing is organizational — it is the lens through which Holm (§20.5) and Hochberg (§20.5) become instances of a single construction rather than ad-hoc step-down rules.

Remark 3 Closed testing is a meta-recipe, not a specific procedure

The theorem above says “pick level-α\alpha tests of every intersection and combine them correctly.” It does not tell you which tests to pick. Bonferroni-based intersection tests (reject HIH_I iff miniIpiα/I\min_{i \in I} p_i \le \alpha / |I|) give the Holm procedure. Fisher-combination intersection tests (reject HIH_I iff 2iIlogpi-2 \sum_{i \in I} \log p_i exceeds a χ2I2\chi^2_{2|I|} critical value) give a different, often more powerful closed procedure. Simes-combination tests give Hochberg. The space of reasonable closed procedures is large; practitioners gravitate to Holm because it is parameter-free and always-valid.

Remark 4 Why only state, not prove, the principle here

The full proof machinery — proving strong control and showing Holm is the closed procedure with Bonferroni intersection tests — takes roughly 15 additional lines of careful indexing. It adds no conceptual insight beyond what the one-line sketch above already conveys, and it distracts from the BH proof in §20.7, which does deserve a full derivation. Closed testing is the scaffold; the specific procedures we prove correct (Bonferroni, Holm, BH) are the load-bearing walls.

20.3 Family-Wise Error Rate: The FWER Paradigm

Definition 2 Family-wise error rate (FWER)

For a multiple-testing procedure producing rejection set R\mathcal{R} on data with true-null configuration H0H_0, the family-wise error rate at H0H_0 is

FWER(H0)  =  PrH0(V1)  =  PrH0(RH0).\text{FWER}(H_0) \;=\; \Pr_{H_0}(V \ge 1) \;=\; \Pr_{H_0}(\mathcal{R} \cap H_0 \ne \emptyset).

A procedure controls FWER at level α\alpha if FWER(H0)α\text{FWER}(H_0) \le \alpha for every configuration H0H_0.

Definition 3 Strong vs weak FWER control

A procedure controls FWER in the weak sense if FWER(H0)α\text{FWER}(H_0) \le \alpha only when H0={1,,m}H_0 = \{1, \ldots, m\} (the global null — every HiH_i true). It controls FWER in the strong sense if FWER(H0)α\text{FWER}(H_0) \le \alpha for every H0{1,,m}H_0 \subseteq \{1, \ldots, m\}, including the partial-null configurations where some HiH_i are true and some are false.

Strong control is what practitioners want: the guarantee should hold whether or not the alternatives are real. Weak control is an academic fallback. Every procedure in this topic controls FWER or FDR in the strong sense unless explicitly noted.

Remark 5 FWER relative to Topic 17's Type I error

At m=1m = 1 FWER collapses to the per-test Type I error α\alpha of formalML: . FWER is the multi-hypothesis generalization: it replaces “probability this particular test falsely rejects” with “probability any of the mm tests falsely rejects.” The two notions agree at m=1m = 1 and diverge sharply for large mm, as §20.1 Fig 1 shows.

Remark 6 The FWER–power trade-off

Holding each test’s nominal level fixed, controlling FWER at level α\alpha forces each individual test to use a smaller effective αiα\alpha_i \le \alpha. Smaller per-test level means smaller per-test power. The trade-off is quantitative: under Bonferroni at global α=0.05\alpha = 0.05, a test of a single alternative with effect size δ\delta that had power 0.800.80 at α=0.05\alpha = 0.05 with m=1m = 1 retains only power 0.60\approx 0.60 at m=20m = 20 and 0.25\approx 0.25 at m=200m = 200 — a dramatic loss. The FDR framework of §20.6 sacrifices some of FWER’s protection to claw back this power, and is the right tool whenever false discoveries (rather than any false rejection whatsoever) are the cost metric.

20.4 Bonferroni and Šidák

Bonferroni is the starting point: reject HiH_i iff piα/mp_i \le \alpha/m. It controls FWER without any dependence assumption and its proof is immediate.

Theorem 1 Bonferroni FWER control

The Bonferroni procedure rejects HiH_i iff piα/mp_i \le \alpha/m. It controls FWER at level α\alpha in the strong sense under any joint dependence structure among the test statistics.

Proof Proof 1 — Bonferroni FWER control [show]

Let VV be the number of false rejections. The event {V1}\{V \ge 1\} equals the union iH0{piα/m}\bigcup_{i \in H_0} \{p_i \le \alpha/m\}. Applying the union bound (see formalML: ):

PrH0(V1)  =  PrH0 ⁣(iH0{piα/m})    iH0PrH0(piα/m)    iH0αm  =  m0αm    α.\Pr_{H_0}(V \ge 1) \;=\; \Pr_{H_0}\!\left(\bigcup_{i \in H_0} \{p_i \le \alpha/m\}\right) \;\le\; \sum_{i \in H_0} \Pr_{H_0}(p_i \le \alpha/m) \;\le\; \sum_{i \in H_0} \tfrac{\alpha}{m} \;=\; \tfrac{m_0 \alpha}{m} \;\le\; \alpha.

The first inequality is the union bound; the second uses that each pip_i for iH0i \in H_0 is stochastically at least Uniform(0, 1), so Pr(piα/m)α/m\Pr(p_i \le \alpha/m) \le \alpha/m. No independence is needed. ∎ — using Boole’s inequality.

Proof 1 is three lines because Bonferroni is morally the union bound. The trade-off is that it uses the union bound even when better control is possible — independence-based procedures (Šidák) or step-down procedures (Holm) achieve larger per-test thresholds while retaining FWER control.

Theorem 2 Šidák FWER control (Šidák 1967)

Suppose the test statistics T1,,TmT_1, \ldots, T_m are independent under H0H_0. Then the Šidák procedure — reject HiH_i iff pi1(1α)1/mp_i \le 1 - (1 - \alpha)^{1/m} — controls FWER in the strong sense at exactly α\alpha when m0=mm_0 = m and at most α\alpha otherwise.

Under independence, Šidák’s threshold is slightly larger than Bonferroni’s (1(1α)1/m>α/m1 - (1 - \alpha)^{1/m} > \alpha/m for every m2m \ge 2, with equality in the limit m1m \to 1), but the gap is tiny: at α=0.05\alpha = 0.05 and m=100m = 100 the thresholds are 5.13×1045.13 \times 10^{-4} (Šidák) vs 5.00×1045.00 \times 10^{-4} (Bonferroni), a 2.6%2.6\% difference. Šidák is strictly more powerful when independence holds, but practitioners typically prefer Bonferroni because its universal validity (no dependence assumption) outweighs the marginal power gain when independence is questionable. Stated without proof; see formalML: .

Mixture of p-values under a typical multiple-testing scenario: m = 1000, \\pi_0 = 0.8, moderate signal. The observed mixture (left panel) looks like a uniform-plus-concentration-near-zero; the ground-truth decomposition (right panel) shows the Uniform(0, 1) distribution of the 800 null p-values plus the near-zero concentration of the 200 alternative p-values.

Example 3 Bonferroni on $m = 20$ tests at $\\alpha = 0.05$

Per-test threshold α/m=0.0025\alpha/m = 0.0025. A test with a pp-value of 0.040.04 — significant at the α=0.05\alpha = 0.05 level for a single test — is not significant under Bonferroni. The union-bound-guaranteed FWER is at most 20×0.0025=0.0520 \times 0.0025 = 0.05; under independence the exact FWER is 1(10.0025)20=0.04881 - (1 - 0.0025)^{20} = 0.0488, slightly below the bound.

Remark 7 Bonferroni vs Šidák in practice

For essentially every routine application, the two give numerically indistinguishable thresholds. The deciding factor is the dependence story: Bonferroni is correct under any dependence, Šidák is exact under independence and conservative-then-liberal under positive-then-negative dependence. Textbook tables default to Bonferroni; the software outputs often default to Šidák (e.g., R’s p.adjust(method = 'bonferroni') vs the closely-related Šidák implementations). The practical answer is almost always the same hypothesis-level decisions.

Remark 8 Bonferroni as a level-$\\alpha$ test of the global null

The Bonferroni intersection test — “reject HIH_I iff miniIpiα/I\min_{i \in I} p_i \le \alpha / |I|” — is itself a valid level-α\alpha test of the intersection hypothesis iIHi\bigcap_{i \in I} H_i. Applying closed testing with Bonferroni intersection tests produces exactly the Holm step-down procedure of §20.5 (LEH2005 §9.1). This is the cleanest way to see why Holm strictly dominates Bonferroni: the closed procedure exploits the fact that once a hypothesis is rejected, the remaining intersections shrink, and subsequent Bonferroni thresholds become less stringent.

20.5 Holm’s Step-Down and Hochberg’s Step-Up

Holm’s step-down procedure is uniformly more powerful than Bonferroni: every rejection Bonferroni makes, Holm also makes; and Holm additionally rejects hypotheses with pp-values up to α\alpha (the least-stringent threshold, at the end of the step-down walk). It preserves Bonferroni’s strong FWER control without requiring independence.

Theorem 3 Holm step-down FWER control (Holm 1979)

Order the pp-values ascending: p(1)p(2)p(m)p_{(1)} \le p_{(2)} \le \cdots \le p_{(m)}. Let kk^\dagger be the smallest rank kk such that p(k)>α/(mk+1)p_{(k)} > \alpha / (m - k + 1) (set k=m+1k^\dagger = m + 1 if no such kk exists). The Holm procedure rejects the hypotheses corresponding to p(1),,p(k1)p_{(1)}, \ldots, p_{(k^\dagger - 1)}. This controls FWER at level α\alpha in the strong sense under any joint dependence structure.

Proof Proof 3 — Holm FWER control (induction on rank order) [show]

Let H0H_0 be the true-null configuration with H0=m0|H_0| = m_0 and let rr^* be the smallest rank (in the overall ordering of p(1),,p(m)p_{(1)}, \ldots, p_{(m)}) at which some true null appears. Equivalently, r=m1+1r^* = m_1 + 1: the first m1m_1 positions in the ordering must be alternatives (if every null is larger than every alternative, which is the worst case for Type I; otherwise rr^* is smaller, which only makes a false rejection require a smaller p(r)p_{(r^*)}).

We show that PrH0(V1)\Pr_{H_0}(V \ge 1) — the probability that Holm falsely rejects at least one true null — is at most α\alpha. A false rejection requires Holm to walk past step rr^*, which requires p(r)α/(mr+1)p_{(r^*)} \le \alpha / (m - r^* + 1). At rank rr^*, exactly mr+1=m0m - r^* + 1 = m_0 of the remaining pp-values are true nulls. The smallest of those m0m_0 null pp-values — call it p(1)p^*_{(1)}, the first-order statistic of the null pp-values — is at most p(r)p_{(r^*)}. So

PrH0(V1)    PrH0 ⁣(p(1)αm0).\Pr_{H_0}(V \ge 1) \;\le\; \Pr_{H_0}\!\left(p^*_{(1)} \le \tfrac{\alpha}{m_0}\right).

The first-order statistic of m0m_0 null pp-values, each stochastically at least Uniform(0, 1), is at most α/m0\alpha/m_0 with probability at most α\alpha by the union bound:

PrH0 ⁣(p(1)αm0)  =  PrH0 ⁣(miniH0piαm0)    iH0PrH0 ⁣(piαm0)    m0αm0  =  α.\Pr_{H_0}\!\left(p^*_{(1)} \le \tfrac{\alpha}{m_0}\right) \;=\; \Pr_{H_0}\!\left(\min_{i \in H_0} p_i \le \tfrac{\alpha}{m_0}\right) \;\le\; \sum_{i \in H_0} \Pr_{H_0}\!\left(p_i \le \tfrac{\alpha}{m_0}\right) \;\le\; m_0 \cdot \tfrac{\alpha}{m_0} \;=\; \alpha.

Combining the two displays: PrH0(V1)α\Pr_{H_0}(V \ge 1) \le \alpha. No dependence assumption is required because every inequality used is the union bound applied to events on a fixed set of null-indexed pp-values. ∎ — using induction-on-rank + union bound; HOL1979.

The proof shows why Holm dominates Bonferroni: Bonferroni uses the union bound once at threshold α/m\alpha/m; Holm uses it at threshold α/m0\alpha/m_0, which is larger when m0<mm_0 < m (i.e., whenever there are real alternatives). The step-down structure is the device that lets the threshold depend on the number of surviving hypotheses.

Rank-dependent thresholds for Bonferroni (flat at \\alpha/m) and Holm (staircase \\alpha/(m-k+1), ranging from \\alpha/m at the smallest p-value to \\alpha at the largest). Holm starts stricter than Bonferroni at no rank and ends less strict; the "power gain" is the shaded region between the two.

Theorem 4 Hochberg step-up FWER control (Hochberg 1988)

Order the pp-values ascending. Let kk^* be the largest rank kk such that p(k)α/(mk+1)p_{(k)} \le \alpha / (m - k + 1) (set k=0k^* = 0 if no such kk exists). The Hochberg procedure rejects p(1),,p(k)p_{(1)}, \ldots, p_{(k^*)}. This controls FWER at level α\alpha under independence or positive regression dependence (PRDS).

Hochberg uses the same threshold function as Holm — α/(mk+1)\alpha/(m-k+1) at rank kk — but walks the ranks in the opposite direction (from largest pp-value down to smallest, stopping at the first that passes) rather than step-down (stopping at the first that fails). It is uniformly more powerful than Holm when independence holds, at the cost of requiring that assumption. Stated without proof; see formalML: .

Example 4 Holm vs Bonferroni on $m = 20$ at $\\alpha = 0.05$

Suppose 3 of the 20 pp-values are below 0.00250.0025: specifically p(1)=0.0003,p(2)=0.0012,p(3)=0.0021p_{(1)} = 0.0003, p_{(2)} = 0.0012, p_{(3)} = 0.0021, and p(4)=0.004p_{(4)} = 0.004. Bonferroni rejects the first three. Holm: p(1)=0.00030.05/20=0.0025p_{(1)} = 0.0003 \le 0.05/20 = 0.0025 (pass), p(2)=0.00120.05/190.00263p_{(2)} = 0.0012 \le 0.05/19 \approx 0.00263 (pass), p(3)=0.00210.05/180.00278p_{(3)} = 0.0021 \le 0.05/18 \approx 0.00278 (pass), p(4)=0.0040.05/170.00294p_{(4)} = 0.004 \le 0.05/17 \approx 0.00294? No — 0.004>0.002940.004 > 0.00294, so Holm stops and rejects p(1),p(2),p(3)p_{(1)}, p_{(2)}, p_{(3)}. Same rejections as Bonferroni in this case, but Holm’s threshold at rank 4 is less stringent than Bonferroni’s constant 0.00250.0025, so if p(4)p_{(4)} had been, say, 0.00270.0027, Holm would additionally reject it while Bonferroni would not.

Example 5 Holm vs Hochberg on a step-up-favoring input

Take m=5m = 5, α=0.05\alpha = 0.05, ordered pp-values 0.006,0.012,0.018,0.030,0.0450.006, 0.012, 0.018, 0.030, 0.045. Holm: p(1)=0.0060.01p_{(1)} = 0.006 \le 0.01 (pass), p(2)=0.0120.0125p_{(2)} = 0.012 \le 0.0125 (pass), p(3)=0.0180.0167p_{(3)} = 0.018 \le 0.0167? No — 0.018>0.01670.018 > 0.0167, stop. Rejects 2. Hochberg: start at p(5)=0.0450.05p_{(5)} = 0.045 \le 0.05? Yes — reject all 5. The same inputs give very different answers: Hochberg’s step-up detects the global pattern (“all five look moderately small”) while Holm’s step-down commits early. Hochberg is valid under independence; Holm makes no such assumption.

Example 6 The explorer — changing $\\pi_0$ at fixed $\\alpha$ and $m$

Open the explorer below; set m=200m = 200, α=0.05\alpha = 0.05; sweep π0\pi_0 from 0.50.5 (dense signal) to 0.990.99 (sparse). At π0=0.5\pi_0 = 0.5 Holm and BH both recover most of the 100 alternatives; Bonferroni recovers perhaps 4040. At π0=0.99\pi_0 = 0.99 Holm and Bonferroni converge (both at 2\approx 2 rejections) while BH retains nearly all of the 2 expected true alternatives. This is the core FDR-vs-FWER story and it is best seen interactively.

1e-61e-51e-41e-30.010.11rank k (p-values in ascending order; m₀ = 80 nulls, m₁ = 20 alternatives)p-value (log)
ProcedureRVSFDPPower
Bonferroni6060.0%30.0%
Holm6060.0%30.0%
Benjamini–Hochberg150150.0%75.0%

R = total rejections, V = false rejections (from nulls), S = true rejections (from alternatives). FDP = V/R; power = S/m₁.

Default exploration configuration; realistic for behavioral-science multi-test scenarios.

Example p-value (first shown): 0.0212. Click Resample to draw a fresh p-value vector at the same (m, π₀, δ, α); the relative ranking of procedures stays stable, but individual rejects shift.

Remark 9 Holm is 'always-on' Bonferroni

The step-down structure guarantees that Holm’s rejection set contains Bonferroni’s: if p(k)α/mp_{(k)} \le \alpha/m, then a fortiori p(k)α/(mk+1)p_{(k)} \le \alpha/(m-k+1). So every Bonferroni rejection is a Holm rejection, and Holm potentially adds more as the rank grows. There is no regime where using Bonferroni instead of Holm is better; use Holm by default.

Remark 10 Hochberg vs Holm: the independence price

Hochberg’s step-up is strictly more powerful than Holm’s step-down under independence, but not under arbitrary dependence. In GWAS (LD-correlated tests) and most A/B test multiplicity, positive dependence holds and Hochberg is safe and preferred. In high-dimensional regression where the design matrix is ill-conditioned, dependence structure can be complex and Holm is the conservative default.

Remark 11 Closed-testing derivation of Holm

Applying the closed-testing scaffold (§20.2 Thm 0) with Bonferroni intersection tests — “reject HIH_I iff miniIpiα/I\min_{i \in I} p_i \le \alpha / |I|” — produces exactly the Holm step-down procedure. This is formalML: §9.1 Thm 9.1.2. The derivation explains why Holm’s rank-dependent threshold is α/(mk+1)\alpha/(m - k + 1): at step kk, the remaining candidate intersection has size mk+1m - k + 1, and its Bonferroni threshold is α/(mk+1)\alpha/(m - k + 1).

Remark 12 Uniformly most powerful step-down?

No. Holm is admissible among step-down procedures that control FWER under arbitrary dependence (you can’t uniformly improve it), but it is not UMP. The closed-testing framework allows other intersection-test choices (Fisher combination, Simes, …) that can dominate Holm on specific alternative configurations while still controlling FWER in the strong sense. Procedure choice depends on the expected dependence and signal structure. See LEH2005 §9.1 for the admissibility theorems.

20.6 False Discovery Rate: A New Error Metric

Benjamini and Hochberg (1995) observed that FWER control is often overkill. In a genome-wide association study with 10,00010{,}000 tested SNPs, a practitioner flags 5050 as significant. They are content if 55 of those 5050 are false positives (10% FDP) — that is the rate at which follow-up experiments will fail. They do not specifically want Pr(any of the 10,000 is a false positive)0.05\Pr(\text{any of the 10{,}000 is a false positive}) \le 0.05; that is an unreasonably stringent target that would shrink the flagged set to perhaps 55 SNPs, most of which the practitioner already knew. The false discovery rate is the error metric that aligns with this reasoning: control the expected fraction of false rejections among all rejections, not the probability of any false rejection.

Definition 4 False discovery proportion (FDP) and false discovery rate (FDR)

For a multiple-testing procedure with rejection set R\mathcal{R} of cardinality R=RR = |\mathcal{R}| and false-rejection count VV, the false discovery proportion is

FDP  =  {V/Rif R1,0if R=0.\mathrm{FDP} \;=\; \begin{cases} V/R & \text{if } R \ge 1, \\ 0 & \text{if } R = 0. \end{cases}

The false discovery rate is its expectation:

FDR  =  EH0[FDP].\mathrm{FDR} \;=\; \mathbb{E}_{H_0}[\mathrm{FDP}].

A procedure controls FDR at level α\alpha if FDRα\mathrm{FDR} \le \alpha for every true-null configuration H0H_0.

Definition 5 FDP vs FDR — the random-vs-expected distinction

FDP is a random variable (it depends on which tests happen to reject in this specific sample); FDR is its expectation over the data distribution. A procedure can have FDR =0.05= 0.05 while producing individual samples with FDP =0= 0 (zero false discoveries) or FDP =0.3= 0.3 (many false discoveries) — on average the false-discovery proportion is 5%5\%. Bounding FDR is a weaker guarantee than bounding Pr(FDP>α)\Pr(\mathrm{FDP} > \alpha), which is the “FDP-exceedance” control studied in more advanced literature (Lehmann–Romano 2005, Genovese–Wasserman 2006).

Remark 13 FDR $\\le$ FWER — the always-true inequality

For any procedure, FDRFWER\mathrm{FDR} \le \mathrm{FWER}. Proof: FDP1{V1}\mathrm{FDP} \le \mathbf{1}\{V \ge 1\} almost surely (whenever V1V \ge 1, FDP1\mathrm{FDP} \le 1; whenever V=0V = 0, FDP=0\mathrm{FDP} = 0). Taking expectations: FDR=E[FDP]Pr(V1)=FWER\mathrm{FDR} = \mathbb{E}[\mathrm{FDP}] \le \Pr(V \ge 1) = \mathrm{FWER}. So any FWER-controlling procedure automatically controls FDR — but FDR-controlling procedures (like BH) can achieve much more power by not controlling FWER, letting the Pr(V1)\Pr(V \ge 1) float up while keeping the expected V/RV/R in check.

Remark 14 When FDR is the right metric, and when it isn't

FDR is appropriate when the cost of a false discovery is roughly linear in the number of false discoveries — e.g., each false hit triggers a follow-up experiment that wastes resources, and the total cost is proportional to the count. FWER is appropriate when any false discovery is catastrophic — e.g., a drug-approval decision, where one false positive leads to a product recall regardless of how many true positives also emerged. Regulatory biostatistics leans FWER; genomics and high-dimensional discovery lean FDR. Most A/B-test platforms split the difference: FWER for final launch decisions, FDR for exploratory dashboards.

Remark 15 Comparator: FWER and FDR side by side

The comparator below runs multiTestingMonteCarlo for a selected procedure across a grid of signal strengths δ[0,4]\delta \in [0, 4] at m=100m = 100, π0=0.8\pi_0 = 0.8. For BH, the FDR curve stays pinned near απ0=0.04\alpha \cdot \pi_0 = 0.04 (below the α=0.05\alpha = 0.05 line) at all δ\delta, while the FWER curve climbs toward 1 as δ\delta increases — BH does not control FWER. For Bonferroni / Holm, FWER stays near α\alpha and FDR slopes down. The power curve in the right panel shows BH’s advantage: at moderate δ\delta, BH’s power is more than double Bonferroni’s.

Click Run to evaluate the procedure across 9 values of δ ∈ [0, 4] with m = 100, π₀ = 0.80,300 trials each.

Procedure: Benjamini–Hochberg, m = 100, π₀ = 0.80, α = 0.050, 300 MC trials at each of 9 δ grid points. FDR target under independence: α·π₀ = 0.040.

20.7 The Benjamini–Hochberg Procedure

We now prove the featured theorem of Topic 20.

Theorem 5 Benjamini–Hochberg FDR control (BH 1995)

Order the pp-values ascending: p(1)p(m)p_{(1)} \le \cdots \le p_{(m)}. Let kk^* be the largest rank kk such that p(k)kα/mp_{(k)} \le k \alpha / m (set k=0k^* = 0 if no such kk exists). The Benjamini–Hochberg procedure rejects H(1),,H(k)H_{(1)}, \ldots, H_{(k^*)} — i.e., the kk^* hypotheses with the smallest pp-values.

If the pp-values are independent (or positively regression-dependent on the subset, PRDS), then

FDR    π0α    α.\mathrm{FDR} \;\le\; \pi_0 \cdot \alpha \;\le\; \alpha.

Proof Proof 5 — BH FDR control (featured) [show]

The argument decomposes the expected FDP into per-null contributions, each of which can be bounded by α/m\alpha/m using an independence-based lemma. We follow BH 1995 for the outline and cite BEN2001 Lemma 2.1 for the key step.

Step 1 — Write the FDP as a sum. For a specific realization of (p1,,pm)(p_1, \ldots, p_m), let R(k)R(k) denote the total number of rejections when BH is applied at level kα/mk\alpha/m (so R(k)R(k^*) is the actual number of rejections at level α\alpha). Then the false-discovery count at the realized kk^* is V=iH01{pikα/m}V = \sum_{i \in H_0} \mathbf{1}\{p_i \le k^* \alpha / m\}, where the indicator captures whether null ii is rejected. Dividing by max(R,1)=R(k)1\max(R, 1) = R(k^*) \vee 1:

FDP  =  VR1  =  iH01{pikα/m}R1.\mathrm{FDP} \;=\; \frac{V}{R \vee 1} \;=\; \sum_{i \in H_0} \frac{\mathbf{1}\{p_i \le k^* \alpha / m\}}{R \vee 1}.

Step 2 — Take expectation; condition on the other pp-values. Taking EH0[]\mathbb{E}_{H_0}[\cdot] on both sides and pulling the sum outside the expectation (linearity):

FDR  =  iH0EH0 ⁣[1{pik(pi,pi)α/m}R(pi,pi)1].\mathrm{FDR} \;=\; \sum_{i \in H_0} \mathbb{E}_{H_0}\!\left[\frac{\mathbf{1}\{p_i \le k^*(p_{-i}, p_i) \cdot \alpha/m\}}{R(p_{-i}, p_i) \vee 1}\right].

Here pip_{-i} denotes the vector of pp-values excluding pip_i, and both kk^* and RR are written as functions of the full pp-value vector to emphasize that they depend on pip_i via the step-up search.

Step 3 — Apply the BEN2001 Lemma 2.1 independence identity. Under independence of the null pp-values from the rest, Lemma 2.1 of BEN2001 gives, for each iH0i \in H_0,

EH0 ⁣[1{piRα/m}R1]    αm,\mathbb{E}_{H_0}\!\left[\frac{\mathbf{1}\{p_i \le R \alpha / m\}}{R \vee 1}\right] \;\le\; \frac{\alpha}{m},

with equality in the continuous independent case. The content of the lemma is that the event {piRα/m}\{p_i \le R \alpha / m\} and the random denominator R1R \vee 1 are coupled in exactly the way BH’s linear-in-kk threshold kα/mk \alpha / m is designed to exploit — each unit increment in RR brings in α/m\alpha/m more null-rejection mass in the numerator while adding 1/(R(R+1))1/(R(R+1)) worth of dilution in the denominator, and the two balance. See formalML: §§3–4 for the full conditional-expectation expansion; we use the stated bound as a black box.

Step 4 — Sum over the m0m_0 true nulls. Summing the per-null bound from Step 3 over all iH0i \in H_0:

FDR  =  iH0EH0 ⁣[1{piRα/m}R1]    iH0αm  =  m0αm  =  π0α    α.\mathrm{FDR} \;=\; \sum_{i \in H_0} \mathbb{E}_{H_0}\!\left[\frac{\mathbf{1}\{p_i \le R \alpha / m\}}{R \vee 1}\right] \;\le\; \sum_{i \in H_0} \frac{\alpha}{m} \;=\; \frac{m_0 \cdot \alpha}{m} \;=\; \pi_0 \cdot \alpha \;\le\; \alpha. \quad\blacksquare

∎ — using linearity of expectation (Step 1–2) and the BEN2001 Lemma 2.1 independence bound (Step 3); BEN1995 original; BEN2001 for the modern formalization.

The proof’s most striking feature is that BH’s FDR control is exactly π0α\pi_0 \alpha, not just α\le \alpha. The procedure leaves power on the table when π0<1\pi_0 < 1 — hence Storey’s adaptive correction in §20.8, which estimates π0\pi_0 from the data and applies BH at level α/π^0\alpha / \hat{\pi}_0 to recover that slack.

BH step-up procedure on a random sample (m = 20, \\alpha = 0.10). Dots are sorted p-values; the green line is the BH diagonal k\\alpha/m; the dashed red line is the Bonferroni reference \\alpha/m. The step-up search walks from k = m downward, stopping at the first k where the p-value is below the diagonal. k^\\* = 10: ranks 1 through 10 are rejected. Bonferroni at the same \\alpha rejects only the two p-values below the dashed line.

Example 7 BH on the canonical 10-vector (§6.2 T1)

p=(0.0005,0.003,0.011,0.021,0.043,0.051,0.087,0.21,0.33,0.55)p = (0.0005, 0.003, 0.011, 0.021, 0.043, 0.051, 0.087, 0.21, 0.33, 0.55), α=0.05\alpha = 0.05, m=10m = 10. The BH threshold at rank kk is k0.005k \cdot 0.005. Walk from k=10k = 10 down: p(10)=0.55>0.05p_{(10)} = 0.55 > 0.05, p(9)=0.33>0.045p_{(9)} = 0.33 > 0.045, …, p(4)=0.021>0.020p_{(4)} = 0.021 > 0.020 (fail!), p(3)=0.0110.015p_{(3)} = 0.011 \le 0.015 (pass — stop). k=3k^* = 3; reject ranks 1, 2, 3. The critical point: p(4)=0.021p_{(4)} = 0.021 fails its threshold by a hair (0.0210.021 vs 0.0200.020); had it been below, BH would have rejected 4 or more. The regression test T1 in testing.test.ts pins this behavior.

Example 8 BH under independence — empirical FDR over 3000 MC trials

Using multiTestingMonteCarlo('bh', m = 200, π₀ = 0.8, δ = 3.0, α = 0.05, 3000, seed = 123), empirical FDR lands in [0.030,0.050][0.030, 0.050]. The theoretical bound is π0α=0.040\pi_0 \cdot \alpha = 0.040. Empirical values below 0.0400.040 are normal — BH is conservative when the signal is strong, because strong alternatives drive RR up while VV stays fixed. The test harness T3 in testing.test.ts locks this range. The same simulation shows empirical FWER well above 0.500.50: BH deliberately does not control FWER, and large samples with moderate signal produce many false rejections on their way to producing even more true rejections.

Example 9 Stepping through the algorithm

The visualizer below runs the step-up walk one rank at a time. Load the canonical T1 preset, press Step. The active rank descends from k=10k = 10, lighting up red (fail) at ranks 10, 9, …, 4, then green (pass) at k=3k = 3. The rejection region shades in, containing ranks 1 through 3.

α/m (Bonferroni)k·α/m (BH)123456789100.000.200.400.600.801.00rank k (1 = smallest p-value)p-value
Step log
kp_(k)k·α/mdec.
100.55000.0500
90.33000.0450
80.21000.0400
70.08700.0350
60.05100.0300
50.04300.0250
40.02100.0200
30.01100.0150
20.00300.0100
10.00050.0050

Regression-test vector from §6.2 T1. BH rejects ranks 1–3 (p_(3) = 0.011 passes threshold 0.015; p_(4) = 0.021 fails 0.020). Expected k* = 3; current k* = 3. The step-up search walks down from k = m, returning the first rank where p_(k) ≤ k·α/m.

Remark 16 Why the threshold is linear in $k$ (not Bonferroni's constant)

The BH threshold function kkα/mk \mapsto k\alpha/m has two endpoints: α/m\alpha/m at k=1k = 1 (exactly Bonferroni) and α\alpha at k=mk = m (no adjustment). The linear interpolation between them is what makes the independence cancellation in Proof 5 Step 3 work: the per-rank threshold increment α/m\alpha/m is also the per-rank denominator increment 1/R1/R, and the two cancel. Any nonlinear threshold function breaks this. Storey’s adaptive variant (§20.8) rescales to kα/(mπ^0)k\alpha/(m \cdot \hat{\pi}_0) — still linear, but with π^0\hat{\pi}_0 shrinking the effective mm.

Remark 17 BH under positive regression dependence

The BEN2001 Lemma 2.1 independence argument extends to the PRDS (positive regression dependence on the subset) regime — where the conditional distribution of the non-null pp-values given the null pp-values is non-decreasing in the null pp-values. This covers most “positively correlated” scenarios in practice: LD-correlated SNPs in a GWAS, correlated coefficients in a regression with positively-correlated predictors. Under PRDS, BH still controls FDR at π0αα\pi_0 \alpha \le \alpha. Under negative dependence, BH can fail; §20.8 Thm 6 (BY) handles this case at an Hm=k=1m1/kH_m = \sum_{k=1}^m 1/k power cost.

Remark 18 BH is a closed-testing procedure — but not with Bonferroni intersections

Unlike Holm, BH is not obtained from closed testing with Bonferroni intersection tests; those would produce Holm. BH corresponds to a different closed-testing construction (Simes intersection tests) that is valid for FWER control only under independence, and the resulting closed procedure controls FDR as well. The cleaner modern view: BH is its own step-up procedure derived directly from the FDR-control calculation of Proof 5, independent of the closed-testing scaffold. Closed testing explains why FWER procedures are nested (Holm dominates Bonferroni); FDR procedures live in a separate algebraic universe.

20.8 Dependence and Adaptivity: BY and Storey

BH under arbitrary dependence (Benjamini–Yekutieli 2001) and BH adaptive to the true null fraction (Storey 2002) are the two major variants practitioners reach for when BH’s assumptions don’t hold or when additional power is available.

Theorem 6 Benjamini–Yekutieli FDR control under arbitrary dependence (BEN2001)

Let cm=k=1m1/kc_m = \sum_{k=1}^m 1/k be the harmonic number. Apply BH with thresholds kα/(mcm)k \alpha / (m \cdot c_m) at rank kk. Under any joint dependence structure of the pp-values (positive, negative, or arbitrary), this BY procedure controls FDR at π0αα\pi_0 \alpha \le \alpha.

The price of universal validity is the cmc_m factor: cmlogm+0.577c_m \sim \log m + 0.577, so for m=100m = 100 the BY thresholds are c1005.19c_{100} \approx 5.19 times smaller than BH’s — a roughly 5×5\times power loss. When independence or PRDS can be reasonably assumed, BH dominates. When dependence is genuinely unknown — rare in practice, but common in theoretical analyses of novel statistics — BY is the safe default. Stated without proof; see formalML: .

Theorem 7 Storey adaptive q-values (STO2002)

Let π^0(λ)={i:pi>λ}/(m(1λ))\hat{\pi}_0(\lambda) = |\{i : p_i > \lambda\}| / (m (1 - \lambda)) at a tuning parameter λ(0,1)\lambda \in (0, 1) (canonically λ=0.5\lambda = 0.5). The Storey q-value at rank kk is

q^(k)  =  minjkπ^0mp(j)j.\hat{q}_{(k)} \;=\; \min_{j \ge k} \frac{\hat{\pi}_0 \cdot m \cdot p_{(j)}}{j}.

The Storey procedure rejects hypotheses with q^(k)α\hat{q}_{(k)} \le \alpha. Asymptotically (as mm \to \infty), Storey controls FDR at exactly α\alpha, recovering the π0\pi_0-slack that BH leaves on the table.

Storey’s π^0\hat{\pi}_0 is the fraction of pp-values above λ\lambda, scaled by 1/(1λ)1/(1 - \lambda) to correct for the uniform density. Under the null, E[p>λ]=m(1λ)E[|p > \lambda|] = m(1-\lambda); under alternatives, alternative pp-values concentrate near zero and contribute little to the above-λ\lambda count. So π^0m0/m+(small)=π0+op(1)\hat{\pi}_0 \approx m_0/m + \text{(small)} = \pi_0 + o_p(1). Applying BH at α/π^0\alpha/\hat{\pi}_0 — which the q-value construction does — produces tighter thresholds when π0<1\pi_0 < 1. Stated without proof; see formalML: . The qvalue Bioconductor package is the de facto bioinformatics implementation.

FDR / FWER / power curves for Bonferroni, Holm, BH, and BY at three \\pi_0 levels (0.5, 0.8, 0.95). The BH FDR curves stay pinned near \\alpha \\pi_0 across all signal strengths; the FWER curves climb with \\delta. Bonferroni / Holm FWERs stay near \\alpha; their FDRs slope down. Power panels show BH's systematic advantage at moderate \\delta.

Example 10 BH vs Bonferroni at GWAS scale ($m = 20{,}000$, $\\pi_0 = 0.99$)

Simulated m=20,000m = 20{,}000 hypotheses, 200 true alternatives at effect size δ=3.5\delta = 3.5, α=0.05\alpha = 0.05. Bonferroni’s threshold is 0.05/20,000=2.5×1060.05 / 20{,}000 = 2.5 \times 10^{-6}, so alternatives need zz-scores above 4.7\approx 4.7 — the tails where only the strongest signals live. BH rejects all pp-values below k0.05/20,000k \cdot 0.05 / 20{,}000 at the realized kk^*, which for moderate signal lands in the hundreds. The genomics figure below (which will render once the supplementary cell in 20_multiple_testing.ipynb is executed) shows BH recovering roughly 3×3\times more true positives than Bonferroni at the same (empirical) FWER cost.

Realistic-scale demonstration at m = 20{,}000, \\pi_0 = 0.99. Bonferroni recovers only the strongest signals; BH recovers approximately 3\\times more true alternatives at empirical FDR \\le \\alpha \\pi_0. Figure produced by 20_multiple_testing.ipynb Cell 9 (skipped in the default run — see the repo README for how to regenerate).

Example 11 The Ioannidis PPV inversion (IOA2005)

John Ioannidis’s 2005 paper asked: given a statistically significant finding (p<0.05p < 0.05), what is the probability the corresponding alternative is true? The answer depends on the prior probability of the alternative. For a screening study with π0=0.99\pi_0 = 0.99 and moderate power 0.50.5: expected true positives =0.01m0.5=0.005m= 0.01 \cdot m \cdot 0.5 = 0.005 m; expected false positives =0.99m0.05=0.0495m= 0.99 \cdot m \cdot 0.05 = 0.0495 m. The positive predictive value PPV=0.005/(0.005+0.0495)0.092\mathrm{PPV} = 0.005 / (0.005 + 0.0495) \approx 0.092. A "p<0.05p < 0.05" finding in this regime is a false positive 91%\approx 91\% of the time. Controlling FDR at α\alpha aligns the reported significance claim with PPV1α\mathrm{PPV} \ge 1 - \alpha for the rejected set, which is the practical guarantee that FWER alone does not deliver.

Example 12 Gelman–Loken forking paths (GEL2014)

Even researchers who never consciously run multiple tests inflate α\alpha via unstated analytical flexibility: outcome choice, covariate selection, data cleaning rules, subgroup definitions, alternative models. Each branching analytic decision, if it depends on the data, effectively increases mm. GEL2014 estimate that a “single” analysis often corresponds to mm between 1010 and 100100 latent decisions, so the naive p<0.05p < 0.05 threshold corresponds to an unadjusted FWER of roughly 0.40.4 to 0.990.99 under the global null. Pre-registration is the technical remedy; multiple-testing correction is the retroactive one.

Remark 19 Which dependence regime are you in?

A/B testing with fixed pre-specified metrics on a common user population: positive regression dependence (PRDS) — BH is appropriate. GWAS within a single chromosome: positive dependence via LD — BH is appropriate. Testing a single hypothesis across many unrelated datasets (pooled meta-analysis with unknown dependence): BY is the safer choice. Post-hoc subgroup analysis with unstated inclusion rules: no procedure saves you — this is the Gelman–Loken regime, and the fix is pre-registration, not a different threshold.

Remark 20 Storey's $\\lambda$ tuning and bootstrap alternatives

The canonical λ=0.5\lambda = 0.5 is a sensible default but not optimal for every distribution. STO2002 proposes a bootstrap-based tuning: compute π^0(λ)\hat{\pi}_0(\lambda) over a grid and pick the λ\lambda^* minimizing estimated MSE. For large mm the choice barely matters; for m<500m < 500, the bootstrap can stabilize estimates. The Bioconductor qvalue package runs the bootstrap by default.

Remark 21 When to use Storey vs plain BH

If π0\pi_0 is genuinely close to 1 (GWAS: π00.99\pi_0 \ge 0.99), Storey and BH are numerically indistinguishable. The Storey adjustment matters most for moderately dense signals (0.5π00.90.5 \le \pi_0 \le 0.9) common in transcriptomics, proteomics, and moderate-dimensional regression coefficient testing. Bioinformatics pipelines default to Storey (via qvalue); statisticians default to plain BH unless they have explicit evidence π01\pi_0 \ll 1.

Remark 22 The replication-crisis framing

The 2005–2016 replication-crisis literature (Ioannidis, Gelman–Loken, Open Science Collaboration 2015) pointed out that naive p<0.05p < 0.05 in the face of unacknowledged multiplicity produces irreproducible findings at scale. The technical remedy is BH or Storey with appropriately calibrated α\alpha; the sociological remedy is pre-registration, adversarial collaboration, and transparency about analytical forking. §17.12 Rem 16 and §17.12 Rem 19-bullet defer to this discussion; both are now fulfilled.

20.9 Simultaneous Confidence Intervals

The test–CI duality of formalML: applies at every level: inverting a level-α/m\alpha/m test gives a (1α/m)(1 - \alpha/m) CI. Applying the union bound (Bonferroni) or exact formula (Šidák) over mm parameters produces simultaneous confidence intervals — an mm-tuple (C1,,Cm)(C_1, \ldots, C_m) such that

Pr(θiCi for every i{1,,m})    1α.\Pr\bigl(\theta_i \in C_i \text{ for every } i \in \{1, \ldots, m\}\bigr) \;\ge\; 1 - \alpha.

This is the CI-side analog of FWER-controlled testing.

Theorem 8 Bonferroni / Šidák simultaneous CIs as test duals

Let Ci(α)C_i^{(\alpha')} be a (1α)(1 - \alpha') CI for θi\theta_i obtained by inverting a level-α\alpha' test (Topic 19 Thm 1).

(a) Bonferroni simultaneous CIs. Set α=α/m\alpha' = \alpha/m. Then

Pr(θiCi(α/m) for every i)    1α.\Pr\bigl(\theta_i \in C_i^{(\alpha/m)} \text{ for every } i\bigr) \;\ge\; 1 - \alpha.

(b) Šidák simultaneous CIs. Assume (C1,,Cm)(C_1, \ldots, C_m) are based on independent test statistics. Set α=1(1α)1/m\alpha' = 1 - (1 - \alpha)^{1/m}. Then

Pr(θiCi(α) for every i)    1α,\Pr\bigl(\theta_i \in C_i^{(\alpha')} \text{ for every } i\bigr) \;\ge\; 1 - \alpha,

with equality when every θi\theta_i equals the value against which the test was constructed.

Both follow immediately from Topic 19 Thm 1 combined with the union bound (Bonferroni) or the independence product (Šidák). Stated without separate proof.

Three simultaneous CI constructions for m = 10 Normal means at n = 25, \\alpha = 0.05: unadjusted (marginal 95\\% coverage but joint coverage \\approx 0.60), Bonferroni-adjusted (joint coverage \\ge 0.95, each CI \\approx 2.5\\times wider than unadjusted), Šidák-adjusted (slightly narrower than Bonferroni, joint coverage \\approx 0.95 exactly under independence).

Example 13 Bonferroni simultaneous CIs for $m = 10$ means

Ten independent sample means Xˉi\bar{X}_i with SE =1/n= 1/\sqrt{n}, n=25n = 25, α=0.05\alpha = 0.05. The per-interval critical value is z1α/(2m)=z0.9975=2.807z_{1 - \alpha/(2m)} = z_{0.9975} = 2.807, so each Bonferroni CI is Xˉi±2.807/5=Xˉi±0.561\bar{X}_i \pm 2.807 / 5 = \bar{X}_i \pm 0.561. The unadjusted critical value is z0.975=1.96z_{0.975} = 1.96, giving width ±0.392\pm 0.392. The Bonferroni width is 2.807/1.961.43×2.807/1.96 \approx 1.43\times the unadjusted. Over 500 simulations with all true means =0= 0: empirical joint coverage for unadjusted is 0.60\approx 0.60; for Bonferroni 0.95\approx 0.95.

Example 14 Šidák slightly tighter under independence

Same setup as Ex 13. Šidák’s per-interval α=10.950.1=0.00512\alpha' = 1 - 0.95^{0.1} = 0.00512, so the critical value is z10.00512/2=z0.99744=2.800z_{1 - 0.00512/2} = z_{0.99744} = 2.800 — only 0.0070.007 smaller than Bonferroni’s. CI width is 0.560\approx 0.560 (vs Bonferroni’s 0.5610.561), a gain of 0.2%0.2\%. The Šidák construction achieves exact joint coverage 0.950.95 under independence, not just 0.95\ge 0.95, so the marginal gain is worth knowing exists, though the practical difference is negligible.

show
0.01.02.03.0i=1i=2i=3i=4i=5i=6i=7i=8i=9i=10parameter valueUnadjusted62.5%Bonferroni94.5%Šidák94.5%1 − α = 0.95000.250.50.751empirical joint coverage (200 trials)

Matches Figure 8 parameter configuration exactly. Widest Bonferroni CI uses z = 2.807 (vs the unadjusted z = 1.960).

Remark 23 Scope of simultaneous inference

We have covered the general case where no additional structure relates the mm parameters. Two prominent special cases give tighter bounds:

  • Scheffé’s method for linear contrasts in ANOVA: if the parameters of interest are all linear combinations ciμi\sum c_i \mu_i of group means in a one-way ANOVA, Scheffé’s simultaneous CIs use the FF-distribution critical value, which is tighter than Bonferroni for large mm. See formalML: or CAS2002 §11.2.4.
  • Tukey’s HSD for all pairwise mean differences: exact simultaneous CIs via the studentized range distribution, tighter than Bonferroni when all (m2)\binom{m}{2} pairwise contrasts are of interest.
  • Working–Hotelling for regression coefficient trajectories: tighter simultaneous bands via the FF-distribution over a contiguous parameter space.

These are Topic 20’s immediate neighbors and are covered in depth in Track 6’s regression material — §21.8 Rem 16 delivers both Bonferroni-adjusted CIs and Working–Hotelling bands with full F-distribution-based derivations.

Remark 24 Simultaneous CIs vs the Bonferroni-of-FDR temptation

An FDR procedure like BH controls the expected false-discovery proportion but does not produce a simultaneous CI — there is no ”(1α)(1 - \alpha) joint coverage” interpretation of BH rejections. When the practitioner reports “10 features rejected at BH α=0.05\alpha = 0.05,” the claim is that 5%\le 5\% of those 10 are expected to be false rejections, not that the 10 parameters are jointly confined to specific intervals. For joint confidence statements on multiple parameters, Bonferroni / Šidák CIs are the correct tool; for discovery-rate-controlled flag lists, BH is correct.

Remark 25 Fulfils Topic 19 §19.10 Rem 19 #3

Topic 19 explicitly deferred simultaneous CIs to Topic 20. The construction above — Bonferroni-adjusted inversion of FWER-controlled tests — is the formal treatment Topic 19 promised. Every claim about test–CI duality in Topic 19 carries over unchanged; the adjustment is to the level (not the construction).

20.10 Track 5 Closes; Multiplicity Lives On

Track 5 — Hypothesis Testing & Confidence — ends here, 4 topics deep. The next tracks inherit this topic’s scaffolding rather than re-derive it:

  • Track 6 (Regression) uses Bonferroni-adjusted coefficient CIs (§21.8 Rem 16) as the default simultaneous-inference tool and BH for variable selection. F-tests for nested-model comparisons (§21.8 Thm 9, Example 8) are the closed-testing ancestor of Holm (§20.5 Rem 11).
  • Track 7 (Bayesian) contrasts FDR with Bayesian local-FDR (Efron 2010) — the posterior probability that an individual observation is from the null. Bayesian multiplicity reduces to posterior updating; no closed-testing scaffold is required.
  • Track 8 (High-Dimensional) builds knockoffs (Barber–Candès 2015) and online FDR (Javanmard–Montanari 2018) directly on top of BH. Every modern multiplicity method traces to §20.7 Thm 5.

The figure below maps Track 5’s arc into these downstream topics.

Track 5 spine (Topics 17, 18, 19, 20) with forward arrows to Track 6 (regression simultaneous inference), Track 7 (Bayesian multiplicity), Track 8 (knockoffs, online FDR), and formalml.com (feature selection, high-dimensional testing, always-valid inference). Back-arrow to Topic 19 emphasizes that the simultaneous CI construction of §20.9 is the direct dual of Topic 19 Thm 1.

Remark 26 Always-valid inference and e-processes

The setup of §20.3 assumes the mm hypotheses and stopping rule are fixed in advance. Always-valid confidence sequences (Howard, Ramdas, McAuliffe, Sekhon 2021) and e-processes (Vovk 2019, Grünwald, de Heide, Koolen 2024) dispense with this: the inferential guarantee holds at every stopping time, including data-dependent ones. The mathematical machinery is martingale-based and lives in Track 8 / formalml.com. Every A/B-testing platform that allows “peeking” without Type I inflation uses these tools.

Remark 27 Online FDR

What if hypotheses arrive sequentially and we must decide to reject immediately without seeing future pp-values? Offline BH no longer applies. The Online-BH procedure (Javanmard–Montanari 2018) and SAFFRON / LORD / Adaptive-LORD family (Ramdas et al. 2018) control FDR in this streaming regime by spending an α\alpha-wealth budget across time. formalML: lives in Track 8.

Remark 28 Knockoffs as BH generalization

Barber–Candès (2015) construct synthetic “knockoff” features X~1,,X~m\tilde{X}_1, \ldots, \tilde{X}_m exchangeable with the real features X1,,XmX_1, \ldots, X_m under the null, then apply a BH-like step-up on feature-statistic signs. The FDR control at α\alpha is the §20.7 Thm 5 result extended to a test-statistic-free setting — knockoffs work without requiring the per-feature pp-values that BH demands. This is the workhorse of modern high-dimensional variable selection.

Remark 29 Permutation multiplicity (Westfall–Young)

Westfall–Young (1993) permutation-based multiplicity adjustment computes FWER directly from the joint distribution of the mm test statistics via resampling, avoiding any dependence assumption. It is the gold standard for FWER control when strict validity is required (e.g., FDA-regulated settings) and computation is affordable. formalML: has the full treatment.

Remark 30 Bayesian multiplicity

In a fully Bayesian framework, multiplicity is handled by the prior: placing a sparse prior on the mm-dimensional parameter vector (e.g., spike-and-slab, horseshoe) automatically shrinks most estimates toward zero, achieving a form of implicit multiple-testing correction. The posterior probability that the ii-th alternative is true is the Bayesian counterpart to the frequentist q-value. Topic 25 §25.10 names this pointer; the full local-FDR framework is developed in Topic 27 §27.9.

Remark 31 Closed-testing full proof

We stated Thm 0 (closed testing) without proof in §20.2. The full Marcus–Peritz–Gabriel argument is 15–20 lines and lives in formalML: . The key step is the observation — already stated in §20.2 — that a false rejection requires the true-null intersection’s level-α\alpha test to reject, which happens with probability α\le \alpha.

Remark 32 Hommel, Rom, and other step-wise procedures

Hommel (1988) and Rom (1990) are two additional step-up procedures that fall between Hochberg and the full Simes-based closed procedure in the FWER-control hierarchy. Both require independence; both are slightly more powerful than Hochberg on specific pp-value configurations and slightly less powerful on others. They are named here for completeness; practitioners overwhelmingly default to Hochberg or BH.


References

  1. Olive Jean Dunn. (1961). Multiple Comparisons Among Means. Journal of the American Statistical Association, 56(293), 52–64.
  2. Sture Holm. (1979). A Simple Sequentially Rejective Multiple Test Procedure. Scandinavian Journal of Statistics, 6(2), 65–70.
  3. Zbyněk Šidák. (1967). Rectangular Confidence Regions for the Means of Multivariate Normal Distributions. Journal of the American Statistical Association, 62(318), 626–633.
  4. Yosef Hochberg. (1988). A Sharper Bonferroni Procedure for Multiple Tests of Significance. Biometrika, 75(4), 800–803.
  5. Yoav Benjamini and Yosef Hochberg. (1995). Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society: Series B, 57(1), 289–300.
  6. Yoav Benjamini and Daniel Yekutieli. (2001). The Control of the False Discovery Rate in Multiple Testing Under Dependency. The Annals of Statistics, 29(4), 1165–1188.
  7. John D. Storey. (2002). A Direct Approach to False Discovery Rates. Journal of the Royal Statistical Society: Series B, 64(3), 479–498.
  8. John P. A. Ioannidis. (2005). Why Most Published Research Findings Are False. PLoS Medicine, 2(8), e124.
  9. Andrew Gelman and Eric Loken. (2014). The Statistical Crisis in Science. American Scientist, 102(6), 460–466.
  10. Erich L. Lehmann and Joseph P. Romano. (2005). Testing Statistical Hypotheses (3rd ed.). Springer.
  11. George Casella and Roger L. Berger. (2002). Statistical Inference (2nd ed.). Duxbury.