intermediate 60 min read · April 19, 2026

Simple & Multiple Linear Regression

The first conditional model — OLS as orthogonal projection, Gauss–Markov's BLUE property, the Wilks-specialized F-test, and the diagnostic anchors that motivate Topic 22's GLM machinery.

21.1 From Predictions to Lines

Every scatter plot invites a line. A biologist plots height against age and wants to know how growth rate depends on sex. An economist plots a firm’s revenue against advertising spend and wants the marginal return. A machine-learning engineer plots validation loss against training-set size and wants to extrapolate. In all three cases the question is the same: given paired observations (x1,y1),,(xn,yn)(x_1, y_1), \ldots, (x_n, y_n), what linear function of xx best explains yy — and how much of the variation in yy is it actually explaining?

Two-panel scatter of height-vs-age data. Left: the OLS line overlaid with residuals drawn as short vertical segments. Right: the same scatter without any line — 'where would you draw it?'

The classical answer — ordinary least squares (OLS) — picks the line that minimizes the sum of squared vertical distances from each point to the line. That choice is not arbitrary. It has a geometric meaning (the residual vector lies orthogonal to the column space of the design matrix), a probabilistic meaning (it is the maximum-likelihood estimator under Normal errors), and an optimality property (Gauss–Markov: among all linear unbiased estimators, OLS has the smallest variance). Topic 21 builds all three in turn.

Definition 1 Simple linear regression model

Given paired data (x1,y1),,(xn,yn)(x_1, y_1), \ldots, (x_n, y_n) with the xix_i treated as fixed (non-random), the simple linear regression model asserts

yi=β0+β1xi+ϵi,i=1,,n,y_i = \beta_0 + \beta_1 x_i + \epsilon_i, \qquad i = 1, \ldots, n,

where β0,β1R\beta_0, \beta_1 \in \mathbb{R} are unknown parameters (the intercept and slope) and the errors ϵ1,,ϵn\epsilon_1, \ldots, \epsilon_n are random variables with E[ϵi]=0E[\epsilon_i] = 0 and Var(ϵi)=σ2\text{Var}(\epsilon_i) = \sigma^2 for all ii. The errors are assumed uncorrelated: Cov(ϵi,ϵj)=0\text{Cov}(\epsilon_i, \epsilon_j) = 0 for iji \neq j.

Example 1 Galton's height data and the origin of 'regression'

Galton’s 1886 study of hereditary stature plotted adult-child height against mid-parent height and found the regression line’s slope was less than 1 — tall parents tended to have tall children, but on average less tall than the parents. He called this “regression toward mediocrity” (later softened to “regression toward the mean”). Figure 1 shows the Galton-style scatter; the fit line has slope 0.65\approx 0.65, cutting through the 45° line at the sample mean. The vertical segments are residuals — the part of each child’s height unexplained by the parental-height linear model.

Remark 1 'Linear' means linear-in-parameters, not linear-in-predictors

The model yi=β0+β1xi2+β2logxi+ϵiy_i = \beta_0 + \beta_1 x_i^2 + \beta_2 \log x_i + \epsilon_i is a linear regression: the response depends linearly on the unknown parameters (β0,β1,β2)(\beta_0, \beta_1, \beta_2), even though xi2x_i^2 and logxi\log x_i are nonlinear transformations of the predictor. The matrix form y=Xβ+ϵ\mathbf{y} = \mathbf{X}\boldsymbol\beta + \boldsymbol\epsilon of §21.4 accommodates any such design matrix. What this topic does not cover is nonlinear regression — models like yi=β0eβ1xi+ϵiy_i = \beta_0 e^{-\beta_1 x_i} + \epsilon_i where β\boldsymbol\beta enters nonlinearly. Those require iterative fitting (Gauss–Newton, Levenberg–Marquardt) and belong to Topic 22 or later.

Remark 2 Historical note — Galton's etymology

The word “regression” comes from Galton’s 1886 observation. A century earlier, Legendre (1805) and Gauss (1809, GAU1823) had introduced least squares for astronomical-orbit estimation. The two traditions — Galton’s statistical “regression toward the mean” and Gauss’s computational least-squares — merged into the modern discipline in the early 20th century, with Fisher’s 1922 foundational paper on estimation theory providing the bridge.

21.2 The Least-Squares Criterion

What does “best fit” mean? OLS picks the line that minimizes the sum of squared residuals. For simple regression:

Definition 2 Sum of squared residuals (OLS objective)

Given data (x1,y1),,(xn,yn)(x_1, y_1), \ldots, (x_n, y_n), the sum-of-squared-residuals loss is

L(β0,β1)=i=1n(yiβ0β1xi)2.L(\beta_0, \beta_1) = \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2.

The OLS estimator (β^0,β^1)(\hat\beta_0, \hat\beta_1) is the minimizer of LL over R2\mathbb{R}^2.

Squared loss is not the only option — absolute deviation (LAD / quantile regression) and Huber loss are standard alternatives — but OLS has three properties its competitors lack, and one of them is the featured theorem of this topic.

Theorem 1 OLS as orthogonal projection (featured)

In the simple-regression setup, let y^=β^01+β^1x\hat{\mathbf{y}} = \hat\beta_0 \mathbf{1} + \hat\beta_1 \mathbf{x} denote the fitted values (vectors in Rn\mathbb{R}^n). The OLS solution (β^0,β^1)(\hat\beta_0, \hat\beta_1) is the unique pair for which the residual vector e=yy^\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}} satisfies

1e=0,xe=0.\mathbf{1}^\top \mathbf{e} = 0, \qquad \mathbf{x}^\top \mathbf{e} = 0.

Equivalently, e\mathbf{e} is orthogonal to the column space col(1,x)Rn\text{col}(\mathbf{1}, \mathbf{x}) \subseteq \mathbb{R}^n, and y^\hat{\mathbf{y}} is the orthogonal projection of y\mathbf{y} onto that subspace. The OLS solution is unique whenever x\mathbf{x} is not constant.

Proof Proof 1 — OLS as orthogonal projection [show]

Setup. Let L(β)=yXβ2L(\boldsymbol\beta) = \|\mathbf{y} - \mathbf{X}\boldsymbol\beta\|^2 with X=[1,x]\mathbf{X} = [\mathbf{1}, \mathbf{x}] the n×2n \times 2 design matrix. LL is a positive-semidefinite quadratic form in β\boldsymbol\beta, so a minimizer β^\hat{\boldsymbol\beta} exists by continuity and coercivity.

First-order condition. Expanding L(β)=yy2βXy+βXXβL(\boldsymbol\beta) = \mathbf{y}^\top \mathbf{y} - 2 \boldsymbol\beta^\top \mathbf{X}^\top \mathbf{y} + \boldsymbol\beta^\top \mathbf{X}^\top \mathbf{X} \boldsymbol\beta and differentiating,

βL(β)=2X(yXβ).\nabla_{\boldsymbol\beta} L(\boldsymbol\beta) = -2 \mathbf{X}^\top (\mathbf{y} - \mathbf{X} \boldsymbol\beta).

Setting the gradient to zero at β^\hat{\boldsymbol\beta} yields the normal equations

X(yXβ^)=0.\mathbf{X}^\top (\mathbf{y} - \mathbf{X} \hat{\boldsymbol\beta}) = \mathbf{0}.

Geometric reading. The normal equations say the residual e=yXβ^\mathbf{e} = \mathbf{y} - \mathbf{X} \hat{\boldsymbol\beta} is orthogonal to every column of X\mathbf{X}, hence to every vector in the column space col(X)\text{col}(\mathbf{X}). Since y^=Xβ^\hat{\mathbf{y}} = \mathbf{X} \hat{\boldsymbol\beta} lies in col(X)\text{col}(\mathbf{X}), we have decomposed y=y^+e\mathbf{y} = \hat{\mathbf{y}} + \mathbf{e} with y^col(X)\hat{\mathbf{y}} \in \text{col}(\mathbf{X}) and ecol(X)\mathbf{e} \perp \text{col}(\mathbf{X}) — the defining property of the orthogonal projection onto col(X)\text{col}(\mathbf{X}). Figure 2 renders the picture.

3D schematic. The column space of X is drawn as a tilted plane through the origin. The vector y extends above the plane. The fitted vector ŷ = X β̂ is the perpendicular projection of y onto the plane. The residual vector e = y − ŷ is the perpendicular segment from ŷ up to y. Labels: 'col(X)', 'y', 'ŷ = X β̂', 'e ⊥ col(X)'.

Uniqueness via Pythagoras. For any alternative βRp+1\boldsymbol\beta^\ast \in \mathbb{R}^{p+1}, write Xβ=Xβ^X(β^β)\mathbf{X} \boldsymbol\beta^\ast = \mathbf{X} \hat{\boldsymbol\beta} - \mathbf{X}(\hat{\boldsymbol\beta} - \boldsymbol\beta^\ast). Then

yXβ2=(yXβ^)+X(β^β)2.\|\mathbf{y} - \mathbf{X} \boldsymbol\beta^\ast\|^2 = \|(\mathbf{y} - \mathbf{X} \hat{\boldsymbol\beta}) + \mathbf{X}(\hat{\boldsymbol\beta} - \boldsymbol\beta^\ast)\|^2.

The two summands on the right are orthogonal: yXβ^\mathbf{y} - \mathbf{X}\hat{\boldsymbol\beta} is the residual and X(β^β)col(X)\mathbf{X}(\hat{\boldsymbol\beta} - \boldsymbol\beta^\ast) \in \text{col}(\mathbf{X}), so the cross term vanishes. By Pythagoras,

yXβ2=yXβ^2+X(β^β)2.\|\mathbf{y} - \mathbf{X} \boldsymbol\beta^\ast\|^2 = \|\mathbf{y} - \mathbf{X} \hat{\boldsymbol\beta}\|^2 + \|\mathbf{X}(\hat{\boldsymbol\beta} - \boldsymbol\beta^\ast)\|^2.

Hence L(β)L(β^)L(\boldsymbol\beta^\ast) \geq L(\hat{\boldsymbol\beta}), with equality iff X(β^β)=0\mathbf{X}(\hat{\boldsymbol\beta} - \boldsymbol\beta^\ast) = \mathbf{0}. When X\mathbf{X} has full column rank (the simple-regression case when x\mathbf{x} is non-constant), this forces β=β^\boldsymbol\beta^\ast = \hat{\boldsymbol\beta}. ∎ — using positive-definite quadratic minimization and uniqueness of orthogonal projection onto a closed subspace

Solving the normal equations in the simple-regression case gives the closed form

β^1=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2,β^0=yˉβ^1xˉ.\hat\beta_1 = \frac{\sum_{i=1}^n (x_i - \bar x)(y_i - \bar y)}{\sum_{i=1}^n (x_i - \bar x)^2}, \qquad \hat\beta_0 = \bar y - \hat\beta_1 \bar x. Two-panel. Left: a scatter with the fitted OLS line and signed residual segments drawn vertically above and below the line. Right: a bar chart of the residuals in observation order, showing sum-to-zero and the symmetry enforced by the normal equations.
Example 2 Five-point worked example

Take the data x=(1,2,3,4,5)\mathbf{x} = (1, 2, 3, 4, 5), y=(1.2,1.9,3.3,4.1,5.0)\mathbf{y} = (1.2, 1.9, 3.3, 4.1, 5.0). Then xˉ=3\bar x = 3, yˉ=3.1\bar y = 3.1, (xixˉ)2=10\sum (x_i - \bar x)^2 = 10, (xixˉ)(yiyˉ)=9.6\sum (x_i - \bar x)(y_i - \bar y) = 9.6, so

β^1=9.610=0.96,β^0=3.10.963=0.22.\hat\beta_1 = \frac{9.6}{10} = 0.96, \qquad \hat\beta_0 = 3.1 - 0.96 \cdot 3 = 0.22.

The fitted line y^=0.22+0.96x\hat y = 0.22 + 0.96 x produces residuals (0.02,0.24,0.20,0.04,0.02)(0.02, -0.24, 0.20, -0.04, 0.02), which sum to 0.04-0.04 (≈ 0, the small numerical residual from finite arithmetic) and have the signature of an OLS fit: one-half positive, one-half negative, approximately centered on zero.

1.02.03.04.05.06.07.00.02.04.06.08.010.012.014.016.0Residuals
Fit
β̂₀0.029β̂₁1.9860.999SSE0.149σ̂0.172

Drag any point or press Tab then arrow keys to nudge. The line is the least-squares fit; the grey segments are residuals; the bar panel shows signed residuals in observation order.

Remark 3 Why squared loss?

Three reasons, in order of historical importance. (1) Analytic tractability: squared loss is differentiable and convex, so the minimizer has a closed form via the normal equations. Absolute-deviation loss is neither — LAD requires iterative linear programming or specialized algorithms. (2) Normal-errors connection: under Gaussian ϵi\epsilon_i, OLS coincides with the MLE (see §21.5 Rem 9). (3) Optimality certificate: Gauss–Markov (§21.6) shows OLS is BLUE under second-order assumptions — no competitor in the linear-unbiased class beats it on variance. The forward-pointer cost is sensitivity to outliers: squared loss magnifies large residuals quadratically, so one bad observation can swing the fit. Topic 15’s M-estimators (Huber, Tukey) trade some of Gauss–Markov’s optimality for robustness; the forward pointer in §21.9 Rem 20 states this explicitly.

Remark 4 OLS equals the MLE under Normal errors (teaser)

If we add the stronger assumption ϵiiidN(0,σ2)\epsilon_i \overset{\text{iid}}{\sim} \mathcal{N}(0, \sigma^2), the log-likelihood is

logL(β0,β1,σ2)=n2log(2πσ2)12σ2(yiβ0β1xi)2.\log L(\beta_0, \beta_1, \sigma^2) = -\tfrac{n}{2} \log(2\pi\sigma^2) - \tfrac{1}{2\sigma^2} \sum (y_i - \beta_0 - \beta_1 x_i)^2.

Maximizing over (β0,β1)(\beta_0, \beta_1) for any fixed σ2\sigma^2 is identical to minimizing the sum of squared residuals — OLS and MLE pick the same (β^0,β^1)(\hat\beta_0, \hat\beta_1). Full treatment in §21.5 Rem 9 via Topic 14’s MLE machinery.

21.3 Properties of Simple OLS

With the OLS estimator in hand, we can ask: how well does β^1\hat\beta_1 recover the true β1\beta_1? What fraction of variation in yy does the fitted line explain? These are the first questions a practitioner asks, and their answers set up the inferential machinery of §21.5 and §21.8.

Definition 3 Fitted values, residuals, and $R^2$

Given OLS estimates (β^0,β^1)(\hat\beta_0, \hat\beta_1), define

y^i=β^0+β^1xi(fitted values),\hat y_i = \hat\beta_0 + \hat\beta_1 x_i \qquad \text{(fitted values)},ei=yiy^i(residuals),e_i = y_i - \hat y_i \qquad \text{(residuals)},SSE=i=1nei2,SST=i=1n(yiyˉ)2,SSR=SSTSSE.\text{SSE} = \sum_{i=1}^n e_i^2, \qquad \text{SST} = \sum_{i=1}^n (y_i - \bar y)^2, \qquad \text{SSR} = \text{SST} - \text{SSE}.

The coefficient of determination is R2=1SSE/SST=SSR/SSTR^2 = 1 - \text{SSE}/\text{SST} = \text{SSR}/\text{SST}, interpreted as the fraction of total variation in yy explained by the linear fit.

Theorem 2 Unbiasedness and variance of the simple-regression slope

Under Definition 1 (fixed xix_i, zero-mean uncorrelated errors with common variance σ2\sigma^2),

E[β^1]=β1,Var(β^1)=σ2i=1n(xixˉ)2.E[\hat\beta_1] = \beta_1, \qquad \text{Var}(\hat\beta_1) = \frac{\sigma^2}{\sum_{i=1}^n (x_i - \bar x)^2}.

Brief derivation. Substituting yi=β0+β1xi+ϵiy_i = \beta_0 + \beta_1 x_i + \epsilon_i into the closed form,

β^1=(xixˉ)(yiyˉ)(xixˉ)2=β1+(xixˉ)ϵi(xixˉ)2.\hat\beta_1 = \frac{\sum (x_i - \bar x)(y_i - \bar y)}{\sum (x_i - \bar x)^2} = \beta_1 + \frac{\sum (x_i - \bar x) \epsilon_i}{\sum (x_i - \bar x)^2}.

(The β0\beta_0 and xˉϵˉ\bar x \bar\epsilon terms cancel because (xixˉ)=0\sum (x_i - \bar x) = 0.) Taking expectations: E[β^1]=β1+0=β1E[\hat\beta_1] = \beta_1 + 0 = \beta_1. The variance: Var(β^1)=(xixˉ)2σ2[(xixˉ)2]2=σ2/(xixˉ)2\text{Var}(\hat\beta_1) = \frac{\sum (x_i - \bar x)^2 \sigma^2}{[\sum (x_i - \bar x)^2]^2} = \sigma^2 / \sum (x_i - \bar x)^2. ∎

Example 3 A 95% CI for the slope

Continuing Example 2 with σ\sigma assumed known for illustration at σ=0.3\sigma = 0.3: (xixˉ)2=10\sum (x_i - \bar x)^2 = 10, so SE(β^1)=0.3/100.095\text{SE}(\hat\beta_1) = 0.3 / \sqrt{10} \approx 0.095. A 95% Wald CI for β1\beta_1 is 0.96±1.960.095=(0.77,1.15)0.96 \pm 1.96 \cdot 0.095 = (0.77, 1.15). In practice σ\sigma is unknown and we plug in σ^\hat\sigma with a t-distribution critical value — see §21.8. The test–CI duality of Topic 19 tells us this CI is exactly the set of β1\beta_1 values a two-sided level-0.05 t-test would not reject.

Remark 5 $R^2$ interpretation and its limitations

R2R^2 is the fraction of yy-variance captured by the linear fit. R2=1R^2 = 1 means perfect fit; R2=0R^2 = 0 means the fit is no better than the constant yˉ\bar y. Two caveats. First, R2R^2 always increases when predictors are added (the enlarged column space cannot shrink the projection’s accuracy), so it is useless as a model-comparison tool across models of different dimension. The adjusted R2R^2, 1(1R2)(n1)/(np1)1 - (1 - R^2)(n - 1)/(n - p - 1), penalizes for predictor count but is still ad hoc; principled model selection lives in Topic 24 AIC/BIC/CV. Second, high R2R^2 does not imply the model is correct — the Anscombe quartet of Example 10 exhibits R20.67R^2 \approx 0.67 for four datasets with wildly different residual structures, only one of which is actually well-modeled by a line. Residual diagnostics (§21.9) are non-negotiable.

Remark 6 Correlation and regression share arithmetic but answer different questions

The sample Pearson correlation r=(xixˉ)(yiyˉ)/(xixˉ)2(yiyˉ)2r = \sum (x_i - \bar x)(y_i - \bar y) / \sqrt{\sum (x_i - \bar x)^2 \sum (y_i - \bar y)^2} is related to the slope by β^1=rsy/sx\hat\beta_1 = r \cdot s_y / s_x and to R2R^2 by r2=R2r^2 = R^2 (in simple regression). But correlation is symmetric (r(x,y)=r(y,x)r(x, y) = r(y, x)) while regression is directional — the slope of yy on xx is not the reciprocal of the slope of xx on yy unless r=±1r = \pm 1. Regression asks “how does E[Yx]E[Y \mid x] depend on xx?”, a conditional-expectation question; correlation asks “how much do xx and yy co-vary?”, a joint-distribution question.

21.4 The Matrix Form

Simple regression (p=1p = 1, single predictor) is the pedagogical on-ramp; the real payoff is multiple regression with pp predictors plus an intercept. The matrix form unifies every result from §21.2–§21.3 into a single geometric picture that generalizes without change.

Notation. Throughout this topic and Track 6, we use column vectors by default: y,β,β^,ϵ,xi\mathbf{y}, \boldsymbol\beta, \hat{\boldsymbol\beta}, \boldsymbol\epsilon, \mathbf{x}_i are column vectors; their transposes are row vectors. Transposes are written X\mathbf{X}^\top (not X\mathbf{X}'). Vectors are bold lowercase; matrices are bold uppercase. The sample size is nn; pp denotes the number of non-intercept predictors, so the design matrix X\mathbf{X} is n×(p+1)n \times (p+1) with a leading column of ones, and the residual-degrees-of-freedom is np1n - p - 1. Inner products are uv\mathbf{u}^\top \mathbf{v}; v2=vv\|\mathbf{v}\|^2 = \mathbf{v}^\top \mathbf{v}. For A\mathbf{A} symmetric positive-semidefinite we write A0\mathbf{A} \succeq \mathbf{0}; AB\mathbf{A} \succeq \mathbf{B} means AB\mathbf{A} - \mathbf{B} is PSD.

Definition 4 Design matrix, parameter vector, error vector

Given nn observations with pp non-intercept predictors, the design matrix X\mathbf{X} is n×(p+1)n \times (p + 1) with a leading column of ones and ii-th row (1,xi1,,xip)(1, x_{i1}, \ldots, x_{ip}). The parameter vector β=(β0,β1,,βp)Rp+1\boldsymbol\beta = (\beta_0, \beta_1, \ldots, \beta_p)^\top \in \mathbb{R}^{p+1} collects intercept and slopes. The error vector ϵ=(ϵ1,,ϵn)Rn\boldsymbol\epsilon = (\epsilon_1, \ldots, \epsilon_n)^\top \in \mathbb{R}^n has E[ϵ]=0E[\boldsymbol\epsilon] = \mathbf{0} and Cov(ϵ)=σ2In\text{Cov}(\boldsymbol\epsilon) = \sigma^2 \mathbf{I}_n. The linear regression model is

y=Xβ+ϵ.\mathbf{y} = \mathbf{X} \boldsymbol\beta + \boldsymbol\epsilon.
Theorem 3 Normal equations

Assuming X\mathbf{X} has full column rank p+1p + 1, the OLS estimator minimizing L(β)=yXβ2L(\boldsymbol\beta) = \|\mathbf{y} - \mathbf{X} \boldsymbol\beta\|^2 is the unique solution to the normal equations

XXβ^=Xy,\mathbf{X}^\top \mathbf{X} \hat{\boldsymbol\beta} = \mathbf{X}^\top \mathbf{y},

explicitly given by β^=(XX)1Xy\hat{\boldsymbol\beta} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}.

Brief derivation. Differentiating L(β)=yXβ2L(\boldsymbol\beta) = \|\mathbf{y} - \mathbf{X}\boldsymbol\beta\|^2 and setting the gradient to zero gives 2X(yXβ)=0-2 \mathbf{X}^\top (\mathbf{y} - \mathbf{X}\boldsymbol\beta) = \mathbf{0}, i.e., XXβ=Xy\mathbf{X}^\top \mathbf{X} \boldsymbol\beta = \mathbf{X}^\top \mathbf{y}. Full column rank makes XX\mathbf{X}^\top \mathbf{X} invertible, so the solution is unique. Proof 1 of §21.2 verified that this critical point is the global minimum via Pythagorean uniqueness. ∎

Two-panel. Left: block diagram of a 5×3 design matrix X with an intercept column of ones, two predictor columns, paired with a 3-vector β and a 5-vector y in the equation y = Xβ + ε. Right: a 3×3 matrix Xᵀ X with its entries labeled (n, Σx_{i1}, Σx_{i1}², etc.) showing the sufficient-statistic structure that Topic 16 anticipated.
Example 4 Recovering the scalar formulas

For simple regression (p=1p = 1, so X\mathbf{X} has columns (1,x)(\mathbf{1}, \mathbf{x})):

XX=(nxixixi2),Xy=(yixiyi).\mathbf{X}^\top \mathbf{X} = \begin{pmatrix} n & \sum x_i \\ \sum x_i & \sum x_i^2 \end{pmatrix}, \qquad \mathbf{X}^\top \mathbf{y} = \begin{pmatrix} \sum y_i \\ \sum x_i y_i \end{pmatrix}.

Solving XXβ^=Xy\mathbf{X}^\top \mathbf{X} \hat{\boldsymbol\beta} = \mathbf{X}^\top \mathbf{y} by hand (apply the 2×22 \times 2 matrix inverse) reproduces

β^1=(xixˉ)(yiyˉ)(xixˉ)2,β^0=yˉβ^1xˉ,\hat\beta_1 = \frac{\sum (x_i - \bar x)(y_i - \bar y)}{\sum (x_i - \bar x)^2}, \qquad \hat\beta_0 = \bar y - \hat\beta_1 \bar x,

matching §21.3. The sufficient pair (XX,Xy)(\mathbf{X}^\top \mathbf{X}, \mathbf{X}^\top \mathbf{y}) is all the data tells OLS — no other aspect of (x,y)(\mathbf{x}, \mathbf{y}) matters for the point estimate. This discharges the bullet in Topic 16 §16.12: OLS is a function of a Normal-linear-model sufficient statistic.

Remark 7 Full column rank is necessary

(XX)1(\mathbf{X}^\top \mathbf{X})^{-1} exists if and only if X\mathbf{X} has full column rank p+1p + 1 — equivalently, no predictor is a linear combination of the others. When this fails (perfect collinearity), the normal equations have infinitely many solutions and OLS is undefined. In practice even near-collinearity (multicollinearity) destabilizes β^\hat{\boldsymbol\beta}: variances inflate, standard errors explode, individual coefficients become uninterpretable. The classical diagnostic is the variance inflation factor (VIF); the classical remedy is ridge regression (Topic 23). For Topic 21 we assume full column rank throughout.

Remark 8 Connection to conditional MVN — 'this is not an approximation'

When (Y,X)(Y, \mathbf{X}) are jointly multivariate Normal with mean (μY,μX)(\mu_Y, \boldsymbol\mu_X) and covariance (σY2ΣYXΣXYΣXX)\begin{pmatrix} \sigma_Y^2 & \boldsymbol\Sigma_{YX} \\ \boldsymbol\Sigma_{XY} & \boldsymbol\Sigma_{XX} \end{pmatrix}, Topic 8 §8.4 shows the conditional expectation is exactly linear:

E[YX=x]=μY+ΣYXΣXX1(xμX).E[Y \mid \mathbf{X} = \mathbf{x}] = \mu_Y + \boldsymbol\Sigma_{YX} \boldsymbol\Sigma_{XX}^{-1} (\mathbf{x} - \boldsymbol\mu_X).

So for jointly Normal data, multiple regression is not an approximation — it recovers the population regression line exactly. For non-Normal data, OLS is fitting the best linear approximation to E[YX]E[Y \mid \mathbf{X}] in mean-squared-error, which may or may not be the true conditional mean. The Gauss–Markov theorem (§21.6) handles the “best” part; Rem 11 under Normality promotes it to “unique UMVUE.”

21.5 Distributional Theory under Normal Errors

Gauss–Markov does not require Normal errors — only zero mean, uncorrelated, common variance. But for exact distributional statements — t-tests for individual coefficients, F-tests for nested models, confidence intervals — we need more. The standard additional assumption is Normality.

Definition 5 Normal linear model

The Normal linear model extends Definition 4 with the distributional assumption

ϵNn(0,σ2In),\boldsymbol\epsilon \sim \mathcal{N}_n(\mathbf{0}, \sigma^2 \mathbf{I}_n),

i.e., the errors are iid N(0,σ2)\mathcal{N}(0, \sigma^2). The design matrix X\mathbf{X} remains fixed (non-random).

Theorem 4 Sampling distribution of OLS

Under Definition 5 (Normal linear model, full column rank),

β^Np+1(β,σ2(XX)1).\hat{\boldsymbol\beta} \sim \mathcal{N}_{p+1}\bigl(\boldsymbol\beta, \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1}\bigr).

Brief derivation. β^=(XX)1Xy\hat{\boldsymbol\beta} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} is a linear combination of y\mathbf{y}, which is multivariate Normal under the Normal linear model. Linear combinations of MVN vectors are MVN, so β^\hat{\boldsymbol\beta} is MVN with mean (XX)1XE[y]=(XX)1XXβ=β(\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top E[\mathbf{y}] = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{X} \boldsymbol\beta = \boldsymbol\beta and covariance (XX)1XCov(y)X(XX)1=σ2(XX)1(\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \text{Cov}(\mathbf{y}) \mathbf{X} (\mathbf{X}^\top \mathbf{X})^{-1} = \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1}. ∎

Theorem 5 Independence of OLS and variance estimator; $\chi^2$ distribution of the SSE

Under the Normal linear model, let σ^2=SSE/(np1)\hat\sigma^2 = \text{SSE} / (n - p - 1). Then

β^ ⁣ ⁣ ⁣σ^2,(np1)σ^2σ2χnp12.\hat{\boldsymbol\beta} \perp\!\!\!\perp \hat\sigma^2, \qquad \frac{(n - p - 1) \hat\sigma^2}{\sigma^2} \sim \chi^2_{n - p - 1}.

Brief derivation. Write β^β=(XX)1Xϵ\hat{\boldsymbol\beta} - \boldsymbol\beta = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \boldsymbol\epsilon and e=(IH)ϵ\mathbf{e} = (\mathbf{I} - \mathbf{H}) \boldsymbol\epsilon where H=X(XX)1X\mathbf{H} = \mathbf{X}(\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top is the hat matrix (§21.7). Both are linear functions of ϵ\boldsymbol\epsilon; their cross-covariance is

Cov(β^β,e)=(XX)1Xσ2In(IH)=σ2(XX)1X(IH)=0\text{Cov}(\hat{\boldsymbol\beta} - \boldsymbol\beta, \mathbf{e}) = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \sigma^2 \mathbf{I}_n (\mathbf{I} - \mathbf{H})^\top = \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top (\mathbf{I} - \mathbf{H}) = \mathbf{0}

because X(IH)=XXH=XX=0\mathbf{X}^\top (\mathbf{I} - \mathbf{H}) = \mathbf{X}^\top - \mathbf{X}^\top \mathbf{H} = \mathbf{X}^\top - \mathbf{X}^\top = \mathbf{0} (the defining property of the hat matrix). Uncorrelated Normal vectors are independent, so β^ ⁣ ⁣ ⁣e\hat{\boldsymbol\beta} \perp\!\!\!\perp \mathbf{e}, hence β^ ⁣ ⁣ ⁣σ^2=e2/(np1)\hat{\boldsymbol\beta} \perp\!\!\!\perp \hat\sigma^2 = \|\mathbf{e}\|^2 / (n - p - 1).

For the χ2\chi^2 distribution: SSE=ϵ(IH)ϵ\text{SSE} = \boldsymbol\epsilon^\top (\mathbf{I} - \mathbf{H}) \boldsymbol\epsilon. The matrix IH\mathbf{I} - \mathbf{H} is symmetric idempotent with rank np1n - p - 1 (see §21.7 Thm 7). A symmetric idempotent quadratic form in iid N(0,σ2)\mathcal{N}(0, \sigma^2) entries is σ2χr2\sigma^2 \cdot \chi^2_r where rr is the rank — this is a classical application of the Fisher–Cochran decomposition, cited from Topic 16 Ex 21. Hence SSE/σ2χnp12\text{SSE} / \sigma^2 \sim \chi^2_{n-p-1}. ∎

Two-panel. Left: MC histogram of β̂_1 from 5000 simulations at n=50, p=1, β_1=2, σ=1, overlaid with the theoretical Normal density — a tight match. Right: joint scatter of (β̂_0, β̂_1) across simulations with a 95% elliptical confidence contour showing the (XᵀX)⁻¹ correlation structure.
Remark 9 OLS is the MLE under Normal errors

The Normal log-likelihood is (β,σ2)=n2log(2πσ2)12σ2yXβ2\ell(\boldsymbol\beta, \sigma^2) = -\frac{n}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \|\mathbf{y} - \mathbf{X}\boldsymbol\beta\|^2. For any σ2>0\sigma^2 > 0, the β\boldsymbol\beta that maximizes \ell is the one that minimizes yXβ2\|\mathbf{y} - \mathbf{X}\boldsymbol\beta\|^2 — exactly the OLS problem. So β^MLE=β^OLS\hat{\boldsymbol\beta}_{\text{MLE}} = \hat{\boldsymbol\beta}_{\text{OLS}}. The MLE of σ2\sigma^2 is SSE/n\text{SSE}/n (biased downward by a factor (np1)/n(n-p-1)/n); the unbiased σ^2\hat\sigma^2 of Theorem 5 divides by np1n - p - 1 instead. See Topic 14 for the general MLE framework.

Remark 10 The pivotal quantity for coefficient inference

Combining Theorems 4 and 5: for each coefficient jj,

β^jβjσ^2[(XX)1]jjtnp1.\frac{\hat\beta_j - \beta_j}{\sqrt{\hat\sigma^2 [(\mathbf{X}^\top \mathbf{X})^{-1}]_{jj}}} \sim t_{n - p - 1}.

The numerator is N(0,σ2[(XX)1]jj)\mathcal{N}(0, \sigma^2 [(\mathbf{X}^\top \mathbf{X})^{-1}]_{jj}); the denominator is σ2χnp12/(np1)\sqrt{\sigma^2 \chi^2_{n-p-1} / (n-p-1)}, independent of the numerator. Their ratio is Student’s tnp1t_{n-p-1}. This pivotal quantity is the foundation of §21.8’s confidence intervals and hypothesis tests — the direct inheritor of Topic 17 §17.7’s one-sample t-test.

21.6 The Gauss–Markov Theorem

OLS is unbiased and has a tractable sampling distribution. Is it also good? Gauss and Markov together answer: among all linear unbiased estimators of β\boldsymbol\beta, OLS has the smallest variance. This optimality certificate is the second pillar of the topic, and — unlike the distributional results of §21.5 — it does not require Normality.

Definition 6 Best Linear Unbiased Estimator (BLUE)

An estimator β~=Cy\tilde{\boldsymbol\beta} = \mathbf{C} \mathbf{y} is a linear estimator of β\boldsymbol\beta. It is unbiased if E[β~]=βE[\tilde{\boldsymbol\beta}] = \boldsymbol\beta for all βRp+1\boldsymbol\beta \in \mathbb{R}^{p+1} — equivalently, CX=Ip+1\mathbf{C} \mathbf{X} = \mathbf{I}_{p+1}. Among all linear unbiased estimators, the Best Linear Unbiased Estimator (BLUE) is the one with the smallest covariance matrix in the PSD sense: Cov(β~)Cov(β^)0\text{Cov}(\tilde{\boldsymbol\beta}) - \text{Cov}(\hat{\boldsymbol\beta}) \succeq \mathbf{0} for every competing β~\tilde{\boldsymbol\beta}.

Theorem 6 Gauss–Markov

Under Definition 4 (the linear regression model with E[ϵ]=0E[\boldsymbol\epsilon] = \mathbf{0}, Cov(ϵ)=σ2In\text{Cov}(\boldsymbol\epsilon) = \sigma^2 \mathbf{I}_n, full column rank X\mathbf{X}) — without any Normality assumption — the OLS estimator β^=(XX)1Xy\hat{\boldsymbol\beta} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} is BLUE.

Proof Proof 6 — Gauss–Markov [show]

Setup. Assume E[ϵ]=0E[\boldsymbol\epsilon] = \mathbf{0}, Cov(ϵ)=σ2In\text{Cov}(\boldsymbol\epsilon) = \sigma^2 \mathbf{I}_n. No Normal assumption.

Linear form. Any linear estimator of β\boldsymbol\beta has the form β~=Cy\tilde{\boldsymbol\beta} = \mathbf{C} \mathbf{y} for some (p+1)×n(p+1) \times n matrix C\mathbf{C}.

Unbiasedness constraint. E[β~]=CE[y]=CXβE[\tilde{\boldsymbol\beta}] = \mathbf{C} E[\mathbf{y}] = \mathbf{C} \mathbf{X} \boldsymbol\beta. For this to equal β\boldsymbol\beta for every βRp+1\boldsymbol\beta \in \mathbb{R}^{p+1}, we need

CX=Ip+1.\mathbf{C} \mathbf{X} = \mathbf{I}_{p+1}.

Decomposition. Write C=(XX)1X+D\mathbf{C} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top + \mathbf{D} for some (p+1)×n(p+1) \times n matrix D\mathbf{D}. The unbiasedness constraint becomes

I=CX=(XX)1XX+DX=I+DX,\mathbf{I} = \mathbf{C} \mathbf{X} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{X} + \mathbf{D} \mathbf{X} = \mathbf{I} + \mathbf{D} \mathbf{X},

so DX=0\mathbf{D} \mathbf{X} = \mathbf{0}. This is the load-bearing identity.

Covariance expansion. The covariance of a linear estimator is Cov(Cy)=CCov(y)C=σ2CC\text{Cov}(\mathbf{C} \mathbf{y}) = \mathbf{C} \text{Cov}(\mathbf{y}) \mathbf{C}^\top = \sigma^2 \mathbf{C} \mathbf{C}^\top. Expanding:

CC=[(XX)1X+D][X(XX)1+D]\mathbf{C} \mathbf{C}^\top = \bigl[(\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top + \mathbf{D}\bigr]\bigl[\mathbf{X} (\mathbf{X}^\top \mathbf{X})^{-1} + \mathbf{D}^\top\bigr]=(XX)1+DD+(XX)1XD+DX(XX)1.= (\mathbf{X}^\top \mathbf{X})^{-1} + \mathbf{D} \mathbf{D}^\top + (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{D}^\top + \mathbf{D} \mathbf{X} (\mathbf{X}^\top \mathbf{X})^{-1}.

The last two cross-terms vanish by DX=0\mathbf{D} \mathbf{X} = \mathbf{0} (and its transpose XD=0\mathbf{X}^\top \mathbf{D}^\top = \mathbf{0}). Therefore

CC=(XX)1+DD.\mathbf{C} \mathbf{C}^\top = (\mathbf{X}^\top \mathbf{X})^{-1} + \mathbf{D} \mathbf{D}^\top.

Comparison. Multiplying by σ2\sigma^2,

Cov(β~)=σ2(XX)1+σ2DD=Cov(β^)+σ2DD.\text{Cov}(\tilde{\boldsymbol\beta}) = \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1} + \sigma^2 \mathbf{D} \mathbf{D}^\top = \text{Cov}(\hat{\boldsymbol\beta}) + \sigma^2 \mathbf{D} \mathbf{D}^\top.

Since DD0\mathbf{D} \mathbf{D}^\top \succeq \mathbf{0} (any matrix times its transpose is PSD), we have

Cov(β~)Cov(β^)=σ2DD0,\text{Cov}(\tilde{\boldsymbol\beta}) - \text{Cov}(\hat{\boldsymbol\beta}) = \sigma^2 \mathbf{D} \mathbf{D}^\top \succeq \mathbf{0},

with equality iff D=0\mathbf{D} = \mathbf{0}, i.e., β~=β^\tilde{\boldsymbol\beta} = \hat{\boldsymbol\beta}.

Scalar corollary. For any linear combination aβ\mathbf{a}^\top \boldsymbol\beta with aRp+1\mathbf{a} \in \mathbb{R}^{p+1}:

Var(aβ~)Var(aβ^)=σ2aDDa=σ2Da20.\text{Var}(\mathbf{a}^\top \tilde{\boldsymbol\beta}) - \text{Var}(\mathbf{a}^\top \hat{\boldsymbol\beta}) = \sigma^2 \mathbf{a}^\top \mathbf{D} \mathbf{D}^\top \mathbf{a} = \sigma^2 \|\mathbf{D}^\top \mathbf{a}\|^2 \geq 0.

∎ — using unbiasedness to derive DX=0\mathbf{D}\mathbf{X} = \mathbf{0}; covariance expansion; PSD property of DD\mathbf{D}\mathbf{D}^\top

Example 5 OLS vs a naive linear unbiased competitor

Take n=4n = 4, x=(1,2,3,4)\mathbf{x} = (1, 2, 3, 4), β=(0,1)\boldsymbol\beta = (0, 1)^\top, σ=1\sigma = 1. OLS gives Var(β^1)=1/(xixˉ)2=1/5=0.2\text{Var}(\hat\beta_1) = 1/\sum(x_i - \bar x)^2 = 1/5 = 0.2. A naive competitor — “use only the first three observations to fit a simple regression” — is also linear and unbiased, with Var(β~1)=1/i=13(xixˉ3)2=1/2=0.5\text{Var}(\tilde\beta_1) = 1/\sum_{i=1}^3 (x_i - \bar x_3)^2 = 1/2 = 0.5. Gauss–Markov guarantees OLS wins; the gap here is a factor of 2.5. Figure 6 shows the full MC comparison.

Two-panel. Left: MC histograms of β̂_1 from 5000 simulations — OLS (tight, centered at 1) vs 'first-3-points only' estimator (wider, also centered at 1). Right: variance as a function of n. OLS variance decreases as 1/n; the naive estimator's variance is flat (it never uses the extra data). Gauss–Markov guarantees the picture.
Remark 11 Under Normal errors, BLUE promotes to UMVUE

Gauss–Markov says OLS beats every linear unbiased competitor. Could a nonlinear unbiased estimator do better? Under Normal errors, Topic 16 Thm 4 (Lehmann–Scheffé) answers no: (XX)1Xy(\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} is the unique UMVUE — minimum variance among all unbiased estimators, linear or not — because it is a function of a complete sufficient statistic for the Normal-linear-model exponential family. Without Normality, BLUE is the best we can claim; with Normality, BLUE = UMVUE. This is the §16.12 Track-6 promise discharged.

Remark 12 What the proof does *not* assume

Gauss–Markov requires only: zero-mean errors, uncorrelated errors, common variance σ2\sigma^2, full column rank X\mathbf{X}. It does not require Normality, independence (only uncorrelatedness), or even identical distribution — errors can follow different distributions as long as they share variance. The one structural assumption that is load-bearing is homoscedasticity: all errors have the same variance. When this fails — heteroscedastic errors, Cov(ϵ)=σ2diag(w1,,wn)\text{Cov}(\boldsymbol\epsilon) = \sigma^2 \text{diag}(w_1, \ldots, w_n) with unequal wiw_i — OLS is no longer BLUE. The weighted least squares (WLS) estimator reweights observations by 1/wi1/w_i and is BLUE in the heteroscedastic setting; see §21.9 Rem 21 for the one-sentence forward pointer to Topic 22.

21.7 Geometry of Fitted Values

Proof 1 of §21.2 identified y^\hat{\mathbf{y}} as the orthogonal projection of y\mathbf{y} onto col(X)\text{col}(\mathbf{X}). That projection has a matrix representation — the hat matrix — and a rich algebraic structure that powers the diagnostic machinery of §21.9.

Definition 7 Hat matrix and leverage

Under full column rank, the hat matrix is

H=X(XX)1XRn×n.\mathbf{H} = \mathbf{X} (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \in \mathbb{R}^{n \times n}.

It “puts the hat on y\mathbf{y}”: y^=Hy\hat{\mathbf{y}} = \mathbf{H} \mathbf{y}. The diagonal entries hiih_{ii} are the leverage values of observation ii — measuring how much the ii-th observation’s own response contributes to its own fit.

Theorem 7 Hat matrix properties

The hat matrix H\mathbf{H} is:

  1. Symmetric: H=H\mathbf{H}^\top = \mathbf{H}.
  2. Idempotent: HH=H\mathbf{H} \mathbf{H} = \mathbf{H}.
  3. Rank and trace: rank(H)=tr(H)=p+1\text{rank}(\mathbf{H}) = \text{tr}(\mathbf{H}) = p + 1.
  4. Complementary projector: IH\mathbf{I} - \mathbf{H} is also symmetric idempotent, with rank(IH)=np1\text{rank}(\mathbf{I} - \mathbf{H}) = n - p - 1.

Brief derivation. Symmetry: H=X[(XX)1]X=X(XX)1X=H\mathbf{H}^\top = \mathbf{X} [(\mathbf{X}^\top \mathbf{X})^{-1}]^\top \mathbf{X}^\top = \mathbf{X} (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top = \mathbf{H} (using symmetry of (XX)1(\mathbf{X}^\top \mathbf{X})^{-1}). Idempotence: HH=X(XX)1XX(XX)1X=X(XX)1X=H\mathbf{H} \mathbf{H} = \mathbf{X} (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{X} (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top = \mathbf{X} (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top = \mathbf{H}. For rank and trace: a symmetric idempotent matrix has eigenvalues {0,1}\in \{0, 1\}, so its trace equals its rank. Using the cyclic property of trace, tr(H)=tr(X(XX)1X)=tr((XX)1XX)=tr(Ip+1)=p+1\text{tr}(\mathbf{H}) = \text{tr}(\mathbf{X} (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top) = \text{tr}((\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{X}) = \text{tr}(\mathbf{I}_{p+1}) = p + 1. Finally, IH\mathbf{I} - \mathbf{H} is symmetric and idempotent (direct check), with trace n(p+1)n - (p+1). ∎

Example 6 Explicit hat matrix for $n = 3$, $p = 1$

Take x=(0,1,2)\mathbf{x} = (0, 1, 2)^\top, so X=(101112)\mathbf{X} = \begin{pmatrix} 1 & 0 \\ 1 & 1 \\ 1 & 2 \end{pmatrix}. Then XX=(3335)\mathbf{X}^\top \mathbf{X} = \begin{pmatrix} 3 & 3 \\ 3 & 5 \end{pmatrix}, (XX)1=16(5333)(\mathbf{X}^\top \mathbf{X})^{-1} = \tfrac{1}{6} \begin{pmatrix} 5 & -3 \\ -3 & 3 \end{pmatrix}, and

H=16(521222125).\mathbf{H} = \tfrac{1}{6} \begin{pmatrix} 5 & 2 & -1 \\ 2 & 2 & 2 \\ -1 & 2 & 5 \end{pmatrix}.

Verify: row sums equal 1 (as they must when the first column of X\mathbf{X} is 1\mathbf{1}), trace =(5+2+5)/6=2=p+1= (5 + 2 + 5)/6 = 2 = p + 1. The diagonal entries h11=h33=5/60.83h_{11} = h_{33} = 5/6 \approx 0.83 are high — the endpoint observations have the most leverage. The middle observation has h22=2/60.33h_{22} = 2/6 \approx 0.33, the lowest in this 3-point design.

Two-panel. Left: scatter with 12 observations, 3 highlighted amber as high-leverage (endpoints and outliers in x). Right: bar chart of leverage values h_ii in observation order, with the 2(p+1)/n threshold drawn as a horizontal line; the 3 amber observations exceed the threshold.
Remark 13 Leverage and its rule of thumb

Leverage hii[0,1]h_{ii} \in [0, 1] measures how “far” observation ii‘s predictor vector xi\mathbf{x}_i is from the center of the predictor cloud. Since ihii=p+1\sum_i h_{ii} = p + 1, the average leverage is (p+1)/n(p + 1)/n. The common rule of thumb flags observations with hii>2(p+1)/nh_{ii} > 2 (p + 1) / n as high-leverage. High leverage is not itself pathological — a well-designed experiment may deliberately include extreme xi\mathbf{x}_i values to reduce Var(β^)\text{Var}(\hat{\boldsymbol\beta}) — but it is a prerequisite for being influential: an observation with low residual but high leverage contributes little to visible fit quality and much to the estimated coefficients. The diagnostic machinery of §21.9 combines leverage with studentized residuals via Cook’s distance.

Remark 14 Variance decomposition via Eve's law

The ANOVA identity SST=SSR+SSE\text{SST} = \text{SSR} + \text{SSE} is the sample analog of Topic 4’s Eve’s law: Var(Y)=E[Var(YX)]+Var(E[YX])\text{Var}(Y) = E[\text{Var}(Y \mid \mathbf{X})] + \text{Var}(E[Y \mid \mathbf{X}]). The fitted values y^i\hat y_i play the role of the conditional expectation E[YX=xi]E[Y \mid \mathbf{X} = \mathbf{x}_i]; the residuals eie_i play the role of the conditional deviation. The decomposition is exact in-sample (Pythagoras: yyˉ12=y^yˉ12+yy^2\|\mathbf{y} - \bar y \mathbf{1}\|^2 = \|\hat{\mathbf{y}} - \bar y \mathbf{1}\|^2 + \|\mathbf{y} - \hat{\mathbf{y}}\|^2 when the intercept is included, because y^yˉ1col(X){1}\hat{\mathbf{y}} - \bar y \mathbf{1} \in \text{col}(\mathbf{X}) \cap \{\mathbf{1}\}^\perp) and is the geometric basis of the one-way ANOVA example in §21.8.

21.8 Hypothesis Tests and Confidence Intervals

The distributional theory of §21.5 plus Topic 19’s test–CI duality delivers the full inferential package: CIs for individual coefficients, tests for nested-model hypotheses, simultaneous inference for families of coefficients.

Definition 8 Coefficient t-statistic

For the jj-th coefficient under the Normal linear model, the t-statistic for testing H0:βj=βj0H_0: \beta_j = \beta_j^0 against the two-sided alternative is

Tj=β^jβj0SE^(β^j),SE^(β^j)=σ^2[(XX)1]jj.T_j = \frac{\hat\beta_j - \beta_j^0}{\widehat{\text{SE}}(\hat\beta_j)}, \qquad \widehat{\text{SE}}(\hat\beta_j) = \sqrt{\hat\sigma^2 [(\mathbf{X}^\top \mathbf{X})^{-1}]_{jj}}.
Theorem 8 Coefficient t-distribution

Under the Normal linear model and H0:βj=βj0H_0: \beta_j = \beta_j^0, Tjtnp1T_j \sim t_{n - p - 1}. Hence a level-α\alpha two-sided test rejects when Tj>tnp1,1α/2|T_j| > t_{n - p - 1, 1 - \alpha/2}, and the corresponding Wald-t confidence interval at level 1α1 - \alpha is

β^j±tnp1,1α/2SE^(β^j).\hat\beta_j \pm t_{n - p - 1, 1 - \alpha/2} \cdot \widehat{\text{SE}}(\hat\beta_j).

Brief derivation. This is Remark 10 restated: the numerator β^jβj\hat\beta_j - \beta_j is N(0,σ2[(XX)1]jj)\mathcal{N}(0, \sigma^2 [(\mathbf{X}^\top\mathbf{X})^{-1}]_{jj}) under H0H_0. The denominator is σ^[(XX)1]jj\hat\sigma \sqrt{[(\mathbf{X}^\top\mathbf{X})^{-1}]_{jj}} with σ^2 ⁣ ⁣ ⁣β^j\hat\sigma^2 \perp\!\!\!\perp \hat\beta_j and (np1)σ^2/σ2χnp12(n-p-1)\hat\sigma^2/\sigma^2 \sim \chi^2_{n-p-1}. Their ratio is standardized N(0,1)\mathcal{N}(0, 1) over χnp12/(np1)\sqrt{\chi^2_{n-p-1}/(n-p-1)} — by definition, tnp1t_{n-p-1}. The CI follows from Topic 19 Thm 1: a two-sided t-test and its inversion produce the symmetric interval above. ∎

Example 7 Testing $H_0: \beta_1 = 0$

In a simple regression at n=30n = 30, suppose β^1=0.45\hat\beta_1 = 0.45 with SE^(β^1)=0.12\widehat{\text{SE}}(\hat\beta_1) = 0.12. Then T1=0.45/0.12=3.75T_1 = 0.45 / 0.12 = 3.75 on t28t_{28}. The two-sided p-value is 2P(T28>3.75)8×1042 \cdot P(T_{28} > 3.75) \approx 8 \times 10^{-4}, strongly rejecting the null at conventional levels. The 95% Wald-t CI is 0.45±t28,0.9750.12=0.45±2.0480.12=(0.20,0.70)0.45 \pm t_{28, 0.975} \cdot 0.12 = 0.45 \pm 2.048 \cdot 0.12 = (0.20, 0.70) — the interval excludes zero, consistent with the rejection.

The individual coefficient test is the simplest case. For nested-model hypotheses — “do these kk coefficients add anything?” — the t-test generalizes to the F-test, and the F-test has a beautiful interpretation as the finite-sample sharpening of Wilks’ theorem.

Theorem 9 F-test as Wilks specialization (third full proof)

Let M1\mathcal{M}_1 be the full Normal linear model with p+1p + 1 parameters and M0M1\mathcal{M}_0 \subset \mathcal{M}_1 be the reduced model obtained by imposing kk linear restrictions Rβ=r\mathbf{R} \boldsymbol\beta = \mathbf{r} (so M0\mathcal{M}_0 has p+1kp + 1 - k free parameters). Let SSE1,SSE0\text{SSE}_1, \text{SSE}_0 be the residual sums of squares under each. Under H0:Rβ=rH_0: \mathbf{R} \boldsymbol\beta = \mathbf{r} and Normal errors,

F=(SSE0SSE1)/kSSE1/(np1)Fk,np1F = \frac{(\text{SSE}_0 - \text{SSE}_1) / k}{\text{SSE}_1 / (n - p - 1)} \sim F_{k, n - p - 1}

exactly (not just asymptotically).

Proof Proof 9 — F-test as Wilks specialization [show]

Setup. Under M1\mathcal{M}_1, y=Xβ+ϵ\mathbf{y} = \mathbf{X} \boldsymbol\beta + \boldsymbol\epsilon with ϵNn(0,σ2I)\boldsymbol\epsilon \sim \mathcal{N}_n(\mathbf{0}, \sigma^2 \mathbf{I}). The restriction subspace under M0\mathcal{M}_0 has dimension p+1kp + 1 - k; its orthogonal complement within col(X)\text{col}(\mathbf{X}) has dimension kk.

Independence via Fisher–Cochran. The quadratic forms SSE1=yy^12\text{SSE}_1 = \|\mathbf{y} - \hat{\mathbf{y}}_1\|^2 and SSE0SSE1=y^1y^02\text{SSE}_0 - \text{SSE}_1 = \|\hat{\mathbf{y}}_1 - \hat{\mathbf{y}}_0\|^2 project onto orthogonal subspaces of Rn\mathbb{R}^n: the residual subspace (dim np1n - p - 1) and the restriction-complement subspace (dim kk). Topic 16 Ex 21’s Fisher–Cochran theorem says quadratic forms in a Normal vector in orthogonal subspaces are independent.

Chi-squared distributions. SSE1/σ2χnp12\text{SSE}_1 / \sigma^2 \sim \chi^2_{n-p-1} (Theorem 5). Under H0H_0, (SSE0SSE1)/σ2(\text{SSE}_0 - \text{SSE}_1)/\sigma^2 is a quadratic form in Normal residuals projected onto a kk-dimensional subspace, hence χk2\sim \chi^2_k (Fisher–Cochran on the orthogonal complement).

F-statistic. By definition of the F distribution (Topic 6 Def 13), the ratio of two independent chi-squareds each divided by its degrees of freedom is distributed FF:

F=(SSE0SSE1)/kSSE1/(np1)Fk,np1.F = \frac{(\text{SSE}_0 - \text{SSE}_1)/k}{\text{SSE}_1/(n-p-1)} \sim F_{k, n-p-1}.

Wilks connection. The log-likelihood-ratio statistic for M0\mathcal{M}_0 vs M1\mathcal{M}_1 works out to

2logΛn=nlog(SSE0/SSE1)=nlog(1+kF/(np1)).-2 \log \Lambda_n = n \log(\text{SSE}_0 / \text{SSE}_1) = n \log\bigl(1 + kF/(n-p-1)\bigr).

For large nn at fixed kk, nlog(1+kF/(np1))=kF+O(n1)n \log(1 + kF/(n-p-1)) = kF + O(n^{-1}) by Taylor expansion, so 2logΛndχk2-2 \log \Lambda_n \xrightarrow{d} \chi^2_k — matching Topic 18 Thm 4 (Wilks). But we just proved that the exact finite-sample distribution of FF is Fk,np1F_{k, n-p-1} under Normal errors — not just asymptotically χk2\chi^2_k. The F-test is Wilks’ theorem sharpened to an exact distribution under the additional Normal-errors assumption.

∎ — using Topic 18 Thm 4 (Wilks, §18.6), Topic 16 Ex 21 (Fisher–Cochran), Topic 6 Def 13 (F distribution)

Example 8 Partial F-test for joint significance

Fit a four-predictor model yi=β0+β1xi1+β2xi2+β3xi3+β4xi4+ϵiy_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3} + \beta_4 x_{i4} + \epsilon_i at n=50n = 50. To test H0:β2=β3=0H_0: \beta_2 = \beta_3 = 0 (predictors 2 and 3 together add nothing), fit both the full and the reduced (two-predictor) models. With k=2k = 2, np1=45n - p - 1 = 45, suppose SSE0=180\text{SSE}_0 = 180, SSE1=144\text{SSE}_1 = 144. Then F=(36/2)/(144/45)=18/3.2=5.625F = (36/2)/(144/45) = 18/3.2 = 5.625 on F2,45F_{2, 45}. Since F2,45,0.953.20F_{2, 45, 0.95} \approx 3.20, we reject at α=0.05\alpha = 0.05; the p-value is 0.0065\approx 0.0065. Predictors 2 and 3 jointly contribute even though, as individual t-tests might show, neither alone is significant at 5%.

Example 9 One-way ANOVA as a nested F-test

Three groups with n1=n2=n3=10n_1 = n_2 = n_3 = 10 observations each (n=30n = 30). The full model is “group-specific means” — yij=μj+ϵijy_{ij} = \mu_j + \epsilon_{ij} for j{1,2,3}j \in \{1, 2, 3\} — fit by an all-dummy design matrix. The reduced model is “common mean” — yij=μ+ϵijy_{ij} = \mu + \epsilon_{ij} — fit by a column of ones. The nested F-test has k=2k = 2 (two group-mean differences), np1=27n - p - 1 = 27. Its FF statistic is exactly the classical ANOVA F, and its exact finite-sample distribution under H0:μ1=μ2=μ3H_0: \mu_1 = \mu_2 = \mu_3 is F2,27F_{2, 27}. Topic 4’s Eve’s law framing is the decomposition:

i,j(yijyˉ)2SST=jnj(yˉjyˉ)2between-group+i,j(yijyˉj)2within-group.\underbrace{\sum_{i, j} (y_{ij} - \bar y)^2}_{\text{SST}} = \underbrace{\sum_j n_j (\bar y_j - \bar y)^2}_{\text{between-group}} + \underbrace{\sum_{i, j} (y_{ij} - \bar y_j)^2}_{\text{within-group}}.

ANOVA is regression. Topic 21 treats it as a running example rather than a separate chapter.

Two-panel. Left: F_{3,96} null density (blue) with upper-α = 0.05 rejection region shaded grey. The observed F-statistic from Example 8 is drawn as a dashed vertical line, well inside the rejection region. Right: non-central F_{3,96}(λ) densities at λ = 5, 15, 30 (amber curves) showing power increasing with λ. 'F-test power grows with the effect-size non-centrality.'
0.02.04.06.08.010.0F statisticF_crit = 3.20Power vs λ (current df₁=2, df₂=46, α=0.050)05101520253000.250.50.751
Observed F5.200p-value9.21e-3Power0.860

Blue: null F2,46. Amber: non-central F(λ). Grey shading: rejection region of the α-level test.

Remark 15 Wald-t and profile-likelihood CIs coincide under Normal errors

Topic 19 Thm 1 gave two paths to a CI for any parameter: invert a Wald test (symmetric CI built on θ^±zSE^\hat\theta \pm z \cdot \widehat{\text{SE}}) or invert a LRT (possibly asymmetric CI built by profiling the likelihood). For a coefficient βj\beta_j in the Normal linear model, the profile log-likelihood is exactly quadratic — because the log-likelihood itself is quadratic in β\boldsymbol\beta — so the Wald and LRT inversions produce identical intervals. Under non-Normal errors (GLMs of Topic 22), the profile likelihood is no longer quadratic and the two CIs diverge, with the LRT/profile interval usually better-calibrated. For Topic 21’s Normal linear model: they agree.

Remark 16 Simultaneous inference — Bonferroni and Working–Hotelling

A regression report with p+1p + 1 coefficient CIs at individual level 1α1 - \alpha carries a family-wise error rate that grows with p+1p + 1. Topic 20 §20.9 Thm 8 hands us the Bonferroni-adjusted CI: widen each interval to individual level 1α/(p+1)1 - \alpha / (p+1) and the simultaneous coverage is 1α\geq 1 - \alpha. This is the default simultaneous-inference tool for regression reports.

For continuous coefficient trajectories — e.g., the fitted regression function μ(x)=xβ\mu(\mathbf{x}) = \mathbf{x}^\top \boldsymbol\beta evaluated over a grid of x\mathbf{x} — Bonferroni is too conservative (an infinite grid would require infinite widening). The Working–Hotelling (WOR1929) confidence band uses the F-distribution:

xβ^±(p+1)Fp+1,np1,1αSE^(xβ^).\mathbf{x}^\top \hat{\boldsymbol\beta} \pm \sqrt{(p+1) \cdot F_{p+1, n-p-1, 1-\alpha}} \cdot \widehat{\text{SE}}(\mathbf{x}^\top \hat{\boldsymbol\beta}).

The factor (p+1)Fp+1,np1,1α\sqrt{(p+1) F_{p+1, n-p-1, 1-\alpha}} replaces Bonferroni’s z1α/(p+1)z_{1 - \alpha/(p+1)} and delivers exact simultaneous coverage over the entire continuum of x\mathbf{x} values. This discharges the §19.10 closer and the §20.9 Rem 23 forward-pointer.

-1.0-0.50.00.51.01.5coefficient valueβ̂0β̂1β̂2β̂3β̂4
Wald-t (individual 1 − α)
LRT / profile (same under Normal errors)
Bonferroni (FWER ≤ α, per-CI at 0.010)
β̂_j
true β_j

See §21.8 Rem 15 (Wald vs LRT coincide under Normal errors) and Rem 16 (Bonferroni ⊆ §21.8 simultaneous inference; Working–Hotelling gives continuum bands).

21.9 Diagnostics and Model Validation

All of §21.5–§21.8 assumes the Normal linear model holds. What if it doesn’t? Regression diagnostics are the tools for detecting model failures — heteroscedasticity, non-linearity, outliers, influential points — without which the inferential machinery of the last four sections can produce confident nonsense. The Anscombe quartet (Example 10) is the canonical warning.

Remark 17 The residual-vs-fitted plot — the single most important diagnostic

Plot eie_i on the vertical axis against y^i\hat y_i on the horizontal. Under a correctly specified homoscedastic linear model, the scatter should be structureless — roughly symmetric around the horizontal line e=0e = 0, with constant vertical spread across the range of y^\hat y. Three common pathologies: (1) a funnel shape (variance grows or shrinks with y^\hat y) signals heteroscedasticity; (2) a curved trend signals the linear-in-parameters assumption is missing a quadratic or nonlinear term; (3) one or two extreme points signal outliers or influential observations. If the residual-vs-fitted plot looks random, that is most of the diagnostic battle.

Remark 18 Q-Q plots for Normal-errors diagnostics

A Q-Q plot of the studentized residuals against theoretical tnp1t_{n-p-1} quantiles tests the Normal-errors assumption of Definition 5. Points on the 45°45° line indicate Normality; systematic departures (S-shaped curves, heavy tails, skew) flag distributional misspecification. Under non-Normal errors the point estimate β^\hat{\boldsymbol\beta} is still unbiased and Gauss–Markov still applies, but the exact t- and F-distributions of §21.8 become approximations; for moderate nn this is usually acceptable, but near the tails (extreme p-values, wide CIs) it matters.

Remark 19 Heteroscedasticity — HC-robust SEs are in Topic 22

When Cov(ϵ)=σ2diag(w1,,wn)\text{Cov}(\boldsymbol\epsilon) = \sigma^2 \text{diag}(w_1, \ldots, w_n) with unequal wiw_i, OLS coefficient estimates remain unbiased but their standard errors from σ^2(XX)1\hat\sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1} are wrong — usually too small, producing over-confident inferences. Topic 22 handles this two ways: (1) HC-robust (sandwich) standard errors — White 1980, HC0 through HC3 — replace the homoscedastic SE formula with one that is consistent under arbitrary heteroscedasticity; (2) weighted least squares (WLS) — reweight observations by 1/wi1/w_i — delivers BLUE when the wiw_i are known. Topic 21 surfaces the pathology (Figure 9, RegressionDiagnosticsExplorer) but defers the fix.

Remark 20 Outliers, influential points, and Cook's distance

An outlier has large residual. A high-leverage point has extreme xi\mathbf{x}_i (large hiih_{ii}). An influential point has both, and Cook’s distance

Di=ri2p+1hii1hiiD_i = \frac{r_i^2}{p + 1} \cdot \frac{h_{ii}}{1 - h_{ii}}

(where rir_i is the internally studentized residual) is the standard one-number summary. Thresholds of Di>1D_i > 1 or Di>4/nD_i > 4/n flag points whose removal would substantially shift β^\hat{\boldsymbol\beta}. For robust fitting — M-estimators (Huber, Tukey) that explicitly downweight large residuals — see Topic 15 for the moment-conditions generalization and Topic 22 for robust regression practice.

Remark 21 WLS as diagonal re-weighting of OLS

WLS with known diagonal weights wiw_i solves the transformed OLS problem y~=X~β+ϵ~\tilde{\mathbf{y}} = \tilde{\mathbf{X}} \boldsymbol\beta + \tilde{\boldsymbol\epsilon} where y~i=yi/wi\tilde y_i = y_i / \sqrt{w_i} and X~i=Xi/wi\tilde{\mathbf{X}}_i = \mathbf{X}_i / \sqrt{w_i} — classical OLS applied to scaled data. The general case with non-diagonal Cov(ϵ)=σ2Ω\text{Cov}(\boldsymbol\epsilon) = \sigma^2 \boldsymbol\Omega is generalized least squares (GLS), solving β^GLS=(XΩ1X)1XΩ1y\hat{\boldsymbol\beta}_{\text{GLS}} = (\mathbf{X}^\top \boldsymbol\Omega^{-1} \mathbf{X})^{-1} \mathbf{X}^\top \boldsymbol\Omega^{-1} \mathbf{y}. Both are BLUE in their respective settings — the Gauss–Markov theorem generalizes. Topic 22’s GLM section treats GLS and WLS as special cases of iteratively reweighted least squares (IRLS).

4-panel grid of residual-vs-fitted plots. Top-left: 'ideal' — structureless cloud centered on zero. Top-right: 'heteroscedastic' — fan shape, spread grows with fitted value. Bottom-left: 'one high-leverage outlier' — a single point pulls the fit. Bottom-right: 'non-linearity' — residuals curve, showing a missing quadratic term. Each panel captions its failure mode.
Example 10 The Anscombe quartet — why residual plots are irreducible

Anscombe (1973) constructed four datasets with identical (n,xˉ,yˉ,β^0,β^1,SE(β^1),R2,ANOVA F)(n, \bar x, \bar y, \hat\beta_0, \hat\beta_1, \text{SE}(\hat\beta_1), R^2, \text{ANOVA F}) — every summary statistic rounds to the same value. Yet their residual structures are wildly different: dataset I is a correctly-modeled linear scatter; II is a perfect quadratic with misfit linear model; III has one high-leverage outlier driving a high-leverage fit; IV is a single-x-value cluster plus one vertical outlier. Only dataset I warrants the regression report; the other three require different models or diagnostic intervention. The punchline: no suite of summary statistics is a substitute for looking at the residual plot. The RegressionDiagnosticsExplorer below cycles through five preset scenarios — ideal, heteroscedastic, outlier, non-linear, high-leverage — each producing a distinct residual-vs-fitted, Q-Q, scale-location, and leverage-Cook’s-distance signature. Each failure mode is named, diagnosed, and linked to the appropriate Topic-22/23 remedy.

Residuals vs Fitted
-6.0-4.0-2.00.02.04.06.0fitted-2.0-1.00.01.02.0residual
Normal Q-Q
-3.0-2.0-1.00.01.02.03.0theoretical quantile-3.0-2.0-1.00.01.02.03.0sample quantile
Scale-Location (√|r| vs Fitted)
-6.0-4.0-2.00.02.04.06.0fitted0.00.51.01.5√|studentized r|
Residuals vs Leverage (colored by Cook's D)
0.00.10.10.1leverage h_ii-3.0-2.0-1.00.01.02.03.0studentized r2(p+1)/n = 0.050
Diagnosis
All assumptions met. No pattern in residuals. Q-Q plot is near-linear. Leverage values cluster below 2(p+1)/n.

21.10 Forward Map

Topic 21 built the machinery of the fixed-design homoscedastic Normal linear model. Five directions lead out, and the rest of Track 6 plus Tracks 7–8 plus formalml.com cover them.

Remark 22 Topic 22 — Generalized Linear Models

The Normal-errors assumption is the restrictive one. Binary outcomes (logistic regression), count outcomes (Poisson regression), positive-skewed outcomes (gamma regression) all fit into the exponential-family GLM framework: a linear predictor ηi=xiβ\eta_i = \mathbf{x}_i^\top \boldsymbol\beta composed with a link function gg such that g(E[yi])=ηig(E[y_i]) = \eta_i. The normal equations generalize to iteratively reweighted least squares (IRLS); the F-test generalizes to the deviance test (an exact LRT); HC-robust SEs and GLS both live here.

Remark 23 Topic 23 — Ridge, Lasso, and Elastic Net

When pp is large or X\mathbf{X} is ill-conditioned, OLS variance explodes. Ridge regression adds λβ22\lambda \|\boldsymbol\beta\|_2^2 to the OLS objective, shrinking coefficients toward zero in exchange for bias — a direct bias-variance trade-off controlled by λ\lambda. Lasso uses λβ1\lambda \|\boldsymbol\beta\|_1 and produces exactly-zero coefficients (variable selection). Elastic net combines both. Topic 23 covers the penalized objective, coordinate descent solvers, the LARS path, and the cross-validated λ\lambda selection.

Remark 24 Topic 24 — Model Selection (AIC, BIC, CV)

Adjusted R2R^2 is a limited criterion. AIC (Akaike) penalizes by 2k2k; BIC (Schwarz) by klognk\log n; cross-validation estimates out-of-sample predictive performance directly, and Stone’s 1977 theorem identifies LOO-CV with AIC asymptotically. Mallows’ CpC_p is the Gaussian-linear special case. The three criteria diverge in the large-pp regime: BIC is selection-consistent when the true model is in the candidate set, AIC and CV are minimax-rate optimal for prediction, and Yang 2005 proves the two properties are formally incompatible. Topic 24 develops the full theory.

Remark 25 Track 7 — Bayesian linear regression

A conjugate Normal-inverse-gamma prior on (β,σ2)(\boldsymbol\beta, \sigma^2) yields a closed-form Normal-inverse-gamma posterior, with the posterior mean an explicit ridge-regression estimate (the prior precision plays the role of λ\lambda). The Bayesian CI is the credible interval — a subset of parameter space with posterior mass 1α1 - \alpha — which agrees with the Wald-t CI only under flat priors and as nn \to \infty. The conditional-MVN calculation of Topic 8 §8.4 is the mechanical engine. Topic 25 §25.5 covers the scalar Normal–Normal-Inverse-Gamma case; non-conjugate priors are handled via MCMC (Topic 26), with the regression extension developed in Topics 27–28.

Remark 26 Track 8 — Nonparametric regression

When the linearity assumption itself is too restrictive, nonparametric regression estimates E[Yx]E[Y \mid \mathbf{x}] without assuming a parametric form. Kernel regression (Nadaraya–Watson), local polynomial regression, and smoothing splines all replace the parametric xβ\mathbf{x}^\top \boldsymbol\beta with a data-driven smooth function. The bias-variance trade-off is controlled by a bandwidth or smoothing parameter; Track 8’s Topic 30 (Kernel Density Estimation) treats the density-estimation theory, and the covariate-conditional (regression) extension is developed at formalml.

Remark 27 formalml.com — Mixed effects and causal inference

Mixed-effects (hierarchical) models add random coefficients for grouped data, handling repeated measurements or nested designs that violate the independent-errors assumption of §21.5. The formalML: Mixed effects topic covers the REML estimator and shrinkage. Causal inference via IV, DiD, RDD, and synthetic control all live in the linear-model framework with identifying restrictions; see formalML: Causal inference methods .

Remark 28 formalml.com — High-dimensional regression

When p>np > n, OLS fails and penalized methods (Topic 23) become essential. The theory of high-dimensional regression — sparsity rates, minimax bounds, double descent — lives at formalML: High-dimensional regression . The modern regime where pnp \gg n with kernel-like implicit regularization (neural networks, random features) is an active research area that Track 8 and formalml.com’s theory topics touch.

Track-6 spine of Topics 21–24 with forward arrows to Track 7 (Bayesian linear regression), Track 8 (nonparametric regression), formalml.com (mixed effects, causal inference, high-dimensional regression). Back-arrows to Topic 18 (Wilks) and Topic 19 (CI duality). Color-coded: Track 6 in blue, Tracks 7–8 in grey, formalml.com in amber.

Track 5 closed with simultaneous inference; Track 6 opens with the first conditional model. Every §21.8 result — the F-test, the coefficient CIs, the Bonferroni adjustment, the Working–Hotelling band — is a direct application of Track 5’s machinery to the geometry of §21.2 and §21.7. The next three Track 6 topics widen the scope: GLMs drop the Normal-errors assumption, regularization handles pnp \gg n, model selection chooses among nested models. But the orthogonal-projection picture of Proof 1 — residual \perp column space, fitted value as the closest point in the column space — is the geometric anchor that every extension rests on. When in doubt, draw Figure 2 again.


References

  1. Erich L. Lehmann and Joseph P. Romano. (2005). Testing Statistical Hypotheses (3rd ed.). Springer.
  2. George Casella and Roger L. Berger. (2002). Statistical Inference (2nd ed.). Duxbury.
  3. George A. F. Seber and Alan J. Lee. (2003). Linear Regression Analysis (2nd ed.). Wiley.
  4. Sanford Weisberg. (2005). Applied Linear Regression (3rd ed.). Wiley.
  5. William H. Greene. (2012). Econometric Analysis (7th ed.). Pearson.
  6. Holbrook Working and Harold Hotelling. (1929). Applications of the Theory of Error to the Interpretation of Trends. Journal of the American Statistical Association, 24(165A), 73–85.
  7. Francis Galton. (1886). Regression Towards Mediocrity in Hereditary Stature. Journal of the Anthropological Institute of Great Britain and Ireland, 15, 246–263.
  8. Carl Friedrich Gauss. (1823). Theoria Combinationis Observationum Erroribus Minimis Obnoxiae. Werke, IV, 1–108.
  9. Andrey A. Markov. (1900). Wahrscheinlichkeitsrechnung. Teubner.