intermediate 60 min read · April 19, 2026

Regularization & Penalized Estimation

When OLS and IRLS break under collinearity or separation, a penalty term on the loss restores well-posedness. Ridge gives MSE-improvement (Hoerl–Kennard 1970); lasso gives variable selection through soft-thresholding (Tibshirani 1996); elastic net combines them; coordinate descent (Tseng 2001) is the algorithmic engine, and cross-validation picks the penalty parameter.

23.1 When OLS and IRLS fail: collinearity and separation

Topic 21 built the OLS framework on a single working assumption: XX\mathbf{X}^\top\mathbf{X} is invertible. Topic 22 extended this to XWX\mathbf{X}^\top\mathbf{W}\mathbf{X} for IRLS. Both assumptions fail in production. The canonical failure modes are multicollinearity (predictors so correlated that det(XX)0\det(\mathbf{X}^\top\mathbf{X}) \to 0) and separation (a logistic-regression hyperplane that perfectly classifies the training data, sending β^\|\hat{\boldsymbol\beta}\| \to \infty). Both pathologies have one fix in common: add a penalty term λP(β)\lambda P(\boldsymbol\beta) that grows with β\|\boldsymbol\beta\|, making the penalized objective coercive even when the unpenalized one is not.

Example 1 OLS blows up under collinearity (prostate-cancer design)

The prostate-cancer dataset (HTF Table 3.3, used as the running benchmark throughout this topic) has 97 patients and 8 predictors. Two of those predictors — lcavol (log cancer volume) and lcp (log capsular penetration) — are tightly correlated (ρ0.68\rho \approx 0.68). If we artificially amplify that correlation by replacing lcp with ρlcavol+1ρ2lcp\rho \cdot \text{lcavol} + \sqrt{1 - \rho^2} \cdot \text{lcp} for ρ1\rho \to 1, the OLS coefficient vector β^\hat{\boldsymbol\beta} does not just become unstable — it diverges. The figure below tracks β^2\|\hat{\boldsymbol\beta}\|_2 as a function of the artificial correlation strength: by ρ=0.999\rho = 0.999 the L²-norm has grown by three orders of magnitude. Individual coefficients on lcavol and lcp swing in opposite directions, fitting noise.

This is the §21.5 multicollinearity warning made concrete. The variance inflation factor (VIF) flags it; ridge regression fixes it. The cure works by adding λβ22\lambda \|\boldsymbol\beta\|_2^2 to the objective: the closed form (XX+λI)1Xy(\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^\top\mathbf{y} is well-defined as long as λ>0\lambda > 0, no matter how singular XX\mathbf{X}^\top\mathbf{X} becomes.

Example 2 Logistic IRLS diverges under near-separation

Topic 22 §22.4 Rem 8 flagged separation as the second canonical pathology: when a hyperplane in predictor space perfectly classifies the binary response, the logistic log-likelihood is monotone-increasing along the direction β\boldsymbol\beta, and IRLS iterates run away to ±\pm\infty. The brief promised the fix would arrive in Topic 23. Here it is.

The EXAMPLE_10_GLM_DATA preset (n = 120, p = 2) is a near-separation problem: the two classes are almost linearly separable, with only a handful of overlap points. Bare glmFit from Topic 22 reports converged = false after the maximum iteration count — by iteration 50, β(t)\|\boldsymbol\beta^{(t)}\| has crossed 100 and is still growing. Adding a ridge penalty with λ=0.1\lambda = 0.1 changes the picture immediately: convergence in around 10 outer iterations to a finite, well-determined estimate. The §23.6 Ex 8 worked example develops this in detail, but the punchline lands here in §23.1: regularization is not just a remedy for the squared-error case — it restores well-posedness for IRLS too.

Two-panel motivation. Left: OLS coefficient L² norm vs artificial correlation strength on the prostate-cancer design — diverges sharply as correlation approaches 1. Right: logistic-IRLS coefficient L∞ norm vs iteration on the near-separation preset — runaway divergence past iteration 30. Both pathologies are repaired by adding a penalty term.
Remark 1 Historical note

Penalized estimation is older than computer-era statistics. Tikhonov 1943 introduced (XX+λI)1(\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I})^{-1} for integral-equation regularization in physics — 27 years before its statistical rediscovery. Hoerl & Kennard 1970 put ridge on the statistical map with the MSE-improvement theorem (§23.3 Thm 1) and the ridge-trace diagnostic. Tibshirani 1996 introduced the lasso, replacing 22\|\cdot\|_2^2 with 1\|\cdot\|_1 to obtain variable selection — a paper that has accumulated more than 40,000 citations. Firth 1993 had earlier introduced a Jeffreys-prior penalty as a separation cure for logistic regression; it is one of two practical answers to the §23.1 Ex 2 pathology, with ridge being the other (§23.6 Rem 11). The unifying framework — penalty as additive term on the negative log-likelihood — is what we develop next.

23.2 The penalized-objective framework

Notation (extending §21.4 and §22.2). Throughout Topic 23, λ0\lambda \ge 0 is the penalty parameter and P:Rp+1R0P : \mathbb{R}^{p+1} \to \mathbb{R}_{\ge 0} is a convex penalty function (P(β)=β22P(\boldsymbol\beta) = \|\boldsymbol\beta\|_2^2 for ridge, β1\|\boldsymbol\beta\|_1 for lasso, αβ1+(1α)β22/2\alpha\|\boldsymbol\beta\|_1 + (1-\alpha)\|\boldsymbol\beta\|_2^2/2 for elastic net). The soft-thresholding operator is Sλ(x)=sign(x)max(xλ,0)S_\lambda(x) = \text{sign}(x)\max(|x| - \lambda, 0), applied componentwise to vectors. The active set is A(β)={j:βj0}\mathcal{A}(\boldsymbol\beta) = \{j : \beta_j \neq 0\}. The partial score at coordinate jj is sj(β)=1nxj(yXβ)s_j(\boldsymbol\beta) = \frac{1}{n}\mathbf{x}_j^\top(\mathbf{y} - \mathbf{X}\boldsymbol\beta). We write β\boldsymbol\beta^\star for the data-generating “true” coefficient and β^(λ)\hat{\boldsymbol\beta}(\lambda) for the regularization path.

Definition 1 Ridge objective

The ridge regression estimator at penalty parameter λ0\lambda \ge 0 is

β^ridge(λ)  =  argminβRp+1  12yXβ22+λ2β22.\hat{\boldsymbol\beta}_{\text{ridge}}(\lambda) \;=\; \arg\min_{\boldsymbol\beta \in \mathbb{R}^{p+1}}\; \tfrac12 \|\mathbf{y} - \mathbf{X}\boldsymbol\beta\|_2^2 \,+\, \tfrac{\lambda}{2} \|\boldsymbol\beta\|_2^2.

The closed form, derivable by setting the gradient X(yXβ)+λβ-\mathbf{X}^\top(\mathbf{y} - \mathbf{X}\boldsymbol\beta) + \lambda\boldsymbol\beta to zero, is

β^ridge(λ)  =  (XX+λI)1Xy.\hat{\boldsymbol\beta}_{\text{ridge}}(\lambda) \;=\; (\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^\top\mathbf{y}.

The matrix XX+λI\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I} is positive-definite for any λ>0\lambda > 0, so the estimator exists and is unique even when X\mathbf{X} is rank-deficient — the regression.ts ridgeFit implements exactly this closed form via Cholesky on the augmented Gram matrix. The λ2\tfrac{\lambda}{2} scaling on the penalty matches the lasso convention below (12\tfrac12 on residuals throughout) and lines up with sklearn.linear_model.Ridge(alpha=λ) directly: alpha = λ gives identical fits.

Definition 2 Lasso objective

The lasso estimator at penalty parameter λ0\lambda \ge 0 is

β^lasso(λ)  =  argminβRp+1  12yXβ22+λβ1.\hat{\boldsymbol\beta}_{\text{lasso}}(\lambda) \;=\; \arg\min_{\boldsymbol\beta \in \mathbb{R}^{p+1}}\; \tfrac12 \|\mathbf{y} - \mathbf{X}\boldsymbol\beta\|_2^2 \,+\, \lambda \|\boldsymbol\beta\|_1.

The regression.ts lassoFit adopts this convention (no 1/n1/n on residuals); to get the same fit out of sklearn.linear_model.Lasso(alpha) — whose objective is (2n)1yXw22+αw1(2n)^{-1}\|\mathbf{y} - \mathbf{X}\mathbf{w}\|_2^2 + \alpha\|\mathbf{w}\|_1 — call it with alpha = λ/n. There is no closed form in general; coordinate descent (§23.4) is the standard solver. Unlike ridge, the lasso produces exact zeros in β^\hat{\boldsymbol\beta} — the variable-selection property, established formally in Thm 2 below.

Definition 3 Penalized MLE

For a general log-likelihood (β)\ell(\boldsymbol\beta) and a convex penalty P:Rp+1R0P : \mathbb{R}^{p+1} \to \mathbb{R}_{\ge 0}, the penalized maximum-likelihood estimator is

β^(λ)  =  argminβ  (β)+λP(β).\hat{\boldsymbol\beta}(\lambda) \;=\; \arg\min_{\boldsymbol\beta}\; -\ell(\boldsymbol\beta) \,+\, \lambda P(\boldsymbol\beta).

This is the unifying framework: ridge is the squared-error special case with P=22P = \|\cdot\|_2^2; lasso is P=1P = \|\cdot\|_1; penalized GLMs (§23.6) replace -\ell with the GLM negative log-likelihood. The Bayesian reading via §23.7 Thm 5 is that λP(β)\lambda P(\boldsymbol\beta) is the negative log-prior, making penalized MLE identical to MAP estimation.

Example 3 Ridge closed-form derivation

Setting β[yXβ22+λβ22]=0\nabla_{\boldsymbol\beta}[\|\mathbf{y} - \mathbf{X}\boldsymbol\beta\|_2^2 + \lambda\|\boldsymbol\beta\|_2^2] = \mathbf{0}:

2X(yXβ)+2λβ=0.-2\mathbf{X}^\top(\mathbf{y} - \mathbf{X}\boldsymbol\beta) + 2\lambda\boldsymbol\beta = \mathbf{0}.

Rearranging,

(XX+λI)β=Xy.(\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I})\boldsymbol\beta = \mathbf{X}^\top\mathbf{y}.

The Hessian XX+λI\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I} is positive-definite for λ>0\lambda > 0, so the stationary point is the unique global minimum. Compare to OLS (Topic 21 §21.4 Thm 3): the only difference is the λI\lambda\mathbf{I} term added to the diagonal. Geometrically, ridge replaces an ill-conditioned matrix with a well-conditioned one by inflating its smallest eigenvalues from near-zero up to at least λ\lambda.

Three penalty objective surfaces side-by-side on a 2D coefficient space (β₁, β₂). Left: ridge — circular contours center at OLS solution; the constrained solution (intersection with a circle of radius t) lies on a smooth curve. Middle: lasso — diamond contours, constrained solution lands on a corner of the L¹ ball, producing exact zeros. Right: elastic-net — egg-shaped contours combining both, smoother than lasso but still capable of corner selection.
Remark 2 Why standardization matters

Both ridge and lasso are not scale-equivariant: rescaling a predictor changes its penalty contribution. If XjX_j is measured in centimeters, its coefficient is 100× the value it would be in meters; the lasso would shrink the centimeter-scale coefficient harder. Standard practice is to standardize predictors before fitting — center each XjX_j at zero and scale to unit variance. The intercept is then absorbed by centering yy, and the recovered coefficients are converted back to original units after the fit. The regression.ts implementation does this by default (standardize: true).

This standardization preserves the geometric clarity of the penalty (every coefficient sits on the same scale) at the cost of breaking equivariance under reparameterization — the price of a penalty that ignores the model’s sufficient statistic structure (Topic 16 §16.4). For genuine equivariance, one would need a scale-invariant penalty like jβjsd(Xj)\sum_j |\beta_j| \cdot \text{sd}(X_j), which is rarely used in practice.

Remark 3 The intercept is never penalized

Standard convention — adopted by glmnet, sklearn, and regression.ts — is to leave the intercept unpenalized. The intuition: shifting all predictors by a constant should not change the model, but it would change the intercept; penalizing the intercept would couple the model to the location of the predictors. Mechanically, after centering yy and standardizing X\mathbf{X}, the intercept becomes yˉ\bar{y} and is recovered post-fit as β^0=yˉjβ^jmean(Xj)\hat\beta_0 = \bar{y} - \sum_j \hat\beta_j \cdot \text{mean}(X_j). The penalty applies only to the slope coefficients j=1,,pj = 1, \dots, p.

Remark 4 The SVD perspective on ridge

Write X=UDV\mathbf{X} = \mathbf{U}\mathbf{D}\mathbf{V}^\top with D=diag(d1,,dp+1)\mathbf{D} = \text{diag}(d_1, \dots, d_{p+1}) the singular values. The ridge fit can be written

β^ridge(λ)  =  Vdiag ⁣(djdj2+λ)Uy.\hat{\boldsymbol\beta}_{\text{ridge}}(\lambda) \;=\; \mathbf{V} \cdot \text{diag}\!\left(\frac{d_j}{d_j^2 + \lambda}\right) \cdot \mathbf{U}^\top\mathbf{y}.

OLS is the λ=0\lambda = 0 case: β^OLS=Vdiag(1/dj)Uy\hat{\boldsymbol\beta}_{\text{OLS}} = \mathbf{V} \cdot \text{diag}(1/d_j) \cdot \mathbf{U}^\top\mathbf{y}. Ridge replaces 1/dj1/d_j with dj/(dj2+λ)d_j/(d_j^2 + \lambda) — a smooth shrinkage that hits hardest where djd_j is small (the directions of low data variance). Lasso is non-uniform: its shrinkage acts in the original predictor coordinates, not the SVD coordinates, and produces exact zeros rather than smooth shrinkage. The two penalties differ both in what they shrink and how.

Topic 22’s §22.3 paired two featured theorems — the canonical-link score identity and IRLS = Fisher scoring = Newton — and earned its place in the topic’s structure as the load-bearing block. §23.3 mirrors that structure with two of its own.

Theorem 1 Ridge MSE-improvement (Hoerl–Kennard 1970)

Let β0\boldsymbol\beta \neq \mathbf{0} be the true coefficient vector under the fixed-design Normal linear model y=Xβ+ε\mathbf{y} = \mathbf{X}\boldsymbol\beta + \boldsymbol\varepsilon with εNn(0,σ2I)\boldsymbol\varepsilon \sim \mathcal{N}_n(\mathbf{0}, \sigma^2\mathbf{I}) and X\mathbf{X} of full column rank. Then there exists λ>0\lambda^\star > 0 such that for every coordinate jj,

MSE ⁣(β^jridge(λ))  <  MSE ⁣(β^jOLS).\text{MSE}\!\left(\hat\beta_j^{\text{ridge}}(\lambda^\star)\right) \;<\; \text{MSE}\!\left(\hat\beta_j^{\text{OLS}}\right).

The total MSE jMSE(β^j)\sum_j \text{MSE}(\hat\beta_j) is strictly smaller at λ\lambda^\star than at λ=0\lambda = 0.

Proof 1 Ridge MSE-improvement (Hoerl–Kennard 1970) [show]

Setup. Fix β0\boldsymbol\beta \neq \mathbf{0}, σ2>0\sigma^2 > 0, and a full-column-rank design X\mathbf{X}. Let A(λ)=(XX+λI)1XX\mathbf{A}(\lambda) = (\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^\top\mathbf{X}, so that E[β^ridge(λ)]=A(λ)β\mathbb{E}[\hat{\boldsymbol\beta}_{\text{ridge}}(\lambda)] = \mathbf{A}(\lambda)\boldsymbol\beta and Var(β^ridge(λ))=σ2A(λ)(XX)1A(λ)\text{Var}(\hat{\boldsymbol\beta}_{\text{ridge}}(\lambda)) = \sigma^2\mathbf{A}(\lambda)(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{A}(\lambda)^\top. At λ=0\lambda = 0, A(0)=I\mathbf{A}(0) = \mathbf{I} and we recover OLS: unbiased, with variance σ2(XX)1\sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}.

Total MSE decomposition. Writing MSEtot(λ)=jE[(β^jridgeβj)2]=Bias22+tr(Var)\text{MSE}_{\text{tot}}(\lambda) = \sum_j \mathbb{E}[(\hat\beta_j^{\text{ridge}} - \beta_j)^2] = \|\text{Bias}\|_2^2 + \text{tr}(\text{Var}),

MSEtot(λ)  =  (A(λ)I)β22squared bias  +  σ2tr ⁣[A(λ)(XX)1A(λ)]total variance.\text{MSE}_{\text{tot}}(\lambda) \;=\; \underbrace{\|(\mathbf{A}(\lambda) - \mathbf{I})\boldsymbol\beta\|_2^2}_{\text{squared bias}} \;+\; \underbrace{\sigma^2\,\text{tr}\!\left[\mathbf{A}(\lambda)(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{A}(\lambda)^\top\right]}_{\text{total variance}}.

SVD coordinates. Write X=UDV\mathbf{X} = \mathbf{U}\mathbf{D}\mathbf{V}^\top with D=diag(d1,,dp+1)\mathbf{D} = \text{diag}(d_1, \dots, d_{p+1}), dj>0d_j > 0. In the rotated coordinates γ=Vβ\boldsymbol\gamma = \mathbf{V}^\top\boldsymbol\beta (so γ2=β2\|\boldsymbol\gamma\|_2 = \|\boldsymbol\beta\|_2), the MSE decomposes coordinatewise:

MSEtot(λ)  =  j=1p+1 ⁣[λ2γj2(dj2+λ)2  +  σ2dj2(dj2+λ)2].\text{MSE}_{\text{tot}}(\lambda) \;=\; \sum_{j=1}^{p+1}\!\left[\frac{\lambda^2\gamma_j^2}{(d_j^2 + \lambda)^2} \;+\; \frac{\sigma^2 d_j^2}{(d_j^2 + \lambda)^2}\right].

First-order Taylor at λ=0\lambda = 0. Expanding each summand (holding jj fixed):

  • Squared-bias contribution: λ2γj2/(dj2+λ)2=λ2γj2/dj4+O(λ3)\lambda^2\gamma_j^2 / (d_j^2 + \lambda)^2 = \lambda^2\gamma_j^2/d_j^4 + O(\lambda^3)O(λ2)O(\lambda^2) at small λ\lambda.
  • Variance contribution: σ2dj2/(dj2+λ)2=σ2/dj22σ2λ/dj4+O(λ2)\sigma^2 d_j^2 / (d_j^2 + \lambda)^2 = \sigma^2/d_j^2 - 2\sigma^2\lambda/d_j^4 + O(\lambda^2)O(λ)O(\lambda) at small λ\lambda with negative coefficient.

Derivative at λ=0\lambda = 0. Summing over jj:

dMSEtotdλλ=0  =  2σ2j=1p+11dj4  <  0.\left.\frac{d\,\text{MSE}_{\text{tot}}}{d\lambda}\right|_{\lambda = 0} \;=\; -2\sigma^2 \sum_{j=1}^{p+1} \frac{1}{d_j^4} \;<\; 0.

Because the derivative is strictly negative at λ=0\lambda = 0 and MSEtot\text{MSE}_{\text{tot}} is continuous (indeed smooth) in λ\lambda, there exists λ>0\lambda^\star > 0 with MSEtot(λ)<MSEtot(0)\text{MSE}_{\text{tot}}(\lambda^\star) < \text{MSE}_{\text{tot}}(0).

Component-wise strengthening. The same argument applied to each diagonal entry of the MSE matrix shows that every component MSE(β^jridge)\text{MSE}(\hat\beta_j^{\text{ridge}}) is strictly smaller than MSE(β^jOLS)\text{MSE}(\hat\beta_j^{\text{OLS}}) at some λj\lambda_j^\star. Taking λ=minjλj\lambda^\star = \min_j \lambda_j^\star gives uniform strict improvement. ∎ — using the SVD decomposition of X\mathbf{X} (Topic 8 §8.5) and the MSE = bias² + variance decomposition (Topic 13 §13.2).

The sign of the first-derivative term is the entire argument: variance shrinks linearly in λ\lambda while bias grows quadratically, so for small λ\lambda variance wins. The bias-variance tradeoff is not a vague principle but a precise inequality. The catch: λ\lambda^\star depends on the unknown β\boldsymbol\beta and σ2\sigma^2 — in practice, cross-validation (§23.8) picks it.

Level-set geometry — why ridge shrinks smoothly and lasso selects
Loss contours of an OLS quadratic (rotated ellipse) intersect the penalty's constraint set. Ridge: tangent lands smoothly. Lasso: tangent lands at a vertex of the diamond — a coordinate is exactly zero.
β̂_OLSβ̂(λ)β₁β₂
Ridge (L²): constraint region ||β||₂² ≤ t. Ridge produces a smooth shrinkage; both coordinates are nonzero unless OLS is inside the ball.
Theorem 2 Lasso KKT characterization + soft-thresholding (Tibshirani 1996)

Let β^=argminβ12yXβ22+λβ1\hat{\boldsymbol\beta} = \arg\min_{\boldsymbol\beta} \tfrac12\|\mathbf{y} - \mathbf{X}\boldsymbol\beta\|_2^2 + \lambda\|\boldsymbol\beta\|_1. Define the partial score sj(β)=xj(yXβ)s_j(\boldsymbol\beta) = \mathbf{x}_j^\top(\mathbf{y} - \mathbf{X}\boldsymbol\beta). The KKT optimality conditions are

sj(β^)=λsign(β^j)for jA(β^),s_j(\hat{\boldsymbol\beta}) = \lambda \cdot \text{sign}(\hat\beta_j) \quad \text{for } j \in \mathcal{A}(\hat{\boldsymbol\beta}),sj(β^)λfor jA(β^).|s_j(\hat{\boldsymbol\beta})| \le \lambda \quad \text{for } j \notin \mathcal{A}(\hat{\boldsymbol\beta}).

For orthonormal designs XX=I\mathbf{X}^\top\mathbf{X} = \mathbf{I}, the solution has the closed form

β^j  =  Sλ ⁣(xjy),\hat\beta_j \;=\; S_\lambda\!\left(\mathbf{x}_j^\top\mathbf{y}\right),

where Sλ(x)=sign(x)max(xλ,0)S_\lambda(x) = \text{sign}(x)\,\max(|x| - \lambda, 0) is the soft-thresholding operator.

Proof 2 Lasso KKT characterization + soft-thresholding (Tibshirani 1996) [show]

Setup. Let f(β)=12yXβ22+λβ1f(\boldsymbol\beta) = \tfrac12\|\mathbf{y} - \mathbf{X}\boldsymbol\beta\|_2^2 + \lambda\|\boldsymbol\beta\|_1. The first term is convex and differentiable; the second is convex but non-differentiable at any βj=0\beta_j = 0. The global minimum exists (the objective is coercive and continuous) and is characterized by the subgradient optimality condition 0f(β^)\mathbf{0} \in \partial f(\hat{\boldsymbol\beta}).

Subgradient of the L¹-norm. The subgradient of βj|\beta_j| at βj\beta_j is

βj  =  {{sign(βj)}βj0,[1,1]βj=0.\partial|\beta_j| \;=\; \begin{cases}\{\text{sign}(\beta_j)\} & \beta_j \neq 0, \\ [-1, 1] & \beta_j = 0.\end{cases}

Subgradient of ff. The gradient of the smooth part is X(yXβ)-\mathbf{X}^\top(\mathbf{y} - \mathbf{X}\boldsymbol\beta). Hence

f(β)j  =  xj(yXβ)+λβj  =  sj(β)+λβj.\partial f(\boldsymbol\beta)_j \;=\; -\mathbf{x}_j^\top(\mathbf{y} - \mathbf{X}\boldsymbol\beta) + \lambda \cdot \partial|\beta_j| \;=\; -s_j(\boldsymbol\beta) + \lambda \, \partial|\beta_j|.

KKT conditions. Setting 0f(β^)\mathbf{0} \in \partial f(\hat{\boldsymbol\beta}) coordinatewise:

  • If β^j>0\hat\beta_j > 0:   sj(β^)=λ\;s_j(\hat{\boldsymbol\beta}) = \lambda.
  • If β^j<0\hat\beta_j < 0:   sj(β^)=λ\;s_j(\hat{\boldsymbol\beta}) = -\lambda.
  • If β^j=0\hat\beta_j = 0:   sj(β^)[λ,λ]\;s_j(\hat{\boldsymbol\beta}) \in [-\lambda, \lambda], i.e., sj(β^)λ|s_j(\hat{\boldsymbol\beta})| \le \lambda.

Active coordinates carry sign-matched score equal to ±λ\pm\lambda; inactive coordinates have partial score bounded by λ\lambda.

Orthonormal-design corollary. When XX=Ip+1\mathbf{X}^\top\mathbf{X} = \mathbf{I}_{p+1}, the partial score at β\boldsymbol\beta simplifies to sj(β)=xjyβjs_j(\boldsymbol\beta) = \mathbf{x}_j^\top\mathbf{y} - \beta_j. Substituting into the KKT cases:

  • β^j>0\hat\beta_j > 0:   β^j=xjyλ\;\hat\beta_j = \mathbf{x}_j^\top\mathbf{y} - \lambda, valid when xjy>λ\mathbf{x}_j^\top\mathbf{y} > \lambda.
  • β^j<0\hat\beta_j < 0:   β^j=xjy+λ\;\hat\beta_j = \mathbf{x}_j^\top\mathbf{y} + \lambda, valid when xjy<λ\mathbf{x}_j^\top\mathbf{y} < -\lambda.
  • β^j=0\hat\beta_j = 0:   xjyλ\;|\mathbf{x}_j^\top\mathbf{y}| \le \lambda.

Soft-thresholding formula. The three cases collapse into

β^j  =  Sλ ⁣(xjy)  =  sign ⁣(xjy)max ⁣(xjyλ,0).\hat\beta_j \;=\; S_\lambda\!\left(\mathbf{x}_j^\top\mathbf{y}\right) \;=\; \text{sign}\!\left(\mathbf{x}_j^\top\mathbf{y}\right) \cdot \max\!\left(\left|\mathbf{x}_j^\top\mathbf{y}\right| - \lambda, \,0\right).

(Under the standardization convention used by lassoFit — column-centered with xj2=n\|\mathbf{x}_j\|^2 = n — this generalizes to β^j=Sλ(xjrj)/n\hat\beta_j = S_\lambda(\mathbf{x}_j^\top\mathbf{r}_{-j})/n on the partial residual; the algorithmic version is §23.4.) Every univariate coordinate-descent update for the lasso is one application of SλS_\lambda. ∎ — using subgradient calculus (formalCalculus: convex optimization).

Step 4 is the headline result: the diamond-vs-circle picture in LevelSetExplorer literally shows active coordinates landing at ±λ\pm\lambda on the diamond’s edge, and inactive coordinates landing at a vertex (where the L¹-ball is non-differentiable). The §23.4 coordinate-descent algorithm is one application of the orthogonal-design formula per coordinate per sweep, using the partial residual instead of y\mathbf{y} directly.

Soft-thresholding operator S_λ(x) plotted on the real line with the dead-zone |x| ≤ λ highlighted in light gray, and the two linear branches y = x − λ for x > λ and y = x + λ for x < −λ shown in color. The function is continuous but non-differentiable at ±λ.
Example 4 Bias-variance decomposition on prostate-cancer

Plugging the prostate-cancer design into the SVD-coordinate MSE formula from Proof 1, we can compute the bias² and variance contributions to total MSE as a function of λ\lambda. At λ=0\lambda = 0 (OLS), bias² is zero and variance carries all the weight. As λ\lambda grows, bias² rises quadratically while variance falls roughly proportionally to λj1/dj4\lambda \sum_j 1/d_j^4. The total MSE traces a U-shape with minimum somewhere in the middle. For the prostate dataset, that minimum lies near λ1\lambda \approx 1 — large enough to substantially reduce variance on the small singular values, small enough to keep bias modest.

Example 5 Soft-thresholding on an orthogonal design

Take X=I4\mathbf{X} = \mathbf{I}_4 (so XX/n=I/4\mathbf{X}^\top\mathbf{X}/n = \mathbf{I}/4, scaling λ\lambda by 1/41/4 for cleanness — for the canonical orthogonal-design corollary). Suppose y=(3,2,0.5,0.1)\mathbf{y} = (3, -2, 0.5, 0.1)^\top. Then 1nXy=(0.75,0.5,0.125,0.025)\frac{1}{n}\mathbf{X}^\top\mathbf{y} = (0.75, -0.5, 0.125, 0.025)^\top. Apply soft-thresholding at λ=0.2\lambda = 0.2:

β^  =  S0.2(0.75,0.5,0.125,0.025)  =  (0.55,0.3,0,0).\hat{\boldsymbol\beta} \;=\; S_{0.2}(0.75, -0.5, 0.125, 0.025) \;=\; (0.55, -0.3, 0, 0).

The third and fourth coordinates land in the dead zone and are zeroed out — this is variable selection in action. As λ\lambda grows, more coordinates fall into the dead zone; at λ>0.75\lambda > 0.75 the entire estimate is the zero vector. This is the geometric content of Thm 2’s orthogonal-design corollary made concrete.

Remark 5 Bias-variance tradeoff as a theorem, not an aphorism

Statistics courses introduce the bias-variance tradeoff with the equation MSE=bias2+variance\text{MSE} = \text{bias}^2 + \text{variance} and an exhortation to “find the right balance.” Thm 1 sharpens this to a precise mathematical statement: for any design and any nonzero true β\boldsymbol\beta, some positive λ\lambda strictly beats OLS in total MSE — and component-wise, too. The tradeoff has a positive side: there is always free MSE on the table to be picked up by a small ridge penalty. The remaining question is how to find the right λ\lambda, which §23.8 answers.

Remark 6 Stein 1956: shrinkage's pre-history

The Hoerl–Kennard ridge MSE-improvement theorem has a 14-year-older sibling: Stein 1956’s inadmissibility result. For the multivariate Normal mean problem XNp(μ,I)\mathbf{X} \sim \mathcal{N}_p(\boldsymbol\mu, \mathbf{I}) with p3p \ge 3, the natural estimator Xˉ\bar{\mathbf{X}} is inadmissible under squared-error loss — there exist estimators (the James–Stein shrinkage family) with strictly smaller risk for every μ\boldsymbol\mu. Ridge is a covariate-version of the same idea; the full Stein-identity proof is Topic 28 §28.5 Thm 2. The Lehmann–Romano admissibility framework (LEH2005 Ch. 5) shows that in the regression setting, ridge dominates OLS for specific λ\lambda ranges depending on σ2\sigma^2 and the design. The §23.7 Rem 17 returns to the Hodges-superefficiency tie-in flagged at Topic 14 §14.6 Rem 6.

Remark 7 Geometric picture: ellipse vs circle, ellipse vs diamond

Both ridge and lasso can be re-cast as constrained optimization problems via Lagrangian duality:

  • Ridge: minyXβ2\min \|\mathbf{y} - \mathbf{X}\boldsymbol\beta\|^2 subject to β2t\|\boldsymbol\beta\|_2 \le t — minimize a quadratic on a Euclidean ball.
  • Lasso: minyXβ2\min \|\mathbf{y} - \mathbf{X}\boldsymbol\beta\|^2 subject to β1t\|\boldsymbol\beta\|_1 \le t — minimize a quadratic on an L¹ ball (a “diamond” in 2D, a cross-polytope in higher dimensions).

The OLS solution is the center of the elliptical contours of the quadratic loss. Ridge’s solution is the closest point on the Euclidean ball — a smooth shrinkage. Lasso’s solution is the closest point on the diamond — and because the diamond has corners precisely where one or more coordinates are zero, the closest point typically lands on a corner. Sparsity is the geometric consequence of the L¹ ball’s vertices. This is the picture the LevelSetExplorer makes interactive.

Side-by-side level-set diagrams in 2D coefficient space. Left: OLS contour ellipses tangent to a circle (ridge constraint) — tangent point is interior to the L² ball, both coefficients nonzero, smooth shrinkage. Right: same ellipses tangent to a diamond (lasso constraint) — tangent point lands on a vertex of the diamond, one coefficient exactly zero. The geometric origin of sparsity.
Regularization paths β̂j(λ) — prostate-cancer dataset
Drag the slider to sweep λ from large (right, all coefficients shrunk to zero) to small (left, recovers OLS). Lasso paths are piecewise linear with active-set changes; ridge paths are smooth.
ridge dof ≈ 7.87 · lasso |𝓐| = 8
Ridge pathlog₁₀(λ)β̂ⱼlcavollweightagelbphsvilcpgleasonpgg45Lasso pathlog₁₀(λ)β̂ⱼlcavollweightagelbphsvilcpgleasonpgg45
Reading the figure. Each colored curve is one coefficient β̂j(λ) as λ varies. The vertical line marks the current λ. Hover a curve or its legend swatch to highlight one feature. Lasso curves enter the active set one at a time (piecewise linear, each elbow is an active-set change); ridge curves shrink smoothly.
Featured regularization-path figure. Two panels side-by-side: ridge coefficient trajectories β̂_j(λ) and lasso trajectories on the prostate-cancer dataset, with x-axis log(λ). Ridge paths shrink smoothly to zero as λ → ∞; lasso paths show piecewise-linear behavior with each coefficient hitting zero at its own entry-λ. Eight color-coded coefficient lines per panel, λ_min and λ_1SE markers from cross-validation.

23.4 Coordinate descent and the glmnet algorithm

Ridge has a closed form. Lasso does not. The standard solver is cyclic coordinate descent — update one coordinate at a time, holding the others fixed, sweep through all pp coordinates, and repeat until convergence. Friedman, Hastie & Tibshirani’s glmnet paper (FRI2010) made this approach the production workhorse for penalized linear models and GLMs.

Definition 4 Active set

The active set of a coefficient vector β\boldsymbol\beta is

A(β)  =  {j{1,,p}:βj0}.\mathcal{A}(\boldsymbol\beta) \;=\; \{j \in \{1, \dots, p\} : \beta_j \neq 0\}.

For a lasso fit β^lasso(λ)\hat{\boldsymbol\beta}_{\text{lasso}}(\lambda), A(β^)\mathcal{A}(\hat{\boldsymbol\beta}) is the selected variable set — the predictors the procedure judges worth keeping. As λ\lambda shrinks, A|\mathcal{A}| grows; at λ=0\lambda = 0, A={1,,p}\mathcal{A} = \{1, \dots, p\} recovers OLS.

The cyclic-CD update for the lasso uses the soft-thresholding closed form (Thm 2’s orthogonal-design corollary) on the partial residual r(j)=yXjβjcurrent\mathbf{r}^{(j)} = \mathbf{y} - \mathbf{X}_{-j}\boldsymbol\beta_{-j}^{\text{current}}:

βjnew  =  Sλ ⁣(1nxjr(j))/1nxj2.\beta_j^{\text{new}} \;=\; S_\lambda\!\left(\tfrac{1}{n}\mathbf{x}_j^\top\mathbf{r}^{(j)}\right) \,/\, \tfrac{1}{n}\|\mathbf{x}_j\|^2.

Standardization makes xj2/n=1\|\mathbf{x}_j\|^2/n = 1, so the denominator drops out and each coordinate update is one soft-thresholding step. A full sweep updates all pp coordinates; the algorithm terminates when the maximum coordinate change falls below a tolerance. With warm starts from a neighbouring λ\lambda, paths over a 100-point λ\lambda-grid are computed in essentially the time of one cold-start fit.

Theorem 6 Cyclic coordinate descent convergence (Tseng 2001)

Let f(β)=g(β)+j=1p+1hj(βj)f(\boldsymbol\beta) = g(\boldsymbol\beta) + \sum_{j=1}^{p+1} h_j(\beta_j) where g:Rp+1Rg : \mathbb{R}^{p+1} \to \mathbb{R} is convex and continuously differentiable and each hjh_j is convex (possibly non-differentiable). Assume ff is bounded below and coercive. Starting from β(0)\boldsymbol\beta^{(0)} and updating coordinate-by-coordinate as

βj(t+1)  =  argminβjf ⁣(β1(t+1),,βj1(t+1),βj,βj+1(t),,βp+1(t)),\beta_j^{(t+1)} \;=\; \arg\min_{\beta_j} f\!\left(\beta_1^{(t+1)}, \dots, \beta_{j-1}^{(t+1)}, \beta_j, \beta_{j+1}^{(t)}, \dots, \beta_{p+1}^{(t)}\right),

the sequence {β(t)}\{\boldsymbol\beta^{(t)}\} converges to a global minimizer of ff. The lasso objective satisfies the hypothesis with g(β)=12nyXβ22g(\boldsymbol\beta) = \frac{1}{2n}\|\mathbf{y} - \mathbf{X}\boldsymbol\beta\|_2^2 and hj(βj)=λβjh_j(\beta_j) = \lambda|\beta_j|.

Proof 3 Cyclic coordinate descent convergence (Tseng 2001) [show]

Setup. f=g+jhjf = g + \sum_j h_j as in the theorem, with gg convex and C1C^1, each hjh_j convex (possibly non-differentiable), and ff coercive. The cyclic-CD iterates are defined as in the theorem statement.

Monotone descent. Each univariate update βj(t+1)=argminβjf(,βj,)\beta_j^{(t+1)} = \arg\min_{\beta_j} f(\dots, \beta_j, \dots) can only decrease the objective, f(β(t+1))f(β(t))f(\boldsymbol\beta^{(t+1)}) \le f(\boldsymbol\beta^{(t)}), with equality iff the previous βj(t)\beta_j^{(t)} already minimized the slice.

Boundedness of iterates. Because ff is coercive and the iterates are monotone-descending, the sublevel set {β:f(β)f(β(0))}\{\boldsymbol\beta : f(\boldsymbol\beta) \le f(\boldsymbol\beta^{(0)})\} is bounded. The sequence {β(t)}\{\boldsymbol\beta^{(t)}\} is therefore bounded and admits a convergent subsequence β(tk)β\boldsymbol\beta^{(t_k)} \to \boldsymbol\beta^\infty by Bolzano–Weierstrass.

Limit is a coordinatewise stationary point. By continuity of the univariate argmin\arg\min map (which holds because the slices are strictly convex — gg‘s second partial in βj\beta_j plus the hjh_j convexity), the limit β\boldsymbol\beta^\infty satisfies

βj  =  argminβjf(β1,,βj,,βp+1)\beta_j^\infty \;=\; \arg\min_{\beta_j} f(\beta_1^\infty, \dots, \beta_j, \dots, \beta_{p+1}^\infty)

for every jj. This is the coordinatewise stationarity condition.

Coordinatewise ⇒ global, via separability. Because f=g+jhjf = g + \sum_j h_j with the non-smooth part separable (each hjh_j depends on only one coordinate), the coordinatewise stationarity condition is equivalent to 0f(β)\mathbf{0} \in \partial f(\boldsymbol\beta^\infty) — and for convex ff, that is the global-minimum condition. The separability is essential: Tseng 2001 explicitly constructs counterexamples where cyclic CD on a non-separable non-smooth function fails to converge to the minimizer.

Full-sequence convergence. Because the slice-minima are unique (strict convexity per coordinate), every convergent subsequence of {β(t)}\{\boldsymbol\beta^{(t)}\} converges to the same global minimizer. By a standard sub-subsequence argument, the full sequence converges. ∎ — using Tseng 2001 Prop 5.1 and the lasso objective’s separable structure.

Step 5 is where the structural insight lives: coordinate descent works on the lasso because β1=jβj\|\boldsymbol\beta\|_1 = \sum_j |\beta_j| is a sum of single-coordinate functions. It would fail on the fused lasso (jβj+1βj\sum_j |\beta_{j+1} - \beta_j|, used for piecewise-constant signals), which couples adjacent coordinates and breaks the separability. For non-separable non-smooth penalties, proximal-gradient methods (ISTA, FISTA) take over.

Cyclic coordinate descent on a 2D lasso problem (Tseng 2001)
Each step minimizes the objective along one coordinate via soft-thresholding. The path is axis-aligned; the objective is monotone-decreasing — the entire convergence content of Thm 6.
step 0 · β = (0.000, 0.000)
Iterate path β⁽ᵗ⁾β̂_minβ₁β₂Objective f(β⁽ᵗ⁾) (monotone-decreasing)coord-descent step tf(β)dot color: j=0 · j=1
Example 6 Coordinate-descent trace on a 2D lasso toy

A two-coefficient lasso problem (p=2p = 2) lets us visualize the iterate path directly. With X\mathbf{X} a small synthetic design and β=(2,1)\boldsymbol\beta^\star = (2, 1), the cold-start CD iterate trajectory makes a zigzag through coefficient space: each axis-aligned step minimizes one coordinate while holding the other fixed. The objective decreases monotonically (the right panel of the visualizer); the path converges to the global minimizer in around 47 coordinate updates (about 24 sweeps of 2 coordinates) at tolerance 10610^{-6}. With warm-start from the neighbouring λ\lambda, the same problem solves in 5–10 coordinate updates.

Coordinate-descent iterate trace on a 2D lasso problem. Left panel: contour plot of the lasso objective with the iterate path overlaid as zigzag axis-aligned segments converging to the minimizer. Right panel: monotonically decreasing objective values per coordinate update step, with a logarithmic y-axis showing the rate of convergence.
Remark 8 LARS as the historical alternative

Before glmnet, the standard lasso solver was LARS (Least Angle Regression; Efron, Hastie, Johnstone & Tibshirani 2004). LARS computes the entire piecewise-linear regularization path exactly by tracking when each coordinate enters or leaves the active set as λ\lambda decreases. It is elegant — the path is piecewise linear — but slower in practice than coordinate descent for large pp, because each LARS step requires solving a linear system on the current active set. glmnet won out on engineering: cyclic CD is embarrassingly simple to vectorize, scales to pp in the millions, and supports any convex penalty (LARS is lasso-only). Topic 23 treats LARS as a historical pointer; for the algorithm itself, see HAS2009 §3.4.4.

Remark 9 Warm-start path computation

A defining feature of glmnet is path computation with warm starts: instead of solving for one λ\lambda at a time from scratch, compute the path over a 100-point log-grid in one pass, using the solution at λk\lambda_k as the starting point for λk+1\lambda_{k+1}. Because consecutive λ\lambda values give nearby solutions, each warm-started fit needs only a few sweeps to converge — the per-λ\lambda cost drops by roughly an order of magnitude. The grid runs from λmax=maxjxjy/n\lambda_{\max} = \max_j |\mathbf{x}_j^\top\mathbf{y}/n| (the smallest λ\lambda for which β^=0\hat{\boldsymbol\beta} = \mathbf{0}) down to λmin=λmax104\lambda_{\min} = \lambda_{\max} \cdot 10^{-4}. The regularizationPath function in regression.ts implements this scheme.

Remark 10 Strong-screening rules and proximal-gradient methods

For very large pp (gene-expression studies with p=105p = 10^5 predictors), even cyclic CD becomes a bottleneck. Two acceleration techniques are widely deployed: strong screening rules (Tibshirani et al. 2012) eliminate predictors that the KKT conditions guarantee will be inactive at the next λ\lambda, before any coord-descent updates — a 10×10\times100×100\times speedup. Proximal-gradient methods (ISTA, FISTA) take small gradient steps on the smooth part and apply soft-thresholding as the proximal operator; they generalize beyond separable non-smooth penalties (where coord descent fails) at the cost of slightly worse constants. regression.ts ships only the basic cyclic CD; production deployments should reach for glmnet or sklearn.linear_model.{Lasso, LassoCV} (also a CD implementation) with screening rules enabled.

23.5 Elastic net

Lasso and ridge each have a failure mode. Lasso, on highly correlated predictors, tends to pick one and zero out its companions arbitrarily — a coin-flip choice that destabilizes interpretation across bootstrap samples. Ridge gives stable weight to all correlated predictors but never produces sparsity. The elastic net (Zou & Hastie 2005) interpolates: a convex combination of L¹ and L² penalties.

Definition 5 Elastic-net penalty and objective

For α[0,1]\alpha \in [0, 1] and λ0\lambda \ge 0, the elastic-net penalty is

Pα(β)  =  αβ1  +  1α2β22,P_\alpha(\boldsymbol\beta) \;=\; \alpha \|\boldsymbol\beta\|_1 \;+\; \tfrac{1 - \alpha}{2} \|\boldsymbol\beta\|_2^2,

and the elastic-net objective is

β^EN(λ,α)  =  argminβ  12yXβ22  +  λPα(β).\hat{\boldsymbol\beta}_{\text{EN}}(\lambda, \alpha) \;=\; \arg\min_{\boldsymbol\beta}\; \tfrac12\|\mathbf{y} - \mathbf{X}\boldsymbol\beta\|_2^2 \;+\; \lambda P_\alpha(\boldsymbol\beta).

α=1\alpha = 1 recovers lasso (122+λβ1\tfrac12\|\cdot\|^2 + \lambda\|\boldsymbol\beta\|_1); α=0\alpha = 0 recovers ridge with penalty (λ/2)β2(\lambda/2)\|\boldsymbol\beta\|^2 — matching Def 1’s convention exactly. For α(0,1)\alpha \in (0, 1), the penalty is strictly convex even when XX\mathbf{X}^\top\mathbf{X} is rank-deficient — a property ridge has but lasso does not.

Theorem 3 Elastic-net strict convexity (Zou-Hastie 2005)

For any α[0,1)\alpha \in [0, 1) and λ>0\lambda > 0, the elastic-net objective is strictly convex in β\boldsymbol\beta, and the minimizer β^\hat{\boldsymbol\beta} is unique — regardless of the rank of X\mathbf{X}.

Brief argument: the Hessian of the smooth part is 1nXX+λ(1α)I\frac{1}{n}\mathbf{X}^\top\mathbf{X} + \lambda(1-\alpha)\mathbf{I}, which is positive-definite whenever λ(1α)>0\lambda(1-\alpha) > 0. The L¹ term is convex (subdifferentiably), so the full objective is strictly convex. ZOU2005 Thm 1.

Example 7 Grouping effect on correlated features

Take a synthetic problem with three pairs of perfectly correlated predictors plus four independent ones (the CORRELATED_FEATURES_DATA preset). The true coefficient is nonzero only at X1X_1 and X3X_3 — both members of correlated pairs. Lasso, run at the CV-optimal λ\lambda, picks one predictor from each correlated pair and zeros the other — and which one it picks varies arbitrarily across bootstrap samples. Elastic net with α=0.5\alpha = 0.5 at the same effective penalty strength assigns nonzero weight to all four members of the correlated pairs, distributing weight across the group. The grouping effect: correlated predictors get similar coefficients, and the active set is more stable.

Remark 11 Grouping effect formalized

Zou & Hastie 2005 prove that for any two predictors XjX_j and XkX_k with Xxj=Xxk\mathbf{X}^\top\mathbf{x}_j = \mathbf{X}^\top\mathbf{x}_k (a strong correlation condition), the elastic-net coefficients satisfy

β^jβ^k    2(1ρ)λ(1α)yn|\hat\beta_j - \hat\beta_k| \;\le\; \frac{\sqrt{2(1 - \rho)}}{\lambda(1-\alpha)} \cdot \frac{\|\mathbf{y}\|}{n}

where ρ\rho is the correlation. As correlation ρ1\rho \to 1, the bound forces β^jβ^k\hat\beta_j \to \hat\beta_k — the coefficients are pulled together. For pure lasso (α=1\alpha = 1), the bound is vacuous (denominator zero), which is why lasso lacks the grouping effect.

Remark 12 The (α, λ) parameter plane

Elastic net introduces a second hyperparameter, α\alpha, on top of λ\lambda. Most production workflows fix a small grid of α\alpha values (α{0.1,0.5,0.9}\alpha \in \{0.1, 0.5, 0.9\}, say), run cross-validation to pick λ\lambda for each, and choose the combination with smallest CV error. The 2D grid is more expensive than the 1D lasso/ridge search but rarely prohibitively so. The figure shows the (α,λ)(\alpha, \lambda) parameter plane on the prostate-cancer data, with CV-error contours and the optimal point marked.

Elastic-net (α, λ) parameter plane heatmap on the prostate-cancer dataset. X-axis: log λ; y-axis: α from 0 (ridge) to 1 (lasso); color encodes 10-fold CV mean squared error. Optimal (α*, λ*) marked with crosshairs. The lasso boundary (α = 1) and ridge boundary (α = 0) labeled.

23.6 Penalized GLMs

Topic 22’s IRLS framework extends to penalized GLMs essentially unchanged. Replace the negative log-likelihood with the penalized version, and the inner WLS step becomes a penalized WLS step.

Definition 6 Penalized-GLM objective

For a GLM with log-likelihood (β)\ell(\boldsymbol\beta) (Topic 22 Def 3) and a convex penalty PP, the penalized-GLM estimator is

β^(λ)  =  argminβ  (β)  +  λP(β).\hat{\boldsymbol\beta}(\lambda) \;=\; \arg\min_{\boldsymbol\beta}\; -\ell(\boldsymbol\beta) \;+\; \lambda P(\boldsymbol\beta).

For ridge, P=β22P = \|\boldsymbol\beta\|_2^2; for lasso, P=β1P = \|\boldsymbol\beta\|_1; for elastic net, PαP_\alpha as in Def 5. The unpenalized intercept convention applies as usual.

Theorem 4 Penalized-IRLS update

Under a canonical link and a convex sub-differentiable penalty PP, the penalized-IRLS update at iteration tt solves

β(t+1)  =  argminβ  (z(t)Xβ)W(t)(z(t)Xβ)  +  2λP(β),\boldsymbol\beta^{(t+1)} \;=\; \arg\min_{\boldsymbol\beta}\; (\mathbf{z}^{(t)} - \mathbf{X}\boldsymbol\beta)^\top \mathbf{W}^{(t)} (\mathbf{z}^{(t)} - \mathbf{X}\boldsymbol\beta) \;+\; 2\lambda P(\boldsymbol\beta),

where W(t)\mathbf{W}^{(t)} and z(t)=η(t)+W(t)1(yμ(t))\mathbf{z}^{(t)} = \boldsymbol\eta^{(t)} + \mathbf{W}^{(t)\,-1}(\mathbf{y} - \boldsymbol\mu^{(t)}) are Topic 22 Thm 2’s IRLS weights and adjusted response. For P=22P = \|\cdot\|_2^2 this is a ridge-WLS problem with closed form (XWX+2λI)1XWz(\mathbf{X}^\top\mathbf{W}\mathbf{X} + 2\lambda\mathbf{I})^{-1}\mathbf{X}^\top\mathbf{W}\mathbf{z}. For P=1P = \|\cdot\|_1 it is a weighted-lasso coord-descent inner pass.

Brief derivation: a second-order Taylor expansion of -\ell around β(t)\boldsymbol\beta^{(t)} produces the WLS quadratic surrogate exactly as in Topic 22 Proof 2; adding 2λP(β)2\lambda P(\boldsymbol\beta) gives the penalized surrogate. Minimizing the surrogate is the next IRLS step. Convergence proceeds as in Tseng 2001 with the surrogate playing the role of ff.

Example 8 Ridge-logistic restores convergence on near-separation

The EXAMPLE_10_GLM_DATA preset (n = 120, p = 2) is the near-separation problem from §23.1 Ex 2. Topic 22’s bare glmFit reports converged = false after 50 iterations: β(t)\|\boldsymbol\beta^{(t)}\| has crossed 100 and is still growing. Calling penalizedGLMFit with family: 'binomial', link: 'logit', penalty: 'ridge', lambda: 0.1 instead converges in around 10 outer IRLS iterations to a finite estimate β^\hat{\boldsymbol\beta} with β^1\|\hat{\boldsymbol\beta}\|_\infty \approx 1 — a small fraction of what the unpenalized fit would have produced. The ridge penalty makes the otherwise non-coercive objective coercive; the IRLS surrogate is then guaranteed to have a unique bounded minimizer at every step.

Side-by-side comparison of bare logistic IRLS (left) and ridge-logistic penalized IRLS (right) on the near-separation preset. Left panel: coefficient L∞ norm ‖β^(t)‖ vs IRLS iteration, growing without bound past iteration 30 in red. Right panel: same axes, ridge penalty (λ=0.1) applied — convergence to a bounded estimate around iteration 10 in green. Cross-symbols mark convergence.
Example 9 Lasso-Poisson on polynomial-expanded credit-default

The Topic 22 EXAMPLE_6_GLM_DATA (credit-default logistic, n = 200, p = 2) becomes a high-dimensional problem when its predictors are expanded to degree 3 polynomials. The expanded design has p = 34 features (the CREDIT_DEFAULT_EXPANDED_DATA preset). Fitting a lasso-logistic at the CV-selected λ\lambda produces a sparse β^\hat{\boldsymbol\beta} with around 6–10 active coefficients, automatically selecting the polynomial terms that contribute predictive signal. Pure logistic regression on the expanded design overfits badly; lasso-logistic regularizes both selection and shrinkage in one pass.

Remark 11 Firth 1993 as the alternative for separation

Ridge is one cure for logistic-regression separation; Firth’s penalized likelihood (FIR1993) is the other. Firth replaces the log-likelihood with (β)+12logI(β)\ell(\boldsymbol\beta) + \tfrac12 \log|\mathbf{I}(\boldsymbol\beta)| where I\mathbf{I} is the Fisher information — equivalent to a Jeffreys prior on β\boldsymbol\beta. This penalty bounds the estimator without requiring a tuning parameter. In practice, Firth is the default in many epidemiology workflows where a single hyperparameter-free fix is preferred; ridge wins when cross-validation is feasible and the user wants explicit control over the bias-variance tradeoff. Topic 23 develops only the ridge cure; logistf in R implements Firth.

Remark 12 Shrinkage as influence control

Topic 22 §22.9 Rem 19 flagged influential observations and outliers as a robustness concern, and pointed forward to two cures: M-estimation (defer to a future topic) and shrinkage. Ridge is the shrinkage answer: by penalizing β\|\boldsymbol\beta\|, it limits the leverage that any single observation can exert on individual coefficients. The §22.9 Rem 19 promise is discharged here. Lasso gives a different kind of robustness: by zeroing weak signals, it makes the surviving coefficients depend on stronger structural patterns rather than noise-driven fluctuations.

Remark 13 Inference under regularization is hazardous

A standard mistake: read off Wald confidence intervals from a penalized estimator using (XWX)1(\mathbf{X}^\top\mathbf{W}\mathbf{X})^{-1} as if the penalty did not exist. This is wrong. The penalty introduces bias E[β^]β0\mathbb{E}[\hat{\boldsymbol\beta}] - \boldsymbol\beta \neq 0, breaking the central condition for Wald inference. Profile-likelihood CIs on penalized estimators have similar issues. The correct procedure for inference under regularization is post-selection inference — selecting variables with lasso, then refitting unpenalized OLS on the selected variables (with caveats), or using the debiased lasso (Javanmard–Montanari 2014). Topic 23 stops at the warning; the methodology lives at formalml’s high-dimensional regression topic.

23.7 The Bayesian bridge: MAP = penalized MLE

Topic 14 §14.11 Rem 9 previewed the correspondence in prose: every penalty is a prior; every prior is a penalty. The §23.7 theorem makes that exact.

Definition 7 MAP estimate

For a likelihood L(β)L(\boldsymbol\beta) and a prior π(β)\pi(\boldsymbol\beta), the maximum a-posteriori (MAP) estimate is the mode of the posterior:

β^MAP  =  argmaxβp(βy)  =  argmaxβ[(β)+logπ(β)].\hat{\boldsymbol\beta}_{\text{MAP}} \;=\; \arg\max_{\boldsymbol\beta}\, p(\boldsymbol\beta \mid \mathbf{y}) \;=\; \arg\max_{\boldsymbol\beta}\, [\ell(\boldsymbol\beta) + \log\pi(\boldsymbol\beta)].

The proportionality constant p(y)p(\mathbf{y}) drops out of the argmax. The MAP coincides with the posterior mean only for symmetric posteriors (e.g. Gaussian); otherwise the two diverge.

Theorem 5 MAP-penalization correspondence

Let the prior density be π(β)=Cexp(λP(β))\pi(\boldsymbol\beta) = C \exp(-\lambda P(\boldsymbol\beta)) for normalizing constant CC and convex PP. Then

β^MAP  =  argminβ[(β)+λP(β)]  =  β^penalized MLE.\hat{\boldsymbol\beta}_{\text{MAP}} \;=\; \arg\min_{\boldsymbol\beta}\, [-\ell(\boldsymbol\beta) + \lambda P(\boldsymbol\beta)] \;=\; \hat{\boldsymbol\beta}_{\text{penalized MLE}}.

Embedded derivation. The posterior is p(βy)L(β)π(β)p(\boldsymbol\beta \mid \mathbf{y}) \propto L(\boldsymbol\beta) \pi(\boldsymbol\beta). Taking logs, logp(βy)=(β)λP(β)+const\log p(\boldsymbol\beta \mid \mathbf{y}) = \ell(\boldsymbol\beta) - \lambda P(\boldsymbol\beta) + \text{const}. The MAP maximizes this expression — equivalently, minimizes +λP-\ell + \lambda P, which is the penalized-MLE objective. ∎

Specializations. The Gaussian prior π(β)=Np+1(0,τ2I)\pi(\boldsymbol\beta) = \mathcal{N}_{p+1}(\mathbf{0}, \tau^2\mathbf{I}) gives logπ(β)=β22/(2τ2)+const\log\pi(\boldsymbol\beta) = -\|\boldsymbol\beta\|_2^2/(2\tau^2) + \text{const} — that is, ridge with λ=1/(2τ2)\lambda = 1/(2\tau^2). The Laplace prior π(β)=j12bexp(βj/b)\pi(\boldsymbol\beta) = \prod_j \frac{1}{2b}\exp(-|\beta_j|/b) gives logπ(β)=β1/b+const\log\pi(\boldsymbol\beta) = -\|\boldsymbol\beta\|_1/b + \text{const} — that is, lasso with λ=1/b\lambda = 1/b. Discharges Topic 14 §14.11 Rem 9.

Example 10 Gaussian prior gives ridge

Suppose we observe yi=xiβ+εiy_i = \mathbf{x}_i^\top\boldsymbol\beta + \varepsilon_i with εiiidN(0,σ2)\varepsilon_i \overset{\text{iid}}{\sim} \mathcal{N}(0, \sigma^2), and place a Gaussian prior βNp+1(0,τ2I)\boldsymbol\beta \sim \mathcal{N}_{p+1}(\mathbf{0}, \tau^2\mathbf{I}). The log-posterior is

logp(βy)  =  12σ2yXβ22    12τ2β22  +  const.\log p(\boldsymbol\beta \mid \mathbf{y}) \;=\; -\tfrac{1}{2\sigma^2}\|\mathbf{y} - \mathbf{X}\boldsymbol\beta\|_2^2 \;-\; \tfrac{1}{2\tau^2}\|\boldsymbol\beta\|_2^2 \;+\; \text{const}.

The MAP — equivalently, minimizing logp-\log p — is the ridge solution with λ=σ2/τ2\lambda = \sigma^2/\tau^2. Smaller τ2\tau^2 (tighter prior on coefficients) means larger λ\lambda (heavier shrinkage). The ridge regularization parameter is the prior-precision-to-noise-precision ratio.

Example 11 Laplace prior gives lasso

With the same likelihood but a Laplace prior βjiidLaplace(0,b)\beta_j \overset{\text{iid}}{\sim} \text{Laplace}(0, b), the log-posterior is

logp(βy)  =  12σ2yXβ22    1bβ1  +  const.\log p(\boldsymbol\beta \mid \mathbf{y}) \;=\; -\tfrac{1}{2\sigma^2}\|\mathbf{y} - \mathbf{X}\boldsymbol\beta\|_2^2 \;-\; \tfrac{1}{b}\|\boldsymbol\beta\|_1 \;+\; \text{const}.

The MAP is the lasso solution with λ=σ2/b\lambda = \sigma^2/b. The Laplace prior — sharply peaked at zero with heavy tails — encodes the prior belief that “most coefficients are zero, but those that aren’t can be large.” That is exactly the sparsity intuition lasso operationalizes.

Three prior densities side-by-side: Gaussian (smooth bell shape, peaked at 0), Laplace (sharp peak at 0 with exponential tails), and Cauchy (heavier tails than Laplace, no second moment). Each labeled with its corresponding penalty equivalence: Gaussian ⇒ ridge L², Laplace ⇒ lasso L¹, Cauchy ⇒ heavier-tailed shrinkage with no closed form.
Remark 14 Cauchy prior: heavier tails, no closed form

A natural extension of the Gaussian/Laplace pair is the Cauchy prior π(βj)1/(1+(βj/c)2)\pi(\beta_j) \propto 1/(1 + (\beta_j/c)^2), used by Gelman et al. 2008 as a default weakly-informative prior for logistic-regression coefficients. Cauchy has heavier tails than Laplace, expressing greater prior tolerance for occasional large coefficients while still pulling small ones toward zero. The Cauchy MAP has no closed-form penalty equivalent — the resulting objective is non-convex — and requires numerical optimization. The general lesson: the Gaussian–Laplace dichotomy is the convex special case of a much richer prior-as-penalty correspondence.

Remark 15 Hodges superefficiency tie-back

Topic 14 §14.6 Rem 6 introduced Hodges superefficiency: an estimator that beats the Cramér–Rao bound at a single point (e.g. β=0\boldsymbol\beta = \mathbf{0}) by being shrinkage-like there. Ridge and lasso are exactly such estimators — they have superior MSE near β=0\boldsymbol\beta = \mathbf{0} at the cost of bias elsewhere. Le Cam’s discrediting of superefficient estimators in the 1950s was based on their pathological behavior at non-shrinkage targets; the modern MSE-improvement view (Thm 1) reframes the same algebra as a virtue. The Stein-shrinkage origin paper (STE1956) was the first to take the bias-for-variance trade seriously; ridge and lasso are its modern descendants.

Remark 16 The regularization parameter as prior precision

Thm 5 makes a crisp identification: λ\lambda is the prior precision (inverse variance) per coefficient. The empirical-Bayes view says: choose λ\lambda to maximize the marginal likelihood p(y)=L(β)π(β)dβp(\mathbf{y}) = \int L(\boldsymbol\beta)\pi(\boldsymbol\beta)\,d\boldsymbol\beta, treating λ\lambda as a hyperparameter of the prior. Cross-validation (§23.8) is a frequentist alternative that estimates predictive risk directly. Both procedures pick a λ\lambda that adapts to the data; they typically agree to within an order of magnitude, with empirical-Bayes preferring slightly smaller λ\lambda (less shrinkage).

Remark 17 Weight decay and AdamW

In neural-network training, the term weight decay denotes the addition of λw22\lambda \|\mathbf{w}\|_2^2 to the loss — Gaussian prior MAP, by Thm 5. Equivalently, gradient-descent updates with multiplicative decay wt+1=(1ηλ)wtηL\mathbf{w}_{t+1} = (1 - \eta\lambda)\mathbf{w}_t - \eta\nabla\mathcal{L} implement L² regularization for vanilla SGD (the implementation differs subtly for Adam-family optimizers, motivating decoupled weight decay in AdamW). Every modern deep-learning optimizer ships with a weight-decay knob; tuning it is one of the most reliable ways to improve generalization. Forward-pointer to formalml’s weight-decay topic.

23.8 Cross-validation for λ\lambda selection

The penalty parameter λ\lambda controls the bias-variance tradeoff. Choosing it well requires an estimate of out-of-sample predictive performance — exactly what cross-validation provides.

Definition 8 k-fold cross-validation estimator

Partition the data into kk disjoint folds via a fold-assignment map κ:{1,,n}{1,,k}\kappa: \{1, \dots, n\} \to \{1, \dots, k\}. Let f^κ(;λ)\hat{f}^{-\kappa}(\cdot ; \lambda) denote the penalized estimator trained on all data except fold κ\kappa. The kk-fold CV estimator is

CV^k(λ)  =  1ni=1nL ⁣(yi,f^κ(i)(xi;λ)),\widehat{\text{CV}}_k(\lambda) \;=\; \frac{1}{n}\sum_{i=1}^n L\!\left(y_i, \hat{f}^{-\kappa(i)}(\mathbf{x}_i; \lambda)\right),

where LL is the per-observation loss (squared error for regression, deviance for GLMs).

The CV curve CV^k(λ)\widehat{\text{CV}}_k(\lambda) traces a U-shape as a function of logλ\log\lambda: high λ\lambda over-shrinks (high bias), low λ\lambda under-regularizes (high variance), and the minimum sits in the middle. Two standard rules pick λ\lambda from the curve:

  • λ^min\hat\lambda_{\min} — the λ\lambda minimizing CV^k\widehat{\text{CV}}_k. Usually chosen for prediction-focused workflows.
  • λ^1SE\hat\lambda_{\text{1SE}} — the largest λ\lambda such that CV^k(λ)CV^k(λ^min)+SE^(λ^min)\widehat{\text{CV}}_k(\lambda) \le \widehat{\text{CV}}_k(\hat\lambda_{\min}) + \widehat{\text{SE}}(\hat\lambda_{\min}). Chooses a parsimonious model that is statistically indistinguishable from the best.
Example 12 $k = 5$ vs $k = 10$ on prostate-cancer

Running 5-fold and 10-fold CV on the prostate-cancer lasso path with seed 42 gives slightly different λ^min\hat\lambda_{\min} values, but typically the same order of magnitude. 10-fold CV has lower bias (more data per fit) but higher variance (smaller test folds); 5-fold has the opposite tradeoff. For n<100n < 100, the variance of λ^min\hat\lambda_{\min} across CV runs is usually larger than the bias from choosing one kk over the other. 10-fold is the modern default.

Example 13 One-SE rule in action

On the prostate dataset, 10-fold lasso CV identifies λ^min\hat\lambda_{\min} at a small value where most of the 8 predictors enter the active set. Applying the one-SE rule shifts to a substantially larger λ^1SE\hat\lambda_{\text{1SE}} where only 4–5 predictors remain — a sparser, more interpretable model whose CV performance is statistically indistinguishable. The one-SE rule’s aggressive bias toward sparsity is a feature when interpretation matters (e.g. variable importance for medical research) and a bug when prediction is the only goal.

Cross-validation curve for lasso on prostate-cancer dataset. X-axis: log λ; y-axis: 10-fold mean squared error with one-SE error bars at each λ. U-shape with minimum at λ_min marked with a vertical dashed line; one-SE threshold horizontal line; λ_1SE marked with a second vertical dashed line at the largest λ within one SE of the minimum. Right panel shows active-set size as a function of λ — staircase descending from p = 8 to 0.
k-fold cross-validation curve — λ̂_min and the one-SE rule
10-fold CV on the prostate-cancer dataset, seed 42. Vertical lines: λ_min minimizes mean CV error; λ_1SE is the largest λ within one SE of the min — a sparser model that is statistically indistinguishable.
λ_min = 3.71e-1 · λ_1SE = 1.40e+1
CV mean ± 1 SE vs log₁₀(λ)λ_minλ_1SElog₁₀(λ)CV meanActive-set size at refit|𝒜|=8|𝒜|=3log₁₀(λ)|𝒜(β̂)|
Reading the figure. Left: U-shaped CV curve with one-SE error bars at each λ. Right: active-set size (number of nonzero coefficients in the refit) drops as λ grows. The two vertical markers highlight the prediction-optimal λ_min and the parsimony-favoring λ_1SE. Both refits live as cv.pathAtLambdaMin and cv.pathAtLambdaOneSE in the result struct.
Remark 18 One-SE rule formal statement

Letting SE^(λ)=sd(CV^k(f)(λ))/k\widehat{\text{SE}}(\lambda) = \text{sd}(\widehat{\text{CV}}_k^{(f)}(\lambda)) / \sqrt{k} be the empirical standard error of the per-fold deviances at λ\lambda, the one-SE rule is

λ^1SE  =  max ⁣{λ  :  CV^k(λ)minλCV^k(λ)+SE^ ⁣(λ^min)}.\hat\lambda_{\text{1SE}} \;=\; \max\!\left\{\lambda \;:\; \widehat{\text{CV}}_k(\lambda) \le \min_\lambda \widehat{\text{CV}}_k(\lambda) + \widehat{\text{SE}}\!\left(\hat\lambda_{\min}\right)\right\}.

This is Breiman, Friedman, Olshen & Stone (CART 1984) Section 3.4. The rule is deliberately conservative — it picks the sparsest model whose predictive performance falls within sampling noise of the best.

Remark 19 Nested CV for honest generalization-error estimation

A subtle pitfall: if you choose λ\lambda by CV and then report the same CV curve’s minimum value as your “test error,” you’ve leaked tuning information into the test estimate. Nested cross-validation fixes this: an outer CV loop estimates generalization error, and within each outer fold, an inner CV loop selects λ\lambda. The outer CV’s mean error is then an honest estimate of generalization performance. Topic 24 §24.5 Rem 11 develops nested CV in the broader model-selection context; it’s mentioned here so the reader avoids the over-optimistic shortcut.

Remark 20 AIC, BIC, $C_p$ as alternative selectors

Cross-validation is one way to pick λ\lambda. Information criteria — AIC (2+2k-2\ell + 2k), BIC (2+klogn-2\ell + k\log n), Mallow’s CpC_p — are alternatives that estimate out-of-sample risk via a complexity penalty rather than data-splitting. For penalized estimators, the effective number of parameters kk is replaced by the trace of the smoother matrix (ridge) or the size of the active set (lasso, with caveats). Topic 24 develops the asymptotic theory; for now the take-away is that AIC/BIC/CV typically give similar answers in the high-signal regime and diverge in the low-signal regime, with BIC favoring sparser models.

23.9 Worked examples: prostate-cancer and credit-default

This section closes the loop on the running examples by walking through two complete penalized-regression workflows.

Example 14 Prostate-cancer ridge + lasso paths (HTF Table 3.3)

The HTF Table 3.3 prostate-cancer dataset has 97 patients and 8 predictors; the response is lpsa. The standard workflow:

  1. Standardize predictors (centered and scaled to unit variance).
  2. Compute the regularization path on a 100-point log-grid λmin=104λmax\lambda_{\min} = 10^{-4} \cdot \lambda_{\max}.
  3. Run 10-fold CV with seed 42 to identify λ^min\hat\lambda_{\min} and λ^1SE\hat\lambda_{\text{1SE}}.
  4. Refit on the full data at each chosen λ\lambda to get the final coefficient estimates.

The ridge path shows smooth shrinkage: every coefficient starts near OLS at small λ\lambda and decays smoothly toward zero. The lasso path is piecewise linear: each coefficient enters the active set at its own λ\lambda, with the strongest signals (lcavol, lweight) entering first and weaker ones (gleason, pgg45) entering only at small λ\lambda. At λ^1SE\hat\lambda_{\text{1SE}}, the lasso retains 4–5 predictors; ridge keeps all 8 with shrunken weights. The two estimators agree on the leading coefficients but diverge in their treatment of the marginal predictors.

Full prostate-cancer worked-example figure. Two-panel ridge and lasso regularization paths, eight coefficient lines per panel, x-axis log λ, vertical dashed lines at λ_min (shorter) and λ_1SE (longer) from 10-fold CV. Coefficient labels at the right edge of each panel showing predictor names (lcavol, lweight, age, lbph, svi, lcp, gleason, pgg45). Active-set size annotated below the x-axis.
Example 15 Lasso-Poisson on credit-default polynomial expansion

Topic 22’s EXAMPLE_6_GLM_DATA has 200 observations and 2 raw predictors. Expanding to degree-3 polynomial features gives 34 columns — a high-dimensional regime where unpenalized logistic regression would overfit. Lasso-logistic regression at the CV-optimal λ\lambda selects 6–10 features automatically: the linear and squared terms in the original predictors, a handful of cross-terms, and a few cubic terms. The remaining 24\sim 24 polynomial features are zeroed out. This is the practical workflow that justifies the brief’s “high-dimensional p ≫ n” promise from Topic 22 §22.10 Rem 28: penalized GLMs make polynomial-feature expansion a viable modeling strategy.

Remark 21 Coefficient interpretation under shrinkage

A standard error: reading β^jridge\hat\beta_j^{\text{ridge}} as if it were an unbiased estimate of βj\beta_j. It is not. The ridge estimator is biased toward zero by design — that is the source of its variance reduction. The right interpretation: β^jridge\hat\beta_j^{\text{ridge}} is a posterior mode under a Gaussian prior with mean zero, telling you the predictor’s contribution after regularizing for likely small effects. Lasso is even more aggressive: β^j=0\hat\beta_j = 0 does not mean the predictor has no effect, just that the data doesn’t provide enough evidence to overcome the L¹ penalty’s threshold. Selected variables can be reported as “the lasso identified these as relevant,” but their numerical coefficient values should be interpreted with the bias caveat.

Remark 22 Production tooling

The penalized-estimation universe has a small number of canonical implementations:

  • glmnet (R, also Python via glmnet-python) — Friedman, Hastie & Tibshirani’s reference implementation. Cyclic coord descent + warm-start path + screening rules + auto CV. The benchmark.
  • sklearn.linear_model.{Ridge, Lasso, ElasticNet, RidgeCV, LassoCV, ElasticNetCV} (Python) — coord-descent (lasso/elasticnet) and direct solver (ridge); CV variants automate λ\lambda selection. Slightly slower than glmnet for very large pp but vastly more convenient for sklearn pipelines.
  • MLJLinearModels.jl (Julia) — fast, supports a wider variety of penalties.
  • regression.ts (this codebase) — a faithful reimplementation of the core algorithms for browser-side computation in interactive components. Not a production solver — for real workloads, use one of the above.

23.10 Forward Map

Topic 23 establishes penalized estimation as a unifying framework for stabilizing OLS and IRLS, with three penalty families (ridge, lasso, elastic net), one core algorithm (cyclic coord descent), and one selection procedure (cross-validation). The forward map below catalogs where the framework leads.

Theorem 7 Lasso ℓ₂-consistency under sparsity (stated, Wainwright 2019)

Let β\boldsymbol\beta^\star have at most ss nonzero components and assume X/n\mathbf{X}/\sqrt{n} satisfies the restricted eigenvalue condition with constant κ>0\kappa > 0. Then for λσlogp/n\lambda \asymp \sigma\sqrt{\log p / n}, the lasso estimator satisfies

β^β22  =  Op ⁣(sσ2logpnκ2).\|\hat{\boldsymbol\beta} - \boldsymbol\beta^\star\|_2^2 \;=\; O_p\!\left(\frac{s \sigma^2 \log p}{n \kappa^2}\right).

The rate slogp/ns\log p / n is information-theoretically optimal up to constants — the lasso is minimax-rate optimal under sparsity. WAI2019 Thm 7.13 (statement adapted). Not proved here; the full development (RIP, RE conditions, dual certificates) is at formalml.

Remark 23 Topic 24 — model selection (AIC, BIC, $C_p$, CV theory)

The next Track 6 topic generalizes λ\lambda selection to a broader model selection framework. AIC (2+2k-2\ell + 2k) estimates expected out-of-sample log-likelihood; BIC (2+klogn-2\ell + k\log n) approximates the Bayesian marginal likelihood; Mallow’s CpC_p is the Gaussian special case of AIC. Stone 1977 identifies LOO-CV with AIC asymptotically; Yang 2005 proves selection consistency and prediction efficiency are formally incompatible. Topic 24 develops the asymptotic theory and the trade-offs.

Remark 24 Track 7 — Bayesian regularization, hierarchical priors, horseshoe

Thm 5’s MAP-as-penalized-MLE correspondence is the gateway to the Bayesian treatment of regularization. Topic 25 §25.7–§25.8 develops prior selection (Jeffreys, weakly-informative, improper with integrable posterior) and shows MAP-as-penalized-MLE as Ex 8. Topic 28 (Hierarchical Bayes) develops hierarchical priors that learn λ\lambda from the data (empirical-Bayes ridge) and modern continuous shrinkage priors like the horseshoe (Carvalho–Polson–Scott 2010), which approximate lasso’s sparsity behavior with a globally adaptive scale. The horseshoe is, in a sense, the “lasso of priors.”

Remark 25 Track 8 — group, fused, and sparse-group lasso

Several practical generalizations of lasso are catalogued in HTW2015:

  • Group lasso (Yuan & Lin 2006): penalty λgβg2\lambda \sum_g \|\boldsymbol\beta_g\|_2 for predefined predictor groups, performing group-level selection.
  • Fused lasso (Tibshirani et al. 2005): penalty λ1jβj+1βj\lambda_1 \sum_j |\beta_{j+1} - \beta_j| for ordered coefficients, encouraging piecewise-constant fits — used in genomics and signal denoising.
  • Sparse-group lasso (Simon et al. 2013): combines L¹ within groups and L² across groups.

These are non-separable penalties (the fused-lasso term couples adjacent coefficients), so cyclic CD doesn’t apply directly — proximal-gradient or ADMM is the standard solver. Track 8 covers the algorithms and theory.

Remark 26 Non-convex penalties: SCAD and MCP

Lasso has known biases at large coefficients — its constant L¹ shrinkage applies even when the data overwhelmingly support a strong nonzero effect. SCAD (Smoothly Clipped Absolute Deviation; Fan & Li 2001) and MCP (Minimax Concave Penalty; Zhang 2010) are non-convex penalties designed to give lasso-like sparsity for small coefficients but no shrinkage for large ones. The trade-off is computational: the resulting objective is non-convex, requiring local-linear-approximation algorithms or proximal-gradient methods with careful initialization. SCAD/MCP estimators have the oracle property — they perform as if the true active set were known a priori — under regularity conditions. The methodology is mature; Topic 23 leaves it as a deferral pointer (FAN2001).

Remark 27 formalml.com — high-dimensional theory and debiased lasso

The asymptotic theory of penalized estimation in pnp \gg n regimes is the domain of high-dimensional statistics. The key tools are oracle inequalities (Bickel–Ritov–Tsybakov 2009), restricted-eigenvalue conditions (Bickel et al. 2009), and the dual-certificate technique. The debiased lasso (Javanmard–Montanari 2014; van de Geer et al. 2014) gives valid post-selection confidence intervals by adding a one-step Newton correction to the lasso estimate. Double descent (Belkin et al. 2019) shows that predictive risk can have a second descent past the interpolation threshold — overparameterization is not always overfitting. All of this lives at formalml’s high-dimensional regression topic.

Remark 28 formalml.com — neural-net weight decay, SGD-with-decay, AdamW

Modern deep learning training relies heavily on weight decay — Thm 5’s Gaussian-prior MAP applied to neural-network weights. The SGD-with-decay update wt+1=(1ηλ)wtηL\mathbf{w}_{t+1} = (1 - \eta\lambda)\mathbf{w}_t - \eta \nabla\mathcal{L} implements L² regularization; the more-complicated AdamW decoupling separates weight decay from the adaptive learning rate. Implicit regularization — the bias of SGD toward minimum-norm solutions in overparameterized regimes — plays the role of explicit ridge in modern practice, often with no tuning required. The connection to Topic 23 is direct: every deep-learning regularizer is a penalty, every penalty is a prior, and the bias-variance tradeoff still rules.

Forward-map diagram for Topic 23. Central hub of penalized estimation with arrows out to Topic 24 (model selection), Track 7 (Bayesian regularization, horseshoe priors, GLMMs), Track 8 (group/fused/sparse-group lasso, non-convex penalties), and formalml.com (high-dim theory, debiased lasso, weight decay, double descent). Back-arrows to Topics 14 (MLE), 21 (linear regression), 22 (GLMs). Track-color coded with Track 6 in blue, Tracks 7–8 in amber, formalml.com in purple.

Topic 23 closes Track 6’s main framework. Linear regression (Topic 21) is OLS as orthogonal projection. GLMs (Topic 22) are IRLS on the exponential family. Penalized estimation (this topic) is what you reach for when those frameworks break — collinearity, separation, pnp \gg n. Topic 24 closes Track 6 with the model-selection layer that picks which features to include in the first place; reciprocally, the λ\lambda-indexed family of this topic IS a continuous model space with effective DOF trHλ\operatorname{tr}\mathbf{H}_\lambda — Topic 23 is a special case of Topic 24’s framework. Together, Topics 21–24 cover the full classical-regression toolkit; Tracks 7–8 and formalml.com take it from there.


References

  1. Hoerl, A. E. & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1), 55–67.
  2. Tikhonov, A. N. (1943). On the stability of inverse problems. Doklady Akademii Nauk SSSR, 39(5), 195–198.
  3. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58(1), 267–288.
  4. Zou, H. & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B, 67(2), 301–320.
  5. Friedman, J., Hastie, T. & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1–22.
  6. Tseng, P. (2001). Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of Optimization Theory and Applications, 109(3), 475–494.
  7. Hastie, T., Tibshirani, R. & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer.
  8. Hastie, T., Tibshirani, R. & Wainwright, M. (2015). Statistical Learning with Sparsity: The Lasso and Generalizations. Chapman & Hall/CRC.
  9. Stein, C. (1956). Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1, 197–206. University of California Press.
  10. Firth, D. (1993). Bias reduction of maximum likelihood estimates. Biometrika, 80(1), 27–38.
  11. Fan, J. & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456), 1348–1360.
  12. Wainwright, M. J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press.
  13. Lehmann, E. L. & Romano, J. P. (2005). Testing Statistical Hypotheses (3rd ed.). Springer.
  14. Casella, G. & Berger, R. L. (2002). Statistical Inference (2nd ed.). Duxbury.