Ridge Regression - Learning Module

Loading content...

0/278

Closed-Form Solution

The Beauty of an Exact Solution

One of Ridge regression's most remarkable properties is that it admits a closed-form solution—an explicit formula that directly computes the optimal coefficients without iterative optimization. This is rare and valuable: most regularized models (including Lasso, neural networks, and many others) require iterative algorithms that only approximate the optimum.

The closed-form solution reveals deep connections between regularization, matrix algebra, and spectral theory. Understanding this derivation provides insights that extend far beyond Ridge regression itself.

What You Will Learn

By the end of this page, you will be able to derive the Ridge regression solution from first principles, understand why the solution always exists (unlike OLS), interpret the solution through eigenvalue decomposition, and implement Ridge regression efficiently using numerical best practices.

Deriving the Solution from First Principles

Let's derive the Ridge regression solution step by step, starting from the objective function and applying calculus.

The Ridge objective:

$$J(\boldsymbol{\beta}) = |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \lambda |\boldsymbol{\beta}|_2^2$$

Step 1: Expand the objective

Expanding the squared norms:

$$J(\boldsymbol{\beta}) = (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) + \lambda \boldsymbol{\beta}^T\boldsymbol{\beta}$$

Expanding the first term:

$$= \mathbf{y}^T\mathbf{y} - 2\mathbf{y}^T\mathbf{X}\boldsymbol{\beta} + \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} + \lambda \boldsymbol{\beta}^T\boldsymbol{\beta}$$

Step 2: Compute the gradient

To find the minimum, we take the gradient with respect to $\boldsymbol{\beta}$ and set it to zero:

$$ abla_{\boldsymbol{\beta}} J = -2\mathbf{X}^T\mathbf{y} + 2\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} + 2\lambda \boldsymbol{\beta} = \mathbf{0}$$

Simplifying:

$$\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} + \lambda \boldsymbol{\beta} = \mathbf{X}^T\mathbf{y}$$

Factoring out $\boldsymbol{\beta}$:

$$(\mathbf{X}^T\mathbf{X} + \lambda \mathbf{I})\boldsymbol{\beta} = \mathbf{X}^T\mathbf{y}$$

This is the Ridge normal equation—analogous to the OLS normal equation but with $\lambda \mathbf{I}$ added to the matrix.

Step 3: Solve for β

Assuming $(\mathbf{X}^T\mathbf{X} + \lambda \mathbf{I})$ is invertible (which we'll prove shortly), we can solve directly:

$$\boxed{\hat{\boldsymbol{\beta}}_{\text{Ridge}} = (\mathbf{X}^T\mathbf{X} + \lambda \mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}}$$

This is the closed-form Ridge solution. Let's understand why this formula is so powerful.

The Ridge "Constant"

Comparing to OLS: $\hat{\boldsymbol{\beta}}_{\text{OLS}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$, the only difference is the addition of $\lambda \mathbf{I}$. This simple modification has profound effects on the solution's existence, stability, and statistical properties.

Why the Solution Always Exists

A critical advantage of Ridge regression over OLS is that the Ridge solution always exists whenever $\lambda > 0$, regardless of the relationship between $n$ (samples) and $p$ (features).

The key insight: positive definiteness

The matrix $\mathbf{X}^T\mathbf{X}$ is always positive semi-definite (PSD): for any vector $\mathbf{v}$,

$$\mathbf{v}^T(\mathbf{X}^T\mathbf{X})\mathbf{v} = |\mathbf{X}\mathbf{v}|_2^2 \geq 0$$

However, $\mathbf{X}^T\mathbf{X}$ may be singular (not full rank), especially when $p > n$ or when features are perfectly collinear.

Adding λI makes the matrix positive definite:

For any vector $\mathbf{v} eq \mathbf{0}$:

$$\mathbf{v}^T(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})\mathbf{v} = \mathbf{v}^T\mathbf{X}^T\mathbf{X}\mathbf{v} + \lambda \mathbf{v}^T\mathbf{v}$$

$$= |\mathbf{X}\mathbf{v}|_2^2 + \lambda |\mathbf{v}|_2^2$$

Since $\lambda > 0$ and $|\mathbf{v}|_2^2 > 0$ for $\mathbf{v} eq \mathbf{0}$:

$$\mathbf{v}^T(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})\mathbf{v} \geq \lambda |\mathbf{v}|_2^2 > 0$$

Therefore, $(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})$ is positive definite and hence invertible.

Guaranteed Existence and Uniqueness

For any $\lambda > 0$, the matrix $(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})$ is positive definite and invertible. This means the Ridge solution exists, is unique, and can be computed directly—regardless of whether $n > p$, $n = p$, or $n < p$.

Existence of Solutions: OLS vs Ridge
Scenario	OLS Solution	Ridge Solution (λ > 0)
$n > p$, full rank	✓ Exists, unique	✓ Exists, unique
$n = p$, full rank	✓ Exists, unique	✓ Exists, unique
$n < p$ (underdetermined)	✗ Infinitely many solutions	✓ Exists, unique
Multicollinearity (rank deficient)	✗ Singular, no solution	✓ Exists, unique
Perfect collinearity	✗ Singular, no solution	✓ Exists, unique

Eigenvalue Interpretation — Spectral Analysis

The deepest insights into Ridge regression come from analyzing the solution through eigenvalue decomposition. This spectral perspective reveals exactly how regularization modifies the OLS solution.

Eigendecomposition of X^TX:

Let $\mathbf{X}^T\mathbf{X}$ have eigenvalue decomposition:

$$\mathbf{X}^T\mathbf{X} = \mathbf{V}\mathbf{D}\mathbf{V}^T$$

where:

$\mathbf{V}$ is an orthogonal matrix whose columns are eigenvectors
$\mathbf{D} = \text{diag}(d_1, d_2, \ldots, d_p)$ contains the eigenvalues $d_j \geq 0$

The eigenvalues $d_j$ represent the "strength" of variation in each principal direction of the feature space.

Effect of adding λI:

Adding $\lambda\mathbf{I}$ simply shifts all eigenvalues:

$$\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I} = \mathbf{V}(\mathbf{D} + \lambda\mathbf{I})\mathbf{V}^T = \mathbf{V}\text{diag}(d_1 + \lambda, d_2 + \lambda, \ldots, d_p + \lambda)\mathbf{V}^T$$

The inverse becomes:

$$(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1} = \mathbf{V}\text{diag}\left(\frac{1}{d_1 + \lambda}, \frac{1}{d_2 + \lambda}, \ldots, \frac{1}{d_p + \lambda}\right)\mathbf{V}^T$$

The shrinkage factor:

To understand how Ridge modifies OLS, express the Ridge solution as:

$$\hat{\boldsymbol{\beta}}{\text{Ridge}} = \mathbf{V}\text{diag}\left(\frac{d_1}{d_1 + \lambda}, \frac{d_2}{d_2 + \lambda}, \ldots, \frac{d_p}{d_p + \lambda}\right)\mathbf{V}^T \hat{\boldsymbol{\beta}}{\text{OLS}}$$

Define the shrinkage factor for direction $j$:

$$s_j = \frac{d_j}{d_j + \lambda}$$

This factor lies in $(0, 1)$ for $\lambda > 0$, indicating shrinkage.

The Shrinkage Insight

Ridge regression shrinks each principal component of the OLS solution by a factor $s_j = d_j/(d_j + \lambda)$. Directions with small eigenvalues (low-variance directions in feature space) are shrunk more aggressively. Directions with large eigenvalues (high-variance directions) are shrunk less. This is precisely what we want: uncertain directions are regularized more.

Shrinkage Behavior by Eigenvalue
Eigenvalue $d_j$	Shrinkage Factor $s_j$	Interpretation
$d_j >> \lambda$	$s_j \approx 1$	Almost no shrinkage; high-confidence direction
$d_j = \lambda$	$s_j = 0.5$	50% shrinkage; moderate uncertainty
$d_j << \lambda$	$s_j \approx 0$	Severe shrinkage; low-confidence direction
$d_j = 0$ (singular)	$s_j = 0$	Complete elimination; OLS would blow up

Effective degrees of freedom:

The sum of shrinkage factors defines the effective degrees of freedom of the Ridge model:

$$\text{df}(\lambda) = \sum_{j=1}^{p} \frac{d_j}{d_j + \lambda} = \text{tr}\left[\mathbf{X}(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\right]$$

This quantity:

Equals $p$ when $\lambda = 0$ (full OLS model)
Approaches 0 as $\lambda \to \infty$ (null model)
Provides a continuous measure of model complexity between these extremes

The effective degrees of freedom is crucial for model selection criteria like AIC and BIC adapted for Ridge regression.

Alternative Forms of the Solution

The Ridge solution can be expressed in several equivalent forms, each offering computational or conceptual advantages in different contexts.

Form 1: Primal form (standard)

$$\hat{\boldsymbol{\beta}}_{\text{Ridge}} = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I}_p)^{-1}\mathbf{X}^T\mathbf{y}$$

This is a $p \times p$ matrix inversion. Preferred when $p < n$.

Form 2: Dual form (Woodbury identity)

Using the Woodbury matrix identity, we can rewrite:

$$\hat{\boldsymbol{\beta}}_{\text{Ridge}} = \mathbf{X}^T(\mathbf{X}\mathbf{X}^T + \lambda\mathbf{I}_n)^{-1}\mathbf{y}$$

This is an $n \times n$ matrix inversion. Preferred when $n < p$.

The Kernel Connection

The dual form involves $(\mathbf{X}\mathbf{X}^T + \lambda\mathbf{I}_n)^{-1}$—the regularized Gram matrix. This is the foundation of kernel Ridge regression: replace $\mathbf{X}\mathbf{X}^T$ with a kernel matrix $\mathbf{K}$ to handle nonlinear relationships implicitly.

Form 3: SVD form

Let $\mathbf{X} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^T$ be the singular value decomposition where:

$\mathbf{U}$ is $n \times r$ with orthonormal columns (left singular vectors)
$\mathbf{\Sigma}$ is $r \times r$ diagonal with singular values $\sigma_1, \ldots, \sigma_r$
$\mathbf{V}$ is $p \times r$ with orthonormal columns (right singular vectors)
$r = \text{rank}(\mathbf{X})$

The Ridge solution becomes:

$$\hat{\boldsymbol{\beta}}{\text{Ridge}} = \sum{j=1}^{r} \frac{\sigma_j}{\sigma_j^2 + \lambda} (\mathbf{u}_j^T\mathbf{y})\mathbf{v}_j$$

where $d_j = \sigma_j^2$ are the eigenvalues of $\mathbf{X}^T\mathbf{X}$.

Form 4: Augmented data form

Ridge regression can be viewed as OLS on an augmented dataset:

$$\tilde{\mathbf{X}} = \begin{pmatrix} \mathbf{X} \ \sqrt{\lambda}\mathbf{I}_p \end{pmatrix}, \quad \tilde{\mathbf{y}} = \begin{pmatrix} \mathbf{y} \ \mathbf{0}_p \end{pmatrix}$$

Then:

$$\hat{\boldsymbol{\beta}}_{\text{Ridge}} = (\tilde{\mathbf{X}}^T\tilde{\mathbf{X}})^{-1}\tilde{\mathbf{X}}^T\tilde{\mathbf{y}}$$

This is useful for deriving properties using OLS theory.

Computational Cost of Different Forms
Form	Matrix Size	Cost	Best When
Primal	$p \times p$	$O(np^2 + p^3)$	$p < n$
Dual	$n \times n$	$O(n^2p + n^3)$	$n < p$
SVD	$r \times r$	$O(\min(n,p) \cdot np)$	Multiple $\lambda$ values
Cholesky	$p \times p$	$O(np^2 + p^3/3)$	Numerical stability

Numerical Computation — Best Practices

While the closed-form solution is mathematically elegant, implementing it numerically requires care. Directly computing matrix inverses is rarely the best approach.

Never explicitly invert matrices:

Computing $(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}$ explicitly is:

Numerically unstable (accumulates floating-point errors)
Computationally wasteful (we only need the product with $\mathbf{X}^T\mathbf{y}$)

Instead, solve the linear system directly.

ridge_solution.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import numpy as np
from scipy import linalg
 
def ridge_regression_naive(X, y, lambda_param):
    """
    Ridge regression via explicit matrix inversion.
    NOT RECOMMENDED - shown for educational purposes only.
    """
    p = X.shape[1]
    A = X.T @ X + lambda_param * np.eye(p)
    b = X.T @ y
    # BAD: Explicit inverse
    beta = np.linalg.inv(A) @ b
    return beta
 
 
def ridge_regression_solve(X, y, lambda_param):
    """
    Ridge regression via direct linear solve.
    RECOMMENDED for single lambda value.
    
    Solves: (X^T X + λI) β = X^T y
    """
    p = X.shape[1]
    A = X.T @ X + lambda_param * np.eye(p)
    b = X.T @ y
    # GOOD: Solve linear system directly
    beta = np.linalg.solve(A, b)
    return beta
 
 
def ridge_regression_cholesky(X, y, lambda_param):
    """
    Ridge regression via Cholesky decomposition.
    BEST for numerical stability when matrix is well-conditioned.
    
    Since A = X^T X + λI is positive definite, we can use Cholesky.
    """
    p = X.shape[1]
    A = X.T @ X + lambda_param * np.eye(p)
    b = X.T @ y
    # Cholesky: A = L L^T where L is lower triangular
    L = linalg.cholesky(A, lower=True)
    # Solve L z = b, then L^T β = z
    z = linalg.solve_triangular(L, b, lower=True)
    beta = linalg.solve_triangular(L.T, z, lower=False)
    return beta
 
 
def ridge_regression_svd(X, y, lambda_param):
    """
    Ridge regression via SVD.
    BEST for trying multiple lambda values or handling 
    ill-conditioned problems.
    
    Decomposes X = U Σ V^T, then:
    β = V diag(σ_j / (σ_j^2 + λ)) U^T y
    """
    U, s, Vt = np.linalg.svd(X, full_matrices=False)
    # s contains singular values
    # Shrinkage factors: s / (s^2 + lambda)
    d = s / (s**2 + lambda_param)
    # Solution: V @ diag(d) @ U^T @ y
    beta = Vt.T @ (d * (U.T @ y))
    return beta
 
 
def ridge_regression_dual(X, y, lambda_param):
    """
    Ridge regression via dual form.
    PREFERRED when n << p (high-dimensional setting).
    
    Solves: β = X^T (X X^T + λI_n)^{-1} y
    """
    n = X.shape[0]
    K = X @ X.T + lambda_param * np.eye(n)
    alpha = np.linalg.solve(K, y)
    beta = X.T @ alpha
    return beta

Numerical Best Practices

•Use np.linalg.solve() or Cholesky instead of explicit matrix inverse. This is faster and more stable.
•Use SVD when evaluating multiple λ values — compute SVD once, then apply different shrinkage factors.
•Use the dual form when n << p — inverts an $n \times n$ matrix instead of $p \times p$.
•Add λ to the diagonal before any decomposition — never try to regularize after the fact.
•Use 64-bit floating point — Ridge is more stable than OLS but precision still matters.
•Monitor condition numbers — if $(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})$ has condition number > $10^{12}$, increase $\lambda$.

Connection to the Moore-Penrose Pseudoinverse

Ridge regression has an illuminating connection to the Moore-Penrose pseudoinverse, the generalization of matrix inverse for non-square and singular matrices.

The pseudoinverse solution:

For an underdetermined or rank-deficient system $\mathbf{X}\boldsymbol{\beta} = \mathbf{y}$, the minimum-norm least squares solution is:

$$\hat{\boldsymbol{\beta}}_{\text{pseudo}} = \mathbf{X}^+ \mathbf{y}$$

where $\mathbf{X}^+$ is the Moore-Penrose pseudoinverse.

Using the SVD $\mathbf{X} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^T$:

$$\mathbf{X}^+ = \mathbf{V}\mathbf{\Sigma}^+\mathbf{U}^T$$

where $\mathbf{\Sigma}^+ = \text{diag}(1/\sigma_1, \ldots, 1/\sigma_r, 0, \ldots, 0)$.

Ridge as regularized pseudoinverse:

The Ridge solution can be written as:

$$\hat{\boldsymbol{\beta}}{\text{Ridge}} = \mathbf{V}\mathbf{\Sigma}\lambda^+\mathbf{U}^T\mathbf{y}$$

where $\mathbf{\Sigma}_\lambda^+ = \text{diag}\left(\frac{\sigma_1}{\sigma_1^2 + \lambda}, \ldots, \frac{\sigma_r}{\sigma_r^2 + \lambda}\right)$.

The limit as λ → 0⁺:

$$\lim_{\lambda \to 0^+} \hat{\boldsymbol{\beta}}_{\text{Ridge}} = \mathbf{X}^+ \mathbf{y}$$

Thus, Ridge regression interpolates between:

$\lambda \to 0$: The pseudoinverse (minimum-norm) solution
$\lambda \to \infty$: The zero solution

Ridge provides a continuous family of solutions parameterized by $\lambda$.

Regularization as Smoothed Inversion

The pseudoinverse uses exact ratios $1/\sigma_j$, which explode when $\sigma_j$ is small. Ridge replaces these with $\sigma_j/(\sigma_j^2 + \lambda)$, which remains bounded even as $\sigma_j \to 0$. This "smoothing" of the inversion is precisely what makes Ridge stable.

Predictions and the Hat Matrix

The hat matrix (also called the smoother matrix) maps observed responses to fitted values. It plays a central role in understanding the geometry and diagnostics of regression.

OLS hat matrix:

$$\mathbf{H}_{\text{OLS}} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T$$

Fitted values: $\hat{\mathbf{y}} = \mathbf{H}_{\text{OLS}} \mathbf{y}$

Properties:

Symmetric and idempotent ($\mathbf{H}^2 = \mathbf{H}$)
$\text{tr}(\mathbf{H}) = \text{rank}(\mathbf{X}) = p$ (number of parameters)

Ridge hat matrix:

$$\mathbf{H}_{\text{Ridge}} = \mathbf{X}(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T$$

Fitted values: $\hat{\mathbf{y}}{\text{Ridge}} = \mathbf{H}{\text{Ridge}} \mathbf{y}$

Key differences from OLS:

Not idempotent: $\mathbf{H}{\text{Ridge}}^2 eq \mathbf{H}{\text{Ridge}}$ in general.
Trace gives effective degrees of freedom: $$\text{df}(\lambda) = \text{tr}(\mathbf{H}{\text{Ridge}}) = \sum{j=1}^{p} \frac{d_j}{d_j + \lambda}$$ This is less than $p$ and decreases as $\lambda$ increases.
Shrinks toward zero: Unlike OLS which projects onto the column space of $\mathbf{X}$, Ridge also shrinks the projection.

Eigenvalue expression:

Using the eigendecomposition $\mathbf{X}^T\mathbf{X} = \mathbf{V}\mathbf{D}\mathbf{V}^T$:

$$\mathbf{H}_{\text{Ridge}} = \mathbf{X}\mathbf{V}\text{diag}\left(\frac{1}{d_j + \lambda}\right)\mathbf{V}^T\mathbf{X}^T$$

$$= \mathbf{U}\text{diag}\left(\frac{d_j}{d_j + \lambda}\right)\mathbf{U}^T$$

where $\mathbf{U}$ comes from the SVD of $\mathbf{X}$.

Degrees of Freedom Interpretation

In OLS, the degrees of freedom equals the number of parameters $p$. In Ridge, the regularization effectively "uses up" fewer degrees of freedom because it constrains the solution. The quantity $\text{df}(\lambda) = \text{tr}(\mathbf{H}_{\text{Ridge}})$ captures this: as $\lambda$ increases, we use fewer effective parameters, reducing model complexity.

Summary: The Closed-Form Solution

We have derived and deeply analyzed the closed-form Ridge regression solution. Let's consolidate the key insights:

Key Takeaways

•The closed-form solution is $\hat{\boldsymbol{\beta}}_{\text{Ridge}} = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}$
•Solution always exists for $\lambda > 0$ because $(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})$ is positive definite, regardless of $n$ vs $p$.
•Eigenvalue interpretation: Ridge shrinks coefficients by factors $d_j/(d_j + \lambda)$, shrinking low-variance directions more aggressively.
•Effective degrees of freedom $\text{df}(\lambda) = \sum_j d_j/(d_j + \lambda)$ measures model complexity continuously.
•Alternative forms (primal, dual, SVD) offer computational advantages in different regimes.
•Numerical best practices: Use linear solvers or decompositions, not explicit matrix inverse.
•Pseudoinverse connection: Ridge smoothly regularizes the singular value inversion that causes OLS instability.

What's next:

With the mathematical solution in hand, we'll develop intuition for what Ridge regression does to our coefficients. The next page explores the shrinkage interpretation—how Ridge pulls coefficients toward zero, the geometry of this shrinkage, and why this leads to better predictions.

Page Complete

You now understand the closed-form Ridge solution from multiple perspectives: derivation, existence/uniqueness, spectral properties, computational implementation, and connections to general linear algebra concepts. This foundation prepares you for understanding the practical effects of regularization.