Loading content...
One of Ridge regression's most remarkable properties is that it admits a closed-form solution—an explicit formula that directly computes the optimal coefficients without iterative optimization. This is rare and valuable: most regularized models (including Lasso, neural networks, and many others) require iterative algorithms that only approximate the optimum.
The closed-form solution reveals deep connections between regularization, matrix algebra, and spectral theory. Understanding this derivation provides insights that extend far beyond Ridge regression itself.
By the end of this page, you will be able to derive the Ridge regression solution from first principles, understand why the solution always exists (unlike OLS), interpret the solution through eigenvalue decomposition, and implement Ridge regression efficiently using numerical best practices.
Let's derive the Ridge regression solution step by step, starting from the objective function and applying calculus.
The Ridge objective:
$$J(\boldsymbol{\beta}) = |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \lambda |\boldsymbol{\beta}|_2^2$$
Step 1: Expand the objective
Expanding the squared norms:
$$J(\boldsymbol{\beta}) = (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) + \lambda \boldsymbol{\beta}^T\boldsymbol{\beta}$$
Expanding the first term:
$$= \mathbf{y}^T\mathbf{y} - 2\mathbf{y}^T\mathbf{X}\boldsymbol{\beta} + \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} + \lambda \boldsymbol{\beta}^T\boldsymbol{\beta}$$
Step 2: Compute the gradient
To find the minimum, we take the gradient with respect to $\boldsymbol{\beta}$ and set it to zero:
$$ abla_{\boldsymbol{\beta}} J = -2\mathbf{X}^T\mathbf{y} + 2\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} + 2\lambda \boldsymbol{\beta} = \mathbf{0}$$
Simplifying:
$$\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} + \lambda \boldsymbol{\beta} = \mathbf{X}^T\mathbf{y}$$
Factoring out $\boldsymbol{\beta}$:
$$(\mathbf{X}^T\mathbf{X} + \lambda \mathbf{I})\boldsymbol{\beta} = \mathbf{X}^T\mathbf{y}$$
This is the Ridge normal equation—analogous to the OLS normal equation but with $\lambda \mathbf{I}$ added to the matrix.
Step 3: Solve for β
Assuming $(\mathbf{X}^T\mathbf{X} + \lambda \mathbf{I})$ is invertible (which we'll prove shortly), we can solve directly:
$$\boxed{\hat{\boldsymbol{\beta}}_{\text{Ridge}} = (\mathbf{X}^T\mathbf{X} + \lambda \mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}}$$
This is the closed-form Ridge solution. Let's understand why this formula is so powerful.
Comparing to OLS: $\hat{\boldsymbol{\beta}}_{\text{OLS}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$, the only difference is the addition of $\lambda \mathbf{I}$. This simple modification has profound effects on the solution's existence, stability, and statistical properties.
A critical advantage of Ridge regression over OLS is that the Ridge solution always exists whenever $\lambda > 0$, regardless of the relationship between $n$ (samples) and $p$ (features).
The key insight: positive definiteness
The matrix $\mathbf{X}^T\mathbf{X}$ is always positive semi-definite (PSD): for any vector $\mathbf{v}$,
$$\mathbf{v}^T(\mathbf{X}^T\mathbf{X})\mathbf{v} = |\mathbf{X}\mathbf{v}|_2^2 \geq 0$$
However, $\mathbf{X}^T\mathbf{X}$ may be singular (not full rank), especially when $p > n$ or when features are perfectly collinear.
Adding λI makes the matrix positive definite:
For any vector $\mathbf{v} eq \mathbf{0}$:
$$\mathbf{v}^T(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})\mathbf{v} = \mathbf{v}^T\mathbf{X}^T\mathbf{X}\mathbf{v} + \lambda \mathbf{v}^T\mathbf{v}$$
$$= |\mathbf{X}\mathbf{v}|_2^2 + \lambda |\mathbf{v}|_2^2$$
Since $\lambda > 0$ and $|\mathbf{v}|_2^2 > 0$ for $\mathbf{v} eq \mathbf{0}$:
$$\mathbf{v}^T(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})\mathbf{v} \geq \lambda |\mathbf{v}|_2^2 > 0$$
Therefore, $(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})$ is positive definite and hence invertible.
For any $\lambda > 0$, the matrix $(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})$ is positive definite and invertible. This means the Ridge solution exists, is unique, and can be computed directly—regardless of whether $n > p$, $n = p$, or $n < p$.
| Scenario | OLS Solution | Ridge Solution (λ > 0) |
|---|---|---|
| $n > p$, full rank | ✓ Exists, unique | ✓ Exists, unique |
| $n = p$, full rank | ✓ Exists, unique | ✓ Exists, unique |
| $n < p$ (underdetermined) | ✗ Infinitely many solutions | ✓ Exists, unique |
| Multicollinearity (rank deficient) | ✗ Singular, no solution | ✓ Exists, unique |
| Perfect collinearity | ✗ Singular, no solution | ✓ Exists, unique |
The deepest insights into Ridge regression come from analyzing the solution through eigenvalue decomposition. This spectral perspective reveals exactly how regularization modifies the OLS solution.
Eigendecomposition of X^TX:
Let $\mathbf{X}^T\mathbf{X}$ have eigenvalue decomposition:
$$\mathbf{X}^T\mathbf{X} = \mathbf{V}\mathbf{D}\mathbf{V}^T$$
where:
The eigenvalues $d_j$ represent the "strength" of variation in each principal direction of the feature space.
Effect of adding λI:
Adding $\lambda\mathbf{I}$ simply shifts all eigenvalues:
$$\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I} = \mathbf{V}(\mathbf{D} + \lambda\mathbf{I})\mathbf{V}^T = \mathbf{V}\text{diag}(d_1 + \lambda, d_2 + \lambda, \ldots, d_p + \lambda)\mathbf{V}^T$$
The inverse becomes:
$$(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1} = \mathbf{V}\text{diag}\left(\frac{1}{d_1 + \lambda}, \frac{1}{d_2 + \lambda}, \ldots, \frac{1}{d_p + \lambda}\right)\mathbf{V}^T$$
The shrinkage factor:
To understand how Ridge modifies OLS, express the Ridge solution as:
$$\hat{\boldsymbol{\beta}}{\text{Ridge}} = \mathbf{V}\text{diag}\left(\frac{d_1}{d_1 + \lambda}, \frac{d_2}{d_2 + \lambda}, \ldots, \frac{d_p}{d_p + \lambda}\right)\mathbf{V}^T \hat{\boldsymbol{\beta}}{\text{OLS}}$$
Define the shrinkage factor for direction $j$:
$$s_j = \frac{d_j}{d_j + \lambda}$$
This factor lies in $(0, 1)$ for $\lambda > 0$, indicating shrinkage.
Ridge regression shrinks each principal component of the OLS solution by a factor $s_j = d_j/(d_j + \lambda)$. Directions with small eigenvalues (low-variance directions in feature space) are shrunk more aggressively. Directions with large eigenvalues (high-variance directions) are shrunk less. This is precisely what we want: uncertain directions are regularized more.
| Eigenvalue $d_j$ | Shrinkage Factor $s_j$ | Interpretation |
|---|---|---|
| $d_j >> \lambda$ | $s_j \approx 1$ | Almost no shrinkage; high-confidence direction |
| $d_j = \lambda$ | $s_j = 0.5$ | 50% shrinkage; moderate uncertainty |
| $d_j << \lambda$ | $s_j \approx 0$ | Severe shrinkage; low-confidence direction |
| $d_j = 0$ (singular) | $s_j = 0$ | Complete elimination; OLS would blow up |
Effective degrees of freedom:
The sum of shrinkage factors defines the effective degrees of freedom of the Ridge model:
$$\text{df}(\lambda) = \sum_{j=1}^{p} \frac{d_j}{d_j + \lambda} = \text{tr}\left[\mathbf{X}(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\right]$$
This quantity:
The effective degrees of freedom is crucial for model selection criteria like AIC and BIC adapted for Ridge regression.
The Ridge solution can be expressed in several equivalent forms, each offering computational or conceptual advantages in different contexts.
Form 1: Primal form (standard)
$$\hat{\boldsymbol{\beta}}_{\text{Ridge}} = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I}_p)^{-1}\mathbf{X}^T\mathbf{y}$$
This is a $p \times p$ matrix inversion. Preferred when $p < n$.
Form 2: Dual form (Woodbury identity)
Using the Woodbury matrix identity, we can rewrite:
$$\hat{\boldsymbol{\beta}}_{\text{Ridge}} = \mathbf{X}^T(\mathbf{X}\mathbf{X}^T + \lambda\mathbf{I}_n)^{-1}\mathbf{y}$$
This is an $n \times n$ matrix inversion. Preferred when $n < p$.
The dual form involves $(\mathbf{X}\mathbf{X}^T + \lambda\mathbf{I}_n)^{-1}$—the regularized Gram matrix. This is the foundation of kernel Ridge regression: replace $\mathbf{X}\mathbf{X}^T$ with a kernel matrix $\mathbf{K}$ to handle nonlinear relationships implicitly.
Form 3: SVD form
Let $\mathbf{X} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^T$ be the singular value decomposition where:
The Ridge solution becomes:
$$\hat{\boldsymbol{\beta}}{\text{Ridge}} = \sum{j=1}^{r} \frac{\sigma_j}{\sigma_j^2 + \lambda} (\mathbf{u}_j^T\mathbf{y})\mathbf{v}_j$$
where $d_j = \sigma_j^2$ are the eigenvalues of $\mathbf{X}^T\mathbf{X}$.
Form 4: Augmented data form
Ridge regression can be viewed as OLS on an augmented dataset:
$$\tilde{\mathbf{X}} = \begin{pmatrix} \mathbf{X} \ \sqrt{\lambda}\mathbf{I}_p \end{pmatrix}, \quad \tilde{\mathbf{y}} = \begin{pmatrix} \mathbf{y} \ \mathbf{0}_p \end{pmatrix}$$
Then:
$$\hat{\boldsymbol{\beta}}_{\text{Ridge}} = (\tilde{\mathbf{X}}^T\tilde{\mathbf{X}})^{-1}\tilde{\mathbf{X}}^T\tilde{\mathbf{y}}$$
This is useful for deriving properties using OLS theory.
| Form | Matrix Size | Cost | Best When |
|---|---|---|---|
| Primal | $p \times p$ | $O(np^2 + p^3)$ | $p < n$ |
| Dual | $n \times n$ | $O(n^2p + n^3)$ | $n < p$ |
| SVD | $r \times r$ | $O(\min(n,p) \cdot np)$ | Multiple $\lambda$ values |
| Cholesky | $p \times p$ | $O(np^2 + p^3/3)$ | Numerical stability |
While the closed-form solution is mathematically elegant, implementing it numerically requires care. Directly computing matrix inverses is rarely the best approach.
Never explicitly invert matrices:
Computing $(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}$ explicitly is:
Instead, solve the linear system directly.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879
import numpy as npfrom scipy import linalg def ridge_regression_naive(X, y, lambda_param): """ Ridge regression via explicit matrix inversion. NOT RECOMMENDED - shown for educational purposes only. """ p = X.shape[1] A = X.T @ X + lambda_param * np.eye(p) b = X.T @ y # BAD: Explicit inverse beta = np.linalg.inv(A) @ b return beta def ridge_regression_solve(X, y, lambda_param): """ Ridge regression via direct linear solve. RECOMMENDED for single lambda value. Solves: (X^T X + λI) β = X^T y """ p = X.shape[1] A = X.T @ X + lambda_param * np.eye(p) b = X.T @ y # GOOD: Solve linear system directly beta = np.linalg.solve(A, b) return beta def ridge_regression_cholesky(X, y, lambda_param): """ Ridge regression via Cholesky decomposition. BEST for numerical stability when matrix is well-conditioned. Since A = X^T X + λI is positive definite, we can use Cholesky. """ p = X.shape[1] A = X.T @ X + lambda_param * np.eye(p) b = X.T @ y # Cholesky: A = L L^T where L is lower triangular L = linalg.cholesky(A, lower=True) # Solve L z = b, then L^T β = z z = linalg.solve_triangular(L, b, lower=True) beta = linalg.solve_triangular(L.T, z, lower=False) return beta def ridge_regression_svd(X, y, lambda_param): """ Ridge regression via SVD. BEST for trying multiple lambda values or handling ill-conditioned problems. Decomposes X = U Σ V^T, then: β = V diag(σ_j / (σ_j^2 + λ)) U^T y """ U, s, Vt = np.linalg.svd(X, full_matrices=False) # s contains singular values # Shrinkage factors: s / (s^2 + lambda) d = s / (s**2 + lambda_param) # Solution: V @ diag(d) @ U^T @ y beta = Vt.T @ (d * (U.T @ y)) return beta def ridge_regression_dual(X, y, lambda_param): """ Ridge regression via dual form. PREFERRED when n << p (high-dimensional setting). Solves: β = X^T (X X^T + λI_n)^{-1} y """ n = X.shape[0] K = X @ X.T + lambda_param * np.eye(n) alpha = np.linalg.solve(K, y) beta = X.T @ alpha return betanp.linalg.solve() or Cholesky instead of explicit matrix inverse. This is faster and more stable.Ridge regression has an illuminating connection to the Moore-Penrose pseudoinverse, the generalization of matrix inverse for non-square and singular matrices.
The pseudoinverse solution:
For an underdetermined or rank-deficient system $\mathbf{X}\boldsymbol{\beta} = \mathbf{y}$, the minimum-norm least squares solution is:
$$\hat{\boldsymbol{\beta}}_{\text{pseudo}} = \mathbf{X}^+ \mathbf{y}$$
where $\mathbf{X}^+$ is the Moore-Penrose pseudoinverse.
Using the SVD $\mathbf{X} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^T$:
$$\mathbf{X}^+ = \mathbf{V}\mathbf{\Sigma}^+\mathbf{U}^T$$
where $\mathbf{\Sigma}^+ = \text{diag}(1/\sigma_1, \ldots, 1/\sigma_r, 0, \ldots, 0)$.
Ridge as regularized pseudoinverse:
The Ridge solution can be written as:
$$\hat{\boldsymbol{\beta}}{\text{Ridge}} = \mathbf{V}\mathbf{\Sigma}\lambda^+\mathbf{U}^T\mathbf{y}$$
where $\mathbf{\Sigma}_\lambda^+ = \text{diag}\left(\frac{\sigma_1}{\sigma_1^2 + \lambda}, \ldots, \frac{\sigma_r}{\sigma_r^2 + \lambda}\right)$.
The limit as λ → 0⁺:
$$\lim_{\lambda \to 0^+} \hat{\boldsymbol{\beta}}_{\text{Ridge}} = \mathbf{X}^+ \mathbf{y}$$
Thus, Ridge regression interpolates between:
Ridge provides a continuous family of solutions parameterized by $\lambda$.
The pseudoinverse uses exact ratios $1/\sigma_j$, which explode when $\sigma_j$ is small. Ridge replaces these with $\sigma_j/(\sigma_j^2 + \lambda)$, which remains bounded even as $\sigma_j \to 0$. This "smoothing" of the inversion is precisely what makes Ridge stable.
The hat matrix (also called the smoother matrix) maps observed responses to fitted values. It plays a central role in understanding the geometry and diagnostics of regression.
OLS hat matrix:
$$\mathbf{H}_{\text{OLS}} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T$$
Fitted values: $\hat{\mathbf{y}} = \mathbf{H}_{\text{OLS}} \mathbf{y}$
Properties:
Ridge hat matrix:
$$\mathbf{H}_{\text{Ridge}} = \mathbf{X}(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T$$
Fitted values: $\hat{\mathbf{y}}{\text{Ridge}} = \mathbf{H}{\text{Ridge}} \mathbf{y}$
Key differences from OLS:
Not idempotent: $\mathbf{H}{\text{Ridge}}^2 eq \mathbf{H}{\text{Ridge}}$ in general.
Trace gives effective degrees of freedom: $$\text{df}(\lambda) = \text{tr}(\mathbf{H}{\text{Ridge}}) = \sum{j=1}^{p} \frac{d_j}{d_j + \lambda}$$ This is less than $p$ and decreases as $\lambda$ increases.
Shrinks toward zero: Unlike OLS which projects onto the column space of $\mathbf{X}$, Ridge also shrinks the projection.
Eigenvalue expression:
Using the eigendecomposition $\mathbf{X}^T\mathbf{X} = \mathbf{V}\mathbf{D}\mathbf{V}^T$:
$$\mathbf{H}_{\text{Ridge}} = \mathbf{X}\mathbf{V}\text{diag}\left(\frac{1}{d_j + \lambda}\right)\mathbf{V}^T\mathbf{X}^T$$
$$= \mathbf{U}\text{diag}\left(\frac{d_j}{d_j + \lambda}\right)\mathbf{U}^T$$
where $\mathbf{U}$ comes from the SVD of $\mathbf{X}$.
In OLS, the degrees of freedom equals the number of parameters $p$. In Ridge, the regularization effectively "uses up" fewer degrees of freedom because it constrains the solution. The quantity $\text{df}(\lambda) = \text{tr}(\mathbf{H}_{\text{Ridge}})$ captures this: as $\lambda$ increases, we use fewer effective parameters, reducing model complexity.
We have derived and deeply analyzed the closed-form Ridge regression solution. Let's consolidate the key insights:
What's next:
With the mathematical solution in hand, we'll develop intuition for what Ridge regression does to our coefficients. The next page explores the shrinkage interpretation—how Ridge pulls coefficients toward zero, the geometry of this shrinkage, and why this leads to better predictions.
You now understand the closed-form Ridge solution from multiple perspectives: derivation, existence/uniqueness, spectral properties, computational implementation, and connections to general linear algebra concepts. This foundation prepares you for understanding the practical effects of regularization.