Machine LearningRegularization Theory

Ridge Regression (L2 Regularization)

LevelIntermediate

Duration90 mins

TopicRegularization Theory

3 / 5

Shrinkage Interpretation

Understanding Shrinkage

We've derived the closed-form solution for Ridge regression and proved its mathematical properties. But to truly master Ridge regression, we must develop deep intuition for what it does to our coefficients and why this leads to better predictions.

The answer lies in understanding shrinkage—the systematic pulling of coefficient estimates toward zero. Shrinkage is the mechanism by which Ridge regression trades bias for reduced variance, achieving better overall prediction accuracy.

What You Will Learn

By the end of this page, you will understand shrinkage geometrically, visualize how coefficients evolve as λ changes (the shrinkage path), distinguish between shrinkage in different principal directions, and grasp why shrinking toward zero—despite introducing bias—often improves prediction.

The Geometry of Shrinkage

To visualize shrinkage, consider the coefficient space in two dimensions (two features $\beta_1$ and $\beta_2$).

The OLS contours:

The OLS objective function defines elliptical level sets (contours of constant RSS) centered at the OLS solution $\hat{\boldsymbol{\beta}}_{\text{OLS}}$. The shape of these ellipses depends on the eigenstructure of $\mathbf{X}^T\mathbf{X}$:

Circular contours: When features are orthogonal and equally scaled, $\mathbf{X}^T\mathbf{X} \propto \mathbf{I}$, producing circular level sets.
Elliptical contours: When features are correlated or have different variances, the contours become elliptical, stretched along certain directions.

Converting Mermaid diagram...

The L2 constraint:

The L2 constraint $|\boldsymbol{\beta}|_2^2 \leq t$ defines a circle (in 2D) or hypersphere (in higher dimensions) centered at the origin. The constraint radius $\sqrt{t}$ decreases as the regularization parameter $\lambda$ increases.

Finding the Ridge solution:

The Ridge solution lies at the point where the RSS ellipse is tangent to the L2 ball. Key observations:

The tangent point is never at an axis (except in degenerate cases): Unlike L1 regularization (Lasso), the smooth L2 ball never "catches" the solution exactly on an axis. Thus, Ridge shrinks all coefficients but never sets any exactly to zero.
The solution moves continuously: As $\lambda$ increases, the tangent point traces a smooth path from $\hat{\boldsymbol{\beta}}_{\text{OLS}}$ toward the origin.
Shrinkage is proportional to distance from origin: Coefficients farther from zero are pulled more strongly toward zero.

The Smooth Shrinkage Property

The smoothness of the L2 ball is why Ridge regression produces smooth shrinkage paths. There are no corners where the solution could "stick" as λ varies. This is fundamentally different from L1 (Lasso) where the diamond-shaped constraint has corners that can trap the solution on axes, producing exact zeros.

Shrinkage Factors in Principal Directions

Recall from the eigenvalue analysis that Ridge regression shrinks each principal direction by a specific factor. Let's develop this insight more fully.

Eigendecomposition perspective:

The design matrix $\mathbf{X}^T\mathbf{X}$ has eigendecomposition $\mathbf{V}\mathbf{D}\mathbf{V}^T$ where:

Columns of $\mathbf{V}$ are the principal component directions
Diagonal entries of $\mathbf{D}$ are eigenvalues $d_j$

The OLS coefficients projected onto principal direction $j$ are scaled by:

$$\text{Shrinkage factor } s_j = \frac{d_j}{d_j + \lambda}$$

Interpreting shrinkage factors:

The shrinkage factor $s_j$ depends on the ratio of eigenvalue to regularization:

$$s_j = \frac{d_j}{d_j + \lambda} = \frac{1}{1 + \lambda/d_j}$$

Large eigenvalue ($d_j >> \lambda$):

$s_j \approx 1$: Almost no shrinkage
This direction has high variance in the data
Coefficients in this direction are well-determined by the data
We trust the OLS estimate and don't regularize much

Small eigenvalue ($d_j << \lambda$):

$s_j \approx d_j/\lambda \approx 0$: Severe shrinkage
This direction has low variance in the data
Coefficients in this direction are poorly determined
OLS would be unstable here; we shrink heavily toward zero

Shrinkage Behavior Analysis
Eigenvalue Regime	Shrinkage $s_j$	Variance in Direction	Regularization Effect
$d_j = 100, \lambda = 1$	0.990	High (well-determined)	Almost none
$d_j = 10, \lambda = 1$	0.909	Moderate	Light shrinkage
$d_j = 1, \lambda = 1$	0.500	Moderate-low	50% reduction
$d_j = 0.1, \lambda = 1$	0.091	Low (poorly determined)	Heavy shrinkage
$d_j = 0.01, \lambda = 1$	0.010	Very low (nearly singular)	Near-complete shrinkage

The Wisdom of Ridge Shrinkage

Ridge regression shrinks most aggressively in precisely the directions where OLS is most unreliable—directions with low data variance (small eigenvalues). It shrinks least in directions where the data provides strong information. This is exactly the right behavior: regularize where we're uncertain, trust data where we're confident.

The Shrinkage Path — Regularization Trajectory

As we vary $\lambda$ from 0 to $\infty$, the Ridge solution traces a continuous path from the OLS solution to the origin. This shrinkage path (or regularization path) reveals how coefficients evolve with regularization strength.

Mathematical description:

$$\hat{\boldsymbol{\beta}}(\lambda) = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}$$

This is a continuous, monotonically shrinking function of $\lambda$:

$\hat{\boldsymbol{\beta}}(0) = \hat{\boldsymbol{\beta}}_{\text{OLS}}$ (if OLS exists)
$|\hat{\boldsymbol{\beta}}(\lambda)|_2$ is strictly decreasing in $\lambda$
$\lim_{\lambda \to \infty} \hat{\boldsymbol{\beta}}(\lambda) = \mathbf{0}$

shrinkage_path.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import numpy as np
import matplotlib.pyplot as plt
 
def compute_ridge_path(X, y, lambdas):
    """
    Compute Ridge coefficients for a sequence of lambda values.
    
    Parameters:
    -----------
    X : ndarray of shape (n_samples, n_features)
        Feature matrix (should be standardized)
    y : ndarray of shape (n_samples,)
        Target vector (should be centered)
    lambdas : array-like
        Sequence of regularization parameters (descending order)
    
    Returns:
    --------
    coefs : ndarray of shape (n_lambdas, n_features)
        Ridge coefficients for each lambda
    """
    n, p = X.shape
    coefs = np.zeros((len(lambdas), p))
    
    # Use SVD for efficient computation across many lambdas
    U, s, Vt = np.linalg.svd(X, full_matrices=False)
    UTy = U.T @ y
    
    for i, lam in enumerate(lambdas):
        # Shrinkage factors: s / (s^2 + lambda)
        d = s / (s**2 + lam)
        # Ridge solution
        coefs[i] = Vt.T @ (d * UTy)
    
    return coefs
 
 
def plot_ridge_path(lambdas, coefs, feature_names=None):
    """
    Visualize the Ridge regularization path.
    """
    plt.figure(figsize=(10, 6))
    
    for j in range(coefs.shape[1]):
        label = feature_names[j] if feature_names else f"β_{j+1}"
        plt.plot(np.log10(lambdas), coefs[:, j], label=label, linewidth=2)
    
    plt.xlabel("log₁₀(λ)", fontsize=12)
    plt.ylabel("Coefficient Value", fontsize=12)
    plt.title("Ridge Regression Shrinkage Path", fontsize=14)
    plt.axhline(y=0, color='k', linestyle='--', alpha=0.3)
    plt.legend(loc='best')
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    
    return plt.gcf()
 
 
# Example: Generate synthetic data and visualize path
np.random.seed(42)
n, p = 100, 5
X = np.random.randn(n, p)
# Create correlated features
X[:, 1] = X[:, 0] + 0.1 * np.random.randn(n)
X[:, 2] = X[:, 0] - X[:, 1] + 0.1 * np.random.randn(n)
 
# Standardize
X = (X - X.mean(axis=0)) / X.std(axis=0)
 
# True coefficients (sparse for illustration)
beta_true = np.array([3.0, -2.0, 0.0, 1.5, 0.0])
y = X @ beta_true + 0.5 * np.random.randn(n)
y = y - y.mean()
 
# Compute path over range of lambdas
lambdas = np.logspace(4, -4, 200)  # log-spaced from 10^4 to 10^-4
coefs = compute_ridge_path(X, y, lambdas)
 
# Plot the shrinkage path
# plot_ridge_path(lambdas, coefs, 
#                 feature_names=["Feature 1", "Feature 2", "Feature 3", 
#                                "Feature 4", "Feature 5"])

Properties of the Ridge shrinkage path:

Smooth and continuous: No jumps or discontinuities as $\lambda$ varies. This is a direct consequence of the smooth L2 penalty.
Monotonic shrinkage: The L2 norm of coefficients strictly decreases with increasing $\lambda$: $|\hat{\boldsymbol{\beta}}(\lambda_1)|_2 > |\hat{\boldsymbol{\beta}}(\lambda_2)|_2$ for $\lambda_1 < \lambda_2$.
Coefficients approach zero at different rates: Coefficients in low-eigenvalue directions shrink faster than those in high-eigenvalue directions.
Never exactly zero: Unlike Lasso, Ridge coefficients approach zero asymptotically but (theoretically) never reach it exactly at finite $\lambda$.
Sign preservation: Ridge coefficients maintain their signs throughout the path—if $\hat{\beta}_j > 0$ at $\lambda = 0$, it remains positive for all $\lambda > 0$.

Shrinkage vs. Simply Scaling OLS

A natural question arises: why not just scale the OLS solution by some constant factor? Why is Ridge shrinkage better than simple multiplication?

Uniform scaling:

$$\hat{\boldsymbol{\beta}}{\text{scaled}} = c \cdot \hat{\boldsymbol{\beta}}{\text{OLS}}, \quad c \in (0, 1)$$

This applies the same shrinkage factor to all coefficients.

Ridge shrinkage:

$$\hat{\boldsymbol{\beta}}{\text{Ridge}} = \mathbf{V} \cdot \text{diag}(s_1, \ldots, s_p) \cdot \mathbf{V}^T \cdot \hat{\boldsymbol{\beta}}{\text{OLS}}$$

This applies different shrinkage factors $s_j = d_j/(d_j + \lambda)$ to different principal directions.

Why adaptive shrinkage is superior:

The key insight is that not all directions are equally informative:

High-variance directions (large $d_j$): The data provides strong signal; OLS estimates are reliable. Shrinking these heavily would lose valuable information.
Low-variance directions (small $d_j$): The data provides weak signal; OLS estimates are dominated by noise. These should be shrunk aggressively.

Ridge shrinkage adaptively adjusts based on this information content. Uniform scaling would over-shrink confident estimates while under-shrinking noisy ones—the worst of both worlds.

Comparison: Uniform vs. Ridge Shrinkage
Aspect	Uniform Scaling	Ridge Shrinkage
Formula	$c \cdot \hat{\boldsymbol{\beta}}_{\text{OLS}}$	$(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}$
Shrinkage factors	Same for all directions	Varies by eigenvalue
High-variance directions	Potentially over-shrunk	Minimally shrunk
Low-variance directions	Potentially under-shrunk	Heavily shrunk
Optimality	Suboptimal (misallocates shrinkage)	Optimal (in certain senses)
Direction of shrinkage	Toward origin along same line	Toward origin, rotated

James-Stein Estimation

This adaptive shrinkage is related to James-Stein estimation, which proved that in dimensions ≥ 3, simple shrinkage toward the origin always improves on the sample mean in terms of mean squared error. Ridge regression can be viewed as applying shrinkage in a principled, data-adaptive way.

Effect on Correlated Features

Ridge regression handles correlated (multicollinear) features particularly well. Understanding this behavior is crucial for real-world applications where features are rarely independent.

The multicollinearity problem in OLS:

When features are highly correlated:

$\mathbf{X}^T\mathbf{X}$ has one or more very small eigenvalues
Small changes in data cause large swings in coefficients
Coefficient estimates have huge variance
Signs may flip randomly; magnitudes explode

Ridge solution for correlated features:

With two perfectly correlated features ($X_2 = X_1$), the OLS solution is non-unique—any combination $\beta_1 + \beta_2 = c$ works. Ridge resolves this by shrinking both coefficients toward zero, preferring the solution that minimizes $\beta_1^2 + \beta_2^2$.

The "grouping effect":

When features are highly correlated, Ridge regression tends to:

Shrink both coefficients rather than assigning all the effect to one
Produce similar coefficient magnitudes for correlated predictors
Average the effect across the correlated group

This is often sensible: if $X_1$ and $X_2$ are measuring the same underlying phenomenon, it's more stable to spread the effect across both.

Mathematical insight:

For perfectly correlated features, the direction of correlation (e.g., $X_1 = X_2$) has a very large eigenvalue, while the perpendicular direction (e.g., $X_1 = -X_2$) has eigenvalue zero. Ridge:

Preserves coefficients in the high-eigenvalue direction (constructive combination)
Shrinks to zero in the low-eigenvalue direction (difference between coefficients)

correlated_features.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import numpy as np
from sklearn.linear_model import Ridge, LinearRegression
 
def demonstrate_multicollinearity_handling():
    """
    Demonstrates how Ridge regression handles correlated features
    compared to OLS.
    """
    np.random.seed(42)
    n = 100
    
    # Create base feature
    x1 = np.random.randn(n)
    
    # Create correlated feature (r ≈ 0.99)
    x2 = x1 + 0.1 * np.random.randn(n)
    
    # True relationship: y = 3*z where z ≈ x1 ≈ x2
    # The "true" coefficients summing x1 and x2 should total ~3
    y = 3 * x1 + np.random.randn(n) * 0.5
    
    X = np.column_stack([x1, x2])
    
    # OLS solution
    ols = LinearRegression(fit_intercept=False)
    ols.fit(X, y)
    
    print("OLS Coefficients:")
    print(f"  β₁ = {ols.coef_[0]:.3f}")
    print(f"  β₂ = {ols.coef_[1]:.3f}")
    print(f"  Sum = {ols.coef_.sum():.3f}")
    print(f"  ||β||₂ = {np.linalg.norm(ols.coef_):.3f}")
    
    # Ridge solution for various λ
    print("\nRidge Coefficients:")
    for alpha in [0.01, 0.1, 1.0, 10.0]:
        ridge = Ridge(alpha=alpha, fit_intercept=False)
        ridge.fit(X, y)
        print(f"  λ={alpha:5.2f}: β₁={ridge.coef_[0]:6.3f}, "
              f"β₂={ridge.coef_[1]:6.3f}, "
              f"Sum={ridge.coef_.sum():5.3f}, "
              f"||β||₂={np.linalg.norm(ridge.coef_):.3f}")
 
# Run demonstration
# demonstrate_multicollinearity_handling()
# 
# Typical output:
# OLS Coefficients:
#   β₁ = 8.234    <- wildly inflated
#   β₂ = -5.198   <- wildly inflated, wrong sign
#   Sum = 3.036   <- sum is reasonable!
#   ||β||₂ = 9.738
# 
# Ridge Coefficients:
#   λ= 0.01: β₁= 4.532, β₂=-1.512, Sum=3.020, ||β||₂=4.777
#   λ= 0.10: β₁= 2.847, β₂= 0.174, Sum=3.022, ||β||₂=2.853
#   λ= 1.00: β₁= 1.628, β₂= 1.283, Sum=2.911, ||β||₂=2.072
#   λ=10.00: β₁= 0.746, β₂= 0.689, Sum=1.435, ||β||₂=1.016

Practical Insight

When you see OLS coefficients with opposite signs and large magnitudes for correlated features (e.g., +1000 and -998), this is a red flag for multicollinearity. Ridge regression will "calm down" these estimates, producing more interpretable and stable coefficients.

Shrinkage as Regularized Projection

We can understand Ridge shrinkage through the lens of projection geometry—a perspective that unifies several concepts in linear models.

OLS as orthogonal projection:

The OLS fitted values are the orthogonal projection of $\mathbf{y}$ onto the column space of $\mathbf{X}$:

$$\hat{\mathbf{y}}{\text{OLS}} = \mathbf{H}{\text{OLS}} \mathbf{y} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \mathbf{y}$$

This projection has norm $|\hat{\mathbf{y}}_{\text{OLS}}|_2 \leq |\mathbf{y}|_2$ (projection reduces norm).

Ridge as shrunken projection:

$$\hat{\mathbf{y}}{\text{Ridge}} = \mathbf{H}{\text{Ridge}} \mathbf{y} = \mathbf{X}(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T \mathbf{y}$$

The Ridge smoother matrix $\mathbf{H}_{\text{Ridge}}$ shrinks fitted values toward zero, beyond what OLS projection does:

$$|\hat{\mathbf{y}}_{\text{Ridge}}|2 \leq |\hat{\mathbf{y}}{\text{OLS}}|_2 \leq |\mathbf{y}|_2$$

SVD interpretation:

Using the SVD $\mathbf{X} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^T$:

$$\hat{\mathbf{y}}{\text{Ridge}} = \sum{j=1}^{r} \frac{\sigma_j^2}{\sigma_j^2 + \lambda} (\mathbf{u}_j^T\mathbf{y}) \mathbf{u}_j$$

Compare to OLS:

$$\hat{\mathbf{y}}{\text{OLS}} = \sum{j=1}^{r} (\mathbf{u}_j^T\mathbf{y}) \mathbf{u}_j$$

Ridge multiplies each component by $\sigma_j^2/(\sigma_j^2 + \lambda) < 1$, shrinking the projection.

The geometric picture:

Project $\mathbf{y}$ onto each left singular vector $\mathbf{u}_j$ to get component $(\mathbf{u}_j^T\mathbf{y})$
OLS keeps these components unchanged
Ridge shrinks each component by factor $\sigma_j^2/(\sigma_j^2 + \lambda)$
Components corresponding to small singular values $\sigma_j$ (low-information directions) are shrunk most

The Response Surface Perspective

In the space of fitted values, Ridge regression traces the origin-OLS line segment as λ varies. Unlike Lasso which produces curved paths, Ridge moves strictly along the line connecting the origin to the OLS solution. This is because L2 regularization induces isotropic (direction-independent) shrinkage in coefficient space.

Why Shrinkage Improves Prediction

At first glance, shrinkage seems counterproductive: we're deliberately biasing our estimates toward zero. How can introducing bias improve predictions?

The answer lies in the bias-variance tradeoff. Let's make this precise.

Mean Squared Error decomposition:

For any estimator $\hat{\boldsymbol{\beta}}$, its MSE can be decomposed:

$$\text{MSE}(\hat{\boldsymbol{\beta}}) = \text{Bias}(\hat{\boldsymbol{\beta}})^2 + \text{Var}(\hat{\boldsymbol{\beta}})$$

where:

Bias = $\mathbb{E}[\hat{\boldsymbol{\beta}}] - \boldsymbol{\beta}_{\text{true}}$: systematic error
Variance = $\mathbb{E}[(\hat{\boldsymbol{\beta}} - \mathbb{E}[\hat{\boldsymbol{\beta}}])^2]$: sensitivity to training data

OLS properties:

Unbiased: $\text{Bias}(\hat{\boldsymbol{\beta}}_{\text{OLS}}) = 0$
Potentially high variance: $\text{Var}(\hat{\boldsymbol{\beta}}_{\text{OLS}}) \propto (\mathbf{X}^T\mathbf{X})^{-1}$

When $\mathbf{X}^T\mathbf{X}$ is ill-conditioned, variance can be enormous, dominating MSE.

Ridge properties:

Biased: $\text{Bias}(\hat{\boldsymbol{\beta}}{\text{Ridge}}) = -\lambda(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\boldsymbol{\beta}{\text{true}} \neq 0$
Lower variance: $\text{Var}(\hat{\boldsymbol{\beta}}{\text{Ridge}}) < \text{Var}(\hat{\boldsymbol{\beta}}{\text{OLS}})$

The variance reduction can outweigh the bias increase, reducing total MSE.

Bias-Variance Tradeoff: OLS vs Ridge
Metric	OLS	Ridge (optimal λ)
Bias	Zero	Small positive
Variance	Large (esp. ill-conditioned)	Substantially reduced
Bias²	0	Small
Variance	Dominates MSE	Greatly reduced
Total MSE	High (variance-dominated)	Lower

The Trading Principle

Ridge regression trades a little bias for a large variance reduction. In high-dimensional or ill-conditioned settings, this trade is highly favorable: the small bias introduced is more than compensated by the substantial variance reduction, resulting in better overall prediction accuracy.

The variance reduction mechanism:

Recall that Ridge shrinks by factors $s_j = d_j/(d_j + \lambda)$. The variance of the $j$-th principal component of the OLS estimator is proportional to $1/d_j$. After Ridge shrinkage, the variance scales as:

$$\text{Var}{\text{Ridge},j} = s_j^2 \cdot \text{Var}{\text{OLS},j} = \left(\frac{d_j}{d_j + \lambda}\right)^2 \cdot \frac{\sigma^2}{d_j}$$

For small $d_j$, the shrinkage factor $s_j^2$ dramatically reduces what would otherwise be massive variance. This is why Ridge is especially effective when data is ill-conditioned.

Summary: The Shrinkage Interpretation

We've developed deep intuition for how Ridge regression shrinks coefficients and why this shrinkage improves predictions:

Key Takeaways

•Geometric interpretation: Ridge solution is the tangent point between RSS ellipses and the L2 ball, moving smoothly toward the origin as λ increases.
•Adaptive shrinkage factors: Each principal direction is shrunk by $d_j/(d_j + \lambda)$—low-variance directions shrink more.
•Shrinkage path: As λ varies, coefficients trace a smooth, monotonic path from OLS solution to zero.
•Superior to uniform scaling: Ridge adapts shrinkage to data uncertainty, shrinking unreliable estimates more.
•Handles multicollinearity: Ridge stabilizes coefficient estimates for correlated features, producing sensible magnitudes.
•Bias-variance tradeoff: Shrinkage introduces small bias but dramatically reduces variance, often lowering total MSE.

What's next:

With the shrinkage intuition in place, we're ready to formalize the bias-variance tradeoff for Ridge regression. The next page provides mathematical analysis of how Ridge balances these competing quantities and when the tradeoff is most favorable.

Page Complete

You now understand Ridge shrinkage from geometric, algebraic, and statistical perspectives. This intuition is essential for applying Ridge regression effectively and understanding when it will (or won't) help your models.

3 / 5

Loading learning content...

Machine LearningRegularization Theory

Ridge Regression (L2 Regularization)

LevelIntermediate

Duration90 mins

TopicRegularization Theory

3 / 5

Shrinkage Interpretation

Understanding Shrinkage

What You Will Learn

The Geometry of Shrinkage

To visualize shrinkage, consider the coefficient space in two dimensions (two features $\beta_1$ and $\beta_2$).

The OLS contours:

Circular contours: When features are orthogonal and equally scaled, $\mathbf{X}^T\mathbf{X} \propto \mathbf{I}$, producing circular level sets.
Elliptical contours: When features are correlated or have different variances, the contours become elliptical, stretched along certain directions.

Converting Mermaid diagram...

The L2 constraint:

Finding the Ridge solution:

The Ridge solution lies at the point where the RSS ellipse is tangent to the L2 ball. Key observations:

The tangent point is never at an axis (except in degenerate cases): Unlike L1 regularization (Lasso), the smooth L2 ball never "catches" the solution exactly on an axis. Thus, Ridge shrinks all coefficients but never sets any exactly to zero.
The solution moves continuously: As $\lambda$ increases, the tangent point traces a smooth path from $\hat{\boldsymbol{\beta}}_{\text{OLS}}$ toward the origin.
Shrinkage is proportional to distance from origin: Coefficients farther from zero are pulled more strongly toward zero.

The Smooth Shrinkage Property

Shrinkage Factors in Principal Directions

Recall from the eigenvalue analysis that Ridge regression shrinks each principal direction by a specific factor. Let's develop this insight more fully.

Eigendecomposition perspective:

The design matrix $\mathbf{X}^T\mathbf{X}$ has eigendecomposition $\mathbf{V}\mathbf{D}\mathbf{V}^T$ where:

Columns of $\mathbf{V}$ are the principal component directions
Diagonal entries of $\mathbf{D}$ are eigenvalues $d_j$

The OLS coefficients projected onto principal direction $j$ are scaled by:

$$\text{Shrinkage factor } s_j = \frac{d_j}{d_j + \lambda}$$

Interpreting shrinkage factors:

The shrinkage factor $s_j$ depends on the ratio of eigenvalue to regularization:

$$s_j = \frac{d_j}{d_j + \lambda} = \frac{1}{1 + \lambda/d_j}$$

Large eigenvalue ($d_j >> \lambda$):

$s_j \approx 1$: Almost no shrinkage
This direction has high variance in the data
Coefficients in this direction are well-determined by the data
We trust the OLS estimate and don't regularize much

Small eigenvalue ($d_j << \lambda$):

$s_j \approx d_j/\lambda \approx 0$: Severe shrinkage
This direction has low variance in the data
Coefficients in this direction are poorly determined
OLS would be unstable here; we shrink heavily toward zero

Shrinkage Behavior Analysis
Eigenvalue Regime	Shrinkage $s_j$	Variance in Direction	Regularization Effect
$d_j = 100, \lambda = 1$	0.990	High (well-determined)	Almost none
$d_j = 10, \lambda = 1$	0.909	Moderate	Light shrinkage
$d_j = 1, \lambda = 1$	0.500	Moderate-low	50% reduction
$d_j = 0.1, \lambda = 1$	0.091	Low (poorly determined)	Heavy shrinkage
$d_j = 0.01, \lambda = 1$	0.010	Very low (nearly singular)	Near-complete shrinkage

The Wisdom of Ridge Shrinkage

The Shrinkage Path — Regularization Trajectory

Mathematical description:

$$\hat{\boldsymbol{\beta}}(\lambda) = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}$$

This is a continuous, monotonically shrinking function of $\lambda$:

$\hat{\boldsymbol{\beta}}(0) = \hat{\boldsymbol{\beta}}_{\text{OLS}}$ (if OLS exists)
$|\hat{\boldsymbol{\beta}}(\lambda)|_2$ is strictly decreasing in $\lambda$
$\lim_{\lambda \to \infty} \hat{\boldsymbol{\beta}}(\lambda) = \mathbf{0}$

shrinkage_path.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import numpy as np
import matplotlib.pyplot as plt
 
def compute_ridge_path(X, y, lambdas):
    """
    Compute Ridge coefficients for a sequence of lambda values.
    
    Parameters:
    -----------
    X : ndarray of shape (n_samples, n_features)
        Feature matrix (should be standardized)
    y : ndarray of shape (n_samples,)
        Target vector (should be centered)
    lambdas : array-like
        Sequence of regularization parameters (descending order)
    
    Returns:
    --------
    coefs : ndarray of shape (n_lambdas, n_features)
        Ridge coefficients for each lambda
    """
    n, p = X.shape
    coefs = np.zeros((len(lambdas), p))
    
    # Use SVD for efficient computation across many lambdas
    U, s, Vt = np.linalg.svd(X, full_matrices=False)
    UTy = U.T @ y
    
    for i, lam in enumerate(lambdas):
        # Shrinkage factors: s / (s^2 + lambda)
        d = s / (s**2 + lam)
        # Ridge solution
        coefs[i] = Vt.T @ (d * UTy)
    
    return coefs
 
 
def plot_ridge_path(lambdas, coefs, feature_names=None):
    """
    Visualize the Ridge regularization path.
    """
    plt.figure(figsize=(10, 6))
    
    for j in range(coefs.shape[1]):
        label = feature_names[j] if feature_names else f"β_{j+1}"
        plt.plot(np.log10(lambdas), coefs[:, j], label=label, linewidth=2)
    
    plt.xlabel("log₁₀(λ)", fontsize=12)
    plt.ylabel("Coefficient Value", fontsize=12)
    plt.title("Ridge Regression Shrinkage Path", fontsize=14)
    plt.axhline(y=0, color='k', linestyle='--', alpha=0.3)
    plt.legend(loc='best')
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    
    return plt.gcf()
 
 
# Example: Generate synthetic data and visualize path
np.random.seed(42)
n, p = 100, 5
X = np.random.randn(n, p)
# Create correlated features
X[:, 1] = X[:, 0] + 0.1 * np.random.randn(n)
X[:, 2] = X[:, 0] - X[:, 1] + 0.1 * np.random.randn(n)
 
# Standardize
X = (X - X.mean(axis=0)) / X.std(axis=0)
 
# True coefficients (sparse for illustration)
beta_true = np.array([3.0, -2.0, 0.0, 1.5, 0.0])
y = X @ beta_true + 0.5 * np.random.randn(n)
y = y - y.mean()
 
# Compute path over range of lambdas
lambdas = np.logspace(4, -4, 200)  # log-spaced from 10^4 to 10^-4
coefs = compute_ridge_path(X, y, lambdas)
 
# Plot the shrinkage path
# plot_ridge_path(lambdas, coefs, 
#                 feature_names=["Feature 1", "Feature 2", "Feature 3", 
#                                "Feature 4", "Feature 5"])

Properties of the Ridge shrinkage path:

Smooth and continuous: No jumps or discontinuities as $\lambda$ varies. This is a direct consequence of the smooth L2 penalty.
Monotonic shrinkage: The L2 norm of coefficients strictly decreases with increasing $\lambda$: $|\hat{\boldsymbol{\beta}}(\lambda_1)|_2 > |\hat{\boldsymbol{\beta}}(\lambda_2)|_2$ for $\lambda_1 < \lambda_2$.
Coefficients approach zero at different rates: Coefficients in low-eigenvalue directions shrink faster than those in high-eigenvalue directions.
Never exactly zero: Unlike Lasso, Ridge coefficients approach zero asymptotically but (theoretically) never reach it exactly at finite $\lambda$.
Sign preservation: Ridge coefficients maintain their signs throughout the path—if $\hat{\beta}_j > 0$ at $\lambda = 0$, it remains positive for all $\lambda > 0$.

Shrinkage vs. Simply Scaling OLS

A natural question arises: why not just scale the OLS solution by some constant factor? Why is Ridge shrinkage better than simple multiplication?

Uniform scaling:

$$\hat{\boldsymbol{\beta}}{\text{scaled}} = c \cdot \hat{\boldsymbol{\beta}}{\text{OLS}}, \quad c \in (0, 1)$$

This applies the same shrinkage factor to all coefficients.

Ridge shrinkage:

$$\hat{\boldsymbol{\beta}}{\text{Ridge}} = \mathbf{V} \cdot \text{diag}(s_1, \ldots, s_p) \cdot \mathbf{V}^T \cdot \hat{\boldsymbol{\beta}}{\text{OLS}}$$

This applies different shrinkage factors $s_j = d_j/(d_j + \lambda)$ to different principal directions.

Why adaptive shrinkage is superior:

The key insight is that not all directions are equally informative:

High-variance directions (large $d_j$): The data provides strong signal; OLS estimates are reliable. Shrinking these heavily would lose valuable information.
Low-variance directions (small $d_j$): The data provides weak signal; OLS estimates are dominated by noise. These should be shrunk aggressively.

Ridge shrinkage adaptively adjusts based on this information content. Uniform scaling would over-shrink confident estimates while under-shrinking noisy ones—the worst of both worlds.

Comparison: Uniform vs. Ridge Shrinkage
Aspect	Uniform Scaling	Ridge Shrinkage
Formula	$c \cdot \hat{\boldsymbol{\beta}}_{\text{OLS}}$	$(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}$
Shrinkage factors	Same for all directions	Varies by eigenvalue
High-variance directions	Potentially over-shrunk	Minimally shrunk
Low-variance directions	Potentially under-shrunk	Heavily shrunk
Optimality	Suboptimal (misallocates shrinkage)	Optimal (in certain senses)
Direction of shrinkage	Toward origin along same line	Toward origin, rotated

James-Stein Estimation

Effect on Correlated Features

Ridge regression handles correlated (multicollinear) features particularly well. Understanding this behavior is crucial for real-world applications where features are rarely independent.

The multicollinearity problem in OLS:

When features are highly correlated:

$\mathbf{X}^T\mathbf{X}$ has one or more very small eigenvalues
Small changes in data cause large swings in coefficients
Coefficient estimates have huge variance
Signs may flip randomly; magnitudes explode

Ridge solution for correlated features:

The "grouping effect":

When features are highly correlated, Ridge regression tends to:

Shrink both coefficients rather than assigning all the effect to one
Produce similar coefficient magnitudes for correlated predictors
Average the effect across the correlated group

This is often sensible: if $X_1$ and $X_2$ are measuring the same underlying phenomenon, it's more stable to spread the effect across both.

Mathematical insight:

For perfectly correlated features, the direction of correlation (e.g., $X_1 = X_2$) has a very large eigenvalue, while the perpendicular direction (e.g., $X_1 = -X_2$) has eigenvalue zero. Ridge:

Preserves coefficients in the high-eigenvalue direction (constructive combination)
Shrinks to zero in the low-eigenvalue direction (difference between coefficients)

correlated_features.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import numpy as np
from sklearn.linear_model import Ridge, LinearRegression
 
def demonstrate_multicollinearity_handling():
    """
    Demonstrates how Ridge regression handles correlated features
    compared to OLS.
    """
    np.random.seed(42)
    n = 100
    
    # Create base feature
    x1 = np.random.randn(n)
    
    # Create correlated feature (r ≈ 0.99)
    x2 = x1 + 0.1 * np.random.randn(n)
    
    # True relationship: y = 3*z where z ≈ x1 ≈ x2
    # The "true" coefficients summing x1 and x2 should total ~3
    y = 3 * x1 + np.random.randn(n) * 0.5
    
    X = np.column_stack([x1, x2])
    
    # OLS solution
    ols = LinearRegression(fit_intercept=False)
    ols.fit(X, y)
    
    print("OLS Coefficients:")
    print(f"  β₁ = {ols.coef_[0]:.3f}")
    print(f"  β₂ = {ols.coef_[1]:.3f}")
    print(f"  Sum = {ols.coef_.sum():.3f}")
    print(f"  ||β||₂ = {np.linalg.norm(ols.coef_):.3f}")
    
    # Ridge solution for various λ
    print("\nRidge Coefficients:")
    for alpha in [0.01, 0.1, 1.0, 10.0]:
        ridge = Ridge(alpha=alpha, fit_intercept=False)
        ridge.fit(X, y)
        print(f"  λ={alpha:5.2f}: β₁={ridge.coef_[0]:6.3f}, "
              f"β₂={ridge.coef_[1]:6.3f}, "
              f"Sum={ridge.coef_.sum():5.3f}, "
              f"||β||₂={np.linalg.norm(ridge.coef_):.3f}")
 
# Run demonstration
# demonstrate_multicollinearity_handling()
# 
# Typical output:
# OLS Coefficients:
#   β₁ = 8.234    <- wildly inflated
#   β₂ = -5.198   <- wildly inflated, wrong sign
#   Sum = 3.036   <- sum is reasonable!
#   ||β||₂ = 9.738
# 
# Ridge Coefficients:
#   λ= 0.01: β₁= 4.532, β₂=-1.512, Sum=3.020, ||β||₂=4.777
#   λ= 0.10: β₁= 2.847, β₂= 0.174, Sum=3.022, ||β||₂=2.853
#   λ= 1.00: β₁= 1.628, β₂= 1.283, Sum=2.911, ||β||₂=2.072
#   λ=10.00: β₁= 0.746, β₂= 0.689, Sum=1.435, ||β||₂=1.016

Practical Insight

Shrinkage as Regularized Projection

We can understand Ridge shrinkage through the lens of projection geometry—a perspective that unifies several concepts in linear models.

OLS as orthogonal projection:

The OLS fitted values are the orthogonal projection of $\mathbf{y}$ onto the column space of $\mathbf{X}$:

$$\hat{\mathbf{y}}{\text{OLS}} = \mathbf{H}{\text{OLS}} \mathbf{y} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \mathbf{y}$$

This projection has norm $|\hat{\mathbf{y}}_{\text{OLS}}|_2 \leq |\mathbf{y}|_2$ (projection reduces norm).

Ridge as shrunken projection:

$$\hat{\mathbf{y}}{\text{Ridge}} = \mathbf{H}{\text{Ridge}} \mathbf{y} = \mathbf{X}(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T \mathbf{y}$$

The Ridge smoother matrix $\mathbf{H}_{\text{Ridge}}$ shrinks fitted values toward zero, beyond what OLS projection does:

$$|\hat{\mathbf{y}}_{\text{Ridge}}|2 \leq |\hat{\mathbf{y}}{\text{OLS}}|_2 \leq |\mathbf{y}|_2$$

SVD interpretation:

Using the SVD $\mathbf{X} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^T$:

$$\hat{\mathbf{y}}{\text{Ridge}} = \sum{j=1}^{r} \frac{\sigma_j^2}{\sigma_j^2 + \lambda} (\mathbf{u}_j^T\mathbf{y}) \mathbf{u}_j$$

Compare to OLS:

$$\hat{\mathbf{y}}{\text{OLS}} = \sum{j=1}^{r} (\mathbf{u}_j^T\mathbf{y}) \mathbf{u}_j$$

Ridge multiplies each component by $\sigma_j^2/(\sigma_j^2 + \lambda) < 1$, shrinking the projection.

The geometric picture:

Project $\mathbf{y}$ onto each left singular vector $\mathbf{u}_j$ to get component $(\mathbf{u}_j^T\mathbf{y})$
OLS keeps these components unchanged
Ridge shrinks each component by factor $\sigma_j^2/(\sigma_j^2 + \lambda)$
Components corresponding to small singular values $\sigma_j$ (low-information directions) are shrunk most

The Response Surface Perspective

Why Shrinkage Improves Prediction

At first glance, shrinkage seems counterproductive: we're deliberately biasing our estimates toward zero. How can introducing bias improve predictions?

The answer lies in the bias-variance tradeoff. Let's make this precise.

Mean Squared Error decomposition:

For any estimator $\hat{\boldsymbol{\beta}}$, its MSE can be decomposed:

$$\text{MSE}(\hat{\boldsymbol{\beta}}) = \text{Bias}(\hat{\boldsymbol{\beta}})^2 + \text{Var}(\hat{\boldsymbol{\beta}})$$

where:

Bias = $\mathbb{E}[\hat{\boldsymbol{\beta}}] - \boldsymbol{\beta}_{\text{true}}$: systematic error
Variance = $\mathbb{E}[(\hat{\boldsymbol{\beta}} - \mathbb{E}[\hat{\boldsymbol{\beta}}])^2]$: sensitivity to training data

OLS properties:

Unbiased: $\text{Bias}(\hat{\boldsymbol{\beta}}_{\text{OLS}}) = 0$
Potentially high variance: $\text{Var}(\hat{\boldsymbol{\beta}}_{\text{OLS}}) \propto (\mathbf{X}^T\mathbf{X})^{-1}$

When $\mathbf{X}^T\mathbf{X}$ is ill-conditioned, variance can be enormous, dominating MSE.

Ridge properties:

Biased: $\text{Bias}(\hat{\boldsymbol{\beta}}{\text{Ridge}}) = -\lambda(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\boldsymbol{\beta}{\text{true}} \neq 0$
Lower variance: $\text{Var}(\hat{\boldsymbol{\beta}}{\text{Ridge}}) < \text{Var}(\hat{\boldsymbol{\beta}}{\text{OLS}})$

The variance reduction can outweigh the bias increase, reducing total MSE.

Bias-Variance Tradeoff: OLS vs Ridge
Metric	OLS	Ridge (optimal λ)
Bias	Zero	Small positive
Variance	Large (esp. ill-conditioned)	Substantially reduced
Bias²	0	Small
Variance	Dominates MSE	Greatly reduced
Total MSE	High (variance-dominated)	Lower

The Trading Principle

The variance reduction mechanism:

$$\text{Var}{\text{Ridge},j} = s_j^2 \cdot \text{Var}{\text{OLS},j} = \left(\frac{d_j}{d_j + \lambda}\right)^2 \cdot \frac{\sigma^2}{d_j}$$

For small $d_j$, the shrinkage factor $s_j^2$ dramatically reduces what would otherwise be massive variance. This is why Ridge is especially effective when data is ill-conditioned.

Summary: The Shrinkage Interpretation

We've developed deep intuition for how Ridge regression shrinks coefficients and why this shrinkage improves predictions:

Key Takeaways

•Geometric interpretation: Ridge solution is the tangent point between RSS ellipses and the L2 ball, moving smoothly toward the origin as λ increases.
•Adaptive shrinkage factors: Each principal direction is shrunk by $d_j/(d_j + \lambda)$—low-variance directions shrink more.
•Shrinkage path: As λ varies, coefficients trace a smooth, monotonic path from OLS solution to zero.
•Superior to uniform scaling: Ridge adapts shrinkage to data uncertainty, shrinking unreliable estimates more.
•Handles multicollinearity: Ridge stabilizes coefficient estimates for correlated features, producing sensible magnitudes.
•Bias-variance tradeoff: Shrinkage introduces small bias but dramatically reduces variance, often lowering total MSE.

What's next:

Page Complete

3 / 5