Loading learning content...
We've derived the closed-form solution for Ridge regression and proved its mathematical properties. But to truly master Ridge regression, we must develop deep intuition for what it does to our coefficients and why this leads to better predictions.
The answer lies in understanding shrinkage—the systematic pulling of coefficient estimates toward zero. Shrinkage is the mechanism by which Ridge regression trades bias for reduced variance, achieving better overall prediction accuracy.
By the end of this page, you will understand shrinkage geometrically, visualize how coefficients evolve as λ changes (the shrinkage path), distinguish between shrinkage in different principal directions, and grasp why shrinking toward zero—despite introducing bias—often improves prediction.
To visualize shrinkage, consider the coefficient space in two dimensions (two features $\beta_1$ and $\beta_2$).
The OLS contours:
The OLS objective function defines elliptical level sets (contours of constant RSS) centered at the OLS solution $\hat{\boldsymbol{\beta}}_{\text{OLS}}$. The shape of these ellipses depends on the eigenstructure of $\mathbf{X}^T\mathbf{X}$:
The L2 constraint:
The L2 constraint $|\boldsymbol{\beta}|_2^2 \leq t$ defines a circle (in 2D) or hypersphere (in higher dimensions) centered at the origin. The constraint radius $\sqrt{t}$ decreases as the regularization parameter $\lambda$ increases.
Finding the Ridge solution:
The Ridge solution lies at the point where the RSS ellipse is tangent to the L2 ball. Key observations:
The tangent point is never at an axis (except in degenerate cases): Unlike L1 regularization (Lasso), the smooth L2 ball never "catches" the solution exactly on an axis. Thus, Ridge shrinks all coefficients but never sets any exactly to zero.
The solution moves continuously: As $\lambda$ increases, the tangent point traces a smooth path from $\hat{\boldsymbol{\beta}}_{\text{OLS}}$ toward the origin.
Shrinkage is proportional to distance from origin: Coefficients farther from zero are pulled more strongly toward zero.
The smoothness of the L2 ball is why Ridge regression produces smooth shrinkage paths. There are no corners where the solution could "stick" as λ varies. This is fundamentally different from L1 (Lasso) where the diamond-shaped constraint has corners that can trap the solution on axes, producing exact zeros.
Recall from the eigenvalue analysis that Ridge regression shrinks each principal direction by a specific factor. Let's develop this insight more fully.
Eigendecomposition perspective:
The design matrix $\mathbf{X}^T\mathbf{X}$ has eigendecomposition $\mathbf{V}\mathbf{D}\mathbf{V}^T$ where:
The OLS coefficients projected onto principal direction $j$ are scaled by:
$$\text{Shrinkage factor } s_j = \frac{d_j}{d_j + \lambda}$$
Interpreting shrinkage factors:
The shrinkage factor $s_j$ depends on the ratio of eigenvalue to regularization:
$$s_j = \frac{d_j}{d_j + \lambda} = \frac{1}{1 + \lambda/d_j}$$
Large eigenvalue ($d_j >> \lambda$):
Small eigenvalue ($d_j << \lambda$):
| Eigenvalue Regime | Shrinkage $s_j$ | Variance in Direction | Regularization Effect |
|---|---|---|---|
| $d_j = 100, \lambda = 1$ | 0.990 | High (well-determined) | Almost none |
| $d_j = 10, \lambda = 1$ | 0.909 | Moderate | Light shrinkage |
| $d_j = 1, \lambda = 1$ | 0.500 | Moderate-low | 50% reduction |
| $d_j = 0.1, \lambda = 1$ | 0.091 | Low (poorly determined) | Heavy shrinkage |
| $d_j = 0.01, \lambda = 1$ | 0.010 | Very low (nearly singular) | Near-complete shrinkage |
Ridge regression shrinks most aggressively in precisely the directions where OLS is most unreliable—directions with low data variance (small eigenvalues). It shrinks least in directions where the data provides strong information. This is exactly the right behavior: regularize where we're uncertain, trust data where we're confident.
As we vary $\lambda$ from 0 to $\infty$, the Ridge solution traces a continuous path from the OLS solution to the origin. This shrinkage path (or regularization path) reveals how coefficients evolve with regularization strength.
Mathematical description:
$$\hat{\boldsymbol{\beta}}(\lambda) = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}$$
This is a continuous, monotonically shrinking function of $\lambda$:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182
import numpy as npimport matplotlib.pyplot as plt def compute_ridge_path(X, y, lambdas): """ Compute Ridge coefficients for a sequence of lambda values. Parameters: ----------- X : ndarray of shape (n_samples, n_features) Feature matrix (should be standardized) y : ndarray of shape (n_samples,) Target vector (should be centered) lambdas : array-like Sequence of regularization parameters (descending order) Returns: -------- coefs : ndarray of shape (n_lambdas, n_features) Ridge coefficients for each lambda """ n, p = X.shape coefs = np.zeros((len(lambdas), p)) # Use SVD for efficient computation across many lambdas U, s, Vt = np.linalg.svd(X, full_matrices=False) UTy = U.T @ y for i, lam in enumerate(lambdas): # Shrinkage factors: s / (s^2 + lambda) d = s / (s**2 + lam) # Ridge solution coefs[i] = Vt.T @ (d * UTy) return coefs def plot_ridge_path(lambdas, coefs, feature_names=None): """ Visualize the Ridge regularization path. """ plt.figure(figsize=(10, 6)) for j in range(coefs.shape[1]): label = feature_names[j] if feature_names else f"β_{j+1}" plt.plot(np.log10(lambdas), coefs[:, j], label=label, linewidth=2) plt.xlabel("log₁₀(λ)", fontsize=12) plt.ylabel("Coefficient Value", fontsize=12) plt.title("Ridge Regression Shrinkage Path", fontsize=14) plt.axhline(y=0, color='k', linestyle='--', alpha=0.3) plt.legend(loc='best') plt.grid(True, alpha=0.3) plt.tight_layout() return plt.gcf() # Example: Generate synthetic data and visualize pathnp.random.seed(42)n, p = 100, 5X = np.random.randn(n, p)# Create correlated featuresX[:, 1] = X[:, 0] + 0.1 * np.random.randn(n)X[:, 2] = X[:, 0] - X[:, 1] + 0.1 * np.random.randn(n) # StandardizeX = (X - X.mean(axis=0)) / X.std(axis=0) # True coefficients (sparse for illustration)beta_true = np.array([3.0, -2.0, 0.0, 1.5, 0.0])y = X @ beta_true + 0.5 * np.random.randn(n)y = y - y.mean() # Compute path over range of lambdaslambdas = np.logspace(4, -4, 200) # log-spaced from 10^4 to 10^-4coefs = compute_ridge_path(X, y, lambdas) # Plot the shrinkage path# plot_ridge_path(lambdas, coefs, # feature_names=["Feature 1", "Feature 2", "Feature 3", # "Feature 4", "Feature 5"])Properties of the Ridge shrinkage path:
Smooth and continuous: No jumps or discontinuities as $\lambda$ varies. This is a direct consequence of the smooth L2 penalty.
Monotonic shrinkage: The L2 norm of coefficients strictly decreases with increasing $\lambda$: $|\hat{\boldsymbol{\beta}}(\lambda_1)|_2 > |\hat{\boldsymbol{\beta}}(\lambda_2)|_2$ for $\lambda_1 < \lambda_2$.
Coefficients approach zero at different rates: Coefficients in low-eigenvalue directions shrink faster than those in high-eigenvalue directions.
Never exactly zero: Unlike Lasso, Ridge coefficients approach zero asymptotically but (theoretically) never reach it exactly at finite $\lambda$.
Sign preservation: Ridge coefficients maintain their signs throughout the path—if $\hat{\beta}_j > 0$ at $\lambda = 0$, it remains positive for all $\lambda > 0$.
A natural question arises: why not just scale the OLS solution by some constant factor? Why is Ridge shrinkage better than simple multiplication?
Uniform scaling:
$$\hat{\boldsymbol{\beta}}{\text{scaled}} = c \cdot \hat{\boldsymbol{\beta}}{\text{OLS}}, \quad c \in (0, 1)$$
This applies the same shrinkage factor to all coefficients.
Ridge shrinkage:
$$\hat{\boldsymbol{\beta}}{\text{Ridge}} = \mathbf{V} \cdot \text{diag}(s_1, \ldots, s_p) \cdot \mathbf{V}^T \cdot \hat{\boldsymbol{\beta}}{\text{OLS}}$$
This applies different shrinkage factors $s_j = d_j/(d_j + \lambda)$ to different principal directions.
Why adaptive shrinkage is superior:
The key insight is that not all directions are equally informative:
High-variance directions (large $d_j$): The data provides strong signal; OLS estimates are reliable. Shrinking these heavily would lose valuable information.
Low-variance directions (small $d_j$): The data provides weak signal; OLS estimates are dominated by noise. These should be shrunk aggressively.
Ridge shrinkage adaptively adjusts based on this information content. Uniform scaling would over-shrink confident estimates while under-shrinking noisy ones—the worst of both worlds.
| Aspect | Uniform Scaling | Ridge Shrinkage |
|---|---|---|
| Formula | $c \cdot \hat{\boldsymbol{\beta}}_{\text{OLS}}$ | $(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}$ |
| Shrinkage factors | Same for all directions | Varies by eigenvalue |
| High-variance directions | Potentially over-shrunk | Minimally shrunk |
| Low-variance directions | Potentially under-shrunk | Heavily shrunk |
| Optimality | Suboptimal (misallocates shrinkage) | Optimal (in certain senses) |
| Direction of shrinkage | Toward origin along same line | Toward origin, rotated |
This adaptive shrinkage is related to James-Stein estimation, which proved that in dimensions ≥ 3, simple shrinkage toward the origin always improves on the sample mean in terms of mean squared error. Ridge regression can be viewed as applying shrinkage in a principled, data-adaptive way.
Ridge regression handles correlated (multicollinear) features particularly well. Understanding this behavior is crucial for real-world applications where features are rarely independent.
The multicollinearity problem in OLS:
When features are highly correlated:
Ridge solution for correlated features:
With two perfectly correlated features ($X_2 = X_1$), the OLS solution is non-unique—any combination $\beta_1 + \beta_2 = c$ works. Ridge resolves this by shrinking both coefficients toward zero, preferring the solution that minimizes $\beta_1^2 + \beta_2^2$.
The "grouping effect":
When features are highly correlated, Ridge regression tends to:
This is often sensible: if $X_1$ and $X_2$ are measuring the same underlying phenomenon, it's more stable to spread the effect across both.
Mathematical insight:
For perfectly correlated features, the direction of correlation (e.g., $X_1 = X_2$) has a very large eigenvalue, while the perpendicular direction (e.g., $X_1 = -X_2$) has eigenvalue zero. Ridge:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
import numpy as npfrom sklearn.linear_model import Ridge, LinearRegression def demonstrate_multicollinearity_handling(): """ Demonstrates how Ridge regression handles correlated features compared to OLS. """ np.random.seed(42) n = 100 # Create base feature x1 = np.random.randn(n) # Create correlated feature (r ≈ 0.99) x2 = x1 + 0.1 * np.random.randn(n) # True relationship: y = 3*z where z ≈ x1 ≈ x2 # The "true" coefficients summing x1 and x2 should total ~3 y = 3 * x1 + np.random.randn(n) * 0.5 X = np.column_stack([x1, x2]) # OLS solution ols = LinearRegression(fit_intercept=False) ols.fit(X, y) print("OLS Coefficients:") print(f" β₁ = {ols.coef_[0]:.3f}") print(f" β₂ = {ols.coef_[1]:.3f}") print(f" Sum = {ols.coef_.sum():.3f}") print(f" ||β||₂ = {np.linalg.norm(ols.coef_):.3f}") # Ridge solution for various λ print("\nRidge Coefficients:") for alpha in [0.01, 0.1, 1.0, 10.0]: ridge = Ridge(alpha=alpha, fit_intercept=False) ridge.fit(X, y) print(f" λ={alpha:5.2f}: β₁={ridge.coef_[0]:6.3f}, " f"β₂={ridge.coef_[1]:6.3f}, " f"Sum={ridge.coef_.sum():5.3f}, " f"||β||₂={np.linalg.norm(ridge.coef_):.3f}") # Run demonstration# demonstrate_multicollinearity_handling()# # Typical output:# OLS Coefficients:# β₁ = 8.234 <- wildly inflated# β₂ = -5.198 <- wildly inflated, wrong sign# Sum = 3.036 <- sum is reasonable!# ||β||₂ = 9.738# # Ridge Coefficients:# λ= 0.01: β₁= 4.532, β₂=-1.512, Sum=3.020, ||β||₂=4.777# λ= 0.10: β₁= 2.847, β₂= 0.174, Sum=3.022, ||β||₂=2.853# λ= 1.00: β₁= 1.628, β₂= 1.283, Sum=2.911, ||β||₂=2.072# λ=10.00: β₁= 0.746, β₂= 0.689, Sum=1.435, ||β||₂=1.016When you see OLS coefficients with opposite signs and large magnitudes for correlated features (e.g., +1000 and -998), this is a red flag for multicollinearity. Ridge regression will "calm down" these estimates, producing more interpretable and stable coefficients.
We can understand Ridge shrinkage through the lens of projection geometry—a perspective that unifies several concepts in linear models.
OLS as orthogonal projection:
The OLS fitted values are the orthogonal projection of $\mathbf{y}$ onto the column space of $\mathbf{X}$:
$$\hat{\mathbf{y}}{\text{OLS}} = \mathbf{H}{\text{OLS}} \mathbf{y} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \mathbf{y}$$
This projection has norm $|\hat{\mathbf{y}}_{\text{OLS}}|_2 \leq |\mathbf{y}|_2$ (projection reduces norm).
Ridge as shrunken projection:
$$\hat{\mathbf{y}}{\text{Ridge}} = \mathbf{H}{\text{Ridge}} \mathbf{y} = \mathbf{X}(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T \mathbf{y}$$
The Ridge smoother matrix $\mathbf{H}_{\text{Ridge}}$ shrinks fitted values toward zero, beyond what OLS projection does:
$$|\hat{\mathbf{y}}_{\text{Ridge}}|2 \leq |\hat{\mathbf{y}}{\text{OLS}}|_2 \leq |\mathbf{y}|_2$$
SVD interpretation:
Using the SVD $\mathbf{X} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^T$:
$$\hat{\mathbf{y}}{\text{Ridge}} = \sum{j=1}^{r} \frac{\sigma_j^2}{\sigma_j^2 + \lambda} (\mathbf{u}_j^T\mathbf{y}) \mathbf{u}_j$$
Compare to OLS:
$$\hat{\mathbf{y}}{\text{OLS}} = \sum{j=1}^{r} (\mathbf{u}_j^T\mathbf{y}) \mathbf{u}_j$$
Ridge multiplies each component by $\sigma_j^2/(\sigma_j^2 + \lambda) < 1$, shrinking the projection.
The geometric picture:
In the space of fitted values, Ridge regression traces the origin-OLS line segment as λ varies. Unlike Lasso which produces curved paths, Ridge moves strictly along the line connecting the origin to the OLS solution. This is because L2 regularization induces isotropic (direction-independent) shrinkage in coefficient space.
At first glance, shrinkage seems counterproductive: we're deliberately biasing our estimates toward zero. How can introducing bias improve predictions?
The answer lies in the bias-variance tradeoff. Let's make this precise.
Mean Squared Error decomposition:
For any estimator $\hat{\boldsymbol{\beta}}$, its MSE can be decomposed:
$$\text{MSE}(\hat{\boldsymbol{\beta}}) = \text{Bias}(\hat{\boldsymbol{\beta}})^2 + \text{Var}(\hat{\boldsymbol{\beta}})$$
where:
OLS properties:
When $\mathbf{X}^T\mathbf{X}$ is ill-conditioned, variance can be enormous, dominating MSE.
Ridge properties:
The variance reduction can outweigh the bias increase, reducing total MSE.
| Metric | OLS | Ridge (optimal λ) |
|---|---|---|
| Bias | Zero | Small positive |
| Variance | Large (esp. ill-conditioned) | Substantially reduced |
| Bias² | 0 | Small |
| Variance | Dominates MSE | Greatly reduced |
| Total MSE | High (variance-dominated) | Lower |
Ridge regression trades a little bias for a large variance reduction. In high-dimensional or ill-conditioned settings, this trade is highly favorable: the small bias introduced is more than compensated by the substantial variance reduction, resulting in better overall prediction accuracy.
The variance reduction mechanism:
Recall that Ridge shrinks by factors $s_j = d_j/(d_j + \lambda)$. The variance of the $j$-th principal component of the OLS estimator is proportional to $1/d_j$. After Ridge shrinkage, the variance scales as:
$$\text{Var}{\text{Ridge},j} = s_j^2 \cdot \text{Var}{\text{OLS},j} = \left(\frac{d_j}{d_j + \lambda}\right)^2 \cdot \frac{\sigma^2}{d_j}$$
For small $d_j$, the shrinkage factor $s_j^2$ dramatically reduces what would otherwise be massive variance. This is why Ridge is especially effective when data is ill-conditioned.
We've developed deep intuition for how Ridge regression shrinks coefficients and why this shrinkage improves predictions:
What's next:
With the shrinkage intuition in place, we're ready to formalize the bias-variance tradeoff for Ridge regression. The next page provides mathematical analysis of how Ridge balances these competing quantities and when the tradeoff is most favorable.
You now understand Ridge shrinkage from geometric, algebraic, and statistical perspectives. This intuition is essential for applying Ridge regression effectively and understanding when it will (or won't) help your models.