Machine LearningRegularization Theory

Ridge Regression (L2 Regularization)

LevelIntermediate

Duration90 mins

TopicRegularization Theory

4 / 5

Bias-Variance Tradeoff

Quantifying the Tradeoff

In the previous page, we developed intuition for why Ridge regression's shrinkage can improve predictions despite introducing bias. Now we make this precise: we'll derive exact expressions for bias and variance of Ridge estimators, analyze how they depend on $\lambda$, and understand when the tradeoff is most favorable.

This analysis answers the fundamental question: How do we know that trading bias for variance reduction is worthwhile?

What You Will Learn

By the end of this page, you will be able to write exact formulas for Ridge bias and variance, derive conditions under which Ridge outperforms OLS in MSE, understand an oracle bound for the optimal λ, and analyze how the tradeoff depends on signal-to-noise ratio and conditioning.

Setup and Assumptions

To derive bias and variance precisely, we need a clear probabilistic framework.

The linear model:

$$\mathbf{y} = \mathbf{X}\boldsymbol{\beta}^* + \boldsymbol{\epsilon}$$

where:

$\boldsymbol{\beta}^* \in \mathbb{R}^p$ is the true (unknown) coefficient vector
$\mathbf{X} \in \mathbb{R}^{n \times p}$ is the design matrix (treated as fixed/non-random)
$\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_n)$ is i.i.d. Gaussian noise

Key quantities:

Let $\mathbf{X}^T\mathbf{X}$ have eigenvalue decomposition: $$\mathbf{X}^T\mathbf{X} = \mathbf{V}\mathbf{D}\mathbf{V}^T = \mathbf{V} \cdot \text{diag}(d_1, \ldots, d_p) \cdot \mathbf{V}^T$$

Define:

Shrinkage factors: $s_j(\lambda) = \frac{d_j}{d_j + \lambda}$
Rotated true coefficients: $\boldsymbol{\gamma}^* = \mathbf{V}^T \boldsymbol{\beta}^*$ (coefficients in principal component basis)
Ridge estimator: $\hat{\boldsymbol{\beta}}_{\lambda} = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}$

The Fixed-X Framework

We treat X as fixed (conditional on the design). This is the classical framework for analyzing ridge regression. Bias and variance are computed over the randomness in the noise ε. The random-X framework (where X is also random) leads to similar qualitative conclusions but more complex expressions.

Deriving the Ridge Bias

Definition of bias:

The bias of an estimator is the difference between its expected value and the true parameter:

$$\text{Bias}(\hat{\boldsymbol{\beta}}\lambda) = \mathbb{E}[\hat{\boldsymbol{\beta}}\lambda] - \boldsymbol{\beta}^*$$

Computing the expected value:

\begin{align} \mathbb{E}[\hat{\boldsymbol{\beta}}_\lambda] &= \mathbb{E}[(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}] \ &= (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbb{E}[\mathbf{y}] \ &= (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}^* \end{align}

The last step uses $\mathbb{E}[\mathbf{y}] = \mathbf{X}\boldsymbol{\beta}^*$ (noise has zero mean).

The bias formula:

\begin{align} \text{Bias}(\hat{\boldsymbol{\beta}}_\lambda) &= (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}^* - \boldsymbol{\beta}^* \ &= \left[(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{X} - \mathbf{I}\right]\boldsymbol{\beta}^* \ &= -\lambda(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\boldsymbol{\beta}^* \end{align}

The last equality uses the identity: $$(\mathbf{A} + \lambda\mathbf{I})^{-1}\mathbf{A} = \mathbf{I} - \lambda(\mathbf{A} + \lambda\mathbf{I})^{-1}$$

Boxed result:

$$\boxed{\text{Bias}(\hat{\boldsymbol{\beta}}_\lambda) = -\lambda(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\boldsymbol{\beta}^*}$$

Eigenvalue decomposition form:

Using $\mathbf{X}^T\mathbf{X} = \mathbf{V}\mathbf{D}\mathbf{V}^T$ and $\boldsymbol{\gamma}^* = \mathbf{V}^T\boldsymbol{\beta}^*$:

$$\text{Bias}(\hat{\boldsymbol{\beta}}_\lambda) = -\mathbf{V}\cdot\text{diag}\left(\frac{\lambda}{d_1 + \lambda}, \ldots, \frac{\lambda}{d_p + \lambda}\right)\cdot\boldsymbol{\gamma}^*$$

Squared bias (summed over coordinates):

$$|\text{Bias}|2^2 = \sum{j=1}^{p} \left(\frac{\lambda}{d_j + \lambda}\right)^2 (\gamma_j^)^2 = \lambda^2 \sum_{j=1}^{p} \frac{(\gamma_j^)^2}{(d_j + \lambda)^2}$$

This is the total squared bias across all coefficient components.

Understanding the Bias Formula

The bias is proportional to λ and to the true coefficients β*. Larger true coefficients suffer more bias. The bias in direction j is $\frac{\lambda}{d_j + \lambda} \gamma_j^$, which approaches $\gamma_j^$ as $d_j \to 0$ (complete shrinkage) and approaches 0 as $d_j \to \infty$ (no shrinkage needed).

Deriving the Ridge Variance

Definition of variance:

The variance (covariance matrix) of the estimator measures how it fluctuates across different realizations of the noise:

$$\text{Var}(\hat{\boldsymbol{\beta}}\lambda) = \mathbb{E}\left[(\hat{\boldsymbol{\beta}}\lambda - \mathbb{E}[\hat{\boldsymbol{\beta}}\lambda])(\hat{\boldsymbol{\beta}}\lambda - \mathbb{E}[\hat{\boldsymbol{\beta}}_\lambda])^T\right]$$

Computing the variance:

Since $\hat{\boldsymbol{\beta}}_\lambda$ is linear in $\mathbf{y}$:

$$\hat{\boldsymbol{\beta}}_\lambda = \mathbf{W}\mathbf{y}, \quad \text{where } \mathbf{W} = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T$$

Using $\text{Var}(\mathbf{W}\mathbf{y}) = \mathbf{W}\text{Var}(\mathbf{y})\mathbf{W}^T$ and $\text{Var}(\mathbf{y}) = \sigma^2\mathbf{I}$:

$$\text{Var}(\hat{\boldsymbol{\beta}}_\lambda) = \sigma^2 \mathbf{W}\mathbf{W}^T = \sigma^2 (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{X}(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}$$

Eigenvalue decomposition form:

Substituting $\mathbf{X}^T\mathbf{X} = \mathbf{V}\mathbf{D}\mathbf{V}^T$:

$$\text{Var}(\hat{\boldsymbol{\beta}}_\lambda) = \sigma^2 \mathbf{V} \cdot \text{diag}\left(\frac{d_1}{(d_1 + \lambda)^2}, \ldots, \frac{d_p}{(d_p + \lambda)^2}\right) \cdot \mathbf{V}^T$$

Total variance (trace):

$$\text{tr}(\text{Var}(\hat{\boldsymbol{\beta}}\lambda)) = \sigma^2 \sum{j=1}^{p} \frac{d_j}{(d_j + \lambda)^2}$$

Boxed result:

$$\boxed{\text{Total Variance} = \sigma^2 \sum_{j=1}^{p} \frac{d_j}{(d_j + \lambda)^2}}$$

Comparison with OLS variance:

For OLS ($\lambda = 0$):

$$\text{Var}(\hat{\boldsymbol{\beta}}_{\text{OLS}}) = \sigma^2 (\mathbf{X}^T\mathbf{X})^{-1} = \sigma^2 \mathbf{V} \cdot \text{diag}\left(\frac{1}{d_1}, \ldots, \frac{1}{d_p}\right) \cdot \mathbf{V}^T$$

Total OLS variance: $\sigma^2 \sum_{j=1}^{p} \frac{1}{d_j}$

Variance reduction factor:

For each principal direction $j$:

$$\frac{\text{Var}{\text{Ridge}, j}}{\text{Var}{\text{OLS}, j}} = \frac{d_j/(d_j + \lambda)^2}{1/d_j} = \frac{d_j^2}{(d_j + \lambda)^2} = s_j^2$$

Since $s_j < 1$, Ridge variance is strictly less than OLS variance in every direction when $\lambda > 0$.

Variance Comparison by Eigenvalue
Eigenvalue $d_j$	OLS Var (×$\sigma^2$)	Ridge Var (×$\sigma^2$)	Reduction
100 (λ=1)	0.01	0.0098	2% reduction
10 (λ=1)	0.1	0.083	17% reduction
1 (λ=1)	1.0	0.25	75% reduction
0.1 (λ=1)	10.0	0.083	99.2% reduction
0.01 (λ=1)	100.0	0.0098	99.99% reduction

Dramatic Variance Reduction

Ridge regression achieves enormous variance reduction in ill-conditioned directions (small eigenvalues). Where OLS variance would explode (1/d_j → ∞ as d_j → 0), Ridge variance remains bounded (approaches 0 as d_j → 0 for fixed λ). This is the key mechanism behind Ridge's improved prediction.

Mean Squared Error — Combining Bias and Variance

MSE decomposition:

The Mean Squared Error of the Ridge estimator is:

$$\text{MSE}(\hat{\boldsymbol{\beta}}\lambda) = \mathbb{E}|\hat{\boldsymbol{\beta}}\lambda - \boldsymbol{\beta}^*|_2^2 = |\text{Bias}|_2^2 + \text{tr}(\text{Var})$$

Substituting our results:

$$\text{MSE}(\lambda) = \underbrace{\lambda^2 \sum_{j=1}^{p} \frac{(\gamma_j^*)^2}{(d_j + \lambda)^2}}{\text{Squared Bias}} + \underbrace{\sigma^2 \sum{j=1}^{p} \frac{d_j}{(d_j + \lambda)^2}}_{\text{Variance}}$$

This can be written per-component in principal direction $j$:

$$\text{MSE}_j(\lambda) = \frac{\lambda^2 (\gamma_j^*)^2 + \sigma^2 d_j}{(d_j + \lambda)^2}$$

Behavior of MSE with λ:

At $\lambda = 0$ (OLS): Bias = 0, Variance = $\sigma^2 \sum_j 1/d_j$ (can be huge)
As $\lambda \to \infty$: Bias² → $|\boldsymbol{\beta}^*|_2^2$ (total signal), Variance → 0
Optimal $\lambda^*$: Somewhere in between, minimizing total MSE

The tradeoff visualized:

As $\lambda$ increases from 0:

Variance decreases monotonically (good)
Squared bias increases monotonically (bad)
Total MSE initially decreases (variance reduction dominates)
Eventually MSE increases (bias becomes too large)
Optimal MSE occurs at some $\lambda^* > 0$

mse_tradeoff.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import numpy as np
import matplotlib.pyplot as plt
 
def compute_ridge_mse_components(eigenvalues, gamma_star, sigma2, lambdas):
    """
    Compute bias^2, variance, and total MSE for Ridge regression
    as a function of lambda.
    
    Parameters:
    -----------
    eigenvalues : ndarray
        Eigenvalues d_j of X^T X
    gamma_star : ndarray  
        True coefficients in principal component basis
    sigma2 : float
        Noise variance
    lambdas : ndarray
        Grid of lambda values
    
    Returns:
    --------
    bias2 : ndarray
        Squared bias for each lambda
    variance : ndarray
        Total variance for each lambda
    mse : ndarray
        Total MSE for each lambda
    """
    bias2 = np.zeros(len(lambdas))
    variance = np.zeros(len(lambdas))
    
    for i, lam in enumerate(lambdas):
        # Squared bias: sum_j [lambda / (d_j + lambda)]^2 * gamma_j^2
        shrink_factors = lam / (eigenvalues + lam)
        bias2[i] = np.sum(shrink_factors**2 * gamma_star**2)
        
        # Variance: sigma^2 * sum_j d_j / (d_j + lambda)^2
        variance[i] = sigma2 * np.sum(
            eigenvalues / (eigenvalues + lam)**2
        )
    
    mse = bias2 + variance
    
    return bias2, variance, mse
 
 
def plot_bias_variance_tradeoff(eigenvalues, gamma_star, sigma2):
    """
    Visualize the bias-variance tradeoff for Ridge regression.
    """
    lambdas = np.logspace(-3, 3, 500)
    bias2, variance, mse = compute_ridge_mse_components(
        eigenvalues, gamma_star, sigma2, lambdas
    )
    
    # Find optimal lambda
    opt_idx = np.argmin(mse)
    opt_lambda = lambdas[opt_idx]
    opt_mse = mse[opt_idx]
    
    plt.figure(figsize=(10, 6))
    plt.loglog(lambdas, bias2, 'b-', label='Squared Bias', linewidth=2)
    plt.loglog(lambdas, variance, 'r-', label='Variance', linewidth=2)
    plt.loglog(lambdas, mse, 'k-', label='Total MSE', linewidth=2.5)
    
    plt.axvline(opt_lambda, color='g', linestyle='--', 
                label=f'Optimal λ = {opt_lambda:.4f}')
    plt.scatter([opt_lambda], [opt_mse], color='g', s=100, zorder=5)
    
    plt.xlabel('λ (log scale)', fontsize=12)
    plt.ylabel('MSE Components (log scale)', fontsize=12)
    plt.title('Ridge Regression: Bias-Variance Tradeoff', fontsize=14)
    plt.legend(loc='best')
    plt.grid(True, alpha=0.3)
    
    return plt.gcf(), opt_lambda
 
 
# Example: ill-conditioned problem
eigenvalues = np.array([100, 10, 1, 0.1, 0.01])  # Condition number = 10000
gamma_star = np.array([1, 1, 1, 1, 1])  # Equal signal in all directions
sigma2 = 1.0  # Unit noise variance
 
# fig, opt_lambda = plot_bias_variance_tradeoff(eigenvalues, gamma_star, sigma2)
# plt.show()

The Optimal Regularization Parameter

Oracle optimal λ:

In principle, the optimal $\lambda^*$ minimizes the MSE:

$$\lambda^* = \arg\min_\lambda \text{MSE}(\lambda)$$

Taking the derivative of MSE and setting to zero yields a complex expression that depends on the true parameters $\boldsymbol{\beta}^*$ and noise variance $\sigma^2$—quantities we don't know in practice.

Simplified case (orthogonal design):

When $\mathbf{X}^T\mathbf{X} = \mathbf{I}$ (orthonormal features), the optimal λ for each coefficient $\beta_j^*$ is:

$$\lambda_j^* = \frac{\sigma^2}{(\beta_j^*)^2}$$

This is the ratio of noise variance to squared signal in that direction.

Per-component analysis:

For a single component with eigenvalue $d$ and true coefficient $\gamma^*$:

$$\text{MSE}(\lambda) = \frac{\lambda^2 (\gamma^*)^2 + \sigma^2 d}{(d + \lambda)^2}$$

Taking derivative and setting to zero:

$$\frac{d}{d\lambda}\text{MSE} = \frac{2\lambda (\gamma^)^2 (d + \lambda)^2 - 2(d + \lambda)[\lambda^2(\gamma^)^2 + \sigma^2 d]}{(d + \lambda)^4} = 0$$

Solving:

$$\lambda^_j = \frac{\sigma^2 d_j}{(\gamma_j^)^2}$$

The Signal-to-Noise Ratio

The optimal λ for component j is inversely proportional to the squared signal $(\gamma_j^*)^2$ and proportional to the noise $\sigma^2$. High signal → small regularization. High noise → large regularization. This makes intuitive sense: regularize more where signal is weak relative to noise.

The challenge: we don't know β:*

The oracle optimal λ depends on $\boldsymbol{\beta}^*$, which is exactly what we're trying to estimate. This creates a circular problem:

To find optimal λ, we need to know $\boldsymbol{\beta}^*$
To estimate $\boldsymbol{\beta}^*$, we need to choose λ

Practical solutions include:

Cross-validation: Use held-out data to estimate prediction error for different λ values
Generalized cross-validation (GCV): An efficient approximation to leave-one-out CV
AIC/BIC-type criteria: Use effective degrees of freedom to penalize complexity
Empirical Bayes: Estimate λ by maximizing marginal likelihood

We'll explore these methods in detail in the next page.

When Does Ridge Beat OLS?

Existence of improvement:

A fundamental theorem guarantees that Ridge regression (for some $\lambda > 0$) always achieves lower MSE than OLS, except in degenerate cases.

Theorem (Hoerl and Kennard, 1970):

For any true parameter $\boldsymbol{\beta}^ \neq \mathbf{0}$ and any design matrix $\mathbf{X}$, there exists $\lambda > 0$ such that:*

$$\text{MSE}(\hat{\boldsymbol{\beta}}\lambda) < \text{MSE}(\hat{\boldsymbol{\beta}}{\text{OLS}})$$

This is because the MSE curve has negative slope at $\lambda = 0$: the marginal variance reduction from a small λ always exceeds the marginal bias increase.

Magnitude of improvement:

The amount of MSE reduction depends on:

Condition number of $\mathbf{X}^T\mathbf{X}$: Higher condition number → larger potential improvement
Signal-to-noise ratio: Lower SNR → larger potential improvement
Dimensionality: More parameters relative to observations → larger improvement
Distribution of true coefficients: If true coefficients are small, shrinkage toward zero helps more

Quantitative bound:

For the optimal Ridge estimator:

$$\frac{\text{MSE}(\hat{\boldsymbol{\beta}}{\lambda^*})}{\text{MSE}(\hat{\boldsymbol{\beta}}{\text{OLS}})} \leq \frac{\text{effective df}}{p}$$

where effective df = $\sum_j d_j/(d_j + \lambda^*)$. Since effective df < p, this ratio is always less than 1.

Scenarios Where Ridge Provides Substantial Improvement
Scenario	OLS Problem	Ridge Advantage
High dimensionality (p ≈ n)	Variance explosion, potential non-existence	Stabilizes estimates, guarantees solution
Multicollinearity	Inflated variances, unstable coefficients	Shrinks correlated coefficients together
Low signal-to-noise	Estimates dominated by noise	Shrinks noisy estimates toward zero
Small sample size	Insufficient data to estimate all parameters	Borrows strength via regularization
Many weak predictors	Each predictor's effect is uncertain	Shrinks weak effects, preserves strong ones

When Ridge Provides Less Improvement

Ridge provides minimal advantage when: (1) X^TX is well-conditioned (small condition number), (2) SNR is high (noise is small relative to signal), (3) n >> p (many observations relative to parameters), (4) true coefficients are large (shrinkage toward zero is harmful). In these cases, OLS and Ridge perform similarly.

Prediction Risk vs. Estimation Risk

We've focused on estimation MSE: how well we estimate $\boldsymbol{\beta}^*$. But often we care more about prediction: how well we predict new responses.

Prediction risk:

For a new observation $\mathbf{x}{\text{new}}$ with true response $y{\text{new}} = \mathbf{x}{\text{new}}^T\boldsymbol{\beta}^* + \epsilon{\text{new}}$, the prediction risk is:

$$\text{Prediction MSE} = \mathbb{E}[(\hat{y}{\text{new}} - y{\text{new}})^2]$$

where $\hat{y}{\text{new}} = \mathbf{x}{\text{new}}^T \hat{\boldsymbol{\beta}}_\lambda$.

Decomposition:

$$\text{Prediction MSE} = \underbrace{\sigma^2}{\text{Irreducible}} + \underbrace{\mathbf{x}{\text{new}}^T \cdot \text{MSE}(\hat{\boldsymbol{\beta}}\lambda) \cdot \mathbf{x}{\text{new}}}_{\text{Estimation error contribution}}$$

Minimizing estimation MSE also minimizes prediction MSE (for any fixed test point).

In-sample vs. out-of-sample:

Training error: Typically increases with λ (less fit to training data)
Test error: Typically decreases then increases with λ (optimal λ minimizes)
Generalization gap = Test error - Training error: Typically decreases with λ

Ridge closes the generalization gap by trading some training fit for better generalization.

Expected out-of-sample error:

For random test points $\mathbf{x}_{\text{new}}$ with the same distribution as training:

$$\mathbb{E}[\text{Prediction MSE}] = \sigma^2 + \text{tr}(\boldsymbol{\Sigma}_x \cdot \text{MSE Matrix})$$

where $\boldsymbol{\Sigma}_x$ is the covariance of the features.

This shows that prediction performance integrates estimation MSE over the feature distribution—directionally weighting by where test points are likely to fall.

The Practical Upshot

Whether you care about estimating coefficients or predicting new outcomes, Ridge regression with appropriate λ typically outperforms OLS. The bias-variance tradeoff operates similarly for both objectives: trading some accuracy for stability improves overall performance.

Summary: The Bias-Variance Tradeoff

We've quantified precisely how Ridge regression trades bias for variance reduction:

Key Takeaways

•Bias formula: $\text{Bias} = -\lambda(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\boldsymbol{\beta}^*$ — proportional to true coefficients and λ.
•Variance formula: Total variance $= \sigma^2 \sum_j d_j/(d_j + \lambda)^2$ — dramatically reduced for small eigenvalues.
•MSE: Bias² + Variance has a minimum at some $\lambda^* > 0$ (except when OLS is already optimal).
•Optimal λ depends on unknown $\boldsymbol{\beta}^*$ and $\sigma^2$ — must be estimated from data.
•Ridge always helps (Hoerl-Kennard theorem): For any $\boldsymbol{\beta}^* \neq 0$, some λ beats OLS.
•Largest gains occur with high dimensionality, multicollinearity, low SNR, or small samples.
•Prediction and estimation both benefit from the bias-variance tradeoff.

What's next:

The theoretical analysis is complete, but we face a practical challenge: how do we choose λ when we don't know the true coefficients? The next page covers methods for selecting the regularization strength, including cross-validation, GCV, and information criteria.

Page Complete

You now understand the mathematical foundations of the bias-variance tradeoff in Ridge regression. This analysis explains why a biased estimator can outperform an unbiased one—a counterintuitive but fundamental insight in statistical learning.

4 / 5

Loading learning content...

Machine LearningRegularization Theory

Ridge Regression (L2 Regularization)

LevelIntermediate

Duration90 mins

TopicRegularization Theory

4 / 5

Bias-Variance Tradeoff

Quantifying the Tradeoff

This analysis answers the fundamental question: How do we know that trading bias for variance reduction is worthwhile?

What You Will Learn

Setup and Assumptions

To derive bias and variance precisely, we need a clear probabilistic framework.

The linear model:

$$\mathbf{y} = \mathbf{X}\boldsymbol{\beta}^* + \boldsymbol{\epsilon}$$

where:

$\boldsymbol{\beta}^* \in \mathbb{R}^p$ is the true (unknown) coefficient vector
$\mathbf{X} \in \mathbb{R}^{n \times p}$ is the design matrix (treated as fixed/non-random)
$\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_n)$ is i.i.d. Gaussian noise

Key quantities:

Let $\mathbf{X}^T\mathbf{X}$ have eigenvalue decomposition: $$\mathbf{X}^T\mathbf{X} = \mathbf{V}\mathbf{D}\mathbf{V}^T = \mathbf{V} \cdot \text{diag}(d_1, \ldots, d_p) \cdot \mathbf{V}^T$$

Define:

Shrinkage factors: $s_j(\lambda) = \frac{d_j}{d_j + \lambda}$
Rotated true coefficients: $\boldsymbol{\gamma}^* = \mathbf{V}^T \boldsymbol{\beta}^*$ (coefficients in principal component basis)
Ridge estimator: $\hat{\boldsymbol{\beta}}_{\lambda} = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}$

The Fixed-X Framework

Deriving the Ridge Bias

Definition of bias:

The bias of an estimator is the difference between its expected value and the true parameter:

$$\text{Bias}(\hat{\boldsymbol{\beta}}\lambda) = \mathbb{E}[\hat{\boldsymbol{\beta}}\lambda] - \boldsymbol{\beta}^*$$

Computing the expected value:

The last step uses $\mathbb{E}[\mathbf{y}] = \mathbf{X}\boldsymbol{\beta}^*$ (noise has zero mean).

The bias formula:

The last equality uses the identity: $$(\mathbf{A} + \lambda\mathbf{I})^{-1}\mathbf{A} = \mathbf{I} - \lambda(\mathbf{A} + \lambda\mathbf{I})^{-1}$$

Boxed result:

$$\boxed{\text{Bias}(\hat{\boldsymbol{\beta}}_\lambda) = -\lambda(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\boldsymbol{\beta}^*}$$

Eigenvalue decomposition form:

Using $\mathbf{X}^T\mathbf{X} = \mathbf{V}\mathbf{D}\mathbf{V}^T$ and $\boldsymbol{\gamma}^* = \mathbf{V}^T\boldsymbol{\beta}^*$:

$$\text{Bias}(\hat{\boldsymbol{\beta}}_\lambda) = -\mathbf{V}\cdot\text{diag}\left(\frac{\lambda}{d_1 + \lambda}, \ldots, \frac{\lambda}{d_p + \lambda}\right)\cdot\boldsymbol{\gamma}^*$$

Squared bias (summed over coordinates):

$$|\text{Bias}|2^2 = \sum{j=1}^{p} \left(\frac{\lambda}{d_j + \lambda}\right)^2 (\gamma_j^)^2 = \lambda^2 \sum_{j=1}^{p} \frac{(\gamma_j^)^2}{(d_j + \lambda)^2}$$

This is the total squared bias across all coefficient components.

Understanding the Bias Formula

Deriving the Ridge Variance

Definition of variance:

The variance (covariance matrix) of the estimator measures how it fluctuates across different realizations of the noise:

Computing the variance:

Since $\hat{\boldsymbol{\beta}}_\lambda$ is linear in $\mathbf{y}$:

$$\hat{\boldsymbol{\beta}}_\lambda = \mathbf{W}\mathbf{y}, \quad \text{where } \mathbf{W} = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T$$

Using $\text{Var}(\mathbf{W}\mathbf{y}) = \mathbf{W}\text{Var}(\mathbf{y})\mathbf{W}^T$ and $\text{Var}(\mathbf{y}) = \sigma^2\mathbf{I}$:

Eigenvalue decomposition form:

Substituting $\mathbf{X}^T\mathbf{X} = \mathbf{V}\mathbf{D}\mathbf{V}^T$:

$$\text{Var}(\hat{\boldsymbol{\beta}}_\lambda) = \sigma^2 \mathbf{V} \cdot \text{diag}\left(\frac{d_1}{(d_1 + \lambda)^2}, \ldots, \frac{d_p}{(d_p + \lambda)^2}\right) \cdot \mathbf{V}^T$$

Total variance (trace):

$$\text{tr}(\text{Var}(\hat{\boldsymbol{\beta}}\lambda)) = \sigma^2 \sum{j=1}^{p} \frac{d_j}{(d_j + \lambda)^2}$$

Boxed result:

$$\boxed{\text{Total Variance} = \sigma^2 \sum_{j=1}^{p} \frac{d_j}{(d_j + \lambda)^2}}$$

Comparison with OLS variance:

For OLS ($\lambda = 0$):

Total OLS variance: $\sigma^2 \sum_{j=1}^{p} \frac{1}{d_j}$

Variance reduction factor:

For each principal direction $j$:

$$\frac{\text{Var}{\text{Ridge}, j}}{\text{Var}{\text{OLS}, j}} = \frac{d_j/(d_j + \lambda)^2}{1/d_j} = \frac{d_j^2}{(d_j + \lambda)^2} = s_j^2$$

Since $s_j < 1$, Ridge variance is strictly less than OLS variance in every direction when $\lambda > 0$.

Variance Comparison by Eigenvalue
Eigenvalue $d_j$	OLS Var (×$\sigma^2$)	Ridge Var (×$\sigma^2$)	Reduction
100 (λ=1)	0.01	0.0098	2% reduction
10 (λ=1)	0.1	0.083	17% reduction
1 (λ=1)	1.0	0.25	75% reduction
0.1 (λ=1)	10.0	0.083	99.2% reduction
0.01 (λ=1)	100.0	0.0098	99.99% reduction

Dramatic Variance Reduction

Mean Squared Error — Combining Bias and Variance

MSE decomposition:

The Mean Squared Error of the Ridge estimator is:

$$\text{MSE}(\hat{\boldsymbol{\beta}}\lambda) = \mathbb{E}|\hat{\boldsymbol{\beta}}\lambda - \boldsymbol{\beta}^*|_2^2 = |\text{Bias}|_2^2 + \text{tr}(\text{Var})$$

Substituting our results:

This can be written per-component in principal direction $j$:

$$\text{MSE}_j(\lambda) = \frac{\lambda^2 (\gamma_j^*)^2 + \sigma^2 d_j}{(d_j + \lambda)^2}$$

Behavior of MSE with λ:

At $\lambda = 0$ (OLS): Bias = 0, Variance = $\sigma^2 \sum_j 1/d_j$ (can be huge)
As $\lambda \to \infty$: Bias² → $|\boldsymbol{\beta}^*|_2^2$ (total signal), Variance → 0
Optimal $\lambda^*$: Somewhere in between, minimizing total MSE

The tradeoff visualized:

As $\lambda$ increases from 0:

Variance decreases monotonically (good)
Squared bias increases monotonically (bad)
Total MSE initially decreases (variance reduction dominates)
Eventually MSE increases (bias becomes too large)
Optimal MSE occurs at some $\lambda^* > 0$

mse_tradeoff.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import numpy as np
import matplotlib.pyplot as plt
 
def compute_ridge_mse_components(eigenvalues, gamma_star, sigma2, lambdas):
    """
    Compute bias^2, variance, and total MSE for Ridge regression
    as a function of lambda.
    
    Parameters:
    -----------
    eigenvalues : ndarray
        Eigenvalues d_j of X^T X
    gamma_star : ndarray  
        True coefficients in principal component basis
    sigma2 : float
        Noise variance
    lambdas : ndarray
        Grid of lambda values
    
    Returns:
    --------
    bias2 : ndarray
        Squared bias for each lambda
    variance : ndarray
        Total variance for each lambda
    mse : ndarray
        Total MSE for each lambda
    """
    bias2 = np.zeros(len(lambdas))
    variance = np.zeros(len(lambdas))
    
    for i, lam in enumerate(lambdas):
        # Squared bias: sum_j [lambda / (d_j + lambda)]^2 * gamma_j^2
        shrink_factors = lam / (eigenvalues + lam)
        bias2[i] = np.sum(shrink_factors**2 * gamma_star**2)
        
        # Variance: sigma^2 * sum_j d_j / (d_j + lambda)^2
        variance[i] = sigma2 * np.sum(
            eigenvalues / (eigenvalues + lam)**2
        )
    
    mse = bias2 + variance
    
    return bias2, variance, mse
 
 
def plot_bias_variance_tradeoff(eigenvalues, gamma_star, sigma2):
    """
    Visualize the bias-variance tradeoff for Ridge regression.
    """
    lambdas = np.logspace(-3, 3, 500)
    bias2, variance, mse = compute_ridge_mse_components(
        eigenvalues, gamma_star, sigma2, lambdas
    )
    
    # Find optimal lambda
    opt_idx = np.argmin(mse)
    opt_lambda = lambdas[opt_idx]
    opt_mse = mse[opt_idx]
    
    plt.figure(figsize=(10, 6))
    plt.loglog(lambdas, bias2, 'b-', label='Squared Bias', linewidth=2)
    plt.loglog(lambdas, variance, 'r-', label='Variance', linewidth=2)
    plt.loglog(lambdas, mse, 'k-', label='Total MSE', linewidth=2.5)
    
    plt.axvline(opt_lambda, color='g', linestyle='--', 
                label=f'Optimal λ = {opt_lambda:.4f}')
    plt.scatter([opt_lambda], [opt_mse], color='g', s=100, zorder=5)
    
    plt.xlabel('λ (log scale)', fontsize=12)
    plt.ylabel('MSE Components (log scale)', fontsize=12)
    plt.title('Ridge Regression: Bias-Variance Tradeoff', fontsize=14)
    plt.legend(loc='best')
    plt.grid(True, alpha=0.3)
    
    return plt.gcf(), opt_lambda
 
 
# Example: ill-conditioned problem
eigenvalues = np.array([100, 10, 1, 0.1, 0.01])  # Condition number = 10000
gamma_star = np.array([1, 1, 1, 1, 1])  # Equal signal in all directions
sigma2 = 1.0  # Unit noise variance
 
# fig, opt_lambda = plot_bias_variance_tradeoff(eigenvalues, gamma_star, sigma2)
# plt.show()

The Optimal Regularization Parameter

Oracle optimal λ:

In principle, the optimal $\lambda^*$ minimizes the MSE:

$$\lambda^* = \arg\min_\lambda \text{MSE}(\lambda)$$

Simplified case (orthogonal design):

When $\mathbf{X}^T\mathbf{X} = \mathbf{I}$ (orthonormal features), the optimal λ for each coefficient $\beta_j^*$ is:

$$\lambda_j^* = \frac{\sigma^2}{(\beta_j^*)^2}$$

This is the ratio of noise variance to squared signal in that direction.

Per-component analysis:

For a single component with eigenvalue $d$ and true coefficient $\gamma^*$:

$$\text{MSE}(\lambda) = \frac{\lambda^2 (\gamma^*)^2 + \sigma^2 d}{(d + \lambda)^2}$$

Taking derivative and setting to zero:

$$\frac{d}{d\lambda}\text{MSE} = \frac{2\lambda (\gamma^)^2 (d + \lambda)^2 - 2(d + \lambda)[\lambda^2(\gamma^)^2 + \sigma^2 d]}{(d + \lambda)^4} = 0$$

Solving:

$$\lambda^_j = \frac{\sigma^2 d_j}{(\gamma_j^)^2}$$

The Signal-to-Noise Ratio

The challenge: we don't know β:*

The oracle optimal λ depends on $\boldsymbol{\beta}^*$, which is exactly what we're trying to estimate. This creates a circular problem:

To find optimal λ, we need to know $\boldsymbol{\beta}^*$
To estimate $\boldsymbol{\beta}^*$, we need to choose λ

Practical solutions include:

Cross-validation: Use held-out data to estimate prediction error for different λ values
Generalized cross-validation (GCV): An efficient approximation to leave-one-out CV
AIC/BIC-type criteria: Use effective degrees of freedom to penalize complexity
Empirical Bayes: Estimate λ by maximizing marginal likelihood

We'll explore these methods in detail in the next page.

When Does Ridge Beat OLS?

Existence of improvement:

A fundamental theorem guarantees that Ridge regression (for some $\lambda > 0$) always achieves lower MSE than OLS, except in degenerate cases.

Theorem (Hoerl and Kennard, 1970):

For any true parameter $\boldsymbol{\beta}^ \neq \mathbf{0}$ and any design matrix $\mathbf{X}$, there exists $\lambda > 0$ such that:*

$$\text{MSE}(\hat{\boldsymbol{\beta}}\lambda) < \text{MSE}(\hat{\boldsymbol{\beta}}{\text{OLS}})$$

This is because the MSE curve has negative slope at $\lambda = 0$: the marginal variance reduction from a small λ always exceeds the marginal bias increase.

Magnitude of improvement:

The amount of MSE reduction depends on:

Condition number of $\mathbf{X}^T\mathbf{X}$: Higher condition number → larger potential improvement
Signal-to-noise ratio: Lower SNR → larger potential improvement
Dimensionality: More parameters relative to observations → larger improvement
Distribution of true coefficients: If true coefficients are small, shrinkage toward zero helps more

Quantitative bound:

For the optimal Ridge estimator:

$$\frac{\text{MSE}(\hat{\boldsymbol{\beta}}{\lambda^*})}{\text{MSE}(\hat{\boldsymbol{\beta}}{\text{OLS}})} \leq \frac{\text{effective df}}{p}$$

where effective df = $\sum_j d_j/(d_j + \lambda^*)$. Since effective df < p, this ratio is always less than 1.

Scenarios Where Ridge Provides Substantial Improvement
Scenario	OLS Problem	Ridge Advantage
High dimensionality (p ≈ n)	Variance explosion, potential non-existence	Stabilizes estimates, guarantees solution
Multicollinearity	Inflated variances, unstable coefficients	Shrinks correlated coefficients together
Low signal-to-noise	Estimates dominated by noise	Shrinks noisy estimates toward zero
Small sample size	Insufficient data to estimate all parameters	Borrows strength via regularization
Many weak predictors	Each predictor's effect is uncertain	Shrinks weak effects, preserves strong ones

When Ridge Provides Less Improvement

Prediction Risk vs. Estimation Risk

We've focused on estimation MSE: how well we estimate $\boldsymbol{\beta}^*$. But often we care more about prediction: how well we predict new responses.

Prediction risk:

For a new observation $\mathbf{x}{\text{new}}$ with true response $y{\text{new}} = \mathbf{x}{\text{new}}^T\boldsymbol{\beta}^* + \epsilon{\text{new}}$, the prediction risk is:

$$\text{Prediction MSE} = \mathbb{E}[(\hat{y}{\text{new}} - y{\text{new}})^2]$$

where $\hat{y}{\text{new}} = \mathbf{x}{\text{new}}^T \hat{\boldsymbol{\beta}}_\lambda$.

Decomposition:

Minimizing estimation MSE also minimizes prediction MSE (for any fixed test point).

In-sample vs. out-of-sample:

Training error: Typically increases with λ (less fit to training data)
Test error: Typically decreases then increases with λ (optimal λ minimizes)
Generalization gap = Test error - Training error: Typically decreases with λ

Ridge closes the generalization gap by trading some training fit for better generalization.

Expected out-of-sample error:

For random test points $\mathbf{x}_{\text{new}}$ with the same distribution as training:

$$\mathbb{E}[\text{Prediction MSE}] = \sigma^2 + \text{tr}(\boldsymbol{\Sigma}_x \cdot \text{MSE Matrix})$$

where $\boldsymbol{\Sigma}_x$ is the covariance of the features.

This shows that prediction performance integrates estimation MSE over the feature distribution—directionally weighting by where test points are likely to fall.

The Practical Upshot

Summary: The Bias-Variance Tradeoff

We've quantified precisely how Ridge regression trades bias for variance reduction:

Key Takeaways

•Bias formula: $\text{Bias} = -\lambda(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\boldsymbol{\beta}^*$ — proportional to true coefficients and λ.
•Variance formula: Total variance $= \sigma^2 \sum_j d_j/(d_j + \lambda)^2$ — dramatically reduced for small eigenvalues.
•MSE: Bias² + Variance has a minimum at some $\lambda^* > 0$ (except when OLS is already optimal).
•Optimal λ depends on unknown $\boldsymbol{\beta}^*$ and $\sigma^2$ — must be estimated from data.
•Ridge always helps (Hoerl-Kennard theorem): For any $\boldsymbol{\beta}^* \neq 0$, some λ beats OLS.
•Largest gains occur with high dimensionality, multicollinearity, low SNR, or small samples.
•Prediction and estimation both benefit from the bias-variance tradeoff.

What's next:

Page Complete

4 / 5