Loading learning content...
In the previous page, we developed intuition for why Ridge regression's shrinkage can improve predictions despite introducing bias. Now we make this precise: we'll derive exact expressions for bias and variance of Ridge estimators, analyze how they depend on $\lambda$, and understand when the tradeoff is most favorable.
This analysis answers the fundamental question: How do we know that trading bias for variance reduction is worthwhile?
By the end of this page, you will be able to write exact formulas for Ridge bias and variance, derive conditions under which Ridge outperforms OLS in MSE, understand an oracle bound for the optimal λ, and analyze how the tradeoff depends on signal-to-noise ratio and conditioning.
To derive bias and variance precisely, we need a clear probabilistic framework.
The linear model:
$$\mathbf{y} = \mathbf{X}\boldsymbol{\beta}^* + \boldsymbol{\epsilon}$$
where:
Key quantities:
Let $\mathbf{X}^T\mathbf{X}$ have eigenvalue decomposition: $$\mathbf{X}^T\mathbf{X} = \mathbf{V}\mathbf{D}\mathbf{V}^T = \mathbf{V} \cdot \text{diag}(d_1, \ldots, d_p) \cdot \mathbf{V}^T$$
Define:
We treat X as fixed (conditional on the design). This is the classical framework for analyzing ridge regression. Bias and variance are computed over the randomness in the noise ε. The random-X framework (where X is also random) leads to similar qualitative conclusions but more complex expressions.
Definition of bias:
The bias of an estimator is the difference between its expected value and the true parameter:
$$\text{Bias}(\hat{\boldsymbol{\beta}}\lambda) = \mathbb{E}[\hat{\boldsymbol{\beta}}\lambda] - \boldsymbol{\beta}^*$$
Computing the expected value:
\begin{align} \mathbb{E}[\hat{\boldsymbol{\beta}}_\lambda] &= \mathbb{E}[(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}] \ &= (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbb{E}[\mathbf{y}] \ &= (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}^* \end{align}
The last step uses $\mathbb{E}[\mathbf{y}] = \mathbf{X}\boldsymbol{\beta}^*$ (noise has zero mean).
The bias formula:
\begin{align} \text{Bias}(\hat{\boldsymbol{\beta}}_\lambda) &= (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}^* - \boldsymbol{\beta}^* \ &= \left[(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{X} - \mathbf{I}\right]\boldsymbol{\beta}^* \ &= -\lambda(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\boldsymbol{\beta}^* \end{align}
The last equality uses the identity: $$(\mathbf{A} + \lambda\mathbf{I})^{-1}\mathbf{A} = \mathbf{I} - \lambda(\mathbf{A} + \lambda\mathbf{I})^{-1}$$
Boxed result:
$$\boxed{\text{Bias}(\hat{\boldsymbol{\beta}}_\lambda) = -\lambda(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\boldsymbol{\beta}^*}$$
Eigenvalue decomposition form:
Using $\mathbf{X}^T\mathbf{X} = \mathbf{V}\mathbf{D}\mathbf{V}^T$ and $\boldsymbol{\gamma}^* = \mathbf{V}^T\boldsymbol{\beta}^*$:
$$\text{Bias}(\hat{\boldsymbol{\beta}}_\lambda) = -\mathbf{V}\cdot\text{diag}\left(\frac{\lambda}{d_1 + \lambda}, \ldots, \frac{\lambda}{d_p + \lambda}\right)\cdot\boldsymbol{\gamma}^*$$
Squared bias (summed over coordinates):
$$|\text{Bias}|2^2 = \sum{j=1}^{p} \left(\frac{\lambda}{d_j + \lambda}\right)^2 (\gamma_j^)^2 = \lambda^2 \sum_{j=1}^{p} \frac{(\gamma_j^)^2}{(d_j + \lambda)^2}$$
This is the total squared bias across all coefficient components.
The bias is proportional to λ and to the true coefficients β*. Larger true coefficients suffer more bias. The bias in direction j is $\frac{\lambda}{d_j + \lambda} \gamma_j^$, which approaches $\gamma_j^$ as $d_j \to 0$ (complete shrinkage) and approaches 0 as $d_j \to \infty$ (no shrinkage needed).
Definition of variance:
The variance (covariance matrix) of the estimator measures how it fluctuates across different realizations of the noise:
$$\text{Var}(\hat{\boldsymbol{\beta}}\lambda) = \mathbb{E}\left[(\hat{\boldsymbol{\beta}}\lambda - \mathbb{E}[\hat{\boldsymbol{\beta}}\lambda])(\hat{\boldsymbol{\beta}}\lambda - \mathbb{E}[\hat{\boldsymbol{\beta}}_\lambda])^T\right]$$
Computing the variance:
Since $\hat{\boldsymbol{\beta}}_\lambda$ is linear in $\mathbf{y}$:
$$\hat{\boldsymbol{\beta}}_\lambda = \mathbf{W}\mathbf{y}, \quad \text{where } \mathbf{W} = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T$$
Using $\text{Var}(\mathbf{W}\mathbf{y}) = \mathbf{W}\text{Var}(\mathbf{y})\mathbf{W}^T$ and $\text{Var}(\mathbf{y}) = \sigma^2\mathbf{I}$:
$$\text{Var}(\hat{\boldsymbol{\beta}}_\lambda) = \sigma^2 \mathbf{W}\mathbf{W}^T = \sigma^2 (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{X}(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}$$
Eigenvalue decomposition form:
Substituting $\mathbf{X}^T\mathbf{X} = \mathbf{V}\mathbf{D}\mathbf{V}^T$:
$$\text{Var}(\hat{\boldsymbol{\beta}}_\lambda) = \sigma^2 \mathbf{V} \cdot \text{diag}\left(\frac{d_1}{(d_1 + \lambda)^2}, \ldots, \frac{d_p}{(d_p + \lambda)^2}\right) \cdot \mathbf{V}^T$$
Total variance (trace):
$$\text{tr}(\text{Var}(\hat{\boldsymbol{\beta}}\lambda)) = \sigma^2 \sum{j=1}^{p} \frac{d_j}{(d_j + \lambda)^2}$$
Boxed result:
$$\boxed{\text{Total Variance} = \sigma^2 \sum_{j=1}^{p} \frac{d_j}{(d_j + \lambda)^2}}$$
Comparison with OLS variance:
For OLS ($\lambda = 0$):
$$\text{Var}(\hat{\boldsymbol{\beta}}_{\text{OLS}}) = \sigma^2 (\mathbf{X}^T\mathbf{X})^{-1} = \sigma^2 \mathbf{V} \cdot \text{diag}\left(\frac{1}{d_1}, \ldots, \frac{1}{d_p}\right) \cdot \mathbf{V}^T$$
Total OLS variance: $\sigma^2 \sum_{j=1}^{p} \frac{1}{d_j}$
Variance reduction factor:
For each principal direction $j$:
$$\frac{\text{Var}{\text{Ridge}, j}}{\text{Var}{\text{OLS}, j}} = \frac{d_j/(d_j + \lambda)^2}{1/d_j} = \frac{d_j^2}{(d_j + \lambda)^2} = s_j^2$$
Since $s_j < 1$, Ridge variance is strictly less than OLS variance in every direction when $\lambda > 0$.
| Eigenvalue $d_j$ | OLS Var (×$\sigma^2$) | Ridge Var (×$\sigma^2$) | Reduction |
|---|---|---|---|
| 100 (λ=1) | 0.01 | 0.0098 | 2% reduction |
| 10 (λ=1) | 0.1 | 0.083 | 17% reduction |
| 1 (λ=1) | 1.0 | 0.25 | 75% reduction |
| 0.1 (λ=1) | 10.0 | 0.083 | 99.2% reduction |
| 0.01 (λ=1) | 100.0 | 0.0098 | 99.99% reduction |
Ridge regression achieves enormous variance reduction in ill-conditioned directions (small eigenvalues). Where OLS variance would explode (1/d_j → ∞ as d_j → 0), Ridge variance remains bounded (approaches 0 as d_j → 0 for fixed λ). This is the key mechanism behind Ridge's improved prediction.
MSE decomposition:
The Mean Squared Error of the Ridge estimator is:
$$\text{MSE}(\hat{\boldsymbol{\beta}}\lambda) = \mathbb{E}|\hat{\boldsymbol{\beta}}\lambda - \boldsymbol{\beta}^*|_2^2 = |\text{Bias}|_2^2 + \text{tr}(\text{Var})$$
Substituting our results:
$$\text{MSE}(\lambda) = \underbrace{\lambda^2 \sum_{j=1}^{p} \frac{(\gamma_j^*)^2}{(d_j + \lambda)^2}}{\text{Squared Bias}} + \underbrace{\sigma^2 \sum{j=1}^{p} \frac{d_j}{(d_j + \lambda)^2}}_{\text{Variance}}$$
This can be written per-component in principal direction $j$:
$$\text{MSE}_j(\lambda) = \frac{\lambda^2 (\gamma_j^*)^2 + \sigma^2 d_j}{(d_j + \lambda)^2}$$
Behavior of MSE with λ:
The tradeoff visualized:
As $\lambda$ increases from 0:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485
import numpy as npimport matplotlib.pyplot as plt def compute_ridge_mse_components(eigenvalues, gamma_star, sigma2, lambdas): """ Compute bias^2, variance, and total MSE for Ridge regression as a function of lambda. Parameters: ----------- eigenvalues : ndarray Eigenvalues d_j of X^T X gamma_star : ndarray True coefficients in principal component basis sigma2 : float Noise variance lambdas : ndarray Grid of lambda values Returns: -------- bias2 : ndarray Squared bias for each lambda variance : ndarray Total variance for each lambda mse : ndarray Total MSE for each lambda """ bias2 = np.zeros(len(lambdas)) variance = np.zeros(len(lambdas)) for i, lam in enumerate(lambdas): # Squared bias: sum_j [lambda / (d_j + lambda)]^2 * gamma_j^2 shrink_factors = lam / (eigenvalues + lam) bias2[i] = np.sum(shrink_factors**2 * gamma_star**2) # Variance: sigma^2 * sum_j d_j / (d_j + lambda)^2 variance[i] = sigma2 * np.sum( eigenvalues / (eigenvalues + lam)**2 ) mse = bias2 + variance return bias2, variance, mse def plot_bias_variance_tradeoff(eigenvalues, gamma_star, sigma2): """ Visualize the bias-variance tradeoff for Ridge regression. """ lambdas = np.logspace(-3, 3, 500) bias2, variance, mse = compute_ridge_mse_components( eigenvalues, gamma_star, sigma2, lambdas ) # Find optimal lambda opt_idx = np.argmin(mse) opt_lambda = lambdas[opt_idx] opt_mse = mse[opt_idx] plt.figure(figsize=(10, 6)) plt.loglog(lambdas, bias2, 'b-', label='Squared Bias', linewidth=2) plt.loglog(lambdas, variance, 'r-', label='Variance', linewidth=2) plt.loglog(lambdas, mse, 'k-', label='Total MSE', linewidth=2.5) plt.axvline(opt_lambda, color='g', linestyle='--', label=f'Optimal λ = {opt_lambda:.4f}') plt.scatter([opt_lambda], [opt_mse], color='g', s=100, zorder=5) plt.xlabel('λ (log scale)', fontsize=12) plt.ylabel('MSE Components (log scale)', fontsize=12) plt.title('Ridge Regression: Bias-Variance Tradeoff', fontsize=14) plt.legend(loc='best') plt.grid(True, alpha=0.3) return plt.gcf(), opt_lambda # Example: ill-conditioned problemeigenvalues = np.array([100, 10, 1, 0.1, 0.01]) # Condition number = 10000gamma_star = np.array([1, 1, 1, 1, 1]) # Equal signal in all directionssigma2 = 1.0 # Unit noise variance # fig, opt_lambda = plot_bias_variance_tradeoff(eigenvalues, gamma_star, sigma2)# plt.show()Oracle optimal λ:
In principle, the optimal $\lambda^*$ minimizes the MSE:
$$\lambda^* = \arg\min_\lambda \text{MSE}(\lambda)$$
Taking the derivative of MSE and setting to zero yields a complex expression that depends on the true parameters $\boldsymbol{\beta}^*$ and noise variance $\sigma^2$—quantities we don't know in practice.
Simplified case (orthogonal design):
When $\mathbf{X}^T\mathbf{X} = \mathbf{I}$ (orthonormal features), the optimal λ for each coefficient $\beta_j^*$ is:
$$\lambda_j^* = \frac{\sigma^2}{(\beta_j^*)^2}$$
This is the ratio of noise variance to squared signal in that direction.
Per-component analysis:
For a single component with eigenvalue $d$ and true coefficient $\gamma^*$:
$$\text{MSE}(\lambda) = \frac{\lambda^2 (\gamma^*)^2 + \sigma^2 d}{(d + \lambda)^2}$$
Taking derivative and setting to zero:
$$\frac{d}{d\lambda}\text{MSE} = \frac{2\lambda (\gamma^)^2 (d + \lambda)^2 - 2(d + \lambda)[\lambda^2(\gamma^)^2 + \sigma^2 d]}{(d + \lambda)^4} = 0$$
Solving:
$$\lambda^_j = \frac{\sigma^2 d_j}{(\gamma_j^)^2}$$
The optimal λ for component j is inversely proportional to the squared signal $(\gamma_j^*)^2$ and proportional to the noise $\sigma^2$. High signal → small regularization. High noise → large regularization. This makes intuitive sense: regularize more where signal is weak relative to noise.
The challenge: we don't know β:*
The oracle optimal λ depends on $\boldsymbol{\beta}^*$, which is exactly what we're trying to estimate. This creates a circular problem:
Practical solutions include:
We'll explore these methods in detail in the next page.
Existence of improvement:
A fundamental theorem guarantees that Ridge regression (for some $\lambda > 0$) always achieves lower MSE than OLS, except in degenerate cases.
Theorem (Hoerl and Kennard, 1970):
For any true parameter $\boldsymbol{\beta}^ \neq \mathbf{0}$ and any design matrix $\mathbf{X}$, there exists $\lambda > 0$ such that:*
$$\text{MSE}(\hat{\boldsymbol{\beta}}\lambda) < \text{MSE}(\hat{\boldsymbol{\beta}}{\text{OLS}})$$
This is because the MSE curve has negative slope at $\lambda = 0$: the marginal variance reduction from a small λ always exceeds the marginal bias increase.
Magnitude of improvement:
The amount of MSE reduction depends on:
Quantitative bound:
For the optimal Ridge estimator:
$$\frac{\text{MSE}(\hat{\boldsymbol{\beta}}{\lambda^*})}{\text{MSE}(\hat{\boldsymbol{\beta}}{\text{OLS}})} \leq \frac{\text{effective df}}{p}$$
where effective df = $\sum_j d_j/(d_j + \lambda^*)$. Since effective df < p, this ratio is always less than 1.
| Scenario | OLS Problem | Ridge Advantage |
|---|---|---|
| High dimensionality (p ≈ n) | Variance explosion, potential non-existence | Stabilizes estimates, guarantees solution |
| Multicollinearity | Inflated variances, unstable coefficients | Shrinks correlated coefficients together |
| Low signal-to-noise | Estimates dominated by noise | Shrinks noisy estimates toward zero |
| Small sample size | Insufficient data to estimate all parameters | Borrows strength via regularization |
| Many weak predictors | Each predictor's effect is uncertain | Shrinks weak effects, preserves strong ones |
Ridge provides minimal advantage when: (1) X^TX is well-conditioned (small condition number), (2) SNR is high (noise is small relative to signal), (3) n >> p (many observations relative to parameters), (4) true coefficients are large (shrinkage toward zero is harmful). In these cases, OLS and Ridge perform similarly.
We've focused on estimation MSE: how well we estimate $\boldsymbol{\beta}^*$. But often we care more about prediction: how well we predict new responses.
Prediction risk:
For a new observation $\mathbf{x}{\text{new}}$ with true response $y{\text{new}} = \mathbf{x}{\text{new}}^T\boldsymbol{\beta}^* + \epsilon{\text{new}}$, the prediction risk is:
$$\text{Prediction MSE} = \mathbb{E}[(\hat{y}{\text{new}} - y{\text{new}})^2]$$
where $\hat{y}{\text{new}} = \mathbf{x}{\text{new}}^T \hat{\boldsymbol{\beta}}_\lambda$.
Decomposition:
$$\text{Prediction MSE} = \underbrace{\sigma^2}{\text{Irreducible}} + \underbrace{\mathbf{x}{\text{new}}^T \cdot \text{MSE}(\hat{\boldsymbol{\beta}}\lambda) \cdot \mathbf{x}{\text{new}}}_{\text{Estimation error contribution}}$$
Minimizing estimation MSE also minimizes prediction MSE (for any fixed test point).
In-sample vs. out-of-sample:
Ridge closes the generalization gap by trading some training fit for better generalization.
Expected out-of-sample error:
For random test points $\mathbf{x}_{\text{new}}$ with the same distribution as training:
$$\mathbb{E}[\text{Prediction MSE}] = \sigma^2 + \text{tr}(\boldsymbol{\Sigma}_x \cdot \text{MSE Matrix})$$
where $\boldsymbol{\Sigma}_x$ is the covariance of the features.
This shows that prediction performance integrates estimation MSE over the feature distribution—directionally weighting by where test points are likely to fall.
Whether you care about estimating coefficients or predicting new outcomes, Ridge regression with appropriate λ typically outperforms OLS. The bias-variance tradeoff operates similarly for both objectives: trading some accuracy for stability improves overall performance.
We've quantified precisely how Ridge regression trades bias for variance reduction:
What's next:
The theoretical analysis is complete, but we face a practical challenge: how do we choose λ when we don't know the true coefficients? The next page covers methods for selecting the regularization strength, including cross-validation, GCV, and information criteria.
You now understand the mathematical foundations of the bias-variance tradeoff in Ridge regression. This analysis explains why a biased estimator can outperform an unbiased one—a counterintuitive but fundamental insight in statistical learning.