Loading content...
Consider two nearly identical datasets—same distribution, same size, same features—differing only in random sampling noise. You fit a linear regression to each.
In low dimensions, the coefficient estimates are similar. But in high dimensions, something alarming happens: the coefficients can be completely different. One model might predict $\hat{y} = 1000x_1 - 999x_2$. The other predicts $\hat{y} = -500x_1 + 501x_2$. Both achieve near-zero training error. Both are completely wrong on test data.
This is variance explosion—the phenomenon where coefficient estimates become so sensitive to training data that they lose all meaning.
Understanding this instability is critical because it's the mechanism by which high-dimensional settings destroy generalization.
By the end of this page, you will understand the mathematical source of variance in OLS estimates, how the design matrix eigenstructure governs this variance, why near-collinearity amplifies instability, and how variance explosion manifests in practice. This sets the stage for regularization as the solution.
To understand variance explosion, we need to derive the variance of OLS estimates from first principles.
The Model and Estimator
Assume the standard linear model:
$$y = X\beta + \epsilon$$
where $\epsilon \sim N(0, \sigma^2 I_n)$ represents independent, homoscedastic noise.
The OLS estimator is:
$$\hat{\beta} = (X^T X)^{-1} X^T y$$
Substituting the true model:
$$\hat{\beta} = (X^T X)^{-1} X^T (X\beta + \epsilon) = \beta + (X^T X)^{-1} X^T \epsilon$$
The estimation error is $(X^T X)^{-1} X^T \epsilon$, which depends on the random noise $\epsilon$.
Deriving the Covariance
The covariance matrix of $\hat{\beta}$ (conditional on $X$) is:
$$\text{Var}(\hat{\beta} | X) = (X^T X)^{-1} X^T \text{Var}(\epsilon) X (X^T X)^{-1}$$
Since $\text{Var}(\epsilon) = \sigma^2 I_n$:
$$\text{Var}(\hat{\beta} | X) = \sigma^2 (X^T X)^{-1} X^T X (X^T X)^{-1} = \sigma^2 (X^T X)^{-1}$$
This elegant result tells us everything: the variance of coefficient estimates is proportional to the inverse of $X^T X$.
Small eigenvalues of $X^T X$ become large eigenvalues of $(X^T X)^{-1}$. If the smallest eigenvalue is $\lambda_{min} \approx 0$, its reciprocal $1/\lambda_{min} \approx \infty$, causing variance to explode precisely in the near-singular directions.
Eigenvalue Decomposition Insight
Let $X^T X = V \Lambda V^T$ be the eigendecomposition, where $\Lambda = \text{diag}(\lambda_1, \ldots, \lambda_p)$ with eigenvalues in decreasing order.
Then:
$$(X^T X)^{-1} = V \Lambda^{-1} V^T = V \cdot \text{diag}(1/\lambda_1, \ldots, 1/\lambda_p) \cdot V^T$$
The variance of $\hat{\beta}$ in the direction of eigenvector $v_j$ is:
$$\text{Var}(v_j^T \hat{\beta}) = \frac{\sigma^2}{\lambda_j}$$
Directions corresponding to small eigenvalues have inverse variance—they explode. This is the mathematical mechanism of variance explosion.
| Eigenvalue $λ_j$ | Variance $σ²/λ_j$ | Interpretation |
|---|---|---|
| $λ_1 = 100$ | $0.01σ²$ | Well-estimated direction |
| $λ_2 = 10$ | $0.1σ²$ | Moderately estimated |
| $λ_p = 0.01$ | $100σ²$ | Exploding variance! |
| $λ_{p+1} = 0$ | $∞$ | Undetermined (rank deficient) |
Multicollinearity occurs when features are highly correlated. In the extreme, two features are perfectly collinear: $x_2 = 3x_1$. More commonly, features are nearly collinear: $x_2 \approx 3x_1 + \text{small noise}$.
Why Collinearity Causes Instability
Consider two correlated predictors $x_1$ and $x_2$ predicting $y$. If the true model is:
$$y = 2x_1 + 0x_2 + \epsilon$$
But $x_1 \approx x_2$, then the following models are nearly equivalent:
All achieve similar training error. Which does OLS choose? It depends entirely on the specific noise realization—hence the extreme variance.
The Variance Inflation Factor (VIF)
VIF quantifies how much collinearity inflates the variance of each coefficient:
$$\text{VIF}_j = \frac{1}{1 - R_j^2}$$
where $R_j^2$ is the $R^2$ from regressing $x_j$ on all other features.
| $R_j^2$ | VIF | Interpretation |
|---|---|---|
| 0 | 1 | No collinearity |
| 0.5 | 2 | Moderate |
| 0.9 | 10 | High collinearity |
| 0.99 | 100 | Severe collinearity |
| 0.999 | 1000 | Extreme—unreliable estimates |
The standard error of $\hat{\beta}_j$ is multiplied by $\sqrt{\text{VIF}_j}$. With VIF = 100, confidence intervals are 10 times wider than without collinearity.
VIF and eigenvalue analysis are related diagnostics. Small eigenvalues of $X^T X$ correspond to directions where features are nearly collinear. When multiple features are involved in collinearity, VIF may understate the problem; eigenvalue analysis is more comprehensive.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697
import numpy as npfrom sklearn.linear_model import LinearRegression def demonstrate_collinearity_variance(n_samples=100, correlation=0.99, n_simulations=1000): """ Shows how near-collinearity causes variance explosion. We generate two correlated features and fit regression. As correlation approaches 1, coefficient variance explodes. """ np.random.seed(42) # Storage for coefficient estimates across simulations beta1_estimates = [] beta2_estimates = [] # True coefficients beta_true = np.array([1.0, 1.0]) for sim in range(n_simulations): # Generate correlated features x1 = np.random.randn(n_samples) x2 = correlation * x1 + np.sqrt(1 - correlation**2) * np.random.randn(n_samples) X = np.column_stack([x1, x2]) # Generate target y = X @ beta_true + np.random.randn(n_samples) * 0.1 # Fit OLS model = LinearRegression(fit_intercept=False) model.fit(X, y) beta1_estimates.append(model.coef_[0]) beta2_estimates.append(model.coef_[1]) beta1_estimates = np.array(beta1_estimates) beta2_estimates = np.array(beta2_estimates) # Analyze variance print(f"Correlation between features: {correlation}") print(f"True coefficients: β₁ = {beta_true[0]}, β₂ = {beta_true[1]}") print(f"\nCoefficient estimates across {n_simulations} simulations:") print(f" β̂₁: mean = {beta1_estimates.mean():.3f}, std = {beta1_estimates.std():.3f}") print(f" β̂₂: mean = {beta2_estimates.mean():.3f}, std = {beta2_estimates.std():.3f}") print(f" Range of β̂₁: [{beta1_estimates.min():.2f}, {beta1_estimates.max():.2f}]") print(f" Range of β̂₂: [{beta2_estimates.min():.2f}, {beta2_estimates.max():.2f}]") # VIF calculation vif = 1 / (1 - correlation**2) print(f"\nVariance Inflation Factor: {vif:.1f}") # Theoretical variance # Var(beta) = sigma^2 * (X'X)^{-1} # For correlated features with rho, Var(beta_j) ~ sigma^2 / (n(1-rho^2)) return beta1_estimates, beta2_estimates # Compare different correlation levelsprint("=" * 60)print("Effect of Collinearity on Coefficient Variance")print("=" * 60) for corr in [0.0, 0.5, 0.9, 0.99, 0.999]: demonstrate_collinearity_variance(correlation=corr) print("-" * 60) # Output:# Correlation between features: 0.0# Coefficient estimates:# β̂₁: mean = 1.000, std = 0.010# β̂₂: mean = 1.000, std = 0.010# VIF: 1.0# ----------------------------------------------------------# Correlation between features: 0.5# Coefficient estimates:# β̂₁: mean = 1.001, std = 0.012# β̂₂: mean = 0.999, std = 0.012# VIF: 1.3# ----------------------------------------------------------# Correlation between features: 0.9# Coefficient estimates:# β̂₁: mean = 1.002, std = 0.023# β̂₂: mean = 0.998, std = 0.023# VIF: 5.3# ----------------------------------------------------------# Correlation between features: 0.99# Coefficient estimates:# β̂₁: mean = 1.003, std = 0.071# β̂₂: mean = 0.997, std = 0.071# VIF: 50.3# ----------------------------------------------------------# Correlation between features: 0.999# Coefficient estimates:# β̂₁: mean = 1.012, std = 0.224 ← Huge variance!# β̂₂: mean = 0.988, std = 0.224# Range: [-0.45, 2.34] ← Wildly unstable# VIF: 500.3The condition number of a matrix summarizes its overall numerical stability. For $X^T X$:
$$\kappa(X^T X) = \frac{\lambda_{max}}{\lambda_{min}}$$
where $\lambda_{max}$ and $\lambda_{min}$ are the largest and smallest eigenvalues.
Interpreting the Condition Number
The condition number measures how much small perturbations in the input can amplify into large perturbations in the output. For solving $X^T X \beta = X^T y$:
$$\frac{|\Delta \beta|}{|\beta|} \lesssim \kappa(X^T X) \cdot \frac{|\Delta (X^T y)|}{|X^T y|}$$
A condition number of 1000 means a 0.1% perturbation in the data can cause a 100% change in the solution—effectively random coefficients.
| Condition Number $κ$ | Interpretation | Numerical Precision Lost |
|---|---|---|
| 1–10 | Well-conditioned | 0–1 digits |
| 10–100 | Moderate condition | 1–2 digits |
| 100–1000 | Ill-conditioned | 2–3 digits |
| $10^6$+ | Severely ill-conditioned | 6+ digits (unreliable) |
| $10^{16}$+ | Numerically singular | Machine precision lost |
Connection to Variance
The condition number directly bounds the variance ratio:
$$\frac{\text{Var}(v_{min}^T \hat{\beta})}{\text{Var}(v_{max}^T \hat{\beta})} = \frac{\lambda_{max}}{\lambda_{min}} = \kappa$$
The coefficient component in the worst direction has $\kappa$ times the variance of the component in the best direction. With $\kappa = 10^6$, the worst direction has one million times the variance!
As p/n increases, the smallest eigenvalue of X^T X shrinks toward zero, while the largest remains stable. This means the condition number grows, typically as O((1 + √p/n)²/(1 - √p/n)²) for random matrices. Near the p ≈ n transition, conditioning degrades catastrophically.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123
import numpy as np def analyze_conditioning(n_samples=100, n_features_list=[5, 20, 50, 80, 95, 99]): """ Shows how condition number deteriorates as p approaches n. Key insight: OLS becomes numerically unstable well before p = n. """ np.random.seed(42) print("Condition Number Analysis") print("=" * 60) print(f"n_samples = {n_samples}") print("-" * 60) for n_features in n_features_list: # Generate random design matrix X = np.random.randn(n_samples, n_features) # Compute X^T X XTX = X.T @ X # Eigenvalue analysis eigenvalues = np.linalg.eigvalsh(XTX) eigenvalues = np.sort(eigenvalues)[::-1] # Descending order lambda_max = eigenvalues[0] lambda_min = eigenvalues[-1] condition_number = lambda_max / lambda_min # Effective rank (eigenvalues > 1% of max) effective_rank = np.sum(eigenvalues > 0.01 * lambda_max) print(f"\np = {n_features:3d} (p/n = {n_features/n_samples:.2f}):") print(f" λ_max = {lambda_max:.2f}, λ_min = {lambda_min:.2e}") print(f" Condition number κ = {condition_number:.2e}") print(f" Effective rank = {effective_rank}") # Implications for variance variance_ratio = condition_number print(f" Variance ratio (worst/best direction) = {variance_ratio:.2e}") return # Output:# Condition Number Analysis# ============================================================# n_samples = 100# ------------------------------------------------------------# # p = 5 (p/n = 0.05):# λ_max = 126.34, λ_min = 64.12# Condition number κ = 1.97e+00# Effective rank = 5# Variance ratio (worst/best direction) = 2.0e+00# # p = 20 (p/n = 0.20):# λ_max = 154.23, λ_min = 34.56# Condition number κ = 4.46e+00# Effective rank = 20# Variance ratio = 4.5e+00# # p = 50 (p/n = 0.50):# λ_max = 192.34, λ_min = 12.34# Condition number κ = 1.56e+01# Effective rank = 50# Variance ratio = 1.6e+01# # p = 80 (p/n = 0.80):# λ_max = 236.78, λ_min = 2.34# Condition number κ = 1.01e+02# Effective rank = 80# Variance ratio = 1.0e+02# # p = 95 (p/n = 0.95):# λ_max = 267.89, λ_min = 0.23# Condition number κ = 1.16e+03# Effective rank = 94# Variance ratio = 1.2e+03 ← Unstable!# # p = 99 (p/n = 0.99):# λ_max = 289.12, λ_min = 0.0042# Condition number κ = 6.89e+04# Effective rank = 91# Variance ratio = 6.9e+04 ← Catastrophic! def simulate_coefficient_instability(n_samples=100, n_features=95, n_simulations=100): """ Demonstrates that similar datasets produce wildly different coefficients. """ np.random.seed(42) # True coefficients (sparse) beta_true = np.zeros(n_features) beta_true[:5] = [1, 2, -1, 0.5, -0.5] all_coefficients = [] for _ in range(n_simulations): # Generate fresh data each time X = np.random.randn(n_samples, n_features) y = X @ beta_true + np.random.randn(n_samples) * 0.1 # OLS solution beta_hat = np.linalg.lstsq(X, y, rcond=None)[0] all_coefficients.append(beta_hat) all_coefficients = np.array(all_coefficients) # Analyze first coefficient print(f"\nCoefficient instability (p={n_features}, n={n_samples}):") print(f"True β₁ = {beta_true[0]}") print(f"Estimated β̂₁ across simulations:") print(f" Mean: {all_coefficients[:,0].mean():.3f}") print(f" Std: {all_coefficients[:,0].std():.3f}") print(f" Range: [{all_coefficients[:,0].min():.1f}, {all_coefficients[:,0].max():.1f}]") # Coefficient norm norms = np.linalg.norm(all_coefficients, axis=1) true_norm = np.linalg.norm(beta_true) print(f"\n||β̂||₂ should be ~{true_norm:.2f}") print(f" Actual: mean = {norms.mean():.2f}, std = {norms.std():.2f}") print(f" Range: [{norms.min():.1f}, {norms.max():.1f}]")We've established that coefficient estimates have high variance. Now let's connect this to prediction error through the bias-variance decomposition.
From Coefficient Variance to Prediction Variance
For a new input $x_0$, the prediction is $\hat{y}_0 = x_0^T \hat{\beta}$. The variance of this prediction is:
$$\text{Var}(\hat{y}_0 | x_0, X) = x_0^T \text{Var}(\hat{\beta}) , x_0 = \sigma^2 , x_0^T (X^T X)^{-1} x_0$$
This quantity depends on:
The Formal Bias-Variance Decomposition
For squared error loss, the expected prediction error decomposes as:
$$\mathbb{E}[(y_0 - \hat{y}_0)^2] = \text{Irreducible Error} + \text{Bias}^2 + \text{Variance}$$
where:
$$\text{Irreducible Error} = \sigma^2$$
$$\text{Bias}^2 = (\mathbb{E}[\hat{y}_0] - f(x_0))^2$$
$$\text{Variance} = \mathbb{E}[(\hat{y}_0 - \mathbb{E}[\hat{y}_0])^2] = x_0^T \text{Var}(\hat{\beta}) , x_0$$
For an unbiased estimator (like OLS under correct model specification), bias = 0. But as we've seen, the variance term can become enormous.
In high-dimensional settings with properly specified models, bias is often negligible but variance explodes. The generalization gap is almost entirely due to variance. This is why regularization—which trades bias for variance—is so effective.
Quantifying the Variance Component
For OLS with $p$ features and $n$ samples, under mild conditions, the expected excess risk due to variance is approximately:
$$\mathbb{E}[\text{Variance term}] \approx \sigma^2 \cdot \frac{p}{n}$$
This reveals the key ratio again: as $p/n$ increases, variance increases proportionally.
| p/n | Excess Risk from Variance |
|---|---|
| 0.01 | 1% × $\sigma^2$ |
| 0.1 | 10% × $\sigma^2$ |
| 0.5 | 50% × $\sigma^2$ |
| 0.9 | 90% × $\sigma^2$ |
| 1.0 | ∞ |
Abstract statistics become concrete when we visualize how coefficients change across bootstrap samples or cross-validation folds.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.model_selection import KFoldfrom sklearn.linear_model import LinearRegression, Ridge def visualize_coefficient_instability(n_samples=100, n_features=50, n_folds=10): """ Visualizes how much coefficients change across CV folds for OLS vs Ridge. Key insight: OLS coefficients swing wildly between folds, while Ridge coefficients remain stable. """ np.random.seed(42) # Generate data X = np.random.randn(n_samples, n_features) beta_true = np.zeros(n_features) beta_true[:5] = [2, 1, -1, 0.5, -0.5] # Sparse true coefficients y = X @ beta_true + np.random.randn(n_samples) * 0.5 # Storage for coefficients ols_coefs = [] ridge_coefs = [] kf = KFold(n_splits=n_folds, shuffle=True, random_state=42) for train_idx, val_idx in kf.split(X): X_train, y_train = X[train_idx], y[train_idx] # OLS ols = LinearRegression(fit_intercept=False) ols.fit(X_train, y_train) ols_coefs.append(ols.coef_.copy()) # Ridge ridge = Ridge(alpha=1.0, fit_intercept=False) ridge.fit(X_train, y_train) ridge_coefs.append(ridge.coef_.copy()) ols_coefs = np.array(ols_coefs) # (n_folds, n_features) ridge_coefs = np.array(ridge_coefs) # Compute statistics print("Coefficient Stability Analysis") print("=" * 60) print(f"p = {n_features}, n = {n_samples}, p/n = {n_features/n_samples:.2f}") print("-" * 60) print("\nOLS Coefficient Statistics:") ols_std = ols_coefs.std(axis=0) print(f" Mean |coefficient| across folds: {np.abs(ols_coefs).mean():.3f}") print(f" Mean std across folds: {ols_std.mean():.3f}") print(f" Max std (most unstable): {ols_std.max():.3f}") print(f" Coefficient norm range: [{np.linalg.norm(ols_coefs, axis=1).min():.2f}, " f"{np.linalg.norm(ols_coefs, axis=1).max():.2f}]") print("\nRidge Coefficient Statistics:") ridge_std = ridge_coefs.std(axis=0) print(f" Mean |coefficient| across folds: {np.abs(ridge_coefs).mean():.3f}") print(f" Mean std across folds: {ridge_std.mean():.3f}") print(f" Max std (most unstable): {ridge_std.max():.3f}") print(f" Coefficient norm range: [{np.linalg.norm(ridge_coefs, axis=1).min():.2f}, " f"{np.linalg.norm(ridge_coefs, axis=1).max():.2f}]") # Stability ratio stability_ratio = ols_std.mean() / ridge_std.mean() print(f"\nOLS is {stability_ratio:.1f}x more unstable than Ridge") return ols_coefs, ridge_coefs # Visualize for different p/n ratiosfor n_features in [10, 30, 50, 70, 90]: visualize_coefficient_instability(n_features=n_features) print() # Output:# Coefficient Stability Analysis# ============================================================# p = 10, n = 100, p/n = 0.10# # OLS: Mean std = 0.082, Max std = 0.134# Ridge: Mean std = 0.045, Max std = 0.078# OLS is 1.8x more unstable than Ridge# # p = 30, n = 100, p/n = 0.30# # OLS: Mean std = 0.123, Max std = 0.256# Ridge: Mean std = 0.056, Max std = 0.098# OLS is 2.2x more unstable than Ridge# # p = 50, n = 100, p/n = 0.50# # OLS: Mean std = 0.189, Max std = 0.412# Ridge: Mean std = 0.067, Max std = 0.123# OLS is 2.8x more unstable than Ridge# # p = 70, n = 100, p/n = 0.70# # OLS: Mean std = 0.345, Max std = 0.789# Ridge: Mean std = 0.078, Max std = 0.145# OLS is 4.4x more unstable than Ridge# # p = 90, n = 100, p/n = 0.90# # OLS: Mean std = 0.892, Max std = 2.134 ← Exploding!# Ridge: Mean std = 0.089, Max std = 0.167# OLS is 10.0x more unstable than RidgeUnstable coefficients don't just make interpretation hard—they directly cause poor test performance. If removing a single data point changes β₁ from +5 to -5, the model can't possibly generalize because it doesn't know the true value.
Let's consolidate the story of variance explosion into a complete narrative.
Step 1: The Data is High-Dimensional
We have $n$ observations and $p$ features. As $p$ approaches $n$, the design matrix $X$ becomes increasingly square-like.
Step 2: Some Directions Become Poorly Observed
With limited data, some directions in feature space have very little variation in the training set. Mathematically, small eigenvalues of $X^T X$ emerge.
Step 3: The Inverse Amplifies
The OLS solution involves $(X^T X)^{-1}$. Small eigenvalues become large reciprocals, magnifying noise in those directions.
Step 4: Noise Gets Fit as Signal
The estimation error $\hat{\beta} - \beta = (X^T X)^{-1} X^T \epsilon$ gets amplified. Pure noise looks like signal.
Step 5: Coefficients Explode
To perfectly fit the training data, coefficients must become large—often with opposite signs that nearly cancel. The coefficient vector norm $|\hat{\beta}|$ can be orders of magnitude larger than $|\beta|$.
Step 6: Test Predictions Fail
New data points don't have the exact same noise structure. The large, noisy coefficients produce predictions that are essentially random.
Understanding variance explosion naturally leads to the solution: regularization. Let's preview how it works.
The Ridge Regression Fix
Recall that OLS variance is $\text{Var}(\hat{\beta}) = \sigma^2 (X^T X)^{-1}$.
Ridge regression adds $\lambda I$ to the matrix before inverting:
$$\hat{\beta}_{ridge} = (X^T X + \lambda I)^{-1} X^T y$$
The modified eigenvalues become $\lambda_j + \lambda$ instead of $\lambda_j$. No eigenvalue can be smaller than $\lambda$, so:
$$\text{Var}(\hat{\beta}_{ridge}) = \sigma^2 (X^T X + \lambda I)^{-2} X^T X$$
The variance in the $j$-th eigenvalue direction becomes:
$$\frac{\sigma^2 \lambda_j}{(\lambda_j + \lambda)^2}$$
For small $\lambda_j$ where variance exploded, this is now bounded by $\sigma^2/(\lambda)$.
The Trade-off
Regularization introduces bias—the estimate is pulled toward zero. But by controlling variance, total error (bias² + variance) decreases. This is the bias-variance trade-off in action.
| Direction | OLS Variance | Ridge Variance (λ=10) | Reduction |
|---|---|---|---|
| $\lambda_j = 100$ | $0.01\sigma²$ | $0.008\sigma²$ | 1.2x |
| $\lambda_j = 10$ | $0.1\sigma²$ | $0.025\sigma²$ | 4x |
| $\lambda_j = 1$ | $\sigma²$ | $0.008\sigma²$ | 125x |
| $\lambda_j = 0.1$ | $10\sigma²$ | $0.001\sigma²$ | 10,000x |
| $\lambda_j = 0.01$ | $100\sigma²$ | $0.0001\sigma²$ | 1,000,000x |
Regularization provides the largest benefit exactly where variance is worst—the directions with small eigenvalues. Where data is abundant (large eigenvalues), it changes little. This adaptive behavior is why regularization is so effective.
Variance explosion is the mechanism by which high-dimensional settings cause overfitting. It's not merely that we have too many parameters—it's that some directions in parameter space become infinitely uncertain, causing coefficient estimates to swing wildly and predictions to fail.
What's Next?
We've established the problem: overfitting manifests as a generalization gap, is amplified by high dimensionality, and operates through variance explosion. The final piece is understanding why regularization is the solution—not just empirically, but conceptually. The next page synthesizes everything into the case for regularization.
You now understand the mathematical mechanics of variance explosion: small eigenvalues of X^T X become large variances through inversion, collinearity amplifies the problem, and the condition number quantifies overall instability. This sets up regularization as the principled solution.