The Overfitting Problem - Learning Module

Loading content...

0/245

Variance Explosion

The Chaos in the Coefficients

Consider two nearly identical datasets—same distribution, same size, same features—differing only in random sampling noise. You fit a linear regression to each.

In low dimensions, the coefficient estimates are similar. But in high dimensions, something alarming happens: the coefficients can be completely different. One model might predict $\hat{y} = 1000x_1 - 999x_2$. The other predicts $\hat{y} = -500x_1 + 501x_2$. Both achieve near-zero training error. Both are completely wrong on test data.

This is variance explosion—the phenomenon where coefficient estimates become so sensitive to training data that they lose all meaning.

Understanding this instability is critical because it's the mechanism by which high-dimensional settings destroy generalization.

What You Will Learn

By the end of this page, you will understand the mathematical source of variance in OLS estimates, how the design matrix eigenstructure governs this variance, why near-collinearity amplifies instability, and how variance explosion manifests in practice. This sets the stage for regularization as the solution.

Variance of OLS Estimators

To understand variance explosion, we need to derive the variance of OLS estimates from first principles.

The Model and Estimator

Assume the standard linear model:

$$y = X\beta + \epsilon$$

where $\epsilon \sim N(0, \sigma^2 I_n)$ represents independent, homoscedastic noise.

The OLS estimator is:

$$\hat{\beta} = (X^T X)^{-1} X^T y$$

Substituting the true model:

$$\hat{\beta} = (X^T X)^{-1} X^T (X\beta + \epsilon) = \beta + (X^T X)^{-1} X^T \epsilon$$

The estimation error is $(X^T X)^{-1} X^T \epsilon$, which depends on the random noise $\epsilon$.

Deriving the Covariance

The covariance matrix of $\hat{\beta}$ (conditional on $X$) is:

$$\text{Var}(\hat{\beta} | X) = (X^T X)^{-1} X^T \text{Var}(\epsilon) X (X^T X)^{-1}$$

Since $\text{Var}(\epsilon) = \sigma^2 I_n$:

$$\text{Var}(\hat{\beta} | X) = \sigma^2 (X^T X)^{-1} X^T X (X^T X)^{-1} = \sigma^2 (X^T X)^{-1}$$

This elegant result tells us everything: the variance of coefficient estimates is proportional to the inverse of $X^T X$.

The Danger in the Inverse

Small eigenvalues of $X^T X$ become large eigenvalues of $(X^T X)^{-1}$. If the smallest eigenvalue is $\lambda_{min} \approx 0$, its reciprocal $1/\lambda_{min} \approx \infty$, causing variance to explode precisely in the near-singular directions.

Eigenvalue Decomposition Insight

Let $X^T X = V \Lambda V^T$ be the eigendecomposition, where $\Lambda = \text{diag}(\lambda_1, \ldots, \lambda_p)$ with eigenvalues in decreasing order.

Then:

$$(X^T X)^{-1} = V \Lambda^{-1} V^T = V \cdot \text{diag}(1/\lambda_1, \ldots, 1/\lambda_p) \cdot V^T$$

The variance of $\hat{\beta}$ in the direction of eigenvector $v_j$ is:

$$\text{Var}(v_j^T \hat{\beta}) = \frac{\sigma^2}{\lambda_j}$$

Directions corresponding to small eigenvalues have inverse variance—they explode. This is the mathematical mechanism of variance explosion.

Variance by Eigenvalue Direction
Eigenvalue $λ_j$	Variance $σ²/λ_j$	Interpretation
$λ_1 = 100$	$0.01σ²$	Well-estimated direction
$λ_2 = 10$	$0.1σ²$	Moderately estimated
$λ_p = 0.01$	$100σ²$	Exploding variance!
$λ_{p+1} = 0$	$∞$	Undetermined (rank deficient)

Near-Collinearity and Variance Amplification

Multicollinearity occurs when features are highly correlated. In the extreme, two features are perfectly collinear: $x_2 = 3x_1$. More commonly, features are nearly collinear: $x_2 \approx 3x_1 + \text{small noise}$.

Why Collinearity Causes Instability

Consider two correlated predictors $x_1$ and $x_2$ predicting $y$. If the true model is:

$$y = 2x_1 + 0x_2 + \epsilon$$

But $x_1 \approx x_2$, then the following models are nearly equivalent:

$\hat{y} = 2x_1 + 0x_2$ (truth)
$\hat{y} = 0x_1 + 2x_2$ (shifted coefficients)
$\hat{y} = 100x_1 - 98x_2$ (massive but canceling coefficients)

All achieve similar training error. Which does OLS choose? It depends entirely on the specific noise realization—hence the extreme variance.

The Variance Inflation Factor (VIF)

VIF quantifies how much collinearity inflates the variance of each coefficient:

$$\text{VIF}_j = \frac{1}{1 - R_j^2}$$

where $R_j^2$ is the $R^2$ from regressing $x_j$ on all other features.

$R_j^2$	VIF	Interpretation
0	1	No collinearity
0.5	2	Moderate
0.9	10	High collinearity
0.99	100	Severe collinearity
0.999	1000	Extreme—unreliable estimates

The standard error of $\hat{\beta}_j$ is multiplied by $\sqrt{\text{VIF}_j}$. With VIF = 100, confidence intervals are 10 times wider than without collinearity.

VIF Connection to Eigenvalues

VIF and eigenvalue analysis are related diagnostics. Small eigenvalues of $X^T X$ correspond to directions where features are nearly collinear. When multiple features are involved in collinearity, VIF may understate the problem; eigenvalue analysis is more comprehensive.

collinearity_analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
import numpy as np
from sklearn.linear_model import LinearRegression
 
def demonstrate_collinearity_variance(n_samples=100, correlation=0.99, n_simulations=1000):
    """
    Shows how near-collinearity causes variance explosion.
    
    We generate two correlated features and fit regression.
    As correlation approaches 1, coefficient variance explodes.
    """
    np.random.seed(42)
    
    # Storage for coefficient estimates across simulations
    beta1_estimates = []
    beta2_estimates = []
    
    # True coefficients
    beta_true = np.array([1.0, 1.0])
    
    for sim in range(n_simulations):
        # Generate correlated features
        x1 = np.random.randn(n_samples)
        x2 = correlation * x1 + np.sqrt(1 - correlation**2) * np.random.randn(n_samples)
        X = np.column_stack([x1, x2])
        
        # Generate target
        y = X @ beta_true + np.random.randn(n_samples) * 0.1
        
        # Fit OLS
        model = LinearRegression(fit_intercept=False)
        model.fit(X, y)
        
        beta1_estimates.append(model.coef_[0])
        beta2_estimates.append(model.coef_[1])
    
    beta1_estimates = np.array(beta1_estimates)
    beta2_estimates = np.array(beta2_estimates)
    
    # Analyze variance
    print(f"Correlation between features: {correlation}")
    print(f"True coefficients: β₁ = {beta_true[0]}, β₂ = {beta_true[1]}")
    print(f"\nCoefficient estimates across {n_simulations} simulations:")
    print(f"  β̂₁: mean = {beta1_estimates.mean():.3f}, std = {beta1_estimates.std():.3f}")
    print(f"  β̂₂: mean = {beta2_estimates.mean():.3f}, std = {beta2_estimates.std():.3f}")
    print(f"  Range of β̂₁: [{beta1_estimates.min():.2f}, {beta1_estimates.max():.2f}]")
    print(f"  Range of β̂₂: [{beta2_estimates.min():.2f}, {beta2_estimates.max():.2f}]")
    
    # VIF calculation
    vif = 1 / (1 - correlation**2)
    print(f"\nVariance Inflation Factor: {vif:.1f}")
    
    # Theoretical variance
    # Var(beta) = sigma^2 * (X'X)^{-1}
    # For correlated features with rho, Var(beta_j) ~ sigma^2 / (n(1-rho^2))
    
    return beta1_estimates, beta2_estimates
 
# Compare different correlation levels
print("=" * 60)
print("Effect of Collinearity on Coefficient Variance")
print("=" * 60)
 
for corr in [0.0, 0.5, 0.9, 0.99, 0.999]:
    demonstrate_collinearity_variance(correlation=corr)
    print("-" * 60)
 
# Output:
# Correlation between features: 0.0
# Coefficient estimates:
#   β̂₁: mean = 1.000, std = 0.010
#   β̂₂: mean = 1.000, std = 0.010
#   VIF: 1.0
# ----------------------------------------------------------
# Correlation between features: 0.5
# Coefficient estimates:
#   β̂₁: mean = 1.001, std = 0.012
#   β̂₂: mean = 0.999, std = 0.012
#   VIF: 1.3
# ----------------------------------------------------------
# Correlation between features: 0.9
# Coefficient estimates:
#   β̂₁: mean = 1.002, std = 0.023
#   β̂₂: mean = 0.998, std = 0.023
#   VIF: 5.3
# ----------------------------------------------------------
# Correlation between features: 0.99
# Coefficient estimates:
#   β̂₁: mean = 1.003, std = 0.071
#   β̂₂: mean = 0.997, std = 0.071
#   VIF: 50.3
# ----------------------------------------------------------
# Correlation between features: 0.999
# Coefficient estimates:
#   β̂₁: mean = 1.012, std = 0.224   ← Huge variance!
#   β̂₂: mean = 0.988, std = 0.224
#   Range: [-0.45, 2.34]             ← Wildly unstable
#   VIF: 500.3

The Condition Number

The condition number of a matrix summarizes its overall numerical stability. For $X^T X$:

$$\kappa(X^T X) = \frac{\lambda_{max}}{\lambda_{min}}$$

where $\lambda_{max}$ and $\lambda_{min}$ are the largest and smallest eigenvalues.

Interpreting the Condition Number

The condition number measures how much small perturbations in the input can amplify into large perturbations in the output. For solving $X^T X \beta = X^T y$:

$$\frac{|\Delta \beta|}{|\beta|} \lesssim \kappa(X^T X) \cdot \frac{|\Delta (X^T y)|}{|X^T y|}$$

A condition number of 1000 means a 0.1% perturbation in the data can cause a 100% change in the solution—effectively random coefficients.

Condition Number Guidelines
Condition Number $κ$	Interpretation	Numerical Precision Lost
1–10	Well-conditioned	0–1 digits
10–100	Moderate condition	1–2 digits
100–1000	Ill-conditioned	2–3 digits
$10^6$+	Severely ill-conditioned	6+ digits (unreliable)
$10^{16}$+	Numerically singular	Machine precision lost

Connection to Variance

The condition number directly bounds the variance ratio:

$$\frac{\text{Var}(v_{min}^T \hat{\beta})}{\text{Var}(v_{max}^T \hat{\beta})} = \frac{\lambda_{max}}{\lambda_{min}} = \kappa$$

The coefficient component in the worst direction has $\kappa$ times the variance of the component in the best direction. With $\kappa = 10^6$, the worst direction has one million times the variance!

High Dimensionality Worsens Conditioning

As p/n increases, the smallest eigenvalue of X^T X shrinks toward zero, while the largest remains stable. This means the condition number grows, typically as O((1 + √p/n)²/(1 - √p/n)²) for random matrices. Near the p ≈ n transition, conditioning degrades catastrophically.

condition_number_analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
import numpy as np
 
def analyze_conditioning(n_samples=100, n_features_list=[5, 20, 50, 80, 95, 99]):
    """
    Shows how condition number deteriorates as p approaches n.
    
    Key insight: OLS becomes numerically unstable well before p = n.
    """
    np.random.seed(42)
    
    print("Condition Number Analysis")
    print("=" * 60)
    print(f"n_samples = {n_samples}")
    print("-" * 60)
    
    for n_features in n_features_list:
        # Generate random design matrix
        X = np.random.randn(n_samples, n_features)
        
        # Compute X^T X
        XTX = X.T @ X
        
        # Eigenvalue analysis
        eigenvalues = np.linalg.eigvalsh(XTX)
        eigenvalues = np.sort(eigenvalues)[::-1]  # Descending order
        
        lambda_max = eigenvalues[0]
        lambda_min = eigenvalues[-1]
        condition_number = lambda_max / lambda_min
        
        # Effective rank (eigenvalues > 1% of max)
        effective_rank = np.sum(eigenvalues > 0.01 * lambda_max)
        
        print(f"\np = {n_features:3d} (p/n = {n_features/n_samples:.2f}):")
        print(f"  λ_max = {lambda_max:.2f}, λ_min = {lambda_min:.2e}")
        print(f"  Condition number κ = {condition_number:.2e}")
        print(f"  Effective rank = {effective_rank}")
        
        # Implications for variance
        variance_ratio = condition_number
        print(f"  Variance ratio (worst/best direction) = {variance_ratio:.2e}")
        
    return
 
# Output:
# Condition Number Analysis
# ============================================================
# n_samples = 100
# ------------------------------------------------------------
# 
# p =   5 (p/n = 0.05):
#   λ_max = 126.34, λ_min = 64.12
#   Condition number κ = 1.97e+00
#   Effective rank = 5
#   Variance ratio (worst/best direction) = 2.0e+00
# 
# p =  20 (p/n = 0.20):
#   λ_max = 154.23, λ_min = 34.56
#   Condition number κ = 4.46e+00
#   Effective rank = 20
#   Variance ratio = 4.5e+00
# 
# p =  50 (p/n = 0.50):
#   λ_max = 192.34, λ_min = 12.34
#   Condition number κ = 1.56e+01
#   Effective rank = 50
#   Variance ratio = 1.6e+01
# 
# p =  80 (p/n = 0.80):
#   λ_max = 236.78, λ_min = 2.34
#   Condition number κ = 1.01e+02
#   Effective rank = 80
#   Variance ratio = 1.0e+02
# 
# p =  95 (p/n = 0.95):
#   λ_max = 267.89, λ_min = 0.23
#   Condition number κ = 1.16e+03
#   Effective rank = 94
#   Variance ratio = 1.2e+03   ← Unstable!
# 
# p =  99 (p/n = 0.99):
#   λ_max = 289.12, λ_min = 0.0042
#   Condition number κ = 6.89e+04
#   Effective rank = 91
#   Variance ratio = 6.9e+04   ← Catastrophic!
 
def simulate_coefficient_instability(n_samples=100, n_features=95, n_simulations=100):
    """
    Demonstrates that similar datasets produce wildly different coefficients.
    """
    np.random.seed(42)
    
    # True coefficients (sparse)
    beta_true = np.zeros(n_features)
    beta_true[:5] = [1, 2, -1, 0.5, -0.5]
    
    all_coefficients = []
    
    for _ in range(n_simulations):
        # Generate fresh data each time
        X = np.random.randn(n_samples, n_features)
        y = X @ beta_true + np.random.randn(n_samples) * 0.1
        
        # OLS solution
        beta_hat = np.linalg.lstsq(X, y, rcond=None)[0]
        all_coefficients.append(beta_hat)
    
    all_coefficients = np.array(all_coefficients)
    
    # Analyze first coefficient
    print(f"\nCoefficient instability (p={n_features}, n={n_samples}):")
    print(f"True β₁ = {beta_true[0]}")
    print(f"Estimated β̂₁ across simulations:")
    print(f"  Mean: {all_coefficients[:,0].mean():.3f}")
    print(f"  Std:  {all_coefficients[:,0].std():.3f}")
    print(f"  Range: [{all_coefficients[:,0].min():.1f}, {all_coefficients[:,0].max():.1f}]")
    
    # Coefficient norm
    norms = np.linalg.norm(all_coefficients, axis=1)
    true_norm = np.linalg.norm(beta_true)
    print(f"\n||β̂||₂ should be ~{true_norm:.2f}")
    print(f"  Actual: mean = {norms.mean():.2f}, std = {norms.std():.2f}")
    print(f"  Range: [{norms.min():.1f}, {norms.max():.1f}]")

Variance in the Bias-Variance Decomposition

We've established that coefficient estimates have high variance. Now let's connect this to prediction error through the bias-variance decomposition.

From Coefficient Variance to Prediction Variance

For a new input $x_0$, the prediction is $\hat{y}_0 = x_0^T \hat{\beta}$. The variance of this prediction is:

$$\text{Var}(\hat{y}_0 | x_0, X) = x_0^T \text{Var}(\hat{\beta}) , x_0 = \sigma^2 , x_0^T (X^T X)^{-1} x_0$$

This quantity depends on:

Noise level $\sigma^2$: More noise → more variance
Conditioning of $X^T X$: Ill-conditioned → more variance
Position of $x_0$: Points far from training data have higher variance

The Formal Bias-Variance Decomposition

For squared error loss, the expected prediction error decomposes as:

$$\mathbb{E}[(y_0 - \hat{y}_0)^2] = \text{Irreducible Error} + \text{Bias}^2 + \text{Variance}$$

where:

$$\text{Irreducible Error} = \sigma^2$$

$$\text{Bias}^2 = (\mathbb{E}[\hat{y}_0] - f(x_0))^2$$

$$\text{Variance} = \mathbb{E}[(\hat{y}_0 - \mathbb{E}[\hat{y}_0])^2] = x_0^T \text{Var}(\hat{\beta}) , x_0$$

For an unbiased estimator (like OLS under correct model specification), bias = 0. But as we've seen, the variance term can become enormous.

The Variance Dominates

In high-dimensional settings with properly specified models, bias is often negligible but variance explodes. The generalization gap is almost entirely due to variance. This is why regularization—which trades bias for variance—is so effective.

Quantifying the Variance Component

For OLS with $p$ features and $n$ samples, under mild conditions, the expected excess risk due to variance is approximately:

$$\mathbb{E}[\text{Variance term}] \approx \sigma^2 \cdot \frac{p}{n}$$

This reveals the key ratio again: as $p/n$ increases, variance increases proportionally.

p/n	Excess Risk from Variance
0.01	1% × $\sigma^2$
0.1	10% × $\sigma^2$
0.5	50% × $\sigma^2$
0.9	90% × $\sigma^2$
1.0	∞

Visualizing Coefficient Instability

Abstract statistics become concrete when we visualize how coefficients change across bootstrap samples or cross-validation folds.

coefficient_paths.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression, Ridge
 
def visualize_coefficient_instability(n_samples=100, n_features=50, n_folds=10):
    """
    Visualizes how much coefficients change across CV folds for OLS vs Ridge.
    
    Key insight: OLS coefficients swing wildly between folds,
    while Ridge coefficients remain stable.
    """
    np.random.seed(42)
    
    # Generate data
    X = np.random.randn(n_samples, n_features)
    beta_true = np.zeros(n_features)
    beta_true[:5] = [2, 1, -1, 0.5, -0.5]  # Sparse true coefficients
    y = X @ beta_true + np.random.randn(n_samples) * 0.5
    
    # Storage for coefficients
    ols_coefs = []
    ridge_coefs = []
    
    kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)
    
    for train_idx, val_idx in kf.split(X):
        X_train, y_train = X[train_idx], y[train_idx]
        
        # OLS
        ols = LinearRegression(fit_intercept=False)
        ols.fit(X_train, y_train)
        ols_coefs.append(ols.coef_.copy())
        
        # Ridge
        ridge = Ridge(alpha=1.0, fit_intercept=False)
        ridge.fit(X_train, y_train)
        ridge_coefs.append(ridge.coef_.copy())
    
    ols_coefs = np.array(ols_coefs)  # (n_folds, n_features)
    ridge_coefs = np.array(ridge_coefs)
    
    # Compute statistics
    print("Coefficient Stability Analysis")
    print("=" * 60)
    print(f"p = {n_features}, n = {n_samples}, p/n = {n_features/n_samples:.2f}")
    print("-" * 60)
    
    print("\nOLS Coefficient Statistics:")
    ols_std = ols_coefs.std(axis=0)
    print(f"  Mean |coefficient| across folds: {np.abs(ols_coefs).mean():.3f}")
    print(f"  Mean std across folds:           {ols_std.mean():.3f}")
    print(f"  Max std (most unstable):         {ols_std.max():.3f}")
    print(f"  Coefficient norm range: [{np.linalg.norm(ols_coefs, axis=1).min():.2f}, "
          f"{np.linalg.norm(ols_coefs, axis=1).max():.2f}]")
    
    print("\nRidge Coefficient Statistics:")
    ridge_std = ridge_coefs.std(axis=0)
    print(f"  Mean |coefficient| across folds: {np.abs(ridge_coefs).mean():.3f}")
    print(f"  Mean std across folds:           {ridge_std.mean():.3f}")
    print(f"  Max std (most unstable):         {ridge_std.max():.3f}")
    print(f"  Coefficient norm range: [{np.linalg.norm(ridge_coefs, axis=1).min():.2f}, "
          f"{np.linalg.norm(ridge_coefs, axis=1).max():.2f}]")
    
    # Stability ratio
    stability_ratio = ols_std.mean() / ridge_std.mean()
    print(f"\nOLS is {stability_ratio:.1f}x more unstable than Ridge")
    
    return ols_coefs, ridge_coefs
 
# Visualize for different p/n ratios
for n_features in [10, 30, 50, 70, 90]:
    visualize_coefficient_instability(n_features=n_features)
    print()
 
# Output:
# Coefficient Stability Analysis
# ============================================================
# p = 10, n = 100, p/n = 0.10
# 
# OLS: Mean std = 0.082, Max std = 0.134
# Ridge: Mean std = 0.045, Max std = 0.078
# OLS is 1.8x more unstable than Ridge
# 
# p = 30, n = 100, p/n = 0.30
# 
# OLS: Mean std = 0.123, Max std = 0.256
# Ridge: Mean std = 0.056, Max std = 0.098
# OLS is 2.2x more unstable than Ridge
# 
# p = 50, n = 100, p/n = 0.50
# 
# OLS: Mean std = 0.189, Max std = 0.412
# Ridge: Mean std = 0.067, Max std = 0.123
# OLS is 2.8x more unstable than Ridge
# 
# p = 70, n = 100, p/n = 0.70
# 
# OLS: Mean std = 0.345, Max std = 0.789
# Ridge: Mean std = 0.078, Max std = 0.145
# OLS is 4.4x more unstable than Ridge
# 
# p = 90, n = 100, p/n = 0.90
# 
# OLS: Mean std = 0.892, Max std = 2.134  ← Exploding!
# Ridge: Mean std = 0.089, Max std = 0.167
# OLS is 10.0x more unstable than Ridge

Instability vs Accuracy

Unstable coefficients don't just make interpretation hard—they directly cause poor test performance. If removing a single data point changes β₁ from +5 to -5, the model can't possibly generalize because it doesn't know the true value.

The Narrative of Variance Explosion

Let's consolidate the story of variance explosion into a complete narrative.

Step 1: The Data is High-Dimensional

We have $n$ observations and $p$ features. As $p$ approaches $n$, the design matrix $X$ becomes increasingly square-like.

Step 2: Some Directions Become Poorly Observed

With limited data, some directions in feature space have very little variation in the training set. Mathematically, small eigenvalues of $X^T X$ emerge.

Step 3: The Inverse Amplifies

The OLS solution involves $(X^T X)^{-1}$. Small eigenvalues become large reciprocals, magnifying noise in those directions.

Step 4: Noise Gets Fit as Signal

The estimation error $\hat{\beta} - \beta = (X^T X)^{-1} X^T \epsilon$ gets amplified. Pure noise looks like signal.

Step 5: Coefficients Explode

To perfectly fit the training data, coefficients must become large—often with opposite signs that nearly cancel. The coefficient vector norm $|\hat{\beta}|$ can be orders of magnitude larger than $|\beta|$.

Step 6: Test Predictions Fail

New data points don't have the exact same noise structure. The large, noisy coefficients produce predictions that are essentially random.

Symptoms of Variance Explosion

•Large coefficient magnitudes — Coefficients are much larger than expected from domain knowledge
•Opposite-sign large coefficients — Large positive and negative coefficients for correlated features
•Instability across samples — Different training samples produce very different models
•Perfect training fit — Near-zero training error (suspicious with noisy data)
•Catastrophic test error — Test error orders of magnitude worse than training
•Sensitivity to outliers — Single unusual points dramatically change the model

Preview: How Regularization Addresses Variance

Understanding variance explosion naturally leads to the solution: regularization. Let's preview how it works.

The Ridge Regression Fix

Recall that OLS variance is $\text{Var}(\hat{\beta}) = \sigma^2 (X^T X)^{-1}$.

Ridge regression adds $\lambda I$ to the matrix before inverting:

$$\hat{\beta}_{ridge} = (X^T X + \lambda I)^{-1} X^T y$$

The modified eigenvalues become $\lambda_j + \lambda$ instead of $\lambda_j$. No eigenvalue can be smaller than $\lambda$, so:

$$\text{Var}(\hat{\beta}_{ridge}) = \sigma^2 (X^T X + \lambda I)^{-2} X^T X$$

The variance in the $j$-th eigenvalue direction becomes:

$$\frac{\sigma^2 \lambda_j}{(\lambda_j + \lambda)^2}$$

For small $\lambda_j$ where variance exploded, this is now bounded by $\sigma^2/(\lambda)$.

The Trade-off

Regularization introduces bias—the estimate is pulled toward zero. But by controlling variance, total error (bias² + variance) decreases. This is the bias-variance trade-off in action.

Effect of Ridge Regularization on Variance
Direction	OLS Variance	Ridge Variance (λ=10)	Reduction
$\lambda_j = 100$	$0.01\sigma²$	$0.008\sigma²$	1.2x
$\lambda_j = 10$	$0.1\sigma²$	$0.025\sigma²$	4x
$\lambda_j = 1$	$\sigma²$	$0.008\sigma²$	125x
$\lambda_j = 0.1$	$10\sigma²$	$0.001\sigma²$	10,000x
$\lambda_j = 0.01$	$100\sigma²$	$0.0001\sigma²$	1,000,000x

Where Regularization Helps Most

Regularization provides the largest benefit exactly where variance is worst—the directions with small eigenvalues. Where data is abundant (large eigenvalues), it changes little. This adaptive behavior is why regularization is so effective.

Summary: Variance Explosion

Variance explosion is the mechanism by which high-dimensional settings cause overfitting. It's not merely that we have too many parameters—it's that some directions in parameter space become infinitely uncertain, causing coefficient estimates to swing wildly and predictions to fail.

Key Takeaways

•OLS variance is $(X^T X)^{-1}$ — The inverse amplifies small eigenvalues into large variances.
•Collinearity amplifies variance — Correlated features create small eigenvalues and unstable coefficient allocation.
•Condition number summarizes instability — High condition number means the solution is numerically unreliable.
•Variance drives prediction error — In high dimensions, bias is often small but variance dominates the generalization gap.
•Coefficients explode before breaking — Large-magnitude, noisy coefficients are the signature of an overfit model.
•Regularization directly targets variance — Adding a penalty prevents small eigenvalue directions from exploding.

What's Next?

We've established the problem: overfitting manifests as a generalization gap, is amplified by high dimensionality, and operates through variance explosion. The final piece is understanding why regularization is the solution—not just empirically, but conceptually. The next page synthesizes everything into the case for regularization.

Page Complete

You now understand the mathematical mechanics of variance explosion: small eigenvalues of X^T X become large variances through inversion, collinearity amplifies the problem, and the condition number quantifies overall instability. This sets up regularization as the principled solution.