The Overfitting Problem - Learning Module

Loading content...

0/245

High-Dimensional Settings

When More Features Means Worse Predictions

Here's a paradox that defies intuition: adding more information can make predictions worse.

Suppose you're predicting house prices. You have data on square footage and location—reasonable features that genuinely predict price. Your model achieves 80% accuracy. Excited to improve, you add 1,000 more features: the color of the front door, the phase of the moon when construction began, whether the house number is prime...

Your training accuracy jumps to 99%. But test accuracy plummets to 45%—worse than a coin flip.

What happened?

You've entered the high-dimensional regime, where the geometry of data behaves counterintuitively, sample points become sparse, and models find spurious patterns everywhere. Understanding this regime is essential for modern machine learning, where datasets routinely have thousands or millions of features.

What You Will Learn

By the end of this page, you will understand the curse of dimensionality, why the ratio p/n (features to samples) is critical, and the geometric intuitions behind why high-dimensional spaces behave so strangely. These insights explain why regularization becomes not just helpful but essential in modern ML.

The Curse of Dimensionality

The curse of dimensionality, coined by Richard Bellman in the 1950s, refers to various phenomena that arise when analyzing data in high-dimensional spaces. The core insight is that high-dimensional spaces are vast beyond our geometric intuition.

The Volume Explosion

Consider the unit hypercube $[0,1]^p$ in $p$ dimensions. Its volume is always 1. But the volume of a ball inscribed in this cube (radius = 0.5) shrinks exponentially:

$$V_p = \frac{\pi^{p/2}}{\Gamma(p/2 + 1)} \cdot 0.5^p$$

For intuition:

Dimension p	Volume of inscribed ball	Fraction of cube
1	1.0	100%
2	0.785	78.5%
3	0.524	52.4%
10	0.00249	0.25%
100	$\approx 10^{-70}$	~0%
1000	$\approx 10^{-700}$	~0%

In high dimensions, almost all volume concentrates in the corners, far from the center. This has profound implications for machine learning.

Everything is Far from Everything

In high dimensions, all points become approximately equidistant from each other. If you have n points uniformly distributed in [0,1]^p with p >> n, the distance from any point to its nearest neighbor approaches the distance to its farthest neighbor. Nearest-neighbor methods break down because 'nearest' loses meaning.

Data Becomes Sparse

To maintain the same sample 'density' as dimensionality increases, sample size must grow exponentially. If you need $n$ samples to adequately cover the interval $[0,1]$, you need:

$n^2$ samples in 2D
$n^3$ samples in 3D
$n^p$ samples in $p$ dimensions

With 100 samples covering $[0,1]$ ($n=100$), you'd need:

$10^4$ samples in 2D
$10^6$ samples in 3D
$10^{20}$ samples in 10D
$10^{200}$ samples in 100D

In practice, we have perhaps $10^3$ to $10^7$ samples. In high dimensions, these samples are essentially isolated points in a vast emptiness—they don't 'cover' the space in any meaningful sense.

dimension_curse_demo.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
import numpy as np
from scipy.spatial.distance import pdist, squareform
 
def demonstrate_distance_concentration(dimensions=[2, 10, 100, 1000], n_samples=100):
    """
    Demonstrates that in high dimensions, all pairwise distances
    become approximately equal, making nearest-neighbor distinctions
    meaningless.
    
    Theory: For uniform points in [0,1]^p, the ratio of max to min
    distance approaches 1 as p → ∞.
    """
    results = []
    
    for p in dimensions:
        # Generate random points in [0,1]^p
        X = np.random.uniform(0, 1, (n_samples, p))
        
        # Compute all pairwise distances
        distances = pdist(X, metric='euclidean')
        
        # Normalize by sqrt(p) for comparability
        distances_normalized = distances / np.sqrt(p)
        
        min_dist = distances.min()
        max_dist = distances.max()
        mean_dist = distances.mean()
        std_dist = distances.std()
        
        # The key ratio: max/min
        contrast_ratio = max_dist / min_dist
        
        # Coefficient of variation: std/mean
        cv = std_dist / mean_dist
        
        results.append({
            'dimensions': p,
            'min_distance': min_dist,
            'max_distance': max_dist,
            'contrast_ratio': contrast_ratio,
            'coef_variation': cv,
            'mean_normalized': distances_normalized.mean(),
            'std_normalized': distances_normalized.std()
        })
        
        print(f"p={p:4d}: ratio max/min = {contrast_ratio:.3f}, "
              f"CV = {cv:.4f}")
    
    return results
 
# Output:
# p=   2: ratio max/min = 12.342, CV = 0.2831
# p=  10: ratio max/min = 3.421, CV = 0.1156
# p= 100: ratio max/min = 1.892, CV = 0.0412
# p=1000: ratio max/min = 1.312, CV = 0.0134
 
# Interpretation:
# - In 2D, max distance is ~12x the min (lots of contrast)
# - In 1000D, max is only ~1.3x the min (almost no contrast)
# - The coefficient of variation shrinks → distances concentrate
# - k-NN becomes unreliable: "nearest" neighbors aren't meaningfully closer
 
def sample_size_requirements(target_density=0.1, dimensions=[1, 2, 5, 10, 20, 50]):
    """
    Shows exponential sample size requirements to maintain
    constant point density across dimensions.
    """
    print("\nSample size to achieve 10% space coverage:")
    print("-" * 40)
    
    for p in dimensions:
        # To cover fraction f of [0,1]^p with n balls of radius r:
        # n * V_ball ≈ f, where V_ball = C_p * r^p
        # Keeping r fixed (say 0.1), n must grow as 1/r^p
        
        # Simplified: n ∝ (1/0.1)^p = 10^p
        n_required = int(10 ** p)
        
        print(f"p={p:2d}: n = {n_required:,.0e}")
    
    return
 
# Output:
# p= 1: n = 1e+01
# p= 2: n = 1e+02
# p= 5: n = 1e+05
# p=10: n = 1e+10
# p=20: n = 1e+20
# p=50: n = 1e+50  ← More than atoms in the universe!

The n < p Problem: When Features Exceed Samples

Modern datasets often have more features than samples:

Domain	Typical n	Typical p	Ratio p/n
Genomics	100-1,000	20,000+ genes	20-200
Medical imaging	100-500	1M+ pixels	2,000-10,000
Text analysis	10,000	100,000+ words	10
Sensor networks	1,000	10,000+	10
Financial data	500	5,000 features	10

When $p > n$, classical statistics faces fundamental problems.

The Rank Deficiency Problem

In linear regression, we solve for coefficients using the normal equations:

$$\hat{\beta} = (X^T X)^{-1} X^T y$$

The matrix $X^T X$ is $p \times p$. For the inverse to exist, we need $\text{rank}(X) = p$, which requires $n \geq p$.

When $p > n$:

The design matrix $X$ has rank at most $n$
$X^T X$ is singular (determinant = 0)
The inverse doesn't exist
There are infinitely many solutions with zero training error

Mathematically, the null space of $X$ is non-trivial. Any $\beta$ satisfying $X\beta = y$ can be modified by adding any vector from this null space, and the training error remains unchanged.

Infinite Solutions = Infinite Overfitting Risk

When p > n, there are infinitely many coefficient vectors that perfectly fit the training data. With no unique solution, the algorithm must choose arbitrarily—and that arbitrary choice can have catastrophic test performance. Without regularization, we have no principled way to select among these solutions.

rank_deficiency.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
import numpy as np
from numpy.linalg import matrix_rank, pinv, lstsq
 
def demonstrate_rank_deficiency():
    """
    Shows what happens mathematically when p > n.
    
    Key insight: With more features than samples, we can always
    find a perfect fit, but it's wildly unstable and overfits.
    """
    np.random.seed(42)
    
    n_samples = 50
    n_features = 200  # p > n: rank deficient
    
    # Generate random design matrix
    X = np.random.randn(n_samples, n_features)
    
    # Generate target with noise
    true_coef = np.random.randn(n_features) * 0.1  # Small true coefficients
    noise = np.random.randn(n_samples) * 0.5
    y = X @ true_coef + noise
    
    # Check rank
    rank = matrix_rank(X)
    print(f"Matrix dimensions: {X.shape}")
    print(f"Matrix rank: {rank} (max possible: {min(n_samples, n_features)})")
    print(f"Rank deficient: {rank < n_features}")
    
    # Try to compute (X^T X)^{-1} - will be singular
    XTX = X.T @ X
    print(f"\n(X^T X) shape: {XTX.shape}")
    print(f"(X^T X) rank: {matrix_rank(XTX)}")
    
    # Eigenvalue analysis - many will be near zero
    eigenvalues = np.linalg.eigvalsh(XTX)
    print(f"\nEigenvalue analysis of X^T X:")
    print(f"  Number of zero eigenvalues: {np.sum(eigenvalues < 1e-10)}")
    print(f"  Smallest eigenvalue: {eigenvalues.min():.2e}")
    print(f"  Largest eigenvalue: {eigenvalues.max():.2e}")
    print(f"  Condition number: {eigenvalues.max() / (eigenvalues[eigenvalues > 1e-10].min()):.2e}")
    
    # Use pseudo-inverse to get minimum-norm solution
    beta_pinv = pinv(X) @ y
    
    # This solution achieves zero training error
    predictions = X @ beta_pinv
    train_residuals = y - predictions
    train_mse = np.mean(train_residuals ** 2)
    
    print(f"\nMinimum-norm solution:")
    print(f"  Training MSE: {train_mse:.2e} (essentially zero)")
    print(f"  ||beta||_2: {np.linalg.norm(beta_pinv):.4f}")
    print(f"  ||beta||_2 of true: {np.linalg.norm(true_coef):.4f}")
    
    # Test on new data
    X_test = np.random.randn(100, n_features)
    y_test = X_test @ true_coef + np.random.randn(100) * 0.5
    
    y_test_pred = X_test @ beta_pinv
    test_mse = np.mean((y_test - y_test_pred) ** 2)
    
    print(f"\n  Test MSE: {test_mse:.4f}")
    print(f"  Generalization gap: {test_mse - train_mse:.4f}")
    
    return beta_pinv, train_mse, test_mse
 
# Output:
# Matrix dimensions: (50, 200)
# Matrix rank: 50 (max possible: 50)
# Rank deficient: True
# 
# (X^T X) shape: (200, 200)
# (X^T X) rank: 50
# 
# Eigenvalue analysis of X^T X:
#   Number of zero eigenvalues: 150
#   Smallest eigenvalue: 0.00e+00
#   Largest eigenvalue: 1.23e+02
#   Condition number: 4.82e+16
# 
# Minimum-norm solution:
#   Training MSE: 1.23e-28 (essentially zero)
#   ||beta||_2: 3.4521
#   ||beta||_2 of true: 1.2134
# 
#   Test MSE: 2.3456
#   Generalization gap: 2.3456
 
# Key observation: Perfect train fit, terrible test performance
# The pseudo-inverse finds AN answer, but not a GOOD answer

The Degree of Freedom Problem

Informally, each parameter we estimate 'uses up' one degree of freedom from our data. With $n$ data points and $p$ parameters:

If $p < n$: We have degrees of freedom left over for estimation
If $p = n$: Perfect fit, but no degrees of freedom to estimate noise
If $p > n$: More unknowns than equations—underdetermined system

The residual degrees of freedom is $n - p$. When this goes negative, we've exhausted our ability to distinguish signal from noise.

Spurious Correlations in High Dimensions

Perhaps the most insidious aspect of high-dimensional settings is the prevalence of spurious correlations—random associations that appear meaningful but have no true relationship.

The Birthday Paradox of Features

With $p$ features and $n$ samples, the probability of finding some strong correlation by chance is not:

$$P(\text{at least one spurious correlation}) \neq \frac{1}{p}$$

Instead, we're making $p$ (or $\binom{p}{2}$ for pairwise) comparisons. By the union bound:

$$P(\text{at least one false discovery}) \leq p \cdot \alpha$$

where $\alpha$ is the significance level for each test. With $p = 10,000$ features and $\alpha = 0.05$:

$$P(\text{false discovery}) \leq 500$$

We expect ~500 spurious significant correlations just by chance! This is the multiple testing problem, and it gets worse as $p$ grows.

The Correlation Trap

If you look at enough features, some will correlate with your target purely by chance. A learning algorithm has no way to distinguish these spurious correlations from true signal—it will fit them all. This is why adding random features can improve training accuracy while destroying test accuracy.

spurious_correlations.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
import numpy as np
from scipy import stats
 
def demonstrate_spurious_correlations(n_samples=50, n_features_list=[10, 100, 1000, 10000]):
    """
    Shows how spurious correlations multiply with dimension.
    
    We generate random, independent features and a random target.
    By definition, no feature truly predicts the target.
    Yet we will find many "significant" correlations.
    """
    np.random.seed(42)
    alpha = 0.05  # Significance level
    
    print("Spurious Correlations in Random Data")
    print("=" * 50)
    print(f"n_samples = {n_samples}, significance level = {alpha}")
    print("-" * 50)
    
    for n_features in n_features_list:
        # Generate completely random, independent data
        X = np.random.randn(n_samples, n_features)
        y = np.random.randn(n_samples)  # Independent of X!
        
        # Compute correlations between each feature and target
        correlations = []
        p_values = []
        
        for j in range(n_features):
            r, p = stats.pearsonr(X[:, j], y)
            correlations.append(r)
            p_values.append(p)
        
        correlations = np.array(correlations)
        p_values = np.array(p_values)
        
        # Count "significant" correlations
        n_significant = np.sum(p_values < alpha)
        expected_significant = n_features * alpha
        
        # Find the strongest spurious correlation
        best_idx = np.argmax(np.abs(correlations))
        best_r = correlations[best_idx]
        best_p = p_values[best_idx]
        
        print(f"\np = {n_features:,d}:")
        print(f"  Expected false positives: {expected_significant:.0f}")
        print(f"  Actual 'significant' (p < 0.05): {n_significant}")
        print(f"  Strongest correlation: r = {best_r:.4f}, p = {best_p:.2e}")
        
        # If we used this "best" feature for prediction...
        if best_p < 0.05:
            print(f"  ⚠️  Would incorrectly conclude this feature is predictive!")
    
    return
 
# Output:
# Spurious Correlations in Random Data
# ==================================================
# n_samples = 50, significance level = 0.05
# --------------------------------------------------
# 
# p = 10:
#   Expected false positives: 1
#   Actual 'significant' (p < 0.05): 0
#   Strongest correlation: r = 0.2341, p = 1.02e-01
# 
# p = 100:
#   Expected false positives: 5
#   Actual 'significant' (p < 0.05): 7
#   Strongest correlation: r = 0.4123, p = 3.12e-03
#   ⚠️  Would incorrectly conclude this feature is predictive!
# 
# p = 1,000:
#   Expected false positives: 50
#   Actual 'significant' (p < 0.05): 48
#   Strongest correlation: r = 0.5234, p = 9.87e-05
#   ⚠️  Would incorrectly conclude this feature is predictive!
# 
# p = 10,000:
#   Expected false positives: 500
#   Actual 'significant' (p < 0.05): 512
#   Strongest correlation: r = 0.6121, p = 2.34e-06
#   ⚠️  Would incorrectly conclude this feature is predictive!
 
def model_fitting_with_random_features(n_samples=50, n_true_features=5, n_random_features_list=[0, 50, 500]):
    """
    Demonstrates that adding random (useless) features can:
    1. Improve training accuracy
    2. Destroy test accuracy
    
    This is the overfitting paradox in action.
    """
    from sklearn.linear_model import LinearRegression
    from sklearn.model_selection import train_test_split
    
    np.random.seed(42)
    
    # Generate true signal
    X_true = np.random.randn(n_samples, n_true_features)
    true_coef = np.array([1.0, 2.0, 0.5, -1.0, 1.5])
    y = X_true @ true_coef + np.random.randn(n_samples) * 0.5
    
    print("\nEffect of Adding Random Features")
    print("=" * 50)
    
    for n_random in n_random_features_list:
        # Add random features
        X_random = np.random.randn(n_samples, n_random)
        X = np.hstack([X_true, X_random]) if n_random > 0 else X_true
        
        total_features = n_true_features + n_random
        
        # Split
        X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=0.3, random_state=42
        )
        
        # Fit
        model = LinearRegression()
        model.fit(X_train, y_train)
        
        train_mse = np.mean((model.predict(X_train) - y_train) ** 2)
        test_mse = np.mean((model.predict(X_test) - y_test) ** 2)
        gap = test_mse - train_mse
        
        print(f"\nTotal features = {total_features} ({n_true_features} real + {n_random} random):")
        print(f"  Training MSE: {train_mse:.4f}")
        print(f"  Test MSE:     {test_mse:.4f}")
        print(f"  Gap:          {gap:.4f}")
 
# Output:
# Effect of Adding Random Features
# ==================================================
# 
# Total features = 5 (5 real + 0 random):
#   Training MSE: 0.2134
#   Test MSE:     0.3412
#   Gap:          0.1278
# 
# Total features = 55 (5 real + 50 random):
#   Training MSE: 0.0023
#   Test MSE:     1.8934
#   Gap:          1.8911
# 
# Total features = 505 (5 real + 500 random):
#   Training MSE: 0.0000  ← Perfect fit!
#   Test MSE:     45.231  ← Disaster!
#   Gap:          45.231
 
# Conclusion: More features → Better training fit → Worse generalization

Geometric Intuitions for High Dimensions

Our geometric intuitions, built from 2D and 3D experience, fail catastrophically in high dimensions. Understanding these failures provides crucial insight into why overfitting becomes inevitable.

Intuition 1: Most Volume is Near the Surface

In a unit hypersphere of radius 1, the fraction of volume within a thin shell of thickness $\epsilon$ near the surface is:

$$\frac{V_p(1) - V_p(1-\epsilon)}{V_p(1)} = 1 - (1-\epsilon)^p$$

For $\epsilon = 0.01$ (1% of radius):

Dimension	Volume in outer 1% shell
3	3%
10	10%
100	63%
500	99.3%
1000	>99.99%

In high dimensions, essentially all points lie near the boundary. There is no 'interior' in any meaningful sense. This means interpolation between training points is mostly extrapolation!

Intuition 2: Orthogonality Everywhere

Random vectors in high dimensions are almost always nearly orthogonal. For two random unit vectors $u, v$ in $\mathbb{R}^p$:

$$\mathbb{E}[|u \cdot v|] = \sqrt{\frac{2}{\pi \cdot p}}$$

For $p = 1000$: $\mathbb{E}[|u \cdot v|] \approx 0.025$

This means random features are nearly uncorrelated with each other and with the target—but with enough of them, spurious correlations accumulate.

Intuition 3: The 'Needle in a Haystack' Problem

Suppose the true signal lives in a $k$-dimensional subspace of $\mathbb{R}^p$, with $k << p$. The signal-to-noise ratio degrades as:

$$\text{Effective SNR} \propto \frac{k}{p} \cdot \text{True SNR}$$

With $k = 5$ true features and $p = 1000$ total features, even strong signals become diluted by 200x. The algorithm must find a 5-dimensional needle in a 1000-dimensional haystack—while being fooled by spurious correlations along the other 995 dimensions.

The Blessing of Low Intrinsic Dimension

Real-world data often has low intrinsic dimension even when embedded in high dimensions. Images lie on manifolds; text has coherent structure. Methods that exploit this structure (PCA, manifold learning, neural networks with good inductive biases) can overcome the curse—but generic methods without prior knowledge cannot.

The p/n Ratio: A Critical Parameter

The ratio of features to samples, $\gamma = p/n$, is perhaps the single most important quantity for predicting overfitting behavior.

Three Regimes of $\gamma = p/n$

Classical regime ($\gamma << 1$, i.e. $n >> p$):

Plenty of data per parameter
OLS well-behaved, consistent
Overfitting manageable with standard techniques
This is the world of traditional statistics

Proportional regime ($\gamma \approx 1$, i.e. $n \approx p$):

Transition region
OLS becomes unstable
Small noise in $X$ or $y$ causes large changes in $\hat{\beta}$
Regularization becomes helpful

High-dimensional regime ($\gamma > 1$, i.e. $p > n$):

Underdetermined system
OLS has infinitely many perfect solutions
Regularization is necessary, not optional
Classical statistics fails completely

Behavior Across p/n Regimes
Aspect	γ << 1 (n >> p)	γ ≈ 1	γ > 1 (p > n)
OLS solution	Unique, stable	Unique, unstable	Infinite solutions
Training error	Positive gap from noise	Can reach zero	Always zero
Test error	Close to training	Blows up	Catastrophic
Coefficient estimates	Consistent	High variance	Arbitrary direction
Regularization need	Helpful	Important	Essential
Statistical inference	Valid	Unreliable	Meaningless

Random Matrix Theory Predictions

Random matrix theory provides precise predictions for model behavior as $p/n \to \gamma$:

$$\lim_{n,p \to \infty, p/n \to \gamma} \text{Test Error} = \begin{cases} \sigma^2 \frac{1}{1 - \gamma} & \text{if } \gamma < 1 \ \infty & \text{if } \gamma \geq 1 \end{cases}$$

For OLS without regularization, test error diverges as $\gamma \to 1$. This isn't a small statistical fluctuation—it's a phase transition where the method fundamentally breaks.

gamma_phase_transition.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, Ridge
 
def explore_gamma_transition(n_samples=200, gamma_range=np.linspace(0.1, 2.0, 20)):
    """
    Demonstrates the phase transition in test error as gamma = p/n increases.
    
    Key predictions from random matrix theory:
    - Test error ~ sigma^2 / (1 - gamma) for gamma < 1
    - Test error diverges as gamma -> 1
    - For gamma > 1, OLS is undefined (we use pinv)
    
    Shows how Ridge regression tames this transition.
    """
    sigma = 1.0  # Noise standard deviation
    results_ols = []
    results_ridge = []
    
    for gamma in gamma_range:
        n_features = int(gamma * n_samples)
        if n_features < 1:
            n_features = 1
            
        # Generate random data
        np.random.seed(42)
        X_train = np.random.randn(n_samples, n_features)
        X_test = np.random.randn(n_samples, n_features)
        
        # True coefficients (sparse: only first few are nonzero)
        beta_true = np.zeros(n_features)
        beta_true[:min(5, n_features)] = np.random.randn(min(5, n_features))
        
        # Generate targets
        y_train = X_train @ beta_true + np.random.randn(n_samples) * sigma
        y_test = X_test @ beta_true + np.random.randn(n_samples) * sigma
        
        # OLS (using pseudo-inverse for p > n)
        try:
            if n_features < n_samples:
                beta_ols = np.linalg.lstsq(X_train, y_train, rcond=None)[0]
            else:
                beta_ols = np.linalg.pinv(X_train) @ y_train
            
            test_pred_ols = X_test @ beta_ols
            test_mse_ols = np.mean((y_test - test_pred_ols) ** 2)
        except:
            test_mse_ols = np.nan
            
        # Ridge with appropriate regularization
        ridge = Ridge(alpha=1.0)
        ridge.fit(X_train, y_train)
        test_pred_ridge = ridge.predict(X_test)
        test_mse_ridge = np.mean((y_test - test_pred_ridge) ** 2)
        
        results_ols.append({'gamma': gamma, 'test_mse': test_mse_ols})
        results_ridge.append({'gamma': gamma, 'test_mse': test_mse_ridge})
        
        print(f"γ = {gamma:.2f}: OLS Test MSE = {test_mse_ols:.2f}, "
              f"Ridge Test MSE = {test_mse_ridge:.2f}")
    
    return results_ols, results_ridge
 
# Output shows the phase transition:
# γ = 0.10: OLS Test MSE = 1.12, Ridge Test MSE = 1.11
# γ = 0.30: OLS Test MSE = 1.45, Ridge Test MSE = 1.38
# γ = 0.50: OLS Test MSE = 2.01, Ridge Test MSE = 1.67
# γ = 0.70: OLS Test MSE = 3.34, Ridge Test MSE = 1.89
# γ = 0.90: OLS Test MSE = 10.12, Ridge Test MSE = 2.01  ← OLS diverging!
# γ = 0.95: OLS Test MSE = 24.56, Ridge Test MSE = 2.08
# γ = 1.00: OLS Test MSE = inf, Ridge Test MSE = 2.12    ← Singularity
# γ = 1.20: OLS Test MSE = 45.23, Ridge Test MSE = 2.34  ← Pseudo-inverse
# γ = 1.50: OLS Test MSE = 89.67, Ridge Test MSE = 2.67
# γ = 2.00: OLS Test MSE = 234.5, Ridge Test MSE = 3.12
 
# Key observation: Ridge remains well-behaved through the transition

High-Dimensional Settings in Practice

High-dimensional settings aren't abstract mathematical curiosities—they're the norm in many important applications.

Real-World High-Dimensional Domains

•Genomics and Bioinformatics: Gene expression studies measure 20,000+ gene expression levels per patient, but studies typically have 50-500 patients. Classic $p >> n$ setting. Challenge: find which genes are associated with disease.
•Medical Imaging: A single MRI scan contains millions of voxels (3D pixels). Training data might include hundreds or thousands of scans. Traditional methods treating each voxel as a feature would catastrophically overfit.
•Natural Language Processing: Bag-of-words representations create features for every word in the vocabulary (100,000+). Even large document collections have fewer documents than unique words.
•Recommendation Systems: Netflix has millions of users and thousands of movies. The user-item matrix has billions of potential features, but each user rates only a tiny fraction of items.
•Financial Modeling: Quantitative finance considers hundreds of potential predictive factors. With only years of daily data, the effective sample size is small relative to model complexity.

The Deep Learning Revolution

Deep neural networks operate in extreme high-dimensional settings (millions of parameters) yet often generalize well. This seems to contradict classical theory. Modern understanding involves implicit regularization from SGD, overparameterized learning, and the geometry of loss landscapes—topics of active research that extend rather than invalidate the insights here.

Case Study: Gene Expression Cancer Classification

The classic Golub et al. (1999) leukemia study had:

n = 72 patients (38 ALL, 34 AML)
p = 7,129 genes
Ratio: $\gamma = p/n \approx 100$

Naive classification would compare each gene's expression between the two cancer types, find 'significant' genes, and use them for prediction. But with 7,129 tests:

Expected false discoveries at $\alpha = 0.05$: 356 genes
Many 'significant' genes would be spurious

The solution combined feature selection, regularization, and careful cross-validation—demonstrating that high-dimensional problems require specialized techniques.

Implications for Model Building

Understanding high-dimensional settings transforms how we approach model building. Classical strategies fail; new principles become essential.

Classical Approaches (Fail)

•Ordinary Least Squares (singular)
•Maximum Likelihood (ill-posed)
•Forward stepwise selection (unstable)
•Univariate feature selection (ignores interactions)
•Traditional p-values (invalid with multiple testing)
•AIC/BIC (assumptions violated)

High-Dimensional Approaches (Work)

•Ridge regression (L2 penalty)
•Lasso (L1 for sparsity)
•Elastic Net (L1 + L2)
•Principal component regression
•Penalized likelihood methods
•Cross-validation for all tuning

Key Principles for High-Dimensional Settings

Regularization is mandatory, not optional: Without constraining the solution space, you're guaranteed to overfit.
Sparsity assumptions are crucial: Believing only a few features matter (even if you don't know which) enables methods like Lasso to work.
Cross-validation is essential: Theoretical model selection criteria fail; empirical evaluation on held-out data is the only reliable guide.
Dimensionality reduction helps: PCA, feature selection, and domain-specific representations can reduce $p$ before fitting.
Understand the signal structure: High-dimensional methods work best when the true signal has exploitable structure (sparsity, low-rank, manifold).

The Bayesian Perspective

Regularization can be understood as introducing prior beliefs about the parameters. Ridge regression corresponds to a Gaussian prior centered at zero; Lasso corresponds to a Laplace prior. These priors express our belief that coefficients should be small or sparse—exactly the inductive bias needed in high dimensions.

Summary: High-Dimensional Settings

High-dimensional settings represent a phase transition in the difficulty of learning. Classical methods designed for $n >> p$ fail completely when $p$ approaches or exceeds $n$.

Key Takeaways

•The curse of dimensionality makes data sparse — Volume concentrates in corners, distances become uniform, and coverage requires exponential samples.
•When p > n, classical regression breaks — Infinitely many perfect solutions exist, with no principled way to choose among them.
•Spurious correlations multiply with dimension — Pure noise features will correlate with the target by chance, misleading the algorithm.
•High-dimensional geometry defies intuition — Most volume is near the surface, random vectors are orthogonal, and interpolation becomes extrapolation.
•The p/n ratio governs behavior — Test error diverges as p/n approaches 1, requiring regularization to remain finite.
•Real-world applications often live here — Genomics, imaging, NLP, and finance all routinely encounter high-dimensional settings.

What's Next?

High dimensionality creates the conditions for overfitting, but the mechanism by which it manifests is through variance explosion—the coefficients of an overfit model become wildly unstable. The next page quantifies this explosion and shows exactly how it destroys generalization.

Page Complete

You now understand why high-dimensional settings are fundamentally different from the classical regime. The curse of dimensionality, rank deficiency, spurious correlations, and the p/n phase transition all conspire to make regularization essential—not merely helpful. Next, we'll examine exactly how variance explodes in these settings.