Machine LearningHyperparameter Optimization

Grid Search

LevelIntermediate

Duration90 mins

TopicHyperparameter Optimization

3 / 5

Curse of Dimensionality

When More Dimensions Mean Less Understanding

The phrase "curse of dimensionality" was coined by mathematician Richard Bellman in 1961 to describe a fundamental challenge in optimization and data analysis: as the number of dimensions increases, the volume of the space increases so rapidly that available data becomes sparse and many intuitions from low-dimensional spaces fail catastrophically.

For grid search, this curse manifests in multiple devastating ways:

Combinatorial Explosion: The number of grid points grows exponentially with dimensions
Distance Concentration: All points become approximately equidistant from each other
Empty Space Dominance: Almost all volume lies far from any grid point
Sample Inefficiency: An exponential number of points is needed to cover the space

Understanding these phenomena is essential for knowing when grid search is appropriate—and when it is fundamentally inadequate regardless of computational resources.

What You Will Learn

By the end of this page, you will understand the mathematical foundations of the curse of dimensionality, develop geometric intuition for high-dimensional spaces, see how this curse specifically impacts grid search, and learn to recognize when dimensionality makes grid search fundamentally inappropriate.

The Mathematics of High Dimensions

Let us develop the mathematical foundations that explain why high-dimensional spaces behave so counterintuitively.

Exponential Volume Growth:

Consider a $d$-dimensional unit hypercube $[0,1]^d$. To place a uniform grid with $n$ points per dimension, we need $n^d$ total points.

Dimensions	Points per dim = 3	Points per dim = 5	Points per dim = 10
2	9	25	100
5	243	3,125	100,000
10	59,049	9,765,625	10 billion
20	3.5 billion	95 trillion	10^20

The numbers quickly exceed not just computational capacity but the number of atoms in the observable universe (~$10^{80}$).

Volume of a Hypersphere:

The volume of a $d$-dimensional ball of radius $r$ is:

$$V_d(r) = \frac{\pi^{d/2}}{\Gamma(d/2 + 1)} r^d$$

where $\Gamma$ is the gamma function.

As $d \to \infty$, $V_d(1) \to 0$. The unit ball's volume goes to zero! Meanwhile, the unit hypercube maintains volume 1. This means almost all the volume is in the corners, far from the center.

dimensional_geometry.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
import numpy as np
from scipy.special import gamma
import matplotlib.pyplot as plt
 
def hypersphere_volume(d: int, r: float = 1.0) -> float:
    """
    Compute the volume of a d-dimensional ball of radius r.
    
    V_d(r) = π^(d/2) / Γ(d/2 + 1) * r^d
    
    Key insight: For d >> 1, this volume approaches 0,
    even though the bounding hypercube has volume (2r)^d.
    """
    return (np.pi ** (d/2)) / gamma(d/2 + 1) * (r ** d)
 
 
def hypersphere_to_hypercube_ratio(d: int) -> float:
    """
    Ratio of inscribed sphere volume to bounding cube volume.
    
    This ratio tells us what fraction of the hypercube's volume
    is "near the center" (within the inscribed sphere).
    
    For d=2: π/4 ≈ 0.785 (circle covers ~78% of square)
    For d=10: ≈ 0.0025 (hypersphere covers ~0.25% of hypercube)
    For d=100: ≈ 10^-70 (essentially zero)
    """
    sphere_vol = hypersphere_volume(d, r=1.0)
    cube_vol = 2 ** d  # Cube from -1 to 1 in each dimension
    return sphere_vol / cube_vol
 
 
def fraction_in_shell(d: int, shell_thickness: float = 0.01) -> float:
    """
    Compute the fraction of a hypersphere's volume in its outer shell.
    
    For a ball of radius r=1, the shell is [1-ε, 1].
    
    The fraction is: 1 - (1-ε)^d
    
    For small ε and large d, almost ALL volume is in the shell.
    This is why "the center is empty" in high dimensions.
    """
    inner_radius = 1 - shell_thickness
    return 1 - (inner_radius ** d)
 
 
# Demonstrate the curse
print("=== Volume Ratios by Dimension ===")
print("Dimension | Sphere/Cube Ratio | % in Outer 1% Shell")
print("-" * 55)
 
for d in [2, 3, 5, 10, 20, 50, 100]:
    ratio = hypersphere_to_hypercube_ratio(d)
    shell_frac = fraction_in_shell(d, 0.01)
    print(f"{d:^9} | {ratio:>17.2e} | {shell_frac*100:>17.2f}%")
 
 
def demonstrate_distance_concentration(n_samples: int = 10000):
    """
    Show that in high dimensions, all pairwise distances converge to the same value.
    
    For uniformly distributed points in [0,1]^d:
    - Expected distance: √(d/6) (for large d)
    - Variance of distances: approaches 0 as d increases
    
    This means "near" and "far" become indistinguishable—
    nearest neighbor search loses meaning.
    """
    print("
=== Distance Concentration ===")
    print("Dimension | Mean Distance | Std Dev | Coefficient of Variation")
    print("-" * 65)
    
    for d in [2, 5, 10, 20, 50, 100, 200]:
        # Generate random points in [0,1]^d
        points = np.random.uniform(0, 1, size=(n_samples, d))
        
        # Compute pairwise distances (sample for efficiency)
        n_pairs = min(10000, n_samples * (n_samples - 1) // 2)
        distances = []
        for _ in range(n_pairs):
            i, j = np.random.choice(n_samples, 2, replace=False)
            dist = np.linalg.norm(points[i] - points[j])
            distances.append(dist)
        
        mean_dist = np.mean(distances)
        std_dist = np.std(distances)
        cv = std_dist / mean_dist  # Coefficient of variation
        
        print(f"{d:^9} | {mean_dist:>13.4f} | {std_dist:>7.4f} | {cv:>24.4f}")
    
    print("
Note: Coefficient of variation → 0 means distances concentrate")
    print("      around the mean, making nearest/farthest indistinguishable.")
 
 
demonstrate_distance_concentration()
 
 
def corner_concentration():
    """
    Demonstrate that in high dimensions, random points concentrate near corners.
    
    For a uniform point in [0,1]^d, the probability that it lies
    within distance ε from the boundary is:
    
    P(near boundary) = 1 - (1-2ε)^d
    
    For ε = 0.1 and d = 10: P ≈ 0.89 (89% near boundary)
    For ε = 0.1 and d = 100: P ≈ 1.0 (essentially all points)
    """
    print("
=== Corner Concentration ===")
    print("Dimension | P(within 10% of boundary)")
    print("-" * 40)
    
    epsilon = 0.1
    for d in [2, 5, 10, 20, 50, 100]:
        p_near_boundary = 1 - (1 - 2*epsilon) ** d
        print(f"{d:^9} | {p_near_boundary*100:>23.1f}%")
    
    print("
This means grid points near the center are 'wasted'—")
    print("the actual optimal hyperparameters are likely near the edges.")
 
 
corner_concentration()

Intuition Fails in High Dimensions

Our geometric intuition is built from 2D and 3D experience. In high dimensions: the center is empty, most volume is in corners, all points are equidistant, and the hypersphere vanishes inside its bounding hypercube. These counterintuitive facts fundamentally change the search problem.

The Distance Concentration Phenomenon

One of the most striking effects of high dimensionality is distance concentration: as dimensions increase, the ratio between the nearest and farthest points approaches 1. All distances become approximately equal.

Formal Statement:

For uniformly distributed random points in $[0,1]^d$, let $D_{\min}$ and $D_{\max}$ be the minimum and maximum distances to a fixed query point. Then:

$$\lim_{d \to \infty} \frac{D_{\max} - D_{\min}}{D_{\min}} = 0$$

In words: the relative contrast between near and far vanishes.

Implications for Grid Search:

If all grid points are approximately equidistant from any target point:

No point is meaningfully closer to the true optimum than any other
Interpolation becomes unreliable — local patterns don't predict nearby values
The grid provides no structural advantage over random sampling
Fine grids don't help — more points still have similar distances

distance_concentration.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
import numpy as np
from typing import Tuple
 
def analyze_distance_contrast(d: int, n_grid: int, n_trials: int = 100) -> dict:
    """
    Measure the contrast between nearest and farthest grid points
    from random target points.
    
    The contrast ratio (D_max - D_min) / D_min tells us how distinguishable
    near and far points are. In high dimensions, this ratio → 0.
    
    Parameters:
    -----------
    d : int
        Number of dimensions
    n_grid : int
        Number of grid points per dimension
    n_trials : int
        Number of random target points to test
    """
    # Create uniform grid
    grid_1d = np.linspace(0, 1, n_grid)
    grid_coords = np.array(np.meshgrid(*[grid_1d]*d)).T.reshape(-1, d)
    
    contrasts = []
    
    for _ in range(n_trials):
        # Random target point
        target = np.random.uniform(0, 1, d)
        
        # Distances to all grid points
        distances = np.linalg.norm(grid_coords - target, axis=1)
        
        d_min = distances.min()
        d_max = distances.max()
        
        # Contrast ratio
        if d_min > 0:
            contrast = (d_max - d_min) / d_min
            contrasts.append(contrast)
    
    return {
        'dimension': d,
        'n_grid_points': len(grid_coords),
        'mean_contrast': np.mean(contrasts),
        'std_contrast': np.std(contrasts),
        'expected_d_min': np.mean([c['d_min'] for c in [{'d_min': np.min(np.linalg.norm(grid_coords - np.random.uniform(0,1,d), axis=1))} for _ in range(100)]]),
    }
 
 
def demonstrate_contrast_collapse():
    """
    Show how distance contrast collapses with dimension.
    
    For grid search, this means:
    - In 2D, the nearest grid point is meaningfully closer than the farthest
    - In 20D, all grid points are approximately the same distance
    - The grid structure provides no advantage
    """
    print("=== Distance Contrast by Dimension ===")
    print("Dimension | Grid Points | Mean Contrast | Interpretation")
    print("-" * 70)
    
    for d in [2, 3, 5, 10, 15, 20]:
        # Use 3 points per dimension (total 3^d points)
        if 3**d <= 1000000:  # Keep computation reasonable
            result = analyze_distance_contrast(d, n_grid=3, n_trials=50)
            
            if result['mean_contrast'] > 1.0:
                interpretation = "Clear near/far distinction"
            elif result['mean_contrast'] > 0.5:
                interpretation = "Moderate distinction"
            elif result['mean_contrast'] > 0.2:
                interpretation = "Weak distinction"
            else:
                interpretation = "No meaningful distinction"
            
            print(f"{d:^9} | {result['n_grid_points']:>11} | "
                  f"{result['mean_contrast']:>13.3f} | {interpretation}")
 
 
def grid_coverage_analysis(d: int, n_grid: int, epsilon: float = 0.1) -> float:
    """
    Compute what fraction of the search space is within ε of a grid point.
    
    For a grid with n points per dimension in [0,1]^d:
    - Grid spacing is 1/(n-1)
    - An ε-ball around each point covers ε^d volume (approximately)
    - Total coverage ≈ n^d × ε^d volume... but balls overlap and boundaries truncate
    
    In high dimensions, even dense grids cover negligible volume.
    """
    # Compute via Monte Carlo sampling
    grid_1d = np.linspace(0, 1, n_grid)
    grid_coords = np.array(np.meshgrid(*[grid_1d]*min(d, 8))).T.reshape(-1, min(d, 8))
    
    if d > 8:
        # For very high dimensions, use theoretical approximation
        # Coverage ≈ n^d × V_d(ε) where V_d is d-ball volume
        from scipy.special import gamma
        ball_volume = (np.pi ** (d/2)) / gamma(d/2 + 1) * (epsilon ** d)
        n_balls = n_grid ** d
        raw_coverage = n_balls * ball_volume
        # Account for overlap (rough approximation)
        coverage = min(1.0, raw_coverage * 0.5)  # Very rough
        return coverage
    
    # Monte Carlo for lower dimensions
    n_samples = 10000
    random_points = np.random.uniform(0, 1, (n_samples, len(grid_coords[0])))
    
    covered = 0
    for point in random_points:
        min_dist = np.min(np.linalg.norm(grid_coords - point, axis=1))
        if min_dist <= epsilon:
            covered += 1
    
    return covered / n_samples
 
 
def demonstrate_coverage_collapse():
    """
    Show how grid coverage of the space collapses with dimension.
    
    Even with 5 points per dimension, coverage → 0 rapidly.
    """
    print("
=== Grid Coverage Analysis ===")
    print("Dimension | Grid Points | Coverage within ε=0.1")
    print("-" * 50)
    
    for d in [2, 3, 5, 7, 10, 15, 20]:
        n_grid = 5  # 5 points per dimension
        if 5**d <= 1e7:
            coverage = grid_coverage_analysis(d, n_grid, epsilon=0.1)
            grid_size = 5**d
            print(f"{d:^9} | {grid_size:>11,} | {coverage*100:>19.4f}%")
        else:
            print(f"{d:^9} | {5**d:>11.2e} | Negligible")
 
 
demonstrate_contrast_collapse()
demonstrate_coverage_collapse()

The Lost Needle

Imagine a needle (optimal hyperparameters) hidden in a haystack (hyperparameter space). In 2D, a fine grid will pass close to the needle. In 20D, even trillions of grid points may all be equally far from the needle—the grid structure provides no guidance.

Grid Search Failure Modes in High Dimensions

The curse of dimensionality causes specific failure modes for grid search that practitioners must understand.

Failure Mode 1: Missing the Optimal Region Entirely

With $n$ points per dimension, the grid spacing is $\Delta = 1/(n-1)$ in normalized coordinates. The probability that the true optimum falls within $\epsilon$ of any grid point depends on coverage, which goes to zero exponentially.

If the objective function is smooth, the best grid point will be within $O(\Delta)$ of the optimum. But in high dimensions, even $\Delta = 0.1$ leaves vast unexplored regions.

Failure Mode 2: Axis-Aligned Blindness

Grid search places points on an axis-aligned lattice. If the optimal region is oriented diagonally (hyperparameters interact), the grid may completely miss it.

Example: If learning_rate and regularization have an optimal ratio (not absolute values), a grid that samples each independently will only accidentally hit configurations with the right ratio.

failure_modes.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
import numpy as np
import matplotlib.pyplot as plt
from itertools import product
 
def demonstrate_axis_aligned_failure():
    """
    Show how grid search fails when the optimal region is diagonal.
    
    Consider an objective where the optimum lies along y = x,
    not at any grid intersection. The grid misses it entirely.
    """
    # Define objective: minimum along the diagonal y = x
    def diagonal_objective(x, y):
        # Distance from diagonal y = x
        diagonal_dist = np.abs(y - x) / np.sqrt(2)
        # With minimum at (0.37, 0.37)
        center_dist = np.sqrt((x - 0.37)**2 + (y - 0.37)**2)
        return diagonal_dist + 0.1 * center_dist
    
    # Create grid
    grid_points = np.linspace(0, 1, 5)  # 5x5 = 25 points
    grid_coords = np.array([(x, y) for x in grid_points for y in grid_points])
    
    # Evaluate grid
    grid_scores = [diagonal_objective(x, y) for x, y in grid_coords]
    best_grid_idx = np.argmin(grid_scores)
    best_grid_point = grid_coords[best_grid_idx]
    best_grid_score = grid_scores[best_grid_idx]
    
    # True optimum
    true_optimum = (0.37, 0.37)
    true_score = diagonal_objective(*true_optimum)
    
    print("=== Axis-Aligned Failure Demonstration ===")
    print(f"Objective: minimum along diagonal y = x, centered at (0.37, 0.37)")
    print(f"Grid size: 5x5 = 25 points")
    print(f"Best grid point: ({best_grid_point[0]:.2f}, {best_grid_point[1]:.2f})")
    print(f"Best grid score: {best_grid_score:.4f}")
    print(f"True optimum: {true_optimum}")
    print(f"True optimal score: {true_score:.4f}")
    print(f"")
    print(f"Grid suboptimality: {((best_grid_score - true_score) / true_score) * 100:.1f}%")
    print(f"The grid misses the diagonal optimum because it samples axis-aligned!")
 
 
def demonstrate_sparse_coverage(d: int = 10, n_grid: int = 3):
    """
    Show how sparse grid coverage becomes in high dimensions.
    
    Even with a reasonable number of points, the space is mostly unexplored.
    """
    # Total grid points
    n_total = n_grid ** d
    
    # Volume of space
    space_volume = 1.0  # Unit hypercube
    
    # Average grid spacing
    spacing = 1 / (n_grid - 1)
    
    # Volume "near" a grid point (within spacing/2)
    # Approximate as hypercube around each point
    near_volume_per_point = (spacing / 2) ** d
    total_near_volume = n_total * near_volume_per_point
    
    # But with overlap correction and boundary effects
    coverage_fraction = min(1.0, total_near_volume)
    
    print(f"
=== Sparse Coverage in {d}D ===")
    print(f"Grid: {n_grid} points per dimension = {n_total:,} total points")
    print(f"Grid spacing: {spacing:.3f}")
    print(f"Volume 'near' each point (spacing/2 hypercube): {near_volume_per_point:.2e}")
    print(f"Total covered volume (upper bound): {total_near_volume:.2e}")
    print(f"Fraction of space explored: {min(100, coverage_fraction*100):.2e}%")
    
    if coverage_fraction < 0.01:
        print("
⚠ Less than 1% of the space is within half a grid spacing of any point!")
        print("  The grid is effectively useless for locating narrow optima.")
 
 
def marginal_value_analysis(d: int = 10):
    """
    Analyze the marginal value of adding grid points in high dimensions.
    
    Key insight: In high dimensions, adding more points provides
    diminishing returns much faster than in low dimensions.
    """
    print(f"
=== Marginal Value of Grid Points in {d}D ===")
    print("Points/Dim | Total Points | Avg Distance to Random | Marginal Speedup")
    print("-" * 75)
    
    for n_grid in [2, 3, 4, 5, 6, 7, 8]:
        n_total = n_grid ** d
        
        if n_total > 1e9:
            print(f"{n_grid:^10} | {n_total:>12.2e} | Intractable")
            continue
        
        # Create grid (or estimate for high dimensions)
        if d <= 5:
            grid_1d = np.linspace(0, 1, n_grid)
            grid_coords = np.array(list(product(*[grid_1d]*d)))
        else:
            # Estimate avg distance theoretically
            # For uniform grid in [0,1]^d with spacing 1/(n-1),
            # expected nearest grid point distance ≈ sqrt(d) * spacing / 2
            spacing = 1 / (n_grid - 1)
            avg_distance = np.sqrt(d) * spacing / 2
            
        # Compute average distance from random points to nearest grid point
        n_samples = 1000
        if d <= 5:
            distances = []
            for _ in range(n_samples):
                point = np.random.uniform(0, 1, d)
                min_dist = np.min(np.linalg.norm(grid_coords - point, axis=1))
                distances.append(min_dist)
            avg_distance = np.mean(distances)
        
        # Marginal speedup: improvement from n-1 to n grid points
        if n_grid > 2:
            prev_total = (n_grid - 1) ** d
            speedup = prev_total / n_total  # More points = closer but more expensive
            marginal = f"{1/speedup:.1f}x cost for modest improvement"
        else:
            marginal = "Baseline"
        
        print(f"{n_grid:^10} | {n_total:>12,} | {avg_distance:>22.4f} | {marginal}")
 
 
demonstrate_axis_aligned_failure()
demonstrate_sparse_coverage(d=10, n_grid=3)
marginal_value_analysis(d=8)

Grid Search Failure Modes
Failure Mode	Cause	Symptom	Mitigation
Missing optima	Sparse coverage in high-d	Best grid point far from optimum	Use random search or Bayesian optimization
Axis-aligned blindness	Grid samples independently	Misses diagonal correlations	Use non-axis-aligned sampling (Latin hypercube)
Wasted computation	Uniform coverage	Many evaluations in poor regions	Use adaptive methods
False confidence	Best grid point reported	Appears successful despite suboptimality	Compare with random baseline

The Bergstra-Bengio Result: Random Beats Grid

A landmark 2012 paper by James Bergstra and Yoshua Bengio provided both theoretical and empirical evidence that random search outperforms grid search for hyperparameter optimization, particularly as dimensionality increases.

Key Insight: Low Effective Dimensionality

In practice, not all hyperparameters matter equally. A few hyperparameters often dominate performance (e.g., learning rate), while others have minimal effect (e.g., momentum coefficient). This is the concept of low effective dimensionality.

Grid Search's Fatal Flaw:

Suppose only 2 of 10 hyperparameters actually matter. With a 5-point grid:

Grid search: $5^{10} = 9,765,625$ configurations, but only $5^2 = 25$ unique combinations of the important hyperparameters
Random search: $N$ random samples, with $N$ unique combinations of the important hyperparameters

With 25 random samples, we match grid search's exploration of the important dimensions while spending 400,000× less compute!

bergstra_bengio.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
import numpy as np
from typing import Dict, List, Tuple
from itertools import product
 
def simulate_low_effective_dimensionality(
    n_total_dims: int = 10,
    n_important_dims: int = 2,
    n_grid_points: int = 5,
    n_random_samples: int = 25
) -> Dict:
    """
    Demonstrate the Bergstra-Bengio phenomenon.
    
    Scenario: 
    - n_total_dims hyperparameters in the search space
    - Only n_important_dims actually affect the objective
    - The rest are noise dimensions
    
    We compare:
    1. Full grid search (5^10 = 9.7M points)
    2. Random search with just 25 samples
    
    Shows how random search achieves equivalent coverage of important
    dimensions with vastly less computation.
    """
    np.random.seed(42)
    
    # Define objective: only first n_important_dims matter
    def objective(params: np.ndarray) -> float:
        # Only first n_important_dims affect the objective
        important = params[:n_important_dims]
        # Objective is sum of squared distances from 0.3 (arbitrary optimum)
        return np.sum((important - 0.3) ** 2)
    
    # === Grid Search ===
    # Create grid (but limit evaluation for tractability)
    grid_1d = np.linspace(0, 1, n_grid_points)
    
    # Full grid would be n_grid_points^n_total_dims
    full_grid_size = n_grid_points ** n_total_dims
    
    # Unique combinations in important dimensions
    # This is what actually matters for finding the optimum
    unique_important_grid = n_grid_points ** n_important_dims
    
    # Simulate by evaluating only the important-dimension grid
    important_grid = np.array(list(product(*[grid_1d]*n_important_dims)))
    grid_scores = [objective(np.concatenate([p, np.zeros(n_total_dims - n_important_dims)])) 
                   for p in important_grid]
    best_grid_score = min(grid_scores)
    
    # === Random Search ===
    random_points = np.random.uniform(0, 1, (n_random_samples, n_total_dims))
    random_scores = [objective(p) for p in random_points]
    best_random_score = min(random_scores)
    
    # Coverage comparison
    # For important dimensions: how many unique "regions" are covered?
    grid_1d_spacing = 1 / (n_grid_points - 1)
    
    # Quantize random points to grid resolution
    random_important = random_points[:, :n_important_dims]
    random_quantized = np.round(random_important / grid_1d_spacing) * grid_1d_spacing
    n_unique_random_regions = len(set(map(tuple, random_quantized)))
    
    print("=== Bergstra-Bengio Simulation ===")
    print(f"Total dimensions: {n_total_dims}")
    print(f"Important dimensions: {n_important_dims}")
    print(f"")
    print("Grid Search:")
    print(f"  Total configurations: {full_grid_size:,}")
    print(f"  Unique important-dim combinations: {unique_important_grid}")
    print(f"  Best score found: {best_grid_score:.6f}")
    print(f"")
    print("Random Search:")
    print(f"  Total samples: {n_random_samples}")
    print(f"  Unique important-dim regions covered: {n_unique_random_regions}")
    print(f"  Best score found: {best_random_score:.6f}")
    print(f"")
    print(f"Efficiency ratio: {full_grid_size / n_random_samples:,.0f}x")
    print(f"Grid uses {full_grid_size / n_random_samples:,.0f}x more compute for similar coverage")
    print(f"of the dimensions that actually matter!")
    
    return {
        'grid_size': full_grid_size,
        'random_size': n_random_samples,
        'grid_important_unique': unique_important_grid,
        'random_important_unique': n_unique_random_regions,
        'best_grid': best_grid_score,
        'best_random': best_random_score,
    }
 
 
def visualize_projection_advantage():
    """
    Visualize why random search covers important dimensions better.
    
    When projected onto a low-dimensional subspace (the important dimensions),
    random points spread out while grid points cluster on a sparse lattice.
    """
    n_dims = 6
    n_important = 2
    n_grid = 3  # 3^6 = 729 grid points
    n_random = 30
    
    # Grid points
    grid_1d = np.linspace(0, 1, n_grid)
    grid_full = np.array(list(product(*[grid_1d]*n_dims)))
    grid_projected = grid_full[:, :n_important]  # Project to important dims
    
    # Random points
    np.random.seed(42)
    random_full = np.random.uniform(0, 1, (n_random, n_dims))
    random_projected = random_full[:, :n_important]
    
    print("
=== Projection Advantage ===")
    print(f"Grid: {len(grid_full)} points in {n_dims}D, projected to {n_important}D")
    print(f"Random: {len(random_full)} points in {n_dims}D, projected to {n_important}D")
    print(f"")
    
    # Count unique projected positions (at grid resolution)
    grid_unique_projected = len(set(map(tuple, grid_projected)))
    random_quantized = np.round(random_projected * (n_grid-1)) / (n_grid-1)
    random_unique_projected = len(set(map(tuple, random_quantized)))
    
    print(f"Grid: {grid_unique_projected} unique positions when projected")
    print(f"      (only {n_grid}^{n_important} = {n_grid**n_important} distinct values!)")
    print(f"Random: {random_unique_projected} unique positions (at grid resolution)")
    print(f"")
    print("Key insight: Grid samples collapse to few unique values in the important")
    print("dimensions, while random samples spread out, exploring more possibilities.")
 
 
simulate_low_effective_dimensionality()
visualize_projection_advantage()

The Bergstra-Bengio Recommendation

For hyperparameter optimization with more than 3-4 dimensions, random search typically finds better configurations than grid search with the same computational budget. This is especially true when some hyperparameters are more important than others—which is almost always the case in practice.

When Grid Search Still Works

Despite the curse of dimensionality, grid search remains appropriate in specific scenarios. Understanding these cases prevents both overuse and underuse of the technique.

Scenario 1: Truly Low-Dimensional Search Spaces (d ≤ 3)

With 2-3 hyperparameters, the curse hasn't fully manifested:

$5^3 = 125$ configurations is tractable
Grid provides complete coverage at chosen resolution
Visualization is possible for analysis

Scenario 2: Discrete Hyperparameters with Few Options

When hyperparameters have limited discrete options:

kernel ∈ {'linear', 'rbf', 'poly'}: 3 options
criterion ∈ {'gini', 'entropy'}: 2 options

Even with 10 such hyperparameters, the total might be $3 \times 2 \times 4 \times \ldots = 1000$ — manageable.

Scenario 3: Known Narrow Effective Range

When domain knowledge provides tight bounds:

Learning rate must be between 0.01 and 0.1
Only 3-4 values in that range matter

The effective dimensionality is reduced by prior knowledge.

Grid Search is Appropriate

•≤ 3 continuous hyperparameters with 5 values each (~125 configs)
•Discrete parameters with few options regardless of count
•Well-established ranges from prior work or domain knowledge
•Need for complete coverage (regulatory, scientific)
•Visualization required for understanding sensitivity
•Simple baseline before trying advanced methods

Grid Search is Inappropriate

•> 5 continuous hyperparameters (exponential explosion)
•Wide ranges spanning orders of magnitude per parameter
•Unknown effective range requiring broad exploration
•Limited computational budget relative to grid size
•Correlated hyperparameters (axis-aligned blindness)
•Deep learning with many architectural choices

The Practical Threshold

A practical rule of thumb: if the total grid size (including CV folds) exceeds 10,000 model fits, seriously consider random search or Bayesian optimization instead. Beyond 100,000 fits, grid search is almost certainly the wrong choice.

Dimensionality Reduction Strategies

When faced with high-dimensional hyperparameter spaces, strategies exist to reduce effective dimensionality and make grid search more viable.

Strategy 1: Hyperparameter Sensitivity Analysis

Before full optimization, identify which hyperparameters matter most:

Vary one hyperparameter at a time while holding others at defaults
Measure sensitivity of objective to each parameter
Focus grid search on the top 2-3 most sensitive parameters

Strategy 2: Hierarchical/Sequential Tuning

Tune hyperparameters in stages:

Stage 1: Tune most important hyperparameters (e.g., learning rate, regularization)
Stage 2: With Stage 1 hyperparameters fixed, tune secondary ones
Stage 3: Fine-tune all together with narrow ranges

This reduces a 10-dimensional grid to three 3-4 dimensional grids.

Strategy 3: Coupling Hyperparameters

Some hyperparameters should vary together:

Learning rate and batch size (linear scaling rule)
Number of trees and max depth (trade-off)
Regularization strength and model complexity

Create composite "meta-hyperparameters" that control coupled parameters.

dimensionality_reduction.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
import numpy as np
from typing import Dict, List, Any, Callable
from sklearn.model_selection import cross_val_score
 
def hyperparameter_sensitivity_analysis(
    model_class,
    param_ranges: Dict[str, List[Any]],
    X, y,
    default_params: Dict[str, Any],
    cv: int = 3
) -> Dict[str, float]:
    """
    Analyze sensitivity of model performance to each hyperparameter.
    
    For each hyperparameter:
    1. Hold all others at default values
    2. Vary this hyperparameter across its range
    3. Measure variance of cross-validation scores
    
    Returns sensitivity scores (higher = more sensitive = more important).
    """
    sensitivities = {}
    
    for param_name, param_values in param_ranges.items():
        scores_for_param = []
        
        for value in param_values:
            # Create config with this parameter varied
            config = default_params.copy()
            config[param_name] = value
            
            # Evaluate
            model = model_class(**config)
            cv_scores = cross_val_score(model, X, y, cv=cv)
            scores_for_param.append(cv_scores.mean())
        
        # Sensitivity = range of scores when varying this parameter
        sensitivity = max(scores_for_param) - min(scores_for_param)
        sensitivities[param_name] = sensitivity
        
        print(f"{param_name}: sensitivity = {sensitivity:.4f}")
        print(f"  Values: {param_values}")
        print(f"  Scores: {[f'{s:.4f}' for s in scores_for_param]}")
    
    # Rank by sensitivity
    ranked = sorted(sensitivities.items(), key=lambda x: x[1], reverse=True)
    print("
Ranked by sensitivity:")
    for param, sens in ranked:
        print(f"  {param}: {sens:.4f}")
    
    return sensitivities
 
 
def hierarchical_tuning(
    model_class,
    stage1_grid: Dict[str, List[Any]],
    stage2_grid: Dict[str, List[Any]],
    X, y,
    cv: int = 5
) -> Dict[str, Any]:
    """
    Two-stage hierarchical hyperparameter tuning.
    
    Stage 1: Tune primary hyperparameters (coarse grid)
    Stage 2: Tune secondary hyperparameters (using Stage 1 values)
    
    This reduces exponential complexity:
    - Instead of (n1 × n2 × ... × nk) configurations
    - We have (n1 × n2) + (n3 × n4 × ...) configurations
    """
    from sklearn.model_selection import GridSearchCV
    
    print("=== Hierarchical Tuning ===")
    
    # Stage 1: Primary hyperparameters
    print(f"
Stage 1: Primary Hyperparameters")
    stage1_size = np.prod([len(v) for v in stage1_grid.values()])
    print(f"  Grid size: {stage1_size}")
    
    grid1 = GridSearchCV(model_class(), stage1_grid, cv=cv, n_jobs=-1)
    grid1.fit(X, y)
    stage1_best = grid1.best_params_
    print(f"  Best params: {stage1_best}")
    print(f"  Best score: {grid1.best_score_:.4f}")
    
    # Stage 2: Secondary hyperparameters (with Stage 1 fixed)
    print(f"
Stage 2: Secondary Hyperparameters")
    stage2_size = np.prod([len(v) for v in stage2_grid.values()])
    print(f"  Grid size: {stage2_size}")
    
    # Merge Stage 1 best params with Stage 2 grid
    class FixedParamModel:
        def __init__(self, **kwargs):
            full_params = {**stage1_best, **kwargs}
            self.model = model_class(**full_params)
        def fit(self, X, y): return self.model.fit(X, y)
        def predict(self, X): return self.model.predict(X)
        def score(self, X, y): return self.model.score(X, y)
    
    grid2 = GridSearchCV(FixedParamModel(), stage2_grid, cv=cv, n_jobs=-1)
    grid2.fit(X, y)
    stage2_best = grid2.best_params_
    print(f"  Best params: {stage2_best}")
    print(f"  Best score: {grid2.best_score_:.4f}")
    
    # Combine
    final_params = {**stage1_best, **stage2_best}
    
    # Compare to full grid search
    full_grid = {**stage1_grid, **stage2_grid}
    full_size = np.prod([len(v) for v in full_grid.values()])
    hierarchical_size = stage1_size + stage2_size
    
    print(f"
=== Comparison ===")
    print(f"Full grid size: {full_size}")
    print(f"Hierarchical size: {hierarchical_size}")
    print(f"Reduction: {full_size / hierarchical_size:.1f}x")
    print(f"Final params: {final_params}")
    
    return final_params
 
 
def coupled_hyperparameter_grid(
    coupling_rules: Dict[str, Callable[[float], Dict[str, Any]]],
    meta_values: Dict[str, List[float]],
) -> List[Dict[str, Any]]:
    """
    Generate configurations with coupled hyperparameters.
    
    Instead of independent grids for learning_rate and batch_size,
    define a single "scale" meta-parameter that sets both according
    to a known relationship (e.g., linear scaling rule).
    
    Parameters:
    -----------
    coupling_rules: Dict mapping meta-param names to functions
        Each function takes meta-param value and returns dict of actual params
    
    meta_values: Dict mapping meta-param names to values to try
    
    Example:
    coupling_rules = {
        'scale': lambda s: {'learning_rate': 0.01 * s, 'batch_size': int(32 * s)}
    }
    meta_values = {'scale': [0.5, 1.0, 2.0, 4.0]}
    """
    from itertools import product
    
    meta_names = list(meta_values.keys())
    all_meta_combos = list(product(*meta_values.values()))
    
    configs = []
    for combo in all_meta_combos:
        config = {}
        for name, value in zip(meta_names, combo):
            params = coupling_rules[name](value)
            config.update(params)
        configs.append(config)
    
    print(f"Generated {len(configs)} coupled configurations:")
    for config in configs:
        print(f"  {config}")
    
    return configs
 
 
# Example: Coupled hyperparameters for neural network
coupling_rules = {
    'scale': lambda s: {
        'learning_rate': 0.001 * s,
        'batch_size': int(32 * s),
    },
    'capacity': lambda c: {
        'hidden_layer_sizes': tuple([int(100 * c)] * 2),
        'alpha': 0.01 / c,  # Stronger regularization for larger models
    }
}
 
meta_values = {
    'scale': [0.5, 1.0, 2.0, 4.0],
    'capacity': [0.5, 1.0, 2.0],
}
 
print("=== Coupled Hyperparameter Grid ===")
configs = coupled_hyperparameter_grid(coupling_rules, meta_values)
print(f"
Total configurations: {len(configs)}")
print(f"vs independent grid: 4 × 4 × 4 × 4 = 256 (if each param independent)")

Focus on What Matters

Preliminary sensitivity analysis often reveals that 2-3 hyperparameters account for 90% of performance variance. Concentrating grid search on these while fixing others at defaults can make even complex models tractable while losing little optimization quality.

Summary: Navigating the Curse

We have explored how high-dimensional hyperparameter spaces fundamentally challenge grid search and developed strategies for mitigation. Let's consolidate the key concepts:

Key Takeaways

•Exponential volume growth — The number of grid points required for uniform coverage grows as $n^d$, quickly exceeding any computational budget.
•Distance concentration destroys locality — In high dimensions, all points become equidistant, eliminating the structural advantage of grids.
•Most volume is in the corners — Random points concentrate near the boundary, far from grid interiors, making center-focused grids wasteful.
•Axis-aligned blindness misses correlations — Grid search samples hyperparameters independently, missing optimal regions defined by parameter interactions.
•Low effective dimensionality favors random search — When few hyperparameters dominate, random search explores the important dimensions more efficiently.
•Dimensionality reduction strategies exist — Sensitivity analysis, hierarchical tuning, and parameter coupling can reduce effective dimensionality.

The Critical Threshold:

As a practical guideline:

d ≤ 3: Grid search is often optimal
d = 4-5: Grid search is feasible but consider random search
d > 5: Grid search is impractical; use random search or Bayesian optimization

What's Next:

Despite its limitations, grid search remains a valuable tool when applied appropriately. The next page provides practical guidelines for designing effective grid searches, including heuristics for setting ranges, choosing resolution, and combining grid search with other methods.

Page Complete

You now understand why the curse of dimensionality fundamentally limits grid search to low-dimensional hyperparameter spaces. You can recognize when grid search will fail, apply dimensionality reduction strategies, and make informed decisions about when to use alternatives.

3 / 5

Loading learning content...

Machine LearningHyperparameter Optimization

Grid Search

LevelIntermediate

Duration90 mins

TopicHyperparameter Optimization

3 / 5

Curse of Dimensionality

When More Dimensions Mean Less Understanding

For grid search, this curse manifests in multiple devastating ways:

Combinatorial Explosion: The number of grid points grows exponentially with dimensions
Distance Concentration: All points become approximately equidistant from each other
Empty Space Dominance: Almost all volume lies far from any grid point
Sample Inefficiency: An exponential number of points is needed to cover the space

Understanding these phenomena is essential for knowing when grid search is appropriate—and when it is fundamentally inadequate regardless of computational resources.

What You Will Learn

The Mathematics of High Dimensions

Let us develop the mathematical foundations that explain why high-dimensional spaces behave so counterintuitively.

Exponential Volume Growth:

Consider a $d$-dimensional unit hypercube $[0,1]^d$. To place a uniform grid with $n$ points per dimension, we need $n^d$ total points.

Dimensions	Points per dim = 3	Points per dim = 5	Points per dim = 10
2	9	25	100
5	243	3,125	100,000
10	59,049	9,765,625	10 billion
20	3.5 billion	95 trillion	10^20

The numbers quickly exceed not just computational capacity but the number of atoms in the observable universe (~$10^{80}$).

Volume of a Hypersphere:

The volume of a $d$-dimensional ball of radius $r$ is:

$$V_d(r) = \frac{\pi^{d/2}}{\Gamma(d/2 + 1)} r^d$$

where $\Gamma$ is the gamma function.

As $d \to \infty$, $V_d(1) \to 0$. The unit ball's volume goes to zero! Meanwhile, the unit hypercube maintains volume 1. This means almost all the volume is in the corners, far from the center.

dimensional_geometry.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
import numpy as np
from scipy.special import gamma
import matplotlib.pyplot as plt
 
def hypersphere_volume(d: int, r: float = 1.0) -> float:
    """
    Compute the volume of a d-dimensional ball of radius r.
    
    V_d(r) = π^(d/2) / Γ(d/2 + 1) * r^d
    
    Key insight: For d >> 1, this volume approaches 0,
    even though the bounding hypercube has volume (2r)^d.
    """
    return (np.pi ** (d/2)) / gamma(d/2 + 1) * (r ** d)
 
 
def hypersphere_to_hypercube_ratio(d: int) -> float:
    """
    Ratio of inscribed sphere volume to bounding cube volume.
    
    This ratio tells us what fraction of the hypercube's volume
    is "near the center" (within the inscribed sphere).
    
    For d=2: π/4 ≈ 0.785 (circle covers ~78% of square)
    For d=10: ≈ 0.0025 (hypersphere covers ~0.25% of hypercube)
    For d=100: ≈ 10^-70 (essentially zero)
    """
    sphere_vol = hypersphere_volume(d, r=1.0)
    cube_vol = 2 ** d  # Cube from -1 to 1 in each dimension
    return sphere_vol / cube_vol
 
 
def fraction_in_shell(d: int, shell_thickness: float = 0.01) -> float:
    """
    Compute the fraction of a hypersphere's volume in its outer shell.
    
    For a ball of radius r=1, the shell is [1-ε, 1].
    
    The fraction is: 1 - (1-ε)^d
    
    For small ε and large d, almost ALL volume is in the shell.
    This is why "the center is empty" in high dimensions.
    """
    inner_radius = 1 - shell_thickness
    return 1 - (inner_radius ** d)
 
 
# Demonstrate the curse
print("=== Volume Ratios by Dimension ===")
print("Dimension | Sphere/Cube Ratio | % in Outer 1% Shell")
print("-" * 55)
 
for d in [2, 3, 5, 10, 20, 50, 100]:
    ratio = hypersphere_to_hypercube_ratio(d)
    shell_frac = fraction_in_shell(d, 0.01)
    print(f"{d:^9} | {ratio:>17.2e} | {shell_frac*100:>17.2f}%")
 
 
def demonstrate_distance_concentration(n_samples: int = 10000):
    """
    Show that in high dimensions, all pairwise distances converge to the same value.
    
    For uniformly distributed points in [0,1]^d:
    - Expected distance: √(d/6) (for large d)
    - Variance of distances: approaches 0 as d increases
    
    This means "near" and "far" become indistinguishable—
    nearest neighbor search loses meaning.
    """
    print("
=== Distance Concentration ===")
    print("Dimension | Mean Distance | Std Dev | Coefficient of Variation")
    print("-" * 65)
    
    for d in [2, 5, 10, 20, 50, 100, 200]:
        # Generate random points in [0,1]^d
        points = np.random.uniform(0, 1, size=(n_samples, d))
        
        # Compute pairwise distances (sample for efficiency)
        n_pairs = min(10000, n_samples * (n_samples - 1) // 2)
        distances = []
        for _ in range(n_pairs):
            i, j = np.random.choice(n_samples, 2, replace=False)
            dist = np.linalg.norm(points[i] - points[j])
            distances.append(dist)
        
        mean_dist = np.mean(distances)
        std_dist = np.std(distances)
        cv = std_dist / mean_dist  # Coefficient of variation
        
        print(f"{d:^9} | {mean_dist:>13.4f} | {std_dist:>7.4f} | {cv:>24.4f}")
    
    print("
Note: Coefficient of variation → 0 means distances concentrate")
    print("      around the mean, making nearest/farthest indistinguishable.")
 
 
demonstrate_distance_concentration()
 
 
def corner_concentration():
    """
    Demonstrate that in high dimensions, random points concentrate near corners.
    
    For a uniform point in [0,1]^d, the probability that it lies
    within distance ε from the boundary is:
    
    P(near boundary) = 1 - (1-2ε)^d
    
    For ε = 0.1 and d = 10: P ≈ 0.89 (89% near boundary)
    For ε = 0.1 and d = 100: P ≈ 1.0 (essentially all points)
    """
    print("
=== Corner Concentration ===")
    print("Dimension | P(within 10% of boundary)")
    print("-" * 40)
    
    epsilon = 0.1
    for d in [2, 5, 10, 20, 50, 100]:
        p_near_boundary = 1 - (1 - 2*epsilon) ** d
        print(f"{d:^9} | {p_near_boundary*100:>23.1f}%")
    
    print("
This means grid points near the center are 'wasted'—")
    print("the actual optimal hyperparameters are likely near the edges.")
 
 
corner_concentration()

Intuition Fails in High Dimensions

The Distance Concentration Phenomenon

Formal Statement:

For uniformly distributed random points in $[0,1]^d$, let $D_{\min}$ and $D_{\max}$ be the minimum and maximum distances to a fixed query point. Then:

$$\lim_{d \to \infty} \frac{D_{\max} - D_{\min}}{D_{\min}} = 0$$

In words: the relative contrast between near and far vanishes.

Implications for Grid Search:

If all grid points are approximately equidistant from any target point:

No point is meaningfully closer to the true optimum than any other
Interpolation becomes unreliable — local patterns don't predict nearby values
The grid provides no structural advantage over random sampling
Fine grids don't help — more points still have similar distances

distance_concentration.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
import numpy as np
from typing import Tuple
 
def analyze_distance_contrast(d: int, n_grid: int, n_trials: int = 100) -> dict:
    """
    Measure the contrast between nearest and farthest grid points
    from random target points.
    
    The contrast ratio (D_max - D_min) / D_min tells us how distinguishable
    near and far points are. In high dimensions, this ratio → 0.
    
    Parameters:
    -----------
    d : int
        Number of dimensions
    n_grid : int
        Number of grid points per dimension
    n_trials : int
        Number of random target points to test
    """
    # Create uniform grid
    grid_1d = np.linspace(0, 1, n_grid)
    grid_coords = np.array(np.meshgrid(*[grid_1d]*d)).T.reshape(-1, d)
    
    contrasts = []
    
    for _ in range(n_trials):
        # Random target point
        target = np.random.uniform(0, 1, d)
        
        # Distances to all grid points
        distances = np.linalg.norm(grid_coords - target, axis=1)
        
        d_min = distances.min()
        d_max = distances.max()
        
        # Contrast ratio
        if d_min > 0:
            contrast = (d_max - d_min) / d_min
            contrasts.append(contrast)
    
    return {
        'dimension': d,
        'n_grid_points': len(grid_coords),
        'mean_contrast': np.mean(contrasts),
        'std_contrast': np.std(contrasts),
        'expected_d_min': np.mean([c['d_min'] for c in [{'d_min': np.min(np.linalg.norm(grid_coords - np.random.uniform(0,1,d), axis=1))} for _ in range(100)]]),
    }
 
 
def demonstrate_contrast_collapse():
    """
    Show how distance contrast collapses with dimension.
    
    For grid search, this means:
    - In 2D, the nearest grid point is meaningfully closer than the farthest
    - In 20D, all grid points are approximately the same distance
    - The grid structure provides no advantage
    """
    print("=== Distance Contrast by Dimension ===")
    print("Dimension | Grid Points | Mean Contrast | Interpretation")
    print("-" * 70)
    
    for d in [2, 3, 5, 10, 15, 20]:
        # Use 3 points per dimension (total 3^d points)
        if 3**d <= 1000000:  # Keep computation reasonable
            result = analyze_distance_contrast(d, n_grid=3, n_trials=50)
            
            if result['mean_contrast'] > 1.0:
                interpretation = "Clear near/far distinction"
            elif result['mean_contrast'] > 0.5:
                interpretation = "Moderate distinction"
            elif result['mean_contrast'] > 0.2:
                interpretation = "Weak distinction"
            else:
                interpretation = "No meaningful distinction"
            
            print(f"{d:^9} | {result['n_grid_points']:>11} | "
                  f"{result['mean_contrast']:>13.3f} | {interpretation}")
 
 
def grid_coverage_analysis(d: int, n_grid: int, epsilon: float = 0.1) -> float:
    """
    Compute what fraction of the search space is within ε of a grid point.
    
    For a grid with n points per dimension in [0,1]^d:
    - Grid spacing is 1/(n-1)
    - An ε-ball around each point covers ε^d volume (approximately)
    - Total coverage ≈ n^d × ε^d volume... but balls overlap and boundaries truncate
    
    In high dimensions, even dense grids cover negligible volume.
    """
    # Compute via Monte Carlo sampling
    grid_1d = np.linspace(0, 1, n_grid)
    grid_coords = np.array(np.meshgrid(*[grid_1d]*min(d, 8))).T.reshape(-1, min(d, 8))
    
    if d > 8:
        # For very high dimensions, use theoretical approximation
        # Coverage ≈ n^d × V_d(ε) where V_d is d-ball volume
        from scipy.special import gamma
        ball_volume = (np.pi ** (d/2)) / gamma(d/2 + 1) * (epsilon ** d)
        n_balls = n_grid ** d
        raw_coverage = n_balls * ball_volume
        # Account for overlap (rough approximation)
        coverage = min(1.0, raw_coverage * 0.5)  # Very rough
        return coverage
    
    # Monte Carlo for lower dimensions
    n_samples = 10000
    random_points = np.random.uniform(0, 1, (n_samples, len(grid_coords[0])))
    
    covered = 0
    for point in random_points:
        min_dist = np.min(np.linalg.norm(grid_coords - point, axis=1))
        if min_dist <= epsilon:
            covered += 1
    
    return covered / n_samples
 
 
def demonstrate_coverage_collapse():
    """
    Show how grid coverage of the space collapses with dimension.
    
    Even with 5 points per dimension, coverage → 0 rapidly.
    """
    print("
=== Grid Coverage Analysis ===")
    print("Dimension | Grid Points | Coverage within ε=0.1")
    print("-" * 50)
    
    for d in [2, 3, 5, 7, 10, 15, 20]:
        n_grid = 5  # 5 points per dimension
        if 5**d <= 1e7:
            coverage = grid_coverage_analysis(d, n_grid, epsilon=0.1)
            grid_size = 5**d
            print(f"{d:^9} | {grid_size:>11,} | {coverage*100:>19.4f}%")
        else:
            print(f"{d:^9} | {5**d:>11.2e} | Negligible")
 
 
demonstrate_contrast_collapse()
demonstrate_coverage_collapse()

The Lost Needle

Grid Search Failure Modes in High Dimensions

The curse of dimensionality causes specific failure modes for grid search that practitioners must understand.

Failure Mode 1: Missing the Optimal Region Entirely

If the objective function is smooth, the best grid point will be within $O(\Delta)$ of the optimum. But in high dimensions, even $\Delta = 0.1$ leaves vast unexplored regions.

Failure Mode 2: Axis-Aligned Blindness

Grid search places points on an axis-aligned lattice. If the optimal region is oriented diagonally (hyperparameters interact), the grid may completely miss it.

Example: If learning_rate and regularization have an optimal ratio (not absolute values), a grid that samples each independently will only accidentally hit configurations with the right ratio.

failure_modes.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
import numpy as np
import matplotlib.pyplot as plt
from itertools import product
 
def demonstrate_axis_aligned_failure():
    """
    Show how grid search fails when the optimal region is diagonal.
    
    Consider an objective where the optimum lies along y = x,
    not at any grid intersection. The grid misses it entirely.
    """
    # Define objective: minimum along the diagonal y = x
    def diagonal_objective(x, y):
        # Distance from diagonal y = x
        diagonal_dist = np.abs(y - x) / np.sqrt(2)
        # With minimum at (0.37, 0.37)
        center_dist = np.sqrt((x - 0.37)**2 + (y - 0.37)**2)
        return diagonal_dist + 0.1 * center_dist
    
    # Create grid
    grid_points = np.linspace(0, 1, 5)  # 5x5 = 25 points
    grid_coords = np.array([(x, y) for x in grid_points for y in grid_points])
    
    # Evaluate grid
    grid_scores = [diagonal_objective(x, y) for x, y in grid_coords]
    best_grid_idx = np.argmin(grid_scores)
    best_grid_point = grid_coords[best_grid_idx]
    best_grid_score = grid_scores[best_grid_idx]
    
    # True optimum
    true_optimum = (0.37, 0.37)
    true_score = diagonal_objective(*true_optimum)
    
    print("=== Axis-Aligned Failure Demonstration ===")
    print(f"Objective: minimum along diagonal y = x, centered at (0.37, 0.37)")
    print(f"Grid size: 5x5 = 25 points")
    print(f"Best grid point: ({best_grid_point[0]:.2f}, {best_grid_point[1]:.2f})")
    print(f"Best grid score: {best_grid_score:.4f}")
    print(f"True optimum: {true_optimum}")
    print(f"True optimal score: {true_score:.4f}")
    print(f"")
    print(f"Grid suboptimality: {((best_grid_score - true_score) / true_score) * 100:.1f}%")
    print(f"The grid misses the diagonal optimum because it samples axis-aligned!")
 
 
def demonstrate_sparse_coverage(d: int = 10, n_grid: int = 3):
    """
    Show how sparse grid coverage becomes in high dimensions.
    
    Even with a reasonable number of points, the space is mostly unexplored.
    """
    # Total grid points
    n_total = n_grid ** d
    
    # Volume of space
    space_volume = 1.0  # Unit hypercube
    
    # Average grid spacing
    spacing = 1 / (n_grid - 1)
    
    # Volume "near" a grid point (within spacing/2)
    # Approximate as hypercube around each point
    near_volume_per_point = (spacing / 2) ** d
    total_near_volume = n_total * near_volume_per_point
    
    # But with overlap correction and boundary effects
    coverage_fraction = min(1.0, total_near_volume)
    
    print(f"
=== Sparse Coverage in {d}D ===")
    print(f"Grid: {n_grid} points per dimension = {n_total:,} total points")
    print(f"Grid spacing: {spacing:.3f}")
    print(f"Volume 'near' each point (spacing/2 hypercube): {near_volume_per_point:.2e}")
    print(f"Total covered volume (upper bound): {total_near_volume:.2e}")
    print(f"Fraction of space explored: {min(100, coverage_fraction*100):.2e}%")
    
    if coverage_fraction < 0.01:
        print("
⚠ Less than 1% of the space is within half a grid spacing of any point!")
        print("  The grid is effectively useless for locating narrow optima.")
 
 
def marginal_value_analysis(d: int = 10):
    """
    Analyze the marginal value of adding grid points in high dimensions.
    
    Key insight: In high dimensions, adding more points provides
    diminishing returns much faster than in low dimensions.
    """
    print(f"
=== Marginal Value of Grid Points in {d}D ===")
    print("Points/Dim | Total Points | Avg Distance to Random | Marginal Speedup")
    print("-" * 75)
    
    for n_grid in [2, 3, 4, 5, 6, 7, 8]:
        n_total = n_grid ** d
        
        if n_total > 1e9:
            print(f"{n_grid:^10} | {n_total:>12.2e} | Intractable")
            continue
        
        # Create grid (or estimate for high dimensions)
        if d <= 5:
            grid_1d = np.linspace(0, 1, n_grid)
            grid_coords = np.array(list(product(*[grid_1d]*d)))
        else:
            # Estimate avg distance theoretically
            # For uniform grid in [0,1]^d with spacing 1/(n-1),
            # expected nearest grid point distance ≈ sqrt(d) * spacing / 2
            spacing = 1 / (n_grid - 1)
            avg_distance = np.sqrt(d) * spacing / 2
            
        # Compute average distance from random points to nearest grid point
        n_samples = 1000
        if d <= 5:
            distances = []
            for _ in range(n_samples):
                point = np.random.uniform(0, 1, d)
                min_dist = np.min(np.linalg.norm(grid_coords - point, axis=1))
                distances.append(min_dist)
            avg_distance = np.mean(distances)
        
        # Marginal speedup: improvement from n-1 to n grid points
        if n_grid > 2:
            prev_total = (n_grid - 1) ** d
            speedup = prev_total / n_total  # More points = closer but more expensive
            marginal = f"{1/speedup:.1f}x cost for modest improvement"
        else:
            marginal = "Baseline"
        
        print(f"{n_grid:^10} | {n_total:>12,} | {avg_distance:>22.4f} | {marginal}")
 
 
demonstrate_axis_aligned_failure()
demonstrate_sparse_coverage(d=10, n_grid=3)
marginal_value_analysis(d=8)

Grid Search Failure Modes
Failure Mode	Cause	Symptom	Mitigation
Missing optima	Sparse coverage in high-d	Best grid point far from optimum	Use random search or Bayesian optimization
Axis-aligned blindness	Grid samples independently	Misses diagonal correlations	Use non-axis-aligned sampling (Latin hypercube)
Wasted computation	Uniform coverage	Many evaluations in poor regions	Use adaptive methods
False confidence	Best grid point reported	Appears successful despite suboptimality	Compare with random baseline

The Bergstra-Bengio Result: Random Beats Grid

Key Insight: Low Effective Dimensionality

Grid Search's Fatal Flaw:

Suppose only 2 of 10 hyperparameters actually matter. With a 5-point grid:

Grid search: $5^{10} = 9,765,625$ configurations, but only $5^2 = 25$ unique combinations of the important hyperparameters
Random search: $N$ random samples, with $N$ unique combinations of the important hyperparameters

With 25 random samples, we match grid search's exploration of the important dimensions while spending 400,000× less compute!

bergstra_bengio.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
import numpy as np
from typing import Dict, List, Tuple
from itertools import product
 
def simulate_low_effective_dimensionality(
    n_total_dims: int = 10,
    n_important_dims: int = 2,
    n_grid_points: int = 5,
    n_random_samples: int = 25
) -> Dict:
    """
    Demonstrate the Bergstra-Bengio phenomenon.
    
    Scenario: 
    - n_total_dims hyperparameters in the search space
    - Only n_important_dims actually affect the objective
    - The rest are noise dimensions
    
    We compare:
    1. Full grid search (5^10 = 9.7M points)
    2. Random search with just 25 samples
    
    Shows how random search achieves equivalent coverage of important
    dimensions with vastly less computation.
    """
    np.random.seed(42)
    
    # Define objective: only first n_important_dims matter
    def objective(params: np.ndarray) -> float:
        # Only first n_important_dims affect the objective
        important = params[:n_important_dims]
        # Objective is sum of squared distances from 0.3 (arbitrary optimum)
        return np.sum((important - 0.3) ** 2)
    
    # === Grid Search ===
    # Create grid (but limit evaluation for tractability)
    grid_1d = np.linspace(0, 1, n_grid_points)
    
    # Full grid would be n_grid_points^n_total_dims
    full_grid_size = n_grid_points ** n_total_dims
    
    # Unique combinations in important dimensions
    # This is what actually matters for finding the optimum
    unique_important_grid = n_grid_points ** n_important_dims
    
    # Simulate by evaluating only the important-dimension grid
    important_grid = np.array(list(product(*[grid_1d]*n_important_dims)))
    grid_scores = [objective(np.concatenate([p, np.zeros(n_total_dims - n_important_dims)])) 
                   for p in important_grid]
    best_grid_score = min(grid_scores)
    
    # === Random Search ===
    random_points = np.random.uniform(0, 1, (n_random_samples, n_total_dims))
    random_scores = [objective(p) for p in random_points]
    best_random_score = min(random_scores)
    
    # Coverage comparison
    # For important dimensions: how many unique "regions" are covered?
    grid_1d_spacing = 1 / (n_grid_points - 1)
    
    # Quantize random points to grid resolution
    random_important = random_points[:, :n_important_dims]
    random_quantized = np.round(random_important / grid_1d_spacing) * grid_1d_spacing
    n_unique_random_regions = len(set(map(tuple, random_quantized)))
    
    print("=== Bergstra-Bengio Simulation ===")
    print(f"Total dimensions: {n_total_dims}")
    print(f"Important dimensions: {n_important_dims}")
    print(f"")
    print("Grid Search:")
    print(f"  Total configurations: {full_grid_size:,}")
    print(f"  Unique important-dim combinations: {unique_important_grid}")
    print(f"  Best score found: {best_grid_score:.6f}")
    print(f"")
    print("Random Search:")
    print(f"  Total samples: {n_random_samples}")
    print(f"  Unique important-dim regions covered: {n_unique_random_regions}")
    print(f"  Best score found: {best_random_score:.6f}")
    print(f"")
    print(f"Efficiency ratio: {full_grid_size / n_random_samples:,.0f}x")
    print(f"Grid uses {full_grid_size / n_random_samples:,.0f}x more compute for similar coverage")
    print(f"of the dimensions that actually matter!")
    
    return {
        'grid_size': full_grid_size,
        'random_size': n_random_samples,
        'grid_important_unique': unique_important_grid,
        'random_important_unique': n_unique_random_regions,
        'best_grid': best_grid_score,
        'best_random': best_random_score,
    }
 
 
def visualize_projection_advantage():
    """
    Visualize why random search covers important dimensions better.
    
    When projected onto a low-dimensional subspace (the important dimensions),
    random points spread out while grid points cluster on a sparse lattice.
    """
    n_dims = 6
    n_important = 2
    n_grid = 3  # 3^6 = 729 grid points
    n_random = 30
    
    # Grid points
    grid_1d = np.linspace(0, 1, n_grid)
    grid_full = np.array(list(product(*[grid_1d]*n_dims)))
    grid_projected = grid_full[:, :n_important]  # Project to important dims
    
    # Random points
    np.random.seed(42)
    random_full = np.random.uniform(0, 1, (n_random, n_dims))
    random_projected = random_full[:, :n_important]
    
    print("
=== Projection Advantage ===")
    print(f"Grid: {len(grid_full)} points in {n_dims}D, projected to {n_important}D")
    print(f"Random: {len(random_full)} points in {n_dims}D, projected to {n_important}D")
    print(f"")
    
    # Count unique projected positions (at grid resolution)
    grid_unique_projected = len(set(map(tuple, grid_projected)))
    random_quantized = np.round(random_projected * (n_grid-1)) / (n_grid-1)
    random_unique_projected = len(set(map(tuple, random_quantized)))
    
    print(f"Grid: {grid_unique_projected} unique positions when projected")
    print(f"      (only {n_grid}^{n_important} = {n_grid**n_important} distinct values!)")
    print(f"Random: {random_unique_projected} unique positions (at grid resolution)")
    print(f"")
    print("Key insight: Grid samples collapse to few unique values in the important")
    print("dimensions, while random samples spread out, exploring more possibilities.")
 
 
simulate_low_effective_dimensionality()
visualize_projection_advantage()

The Bergstra-Bengio Recommendation

When Grid Search Still Works

Despite the curse of dimensionality, grid search remains appropriate in specific scenarios. Understanding these cases prevents both overuse and underuse of the technique.

Scenario 1: Truly Low-Dimensional Search Spaces (d ≤ 3)

With 2-3 hyperparameters, the curse hasn't fully manifested:

$5^3 = 125$ configurations is tractable
Grid provides complete coverage at chosen resolution
Visualization is possible for analysis

Scenario 2: Discrete Hyperparameters with Few Options

When hyperparameters have limited discrete options:

kernel ∈ {'linear', 'rbf', 'poly'}: 3 options
criterion ∈ {'gini', 'entropy'}: 2 options

Even with 10 such hyperparameters, the total might be $3 \times 2 \times 4 \times \ldots = 1000$ — manageable.

Scenario 3: Known Narrow Effective Range

When domain knowledge provides tight bounds:

Learning rate must be between 0.01 and 0.1
Only 3-4 values in that range matter

The effective dimensionality is reduced by prior knowledge.

Grid Search is Appropriate

•≤ 3 continuous hyperparameters with 5 values each (~125 configs)
•Discrete parameters with few options regardless of count
•Well-established ranges from prior work or domain knowledge
•Need for complete coverage (regulatory, scientific)
•Visualization required for understanding sensitivity
•Simple baseline before trying advanced methods

Grid Search is Inappropriate

•> 5 continuous hyperparameters (exponential explosion)
•Wide ranges spanning orders of magnitude per parameter
•Unknown effective range requiring broad exploration
•Limited computational budget relative to grid size
•Correlated hyperparameters (axis-aligned blindness)
•Deep learning with many architectural choices

The Practical Threshold

Dimensionality Reduction Strategies

When faced with high-dimensional hyperparameter spaces, strategies exist to reduce effective dimensionality and make grid search more viable.

Strategy 1: Hyperparameter Sensitivity Analysis

Before full optimization, identify which hyperparameters matter most:

Vary one hyperparameter at a time while holding others at defaults
Measure sensitivity of objective to each parameter
Focus grid search on the top 2-3 most sensitive parameters

Strategy 2: Hierarchical/Sequential Tuning

Tune hyperparameters in stages:

Stage 1: Tune most important hyperparameters (e.g., learning rate, regularization)
Stage 2: With Stage 1 hyperparameters fixed, tune secondary ones
Stage 3: Fine-tune all together with narrow ranges

This reduces a 10-dimensional grid to three 3-4 dimensional grids.

Strategy 3: Coupling Hyperparameters

Some hyperparameters should vary together:

Learning rate and batch size (linear scaling rule)
Number of trees and max depth (trade-off)
Regularization strength and model complexity

Create composite "meta-hyperparameters" that control coupled parameters.

dimensionality_reduction.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
import numpy as np
from typing import Dict, List, Any, Callable
from sklearn.model_selection import cross_val_score
 
def hyperparameter_sensitivity_analysis(
    model_class,
    param_ranges: Dict[str, List[Any]],
    X, y,
    default_params: Dict[str, Any],
    cv: int = 3
) -> Dict[str, float]:
    """
    Analyze sensitivity of model performance to each hyperparameter.
    
    For each hyperparameter:
    1. Hold all others at default values
    2. Vary this hyperparameter across its range
    3. Measure variance of cross-validation scores
    
    Returns sensitivity scores (higher = more sensitive = more important).
    """
    sensitivities = {}
    
    for param_name, param_values in param_ranges.items():
        scores_for_param = []
        
        for value in param_values:
            # Create config with this parameter varied
            config = default_params.copy()
            config[param_name] = value
            
            # Evaluate
            model = model_class(**config)
            cv_scores = cross_val_score(model, X, y, cv=cv)
            scores_for_param.append(cv_scores.mean())
        
        # Sensitivity = range of scores when varying this parameter
        sensitivity = max(scores_for_param) - min(scores_for_param)
        sensitivities[param_name] = sensitivity
        
        print(f"{param_name}: sensitivity = {sensitivity:.4f}")
        print(f"  Values: {param_values}")
        print(f"  Scores: {[f'{s:.4f}' for s in scores_for_param]}")
    
    # Rank by sensitivity
    ranked = sorted(sensitivities.items(), key=lambda x: x[1], reverse=True)
    print("
Ranked by sensitivity:")
    for param, sens in ranked:
        print(f"  {param}: {sens:.4f}")
    
    return sensitivities
 
 
def hierarchical_tuning(
    model_class,
    stage1_grid: Dict[str, List[Any]],
    stage2_grid: Dict[str, List[Any]],
    X, y,
    cv: int = 5
) -> Dict[str, Any]:
    """
    Two-stage hierarchical hyperparameter tuning.
    
    Stage 1: Tune primary hyperparameters (coarse grid)
    Stage 2: Tune secondary hyperparameters (using Stage 1 values)
    
    This reduces exponential complexity:
    - Instead of (n1 × n2 × ... × nk) configurations
    - We have (n1 × n2) + (n3 × n4 × ...) configurations
    """
    from sklearn.model_selection import GridSearchCV
    
    print("=== Hierarchical Tuning ===")
    
    # Stage 1: Primary hyperparameters
    print(f"
Stage 1: Primary Hyperparameters")
    stage1_size = np.prod([len(v) for v in stage1_grid.values()])
    print(f"  Grid size: {stage1_size}")
    
    grid1 = GridSearchCV(model_class(), stage1_grid, cv=cv, n_jobs=-1)
    grid1.fit(X, y)
    stage1_best = grid1.best_params_
    print(f"  Best params: {stage1_best}")
    print(f"  Best score: {grid1.best_score_:.4f}")
    
    # Stage 2: Secondary hyperparameters (with Stage 1 fixed)
    print(f"
Stage 2: Secondary Hyperparameters")
    stage2_size = np.prod([len(v) for v in stage2_grid.values()])
    print(f"  Grid size: {stage2_size}")
    
    # Merge Stage 1 best params with Stage 2 grid
    class FixedParamModel:
        def __init__(self, **kwargs):
            full_params = {**stage1_best, **kwargs}
            self.model = model_class(**full_params)
        def fit(self, X, y): return self.model.fit(X, y)
        def predict(self, X): return self.model.predict(X)
        def score(self, X, y): return self.model.score(X, y)
    
    grid2 = GridSearchCV(FixedParamModel(), stage2_grid, cv=cv, n_jobs=-1)
    grid2.fit(X, y)
    stage2_best = grid2.best_params_
    print(f"  Best params: {stage2_best}")
    print(f"  Best score: {grid2.best_score_:.4f}")
    
    # Combine
    final_params = {**stage1_best, **stage2_best}
    
    # Compare to full grid search
    full_grid = {**stage1_grid, **stage2_grid}
    full_size = np.prod([len(v) for v in full_grid.values()])
    hierarchical_size = stage1_size + stage2_size
    
    print(f"
=== Comparison ===")
    print(f"Full grid size: {full_size}")
    print(f"Hierarchical size: {hierarchical_size}")
    print(f"Reduction: {full_size / hierarchical_size:.1f}x")
    print(f"Final params: {final_params}")
    
    return final_params
 
 
def coupled_hyperparameter_grid(
    coupling_rules: Dict[str, Callable[[float], Dict[str, Any]]],
    meta_values: Dict[str, List[float]],
) -> List[Dict[str, Any]]:
    """
    Generate configurations with coupled hyperparameters.
    
    Instead of independent grids for learning_rate and batch_size,
    define a single "scale" meta-parameter that sets both according
    to a known relationship (e.g., linear scaling rule).
    
    Parameters:
    -----------
    coupling_rules: Dict mapping meta-param names to functions
        Each function takes meta-param value and returns dict of actual params
    
    meta_values: Dict mapping meta-param names to values to try
    
    Example:
    coupling_rules = {
        'scale': lambda s: {'learning_rate': 0.01 * s, 'batch_size': int(32 * s)}
    }
    meta_values = {'scale': [0.5, 1.0, 2.0, 4.0]}
    """
    from itertools import product
    
    meta_names = list(meta_values.keys())
    all_meta_combos = list(product(*meta_values.values()))
    
    configs = []
    for combo in all_meta_combos:
        config = {}
        for name, value in zip(meta_names, combo):
            params = coupling_rules[name](value)
            config.update(params)
        configs.append(config)
    
    print(f"Generated {len(configs)} coupled configurations:")
    for config in configs:
        print(f"  {config}")
    
    return configs
 
 
# Example: Coupled hyperparameters for neural network
coupling_rules = {
    'scale': lambda s: {
        'learning_rate': 0.001 * s,
        'batch_size': int(32 * s),
    },
    'capacity': lambda c: {
        'hidden_layer_sizes': tuple([int(100 * c)] * 2),
        'alpha': 0.01 / c,  # Stronger regularization for larger models
    }
}
 
meta_values = {
    'scale': [0.5, 1.0, 2.0, 4.0],
    'capacity': [0.5, 1.0, 2.0],
}
 
print("=== Coupled Hyperparameter Grid ===")
configs = coupled_hyperparameter_grid(coupling_rules, meta_values)
print(f"
Total configurations: {len(configs)}")
print(f"vs independent grid: 4 × 4 × 4 × 4 = 256 (if each param independent)")

Focus on What Matters

Summary: Navigating the Curse

We have explored how high-dimensional hyperparameter spaces fundamentally challenge grid search and developed strategies for mitigation. Let's consolidate the key concepts:

Key Takeaways

•Exponential volume growth — The number of grid points required for uniform coverage grows as $n^d$, quickly exceeding any computational budget.
•Distance concentration destroys locality — In high dimensions, all points become equidistant, eliminating the structural advantage of grids.
•Most volume is in the corners — Random points concentrate near the boundary, far from grid interiors, making center-focused grids wasteful.
•Axis-aligned blindness misses correlations — Grid search samples hyperparameters independently, missing optimal regions defined by parameter interactions.
•Low effective dimensionality favors random search — When few hyperparameters dominate, random search explores the important dimensions more efficiently.
•Dimensionality reduction strategies exist — Sensitivity analysis, hierarchical tuning, and parameter coupling can reduce effective dimensionality.

The Critical Threshold:

As a practical guideline:

d ≤ 3: Grid search is often optimal
d = 4-5: Grid search is feasible but consider random search
d > 5: Grid search is impractical; use random search or Bayesian optimization

What's Next:

Page Complete

3 / 5