Machine LearningHyperparameter Optimization

Grid Search

LevelIntermediate

Duration90 mins

TopicHyperparameter Optimization

1 / 5

Exhaustive Search

The Systematic Approach to Hyperparameter Discovery

Every machine learning model carries two distinct types of parameters: model parameters learned automatically from data during training, and hyperparameters that must be specified before training begins. While gradient descent elegantly handles the former, the search for optimal hyperparameters presents a fundamentally different optimization challenge—one where the objective function is expensive to evaluate, potentially non-convex, and often lacks gradient information.

Grid search stands as the most intuitive and historically significant approach to this challenge. By systematically evaluating every combination of hyperparameter values from a predefined grid, it provides comprehensive coverage of the search space with mathematical guarantees that no candidate configuration is overlooked.

What You Will Learn

By the end of this page, you will understand the complete mechanics of exhaustive grid search, including its mathematical foundations, implementation strategies, evaluation protocols, and the theoretical guarantees it provides. You'll gain deep insight into why grid search remains a fundamental technique despite its apparent simplicity.

The Hyperparameter Optimization Problem

Before diving into grid search specifically, we must precisely formulate the hyperparameter optimization problem it aims to solve.

Formal Problem Definition:

Given a machine learning algorithm $\mathcal{A}$, a hyperparameter configuration space $\Lambda$, a training dataset $\mathcal{D}{\text{train}}$, a validation dataset $\mathcal{D}{\text{val}}$, and a performance metric $\mathcal{L}$, we seek:

$$\lambda^* = \arg\min_{\lambda \in \Lambda} \mathcal{L}(\mathcal{A}{\lambda}(\mathcal{D}{\text{train}}), \mathcal{D}_{\text{val}})$$

where $\mathcal{A}_{\lambda}$ denotes the model trained with hyperparameters $\lambda$.

This formulation reveals several critical characteristics that distinguish hyperparameter optimization from conventional optimization:

Black-Box Objective: The function $\mathcal{L}(\lambda)$ is a black box—we can evaluate it for any $\lambda$ but have no closed-form expression. Each evaluation requires training a model, which is computationally expensive.

No Gradient Information: Unlike training with backpropagation, we typically cannot compute $ abla_\lambda \mathcal{L}$. The relationship between hyperparameters and validation performance is only observable through empirical evaluation.

Potentially Complex Landscape: The objective function may be non-convex, multi-modal, or contain regions of poor conditioning. There's no guarantee that local optimization methods will find global optima.

Characteristics of the Hyperparameter Optimization Problem
Characteristic	Description	Implication for Grid Search
Expensive Evaluations	Each point requires full model training	Must minimize total evaluations or parallelize
Black-Box Nature	No analytical form for objective	Cannot use gradient-based methods directly
Mixed Variable Types	Continuous, discrete, categorical parameters	Must handle all types uniformly
Conditional Dependencies	Some parameters only relevant given others	Grid may contain invalid combinations
Non-Stationary	Optimal values may differ across datasets	Cannot rely on prior solutions directly

The Evaluation Cost Perspective

The fundamental tension in hyperparameter optimization is between exploration quality (how thoroughly we search) and computational budget (how many evaluations we can afford). Grid search takes the position of maximum exploration quality within a discrete structured search space.

The Grid Search Algorithm

Grid search operationalizes exhaustive enumeration through a structured algorithm that systematically explores every point in a discrete hyperparameter grid.

Algorithm Definition:

Let $\Lambda = \Lambda_1 \times \Lambda_2 \times \cdots \times \Lambda_k$ be the Cartesian product of $k$ discrete hyperparameter spaces, where each $\Lambda_i = {\lambda_i^{(1)}, \lambda_i^{(2)}, \ldots, \lambda_i^{(n_i)}}$ contains $n_i$ candidate values for hyperparameter $i$.

Grid search evaluates the objective function at every point in this product space:

$$\text{Grid} = {(\lambda_1, \lambda_2, \ldots, \lambda_k) : \lambda_i \in \Lambda_i \text{ for all } i}$$

The total number of configurations to evaluate is:

$$|\text{Grid}| = \prod_{i=1}^{k} n_i = n_1 \times n_2 \times \cdots \times n_k$$

grid_search_algorithm.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
from itertools import product
from typing import Dict, List, Any, Callable, Tuple
import numpy as np
 
def exhaustive_grid_search(
    param_grid: Dict[str, List[Any]],
    objective_fn: Callable[[Dict[str, Any]], float],
    minimize: bool = True
) -> Tuple[Dict[str, Any], float, List[Dict]]:
    """
    Exhaustive grid search implementation.
    
    Parameters:
    -----------
    param_grid : Dict[str, List[Any]]
        Dictionary mapping hyperparameter names to lists of candidate values.
        Example: {'learning_rate': [0.01, 0.1, 1.0], 'max_depth': [3, 5, 7]}
    
    objective_fn : Callable[[Dict[str, Any]], float]
        Function that takes a hyperparameter configuration and returns
        the validation performance (e.g., validation loss or accuracy).
    
    minimize : bool
        If True, seek minimum objective (e.g., loss).
        If False, seek maximum objective (e.g., accuracy).
    
    Returns:
    --------
    best_params : Dict[str, Any]
        The hyperparameter configuration achieving best performance.
    best_score : float
        The objective value for the best configuration.
    all_results : List[Dict]
        Complete results for all evaluated configurations.
    """
    # Extract parameter names and their candidate values
    param_names = list(param_grid.keys())
    param_values = list(param_grid.values())
    
    # Generate all combinations via Cartesian product
    all_combinations = list(product(*param_values))
    
    print(f"Grid Search: Evaluating {len(all_combinations)} configurations")
    print(f"Parameters: {param_names}")
    print(f"Grid dimensions: {[len(v) for v in param_values]}")
    
    # Track all results for analysis
    all_results = []
    best_params = None
    best_score = float('inf') if minimize else float('-inf')
    
    # Systematic enumeration of all grid points
    for idx, combination in enumerate(all_combinations):
        # Construct configuration dictionary
        config = dict(zip(param_names, combination))
        
        # Evaluate objective (typically involves model training)
        score = objective_fn(config)
        
        # Store results
        result = {
            'config': config.copy(),
            'score': score,
            'iteration': idx
        }
        all_results.append(result)
        
        # Update best if improved
        is_better = (score < best_score) if minimize else (score > best_score)
        if is_better:
            best_score = score
            best_params = config.copy()
            print(f"  [Iter {idx+1}/{len(all_combinations)}] New best: "
                  f"score={score:.6f}, params={config}")
    
    print(f"
Grid Search Complete!")
    print(f"Best Score: {best_score:.6f}")
    print(f"Best Parameters: {best_params}")
    
    return best_params, best_score, all_results
 
 
# Example: Grid search for a ridge regression model
def example_ridge_grid_search():
    """Demonstrates grid search for ridge regression hyperparameters."""
    from sklearn.datasets import make_regression
    from sklearn.linear_model import Ridge
    from sklearn.model_selection import cross_val_score
    
    # Generate synthetic regression data
    X, y = make_regression(n_samples=1000, n_features=20, noise=10, random_state=42)
    
    # Define the hyperparameter grid
    param_grid = {
        'alpha': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0],  # Regularization strength
        'fit_intercept': [True, False],
        'solver': ['auto', 'svd', 'cholesky', 'lsqr']
    }
    
    def evaluate_ridge(config: Dict[str, Any]) -> float:
        """Objective: Negative mean cross-validated R² score."""
        model = Ridge(
            alpha=config['alpha'],
            fit_intercept=config['fit_intercept'],
            solver=config['solver']
        )
        # 5-fold cross-validation for robust performance estimate
        scores = cross_val_score(model, X, y, cv=5, scoring='r2')
        return -np.mean(scores)  # Negative because we minimize
    
    # Run exhaustive grid search
    best_params, best_score, results = exhaustive_grid_search(
        param_grid=param_grid,
        objective_fn=evaluate_ridge,
        minimize=True
    )
    
    print(f"
Total configurations evaluated: {len(results)}")
    print(f"Grid size: 6 × 2 × 4 = {6 * 2 * 4}")
    
    return best_params, -best_score  # Return positive R² score

Algorithmic Properties:

Completeness: Grid search is complete with respect to the defined grid—it will always find the best configuration among those explicitly enumerated, assuming evaluation is deterministic or properly averaged.

Determinism: Given the same grid and evaluation function, grid search produces identical results regardless of execution order. This reproducibility is invaluable for scientific work.

Independence: Each grid point can be evaluated independently, enabling embarrassingly parallel execution across available compute resources.

Anytime Behavior: While not optimal in an anytime sense (we cannot predict which configuration will be best without full enumeration), partial results are always valid—we simply report the best configuration found so far.

Constructing the Parameter Grid

The effectiveness of grid search crucially depends on thoughtful construction of the parameter grid. This involves determining which hyperparameters to tune, selecting appropriate value ranges, and choosing suitable discretization schemes.

Value Range Selection:

The range of values to explore for each hyperparameter should be guided by:

Domain Knowledge: Understanding the algorithm helps establish sensible bounds. For example, learning rates in neural networks typically range from $10^{-5}$ to $10^{-1}$.
Prior Experience: Previous experiments on similar problems provide empirical guidance on effective ranges.
Default Values: Library defaults are often well-calibrated starting points. The grid should encompass a neighborhood around these defaults.
Theoretical Constraints: Some parameters have natural bounds (e.g., regularization strength must be non-negative).

grid_construction.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
import numpy as np
from typing import List, Dict, Any
 
def create_linear_grid(start: float, stop: float, num: int) -> List[float]:
    """
    Create linearly spaced grid for parameters with linear response.
    
    Use for: Number of trees, max depth, min samples split, etc.
    
    Example:
    >>> create_linear_grid(1, 10, 5)
    [1.0, 3.25, 5.5, 7.75, 10.0]
    """
    return list(np.linspace(start, stop, num))
 
 
def create_log_grid(start: float, stop: float, num: int) -> List[float]:
    """
    Create logarithmically spaced grid for parameters spanning orders of magnitude.
    
    Use for: Learning rate, regularization strength, kernel bandwidth, etc.
    
    The logarithmic spacing ensures equal representation across orders of magnitude,
    which is critical when optimal values could be anywhere from 10^-4 to 10^2.
    
    Example:
    >>> create_log_grid(0.001, 100, 6)
    [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
    """
    return list(np.logspace(np.log10(start), np.log10(stop), num))
 
 
def create_geometric_grid(start: float, ratio: float, num: int) -> List[float]:
    """
    Create geometrically spaced grid with fixed ratio between consecutive values.
    
    Use for: When you want consistent multiplicative steps (e.g., doubling).
    
    Example:
    >>> create_geometric_grid(1, 2, 5)  # Powers of 2
    [1, 2, 4, 8, 16]
    """
    return [start * (ratio ** i) for i in range(num)]
 
 
# ============================================================
# Complete example: Constructing a grid for XGBoost
# ============================================================
def construct_xgboost_grid(granularity: str = 'medium') -> Dict[str, List[Any]]:
    """
    Construct a hyperparameter grid for XGBoost with varying granularity.
    
    Parameters:
    -----------
    granularity : str
        'coarse' - Quick exploration (3-4 values per parameter)
        'medium' - Balanced coverage (5-6 values per parameter)  
        'fine'   - Thorough search (7-10 values per parameter)
    
    Returns:
    --------
    param_grid : Dict[str, List[Any]]
        Complete hyperparameter grid for XGBoost.
    """
    grids = {
        'coarse': {
            # Learning rate: log-spaced (response is typically log-linear)
            'learning_rate': [0.01, 0.1, 0.3],
            
            # Tree depth: linear spacing (discrete, roughly linear impact)
            'max_depth': [3, 6, 10],
            
            # Number of estimators: often paired inversely with learning rate
            'n_estimators': [100, 300, 500],
            
            # Regularization: log-spaced
            'reg_alpha': [0, 0.1, 1.0],     # L1 regularization
            'reg_lambda': [0.1, 1.0, 10.0], # L2 regularization
            
            # Subsampling: linear spacing in valid range [0, 1]
            'subsample': [0.7, 0.85, 1.0],
            'colsample_bytree': [0.7, 0.85, 1.0],
        },
        'medium': {
            'learning_rate': create_log_grid(0.01, 0.3, 5),
            'max_depth': [3, 4, 5, 6, 8, 10],
            'n_estimators': [100, 200, 300, 500, 700],
            'reg_alpha': [0, 0.01, 0.1, 1.0, 10.0],
            'reg_lambda': create_log_grid(0.01, 10, 5),
            'subsample': [0.6, 0.7, 0.8, 0.9, 1.0],
            'colsample_bytree': [0.6, 0.7, 0.8, 0.9, 1.0],
        },
        'fine': {
            'learning_rate': create_log_grid(0.005, 0.5, 8),
            'max_depth': list(range(2, 12)),  # 2, 3, ..., 11
            'n_estimators': list(range(100, 1001, 100)),  # 100, 200, ..., 1000
            'reg_alpha': [0] + list(create_log_grid(0.001, 100, 7)),
            'reg_lambda': create_log_grid(0.001, 100, 8),
            'subsample': list(np.arange(0.5, 1.05, 0.1)),
            'colsample_bytree': list(np.arange(0.5, 1.05, 0.1)),
            'min_child_weight': [1, 3, 5, 7, 10],
            'gamma': [0, 0.1, 0.5, 1.0, 5.0],
        }
    }
    
    grid = grids[granularity]
    
    # Calculate grid statistics
    total_configs = 1
    for param, values in grid.items():
        total_configs *= len(values)
    
    print(f"Grid granularity: {granularity}")
    print(f"Parameters: {len(grid)}")
    print(f"Grid dimensions: {[len(v) for v in grid.values()]}")
    print(f"Total configurations: {total_configs:,}")
    
    return grid
 
 
# Demonstrate grid size growth
print("=== Grid Size Comparison ===")
for level in ['coarse', 'medium', 'fine']:
    print()
    construct_xgboost_grid(level)

Logarithmic vs Linear Spacing

Use logarithmic spacing when the hyperparameter's effect on model performance scales multiplicatively (learning rate, regularization). Use linear spacing when the effect is additive or when values are bounded integers (max depth, number of layers). Incorrect spacing wastes computational budget on redundant similar configurations.

Discretization Granularity:

The number of grid points per hyperparameter creates a fundamental trade-off:

Too Coarse: Risk missing the optimal region entirely. If the optimal learning rate is 0.05 but you only try [0.01, 0.1], you'll never find it.
Too Fine: Exponentially increases computational cost with diminishing returns. The difference between learning rate 0.051 and 0.052 is rarely significant.

Practical Heuristic: Start with 3-5 geometrically or logarithmically spaced values per continuous hyperparameter. After identifying promising regions, refine with focused local searches.

The Evaluation Protocol

How we evaluate each grid point profoundly affects the reliability and validity of grid search results. Defining a robust evaluation protocol is as important as the search strategy itself.

Single Train-Validation Split:

The simplest approach trains on a single training set and evaluates on a held-out validation set:

$$\hat{\mathcal{L}}(\lambda) = \mathcal{L}(f_{\lambda}^{\mathcal{D}{\text{train}}}, \mathcal{D}{\text{val}})$$

This is fast but has high variance—a single split may not be representative, and validation performance can fluctuate significantly.

K-Fold Cross-Validation:

For more robust estimates, we average performance across $K$ different train-validation splits:

$$\hat{\mathcal{L}}{\text{CV}}(\lambda) = \frac{1}{K} \sum{k=1}^{K} \mathcal{L}(f_{\lambda}^{\mathcal{D}_{-k}}, \mathcal{D}_k)$$

where $\mathcal{D}k$ is fold $k$ and $\mathcal{D}{-k}$ is all data except fold $k$.

This reduces variance but multiplies computational cost by factor $K$.

evaluation_protocols.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
from sklearn.model_selection import (
    cross_val_score, StratifiedKFold, RepeatedStratifiedKFold,
    GridSearchCV
)
import numpy as np
from typing import Dict, Any
 
def grid_search_with_cv(
    estimator_class,
    param_grid: Dict[str, list],
    X, y,
    cv: int = 5,
    scoring: str = 'accuracy',
    n_jobs: int = -1
):
    """
    Grid search with k-fold cross-validation for robust evaluation.
    
    Mathematical Foundation:
    For each configuration λ, we compute:
    
    CV(λ) = (1/K) Σ_{k=1}^K L(f_λ^{D_{-k}}, D_k)
    
    where D_k is fold k and D_{-k} is all data except fold k.
    
    The standard error of this estimate is:
    SE = σ / √K
    
    where σ is the standard deviation across folds.
    """
    # Create cross-validation strategy
    cv_strategy = StratifiedKFold(n_splits=cv, shuffle=True, random_state=42)
    
    # Initialize GridSearchCV
    grid_search = GridSearchCV(
        estimator=estimator_class(),
        param_grid=param_grid,
        cv=cv_strategy,
        scoring=scoring,
        n_jobs=n_jobs,           # Parallelize across cores
        return_train_score=True, # Also track training performance
        verbose=2,
        refit=True               # Refit best model on full training data
    )
    
    # Execute grid search
    grid_search.fit(X, y)
    
    # Extract results
    results = {
        'best_params': grid_search.best_params_,
        'best_score': grid_search.best_score_,
        'best_estimator': grid_search.best_estimator_,
        'cv_results': grid_search.cv_results_,
    }
    
    # Analyze variance across folds for best configuration
    best_idx = grid_search.best_index_
    fold_scores = [
        grid_search.cv_results_[f'split{k}_test_score'][best_idx]
        for k in range(cv)
    ]
    
    results['best_score_std'] = np.std(fold_scores)
    results['best_score_ci_95'] = 1.96 * results['best_score_std'] / np.sqrt(cv)
    
    print(f"Best Score: {results['best_score']:.4f} "
          f"± {results['best_score_ci_95']:.4f} (95% CI)")
    print(f"Best Parameters: {results['best_params']}")
    
    return results
 
 
def analyze_grid_search_results(cv_results: Dict) -> None:
    """
    Comprehensive analysis of grid search results.
    
    Examines:
    1. Performance distribution across configurations
    2. Hyperparameter sensitivity
    3. Variance across cross-validation folds
    4. Potential overfitting indicators
    """
    import pandas as pd
    
    # Convert to DataFrame for analysis
    df = pd.DataFrame(cv_results)
    
    print("=" * 60)
    print("GRID SEARCH RESULTS ANALYSIS")
    print("=" * 60)
    
    # Performance distribution
    print(f"
Performance Distribution:")
    print(f"  Mean test score: {df['mean_test_score'].mean():.4f}")
    print(f"  Std test score:  {df['mean_test_score'].std():.4f}")
    print(f"  Min test score:  {df['mean_test_score'].min():.4f}")
    print(f"  Max test score:  {df['mean_test_score'].max():.4f}")
    
    # Top 5 configurations
    print(f"
Top 5 Configurations:")
    top5 = df.nlargest(5, 'mean_test_score')[
        ['params', 'mean_test_score', 'std_test_score', 'rank_test_score']
    ]
    for idx, row in top5.iterrows():
        print(f"  Rank {row['rank_test_score']}: "
              f"Score={row['mean_test_score']:.4f} ± {row['std_test_score']:.4f}")
        print(f"    Params: {row['params']}")
    
    # Train-test gap analysis (overfitting indicator)
    print(f"
Train-Test Gap Analysis:")
    df['gap'] = df['mean_train_score'] - df['mean_test_score']
    print(f"  Mean gap: {df['gap'].mean():.4f}")
    print(f"  Max gap:  {df['gap'].max():.4f} (potential overfitting)")
    
    # Variance analysis
    print(f"
Cross-Validation Variance:")
    high_variance = df[df['std_test_score'] > df['std_test_score'].median()]
    print(f"  Configs with high CV variance: {len(high_variance)}/{len(df)}")
    
    return df
 
 
# Example: Complete grid search workflow
def complete_grid_search_example():
    """End-to-end grid search with analysis."""
    from sklearn.datasets import load_breast_cancer
    from sklearn.ensemble import GradientBoostingClassifier
    from sklearn.model_selection import train_test_split
    
    # Load data
    X, y = load_breast_cancer(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )
    
    # Define parameter grid
    param_grid = {
        'n_estimators': [50, 100, 200],
        'max_depth': [3, 5, 7],
        'learning_rate': [0.01, 0.1, 0.2],
        'min_samples_split': [2, 5, 10]
    }
    
    print(f"Grid size: {3 * 3 * 3 * 3} = 81 configurations")
    print(f"With 5-fold CV: {81 * 5} = 405 model fits")
    
    # Run grid search
    results = grid_search_with_cv(
        estimator_class=GradientBoostingClassifier,
        param_grid=param_grid,
        X=X_train, y=y_train,
        cv=5,
        scoring='accuracy'
    )
    
    # Analyze results
    df = analyze_grid_search_results(results['cv_results'])
    
    # Final evaluation on test set
    test_score = results['best_estimator'].score(X_test, y_test)
    print(f"
Final Test Score: {test_score:.4f}")
    
    return results

The Nested Cross-Validation Requirement

When using cross-validation for hyperparameter selection, the reported CV score is an optimistically biased estimate of true generalization performance. For unbiased estimation, nested cross-validation is required: an outer loop for performance estimation and an inner loop for hyperparameter selection. This multiplies computational cost by another factor of K.

Parallelization Strategies

Grid search's embarrassingly parallel structure—where each configuration evaluation is independent—makes it an ideal candidate for parallelization. The key insight is that the total wall-clock time can be reduced nearly linearly with available compute resources.

Levels of Parallelism:

Configuration-Level Parallelism: Evaluate multiple hyperparameter configurations simultaneously. Each worker processes one complete configuration.
Fold-Level Parallelism: Within each configuration, cross-validation folds can be evaluated in parallel. This multiplies parallelism by the CV factor.
Training-Level Parallelism: Some algorithms (XGBoost, LightGBM) can parallelize individual training runs. This adds another dimension of speedup.

The optimal strategy depends on the relative costs and available resources.

parallel_grid_search.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
from concurrent.futures import ProcessPoolExecutor, as_completed
from typing import Dict, List, Any, Callable, Tuple
import multiprocessing
from itertools import product
import time
 
def parallel_grid_search(
    param_grid: Dict[str, List[Any]],
    objective_fn: Callable[[Dict[str, Any]], float],
    n_workers: int = None,
    minimize: bool = True
) -> Tuple[Dict[str, Any], float, List[Dict]]:
    """
    Parallel grid search utilizing multiple CPU cores.
    
    Theoretical Speedup:
    With N configurations and W workers, ideal speedup is min(N, W).
    
    Actual speedup is typically 80-95% of ideal due to:
    - Process spawning overhead
    - Load imbalance (some configs train faster)
    - Memory bandwidth contention
    
    Parameters:
    -----------
    n_workers : int, optional
        Number of parallel workers. Defaults to CPU count.
    """
    if n_workers is None:
        n_workers = multiprocessing.cpu_count()
    
    # Generate all configurations
    param_names = list(param_grid.keys())
    param_values = list(param_grid.values())
    all_configs = [
        dict(zip(param_names, combo))
        for combo in product(*param_values)
    ]
    
    print(f"Parallel Grid Search")
    print(f"  Configurations: {len(all_configs)}")
    print(f"  Workers: {n_workers}")
    print(f"  Expected speedup: ~{min(len(all_configs), n_workers)}x")
    
    start_time = time.time()
    all_results = []
    
    # Parallel evaluation using ProcessPoolExecutor
    with ProcessPoolExecutor(max_workers=n_workers) as executor:
        # Submit all jobs
        futures = {
            executor.submit(objective_fn, config): config
            for config in all_configs
        }
        
        # Collect results as they complete
        completed = 0
        best_score = float('inf') if minimize else float('-inf')
        best_config = None
        
        for future in as_completed(futures):
            config = futures[future]
            try:
                score = future.result()
                result = {'config': config, 'score': score}
                all_results.append(result)
                
                # Track best
                is_better = (score < best_score) if minimize else (score > best_score)
                if is_better:
                    best_score = score
                    best_config = config
                
                completed += 1
                if completed % (len(all_configs) // 10 + 1) == 0:
                    print(f"  Progress: {completed}/{len(all_configs)} "
                          f"({100*completed/len(all_configs):.1f}%)")
                          
            except Exception as e:
                print(f"Error evaluating {config}: {e}")
    
    elapsed = time.time() - start_time
    print(f"
Completed in {elapsed:.2f}s")
    print(f"Throughput: {len(all_configs)/elapsed:.2f} configs/sec")
    print(f"Best score: {best_score:.6f}")
    
    return best_config, best_score, all_results
 
 
# ============================================================
# Distributed Grid Search with Joblib
# ============================================================
from joblib import Parallel, delayed
 
def distributed_grid_search_joblib(
    estimator_class,
    param_grid: Dict[str, List[Any]],
    X, y,
    cv: int = 5,
    scoring: str = 'accuracy',
    n_jobs: int = -1,
    backend: str = 'loky'
) -> Dict:
    """
    Grid search using joblib for flexible parallelization.
    
    Backends:
    - 'loky': Default, good for CPU-bound tasks
    - 'threading': For I/O-bound tasks or GIL-releasing code
    - 'multiprocessing': Alternative process-based backend
    - 'dask': For distributed computing clusters
    
    Memory Management:
    Joblib handles memory efficiently by:
    - Memory-mapping large arrays (auto_memmap)
    - Worker recycling to prevent memory leaks
    - Automatic batching of small tasks
    """
    from sklearn.model_selection import cross_val_score
    
    param_names = list(param_grid.keys())
    param_values = list(param_grid.values())
    all_configs = list(product(*param_values))
    
    def evaluate_single(combo):
        """Evaluate one configuration with cross-validation."""
        config = dict(zip(param_names, combo))
        model = estimator_class(**config)
        scores = cross_val_score(model, X, y, cv=cv, scoring=scoring)
        return {
            'config': config,
            'mean_score': scores.mean(),
            'std_score': scores.std(),
            'all_scores': scores.tolist()
        }
    
    print(f"Evaluating {len(all_configs)} configurations with {cv}-fold CV")
    print(f"Total model fits: {len(all_configs) * cv}")
    
    # Parallel execution
    results = Parallel(
        n_jobs=n_jobs,
        backend=backend,
        verbose=10,
        pre_dispatch='2*n_jobs'  # Limit memory usage
    )(
        delayed(evaluate_single)(combo)
        for combo in all_configs
    )
    
    # Find best
    best_result = max(results, key=lambda x: x['mean_score'])
    
    return {
        'best_params': best_result['config'],
        'best_score': best_result['mean_score'],
        'best_std': best_result['std_score'],
        'all_results': results
    }
 
 
# ============================================================
# Memory-Efficient Grid Search for Large Datasets
# ============================================================
def memory_efficient_grid_search(
    estimator_class,
    param_grid: Dict[str, List[Any]],
    X_memmap_path: str,
    y_memmap_path: str,
    cv: int = 5,
    n_jobs: int = -1
):
    """
    Grid search using memory-mapped files for large datasets.
    
    For datasets larger than available RAM:
    1. Convert data to memory-mapped format (numpy.memmap)
    2. Each worker reads directly from disk with OS-level caching
    3. Avoids copying full dataset to each worker process
    
    This pattern enables grid search on datasets 10x larger than RAM.
    """
    import numpy as np
    
    # Load as memory-mapped (no copy to memory)
    X = np.load(X_memmap_path, mmap_mode='r')
    y = np.load(y_memmap_path, mmap_mode='r')
    
    print(f"Dataset shape: {X.shape}")
    print(f"Dataset size: {X.nbytes / 1e9:.2f} GB (memory-mapped)")
    
    # Use joblib with memmap support
    return distributed_grid_search_joblib(
        estimator_class, param_grid, X, y, cv=cv, n_jobs=n_jobs
    )

Parallelization Strategy Comparison
Strategy	Speedup Potential	Memory Overhead	Best Use Case
Single-threaded	1x (baseline)	Minimal	Small grids, debugging
Multi-process (local)	Linear in cores	Moderate (per-process copies)	Standard workloads
Distributed (cluster)	Linear in nodes × cores	High (network overhead)	Large-scale HPO
GPU-accelerated	10-100x for supported models	GPU memory constrained	Deep learning, XGBoost

Mathematical Properties and Guarantees

Grid search provides several important mathematical guarantees that make it a sound choice for hyperparameter optimization.

Guarantee 1: Global Optimality (within the grid)

Grid search finds the global optimum $\lambda^*_G$ within the defined grid $G$:

$$\lambda^*G = \arg\min{\lambda \in G} \mathcal{L}(\lambda)$$

This is a complete enumeration guarantee—no configuration in $G$ achieves lower loss than $\lambda^*_G$.

Guarantee 2: Reproducibility

Given deterministic model training and evaluation, grid search produces identical results across runs. If stochasticity exists (random initialization, stochastic optimization), averaging multiple evaluations restores determinism.

Guarantee 3: Coverage Bounds

For a grid with $n$ points along each of $d$ dimensions (total $n^d$ points), the maximum distance from any point in the hyperparameter space to the nearest grid point is bounded by half the grid spacing:

$$\max_{\lambda \in \Lambda} \min_{g \in G} |\lambda - g| \leq \frac{\Delta}{2}$$

where $\Delta$ is the largest grid spacing.

The Continuity Assumption

Grid search implicitly assumes that the objective function is reasonably smooth—that nearby hyperparameter configurations produce similar performance. If the objective is highly discontinuous or chaotic, even a fine grid may miss optimal regions.

Limitation: Suboptimality vs True Global Optimum

The critical limitation is that $\lambda^_G$ may be arbitrarily far from the true global optimum $\lambda^ = \arg\min_{\lambda \in \Lambda} \mathcal{L}(\lambda)$ if the grid doesn't sample the optimal region. The approximation error:

$$\mathcal{L}(\lambda^_G) - \mathcal{L}(\lambda^)$$

depends entirely on how well the grid covers the promising regions of $\Lambda$.

Uniform Coverage Property:

One often-overlooked property of grid search is uniform coverage of the search space. Every region receives equal attention, regardless of prior information about where good solutions might lie. This is both a strength (no regions are neglected) and a weakness (no regions are prioritized).

$$P(\text{sampling hyperparameter } i \text{ at value } v) = \frac{1}{n_i} \quad \forall v \in \Lambda_i$$

This uniformity means grid search treats all hyperparameters and values as equally important a priori, which may waste budget on unimportant dimensions.

Comparison with Exhaustive Alternatives

Grid search is not the only exhaustive approach. Understanding how it compares to alternatives illuminates when grid search is the right choice.

Brute Force on Continuous Spaces:

For truly continuous hyperparameters, exhaustive search is impossible. Grid search approximates exhaustive search through discretization. The approximation quality improves with finer grids but computational cost grows correspondingly.

Exhaustive Enumeration for Discrete Spaces:

When all hyperparameters are discrete with finite domains, grid search is exhaustive enumeration. For example, tuning only kernel ∈ {'linear', 'rbf', 'poly'} and C ∈ {0.1, 1, 10} involves exactly 9 configurations—grid search evaluates all of them.

Latin Hypercube Sampling (LHS):

LHS provides structured coverage of the hyperparameter space without the rigid Cartesian structure of grid search. It divides each dimension into $N$ equal intervals and places exactly one sample in each interval per dimension.

Grid Search Advantages

•Complete coverage: Every defined configuration tested
•Deterministic: Same results on every run
•Interpretable: Easy to visualize and explain
•No hyperparameters: No meta-tuning required
•Parallelizable: Embarrassingly parallel structure
•Simple implementation: Straightforward to code

Grid Search Limitations

•Exponential scaling: Cost grows as $O(n^d)$
•Uniform coverage: Ignores prior knowledge
•Discretization error: May miss optima between grid points
•Rigid structure: Cannot adapt based on results
•Axis alignment: Misses diagonal correlations
•No early stopping: Must complete full grid

Grid Search as Baseline

Even when using advanced HPO methods (Bayesian optimization, evolutionary strategies), running a coarse grid search first provides a valuable baseline. It's simple to implement, produces interpretable results, and establishes performance bounds that sophisticated methods should improve upon.

Summary: Exhaustive Search Foundations

We have established the complete theoretical and practical foundations of grid search as an exhaustive hyperparameter optimization strategy. Let's consolidate the key concepts:

Key Takeaways

•Grid search is exhaustive enumeration — It systematically evaluates every configuration in the Cartesian product of hyperparameter value sets, guaranteeing the global optimum within the defined grid.
•Grid construction is critical — Thoughtful selection of hyperparameter ranges, appropriate spacing (linear vs logarithmic), and granularity directly determine search effectiveness.
•Robust evaluation prevents overfitting — Cross-validation provides more reliable performance estimates than single train-validation splits, though at increased computational cost.
•Parallelization is natural — Grid search's independent evaluations enable near-linear speedup with available compute resources, from multi-core to distributed settings.
•Mathematical guarantees are limited to the grid — Grid search finds the best configuration among those tested, but may miss better solutions between grid points.
•Simplicity is a feature — No hyperparameters to tune, deterministic results, interpretable outputs, and straightforward implementation make grid search a reliable baseline.

What's Next:

Now that we understand the mechanics of exhaustive grid search, the next page examines its computational cost in detail. We'll analyze how the number of hyperparameters, grid resolution, and evaluation complexity interact to determine total runtime, and develop strategies for estimating and managing this cost in practice.

Page Complete

You now understand the foundational principles of exhaustive grid search. You can implement grid search from scratch, construct effective parameter grids, set up robust evaluation protocols, and leverage parallelization. Next, we'll confront the elephant in the room: the computational cost that makes grid search impractical for high-dimensional search spaces.

1 / 5

Loading learning content...

Machine LearningHyperparameter Optimization

Grid Search

LevelIntermediate

Duration90 mins

TopicHyperparameter Optimization

1 / 5

Exhaustive Search

The Systematic Approach to Hyperparameter Discovery

What You Will Learn

The Hyperparameter Optimization Problem

Before diving into grid search specifically, we must precisely formulate the hyperparameter optimization problem it aims to solve.

Formal Problem Definition:

$$\lambda^* = \arg\min_{\lambda \in \Lambda} \mathcal{L}(\mathcal{A}{\lambda}(\mathcal{D}{\text{train}}), \mathcal{D}_{\text{val}})$$

where $\mathcal{A}_{\lambda}$ denotes the model trained with hyperparameters $\lambda$.

This formulation reveals several critical characteristics that distinguish hyperparameter optimization from conventional optimization:

Characteristics of the Hyperparameter Optimization Problem
Characteristic	Description	Implication for Grid Search
Expensive Evaluations	Each point requires full model training	Must minimize total evaluations or parallelize
Black-Box Nature	No analytical form for objective	Cannot use gradient-based methods directly
Mixed Variable Types	Continuous, discrete, categorical parameters	Must handle all types uniformly
Conditional Dependencies	Some parameters only relevant given others	Grid may contain invalid combinations
Non-Stationary	Optimal values may differ across datasets	Cannot rely on prior solutions directly

The Evaluation Cost Perspective

The Grid Search Algorithm

Grid search operationalizes exhaustive enumeration through a structured algorithm that systematically explores every point in a discrete hyperparameter grid.

Algorithm Definition:

Grid search evaluates the objective function at every point in this product space:

$$\text{Grid} = {(\lambda_1, \lambda_2, \ldots, \lambda_k) : \lambda_i \in \Lambda_i \text{ for all } i}$$

The total number of configurations to evaluate is:

$$|\text{Grid}| = \prod_{i=1}^{k} n_i = n_1 \times n_2 \times \cdots \times n_k$$

grid_search_algorithm.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
from itertools import product
from typing import Dict, List, Any, Callable, Tuple
import numpy as np
 
def exhaustive_grid_search(
    param_grid: Dict[str, List[Any]],
    objective_fn: Callable[[Dict[str, Any]], float],
    minimize: bool = True
) -> Tuple[Dict[str, Any], float, List[Dict]]:
    """
    Exhaustive grid search implementation.
    
    Parameters:
    -----------
    param_grid : Dict[str, List[Any]]
        Dictionary mapping hyperparameter names to lists of candidate values.
        Example: {'learning_rate': [0.01, 0.1, 1.0], 'max_depth': [3, 5, 7]}
    
    objective_fn : Callable[[Dict[str, Any]], float]
        Function that takes a hyperparameter configuration and returns
        the validation performance (e.g., validation loss or accuracy).
    
    minimize : bool
        If True, seek minimum objective (e.g., loss).
        If False, seek maximum objective (e.g., accuracy).
    
    Returns:
    --------
    best_params : Dict[str, Any]
        The hyperparameter configuration achieving best performance.
    best_score : float
        The objective value for the best configuration.
    all_results : List[Dict]
        Complete results for all evaluated configurations.
    """
    # Extract parameter names and their candidate values
    param_names = list(param_grid.keys())
    param_values = list(param_grid.values())
    
    # Generate all combinations via Cartesian product
    all_combinations = list(product(*param_values))
    
    print(f"Grid Search: Evaluating {len(all_combinations)} configurations")
    print(f"Parameters: {param_names}")
    print(f"Grid dimensions: {[len(v) for v in param_values]}")
    
    # Track all results for analysis
    all_results = []
    best_params = None
    best_score = float('inf') if minimize else float('-inf')
    
    # Systematic enumeration of all grid points
    for idx, combination in enumerate(all_combinations):
        # Construct configuration dictionary
        config = dict(zip(param_names, combination))
        
        # Evaluate objective (typically involves model training)
        score = objective_fn(config)
        
        # Store results
        result = {
            'config': config.copy(),
            'score': score,
            'iteration': idx
        }
        all_results.append(result)
        
        # Update best if improved
        is_better = (score < best_score) if minimize else (score > best_score)
        if is_better:
            best_score = score
            best_params = config.copy()
            print(f"  [Iter {idx+1}/{len(all_combinations)}] New best: "
                  f"score={score:.6f}, params={config}")
    
    print(f"
Grid Search Complete!")
    print(f"Best Score: {best_score:.6f}")
    print(f"Best Parameters: {best_params}")
    
    return best_params, best_score, all_results
 
 
# Example: Grid search for a ridge regression model
def example_ridge_grid_search():
    """Demonstrates grid search for ridge regression hyperparameters."""
    from sklearn.datasets import make_regression
    from sklearn.linear_model import Ridge
    from sklearn.model_selection import cross_val_score
    
    # Generate synthetic regression data
    X, y = make_regression(n_samples=1000, n_features=20, noise=10, random_state=42)
    
    # Define the hyperparameter grid
    param_grid = {
        'alpha': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0],  # Regularization strength
        'fit_intercept': [True, False],
        'solver': ['auto', 'svd', 'cholesky', 'lsqr']
    }
    
    def evaluate_ridge(config: Dict[str, Any]) -> float:
        """Objective: Negative mean cross-validated R² score."""
        model = Ridge(
            alpha=config['alpha'],
            fit_intercept=config['fit_intercept'],
            solver=config['solver']
        )
        # 5-fold cross-validation for robust performance estimate
        scores = cross_val_score(model, X, y, cv=5, scoring='r2')
        return -np.mean(scores)  # Negative because we minimize
    
    # Run exhaustive grid search
    best_params, best_score, results = exhaustive_grid_search(
        param_grid=param_grid,
        objective_fn=evaluate_ridge,
        minimize=True
    )
    
    print(f"
Total configurations evaluated: {len(results)}")
    print(f"Grid size: 6 × 2 × 4 = {6 * 2 * 4}")
    
    return best_params, -best_score  # Return positive R² score

Algorithmic Properties:

Determinism: Given the same grid and evaluation function, grid search produces identical results regardless of execution order. This reproducibility is invaluable for scientific work.

Independence: Each grid point can be evaluated independently, enabling embarrassingly parallel execution across available compute resources.

Constructing the Parameter Grid

Value Range Selection:

The range of values to explore for each hyperparameter should be guided by:

Domain Knowledge: Understanding the algorithm helps establish sensible bounds. For example, learning rates in neural networks typically range from $10^{-5}$ to $10^{-1}$.
Prior Experience: Previous experiments on similar problems provide empirical guidance on effective ranges.
Default Values: Library defaults are often well-calibrated starting points. The grid should encompass a neighborhood around these defaults.
Theoretical Constraints: Some parameters have natural bounds (e.g., regularization strength must be non-negative).

grid_construction.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
import numpy as np
from typing import List, Dict, Any
 
def create_linear_grid(start: float, stop: float, num: int) -> List[float]:
    """
    Create linearly spaced grid for parameters with linear response.
    
    Use for: Number of trees, max depth, min samples split, etc.
    
    Example:
    >>> create_linear_grid(1, 10, 5)
    [1.0, 3.25, 5.5, 7.75, 10.0]
    """
    return list(np.linspace(start, stop, num))
 
 
def create_log_grid(start: float, stop: float, num: int) -> List[float]:
    """
    Create logarithmically spaced grid for parameters spanning orders of magnitude.
    
    Use for: Learning rate, regularization strength, kernel bandwidth, etc.
    
    The logarithmic spacing ensures equal representation across orders of magnitude,
    which is critical when optimal values could be anywhere from 10^-4 to 10^2.
    
    Example:
    >>> create_log_grid(0.001, 100, 6)
    [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
    """
    return list(np.logspace(np.log10(start), np.log10(stop), num))
 
 
def create_geometric_grid(start: float, ratio: float, num: int) -> List[float]:
    """
    Create geometrically spaced grid with fixed ratio between consecutive values.
    
    Use for: When you want consistent multiplicative steps (e.g., doubling).
    
    Example:
    >>> create_geometric_grid(1, 2, 5)  # Powers of 2
    [1, 2, 4, 8, 16]
    """
    return [start * (ratio ** i) for i in range(num)]
 
 
# ============================================================
# Complete example: Constructing a grid for XGBoost
# ============================================================
def construct_xgboost_grid(granularity: str = 'medium') -> Dict[str, List[Any]]:
    """
    Construct a hyperparameter grid for XGBoost with varying granularity.
    
    Parameters:
    -----------
    granularity : str
        'coarse' - Quick exploration (3-4 values per parameter)
        'medium' - Balanced coverage (5-6 values per parameter)  
        'fine'   - Thorough search (7-10 values per parameter)
    
    Returns:
    --------
    param_grid : Dict[str, List[Any]]
        Complete hyperparameter grid for XGBoost.
    """
    grids = {
        'coarse': {
            # Learning rate: log-spaced (response is typically log-linear)
            'learning_rate': [0.01, 0.1, 0.3],
            
            # Tree depth: linear spacing (discrete, roughly linear impact)
            'max_depth': [3, 6, 10],
            
            # Number of estimators: often paired inversely with learning rate
            'n_estimators': [100, 300, 500],
            
            # Regularization: log-spaced
            'reg_alpha': [0, 0.1, 1.0],     # L1 regularization
            'reg_lambda': [0.1, 1.0, 10.0], # L2 regularization
            
            # Subsampling: linear spacing in valid range [0, 1]
            'subsample': [0.7, 0.85, 1.0],
            'colsample_bytree': [0.7, 0.85, 1.0],
        },
        'medium': {
            'learning_rate': create_log_grid(0.01, 0.3, 5),
            'max_depth': [3, 4, 5, 6, 8, 10],
            'n_estimators': [100, 200, 300, 500, 700],
            'reg_alpha': [0, 0.01, 0.1, 1.0, 10.0],
            'reg_lambda': create_log_grid(0.01, 10, 5),
            'subsample': [0.6, 0.7, 0.8, 0.9, 1.0],
            'colsample_bytree': [0.6, 0.7, 0.8, 0.9, 1.0],
        },
        'fine': {
            'learning_rate': create_log_grid(0.005, 0.5, 8),
            'max_depth': list(range(2, 12)),  # 2, 3, ..., 11
            'n_estimators': list(range(100, 1001, 100)),  # 100, 200, ..., 1000
            'reg_alpha': [0] + list(create_log_grid(0.001, 100, 7)),
            'reg_lambda': create_log_grid(0.001, 100, 8),
            'subsample': list(np.arange(0.5, 1.05, 0.1)),
            'colsample_bytree': list(np.arange(0.5, 1.05, 0.1)),
            'min_child_weight': [1, 3, 5, 7, 10],
            'gamma': [0, 0.1, 0.5, 1.0, 5.0],
        }
    }
    
    grid = grids[granularity]
    
    # Calculate grid statistics
    total_configs = 1
    for param, values in grid.items():
        total_configs *= len(values)
    
    print(f"Grid granularity: {granularity}")
    print(f"Parameters: {len(grid)}")
    print(f"Grid dimensions: {[len(v) for v in grid.values()]}")
    print(f"Total configurations: {total_configs:,}")
    
    return grid
 
 
# Demonstrate grid size growth
print("=== Grid Size Comparison ===")
for level in ['coarse', 'medium', 'fine']:
    print()
    construct_xgboost_grid(level)

Logarithmic vs Linear Spacing

Discretization Granularity:

The number of grid points per hyperparameter creates a fundamental trade-off:

Too Coarse: Risk missing the optimal region entirely. If the optimal learning rate is 0.05 but you only try [0.01, 0.1], you'll never find it.
Too Fine: Exponentially increases computational cost with diminishing returns. The difference between learning rate 0.051 and 0.052 is rarely significant.

Practical Heuristic: Start with 3-5 geometrically or logarithmically spaced values per continuous hyperparameter. After identifying promising regions, refine with focused local searches.

The Evaluation Protocol

How we evaluate each grid point profoundly affects the reliability and validity of grid search results. Defining a robust evaluation protocol is as important as the search strategy itself.

Single Train-Validation Split:

The simplest approach trains on a single training set and evaluates on a held-out validation set:

$$\hat{\mathcal{L}}(\lambda) = \mathcal{L}(f_{\lambda}^{\mathcal{D}{\text{train}}}, \mathcal{D}{\text{val}})$$

This is fast but has high variance—a single split may not be representative, and validation performance can fluctuate significantly.

K-Fold Cross-Validation:

For more robust estimates, we average performance across $K$ different train-validation splits:

$$\hat{\mathcal{L}}{\text{CV}}(\lambda) = \frac{1}{K} \sum{k=1}^{K} \mathcal{L}(f_{\lambda}^{\mathcal{D}_{-k}}, \mathcal{D}_k)$$

where $\mathcal{D}k$ is fold $k$ and $\mathcal{D}{-k}$ is all data except fold $k$.

This reduces variance but multiplies computational cost by factor $K$.

evaluation_protocols.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
from sklearn.model_selection import (
    cross_val_score, StratifiedKFold, RepeatedStratifiedKFold,
    GridSearchCV
)
import numpy as np
from typing import Dict, Any
 
def grid_search_with_cv(
    estimator_class,
    param_grid: Dict[str, list],
    X, y,
    cv: int = 5,
    scoring: str = 'accuracy',
    n_jobs: int = -1
):
    """
    Grid search with k-fold cross-validation for robust evaluation.
    
    Mathematical Foundation:
    For each configuration λ, we compute:
    
    CV(λ) = (1/K) Σ_{k=1}^K L(f_λ^{D_{-k}}, D_k)
    
    where D_k is fold k and D_{-k} is all data except fold k.
    
    The standard error of this estimate is:
    SE = σ / √K
    
    where σ is the standard deviation across folds.
    """
    # Create cross-validation strategy
    cv_strategy = StratifiedKFold(n_splits=cv, shuffle=True, random_state=42)
    
    # Initialize GridSearchCV
    grid_search = GridSearchCV(
        estimator=estimator_class(),
        param_grid=param_grid,
        cv=cv_strategy,
        scoring=scoring,
        n_jobs=n_jobs,           # Parallelize across cores
        return_train_score=True, # Also track training performance
        verbose=2,
        refit=True               # Refit best model on full training data
    )
    
    # Execute grid search
    grid_search.fit(X, y)
    
    # Extract results
    results = {
        'best_params': grid_search.best_params_,
        'best_score': grid_search.best_score_,
        'best_estimator': grid_search.best_estimator_,
        'cv_results': grid_search.cv_results_,
    }
    
    # Analyze variance across folds for best configuration
    best_idx = grid_search.best_index_
    fold_scores = [
        grid_search.cv_results_[f'split{k}_test_score'][best_idx]
        for k in range(cv)
    ]
    
    results['best_score_std'] = np.std(fold_scores)
    results['best_score_ci_95'] = 1.96 * results['best_score_std'] / np.sqrt(cv)
    
    print(f"Best Score: {results['best_score']:.4f} "
          f"± {results['best_score_ci_95']:.4f} (95% CI)")
    print(f"Best Parameters: {results['best_params']}")
    
    return results
 
 
def analyze_grid_search_results(cv_results: Dict) -> None:
    """
    Comprehensive analysis of grid search results.
    
    Examines:
    1. Performance distribution across configurations
    2. Hyperparameter sensitivity
    3. Variance across cross-validation folds
    4. Potential overfitting indicators
    """
    import pandas as pd
    
    # Convert to DataFrame for analysis
    df = pd.DataFrame(cv_results)
    
    print("=" * 60)
    print("GRID SEARCH RESULTS ANALYSIS")
    print("=" * 60)
    
    # Performance distribution
    print(f"
Performance Distribution:")
    print(f"  Mean test score: {df['mean_test_score'].mean():.4f}")
    print(f"  Std test score:  {df['mean_test_score'].std():.4f}")
    print(f"  Min test score:  {df['mean_test_score'].min():.4f}")
    print(f"  Max test score:  {df['mean_test_score'].max():.4f}")
    
    # Top 5 configurations
    print(f"
Top 5 Configurations:")
    top5 = df.nlargest(5, 'mean_test_score')[
        ['params', 'mean_test_score', 'std_test_score', 'rank_test_score']
    ]
    for idx, row in top5.iterrows():
        print(f"  Rank {row['rank_test_score']}: "
              f"Score={row['mean_test_score']:.4f} ± {row['std_test_score']:.4f}")
        print(f"    Params: {row['params']}")
    
    # Train-test gap analysis (overfitting indicator)
    print(f"
Train-Test Gap Analysis:")
    df['gap'] = df['mean_train_score'] - df['mean_test_score']
    print(f"  Mean gap: {df['gap'].mean():.4f}")
    print(f"  Max gap:  {df['gap'].max():.4f} (potential overfitting)")
    
    # Variance analysis
    print(f"
Cross-Validation Variance:")
    high_variance = df[df['std_test_score'] > df['std_test_score'].median()]
    print(f"  Configs with high CV variance: {len(high_variance)}/{len(df)}")
    
    return df
 
 
# Example: Complete grid search workflow
def complete_grid_search_example():
    """End-to-end grid search with analysis."""
    from sklearn.datasets import load_breast_cancer
    from sklearn.ensemble import GradientBoostingClassifier
    from sklearn.model_selection import train_test_split
    
    # Load data
    X, y = load_breast_cancer(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )
    
    # Define parameter grid
    param_grid = {
        'n_estimators': [50, 100, 200],
        'max_depth': [3, 5, 7],
        'learning_rate': [0.01, 0.1, 0.2],
        'min_samples_split': [2, 5, 10]
    }
    
    print(f"Grid size: {3 * 3 * 3 * 3} = 81 configurations")
    print(f"With 5-fold CV: {81 * 5} = 405 model fits")
    
    # Run grid search
    results = grid_search_with_cv(
        estimator_class=GradientBoostingClassifier,
        param_grid=param_grid,
        X=X_train, y=y_train,
        cv=5,
        scoring='accuracy'
    )
    
    # Analyze results
    df = analyze_grid_search_results(results['cv_results'])
    
    # Final evaluation on test set
    test_score = results['best_estimator'].score(X_test, y_test)
    print(f"
Final Test Score: {test_score:.4f}")
    
    return results

The Nested Cross-Validation Requirement

Parallelization Strategies

Levels of Parallelism:

Configuration-Level Parallelism: Evaluate multiple hyperparameter configurations simultaneously. Each worker processes one complete configuration.
Fold-Level Parallelism: Within each configuration, cross-validation folds can be evaluated in parallel. This multiplies parallelism by the CV factor.
Training-Level Parallelism: Some algorithms (XGBoost, LightGBM) can parallelize individual training runs. This adds another dimension of speedup.

The optimal strategy depends on the relative costs and available resources.

parallel_grid_search.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
from concurrent.futures import ProcessPoolExecutor, as_completed
from typing import Dict, List, Any, Callable, Tuple
import multiprocessing
from itertools import product
import time
 
def parallel_grid_search(
    param_grid: Dict[str, List[Any]],
    objective_fn: Callable[[Dict[str, Any]], float],
    n_workers: int = None,
    minimize: bool = True
) -> Tuple[Dict[str, Any], float, List[Dict]]:
    """
    Parallel grid search utilizing multiple CPU cores.
    
    Theoretical Speedup:
    With N configurations and W workers, ideal speedup is min(N, W).
    
    Actual speedup is typically 80-95% of ideal due to:
    - Process spawning overhead
    - Load imbalance (some configs train faster)
    - Memory bandwidth contention
    
    Parameters:
    -----------
    n_workers : int, optional
        Number of parallel workers. Defaults to CPU count.
    """
    if n_workers is None:
        n_workers = multiprocessing.cpu_count()
    
    # Generate all configurations
    param_names = list(param_grid.keys())
    param_values = list(param_grid.values())
    all_configs = [
        dict(zip(param_names, combo))
        for combo in product(*param_values)
    ]
    
    print(f"Parallel Grid Search")
    print(f"  Configurations: {len(all_configs)}")
    print(f"  Workers: {n_workers}")
    print(f"  Expected speedup: ~{min(len(all_configs), n_workers)}x")
    
    start_time = time.time()
    all_results = []
    
    # Parallel evaluation using ProcessPoolExecutor
    with ProcessPoolExecutor(max_workers=n_workers) as executor:
        # Submit all jobs
        futures = {
            executor.submit(objective_fn, config): config
            for config in all_configs
        }
        
        # Collect results as they complete
        completed = 0
        best_score = float('inf') if minimize else float('-inf')
        best_config = None
        
        for future in as_completed(futures):
            config = futures[future]
            try:
                score = future.result()
                result = {'config': config, 'score': score}
                all_results.append(result)
                
                # Track best
                is_better = (score < best_score) if minimize else (score > best_score)
                if is_better:
                    best_score = score
                    best_config = config
                
                completed += 1
                if completed % (len(all_configs) // 10 + 1) == 0:
                    print(f"  Progress: {completed}/{len(all_configs)} "
                          f"({100*completed/len(all_configs):.1f}%)")
                          
            except Exception as e:
                print(f"Error evaluating {config}: {e}")
    
    elapsed = time.time() - start_time
    print(f"
Completed in {elapsed:.2f}s")
    print(f"Throughput: {len(all_configs)/elapsed:.2f} configs/sec")
    print(f"Best score: {best_score:.6f}")
    
    return best_config, best_score, all_results
 
 
# ============================================================
# Distributed Grid Search with Joblib
# ============================================================
from joblib import Parallel, delayed
 
def distributed_grid_search_joblib(
    estimator_class,
    param_grid: Dict[str, List[Any]],
    X, y,
    cv: int = 5,
    scoring: str = 'accuracy',
    n_jobs: int = -1,
    backend: str = 'loky'
) -> Dict:
    """
    Grid search using joblib for flexible parallelization.
    
    Backends:
    - 'loky': Default, good for CPU-bound tasks
    - 'threading': For I/O-bound tasks or GIL-releasing code
    - 'multiprocessing': Alternative process-based backend
    - 'dask': For distributed computing clusters
    
    Memory Management:
    Joblib handles memory efficiently by:
    - Memory-mapping large arrays (auto_memmap)
    - Worker recycling to prevent memory leaks
    - Automatic batching of small tasks
    """
    from sklearn.model_selection import cross_val_score
    
    param_names = list(param_grid.keys())
    param_values = list(param_grid.values())
    all_configs = list(product(*param_values))
    
    def evaluate_single(combo):
        """Evaluate one configuration with cross-validation."""
        config = dict(zip(param_names, combo))
        model = estimator_class(**config)
        scores = cross_val_score(model, X, y, cv=cv, scoring=scoring)
        return {
            'config': config,
            'mean_score': scores.mean(),
            'std_score': scores.std(),
            'all_scores': scores.tolist()
        }
    
    print(f"Evaluating {len(all_configs)} configurations with {cv}-fold CV")
    print(f"Total model fits: {len(all_configs) * cv}")
    
    # Parallel execution
    results = Parallel(
        n_jobs=n_jobs,
        backend=backend,
        verbose=10,
        pre_dispatch='2*n_jobs'  # Limit memory usage
    )(
        delayed(evaluate_single)(combo)
        for combo in all_configs
    )
    
    # Find best
    best_result = max(results, key=lambda x: x['mean_score'])
    
    return {
        'best_params': best_result['config'],
        'best_score': best_result['mean_score'],
        'best_std': best_result['std_score'],
        'all_results': results
    }
 
 
# ============================================================
# Memory-Efficient Grid Search for Large Datasets
# ============================================================
def memory_efficient_grid_search(
    estimator_class,
    param_grid: Dict[str, List[Any]],
    X_memmap_path: str,
    y_memmap_path: str,
    cv: int = 5,
    n_jobs: int = -1
):
    """
    Grid search using memory-mapped files for large datasets.
    
    For datasets larger than available RAM:
    1. Convert data to memory-mapped format (numpy.memmap)
    2. Each worker reads directly from disk with OS-level caching
    3. Avoids copying full dataset to each worker process
    
    This pattern enables grid search on datasets 10x larger than RAM.
    """
    import numpy as np
    
    # Load as memory-mapped (no copy to memory)
    X = np.load(X_memmap_path, mmap_mode='r')
    y = np.load(y_memmap_path, mmap_mode='r')
    
    print(f"Dataset shape: {X.shape}")
    print(f"Dataset size: {X.nbytes / 1e9:.2f} GB (memory-mapped)")
    
    # Use joblib with memmap support
    return distributed_grid_search_joblib(
        estimator_class, param_grid, X, y, cv=cv, n_jobs=n_jobs
    )

Parallelization Strategy Comparison
Strategy	Speedup Potential	Memory Overhead	Best Use Case
Single-threaded	1x (baseline)	Minimal	Small grids, debugging
Multi-process (local)	Linear in cores	Moderate (per-process copies)	Standard workloads
Distributed (cluster)	Linear in nodes × cores	High (network overhead)	Large-scale HPO
GPU-accelerated	10-100x for supported models	GPU memory constrained	Deep learning, XGBoost

Mathematical Properties and Guarantees

Grid search provides several important mathematical guarantees that make it a sound choice for hyperparameter optimization.

Guarantee 1: Global Optimality (within the grid)

Grid search finds the global optimum $\lambda^*_G$ within the defined grid $G$:

$$\lambda^*G = \arg\min{\lambda \in G} \mathcal{L}(\lambda)$$

This is a complete enumeration guarantee—no configuration in $G$ achieves lower loss than $\lambda^*_G$.

Guarantee 2: Reproducibility

Guarantee 3: Coverage Bounds

$$\max_{\lambda \in \Lambda} \min_{g \in G} |\lambda - g| \leq \frac{\Delta}{2}$$

where $\Delta$ is the largest grid spacing.

The Continuity Assumption

Limitation: Suboptimality vs True Global Optimum

$$\mathcal{L}(\lambda^_G) - \mathcal{L}(\lambda^)$$

depends entirely on how well the grid covers the promising regions of $\Lambda$.

Uniform Coverage Property:

$$P(\text{sampling hyperparameter } i \text{ at value } v) = \frac{1}{n_i} \quad \forall v \in \Lambda_i$$

This uniformity means grid search treats all hyperparameters and values as equally important a priori, which may waste budget on unimportant dimensions.

Comparison with Exhaustive Alternatives

Grid search is not the only exhaustive approach. Understanding how it compares to alternatives illuminates when grid search is the right choice.

Brute Force on Continuous Spaces:

Exhaustive Enumeration for Discrete Spaces:

Latin Hypercube Sampling (LHS):

Grid Search Advantages

•Complete coverage: Every defined configuration tested
•Deterministic: Same results on every run
•Interpretable: Easy to visualize and explain
•No hyperparameters: No meta-tuning required
•Parallelizable: Embarrassingly parallel structure
•Simple implementation: Straightforward to code

Grid Search Limitations

•Exponential scaling: Cost grows as $O(n^d)$
•Uniform coverage: Ignores prior knowledge
•Discretization error: May miss optima between grid points
•Rigid structure: Cannot adapt based on results
•Axis alignment: Misses diagonal correlations
•No early stopping: Must complete full grid

Grid Search as Baseline

Summary: Exhaustive Search Foundations

We have established the complete theoretical and practical foundations of grid search as an exhaustive hyperparameter optimization strategy. Let's consolidate the key concepts:

Key Takeaways

•Grid search is exhaustive enumeration — It systematically evaluates every configuration in the Cartesian product of hyperparameter value sets, guaranteeing the global optimum within the defined grid.
•Grid construction is critical — Thoughtful selection of hyperparameter ranges, appropriate spacing (linear vs logarithmic), and granularity directly determine search effectiveness.
•Robust evaluation prevents overfitting — Cross-validation provides more reliable performance estimates than single train-validation splits, though at increased computational cost.
•Parallelization is natural — Grid search's independent evaluations enable near-linear speedup with available compute resources, from multi-core to distributed settings.
•Mathematical guarantees are limited to the grid — Grid search finds the best configuration among those tested, but may miss better solutions between grid points.
•Simplicity is a feature — No hyperparameters to tune, deterministic results, interpretable outputs, and straightforward implementation make grid search a reliable baseline.

What's Next:

Page Complete

1 / 5