Hyperparameter Fundamentals - Learning Module

Loading content...

0/245

Search Space Definition

The Art and Science of Defining Where to Search

Before any hyperparameter optimization algorithm can run, you must answer a fundamental question: Where should we search? This question sounds simple but carries profound implications.

Define your search space too narrowly, and the optimal configuration might lie outside your boundaries—no algorithm can find what you've excluded. Define it too broadly, and you waste precious computational budget exploring regions that were never viable. The search space definition is where domain knowledge meets optimization theory, and getting it right is often the difference between HPO success and failure.

In this page, we'll develop a systematic framework for search space definition that balances coverage with efficiency.

What You Will Learn

By the end of this page, you will: • Understand the mathematical definition of hyperparameter search spaces • Know how to choose appropriate bounds and ranges • Apply transformations (log, logit) to improve search efficiency • Design search spaces that encode prior knowledge • Avoid common pitfalls that doom HPO campaigns

Formal Definition of Search Spaces

A hyperparameter search space $\Lambda$ is a set of all valid hyperparameter configurations. Each configuration $\lambda \in \Lambda$ is a point in this space, specifying values for all hyperparameters.

Formally, if we have $d$ hyperparameters, each with its own domain:

$$\Lambda = \Lambda_1 \times \Lambda_2 \times \cdots \times \Lambda_d$$

where $\Lambda_i$ is the domain of the $i$-th hyperparameter. The total search space is the Cartesian product of individual domains.

Types of Individual Domains:

Each hyperparameter has a domain that falls into one of several categories:

Bounded Continuous: $\Lambda_i = [a, b] \subset \mathbb{R}$
- Examples: learning rate ∈ [1e-5, 1.0], dropout ∈ [0.0, 0.5]
Unbounded Continuous: $\Lambda_i = \mathbb{R}$ or $\mathbb{R}^+$
- Examples: Weight decay (theoretically), regularization strength
- In practice, always bounded for search
Discrete/Integer: $\Lambda_i = {a, a+1, ..., b}$
- Examples: num_layers ∈ {1, 2, 3, 4, 5}, batch_size ∈ {16, 32, 64, 128}
Categorical: $\Lambda_i = {c_1, c_2, ..., c_k}$
- Examples: optimizer ∈ {SGD, Adam, AdamW}, activation ∈ {relu, gelu, swish}
Ordinal: Discrete with meaningful order but non-uniform spacing
- Examples: tree_depth ∈ {3, 5, 7, 10, 15}

Search Space Domain Types and Their Properties
Domain Type	Search Strategy	Encoding	Distance Metric
Continuous	Interpolation works	Direct value or transformed	Euclidean
Integer	Rounding from continuous	Integer-valued	Euclidean (rounded)
Categorical	Cannot interpolate	One-hot or integer	Hamming distance
Ordinal	Order meaningful, spacing not	Integer index	Ordinal difference

Why Domain Types Matter

Different HPO algorithms handle different domain types with varying efficiency. Bayesian optimization with Gaussian Processes works best on continuous spaces. Tree-based methods (TPE, SMAC) handle mixed spaces naturally. Understanding your search space's domain structure helps select the right optimizer.

Choosing Bounds and Ranges

Setting appropriate bounds for continuous and integer hyperparameters is critical. Here's a systematic approach for common hyperparameters:

Learning Rate

The learning rate is arguably the most important hyperparameter for neural networks, and its range spans many orders of magnitude.

Typical range: 1e-5 to 1.0 (5 orders of magnitude)
Sweet spot for Adam: 1e-4 to 1e-2
Sweet spot for SGD: 1e-3 to 0.1
Lower bound rationale: Below 1e-5, learning is prohibitively slow
Upper bound rationale: Above 1.0, most networks diverge

Regularization Strength (L2/Weight Decay)

Typical range: 1e-6 to 1e-1
Default starting point: 1e-4 to 1e-3
Lower bound rationale: Below 1e-6, regularization has negligible effect
Upper bound rationale: Above 0.1, weights shrink excessively

Dropout Rate

Typical range: 0.0 to 0.5 (sometimes 0.7)
Default starting point: 0.1 to 0.3
Upper bound rationale: Above 0.5, too much information is discarded

search_space_bounds.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
"""
Common Search Space Bounds: A Practical Reference
 
These bounds are derived from extensive empirical studies and represent
reasonable starting points for most tasks. Adjust based on domain knowledge.
"""
from dataclasses import dataclass
from typing import List, Tuple, Optional, Union
import numpy as np
 
 
@dataclass
class HyperparameterSpec:
    """Specification for a single hyperparameter."""
    name: str
    type: str  # 'continuous', 'integer', 'categorical'
    bounds: Union[Tuple[float, float], List[str]]
    log_scale: bool = False
    default: Optional[float] = None
    description: str = ""
 
 
# ============================================================
# Neural Network Hyperparameters
# ============================================================
 
NEURAL_NETWORK_SEARCH_SPACE = [
    HyperparameterSpec(
        name="learning_rate",
        type="continuous",
        bounds=(1e-5, 1.0),
        log_scale=True,  # CRITICAL: Always log-scale for learning rate
        default=1e-3,
        description="Step size for gradient updates. Log-scale essential."
    ),
    HyperparameterSpec(
        name="weight_decay",
        type="continuous",
        bounds=(1e-6, 1e-1),
        log_scale=True,
        default=1e-4,
        description="L2 regularization. Log-scale for wide range."
    ),
    HyperparameterSpec(
        name="dropout",
        type="continuous",
        bounds=(0.0, 0.5),
        log_scale=False,  # Linear scale works for probability
        default=0.1,
        description="Dropout probability. Linear scale, bounded by 0.5."
    ),
    HyperparameterSpec(
        name="batch_size",
        type="integer",
        bounds=(8, 256),
        log_scale=True,  # Consider log-scale for powers of 2
        default=32,
        description="Samples per gradient update. Powers of 2 preferred."
    ),
    HyperparameterSpec(
        name="hidden_units",
        type="integer",
        bounds=(32, 1024),
        log_scale=True,
        default=256,
        description="Units in hidden layers. Log-scale for wide range."
    ),
    HyperparameterSpec(
        name="num_layers",
        type="integer",
        bounds=(1, 10),
        log_scale=False,  # Linear for small integer range
        default=3,
        description="Number of hidden layers."
    ),
    HyperparameterSpec(
        name="optimizer",
        type="categorical",
        bounds=["sgd", "adam", "adamw", "rmsprop"],
        default="adam",
        description="Optimization algorithm choice."
    ),
    HyperparameterSpec(
        name="activation",
        type="categorical",
        bounds=["relu", "gelu", "swish", "tanh"],
        default="relu",
        description="Activation function."
    ),
]
 
 
# ============================================================
# Gradient Boosting Hyperparameters (XGBoost/LightGBM)
# ============================================================
 
GRADIENT_BOOSTING_SEARCH_SPACE = [
    HyperparameterSpec(
        name="learning_rate",
        type="continuous",
        bounds=(0.001, 0.3),
        log_scale=True,
        default=0.1,
        description="Boosting learning rate (shrinkage)."
    ),
    HyperparameterSpec(
        name="n_estimators",
        type="integer",
        bounds=(50, 2000),
        log_scale=True,
        default=500,
        description="Number of boosting rounds."
    ),
    HyperparameterSpec(
        name="max_depth",
        type="integer",
        bounds=(3, 12),
        log_scale=False,
        default=6,
        description="Maximum tree depth."
    ),
    HyperparameterSpec(
        name="min_child_weight",
        type="continuous",
        bounds=(1, 100),
        log_scale=True,
        default=1,
        description="Minimum sum of instance weight in child."
    ),
    HyperparameterSpec(
        name="subsample",
        type="continuous",
        bounds=(0.5, 1.0),
        log_scale=False,
        default=0.8,
        description="Row subsampling ratio."
    ),
    HyperparameterSpec(
        name="colsample_bytree",
        type="continuous",
        bounds=(0.5, 1.0),
        log_scale=False,
        default=0.8,
        description="Column subsampling ratio per tree."
    ),
    HyperparameterSpec(
        name="reg_alpha",
        type="continuous",
        bounds=(1e-8, 10.0),
        log_scale=True,
        default=1e-5,
        description="L1 regularization."
    ),
    HyperparameterSpec(
        name="reg_lambda",
        type="continuous",
        bounds=(1e-8, 10.0),
        log_scale=True,
        default=1.0,
        description="L2 regularization."
    ),
]
 
 
# ============================================================
# SVM Hyperparameters
# ============================================================
 
SVM_SEARCH_SPACE = [
    HyperparameterSpec(
        name="C",
        type="continuous",
        bounds=(1e-4, 1e4),
        log_scale=True,  # 8 orders of magnitude!
        default=1.0,
        description="Regularization parameter. Log-scale essential."
    ),
    HyperparameterSpec(
        name="gamma",
        type="continuous",
        bounds=(1e-5, 1e2),
        log_scale=True,
        default=None,  # Often 1/n_features
        description="RBF kernel bandwidth. Log-scale essential."
    ),
    HyperparameterSpec(
        name="kernel",
        type="categorical",
        bounds=["linear", "poly", "rbf", "sigmoid"],
        default="rbf",
        description="Kernel function."
    ),
]
 
 
def estimate_search_space_size(specs: List[HyperparameterSpec], 
                                resolution: int = 10) -> float:
    """
    Estimate the effective size of a search space.
    
    For continuous parameters: assume 'resolution' distinct values
    For categorical: number of categories
    For integer: (upper - lower + 1), capped at resolution
    
    Args:
        specs: List of hyperparameter specifications
        resolution: Discretization level for continuous params
        
    Returns:
        Total number of grid points (can be astronomical)
    """
    total_configs = 1
    
    for spec in specs:
        if spec.type == "continuous":
            points = resolution
        elif spec.type == "integer":
            low, high = spec.bounds
            points = min(high - low + 1, resolution)
        elif spec.type == "categorical":
            points = len(spec.bounds)
        else:
            points = resolution
            
        total_configs *= points
    
    return total_configs
 
 
# Example: Calculate search space size
if __name__ == "__main__":
    print("Neural Network Search Space Analysis")
    print("=" * 50)
    
    nn_size = estimate_search_space_size(NEURAL_NETWORK_SEARCH_SPACE)
    print(f"Grid points (10 per continuous dim): {nn_size:,}")
    
    nn_size_fine = estimate_search_space_size(NEURAL_NETWORK_SEARCH_SPACE, 
                                               resolution=100)
    print(f"Grid points (100 per continuous dim): {nn_size_fine:,}")
    
    print(f"\nAt 5 min/evaluation: {nn_size * 5 / 60:.1f} hours for coarse grid")
    print(f"This illustrates why grid search is infeasible!")

Scale Transformations: Log, Logit, and Beyond

One of the most impactful decisions in search space design is choosing the scale on which to search. The same range can behave very differently under linear versus logarithmic sampling.

Why Log-Scale Matters

Consider searching for a learning rate in [1e-5, 1.0]:

Linear sampling uniformly samples values. But:

50% of samples fall in [0.5, 1.0] (likely too high)
Only 0.01% fall in [1e-5, 1e-3] (often the sweet spot)

Log-scale sampling uniformly samples the exponent:

Equal probability to each order of magnitude
20% of samples in [1e-5, 1e-4], 20% in [1e-4, 1e-3], etc.

This is not just more balanced—it reflects our prior belief that the optimal value could be anywhere in this range, regardless of order of magnitude.

When to Use Different Scales
Scale	When to Use	Examples	Transformation
Linear	Effect is roughly linear with value	Dropout rate, subsample ratio, momentum	$x = \text{sample}(a, b)$
Log	Range spans multiple orders of magnitude	Learning rate, regularization, gamma (SVM)	$x = 10^{\text{sample}(\log a, \log b)}$
Logit	Probability in (0,1), effects near boundaries	Rare: dropout near 0 or 1	$x = \sigma(\text{sample}(-\infty, \infty))$
Power	Custom scaling between linear and log	Batch size, layer widths	$x = a + (b-a) \cdot u^p$

The Log-Scale Rule of Thumb

If your upper bound is more than 100× your lower bound, you almost certainly want log-scale. For learning rate (1e-5 to 1.0 = 100,000× range), log-scale is mandatory. Linear scale would be nearly useless.

scale_transformations.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
"""
Scale Transformations for Hyperparameter Search
 
This demonstrates why scale choice dramatically affects search efficiency.
"""
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
 
 
def sample_linear(low: float, high: float, n: int = 1000) -> np.ndarray:
    """Sample uniformly on linear scale."""
    return np.random.uniform(low, high, n)
 
 
def sample_log(low: float, high: float, n: int = 1000) -> np.ndarray:
    """Sample uniformly on log scale (log-uniform distribution)."""
    log_low, log_high = np.log10(low), np.log10(high)
    return 10 ** np.random.uniform(log_low, log_high, n)
 
 
def sample_logit(low: float, high: float, n: int = 1000) -> np.ndarray:
    """Sample on logit scale (useful for probabilities near boundaries)."""
    # Map [low, high] to [logit(low), logit(high)], sample, invert
    def logit(p):
        return np.log(p / (1 - p))
    def sigmoid(x):
        return 1 / (1 + np.exp(-x))
    
    logit_low, logit_high = logit(low), logit(high)
    samples = np.random.uniform(logit_low, logit_high, n)
    return sigmoid(samples)
 
 
def compare_sampling_strategies():
    """
    Visualize how different scales distribute samples.
    
    This demonstrates why log-scale is essential for wide ranges.
    """
    learning_rate_range = (1e-5, 1.0)  # 5 orders of magnitude
    
    linear_samples = sample_linear(*learning_rate_range, n=10000)
    log_samples = sample_log(*learning_rate_range, n=10000)
    
    print("Learning Rate Sampling: [1e-5, 1.0]")
    print("=" * 60)
    
    # Count samples in different ranges
    ranges = [
        (1e-5, 1e-4, "1e-5 to 1e-4"),
        (1e-4, 1e-3, "1e-4 to 1e-3"),
        (1e-3, 1e-2, "1e-3 to 1e-2"),
        (1e-2, 1e-1, "1e-2 to 1e-1"),
        (1e-1, 1.0,  "1e-1 to 1.0"),
    ]
    
    print(f"{'Range':<20} {'Linear %':>12} {'Log %':>12} {'Optimal?':<15}")
    print("-" * 60)
    
    for low, high, name in ranges:
        linear_pct = 100 * np.sum((linear_samples >= low) & 
                                   (linear_samples < high)) / len(linear_samples)
        log_pct = 100 * np.sum((log_samples >= low) & 
                                (log_samples < high)) / len(log_samples)
        
        # Most optimal learning rates for Adam are in 1e-4 to 1e-2 range
        is_optimal = "← Sweet spot" if name in ["1e-4 to 1e-3", "1e-3 to 1e-2"] else ""
        
        print(f"{name:<20} {linear_pct:>11.2f}% {log_pct:>11.2f}% {is_optimal}")
    
    print("\n⚠️  Linear sampling wastes 90% of budget on suboptimal region [0.1, 1.0]!")
    print("✓  Log sampling gives equal probability to each order of magnitude.")
 
 
def demonstrate_log_transform_benefit():
    """
    Show quantitatively how log-transform improves optimization.
    
    Simulate a response surface where optimal is at lr=0.001,
    and show how many random samples find the good region.
    """
    np.random.seed(42)
    
    # Simulated response surface: optimum near lr=0.001
    def response_surface(lr):
        """Validation error as function of learning rate."""
        log_lr = np.log10(lr)
        optimal_log_lr = np.log10(0.001)  # -3
        # Parabola in log-space, minimum at log(0.001) = -3
        return (log_lr - optimal_log_lr) ** 2 + 0.1 * np.random.randn()
    
    n_samples = 100
    
    # Random search with linear sampling
    linear_samples = sample_linear(1e-5, 1.0, n_samples)
    linear_errors = [response_surface(lr) for lr in linear_samples]
    
    # Random search with log sampling
    log_samples = sample_log(1e-5, 1.0, n_samples)
    log_errors = [response_surface(lr) for lr in log_samples]
    
    print("\nRandom Search Comparison (100 samples)")
    print("=" * 50)
    print(f"Optimal learning rate: 0.001")
    print(f"\nLinear sampling:")
    print(f"  Best lr found: {linear_samples[np.argmin(linear_errors)]:.6f}")
    print(f"  Best error: {min(linear_errors):.4f}")
    print(f"  Samples in [1e-4, 1e-2]: {np.sum((linear_samples >= 1e-4) & (linear_samples <= 1e-2))}")
    
    print(f"\nLog sampling:")
    print(f"  Best lr found: {log_samples[np.argmin(log_errors)]:.6f}")
    print(f"  Best error: {min(log_errors):.4f}")
    print(f"  Samples in [1e-4, 1e-2]: {np.sum((log_samples >= 1e-4) & (log_samples <= 1e-2))}")
 
 
if __name__ == "__main__":
    compare_sampling_strategies()
    demonstrate_log_transform_benefit()

Encoding Prior Knowledge in Search Spaces

Search space definition is an opportunity to encode everything you know about what works. The more prior knowledge you encode, the more efficiently HPO can proceed.

Sources of Prior Knowledge:

Published benchmarks: Papers report effective hyperparameter ranges
Library defaults: sklearn, PyTorch defaults reflect community wisdom
Problem characteristics: Dataset size, feature dimensionality, class balance
Computational constraints: GPU memory limits valid batch sizes
Previous experiments: Your own experience with similar problems

Naive Search Space

•learning_rate: [1e-10, 100]
•num_layers: [1, 1000]
•batch_size: [1, 10000]
•No scale specifications
•No constraints between params

Informed Search Space

•learning_rate: [1e-5, 1e-2], log-scale
•num_layers: [2, 6], informed by dataset size
•batch_size: [32, 256], GPU memory aware
•All wide ranges use log-scale
•lr ↓ when layers ↑ (encoded constraint)

Dataset-Adaptive Bounds

Search space bounds should adapt to problem characteristics:

For small datasets (n < 10,000):

Lower max_depth (3-6 for trees) to prevent overfitting
Higher regularization (weight_decay > 1e-3)
Lower model capacity (fewer layers/units)

For large datasets (n > 1,000,000):

Can increase model capacity
Lower regularization strength may work
Larger batch sizes become viable

For high-dimensional features (d > 1000):

Feature selection hyperparameters become important
L1 regularization more relevant
May need dimensionality reduction

Start Narrow, Expand if Needed

Begin with a focused search space based on prior knowledge. If HPO consistently finds optima at boundaries, that's a signal to expand those bounds. This adaptive approach is more efficient than starting with an enormous search space covering 'all possibilities.'

adaptive_search_space.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
"""
Adaptive Search Space Definition
 
Define search spaces that adapt to problem characteristics.
"""
from dataclasses import dataclass
from typing import Dict, Tuple, List, Optional
import numpy as np
 
 
@dataclass 
class ProblemCharacteristics:
    """Characteristics that inform search space design."""
    n_samples: int
    n_features: int
    n_classes: int  # 1 for regression
    is_imbalanced: bool
    task_type: str  # 'classification', 'regression', 'multilabel'
    compute_budget: str  # 'low', 'medium', 'high'
    memory_gb: float  # Available GPU memory
 
 
def create_neural_network_search_space(
    problem: ProblemCharacteristics
) -> Dict:
    """
    Create an adaptive search space for neural networks.
    
    The bounds adapt based on:
    - Dataset size (more data → can use more capacity)
    - Feature dimensionality (affects first layer)
    - Compute budget (limits what's feasible)
    - Memory constraints (limits batch size)
    """
    space = {}
    
    # ============================================================
    # Learning Rate: Adapt based on expected training dynamics
    # ============================================================
    
    # Smaller datasets often need smaller learning rates
    if problem.n_samples < 1000:
        lr_range = (1e-5, 1e-2)  # More conservative
    elif problem.n_samples < 100000:
        lr_range = (1e-5, 1e-1)  # Standard range
    else:
        lr_range = (1e-4, 0.5)   # Large datasets can use higher lr
    
    space['learning_rate'] = {
        'type': 'continuous',
        'bounds': lr_range,
        'log_scale': True,
        'rationale': f'Adapted for dataset size n={problem.n_samples}'
    }
    
    # ============================================================
    # Model Capacity: Scale with data availability
    # ============================================================
    
    # Rough heuristic: capacity should scale with log(n_samples)
    max_complexity = np.log10(problem.n_samples)  # e.g., 3 for 1K, 6 for 1M
    
    # Number of layers
    max_layers = int(np.clip(max_complexity, 2, 10))
    space['num_layers'] = {
        'type': 'integer',
        'bounds': (1, max_layers),
        'log_scale': False,
        'rationale': f'Max {max_layers} layers for n={problem.n_samples}'
    }
    
    # Hidden units
    max_units = int(np.clip(2 ** (4 + max_complexity), 64, 2048))
    space['hidden_units'] = {
        'type': 'integer',
        'bounds': (32, max_units),
        'log_scale': True,
        'rationale': f'Max {max_units} units for n={problem.n_samples}'
    }
    
    # ============================================================
    # Regularization: More for small data, less for large
    # ============================================================
    
    # Dropout: smaller datasets need more regularization
    if problem.n_samples < 1000:
        dropout_range = (0.3, 0.6)
    elif problem.n_samples < 100000:
        dropout_range = (0.1, 0.5)
    else:
        dropout_range = (0.0, 0.3)
    
    space['dropout'] = {
        'type': 'continuous',
        'bounds': dropout_range,
        'log_scale': False,
        'rationale': f'Dropout adapted for dataset size'
    }
    
    # Weight decay
    if problem.n_samples < 10000:
        wd_range = (1e-4, 1e-1)  # Strong regularization
    else:
        wd_range = (1e-6, 1e-2)  # Can use weaker
    
    space['weight_decay'] = {
        'type': 'continuous',
        'bounds': wd_range,
        'log_scale': True,
        'rationale': 'Weight decay for overfitting control'
    }
    
    # ============================================================
    # Batch Size: Based on memory and dataset size
    # ============================================================
    
    # Estimate memory per sample (rough heuristic)
    bytes_per_sample = problem.n_features * 4  # float32
    max_batch_from_memory = int(problem.memory_gb * 1e9 / bytes_per_sample / 100)
    max_batch_from_data = problem.n_samples // 10  # At least 10 batches
    max_batch = max(16, min(512, max_batch_from_memory, max_batch_from_data))
    
    # Common batch sizes (powers of 2)
    batch_sizes = [bs for bs in [16, 32, 64, 128, 256, 512] if bs <= max_batch]
    if not batch_sizes:
        batch_sizes = [16]
    
    space['batch_size'] = {
        'type': 'categorical',
        'choices': batch_sizes,
        'rationale': f'Batch sizes feasible for {problem.memory_gb}GB memory'
    }
    
    # ============================================================
    # Optimizer: Include gradient clipping for imbalanced data
    # ============================================================
    
    space['optimizer'] = {
        'type': 'categorical',
        'choices': ['adam', 'adamw', 'sgd'],
        'rationale': 'Standard optimizers'
    }
    
    if problem.is_imbalanced:
        space['gradient_clip_norm'] = {
            'type': 'continuous',
            'bounds': (0.5, 5.0),
            'log_scale': False,
            'rationale': 'Gradient clipping for unstable gradients with imbalance'
        }
    
    return space
 
 
def report_search_space(space: Dict) -> None:
    """Pretty print search space with rationale."""
    print("\nAdaptive Search Space")
    print("=" * 70)
    
    for name, config in space.items():
        print(f"\n{name}:")
        for key, value in config.items():
            print(f"  {key}: {value}")
 
 
# Example usage
if __name__ == "__main__":
    # Small tabular dataset
    small_problem = ProblemCharacteristics(
        n_samples=2000,
        n_features=50,
        n_classes=3,
        is_imbalanced=True,
        task_type='classification',
        compute_budget='medium',
        memory_gb=8.0
    )
    
    small_space = create_neural_network_search_space(small_problem)
    print("Search Space for Small Dataset (n=2000)")
    report_search_space(small_space)
    
    # Large dataset
    large_problem = ProblemCharacteristics(
        n_samples=1_000_000,
        n_features=100,
        n_classes=10,
        is_imbalanced=False,
        task_type='classification',
        compute_budget='high',
        memory_gb=32.0
    )
    
    large_space = create_neural_network_search_space(large_problem)
    print("\n" + "=" * 70)
    print("Search Space for Large Dataset (n=1,000,000)")
    report_search_space(large_space)

Common Pitfalls in Search Space Definition

Even experienced practitioners make mistakes in search space definition. Here are the most common errors and how to avoid them:

Critical Mistakes to Avoid

•Using linear scale for wide ranges: A learning rate range of [1e-6, 0.1] requires log-scale. Linear sampling wastes 99.999% of samples in the wrong region.
•Bounds too narrow: If optimal hyperparameters consistently hit boundaries, your space is too constrained. You're artificially limiting performance.
•Bounds too wide: Searching learning rate in [1e-10, 1000] wastes samples in regions where models either don't learn or immediately diverge.
•Ignoring dependencies: Searching learning rate and momentum independently ignores their interaction. Some optimizers require specific ranges.
•Fixed random seed in search space: If batch_size choices are [16, 32, 64], but your data loader always uses the same seed, you might miss batch-size-specific behaviors.
•Forgetting derived hyperparameters: Total training updates = epochs × steps_per_epoch. Changing one without considering the other breaks the training schedule.

The Boundary Problem

When your best hyperparameters are at the edge of your search space, you cannot know if you've found the true optimum or merely the best within your constraints. Always check if optima lie near boundaries, and expand if needed.

Diagnosis Protocol

After running HPO, perform these checks:

Boundary check: Are any best values within 10% of bounds? → Expand that bound
Utilization check: Did search explore the full space? → Check if optimizer is working
Sensitivity check: How much does performance vary across the space? → Flat regions indicate unimportant hyperparameters
Stability check: Do multiple runs find similar optima? → High variance suggests noise or multimodality
Validation check: Do validation and test agree? → Large gaps suggest overfitting to validation

Practical Guidelines for Search Space Design

Let's consolidate everything into actionable guidelines you can apply immediately:

Search Space Design Checklist

•List all hyperparameters: Start with the model's full hyperparameter list. Don't forget optimizer, learning rate schedule, data augmentation.
•Classify each type: Continuous, integer, or categorical? This determines sampling strategy.
•Set initial bounds from prior knowledge: Use library defaults, published results, or expert intuition as starting points.
•Choose scales: Log-scale for ranges > 100×. Linear for probabilities and small ranges.
•Consider dependencies: Document which hyperparameters interact. Consider joint search spaces.
•Adapt to problem size: Smaller datasets need smaller models, stronger regularization.
•Respect constraints: Memory limits, time budgets, hardware capabilities.
•Plan for iteration: Your first search space is a hypothesis. Refine based on results.

Quick Reference: Recommended Ranges by Hyperparameter
Hyperparameter	Typical Range	Scale	Notes
Learning Rate (Adam)	[1e-5, 1e-2]	Log	Start at 1e-3
Learning Rate (SGD)	[1e-3, 0.5]	Log	Needs momentum
Weight Decay	[1e-6, 1e-2]	Log	Start at 1e-4
Dropout	[0.0, 0.5]	Linear	Start at 0.1
Batch Size	[16, 256]	Log (or discrete)	Powers of 2
Hidden Units	[64, 1024]	Log	Model capacity
Num Layers	[1, 8]	Linear	Depth
L2 Reg (SVM)	[1e-4, 1e4]	Log	8 orders of magnitude
RBF γ (SVM)	[1e-5, 1e2]	Log	Depends on features
Tree max_depth	[3, 12]	Linear	Limit for overfitting
Boosting n_estimators	[50, 2000]	Log	More with lower LR

Summary: Building Effective Search Spaces

Search space definition is where prior knowledge meets optimization. Done well, it focuses computational resources on viable configurations. Done poorly, it dooms HPO to searching in the wrong places.

Key Takeaways

•Search spaces have formal structure: Cartesian products of continuous, integer, and categorical domains, each with bounds and scales.
•Scale choice is critical: Log-scale for wide ranges (>100×), linear for bounded probabilities and small ranges.
•Encode prior knowledge: Use published results, defaults, and problem characteristics to set informed bounds.
•Search spaces should adapt: Smaller datasets need smaller search spaces focused on regularization.
•Watch for boundary effects: Optima at edges signal bounds are too tight.
•Iterate and refine: Treat initial search spaces as hypotheses to be refined based on HPO results.

What's Next

With search space definition covered, we'll next dive deeper into the distinction between continuous and discrete hyperparameters, examining how each type affects optimization strategies and how to handle mixed search spaces effectively.

Page Complete

You now have a systematic framework for defining hyperparameter search spaces. The quality of your search space definition will often determine HPO success more than the choice of optimization algorithm.