Loading content...
Before any hyperparameter optimization algorithm can run, you must answer a fundamental question: Where should we search? This question sounds simple but carries profound implications.
Define your search space too narrowly, and the optimal configuration might lie outside your boundaries—no algorithm can find what you've excluded. Define it too broadly, and you waste precious computational budget exploring regions that were never viable. The search space definition is where domain knowledge meets optimization theory, and getting it right is often the difference between HPO success and failure.
In this page, we'll develop a systematic framework for search space definition that balances coverage with efficiency.
By the end of this page, you will: • Understand the mathematical definition of hyperparameter search spaces • Know how to choose appropriate bounds and ranges • Apply transformations (log, logit) to improve search efficiency • Design search spaces that encode prior knowledge • Avoid common pitfalls that doom HPO campaigns
A hyperparameter search space $\Lambda$ is a set of all valid hyperparameter configurations. Each configuration $\lambda \in \Lambda$ is a point in this space, specifying values for all hyperparameters.
Formally, if we have $d$ hyperparameters, each with its own domain:
$$\Lambda = \Lambda_1 \times \Lambda_2 \times \cdots \times \Lambda_d$$
where $\Lambda_i$ is the domain of the $i$-th hyperparameter. The total search space is the Cartesian product of individual domains.
Types of Individual Domains:
Each hyperparameter has a domain that falls into one of several categories:
Bounded Continuous: $\Lambda_i = [a, b] \subset \mathbb{R}$
Unbounded Continuous: $\Lambda_i = \mathbb{R}$ or $\mathbb{R}^+$
Discrete/Integer: $\Lambda_i = {a, a+1, ..., b}$
Categorical: $\Lambda_i = {c_1, c_2, ..., c_k}$
Ordinal: Discrete with meaningful order but non-uniform spacing
| Domain Type | Search Strategy | Encoding | Distance Metric |
|---|---|---|---|
| Continuous | Interpolation works | Direct value or transformed | Euclidean |
| Integer | Rounding from continuous | Integer-valued | Euclidean (rounded) |
| Categorical | Cannot interpolate | One-hot or integer | Hamming distance |
| Ordinal | Order meaningful, spacing not | Integer index | Ordinal difference |
Different HPO algorithms handle different domain types with varying efficiency. Bayesian optimization with Gaussian Processes works best on continuous spaces. Tree-based methods (TPE, SMAC) handle mixed spaces naturally. Understanding your search space's domain structure helps select the right optimizer.
Setting appropriate bounds for continuous and integer hyperparameters is critical. Here's a systematic approach for common hyperparameters:
Learning Rate
The learning rate is arguably the most important hyperparameter for neural networks, and its range spans many orders of magnitude.
Regularization Strength (L2/Weight Decay)
Dropout Rate
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243
"""Common Search Space Bounds: A Practical Reference These bounds are derived from extensive empirical studies and representreasonable starting points for most tasks. Adjust based on domain knowledge."""from dataclasses import dataclassfrom typing import List, Tuple, Optional, Unionimport numpy as np @dataclassclass HyperparameterSpec: """Specification for a single hyperparameter.""" name: str type: str # 'continuous', 'integer', 'categorical' bounds: Union[Tuple[float, float], List[str]] log_scale: bool = False default: Optional[float] = None description: str = "" # ============================================================# Neural Network Hyperparameters# ============================================================ NEURAL_NETWORK_SEARCH_SPACE = [ HyperparameterSpec( name="learning_rate", type="continuous", bounds=(1e-5, 1.0), log_scale=True, # CRITICAL: Always log-scale for learning rate default=1e-3, description="Step size for gradient updates. Log-scale essential." ), HyperparameterSpec( name="weight_decay", type="continuous", bounds=(1e-6, 1e-1), log_scale=True, default=1e-4, description="L2 regularization. Log-scale for wide range." ), HyperparameterSpec( name="dropout", type="continuous", bounds=(0.0, 0.5), log_scale=False, # Linear scale works for probability default=0.1, description="Dropout probability. Linear scale, bounded by 0.5." ), HyperparameterSpec( name="batch_size", type="integer", bounds=(8, 256), log_scale=True, # Consider log-scale for powers of 2 default=32, description="Samples per gradient update. Powers of 2 preferred." ), HyperparameterSpec( name="hidden_units", type="integer", bounds=(32, 1024), log_scale=True, default=256, description="Units in hidden layers. Log-scale for wide range." ), HyperparameterSpec( name="num_layers", type="integer", bounds=(1, 10), log_scale=False, # Linear for small integer range default=3, description="Number of hidden layers." ), HyperparameterSpec( name="optimizer", type="categorical", bounds=["sgd", "adam", "adamw", "rmsprop"], default="adam", description="Optimization algorithm choice." ), HyperparameterSpec( name="activation", type="categorical", bounds=["relu", "gelu", "swish", "tanh"], default="relu", description="Activation function." ),] # ============================================================# Gradient Boosting Hyperparameters (XGBoost/LightGBM)# ============================================================ GRADIENT_BOOSTING_SEARCH_SPACE = [ HyperparameterSpec( name="learning_rate", type="continuous", bounds=(0.001, 0.3), log_scale=True, default=0.1, description="Boosting learning rate (shrinkage)." ), HyperparameterSpec( name="n_estimators", type="integer", bounds=(50, 2000), log_scale=True, default=500, description="Number of boosting rounds." ), HyperparameterSpec( name="max_depth", type="integer", bounds=(3, 12), log_scale=False, default=6, description="Maximum tree depth." ), HyperparameterSpec( name="min_child_weight", type="continuous", bounds=(1, 100), log_scale=True, default=1, description="Minimum sum of instance weight in child." ), HyperparameterSpec( name="subsample", type="continuous", bounds=(0.5, 1.0), log_scale=False, default=0.8, description="Row subsampling ratio." ), HyperparameterSpec( name="colsample_bytree", type="continuous", bounds=(0.5, 1.0), log_scale=False, default=0.8, description="Column subsampling ratio per tree." ), HyperparameterSpec( name="reg_alpha", type="continuous", bounds=(1e-8, 10.0), log_scale=True, default=1e-5, description="L1 regularization." ), HyperparameterSpec( name="reg_lambda", type="continuous", bounds=(1e-8, 10.0), log_scale=True, default=1.0, description="L2 regularization." ),] # ============================================================# SVM Hyperparameters# ============================================================ SVM_SEARCH_SPACE = [ HyperparameterSpec( name="C", type="continuous", bounds=(1e-4, 1e4), log_scale=True, # 8 orders of magnitude! default=1.0, description="Regularization parameter. Log-scale essential." ), HyperparameterSpec( name="gamma", type="continuous", bounds=(1e-5, 1e2), log_scale=True, default=None, # Often 1/n_features description="RBF kernel bandwidth. Log-scale essential." ), HyperparameterSpec( name="kernel", type="categorical", bounds=["linear", "poly", "rbf", "sigmoid"], default="rbf", description="Kernel function." ),] def estimate_search_space_size(specs: List[HyperparameterSpec], resolution: int = 10) -> float: """ Estimate the effective size of a search space. For continuous parameters: assume 'resolution' distinct values For categorical: number of categories For integer: (upper - lower + 1), capped at resolution Args: specs: List of hyperparameter specifications resolution: Discretization level for continuous params Returns: Total number of grid points (can be astronomical) """ total_configs = 1 for spec in specs: if spec.type == "continuous": points = resolution elif spec.type == "integer": low, high = spec.bounds points = min(high - low + 1, resolution) elif spec.type == "categorical": points = len(spec.bounds) else: points = resolution total_configs *= points return total_configs # Example: Calculate search space sizeif __name__ == "__main__": print("Neural Network Search Space Analysis") print("=" * 50) nn_size = estimate_search_space_size(NEURAL_NETWORK_SEARCH_SPACE) print(f"Grid points (10 per continuous dim): {nn_size:,}") nn_size_fine = estimate_search_space_size(NEURAL_NETWORK_SEARCH_SPACE, resolution=100) print(f"Grid points (100 per continuous dim): {nn_size_fine:,}") print(f"\nAt 5 min/evaluation: {nn_size * 5 / 60:.1f} hours for coarse grid") print(f"This illustrates why grid search is infeasible!")One of the most impactful decisions in search space design is choosing the scale on which to search. The same range can behave very differently under linear versus logarithmic sampling.
Why Log-Scale Matters
Consider searching for a learning rate in [1e-5, 1.0]:
Linear sampling uniformly samples values. But:
Log-scale sampling uniformly samples the exponent:
This is not just more balanced—it reflects our prior belief that the optimal value could be anywhere in this range, regardless of order of magnitude.
| Scale | When to Use | Examples | Transformation |
|---|---|---|---|
| Linear | Effect is roughly linear with value | Dropout rate, subsample ratio, momentum | $x = \text{sample}(a, b)$ |
| Log | Range spans multiple orders of magnitude | Learning rate, regularization, gamma (SVM) | $x = 10^{\text{sample}(\log a, \log b)}$ |
| Logit | Probability in (0,1), effects near boundaries | Rare: dropout near 0 or 1 | $x = \sigma(\text{sample}(-\infty, \infty))$ |
| Power | Custom scaling between linear and log | Batch size, layer widths | $x = a + (b-a) \cdot u^p$ |
If your upper bound is more than 100× your lower bound, you almost certainly want log-scale. For learning rate (1e-5 to 1.0 = 100,000× range), log-scale is mandatory. Linear scale would be nearly useless.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119
"""Scale Transformations for Hyperparameter Search This demonstrates why scale choice dramatically affects search efficiency."""import numpy as npimport matplotlib.pyplot as pltfrom scipy import stats def sample_linear(low: float, high: float, n: int = 1000) -> np.ndarray: """Sample uniformly on linear scale.""" return np.random.uniform(low, high, n) def sample_log(low: float, high: float, n: int = 1000) -> np.ndarray: """Sample uniformly on log scale (log-uniform distribution).""" log_low, log_high = np.log10(low), np.log10(high) return 10 ** np.random.uniform(log_low, log_high, n) def sample_logit(low: float, high: float, n: int = 1000) -> np.ndarray: """Sample on logit scale (useful for probabilities near boundaries).""" # Map [low, high] to [logit(low), logit(high)], sample, invert def logit(p): return np.log(p / (1 - p)) def sigmoid(x): return 1 / (1 + np.exp(-x)) logit_low, logit_high = logit(low), logit(high) samples = np.random.uniform(logit_low, logit_high, n) return sigmoid(samples) def compare_sampling_strategies(): """ Visualize how different scales distribute samples. This demonstrates why log-scale is essential for wide ranges. """ learning_rate_range = (1e-5, 1.0) # 5 orders of magnitude linear_samples = sample_linear(*learning_rate_range, n=10000) log_samples = sample_log(*learning_rate_range, n=10000) print("Learning Rate Sampling: [1e-5, 1.0]") print("=" * 60) # Count samples in different ranges ranges = [ (1e-5, 1e-4, "1e-5 to 1e-4"), (1e-4, 1e-3, "1e-4 to 1e-3"), (1e-3, 1e-2, "1e-3 to 1e-2"), (1e-2, 1e-1, "1e-2 to 1e-1"), (1e-1, 1.0, "1e-1 to 1.0"), ] print(f"{'Range':<20} {'Linear %':>12} {'Log %':>12} {'Optimal?':<15}") print("-" * 60) for low, high, name in ranges: linear_pct = 100 * np.sum((linear_samples >= low) & (linear_samples < high)) / len(linear_samples) log_pct = 100 * np.sum((log_samples >= low) & (log_samples < high)) / len(log_samples) # Most optimal learning rates for Adam are in 1e-4 to 1e-2 range is_optimal = "← Sweet spot" if name in ["1e-4 to 1e-3", "1e-3 to 1e-2"] else "" print(f"{name:<20} {linear_pct:>11.2f}% {log_pct:>11.2f}% {is_optimal}") print("\n⚠️ Linear sampling wastes 90% of budget on suboptimal region [0.1, 1.0]!") print("✓ Log sampling gives equal probability to each order of magnitude.") def demonstrate_log_transform_benefit(): """ Show quantitatively how log-transform improves optimization. Simulate a response surface where optimal is at lr=0.001, and show how many random samples find the good region. """ np.random.seed(42) # Simulated response surface: optimum near lr=0.001 def response_surface(lr): """Validation error as function of learning rate.""" log_lr = np.log10(lr) optimal_log_lr = np.log10(0.001) # -3 # Parabola in log-space, minimum at log(0.001) = -3 return (log_lr - optimal_log_lr) ** 2 + 0.1 * np.random.randn() n_samples = 100 # Random search with linear sampling linear_samples = sample_linear(1e-5, 1.0, n_samples) linear_errors = [response_surface(lr) for lr in linear_samples] # Random search with log sampling log_samples = sample_log(1e-5, 1.0, n_samples) log_errors = [response_surface(lr) for lr in log_samples] print("\nRandom Search Comparison (100 samples)") print("=" * 50) print(f"Optimal learning rate: 0.001") print(f"\nLinear sampling:") print(f" Best lr found: {linear_samples[np.argmin(linear_errors)]:.6f}") print(f" Best error: {min(linear_errors):.4f}") print(f" Samples in [1e-4, 1e-2]: {np.sum((linear_samples >= 1e-4) & (linear_samples <= 1e-2))}") print(f"\nLog sampling:") print(f" Best lr found: {log_samples[np.argmin(log_errors)]:.6f}") print(f" Best error: {min(log_errors):.4f}") print(f" Samples in [1e-4, 1e-2]: {np.sum((log_samples >= 1e-4) & (log_samples <= 1e-2))}") if __name__ == "__main__": compare_sampling_strategies() demonstrate_log_transform_benefit()Search space definition is an opportunity to encode everything you know about what works. The more prior knowledge you encode, the more efficiently HPO can proceed.
Sources of Prior Knowledge:
Dataset-Adaptive Bounds
Search space bounds should adapt to problem characteristics:
For small datasets (n < 10,000):
For large datasets (n > 1,000,000):
For high-dimensional features (d > 1000):
Begin with a focused search space based on prior knowledge. If HPO consistently finds optima at boundaries, that's a signal to expand those bounds. This adaptive approach is more efficient than starting with an enormous search space covering 'all possibilities.'
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197
"""Adaptive Search Space Definition Define search spaces that adapt to problem characteristics."""from dataclasses import dataclassfrom typing import Dict, Tuple, List, Optionalimport numpy as np @dataclass class ProblemCharacteristics: """Characteristics that inform search space design.""" n_samples: int n_features: int n_classes: int # 1 for regression is_imbalanced: bool task_type: str # 'classification', 'regression', 'multilabel' compute_budget: str # 'low', 'medium', 'high' memory_gb: float # Available GPU memory def create_neural_network_search_space( problem: ProblemCharacteristics) -> Dict: """ Create an adaptive search space for neural networks. The bounds adapt based on: - Dataset size (more data → can use more capacity) - Feature dimensionality (affects first layer) - Compute budget (limits what's feasible) - Memory constraints (limits batch size) """ space = {} # ============================================================ # Learning Rate: Adapt based on expected training dynamics # ============================================================ # Smaller datasets often need smaller learning rates if problem.n_samples < 1000: lr_range = (1e-5, 1e-2) # More conservative elif problem.n_samples < 100000: lr_range = (1e-5, 1e-1) # Standard range else: lr_range = (1e-4, 0.5) # Large datasets can use higher lr space['learning_rate'] = { 'type': 'continuous', 'bounds': lr_range, 'log_scale': True, 'rationale': f'Adapted for dataset size n={problem.n_samples}' } # ============================================================ # Model Capacity: Scale with data availability # ============================================================ # Rough heuristic: capacity should scale with log(n_samples) max_complexity = np.log10(problem.n_samples) # e.g., 3 for 1K, 6 for 1M # Number of layers max_layers = int(np.clip(max_complexity, 2, 10)) space['num_layers'] = { 'type': 'integer', 'bounds': (1, max_layers), 'log_scale': False, 'rationale': f'Max {max_layers} layers for n={problem.n_samples}' } # Hidden units max_units = int(np.clip(2 ** (4 + max_complexity), 64, 2048)) space['hidden_units'] = { 'type': 'integer', 'bounds': (32, max_units), 'log_scale': True, 'rationale': f'Max {max_units} units for n={problem.n_samples}' } # ============================================================ # Regularization: More for small data, less for large # ============================================================ # Dropout: smaller datasets need more regularization if problem.n_samples < 1000: dropout_range = (0.3, 0.6) elif problem.n_samples < 100000: dropout_range = (0.1, 0.5) else: dropout_range = (0.0, 0.3) space['dropout'] = { 'type': 'continuous', 'bounds': dropout_range, 'log_scale': False, 'rationale': f'Dropout adapted for dataset size' } # Weight decay if problem.n_samples < 10000: wd_range = (1e-4, 1e-1) # Strong regularization else: wd_range = (1e-6, 1e-2) # Can use weaker space['weight_decay'] = { 'type': 'continuous', 'bounds': wd_range, 'log_scale': True, 'rationale': 'Weight decay for overfitting control' } # ============================================================ # Batch Size: Based on memory and dataset size # ============================================================ # Estimate memory per sample (rough heuristic) bytes_per_sample = problem.n_features * 4 # float32 max_batch_from_memory = int(problem.memory_gb * 1e9 / bytes_per_sample / 100) max_batch_from_data = problem.n_samples // 10 # At least 10 batches max_batch = max(16, min(512, max_batch_from_memory, max_batch_from_data)) # Common batch sizes (powers of 2) batch_sizes = [bs for bs in [16, 32, 64, 128, 256, 512] if bs <= max_batch] if not batch_sizes: batch_sizes = [16] space['batch_size'] = { 'type': 'categorical', 'choices': batch_sizes, 'rationale': f'Batch sizes feasible for {problem.memory_gb}GB memory' } # ============================================================ # Optimizer: Include gradient clipping for imbalanced data # ============================================================ space['optimizer'] = { 'type': 'categorical', 'choices': ['adam', 'adamw', 'sgd'], 'rationale': 'Standard optimizers' } if problem.is_imbalanced: space['gradient_clip_norm'] = { 'type': 'continuous', 'bounds': (0.5, 5.0), 'log_scale': False, 'rationale': 'Gradient clipping for unstable gradients with imbalance' } return space def report_search_space(space: Dict) -> None: """Pretty print search space with rationale.""" print("\nAdaptive Search Space") print("=" * 70) for name, config in space.items(): print(f"\n{name}:") for key, value in config.items(): print(f" {key}: {value}") # Example usageif __name__ == "__main__": # Small tabular dataset small_problem = ProblemCharacteristics( n_samples=2000, n_features=50, n_classes=3, is_imbalanced=True, task_type='classification', compute_budget='medium', memory_gb=8.0 ) small_space = create_neural_network_search_space(small_problem) print("Search Space for Small Dataset (n=2000)") report_search_space(small_space) # Large dataset large_problem = ProblemCharacteristics( n_samples=1_000_000, n_features=100, n_classes=10, is_imbalanced=False, task_type='classification', compute_budget='high', memory_gb=32.0 ) large_space = create_neural_network_search_space(large_problem) print("\n" + "=" * 70) print("Search Space for Large Dataset (n=1,000,000)") report_search_space(large_space)Even experienced practitioners make mistakes in search space definition. Here are the most common errors and how to avoid them:
When your best hyperparameters are at the edge of your search space, you cannot know if you've found the true optimum or merely the best within your constraints. Always check if optima lie near boundaries, and expand if needed.
Diagnosis Protocol
After running HPO, perform these checks:
Boundary check: Are any best values within 10% of bounds? → Expand that bound
Utilization check: Did search explore the full space? → Check if optimizer is working
Sensitivity check: How much does performance vary across the space? → Flat regions indicate unimportant hyperparameters
Stability check: Do multiple runs find similar optima? → High variance suggests noise or multimodality
Validation check: Do validation and test agree? → Large gaps suggest overfitting to validation
Let's consolidate everything into actionable guidelines you can apply immediately:
| Hyperparameter | Typical Range | Scale | Notes |
|---|---|---|---|
| Learning Rate (Adam) | [1e-5, 1e-2] | Log | Start at 1e-3 |
| Learning Rate (SGD) | [1e-3, 0.5] | Log | Needs momentum |
| Weight Decay | [1e-6, 1e-2] | Log | Start at 1e-4 |
| Dropout | [0.0, 0.5] | Linear | Start at 0.1 |
| Batch Size | [16, 256] | Log (or discrete) | Powers of 2 |
| Hidden Units | [64, 1024] | Log | Model capacity |
| Num Layers | [1, 8] | Linear | Depth |
| L2 Reg (SVM) | [1e-4, 1e4] | Log | 8 orders of magnitude |
| RBF γ (SVM) | [1e-5, 1e2] | Log | Depends on features |
| Tree max_depth | [3, 12] | Linear | Limit for overfitting |
| Boosting n_estimators | [50, 2000] | Log | More with lower LR |
Search space definition is where prior knowledge meets optimization. Done well, it focuses computational resources on viable configurations. Done poorly, it dooms HPO to searching in the wrong places.
What's Next
With search space definition covered, we'll next dive deeper into the distinction between continuous and discrete hyperparameters, examining how each type affects optimization strategies and how to handle mixed search spaces effectively.
You now have a systematic framework for defining hyperparameter search spaces. The quality of your search space definition will often determine HPO success more than the choice of optimization algorithm.