Loading learning content...
With search spaces potentially containing billions of configurations, how do we find good solutions efficiently? Exhaustive search is impossible; random guessing is inefficient. We need intelligent search strategies that balance exploration (discovering new regions) with exploitation (refining known good regions).
This page covers the major search strategies used in AutoML: random search, grid search, Bayesian optimization, evolutionary methods, and multi-fidelity approaches. You'll understand when to use each and how they work under the hood.
You'll master the spectrum of search strategies: from baseline methods (random, grid) through sophisticated model-based approaches (Bayesian optimization with Gaussian Processes and TPE), to population-based methods (evolutionary algorithms). You'll understand the tradeoffs and know which to apply in different scenarios.
Before sophisticated optimization, practitioners used simple sampling strategies. Understanding their limitations motivates more advanced methods.
Grid Search:
Grid search evaluates all combinations of pre-specified parameter values:
Total evaluations: 3 × 3 × 3 = 27 combinations.
Grid Search Limitations:
Why Random Search Often Beats Grid Search (Bergstra & Bengio, 2012):
The key insight: most hyperparameters don't matter equally. Typically 1-2 parameters dominate performance while others have minimal impact.
With 9 grid points across 2 parameters, grid search tests only 3 unique values per parameter. Random search with 9 samples tests ~9 unique values of the important parameter.
If learning rate dominates and depth doesn't matter:
Random search is strictly better when some parameters matter more than others—which is almost always true.
1
Random search is a surprisingly strong baseline. Many papers claiming sophisticated methods fail to beat well-tuned random search. Always compare advanced methods against random search with equal budget before claiming improvement.
Bayesian Optimization (BO) is the workhorse of modern hyperparameter optimization. Unlike random search, BO learns from previous evaluations to focus on promising regions.
Core Idea:
The surrogate is cheap to query; the true objective (training a model) is expensive. BO trades surrogate queries for expensive evaluations.
Gaussian Process Surrogate:
The classic BO surrogate is a Gaussian Process (GP)—a distribution over functions. Given observations $D = {(x_1, y_1), ..., (x_n, y_n)}$, the GP provides:
The GP posterior is: $$p(f(x) | D) = \mathcal{N}(\mu(x), \sigma^2(x))$$
where closed-form formulas exist for $\mu(x)$ and $\sigma^2(x)$ based on the kernel function and observations.
1
Expected Improvement (EI): Expected value of improvement over current best. Balances exploitation (high μ) with exploration (high σ). Upper Confidence Bound (UCB): μ(x) + κσ(x), where κ controls exploration. Probability of Improvement (PI): Probability of exceeding current best. EI is most common in practice.
TPE is an alternative to GP-based BO that handles high-dimensional and categorical spaces better. It's the default in Optuna and HyperOpt.
Key Insight:
GPs model $p(y|x)$—the performance given configuration. TPE models the inverse: $p(x|y)$—the configurations that achieve performance $y$.
TPE Approach:
This ratio is proportional to Expected Improvement!
| Aspect | Gaussian Process BO | TPE |
|---|---|---|
| Models | p(y|x) directly | p(x|y) via density estimation |
| Scalability | O(n³) in observations | O(n) in observations |
| Categorical Variables | Requires encoding | Handles natively |
| Conditional Spaces | Difficult | Natural handling |
| Parallelization | Limited (kriging believer) | Easy (independent sampling) |
| Implementation | Complex | Relatively simple |
1
Use TPE when: High-dimensional spaces (>10 params), categorical/conditional parameters, need parallelization. Use GP-BO when: Low-dimensional continuous spaces, need uncertainty estimates, very expensive evaluations where sample efficiency is critical.
Multi-fidelity methods exploit a key observation: bad configurations can be identified early. Instead of fully training every model, we can use cheap early-stopping to filter poor candidates.
Successive Halving:
Hyperband:
Hyperband runs multiple brackets of Successive Halving with different starting configurations and budgets, hedging against the assumption that early performance predicts final performance.
1
BOHB: Bayesian Optimization + Hyperband
BOHB combines the best of both worlds:
Result: Sample-efficient selection + aggressive early stopping = state-of-the-art efficiency.
Key BOHB Insight:
TPE builds density models from completed evaluations at each budget level. Configurations that perform well at low budgets inform higher-budget proposals. The model gradually shifts from exploration to exploitation.
Multi-fidelity methods are powerful when: (1) early performance correlates with final performance, (2) evaluation cost scales with budget, (3) you have many configurations to try. They're less useful when early training is uninformative (some architectures need many epochs to start learning).
Evolutionary algorithms maintain a population of configurations that evolve over generations through selection, mutation, and crossover. They're particularly effective for NAS and complex structured spaces.
Basic Evolutionary Loop:
| Method | Key Idea | Best For |
|---|---|---|
| Genetic Algorithm | Crossover + mutation | Discrete, structured spaces |
| Evolution Strategies | Gaussian mutation, no crossover | Continuous spaces |
| Regularized Evolution | Tournament selection + aging | NAS (AmoebaNet) |
| NEAT | Evolving both topology and weights | Neural network structure |
| Population-Based Training | Online hyperparameter adaptation | RL, long training runs |
1
Evolutionary methods excel when: the space is highly structured or discrete, gradient information is unavailable, evaluation is parallelizable, and you need to explore diverse solutions. They're the method of choice for most NAS systems.
With many options available, how do you choose? Decision factors include: search space structure, evaluation cost, parallelization needs, and available budget.
| Scenario | Recommended Method | Why |
|---|---|---|
| Small budget (<50 evals) | Random Search | Not enough data for model-based |
| Continuous, low-dim (<10) | GP-BO (SMAC, BOTorch) | Sample-efficient with uncertainty |
| Mixed types, high-dim | TPE (Optuna) | Handles categoricals, scales well |
| Many cheap evaluations | Hyperband | Aggressive early stopping |
| General purpose | BOHB | Best of TPE + Hyperband |
| Neural Architecture Search | Regularized Evolution | Handles structured spaces |
| Need parallelization | TPE or Population-based | Independent sampling |
| Very long training | Population-Based Training | Online adaptation |
Complex methods have overhead. For small problems (<100 evaluations), sophisticated methods may not outperform random search despite added complexity. Match method sophistication to problem scale.
What's Next:
With search spaces defined and strategies to explore them, we need to efficiently evaluate candidate configurations. The next page covers evaluation strategies: cross-validation, early stopping, fidelity proxies, and how to make robust decisions from noisy evaluations.
You now understand the spectrum of search strategies: random/grid as baselines, Bayesian optimization for sample efficiency, TPE for complex spaces, multi-fidelity for early stopping, and evolutionary methods for NAS. Next, we'll explore how to evaluate configurations efficiently.