Machine LearningAutoML & Neural Architecture Search

AutoML Overview

LevelAdvanced

Duration90 mins

TopicAutoML & Neural Architecture Search

4 / 5

Search Strategies

Navigating the Configuration Space

With search spaces potentially containing billions of configurations, how do we find good solutions efficiently? Exhaustive search is impossible; random guessing is inefficient. We need intelligent search strategies that balance exploration (discovering new regions) with exploitation (refining known good regions).

This page covers the major search strategies used in AutoML: random search, grid search, Bayesian optimization, evolutionary methods, and multi-fidelity approaches. You'll understand when to use each and how they work under the hood.

What You Will Learn

You'll master the spectrum of search strategies: from baseline methods (random, grid) through sophisticated model-based approaches (Bayesian optimization with Gaussian Processes and TPE), to population-based methods (evolutionary algorithms). You'll understand the tradeoffs and know which to apply in different scenarios.

Baseline Methods: Grid and Random Search

Before sophisticated optimization, practitioners used simple sampling strategies. Understanding their limitations motivates more advanced methods.

Grid Search:

Grid search evaluates all combinations of pre-specified parameter values:

Learning rate: [0.001, 0.01, 0.1]
Max depth: [3, 5, 7]
n_estimators: [50, 100, 200]

Total evaluations: 3 × 3 × 3 = 27 combinations.

Grid Search Limitations:

Exponential scaling: Adding parameters or values per parameter explodes the grid
Uniform spacing ignores importance: All parameters get equal resolution
Wastes budget on unimportant parameters: If only learning rate matters, most evaluations are redundant

Grid Search

•Exhaustive on discrete grid
•Exponential in # parameters
•Wastes budget on unimportant dimensions
•Same resolution for all parameters
•Cannot adapt based on results

Random Search

•Samples uniformly from space
•Linear in # evaluations
•Explores more unique values per dimension
•Each sample covers different regions
•Still doesn't adapt, but more efficient

Why Random Search Often Beats Grid Search (Bergstra & Bengio, 2012):

The key insight: most hyperparameters don't matter equally. Typically 1-2 parameters dominate performance while others have minimal impact.

With 9 grid points across 2 parameters, grid search tests only 3 unique values per parameter. Random search with 9 samples tests ~9 unique values of the important parameter.

If learning rate dominates and depth doesn't matter:

Grid search: 3 learning rates tested
Random search: ~9 learning rates tested

Random search is strictly better when some parameters matter more than others—which is almost always true.

random_vs_grid.py

Random Search as Strong Baseline

Random search is a surprisingly strong baseline. Many papers claiming sophisticated methods fail to beat well-tuned random search. Always compare advanced methods against random search with equal budget before claiming improvement.

Bayesian Optimization

Bayesian Optimization (BO) is the workhorse of modern hyperparameter optimization. Unlike random search, BO learns from previous evaluations to focus on promising regions.

Core Idea:

Build a surrogate model that approximates the objective function $f(x)$
Use the surrogate to select the next configuration to evaluate via an acquisition function
Evaluate the configuration on the true objective
Update the surrogate with new observation
Repeat until budget exhausted

The surrogate is cheap to query; the true objective (training a model) is expensive. BO trades surrogate queries for expensive evaluations.

Gaussian Process Surrogate:

The classic BO surrogate is a Gaussian Process (GP)—a distribution over functions. Given observations $D = {(x_1, y_1), ..., (x_n, y_n)}$, the GP provides:

Mean prediction $\mu(x)$: expected performance at configuration $x$
Uncertainty $\sigma(x)$: how confident we are in the prediction

The GP posterior is: $$p(f(x) | D) = \mathcal{N}(\mu(x), \sigma^2(x))$$

where closed-form formulas exist for $\mu(x)$ and $\sigma^2(x)$ based on the kernel function and observations.

bayesian_optimization.py

Acquisition Functions

Expected Improvement (EI): Expected value of improvement over current best. Balances exploitation (high μ) with exploration (high σ). Upper Confidence Bound (UCB): μ(x) + κσ(x), where κ controls exploration. Probability of Improvement (PI): Probability of exceeding current best. EI is most common in practice.

Tree-Structured Parzen Estimator (TPE)

TPE is an alternative to GP-based BO that handles high-dimensional and categorical spaces better. It's the default in Optuna and HyperOpt.

Key Insight:

GPs model $p(y|x)$—the performance given configuration. TPE models the inverse: $p(x|y)$—the configurations that achieve performance $y$.

TPE Approach:

Split observations into "good" (top γ quantile) and "bad" (rest)
Model $\ell(x) = p(x|y < y^*)$: density of good configurations
Model $g(x) = p(x|y \geq y^*)$: density of bad configurations
Maximize $\ell(x) / g(x)$: configurations likely good, unlikely bad

This ratio is proportional to Expected Improvement!

GP-BO vs TPE Comparison
Aspect	Gaussian Process BO	TPE
Models	p(y\|x) directly	p(x\|y) via density estimation
Scalability	O(n³) in observations	O(n) in observations
Categorical Variables	Requires encoding	Handles natively
Conditional Spaces	Difficult	Natural handling
Parallelization	Limited (kriging believer)	Easy (independent sampling)
Implementation	Complex	Relatively simple

tpe_example.py

When to Use TPE vs GP-BO

Use TPE when: High-dimensional spaces (>10 params), categorical/conditional parameters, need parallelization. Use GP-BO when: Low-dimensional continuous spaces, need uncertainty estimates, very expensive evaluations where sample efficiency is critical.

Multi-Fidelity Methods: Hyperband and BOHB

Multi-fidelity methods exploit a key observation: bad configurations can be identified early. Instead of fully training every model, we can use cheap early-stopping to filter poor candidates.

Successive Halving:

Start with n configurations, each with budget b (e.g., 1 epoch)
Evaluate all configurations with budget b
Keep top 1/η fraction based on performance
Double budget for survivors
Repeat until one configuration remains

Hyperband:

Hyperband runs multiple brackets of Successive Halving with different starting configurations and budgets, hedging against the assumption that early performance predicts final performance.

hyperband.py

BOHB: Bayesian Optimization + Hyperband

BOHB combines the best of both worlds:

Use TPE for intelligent configuration sampling (instead of random)
Use Hyperband for resource allocation and early stopping

Result: Sample-efficient selection + aggressive early stopping = state-of-the-art efficiency.

Key BOHB Insight:

TPE builds density models from completed evaluations at each budget level. Configurations that perform well at low budgets inform higher-budget proposals. The model gradually shifts from exploration to exploitation.

When Multi-Fidelity Helps

Multi-fidelity methods are powerful when: (1) early performance correlates with final performance, (2) evaluation cost scales with budget, (3) you have many configurations to try. They're less useful when early training is uninformative (some architectures need many epochs to start learning).

Evolutionary and Population-Based Methods

Evolutionary algorithms maintain a population of configurations that evolve over generations through selection, mutation, and crossover. They're particularly effective for NAS and complex structured spaces.

Basic Evolutionary Loop:

Initialize: Create random population of configurations
Evaluate: Measure fitness (performance) of each individual
Select: Choose high-fitness individuals as parents
Mutate: Randomly modify selected individuals
Crossover: Combine parts of two parents (optional)
Replace: Form new generation
Repeat: Until convergence or budget exhausted

Evolutionary Methods for AutoML
Method	Key Idea	Best For
Genetic Algorithm	Crossover + mutation	Discrete, structured spaces
Evolution Strategies	Gaussian mutation, no crossover	Continuous spaces
Regularized Evolution	Tournament selection + aging	NAS (AmoebaNet)
NEAT	Evolving both topology and weights	Neural network structure
Population-Based Training	Online hyperparameter adaptation	RL, long training runs

evolutionary_nas.py

Evolution for Complex Spaces

Evolutionary methods excel when: the space is highly structured or discrete, gradient information is unavailable, evaluation is parallelizable, and you need to explore diverse solutions. They're the method of choice for most NAS systems.

Choosing the Right Search Strategy

With many options available, how do you choose? Decision factors include: search space structure, evaluation cost, parallelization needs, and available budget.

Search Strategy Selection Guide
Scenario	Recommended Method	Why
Small budget (<50 evals)	Random Search	Not enough data for model-based
Continuous, low-dim (<10)	GP-BO (SMAC, BOTorch)	Sample-efficient with uncertainty
Mixed types, high-dim	TPE (Optuna)	Handles categoricals, scales well
Many cheap evaluations	Hyperband	Aggressive early stopping
General purpose	BOHB	Best of TPE + Hyperband
Neural Architecture Search	Regularized Evolution	Handles structured spaces
Need parallelization	TPE or Population-based	Independent sampling
Very long training	Population-Based Training	Online adaptation

Practical Recommendations

•Start with TPE (Optuna) — Works well across most scenarios, good defaults, great tooling
•Add Hyperband for iterative learners — If training incrementally, use early stopping
•Use BOHB for serious optimization — Current state-of-the-art for general HPO
•Consider evolutionary for NAS — Regularized evolution dominates for architecture search
•Always baseline with random search — Ensures your method actually helps

Don't Over-Engineer

Complex methods have overhead. For small problems (<100 evaluations), sophisticated methods may not outperform random search despite added complexity. Match method sophistication to problem scale.

Summary: Search Strategies

Key Takeaways

•Random search beats grid search — More unique values per dimension, linear scaling.
•Bayesian optimization learns — Surrogate models guide search to promising regions.
•GP-BO is sample-efficient — Best for expensive, low-dimensional, continuous problems.
•TPE scales to complex spaces — Handles categoricals, conditionals, high dimensions.
•Multi-fidelity saves budget — Early stopping eliminates bad configs quickly.
•BOHB = TPE + Hyperband — Current state-of-the-art general-purpose method.
•Evolution for NAS — Population-based methods handle structured architecture spaces.

What's Next:

With search spaces defined and strategies to explore them, we need to efficiently evaluate candidate configurations. The next page covers evaluation strategies: cross-validation, early stopping, fidelity proxies, and how to make robust decisions from noisy evaluations.

Page Complete

You now understand the spectrum of search strategies: random/grid as baselines, Bayesian optimization for sample efficiency, TPE for complex spaces, multi-fidelity for early stopping, and evolutionary methods for NAS. Next, we'll explore how to evaluate configurations efficiently.

4 / 5

Loading learning content...

Machine LearningAutoML & Neural Architecture Search

AutoML Overview

LevelAdvanced

Duration90 mins

TopicAutoML & Neural Architecture Search

4 / 5

Search Strategies

Navigating the Configuration Space

What You Will Learn

Baseline Methods: Grid and Random Search

Before sophisticated optimization, practitioners used simple sampling strategies. Understanding their limitations motivates more advanced methods.

Grid Search:

Grid search evaluates all combinations of pre-specified parameter values:

Learning rate: [0.001, 0.01, 0.1]
Max depth: [3, 5, 7]
n_estimators: [50, 100, 200]

Total evaluations: 3 × 3 × 3 = 27 combinations.

Grid Search Limitations:

Exponential scaling: Adding parameters or values per parameter explodes the grid
Uniform spacing ignores importance: All parameters get equal resolution
Wastes budget on unimportant parameters: If only learning rate matters, most evaluations are redundant

Grid Search

•Exhaustive on discrete grid
•Exponential in # parameters
•Wastes budget on unimportant dimensions
•Same resolution for all parameters
•Cannot adapt based on results

Random Search

•Samples uniformly from space
•Linear in # evaluations
•Explores more unique values per dimension
•Each sample covers different regions
•Still doesn't adapt, but more efficient

Why Random Search Often Beats Grid Search (Bergstra & Bengio, 2012):

The key insight: most hyperparameters don't matter equally. Typically 1-2 parameters dominate performance while others have minimal impact.

With 9 grid points across 2 parameters, grid search tests only 3 unique values per parameter. Random search with 9 samples tests ~9 unique values of the important parameter.

If learning rate dominates and depth doesn't matter:

Grid search: 3 learning rates tested
Random search: ~9 learning rates tested

Random search is strictly better when some parameters matter more than others—which is almost always true.

random_vs_grid.py

Random Search as Strong Baseline

Bayesian Optimization

Bayesian Optimization (BO) is the workhorse of modern hyperparameter optimization. Unlike random search, BO learns from previous evaluations to focus on promising regions.

Core Idea:

Build a surrogate model that approximates the objective function $f(x)$
Use the surrogate to select the next configuration to evaluate via an acquisition function
Evaluate the configuration on the true objective
Update the surrogate with new observation
Repeat until budget exhausted

The surrogate is cheap to query; the true objective (training a model) is expensive. BO trades surrogate queries for expensive evaluations.

Gaussian Process Surrogate:

The classic BO surrogate is a Gaussian Process (GP)—a distribution over functions. Given observations $D = {(x_1, y_1), ..., (x_n, y_n)}$, the GP provides:

Mean prediction $\mu(x)$: expected performance at configuration $x$
Uncertainty $\sigma(x)$: how confident we are in the prediction

The GP posterior is: $$p(f(x) | D) = \mathcal{N}(\mu(x), \sigma^2(x))$$

where closed-form formulas exist for $\mu(x)$ and $\sigma^2(x)$ based on the kernel function and observations.

bayesian_optimization.py

Acquisition Functions

Tree-Structured Parzen Estimator (TPE)

TPE is an alternative to GP-based BO that handles high-dimensional and categorical spaces better. It's the default in Optuna and HyperOpt.

Key Insight:

GPs model $p(y|x)$—the performance given configuration. TPE models the inverse: $p(x|y)$—the configurations that achieve performance $y$.

TPE Approach:

Split observations into "good" (top γ quantile) and "bad" (rest)
Model $\ell(x) = p(x|y < y^*)$: density of good configurations
Model $g(x) = p(x|y \geq y^*)$: density of bad configurations
Maximize $\ell(x) / g(x)$: configurations likely good, unlikely bad

This ratio is proportional to Expected Improvement!

GP-BO vs TPE Comparison
Aspect	Gaussian Process BO	TPE
Models	p(y\|x) directly	p(x\|y) via density estimation
Scalability	O(n³) in observations	O(n) in observations
Categorical Variables	Requires encoding	Handles natively
Conditional Spaces	Difficult	Natural handling
Parallelization	Limited (kriging believer)	Easy (independent sampling)
Implementation	Complex	Relatively simple

tpe_example.py

When to Use TPE vs GP-BO

Multi-Fidelity Methods: Hyperband and BOHB

Multi-fidelity methods exploit a key observation: bad configurations can be identified early. Instead of fully training every model, we can use cheap early-stopping to filter poor candidates.

Successive Halving:

Start with n configurations, each with budget b (e.g., 1 epoch)
Evaluate all configurations with budget b
Keep top 1/η fraction based on performance
Double budget for survivors
Repeat until one configuration remains

Hyperband:

Hyperband runs multiple brackets of Successive Halving with different starting configurations and budgets, hedging against the assumption that early performance predicts final performance.

hyperband.py

BOHB: Bayesian Optimization + Hyperband

BOHB combines the best of both worlds:

Use TPE for intelligent configuration sampling (instead of random)
Use Hyperband for resource allocation and early stopping

Result: Sample-efficient selection + aggressive early stopping = state-of-the-art efficiency.

Key BOHB Insight:

When Multi-Fidelity Helps

Evolutionary and Population-Based Methods

Basic Evolutionary Loop:

Initialize: Create random population of configurations
Evaluate: Measure fitness (performance) of each individual
Select: Choose high-fitness individuals as parents
Mutate: Randomly modify selected individuals
Crossover: Combine parts of two parents (optional)
Replace: Form new generation
Repeat: Until convergence or budget exhausted

Evolutionary Methods for AutoML
Method	Key Idea	Best For
Genetic Algorithm	Crossover + mutation	Discrete, structured spaces
Evolution Strategies	Gaussian mutation, no crossover	Continuous spaces
Regularized Evolution	Tournament selection + aging	NAS (AmoebaNet)
NEAT	Evolving both topology and weights	Neural network structure
Population-Based Training	Online hyperparameter adaptation	RL, long training runs

evolutionary_nas.py

Evolution for Complex Spaces

Choosing the Right Search Strategy

With many options available, how do you choose? Decision factors include: search space structure, evaluation cost, parallelization needs, and available budget.

Search Strategy Selection Guide
Scenario	Recommended Method	Why
Small budget (<50 evals)	Random Search	Not enough data for model-based
Continuous, low-dim (<10)	GP-BO (SMAC, BOTorch)	Sample-efficient with uncertainty
Mixed types, high-dim	TPE (Optuna)	Handles categoricals, scales well
Many cheap evaluations	Hyperband	Aggressive early stopping
General purpose	BOHB	Best of TPE + Hyperband
Neural Architecture Search	Regularized Evolution	Handles structured spaces
Need parallelization	TPE or Population-based	Independent sampling
Very long training	Population-Based Training	Online adaptation

Practical Recommendations

•Start with TPE (Optuna) — Works well across most scenarios, good defaults, great tooling
•Add Hyperband for iterative learners — If training incrementally, use early stopping
•Use BOHB for serious optimization — Current state-of-the-art for general HPO
•Consider evolutionary for NAS — Regularized evolution dominates for architecture search
•Always baseline with random search — Ensures your method actually helps

Don't Over-Engineer

Complex methods have overhead. For small problems (<100 evaluations), sophisticated methods may not outperform random search despite added complexity. Match method sophistication to problem scale.

Summary: Search Strategies

Key Takeaways

•Random search beats grid search — More unique values per dimension, linear scaling.
•Bayesian optimization learns — Surrogate models guide search to promising regions.
•GP-BO is sample-efficient — Best for expensive, low-dimensional, continuous problems.
•TPE scales to complex spaces — Handles categoricals, conditionals, high dimensions.
•Multi-fidelity saves budget — Early stopping eliminates bad configs quickly.
•BOHB = TPE + Hyperband — Current state-of-the-art general-purpose method.
•Evolution for NAS — Population-based methods handle structured architecture spaces.

What's Next:

Page Complete

4 / 5