Loading content...
Search strategies propose configurations—but how do we evaluate them? Evaluation seems straightforward (train, measure performance), but subtle issues abound:
This page covers evaluation strategies that balance accuracy, cost, and reliability in AutoML systems.
You'll master evaluation techniques: holdout vs cross-validation, stratification, early stopping, low-fidelity proxies, handling noise, and avoiding validation overfitting. You'll understand how to make reliable decisions from inherently noisy evaluations.
Every configuration evaluation requires a validation protocol. The choice significantly impacts both evaluation cost and reliability.
Holdout Validation:
Simplest approach: split data into train/validation, train on train, evaluate on validation.
K-Fold Cross-Validation:
Split data into K folds. Train K times, each time using K-1 folds for training and 1 for validation. Average results.
| Method | Cost | Variance | Data Efficiency | Best For |
|---|---|---|---|---|
| Holdout (80/20) | 1× | High | Low | Large datasets, fast iteration |
| 3-Fold CV | 3× | Medium | Medium | Moderate size, balanced cost/variance |
| 5-Fold CV | 5× | Low | High | Standard choice, good tradeoff |
| 10-Fold CV | 10× | Very Low | High | Small datasets, precise estimates |
| Leave-One-Out | N× | Minimal | Maximum | Very small datasets only |
| Repeated K-Fold | K×R | Very Low | High | When statistical precision is critical |
1
For imbalanced classification, always use stratified splits. Without stratification, some folds may have very different class distributions, leading to high variance and misleading estimates. StratifiedKFold ensures each fold mirrors overall class proportions.
Early stopping terminates training when validation performance stops improving. This serves two purposes:
Patience-Based Early Stopping:
Monitor validation metric. If no improvement for p epochs, stop training.
1
Learning Curve Extrapolation:
More sophisticated: predict final performance from partial training curves. If the predicted final performance is poor, stop early.
Methods:
This enables stopping even earlier than patience-based methods.
During hyperparameter optimization, use more aggressive early stopping than for final training. For HPO, you care about ranking (is config A better than B?), not absolute performance. Aggressive stopping with patience=5 epochs often correctly ranks configurations while saving 70%+ compute.
Fidelity refers to how closely an evaluation approximates the true objective. Full training on all data is high-fidelity but expensive. Low-fidelity proxies trade accuracy for speed.
Common Fidelity Dimensions:
| Dimension | Low Fidelity | High Fidelity | Speedup |
|---|---|---|---|
| Training epochs | 1-10 epochs | 100+ epochs | 10-100× |
| Dataset size | 10% subsample | Full dataset | 10× |
| Model size | Scaled-down model | Full model | 4-16× |
| Resolution (images) | Low-res (32×32) | High-res (224×224) | 50× |
| CV folds | 1 fold / holdout | 5-fold CV | 5× |
Multi-Fidelity Optimization:
Low-fidelity evaluations can filter obviously poor configurations cheaply. The key question: does low-fidelity performance correlate with high-fidelity performance?
If yes, we can:
This is the principle behind Successive Halving and Hyperband.
1
Multi-fidelity only works when low-fidelity rankings correlate with high-fidelity rankings. This assumption can fail: some models (like large transformers) perform poorly initially but excel eventually. Always validate the correlation for your domain before relying on aggressive early elimination.
ML training has inherent stochasticity: weight initialization, data shuffling, dropout, batch ordering. The same configuration may yield different results across runs. How do we make reliable decisions in the presence of noise?
Sources of Variance:
Strategies for Handling Noise:
1. Repeated Evaluation:
Run the same configuration multiple times and average. Compute confidence intervals.
2. Fixed Random Seeds:
Use consistent seeds across configurations for fair comparison. Same initialization, same splits.
3. Statistical Testing:
Don't compare point estimates; use t-tests or non-parametric tests to compare configurations.
1
For HPO with many evaluations, use fixed seeds for fair comparison. For final model selection, use repeated evaluation (3-5 seeds) with confidence intervals. Report both mean and std in results. Never claim one config beats another without statistical evidence.
A subtle but critical issue: optimizing on the validation set eventually overfits to it. If you evaluate 1000 configurations on the same validation set, you're effectively "training" on that set.
The Danger:
Suppose true performance of all configurations is random with mean 80%. After evaluating 1000 configurations, the best one might score 85% purely by chance. Deploying this "best" configuration yields 80%—not 85%.
Mitigations:
Nested Cross-Validation:
The gold standard for unbiased evaluation:
For each outer fold:
1. Hold out test fold
2. On remaining data, run AutoML (inner CV for validation)
3. Select best configuration from inner search
4. Evaluate on held-out test fold
Final estimate = average of outer fold test performances
This separates model selection (inner) from performance estimation (outer).
Many published ML results fail to replicate because of validation overfitting. Authors tune extensively on validation, report best results, but the "best" configuration was lucky. Always hold out a true test set. Never look at test data until you've committed to your final model.
Different configurations have different evaluation costs. A small neural network evaluates in seconds; a large one takes hours. How do we fairly compare configurations with different costs?
Cost-Aware Metrics:
Instead of comparing accuracy alone, consider:
| Approach | Method | When to Use |
|---|---|---|
| Fixed time budget | Run all configs for same wall-clock time | When deployment has fixed latency budget |
| Fixed compute budget | Allocate same FLOPs to each config | When compute cost is primary concern |
| Pareto optimization | Find non-dominated solutions | When trading off multiple objectives |
| Expected improvement per second | BO acquisition / expected eval time | When configs have varying costs |
| Adaptive budget allocation | Give more time to promising configs | Hyperband-style search |
Practical Advice:
AutoML can consume substantial cloud resources. Before running, estimate: configurations × evaluations × training time × cost per hour. Set hard budget limits. Monitor spending. A runaway AutoML job can generate surprising bills.
Module Complete!
You've now completed the AutoML Overview module, covering:
With this foundation, you're prepared to explore specific AutoML systems and advanced topics like Neural Architecture Search in subsequent modules.
Congratulations! You now have a comprehensive understanding of AutoML fundamentals: the motivations, automatable components, search spaces, search strategies, and evaluation methods. You're ready to apply these concepts with practical AutoML systems and explore advanced topics like Neural Architecture Search.