Automl Overview - Learning Module

Loading content...

0/278

Evaluation Strategies

Measuring What Matters

Search strategies propose configurations—but how do we evaluate them? Evaluation seems straightforward (train, measure performance), but subtle issues abound:

Noise: Same configuration may yield different results due to randomness
Cost: Full training is expensive; can we estimate quality cheaply?
Overfitting: Optimizing too hard on validation data leads to poor generalization
Fairness: Comparing configurations with different compute budgets is misleading

This page covers evaluation strategies that balance accuracy, cost, and reliability in AutoML systems.

What You Will Learn

You'll master evaluation techniques: holdout vs cross-validation, stratification, early stopping, low-fidelity proxies, handling noise, and avoiding validation overfitting. You'll understand how to make reliable decisions from inherently noisy evaluations.

Validation Methods for AutoML

Every configuration evaluation requires a validation protocol. The choice significantly impacts both evaluation cost and reliability.

Holdout Validation:

Simplest approach: split data into train/validation, train on train, evaluate on validation.

Pros: Fast, single training run
Cons: High variance, wastes data, dependent on split

K-Fold Cross-Validation:

Split data into K folds. Train K times, each time using K-1 folds for training and 1 for validation. Average results.

Pros: Uses all data for both training and validation, lower variance
Cons: K× more expensive, still some variance between CV runs

Validation Method Comparison
Method	Cost	Variance	Data Efficiency	Best For
Holdout (80/20)	1×	High	Low	Large datasets, fast iteration
3-Fold CV	3×	Medium	Medium	Moderate size, balanced cost/variance
5-Fold CV	5×	Low	High	Standard choice, good tradeoff
10-Fold CV	10×	Very Low	High	Small datasets, precise estimates
Leave-One-Out	N×	Minimal	Maximum	Very small datasets only
Repeated K-Fold	K×R	Very Low	High	When statistical precision is critical

validation_strategies.py

Stratification Matters

For imbalanced classification, always use stratified splits. Without stratification, some folds may have very different class distributions, leading to high variance and misleading estimates. StratifiedKFold ensures each fold mirrors overall class proportions.

Early Stopping and Checkpointing

Early stopping terminates training when validation performance stops improving. This serves two purposes:

Regularization: Prevents overfitting by stopping before the model memorizes
Efficiency: Saves compute by not completing doomed training runs

Patience-Based Early Stopping:

Monitor validation metric. If no improvement for p epochs, stop training.

early_stopping.py

Learning Curve Extrapolation:

More sophisticated: predict final performance from partial training curves. If the predicted final performance is poor, stop early.

Methods:

Parametric fitting: Fit power law or exponential to observed curve
Bayesian models: Learn curve shapes from prior configurations
ML-based: Train a model to predict final from partial curves

This enables stopping even earlier than patience-based methods.

Aggressive Early Stopping for HPO

During hyperparameter optimization, use more aggressive early stopping than for final training. For HPO, you care about ranking (is config A better than B?), not absolute performance. Aggressive stopping with patience=5 epochs often correctly ranks configurations while saving 70%+ compute.

Low-Fidelity Proxies and Approximations

Fidelity refers to how closely an evaluation approximates the true objective. Full training on all data is high-fidelity but expensive. Low-fidelity proxies trade accuracy for speed.

Common Fidelity Dimensions:

Fidelity Dimensions for Cheap Evaluation
Dimension	Low Fidelity	High Fidelity	Speedup
Training epochs	1-10 epochs	100+ epochs	10-100×
Dataset size	10% subsample	Full dataset	10×
Model size	Scaled-down model	Full model	4-16×
Resolution (images)	Low-res (32×32)	High-res (224×224)	50×
CV folds	1 fold / holdout	5-fold CV	5×

Multi-Fidelity Optimization:

Low-fidelity evaluations can filter obviously poor configurations cheaply. The key question: does low-fidelity performance correlate with high-fidelity performance?

If yes, we can:

Evaluate many configurations at low fidelity
Promote promising configurations to higher fidelity
Only the best reach full-fidelity evaluation

This is the principle behind Successive Halving and Hyperband.

multi_fidelity.py

Fidelity Correlation Assumption

Multi-fidelity only works when low-fidelity rankings correlate with high-fidelity rankings. This assumption can fail: some models (like large transformers) perform poorly initially but excel eventually. Always validate the correlation for your domain before relying on aggressive early elimination.

Handling Evaluation Noise

ML training has inherent stochasticity: weight initialization, data shuffling, dropout, batch ordering. The same configuration may yield different results across runs. How do we make reliable decisions in the presence of noise?

Sources of Variance:

•Weight initialization: Random starting points lead to different optima
•Data ordering: Shuffled batches affect gradient trajectory
•Dropout masking: Different subnetworks trained each forward pass
•Data augmentation: Random augmentation creates variance
•Validation split: Different holdout sets have different difficulty
•Hardware non-determinism: GPU operations can be non-deterministic

Strategies for Handling Noise:

1. Repeated Evaluation:

Run the same configuration multiple times and average. Compute confidence intervals.

2. Fixed Random Seeds:

Use consistent seeds across configurations for fair comparison. Same initialization, same splits.

3. Statistical Testing:

Don't compare point estimates; use t-tests or non-parametric tests to compare configurations.

handling_noise.py

Practical Noise Reduction

For HPO with many evaluations, use fixed seeds for fair comparison. For final model selection, use repeated evaluation (3-5 seeds) with confidence intervals. Report both mean and std in results. Never claim one config beats another without statistical evidence.

Avoiding Validation Set Overfitting

A subtle but critical issue: optimizing on the validation set eventually overfits to it. If you evaluate 1000 configurations on the same validation set, you're effectively "training" on that set.

The Danger:

Suppose true performance of all configurations is random with mean 80%. After evaluating 1000 configurations, the best one might score 85% purely by chance. Deploying this "best" configuration yields 80%—not 85%.

Mitigations:

Preventing Validation Overfitting

•Hold out a test set: Never touched during HPO. Only used for final evaluation after optimization is complete.
•Use nested CV: Outer loop for test, inner loop for validation. Expensive but rigorous.
•Limit evaluations: Fewer evaluations = less overfitting risk. Balance search thoroughness with overfitting.
•Regularized selection: Prefer simpler configurations when performance is similar (Occam's razor).
•Ensemble instead of select: Instead of picking the single best, ensemble top-k configurations. Reduces variance.

Nested Cross-Validation:

The gold standard for unbiased evaluation:

For each outer fold:
    1. Hold out test fold
    2. On remaining data, run AutoML (inner CV for validation)
    3. Select best configuration from inner search
    4. Evaluate on held-out test fold

Final estimate = average of outer fold test performances

This separates model selection (inner) from performance estimation (outer).

The Replication Crisis in ML

Many published ML results fail to replicate because of validation overfitting. Authors tune extensively on validation, report best results, but the "best" configuration was lucky. Always hold out a true test set. Never look at test data until you've committed to your final model.

Cost-Aware Evaluation Strategies

Different configurations have different evaluation costs. A small neural network evaluates in seconds; a large one takes hours. How do we fairly compare configurations with different costs?

Cost-Aware Metrics:

Instead of comparing accuracy alone, consider:

Performance per compute hour: Which achieves best accuracy given fixed time budget?
Pareto frontier: Which configurations are not dominated on cost and performance?
Utility functions: Weighted combination of performance and cost

Cost-Aware Evaluation Approaches
Approach	Method	When to Use
Fixed time budget	Run all configs for same wall-clock time	When deployment has fixed latency budget
Fixed compute budget	Allocate same FLOPs to each config	When compute cost is primary concern
Pareto optimization	Find non-dominated solutions	When trading off multiple objectives
Expected improvement per second	BO acquisition / expected eval time	When configs have varying costs
Adaptive budget allocation	Give more time to promising configs	Hyperband-style search

Practical Advice:

Normalize by cost: Report "accuracy per GPU-hour" not just "accuracy"
Set realistic budgets: Match AutoML budget to deployment constraints
Use multi-fidelity: Cheap evaluations to filter, expensive to confirm
Consider total cost: Including human time, not just compute

Cloud Cost Awareness

AutoML can consume substantial cloud resources. Before running, estimate: configurations × evaluations × training time × cost per hour. Set hard budget limits. Monitor spending. A runaway AutoML job can generate surprising bills.

Summary: Evaluation Strategies

Key Takeaways

•Validation protocol matters — K-fold CV is more reliable but K× more expensive than holdout.
•Early stopping saves compute — Terminate unpromising configurations early. Aggressive stopping is acceptable for HPO.
•Low-fidelity proxies accelerate search — Use subsampling, fewer epochs, smaller models for early filtering.
•Noise requires statistical treatment — Use repeated evaluation and confidence intervals for reliable comparison.
•Validation overfitting is real — Always hold out a test set. Consider nested CV for rigorous evaluation.
•Cost-awareness enables fair comparison — Normalize results by compute cost when configurations vary widely.
•Multi-fidelity combines cheap and accurate — Early elimination at low fidelity, confirmation at high fidelity.

Module Complete!

You've now completed the AutoML Overview module, covering:

AutoML Motivation: Why automation matters for ML scalability and democratization
What to Automate: Which pipeline components can be automated and how
Search Space: How to define and structure configuration spaces
Search Strategies: Methods from random search to Bayesian optimization to evolution
Evaluation Strategies: Techniques for reliable, efficient configuration evaluation

With this foundation, you're prepared to explore specific AutoML systems and advanced topics like Neural Architecture Search in subsequent modules.

Module Complete

Congratulations! You now have a comprehensive understanding of AutoML fundamentals: the motivations, automatable components, search spaces, search strategies, and evaluation methods. You're ready to apply these concepts with practical AutoML systems and explore advanced topics like Neural Architecture Search.