Machine LearningHyperparameter Optimization

Multi-Fidelity Optimization

LevelAdvanced

Duration90 mins

TopicHyperparameter Optimization

1 / 5

Early Stopping Approaches

The Resource Efficiency Imperative

Hyperparameter optimization presents a fundamental computational challenge: evaluating a single configuration—training a model to completion—can take hours, days, or even weeks for large-scale models. When search spaces contain thousands or millions of candidate configurations, exhaustive evaluation becomes computationally infeasible.

Early stopping approaches represent a paradigm shift in hyperparameter optimization. Rather than treating each configuration evaluation as an atomic, all-or-nothing operation, these methods recognize that partial evaluations carry predictive information. A configuration that performs poorly after 10% of training is unlikely to become the best configuration after 100% of training. This insight enables terminating unpromising configurations early, redirecting computational resources toward more promising candidates.

This fundamental idea—that we can infer final performance from intermediate observations—underlies all multi-fidelity optimization methods and fundamentally changes how we approach hyperparameter search at scale.

What You Will Learn

By the end of this page, you will understand: • The theoretical foundation for early stopping in hyperparameter optimization • Different fidelity types and their implications for optimization • Early stopping criteria and when they are valid • The bias-variance tradeoffs in early stopping decisions • Practical implementation patterns and common pitfalls

The Multi-Fidelity Paradigm

Multi-fidelity optimization refers to optimization methods that leverage multiple fidelity levels—approximations of the true objective function that are cheaper to evaluate but provide useful information about the full-fidelity result. In hyperparameter optimization, the "fidelity" typically corresponds to the computational budget allocated to evaluating a configuration.

The key insight is that the objective function (f(\lambda))—model performance as a function of hyperparameters (\lambda)—can be approximated by cheaper functions (f_b(\lambda)) where (b < B) represents a reduced budget compared to the full budget (B).

Common Fidelity Types in Hyperparameter Optimization
Fidelity Type	Description	Low Fidelity	High Fidelity	Correlation Typically
Training epochs	Number of passes through training data	10 epochs	1000 epochs	High (0.8-0.95)
Dataset size	Fraction of training data used	10% of data	100% of data	Moderate-High (0.6-0.9)
Model size	Reduced model dimensions	Small proxy model	Full model	Variable (0.5-0.85)
Cross-validation folds	Number of CV folds	2-fold CV	10-fold CV	High (0.85-0.95)
Resolution/sampling	Input resolution or sampling rate	Low resolution	Full resolution	Domain-dependent

Formal Definition:

Let (\mathcal{\Lambda}) be the hyperparameter space and let (f: \mathcal{\Lambda} \rightarrow \mathbb{R}) be the true objective function (e.g., validation loss after full training). A multi-fidelity approximation provides a family of functions:

$$f_b: \mathcal{\Lambda} \rightarrow \mathbb{R}, \quad b \in [b_{\min}, B]$$

where:

(b) is the fidelity level (budget)
(b_{\min}) is the minimum useful budget
(B) is the maximum budget (full fidelity)
(f_B(\lambda) \approx f(\lambda)) for all (\lambda)

The cost of evaluating (f_b(\lambda)) is typically proportional to (b), making low-fidelity evaluations dramatically cheaper.

The Correlation Assumption

Multi-fidelity optimization relies fundamentally on the assumption that performance ranking at low fidelity correlates with performance ranking at high fidelity. Formally, configurations that rank well under f_b should tend to rank well under f_B. When this assumption is violated—for example, when training dynamics change fundamentally between early and late stages—early stopping can eliminate the true optimal configuration.

Theoretical Foundation for Early Stopping

The theoretical justification for early stopping rests on two complementary perspectives: learning curve analysis and bandit-based formulations.

Learning Curve Perspective:

During training, model performance typically follows a learning curve—a trajectory of performance (validation loss or accuracy) as a function of training progress. Empirically, these curves often exhibit characteristic shapes:

Common Learning Curve Patterns

•Exponential decay: Rapid initial improvement followed by diminishing returns: (L(t) = a \cdot e^{-bt} + c)
•Power law decay: Performance improves as a power of training time: (L(t) = a \cdot t^{-\alpha} + c)
•Saturation curves: Performance approaches an asymptote: (L(t) = c - (c - L_0) \cdot (1 - e^{-bt}))
•Staged improvement: Step-wise improvements with plateaus (common with learning rate schedules)

Learning Curve Extrapolation:

Given partial observations of a learning curve, we can attempt to predict the final performance. If (L_\lambda(t)) denotes the learning curve for configuration (\lambda), and we observe values up to time (t_0), we can fit a parametric model and extrapolate:

$$\hat{L}\lambda(T) = \arg\min\theta \sum_{t=1}^{t_0} (L_\lambda(t) - g_\theta(t))^2 + \text{regularization}$$

where (g_\theta(t)) is a parametric curve model (e.g., exponential, power law) with parameters (\theta).

The Bayesian perspective treats learning curve prediction as a probabilistic inference problem, maintaining a posterior distribution over possible curves given observations. This enables principled uncertainty quantification about whether early termination is warranted.

Practical Learning Curve Models

The combination of multiple basis curves often works better than any single parametric form. A weighted combination like L(t) = w₁·exp(-a₁t) + w₂·t^(-α) + w₃·log(t) + c can capture diverse learning dynamics. Research on learning curve extrapolation (Domhan et al., 2015) demonstrated that ensemble models outperform single curve fits for early prediction.

Multi-Armed Bandit Perspective:

An alternative theoretical framework views hyperparameter optimization as a multi-armed bandit problem. Each configuration (\lambda_i) is an "arm" of the bandit, and pulling an arm reveals stochastic information about its reward (model performance).

The key insight is that we want to identify the best arm efficiently, not maximize cumulative reward. This is the pure exploration or best arm identification variant of bandits.

Successive elimination is a classic strategy: pull all arms equally until statistical evidence suggests some arms are suboptimal, then eliminate them. The budget saved from eliminated arms is reallocated to remaining candidates.

The upper confidence bound (UCB) approach maintains confidence intervals on each arm's mean reward. Arms whose upper confidence bound falls below another arm's lower confidence bound can be eliminated.

For configuration (i) with observed mean performance (\hat{\mu}_i) and (n_i) evaluations:

$$\text{UCB}_i = \hat{\mu}_i + \sqrt{\frac{2 \ln(1/\delta)}{n_i}}$$

$$\text{LCB}_i = \hat{\mu}_i - \sqrt{\frac{2 \ln(1/\delta)}{n_i}}$$

Eliminate configuration (i) if (\text{UCB}_i < \text{LCB}_j) for some remaining configuration (j).

Early Stopping Criteria and Strategies

Implementing early stopping requires defining when to terminate a configuration. Various criteria have been proposed, each with different properties:

Early Stopping Strategies

•Median Stopping Rule: Terminate if performance is below the median of all previously observed configurations at the same fidelity level. Simple and robust, but may be conservative.
•Percentile Stopping: Generalization of median stopping—terminate if below the p-th percentile. Setting p < 50 is more aggressive; p > 50 is more conservative.
•Curve Fitting Stopping: Fit a learning curve model to observed performance; terminate if extrapolated final performance is unlikely to beat the current best.
•Threshold Stopping: Terminate if performance hasn't improved by at least δ in the last k evaluations. Simple but requires domain knowledge to set thresholds.
•Bayesian Stopping: Maintain a probabilistic model of expected final performance; terminate if the probability of beating the current best falls below a threshold.
•Budget-Based Stopping: Fixed successive halving—evaluate for a fixed low-fidelity budget, keep top fraction, increase budget, repeat.

Aggressive Stopping (p < 50)

•Terminates more configurations early
•Higher speedup when assumptions hold
•Greater risk of eliminating optimal
•Better for noisy/uniform search spaces
•Prioritizes exploration breadth

Conservative Stopping (p > 50)

•Terminates fewer configurations
•Lower speedup but safer
•Lower risk of false elimination
•Better when optimal is rare
•Prioritizes exploitation depth

early_stopping_criteria.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
import numpy as np
from scipy.optimize import curve_fit
from typing import List, Tuple, Optional
 
class EarlyStoppingCriteria:
    """Collection of early stopping criteria for hyperparameter optimization."""
    
    @staticmethod
    def median_stopping(
        current_performance: float,
        historical_performances: List[float],
        fidelity: int
    ) -> bool:
        """
        Median stopping rule: terminate if below median at same fidelity.
        
        Args:
            current_performance: Current configuration's performance (lower is better)
            historical_performances: List of performances from completed configs
            fidelity: Current fidelity level (for filtering historical data)
            
        Returns:
            True if configuration should be stopped
        """
        if len(historical_performances) < 3:  # Need sufficient history
            return False
        
        median = np.median(historical_performances)
        return current_performance > median  # Stop if worse than median
    
    @staticmethod
    def percentile_stopping(
        current_performance: float,
        historical_performances: List[float],
        percentile: float = 50.0
    ) -> bool:
        """
        Percentile stopping: terminate if below p-th percentile.
        Lower percentile = more aggressive stopping.
        """
        if len(historical_performances) < 5:
            return False
        
        threshold = np.percentile(historical_performances, percentile)
        return current_performance > threshold
    
    @staticmethod
    def learning_curve_stopping(
        observations: List[Tuple[int, float]],  # (epoch, loss) pairs
        current_best: float,
        confidence_threshold: float = 0.95
    ) -> bool:
        """
        Curve fitting stopping: extrapolate learning curve, stop if
        unlikely to beat current best.
        
        Uses power law model: L(t) = a * t^(-alpha) + c
        """
        if len(observations) < 5:
            return False
        
        epochs = np.array([e for e, _ in observations])
        losses = np.array([l for _, l in observations])
        
        try:
            # Fit power law decay model
            def power_law(t, a, alpha, c):
                return a * np.power(t + 1, -alpha) + c
            
            popt, pcov = curve_fit(
                power_law, epochs, losses,
                p0=[losses[0], 0.5, losses[-1]],
                bounds=([0, 0, 0], [np.inf, 2, np.inf]),
                maxfev=1000
            )
            
            # Extrapolate to completion (e.g., epoch 100)
            max_epochs = 100
            predicted_final = power_law(max_epochs, *popt)
            
            # Estimate uncertainty from covariance
            predicted_std = np.sqrt(pcov[2, 2]) if pcov is not None else 0.1
            
            # Stop if unlikely to beat current best
            # P(predicted_final < current_best) < threshold
            from scipy.stats import norm
            prob_better = norm.cdf(current_best, loc=predicted_final, scale=predicted_std)
            
            return prob_better < (1 - confidence_threshold)
            
        except (RuntimeError, ValueError):
            # Curve fitting failed, don't stop
            return False
    
    @staticmethod
    def threshold_stopping(
        recent_performances: List[float],
        minimum_improvement: float = 0.001,
        patience: int = 5
    ) -> bool:
        """
        Threshold stopping: stop if no improvement above threshold
        in last 'patience' evaluations.
        """
        if len(recent_performances) < patience + 1:
            return False
        
        recent = recent_performances[-(patience + 1):]
        best_before = min(recent[:-1])
        best_after = min(recent[-patience:])
        
        improvement = (best_before - best_after) / abs(best_before + 1e-10)
        return improvement < minimum_improvement
 
 
# Example usage
if __name__ == "__main__":
    criteria = EarlyStoppingCriteria()
    
    # Simulate historical performances
    historical = [0.15, 0.22, 0.18, 0.35, 0.12, 0.28, 0.19]
    
    # Test median stopping
    current = 0.25
    should_stop = criteria.median_stopping(current, historical, fidelity=10)
    print(f"Median stopping (perf={current}): {should_stop}")  # True
    
    # Test learning curve stopping
    observations = [(1, 0.8), (2, 0.5), (3, 0.35), (4, 0.28), (5, 0.24)]
    should_stop = criteria.learning_curve_stopping(observations, current_best=0.12)
    print(f"Curve stopping: {should_stop}")

Asynchronous Early Stopping

In distributed and parallel computing environments, configurations are often evaluated asynchronously—different configurations start and finish at different times, and we cannot wait for all configurations to reach the same fidelity before making stopping decisions.

Asynchronous early stopping presents unique challenges:

Challenges in Asynchronous Settings

•Incomplete baselines: When making a stopping decision, we may have limited historical data at that fidelity level
•Racing conditions: A configuration may look bad compared to current baselines but would look good compared to the true population
•Resource fragmentation: Stopping creates idle workers that need new configurations to evaluate
•Synchronization overhead: Waiting for synchronization points defeats the purpose of parallelism
•Non-stationary baselines: As optimization progresses, the quality threshold for 'good' configurations increases

ASHA (Asynchronous Successive Halving Algorithm):

ASHA adapts successive halving for asynchronous parallel execution. The key idea is to promote configurations without waiting for all configurations at the current rung to complete:

asha_algorithm.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
import numpy as np
from dataclasses import dataclass, field
from typing import Dict, List, Optional, Set
from collections import defaultdict
import heapq
from enum import Enum
 
class ConfigStatus(Enum):
    PENDING = "pending"
    RUNNING = "running"
    PROMOTED = "promoted"
    STOPPED = "stopped"
    COMPLETED = "completed"
 
@dataclass
class Configuration:
    config_id: int
    hyperparameters: Dict
    status: ConfigStatus = ConfigStatus.PENDING
    current_rung: int = 0
    performance_by_rung: Dict[int, float] = field(default_factory=dict)
 
class ASHA:
    """
    Asynchronous Successive Halving Algorithm (ASHA).
    
    Key insight: Promote configurations as soon as they are in the top 1/eta
    of all configurations that have been evaluated at that rung, rather than
    waiting for all configurations to complete.
    """
    
    def __init__(
        self,
        min_budget: int = 1,
        max_budget: int = 81,
        eta: int = 3,
        num_configs: int = 100
    ):
        """
        Args:
            min_budget: Minimum budget (rung 0)
            max_budget: Maximum budget (final rung)
            eta: Reduction factor (keep top 1/eta at each rung)
            num_configs: Total number of configurations to sample
        """
        self.min_budget = min_budget
        self.max_budget = max_budget
        self.eta = eta
        
        # Calculate rungs: each rung is eta^rung * min_budget
        self.rungs = []
        budget = min_budget
        while budget <= max_budget:
            self.rungs.append(budget)
            budget = int(budget * eta)
        
        self.num_rungs = len(self.rungs)
        
        # Track configurations at each rung
        self.rung_members: Dict[int, List[Configuration]] = defaultdict(list)
        self.promoted_counts: Dict[int, int] = defaultdict(int)
        
        # All configurations
        self.configurations: Dict[int, Configuration] = {}
        self.config_counter = 0
        
        print(f"ASHA initialized with rungs: {self.rungs}")
        print(f"Reduction factor eta={eta}, keeping top {100/eta:.1f}% at each rung")
    
    def get_next_configuration(self) -> Optional[Configuration]:
        """
        Get the next configuration to evaluate.
        Priority: promote existing configs > sample new configs
        """
        # First, check if any configuration can be promoted
        promotable = self._find_promotable_configuration()
        if promotable:
            return promotable
        
        # Otherwise, sample a new configuration
        if self.config_counter < 100:  # Max configs
            return self._sample_new_configuration()
        
        return None
    
    def _find_promotable_configuration(self) -> Optional[Configuration]:
        """
        Find a configuration ready for promotion to the next rung.
        A config is promotable if it's in the top 1/eta of completed
        configs at its current rung.
        """
        for rung_idx in range(self.num_rungs - 1):  # Can't promote from last rung
            completed_at_rung = [
                c for c in self.rung_members[rung_idx]
                if c.status in [ConfigStatus.COMPLETED, ConfigStatus.PROMOTED]
                and rung_idx in c.performance_by_rung
            ]
            
            if len(completed_at_rung) == 0:
                continue
            
            # Sort by performance (lower is better)
            completed_at_rung.sort(
                key=lambda c: c.performance_by_rung[rung_idx]
            )
            
            # How many should be promoted from this rung?
            num_to_promote = max(1, len(completed_at_rung) // self.eta)
            
            # Find configs that should be promoted but haven't been
            promotable = [
                c for c in completed_at_rung[:num_to_promote]
                if c.status == ConfigStatus.COMPLETED
            ]
            
            if promotable:
                config = promotable[0]
                config.status = ConfigStatus.PROMOTED
                config.current_rung = rung_idx + 1
                self.rung_members[rung_idx + 1].append(config)
                self.promoted_counts[rung_idx] += 1
                print(f"Promoting config {config.config_id} to rung {rung_idx + 1} "
                      f"(budget {self.rungs[rung_idx + 1]})")
                return config
        
        return None
    
    def _sample_new_configuration(self) -> Configuration:
        """Sample a new random configuration."""
        config = Configuration(
            config_id=self.config_counter,
            hyperparameters=self._sample_hyperparameters(),
            status=ConfigStatus.RUNNING,
            current_rung=0
        )
        self.configurations[config.config_id] = config
        self.rung_members[0].append(config)
        self.config_counter += 1
        return config
    
    def _sample_hyperparameters(self) -> Dict:
        """Sample random hyperparameters (placeholder)."""
        return {
            "learning_rate": 10 ** np.random.uniform(-5, -1),
            "num_layers": np.random.randint(1, 10),
            "hidden_size": 2 ** np.random.randint(4, 10),
            "dropout": np.random.uniform(0, 0.5),
        }
    
    def report_result(self, config_id: int, rung: int, performance: float):
        """
        Report a result from an evaluation.
        
        Args:
            config_id: Configuration identifier
            rung: Rung at which evaluation completed
            performance: Validation loss (lower is better)
        """
        config = self.configurations[config_id]
        config.performance_by_rung[rung] = performance
        
        # Check if this is the final rung
        if rung == self.num_rungs - 1:
            config.status = ConfigStatus.COMPLETED
        else:
            config.status = ConfigStatus.COMPLETED  # Completed this rung
        
        print(f"Config {config_id} at rung {rung} (budget {self.rungs[rung]}): "
              f"performance = {performance:.4f}")
    
    def get_best_configuration(self) -> Optional[Configuration]:
        """Get the best configuration found so far."""
        final_rung = self.num_rungs - 1
        completed = [
            c for c in self.rung_members[final_rung]
            if final_rung in c.performance_by_rung
        ]
        
        if not completed:
            return None
        
        return min(completed, key=lambda c: c.performance_by_rung[final_rung])
    
    def get_statistics(self) -> Dict:
        """Get statistics about the optimization run."""
        stats = {
            "total_configs": len(self.configurations),
            "configs_per_rung": {
                rung: len(members) for rung, members in self.rung_members.items()
            },
            "promotions_per_rung": dict(self.promoted_counts),
        }
        
        best = self.get_best_configuration()
        if best:
            final_rung = self.num_rungs - 1
            stats["best_performance"] = best.performance_by_rung[final_rung]
            stats["best_hyperparameters"] = best.hyperparameters
        
        return stats

ASHA's Key Insight

ASHA achieves near-linear speedup with the number of workers by decoupling the promotion decision from synchronization. A worker finishing an evaluation can immediately promote the configuration if it qualifies, without waiting for other workers. This enables efficient utilization even with heterogeneous workloads and straggler workers.

Bias-Variance Tradeoffs in Early Stopping

Early stopping introduces both bias and variance into the optimization process. Understanding these effects is crucial for designing robust early stopping strategies.

Sources of Bias:

Selection bias: Configurations that happen to perform well early are more likely to be continued, even if early performance is not perfectly predictive of final performance.
Ranking distortion: The relative ordering of configurations can change dramatically between early and late training. Early stopping assumes this distortion is limited.
Hyperparameter-dependent convergence: Some hyperparameters (e.g., learning rate) affect how quickly a configuration reaches its final performance. Fast-converging configurations are favored even if slow-converging alternatives would ultimately be superior.

Quantifying Early Stopping Bias:

Let (\lambda^*) be the true optimal configuration and (\hat{\lambda}) be the configuration selected by early stopping. The bias is:

$$\text{Bias} = \mathbb{E}[f(\hat{\lambda})] - f(\lambda^*)$$

This bias arises because early stopping might eliminate (\lambda^*) if it happens to perform poorly at low fidelity.

Factors affecting bias:

Correlation (\rho(f_b, f_B)): Higher correlation between low and high fidelity reduces bias
Stopping aggressiveness: More aggressive stopping increases bias
Training dynamics: Configurations with non-monotonic learning curves suffer higher bias

Scenarios Where Early Stopping Bias is Problematic
Scenario	Why Bias Occurs	Mitigation Strategy
Learning rate warmup	Performance is poor during warmup phase	Use minimum fidelity above warmup duration
Regularization effects	Regularization helps only late in training	Conservative stopping thresholds
Architecture search	Different architectures converge at different rates	Normalize by architecture type
Transfer learning	Fine-tuning dynamics differ from training from scratch	Use domain-specific fidelity lower bounds
Noisy objectives	High variance in early performance estimates	Multiple independent evaluations at each rung

Sources of Variance:

Even with unbiased early stopping, the selected configuration can vary significantly across optimization runs:

Sampling variance: Random configuration sampling leads to different candidate pools
Evaluation noise: Stochastic training (random initialization, mini-batch sampling) creates noisy performance estimates
Elimination randomness: Near-threshold configurations may or may not be eliminated depending on noise

Variance Reduction Strategies:

•Ensemble early stopping: Train multiple instances of each configuration with different random seeds; use aggregated performance for stopping decisions
•Conservative thresholds: Use higher percentiles for stopping (e.g., 75th instead of 50th) to reduce false eliminations
•Repeated trials: Run the entire optimization multiple times and ensemble the results
•Cross-validated performance: Use k-fold cross-validation at each fidelity level for more stable estimates
•Bayesian treatment: Maintain posterior distributions over configuration quality rather than point estimates

The Fundamental Tradeoff

There is an inherent tradeoff between computational efficiency and optimization quality. More aggressive early stopping saves computation but increases the risk of eliminating the optimal configuration. The right balance depends on the cost of computation versus the cost of suboptimal hyperparameters.

Practical Implementation Considerations

Implementing early stopping in production systems requires careful attention to several practical concerns:

Implementation Checklist

•Checkpoint management: Save model checkpoints at each evaluation point to enable resumption after stopping decisions
•Metric logging: Log all intermediate metrics with timestamps for post-hoc analysis and debugging
•Reproducibility: Fix random seeds and log them to enable reproduction of specific configurations
•Resource cleanup: Ensure stopped jobs release GPU memory, disk space, and other resources promptly
•Failure handling: Distinguish between intentional early stopping and job failures; handle retries appropriately
•Progress persistence: Store optimization state to enable resumption after system failures

early_stopping_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
"""
Production-ready early stopping implementation with proper
checkpoint management, logging, and resource cleanup.
"""
import os
import json
import logging
from pathlib import Path
from datetime import datetime
from typing import Optional, Callable, Dict, Any
from dataclasses import dataclass, asdict
 
logger = logging.getLogger(__name__)
 
@dataclass
class EvaluationResult:
    """Result from a single evaluation point."""
    config_id: str
    epoch: int
    budget: float
    validation_loss: float
    validation_accuracy: float
    training_loss: float
    timestamp: str
    wall_time_seconds: float
    checkpoint_path: str
    
class EarlyStoppingController:
    """
    Controller for early stopping with proper lifecycle management.
    """
    
    def __init__(
        self,
        experiment_dir: Path,
        stopping_criterion: Callable[[list, float], bool],
        checkpoint_callback: Callable[[str], None],
        cleanup_callback: Callable[[str], None],
    ):
        """
        Args:
            experiment_dir: Directory for checkpoints and logs
            stopping_criterion: Function(historical_results, current_result) -> should_stop
            checkpoint_callback: Called to save checkpoint
            cleanup_callback: Called to clean up resources when stopping
        """
        self.experiment_dir = Path(experiment_dir)
        self.experiment_dir.mkdir(parents=True, exist_ok=True)
        
        self.stopping_criterion = stopping_criterion
        self.checkpoint_callback = checkpoint_callback
        self.cleanup_callback = cleanup_callback
        
        self.results: Dict[str, list] = {}  # config_id -> list of results
        self.stopped_configs: set = set()
        
        # Load existing state if resuming
        self._load_state()
    
    def _state_path(self) -> Path:
        return self.experiment_dir / "early_stopping_state.json"
    
    def _load_state(self):
        """Load state from disk for resumption."""
        state_path = self._state_path()
        if state_path.exists():
            with open(state_path) as f:
                state = json.load(f)
            self.results = {
                k: [EvaluationResult(**r) for r in v]
                for k, v in state.get("results", {}).items()
            }
            self.stopped_configs = set(state.get("stopped_configs", []))
            logger.info(f"Resumed state: {len(self.results)} configs, "
                       f"{len(self.stopped_configs)} stopped")
    
    def _save_state(self):
        """Persist state to disk."""
        state = {
            "results": {
                k: [asdict(r) for r in v]
                for k, v in self.results.items()
            },
            "stopped_configs": list(self.stopped_configs),
            "last_updated": datetime.now().isoformat(),
        }
        with open(self._state_path(), 'w') as f:
            json.dump(state, f, indent=2)
    
    def report_evaluation(
        self,
        result: EvaluationResult
    ) -> bool:
        """
        Report an evaluation result and determine if training should stop.
        
        Returns:
            True if training should continue, False if it should stop
        """
        config_id = result.config_id
        
        if config_id in self.stopped_configs:
            logger.warning(f"Config {config_id} already stopped, ignoring result")
            return False
        
        # Store result
        if config_id not in self.results:
            self.results[config_id] = []
        self.results[config_id].append(result)
        
        # Save checkpoint
        try:
            self.checkpoint_callback(result.checkpoint_path)
        except Exception as e:
            logger.error(f"Checkpoint failed for {config_id}: {e}")
        
        # Log result
        logger.info(
            f"Config {config_id} epoch {result.epoch}: "
            f"val_loss={result.validation_loss:.4f}, "
            f"val_acc={result.validation_accuracy:.4f}"
        )
        
        # Collect historical results for stopping decision
        all_results = []
        for cid, results in self.results.items():
            if cid not in self.stopped_configs and results:
                # Get most recent result at similar budget
                matching = [r for r in results if abs(r.budget - result.budget) < 0.1]
                if matching:
                    all_results.append(matching[-1].validation_loss)
        
        # Make stopping decision
        should_stop = self.stopping_criterion(
            all_results,
            result.validation_loss
        )
        
        if should_stop:
            logger.info(f"Early stopping config {config_id} at epoch {result.epoch}")
            self.stopped_configs.add(config_id)
            
            # Cleanup resources
            try:
                self.cleanup_callback(config_id)
            except Exception as e:
                logger.error(f"Cleanup failed for {config_id}: {e}")
        
        # Persist state
        self._save_state()
        
        return not should_stop
    
    def get_best_config(self) -> Optional[str]:
        """Get the best configuration based on final results."""
        best_config = None
        best_loss = float('inf')
        
        for config_id, results in self.results.items():
            if results:
                final_loss = results[-1].validation_loss
                if final_loss < best_loss:
                    best_loss = final_loss
                    best_config = config_id
        
        return best_config
    
    def get_statistics(self) -> Dict[str, Any]:
        """Get statistics about the optimization run."""
        total_configs = len(self.results)
        stopped_early = len(self.stopped_configs)
        completed = total_configs - stopped_early
        
        total_evaluations = sum(len(r) for r in self.results.values())
        
        # Compute savings
        max_epochs_seen = max(
            (r[-1].epoch for r in self.results.values() if r),
            default=0
        )
        potential_evaluations = total_configs * max_epochs_seen
        savings = (potential_evaluations - total_evaluations) / max(1, potential_evaluations)
        
        return {
            "total_configurations": total_configs,
            "stopped_early": stopped_early,
            "completed_full_training": completed,
            "total_evaluations": total_evaluations,
            "potential_evaluations": potential_evaluations,
            "computational_savings": f"{savings:.1%}",
        }

Integration with ML Frameworks

Most modern ML frameworks and hyperparameter optimization libraries provide built-in early stopping support: • PyTorch Lightning: EarlyStopping callback with configurable patience and metrics • TensorFlow/Keras: EarlyStopping callback integrated with model.fit()
• Ray Tune: ASHA scheduler with async support and cluster distribution • Optuna: Pruners including MedianPruner, PercentilePruner, and HyperbandPruner • Weights & Biases: Sweeps with early termination support

When Early Stopping Fails

Despite its power, early stopping is not universally applicable. Recognizing scenarios where early stopping is unreliable prevents wasted effort and suboptimal results.

Failure Modes of Early Stopping

•Non-monotonic learning curves: Some configurations improve, plateau, then improve again after architectural reorganization or phase transitions in training
•Late regularization effects: L2 regularization, dropout, and other regularizers primarily affect late-stage training, making early performance misleading
•Double descent phenomena: Modern deep learning exhibits double descent where performance first increases, then decreases (overfitting), then increases again
•Learning rate schedules: Cosine annealing, warmup, and step schedules create non-stationary training dynamics that violate early stopping assumptions
•Curriculum learning: Progressively increasing task difficulty means early performance on easy examples doesn't predict performance on hard examples
•Multi-stage training: Fine-tuning pretrained models has different dynamics than training from scratch; early fine-tuning performance is not predictive

Detecting Early Stopping Failures:

Post-hoc analysis can reveal whether early stopping was appropriate for a given problem:

Learning curve correlation analysis: Plot low-fidelity vs high-fidelity performance for all configurations. Poor correlation (r < 0.7) indicates early stopping may be unreliable.
Rank correlation (Spearman/Kendall): Measure whether configurations that rank highly at low fidelity also rank highly at high fidelity. Low rank correlation suggests ranking distortion.
False elimination rate: Of configurations eliminated early, estimate what fraction would have been in the top-k set if allowed to complete. High false elimination indicates overly aggressive stopping.

early_stopping_validation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
import numpy as np
from scipy.stats import spearmanr, pearsonr
import matplotlib.pyplot as plt
from typing import List, Tuple
 
def validate_early_stopping(
    low_fidelity_scores: List[float],
    high_fidelity_scores: List[float],
    eliminated_at_low: List[int],  # Indices of eliminated configs
) -> dict:
    """
    Validate whether early stopping was appropriate for this problem.
    
    Args:
        low_fidelity_scores: Performance at low fidelity for all configs
        high_fidelity_scores: Performance at high fidelity (ground truth)
        eliminated_at_low: Indices of configs that were eliminated early
        
    Returns:
        Dictionary of validation metrics
    """
    low = np.array(low_fidelity_scores)
    high = np.array(high_fidelity_scores)
    
    # Correlation analysis
    pearson_r, pearson_p = pearsonr(low, high)
    spearman_r, spearman_p = spearmanr(low, high)
    
    # Rank analysis
    low_ranks = np.argsort(np.argsort(low))
    high_ranks = np.argsort(np.argsort(high))
    
    # False elimination analysis
    # Which configs would have been in top-k at high fidelity?
    k = max(1, len(high) // 10)  # Top 10%
    top_k_at_high = set(np.argsort(high)[:k])
    eliminated_set = set(eliminated_at_low)
    
    false_eliminations = top_k_at_high & eliminated_set
    false_elimination_rate = len(false_eliminations) / max(1, len(eliminated_set))
    
    # Best configuration analysis
    true_best = np.argmin(high)
    was_best_eliminated = true_best in eliminated_set
    
    return {
        "pearson_correlation": pearson_r,
        "spearman_correlation": spearman_r,
        "false_elimination_rate": false_elimination_rate,
        "true_best_eliminated": was_best_eliminated,
        "recommendation": _get_recommendation(pearson_r, spearman_r, false_elimination_rate),
    }
 
def _get_recommendation(pearson_r, spearman_r, fer):
    """Generate recommendation based on validation metrics."""
    if spearman_r > 0.8 and fer < 0.1:
        return "Early stopping is APPROPRIATE for this problem"
    elif spearman_r > 0.6 and fer < 0.2:
        return "Early stopping is ACCEPTABLE but use conservative thresholds"
    elif spearman_r > 0.4:
        return "Early stopping is RISKY - consider higher minimum fidelity"
    else:
        return "Early stopping is NOT RECOMMENDED for this problem"
 
def plot_fidelity_correlation(
    low: np.ndarray,
    high: np.ndarray,
    eliminated: List[int]
):
    """Visualize correlation between low and high fidelity performance."""
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))
    
    # Scatter plot
    ax1 = axes[0]
    mask = np.ones(len(low), dtype=bool)
    mask[eliminated] = False
    
    ax1.scatter(low[mask], high[mask], alpha=0.6, label='Continued')
    ax1.scatter(low[~mask], high[~mask], alpha=0.6, c='red', label='Eliminated')
    ax1.plot([low.min(), low.max()], [low.min(), low.max()], 'k--', alpha=0.3)
    ax1.set_xlabel('Low Fidelity Performance')
    ax1.set_ylabel('High Fidelity Performance')
    ax1.set_title('Fidelity Correlation')
    ax1.legend()
    
    # Rank comparison
    ax2 = axes[1]
    low_ranks = np.argsort(np.argsort(low))
    high_ranks = np.argsort(np.argsort(high))
    
    ax2.scatter(low_ranks[mask], high_ranks[mask], alpha=0.6, label='Continued')
    ax2.scatter(low_ranks[~mask], high_ranks[~mask], alpha=0.6, c='red', label='Eliminated')
    ax2.plot([0, len(low)], [0, len(high)], 'k--', alpha=0.3)
    ax2.set_xlabel('Rank at Low Fidelity')
    ax2.set_ylabel('Rank at High Fidelity')
    ax2.set_title('Rank Correlation')
    ax2.legend()
    
    plt.tight_layout()
    return fig

Domain-Specific Validation

Always validate early stopping assumptions on a representative subset of your search space before committing to an early stopping strategy. Run a pilot study where configurations are evaluated to completion, then retrospectively analyze whether early stopping would have made correct decisions.

Summary: Early Stopping Approaches

Early stopping represents a fundamental optimization in hyperparameter search, enabling dramatic computational savings when applied appropriately.

Key Takeaways

•Multi-fidelity optimization exploits cheap approximations (reduced epochs, data, or model size) to predict full-fidelity performance
•Learning curves often follow predictable patterns (exponential, power law) enabling extrapolation from partial observations
•Stopping criteria range from simple (median stopping) to sophisticated (Bayesian curve prediction) with different bias-variance tradeoffs
•Asynchronous methods like ASHA enable efficient parallel early stopping without synchronization overhead
•Bias-variance tradeoffs must be managed: aggressive stopping saves computation but risks eliminating optimal configurations
•Validation is essential: correlation between low and high fidelity performance must be verified empirically for each problem domain

Looking Ahead:

The next page explores Successive Halving—a principled algorithm that formalizes the early stopping intuitions covered here into a structured procedure with provable properties. Successive Halving provides a rigorous framework for deciding how many configurations to sample, when to stop each, and how to allocate computational budget across the optimization process.

Page Complete

You now understand the theoretical foundations and practical considerations of early stopping in hyperparameter optimization. You can implement stopping criteria, evaluate their appropriateness for a given problem, and avoid common failure modes. Next, we'll see how Successive Halving structures these ideas into a complete algorithm.

1 / 5

Loading learning content...

Machine LearningHyperparameter Optimization

Multi-Fidelity Optimization

LevelAdvanced

Duration90 mins

TopicHyperparameter Optimization

1 / 5

Early Stopping Approaches

The Resource Efficiency Imperative

What You Will Learn

The Multi-Fidelity Paradigm

Common Fidelity Types in Hyperparameter Optimization
Fidelity Type	Description	Low Fidelity	High Fidelity	Correlation Typically
Training epochs	Number of passes through training data	10 epochs	1000 epochs	High (0.8-0.95)
Dataset size	Fraction of training data used	10% of data	100% of data	Moderate-High (0.6-0.9)
Model size	Reduced model dimensions	Small proxy model	Full model	Variable (0.5-0.85)
Cross-validation folds	Number of CV folds	2-fold CV	10-fold CV	High (0.85-0.95)
Resolution/sampling	Input resolution or sampling rate	Low resolution	Full resolution	Domain-dependent

Formal Definition:

$$f_b: \mathcal{\Lambda} \rightarrow \mathbb{R}, \quad b \in [b_{\min}, B]$$

where:

(b) is the fidelity level (budget)
(b_{\min}) is the minimum useful budget
(B) is the maximum budget (full fidelity)
(f_B(\lambda) \approx f(\lambda)) for all (\lambda)

The cost of evaluating (f_b(\lambda)) is typically proportional to (b), making low-fidelity evaluations dramatically cheaper.

The Correlation Assumption

Theoretical Foundation for Early Stopping

The theoretical justification for early stopping rests on two complementary perspectives: learning curve analysis and bandit-based formulations.

Learning Curve Perspective:

Common Learning Curve Patterns

•Exponential decay: Rapid initial improvement followed by diminishing returns: (L(t) = a \cdot e^{-bt} + c)
•Power law decay: Performance improves as a power of training time: (L(t) = a \cdot t^{-\alpha} + c)
•Saturation curves: Performance approaches an asymptote: (L(t) = c - (c - L_0) \cdot (1 - e^{-bt}))
•Staged improvement: Step-wise improvements with plateaus (common with learning rate schedules)

Learning Curve Extrapolation:

$$\hat{L}\lambda(T) = \arg\min\theta \sum_{t=1}^{t_0} (L_\lambda(t) - g_\theta(t))^2 + \text{regularization}$$

where (g_\theta(t)) is a parametric curve model (e.g., exponential, power law) with parameters (\theta).

Practical Learning Curve Models

Multi-Armed Bandit Perspective:

The key insight is that we want to identify the best arm efficiently, not maximize cumulative reward. This is the pure exploration or best arm identification variant of bandits.

For configuration (i) with observed mean performance (\hat{\mu}_i) and (n_i) evaluations:

$$\text{UCB}_i = \hat{\mu}_i + \sqrt{\frac{2 \ln(1/\delta)}{n_i}}$$

$$\text{LCB}_i = \hat{\mu}_i - \sqrt{\frac{2 \ln(1/\delta)}{n_i}}$$

Eliminate configuration (i) if (\text{UCB}_i < \text{LCB}_j) for some remaining configuration (j).

Early Stopping Criteria and Strategies

Implementing early stopping requires defining when to terminate a configuration. Various criteria have been proposed, each with different properties:

Early Stopping Strategies

•Median Stopping Rule: Terminate if performance is below the median of all previously observed configurations at the same fidelity level. Simple and robust, but may be conservative.
•Percentile Stopping: Generalization of median stopping—terminate if below the p-th percentile. Setting p < 50 is more aggressive; p > 50 is more conservative.
•Curve Fitting Stopping: Fit a learning curve model to observed performance; terminate if extrapolated final performance is unlikely to beat the current best.
•Threshold Stopping: Terminate if performance hasn't improved by at least δ in the last k evaluations. Simple but requires domain knowledge to set thresholds.
•Bayesian Stopping: Maintain a probabilistic model of expected final performance; terminate if the probability of beating the current best falls below a threshold.
•Budget-Based Stopping: Fixed successive halving—evaluate for a fixed low-fidelity budget, keep top fraction, increase budget, repeat.

Aggressive Stopping (p < 50)

•Terminates more configurations early
•Higher speedup when assumptions hold
•Greater risk of eliminating optimal
•Better for noisy/uniform search spaces
•Prioritizes exploration breadth

Conservative Stopping (p > 50)

•Terminates fewer configurations
•Lower speedup but safer
•Lower risk of false elimination
•Better when optimal is rare
•Prioritizes exploitation depth

early_stopping_criteria.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
import numpy as np
from scipy.optimize import curve_fit
from typing import List, Tuple, Optional
 
class EarlyStoppingCriteria:
    """Collection of early stopping criteria for hyperparameter optimization."""
    
    @staticmethod
    def median_stopping(
        current_performance: float,
        historical_performances: List[float],
        fidelity: int
    ) -> bool:
        """
        Median stopping rule: terminate if below median at same fidelity.
        
        Args:
            current_performance: Current configuration's performance (lower is better)
            historical_performances: List of performances from completed configs
            fidelity: Current fidelity level (for filtering historical data)
            
        Returns:
            True if configuration should be stopped
        """
        if len(historical_performances) < 3:  # Need sufficient history
            return False
        
        median = np.median(historical_performances)
        return current_performance > median  # Stop if worse than median
    
    @staticmethod
    def percentile_stopping(
        current_performance: float,
        historical_performances: List[float],
        percentile: float = 50.0
    ) -> bool:
        """
        Percentile stopping: terminate if below p-th percentile.
        Lower percentile = more aggressive stopping.
        """
        if len(historical_performances) < 5:
            return False
        
        threshold = np.percentile(historical_performances, percentile)
        return current_performance > threshold
    
    @staticmethod
    def learning_curve_stopping(
        observations: List[Tuple[int, float]],  # (epoch, loss) pairs
        current_best: float,
        confidence_threshold: float = 0.95
    ) -> bool:
        """
        Curve fitting stopping: extrapolate learning curve, stop if
        unlikely to beat current best.
        
        Uses power law model: L(t) = a * t^(-alpha) + c
        """
        if len(observations) < 5:
            return False
        
        epochs = np.array([e for e, _ in observations])
        losses = np.array([l for _, l in observations])
        
        try:
            # Fit power law decay model
            def power_law(t, a, alpha, c):
                return a * np.power(t + 1, -alpha) + c
            
            popt, pcov = curve_fit(
                power_law, epochs, losses,
                p0=[losses[0], 0.5, losses[-1]],
                bounds=([0, 0, 0], [np.inf, 2, np.inf]),
                maxfev=1000
            )
            
            # Extrapolate to completion (e.g., epoch 100)
            max_epochs = 100
            predicted_final = power_law(max_epochs, *popt)
            
            # Estimate uncertainty from covariance
            predicted_std = np.sqrt(pcov[2, 2]) if pcov is not None else 0.1
            
            # Stop if unlikely to beat current best
            # P(predicted_final < current_best) < threshold
            from scipy.stats import norm
            prob_better = norm.cdf(current_best, loc=predicted_final, scale=predicted_std)
            
            return prob_better < (1 - confidence_threshold)
            
        except (RuntimeError, ValueError):
            # Curve fitting failed, don't stop
            return False
    
    @staticmethod
    def threshold_stopping(
        recent_performances: List[float],
        minimum_improvement: float = 0.001,
        patience: int = 5
    ) -> bool:
        """
        Threshold stopping: stop if no improvement above threshold
        in last 'patience' evaluations.
        """
        if len(recent_performances) < patience + 1:
            return False
        
        recent = recent_performances[-(patience + 1):]
        best_before = min(recent[:-1])
        best_after = min(recent[-patience:])
        
        improvement = (best_before - best_after) / abs(best_before + 1e-10)
        return improvement < minimum_improvement
 
 
# Example usage
if __name__ == "__main__":
    criteria = EarlyStoppingCriteria()
    
    # Simulate historical performances
    historical = [0.15, 0.22, 0.18, 0.35, 0.12, 0.28, 0.19]
    
    # Test median stopping
    current = 0.25
    should_stop = criteria.median_stopping(current, historical, fidelity=10)
    print(f"Median stopping (perf={current}): {should_stop}")  # True
    
    # Test learning curve stopping
    observations = [(1, 0.8), (2, 0.5), (3, 0.35), (4, 0.28), (5, 0.24)]
    should_stop = criteria.learning_curve_stopping(observations, current_best=0.12)
    print(f"Curve stopping: {should_stop}")

Asynchronous Early Stopping

Asynchronous early stopping presents unique challenges:

Challenges in Asynchronous Settings

•Incomplete baselines: When making a stopping decision, we may have limited historical data at that fidelity level
•Racing conditions: A configuration may look bad compared to current baselines but would look good compared to the true population
•Resource fragmentation: Stopping creates idle workers that need new configurations to evaluate
•Synchronization overhead: Waiting for synchronization points defeats the purpose of parallelism
•Non-stationary baselines: As optimization progresses, the quality threshold for 'good' configurations increases

ASHA (Asynchronous Successive Halving Algorithm):

ASHA adapts successive halving for asynchronous parallel execution. The key idea is to promote configurations without waiting for all configurations at the current rung to complete:

asha_algorithm.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
import numpy as np
from dataclasses import dataclass, field
from typing import Dict, List, Optional, Set
from collections import defaultdict
import heapq
from enum import Enum
 
class ConfigStatus(Enum):
    PENDING = "pending"
    RUNNING = "running"
    PROMOTED = "promoted"
    STOPPED = "stopped"
    COMPLETED = "completed"
 
@dataclass
class Configuration:
    config_id: int
    hyperparameters: Dict
    status: ConfigStatus = ConfigStatus.PENDING
    current_rung: int = 0
    performance_by_rung: Dict[int, float] = field(default_factory=dict)
 
class ASHA:
    """
    Asynchronous Successive Halving Algorithm (ASHA).
    
    Key insight: Promote configurations as soon as they are in the top 1/eta
    of all configurations that have been evaluated at that rung, rather than
    waiting for all configurations to complete.
    """
    
    def __init__(
        self,
        min_budget: int = 1,
        max_budget: int = 81,
        eta: int = 3,
        num_configs: int = 100
    ):
        """
        Args:
            min_budget: Minimum budget (rung 0)
            max_budget: Maximum budget (final rung)
            eta: Reduction factor (keep top 1/eta at each rung)
            num_configs: Total number of configurations to sample
        """
        self.min_budget = min_budget
        self.max_budget = max_budget
        self.eta = eta
        
        # Calculate rungs: each rung is eta^rung * min_budget
        self.rungs = []
        budget = min_budget
        while budget <= max_budget:
            self.rungs.append(budget)
            budget = int(budget * eta)
        
        self.num_rungs = len(self.rungs)
        
        # Track configurations at each rung
        self.rung_members: Dict[int, List[Configuration]] = defaultdict(list)
        self.promoted_counts: Dict[int, int] = defaultdict(int)
        
        # All configurations
        self.configurations: Dict[int, Configuration] = {}
        self.config_counter = 0
        
        print(f"ASHA initialized with rungs: {self.rungs}")
        print(f"Reduction factor eta={eta}, keeping top {100/eta:.1f}% at each rung")
    
    def get_next_configuration(self) -> Optional[Configuration]:
        """
        Get the next configuration to evaluate.
        Priority: promote existing configs > sample new configs
        """
        # First, check if any configuration can be promoted
        promotable = self._find_promotable_configuration()
        if promotable:
            return promotable
        
        # Otherwise, sample a new configuration
        if self.config_counter < 100:  # Max configs
            return self._sample_new_configuration()
        
        return None
    
    def _find_promotable_configuration(self) -> Optional[Configuration]:
        """
        Find a configuration ready for promotion to the next rung.
        A config is promotable if it's in the top 1/eta of completed
        configs at its current rung.
        """
        for rung_idx in range(self.num_rungs - 1):  # Can't promote from last rung
            completed_at_rung = [
                c for c in self.rung_members[rung_idx]
                if c.status in [ConfigStatus.COMPLETED, ConfigStatus.PROMOTED]
                and rung_idx in c.performance_by_rung
            ]
            
            if len(completed_at_rung) == 0:
                continue
            
            # Sort by performance (lower is better)
            completed_at_rung.sort(
                key=lambda c: c.performance_by_rung[rung_idx]
            )
            
            # How many should be promoted from this rung?
            num_to_promote = max(1, len(completed_at_rung) // self.eta)
            
            # Find configs that should be promoted but haven't been
            promotable = [
                c for c in completed_at_rung[:num_to_promote]
                if c.status == ConfigStatus.COMPLETED
            ]
            
            if promotable:
                config = promotable[0]
                config.status = ConfigStatus.PROMOTED
                config.current_rung = rung_idx + 1
                self.rung_members[rung_idx + 1].append(config)
                self.promoted_counts[rung_idx] += 1
                print(f"Promoting config {config.config_id} to rung {rung_idx + 1} "
                      f"(budget {self.rungs[rung_idx + 1]})")
                return config
        
        return None
    
    def _sample_new_configuration(self) -> Configuration:
        """Sample a new random configuration."""
        config = Configuration(
            config_id=self.config_counter,
            hyperparameters=self._sample_hyperparameters(),
            status=ConfigStatus.RUNNING,
            current_rung=0
        )
        self.configurations[config.config_id] = config
        self.rung_members[0].append(config)
        self.config_counter += 1
        return config
    
    def _sample_hyperparameters(self) -> Dict:
        """Sample random hyperparameters (placeholder)."""
        return {
            "learning_rate": 10 ** np.random.uniform(-5, -1),
            "num_layers": np.random.randint(1, 10),
            "hidden_size": 2 ** np.random.randint(4, 10),
            "dropout": np.random.uniform(0, 0.5),
        }
    
    def report_result(self, config_id: int, rung: int, performance: float):
        """
        Report a result from an evaluation.
        
        Args:
            config_id: Configuration identifier
            rung: Rung at which evaluation completed
            performance: Validation loss (lower is better)
        """
        config = self.configurations[config_id]
        config.performance_by_rung[rung] = performance
        
        # Check if this is the final rung
        if rung == self.num_rungs - 1:
            config.status = ConfigStatus.COMPLETED
        else:
            config.status = ConfigStatus.COMPLETED  # Completed this rung
        
        print(f"Config {config_id} at rung {rung} (budget {self.rungs[rung]}): "
              f"performance = {performance:.4f}")
    
    def get_best_configuration(self) -> Optional[Configuration]:
        """Get the best configuration found so far."""
        final_rung = self.num_rungs - 1
        completed = [
            c for c in self.rung_members[final_rung]
            if final_rung in c.performance_by_rung
        ]
        
        if not completed:
            return None
        
        return min(completed, key=lambda c: c.performance_by_rung[final_rung])
    
    def get_statistics(self) -> Dict:
        """Get statistics about the optimization run."""
        stats = {
            "total_configs": len(self.configurations),
            "configs_per_rung": {
                rung: len(members) for rung, members in self.rung_members.items()
            },
            "promotions_per_rung": dict(self.promoted_counts),
        }
        
        best = self.get_best_configuration()
        if best:
            final_rung = self.num_rungs - 1
            stats["best_performance"] = best.performance_by_rung[final_rung]
            stats["best_hyperparameters"] = best.hyperparameters
        
        return stats

ASHA's Key Insight

Bias-Variance Tradeoffs in Early Stopping

Early stopping introduces both bias and variance into the optimization process. Understanding these effects is crucial for designing robust early stopping strategies.

Sources of Bias:

Selection bias: Configurations that happen to perform well early are more likely to be continued, even if early performance is not perfectly predictive of final performance.
Ranking distortion: The relative ordering of configurations can change dramatically between early and late training. Early stopping assumes this distortion is limited.
Hyperparameter-dependent convergence: Some hyperparameters (e.g., learning rate) affect how quickly a configuration reaches its final performance. Fast-converging configurations are favored even if slow-converging alternatives would ultimately be superior.

Quantifying Early Stopping Bias:

Let (\lambda^*) be the true optimal configuration and (\hat{\lambda}) be the configuration selected by early stopping. The bias is:

$$\text{Bias} = \mathbb{E}[f(\hat{\lambda})] - f(\lambda^*)$$

This bias arises because early stopping might eliminate (\lambda^*) if it happens to perform poorly at low fidelity.

Factors affecting bias:

Correlation (\rho(f_b, f_B)): Higher correlation between low and high fidelity reduces bias
Stopping aggressiveness: More aggressive stopping increases bias
Training dynamics: Configurations with non-monotonic learning curves suffer higher bias

Scenarios Where Early Stopping Bias is Problematic
Scenario	Why Bias Occurs	Mitigation Strategy
Learning rate warmup	Performance is poor during warmup phase	Use minimum fidelity above warmup duration
Regularization effects	Regularization helps only late in training	Conservative stopping thresholds
Architecture search	Different architectures converge at different rates	Normalize by architecture type
Transfer learning	Fine-tuning dynamics differ from training from scratch	Use domain-specific fidelity lower bounds
Noisy objectives	High variance in early performance estimates	Multiple independent evaluations at each rung

Sources of Variance:

Even with unbiased early stopping, the selected configuration can vary significantly across optimization runs:

Sampling variance: Random configuration sampling leads to different candidate pools
Evaluation noise: Stochastic training (random initialization, mini-batch sampling) creates noisy performance estimates
Elimination randomness: Near-threshold configurations may or may not be eliminated depending on noise

Variance Reduction Strategies:

•Ensemble early stopping: Train multiple instances of each configuration with different random seeds; use aggregated performance for stopping decisions
•Conservative thresholds: Use higher percentiles for stopping (e.g., 75th instead of 50th) to reduce false eliminations
•Repeated trials: Run the entire optimization multiple times and ensemble the results
•Cross-validated performance: Use k-fold cross-validation at each fidelity level for more stable estimates
•Bayesian treatment: Maintain posterior distributions over configuration quality rather than point estimates

The Fundamental Tradeoff

Practical Implementation Considerations

Implementing early stopping in production systems requires careful attention to several practical concerns:

Implementation Checklist

•Checkpoint management: Save model checkpoints at each evaluation point to enable resumption after stopping decisions
•Metric logging: Log all intermediate metrics with timestamps for post-hoc analysis and debugging
•Reproducibility: Fix random seeds and log them to enable reproduction of specific configurations
•Resource cleanup: Ensure stopped jobs release GPU memory, disk space, and other resources promptly
•Failure handling: Distinguish between intentional early stopping and job failures; handle retries appropriately
•Progress persistence: Store optimization state to enable resumption after system failures

early_stopping_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
"""
Production-ready early stopping implementation with proper
checkpoint management, logging, and resource cleanup.
"""
import os
import json
import logging
from pathlib import Path
from datetime import datetime
from typing import Optional, Callable, Dict, Any
from dataclasses import dataclass, asdict
 
logger = logging.getLogger(__name__)
 
@dataclass
class EvaluationResult:
    """Result from a single evaluation point."""
    config_id: str
    epoch: int
    budget: float
    validation_loss: float
    validation_accuracy: float
    training_loss: float
    timestamp: str
    wall_time_seconds: float
    checkpoint_path: str
    
class EarlyStoppingController:
    """
    Controller for early stopping with proper lifecycle management.
    """
    
    def __init__(
        self,
        experiment_dir: Path,
        stopping_criterion: Callable[[list, float], bool],
        checkpoint_callback: Callable[[str], None],
        cleanup_callback: Callable[[str], None],
    ):
        """
        Args:
            experiment_dir: Directory for checkpoints and logs
            stopping_criterion: Function(historical_results, current_result) -> should_stop
            checkpoint_callback: Called to save checkpoint
            cleanup_callback: Called to clean up resources when stopping
        """
        self.experiment_dir = Path(experiment_dir)
        self.experiment_dir.mkdir(parents=True, exist_ok=True)
        
        self.stopping_criterion = stopping_criterion
        self.checkpoint_callback = checkpoint_callback
        self.cleanup_callback = cleanup_callback
        
        self.results: Dict[str, list] = {}  # config_id -> list of results
        self.stopped_configs: set = set()
        
        # Load existing state if resuming
        self._load_state()
    
    def _state_path(self) -> Path:
        return self.experiment_dir / "early_stopping_state.json"
    
    def _load_state(self):
        """Load state from disk for resumption."""
        state_path = self._state_path()
        if state_path.exists():
            with open(state_path) as f:
                state = json.load(f)
            self.results = {
                k: [EvaluationResult(**r) for r in v]
                for k, v in state.get("results", {}).items()
            }
            self.stopped_configs = set(state.get("stopped_configs", []))
            logger.info(f"Resumed state: {len(self.results)} configs, "
                       f"{len(self.stopped_configs)} stopped")
    
    def _save_state(self):
        """Persist state to disk."""
        state = {
            "results": {
                k: [asdict(r) for r in v]
                for k, v in self.results.items()
            },
            "stopped_configs": list(self.stopped_configs),
            "last_updated": datetime.now().isoformat(),
        }
        with open(self._state_path(), 'w') as f:
            json.dump(state, f, indent=2)
    
    def report_evaluation(
        self,
        result: EvaluationResult
    ) -> bool:
        """
        Report an evaluation result and determine if training should stop.
        
        Returns:
            True if training should continue, False if it should stop
        """
        config_id = result.config_id
        
        if config_id in self.stopped_configs:
            logger.warning(f"Config {config_id} already stopped, ignoring result")
            return False
        
        # Store result
        if config_id not in self.results:
            self.results[config_id] = []
        self.results[config_id].append(result)
        
        # Save checkpoint
        try:
            self.checkpoint_callback(result.checkpoint_path)
        except Exception as e:
            logger.error(f"Checkpoint failed for {config_id}: {e}")
        
        # Log result
        logger.info(
            f"Config {config_id} epoch {result.epoch}: "
            f"val_loss={result.validation_loss:.4f}, "
            f"val_acc={result.validation_accuracy:.4f}"
        )
        
        # Collect historical results for stopping decision
        all_results = []
        for cid, results in self.results.items():
            if cid not in self.stopped_configs and results:
                # Get most recent result at similar budget
                matching = [r for r in results if abs(r.budget - result.budget) < 0.1]
                if matching:
                    all_results.append(matching[-1].validation_loss)
        
        # Make stopping decision
        should_stop = self.stopping_criterion(
            all_results,
            result.validation_loss
        )
        
        if should_stop:
            logger.info(f"Early stopping config {config_id} at epoch {result.epoch}")
            self.stopped_configs.add(config_id)
            
            # Cleanup resources
            try:
                self.cleanup_callback(config_id)
            except Exception as e:
                logger.error(f"Cleanup failed for {config_id}: {e}")
        
        # Persist state
        self._save_state()
        
        return not should_stop
    
    def get_best_config(self) -> Optional[str]:
        """Get the best configuration based on final results."""
        best_config = None
        best_loss = float('inf')
        
        for config_id, results in self.results.items():
            if results:
                final_loss = results[-1].validation_loss
                if final_loss < best_loss:
                    best_loss = final_loss
                    best_config = config_id
        
        return best_config
    
    def get_statistics(self) -> Dict[str, Any]:
        """Get statistics about the optimization run."""
        total_configs = len(self.results)
        stopped_early = len(self.stopped_configs)
        completed = total_configs - stopped_early
        
        total_evaluations = sum(len(r) for r in self.results.values())
        
        # Compute savings
        max_epochs_seen = max(
            (r[-1].epoch for r in self.results.values() if r),
            default=0
        )
        potential_evaluations = total_configs * max_epochs_seen
        savings = (potential_evaluations - total_evaluations) / max(1, potential_evaluations)
        
        return {
            "total_configurations": total_configs,
            "stopped_early": stopped_early,
            "completed_full_training": completed,
            "total_evaluations": total_evaluations,
            "potential_evaluations": potential_evaluations,
            "computational_savings": f"{savings:.1%}",
        }

Integration with ML Frameworks

When Early Stopping Fails

Despite its power, early stopping is not universally applicable. Recognizing scenarios where early stopping is unreliable prevents wasted effort and suboptimal results.

Failure Modes of Early Stopping

•Non-monotonic learning curves: Some configurations improve, plateau, then improve again after architectural reorganization or phase transitions in training
•Late regularization effects: L2 regularization, dropout, and other regularizers primarily affect late-stage training, making early performance misleading
•Double descent phenomena: Modern deep learning exhibits double descent where performance first increases, then decreases (overfitting), then increases again
•Learning rate schedules: Cosine annealing, warmup, and step schedules create non-stationary training dynamics that violate early stopping assumptions
•Curriculum learning: Progressively increasing task difficulty means early performance on easy examples doesn't predict performance on hard examples
•Multi-stage training: Fine-tuning pretrained models has different dynamics than training from scratch; early fine-tuning performance is not predictive

Detecting Early Stopping Failures:

Post-hoc analysis can reveal whether early stopping was appropriate for a given problem:

Learning curve correlation analysis: Plot low-fidelity vs high-fidelity performance for all configurations. Poor correlation (r < 0.7) indicates early stopping may be unreliable.
Rank correlation (Spearman/Kendall): Measure whether configurations that rank highly at low fidelity also rank highly at high fidelity. Low rank correlation suggests ranking distortion.
False elimination rate: Of configurations eliminated early, estimate what fraction would have been in the top-k set if allowed to complete. High false elimination indicates overly aggressive stopping.

early_stopping_validation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
import numpy as np
from scipy.stats import spearmanr, pearsonr
import matplotlib.pyplot as plt
from typing import List, Tuple
 
def validate_early_stopping(
    low_fidelity_scores: List[float],
    high_fidelity_scores: List[float],
    eliminated_at_low: List[int],  # Indices of eliminated configs
) -> dict:
    """
    Validate whether early stopping was appropriate for this problem.
    
    Args:
        low_fidelity_scores: Performance at low fidelity for all configs
        high_fidelity_scores: Performance at high fidelity (ground truth)
        eliminated_at_low: Indices of configs that were eliminated early
        
    Returns:
        Dictionary of validation metrics
    """
    low = np.array(low_fidelity_scores)
    high = np.array(high_fidelity_scores)
    
    # Correlation analysis
    pearson_r, pearson_p = pearsonr(low, high)
    spearman_r, spearman_p = spearmanr(low, high)
    
    # Rank analysis
    low_ranks = np.argsort(np.argsort(low))
    high_ranks = np.argsort(np.argsort(high))
    
    # False elimination analysis
    # Which configs would have been in top-k at high fidelity?
    k = max(1, len(high) // 10)  # Top 10%
    top_k_at_high = set(np.argsort(high)[:k])
    eliminated_set = set(eliminated_at_low)
    
    false_eliminations = top_k_at_high & eliminated_set
    false_elimination_rate = len(false_eliminations) / max(1, len(eliminated_set))
    
    # Best configuration analysis
    true_best = np.argmin(high)
    was_best_eliminated = true_best in eliminated_set
    
    return {
        "pearson_correlation": pearson_r,
        "spearman_correlation": spearman_r,
        "false_elimination_rate": false_elimination_rate,
        "true_best_eliminated": was_best_eliminated,
        "recommendation": _get_recommendation(pearson_r, spearman_r, false_elimination_rate),
    }
 
def _get_recommendation(pearson_r, spearman_r, fer):
    """Generate recommendation based on validation metrics."""
    if spearman_r > 0.8 and fer < 0.1:
        return "Early stopping is APPROPRIATE for this problem"
    elif spearman_r > 0.6 and fer < 0.2:
        return "Early stopping is ACCEPTABLE but use conservative thresholds"
    elif spearman_r > 0.4:
        return "Early stopping is RISKY - consider higher minimum fidelity"
    else:
        return "Early stopping is NOT RECOMMENDED for this problem"
 
def plot_fidelity_correlation(
    low: np.ndarray,
    high: np.ndarray,
    eliminated: List[int]
):
    """Visualize correlation between low and high fidelity performance."""
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))
    
    # Scatter plot
    ax1 = axes[0]
    mask = np.ones(len(low), dtype=bool)
    mask[eliminated] = False
    
    ax1.scatter(low[mask], high[mask], alpha=0.6, label='Continued')
    ax1.scatter(low[~mask], high[~mask], alpha=0.6, c='red', label='Eliminated')
    ax1.plot([low.min(), low.max()], [low.min(), low.max()], 'k--', alpha=0.3)
    ax1.set_xlabel('Low Fidelity Performance')
    ax1.set_ylabel('High Fidelity Performance')
    ax1.set_title('Fidelity Correlation')
    ax1.legend()
    
    # Rank comparison
    ax2 = axes[1]
    low_ranks = np.argsort(np.argsort(low))
    high_ranks = np.argsort(np.argsort(high))
    
    ax2.scatter(low_ranks[mask], high_ranks[mask], alpha=0.6, label='Continued')
    ax2.scatter(low_ranks[~mask], high_ranks[~mask], alpha=0.6, c='red', label='Eliminated')
    ax2.plot([0, len(low)], [0, len(high)], 'k--', alpha=0.3)
    ax2.set_xlabel('Rank at Low Fidelity')
    ax2.set_ylabel('Rank at High Fidelity')
    ax2.set_title('Rank Correlation')
    ax2.legend()
    
    plt.tight_layout()
    return fig

Domain-Specific Validation

Summary: Early Stopping Approaches

Early stopping represents a fundamental optimization in hyperparameter search, enabling dramatic computational savings when applied appropriately.

Key Takeaways

•Multi-fidelity optimization exploits cheap approximations (reduced epochs, data, or model size) to predict full-fidelity performance
•Learning curves often follow predictable patterns (exponential, power law) enabling extrapolation from partial observations
•Stopping criteria range from simple (median stopping) to sophisticated (Bayesian curve prediction) with different bias-variance tradeoffs
•Asynchronous methods like ASHA enable efficient parallel early stopping without synchronization overhead
•Bias-variance tradeoffs must be managed: aggressive stopping saves computation but risks eliminating optimal configurations
•Validation is essential: correlation between low and high fidelity performance must be verified empirically for each problem domain

Looking Ahead:

Page Complete

1 / 5