Multi Fidelity Optimization - Learning Module

Loading content...

0/245

Successive Halving

From Intuition to Algorithm

The previous page established that early stopping can dramatically reduce computational costs in hyperparameter optimization. But the intuition of "stop bad configurations early" leaves critical questions unanswered:

How many configurations should we sample initially?
At what points should we make stopping decisions?
What fraction of configurations should survive each round?
How should total budget be allocated across evaluation rounds?

Successive Halving (SH) provides principled answers to all these questions. First introduced in the context of pure exploration in multi-armed bandits by Karnin et al. (2013) and adapted for hyperparameter optimization by Jamieson and Talwalkar (2016), Successive Halving transforms the vague notion of early stopping into a rigorous algorithm with theoretical guarantees.

The core insight is elegant: rather than evaluating all configurations to completion, iteratively eliminate the worst-performing half while doubling the budget allocated to survivors. This geometric elimination schedule provides near-optimal identification of the best configuration under reasonable assumptions.

What You Will Learn

By the end of this page, you will understand: • The Successive Halving algorithm in complete detail • Theoretical foundations and performance guarantees • The n-vs-B/n tradeoff between exploration and exploitation • Implementation with proper resource management • When Successive Halving excels and when it struggles

The Successive Halving Algorithm

Successive Halving operates in rounds (also called rungs or brackets). In each round:

Evaluate all surviving configurations for a fixed budget
Rank configurations by performance
Eliminate the bottom half
Double the budget for the next round
Repeat until one configuration remains

Formal Algorithm:

Given:

Total budget (B) (e.g., total GPU-hours)
Number of initial configurations (n)
Reduction factor (\eta) (typically 2 or 3)

The algorithm proceeds through (s_{\max} = \lfloor \log_\eta(n) \rfloor) rounds.

At round (s \in {0, 1, ..., s_{\max}}):

Number of surviving configurations: (n_s = \lfloor n \cdot \eta^{-s} \rfloor)
Budget per configuration: (r_s = r_0 \cdot \eta^s) where (r_0 = B / (n \cdot (s_{\max} + 1)))

Example: Successive Halving with n=81, B=81, η=3
Round (s)	Configurations (nₛ)	Budget per Config (rₛ)	Round Budget (nₛ × rₛ)
0	81	1	81
1	27	3	81
2	9	9	81
3	3	27	81
4	1	81	81

Notice the elegant property: each round consumes approximately the same total budget ((n_s \times r_s \approx B / (s_{\max} + 1))). This equal budget allocation across rounds is not coincidental—it emerges from the geometric relationship between configuration count and per-configuration budget.

successive_halving.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
import numpy as np
from typing import List, Dict, Callable, Any, Tuple
from dataclasses import dataclass
import logging
 
logger = logging.getLogger(__name__)
 
@dataclass
class SHConfiguration:
    """A hyperparameter configuration being evaluated."""
    config_id: int
    hyperparameters: Dict[str, Any]
    performances: Dict[int, float]  # round -> performance
    
    def get_performance(self, round: int) -> float:
        return self.performances.get(round, float('inf'))
 
class SuccessiveHalving:
    """
    Successive Halving algorithm for hyperparameter optimization.
    
    Reference: Jamieson, K. and Talwalkar, A. "Non-stochastic Best Arm 
    Identification and Hyperparameter Optimization." AISTATS 2016.
    """
    
    def __init__(
        self,
        n_configs: int,
        min_budget: float,
        max_budget: float,
        eta: int = 3,
        config_sampler: Callable[[], Dict[str, Any]] = None,
        evaluator: Callable[[Dict[str, Any], float], float] = None,
    ):
        """
        Args:
            n_configs: Initial number of configurations to sample
            min_budget: Minimum budget per configuration (e.g., 1 epoch)
            max_budget: Maximum budget (e.g., 81 epochs)
            eta: Reduction factor (default 3: keep top 1/3 each round)
            config_sampler: Function to sample random hyperparameters
            evaluator: Function(config, budget) -> validation_loss
        """
        self.n_configs = n_configs
        self.min_budget = min_budget
        self.max_budget = max_budget
        self.eta = eta
        self.config_sampler = config_sampler
        self.evaluator = evaluator
        
        # Calculate number of rounds
        self.s_max = int(np.floor(np.log(n_configs) / np.log(eta)))
        
        # Calculate budgets and config counts for each round
        self.round_budgets = []
        self.round_configs = []
        
        for s in range(self.s_max + 1):
            n_s = int(np.floor(n_configs * (eta ** (-s))))
            r_s = min_budget * (eta ** s)
            self.round_configs.append(n_s)
            self.round_budgets.append(min(r_s, max_budget))
        
        logger.info(f"Successive Halving initialized:")
        logger.info(f"  Initial configs: {n_configs}")
        logger.info(f"  Rounds: {self.s_max + 1}")
        logger.info(f"  Reduction factor: {eta}")
        logger.info(f"  Budget schedule: {self.round_budgets}")
        logger.info(f"  Config schedule: {self.round_configs}")
        
        # Storage
        self.configurations: List[SHConfiguration] = []
        self.current_round = 0
        self.surviving_configs: List[int] = []  # indices
    
    def initialize(self):
        """Sample initial configurations."""
        logger.info(f"Sampling {self.n_configs} initial configurations...")
        
        for i in range(self.n_configs):
            if self.config_sampler:
                hp = self.config_sampler()
            else:
                hp = self._default_sampler()
            
            config = SHConfiguration(
                config_id=i,
                hyperparameters=hp,
                performances={}
            )
            self.configurations.append(config)
        
        self.surviving_configs = list(range(self.n_configs))
        logger.info(f"Initialized {len(self.configurations)} configurations")
    
    def _default_sampler(self) -> Dict[str, Any]:
        """Default hyperparameter sampler."""
        return {
            "learning_rate": 10 ** np.random.uniform(-5, -1),
            "weight_decay": 10 ** np.random.uniform(-6, -2),
            "batch_size": 2 ** np.random.randint(4, 9),
            "num_layers": np.random.randint(1, 8),
        }
    
    def run_round(self) -> Tuple[int, List[int]]:
        """
        Run one round of Successive Halving.
        
        Returns:
            (round_number, surviving_config_indices)
        """
        if self.current_round > self.s_max:
            logger.warning("All rounds completed")
            return self.current_round, self.surviving_configs
        
        s = self.current_round
        budget = self.round_budgets[s]
        n_keep = self.round_configs[min(s + 1, self.s_max)]
        
        logger.info(f"\n=== Round {s} ===")
        logger.info(f"Evaluating {len(self.surviving_configs)} configs at budget {budget}")
        
        # Evaluate all surviving configurations
        performances = []
        for idx in self.surviving_configs:
            config = self.configurations[idx]
            
            if self.evaluator:
                perf = self.evaluator(config.hyperparameters, budget)
            else:
                # Simulate: better configs tend to perform better
                perf = self._simulate_performance(config, budget)
            
            config.performances[s] = perf
            performances.append((idx, perf))
            logger.debug(f"  Config {idx}: {perf:.4f}")
        
        # Sort by performance (lower is better)
        performances.sort(key=lambda x: x[1])
        
        # Keep top n_keep configurations
        self.surviving_configs = [idx for idx, _ in performances[:n_keep]]
        eliminated = [idx for idx, _ in performances[n_keep:]]
        
        logger.info(f"Keeping top {n_keep}, eliminating {len(eliminated)}")
        logger.info(f"Survivors: {self.surviving_configs}")
        
        self.current_round += 1
        return s, self.surviving_configs
    
    def _simulate_performance(self, config: SHConfiguration, budget: float) -> float:
        """Simulate performance for testing (realistic learning curve)."""
        # True quality is determined by hyperparameters
        lr = config.hyperparameters.get("learning_rate", 0.001)
        wd = config.hyperparameters.get("weight_decay", 0.0001)
        
        # Optimal learning rate around 0.001
        lr_quality = -((np.log10(lr) + 3) ** 2) / 4
        wd_quality = -((np.log10(wd) + 4) ** 2) / 4
        
        true_quality = 0.5 + 0.2 * (lr_quality + wd_quality)
        
        # Performance improves with budget (learning curve)
        progress = 1 - np.exp(-budget / (self.max_budget / 3))
        noise = np.random.normal(0, 0.02 / np.sqrt(budget))
        
        return true_quality * (1 - 0.8 * progress) + noise
    
    def run_all(self) -> SHConfiguration:
        """Run all rounds and return best configuration."""
        self.initialize()
        
        while self.current_round <= self.s_max:
            self.run_round()
        
        # Return best surviving configuration
        best_idx = self.surviving_configs[0]
        best_config = self.configurations[best_idx]
        
        final_perf = best_config.performances[self.s_max]
        logger.info(f"\n=== Optimization Complete ===")
        logger.info(f"Best config: {best_idx}")
        logger.info(f"Final performance: {final_perf:.4f}")
        logger.info(f"Hyperparameters: {best_config.hyperparameters}")
        
        return best_config
    
    def get_statistics(self) -> Dict:
        """Get statistics about the optimization run."""
        total_budget_used = sum(
            self.round_configs[s] * self.round_budgets[s]
            for s in range(self.current_round)
        )
        
        full_eval_budget = self.n_configs * self.max_budget
        
        return {
            "rounds_completed": self.current_round,
            "total_budget_used": total_budget_used,
            "full_evaluation_budget": full_eval_budget,
            "budget_savings": 1 - (total_budget_used / full_eval_budget),
            "configurations_evaluated": self.n_configs,
            "surviving_configurations": len(self.surviving_configs),
        }
 
 
# Example usage
if __name__ == "__main__":
    logging.basicConfig(level=logging.INFO)
    
    sh = SuccessiveHalving(
        n_configs=81,
        min_budget=1,
        max_budget=81,
        eta=3
    )
    
    best = sh.run_all()
    stats = sh.get_statistics()
    
    print(f"\nStatistics:")
    print(f"  Budget used: {stats['total_budget_used']}")
    print(f"  Savings: {stats['budget_savings']:.1%}")

Theoretical Analysis and Guarantees

Successive Halving's elegance extends beyond its intuitive appeal—it comes with rigorous theoretical guarantees. Understanding these guarantees clarifies when the algorithm works well and guides practical design choices.

Sample Complexity:

Consider the problem of identifying the best among (n) configurations, where each configuration (i) has true quality (\mu_i), and evaluating at budget (b) yields an observation with variance (\sigma^2 / b).

Let (\Delta_i = \mu_i - \mu^) be the suboptimality gap of configuration (i), where (\mu^) is the quality of the best configuration. Define the complexity parameter:

$$H = \sum_{i=2}^{n} \frac{1}{\Delta_i^2}$$

Theorem (Karnin et al., 2013): Successive Halving identifies the best configuration with probability at least (1 - \delta) using total budget:

$$B = O\left( H \cdot \log(n) \cdot \log(1/\delta) \right)$$

This is nearly optimal—no algorithm can do better than (\Omega(H \cdot \log(1/\delta))) in the worst case.

Understanding the Complexity Term H

The complexity H captures how hard the identification problem is. When gaps Δᵢ are large (easy to distinguish configurations), H is small. When many configurations are nearly tied with the best, H is large.

Importantly, H depends on the problem instance—the actual gaps between configurations—not just the number of configurations. This means Successive Halving adapts (implicitly) to problem difficulty.

Equal Budget Allocation:

A key insight is that Successive Halving allocates roughly equal total budget to each round:

$$\text{Round budget} = n_s \times r_s = \frac{n}{\eta^s} \times r_0 \cdot \eta^s = n \cdot r_0 = \frac{B}{s_{\max} + 1}$$

This equal allocation is information-theoretically motivated: we don't know a priori which round will be "hardest" (require the most budget to distinguish survivors), so we hedge by allocating equally.

Geometric Elimination:

The geometric reduction schedule (halving at each round) is also theoretically justified. With (n) configurations and budget (B):

Too aggressive elimination (e.g., keep only 1/10): Risk eliminating the best configuration before it can demonstrate its quality
Too conservative elimination (e.g., keep 9/10): Not enough budget per configuration to make reliable comparisons
Geometric (1/η with η ≈ 2-3): Balances exploration (many configs) and exploitation (thorough evaluation)

Regret Analysis:

In the context of hyperparameter optimization, we care about the simple regret—the suboptimality of the returned configuration:

$$\text{Simple Regret} = \mu_{\hat{i}} - \mu^*$$

where (\hat{i}) is the configuration returned by the algorithm.

Theorem: For Successive Halving with total budget (B):

$$\mathbb{E}[\text{Simple Regret}] = O\left( \sqrt{\frac{n \log n}{B}} \right)$$

This shows the regret decreases as (1/\sqrt{B}), which is optimal for this class of problems.

Comparison to Uniform Allocation:

If we instead evaluated all (n) configurations with equal budget (B/n) each, the regret would be:

$$\mathbb{E}[\text{Simple Regret}]_{\text{uniform}} = O\left( \sqrt{\frac{n}{B}} \right)$$

Successive Halving's (\log n) overhead is negligible compared to the savings from early elimination—typical experiments see 10-100× speedups.

The n-vs-B/n Tradeoff

One of the most important—and often overlooked—aspects of Successive Halving is the fundamental tradeoff between breadth and depth: given a fixed total budget (B), should we evaluate many configurations shallowly (large (n), small (B/n)) or few configurations deeply (small (n), large (B/n))?

This tradeoff critically affects Successive Halving's performance and is not automatically resolved by the algorithm. The user must choose (n).

Large n (Aggressive Exploration)

•Sample many configurations
•Very low initial budget per config
•More rounds of elimination
•Better when: optimal is rare in space
•Risk: eliminate good configs too early
•Good for: large, uniform search spaces

Small n (Conservative Exploitation)

•Sample fewer configurations
•Higher initial budget per config
•Fewer rounds of elimination
•Better when: many configs are good
•Risk: miss the optimal entirely
•Good for: well-tuned prior knowledge

Formal Analysis of the Tradeoff:

Consider a search space where the best configuration has performance (\mu^* = 0) and other configurations have performance uniformly distributed on ([\varepsilon, 1]).

Case 1: Large n, small initial budget

Probability of sampling near-optimal config: (\approx 1 - (1-\varepsilon)^n \approx n\varepsilon) for small (\varepsilon)
But: very low initial budget may fail to distinguish good from bad
Early elimination is noisy when (r_0 < \sigma^2 / \varepsilon^2)

Case 2: Small n, large initial budget

Probability of sampling near-optimal config: (\approx n\varepsilon)
Each evaluation is reliable
But: may never sample a good configuration

The sweet spot depends on:

The density of good configurations in the search space
The signal-to-noise ratio at low budgets
The correlation between low-budget and high-budget performance

n_vs_bn_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
import numpy as np
import matplotlib.pyplot as plt
from typing import Tuple
 
def simulate_sh_performance(
    n: int,  # Number of initial configurations
    budget: float,  # Total budget
    eta: int,  # Reduction factor
    quality_density: float,  # Fraction of configs within 10% of optimal
    noise_std: float,  # Observation noise
    n_trials: int = 100
) -> Tuple[float, float]:
    """
    Simulate Successive Halving performance for given n.
    
    Returns:
        (mean_regret, std_regret)
    """
    regrets = []
    
    for _ in range(n_trials):
        # Sample true qualities
        # quality_density fraction are "good" (quality < 0.1)
        n_good = int(n * quality_density)
        n_bad = n - n_good
        
        good_qualities = np.random.uniform(0, 0.1, n_good)
        bad_qualities = np.random.uniform(0.1, 1.0, n_bad)
        qualities = np.concatenate([good_qualities, bad_qualities])
        np.random.shuffle(qualities)
        
        true_optimal = qualities.min()
        
        # Run Successive Halving
        s_max = int(np.log(n) / np.log(eta))
        r_0 = budget / (n * (s_max + 1))
        
        survivors = np.arange(n)
        
        for s in range(s_max + 1):
            r_s = r_0 * (eta ** s)
            n_s = len(survivors)
            n_keep = max(1, n_s // eta)
            
            # Observe performances with noise
            observations = qualities[survivors] + np.random.normal(
                0, noise_std / np.sqrt(r_s), n_s
            )
            
            # Keep best n_keep
            keep_indices = np.argsort(observations)[:n_keep]
            survivors = survivors[keep_indices]
        
        # Final selection
        final_quality = qualities[survivors[0]]
        regret = final_quality - true_optimal
        regrets.append(regret)
    
    return np.mean(regrets), np.std(regrets)
 
def find_optimal_n(
    budget: float,
    eta: int,
    quality_density: float,
    noise_std: float,
    n_range: np.ndarray = None
) -> dict:
    """Find optimal n for given problem parameters."""
    if n_range is None:
        n_range = np.array([9, 27, 81, 243, 729])
    
    results = []
    for n in n_range:
        if n > budget:  # Can't have more configs than budget
            continue
        
        mean_regret, std_regret = simulate_sh_performance(
            n, budget, eta, quality_density, noise_std
        )
        results.append({
            'n': n,
            'mean_regret': mean_regret,
            'std_regret': std_regret,
            'budget_per_config_initial': budget / (n * (int(np.log(n)/np.log(eta)) + 1))
        })
        print(f"n={n}: regret={mean_regret:.4f} ± {std_regret:.4f}")
    
    best = min(results, key=lambda x: x['mean_regret'])
    return {
        'optimal_n': best['n'],
        'optimal_regret': best['mean_regret'],
        'all_results': results
    }
 
def visualize_tradeoff(
    budget: float = 729,
    eta: int = 3,
    noise_std: float = 0.1
):
    """Visualize the n-vs-B/n tradeoff for different quality densities."""
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    densities = [0.01, 0.05, 0.1, 0.2]
    n_values = np.array([9, 27, 81, 243, 729])
    
    for density in densities:
        regrets = []
        for n in n_values:
            if n <= budget:
                mean_r, _ = simulate_sh_performance(
                    n, budget, eta, density, noise_std, n_trials=50
                )
                regrets.append(mean_r)
        
        axes[0].plot(n_values[:len(regrets)], regrets, 
                     'o-', label=f'{density*100:.0f}% good configs')
    
    axes[0].set_xlabel('Number of Initial Configurations (n)')
    axes[0].set_ylabel('Mean Regret (lower is better)')
    axes[0].set_xscale('log')
    axes[0].set_title('Effect of n on Regret')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)
    
    # Heatmap of optimal n
    noise_levels = [0.05, 0.1, 0.2, 0.3]
    optimal_ns = []
    
    for noise in noise_levels:
        row = []
        for density in densities:
            result = find_optimal_n(budget, eta, density, noise, n_values)
            row.append(result['optimal_n'])
        optimal_ns.append(row)
    
    im = axes[1].imshow(optimal_ns, aspect='auto')
    axes[1].set_xticks(range(len(densities)))
    axes[1].set_xticklabels([f'{d*100:.0f}%' for d in densities])
    axes[1].set_yticks(range(len(noise_levels)))
    axes[1].set_yticklabels([f'{n:.2f}' for n in noise_levels])
    axes[1].set_xlabel('Quality Density')
    axes[1].set_ylabel('Noise Level')
    axes[1].set_title('Optimal n (darker = larger)')
    plt.colorbar(im, ax=axes[1], label='Optimal n')
    
    plt.tight_layout()
    return fig
 
# Demonstration
if __name__ == "__main__":
    print("Analyzing n-vs-B/n tradeoff...")
    print("\nScenario: 1% of configs are good, moderate noise")
    find_optimal_n(budget=729, eta=3, quality_density=0.01, noise_std=0.1)
    
    print("\nScenario: 10% of configs are good, moderate noise")
    find_optimal_n(budget=729, eta=3, quality_density=0.10, noise_std=0.1)

Practical Guidance for Choosing n

When the search space is well-understood and prior knowledge suggests many configurations will perform well, use smaller n. When exploring a new problem with unknown structure, use larger n to hedge against uncertainty.

A common heuristic: set n such that the minimum budget r₀ = B/(n·(s_max+1)) allows at least 1-2 epochs of training—enough to distinguish clearly bad configurations from potentially good ones.

Synchronous vs Asynchronous Execution

The original Successive Halving algorithm is synchronous: all configurations at a given round must complete before the elimination decision is made and the next round begins. This creates challenges in distributed settings:

Synchronization Challenges

•Straggler problem: One slow worker delays all others at synchronization barriers
•Resource underutilization: Workers sit idle waiting for the slowest evaluation to complete
•Heterogeneous hardware: Different GPU types/speeds exacerbate synchronization overhead
•Dynamic workloads: Some configurations train faster than others (e.g., smaller models)
•Fault tolerance: Worker failures require all completed work to wait for retry

Asynchronous Successive Halving (ASHA):

ASHA (Li et al., 2020) modifies Successive Halving for asynchronous parallel execution. The key insight: don't wait for all configurations at a rung to complete before promoting.

Instead, whenever a worker finishes evaluating a configuration at rung (s):

Check if this configuration is in the top (1/\eta) of all configurations that have completed rung (s)
If yes, promote immediately to rung (s+1)
If no, stop this configuration
Immediately start evaluating a new configuration (from promoted queue or fresh sample)

Synchronous SH vs Asynchronous SH (ASHA)
Aspect	Synchronous SH	ASHA
Promotion decision	After all complete round s	After each individual completion
Worker utilization	Idle during sync barriers	Always busy
Speedup with k workers	~k (sequential phases)	Near-linear (no sync)
Straggler impact	Entire round delayed	Only affects that config
Promotion threshold	Top 1/η of all at rung s	Top 1/η of completed at rung s
Theoretical guarantees	Original guarantees apply	Similar but slightly weaker

synchronous_sh.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
"""
Comparison of synchronous and asynchronous Successive Halving
in a simulated parallel environment.
"""
import numpy as np
from concurrent.futures import ThreadPoolExecutor, as_completed
import time
from typing import List, Dict
import threading
 
class SimulatedWorker:
    """Simulates evaluation with variable runtime."""
    
    def evaluate(self, config_id: int, budget: float, true_quality: float) -> tuple:
        """
        Simulate evaluation with realistic timing.
        
        Returns:
            (config_id, budget, observed_performance, wall_time)
        """
        # Simulate variable runtime (30s to 60s per unit budget)
        base_time = 0.01  # seconds per budget unit (fast for demo)
        variability = np.random.uniform(0.5, 1.5)
        wall_time = base_time * budget * variability
        
        time.sleep(wall_time)
        
        # Performance depends on quality and improves with budget
        progress = 1 - np.exp(-budget / 30)
        noise = np.random.normal(0, 0.05 / np.sqrt(budget))
        performance = true_quality * (1 - 0.8 * progress) + noise
        
        return config_id, budget, performance, wall_time
 
class SynchronousSH:
    """Synchronous Successive Halving with parallel evaluation."""
    
    def __init__(self, n_workers: int, n_configs: int, eta: int = 3):
        self.n_workers = n_workers
        self.n_configs = n_configs
        self.eta = eta
        self.s_max = int(np.log(n_configs) / np.log(eta))
        
    def run(self, qualities: np.ndarray) -> Dict:
        """Run with given true qualities."""
        total_time = 0
        workers = SimulatedWorker()
        
        survivors = list(range(self.n_configs))
        
        for s in range(self.s_max + 1):
            budget = self.eta ** s
            n_keep = max(1, len(survivors) // self.eta)
            
            # Parallel evaluation with synchronization
            round_start = time.time()
            results = []
            
            with ThreadPoolExecutor(max_workers=self.n_workers) as executor:
                futures = {
                    executor.submit(
                        workers.evaluate, idx, budget, qualities[idx]
                    ): idx
                    for idx in survivors
                }
                
                # Wait for ALL to complete (synchronization point)
                for future in as_completed(futures):
                    results.append(future.result())
            
            round_time = time.time() - round_start
            total_time += round_time
            
            # Sort and eliminate
            results.sort(key=lambda x: x[2])  # Sort by performance
            survivors = [r[0] for r in results[:n_keep]]
            
            print(f"Round {s}: {len(results)} configs, "
                  f"{round_time:.2f}s, survivors: {survivors[:5]}...")
        
        return {
            'winner': survivors[0],
            'winner_quality': qualities[survivors[0]],
            'total_time': total_time
        }
 
class AsynchronousSH:
    """Asynchronous Successive Halving (ASHA-like)."""
    
    def __init__(self, n_workers: int, n_configs: int, eta: int = 3):
        self.n_workers = n_workers
        self.n_configs = n_configs
        self.eta = eta
        self.s_max = int(np.log(n_configs) / np.log(eta))
        
        # Shared state (thread-safe)
        self.lock = threading.Lock()
        self.rung_results: Dict[int, List] = {s: [] for s in range(self.s_max + 1)}
        self.promoted: Dict[int, int] = {}  # config_id -> current_rung
        self.pending_promotions: List = []  # (config_id, rung) to evaluate
        self.next_config_id = 0
        
    def run(self, qualities: np.ndarray) -> Dict:
        """Run asynchronously."""
        start_time = time.time()
        workers = SimulatedWorker()
        
        def worker_loop():
            while True:
                with self.lock:
                    # Check for promoted configs first
                    if self.pending_promotions:
                        config_id, rung = self.pending_promotions.pop(0)
                        budget = self.eta ** rung
                    elif self.next_config_id < self.n_configs:
                        config_id = self.next_config_id
                        self.next_config_id += 1
                        rung = 0
                        budget = 1
                        self.promoted[config_id] = 0
                    else:
                        # No work left
                        return
                
                # Evaluate (outside lock)
                _, _, performance, _ = workers.evaluate(
                    config_id, budget, qualities[config_id]
                )
                
                with self.lock:
                    # Record result
                    self.rung_results[rung].append((config_id, performance))
                    
                    # Check for promotion
                    if rung < self.s_max:
                        n_at_rung = len(self.rung_results[rung])
                        n_to_promote = max(1, n_at_rung // self.eta)
                        
                        # Sort completed at this rung
                        sorted_results = sorted(
                            self.rung_results[rung], 
                            key=lambda x: x[1]
                        )
                        
                        # Is this config in top 1/eta?
                        top_ids = {r[0] for r in sorted_results[:n_to_promote]}
                        if config_id in top_ids:
                            # Promote
                            self.pending_promotions.append((config_id, rung + 1))
                            self.promoted[config_id] = rung + 1
        
        # Run workers
        with ThreadPoolExecutor(max_workers=self.n_workers) as executor:
            futures = [executor.submit(worker_loop) for _ in range(self.n_workers)]
            for f in futures:
                f.result()
        
        total_time = time.time() - start_time
        
        # Find best at final rung
        final_results = self.rung_results[self.s_max]
        if final_results:
            best = min(final_results, key=lambda x: x[1])
            return {
                'winner': best[0],
                'winner_quality': qualities[best[0]],
                'total_time': total_time
            }
        return {'winner': None, 'total_time': total_time}
 
# Comparison
if __name__ == "__main__":
    np.random.seed(42)
    n_configs = 27
    qualities = np.random.uniform(0, 1, n_configs)
    qualities[7] = 0.05  # Plant a good config
    
    print("=== Synchronous SH ===")
    sync_sh = SynchronousSH(n_workers=4, n_configs=n_configs, eta=3)
    sync_result = sync_sh.run(qualities)
    print(f"Result: config {sync_result['winner']}, "
          f"quality {sync_result['winner_quality']:.3f}, "
          f"time {sync_result['total_time']:.2f}s")
    
    print("\n=== Asynchronous SH (ASHA) ===")
    async_sh = AsynchronousSH(n_workers=4, n_configs=n_configs, eta=3)
    async_result = async_sh.run(qualities)
    print(f"Result: config {async_result['winner']}, "
          f"quality {async_result['winner_quality']:.3f}, "
          f"time {async_result['total_time']:.2f}s")
    
    print(f"\nSpeedup: {sync_result['total_time'] / async_result['total_time']:.2f}x")

When Successive Halving Works Well

Successive Halving is highly effective in certain scenarios, achieving dramatic speedups with minimal impact on solution quality. Understanding these conditions helps practitioners choose appropriate optimization strategies.

Ideal Conditions for Successive Halving

•High rank correlation: Low-budget performance strongly predicts high-budget performance (Spearman ρ > 0.8)
•Monotonic learning curves: Performance consistently improves (or at least doesn't get worse) with more training
•Large search space: Many configurations to explore, making exhaustive evaluation infeasible
•Expensive full evaluation: Individual evaluations take minutes to hours, so savings from early stopping are substantial
•Distinguishable early performance: Even at low budgets, clearly bad configurations can be identified
•Uniform convergence rates: Different configurations converge at similar rates (no late bloomers)

Empirical Success Stories:

Neural Network Architecture Search: When comparing architectures with fixed hyperparameters, early training epochs strongly predict final performance. SH achieves 10-100× speedup in NAS benchmarks.
Learning Rate Tuning: Learning rate primarily affects early training dynamics. Poorly chosen learning rates are immediately obvious (divergence or extremely slow learning).
Regularization Strength: Too much or too little regularization shows immediate effects on validation loss.
Model Capacity Selection: Underfitting is visible early; overfitting takes longer but configurations that underfit early rarely become optimal.

Rule of Thumb

If you can reliably identify the bottom 50% of configurations after 10% of training, Successive Halving will likely provide substantial speedup. Run a small pilot study: train 10-20 random configurations to completion and check the rank correlation between early and final performance.

Limitations and Failure Modes

Despite its elegance, Successive Halving is not universally applicable. Recognizing its limitations prevents suboptimal results and wasted computational resources.

Failure Modes of Successive Halving

•Late convergence: Some configurations improve dramatically late in training (e.g., due to learning rate schedules) and are eliminated before their potential is revealed
•Non-monotonic learning: Double descent and other phenomena where performance temporarily worsens before improving
•Hyperparameter-dependent convergence: Larger learning rates converge faster initially but may not yield the best final result
•Regularization effects: Heavy regularization hurts early performance but helps late-stage generalization
•Architecture-dependent dynamics: Larger/deeper models often need more epochs to reach their potential
•The n-vs-B/n dilemma: Fixed n cannot adapt to problem difficulty; wrong choice wastes resources

The Fixed n Problem:

Perhaps Successive Halving's most significant limitation is the requirement to specify (n) upfront. This creates a no-win situation:

If (n) is too small: We may miss the optimal configuration entirely
If (n) is too large: Initial budgets are too small to reliably distinguish configurations

The optimal (n) depends on problem characteristics that are unknown before running the optimization!

Hyperband, covered in the next page, directly addresses this limitation by running multiple Successive Halving instances with different (n) values in parallel, automatically adapting to problem difficulty.

Validation is Essential

Never blindly apply Successive Halving without validating the rank correlation assumption. Run a pilot study with diverse configurations, evaluate to completion, and compute Spearman correlation between performance at your intended minimum budget and final performance. If ρ < 0.7, consider higher minimum budgets or alternative methods.

Practical Implementation Considerations

Deploying Successive Halving in production requires attention to several practical details:

Production Checklist

•Checkpointing: Save model state at each rung completion to enable resumption and analysis
•Resume semantics: Design for resumption—if a run is interrupted, it should be able to continue from the last completed rung
•Metric selection: Ensure the metric used for elimination (e.g., validation loss) is stable; consider moving averages for noisy metrics
•Budget boundaries: Align budget levels with natural training boundaries (epochs, not arbitrary step counts)
•Resource cleanup: Terminate early and release GPU memory for eliminated configurations
•Logging and visualization: Log learning curves for all configurations to enable post-hoc analysis
•Reproducibility: Fix all random seeds and log them to enable reproduction

Integration with Training Frameworks:

Most modern HPO frameworks provide built-in Successive Halving support:

framework_integration.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# Ray Tune integration with ASHA (Async Successive Halving)
from ray import tune
from ray.tune.schedulers import ASHAScheduler
 
def train_model(config):
    """Training function that reports intermediate results."""
    model = create_model(config)
    
    for epoch in range(config["max_epochs"]):
        # Train one epoch
        train_loss = train_epoch(model, config)
        val_loss = validate(model)
        
        # Report to Ray Tune (ASHA uses this for early stopping)
        tune.report(
            epoch=epoch,
            val_loss=val_loss,
            train_loss=train_loss
        )
 
# Configure ASHA scheduler
asha_scheduler = ASHAScheduler(
    metric="val_loss",
    mode="min",
    max_t=100,           # Max epochs (budget)
    grace_period=1,       # Min epochs before stopping
    reduction_factor=3,   # eta (keep top 1/3)
)
 
# Define search space
config = {
    "learning_rate": tune.loguniform(1e-5, 1e-1),
    "weight_decay": tune.loguniform(1e-6, 1e-2),
    "batch_size": tune.choice([16, 32, 64, 128]),
    "num_layers": tune.randint(1, 8),
    "max_epochs": 100,
}
 
# Run optimization
analysis = tune.run(
    train_model,
    config=config,
    num_samples=81,       # n configurations
    scheduler=asha_scheduler,
    resources_per_trial={"cpu": 2, "gpu": 1},
)
 
# Get best result
best = analysis.get_best_config(metric="val_loss", mode="min")
print(f"Best config: {best}")

Summary: Successive Halving

Successive Halving provides a principled, theoretically-grounded approach to early stopping in hyperparameter optimization.

Key Takeaways

•Geometric elimination: Halving survivors at each round while doubling budget creates near-optimal resource allocation
•Equal round budgets: Each round consumes approximately equal total computational budget
•Near-optimal sample complexity: O(H·log(n)) budget to identify best arm, where H depends on problem difficulty
•n-vs-B/n tradeoff: Fixed n requires choosing between exploration (many configs, low initial budget) and exploitation (few configs, high initial budget)
•Asynchronous variants (ASHA): Enable efficient parallel execution without synchronization barriers
•Rank correlation requirement: Success depends on early performance predicting final performance

Looking Ahead:

The fundamental limitation of Successive Halving—the requirement to choose (n) upfront—motivates Hyperband, covered in the next page. Hyperband runs multiple Successive Halving brackets with different (n) values, automatically balancing the exploration-exploitation tradeoff without requiring prior knowledge of problem characteristics.

Page Complete

You now understand the Successive Halving algorithm in depth: its derivation, theoretical guarantees, the n-vs-B/n tradeoff, synchronous and asynchronous execution, ideal conditions, and practical implementation. Next, we'll see how Hyperband builds on these foundations to create a more robust algorithm.