Loading content...
The previous page established that early stopping can dramatically reduce computational costs in hyperparameter optimization. But the intuition of "stop bad configurations early" leaves critical questions unanswered:
Successive Halving (SH) provides principled answers to all these questions. First introduced in the context of pure exploration in multi-armed bandits by Karnin et al. (2013) and adapted for hyperparameter optimization by Jamieson and Talwalkar (2016), Successive Halving transforms the vague notion of early stopping into a rigorous algorithm with theoretical guarantees.
The core insight is elegant: rather than evaluating all configurations to completion, iteratively eliminate the worst-performing half while doubling the budget allocated to survivors. This geometric elimination schedule provides near-optimal identification of the best configuration under reasonable assumptions.
By the end of this page, you will understand: • The Successive Halving algorithm in complete detail • Theoretical foundations and performance guarantees • The n-vs-B/n tradeoff between exploration and exploitation • Implementation with proper resource management • When Successive Halving excels and when it struggles
Successive Halving operates in rounds (also called rungs or brackets). In each round:
Formal Algorithm:
Given:
The algorithm proceeds through (s_{\max} = \lfloor \log_\eta(n) \rfloor) rounds.
At round (s \in {0, 1, ..., s_{\max}}):
| Round (s) | Configurations (nₛ) | Budget per Config (rₛ) | Round Budget (nₛ × rₛ) |
|---|---|---|---|
| 0 | 81 | 1 | 81 |
| 1 | 27 | 3 | 81 |
| 2 | 9 | 9 | 81 |
| 3 | 3 | 27 | 81 |
| 4 | 1 | 81 | 81 |
Notice the elegant property: each round consumes approximately the same total budget ((n_s \times r_s \approx B / (s_{\max} + 1))). This equal budget allocation across rounds is not coincidental—it emerges from the geometric relationship between configuration count and per-configuration budget.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223
import numpy as npfrom typing import List, Dict, Callable, Any, Tuplefrom dataclasses import dataclassimport logging logger = logging.getLogger(__name__) @dataclassclass SHConfiguration: """A hyperparameter configuration being evaluated.""" config_id: int hyperparameters: Dict[str, Any] performances: Dict[int, float] # round -> performance def get_performance(self, round: int) -> float: return self.performances.get(round, float('inf')) class SuccessiveHalving: """ Successive Halving algorithm for hyperparameter optimization. Reference: Jamieson, K. and Talwalkar, A. "Non-stochastic Best Arm Identification and Hyperparameter Optimization." AISTATS 2016. """ def __init__( self, n_configs: int, min_budget: float, max_budget: float, eta: int = 3, config_sampler: Callable[[], Dict[str, Any]] = None, evaluator: Callable[[Dict[str, Any], float], float] = None, ): """ Args: n_configs: Initial number of configurations to sample min_budget: Minimum budget per configuration (e.g., 1 epoch) max_budget: Maximum budget (e.g., 81 epochs) eta: Reduction factor (default 3: keep top 1/3 each round) config_sampler: Function to sample random hyperparameters evaluator: Function(config, budget) -> validation_loss """ self.n_configs = n_configs self.min_budget = min_budget self.max_budget = max_budget self.eta = eta self.config_sampler = config_sampler self.evaluator = evaluator # Calculate number of rounds self.s_max = int(np.floor(np.log(n_configs) / np.log(eta))) # Calculate budgets and config counts for each round self.round_budgets = [] self.round_configs = [] for s in range(self.s_max + 1): n_s = int(np.floor(n_configs * (eta ** (-s)))) r_s = min_budget * (eta ** s) self.round_configs.append(n_s) self.round_budgets.append(min(r_s, max_budget)) logger.info(f"Successive Halving initialized:") logger.info(f" Initial configs: {n_configs}") logger.info(f" Rounds: {self.s_max + 1}") logger.info(f" Reduction factor: {eta}") logger.info(f" Budget schedule: {self.round_budgets}") logger.info(f" Config schedule: {self.round_configs}") # Storage self.configurations: List[SHConfiguration] = [] self.current_round = 0 self.surviving_configs: List[int] = [] # indices def initialize(self): """Sample initial configurations.""" logger.info(f"Sampling {self.n_configs} initial configurations...") for i in range(self.n_configs): if self.config_sampler: hp = self.config_sampler() else: hp = self._default_sampler() config = SHConfiguration( config_id=i, hyperparameters=hp, performances={} ) self.configurations.append(config) self.surviving_configs = list(range(self.n_configs)) logger.info(f"Initialized {len(self.configurations)} configurations") def _default_sampler(self) -> Dict[str, Any]: """Default hyperparameter sampler.""" return { "learning_rate": 10 ** np.random.uniform(-5, -1), "weight_decay": 10 ** np.random.uniform(-6, -2), "batch_size": 2 ** np.random.randint(4, 9), "num_layers": np.random.randint(1, 8), } def run_round(self) -> Tuple[int, List[int]]: """ Run one round of Successive Halving. Returns: (round_number, surviving_config_indices) """ if self.current_round > self.s_max: logger.warning("All rounds completed") return self.current_round, self.surviving_configs s = self.current_round budget = self.round_budgets[s] n_keep = self.round_configs[min(s + 1, self.s_max)] logger.info(f"\n=== Round {s} ===") logger.info(f"Evaluating {len(self.surviving_configs)} configs at budget {budget}") # Evaluate all surviving configurations performances = [] for idx in self.surviving_configs: config = self.configurations[idx] if self.evaluator: perf = self.evaluator(config.hyperparameters, budget) else: # Simulate: better configs tend to perform better perf = self._simulate_performance(config, budget) config.performances[s] = perf performances.append((idx, perf)) logger.debug(f" Config {idx}: {perf:.4f}") # Sort by performance (lower is better) performances.sort(key=lambda x: x[1]) # Keep top n_keep configurations self.surviving_configs = [idx for idx, _ in performances[:n_keep]] eliminated = [idx for idx, _ in performances[n_keep:]] logger.info(f"Keeping top {n_keep}, eliminating {len(eliminated)}") logger.info(f"Survivors: {self.surviving_configs}") self.current_round += 1 return s, self.surviving_configs def _simulate_performance(self, config: SHConfiguration, budget: float) -> float: """Simulate performance for testing (realistic learning curve).""" # True quality is determined by hyperparameters lr = config.hyperparameters.get("learning_rate", 0.001) wd = config.hyperparameters.get("weight_decay", 0.0001) # Optimal learning rate around 0.001 lr_quality = -((np.log10(lr) + 3) ** 2) / 4 wd_quality = -((np.log10(wd) + 4) ** 2) / 4 true_quality = 0.5 + 0.2 * (lr_quality + wd_quality) # Performance improves with budget (learning curve) progress = 1 - np.exp(-budget / (self.max_budget / 3)) noise = np.random.normal(0, 0.02 / np.sqrt(budget)) return true_quality * (1 - 0.8 * progress) + noise def run_all(self) -> SHConfiguration: """Run all rounds and return best configuration.""" self.initialize() while self.current_round <= self.s_max: self.run_round() # Return best surviving configuration best_idx = self.surviving_configs[0] best_config = self.configurations[best_idx] final_perf = best_config.performances[self.s_max] logger.info(f"\n=== Optimization Complete ===") logger.info(f"Best config: {best_idx}") logger.info(f"Final performance: {final_perf:.4f}") logger.info(f"Hyperparameters: {best_config.hyperparameters}") return best_config def get_statistics(self) -> Dict: """Get statistics about the optimization run.""" total_budget_used = sum( self.round_configs[s] * self.round_budgets[s] for s in range(self.current_round) ) full_eval_budget = self.n_configs * self.max_budget return { "rounds_completed": self.current_round, "total_budget_used": total_budget_used, "full_evaluation_budget": full_eval_budget, "budget_savings": 1 - (total_budget_used / full_eval_budget), "configurations_evaluated": self.n_configs, "surviving_configurations": len(self.surviving_configs), } # Example usageif __name__ == "__main__": logging.basicConfig(level=logging.INFO) sh = SuccessiveHalving( n_configs=81, min_budget=1, max_budget=81, eta=3 ) best = sh.run_all() stats = sh.get_statistics() print(f"\nStatistics:") print(f" Budget used: {stats['total_budget_used']}") print(f" Savings: {stats['budget_savings']:.1%}")Successive Halving's elegance extends beyond its intuitive appeal—it comes with rigorous theoretical guarantees. Understanding these guarantees clarifies when the algorithm works well and guides practical design choices.
Sample Complexity:
Consider the problem of identifying the best among (n) configurations, where each configuration (i) has true quality (\mu_i), and evaluating at budget (b) yields an observation with variance (\sigma^2 / b).
Let (\Delta_i = \mu_i - \mu^) be the suboptimality gap of configuration (i), where (\mu^) is the quality of the best configuration. Define the complexity parameter:
$$H = \sum_{i=2}^{n} \frac{1}{\Delta_i^2}$$
Theorem (Karnin et al., 2013): Successive Halving identifies the best configuration with probability at least (1 - \delta) using total budget:
$$B = O\left( H \cdot \log(n) \cdot \log(1/\delta) \right)$$
This is nearly optimal—no algorithm can do better than (\Omega(H \cdot \log(1/\delta))) in the worst case.
The complexity H captures how hard the identification problem is. When gaps Δᵢ are large (easy to distinguish configurations), H is small. When many configurations are nearly tied with the best, H is large.
Importantly, H depends on the problem instance—the actual gaps between configurations—not just the number of configurations. This means Successive Halving adapts (implicitly) to problem difficulty.
Equal Budget Allocation:
A key insight is that Successive Halving allocates roughly equal total budget to each round:
$$\text{Round budget} = n_s \times r_s = \frac{n}{\eta^s} \times r_0 \cdot \eta^s = n \cdot r_0 = \frac{B}{s_{\max} + 1}$$
This equal allocation is information-theoretically motivated: we don't know a priori which round will be "hardest" (require the most budget to distinguish survivors), so we hedge by allocating equally.
Geometric Elimination:
The geometric reduction schedule (halving at each round) is also theoretically justified. With (n) configurations and budget (B):
Regret Analysis:
In the context of hyperparameter optimization, we care about the simple regret—the suboptimality of the returned configuration:
$$\text{Simple Regret} = \mu_{\hat{i}} - \mu^*$$
where (\hat{i}) is the configuration returned by the algorithm.
Theorem: For Successive Halving with total budget (B):
$$\mathbb{E}[\text{Simple Regret}] = O\left( \sqrt{\frac{n \log n}{B}} \right)$$
This shows the regret decreases as (1/\sqrt{B}), which is optimal for this class of problems.
Comparison to Uniform Allocation:
If we instead evaluated all (n) configurations with equal budget (B/n) each, the regret would be:
$$\mathbb{E}[\text{Simple Regret}]_{\text{uniform}} = O\left( \sqrt{\frac{n}{B}} \right)$$
Successive Halving's (\log n) overhead is negligible compared to the savings from early elimination—typical experiments see 10-100× speedups.
One of the most important—and often overlooked—aspects of Successive Halving is the fundamental tradeoff between breadth and depth: given a fixed total budget (B), should we evaluate many configurations shallowly (large (n), small (B/n)) or few configurations deeply (small (n), large (B/n))?
This tradeoff critically affects Successive Halving's performance and is not automatically resolved by the algorithm. The user must choose (n).
Formal Analysis of the Tradeoff:
Consider a search space where the best configuration has performance (\mu^* = 0) and other configurations have performance uniformly distributed on ([\varepsilon, 1]).
Case 1: Large n, small initial budget
Case 2: Small n, large initial budget
The sweet spot depends on:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156
import numpy as npimport matplotlib.pyplot as pltfrom typing import Tuple def simulate_sh_performance( n: int, # Number of initial configurations budget: float, # Total budget eta: int, # Reduction factor quality_density: float, # Fraction of configs within 10% of optimal noise_std: float, # Observation noise n_trials: int = 100) -> Tuple[float, float]: """ Simulate Successive Halving performance for given n. Returns: (mean_regret, std_regret) """ regrets = [] for _ in range(n_trials): # Sample true qualities # quality_density fraction are "good" (quality < 0.1) n_good = int(n * quality_density) n_bad = n - n_good good_qualities = np.random.uniform(0, 0.1, n_good) bad_qualities = np.random.uniform(0.1, 1.0, n_bad) qualities = np.concatenate([good_qualities, bad_qualities]) np.random.shuffle(qualities) true_optimal = qualities.min() # Run Successive Halving s_max = int(np.log(n) / np.log(eta)) r_0 = budget / (n * (s_max + 1)) survivors = np.arange(n) for s in range(s_max + 1): r_s = r_0 * (eta ** s) n_s = len(survivors) n_keep = max(1, n_s // eta) # Observe performances with noise observations = qualities[survivors] + np.random.normal( 0, noise_std / np.sqrt(r_s), n_s ) # Keep best n_keep keep_indices = np.argsort(observations)[:n_keep] survivors = survivors[keep_indices] # Final selection final_quality = qualities[survivors[0]] regret = final_quality - true_optimal regrets.append(regret) return np.mean(regrets), np.std(regrets) def find_optimal_n( budget: float, eta: int, quality_density: float, noise_std: float, n_range: np.ndarray = None) -> dict: """Find optimal n for given problem parameters.""" if n_range is None: n_range = np.array([9, 27, 81, 243, 729]) results = [] for n in n_range: if n > budget: # Can't have more configs than budget continue mean_regret, std_regret = simulate_sh_performance( n, budget, eta, quality_density, noise_std ) results.append({ 'n': n, 'mean_regret': mean_regret, 'std_regret': std_regret, 'budget_per_config_initial': budget / (n * (int(np.log(n)/np.log(eta)) + 1)) }) print(f"n={n}: regret={mean_regret:.4f} ± {std_regret:.4f}") best = min(results, key=lambda x: x['mean_regret']) return { 'optimal_n': best['n'], 'optimal_regret': best['mean_regret'], 'all_results': results } def visualize_tradeoff( budget: float = 729, eta: int = 3, noise_std: float = 0.1): """Visualize the n-vs-B/n tradeoff for different quality densities.""" fig, axes = plt.subplots(1, 2, figsize=(14, 5)) densities = [0.01, 0.05, 0.1, 0.2] n_values = np.array([9, 27, 81, 243, 729]) for density in densities: regrets = [] for n in n_values: if n <= budget: mean_r, _ = simulate_sh_performance( n, budget, eta, density, noise_std, n_trials=50 ) regrets.append(mean_r) axes[0].plot(n_values[:len(regrets)], regrets, 'o-', label=f'{density*100:.0f}% good configs') axes[0].set_xlabel('Number of Initial Configurations (n)') axes[0].set_ylabel('Mean Regret (lower is better)') axes[0].set_xscale('log') axes[0].set_title('Effect of n on Regret') axes[0].legend() axes[0].grid(True, alpha=0.3) # Heatmap of optimal n noise_levels = [0.05, 0.1, 0.2, 0.3] optimal_ns = [] for noise in noise_levels: row = [] for density in densities: result = find_optimal_n(budget, eta, density, noise, n_values) row.append(result['optimal_n']) optimal_ns.append(row) im = axes[1].imshow(optimal_ns, aspect='auto') axes[1].set_xticks(range(len(densities))) axes[1].set_xticklabels([f'{d*100:.0f}%' for d in densities]) axes[1].set_yticks(range(len(noise_levels))) axes[1].set_yticklabels([f'{n:.2f}' for n in noise_levels]) axes[1].set_xlabel('Quality Density') axes[1].set_ylabel('Noise Level') axes[1].set_title('Optimal n (darker = larger)') plt.colorbar(im, ax=axes[1], label='Optimal n') plt.tight_layout() return fig # Demonstrationif __name__ == "__main__": print("Analyzing n-vs-B/n tradeoff...") print("\nScenario: 1% of configs are good, moderate noise") find_optimal_n(budget=729, eta=3, quality_density=0.01, noise_std=0.1) print("\nScenario: 10% of configs are good, moderate noise") find_optimal_n(budget=729, eta=3, quality_density=0.10, noise_std=0.1)When the search space is well-understood and prior knowledge suggests many configurations will perform well, use smaller n. When exploring a new problem with unknown structure, use larger n to hedge against uncertainty.
A common heuristic: set n such that the minimum budget r₀ = B/(n·(s_max+1)) allows at least 1-2 epochs of training—enough to distinguish clearly bad configurations from potentially good ones.
The original Successive Halving algorithm is synchronous: all configurations at a given round must complete before the elimination decision is made and the next round begins. This creates challenges in distributed settings:
Asynchronous Successive Halving (ASHA):
ASHA (Li et al., 2020) modifies Successive Halving for asynchronous parallel execution. The key insight: don't wait for all configurations at a rung to complete before promoting.
Instead, whenever a worker finishes evaluating a configuration at rung (s):
| Aspect | Synchronous SH | ASHA |
|---|---|---|
| Promotion decision | After all complete round s | After each individual completion |
| Worker utilization | Idle during sync barriers | Always busy |
| Speedup with k workers | ~k (sequential phases) | Near-linear (no sync) |
| Straggler impact | Entire round delayed | Only affects that config |
| Promotion threshold | Top 1/η of all at rung s | Top 1/η of completed at rung s |
| Theoretical guarantees | Original guarantees apply | Similar but slightly weaker |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192
"""Comparison of synchronous and asynchronous Successive Halvingin a simulated parallel environment."""import numpy as npfrom concurrent.futures import ThreadPoolExecutor, as_completedimport timefrom typing import List, Dictimport threading class SimulatedWorker: """Simulates evaluation with variable runtime.""" def evaluate(self, config_id: int, budget: float, true_quality: float) -> tuple: """ Simulate evaluation with realistic timing. Returns: (config_id, budget, observed_performance, wall_time) """ # Simulate variable runtime (30s to 60s per unit budget) base_time = 0.01 # seconds per budget unit (fast for demo) variability = np.random.uniform(0.5, 1.5) wall_time = base_time * budget * variability time.sleep(wall_time) # Performance depends on quality and improves with budget progress = 1 - np.exp(-budget / 30) noise = np.random.normal(0, 0.05 / np.sqrt(budget)) performance = true_quality * (1 - 0.8 * progress) + noise return config_id, budget, performance, wall_time class SynchronousSH: """Synchronous Successive Halving with parallel evaluation.""" def __init__(self, n_workers: int, n_configs: int, eta: int = 3): self.n_workers = n_workers self.n_configs = n_configs self.eta = eta self.s_max = int(np.log(n_configs) / np.log(eta)) def run(self, qualities: np.ndarray) -> Dict: """Run with given true qualities.""" total_time = 0 workers = SimulatedWorker() survivors = list(range(self.n_configs)) for s in range(self.s_max + 1): budget = self.eta ** s n_keep = max(1, len(survivors) // self.eta) # Parallel evaluation with synchronization round_start = time.time() results = [] with ThreadPoolExecutor(max_workers=self.n_workers) as executor: futures = { executor.submit( workers.evaluate, idx, budget, qualities[idx] ): idx for idx in survivors } # Wait for ALL to complete (synchronization point) for future in as_completed(futures): results.append(future.result()) round_time = time.time() - round_start total_time += round_time # Sort and eliminate results.sort(key=lambda x: x[2]) # Sort by performance survivors = [r[0] for r in results[:n_keep]] print(f"Round {s}: {len(results)} configs, " f"{round_time:.2f}s, survivors: {survivors[:5]}...") return { 'winner': survivors[0], 'winner_quality': qualities[survivors[0]], 'total_time': total_time } class AsynchronousSH: """Asynchronous Successive Halving (ASHA-like).""" def __init__(self, n_workers: int, n_configs: int, eta: int = 3): self.n_workers = n_workers self.n_configs = n_configs self.eta = eta self.s_max = int(np.log(n_configs) / np.log(eta)) # Shared state (thread-safe) self.lock = threading.Lock() self.rung_results: Dict[int, List] = {s: [] for s in range(self.s_max + 1)} self.promoted: Dict[int, int] = {} # config_id -> current_rung self.pending_promotions: List = [] # (config_id, rung) to evaluate self.next_config_id = 0 def run(self, qualities: np.ndarray) -> Dict: """Run asynchronously.""" start_time = time.time() workers = SimulatedWorker() def worker_loop(): while True: with self.lock: # Check for promoted configs first if self.pending_promotions: config_id, rung = self.pending_promotions.pop(0) budget = self.eta ** rung elif self.next_config_id < self.n_configs: config_id = self.next_config_id self.next_config_id += 1 rung = 0 budget = 1 self.promoted[config_id] = 0 else: # No work left return # Evaluate (outside lock) _, _, performance, _ = workers.evaluate( config_id, budget, qualities[config_id] ) with self.lock: # Record result self.rung_results[rung].append((config_id, performance)) # Check for promotion if rung < self.s_max: n_at_rung = len(self.rung_results[rung]) n_to_promote = max(1, n_at_rung // self.eta) # Sort completed at this rung sorted_results = sorted( self.rung_results[rung], key=lambda x: x[1] ) # Is this config in top 1/eta? top_ids = {r[0] for r in sorted_results[:n_to_promote]} if config_id in top_ids: # Promote self.pending_promotions.append((config_id, rung + 1)) self.promoted[config_id] = rung + 1 # Run workers with ThreadPoolExecutor(max_workers=self.n_workers) as executor: futures = [executor.submit(worker_loop) for _ in range(self.n_workers)] for f in futures: f.result() total_time = time.time() - start_time # Find best at final rung final_results = self.rung_results[self.s_max] if final_results: best = min(final_results, key=lambda x: x[1]) return { 'winner': best[0], 'winner_quality': qualities[best[0]], 'total_time': total_time } return {'winner': None, 'total_time': total_time} # Comparisonif __name__ == "__main__": np.random.seed(42) n_configs = 27 qualities = np.random.uniform(0, 1, n_configs) qualities[7] = 0.05 # Plant a good config print("=== Synchronous SH ===") sync_sh = SynchronousSH(n_workers=4, n_configs=n_configs, eta=3) sync_result = sync_sh.run(qualities) print(f"Result: config {sync_result['winner']}, " f"quality {sync_result['winner_quality']:.3f}, " f"time {sync_result['total_time']:.2f}s") print("\n=== Asynchronous SH (ASHA) ===") async_sh = AsynchronousSH(n_workers=4, n_configs=n_configs, eta=3) async_result = async_sh.run(qualities) print(f"Result: config {async_result['winner']}, " f"quality {async_result['winner_quality']:.3f}, " f"time {async_result['total_time']:.2f}s") print(f"\nSpeedup: {sync_result['total_time'] / async_result['total_time']:.2f}x")Successive Halving is highly effective in certain scenarios, achieving dramatic speedups with minimal impact on solution quality. Understanding these conditions helps practitioners choose appropriate optimization strategies.
Empirical Success Stories:
Neural Network Architecture Search: When comparing architectures with fixed hyperparameters, early training epochs strongly predict final performance. SH achieves 10-100× speedup in NAS benchmarks.
Learning Rate Tuning: Learning rate primarily affects early training dynamics. Poorly chosen learning rates are immediately obvious (divergence or extremely slow learning).
Regularization Strength: Too much or too little regularization shows immediate effects on validation loss.
Model Capacity Selection: Underfitting is visible early; overfitting takes longer but configurations that underfit early rarely become optimal.
If you can reliably identify the bottom 50% of configurations after 10% of training, Successive Halving will likely provide substantial speedup. Run a small pilot study: train 10-20 random configurations to completion and check the rank correlation between early and final performance.
Despite its elegance, Successive Halving is not universally applicable. Recognizing its limitations prevents suboptimal results and wasted computational resources.
The Fixed n Problem:
Perhaps Successive Halving's most significant limitation is the requirement to specify (n) upfront. This creates a no-win situation:
The optimal (n) depends on problem characteristics that are unknown before running the optimization!
Hyperband, covered in the next page, directly addresses this limitation by running multiple Successive Halving instances with different (n) values in parallel, automatically adapting to problem difficulty.
Never blindly apply Successive Halving without validating the rank correlation assumption. Run a pilot study with diverse configurations, evaluate to completion, and compute Spearman correlation between performance at your intended minimum budget and final performance. If ρ < 0.7, consider higher minimum budgets or alternative methods.
Deploying Successive Halving in production requires attention to several practical details:
Integration with Training Frameworks:
Most modern HPO frameworks provide built-in Successive Halving support:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
# Ray Tune integration with ASHA (Async Successive Halving)from ray import tunefrom ray.tune.schedulers import ASHAScheduler def train_model(config): """Training function that reports intermediate results.""" model = create_model(config) for epoch in range(config["max_epochs"]): # Train one epoch train_loss = train_epoch(model, config) val_loss = validate(model) # Report to Ray Tune (ASHA uses this for early stopping) tune.report( epoch=epoch, val_loss=val_loss, train_loss=train_loss ) # Configure ASHA schedulerasha_scheduler = ASHAScheduler( metric="val_loss", mode="min", max_t=100, # Max epochs (budget) grace_period=1, # Min epochs before stopping reduction_factor=3, # eta (keep top 1/3)) # Define search spaceconfig = { "learning_rate": tune.loguniform(1e-5, 1e-1), "weight_decay": tune.loguniform(1e-6, 1e-2), "batch_size": tune.choice([16, 32, 64, 128]), "num_layers": tune.randint(1, 8), "max_epochs": 100,} # Run optimizationanalysis = tune.run( train_model, config=config, num_samples=81, # n configurations scheduler=asha_scheduler, resources_per_trial={"cpu": 2, "gpu": 1},) # Get best resultbest = analysis.get_best_config(metric="val_loss", mode="min")print(f"Best config: {best}")Successive Halving provides a principled, theoretically-grounded approach to early stopping in hyperparameter optimization.
Looking Ahead:
The fundamental limitation of Successive Halving—the requirement to choose (n) upfront—motivates Hyperband, covered in the next page. Hyperband runs multiple Successive Halving brackets with different (n) values, automatically balancing the exploration-exploitation tradeoff without requiring prior knowledge of problem characteristics.
You now understand the Successive Halving algorithm in depth: its derivation, theoretical guarantees, the n-vs-B/n tradeoff, synchronous and asynchronous execution, ideal conditions, and practical implementation. Next, we'll see how Hyperband builds on these foundations to create a more robust algorithm.