Machine LearningHyperparameter Optimization

Multi-Fidelity Optimization

LevelAdvanced

Duration90 mins

TopicHyperparameter Optimization

5 / 5

Budget Allocation

Strategic Resource Distribution

Multi-fidelity optimization methods like Hyperband and BOHB provide frameworks for within-run budget allocation—how to distribute resources across configurations during a single optimization run. But practitioners face broader strategic budget allocation decisions:

How much total budget should be allocated to hyperparameter optimization vs. final model training?
How should budget be distributed across parallel workers with different capabilities?
When running multiple optimization campaigns, how should learning transfer between them?
How do we handle budget constraints that change over time?

This page addresses these strategic considerations, providing frameworks for making principled budget allocation decisions across the full hyperparameter optimization lifecycle.

What You Will Learn

By the end of this page, you will understand: • Theoretical foundations of budget allocation in HPO • Parallel and distributed budget management • Adaptive budget strategies for changing constraints • Cross-campaign budget allocation and transfer • Practical budget planning for production systems

Theoretical Foundations

Budget allocation in hyperparameter optimization can be analyzed through the lens of optimal stopping theory and resource allocation under uncertainty.

The Fundamental Tradeoff:

Given total budget (B), we must decide:

Exploration budget: Resources for trying new configurations
Exploitation budget: Resources for thoroughly evaluating promising configurations
Final training budget: Resources for training the selected model to deployment quality

Let (B = B_e + B_x + B_f) where these represent exploration, exploitation, and final training budgets respectively.

Optimal Allocation Theory:

Under simplifying assumptions (configurations sampled from known distribution, performance improves monotonically with budget), optimal allocation satisfies:

$$\frac{\partial \mathbb{E}[\text{Improvement}]}{\partial B_e} = \frac{\partial \mathbb{E}[\text{Improvement}]}{\partial B_x}$$

At the optimum, marginal improvement from exploration equals marginal improvement from exploitation. Early in optimization, exploration dominates; as good configurations are found, exploitation becomes more valuable.

Practical Implications:

Start with exploration-heavy allocation (many configs, low budget each)
Gradually shift toward exploitation (fewer configs, higher budget)
Reserve sufficient budget for final training (often 20-40% of total)

Budget Allocation Guidelines by Optimization Phase
Phase	Exploration %	Exploitation %	Reserve %	Strategy
Early (0-25%)	60-70%	20-30%	10%	Aggressive random search
Middle (25-60%)	40-50%	40-50%	10%	Balanced Hyperband
Late (60-90%)	20-30%	50-60%	20%	Focused BOHB
Final (90-100%)	0%	20-30%	70-80%	Best config full training

Parallel and Distributed Budget Management

In distributed settings with multiple workers, budget allocation becomes more complex. Key considerations include worker heterogeneity, communication overhead, and load balancing.

Worker Heterogeneity:

Real clusters often contain heterogeneous hardware—different GPU types, varying memory, different network latencies. Naive allocation ignores this, leading to poor utilization.

Heterogeneous Worker Strategies

•Capability-aware scheduling: Assign large-budget evaluations to powerful workers, low-budget to weaker workers
•Throughput normalization: Measure worker throughput and allocate jobs to equalize completion times
•Priority queues: Maintain separate queues by budget level, workers pull from appropriate queue
•Dynamic rebalancing: Move pending work from slow to fast workers as completion times become clearer

parallel_budget_manager.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
"""
Parallel budget management with heterogeneous workers.
"""
import numpy as np
from dataclasses import dataclass
from typing import Dict, List, Optional
from collections import defaultdict
import heapq
 
@dataclass
class Worker:
    worker_id: str
    gpu_type: str
    throughput: float  # Budget units per hour
    current_job: Optional[str] = None
    available_at: float = 0.0
 
@dataclass
class Job:
    job_id: str
    config_id: str
    budget: float
    priority: float = 0.0  # Higher = more important
 
class ParallelBudgetManager:
    """
    Manages budget allocation across heterogeneous workers.
    """
    
    def __init__(self, workers: List[Worker]):
        self.workers = {w.worker_id: w for w in workers}
        self.pending_jobs: List[Job] = []
        self.completed_jobs: List[Job] = []
        self.current_time = 0.0
        
        # Track throughput by GPU type for estimation
        self.throughput_history: Dict[str, List[float]] = defaultdict(list)
    
    def submit_job(self, job: Job):
        """Submit a job to the pending queue."""
        heapq.heappush(self.pending_jobs, (-job.priority, job.budget, job))
    
    def get_assignment(self, worker_id: str) -> Optional[Job]:
        """
        Get next job assignment for a worker.
        Uses capability-aware scheduling.
        """
        if not self.pending_jobs:
            return None
        
        worker = self.workers[worker_id]
        
        # Find best job for this worker
        # Prefer jobs where budget matches worker capability
        best_idx = 0
        best_score = float('-inf')
        
        for i, (neg_priority, budget, job) in enumerate(self.pending_jobs):
            # Score based on priority and capability match
            time_estimate = budget / worker.throughput
            efficiency = worker.throughput / self._avg_throughput()
            
            # High-budget jobs go to high-throughput workers
            capability_match = efficiency if budget > self._median_budget() else 1/efficiency
            
            score = -neg_priority + 0.5 * capability_match
            
            if score > best_score:
                best_score = score
                best_idx = i
        
        # Remove and return best job
        _, _, job = self.pending_jobs.pop(best_idx)
        heapq.heapify(self.pending_jobs)
        
        worker.current_job = job.job_id
        worker.available_at = self.current_time + job.budget / worker.throughput
        
        return job
    
    def complete_job(self, worker_id: str, job: Job, actual_time: float):
        """Record job completion and update throughput estimates."""
        worker = self.workers[worker_id]
        worker.current_job = None
        
        # Update throughput estimate
        actual_throughput = job.budget / actual_time
        self.throughput_history[worker.gpu_type].append(actual_throughput)
        
        # Exponential moving average
        alpha = 0.3
        worker.throughput = alpha * actual_throughput + (1 - alpha) * worker.throughput
        
        self.completed_jobs.append(job)
    
    def _avg_throughput(self) -> float:
        return np.mean([w.throughput for w in self.workers.values()])
    
    def _median_budget(self) -> float:
        if not self.pending_jobs:
            return 1.0
        budgets = [job.budget for _, _, job in self.pending_jobs]
        return np.median(budgets)
    
    def estimate_completion_time(self) -> float:
        """Estimate time to complete all pending jobs."""
        if not self.pending_jobs:
            return 0.0
        
        total_budget = sum(job.budget for _, _, job in self.pending_jobs)
        total_throughput = sum(w.throughput for w in self.workers.values())
        
        return total_budget / total_throughput
    
    def get_utilization(self) -> Dict[str, float]:
        """Get current worker utilization."""
        busy = sum(1 for w in self.workers.values() if w.current_job)
        return {
            "busy_workers": busy,
            "total_workers": len(self.workers),
            "utilization": busy / len(self.workers),
        }

Adaptive Budget Strategies

Real-world optimization often faces changing budget constraints—cloud costs may limit resources, deadlines may shift, or new information may change priorities. Adaptive strategies adjust allocation dynamically.

Budget Reallocation Triggers:

•Convergence detection: When best configuration hasn't improved for k iterations, shift budget from exploration to exploitation
•Deadline pressure: As deadline approaches, prune aggressive brackets and focus on completing promising configurations
•Cost overruns: When costs exceed projections, reduce parallelism or lower maximum fidelity
•New information: When a configuration shows exceptional promise, allocate extra budget for thorough evaluation

adaptive_budget.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
"""
Adaptive budget allocation with dynamic reallocation.
"""
import numpy as np
from typing import Dict, List, Callable
from dataclasses import dataclass
from enum import Enum
 
class BudgetPhase(Enum):
    EXPLORATION = "exploration"
    BALANCED = "balanced"
    EXPLOITATION = "exploitation"
    FINALIZATION = "finalization"
 
@dataclass
class BudgetState:
    total_budget: float
    spent_budget: float
    best_performance: float
    iterations_without_improvement: int
    deadline_pressure: float  # 0-1, higher = more pressure
    
    @property
    def remaining_budget(self) -> float:
        return self.total_budget - self.spent_budget
    
    @property
    def progress(self) -> float:
        return self.spent_budget / self.total_budget
 
class AdaptiveBudgetAllocator:
    """
    Dynamically adjusts budget allocation based on optimization state.
    """
    
    def __init__(
        self,
        total_budget: float,
        convergence_patience: int = 10,
        phase_thresholds: Dict[str, float] = None,
    ):
        self.total_budget = total_budget
        self.convergence_patience = convergence_patience
        
        self.phase_thresholds = phase_thresholds or {
            "exploration_end": 0.25,
            "balanced_end": 0.60,
            "exploitation_end": 0.90,
        }
        
        self.state = BudgetState(
            total_budget=total_budget,
            spent_budget=0.0,
            best_performance=float('inf'),
            iterations_without_improvement=0,
            deadline_pressure=0.0,
        )
        
        self.current_phase = BudgetPhase.EXPLORATION
    
    def update_state(
        self,
        budget_spent: float,
        best_performance: float,
        deadline_pressure: float = 0.0,
    ):
        """Update internal state after each iteration."""
        self.state.spent_budget += budget_spent
        self.state.deadline_pressure = deadline_pressure
        
        if best_performance < self.state.best_performance:
            self.state.best_performance = best_performance
            self.state.iterations_without_improvement = 0
        else:
            self.state.iterations_without_improvement += 1
        
        self._update_phase()
    
    def _update_phase(self):
        """Determine current optimization phase."""
        progress = self.state.progress
        converging = self.state.iterations_without_improvement > self.convergence_patience
        
        if progress < self.phase_thresholds["exploration_end"] and not converging:
            self.current_phase = BudgetPhase.EXPLORATION
        elif progress < self.phase_thresholds["balanced_end"] and not converging:
            self.current_phase = BudgetPhase.BALANCED
        elif progress < self.phase_thresholds["exploitation_end"]:
            self.current_phase = BudgetPhase.EXPLOITATION
        else:
            self.current_phase = BudgetPhase.FINALIZATION
        
        # Override for deadline pressure
        if self.state.deadline_pressure > 0.8:
            self.current_phase = BudgetPhase.FINALIZATION
    
    def get_allocation(self) -> Dict[str, float]:
        """Get recommended budget allocation for current phase."""
        allocations = {
            BudgetPhase.EXPLORATION: {
                "n_configs": "high",
                "max_budget_fraction": 0.3,
                "random_fraction": 0.7,
                "aggressive_brackets": True,
            },
            BudgetPhase.BALANCED: {
                "n_configs": "medium",
                "max_budget_fraction": 0.5,
                "random_fraction": 0.4,
                "aggressive_brackets": True,
            },
            BudgetPhase.EXPLOITATION: {
                "n_configs": "low",
                "max_budget_fraction": 0.8,
                "random_fraction": 0.2,
                "aggressive_brackets": False,
            },
            BudgetPhase.FINALIZATION: {
                "n_configs": "minimal",
                "max_budget_fraction": 1.0,
                "random_fraction": 0.0,
                "aggressive_brackets": False,
            },
        }
        
        return allocations[self.current_phase]
    
    def should_early_stop_campaign(self) -> bool:
        """Check if optimization campaign should terminate early."""
        # Stop if converged and sufficient budget spent
        if (self.state.iterations_without_improvement > 2 * self.convergence_patience
            and self.state.progress > 0.5):
            return True
        
        # Stop if deadline pressure is extreme
        if self.state.deadline_pressure > 0.95:
            return True
        
        return False

Cross-Campaign Budget Allocation

Organizations often run multiple optimization campaigns across different models, datasets, or problem domains. Strategic budget allocation across campaigns can dramatically improve overall efficiency.

Transfer Learning for HPO:

When optimizing hyperparameters for related problems, observations from previous campaigns can inform new ones:

Transfer Strategies

•Warm-starting: Initialize search from best configurations of related tasks
•Prior transfer: Use past observations to build informative priors for new campaigns
•Meta-learning: Learn a meta-model that predicts good hyperparameters from task features
•Portfolio methods: Maintain a portfolio of configurations that perform well across tasks

Multi-Task Budget Allocation:

Given (k) tasks and total budget (B), how should budget be distributed?

Approaches:

Equal allocation: (B/k) per task. Simple but ignores task difficulty.
Bandit-based: Model each task as an arm, allocate budget to tasks showing most improvement.
Transfer-aware: Spend more on "source" tasks that inform many others.
Importance-weighted: Allocate based on downstream value of each model.

Portfolio Strategy

Maintain a "configuration portfolio"—a set of hyperparameter configurations that collectively perform well across past tasks. When starting a new campaign, evaluate portfolio configurations first at low fidelity. This often identifies good starting points within 5-10% of total budget, leaving 90%+ for refinement.

Practical Budget Planning

Translating theoretical frameworks into practical budget plans requires consideration of real-world constraints.

Budget Planning Worksheet
Component	Typical %	Considerations
Initial exploration	15-25%	Random/grid search to understand landscape
Main optimization	40-50%	Hyperband/BOHB with progressive refinement
Validation	10-15%	Cross-validation of top candidates
Final training	20-30%	Full training of selected configuration
Buffer	5-10%	Unexpected reruns, debugging, verification

Budget Planning Checklist

•Estimate per-evaluation cost: GPU-hours × cost/hour for full training run
•Set fidelity ratios: Typical 1:3:9:27:81 for η=3 Hyperband
•Plan parallelism: Number of concurrent evaluations × wall-clock time = total GPU-hours
•Reserve validation budget: 3-5x cost of single evaluation for CV of top 3-5 configs
•Include buffer: 10% for infrastructure issues, reruns, debugging
•Plan checkpointing strategy: Balance storage cost vs resumption capability

Common Budget Mistakes

• Underestimating final training cost (often 2-5x the max HPO budget level) • No buffer for infrastructure failures and debugging • Ignoring cross-validation costs for final model selection • Not accounting for hyperparameter-dependent training time (larger models train slower)

Monitoring and Real-Time Adjustment

Production HPO systems require monitoring to detect issues and adjustment mechanisms to respond dynamically.

Key Metrics to Monitor:

•Best performance trajectory: Is optimization making progress?
•Budget burn rate: Are we on track to complete within budget?
•Worker utilization: Are resources being used efficiently?
•Failure rate: What fraction of evaluations fail vs succeed?
•Exploration diversity: Are we sampling diverse configurations?

Intervention Triggers:

Signal	Possible Issue	Response
No improvement for many iterations	Convergence or stuck	Increase exploration, restart
High failure rate	Configuration issues	Narrow search space bounds
Low utilization	Scheduling inefficiency	Increase parallelism
Budget overrun	Underestimated costs	Reduce max budget, prune brackets

Summary: Budget Allocation

Key Takeaways

•Three-way split: Balance exploration, exploitation, and final training budgets based on optimization phase
•Heterogeneous workers: Match job budgets to worker capabilities for efficient utilization
•Adaptive allocation: Adjust strategy based on convergence signals and deadline pressure
•Transfer across campaigns: Use portfolio methods and warm-starting to amortize HPO cost
•Practical planning: Include buffer, validation, and final training in budget estimates
•Real-time monitoring: Track key metrics and intervene when signals indicate problems

Module Complete:

You have now completed the Multi-Fidelity Optimization module, covering early stopping approaches, Successive Halving, Hyperband, BOHB, and budget allocation strategies. These techniques form the foundation of efficient hyperparameter optimization at scale.

Module Complete

You now have a comprehensive understanding of multi-fidelity hyperparameter optimization—from theoretical foundations through practical implementation. You can design and deploy efficient HPO systems that find optimal configurations orders of magnitude faster than exhaustive search.

5 / 5

Loading learning content...

Machine LearningHyperparameter Optimization

Multi-Fidelity Optimization

LevelAdvanced

Duration90 mins

TopicHyperparameter Optimization

5 / 5

Budget Allocation

Strategic Resource Distribution

How much total budget should be allocated to hyperparameter optimization vs. final model training?
How should budget be distributed across parallel workers with different capabilities?
When running multiple optimization campaigns, how should learning transfer between them?
How do we handle budget constraints that change over time?

This page addresses these strategic considerations, providing frameworks for making principled budget allocation decisions across the full hyperparameter optimization lifecycle.

What You Will Learn

Theoretical Foundations

Budget allocation in hyperparameter optimization can be analyzed through the lens of optimal stopping theory and resource allocation under uncertainty.

The Fundamental Tradeoff:

Given total budget (B), we must decide:

Exploration budget: Resources for trying new configurations
Exploitation budget: Resources for thoroughly evaluating promising configurations
Final training budget: Resources for training the selected model to deployment quality

Let (B = B_e + B_x + B_f) where these represent exploration, exploitation, and final training budgets respectively.

Optimal Allocation Theory:

Under simplifying assumptions (configurations sampled from known distribution, performance improves monotonically with budget), optimal allocation satisfies:

$$\frac{\partial \mathbb{E}[\text{Improvement}]}{\partial B_e} = \frac{\partial \mathbb{E}[\text{Improvement}]}{\partial B_x}$$

Practical Implications:

Start with exploration-heavy allocation (many configs, low budget each)
Gradually shift toward exploitation (fewer configs, higher budget)
Reserve sufficient budget for final training (often 20-40% of total)

Budget Allocation Guidelines by Optimization Phase
Phase	Exploration %	Exploitation %	Reserve %	Strategy
Early (0-25%)	60-70%	20-30%	10%	Aggressive random search
Middle (25-60%)	40-50%	40-50%	10%	Balanced Hyperband
Late (60-90%)	20-30%	50-60%	20%	Focused BOHB
Final (90-100%)	0%	20-30%	70-80%	Best config full training

Parallel and Distributed Budget Management

In distributed settings with multiple workers, budget allocation becomes more complex. Key considerations include worker heterogeneity, communication overhead, and load balancing.

Worker Heterogeneity:

Real clusters often contain heterogeneous hardware—different GPU types, varying memory, different network latencies. Naive allocation ignores this, leading to poor utilization.

Heterogeneous Worker Strategies

•Capability-aware scheduling: Assign large-budget evaluations to powerful workers, low-budget to weaker workers
•Throughput normalization: Measure worker throughput and allocate jobs to equalize completion times
•Priority queues: Maintain separate queues by budget level, workers pull from appropriate queue
•Dynamic rebalancing: Move pending work from slow to fast workers as completion times become clearer

parallel_budget_manager.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
"""
Parallel budget management with heterogeneous workers.
"""
import numpy as np
from dataclasses import dataclass
from typing import Dict, List, Optional
from collections import defaultdict
import heapq
 
@dataclass
class Worker:
    worker_id: str
    gpu_type: str
    throughput: float  # Budget units per hour
    current_job: Optional[str] = None
    available_at: float = 0.0
 
@dataclass
class Job:
    job_id: str
    config_id: str
    budget: float
    priority: float = 0.0  # Higher = more important
 
class ParallelBudgetManager:
    """
    Manages budget allocation across heterogeneous workers.
    """
    
    def __init__(self, workers: List[Worker]):
        self.workers = {w.worker_id: w for w in workers}
        self.pending_jobs: List[Job] = []
        self.completed_jobs: List[Job] = []
        self.current_time = 0.0
        
        # Track throughput by GPU type for estimation
        self.throughput_history: Dict[str, List[float]] = defaultdict(list)
    
    def submit_job(self, job: Job):
        """Submit a job to the pending queue."""
        heapq.heappush(self.pending_jobs, (-job.priority, job.budget, job))
    
    def get_assignment(self, worker_id: str) -> Optional[Job]:
        """
        Get next job assignment for a worker.
        Uses capability-aware scheduling.
        """
        if not self.pending_jobs:
            return None
        
        worker = self.workers[worker_id]
        
        # Find best job for this worker
        # Prefer jobs where budget matches worker capability
        best_idx = 0
        best_score = float('-inf')
        
        for i, (neg_priority, budget, job) in enumerate(self.pending_jobs):
            # Score based on priority and capability match
            time_estimate = budget / worker.throughput
            efficiency = worker.throughput / self._avg_throughput()
            
            # High-budget jobs go to high-throughput workers
            capability_match = efficiency if budget > self._median_budget() else 1/efficiency
            
            score = -neg_priority + 0.5 * capability_match
            
            if score > best_score:
                best_score = score
                best_idx = i
        
        # Remove and return best job
        _, _, job = self.pending_jobs.pop(best_idx)
        heapq.heapify(self.pending_jobs)
        
        worker.current_job = job.job_id
        worker.available_at = self.current_time + job.budget / worker.throughput
        
        return job
    
    def complete_job(self, worker_id: str, job: Job, actual_time: float):
        """Record job completion and update throughput estimates."""
        worker = self.workers[worker_id]
        worker.current_job = None
        
        # Update throughput estimate
        actual_throughput = job.budget / actual_time
        self.throughput_history[worker.gpu_type].append(actual_throughput)
        
        # Exponential moving average
        alpha = 0.3
        worker.throughput = alpha * actual_throughput + (1 - alpha) * worker.throughput
        
        self.completed_jobs.append(job)
    
    def _avg_throughput(self) -> float:
        return np.mean([w.throughput for w in self.workers.values()])
    
    def _median_budget(self) -> float:
        if not self.pending_jobs:
            return 1.0
        budgets = [job.budget for _, _, job in self.pending_jobs]
        return np.median(budgets)
    
    def estimate_completion_time(self) -> float:
        """Estimate time to complete all pending jobs."""
        if not self.pending_jobs:
            return 0.0
        
        total_budget = sum(job.budget for _, _, job in self.pending_jobs)
        total_throughput = sum(w.throughput for w in self.workers.values())
        
        return total_budget / total_throughput
    
    def get_utilization(self) -> Dict[str, float]:
        """Get current worker utilization."""
        busy = sum(1 for w in self.workers.values() if w.current_job)
        return {
            "busy_workers": busy,
            "total_workers": len(self.workers),
            "utilization": busy / len(self.workers),
        }

Adaptive Budget Strategies

Budget Reallocation Triggers:

•Convergence detection: When best configuration hasn't improved for k iterations, shift budget from exploration to exploitation
•Deadline pressure: As deadline approaches, prune aggressive brackets and focus on completing promising configurations
•Cost overruns: When costs exceed projections, reduce parallelism or lower maximum fidelity
•New information: When a configuration shows exceptional promise, allocate extra budget for thorough evaluation

adaptive_budget.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
"""
Adaptive budget allocation with dynamic reallocation.
"""
import numpy as np
from typing import Dict, List, Callable
from dataclasses import dataclass
from enum import Enum
 
class BudgetPhase(Enum):
    EXPLORATION = "exploration"
    BALANCED = "balanced"
    EXPLOITATION = "exploitation"
    FINALIZATION = "finalization"
 
@dataclass
class BudgetState:
    total_budget: float
    spent_budget: float
    best_performance: float
    iterations_without_improvement: int
    deadline_pressure: float  # 0-1, higher = more pressure
    
    @property
    def remaining_budget(self) -> float:
        return self.total_budget - self.spent_budget
    
    @property
    def progress(self) -> float:
        return self.spent_budget / self.total_budget
 
class AdaptiveBudgetAllocator:
    """
    Dynamically adjusts budget allocation based on optimization state.
    """
    
    def __init__(
        self,
        total_budget: float,
        convergence_patience: int = 10,
        phase_thresholds: Dict[str, float] = None,
    ):
        self.total_budget = total_budget
        self.convergence_patience = convergence_patience
        
        self.phase_thresholds = phase_thresholds or {
            "exploration_end": 0.25,
            "balanced_end": 0.60,
            "exploitation_end": 0.90,
        }
        
        self.state = BudgetState(
            total_budget=total_budget,
            spent_budget=0.0,
            best_performance=float('inf'),
            iterations_without_improvement=0,
            deadline_pressure=0.0,
        )
        
        self.current_phase = BudgetPhase.EXPLORATION
    
    def update_state(
        self,
        budget_spent: float,
        best_performance: float,
        deadline_pressure: float = 0.0,
    ):
        """Update internal state after each iteration."""
        self.state.spent_budget += budget_spent
        self.state.deadline_pressure = deadline_pressure
        
        if best_performance < self.state.best_performance:
            self.state.best_performance = best_performance
            self.state.iterations_without_improvement = 0
        else:
            self.state.iterations_without_improvement += 1
        
        self._update_phase()
    
    def _update_phase(self):
        """Determine current optimization phase."""
        progress = self.state.progress
        converging = self.state.iterations_without_improvement > self.convergence_patience
        
        if progress < self.phase_thresholds["exploration_end"] and not converging:
            self.current_phase = BudgetPhase.EXPLORATION
        elif progress < self.phase_thresholds["balanced_end"] and not converging:
            self.current_phase = BudgetPhase.BALANCED
        elif progress < self.phase_thresholds["exploitation_end"]:
            self.current_phase = BudgetPhase.EXPLOITATION
        else:
            self.current_phase = BudgetPhase.FINALIZATION
        
        # Override for deadline pressure
        if self.state.deadline_pressure > 0.8:
            self.current_phase = BudgetPhase.FINALIZATION
    
    def get_allocation(self) -> Dict[str, float]:
        """Get recommended budget allocation for current phase."""
        allocations = {
            BudgetPhase.EXPLORATION: {
                "n_configs": "high",
                "max_budget_fraction": 0.3,
                "random_fraction": 0.7,
                "aggressive_brackets": True,
            },
            BudgetPhase.BALANCED: {
                "n_configs": "medium",
                "max_budget_fraction": 0.5,
                "random_fraction": 0.4,
                "aggressive_brackets": True,
            },
            BudgetPhase.EXPLOITATION: {
                "n_configs": "low",
                "max_budget_fraction": 0.8,
                "random_fraction": 0.2,
                "aggressive_brackets": False,
            },
            BudgetPhase.FINALIZATION: {
                "n_configs": "minimal",
                "max_budget_fraction": 1.0,
                "random_fraction": 0.0,
                "aggressive_brackets": False,
            },
        }
        
        return allocations[self.current_phase]
    
    def should_early_stop_campaign(self) -> bool:
        """Check if optimization campaign should terminate early."""
        # Stop if converged and sufficient budget spent
        if (self.state.iterations_without_improvement > 2 * self.convergence_patience
            and self.state.progress > 0.5):
            return True
        
        # Stop if deadline pressure is extreme
        if self.state.deadline_pressure > 0.95:
            return True
        
        return False

Cross-Campaign Budget Allocation

Transfer Learning for HPO:

When optimizing hyperparameters for related problems, observations from previous campaigns can inform new ones:

Transfer Strategies

•Warm-starting: Initialize search from best configurations of related tasks
•Prior transfer: Use past observations to build informative priors for new campaigns
•Meta-learning: Learn a meta-model that predicts good hyperparameters from task features
•Portfolio methods: Maintain a portfolio of configurations that perform well across tasks

Multi-Task Budget Allocation:

Given (k) tasks and total budget (B), how should budget be distributed?

Approaches:

Equal allocation: (B/k) per task. Simple but ignores task difficulty.
Bandit-based: Model each task as an arm, allocate budget to tasks showing most improvement.
Transfer-aware: Spend more on "source" tasks that inform many others.
Importance-weighted: Allocate based on downstream value of each model.

Portfolio Strategy

Practical Budget Planning

Translating theoretical frameworks into practical budget plans requires consideration of real-world constraints.

Budget Planning Worksheet
Component	Typical %	Considerations
Initial exploration	15-25%	Random/grid search to understand landscape
Main optimization	40-50%	Hyperband/BOHB with progressive refinement
Validation	10-15%	Cross-validation of top candidates
Final training	20-30%	Full training of selected configuration
Buffer	5-10%	Unexpected reruns, debugging, verification

Budget Planning Checklist

•Estimate per-evaluation cost: GPU-hours × cost/hour for full training run
•Set fidelity ratios: Typical 1:3:9:27:81 for η=3 Hyperband
•Plan parallelism: Number of concurrent evaluations × wall-clock time = total GPU-hours
•Reserve validation budget: 3-5x cost of single evaluation for CV of top 3-5 configs
•Include buffer: 10% for infrastructure issues, reruns, debugging
•Plan checkpointing strategy: Balance storage cost vs resumption capability

Common Budget Mistakes

Monitoring and Real-Time Adjustment

Production HPO systems require monitoring to detect issues and adjustment mechanisms to respond dynamically.

Key Metrics to Monitor:

•Best performance trajectory: Is optimization making progress?
•Budget burn rate: Are we on track to complete within budget?
•Worker utilization: Are resources being used efficiently?
•Failure rate: What fraction of evaluations fail vs succeed?
•Exploration diversity: Are we sampling diverse configurations?

Intervention Triggers:

Signal	Possible Issue	Response
No improvement for many iterations	Convergence or stuck	Increase exploration, restart
High failure rate	Configuration issues	Narrow search space bounds
Low utilization	Scheduling inefficiency	Increase parallelism
Budget overrun	Underestimated costs	Reduce max budget, prune brackets

Summary: Budget Allocation

Key Takeaways

•Three-way split: Balance exploration, exploitation, and final training budgets based on optimization phase
•Heterogeneous workers: Match job budgets to worker capabilities for efficient utilization
•Adaptive allocation: Adjust strategy based on convergence signals and deadline pressure
•Transfer across campaigns: Use portfolio methods and warm-starting to amortize HPO cost
•Practical planning: Include buffer, validation, and final training in budget estimates
•Real-time monitoring: Track key metrics and intervene when signals indicate problems

Module Complete:

Module Complete

5 / 5