Machine LearningAutoML & Neural Architecture Search

Automated Model Selection

LevelAdvanced

Duration90 mins

TopicAutoML & Neural Architecture Search

4 / 5

Warm Starting

Starting Ahead: The Power of Informed Initialization

Imagine two runners in a race. One starts at the starting line. The other starts 80% of the way to the finish. Who wins?

This analogy captures the essence of warm starting in hyperparameter optimization. Without warm starting, optimization algorithms begin from random configurations—the starting line. With warm starting, they begin from configurations that meta-learning suggests will perform well—often already close to optimal.

The impact is dramatic:

Random initialization might require 100 evaluations to find a good configuration
Warm-started optimization often finds equally good configurations in 5-10 evaluations
The difference: hours of compute saved, faster time-to-deploy, lower cost

Warm starting is the practical bridge between meta-learning and production AutoML. It's how systems like Auto-sklearn achieve their remarkable efficiency—by never truly starting from scratch when solving a new problem.

This page explores warm starting techniques: from simple configuration transfer to sophisticated transfer acquisition functions, from single-fidelity to multi-fidelity approaches.

What You Will Learn

By the end of this page, you will understand warm-starting strategies for Bayesian optimization, how to transfer configurations from similar tasks, multi-fidelity warm starting with Hyperband, transfer acquisition functions, and practical implementation in Auto-sklearn and other systems.

The Case for Warm Starting

Before diving into techniques, let's establish why warm starting matters and quantify its benefits.

The Cold-Start Problem in Optimization:

Standard hyperparameter optimization starts with no knowledge:

Initial phase: Random sampling to build initial surrogate model
Exploration phase: Acquisition function guides search but starts from weak model
Convergence phase: Eventually finds good configurations

The initial and exploration phases are wasteful if we have prior knowledge about what configurations work. Warm starting eliminates this waste.

Empirical Evidence:

Studies show that warm-started Bayesian optimization:

Achieves the same final performance as cold-start in 3-10x fewer evaluations
Often finds near-optimal configurations in the first 5-10 evaluations
Shows largest gains on datasets similar to those in the meta-database

When Warm Starting Helps Most:

Limited evaluation budget: When you can only try 20-50 configurations
Similar past tasks: When the meta-database contains relevant experience
Expensive evaluations: When each configuration takes minutes to hours to evaluate
Production constraints: When time-to-deploy is a primary concern

Impact of Warm Starting on Optimization Efficiency
Scenario	Random Init (evals to 90% best)	Warm Start (evals to 90% best)	Speedup
Small dataset (< 1K samples)	~30	~3	10x
Medium dataset (10K samples)	~50	~8	6x
Large dataset (100K samples)	~80	~15	5x
Novel domain	~100	~40	2.5x

Diminishing Returns

Warm starting provides the largest relative gains early in optimization. With unlimited budget, cold-start eventually catches up. The value of warm starting is in reaching good performance faster, not necessarily in reaching better performance.

Configuration Transfer from Similar Tasks

The simplest warm-starting approach: find similar tasks from the meta-database and use their best configurations as starting points.

The Algorithm:

Extract meta-features from the new dataset
Compute similarity to all datasets in the meta-database
Retrieve top-k most similar datasets
Get the best configurations from those k datasets
Use these configurations as initial evaluations for optimization

This is exactly what Auto-sklearn does with k=25 similar datasets.

Why It Works:

If similar datasets benefit from similar configurations, transferred configurations should:

Already achieve good performance (hypothesis)
Provide strong signal for surrogate model (improves subsequent search)
Cover diverse promising regions (when k is sufficiently large)

configuration_transfer.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
import numpy as np
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler
from typing import List, Dict, Tuple, Any
from dataclasses import dataclass
 
@dataclass
class Configuration:
    """A hyperparameter configuration with metadata."""
    algorithm: str
    hyperparameters: Dict[str, Any]
    performance: float
    source_dataset: str
 
 
class ConfigurationTransfer:
    """
    Transfer best configurations from similar past tasks.
    
    This is the core warm-starting mechanism in Auto-sklearn.
    """
    
    def __init__(self, k_neighbors: int = 25, metric: str = 'manhattan'):
        """
        Parameters:
            k_neighbors: Number of similar datasets to retrieve from
            metric: Distance metric for similarity
        """
        self.k = k_neighbors
        self.metric = metric
        self.scaler = StandardScaler()
        self.knn = NearestNeighbors(n_neighbors=k_neighbors, metric=metric)
        
    def fit(self, meta_features: np.ndarray,
            dataset_ids: List[str],
            configurations: Dict[str, List[Configuration]]):
        """
        Build the transfer model from historical data.
        
        Parameters:
            meta_features: (n_datasets, n_meta_features) array
            dataset_ids: List of dataset identifiers
            configurations: Dict mapping dataset_id to list of tried configurations
        """
        # Store and normalize meta-features
        self.meta_features = self.scaler.fit_transform(meta_features)
        self.dataset_ids = dataset_ids
        self.configurations = configurations
        
        # Fit k-NN model
        self.knn.fit(self.meta_features)
        
        # Precompute best configuration per dataset
        self.best_configs = {}
        for did in dataset_ids:
            if did in configurations and configurations[did]:
                best = max(configurations[did], key=lambda c: c.performance)
                self.best_configs[did] = best
    
    def get_warm_start_configs(self, new_meta_features: np.ndarray,
                                n_configs: int = 25,
                                diversity_aware: bool = True) -> List[Configuration]:
        """
        Get warm-start configurations for a new dataset.
        
        Parameters:
            new_meta_features: Meta-features of the new dataset
            n_configs: Maximum number of configurations to return
            diversity_aware: If True, ensure diverse algorithm coverage
            
        Returns:
            List of configurations to evaluate first
        """
        # Find similar datasets
        mf_scaled = self.scaler.transform(new_meta_features.reshape(1, -1))
        distances, indices = self.knn.kneighbors(mf_scaled)
        
        # Collect configurations from similar datasets
        candidates = []
        for idx, dist in zip(indices[0], distances[0]):
            dataset_id = self.dataset_ids[idx]
            
            if dataset_id in self.best_configs:
                config = self.best_configs[dataset_id]
                candidates.append({
                    'config': config,
                    'distance': dist,
                    'source': dataset_id
                })
        
        if diversity_aware:
            # Ensure we have diverse algorithms, not just 25 random forests
            selected = self._select_diverse(candidates, n_configs)
        else:
            # Simple: take top-n by distance
            candidates.sort(key=lambda x: x['distance'])
            selected = [c['config'] for c in candidates[:n_configs]]
        
        return selected
    
    def _select_diverse(self, candidates: List[Dict],
                        n: int) -> List[Configuration]:
        """
        Select diverse configurations covering different algorithms.
        
        Greedy selection: pick best for each algorithm, then fill remaining
        slots with globally best.
        """
        # Group by algorithm
        by_algorithm = {}
        for c in candidates:
            algo = c['config'].algorithm
            if algo not in by_algorithm:
                by_algorithm[algo] = []
            by_algorithm[algo].append(c)
        
        # Sort within each group
        for algo in by_algorithm:
            by_algorithm[algo].sort(key=lambda x: x['distance'])
        
        selected = []
        selected_set = set()
        
        # First pass: one per algorithm
        for algo, algo_candidates in by_algorithm.items():
            if algo_candidates and len(selected) < n:
                config = algo_candidates[0]['config']
                config_key = (config.algorithm, tuple(sorted(config.hyperparameters.items())))
                if config_key not in selected_set:
                    selected.append(config)
                    selected_set.add(config_key)
        
        # Second pass: fill remaining slots with best overall
        all_sorted = sorted(candidates, key=lambda x: x['distance'])
        for c in all_sorted:
            if len(selected) >= n:
                break
            config = c['config']
            config_key = (config.algorithm, tuple(sorted(config.hyperparameters.items())))
            if config_key not in selected_set:
                selected.append(config)
                selected_set.add(config_key)
        
        return selected
    
    def estimate_transfer_difficulty(self, new_meta_features: np.ndarray) -> float:
        """
        Estimate how difficult transfer will be (how different is this dataset?).
        
        High distance to neighbors suggests warm starting may be less effective.
        """
        mf_scaled = self.scaler.transform(new_meta_features.reshape(1, -1))
        distances, _ = self.knn.kneighbors(mf_scaled)
        
        # Use median distance to k neighbors as difficulty estimate
        return np.median(distances[0])

Practical Considerations:

1. How many configurations to transfer?

Auto-sklearn uses k=25 as a balanced choice:

Too few (k < 10): Miss relevant configurations
Too many (k > 50): Include configurations from dissimilar datasets

2. Configuration validity:

Transferred configurations must be valid in the current search space. If the new problem has different:

Algorithm options: Filter to available algorithms
Hyperparameter bounds: Clip to valid ranges
Conditional structure: Check conditionals are satisfied

3. Staleness:

Older configurations may use outdated algorithm versions. Include recency in selection criteria.

Diversity Matters

If k similar datasets all favor Random Forest, you'll get k Random Forest configurations. Ensure diversity by selecting at least one configuration per algorithm type before filling remaining slots. This prevents missing algorithms that might excel on the new dataset despite not being best on similar ones.

Warm-Starting Bayesian Optimization

Bayesian optimization builds a surrogate model of the objective function and uses an acquisition function to select promising configurations. Warm starting can accelerate both components.

Approach 1: Warm-Starting the Surrogate Model

The surrogate model (Gaussian Process or Random Forest) maps configurations to expected performance. We can initialize it with:

Prior mean function: Set the prior mean to predictions from meta-learning
Initial observations: Start with transferred configurations + their predicted (not actual) performance
Transfer kernel: Include cross-task covariance in the GP kernel

Approach 2: Warm-Starting the Search

Keep the surrogate cold-started (no prior bias) but:

Evaluate transferred configurations first: Get real observations from promising regions
Use these for initial model fitting: First k iterations use transferred configs, then acquisition takes over
This is what Auto-sklearn does: Simple, robust, effective

Why Approach 2 Often Works Better:

Meta-learning predictions may be imperfect; real observations correct mistakes
Surrogate models can overfit to biased priors
Evaluation is relatively cheap compared to search time

warm_start_bayesian_optimization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
import numpy as np
from typing import List, Dict, Callable, Tuple, Any
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import Matern, WhiteKernel
from scipy.stats import norm
 
class WarmStartBayesianOptimization:
    """
    Bayesian optimization with warm-starting from transferred configurations.
    
    Combines configuration transfer with Gaussian Process-based BO.
    """
    
    def __init__(self, config_space, 
                 warm_start_configs: List[Dict[str, Any]] = None,
                 n_random_init: int = 5):
        """
        Parameters:
            config_space: Configuration space (ConfigSpace object)
            warm_start_configs: Configurations to evaluate first
            n_random_init: Random configs to add if fewer warm-start configs
        """
        self.config_space = config_space
        self.warm_start_configs = warm_start_configs or []
        self.n_random_init = n_random_init
        
        # GP surrogate
        kernel = Matern(nu=2.5) + WhiteKernel(noise_level=0.1)
        self.gp = GaussianProcessRegressor(kernel=kernel, normalize_y=True)
        
        # Track observations
        self.X_observed = []
        self.y_observed = []
        
    def _config_to_vector(self, config: Dict) -> np.ndarray:
        """Convert configuration to numerical vector."""
        # Implementation depends on config_space
        # Simplified: assume all values are numeric
        return np.array(list(config.values()))
    
    def _expected_improvement(self, X: np.ndarray, 
                               y_best: float, 
                               xi: float = 0.01) -> np.ndarray:
        """
        Expected Improvement acquisition function.
        
        Parameters:
            X: Candidate configurations (n_candidates, n_dims)
            y_best: Best observed value so far
            xi: Exploration-exploitation trade-off
            
        Returns:
            EI value for each candidate
        """
        mu, sigma = self.gp.predict(X, return_std=True)
        
        # Handle zero variance
        with np.errstate(divide='ignore', invalid='ignore'):
            z = (mu - y_best - xi) / sigma
            ei = (mu - y_best - xi) * norm.cdf(z) + sigma * norm.pdf(z)
            ei[sigma < 1e-10] = 0.0
        
        return ei
    
    def _select_next_random(self) -> Dict:
        """Sample a random configuration."""
        return dict(self.config_space.sample_configuration())
    
    def _select_next_acquisition(self) -> Dict:
        """Select next configuration using acquisition function."""
        # Generate candidate pool
        n_candidates = 1000
        candidates = [self._config_to_vector(dict(self.config_space.sample_configuration()))
                      for _ in range(n_candidates)]
        X_candidates = np.array(candidates)
        
        # Compute acquisition values
        y_best = max(self.y_observed)  # Assuming maximization
        ei_values = self._expected_improvement(X_candidates, y_best)
        
        # Return best candidate
        best_idx = np.argmax(ei_values)
        
        # Convert back to config (simplified)
        # In practice, need reverse mapping from vector to config
        return dict(self.config_space.sample_configuration())
    
    def optimize(self, objective_func: Callable,
                 n_iterations: int = 50) -> Tuple[Dict, float]:
        """
        Run warm-started Bayesian optimization.
        
        Parameters:
            objective_func: Function mapping config to performance (maximize)
            n_iterations: Total number of configurations to evaluate
            
        Returns:
            (best_config, best_performance)
        """
        print(f"Starting warm BO with {len(self.warm_start_configs)} warm-start configs")
        
        iteration = 0
        
        # Phase 1: Evaluate warm-start configurations
        for config in self.warm_start_configs:
            if iteration >= n_iterations:
                break
            
            performance = objective_func(config)
            
            self.X_observed.append(self._config_to_vector(config))
            self.y_observed.append(performance)
            
            print(f"Warm-start {iteration + 1}: {performance:.4f}")
            iteration += 1
        
        # Phase 2: Random fill if needed (to have enough for GP)
        while len(self.X_observed) < max(self.n_random_init, 3):
            if iteration >= n_iterations:
                break
            
            config = self._select_next_random()
            performance = objective_func(config)
            
            self.X_observed.append(self._config_to_vector(config))
            self.y_observed.append(performance)
            
            print(f"Random init {iteration + 1}: {performance:.4f}")
            iteration += 1
        
        # Fit initial GP
        X = np.array(self.X_observed)
        y = np.array(self.y_observed)
        self.gp.fit(X, y)
        
        # Phase 3: Acquisition-guided search
        while iteration < n_iterations:
            config = self._select_next_acquisition()
            performance = objective_func(config)
            
            self.X_observed.append(self._config_to_vector(config))
            self.y_observed.append(performance)
            
            # Refit GP
            X = np.array(self.X_observed)
            y = np.array(self.y_observed)
            self.gp.fit(X, y)
            
            print(f"BO iteration {iteration + 1}: {performance:.4f}")
            iteration += 1
        
        # Return best
        best_idx = np.argmax(self.y_observed)
        # Would need to store configs, not just vectors
        return None, self.y_observed[best_idx]
 
 
def compare_warm_vs_cold(objective_func: Callable,
                          config_space,
                          warm_configs: List[Dict],
                          n_iterations: int = 50,
                          n_repeats: int = 5):
    """
    Compare warm-started vs cold-started BO.
    
    Shows the benefit of warm starting empirically.
    """
    warm_curves = []
    cold_curves = []
    
    for repeat in range(n_repeats):
        print(f"
=== Repeat {repeat + 1}/{n_repeats} ===")
        
        # Warm-started
        print("
Warm-started:")
        warm_bo = WarmStartBayesianOptimization(config_space, warm_configs)
        _, warm_best = warm_bo.optimize(objective_func, n_iterations)
        warm_curves.append(np.maximum.accumulate(warm_bo.y_observed))
        
        # Cold-started
        print("
Cold-started:")
        cold_bo = WarmStartBayesianOptimization(config_space, [])
        _, cold_best = cold_bo.optimize(objective_func, n_iterations)
        cold_curves.append(np.maximum.accumulate(cold_bo.y_observed))
    
    # Aggregate results
    warm_mean = np.mean(warm_curves, axis=0)
    cold_mean = np.mean(cold_curves, axis=0)
    
    print("
=== Results ===")
    print(f"Warm-start reaches 90% of final in ~{np.argmax(warm_mean > 0.9 * warm_mean[-1])} evals")
    print(f"Cold-start reaches 90% of final in ~{np.argmax(cold_mean > 0.9 * cold_mean[-1])} evals")
    
    return warm_curves, cold_curves

SMAC's Warm-Starting

SMAC (used by Auto-sklearn) implements warm-starting by treating transferred configurations as the initial design. After evaluating these, SMAC's random forest surrogate is fitted on all observations (including warm-start evaluations) and acquisition-guided search begins. The warm-start configurations influence the surrogate through their actual evaluated performance.

Transfer Acquisition Functions

A more sophisticated approach: modify the acquisition function itself to incorporate transfer learning. Rather than just warm-starting with configurations, we transfer knowledge at the model level.

Transfer Acquisition Function (TAF):

The idea is to combine:

Source model: Surrogate trained on similar past tasks
Target model: Surrogate trained on current task
Weighted combination: Balance based on how transferable the source is

Formulation:

α_transfer(x) = w · α_source(x) + (1-w) · α_target(x)

Where:

α_source: Acquisition function using source task model
α_target: Acquisition function using target task model
w: Transfer weight (decreases as we collect target data)

The RGPE Approach (Ranking-Weighted Gaussian Process Ensemble):

A principled way to combine multiple source models:

For each source task, train a GP on its data
For the target task, train a GP on available target data
Weight sources by how well they predict target observations
Use weighted ensemble for acquisition

transfer_acquisition_functions.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
import numpy as np
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import Matern
from scipy.stats import norm
from typing import List, Dict, Callable, Tuple
 
class RankingWeightedGPEnsemble:
    """
    RGPE: Ranking-Weighted Gaussian Process Ensemble.
    
    Combines surrogate models from source tasks with a target task model,
    weighting by how well source models predict target observations.
    
    Reference: Feurer et al., "Scalable Meta-Learning for Bayesian Optimization"
    """
    
    def __init__(self, source_models: List[GaussianProcessRegressor] = None):
        """
        Parameters:
            source_models: Pre-trained GP models from source tasks
        """
        self.source_models = source_models or []
        
        # Target model (trained as we collect data)
        kernel = Matern(nu=2.5)
        self.target_model = GaussianProcessRegressor(kernel=kernel, normalize_y=True)
        
        # Ensemble weights
        self.weights = np.ones(len(self.source_models) + 1)
        self.weights /= self.weights.sum()
        
        self.X_target = []
        self.y_target = []
        
    def update(self, x: np.ndarray, y: float):
        """
        Add a new observation from the target task.
        
        Parameters:
            x: Configuration (as vector)
            y: Observed performance
        """
        self.X_target.append(x)
        self.y_target.append(y)
        
        # Refit target model if enough data
        if len(self.X_target) >= 3:
            X = np.array(self.X_target)
            y_arr = np.array(self.y_target)
            self.target_model.fit(X, y_arr)
            
            # Update weights based on ranking agreement
            self._update_weights()
    
    def _update_weights(self):
        """
        Update ensemble weights based on how well each model
        predicts the ranking of target observations.
        """
        if len(self.y_target) < 3:
            return
        
        X = np.array(self.X_target)
        y_true = np.array(self.y_target)
        true_ranking = np.argsort(np.argsort(-y_true))  # Higher is better
        
        weights = []
        
        # Score each source model
        for source_model in self.source_models:
            try:
                y_pred = source_model.predict(X)
                pred_ranking = np.argsort(np.argsort(-y_pred))
                
                # Spearman-like correlation
                rank_diff = true_ranking - pred_ranking
                score = 1.0 / (1.0 + np.mean(rank_diff ** 2) + 1e-10)
                weights.append(score)
            except:
                weights.append(0.1)  # Low weight for failed predictions
        
        # Score target model
        if len(self.X_target) >= 5:
            y_pred = self.target_model.predict(X)
            pred_ranking = np.argsort(np.argsort(-y_pred))
            rank_diff = true_ranking - pred_ranking
            target_score = 1.0 / (1.0 + np.mean(rank_diff ** 2) + 1e-10)
        else:
            target_score = 0.5  # Moderate weight until we have enough data
        
        weights.append(target_score)
        
        # Normalize
        self.weights = np.array(weights)
        self.weights /= self.weights.sum()
    
    def predict(self, X: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:
        """
        Ensemble prediction for configurations.
        
        Returns:
            (mean, std) predictions
        """
        if len(X.shape) == 1:
            X = X.reshape(1, -1)
        
        predictions = []
        variances = []
        
        # Source model predictions
        for i, source_model in enumerate(self.source_models):
            try:
                mu, std = source_model.predict(X, return_std=True)
                predictions.append(mu * self.weights[i])
                variances.append((std ** 2) * (self.weights[i] ** 2))
            except:
                predictions.append(np.zeros(len(X)))
                variances.append(np.ones(len(X)) * 0.1)
        
        # Target model prediction
        if len(self.X_target) >= 3:
            mu, std = self.target_model.predict(X, return_std=True)
            predictions.append(mu * self.weights[-1])
            variances.append((std ** 2) * (self.weights[-1] ** 2))
        else:
            predictions.append(np.zeros(len(X)))
            variances.append(np.ones(len(X)) * 0.5)
        
        # Combine
        ensemble_mean = np.sum(predictions, axis=0)
        ensemble_var = np.sum(variances, axis=0)
        ensemble_std = np.sqrt(ensemble_var)
        
        return ensemble_mean, ensemble_std
    
    def acquisition(self, X: np.ndarray, xi: float = 0.01) -> np.ndarray:
        """
        Expected Improvement using ensemble predictions.
        """
        mu, sigma = self.predict(X)
        
        if len(self.y_target) > 0:
            y_best = max(self.y_target)
        else:
            y_best = 0
        
        with np.errstate(divide='ignore', invalid='ignore'):
            z = (mu - y_best - xi) / sigma
            ei = (mu - y_best - xi) * norm.cdf(z) + sigma * norm.pdf(z)
            ei[sigma < 1e-10] = 0.0
        
        return ei
 
 
def build_source_models(meta_database, config_vectorizer) -> List[GaussianProcessRegressor]:
    """
    Build source GP models from meta-database.
    
    Each source model is trained on one past task's data.
    """
    source_models = []
    
    for dataset_id, experiments in meta_database.items():
        if len(experiments) >= 10:  # Need enough data
            X = np.array([config_vectorizer(exp['config']) for exp in experiments])
            y = np.array([exp['performance'] for exp in experiments])
            
            kernel = Matern(nu=2.5)
            gp = GaussianProcessRegressor(kernel=kernel, normalize_y=True)
            
            try:
                gp.fit(X, y)
                source_models.append(gp)
            except:
                continue
    
    return source_models

Advantages of Transfer Acquisition Functions:

Continuous transfer: Knowledge transfers throughout optimization, not just at initialization
Automatic relevance weighting: Source models are weighted by relevance to target
Graceful degradation: If sources are irrelevant, weights go to zero

Challenges:

Computational cost: Maintaining multiple GPs is expensive
Source selection: Which source tasks to include?
Negative transfer: Poor source models can hurt if not properly weighted

When to Use:

When you have well-trained source models from many similar tasks
When budget allows more sophisticated transfer
When you want transfer throughout optimization, not just initialization

Multi-Fidelity Warm Starting

Multi-fidelity optimization evaluates configurations at multiple resource levels (epochs, data subset, etc.). Warm starting in this context means:

Transfer cheap evaluations: Use low-fidelity results from similar tasks
Initialize learning curves: Predict full learning curves from partial training
Prioritize promising configurations: Fast-track transferred configs to higher fidelities

Hyperband with Warm Starting:

Hyperband successively halves configurations, giving more resources to promising ones. Warm-started Hyperband:

Bracket initialization: Include transferred configs in initial brackets
Early stopping prediction: Use transfer to predict which configs to promote
Budget allocation: Give more initial budget to high-confidence transfers

BOHB (Bayesian Optimization Hyperband):

Combines Bayesian optimization with Hyperband. Warm-started BOHB:

Initialize the BO surrogate with transferred configurations
Include transfer predictions in acquisition function
Use Hyperband scheduling for efficient resource allocation

multi_fidelity_warm_start.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
import numpy as np
from typing import List, Dict, Callable, Tuple, Any
 
class WarmStartHyperband:
    """
    Hyperband with warm-starting from transferred configurations.
    
    Transferred configurations are given priority in the initial
    brackets, potentially fast-tracking them to full evaluation.
    """
    
    def __init__(self, config_space,
                 min_budget: float = 1.0,
                 max_budget: float = 81.0,
                 eta: int = 3,
                 warm_start_configs: List[Dict] = None):
        """
        Parameters:
            config_space: Configuration space
            min_budget: Minimum resource (e.g., epochs)
            max_budget: Maximum resource
            eta: Reduction factor between successive halving rounds
            warm_start_configs: Configurations to prioritize
        """
        self.config_space = config_space
        self.min_budget = min_budget
        self.max_budget = max_budget
        self.eta = eta
        self.warm_start_configs = warm_start_configs or []
        
        # Compute number of rungs
        self.s_max = int(np.log(max_budget / min_budget) / np.log(eta))
        
    def _get_hyperband_schedule(self) -> List[List[Tuple[int, float]]]:
        """
        Compute the Hyperband bracket schedule.
        
        Returns:
            List of brackets, each bracket is list of (n_configs, budget) tuples
        """
        brackets = []
        
        for s in range(self.s_max, -1, -1):
            n = int(np.ceil((self.s_max + 1) / (s + 1) * (self.eta ** s)))
            budget = self.max_budget / (self.eta ** s)
            
            bracket = []
            for i in range(s + 1):
                n_i = int(n / (self.eta ** i))
                budget_i = budget * (self.eta ** i)
                bracket.append((max(n_i, 1), min(budget_i, self.max_budget)))
            
            brackets.append(bracket)
        
        return brackets
    
    def _sample_configs(self, n: int, include_warm_start: bool = True) -> List[Dict]:
        """
        Sample n configurations, prioritizing warm-start configs.
        """
        configs = []
        
        # Include warm-start configs first
        if include_warm_start:
            for config in self.warm_start_configs[:n]:
                configs.append(config)
        
        # Fill remaining with random samples
        while len(configs) < n:
            configs.append(dict(self.config_space.sample_configuration()))
        
        return configs
    
    def run(self, evaluate_fn: Callable[[Dict, float], float]) -> Tuple[Dict, float]:
        """
        Run warm-started Hyperband.
        
        Parameters:
            evaluate_fn: Function that takes (config, budget) -> performance
            
        Returns:
            (best_config, best_performance)
        """
        brackets = self._get_hyperband_schedule()
        
        best_config = None
        best_perf = float('-inf')
        
        # Track which warm-start configs we've used
        warm_start_used = 0
        
        for bracket_idx, bracket in enumerate(brackets):
            print(f"
=== Bracket {bracket_idx + 1}/{len(brackets)} ===")
            
            # Initial configs for this bracket
            n_configs_initial = bracket[0][0]
            
            # For first bracket, heavily favor warm-start configs
            if bracket_idx == 0:
                configs = self._sample_configs(
                    n_configs_initial, 
                    include_warm_start=True
                )
                warm_start_used = min(len(self.warm_start_configs), n_configs_initial)
                print(f"Including {warm_start_used} warm-start configs")
            else:
                configs = self._sample_configs(
                    n_configs_initial,
                    include_warm_start=False
                )
            
            # Run successive halving rounds
            for rung_idx, (n_configs, budget) in enumerate(bracket):
                print(f"  Rung {rung_idx + 1}: {len(configs)} configs at budget {budget:.1f}")
                
                # Evaluate all configs at this budget
                results = []
                for config in configs:
                    perf = evaluate_fn(config, budget)
                    results.append((config, perf))
                    
                    if perf > best_perf:
                        best_perf = perf
                        best_config = config
                
                # Select top 1/eta for next rung
                if rung_idx < len(bracket) - 1:
                    results.sort(key=lambda x: x[1], reverse=True)
                    n_keep = max(1, len(results) // self.eta)
                    configs = [r[0] for r in results[:n_keep]]
        
        return best_config, best_perf
 
 
class WarmStartBOHB:
    """
    BOHB (Bayesian Optimization Hyperband) with warm starting.
    
    Combines transfer learning with multi-fidelity evaluation
    for efficient hyperparameter optimization.
    """
    
    def __init__(self, config_space,
                 min_budget: float = 1.0,
                 max_budget: float = 81.0,
                 warm_start_configs: List[Dict] = None,
                 warm_start_performances: List[float] = None):
        """
        Parameters:
            config_space: Configuration space
            min_budget: Minimum resource
            max_budget: Maximum resource
            warm_start_configs: Transferred configurations
            warm_start_performances: Expected performances (from meta-learning)
        """
        self.config_space = config_space
        self.min_budget = min_budget
        self.max_budget = max_budget
        self.warm_start_configs = warm_start_configs or []
        self.warm_start_performances = warm_start_performances or []
        
        # Build a simple model from warm-start data for acquisition
        self._initialize_model()
        
    def _initialize_model(self):
        """
        Initialize the surrogate model with warm-start data.
        """
        from sklearn.ensemble import RandomForestRegressor
        
        self.model = RandomForestRegressor(n_estimators=10, random_state=42)
        
        # If we have warm-start data, fit initial model
        if len(self.warm_start_configs) > 0 and len(self.warm_start_performances) > 0:
            X = np.array([self._config_to_vector(c) for c in self.warm_start_configs])
            y = np.array(self.warm_start_performances)
            self.model.fit(X, y)
            self.model_fitted = True
        else:
            self.model_fitted = False
    
    def _config_to_vector(self, config: Dict) -> np.ndarray:
        """Convert config dict to numerical vector."""
        return np.array(list(config.values()), dtype=float)
    
    def suggest_configuration(self) -> Dict:
        """
        Suggest next configuration to evaluate.
        
        Uses transfer-informed acquisition if model is fitted,
        otherwise samples randomly (with warm-start priority).
        """
        # First, exhaust warm-start configs
        if hasattr(self, '_warm_start_idx'):
            self._warm_start_idx += 1
        else:
            self._warm_start_idx = 0
        
        if self._warm_start_idx < len(self.warm_start_configs):
            return self.warm_start_configs[self._warm_start_idx]
        
        # Then use model-based suggestion if available
        if self.model_fitted:
            # Sample candidates and pick best by model
            candidates = [dict(self.config_space.sample_configuration()) 
                         for _ in range(100)]
            X_cand = np.array([self._config_to_vector(c) for c in candidates])
            predictions = self.model.predict(X_cand)
            best_idx = np.argmax(predictions)
            return candidates[best_idx]
        
        # Fallback to random
        return dict(self.config_space.sample_configuration())

Learning Curve Transfer

Advanced multi-fidelity transfer: predict entire learning curves from partial training. If similar tasks show that LightGBM converges in 50 iterations but neural networks need 500, we can allocate resources accordingly. Learning curve databases enable this.

Summary: Warm Starting

We've explored how warm starting transforms hyperparameter optimization from cold search to informed optimization. Let's consolidate the key takeaways:

Key Takeaways

•Configuration transfer from similar tasks provides 3-10x speedup in finding good configurations.
•Diversity matters: Transfer configurations from multiple algorithms, not just the most similar.
•Warm-started BO evaluates transferred configs first, then uses them to build an informed surrogate.
•Transfer acquisition functions enable continuous knowledge transfer throughout optimization.
•Multi-fidelity warm starting combines transfer with efficient resource allocation (Hyperband, BOHB).

What's Next:

The final page of this module explores Portfolio Methods—an alternative to selection where we run a diverse set of algorithms and combine their predictions. Portfolio methods complement selection by providing robustness when selection is uncertain.

Page Complete

You now understand how warm starting accelerates hyperparameter optimization by leveraging meta-learning. This is the practical link between accumulated knowledge and efficient AutoML—starting from informed positions rather than random initialization.

4 / 5

Loading learning content...

Machine LearningAutoML & Neural Architecture Search

Automated Model Selection

LevelAdvanced

Duration90 mins

TopicAutoML & Neural Architecture Search

4 / 5

Warm Starting

Starting Ahead: The Power of Informed Initialization

Imagine two runners in a race. One starts at the starting line. The other starts 80% of the way to the finish. Who wins?

The impact is dramatic:

Random initialization might require 100 evaluations to find a good configuration
Warm-started optimization often finds equally good configurations in 5-10 evaluations
The difference: hours of compute saved, faster time-to-deploy, lower cost

This page explores warm starting techniques: from simple configuration transfer to sophisticated transfer acquisition functions, from single-fidelity to multi-fidelity approaches.

What You Will Learn

The Case for Warm Starting

Before diving into techniques, let's establish why warm starting matters and quantify its benefits.

The Cold-Start Problem in Optimization:

Standard hyperparameter optimization starts with no knowledge:

Initial phase: Random sampling to build initial surrogate model
Exploration phase: Acquisition function guides search but starts from weak model
Convergence phase: Eventually finds good configurations

The initial and exploration phases are wasteful if we have prior knowledge about what configurations work. Warm starting eliminates this waste.

Empirical Evidence:

Studies show that warm-started Bayesian optimization:

Achieves the same final performance as cold-start in 3-10x fewer evaluations
Often finds near-optimal configurations in the first 5-10 evaluations
Shows largest gains on datasets similar to those in the meta-database

When Warm Starting Helps Most:

Limited evaluation budget: When you can only try 20-50 configurations
Similar past tasks: When the meta-database contains relevant experience
Expensive evaluations: When each configuration takes minutes to hours to evaluate
Production constraints: When time-to-deploy is a primary concern

Impact of Warm Starting on Optimization Efficiency
Scenario	Random Init (evals to 90% best)	Warm Start (evals to 90% best)	Speedup
Small dataset (< 1K samples)	~30	~3	10x
Medium dataset (10K samples)	~50	~8	6x
Large dataset (100K samples)	~80	~15	5x
Novel domain	~100	~40	2.5x

Diminishing Returns

Configuration Transfer from Similar Tasks

The simplest warm-starting approach: find similar tasks from the meta-database and use their best configurations as starting points.

The Algorithm:

Extract meta-features from the new dataset
Compute similarity to all datasets in the meta-database
Retrieve top-k most similar datasets
Get the best configurations from those k datasets
Use these configurations as initial evaluations for optimization

This is exactly what Auto-sklearn does with k=25 similar datasets.

Why It Works:

If similar datasets benefit from similar configurations, transferred configurations should:

Already achieve good performance (hypothesis)
Provide strong signal for surrogate model (improves subsequent search)
Cover diverse promising regions (when k is sufficiently large)

configuration_transfer.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
import numpy as np
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler
from typing import List, Dict, Tuple, Any
from dataclasses import dataclass
 
@dataclass
class Configuration:
    """A hyperparameter configuration with metadata."""
    algorithm: str
    hyperparameters: Dict[str, Any]
    performance: float
    source_dataset: str
 
 
class ConfigurationTransfer:
    """
    Transfer best configurations from similar past tasks.
    
    This is the core warm-starting mechanism in Auto-sklearn.
    """
    
    def __init__(self, k_neighbors: int = 25, metric: str = 'manhattan'):
        """
        Parameters:
            k_neighbors: Number of similar datasets to retrieve from
            metric: Distance metric for similarity
        """
        self.k = k_neighbors
        self.metric = metric
        self.scaler = StandardScaler()
        self.knn = NearestNeighbors(n_neighbors=k_neighbors, metric=metric)
        
    def fit(self, meta_features: np.ndarray,
            dataset_ids: List[str],
            configurations: Dict[str, List[Configuration]]):
        """
        Build the transfer model from historical data.
        
        Parameters:
            meta_features: (n_datasets, n_meta_features) array
            dataset_ids: List of dataset identifiers
            configurations: Dict mapping dataset_id to list of tried configurations
        """
        # Store and normalize meta-features
        self.meta_features = self.scaler.fit_transform(meta_features)
        self.dataset_ids = dataset_ids
        self.configurations = configurations
        
        # Fit k-NN model
        self.knn.fit(self.meta_features)
        
        # Precompute best configuration per dataset
        self.best_configs = {}
        for did in dataset_ids:
            if did in configurations and configurations[did]:
                best = max(configurations[did], key=lambda c: c.performance)
                self.best_configs[did] = best
    
    def get_warm_start_configs(self, new_meta_features: np.ndarray,
                                n_configs: int = 25,
                                diversity_aware: bool = True) -> List[Configuration]:
        """
        Get warm-start configurations for a new dataset.
        
        Parameters:
            new_meta_features: Meta-features of the new dataset
            n_configs: Maximum number of configurations to return
            diversity_aware: If True, ensure diverse algorithm coverage
            
        Returns:
            List of configurations to evaluate first
        """
        # Find similar datasets
        mf_scaled = self.scaler.transform(new_meta_features.reshape(1, -1))
        distances, indices = self.knn.kneighbors(mf_scaled)
        
        # Collect configurations from similar datasets
        candidates = []
        for idx, dist in zip(indices[0], distances[0]):
            dataset_id = self.dataset_ids[idx]
            
            if dataset_id in self.best_configs:
                config = self.best_configs[dataset_id]
                candidates.append({
                    'config': config,
                    'distance': dist,
                    'source': dataset_id
                })
        
        if diversity_aware:
            # Ensure we have diverse algorithms, not just 25 random forests
            selected = self._select_diverse(candidates, n_configs)
        else:
            # Simple: take top-n by distance
            candidates.sort(key=lambda x: x['distance'])
            selected = [c['config'] for c in candidates[:n_configs]]
        
        return selected
    
    def _select_diverse(self, candidates: List[Dict],
                        n: int) -> List[Configuration]:
        """
        Select diverse configurations covering different algorithms.
        
        Greedy selection: pick best for each algorithm, then fill remaining
        slots with globally best.
        """
        # Group by algorithm
        by_algorithm = {}
        for c in candidates:
            algo = c['config'].algorithm
            if algo not in by_algorithm:
                by_algorithm[algo] = []
            by_algorithm[algo].append(c)
        
        # Sort within each group
        for algo in by_algorithm:
            by_algorithm[algo].sort(key=lambda x: x['distance'])
        
        selected = []
        selected_set = set()
        
        # First pass: one per algorithm
        for algo, algo_candidates in by_algorithm.items():
            if algo_candidates and len(selected) < n:
                config = algo_candidates[0]['config']
                config_key = (config.algorithm, tuple(sorted(config.hyperparameters.items())))
                if config_key not in selected_set:
                    selected.append(config)
                    selected_set.add(config_key)
        
        # Second pass: fill remaining slots with best overall
        all_sorted = sorted(candidates, key=lambda x: x['distance'])
        for c in all_sorted:
            if len(selected) >= n:
                break
            config = c['config']
            config_key = (config.algorithm, tuple(sorted(config.hyperparameters.items())))
            if config_key not in selected_set:
                selected.append(config)
                selected_set.add(config_key)
        
        return selected
    
    def estimate_transfer_difficulty(self, new_meta_features: np.ndarray) -> float:
        """
        Estimate how difficult transfer will be (how different is this dataset?).
        
        High distance to neighbors suggests warm starting may be less effective.
        """
        mf_scaled = self.scaler.transform(new_meta_features.reshape(1, -1))
        distances, _ = self.knn.kneighbors(mf_scaled)
        
        # Use median distance to k neighbors as difficulty estimate
        return np.median(distances[0])

Practical Considerations:

1. How many configurations to transfer?

Auto-sklearn uses k=25 as a balanced choice:

Too few (k < 10): Miss relevant configurations
Too many (k > 50): Include configurations from dissimilar datasets

2. Configuration validity:

Transferred configurations must be valid in the current search space. If the new problem has different:

Algorithm options: Filter to available algorithms
Hyperparameter bounds: Clip to valid ranges
Conditional structure: Check conditionals are satisfied

3. Staleness:

Older configurations may use outdated algorithm versions. Include recency in selection criteria.

Diversity Matters

Warm-Starting Bayesian Optimization

Bayesian optimization builds a surrogate model of the objective function and uses an acquisition function to select promising configurations. Warm starting can accelerate both components.

Approach 1: Warm-Starting the Surrogate Model

The surrogate model (Gaussian Process or Random Forest) maps configurations to expected performance. We can initialize it with:

Prior mean function: Set the prior mean to predictions from meta-learning
Initial observations: Start with transferred configurations + their predicted (not actual) performance
Transfer kernel: Include cross-task covariance in the GP kernel

Approach 2: Warm-Starting the Search

Keep the surrogate cold-started (no prior bias) but:

Evaluate transferred configurations first: Get real observations from promising regions
Use these for initial model fitting: First k iterations use transferred configs, then acquisition takes over
This is what Auto-sklearn does: Simple, robust, effective

Why Approach 2 Often Works Better:

Meta-learning predictions may be imperfect; real observations correct mistakes
Surrogate models can overfit to biased priors
Evaluation is relatively cheap compared to search time

warm_start_bayesian_optimization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
import numpy as np
from typing import List, Dict, Callable, Tuple, Any
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import Matern, WhiteKernel
from scipy.stats import norm
 
class WarmStartBayesianOptimization:
    """
    Bayesian optimization with warm-starting from transferred configurations.
    
    Combines configuration transfer with Gaussian Process-based BO.
    """
    
    def __init__(self, config_space, 
                 warm_start_configs: List[Dict[str, Any]] = None,
                 n_random_init: int = 5):
        """
        Parameters:
            config_space: Configuration space (ConfigSpace object)
            warm_start_configs: Configurations to evaluate first
            n_random_init: Random configs to add if fewer warm-start configs
        """
        self.config_space = config_space
        self.warm_start_configs = warm_start_configs or []
        self.n_random_init = n_random_init
        
        # GP surrogate
        kernel = Matern(nu=2.5) + WhiteKernel(noise_level=0.1)
        self.gp = GaussianProcessRegressor(kernel=kernel, normalize_y=True)
        
        # Track observations
        self.X_observed = []
        self.y_observed = []
        
    def _config_to_vector(self, config: Dict) -> np.ndarray:
        """Convert configuration to numerical vector."""
        # Implementation depends on config_space
        # Simplified: assume all values are numeric
        return np.array(list(config.values()))
    
    def _expected_improvement(self, X: np.ndarray, 
                               y_best: float, 
                               xi: float = 0.01) -> np.ndarray:
        """
        Expected Improvement acquisition function.
        
        Parameters:
            X: Candidate configurations (n_candidates, n_dims)
            y_best: Best observed value so far
            xi: Exploration-exploitation trade-off
            
        Returns:
            EI value for each candidate
        """
        mu, sigma = self.gp.predict(X, return_std=True)
        
        # Handle zero variance
        with np.errstate(divide='ignore', invalid='ignore'):
            z = (mu - y_best - xi) / sigma
            ei = (mu - y_best - xi) * norm.cdf(z) + sigma * norm.pdf(z)
            ei[sigma < 1e-10] = 0.0
        
        return ei
    
    def _select_next_random(self) -> Dict:
        """Sample a random configuration."""
        return dict(self.config_space.sample_configuration())
    
    def _select_next_acquisition(self) -> Dict:
        """Select next configuration using acquisition function."""
        # Generate candidate pool
        n_candidates = 1000
        candidates = [self._config_to_vector(dict(self.config_space.sample_configuration()))
                      for _ in range(n_candidates)]
        X_candidates = np.array(candidates)
        
        # Compute acquisition values
        y_best = max(self.y_observed)  # Assuming maximization
        ei_values = self._expected_improvement(X_candidates, y_best)
        
        # Return best candidate
        best_idx = np.argmax(ei_values)
        
        # Convert back to config (simplified)
        # In practice, need reverse mapping from vector to config
        return dict(self.config_space.sample_configuration())
    
    def optimize(self, objective_func: Callable,
                 n_iterations: int = 50) -> Tuple[Dict, float]:
        """
        Run warm-started Bayesian optimization.
        
        Parameters:
            objective_func: Function mapping config to performance (maximize)
            n_iterations: Total number of configurations to evaluate
            
        Returns:
            (best_config, best_performance)
        """
        print(f"Starting warm BO with {len(self.warm_start_configs)} warm-start configs")
        
        iteration = 0
        
        # Phase 1: Evaluate warm-start configurations
        for config in self.warm_start_configs:
            if iteration >= n_iterations:
                break
            
            performance = objective_func(config)
            
            self.X_observed.append(self._config_to_vector(config))
            self.y_observed.append(performance)
            
            print(f"Warm-start {iteration + 1}: {performance:.4f}")
            iteration += 1
        
        # Phase 2: Random fill if needed (to have enough for GP)
        while len(self.X_observed) < max(self.n_random_init, 3):
            if iteration >= n_iterations:
                break
            
            config = self._select_next_random()
            performance = objective_func(config)
            
            self.X_observed.append(self._config_to_vector(config))
            self.y_observed.append(performance)
            
            print(f"Random init {iteration + 1}: {performance:.4f}")
            iteration += 1
        
        # Fit initial GP
        X = np.array(self.X_observed)
        y = np.array(self.y_observed)
        self.gp.fit(X, y)
        
        # Phase 3: Acquisition-guided search
        while iteration < n_iterations:
            config = self._select_next_acquisition()
            performance = objective_func(config)
            
            self.X_observed.append(self._config_to_vector(config))
            self.y_observed.append(performance)
            
            # Refit GP
            X = np.array(self.X_observed)
            y = np.array(self.y_observed)
            self.gp.fit(X, y)
            
            print(f"BO iteration {iteration + 1}: {performance:.4f}")
            iteration += 1
        
        # Return best
        best_idx = np.argmax(self.y_observed)
        # Would need to store configs, not just vectors
        return None, self.y_observed[best_idx]
 
 
def compare_warm_vs_cold(objective_func: Callable,
                          config_space,
                          warm_configs: List[Dict],
                          n_iterations: int = 50,
                          n_repeats: int = 5):
    """
    Compare warm-started vs cold-started BO.
    
    Shows the benefit of warm starting empirically.
    """
    warm_curves = []
    cold_curves = []
    
    for repeat in range(n_repeats):
        print(f"
=== Repeat {repeat + 1}/{n_repeats} ===")
        
        # Warm-started
        print("
Warm-started:")
        warm_bo = WarmStartBayesianOptimization(config_space, warm_configs)
        _, warm_best = warm_bo.optimize(objective_func, n_iterations)
        warm_curves.append(np.maximum.accumulate(warm_bo.y_observed))
        
        # Cold-started
        print("
Cold-started:")
        cold_bo = WarmStartBayesianOptimization(config_space, [])
        _, cold_best = cold_bo.optimize(objective_func, n_iterations)
        cold_curves.append(np.maximum.accumulate(cold_bo.y_observed))
    
    # Aggregate results
    warm_mean = np.mean(warm_curves, axis=0)
    cold_mean = np.mean(cold_curves, axis=0)
    
    print("
=== Results ===")
    print(f"Warm-start reaches 90% of final in ~{np.argmax(warm_mean > 0.9 * warm_mean[-1])} evals")
    print(f"Cold-start reaches 90% of final in ~{np.argmax(cold_mean > 0.9 * cold_mean[-1])} evals")
    
    return warm_curves, cold_curves

SMAC's Warm-Starting

Transfer Acquisition Functions

A more sophisticated approach: modify the acquisition function itself to incorporate transfer learning. Rather than just warm-starting with configurations, we transfer knowledge at the model level.

Transfer Acquisition Function (TAF):

The idea is to combine:

Source model: Surrogate trained on similar past tasks
Target model: Surrogate trained on current task
Weighted combination: Balance based on how transferable the source is

Formulation:

α_transfer(x) = w · α_source(x) + (1-w) · α_target(x)

Where:

α_source: Acquisition function using source task model
α_target: Acquisition function using target task model
w: Transfer weight (decreases as we collect target data)

The RGPE Approach (Ranking-Weighted Gaussian Process Ensemble):

A principled way to combine multiple source models:

For each source task, train a GP on its data
For the target task, train a GP on available target data
Weight sources by how well they predict target observations
Use weighted ensemble for acquisition

transfer_acquisition_functions.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
import numpy as np
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import Matern
from scipy.stats import norm
from typing import List, Dict, Callable, Tuple
 
class RankingWeightedGPEnsemble:
    """
    RGPE: Ranking-Weighted Gaussian Process Ensemble.
    
    Combines surrogate models from source tasks with a target task model,
    weighting by how well source models predict target observations.
    
    Reference: Feurer et al., "Scalable Meta-Learning for Bayesian Optimization"
    """
    
    def __init__(self, source_models: List[GaussianProcessRegressor] = None):
        """
        Parameters:
            source_models: Pre-trained GP models from source tasks
        """
        self.source_models = source_models or []
        
        # Target model (trained as we collect data)
        kernel = Matern(nu=2.5)
        self.target_model = GaussianProcessRegressor(kernel=kernel, normalize_y=True)
        
        # Ensemble weights
        self.weights = np.ones(len(self.source_models) + 1)
        self.weights /= self.weights.sum()
        
        self.X_target = []
        self.y_target = []
        
    def update(self, x: np.ndarray, y: float):
        """
        Add a new observation from the target task.
        
        Parameters:
            x: Configuration (as vector)
            y: Observed performance
        """
        self.X_target.append(x)
        self.y_target.append(y)
        
        # Refit target model if enough data
        if len(self.X_target) >= 3:
            X = np.array(self.X_target)
            y_arr = np.array(self.y_target)
            self.target_model.fit(X, y_arr)
            
            # Update weights based on ranking agreement
            self._update_weights()
    
    def _update_weights(self):
        """
        Update ensemble weights based on how well each model
        predicts the ranking of target observations.
        """
        if len(self.y_target) < 3:
            return
        
        X = np.array(self.X_target)
        y_true = np.array(self.y_target)
        true_ranking = np.argsort(np.argsort(-y_true))  # Higher is better
        
        weights = []
        
        # Score each source model
        for source_model in self.source_models:
            try:
                y_pred = source_model.predict(X)
                pred_ranking = np.argsort(np.argsort(-y_pred))
                
                # Spearman-like correlation
                rank_diff = true_ranking - pred_ranking
                score = 1.0 / (1.0 + np.mean(rank_diff ** 2) + 1e-10)
                weights.append(score)
            except:
                weights.append(0.1)  # Low weight for failed predictions
        
        # Score target model
        if len(self.X_target) >= 5:
            y_pred = self.target_model.predict(X)
            pred_ranking = np.argsort(np.argsort(-y_pred))
            rank_diff = true_ranking - pred_ranking
            target_score = 1.0 / (1.0 + np.mean(rank_diff ** 2) + 1e-10)
        else:
            target_score = 0.5  # Moderate weight until we have enough data
        
        weights.append(target_score)
        
        # Normalize
        self.weights = np.array(weights)
        self.weights /= self.weights.sum()
    
    def predict(self, X: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:
        """
        Ensemble prediction for configurations.
        
        Returns:
            (mean, std) predictions
        """
        if len(X.shape) == 1:
            X = X.reshape(1, -1)
        
        predictions = []
        variances = []
        
        # Source model predictions
        for i, source_model in enumerate(self.source_models):
            try:
                mu, std = source_model.predict(X, return_std=True)
                predictions.append(mu * self.weights[i])
                variances.append((std ** 2) * (self.weights[i] ** 2))
            except:
                predictions.append(np.zeros(len(X)))
                variances.append(np.ones(len(X)) * 0.1)
        
        # Target model prediction
        if len(self.X_target) >= 3:
            mu, std = self.target_model.predict(X, return_std=True)
            predictions.append(mu * self.weights[-1])
            variances.append((std ** 2) * (self.weights[-1] ** 2))
        else:
            predictions.append(np.zeros(len(X)))
            variances.append(np.ones(len(X)) * 0.5)
        
        # Combine
        ensemble_mean = np.sum(predictions, axis=0)
        ensemble_var = np.sum(variances, axis=0)
        ensemble_std = np.sqrt(ensemble_var)
        
        return ensemble_mean, ensemble_std
    
    def acquisition(self, X: np.ndarray, xi: float = 0.01) -> np.ndarray:
        """
        Expected Improvement using ensemble predictions.
        """
        mu, sigma = self.predict(X)
        
        if len(self.y_target) > 0:
            y_best = max(self.y_target)
        else:
            y_best = 0
        
        with np.errstate(divide='ignore', invalid='ignore'):
            z = (mu - y_best - xi) / sigma
            ei = (mu - y_best - xi) * norm.cdf(z) + sigma * norm.pdf(z)
            ei[sigma < 1e-10] = 0.0
        
        return ei
 
 
def build_source_models(meta_database, config_vectorizer) -> List[GaussianProcessRegressor]:
    """
    Build source GP models from meta-database.
    
    Each source model is trained on one past task's data.
    """
    source_models = []
    
    for dataset_id, experiments in meta_database.items():
        if len(experiments) >= 10:  # Need enough data
            X = np.array([config_vectorizer(exp['config']) for exp in experiments])
            y = np.array([exp['performance'] for exp in experiments])
            
            kernel = Matern(nu=2.5)
            gp = GaussianProcessRegressor(kernel=kernel, normalize_y=True)
            
            try:
                gp.fit(X, y)
                source_models.append(gp)
            except:
                continue
    
    return source_models

Advantages of Transfer Acquisition Functions:

Continuous transfer: Knowledge transfers throughout optimization, not just at initialization
Automatic relevance weighting: Source models are weighted by relevance to target
Graceful degradation: If sources are irrelevant, weights go to zero

Challenges:

Computational cost: Maintaining multiple GPs is expensive
Source selection: Which source tasks to include?
Negative transfer: Poor source models can hurt if not properly weighted

When to Use:

When you have well-trained source models from many similar tasks
When budget allows more sophisticated transfer
When you want transfer throughout optimization, not just initialization

Multi-Fidelity Warm Starting

Multi-fidelity optimization evaluates configurations at multiple resource levels (epochs, data subset, etc.). Warm starting in this context means:

Transfer cheap evaluations: Use low-fidelity results from similar tasks
Initialize learning curves: Predict full learning curves from partial training
Prioritize promising configurations: Fast-track transferred configs to higher fidelities

Hyperband with Warm Starting:

Hyperband successively halves configurations, giving more resources to promising ones. Warm-started Hyperband:

Bracket initialization: Include transferred configs in initial brackets
Early stopping prediction: Use transfer to predict which configs to promote
Budget allocation: Give more initial budget to high-confidence transfers

BOHB (Bayesian Optimization Hyperband):

Combines Bayesian optimization with Hyperband. Warm-started BOHB:

Initialize the BO surrogate with transferred configurations
Include transfer predictions in acquisition function
Use Hyperband scheduling for efficient resource allocation

multi_fidelity_warm_start.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
import numpy as np
from typing import List, Dict, Callable, Tuple, Any
 
class WarmStartHyperband:
    """
    Hyperband with warm-starting from transferred configurations.
    
    Transferred configurations are given priority in the initial
    brackets, potentially fast-tracking them to full evaluation.
    """
    
    def __init__(self, config_space,
                 min_budget: float = 1.0,
                 max_budget: float = 81.0,
                 eta: int = 3,
                 warm_start_configs: List[Dict] = None):
        """
        Parameters:
            config_space: Configuration space
            min_budget: Minimum resource (e.g., epochs)
            max_budget: Maximum resource
            eta: Reduction factor between successive halving rounds
            warm_start_configs: Configurations to prioritize
        """
        self.config_space = config_space
        self.min_budget = min_budget
        self.max_budget = max_budget
        self.eta = eta
        self.warm_start_configs = warm_start_configs or []
        
        # Compute number of rungs
        self.s_max = int(np.log(max_budget / min_budget) / np.log(eta))
        
    def _get_hyperband_schedule(self) -> List[List[Tuple[int, float]]]:
        """
        Compute the Hyperband bracket schedule.
        
        Returns:
            List of brackets, each bracket is list of (n_configs, budget) tuples
        """
        brackets = []
        
        for s in range(self.s_max, -1, -1):
            n = int(np.ceil((self.s_max + 1) / (s + 1) * (self.eta ** s)))
            budget = self.max_budget / (self.eta ** s)
            
            bracket = []
            for i in range(s + 1):
                n_i = int(n / (self.eta ** i))
                budget_i = budget * (self.eta ** i)
                bracket.append((max(n_i, 1), min(budget_i, self.max_budget)))
            
            brackets.append(bracket)
        
        return brackets
    
    def _sample_configs(self, n: int, include_warm_start: bool = True) -> List[Dict]:
        """
        Sample n configurations, prioritizing warm-start configs.
        """
        configs = []
        
        # Include warm-start configs first
        if include_warm_start:
            for config in self.warm_start_configs[:n]:
                configs.append(config)
        
        # Fill remaining with random samples
        while len(configs) < n:
            configs.append(dict(self.config_space.sample_configuration()))
        
        return configs
    
    def run(self, evaluate_fn: Callable[[Dict, float], float]) -> Tuple[Dict, float]:
        """
        Run warm-started Hyperband.
        
        Parameters:
            evaluate_fn: Function that takes (config, budget) -> performance
            
        Returns:
            (best_config, best_performance)
        """
        brackets = self._get_hyperband_schedule()
        
        best_config = None
        best_perf = float('-inf')
        
        # Track which warm-start configs we've used
        warm_start_used = 0
        
        for bracket_idx, bracket in enumerate(brackets):
            print(f"
=== Bracket {bracket_idx + 1}/{len(brackets)} ===")
            
            # Initial configs for this bracket
            n_configs_initial = bracket[0][0]
            
            # For first bracket, heavily favor warm-start configs
            if bracket_idx == 0:
                configs = self._sample_configs(
                    n_configs_initial, 
                    include_warm_start=True
                )
                warm_start_used = min(len(self.warm_start_configs), n_configs_initial)
                print(f"Including {warm_start_used} warm-start configs")
            else:
                configs = self._sample_configs(
                    n_configs_initial,
                    include_warm_start=False
                )
            
            # Run successive halving rounds
            for rung_idx, (n_configs, budget) in enumerate(bracket):
                print(f"  Rung {rung_idx + 1}: {len(configs)} configs at budget {budget:.1f}")
                
                # Evaluate all configs at this budget
                results = []
                for config in configs:
                    perf = evaluate_fn(config, budget)
                    results.append((config, perf))
                    
                    if perf > best_perf:
                        best_perf = perf
                        best_config = config
                
                # Select top 1/eta for next rung
                if rung_idx < len(bracket) - 1:
                    results.sort(key=lambda x: x[1], reverse=True)
                    n_keep = max(1, len(results) // self.eta)
                    configs = [r[0] for r in results[:n_keep]]
        
        return best_config, best_perf
 
 
class WarmStartBOHB:
    """
    BOHB (Bayesian Optimization Hyperband) with warm starting.
    
    Combines transfer learning with multi-fidelity evaluation
    for efficient hyperparameter optimization.
    """
    
    def __init__(self, config_space,
                 min_budget: float = 1.0,
                 max_budget: float = 81.0,
                 warm_start_configs: List[Dict] = None,
                 warm_start_performances: List[float] = None):
        """
        Parameters:
            config_space: Configuration space
            min_budget: Minimum resource
            max_budget: Maximum resource
            warm_start_configs: Transferred configurations
            warm_start_performances: Expected performances (from meta-learning)
        """
        self.config_space = config_space
        self.min_budget = min_budget
        self.max_budget = max_budget
        self.warm_start_configs = warm_start_configs or []
        self.warm_start_performances = warm_start_performances or []
        
        # Build a simple model from warm-start data for acquisition
        self._initialize_model()
        
    def _initialize_model(self):
        """
        Initialize the surrogate model with warm-start data.
        """
        from sklearn.ensemble import RandomForestRegressor
        
        self.model = RandomForestRegressor(n_estimators=10, random_state=42)
        
        # If we have warm-start data, fit initial model
        if len(self.warm_start_configs) > 0 and len(self.warm_start_performances) > 0:
            X = np.array([self._config_to_vector(c) for c in self.warm_start_configs])
            y = np.array(self.warm_start_performances)
            self.model.fit(X, y)
            self.model_fitted = True
        else:
            self.model_fitted = False
    
    def _config_to_vector(self, config: Dict) -> np.ndarray:
        """Convert config dict to numerical vector."""
        return np.array(list(config.values()), dtype=float)
    
    def suggest_configuration(self) -> Dict:
        """
        Suggest next configuration to evaluate.
        
        Uses transfer-informed acquisition if model is fitted,
        otherwise samples randomly (with warm-start priority).
        """
        # First, exhaust warm-start configs
        if hasattr(self, '_warm_start_idx'):
            self._warm_start_idx += 1
        else:
            self._warm_start_idx = 0
        
        if self._warm_start_idx < len(self.warm_start_configs):
            return self.warm_start_configs[self._warm_start_idx]
        
        # Then use model-based suggestion if available
        if self.model_fitted:
            # Sample candidates and pick best by model
            candidates = [dict(self.config_space.sample_configuration()) 
                         for _ in range(100)]
            X_cand = np.array([self._config_to_vector(c) for c in candidates])
            predictions = self.model.predict(X_cand)
            best_idx = np.argmax(predictions)
            return candidates[best_idx]
        
        # Fallback to random
        return dict(self.config_space.sample_configuration())

Learning Curve Transfer

Summary: Warm Starting

We've explored how warm starting transforms hyperparameter optimization from cold search to informed optimization. Let's consolidate the key takeaways:

Key Takeaways

•Configuration transfer from similar tasks provides 3-10x speedup in finding good configurations.
•Diversity matters: Transfer configurations from multiple algorithms, not just the most similar.
•Warm-started BO evaluates transferred configs first, then uses them to build an informed surrogate.
•Transfer acquisition functions enable continuous knowledge transfer throughout optimization.
•Multi-fidelity warm starting combines transfer with efficient resource allocation (Hyperband, BOHB).

What's Next:

Page Complete

4 / 5