Bayesian Optimization - Learning Module

Loading content...

0/245

Sequential Model-Based Optimization

The Fundamental Limitation of Blind Search

Consider the problem of tuning hyperparameters for a deep neural network. Each evaluation requires training the model, which might take hours or even days. Grid search would require evaluating every point in a predefined grid—potentially thousands of configurations. Random search improves upon this but still operates without any memory: each evaluation is independent, learning nothing from previous attempts.

What if we could use information from past evaluations to guide future ones? What if, after observing that learning rate 0.01 yielded a validation error of 0.15 and learning rate 0.001 yielded 0.12, we could intelligently decide where to look next rather than randomly guessing?

This is the fundamental insight behind Sequential Model-Based Optimization (SMBO), the algorithmic framework that powers Bayesian Optimization. Instead of treating each evaluation as independent, SMBO builds a probabilistic model of the objective function and uses it to make informed decisions about where to evaluate next.

What You Will Learn

By the end of this page, you will understand the complete SMBO framework: how it builds probabilistic models of objective functions, why this leads to dramatically more efficient optimization, and the key algorithmic components that make Bayesian Optimization work. You'll grasp the theoretical foundations that make SMBO the gold standard for expensive black-box optimization.

The SMBO Framework

Sequential Model-Based Optimization is a principled framework for optimizing expensive black-box functions. The term "black-box" means we can only evaluate the function—we have no access to gradients, no analytical form, and often no understanding of why certain inputs produce certain outputs.

The core SMBO loop:

Observe — Evaluate the objective function at initial points to gather data
Model — Fit a probabilistic surrogate model to the observed data
Acquire — Use an acquisition function to select the next point to evaluate
Repeat — Update the model with new data and iterate

This simple loop encapsulates a powerful idea: rather than evaluating the expensive objective function everywhere, we evaluate a cheap surrogate model and use it to identify the most promising locations for the next expensive evaluation.

smbo_framework.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
import numpy as np
from typing import Callable, Tuple, List
 
def sequential_model_based_optimization(
    objective_function: Callable,
    bounds: np.ndarray,
    n_initial: int = 5,
    n_iterations: int = 25,
    surrogate_model = None,
    acquisition_function = None
) -> Tuple[np.ndarray, float]:
    """
    Sequential Model-Based Optimization (SMBO) Framework
    
    The canonical algorithm for Bayesian Optimization, demonstrating
    how surrogate models and acquisition functions work together.
    
    Parameters:
    -----------
    objective_function : callable
        The expensive black-box function to optimize (minimize).
        Takes a point x and returns a scalar value.
    bounds : np.ndarray, shape (d, 2)
        Lower and upper bounds for each of d dimensions.
    n_initial : int
        Number of initial random evaluations to seed the model.
    n_iterations : int
        Number of SMBO iterations after initialization.
    surrogate_model : object
        Probabilistic model with fit() and predict() methods.
    acquisition_function : callable
        Function that scores candidate points for evaluation.
        
    Returns:
    --------
    best_x : np.ndarray
        The best input found during optimization.
    best_y : float
        The corresponding objective value.
    """
    
    # Phase 1: Initialization with random sampling
    # We need initial observations to fit our first surrogate model
    X_observed = sample_initial_points(bounds, n_initial)
    y_observed = np.array([objective_function(x) for x in X_observed])
    
    print(f"Initialization complete: {n_initial} points evaluated")
    print(f"Best initial value: {np.min(y_observed):.4f}")
    
    # Phase 2: Sequential optimization loop
    for iteration in range(n_iterations):
        # Step 1: Fit the surrogate model to all observed data
        # The surrogate provides both predictions AND uncertainty estimates
        surrogate_model.fit(X_observed, y_observed)
        
        # Step 2: Find the point that maximizes the acquisition function
        # This balances exploitation (what looks good) vs exploration (uncertainty)
        x_next = optimize_acquisition(
            acquisition_function,
            surrogate_model,
            bounds,
            y_best=np.min(y_observed)
        )
        
        # Step 3: Evaluate the expensive objective at the selected point
        y_next = objective_function(x_next)
        
        # Step 4: Augment our observed dataset
        X_observed = np.vstack([X_observed, x_next])
        y_observed = np.append(y_observed, y_next)
        
        # Logging for insight into the optimization progress
        print(f"Iteration {iteration + 1}: f(x) = {y_next:.4f}, "
              f"Best so far: {np.min(y_observed):.4f}")
    
    # Return the best observed configuration
    best_idx = np.argmin(y_observed)
    return X_observed[best_idx], y_observed[best_idx]
 
 
def sample_initial_points(bounds: np.ndarray, n_points: int) -> np.ndarray:
    """
    Generate initial points using Latin Hypercube Sampling (LHS).
    
    LHS provides better coverage of the search space than pure random
    sampling, ensuring we don't accidentally cluster all initial points
    in one region.
    """
    d = bounds.shape[0]
    points = np.zeros((n_points, d))
    
    for dim in range(d):
        # Create stratified intervals and shuffle
        intervals = np.linspace(0, 1, n_points + 1)
        for i in range(n_points):
            points[i, dim] = np.random.uniform(intervals[i], intervals[i + 1])
        np.random.shuffle(points[:, dim])
    
    # Scale to actual bounds
    lower, upper = bounds[:, 0], bounds[:, 1]
    return points * (upper - lower) + lower
 
 
def optimize_acquisition(
    acquisition_fn: Callable,
    model,
    bounds: np.ndarray,
    y_best: float,
    n_restarts: int = 10
) -> np.ndarray:
    """
    Optimize the acquisition function to find the next evaluation point.
    
    Since the acquisition function is typically multimodal, we use
    multi-start optimization to avoid local optima.
    """
    d = bounds.shape[0]
    best_x, best_acq = None, -np.inf
    
    for _ in range(n_restarts):
        # Random starting point
        x0 = np.random.uniform(bounds[:, 0], bounds[:, 1])
        
        # L-BFGS-B optimization (gradient-based, respects bounds)
        from scipy.optimize import minimize
        
        def neg_acquisition(x):
            mu, sigma = model.predict(x.reshape(1, -1))
            return -acquisition_fn(mu, sigma, y_best)
        
        result = minimize(
            neg_acquisition,
            x0,
            bounds=[(bounds[i, 0], bounds[i, 1]) for i in range(d)],
            method='L-BFGS-B'
        )
        
        if -result.fun > best_acq:
            best_acq = -result.fun
            best_x = result.x
    
    return best_x

Key Insight: Information Reuse

The power of SMBO lies in its ability to reuse information. Every single evaluation updates our belief about the entire objective function surface. A single data point (x, y) tells us not just about that point, but constrains what the function can look like nearby due to continuity assumptions in the surrogate model.

Why Sequential Beats Parallel Blind Search

At first glance, sequential optimization seems inefficient compared to parallel approaches. If we have 100 CPUs, shouldn't we evaluate 100 points simultaneously? The answer reveals a fundamental insight about optimization under uncertainty.

The Value of Information:

Consider two scenarios:

Parallel blind search: Evaluate 100 random points simultaneously, then pick the best
Sequential adaptive search: Evaluate 1 point, update beliefs, choose the next point intelligently, repeat 100 times

In parallel blind search, the 100th evaluation is chosen with exactly the same knowledge as the 1st. In sequential search, the 100th evaluation benefits from information gathered in the first 99 evaluations.

Mathematical formalization:

Let $f(x)$ be the objective function and let $\mathcal{D}_n = {(x_1, y_1), ..., (x_n, y_n)}$ be observations after $n$ evaluations. The expected improvement from a new evaluation at $x$ depends on our posterior belief:

$$\text{Value of evaluating } x \propto \mathbb{E}[\text{Improvement} | \mathcal{D}_n, x]$$

In parallel blind search, all evaluations are conditioned on $\mathcal{D}0$ (empty set). In sequential search, evaluation $k$ is conditioned on $\mathcal{D}{k-1}$, which contains vastly more information.

Sequential vs. Parallel Optimization Comparison
Aspect	Parallel Blind Search	Sequential SMBO
Information utilization	Zero—each point chosen independently	Maximum—each point informed by all prior observations
Sample efficiency	Low—many evaluations wasted in poor regions	High—evaluations concentrate in promising regions
Convergence rate	O(1/√n) for random search	Can achieve exponential convergence for smooth functions
Wall-clock time (100 CPUs)	1 × evaluation time	100 × evaluation time (naive sequential)
Best for	Cheap evaluations, massive parallelism	Expensive evaluations, limited budget

The fundamental tradeoff:

SMBO trades parallelism for sample efficiency. When each function evaluation is cheap (milliseconds), parallel random search wins on wall-clock time. But when each evaluation is expensive (hours to days), the sample efficiency of SMBO becomes critical.

Hyperparameter optimization is the ideal use case:

Training a neural network: 1 hour to 1 week per evaluation
Limited compute budget: Often only 50-100 evaluations feasible
Expensive consequences of poor choices: Wasted GPU hours, delayed projects

In this regime, using evaluations wisely (SMBO) dramatically outperforms using evaluations quickly but wastefully (random search with parallelism).

Parallel SMBO Exists

Modern SMBO implementations support batch-parallel optimization. Techniques like the q-EI (q-Expected Improvement) acquisition function select multiple points to evaluate simultaneously while accounting for the pending evaluations. This combines the best of both worlds for moderately parallel settings.

The Surrogate Model Concept

The heart of SMBO is the surrogate model (also called response surface model). This is a probabilistic model that approximates the true objective function based on observed data. Unlike the true objective, evaluating the surrogate is essentially free.

Key requirements for a good surrogate:

Probabilistic predictions — Must provide both mean predictions AND uncertainty estimates
Computational efficiency — Must be fast to fit and evaluate (otherwise we've gained nothing)
Flexibility — Must adapt to a wide variety of function shapes
Calibrated uncertainty — Uncertainty estimates must be meaningful (neither overconfident nor underconfident)

Why uncertainty is essential:

A deterministic surrogate (like a random forest without uncertainty quantification) can only tell us: "Point A has predicted value 0.5". A probabilistic surrogate tells us: "Point A has predicted value 0.5 ± 0.2 with 95% confidence."

This uncertainty is what enables intelligent exploration. High uncertainty regions are places where the model "doesn't know" the objective value—and therefore might contain the optimal solution.

surrogate_model_interface.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
import numpy as np
from abc import ABC, abstractmethod
from typing import Tuple
 
class SurrogateModel(ABC):
    """
    Abstract base class defining the interface for SMBO surrogate models.
    
    Any surrogate model used in Bayesian Optimization must implement
    this interface, providing both predictions and uncertainty estimates.
    """
    
    @abstractmethod
    def fit(self, X: np.ndarray, y: np.ndarray) -> None:
        """
        Fit the surrogate model to observed data.
        
        Parameters:
        -----------
        X : np.ndarray, shape (n_samples, n_features)
            Input configurations where the objective was evaluated.
        y : np.ndarray, shape (n_samples,)
            Observed objective values.
        """
        pass
    
    @abstractmethod
    def predict(self, X: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:
        """
        Predict objective values with uncertainty at new points.
        
        Parameters:
        -----------
        X : np.ndarray, shape (n_samples, n_features)
            Points to make predictions at.
            
        Returns:
        --------
        mean : np.ndarray, shape (n_samples,)
            Predicted mean objective value at each point.
        std : np.ndarray, shape (n_samples,)
            Predicted standard deviation (uncertainty) at each point.
            
        Note:
        -----
        The standard deviation is CRITICAL for Bayesian Optimization.
        It represents epistemic uncertainty (uncertainty due to lack of data)
        and drives the exploration-exploitation tradeoff.
        """
        pass
    
    def update(self, X_new: np.ndarray, y_new: np.ndarray) -> None:
        """
        Incrementally update the model with new observations.
        
        Default implementation simply refits from scratch.
        Efficient implementations may support true incremental updates.
        """
        # Combine with existing data (if stored) and refit
        self.fit(
            np.vstack([self._X_train, X_new]),
            np.concatenate([self._y_train, y_new])
        )
 
 
class DummySurrogate(SurrogateModel):
    """
    Illustrative example: A simple surrogate using nearest neighbor
    interpolation with distance-based uncertainty.
    
    NOT recommended for production—use Gaussian Processes or TPE instead.
    This demonstrates the interface and the importance of uncertainty.
    """
    
    def __init__(self, length_scale: float = 1.0):
        self.length_scale = length_scale
        self._X_train = None
        self._y_train = None
    
    def fit(self, X: np.ndarray, y: np.ndarray) -> None:
        self._X_train = X.copy()
        self._y_train = y.copy()
    
    def predict(self, X: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:
        n_test = X.shape[0]
        means = np.zeros(n_test)
        stds = np.zeros(n_test)
        
        for i in range(n_test):
            # Compute distances to all training points
            distances = np.linalg.norm(self._X_train - X[i], axis=1)
            
            # Inverse distance weighting for mean prediction
            weights = np.exp(-distances / self.length_scale)
            weights /= weights.sum()
            means[i] = np.dot(weights, self._y_train)
            
            # Uncertainty increases with distance to nearest neighbor
            min_distance = distances.min()
            stds[i] = 0.1 + min_distance / self.length_scale
        
        return means, stds
 
 
# Visualization of what a surrogate model captures
def visualize_surrogate_1d(surrogate, X_train, y_train, bounds, true_function=None):
    """
    Visualize a 1D surrogate model's predictions and uncertainty.
    
    This is extremely useful for building intuition about how
    surrogate models work in Bayesian Optimization.
    """
    import matplotlib.pyplot as plt
    
    # Dense grid for visualization
    X_test = np.linspace(bounds[0], bounds[1], 200).reshape(-1, 1)
    
    # Fit and predict
    surrogate.fit(X_train.reshape(-1, 1), y_train)
    mean, std = surrogate.predict(X_test)
    
    # Plot
    fig, ax = plt.subplots(figsize=(12, 6))
    
    # True function (if known)
    if true_function is not None:
        y_true = np.array([true_function(x) for x in X_test.flatten()])
        ax.plot(X_test, y_true, 'k--', label='True function', alpha=0.7)
    
    # Surrogate predictions
    ax.plot(X_test, mean, 'b-', label='Surrogate mean', linewidth=2)
    
    # Uncertainty bands (±1 and ±2 std)
    ax.fill_between(X_test.flatten(), mean - 2*std, mean + 2*std, 
                    alpha=0.2, color='blue', label='95% CI')
    ax.fill_between(X_test.flatten(), mean - std, mean + std,
                    alpha=0.3, color='blue', label='68% CI')
    
    # Training points
    ax.scatter(X_train, y_train, c='red', s=100, zorder=5, 
               label='Observations', edgecolors='black')
    
    ax.set_xlabel('x', fontsize=12)
    ax.set_ylabel('f(x)', fontsize=12)
    ax.set_title('Surrogate Model: Predictions with Uncertainty', fontsize=14)
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    return fig

The Surrogate Is Not the Goal

A common misconception is that we want to build an accurate surrogate of the entire objective function. In fact, we only need the surrogate to be accurate near the optimum. It's perfectly fine if the surrogate is inaccurate in clearly suboptimal regions—we're optimizing, not learning the function everywhere.

The Acquisition Function Concept

Given a surrogate model that provides predictions and uncertainty, how do we decide where to evaluate next? This is the role of the acquisition function (also called selection criterion or infill criterion).

The acquisition function encodes a strategy for trading off exploitation (evaluating where the predicted value is good) versus exploration (evaluating where uncertainty is high).

Why both are necessary:

Pure exploitation (always evaluate where predicted value is best): Gets stuck in local optima. If our initial data happens to miss the global optimum, we never discover it.
Pure exploration (always evaluate where uncertainty is highest): Wastes evaluations learning about clearly suboptimal regions. Eventually covers the whole space but doesn't focus on optimization.

The acquisition function is cheap to optimize:

Since the acquisition function is computed from the surrogate (which is cheap to evaluate), we can use standard optimization techniques (gradient descent, multi-start local search, evolutionary algorithms) to find its maximum. This additional optimization costs seconds, not hours.

Exploitation-Focused

•Evaluates where surrogate predicts low values
•Fast convergence when near optimum
•Risk: Gets trapped in local optima
•Risk: Ignores unexplored regions
•Example: Greedy selection, probability of improvement with low threshold

Exploration-Focused

•Evaluates where uncertainty is highest
•Robust to initial data placement
•Risk: Wastes evaluations in suboptimal regions
•Risk: Slow convergence to optimum
•Example: Maximum variance, pure uncertainty sampling

The balance depends on context:

Early in optimization (few evaluations): Favor exploration. We haven't seen much of the space.
Late in optimization (many evaluations): Favor exploitation. We've mapped the landscape and should refine the best region.

Good acquisition functions naturally adapt this balance. Expected Improvement (EI), for instance, automatically explores more when uncertainty is high and exploits more when the predicted value is low.

acquisition_functions_overview.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
import numpy as np
from scipy.stats import norm
 
def expected_improvement(mu: np.ndarray, sigma: np.ndarray, 
                         y_best: float, xi: float = 0.01) -> np.ndarray:
    """
    Expected Improvement (EI) acquisition function.
    
    The most popular choice for Bayesian Optimization. Balances
    exploitation and exploration automatically.
    
    EI(x) = E[max(0, y_best - f(x))]
    
    Computed in closed form under Gaussian surrogate:
    EI(x) = (y_best - mu(x) - xi) * Φ(Z) + sigma(x) * φ(Z)
    
    where Z = (y_best - mu(x) - xi) / sigma(x)
    
    Parameters:
    -----------
    mu : predicted mean at each point
    sigma : predicted std at each point  
    y_best : best objective value observed so far
    xi : exploration-exploitation tradeoff parameter (higher = more exploration)
    """
    # Avoid division by zero
    sigma = np.maximum(sigma, 1e-9)
    
    # Standardized improvement
    Z = (y_best - mu - xi) / sigma
    
    # Expected Improvement formula (closed-form for Gaussian)
    ei = (y_best - mu - xi) * norm.cdf(Z) + sigma * norm.pdf(Z)
    
    # EI is zero where sigma is effectively zero (already evaluated)
    ei[sigma <= 1e-9] = 0.0
    
    return ei
 
 
def probability_of_improvement(mu: np.ndarray, sigma: np.ndarray,
                               y_best: float, xi: float = 0.0) -> np.ndarray:
    """
    Probability of Improvement (PI) acquisition function.
    
    A simpler acquisition function that measures the probability
    that a point will improve upon the current best.
    
    PI(x) = P(f(x) < y_best - xi) = Φ((y_best - mu(x) - xi) / sigma(x))
    
    Note: PI can be too exploitative because it ignores the MAGNITUDE
    of potential improvement. A small certain improvement beats a
    potentially large uncertain improvement.
    """
    sigma = np.maximum(sigma, 1e-9)
    Z = (y_best - mu - xi) / sigma
    return norm.cdf(Z)
 
 
def lower_confidence_bound(mu: np.ndarray, sigma: np.ndarray,
                           kappa: float = 2.0) -> np.ndarray:
    """
    Lower Confidence Bound (LCB) acquisition function.
    
    Also called GP-LCB. Optimistic in the face of uncertainty.
    
    LCB(x) = mu(x) - kappa * sigma(x)
    
    We MINIMIZE LCB, which encourages evaluating points that either:
    - Have low predicted mean (exploitation), OR
    - Have high uncertainty (exploration)
    
    The parameter kappa controls the balance:
    - kappa = 0: Pure exploitation (greedy)
    - kappa → ∞: Pure exploration (uncertainty sampling)
    - kappa ≈ 2: Common balanced choice
    """
    return mu - kappa * sigma
 
 
def thompson_sampling(model, X_candidates: np.ndarray, 
                      n_samples: int = 1) -> np.ndarray:
    """
    Thompson Sampling for Bayesian Optimization.
    
    Instead of computing an acquisition function explicitly, we:
    1. Draw a sample from the posterior (a random function consistent with data)
    2. Find the minimum of this sampled function
    3. Evaluate the objective there
    
    This naturally balances exploration and exploitation through
    posterior randomness.
    """
    # Get posterior mean and covariance at candidates
    mu, sigma = model.predict(X_candidates)
    
    # For simplicity, sample independently (ignores correlation)
    # Full Thompson sampling would sample from the joint posterior
    samples = np.random.normal(mu, sigma)
    
    # Return the candidate with the lowest sampled value
    best_idx = np.argmin(samples)
    return X_candidates[best_idx]
 
 
# Visualization of acquisition functions
def visualize_acquisition_landscape_1d(model, X_train, y_train, bounds):
    """
    Show how different acquisition functions select the next point.
    """
    import matplotlib.pyplot as plt
    
    X_test = np.linspace(bounds[0], bounds[1], 200).reshape(-1, 1)
    model.fit(X_train.reshape(-1, 1), y_train)
    mu, sigma = model.predict(X_test)
    
    y_best = y_train.min()
    
    # Compute acquisition functions
    ei = expected_improvement(mu, sigma, y_best)
    pi = probability_of_improvement(mu, sigma, y_best)
    lcb = lower_confidence_bound(mu, sigma, kappa=2.0)
    
    fig, axes = plt.subplots(4, 1, figsize=(12, 12), sharex=True)
    
    # Surrogate
    axes[0].plot(X_test, mu, 'b-', linewidth=2)
    axes[0].fill_between(X_test.flatten(), mu - 2*sigma, mu + 2*sigma, alpha=0.2)
    axes[0].scatter(X_train, y_train, c='red', s=100, zorder=5)
    axes[0].axhline(y_best, color='green', linestyle='--', label='y_best')
    axes[0].set_ylabel('f(x)')
    axes[0].set_title('Surrogate Model')
    axes[0].legend()
    
    # EI
    axes[1].plot(X_test, ei, 'g-', linewidth=2)
    axes[1].axvline(X_test[np.argmax(ei)], color='red', linestyle='--')
    axes[1].set_ylabel('EI(x)')
    axes[1].set_title(f'Expected Improvement (next: x={X_test[np.argmax(ei)][0]:.2f})')
    
    # PI  
    axes[2].plot(X_test, pi, 'm-', linewidth=2)
    axes[2].axvline(X_test[np.argmax(pi)], color='red', linestyle='--')
    axes[2].set_ylabel('PI(x)')
    axes[2].set_title(f'Probability of Improvement (next: x={X_test[np.argmax(pi)][0]:.2f})')
    
    # LCB (negate for visualization since we minimize)
    axes[3].plot(X_test, -lcb, 'c-', linewidth=2)
    axes[3].axvline(X_test[np.argmin(lcb)], color='red', linestyle='--')
    axes[3].set_ylabel('-LCB(x)')
    axes[3].set_title(f'Lower Confidence Bound (next: x={X_test[np.argmin(lcb)][0]:.2f})')
    axes[3].set_xlabel('x')
    
    plt.tight_layout()
    return fig

The Complete SMBO Algorithm

Let's synthesize everything into the complete SMBO algorithm, with all components working together. This provides the canonical formulation that underlies practically all Bayesian Optimization implementations.

Algorithm: Sequential Model-Based Optimization

Input:

Objective function $f: \mathcal{X} \rightarrow \mathbb{R}$ (expensive to evaluate)
Search space $\mathcal{X}$ with bounds
Budget $N$ (total number of evaluations)
Initial design size $n_0$

Output:

Best configuration $x^* \in \mathcal{X}$
Best objective value $f(x^*)$

Procedure:

Initialize: Sample $n_0$ points from $\mathcal{X}$ using Latin Hypercube Sampling
Evaluate: Compute $y_i = f(x_i)$ for each initial point
For $t = n_0 + 1$ to $N$:
- a) Fit surrogate model $\mathcal{M}$ to data ${(x_i, y_i)}_{i=1}^{t-1}$
- b) Optimize acquisition function: $x_t = \arg\max_x \alpha(x; \mathcal{M}, \mathcal{D})$
- c) Evaluate: $y_t = f(x_t)$
- d) Augment data: $\mathcal{D} = \mathcal{D} \cup {(x_t, y_t)}$
Return: $(x^, y^)$ where $y^* = \min_i y_i$

complete_smbo_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
import numpy as np
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import Matern, ConstantKernel
from scipy.stats import norm
from scipy.optimize import minimize
from typing import Callable, Tuple, Dict, List, Optional
import warnings
warnings.filterwarnings('ignore')
 
 
class BayesianOptimizer:
    """
    Complete Bayesian Optimization implementation using SMBO framework.
    
    This is a production-quality implementation that demonstrates all
    the core concepts of Sequential Model-Based Optimization.
    """
    
    def __init__(
        self,
        bounds: np.ndarray,
        n_initial: int = 5,
        acquisition: str = 'ei',
        xi: float = 0.01,
        kappa: float = 2.0,
        random_state: Optional[int] = None
    ):
        """
        Initialize the Bayesian Optimizer.
        
        Parameters:
        -----------
        bounds : np.ndarray, shape (d, 2)
            Lower and upper bounds for each dimension
        n_initial : int
            Number of initial random samples
        acquisition : str
            Choice of acquisition function: 'ei', 'pi', 'lcb'
        xi : float
            Exploration parameter for EI and PI
        kappa : float
            Exploration parameter for LCB
        random_state : int
            Seed for reproducibility
        """
        self.bounds = np.array(bounds)
        self.n_dims = self.bounds.shape[0]
        self.n_initial = n_initial
        self.acquisition = acquisition
        self.xi = xi
        self.kappa = kappa
        self.rng = np.random.RandomState(random_state)
        
        # Initialize Gaussian Process surrogate
        # Matern kernel is a robust default choice
        kernel = ConstantKernel(1.0) * Matern(
            length_scale=np.ones(self.n_dims),
            length_scale_bounds=(1e-3, 1e3),
            nu=2.5  # Twice differentiable (smooth)
        )
        self.gp = GaussianProcessRegressor(
            kernel=kernel,
            alpha=1e-6,  # Noise term for numerical stability
            normalize_y=True,
            n_restarts_optimizer=10,
            random_state=random_state
        )
        
        # Storage for observations
        self.X_observed: List[np.ndarray] = []
        self.y_observed: List[float] = []
        self.history: List[Dict] = []
    
    def _latin_hypercube_sample(self, n_samples: int) -> np.ndarray:
        """Generate Latin Hypercube Samples within bounds."""
        samples = np.zeros((n_samples, self.n_dims))
        
        for dim in range(self.n_dims):
            # Create n_samples intervals of equal probability
            intervals = np.linspace(0, 1, n_samples + 1)
            # Sample uniformly within each interval
            points = np.array([
                self.rng.uniform(intervals[i], intervals[i + 1])
                for i in range(n_samples)
            ])
            # Randomly permute to break correlation between dimensions
            self.rng.shuffle(points)
            samples[:, dim] = points
        
        # Scale to actual bounds
        lower, upper = self.bounds[:, 0], self.bounds[:, 1]
        return samples * (upper - lower) + lower
    
    def _compute_acquisition(self, X: np.ndarray) -> np.ndarray:
        """Compute acquisition function values at given points."""
        mu, sigma = self.gp.predict(X, return_std=True)
        
        if len(self.y_observed) == 0:
            return sigma  # Pure exploration before any data
        
        y_best = np.min(self.y_observed)
        
        if self.acquisition == 'ei':
            return self._expected_improvement(mu, sigma, y_best)
        elif self.acquisition == 'pi':
            return self._probability_of_improvement(mu, sigma, y_best)
        elif self.acquisition == 'lcb':
            return -self._lower_confidence_bound(mu, sigma)
        else:
            raise ValueError(f"Unknown acquisition function: {self.acquisition}")
    
    def _expected_improvement(self, mu, sigma, y_best) -> np.ndarray:
        """Expected Improvement acquisition function."""
        sigma = np.maximum(sigma, 1e-9)
        Z = (y_best - mu - self.xi) / sigma
        ei = (y_best - mu - self.xi) * norm.cdf(Z) + sigma * norm.pdf(Z)
        ei[sigma <= 1e-9] = 0.0
        return ei
    
    def _probability_of_improvement(self, mu, sigma, y_best) -> np.ndarray:
        """Probability of Improvement acquisition function."""
        sigma = np.maximum(sigma, 1e-9)
        Z = (y_best - mu - self.xi) / sigma
        return norm.cdf(Z)
    
    def _lower_confidence_bound(self, mu, sigma) -> np.ndarray:
        """Lower Confidence Bound acquisition function."""
        return mu - self.kappa * sigma
    
    def _optimize_acquisition(self, n_restarts: int = 25) -> np.ndarray:
        """Find the point that maximizes the acquisition function."""
        best_x = None
        best_acq = -np.inf
        
        # Multi-start optimization
        for _ in range(n_restarts):
            # Random starting point
            x0 = self.rng.uniform(self.bounds[:, 0], self.bounds[:, 1])
            
            try:
                result = minimize(
                    lambda x: -self._compute_acquisition(x.reshape(1, -1))[0],
                    x0,
                    bounds=list(self.bounds),
                    method='L-BFGS-B'
                )
                
                if -result.fun > best_acq:
                    best_acq = -result.fun
                    best_x = result.x
            except:
                continue
        
        # Fallback to random if optimization fails
        if best_x is None:
            best_x = self.rng.uniform(self.bounds[:, 0], self.bounds[:, 1])
        
        return best_x
    
    def suggest_next(self) -> np.ndarray:
        """Suggest the next point to evaluate."""
        n_observed = len(self.X_observed)
        
        if n_observed < self.n_initial:
            # Still in initialization phase
            if n_observed == 0:
                # Generate all initial points at once for LHS
                self._initial_samples = self._latin_hypercube_sample(self.n_initial)
            return self._initial_samples[n_observed]
        else:
            # SMBO phase: fit GP and optimize acquisition
            X = np.array(self.X_observed)
            y = np.array(self.y_observed)
            
            # Fit the Gaussian Process to all observed data
            self.gp.fit(X, y)
            
            # Optimize acquisition function
            return self._optimize_acquisition()
    
    def observe(self, x: np.ndarray, y: float) -> None:
        """Record an observation."""
        self.X_observed.append(x.flatten())
        self.y_observed.append(y)
        
        # Track history for analysis
        self.history.append({
            'x': x.flatten().copy(),
            'y': y,
            'y_best': min(self.y_observed),
            'iteration': len(self.y_observed)
        })
    
    def optimize(
        self, 
        objective: Callable, 
        n_iterations: int = 25,
        verbose: bool = True
    ) -> Tuple[np.ndarray, float]:
        """
        Run the complete Bayesian Optimization loop.
        
        Parameters:
        -----------
        objective : callable
            Function to minimize. Takes array x and returns scalar.
        n_iterations : int
            Total number of function evaluations.
        verbose : bool
            Print progress during optimization.
            
        Returns:
        --------
        best_x : np.ndarray
            Best configuration found.
        best_y : float
            Best objective value achieved.
        """
        total_evals = self.n_initial + n_iterations
        
        for i in range(total_evals):
            # Get next point to evaluate
            x_next = self.suggest_next()
            
            # Evaluate the objective (THE EXPENSIVE STEP)
            y_next = objective(x_next)
            
            # Record the observation
            self.observe(x_next, y_next)
            
            if verbose:
                phase = "Init" if i < self.n_initial else "SMBO"
                print(f"[{phase}] Eval {i+1}/{total_evals}: "
                      f"f(x) = {y_next:.6f}, "
                      f"Best = {min(self.y_observed):.6f}")
        
        # Return best observed
        best_idx = np.argmin(self.y_observed)
        return np.array(self.X_observed[best_idx]), self.y_observed[best_idx]
 
 
# Example usage and demonstration
def demonstration():
    """Demonstrate SMBO on the Branin benchmark function."""
    
    # Branin function: a classic optimization benchmark
    def branin(x):
        x1, x2 = x[0], x[1]
        a, b, c = 1, 5.1 / (4 * np.pi**2), 5 / np.pi
        r, s, t = 6, 10, 1 / (8 * np.pi)
        return a * (x2 - b*x1**2 + c*x1 - r)**2 + s*(1-t)*np.cos(x1) + s
    
    # Known optima: f* ≈ 0.397887
    # At: (-π, 12.275), (π, 2.275), (9.42478, 2.475)
    
    # Define search space
    bounds = np.array([
        [-5.0, 10.0],  # x1 bounds
        [0.0, 15.0]    # x2 bounds
    ])
    
    # Run optimization
    optimizer = BayesianOptimizer(
        bounds=bounds,
        n_initial=5,
        acquisition='ei',
        random_state=42
    )
    
    best_x, best_y = optimizer.optimize(branin, n_iterations=20, verbose=True)
    
    print(f"\nOptimization complete!")
    print(f"Best x found: [{best_x[0]:.4f}, {best_x[1]:.4f}]")
    print(f"Best f(x) found: {best_y:.6f}")
    print(f"Known optimum: 0.397887")
    print(f"Gap: {abs(best_y - 0.397887):.6f}")
    
    return optimizer
 
 
if __name__ == "__main__":
    demonstration()

Convergence Properties and Theoretical Guarantees

A natural question is: does SMBO actually work? Can we prove that it will eventually find the optimum? The answer depends on the specific components used, but there are strong theoretical results.

Regret bounds:

The typical way to measure optimization performance is through cumulative regret:

$$R_N = \sum_{t=1}^{N} [f(x_t) - f(x^*)]$$

where $x^*$ is the true optimum. Good optimization algorithms have regret that grows slowly with $N$.

Key theoretical results:

GP-UCB [Srinivas et al., 2010]: For Gaussian Process surrogates with UCB acquisition, cumulative regret is $O(\sqrt{N \gamma_N})$ where $\gamma_N$ is the maximum information gain, which grows as $O((\log N)^{d+1})$ for Matérn kernels in $d$ dimensions.
Convergence guarantees: Under mild conditions (Lipschitz continuous objective, bounded search space), SMBO algorithms are guaranteed to converge to the global optimum as $N \rightarrow \infty$.
Sample efficiency: For smooth functions, SMBO can achieve exponential convergence rates—meaning the gap to the optimum shrinks exponentially with each evaluation. This is dramatically better than random search's polynomial rate.

Convergence Rates Comparison
Method	Convergence Rate	Assumptions	Practical Impact
Random Search	O(1/√N)	None	Needs ~100× more evaluations for 10× improvement
Grid Search	O(1/N^(1/d))	None	Curse of dimensionality—unusable beyond ~5D
SMBO (GP-UCB)	O(√(γ_N/N))	Function in RKHS	Near-optimal for smooth functions
SMBO (GP-EI)	Sublinear	GP model is accurate	Excellent empirical performance

Theory vs. Practice

The theoretical guarantees assume the surrogate model is well-specified (i.e., the true function is a sample from the GP prior). In practice, this is never exactly true. However, SMBO is remarkably robust to model misspecification. Empirically, it outperforms alternatives even when theoretical assumptions are violated.

Why SMBO is particularly effective for hyperparameter optimization:

Evaluation cost dominates: Training a neural network takes hours; surrogate overhead is negligible
Budget is typically limited: 50-100 evaluations are common constraints
Functions are relatively smooth: Nearby hyperparameters often yield similar performance
Dimensionality is moderate: Typically 5-20 hyperparameters, within SMBO's sweet spot
Prior knowledge can be incorporated: We often know reasonable ranges and interactions

Practical Considerations for SMBO

While the SMBO framework is elegant, practical implementation requires careful attention to several details:

1. Initialization Strategy

The initial samples seed the surrogate model. Poor initialization can bias the entire optimization:

Use Latin Hypercube Sampling (LHS) for better space coverage than random
Include corner points if you suspect the optimum is at boundaries
Consider using prior knowledge to place initial points near expected good regions

2. Surrogate Model Selection

The choice of surrogate model significantly impacts performance:

Gaussian Processes: Excellent for continuous hyperparameters, d ≤ 20
Tree Parzen Estimators (TPE): Better for mixed (continuous + categorical) spaces
Random Forests: Robust, scales to larger d, but requires more data

3. Acquisition Function Optimization

The acquisition function may have many local optima:

Use multi-start optimization (10-50 random restarts)
Consider global optimization methods (DIRECT, CMA-ES) for complex acquisitions
For high dimensions, use random sampling + local refinement

4. Hyperparameters of BO Itself

Bayesian Optimization has its own hyperparameters:

Kernel choice and parameters for GPs
xi/kappa parameters for acquisition functions
Number of initial points
Standard practice: Use defaults or perform a small sensitivity analysis

SMBO Best Practices

•Normalize the search space — Scale all dimensions to [0, 1] for better GP behavior
•Standardize objectives — Zero mean, unit variance prevents numerical issues
•Handle failed evaluations — Assign a penalty value rather than discarding
•Log-transform learning rates — Search log(lr) rather than lr for better coverage
•Use early stopping for expensive evaluations — Partial evaluation can inform the model
•Save all results — Even failed runs provide valuable information for future searches

When SMBO May Struggle

SMBO is not universally superior. It may underperform random search when: (1) the objective is extremely noisy, (2) the function is highly discontinuous, (3) the budget is very large (>1000 evaluations), or (4) massive parallelism is available and evaluations are cheap. Always consider your specific context.

Summary: Sequential Model-Based Optimization

We've established the complete theoretical and practical foundation for Sequential Model-Based Optimization, the framework powering Bayesian Optimization.

Core Concepts:

Key Takeaways

•SMBO is fundamental — It's the canonical framework for sample-efficient black-box optimization, trading parallelism for intelligent sequential decisions.
•Surrogate models approximate the objective — Probabilistic models (typically GPs) provide predictions AND uncertainty estimates, enabling informed exploration.
•Acquisition functions guide search — They balance exploitation (evaluating promising regions) with exploration (reducing uncertainty).
•Sample efficiency is the key advantage — SMBO can find optima with 10-100× fewer evaluations than random search for smooth functions.
•Theoretical guarantees exist — Under appropriate conditions, SMBO converges to the global optimum with provable regret bounds.
•Practical success is widespread — From AutoML to drug discovery, SMBO powers state-of-the-art optimization in countless domains.

What's next:

With the SMBO framework established, we'll dive deep into the surrogate models that make it work. The next page explores Gaussian Processes—the most popular surrogate for Bayesian Optimization—covering their mathematical foundation, kernel design, and how they provide the uncertainty estimates that drive intelligent acquisition.

Page Complete

You now understand Sequential Model-Based Optimization—the principled framework that makes Bayesian Optimization work. You can articulate why sequential search outperforms parallel blind search for expensive functions, how surrogates approximate objectives with uncertainty, and how acquisition functions balance exploration and exploitation.