Loading content...
Consider the problem of tuning hyperparameters for a deep neural network. Each evaluation requires training the model, which might take hours or even days. Grid search would require evaluating every point in a predefined grid—potentially thousands of configurations. Random search improves upon this but still operates without any memory: each evaluation is independent, learning nothing from previous attempts.
What if we could use information from past evaluations to guide future ones? What if, after observing that learning rate 0.01 yielded a validation error of 0.15 and learning rate 0.001 yielded 0.12, we could intelligently decide where to look next rather than randomly guessing?
This is the fundamental insight behind Sequential Model-Based Optimization (SMBO), the algorithmic framework that powers Bayesian Optimization. Instead of treating each evaluation as independent, SMBO builds a probabilistic model of the objective function and uses it to make informed decisions about where to evaluate next.
By the end of this page, you will understand the complete SMBO framework: how it builds probabilistic models of objective functions, why this leads to dramatically more efficient optimization, and the key algorithmic components that make Bayesian Optimization work. You'll grasp the theoretical foundations that make SMBO the gold standard for expensive black-box optimization.
Sequential Model-Based Optimization is a principled framework for optimizing expensive black-box functions. The term "black-box" means we can only evaluate the function—we have no access to gradients, no analytical form, and often no understanding of why certain inputs produce certain outputs.
The core SMBO loop:
This simple loop encapsulates a powerful idea: rather than evaluating the expensive objective function everywhere, we evaluate a cheap surrogate model and use it to identify the most promising locations for the next expensive evaluation.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142
import numpy as npfrom typing import Callable, Tuple, List def sequential_model_based_optimization( objective_function: Callable, bounds: np.ndarray, n_initial: int = 5, n_iterations: int = 25, surrogate_model = None, acquisition_function = None) -> Tuple[np.ndarray, float]: """ Sequential Model-Based Optimization (SMBO) Framework The canonical algorithm for Bayesian Optimization, demonstrating how surrogate models and acquisition functions work together. Parameters: ----------- objective_function : callable The expensive black-box function to optimize (minimize). Takes a point x and returns a scalar value. bounds : np.ndarray, shape (d, 2) Lower and upper bounds for each of d dimensions. n_initial : int Number of initial random evaluations to seed the model. n_iterations : int Number of SMBO iterations after initialization. surrogate_model : object Probabilistic model with fit() and predict() methods. acquisition_function : callable Function that scores candidate points for evaluation. Returns: -------- best_x : np.ndarray The best input found during optimization. best_y : float The corresponding objective value. """ # Phase 1: Initialization with random sampling # We need initial observations to fit our first surrogate model X_observed = sample_initial_points(bounds, n_initial) y_observed = np.array([objective_function(x) for x in X_observed]) print(f"Initialization complete: {n_initial} points evaluated") print(f"Best initial value: {np.min(y_observed):.4f}") # Phase 2: Sequential optimization loop for iteration in range(n_iterations): # Step 1: Fit the surrogate model to all observed data # The surrogate provides both predictions AND uncertainty estimates surrogate_model.fit(X_observed, y_observed) # Step 2: Find the point that maximizes the acquisition function # This balances exploitation (what looks good) vs exploration (uncertainty) x_next = optimize_acquisition( acquisition_function, surrogate_model, bounds, y_best=np.min(y_observed) ) # Step 3: Evaluate the expensive objective at the selected point y_next = objective_function(x_next) # Step 4: Augment our observed dataset X_observed = np.vstack([X_observed, x_next]) y_observed = np.append(y_observed, y_next) # Logging for insight into the optimization progress print(f"Iteration {iteration + 1}: f(x) = {y_next:.4f}, " f"Best so far: {np.min(y_observed):.4f}") # Return the best observed configuration best_idx = np.argmin(y_observed) return X_observed[best_idx], y_observed[best_idx] def sample_initial_points(bounds: np.ndarray, n_points: int) -> np.ndarray: """ Generate initial points using Latin Hypercube Sampling (LHS). LHS provides better coverage of the search space than pure random sampling, ensuring we don't accidentally cluster all initial points in one region. """ d = bounds.shape[0] points = np.zeros((n_points, d)) for dim in range(d): # Create stratified intervals and shuffle intervals = np.linspace(0, 1, n_points + 1) for i in range(n_points): points[i, dim] = np.random.uniform(intervals[i], intervals[i + 1]) np.random.shuffle(points[:, dim]) # Scale to actual bounds lower, upper = bounds[:, 0], bounds[:, 1] return points * (upper - lower) + lower def optimize_acquisition( acquisition_fn: Callable, model, bounds: np.ndarray, y_best: float, n_restarts: int = 10) -> np.ndarray: """ Optimize the acquisition function to find the next evaluation point. Since the acquisition function is typically multimodal, we use multi-start optimization to avoid local optima. """ d = bounds.shape[0] best_x, best_acq = None, -np.inf for _ in range(n_restarts): # Random starting point x0 = np.random.uniform(bounds[:, 0], bounds[:, 1]) # L-BFGS-B optimization (gradient-based, respects bounds) from scipy.optimize import minimize def neg_acquisition(x): mu, sigma = model.predict(x.reshape(1, -1)) return -acquisition_fn(mu, sigma, y_best) result = minimize( neg_acquisition, x0, bounds=[(bounds[i, 0], bounds[i, 1]) for i in range(d)], method='L-BFGS-B' ) if -result.fun > best_acq: best_acq = -result.fun best_x = result.x return best_xThe power of SMBO lies in its ability to reuse information. Every single evaluation updates our belief about the entire objective function surface. A single data point (x, y) tells us not just about that point, but constrains what the function can look like nearby due to continuity assumptions in the surrogate model.
At first glance, sequential optimization seems inefficient compared to parallel approaches. If we have 100 CPUs, shouldn't we evaluate 100 points simultaneously? The answer reveals a fundamental insight about optimization under uncertainty.
The Value of Information:
Consider two scenarios:
In parallel blind search, the 100th evaluation is chosen with exactly the same knowledge as the 1st. In sequential search, the 100th evaluation benefits from information gathered in the first 99 evaluations.
Mathematical formalization:
Let $f(x)$ be the objective function and let $\mathcal{D}_n = {(x_1, y_1), ..., (x_n, y_n)}$ be observations after $n$ evaluations. The expected improvement from a new evaluation at $x$ depends on our posterior belief:
$$\text{Value of evaluating } x \propto \mathbb{E}[\text{Improvement} | \mathcal{D}_n, x]$$
In parallel blind search, all evaluations are conditioned on $\mathcal{D}0$ (empty set). In sequential search, evaluation $k$ is conditioned on $\mathcal{D}{k-1}$, which contains vastly more information.
| Aspect | Parallel Blind Search | Sequential SMBO |
|---|---|---|
| Information utilization | Zero—each point chosen independently | Maximum—each point informed by all prior observations |
| Sample efficiency | Low—many evaluations wasted in poor regions | High—evaluations concentrate in promising regions |
| Convergence rate | O(1/√n) for random search | Can achieve exponential convergence for smooth functions |
| Wall-clock time (100 CPUs) | 1 × evaluation time | 100 × evaluation time (naive sequential) |
| Best for | Cheap evaluations, massive parallelism | Expensive evaluations, limited budget |
The fundamental tradeoff:
SMBO trades parallelism for sample efficiency. When each function evaluation is cheap (milliseconds), parallel random search wins on wall-clock time. But when each evaluation is expensive (hours to days), the sample efficiency of SMBO becomes critical.
Hyperparameter optimization is the ideal use case:
In this regime, using evaluations wisely (SMBO) dramatically outperforms using evaluations quickly but wastefully (random search with parallelism).
Modern SMBO implementations support batch-parallel optimization. Techniques like the q-EI (q-Expected Improvement) acquisition function select multiple points to evaluate simultaneously while accounting for the pending evaluations. This combines the best of both worlds for moderately parallel settings.
The heart of SMBO is the surrogate model (also called response surface model). This is a probabilistic model that approximates the true objective function based on observed data. Unlike the true objective, evaluating the surrogate is essentially free.
Key requirements for a good surrogate:
Why uncertainty is essential:
A deterministic surrogate (like a random forest without uncertainty quantification) can only tell us: "Point A has predicted value 0.5". A probabilistic surrogate tells us: "Point A has predicted value 0.5 ± 0.2 with 95% confidence."
This uncertainty is what enables intelligent exploration. High uncertainty regions are places where the model "doesn't know" the objective value—and therefore might contain the optimal solution.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149
import numpy as npfrom abc import ABC, abstractmethodfrom typing import Tuple class SurrogateModel(ABC): """ Abstract base class defining the interface for SMBO surrogate models. Any surrogate model used in Bayesian Optimization must implement this interface, providing both predictions and uncertainty estimates. """ @abstractmethod def fit(self, X: np.ndarray, y: np.ndarray) -> None: """ Fit the surrogate model to observed data. Parameters: ----------- X : np.ndarray, shape (n_samples, n_features) Input configurations where the objective was evaluated. y : np.ndarray, shape (n_samples,) Observed objective values. """ pass @abstractmethod def predict(self, X: np.ndarray) -> Tuple[np.ndarray, np.ndarray]: """ Predict objective values with uncertainty at new points. Parameters: ----------- X : np.ndarray, shape (n_samples, n_features) Points to make predictions at. Returns: -------- mean : np.ndarray, shape (n_samples,) Predicted mean objective value at each point. std : np.ndarray, shape (n_samples,) Predicted standard deviation (uncertainty) at each point. Note: ----- The standard deviation is CRITICAL for Bayesian Optimization. It represents epistemic uncertainty (uncertainty due to lack of data) and drives the exploration-exploitation tradeoff. """ pass def update(self, X_new: np.ndarray, y_new: np.ndarray) -> None: """ Incrementally update the model with new observations. Default implementation simply refits from scratch. Efficient implementations may support true incremental updates. """ # Combine with existing data (if stored) and refit self.fit( np.vstack([self._X_train, X_new]), np.concatenate([self._y_train, y_new]) ) class DummySurrogate(SurrogateModel): """ Illustrative example: A simple surrogate using nearest neighbor interpolation with distance-based uncertainty. NOT recommended for production—use Gaussian Processes or TPE instead. This demonstrates the interface and the importance of uncertainty. """ def __init__(self, length_scale: float = 1.0): self.length_scale = length_scale self._X_train = None self._y_train = None def fit(self, X: np.ndarray, y: np.ndarray) -> None: self._X_train = X.copy() self._y_train = y.copy() def predict(self, X: np.ndarray) -> Tuple[np.ndarray, np.ndarray]: n_test = X.shape[0] means = np.zeros(n_test) stds = np.zeros(n_test) for i in range(n_test): # Compute distances to all training points distances = np.linalg.norm(self._X_train - X[i], axis=1) # Inverse distance weighting for mean prediction weights = np.exp(-distances / self.length_scale) weights /= weights.sum() means[i] = np.dot(weights, self._y_train) # Uncertainty increases with distance to nearest neighbor min_distance = distances.min() stds[i] = 0.1 + min_distance / self.length_scale return means, stds # Visualization of what a surrogate model capturesdef visualize_surrogate_1d(surrogate, X_train, y_train, bounds, true_function=None): """ Visualize a 1D surrogate model's predictions and uncertainty. This is extremely useful for building intuition about how surrogate models work in Bayesian Optimization. """ import matplotlib.pyplot as plt # Dense grid for visualization X_test = np.linspace(bounds[0], bounds[1], 200).reshape(-1, 1) # Fit and predict surrogate.fit(X_train.reshape(-1, 1), y_train) mean, std = surrogate.predict(X_test) # Plot fig, ax = plt.subplots(figsize=(12, 6)) # True function (if known) if true_function is not None: y_true = np.array([true_function(x) for x in X_test.flatten()]) ax.plot(X_test, y_true, 'k--', label='True function', alpha=0.7) # Surrogate predictions ax.plot(X_test, mean, 'b-', label='Surrogate mean', linewidth=2) # Uncertainty bands (±1 and ±2 std) ax.fill_between(X_test.flatten(), mean - 2*std, mean + 2*std, alpha=0.2, color='blue', label='95% CI') ax.fill_between(X_test.flatten(), mean - std, mean + std, alpha=0.3, color='blue', label='68% CI') # Training points ax.scatter(X_train, y_train, c='red', s=100, zorder=5, label='Observations', edgecolors='black') ax.set_xlabel('x', fontsize=12) ax.set_ylabel('f(x)', fontsize=12) ax.set_title('Surrogate Model: Predictions with Uncertainty', fontsize=14) ax.legend() ax.grid(True, alpha=0.3) return figA common misconception is that we want to build an accurate surrogate of the entire objective function. In fact, we only need the surrogate to be accurate near the optimum. It's perfectly fine if the surrogate is inaccurate in clearly suboptimal regions—we're optimizing, not learning the function everywhere.
Given a surrogate model that provides predictions and uncertainty, how do we decide where to evaluate next? This is the role of the acquisition function (also called selection criterion or infill criterion).
The acquisition function encodes a strategy for trading off exploitation (evaluating where the predicted value is good) versus exploration (evaluating where uncertainty is high).
Why both are necessary:
Pure exploitation (always evaluate where predicted value is best): Gets stuck in local optima. If our initial data happens to miss the global optimum, we never discover it.
Pure exploration (always evaluate where uncertainty is highest): Wastes evaluations learning about clearly suboptimal regions. Eventually covers the whole space but doesn't focus on optimization.
The acquisition function is cheap to optimize:
Since the acquisition function is computed from the surrogate (which is cheap to evaluate), we can use standard optimization techniques (gradient descent, multi-start local search, evolutionary algorithms) to find its maximum. This additional optimization costs seconds, not hours.
The balance depends on context:
Good acquisition functions naturally adapt this balance. Expected Improvement (EI), for instance, automatically explores more when uncertainty is high and exploits more when the predicted value is low.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155
import numpy as npfrom scipy.stats import norm def expected_improvement(mu: np.ndarray, sigma: np.ndarray, y_best: float, xi: float = 0.01) -> np.ndarray: """ Expected Improvement (EI) acquisition function. The most popular choice for Bayesian Optimization. Balances exploitation and exploration automatically. EI(x) = E[max(0, y_best - f(x))] Computed in closed form under Gaussian surrogate: EI(x) = (y_best - mu(x) - xi) * Φ(Z) + sigma(x) * φ(Z) where Z = (y_best - mu(x) - xi) / sigma(x) Parameters: ----------- mu : predicted mean at each point sigma : predicted std at each point y_best : best objective value observed so far xi : exploration-exploitation tradeoff parameter (higher = more exploration) """ # Avoid division by zero sigma = np.maximum(sigma, 1e-9) # Standardized improvement Z = (y_best - mu - xi) / sigma # Expected Improvement formula (closed-form for Gaussian) ei = (y_best - mu - xi) * norm.cdf(Z) + sigma * norm.pdf(Z) # EI is zero where sigma is effectively zero (already evaluated) ei[sigma <= 1e-9] = 0.0 return ei def probability_of_improvement(mu: np.ndarray, sigma: np.ndarray, y_best: float, xi: float = 0.0) -> np.ndarray: """ Probability of Improvement (PI) acquisition function. A simpler acquisition function that measures the probability that a point will improve upon the current best. PI(x) = P(f(x) < y_best - xi) = Φ((y_best - mu(x) - xi) / sigma(x)) Note: PI can be too exploitative because it ignores the MAGNITUDE of potential improvement. A small certain improvement beats a potentially large uncertain improvement. """ sigma = np.maximum(sigma, 1e-9) Z = (y_best - mu - xi) / sigma return norm.cdf(Z) def lower_confidence_bound(mu: np.ndarray, sigma: np.ndarray, kappa: float = 2.0) -> np.ndarray: """ Lower Confidence Bound (LCB) acquisition function. Also called GP-LCB. Optimistic in the face of uncertainty. LCB(x) = mu(x) - kappa * sigma(x) We MINIMIZE LCB, which encourages evaluating points that either: - Have low predicted mean (exploitation), OR - Have high uncertainty (exploration) The parameter kappa controls the balance: - kappa = 0: Pure exploitation (greedy) - kappa → ∞: Pure exploration (uncertainty sampling) - kappa ≈ 2: Common balanced choice """ return mu - kappa * sigma def thompson_sampling(model, X_candidates: np.ndarray, n_samples: int = 1) -> np.ndarray: """ Thompson Sampling for Bayesian Optimization. Instead of computing an acquisition function explicitly, we: 1. Draw a sample from the posterior (a random function consistent with data) 2. Find the minimum of this sampled function 3. Evaluate the objective there This naturally balances exploration and exploitation through posterior randomness. """ # Get posterior mean and covariance at candidates mu, sigma = model.predict(X_candidates) # For simplicity, sample independently (ignores correlation) # Full Thompson sampling would sample from the joint posterior samples = np.random.normal(mu, sigma) # Return the candidate with the lowest sampled value best_idx = np.argmin(samples) return X_candidates[best_idx] # Visualization of acquisition functionsdef visualize_acquisition_landscape_1d(model, X_train, y_train, bounds): """ Show how different acquisition functions select the next point. """ import matplotlib.pyplot as plt X_test = np.linspace(bounds[0], bounds[1], 200).reshape(-1, 1) model.fit(X_train.reshape(-1, 1), y_train) mu, sigma = model.predict(X_test) y_best = y_train.min() # Compute acquisition functions ei = expected_improvement(mu, sigma, y_best) pi = probability_of_improvement(mu, sigma, y_best) lcb = lower_confidence_bound(mu, sigma, kappa=2.0) fig, axes = plt.subplots(4, 1, figsize=(12, 12), sharex=True) # Surrogate axes[0].plot(X_test, mu, 'b-', linewidth=2) axes[0].fill_between(X_test.flatten(), mu - 2*sigma, mu + 2*sigma, alpha=0.2) axes[0].scatter(X_train, y_train, c='red', s=100, zorder=5) axes[0].axhline(y_best, color='green', linestyle='--', label='y_best') axes[0].set_ylabel('f(x)') axes[0].set_title('Surrogate Model') axes[0].legend() # EI axes[1].plot(X_test, ei, 'g-', linewidth=2) axes[1].axvline(X_test[np.argmax(ei)], color='red', linestyle='--') axes[1].set_ylabel('EI(x)') axes[1].set_title(f'Expected Improvement (next: x={X_test[np.argmax(ei)][0]:.2f})') # PI axes[2].plot(X_test, pi, 'm-', linewidth=2) axes[2].axvline(X_test[np.argmax(pi)], color='red', linestyle='--') axes[2].set_ylabel('PI(x)') axes[2].set_title(f'Probability of Improvement (next: x={X_test[np.argmax(pi)][0]:.2f})') # LCB (negate for visualization since we minimize) axes[3].plot(X_test, -lcb, 'c-', linewidth=2) axes[3].axvline(X_test[np.argmin(lcb)], color='red', linestyle='--') axes[3].set_ylabel('-LCB(x)') axes[3].set_title(f'Lower Confidence Bound (next: x={X_test[np.argmin(lcb)][0]:.2f})') axes[3].set_xlabel('x') plt.tight_layout() return figLet's synthesize everything into the complete SMBO algorithm, with all components working together. This provides the canonical formulation that underlies practically all Bayesian Optimization implementations.
Algorithm: Sequential Model-Based Optimization
Input:
Output:
Procedure:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282
import numpy as npfrom sklearn.gaussian_process import GaussianProcessRegressorfrom sklearn.gaussian_process.kernels import Matern, ConstantKernelfrom scipy.stats import normfrom scipy.optimize import minimizefrom typing import Callable, Tuple, Dict, List, Optionalimport warningswarnings.filterwarnings('ignore') class BayesianOptimizer: """ Complete Bayesian Optimization implementation using SMBO framework. This is a production-quality implementation that demonstrates all the core concepts of Sequential Model-Based Optimization. """ def __init__( self, bounds: np.ndarray, n_initial: int = 5, acquisition: str = 'ei', xi: float = 0.01, kappa: float = 2.0, random_state: Optional[int] = None ): """ Initialize the Bayesian Optimizer. Parameters: ----------- bounds : np.ndarray, shape (d, 2) Lower and upper bounds for each dimension n_initial : int Number of initial random samples acquisition : str Choice of acquisition function: 'ei', 'pi', 'lcb' xi : float Exploration parameter for EI and PI kappa : float Exploration parameter for LCB random_state : int Seed for reproducibility """ self.bounds = np.array(bounds) self.n_dims = self.bounds.shape[0] self.n_initial = n_initial self.acquisition = acquisition self.xi = xi self.kappa = kappa self.rng = np.random.RandomState(random_state) # Initialize Gaussian Process surrogate # Matern kernel is a robust default choice kernel = ConstantKernel(1.0) * Matern( length_scale=np.ones(self.n_dims), length_scale_bounds=(1e-3, 1e3), nu=2.5 # Twice differentiable (smooth) ) self.gp = GaussianProcessRegressor( kernel=kernel, alpha=1e-6, # Noise term for numerical stability normalize_y=True, n_restarts_optimizer=10, random_state=random_state ) # Storage for observations self.X_observed: List[np.ndarray] = [] self.y_observed: List[float] = [] self.history: List[Dict] = [] def _latin_hypercube_sample(self, n_samples: int) -> np.ndarray: """Generate Latin Hypercube Samples within bounds.""" samples = np.zeros((n_samples, self.n_dims)) for dim in range(self.n_dims): # Create n_samples intervals of equal probability intervals = np.linspace(0, 1, n_samples + 1) # Sample uniformly within each interval points = np.array([ self.rng.uniform(intervals[i], intervals[i + 1]) for i in range(n_samples) ]) # Randomly permute to break correlation between dimensions self.rng.shuffle(points) samples[:, dim] = points # Scale to actual bounds lower, upper = self.bounds[:, 0], self.bounds[:, 1] return samples * (upper - lower) + lower def _compute_acquisition(self, X: np.ndarray) -> np.ndarray: """Compute acquisition function values at given points.""" mu, sigma = self.gp.predict(X, return_std=True) if len(self.y_observed) == 0: return sigma # Pure exploration before any data y_best = np.min(self.y_observed) if self.acquisition == 'ei': return self._expected_improvement(mu, sigma, y_best) elif self.acquisition == 'pi': return self._probability_of_improvement(mu, sigma, y_best) elif self.acquisition == 'lcb': return -self._lower_confidence_bound(mu, sigma) else: raise ValueError(f"Unknown acquisition function: {self.acquisition}") def _expected_improvement(self, mu, sigma, y_best) -> np.ndarray: """Expected Improvement acquisition function.""" sigma = np.maximum(sigma, 1e-9) Z = (y_best - mu - self.xi) / sigma ei = (y_best - mu - self.xi) * norm.cdf(Z) + sigma * norm.pdf(Z) ei[sigma <= 1e-9] = 0.0 return ei def _probability_of_improvement(self, mu, sigma, y_best) -> np.ndarray: """Probability of Improvement acquisition function.""" sigma = np.maximum(sigma, 1e-9) Z = (y_best - mu - self.xi) / sigma return norm.cdf(Z) def _lower_confidence_bound(self, mu, sigma) -> np.ndarray: """Lower Confidence Bound acquisition function.""" return mu - self.kappa * sigma def _optimize_acquisition(self, n_restarts: int = 25) -> np.ndarray: """Find the point that maximizes the acquisition function.""" best_x = None best_acq = -np.inf # Multi-start optimization for _ in range(n_restarts): # Random starting point x0 = self.rng.uniform(self.bounds[:, 0], self.bounds[:, 1]) try: result = minimize( lambda x: -self._compute_acquisition(x.reshape(1, -1))[0], x0, bounds=list(self.bounds), method='L-BFGS-B' ) if -result.fun > best_acq: best_acq = -result.fun best_x = result.x except: continue # Fallback to random if optimization fails if best_x is None: best_x = self.rng.uniform(self.bounds[:, 0], self.bounds[:, 1]) return best_x def suggest_next(self) -> np.ndarray: """Suggest the next point to evaluate.""" n_observed = len(self.X_observed) if n_observed < self.n_initial: # Still in initialization phase if n_observed == 0: # Generate all initial points at once for LHS self._initial_samples = self._latin_hypercube_sample(self.n_initial) return self._initial_samples[n_observed] else: # SMBO phase: fit GP and optimize acquisition X = np.array(self.X_observed) y = np.array(self.y_observed) # Fit the Gaussian Process to all observed data self.gp.fit(X, y) # Optimize acquisition function return self._optimize_acquisition() def observe(self, x: np.ndarray, y: float) -> None: """Record an observation.""" self.X_observed.append(x.flatten()) self.y_observed.append(y) # Track history for analysis self.history.append({ 'x': x.flatten().copy(), 'y': y, 'y_best': min(self.y_observed), 'iteration': len(self.y_observed) }) def optimize( self, objective: Callable, n_iterations: int = 25, verbose: bool = True ) -> Tuple[np.ndarray, float]: """ Run the complete Bayesian Optimization loop. Parameters: ----------- objective : callable Function to minimize. Takes array x and returns scalar. n_iterations : int Total number of function evaluations. verbose : bool Print progress during optimization. Returns: -------- best_x : np.ndarray Best configuration found. best_y : float Best objective value achieved. """ total_evals = self.n_initial + n_iterations for i in range(total_evals): # Get next point to evaluate x_next = self.suggest_next() # Evaluate the objective (THE EXPENSIVE STEP) y_next = objective(x_next) # Record the observation self.observe(x_next, y_next) if verbose: phase = "Init" if i < self.n_initial else "SMBO" print(f"[{phase}] Eval {i+1}/{total_evals}: " f"f(x) = {y_next:.6f}, " f"Best = {min(self.y_observed):.6f}") # Return best observed best_idx = np.argmin(self.y_observed) return np.array(self.X_observed[best_idx]), self.y_observed[best_idx] # Example usage and demonstrationdef demonstration(): """Demonstrate SMBO on the Branin benchmark function.""" # Branin function: a classic optimization benchmark def branin(x): x1, x2 = x[0], x[1] a, b, c = 1, 5.1 / (4 * np.pi**2), 5 / np.pi r, s, t = 6, 10, 1 / (8 * np.pi) return a * (x2 - b*x1**2 + c*x1 - r)**2 + s*(1-t)*np.cos(x1) + s # Known optima: f* ≈ 0.397887 # At: (-π, 12.275), (π, 2.275), (9.42478, 2.475) # Define search space bounds = np.array([ [-5.0, 10.0], # x1 bounds [0.0, 15.0] # x2 bounds ]) # Run optimization optimizer = BayesianOptimizer( bounds=bounds, n_initial=5, acquisition='ei', random_state=42 ) best_x, best_y = optimizer.optimize(branin, n_iterations=20, verbose=True) print(f"\nOptimization complete!") print(f"Best x found: [{best_x[0]:.4f}, {best_x[1]:.4f}]") print(f"Best f(x) found: {best_y:.6f}") print(f"Known optimum: 0.397887") print(f"Gap: {abs(best_y - 0.397887):.6f}") return optimizer if __name__ == "__main__": demonstration()A natural question is: does SMBO actually work? Can we prove that it will eventually find the optimum? The answer depends on the specific components used, but there are strong theoretical results.
Regret bounds:
The typical way to measure optimization performance is through cumulative regret:
$$R_N = \sum_{t=1}^{N} [f(x_t) - f(x^*)]$$
where $x^*$ is the true optimum. Good optimization algorithms have regret that grows slowly with $N$.
Key theoretical results:
GP-UCB [Srinivas et al., 2010]: For Gaussian Process surrogates with UCB acquisition, cumulative regret is $O(\sqrt{N \gamma_N})$ where $\gamma_N$ is the maximum information gain, which grows as $O((\log N)^{d+1})$ for Matérn kernels in $d$ dimensions.
Convergence guarantees: Under mild conditions (Lipschitz continuous objective, bounded search space), SMBO algorithms are guaranteed to converge to the global optimum as $N \rightarrow \infty$.
Sample efficiency: For smooth functions, SMBO can achieve exponential convergence rates—meaning the gap to the optimum shrinks exponentially with each evaluation. This is dramatically better than random search's polynomial rate.
| Method | Convergence Rate | Assumptions | Practical Impact |
|---|---|---|---|
| Random Search | O(1/√N) | None | Needs ~100× more evaluations for 10× improvement |
| Grid Search | O(1/N^(1/d)) | None | Curse of dimensionality—unusable beyond ~5D |
| SMBO (GP-UCB) | O(√(γ_N/N)) | Function in RKHS | Near-optimal for smooth functions |
| SMBO (GP-EI) | Sublinear | GP model is accurate | Excellent empirical performance |
The theoretical guarantees assume the surrogate model is well-specified (i.e., the true function is a sample from the GP prior). In practice, this is never exactly true. However, SMBO is remarkably robust to model misspecification. Empirically, it outperforms alternatives even when theoretical assumptions are violated.
Why SMBO is particularly effective for hyperparameter optimization:
While the SMBO framework is elegant, practical implementation requires careful attention to several details:
1. Initialization Strategy
The initial samples seed the surrogate model. Poor initialization can bias the entire optimization:
2. Surrogate Model Selection
The choice of surrogate model significantly impacts performance:
3. Acquisition Function Optimization
The acquisition function may have many local optima:
4. Hyperparameters of BO Itself
Bayesian Optimization has its own hyperparameters:
SMBO is not universally superior. It may underperform random search when: (1) the objective is extremely noisy, (2) the function is highly discontinuous, (3) the budget is very large (>1000 evaluations), or (4) massive parallelism is available and evaluations are cheap. Always consider your specific context.
We've established the complete theoretical and practical foundation for Sequential Model-Based Optimization, the framework powering Bayesian Optimization.
Core Concepts:
What's next:
With the SMBO framework established, we'll dive deep into the surrogate models that make it work. The next page explores Gaussian Processes—the most popular surrogate for Bayesian Optimization—covering their mathematical foundation, kernel design, and how they provide the uncertainty estimates that drive intelligent acquisition.
You now understand Sequential Model-Based Optimization—the principled framework that makes Bayesian Optimization work. You can articulate why sequential search outperforms parallel blind search for expensive functions, how surrogates approximate objectives with uncertainty, and how acquisition functions balance exploration and exploitation.