Loading learning content...
Given a surrogate model that provides predictions and uncertainty, how do we decide where to evaluate the expensive objective function next? This is the role of acquisition functions.
The acquisition function encodes a strategy for selecting the next evaluation point. It takes the surrogate's predictions (mean and variance) and outputs a score for each candidate point. We then evaluate the objective at the point with the highest acquisition value.
The fundamental challenge: We want to find points that are likely to be good (exploitation) while also reducing our uncertainty about unexplored regions (exploration). Pure exploitation risks getting stuck in local optima; pure exploration wastes evaluations.
By the end of this page, you will master the major acquisition functions: Expected Improvement (EI), Probability of Improvement (PI), Lower Confidence Bound (LCB), and Thompson Sampling. You'll understand their mathematical formulations, tradeoffs, and when to use each.
Expected Improvement is the most widely used acquisition function. It measures the expected value of improvement over the current best observation.
Definition: Let $f^* = \min_i y_i$ be the best observed value. The improvement at point $x$ is:
$$I(x) = \max(0, f^* - f(x))$$
Since $f(x)$ is unknown, we take the expectation under the GP posterior:
$$\text{EI}(x) = \mathbb{E}[\max(0, f^* - f(x))]$$
Closed-form solution (for Gaussian posterior):
$$\text{EI}(x) = (f^* - \mu(x))\Phi(Z) + \sigma(x)\phi(Z)$$
where $Z = \frac{f^* - \mu(x)}{\sigma(x)}$, $\Phi$ is the CDF, and $\phi$ is the PDF of the standard normal.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
import numpy as npfrom scipy.stats import norm def expected_improvement(mu, sigma, f_best, xi=0.01): """ Expected Improvement acquisition function. Parameters: ----------- mu : array, predicted means from GP sigma : array, predicted stds from GP f_best : float, best observed value (we minimize) xi : float, exploration-exploitation tradeoff - xi=0: pure exploitation - xi>0: encourages exploration Returns: -------- ei : array, EI values at each point """ # Handle numerical issues sigma = np.maximum(sigma, 1e-9) # Improvement term (with optional exploration bonus xi) improvement = f_best - mu - xi # Standardized improvement Z = improvement / sigma # EI formula: two terms # Term 1: improvement * P(improvement) # Term 2: sigma * density (exploration bonus) ei = improvement * norm.cdf(Z) + sigma * norm.pdf(Z) # No improvement possible where we've already evaluated ei[sigma < 1e-9] = 0.0 return ei # Example: why EI balances exploration and exploitation# Point A: mu=0.5, sigma=0.1, f_best=0.4# Point B: mu=0.6, sigma=0.5, f_best=0.4## Point A: exploitative - low mean but low uncertainty# Point B: exploratory - higher mean but high uncertainty# EI will balance these based on improvement magnitudeThe first term (improvement × CDF) rewards exploitation: points with low predicted mean. The second term (sigma × PDF) rewards exploration: points with high uncertainty. The genius of EI is that it naturally balances these—high uncertainty only matters if there's reasonable probability of improvement.
Probability of Improvement is the simplest acquisition function—the probability that a point will improve upon the current best:
$$\text{PI}(x) = P(f(x) < f^) = \Phi\left(\frac{f^ - \mu(x)}{\sigma(x)}\right)$$
Advantages:
Disadvantages:
12345678910111213141516171819202122
from scipy.stats import normimport numpy as np def probability_of_improvement(mu, sigma, f_best, xi=0.0): """ Probability of Improvement acquisition function. PI = P(f(x) < f_best - xi) The xi parameter adds a margin: we want improvement of at least xi to count. This encourages exploration. """ sigma = np.maximum(sigma, 1e-9) Z = (f_best - mu - xi) / sigma return norm.cdf(Z) # PI vs EI example:# Point A: mu=0.35, sigma=0.01 -> almost certain small improvement# Point B: mu=0.50, sigma=0.30 -> uncertain but potentially big improvement## PI: Prefers A (high probability of improvement)# EI: May prefer B (accounts for improvement magnitude)PI is often too greedy. In practice, a 99% chance of 0.001 improvement beats a 30% chance of 0.5 improvement under PI. This makes it prone to local optima. Use EI instead for most applications, or add a larger xi to PI.
Lower Confidence Bound (also called GP-LCB or UCB for maximization) takes an optimistic approach—it assumes the function could be as good as the lower bound of the confidence interval:
$$\text{LCB}(x) = \mu(x) - \kappa \cdot \sigma(x)$$
We minimize LCB, which means we prefer points with:
The κ parameter:
Theoretical advantage: GP-LCB has proven regret bounds. With appropriately scheduled κ that grows with log(t), cumulative regret is sublinear.
123456789101112131415161718192021222324252627282930
import numpy as np def lower_confidence_bound(mu, sigma, kappa=2.0): """ Lower Confidence Bound acquisition function. LCB(x) = mu(x) - kappa * sigma(x) Minimizing LCB is equivalent to being optimistic: we evaluate where the function COULD be low. Parameters: ----------- kappa : float, exploration parameter Higher kappa = more exploration Theoretical optimal: kappa = sqrt(2 * log(t * d)) where t is iteration number, d is dimensionality """ return mu - kappa * sigma def dynamic_lcb(mu, sigma, iteration, n_dims, delta=0.1): """ LCB with theoretically-motivated dynamic kappa. Kappa grows logarithmically with iterations, providing more exploration early and more exploitation later. """ beta = 2 * np.log(iteration**2 * np.pi**2 / (6 * delta)) kappa = np.sqrt(beta) return mu - kappa * sigmaThompson Sampling takes a different approach—instead of computing an explicit acquisition function, we:
Algorithm:
1. f_sample ~ GP(mu, K) # Draw function sample
2. x_next = argmin f_sample(x) # Find minimum
3. Return x_next
Why it works: The randomness of posterior sampling naturally balances exploration and exploitation. In uncertain regions, samples vary widely, so we sometimes evaluate there. In well-understood regions, samples agree, so we exploit if they're good.
Practical implementation: True posterior sampling requires drawing correlated function values, which is expensive. Common approximations:
123456789101112131415161718192021222324252627282930313233
import numpy as np def thompson_sampling_simple(mu, sigma, n_samples=1): """ Simplified Thompson Sampling using independent samples. For each candidate, sample from N(mu, sigma^2) and pick the candidate with lowest sampled value. Note: This ignores posterior correlations but works well in practice and is very fast. """ samples = np.random.normal(mu, sigma) return np.argmin(samples) def thompson_sampling_correlated(gp, X_candidates, n_samples=1): """ Correlated Thompson Sampling using GP posterior. Draws from the joint posterior at all candidates, respecting correlations between nearby points. """ # Get posterior mean and covariance mu = gp.predict(X_candidates) K = gp.kernel_(X_candidates, X_candidates) # Sample from multivariate Gaussian # f ~ N(mu, K) L = np.linalg.cholesky(K + 1e-6 * np.eye(len(K))) z = np.random.randn(len(mu)) f_sample = mu + L @ z return np.argmin(f_sample)Thompson Sampling naturally extends to parallel/batch settings. Simply draw k independent posterior samples and evaluate the minimum of each. Unlike batch EI, this requires no expensive computation of multi-point acquisition values.
Each acquisition function has strengths and weaknesses. Here's a practical guide to selection:
| Function | Best For | Tradeoff Parameter | Recommendation |
|---|---|---|---|
| EI | General use, unknown objectives | ξ (small, e.g., 0.01) | Default choice for most problems |
| PI | Risk-averse, need any improvement | ξ (larger for exploration) | Use when small improvements matter |
| LCB | Theoretical guarantees needed | κ (typically 2.0) | Good when you want provable bounds |
| Thompson | Parallel optimization, simplicity | None | Excellent for batch settings |
Empirical findings:
For hyperparameter optimization: Use EI as default. Consider Thompson Sampling if you can evaluate multiple configurations in parallel.
Batch Acquisition (q-EI, q-LCB): Select multiple points to evaluate in parallel. q-EI optimizes the expected improvement from evaluating q points jointly. Computationally expensive but important for parallel computing.
Knowledge Gradient: Instead of improvement in the objective, optimizes expected improvement in knowledge about the optimum. More sophisticated but computationally intensive.
Entropy Search: Minimizes entropy (uncertainty) about the location of the optimum. Principled information-theoretic approach, but complex to implement.
Portfolio Methods: Run multiple acquisition functions and use a bandit algorithm to select among them. Hedges against any single acquisition function performing poorly.
For most hyperparameter optimization tasks, EI with default parameters is sufficient. Advanced acquisition functions provide marginal improvements but add complexity. Focus on good surrogate modeling and search space design before optimizing acquisition functions.
What's next: We'll explore the exploration-exploitation tradeoff in depth, understanding the theoretical foundations and practical strategies for balancing these competing objectives.
You now understand the major acquisition functions for Bayesian Optimization. You can implement EI, PI, LCB, and Thompson Sampling, explain their tradeoffs, and select appropriate functions for different optimization scenarios.