Machine LearningHyperparameter Optimization

Bayesian Optimization

LevelAdvanced

Duration90 mins

TopicHyperparameter Optimization

3 / 5

Acquisition Functions

Deciding Where to Look Next

Given a surrogate model that provides predictions and uncertainty, how do we decide where to evaluate the expensive objective function next? This is the role of acquisition functions.

The acquisition function encodes a strategy for selecting the next evaluation point. It takes the surrogate's predictions (mean and variance) and outputs a score for each candidate point. We then evaluate the objective at the point with the highest acquisition value.

The fundamental challenge: We want to find points that are likely to be good (exploitation) while also reducing our uncertainty about unexplored regions (exploration). Pure exploitation risks getting stuck in local optima; pure exploration wastes evaluations.

What You Will Learn

By the end of this page, you will master the major acquisition functions: Expected Improvement (EI), Probability of Improvement (PI), Lower Confidence Bound (LCB), and Thompson Sampling. You'll understand their mathematical formulations, tradeoffs, and when to use each.

Expected Improvement (EI)

Expected Improvement is the most widely used acquisition function. It measures the expected value of improvement over the current best observation.

Definition: Let $f^* = \min_i y_i$ be the best observed value. The improvement at point $x$ is:

$$I(x) = \max(0, f^* - f(x))$$

Since $f(x)$ is unknown, we take the expectation under the GP posterior:

$$\text{EI}(x) = \mathbb{E}[\max(0, f^* - f(x))]$$

Closed-form solution (for Gaussian posterior):

$$\text{EI}(x) = (f^* - \mu(x))\Phi(Z) + \sigma(x)\phi(Z)$$

where $Z = \frac{f^* - \mu(x)}{\sigma(x)}$, $\Phi$ is the CDF, and $\phi$ is the PDF of the standard normal.

expected_improvement.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import numpy as np
from scipy.stats import norm
 
def expected_improvement(mu, sigma, f_best, xi=0.01):
    """
    Expected Improvement acquisition function.
    
    Parameters:
    -----------
    mu : array, predicted means from GP
    sigma : array, predicted stds from GP
    f_best : float, best observed value (we minimize)
    xi : float, exploration-exploitation tradeoff
        - xi=0: pure exploitation
        - xi>0: encourages exploration
    
    Returns:
    --------
    ei : array, EI values at each point
    """
    # Handle numerical issues
    sigma = np.maximum(sigma, 1e-9)
    
    # Improvement term (with optional exploration bonus xi)
    improvement = f_best - mu - xi
    
    # Standardized improvement
    Z = improvement / sigma
    
    # EI formula: two terms
    # Term 1: improvement * P(improvement)
    # Term 2: sigma * density (exploration bonus)
    ei = improvement * norm.cdf(Z) + sigma * norm.pdf(Z)
    
    # No improvement possible where we've already evaluated
    ei[sigma < 1e-9] = 0.0
    
    return ei
 
# Example: why EI balances exploration and exploitation
# Point A: mu=0.5, sigma=0.1, f_best=0.4
# Point B: mu=0.6, sigma=0.5, f_best=0.4
#
# Point A: exploitative - low mean but low uncertainty
# Point B: exploratory - higher mean but high uncertainty
# EI will balance these based on improvement magnitude

Understanding EI's Two Terms

The first term (improvement × CDF) rewards exploitation: points with low predicted mean. The second term (sigma × PDF) rewards exploration: points with high uncertainty. The genius of EI is that it naturally balances these—high uncertainty only matters if there's reasonable probability of improvement.

Probability of Improvement (PI)

Probability of Improvement is the simplest acquisition function—the probability that a point will improve upon the current best:

$$\text{PI}(x) = P(f(x) < f^) = \Phi\left(\frac{f^ - \mu(x)}{\sigma(x)}\right)$$

Advantages:

Simple and intuitive
Computationally cheap
Good when you need any improvement

Disadvantages:

Ignores the magnitude of improvement
Often too exploitative—prefers small certain improvements over potentially large uncertain ones
Can get stuck near local optima

probability_of_improvement.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
from scipy.stats import norm
import numpy as np
 
def probability_of_improvement(mu, sigma, f_best, xi=0.0):
    """
    Probability of Improvement acquisition function.
    
    PI = P(f(x) < f_best - xi)
    
    The xi parameter adds a margin: we want improvement
    of at least xi to count. This encourages exploration.
    """
    sigma = np.maximum(sigma, 1e-9)
    Z = (f_best - mu - xi) / sigma
    return norm.cdf(Z)
 
# PI vs EI example:
# Point A: mu=0.35, sigma=0.01 -> almost certain small improvement
# Point B: mu=0.50, sigma=0.30 -> uncertain but potentially big improvement
#
# PI: Prefers A (high probability of improvement)
# EI: May prefer B (accounts for improvement magnitude)

PI's Exploitation Bias

PI is often too greedy. In practice, a 99% chance of 0.001 improvement beats a 30% chance of 0.5 improvement under PI. This makes it prone to local optima. Use EI instead for most applications, or add a larger xi to PI.

Lower Confidence Bound (LCB)

Lower Confidence Bound (also called GP-LCB or UCB for maximization) takes an optimistic approach—it assumes the function could be as good as the lower bound of the confidence interval:

$$\text{LCB}(x) = \mu(x) - \kappa \cdot \sigma(x)$$

We minimize LCB, which means we prefer points with:

Low predicted mean (exploitation), OR
High uncertainty (exploration)

The κ parameter:

κ = 0: Pure exploitation (greedy)
κ → ∞: Pure exploration
κ ≈ 2: Common balanced choice

Theoretical advantage: GP-LCB has proven regret bounds. With appropriately scheduled κ that grows with log(t), cumulative regret is sublinear.

lower_confidence_bound.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import numpy as np
 
def lower_confidence_bound(mu, sigma, kappa=2.0):
    """
    Lower Confidence Bound acquisition function.
    
    LCB(x) = mu(x) - kappa * sigma(x)
    
    Minimizing LCB is equivalent to being optimistic:
    we evaluate where the function COULD be low.
    
    Parameters:
    -----------
    kappa : float, exploration parameter
        Higher kappa = more exploration
        Theoretical optimal: kappa = sqrt(2 * log(t * d))
        where t is iteration number, d is dimensionality
    """
    return mu - kappa * sigma
 
def dynamic_lcb(mu, sigma, iteration, n_dims, delta=0.1):
    """
    LCB with theoretically-motivated dynamic kappa.
    
    Kappa grows logarithmically with iterations, providing
    more exploration early and more exploitation later.
    """
    beta = 2 * np.log(iteration**2 * np.pi**2 / (6 * delta))
    kappa = np.sqrt(beta)
    return mu - kappa * sigma

LCB Advantages

•Simple, intuitive formulation
•Proven theoretical guarantees
•Single tuning parameter (κ)
•Works well in practice

LCB Disadvantages

•Requires tuning κ or schedule
•Less intuitive than EI
•Can over-explore with high κ
•Sensitivity to κ choice

Thompson Sampling

Thompson Sampling takes a different approach—instead of computing an explicit acquisition function, we:

Draw a random sample from the GP posterior (a random function)
Find the minimum of this sampled function
Evaluate the objective there

Algorithm:

1. f_sample ~ GP(mu, K)  # Draw function sample
2. x_next = argmin f_sample(x)  # Find minimum
3. Return x_next

Why it works: The randomness of posterior sampling naturally balances exploration and exploitation. In uncertain regions, samples vary widely, so we sometimes evaluate there. In well-understood regions, samples agree, so we exploit if they're good.

Practical implementation: True posterior sampling requires drawing correlated function values, which is expensive. Common approximations:

Sample at a finite set of candidates
Use random Fourier features for scalable sampling

thompson_sampling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import numpy as np
 
def thompson_sampling_simple(mu, sigma, n_samples=1):
    """
    Simplified Thompson Sampling using independent samples.
    
    For each candidate, sample from N(mu, sigma^2) and pick
    the candidate with lowest sampled value.
    
    Note: This ignores posterior correlations but works well
    in practice and is very fast.
    """
    samples = np.random.normal(mu, sigma)
    return np.argmin(samples)
 
def thompson_sampling_correlated(gp, X_candidates, n_samples=1):
    """
    Correlated Thompson Sampling using GP posterior.
    
    Draws from the joint posterior at all candidates,
    respecting correlations between nearby points.
    """
    # Get posterior mean and covariance
    mu = gp.predict(X_candidates)
    K = gp.kernel_(X_candidates, X_candidates)
    
    # Sample from multivariate Gaussian
    # f ~ N(mu, K)
    L = np.linalg.cholesky(K + 1e-6 * np.eye(len(K)))
    z = np.random.randn(len(mu))
    f_sample = mu + L @ z
    
    return np.argmin(f_sample)

Thompson Sampling for Parallelism

Thompson Sampling naturally extends to parallel/batch settings. Simply draw k independent posterior samples and evaluate the minimum of each. Unlike batch EI, this requires no expensive computation of multi-point acquisition values.

Acquisition Function Comparison

Each acquisition function has strengths and weaknesses. Here's a practical guide to selection:

Acquisition Function Selection Guide
Function	Best For	Tradeoff Parameter	Recommendation
EI	General use, unknown objectives	ξ (small, e.g., 0.01)	Default choice for most problems
PI	Risk-averse, need any improvement	ξ (larger for exploration)	Use when small improvements matter
LCB	Theoretical guarantees needed	κ (typically 2.0)	Good when you want provable bounds
Thompson	Parallel optimization, simplicity	None	Excellent for batch settings

Empirical findings:

EI is robust: Performs well across diverse problems with minimal tuning
LCB needs tuning: Performance sensitive to κ, but works well when tuned
Thompson is underrated: Simple, parallelizable, and competitive
PI is rarely best: Almost always dominated by EI

For hyperparameter optimization: Use EI as default. Consider Thompson Sampling if you can evaluate multiple configurations in parallel.

Advanced Acquisition Concepts

Batch Acquisition (q-EI, q-LCB): Select multiple points to evaluate in parallel. q-EI optimizes the expected improvement from evaluating q points jointly. Computationally expensive but important for parallel computing.

Knowledge Gradient: Instead of improvement in the objective, optimizes expected improvement in knowledge about the optimum. More sophisticated but computationally intensive.

Entropy Search: Minimizes entropy (uncertainty) about the location of the optimum. Principled information-theoretic approach, but complex to implement.

Portfolio Methods: Run multiple acquisition functions and use a bandit algorithm to select among them. Hedges against any single acquisition function performing poorly.

Practical Advice

For most hyperparameter optimization tasks, EI with default parameters is sufficient. Advanced acquisition functions provide marginal improvements but add complexity. Focus on good surrogate modeling and search space design before optimizing acquisition functions.

Summary: Acquisition Functions

Key Takeaways

•Acquisition functions guide the search — They score candidate points using surrogate predictions to decide where to evaluate next
•EI is the default choice — Balances exploration and exploitation automatically with closed-form computation
•LCB has theoretical guarantees — Provable regret bounds but requires tuning the κ parameter
•Thompson Sampling is parallelizable — Natural extension to batch settings via posterior sampling
•The choice rarely dominates — Focus on surrogate quality and search space design over acquisition function selection

What's next: We'll explore the exploration-exploitation tradeoff in depth, understanding the theoretical foundations and practical strategies for balancing these competing objectives.

Page Complete

You now understand the major acquisition functions for Bayesian Optimization. You can implement EI, PI, LCB, and Thompson Sampling, explain their tradeoffs, and select appropriate functions for different optimization scenarios.

3 / 5

Loading learning content...

Machine LearningHyperparameter Optimization

Bayesian Optimization

LevelAdvanced

Duration90 mins

TopicHyperparameter Optimization

3 / 5

Acquisition Functions

Deciding Where to Look Next

Given a surrogate model that provides predictions and uncertainty, how do we decide where to evaluate the expensive objective function next? This is the role of acquisition functions.

What You Will Learn

Expected Improvement (EI)

Expected Improvement is the most widely used acquisition function. It measures the expected value of improvement over the current best observation.

Definition: Let $f^* = \min_i y_i$ be the best observed value. The improvement at point $x$ is:

$$I(x) = \max(0, f^* - f(x))$$

Since $f(x)$ is unknown, we take the expectation under the GP posterior:

$$\text{EI}(x) = \mathbb{E}[\max(0, f^* - f(x))]$$

Closed-form solution (for Gaussian posterior):

$$\text{EI}(x) = (f^* - \mu(x))\Phi(Z) + \sigma(x)\phi(Z)$$

where $Z = \frac{f^* - \mu(x)}{\sigma(x)}$, $\Phi$ is the CDF, and $\phi$ is the PDF of the standard normal.

expected_improvement.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import numpy as np
from scipy.stats import norm
 
def expected_improvement(mu, sigma, f_best, xi=0.01):
    """
    Expected Improvement acquisition function.
    
    Parameters:
    -----------
    mu : array, predicted means from GP
    sigma : array, predicted stds from GP
    f_best : float, best observed value (we minimize)
    xi : float, exploration-exploitation tradeoff
        - xi=0: pure exploitation
        - xi>0: encourages exploration
    
    Returns:
    --------
    ei : array, EI values at each point
    """
    # Handle numerical issues
    sigma = np.maximum(sigma, 1e-9)
    
    # Improvement term (with optional exploration bonus xi)
    improvement = f_best - mu - xi
    
    # Standardized improvement
    Z = improvement / sigma
    
    # EI formula: two terms
    # Term 1: improvement * P(improvement)
    # Term 2: sigma * density (exploration bonus)
    ei = improvement * norm.cdf(Z) + sigma * norm.pdf(Z)
    
    # No improvement possible where we've already evaluated
    ei[sigma < 1e-9] = 0.0
    
    return ei
 
# Example: why EI balances exploration and exploitation
# Point A: mu=0.5, sigma=0.1, f_best=0.4
# Point B: mu=0.6, sigma=0.5, f_best=0.4
#
# Point A: exploitative - low mean but low uncertainty
# Point B: exploratory - higher mean but high uncertainty
# EI will balance these based on improvement magnitude

Understanding EI's Two Terms

Probability of Improvement (PI)

Probability of Improvement is the simplest acquisition function—the probability that a point will improve upon the current best:

$$\text{PI}(x) = P(f(x) < f^) = \Phi\left(\frac{f^ - \mu(x)}{\sigma(x)}\right)$$

Advantages:

Simple and intuitive
Computationally cheap
Good when you need any improvement

Disadvantages:

Ignores the magnitude of improvement
Often too exploitative—prefers small certain improvements over potentially large uncertain ones
Can get stuck near local optima

probability_of_improvement.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
from scipy.stats import norm
import numpy as np
 
def probability_of_improvement(mu, sigma, f_best, xi=0.0):
    """
    Probability of Improvement acquisition function.
    
    PI = P(f(x) < f_best - xi)
    
    The xi parameter adds a margin: we want improvement
    of at least xi to count. This encourages exploration.
    """
    sigma = np.maximum(sigma, 1e-9)
    Z = (f_best - mu - xi) / sigma
    return norm.cdf(Z)
 
# PI vs EI example:
# Point A: mu=0.35, sigma=0.01 -> almost certain small improvement
# Point B: mu=0.50, sigma=0.30 -> uncertain but potentially big improvement
#
# PI: Prefers A (high probability of improvement)
# EI: May prefer B (accounts for improvement magnitude)

PI's Exploitation Bias

Lower Confidence Bound (LCB)

Lower Confidence Bound (also called GP-LCB or UCB for maximization) takes an optimistic approach—it assumes the function could be as good as the lower bound of the confidence interval:

$$\text{LCB}(x) = \mu(x) - \kappa \cdot \sigma(x)$$

We minimize LCB, which means we prefer points with:

Low predicted mean (exploitation), OR
High uncertainty (exploration)

The κ parameter:

κ = 0: Pure exploitation (greedy)
κ → ∞: Pure exploration
κ ≈ 2: Common balanced choice

Theoretical advantage: GP-LCB has proven regret bounds. With appropriately scheduled κ that grows with log(t), cumulative regret is sublinear.

lower_confidence_bound.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import numpy as np
 
def lower_confidence_bound(mu, sigma, kappa=2.0):
    """
    Lower Confidence Bound acquisition function.
    
    LCB(x) = mu(x) - kappa * sigma(x)
    
    Minimizing LCB is equivalent to being optimistic:
    we evaluate where the function COULD be low.
    
    Parameters:
    -----------
    kappa : float, exploration parameter
        Higher kappa = more exploration
        Theoretical optimal: kappa = sqrt(2 * log(t * d))
        where t is iteration number, d is dimensionality
    """
    return mu - kappa * sigma
 
def dynamic_lcb(mu, sigma, iteration, n_dims, delta=0.1):
    """
    LCB with theoretically-motivated dynamic kappa.
    
    Kappa grows logarithmically with iterations, providing
    more exploration early and more exploitation later.
    """
    beta = 2 * np.log(iteration**2 * np.pi**2 / (6 * delta))
    kappa = np.sqrt(beta)
    return mu - kappa * sigma

LCB Advantages

•Simple, intuitive formulation
•Proven theoretical guarantees
•Single tuning parameter (κ)
•Works well in practice

LCB Disadvantages

•Requires tuning κ or schedule
•Less intuitive than EI
•Can over-explore with high κ
•Sensitivity to κ choice

Thompson Sampling

Thompson Sampling takes a different approach—instead of computing an explicit acquisition function, we:

Draw a random sample from the GP posterior (a random function)
Find the minimum of this sampled function
Evaluate the objective there

Algorithm:

1. f_sample ~ GP(mu, K)  # Draw function sample
2. x_next = argmin f_sample(x)  # Find minimum
3. Return x_next

Practical implementation: True posterior sampling requires drawing correlated function values, which is expensive. Common approximations:

Sample at a finite set of candidates
Use random Fourier features for scalable sampling

thompson_sampling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import numpy as np
 
def thompson_sampling_simple(mu, sigma, n_samples=1):
    """
    Simplified Thompson Sampling using independent samples.
    
    For each candidate, sample from N(mu, sigma^2) and pick
    the candidate with lowest sampled value.
    
    Note: This ignores posterior correlations but works well
    in practice and is very fast.
    """
    samples = np.random.normal(mu, sigma)
    return np.argmin(samples)
 
def thompson_sampling_correlated(gp, X_candidates, n_samples=1):
    """
    Correlated Thompson Sampling using GP posterior.
    
    Draws from the joint posterior at all candidates,
    respecting correlations between nearby points.
    """
    # Get posterior mean and covariance
    mu = gp.predict(X_candidates)
    K = gp.kernel_(X_candidates, X_candidates)
    
    # Sample from multivariate Gaussian
    # f ~ N(mu, K)
    L = np.linalg.cholesky(K + 1e-6 * np.eye(len(K)))
    z = np.random.randn(len(mu))
    f_sample = mu + L @ z
    
    return np.argmin(f_sample)

Thompson Sampling for Parallelism

Acquisition Function Comparison

Each acquisition function has strengths and weaknesses. Here's a practical guide to selection:

Acquisition Function Selection Guide
Function	Best For	Tradeoff Parameter	Recommendation
EI	General use, unknown objectives	ξ (small, e.g., 0.01)	Default choice for most problems
PI	Risk-averse, need any improvement	ξ (larger for exploration)	Use when small improvements matter
LCB	Theoretical guarantees needed	κ (typically 2.0)	Good when you want provable bounds
Thompson	Parallel optimization, simplicity	None	Excellent for batch settings

Empirical findings:

EI is robust: Performs well across diverse problems with minimal tuning
LCB needs tuning: Performance sensitive to κ, but works well when tuned
Thompson is underrated: Simple, parallelizable, and competitive
PI is rarely best: Almost always dominated by EI

For hyperparameter optimization: Use EI as default. Consider Thompson Sampling if you can evaluate multiple configurations in parallel.

Advanced Acquisition Concepts

Knowledge Gradient: Instead of improvement in the objective, optimizes expected improvement in knowledge about the optimum. More sophisticated but computationally intensive.

Entropy Search: Minimizes entropy (uncertainty) about the location of the optimum. Principled information-theoretic approach, but complex to implement.

Portfolio Methods: Run multiple acquisition functions and use a bandit algorithm to select among them. Hedges against any single acquisition function performing poorly.

Practical Advice

Summary: Acquisition Functions

Key Takeaways

•Acquisition functions guide the search — They score candidate points using surrogate predictions to decide where to evaluate next
•EI is the default choice — Balances exploration and exploitation automatically with closed-form computation
•LCB has theoretical guarantees — Provable regret bounds but requires tuning the κ parameter
•Thompson Sampling is parallelizable — Natural extension to batch settings via posterior sampling
•The choice rarely dominates — Focus on surrogate quality and search space design over acquisition function selection

What's next: We'll explore the exploration-exploitation tradeoff in depth, understanding the theoretical foundations and practical strategies for balancing these competing objectives.

Page Complete

3 / 5