Machine LearningHyperparameter Optimization

Bayesian Optimization

LevelAdvanced

Duration90 mins

TopicHyperparameter Optimization

2 / 5

Surrogate Models: Gaussian Processes

The Heart of Bayesian Optimization

At the core of every Bayesian Optimization algorithm lies a surrogate model—a probabilistic approximation of the expensive objective function. While several surrogate models exist, Gaussian Processes (GPs) remain the gold standard for continuous hyperparameter spaces.

What makes GPs special? They provide not just predictions, but calibrated uncertainty estimates. When a GP says "the predicted loss is 0.15 ± 0.08," that uncertainty is mathematically principled, derived from the data and prior assumptions. This uncertainty is what enables intelligent exploration in Bayesian Optimization.

What You Will Learn

By the end of this page, you will understand: the mathematical definition of Gaussian Processes, how they provide predictions with uncertainty, the role of kernel functions in encoding assumptions, and practical implementation details for hyperparameter optimization.

What Is a Gaussian Process?

A Gaussian Process is a probability distribution over functions. Formally:

Definition: A Gaussian Process is a collection of random variables, any finite number of which have a joint Gaussian distribution.

A GP is completely specified by:

Mean function: $m(x) = \mathbb{E}[f(x)]$
Covariance function (kernel): $k(x, x') = \mathbb{E}[(f(x) - m(x))(f(x') - m(x'))]$

We write: $f(x) \sim \mathcal{GP}(m(x), k(x, x'))$

Intuition: Instead of parameterizing a function with a finite set of weights (like neural networks), GPs define a distribution over the infinite-dimensional space of all possible functions. The kernel encodes our beliefs about function properties like smoothness and periodicity.

gp_fundamentals.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import numpy as np
from scipy.linalg import cholesky, solve_triangular
 
class GaussianProcess:
    """
    Gaussian Process Regressor from scratch.
    
    Demonstrates the core math of GP prediction:
    - Prior: f ~ GP(0, K)
    - Posterior: f* | X, y, X* ~ N(mu*, Sigma*)
    """
    
    def __init__(self, kernel, noise: float = 1e-6):
        self.kernel = kernel
        self.noise = noise
        self.X_train = None
        self.y_train = None
        self.L = None  # Cholesky factor
        self.alpha = None  # Precomputed weights
    
    def fit(self, X: np.ndarray, y: np.ndarray):
        """Fit GP to training data."""
        self.X_train = X
        self.y_train = y
        
        # Compute kernel matrix K(X, X)
        K = self.kernel(X, X) + self.noise * np.eye(len(X))
        
        # Cholesky decomposition: K = L @ L.T
        self.L = cholesky(K, lower=True)
        
        # Solve for alpha: K @ alpha = y
        self.alpha = solve_triangular(
            self.L.T, 
            solve_triangular(self.L, y, lower=True)
        )
    
    def predict(self, X_test: np.ndarray):
        """
        Predict mean and std at test points.
        
        Key equations (from GP posterior):
        mu* = K(X*, X) @ K(X, X)^{-1} @ y
        Sigma* = K(X*, X*) - K(X*, X) @ K(X, X)^{-1} @ K(X, X*)
        """
        # Cross-covariance K(X*, X)
        K_star = self.kernel(X_test, self.X_train)
        
        # Predictive mean: mu* = K* @ alpha
        mu = K_star @ self.alpha
        
        # Predictive variance
        v = solve_triangular(self.L, K_star.T, lower=True)
        K_ss = self.kernel(X_test, X_test)
        var = np.diag(K_ss) - np.sum(v**2, axis=0)
        std = np.sqrt(np.maximum(var, 1e-10))
        
        return mu, std

Kernel Functions: Encoding Prior Beliefs

The kernel (covariance function) is the most important design choice in a GP. It encodes assumptions about the function we're modeling:

Smoothness: How quickly can the function change?
Stationarity: Are properties the same everywhere?
Periodicity: Does the function repeat?
Length scales: Over what distance do correlations decay?

Common kernels for Bayesian Optimization:

Kernel Functions for Bayesian Optimization
Kernel	Formula	Properties	Use Case
RBF (Squared Exponential)	$k(x,x') = \sigma^2 \exp(-\frac{\|\|x-x'\|\|^2}{2l^2})$	Infinitely differentiable, very smooth	Smooth objectives
Matérn 5/2	$k(x,x') = \sigma^2(1 + \frac{\sqrt{5}r}{l} + \frac{5r^2}{3l^2})\exp(-\frac{\sqrt{5}r}{l})$	Twice differentiable	Default choice (recommended)
Matérn 3/2	$k(x,x') = \sigma^2(1 + \frac{\sqrt{3}r}{l})\exp(-\frac{\sqrt{3}r}{l})$	Once differentiable	Less smooth objectives
Rational Quadratic	$k(x,x') = \sigma^2(1 + \frac{r^2}{2\alpha l^2})^{-\alpha}$	Mixture of RBFs	Multi-scale variation

kernels.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import numpy as np
 
class Matern52Kernel:
    """
    Matérn 5/2 kernel - the recommended default for BO.
    
    Twice differentiable, providing enough smoothness for
    gradient-based acquisition optimization while being
    more realistic than the infinitely smooth RBF.
    """
    
    def __init__(self, length_scale: float = 1.0, variance: float = 1.0):
        self.length_scale = length_scale
        self.variance = variance
    
    def __call__(self, X1: np.ndarray, X2: np.ndarray) -> np.ndarray:
        # Compute pairwise distances
        dist = np.sqrt(np.sum(
            (X1[:, np.newaxis, :] - X2[np.newaxis, :, :]) ** 2,
            axis=-1
        ))
        
        # Scaled distance
        r = np.sqrt(5) * dist / self.length_scale
        
        # Matérn 5/2 formula
        return self.variance * (1 + r + r**2 / 3) * np.exp(-r)
 
 
class ARDKernel:
    """
    Automatic Relevance Determination (ARD) kernel.
    
    Uses a separate length scale per dimension, allowing the
    GP to automatically learn which hyperparameters matter most.
    """
    
    def __init__(self, length_scales: np.ndarray, variance: float = 1.0):
        self.length_scales = np.array(length_scales)
        self.variance = variance
    
    def __call__(self, X1: np.ndarray, X2: np.ndarray) -> np.ndarray:
        # Scale each dimension by its length scale
        X1_scaled = X1 / self.length_scales
        X2_scaled = X2 / self.length_scales
        
        # Compute squared distances
        sq_dist = np.sum(
            (X1_scaled[:, np.newaxis, :] - X2_scaled[np.newaxis, :, :]) ** 2,
            axis=-1
        )
        
        return self.variance * np.exp(-0.5 * sq_dist)

Length Scale Interpretation

The length scale l determines how far apart two points must be before their function values become uncorrelated. A small l means the function can change rapidly; a large l means the function is slowly varying. For hyperparameter optimization, length scales are typically learned from data.

GP Posterior: Updating Beliefs with Data

The power of GPs comes from Bayesian updating. Given observations $\mathcal{D} = {(x_i, y_i)}_{i=1}^n$, the posterior is also a GP with closed-form mean and covariance:

Posterior mean: $\mu_*(x) = k(x, X) K^{-1} y$

Posterior variance: $\sigma^2_*(x) = k(x, x) - k(x, X) K^{-1} k(X, x)$

where $K = k(X, X) + \sigma_n^2 I$ is the kernel matrix with observation noise.

Key properties:

The posterior mean interpolates the training data (zero uncertainty at observed points)
Uncertainty increases as we move away from observations
Uncertainty decreases as we add more data
The posterior is exact—no approximations needed for regression

Computational Complexity

GP inference requires O(n³) time and O(n²) memory due to matrix inversion. For n > 1000 observations, this becomes prohibitive. Fortunately, hyperparameter optimization rarely exceeds a few hundred evaluations, making GPs practical. For larger scales, sparse GPs or other surrogates are needed.

Learning GP Hyperparameters

The kernel has hyperparameters (length scales, variance) that must be set. The standard approach is maximum marginal likelihood:

$$\log p(y | X, \theta) = -\frac{1}{2}y^T K^{-1} y - \frac{1}{2}\log|K| - \frac{n}{2}\log 2\pi$$

This balances model fit (first term) against model complexity (second term). We optimize this using gradient-based methods:

Practical considerations:

Use multiple random restarts to avoid local optima
Bound hyperparameters to reasonable ranges
Re-optimize after each new observation (or every few observations)
Consider priors on hyperparameters for regularization

gp_hyperparameter_learning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import Matern, ConstantKernel
 
def create_optimized_gp():
    """
    Create a GP with automatic hyperparameter optimization.
    
    sklearn handles marginal likelihood optimization internally.
    """
    kernel = ConstantKernel(1.0, (1e-3, 1e3)) * Matern(
        length_scale=1.0,
        length_scale_bounds=(1e-2, 1e2),
        nu=2.5  # Matérn 5/2
    )
    
    return GaussianProcessRegressor(
        kernel=kernel,
        alpha=1e-6,          # Observation noise
        normalize_y=True,     # Recommended for stability
        n_restarts_optimizer=10,  # Multiple restarts
        random_state=42
    )
 
# Usage in Bayesian Optimization
gp = create_optimized_gp()
gp.fit(X_observed, y_observed)  # Learns hyperparameters
mean, std = gp.predict(X_candidates, return_std=True)

GPs for Bayesian Optimization: Practical Guidance

When using GPs as surrogates for hyperparameter optimization, several practical considerations apply:

GP Best Practices for BO

•Normalize inputs to [0, 1] — Helps with kernel length scale interpretation and numerical stability
•Standardize outputs (zero mean, unit variance) — Prevents scale issues in likelihood optimization
•Use Matérn 5/2 kernel — More robust than RBF, less restrictive smoothness assumption
•Enable ARD (separate length scales per dimension) — Automatically identifies important hyperparameters
•Add small noise term (1e-6) — Numerical stability for matrix inversion
•Log-transform learning rates and similar — Makes the space more uniform for the GP

GP Limitations

GPs struggle with: (1) Categorical/discrete hyperparameters without special handling, (2) High-dimensional spaces (d > 20), (3) Very large datasets (n > 1000), (4) Highly discontinuous objectives. For mixed spaces with categoricals, consider Tree-structured Parzen Estimators (TPE) instead.

Summary: Gaussian Processes as Surrogates

Key Takeaways

•GPs are distributions over functions — They provide both predictions and calibrated uncertainty estimates
•Kernels encode prior beliefs — Matérn 5/2 is the recommended default for hyperparameter optimization
•Posterior updates are exact — No approximations needed; uncertainty naturally decreases with more data
•Hyperparameters are learned — Maximum marginal likelihood automatically tunes kernel parameters
•Computational cost is O(n³) — Practical for typical HPO budgets (<1000 evaluations)

What's next: With surrogate models understood, we'll explore acquisition functions—the strategies that use GP predictions and uncertainties to decide where to evaluate next.

Page Complete

You now understand Gaussian Processes as surrogate models for Bayesian Optimization. You can explain how GPs provide predictions with uncertainty, the role of kernel functions, and practical considerations for hyperparameter optimization.

2 / 5

Loading learning content...

Machine LearningHyperparameter Optimization

Bayesian Optimization

LevelAdvanced

Duration90 mins

TopicHyperparameter Optimization

2 / 5

Surrogate Models: Gaussian Processes

The Heart of Bayesian Optimization

What You Will Learn

What Is a Gaussian Process?

A Gaussian Process is a probability distribution over functions. Formally:

Definition: A Gaussian Process is a collection of random variables, any finite number of which have a joint Gaussian distribution.

A GP is completely specified by:

Mean function: $m(x) = \mathbb{E}[f(x)]$
Covariance function (kernel): $k(x, x') = \mathbb{E}[(f(x) - m(x))(f(x') - m(x'))]$

We write: $f(x) \sim \mathcal{GP}(m(x), k(x, x'))$

gp_fundamentals.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import numpy as np
from scipy.linalg import cholesky, solve_triangular
 
class GaussianProcess:
    """
    Gaussian Process Regressor from scratch.
    
    Demonstrates the core math of GP prediction:
    - Prior: f ~ GP(0, K)
    - Posterior: f* | X, y, X* ~ N(mu*, Sigma*)
    """
    
    def __init__(self, kernel, noise: float = 1e-6):
        self.kernel = kernel
        self.noise = noise
        self.X_train = None
        self.y_train = None
        self.L = None  # Cholesky factor
        self.alpha = None  # Precomputed weights
    
    def fit(self, X: np.ndarray, y: np.ndarray):
        """Fit GP to training data."""
        self.X_train = X
        self.y_train = y
        
        # Compute kernel matrix K(X, X)
        K = self.kernel(X, X) + self.noise * np.eye(len(X))
        
        # Cholesky decomposition: K = L @ L.T
        self.L = cholesky(K, lower=True)
        
        # Solve for alpha: K @ alpha = y
        self.alpha = solve_triangular(
            self.L.T, 
            solve_triangular(self.L, y, lower=True)
        )
    
    def predict(self, X_test: np.ndarray):
        """
        Predict mean and std at test points.
        
        Key equations (from GP posterior):
        mu* = K(X*, X) @ K(X, X)^{-1} @ y
        Sigma* = K(X*, X*) - K(X*, X) @ K(X, X)^{-1} @ K(X, X*)
        """
        # Cross-covariance K(X*, X)
        K_star = self.kernel(X_test, self.X_train)
        
        # Predictive mean: mu* = K* @ alpha
        mu = K_star @ self.alpha
        
        # Predictive variance
        v = solve_triangular(self.L, K_star.T, lower=True)
        K_ss = self.kernel(X_test, X_test)
        var = np.diag(K_ss) - np.sum(v**2, axis=0)
        std = np.sqrt(np.maximum(var, 1e-10))
        
        return mu, std

Kernel Functions: Encoding Prior Beliefs

The kernel (covariance function) is the most important design choice in a GP. It encodes assumptions about the function we're modeling:

Smoothness: How quickly can the function change?
Stationarity: Are properties the same everywhere?
Periodicity: Does the function repeat?
Length scales: Over what distance do correlations decay?

Common kernels for Bayesian Optimization:

Kernel Functions for Bayesian Optimization
Kernel	Formula	Properties	Use Case
RBF (Squared Exponential)	$k(x,x') = \sigma^2 \exp(-\frac{\|\|x-x'\|\|^2}{2l^2})$	Infinitely differentiable, very smooth	Smooth objectives
Matérn 5/2	$k(x,x') = \sigma^2(1 + \frac{\sqrt{5}r}{l} + \frac{5r^2}{3l^2})\exp(-\frac{\sqrt{5}r}{l})$	Twice differentiable	Default choice (recommended)
Matérn 3/2	$k(x,x') = \sigma^2(1 + \frac{\sqrt{3}r}{l})\exp(-\frac{\sqrt{3}r}{l})$	Once differentiable	Less smooth objectives
Rational Quadratic	$k(x,x') = \sigma^2(1 + \frac{r^2}{2\alpha l^2})^{-\alpha}$	Mixture of RBFs	Multi-scale variation

kernels.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import numpy as np
 
class Matern52Kernel:
    """
    Matérn 5/2 kernel - the recommended default for BO.
    
    Twice differentiable, providing enough smoothness for
    gradient-based acquisition optimization while being
    more realistic than the infinitely smooth RBF.
    """
    
    def __init__(self, length_scale: float = 1.0, variance: float = 1.0):
        self.length_scale = length_scale
        self.variance = variance
    
    def __call__(self, X1: np.ndarray, X2: np.ndarray) -> np.ndarray:
        # Compute pairwise distances
        dist = np.sqrt(np.sum(
            (X1[:, np.newaxis, :] - X2[np.newaxis, :, :]) ** 2,
            axis=-1
        ))
        
        # Scaled distance
        r = np.sqrt(5) * dist / self.length_scale
        
        # Matérn 5/2 formula
        return self.variance * (1 + r + r**2 / 3) * np.exp(-r)
 
 
class ARDKernel:
    """
    Automatic Relevance Determination (ARD) kernel.
    
    Uses a separate length scale per dimension, allowing the
    GP to automatically learn which hyperparameters matter most.
    """
    
    def __init__(self, length_scales: np.ndarray, variance: float = 1.0):
        self.length_scales = np.array(length_scales)
        self.variance = variance
    
    def __call__(self, X1: np.ndarray, X2: np.ndarray) -> np.ndarray:
        # Scale each dimension by its length scale
        X1_scaled = X1 / self.length_scales
        X2_scaled = X2 / self.length_scales
        
        # Compute squared distances
        sq_dist = np.sum(
            (X1_scaled[:, np.newaxis, :] - X2_scaled[np.newaxis, :, :]) ** 2,
            axis=-1
        )
        
        return self.variance * np.exp(-0.5 * sq_dist)

Length Scale Interpretation

GP Posterior: Updating Beliefs with Data

The power of GPs comes from Bayesian updating. Given observations $\mathcal{D} = {(x_i, y_i)}_{i=1}^n$, the posterior is also a GP with closed-form mean and covariance:

Posterior mean: $\mu_*(x) = k(x, X) K^{-1} y$

Posterior variance: $\sigma^2_*(x) = k(x, x) - k(x, X) K^{-1} k(X, x)$

where $K = k(X, X) + \sigma_n^2 I$ is the kernel matrix with observation noise.

Key properties:

The posterior mean interpolates the training data (zero uncertainty at observed points)
Uncertainty increases as we move away from observations
Uncertainty decreases as we add more data
The posterior is exact—no approximations needed for regression

Computational Complexity

Learning GP Hyperparameters

The kernel has hyperparameters (length scales, variance) that must be set. The standard approach is maximum marginal likelihood:

$$\log p(y | X, \theta) = -\frac{1}{2}y^T K^{-1} y - \frac{1}{2}\log|K| - \frac{n}{2}\log 2\pi$$

This balances model fit (first term) against model complexity (second term). We optimize this using gradient-based methods:

Practical considerations:

Use multiple random restarts to avoid local optima
Bound hyperparameters to reasonable ranges
Re-optimize after each new observation (or every few observations)
Consider priors on hyperparameters for regularization

gp_hyperparameter_learning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import Matern, ConstantKernel
 
def create_optimized_gp():
    """
    Create a GP with automatic hyperparameter optimization.
    
    sklearn handles marginal likelihood optimization internally.
    """
    kernel = ConstantKernel(1.0, (1e-3, 1e3)) * Matern(
        length_scale=1.0,
        length_scale_bounds=(1e-2, 1e2),
        nu=2.5  # Matérn 5/2
    )
    
    return GaussianProcessRegressor(
        kernel=kernel,
        alpha=1e-6,          # Observation noise
        normalize_y=True,     # Recommended for stability
        n_restarts_optimizer=10,  # Multiple restarts
        random_state=42
    )
 
# Usage in Bayesian Optimization
gp = create_optimized_gp()
gp.fit(X_observed, y_observed)  # Learns hyperparameters
mean, std = gp.predict(X_candidates, return_std=True)

GPs for Bayesian Optimization: Practical Guidance

When using GPs as surrogates for hyperparameter optimization, several practical considerations apply:

GP Best Practices for BO

•Normalize inputs to [0, 1] — Helps with kernel length scale interpretation and numerical stability
•Standardize outputs (zero mean, unit variance) — Prevents scale issues in likelihood optimization
•Use Matérn 5/2 kernel — More robust than RBF, less restrictive smoothness assumption
•Enable ARD (separate length scales per dimension) — Automatically identifies important hyperparameters
•Add small noise term (1e-6) — Numerical stability for matrix inversion
•Log-transform learning rates and similar — Makes the space more uniform for the GP

GP Limitations

Summary: Gaussian Processes as Surrogates

Key Takeaways

•GPs are distributions over functions — They provide both predictions and calibrated uncertainty estimates
•Kernels encode prior beliefs — Matérn 5/2 is the recommended default for hyperparameter optimization
•Posterior updates are exact — No approximations needed; uncertainty naturally decreases with more data
•Hyperparameters are learned — Maximum marginal likelihood automatically tunes kernel parameters
•Computational cost is O(n³) — Practical for typical HPO budgets (<1000 evaluations)

What's next: With surrogate models understood, we'll explore acquisition functions—the strategies that use GP predictions and uncertainties to decide where to evaluate next.

Page Complete

2 / 5