Loading learning content...
At the core of every Bayesian Optimization algorithm lies a surrogate model—a probabilistic approximation of the expensive objective function. While several surrogate models exist, Gaussian Processes (GPs) remain the gold standard for continuous hyperparameter spaces.
What makes GPs special? They provide not just predictions, but calibrated uncertainty estimates. When a GP says "the predicted loss is 0.15 ± 0.08," that uncertainty is mathematically principled, derived from the data and prior assumptions. This uncertainty is what enables intelligent exploration in Bayesian Optimization.
By the end of this page, you will understand: the mathematical definition of Gaussian Processes, how they provide predictions with uncertainty, the role of kernel functions in encoding assumptions, and practical implementation details for hyperparameter optimization.
A Gaussian Process is a probability distribution over functions. Formally:
Definition: A Gaussian Process is a collection of random variables, any finite number of which have a joint Gaussian distribution.
A GP is completely specified by:
We write: $f(x) \sim \mathcal{GP}(m(x), k(x, x'))$
Intuition: Instead of parameterizing a function with a finite set of weights (like neural networks), GPs define a distribution over the infinite-dimensional space of all possible functions. The kernel encodes our beliefs about function properties like smoothness and periodicity.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
import numpy as npfrom scipy.linalg import cholesky, solve_triangular class GaussianProcess: """ Gaussian Process Regressor from scratch. Demonstrates the core math of GP prediction: - Prior: f ~ GP(0, K) - Posterior: f* | X, y, X* ~ N(mu*, Sigma*) """ def __init__(self, kernel, noise: float = 1e-6): self.kernel = kernel self.noise = noise self.X_train = None self.y_train = None self.L = None # Cholesky factor self.alpha = None # Precomputed weights def fit(self, X: np.ndarray, y: np.ndarray): """Fit GP to training data.""" self.X_train = X self.y_train = y # Compute kernel matrix K(X, X) K = self.kernel(X, X) + self.noise * np.eye(len(X)) # Cholesky decomposition: K = L @ L.T self.L = cholesky(K, lower=True) # Solve for alpha: K @ alpha = y self.alpha = solve_triangular( self.L.T, solve_triangular(self.L, y, lower=True) ) def predict(self, X_test: np.ndarray): """ Predict mean and std at test points. Key equations (from GP posterior): mu* = K(X*, X) @ K(X, X)^{-1} @ y Sigma* = K(X*, X*) - K(X*, X) @ K(X, X)^{-1} @ K(X, X*) """ # Cross-covariance K(X*, X) K_star = self.kernel(X_test, self.X_train) # Predictive mean: mu* = K* @ alpha mu = K_star @ self.alpha # Predictive variance v = solve_triangular(self.L, K_star.T, lower=True) K_ss = self.kernel(X_test, X_test) var = np.diag(K_ss) - np.sum(v**2, axis=0) std = np.sqrt(np.maximum(var, 1e-10)) return mu, stdThe kernel (covariance function) is the most important design choice in a GP. It encodes assumptions about the function we're modeling:
Common kernels for Bayesian Optimization:
| Kernel | Formula | Properties | Use Case |
|---|---|---|---|
| RBF (Squared Exponential) | $k(x,x') = \sigma^2 \exp(-\frac{||x-x'||^2}{2l^2})$ | Infinitely differentiable, very smooth | Smooth objectives |
| Matérn 5/2 | $k(x,x') = \sigma^2(1 + \frac{\sqrt{5}r}{l} + \frac{5r^2}{3l^2})\exp(-\frac{\sqrt{5}r}{l})$ | Twice differentiable | Default choice (recommended) |
| Matérn 3/2 | $k(x,x') = \sigma^2(1 + \frac{\sqrt{3}r}{l})\exp(-\frac{\sqrt{3}r}{l})$ | Once differentiable | Less smooth objectives |
| Rational Quadratic | $k(x,x') = \sigma^2(1 + \frac{r^2}{2\alpha l^2})^{-\alpha}$ | Mixture of RBFs | Multi-scale variation |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
import numpy as np class Matern52Kernel: """ Matérn 5/2 kernel - the recommended default for BO. Twice differentiable, providing enough smoothness for gradient-based acquisition optimization while being more realistic than the infinitely smooth RBF. """ def __init__(self, length_scale: float = 1.0, variance: float = 1.0): self.length_scale = length_scale self.variance = variance def __call__(self, X1: np.ndarray, X2: np.ndarray) -> np.ndarray: # Compute pairwise distances dist = np.sqrt(np.sum( (X1[:, np.newaxis, :] - X2[np.newaxis, :, :]) ** 2, axis=-1 )) # Scaled distance r = np.sqrt(5) * dist / self.length_scale # Matérn 5/2 formula return self.variance * (1 + r + r**2 / 3) * np.exp(-r) class ARDKernel: """ Automatic Relevance Determination (ARD) kernel. Uses a separate length scale per dimension, allowing the GP to automatically learn which hyperparameters matter most. """ def __init__(self, length_scales: np.ndarray, variance: float = 1.0): self.length_scales = np.array(length_scales) self.variance = variance def __call__(self, X1: np.ndarray, X2: np.ndarray) -> np.ndarray: # Scale each dimension by its length scale X1_scaled = X1 / self.length_scales X2_scaled = X2 / self.length_scales # Compute squared distances sq_dist = np.sum( (X1_scaled[:, np.newaxis, :] - X2_scaled[np.newaxis, :, :]) ** 2, axis=-1 ) return self.variance * np.exp(-0.5 * sq_dist)The length scale l determines how far apart two points must be before their function values become uncorrelated. A small l means the function can change rapidly; a large l means the function is slowly varying. For hyperparameter optimization, length scales are typically learned from data.
The power of GPs comes from Bayesian updating. Given observations $\mathcal{D} = {(x_i, y_i)}_{i=1}^n$, the posterior is also a GP with closed-form mean and covariance:
Posterior mean: $\mu_*(x) = k(x, X) K^{-1} y$
Posterior variance: $\sigma^2_*(x) = k(x, x) - k(x, X) K^{-1} k(X, x)$
where $K = k(X, X) + \sigma_n^2 I$ is the kernel matrix with observation noise.
Key properties:
GP inference requires O(n³) time and O(n²) memory due to matrix inversion. For n > 1000 observations, this becomes prohibitive. Fortunately, hyperparameter optimization rarely exceeds a few hundred evaluations, making GPs practical. For larger scales, sparse GPs or other surrogates are needed.
The kernel has hyperparameters (length scales, variance) that must be set. The standard approach is maximum marginal likelihood:
$$\log p(y | X, \theta) = -\frac{1}{2}y^T K^{-1} y - \frac{1}{2}\log|K| - \frac{n}{2}\log 2\pi$$
This balances model fit (first term) against model complexity (second term). We optimize this using gradient-based methods:
Practical considerations:
123456789101112131415161718192021222324252627
from sklearn.gaussian_process import GaussianProcessRegressorfrom sklearn.gaussian_process.kernels import Matern, ConstantKernel def create_optimized_gp(): """ Create a GP with automatic hyperparameter optimization. sklearn handles marginal likelihood optimization internally. """ kernel = ConstantKernel(1.0, (1e-3, 1e3)) * Matern( length_scale=1.0, length_scale_bounds=(1e-2, 1e2), nu=2.5 # Matérn 5/2 ) return GaussianProcessRegressor( kernel=kernel, alpha=1e-6, # Observation noise normalize_y=True, # Recommended for stability n_restarts_optimizer=10, # Multiple restarts random_state=42 ) # Usage in Bayesian Optimizationgp = create_optimized_gp()gp.fit(X_observed, y_observed) # Learns hyperparametersmean, std = gp.predict(X_candidates, return_std=True)When using GPs as surrogates for hyperparameter optimization, several practical considerations apply:
GPs struggle with: (1) Categorical/discrete hyperparameters without special handling, (2) High-dimensional spaces (d > 20), (3) Very large datasets (n > 1000), (4) Highly discontinuous objectives. For mixed spaces with categoricals, consider Tree-structured Parzen Estimators (TPE) instead.
What's next: With surrogate models understood, we'll explore acquisition functions—the strategies that use GP predictions and uncertainties to decide where to evaluate next.
You now understand Gaussian Processes as surrogate models for Bayesian Optimization. You can explain how GPs provide predictions with uncertainty, the role of kernel functions, and practical considerations for hyperparameter optimization.