Loading learning content...
Every kernel we've discussed has hyperparameters: length scales, signal variances, periods, smoothness parameters. These hyperparameters profoundly affect the GP's predictions and uncertainty estimates.
Setting them incorrectly leads to:
This page covers how to learn optimal hyperparameters from data.
By the end of this page, you will understand the Type-II Maximum Likelihood (empirical Bayes) approach to hyperparameter optimization, implement gradient-based optimization of the log marginal likelihood, and know practical strategies for avoiding local optima.
Let $\theta$ denote the vector of all hyperparameters. For an RBF kernel with noise:
$$\theta = (\ell, \sigma_f^2, \sigma_n^2)$$
Our goal is to find:
$$\theta^* = \arg\max_\theta p(\mathbf{y} | X, \theta)$$
This is the marginal likelihood (or evidence): the probability of the observed data given the hyperparameters, with the function values $f$ integrated out.
We could maximize how well the GP fits training data, but this leads to overfitting. The marginal likelihood automatically balances fit and complexity through Bayesian Occam's razor. It penalizes overly complex models that 'waste' probability mass on functions that could explain many datasets.
The Optimization Landscape is Non-Convex
The log marginal likelihood is generally non-convex in the hyperparameters. This means:
Despite this, gradient-based optimization works well in practice when combined with multiple restarts or careful initialization.
The standard approach is gradient ascent on the log marginal likelihood. We can compute the gradient analytically:
$$\frac{\partial \log p(\mathbf{y}|X,\theta)}{\partial \theta_j} = \frac{1}{2}\text{tr}\left((\boldsymbol{\alpha}\boldsymbol{\alpha}^T - K^{-1})\frac{\partial K}{\partial \theta_j}\right)$$
where $\boldsymbol{\alpha} = K^{-1}\mathbf{y}$ and $K = K_\theta + \sigma_n^2 I$.
This gradient has two interpretable terms:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667
import numpy as npfrom scipy.optimize import minimizefrom scipy.spatial.distance import cdist class GPWithOptimization: """GP with gradient-based hyperparameter optimization.""" def __init__(self, noise_variance=1e-2): self.noise_variance = noise_variance self.length_scale = 1.0 self.signal_variance = 1.0 def _kernel(self, X1, X2, ls, sv): """RBF kernel with given hyperparameters.""" sq_dist = cdist(X1, X2, 'sqeuclidean') return sv * np.exp(-sq_dist / (2 * ls**2)) def _neg_log_marginal_likelihood(self, theta, X, y): """Negative log marginal likelihood (for minimization).""" ls, sv, nv = np.exp(theta) # Log-transform for unconstrained opt K = self._kernel(X, X, ls, sv) + nv * np.eye(len(X)) try: L = np.linalg.cholesky(K) except np.linalg.LinAlgError: return 1e10 # Return large value if not PSD alpha = np.linalg.solve(L.T, np.linalg.solve(L, y)) # Log marginal likelihood log_ml = -0.5 * y.T @ alpha log_ml -= np.sum(np.log(np.diag(L))) log_ml -= 0.5 * len(y) * np.log(2 * np.pi) return -log_ml # Negative for minimization def optimize(self, X, y, n_restarts=5): """Optimize hyperparameters with multiple restarts.""" best_theta = None best_nll = np.inf for _ in range(n_restarts): # Random initialization in log space theta0 = np.random.randn(3) result = minimize( self._neg_log_marginal_likelihood, theta0, args=(X, y), method='L-BFGS-B', ) if result.fun < best_nll: best_nll = result.fun best_theta = result.x # Set optimized hyperparameters self.length_scale, self.signal_variance, self.noise_variance = \ np.exp(best_theta) return { 'length_scale': self.length_scale, 'signal_variance': self.signal_variance, 'noise_variance': self.noise_variance, 'log_marginal_likelihood': -best_nll }Several strategies improve optimization reliability:
Very short length scales with low noise variance often indicate overfitting. Very long length scales may indicate the model is reducing to a constant. Always visualize predictions and check that hyperparameters are physically reasonable.
While gradient-based MLE is standard, alternatives exist:
| Method | Approach | Pros/Cons |
|---|---|---|
| MLE (Type-II ML) | Maximize marginal likelihood | Standard; efficient but prone to overfitting with little data |
| Cross-Validation | Maximize held-out prediction | More robust but computationally expensive; O(n³) per fold |
| Full Bayesian | MCMC over hyperparameters | Uncertainty quantified but very slow; often impractical |
| Grid Search | Evaluate on grid of values | Simple but scales poorly with dimensionality |
| Bayesian Optimization | GP over the hyperparameter space | Efficient for expensive objectives; can use for GP tuning |
Cross-Validation Alternative
Instead of marginal likelihood, we can use leave-one-out cross-validation (LOO-CV). The LOO predictive distribution can be computed efficiently:
$$\mu_{-i} = y_i - \frac{\alpha_i}{K^{-1}{ii}}, \quad \sigma^2{-i} = \frac{1}{K^{-1}_{ii}}$$
where $\alpha = K^{-1}y$. This requires only one matrix inversion but gives predictions as if each point was left out.
Good initialization dramatically improves optimization. Data-driven heuristics:
Length Scale Initialization:
Signal Variance Initialization:
Noise Variance Initialization:
Period Initialization (for periodic kernels):
12345678910111213141516171819
def initialize_hyperparameters(X, y): """Data-driven hyperparameter initialization.""" from scipy.spatial.distance import pdist # Length scale: median pairwise distance pairwise_dists = pdist(X.reshape(-1, 1), 'euclidean') length_scale = np.median(pairwise_dists) # Signal variance: variance of outputs signal_variance = np.var(y) # Noise variance: 10% of signal variance noise_variance = 0.1 * signal_variance return { 'length_scale': max(length_scale, 1e-3), 'signal_variance': max(signal_variance, 1e-3), 'noise_variance': max(noise_variance, 1e-6) }You now understand how to optimize GP hyperparameters. Next, we'll dive deeper into the marginal likelihood—what it means and why it works as an objective function.