Loading learning content...
Gaussian Processes define probability distributions over functions. But a GP is not fully specified until we choose its hyperparameters—the parameters of the mean function $m(\mathbf{x})$ and covariance function $k(\mathbf{x}, \mathbf{x}')$ that determine the prior beliefs about the functions we're modeling.
Consider the ubiquitous Squared Exponential (RBF) kernel:
$$k(\mathbf{x}, \mathbf{x}') = \sigma_f^2 \exp\left(-\frac{|\mathbf{x} - \mathbf{x}'|^2}{2\ell^2}\right)$$
This kernel has two hyperparameters: the signal variance $\sigma_f^2$ (controlling the amplitude of functions) and the length-scale $\ell$ (controlling smoothness). Add the observation noise variance $\sigma_n^2$, and we have three hyperparameters demanding careful selection.
How should we choose these? Unlike model parameters which can be learned through direct optimization of a training objective, hyperparameters govern the shape of our prior—and poor choices lead to catastrophically wrong uncertainty estimates.
By the end of this page, you will understand the marginal likelihood as a principled Bayesian criterion for hyperparameter selection in GPs. You'll master its mathematical derivation, geometric interpretation, and understand why it naturally balances model fit against complexity—embodying Occam's Razor in closed form.
In a fully Bayesian treatment, hyperparameters $\boldsymbol{\theta}$ (comprising kernel parameters, noise variance, and mean function parameters) would be integrated out rather than point-estimated. Given training data $\mathcal{D} = {(\mathbf{x}i, y_i)}{i=1}^n$, the predictive distribution would be:
$$p(y_* | \mathbf{x}*, \mathcal{D}) = \int p(y* | \mathbf{x}_*, \mathcal{D}, \boldsymbol{\theta}) , p(\boldsymbol{\theta} | \mathcal{D}) , d\boldsymbol{\theta}$$
This integral marginalizes over all possible hyperparameter values, weighted by their posterior probability given the data. However, this integral is typically intractable for GP hyperparameters, as the posterior $p(\boldsymbol{\theta} | \mathcal{D})$ doesn't have a closed form.
Practical Resolution: The standard approach is to find a point estimate $\hat{\boldsymbol{\theta}}$ by maximizing the marginal likelihood (also called the evidence or type-II maximum likelihood):
$$\hat{\boldsymbol{\theta}} = \arg\max_{\boldsymbol{\theta}} , p(\mathbf{y} | \mathbf{X}, \boldsymbol{\theta})$$
This effectively selects hyperparameters that make the observed data most probable under the GP prior.
This approach is called 'Type-II Maximum Likelihood' or 'Empirical Bayes' because we maximize the likelihood with respect to hyperparameters (Type-II) while integrating out the function values (which would be Type-I parameters). It occupies a middle ground between fully Bayesian (integrating out everything) and fully frequentist (optimizing everything) approaches.
Why not just maximize fit to training data?
A naive approach would be to choose hyperparameters that minimize training error or maximize the standard likelihood. This leads to severe overfitting:
The marginal likelihood naturally penalizes such overly complex solutions, as we'll see in the derivation.
Let's derive the marginal likelihood from first principles. In GP regression with Gaussian observation noise, we model:
$$y_i = f(\mathbf{x}_i) + \epsilon_i, \quad \epsilon_i \sim \mathcal{N}(0, \sigma_n^2)$$
where $f$ is drawn from a GP prior: $f \sim \mathcal{GP}(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x}'))$.
Step 1: Joint Distribution of Observations
Given the GP prior, the function values at training points $\mathbf{f} = [f(\mathbf{x}_1), \ldots, f(\mathbf{x}_n)]^\top$ follow a multivariate Gaussian:
$$\mathbf{f} \sim \mathcal{N}(\boldsymbol{\mu}, \mathbf{K})$$
where $\boldsymbol{\mu} = [m(\mathbf{x}_1), \ldots, m(\mathbf{x}n)]^\top$ and $K{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$.
The observations $\mathbf{y}$ are obtained by adding independent Gaussian noise:
$$\mathbf{y} | \mathbf{f} \sim \mathcal{N}(\mathbf{f}, \sigma_n^2 \mathbf{I})$$
Step 2: Marginalizing Over Function Values
The marginal likelihood is obtained by integrating out the latent function values:
$$p(\mathbf{y} | \mathbf{X}, \boldsymbol{\theta}) = \int p(\mathbf{y} | \mathbf{f}) , p(\mathbf{f} | \mathbf{X}, \boldsymbol{\theta}) , d\mathbf{f}$$
Since both $p(\mathbf{y} | \mathbf{f})$ and $p(\mathbf{f} | \mathbf{X}, \boldsymbol{\theta})$ are Gaussian, this integral has a closed-form solution. The marginal distribution of $\mathbf{y}$ is:
$$\mathbf{y} \sim \mathcal{N}(\boldsymbol{\mu}, \mathbf{K} + \sigma_n^2 \mathbf{I}) = \mathcal{N}(\boldsymbol{\mu}, \mathbf{K}_y)$$
where we define $\mathbf{K}_y = \mathbf{K} + \sigma_n^2 \mathbf{I}$ as the covariance matrix of the observations.
Step 3: The Log Marginal Likelihood
The log of the multivariate Gaussian density gives us the log marginal likelihood (LML):
$$\log p(\mathbf{y} | \mathbf{X}, \boldsymbol{\theta}) = -\frac{1}{2}(\mathbf{y} - \boldsymbol{\mu})^\top \mathbf{K}_y^{-1} (\mathbf{y} - \boldsymbol{\mu}) - \frac{1}{2}\log |\mathbf{K}_y| - \frac{n}{2}\log(2\pi)$$
For convenience (and without loss of generality), assume a zero mean function $\boldsymbol{\mu} = \mathbf{0}$:
$$\boxed{\mathcal{L}(\boldsymbol{\theta}) = \log p(\mathbf{y} | \mathbf{X}, \boldsymbol{\theta}) = -\frac{1}{2}\mathbf{y}^\top \mathbf{K}_y^{-1} \mathbf{y} - \frac{1}{2}\log |\mathbf{K}_y| - \frac{n}{2}\log(2\pi)}$$
This is the central equation for GP hyperparameter learning. Let's examine each term:
The marginal likelihood automatically implements Occam's Razor—the principle that simpler explanations should be preferred over complex ones, all else being equal. This emerges naturally from the mathematical structure, not as an imposed regularization.
Geometric Intuition:
Consider the space of all possible datasets. A simple model (e.g., a GP with large length-scale, modeling only smooth functions) concentrates its probability mass on a restricted set of datasets—those exhibiting smooth patterns. A complex model (small length-scale, capable of fitting wiggly functions) spreads its probability mass across a much larger space of possible datasets.
For any particular dataset:
Conversely:
Since probabilities must sum to one, a model that can explain everything explains nothing particularly well. The marginal likelihood captures this: models are automatically penalized for wasted explanatory capacity. More flexible models must 'pay' for flexibility by lower probability on simpler data.
Mathematical Manifestation:
Consider how the two non-constant terms trade off:
Overly simple models (large $\ell$, small $\sigma_f^2$):
Overly complex models (small $\ell$, large $\sigma_f^2$):
Appropriately complex models:
The marginal likelihood is maximized by models whose complexity matches the true complexity of the data-generating process.
| Model Complexity | Data Fit ($-\frac{1}{2}\mathbf{y}^\top \mathbf{K}_y^{-1} \mathbf{y}$) | Complexity Penalty ($-\frac{1}{2}\log|\mathbf{K}_y|$) | Net LML |
|---|---|---|---|
| Too Simple | Very negative (poor) | Mildly negative (good) | Low |
| Appropriate | Moderately negative | Moderately negative | Optimal |
| Too Complex | Near zero (excellent) | Very negative (severe) | Low |
Let's develop deeper intuition for each term in the log marginal likelihood.
The Data Fit Term: $-\frac{1}{2}\mathbf{y}^\top \mathbf{K}_y^{-1} \mathbf{y}$
This term is the negative of a Mahalanobis distance—it measures how 'typical' the observations $\mathbf{y}$ are under the GP prior with covariance $\mathbf{K}_y$.
The inverse covariance $\mathbf{K}_y^{-1}$ (precision matrix) acts as a weighting that accounts for correlations between observations. Points that are modeled as highly correlated should have similar values; deviations are more severely penalized than for weakly correlated points.
Key insight: This is not simply a sum of squared residuals. The matrix structure means:
The Complexity Penalty: $-\frac{1}{2}\log |\mathbf{K}_y|$
The determinant $|\mathbf{K}_y|$ has a beautiful geometric interpretation: it represents the volume of the probability ellipsoid defined by the Gaussian prior.
For diagonal covariance (independent observations): $$|\mathbf{K}y| = \prod{i=1}^n (K_{y,ii}) = \prod_{i=1}^n (k(\mathbf{x}_i, \mathbf{x}_i) + \sigma_n^2)$$
Each factor represents the prior variance at one observation point. The product gives the volume of the n-dimensional hyperrectangle within which functions are likely.
For full covariance (correlated observations): Correlations reduce the volume—if knowing $y_1$ tells us something about $y_2$, the effective 'space' of possibilities shrinks. The determinant captures this reduction.
Connection to eigenvalues: $$\log|\mathbf{K}y| = \sum{i=1}^n \log \lambda_i$$
where $\lambda_i$ are eigenvalues of $\mathbf{K}_y$. More complex models have larger eigenvalues (each eigenvalue represents variance along a principal direction), leading to larger determinants and more severe penalties.
The marginal likelihood implicitly seeks the 'Goldilocks' model—not too simple (poor fit), not too complex (excessive penalty), but just right. Unlike cross-validation, which requires separate validation data, or explicit regularization, which requires tuning, the marginal likelihood achieves this balance automatically from the training data alone.
Understanding how each hyperparameter influences the marginal likelihood provides crucial intuition for debugging GP models and interpreting fitted values.
Length-scale $\ell$:
The length-scale controls the 'wiggliness' of functions sampled from the GP.
| Length-scale | Effect on Prior | Effect on Data Fit | Effect on Complexity | Net Effect |
|---|---|---|---|---|
| Very small | Allows very wiggly functions | Can interpolate exactly | Very high penalty (large det) | Poor LML unless data is truly noisy |
| Moderate | Smooth with local variation | Reasonable fit possible | Moderate penalty | Good LML for typical smooth data |
| Very large | Nearly constant functions | Can only fit means | Low penalty | Poor LML unless data is constant |
Signal Variance $\sigma_f^2$:
The signal variance controls the overall amplitude of functions.
| Signal Variance | Effect on Prior | Effect on Data Fit | Effect on Complexity | Net Effect |
|---|---|---|---|---|
| Very small | Functions close to mean | Cannot explain large deviations | Low penalty | Poor LML if data has large variance |
| Moderate | Functions of reasonable amplitude | Can fit typical data | Moderate penalty | Good LML for typical data |
| Very large | Wild amplitude swings | Explains everything | High penalty | Poor LML (wasted capacity) |
Noise Variance $\sigma_n^2$:
The noise variance is particularly interesting because it affects both terms in opposite directions.
| Noise Variance | Effect on Prior | Effect on Data Fit | Effect on Complexity | Net Effect |
|---|---|---|---|---|
| Very small | Must fit data exactly | Residuals heavily penalized | High penalty (large correlations) | Poor LML unless data is noise-free |
| Moderate | Allows some slack | Tolerates reasonable noise | Moderate penalty | Good LML for noisy data |
| Very large | Data almost ignored | Any data is 'explained' by noise | Low penalty (near-diagonal) | Poor LML (not learning from data) |
The noise variance essentially controls the signal-to-noise ratio the model assumes. Too low, and the GP overfits noise as signal. Too high, and it ignores genuine patterns.
There can be interactions between $\sigma_f^2$ and $\sigma_n^2$ such that scaling both by the same factor leaves the SNR (and predictions) unchanged but affects the marginal likelihood. Always check for such degeneracies, especially when using automatic optimization. Parameterizing in terms of SNR and total variance can sometimes help.
An alternative to marginal likelihood optimization is selecting hyperparameters via cross-validation (CV). Both approaches have merit, and understanding their differences illuminates what the marginal likelihood is fundamentally doing.
Leave-One-Out Cross-Validation (LOO-CV):
For GP regression, LOO-CV has a remarkable closed-form expression:
$$\text{LOO-CV} = \sum_{i=1}^n \log p(y_i | \mathbf{X}, \mathbf{y}_{-i}, \boldsymbol{\theta})$$
where $\mathbf{y}_{-i}$ denotes all observations except the $i$-th. This measures how well the model predicts held-out points when trained on the rest.
Using the GP predictive equations and the matrix inversion lemma, this can be computed efficiently:
$$\text{LOO-CV} = -\frac{1}{2}\sum_{i=1}^n \left( \log\sigma_i^2 + \frac{(y_i - \mu_i)^2}{\sigma_i^2} \right)$$
where $\mu_i$ and $\sigma_i^2$ are the LOO predictive mean and variance for point $i$.
When do they differ?
In practice, marginal likelihood and LOO-CV often select similar hyperparameters. They diverge when:
Model misspecification is severe: If the true data-generating process is far from any GP prior, LOO-CV may find hyperparameters that minimize prediction error even if they don't correspond to a 'good' probabilistic model.
Small datasets: The marginal likelihood's complexity penalty is calibrated to the Bayesian framework; LOO-CV can underpenalize complexity when data is scarce.
Heteroscedastic or non-Gaussian noise: LOO-CV focuses on prediction accuracy regardless of how the noise model is specified; marginal likelihood requires correct noise modeling for valid results.
As a principled default, maximizing the marginal likelihood is preferred for GP hyperparameter learning. Cross-validation serves as a valuable sanity check and alternative when model assumptions are suspect.
Evaluating the log marginal likelihood requires computing:
Both operations are dominated by inverting or factorizing the $n \times n$ covariance matrix.
Cholesky Decomposition Approach:
The standard approach is to compute the Cholesky decomposition:
$$\mathbf{K}_y = \mathbf{L} \mathbf{L}^\top$$
where $\mathbf{L}$ is lower triangular. This decomposition costs $O(n^3)$ time and $O(n^2)$ memory.
Once $\mathbf{L}$ is available:
Solve $\mathbf{K}_y^{-1} \mathbf{y}$: Forward substitution for $\mathbf{z} = \mathbf{L}^{-1}\mathbf{y}$, then backward substitution for $\boldsymbol{\alpha} = \mathbf{L}^{-\top}\mathbf{z}$. Cost: $O(n^2)$.
Compute $\log|\mathbf{K}_y|$: Since $|\mathbf{K}y| = |\mathbf{L}|^2$ and $|\mathbf{L}| = \prod_i L{ii}$, we have $\log|\mathbf{K}y| = 2\sum_i \log L{ii}$. Cost: $O(n)$.
The final expression using the Cholesky factor:
$$\mathcal{L}(\boldsymbol{\theta}) = -\frac{1}{2}|\mathbf{L}^{-1}\mathbf{y}|^2 - \sum_i \log L_{ii} - \frac{n}{2}\log(2\pi)$$
12345678910111213141516171819202122232425262728293031323334353637
import numpy as npfrom scipy.linalg import cho_factor, cho_solve def log_marginal_likelihood(y: np.ndarray, K_y: np.ndarray) -> float: """ Compute the log marginal likelihood for GP regression. Parameters ---------- y : np.ndarray, shape (n,) Observed target values K_y : np.ndarray, shape (n, n) Covariance matrix K + sigma_n^2 * I Returns ------- float Log marginal likelihood value """ n = len(y) # Cholesky decomposition: K_y = L @ L.T L, lower = cho_factor(K_y, lower=True) # Solve for alpha = K_y^{-1} @ y via Cholesky alpha = cho_solve((L, lower), y) # Data fit term: -0.5 * y.T @ K_y^{-1} @ y data_fit = -0.5 * np.dot(y, alpha) # Complexity penalty: -0.5 * log|K_y| = -sum(log(diag(L))) complexity = -np.sum(np.log(np.diag(L))) # Normalization constant normalization = -0.5 * n * np.log(2 * np.pi) return data_fit + complexity + normalizationThe Cholesky decomposition can fail if $\mathbf{K}_y$ is not positive definite (which can occur due to numerical errors, especially with very small noise variance). Adding a small 'jitter' term $\epsilon \mathbf{I}$ (typically $\epsilon = 10^{-6}$ to $10^{-10}$) to the diagonal stabilizes the computation without significantly affecting results.
Understanding the shape of the marginal likelihood as a function of hyperparameters is crucial for successful optimization.
General Characteristics:
Non-convexity: The marginal likelihood is in general non-convex in hyperparameters. Multiple local maxima can exist, corresponding to qualitatively different models (e.g., 'smooth function + high noise' vs. 'wiggly function + low noise').
Log-scale parameterization: Hyperparameters like $\ell$, $\sigma_f^2$, $\sigma_n^2$ are strictly positive and often span multiple orders of magnitude. Working with $\log \ell$, $\log \sigma_f^2$, $\log \sigma_n^2$ instead is standard practice.
Flat regions: In sparsely sampled regions of input space, the marginal likelihood can be insensitive to length-scale (any scale larger than the data spacing has similar effect).
Sharp ridges: The noise-signal variance trade-off often creates sharp ridges in the marginal likelihood surface, making optimization challenging.
Common Pathologies:
| Pathology | Symptom | Cause | Resolution |
|---|---|---|---|
| Degenerate noise | $\sigma_n^2 \to 0$ | Overfitting noise | Add minimum noise bound or jitter |
| Infinite length-scale | $\ell \to \infty$ | Data looks constant | Verify data has genuine signal |
| Zero length-scale | $\ell \to 0$ | Extreme local fitting | Check for data issues; regularize |
| Multiple optima | Different runs → different $\theta$ | Non-convex landscape | Use multiple random restarts |
Local vs. Global Optima:
Different local optima can correspond to genuinely different explanations of the data:
Both might achieve reasonable marginal likelihood values. Domain knowledge or external validation can help distinguish between such alternatives.
Always run optimization from multiple random initial points and select the solution with the highest marginal likelihood. For critical applications, visualize the marginal likelihood surface (at least in 2D slices) to understand the landscape and ensure you've found the global optimum.
The marginal likelihood connects to several other model selection criteria, providing a unifying perspective.
Connection to AIC and BIC:
The Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) approximate the marginal likelihood under certain assumptions:
$$\text{AIC} = -2\log p(\mathbf{y} | \hat{\boldsymbol{\theta}}) + 2k$$ $$\text{BIC} = -2\log p(\mathbf{y} | \hat{\boldsymbol{\theta}}) + k \log n$$
where $k$ is the number of parameters. For GP kernels, $k$ typically refers to hyperparameters. The marginal likelihood can be viewed as a more principled version that doesn't require ad hoc counting of 'effective parameters.'
Connection to Minimum Description Length (MDL):
MDL selects models that maximize compression of the data. The marginal likelihood has an MDL interpretation: $-\log p(\mathbf{y} | \mathbf{X}, \boldsymbol{\theta})$ represents the number of bits needed to encode the data using the model. Better models provide shorter codes; the marginal likelihood seeks the shortest description.
Connection to Regularization:
Maximizing the log marginal likelihood can be rewritten as:
$$\max_{\boldsymbol{\theta}} \left[ \text{Data Fit}(\boldsymbol{\theta}) - \frac{1}{2}\log|\mathbf{K}_y(\boldsymbol{\theta})| \right]$$
The second term functions as a data-dependent regularizer. Unlike fixed regularizers (L1, L2), this regularizer adapts to the data: more complex datasets justify more complex models without excessive penalty.
Connection to PAC-Bayes Bounds:
In learning theory, PAC-Bayes bounds provide generalization guarantees based on the KL divergence between posterior and prior. The marginal likelihood appears naturally in these bounds:
$$\text{Generalization Error} \leq \text{Empirical Error} + \sqrt{\frac{\text{KL}(\text{posterior} | \text{prior}) + \log n}{n}}$$
Maximizing marginal likelihood implicitly minimizes this bound under certain conditions, providing theoretical justification for its use in model selection.
We've developed a comprehensive understanding of the marginal likelihood as the cornerstone of GP hyperparameter learning. Let's consolidate the key insights:
Looking Ahead:
With the marginal likelihood established as our objective, the next page addresses how to actually optimize it. We'll derive the gradients of the marginal likelihood with respect to hyperparameters—the mathematical machinery that enables efficient gradient-based optimization for GP hyperparameter learning.
You now have a thorough understanding of the marginal likelihood: its derivation, interpretation as Occam's Razor, sensitivity to hyperparameters, and computational implementation. This foundation is essential for principled GP hyperparameter learning, which we continue developing in the following pages.