Machine LearningGaussian Processes

Hyperparameter Learning in Gaussian Processes

LevelAdvanced

Duration90 mins

TopicGaussian Processes

1 / 5

Marginal Likelihood

The Hyperparameter Problem in Gaussian Processes

Gaussian Processes define probability distributions over functions. But a GP is not fully specified until we choose its hyperparameters—the parameters of the mean function $m(\mathbf{x})$ and covariance function $k(\mathbf{x}, \mathbf{x}')$ that determine the prior beliefs about the functions we're modeling.

Consider the ubiquitous Squared Exponential (RBF) kernel:

$$k(\mathbf{x}, \mathbf{x}') = \sigma_f^2 \exp\left(-\frac{|\mathbf{x} - \mathbf{x}'|^2}{2\ell^2}\right)$$

This kernel has two hyperparameters: the signal variance $\sigma_f^2$ (controlling the amplitude of functions) and the length-scale $\ell$ (controlling smoothness). Add the observation noise variance $\sigma_n^2$, and we have three hyperparameters demanding careful selection.

How should we choose these? Unlike model parameters which can be learned through direct optimization of a training objective, hyperparameters govern the shape of our prior—and poor choices lead to catastrophically wrong uncertainty estimates.

What You Will Learn

By the end of this page, you will understand the marginal likelihood as a principled Bayesian criterion for hyperparameter selection in GPs. You'll master its mathematical derivation, geometric interpretation, and understand why it naturally balances model fit against complexity—embodying Occam's Razor in closed form.

The Bayesian Approach to Hyperparameters

In a fully Bayesian treatment, hyperparameters $\boldsymbol{\theta}$ (comprising kernel parameters, noise variance, and mean function parameters) would be integrated out rather than point-estimated. Given training data $\mathcal{D} = {(\mathbf{x}i, y_i)}{i=1}^n$, the predictive distribution would be:

$$p(y_* | \mathbf{x}*, \mathcal{D}) = \int p(y* | \mathbf{x}_*, \mathcal{D}, \boldsymbol{\theta}) , p(\boldsymbol{\theta} | \mathcal{D}) , d\boldsymbol{\theta}$$

This integral marginalizes over all possible hyperparameter values, weighted by their posterior probability given the data. However, this integral is typically intractable for GP hyperparameters, as the posterior $p(\boldsymbol{\theta} | \mathcal{D})$ doesn't have a closed form.

Practical Resolution: The standard approach is to find a point estimate $\hat{\boldsymbol{\theta}}$ by maximizing the marginal likelihood (also called the evidence or type-II maximum likelihood):

$$\hat{\boldsymbol{\theta}} = \arg\max_{\boldsymbol{\theta}} , p(\mathbf{y} | \mathbf{X}, \boldsymbol{\theta})$$

This effectively selects hyperparameters that make the observed data most probable under the GP prior.

Type-II Maximum Likelihood

This approach is called 'Type-II Maximum Likelihood' or 'Empirical Bayes' because we maximize the likelihood with respect to hyperparameters (Type-II) while integrating out the function values (which would be Type-I parameters). It occupies a middle ground between fully Bayesian (integrating out everything) and fully frequentist (optimizing everything) approaches.

Why not just maximize fit to training data?

A naive approach would be to choose hyperparameters that minimize training error or maximize the standard likelihood. This leads to severe overfitting:

Setting the length-scale $\ell$ very small allows the GP to interpolate training points exactly
Setting the noise variance $\sigma_n^2$ to near-zero forces the GP through every data point
The resulting model has excellent training fit but terrible generalization

The marginal likelihood naturally penalizes such overly complex solutions, as we'll see in the derivation.

Mathematical Derivation of the Marginal Likelihood

Let's derive the marginal likelihood from first principles. In GP regression with Gaussian observation noise, we model:

$$y_i = f(\mathbf{x}_i) + \epsilon_i, \quad \epsilon_i \sim \mathcal{N}(0, \sigma_n^2)$$

where $f$ is drawn from a GP prior: $f \sim \mathcal{GP}(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x}'))$.

Step 1: Joint Distribution of Observations

Given the GP prior, the function values at training points $\mathbf{f} = [f(\mathbf{x}_1), \ldots, f(\mathbf{x}_n)]^\top$ follow a multivariate Gaussian:

$$\mathbf{f} \sim \mathcal{N}(\boldsymbol{\mu}, \mathbf{K})$$

where $\boldsymbol{\mu} = [m(\mathbf{x}_1), \ldots, m(\mathbf{x}n)]^\top$ and $K{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$.

The observations $\mathbf{y}$ are obtained by adding independent Gaussian noise:

$$\mathbf{y} | \mathbf{f} \sim \mathcal{N}(\mathbf{f}, \sigma_n^2 \mathbf{I})$$

Step 2: Marginalizing Over Function Values

The marginal likelihood is obtained by integrating out the latent function values:

$$p(\mathbf{y} | \mathbf{X}, \boldsymbol{\theta}) = \int p(\mathbf{y} | \mathbf{f}) , p(\mathbf{f} | \mathbf{X}, \boldsymbol{\theta}) , d\mathbf{f}$$

Since both $p(\mathbf{y} | \mathbf{f})$ and $p(\mathbf{f} | \mathbf{X}, \boldsymbol{\theta})$ are Gaussian, this integral has a closed-form solution. The marginal distribution of $\mathbf{y}$ is:

$$\mathbf{y} \sim \mathcal{N}(\boldsymbol{\mu}, \mathbf{K} + \sigma_n^2 \mathbf{I}) = \mathcal{N}(\boldsymbol{\mu}, \mathbf{K}_y)$$

where we define $\mathbf{K}_y = \mathbf{K} + \sigma_n^2 \mathbf{I}$ as the covariance matrix of the observations.

Step 3: The Log Marginal Likelihood

The log of the multivariate Gaussian density gives us the log marginal likelihood (LML):

$$\log p(\mathbf{y} | \mathbf{X}, \boldsymbol{\theta}) = -\frac{1}{2}(\mathbf{y} - \boldsymbol{\mu})^\top \mathbf{K}_y^{-1} (\mathbf{y} - \boldsymbol{\mu}) - \frac{1}{2}\log |\mathbf{K}_y| - \frac{n}{2}\log(2\pi)$$

For convenience (and without loss of generality), assume a zero mean function $\boldsymbol{\mu} = \mathbf{0}$:

$$\boxed{\mathcal{L}(\boldsymbol{\theta}) = \log p(\mathbf{y} | \mathbf{X}, \boldsymbol{\theta}) = -\frac{1}{2}\mathbf{y}^\top \mathbf{K}_y^{-1} \mathbf{y} - \frac{1}{2}\log |\mathbf{K}_y| - \frac{n}{2}\log(2\pi)}$$

This is the central equation for GP hyperparameter learning. Let's examine each term:

The Three Terms of the Log Marginal Likelihood

•Data Fit Term: $-\frac{1}{2}\mathbf{y}^\top \mathbf{K}_y^{-1} \mathbf{y}$ — Measures how well the model explains the training data. Higher values (less negative) indicate better fit. This term is maximized when the model fits the data closely.
•Complexity Penalty: $-\frac{1}{2}\log |\mathbf{K}_y|$ — Penalizes model complexity. The determinant $|\mathbf{K}_y|$ represents the 'volume' of functions consistent with the prior. Complex models that can fit many functions have larger determinants, resulting in more negative values for this term.
•Normalization Constant: $-\frac{n}{2}\log(2\pi)$ — Independent of hyperparameters; can be ignored during optimization.

Occam's Razor: The Automatic Balance

The marginal likelihood automatically implements Occam's Razor—the principle that simpler explanations should be preferred over complex ones, all else being equal. This emerges naturally from the mathematical structure, not as an imposed regularization.

Geometric Intuition:

Consider the space of all possible datasets. A simple model (e.g., a GP with large length-scale, modeling only smooth functions) concentrates its probability mass on a restricted set of datasets—those exhibiting smooth patterns. A complex model (small length-scale, capable of fitting wiggly functions) spreads its probability mass across a much larger space of possible datasets.

For any particular dataset:

If the data is genuinely simple (smooth), the simple model assigns it high probability because it's in the model's 'sweet spot'
The complex model assigns it lower probability (despite being able to fit it) because probability mass is diluted across all the complex patterns it could fit

Conversely:

If the data is genuinely complex, the simple model can't fit it well and assigns low probability
The complex model assigns reasonable probability because such data is within its modeling capability

The Conservation of Probability

Since probabilities must sum to one, a model that can explain everything explains nothing particularly well. The marginal likelihood captures this: models are automatically penalized for wasted explanatory capacity. More flexible models must 'pay' for flexibility by lower probability on simpler data.

Mathematical Manifestation:

Consider how the two non-constant terms trade off:

Overly simple models (large $\ell$, small $\sigma_f^2$):
- Cannot fit the data well → data fit term is very negative
- Low complexity → complexity penalty is mild
- Net result: low marginal likelihood due to poor fit
Overly complex models (small $\ell$, large $\sigma_f^2$):
- Can fit the data perfectly → data fit term is only mildly negative
- High complexity → complexity penalty is severe (large $\log|\mathbf{K}_y|$)
- Net result: low marginal likelihood due to complexity penalty
Appropriately complex models:
- Fit the data reasonably well → moderate data fit term
- Complexity matched to data → moderate complexity penalty
- Net result: optimal marginal likelihood

The marginal likelihood is maximized by models whose complexity matches the true complexity of the data-generating process.

Marginal Likelihood Behavior Across Model Complexity
Model Complexity	Data Fit ($-\frac{1}{2}\mathbf{y}^\top \mathbf{K}_y^{-1} \mathbf{y}$)	Complexity Penalty ($-\frac{1}{2}\log\|\mathbf{K}_y\|$)	Net LML
Too Simple	Very negative (poor)	Mildly negative (good)	Low
Appropriate	Moderately negative	Moderately negative	Optimal
Too Complex	Near zero (excellent)	Very negative (severe)	Low

Detailed Analysis of Each Term

Let's develop deeper intuition for each term in the log marginal likelihood.

The Data Fit Term: $-\frac{1}{2}\mathbf{y}^\top \mathbf{K}_y^{-1} \mathbf{y}$

This term is the negative of a Mahalanobis distance—it measures how 'typical' the observations $\mathbf{y}$ are under the GP prior with covariance $\mathbf{K}_y$.

The inverse covariance $\mathbf{K}_y^{-1}$ (precision matrix) acts as a weighting that accounts for correlations between observations. Points that are modeled as highly correlated should have similar values; deviations are more severely penalized than for weakly correlated points.

Key insight: This is not simply a sum of squared residuals. The matrix structure means:

If $\mathbf{K}_y$ predicts strong correlations and the data exhibits them → low Mahalanobis distance
If $\mathbf{K}_y$ predicts correlations that aren't present (or vice versa) → high Mahalanobis distance

The Complexity Penalty: $-\frac{1}{2}\log |\mathbf{K}_y|$

The determinant $|\mathbf{K}_y|$ has a beautiful geometric interpretation: it represents the volume of the probability ellipsoid defined by the Gaussian prior.

For diagonal covariance (independent observations): $$|\mathbf{K}y| = \prod{i=1}^n (K_{y,ii}) = \prod_{i=1}^n (k(\mathbf{x}_i, \mathbf{x}_i) + \sigma_n^2)$$

Each factor represents the prior variance at one observation point. The product gives the volume of the n-dimensional hyperrectangle within which functions are likely.

For full covariance (correlated observations): Correlations reduce the volume—if knowing $y_1$ tells us something about $y_2$, the effective 'space' of possibilities shrinks. The determinant captures this reduction.

Connection to eigenvalues: $$\log|\mathbf{K}y| = \sum{i=1}^n \log \lambda_i$$

where $\lambda_i$ are eigenvalues of $\mathbf{K}_y$. More complex models have larger eigenvalues (each eigenvalue represents variance along a principal direction), leading to larger determinants and more severe penalties.

The Goldilocks Principle

The marginal likelihood implicitly seeks the 'Goldilocks' model—not too simple (poor fit), not too complex (excessive penalty), but just right. Unlike cross-validation, which requires separate validation data, or explicit regularization, which requires tuning, the marginal likelihood achieves this balance automatically from the training data alone.

How Each Hyperparameter Affects the Marginal Likelihood

Understanding how each hyperparameter influences the marginal likelihood provides crucial intuition for debugging GP models and interpreting fitted values.

Length-scale $\ell$:

The length-scale controls the 'wiggliness' of functions sampled from the GP.

Length-scale	Effect on Prior	Effect on Data Fit	Effect on Complexity	Net Effect
Very small	Allows very wiggly functions	Can interpolate exactly	Very high penalty (large det)	Poor LML unless data is truly noisy
Moderate	Smooth with local variation	Reasonable fit possible	Moderate penalty	Good LML for typical smooth data
Very large	Nearly constant functions	Can only fit means	Low penalty	Poor LML unless data is constant

Signal Variance $\sigma_f^2$:

The signal variance controls the overall amplitude of functions.

Signal Variance	Effect on Prior	Effect on Data Fit	Effect on Complexity	Net Effect
Very small	Functions close to mean	Cannot explain large deviations	Low penalty	Poor LML if data has large variance
Moderate	Functions of reasonable amplitude	Can fit typical data	Moderate penalty	Good LML for typical data
Very large	Wild amplitude swings	Explains everything	High penalty	Poor LML (wasted capacity)

Noise Variance $\sigma_n^2$:

The noise variance is particularly interesting because it affects both terms in opposite directions.

Noise Variance	Effect on Prior	Effect on Data Fit	Effect on Complexity	Net Effect
Very small	Must fit data exactly	Residuals heavily penalized	High penalty (large correlations)	Poor LML unless data is noise-free
Moderate	Allows some slack	Tolerates reasonable noise	Moderate penalty	Good LML for noisy data
Very large	Data almost ignored	Any data is 'explained' by noise	Low penalty (near-diagonal)	Poor LML (not learning from data)

The noise variance essentially controls the signal-to-noise ratio the model assumes. Too low, and the GP overfits noise as signal. Too high, and it ignores genuine patterns.

Identifiability Issues

There can be interactions between $\sigma_f^2$ and $\sigma_n^2$ such that scaling both by the same factor leaves the SNR (and predictions) unchanged but affects the marginal likelihood. Always check for such degeneracies, especially when using automatic optimization. Parameterizing in terms of SNR and total variance can sometimes help.

Marginal Likelihood vs. Cross-Validation

An alternative to marginal likelihood optimization is selecting hyperparameters via cross-validation (CV). Both approaches have merit, and understanding their differences illuminates what the marginal likelihood is fundamentally doing.

Leave-One-Out Cross-Validation (LOO-CV):

For GP regression, LOO-CV has a remarkable closed-form expression:

$$\text{LOO-CV} = \sum_{i=1}^n \log p(y_i | \mathbf{X}, \mathbf{y}_{-i}, \boldsymbol{\theta})$$

where $\mathbf{y}_{-i}$ denotes all observations except the $i$-th. This measures how well the model predicts held-out points when trained on the rest.

Using the GP predictive equations and the matrix inversion lemma, this can be computed efficiently:

$$\text{LOO-CV} = -\frac{1}{2}\sum_{i=1}^n \left( \log\sigma_i^2 + \frac{(y_i - \mu_i)^2}{\sigma_i^2} \right)$$

where $\mu_i$ and $\sigma_i^2$ are the LOO predictive mean and variance for point $i$.

Marginal Likelihood Advantages

•Principled Bayesian framework — Emerges from first principles, not heuristics
•Single objective — No need to choose CV folds or handle edge cases
•Efficient gradients — Derivatives are available in closed form
•Penalizes overconfidence — Models that predict with false certainty are penalized
•All data used for training — No held-out sets required

Cross-Validation Advantages

•Directly optimizes prediction — Minimizes actual prediction error
•Robust to model misspecification — Works even if GP assumptions are violated
•Familiar interpretation — Easy to explain to non-Bayesians
•No matrix determinants — Avoids numerical issues with determinant computation
•Works for non-probabilistic losses — Can optimize MAE, accuracy, etc.

When do they differ?

In practice, marginal likelihood and LOO-CV often select similar hyperparameters. They diverge when:

Model misspecification is severe: If the true data-generating process is far from any GP prior, LOO-CV may find hyperparameters that minimize prediction error even if they don't correspond to a 'good' probabilistic model.
Small datasets: The marginal likelihood's complexity penalty is calibrated to the Bayesian framework; LOO-CV can underpenalize complexity when data is scarce.
Heteroscedastic or non-Gaussian noise: LOO-CV focuses on prediction accuracy regardless of how the noise model is specified; marginal likelihood requires correct noise modeling for valid results.

As a principled default, maximizing the marginal likelihood is preferred for GP hyperparameter learning. Cross-validation serves as a valuable sanity check and alternative when model assumptions are suspect.

Computational Considerations

Evaluating the log marginal likelihood requires computing:

$\mathbf{K}_y^{-1} \mathbf{y}$ — A linear system solve
$\log |\mathbf{K}_y|$ — The log determinant

Both operations are dominated by inverting or factorizing the $n \times n$ covariance matrix.

Cholesky Decomposition Approach:

The standard approach is to compute the Cholesky decomposition:

$$\mathbf{K}_y = \mathbf{L} \mathbf{L}^\top$$

where $\mathbf{L}$ is lower triangular. This decomposition costs $O(n^3)$ time and $O(n^2)$ memory.

Once $\mathbf{L}$ is available:

Solve $\mathbf{K}_y^{-1} \mathbf{y}$: Forward substitution for $\mathbf{z} = \mathbf{L}^{-1}\mathbf{y}$, then backward substitution for $\boldsymbol{\alpha} = \mathbf{L}^{-\top}\mathbf{z}$. Cost: $O(n^2)$.
Compute $\log|\mathbf{K}_y|$: Since $|\mathbf{K}y| = |\mathbf{L}|^2$ and $|\mathbf{L}| = \prod_i L{ii}$, we have $\log|\mathbf{K}y| = 2\sum_i \log L{ii}$. Cost: $O(n)$.

The final expression using the Cholesky factor:

$$\mathcal{L}(\boldsymbol{\theta}) = -\frac{1}{2}|\mathbf{L}^{-1}\mathbf{y}|^2 - \sum_i \log L_{ii} - \frac{n}{2}\log(2\pi)$$

log_marginal_likelihood.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import numpy as np
from scipy.linalg import cho_factor, cho_solve
 
def log_marginal_likelihood(y: np.ndarray, K_y: np.ndarray) -> float:
    """
    Compute the log marginal likelihood for GP regression.
    
    Parameters
    ----------
    y : np.ndarray, shape (n,)
        Observed target values
    K_y : np.ndarray, shape (n, n)
        Covariance matrix K + sigma_n^2 * I
    
    Returns
    -------
    float
        Log marginal likelihood value
    """
    n = len(y)
    
    # Cholesky decomposition: K_y = L @ L.T
    L, lower = cho_factor(K_y, lower=True)
    
    # Solve for alpha = K_y^{-1} @ y via Cholesky
    alpha = cho_solve((L, lower), y)
    
    # Data fit term: -0.5 * y.T @ K_y^{-1} @ y
    data_fit = -0.5 * np.dot(y, alpha)
    
    # Complexity penalty: -0.5 * log|K_y| = -sum(log(diag(L)))
    complexity = -np.sum(np.log(np.diag(L)))
    
    # Normalization constant
    normalization = -0.5 * n * np.log(2 * np.pi)
    
    return data_fit + complexity + normalization

Numerical Stability

The Cholesky decomposition can fail if $\mathbf{K}_y$ is not positive definite (which can occur due to numerical errors, especially with very small noise variance). Adding a small 'jitter' term $\epsilon \mathbf{I}$ (typically $\epsilon = 10^{-6}$ to $10^{-10}$) to the diagonal stabilizes the computation without significantly affecting results.

The Marginal Likelihood Landscape

Understanding the shape of the marginal likelihood as a function of hyperparameters is crucial for successful optimization.

General Characteristics:

Non-convexity: The marginal likelihood is in general non-convex in hyperparameters. Multiple local maxima can exist, corresponding to qualitatively different models (e.g., 'smooth function + high noise' vs. 'wiggly function + low noise').
Log-scale parameterization: Hyperparameters like $\ell$, $\sigma_f^2$, $\sigma_n^2$ are strictly positive and often span multiple orders of magnitude. Working with $\log \ell$, $\log \sigma_f^2$, $\log \sigma_n^2$ instead is standard practice.
Flat regions: In sparsely sampled regions of input space, the marginal likelihood can be insensitive to length-scale (any scale larger than the data spacing has similar effect).
Sharp ridges: The noise-signal variance trade-off often creates sharp ridges in the marginal likelihood surface, making optimization challenging.

Common Pathologies:

Pathology	Symptom	Cause	Resolution
Degenerate noise	$\sigma_n^2 \to 0$	Overfitting noise	Add minimum noise bound or jitter
Infinite length-scale	$\ell \to \infty$	Data looks constant	Verify data has genuine signal
Zero length-scale	$\ell \to 0$	Extreme local fitting	Check for data issues; regularize
Multiple optima	Different runs → different $\theta$	Non-convex landscape	Use multiple random restarts

Local vs. Global Optima:

Different local optima can correspond to genuinely different explanations of the data:

Explanation 1: The true function is smooth with little noise; observations are nearly exact.
Explanation 2: The true function is complex with much noise; observations are noisy samples.

Both might achieve reasonable marginal likelihood values. Domain knowledge or external validation can help distinguish between such alternatives.

Practical Recommendation

Always run optimization from multiple random initial points and select the solution with the highest marginal likelihood. For critical applications, visualize the marginal likelihood surface (at least in 2D slices) to understand the landscape and ensure you've found the global optimum.

Connections to Other Model Selection Frameworks

The marginal likelihood connects to several other model selection criteria, providing a unifying perspective.

Connection to AIC and BIC:

The Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) approximate the marginal likelihood under certain assumptions:

$$\text{AIC} = -2\log p(\mathbf{y} | \hat{\boldsymbol{\theta}}) + 2k$$ $$\text{BIC} = -2\log p(\mathbf{y} | \hat{\boldsymbol{\theta}}) + k \log n$$

where $k$ is the number of parameters. For GP kernels, $k$ typically refers to hyperparameters. The marginal likelihood can be viewed as a more principled version that doesn't require ad hoc counting of 'effective parameters.'

Connection to Minimum Description Length (MDL):

MDL selects models that maximize compression of the data. The marginal likelihood has an MDL interpretation: $-\log p(\mathbf{y} | \mathbf{X}, \boldsymbol{\theta})$ represents the number of bits needed to encode the data using the model. Better models provide shorter codes; the marginal likelihood seeks the shortest description.

Connection to Regularization:

Maximizing the log marginal likelihood can be rewritten as:

$$\max_{\boldsymbol{\theta}} \left[ \text{Data Fit}(\boldsymbol{\theta}) - \frac{1}{2}\log|\mathbf{K}_y(\boldsymbol{\theta})| \right]$$

The second term functions as a data-dependent regularizer. Unlike fixed regularizers (L1, L2), this regularizer adapts to the data: more complex datasets justify more complex models without excessive penalty.

Connection to PAC-Bayes Bounds:

In learning theory, PAC-Bayes bounds provide generalization guarantees based on the KL divergence between posterior and prior. The marginal likelihood appears naturally in these bounds:

$$\text{Generalization Error} \leq \text{Empirical Error} + \sqrt{\frac{\text{KL}(\text{posterior} | \text{prior}) + \log n}{n}}$$

Maximizing marginal likelihood implicitly minimizes this bound under certain conditions, providing theoretical justification for its use in model selection.

Summary: The Marginal Likelihood

We've developed a comprehensive understanding of the marginal likelihood as the cornerstone of GP hyperparameter learning. Let's consolidate the key insights:

Key Takeaways

•The marginal likelihood is the probability of the observed data given the hyperparameters, with latent function values integrated out—$p(\mathbf{y} | \mathbf{X}, \boldsymbol{\theta})$.
•The log marginal likelihood decomposes into three terms: data fit, complexity penalty, and normalization. Only the first two are relevant for optimization.
•Occam's Razor emerges automatically: complex models that waste capacity on unnecessary flexibility are penalized, while overly simple models that can't fit the data score poorly.
•Each hyperparameter affects the marginal likelihood through its influence on both the data fit and the prior 'volume.' Understanding these effects is crucial for debugging.
•Computational cost is dominated by $O(n^3)$ matrix operations, typically computed via Cholesky decomposition for numerical stability.
•The landscape is non-convex with potential multiple optima. Multiple random restarts and careful initialization are essential for robust optimization.

Looking Ahead:

With the marginal likelihood established as our objective, the next page addresses how to actually optimize it. We'll derive the gradients of the marginal likelihood with respect to hyperparameters—the mathematical machinery that enables efficient gradient-based optimization for GP hyperparameter learning.

Page Complete

You now have a thorough understanding of the marginal likelihood: its derivation, interpretation as Occam's Razor, sensitivity to hyperparameters, and computational implementation. This foundation is essential for principled GP hyperparameter learning, which we continue developing in the following pages.

1 / 5

Loading learning content...

Machine LearningGaussian Processes

Hyperparameter Learning in Gaussian Processes

LevelAdvanced

Duration90 mins

TopicGaussian Processes

1 / 5

Marginal Likelihood

The Hyperparameter Problem in Gaussian Processes

Consider the ubiquitous Squared Exponential (RBF) kernel:

$$k(\mathbf{x}, \mathbf{x}') = \sigma_f^2 \exp\left(-\frac{|\mathbf{x} - \mathbf{x}'|^2}{2\ell^2}\right)$$

What You Will Learn

The Bayesian Approach to Hyperparameters

$$p(y_* | \mathbf{x}*, \mathcal{D}) = \int p(y* | \mathbf{x}_*, \mathcal{D}, \boldsymbol{\theta}) , p(\boldsymbol{\theta} | \mathcal{D}) , d\boldsymbol{\theta}$$

$$\hat{\boldsymbol{\theta}} = \arg\max_{\boldsymbol{\theta}} , p(\mathbf{y} | \mathbf{X}, \boldsymbol{\theta})$$

This effectively selects hyperparameters that make the observed data most probable under the GP prior.

Type-II Maximum Likelihood

Why not just maximize fit to training data?

A naive approach would be to choose hyperparameters that minimize training error or maximize the standard likelihood. This leads to severe overfitting:

Setting the length-scale $\ell$ very small allows the GP to interpolate training points exactly
Setting the noise variance $\sigma_n^2$ to near-zero forces the GP through every data point
The resulting model has excellent training fit but terrible generalization

The marginal likelihood naturally penalizes such overly complex solutions, as we'll see in the derivation.

Mathematical Derivation of the Marginal Likelihood

Let's derive the marginal likelihood from first principles. In GP regression with Gaussian observation noise, we model:

$$y_i = f(\mathbf{x}_i) + \epsilon_i, \quad \epsilon_i \sim \mathcal{N}(0, \sigma_n^2)$$

where $f$ is drawn from a GP prior: $f \sim \mathcal{GP}(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x}'))$.

Step 1: Joint Distribution of Observations

Given the GP prior, the function values at training points $\mathbf{f} = [f(\mathbf{x}_1), \ldots, f(\mathbf{x}_n)]^\top$ follow a multivariate Gaussian:

$$\mathbf{f} \sim \mathcal{N}(\boldsymbol{\mu}, \mathbf{K})$$

where $\boldsymbol{\mu} = [m(\mathbf{x}_1), \ldots, m(\mathbf{x}n)]^\top$ and $K{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$.

The observations $\mathbf{y}$ are obtained by adding independent Gaussian noise:

$$\mathbf{y} | \mathbf{f} \sim \mathcal{N}(\mathbf{f}, \sigma_n^2 \mathbf{I})$$

Step 2: Marginalizing Over Function Values

The marginal likelihood is obtained by integrating out the latent function values:

$$p(\mathbf{y} | \mathbf{X}, \boldsymbol{\theta}) = \int p(\mathbf{y} | \mathbf{f}) , p(\mathbf{f} | \mathbf{X}, \boldsymbol{\theta}) , d\mathbf{f}$$

Since both $p(\mathbf{y} | \mathbf{f})$ and $p(\mathbf{f} | \mathbf{X}, \boldsymbol{\theta})$ are Gaussian, this integral has a closed-form solution. The marginal distribution of $\mathbf{y}$ is:

$$\mathbf{y} \sim \mathcal{N}(\boldsymbol{\mu}, \mathbf{K} + \sigma_n^2 \mathbf{I}) = \mathcal{N}(\boldsymbol{\mu}, \mathbf{K}_y)$$

where we define $\mathbf{K}_y = \mathbf{K} + \sigma_n^2 \mathbf{I}$ as the covariance matrix of the observations.

Step 3: The Log Marginal Likelihood

The log of the multivariate Gaussian density gives us the log marginal likelihood (LML):

For convenience (and without loss of generality), assume a zero mean function $\boldsymbol{\mu} = \mathbf{0}$:

This is the central equation for GP hyperparameter learning. Let's examine each term:

The Three Terms of the Log Marginal Likelihood

•Data Fit Term: $-\frac{1}{2}\mathbf{y}^\top \mathbf{K}_y^{-1} \mathbf{y}$ — Measures how well the model explains the training data. Higher values (less negative) indicate better fit. This term is maximized when the model fits the data closely.
•Complexity Penalty: $-\frac{1}{2}\log |\mathbf{K}_y|$ — Penalizes model complexity. The determinant $|\mathbf{K}_y|$ represents the 'volume' of functions consistent with the prior. Complex models that can fit many functions have larger determinants, resulting in more negative values for this term.
•Normalization Constant: $-\frac{n}{2}\log(2\pi)$ — Independent of hyperparameters; can be ignored during optimization.

Occam's Razor: The Automatic Balance

Geometric Intuition:

For any particular dataset:

If the data is genuinely simple (smooth), the simple model assigns it high probability because it's in the model's 'sweet spot'
The complex model assigns it lower probability (despite being able to fit it) because probability mass is diluted across all the complex patterns it could fit

Conversely:

If the data is genuinely complex, the simple model can't fit it well and assigns low probability
The complex model assigns reasonable probability because such data is within its modeling capability

The Conservation of Probability

Mathematical Manifestation:

Consider how the two non-constant terms trade off:

Overly simple models (large $\ell$, small $\sigma_f^2$):
- Cannot fit the data well → data fit term is very negative
- Low complexity → complexity penalty is mild
- Net result: low marginal likelihood due to poor fit
Overly complex models (small $\ell$, large $\sigma_f^2$):
- Can fit the data perfectly → data fit term is only mildly negative
- High complexity → complexity penalty is severe (large $\log|\mathbf{K}_y|$)
- Net result: low marginal likelihood due to complexity penalty
Appropriately complex models:
- Fit the data reasonably well → moderate data fit term
- Complexity matched to data → moderate complexity penalty
- Net result: optimal marginal likelihood

The marginal likelihood is maximized by models whose complexity matches the true complexity of the data-generating process.

Marginal Likelihood Behavior Across Model Complexity
Model Complexity	Data Fit ($-\frac{1}{2}\mathbf{y}^\top \mathbf{K}_y^{-1} \mathbf{y}$)	Complexity Penalty ($-\frac{1}{2}\log\|\mathbf{K}_y\|$)	Net LML
Too Simple	Very negative (poor)	Mildly negative (good)	Low
Appropriate	Moderately negative	Moderately negative	Optimal
Too Complex	Near zero (excellent)	Very negative (severe)	Low

Detailed Analysis of Each Term

Let's develop deeper intuition for each term in the log marginal likelihood.

The Data Fit Term: $-\frac{1}{2}\mathbf{y}^\top \mathbf{K}_y^{-1} \mathbf{y}$

This term is the negative of a Mahalanobis distance—it measures how 'typical' the observations $\mathbf{y}$ are under the GP prior with covariance $\mathbf{K}_y$.

Key insight: This is not simply a sum of squared residuals. The matrix structure means:

If $\mathbf{K}_y$ predicts strong correlations and the data exhibits them → low Mahalanobis distance
If $\mathbf{K}_y$ predicts correlations that aren't present (or vice versa) → high Mahalanobis distance

The Complexity Penalty: $-\frac{1}{2}\log |\mathbf{K}_y|$

The determinant $|\mathbf{K}_y|$ has a beautiful geometric interpretation: it represents the volume of the probability ellipsoid defined by the Gaussian prior.

For diagonal covariance (independent observations): $$|\mathbf{K}y| = \prod{i=1}^n (K_{y,ii}) = \prod_{i=1}^n (k(\mathbf{x}_i, \mathbf{x}_i) + \sigma_n^2)$$

Each factor represents the prior variance at one observation point. The product gives the volume of the n-dimensional hyperrectangle within which functions are likely.

Connection to eigenvalues: $$\log|\mathbf{K}y| = \sum{i=1}^n \log \lambda_i$$

The Goldilocks Principle

How Each Hyperparameter Affects the Marginal Likelihood

Understanding how each hyperparameter influences the marginal likelihood provides crucial intuition for debugging GP models and interpreting fitted values.

Length-scale $\ell$:

The length-scale controls the 'wiggliness' of functions sampled from the GP.

Length-scale	Effect on Prior	Effect on Data Fit	Effect on Complexity	Net Effect
Very small	Allows very wiggly functions	Can interpolate exactly	Very high penalty (large det)	Poor LML unless data is truly noisy
Moderate	Smooth with local variation	Reasonable fit possible	Moderate penalty	Good LML for typical smooth data
Very large	Nearly constant functions	Can only fit means	Low penalty	Poor LML unless data is constant

Signal Variance $\sigma_f^2$:

The signal variance controls the overall amplitude of functions.

Signal Variance	Effect on Prior	Effect on Data Fit	Effect on Complexity	Net Effect
Very small	Functions close to mean	Cannot explain large deviations	Low penalty	Poor LML if data has large variance
Moderate	Functions of reasonable amplitude	Can fit typical data	Moderate penalty	Good LML for typical data
Very large	Wild amplitude swings	Explains everything	High penalty	Poor LML (wasted capacity)

Noise Variance $\sigma_n^2$:

The noise variance is particularly interesting because it affects both terms in opposite directions.

Noise Variance	Effect on Prior	Effect on Data Fit	Effect on Complexity	Net Effect
Very small	Must fit data exactly	Residuals heavily penalized	High penalty (large correlations)	Poor LML unless data is noise-free
Moderate	Allows some slack	Tolerates reasonable noise	Moderate penalty	Good LML for noisy data
Very large	Data almost ignored	Any data is 'explained' by noise	Low penalty (near-diagonal)	Poor LML (not learning from data)

The noise variance essentially controls the signal-to-noise ratio the model assumes. Too low, and the GP overfits noise as signal. Too high, and it ignores genuine patterns.

Identifiability Issues

Marginal Likelihood vs. Cross-Validation

Leave-One-Out Cross-Validation (LOO-CV):

For GP regression, LOO-CV has a remarkable closed-form expression:

$$\text{LOO-CV} = \sum_{i=1}^n \log p(y_i | \mathbf{X}, \mathbf{y}_{-i}, \boldsymbol{\theta})$$

where $\mathbf{y}_{-i}$ denotes all observations except the $i$-th. This measures how well the model predicts held-out points when trained on the rest.

Using the GP predictive equations and the matrix inversion lemma, this can be computed efficiently:

$$\text{LOO-CV} = -\frac{1}{2}\sum_{i=1}^n \left( \log\sigma_i^2 + \frac{(y_i - \mu_i)^2}{\sigma_i^2} \right)$$

where $\mu_i$ and $\sigma_i^2$ are the LOO predictive mean and variance for point $i$.

Marginal Likelihood Advantages

•Principled Bayesian framework — Emerges from first principles, not heuristics
•Single objective — No need to choose CV folds or handle edge cases
•Efficient gradients — Derivatives are available in closed form
•Penalizes overconfidence — Models that predict with false certainty are penalized
•All data used for training — No held-out sets required

Cross-Validation Advantages

•Directly optimizes prediction — Minimizes actual prediction error
•Robust to model misspecification — Works even if GP assumptions are violated
•Familiar interpretation — Easy to explain to non-Bayesians
•No matrix determinants — Avoids numerical issues with determinant computation
•Works for non-probabilistic losses — Can optimize MAE, accuracy, etc.

When do they differ?

In practice, marginal likelihood and LOO-CV often select similar hyperparameters. They diverge when:

Model misspecification is severe: If the true data-generating process is far from any GP prior, LOO-CV may find hyperparameters that minimize prediction error even if they don't correspond to a 'good' probabilistic model.
Small datasets: The marginal likelihood's complexity penalty is calibrated to the Bayesian framework; LOO-CV can underpenalize complexity when data is scarce.
Heteroscedastic or non-Gaussian noise: LOO-CV focuses on prediction accuracy regardless of how the noise model is specified; marginal likelihood requires correct noise modeling for valid results.

Computational Considerations

Evaluating the log marginal likelihood requires computing:

$\mathbf{K}_y^{-1} \mathbf{y}$ — A linear system solve
$\log |\mathbf{K}_y|$ — The log determinant

Both operations are dominated by inverting or factorizing the $n \times n$ covariance matrix.

Cholesky Decomposition Approach:

The standard approach is to compute the Cholesky decomposition:

$$\mathbf{K}_y = \mathbf{L} \mathbf{L}^\top$$

where $\mathbf{L}$ is lower triangular. This decomposition costs $O(n^3)$ time and $O(n^2)$ memory.

Once $\mathbf{L}$ is available:

Solve $\mathbf{K}_y^{-1} \mathbf{y}$: Forward substitution for $\mathbf{z} = \mathbf{L}^{-1}\mathbf{y}$, then backward substitution for $\boldsymbol{\alpha} = \mathbf{L}^{-\top}\mathbf{z}$. Cost: $O(n^2)$.
Compute $\log|\mathbf{K}_y|$: Since $|\mathbf{K}y| = |\mathbf{L}|^2$ and $|\mathbf{L}| = \prod_i L{ii}$, we have $\log|\mathbf{K}y| = 2\sum_i \log L{ii}$. Cost: $O(n)$.

The final expression using the Cholesky factor:

$$\mathcal{L}(\boldsymbol{\theta}) = -\frac{1}{2}|\mathbf{L}^{-1}\mathbf{y}|^2 - \sum_i \log L_{ii} - \frac{n}{2}\log(2\pi)$$

log_marginal_likelihood.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import numpy as np
from scipy.linalg import cho_factor, cho_solve
 
def log_marginal_likelihood(y: np.ndarray, K_y: np.ndarray) -> float:
    """
    Compute the log marginal likelihood for GP regression.
    
    Parameters
    ----------
    y : np.ndarray, shape (n,)
        Observed target values
    K_y : np.ndarray, shape (n, n)
        Covariance matrix K + sigma_n^2 * I
    
    Returns
    -------
    float
        Log marginal likelihood value
    """
    n = len(y)
    
    # Cholesky decomposition: K_y = L @ L.T
    L, lower = cho_factor(K_y, lower=True)
    
    # Solve for alpha = K_y^{-1} @ y via Cholesky
    alpha = cho_solve((L, lower), y)
    
    # Data fit term: -0.5 * y.T @ K_y^{-1} @ y
    data_fit = -0.5 * np.dot(y, alpha)
    
    # Complexity penalty: -0.5 * log|K_y| = -sum(log(diag(L)))
    complexity = -np.sum(np.log(np.diag(L)))
    
    # Normalization constant
    normalization = -0.5 * n * np.log(2 * np.pi)
    
    return data_fit + complexity + normalization

Numerical Stability

The Marginal Likelihood Landscape

Understanding the shape of the marginal likelihood as a function of hyperparameters is crucial for successful optimization.

General Characteristics:

Non-convexity: The marginal likelihood is in general non-convex in hyperparameters. Multiple local maxima can exist, corresponding to qualitatively different models (e.g., 'smooth function + high noise' vs. 'wiggly function + low noise').
Log-scale parameterization: Hyperparameters like $\ell$, $\sigma_f^2$, $\sigma_n^2$ are strictly positive and often span multiple orders of magnitude. Working with $\log \ell$, $\log \sigma_f^2$, $\log \sigma_n^2$ instead is standard practice.
Flat regions: In sparsely sampled regions of input space, the marginal likelihood can be insensitive to length-scale (any scale larger than the data spacing has similar effect).
Sharp ridges: The noise-signal variance trade-off often creates sharp ridges in the marginal likelihood surface, making optimization challenging.

Common Pathologies:

Pathology	Symptom	Cause	Resolution
Degenerate noise	$\sigma_n^2 \to 0$	Overfitting noise	Add minimum noise bound or jitter
Infinite length-scale	$\ell \to \infty$	Data looks constant	Verify data has genuine signal
Zero length-scale	$\ell \to 0$	Extreme local fitting	Check for data issues; regularize
Multiple optima	Different runs → different $\theta$	Non-convex landscape	Use multiple random restarts

Local vs. Global Optima:

Different local optima can correspond to genuinely different explanations of the data:

Explanation 1: The true function is smooth with little noise; observations are nearly exact.
Explanation 2: The true function is complex with much noise; observations are noisy samples.

Both might achieve reasonable marginal likelihood values. Domain knowledge or external validation can help distinguish between such alternatives.

Practical Recommendation

Connections to Other Model Selection Frameworks

The marginal likelihood connects to several other model selection criteria, providing a unifying perspective.

Connection to AIC and BIC:

The Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) approximate the marginal likelihood under certain assumptions:

$$\text{AIC} = -2\log p(\mathbf{y} | \hat{\boldsymbol{\theta}}) + 2k$$ $$\text{BIC} = -2\log p(\mathbf{y} | \hat{\boldsymbol{\theta}}) + k \log n$$

Connection to Minimum Description Length (MDL):

Connection to Regularization:

Maximizing the log marginal likelihood can be rewritten as:

$$\max_{\boldsymbol{\theta}} \left[ \text{Data Fit}(\boldsymbol{\theta}) - \frac{1}{2}\log|\mathbf{K}_y(\boldsymbol{\theta})| \right]$$

Connection to PAC-Bayes Bounds:

In learning theory, PAC-Bayes bounds provide generalization guarantees based on the KL divergence between posterior and prior. The marginal likelihood appears naturally in these bounds:

$$\text{Generalization Error} \leq \text{Empirical Error} + \sqrt{\frac{\text{KL}(\text{posterior} | \text{prior}) + \log n}{n}}$$

Maximizing marginal likelihood implicitly minimizes this bound under certain conditions, providing theoretical justification for its use in model selection.

Summary: The Marginal Likelihood

We've developed a comprehensive understanding of the marginal likelihood as the cornerstone of GP hyperparameter learning. Let's consolidate the key insights:

Key Takeaways

•The marginal likelihood is the probability of the observed data given the hyperparameters, with latent function values integrated out—$p(\mathbf{y} | \mathbf{X}, \boldsymbol{\theta})$.
•The log marginal likelihood decomposes into three terms: data fit, complexity penalty, and normalization. Only the first two are relevant for optimization.
•Occam's Razor emerges automatically: complex models that waste capacity on unnecessary flexibility are penalized, while overly simple models that can't fit the data score poorly.
•Each hyperparameter affects the marginal likelihood through its influence on both the data fit and the prior 'volume.' Understanding these effects is crucial for debugging.
•Computational cost is dominated by $O(n^3)$ matrix operations, typically computed via Cholesky decomposition for numerical stability.
•The landscape is non-convex with potential multiple optima. Multiple random restarts and careful initialization are essential for robust optimization.

Looking Ahead:

Page Complete

1 / 5