Loading content...
The marginal likelihood (also called model evidence) is the cornerstone of Bayesian model selection. It answers a fundamental question: How probable is the observed data under this model, averaging over all possible parameter values?
For Gaussian Processes, the marginal likelihood is:
$$p(\mathbf{y}|X,\theta) = \int p(\mathbf{y}|\mathbf{f})p(\mathbf{f}|X,\theta)d\mathbf{f}$$
The function values $\mathbf{f}$ are integrated out, leaving only the hyperparameters $\theta$.
By the end of this page, you will understand thederivation and closed-form of the GP marginal likelihood, interpret its three components (data fit, complexity penalty, constant), and appreciate the automatic Occam's razor that prevents overfitting.
For a GP with Gaussian likelihood (regression with Gaussian noise), the marginal likelihood has a closed-form solution.
Given:
The marginal distribution of $\mathbf{y}$ is:
$$\mathbf{y} \sim \mathcal{N}(\mathbf{0}, K + \sigma_n^2 I)$$
Let $K_y = K + \sigma_n^2 I$. The log marginal likelihood is:
$$\log p(\mathbf{y}|X,\theta) = -\frac{1}{2}\mathbf{y}^T K_y^{-1}\mathbf{y} - \frac{1}{2}\log|K_y| - \frac{n}{2}\log(2\pi)$$
The closed form exists because Gaussians are closed under marginalization. The convolution of two Gaussians (prior and likelihood) is another Gaussian. This is one of the key computational advantages of GP regression with Gaussian noise.
The log marginal likelihood has three interpretable terms:
$$\log p(\mathbf{y}|X,\theta) = \underbrace{-\frac{1}{2}\mathbf{y}^T K_y^{-1}\mathbf{y}}{\text{Data Fit}} \underbrace{- \frac{1}{2}\log|K_y|}{\text{Complexity Penalty}} \underbrace{- \frac{n}{2}\log(2\pi)}_{\text{Constant}}$$
The Tradeoff Between Fit and Complexity
The magic of the marginal likelihood is the automatic tradeoff between fit and complexity:
The optimal hyperparameters balance these competing effects.
The marginal likelihood implements Bayesian Occam's Razor: simpler models that explain the data are preferred over complex models.
Intuition:
Imagine two models:
If both models can explain the observed data, the simple model assigns higher probability to that specific dataset (its probability mass is concentrated). The complex model spreads its probability across many possible datasets, assigning lower probability to each.
Mathematically, since $\int p(D|M) dD = 1$ for any model $M$, a model that can fit many datasets must assign lower probability to each individual dataset.
Unlike methods that require explicit regularization (L1, L2 penalties), the marginal likelihood automatically penalizes complexity. This is an emergent property of Bayesian inference, not something we add by hand.
123456789101112131415161718192021222324252627282930313233343536
import numpy as np def log_marginal_likelihood(X, y, length_scale, signal_var, noise_var): """Compute the log marginal likelihood for GP regression.""" from scipy.spatial.distance import cdist n = len(y) # Build kernel matrix sq_dist = cdist(X.reshape(-1,1), X.reshape(-1,1), 'sqeuclidean') K = signal_var * np.exp(-sq_dist / (2 * length_scale**2)) K_y = K + noise_var * np.eye(n) # Cholesky decomposition for numerical stability L = np.linalg.cholesky(K_y) alpha = np.linalg.solve(L.T, np.linalg.solve(L, y)) # Three terms data_fit = -0.5 * y.T @ alpha complexity = -np.sum(np.log(np.diag(L))) # = -0.5 * log|K_y| constant = -0.5 * n * np.log(2 * np.pi) return data_fit + complexity + constant, { 'data_fit': float(data_fit), 'complexity': float(complexity), 'constant': float(constant) } # Example: See how terms change with length scaleX = np.linspace(0, 10, 20)y = np.sin(X) + 0.1 * np.random.randn(20) for ls in [0.1, 0.5, 1.0, 2.0, 5.0]: lml, terms = log_marginal_likelihood(X, y, ls, 1.0, 0.01) print(f"l={ls}: LML={lml:.2f} | Fit={terms['data_fit']:.2f}, " + f"Complexity={terms['complexity']:.2f}")The marginal likelihood enables principled model comparison. Given two models $M_1$ and $M_2$ with different kernel structures:
$$\frac{p(M_1|D)}{p(M_2|D)} = \frac{p(D|M_1)}{p(D|M_2)} \cdot \frac{p(M_1)}{p(M_2)}$$
If we have no prior preference ($p(M_1) = p(M_2)$), the ratio of marginal likelihoods (called the Bayes factor) directly gives the posterior odds.
In practice, we compare log marginal likelihoods:
| Log Bayes Factor | Bayes Factor | Evidence Strength |
|---|---|---|
| 0 to 1 | 1 to 3 | Barely worth mentioning |
| 1 to 3 | 3 to 20 | Positive |
| 3 to 5 | 20 to 150 | Strong |
5 | 150 | Very strong |
Computing the log marginal likelihood requires care for numerical stability:
Use Cholesky Decomposition:
Why Cholesky?
If Cholesky fails (matrix not positive definite), add a small jitter to the diagonal: K_y += 1e-6 * I. This can happen with very short length scales or numerical precision issues.
You now understand the marginal likelihood deeply—its derivation, interpretation, and computation. Next, we'll cover Automatic Relevance Determination (ARD), which extends these ideas to automatically determine feature importance.