Gp Kernels And Hyperparameters - Learning Module

Loading content...

0/278

The Marginal Likelihood: Bayesian Model Selection

The Evidence for Your Model

The marginal likelihood (also called model evidence) is the cornerstone of Bayesian model selection. It answers a fundamental question: How probable is the observed data under this model, averaging over all possible parameter values?

For Gaussian Processes, the marginal likelihood is:

$$p(\mathbf{y}|X,\theta) = \int p(\mathbf{y}|\mathbf{f})p(\mathbf{f}|X,\theta)d\mathbf{f}$$

The function values $\mathbf{f}$ are integrated out, leaving only the hyperparameters $\theta$.

What You Will Learn

By the end of this page, you will understand thederivation and closed-form of the GP marginal likelihood, interpret its three components (data fit, complexity penalty, constant), and appreciate the automatic Occam's razor that prevents overfitting.

Derivation of the Marginal Likelihood

For a GP with Gaussian likelihood (regression with Gaussian noise), the marginal likelihood has a closed-form solution.

Given:

Prior: $\mathbf{f} \sim \mathcal{N}(\mathbf{0}, K)$ where $K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$
Likelihood: $\mathbf{y}|\mathbf{f} \sim \mathcal{N}(\mathbf{f}, \sigma_n^2 I)$

The marginal distribution of $\mathbf{y}$ is:

$$\mathbf{y} \sim \mathcal{N}(\mathbf{0}, K + \sigma_n^2 I)$$

Let $K_y = K + \sigma_n^2 I$. The log marginal likelihood is:

$$\log p(\mathbf{y}|X,\theta) = -\frac{1}{2}\mathbf{y}^T K_y^{-1}\mathbf{y} - \frac{1}{2}\log|K_y| - \frac{n}{2}\log(2\pi)$$

Why This Works

The closed form exists because Gaussians are closed under marginalization. The convolution of two Gaussians (prior and likelihood) is another Gaussian. This is one of the key computational advantages of GP regression with Gaussian noise.

The Three Terms: Fit, Complexity, Constant

The log marginal likelihood has three interpretable terms:

$$\log p(\mathbf{y}|X,\theta) = \underbrace{-\frac{1}{2}\mathbf{y}^T K_y^{-1}\mathbf{y}}{\text{Data Fit}} \underbrace{- \frac{1}{2}\log|K_y|}{\text{Complexity Penalty}} \underbrace{- \frac{n}{2}\log(2\pi)}_{\text{Constant}}$$

Understanding Each Term

•Data Fit Term $-\frac{1}{2}\mathbf{y}^TK_y^{-1}\mathbf{y}$: Measures how well the GP explains the data. Smaller when predictions match observations. Favors complex models that fit data well.
•Complexity Penalty $-\frac{1}{2}\log|K_y|$: Penalizes model complexity. The determinant $|K_y|$ measures the 'volume' of the prior distribution. Complex models with many degrees of freedom have larger determinants.
•Normalization Constant $-\frac{n}{2}\log(2\pi)$: Depends only on the number of observations. Irrelevant for optimization but necessary for model comparison.

The Tradeoff Between Fit and Complexity

The magic of the marginal likelihood is the automatic tradeoff between fit and complexity:

Short length scale: Data fit improves (model can wiggle to hit each point), but complexity penalty increases ($|K_y|$ grows)
Long length scale: Complexity penalty decreases (simpler model), but data fit worsens

The optimal hyperparameters balance these competing effects.

Bayesian Occam's Razor

The marginal likelihood implements Bayesian Occam's Razor: simpler models that explain the data are preferred over complex models.

Intuition:

Imagine two models:

Simple model: Can generate only a narrow range of datasets
Complex model: Can generate almost any dataset

If both models can explain the observed data, the simple model assigns higher probability to that specific dataset (its probability mass is concentrated). The complex model spreads its probability across many possible datasets, assigning lower probability to each.

Mathematically, since $\int p(D|M) dD = 1$ for any model $M$, a model that can fit many datasets must assign lower probability to each individual dataset.

No Explicit Regularization Needed

Unlike methods that require explicit regularization (L1, L2 penalties), the marginal likelihood automatically penalizes complexity. This is an emergent property of Bayesian inference, not something we add by hand.

marginal_likelihood_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import numpy as np
 
def log_marginal_likelihood(X, y, length_scale, signal_var, noise_var):
    """Compute the log marginal likelihood for GP regression."""
    from scipy.spatial.distance import cdist
    
    n = len(y)
    
    # Build kernel matrix
    sq_dist = cdist(X.reshape(-1,1), X.reshape(-1,1), 'sqeuclidean')
    K = signal_var * np.exp(-sq_dist / (2 * length_scale**2))
    K_y = K + noise_var * np.eye(n)
    
    # Cholesky decomposition for numerical stability
    L = np.linalg.cholesky(K_y)
    alpha = np.linalg.solve(L.T, np.linalg.solve(L, y))
    
    # Three terms
    data_fit = -0.5 * y.T @ alpha
    complexity = -np.sum(np.log(np.diag(L)))  # = -0.5 * log|K_y|
    constant = -0.5 * n * np.log(2 * np.pi)
    
    return data_fit + complexity + constant, {
        'data_fit': float(data_fit),
        'complexity': float(complexity),
        'constant': float(constant)
    }
 
# Example: See how terms change with length scale
X = np.linspace(0, 10, 20)
y = np.sin(X) + 0.1 * np.random.randn(20)
 
for ls in [0.1, 0.5, 1.0, 2.0, 5.0]:
    lml, terms = log_marginal_likelihood(X, y, ls, 1.0, 0.01)
    print(f"l={ls}: LML={lml:.2f} | Fit={terms['data_fit']:.2f}, " +
          f"Complexity={terms['complexity']:.2f}")

Model Comparison via Marginal Likelihood

The marginal likelihood enables principled model comparison. Given two models $M_1$ and $M_2$ with different kernel structures:

$$\frac{p(M_1|D)}{p(M_2|D)} = \frac{p(D|M_1)}{p(D|M_2)} \cdot \frac{p(M_1)}{p(M_2)}$$

If we have no prior preference ($p(M_1) = p(M_2)$), the ratio of marginal likelihoods (called the Bayes factor) directly gives the posterior odds.

In practice, we compare log marginal likelihoods:

Difference > 3: Strong evidence for the higher model
Difference > 5: Very strong evidence
Difference < 1: Models are comparable

Interpreting Bayes Factors
Log Bayes Factor	Bayes Factor	Evidence Strength
0 to 1	1 to 3	Barely worth mentioning
1 to 3	3 to 20	Positive
3 to 5	20 to 150	Strong
5	150	Very strong

Numerical Computation

Computing the log marginal likelihood requires care for numerical stability:

Use Cholesky Decomposition:

Compute $L$ such that $K_y = LL^T$
Solve $L\mathbf{z} = \mathbf{y}$, then $L^T\boldsymbol{\alpha} = \mathbf{z}$
Data fit: $-\frac{1}{2}\mathbf{y}^T\boldsymbol{\alpha}$
Complexity: $-\sum_i \log L_{ii}$ (sum of log diagonal)

Why Cholesky?

More stable than direct inversion
Log determinant from diagonal of $L$: $\log|K_y| = 2\sum_i \log L_{ii}$
Same $L$ used for both terms

Numerical Issues

If Cholesky fails (matrix not positive definite), add a small jitter to the diagonal: K_y += 1e-6 * I. This can happen with very short length scales or numerical precision issues.

Summary

Key Takeaways

•Marginal likelihood = evidence for the model — Integrates over all possible function values
•Three terms: fit + complexity + constant — Automatic balance without manual tuning
•Bayesian Occam's razor — Simpler models preferred unless complexity is needed for fit
•Model comparison via Bayes factors — Log differences > 3 indicate strong evidence
•Use Cholesky for computation — Numerically stable and efficient

Page Complete

You now understand the marginal likelihood deeply—its derivation, interpretation, and computation. Next, we'll cover Automatic Relevance Determination (ARD), which extends these ideas to automatically determine feature importance.