Gp Classification - Learning Module

Loading content...

0/245

Laplace Approximation

The Second-Order Solution

When faced with an intractable posterior, the most natural first approach is to find its mode and approximate the distribution around it. This is the essence of the Laplace approximation—a technique dating back to Pierre-Simon Laplace's 18th-century work on Bayesian inference, now the workhorse of practical GP classification.

The intuition is geometric: if the true posterior is unimodal and reasonably well-behaved, we can approximate it with a Gaussian centered at the mode. The Gaussian's covariance is determined by the curvature (Hessian) of the log-posterior at that point. For many distributions, this approximation is remarkably accurate—and for GP classification with logistic or probit likelihoods, it works particularly well due to the concavity of the log-posterior.

What You Will Learn

By the end of this page, you will understand: (1) The theoretical foundation of Laplace approximation, (2) How to find the posterior mode via Newton-Raphson optimization, (3) The complete algorithm for GP classification predictions, (4) How to approximate the marginal likelihood for hyperparameter learning, and (5) The strengths and limitations of the Laplace approach.

The Laplace Approximation Idea

The Laplace approximation replaces an intractable distribution with a Gaussian by performing a second-order Taylor expansion of the log-density around its mode.

The Setup:

Given a target distribution $p(\mathbf{f}) \propto \exp(\Psi(\mathbf{f}))$, where $\Psi(\mathbf{f}) = \log \tilde{p}(\mathbf{f})$ is the log of the unnormalized density, we:

Find the mode: $\hat{\mathbf{f}} = \arg\max_{\mathbf{f}} \Psi(\mathbf{f})$
Taylor expand: Around $\hat{\mathbf{f}}$, approximate: $$\Psi(\mathbf{f}) \approx \Psi(\hat{\mathbf{f}}) + \underbrace{\nabla \Psi(\hat{\mathbf{f}})^\top (\mathbf{f} - \hat{\mathbf{f}})}_{= 0 \text{ at mode}} + \frac{1}{2}(\mathbf{f} - \hat{\mathbf{f}})^\top \nabla^2 \Psi(\hat{\mathbf{f}}) (\mathbf{f} - \hat{\mathbf{f}})$$
Recognize the Gaussian: The quadratic form in the exponent defines: $$q(\mathbf{f}) = \mathcal{N}(\mathbf{f} | \hat{\mathbf{f}}, A^{-1})$$

where $A = -\nabla^2 \Psi(\hat{\mathbf{f}})$ is the negative Hessian of the log-posterior (precision matrix).

Geometric Interpretation

Visualize the log-posterior as a landscape in n-dimensional space. The Laplace approximation finds the peak of this landscape, then fits a paraboloid (quadratic surface) tangent to the peak. Exponentiating this paraboloid gives a Gaussian. The approximation is exact if the true log-posterior is exactly quadratic (i.e., if the true posterior is Gaussian).

Why It Works for GP Classification:

For GP classification, the log-posterior is:

$$\Psi(\mathbf{f}) = \log p(\mathbf{y} | \mathbf{f}) + \log p(\mathbf{f} | X) + \text{const}$$

$$= \sum_{i=1}^n \psi_i(f_i) - \frac{1}{2}(\mathbf{f} - \mathbf{m})^\top K^{-1} (\mathbf{f} - \mathbf{m}) + \text{const}$$

The GP prior contributes a quadratic term (exactly Gaussian). The likelihood contributes a sum of log-sigmoids. Since each $\psi_i$ is concave, $\Psi$ is concave—guaranteeing a unique global maximum.

Key Insight: The prior term is already quadratic, so the deviation from Gaussianity comes entirely from the likelihood. For data points where $|f_i|$ is large (confident predictions), the likelihood term is nearly linear, contributing little curvature. The non-Gaussianity is concentrated where predictions are uncertain.

Finding the Posterior Mode

The first step in Laplace approximation is finding the posterior mode $\hat{\mathbf{f}}$, the Maximum A Posteriori (MAP) estimate. We maximize:

$$\Psi(\mathbf{f}) = \sum_{i=1}^n \psi_i(f_i) - \frac{1}{2}(\mathbf{f} - \mathbf{m})^\top K^{-1} (\mathbf{f} - \mathbf{m})$$

Gradient:

$$\nabla \Psi = \nabla \boldsymbol{\psi} - K^{-1}(\mathbf{f} - \mathbf{m})$$

where $\nabla \boldsymbol{\psi} = [y_1 - \pi_1, ..., y_n - \pi_n]^\top$ with $\pi_i = \sigma(f_i)$.

Hessian:

$$\nabla^2 \Psi = -W - K^{-1}$$

where $W = \text{diag}(\pi_i(1-\pi_i))$ is the diagonal matrix of likelihood second derivatives.

Setting the gradient to zero gives the fixed-point equation:

$$\mathbf{f} = \mathbf{m} + K(\nabla \boldsymbol{\psi}) = \mathbf{m} + K(\mathbf{y} - \boldsymbol{\pi})$$

This is not closed-form because $\boldsymbol{\pi}$ depends on $\mathbf{f}$.

Newton-Raphson Iteration:

We solve for the mode using Newton's method. The update rule is:

$$\mathbf{f}^{(t+1)} = \mathbf{f}^{(t)} - (\nabla^2 \Psi)^{-1} \nabla \Psi$$

Substituting our expressions:

$$\mathbf{f}^{(t+1)} = \mathbf{f}^{(t)} + (W + K^{-1})^{-1}(\nabla \boldsymbol{\psi} - K^{-1}(\mathbf{f}^{(t)} - \mathbf{m}))$$

Using the matrix identity $(W + K^{-1})^{-1} = K - K(K + W^{-1})^{-1}K$ and simplifying:

$$\mathbf{f}^{(t+1)} = (K^{-1} + W)^{-1}(W\mathbf{f}^{(t)} + \nabla \boldsymbol{\psi})$$

Defining $\mathbf{b} = W\mathbf{f}^{(t)} + \nabla \boldsymbol{\psi}$:

$$\mathbf{f}^{(t+1)} = (K^{-1} + W)^{-1}\mathbf{b}$$

Computational Efficiency

Rather than inverting (K⁻¹ + W) directly, we use the Woodbury identity. Since W is diagonal, we can compute (K + W⁻¹)⁻¹ via Cholesky factorization. The per-iteration cost is O(n³) for the Cholesky decomposition, same as standard GP regression.

laplace_mode_finding.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
import numpy as np
from scipy.linalg import cholesky, solve_triangular
 
def sigmoid(f):
    """Numerically stable sigmoid."""
    return np.where(f >= 0, 
                    1 / (1 + np.exp(-f)),
                    np.exp(f) / (1 + np.exp(f)))
 
def find_laplace_mode(K, y, m=None, max_iter=100, tol=1e-9):
    """
    Find the MAP estimate for GP classification using Newton-Raphson.
    
    Parameters:
    -----------
    K : ndarray (n, n)
        Kernel matrix
    y : ndarray (n,)
        Binary labels {0, 1}
    m : ndarray (n,) or None
        Prior mean (defaults to zero)
    max_iter : int
        Maximum Newton iterations
    tol : float
        Convergence tolerance
        
    Returns:
    --------
    f_hat : ndarray (n,)
        Mode of the posterior
    W : ndarray (n,)
        Diagonal of the Hessian (at mode)
    L : ndarray (n, n)
        Cholesky factor of (I + W^{1/2} K W^{1/2})
    """
    n = len(y)
    if m is None:
        m = np.zeros(n)
    
    f = m.copy()  # Initialize at prior mean
    
    for iteration in range(max_iter):
        # Compute probabilities and derivatives
        pi = sigmoid(f)
        grad_psi = y - pi                     # Gradient of log-likelihood
        W = pi * (1 - pi)                     # Diagonal of -Hessian of log-likelihood
        
        # Numerically stable computation using Cholesky
        # B = I + W^{1/2} K W^{1/2}
        sqrt_W = np.sqrt(W)
        B = np.eye(n) + np.outer(sqrt_W, sqrt_W) * K
        L = cholesky(B, lower=True)
        
        # Solve for new f
        b = W * f + grad_psi
        # f_new = K @ b - K @ W^{1/2} @ L^{-T} @ L^{-1} @ W^{1/2} @ K @ b
        Kb = K @ b
        sqrt_W_Kb = sqrt_W * Kb
        v = solve_triangular(L, sqrt_W_Kb, lower=True)
        f_new = Kb - sqrt_W * solve_triangular(L.T, v, lower=False)
        
        # Check convergence
        if np.max(np.abs(f_new - f)) < tol:
            print(f"Converged after {iteration + 1} iterations")
            f = f_new
            break
        f = f_new
    else:
        print(f"Warning: Did not converge after {max_iter} iterations")
    
    # Recompute final quantities at mode
    pi = sigmoid(f)
    W = pi * (1 - pi)
    sqrt_W = np.sqrt(W)
    B = np.eye(n) + np.outer(sqrt_W, sqrt_W) * K
    L = cholesky(B, lower=True)
    
    return f, W, L
 
# Example usage
np.random.seed(42)
n = 50
X = np.random.randn(n, 1)
y = (X[:, 0] > 0).astype(float)  # Simple linearly separable problem
 
# RBF kernel
def rbf_kernel(X1, X2, lengthscale=1.0, variance=1.0):
    sq_dist = np.sum((X1[:, None] - X2[None, :])**2, axis=-1)
    return variance * np.exp(-0.5 * sq_dist / lengthscale**2)
 
K = rbf_kernel(X, X)
f_hat, W, L = find_laplace_mode(K, y)
print(f"Mode found: f_hat in range [{f_hat.min():.2f}, {f_hat.max():.2f}]")

The Laplace Posterior Approximation

Once we have the mode $\hat{\mathbf{f}}$, the Laplace approximation to the posterior is:

$$q(\mathbf{f} | \mathbf{y}, X) = \mathcal{N}(\mathbf{f} | \hat{\mathbf{f}}, (K^{-1} + W)^{-1})$$

where $W = \text{diag}(\hat{\pi}_i(1-\hat{\pi}_i))$ is evaluated at the mode.

Interpreting the Covariance:

The posterior covariance $(K^{-1} + W)^{-1}$ has an intuitive interpretation:

$K$ is the prior covariance (from the GP)
$W$ is the 'observation precision' from the likelihood

The posterior precision (inverse covariance) is the sum of prior precision and likelihood precision—exactly as in Bayesian linear regression. This additive structure is a hallmark of Gaussian inference.

Using the matrix inversion lemma:

$$(K^{-1} + W)^{-1} = K - K(K + W^{-1})^{-1}K = K - K W^{1/2} B^{-1} W^{1/2} K$$

where $B = I + W^{1/2} K W^{1/2}$.

This shows the posterior covariance is always smaller than the prior covariance (in the positive-definite ordering), reflecting information gain from observations.

The Role of W

The diagonal elements W_ii = π_i(1-π_i) are maximized when π_i = 0.5 (uncertain prediction) and minimized when π_i ≈ 0 or 1 (confident prediction). Points where the model is uncertain contribute more 'information' to the posterior—they constrain the latent function more strongly. This is analogous to heteroscedastic regression where some observations are more informative than others.

Posterior Marginals:

The marginal distribution for a single training point is:

$$q(f_i | \mathbf{y}, X) = \mathcal{N}(f_i | \hat{f}i, \Sigma{ii})$$

where $\Sigma = (K^{-1} + W)^{-1}$.

Important Caveat:

The Laplace approximation is a local approximation—it only captures the behavior near the mode. If the true posterior is:

Skewed (asymmetric)
Heavy-tailed
Multimodal (though this doesn't happen for log-concave likelihoods)

then the Laplace approximation will be inaccurate. For GP classification with logistic/probit likelihoods, the true posterior is often slightly skewed, making Laplace less accurate than Expectation Propagation in some cases.

Making Predictions

With the Laplace posterior in hand, we can make predictions at new test points $\mathbf{x}_*$.

Step 1: Latent Predictive Distribution

The predictive distribution for the latent function value $f_*$ is found by marginalizing over the training latents:

$$p(f_* | \mathbf{y}, X, \mathbf{x}*) \approx \int p(f* | \mathbf{f}, X, \mathbf{x}_*) , q(\mathbf{f} | \mathbf{y}, X) , d\mathbf{f}$$

Both terms are Gaussian, so the integral is tractable. The result is:

$$p(f_* | \mathbf{y}, X, \mathbf{x}*) \approx \mathcal{N}(f* | \mu_, \sigma_^2)$$

with:

$$\mu_* = m(\mathbf{x}*) + \mathbf{k}^\top K^{-1}(\hat{\mathbf{f}} - \mathbf{m}) = m(\mathbf{x}_) + \mathbf{k}_*^\top(\mathbf{y} - \hat{\boldsymbol{\pi}})$$

where the second form uses $\hat{\mathbf{f}} = \mathbf{m} + K(\mathbf{y} - \hat{\boldsymbol{\pi}})$.

$$\sigma_^2 = k(\mathbf{x}_, \mathbf{x}*) - \mathbf{k}^\top (K + W^{-1})^{-1} \mathbf{k}_$$

Step 2: Class Probability Prediction

The probability of class 1 is:

$$\pi_* = p(y_* = 1 | \mathbf{y}, X, \mathbf{x}*) = \int \sigma(f) , \mathcal{N}(f_ | \mu_, \sigma_^2) , df_*$$

For the logistic likelihood, this integral has no closed form. Common approximations:

Point prediction: $\pi_* \approx \sigma(\mu_*)$ (ignores uncertainty)
Probit approximation: $\pi_* \approx \sigma(\kappa(\sigma_^2) \cdot \mu_)$ where $\kappa(v) = (1 + \pi v / 8)^{-1/2}$
Monte Carlo: Sample $f_^{(s)} \sim \mathcal{N}(\mu_, \sigma_^2)$, compute $\pi_ \approx \frac{1}{S}\sum_s \sigma(f_*^{(s)})$

For the probit likelihood, there's a closed form:

$$\pi_* = \Phi\left(\frac{\mu_}{\sqrt{1 + \sigma_^2}}\right)$$

Uncertainty Matters

Never use the point prediction σ(μ*) in safety-critical applications! It ignores predictive uncertainty and yields overconfident predictions far from training data. The full integral (approximated by probit or Monte Carlo) properly attenuates predictions toward 0.5 when uncertainty is high.

laplace_prediction.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import numpy as np
from scipy.stats import norm
from scipy.linalg import solve_triangular
 
def laplace_predict(X_train, y_train, X_test, f_hat, W, L, K, 
                    kernel_fn, m=None, method='probit_approx'):
    """
    Make predictions using the Laplace approximation.
    
    Parameters:
    -----------
    X_train, y_train : training data
    X_test : test points (n_test, d)
    f_hat : posterior mode
    W : Hessian diagonal at mode
    L : Cholesky factor of B
    K : training kernel matrix
    kernel_fn : function(X1, X2) -> kernel matrix
    m : prior mean (default zero)
    method : 'point', 'probit_approx', or 'monte_carlo'
    
    Returns:
    --------
    pi_star : predicted probabilities
    mu_star : latent mean
    sigma_star : latent std
    """
    n_train = len(y_train)
    n_test = X_test.shape[0]
    
    if m is None:
        m = np.zeros(n_train)
    
    # Compute test kernel vectors
    K_star = kernel_fn(X_test, X_train)  # (n_test, n_train)
    K_star_star = kernel_fn(X_test, X_test)  # (n_test, n_test) - only need diagonal
    
    # Predictive mean
    # μ* = m(x*) + k*ᵀ(y - π̂)
    pi_hat = sigmoid(f_hat)
    mu_star = K_star @ (y_train - pi_hat)  # Assuming m(x*) = 0
    
    # Predictive variance
    # σ*² = k(x*, x*) - k*ᵀ(K + W⁻¹)⁻¹k*
    # Using: (K + W⁻¹)⁻¹ = W^{1/2} B⁻¹ W^{1/2} where B = I + W^{1/2}KW^{1/2}
    sqrt_W = np.sqrt(W)
    
    # v = L⁻¹ W^{1/2} k* for each test point
    # v = L⁻¹ @ (sqrt_W[:, None] * K_star.T)
    v = solve_triangular(L, sqrt_W[:, None] * K_star.T, lower=True)  # (n_train, n_test)
    
    # Variance reduction: k*ᵀ W^{1/2} B⁻¹ W^{1/2} k* = ||v||²
    var_reduction = np.sum(v**2, axis=0)
    
    # Prior variance (diagonal only for efficiency)
    prior_var = np.diag(K_star_star) if K_star_star.ndim > 1 else K_star_star
    sigma_star_sq = prior_var - var_reduction
    sigma_star = np.sqrt(np.maximum(sigma_star_sq, 1e-10))
    
    # Class probability
    if method == 'point':
        # Simple but overconfident
        pi_star = sigmoid(mu_star)
    elif method == 'probit_approx':
        # Approximate ∫σ(f)N(f|μ,σ²)df ≈ σ(κμ) where κ = 1/√(1 + πσ²/8)
        kappa = 1.0 / np.sqrt(1 + np.pi * sigma_star_sq / 8)
        pi_star = sigmoid(kappa * mu_star)
    elif method == 'monte_carlo':
        # Monte Carlo integration
        n_samples = 1000
        samples = np.random.randn(n_samples, n_test) * sigma_star + mu_star
        pi_star = np.mean(sigmoid(samples), axis=0)
    else:
        raise ValueError(f"Unknown method: {method}")
    
    return pi_star, mu_star, sigma_star
 
def sigmoid(f):
    return np.where(f >= 0, 
                    1 / (1 + np.exp(-f)),
                    np.exp(f) / (1 + np.exp(f)))

Marginal Likelihood Approximation

Hyperparameter optimization in GP classification requires the marginal likelihood $p(\mathbf{y} | X, \boldsymbol{\theta})$, which is intractable. The Laplace approximation also provides an approximation to this quantity.

The Derivation:

The marginal likelihood is:

$$p(\mathbf{y} | X) = \int p(\mathbf{y} | \mathbf{f}) p(\mathbf{f} | X) , d\mathbf{f}$$

Using the Laplace approximation:

$$\log p(\mathbf{y} | X) \approx \Psi(\hat{\mathbf{f}}) + \frac{1}{2}\log|2\pi (K^{-1} + W)^{-1}|$$

Expanding $\Psi(\hat{\mathbf{f}})$:

$$\log p(\mathbf{y} | X) \approx \underbrace{\sum_{i=1}^n \log p(y_i | \hat{f}i)}{\text{log-likelihood}} - \underbrace{\frac{1}{2}\hat{\mathbf{f}}^\top K^{-1} \hat{\mathbf{f}} - \frac{1}{2}\log|K|}_{\text{GP prior penalty}} - \frac{1}{2}\log|B|$$

where $B = I + W^{1/2} K W^{1/2}$ and we've assumed zero prior mean.

Efficient Computation:

The log-determinant term is computed via the Cholesky factor:

$$\log|B| = 2 \sum_{i=1}^n \log L_{ii}$$

where $L$ is the lower Cholesky factor of $B$.

Final Formula:

$$\log q(\mathbf{y} | X, \boldsymbol{\theta}) = -\frac{1}{2}\hat{\mathbf{f}}^\top K^{-1}\hat{\mathbf{f}} + \sum_i \log p(y_i | \hat{f}_i) - \frac{1}{2}\log|B|$$

For logistic likelihood: $$\log p(y_i | \hat{f}_i) = -\log(1 + \exp(-y_i' \hat{f}_i))$$ where $y_i' \in {-1, +1}$ are the recoded labels.

Hyperparameter Gradients

For gradient-based optimization, we need ∂log q(y|X,θ)/∂θ. This involves the implicit dependence of f̂ on θ (through the Newton iteration). The full derivation uses implicit differentiation and is implemented in libraries like GPflow and GPyTorch. The complexity is O(n³) per gradient evaluation.

laplace_marginal_likelihood.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import numpy as np
from scipy.linalg import cholesky
 
def laplace_log_marginal_likelihood(K, y, f_hat, W, L):
    """
    Compute the Laplace approximation to the log marginal likelihood.
    
    Parameters:
    -----------
    K : kernel matrix (n, n)
    y : labels in {0, 1}
    f_hat : posterior mode
    W : Hessian diagonal at mode
    L : Cholesky factor of B = I + W^{1/2} K W^{1/2}
    
    Returns:
    --------
    log_ml : approximate log marginal likelihood
    """
    n = len(y)
    
    # Convert labels to {-1, +1} for cleaner formula
    y_pm = 2 * y - 1  # {0,1} -> {-1,+1}
    
    # Log-likelihood at mode: sum of -log(1 + exp(-y*f))
    log_lik = -np.sum(np.log1p(np.exp(-y_pm * f_hat)))
    
    # Prior penalty: -1/2 * f^T K^{-1} f
    # Compute via Cholesky of K
    L_K = cholesky(K + 1e-8 * np.eye(n), lower=True)
    alpha = np.linalg.solve(L_K.T, np.linalg.solve(L_K, f_hat))
    prior_penalty = -0.5 * np.dot(f_hat, alpha)
    
    # Log-determinant of B (from Cholesky factor)
    log_det_B = 2 * np.sum(np.log(np.diag(L)))
    
    # Approximate log marginal likelihood
    log_ml = log_lik + prior_penalty - 0.5 * log_det_B
    
    return log_ml
 
# Note: This ignores the constant -n/2 * log(2π) - 1/2 * log|K|
# which cancels when comparing models with the same kernel structure

The Complete Laplace GPC Algorithm

Let's consolidate everything into the complete algorithm for Gaussian Process Classification with Laplace approximation.

Algorithm: Laplace GPC

•Input: Training data $(X, \mathbf{y})$, kernel function $k(\cdot, \cdot)$, test point $\mathbf{x}_*$
•Compute kernel matrix: $K = k(X, X)$, $\mathbf{k}* = k(X, \mathbf{x}*)$
•Find posterior mode: Run Newton-Raphson until convergence to get $\hat{\mathbf{f}}$
•Compute posterior precision factors: $W = \text{diag}(\hat{\pi}_i(1-\hat{\pi}_i))$, $B = I + W^{1/2}KW^{1/2}$, $L = \text{chol}(B)$
•Predictive mean: $\mu_* = \mathbf{k}_*^\top(\mathbf{y} - \hat{\boldsymbol{\pi}})$
•Predictive variance: $\sigma_^2 = k(\mathbf{x}_, \mathbf{x}*) - \mathbf{k}^\top W^{1/2}B^{-1}W^{1/2}\mathbf{k}_$
•Class probability: $\pi_* = \int \sigma(f_) \mathcal{N}(f_ | \mu_, \sigma_^2) df_*$ (approximate)
•Output: $\pi_* = p(y_* = 1 | \mathbf{y}, X, \mathbf{x}_*)$

Computational Complexity
Operation	Complexity	Notes
Kernel matrix computation	O(n²d)	n points, d dimensions
Newton iteration (each)	O(n³)	Cholesky factorization
Newton convergence	~5-20 iterations	Typically fast for classification
Prediction (per test point)	O(n²)	Matrix-vector products
Marginal likelihood	O(n³)	Same as mode finding
Total training time	O(n³)	Dominated by Cholesky

Practical Tip: Initialization

Initialize f at the prior mean (typically zero). For well-specified kernels, Newton's method usually converges in 5-10 iterations. If convergence is slow, check for numerical issues in the kernel matrix (add jitter: K + 10⁻⁶I) or reconsider hyperparameters.

Strengths and Limitations

Strengths

•Simple to implement — Just Newton's method plus standard GP operations
•Fast convergence — Typically 5-20 iterations due to concavity
•Deterministic — No sampling variance, reproducible results
•Gradient-friendly — Marginal likelihood gradients available for hyperparameter optimization
•Well-established — Extensive literature and library support

Limitations

•Local approximation — Only accurate near the mode; misses posterior skewness
•Underestimates variance — Tends to be overconfident, especially with limited data
•Marginal likelihood bias — May select wrong hyperparameters compared to EP or MCMC
•Mode ≠ Mean — For skewed posteriors, mode is not the best point estimate
•O(n³) scaling — Same cubic scaling as GP regression

When to Use Laplace vs. Other Methods:

Scenario	Recommendation
Rapid prototyping	Laplace (simplest)
Best predictive accuracy	Expectation Propagation
Sparse/large-scale	Variational Inference
Gold standard comparison	MCMC sampling
Limited computational budget	Laplace
Well-separated classes	Laplace (all methods similar)
Noisy/overlapping classes	EP or VI preferred

The Skewness Problem

For training points where the latent function is uncertain (near the decision boundary), the true posterior is skewed—asymmetric around the mode. Laplace, being a symmetric Gaussian, cannot capture this. EP addresses this by matching moments rather than approximating at a single point.

Summary: Laplace Approximation

The Laplace approximation is the foundation of practical GP classification, providing a simple yet effective approach to the intractable posterior.

Key Takeaways

•Second-order Taylor expansion — The Laplace approximation fits a Gaussian at the posterior mode using the Hessian to determine the covariance.
•Newton-Raphson optimization — Finding the mode requires iterative optimization, but converges quickly due to log-posterior concavity.
•Gaussian posterior approximation — The result is $q(\mathbf{f}) = \mathcal{N}(\hat{\mathbf{f}}, (K^{-1} + W)^{-1})$.
•Two-step prediction — First compute latent predictive distribution (Gaussian), then integrate against the likelihood (requires approximation for logistic).
•Marginal likelihood approximation — Enables hyperparameter optimization, though with known biases.
•Trade-off: simplicity vs. accuracy — Laplace is the simplest method but may underestimate uncertainty compared to EP or VI.

What's Next:

The next page introduces Expectation Propagation (EP), a more sophisticated approximation that often outperforms Laplace by matching moments rather than approximating at a single point. EP is particularly effective when the posterior is skewed or when accurate predictive probabilities are crucial.

Page Complete

You now understand the Laplace approximation for GP classification—from finding the posterior mode via Newton's method, to making predictions with uncertainty, to approximating the marginal likelihood for hyperparameter learning. This is the workhorse of practical GP classification.