Loading content...
When faced with an intractable posterior, the most natural first approach is to find its mode and approximate the distribution around it. This is the essence of the Laplace approximation—a technique dating back to Pierre-Simon Laplace's 18th-century work on Bayesian inference, now the workhorse of practical GP classification.
The intuition is geometric: if the true posterior is unimodal and reasonably well-behaved, we can approximate it with a Gaussian centered at the mode. The Gaussian's covariance is determined by the curvature (Hessian) of the log-posterior at that point. For many distributions, this approximation is remarkably accurate—and for GP classification with logistic or probit likelihoods, it works particularly well due to the concavity of the log-posterior.
By the end of this page, you will understand: (1) The theoretical foundation of Laplace approximation, (2) How to find the posterior mode via Newton-Raphson optimization, (3) The complete algorithm for GP classification predictions, (4) How to approximate the marginal likelihood for hyperparameter learning, and (5) The strengths and limitations of the Laplace approach.
The Laplace approximation replaces an intractable distribution with a Gaussian by performing a second-order Taylor expansion of the log-density around its mode.
The Setup:
Given a target distribution $p(\mathbf{f}) \propto \exp(\Psi(\mathbf{f}))$, where $\Psi(\mathbf{f}) = \log \tilde{p}(\mathbf{f})$ is the log of the unnormalized density, we:
Find the mode: $\hat{\mathbf{f}} = \arg\max_{\mathbf{f}} \Psi(\mathbf{f})$
Taylor expand: Around $\hat{\mathbf{f}}$, approximate: $$\Psi(\mathbf{f}) \approx \Psi(\hat{\mathbf{f}}) + \underbrace{\nabla \Psi(\hat{\mathbf{f}})^\top (\mathbf{f} - \hat{\mathbf{f}})}_{= 0 \text{ at mode}} + \frac{1}{2}(\mathbf{f} - \hat{\mathbf{f}})^\top \nabla^2 \Psi(\hat{\mathbf{f}}) (\mathbf{f} - \hat{\mathbf{f}})$$
Recognize the Gaussian: The quadratic form in the exponent defines: $$q(\mathbf{f}) = \mathcal{N}(\mathbf{f} | \hat{\mathbf{f}}, A^{-1})$$
where $A = -\nabla^2 \Psi(\hat{\mathbf{f}})$ is the negative Hessian of the log-posterior (precision matrix).
Visualize the log-posterior as a landscape in n-dimensional space. The Laplace approximation finds the peak of this landscape, then fits a paraboloid (quadratic surface) tangent to the peak. Exponentiating this paraboloid gives a Gaussian. The approximation is exact if the true log-posterior is exactly quadratic (i.e., if the true posterior is Gaussian).
Why It Works for GP Classification:
For GP classification, the log-posterior is:
$$\Psi(\mathbf{f}) = \log p(\mathbf{y} | \mathbf{f}) + \log p(\mathbf{f} | X) + \text{const}$$
$$= \sum_{i=1}^n \psi_i(f_i) - \frac{1}{2}(\mathbf{f} - \mathbf{m})^\top K^{-1} (\mathbf{f} - \mathbf{m}) + \text{const}$$
The GP prior contributes a quadratic term (exactly Gaussian). The likelihood contributes a sum of log-sigmoids. Since each $\psi_i$ is concave, $\Psi$ is concave—guaranteeing a unique global maximum.
Key Insight: The prior term is already quadratic, so the deviation from Gaussianity comes entirely from the likelihood. For data points where $|f_i|$ is large (confident predictions), the likelihood term is nearly linear, contributing little curvature. The non-Gaussianity is concentrated where predictions are uncertain.
The first step in Laplace approximation is finding the posterior mode $\hat{\mathbf{f}}$, the Maximum A Posteriori (MAP) estimate. We maximize:
$$\Psi(\mathbf{f}) = \sum_{i=1}^n \psi_i(f_i) - \frac{1}{2}(\mathbf{f} - \mathbf{m})^\top K^{-1} (\mathbf{f} - \mathbf{m})$$
Gradient:
$$\nabla \Psi = \nabla \boldsymbol{\psi} - K^{-1}(\mathbf{f} - \mathbf{m})$$
where $\nabla \boldsymbol{\psi} = [y_1 - \pi_1, ..., y_n - \pi_n]^\top$ with $\pi_i = \sigma(f_i)$.
Hessian:
$$\nabla^2 \Psi = -W - K^{-1}$$
where $W = \text{diag}(\pi_i(1-\pi_i))$ is the diagonal matrix of likelihood second derivatives.
Setting the gradient to zero gives the fixed-point equation:
$$\mathbf{f} = \mathbf{m} + K(\nabla \boldsymbol{\psi}) = \mathbf{m} + K(\mathbf{y} - \boldsymbol{\pi})$$
This is not closed-form because $\boldsymbol{\pi}$ depends on $\mathbf{f}$.
Newton-Raphson Iteration:
We solve for the mode using Newton's method. The update rule is:
$$\mathbf{f}^{(t+1)} = \mathbf{f}^{(t)} - (\nabla^2 \Psi)^{-1} \nabla \Psi$$
Substituting our expressions:
$$\mathbf{f}^{(t+1)} = \mathbf{f}^{(t)} + (W + K^{-1})^{-1}(\nabla \boldsymbol{\psi} - K^{-1}(\mathbf{f}^{(t)} - \mathbf{m}))$$
Using the matrix identity $(W + K^{-1})^{-1} = K - K(K + W^{-1})^{-1}K$ and simplifying:
$$\mathbf{f}^{(t+1)} = (K^{-1} + W)^{-1}(W\mathbf{f}^{(t)} + \nabla \boldsymbol{\psi})$$
Defining $\mathbf{b} = W\mathbf{f}^{(t)} + \nabla \boldsymbol{\psi}$:
$$\mathbf{f}^{(t+1)} = (K^{-1} + W)^{-1}\mathbf{b}$$
Rather than inverting (K⁻¹ + W) directly, we use the Woodbury identity. Since W is diagonal, we can compute (K + W⁻¹)⁻¹ via Cholesky factorization. The per-iteration cost is O(n³) for the Cholesky decomposition, same as standard GP regression.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293
import numpy as npfrom scipy.linalg import cholesky, solve_triangular def sigmoid(f): """Numerically stable sigmoid.""" return np.where(f >= 0, 1 / (1 + np.exp(-f)), np.exp(f) / (1 + np.exp(f))) def find_laplace_mode(K, y, m=None, max_iter=100, tol=1e-9): """ Find the MAP estimate for GP classification using Newton-Raphson. Parameters: ----------- K : ndarray (n, n) Kernel matrix y : ndarray (n,) Binary labels {0, 1} m : ndarray (n,) or None Prior mean (defaults to zero) max_iter : int Maximum Newton iterations tol : float Convergence tolerance Returns: -------- f_hat : ndarray (n,) Mode of the posterior W : ndarray (n,) Diagonal of the Hessian (at mode) L : ndarray (n, n) Cholesky factor of (I + W^{1/2} K W^{1/2}) """ n = len(y) if m is None: m = np.zeros(n) f = m.copy() # Initialize at prior mean for iteration in range(max_iter): # Compute probabilities and derivatives pi = sigmoid(f) grad_psi = y - pi # Gradient of log-likelihood W = pi * (1 - pi) # Diagonal of -Hessian of log-likelihood # Numerically stable computation using Cholesky # B = I + W^{1/2} K W^{1/2} sqrt_W = np.sqrt(W) B = np.eye(n) + np.outer(sqrt_W, sqrt_W) * K L = cholesky(B, lower=True) # Solve for new f b = W * f + grad_psi # f_new = K @ b - K @ W^{1/2} @ L^{-T} @ L^{-1} @ W^{1/2} @ K @ b Kb = K @ b sqrt_W_Kb = sqrt_W * Kb v = solve_triangular(L, sqrt_W_Kb, lower=True) f_new = Kb - sqrt_W * solve_triangular(L.T, v, lower=False) # Check convergence if np.max(np.abs(f_new - f)) < tol: print(f"Converged after {iteration + 1} iterations") f = f_new break f = f_new else: print(f"Warning: Did not converge after {max_iter} iterations") # Recompute final quantities at mode pi = sigmoid(f) W = pi * (1 - pi) sqrt_W = np.sqrt(W) B = np.eye(n) + np.outer(sqrt_W, sqrt_W) * K L = cholesky(B, lower=True) return f, W, L # Example usagenp.random.seed(42)n = 50X = np.random.randn(n, 1)y = (X[:, 0] > 0).astype(float) # Simple linearly separable problem # RBF kerneldef rbf_kernel(X1, X2, lengthscale=1.0, variance=1.0): sq_dist = np.sum((X1[:, None] - X2[None, :])**2, axis=-1) return variance * np.exp(-0.5 * sq_dist / lengthscale**2) K = rbf_kernel(X, X)f_hat, W, L = find_laplace_mode(K, y)print(f"Mode found: f_hat in range [{f_hat.min():.2f}, {f_hat.max():.2f}]")Once we have the mode $\hat{\mathbf{f}}$, the Laplace approximation to the posterior is:
$$q(\mathbf{f} | \mathbf{y}, X) = \mathcal{N}(\mathbf{f} | \hat{\mathbf{f}}, (K^{-1} + W)^{-1})$$
where $W = \text{diag}(\hat{\pi}_i(1-\hat{\pi}_i))$ is evaluated at the mode.
Interpreting the Covariance:
The posterior covariance $(K^{-1} + W)^{-1}$ has an intuitive interpretation:
The posterior precision (inverse covariance) is the sum of prior precision and likelihood precision—exactly as in Bayesian linear regression. This additive structure is a hallmark of Gaussian inference.
Using the matrix inversion lemma:
$$(K^{-1} + W)^{-1} = K - K(K + W^{-1})^{-1}K = K - K W^{1/2} B^{-1} W^{1/2} K$$
where $B = I + W^{1/2} K W^{1/2}$.
This shows the posterior covariance is always smaller than the prior covariance (in the positive-definite ordering), reflecting information gain from observations.
The diagonal elements W_ii = π_i(1-π_i) are maximized when π_i = 0.5 (uncertain prediction) and minimized when π_i ≈ 0 or 1 (confident prediction). Points where the model is uncertain contribute more 'information' to the posterior—they constrain the latent function more strongly. This is analogous to heteroscedastic regression where some observations are more informative than others.
Posterior Marginals:
The marginal distribution for a single training point is:
$$q(f_i | \mathbf{y}, X) = \mathcal{N}(f_i | \hat{f}i, \Sigma{ii})$$
where $\Sigma = (K^{-1} + W)^{-1}$.
Important Caveat:
The Laplace approximation is a local approximation—it only captures the behavior near the mode. If the true posterior is:
then the Laplace approximation will be inaccurate. For GP classification with logistic/probit likelihoods, the true posterior is often slightly skewed, making Laplace less accurate than Expectation Propagation in some cases.
With the Laplace posterior in hand, we can make predictions at new test points $\mathbf{x}_*$.
Step 1: Latent Predictive Distribution
The predictive distribution for the latent function value $f_*$ is found by marginalizing over the training latents:
$$p(f_* | \mathbf{y}, X, \mathbf{x}*) \approx \int p(f* | \mathbf{f}, X, \mathbf{x}_*) , q(\mathbf{f} | \mathbf{y}, X) , d\mathbf{f}$$
Both terms are Gaussian, so the integral is tractable. The result is:
$$p(f_* | \mathbf{y}, X, \mathbf{x}*) \approx \mathcal{N}(f* | \mu_, \sigma_^2)$$
with:
$$\mu_* = m(\mathbf{x}*) + \mathbf{k}^\top K^{-1}(\hat{\mathbf{f}} - \mathbf{m}) = m(\mathbf{x}_) + \mathbf{k}_*^\top(\mathbf{y} - \hat{\boldsymbol{\pi}})$$
where the second form uses $\hat{\mathbf{f}} = \mathbf{m} + K(\mathbf{y} - \hat{\boldsymbol{\pi}})$.
$$\sigma_^2 = k(\mathbf{x}_, \mathbf{x}*) - \mathbf{k}^\top (K + W^{-1})^{-1} \mathbf{k}_$$
Step 2: Class Probability Prediction
The probability of class 1 is:
$$\pi_* = p(y_* = 1 | \mathbf{y}, X, \mathbf{x}*) = \int \sigma(f) , \mathcal{N}(f_ | \mu_, \sigma_^2) , df_*$$
For the logistic likelihood, this integral has no closed form. Common approximations:
Point prediction: $\pi_* \approx \sigma(\mu_*)$ (ignores uncertainty)
Probit approximation: $\pi_* \approx \sigma(\kappa(\sigma_^2) \cdot \mu_)$ where $\kappa(v) = (1 + \pi v / 8)^{-1/2}$
Monte Carlo: Sample $f_^{(s)} \sim \mathcal{N}(\mu_, \sigma_^2)$, compute $\pi_ \approx \frac{1}{S}\sum_s \sigma(f_*^{(s)})$
For the probit likelihood, there's a closed form:
$$\pi_* = \Phi\left(\frac{\mu_}{\sqrt{1 + \sigma_^2}}\right)$$
Never use the point prediction σ(μ*) in safety-critical applications! It ignores predictive uncertainty and yields overconfident predictions far from training data. The full integral (approximated by probit or Monte Carlo) properly attenuates predictions toward 0.5 when uncertainty is high.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081
import numpy as npfrom scipy.stats import normfrom scipy.linalg import solve_triangular def laplace_predict(X_train, y_train, X_test, f_hat, W, L, K, kernel_fn, m=None, method='probit_approx'): """ Make predictions using the Laplace approximation. Parameters: ----------- X_train, y_train : training data X_test : test points (n_test, d) f_hat : posterior mode W : Hessian diagonal at mode L : Cholesky factor of B K : training kernel matrix kernel_fn : function(X1, X2) -> kernel matrix m : prior mean (default zero) method : 'point', 'probit_approx', or 'monte_carlo' Returns: -------- pi_star : predicted probabilities mu_star : latent mean sigma_star : latent std """ n_train = len(y_train) n_test = X_test.shape[0] if m is None: m = np.zeros(n_train) # Compute test kernel vectors K_star = kernel_fn(X_test, X_train) # (n_test, n_train) K_star_star = kernel_fn(X_test, X_test) # (n_test, n_test) - only need diagonal # Predictive mean # μ* = m(x*) + k*ᵀ(y - π̂) pi_hat = sigmoid(f_hat) mu_star = K_star @ (y_train - pi_hat) # Assuming m(x*) = 0 # Predictive variance # σ*² = k(x*, x*) - k*ᵀ(K + W⁻¹)⁻¹k* # Using: (K + W⁻¹)⁻¹ = W^{1/2} B⁻¹ W^{1/2} where B = I + W^{1/2}KW^{1/2} sqrt_W = np.sqrt(W) # v = L⁻¹ W^{1/2} k* for each test point # v = L⁻¹ @ (sqrt_W[:, None] * K_star.T) v = solve_triangular(L, sqrt_W[:, None] * K_star.T, lower=True) # (n_train, n_test) # Variance reduction: k*ᵀ W^{1/2} B⁻¹ W^{1/2} k* = ||v||² var_reduction = np.sum(v**2, axis=0) # Prior variance (diagonal only for efficiency) prior_var = np.diag(K_star_star) if K_star_star.ndim > 1 else K_star_star sigma_star_sq = prior_var - var_reduction sigma_star = np.sqrt(np.maximum(sigma_star_sq, 1e-10)) # Class probability if method == 'point': # Simple but overconfident pi_star = sigmoid(mu_star) elif method == 'probit_approx': # Approximate ∫σ(f)N(f|μ,σ²)df ≈ σ(κμ) where κ = 1/√(1 + πσ²/8) kappa = 1.0 / np.sqrt(1 + np.pi * sigma_star_sq / 8) pi_star = sigmoid(kappa * mu_star) elif method == 'monte_carlo': # Monte Carlo integration n_samples = 1000 samples = np.random.randn(n_samples, n_test) * sigma_star + mu_star pi_star = np.mean(sigmoid(samples), axis=0) else: raise ValueError(f"Unknown method: {method}") return pi_star, mu_star, sigma_star def sigmoid(f): return np.where(f >= 0, 1 / (1 + np.exp(-f)), np.exp(f) / (1 + np.exp(f)))Hyperparameter optimization in GP classification requires the marginal likelihood $p(\mathbf{y} | X, \boldsymbol{\theta})$, which is intractable. The Laplace approximation also provides an approximation to this quantity.
The Derivation:
The marginal likelihood is:
$$p(\mathbf{y} | X) = \int p(\mathbf{y} | \mathbf{f}) p(\mathbf{f} | X) , d\mathbf{f}$$
Using the Laplace approximation:
$$\log p(\mathbf{y} | X) \approx \Psi(\hat{\mathbf{f}}) + \frac{1}{2}\log|2\pi (K^{-1} + W)^{-1}|$$
Expanding $\Psi(\hat{\mathbf{f}})$:
$$\log p(\mathbf{y} | X) \approx \underbrace{\sum_{i=1}^n \log p(y_i | \hat{f}i)}{\text{log-likelihood}} - \underbrace{\frac{1}{2}\hat{\mathbf{f}}^\top K^{-1} \hat{\mathbf{f}} - \frac{1}{2}\log|K|}_{\text{GP prior penalty}} - \frac{1}{2}\log|B|$$
where $B = I + W^{1/2} K W^{1/2}$ and we've assumed zero prior mean.
Efficient Computation:
The log-determinant term is computed via the Cholesky factor:
$$\log|B| = 2 \sum_{i=1}^n \log L_{ii}$$
where $L$ is the lower Cholesky factor of $B$.
Final Formula:
$$\log q(\mathbf{y} | X, \boldsymbol{\theta}) = -\frac{1}{2}\hat{\mathbf{f}}^\top K^{-1}\hat{\mathbf{f}} + \sum_i \log p(y_i | \hat{f}_i) - \frac{1}{2}\log|B|$$
For logistic likelihood: $$\log p(y_i | \hat{f}_i) = -\log(1 + \exp(-y_i' \hat{f}_i))$$ where $y_i' \in {-1, +1}$ are the recoded labels.
For gradient-based optimization, we need ∂log q(y|X,θ)/∂θ. This involves the implicit dependence of f̂ on θ (through the Newton iteration). The full derivation uses implicit differentiation and is implemented in libraries like GPflow and GPyTorch. The complexity is O(n³) per gradient evaluation.
12345678910111213141516171819202122232425262728293031323334353637383940414243
import numpy as npfrom scipy.linalg import cholesky def laplace_log_marginal_likelihood(K, y, f_hat, W, L): """ Compute the Laplace approximation to the log marginal likelihood. Parameters: ----------- K : kernel matrix (n, n) y : labels in {0, 1} f_hat : posterior mode W : Hessian diagonal at mode L : Cholesky factor of B = I + W^{1/2} K W^{1/2} Returns: -------- log_ml : approximate log marginal likelihood """ n = len(y) # Convert labels to {-1, +1} for cleaner formula y_pm = 2 * y - 1 # {0,1} -> {-1,+1} # Log-likelihood at mode: sum of -log(1 + exp(-y*f)) log_lik = -np.sum(np.log1p(np.exp(-y_pm * f_hat))) # Prior penalty: -1/2 * f^T K^{-1} f # Compute via Cholesky of K L_K = cholesky(K + 1e-8 * np.eye(n), lower=True) alpha = np.linalg.solve(L_K.T, np.linalg.solve(L_K, f_hat)) prior_penalty = -0.5 * np.dot(f_hat, alpha) # Log-determinant of B (from Cholesky factor) log_det_B = 2 * np.sum(np.log(np.diag(L))) # Approximate log marginal likelihood log_ml = log_lik + prior_penalty - 0.5 * log_det_B return log_ml # Note: This ignores the constant -n/2 * log(2π) - 1/2 * log|K|# which cancels when comparing models with the same kernel structureLet's consolidate everything into the complete algorithm for Gaussian Process Classification with Laplace approximation.
| Operation | Complexity | Notes |
|---|---|---|
| Kernel matrix computation | O(n²d) | n points, d dimensions |
| Newton iteration (each) | O(n³) | Cholesky factorization |
| Newton convergence | ~5-20 iterations | Typically fast for classification |
| Prediction (per test point) | O(n²) | Matrix-vector products |
| Marginal likelihood | O(n³) | Same as mode finding |
| Total training time | O(n³) | Dominated by Cholesky |
Initialize f at the prior mean (typically zero). For well-specified kernels, Newton's method usually converges in 5-10 iterations. If convergence is slow, check for numerical issues in the kernel matrix (add jitter: K + 10⁻⁶I) or reconsider hyperparameters.
When to Use Laplace vs. Other Methods:
| Scenario | Recommendation |
|---|---|
| Rapid prototyping | Laplace (simplest) |
| Best predictive accuracy | Expectation Propagation |
| Sparse/large-scale | Variational Inference |
| Gold standard comparison | MCMC sampling |
| Limited computational budget | Laplace |
| Well-separated classes | Laplace (all methods similar) |
| Noisy/overlapping classes | EP or VI preferred |
For training points where the latent function is uncertain (near the decision boundary), the true posterior is skewed—asymmetric around the mode. Laplace, being a symmetric Gaussian, cannot capture this. EP addresses this by matching moments rather than approximating at a single point.
The Laplace approximation is the foundation of practical GP classification, providing a simple yet effective approach to the intractable posterior.
What's Next:
The next page introduces Expectation Propagation (EP), a more sophisticated approximation that often outperforms Laplace by matching moments rather than approximating at a single point. EP is particularly effective when the posterior is skewed or when accurate predictive probabilities are crucial.
You now understand the Laplace approximation for GP classification—from finding the posterior mode via Newton's method, to making predictions with uncertainty, to approximating the marginal likelihood for hyperparameter learning. This is the workhorse of practical GP classification.