Loading learning content...
At the heart of Bayesian inference lies a fundamental computational challenge: the posterior distribution p(θ|D) = p(D|θ)p(θ)/p(D) is often intractable. The marginal likelihood p(D) = ∫p(D|θ)p(θ)dθ requires integrating over the entire parameter space—an operation that is analytically impossible for most realistic models.
We've seen how sampling methods (MCMC, importance sampling) handle this by drawing samples from the posterior. But these methods can be slow, require careful tuning, and provide stochastic rather than deterministic answers. For many applications—especially those requiring fast predictions or online learning—we need deterministic approximations that provide closed-form, tractable representations of complex posteriors.
The Laplace approximation is the simplest and most elegant of these methods. It approximates any posterior with a Gaussian distribution centered at the posterior mode. Despite its simplicity, it remains surprisingly effective for many machine learning applications and forms the conceptual foundation for more sophisticated methods.
By the end of this page, you will understand how Laplace approximation works from first principles, derive the Gaussian approximation using Taylor expansion, develop geometric intuition for when the approximation is accurate, and recognize the method's fundamental assumptions and limitations.
The Laplace approximation is named after Pierre-Simon Laplace, who used it in the 18th century for asymptotic expansions of integrals. The core insight is deceptively simple:
Any smooth, peaked probability density can be locally approximated by a Gaussian near its mode.
This isn't just a heuristic—it emerges from the mathematics of Taylor expansion. When a probability density p(θ) has a single sharp peak at θ̂, the behavior near that peak dominates all integrals involving p(θ). If we can approximate this local behavior with a Gaussian, we obtain a tractable representation that captures the essential structure of the distribution.
The success of Laplace approximation is deeply connected to the Central Limit Theorem. As we observe more data, the posterior becomes increasingly concentrated (by the Bernstein-von Mises theorem), and its shape becomes increasingly Gaussian. The Laplace approximation is exact in the asymptotic limit of infinite data.
Why Gaussians are special:
Gaussians are the "natural" choice for approximating peaked distributions because:
Quadratic log-density: log p(θ) is a quadratic function of θ, which is exactly what a second-order Taylor expansion produces
Analytical tractability: All moments, marginals, and conditionals of Gaussians have closed forms
Maximum entropy: Among distributions with specified mean and covariance, the Gaussian has maximum entropy—it assumes nothing beyond these moments
Closure under marginalization: Marginals of Gaussians are Gaussian, enabling recursive approximation schemes
Let's derive the Laplace approximation rigorously. Consider a posterior distribution:
$$p(\theta|D) = \frac{p(D|\theta)p(\theta)}{p(D)} \propto p(D|\theta)p(\theta)$$
Define the negative log posterior (ignoring the normalization constant):
$$\Psi(\theta) = -\log p(D|\theta) - \log p(\theta)$$
The posterior can be written as:
$$p(\theta|D) \propto \exp(-\Psi(\theta))$$
Our goal is to approximate Ψ(θ) with a simpler function that yields a Gaussian when exponentiated.
Working in log-space is crucial. Probabilities are products of many terms; in log-space these become sums. Taylor expansion of sums is more numerically stable and conceptually cleaner than expansion of products.
Step 1: Find the mode
First, find the Maximum A Posteriori (MAP) estimate:
$$\hat{\theta} = \arg\min_\theta \Psi(\theta) = \arg\max_\theta p(\theta|D)$$
At the mode, the gradient vanishes:
$$\nabla\Psi(\hat{\theta}) = 0$$
Step 2: Second-order Taylor expansion
Expand Ψ(θ) around θ̂ using Taylor's theorem:
$$\Psi(\theta) \approx \Psi(\hat{\theta}) + \underbrace{\nabla\Psi(\hat{\theta})^T(\theta - \hat{\theta})}_{= 0 \text{ (at mode)}} + \frac{1}{2}(\theta - \hat{\theta})^T H (\theta - \hat{\theta})$$
where H is the Hessian matrix of Ψ evaluated at θ̂:
$$H = \nabla^2\Psi(\hat{\theta}) = -\nabla^2\log p(D|\hat{\theta}) - \nabla^2\log p(\hat{\theta})$$
Step 3: Exponentiate to get the Gaussian
Substituting back:
$$p(\theta|D) \propto \exp(-\Psi(\theta)) \approx \exp\left(-\Psi(\hat{\theta}) - \frac{1}{2}(\theta - \hat{\theta})^T H (\theta - \hat{\theta})\right)$$
This is proportional to a Gaussian:
$$q(\theta) = \mathcal{N}(\theta | \hat{\theta}, H^{-1})$$
where:
The complete Laplace approximation:
$$p(\theta|D) \approx \mathcal{N}(\theta | \hat{\theta}, (\nabla^2\Psi(\hat{\theta}))^{-1})$$
For the approximation to be valid, the Hessian H must be positive definite at the mode (otherwise H⁻¹ is not a valid covariance matrix). This is guaranteed if θ̂ is a local maximum of the posterior—but not at saddle points or local minima.
The practical challenge in Laplace approximation is computing the Hessian. For a model with d parameters, the Hessian is a d × d matrix containing all second-order partial derivatives:
$$H_{ij} = \frac{\partial^2 \Psi}{\partial \theta_i \partial \theta_j}$$
The Hessian decomposes into contributions from likelihood and prior:
$$H = -\nabla^2\log p(D|\theta) - \nabla^2\log p(\theta)$$
| Model | Likelihood Hessian | Prior Hessian | Total Complexity |
|---|---|---|---|
| Linear Regression | XᵀX (data outer product) | λI (if Gaussian prior) | O(nd² + d³) |
| Logistic Regression | XᵀWX (W = diag weights) | λI (if L2 prior) | O(nd² + d³) |
| Neural Networks | Fisher Information Matrix | Prior-dependent | O(nd²) per sample |
| Gaussian Processes | K⁻¹ (kernel inverse) | 0 (integrated out) | O(n³) |
The Fisher Information Matrix:
For many models, the Hessian of the negative log-likelihood equals the Fisher Information Matrix:
$$\mathcal{I}(\theta) = -\mathbb{E}[\nabla^2\log p(D|\theta)]$$
The Fisher Information measures the curvature of the log-likelihood, which determines how much information the data provides about θ. Higher curvature means more concentrated (less uncertain) posteriors.
Observed vs. Expected Fisher Information:
For large datasets, these converge. The observed Fisher is preferred when available because it uses actual data rather than expectations.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485
import numpy as npfrom scipy.optimize import minimizefrom scipy.linalg import inv, cholesky def laplace_approximation(neg_log_posterior, theta_init, compute_hessian=None): """ Compute Laplace approximation to a posterior distribution. Parameters: ----------- neg_log_posterior : callable Function Ψ(θ) = -log p(θ|D) (unnormalized is fine) theta_init : ndarray Initial guess for optimization compute_hessian : callable, optional Function to compute Hessian. If None, uses numerical approx. Returns: -------- mean : ndarray MAP estimate (mode of posterior) covariance : ndarray Inverse Hessian (posterior covariance approximation) """ # Step 1: Find the MAP estimate (mode) result = minimize(neg_log_posterior, theta_init, method='L-BFGS-B') theta_map = result.x if not result.success: raise ValueError(f"Optimization failed: {result.message}") # Step 2: Compute Hessian at the mode if compute_hessian is not None: H = compute_hessian(theta_map) else: H = numerical_hessian(neg_log_posterior, theta_map) # Step 3: Invert Hessian to get covariance # Use Cholesky decomposition for numerical stability try: L = cholesky(H, lower=True) covariance = inv(H) except np.linalg.LinAlgError: raise ValueError("Hessian not positive definite at mode") return theta_map, covariance def numerical_hessian(f, x, eps=1e-5): """Compute Hessian via central differences.""" n = len(x) H = np.zeros((n, n)) for i in range(n): for j in range(i, n): # Central difference for mixed partials x_pp = x.copy(); x_pp[i] += eps; x_pp[j] += eps x_pm = x.copy(); x_pm[i] += eps; x_pm[j] -= eps x_mp = x.copy(); x_mp[i] -= eps; x_mp[j] += eps x_mm = x.copy(); x_mm[i] -= eps; x_mm[j] -= eps H[i,j] = (f(x_pp) - f(x_pm) - f(x_mp) + f(x_mm)) / (4*eps*eps) H[j,i] = H[i,j] # Symmetry return H # Example: Bayesian Logistic Regressiondef logistic_neg_log_posterior(w, X, y, prior_variance=1.0): """Negative log posterior for logistic regression.""" logits = X @ w # Log-likelihood (numerically stable) log_likelihood = np.sum(y * logits - np.logaddexp(0, logits)) # Gaussian prior log_prior = -0.5 * np.sum(w**2) / prior_variance return -(log_likelihood + log_prior) def logistic_hessian(w, X, y, prior_variance=1.0): """Exact Hessian for logistic regression.""" logits = X @ w probs = 1 / (1 + np.exp(-logits)) # Diagonal weight matrix W = p(1-p) W = np.diag(probs * (1 - probs)) # Hessian = X^T W X + prior precision H = X.T @ W @ X + np.eye(len(w)) / prior_variance return HThe Laplace approximation has an elegant geometric interpretation. Imagine the negative log posterior Ψ(θ) as a "energy landscape" where valleys correspond to high-probability regions.
The approximation process:
The Hessian captures the curvature of this landscape:
The eigenvalue interpretation:
The covariance matrix Σ = H⁻¹ has spectral decomposition:
$$\Sigma = V \Lambda V^T$$
where V contains eigenvectors (principal directions) and Λ contains eigenvalues (variances along each direction).
A large condition number (κ >> 1) signals potential numerical issues and suggests parameters are on very different scales.
Beyond approximating the posterior, Laplace provides an approximation to the marginal likelihood (model evidence):
$$p(D) = \int p(D|\theta)p(\theta)d\theta$$
This integral is typically intractable, but Laplace approximation yields:
$$\log p(D) \approx \log p(D|\hat{\theta}) + \log p(\hat{\theta}) + \frac{d}{2}\log(2\pi) - \frac{1}{2}\log|H|$$
where d is the number of parameters.
The Bayesian Information Criterion (BIC) is essentially a simplified Laplace approximation where the Hessian is replaced by n·I (n = sample size). BIC ≈ -2 log p(D|θ̂) + d log n penalizes model complexity through the parameter count d.
Decomposition of the marginal likelihood:
$$\log p(D) \approx \underbrace{\log p(D|\hat{\theta})}{\text{fit to data}} + \underbrace{\log p(\hat{\theta})}{\text{prior consistency}} + \underbrace{\frac{d}{2}\log(2\pi) - \frac{1}{2}\log|H|}_{\text{Occam factor}}$$
The Occam factor automatically penalizes model complexity:
This provides a principled trade-off between model fit and complexity, implementing Bayesian Occam's razor.
Computing log|H|:
The log-determinant can be computed efficiently via Cholesky decomposition:
$$\log|H| = 2\sum_{i=1}^d \log L_{ii}$$
where H = LLᵀ is the Cholesky factorization. This is O(d³) but numerically stable.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687
import numpy as npfrom scipy.linalg import cholesky def laplace_log_marginal_likelihood(theta_map, X, y, neg_log_likelihood, log_prior, hessian): """ Compute Laplace approximation to log marginal likelihood. Parameters: ----------- theta_map : ndarray MAP estimate neg_log_likelihood : callable Returns -log p(D|θ) log_prior : callable Returns log p(θ) hessian : ndarray Hessian of negative log posterior at MAP Returns: -------- log_ml : float Log marginal likelihood approximation """ d = len(theta_map) # Log likelihood at MAP log_lik = -neg_log_likelihood(theta_map, X, y) # Log prior at MAP log_p = log_prior(theta_map) # Occam factor via Cholesky try: L = cholesky(hessian, lower=True) log_det_H = 2 * np.sum(np.log(np.diag(L))) except np.linalg.LinAlgError: raise ValueError("Hessian not positive definite") # Laplace approximation log_ml = (log_lik + log_p + (d/2) * np.log(2 * np.pi) - 0.5 * log_det_H) return log_ml def compare_models_laplace(models, X, y): """ Compare models using Laplace-approximated marginal likelihood. Parameters: ----------- models : list of dict Each dict has 'name', 'neg_log_post', 'hessian_fn', 'init' Returns: -------- results : dict Model names with log marginal likelihoods and posterior probs """ log_mls = [] for model in models: # Fit model (find MAP) theta_map, cov = laplace_approximation( model['neg_log_post'], model['init'], compute_hessian=model['hessian_fn'] ) # Compute marginal likelihood hessian = model['hessian_fn'](theta_map) log_ml = laplace_log_marginal_likelihood( theta_map, X, y, model['neg_log_lik'], model['log_prior'], hessian ) log_mls.append((model['name'], log_ml)) # Convert to posterior model probabilities log_mls_array = np.array([lm for _, lm in log_mls]) log_mls_array -= np.max(log_mls_array) # Numerical stability probs = np.exp(log_mls_array) probs /= np.sum(probs) return {name: {'log_ml': lm, 'prob': p} for (name, lm), p in zip(log_mls, probs)}Implementing Laplace approximation in practice requires attention to several numerical and statistical details.
Scaling to high dimensions:
For models with many parameters (d >> 1000), storing and inverting the full Hessian becomes prohibitive. Several approximations help:
Diagonal approximation: Use only diagonal of Hessian, assuming parameter independence $$\Sigma \approx \text{diag}(1/H_{11}, ..., 1/H_{dd})$$
Block-diagonal approximation: Capture correlations within groups (e.g., per-layer in neural networks) but not across groups
Low-rank approximation: Represent covariance as Σ ≈ D + VVᵀ where D is diagonal and V is low-rank
Kronecker-factored approximation (K-FAC): For neural networks, approximate Fisher as Kronecker product of smaller matrices
In very high dimensions, the posterior mode (MAP) may not be representative of typical samples. The mode sits at the peak of the density, but in high-d spaces, most of the probability mass is in a thin shell away from the peak. This is the 'typical set' phenomenon and represents a fundamental limitation of mode-based approximations.
Laplace approximation is widely used across machine learning, particularly in settings where speed or closed-form solutions are prioritized.
| Application | Why Laplace Works | Practical Notes |
|---|---|---|
| Bayesian Logistic Regression | Posterior is log-concave, unimodal | Exact Hessian available; very reliable |
| Gaussian Process Classification | Approximates non-Gaussian likelihood | Combined with EP or variational methods |
| Bayesian Neural Networks | Fast uncertainty after training | Works well with pre-trained MAP networks |
| Hyperparameter Optimization | Marginal likelihood for model selection | Enables gradient-based hyperparameter tuning |
| Online Learning | Recursive updating of Gaussian posterior | Posterior becomes prior for next batch |
Laplace for Neural Networks (LLA):
Recent work has revived Laplace approximation for deep learning through Last-Layer Laplace Approximation:
This provides calibrated uncertainty estimates with minimal computational overhead—the Hessian is only computed for the last layer, making it tractable even for large networks.
Laplace for continual learning:
Sequential Laplace approximation naturally supports continual learning:
Understanding when Laplace approximation fails is as important as knowing how to apply it. Several fundamental limitations restrict its applicability.
Extensions to address limitations:
Mixture of Laplace approximations: Fit Gaussian at each mode, combine as mixture $$q(\theta) = \sum_k \pi_k \mathcal{N}(\theta | \hat{\theta}_k, H_k^{-1})$$
Reparameterization: Transform parameters so posterior is more Gaussian (e.g., log-transform positive parameters)
Higher-order expansions: Include third/fourth-order terms for skewness and kurtosis correction (Edgeworth expansions)
Leave-one-out corrections: WAIC and LOO-CV can detect when Laplace marginal likelihood is unreliable
Variational refinement: Use Laplace as initialization for more flexible variational approximations
Laplace approximation excels when: (1) you need a fast approximation and can tolerate some error, (2) the posterior is expected to be approximately Gaussian (large data, log-concave), (3) you need marginal likelihood for model comparison, or (4) you're building a baseline before trying more sophisticated methods.
The Laplace approximation provides a remarkably simple yet powerful approach to approximate inference. By fitting a Gaussian to the posterior via second-order Taylor expansion, we obtain tractable representations that enable fast prediction, uncertainty quantification, and model comparison.
What's next:
Laplace approximation provides a single Gaussian centered at the mode, which may be too restrictive for complex posteriors. In the next page, we'll explore variational inference—a more flexible framework that optimizes over a family of distributions to find the best approximation, trading off fidelity and tractability through a principled objective function.
You now understand Laplace approximation from first principles—its derivation, geometric interpretation, and practical applications. This forms the foundation for understanding all deterministic approximation methods, which seek closed-form solutions to the intractable inference problem.