Machine LearningBayesian Inference

Approximate Inference: Deterministic Methods

LevelAdvanced

Duration90 mins

TopicBayesian Inference

1 / 5

Laplace Approximation

The Quest for Tractable Posteriors

At the heart of Bayesian inference lies a fundamental computational challenge: the posterior distribution p(θ|D) = p(D|θ)p(θ)/p(D) is often intractable. The marginal likelihood p(D) = ∫p(D|θ)p(θ)dθ requires integrating over the entire parameter space—an operation that is analytically impossible for most realistic models.

We've seen how sampling methods (MCMC, importance sampling) handle this by drawing samples from the posterior. But these methods can be slow, require careful tuning, and provide stochastic rather than deterministic answers. For many applications—especially those requiring fast predictions or online learning—we need deterministic approximations that provide closed-form, tractable representations of complex posteriors.

The Laplace approximation is the simplest and most elegant of these methods. It approximates any posterior with a Gaussian distribution centered at the posterior mode. Despite its simplicity, it remains surprisingly effective for many machine learning applications and forms the conceptual foundation for more sophisticated methods.

What You Will Learn

By the end of this page, you will understand how Laplace approximation works from first principles, derive the Gaussian approximation using Taylor expansion, develop geometric intuition for when the approximation is accurate, and recognize the method's fundamental assumptions and limitations.

The Fundamental Insight

The Laplace approximation is named after Pierre-Simon Laplace, who used it in the 18th century for asymptotic expansions of integrals. The core insight is deceptively simple:

Any smooth, peaked probability density can be locally approximated by a Gaussian near its mode.

This isn't just a heuristic—it emerges from the mathematics of Taylor expansion. When a probability density p(θ) has a single sharp peak at θ̂, the behavior near that peak dominates all integrals involving p(θ). If we can approximate this local behavior with a Gaussian, we obtain a tractable representation that captures the essential structure of the distribution.

The Central Limit Connection

The success of Laplace approximation is deeply connected to the Central Limit Theorem. As we observe more data, the posterior becomes increasingly concentrated (by the Bernstein-von Mises theorem), and its shape becomes increasingly Gaussian. The Laplace approximation is exact in the asymptotic limit of infinite data.

Why Gaussians are special:

Gaussians are the "natural" choice for approximating peaked distributions because:

Quadratic log-density: log p(θ) is a quadratic function of θ, which is exactly what a second-order Taylor expansion produces
Analytical tractability: All moments, marginals, and conditionals of Gaussians have closed forms
Maximum entropy: Among distributions with specified mean and covariance, the Gaussian has maximum entropy—it assumes nothing beyond these moments
Closure under marginalization: Marginals of Gaussians are Gaussian, enabling recursive approximation schemes

Mathematical Derivation

Let's derive the Laplace approximation rigorously. Consider a posterior distribution:

$$p(\theta|D) = \frac{p(D|\theta)p(\theta)}{p(D)} \propto p(D|\theta)p(\theta)$$

Define the negative log posterior (ignoring the normalization constant):

$$\Psi(\theta) = -\log p(D|\theta) - \log p(\theta)$$

The posterior can be written as:

$$p(\theta|D) \propto \exp(-\Psi(\theta))$$

Our goal is to approximate Ψ(θ) with a simpler function that yields a Gaussian when exponentiated.

The Log-Space Trick

Working in log-space is crucial. Probabilities are products of many terms; in log-space these become sums. Taylor expansion of sums is more numerically stable and conceptually cleaner than expansion of products.

Step 1: Find the mode

First, find the Maximum A Posteriori (MAP) estimate:

$$\hat{\theta} = \arg\min_\theta \Psi(\theta) = \arg\max_\theta p(\theta|D)$$

At the mode, the gradient vanishes:

$$\nabla\Psi(\hat{\theta}) = 0$$

Step 2: Second-order Taylor expansion

Expand Ψ(θ) around θ̂ using Taylor's theorem:

$$\Psi(\theta) \approx \Psi(\hat{\theta}) + \underbrace{\nabla\Psi(\hat{\theta})^T(\theta - \hat{\theta})}_{= 0 \text{ (at mode)}} + \frac{1}{2}(\theta - \hat{\theta})^T H (\theta - \hat{\theta})$$

where H is the Hessian matrix of Ψ evaluated at θ̂:

$$H = \nabla^2\Psi(\hat{\theta}) = -\nabla^2\log p(D|\hat{\theta}) - \nabla^2\log p(\hat{\theta})$$

Step 3: Exponentiate to get the Gaussian

Substituting back:

$$p(\theta|D) \propto \exp(-\Psi(\theta)) \approx \exp\left(-\Psi(\hat{\theta}) - \frac{1}{2}(\theta - \hat{\theta})^T H (\theta - \hat{\theta})\right)$$

This is proportional to a Gaussian:

$$q(\theta) = \mathcal{N}(\theta | \hat{\theta}, H^{-1})$$

where:

Mean = θ̂ (the MAP estimate)
Covariance = H⁻¹ (inverse Hessian of negative log posterior)

The complete Laplace approximation:

$$p(\theta|D) \approx \mathcal{N}(\theta | \hat{\theta}, (\nabla^2\Psi(\hat{\theta}))^{-1})$$

Hessian Must Be Positive Definite

For the approximation to be valid, the Hessian H must be positive definite at the mode (otherwise H⁻¹ is not a valid covariance matrix). This is guaranteed if θ̂ is a local maximum of the posterior—but not at saddle points or local minima.

Computing the Hessian

The practical challenge in Laplace approximation is computing the Hessian. For a model with d parameters, the Hessian is a d × d matrix containing all second-order partial derivatives:

$$H_{ij} = \frac{\partial^2 \Psi}{\partial \theta_i \partial \theta_j}$$

The Hessian decomposes into contributions from likelihood and prior:

$$H = -\nabla^2\log p(D|\theta) - \nabla^2\log p(\theta)$$

Hessian Computation for Common Models
Model	Likelihood Hessian	Prior Hessian	Total Complexity
Linear Regression	XᵀX (data outer product)	λI (if Gaussian prior)	O(nd² + d³)
Logistic Regression	XᵀWX (W = diag weights)	λI (if L2 prior)	O(nd² + d³)
Neural Networks	Fisher Information Matrix	Prior-dependent	O(nd²) per sample
Gaussian Processes	K⁻¹ (kernel inverse)	0 (integrated out)	O(n³)

The Fisher Information Matrix:

For many models, the Hessian of the negative log-likelihood equals the Fisher Information Matrix:

$$\mathcal{I}(\theta) = -\mathbb{E}[\nabla^2\log p(D|\theta)]$$

The Fisher Information measures the curvature of the log-likelihood, which determines how much information the data provides about θ. Higher curvature means more concentrated (less uncertain) posteriors.

Observed vs. Expected Fisher Information:

Observed Fisher: Use actual Hessian at θ̂ (what we compute)
Expected Fisher: Use expectation over data (sometimes analytically simpler)

For large datasets, these converge. The observed Fisher is preferred when available because it uses actual data rather than expectations.

laplace_approximation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import numpy as np
from scipy.optimize import minimize
from scipy.linalg import inv, cholesky
 
def laplace_approximation(neg_log_posterior, theta_init, 
                          compute_hessian=None):
    """
    Compute Laplace approximation to a posterior distribution.
    
    Parameters:
    -----------
    neg_log_posterior : callable
        Function Ψ(θ) = -log p(θ|D) (unnormalized is fine)
    theta_init : ndarray
        Initial guess for optimization
    compute_hessian : callable, optional
        Function to compute Hessian. If None, uses numerical approx.
    
    Returns:
    --------
    mean : ndarray
        MAP estimate (mode of posterior)
    covariance : ndarray
        Inverse Hessian (posterior covariance approximation)
    """
    # Step 1: Find the MAP estimate (mode)
    result = minimize(neg_log_posterior, theta_init, method='L-BFGS-B')
    theta_map = result.x
    
    if not result.success:
        raise ValueError(f"Optimization failed: {result.message}")
    
    # Step 2: Compute Hessian at the mode
    if compute_hessian is not None:
        H = compute_hessian(theta_map)
    else:
        H = numerical_hessian(neg_log_posterior, theta_map)
    
    # Step 3: Invert Hessian to get covariance
    # Use Cholesky decomposition for numerical stability
    try:
        L = cholesky(H, lower=True)
        covariance = inv(H)
    except np.linalg.LinAlgError:
        raise ValueError("Hessian not positive definite at mode")
    
    return theta_map, covariance
 
def numerical_hessian(f, x, eps=1e-5):
    """Compute Hessian via central differences."""
    n = len(x)
    H = np.zeros((n, n))
    
    for i in range(n):
        for j in range(i, n):
            # Central difference for mixed partials
            x_pp = x.copy(); x_pp[i] += eps; x_pp[j] += eps
            x_pm = x.copy(); x_pm[i] += eps; x_pm[j] -= eps
            x_mp = x.copy(); x_mp[i] -= eps; x_mp[j] += eps
            x_mm = x.copy(); x_mm[i] -= eps; x_mm[j] -= eps
            
            H[i,j] = (f(x_pp) - f(x_pm) - f(x_mp) + f(x_mm)) / (4*eps*eps)
            H[j,i] = H[i,j]  # Symmetry
    
    return H
 
# Example: Bayesian Logistic Regression
def logistic_neg_log_posterior(w, X, y, prior_variance=1.0):
    """Negative log posterior for logistic regression."""
    logits = X @ w
    # Log-likelihood (numerically stable)
    log_likelihood = np.sum(y * logits - np.logaddexp(0, logits))
    # Gaussian prior
    log_prior = -0.5 * np.sum(w**2) / prior_variance
    return -(log_likelihood + log_prior)
 
def logistic_hessian(w, X, y, prior_variance=1.0):
    """Exact Hessian for logistic regression."""
    logits = X @ w
    probs = 1 / (1 + np.exp(-logits))
    # Diagonal weight matrix W = p(1-p)
    W = np.diag(probs * (1 - probs))
    # Hessian = X^T W X + prior precision
    H = X.T @ W @ X + np.eye(len(w)) / prior_variance
    return H

Geometric Interpretation

The Laplace approximation has an elegant geometric interpretation. Imagine the negative log posterior Ψ(θ) as a "energy landscape" where valleys correspond to high-probability regions.

The approximation process:

Find the lowest point (minimum of Ψ, maximum of posterior)
Fit a paraboloid tangent to the surface at this point
This paraboloid defines a Gaussian when exponentiated

The Hessian captures the curvature of this landscape:

Large eigenvalues → steep curvature → tightly constrained parameters
Small eigenvalues → gentle curvature → loosely constrained parameters
Eigenvalue ratios → shape of the Gaussian (elongated vs. spherical)

When Approximation is Accurate

•Unimodal posterior — Single, well-defined peak
•Symmetric curvature — Similar behavior above and below mode
•Large data regime — Posterior concentrates toward Gaussian (Bernstein-von Mises)
•Strong prior — Regularization smooths the landscape
•Well-identified parameters — Clear separation from boundaries

When Approximation Fails

•Multimodal posterior — Multiple peaks require mixture of Gaussians
•Heavy-tailed posterior — Outliers create non-Gaussian behavior
•Skewed posterior — Asymmetric distributions near boundaries
•Flat regions — Near-zero curvature causes numerical instability
•Highly correlated parameters — Ill-conditioned Hessian

The eigenvalue interpretation:

The covariance matrix Σ = H⁻¹ has spectral decomposition:

$$\Sigma = V \Lambda V^T$$

where V contains eigenvectors (principal directions) and Λ contains eigenvalues (variances along each direction).

Eigenvectors point along the principal axes of uncertainty
Eigenvalues measure uncertainty magnitude along each axis
Condition number κ = λ_max/λ_min indicates how "elongated" the Gaussian is

A large condition number (κ >> 1) signals potential numerical issues and suggests parameters are on very different scales.

Marginal Likelihood Approximation

Beyond approximating the posterior, Laplace provides an approximation to the marginal likelihood (model evidence):

$$p(D) = \int p(D|\theta)p(\theta)d\theta$$

This integral is typically intractable, but Laplace approximation yields:

$$\log p(D) \approx \log p(D|\hat{\theta}) + \log p(\hat{\theta}) + \frac{d}{2}\log(2\pi) - \frac{1}{2}\log|H|$$

where d is the number of parameters.

The BIC Connection

The Bayesian Information Criterion (BIC) is essentially a simplified Laplace approximation where the Hessian is replaced by n·I (n = sample size). BIC ≈ -2 log p(D|θ̂) + d log n penalizes model complexity through the parameter count d.

Decomposition of the marginal likelihood:

$$\log p(D) \approx \underbrace{\log p(D|\hat{\theta})}{\text{fit to data}} + \underbrace{\log p(\hat{\theta})}{\text{prior consistency}} + \underbrace{\frac{d}{2}\log(2\pi) - \frac{1}{2}\log|H|}_{\text{Occam factor}}$$

The Occam factor automatically penalizes model complexity:

More parameters → larger d → smaller Occam factor
Sharper posterior → larger |H| → smaller Occam factor

This provides a principled trade-off between model fit and complexity, implementing Bayesian Occam's razor.

Computing log|H|:

The log-determinant can be computed efficiently via Cholesky decomposition:

$$\log|H| = 2\sum_{i=1}^d \log L_{ii}$$

where H = LLᵀ is the Cholesky factorization. This is O(d³) but numerically stable.

marginal_likelihood.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
import numpy as np
from scipy.linalg import cholesky
 
def laplace_log_marginal_likelihood(theta_map, X, y, 
                                     neg_log_likelihood,
                                     log_prior,
                                     hessian):
    """
    Compute Laplace approximation to log marginal likelihood.
    
    Parameters:
    -----------
    theta_map : ndarray
        MAP estimate
    neg_log_likelihood : callable
        Returns -log p(D|θ)
    log_prior : callable
        Returns log p(θ)
    hessian : ndarray
        Hessian of negative log posterior at MAP
    
    Returns:
    --------
    log_ml : float
        Log marginal likelihood approximation
    """
    d = len(theta_map)
    
    # Log likelihood at MAP
    log_lik = -neg_log_likelihood(theta_map, X, y)
    
    # Log prior at MAP
    log_p = log_prior(theta_map)
    
    # Occam factor via Cholesky
    try:
        L = cholesky(hessian, lower=True)
        log_det_H = 2 * np.sum(np.log(np.diag(L)))
    except np.linalg.LinAlgError:
        raise ValueError("Hessian not positive definite")
    
    # Laplace approximation
    log_ml = (log_lik + log_p + 
              (d/2) * np.log(2 * np.pi) - 
              0.5 * log_det_H)
    
    return log_ml
 
def compare_models_laplace(models, X, y):
    """
    Compare models using Laplace-approximated marginal likelihood.
    
    Parameters:
    -----------
    models : list of dict
        Each dict has 'name', 'neg_log_post', 'hessian_fn', 'init'
    
    Returns:
    --------
    results : dict
        Model names with log marginal likelihoods and posterior probs
    """
    log_mls = []
    
    for model in models:
        # Fit model (find MAP)
        theta_map, cov = laplace_approximation(
            model['neg_log_post'], model['init'],
            compute_hessian=model['hessian_fn']
        )
        
        # Compute marginal likelihood
        hessian = model['hessian_fn'](theta_map)
        log_ml = laplace_log_marginal_likelihood(
            theta_map, X, y,
            model['neg_log_lik'], model['log_prior'], hessian
        )
        log_mls.append((model['name'], log_ml))
    
    # Convert to posterior model probabilities
    log_mls_array = np.array([lm for _, lm in log_mls])
    log_mls_array -= np.max(log_mls_array)  # Numerical stability
    probs = np.exp(log_mls_array)
    probs /= np.sum(probs)
    
    return {name: {'log_ml': lm, 'prob': p} 
            for (name, lm), p in zip(log_mls, probs)}

Practical Considerations

Implementing Laplace approximation in practice requires attention to several numerical and statistical details.

Implementation Best Practices

•Parameter scaling: Ensure all parameters are on similar scales before optimization. Standardize features and consider log-transforming positive parameters.
•Regularization: Add small diagonal perturbation to Hessian (H + εI) to ensure positive definiteness. This is equivalent to adding a weak Gaussian prior.
•Cholesky over eigendecomposition: Use Cholesky factorization for inverting Hessian—it's faster (O(d³/3) vs O(d³)) and provides numerical stability check.
•Multiple restarts: Run optimization from multiple starting points to verify you found the global mode, not a local maximum.
•Posterior predictive sampling: Draw from q(θ) = N(θ̂, H⁻¹) to propagate uncertainty to predictions.

Scaling to high dimensions:

For models with many parameters (d >> 1000), storing and inverting the full Hessian becomes prohibitive. Several approximations help:

Diagonal approximation: Use only diagonal of Hessian, assuming parameter independence $$\Sigma \approx \text{diag}(1/H_{11}, ..., 1/H_{dd})$$
Block-diagonal approximation: Capture correlations within groups (e.g., per-layer in neural networks) but not across groups
Low-rank approximation: Represent covariance as Σ ≈ D + VVᵀ where D is diagonal and V is low-rank
Kronecker-factored approximation (K-FAC): For neural networks, approximate Fisher as Kronecker product of smaller matrices

The Curse of Dimensionality

In very high dimensions, the posterior mode (MAP) may not be representative of typical samples. The mode sits at the peak of the density, but in high-d spaces, most of the probability mass is in a thin shell away from the peak. This is the 'typical set' phenomenon and represents a fundamental limitation of mode-based approximations.

Applications in Machine Learning

Laplace approximation is widely used across machine learning, particularly in settings where speed or closed-form solutions are prioritized.

Laplace Approximation in ML Applications
Application	Why Laplace Works	Practical Notes
Bayesian Logistic Regression	Posterior is log-concave, unimodal	Exact Hessian available; very reliable
Gaussian Process Classification	Approximates non-Gaussian likelihood	Combined with EP or variational methods
Bayesian Neural Networks	Fast uncertainty after training	Works well with pre-trained MAP networks
Hyperparameter Optimization	Marginal likelihood for model selection	Enables gradient-based hyperparameter tuning
Online Learning	Recursive updating of Gaussian posterior	Posterior becomes prior for next batch

Laplace for Neural Networks (LLA):

Recent work has revived Laplace approximation for deep learning through Last-Layer Laplace Approximation:

Train a neural network to MAP estimate using standard methods
Apply Laplace only to the final layer weights
Propagate uncertainty through the fixed feature extractor

This provides calibrated uncertainty estimates with minimal computational overhead—the Hessian is only computed for the last layer, making it tractable even for large networks.

Laplace for continual learning:

Sequential Laplace approximation naturally supports continual learning:

After seeing data D₁, posterior is N(θ̂₁, Σ₁)
This becomes the prior for D₂
The prior's precision (Σ₁⁻¹) acts as regularization, preventing catastrophic forgetting

Limitations and Extensions

Understanding when Laplace approximation fails is as important as knowing how to apply it. Several fundamental limitations restrict its applicability.

Fundamental Limitations

•Unimodal assumption: Cannot capture multimodal posteriors. Mixture models, latent variable models, and non-identifiable models often have multiple modes.
•Symmetric approximation: Gaussians are symmetric; many posteriors are skewed, especially near parameter boundaries.
•Local approximation: Only captures behavior near the mode; fails when tails are important (e.g., for rare event probabilities).
•Point estimate dependency: Quality entirely depends on finding the correct mode. Local optima lead to poor approximations.
•Curvature assumption: Requires well-defined curvature (positive definite Hessian). Flat regions or saddle points break the approximation.

Extensions to address limitations:

Mixture of Laplace approximations: Fit Gaussian at each mode, combine as mixture $$q(\theta) = \sum_k \pi_k \mathcal{N}(\theta | \hat{\theta}_k, H_k^{-1})$$
Reparameterization: Transform parameters so posterior is more Gaussian (e.g., log-transform positive parameters)
Higher-order expansions: Include third/fourth-order terms for skewness and kurtosis correction (Edgeworth expansions)
Leave-one-out corrections: WAIC and LOO-CV can detect when Laplace marginal likelihood is unreliable
Variational refinement: Use Laplace as initialization for more flexible variational approximations

When to Use Laplace

Laplace approximation excels when: (1) you need a fast approximation and can tolerate some error, (2) the posterior is expected to be approximately Gaussian (large data, log-concave), (3) you need marginal likelihood for model comparison, or (4) you're building a baseline before trying more sophisticated methods.

Summary

The Laplace approximation provides a remarkably simple yet powerful approach to approximate inference. By fitting a Gaussian to the posterior via second-order Taylor expansion, we obtain tractable representations that enable fast prediction, uncertainty quantification, and model comparison.

Key Takeaways

•Laplace approximation fits a Gaussian at the posterior mode, with covariance given by the inverse Hessian of the negative log posterior.
•The derivation follows from second-order Taylor expansion of the log posterior—the linear term vanishes at the mode, leaving a quadratic (Gaussian) approximation.
•Marginal likelihood approximation provides principled model comparison with automatic Occam factor penalizing complexity.
•Accuracy improves with more data as posteriors concentrate and become more Gaussian (Bernstein-von Mises theorem).
•Failures occur for multimodal, skewed, or heavy-tailed posteriors—cases where Gaussian assumption fundamentally misrepresents the distribution.
•Practical implementation requires careful attention to optimization, Hessian computation, and numerical conditioning.

What's next:

Laplace approximation provides a single Gaussian centered at the mode, which may be too restrictive for complex posteriors. In the next page, we'll explore variational inference—a more flexible framework that optimizes over a family of distributions to find the best approximation, trading off fidelity and tractability through a principled objective function.

Page Complete

You now understand Laplace approximation from first principles—its derivation, geometric interpretation, and practical applications. This forms the foundation for understanding all deterministic approximation methods, which seek closed-form solutions to the intractable inference problem.

1 / 5

Loading learning content...

Machine LearningBayesian Inference

Approximate Inference: Deterministic Methods

LevelAdvanced

Duration90 mins

TopicBayesian Inference

1 / 5

Laplace Approximation

The Quest for Tractable Posteriors

What You Will Learn

The Fundamental Insight

The Laplace approximation is named after Pierre-Simon Laplace, who used it in the 18th century for asymptotic expansions of integrals. The core insight is deceptively simple:

Any smooth, peaked probability density can be locally approximated by a Gaussian near its mode.

The Central Limit Connection

Why Gaussians are special:

Gaussians are the "natural" choice for approximating peaked distributions because:

Quadratic log-density: log p(θ) is a quadratic function of θ, which is exactly what a second-order Taylor expansion produces
Analytical tractability: All moments, marginals, and conditionals of Gaussians have closed forms
Maximum entropy: Among distributions with specified mean and covariance, the Gaussian has maximum entropy—it assumes nothing beyond these moments
Closure under marginalization: Marginals of Gaussians are Gaussian, enabling recursive approximation schemes

Mathematical Derivation

Let's derive the Laplace approximation rigorously. Consider a posterior distribution:

$$p(\theta|D) = \frac{p(D|\theta)p(\theta)}{p(D)} \propto p(D|\theta)p(\theta)$$

Define the negative log posterior (ignoring the normalization constant):

$$\Psi(\theta) = -\log p(D|\theta) - \log p(\theta)$$

The posterior can be written as:

$$p(\theta|D) \propto \exp(-\Psi(\theta))$$

Our goal is to approximate Ψ(θ) with a simpler function that yields a Gaussian when exponentiated.

The Log-Space Trick

Step 1: Find the mode

First, find the Maximum A Posteriori (MAP) estimate:

$$\hat{\theta} = \arg\min_\theta \Psi(\theta) = \arg\max_\theta p(\theta|D)$$

At the mode, the gradient vanishes:

$$\nabla\Psi(\hat{\theta}) = 0$$

Step 2: Second-order Taylor expansion

Expand Ψ(θ) around θ̂ using Taylor's theorem:

$$\Psi(\theta) \approx \Psi(\hat{\theta}) + \underbrace{\nabla\Psi(\hat{\theta})^T(\theta - \hat{\theta})}_{= 0 \text{ (at mode)}} + \frac{1}{2}(\theta - \hat{\theta})^T H (\theta - \hat{\theta})$$

where H is the Hessian matrix of Ψ evaluated at θ̂:

$$H = \nabla^2\Psi(\hat{\theta}) = -\nabla^2\log p(D|\hat{\theta}) - \nabla^2\log p(\hat{\theta})$$

Step 3: Exponentiate to get the Gaussian

Substituting back:

$$p(\theta|D) \propto \exp(-\Psi(\theta)) \approx \exp\left(-\Psi(\hat{\theta}) - \frac{1}{2}(\theta - \hat{\theta})^T H (\theta - \hat{\theta})\right)$$

This is proportional to a Gaussian:

$$q(\theta) = \mathcal{N}(\theta | \hat{\theta}, H^{-1})$$

where:

Mean = θ̂ (the MAP estimate)
Covariance = H⁻¹ (inverse Hessian of negative log posterior)

The complete Laplace approximation:

$$p(\theta|D) \approx \mathcal{N}(\theta | \hat{\theta}, (\nabla^2\Psi(\hat{\theta}))^{-1})$$

Hessian Must Be Positive Definite

Computing the Hessian

The practical challenge in Laplace approximation is computing the Hessian. For a model with d parameters, the Hessian is a d × d matrix containing all second-order partial derivatives:

$$H_{ij} = \frac{\partial^2 \Psi}{\partial \theta_i \partial \theta_j}$$

The Hessian decomposes into contributions from likelihood and prior:

$$H = -\nabla^2\log p(D|\theta) - \nabla^2\log p(\theta)$$

Hessian Computation for Common Models
Model	Likelihood Hessian	Prior Hessian	Total Complexity
Linear Regression	XᵀX (data outer product)	λI (if Gaussian prior)	O(nd² + d³)
Logistic Regression	XᵀWX (W = diag weights)	λI (if L2 prior)	O(nd² + d³)
Neural Networks	Fisher Information Matrix	Prior-dependent	O(nd²) per sample
Gaussian Processes	K⁻¹ (kernel inverse)	0 (integrated out)	O(n³)

The Fisher Information Matrix:

For many models, the Hessian of the negative log-likelihood equals the Fisher Information Matrix:

$$\mathcal{I}(\theta) = -\mathbb{E}[\nabla^2\log p(D|\theta)]$$

Observed vs. Expected Fisher Information:

Observed Fisher: Use actual Hessian at θ̂ (what we compute)
Expected Fisher: Use expectation over data (sometimes analytically simpler)

For large datasets, these converge. The observed Fisher is preferred when available because it uses actual data rather than expectations.

laplace_approximation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import numpy as np
from scipy.optimize import minimize
from scipy.linalg import inv, cholesky
 
def laplace_approximation(neg_log_posterior, theta_init, 
                          compute_hessian=None):
    """
    Compute Laplace approximation to a posterior distribution.
    
    Parameters:
    -----------
    neg_log_posterior : callable
        Function Ψ(θ) = -log p(θ|D) (unnormalized is fine)
    theta_init : ndarray
        Initial guess for optimization
    compute_hessian : callable, optional
        Function to compute Hessian. If None, uses numerical approx.
    
    Returns:
    --------
    mean : ndarray
        MAP estimate (mode of posterior)
    covariance : ndarray
        Inverse Hessian (posterior covariance approximation)
    """
    # Step 1: Find the MAP estimate (mode)
    result = minimize(neg_log_posterior, theta_init, method='L-BFGS-B')
    theta_map = result.x
    
    if not result.success:
        raise ValueError(f"Optimization failed: {result.message}")
    
    # Step 2: Compute Hessian at the mode
    if compute_hessian is not None:
        H = compute_hessian(theta_map)
    else:
        H = numerical_hessian(neg_log_posterior, theta_map)
    
    # Step 3: Invert Hessian to get covariance
    # Use Cholesky decomposition for numerical stability
    try:
        L = cholesky(H, lower=True)
        covariance = inv(H)
    except np.linalg.LinAlgError:
        raise ValueError("Hessian not positive definite at mode")
    
    return theta_map, covariance
 
def numerical_hessian(f, x, eps=1e-5):
    """Compute Hessian via central differences."""
    n = len(x)
    H = np.zeros((n, n))
    
    for i in range(n):
        for j in range(i, n):
            # Central difference for mixed partials
            x_pp = x.copy(); x_pp[i] += eps; x_pp[j] += eps
            x_pm = x.copy(); x_pm[i] += eps; x_pm[j] -= eps
            x_mp = x.copy(); x_mp[i] -= eps; x_mp[j] += eps
            x_mm = x.copy(); x_mm[i] -= eps; x_mm[j] -= eps
            
            H[i,j] = (f(x_pp) - f(x_pm) - f(x_mp) + f(x_mm)) / (4*eps*eps)
            H[j,i] = H[i,j]  # Symmetry
    
    return H
 
# Example: Bayesian Logistic Regression
def logistic_neg_log_posterior(w, X, y, prior_variance=1.0):
    """Negative log posterior for logistic regression."""
    logits = X @ w
    # Log-likelihood (numerically stable)
    log_likelihood = np.sum(y * logits - np.logaddexp(0, logits))
    # Gaussian prior
    log_prior = -0.5 * np.sum(w**2) / prior_variance
    return -(log_likelihood + log_prior)
 
def logistic_hessian(w, X, y, prior_variance=1.0):
    """Exact Hessian for logistic regression."""
    logits = X @ w
    probs = 1 / (1 + np.exp(-logits))
    # Diagonal weight matrix W = p(1-p)
    W = np.diag(probs * (1 - probs))
    # Hessian = X^T W X + prior precision
    H = X.T @ W @ X + np.eye(len(w)) / prior_variance
    return H

Geometric Interpretation

The Laplace approximation has an elegant geometric interpretation. Imagine the negative log posterior Ψ(θ) as a "energy landscape" where valleys correspond to high-probability regions.

The approximation process:

Find the lowest point (minimum of Ψ, maximum of posterior)
Fit a paraboloid tangent to the surface at this point
This paraboloid defines a Gaussian when exponentiated

The Hessian captures the curvature of this landscape:

Large eigenvalues → steep curvature → tightly constrained parameters
Small eigenvalues → gentle curvature → loosely constrained parameters
Eigenvalue ratios → shape of the Gaussian (elongated vs. spherical)

When Approximation is Accurate

•Unimodal posterior — Single, well-defined peak
•Symmetric curvature — Similar behavior above and below mode
•Large data regime — Posterior concentrates toward Gaussian (Bernstein-von Mises)
•Strong prior — Regularization smooths the landscape
•Well-identified parameters — Clear separation from boundaries

When Approximation Fails

•Multimodal posterior — Multiple peaks require mixture of Gaussians
•Heavy-tailed posterior — Outliers create non-Gaussian behavior
•Skewed posterior — Asymmetric distributions near boundaries
•Flat regions — Near-zero curvature causes numerical instability
•Highly correlated parameters — Ill-conditioned Hessian

The eigenvalue interpretation:

The covariance matrix Σ = H⁻¹ has spectral decomposition:

$$\Sigma = V \Lambda V^T$$

where V contains eigenvectors (principal directions) and Λ contains eigenvalues (variances along each direction).

Eigenvectors point along the principal axes of uncertainty
Eigenvalues measure uncertainty magnitude along each axis
Condition number κ = λ_max/λ_min indicates how "elongated" the Gaussian is

A large condition number (κ >> 1) signals potential numerical issues and suggests parameters are on very different scales.

Marginal Likelihood Approximation

Beyond approximating the posterior, Laplace provides an approximation to the marginal likelihood (model evidence):

$$p(D) = \int p(D|\theta)p(\theta)d\theta$$

This integral is typically intractable, but Laplace approximation yields:

$$\log p(D) \approx \log p(D|\hat{\theta}) + \log p(\hat{\theta}) + \frac{d}{2}\log(2\pi) - \frac{1}{2}\log|H|$$

where d is the number of parameters.

The BIC Connection

Decomposition of the marginal likelihood:

The Occam factor automatically penalizes model complexity:

More parameters → larger d → smaller Occam factor
Sharper posterior → larger |H| → smaller Occam factor

This provides a principled trade-off between model fit and complexity, implementing Bayesian Occam's razor.

Computing log|H|:

The log-determinant can be computed efficiently via Cholesky decomposition:

$$\log|H| = 2\sum_{i=1}^d \log L_{ii}$$

where H = LLᵀ is the Cholesky factorization. This is O(d³) but numerically stable.

marginal_likelihood.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
import numpy as np
from scipy.linalg import cholesky
 
def laplace_log_marginal_likelihood(theta_map, X, y, 
                                     neg_log_likelihood,
                                     log_prior,
                                     hessian):
    """
    Compute Laplace approximation to log marginal likelihood.
    
    Parameters:
    -----------
    theta_map : ndarray
        MAP estimate
    neg_log_likelihood : callable
        Returns -log p(D|θ)
    log_prior : callable
        Returns log p(θ)
    hessian : ndarray
        Hessian of negative log posterior at MAP
    
    Returns:
    --------
    log_ml : float
        Log marginal likelihood approximation
    """
    d = len(theta_map)
    
    # Log likelihood at MAP
    log_lik = -neg_log_likelihood(theta_map, X, y)
    
    # Log prior at MAP
    log_p = log_prior(theta_map)
    
    # Occam factor via Cholesky
    try:
        L = cholesky(hessian, lower=True)
        log_det_H = 2 * np.sum(np.log(np.diag(L)))
    except np.linalg.LinAlgError:
        raise ValueError("Hessian not positive definite")
    
    # Laplace approximation
    log_ml = (log_lik + log_p + 
              (d/2) * np.log(2 * np.pi) - 
              0.5 * log_det_H)
    
    return log_ml
 
def compare_models_laplace(models, X, y):
    """
    Compare models using Laplace-approximated marginal likelihood.
    
    Parameters:
    -----------
    models : list of dict
        Each dict has 'name', 'neg_log_post', 'hessian_fn', 'init'
    
    Returns:
    --------
    results : dict
        Model names with log marginal likelihoods and posterior probs
    """
    log_mls = []
    
    for model in models:
        # Fit model (find MAP)
        theta_map, cov = laplace_approximation(
            model['neg_log_post'], model['init'],
            compute_hessian=model['hessian_fn']
        )
        
        # Compute marginal likelihood
        hessian = model['hessian_fn'](theta_map)
        log_ml = laplace_log_marginal_likelihood(
            theta_map, X, y,
            model['neg_log_lik'], model['log_prior'], hessian
        )
        log_mls.append((model['name'], log_ml))
    
    # Convert to posterior model probabilities
    log_mls_array = np.array([lm for _, lm in log_mls])
    log_mls_array -= np.max(log_mls_array)  # Numerical stability
    probs = np.exp(log_mls_array)
    probs /= np.sum(probs)
    
    return {name: {'log_ml': lm, 'prob': p} 
            for (name, lm), p in zip(log_mls, probs)}

Practical Considerations

Implementing Laplace approximation in practice requires attention to several numerical and statistical details.

Implementation Best Practices

•Parameter scaling: Ensure all parameters are on similar scales before optimization. Standardize features and consider log-transforming positive parameters.
•Regularization: Add small diagonal perturbation to Hessian (H + εI) to ensure positive definiteness. This is equivalent to adding a weak Gaussian prior.
•Cholesky over eigendecomposition: Use Cholesky factorization for inverting Hessian—it's faster (O(d³/3) vs O(d³)) and provides numerical stability check.
•Multiple restarts: Run optimization from multiple starting points to verify you found the global mode, not a local maximum.
•Posterior predictive sampling: Draw from q(θ) = N(θ̂, H⁻¹) to propagate uncertainty to predictions.

Scaling to high dimensions:

For models with many parameters (d >> 1000), storing and inverting the full Hessian becomes prohibitive. Several approximations help:

Diagonal approximation: Use only diagonal of Hessian, assuming parameter independence $$\Sigma \approx \text{diag}(1/H_{11}, ..., 1/H_{dd})$$
Block-diagonal approximation: Capture correlations within groups (e.g., per-layer in neural networks) but not across groups
Low-rank approximation: Represent covariance as Σ ≈ D + VVᵀ where D is diagonal and V is low-rank
Kronecker-factored approximation (K-FAC): For neural networks, approximate Fisher as Kronecker product of smaller matrices

The Curse of Dimensionality

Applications in Machine Learning

Laplace approximation is widely used across machine learning, particularly in settings where speed or closed-form solutions are prioritized.

Laplace Approximation in ML Applications
Application	Why Laplace Works	Practical Notes
Bayesian Logistic Regression	Posterior is log-concave, unimodal	Exact Hessian available; very reliable
Gaussian Process Classification	Approximates non-Gaussian likelihood	Combined with EP or variational methods
Bayesian Neural Networks	Fast uncertainty after training	Works well with pre-trained MAP networks
Hyperparameter Optimization	Marginal likelihood for model selection	Enables gradient-based hyperparameter tuning
Online Learning	Recursive updating of Gaussian posterior	Posterior becomes prior for next batch

Laplace for Neural Networks (LLA):

Recent work has revived Laplace approximation for deep learning through Last-Layer Laplace Approximation:

Train a neural network to MAP estimate using standard methods
Apply Laplace only to the final layer weights
Propagate uncertainty through the fixed feature extractor

This provides calibrated uncertainty estimates with minimal computational overhead—the Hessian is only computed for the last layer, making it tractable even for large networks.

Laplace for continual learning:

Sequential Laplace approximation naturally supports continual learning:

After seeing data D₁, posterior is N(θ̂₁, Σ₁)
This becomes the prior for D₂
The prior's precision (Σ₁⁻¹) acts as regularization, preventing catastrophic forgetting

Limitations and Extensions

Understanding when Laplace approximation fails is as important as knowing how to apply it. Several fundamental limitations restrict its applicability.

Fundamental Limitations

•Unimodal assumption: Cannot capture multimodal posteriors. Mixture models, latent variable models, and non-identifiable models often have multiple modes.
•Symmetric approximation: Gaussians are symmetric; many posteriors are skewed, especially near parameter boundaries.
•Local approximation: Only captures behavior near the mode; fails when tails are important (e.g., for rare event probabilities).
•Point estimate dependency: Quality entirely depends on finding the correct mode. Local optima lead to poor approximations.
•Curvature assumption: Requires well-defined curvature (positive definite Hessian). Flat regions or saddle points break the approximation.

Extensions to address limitations:

Mixture of Laplace approximations: Fit Gaussian at each mode, combine as mixture $$q(\theta) = \sum_k \pi_k \mathcal{N}(\theta | \hat{\theta}_k, H_k^{-1})$$
Reparameterization: Transform parameters so posterior is more Gaussian (e.g., log-transform positive parameters)
Higher-order expansions: Include third/fourth-order terms for skewness and kurtosis correction (Edgeworth expansions)
Leave-one-out corrections: WAIC and LOO-CV can detect when Laplace marginal likelihood is unreliable
Variational refinement: Use Laplace as initialization for more flexible variational approximations

When to Use Laplace

Summary

Key Takeaways

•Laplace approximation fits a Gaussian at the posterior mode, with covariance given by the inverse Hessian of the negative log posterior.
•The derivation follows from second-order Taylor expansion of the log posterior—the linear term vanishes at the mode, leaving a quadratic (Gaussian) approximation.
•Marginal likelihood approximation provides principled model comparison with automatic Occam factor penalizing complexity.
•Accuracy improves with more data as posteriors concentrate and become more Gaussian (Bernstein-von Mises theorem).
•Failures occur for multimodal, skewed, or heavy-tailed posteriors—cases where Gaussian assumption fundamentally misrepresents the distribution.
•Practical implementation requires careful attention to optimization, Hessian computation, and numerical conditioning.

What's next:

Page Complete

1 / 5