Approximate Inference Deterministic - Learning Module

Loading content...

0/278

Deterministic vs Stochastic Methods

Two Philosophies of Approximation

We've now encountered two fundamentally different approaches to the intractable inference problem in Bayesian machine learning:

Deterministic methods (Laplace, VI, EP) produce a closed-form approximating distribution q(θ) by optimizing some objective or matching moments. Given the same inputs, they produce the same output.

Stochastic methods (MCMC, importance sampling, SMC) generate samples θ⁽¹⁾, θ⁽²⁾, ..., θ⁽ᴺ⁾ from the posterior. Each run produces different samples, and expectations are estimated by Monte Carlo averaging.

Neither approach is universally superior. Understanding when each excels—and when each fails—is essential for practitioners. This page provides a systematic comparison to guide your choice of inference method.

What You Will Learn

By the end of this page, you will understand the fundamental trade-off between approximation bias (deterministic) and Monte Carlo variance (stochastic), recognize computational and statistical properties of each paradigm, develop intuition for when each approach is appropriate, and learn hybrid strategies that combine the best of both worlds.

The Fundamental Distinction

The core difference lies in how each approach represents the approximate posterior:

Deterministic methods:

Produce a parametric distribution q(θ; φ*) where φ* are optimized parameters
Error is approximation bias: q(θ) ≠ p(θ|D) due to restrictions of approximating family
Bias is systematic and doesn't decrease with more computation
Output is reproducible given the same inputs

Stochastic methods:

Produce samples {θ⁽ˢ⁾} such that expectations can be estimated as sample averages
Error is Monte Carlo variance: finite samples give noisy estimates
Variance decreases as O(1/√N) with number of samples N
Asymptotically exact given sufficient samples and proper convergence

Error Decomposition: Deterministic vs Stochastic
Error Type	Deterministic Methods	Stochastic Methods
Primary error source	Approximation bias (q ≠ p)	Monte Carlo variance (1/√N)
Effect of more computation	No improvement (fixed bias)	Reduced variance
Worst case	Systematically wrong	High variance estimates
Convergence	To a biased fixed point	To true posterior (asymptotic)
Diagnosing problems	Compare to ground truth	Check mixing, ESS, R̂

The Bias-Variance Trade-off Analogy

This mirrors the classic bias-variance trade-off in estimation. Deterministic methods have high bias but zero variance (for fixed inputs). Stochastic methods have zero asymptotic bias but non-zero variance. The choice depends on whether you can tolerate systematic errors or prefer unbiased but noisy estimates.

Computational Properties

The computational profiles of deterministic and stochastic methods differ significantly, affecting their suitability for different problem scales.

Computational Comparison
Property	Laplace	VI	EP	MCMC	Importance Sampling
Storage	O(d²) Hessian	O(d) - O(d²)	O(nd)	O(NS·d)	O(NS·d)
Per-iteration cost	O(nd² + d³)	O(d) per sample	O(n) per site	O(d) per sample	O(d) per sample
Parallelizable	Partially	Yes (SGD)	Limited	Multiple chains	Fully
Mini-batch friendly	Limited	Yes (SVI)	No	Limited	No
Convergence criterion	Gradient norm	ELBO	Site changes	R̂, ESS	ESS
Memory of past iterations	No	Optional	Yes (sites)	Chain history	All samples

Scaling with data size (n):

Laplace: O(n) for likelihood computation, O(nd²) for Hessian. Scales well for moderate d.
VI (SVI): O(mini-batch) per iteration. Scales to millions of data points via stochastic gradients.
EP: O(n) per iteration, O(nT) total for T iterations. Sequential nature limits parallelization.
MCMC: Each likelihood evaluation is O(n). Becoming slow for large n without subsampling.
Importance Sampling: O(n) per sample. Requires all data for each weight computation.

Scaling with parameter dimension (d):

Laplace: O(d³) for Hessian inversion. Prohibitive beyond thousands of parameters.
VI: Mean-field is O(d), full covariance is O(d³). Diagonal or low-rank approximations for large d.
MCMC: Random walk scales poorly with d. HMC is O(d·L) where L is leapfrog steps, still struggles beyond thousands.
Importance Sampling: Weight degeneracy exponential in d. Fails catastrophically in high dimensions.

The Curse of Dimensionality in Sampling

Importance sampling effective sample size (ESS) degrades exponentially with dimension. For d = 100, you may need 10^50 samples for ESS = 1. MCMC with random walk proposals also scales poorly—HMC is the primary tool that maintains reasonable scaling in moderate dimensions.

Statistical Properties

Beyond computation, the methods differ in their statistical guarantees and behavior.

Deterministic Methods

•Convergence: To local optimum of objective (VI) or fixed point (EP/Laplace)
•Uncertainty calibration: Often underestimate (VI) or approximate (EP)
•Multimodality: Typically collapse to single mode
•Tails: May truncate or distort tail behavior
•Marginal likelihood: Provide approximation (ELBO is lower bound)
•Reproducibility: Deterministic given optimization converges

Stochastic Methods

•Convergence: Asymptotically exact under ergodicity conditions
•Uncertainty calibration: Exact given sufficient mixing
•Multimodality: Can explore all modes (if mixing)
•Tails: Correctly sample tail events
•Marginal likelihood: Estimate via bridge/harmonic mean (tricky)
•Reproducibility: Random; need seed or many samples

Uncertainty quantification quality:

A critical practical question: how well do the methods capture posterior uncertainty?

Scenario	Laplace	VI (Mean Field)	EP	MCMC
Posterior is Gaussian	✓ Exact	≈ If variance matches	✓ Exact	✓ Exact
Skewed posterior	✗ Misses asymmetry	✗ Underestimates tail	≈ Better moments	✓ Correct
Heavy tails	✗ Gaussian tails	✗ Gaussian tails	≈ Depends on family	✓ Correct
Multimodal	✗ Single mode	✗ Single mode	✗ May oscillate	✓ If mixing works
Posterior near boundary	✗ Ignores boundary	✗ May violate constraint	≈ Can handle	✓ Correct

The key insight: Deterministic methods are guaranteed to be wrong for sufficiently complex posteriors. The question is whether they're wrong in ways that matter for your application.

Convergence Diagnostics

Knowing when your inference has "worked" is equally important as running the algorithm. The diagnostic tools differ fundamentally between paradigms.

Deterministic method diagnostics:

ELBO monitoring (VI): Plot ELBO vs. iteration; look for convergence to plateau
Gradient norm (all): Should approach zero at convergence
Hessian conditioning (Laplace): Check condition number; very large indicates near-singular approximation
Parameter stability: Monitor variational parameters across runs from different initializations
Posterior predictive checks: Compare model predictions to held-out data

Stochastic method diagnostics:

Trace plots: Visual inspection of chains; should look like "hairy caterpillars" not trending
R̂ (Gelman-Rubin): Ratio of between-chain to within-chain variance; should be < 1.01
Effective Sample Size (ESS): Accounts for autocorrelation; should be >400 for tail quantiles
Bulk and tail ESS separately: Tails converge slower than center of distribution
Divergent transitions (HMC): Indicate numerical issues or poor geometry

diagnostics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
import numpy as np
from scipy import stats
 
def vi_diagnostics(elbo_history, params_history, n_restarts=3):
    """
    Diagnostics for variational inference.
    
    Parameters:
    -----------
    elbo_history : list
        ELBO values over iterations
    params_history : list of arrays
        Variational parameters at each iteration
    n_restarts : int
        Number of random restarts to run
    """
    diagnostics = {}
    
    # 1. Check ELBO convergence
    final_elbos = elbo_history[-100:]
    elbo_std = np.std(final_elbos)
    diagnostics['elbo_converged'] = elbo_std < 0.01 * abs(np.mean(final_elbos))
    
    # 2. Check parameter convergence
    final_params = np.array(params_history[-100:])
    param_std = np.std(final_params, axis=0)
    diagnostics['params_stable'] = np.all(param_std < 0.01)
    
    # 3. Check gradient (would need access to gradients)
    # diagnostics['grad_norm_small'] = grad_norm < 1e-5
    
    return diagnostics
 
def mcmc_diagnostics(chains, param_names=None):
    """
    Diagnostics for MCMC.
    
    Parameters:
    -----------
    chains : ndarray
        Shape (n_chains, n_samples, n_params)
    param_names : list, optional
        Names of parameters
    """
    n_chains, n_samples, n_params = chains.shape
    diagnostics = {}
    
    # 1. Gelman-Rubin R-hat
    r_hats = []
    for p in range(n_params):
        between = np.var(np.mean(chains[:, :, p], axis=1), ddof=1) * n_samples
        within = np.mean(np.var(chains[:, :, p], axis=1, ddof=1))
        var_est = (n_samples - 1) / n_samples * within + between / n_samples
        r_hat = np.sqrt(var_est / within)
        r_hats.append(r_hat)
    diagnostics['r_hat'] = np.array(r_hats)
    diagnostics['r_hat_ok'] = np.all(np.array(r_hats) < 1.01)
    
    # 2. Effective Sample Size (simple autocorrelation method)
    def ess(chain):
        n = len(chain)
        mean = np.mean(chain)
        var = np.var(chain)
        if var == 0:
            return n
        
        acf = []
        for lag in range(1, n // 2):
            c = np.mean((chain[:-lag] - mean) * (chain[lag:] - mean)) / var
            if c < 0.05:
                break
            acf.append(c)
        
        tau = 1 + 2 * sum(acf)
        return n / tau
    
    ess_values = []
    for p in range(n_params):
        chain_ess = [ess(chains[c, :, p]) for c in range(n_chains)]
        ess_values.append(sum(chain_ess))
    diagnostics['ess'] = np.array(ess_values)
    diagnostics['ess_ok'] = np.all(np.array(ess_values) > 400)
    
    return diagnostics
 
def compare_methods(vi_samples, mcmc_samples, true_mean=None, true_cov=None):
    """
    Compare VI and MCMC approximations.
    """
    vi_mean = np.mean(vi_samples, axis=0)
    vi_cov = np.cov(vi_samples.T)
    
    mcmc_mean = np.mean(mcmc_samples, axis=0)
    mcmc_cov = np.cov(mcmc_samples.T)
    
    comparison = {
        'mean_diff': np.linalg.norm(vi_mean - mcmc_mean),
        'cov_trace_ratio': np.trace(vi_cov) / np.trace(mcmc_cov),  # <1 means VI underestimates
    }
    
    if true_mean is not None:
        comparison['vi_mean_error'] = np.linalg.norm(vi_mean - true_mean)
        comparison['mcmc_mean_error'] = np.linalg.norm(mcmc_mean - true_mean)
    
    return comparison

Speed vs Accuracy Trade-offs

In practice, the choice often comes down to the speed-accuracy trade-off. Let's make this concrete with typical scenarios.

Method Performance by Scenario
Scenario	Recommended	Reasoning
10K params, 1M data	SVI	Only SVI scales; mini-batches essential
100 params, 1K data, high accuracy needed	MCMC (HMC/NUTS)	Sufficient scale for MCMC; need exact inference
1K params, 10K data, real-time predictions	Laplace or VI	Need fast inference; can accept approximation
GP classification, 5K data	EP	Best calibration for non-Gaussian likelihood
Model comparison needed	Laplace or VI	Provide marginal likelihood approximation
Neural network uncertainty	VI (last layer) or MC Dropout	Full Bayesian NN too expensive
Online/streaming data	ADF or online VI	Can't wait for MCMC to converge

Runtime Comparison (Empirical Guidelines):

For a typical model with d = 100 parameters and n = 10,000 observations:

Method	Typical Runtime	Samples/Iterations
Laplace	~1 second	1 optimization
Mean-field VI	~10 seconds	1000 ELBO iterations
Full-covariance VI	~100 seconds	1000 ELBO iterations
EP	~30 seconds	10 EP sweeps
MCMC (RW-MH)	~10 minutes	10,000 samples
MCMC (HMC)	~5 minutes	2,000 samples
MCMC (NUTS)	~3 minutes	1,000 samples

These are rough approximations; actual times depend heavily on model structure and implementation.

The 10× Rule of Thumb

MCMC typically takes 10-100× longer than VI for comparable problems. If you need inference in under 1 second, MCMC is usually not an option. But if you have minutes to hours and need accurate uncertainty, MCMC is often worth the wait.

Hybrid Strategies

Modern Bayesian inference often combines deterministic and stochastic methods to leverage the strengths of both. These hybrid approaches are becoming increasingly common in practice.

Hybrid Inference Strategies

•VI initialization for MCMC: Run VI first to find a good starting region, then refine with MCMC. Reduces burn-in and improves mixing by starting near the posterior.
•MCMC within VI: Use short MCMC chains to estimate intractable expectations in the ELBO. Combines VI's scalability with MCMC's flexibility.
•Importance sampling from VI: Use VI approximation q as proposal, importance sample to correct bias. Get MCMC-quality estimates faster.
•Laplace-based proposals: Use Laplace approximation covariance for MCMC proposals. Better-tuned proposals improve mixing.
•Sequential Monte Carlo with VI: Use VI for lookahead proposals in SMC. Combines particle methods' theoretical guarantees with VI's efficiency.

Strategy 1: VI Warmstart for MCMC

# Step 1: Run VI to get approximate posterior
q_mean, q_cov = run_variational_inference(model, data)

# Step 2: Initialize MCMC at VI mean
initial_point = q_mean

# Step 3: Use VI covariance to set step sizes
step_sizes = np.sqrt(np.diag(q_cov))  # HMC step sizes

# Step 4: Run MCMC from warm start
samples = run_hmc(model, data, init=initial_point, step_size=step_sizes)

This approach often reduces burn-in from thousands to hundreds of samples.

Strategy 2: Importance Sampling Correction

# Draw samples from VI approximation
theta_samples = q.sample(n_samples)

# Compute importance weights
log_weights = log_posterior(theta_samples) - q.log_prob(theta_samples)
weights = softmax(log_weights)  # Normalized

# Weighted estimates correct for VI bias
weighted_mean = sum(w * theta for w, theta in zip(weights, theta_samples))

If VI is close to the true posterior, ESS will be high; if not, weights degenerate but we detect the problem.

Application-Specific Guidance

Different applications have different requirements that suggest specific inference methods. Here's guidance organized by application domain:

Recommended: MCMC (NUTS/HMC)

Reasoning:

Accuracy is paramount; bias is unacceptable for scientific conclusions
Computational budget is usually available (hours to days)
Reproducibility requires precise uncertainty quantification
Publication standards increasingly expect proper Bayesian inference

When to deviate:

Very large datasets: Consider SVI with careful validation
Extremely complex models: Structured VI may be necessary
Exploratory phase: Use VI for rapid iteration, MCMC for final results

A Decision Framework

We can synthesize the preceding analysis into a practical decision framework. Answer these questions to guide your method selection:

Step 1: Assess computational constraints

Inference budget < 1 second → Laplace or pre-trained VI
Inference budget < 1 minute → VI (consider EP for classification)
Inference budget < 1 hour → MCMC (if d < 1000) or VI (larger d)
Inference budget unlimited → MCMC with extensive diagnostics

Step 2: Assess accuracy requirements

Need exact uncertainty → MCMC only (verify convergence!)
Need well-calibrated predictions → EP or MCMC
Approximate uncertainty acceptable → VI or Laplace
Just need point estimates → MAP (not Bayesian inference)

Step 3: Assess model characteristics

Convex log-posterior → Laplace works well
Non-Gaussian likelihoods with GPs → EP
Multimodal posterior → MCMC (with care) or mixture VI
Latent variable model → VI (often derived specifically for model)
Neural network with many parameters → VI (last-layer or structured)

When in Doubt

When uncertain which method to use: (1) Start with VI—it's fastest and provides a baseline. (2) If VI approximation looks poor, try EP for better calibration. (3) If both fail, invest in MCMC with careful diagnostics. (4) If MCMC fails to mix, revisit model parameterization.

The Practitioner's Checklist:

✓ Run multiple initializations for deterministic methods ✓ Check convergence diagnostics (ELBO plateau, R̂ < 1.01) ✓ Validate uncertainty calibration on held-out data ✓ Compare VI and MCMC when computationally feasible ✓ Document inference choices and their justifications ✓ Be skeptical of overly confident predictions ✓ When in doubt, prefer conservative (wider) uncertainty estimates

Summary

The choice between deterministic and stochastic inference methods involves fundamental trade-offs between computational cost, approximation accuracy, and statistical guarantees. Neither approach dominates; the right choice depends on your specific problem and constraints.

Key Takeaways

•Deterministic methods (Laplace, VI, EP) trade approximation bias for computational efficiency. They're fast but may systematically misrepresent the posterior.
•Stochastic methods (MCMC) trade Monte Carlo variance for asymptotic exactness. They're slower but converge to the true posterior with sufficient samples.
•Scaling behavior differs: VI handles large data (via SVI) and moderate parameters; MCMC scales to moderate problems but struggles with both.
•Convergence diagnostics are essential: ELBO and gradient norms for deterministic methods; R̂ and ESS for MCMC.
•Hybrid approaches combine the best of both: VI for fast initialization, MCMC for refinement; importance sampling to correct VI bias.
•Application context matters: Production ML favors VI; scientific inference favors MCMC; GP classification favors EP.

What's next:

Having developed a thorough understanding of individual methods and their trade-offs, the final page of this module provides comprehensive guidance on choosing the right approximation method—synthesizing everything into actionable decision procedures for practitioners.

Page Complete

You now understand the fundamental distinction between deterministic and stochastic inference, their computational and statistical properties, and when to use each. This comparative framework will guide your inference method selection across diverse Bayesian modeling problems.