Loading content...
We've now encountered two fundamentally different approaches to the intractable inference problem in Bayesian machine learning:
Deterministic methods (Laplace, VI, EP) produce a closed-form approximating distribution q(θ) by optimizing some objective or matching moments. Given the same inputs, they produce the same output.
Stochastic methods (MCMC, importance sampling, SMC) generate samples θ⁽¹⁾, θ⁽²⁾, ..., θ⁽ᴺ⁾ from the posterior. Each run produces different samples, and expectations are estimated by Monte Carlo averaging.
Neither approach is universally superior. Understanding when each excels—and when each fails—is essential for practitioners. This page provides a systematic comparison to guide your choice of inference method.
By the end of this page, you will understand the fundamental trade-off between approximation bias (deterministic) and Monte Carlo variance (stochastic), recognize computational and statistical properties of each paradigm, develop intuition for when each approach is appropriate, and learn hybrid strategies that combine the best of both worlds.
The core difference lies in how each approach represents the approximate posterior:
Deterministic methods:
Stochastic methods:
| Error Type | Deterministic Methods | Stochastic Methods |
|---|---|---|
| Primary error source | Approximation bias (q ≠ p) | Monte Carlo variance (1/√N) |
| Effect of more computation | No improvement (fixed bias) | Reduced variance |
| Worst case | Systematically wrong | High variance estimates |
| Convergence | To a biased fixed point | To true posterior (asymptotic) |
| Diagnosing problems | Compare to ground truth | Check mixing, ESS, R̂ |
This mirrors the classic bias-variance trade-off in estimation. Deterministic methods have high bias but zero variance (for fixed inputs). Stochastic methods have zero asymptotic bias but non-zero variance. The choice depends on whether you can tolerate systematic errors or prefer unbiased but noisy estimates.
The computational profiles of deterministic and stochastic methods differ significantly, affecting their suitability for different problem scales.
| Property | Laplace | VI | EP | MCMC | Importance Sampling |
|---|---|---|---|---|---|
| Storage | O(d²) Hessian | O(d) - O(d²) | O(nd) | O(NS·d) | O(NS·d) |
| Per-iteration cost | O(nd² + d³) | O(d) per sample | O(n) per site | O(d) per sample | O(d) per sample |
| Parallelizable | Partially | Yes (SGD) | Limited | Multiple chains | Fully |
| Mini-batch friendly | Limited | Yes (SVI) | No | Limited | No |
| Convergence criterion | Gradient norm | ELBO | Site changes | R̂, ESS | ESS |
| Memory of past iterations | No | Optional | Yes (sites) | Chain history | All samples |
Scaling with data size (n):
Laplace: O(n) for likelihood computation, O(nd²) for Hessian. Scales well for moderate d.
VI (SVI): O(mini-batch) per iteration. Scales to millions of data points via stochastic gradients.
EP: O(n) per iteration, O(nT) total for T iterations. Sequential nature limits parallelization.
MCMC: Each likelihood evaluation is O(n). Becoming slow for large n without subsampling.
Importance Sampling: O(n) per sample. Requires all data for each weight computation.
Scaling with parameter dimension (d):
Laplace: O(d³) for Hessian inversion. Prohibitive beyond thousands of parameters.
VI: Mean-field is O(d), full covariance is O(d³). Diagonal or low-rank approximations for large d.
MCMC: Random walk scales poorly with d. HMC is O(d·L) where L is leapfrog steps, still struggles beyond thousands.
Importance Sampling: Weight degeneracy exponential in d. Fails catastrophically in high dimensions.
Importance sampling effective sample size (ESS) degrades exponentially with dimension. For d = 100, you may need 10^50 samples for ESS = 1. MCMC with random walk proposals also scales poorly—HMC is the primary tool that maintains reasonable scaling in moderate dimensions.
Beyond computation, the methods differ in their statistical guarantees and behavior.
Uncertainty quantification quality:
A critical practical question: how well do the methods capture posterior uncertainty?
| Scenario | Laplace | VI (Mean Field) | EP | MCMC |
|---|---|---|---|---|
| Posterior is Gaussian | ✓ Exact | ≈ If variance matches | ✓ Exact | ✓ Exact |
| Skewed posterior | ✗ Misses asymmetry | ✗ Underestimates tail | ≈ Better moments | ✓ Correct |
| Heavy tails | ✗ Gaussian tails | ✗ Gaussian tails | ≈ Depends on family | ✓ Correct |
| Multimodal | ✗ Single mode | ✗ Single mode | ✗ May oscillate | ✓ If mixing works |
| Posterior near boundary | ✗ Ignores boundary | ✗ May violate constraint | ≈ Can handle | ✓ Correct |
The key insight: Deterministic methods are guaranteed to be wrong for sufficiently complex posteriors. The question is whether they're wrong in ways that matter for your application.
Knowing when your inference has "worked" is equally important as running the algorithm. The diagnostic tools differ fundamentally between paradigms.
Deterministic method diagnostics:
Stochastic method diagnostics:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105
import numpy as npfrom scipy import stats def vi_diagnostics(elbo_history, params_history, n_restarts=3): """ Diagnostics for variational inference. Parameters: ----------- elbo_history : list ELBO values over iterations params_history : list of arrays Variational parameters at each iteration n_restarts : int Number of random restarts to run """ diagnostics = {} # 1. Check ELBO convergence final_elbos = elbo_history[-100:] elbo_std = np.std(final_elbos) diagnostics['elbo_converged'] = elbo_std < 0.01 * abs(np.mean(final_elbos)) # 2. Check parameter convergence final_params = np.array(params_history[-100:]) param_std = np.std(final_params, axis=0) diagnostics['params_stable'] = np.all(param_std < 0.01) # 3. Check gradient (would need access to gradients) # diagnostics['grad_norm_small'] = grad_norm < 1e-5 return diagnostics def mcmc_diagnostics(chains, param_names=None): """ Diagnostics for MCMC. Parameters: ----------- chains : ndarray Shape (n_chains, n_samples, n_params) param_names : list, optional Names of parameters """ n_chains, n_samples, n_params = chains.shape diagnostics = {} # 1. Gelman-Rubin R-hat r_hats = [] for p in range(n_params): between = np.var(np.mean(chains[:, :, p], axis=1), ddof=1) * n_samples within = np.mean(np.var(chains[:, :, p], axis=1, ddof=1)) var_est = (n_samples - 1) / n_samples * within + between / n_samples r_hat = np.sqrt(var_est / within) r_hats.append(r_hat) diagnostics['r_hat'] = np.array(r_hats) diagnostics['r_hat_ok'] = np.all(np.array(r_hats) < 1.01) # 2. Effective Sample Size (simple autocorrelation method) def ess(chain): n = len(chain) mean = np.mean(chain) var = np.var(chain) if var == 0: return n acf = [] for lag in range(1, n // 2): c = np.mean((chain[:-lag] - mean) * (chain[lag:] - mean)) / var if c < 0.05: break acf.append(c) tau = 1 + 2 * sum(acf) return n / tau ess_values = [] for p in range(n_params): chain_ess = [ess(chains[c, :, p]) for c in range(n_chains)] ess_values.append(sum(chain_ess)) diagnostics['ess'] = np.array(ess_values) diagnostics['ess_ok'] = np.all(np.array(ess_values) > 400) return diagnostics def compare_methods(vi_samples, mcmc_samples, true_mean=None, true_cov=None): """ Compare VI and MCMC approximations. """ vi_mean = np.mean(vi_samples, axis=0) vi_cov = np.cov(vi_samples.T) mcmc_mean = np.mean(mcmc_samples, axis=0) mcmc_cov = np.cov(mcmc_samples.T) comparison = { 'mean_diff': np.linalg.norm(vi_mean - mcmc_mean), 'cov_trace_ratio': np.trace(vi_cov) / np.trace(mcmc_cov), # <1 means VI underestimates } if true_mean is not None: comparison['vi_mean_error'] = np.linalg.norm(vi_mean - true_mean) comparison['mcmc_mean_error'] = np.linalg.norm(mcmc_mean - true_mean) return comparisonIn practice, the choice often comes down to the speed-accuracy trade-off. Let's make this concrete with typical scenarios.
| Scenario | Recommended | Reasoning |
|---|---|---|
| 10K params, 1M data | SVI | Only SVI scales; mini-batches essential |
| 100 params, 1K data, high accuracy needed | MCMC (HMC/NUTS) | Sufficient scale for MCMC; need exact inference |
| 1K params, 10K data, real-time predictions | Laplace or VI | Need fast inference; can accept approximation |
| GP classification, 5K data | EP | Best calibration for non-Gaussian likelihood |
| Model comparison needed | Laplace or VI | Provide marginal likelihood approximation |
| Neural network uncertainty | VI (last layer) or MC Dropout | Full Bayesian NN too expensive |
| Online/streaming data | ADF or online VI | Can't wait for MCMC to converge |
Runtime Comparison (Empirical Guidelines):
For a typical model with d = 100 parameters and n = 10,000 observations:
| Method | Typical Runtime | Samples/Iterations |
|---|---|---|
| Laplace | ~1 second | 1 optimization |
| Mean-field VI | ~10 seconds | 1000 ELBO iterations |
| Full-covariance VI | ~100 seconds | 1000 ELBO iterations |
| EP | ~30 seconds | 10 EP sweeps |
| MCMC (RW-MH) | ~10 minutes | 10,000 samples |
| MCMC (HMC) | ~5 minutes | 2,000 samples |
| MCMC (NUTS) | ~3 minutes | 1,000 samples |
These are rough approximations; actual times depend heavily on model structure and implementation.
MCMC typically takes 10-100× longer than VI for comparable problems. If you need inference in under 1 second, MCMC is usually not an option. But if you have minutes to hours and need accurate uncertainty, MCMC is often worth the wait.
Modern Bayesian inference often combines deterministic and stochastic methods to leverage the strengths of both. These hybrid approaches are becoming increasingly common in practice.
Strategy 1: VI Warmstart for MCMC
# Step 1: Run VI to get approximate posterior
q_mean, q_cov = run_variational_inference(model, data)
# Step 2: Initialize MCMC at VI mean
initial_point = q_mean
# Step 3: Use VI covariance to set step sizes
step_sizes = np.sqrt(np.diag(q_cov)) # HMC step sizes
# Step 4: Run MCMC from warm start
samples = run_hmc(model, data, init=initial_point, step_size=step_sizes)
This approach often reduces burn-in from thousands to hundreds of samples.
Strategy 2: Importance Sampling Correction
# Draw samples from VI approximation
theta_samples = q.sample(n_samples)
# Compute importance weights
log_weights = log_posterior(theta_samples) - q.log_prob(theta_samples)
weights = softmax(log_weights) # Normalized
# Weighted estimates correct for VI bias
weighted_mean = sum(w * theta for w, theta in zip(weights, theta_samples))
If VI is close to the true posterior, ESS will be high; if not, weights degenerate but we detect the problem.
Different applications have different requirements that suggest specific inference methods. Here's guidance organized by application domain:
Recommended: MCMC (NUTS/HMC)
Reasoning:
When to deviate:
We can synthesize the preceding analysis into a practical decision framework. Answer these questions to guide your method selection:
Step 1: Assess computational constraints
Step 2: Assess accuracy requirements
Step 3: Assess model characteristics
When uncertain which method to use: (1) Start with VI—it's fastest and provides a baseline. (2) If VI approximation looks poor, try EP for better calibration. (3) If both fail, invest in MCMC with careful diagnostics. (4) If MCMC fails to mix, revisit model parameterization.
The Practitioner's Checklist:
✓ Run multiple initializations for deterministic methods ✓ Check convergence diagnostics (ELBO plateau, R̂ < 1.01) ✓ Validate uncertainty calibration on held-out data ✓ Compare VI and MCMC when computationally feasible ✓ Document inference choices and their justifications ✓ Be skeptical of overly confident predictions ✓ When in doubt, prefer conservative (wider) uncertainty estimates
The choice between deterministic and stochastic inference methods involves fundamental trade-offs between computational cost, approximation accuracy, and statistical guarantees. Neither approach dominates; the right choice depends on your specific problem and constraints.
What's next:
Having developed a thorough understanding of individual methods and their trade-offs, the final page of this module provides comprehensive guidance on choosing the right approximation method—synthesizing everything into actionable decision procedures for practitioners.
You now understand the fundamental distinction between deterministic and stochastic inference, their computational and statistical properties, and when to use each. This comparative framework will guide your inference method selection across diverse Bayesian modeling problems.