Loading learning content...
Bayesian inference provides an elegant, principled framework for reasoning under uncertainty. Given observed data $\mathcal{D}$, we update our beliefs about unknown quantities $\theta$ through Bayes' theorem:
$$p(\theta | \mathcal{D}) = \frac{p(\mathcal{D} | \theta) p(\theta)}{p(\mathcal{D})}$$
The posterior distribution $p(\theta | \mathcal{D})$ encapsulates everything we can know about $\theta$ given our prior beliefs and observed evidence. It's mathematically beautiful, philosophically coherent, and—in many practical cases—utterly intractable to compute.
This page confronts this fundamental tension directly. We will understand precisely why exact Bayesian inference fails in practice, what makes computation intractable, and why this motivates the development of variational methods. By the end, you will see variational inference not as an arbitrary approximation, but as a carefully designed solution to a fundamental computational bottleneck.
By completing this page, you will: (1) Understand why exact posterior computation is intractable in most real-world scenarios, (2) Identify the specific computational barriers—especially the marginal likelihood integral, (3) Compare approximation strategies and understand why optimization-based approaches are appealing, and (4) Develop intuition for when approximation is necessary versus when exact methods suffice.
Let's begin by examining exactly what Bayesian inference asks us to compute. Consider a probabilistic model with:
The joint distribution factorizes according to our modeling assumptions:
$$p(\mathbf{x}, \mathbf{z}, \theta) = p(\mathbf{x} | \mathbf{z}, \theta) \cdot p(\mathbf{z} | \theta) \cdot p(\theta)$$
To perform inference, we need the posterior distribution:
$$p(\mathbf{z}, \theta | \mathbf{x}) = \frac{p(\mathbf{x}, \mathbf{z}, \theta)}{p(\mathbf{x})}$$
The entire difficulty of Bayesian inference concentrates in the denominator. The numerator—the joint distribution—is typically easy to evaluate: we simply multiply our likelihood and prior terms. But the denominator, the marginal likelihood or "evidence," requires integration over all possible values of the latent variables and parameters. This is where computation becomes intractable.
The evidence (marginal likelihood) is defined as:
$$p(\mathbf{x}) = \int \int p(\mathbf{x}, \mathbf{z}, \theta) , d\mathbf{z} , d\theta = \int \int p(\mathbf{x} | \mathbf{z}, \theta) \cdot p(\mathbf{z} | \theta) \cdot p(\theta) , d\mathbf{z} , d\theta$$
This integral marginalizes out all latent variables and parameters by summing (or integrating) over every possible configuration they could take. In principle, this is straightforward—we're just computing a normalization constant. In practice, this integral is the source of nearly all computational difficulty in Bayesian inference.
Why is this integral hard?
High dimensionality: If $\mathbf{z}$ has $K$ dimensions and $\theta$ has $D$ dimensions, we integrate over a $(K + D)$-dimensional space. Modern models easily have millions of dimensions.
No closed-form solution: For most interesting models, the integral has no analytical solution. The integrand may be a complex, nonlinear function with no known antiderivative.
Exponential cost of numerical integration: Grid-based numerical integration requires $O(M^{K+D})$ evaluations for $M$ grid points per dimension. With 100 dimensions and 10 grid points, that's $10^{100}$ evaluations—more than atoms in the observable universe.
| Dimensionality | Grid Points | Evaluations Required | At 1μs/eval |
|---|---|---|---|
| 2D | 100 | 10⁴ | 0.01 seconds |
| 10D | 100 | 10²⁰ | 3 trillion years |
| 100D | 100 | 10²⁰⁰ | Incomputable |
| 1,000,000D (typical neural net) | Any | ∞ | Completely intractable |
The table above illustrates why direct computation is impossible for all but the simplest models. Even with just 10 latent dimensions, brute-force integration is beyond the computational capacity of any conceivable computer. Real machine learning models with millions of parameters are infinitely more challenging.
This is the fundamental computational barrier that motivates approximate inference. We cannot compute the exact posterior, so we must find principled ways to approximate it.
Before diving into approximations, let's understand when exact Bayesian inference is possible. Recognizing these special cases illuminates why most models fall outside them.
When the prior and likelihood belong to the same exponential family, the posterior has a closed-form expression in the same family. The marginal likelihood also has a closed form.
Example: Gaussian-Gaussian Conjugacy
For a Gaussian likelihood with known variance $\sigma^2$ and a Gaussian prior on the mean:
$$\begin{aligned} \text{Prior: } \mu &\sim \mathcal{N}(\mu_0, \tau_0^2) \ \text{Likelihood: } x_i | \mu &\sim \mathcal{N}(\mu, \sigma^2) \end{aligned}$$
The posterior is: $$\mu | \mathbf{x} \sim \mathcal{N}\left( \frac{\frac{\mu_0}{\tau_0^2} + \frac{\sum_i x_i}{\sigma^2}}{\frac{1}{\tau_0^2} + \frac{N}{\sigma^2}}, \left( \frac{1}{\tau_0^2} + \frac{N}{\sigma^2} \right)^{-1} \right)$$
This is exact, elegant, and requires no approximation. The integral "works out" because Gaussians multiplied by Gaussians yield Gaussians.
Common conjugate pairs include: Beta-Binomial, Dirichlet-Multinomial, Gamma-Poisson, Normal-Normal, and Normal-Inverse-Gamma. These are powerful when applicable, but they cover only a tiny fraction of useful models. The moment you need a neural network likelihood or a complex hierarchical structure, conjugacy breaks down.
When latent variables are discrete and take finitely many values, the integral becomes a sum:
$$p(\mathbf{x}) = \sum_{\mathbf{z} \in \mathcal{Z}} p(\mathbf{x}, \mathbf{z})$$
If $|\mathcal{Z}|$ is small enough, we can enumerate all configurations.
Example: Hidden Markov Model with 3 states, 10 time steps
Brute-force enumeration requires $3^{10} = 59,049$ terms—tractable with dynamic programming (the forward algorithm reduces this to $O(3^2 \times 10) = 90$ operations).
Multivariate Gaussian models remain tractable because:
Gaussian Process Regression exploits this: For $N$ data points with a Gaussian prior and Gaussian likelihood, exact inference costs $O(N^3)$ for the matrix inversion.
The message is clear: tractable exact inference is the exception, not the rule. The models that drive modern machine learning—deep generative models, Bayesian neural networks, topic models, probabilistic programs—all require approximate inference. This isn't a failure of Bayesian methods; it's simply the computational reality of working with expressive probabilistic models.
To design effective approximations, we must understand precisely where and why computation fails. Intractability arises from several distinct sources, often in combination.
The most direct source of intractability is the evidence $p(\mathbf{x})$ appearing in Bayes' rule. Without it, we cannot normalize the posterior.
However, some tasks don't require the normalization constant:
But many tasks do require proper normalization:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
import numpy as npfrom scipy.integrate import quad, dblquadimport time def demonstrate_intractability(): """ Demonstrate how integration cost explodes with dimensionality. Even for simple Gaussian integrands, high dimensions are impossible. """ # 1D Gaussian integral: trivial start = time.time() result_1d, _ = quad(lambda x: np.exp(-x**2), -10, 10) print(f"1D integral: {result_1d:.6f}, Time: {time.time() - start:.6f}s") # 2D Gaussian integral: still easy start = time.time() result_2d, _ = dblquad( lambda x, y: np.exp(-x**2 - y**2), -10, 10, -10, 10 ) print(f"2D integral: {result_2d:.6f}, Time: {time.time() - start:.6f}s") # 10D Gaussian: Monte Carlo estimate start = time.time() n_samples = 100000 samples = np.random.randn(n_samples, 10) # Importance sampling with standard normal proposal weights = np.exp(-np.sum(samples**2, axis=1)) * (2*10)**10 result_10d = np.mean(weights) / n_samples print(f"10D integral (MC estimate): ~{result_10d:.4f}, Time: {time.time() - start:.4f}s") print(f" (Note: MC variance is high, exact answer is π^5 ≈ {np.pi**5:.4f})") # 1000D: Even MC struggles print("\n1000D integral: Cannot compute reliably with any method") print("This is why we need variational inference!") # The posterior often looks like:# p(θ | x) ∝ likelihood(x | θ) × prior(θ)# # For a neural network:# - θ has millions of dimensions# - likelihood is a complex nonlinear function# - No closed-form integral exists# - MC methods have enormous variance in high dimensions demonstrate_intractability()Even when individual integrals might be tractable, dependencies between variables can make the joint integral intractable.
Consider a hierarchical model:
$$\begin{aligned} \theta &\sim p(\theta) \ z_n | \theta &\sim p(z_n | \theta) \quad \text{for } n = 1, \ldots, N \ x_n | z_n &\sim p(x_n | z_n) \end{aligned}$$
The posterior is: $$p(\theta, \mathbf{z} | \mathbf{x}) = \frac{p(\theta) \prod_{n=1}^N p(z_n | \theta) p(x_n | z_n)}{p(\mathbf{x})}$$
Even if $p(z_n | \theta, x_n)$ has a nice form for fixed $\theta$, the variables $(\theta, z_1, \ldots, z_N)$ are coupled through the prior $p(z_n | \theta)$. Marginalizing out $\theta$ requires integrating over all ways $\theta$ could have generated the $z_n$'s—an intractable sum of integrals.
Classification presents a canonical example. For logistic regression:
$$p(y | \mathbf{x}, \mathbf{w}) = \sigma(\mathbf{w}^T \mathbf{x})^y (1 - \sigma(\mathbf{w}^T \mathbf{x}))^{1-y}$$
With a Gaussian prior on $\mathbf{w}$:
$$p(\mathbf{w}) = \mathcal{N}(\mathbf{w} | 0, \Sigma)$$
The posterior $p(\mathbf{w} | \mathcal{D})$ has no closed form. The sigmoid likelihood breaks conjugacy—we cannot analytically integrate out $\mathbf{w}$.
Neural network likelihoods are the most extreme example. A network $f_\theta(\mathbf{x})$ defines $p(\mathbf{y} | \mathbf{x}, \theta)$ through composition of nonlinear transformations. There is no hope of analytical integration over $\theta$. Bayesian deep learning is only possible through approximation—either variational methods or sampling.
| Source | Description | Example |
|---|---|---|
| Normalization | Cannot compute $p(\mathbf{x}) = \int p(\mathbf{x}, \mathbf{z}) d\mathbf{z}$ | Any non-conjugate model |
| High dimensionality | Integration over $D$ dimensions costs $O(M^D)$ | Bayesian neural networks ($D > 10^6$) |
| Variable coupling | Dependencies prevent factored integration | Hierarchical Bayesian models |
| Non-conjugacy | Likelihood form doesn't match prior | Logistic regression, any classification |
| Combinatorial explosion | Discrete variables with large state space | Latent sequence models, topic models |
Given that exact inference is impossible for most interesting models, how do we proceed? Approximate inference methods fall into two broad categories:
Idea: Instead of computing the posterior analytically, draw samples from it and use these samples to estimate quantities of interest.
Core methods:
Advantages:
Disadvantages:
Idea: Approximate the intractable posterior with a tractable distribution from a specified family. Find the member of this family closest to the true posterior.
Core methods:
Advantages:
Disadvantages:
In modern machine learning, variational methods have largely displaced MCMC for training large-scale models. The ability to use stochastic gradient descent, GPU acceleration, and mini-batching makes VI practical for datasets with millions of examples and models with millions of parameters. MCMC remains valuable for smaller problems where exact uncertainty quantification is critical, but VI powers most practical Bayesian deep learning.
The key insight of variational inference is transformative: we convert an integration problem into an optimization problem.
This isn't just a technical maneuver—it's a fundamental reframing that unlocks the entire optimization toolkit for inference:
$$p(\mathbf{z} | \mathbf{x}) = \frac{p(\mathbf{x}, \mathbf{z})}{\int p(\mathbf{x}, \mathbf{z}') d\mathbf{z}'}$$
Computing this requires evaluating a potentially infinite-dimensional integral.
$$q^*(\mathbf{z}) = \arg\min_{q \in \mathcal{Q}} D_{\text{KL}}(q(\mathbf{z}) | p(\mathbf{z} | \mathbf{x}))$$
Find the distribution $q$ in a tractable family $\mathcal{Q}$ that is closest to the true posterior.
Why is optimization easier?
Iterative improvement: Optimization algorithms make incremental progress. Each step improves the current solution. Integration must consider all possible values simultaneously.
Gradient-based methods: If we can compute gradients of our objective with respect to variational parameters, we can use gradient descent—the workhorse of modern ML.
Stochastic approximation: Mini-batch gradients allow training on massive datasets without full passes over data.
GPU acceleration: Matrix operations in neural network-based variational families parallelize naturally.
Compositional structure: Modern auto-differentiation (PyTorch, JAX) computes gradients through arbitrarily complex models automatically.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
import numpy as npimport matplotlib.pyplot as pltfrom scipy.optimize import minimize # True target distribution (1D for visualization)def log_target(z): """Log of a complex multi-modal distribution""" return np.log(0.3 * np.exp(-0.5 * (z - 2)**2) + 0.7 * np.exp(-0.5 * (z + 1)**2 / 0.5)) # Integration approach: numerical quadraturedef integration_approach(): """Compute normalizing constant by integration""" z_grid = np.linspace(-5, 5, 10000) p_unnorm = np.exp([log_target(z) for z in z_grid]) Z = np.trapz(p_unnorm, z_grid) # Numerical integration return Z, z_grid, p_unnorm / Z # Optimization approach: variational approximationdef variational_approach(): """Approximate target with Gaussian by optimizing KL divergence proxy""" def elbo(params): """Negative ELBO (we minimize this)""" mu, log_sigma = params sigma = np.exp(log_sigma) # Sample-based ELBO estimate n_samples = 100 epsilon = np.random.randn(n_samples) z = mu + sigma * epsilon # Reparameterization # ELBO = E_q[log p(z)] - E_q[log q(z)] log_p = np.array([log_target(zi) for zi in z]) log_q = -0.5 * epsilon**2 - log_sigma - 0.5 * np.log(2 * np.pi) return -np.mean(log_p - log_q) # Negative because we minimize # Optimize result = minimize(elbo, x0=[0.0, 0.0], method='L-BFGS-B') mu_opt, log_sigma_opt = result.x return mu_opt, np.exp(log_sigma_opt) # Compare approachesZ, z_grid, p_true = integration_approach()mu_vi, sigma_vi = variational_approach() # The key insight: integration required O(10000) evaluations on a grid# Optimization required O(100) iterations with O(100) samples each# In high dimensions, integration cost explodes; optimization stays manageableVariational inference trades bias for variance:
In high-dimensional problems, the variance reduction from VI often outweighs the bias, making it the practical choice for modern machine learning.
| Scenario | Favors VI | Favors MCMC |
|---|---|---|
| Millions of parameters | ✅ | |
| Large datasets (N > 10,000) | ✅ | |
| Need for fast iteration | ✅ | |
| Posterior uncertainty critical | ✅ | |
| Multi-modal posteriors | ✅ (usually) | |
| Computational budget limited | ✅ | |
| Interpretable uncertainty | ✅ |
The name "variational inference" comes from the calculus of variations, a branch of mathematics concerned with finding functions that optimize functionals (functions of functions). This connection is deep and illuminating.
Variational principles pervade theoretical physics:
Principle of Least Action: A physical system evolves along the trajectory that minimizes the action integral: $$S[q(t)] = \int_{t_0}^{t_1} L(q, \dot{q}, t) , dt$$
The actual trajectory satisfies the Euler-Lagrange equations—found by "varying" the function $q(t)$ and setting the variation of $S$ to zero.
Rayleigh-Ritz Method: Approximate the ground state of a quantum system by minimizing energy over a parameterized family of wave functions.
The link to machine learning comes through statistical mechanics. The variational free energy bounds the true free energy:
$$F \leq F_q = \langle E \rangle_q - TS_q$$
where $E$ is energy, $T$ is temperature, and $S_q$ is the entropy of distribution $q$. Minimizing $F_q$ over $q$ gives the best approximation to the Boltzmann distribution.
This is exactly the structure of variational inference:
Variational principles recur across science because they express fundamental optimization structures. In physics, nature minimizes action. In statistics, we minimize divergence from the truth. In machine learning, we maximize the ELBO. The mathematical framework is the same—only the interpretation changes.
In contemporary machine learning, we typically present variational inference in terms of:
This framing emphasizes the optimization perspective and aligns VI with standard machine learning practice. The physics origins are intellectually satisfying but not necessary for practical application.
Key insight for practitioners: Variational inference is just optimization. If you understand gradient descent and probabilistic modeling, you understand 90% of VI. The remaining 10% is the mathematics that connects the ELBO to the posterior approximation—which we develop in the following pages.
Now that we understand why approximation is necessary and why optimization-based approaches are attractive, let's preview the complete variational inference framework you'll learn in this module.
Variational inference proceeds in four steps:
Step 1: Choose a Variational Family $\mathcal{Q}$
Select a family of distributions parameterized by $\phi$: ${ q_\phi(\mathbf{z}) : \phi \in \Phi }$
This family should be:
Step 2: Define the Objective (ELBO)
We maximize the Evidence Lower Bound: $$\mathcal{L}(\phi) = \mathbb{E}{q\phi}[\log p(\mathbf{x}, \mathbf{z})] - \mathbb{E}{q\phi}[\log q_\phi(\mathbf{z})]$$
The ELBO lower-bounds $\log p(\mathbf{x})$ and equals it when $q_\phi = p(\mathbf{z} | \mathbf{x})$.
Step 3: Optimize
Use gradient ascent on $\mathcal{L}(\phi)$: $$\phi \leftarrow \phi + \eta \nabla_\phi \mathcal{L}(\phi)$$
Step 4: Use the Approximate Posterior
Once optimized, $q_{\phi^*}(\mathbf{z})$ serves as a stand-in for the true posterior in downstream tasks.
You now understand why variational inference exists: exact Bayesian inference is intractable for most useful models, and optimization-based approximation provides a practical path forward. The next page introduces the variational family—the set of tractable distributions we'll optimize over to approximate intractable posteriors.