Machine LearningVariational Inference Framework

Variational Inference Framework

LevelAdvanced

Duration90 mins

TopicVariational Inference Framework

1 / 5

Approximation Motivation

The Fundamental Challenge of Bayesian Inference

Bayesian inference provides an elegant, principled framework for reasoning under uncertainty. Given observed data $\mathcal{D}$, we update our beliefs about unknown quantities $\theta$ through Bayes' theorem:

$$p(\theta | \mathcal{D}) = \frac{p(\mathcal{D} | \theta) p(\theta)}{p(\mathcal{D})}$$

The posterior distribution $p(\theta | \mathcal{D})$ encapsulates everything we can know about $\theta$ given our prior beliefs and observed evidence. It's mathematically beautiful, philosophically coherent, and—in many practical cases—utterly intractable to compute.

This page confronts this fundamental tension directly. We will understand precisely why exact Bayesian inference fails in practice, what makes computation intractable, and why this motivates the development of variational methods. By the end, you will see variational inference not as an arbitrary approximation, but as a carefully designed solution to a fundamental computational bottleneck.

What You Will Learn

By completing this page, you will: (1) Understand why exact posterior computation is intractable in most real-world scenarios, (2) Identify the specific computational barriers—especially the marginal likelihood integral, (3) Compare approximation strategies and understand why optimization-based approaches are appealing, and (4) Develop intuition for when approximation is necessary versus when exact methods suffice.

The Posterior Computation Problem

Let's begin by examining exactly what Bayesian inference asks us to compute. Consider a probabilistic model with:

Latent variables $\mathbf{z} = (z_1, z_2, \ldots, z_K)$: Unobserved quantities we want to infer
Observed data $\mathbf{x} = (x_1, x_2, \ldots, x_N)$: The evidence we condition on
Model parameters $\theta$: Quantities specifying the generative process

The joint distribution factorizes according to our modeling assumptions:

$$p(\mathbf{x}, \mathbf{z}, \theta) = p(\mathbf{x} | \mathbf{z}, \theta) \cdot p(\mathbf{z} | \theta) \cdot p(\theta)$$

To perform inference, we need the posterior distribution:

$$p(\mathbf{z}, \theta | \mathbf{x}) = \frac{p(\mathbf{x}, \mathbf{z}, \theta)}{p(\mathbf{x})}$$

The Denominator Problem

The entire difficulty of Bayesian inference concentrates in the denominator. The numerator—the joint distribution—is typically easy to evaluate: we simply multiply our likelihood and prior terms. But the denominator, the marginal likelihood or "evidence," requires integration over all possible values of the latent variables and parameters. This is where computation becomes intractable.

The Marginal Likelihood Integral

The evidence (marginal likelihood) is defined as:

$$p(\mathbf{x}) = \int \int p(\mathbf{x}, \mathbf{z}, \theta) , d\mathbf{z} , d\theta = \int \int p(\mathbf{x} | \mathbf{z}, \theta) \cdot p(\mathbf{z} | \theta) \cdot p(\theta) , d\mathbf{z} , d\theta$$

This integral marginalizes out all latent variables and parameters by summing (or integrating) over every possible configuration they could take. In principle, this is straightforward—we're just computing a normalization constant. In practice, this integral is the source of nearly all computational difficulty in Bayesian inference.

Why is this integral hard?

High dimensionality: If $\mathbf{z}$ has $K$ dimensions and $\theta$ has $D$ dimensions, we integrate over a $(K + D)$-dimensional space. Modern models easily have millions of dimensions.
No closed-form solution: For most interesting models, the integral has no analytical solution. The integrand may be a complex, nonlinear function with no known antiderivative.
Exponential cost of numerical integration: Grid-based numerical integration requires $O(M^{K+D})$ evaluations for $M$ grid points per dimension. With 100 dimensions and 10 grid points, that's $10^{100}$ evaluations—more than atoms in the observable universe.

Computational Complexity of the Evidence Integral
Dimensionality	Grid Points	Evaluations Required	At 1μs/eval
2D	100	10⁴	0.01 seconds
10D	100	10²⁰	3 trillion years
100D	100	10²⁰⁰	Incomputable
1,000,000D (typical neural net)	Any	∞	Completely intractable

The table above illustrates why direct computation is impossible for all but the simplest models. Even with just 10 latent dimensions, brute-force integration is beyond the computational capacity of any conceivable computer. Real machine learning models with millions of parameters are infinitely more challenging.

This is the fundamental computational barrier that motivates approximate inference. We cannot compute the exact posterior, so we must find principled ways to approximate it.

When Exact Inference Works (And Why It Usually Doesn't)

Before diving into approximations, let's understand when exact Bayesian inference is possible. Recognizing these special cases illuminates why most models fall outside them.

Case 1: Conjugate Prior-Likelihood Pairs

When the prior and likelihood belong to the same exponential family, the posterior has a closed-form expression in the same family. The marginal likelihood also has a closed form.

Example: Gaussian-Gaussian Conjugacy

For a Gaussian likelihood with known variance $\sigma^2$ and a Gaussian prior on the mean:

$$\begin{aligned} \text{Prior: } \mu &\sim \mathcal{N}(\mu_0, \tau_0^2) \ \text{Likelihood: } x_i | \mu &\sim \mathcal{N}(\mu, \sigma^2) \end{aligned}$$

The posterior is: $$\mu | \mathbf{x} \sim \mathcal{N}\left( \frac{\frac{\mu_0}{\tau_0^2} + \frac{\sum_i x_i}{\sigma^2}}{\frac{1}{\tau_0^2} + \frac{N}{\sigma^2}}, \left( \frac{1}{\tau_0^2} + \frac{N}{\sigma^2} \right)^{-1} \right)$$

This is exact, elegant, and requires no approximation. The integral "works out" because Gaussians multiplied by Gaussians yield Gaussians.

The Conjugate Family Zoo

Common conjugate pairs include: Beta-Binomial, Dirichlet-Multinomial, Gamma-Poisson, Normal-Normal, and Normal-Inverse-Gamma. These are powerful when applicable, but they cover only a tiny fraction of useful models. The moment you need a neural network likelihood or a complex hierarchical structure, conjugacy breaks down.

Case 2: Discrete Latent Variables with Small State Space

When latent variables are discrete and take finitely many values, the integral becomes a sum:

$$p(\mathbf{x}) = \sum_{\mathbf{z} \in \mathcal{Z}} p(\mathbf{x}, \mathbf{z})$$

If $|\mathcal{Z}|$ is small enough, we can enumerate all configurations.

Example: Hidden Markov Model with 3 states, 10 time steps

Brute-force enumeration requires $3^{10} = 59,049$ terms—tractable with dynamic programming (the forward algorithm reduces this to $O(3^2 \times 10) = 90$ operations).

Case 3: Gaussian Graphical Models

Multivariate Gaussian models remain tractable because:

Products of Gaussians are Gaussian
Conditioning Gaussians yields Gaussians
Marginalization is linear algebra (matrix inversion)

Gaussian Process Regression exploits this: For $N$ data points with a Gaussian prior and Gaussian likelihood, exact inference costs $O(N^3)$ for the matrix inversion.

Why Most Models Fail These Conditions

•Non-conjugate likelihoods: Classification models use sigmoid/softmax likelihoods, which have no conjugate prior.
•Complex generative models: VAEs, GANs, and diffusion models define likelihoods through neural networks—no closed form.
•Hierarchical structures: Multi-level models with dependencies don't decompose nicely.
•Mixture models: Even simple Gaussian mixtures break conjugacy due to the discrete mixture component.
•Exponentially large discrete spaces: A language model's latent space over sequences of length $T$ with vocabulary $V$ has $V^T$ configurations.

The message is clear: tractable exact inference is the exception, not the rule. The models that drive modern machine learning—deep generative models, Bayesian neural networks, topic models, probabilistic programs—all require approximate inference. This isn't a failure of Bayesian methods; it's simply the computational reality of working with expressive probabilistic models.

Dissecting the Sources of Intractability

To design effective approximations, we must understand precisely where and why computation fails. Intractability arises from several distinct sources, often in combination.

Source 1: The Normalization Constant

The most direct source of intractability is the evidence $p(\mathbf{x})$ appearing in Bayes' rule. Without it, we cannot normalize the posterior.

However, some tasks don't require the normalization constant:

Posterior sampling (MCMC methods) only needs the unnormalized posterior
Maximum a posteriori (MAP) estimation finds $\arg\max_{\theta} p(\theta | \mathbf{x})$, which doesn't depend on normalization

But many tasks do require proper normalization:

Posterior expectations $\mathbb{E}_{p(\theta|\mathbf{x})}[f(\theta)]$
Predictive distributions $p(x_{\text{new}} | \mathbf{x}) = \int p(x_{\text{new}} | \theta) p(\theta | \mathbf{x}) d\theta$
Model comparison via Bayes factors requires $p(\mathbf{x} | \mathcal{M})$

intractability_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import numpy as np
from scipy.integrate import quad, dblquad
import time
 
def demonstrate_intractability():
    """
    Demonstrate how integration cost explodes with dimensionality.
    Even for simple Gaussian integrands, high dimensions are impossible.
    """
    
    # 1D Gaussian integral: trivial
    start = time.time()
    result_1d, _ = quad(lambda x: np.exp(-x**2), -10, 10)
    print(f"1D integral: {result_1d:.6f}, Time: {time.time() - start:.6f}s")
    
    # 2D Gaussian integral: still easy
    start = time.time()
    result_2d, _ = dblquad(
        lambda x, y: np.exp(-x**2 - y**2), 
        -10, 10, -10, 10
    )
    print(f"2D integral: {result_2d:.6f}, Time: {time.time() - start:.6f}s")
    
    # 10D Gaussian: Monte Carlo estimate
    start = time.time()
    n_samples = 100000
    samples = np.random.randn(n_samples, 10)
    # Importance sampling with standard normal proposal
    weights = np.exp(-np.sum(samples**2, axis=1)) * (2*10)**10
    result_10d = np.mean(weights) / n_samples
    print(f"10D integral (MC estimate): ~{result_10d:.4f}, Time: {time.time() - start:.4f}s")
    print(f"  (Note: MC variance is high, exact answer is π^5 ≈ {np.pi**5:.4f})")
    
    # 1000D: Even MC struggles
    print("\n1000D integral: Cannot compute reliably with any method")
    print("This is why we need variational inference!")
 
# The posterior often looks like:
# p(θ | x) ∝ likelihood(x | θ) × prior(θ)
# 
# For a neural network:
# - θ has millions of dimensions
# - likelihood is a complex nonlinear function
# - No closed-form integral exists
# - MC methods have enormous variance in high dimensions
 
demonstrate_intractability()

Source 2: Coupling Between Variables

Even when individual integrals might be tractable, dependencies between variables can make the joint integral intractable.

Consider a hierarchical model:

$$\begin{aligned} \theta &\sim p(\theta) \ z_n | \theta &\sim p(z_n | \theta) \quad \text{for } n = 1, \ldots, N \ x_n | z_n &\sim p(x_n | z_n) \end{aligned}$$

The posterior is: $$p(\theta, \mathbf{z} | \mathbf{x}) = \frac{p(\theta) \prod_{n=1}^N p(z_n | \theta) p(x_n | z_n)}{p(\mathbf{x})}$$

Even if $p(z_n | \theta, x_n)$ has a nice form for fixed $\theta$, the variables $(\theta, z_1, \ldots, z_N)$ are coupled through the prior $p(z_n | \theta)$. Marginalizing out $\theta$ requires integrating over all ways $\theta$ could have generated the $z_n$'s—an intractable sum of integrals.

Source 3: Non-Conjugate Likelihood Functions

Classification presents a canonical example. For logistic regression:

$$p(y | \mathbf{x}, \mathbf{w}) = \sigma(\mathbf{w}^T \mathbf{x})^y (1 - \sigma(\mathbf{w}^T \mathbf{x}))^{1-y}$$

With a Gaussian prior on $\mathbf{w}$:

$$p(\mathbf{w}) = \mathcal{N}(\mathbf{w} | 0, \Sigma)$$

The posterior $p(\mathbf{w} | \mathcal{D})$ has no closed form. The sigmoid likelihood breaks conjugacy—we cannot analytically integrate out $\mathbf{w}$.

The Deep Learning Dilemma

Neural network likelihoods are the most extreme example. A network $f_\theta(\mathbf{x})$ defines $p(\mathbf{y} | \mathbf{x}, \theta)$ through composition of nonlinear transformations. There is no hope of analytical integration over $\theta$. Bayesian deep learning is only possible through approximation—either variational methods or sampling.

Sources of Intractability Summarized
Source	Description	Example
Normalization	Cannot compute $p(\mathbf{x}) = \int p(\mathbf{x}, \mathbf{z}) d\mathbf{z}$	Any non-conjugate model
High dimensionality	Integration over $D$ dimensions costs $O(M^D)$	Bayesian neural networks ($D > 10^6$)
Variable coupling	Dependencies prevent factored integration	Hierarchical Bayesian models
Non-conjugacy	Likelihood form doesn't match prior	Logistic regression, any classification
Combinatorial explosion	Discrete variables with large state space	Latent sequence models, topic models

Strategies for Approximation

Given that exact inference is impossible for most interesting models, how do we proceed? Approximate inference methods fall into two broad categories:

Category 1: Stochastic Approximation (Sampling Methods)

Idea: Instead of computing the posterior analytically, draw samples from it and use these samples to estimate quantities of interest.

Core methods:

Markov Chain Monte Carlo (MCMC): Construct a Markov chain whose stationary distribution is the posterior. Popular variants include Metropolis-Hastings, Gibbs sampling, and Hamiltonian Monte Carlo (HMC).
Importance Sampling: Draw samples from a proposal distribution and reweight them to approximate posterior expectations.
Sequential Monte Carlo (SMC): Particle filters that track distributions through sequences of observations.

Advantages:

Asymptotically exact: With infinite samples, error → 0
Applicable to nearly any model
No need to specify a variational family

Disadvantages:

Slow convergence in high dimensions
Difficult to diagnose convergence
Computationally expensive for large datasets
Correlated samples reduce effective sample size

Category 2: Deterministic Approximation (Optimization Methods)

Idea: Approximate the intractable posterior with a tractable distribution from a specified family. Find the member of this family closest to the true posterior.

Core methods:

Variational Inference (VI): Minimize KL divergence between approximate and true posterior
Laplace Approximation: Fit a Gaussian centered at the posterior mode
Expectation Propagation (EP): Match moments locally to approximate marginals

Advantages:

Fast: Converts inference to optimization
Scales to large datasets via stochastic gradients
Provides explicit approximate posterior (not just samples)
Clear convergence criteria (ELBO value)

Disadvantages:

Biased: Approximation error doesn't vanish with more computation
Quality depends on variational family choice
May underestimate posterior uncertainty (mode-seeking behavior)

Sampling (MCMC)

•Asymptotically exact
•Highly general
•No approximation family needed
•Slow in high dimensions
•Hard to diagnose convergence
•Sequential (hard to parallelize)

Variational Inference

•Biased but consistent
•Fast optimization
•Scales with SGD
•Explicit q(z) representation
•Clear convergence (ELBO)
•Parallelizable

The Modern Trend: Variational Methods Dominate

In modern machine learning, variational methods have largely displaced MCMC for training large-scale models. The ability to use stochastic gradient descent, GPU acceleration, and mini-batching makes VI practical for datasets with millions of examples and models with millions of parameters. MCMC remains valuable for smaller problems where exact uncertainty quantification is critical, but VI powers most practical Bayesian deep learning.

Why Optimization? The Core Insight of Variational Inference

The key insight of variational inference is transformative: we convert an integration problem into an optimization problem.

This isn't just a technical maneuver—it's a fundamental reframing that unlocks the entire optimization toolkit for inference:

The Integration Problem (Original)

$$p(\mathbf{z} | \mathbf{x}) = \frac{p(\mathbf{x}, \mathbf{z})}{\int p(\mathbf{x}, \mathbf{z}') d\mathbf{z}'}$$

Computing this requires evaluating a potentially infinite-dimensional integral.

The Optimization Problem (Variational)

$$q^*(\mathbf{z}) = \arg\min_{q \in \mathcal{Q}} D_{\text{KL}}(q(\mathbf{z}) | p(\mathbf{z} | \mathbf{x}))$$

Find the distribution $q$ in a tractable family $\mathcal{Q}$ that is closest to the true posterior.

Why is optimization easier?

Iterative improvement: Optimization algorithms make incremental progress. Each step improves the current solution. Integration must consider all possible values simultaneously.
Gradient-based methods: If we can compute gradients of our objective with respect to variational parameters, we can use gradient descent—the workhorse of modern ML.
Stochastic approximation: Mini-batch gradients allow training on massive datasets without full passes over data.
GPU acceleration: Matrix operations in neural network-based variational families parallelize naturally.
Compositional structure: Modern auto-differentiation (PyTorch, JAX) computes gradients through arbitrarily complex models automatically.

integration_vs_optimization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import minimize
 
# True target distribution (1D for visualization)
def log_target(z):
    """Log of a complex multi-modal distribution"""
    return np.log(0.3 * np.exp(-0.5 * (z - 2)**2) + 
                  0.7 * np.exp(-0.5 * (z + 1)**2 / 0.5))
 
# Integration approach: numerical quadrature
def integration_approach():
    """Compute normalizing constant by integration"""
    z_grid = np.linspace(-5, 5, 10000)
    p_unnorm = np.exp([log_target(z) for z in z_grid])
    Z = np.trapz(p_unnorm, z_grid)  # Numerical integration
    return Z, z_grid, p_unnorm / Z
 
# Optimization approach: variational approximation
def variational_approach():
    """Approximate target with Gaussian by optimizing KL divergence proxy"""
    
    def elbo(params):
        """Negative ELBO (we minimize this)"""
        mu, log_sigma = params
        sigma = np.exp(log_sigma)
        
        # Sample-based ELBO estimate
        n_samples = 100
        epsilon = np.random.randn(n_samples)
        z = mu + sigma * epsilon  # Reparameterization
        
        # ELBO = E_q[log p(z)] - E_q[log q(z)]
        log_p = np.array([log_target(zi) for zi in z])
        log_q = -0.5 * epsilon**2 - log_sigma - 0.5 * np.log(2 * np.pi)
        
        return -np.mean(log_p - log_q)  # Negative because we minimize
    
    # Optimize
    result = minimize(elbo, x0=[0.0, 0.0], method='L-BFGS-B')
    mu_opt, log_sigma_opt = result.x
    
    return mu_opt, np.exp(log_sigma_opt)
 
# Compare approaches
Z, z_grid, p_true = integration_approach()
mu_vi, sigma_vi = variational_approach()
 
# The key insight: integration required O(10000) evaluations on a grid
# Optimization required O(100) iterations with O(100) samples each
# In high dimensions, integration cost explodes; optimization stays manageable

The Central Trade-off

Variational inference trades bias for variance:

Sampling methods have zero asymptotic bias but potentially high variance (especially in high dimensions)
Variational methods have bounded bias (limited by the variational family) but lower variance

In high-dimensional problems, the variance reduction from VI often outweighs the bias, making it the practical choice for modern machine learning.

When Does This Trade-off Favor VI?

Scenario	Favors VI	Favors MCMC
Millions of parameters	✅
Large datasets (N > 10,000)	✅
Need for fast iteration	✅
Posterior uncertainty critical		✅
Multi-modal posteriors		✅ (usually)
Computational budget limited	✅
Interpretable uncertainty		✅

The Variational Principle: Historical Context

The name "variational inference" comes from the calculus of variations, a branch of mathematics concerned with finding functions that optimize functionals (functions of functions). This connection is deep and illuminating.

Historical Roots in Physics

Variational principles pervade theoretical physics:

Principle of Least Action: A physical system evolves along the trajectory that minimizes the action integral: $$S[q(t)] = \int_{t_0}^{t_1} L(q, \dot{q}, t) , dt$$

The actual trajectory satisfies the Euler-Lagrange equations—found by "varying" the function $q(t)$ and setting the variation of $S$ to zero.

Rayleigh-Ritz Method: Approximate the ground state of a quantum system by minimizing energy over a parameterized family of wave functions.

The Statistical Mechanics Connection

The link to machine learning comes through statistical mechanics. The variational free energy bounds the true free energy:

$$F \leq F_q = \langle E \rangle_q - TS_q$$

where $E$ is energy, $T$ is temperature, and $S_q$ is the entropy of distribution $q$. Minimizing $F_q$ over $q$ gives the best approximation to the Boltzmann distribution.

This is exactly the structure of variational inference:

True posterior ↔ Boltzmann distribution
ELBO ↔ Negative variational free energy
Entropy of $q$ ↔ Entropy term in free energy

A Unifying Principle

Variational principles recur across science because they express fundamental optimization structures. In physics, nature minimizes action. In statistics, we minimize divergence from the truth. In machine learning, we maximize the ELBO. The mathematical framework is the same—only the interpretation changes.

The Modern Framing

In contemporary machine learning, we typically present variational inference in terms of:

A variational family $\mathcal{Q}$: The set of distributions we search over
An objective function: The ELBO or its equivalent
An optimization algorithm: Gradient descent, coordinate ascent, or variants

This framing emphasizes the optimization perspective and aligns VI with standard machine learning practice. The physics origins are intellectually satisfying but not necessary for practical application.

Key insight for practitioners: Variational inference is just optimization. If you understand gradient descent and probabilistic modeling, you understand 90% of VI. The remaining 10% is the mathematics that connects the ELBO to the posterior approximation—which we develop in the following pages.

Preview: The Variational Inference Framework

Now that we understand why approximation is necessary and why optimization-based approaches are attractive, let's preview the complete variational inference framework you'll learn in this module.

The Big Picture

Variational inference proceeds in four steps:

Step 1: Choose a Variational Family $\mathcal{Q}$

Select a family of distributions parameterized by $\phi$: ${ q_\phi(\mathbf{z}) : \phi \in \Phi }$

This family should be:

Tractable: We can sample from $q_\phi$ and evaluate $\log q_\phi(\mathbf{z})$
Expressive: Capable of approximating the true posterior well
Optimizable: Gradients with respect to $\phi$ are computable

Step 2: Define the Objective (ELBO)

We maximize the Evidence Lower Bound: $$\mathcal{L}(\phi) = \mathbb{E}{q\phi}[\log p(\mathbf{x}, \mathbf{z})] - \mathbb{E}{q\phi}[\log q_\phi(\mathbf{z})]$$

The ELBO lower-bounds $\log p(\mathbf{x})$ and equals it when $q_\phi = p(\mathbf{z} | \mathbf{x})$.

Step 3: Optimize

Use gradient ascent on $\mathcal{L}(\phi)$: $$\phi \leftarrow \phi + \eta \nabla_\phi \mathcal{L}(\phi)$$

Step 4: Use the Approximate Posterior

Once optimized, $q_{\phi^*}(\mathbf{z})$ serves as a stand-in for the true posterior in downstream tasks.

What's Coming in This Module

•Variational Family — Formal definition, examples (mean-field, full covariance, normalizing flows), and expressiveness vs. tractability trade-offs
•KL Divergence Objective — Why KL divergence? Forward vs. reverse KL. Mode-seeking vs. mass-covering behavior
•ELBO Derivation — Complete mathematical derivation from first principles, multiple perspectives (Jensen's inequality, importance weighted, rate-distortion)
•Optimization Perspective — ELBO as loss function, gradient estimation, reparameterization trick, variance reduction

Page Complete: Approximation Motivation

You now understand why variational inference exists: exact Bayesian inference is intractable for most useful models, and optimization-based approximation provides a practical path forward. The next page introduces the variational family—the set of tractable distributions we'll optimize over to approximate intractable posteriors.

1 / 5

Loading learning content...

Machine LearningVariational Inference Framework

Variational Inference Framework

LevelAdvanced

Duration90 mins

TopicVariational Inference Framework

1 / 5

Approximation Motivation

The Fundamental Challenge of Bayesian Inference

$$p(\theta | \mathcal{D}) = \frac{p(\mathcal{D} | \theta) p(\theta)}{p(\mathcal{D})}$$

What You Will Learn

The Posterior Computation Problem

Let's begin by examining exactly what Bayesian inference asks us to compute. Consider a probabilistic model with:

Latent variables $\mathbf{z} = (z_1, z_2, \ldots, z_K)$: Unobserved quantities we want to infer
Observed data $\mathbf{x} = (x_1, x_2, \ldots, x_N)$: The evidence we condition on
Model parameters $\theta$: Quantities specifying the generative process

The joint distribution factorizes according to our modeling assumptions:

$$p(\mathbf{x}, \mathbf{z}, \theta) = p(\mathbf{x} | \mathbf{z}, \theta) \cdot p(\mathbf{z} | \theta) \cdot p(\theta)$$

To perform inference, we need the posterior distribution:

$$p(\mathbf{z}, \theta | \mathbf{x}) = \frac{p(\mathbf{x}, \mathbf{z}, \theta)}{p(\mathbf{x})}$$

The Denominator Problem

The Marginal Likelihood Integral

The evidence (marginal likelihood) is defined as:

Why is this integral hard?

High dimensionality: If $\mathbf{z}$ has $K$ dimensions and $\theta$ has $D$ dimensions, we integrate over a $(K + D)$-dimensional space. Modern models easily have millions of dimensions.
No closed-form solution: For most interesting models, the integral has no analytical solution. The integrand may be a complex, nonlinear function with no known antiderivative.
Exponential cost of numerical integration: Grid-based numerical integration requires $O(M^{K+D})$ evaluations for $M$ grid points per dimension. With 100 dimensions and 10 grid points, that's $10^{100}$ evaluations—more than atoms in the observable universe.

Computational Complexity of the Evidence Integral
Dimensionality	Grid Points	Evaluations Required	At 1μs/eval
2D	100	10⁴	0.01 seconds
10D	100	10²⁰	3 trillion years
100D	100	10²⁰⁰	Incomputable
1,000,000D (typical neural net)	Any	∞	Completely intractable

This is the fundamental computational barrier that motivates approximate inference. We cannot compute the exact posterior, so we must find principled ways to approximate it.

When Exact Inference Works (And Why It Usually Doesn't)

Before diving into approximations, let's understand when exact Bayesian inference is possible. Recognizing these special cases illuminates why most models fall outside them.

Case 1: Conjugate Prior-Likelihood Pairs

When the prior and likelihood belong to the same exponential family, the posterior has a closed-form expression in the same family. The marginal likelihood also has a closed form.

Example: Gaussian-Gaussian Conjugacy

For a Gaussian likelihood with known variance $\sigma^2$ and a Gaussian prior on the mean:

$$\begin{aligned} \text{Prior: } \mu &\sim \mathcal{N}(\mu_0, \tau_0^2) \ \text{Likelihood: } x_i | \mu &\sim \mathcal{N}(\mu, \sigma^2) \end{aligned}$$

This is exact, elegant, and requires no approximation. The integral "works out" because Gaussians multiplied by Gaussians yield Gaussians.

The Conjugate Family Zoo

Case 2: Discrete Latent Variables with Small State Space

When latent variables are discrete and take finitely many values, the integral becomes a sum:

$$p(\mathbf{x}) = \sum_{\mathbf{z} \in \mathcal{Z}} p(\mathbf{x}, \mathbf{z})$$

If $|\mathcal{Z}|$ is small enough, we can enumerate all configurations.

Example: Hidden Markov Model with 3 states, 10 time steps

Brute-force enumeration requires $3^{10} = 59,049$ terms—tractable with dynamic programming (the forward algorithm reduces this to $O(3^2 \times 10) = 90$ operations).

Case 3: Gaussian Graphical Models

Multivariate Gaussian models remain tractable because:

Products of Gaussians are Gaussian
Conditioning Gaussians yields Gaussians
Marginalization is linear algebra (matrix inversion)

Gaussian Process Regression exploits this: For $N$ data points with a Gaussian prior and Gaussian likelihood, exact inference costs $O(N^3)$ for the matrix inversion.

Why Most Models Fail These Conditions

•Non-conjugate likelihoods: Classification models use sigmoid/softmax likelihoods, which have no conjugate prior.
•Complex generative models: VAEs, GANs, and diffusion models define likelihoods through neural networks—no closed form.
•Hierarchical structures: Multi-level models with dependencies don't decompose nicely.
•Mixture models: Even simple Gaussian mixtures break conjugacy due to the discrete mixture component.
•Exponentially large discrete spaces: A language model's latent space over sequences of length $T$ with vocabulary $V$ has $V^T$ configurations.

Dissecting the Sources of Intractability

To design effective approximations, we must understand precisely where and why computation fails. Intractability arises from several distinct sources, often in combination.

Source 1: The Normalization Constant

The most direct source of intractability is the evidence $p(\mathbf{x})$ appearing in Bayes' rule. Without it, we cannot normalize the posterior.

However, some tasks don't require the normalization constant:

Posterior sampling (MCMC methods) only needs the unnormalized posterior
Maximum a posteriori (MAP) estimation finds $\arg\max_{\theta} p(\theta | \mathbf{x})$, which doesn't depend on normalization

But many tasks do require proper normalization:

Posterior expectations $\mathbb{E}_{p(\theta|\mathbf{x})}[f(\theta)]$
Predictive distributions $p(x_{\text{new}} | \mathbf{x}) = \int p(x_{\text{new}} | \theta) p(\theta | \mathbf{x}) d\theta$
Model comparison via Bayes factors requires $p(\mathbf{x} | \mathcal{M})$

intractability_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import numpy as np
from scipy.integrate import quad, dblquad
import time
 
def demonstrate_intractability():
    """
    Demonstrate how integration cost explodes with dimensionality.
    Even for simple Gaussian integrands, high dimensions are impossible.
    """
    
    # 1D Gaussian integral: trivial
    start = time.time()
    result_1d, _ = quad(lambda x: np.exp(-x**2), -10, 10)
    print(f"1D integral: {result_1d:.6f}, Time: {time.time() - start:.6f}s")
    
    # 2D Gaussian integral: still easy
    start = time.time()
    result_2d, _ = dblquad(
        lambda x, y: np.exp(-x**2 - y**2), 
        -10, 10, -10, 10
    )
    print(f"2D integral: {result_2d:.6f}, Time: {time.time() - start:.6f}s")
    
    # 10D Gaussian: Monte Carlo estimate
    start = time.time()
    n_samples = 100000
    samples = np.random.randn(n_samples, 10)
    # Importance sampling with standard normal proposal
    weights = np.exp(-np.sum(samples**2, axis=1)) * (2*10)**10
    result_10d = np.mean(weights) / n_samples
    print(f"10D integral (MC estimate): ~{result_10d:.4f}, Time: {time.time() - start:.4f}s")
    print(f"  (Note: MC variance is high, exact answer is π^5 ≈ {np.pi**5:.4f})")
    
    # 1000D: Even MC struggles
    print("\n1000D integral: Cannot compute reliably with any method")
    print("This is why we need variational inference!")
 
# The posterior often looks like:
# p(θ | x) ∝ likelihood(x | θ) × prior(θ)
# 
# For a neural network:
# - θ has millions of dimensions
# - likelihood is a complex nonlinear function
# - No closed-form integral exists
# - MC methods have enormous variance in high dimensions
 
demonstrate_intractability()

Source 2: Coupling Between Variables

Even when individual integrals might be tractable, dependencies between variables can make the joint integral intractable.

Consider a hierarchical model:

$$\begin{aligned} \theta &\sim p(\theta) \ z_n | \theta &\sim p(z_n | \theta) \quad \text{for } n = 1, \ldots, N \ x_n | z_n &\sim p(x_n | z_n) \end{aligned}$$

The posterior is: $$p(\theta, \mathbf{z} | \mathbf{x}) = \frac{p(\theta) \prod_{n=1}^N p(z_n | \theta) p(x_n | z_n)}{p(\mathbf{x})}$$

Source 3: Non-Conjugate Likelihood Functions

Classification presents a canonical example. For logistic regression:

$$p(y | \mathbf{x}, \mathbf{w}) = \sigma(\mathbf{w}^T \mathbf{x})^y (1 - \sigma(\mathbf{w}^T \mathbf{x}))^{1-y}$$

With a Gaussian prior on $\mathbf{w}$:

$$p(\mathbf{w}) = \mathcal{N}(\mathbf{w} | 0, \Sigma)$$

The posterior $p(\mathbf{w} | \mathcal{D})$ has no closed form. The sigmoid likelihood breaks conjugacy—we cannot analytically integrate out $\mathbf{w}$.

The Deep Learning Dilemma

Sources of Intractability Summarized
Source	Description	Example
Normalization	Cannot compute $p(\mathbf{x}) = \int p(\mathbf{x}, \mathbf{z}) d\mathbf{z}$	Any non-conjugate model
High dimensionality	Integration over $D$ dimensions costs $O(M^D)$	Bayesian neural networks ($D > 10^6$)
Variable coupling	Dependencies prevent factored integration	Hierarchical Bayesian models
Non-conjugacy	Likelihood form doesn't match prior	Logistic regression, any classification
Combinatorial explosion	Discrete variables with large state space	Latent sequence models, topic models

Strategies for Approximation

Given that exact inference is impossible for most interesting models, how do we proceed? Approximate inference methods fall into two broad categories:

Category 1: Stochastic Approximation (Sampling Methods)

Idea: Instead of computing the posterior analytically, draw samples from it and use these samples to estimate quantities of interest.

Core methods:

Markov Chain Monte Carlo (MCMC): Construct a Markov chain whose stationary distribution is the posterior. Popular variants include Metropolis-Hastings, Gibbs sampling, and Hamiltonian Monte Carlo (HMC).
Importance Sampling: Draw samples from a proposal distribution and reweight them to approximate posterior expectations.
Sequential Monte Carlo (SMC): Particle filters that track distributions through sequences of observations.

Advantages:

Asymptotically exact: With infinite samples, error → 0
Applicable to nearly any model
No need to specify a variational family

Disadvantages:

Slow convergence in high dimensions
Difficult to diagnose convergence
Computationally expensive for large datasets
Correlated samples reduce effective sample size

Category 2: Deterministic Approximation (Optimization Methods)

Idea: Approximate the intractable posterior with a tractable distribution from a specified family. Find the member of this family closest to the true posterior.

Core methods:

Variational Inference (VI): Minimize KL divergence between approximate and true posterior
Laplace Approximation: Fit a Gaussian centered at the posterior mode
Expectation Propagation (EP): Match moments locally to approximate marginals

Advantages:

Fast: Converts inference to optimization
Scales to large datasets via stochastic gradients
Provides explicit approximate posterior (not just samples)
Clear convergence criteria (ELBO value)

Disadvantages:

Biased: Approximation error doesn't vanish with more computation
Quality depends on variational family choice
May underestimate posterior uncertainty (mode-seeking behavior)

Sampling (MCMC)

•Asymptotically exact
•Highly general
•No approximation family needed
•Slow in high dimensions
•Hard to diagnose convergence
•Sequential (hard to parallelize)

Variational Inference

•Biased but consistent
•Fast optimization
•Scales with SGD
•Explicit q(z) representation
•Clear convergence (ELBO)
•Parallelizable

The Modern Trend: Variational Methods Dominate

Why Optimization? The Core Insight of Variational Inference

The key insight of variational inference is transformative: we convert an integration problem into an optimization problem.

This isn't just a technical maneuver—it's a fundamental reframing that unlocks the entire optimization toolkit for inference:

The Integration Problem (Original)

$$p(\mathbf{z} | \mathbf{x}) = \frac{p(\mathbf{x}, \mathbf{z})}{\int p(\mathbf{x}, \mathbf{z}') d\mathbf{z}'}$$

Computing this requires evaluating a potentially infinite-dimensional integral.

The Optimization Problem (Variational)

$$q^*(\mathbf{z}) = \arg\min_{q \in \mathcal{Q}} D_{\text{KL}}(q(\mathbf{z}) | p(\mathbf{z} | \mathbf{x}))$$

Find the distribution $q$ in a tractable family $\mathcal{Q}$ that is closest to the true posterior.

Why is optimization easier?

Iterative improvement: Optimization algorithms make incremental progress. Each step improves the current solution. Integration must consider all possible values simultaneously.
Gradient-based methods: If we can compute gradients of our objective with respect to variational parameters, we can use gradient descent—the workhorse of modern ML.
Stochastic approximation: Mini-batch gradients allow training on massive datasets without full passes over data.
GPU acceleration: Matrix operations in neural network-based variational families parallelize naturally.
Compositional structure: Modern auto-differentiation (PyTorch, JAX) computes gradients through arbitrarily complex models automatically.

integration_vs_optimization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import minimize
 
# True target distribution (1D for visualization)
def log_target(z):
    """Log of a complex multi-modal distribution"""
    return np.log(0.3 * np.exp(-0.5 * (z - 2)**2) + 
                  0.7 * np.exp(-0.5 * (z + 1)**2 / 0.5))
 
# Integration approach: numerical quadrature
def integration_approach():
    """Compute normalizing constant by integration"""
    z_grid = np.linspace(-5, 5, 10000)
    p_unnorm = np.exp([log_target(z) for z in z_grid])
    Z = np.trapz(p_unnorm, z_grid)  # Numerical integration
    return Z, z_grid, p_unnorm / Z
 
# Optimization approach: variational approximation
def variational_approach():
    """Approximate target with Gaussian by optimizing KL divergence proxy"""
    
    def elbo(params):
        """Negative ELBO (we minimize this)"""
        mu, log_sigma = params
        sigma = np.exp(log_sigma)
        
        # Sample-based ELBO estimate
        n_samples = 100
        epsilon = np.random.randn(n_samples)
        z = mu + sigma * epsilon  # Reparameterization
        
        # ELBO = E_q[log p(z)] - E_q[log q(z)]
        log_p = np.array([log_target(zi) for zi in z])
        log_q = -0.5 * epsilon**2 - log_sigma - 0.5 * np.log(2 * np.pi)
        
        return -np.mean(log_p - log_q)  # Negative because we minimize
    
    # Optimize
    result = minimize(elbo, x0=[0.0, 0.0], method='L-BFGS-B')
    mu_opt, log_sigma_opt = result.x
    
    return mu_opt, np.exp(log_sigma_opt)
 
# Compare approaches
Z, z_grid, p_true = integration_approach()
mu_vi, sigma_vi = variational_approach()
 
# The key insight: integration required O(10000) evaluations on a grid
# Optimization required O(100) iterations with O(100) samples each
# In high dimensions, integration cost explodes; optimization stays manageable

The Central Trade-off

Variational inference trades bias for variance:

Sampling methods have zero asymptotic bias but potentially high variance (especially in high dimensions)
Variational methods have bounded bias (limited by the variational family) but lower variance

In high-dimensional problems, the variance reduction from VI often outweighs the bias, making it the practical choice for modern machine learning.

When Does This Trade-off Favor VI?

Scenario	Favors VI	Favors MCMC
Millions of parameters	✅
Large datasets (N > 10,000)	✅
Need for fast iteration	✅
Posterior uncertainty critical		✅
Multi-modal posteriors		✅ (usually)
Computational budget limited	✅
Interpretable uncertainty		✅

The Variational Principle: Historical Context

Historical Roots in Physics

Variational principles pervade theoretical physics:

Principle of Least Action: A physical system evolves along the trajectory that minimizes the action integral: $$S[q(t)] = \int_{t_0}^{t_1} L(q, \dot{q}, t) , dt$$

The actual trajectory satisfies the Euler-Lagrange equations—found by "varying" the function $q(t)$ and setting the variation of $S$ to zero.

Rayleigh-Ritz Method: Approximate the ground state of a quantum system by minimizing energy over a parameterized family of wave functions.

The Statistical Mechanics Connection

The link to machine learning comes through statistical mechanics. The variational free energy bounds the true free energy:

$$F \leq F_q = \langle E \rangle_q - TS_q$$

where $E$ is energy, $T$ is temperature, and $S_q$ is the entropy of distribution $q$. Minimizing $F_q$ over $q$ gives the best approximation to the Boltzmann distribution.

This is exactly the structure of variational inference:

True posterior ↔ Boltzmann distribution
ELBO ↔ Negative variational free energy
Entropy of $q$ ↔ Entropy term in free energy

A Unifying Principle

The Modern Framing

In contemporary machine learning, we typically present variational inference in terms of:

A variational family $\mathcal{Q}$: The set of distributions we search over
An objective function: The ELBO or its equivalent
An optimization algorithm: Gradient descent, coordinate ascent, or variants

Preview: The Variational Inference Framework

Now that we understand why approximation is necessary and why optimization-based approaches are attractive, let's preview the complete variational inference framework you'll learn in this module.

The Big Picture

Variational inference proceeds in four steps:

Step 1: Choose a Variational Family $\mathcal{Q}$

Select a family of distributions parameterized by $\phi$: ${ q_\phi(\mathbf{z}) : \phi \in \Phi }$

This family should be:

Tractable: We can sample from $q_\phi$ and evaluate $\log q_\phi(\mathbf{z})$
Expressive: Capable of approximating the true posterior well
Optimizable: Gradients with respect to $\phi$ are computable

Step 2: Define the Objective (ELBO)

We maximize the Evidence Lower Bound: $$\mathcal{L}(\phi) = \mathbb{E}{q\phi}[\log p(\mathbf{x}, \mathbf{z})] - \mathbb{E}{q\phi}[\log q_\phi(\mathbf{z})]$$

The ELBO lower-bounds $\log p(\mathbf{x})$ and equals it when $q_\phi = p(\mathbf{z} | \mathbf{x})$.

Step 3: Optimize

Use gradient ascent on $\mathcal{L}(\phi)$: $$\phi \leftarrow \phi + \eta \nabla_\phi \mathcal{L}(\phi)$$

Step 4: Use the Approximate Posterior

Once optimized, $q_{\phi^*}(\mathbf{z})$ serves as a stand-in for the true posterior in downstream tasks.

What's Coming in This Module

•Variational Family — Formal definition, examples (mean-field, full covariance, normalizing flows), and expressiveness vs. tractability trade-offs
•KL Divergence Objective — Why KL divergence? Forward vs. reverse KL. Mode-seeking vs. mass-covering behavior
•ELBO Derivation — Complete mathematical derivation from first principles, multiple perspectives (Jensen's inequality, importance weighted, rate-distortion)
•Optimization Perspective — ELBO as loss function, gradient estimation, reparameterization trick, variance reduction

Page Complete: Approximation Motivation

1 / 5