Loading content...
Variational inference frames posterior approximation as an optimization problem: find the distribution $q$ in our variational family that is "closest" to the true posterior $p(\mathbf{z} | \mathbf{x})$. But what does "closest" mean for probability distributions?
We need a notion of distance or divergence between distributions. The choice of this measure is not arbitrary—it fundamentally shapes the behavior of our approximation. In this page, we develop the Kullback-Leibler (KL) divergence as the standard objective for variational inference and understand why this choice, despite its asymmetry and peculiarities, is both theoretically grounded and practically effective.
The KL divergence is not just a technical choice—understanding its properties is essential for interpreting VI results, diagnosing approximation failures, and knowing when alternative divergences might be more appropriate.
By completing this page, you will: (1) Understand the mathematical definition and intuition behind KL divergence, (2) Master the critical distinction between forward KL(p||q) and reverse KL(q||p), (3) Explain mode-seeking vs. mass-covering behavior and their implications, (4) Understand why variational inference uses reverse KL, and (5) Know when alternative divergences might be preferable.
The Kullback-Leibler divergence (also called relative entropy) from distribution $p$ to distribution $q$ is defined as:
$$D_{\text{KL}}(p | q) = \int p(\mathbf{z}) \log \frac{p(\mathbf{z})}{q(\mathbf{z})} , d\mathbf{z} = \mathbb{E}_{p}\left[ \log \frac{p(\mathbf{z})}{q(\mathbf{z})} \right]$$
Equivalently:
$$D_{\text{KL}}(p | q) = \mathbb{E}{p}[\log p(\mathbf{z})] - \mathbb{E}{p}[\log q(\mathbf{z})]$$
The first term is the negative entropy of $p$; the second is the cross-entropy from $p$ to $q$.
Optimal coding interpretation: KL divergence measures the expected number of extra bits needed to encode samples from $p$ using a code optimized for $q$, compared to a code optimized for $p$.
Despite being called a "divergence," KL is not a true distance metric. It is:
Asymmetric: $D_{\text{KL}}(p | q) \neq D_{\text{KL}}(q | p)$ in general
Non-satisfying triangle inequality: $D_{\text{KL}}(p | r) \not\leq D_{\text{KL}}(p | q) + D_{\text{KL}}(q | r)$
Unbounded: Can be infinite if supports don't match
This asymmetry is crucial: the direction matters enormously for variational inference.
Property 1: Non-negativity $$D_{\text{KL}}(p | q) \geq 0$$ with equality if and only if $p = q$ almost everywhere. This follows from Jensen's inequality applied to the concave logarithm.
Property 2: Not symmetric $$D_{\text{KL}}(p | q) \neq D_{\text{KL}}(q | p)$$ The asymmetry reflects different "penalty structures" depending on which distribution the expectation is taken over.
Property 3: Support requirement If there exists $\mathbf{z}$ such that $p(\mathbf{z}) > 0$ but $q(\mathbf{z}) = 0$, then: $$D_{\text{KL}}(p | q) = +\infty$$ The approximation $q$ must cover the support of $p$.
Property 4: Decomposition for exponential families For distributions in the same exponential family, KL divergence has special closed forms, enabling tractable optimization.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
import numpy as npfrom scipy.special import kl_divfrom scipy.stats import norm, entropy def kl_divergence_gaussians(mu_p, sigma_p, mu_q, sigma_q): """ Closed-form KL divergence between univariate Gaussians. KL(N(μ_p, σ_p²) || N(μ_q, σ_q²)) = log(σ_q/σ_p) + (σ_p² + (μ_p - μ_q)²) / (2σ_q²) - 1/2 """ return ( np.log(sigma_q / sigma_p) + (sigma_p**2 + (mu_p - mu_q)**2) / (2 * sigma_q**2) - 0.5 ) def kl_divergence_monte_carlo(log_p, log_q, samples_from_p, n_samples=10000): """ Monte Carlo estimate of KL(p || q). KL(p || q) ≈ (1/N) Σ [log p(z_i) - log q(z_i)] where z_i ~ p """ return np.mean(log_p(samples_from_p) - log_q(samples_from_p)) # Example: Two Gaussiansp_mu, p_sigma = 0, 1q_mu, q_sigma = 1, 2 # Analytical KLkl_analytic = kl_divergence_gaussians(p_mu, p_sigma, q_mu, q_sigma)print(f"KL(p || q) analytical: {kl_analytic:.4f}") # Monte Carlo estimatesamples = np.random.normal(p_mu, p_sigma, 100000)log_p = lambda z: norm.logpdf(z, p_mu, p_sigma)log_q = lambda z: norm.logpdf(z, q_mu, q_sigma)kl_mc = kl_divergence_monte_carlo(log_p, log_q, samples)print(f"KL(p || q) Monte Carlo: {kl_mc:.4f}") # Now compute the REVERSE KLkl_reverse_analytic = kl_divergence_gaussians(q_mu, q_sigma, p_mu, p_sigma)print(f"\nKL(q || p) analytical: {kl_reverse_analytic:.4f}") # Note: KL(p || q) ≠ KL(q || p) ! # The difference reflects the asymmetric penalty structure.print(f"\nDifference: {abs(kl_analytic - kl_reverse_analytic):.4f}") # Special case: When supports don't match# If p has support where q = 0, KL(p || q) = ∞# This is why support coverage matters!The asymmetry of KL divergence gives rise to two fundamentally different optimization objectives:
$$D_{\text{KL}}(p | q) = \mathbb{E}_{p}\left[ \log \frac{p(\mathbf{z})}{q(\mathbf{z})} \right]$$
Expectation is over the true distribution $p$.
Minimizing forward KL means:
Behavior: Mass-covering / Moment-matching / Inclusive
$$D_{\text{KL}}(q | p) = \mathbb{E}_{q}\left[ \log \frac{q(\mathbf{z})}{p(\mathbf{z})} \right]$$
Expectation is over the approximate distribution $q$.
Minimizing reverse KL means:
Behavior: Mode-seeking / Zero-forcing / Exclusive
Consider fitting a unimodal Gaussian $q$ to a bimodal target $p$:
Forward KL (moment-matching): The optimal $q$ will center between the two modes, trying to overlap with both. It will have high variance to "cover" both modes. But it places mass in the valley between modes where $p$ is low.
Reverse KL (mode-seeking): The optimal $q$ will lock onto one mode and match it tightly. It "ignores" the other mode entirely. The variance will be small, matching the single mode it captures.
Neither is "wrong"—they answer different questions:
For inference, we typically want reverse KL: we'd rather have a precise estimate of one mode than a diffuse estimate that doesn't match any mode well.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768
import numpy as npimport matplotlib.pyplot as pltfrom scipy.optimize import minimizefrom scipy.stats import norm # True distribution: mixture of two Gaussians (bimodal)def log_p(z): """Log of bimodal mixture: 0.5 * N(-2, 0.5) + 0.5 * N(2, 0.5)""" p1 = 0.5 * norm.pdf(z, -2, 0.5) p2 = 0.5 * norm.pdf(z, 2, 0.5) return np.log(p1 + p2 + 1e-10) def p(z): return np.exp(log_p(z)) # Variational family: single Gaussiandef log_q(z, mu, sigma): return norm.logpdf(z, mu, sigma) def q(z, mu, sigma): return norm.pdf(z, mu, sigma) # FORWARD KL: D(p || q) - requires sampling from pdef forward_kl(params, n_samples=10000): mu, log_sigma = params sigma = np.exp(log_sigma) # Sample from the mixture component = np.random.binomial(1, 0.5, n_samples) samples = np.where( component, np.random.normal(-2, 0.5, n_samples), np.random.normal(2, 0.5, n_samples) ) # KL(p || q) = E_p[log p - log q] kl = np.mean(log_p(samples) - log_q(samples, mu, sigma)) return kl # REVERSE KL: D(q || p) - requires sampling from qdef reverse_kl(params, n_samples=10000): mu, log_sigma = params sigma = np.exp(log_sigma) # Sample from q samples = np.random.normal(mu, sigma, n_samples) # KL(q || p) = E_q[log q - log p] kl = np.mean(log_q(samples, mu, sigma) - log_p(samples)) return kl # Optimize forward KLresult_forward = minimize(forward_kl, x0=[0.0, 0.0], method='L-BFGS-B')mu_forward, log_sigma_forward = result_forward.xsigma_forward = np.exp(log_sigma_forward)print(f"Forward KL optimal: μ={mu_forward:.2f}, σ={sigma_forward:.2f}") # Optimize reverse KL (may converge to either mode)result_reverse = minimize(reverse_kl, x0=[1.0, 0.0], method='L-BFGS-B')mu_reverse, log_sigma_reverse = result_reverse.xsigma_reverse = np.exp(log_sigma_reverse)print(f"Reverse KL optimal: μ={mu_reverse:.2f}, σ={sigma_reverse:.2f}") # Key observations:# - Forward KL: μ ≈ 0, σ ≈ 2.5 (covers both modes, high variance)# - Reverse KL: μ ≈ ±2, σ ≈ 0.5 (locks onto one mode, matches its variance) # This fundamental difference drives VI behavior!"Q for Questions": In D(q || p), the expectation is over q. Think of it as asking: "If I sample from my approximation q, will I be surprised by looking up the true probability p?" Reverse KL minimizes this surprise by avoiding regions where p is small.
The mode-seeking behavior of reverse KL has profound implications for variational inference. Understanding these consequences is essential for interpreting VI results and knowing when the approximation might fail.
Consider the penalty structure of $D_{\text{KL}}(q | p)$:
$$D_{\text{KL}}(q | p) = \mathbb{E}_q\left[ \log \frac{q(\mathbf{z})}{p(\mathbf{z})} \right] = \mathbb{E}_q[\log q(\mathbf{z})] - \mathbb{E}_q[\log p(\mathbf{z})]$$
The critical term is $\mathbb{E}_q[\log p(\mathbf{z})]$.
This asymmetry means $q$ is punished for covering low-probability regions but not punished for missing modes. The rational strategy: concentrate on the highest mode and match it precisely.
| Scenario | Forward KL (Mass-Covering) | Reverse KL (Mode-Seeking) |
|---|---|---|
| Bimodal symmetric p | q centers between modes | q locks onto one mode |
| Heavy-tailed p | q matches tail behavior | q concentrates on mode, ignores tails |
| Skewed p | q covers entire support | q matches the mode precisely |
| Uniform p | q is broad/uniform | q can be arbitrarily concentrated |
| Uncertainty quantification | Overestimates uncertainty | Underestimates uncertainty |
1. Underestimated Uncertainty
Because reverse KL concentrates on modes, the resulting $q$ typically has smaller variance than the true posterior. This leads to:
2. Mode Collapse in Multi-Modal Posteriors
When the true posterior is multi-modal (common in mixture models, neural networks), VI will:
3. Sensitivity to Initialization
With multiple modes, which mode $q$ converges to depends on initialization. Different random seeds can give qualitatively different results.
4. Local Optima
The ELBO landscape inherits the multi-modality of the posterior. Gradient descent may get stuck in local optima corresponding to less probable modes.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970
import numpy as npimport torchimport torch.nn as nnimport torch.optim as optim def demonstrate_mode_collapse(): """ Show how VI collapses to a single mode for a bimodal posterior. Setup: Bayesian inference on mean μ, with bimodal likelihood (e.g., from a mixture model or symmetric objective). """ # True posterior is bimodal at μ = -2 and μ = +2 def log_posterior(mu): """Log of bimodal posterior: mix of N(-2, 0.5²) and N(2, 0.5²)""" log_p1 = -0.5 * ((mu + 2) / 0.5)**2 log_p2 = -0.5 * ((mu - 2) / 0.5)**2 # Log-sum-exp for numerical stability log_mix = torch.logsumexp(torch.stack([log_p1, log_p2]), dim=0) - np.log(2) return log_mix class GaussianVI(nn.Module): """Simple Gaussian variational approximation.""" def __init__(self, init_mu=0.0): super().__init__() self.mu = nn.Parameter(torch.tensor(init_mu)) self.log_sigma = nn.Parameter(torch.tensor(0.0)) def sample(self, n=100): sigma = torch.exp(self.log_sigma) eps = torch.randn(n) return self.mu + sigma * eps def log_q(self, z): sigma = torch.exp(self.log_sigma) return -0.5 * ((z - self.mu) / sigma)**2 - self.log_sigma - 0.5 * np.log(2 * np.pi) def elbo(self, n_samples=100): z = self.sample(n_samples) return (log_posterior(z) - self.log_q(z)).mean() # Run VI from different initializations results = [] for init_mu in [-3.0, -1.0, 0.0, 1.0, 3.0]: model = GaussianVI(init_mu) optimizer = optim.Adam(model.parameters(), lr=0.1) for _ in range(500): optimizer.zero_grad() loss = -model.elbo() loss.backward() optimizer.step() final_mu = model.mu.item() final_sigma = torch.exp(model.log_sigma).item() results.append((init_mu, final_mu, final_sigma)) print(f"Init μ={init_mu:+.1f} → Final μ={final_mu:+.2f}, σ={final_sigma:.3f}") # Observations: # - Init near -2: converges to μ ≈ -2 # - Init near +2: converges to μ ≈ +2 # - Init at 0: could go either way (depends on noise) # - VI NEVER finds both modes with a unimodal q! print("\n⚠️ Mode collapse: Each run finds only ONE mode.") print("True posterior has BOTH modes, but VI cannot represent this.") print("This is the fundamental limitation of reverse KL with simple families.") demonstrate_mode_collapse()Mode-seeking behavior is particularly problematic for:
• Safety-critical systems where rare events matter • Bayesian model averaging where all modes contribute to predictions • Exploration in reinforcement learning where underestimating uncertainty leads to poor policies • Scientific inference where characterizing full posterior uncertainty is the goal
In these cases, consider MCMC, mixture variational families, or alternative divergences.
Given the mode-seeking limitations, why does variational inference use reverse KL? The answer lies in tractability—forward KL is essentially impossible to optimize directly.
Forward KL: $D_{\text{KL}}(p | q)$
$$D_{\text{KL}}(p | q) = \mathbb{E}_{p(\mathbf{z}|\mathbf{x})}\left[ \log \frac{p(\mathbf{z}|\mathbf{x})}{q(\mathbf{z})} \right]$$
To optimize this, we need to:
We cannot optimize forward KL without already having solved the inference problem.
Reverse KL: $D_{\text{KL}}(q | p)$
$$D_{\text{KL}}(q | p) = \mathbb{E}_{q(\mathbf{z})}\left[ \log \frac{q(\mathbf{z})}{p(\mathbf{z}|\mathbf{x})} \right]$$
To optimize this, we need to:
But here's the key insight: We can rewrite reverse KL to avoid the normalization constant!
Expanding the reverse KL:
$$\begin{aligned} D_{\text{KL}}(q | p(\cdot|\mathbf{x})) &= \mathbb{E}_q\left[ \log \frac{q(\mathbf{z})}{p(\mathbf{z}|\mathbf{x})} \right] \ &= \mathbb{E}_q\left[ \log q(\mathbf{z}) - \log p(\mathbf{z}|\mathbf{x}) \right] \ &= \mathbb{E}_q\left[ \log q(\mathbf{z}) - \log \frac{p(\mathbf{x}, \mathbf{z})}{p(\mathbf{x})} \right] \ &= \mathbb{E}_q\left[ \log q(\mathbf{z}) - \log p(\mathbf{x}, \mathbf{z}) \right] + \log p(\mathbf{x}) \end{aligned}$$
Rearranging:
$$\log p(\mathbf{x}) = D_{\text{KL}}(q | p) + \underbrace{\mathbb{E}q\left[ \log p(\mathbf{x}, \mathbf{z}) - \log q(\mathbf{z}) \right]}{\text{ELBO}(q)}$$
The Evidence Lower Bound (ELBO) is:
$$\mathcal{L}(q) = \mathbb{E}_q[\log p(\mathbf{x}, \mathbf{z})] - \mathbb{E}_q[\log q(\mathbf{z})]$$
Critical observation: The ELBO depends only on:
No normalization constant required!
$$\log p(\mathbf{x}) = \text{ELBO}(q) + D_{\text{KL}}(q | p(\cdot|\mathbf{x}))$$
Since $D_{\text{KL}} \geq 0$, we have $\text{ELBO}(q) \leq \log p(\mathbf{x})$ — the ELBO is a lower bound on the log-evidence.
Maximizing the ELBO is equivalent to minimizing the reverse KL divergence!
This is why it's called the Evidence Lower BOund.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071
import numpy as npimport torchimport torch.nn as nn def illustrate_elbo_kl_relationship(): """ Demonstrate the fundamental identity: log p(x) = ELBO(q) + KL(q || p(z|x)) When KL → 0, ELBO → log p(x) """ # Simple example: Gaussian prior and likelihood # p(z) = N(0, 1) (prior) # p(x|z) = N(z, 0.5²) (likelihood) # Observed: x = 1.0 x = 1.0 prior_mu, prior_sigma = 0.0, 1.0 likelihood_sigma = 0.5 # True posterior (conjugate, so we know it) # p(z|x) = N(posterior_mu, posterior_sigma²) posterior_precision = 1/prior_sigma**2 + 1/likelihood_sigma**2 posterior_sigma = 1/np.sqrt(posterior_precision) posterior_mu = posterior_sigma**2 * (prior_mu/prior_sigma**2 + x/likelihood_sigma**2) print(f"True posterior: N({posterior_mu:.3f}, {posterior_sigma:.3f}²)") # True log evidence (can compute for this simple case) # p(x) = ∫ p(x|z)p(z) dz = N(x | prior_mu, sqrt(prior_sigma² + likelihood_sigma²)) evidence_sigma = np.sqrt(prior_sigma**2 + likelihood_sigma**2) log_evidence = -0.5 * np.log(2*np.pi) - np.log(evidence_sigma) - 0.5 * ((x - prior_mu)/evidence_sigma)**2 print(f"True log evidence: {log_evidence:.4f}") # Now compute ELBO for various q def compute_elbo_and_kl(q_mu, q_sigma, n_samples=10000): """Compute ELBO and KL for Gaussian q(z) = N(q_mu, q_sigma²)""" # Sample from q eps = np.random.randn(n_samples) z = q_mu + q_sigma * eps # log p(x, z) = log p(x|z) + log p(z) log_likelihood = -0.5 * np.log(2*np.pi) - np.log(likelihood_sigma) - 0.5 * ((x - z)/likelihood_sigma)**2 log_prior = -0.5 * np.log(2*np.pi) - np.log(prior_sigma) - 0.5 * (z/prior_sigma)**2 log_joint = log_likelihood + log_prior # log q(z) log_q = -0.5 * np.log(2*np.pi) - np.log(q_sigma) - 0.5 * ((z - q_mu)/q_sigma)**2 # ELBO = E_q[log p(x,z) - log q(z)] elbo = np.mean(log_joint - log_q) # KL to true posterior (analytical for Gaussians) kl = np.log(posterior_sigma/q_sigma) + (q_sigma**2 + (q_mu - posterior_mu)**2)/(2*posterior_sigma**2) - 0.5 return elbo, kl print("\nVarying q and checking ELBO + KL = log p(x):") print("-" * 60) for q_mu in [0.0, 0.5, posterior_mu]: # Different means for q_sigma in [0.3, posterior_sigma, 1.0]: # Different variances elbo, kl = compute_elbo_and_kl(q_mu, q_sigma) total = elbo + kl print(f"q=N({q_mu:.2f}, {q_sigma:.2f}²): ELBO={elbo:.4f}, KL={kl:.4f}, Sum={total:.4f}") print(f"\nTrue log p(x) = {log_evidence:.4f}") print("Note: ELBO + KL always equals log p(x) (up to MC noise)") illustrate_elbo_kl_relationship()While KL divergence dominates variational inference, alternative divergences address specific limitations. Understanding these alternatives helps you choose the right tool for your problem.
The α-divergence family generalizes KL:
$$D_\alpha(p | q) = \frac{1}{\alpha(\alpha - 1)} \left( \int p(\mathbf{z})^\alpha q(\mathbf{z})^{1-\alpha} d\mathbf{z} - 1 \right)$$
Special cases:
Variational Rényi bounds allow trading off between mass-covering ($\alpha > 0$) and mode-seeking ($\alpha < 0$) behavior.
| Divergence | Behavior | Tractability | Use Case |
|---|---|---|---|
| Reverse KL | Mode-seeking | Tractable (ELBO) | Standard VI, VAEs |
| Forward KL | Mass-covering | Requires posterior samples | Expectation Propagation |
| α-divergence (α=0.5) | Balanced | Moderate complexity | When uncertainty matters |
| f-divergences | Flexible | Varies | Research, specialized cases |
| Wasserstein | Geometry-aware | Computationally expensive | When support mismatch is an issue |
| Stein discrepancy | Score-based | Kernel choice required | Implicit variational inference |
The Importance-Weighted ELBO (IWAE) tightens the bound using multiple samples:
$$\mathcal{L}K = \mathbb{E}{\mathbf{z}_1, \ldots, \mathbf{z}K \sim q}\left[ \log \frac{1}{K} \sum{k=1}^{K} \frac{p(\mathbf{x}, \mathbf{z}_k)}{q(\mathbf{z}_k)} \right]$$
As $K \to \infty$, $\mathcal{L}_K \to \log p(\mathbf{x})$. This provides a spectrum between variational (K=1) and Monte Carlo (K→∞) methods.
When to use standard reverse KL (ELBO):
When to consider alternatives:
In practice, the choice of divergence matters less than the choice of variational family. A rich variational family (e.g., flows) with reverse KL often outperforms a simple family with a "better" divergence. Start with standard VI; switch divergences only if diagnostics indicate the approximation is failing for divergence-related reasons.
You now understand why KL divergence is used in variational inference and the profound implications of its asymmetric, mode-seeking nature. The next page derives the ELBO from first principles, showing exactly how minimizing reverse KL leads to a tractable optimization objective.