Variational Inference Framework - Learning Module

Loading content...

0/245

KL Divergence Objective

Measuring Distance Between Distributions

Variational inference frames posterior approximation as an optimization problem: find the distribution $q$ in our variational family that is "closest" to the true posterior $p(\mathbf{z} | \mathbf{x})$. But what does "closest" mean for probability distributions?

We need a notion of distance or divergence between distributions. The choice of this measure is not arbitrary—it fundamentally shapes the behavior of our approximation. In this page, we develop the Kullback-Leibler (KL) divergence as the standard objective for variational inference and understand why this choice, despite its asymmetry and peculiarities, is both theoretically grounded and practically effective.

The KL divergence is not just a technical choice—understanding its properties is essential for interpreting VI results, diagnosing approximation failures, and knowing when alternative divergences might be more appropriate.

Learning Objectives

By completing this page, you will: (1) Understand the mathematical definition and intuition behind KL divergence, (2) Master the critical distinction between forward KL(p||q) and reverse KL(q||p), (3) Explain mode-seeking vs. mass-covering behavior and their implications, (4) Understand why variational inference uses reverse KL, and (5) Know when alternative divergences might be preferable.

The Kullback-Leibler Divergence

Mathematical Definition

The Kullback-Leibler divergence (also called relative entropy) from distribution $p$ to distribution $q$ is defined as:

$$D_{\text{KL}}(p | q) = \int p(\mathbf{z}) \log \frac{p(\mathbf{z})}{q(\mathbf{z})} , d\mathbf{z} = \mathbb{E}_{p}\left[ \log \frac{p(\mathbf{z})}{q(\mathbf{z})} \right]$$

Equivalently:

$$D_{\text{KL}}(p | q) = \mathbb{E}{p}[\log p(\mathbf{z})] - \mathbb{E}{p}[\log q(\mathbf{z})]$$

The first term is the negative entropy of $p$; the second is the cross-entropy from $p$ to $q$.

Information-Theoretic Interpretation

Optimal coding interpretation: KL divergence measures the expected number of extra bits needed to encode samples from $p$ using a code optimized for $q$, compared to a code optimized for $p$.

If $p = q$: No extra bits needed, $D_{\text{KL}}(p | q) = 0$
If $q$ assigns low probability where $p$ is high: Many extra bits needed, $D_{\text{KL}}(p | q)$ is large
If $q$ assigns high probability where $p$ is low: Wasted coding capacity, but limited penalty

KL Divergence is NOT a Distance

Despite being called a "divergence," KL is not a true distance metric. It is:

Asymmetric: $D_{\text{KL}}(p | q) \neq D_{\text{KL}}(q | p)$ in general

Non-satisfying triangle inequality: $D_{\text{KL}}(p | r) \not\leq D_{\text{KL}}(p | q) + D_{\text{KL}}(q | r)$

Unbounded: Can be infinite if supports don't match

This asymmetry is crucial: the direction matters enormously for variational inference.

Key Properties

Property 1: Non-negativity $$D_{\text{KL}}(p | q) \geq 0$$ with equality if and only if $p = q$ almost everywhere. This follows from Jensen's inequality applied to the concave logarithm.

Property 2: Not symmetric $$D_{\text{KL}}(p | q) \neq D_{\text{KL}}(q | p)$$ The asymmetry reflects different "penalty structures" depending on which distribution the expectation is taken over.

Property 3: Support requirement If there exists $\mathbf{z}$ such that $p(\mathbf{z}) > 0$ but $q(\mathbf{z}) = 0$, then: $$D_{\text{KL}}(p | q) = +\infty$$ The approximation $q$ must cover the support of $p$.

Property 4: Decomposition for exponential families For distributions in the same exponential family, KL divergence has special closed forms, enabling tractable optimization.

kl_divergence.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import numpy as np
from scipy.special import kl_div
from scipy.stats import norm, entropy
 
def kl_divergence_gaussians(mu_p, sigma_p, mu_q, sigma_q):
    """
    Closed-form KL divergence between univariate Gaussians.
    
    KL(N(μ_p, σ_p²) || N(μ_q, σ_q²)) = 
        log(σ_q/σ_p) + (σ_p² + (μ_p - μ_q)²) / (2σ_q²) - 1/2
    """
    return (
        np.log(sigma_q / sigma_p) + 
        (sigma_p**2 + (mu_p - mu_q)**2) / (2 * sigma_q**2) - 
        0.5
    )
 
def kl_divergence_monte_carlo(log_p, log_q, samples_from_p, n_samples=10000):
    """
    Monte Carlo estimate of KL(p || q).
    
    KL(p || q) ≈ (1/N) Σ [log p(z_i) - log q(z_i)]
    where z_i ~ p
    """
    return np.mean(log_p(samples_from_p) - log_q(samples_from_p))
 
# Example: Two Gaussians
p_mu, p_sigma = 0, 1
q_mu, q_sigma = 1, 2
 
# Analytical KL
kl_analytic = kl_divergence_gaussians(p_mu, p_sigma, q_mu, q_sigma)
print(f"KL(p || q) analytical: {kl_analytic:.4f}")
 
# Monte Carlo estimate
samples = np.random.normal(p_mu, p_sigma, 100000)
log_p = lambda z: norm.logpdf(z, p_mu, p_sigma)
log_q = lambda z: norm.logpdf(z, q_mu, q_sigma)
kl_mc = kl_divergence_monte_carlo(log_p, log_q, samples)
print(f"KL(p || q) Monte Carlo: {kl_mc:.4f}")
 
# Now compute the REVERSE KL
kl_reverse_analytic = kl_divergence_gaussians(q_mu, q_sigma, p_mu, p_sigma)
print(f"\nKL(q || p) analytical: {kl_reverse_analytic:.4f}")
 
# Note: KL(p || q) ≠ KL(q || p) ! 
# The difference reflects the asymmetric penalty structure.
print(f"\nDifference: {abs(kl_analytic - kl_reverse_analytic):.4f}")
 
# Special case: When supports don't match
# If p has support where q = 0, KL(p || q) = ∞
# This is why support coverage matters!

Forward KL vs. Reverse KL: A Critical Distinction

The asymmetry of KL divergence gives rise to two fundamentally different optimization objectives:

Forward KL: $D_{\text{KL}}(p | q)$

$$D_{\text{KL}}(p | q) = \mathbb{E}_{p}\left[ \log \frac{p(\mathbf{z})}{q(\mathbf{z})} \right]$$

Expectation is over the true distribution $p$.

Minimizing forward KL means:

We penalize $q$ for assigning low probability where $p$ has high probability
We must cover all regions where $p$ concentrates
If $p$ is multi-modal, $q$ tries to cover all modes

Behavior: Mass-covering / Moment-matching / Inclusive

Reverse KL: $D_{\text{KL}}(q | p)$

$$D_{\text{KL}}(q | p) = \mathbb{E}_{q}\left[ \log \frac{q(\mathbf{z})}{p(\mathbf{z})} \right]$$

Expectation is over the approximate distribution $q$.

Minimizing reverse KL means:

We penalize $q$ for assigning high probability where $p$ has low probability
$q$ can ignore modes of $p$ without penalty (if $q$ doesn't put mass there)
$q$ tends to concentrate on the highest-probability region of $p$

Behavior: Mode-seeking / Zero-forcing / Exclusive

Forward KL: D(p || q)

•Expectation over TRUE p
•Mass-covering: q spreads to cover all of p
•Multi-modal p → q tries to cover all modes
•May place mass where p is low
•Typically used in: Expectation Propagation
•Overestimates uncertainty

Reverse KL: D(q || p)

•Expectation over APPROXIMATE q
•Mode-seeking: q concentrates on one mode
•Multi-modal p → q picks dominant mode
•Never places mass where p is low
•Used in: Standard VI, VAEs
•Underestimates uncertainty

Visual Intuition

Consider fitting a unimodal Gaussian $q$ to a bimodal target $p$:

Forward KL (moment-matching): The optimal $q$ will center between the two modes, trying to overlap with both. It will have high variance to "cover" both modes. But it places mass in the valley between modes where $p$ is low.

Reverse KL (mode-seeking): The optimal $q$ will lock onto one mode and match it tightly. It "ignores" the other mode entirely. The variance will be small, matching the single mode it captures.

Neither is "wrong"—they answer different questions:

Forward KL: "Where should I put mass to not miss anything important?"
Reverse KL: "Where should I put mass to not be surprised by samples?"

For inference, we typically want reverse KL: we'd rather have a precise estimate of one mode than a diffuse estimate that doesn't match any mode well.

forward_vs_reverse_kl.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import minimize
from scipy.stats import norm
 
# True distribution: mixture of two Gaussians (bimodal)
def log_p(z):
    """Log of bimodal mixture: 0.5 * N(-2, 0.5) + 0.5 * N(2, 0.5)"""
    p1 = 0.5 * norm.pdf(z, -2, 0.5)
    p2 = 0.5 * norm.pdf(z, 2, 0.5)
    return np.log(p1 + p2 + 1e-10)
 
def p(z):
    return np.exp(log_p(z))
 
# Variational family: single Gaussian
def log_q(z, mu, sigma):
    return norm.logpdf(z, mu, sigma)
 
def q(z, mu, sigma):
    return norm.pdf(z, mu, sigma)
 
# FORWARD KL: D(p || q) - requires sampling from p
def forward_kl(params, n_samples=10000):
    mu, log_sigma = params
    sigma = np.exp(log_sigma)
    
    # Sample from the mixture
    component = np.random.binomial(1, 0.5, n_samples)
    samples = np.where(
        component, 
        np.random.normal(-2, 0.5, n_samples),
        np.random.normal(2, 0.5, n_samples)
    )
    
    # KL(p || q) = E_p[log p - log q]
    kl = np.mean(log_p(samples) - log_q(samples, mu, sigma))
    return kl
 
# REVERSE KL: D(q || p) - requires sampling from q
def reverse_kl(params, n_samples=10000):
    mu, log_sigma = params
    sigma = np.exp(log_sigma)
    
    # Sample from q
    samples = np.random.normal(mu, sigma, n_samples)
    
    # KL(q || p) = E_q[log q - log p]
    kl = np.mean(log_q(samples, mu, sigma) - log_p(samples))
    return kl
 
# Optimize forward KL
result_forward = minimize(forward_kl, x0=[0.0, 0.0], method='L-BFGS-B')
mu_forward, log_sigma_forward = result_forward.x
sigma_forward = np.exp(log_sigma_forward)
print(f"Forward KL optimal: μ={mu_forward:.2f}, σ={sigma_forward:.2f}")
 
# Optimize reverse KL (may converge to either mode)
result_reverse = minimize(reverse_kl, x0=[1.0, 0.0], method='L-BFGS-B')
mu_reverse, log_sigma_reverse = result_reverse.x
sigma_reverse = np.exp(log_sigma_reverse)
print(f"Reverse KL optimal: μ={mu_reverse:.2f}, σ={sigma_reverse:.2f}")
 
# Key observations:
# - Forward KL: μ ≈ 0, σ ≈ 2.5 (covers both modes, high variance)
# - Reverse KL: μ ≈ ±2, σ ≈ 0.5 (locks onto one mode, matches its variance)
 
# This fundamental difference drives VI behavior!

Mnemonic for KL Direction

"Q for Questions": In D(q || p), the expectation is over q. Think of it as asking: "If I sample from my approximation q, will I be surprised by looking up the true probability p?" Reverse KL minimizes this surprise by avoiding regions where p is small.

Mode-Seeking Behavior: Implications and Consequences

The mode-seeking behavior of reverse KL has profound implications for variational inference. Understanding these consequences is essential for interpreting VI results and knowing when the approximation might fail.

Why Reverse KL is Mode-Seeking

Consider the penalty structure of $D_{\text{KL}}(q | p)$:

$$D_{\text{KL}}(q | p) = \mathbb{E}_q\left[ \log \frac{q(\mathbf{z})}{p(\mathbf{z})} \right] = \mathbb{E}_q[\log q(\mathbf{z})] - \mathbb{E}_q[\log p(\mathbf{z})]$$

The critical term is $\mathbb{E}_q[\log p(\mathbf{z})]$.

If $q$ places mass in a region where $p(\mathbf{z})$ is small → $\log p(\mathbf{z})$ is very negative → KL increases dramatically
If $q$ ignores a region where $p$ is large → No contribution to the expectation (since $q = 0$ there) → No penalty!

This asymmetry means $q$ is punished for covering low-probability regions but not punished for missing modes. The rational strategy: concentrate on the highest mode and match it precisely.

Mode-Seeking vs. Mass-Covering Behaviors
Scenario	Forward KL (Mass-Covering)	Reverse KL (Mode-Seeking)
Bimodal symmetric p	q centers between modes	q locks onto one mode
Heavy-tailed p	q matches tail behavior	q concentrates on mode, ignores tails
Skewed p	q covers entire support	q matches the mode precisely
Uniform p	q is broad/uniform	q can be arbitrarily concentrated
Uncertainty quantification	Overestimates uncertainty	Underestimates uncertainty

Consequences for Variational Inference

1. Underestimated Uncertainty

Because reverse KL concentrates on modes, the resulting $q$ typically has smaller variance than the true posterior. This leads to:

Overconfident predictions
Misleading credible intervals
Underestimation of rare event probabilities

2. Mode Collapse in Multi-Modal Posteriors

When the true posterior is multi-modal (common in mixture models, neural networks), VI will:

Capture only the dominant mode
Completely ignore other modes
Give no indication that other modes exist

3. Sensitivity to Initialization

With multiple modes, which mode $q$ converges to depends on initialization. Different random seeds can give qualitatively different results.

4. Local Optima

The ELBO landscape inherits the multi-modality of the posterior. Gradient descent may get stuck in local optima corresponding to less probable modes.

mode_seeking_consequences.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
 
def demonstrate_mode_collapse():
    """
    Show how VI collapses to a single mode for a bimodal posterior.
    
    Setup: Bayesian inference on mean μ, with bimodal likelihood
    (e.g., from a mixture model or symmetric objective).
    """
    
    # True posterior is bimodal at μ = -2 and μ = +2
    def log_posterior(mu):
        """Log of bimodal posterior: mix of N(-2, 0.5²) and N(2, 0.5²)"""
        log_p1 = -0.5 * ((mu + 2) / 0.5)**2
        log_p2 = -0.5 * ((mu - 2) / 0.5)**2
        # Log-sum-exp for numerical stability
        log_mix = torch.logsumexp(torch.stack([log_p1, log_p2]), dim=0) - np.log(2)
        return log_mix
    
    class GaussianVI(nn.Module):
        """Simple Gaussian variational approximation."""
        def __init__(self, init_mu=0.0):
            super().__init__()
            self.mu = nn.Parameter(torch.tensor(init_mu))
            self.log_sigma = nn.Parameter(torch.tensor(0.0))
            
        def sample(self, n=100):
            sigma = torch.exp(self.log_sigma)
            eps = torch.randn(n)
            return self.mu + sigma * eps
        
        def log_q(self, z):
            sigma = torch.exp(self.log_sigma)
            return -0.5 * ((z - self.mu) / sigma)**2 - self.log_sigma - 0.5 * np.log(2 * np.pi)
        
        def elbo(self, n_samples=100):
            z = self.sample(n_samples)
            return (log_posterior(z) - self.log_q(z)).mean()
    
    # Run VI from different initializations
    results = []
    for init_mu in [-3.0, -1.0, 0.0, 1.0, 3.0]:
        model = GaussianVI(init_mu)
        optimizer = optim.Adam(model.parameters(), lr=0.1)
        
        for _ in range(500):
            optimizer.zero_grad()
            loss = -model.elbo()
            loss.backward()
            optimizer.step()
        
        final_mu = model.mu.item()
        final_sigma = torch.exp(model.log_sigma).item()
        results.append((init_mu, final_mu, final_sigma))
        print(f"Init μ={init_mu:+.1f} → Final μ={final_mu:+.2f}, σ={final_sigma:.3f}")
    
    # Observations:
    # - Init near -2: converges to μ ≈ -2
    # - Init near +2: converges to μ ≈ +2
    # - Init at 0: could go either way (depends on noise)
    # - VI NEVER finds both modes with a unimodal q!
    
    print("\n⚠️ Mode collapse: Each run finds only ONE mode.")
    print("True posterior has BOTH modes, but VI cannot represent this.")
    print("This is the fundamental limitation of reverse KL with simple families.")
 
demonstrate_mode_collapse()

When Mode-Seeking is Dangerous

Mode-seeking behavior is particularly problematic for:

• Safety-critical systems where rare events matter • Bayesian model averaging where all modes contribute to predictions • Exploration in reinforcement learning where underestimating uncertainty leads to poor policies • Scientific inference where characterizing full posterior uncertainty is the goal

In these cases, consider MCMC, mixture variational families, or alternative divergences.

Why Variational Inference Uses Reverse KL

Given the mode-seeking limitations, why does variational inference use reverse KL? The answer lies in tractability—forward KL is essentially impossible to optimize directly.

The Tractability Argument

Forward KL: $D_{\text{KL}}(p | q)$

$$D_{\text{KL}}(p | q) = \mathbb{E}_{p(\mathbf{z}|\mathbf{x})}\left[ \log \frac{p(\mathbf{z}|\mathbf{x})}{q(\mathbf{z})} \right]$$

To optimize this, we need to:

Sample from $p(\mathbf{z}|\mathbf{x})$ — But this is the posterior we're trying to compute!
Evaluate $\log p(\mathbf{z}|\mathbf{x})$ — This requires the normalization constant $p(\mathbf{x})$!

We cannot optimize forward KL without already having solved the inference problem.

Reverse KL: $D_{\text{KL}}(q | p)$

$$D_{\text{KL}}(q | p) = \mathbb{E}_{q(\mathbf{z})}\left[ \log \frac{q(\mathbf{z})}{p(\mathbf{z}|\mathbf{x})} \right]$$

To optimize this, we need to:

Sample from $q(\mathbf{z})$ — Easy! We designed $q$ to be tractable.
Evaluate $\log q(\mathbf{z})$ — Easy! We have the density.
Evaluate $\log p(\mathbf{z}|\mathbf{x})$ — Problem: still need normalization...

But here's the key insight: We can rewrite reverse KL to avoid the normalization constant!

The ELBO Trick

Expanding the reverse KL:

$$\begin{aligned} D_{\text{KL}}(q | p(\cdot|\mathbf{x})) &= \mathbb{E}_q\left[ \log \frac{q(\mathbf{z})}{p(\mathbf{z}|\mathbf{x})} \right] \ &= \mathbb{E}_q\left[ \log q(\mathbf{z}) - \log p(\mathbf{z}|\mathbf{x}) \right] \ &= \mathbb{E}_q\left[ \log q(\mathbf{z}) - \log \frac{p(\mathbf{x}, \mathbf{z})}{p(\mathbf{x})} \right] \ &= \mathbb{E}_q\left[ \log q(\mathbf{z}) - \log p(\mathbf{x}, \mathbf{z}) \right] + \log p(\mathbf{x}) \end{aligned}$$

Rearranging:

$$\log p(\mathbf{x}) = D_{\text{KL}}(q | p) + \underbrace{\mathbb{E}q\left[ \log p(\mathbf{x}, \mathbf{z}) - \log q(\mathbf{z}) \right]}{\text{ELBO}(q)}$$

The Evidence Lower Bound (ELBO) is:

$$\mathcal{L}(q) = \mathbb{E}_q[\log p(\mathbf{x}, \mathbf{z})] - \mathbb{E}_q[\log q(\mathbf{z})]$$

Critical observation: The ELBO depends only on:

The joint distribution $p(\mathbf{x}, \mathbf{z})$ — which we can evaluate!
The variational distribution $q(\mathbf{z})$ — which we control!

No normalization constant required!

The Fundamental Relationship

$$\log p(\mathbf{x}) = \text{ELBO}(q) + D_{\text{KL}}(q | p(\cdot|\mathbf{x}))$$

Since $D_{\text{KL}} \geq 0$, we have $\text{ELBO}(q) \leq \log p(\mathbf{x})$ — the ELBO is a lower bound on the log-evidence.

Maximizing the ELBO is equivalent to minimizing the reverse KL divergence!

This is why it's called the Evidence Lower BOund.

elbo_vs_kl.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import numpy as np
import torch
import torch.nn as nn
 
def illustrate_elbo_kl_relationship():
    """
    Demonstrate the fundamental identity:
    log p(x) = ELBO(q) + KL(q || p(z|x))
    
    When KL → 0, ELBO → log p(x)
    """
    
    # Simple example: Gaussian prior and likelihood
    # p(z) = N(0, 1) (prior)
    # p(x|z) = N(z, 0.5²) (likelihood)
    # Observed: x = 1.0
    
    x = 1.0
    prior_mu, prior_sigma = 0.0, 1.0
    likelihood_sigma = 0.5
    
    # True posterior (conjugate, so we know it)
    # p(z|x) = N(posterior_mu, posterior_sigma²)
    posterior_precision = 1/prior_sigma**2 + 1/likelihood_sigma**2
    posterior_sigma = 1/np.sqrt(posterior_precision)
    posterior_mu = posterior_sigma**2 * (prior_mu/prior_sigma**2 + x/likelihood_sigma**2)
    
    print(f"True posterior: N({posterior_mu:.3f}, {posterior_sigma:.3f}²)")
    
    # True log evidence (can compute for this simple case)
    # p(x) = ∫ p(x|z)p(z) dz = N(x | prior_mu, sqrt(prior_sigma² + likelihood_sigma²))
    evidence_sigma = np.sqrt(prior_sigma**2 + likelihood_sigma**2)
    log_evidence = -0.5 * np.log(2*np.pi) - np.log(evidence_sigma) - 0.5 * ((x - prior_mu)/evidence_sigma)**2
    print(f"True log evidence: {log_evidence:.4f}")
    
    # Now compute ELBO for various q
    def compute_elbo_and_kl(q_mu, q_sigma, n_samples=10000):
        """Compute ELBO and KL for Gaussian q(z) = N(q_mu, q_sigma²)"""
        # Sample from q
        eps = np.random.randn(n_samples)
        z = q_mu + q_sigma * eps
        
        # log p(x, z) = log p(x|z) + log p(z)
        log_likelihood = -0.5 * np.log(2*np.pi) - np.log(likelihood_sigma) - 0.5 * ((x - z)/likelihood_sigma)**2
        log_prior = -0.5 * np.log(2*np.pi) - np.log(prior_sigma) - 0.5 * (z/prior_sigma)**2
        log_joint = log_likelihood + log_prior
        
        # log q(z)
        log_q = -0.5 * np.log(2*np.pi) - np.log(q_sigma) - 0.5 * ((z - q_mu)/q_sigma)**2
        
        # ELBO = E_q[log p(x,z) - log q(z)]
        elbo = np.mean(log_joint - log_q)
        
        # KL to true posterior (analytical for Gaussians)
        kl = np.log(posterior_sigma/q_sigma) + (q_sigma**2 + (q_mu - posterior_mu)**2)/(2*posterior_sigma**2) - 0.5
        
        return elbo, kl
    
    print("\nVarying q and checking ELBO + KL = log p(x):")
    print("-" * 60)
    
    for q_mu in [0.0, 0.5, posterior_mu]:  # Different means
        for q_sigma in [0.3, posterior_sigma, 1.0]:  # Different variances
            elbo, kl = compute_elbo_and_kl(q_mu, q_sigma)
            total = elbo + kl
            print(f"q=N({q_mu:.2f}, {q_sigma:.2f}²): ELBO={elbo:.4f}, KL={kl:.4f}, Sum={total:.4f}")
    
    print(f"\nTrue log p(x) = {log_evidence:.4f}")
    print("Note: ELBO + KL always equals log p(x) (up to MC noise)")
 
illustrate_elbo_kl_relationship()

Alternative Divergences and When to Use Them

While KL divergence dominates variational inference, alternative divergences address specific limitations. Understanding these alternatives helps you choose the right tool for your problem.

α-Divergences (Rényi Divergences)

The α-divergence family generalizes KL:

$$D_\alpha(p | q) = \frac{1}{\alpha(\alpha - 1)} \left( \int p(\mathbf{z})^\alpha q(\mathbf{z})^{1-\alpha} d\mathbf{z} - 1 \right)$$

Special cases:

$\alpha \to 1$: Forward KL $D_{\text{KL}}(p | q)$
$\alpha \to 0$: Reverse KL $D_{\text{KL}}(q | p)$
$\alpha = 0.5$: Hellinger distance (symmetric)
$\alpha = 2$: Chi-squared divergence

Variational Rényi bounds allow trading off between mass-covering ($\alpha > 0$) and mode-seeking ($\alpha < 0$) behavior.

Divergence Comparison
Divergence	Behavior	Tractability	Use Case
Reverse KL	Mode-seeking	Tractable (ELBO)	Standard VI, VAEs
Forward KL	Mass-covering	Requires posterior samples	Expectation Propagation
α-divergence (α=0.5)	Balanced	Moderate complexity	When uncertainty matters
f-divergences	Flexible	Varies	Research, specialized cases
Wasserstein	Geometry-aware	Computationally expensive	When support mismatch is an issue
Stein discrepancy	Score-based	Kernel choice required	Implicit variational inference

Importance-Weighted Bounds

The Importance-Weighted ELBO (IWAE) tightens the bound using multiple samples:

$$\mathcal{L}K = \mathbb{E}{\mathbf{z}_1, \ldots, \mathbf{z}K \sim q}\left[ \log \frac{1}{K} \sum{k=1}^{K} \frac{p(\mathbf{x}, \mathbf{z}_k)}{q(\mathbf{z}_k)} \right]$$

As $K \to \infty$, $\mathcal{L}_K \to \log p(\mathbf{x})$. This provides a spectrum between variational (K=1) and Monte Carlo (K→∞) methods.

Practical Recommendations

When to use standard reverse KL (ELBO):

Large-scale problems where speed matters
When a single mode provides good predictions
As a starting point for any VI project

When to consider alternatives:

Posterior uncertainty is critical: Explore α-divergences with α > 0
Multi-modal posteriors: Consider mixture variational families + ELBO, or MCMC
Support mismatch issues: Wasserstein distances
Very high-dimensional: Stein variational gradient descent

The Pragmatic View

In practice, the choice of divergence matters less than the choice of variational family. A rich variational family (e.g., flows) with reverse KL often outperforms a simple family with a "better" divergence. Start with standard VI; switch divergences only if diagnostics indicate the approximation is failing for divergence-related reasons.

Key Takeaways

•KL divergence measures how much one distribution differs from another in information-theoretic terms.
•Forward KL $D(p||q)$ is mass-covering; Reverse KL $D(q||p)$ is mode-seeking.
•VI uses reverse KL because it leads to a tractable objective (the ELBO) that doesn't require the normalization constant.
•Mode-seeking behavior causes VI to underestimate uncertainty and collapse to single modes.
•The identity $\log p(\mathbf{x}) = \text{ELBO} + D_{\text{KL}}(q||p)$ is fundamental to variational inference.
•Alternative divergences can address specific limitations but add complexity.

Page Complete: KL Divergence Objective

You now understand why KL divergence is used in variational inference and the profound implications of its asymmetric, mode-seeking nature. The next page derives the ELBO from first principles, showing exactly how minimizing reverse KL leads to a tractable optimization objective.

KL Divergence Objective

Measuring Distance Between Distributions

Learning Objectives

The Kullback-Leibler Divergence

Mathematical Definition

The Kullback-Leibler divergence (also called relative entropy) from distribution $p$ to distribution $q$ is defined as:

$$D_{\text{KL}}(p | q) = \int p(\mathbf{z}) \log \frac{p(\mathbf{z})}{q(\mathbf{z})} , d\mathbf{z} = \mathbb{E}_{p}\left[ \log \frac{p(\mathbf{z})}{q(\mathbf{z})} \right]$$

Equivalently:

$$D_{\text{KL}}(p | q) = \mathbb{E}{p}[\log p(\mathbf{z})] - \mathbb{E}{p}[\log q(\mathbf{z})]$$

The first term is the negative entropy of $p$; the second is the cross-entropy from $p$ to $q$.

Information-Theoretic Interpretation

Optimal coding interpretation: KL divergence measures the expected number of extra bits needed to encode samples from $p$ using a code optimized for $q$, compared to a code optimized for $p$.

If $p = q$: No extra bits needed, $D_{\text{KL}}(p | q) = 0$
If $q$ assigns low probability where $p$ is high: Many extra bits needed, $D_{\text{KL}}(p | q)$ is large
If $q$ assigns high probability where $p$ is low: Wasted coding capacity, but limited penalty

KL Divergence is NOT a Distance

Despite being called a "divergence," KL is not a true distance metric. It is:

Asymmetric: $D_{\text{KL}}(p | q) \neq D_{\text{KL}}(q | p)$ in general

Non-satisfying triangle inequality: $D_{\text{KL}}(p | r) \not\leq D_{\text{KL}}(p | q) + D_{\text{KL}}(q | r)$

Unbounded: Can be infinite if supports don't match

This asymmetry is crucial: the direction matters enormously for variational inference.

Key Properties

Property 1: Non-negativity $$D_{\text{KL}}(p | q) \geq 0$$ with equality if and only if $p = q$ almost everywhere. This follows from Jensen's inequality applied to the concave logarithm.

Property 2: Not symmetric $$D_{\text{KL}}(p | q) \neq D_{\text{KL}}(q | p)$$ The asymmetry reflects different "penalty structures" depending on which distribution the expectation is taken over.

Property 4: Decomposition for exponential families For distributions in the same exponential family, KL divergence has special closed forms, enabling tractable optimization.

kl_divergence.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import numpy as np
from scipy.special import kl_div
from scipy.stats import norm, entropy
 
def kl_divergence_gaussians(mu_p, sigma_p, mu_q, sigma_q):
    """
    Closed-form KL divergence between univariate Gaussians.
    
    KL(N(μ_p, σ_p²) || N(μ_q, σ_q²)) = 
        log(σ_q/σ_p) + (σ_p² + (μ_p - μ_q)²) / (2σ_q²) - 1/2
    """
    return (
        np.log(sigma_q / sigma_p) + 
        (sigma_p**2 + (mu_p - mu_q)**2) / (2 * sigma_q**2) - 
        0.5
    )
 
def kl_divergence_monte_carlo(log_p, log_q, samples_from_p, n_samples=10000):
    """
    Monte Carlo estimate of KL(p || q).
    
    KL(p || q) ≈ (1/N) Σ [log p(z_i) - log q(z_i)]
    where z_i ~ p
    """
    return np.mean(log_p(samples_from_p) - log_q(samples_from_p))
 
# Example: Two Gaussians
p_mu, p_sigma = 0, 1
q_mu, q_sigma = 1, 2
 
# Analytical KL
kl_analytic = kl_divergence_gaussians(p_mu, p_sigma, q_mu, q_sigma)
print(f"KL(p || q) analytical: {kl_analytic:.4f}")
 
# Monte Carlo estimate
samples = np.random.normal(p_mu, p_sigma, 100000)
log_p = lambda z: norm.logpdf(z, p_mu, p_sigma)
log_q = lambda z: norm.logpdf(z, q_mu, q_sigma)
kl_mc = kl_divergence_monte_carlo(log_p, log_q, samples)
print(f"KL(p || q) Monte Carlo: {kl_mc:.4f}")
 
# Now compute the REVERSE KL
kl_reverse_analytic = kl_divergence_gaussians(q_mu, q_sigma, p_mu, p_sigma)
print(f"\nKL(q || p) analytical: {kl_reverse_analytic:.4f}")
 
# Note: KL(p || q) ≠ KL(q || p) ! 
# The difference reflects the asymmetric penalty structure.
print(f"\nDifference: {abs(kl_analytic - kl_reverse_analytic):.4f}")
 
# Special case: When supports don't match
# If p has support where q = 0, KL(p || q) = ∞
# This is why support coverage matters!

Forward KL vs. Reverse KL: A Critical Distinction

The asymmetry of KL divergence gives rise to two fundamentally different optimization objectives:

Forward KL: $D_{\text{KL}}(p | q)$

$$D_{\text{KL}}(p | q) = \mathbb{E}_{p}\left[ \log \frac{p(\mathbf{z})}{q(\mathbf{z})} \right]$$

Expectation is over the true distribution $p$.

Minimizing forward KL means:

We penalize $q$ for assigning low probability where $p$ has high probability
We must cover all regions where $p$ concentrates
If $p$ is multi-modal, $q$ tries to cover all modes

Behavior: Mass-covering / Moment-matching / Inclusive

Reverse KL: $D_{\text{KL}}(q | p)$

$$D_{\text{KL}}(q | p) = \mathbb{E}_{q}\left[ \log \frac{q(\mathbf{z})}{p(\mathbf{z})} \right]$$

Expectation is over the approximate distribution $q$.

Minimizing reverse KL means:

We penalize $q$ for assigning high probability where $p$ has low probability
$q$ can ignore modes of $p$ without penalty (if $q$ doesn't put mass there)
$q$ tends to concentrate on the highest-probability region of $p$

Behavior: Mode-seeking / Zero-forcing / Exclusive

Forward KL: D(p || q)

•Expectation over TRUE p
•Mass-covering: q spreads to cover all of p
•Multi-modal p → q tries to cover all modes
•May place mass where p is low
•Typically used in: Expectation Propagation
•Overestimates uncertainty

Reverse KL: D(q || p)

•Expectation over APPROXIMATE q
•Mode-seeking: q concentrates on one mode
•Multi-modal p → q picks dominant mode
•Never places mass where p is low
•Used in: Standard VI, VAEs
•Underestimates uncertainty

Visual Intuition

Consider fitting a unimodal Gaussian $q$ to a bimodal target $p$:

Reverse KL (mode-seeking): The optimal $q$ will lock onto one mode and match it tightly. It "ignores" the other mode entirely. The variance will be small, matching the single mode it captures.

Neither is "wrong"—they answer different questions:

Forward KL: "Where should I put mass to not miss anything important?"
Reverse KL: "Where should I put mass to not be surprised by samples?"

For inference, we typically want reverse KL: we'd rather have a precise estimate of one mode than a diffuse estimate that doesn't match any mode well.

forward_vs_reverse_kl.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import minimize
from scipy.stats import norm
 
# True distribution: mixture of two Gaussians (bimodal)
def log_p(z):
    """Log of bimodal mixture: 0.5 * N(-2, 0.5) + 0.5 * N(2, 0.5)"""
    p1 = 0.5 * norm.pdf(z, -2, 0.5)
    p2 = 0.5 * norm.pdf(z, 2, 0.5)
    return np.log(p1 + p2 + 1e-10)
 
def p(z):
    return np.exp(log_p(z))
 
# Variational family: single Gaussian
def log_q(z, mu, sigma):
    return norm.logpdf(z, mu, sigma)
 
def q(z, mu, sigma):
    return norm.pdf(z, mu, sigma)
 
# FORWARD KL: D(p || q) - requires sampling from p
def forward_kl(params, n_samples=10000):
    mu, log_sigma = params
    sigma = np.exp(log_sigma)
    
    # Sample from the mixture
    component = np.random.binomial(1, 0.5, n_samples)
    samples = np.where(
        component, 
        np.random.normal(-2, 0.5, n_samples),
        np.random.normal(2, 0.5, n_samples)
    )
    
    # KL(p || q) = E_p[log p - log q]
    kl = np.mean(log_p(samples) - log_q(samples, mu, sigma))
    return kl
 
# REVERSE KL: D(q || p) - requires sampling from q
def reverse_kl(params, n_samples=10000):
    mu, log_sigma = params
    sigma = np.exp(log_sigma)
    
    # Sample from q
    samples = np.random.normal(mu, sigma, n_samples)
    
    # KL(q || p) = E_q[log q - log p]
    kl = np.mean(log_q(samples, mu, sigma) - log_p(samples))
    return kl
 
# Optimize forward KL
result_forward = minimize(forward_kl, x0=[0.0, 0.0], method='L-BFGS-B')
mu_forward, log_sigma_forward = result_forward.x
sigma_forward = np.exp(log_sigma_forward)
print(f"Forward KL optimal: μ={mu_forward:.2f}, σ={sigma_forward:.2f}")
 
# Optimize reverse KL (may converge to either mode)
result_reverse = minimize(reverse_kl, x0=[1.0, 0.0], method='L-BFGS-B')
mu_reverse, log_sigma_reverse = result_reverse.x
sigma_reverse = np.exp(log_sigma_reverse)
print(f"Reverse KL optimal: μ={mu_reverse:.2f}, σ={sigma_reverse:.2f}")
 
# Key observations:
# - Forward KL: μ ≈ 0, σ ≈ 2.5 (covers both modes, high variance)
# - Reverse KL: μ ≈ ±2, σ ≈ 0.5 (locks onto one mode, matches its variance)
 
# This fundamental difference drives VI behavior!

Mnemonic for KL Direction

Mode-Seeking Behavior: Implications and Consequences

Why Reverse KL is Mode-Seeking

Consider the penalty structure of $D_{\text{KL}}(q | p)$:

$$D_{\text{KL}}(q | p) = \mathbb{E}_q\left[ \log \frac{q(\mathbf{z})}{p(\mathbf{z})} \right] = \mathbb{E}_q[\log q(\mathbf{z})] - \mathbb{E}_q[\log p(\mathbf{z})]$$

The critical term is $\mathbb{E}_q[\log p(\mathbf{z})]$.

If $q$ places mass in a region where $p(\mathbf{z})$ is small → $\log p(\mathbf{z})$ is very negative → KL increases dramatically
If $q$ ignores a region where $p$ is large → No contribution to the expectation (since $q = 0$ there) → No penalty!

This asymmetry means $q$ is punished for covering low-probability regions but not punished for missing modes. The rational strategy: concentrate on the highest mode and match it precisely.

Mode-Seeking vs. Mass-Covering Behaviors
Scenario	Forward KL (Mass-Covering)	Reverse KL (Mode-Seeking)
Bimodal symmetric p	q centers between modes	q locks onto one mode
Heavy-tailed p	q matches tail behavior	q concentrates on mode, ignores tails
Skewed p	q covers entire support	q matches the mode precisely
Uniform p	q is broad/uniform	q can be arbitrarily concentrated
Uncertainty quantification	Overestimates uncertainty	Underestimates uncertainty

Consequences for Variational Inference

1. Underestimated Uncertainty

Because reverse KL concentrates on modes, the resulting $q$ typically has smaller variance than the true posterior. This leads to:

Overconfident predictions
Misleading credible intervals
Underestimation of rare event probabilities

2. Mode Collapse in Multi-Modal Posteriors

When the true posterior is multi-modal (common in mixture models, neural networks), VI will:

Capture only the dominant mode
Completely ignore other modes
Give no indication that other modes exist

3. Sensitivity to Initialization

With multiple modes, which mode $q$ converges to depends on initialization. Different random seeds can give qualitatively different results.

4. Local Optima

The ELBO landscape inherits the multi-modality of the posterior. Gradient descent may get stuck in local optima corresponding to less probable modes.

mode_seeking_consequences.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
 
def demonstrate_mode_collapse():
    """
    Show how VI collapses to a single mode for a bimodal posterior.
    
    Setup: Bayesian inference on mean μ, with bimodal likelihood
    (e.g., from a mixture model or symmetric objective).
    """
    
    # True posterior is bimodal at μ = -2 and μ = +2
    def log_posterior(mu):
        """Log of bimodal posterior: mix of N(-2, 0.5²) and N(2, 0.5²)"""
        log_p1 = -0.5 * ((mu + 2) / 0.5)**2
        log_p2 = -0.5 * ((mu - 2) / 0.5)**2
        # Log-sum-exp for numerical stability
        log_mix = torch.logsumexp(torch.stack([log_p1, log_p2]), dim=0) - np.log(2)
        return log_mix
    
    class GaussianVI(nn.Module):
        """Simple Gaussian variational approximation."""
        def __init__(self, init_mu=0.0):
            super().__init__()
            self.mu = nn.Parameter(torch.tensor(init_mu))
            self.log_sigma = nn.Parameter(torch.tensor(0.0))
            
        def sample(self, n=100):
            sigma = torch.exp(self.log_sigma)
            eps = torch.randn(n)
            return self.mu + sigma * eps
        
        def log_q(self, z):
            sigma = torch.exp(self.log_sigma)
            return -0.5 * ((z - self.mu) / sigma)**2 - self.log_sigma - 0.5 * np.log(2 * np.pi)
        
        def elbo(self, n_samples=100):
            z = self.sample(n_samples)
            return (log_posterior(z) - self.log_q(z)).mean()
    
    # Run VI from different initializations
    results = []
    for init_mu in [-3.0, -1.0, 0.0, 1.0, 3.0]:
        model = GaussianVI(init_mu)
        optimizer = optim.Adam(model.parameters(), lr=0.1)
        
        for _ in range(500):
            optimizer.zero_grad()
            loss = -model.elbo()
            loss.backward()
            optimizer.step()
        
        final_mu = model.mu.item()
        final_sigma = torch.exp(model.log_sigma).item()
        results.append((init_mu, final_mu, final_sigma))
        print(f"Init μ={init_mu:+.1f} → Final μ={final_mu:+.2f}, σ={final_sigma:.3f}")
    
    # Observations:
    # - Init near -2: converges to μ ≈ -2
    # - Init near +2: converges to μ ≈ +2
    # - Init at 0: could go either way (depends on noise)
    # - VI NEVER finds both modes with a unimodal q!
    
    print("\n⚠️ Mode collapse: Each run finds only ONE mode.")
    print("True posterior has BOTH modes, but VI cannot represent this.")
    print("This is the fundamental limitation of reverse KL with simple families.")
 
demonstrate_mode_collapse()

When Mode-Seeking is Dangerous

Mode-seeking behavior is particularly problematic for:

In these cases, consider MCMC, mixture variational families, or alternative divergences.

Why Variational Inference Uses Reverse KL

Given the mode-seeking limitations, why does variational inference use reverse KL? The answer lies in tractability—forward KL is essentially impossible to optimize directly.

The Tractability Argument

Forward KL: $D_{\text{KL}}(p | q)$

$$D_{\text{KL}}(p | q) = \mathbb{E}_{p(\mathbf{z}|\mathbf{x})}\left[ \log \frac{p(\mathbf{z}|\mathbf{x})}{q(\mathbf{z})} \right]$$

To optimize this, we need to:

Sample from $p(\mathbf{z}|\mathbf{x})$ — But this is the posterior we're trying to compute!
Evaluate $\log p(\mathbf{z}|\mathbf{x})$ — This requires the normalization constant $p(\mathbf{x})$!

We cannot optimize forward KL without already having solved the inference problem.

Reverse KL: $D_{\text{KL}}(q | p)$

$$D_{\text{KL}}(q | p) = \mathbb{E}_{q(\mathbf{z})}\left[ \log \frac{q(\mathbf{z})}{p(\mathbf{z}|\mathbf{x})} \right]$$

To optimize this, we need to:

Sample from $q(\mathbf{z})$ — Easy! We designed $q$ to be tractable.
Evaluate $\log q(\mathbf{z})$ — Easy! We have the density.
Evaluate $\log p(\mathbf{z}|\mathbf{x})$ — Problem: still need normalization...

But here's the key insight: We can rewrite reverse KL to avoid the normalization constant!

The ELBO Trick

Expanding the reverse KL:

Rearranging:

$$\log p(\mathbf{x}) = D_{\text{KL}}(q | p) + \underbrace{\mathbb{E}q\left[ \log p(\mathbf{x}, \mathbf{z}) - \log q(\mathbf{z}) \right]}{\text{ELBO}(q)}$$

The Evidence Lower Bound (ELBO) is:

$$\mathcal{L}(q) = \mathbb{E}_q[\log p(\mathbf{x}, \mathbf{z})] - \mathbb{E}_q[\log q(\mathbf{z})]$$

Critical observation: The ELBO depends only on:

The joint distribution $p(\mathbf{x}, \mathbf{z})$ — which we can evaluate!
The variational distribution $q(\mathbf{z})$ — which we control!

No normalization constant required!

The Fundamental Relationship

$$\log p(\mathbf{x}) = \text{ELBO}(q) + D_{\text{KL}}(q | p(\cdot|\mathbf{x}))$$

Since $D_{\text{KL}} \geq 0$, we have $\text{ELBO}(q) \leq \log p(\mathbf{x})$ — the ELBO is a lower bound on the log-evidence.

Maximizing the ELBO is equivalent to minimizing the reverse KL divergence!

This is why it's called the Evidence Lower BOund.

elbo_vs_kl.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import numpy as np
import torch
import torch.nn as nn
 
def illustrate_elbo_kl_relationship():
    """
    Demonstrate the fundamental identity:
    log p(x) = ELBO(q) + KL(q || p(z|x))
    
    When KL → 0, ELBO → log p(x)
    """
    
    # Simple example: Gaussian prior and likelihood
    # p(z) = N(0, 1) (prior)
    # p(x|z) = N(z, 0.5²) (likelihood)
    # Observed: x = 1.0
    
    x = 1.0
    prior_mu, prior_sigma = 0.0, 1.0
    likelihood_sigma = 0.5
    
    # True posterior (conjugate, so we know it)
    # p(z|x) = N(posterior_mu, posterior_sigma²)
    posterior_precision = 1/prior_sigma**2 + 1/likelihood_sigma**2
    posterior_sigma = 1/np.sqrt(posterior_precision)
    posterior_mu = posterior_sigma**2 * (prior_mu/prior_sigma**2 + x/likelihood_sigma**2)
    
    print(f"True posterior: N({posterior_mu:.3f}, {posterior_sigma:.3f}²)")
    
    # True log evidence (can compute for this simple case)
    # p(x) = ∫ p(x|z)p(z) dz = N(x | prior_mu, sqrt(prior_sigma² + likelihood_sigma²))
    evidence_sigma = np.sqrt(prior_sigma**2 + likelihood_sigma**2)
    log_evidence = -0.5 * np.log(2*np.pi) - np.log(evidence_sigma) - 0.5 * ((x - prior_mu)/evidence_sigma)**2
    print(f"True log evidence: {log_evidence:.4f}")
    
    # Now compute ELBO for various q
    def compute_elbo_and_kl(q_mu, q_sigma, n_samples=10000):
        """Compute ELBO and KL for Gaussian q(z) = N(q_mu, q_sigma²)"""
        # Sample from q
        eps = np.random.randn(n_samples)
        z = q_mu + q_sigma * eps
        
        # log p(x, z) = log p(x|z) + log p(z)
        log_likelihood = -0.5 * np.log(2*np.pi) - np.log(likelihood_sigma) - 0.5 * ((x - z)/likelihood_sigma)**2
        log_prior = -0.5 * np.log(2*np.pi) - np.log(prior_sigma) - 0.5 * (z/prior_sigma)**2
        log_joint = log_likelihood + log_prior
        
        # log q(z)
        log_q = -0.5 * np.log(2*np.pi) - np.log(q_sigma) - 0.5 * ((z - q_mu)/q_sigma)**2
        
        # ELBO = E_q[log p(x,z) - log q(z)]
        elbo = np.mean(log_joint - log_q)
        
        # KL to true posterior (analytical for Gaussians)
        kl = np.log(posterior_sigma/q_sigma) + (q_sigma**2 + (q_mu - posterior_mu)**2)/(2*posterior_sigma**2) - 0.5
        
        return elbo, kl
    
    print("\nVarying q and checking ELBO + KL = log p(x):")
    print("-" * 60)
    
    for q_mu in [0.0, 0.5, posterior_mu]:  # Different means
        for q_sigma in [0.3, posterior_sigma, 1.0]:  # Different variances
            elbo, kl = compute_elbo_and_kl(q_mu, q_sigma)
            total = elbo + kl
            print(f"q=N({q_mu:.2f}, {q_sigma:.2f}²): ELBO={elbo:.4f}, KL={kl:.4f}, Sum={total:.4f}")
    
    print(f"\nTrue log p(x) = {log_evidence:.4f}")
    print("Note: ELBO + KL always equals log p(x) (up to MC noise)")
 
illustrate_elbo_kl_relationship()

Alternative Divergences and When to Use Them

While KL divergence dominates variational inference, alternative divergences address specific limitations. Understanding these alternatives helps you choose the right tool for your problem.

α-Divergences (Rényi Divergences)

The α-divergence family generalizes KL:

$$D_\alpha(p | q) = \frac{1}{\alpha(\alpha - 1)} \left( \int p(\mathbf{z})^\alpha q(\mathbf{z})^{1-\alpha} d\mathbf{z} - 1 \right)$$

Special cases:

$\alpha \to 1$: Forward KL $D_{\text{KL}}(p | q)$
$\alpha \to 0$: Reverse KL $D_{\text{KL}}(q | p)$
$\alpha = 0.5$: Hellinger distance (symmetric)
$\alpha = 2$: Chi-squared divergence

Variational Rényi bounds allow trading off between mass-covering ($\alpha > 0$) and mode-seeking ($\alpha < 0$) behavior.

Divergence Comparison
Divergence	Behavior	Tractability	Use Case
Reverse KL	Mode-seeking	Tractable (ELBO)	Standard VI, VAEs
Forward KL	Mass-covering	Requires posterior samples	Expectation Propagation
α-divergence (α=0.5)	Balanced	Moderate complexity	When uncertainty matters
f-divergences	Flexible	Varies	Research, specialized cases
Wasserstein	Geometry-aware	Computationally expensive	When support mismatch is an issue
Stein discrepancy	Score-based	Kernel choice required	Implicit variational inference

Importance-Weighted Bounds

The Importance-Weighted ELBO (IWAE) tightens the bound using multiple samples:

$$\mathcal{L}K = \mathbb{E}{\mathbf{z}_1, \ldots, \mathbf{z}K \sim q}\left[ \log \frac{1}{K} \sum{k=1}^{K} \frac{p(\mathbf{x}, \mathbf{z}_k)}{q(\mathbf{z}_k)} \right]$$

As $K \to \infty$, $\mathcal{L}_K \to \log p(\mathbf{x})$. This provides a spectrum between variational (K=1) and Monte Carlo (K→∞) methods.

Practical Recommendations

When to use standard reverse KL (ELBO):

Large-scale problems where speed matters
When a single mode provides good predictions
As a starting point for any VI project

When to consider alternatives:

Posterior uncertainty is critical: Explore α-divergences with α > 0
Multi-modal posteriors: Consider mixture variational families + ELBO, or MCMC
Support mismatch issues: Wasserstein distances
Very high-dimensional: Stein variational gradient descent

The Pragmatic View

Key Takeaways

•KL divergence measures how much one distribution differs from another in information-theoretic terms.
•Forward KL $D(p||q)$ is mass-covering; Reverse KL $D(q||p)$ is mode-seeking.
•VI uses reverse KL because it leads to a tractable objective (the ELBO) that doesn't require the normalization constant.
•Mode-seeking behavior causes VI to underestimate uncertainty and collapse to single modes.
•The identity $\log p(\mathbf{x}) = \text{ELBO} + D_{\text{KL}}(q||p)$ is fundamental to variational inference.
•Alternative divergences can address specific limitations but add complexity.

Page Complete: KL Divergence Objective