Variational Inference Framework - Learning Module

Loading content...

0/245

ELBO Derivation

The Evidence Lower Bound: Heart of Variational Inference

The Evidence Lower Bound (ELBO) is the central quantity in variational inference. It serves simultaneously as:

An optimization objective — We maximize the ELBO to train variational approximations
A bound on the evidence — The ELBO lower-bounds $\log p(\mathbf{x})$, enabling model comparison
A diagnostic — The gap between ELBO and log-evidence measures approximation quality
A unifying concept — The ELBO connects VI to information theory, rate-distortion, and probabilistic modeling

This page derives the ELBO rigorously from multiple perspectives. Each perspective offers different insights, and together they build a complete understanding of why the ELBO works and what it means.

By the end of this page, you will not just know the ELBO formula—you will understand why it takes this form and what each term contributes.

Learning Objectives

By completing this page, you will: (1) Derive the ELBO from Jensen's inequality, (2) Derive the ELBO from the KL divergence identity, (3) Decompose the ELBO into reconstruction and regularization terms, (4) Understand the rate-distortion interpretation, and (5) Compute the ELBO for concrete examples.

The ELBO Formula

Let's begin by stating the ELBO clearly. Given:

Observed data $\mathbf{x}$
Latent variables $\mathbf{z}$
Generative model $p(\mathbf{x}, \mathbf{z}) = p(\mathbf{x} | \mathbf{z}) p(\mathbf{z})$
Variational distribution $q(\mathbf{z})$

The Evidence Lower Bound (ELBO) is:

$$\mathcal{L}(q) = \mathbb{E}{q(\mathbf{z})}[\log p(\mathbf{x}, \mathbf{z})] - \mathbb{E}{q(\mathbf{z})}[\log q(\mathbf{z})]$$

Equivalently:

$$\mathcal{L}(q) = \mathbb{E}{q(\mathbf{z})}[\log p(\mathbf{x}, \mathbf{z}) - \log q(\mathbf{z})] = \mathbb{E}{q(\mathbf{z})}\left[ \log \frac{p(\mathbf{x}, \mathbf{z})}{q(\mathbf{z})} \right]$$

Or, decomposed:

$$\mathcal{L}(q) = \underbrace{\mathbb{E}{q(\mathbf{z})}[\log p(\mathbf{x} | \mathbf{z})]}{\text{Expected log-likelihood}} - \underbrace{D_{\text{KL}}(q(\mathbf{z}) | p(\mathbf{z}))}_{\text{KL to prior}}$$

This decomposition separates the objective into a reconstruction term (how well latents explain data) and a regularization term (how close $q$ is to the prior).

Three Equivalent Forms

The ELBO can be written as:

Form 1 (Expectations): $\mathbb{E}_q[\log p(\mathbf{x}, \mathbf{z})] - \mathbb{E}_q[\log q(\mathbf{z})]$

Form 2 (Ratio): $\mathbb{E}_q\big[\log \frac{p(\mathbf{x}, \mathbf{z})}{q(\mathbf{z})}\big]$

Form 3 (Decomposed): $\mathbb{E}q[\log p(\mathbf{x}|\mathbf{z})] - D{\text{KL}}(q(\mathbf{z}) | p(\mathbf{z}))$

All three are identical; use whichever is most convenient for your derivation.

The Key Property: It's a Lower Bound

The ELBO satisfies:

$$\mathcal{L}(q) \leq \log p(\mathbf{x})$$

with equality if and only if $q(\mathbf{z}) = p(\mathbf{z} | \mathbf{x})$ (the true posterior).

This property—that the ELBO lower-bounds the log-evidence—is why we call it the Evidence Lower BOund. It allows us to:

Optimize without computing $\log p(\mathbf{x})$: We maximize a bound, which improves our approximation
Bound model evidence: Even if the bound is loose, it provides a floor on $\log p(\mathbf{x})$
Diagnose approximation quality: A large gap indicates $q$ is far from the posterior

Let's now derive this bound from multiple angles.

Derivation 1: Jensen's Inequality

The most direct derivation uses Jensen's inequality, which states that for a concave function $f$ and random variable $X$:

$$f(\mathbb{E}[X]) \geq \mathbb{E}[f(X)]$$

with equality when $X$ is constant.

The Derivation

We start with the log-marginal likelihood (log-evidence):

$$\log p(\mathbf{x}) = \log \int p(\mathbf{x}, \mathbf{z}) , d\mathbf{z}$$

We introduce the variational distribution $q(\mathbf{z})$ by multiplying and dividing:

$$\log p(\mathbf{x}) = \log \int q(\mathbf{z}) \frac{p(\mathbf{x}, \mathbf{z})}{q(\mathbf{z})} , d\mathbf{z}$$

This is an expectation under $q$:

$$\log p(\mathbf{x}) = \log \mathbb{E}_{q(\mathbf{z})}\left[ \frac{p(\mathbf{x}, \mathbf{z})}{q(\mathbf{z})} \right]$$

Now apply Jensen's inequality (log is concave, so we flip the inequality):

$$\log \mathbb{E}{q}\left[ \frac{p(\mathbf{x}, \mathbf{z})}{q(\mathbf{z})} \right] \geq \mathbb{E}{q}\left[ \log \frac{p(\mathbf{x}, \mathbf{z})}{q(\mathbf{z})} \right]$$

Therefore:

$$\log p(\mathbf{x}) \geq \mathbb{E}_{q(\mathbf{z})}\left[ \log \frac{p(\mathbf{x}, \mathbf{z})}{q(\mathbf{z})} \right] = \mathcal{L}(q)$$

This is the ELBO!

When is the Bound Tight?

Jensen's inequality becomes an equality when the random variable is constant:

$$\frac{p(\mathbf{x}, \mathbf{z})}{q(\mathbf{z})} = c \quad \text{for all } \mathbf{z}$$

This means:

$$q(\mathbf{z}) = \frac{p(\mathbf{x}, \mathbf{z})}{c} = \frac{p(\mathbf{x}, \mathbf{z})}{p(\mathbf{x})} = p(\mathbf{z} | \mathbf{x})$$

The bound is tight exactly when $q$ equals the true posterior. In general, the gap between $\mathcal{L}(q)$ and $\log p(\mathbf{x})$ measures how far $q$ is from the posterior.

jensen_inequality_bound.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
import numpy as np
from scipy.stats import norm
 
def demonstrate_jensen_bound():
    """
    Demonstrate the Jensen's inequality derivation of ELBO.
    
    We show that log E[f(z)] >= E[log f(z)] and how the gap
    relates to variational approximation quality.
    """
    
    # Setup: Simple Gaussian model
    # Prior: p(z) = N(0, 1)
    # Likelihood: p(x|z) = N(z, σ²), observe x = 2.0
    
    x = 2.0
    prior_mu, prior_sigma = 0.0, 1.0
    likelihood_sigma = 0.5
    
    # True posterior (conjugate case)
    post_precision = 1/prior_sigma**2 + 1/likelihood_sigma**2
    post_sigma = 1/np.sqrt(post_precision)
    post_mu = post_sigma**2 * (prior_mu/prior_sigma**2 + x/likelihood_sigma**2)
    
    # True log evidence (Gaussian integral)
    evidence_sigma = np.sqrt(prior_sigma**2 + likelihood_sigma**2)
    log_evidence = norm.logpdf(x, prior_mu, evidence_sigma)
    
    print(f"True log p(x): {log_evidence:.4f}")
    print(f"True posterior: N({post_mu:.3f}, {post_sigma:.3f}²)")
    print()
    
    def compute_elbo(q_mu, q_sigma, n_samples=100000):
        """
        Compute ELBO for Gaussian q(z) = N(q_mu, q_sigma²).
        
        ELBO = E_q[log p(x,z) / q(z)]
             = E_q[log p(x|z) + log p(z) - log q(z)]
        """
        # Sample from q
        z_samples = np.random.normal(q_mu, q_sigma, n_samples)
        
        # log p(x|z): Gaussian likelihood
        log_likelihood = norm.logpdf(x, z_samples, likelihood_sigma)
        
        # log p(z): Gaussian prior
        log_prior = norm.logpdf(z_samples, prior_mu, prior_sigma)
        
        # log q(z): Variational distribution
        log_q = norm.logpdf(z_samples, q_mu, q_sigma)
        
        # ELBO = E_q[log p(x,z) - log q(z)]
        elbo = np.mean(log_likelihood + log_prior - log_q)
        
        return elbo
    
    # Compare different variational approximations
    print("Variational approximations:")
    print("-" * 55)
    
    test_cases = [
        (0.0, 1.0, "Prior as variational"),
        (2.0, 0.5, "Centered on x"),
        (post_mu, post_sigma, "Exact posterior"),
        (post_mu, 2*post_sigma, "Too wide"),
        (post_mu, 0.5*post_sigma, "Too narrow"),
    ]
    
    for q_mu, q_sigma, description in test_cases:
        elbo = compute_elbo(q_mu, q_sigma)
        gap = log_evidence - elbo
        print(f"{description:25s}: ELBO={elbo:.4f}, Gap={gap:.4f}")
    
    print()
    print("Note: Gap ≈ 0 only when q = true posterior!")
    print("The gap IS the KL divergence D(q || p(z|x)).")
 
demonstrate_jensen_bound()

Derivation 2: KL Divergence Identity

An alternative derivation starts from the KL divergence between $q$ and the posterior $p(\mathbf{z}|\mathbf{x})$ and rearranges to isolate the ELBO.

The Derivation

Begin with the definition of reverse KL:

$$D_{\text{KL}}(q(\mathbf{z}) | p(\mathbf{z}|\mathbf{x})) = \mathbb{E}_{q}\left[ \log \frac{q(\mathbf{z})}{p(\mathbf{z}|\mathbf{x})} \right]$$

Expand the posterior using Bayes' rule:

$$p(\mathbf{z}|\mathbf{x}) = \frac{p(\mathbf{x}, \mathbf{z})}{p(\mathbf{x})}$$

Substitute:

$$\begin{aligned} D_{\text{KL}}(q | p(\cdot|\mathbf{x})) &= \mathbb{E}{q}\left[ \log \frac{q(\mathbf{z}) \cdot p(\mathbf{x})}{p(\mathbf{x}, \mathbf{z})} \right] \ &= \mathbb{E}{q}\left[ \log q(\mathbf{z}) - \log p(\mathbf{x}, \mathbf{z}) + \log p(\mathbf{x}) \right] \ &= \mathbb{E}{q}[\log q(\mathbf{z})] - \mathbb{E}{q}[\log p(\mathbf{x}, \mathbf{z})] + \log p(\mathbf{x}) \end{aligned}$$

Note that $\log p(\mathbf{x})$ is constant with respect to $\mathbf{z}$, so it comes out of the expectation.

Rearrange to isolate $\log p(\mathbf{x})$:

$$\log p(\mathbf{x}) = D_{\text{KL}}(q | p(\cdot|\mathbf{x})) + \mathbb{E}{q}[\log p(\mathbf{x}, \mathbf{z})] - \mathbb{E}{q}[\log q(\mathbf{z})]$$

Define the ELBO:

$$\mathcal{L}(q) = \mathbb{E}{q}[\log p(\mathbf{x}, \mathbf{z})] - \mathbb{E}{q}[\log q(\mathbf{z})]$$

Then we have the fundamental identity:

$$\boxed{\log p(\mathbf{x}) = \mathcal{L}(q) + D_{\text{KL}}(q(\mathbf{z}) | p(\mathbf{z}|\mathbf{x}))}$$

Implications

1. ELBO is always a lower bound

Since $D_{\text{KL}} \geq 0$: $$\mathcal{L}(q) = \log p(\mathbf{x}) - D_{\text{KL}}(q | p(\cdot|\mathbf{x})) \leq \log p(\mathbf{x})$$

2. Maximizing ELBO minimizes KL to posterior

Since $\log p(\mathbf{x})$ is fixed (doesn't depend on $q$): $$\max_q \mathcal{L}(q) \iff \min_q D_{\text{KL}}(q | p(\cdot|\mathbf{x}))$$

Maximizing the ELBO is exactly equivalent to minimizing the KL divergence to the true posterior!

3. The gap is the KL divergence

$$\log p(\mathbf{x}) - \mathcal{L}(q) = D_{\text{KL}}(q | p(\cdot|\mathbf{x}))$$

The difference between ELBO and log-evidence quantifies approximation quality.

The Central Identity of Variational Inference

$$\log p(\mathbf{x}) = \mathcal{L}(q) + D_{\text{KL}}(q | p(\mathbf{z}|\mathbf{x}))$$

This equation is the conceptual heart of VI. Memorize it. Everything else follows:

• $\mathcal{L}(q) \leq \log p(\mathbf{x})$ because KL ≥ 0 • Maximizing ELBO minimizes KL to posterior • When $q = p(\mathbf{z}|\mathbf{x})$, KL = 0 and ELBO = log evidence

ELBO Decomposition: Reconstruction vs. Regularization

The ELBO has a natural decomposition that provides intuition for what variational inference optimizes:

$$\mathcal{L}(q) = \underbrace{\mathbb{E}{q(\mathbf{z})}[\log p(\mathbf{x} | \mathbf{z})]}{\text{Reconstruction term}} - \underbrace{D_{\text{KL}}(q(\mathbf{z}) | p(\mathbf{z}))}_{\text{Regularization term}}$$

This decomposition emerges naturally from writing the joint as $p(\mathbf{x}, \mathbf{z}) = p(\mathbf{x}|\mathbf{z}) p(\mathbf{z})$.

Reconstruction Term: $\mathbb{E}_q[\log p(\mathbf{x} | \mathbf{z})]$

This term measures how well the latent variables explain the observed data:

High value: Latents sampled from $q$ lead to high likelihood of $\mathbf{x}$
Low value: $q$ places mass on latents that don't explain the data well

In VAEs: This is the expected negative reconstruction error. The decoder $p(\mathbf{x}|\mathbf{z})$ tries to reconstruct $\mathbf{x}$ from $\mathbf{z}$; this term rewards accurate reconstruction.

Regularization Term: $D_{\text{KL}}(q(\mathbf{z}) | p(\mathbf{z}))$

This term penalizes deviation of $q$ from the prior:

Low KL: $q$ is close to prior; latent space is "regularized"
High KL: $q$ deviates significantly from prior; potentially overfitting

In VAEs: Keeps the latent space structured. Without this term, the encoder could map each input to an arbitrary, unstructured point in latent space.

Reconstruction Term

•Measures data fit
•Pushes q to explain observations
•analogous to likelihood
•Encourages informative latents
•Increases with training (ideally)

Regularization Term

•Measures prior compatibility
•Pulls q toward prior
•Prevents overfitting to data
•Ensures structured latent space
•Trade-off with reconstruction

The Trade-off

Maximizing the ELBO requires balancing these competing objectives:

Too much focus on reconstruction:

$q$ overfits to each data point
Latent representations become meaningless
No generalization to new data

Too much focus on regularization:

$q \approx p(\mathbf{z})$ ignores the data
Uninformative latent representations
Poor reconstruction

The optimal balance depends on the model and the data. In practice, a common pathology is posterior collapse, where the KL term dominates and $q$ simply equals the prior, ignoring the data entirely.

The β-VAE Modification

To control this trade-off, the β-VAE introduces a hyperparameter:

$$\mathcal{L}_\beta(q) = \mathbb{E}q[\log p(\mathbf{x} | \mathbf{z})] - \beta \cdot D{\text{KL}}(q(\mathbf{z}) | p(\mathbf{z}))$$

$\beta = 1$: Standard ELBO
$\beta > 1$: Stronger regularization → more disentangled latents
$\beta < 1$: Weaker regularization → better reconstruction

elbo_decomposition.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
import torch
import torch.nn as nn
import torch.nn.functional as F
 
class VAE(nn.Module):
    """
    Variational Autoencoder demonstrating ELBO decomposition.
    
    ELBO = E_q[log p(x|z)] - KL(q(z|x) || p(z))
         = -Reconstruction_Loss - KL_Regularization
    """
    
    def __init__(self, input_dim, hidden_dim, latent_dim):
        super().__init__()
        
        # Encoder: x -> (mu, log_var)
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
        )
        self.fc_mu = nn.Linear(hidden_dim, latent_dim)
        self.fc_logvar = nn.Linear(hidden_dim, latent_dim)
        
        # Decoder: z -> x_recon
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, input_dim),
        )
        
    def encode(self, x):
        h = self.encoder(x)
        return self.fc_mu(h), self.fc_logvar(h)
    
    def reparameterize(self, mu, log_var):
        """z = mu + sigma * epsilon, epsilon ~ N(0, I)"""
        std = torch.exp(0.5 * log_var)
        eps = torch.randn_like(std)
        return mu + std * eps
    
    def decode(self, z):
        return self.decoder(z)
    
    def forward(self, x):
        mu, log_var = self.encode(x)
        z = self.reparameterize(mu, log_var)
        x_recon = self.decode(z)
        return x_recon, mu, log_var
    
    def compute_elbo_components(self, x, beta=1.0):
        """
        Compute and return ELBO components separately.
        
        Returns:
            elbo: Total ELBO (to maximize)
            recon_term: E_q[log p(x|z)]
            kl_term: D_KL(q(z|x) || p(z))
        """
        x_recon, mu, log_var = self.forward(x)
        
        # Reconstruction term: E_q[log p(x|z)]
        # For Gaussian p(x|z) with unit variance: proportional to -MSE
        recon_loss = F.mse_loss(x_recon, x, reduction='sum')
        recon_term = -recon_loss  # Negative because ELBO is to be maximized
        
        # KL term: D_KL(N(mu, sigma^2) || N(0, I))
        # = 0.5 * sum(sigma^2 + mu^2 - 1 - log(sigma^2))
        kl_term = 0.5 * torch.sum(torch.exp(log_var) + mu**2 - 1 - log_var)
        
        # ELBO = Recon - beta * KL
        elbo = recon_term - beta * kl_term
        
        return elbo, recon_term, kl_term
 
# Example usage
vae = VAE(input_dim=784, hidden_dim=256, latent_dim=32)
x = torch.randn(64, 784)  # Batch of 64 samples
 
elbo, recon, kl = vae.compute_elbo_components(x, beta=1.0)
 
print(f"ELBO: {elbo.item():.2f}")
print(f"Reconstruction term: {recon.item():.2f}")
print(f"KL regularization: {kl.item():.2f}")
print(f"Check: Recon - KL = {recon.item() - kl.item():.2f} ≈ ELBO")
 
# Training: maximize ELBO = minimize -ELBO
loss = -elbo / 64  # Average over batch
print(f"\nTraining loss (per sample): {loss.item():.2f}")

Information-Theoretic Interpretation

The ELBO has a beautiful interpretation in terms of information theory. We can rewrite it using entropy and mutual information:

$$\mathcal{L}(q) = \mathbb{E}q[\log p(\mathbf{x}|\mathbf{z})] + H[q(\mathbf{z})] - H[p(\mathbf{z})]{q}$$

where $H[q] = -\mathbb{E}_q[\log q]$ is entropy and $H[p]_q = -\mathbb{E}_q[\log p]$ is cross-entropy.

Rate-Distortion Interpretation

Consider the VAE setting where we encode observations $\mathbf{x}$ into latents $\mathbf{z}$. The ELBO can be viewed through the lens of rate-distortion theory:

$$\mathcal{L} = \underbrace{-D(\mathbf{x}, \hat{\mathbf{x}})}{\text{Distortion}} - \underbrace{I_q(\mathbf{x}; \mathbf{z})}{\text{Rate}}$$

where:

Distortion: Reconstruction error (how much information is lost)
Rate: Mutual information between input and latent (bits used to encode)

The ELBO objective naturally balances these: use enough "bits" (information in $\mathbf{z}$) to reconstruct well, but not more than necessary.

Entropy Decomposition

The ELBO can also be written as:

$$\mathcal{L}(q) = \mathbb{E}_q[\log p(\mathbf{x}, \mathbf{z})] + H[q(\mathbf{z})]$$

This separates:

Expected energy: $\mathbb{E}_q[\log p(\mathbf{x}, \mathbf{z})]$, measuring how well $q$ aligns with the model
Entropy of q: $H[q]$, measuring the "spread" or uncertainty in $q$

Maximizing the ELBO asks for $q$ to both:

Concentrate on high-probability regions of the joint
Maintain high entropy (not collapse to a point)

This tension prevents the variational distribution from collapsing trivially.

The Bits-Back Interpretation

In coding theory, the ELBO represents the expected codelength for transmitting $\mathbf{x}$ via latent $\mathbf{z}$:

$$-\mathcal{L}(q) = \underbrace{-\mathbb{E}q[\log p(\mathbf{x}|\mathbf{z})]}{\text{Bits to encode x given z}} + \underbrace{D_{\text{KL}}(q | p)}_{\text{Bits for z beyond prior}}$$

The "bits back" idea: we pay $D_{\text{KL}}$ bits to encode $\mathbf{z}$, but can recover some bits because $\mathbf{z}$ was drawn stochastically.

Multiple Perspectives, One Objective

The ELBO unifies several perspectives:

Statistical: Approximate the posterior by minimizing KL divergence

Probabilistic: Lower bound the model evidence

Information-theoretic: Trade off rate (compression) and distortion (reconstruction)

Coding-theoretic: Minimize expected codelength for transmission

Each perspective illuminates different aspects of what VI optimizes.

Computing the ELBO in Practice

The ELBO involves expectations under $q$, which generally don't have closed forms. Here are the main strategies for computing it:

Case 1: Closed-Form ELBO (Conjugate Models)

For some models, all expectations have closed forms:

Example: Gaussian prior, Gaussian variational family

$$p(\mathbf{z}) = \mathcal{N}(0, \mathbf{I}), \quad q(\mathbf{z}) = \mathcal{N}(\boldsymbol{\mu}, \text{diag}(\boldsymbol{\sigma}^2))$$

The KL term is:

$$D_{\text{KL}}(q | p) = \frac{1}{2} \sum_j \left( \sigma_j^2 + \mu_j^2 - 1 - \log \sigma_j^2 \right)$$

This is exact and differentiable—no sampling required for the regularization term.

Case 2: Monte Carlo Estimation

For general models, we estimate the ELBO via sampling:

$$\mathcal{L}(q) \approx \frac{1}{K} \sum_{k=1}^{K} \left[ \log p(\mathbf{x}, \mathbf{z}^{(k)}) - \log q(\mathbf{z}^{(k)}) \right]$$

where $\mathbf{z}^{(k)} \sim q(\mathbf{z})$.

This is unbiased but has variance. Common variance reduction techniques:

Control variates: Subtract a baseline with known expectation
Importance weighting: IWAE uses multiple samples per data point
Reparameterization: Reduces variance for continuous latents (next page)

elbo_computation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
import torch
import numpy as np
 
def elbo_closed_form_gaussian():
    """
    Compute ELBO with closed-form KL for Gaussian q and Gaussian prior.
    
    This is the standard computation in VAEs.
    """
    # Variational parameters
    mu = torch.tensor([0.5, -0.3, 1.2])
    log_var = torch.tensor([-0.5, 0.2, -0.1])
    
    # KL divergence: D_KL(N(mu, diag(exp(log_var))) || N(0, I))
    # = 0.5 * sum(exp(log_var) + mu^2 - 1 - log_var)
    kl = 0.5 * torch.sum(torch.exp(log_var) + mu**2 - 1 - log_var)
    
    print(f"Closed-form KL: {kl.item():.4f}")
    
    # Reconstruction term still needs MC estimation (unless likelihood is also Gaussian)
    # For demo, assume some log-likelihood values
    log_likelihood_samples = torch.tensor([-10.5, -11.2, -10.8, -10.1, -11.5])
    recon_term = log_likelihood_samples.mean()
    
    elbo = recon_term - kl
    print(f"ELBO: {elbo.item():.4f}")
    
    return elbo
 
def elbo_monte_carlo_general(log_p_joint, q, n_samples=1000):
    """
    General Monte Carlo ELBO estimation.
    
    Works for any q from which we can sample and evaluate log density.
    
    Args:
        log_p_joint: Function computing log p(x, z) for given z
        q: Variational distribution with .sample() and .log_prob() methods
        n_samples: Number of MC samples
    
    Returns:
        elbo_estimate: Monte Carlo estimate of ELBO
        std_error: Standard error of the estimate
    """
    # Sample from q
    z_samples = q.sample(n_samples)
    
    # Evaluate log p(x, z) and log q(z) for each sample
    log_p_values = torch.stack([log_p_joint(z) for z in z_samples])
    log_q_values = q.log_prob(z_samples)
    
    # ELBO = E_q[log p(x,z) - log q(z)]
    elbo_samples = log_p_values - log_q_values
    
    elbo_estimate = elbo_samples.mean()
    std_error = elbo_samples.std() / np.sqrt(n_samples)
    
    return elbo_estimate, std_error
 
def demonstrate_mc_variance():
    """Show how MC sample count affects ELBO estimate variance."""
    
    torch.manual_seed(42)
    
    # Simple 1D example
    true_posterior_mu = 1.0
    
    # Log joint (up to constant)
    def log_p_joint(z):
        log_prior = -0.5 * z**2
        log_likelihood = -0.5 * (z - 2.0)**2 / 0.25  # Observe x=2 with sigma=0.5
        return log_prior + log_likelihood
    
    # Variational distribution
    class SimpleGaussian:
        def __init__(self, mu, sigma):
            self.mu = mu
            self.sigma = sigma
        
        def sample(self, n):
            return torch.randn(n) * self.sigma + self.mu
        
        def log_prob(self, z):
            return -0.5 * ((z - self.mu) / self.sigma)**2 - np.log(self.sigma) - 0.5 * np.log(2*np.pi)
    
    q = SimpleGaussian(mu=1.0, sigma=0.4)
    
    print("ELBO estimates with different sample counts:")
    print("-" * 50)
    
    for n_samples in [10, 100, 1000, 10000]:
        estimates = []
        for _ in range(100):  # Repeat to measure variance
            elbo, _ = elbo_monte_carlo_general(log_p_joint, q, n_samples)
            estimates.append(elbo.item())
        
        mean_elbo = np.mean(estimates)
        std_elbo = np.std(estimates)
        print(f"N={n_samples:5d}: ELBO = {mean_elbo:.4f} ± {std_elbo:.4f}")
    
    print("\nNote: Variance decreases as 1/sqrt(N)")
 
# Run demonstrations
print("=== Closed-Form Gaussian KL ===")
elbo_closed_form_gaussian()
print()
 
print("=== Monte Carlo Variance Demo ===")
demonstrate_mc_variance()

Key Takeaways

•The ELBO is $\mathcal{L}(q) = \mathbb{E}_q[\log p(\mathbf{x}, \mathbf{z})] - \mathbb{E}_q[\log q(\mathbf{z})]$, a lower bound on $\log p(\mathbf{x})$.
•Jensen's inequality provides the most direct derivation: $\log \mathbb{E}[X] \geq \mathbb{E}[\log X]$.
•The KL identity $\log p(\mathbf{x}) = \mathcal{L}(q) + D_{\text{KL}}(q | p)$ shows maximizing ELBO minimizes KL to posterior.
•The decomposition into reconstruction and regularization reveals the fundamental trade-off in variational learning.
•Information-theoretic interpretations connect VI to rate-distortion and coding theory.
•Computation uses closed-form expressions when available, Monte Carlo otherwise.

Page Complete: ELBO Derivation

You now have a thorough understanding of the ELBO from multiple perspectives. You can derive it, interpret its components, and understand what each term optimizes. The final page of this module covers the optimization perspective—how we actually maximize the ELBO using gradient-based methods and the crucial reparameterization trick.

ELBO Derivation

The Evidence Lower Bound: Heart of Variational Inference

The Evidence Lower Bound (ELBO) is the central quantity in variational inference. It serves simultaneously as:

An optimization objective — We maximize the ELBO to train variational approximations
A bound on the evidence — The ELBO lower-bounds $\log p(\mathbf{x})$, enabling model comparison
A diagnostic — The gap between ELBO and log-evidence measures approximation quality
A unifying concept — The ELBO connects VI to information theory, rate-distortion, and probabilistic modeling

This page derives the ELBO rigorously from multiple perspectives. Each perspective offers different insights, and together they build a complete understanding of why the ELBO works and what it means.

By the end of this page, you will not just know the ELBO formula—you will understand why it takes this form and what each term contributes.

Learning Objectives

The ELBO Formula

Let's begin by stating the ELBO clearly. Given:

Observed data $\mathbf{x}$
Latent variables $\mathbf{z}$
Generative model $p(\mathbf{x}, \mathbf{z}) = p(\mathbf{x} | \mathbf{z}) p(\mathbf{z})$
Variational distribution $q(\mathbf{z})$

The Evidence Lower Bound (ELBO) is:

$$\mathcal{L}(q) = \mathbb{E}{q(\mathbf{z})}[\log p(\mathbf{x}, \mathbf{z})] - \mathbb{E}{q(\mathbf{z})}[\log q(\mathbf{z})]$$

Equivalently:

$$\mathcal{L}(q) = \mathbb{E}{q(\mathbf{z})}[\log p(\mathbf{x}, \mathbf{z}) - \log q(\mathbf{z})] = \mathbb{E}{q(\mathbf{z})}\left[ \log \frac{p(\mathbf{x}, \mathbf{z})}{q(\mathbf{z})} \right]$$

Or, decomposed:

This decomposition separates the objective into a reconstruction term (how well latents explain data) and a regularization term (how close $q$ is to the prior).

Three Equivalent Forms

The ELBO can be written as:

Form 1 (Expectations): $\mathbb{E}_q[\log p(\mathbf{x}, \mathbf{z})] - \mathbb{E}_q[\log q(\mathbf{z})]$

Form 2 (Ratio): $\mathbb{E}_q\big[\log \frac{p(\mathbf{x}, \mathbf{z})}{q(\mathbf{z})}\big]$

Form 3 (Decomposed): $\mathbb{E}q[\log p(\mathbf{x}|\mathbf{z})] - D{\text{KL}}(q(\mathbf{z}) | p(\mathbf{z}))$

All three are identical; use whichever is most convenient for your derivation.

The Key Property: It's a Lower Bound

The ELBO satisfies:

$$\mathcal{L}(q) \leq \log p(\mathbf{x})$$

with equality if and only if $q(\mathbf{z}) = p(\mathbf{z} | \mathbf{x})$ (the true posterior).

This property—that the ELBO lower-bounds the log-evidence—is why we call it the Evidence Lower BOund. It allows us to:

Optimize without computing $\log p(\mathbf{x})$: We maximize a bound, which improves our approximation
Bound model evidence: Even if the bound is loose, it provides a floor on $\log p(\mathbf{x})$
Diagnose approximation quality: A large gap indicates $q$ is far from the posterior

Let's now derive this bound from multiple angles.

Derivation 1: Jensen's Inequality

The most direct derivation uses Jensen's inequality, which states that for a concave function $f$ and random variable $X$:

$$f(\mathbb{E}[X]) \geq \mathbb{E}[f(X)]$$

with equality when $X$ is constant.

The Derivation

We start with the log-marginal likelihood (log-evidence):

$$\log p(\mathbf{x}) = \log \int p(\mathbf{x}, \mathbf{z}) , d\mathbf{z}$$

We introduce the variational distribution $q(\mathbf{z})$ by multiplying and dividing:

$$\log p(\mathbf{x}) = \log \int q(\mathbf{z}) \frac{p(\mathbf{x}, \mathbf{z})}{q(\mathbf{z})} , d\mathbf{z}$$

This is an expectation under $q$:

$$\log p(\mathbf{x}) = \log \mathbb{E}_{q(\mathbf{z})}\left[ \frac{p(\mathbf{x}, \mathbf{z})}{q(\mathbf{z})} \right]$$

Now apply Jensen's inequality (log is concave, so we flip the inequality):

$$\log \mathbb{E}{q}\left[ \frac{p(\mathbf{x}, \mathbf{z})}{q(\mathbf{z})} \right] \geq \mathbb{E}{q}\left[ \log \frac{p(\mathbf{x}, \mathbf{z})}{q(\mathbf{z})} \right]$$

Therefore:

$$\log p(\mathbf{x}) \geq \mathbb{E}_{q(\mathbf{z})}\left[ \log \frac{p(\mathbf{x}, \mathbf{z})}{q(\mathbf{z})} \right] = \mathcal{L}(q)$$

This is the ELBO!

When is the Bound Tight?

Jensen's inequality becomes an equality when the random variable is constant:

$$\frac{p(\mathbf{x}, \mathbf{z})}{q(\mathbf{z})} = c \quad \text{for all } \mathbf{z}$$

This means:

$$q(\mathbf{z}) = \frac{p(\mathbf{x}, \mathbf{z})}{c} = \frac{p(\mathbf{x}, \mathbf{z})}{p(\mathbf{x})} = p(\mathbf{z} | \mathbf{x})$$

The bound is tight exactly when $q$ equals the true posterior. In general, the gap between $\mathcal{L}(q)$ and $\log p(\mathbf{x})$ measures how far $q$ is from the posterior.

jensen_inequality_bound.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
import numpy as np
from scipy.stats import norm
 
def demonstrate_jensen_bound():
    """
    Demonstrate the Jensen's inequality derivation of ELBO.
    
    We show that log E[f(z)] >= E[log f(z)] and how the gap
    relates to variational approximation quality.
    """
    
    # Setup: Simple Gaussian model
    # Prior: p(z) = N(0, 1)
    # Likelihood: p(x|z) = N(z, σ²), observe x = 2.0
    
    x = 2.0
    prior_mu, prior_sigma = 0.0, 1.0
    likelihood_sigma = 0.5
    
    # True posterior (conjugate case)
    post_precision = 1/prior_sigma**2 + 1/likelihood_sigma**2
    post_sigma = 1/np.sqrt(post_precision)
    post_mu = post_sigma**2 * (prior_mu/prior_sigma**2 + x/likelihood_sigma**2)
    
    # True log evidence (Gaussian integral)
    evidence_sigma = np.sqrt(prior_sigma**2 + likelihood_sigma**2)
    log_evidence = norm.logpdf(x, prior_mu, evidence_sigma)
    
    print(f"True log p(x): {log_evidence:.4f}")
    print(f"True posterior: N({post_mu:.3f}, {post_sigma:.3f}²)")
    print()
    
    def compute_elbo(q_mu, q_sigma, n_samples=100000):
        """
        Compute ELBO for Gaussian q(z) = N(q_mu, q_sigma²).
        
        ELBO = E_q[log p(x,z) / q(z)]
             = E_q[log p(x|z) + log p(z) - log q(z)]
        """
        # Sample from q
        z_samples = np.random.normal(q_mu, q_sigma, n_samples)
        
        # log p(x|z): Gaussian likelihood
        log_likelihood = norm.logpdf(x, z_samples, likelihood_sigma)
        
        # log p(z): Gaussian prior
        log_prior = norm.logpdf(z_samples, prior_mu, prior_sigma)
        
        # log q(z): Variational distribution
        log_q = norm.logpdf(z_samples, q_mu, q_sigma)
        
        # ELBO = E_q[log p(x,z) - log q(z)]
        elbo = np.mean(log_likelihood + log_prior - log_q)
        
        return elbo
    
    # Compare different variational approximations
    print("Variational approximations:")
    print("-" * 55)
    
    test_cases = [
        (0.0, 1.0, "Prior as variational"),
        (2.0, 0.5, "Centered on x"),
        (post_mu, post_sigma, "Exact posterior"),
        (post_mu, 2*post_sigma, "Too wide"),
        (post_mu, 0.5*post_sigma, "Too narrow"),
    ]
    
    for q_mu, q_sigma, description in test_cases:
        elbo = compute_elbo(q_mu, q_sigma)
        gap = log_evidence - elbo
        print(f"{description:25s}: ELBO={elbo:.4f}, Gap={gap:.4f}")
    
    print()
    print("Note: Gap ≈ 0 only when q = true posterior!")
    print("The gap IS the KL divergence D(q || p(z|x)).")
 
demonstrate_jensen_bound()

Derivation 2: KL Divergence Identity

An alternative derivation starts from the KL divergence between $q$ and the posterior $p(\mathbf{z}|\mathbf{x})$ and rearranges to isolate the ELBO.

The Derivation

Begin with the definition of reverse KL:

$$D_{\text{KL}}(q(\mathbf{z}) | p(\mathbf{z}|\mathbf{x})) = \mathbb{E}_{q}\left[ \log \frac{q(\mathbf{z})}{p(\mathbf{z}|\mathbf{x})} \right]$$

Expand the posterior using Bayes' rule:

$$p(\mathbf{z}|\mathbf{x}) = \frac{p(\mathbf{x}, \mathbf{z})}{p(\mathbf{x})}$$

Substitute:

Note that $\log p(\mathbf{x})$ is constant with respect to $\mathbf{z}$, so it comes out of the expectation.

Rearrange to isolate $\log p(\mathbf{x})$:

$$\log p(\mathbf{x}) = D_{\text{KL}}(q | p(\cdot|\mathbf{x})) + \mathbb{E}{q}[\log p(\mathbf{x}, \mathbf{z})] - \mathbb{E}{q}[\log q(\mathbf{z})]$$

Define the ELBO:

$$\mathcal{L}(q) = \mathbb{E}{q}[\log p(\mathbf{x}, \mathbf{z})] - \mathbb{E}{q}[\log q(\mathbf{z})]$$

Then we have the fundamental identity:

$$\boxed{\log p(\mathbf{x}) = \mathcal{L}(q) + D_{\text{KL}}(q(\mathbf{z}) | p(\mathbf{z}|\mathbf{x}))}$$

Implications

1. ELBO is always a lower bound

Since $D_{\text{KL}} \geq 0$: $$\mathcal{L}(q) = \log p(\mathbf{x}) - D_{\text{KL}}(q | p(\cdot|\mathbf{x})) \leq \log p(\mathbf{x})$$

2. Maximizing ELBO minimizes KL to posterior

Since $\log p(\mathbf{x})$ is fixed (doesn't depend on $q$): $$\max_q \mathcal{L}(q) \iff \min_q D_{\text{KL}}(q | p(\cdot|\mathbf{x}))$$

Maximizing the ELBO is exactly equivalent to minimizing the KL divergence to the true posterior!

3. The gap is the KL divergence

$$\log p(\mathbf{x}) - \mathcal{L}(q) = D_{\text{KL}}(q | p(\cdot|\mathbf{x}))$$

The difference between ELBO and log-evidence quantifies approximation quality.

The Central Identity of Variational Inference

$$\log p(\mathbf{x}) = \mathcal{L}(q) + D_{\text{KL}}(q | p(\mathbf{z}|\mathbf{x}))$$

This equation is the conceptual heart of VI. Memorize it. Everything else follows:

• $\mathcal{L}(q) \leq \log p(\mathbf{x})$ because KL ≥ 0 • Maximizing ELBO minimizes KL to posterior • When $q = p(\mathbf{z}|\mathbf{x})$, KL = 0 and ELBO = log evidence

ELBO Decomposition: Reconstruction vs. Regularization

The ELBO has a natural decomposition that provides intuition for what variational inference optimizes:

This decomposition emerges naturally from writing the joint as $p(\mathbf{x}, \mathbf{z}) = p(\mathbf{x}|\mathbf{z}) p(\mathbf{z})$.

Reconstruction Term: $\mathbb{E}_q[\log p(\mathbf{x} | \mathbf{z})]$

This term measures how well the latent variables explain the observed data:

High value: Latents sampled from $q$ lead to high likelihood of $\mathbf{x}$
Low value: $q$ places mass on latents that don't explain the data well

In VAEs: This is the expected negative reconstruction error. The decoder $p(\mathbf{x}|\mathbf{z})$ tries to reconstruct $\mathbf{x}$ from $\mathbf{z}$; this term rewards accurate reconstruction.

Regularization Term: $D_{\text{KL}}(q(\mathbf{z}) | p(\mathbf{z}))$

This term penalizes deviation of $q$ from the prior:

Low KL: $q$ is close to prior; latent space is "regularized"
High KL: $q$ deviates significantly from prior; potentially overfitting

In VAEs: Keeps the latent space structured. Without this term, the encoder could map each input to an arbitrary, unstructured point in latent space.

Reconstruction Term

•Measures data fit
•Pushes q to explain observations
•analogous to likelihood
•Encourages informative latents
•Increases with training (ideally)

Regularization Term

•Measures prior compatibility
•Pulls q toward prior
•Prevents overfitting to data
•Ensures structured latent space
•Trade-off with reconstruction

The Trade-off

Maximizing the ELBO requires balancing these competing objectives:

Too much focus on reconstruction:

$q$ overfits to each data point
Latent representations become meaningless
No generalization to new data

Too much focus on regularization:

$q \approx p(\mathbf{z})$ ignores the data
Uninformative latent representations
Poor reconstruction

The β-VAE Modification

To control this trade-off, the β-VAE introduces a hyperparameter:

$$\mathcal{L}_\beta(q) = \mathbb{E}q[\log p(\mathbf{x} | \mathbf{z})] - \beta \cdot D{\text{KL}}(q(\mathbf{z}) | p(\mathbf{z}))$$

$\beta = 1$: Standard ELBO
$\beta > 1$: Stronger regularization → more disentangled latents
$\beta < 1$: Weaker regularization → better reconstruction

elbo_decomposition.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
import torch
import torch.nn as nn
import torch.nn.functional as F
 
class VAE(nn.Module):
    """
    Variational Autoencoder demonstrating ELBO decomposition.
    
    ELBO = E_q[log p(x|z)] - KL(q(z|x) || p(z))
         = -Reconstruction_Loss - KL_Regularization
    """
    
    def __init__(self, input_dim, hidden_dim, latent_dim):
        super().__init__()
        
        # Encoder: x -> (mu, log_var)
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
        )
        self.fc_mu = nn.Linear(hidden_dim, latent_dim)
        self.fc_logvar = nn.Linear(hidden_dim, latent_dim)
        
        # Decoder: z -> x_recon
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, input_dim),
        )
        
    def encode(self, x):
        h = self.encoder(x)
        return self.fc_mu(h), self.fc_logvar(h)
    
    def reparameterize(self, mu, log_var):
        """z = mu + sigma * epsilon, epsilon ~ N(0, I)"""
        std = torch.exp(0.5 * log_var)
        eps = torch.randn_like(std)
        return mu + std * eps
    
    def decode(self, z):
        return self.decoder(z)
    
    def forward(self, x):
        mu, log_var = self.encode(x)
        z = self.reparameterize(mu, log_var)
        x_recon = self.decode(z)
        return x_recon, mu, log_var
    
    def compute_elbo_components(self, x, beta=1.0):
        """
        Compute and return ELBO components separately.
        
        Returns:
            elbo: Total ELBO (to maximize)
            recon_term: E_q[log p(x|z)]
            kl_term: D_KL(q(z|x) || p(z))
        """
        x_recon, mu, log_var = self.forward(x)
        
        # Reconstruction term: E_q[log p(x|z)]
        # For Gaussian p(x|z) with unit variance: proportional to -MSE
        recon_loss = F.mse_loss(x_recon, x, reduction='sum')
        recon_term = -recon_loss  # Negative because ELBO is to be maximized
        
        # KL term: D_KL(N(mu, sigma^2) || N(0, I))
        # = 0.5 * sum(sigma^2 + mu^2 - 1 - log(sigma^2))
        kl_term = 0.5 * torch.sum(torch.exp(log_var) + mu**2 - 1 - log_var)
        
        # ELBO = Recon - beta * KL
        elbo = recon_term - beta * kl_term
        
        return elbo, recon_term, kl_term
 
# Example usage
vae = VAE(input_dim=784, hidden_dim=256, latent_dim=32)
x = torch.randn(64, 784)  # Batch of 64 samples
 
elbo, recon, kl = vae.compute_elbo_components(x, beta=1.0)
 
print(f"ELBO: {elbo.item():.2f}")
print(f"Reconstruction term: {recon.item():.2f}")
print(f"KL regularization: {kl.item():.2f}")
print(f"Check: Recon - KL = {recon.item() - kl.item():.2f} ≈ ELBO")
 
# Training: maximize ELBO = minimize -ELBO
loss = -elbo / 64  # Average over batch
print(f"\nTraining loss (per sample): {loss.item():.2f}")

Information-Theoretic Interpretation

The ELBO has a beautiful interpretation in terms of information theory. We can rewrite it using entropy and mutual information:

$$\mathcal{L}(q) = \mathbb{E}q[\log p(\mathbf{x}|\mathbf{z})] + H[q(\mathbf{z})] - H[p(\mathbf{z})]{q}$$

where $H[q] = -\mathbb{E}_q[\log q]$ is entropy and $H[p]_q = -\mathbb{E}_q[\log p]$ is cross-entropy.

Rate-Distortion Interpretation

Consider the VAE setting where we encode observations $\mathbf{x}$ into latents $\mathbf{z}$. The ELBO can be viewed through the lens of rate-distortion theory:

$$\mathcal{L} = \underbrace{-D(\mathbf{x}, \hat{\mathbf{x}})}{\text{Distortion}} - \underbrace{I_q(\mathbf{x}; \mathbf{z})}{\text{Rate}}$$

where:

Distortion: Reconstruction error (how much information is lost)
Rate: Mutual information between input and latent (bits used to encode)

The ELBO objective naturally balances these: use enough "bits" (information in $\mathbf{z}$) to reconstruct well, but not more than necessary.

Entropy Decomposition

The ELBO can also be written as:

$$\mathcal{L}(q) = \mathbb{E}_q[\log p(\mathbf{x}, \mathbf{z})] + H[q(\mathbf{z})]$$

This separates:

Expected energy: $\mathbb{E}_q[\log p(\mathbf{x}, \mathbf{z})]$, measuring how well $q$ aligns with the model
Entropy of q: $H[q]$, measuring the "spread" or uncertainty in $q$

Maximizing the ELBO asks for $q$ to both:

Concentrate on high-probability regions of the joint
Maintain high entropy (not collapse to a point)

This tension prevents the variational distribution from collapsing trivially.

The Bits-Back Interpretation

In coding theory, the ELBO represents the expected codelength for transmitting $\mathbf{x}$ via latent $\mathbf{z}$:

$$-\mathcal{L}(q) = \underbrace{-\mathbb{E}q[\log p(\mathbf{x}|\mathbf{z})]}{\text{Bits to encode x given z}} + \underbrace{D_{\text{KL}}(q | p)}_{\text{Bits for z beyond prior}}$$

The "bits back" idea: we pay $D_{\text{KL}}$ bits to encode $\mathbf{z}$, but can recover some bits because $\mathbf{z}$ was drawn stochastically.

Multiple Perspectives, One Objective

The ELBO unifies several perspectives:

Statistical: Approximate the posterior by minimizing KL divergence

Probabilistic: Lower bound the model evidence

Information-theoretic: Trade off rate (compression) and distortion (reconstruction)

Coding-theoretic: Minimize expected codelength for transmission

Each perspective illuminates different aspects of what VI optimizes.

Computing the ELBO in Practice

The ELBO involves expectations under $q$, which generally don't have closed forms. Here are the main strategies for computing it:

Case 1: Closed-Form ELBO (Conjugate Models)

For some models, all expectations have closed forms:

Example: Gaussian prior, Gaussian variational family

$$p(\mathbf{z}) = \mathcal{N}(0, \mathbf{I}), \quad q(\mathbf{z}) = \mathcal{N}(\boldsymbol{\mu}, \text{diag}(\boldsymbol{\sigma}^2))$$

The KL term is:

$$D_{\text{KL}}(q | p) = \frac{1}{2} \sum_j \left( \sigma_j^2 + \mu_j^2 - 1 - \log \sigma_j^2 \right)$$

This is exact and differentiable—no sampling required for the regularization term.

Case 2: Monte Carlo Estimation

For general models, we estimate the ELBO via sampling:

$$\mathcal{L}(q) \approx \frac{1}{K} \sum_{k=1}^{K} \left[ \log p(\mathbf{x}, \mathbf{z}^{(k)}) - \log q(\mathbf{z}^{(k)}) \right]$$

where $\mathbf{z}^{(k)} \sim q(\mathbf{z})$.

This is unbiased but has variance. Common variance reduction techniques:

Control variates: Subtract a baseline with known expectation
Importance weighting: IWAE uses multiple samples per data point
Reparameterization: Reduces variance for continuous latents (next page)

elbo_computation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
import torch
import numpy as np
 
def elbo_closed_form_gaussian():
    """
    Compute ELBO with closed-form KL for Gaussian q and Gaussian prior.
    
    This is the standard computation in VAEs.
    """
    # Variational parameters
    mu = torch.tensor([0.5, -0.3, 1.2])
    log_var = torch.tensor([-0.5, 0.2, -0.1])
    
    # KL divergence: D_KL(N(mu, diag(exp(log_var))) || N(0, I))
    # = 0.5 * sum(exp(log_var) + mu^2 - 1 - log_var)
    kl = 0.5 * torch.sum(torch.exp(log_var) + mu**2 - 1 - log_var)
    
    print(f"Closed-form KL: {kl.item():.4f}")
    
    # Reconstruction term still needs MC estimation (unless likelihood is also Gaussian)
    # For demo, assume some log-likelihood values
    log_likelihood_samples = torch.tensor([-10.5, -11.2, -10.8, -10.1, -11.5])
    recon_term = log_likelihood_samples.mean()
    
    elbo = recon_term - kl
    print(f"ELBO: {elbo.item():.4f}")
    
    return elbo
 
def elbo_monte_carlo_general(log_p_joint, q, n_samples=1000):
    """
    General Monte Carlo ELBO estimation.
    
    Works for any q from which we can sample and evaluate log density.
    
    Args:
        log_p_joint: Function computing log p(x, z) for given z
        q: Variational distribution with .sample() and .log_prob() methods
        n_samples: Number of MC samples
    
    Returns:
        elbo_estimate: Monte Carlo estimate of ELBO
        std_error: Standard error of the estimate
    """
    # Sample from q
    z_samples = q.sample(n_samples)
    
    # Evaluate log p(x, z) and log q(z) for each sample
    log_p_values = torch.stack([log_p_joint(z) for z in z_samples])
    log_q_values = q.log_prob(z_samples)
    
    # ELBO = E_q[log p(x,z) - log q(z)]
    elbo_samples = log_p_values - log_q_values
    
    elbo_estimate = elbo_samples.mean()
    std_error = elbo_samples.std() / np.sqrt(n_samples)
    
    return elbo_estimate, std_error
 
def demonstrate_mc_variance():
    """Show how MC sample count affects ELBO estimate variance."""
    
    torch.manual_seed(42)
    
    # Simple 1D example
    true_posterior_mu = 1.0
    
    # Log joint (up to constant)
    def log_p_joint(z):
        log_prior = -0.5 * z**2
        log_likelihood = -0.5 * (z - 2.0)**2 / 0.25  # Observe x=2 with sigma=0.5
        return log_prior + log_likelihood
    
    # Variational distribution
    class SimpleGaussian:
        def __init__(self, mu, sigma):
            self.mu = mu
            self.sigma = sigma
        
        def sample(self, n):
            return torch.randn(n) * self.sigma + self.mu
        
        def log_prob(self, z):
            return -0.5 * ((z - self.mu) / self.sigma)**2 - np.log(self.sigma) - 0.5 * np.log(2*np.pi)
    
    q = SimpleGaussian(mu=1.0, sigma=0.4)
    
    print("ELBO estimates with different sample counts:")
    print("-" * 50)
    
    for n_samples in [10, 100, 1000, 10000]:
        estimates = []
        for _ in range(100):  # Repeat to measure variance
            elbo, _ = elbo_monte_carlo_general(log_p_joint, q, n_samples)
            estimates.append(elbo.item())
        
        mean_elbo = np.mean(estimates)
        std_elbo = np.std(estimates)
        print(f"N={n_samples:5d}: ELBO = {mean_elbo:.4f} ± {std_elbo:.4f}")
    
    print("\nNote: Variance decreases as 1/sqrt(N)")
 
# Run demonstrations
print("=== Closed-Form Gaussian KL ===")
elbo_closed_form_gaussian()
print()
 
print("=== Monte Carlo Variance Demo ===")
demonstrate_mc_variance()

Key Takeaways

•The ELBO is $\mathcal{L}(q) = \mathbb{E}_q[\log p(\mathbf{x}, \mathbf{z})] - \mathbb{E}_q[\log q(\mathbf{z})]$, a lower bound on $\log p(\mathbf{x})$.
•Jensen's inequality provides the most direct derivation: $\log \mathbb{E}[X] \geq \mathbb{E}[\log X]$.
•The KL identity $\log p(\mathbf{x}) = \mathcal{L}(q) + D_{\text{KL}}(q | p)$ shows maximizing ELBO minimizes KL to posterior.
•The decomposition into reconstruction and regularization reveals the fundamental trade-off in variational learning.
•Information-theoretic interpretations connect VI to rate-distortion and coding theory.
•Computation uses closed-form expressions when available, Monte Carlo otherwise.

Page Complete: ELBO Derivation