Loading content...
The Evidence Lower Bound (ELBO) is the central quantity in variational inference. It serves simultaneously as:
This page derives the ELBO rigorously from multiple perspectives. Each perspective offers different insights, and together they build a complete understanding of why the ELBO works and what it means.
By the end of this page, you will not just know the ELBO formula—you will understand why it takes this form and what each term contributes.
By completing this page, you will: (1) Derive the ELBO from Jensen's inequality, (2) Derive the ELBO from the KL divergence identity, (3) Decompose the ELBO into reconstruction and regularization terms, (4) Understand the rate-distortion interpretation, and (5) Compute the ELBO for concrete examples.
Let's begin by stating the ELBO clearly. Given:
The Evidence Lower Bound (ELBO) is:
$$\mathcal{L}(q) = \mathbb{E}{q(\mathbf{z})}[\log p(\mathbf{x}, \mathbf{z})] - \mathbb{E}{q(\mathbf{z})}[\log q(\mathbf{z})]$$
Equivalently:
$$\mathcal{L}(q) = \mathbb{E}{q(\mathbf{z})}[\log p(\mathbf{x}, \mathbf{z}) - \log q(\mathbf{z})] = \mathbb{E}{q(\mathbf{z})}\left[ \log \frac{p(\mathbf{x}, \mathbf{z})}{q(\mathbf{z})} \right]$$
Or, decomposed:
$$\mathcal{L}(q) = \underbrace{\mathbb{E}{q(\mathbf{z})}[\log p(\mathbf{x} | \mathbf{z})]}{\text{Expected log-likelihood}} - \underbrace{D_{\text{KL}}(q(\mathbf{z}) | p(\mathbf{z}))}_{\text{KL to prior}}$$
This decomposition separates the objective into a reconstruction term (how well latents explain data) and a regularization term (how close $q$ is to the prior).
The ELBO can be written as:
Form 1 (Expectations): $\mathbb{E}_q[\log p(\mathbf{x}, \mathbf{z})] - \mathbb{E}_q[\log q(\mathbf{z})]$
Form 2 (Ratio): $\mathbb{E}_q\big[\log \frac{p(\mathbf{x}, \mathbf{z})}{q(\mathbf{z})}\big]$
Form 3 (Decomposed): $\mathbb{E}q[\log p(\mathbf{x}|\mathbf{z})] - D{\text{KL}}(q(\mathbf{z}) | p(\mathbf{z}))$
All three are identical; use whichever is most convenient for your derivation.
The ELBO satisfies:
$$\mathcal{L}(q) \leq \log p(\mathbf{x})$$
with equality if and only if $q(\mathbf{z}) = p(\mathbf{z} | \mathbf{x})$ (the true posterior).
This property—that the ELBO lower-bounds the log-evidence—is why we call it the Evidence Lower BOund. It allows us to:
Let's now derive this bound from multiple angles.
The most direct derivation uses Jensen's inequality, which states that for a concave function $f$ and random variable $X$:
$$f(\mathbb{E}[X]) \geq \mathbb{E}[f(X)]$$
with equality when $X$ is constant.
We start with the log-marginal likelihood (log-evidence):
$$\log p(\mathbf{x}) = \log \int p(\mathbf{x}, \mathbf{z}) , d\mathbf{z}$$
We introduce the variational distribution $q(\mathbf{z})$ by multiplying and dividing:
$$\log p(\mathbf{x}) = \log \int q(\mathbf{z}) \frac{p(\mathbf{x}, \mathbf{z})}{q(\mathbf{z})} , d\mathbf{z}$$
This is an expectation under $q$:
$$\log p(\mathbf{x}) = \log \mathbb{E}_{q(\mathbf{z})}\left[ \frac{p(\mathbf{x}, \mathbf{z})}{q(\mathbf{z})} \right]$$
Now apply Jensen's inequality (log is concave, so we flip the inequality):
$$\log \mathbb{E}{q}\left[ \frac{p(\mathbf{x}, \mathbf{z})}{q(\mathbf{z})} \right] \geq \mathbb{E}{q}\left[ \log \frac{p(\mathbf{x}, \mathbf{z})}{q(\mathbf{z})} \right]$$
Therefore:
$$\log p(\mathbf{x}) \geq \mathbb{E}_{q(\mathbf{z})}\left[ \log \frac{p(\mathbf{x}, \mathbf{z})}{q(\mathbf{z})} \right] = \mathcal{L}(q)$$
This is the ELBO!
Jensen's inequality becomes an equality when the random variable is constant:
$$\frac{p(\mathbf{x}, \mathbf{z})}{q(\mathbf{z})} = c \quad \text{for all } \mathbf{z}$$
This means:
$$q(\mathbf{z}) = \frac{p(\mathbf{x}, \mathbf{z})}{c} = \frac{p(\mathbf{x}, \mathbf{z})}{p(\mathbf{x})} = p(\mathbf{z} | \mathbf{x})$$
The bound is tight exactly when $q$ equals the true posterior. In general, the gap between $\mathcal{L}(q)$ and $\log p(\mathbf{x})$ measures how far $q$ is from the posterior.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778
import numpy as npfrom scipy.stats import norm def demonstrate_jensen_bound(): """ Demonstrate the Jensen's inequality derivation of ELBO. We show that log E[f(z)] >= E[log f(z)] and how the gap relates to variational approximation quality. """ # Setup: Simple Gaussian model # Prior: p(z) = N(0, 1) # Likelihood: p(x|z) = N(z, σ²), observe x = 2.0 x = 2.0 prior_mu, prior_sigma = 0.0, 1.0 likelihood_sigma = 0.5 # True posterior (conjugate case) post_precision = 1/prior_sigma**2 + 1/likelihood_sigma**2 post_sigma = 1/np.sqrt(post_precision) post_mu = post_sigma**2 * (prior_mu/prior_sigma**2 + x/likelihood_sigma**2) # True log evidence (Gaussian integral) evidence_sigma = np.sqrt(prior_sigma**2 + likelihood_sigma**2) log_evidence = norm.logpdf(x, prior_mu, evidence_sigma) print(f"True log p(x): {log_evidence:.4f}") print(f"True posterior: N({post_mu:.3f}, {post_sigma:.3f}²)") print() def compute_elbo(q_mu, q_sigma, n_samples=100000): """ Compute ELBO for Gaussian q(z) = N(q_mu, q_sigma²). ELBO = E_q[log p(x,z) / q(z)] = E_q[log p(x|z) + log p(z) - log q(z)] """ # Sample from q z_samples = np.random.normal(q_mu, q_sigma, n_samples) # log p(x|z): Gaussian likelihood log_likelihood = norm.logpdf(x, z_samples, likelihood_sigma) # log p(z): Gaussian prior log_prior = norm.logpdf(z_samples, prior_mu, prior_sigma) # log q(z): Variational distribution log_q = norm.logpdf(z_samples, q_mu, q_sigma) # ELBO = E_q[log p(x,z) - log q(z)] elbo = np.mean(log_likelihood + log_prior - log_q) return elbo # Compare different variational approximations print("Variational approximations:") print("-" * 55) test_cases = [ (0.0, 1.0, "Prior as variational"), (2.0, 0.5, "Centered on x"), (post_mu, post_sigma, "Exact posterior"), (post_mu, 2*post_sigma, "Too wide"), (post_mu, 0.5*post_sigma, "Too narrow"), ] for q_mu, q_sigma, description in test_cases: elbo = compute_elbo(q_mu, q_sigma) gap = log_evidence - elbo print(f"{description:25s}: ELBO={elbo:.4f}, Gap={gap:.4f}") print() print("Note: Gap ≈ 0 only when q = true posterior!") print("The gap IS the KL divergence D(q || p(z|x)).") demonstrate_jensen_bound()An alternative derivation starts from the KL divergence between $q$ and the posterior $p(\mathbf{z}|\mathbf{x})$ and rearranges to isolate the ELBO.
Begin with the definition of reverse KL:
$$D_{\text{KL}}(q(\mathbf{z}) | p(\mathbf{z}|\mathbf{x})) = \mathbb{E}_{q}\left[ \log \frac{q(\mathbf{z})}{p(\mathbf{z}|\mathbf{x})} \right]$$
Expand the posterior using Bayes' rule:
$$p(\mathbf{z}|\mathbf{x}) = \frac{p(\mathbf{x}, \mathbf{z})}{p(\mathbf{x})}$$
Substitute:
$$\begin{aligned} D_{\text{KL}}(q | p(\cdot|\mathbf{x})) &= \mathbb{E}{q}\left[ \log \frac{q(\mathbf{z}) \cdot p(\mathbf{x})}{p(\mathbf{x}, \mathbf{z})} \right] \ &= \mathbb{E}{q}\left[ \log q(\mathbf{z}) - \log p(\mathbf{x}, \mathbf{z}) + \log p(\mathbf{x}) \right] \ &= \mathbb{E}{q}[\log q(\mathbf{z})] - \mathbb{E}{q}[\log p(\mathbf{x}, \mathbf{z})] + \log p(\mathbf{x}) \end{aligned}$$
Note that $\log p(\mathbf{x})$ is constant with respect to $\mathbf{z}$, so it comes out of the expectation.
Rearrange to isolate $\log p(\mathbf{x})$:
$$\log p(\mathbf{x}) = D_{\text{KL}}(q | p(\cdot|\mathbf{x})) + \mathbb{E}{q}[\log p(\mathbf{x}, \mathbf{z})] - \mathbb{E}{q}[\log q(\mathbf{z})]$$
Define the ELBO:
$$\mathcal{L}(q) = \mathbb{E}{q}[\log p(\mathbf{x}, \mathbf{z})] - \mathbb{E}{q}[\log q(\mathbf{z})]$$
Then we have the fundamental identity:
$$\boxed{\log p(\mathbf{x}) = \mathcal{L}(q) + D_{\text{KL}}(q(\mathbf{z}) | p(\mathbf{z}|\mathbf{x}))}$$
1. ELBO is always a lower bound
Since $D_{\text{KL}} \geq 0$: $$\mathcal{L}(q) = \log p(\mathbf{x}) - D_{\text{KL}}(q | p(\cdot|\mathbf{x})) \leq \log p(\mathbf{x})$$
2. Maximizing ELBO minimizes KL to posterior
Since $\log p(\mathbf{x})$ is fixed (doesn't depend on $q$): $$\max_q \mathcal{L}(q) \iff \min_q D_{\text{KL}}(q | p(\cdot|\mathbf{x}))$$
Maximizing the ELBO is exactly equivalent to minimizing the KL divergence to the true posterior!
3. The gap is the KL divergence
$$\log p(\mathbf{x}) - \mathcal{L}(q) = D_{\text{KL}}(q | p(\cdot|\mathbf{x}))$$
The difference between ELBO and log-evidence quantifies approximation quality.
$$\log p(\mathbf{x}) = \mathcal{L}(q) + D_{\text{KL}}(q | p(\mathbf{z}|\mathbf{x}))$$
This equation is the conceptual heart of VI. Memorize it. Everything else follows:
• $\mathcal{L}(q) \leq \log p(\mathbf{x})$ because KL ≥ 0 • Maximizing ELBO minimizes KL to posterior • When $q = p(\mathbf{z}|\mathbf{x})$, KL = 0 and ELBO = log evidence
The ELBO has a natural decomposition that provides intuition for what variational inference optimizes:
$$\mathcal{L}(q) = \underbrace{\mathbb{E}{q(\mathbf{z})}[\log p(\mathbf{x} | \mathbf{z})]}{\text{Reconstruction term}} - \underbrace{D_{\text{KL}}(q(\mathbf{z}) | p(\mathbf{z}))}_{\text{Regularization term}}$$
This decomposition emerges naturally from writing the joint as $p(\mathbf{x}, \mathbf{z}) = p(\mathbf{x}|\mathbf{z}) p(\mathbf{z})$.
This term measures how well the latent variables explain the observed data:
In VAEs: This is the expected negative reconstruction error. The decoder $p(\mathbf{x}|\mathbf{z})$ tries to reconstruct $\mathbf{x}$ from $\mathbf{z}$; this term rewards accurate reconstruction.
This term penalizes deviation of $q$ from the prior:
In VAEs: Keeps the latent space structured. Without this term, the encoder could map each input to an arbitrary, unstructured point in latent space.
Maximizing the ELBO requires balancing these competing objectives:
Too much focus on reconstruction:
Too much focus on regularization:
The optimal balance depends on the model and the data. In practice, a common pathology is posterior collapse, where the KL term dominates and $q$ simply equals the prior, ignoring the data entirely.
To control this trade-off, the β-VAE introduces a hyperparameter:
$$\mathcal{L}_\beta(q) = \mathbb{E}q[\log p(\mathbf{x} | \mathbf{z})] - \beta \cdot D{\text{KL}}(q(\mathbf{z}) | p(\mathbf{z}))$$
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192
import torchimport torch.nn as nnimport torch.nn.functional as F class VAE(nn.Module): """ Variational Autoencoder demonstrating ELBO decomposition. ELBO = E_q[log p(x|z)] - KL(q(z|x) || p(z)) = -Reconstruction_Loss - KL_Regularization """ def __init__(self, input_dim, hidden_dim, latent_dim): super().__init__() # Encoder: x -> (mu, log_var) self.encoder = nn.Sequential( nn.Linear(input_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, hidden_dim), nn.ReLU(), ) self.fc_mu = nn.Linear(hidden_dim, latent_dim) self.fc_logvar = nn.Linear(hidden_dim, latent_dim) # Decoder: z -> x_recon self.decoder = nn.Sequential( nn.Linear(latent_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, input_dim), ) def encode(self, x): h = self.encoder(x) return self.fc_mu(h), self.fc_logvar(h) def reparameterize(self, mu, log_var): """z = mu + sigma * epsilon, epsilon ~ N(0, I)""" std = torch.exp(0.5 * log_var) eps = torch.randn_like(std) return mu + std * eps def decode(self, z): return self.decoder(z) def forward(self, x): mu, log_var = self.encode(x) z = self.reparameterize(mu, log_var) x_recon = self.decode(z) return x_recon, mu, log_var def compute_elbo_components(self, x, beta=1.0): """ Compute and return ELBO components separately. Returns: elbo: Total ELBO (to maximize) recon_term: E_q[log p(x|z)] kl_term: D_KL(q(z|x) || p(z)) """ x_recon, mu, log_var = self.forward(x) # Reconstruction term: E_q[log p(x|z)] # For Gaussian p(x|z) with unit variance: proportional to -MSE recon_loss = F.mse_loss(x_recon, x, reduction='sum') recon_term = -recon_loss # Negative because ELBO is to be maximized # KL term: D_KL(N(mu, sigma^2) || N(0, I)) # = 0.5 * sum(sigma^2 + mu^2 - 1 - log(sigma^2)) kl_term = 0.5 * torch.sum(torch.exp(log_var) + mu**2 - 1 - log_var) # ELBO = Recon - beta * KL elbo = recon_term - beta * kl_term return elbo, recon_term, kl_term # Example usagevae = VAE(input_dim=784, hidden_dim=256, latent_dim=32)x = torch.randn(64, 784) # Batch of 64 samples elbo, recon, kl = vae.compute_elbo_components(x, beta=1.0) print(f"ELBO: {elbo.item():.2f}")print(f"Reconstruction term: {recon.item():.2f}")print(f"KL regularization: {kl.item():.2f}")print(f"Check: Recon - KL = {recon.item() - kl.item():.2f} ≈ ELBO") # Training: maximize ELBO = minimize -ELBOloss = -elbo / 64 # Average over batchprint(f"\nTraining loss (per sample): {loss.item():.2f}")The ELBO has a beautiful interpretation in terms of information theory. We can rewrite it using entropy and mutual information:
$$\mathcal{L}(q) = \mathbb{E}q[\log p(\mathbf{x}|\mathbf{z})] + H[q(\mathbf{z})] - H[p(\mathbf{z})]{q}$$
where $H[q] = -\mathbb{E}_q[\log q]$ is entropy and $H[p]_q = -\mathbb{E}_q[\log p]$ is cross-entropy.
Consider the VAE setting where we encode observations $\mathbf{x}$ into latents $\mathbf{z}$. The ELBO can be viewed through the lens of rate-distortion theory:
$$\mathcal{L} = \underbrace{-D(\mathbf{x}, \hat{\mathbf{x}})}{\text{Distortion}} - \underbrace{I_q(\mathbf{x}; \mathbf{z})}{\text{Rate}}$$
where:
The ELBO objective naturally balances these: use enough "bits" (information in $\mathbf{z}$) to reconstruct well, but not more than necessary.
The ELBO can also be written as:
$$\mathcal{L}(q) = \mathbb{E}_q[\log p(\mathbf{x}, \mathbf{z})] + H[q(\mathbf{z})]$$
This separates:
Maximizing the ELBO asks for $q$ to both:
This tension prevents the variational distribution from collapsing trivially.
In coding theory, the ELBO represents the expected codelength for transmitting $\mathbf{x}$ via latent $\mathbf{z}$:
$$-\mathcal{L}(q) = \underbrace{-\mathbb{E}q[\log p(\mathbf{x}|\mathbf{z})]}{\text{Bits to encode x given z}} + \underbrace{D_{\text{KL}}(q | p)}_{\text{Bits for z beyond prior}}$$
The "bits back" idea: we pay $D_{\text{KL}}$ bits to encode $\mathbf{z}$, but can recover some bits because $\mathbf{z}$ was drawn stochastically.
The ELBO unifies several perspectives:
Statistical: Approximate the posterior by minimizing KL divergence
Probabilistic: Lower bound the model evidence
Information-theoretic: Trade off rate (compression) and distortion (reconstruction)
Coding-theoretic: Minimize expected codelength for transmission
Each perspective illuminates different aspects of what VI optimizes.
The ELBO involves expectations under $q$, which generally don't have closed forms. Here are the main strategies for computing it:
For some models, all expectations have closed forms:
Example: Gaussian prior, Gaussian variational family
$$p(\mathbf{z}) = \mathcal{N}(0, \mathbf{I}), \quad q(\mathbf{z}) = \mathcal{N}(\boldsymbol{\mu}, \text{diag}(\boldsymbol{\sigma}^2))$$
The KL term is:
$$D_{\text{KL}}(q | p) = \frac{1}{2} \sum_j \left( \sigma_j^2 + \mu_j^2 - 1 - \log \sigma_j^2 \right)$$
This is exact and differentiable—no sampling required for the regularization term.
For general models, we estimate the ELBO via sampling:
$$\mathcal{L}(q) \approx \frac{1}{K} \sum_{k=1}^{K} \left[ \log p(\mathbf{x}, \mathbf{z}^{(k)}) - \log q(\mathbf{z}^{(k)}) \right]$$
where $\mathbf{z}^{(k)} \sim q(\mathbf{z})$.
This is unbiased but has variance. Common variance reduction techniques:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109
import torchimport numpy as np def elbo_closed_form_gaussian(): """ Compute ELBO with closed-form KL for Gaussian q and Gaussian prior. This is the standard computation in VAEs. """ # Variational parameters mu = torch.tensor([0.5, -0.3, 1.2]) log_var = torch.tensor([-0.5, 0.2, -0.1]) # KL divergence: D_KL(N(mu, diag(exp(log_var))) || N(0, I)) # = 0.5 * sum(exp(log_var) + mu^2 - 1 - log_var) kl = 0.5 * torch.sum(torch.exp(log_var) + mu**2 - 1 - log_var) print(f"Closed-form KL: {kl.item():.4f}") # Reconstruction term still needs MC estimation (unless likelihood is also Gaussian) # For demo, assume some log-likelihood values log_likelihood_samples = torch.tensor([-10.5, -11.2, -10.8, -10.1, -11.5]) recon_term = log_likelihood_samples.mean() elbo = recon_term - kl print(f"ELBO: {elbo.item():.4f}") return elbo def elbo_monte_carlo_general(log_p_joint, q, n_samples=1000): """ General Monte Carlo ELBO estimation. Works for any q from which we can sample and evaluate log density. Args: log_p_joint: Function computing log p(x, z) for given z q: Variational distribution with .sample() and .log_prob() methods n_samples: Number of MC samples Returns: elbo_estimate: Monte Carlo estimate of ELBO std_error: Standard error of the estimate """ # Sample from q z_samples = q.sample(n_samples) # Evaluate log p(x, z) and log q(z) for each sample log_p_values = torch.stack([log_p_joint(z) for z in z_samples]) log_q_values = q.log_prob(z_samples) # ELBO = E_q[log p(x,z) - log q(z)] elbo_samples = log_p_values - log_q_values elbo_estimate = elbo_samples.mean() std_error = elbo_samples.std() / np.sqrt(n_samples) return elbo_estimate, std_error def demonstrate_mc_variance(): """Show how MC sample count affects ELBO estimate variance.""" torch.manual_seed(42) # Simple 1D example true_posterior_mu = 1.0 # Log joint (up to constant) def log_p_joint(z): log_prior = -0.5 * z**2 log_likelihood = -0.5 * (z - 2.0)**2 / 0.25 # Observe x=2 with sigma=0.5 return log_prior + log_likelihood # Variational distribution class SimpleGaussian: def __init__(self, mu, sigma): self.mu = mu self.sigma = sigma def sample(self, n): return torch.randn(n) * self.sigma + self.mu def log_prob(self, z): return -0.5 * ((z - self.mu) / self.sigma)**2 - np.log(self.sigma) - 0.5 * np.log(2*np.pi) q = SimpleGaussian(mu=1.0, sigma=0.4) print("ELBO estimates with different sample counts:") print("-" * 50) for n_samples in [10, 100, 1000, 10000]: estimates = [] for _ in range(100): # Repeat to measure variance elbo, _ = elbo_monte_carlo_general(log_p_joint, q, n_samples) estimates.append(elbo.item()) mean_elbo = np.mean(estimates) std_elbo = np.std(estimates) print(f"N={n_samples:5d}: ELBO = {mean_elbo:.4f} ± {std_elbo:.4f}") print("\nNote: Variance decreases as 1/sqrt(N)") # Run demonstrationsprint("=== Closed-Form Gaussian KL ===")elbo_closed_form_gaussian()print() print("=== Monte Carlo Variance Demo ===")demonstrate_mc_variance()You now have a thorough understanding of the ELBO from multiple perspectives. You can derive it, interpret its components, and understand what each term optimizes. The final page of this module covers the optimization perspective—how we actually maximize the ELBO using gradient-based methods and the crucial reparameterization trick.