Machine LearningGenerative Models

Variational Autoencoders

LevelAdvanced

Duration120 mins

TopicGenerative Models

1 / 5

VAE Objective: The Evidence Lower Bound

The Quest for Principled Generative Modeling

Variational Autoencoders represent one of the most elegant intersections of deep learning and probabilistic modeling. Unlike purely discriminative models that learn boundaries between classes, or simple autoencoders that merely compress and reconstruct, VAEs provide a principled probabilistic framework for learning the underlying distribution of data—enabling not just reconstruction, but genuine generation of new samples.

At the heart of VAEs lies a profound insight: we can frame generative modeling as a latent variable model and derive a tractable learning objective through variational inference. The result—the Evidence Lower Bound (ELBO)—gives us both a training signal and deep theoretical guarantees about what we're learning.

This page derives the VAE objective from first principles, ensuring you understand not just what the loss function is, but why it takes this particular form and what each term contributes to the model's behavior.

Learning Objectives

By the end of this page, you will: (1) Understand the probabilistic formulation of generative modeling with latent variables, (2) Derive the Evidence Lower Bound (ELBO) from maximum likelihood, (3) Interpret each term of the ELBO—reconstruction loss and KL divergence, (4) Understand why direct maximum likelihood is intractable and how variational inference provides a solution, and (5) Appreciate the theoretical foundations that make VAEs mathematically principled.

The Generative Modeling Problem

Before diving into the VAE objective, let's crystallize what we're trying to accomplish. In generative modeling, we have a dataset $\mathcal{D} = {\mathbf{x}^{(1)}, \mathbf{x}^{(2)}, ..., \mathbf{x}^{(N)}}$ of samples (images, text, audio, etc.), and our goal is to learn the true underlying data distribution $p_{\text{data}}(\mathbf{x})$.

The maximum likelihood principle tells us the optimal approach: find the model parameters $\boldsymbol{\theta}$ that maximize the probability of observing our data:

$$\boldsymbol{\theta}^* = \arg\max_{\boldsymbol{\theta}} \prod_{i=1}^{N} p_{\boldsymbol{\theta}}(\mathbf{x}^{(i)})$$

Equivalently, we maximize the log-likelihood:

$$\boldsymbol{\theta}^* = \arg\max_{\boldsymbol{\theta}} \sum_{i=1}^{N} \log p_{\boldsymbol{\theta}}(\mathbf{x}^{(i)})$$

This seems straightforward, but there's a fundamental challenge: for complex, high-dimensional data like images, directly defining and computing $p_{\boldsymbol{\theta}}(\mathbf{x})$ is extraordinarily difficult. How can we tractably model a distribution over all possible 256×256 RGB images?

The Dimensionality Challenge

A 256×256 RGB image has 196,608 dimensions. The space of possible images is astronomical—far beyond any tractable explicit probability distribution. We need a clever way to implicitly represent this distribution while maintaining computational tractability.

The Latent Variable Solution

The key insight is to introduce latent variables $\mathbf{z}$ that capture the underlying factors of variation in our data. We hypothesize that each observed datapoint $\mathbf{x}$ was generated from some unobserved latent code $\mathbf{z}$ through a generative process:

Sample a latent code: $\mathbf{z} \sim p(\mathbf{z})$ (prior distribution)
Generate the observation: $\mathbf{x} \sim p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z})$ (conditional likelihood)

The marginal likelihood of data under this model is:

$$p_{\boldsymbol{\theta}}(\mathbf{x}) = \int p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}) p(\mathbf{z}) , d\mathbf{z}$$

This latent variable formulation is powerful because:

The prior $p(\mathbf{z})$ can be simple (e.g., standard Gaussian)
The conditional $p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z})$ can be modeled by a neural network
Together, they implicitly define a complex distribution over $\mathbf{x}$

Components of the Latent Variable Model
Component	Notation	Role	Typical Choice in VAE
Prior	$p(\mathbf{z})$	Distribution over latent codes before seeing data	Standard Gaussian $\mathcal{N}(0, I)$
Likelihood	$p_{\boldsymbol{\theta}}(\mathbf{x}\|\mathbf{z})$	How latent codes generate observations	Neural network (decoder)
Posterior	$p_{\boldsymbol{\theta}}(\mathbf{z}\|\mathbf{x})$	True distribution over latents given an observation	Intractable (requires variational approx.)
Marginal	$p_{\boldsymbol{\theta}}(\mathbf{x})$	Total probability of an observation	Intractable (requires marginalization)

The Intractability Problem

The latent variable formulation seems elegant, but it introduces a severe computational problem. To maximize $\log p_{\boldsymbol{\theta}}(\mathbf{x})$, we need to compute:

$$\log p_{\boldsymbol{\theta}}(\mathbf{x}) = \log \int p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}) p(\mathbf{z}) , d\mathbf{z}$$

This integral is intractable for two fundamental reasons:

1. High-Dimensional Integration

The integral is over the entire latent space, which may have hundreds of dimensions. No closed-form solution exists, and numerical integration is computationally impossible in high dimensions (curse of dimensionality).

2. Posterior Intractability

To efficiently estimate this integral via Monte Carlo methods, we would ideally sample from the posterior $p_{\boldsymbol{\theta}}(\mathbf{z}|\mathbf{x})$—the distribution of latent codes that could have generated a specific observation. But by Bayes' theorem:

$$p_{\boldsymbol{\theta}}(\mathbf{z}|\mathbf{x}) = \frac{p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}) p(\mathbf{z})}{p_{\boldsymbol{\theta}}(\mathbf{x})}$$

Computing this requires the very marginal likelihood $p_{\boldsymbol{\theta}}(\mathbf{x})$ we're trying to compute—a circular dependency.

The Core Dilemma

We want to maximize $\log p_{\boldsymbol{\theta}}(\mathbf{x})$, but we can't compute it. We need $p_{\boldsymbol{\theta}}(\mathbf{z}|\mathbf{x})$ for efficient estimation, but computing it requires $p_{\boldsymbol{\theta}}(\mathbf{x})$. We're stuck in a circle. This is where variational inference comes to the rescue.

Why Naive Monte Carlo Fails

One might attempt to estimate the marginal likelihood directly:

$$p_{\boldsymbol{\theta}}(\mathbf{x}) \approx \frac{1}{K} \sum_{k=1}^{K} p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}^{(k)}), \quad \mathbf{z}^{(k)} \sim p(\mathbf{z})$$

This is called importance sampling with the prior as the proposal distribution. While unbiased, it has astronomically high variance for high-dimensional data.

Why it fails: Consider an image of a face. Most samples from the prior $p(\mathbf{z})$ will decode to noise or unrelated images. Only a tiny fraction of the latent space actually corresponds to faces. We would need an impossibly large number of samples to get even one that's relevant to our specific image.

Mathematically, if the posterior $p(\mathbf{z}|\mathbf{x})$ concentrates in a small region of latent space (which it does for meaningful data), samples from the prior $p(\mathbf{z})$ will almost never land in that region.

The solution: Instead of sampling from the prior and hoping, we need to sample from a distribution that actually concentrates in the relevant regions of latent space for each datapoint. This is what the variational approach provides.

Variational Inference: Approximating the Posterior

The breakthrough of VAEs is to address the intractability through variational inference. Since we can't compute the true posterior $p_{\boldsymbol{\theta}}(\mathbf{z}|\mathbf{x})$, we'll introduce an approximate posterior $q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})$ parameterized by $\boldsymbol{\phi}$.

This approximate posterior will be:

A tractable distribution family (typically Gaussian)
Conditioned on the observation $\mathbf{x}$ through a neural network (the encoder)
Optimized to be as close as possible to the true posterior

The variational approach converts our intractable optimization problem (maximize log-likelihood) into a tractable one (maximize a lower bound while making the approximate posterior accurate).

The Amortized Inference Advantage

Traditional variational inference optimizes separate parameters for each datapoint. VAEs use amortized inference: a single neural network (the encoder) outputs the approximate posterior parameters for any input $\mathbf{x}$.

This amortization is crucial:

Computational efficiency: One forward pass gives $q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})$
Generalization: The encoder can process novel inputs at test time
Learning: Encoder and decoder are trained jointly end-to-end

Properties of the Approximate Posterior

•Tractable Sampling — We must be able to efficiently sample $\mathbf{z} \sim q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})$ for Monte Carlo estimation
•Tractable Density — We must be able to compute $\log q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})$ for any $\mathbf{z}$
•Differentiable Parameters — The parameters must be continuous functions of $\boldsymbol{\phi}$ for gradient-based optimization
•Expressive — The family should be flexible enough to approximate complex posteriors
•Efficient KL Computation — For the VAE objective, we need closed-form or efficient KL divergence computation

The Standard Choice: Diagonal Gaussian

The most common choice for the approximate posterior is a diagonal Gaussian:

$$q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x}) = \mathcal{N}(\mathbf{z}; \boldsymbol{\mu}{\boldsymbol{\phi}}(\mathbf{x}), \text{diag}(\boldsymbol{\sigma}^2{\boldsymbol{\phi}}(\mathbf{x})))$$

Where:

$\boldsymbol{\mu}_{\boldsymbol{\phi}}(\mathbf{x})$ is the mean, output by the encoder network
$\boldsymbol{\sigma}^2_{\boldsymbol{\phi}}(\mathbf{x})$ is the variance (diagonal covariance), also from the encoder

This choice satisfies all our requirements:

Tractable sampling: $\mathbf{z} = \boldsymbol{\mu} + \boldsymbol{\sigma} \odot \boldsymbol{\epsilon}$, where $\boldsymbol{\epsilon} \sim \mathcal{N}(0, I)$
Tractable density: Standard Gaussian density formula
Closed-form KL: The KL divergence to a standard Gaussian has a closed form
Differentiable: Smooth functions of $\boldsymbol{\mu}$ and $\boldsymbol{\sigma}$

Deriving the Evidence Lower Bound

Now we derive the central mathematical result: the Evidence Lower Bound (ELBO). This derivation is fundamental—understanding it deeply will inform your intuition about VAE behavior.

Starting Point: The Log-Likelihood

We want to maximize $\log p_{\boldsymbol{\theta}}(\mathbf{x})$. Let's introduce our approximate posterior by multiplying and dividing:

$$\log p_{\boldsymbol{\theta}}(\mathbf{x}) = \log \int p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}) , d\mathbf{z}$$

$$= \log \int p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}) \frac{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})}{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})} , d\mathbf{z}$$

$$= \log \mathbb{E}{q{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})} \left[ \frac{p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z})}{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})} \right]$$

Applying Jensen's Inequality

Since $\log$ is a concave function, Jensen's inequality gives us:

$$\log \mathbb{E}[X] \geq \mathbb{E}[\log X]$$

Applying this:

$$\log p_{\boldsymbol{\theta}}(\mathbf{x}) \geq \mathbb{E}{q{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})} \left[ \log \frac{p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z})}{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})} \right]$$

The right-hand side is the Evidence Lower Bound (ELBO):

$$\mathcal{L}(\boldsymbol{\theta}, \boldsymbol{\phi}; \mathbf{x}) = \mathbb{E}{q{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})} \left[ \log \frac{p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z})}{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})} \right]$$

So we have established:

$$\boxed{\log p_{\boldsymbol{\theta}}(\mathbf{x}) \geq \mathcal{L}(\boldsymbol{\theta}, \boldsymbol{\phi}; \mathbf{x})}$$

Why 'Evidence' Lower Bound?

In Bayesian terminology, $p(\mathbf{x})$ is called the 'evidence'—the probability of observing our data under the model. The ELBO provides a lower bound on the log of this evidence. Maximizing ELBO pushes up this lower bound, indirectly maximizing the log-evidence.

Decomposing the ELBO: Two Interpretations

The ELBO can be decomposed in two illuminating ways. Each reveals different aspects of what the VAE is learning.

Decomposition 1: Reconstruction + Regularization

Using $p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}) = p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z})p(\mathbf{z})$:

$$\mathcal{L}(\boldsymbol{\theta}, \boldsymbol{\phi}; \mathbf{x}) = \mathbb{E}{q{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})} \left[ \log p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z}) \right] - D_{\text{KL}}(q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x}) | p(\mathbf{z}))$$

This is the most common form, revealing two terms:

Term 1: Reconstruction Loss (Expected Log-Likelihood) $$\mathbb{E}{q{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})} \left[ \log p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z}) \right]$$

This measures how well the decoder reconstructs $\mathbf{x}$ from latent samples. Higher is better—we want the decoder to produce high probability for the original data.

Term 2: KL Regularization (Posterior Regularization) $$D_{\text{KL}}(q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x}) | p(\mathbf{z}))$$

This penalizes deviation of the approximate posterior from the prior. It encourages the encoder to produce latent codes that stay close to the prior distribution.

Reconstruction Term

•Encourages accurate reconstruction
•Drives the decoder to be informative
•Pushes posteriors to encode useful information
•Large variance → more exploration but blurry reconstructions
•Small variance → precise but sensitive to noise

KL Regularization Term

•Prevents posterior collapse to delta functions
•Encourages structured latent space
•Enables sampling from prior at test time
•Promotes smooth interpolations
•Regularizes against overfitting

Decomposition 2: The Gap to True Log-Likelihood

There's an alternative decomposition that reveals exactly when ELBO equals the log-likelihood:

$$\log p_{\boldsymbol{\theta}}(\mathbf{x}) = \mathcal{L}(\boldsymbol{\theta}, \boldsymbol{\phi}; \mathbf{x}) + D_{\text{KL}}(q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x}) | p_{\boldsymbol{\theta}}(\mathbf{z}|\mathbf{x}))$$

This equality tells us:

The gap between ELBO and true log-likelihood is exactly the KL divergence between approximate and true posteriors
ELBO is tight (equals log-likelihood) if and only if $q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x}) = p_{\boldsymbol{\theta}}(\mathbf{z}|\mathbf{x})$
Better approximate posteriors → tighter bounds → better training signal

Since KL divergence is always non-negative, we always have $\mathcal{L} \leq \log p_{\boldsymbol{\theta}}(\mathbf{x})$, confirming it's a lower bound.

The Variational Gap

The KL divergence $D_{\text{KL}}(q||p_{\theta})$ between approximate and true posterior is called the 'variational gap' or 'inference gap'. A larger gap means we're leaving potential log-likelihood on the table. This motivates research into more expressive approximate posteriors (normalizing flows, importance-weighted bounds, etc.).

Computing Each Term in Practice

Let's see how to actually compute the ELBO for training. We'll work with the standard choices: Gaussian prior, Gaussian approximate posterior, and either Gaussian or Bernoulli likelihood.

The KL Term: Closed-Form for Gaussians

With a standard Gaussian prior $p(\mathbf{z}) = \mathcal{N}(0, I)$ and diagonal Gaussian posterior $q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x}) = \mathcal{N}(\boldsymbol{\mu}, \text{diag}(\boldsymbol{\sigma}^2))$:

$$D_{\text{KL}}(q || p) = -\frac{1}{2} \sum_{j=1}^{d} \left(1 + \log \sigma_j^2 - \mu_j^2 - \sigma_j^2\right)$$

This has a closed-form solution! No sampling required. Each latent dimension contributes independently.

kl_divergence.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import torch
 
def kl_divergence_gaussian(mu: torch.Tensor, log_var: torch.Tensor) -> torch.Tensor:
    """
    Compute KL divergence between N(mu, sigma^2) and N(0, I).
    
    Uses log_var = log(sigma^2) for numerical stability.
    
    Args:
        mu: Mean of approximate posterior, shape [batch, latent_dim]
        log_var: Log variance of approx. posterior, shape [batch, latent_dim]
    
    Returns:
        KL divergence per sample, shape [batch]
    
    Derivation:
        KL(q||p) = 0.5 * sum(sigma^2 + mu^2 - 1 - log(sigma^2))
                 = 0.5 * sum(exp(log_var) + mu^2 - 1 - log_var)
    """
    # Sum over latent dimensions, preserve batch dimension
    kl_per_sample = -0.5 * torch.sum(
        1 + log_var - mu.pow(2) - log_var.exp(), 
        dim=1
    )
    return kl_per_sample  # Shape: [batch]
 
# Example usage
batch_size, latent_dim = 32, 64
mu = torch.randn(batch_size, latent_dim)
log_var = torch.randn(batch_size, latent_dim) - 1  # Initialize near unit variance
 
kl = kl_divergence_gaussian(mu, log_var)
print(f"KL divergence shape: {kl.shape}")  # [32]
print(f"Mean KL per sample: {kl.mean().item():.4f}")

The Reconstruction Term: Monte Carlo Estimation

The reconstruction term $\mathbb{E}{q}[\log p{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z})]$ is estimated via sampling:

$$\mathbb{E}{q}[\log p{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z})] \approx \frac{1}{L}\sum_{l=1}^{L} \log p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z}^{(l)}), \quad \mathbf{z}^{(l)} \sim q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})$$

In practice, $L=1$ (single sample) works remarkably well due to minibatch averaging.

For Binary Data (Bernoulli Likelihood): $$\log p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z}) = \sum_i x_i \log \hat{x}_i + (1-x_i) \log(1-\hat{x}_i)$$

This is the binary cross-entropy loss! The decoder outputs probabilities $\hat{x}i = \text{sigmoid}(f{\boldsymbol{\theta}}(\mathbf{z}))$.

For Continuous Data (Gaussian Likelihood): $$\log p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z}) \propto -\frac{1}{2\sigma^2}||\mathbf{x} - \hat{\mathbf{x}}||^2$$

With fixed variance, this reduces to mean squared error! The decoder outputs $\hat{\mathbf{x}} = f_{\boldsymbol{\theta}}(\mathbf{z})$.

reconstruction_loss.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import torch
import torch.nn.functional as F
 
def reconstruction_loss_bernoulli(x: torch.Tensor, x_recon: torch.Tensor) -> torch.Tensor:
    """
    Binary cross-entropy reconstruction loss for binary/binarized data.
    
    Args:
        x: Original data, shape [batch, ...], values in [0, 1]
        x_recon: Reconstructed logits from decoder, shape [batch, ...]
    
    Returns:
        Negative log-likelihood per sample, shape [batch]
    """
    # Flatten all dimensions except batch
    batch_size = x.shape[0]
    x_flat = x.view(batch_size, -1)
    x_recon_flat = x_recon.view(batch_size, -1)
    
    # Binary cross-entropy (reduction='none' gives per-element loss)
    bce = F.binary_cross_entropy_with_logits(
        x_recon_flat, x_flat, reduction='none'
    )
    
    # Sum over data dimensions (negative log-likelihood)
    return bce.sum(dim=1)  # Shape: [batch]
 
 
def reconstruction_loss_gaussian(x: torch.Tensor, x_recon: torch.Tensor, 
                                  log_var: float = 0.0) -> torch.Tensor:
    """
    Gaussian reconstruction loss (MSE) for continuous data.
    
    Args:
        x: Original data, shape [batch, ...]
        x_recon: Reconstructed mean from decoder, shape [batch, ...]
        log_var: Log variance (fixed or learned), scalar
    
    Returns:
        Negative log-likelihood per sample, shape [batch]
    """
    batch_size = x.shape[0]
    x_flat = x.view(batch_size, -1)
    x_recon_flat = x_recon.view(batch_size, -1)
    
    # Gaussian NLL: 0.5 * (log(2*pi*sigma^2) + (x-mu)^2/sigma^2)
    # With fixed sigma, constant terms can be ignored
    # Proportional to MSE: (x - mu)^2 / (2 * sigma^2)
    var = torch.exp(torch.tensor(log_var))
    mse = ((x_flat - x_recon_flat) ** 2).sum(dim=1)
    nll = 0.5 * (mse / var + x_flat.shape[1] * log_var)
    
    return nll  # Shape: [batch]

The Complete VAE Loss Function

Putting it all together, the VAE training objective is to maximize the ELBO, which we can equivalently express as minimizing the negative ELBO loss:

$$\mathcal{L}{\text{VAE}} = \mathbb{E}{\mathbf{z} \sim q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})} \left[-\log p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z})\right] + D_{\text{KL}}(q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x}) | p(\mathbf{z}))$$

$$= \underbrace{\text{Reconstruction Loss}}{\text{How well can we recover } \mathbf{x}?} + \underbrace{\text{KL Divergence}}{\text{How "normal" is the latent code?}}$$

In practice, a weighting factor β is often introduced:

$$\mathcal{L}{\beta\text{-VAE}} = \text{Reconstruction Loss} + \beta \cdot D{\text{KL}}$$

With $\beta = 1$, we have the standard VAE (maximizing ELBO). Different values of $\beta$ create different trade-offs and have spawned influential variants like β-VAE for disentangled representations.

vae_loss.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import torch
import torch.nn.functional as F
from typing import Tuple
 
def vae_loss(x: torch.Tensor, 
             x_recon: torch.Tensor, 
             mu: torch.Tensor, 
             log_var: torch.Tensor,
             beta: float = 1.0,
             reduction: str = 'mean') -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
    """
    Complete VAE loss: Reconstruction + beta * KL divergence.
    
    Args:
        x: Original data, shape [batch, channels, height, width]
        x_recon: Reconstructed logits, shape [batch, channels, height, width]
        mu: Latent mean from encoder, shape [batch, latent_dim]
        log_var: Latent log-variance from encoder, shape [batch, latent_dim]
        beta: Weight for KL term (1.0 = standard VAE)
        reduction: 'mean' or 'sum' over batch
    
    Returns:
        total_loss, reconstruction_loss, kl_loss (all scalars if reduction='mean')
    """
    batch_size = x.shape[0]
    
    # Reconstruction loss (BCE for binary, MSE for continuous)
    # Using BCE with logits for numerical stability
    recon_loss = F.binary_cross_entropy_with_logits(
        x_recon.view(batch_size, -1),
        x.view(batch_size, -1),
        reduction='none'
    ).sum(dim=1)  # Sum over data dimensions
    
    # KL divergence: -0.5 * sum(1 + log_var - mu^2 - exp(log_var))
    kl_loss = -0.5 * torch.sum(
        1 + log_var - mu.pow(2) - log_var.exp(),
        dim=1
    )  # Sum over latent dimensions
    
    # Combine with beta weighting
    total_loss = recon_loss + beta * kl_loss
    
    # Apply reduction
    if reduction == 'mean':
        return total_loss.mean(), recon_loss.mean(), kl_loss.mean()
    elif reduction == 'sum':
        return total_loss.sum(), recon_loss.sum(), kl_loss.sum()
    else:
        return total_loss, recon_loss, kl_loss
 
 
# Training loop example
def train_step(model, optimizer, x_batch, beta=1.0):
    """Single VAE training step."""
    optimizer.zero_grad()
    
    # Forward pass through VAE
    x_recon, mu, log_var = model(x_batch)
    
    # Compute loss
    loss, recon_loss, kl_loss = vae_loss(
        x_batch, x_recon, mu, log_var, beta=beta
    )
    
    # Backward pass
    loss.backward()
    optimizer.step()
    
    return {
        'loss': loss.item(),
        'recon_loss': recon_loss.item(),
        'kl_loss': kl_loss.item()
    }

Monitoring Training

Track reconstruction loss and KL divergence separately during training. Healthy VAE training shows: (1) Reconstruction loss decreasing as the decoder improves, (2) KL loss increasing initially then stabilizing, indicating the encoder is learning informative representations while staying close to the prior. If KL stays near zero ('posterior collapse'), the model is ignoring the latent space.

Interpreting the VAE Objective

Beyond the mathematics, it's essential to build intuition for what the VAE objective is achieving. Let's examine each component's role in shaping the learned representation.

The Tension That Drives Learning

The two terms of the ELBO create a productive tension:

Reconstruction wants specificity: To accurately reconstruct $\mathbf{x}$, the encoder should produce a narrow posterior centered exactly at the latent code that makes $\mathbf{x}$ most likely. Ideally, a delta function $q(\mathbf{z}|\mathbf{x}) = \delta(\mathbf{z} - \mathbf{z}^*)$.

KL regularization wants generality: The KL term penalizes posteriors that deviate from the prior. It wants the encoder to produce posteriors that are broad, close to $\mathcal{N}(0, I)$—essentially forgetting the input.

The compromise: The model finds a middle ground where posteriors encode enough information for reconstruction while maintaining a structured, prior-like distribution. This compromise is what gives VAEs their generative capabilities.

How β Affects the Trade-off
β Value	Reconstruction	Latent Structure	Generation Quality	Use Case
β → 0	Excellent (near-perfect)	Poor (mode collapse)	Poor (holes in latent space)	Compression/AE behavior
β = 1	Good	Balanced	Good	Standard VAE
β > 1	Moderate (blurrier)	Strong (disentangled)	Moderate	β-VAE for disentanglement
β >> 1	Poor (very blurry)	Excellent (very structured)	Variable	Discovering factors of variation

Why This Works for Generation

The VAE objective implicitly ensures good generative properties:

Coverage of the prior: The KL term forces $q(\mathbf{z}|\mathbf{x})$ to cover most of the prior support. If all posteriors collapse to a small region, KL would be high. This means random samples from the prior $p(\mathbf{z})$ will likely fall in regions where some training data mapped.
Smooth latent space: Overlapping posteriors create smooth transitions between encoded points. Moving in latent space corresponds to meaningful changes in generated outputs.
No holes: Unlike deterministic autoencoders where the decoder only learns to handle specific encoder outputs, VAE decoders must handle samples from broad Gaussian posteriors—making them robust to sampling from the prior.

The Information Theory Perspective

The ELBO has an elegant information-theoretic interpretation:

$$\mathcal{L} = \underbrace{\mathbb{E}q[\log p(\mathbf{x}|\mathbf{z})]}\text{Bits to reconstruct x given z} - \underbrace{D_{KL}(q(\mathbf{z}|\mathbf{x}) || p(\mathbf{z}))}_\text{Bits to transmit z}$$

The VAE learns an efficient encoding: minimize the bits needed to specify the latent code while keeping enough information to reconstruct the data. This is the rate-distortion trade-off from information theory!

Summary: The VAE Objective Foundation

We've built up a complete understanding of the VAE objective from first principles. Let's consolidate the key insights:

Key Takeaways

•Generative modeling as latent variable models: We model complex data distributions by introducing latent variables that capture underlying factors of variation.
•Intractability drives innovation: Direct maximum likelihood is intractable due to the marginal integral over latent space and the intractable posterior.
•Variational inference provides a solution: By introducing an approximate posterior, we can derive a tractable lower bound on the log-likelihood.
•The ELBO = Reconstruction - KL: The training objective balances reconstruction quality against posterior regularization toward the prior.
•Tension creates structure: The push-pull between reconstruction and KL terms creates a structured latent space suitable for generation.
•The gap is principled: The difference between ELBO and true log-likelihood is exactly the inference gap—motivating better approximate posteriors.

What's Next:

With the theoretical objective established, the next page explores the encoder and decoder networks—the neural network architectures that parameterize the approximate posterior and likelihood. We'll see how to design these networks for different data modalities and examine the architectural choices that impact VAE performance.

Mathematical Foundation Complete

You now understand the deep mathematical motivation behind the VAE training objective. This isn't just a loss function we invented—it's derived from principled probabilistic reasoning. Every term has meaning, and understanding this meaning will guide your architectural and hyperparameter choices when building VAEs.

1 / 5

Loading learning content...

Machine LearningGenerative Models

Variational Autoencoders

LevelAdvanced

Duration120 mins

TopicGenerative Models

1 / 5

VAE Objective: The Evidence Lower Bound

The Quest for Principled Generative Modeling

Learning Objectives

The Generative Modeling Problem

The maximum likelihood principle tells us the optimal approach: find the model parameters $\boldsymbol{\theta}$ that maximize the probability of observing our data:

$$\boldsymbol{\theta}^* = \arg\max_{\boldsymbol{\theta}} \prod_{i=1}^{N} p_{\boldsymbol{\theta}}(\mathbf{x}^{(i)})$$

Equivalently, we maximize the log-likelihood:

$$\boldsymbol{\theta}^* = \arg\max_{\boldsymbol{\theta}} \sum_{i=1}^{N} \log p_{\boldsymbol{\theta}}(\mathbf{x}^{(i)})$$

The Dimensionality Challenge

The Latent Variable Solution

Sample a latent code: $\mathbf{z} \sim p(\mathbf{z})$ (prior distribution)
Generate the observation: $\mathbf{x} \sim p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z})$ (conditional likelihood)

The marginal likelihood of data under this model is:

$$p_{\boldsymbol{\theta}}(\mathbf{x}) = \int p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}) p(\mathbf{z}) , d\mathbf{z}$$

This latent variable formulation is powerful because:

The prior $p(\mathbf{z})$ can be simple (e.g., standard Gaussian)
The conditional $p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z})$ can be modeled by a neural network
Together, they implicitly define a complex distribution over $\mathbf{x}$

Components of the Latent Variable Model
Component	Notation	Role	Typical Choice in VAE
Prior	$p(\mathbf{z})$	Distribution over latent codes before seeing data	Standard Gaussian $\mathcal{N}(0, I)$
Likelihood	$p_{\boldsymbol{\theta}}(\mathbf{x}\|\mathbf{z})$	How latent codes generate observations	Neural network (decoder)
Posterior	$p_{\boldsymbol{\theta}}(\mathbf{z}\|\mathbf{x})$	True distribution over latents given an observation	Intractable (requires variational approx.)
Marginal	$p_{\boldsymbol{\theta}}(\mathbf{x})$	Total probability of an observation	Intractable (requires marginalization)

The Intractability Problem

The latent variable formulation seems elegant, but it introduces a severe computational problem. To maximize $\log p_{\boldsymbol{\theta}}(\mathbf{x})$, we need to compute:

$$\log p_{\boldsymbol{\theta}}(\mathbf{x}) = \log \int p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}) p(\mathbf{z}) , d\mathbf{z}$$

This integral is intractable for two fundamental reasons:

1. High-Dimensional Integration

2. Posterior Intractability

$$p_{\boldsymbol{\theta}}(\mathbf{z}|\mathbf{x}) = \frac{p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}) p(\mathbf{z})}{p_{\boldsymbol{\theta}}(\mathbf{x})}$$

Computing this requires the very marginal likelihood $p_{\boldsymbol{\theta}}(\mathbf{x})$ we're trying to compute—a circular dependency.

The Core Dilemma

Why Naive Monte Carlo Fails

One might attempt to estimate the marginal likelihood directly:

$$p_{\boldsymbol{\theta}}(\mathbf{x}) \approx \frac{1}{K} \sum_{k=1}^{K} p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}^{(k)}), \quad \mathbf{z}^{(k)} \sim p(\mathbf{z})$$

This is called importance sampling with the prior as the proposal distribution. While unbiased, it has astronomically high variance for high-dimensional data.

Variational Inference: Approximating the Posterior

This approximate posterior will be:

A tractable distribution family (typically Gaussian)
Conditioned on the observation $\mathbf{x}$ through a neural network (the encoder)
Optimized to be as close as possible to the true posterior

The variational approach converts our intractable optimization problem (maximize log-likelihood) into a tractable one (maximize a lower bound while making the approximate posterior accurate).

The Amortized Inference Advantage

This amortization is crucial:

Computational efficiency: One forward pass gives $q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})$
Generalization: The encoder can process novel inputs at test time
Learning: Encoder and decoder are trained jointly end-to-end

Properties of the Approximate Posterior

•Tractable Sampling — We must be able to efficiently sample $\mathbf{z} \sim q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})$ for Monte Carlo estimation
•Tractable Density — We must be able to compute $\log q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})$ for any $\mathbf{z}$
•Differentiable Parameters — The parameters must be continuous functions of $\boldsymbol{\phi}$ for gradient-based optimization
•Expressive — The family should be flexible enough to approximate complex posteriors
•Efficient KL Computation — For the VAE objective, we need closed-form or efficient KL divergence computation

The Standard Choice: Diagonal Gaussian

The most common choice for the approximate posterior is a diagonal Gaussian:

$$q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x}) = \mathcal{N}(\mathbf{z}; \boldsymbol{\mu}{\boldsymbol{\phi}}(\mathbf{x}), \text{diag}(\boldsymbol{\sigma}^2{\boldsymbol{\phi}}(\mathbf{x})))$$

Where:

$\boldsymbol{\mu}_{\boldsymbol{\phi}}(\mathbf{x})$ is the mean, output by the encoder network
$\boldsymbol{\sigma}^2_{\boldsymbol{\phi}}(\mathbf{x})$ is the variance (diagonal covariance), also from the encoder

This choice satisfies all our requirements:

Tractable sampling: $\mathbf{z} = \boldsymbol{\mu} + \boldsymbol{\sigma} \odot \boldsymbol{\epsilon}$, where $\boldsymbol{\epsilon} \sim \mathcal{N}(0, I)$
Tractable density: Standard Gaussian density formula
Closed-form KL: The KL divergence to a standard Gaussian has a closed form
Differentiable: Smooth functions of $\boldsymbol{\mu}$ and $\boldsymbol{\sigma}$

Deriving the Evidence Lower Bound

Now we derive the central mathematical result: the Evidence Lower Bound (ELBO). This derivation is fundamental—understanding it deeply will inform your intuition about VAE behavior.

Starting Point: The Log-Likelihood

We want to maximize $\log p_{\boldsymbol{\theta}}(\mathbf{x})$. Let's introduce our approximate posterior by multiplying and dividing:

$$\log p_{\boldsymbol{\theta}}(\mathbf{x}) = \log \int p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}) , d\mathbf{z}$$

$$= \log \int p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}) \frac{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})}{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})} , d\mathbf{z}$$

$$= \log \mathbb{E}{q{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})} \left[ \frac{p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z})}{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})} \right]$$

Applying Jensen's Inequality

Since $\log$ is a concave function, Jensen's inequality gives us:

$$\log \mathbb{E}[X] \geq \mathbb{E}[\log X]$$

Applying this:

The right-hand side is the Evidence Lower Bound (ELBO):

So we have established:

$$\boxed{\log p_{\boldsymbol{\theta}}(\mathbf{x}) \geq \mathcal{L}(\boldsymbol{\theta}, \boldsymbol{\phi}; \mathbf{x})}$$

Why 'Evidence' Lower Bound?

Decomposing the ELBO: Two Interpretations

The ELBO can be decomposed in two illuminating ways. Each reveals different aspects of what the VAE is learning.

Decomposition 1: Reconstruction + Regularization

Using $p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}) = p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z})p(\mathbf{z})$:

This is the most common form, revealing two terms:

Term 1: Reconstruction Loss (Expected Log-Likelihood) $$\mathbb{E}{q{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})} \left[ \log p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z}) \right]$$

This measures how well the decoder reconstructs $\mathbf{x}$ from latent samples. Higher is better—we want the decoder to produce high probability for the original data.

Term 2: KL Regularization (Posterior Regularization) $$D_{\text{KL}}(q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x}) | p(\mathbf{z}))$$

This penalizes deviation of the approximate posterior from the prior. It encourages the encoder to produce latent codes that stay close to the prior distribution.

Reconstruction Term

•Encourages accurate reconstruction
•Drives the decoder to be informative
•Pushes posteriors to encode useful information
•Large variance → more exploration but blurry reconstructions
•Small variance → precise but sensitive to noise

KL Regularization Term

•Prevents posterior collapse to delta functions
•Encourages structured latent space
•Enables sampling from prior at test time
•Promotes smooth interpolations
•Regularizes against overfitting

Decomposition 2: The Gap to True Log-Likelihood

There's an alternative decomposition that reveals exactly when ELBO equals the log-likelihood:

This equality tells us:

The gap between ELBO and true log-likelihood is exactly the KL divergence between approximate and true posteriors
ELBO is tight (equals log-likelihood) if and only if $q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x}) = p_{\boldsymbol{\theta}}(\mathbf{z}|\mathbf{x})$
Better approximate posteriors → tighter bounds → better training signal

Since KL divergence is always non-negative, we always have $\mathcal{L} \leq \log p_{\boldsymbol{\theta}}(\mathbf{x})$, confirming it's a lower bound.

The Variational Gap

Computing Each Term in Practice

Let's see how to actually compute the ELBO for training. We'll work with the standard choices: Gaussian prior, Gaussian approximate posterior, and either Gaussian or Bernoulli likelihood.

The KL Term: Closed-Form for Gaussians

$$D_{\text{KL}}(q || p) = -\frac{1}{2} \sum_{j=1}^{d} \left(1 + \log \sigma_j^2 - \mu_j^2 - \sigma_j^2\right)$$

This has a closed-form solution! No sampling required. Each latent dimension contributes independently.

kl_divergence.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import torch
 
def kl_divergence_gaussian(mu: torch.Tensor, log_var: torch.Tensor) -> torch.Tensor:
    """
    Compute KL divergence between N(mu, sigma^2) and N(0, I).
    
    Uses log_var = log(sigma^2) for numerical stability.
    
    Args:
        mu: Mean of approximate posterior, shape [batch, latent_dim]
        log_var: Log variance of approx. posterior, shape [batch, latent_dim]
    
    Returns:
        KL divergence per sample, shape [batch]
    
    Derivation:
        KL(q||p) = 0.5 * sum(sigma^2 + mu^2 - 1 - log(sigma^2))
                 = 0.5 * sum(exp(log_var) + mu^2 - 1 - log_var)
    """
    # Sum over latent dimensions, preserve batch dimension
    kl_per_sample = -0.5 * torch.sum(
        1 + log_var - mu.pow(2) - log_var.exp(), 
        dim=1
    )
    return kl_per_sample  # Shape: [batch]
 
# Example usage
batch_size, latent_dim = 32, 64
mu = torch.randn(batch_size, latent_dim)
log_var = torch.randn(batch_size, latent_dim) - 1  # Initialize near unit variance
 
kl = kl_divergence_gaussian(mu, log_var)
print(f"KL divergence shape: {kl.shape}")  # [32]
print(f"Mean KL per sample: {kl.mean().item():.4f}")

The Reconstruction Term: Monte Carlo Estimation

The reconstruction term $\mathbb{E}{q}[\log p{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z})]$ is estimated via sampling:

In practice, $L=1$ (single sample) works remarkably well due to minibatch averaging.

For Binary Data (Bernoulli Likelihood): $$\log p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z}) = \sum_i x_i \log \hat{x}_i + (1-x_i) \log(1-\hat{x}_i)$$

This is the binary cross-entropy loss! The decoder outputs probabilities $\hat{x}i = \text{sigmoid}(f{\boldsymbol{\theta}}(\mathbf{z}))$.

For Continuous Data (Gaussian Likelihood): $$\log p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z}) \propto -\frac{1}{2\sigma^2}||\mathbf{x} - \hat{\mathbf{x}}||^2$$

With fixed variance, this reduces to mean squared error! The decoder outputs $\hat{\mathbf{x}} = f_{\boldsymbol{\theta}}(\mathbf{z})$.

reconstruction_loss.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import torch
import torch.nn.functional as F
 
def reconstruction_loss_bernoulli(x: torch.Tensor, x_recon: torch.Tensor) -> torch.Tensor:
    """
    Binary cross-entropy reconstruction loss for binary/binarized data.
    
    Args:
        x: Original data, shape [batch, ...], values in [0, 1]
        x_recon: Reconstructed logits from decoder, shape [batch, ...]
    
    Returns:
        Negative log-likelihood per sample, shape [batch]
    """
    # Flatten all dimensions except batch
    batch_size = x.shape[0]
    x_flat = x.view(batch_size, -1)
    x_recon_flat = x_recon.view(batch_size, -1)
    
    # Binary cross-entropy (reduction='none' gives per-element loss)
    bce = F.binary_cross_entropy_with_logits(
        x_recon_flat, x_flat, reduction='none'
    )
    
    # Sum over data dimensions (negative log-likelihood)
    return bce.sum(dim=1)  # Shape: [batch]
 
 
def reconstruction_loss_gaussian(x: torch.Tensor, x_recon: torch.Tensor, 
                                  log_var: float = 0.0) -> torch.Tensor:
    """
    Gaussian reconstruction loss (MSE) for continuous data.
    
    Args:
        x: Original data, shape [batch, ...]
        x_recon: Reconstructed mean from decoder, shape [batch, ...]
        log_var: Log variance (fixed or learned), scalar
    
    Returns:
        Negative log-likelihood per sample, shape [batch]
    """
    batch_size = x.shape[0]
    x_flat = x.view(batch_size, -1)
    x_recon_flat = x_recon.view(batch_size, -1)
    
    # Gaussian NLL: 0.5 * (log(2*pi*sigma^2) + (x-mu)^2/sigma^2)
    # With fixed sigma, constant terms can be ignored
    # Proportional to MSE: (x - mu)^2 / (2 * sigma^2)
    var = torch.exp(torch.tensor(log_var))
    mse = ((x_flat - x_recon_flat) ** 2).sum(dim=1)
    nll = 0.5 * (mse / var + x_flat.shape[1] * log_var)
    
    return nll  # Shape: [batch]

The Complete VAE Loss Function

Putting it all together, the VAE training objective is to maximize the ELBO, which we can equivalently express as minimizing the negative ELBO loss:

$$= \underbrace{\text{Reconstruction Loss}}{\text{How well can we recover } \mathbf{x}?} + \underbrace{\text{KL Divergence}}{\text{How "normal" is the latent code?}}$$

In practice, a weighting factor β is often introduced:

$$\mathcal{L}{\beta\text{-VAE}} = \text{Reconstruction Loss} + \beta \cdot D{\text{KL}}$$

vae_loss.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import torch
import torch.nn.functional as F
from typing import Tuple
 
def vae_loss(x: torch.Tensor, 
             x_recon: torch.Tensor, 
             mu: torch.Tensor, 
             log_var: torch.Tensor,
             beta: float = 1.0,
             reduction: str = 'mean') -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
    """
    Complete VAE loss: Reconstruction + beta * KL divergence.
    
    Args:
        x: Original data, shape [batch, channels, height, width]
        x_recon: Reconstructed logits, shape [batch, channels, height, width]
        mu: Latent mean from encoder, shape [batch, latent_dim]
        log_var: Latent log-variance from encoder, shape [batch, latent_dim]
        beta: Weight for KL term (1.0 = standard VAE)
        reduction: 'mean' or 'sum' over batch
    
    Returns:
        total_loss, reconstruction_loss, kl_loss (all scalars if reduction='mean')
    """
    batch_size = x.shape[0]
    
    # Reconstruction loss (BCE for binary, MSE for continuous)
    # Using BCE with logits for numerical stability
    recon_loss = F.binary_cross_entropy_with_logits(
        x_recon.view(batch_size, -1),
        x.view(batch_size, -1),
        reduction='none'
    ).sum(dim=1)  # Sum over data dimensions
    
    # KL divergence: -0.5 * sum(1 + log_var - mu^2 - exp(log_var))
    kl_loss = -0.5 * torch.sum(
        1 + log_var - mu.pow(2) - log_var.exp(),
        dim=1
    )  # Sum over latent dimensions
    
    # Combine with beta weighting
    total_loss = recon_loss + beta * kl_loss
    
    # Apply reduction
    if reduction == 'mean':
        return total_loss.mean(), recon_loss.mean(), kl_loss.mean()
    elif reduction == 'sum':
        return total_loss.sum(), recon_loss.sum(), kl_loss.sum()
    else:
        return total_loss, recon_loss, kl_loss
 
 
# Training loop example
def train_step(model, optimizer, x_batch, beta=1.0):
    """Single VAE training step."""
    optimizer.zero_grad()
    
    # Forward pass through VAE
    x_recon, mu, log_var = model(x_batch)
    
    # Compute loss
    loss, recon_loss, kl_loss = vae_loss(
        x_batch, x_recon, mu, log_var, beta=beta
    )
    
    # Backward pass
    loss.backward()
    optimizer.step()
    
    return {
        'loss': loss.item(),
        'recon_loss': recon_loss.item(),
        'kl_loss': kl_loss.item()
    }

Monitoring Training

Interpreting the VAE Objective

Beyond the mathematics, it's essential to build intuition for what the VAE objective is achieving. Let's examine each component's role in shaping the learned representation.

The Tension That Drives Learning

The two terms of the ELBO create a productive tension:

How β Affects the Trade-off
β Value	Reconstruction	Latent Structure	Generation Quality	Use Case
β → 0	Excellent (near-perfect)	Poor (mode collapse)	Poor (holes in latent space)	Compression/AE behavior
β = 1	Good	Balanced	Good	Standard VAE
β > 1	Moderate (blurrier)	Strong (disentangled)	Moderate	β-VAE for disentanglement
β >> 1	Poor (very blurry)	Excellent (very structured)	Variable	Discovering factors of variation

Why This Works for Generation

The VAE objective implicitly ensures good generative properties:

Coverage of the prior: The KL term forces $q(\mathbf{z}|\mathbf{x})$ to cover most of the prior support. If all posteriors collapse to a small region, KL would be high. This means random samples from the prior $p(\mathbf{z})$ will likely fall in regions where some training data mapped.
Smooth latent space: Overlapping posteriors create smooth transitions between encoded points. Moving in latent space corresponds to meaningful changes in generated outputs.
No holes: Unlike deterministic autoencoders where the decoder only learns to handle specific encoder outputs, VAE decoders must handle samples from broad Gaussian posteriors—making them robust to sampling from the prior.

The Information Theory Perspective

The ELBO has an elegant information-theoretic interpretation:

Summary: The VAE Objective Foundation

We've built up a complete understanding of the VAE objective from first principles. Let's consolidate the key insights:

Key Takeaways

•Generative modeling as latent variable models: We model complex data distributions by introducing latent variables that capture underlying factors of variation.
•Intractability drives innovation: Direct maximum likelihood is intractable due to the marginal integral over latent space and the intractable posterior.
•Variational inference provides a solution: By introducing an approximate posterior, we can derive a tractable lower bound on the log-likelihood.
•The ELBO = Reconstruction - KL: The training objective balances reconstruction quality against posterior regularization toward the prior.
•Tension creates structure: The push-pull between reconstruction and KL terms creates a structured latent space suitable for generation.
•The gap is principled: The difference between ELBO and true log-likelihood is exactly the inference gap—motivating better approximate posteriors.

What's Next:

Mathematical Foundation Complete

1 / 5