Loading learning content...
Variational Autoencoders represent one of the most elegant intersections of deep learning and probabilistic modeling. Unlike purely discriminative models that learn boundaries between classes, or simple autoencoders that merely compress and reconstruct, VAEs provide a principled probabilistic framework for learning the underlying distribution of data—enabling not just reconstruction, but genuine generation of new samples.
At the heart of VAEs lies a profound insight: we can frame generative modeling as a latent variable model and derive a tractable learning objective through variational inference. The result—the Evidence Lower Bound (ELBO)—gives us both a training signal and deep theoretical guarantees about what we're learning.
This page derives the VAE objective from first principles, ensuring you understand not just what the loss function is, but why it takes this particular form and what each term contributes to the model's behavior.
By the end of this page, you will: (1) Understand the probabilistic formulation of generative modeling with latent variables, (2) Derive the Evidence Lower Bound (ELBO) from maximum likelihood, (3) Interpret each term of the ELBO—reconstruction loss and KL divergence, (4) Understand why direct maximum likelihood is intractable and how variational inference provides a solution, and (5) Appreciate the theoretical foundations that make VAEs mathematically principled.
Before diving into the VAE objective, let's crystallize what we're trying to accomplish. In generative modeling, we have a dataset $\mathcal{D} = {\mathbf{x}^{(1)}, \mathbf{x}^{(2)}, ..., \mathbf{x}^{(N)}}$ of samples (images, text, audio, etc.), and our goal is to learn the true underlying data distribution $p_{\text{data}}(\mathbf{x})$.
The maximum likelihood principle tells us the optimal approach: find the model parameters $\boldsymbol{\theta}$ that maximize the probability of observing our data:
$$\boldsymbol{\theta}^* = \arg\max_{\boldsymbol{\theta}} \prod_{i=1}^{N} p_{\boldsymbol{\theta}}(\mathbf{x}^{(i)})$$
Equivalently, we maximize the log-likelihood:
$$\boldsymbol{\theta}^* = \arg\max_{\boldsymbol{\theta}} \sum_{i=1}^{N} \log p_{\boldsymbol{\theta}}(\mathbf{x}^{(i)})$$
This seems straightforward, but there's a fundamental challenge: for complex, high-dimensional data like images, directly defining and computing $p_{\boldsymbol{\theta}}(\mathbf{x})$ is extraordinarily difficult. How can we tractably model a distribution over all possible 256×256 RGB images?
A 256×256 RGB image has 196,608 dimensions. The space of possible images is astronomical—far beyond any tractable explicit probability distribution. We need a clever way to implicitly represent this distribution while maintaining computational tractability.
The key insight is to introduce latent variables $\mathbf{z}$ that capture the underlying factors of variation in our data. We hypothesize that each observed datapoint $\mathbf{x}$ was generated from some unobserved latent code $\mathbf{z}$ through a generative process:
The marginal likelihood of data under this model is:
$$p_{\boldsymbol{\theta}}(\mathbf{x}) = \int p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}) p(\mathbf{z}) , d\mathbf{z}$$
This latent variable formulation is powerful because:
| Component | Notation | Role | Typical Choice in VAE |
|---|---|---|---|
| Prior | $p(\mathbf{z})$ | Distribution over latent codes before seeing data | Standard Gaussian $\mathcal{N}(0, I)$ |
| Likelihood | $p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z})$ | How latent codes generate observations | Neural network (decoder) |
| Posterior | $p_{\boldsymbol{\theta}}(\mathbf{z}|\mathbf{x})$ | True distribution over latents given an observation | Intractable (requires variational approx.) |
| Marginal | $p_{\boldsymbol{\theta}}(\mathbf{x})$ | Total probability of an observation | Intractable (requires marginalization) |
The latent variable formulation seems elegant, but it introduces a severe computational problem. To maximize $\log p_{\boldsymbol{\theta}}(\mathbf{x})$, we need to compute:
$$\log p_{\boldsymbol{\theta}}(\mathbf{x}) = \log \int p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}) p(\mathbf{z}) , d\mathbf{z}$$
This integral is intractable for two fundamental reasons:
1. High-Dimensional Integration
The integral is over the entire latent space, which may have hundreds of dimensions. No closed-form solution exists, and numerical integration is computationally impossible in high dimensions (curse of dimensionality).
2. Posterior Intractability
To efficiently estimate this integral via Monte Carlo methods, we would ideally sample from the posterior $p_{\boldsymbol{\theta}}(\mathbf{z}|\mathbf{x})$—the distribution of latent codes that could have generated a specific observation. But by Bayes' theorem:
$$p_{\boldsymbol{\theta}}(\mathbf{z}|\mathbf{x}) = \frac{p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}) p(\mathbf{z})}{p_{\boldsymbol{\theta}}(\mathbf{x})}$$
Computing this requires the very marginal likelihood $p_{\boldsymbol{\theta}}(\mathbf{x})$ we're trying to compute—a circular dependency.
We want to maximize $\log p_{\boldsymbol{\theta}}(\mathbf{x})$, but we can't compute it. We need $p_{\boldsymbol{\theta}}(\mathbf{z}|\mathbf{x})$ for efficient estimation, but computing it requires $p_{\boldsymbol{\theta}}(\mathbf{x})$. We're stuck in a circle. This is where variational inference comes to the rescue.
One might attempt to estimate the marginal likelihood directly:
$$p_{\boldsymbol{\theta}}(\mathbf{x}) \approx \frac{1}{K} \sum_{k=1}^{K} p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}^{(k)}), \quad \mathbf{z}^{(k)} \sim p(\mathbf{z})$$
This is called importance sampling with the prior as the proposal distribution. While unbiased, it has astronomically high variance for high-dimensional data.
Why it fails: Consider an image of a face. Most samples from the prior $p(\mathbf{z})$ will decode to noise or unrelated images. Only a tiny fraction of the latent space actually corresponds to faces. We would need an impossibly large number of samples to get even one that's relevant to our specific image.
Mathematically, if the posterior $p(\mathbf{z}|\mathbf{x})$ concentrates in a small region of latent space (which it does for meaningful data), samples from the prior $p(\mathbf{z})$ will almost never land in that region.
The solution: Instead of sampling from the prior and hoping, we need to sample from a distribution that actually concentrates in the relevant regions of latent space for each datapoint. This is what the variational approach provides.
The breakthrough of VAEs is to address the intractability through variational inference. Since we can't compute the true posterior $p_{\boldsymbol{\theta}}(\mathbf{z}|\mathbf{x})$, we'll introduce an approximate posterior $q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})$ parameterized by $\boldsymbol{\phi}$.
This approximate posterior will be:
The variational approach converts our intractable optimization problem (maximize log-likelihood) into a tractable one (maximize a lower bound while making the approximate posterior accurate).
Traditional variational inference optimizes separate parameters for each datapoint. VAEs use amortized inference: a single neural network (the encoder) outputs the approximate posterior parameters for any input $\mathbf{x}$.
This amortization is crucial:
The most common choice for the approximate posterior is a diagonal Gaussian:
$$q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x}) = \mathcal{N}(\mathbf{z}; \boldsymbol{\mu}{\boldsymbol{\phi}}(\mathbf{x}), \text{diag}(\boldsymbol{\sigma}^2{\boldsymbol{\phi}}(\mathbf{x})))$$
Where:
This choice satisfies all our requirements:
Now we derive the central mathematical result: the Evidence Lower Bound (ELBO). This derivation is fundamental—understanding it deeply will inform your intuition about VAE behavior.
We want to maximize $\log p_{\boldsymbol{\theta}}(\mathbf{x})$. Let's introduce our approximate posterior by multiplying and dividing:
$$\log p_{\boldsymbol{\theta}}(\mathbf{x}) = \log \int p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}) , d\mathbf{z}$$
$$= \log \int p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}) \frac{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})}{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})} , d\mathbf{z}$$
$$= \log \mathbb{E}{q{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})} \left[ \frac{p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z})}{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})} \right]$$
Since $\log$ is a concave function, Jensen's inequality gives us:
$$\log \mathbb{E}[X] \geq \mathbb{E}[\log X]$$
Applying this:
$$\log p_{\boldsymbol{\theta}}(\mathbf{x}) \geq \mathbb{E}{q{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})} \left[ \log \frac{p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z})}{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})} \right]$$
The right-hand side is the Evidence Lower Bound (ELBO):
$$\mathcal{L}(\boldsymbol{\theta}, \boldsymbol{\phi}; \mathbf{x}) = \mathbb{E}{q{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})} \left[ \log \frac{p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z})}{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})} \right]$$
So we have established:
$$\boxed{\log p_{\boldsymbol{\theta}}(\mathbf{x}) \geq \mathcal{L}(\boldsymbol{\theta}, \boldsymbol{\phi}; \mathbf{x})}$$
In Bayesian terminology, $p(\mathbf{x})$ is called the 'evidence'—the probability of observing our data under the model. The ELBO provides a lower bound on the log of this evidence. Maximizing ELBO pushes up this lower bound, indirectly maximizing the log-evidence.
The ELBO can be decomposed in two illuminating ways. Each reveals different aspects of what the VAE is learning.
Using $p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}) = p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z})p(\mathbf{z})$:
$$\mathcal{L}(\boldsymbol{\theta}, \boldsymbol{\phi}; \mathbf{x}) = \mathbb{E}{q{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})} \left[ \log p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z}) \right] - D_{\text{KL}}(q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x}) | p(\mathbf{z}))$$
This is the most common form, revealing two terms:
Term 1: Reconstruction Loss (Expected Log-Likelihood) $$\mathbb{E}{q{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})} \left[ \log p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z}) \right]$$
This measures how well the decoder reconstructs $\mathbf{x}$ from latent samples. Higher is better—we want the decoder to produce high probability for the original data.
Term 2: KL Regularization (Posterior Regularization) $$D_{\text{KL}}(q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x}) | p(\mathbf{z}))$$
This penalizes deviation of the approximate posterior from the prior. It encourages the encoder to produce latent codes that stay close to the prior distribution.
There's an alternative decomposition that reveals exactly when ELBO equals the log-likelihood:
$$\log p_{\boldsymbol{\theta}}(\mathbf{x}) = \mathcal{L}(\boldsymbol{\theta}, \boldsymbol{\phi}; \mathbf{x}) + D_{\text{KL}}(q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x}) | p_{\boldsymbol{\theta}}(\mathbf{z}|\mathbf{x}))$$
This equality tells us:
Since KL divergence is always non-negative, we always have $\mathcal{L} \leq \log p_{\boldsymbol{\theta}}(\mathbf{x})$, confirming it's a lower bound.
The KL divergence $D_{\text{KL}}(q||p_{\theta})$ between approximate and true posterior is called the 'variational gap' or 'inference gap'. A larger gap means we're leaving potential log-likelihood on the table. This motivates research into more expressive approximate posteriors (normalizing flows, importance-weighted bounds, etc.).
Let's see how to actually compute the ELBO for training. We'll work with the standard choices: Gaussian prior, Gaussian approximate posterior, and either Gaussian or Bernoulli likelihood.
With a standard Gaussian prior $p(\mathbf{z}) = \mathcal{N}(0, I)$ and diagonal Gaussian posterior $q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x}) = \mathcal{N}(\boldsymbol{\mu}, \text{diag}(\boldsymbol{\sigma}^2))$:
$$D_{\text{KL}}(q || p) = -\frac{1}{2} \sum_{j=1}^{d} \left(1 + \log \sigma_j^2 - \mu_j^2 - \sigma_j^2\right)$$
This has a closed-form solution! No sampling required. Each latent dimension contributes independently.
12345678910111213141516171819202122232425262728293031323334
import torch def kl_divergence_gaussian(mu: torch.Tensor, log_var: torch.Tensor) -> torch.Tensor: """ Compute KL divergence between N(mu, sigma^2) and N(0, I). Uses log_var = log(sigma^2) for numerical stability. Args: mu: Mean of approximate posterior, shape [batch, latent_dim] log_var: Log variance of approx. posterior, shape [batch, latent_dim] Returns: KL divergence per sample, shape [batch] Derivation: KL(q||p) = 0.5 * sum(sigma^2 + mu^2 - 1 - log(sigma^2)) = 0.5 * sum(exp(log_var) + mu^2 - 1 - log_var) """ # Sum over latent dimensions, preserve batch dimension kl_per_sample = -0.5 * torch.sum( 1 + log_var - mu.pow(2) - log_var.exp(), dim=1 ) return kl_per_sample # Shape: [batch] # Example usagebatch_size, latent_dim = 32, 64mu = torch.randn(batch_size, latent_dim)log_var = torch.randn(batch_size, latent_dim) - 1 # Initialize near unit variance kl = kl_divergence_gaussian(mu, log_var)print(f"KL divergence shape: {kl.shape}") # [32]print(f"Mean KL per sample: {kl.mean().item():.4f}")The reconstruction term $\mathbb{E}{q}[\log p{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z})]$ is estimated via sampling:
$$\mathbb{E}{q}[\log p{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z})] \approx \frac{1}{L}\sum_{l=1}^{L} \log p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z}^{(l)}), \quad \mathbf{z}^{(l)} \sim q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})$$
In practice, $L=1$ (single sample) works remarkably well due to minibatch averaging.
For Binary Data (Bernoulli Likelihood): $$\log p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z}) = \sum_i x_i \log \hat{x}_i + (1-x_i) \log(1-\hat{x}_i)$$
This is the binary cross-entropy loss! The decoder outputs probabilities $\hat{x}i = \text{sigmoid}(f{\boldsymbol{\theta}}(\mathbf{z}))$.
For Continuous Data (Gaussian Likelihood): $$\log p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z}) \propto -\frac{1}{2\sigma^2}||\mathbf{x} - \hat{\mathbf{x}}||^2$$
With fixed variance, this reduces to mean squared error! The decoder outputs $\hat{\mathbf{x}} = f_{\boldsymbol{\theta}}(\mathbf{z})$.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
import torchimport torch.nn.functional as F def reconstruction_loss_bernoulli(x: torch.Tensor, x_recon: torch.Tensor) -> torch.Tensor: """ Binary cross-entropy reconstruction loss for binary/binarized data. Args: x: Original data, shape [batch, ...], values in [0, 1] x_recon: Reconstructed logits from decoder, shape [batch, ...] Returns: Negative log-likelihood per sample, shape [batch] """ # Flatten all dimensions except batch batch_size = x.shape[0] x_flat = x.view(batch_size, -1) x_recon_flat = x_recon.view(batch_size, -1) # Binary cross-entropy (reduction='none' gives per-element loss) bce = F.binary_cross_entropy_with_logits( x_recon_flat, x_flat, reduction='none' ) # Sum over data dimensions (negative log-likelihood) return bce.sum(dim=1) # Shape: [batch] def reconstruction_loss_gaussian(x: torch.Tensor, x_recon: torch.Tensor, log_var: float = 0.0) -> torch.Tensor: """ Gaussian reconstruction loss (MSE) for continuous data. Args: x: Original data, shape [batch, ...] x_recon: Reconstructed mean from decoder, shape [batch, ...] log_var: Log variance (fixed or learned), scalar Returns: Negative log-likelihood per sample, shape [batch] """ batch_size = x.shape[0] x_flat = x.view(batch_size, -1) x_recon_flat = x_recon.view(batch_size, -1) # Gaussian NLL: 0.5 * (log(2*pi*sigma^2) + (x-mu)^2/sigma^2) # With fixed sigma, constant terms can be ignored # Proportional to MSE: (x - mu)^2 / (2 * sigma^2) var = torch.exp(torch.tensor(log_var)) mse = ((x_flat - x_recon_flat) ** 2).sum(dim=1) nll = 0.5 * (mse / var + x_flat.shape[1] * log_var) return nll # Shape: [batch]Putting it all together, the VAE training objective is to maximize the ELBO, which we can equivalently express as minimizing the negative ELBO loss:
$$\mathcal{L}{\text{VAE}} = \mathbb{E}{\mathbf{z} \sim q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})} \left[-\log p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z})\right] + D_{\text{KL}}(q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x}) | p(\mathbf{z}))$$
$$= \underbrace{\text{Reconstruction Loss}}{\text{How well can we recover } \mathbf{x}?} + \underbrace{\text{KL Divergence}}{\text{How "normal" is the latent code?}}$$
In practice, a weighting factor β is often introduced:
$$\mathcal{L}{\beta\text{-VAE}} = \text{Reconstruction Loss} + \beta \cdot D{\text{KL}}$$
With $\beta = 1$, we have the standard VAE (maximizing ELBO). Different values of $\beta$ create different trade-offs and have spawned influential variants like β-VAE for disentangled representations.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
import torchimport torch.nn.functional as Ffrom typing import Tuple def vae_loss(x: torch.Tensor, x_recon: torch.Tensor, mu: torch.Tensor, log_var: torch.Tensor, beta: float = 1.0, reduction: str = 'mean') -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]: """ Complete VAE loss: Reconstruction + beta * KL divergence. Args: x: Original data, shape [batch, channels, height, width] x_recon: Reconstructed logits, shape [batch, channels, height, width] mu: Latent mean from encoder, shape [batch, latent_dim] log_var: Latent log-variance from encoder, shape [batch, latent_dim] beta: Weight for KL term (1.0 = standard VAE) reduction: 'mean' or 'sum' over batch Returns: total_loss, reconstruction_loss, kl_loss (all scalars if reduction='mean') """ batch_size = x.shape[0] # Reconstruction loss (BCE for binary, MSE for continuous) # Using BCE with logits for numerical stability recon_loss = F.binary_cross_entropy_with_logits( x_recon.view(batch_size, -1), x.view(batch_size, -1), reduction='none' ).sum(dim=1) # Sum over data dimensions # KL divergence: -0.5 * sum(1 + log_var - mu^2 - exp(log_var)) kl_loss = -0.5 * torch.sum( 1 + log_var - mu.pow(2) - log_var.exp(), dim=1 ) # Sum over latent dimensions # Combine with beta weighting total_loss = recon_loss + beta * kl_loss # Apply reduction if reduction == 'mean': return total_loss.mean(), recon_loss.mean(), kl_loss.mean() elif reduction == 'sum': return total_loss.sum(), recon_loss.sum(), kl_loss.sum() else: return total_loss, recon_loss, kl_loss # Training loop exampledef train_step(model, optimizer, x_batch, beta=1.0): """Single VAE training step.""" optimizer.zero_grad() # Forward pass through VAE x_recon, mu, log_var = model(x_batch) # Compute loss loss, recon_loss, kl_loss = vae_loss( x_batch, x_recon, mu, log_var, beta=beta ) # Backward pass loss.backward() optimizer.step() return { 'loss': loss.item(), 'recon_loss': recon_loss.item(), 'kl_loss': kl_loss.item() }Track reconstruction loss and KL divergence separately during training. Healthy VAE training shows: (1) Reconstruction loss decreasing as the decoder improves, (2) KL loss increasing initially then stabilizing, indicating the encoder is learning informative representations while staying close to the prior. If KL stays near zero ('posterior collapse'), the model is ignoring the latent space.
Beyond the mathematics, it's essential to build intuition for what the VAE objective is achieving. Let's examine each component's role in shaping the learned representation.
The two terms of the ELBO create a productive tension:
Reconstruction wants specificity: To accurately reconstruct $\mathbf{x}$, the encoder should produce a narrow posterior centered exactly at the latent code that makes $\mathbf{x}$ most likely. Ideally, a delta function $q(\mathbf{z}|\mathbf{x}) = \delta(\mathbf{z} - \mathbf{z}^*)$.
KL regularization wants generality: The KL term penalizes posteriors that deviate from the prior. It wants the encoder to produce posteriors that are broad, close to $\mathcal{N}(0, I)$—essentially forgetting the input.
The compromise: The model finds a middle ground where posteriors encode enough information for reconstruction while maintaining a structured, prior-like distribution. This compromise is what gives VAEs their generative capabilities.
| β Value | Reconstruction | Latent Structure | Generation Quality | Use Case |
|---|---|---|---|---|
| β → 0 | Excellent (near-perfect) | Poor (mode collapse) | Poor (holes in latent space) | Compression/AE behavior |
| β = 1 | Good | Balanced | Good | Standard VAE |
| β > 1 | Moderate (blurrier) | Strong (disentangled) | Moderate | β-VAE for disentanglement |
| β >> 1 | Poor (very blurry) | Excellent (very structured) | Variable | Discovering factors of variation |
The VAE objective implicitly ensures good generative properties:
Coverage of the prior: The KL term forces $q(\mathbf{z}|\mathbf{x})$ to cover most of the prior support. If all posteriors collapse to a small region, KL would be high. This means random samples from the prior $p(\mathbf{z})$ will likely fall in regions where some training data mapped.
Smooth latent space: Overlapping posteriors create smooth transitions between encoded points. Moving in latent space corresponds to meaningful changes in generated outputs.
No holes: Unlike deterministic autoencoders where the decoder only learns to handle specific encoder outputs, VAE decoders must handle samples from broad Gaussian posteriors—making them robust to sampling from the prior.
The ELBO has an elegant information-theoretic interpretation:
$$\mathcal{L} = \underbrace{\mathbb{E}q[\log p(\mathbf{x}|\mathbf{z})]}\text{Bits to reconstruct x given z} - \underbrace{D_{KL}(q(\mathbf{z}|\mathbf{x}) || p(\mathbf{z}))}_\text{Bits to transmit z}$$
The VAE learns an efficient encoding: minimize the bits needed to specify the latent code while keeping enough information to reconstruct the data. This is the rate-distortion trade-off from information theory!
We've built up a complete understanding of the VAE objective from first principles. Let's consolidate the key insights:
What's Next:
With the theoretical objective established, the next page explores the encoder and decoder networks—the neural network architectures that parameterize the approximate posterior and likelihood. We'll see how to design these networks for different data modalities and examine the architectural choices that impact VAE performance.
You now understand the deep mathematical motivation behind the VAE training objective. This isn't just a loss function we invented—it's derived from principled probabilistic reasoning. Every term has meaning, and understanding this meaning will guide your architectural and hyperparameter choices when building VAEs.