Machine LearningGenerative Models

Generative Model Fundamentals

LevelAdvanced

Duration90 mins

TopicGenerative Models

4 / 5

Latent Variables

The Hidden Structure of Data

The observable world is complex, high-dimensional, and seemingly chaotic. Yet underlying this complexity, we often find simple structures: faces vary along dimensions of age, expression, and pose; music varies in tempo, key, and mood; physical systems evolve according to low-dimensional laws.

Latent variable models formalize this intuition. We posit that observed data $x$ is generated from hidden (latent) variables $z$ that capture the underlying structure. These latent variables are not directly observed—they must be inferred from data—but they can dramatically simplify our understanding and manipulation of complex distributions.

This conceptual framework is foundational for modern generative models. Variational autoencoders, factor analysis, mixture models, and representation learning all rest on latent variable foundations. Understanding this framework deeply is essential for mastering modern generative AI.

What You Will Learn

By the end of this page, you will understand the mathematical formulation of latent variable models, why marginalizing out latent variables creates expressive but intractable distributions, the relationship between latent spaces and manifold learning, inference challenges and the posterior inference problem, classical latent variable models (factor analysis, PCA, GMM), and how this framework extends to deep generative models.

Mathematical Framework

The Generative Story:

A latent variable model posits that observed data $x$ is generated by:

Sample latent variable from prior: $z \sim p(z)$
Generate observation from conditional: $x \sim p(x | z)$

The joint distribution decomposes as: $$p(x, z) = p(z) \cdot p(x | z)$$

The Marginal Distribution:

The distribution over observations (marginal likelihood) is obtained by integrating out the latent: $$p(x) = \int p(x, z) , dz = \int p(z) \cdot p(x | z) , dz$$

This integral is typically intractable for interesting models, which is the source of most computational challenges.

Key Terminology:

Prior: $p(z)$ — Our beliefs about latent variables before seeing data. Often simple (e.g., standard Gaussian).
Likelihood (Decoder): $p(x | z)$ — The observation model. Given latent state, how is data generated?
Posterior: $p(z | x)$ — What we learn about latent variables given observed data. Found via Bayes' rule: $$p(z | x) = \frac{p(x | z) p(z)}{p(x)}$$
Evidence (Marginal likelihood): $p(x)$ — The denominator in Bayes' rule, typically intractable.

Why 'Latent'?

The word 'latent' comes from Latin 'latēre' meaning 'to lie hidden.' Latent variables are hidden because they are not directly observed—they must be inferred from the data. In psychology, latent variables might represent intelligence or personality traits; in ML, they represent abstract features like pose, style, or content.

Why Latent Variables Create Expressiveness:

Even if $p(z)$ and $p(x|z)$ are simple (e.g., Gaussian), the marginal $p(x)$ can be arbitrarily complex.

Example: Gaussian Mixture Model

Let $z$ be discrete, indicating cluster membership:

Prior: $p(z = k) = \pi_k$ (categorical)
Likelihood: $p(x | z = k) = \mathcal{N}(x | \mu_k, \Sigma_k)$ (Gaussian per cluster)

Marginal: $$p(x) = \sum_{k=1}^K \pi_k \cdot \mathcal{N}(x | \mu_k, \Sigma_k)$$

A sum of Gaussians can approximate any continuous distribution given enough components. The simple structure (discrete $z$, Gaussian conditionals) yields a universal approximator.

Example: Continuous Latent + Nonlinear Decoder

Let $z \in \mathbb{R}^d$ be continuous:

Prior: $p(z) = \mathcal{N}(0, I)$
Likelihood: $p(x | z) = \mathcal{N}(f_\theta(z), \sigma^2 I)$ where $f_\theta$ is a neural network

The marginal $p(x)$ is a Gaussian convolved with the pushforward of the prior through $f$. For a complex $f$, this can model intricate distributions. This is the VAE setup.

latent_variable_framework.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import numpy as np
from scipy.stats import norm, multivariate_normal
 
np.random.seed(42)
 
# Demonstrate how simple latents create complex marginals
 
# Example 1: Gaussian Mixture Model (discrete latent)
print("=== Gaussian Mixture Model (Discrete Latent) ===")
 
# Simple components
pi = [0.3, 0.4, 0.3]  # Mixing weights (prior over z)
mus = [-3, 0, 4]       # Component means
sigmas = [0.5, 1.0, 0.7]  # Component stds
 
# Generate from the model
def sample_gmm(n):
    z = np.random.choice(3, n, p=pi)  # Sample latent (cluster)
    x = np.array([np.random.normal(mus[zi], sigmas[zi]) for zi in z])
    return x, z
 
x_samples, z_samples = sample_gmm(5000)
 
print(f"  Prior p(z): {pi}")
print(f"  Conditionals: Gaussian with means {mus}")
print(f"
  Sample statistics:")
print(f"    Mean: {x_samples.mean():.3f}")
print(f"    Std: {x_samples.std():.3f}")
print(f"    (Bimodal distribution from simple components)")
 
# Compute marginal density at a point
def gmm_marginal_density(x):
    """p(x) = sum_k pi_k * N(x | mu_k, sigma_k)"""
    return sum(pi[k] * norm.pdf(x, mus[k], sigmas[k]) for k in range(3))
 
print(f"
  Marginal density at x=0: {gmm_marginal_density(0):.4f}")
print(f"  Marginal density at x=-3: {gmm_marginal_density(-3):.4f}")
 
# Example 2: Continuous latent with nonlinear decoder
print("
=== Continuous Latent with Nonlinear Decoder ===")
 
# 1D latent, 2D observed (data lies on curve in 2D)
def nonlinear_decoder(z):
    """Map 1D latent to 2D: a curve"""
    x1 = z + 0.3 * z**2
    x2 = np.sin(2 * z)
    return np.array([x1, x2])
 
# Generate from model
n = 500
z = np.random.normal(0, 1, n)  # Prior
noise = np.random.normal(0, 0.1, (n, 2))  # Observation noise
x = np.array([nonlinear_decoder(zi) for zi in z]) + noise
 
print(f"  Latent dim: 1")
print(f"  Observed dim: 2")
print(f"  Data lies on 1D curve embedded in 2D")
print(f"
  Latent z range: [{z.min():.2f}, {z.max():.2f}]")
print(f"  Observed x1 range: [{x[:, 0].min():.2f}, {x[:, 0].max():.2f}]")
print(f"  Observed x2 range: [{x[:, 1].min():.2f}, {x[:, 1].max():.2f}]")
 
# Example 3: Demonstrate posterior inference challenge
print("
=== Posterior Inference Challenge ===")
 
# Given observed x, what is p(z|x)?
# For GMM, posterior is:
def gmm_posterior(x):
    """p(z=k|x) for GMM"""
    numerators = [pi[k] * norm.pdf(x, mus[k], sigmas[k]) for k in range(3)]
    denominator = sum(numerators)
    return [n / denominator for n in numerators]
 
x_observed = 1.5
posterior = gmm_posterior(x_observed)
print(f"  Observed x = {x_observed}")
print(f"  Posterior p(z|x): {[f'{p:.3f}' for p in posterior]}")
print(f"  (Component 1 is most likely given x={x_observed})")

The Intractability Problem

While latent variable models are conceptually elegant, they present fundamental computational challenges. The core issue: computing the marginal likelihood $p(x)$ and the posterior $p(z|x)$ is typically intractable.

Why is $p(x)$ intractable?

$$p(x) = \int p(z) p(x|z) , dz$$

This integral over all possible latent configurations is:

Continuous $z$: An integral over high-dimensional space (no closed form for nonlinear decoders)
Discrete $z$: A sum over exponentially many configurations

Why is $p(z|x)$ intractable?

$$p(z|x) = \frac{p(x|z) p(z)}{p(x)}$$

We need $p(x)$ in the denominator—which is intractable!

Why does this matter?

Learning: We want to maximize $\log p(x)$ (likelihood of data). But we can't compute it!
Inference: We want $p(z|x)$ to understand what latent factors explain an observation. But we can't compute that either!
Generation: We want to sample $x \sim p(x)$. While we can sample $z$ then $x|z$, we can't verify the resulting marginal matches the target.

The Fundamental Trade-off

Expressiveness and tractability are in tension. A simple model (linear decoder, Gaussian everything) may have closed-form solutions but cannot capture complex data. A flexible model (neural network decoder) can capture anything but the integrals become intractable. The history of latent variable models is the history of navigating this trade-off.

Solutions to Intractability:

1. Exact Methods (Limited Cases)

Factor Analysis / PCA: Linear Gaussian models have closed-form solutions
GMM: Discrete $z$ with finite states allows exact enumeration
Exponential Families with Conjugacy: Posterior is same family as prior

2. Sampling Methods (MCMC)

Sample from $p(z|x)$ using MCMC (Gibbs, MH)
Approximate expectations via Monte Carlo
Can be exact in principle, slow in practice

3. Variational Inference

Approximate $p(z|x)$ with a tractable distribution $q_\phi(z|x)$
Optimize $q$ to be close to true posterior
Fast and scalable, but introduces approximation error
This is how VAEs work

4. Importance Weighting

Use importance sampling to estimate $p(x)$
High variance in high dimensions
IWAE (Importance-Weighted Autoencoder) uses this

5. Neural Likelihood Surrogates

Train another network to approximate $p(x)$ or its gradients
Score matching, noise-contrastive estimation

Approaches to Posterior Intractability
Approach	Exactness	Scalability	Limitations
Analytical (conjugacy)	Exact	High	Only for specific model families
MCMC	Asymptotically exact	Low-Medium	Slow convergence, hard to diagnose
Variational Inference	Approximate	High	Approximation gap, mode-seeking
Importance Sampling	Unbiased estimate	Medium	High variance in high dimensions

Latent Space Geometry

The latent space is the abstract space where latent variables $z$ live. Understanding its geometry is crucial for manipulating and interpreting generative models.

The Manifold Hypothesis:

High-dimensional data (images, text, audio) is believed to lie on or near low-dimensional manifolds. For example:

All face images might lie on a ~50-dimensional manifold within million-dimensional pixel space
The manifold encodes meaningful variations: pose, lighting, expression, identity

Latent variable models formalize this: $z$ parameterizes the manifold, and the decoder $p(x|z)$ maps from manifold coordinates to observations.

Dimensionality Reduction:

If the intrinsic dimensionality of data is $d$, we hope to learn latent spaces of dimension $d$ or slightly higher. This:

Compresses information (dimensionality reduction)
Reveals interpretable structure
Enables efficient sampling and manipulation

Latent Traversals:

A key test of whether a model has learned meaningful structure: traverse the latent space and observe how generations change.

In a well-trained model:

Moving along one latent dimension might change hair color
Another dimension might change pose
Linear interpolation in latent space produces semantically smooth interpolation in data space

If traversals produce random, discontinuous changes, the latent space lacks structure.

Disentanglement

A 'disentangled' latent space has dimensions that correspond to independent, interpretable factors of variation. Changing one dimension changes one factor (e.g., pose) without affecting others (identity, lighting). Disentanglement is desirable but notoriously difficult to achieve and measure. Methods like β-VAE, FactorVAE, and TC-VAE attempt to encourage disentanglement.

Latent Arithmetic:

A remarkable property of well-trained latent spaces: semantic concepts become directions.

Famous example from word embeddings: $$\vec{king} - \vec{man} + \vec{woman} \approx \vec{queen}$$

Similar operations work in image latent spaces: $$\vec{face_smiling} - \vec{face_neutral} + \vec{another_neutral} = \vec{another_smiling}$$

This enables powerful editing: compute the 'smile direction' by averaging differences between smiling and neutral faces, then add this direction to any face to make it smile.

Holes in Latent Space:

Not all regions of latent space correspond to valid data. If the prior is Gaussian and the model isn't well-regularized, the decoder may only map properly from regions where training samples exist. This creates 'holes'—regions that produce garbage outputs.

VAEs address this by explicitly matching the posterior to the prior, ensuring the entire prior region is 'populated' with meaningful encodings.

latent_space_geometry.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
import numpy as np
 
np.random.seed(42)
 
# Demonstrate latent space concepts with a simple model
 
# True generative process: 2D latent -> 3D observed (embedded surface)
def true_decoder(z):
    """
    Map 2D latent to 3D: a 'swiss roll' surface.
    z[0] controls angle, z[1] controls height.
    """
    t = 1.5 * np.pi * (1 + 2 * z[0])  # Angle
    x = t * np.cos(t)
    y = 10 * z[1]  # Height
    z_coord = t * np.sin(t)
    return np.array([x, y, z_coord])
 
# Generate data
n = 500
z_true = np.random.uniform(-0.5, 0.5, (n, 2))  # 2D latent
x_data = np.array([true_decoder(z) for z in z_true])
x_data += np.random.normal(0, 0.3, x_data.shape)  # Add noise
 
print("=== Latent Space Geometry Demonstration ===")
print(f"  Latent dimension: 2")
print(f"  Observed dimension: 3")
print(f"  Data lies on 2D manifold (swiss roll) in 3D")
 
# Latent traversal: move along one latent dimension
print("
--- Latent Traversals ---")
z_base = np.array([0.0, 0.0])
 
print("  Traversing z[0] (angle dimension):")
for val in [-0.4, -0.2, 0.0, 0.2, 0.4]:
    z = z_base.copy()
    z[0] = val
    x = true_decoder(z)
    print(f"    z[0]={val:+.1f}: x = [{x[0]:+6.2f}, {x[1]:+5.2f}, {x[2]:+6.2f}]")
 
print("
  Traversing z[1] (height dimension):")
for val in [-0.4, -0.2, 0.0, 0.2, 0.4]:
    z = z_base.copy()
    z[1] = val
    x = true_decoder(z)
    print(f"    z[1]={val:+.1f}: x = [{x[0]:+6.2f}, {x[1]:+5.2f}, {x[2]:+6.2f}]")
 
# Interpolation in latent space
print("
--- Latent Interpolation ---")
z_start = np.array([-0.4, -0.4])
z_end = np.array([0.4, 0.4])
 
print(f"  Start: z={z_start} -> x={true_decoder(z_start).round(2)}")
print(f"  End:   z={z_end} -> x={true_decoder(z_end).round(2)}")
print("
  Interpolation:")
for alpha in [0.0, 0.25, 0.5, 0.75, 1.0]:
    z_interp = (1 - alpha) * z_start + alpha * z_end
    x_interp = true_decoder(z_interp)
    print(f"    α={alpha:.2f}: x = [{x_interp[0]:+6.2f}, "
          f"{x_interp[1]:+5.2f}, {x_interp[2]:+6.2f}]")
 
# Latent arithmetic (conceptual demonstration)
print("
--- Latent Arithmetic (Conceptual) ---")
print("  Imagine latent dimensions correspond to:")
print("    z[0] = 'curvature position'")
print("    z[1] = 'height'")
print("
  'Raise' operation: add [0, 0.5] to any z")
z_example = np.array([0.2, -0.3])
z_raised = z_example + np.array([0, 0.5])
print(f"    Original: z={z_example} -> x={true_decoder(z_example).round(2)}")
print(f"    Raised:   z={z_raised} -> x={true_decoder(z_raised).round(2)}")
print("    (Only y-coordinate changes significantly)")

Classical Latent Variable Models

Before deep learning, several classical latent variable models were developed. Understanding these provides essential intuition and reveals the lineage of modern approaches.

Principal Component Analysis (PCA)

PCA can be viewed as a probabilistic latent variable model:

$$z \sim \mathcal{N}(0, I_k)$$ $$x | z \sim \mathcal{N}(Wz + \mu, \sigma^2 I_d)$$

$k$-dimensional latent space, $d$-dimensional observations
Linear decoder: $x \approx Wz + \mu$
$W$ are the principal components (directions of maximum variance)

The posterior is Gaussian: $$p(z | x) = \mathcal{N}(M^{-1} W^T (x - \mu), \sigma^2 M^{-1})$$ where $M = W^T W + \sigma^2 I$.

As $\sigma^2 \to 0$, this recovers deterministic PCA.

Factor Analysis

Generalization of probabilistic PCA with non-isotropic noise:

$$x | z \sim \mathcal{N}(Wz + \mu, \Psi)$$

where $\Psi$ is a diagonal covariance (each observed dimension has its own noise level).

Used in psychology and social sciences to model latent 'factors' (intelligence, personality traits) from observed measurements.

Gaussian Mixture Models (GMM)

Discrete latent variable (cluster indicator):

$$z \sim \text{Categorical}(\pi_1, \ldots, \pi_K)$$ $$x | z = k \sim \mathcal{N}(\mu_k, \Sigma_k)$$

The posterior is: $$p(z = k | x) = \frac{\pi_k \mathcal{N}(x | \mu_k, \Sigma_k)}{\sum_j \pi_j \mathcal{N}(x | \mu_j, \Sigma_j)}$$

This is the 'responsibility' of cluster $k$ for observation $x$.

EM algorithm alternates:

E-step: Compute responsibilities (posterior)
M-step: Update parameters given responsibilities

Hidden Markov Models (HMM)

Latent variable model for sequences:

$$z_1 \sim \pi, \quad z_t | z_{t-1} \sim A_{z_{t-1}}$$ $$x_t | z_t \sim p(x | z_t)$$

The latent variables $z_t$ form a Markov chain; observations $x_t$ depend only on the current state.

Inference (forward-backward algorithm) and learning (Baum-Welch EM) are tractable due to the chain structure.

HMMs dominated speech recognition for decades before deep learning.

The Linear Regime

PCA and Factor Analysis are linear models—the decoder is a linear transformation. This enables analytical solutions but limits expressiveness. GMMs are nonlinear (mixture of modes) but have discrete latents. The key innovation of VAEs was combining continuous latents with nonlinear (neural network) decoders, requiring new approximate inference techniques.

Classical Latent Variable Models
Model	Latent Type	Decoder	Inference	Use Cases
PCA	Continuous (Gaussian)	Linear	Closed-form	Dimensionality reduction, visualization
Factor Analysis	Continuous (Gaussian)	Linear + noise per dim	Closed-form	Psychology, latent trait modeling
GMM	Discrete (Categorical)	Gaussian per cluster	Enumeration	Clustering, density estimation
HMM	Discrete (Sequential)	Emission per state	Forward-backward	Speech recognition, sequences

Deep Latent Variable Models

The deep learning revolution enabled deep latent variable models—models with neural network decoders (and often encoders) that can capture arbitrarily complex relationships between latents and observations.

The Setup:

$$z \sim p(z) = \mathcal{N}(0, I)$$ $$x | z \sim p_\theta(x | z) = \mathcal{N}(f_\theta(z), \sigma^2 I)$$

where $f_\theta$ is a neural network (the decoder).

The Challenge:

With nonlinear $f_\theta$, the marginal and posterior become intractable:

$$p(x) = \int p(z) p_\theta(x|z) dz \quad \text{(no closed form)}$$ $$p(z|x) = \frac{p_\theta(x|z) p(z)}{p(x)} \quad \text{(requires intractable } p(x) \text{)}$$

The Variational Solution (VAE):

Introduce an encoder $q_\phi(z|x)$—a neural network that approximates the posterior:

Encoder: $q_\phi(z|x) = \mathcal{N}(\mu_\phi(x), \sigma^2_\phi(x))$
Decoder: $p_\theta(x|z) = \mathcal{N}(f_\theta(z), \sigma^2 I)$

Optimize the Evidence Lower Bound (ELBO):

$$\log p(x) \geq \mathbb{E}{q\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x) | p(z))$$

The first term encourages accurate reconstruction; the second regularizes the posterior to match the prior.

Amortized Inference

Classical variational inference optimizes a separate q for each observation—expensive! VAEs use 'amortized inference': the encoder q_φ(z|x) shares parameters across all x. One forward pass gives the approximate posterior. This enables efficient inference at test time and end-to-end training.

Why Deep Latent Models Are Powerful:

Expressiveness: Neural network decoders can model arbitrarily complex conditionals $p(x|z)$.
Scalability: Amortized inference enables processing millions of data points.
Representation Learning: The latent space often captures meaningful structure useful for downstream tasks.
Generative Capabilities: Sample $z \sim p(z)$, decode to $x$—create new data.

The VAE Trade-off:

VAEs optimize a lower bound, not the true likelihood. The gap:

$$\log p(x) - \text{ELBO} = D_{KL}(q_\phi(z|x) | p(z|x)) \geq 0$$

This gap is zero only when $q = p(z|x)$—when the approximate posterior is exact. In practice, the approximation gap can be significant, especially when $q$ is restricted (e.g., Gaussian with diagonal covariance).

Extensions:

Hierarchical VAEs: Multiple layers of latent variables for richer structure
VQ-VAE: Discrete (quantized) latent space
β-VAE: Weighted KL term to encourage disentanglement
Diffusion models: Can be viewed as hierarchical latent models with many layers

deep_latent_variables.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
import numpy as np
 
# Conceptual demonstration of VAE-style latent variable model
np.random.seed(42)
 
# Simplified VAE for 1D data
 
# "Decoder": neural network f_theta(z) -> x
def decoder(z, theta):
    """Simple MLP: z -> hidden -> x"""
    W1, b1, W2, b2 = theta
    h = np.tanh(W1 * z + b1)
    x_mean = W2 * h + b2
    return x_mean
 
# "Encoder": neural network x -> (mu_z, sigma_z)
def encoder(x, phi):
    """Simple MLP: x -> (mu, log_sigma)"""
    W1, b1, W_mu, W_logsig = phi
    h = np.tanh(W1 * x + b1)
    mu = W_mu * h
    log_sigma = W_logsig * h
    return mu, np.exp(log_sigma)
 
# Initialize random parameters
theta = (np.random.randn(), np.random.randn(), 
         np.random.randn(), np.random.randn())  # Decoder
phi = (np.random.randn(), np.random.randn(),
       np.random.randn(), np.random.randn())   # Encoder
 
print("=== Deep Latent Variable Model (VAE Concept) ===")
print("  Prior: p(z) = N(0, 1)")
print("  Decoder: p(x|z) = N(f_θ(z), σ²)")
print("  Encoder: q(z|x) = N(μ_φ(x), σ²_φ(x))")
 
# Example: encode a data point
x_example = 2.5
mu_z, sigma_z = encoder(x_example, phi)
print(f"
  Encoding x = {x_example}:")
print(f"    q(z|x) = N({mu_z:.3f}, {sigma_z:.3f}²)")
 
# Sample from approximate posterior
z_samples = np.random.normal(mu_z, sigma_z, 5)
print(f"    z samples: {z_samples.round(3)}")
 
# Decode samples
x_reconstructions = [decoder(z, theta) for z in z_samples]
print(f"    x reconstructions: {np.array(x_reconstructions).round(3)}")
 
# Generate new data by sampling prior
print("
  Generating from prior:")
z_prior_samples = np.random.normal(0, 1, 5)
x_generated = [decoder(z, theta) for z in z_prior_samples]
print(f"    z ~ N(0,1): {z_prior_samples.round(3)}")
print(f"    Generated x: {np.array(x_generated).round(3)}")
 
# ELBO components
print("
=== ELBO Components ===")
print("  ELBO = E_q[log p(x|z)] - D_KL(q(z|x) || p(z))")
print("       = Reconstruction - Regularization")
print("
  Reconstruction: encourages decoder to reconstruct x from z")
print("  Regularization: encourages posterior to match prior")
 
# KL divergence for Gaussian q vs standard Gaussian prior
def kl_divergence(mu, sigma):
    """KL(N(mu, sigma^2) || N(0, 1))"""
    return 0.5 * (sigma**2 + mu**2 - 1 - 2 * np.log(sigma))
 
kl = kl_divergence(mu_z, sigma_z)
print(f"
  For x = {x_example}:")
print(f"    KL(q(z|x) || p(z)) = {kl:.4f}")

Inference Approaches

Computing or approximating the posterior $p(z|x)$ is called inference. Different approaches trade off accuracy, computation, and generality.

Exact Inference:

Possible only for specific model families:

Conjugate priors: Posterior is same family as prior (Gaussian-Gaussian, Beta-Binomial)
Discrete latents with few states: Enumerate all possibilities
Linear Gaussian models: Closed-form Kalman filter, PCA

For deep generative models, exact inference is essentially never possible.

Variational Inference (VI):

Approximate $p(z|x)$ with a parameterized family $q_\phi(z|x)$, optimizing:

$$\phi^* = \arg\min_\phi D_{KL}(q_\phi(z|x) | p(z|x))$$

Equivalently, maximize the ELBO:

$$\mathcal{L}(\phi) = \mathbb{E}{q\phi(z|x)}[\log p(x|z)] - D_{KL}(q_\phi(z|x) | p(z))$$

VI characteristics:

Fast (gradient-based optimization)
Biased (approximation gap)
Mode-seeking (tends to underfit posterior variance)
Enables end-to-end learning via reparameterization

MCMC Inference:

Construct a Markov chain with stationary distribution $p(z|x)$:

Metropolis-Hastings, Gibbs sampling, Hamiltonian Monte Carlo
Asymptotically exact (converges to true posterior)
Often slow (many iterations for convergence)
Harder to integrate with gradient-based learning

Hybrid Approaches:

Modern methods often combine VI and MCMC:

MCMC-VI: Use MCMC to refine VI solutions
Hamiltonian VI: Use HMC dynamics within variational family
Normalizing Flow Posteriors: Use flows to parameterize flexible $q$

Importance-Weighted Inference:

Importance-weighted autoencoders (IWAE) use multiple samples:

$$\log p(x) \geq \mathbb{E}{z_1, \ldots, z_K \sim q}\left[\log \frac{1}{K} \sum{k=1}^K \frac{p(x, z_k)}{q(z_k|x)}\right]$$

Tighter bound than ELBO; approaches $\log p(x)$ as $K \to \infty$. But gradient variance increases with $K$.

The Inference Gap

For VAEs, there are two sources of error: (1) the amortization gap—the encoder may not perfectly approximate the optimal q for each x, and (2) the approximation gap—even the optimal q in the family may differ from the true posterior. Semi-amortized methods address (1) by refining encoder outputs with optimization steps.

Inference Methods Comparison
Method	Exactness	Speed	Gradient-Friendly	Best For
Exact (conjugacy)	Exact	Fast	Yes	Simple models (PCA, GMM)
Mean-field VI	Approximate	Fast	Yes	Factorized posteriors
Amortized VI (VAE)	Approximate	Very fast	Yes	Deep generative models
MCMC	Asymptotically exact	Slow	Partially	Accurate posteriors
IWAE	Tighter bound	Medium	Yes	Better likelihood estimates

Representation Learning

Beyond generation, latent variable models are powerful tools for representation learning—learning useful features for downstream tasks.

Why Latent Representations?

The latent encoding $z$ of an observation $x$ can serve as a feature vector:

Dimensionality reduction: $z$ is typically much lower-dimensional than $x$
Semantic structure: Nearby points in latent space often have similar semantics
Transfer learning: Representations learned on one task can help others

Self-Supervised Learning Connection:

Latent variable models are fundamentally self-supervised:

No labels required—learn from data structure alone
The 'supervision signal' is reconstruction (predict $x$ from $z$)

This connects to contrastive learning, masked modeling, and other self-supervised paradigms.

Evaluating Representations:

How do we know if learned representations are good?

Linear probing: Train a linear classifier on frozen representations. High accuracy = representations capture relevant structure.
Transfer learning: Use representations for downstream tasks. Good representations transfer broadly.
Latent space visualization: Do similar points cluster together? Are trajectories meaningful?
Disentanglement metrics: Do dimensions correspond to independent factors?

The Representation-Generation Trade-off

Models optimized purely for generation may not learn the best representations. The VAE objective balances reconstruction and KL regularization—but this balance doesn't directly optimize for downstream task performance. Methods like VQ-VAE, contrastive learning, and self-distillation often produce better representations for classification and detection tasks.

Representation Learning Applications:

Feature extraction: Encode images/text into fixed-size vectors for search, clustering
Preprocessing for supervised learning: Pre-train on unlabeled data, finetune on labels
Anomaly detection: Points with unusual latent encodings may be anomalous
Data compression: Encode to latent, transmit latent, decode at receiver
Controllable generation: Manipulate latent dimensions to control generated outputs

The Foundation Model Paradigm:

Modern 'foundation models' (GPT, CLIP, DALL-E) can be viewed through this lens:

Train a large generative model on massive unlabeled data
The model learns rich representations as a byproduct
Finetune or prompt for specific downstream tasks

The latent variable framework helps us understand why this works: learning to generate forces the model to capture structure useful for many tasks.

Good Representation Properties

•Informativeness: Representations preserve relevant information about inputs
•Compactness: Low-dimensional, removing redundancy
•Disentanglement: Factors of variation are separable across dimensions
•Smoothness: Similar inputs have similar representations
•Transferability: Useful for multiple downstream tasks
•Interpretability: Dimensions correspond to human-understandable concepts

Summary

Latent variables provide a powerful framework for understanding and building generative models. By positing hidden structure that explains observable data, we can create expressive models, learn meaningful representations, and enable controlled generation.

Key Takeaways

•Latent variable models posit hidden variables $z$ generating observations $x$: $p(x) = \int p(z) p(x|z) dz$.
•Simple components create complex marginals: Even Gaussian latents with neural decoders can model intricate distributions.
•Intractability is fundamental: For interesting models, the marginal $p(x)$ and posterior $p(z|x)$ are typically intractable.
•Solutions: Variational inference (approximate posterior), MCMC (sample posterior), or special model structures (exact inference).
•Latent space geometry captures data structure: traversals, interpolations, and arithmetic in latent space reveal learned semantics.
•Classical to deep: PCA, Factor Analysis, GMM, HMM are latent variable models; VAEs extend this to nonlinear decoders with amortized inference.
•Representation learning: Latent encodings are useful features for downstream tasks—a key benefit beyond generation.

What's next:

We've established the foundation of generative models: the generative-discriminative distinction, density estimation, sampling, and latent variables. The final page of this module addresses a critical practical challenge: how do we evaluate generative models? Unlike classification, where accuracy is clear, evaluating generative quality is notoriously difficult. We'll explore the landscape of evaluation metrics and their limitations.

Latent Variables Mastered

You now have a deep understanding of latent variable models—the conceptual and mathematical framework underlying VAEs, factor models, and modern representation learning. This foundation will illuminate the specific architectures (VAEs, GANs, diffusion models) we study in subsequent modules.

4 / 5

Loading learning content...

Machine LearningGenerative Models

Generative Model Fundamentals

LevelAdvanced

Duration90 mins

TopicGenerative Models

4 / 5

Latent Variables

The Hidden Structure of Data

What You Will Learn

Mathematical Framework

The Generative Story:

A latent variable model posits that observed data $x$ is generated by:

Sample latent variable from prior: $z \sim p(z)$
Generate observation from conditional: $x \sim p(x | z)$

The joint distribution decomposes as: $$p(x, z) = p(z) \cdot p(x | z)$$

The Marginal Distribution:

The distribution over observations (marginal likelihood) is obtained by integrating out the latent: $$p(x) = \int p(x, z) , dz = \int p(z) \cdot p(x | z) , dz$$

This integral is typically intractable for interesting models, which is the source of most computational challenges.

Key Terminology:

Prior: $p(z)$ — Our beliefs about latent variables before seeing data. Often simple (e.g., standard Gaussian).
Likelihood (Decoder): $p(x | z)$ — The observation model. Given latent state, how is data generated?
Posterior: $p(z | x)$ — What we learn about latent variables given observed data. Found via Bayes' rule: $$p(z | x) = \frac{p(x | z) p(z)}{p(x)}$$
Evidence (Marginal likelihood): $p(x)$ — The denominator in Bayes' rule, typically intractable.

Why 'Latent'?

Why Latent Variables Create Expressiveness:

Even if $p(z)$ and $p(x|z)$ are simple (e.g., Gaussian), the marginal $p(x)$ can be arbitrarily complex.

Example: Gaussian Mixture Model

Let $z$ be discrete, indicating cluster membership:

Prior: $p(z = k) = \pi_k$ (categorical)
Likelihood: $p(x | z = k) = \mathcal{N}(x | \mu_k, \Sigma_k)$ (Gaussian per cluster)

Marginal: $$p(x) = \sum_{k=1}^K \pi_k \cdot \mathcal{N}(x | \mu_k, \Sigma_k)$$

A sum of Gaussians can approximate any continuous distribution given enough components. The simple structure (discrete $z$, Gaussian conditionals) yields a universal approximator.

Example: Continuous Latent + Nonlinear Decoder

Let $z \in \mathbb{R}^d$ be continuous:

Prior: $p(z) = \mathcal{N}(0, I)$
Likelihood: $p(x | z) = \mathcal{N}(f_\theta(z), \sigma^2 I)$ where $f_\theta$ is a neural network

The marginal $p(x)$ is a Gaussian convolved with the pushforward of the prior through $f$. For a complex $f$, this can model intricate distributions. This is the VAE setup.

latent_variable_framework.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import numpy as np
from scipy.stats import norm, multivariate_normal
 
np.random.seed(42)
 
# Demonstrate how simple latents create complex marginals
 
# Example 1: Gaussian Mixture Model (discrete latent)
print("=== Gaussian Mixture Model (Discrete Latent) ===")
 
# Simple components
pi = [0.3, 0.4, 0.3]  # Mixing weights (prior over z)
mus = [-3, 0, 4]       # Component means
sigmas = [0.5, 1.0, 0.7]  # Component stds
 
# Generate from the model
def sample_gmm(n):
    z = np.random.choice(3, n, p=pi)  # Sample latent (cluster)
    x = np.array([np.random.normal(mus[zi], sigmas[zi]) for zi in z])
    return x, z
 
x_samples, z_samples = sample_gmm(5000)
 
print(f"  Prior p(z): {pi}")
print(f"  Conditionals: Gaussian with means {mus}")
print(f"
  Sample statistics:")
print(f"    Mean: {x_samples.mean():.3f}")
print(f"    Std: {x_samples.std():.3f}")
print(f"    (Bimodal distribution from simple components)")
 
# Compute marginal density at a point
def gmm_marginal_density(x):
    """p(x) = sum_k pi_k * N(x | mu_k, sigma_k)"""
    return sum(pi[k] * norm.pdf(x, mus[k], sigmas[k]) for k in range(3))
 
print(f"
  Marginal density at x=0: {gmm_marginal_density(0):.4f}")
print(f"  Marginal density at x=-3: {gmm_marginal_density(-3):.4f}")
 
# Example 2: Continuous latent with nonlinear decoder
print("
=== Continuous Latent with Nonlinear Decoder ===")
 
# 1D latent, 2D observed (data lies on curve in 2D)
def nonlinear_decoder(z):
    """Map 1D latent to 2D: a curve"""
    x1 = z + 0.3 * z**2
    x2 = np.sin(2 * z)
    return np.array([x1, x2])
 
# Generate from model
n = 500
z = np.random.normal(0, 1, n)  # Prior
noise = np.random.normal(0, 0.1, (n, 2))  # Observation noise
x = np.array([nonlinear_decoder(zi) for zi in z]) + noise
 
print(f"  Latent dim: 1")
print(f"  Observed dim: 2")
print(f"  Data lies on 1D curve embedded in 2D")
print(f"
  Latent z range: [{z.min():.2f}, {z.max():.2f}]")
print(f"  Observed x1 range: [{x[:, 0].min():.2f}, {x[:, 0].max():.2f}]")
print(f"  Observed x2 range: [{x[:, 1].min():.2f}, {x[:, 1].max():.2f}]")
 
# Example 3: Demonstrate posterior inference challenge
print("
=== Posterior Inference Challenge ===")
 
# Given observed x, what is p(z|x)?
# For GMM, posterior is:
def gmm_posterior(x):
    """p(z=k|x) for GMM"""
    numerators = [pi[k] * norm.pdf(x, mus[k], sigmas[k]) for k in range(3)]
    denominator = sum(numerators)
    return [n / denominator for n in numerators]
 
x_observed = 1.5
posterior = gmm_posterior(x_observed)
print(f"  Observed x = {x_observed}")
print(f"  Posterior p(z|x): {[f'{p:.3f}' for p in posterior]}")
print(f"  (Component 1 is most likely given x={x_observed})")

The Intractability Problem

Why is $p(x)$ intractable?

$$p(x) = \int p(z) p(x|z) , dz$$

This integral over all possible latent configurations is:

Continuous $z$: An integral over high-dimensional space (no closed form for nonlinear decoders)
Discrete $z$: A sum over exponentially many configurations

Why is $p(z|x)$ intractable?

$$p(z|x) = \frac{p(x|z) p(z)}{p(x)}$$

We need $p(x)$ in the denominator—which is intractable!

Why does this matter?

Learning: We want to maximize $\log p(x)$ (likelihood of data). But we can't compute it!
Inference: We want $p(z|x)$ to understand what latent factors explain an observation. But we can't compute that either!
Generation: We want to sample $x \sim p(x)$. While we can sample $z$ then $x|z$, we can't verify the resulting marginal matches the target.

The Fundamental Trade-off

Solutions to Intractability:

1. Exact Methods (Limited Cases)

Factor Analysis / PCA: Linear Gaussian models have closed-form solutions
GMM: Discrete $z$ with finite states allows exact enumeration
Exponential Families with Conjugacy: Posterior is same family as prior

2. Sampling Methods (MCMC)

Sample from $p(z|x)$ using MCMC (Gibbs, MH)
Approximate expectations via Monte Carlo
Can be exact in principle, slow in practice

3. Variational Inference

Approximate $p(z|x)$ with a tractable distribution $q_\phi(z|x)$
Optimize $q$ to be close to true posterior
Fast and scalable, but introduces approximation error
This is how VAEs work

4. Importance Weighting

Use importance sampling to estimate $p(x)$
High variance in high dimensions
IWAE (Importance-Weighted Autoencoder) uses this

5. Neural Likelihood Surrogates

Train another network to approximate $p(x)$ or its gradients
Score matching, noise-contrastive estimation

Approaches to Posterior Intractability
Approach	Exactness	Scalability	Limitations
Analytical (conjugacy)	Exact	High	Only for specific model families
MCMC	Asymptotically exact	Low-Medium	Slow convergence, hard to diagnose
Variational Inference	Approximate	High	Approximation gap, mode-seeking
Importance Sampling	Unbiased estimate	Medium	High variance in high dimensions

Latent Space Geometry

The latent space is the abstract space where latent variables $z$ live. Understanding its geometry is crucial for manipulating and interpreting generative models.

The Manifold Hypothesis:

High-dimensional data (images, text, audio) is believed to lie on or near low-dimensional manifolds. For example:

All face images might lie on a ~50-dimensional manifold within million-dimensional pixel space
The manifold encodes meaningful variations: pose, lighting, expression, identity

Latent variable models formalize this: $z$ parameterizes the manifold, and the decoder $p(x|z)$ maps from manifold coordinates to observations.

Dimensionality Reduction:

If the intrinsic dimensionality of data is $d$, we hope to learn latent spaces of dimension $d$ or slightly higher. This:

Compresses information (dimensionality reduction)
Reveals interpretable structure
Enables efficient sampling and manipulation

Latent Traversals:

A key test of whether a model has learned meaningful structure: traverse the latent space and observe how generations change.

In a well-trained model:

Moving along one latent dimension might change hair color
Another dimension might change pose
Linear interpolation in latent space produces semantically smooth interpolation in data space

If traversals produce random, discontinuous changes, the latent space lacks structure.

Disentanglement

Latent Arithmetic:

A remarkable property of well-trained latent spaces: semantic concepts become directions.

Famous example from word embeddings: $$\vec{king} - \vec{man} + \vec{woman} \approx \vec{queen}$$

Similar operations work in image latent spaces: $$\vec{face_smiling} - \vec{face_neutral} + \vec{another_neutral} = \vec{another_smiling}$$

This enables powerful editing: compute the 'smile direction' by averaging differences between smiling and neutral faces, then add this direction to any face to make it smile.

Holes in Latent Space:

VAEs address this by explicitly matching the posterior to the prior, ensuring the entire prior region is 'populated' with meaningful encodings.

latent_space_geometry.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
import numpy as np
 
np.random.seed(42)
 
# Demonstrate latent space concepts with a simple model
 
# True generative process: 2D latent -> 3D observed (embedded surface)
def true_decoder(z):
    """
    Map 2D latent to 3D: a 'swiss roll' surface.
    z[0] controls angle, z[1] controls height.
    """
    t = 1.5 * np.pi * (1 + 2 * z[0])  # Angle
    x = t * np.cos(t)
    y = 10 * z[1]  # Height
    z_coord = t * np.sin(t)
    return np.array([x, y, z_coord])
 
# Generate data
n = 500
z_true = np.random.uniform(-0.5, 0.5, (n, 2))  # 2D latent
x_data = np.array([true_decoder(z) for z in z_true])
x_data += np.random.normal(0, 0.3, x_data.shape)  # Add noise
 
print("=== Latent Space Geometry Demonstration ===")
print(f"  Latent dimension: 2")
print(f"  Observed dimension: 3")
print(f"  Data lies on 2D manifold (swiss roll) in 3D")
 
# Latent traversal: move along one latent dimension
print("
--- Latent Traversals ---")
z_base = np.array([0.0, 0.0])
 
print("  Traversing z[0] (angle dimension):")
for val in [-0.4, -0.2, 0.0, 0.2, 0.4]:
    z = z_base.copy()
    z[0] = val
    x = true_decoder(z)
    print(f"    z[0]={val:+.1f}: x = [{x[0]:+6.2f}, {x[1]:+5.2f}, {x[2]:+6.2f}]")
 
print("
  Traversing z[1] (height dimension):")
for val in [-0.4, -0.2, 0.0, 0.2, 0.4]:
    z = z_base.copy()
    z[1] = val
    x = true_decoder(z)
    print(f"    z[1]={val:+.1f}: x = [{x[0]:+6.2f}, {x[1]:+5.2f}, {x[2]:+6.2f}]")
 
# Interpolation in latent space
print("
--- Latent Interpolation ---")
z_start = np.array([-0.4, -0.4])
z_end = np.array([0.4, 0.4])
 
print(f"  Start: z={z_start} -> x={true_decoder(z_start).round(2)}")
print(f"  End:   z={z_end} -> x={true_decoder(z_end).round(2)}")
print("
  Interpolation:")
for alpha in [0.0, 0.25, 0.5, 0.75, 1.0]:
    z_interp = (1 - alpha) * z_start + alpha * z_end
    x_interp = true_decoder(z_interp)
    print(f"    α={alpha:.2f}: x = [{x_interp[0]:+6.2f}, "
          f"{x_interp[1]:+5.2f}, {x_interp[2]:+6.2f}]")
 
# Latent arithmetic (conceptual demonstration)
print("
--- Latent Arithmetic (Conceptual) ---")
print("  Imagine latent dimensions correspond to:")
print("    z[0] = 'curvature position'")
print("    z[1] = 'height'")
print("
  'Raise' operation: add [0, 0.5] to any z")
z_example = np.array([0.2, -0.3])
z_raised = z_example + np.array([0, 0.5])
print(f"    Original: z={z_example} -> x={true_decoder(z_example).round(2)}")
print(f"    Raised:   z={z_raised} -> x={true_decoder(z_raised).round(2)}")
print("    (Only y-coordinate changes significantly)")

Classical Latent Variable Models

Before deep learning, several classical latent variable models were developed. Understanding these provides essential intuition and reveals the lineage of modern approaches.

Principal Component Analysis (PCA)

PCA can be viewed as a probabilistic latent variable model:

$$z \sim \mathcal{N}(0, I_k)$$ $$x | z \sim \mathcal{N}(Wz + \mu, \sigma^2 I_d)$$

$k$-dimensional latent space, $d$-dimensional observations
Linear decoder: $x \approx Wz + \mu$
$W$ are the principal components (directions of maximum variance)

The posterior is Gaussian: $$p(z | x) = \mathcal{N}(M^{-1} W^T (x - \mu), \sigma^2 M^{-1})$$ where $M = W^T W + \sigma^2 I$.

As $\sigma^2 \to 0$, this recovers deterministic PCA.

Factor Analysis

Generalization of probabilistic PCA with non-isotropic noise:

$$x | z \sim \mathcal{N}(Wz + \mu, \Psi)$$

where $\Psi$ is a diagonal covariance (each observed dimension has its own noise level).

Used in psychology and social sciences to model latent 'factors' (intelligence, personality traits) from observed measurements.

Gaussian Mixture Models (GMM)

Discrete latent variable (cluster indicator):

$$z \sim \text{Categorical}(\pi_1, \ldots, \pi_K)$$ $$x | z = k \sim \mathcal{N}(\mu_k, \Sigma_k)$$

The posterior is: $$p(z = k | x) = \frac{\pi_k \mathcal{N}(x | \mu_k, \Sigma_k)}{\sum_j \pi_j \mathcal{N}(x | \mu_j, \Sigma_j)}$$

This is the 'responsibility' of cluster $k$ for observation $x$.

EM algorithm alternates:

E-step: Compute responsibilities (posterior)
M-step: Update parameters given responsibilities

Hidden Markov Models (HMM)

Latent variable model for sequences:

$$z_1 \sim \pi, \quad z_t | z_{t-1} \sim A_{z_{t-1}}$$ $$x_t | z_t \sim p(x | z_t)$$

The latent variables $z_t$ form a Markov chain; observations $x_t$ depend only on the current state.

Inference (forward-backward algorithm) and learning (Baum-Welch EM) are tractable due to the chain structure.

HMMs dominated speech recognition for decades before deep learning.

The Linear Regime

Classical Latent Variable Models
Model	Latent Type	Decoder	Inference	Use Cases
PCA	Continuous (Gaussian)	Linear	Closed-form	Dimensionality reduction, visualization
Factor Analysis	Continuous (Gaussian)	Linear + noise per dim	Closed-form	Psychology, latent trait modeling
GMM	Discrete (Categorical)	Gaussian per cluster	Enumeration	Clustering, density estimation
HMM	Discrete (Sequential)	Emission per state	Forward-backward	Speech recognition, sequences

Deep Latent Variable Models

The Setup:

$$z \sim p(z) = \mathcal{N}(0, I)$$ $$x | z \sim p_\theta(x | z) = \mathcal{N}(f_\theta(z), \sigma^2 I)$$

where $f_\theta$ is a neural network (the decoder).

The Challenge:

With nonlinear $f_\theta$, the marginal and posterior become intractable:

$$p(x) = \int p(z) p_\theta(x|z) dz \quad \text{(no closed form)}$$ $$p(z|x) = \frac{p_\theta(x|z) p(z)}{p(x)} \quad \text{(requires intractable } p(x) \text{)}$$

The Variational Solution (VAE):

Introduce an encoder $q_\phi(z|x)$—a neural network that approximates the posterior:

Encoder: $q_\phi(z|x) = \mathcal{N}(\mu_\phi(x), \sigma^2_\phi(x))$
Decoder: $p_\theta(x|z) = \mathcal{N}(f_\theta(z), \sigma^2 I)$

Optimize the Evidence Lower Bound (ELBO):

$$\log p(x) \geq \mathbb{E}{q\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x) | p(z))$$

The first term encourages accurate reconstruction; the second regularizes the posterior to match the prior.

Amortized Inference

Why Deep Latent Models Are Powerful:

Expressiveness: Neural network decoders can model arbitrarily complex conditionals $p(x|z)$.
Scalability: Amortized inference enables processing millions of data points.
Representation Learning: The latent space often captures meaningful structure useful for downstream tasks.
Generative Capabilities: Sample $z \sim p(z)$, decode to $x$—create new data.

The VAE Trade-off:

VAEs optimize a lower bound, not the true likelihood. The gap:

$$\log p(x) - \text{ELBO} = D_{KL}(q_\phi(z|x) | p(z|x)) \geq 0$$

Extensions:

Hierarchical VAEs: Multiple layers of latent variables for richer structure
VQ-VAE: Discrete (quantized) latent space
β-VAE: Weighted KL term to encourage disentanglement
Diffusion models: Can be viewed as hierarchical latent models with many layers

deep_latent_variables.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
import numpy as np
 
# Conceptual demonstration of VAE-style latent variable model
np.random.seed(42)
 
# Simplified VAE for 1D data
 
# "Decoder": neural network f_theta(z) -> x
def decoder(z, theta):
    """Simple MLP: z -> hidden -> x"""
    W1, b1, W2, b2 = theta
    h = np.tanh(W1 * z + b1)
    x_mean = W2 * h + b2
    return x_mean
 
# "Encoder": neural network x -> (mu_z, sigma_z)
def encoder(x, phi):
    """Simple MLP: x -> (mu, log_sigma)"""
    W1, b1, W_mu, W_logsig = phi
    h = np.tanh(W1 * x + b1)
    mu = W_mu * h
    log_sigma = W_logsig * h
    return mu, np.exp(log_sigma)
 
# Initialize random parameters
theta = (np.random.randn(), np.random.randn(), 
         np.random.randn(), np.random.randn())  # Decoder
phi = (np.random.randn(), np.random.randn(),
       np.random.randn(), np.random.randn())   # Encoder
 
print("=== Deep Latent Variable Model (VAE Concept) ===")
print("  Prior: p(z) = N(0, 1)")
print("  Decoder: p(x|z) = N(f_θ(z), σ²)")
print("  Encoder: q(z|x) = N(μ_φ(x), σ²_φ(x))")
 
# Example: encode a data point
x_example = 2.5
mu_z, sigma_z = encoder(x_example, phi)
print(f"
  Encoding x = {x_example}:")
print(f"    q(z|x) = N({mu_z:.3f}, {sigma_z:.3f}²)")
 
# Sample from approximate posterior
z_samples = np.random.normal(mu_z, sigma_z, 5)
print(f"    z samples: {z_samples.round(3)}")
 
# Decode samples
x_reconstructions = [decoder(z, theta) for z in z_samples]
print(f"    x reconstructions: {np.array(x_reconstructions).round(3)}")
 
# Generate new data by sampling prior
print("
  Generating from prior:")
z_prior_samples = np.random.normal(0, 1, 5)
x_generated = [decoder(z, theta) for z in z_prior_samples]
print(f"    z ~ N(0,1): {z_prior_samples.round(3)}")
print(f"    Generated x: {np.array(x_generated).round(3)}")
 
# ELBO components
print("
=== ELBO Components ===")
print("  ELBO = E_q[log p(x|z)] - D_KL(q(z|x) || p(z))")
print("       = Reconstruction - Regularization")
print("
  Reconstruction: encourages decoder to reconstruct x from z")
print("  Regularization: encourages posterior to match prior")
 
# KL divergence for Gaussian q vs standard Gaussian prior
def kl_divergence(mu, sigma):
    """KL(N(mu, sigma^2) || N(0, 1))"""
    return 0.5 * (sigma**2 + mu**2 - 1 - 2 * np.log(sigma))
 
kl = kl_divergence(mu_z, sigma_z)
print(f"
  For x = {x_example}:")
print(f"    KL(q(z|x) || p(z)) = {kl:.4f}")

Inference Approaches

Computing or approximating the posterior $p(z|x)$ is called inference. Different approaches trade off accuracy, computation, and generality.

Exact Inference:

Possible only for specific model families:

Conjugate priors: Posterior is same family as prior (Gaussian-Gaussian, Beta-Binomial)
Discrete latents with few states: Enumerate all possibilities
Linear Gaussian models: Closed-form Kalman filter, PCA

For deep generative models, exact inference is essentially never possible.

Variational Inference (VI):

Approximate $p(z|x)$ with a parameterized family $q_\phi(z|x)$, optimizing:

$$\phi^* = \arg\min_\phi D_{KL}(q_\phi(z|x) | p(z|x))$$

Equivalently, maximize the ELBO:

$$\mathcal{L}(\phi) = \mathbb{E}{q\phi(z|x)}[\log p(x|z)] - D_{KL}(q_\phi(z|x) | p(z))$$

VI characteristics:

Fast (gradient-based optimization)
Biased (approximation gap)
Mode-seeking (tends to underfit posterior variance)
Enables end-to-end learning via reparameterization

MCMC Inference:

Construct a Markov chain with stationary distribution $p(z|x)$:

Metropolis-Hastings, Gibbs sampling, Hamiltonian Monte Carlo
Asymptotically exact (converges to true posterior)
Often slow (many iterations for convergence)
Harder to integrate with gradient-based learning

Hybrid Approaches:

Modern methods often combine VI and MCMC:

MCMC-VI: Use MCMC to refine VI solutions
Hamiltonian VI: Use HMC dynamics within variational family
Normalizing Flow Posteriors: Use flows to parameterize flexible $q$

Importance-Weighted Inference:

Importance-weighted autoencoders (IWAE) use multiple samples:

$$\log p(x) \geq \mathbb{E}{z_1, \ldots, z_K \sim q}\left[\log \frac{1}{K} \sum{k=1}^K \frac{p(x, z_k)}{q(z_k|x)}\right]$$

Tighter bound than ELBO; approaches $\log p(x)$ as $K \to \infty$. But gradient variance increases with $K$.

The Inference Gap

Inference Methods Comparison
Method	Exactness	Speed	Gradient-Friendly	Best For
Exact (conjugacy)	Exact	Fast	Yes	Simple models (PCA, GMM)
Mean-field VI	Approximate	Fast	Yes	Factorized posteriors
Amortized VI (VAE)	Approximate	Very fast	Yes	Deep generative models
MCMC	Asymptotically exact	Slow	Partially	Accurate posteriors
IWAE	Tighter bound	Medium	Yes	Better likelihood estimates

Representation Learning

Beyond generation, latent variable models are powerful tools for representation learning—learning useful features for downstream tasks.

Why Latent Representations?

The latent encoding $z$ of an observation $x$ can serve as a feature vector:

Dimensionality reduction: $z$ is typically much lower-dimensional than $x$
Semantic structure: Nearby points in latent space often have similar semantics
Transfer learning: Representations learned on one task can help others

Self-Supervised Learning Connection:

Latent variable models are fundamentally self-supervised:

No labels required—learn from data structure alone
The 'supervision signal' is reconstruction (predict $x$ from $z$)

This connects to contrastive learning, masked modeling, and other self-supervised paradigms.

Evaluating Representations:

How do we know if learned representations are good?

Linear probing: Train a linear classifier on frozen representations. High accuracy = representations capture relevant structure.
Transfer learning: Use representations for downstream tasks. Good representations transfer broadly.
Latent space visualization: Do similar points cluster together? Are trajectories meaningful?
Disentanglement metrics: Do dimensions correspond to independent factors?

The Representation-Generation Trade-off

Representation Learning Applications:

Feature extraction: Encode images/text into fixed-size vectors for search, clustering
Preprocessing for supervised learning: Pre-train on unlabeled data, finetune on labels
Anomaly detection: Points with unusual latent encodings may be anomalous
Data compression: Encode to latent, transmit latent, decode at receiver
Controllable generation: Manipulate latent dimensions to control generated outputs

The Foundation Model Paradigm:

Modern 'foundation models' (GPT, CLIP, DALL-E) can be viewed through this lens:

Train a large generative model on massive unlabeled data
The model learns rich representations as a byproduct
Finetune or prompt for specific downstream tasks

The latent variable framework helps us understand why this works: learning to generate forces the model to capture structure useful for many tasks.

Good Representation Properties

•Informativeness: Representations preserve relevant information about inputs
•Compactness: Low-dimensional, removing redundancy
•Disentanglement: Factors of variation are separable across dimensions
•Smoothness: Similar inputs have similar representations
•Transferability: Useful for multiple downstream tasks
•Interpretability: Dimensions correspond to human-understandable concepts

Summary

Key Takeaways

•Latent variable models posit hidden variables $z$ generating observations $x$: $p(x) = \int p(z) p(x|z) dz$.
•Simple components create complex marginals: Even Gaussian latents with neural decoders can model intricate distributions.
•Intractability is fundamental: For interesting models, the marginal $p(x)$ and posterior $p(z|x)$ are typically intractable.
•Solutions: Variational inference (approximate posterior), MCMC (sample posterior), or special model structures (exact inference).
•Latent space geometry captures data structure: traversals, interpolations, and arithmetic in latent space reveal learned semantics.
•Classical to deep: PCA, Factor Analysis, GMM, HMM are latent variable models; VAEs extend this to nonlinear decoders with amortized inference.
•Representation learning: Latent encodings are useful features for downstream tasks—a key benefit beyond generation.

What's next:

Latent Variables Mastered

4 / 5