Machine LearningVariational Inference

Advanced VI Topics

LevelAdvanced

Duration120 mins

TopicVariational Inference

1 / 5

Normalizing Flows

Beyond Simple Variational Families

Standard variational inference faces a fundamental limitation: the expressiveness of the variational family directly bounds the quality of the posterior approximation. When we use simple distributions like factorized Gaussians, we inherently accept that our approximation cannot capture complex dependencies, multimodality, or heavy tails present in the true posterior.

Normalizing flows offer an elegant solution to this expressiveness problem. By applying a sequence of invertible transformations to a simple base distribution, flows can represent arbitrarily complex distributions while maintaining tractable density evaluation and sampling. This represents one of the most important advances in variational inference of the past decade.

What You Will Learn

By the end of this page, you will understand how normalizing flows work mathematically, master the change of variables formula and its computational implications, learn about different flow architectures (planar, radial, autoregressive, coupling), and understand how flows integrate with variational inference to create powerful approximate posteriors.

The Expressiveness Problem in VI

Before diving into normalizing flows, let's understand precisely why we need them. In standard variational inference, we approximate an intractable posterior p(z|x) with a tractable distribution q(z) from a variational family Q. The quality of our approximation is limited by the expressiveness of Q.

The Fundamental Tradeoff:

Simple families (e.g., mean-field Gaussians): Easy to optimize, cheap to sample, but poor approximation quality for complex posteriors
Complex families (e.g., mixture models): Better approximation potential, but optimization becomes difficult and density evaluation costly

This creates a dilemma: we need distributions that are both expressive enough to capture complex posteriors and tractable enough for practical optimization.

Limitations of Standard Variational Families
Variational Family	Expressiveness	Key Limitation	Approximation Gap
Factorized Gaussian	Low	Cannot capture correlations	Underestimates variance, misses dependencies
Full-covariance Gaussian	Medium	O(d²) parameters, unimodal only	Cannot represent multimodality or heavy tails
Mixture of Gaussians	High	Discrete optimization, mode collapse	Training instability, expensive density evaluation
Normalizing Flows	Arbitrary	Computational cost of transformations	Controllable via flow depth

Mathematical Formalization of the Gap:

The optimal variational distribution minimizes the KL divergence:

$$q^*(z) = \arg\min_{q \in \mathcal{Q}} D_{KL}(q(z) | p(z|x))$$

Even at optimality, we have a residual approximation gap:

$$\text{Gap} = D_{KL}(q^*(z) | p(z|x)) \geq 0$$

This gap is zero if and only if the true posterior lies within our variational family. For most real problems, simple families exclude the true posterior entirely, guaranteeing a non-zero gap regardless of optimization quality.

Normalizing flows address this by making our variational family rich enough to include (or closely approximate) virtually any smooth posterior.

The Mean-Field Catastrophe

Mean-field approximations can dramatically underestimate posterior variance, especially in high dimensions. For a posterior with correlation ρ between variables, mean-field effectively assumes ρ = 0, leading to overconfident predictions. In Bayesian neural networks, this can cause catastrophic miscalibration.

The Change of Variables Formula

The mathematical foundation of normalizing flows is the change of variables formula from probability theory. This fundamental result tells us how probability densities transform under invertible mappings.

The Core Result:

Let z₀ be a random variable with density q₀(z₀), and let f: ℝᵈ → ℝᵈ be an invertible, differentiable function. Define z = f(z₀). Then the density of z is:

$$q(z) = q_0(f^{-1}(z)) \left| \det \frac{\partial f^{-1}}{\partial z} \right| = q_0(z_0) \left| \det \frac{\partial f}{\partial z_0} \right|^{-1}$$

The term |det ∂f/∂z₀| is the absolute value of the Jacobian determinant, measuring how f locally stretches or compresses volume.

Intuition: Volume Preservation

Think of probability as a fluid. When we apply a transformation f, regions of space expand or contract. If f doubles the volume in some region, the density must halve to preserve total probability mass. The Jacobian determinant quantifies this volume change precisely.

Log-Density Form (Computationally Preferred):

For numerical stability and gradient computation, we work with log-densities:

$$\log q(z) = \log q_0(z_0) - \log \left| \det \frac{\partial f}{\partial z_0} \right|$$

Composition of Flows:

The power of flows comes from composing multiple simple transformations. If we apply a sequence f = fₖ ∘ fₖ₋₁ ∘ ... ∘ f₁, the log-density becomes:

$$\log q_K(z_K) = \log q_0(z_0) - \sum_{k=1}^{K} \log \left| \det \frac{\partial f_k}{\partial z_{k-1}} \right|$$

Each simple transformation adds expressiveness, and by stacking many layers, we can represent arbitrarily complex distributions.

change_of_variables.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import torch
import torch.nn as nn
 
class NormalizingFlow(nn.Module):
    """
    Base class for normalizing flows.
    A flow transforms a simple base distribution into a complex target distribution.
    """
    def __init__(self, base_distribution, transforms):
        super().__init__()
        self.base_dist = base_distribution
        self.transforms = nn.ModuleList(transforms)
    
    def forward(self, z0):
        """
        Transform samples from base distribution.
        Returns: (zK, log_det_jacobian)
        """
        z = z0
        log_det_sum = 0.0
        
        for transform in self.transforms:
            z, log_det = transform(z)
            log_det_sum += log_det
        
        return z, log_det_sum
    
    def log_prob(self, zK):
        """
        Compute log probability of samples under the flow.
        Requires inverting all transformations.
        """
        z = zK
        log_det_sum = 0.0
        
        # Apply inverse transforms in reverse order
        for transform in reversed(self.transforms):
            z, log_det = transform.inverse(z)
            log_det_sum += log_det
        
        # z is now z0, compute base density
        log_prob_base = self.base_dist.log_prob(z).sum(-1)
        
        # Add log determinant (with correct sign for inverse)
        return log_prob_base + log_det_sum
    
    def sample(self, num_samples):
        """Sample from the flow distribution."""
        z0 = self.base_dist.sample((num_samples,))
        zK, _ = self.forward(z0)
        return zK

Planar and Radial Flows

The earliest normalizing flows introduced in the VI literature are planar and radial flows. While relatively simple, they illustrate the core concepts and remain useful for understanding flow design.

Planar Flows:

A planar flow applies a transformation along a hyperplane:

$$f(z) = z + u \cdot h(w^T z + b)$$

where u, w ∈ ℝᵈ, b ∈ ℝ, and h is a smooth activation function (typically tanh).

The Jacobian determinant has a simple form:

$$\det \frac{\partial f}{\partial z} = 1 + u^T \psi(z), \quad \text{where } \psi(z) = h'(w^T z + b) \cdot w$$

Invertibility Constraint: For the transformation to be invertible, we require 1 + uᵀψ(z) > 0 for all z. This is enforced through a reparameterization of u.

Radial Flows:

Radial flows apply transformations that are radially symmetric around a reference point:

$$f(z) = z + \beta h(\alpha, r)(z - z_0), \quad r = |z - z_0|$$

where z₀ is the center, α > 0 controls the smoothness, and β controls the magnitude. A common choice is:

$$h(\alpha, r) = \frac{1}{\alpha + r}$$

Radial flows can either contract or expand regions of space around z₀, depending on the sign of β.

Computational Complexity:

Both planar and radial flows have O(d) complexity for the Jacobian determinant, making them efficient for high-dimensional problems. However, their expressiveness is limited—each transformation can only "bend" the density in one direction (planar) or around one point (radial).

planar_radial_flows.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
import torch
import torch.nn as nn
import torch.nn.functional as F
 
class PlanarFlow(nn.Module):
    """
    Planar flow: f(z) = z + u * h(w^T z + b)
    Efficient O(d) Jacobian determinant computation.
    """
    def __init__(self, dim):
        super().__init__()
        self.dim = dim
        self.w = nn.Parameter(torch.randn(dim))
        self.u = nn.Parameter(torch.randn(dim))
        self.b = nn.Parameter(torch.zeros(1))
    
    def _get_u_hat(self):
        """Ensure invertibility: 1 + u^T ψ(z) > 0"""
        wTu = torch.dot(self.w, self.u)
        m_wTu = -1 + F.softplus(wTu)
        u_hat = self.u + (m_wTu - wTu) * self.w / (self.w.norm() ** 2 + 1e-8)
        return u_hat
    
    def forward(self, z):
        """Apply planar transformation."""
        u_hat = self._get_u_hat()
        
        # Compute w^T z + b
        linear = F.linear(z, self.w.unsqueeze(0), self.b)
        
        # Apply transformation: z + u * tanh(w^T z + b)
        h = torch.tanh(linear)
        z_new = z + u_hat * h
        
        # Compute log |det Jacobian|
        # det = 1 + u^T * h'(w^T z + b) * w = 1 + u^T * ψ(z)
        h_prime = 1 - h ** 2  # derivative of tanh
        psi = h_prime * self.w
        log_det = torch.log(torch.abs(1 + psi @ u_hat) + 1e-8)
        
        return z_new, log_det
 
 
class RadialFlow(nn.Module):
    """
    Radial flow: f(z) = z + β * h(α, r) * (z - z0)
    Contracts or expands around a center point.
    """
    def __init__(self, dim):
        super().__init__()
        self.dim = dim
        self.z0 = nn.Parameter(torch.zeros(dim))
        self.log_alpha = nn.Parameter(torch.zeros(1))
        self.beta = nn.Parameter(torch.zeros(1))
    
    def forward(self, z):
        """Apply radial transformation."""
        alpha = F.softplus(self.log_alpha)
        
        # Compute distance from center
        diff = z - self.z0
        r = diff.norm(dim=-1, keepdim=True) + 1e-8
        
        # h(α, r) = 1 / (α + r)
        h = 1 / (alpha + r)
        h_prime = -1 / (alpha + r) ** 2
        
        # Transform
        z_new = z + self.beta * h * diff
        
        # Log determinant (derived from radial symmetry)
        d = self.dim
        log_det = (d - 1) * torch.log(torch.abs(1 + self.beta * h)) + \
                  torch.log(torch.abs(1 + self.beta * h + self.beta * h_prime * r))
        
        return z_new, log_det.squeeze(-1)

Stacking Simple Flows

While individual planar or radial flows have limited expressiveness, stacking many of them (typically 10-100 layers) can represent highly complex distributions. The Universal Approximation theorem for flows shows that with enough layers, any smooth density can be approximated arbitrarily well.

Autoregressive Flows

Autoregressive flows represent a major advancement in flow design, achieving both high expressiveness and efficient computation. They exploit the autoregressive structure to ensure the Jacobian is triangular, enabling O(d) determinant computation.

Key Insight:

An autoregressive transformation defines each output dimension as a function of only the preceding input dimensions:

$$z_i = f_i(z_0^{(1)}, z_0^{(2)}, ..., z_0^{(i)})$$

This constraint makes the Jacobian matrix triangular, and the determinant is simply the product of diagonal elements:

$$\det \frac{\partial z}{\partial z_0} = \prod_{i=1}^{d} \frac{\partial z_i}{\partial z_0^{(i)}}$$

Affine Autoregressive Flows (MADE-style):

The most common autoregressive flow uses affine transformations:

$$z_i = \mu_i(z_0^{<i}) + \sigma_i(z_0^{<i}) \cdot z_0^{(i)}$$

where μᵢ and σᵢ are neural networks that take the first i-1 dimensions as input. The log-determinant is:

$$\log |\det J| = \sum_{i=1}^{d} \log |\sigma_i(z_0^{<i})|$$

Inverse Autoregressive Flow (IAF):

IAF reverses the dependency direction, making sampling efficient (parallel) but density evaluation sequential:

$$z_i = \mu_i(z^{<i}) + \sigma_i(z^{<i}) \cdot z_0^{(i)}$$

Masked Autoregressive Flow (MAF):

MAF uses the opposite parameterization, making density evaluation efficient but sampling sequential:

$$z_i = (z_0^{(i)} - \mu_i(z_0^{<i})) / \sigma_i(z_0^{<i})$$

Comparison of Autoregressive Flow Variants
Flow Type	Sampling Complexity	Density Evaluation	Best Use Case
Inverse Autoregressive (IAF)	O(d) parallel	O(d²) sequential	Variational inference (need fast sampling)
Masked Autoregressive (MAF)	O(d²) sequential	O(d) parallel	Density estimation (need fast evaluation)
Neural Spline Flow	O(d) parallel	O(d) parallel	Both (using coupling layers)

autoregressive_flows.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
import torch
import torch.nn as nn
 
class MADE(nn.Module):
    """
    Masked Autoencoder for Distribution Estimation.
    Produces autoregressive outputs where output_i depends only on input_{<i}.
    """
    def __init__(self, input_dim, hidden_dims, output_dim_per_input=2):
        super().__init__()
        self.input_dim = input_dim
        self.output_per_input = output_dim_per_input
        
        # Create masks for autoregressive property
        self.masks = self._create_masks(hidden_dims)
        
        # Build masked linear layers
        dims = [input_dim] + hidden_dims + [input_dim * output_dim_per_input]
        self.layers = nn.ModuleList()
        
        for i in range(len(dims) - 1):
            layer = nn.Linear(dims[i], dims[i + 1])
            self.layers.append(layer)
    
    def _create_masks(self, hidden_dims):
        """Create masks ensuring autoregressive property."""
        # Assign each unit a degree (which inputs it can see)
        masks = []
        input_degrees = torch.arange(self.input_dim)
        
        prev_degrees = input_degrees
        for h_dim in hidden_dims:
            # Hidden units can see inputs with degree < their degree
            degrees = torch.randint(0, self.input_dim - 1, (h_dim,))
            mask = (prev_degrees.unsqueeze(1) <= degrees.unsqueeze(0)).float()
            masks.append(mask)
            prev_degrees = degrees
        
        # Output layer: output_i can see inputs with degree < i
        output_degrees = torch.arange(self.input_dim).repeat_interleave(self.output_per_input)
        mask = (prev_degrees.unsqueeze(1) < output_degrees.unsqueeze(0)).float()
        masks.append(mask)
        
        return masks
    
    def forward(self, x):
        """Forward pass with masked connections."""
        h = x
        for i, layer in enumerate(self.layers[:-1]):
            # Apply mask to weight matrix
            layer.weight.data *= self.masks[i].T.to(x.device)
            h = torch.relu(layer(h))
        
        # Final layer
        self.layers[-1].weight.data *= self.masks[-1].T.to(x.device)
        out = self.layers[-1](h)
        
        # Reshape to (batch, input_dim, output_per_input)
        return out.view(-1, self.input_dim, self.output_per_input)
 
 
class MAF(nn.Module):
    """
    Masked Autoregressive Flow.
    Fast density evaluation, slow sampling.
    """
    def __init__(self, dim, num_layers=4, hidden_dim=64):
        super().__init__()
        self.dim = dim
        self.layers = nn.ModuleList([
            MADE(dim, [hidden_dim, hidden_dim], output_dim_per_input=2)
            for _ in range(num_layers)
        ])
    
    def forward(self, z0):
        """Transform base samples to flow samples."""
        z = z0
        log_det_sum = 0.0
        
        for made in self.layers:
            # Get autoregressive parameters
            params = made(z)
            mu, log_sigma = params[..., 0], params[..., 1]
            
            # Affine transform: z_new = (z - mu) / exp(log_sigma)
            z = (z - mu) * torch.exp(-log_sigma)
            log_det_sum -= log_sigma.sum(dim=-1)
            
            # Permute dimensions for next layer
            z = z.flip(dims=[-1])
        
        return z, log_det_sum

Coupling Flows (RealNVP, Glow)

Coupling flows achieve both efficient sampling and efficient density evaluation by splitting dimensions into two groups and transforming one group conditioned on the other.

Affine Coupling Layer:

Split the input z into two parts: z = [z₁, z₂]. The coupling transformation is:

$$z_1' = z_1$$ $$z_2' = z_2 \odot \exp(s(z_1)) + t(z_1)$$

where s and t are arbitrary neural networks (scale and translation). The key insight: because z₁ passes through unchanged, the Jacobian is triangular and the determinant is simply:

$$\log |\det J| = \sum_j s_j(z_1)$$

Coupling Flow Advantages

•Parallel computation: Both sampling and density are O(d)
•Flexible conditioners: s and t can be arbitrarily complex networks
•Exact inversion: Closed-form inverse without iterative methods
•Stable training: No vanishing gradient issues from sigmoid/tanh

Design Considerations

•Need multiple layers: Single coupling layer only transforms half the dimensions
•Splitting strategy: Alternating or random splits between layers
•Permutation layers: Shuffle dimensions between coupling layers
•Network capacity: Conditioner network determines expressiveness

RealNVP (Real-valued Non-Volume Preserving):

RealNVP introduced affine coupling layers and demonstrated their effectiveness for image modeling. Key innovations include:

Alternating masks: Checkerboard and channel-wise splitting for images
Multi-scale architecture: Factoring out dimensions at different scales
Batch normalization: Between coupling layers for training stability

Glow:

Glow improved upon RealNVP with three key innovations:

Invertible 1×1 convolutions: Learnable permutations instead of fixed shuffling
Actnorm: Data-dependent initialization instead of batch norm
Affine coupling with zero initialization: Coupling layers start as identity

coupling_flows.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
import torch
import torch.nn as nn
 
class AffineCoupling(nn.Module):
    """
    Affine coupling layer as used in RealNVP/Glow.
    Splits input, transforms second half conditioned on first half.
    """
    def __init__(self, dim, hidden_dim=256, mask_type='alternate'):
        super().__init__()
        self.dim = dim
        self.split_dim = dim // 2
        
        # Conditioner network: maps first half to scale and translation
        self.conditioner = nn.Sequential(
            nn.Linear(self.split_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 2 * (dim - self.split_dim)),  # scale + translate
        )
        
        # Initialize final layer to zero (coupling starts as identity)
        nn.init.zeros_(self.conditioner[-1].weight)
        nn.init.zeros_(self.conditioner[-1].bias)
    
    def forward(self, z):
        """Apply affine coupling transformation."""
        z1, z2 = z[..., :self.split_dim], z[..., self.split_dim:]
        
        # Get scale and translation from conditioner
        params = self.conditioner(z1)
        log_s, t = params.chunk(2, dim=-1)
        
        # Clamp log_s for numerical stability
        log_s = torch.clamp(log_s, -5, 5)
        
        # Transform: z2' = z2 * exp(s) + t
        z2_new = z2 * torch.exp(log_s) + t
        
        # Concatenate (z1 unchanged)
        z_new = torch.cat([z1, z2_new], dim=-1)
        
        # Log determinant is sum of log scales
        log_det = log_s.sum(dim=-1)
        
        return z_new, log_det
    
    def inverse(self, z):
        """Invert the affine coupling."""
        z1, z2 = z[..., :self.split_dim], z[..., self.split_dim:]
        
        params = self.conditioner(z1)
        log_s, t = params.chunk(2, dim=-1)
        log_s = torch.clamp(log_s, -5, 5)
        
        # Inverse: z2_orig = (z2 - t) * exp(-s)
        z2_orig = (z2 - t) * torch.exp(-log_s)
        
        z_orig = torch.cat([z1, z2_orig], dim=-1)
        log_det = -log_s.sum(dim=-1)
        
        return z_orig, log_det
 
 
class RealNVP(nn.Module):
    """Complete RealNVP flow with multiple coupling layers."""
    def __init__(self, dim, num_layers=6, hidden_dim=256):
        super().__init__()
        
        self.coupling_layers = nn.ModuleList()
        for i in range(num_layers):
            self.coupling_layers.append(AffineCoupling(dim, hidden_dim))
    
    def forward(self, z0):
        z = z0
        log_det_sum = 0.0
        
        for i, coupling in enumerate(self.coupling_layers):
            z, log_det = coupling(z)
            log_det_sum += log_det
            
            # Flip dimensions for next layer
            z = z.flip(dims=[-1])
        
        return z, log_det_sum

Integrating Flows with Variational Inference

Normalizing flows integrate naturally with variational inference, enriching the variational family while maintaining the ELBO optimization framework.

Flow-Enhanced ELBO:

With a flow-based variational distribution, the ELBO becomes:

$$\mathcal{L} = \mathbb{E}{z_0 \sim q_0}\left[ \log p(x, z_K) - \log q_0(z_0) + \sum{k=1}^{K} \log \left| \det \frac{\partial f_k}{\partial z_{k-1}} \right| \right]$$

where z_K = f_K ∘ ... ∘ f_1(z_0) is the transformed latent variable.

Gradient Estimation:

The gradient with respect to flow parameters uses the reparameterization trick:

$$\nabla_\phi \mathcal{L} = \mathbb{E}{\epsilon \sim p(\epsilon)}\left[ \nabla\phi \left( \log p(x, z_K(\epsilon; \phi)) + \text{log-det terms} \right) \right]$$

Because all flow operations are differentiable, gradients flow back through the entire transformation chain.

Practical Benefits for VI

Flows are particularly valuable when the true posterior has: (1) strong correlations between variables, (2) multiple modes, (3) heavy tails, or (4) complex geometries. In these cases, even a few flow layers can dramatically reduce the KL divergence compared to Gaussian approximations.

flow_vi.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import torch
import torch.nn as nn
from torch.distributions import Normal
 
class FlowVI(nn.Module):
    """
    Variational Inference with Normalizing Flows.
    Uses a flow to transform a simple base distribution into
    a flexible approximate posterior.
    """
    def __init__(self, latent_dim, flow_layers, encoder=None):
        super().__init__()
        self.latent_dim = latent_dim
        
        # Base distribution: standard Gaussian
        self.base_dist = Normal(torch.zeros(latent_dim), torch.ones(latent_dim))
        
        # Flow transforms
        self.flow = RealNVP(latent_dim, num_layers=flow_layers)
        
        # Optional: Amortized inference encoder
        self.encoder = encoder
    
    def sample_posterior(self, x, num_samples=1):
        """Sample from the flow-based approximate posterior."""
        batch_size = x.shape[0]
        
        # Sample from base distribution
        z0 = self.base_dist.sample((batch_size, num_samples))
        z0 = z0.view(-1, self.latent_dim)
        
        # Apply flow
        zK, log_det = self.flow(z0)
        
        return zK.view(batch_size, num_samples, -1), log_det.view(batch_size, num_samples)
    
    def compute_elbo(self, x, log_likelihood_fn, num_samples=1):
        """
        Compute the ELBO with flow-based posterior.
        
        Args:
            x: Observed data
            log_likelihood_fn: Function computing log p(x|z)
            num_samples: Number of posterior samples for Monte Carlo estimate
        """
        batch_size = x.shape[0]
        
        # Sample from base and transform
        z0 = self.base_dist.sample((batch_size, num_samples))
        z0 = z0.view(-1, self.latent_dim)
        
        zK, log_det = self.flow(z0)
        zK = zK.view(batch_size, num_samples, -1)
        log_det = log_det.view(batch_size, num_samples)
        
        # Base distribution log probability
        z0_reshaped = z0.view(batch_size, num_samples, -1)
        log_q0 = self.base_dist.log_prob(z0_reshaped).sum(dim=-1)
        
        # Log probability under flow: log q(zK) = log q0(z0) - log_det
        log_qK = log_q0 - log_det
        
        # Prior: assume standard Gaussian prior for simplicity
        log_prior = Normal(0, 1).log_prob(zK).sum(dim=-1)
        
        # Likelihood
        x_expanded = x.unsqueeze(1).expand(-1, num_samples, -1)
        log_lik = log_likelihood_fn(x_expanded, zK)
        
        # ELBO = E[log p(x|z) + log p(z) - log q(z)]
        elbo = (log_lik + log_prior - log_qK).mean(dim=1)
        
        return elbo.mean()

Summary: Normalizing Flows

We've covered the foundations of normalizing flows and their application to variational inference. Let's consolidate the key concepts:

Key Takeaways

•Flows solve the expressiveness problem — They allow the variational family to represent arbitrarily complex distributions, closing the approximation gap inherent in simple families.
•Change of variables is the foundation — The Jacobian determinant accounts for volume changes, enabling tractable density computation after transformation.
•Efficient Jacobian is key — Different flow architectures (planar, autoregressive, coupling) make different tradeoffs between expressiveness and computational cost.
•Autoregressive flows offer high expressiveness with O(d) Jacobian computation, but sampling and density evaluation have different computational costs.
•Coupling flows achieve O(d) for both sampling and density by transforming only half the dimensions conditioned on the other half.
•Flow-VI integration is natural — Flows plug directly into the ELBO framework, with gradients flowing through transformations via reparameterization.

Page Complete

You now understand normalizing flows as a powerful tool for variational inference. Next, we'll explore amortized inference, which enables efficient inference across many data points by learning to map observations directly to approximate posteriors.

1 / 5

Loading learning content...

Machine LearningVariational Inference

Advanced VI Topics

LevelAdvanced

Duration120 mins

TopicVariational Inference

1 / 5

Normalizing Flows

Beyond Simple Variational Families

What You Will Learn

The Expressiveness Problem in VI

The Fundamental Tradeoff:

Simple families (e.g., mean-field Gaussians): Easy to optimize, cheap to sample, but poor approximation quality for complex posteriors
Complex families (e.g., mixture models): Better approximation potential, but optimization becomes difficult and density evaluation costly

This creates a dilemma: we need distributions that are both expressive enough to capture complex posteriors and tractable enough for practical optimization.

Limitations of Standard Variational Families
Variational Family	Expressiveness	Key Limitation	Approximation Gap
Factorized Gaussian	Low	Cannot capture correlations	Underestimates variance, misses dependencies
Full-covariance Gaussian	Medium	O(d²) parameters, unimodal only	Cannot represent multimodality or heavy tails
Mixture of Gaussians	High	Discrete optimization, mode collapse	Training instability, expensive density evaluation
Normalizing Flows	Arbitrary	Computational cost of transformations	Controllable via flow depth

Mathematical Formalization of the Gap:

The optimal variational distribution minimizes the KL divergence:

$$q^*(z) = \arg\min_{q \in \mathcal{Q}} D_{KL}(q(z) | p(z|x))$$

Even at optimality, we have a residual approximation gap:

$$\text{Gap} = D_{KL}(q^*(z) | p(z|x)) \geq 0$$

Normalizing flows address this by making our variational family rich enough to include (or closely approximate) virtually any smooth posterior.

The Mean-Field Catastrophe

The Change of Variables Formula

The Core Result:

Let z₀ be a random variable with density q₀(z₀), and let f: ℝᵈ → ℝᵈ be an invertible, differentiable function. Define z = f(z₀). Then the density of z is:

$$q(z) = q_0(f^{-1}(z)) \left| \det \frac{\partial f^{-1}}{\partial z} \right| = q_0(z_0) \left| \det \frac{\partial f}{\partial z_0} \right|^{-1}$$

The term |det ∂f/∂z₀| is the absolute value of the Jacobian determinant, measuring how f locally stretches or compresses volume.

Intuition: Volume Preservation

Log-Density Form (Computationally Preferred):

For numerical stability and gradient computation, we work with log-densities:

$$\log q(z) = \log q_0(z_0) - \log \left| \det \frac{\partial f}{\partial z_0} \right|$$

Composition of Flows:

The power of flows comes from composing multiple simple transformations. If we apply a sequence f = fₖ ∘ fₖ₋₁ ∘ ... ∘ f₁, the log-density becomes:

$$\log q_K(z_K) = \log q_0(z_0) - \sum_{k=1}^{K} \log \left| \det \frac{\partial f_k}{\partial z_{k-1}} \right|$$

Each simple transformation adds expressiveness, and by stacking many layers, we can represent arbitrarily complex distributions.

change_of_variables.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import torch
import torch.nn as nn
 
class NormalizingFlow(nn.Module):
    """
    Base class for normalizing flows.
    A flow transforms a simple base distribution into a complex target distribution.
    """
    def __init__(self, base_distribution, transforms):
        super().__init__()
        self.base_dist = base_distribution
        self.transforms = nn.ModuleList(transforms)
    
    def forward(self, z0):
        """
        Transform samples from base distribution.
        Returns: (zK, log_det_jacobian)
        """
        z = z0
        log_det_sum = 0.0
        
        for transform in self.transforms:
            z, log_det = transform(z)
            log_det_sum += log_det
        
        return z, log_det_sum
    
    def log_prob(self, zK):
        """
        Compute log probability of samples under the flow.
        Requires inverting all transformations.
        """
        z = zK
        log_det_sum = 0.0
        
        # Apply inverse transforms in reverse order
        for transform in reversed(self.transforms):
            z, log_det = transform.inverse(z)
            log_det_sum += log_det
        
        # z is now z0, compute base density
        log_prob_base = self.base_dist.log_prob(z).sum(-1)
        
        # Add log determinant (with correct sign for inverse)
        return log_prob_base + log_det_sum
    
    def sample(self, num_samples):
        """Sample from the flow distribution."""
        z0 = self.base_dist.sample((num_samples,))
        zK, _ = self.forward(z0)
        return zK

Planar and Radial Flows

Planar Flows:

A planar flow applies a transformation along a hyperplane:

$$f(z) = z + u \cdot h(w^T z + b)$$

where u, w ∈ ℝᵈ, b ∈ ℝ, and h is a smooth activation function (typically tanh).

The Jacobian determinant has a simple form:

$$\det \frac{\partial f}{\partial z} = 1 + u^T \psi(z), \quad \text{where } \psi(z) = h'(w^T z + b) \cdot w$$

Invertibility Constraint: For the transformation to be invertible, we require 1 + uᵀψ(z) > 0 for all z. This is enforced through a reparameterization of u.

Radial Flows:

Radial flows apply transformations that are radially symmetric around a reference point:

$$f(z) = z + \beta h(\alpha, r)(z - z_0), \quad r = |z - z_0|$$

where z₀ is the center, α > 0 controls the smoothness, and β controls the magnitude. A common choice is:

$$h(\alpha, r) = \frac{1}{\alpha + r}$$

Radial flows can either contract or expand regions of space around z₀, depending on the sign of β.

Computational Complexity:

planar_radial_flows.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
import torch
import torch.nn as nn
import torch.nn.functional as F
 
class PlanarFlow(nn.Module):
    """
    Planar flow: f(z) = z + u * h(w^T z + b)
    Efficient O(d) Jacobian determinant computation.
    """
    def __init__(self, dim):
        super().__init__()
        self.dim = dim
        self.w = nn.Parameter(torch.randn(dim))
        self.u = nn.Parameter(torch.randn(dim))
        self.b = nn.Parameter(torch.zeros(1))
    
    def _get_u_hat(self):
        """Ensure invertibility: 1 + u^T ψ(z) > 0"""
        wTu = torch.dot(self.w, self.u)
        m_wTu = -1 + F.softplus(wTu)
        u_hat = self.u + (m_wTu - wTu) * self.w / (self.w.norm() ** 2 + 1e-8)
        return u_hat
    
    def forward(self, z):
        """Apply planar transformation."""
        u_hat = self._get_u_hat()
        
        # Compute w^T z + b
        linear = F.linear(z, self.w.unsqueeze(0), self.b)
        
        # Apply transformation: z + u * tanh(w^T z + b)
        h = torch.tanh(linear)
        z_new = z + u_hat * h
        
        # Compute log |det Jacobian|
        # det = 1 + u^T * h'(w^T z + b) * w = 1 + u^T * ψ(z)
        h_prime = 1 - h ** 2  # derivative of tanh
        psi = h_prime * self.w
        log_det = torch.log(torch.abs(1 + psi @ u_hat) + 1e-8)
        
        return z_new, log_det
 
 
class RadialFlow(nn.Module):
    """
    Radial flow: f(z) = z + β * h(α, r) * (z - z0)
    Contracts or expands around a center point.
    """
    def __init__(self, dim):
        super().__init__()
        self.dim = dim
        self.z0 = nn.Parameter(torch.zeros(dim))
        self.log_alpha = nn.Parameter(torch.zeros(1))
        self.beta = nn.Parameter(torch.zeros(1))
    
    def forward(self, z):
        """Apply radial transformation."""
        alpha = F.softplus(self.log_alpha)
        
        # Compute distance from center
        diff = z - self.z0
        r = diff.norm(dim=-1, keepdim=True) + 1e-8
        
        # h(α, r) = 1 / (α + r)
        h = 1 / (alpha + r)
        h_prime = -1 / (alpha + r) ** 2
        
        # Transform
        z_new = z + self.beta * h * diff
        
        # Log determinant (derived from radial symmetry)
        d = self.dim
        log_det = (d - 1) * torch.log(torch.abs(1 + self.beta * h)) + \
                  torch.log(torch.abs(1 + self.beta * h + self.beta * h_prime * r))
        
        return z_new, log_det.squeeze(-1)

Stacking Simple Flows

Autoregressive Flows

Key Insight:

An autoregressive transformation defines each output dimension as a function of only the preceding input dimensions:

$$z_i = f_i(z_0^{(1)}, z_0^{(2)}, ..., z_0^{(i)})$$

This constraint makes the Jacobian matrix triangular, and the determinant is simply the product of diagonal elements:

$$\det \frac{\partial z}{\partial z_0} = \prod_{i=1}^{d} \frac{\partial z_i}{\partial z_0^{(i)}}$$

Affine Autoregressive Flows (MADE-style):

The most common autoregressive flow uses affine transformations:

$$z_i = \mu_i(z_0^{<i}) + \sigma_i(z_0^{<i}) \cdot z_0^{(i)}$$

where μᵢ and σᵢ are neural networks that take the first i-1 dimensions as input. The log-determinant is:

$$\log |\det J| = \sum_{i=1}^{d} \log |\sigma_i(z_0^{<i})|$$

Inverse Autoregressive Flow (IAF):

IAF reverses the dependency direction, making sampling efficient (parallel) but density evaluation sequential:

$$z_i = \mu_i(z^{<i}) + \sigma_i(z^{<i}) \cdot z_0^{(i)}$$

Masked Autoregressive Flow (MAF):

MAF uses the opposite parameterization, making density evaluation efficient but sampling sequential:

$$z_i = (z_0^{(i)} - \mu_i(z_0^{<i})) / \sigma_i(z_0^{<i})$$

Comparison of Autoregressive Flow Variants
Flow Type	Sampling Complexity	Density Evaluation	Best Use Case
Inverse Autoregressive (IAF)	O(d) parallel	O(d²) sequential	Variational inference (need fast sampling)
Masked Autoregressive (MAF)	O(d²) sequential	O(d) parallel	Density estimation (need fast evaluation)
Neural Spline Flow	O(d) parallel	O(d) parallel	Both (using coupling layers)

autoregressive_flows.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
import torch
import torch.nn as nn
 
class MADE(nn.Module):
    """
    Masked Autoencoder for Distribution Estimation.
    Produces autoregressive outputs where output_i depends only on input_{<i}.
    """
    def __init__(self, input_dim, hidden_dims, output_dim_per_input=2):
        super().__init__()
        self.input_dim = input_dim
        self.output_per_input = output_dim_per_input
        
        # Create masks for autoregressive property
        self.masks = self._create_masks(hidden_dims)
        
        # Build masked linear layers
        dims = [input_dim] + hidden_dims + [input_dim * output_dim_per_input]
        self.layers = nn.ModuleList()
        
        for i in range(len(dims) - 1):
            layer = nn.Linear(dims[i], dims[i + 1])
            self.layers.append(layer)
    
    def _create_masks(self, hidden_dims):
        """Create masks ensuring autoregressive property."""
        # Assign each unit a degree (which inputs it can see)
        masks = []
        input_degrees = torch.arange(self.input_dim)
        
        prev_degrees = input_degrees
        for h_dim in hidden_dims:
            # Hidden units can see inputs with degree < their degree
            degrees = torch.randint(0, self.input_dim - 1, (h_dim,))
            mask = (prev_degrees.unsqueeze(1) <= degrees.unsqueeze(0)).float()
            masks.append(mask)
            prev_degrees = degrees
        
        # Output layer: output_i can see inputs with degree < i
        output_degrees = torch.arange(self.input_dim).repeat_interleave(self.output_per_input)
        mask = (prev_degrees.unsqueeze(1) < output_degrees.unsqueeze(0)).float()
        masks.append(mask)
        
        return masks
    
    def forward(self, x):
        """Forward pass with masked connections."""
        h = x
        for i, layer in enumerate(self.layers[:-1]):
            # Apply mask to weight matrix
            layer.weight.data *= self.masks[i].T.to(x.device)
            h = torch.relu(layer(h))
        
        # Final layer
        self.layers[-1].weight.data *= self.masks[-1].T.to(x.device)
        out = self.layers[-1](h)
        
        # Reshape to (batch, input_dim, output_per_input)
        return out.view(-1, self.input_dim, self.output_per_input)
 
 
class MAF(nn.Module):
    """
    Masked Autoregressive Flow.
    Fast density evaluation, slow sampling.
    """
    def __init__(self, dim, num_layers=4, hidden_dim=64):
        super().__init__()
        self.dim = dim
        self.layers = nn.ModuleList([
            MADE(dim, [hidden_dim, hidden_dim], output_dim_per_input=2)
            for _ in range(num_layers)
        ])
    
    def forward(self, z0):
        """Transform base samples to flow samples."""
        z = z0
        log_det_sum = 0.0
        
        for made in self.layers:
            # Get autoregressive parameters
            params = made(z)
            mu, log_sigma = params[..., 0], params[..., 1]
            
            # Affine transform: z_new = (z - mu) / exp(log_sigma)
            z = (z - mu) * torch.exp(-log_sigma)
            log_det_sum -= log_sigma.sum(dim=-1)
            
            # Permute dimensions for next layer
            z = z.flip(dims=[-1])
        
        return z, log_det_sum

Coupling Flows (RealNVP, Glow)

Coupling flows achieve both efficient sampling and efficient density evaluation by splitting dimensions into two groups and transforming one group conditioned on the other.

Affine Coupling Layer:

Split the input z into two parts: z = [z₁, z₂]. The coupling transformation is:

$$z_1' = z_1$$ $$z_2' = z_2 \odot \exp(s(z_1)) + t(z_1)$$

where s and t are arbitrary neural networks (scale and translation). The key insight: because z₁ passes through unchanged, the Jacobian is triangular and the determinant is simply:

$$\log |\det J| = \sum_j s_j(z_1)$$

Coupling Flow Advantages

•Parallel computation: Both sampling and density are O(d)
•Flexible conditioners: s and t can be arbitrarily complex networks
•Exact inversion: Closed-form inverse without iterative methods
•Stable training: No vanishing gradient issues from sigmoid/tanh

Design Considerations

•Need multiple layers: Single coupling layer only transforms half the dimensions
•Splitting strategy: Alternating or random splits between layers
•Permutation layers: Shuffle dimensions between coupling layers
•Network capacity: Conditioner network determines expressiveness

RealNVP (Real-valued Non-Volume Preserving):

RealNVP introduced affine coupling layers and demonstrated their effectiveness for image modeling. Key innovations include:

Alternating masks: Checkerboard and channel-wise splitting for images
Multi-scale architecture: Factoring out dimensions at different scales
Batch normalization: Between coupling layers for training stability

Glow:

Glow improved upon RealNVP with three key innovations:

Invertible 1×1 convolutions: Learnable permutations instead of fixed shuffling
Actnorm: Data-dependent initialization instead of batch norm
Affine coupling with zero initialization: Coupling layers start as identity

coupling_flows.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
import torch
import torch.nn as nn
 
class AffineCoupling(nn.Module):
    """
    Affine coupling layer as used in RealNVP/Glow.
    Splits input, transforms second half conditioned on first half.
    """
    def __init__(self, dim, hidden_dim=256, mask_type='alternate'):
        super().__init__()
        self.dim = dim
        self.split_dim = dim // 2
        
        # Conditioner network: maps first half to scale and translation
        self.conditioner = nn.Sequential(
            nn.Linear(self.split_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 2 * (dim - self.split_dim)),  # scale + translate
        )
        
        # Initialize final layer to zero (coupling starts as identity)
        nn.init.zeros_(self.conditioner[-1].weight)
        nn.init.zeros_(self.conditioner[-1].bias)
    
    def forward(self, z):
        """Apply affine coupling transformation."""
        z1, z2 = z[..., :self.split_dim], z[..., self.split_dim:]
        
        # Get scale and translation from conditioner
        params = self.conditioner(z1)
        log_s, t = params.chunk(2, dim=-1)
        
        # Clamp log_s for numerical stability
        log_s = torch.clamp(log_s, -5, 5)
        
        # Transform: z2' = z2 * exp(s) + t
        z2_new = z2 * torch.exp(log_s) + t
        
        # Concatenate (z1 unchanged)
        z_new = torch.cat([z1, z2_new], dim=-1)
        
        # Log determinant is sum of log scales
        log_det = log_s.sum(dim=-1)
        
        return z_new, log_det
    
    def inverse(self, z):
        """Invert the affine coupling."""
        z1, z2 = z[..., :self.split_dim], z[..., self.split_dim:]
        
        params = self.conditioner(z1)
        log_s, t = params.chunk(2, dim=-1)
        log_s = torch.clamp(log_s, -5, 5)
        
        # Inverse: z2_orig = (z2 - t) * exp(-s)
        z2_orig = (z2 - t) * torch.exp(-log_s)
        
        z_orig = torch.cat([z1, z2_orig], dim=-1)
        log_det = -log_s.sum(dim=-1)
        
        return z_orig, log_det
 
 
class RealNVP(nn.Module):
    """Complete RealNVP flow with multiple coupling layers."""
    def __init__(self, dim, num_layers=6, hidden_dim=256):
        super().__init__()
        
        self.coupling_layers = nn.ModuleList()
        for i in range(num_layers):
            self.coupling_layers.append(AffineCoupling(dim, hidden_dim))
    
    def forward(self, z0):
        z = z0
        log_det_sum = 0.0
        
        for i, coupling in enumerate(self.coupling_layers):
            z, log_det = coupling(z)
            log_det_sum += log_det
            
            # Flip dimensions for next layer
            z = z.flip(dims=[-1])
        
        return z, log_det_sum

Integrating Flows with Variational Inference

Normalizing flows integrate naturally with variational inference, enriching the variational family while maintaining the ELBO optimization framework.

Flow-Enhanced ELBO:

With a flow-based variational distribution, the ELBO becomes:

$$\mathcal{L} = \mathbb{E}{z_0 \sim q_0}\left[ \log p(x, z_K) - \log q_0(z_0) + \sum{k=1}^{K} \log \left| \det \frac{\partial f_k}{\partial z_{k-1}} \right| \right]$$

where z_K = f_K ∘ ... ∘ f_1(z_0) is the transformed latent variable.

Gradient Estimation:

The gradient with respect to flow parameters uses the reparameterization trick:

$$\nabla_\phi \mathcal{L} = \mathbb{E}{\epsilon \sim p(\epsilon)}\left[ \nabla\phi \left( \log p(x, z_K(\epsilon; \phi)) + \text{log-det terms} \right) \right]$$

Because all flow operations are differentiable, gradients flow back through the entire transformation chain.

Practical Benefits for VI

flow_vi.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import torch
import torch.nn as nn
from torch.distributions import Normal
 
class FlowVI(nn.Module):
    """
    Variational Inference with Normalizing Flows.
    Uses a flow to transform a simple base distribution into
    a flexible approximate posterior.
    """
    def __init__(self, latent_dim, flow_layers, encoder=None):
        super().__init__()
        self.latent_dim = latent_dim
        
        # Base distribution: standard Gaussian
        self.base_dist = Normal(torch.zeros(latent_dim), torch.ones(latent_dim))
        
        # Flow transforms
        self.flow = RealNVP(latent_dim, num_layers=flow_layers)
        
        # Optional: Amortized inference encoder
        self.encoder = encoder
    
    def sample_posterior(self, x, num_samples=1):
        """Sample from the flow-based approximate posterior."""
        batch_size = x.shape[0]
        
        # Sample from base distribution
        z0 = self.base_dist.sample((batch_size, num_samples))
        z0 = z0.view(-1, self.latent_dim)
        
        # Apply flow
        zK, log_det = self.flow(z0)
        
        return zK.view(batch_size, num_samples, -1), log_det.view(batch_size, num_samples)
    
    def compute_elbo(self, x, log_likelihood_fn, num_samples=1):
        """
        Compute the ELBO with flow-based posterior.
        
        Args:
            x: Observed data
            log_likelihood_fn: Function computing log p(x|z)
            num_samples: Number of posterior samples for Monte Carlo estimate
        """
        batch_size = x.shape[0]
        
        # Sample from base and transform
        z0 = self.base_dist.sample((batch_size, num_samples))
        z0 = z0.view(-1, self.latent_dim)
        
        zK, log_det = self.flow(z0)
        zK = zK.view(batch_size, num_samples, -1)
        log_det = log_det.view(batch_size, num_samples)
        
        # Base distribution log probability
        z0_reshaped = z0.view(batch_size, num_samples, -1)
        log_q0 = self.base_dist.log_prob(z0_reshaped).sum(dim=-1)
        
        # Log probability under flow: log q(zK) = log q0(z0) - log_det
        log_qK = log_q0 - log_det
        
        # Prior: assume standard Gaussian prior for simplicity
        log_prior = Normal(0, 1).log_prob(zK).sum(dim=-1)
        
        # Likelihood
        x_expanded = x.unsqueeze(1).expand(-1, num_samples, -1)
        log_lik = log_likelihood_fn(x_expanded, zK)
        
        # ELBO = E[log p(x|z) + log p(z) - log q(z)]
        elbo = (log_lik + log_prior - log_qK).mean(dim=1)
        
        return elbo.mean()

Summary: Normalizing Flows

We've covered the foundations of normalizing flows and their application to variational inference. Let's consolidate the key concepts:

Key Takeaways

•Flows solve the expressiveness problem — They allow the variational family to represent arbitrarily complex distributions, closing the approximation gap inherent in simple families.
•Change of variables is the foundation — The Jacobian determinant accounts for volume changes, enabling tractable density computation after transformation.
•Efficient Jacobian is key — Different flow architectures (planar, autoregressive, coupling) make different tradeoffs between expressiveness and computational cost.
•Autoregressive flows offer high expressiveness with O(d) Jacobian computation, but sampling and density evaluation have different computational costs.
•Coupling flows achieve O(d) for both sampling and density by transforming only half the dimensions conditioned on the other half.
•Flow-VI integration is natural — Flows plug directly into the ELBO framework, with gradients flowing through transformations via reparameterization.

Page Complete

1 / 5