Flow Based Models - Learning Module

Loading content...

0/245

Normalizing Flows: Foundations and Principles

The Quest for Tractable Generative Models

Among the pantheon of generative models—VAEs, GANs, diffusion models—normalizing flows occupy a unique and mathematically elegant position. Unlike VAEs, which rely on variational approximations that provide only lower bounds on the likelihood, and unlike GANs, which eschew likelihood entirely in favor of adversarial training, normalizing flows offer something remarkable: exact, tractable likelihood computation combined with efficient, exact sampling.

This is not a trivial achievement. The fundamental challenge in generative modeling is transforming simple random noise—typically drawn from a standard Gaussian—into samples that faithfully represent complex, high-dimensional data distributions like natural images, audio waveforms, or molecular structures. Most approaches must compromise: either the transformation is simple but the likelihood is intractable, or the likelihood is tractable but the transformation is too constrained to model complex distributions.

Normalizing flows resolve this tension through a brilliant insight: construct the transformation as a composition of simple, invertible functions whose Jacobian determinants can be computed efficiently. The result is a generative model where we can both evaluate the exact probability density of any data point and sample new points efficiently—capabilities that open doors to applications ranging from density estimation to variational inference to anomaly detection.

What You Will Master

By the end of this page, you will understand the core principles of normalizing flows: how they leverage the change of variables formula from probability theory, why invertibility is essential, and how careful architectural design enables tractable likelihood computation. You'll see how flows fit into the broader landscape of generative models and appreciate their unique mathematical properties.

What is a Normalizing Flow?

A normalizing flow is a generative model that learns a complex probability distribution by transforming a simple base distribution (typically a standard Gaussian) through a sequence of invertible, differentiable mappings. The term "normalizing" refers to the process of mapping complex data distributions back to simple ("normal") distributions, while "flow" evokes the continuous transformation of probability mass through the sequence of mappings.

The Core Idea:

Let $\mathbf{z} \sim p_Z(\mathbf{z})$ be a random variable drawn from a simple base distribution (e.g., $\mathcal{N}(\mathbf{0}, \mathbf{I})$). We define a transformation $f: \mathbb{R}^d \to \mathbb{R}^d$ that maps $\mathbf{z}$ to $\mathbf{x}$:

$$\mathbf{x} = f(\mathbf{z})$$

If $f$ is invertible and differentiable, we can express the density of $\mathbf{x}$ using the change of variables formula:

$$p_X(\mathbf{x}) = p_Z(f^{-1}(\mathbf{x})) \left| \det \frac{\partial f^{-1}}{\partial \mathbf{x}} \right|$$

Equivalently, using the inverse function theorem:

$$p_X(\mathbf{x}) = p_Z(\mathbf{z}) \left| \det \frac{\partial f}{\partial \mathbf{z}} \right|^{-1}$$

The determinant of the Jacobian matrix $\frac{\partial f}{\partial \mathbf{z}}$ accounts for how the transformation locally expands or contracts volumes in the probability space.

The Jacobian Determinant Intuition

Think of the Jacobian determinant as measuring how much the transformation $f$ stretches or compresses infinitesimal volumes. If a region is stretched (determinant > 1), the probability density must decrease proportionally to preserve total probability mass. If compressed (determinant < 1), density increases. This is the fundamental conservation law that the change of variables formula captures.

Composing Transformations:

A single invertible transformation may not be expressive enough to map a simple Gaussian to a complex data distribution. The key insight is that we can compose multiple invertible transformations:

$$\mathbf{x} = f_K \circ f_{K-1} \circ \cdots \circ f_1(\mathbf{z})$$

Since the composition of invertible functions is invertible, and the determinant of a product of matrices equals the product of determinants, the log-likelihood decomposes elegantly:

$$\log p_X(\mathbf{x}) = \log p_Z(\mathbf{z}0) - \sum{k=1}^{K} \log \left| \det \frac{\partial f_k}{\partial \mathbf{z}_{k-1}} \right|$$

where $\mathbf{z}_0 = f^{-1}_1 \circ \cdots \circ f^{-1}_K(\mathbf{x})$.

This compositionality is the source of the "flow" metaphor: probability mass flows through the sequence of transformations, with each layer reshaping the distribution slightly until the complex target distribution is reached.

normalizing_flow_concept.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
import torch
import torch.nn as nn
import numpy as np
 
class NormalizingFlowConcept:
    """
    Conceptual implementation of a normalizing flow.
    
    A normalizing flow transforms samples from a simple base distribution
    (like a standard Gaussian) into samples from a complex target distribution
    through a composition of invertible transformations.
    """
    
    def __init__(self, base_distribution, flow_layers):
        """
        Args:
            base_distribution: The simple base distribution (e.g., StandardNormal)
            flow_layers: List of invertible transformation layers
        """
        self.base_dist = base_distribution
        self.flows = flow_layers
    
    def forward(self, z):
        """
        Transform samples from base distribution to target distribution.
        
        Args:
            z: Samples from base distribution [batch_size, dim]
        
        Returns:
            x: Transformed samples [batch_size, dim]
            log_det_jacobian: Sum of log det Jacobians [batch_size]
        """
        log_det_jacobian = torch.zeros(z.shape[0], device=z.device)
        x = z
        
        for flow in self.flows:
            x, log_det = flow.forward(x)
            log_det_jacobian += log_det
        
        return x, log_det_jacobian
    
    def inverse(self, x):
        """
        Transform samples from target distribution back to base distribution.
        
        Args:
            x: Samples from target distribution [batch_size, dim]
        
        Returns:
            z: Samples in base distribution space [batch_size, dim]
            log_det_jacobian: Sum of log det Jacobians [batch_size]
        """
        log_det_jacobian = torch.zeros(x.shape[0], device=x.device)
        z = x
        
        # Apply inverse flows in reverse order
        for flow in reversed(self.flows):
            z, log_det = flow.inverse(z)
            log_det_jacobian += log_det
        
        return z, log_det_jacobian
    
    def log_prob(self, x):
        """
        Compute the log probability density of x under the flow model.
        
        This is the key advantage of normalizing flows: exact likelihood!
        
        log p(x) = log p_z(f^{-1}(x)) + log |det(df^{-1}/dx)|
                 = log p_z(z) - log |det(df/dz)|
        
        Args:
            x: Data points [batch_size, dim]
        
        Returns:
            log_prob: Log probability density [batch_size]
        """
        z, inverse_log_det = self.inverse(x)
        
        # Log prob under base distribution
        log_pz = self.base_dist.log_prob(z)
        
        # Change of variables adjustment
        log_px = log_pz + inverse_log_det
        
        return log_px
    
    def sample(self, num_samples):
        """
        Generate samples from the learned distribution.
        
        Args:
            num_samples: Number of samples to generate
        
        Returns:
            samples: Generated samples [num_samples, dim]
        """
        # Sample from base distribution
        z = self.base_dist.sample((num_samples,))
        
        # Transform through flow
        x, _ = self.forward(z)
        
        return x

The Mathematical Foundation

To truly understand normalizing flows, we must establish a rigorous mathematical foundation. The entire framework rests on the change of variables theorem from measure theory and multivariable calculus—a result that relates probability densities under differentiable, invertible transformations.

Change of Variables Theorem:

Let $\mathbf{z} \in \mathbb{R}^d$ be a random vector with probability density function $p_Z(\mathbf{z})$, and let $f: \mathbb{R}^d \to \mathbb{R}^d$ be a diffeomorphism (a smooth, invertible function with smooth inverse). Define $\mathbf{x} = f(\mathbf{z})$. Then the density of $\mathbf{x}$ is given by:

$$p_X(\mathbf{x}) = p_Z(f^{-1}(\mathbf{x})) \cdot \left| \det J_{f^{-1}}(\mathbf{x}) \right|$$

where $J_{f^{-1}}(\mathbf{x}) = \frac{\partial f^{-1}(\mathbf{x})}{\partial \mathbf{x}}$ is the Jacobian matrix of the inverse transformation.

Equivalently, using $\mathbf{z} = f^{-1}(\mathbf{x})$:

$$p_X(\mathbf{x}) = p_Z(\mathbf{z}) \cdot \left| \det J_f(\mathbf{z}) \right|^{-1}$$

The Jacobian Matrix:

For a vector-valued function $f: \mathbb{R}^d \to \mathbb{R}^d$ with components $f = (f_1, f_2, \ldots, f_d)$, the Jacobian matrix is:

$$J_f(\mathbf{z}) = \begin{bmatrix} \frac{\partial f_1}{\partial z_1} & \frac{\partial f_1}{\partial z_2} & \cdots & \frac{\partial f_1}{\partial z_d} \ \frac{\partial f_2}{\partial z_1} & \frac{\partial f_2}{\partial z_2} & \cdots & \frac{\partial f_2}{\partial z_d} \ \vdots & \vdots & \ddots & \vdots \ \frac{\partial f_d}{\partial z_1} & \frac{\partial f_d}{\partial z_2} & \cdots & \frac{\partial f_d}{\partial z_d} \end{bmatrix}$$

The Jacobian determinant $\det J_f(\mathbf{z})$ measures the local volume change induced by the transformation. Geometrically:

If $|\det J_f| > 1$: the transformation locally expands volume
If $|\det J_f| < 1$: the transformation locally contracts volume
If $|\det J_f| = 1$: the transformation preserves volume (e.g., rotations, permutations)

The Computational Challenge

Computing the determinant of a general $d \times d$ matrix requires $O(d^3)$ operations using LU decomposition. For high-dimensional data like images ($d = 784$ for MNIST, $d = 3072$ for CIFAR-10, $d > 10^6$ for high-res images), this becomes prohibitively expensive. The art of designing normalizing flows lies in constructing transformations whose Jacobians have special structure enabling efficient determinant computation—often $O(d)$ instead of $O(d^3)$.

Log-Likelihood in Practice:

For numerical stability and convenience, we work with log-densities. Taking the logarithm of the change of variables formula:

$$\log p_X(\mathbf{x}) = \log p_Z(\mathbf{z}) - \log \left| \det J_f(\mathbf{z}) \right|$$

For a standard Gaussian base distribution $p_Z(\mathbf{z}) = \mathcal{N}(\mathbf{z}; \mathbf{0}, \mathbf{I})$:

$$\log p_Z(\mathbf{z}) = -\frac{d}{2} \log(2\pi) - \frac{1}{2} |\mathbf{z}|^2$$

Thus the complete log-likelihood becomes:

$$\log p_X(\mathbf{x}) = -\frac{d}{2} \log(2\pi) - \frac{1}{2} |f^{-1}(\mathbf{x})|^2 - \log \left| \det J_f(f^{-1}(\mathbf{x})) \right|$$

For a composition of $K$ transformations:

$$\log p_X(\mathbf{x}) = \log p_Z(\mathbf{z}0) - \sum{k=1}^{K} \log \left| \det J_{f_k}(\mathbf{z}_{k-1}) \right|$$

where $\mathbf{z}_0 = \mathbf{z}$, $\mathbf{z}k = f_k(\mathbf{z}{k-1})$, and $\mathbf{z}_K = \mathbf{x}$.

Properties of Jacobian Matrices for Common Transformations
Transformation Type	Jacobian Structure	Determinant Complexity	Example
General nonlinear	Dense matrix	$O(d^3)$	MLP, arbitrary neural net
Diagonal scaling	Diagonal matrix	$O(d)$ — product of diagonal	$x_i = \sigma_i \cdot z_i$
Triangular (autoregressive)	Triangular matrix	$O(d)$ — product of diagonal	Autoregressive flows
Coupling layers	Block triangular	$O(d)$ — product of one block's diagonal	RealNVP, NICE
Orthogonal/Rotation	Orthogonal matrix	$O(d)$ — always ±1	Rotation matrices
Permutation	Permutation matrix	$O(1)$ — always ±1	Shuffling dimensions

Requirements for Flow Layers

Designing effective normalizing flow layers requires satisfying three fundamental requirements simultaneously. Each requirement constrains the design space, and the interplay between them drives the architectural innovations in the field.

Requirement 1: Invertibility

Every flow layer $f$ must be bijective (one-to-one and onto). This ensures:

Every data point $\mathbf{x}$ maps to exactly one latent code $\mathbf{z}$
Every latent code $\mathbf{z}$ generates exactly one sample $\mathbf{x}$
The change of variables formula applies

This is a non-trivial constraint. Many common neural network operations—like element-wise ReLU or max pooling—are not invertible. Flow architectures must use operations that preserve invertibility.

Requirement 2: Efficiently Computable Jacobian Determinant

The log-determinant of the Jacobian must be computable in $O(d)$ or at worst $O(d \log d)$ time, not $O(d^3)$. This is achieved through:

Triangular Jacobians: Determinant is the product of diagonal elements
Block-structured Jacobians: Determinant factorizes across blocks
Special forms: Orthogonal matrices (det = ±1), diagonal matrices, etc.

Requirement 3: Sufficient Expressiveness

The composition of flow layers must be expressive enough to transform a simple Gaussian into complex, multi-modal data distributions. This creates a tension with the previous requirements:

Simple transformations (e.g., linear) have efficient determinants but limited expressiveness
Complex transformations (e.g., arbitrary neural networks) are expressive but have intractable determinants

The field's major innovations resolve this tension through clever architectural designs that achieve expressiveness while maintaining tractability.

The Expressiveness-Tractability Trade-off:

Consider the spectrum of transformations:

Linear transformations: $f(\mathbf{z}) = \mathbf{A}\mathbf{z} + \mathbf{b}$
- Jacobian determinant: $|\det(\mathbf{A})|$ (computable in $O(d^3)$, or $O(d)$ if triangular)
- Expressiveness: Limited to affine transformations of the Gaussian
Element-wise nonlinearities: $f_i(\mathbf{z}) = g(z_i)$ where $g$ is monotonic and differentiable
- Jacobian: Diagonal, determinant is $\prod_i |g'(z_i)|$ in $O(d)$
- Expressiveness: Each dimension transformed independently
Autoregressive transformations: $x_i = g(z_i; \theta_i(\mathbf{z}_{<i}))$
- Jacobian: Triangular, determinant in $O(d)$
- Expressiveness: Rich, as each output depends on all previous latents
Coupling layers: Split dimensions, transform one half based on the other
- Jacobian: Block triangular, determinant in $O(d)$
- Expressiveness: Good, and both forward and inverse are fast

Checklist for Valid Flow Layers

•Bijective: The transformation must be one-to-one and onto. Every input maps to a unique output, and every output is reachable from some input.
•Differentiable: Both $f$ and $f^{-1}$ must be differentiable for the change of variables formula to apply and for gradient-based optimization.
•Tractable Jacobian: The log-determinant of the Jacobian must be computable efficiently, typically in $O(d)$ time.
•Stable numerics: The transformation should not cause numerical overflow/underflow. Log-determinants help, but layer design matters.
•Efficient forward/inverse: Depending on the application, we need fast forward (sampling) or fast inverse (likelihood evaluation), or ideally both.

flow_layer_requirements.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
import torch
import torch.nn as nn
from abc import ABC, abstractmethod
 
class FlowLayer(ABC, nn.Module):
    """
    Abstract base class for normalizing flow layers.
    
    Every flow layer must implement:
    1. forward: z -> x with log det Jacobian
    2. inverse: x -> z with log det Jacobian
    
    The log det Jacobians must be efficiently computable.
    """
    
    @abstractmethod
    def forward(self, z):
        """
        Forward transformation: z -> x
        
        Args:
            z: Input tensor [batch_size, dim]
        
        Returns:
            x: Transformed tensor [batch_size, dim]
            log_det: Log determinant of Jacobian [batch_size]
        """
        pass
    
    @abstractmethod
    def inverse(self, x):
        """
        Inverse transformation: x -> z
        
        Args:
            x: Input tensor [batch_size, dim]
        
        Returns:
            z: Inverse-transformed tensor [batch_size, dim]
            log_det: Log determinant of inverse Jacobian [batch_size]
        """
        pass
 
 
class ActNorm(FlowLayer):
    """
    Activation Normalization layer.
    
    Learnable per-channel scale and bias, initialized data-dependently
    to have zero mean and unit variance after first batch.
    
    This is a simple but important flow layer that helps with training
    stability and acts as a learnable linear transformation.
    
    Jacobian: Diagonal matrix with entries = scale
    Log det = sum of log absolute scales
    Complexity: O(d)
    """
    
    def __init__(self, num_features):
        super().__init__()
        self.num_features = num_features
        
        # Learnable scale and bias
        self.scale = nn.Parameter(torch.ones(1, num_features))
        self.bias = nn.Parameter(torch.zeros(1, num_features))
        
        # Track initialization
        self.register_buffer('initialized', torch.tensor(False))
    
    def initialize(self, x):
        """Data-dependent initialization."""
        with torch.no_grad():
            # Compute per-channel statistics
            mean = x.mean(dim=0, keepdim=True)
            std = x.std(dim=0, keepdim=True) + 1e-6
            
            # Initialize to normalize input
            self.bias.data = -mean
            self.scale.data = 1.0 / std
            self.initialized.fill_(True)
    
    def forward(self, z):
        if not self.initialized:
            self.initialize(z)
        
        # Affine transformation: x = scale * z + bias
        x = self.scale * z + self.bias
        
        # Log det Jacobian = sum of log|scale| for each sample
        # Since scale is same for all samples, multiply by spatial dims
        log_det = torch.sum(torch.log(torch.abs(self.scale))) * torch.ones(z.shape[0], device=z.device)
        
        return x, log_det
    
    def inverse(self, x):
        # Inverse: z = (x - bias) / scale
        z = (x - self.bias) / self.scale
        
        # Log det of inverse = negative of forward
        log_det = -torch.sum(torch.log(torch.abs(self.scale))) * torch.ones(x.shape[0], device=x.device)
        
        return z, log_det
 
 
class InvertibleLeakyReLU(FlowLayer):
    """
    Invertible Leaky ReLU activation.
    
    f(z) = z if z >= 0 else alpha * z
    
    This is invertible when alpha > 0 (different from standard ReLU where alpha=0).
    
    Jacobian: Diagonal with entries 1 (if z >= 0) or alpha (if z < 0)
    Log det = sum of log(1) or log(alpha) for each element
    Complexity: O(d)
    """
    
    def __init__(self, alpha=0.2):
        super().__init__()
        assert alpha > 0, "Alpha must be positive for invertibility"
        self.alpha = alpha
        self.log_alpha = np.log(alpha)
    
    def forward(self, z):
        x = torch.where(z >= 0, z, self.alpha * z)
        
        # Log det: count negative elements and multiply by log(alpha)
        num_negative = (z < 0).float().sum(dim=1)
        log_det = num_negative * self.log_alpha
        
        return x, log_det
    
    def inverse(self, x):
        z = torch.where(x >= 0, x, x / self.alpha)
        
        # Log det of inverse is negative of forward
        num_negative = (x < 0).float().sum(dim=1)
        log_det = -num_negative * self.log_alpha
        
        return z, log_det

Comparison with Other Generative Models

Understanding where normalizing flows fit in the generative modeling landscape illuminates both their strengths and limitations. Each major family of generative models makes different choices in the likelihood-flexibility-efficiency trade-off space.

Variational Autoencoders (VAEs):

VAEs define an encoder $q_\phi(\mathbf{z}|\mathbf{x})$ and decoder $p_\theta(\mathbf{x}|\mathbf{z})$ connected by a latent space. The true log-likelihood is intractable, so VAEs optimize a variational lower bound (ELBO):

$$\log p(\mathbf{x}) \geq \mathbb{E}{q\phi(\mathbf{z}|\mathbf{x})}[\log p_\theta(\mathbf{x}|\mathbf{z})] - D_{KL}(q_\phi(\mathbf{z}|\mathbf{x}) | p(\mathbf{z}))$$

The gap between the ELBO and true likelihood depends on how well $q_\phi$ approximates the true posterior.

Generative Adversarial Networks (GANs):

GANs abandon likelihood entirely, instead training a generator to fool a discriminator. This enables generation of sharp, high-quality samples but provides no probability density estimates and can suffer from mode collapse.

Diffusion Models:

Diffusion models define a forward process that gradually adds noise to data, then learn to reverse this process. They achieve state-of-the-art sample quality but require many sampling steps and only provide variational likelihood bounds.

Comparison of Generative Model Families
Property	Normalizing Flows	VAEs	GANs	Diffusion Models
Exact likelihood	✓ Yes	✗ Lower bound only	✗ No density	✗ Lower bound only
Exact sampling	✓ Yes	✓ Yes	✓ Yes	✓ Yes (but slow)
Latent representation	✓ Exact inference	~ Approximate	✗ Not meaningful	~ Implicit
Training stability	✓ Stable (MLE)	✓ Stable	✗ Can be unstable	✓ Stable
Sample quality	~ Good	~ Blurry on images	✓ Sharp	✓ Excellent
Sampling speed	✓ Fast (one forward pass)	✓ Fast	✓ Fast	✗ Slow (many steps)
Architectural constraints	✗ Strict (invertibility)	~ Moderate	✓ Flexible	✓ Flexible
Mode coverage	✓ Good	✓ Good	~ Mode collapse risk	✓ Excellent

When to Choose Normalizing Flows:

Normalizing flows are particularly well-suited for applications where:

Exact likelihood is required: Density estimation, anomaly detection via likelihood, probabilistic inference
Exact latent inference is valuable: Learning meaningful latent representations where we need the exact $\mathbf{z}$ for a given $\mathbf{x}$
Fast sampling is needed: Unlike diffusion models that require hundreds of denoising steps, flows sample in a single forward pass
Variational inference: Flows can define flexible variational posteriors that go beyond simple Gaussians, tightening ELBO bounds
Hybrid models: Flows can enhance other models—e.g., using flows in the VAE prior or posterior, or as components in more complex probabilistic programs

The Architectural Constraint

The key limitation of normalizing flows is the invertibility requirement. This means you cannot use arbitrary neural network architectures—only carefully designed invertible ones. This constrains model capacity and has historically limited flows' performance on complex image generation compared to GANs and diffusion models. However, advances in flow architecture design continue to close this gap.

Ideal Use Cases for Normalizing Flows

•Density estimation: When you need to assign exact probability densities to data points, flows are the gold standard.
•Anomaly detection: Low likelihood under a flow model indicates anomalous data points, enabling interpretable anomaly scores.
•Variational inference: Flows can parameterize flexible approximate posteriors, improving inference in Bayesian models.
•Lossless compression: Flows enable bits-back coding for near-optimal compression with learned models.
•Conditional generation: Conditional flows can model complex conditional distributions $p(\mathbf{x}|\mathbf{c})$ with exact likelihoods.
•Hybrid models: Flows as components in larger systems—flow priors in VAEs, flow-based proposals in MCMC, etc.

Historical Development of Normalizing Flows

The development of normalizing flows is a story of progressively solving the tractability-expressiveness trade-off through architectural innovation. Understanding this history provides insight into why modern flow architectures are designed the way they are.

Early Foundations (Pre-2014):

The change of variables formula has been a staple of probability theory for centuries. Early applications in machine learning used simple parametric transformations—linear transformations, Box-Cox transforms—but these lacked the flexibility to model complex distributions.

NICE (2014): Non-linear Independent Components Estimation

Sommer and Dinh introduced NICE, the first modern normalizing flow architecture. NICE used additive coupling layers that split the input and transformed one half based on the other:

$$\mathbf{y}{1:d/2} = \mathbf{x}{1:d/2}$$ $$\mathbf{y}{d/2+1:d} = \mathbf{x}{d/2+1:d} + m(\mathbf{x}_{1:d/2})$$

where $m$ is an arbitrary neural network. The Jacobian is triangular with ones on the diagonal, so $\det(J) = 1$. This was easy to compute but somewhat limited in expressiveness.

RealNVP (2016): Real-valued Non-Volume Preserving

Building on NICE, Dinh et al. introduced affine coupling layers that add learnable scaling:

$$\mathbf{y}{d/2+1:d} = \mathbf{x}{d/2+1:d} \odot \exp(s(\mathbf{x}{1:d/2})) + t(\mathbf{x}{1:d/2})$$

The diagonal Jacobian entries are now $\exp(s_i)$, giving a non-trivial but still tractable determinant. This was a major expressiveness improvement.

Glow (2018): Generative Flow with Invertible 1×1 Convolutions

Kingma and Dhariwal introduced Glow, which achieved state-of-the-art image generation for flows by adding:

Invertible 1×1 convolutions: Learned permutations of channels
Actnorm: Data-dependent initialization for stable training
Multi-scale architecture: Processing at multiple resolutions

Glow demonstrated that flows could generate high-resolution faces, approaching GAN quality.

Autoregressive Flows (2016-2017):

Simultaneously, researchers developed autoregressive flows like MAF (Masked Autoregressive Flow) and IAF (Inverse Autoregressive Flow):

$$x_i = z_i \cdot \sigma(\mathbf{x}{<i}) + \mu(\mathbf{x}{<i})$$

These leverage autoregressive structure for triangular Jacobians. MAF has fast density evaluation but slow sampling; IAF has fast sampling but slow density evaluation.

Continuous Normalizing Flows (2018):

Chen et al. introduced Neural ODEs, treating the transformation as the solution to an ordinary differential equation. This enabled:

Continuous-depth flows with free-form Jacobians
Efficient trace estimation for log-det computation
Memory-efficient training via adjoint methods

Recent Advances (2019-Present):

Flow++: Improved dequantization and architectures
Residual Flows: Using invertible residual networks
FFJORD: Free-form Jacobian estimation for continuous flows
Neural Spline Flows: Monotonic rational-quadratic splines for flexible transforms

Timeline of Normalizing Flow Innovations
Year	Model	Key Innovation	Impact
2014	NICE	Additive coupling layers	First modern flow; volume-preserving
2016	RealNVP	Affine coupling layers	Non-volume-preserving; much more expressive
2016	MAF	Masked autoregressive structure	Triangular Jacobians; very expressive
2016	IAF	Inverse autoregressive structure	Fast sampling for VAE posteriors
2018	Glow	Invertible 1×1 convolutions	State-of-the-art image generation
2018	FFJORD	Continuous normalizing flows	Free-form transformations via neural ODEs
2019	Flow++	Improved dequantization, attention	Better bits-per-dimension on images
2019	Neural Spline Flows	Monotonic spline transforms	Highly flexible 1D transformations
2020	Residual Flows	Invertible residual connections	Contractive residual networks

The Evolution of Expressiveness

Notice the progression: from volume-preserving (NICE) to affine (RealNVP) to expressive autoregressive (MAF) to continuous-time (FFJORD). Each step expanded the expressiveness of flows while maintaining tractable likelihood computation. Modern flows combine multiple innovations—coupling layers, autoregressive components, invertible convolutions, and learned permutations—to achieve strong performance across tasks.

Training Normalizing Flows

Training normalizing flows is remarkably straightforward compared to GANs or diffusion models: we simply maximize the log-likelihood of the data under the flow model. This is direct maximum likelihood estimation, made possible by the tractable density.

The Objective:

Given a dataset $\mathcal{D} = {\mathbf{x}^{(1)}, \mathbf{x}^{(2)}, \ldots, \mathbf{x}^{(N)}}$, we maximize:

$$\mathcal{L}(\theta) = \frac{1}{N} \sum_{i=1}^{N} \log p_\theta(\mathbf{x}^{(i)})$$

Expanding using the change of variables:

$$\mathcal{L}(\theta) = \frac{1}{N} \sum_{i=1}^{N} \left[ \log p_Z(f^{-1}\theta(\mathbf{x}^{(i)})) + \log \left| \det J{f^{-1}_\theta}(\mathbf{x}^{(i)}) \right| \right]$$

For a standard Gaussian base distribution:

$$\mathcal{L}(\theta) = \frac{1}{N} \sum_{i=1}^{N} \left[ -\frac{d}{2} \log(2\pi) - \frac{1}{2} |\mathbf{z}^{(i)}|^2 + \log \left| \det J_{f^{-1}_\theta}(\mathbf{x}^{(i)}) \right| \right]$$

where $\mathbf{z}^{(i)} = f^{-1}_\theta(\mathbf{x}^{(i)})$.

Understanding the Training Dynamics:

The loss has two interpretable components:

Prior term: $-\frac{1}{2} |\mathbf{z}|^2$ encourages the transformed data to lie near the origin in latent space (low Mahanalobis distance from zero under the Gaussian prior)
Volume term: $\log |\det J_{f^{-1}}|$ encourages the model to use the full latent space efficiently, neither over-compressing nor over-expanding

Together, these ensure the model maps data to regions of high prior probability without extreme volume distortion.

Bits Per Dimension (BPD):

For image modeling, we report performance in bits per dimension, which measures how many bits are needed on average to encode each pixel:

$$\text{BPD} = \frac{-\log_2 p(\mathbf{x})}{d} = \frac{-\log p(\mathbf{x})}{d \cdot \log 2}$$

Lower is better. For reference:

A uniform model over 256 values: 8 BPD
State-of-the-art flows achieve ~3.0 BPD on CIFAR-10
State-of-the-art autoregressive models: ~2.8 BPD

training_flow.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import MultivariateNormal
 
class FlowTrainer:
    """
    Trainer for normalizing flow models.
    
    Training is straightforward: maximize log-likelihood via gradient descent.
    """
    
    def __init__(self, flow_model, base_distribution, device='cuda'):
        """
        Args:
            flow_model: The normalizing flow (composition of flow layers)
            base_distribution: The base distribution (e.g., standard Gaussian)
            device: Device to train on
        """
        self.flow = flow_model.to(device)
        self.base_dist = base_distribution
        self.device = device
        
    def compute_log_likelihood(self, x):
        """
        Compute log p(x) under the flow model.
        
        log p(x) = log p_z(f^{-1}(x)) + log |det J_{f^{-1}}(x)|
        
        Args:
            x: Data batch [batch_size, dim]
        
        Returns:
            log_prob: Log probability [batch_size]
            z: Latent representations [batch_size, dim]
        """
        # Transform data to latent space
        z, log_det_inverse = self.flow.inverse(x)
        
        # Log probability under base distribution
        log_pz = self.base_dist.log_prob(z)
        
        # Total log probability (exact!)
        log_px = log_pz + log_det_inverse
        
        return log_px, z
    
    def compute_loss(self, x):
        """
        Compute negative log-likelihood loss.
        
        Args:
            x: Data batch [batch_size, dim]
        
        Returns:
            loss: Scalar loss (negative mean log-likelihood)
            metrics: Dict with additional metrics
        """
        log_px, z = self.compute_log_likelihood(x)
        
        # Negative log-likelihood
        nll = -log_px.mean()
        
        # Compute bits per dimension for interpretability
        dim = x.shape[1]
        bpd = nll / (dim * torch.log(torch.tensor(2.0)))
        
        metrics = {
            'nll': nll.item(),
            'bpd': bpd.item(),
            'z_norm': z.norm(dim=1).mean().item(),
        }
        
        return nll, metrics
    
    def train_epoch(self, dataloader, optimizer):
        """
        Train for one epoch.
        
        Args:
            dataloader: DataLoader yielding batches
            optimizer: Optimizer
        
        Returns:
            epoch_metrics: Dict with averaged metrics
        """
        self.flow.train()
        total_nll = 0.0
        total_bpd = 0.0
        num_batches = 0
        
        for batch in dataloader:
            x = batch[0].to(self.device)
            
            # Dequantization: add uniform noise to discrete data
            # This converts discrete pixel values to continuous values
            x = x + torch.rand_like(x) / 256.0
            
            optimizer.zero_grad()
            
            loss, metrics = self.compute_loss(x)
            loss.backward()
            
            # Gradient clipping for stability
            torch.nn.utils.clip_grad_norm_(self.flow.parameters(), max_norm=1.0)
            
            optimizer.step()
            
            total_nll += metrics['nll']
            total_bpd += metrics['bpd']
            num_batches += 1
        
        return {
            'nll': total_nll / num_batches,
            'bpd': total_bpd / num_batches,
        }
    
    @torch.no_grad()
    def evaluate(self, dataloader):
        """
        Evaluate on a dataset.
        
        Args:
            dataloader: DataLoader for evaluation
        
        Returns:
            metrics: Dict with evaluation metrics
        """
        self.flow.eval()
        total_nll = 0.0
        total_bpd = 0.0
        num_samples = 0
        
        for batch in dataloader:
            x = batch[0].to(self.device)
            x = x + torch.rand_like(x) / 256.0  # Dequantization
            
            log_px, _ = self.compute_log_likelihood(x)
            
            batch_size = x.shape[0]
            total_nll += -log_px.sum().item()
            total_bpd += (-log_px.sum() / (x.shape[1] * torch.log(torch.tensor(2.0)))).item()
            num_samples += batch_size
        
        return {
            'nll': total_nll / num_samples,
            'bpd': total_bpd / num_samples,
        }
    
    @torch.no_grad()
    def sample(self, num_samples, temperature=1.0):
        """
        Generate samples from the flow model.
        
        Args:
            num_samples: Number of samples to generate
            temperature: Sampling temperature (scales std of base dist)
        
        Returns:
            samples: Generated samples [num_samples, dim]
        """
        self.flow.eval()
        
        # Sample from base distribution with temperature scaling
        z = self.base_dist.sample((num_samples,)) * temperature
        z = z.to(self.device)
        
        # Transform through flow
        x, _ = self.flow.forward(z)
        
        return x

Dequantization is Critical

For discrete data like images (with integer pixel values), we must add continuous noise before training—a process called dequantization. Without this, the flow would assign infinite density to the discrete values and zero elsewhere, leading to degenerate training. The standard approach adds uniform noise: $\tilde{x} = x + u$ where $u \sim \text{Uniform}(0, 1/256)$ for 8-bit images.

Summary and Looking Ahead

We have established the foundational principles of normalizing flows, a powerful and mathematically elegant family of generative models. Let us consolidate the key insights before diving deeper into specific architectural innovations.

Key Takeaways

•Normalizing flows transform simple distributions to complex ones through learned invertible mappings, enabling both exact density evaluation and efficient sampling.
•The change of variables formula is the mathematical foundation, relating densities across transformations via the Jacobian determinant.
•Three requirements constrain flow design: invertibility, efficient Jacobian determinant computation, and sufficient expressiveness.
•Flows offer unique advantages: exact likelihoods, exact latent inference, stable training via MLE, and fast single-pass sampling.
•Historical development has progressively improved expressiveness while maintaining tractability, from NICE to RealNVP to Glow to continuous flows.
•Training is straightforward maximum likelihood, with dequantization needed for discrete data and bits-per-dimension as the standard metric.

What's Next:

In the following pages, we will dive deep into the specific mechanisms that make modern normalizing flows work:

Change of Variables: A rigorous treatment of the mathematical machinery underlying all flows
Coupling Layers: The architectural innovation that unlocked practical flows for high-dimensional data
RealNVP and Glow: State-of-the-art flow architectures for image generation
Continuous Flows: Neural ODEs and the generalization to infinite-depth transformations

Each page will build on the foundations established here, developing both the mathematical understanding and practical implementation skills needed to work with normalizing flows in real applications.

Foundation Complete

You now understand the core principles that define normalizing flows: invertible transformations, tractable Jacobian determinants, and the change of variables formula. This conceptual framework will guide our exploration of specific flow architectures in the pages ahead.