Loading content...
Among the pantheon of generative models—VAEs, GANs, diffusion models—normalizing flows occupy a unique and mathematically elegant position. Unlike VAEs, which rely on variational approximations that provide only lower bounds on the likelihood, and unlike GANs, which eschew likelihood entirely in favor of adversarial training, normalizing flows offer something remarkable: exact, tractable likelihood computation combined with efficient, exact sampling.
This is not a trivial achievement. The fundamental challenge in generative modeling is transforming simple random noise—typically drawn from a standard Gaussian—into samples that faithfully represent complex, high-dimensional data distributions like natural images, audio waveforms, or molecular structures. Most approaches must compromise: either the transformation is simple but the likelihood is intractable, or the likelihood is tractable but the transformation is too constrained to model complex distributions.
Normalizing flows resolve this tension through a brilliant insight: construct the transformation as a composition of simple, invertible functions whose Jacobian determinants can be computed efficiently. The result is a generative model where we can both evaluate the exact probability density of any data point and sample new points efficiently—capabilities that open doors to applications ranging from density estimation to variational inference to anomaly detection.
By the end of this page, you will understand the core principles of normalizing flows: how they leverage the change of variables formula from probability theory, why invertibility is essential, and how careful architectural design enables tractable likelihood computation. You'll see how flows fit into the broader landscape of generative models and appreciate their unique mathematical properties.
A normalizing flow is a generative model that learns a complex probability distribution by transforming a simple base distribution (typically a standard Gaussian) through a sequence of invertible, differentiable mappings. The term "normalizing" refers to the process of mapping complex data distributions back to simple ("normal") distributions, while "flow" evokes the continuous transformation of probability mass through the sequence of mappings.
The Core Idea:
Let $\mathbf{z} \sim p_Z(\mathbf{z})$ be a random variable drawn from a simple base distribution (e.g., $\mathcal{N}(\mathbf{0}, \mathbf{I})$). We define a transformation $f: \mathbb{R}^d \to \mathbb{R}^d$ that maps $\mathbf{z}$ to $\mathbf{x}$:
$$\mathbf{x} = f(\mathbf{z})$$
If $f$ is invertible and differentiable, we can express the density of $\mathbf{x}$ using the change of variables formula:
$$p_X(\mathbf{x}) = p_Z(f^{-1}(\mathbf{x})) \left| \det \frac{\partial f^{-1}}{\partial \mathbf{x}} \right|$$
Equivalently, using the inverse function theorem:
$$p_X(\mathbf{x}) = p_Z(\mathbf{z}) \left| \det \frac{\partial f}{\partial \mathbf{z}} \right|^{-1}$$
The determinant of the Jacobian matrix $\frac{\partial f}{\partial \mathbf{z}}$ accounts for how the transformation locally expands or contracts volumes in the probability space.
Think of the Jacobian determinant as measuring how much the transformation $f$ stretches or compresses infinitesimal volumes. If a region is stretched (determinant > 1), the probability density must decrease proportionally to preserve total probability mass. If compressed (determinant < 1), density increases. This is the fundamental conservation law that the change of variables formula captures.
Composing Transformations:
A single invertible transformation may not be expressive enough to map a simple Gaussian to a complex data distribution. The key insight is that we can compose multiple invertible transformations:
$$\mathbf{x} = f_K \circ f_{K-1} \circ \cdots \circ f_1(\mathbf{z})$$
Since the composition of invertible functions is invertible, and the determinant of a product of matrices equals the product of determinants, the log-likelihood decomposes elegantly:
$$\log p_X(\mathbf{x}) = \log p_Z(\mathbf{z}0) - \sum{k=1}^{K} \log \left| \det \frac{\partial f_k}{\partial \mathbf{z}_{k-1}} \right|$$
where $\mathbf{z}_0 = f^{-1}_1 \circ \cdots \circ f^{-1}_K(\mathbf{x})$.
This compositionality is the source of the "flow" metaphor: probability mass flows through the sequence of transformations, with each layer reshaping the distribution slightly until the complex target distribution is reached.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105
import torchimport torch.nn as nnimport numpy as np class NormalizingFlowConcept: """ Conceptual implementation of a normalizing flow. A normalizing flow transforms samples from a simple base distribution (like a standard Gaussian) into samples from a complex target distribution through a composition of invertible transformations. """ def __init__(self, base_distribution, flow_layers): """ Args: base_distribution: The simple base distribution (e.g., StandardNormal) flow_layers: List of invertible transformation layers """ self.base_dist = base_distribution self.flows = flow_layers def forward(self, z): """ Transform samples from base distribution to target distribution. Args: z: Samples from base distribution [batch_size, dim] Returns: x: Transformed samples [batch_size, dim] log_det_jacobian: Sum of log det Jacobians [batch_size] """ log_det_jacobian = torch.zeros(z.shape[0], device=z.device) x = z for flow in self.flows: x, log_det = flow.forward(x) log_det_jacobian += log_det return x, log_det_jacobian def inverse(self, x): """ Transform samples from target distribution back to base distribution. Args: x: Samples from target distribution [batch_size, dim] Returns: z: Samples in base distribution space [batch_size, dim] log_det_jacobian: Sum of log det Jacobians [batch_size] """ log_det_jacobian = torch.zeros(x.shape[0], device=x.device) z = x # Apply inverse flows in reverse order for flow in reversed(self.flows): z, log_det = flow.inverse(z) log_det_jacobian += log_det return z, log_det_jacobian def log_prob(self, x): """ Compute the log probability density of x under the flow model. This is the key advantage of normalizing flows: exact likelihood! log p(x) = log p_z(f^{-1}(x)) + log |det(df^{-1}/dx)| = log p_z(z) - log |det(df/dz)| Args: x: Data points [batch_size, dim] Returns: log_prob: Log probability density [batch_size] """ z, inverse_log_det = self.inverse(x) # Log prob under base distribution log_pz = self.base_dist.log_prob(z) # Change of variables adjustment log_px = log_pz + inverse_log_det return log_px def sample(self, num_samples): """ Generate samples from the learned distribution. Args: num_samples: Number of samples to generate Returns: samples: Generated samples [num_samples, dim] """ # Sample from base distribution z = self.base_dist.sample((num_samples,)) # Transform through flow x, _ = self.forward(z) return xTo truly understand normalizing flows, we must establish a rigorous mathematical foundation. The entire framework rests on the change of variables theorem from measure theory and multivariable calculus—a result that relates probability densities under differentiable, invertible transformations.
Change of Variables Theorem:
Let $\mathbf{z} \in \mathbb{R}^d$ be a random vector with probability density function $p_Z(\mathbf{z})$, and let $f: \mathbb{R}^d \to \mathbb{R}^d$ be a diffeomorphism (a smooth, invertible function with smooth inverse). Define $\mathbf{x} = f(\mathbf{z})$. Then the density of $\mathbf{x}$ is given by:
$$p_X(\mathbf{x}) = p_Z(f^{-1}(\mathbf{x})) \cdot \left| \det J_{f^{-1}}(\mathbf{x}) \right|$$
where $J_{f^{-1}}(\mathbf{x}) = \frac{\partial f^{-1}(\mathbf{x})}{\partial \mathbf{x}}$ is the Jacobian matrix of the inverse transformation.
Equivalently, using $\mathbf{z} = f^{-1}(\mathbf{x})$:
$$p_X(\mathbf{x}) = p_Z(\mathbf{z}) \cdot \left| \det J_f(\mathbf{z}) \right|^{-1}$$
The Jacobian Matrix:
For a vector-valued function $f: \mathbb{R}^d \to \mathbb{R}^d$ with components $f = (f_1, f_2, \ldots, f_d)$, the Jacobian matrix is:
$$J_f(\mathbf{z}) = \begin{bmatrix} \frac{\partial f_1}{\partial z_1} & \frac{\partial f_1}{\partial z_2} & \cdots & \frac{\partial f_1}{\partial z_d} \ \frac{\partial f_2}{\partial z_1} & \frac{\partial f_2}{\partial z_2} & \cdots & \frac{\partial f_2}{\partial z_d} \ \vdots & \vdots & \ddots & \vdots \ \frac{\partial f_d}{\partial z_1} & \frac{\partial f_d}{\partial z_2} & \cdots & \frac{\partial f_d}{\partial z_d} \end{bmatrix}$$
The Jacobian determinant $\det J_f(\mathbf{z})$ measures the local volume change induced by the transformation. Geometrically:
Computing the determinant of a general $d \times d$ matrix requires $O(d^3)$ operations using LU decomposition. For high-dimensional data like images ($d = 784$ for MNIST, $d = 3072$ for CIFAR-10, $d > 10^6$ for high-res images), this becomes prohibitively expensive. The art of designing normalizing flows lies in constructing transformations whose Jacobians have special structure enabling efficient determinant computation—often $O(d)$ instead of $O(d^3)$.
Log-Likelihood in Practice:
For numerical stability and convenience, we work with log-densities. Taking the logarithm of the change of variables formula:
$$\log p_X(\mathbf{x}) = \log p_Z(\mathbf{z}) - \log \left| \det J_f(\mathbf{z}) \right|$$
For a standard Gaussian base distribution $p_Z(\mathbf{z}) = \mathcal{N}(\mathbf{z}; \mathbf{0}, \mathbf{I})$:
$$\log p_Z(\mathbf{z}) = -\frac{d}{2} \log(2\pi) - \frac{1}{2} |\mathbf{z}|^2$$
Thus the complete log-likelihood becomes:
$$\log p_X(\mathbf{x}) = -\frac{d}{2} \log(2\pi) - \frac{1}{2} |f^{-1}(\mathbf{x})|^2 - \log \left| \det J_f(f^{-1}(\mathbf{x})) \right|$$
For a composition of $K$ transformations:
$$\log p_X(\mathbf{x}) = \log p_Z(\mathbf{z}0) - \sum{k=1}^{K} \log \left| \det J_{f_k}(\mathbf{z}_{k-1}) \right|$$
where $\mathbf{z}_0 = \mathbf{z}$, $\mathbf{z}k = f_k(\mathbf{z}{k-1})$, and $\mathbf{z}_K = \mathbf{x}$.
| Transformation Type | Jacobian Structure | Determinant Complexity | Example |
|---|---|---|---|
| General nonlinear | Dense matrix | $O(d^3)$ | MLP, arbitrary neural net |
| Diagonal scaling | Diagonal matrix | $O(d)$ — product of diagonal | $x_i = \sigma_i \cdot z_i$ |
| Triangular (autoregressive) | Triangular matrix | $O(d)$ — product of diagonal | Autoregressive flows |
| Coupling layers | Block triangular | $O(d)$ — product of one block's diagonal | RealNVP, NICE |
| Orthogonal/Rotation | Orthogonal matrix | $O(d)$ — always ±1 | Rotation matrices |
| Permutation | Permutation matrix | $O(1)$ — always ±1 | Shuffling dimensions |
Designing effective normalizing flow layers requires satisfying three fundamental requirements simultaneously. Each requirement constrains the design space, and the interplay between them drives the architectural innovations in the field.
Requirement 1: Invertibility
Every flow layer $f$ must be bijective (one-to-one and onto). This ensures:
This is a non-trivial constraint. Many common neural network operations—like element-wise ReLU or max pooling—are not invertible. Flow architectures must use operations that preserve invertibility.
Requirement 2: Efficiently Computable Jacobian Determinant
The log-determinant of the Jacobian must be computable in $O(d)$ or at worst $O(d \log d)$ time, not $O(d^3)$. This is achieved through:
Requirement 3: Sufficient Expressiveness
The composition of flow layers must be expressive enough to transform a simple Gaussian into complex, multi-modal data distributions. This creates a tension with the previous requirements:
The field's major innovations resolve this tension through clever architectural designs that achieve expressiveness while maintaining tractability.
The Expressiveness-Tractability Trade-off:
Consider the spectrum of transformations:
Linear transformations: $f(\mathbf{z}) = \mathbf{A}\mathbf{z} + \mathbf{b}$
Element-wise nonlinearities: $f_i(\mathbf{z}) = g(z_i)$ where $g$ is monotonic and differentiable
Autoregressive transformations: $x_i = g(z_i; \theta_i(\mathbf{z}_{<i}))$
Coupling layers: Split dimensions, transform one half based on the other
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141
import torchimport torch.nn as nnfrom abc import ABC, abstractmethod class FlowLayer(ABC, nn.Module): """ Abstract base class for normalizing flow layers. Every flow layer must implement: 1. forward: z -> x with log det Jacobian 2. inverse: x -> z with log det Jacobian The log det Jacobians must be efficiently computable. """ @abstractmethod def forward(self, z): """ Forward transformation: z -> x Args: z: Input tensor [batch_size, dim] Returns: x: Transformed tensor [batch_size, dim] log_det: Log determinant of Jacobian [batch_size] """ pass @abstractmethod def inverse(self, x): """ Inverse transformation: x -> z Args: x: Input tensor [batch_size, dim] Returns: z: Inverse-transformed tensor [batch_size, dim] log_det: Log determinant of inverse Jacobian [batch_size] """ pass class ActNorm(FlowLayer): """ Activation Normalization layer. Learnable per-channel scale and bias, initialized data-dependently to have zero mean and unit variance after first batch. This is a simple but important flow layer that helps with training stability and acts as a learnable linear transformation. Jacobian: Diagonal matrix with entries = scale Log det = sum of log absolute scales Complexity: O(d) """ def __init__(self, num_features): super().__init__() self.num_features = num_features # Learnable scale and bias self.scale = nn.Parameter(torch.ones(1, num_features)) self.bias = nn.Parameter(torch.zeros(1, num_features)) # Track initialization self.register_buffer('initialized', torch.tensor(False)) def initialize(self, x): """Data-dependent initialization.""" with torch.no_grad(): # Compute per-channel statistics mean = x.mean(dim=0, keepdim=True) std = x.std(dim=0, keepdim=True) + 1e-6 # Initialize to normalize input self.bias.data = -mean self.scale.data = 1.0 / std self.initialized.fill_(True) def forward(self, z): if not self.initialized: self.initialize(z) # Affine transformation: x = scale * z + bias x = self.scale * z + self.bias # Log det Jacobian = sum of log|scale| for each sample # Since scale is same for all samples, multiply by spatial dims log_det = torch.sum(torch.log(torch.abs(self.scale))) * torch.ones(z.shape[0], device=z.device) return x, log_det def inverse(self, x): # Inverse: z = (x - bias) / scale z = (x - self.bias) / self.scale # Log det of inverse = negative of forward log_det = -torch.sum(torch.log(torch.abs(self.scale))) * torch.ones(x.shape[0], device=x.device) return z, log_det class InvertibleLeakyReLU(FlowLayer): """ Invertible Leaky ReLU activation. f(z) = z if z >= 0 else alpha * z This is invertible when alpha > 0 (different from standard ReLU where alpha=0). Jacobian: Diagonal with entries 1 (if z >= 0) or alpha (if z < 0) Log det = sum of log(1) or log(alpha) for each element Complexity: O(d) """ def __init__(self, alpha=0.2): super().__init__() assert alpha > 0, "Alpha must be positive for invertibility" self.alpha = alpha self.log_alpha = np.log(alpha) def forward(self, z): x = torch.where(z >= 0, z, self.alpha * z) # Log det: count negative elements and multiply by log(alpha) num_negative = (z < 0).float().sum(dim=1) log_det = num_negative * self.log_alpha return x, log_det def inverse(self, x): z = torch.where(x >= 0, x, x / self.alpha) # Log det of inverse is negative of forward num_negative = (x < 0).float().sum(dim=1) log_det = -num_negative * self.log_alpha return z, log_detUnderstanding where normalizing flows fit in the generative modeling landscape illuminates both their strengths and limitations. Each major family of generative models makes different choices in the likelihood-flexibility-efficiency trade-off space.
Variational Autoencoders (VAEs):
VAEs define an encoder $q_\phi(\mathbf{z}|\mathbf{x})$ and decoder $p_\theta(\mathbf{x}|\mathbf{z})$ connected by a latent space. The true log-likelihood is intractable, so VAEs optimize a variational lower bound (ELBO):
$$\log p(\mathbf{x}) \geq \mathbb{E}{q\phi(\mathbf{z}|\mathbf{x})}[\log p_\theta(\mathbf{x}|\mathbf{z})] - D_{KL}(q_\phi(\mathbf{z}|\mathbf{x}) | p(\mathbf{z}))$$
The gap between the ELBO and true likelihood depends on how well $q_\phi$ approximates the true posterior.
Generative Adversarial Networks (GANs):
GANs abandon likelihood entirely, instead training a generator to fool a discriminator. This enables generation of sharp, high-quality samples but provides no probability density estimates and can suffer from mode collapse.
Diffusion Models:
Diffusion models define a forward process that gradually adds noise to data, then learn to reverse this process. They achieve state-of-the-art sample quality but require many sampling steps and only provide variational likelihood bounds.
| Property | Normalizing Flows | VAEs | GANs | Diffusion Models |
|---|---|---|---|---|
| Exact likelihood | ✓ Yes | ✗ Lower bound only | ✗ No density | ✗ Lower bound only |
| Exact sampling | ✓ Yes | ✓ Yes | ✓ Yes | ✓ Yes (but slow) |
| Latent representation | ✓ Exact inference | ~ Approximate | ✗ Not meaningful | ~ Implicit |
| Training stability | ✓ Stable (MLE) | ✓ Stable | ✗ Can be unstable | ✓ Stable |
| Sample quality | ~ Good | ~ Blurry on images | ✓ Sharp | ✓ Excellent |
| Sampling speed | ✓ Fast (one forward pass) | ✓ Fast | ✓ Fast | ✗ Slow (many steps) |
| Architectural constraints | ✗ Strict (invertibility) | ~ Moderate | ✓ Flexible | ✓ Flexible |
| Mode coverage | ✓ Good | ✓ Good | ~ Mode collapse risk | ✓ Excellent |
When to Choose Normalizing Flows:
Normalizing flows are particularly well-suited for applications where:
Exact likelihood is required: Density estimation, anomaly detection via likelihood, probabilistic inference
Exact latent inference is valuable: Learning meaningful latent representations where we need the exact $\mathbf{z}$ for a given $\mathbf{x}$
Fast sampling is needed: Unlike diffusion models that require hundreds of denoising steps, flows sample in a single forward pass
Variational inference: Flows can define flexible variational posteriors that go beyond simple Gaussians, tightening ELBO bounds
Hybrid models: Flows can enhance other models—e.g., using flows in the VAE prior or posterior, or as components in more complex probabilistic programs
The key limitation of normalizing flows is the invertibility requirement. This means you cannot use arbitrary neural network architectures—only carefully designed invertible ones. This constrains model capacity and has historically limited flows' performance on complex image generation compared to GANs and diffusion models. However, advances in flow architecture design continue to close this gap.
The development of normalizing flows is a story of progressively solving the tractability-expressiveness trade-off through architectural innovation. Understanding this history provides insight into why modern flow architectures are designed the way they are.
Early Foundations (Pre-2014):
The change of variables formula has been a staple of probability theory for centuries. Early applications in machine learning used simple parametric transformations—linear transformations, Box-Cox transforms—but these lacked the flexibility to model complex distributions.
NICE (2014): Non-linear Independent Components Estimation
Sommer and Dinh introduced NICE, the first modern normalizing flow architecture. NICE used additive coupling layers that split the input and transformed one half based on the other:
$$\mathbf{y}{1:d/2} = \mathbf{x}{1:d/2}$$ $$\mathbf{y}{d/2+1:d} = \mathbf{x}{d/2+1:d} + m(\mathbf{x}_{1:d/2})$$
where $m$ is an arbitrary neural network. The Jacobian is triangular with ones on the diagonal, so $\det(J) = 1$. This was easy to compute but somewhat limited in expressiveness.
RealNVP (2016): Real-valued Non-Volume Preserving
Building on NICE, Dinh et al. introduced affine coupling layers that add learnable scaling:
$$\mathbf{y}{d/2+1:d} = \mathbf{x}{d/2+1:d} \odot \exp(s(\mathbf{x}{1:d/2})) + t(\mathbf{x}{1:d/2})$$
The diagonal Jacobian entries are now $\exp(s_i)$, giving a non-trivial but still tractable determinant. This was a major expressiveness improvement.
Glow (2018): Generative Flow with Invertible 1×1 Convolutions
Kingma and Dhariwal introduced Glow, which achieved state-of-the-art image generation for flows by adding:
Glow demonstrated that flows could generate high-resolution faces, approaching GAN quality.
Autoregressive Flows (2016-2017):
Simultaneously, researchers developed autoregressive flows like MAF (Masked Autoregressive Flow) and IAF (Inverse Autoregressive Flow):
$$x_i = z_i \cdot \sigma(\mathbf{x}{<i}) + \mu(\mathbf{x}{<i})$$
These leverage autoregressive structure for triangular Jacobians. MAF has fast density evaluation but slow sampling; IAF has fast sampling but slow density evaluation.
Continuous Normalizing Flows (2018):
Chen et al. introduced Neural ODEs, treating the transformation as the solution to an ordinary differential equation. This enabled:
Recent Advances (2019-Present):
| Year | Model | Key Innovation | Impact |
|---|---|---|---|
| 2014 | NICE | Additive coupling layers | First modern flow; volume-preserving |
| 2016 | RealNVP | Affine coupling layers | Non-volume-preserving; much more expressive |
| 2016 | MAF | Masked autoregressive structure | Triangular Jacobians; very expressive |
| 2016 | IAF | Inverse autoregressive structure | Fast sampling for VAE posteriors |
| 2018 | Glow | Invertible 1×1 convolutions | State-of-the-art image generation |
| 2018 | FFJORD | Continuous normalizing flows | Free-form transformations via neural ODEs |
| 2019 | Flow++ | Improved dequantization, attention | Better bits-per-dimension on images |
| 2019 | Neural Spline Flows | Monotonic spline transforms | Highly flexible 1D transformations |
| 2020 | Residual Flows | Invertible residual connections | Contractive residual networks |
Notice the progression: from volume-preserving (NICE) to affine (RealNVP) to expressive autoregressive (MAF) to continuous-time (FFJORD). Each step expanded the expressiveness of flows while maintaining tractable likelihood computation. Modern flows combine multiple innovations—coupling layers, autoregressive components, invertible convolutions, and learned permutations—to achieve strong performance across tasks.
Training normalizing flows is remarkably straightforward compared to GANs or diffusion models: we simply maximize the log-likelihood of the data under the flow model. This is direct maximum likelihood estimation, made possible by the tractable density.
The Objective:
Given a dataset $\mathcal{D} = {\mathbf{x}^{(1)}, \mathbf{x}^{(2)}, \ldots, \mathbf{x}^{(N)}}$, we maximize:
$$\mathcal{L}(\theta) = \frac{1}{N} \sum_{i=1}^{N} \log p_\theta(\mathbf{x}^{(i)})$$
Expanding using the change of variables:
$$\mathcal{L}(\theta) = \frac{1}{N} \sum_{i=1}^{N} \left[ \log p_Z(f^{-1}\theta(\mathbf{x}^{(i)})) + \log \left| \det J{f^{-1}_\theta}(\mathbf{x}^{(i)}) \right| \right]$$
For a standard Gaussian base distribution:
$$\mathcal{L}(\theta) = \frac{1}{N} \sum_{i=1}^{N} \left[ -\frac{d}{2} \log(2\pi) - \frac{1}{2} |\mathbf{z}^{(i)}|^2 + \log \left| \det J_{f^{-1}_\theta}(\mathbf{x}^{(i)}) \right| \right]$$
where $\mathbf{z}^{(i)} = f^{-1}_\theta(\mathbf{x}^{(i)})$.
Understanding the Training Dynamics:
The loss has two interpretable components:
Prior term: $-\frac{1}{2} |\mathbf{z}|^2$ encourages the transformed data to lie near the origin in latent space (low Mahanalobis distance from zero under the Gaussian prior)
Volume term: $\log |\det J_{f^{-1}}|$ encourages the model to use the full latent space efficiently, neither over-compressing nor over-expanding
Together, these ensure the model maps data to regions of high prior probability without extreme volume distortion.
Bits Per Dimension (BPD):
For image modeling, we report performance in bits per dimension, which measures how many bits are needed on average to encode each pixel:
$$\text{BPD} = \frac{-\log_2 p(\mathbf{x})}{d} = \frac{-\log p(\mathbf{x})}{d \cdot \log 2}$$
Lower is better. For reference:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171
import torchimport torch.nn as nnimport torch.optim as optimfrom torch.distributions import MultivariateNormal class FlowTrainer: """ Trainer for normalizing flow models. Training is straightforward: maximize log-likelihood via gradient descent. """ def __init__(self, flow_model, base_distribution, device='cuda'): """ Args: flow_model: The normalizing flow (composition of flow layers) base_distribution: The base distribution (e.g., standard Gaussian) device: Device to train on """ self.flow = flow_model.to(device) self.base_dist = base_distribution self.device = device def compute_log_likelihood(self, x): """ Compute log p(x) under the flow model. log p(x) = log p_z(f^{-1}(x)) + log |det J_{f^{-1}}(x)| Args: x: Data batch [batch_size, dim] Returns: log_prob: Log probability [batch_size] z: Latent representations [batch_size, dim] """ # Transform data to latent space z, log_det_inverse = self.flow.inverse(x) # Log probability under base distribution log_pz = self.base_dist.log_prob(z) # Total log probability (exact!) log_px = log_pz + log_det_inverse return log_px, z def compute_loss(self, x): """ Compute negative log-likelihood loss. Args: x: Data batch [batch_size, dim] Returns: loss: Scalar loss (negative mean log-likelihood) metrics: Dict with additional metrics """ log_px, z = self.compute_log_likelihood(x) # Negative log-likelihood nll = -log_px.mean() # Compute bits per dimension for interpretability dim = x.shape[1] bpd = nll / (dim * torch.log(torch.tensor(2.0))) metrics = { 'nll': nll.item(), 'bpd': bpd.item(), 'z_norm': z.norm(dim=1).mean().item(), } return nll, metrics def train_epoch(self, dataloader, optimizer): """ Train for one epoch. Args: dataloader: DataLoader yielding batches optimizer: Optimizer Returns: epoch_metrics: Dict with averaged metrics """ self.flow.train() total_nll = 0.0 total_bpd = 0.0 num_batches = 0 for batch in dataloader: x = batch[0].to(self.device) # Dequantization: add uniform noise to discrete data # This converts discrete pixel values to continuous values x = x + torch.rand_like(x) / 256.0 optimizer.zero_grad() loss, metrics = self.compute_loss(x) loss.backward() # Gradient clipping for stability torch.nn.utils.clip_grad_norm_(self.flow.parameters(), max_norm=1.0) optimizer.step() total_nll += metrics['nll'] total_bpd += metrics['bpd'] num_batches += 1 return { 'nll': total_nll / num_batches, 'bpd': total_bpd / num_batches, } @torch.no_grad() def evaluate(self, dataloader): """ Evaluate on a dataset. Args: dataloader: DataLoader for evaluation Returns: metrics: Dict with evaluation metrics """ self.flow.eval() total_nll = 0.0 total_bpd = 0.0 num_samples = 0 for batch in dataloader: x = batch[0].to(self.device) x = x + torch.rand_like(x) / 256.0 # Dequantization log_px, _ = self.compute_log_likelihood(x) batch_size = x.shape[0] total_nll += -log_px.sum().item() total_bpd += (-log_px.sum() / (x.shape[1] * torch.log(torch.tensor(2.0)))).item() num_samples += batch_size return { 'nll': total_nll / num_samples, 'bpd': total_bpd / num_samples, } @torch.no_grad() def sample(self, num_samples, temperature=1.0): """ Generate samples from the flow model. Args: num_samples: Number of samples to generate temperature: Sampling temperature (scales std of base dist) Returns: samples: Generated samples [num_samples, dim] """ self.flow.eval() # Sample from base distribution with temperature scaling z = self.base_dist.sample((num_samples,)) * temperature z = z.to(self.device) # Transform through flow x, _ = self.flow.forward(z) return xFor discrete data like images (with integer pixel values), we must add continuous noise before training—a process called dequantization. Without this, the flow would assign infinite density to the discrete values and zero elsewhere, leading to degenerate training. The standard approach adds uniform noise: $\tilde{x} = x + u$ where $u \sim \text{Uniform}(0, 1/256)$ for 8-bit images.
We have established the foundational principles of normalizing flows, a powerful and mathematically elegant family of generative models. Let us consolidate the key insights before diving deeper into specific architectural innovations.
What's Next:
In the following pages, we will dive deep into the specific mechanisms that make modern normalizing flows work:
Each page will build on the foundations established here, developing both the mathematical understanding and practical implementation skills needed to work with normalizing flows in real applications.
You now understand the core principles that define normalizing flows: invertible transformations, tractable Jacobian determinants, and the change of variables formula. This conceptual framework will guide our exploration of specific flow architectures in the pages ahead.