Loading learning content...
Standard variational inference faces a fundamental limitation: the expressiveness of the variational family directly bounds the quality of the posterior approximation. When we use simple distributions like factorized Gaussians, we inherently accept that our approximation cannot capture complex dependencies, multimodality, or heavy tails present in the true posterior.
Normalizing flows offer an elegant solution to this expressiveness problem. By applying a sequence of invertible transformations to a simple base distribution, flows can represent arbitrarily complex distributions while maintaining tractable density evaluation and sampling. This represents one of the most important advances in variational inference of the past decade.
By the end of this page, you will understand how normalizing flows work mathematically, master the change of variables formula and its computational implications, learn about different flow architectures (planar, radial, autoregressive, coupling), and understand how flows integrate with variational inference to create powerful approximate posteriors.
Before diving into normalizing flows, let's understand precisely why we need them. In standard variational inference, we approximate an intractable posterior p(z|x) with a tractable distribution q(z) from a variational family Q. The quality of our approximation is limited by the expressiveness of Q.
The Fundamental Tradeoff:
This creates a dilemma: we need distributions that are both expressive enough to capture complex posteriors and tractable enough for practical optimization.
| Variational Family | Expressiveness | Key Limitation | Approximation Gap |
|---|---|---|---|
| Factorized Gaussian | Low | Cannot capture correlations | Underestimates variance, misses dependencies |
| Full-covariance Gaussian | Medium | O(d²) parameters, unimodal only | Cannot represent multimodality or heavy tails |
| Mixture of Gaussians | High | Discrete optimization, mode collapse | Training instability, expensive density evaluation |
| Normalizing Flows | Arbitrary | Computational cost of transformations | Controllable via flow depth |
Mathematical Formalization of the Gap:
The optimal variational distribution minimizes the KL divergence:
$$q^*(z) = \arg\min_{q \in \mathcal{Q}} D_{KL}(q(z) | p(z|x))$$
Even at optimality, we have a residual approximation gap:
$$\text{Gap} = D_{KL}(q^*(z) | p(z|x)) \geq 0$$
This gap is zero if and only if the true posterior lies within our variational family. For most real problems, simple families exclude the true posterior entirely, guaranteeing a non-zero gap regardless of optimization quality.
Normalizing flows address this by making our variational family rich enough to include (or closely approximate) virtually any smooth posterior.
Mean-field approximations can dramatically underestimate posterior variance, especially in high dimensions. For a posterior with correlation ρ between variables, mean-field effectively assumes ρ = 0, leading to overconfident predictions. In Bayesian neural networks, this can cause catastrophic miscalibration.
The mathematical foundation of normalizing flows is the change of variables formula from probability theory. This fundamental result tells us how probability densities transform under invertible mappings.
The Core Result:
Let z₀ be a random variable with density q₀(z₀), and let f: ℝᵈ → ℝᵈ be an invertible, differentiable function. Define z = f(z₀). Then the density of z is:
$$q(z) = q_0(f^{-1}(z)) \left| \det \frac{\partial f^{-1}}{\partial z} \right| = q_0(z_0) \left| \det \frac{\partial f}{\partial z_0} \right|^{-1}$$
The term |det ∂f/∂z₀| is the absolute value of the Jacobian determinant, measuring how f locally stretches or compresses volume.
Think of probability as a fluid. When we apply a transformation f, regions of space expand or contract. If f doubles the volume in some region, the density must halve to preserve total probability mass. The Jacobian determinant quantifies this volume change precisely.
Log-Density Form (Computationally Preferred):
For numerical stability and gradient computation, we work with log-densities:
$$\log q(z) = \log q_0(z_0) - \log \left| \det \frac{\partial f}{\partial z_0} \right|$$
Composition of Flows:
The power of flows comes from composing multiple simple transformations. If we apply a sequence f = fₖ ∘ fₖ₋₁ ∘ ... ∘ f₁, the log-density becomes:
$$\log q_K(z_K) = \log q_0(z_0) - \sum_{k=1}^{K} \log \left| \det \frac{\partial f_k}{\partial z_{k-1}} \right|$$
Each simple transformation adds expressiveness, and by stacking many layers, we can represent arbitrarily complex distributions.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
import torchimport torch.nn as nn class NormalizingFlow(nn.Module): """ Base class for normalizing flows. A flow transforms a simple base distribution into a complex target distribution. """ def __init__(self, base_distribution, transforms): super().__init__() self.base_dist = base_distribution self.transforms = nn.ModuleList(transforms) def forward(self, z0): """ Transform samples from base distribution. Returns: (zK, log_det_jacobian) """ z = z0 log_det_sum = 0.0 for transform in self.transforms: z, log_det = transform(z) log_det_sum += log_det return z, log_det_sum def log_prob(self, zK): """ Compute log probability of samples under the flow. Requires inverting all transformations. """ z = zK log_det_sum = 0.0 # Apply inverse transforms in reverse order for transform in reversed(self.transforms): z, log_det = transform.inverse(z) log_det_sum += log_det # z is now z0, compute base density log_prob_base = self.base_dist.log_prob(z).sum(-1) # Add log determinant (with correct sign for inverse) return log_prob_base + log_det_sum def sample(self, num_samples): """Sample from the flow distribution.""" z0 = self.base_dist.sample((num_samples,)) zK, _ = self.forward(z0) return zKThe earliest normalizing flows introduced in the VI literature are planar and radial flows. While relatively simple, they illustrate the core concepts and remain useful for understanding flow design.
Planar Flows:
A planar flow applies a transformation along a hyperplane:
$$f(z) = z + u \cdot h(w^T z + b)$$
where u, w ∈ ℝᵈ, b ∈ ℝ, and h is a smooth activation function (typically tanh).
The Jacobian determinant has a simple form:
$$\det \frac{\partial f}{\partial z} = 1 + u^T \psi(z), \quad \text{where } \psi(z) = h'(w^T z + b) \cdot w$$
Invertibility Constraint: For the transformation to be invertible, we require 1 + uᵀψ(z) > 0 for all z. This is enforced through a reparameterization of u.
Radial Flows:
Radial flows apply transformations that are radially symmetric around a reference point:
$$f(z) = z + \beta h(\alpha, r)(z - z_0), \quad r = |z - z_0|$$
where z₀ is the center, α > 0 controls the smoothness, and β controls the magnitude. A common choice is:
$$h(\alpha, r) = \frac{1}{\alpha + r}$$
Radial flows can either contract or expand regions of space around z₀, depending on the sign of β.
Computational Complexity:
Both planar and radial flows have O(d) complexity for the Jacobian determinant, making them efficient for high-dimensional problems. However, their expressiveness is limited—each transformation can only "bend" the density in one direction (planar) or around one point (radial).
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576
import torchimport torch.nn as nnimport torch.nn.functional as F class PlanarFlow(nn.Module): """ Planar flow: f(z) = z + u * h(w^T z + b) Efficient O(d) Jacobian determinant computation. """ def __init__(self, dim): super().__init__() self.dim = dim self.w = nn.Parameter(torch.randn(dim)) self.u = nn.Parameter(torch.randn(dim)) self.b = nn.Parameter(torch.zeros(1)) def _get_u_hat(self): """Ensure invertibility: 1 + u^T ψ(z) > 0""" wTu = torch.dot(self.w, self.u) m_wTu = -1 + F.softplus(wTu) u_hat = self.u + (m_wTu - wTu) * self.w / (self.w.norm() ** 2 + 1e-8) return u_hat def forward(self, z): """Apply planar transformation.""" u_hat = self._get_u_hat() # Compute w^T z + b linear = F.linear(z, self.w.unsqueeze(0), self.b) # Apply transformation: z + u * tanh(w^T z + b) h = torch.tanh(linear) z_new = z + u_hat * h # Compute log |det Jacobian| # det = 1 + u^T * h'(w^T z + b) * w = 1 + u^T * ψ(z) h_prime = 1 - h ** 2 # derivative of tanh psi = h_prime * self.w log_det = torch.log(torch.abs(1 + psi @ u_hat) + 1e-8) return z_new, log_det class RadialFlow(nn.Module): """ Radial flow: f(z) = z + β * h(α, r) * (z - z0) Contracts or expands around a center point. """ def __init__(self, dim): super().__init__() self.dim = dim self.z0 = nn.Parameter(torch.zeros(dim)) self.log_alpha = nn.Parameter(torch.zeros(1)) self.beta = nn.Parameter(torch.zeros(1)) def forward(self, z): """Apply radial transformation.""" alpha = F.softplus(self.log_alpha) # Compute distance from center diff = z - self.z0 r = diff.norm(dim=-1, keepdim=True) + 1e-8 # h(α, r) = 1 / (α + r) h = 1 / (alpha + r) h_prime = -1 / (alpha + r) ** 2 # Transform z_new = z + self.beta * h * diff # Log determinant (derived from radial symmetry) d = self.dim log_det = (d - 1) * torch.log(torch.abs(1 + self.beta * h)) + \ torch.log(torch.abs(1 + self.beta * h + self.beta * h_prime * r)) return z_new, log_det.squeeze(-1)While individual planar or radial flows have limited expressiveness, stacking many of them (typically 10-100 layers) can represent highly complex distributions. The Universal Approximation theorem for flows shows that with enough layers, any smooth density can be approximated arbitrarily well.
Autoregressive flows represent a major advancement in flow design, achieving both high expressiveness and efficient computation. They exploit the autoregressive structure to ensure the Jacobian is triangular, enabling O(d) determinant computation.
Key Insight:
An autoregressive transformation defines each output dimension as a function of only the preceding input dimensions:
$$z_i = f_i(z_0^{(1)}, z_0^{(2)}, ..., z_0^{(i)})$$
This constraint makes the Jacobian matrix triangular, and the determinant is simply the product of diagonal elements:
$$\det \frac{\partial z}{\partial z_0} = \prod_{i=1}^{d} \frac{\partial z_i}{\partial z_0^{(i)}}$$
Affine Autoregressive Flows (MADE-style):
The most common autoregressive flow uses affine transformations:
$$z_i = \mu_i(z_0^{<i}) + \sigma_i(z_0^{<i}) \cdot z_0^{(i)}$$
where μᵢ and σᵢ are neural networks that take the first i-1 dimensions as input. The log-determinant is:
$$\log |\det J| = \sum_{i=1}^{d} \log |\sigma_i(z_0^{<i})|$$
Inverse Autoregressive Flow (IAF):
IAF reverses the dependency direction, making sampling efficient (parallel) but density evaluation sequential:
$$z_i = \mu_i(z^{<i}) + \sigma_i(z^{<i}) \cdot z_0^{(i)}$$
Masked Autoregressive Flow (MAF):
MAF uses the opposite parameterization, making density evaluation efficient but sampling sequential:
$$z_i = (z_0^{(i)} - \mu_i(z_0^{<i})) / \sigma_i(z_0^{<i})$$
| Flow Type | Sampling Complexity | Density Evaluation | Best Use Case |
|---|---|---|---|
| Inverse Autoregressive (IAF) | O(d) parallel | O(d²) sequential | Variational inference (need fast sampling) |
| Masked Autoregressive (MAF) | O(d²) sequential | O(d) parallel | Density estimation (need fast evaluation) |
| Neural Spline Flow | O(d) parallel | O(d) parallel | Both (using coupling layers) |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192
import torchimport torch.nn as nn class MADE(nn.Module): """ Masked Autoencoder for Distribution Estimation. Produces autoregressive outputs where output_i depends only on input_{<i}. """ def __init__(self, input_dim, hidden_dims, output_dim_per_input=2): super().__init__() self.input_dim = input_dim self.output_per_input = output_dim_per_input # Create masks for autoregressive property self.masks = self._create_masks(hidden_dims) # Build masked linear layers dims = [input_dim] + hidden_dims + [input_dim * output_dim_per_input] self.layers = nn.ModuleList() for i in range(len(dims) - 1): layer = nn.Linear(dims[i], dims[i + 1]) self.layers.append(layer) def _create_masks(self, hidden_dims): """Create masks ensuring autoregressive property.""" # Assign each unit a degree (which inputs it can see) masks = [] input_degrees = torch.arange(self.input_dim) prev_degrees = input_degrees for h_dim in hidden_dims: # Hidden units can see inputs with degree < their degree degrees = torch.randint(0, self.input_dim - 1, (h_dim,)) mask = (prev_degrees.unsqueeze(1) <= degrees.unsqueeze(0)).float() masks.append(mask) prev_degrees = degrees # Output layer: output_i can see inputs with degree < i output_degrees = torch.arange(self.input_dim).repeat_interleave(self.output_per_input) mask = (prev_degrees.unsqueeze(1) < output_degrees.unsqueeze(0)).float() masks.append(mask) return masks def forward(self, x): """Forward pass with masked connections.""" h = x for i, layer in enumerate(self.layers[:-1]): # Apply mask to weight matrix layer.weight.data *= self.masks[i].T.to(x.device) h = torch.relu(layer(h)) # Final layer self.layers[-1].weight.data *= self.masks[-1].T.to(x.device) out = self.layers[-1](h) # Reshape to (batch, input_dim, output_per_input) return out.view(-1, self.input_dim, self.output_per_input) class MAF(nn.Module): """ Masked Autoregressive Flow. Fast density evaluation, slow sampling. """ def __init__(self, dim, num_layers=4, hidden_dim=64): super().__init__() self.dim = dim self.layers = nn.ModuleList([ MADE(dim, [hidden_dim, hidden_dim], output_dim_per_input=2) for _ in range(num_layers) ]) def forward(self, z0): """Transform base samples to flow samples.""" z = z0 log_det_sum = 0.0 for made in self.layers: # Get autoregressive parameters params = made(z) mu, log_sigma = params[..., 0], params[..., 1] # Affine transform: z_new = (z - mu) / exp(log_sigma) z = (z - mu) * torch.exp(-log_sigma) log_det_sum -= log_sigma.sum(dim=-1) # Permute dimensions for next layer z = z.flip(dims=[-1]) return z, log_det_sumCoupling flows achieve both efficient sampling and efficient density evaluation by splitting dimensions into two groups and transforming one group conditioned on the other.
Affine Coupling Layer:
Split the input z into two parts: z = [z₁, z₂]. The coupling transformation is:
$$z_1' = z_1$$ $$z_2' = z_2 \odot \exp(s(z_1)) + t(z_1)$$
where s and t are arbitrary neural networks (scale and translation). The key insight: because z₁ passes through unchanged, the Jacobian is triangular and the determinant is simply:
$$\log |\det J| = \sum_j s_j(z_1)$$
RealNVP (Real-valued Non-Volume Preserving):
RealNVP introduced affine coupling layers and demonstrated their effectiveness for image modeling. Key innovations include:
Glow:
Glow improved upon RealNVP with three key innovations:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586
import torchimport torch.nn as nn class AffineCoupling(nn.Module): """ Affine coupling layer as used in RealNVP/Glow. Splits input, transforms second half conditioned on first half. """ def __init__(self, dim, hidden_dim=256, mask_type='alternate'): super().__init__() self.dim = dim self.split_dim = dim // 2 # Conditioner network: maps first half to scale and translation self.conditioner = nn.Sequential( nn.Linear(self.split_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, 2 * (dim - self.split_dim)), # scale + translate ) # Initialize final layer to zero (coupling starts as identity) nn.init.zeros_(self.conditioner[-1].weight) nn.init.zeros_(self.conditioner[-1].bias) def forward(self, z): """Apply affine coupling transformation.""" z1, z2 = z[..., :self.split_dim], z[..., self.split_dim:] # Get scale and translation from conditioner params = self.conditioner(z1) log_s, t = params.chunk(2, dim=-1) # Clamp log_s for numerical stability log_s = torch.clamp(log_s, -5, 5) # Transform: z2' = z2 * exp(s) + t z2_new = z2 * torch.exp(log_s) + t # Concatenate (z1 unchanged) z_new = torch.cat([z1, z2_new], dim=-1) # Log determinant is sum of log scales log_det = log_s.sum(dim=-1) return z_new, log_det def inverse(self, z): """Invert the affine coupling.""" z1, z2 = z[..., :self.split_dim], z[..., self.split_dim:] params = self.conditioner(z1) log_s, t = params.chunk(2, dim=-1) log_s = torch.clamp(log_s, -5, 5) # Inverse: z2_orig = (z2 - t) * exp(-s) z2_orig = (z2 - t) * torch.exp(-log_s) z_orig = torch.cat([z1, z2_orig], dim=-1) log_det = -log_s.sum(dim=-1) return z_orig, log_det class RealNVP(nn.Module): """Complete RealNVP flow with multiple coupling layers.""" def __init__(self, dim, num_layers=6, hidden_dim=256): super().__init__() self.coupling_layers = nn.ModuleList() for i in range(num_layers): self.coupling_layers.append(AffineCoupling(dim, hidden_dim)) def forward(self, z0): z = z0 log_det_sum = 0.0 for i, coupling in enumerate(self.coupling_layers): z, log_det = coupling(z) log_det_sum += log_det # Flip dimensions for next layer z = z.flip(dims=[-1]) return z, log_det_sumNormalizing flows integrate naturally with variational inference, enriching the variational family while maintaining the ELBO optimization framework.
Flow-Enhanced ELBO:
With a flow-based variational distribution, the ELBO becomes:
$$\mathcal{L} = \mathbb{E}{z_0 \sim q_0}\left[ \log p(x, z_K) - \log q_0(z_0) + \sum{k=1}^{K} \log \left| \det \frac{\partial f_k}{\partial z_{k-1}} \right| \right]$$
where z_K = f_K ∘ ... ∘ f_1(z_0) is the transformed latent variable.
Gradient Estimation:
The gradient with respect to flow parameters uses the reparameterization trick:
$$\nabla_\phi \mathcal{L} = \mathbb{E}{\epsilon \sim p(\epsilon)}\left[ \nabla\phi \left( \log p(x, z_K(\epsilon; \phi)) + \text{log-det terms} \right) \right]$$
Because all flow operations are differentiable, gradients flow back through the entire transformation chain.
Flows are particularly valuable when the true posterior has: (1) strong correlations between variables, (2) multiple modes, (3) heavy tails, or (4) complex geometries. In these cases, even a few flow layers can dramatically reduce the KL divergence compared to Gaussian approximations.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273
import torchimport torch.nn as nnfrom torch.distributions import Normal class FlowVI(nn.Module): """ Variational Inference with Normalizing Flows. Uses a flow to transform a simple base distribution into a flexible approximate posterior. """ def __init__(self, latent_dim, flow_layers, encoder=None): super().__init__() self.latent_dim = latent_dim # Base distribution: standard Gaussian self.base_dist = Normal(torch.zeros(latent_dim), torch.ones(latent_dim)) # Flow transforms self.flow = RealNVP(latent_dim, num_layers=flow_layers) # Optional: Amortized inference encoder self.encoder = encoder def sample_posterior(self, x, num_samples=1): """Sample from the flow-based approximate posterior.""" batch_size = x.shape[0] # Sample from base distribution z0 = self.base_dist.sample((batch_size, num_samples)) z0 = z0.view(-1, self.latent_dim) # Apply flow zK, log_det = self.flow(z0) return zK.view(batch_size, num_samples, -1), log_det.view(batch_size, num_samples) def compute_elbo(self, x, log_likelihood_fn, num_samples=1): """ Compute the ELBO with flow-based posterior. Args: x: Observed data log_likelihood_fn: Function computing log p(x|z) num_samples: Number of posterior samples for Monte Carlo estimate """ batch_size = x.shape[0] # Sample from base and transform z0 = self.base_dist.sample((batch_size, num_samples)) z0 = z0.view(-1, self.latent_dim) zK, log_det = self.flow(z0) zK = zK.view(batch_size, num_samples, -1) log_det = log_det.view(batch_size, num_samples) # Base distribution log probability z0_reshaped = z0.view(batch_size, num_samples, -1) log_q0 = self.base_dist.log_prob(z0_reshaped).sum(dim=-1) # Log probability under flow: log q(zK) = log q0(z0) - log_det log_qK = log_q0 - log_det # Prior: assume standard Gaussian prior for simplicity log_prior = Normal(0, 1).log_prob(zK).sum(dim=-1) # Likelihood x_expanded = x.unsqueeze(1).expand(-1, num_samples, -1) log_lik = log_likelihood_fn(x_expanded, zK) # ELBO = E[log p(x|z) + log p(z) - log q(z)] elbo = (log_lik + log_prior - log_qK).mean(dim=1) return elbo.mean()We've covered the foundations of normalizing flows and their application to variational inference. Let's consolidate the key concepts:
You now understand normalizing flows as a powerful tool for variational inference. Next, we'll explore amortized inference, which enables efficient inference across many data points by learning to map observations directly to approximate posteriors.