Loading learning content...
The flow architectures we've studied so far—RealNVP, Glow—use discrete sequences of transformations: fixed layers stacked in a predetermined order. Continuous normalizing flows (CNFs) take a fundamentally different approach, modeling the transformation as a continuous evolution through time governed by an ordinary differential equation (ODE).
This perspective, introduced through Neural ODEs (Chen et al., 2018), offers remarkable flexibility: the transformation is no longer constrained to specific layer architectures, and the Jacobian can take any form (computed efficiently via trace estimation). CNFs also provide theoretical insights connecting discrete flows to dynamical systems and optimal transport.
Understand how ODEs define continuous transformations, derive the instantaneous change of variables formula, learn efficient trace estimation for log-determinant computation, and master the FFJORD architecture for practical continuous flows.
From ResNets to ODEs:
A residual network layer computes: $$\mathbf{h}_{t+1} = \mathbf{h}_t + f(\mathbf{h}_t, \theta_t)$$
With small step sizes and many layers, this resembles Euler discretization of an ODE: $$\frac{d\mathbf{h}(t)}{dt} = f(\mathbf{h}(t), t, \theta)$$
Neural ODEs take this limit: instead of discrete layers, define the continuous dynamics $f$ as a neural network and solve the ODE to transform inputs.
The Transformation:
Given initial state $\mathbf{z}_0$ at time $t=0$, the final state at time $t=1$ is: $$\mathbf{z}_1 = \mathbf{z}_0 + \int_0^1 f(\mathbf{z}(t), t, \theta) , dt$$
This defines an invertible transformation from $\mathbf{z}_0$ to $\mathbf{z}_1$ (and vice versa by integrating backward in time).
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
import torchimport torch.nn as nnfrom torchdiffeq import odeint class ODEFunc(nn.Module): """ Neural network defining ODE dynamics. dz/dt = f(z, t) """ def __init__(self, dim, hidden_dim=64): super().__init__() self.net = nn.Sequential( nn.Linear(dim + 1, hidden_dim), # +1 for time nn.Tanh(), nn.Linear(hidden_dim, hidden_dim), nn.Tanh(), nn.Linear(hidden_dim, dim) ) def forward(self, t, z): # Concatenate time to input t_vec = torch.ones(z.shape[0], 1, device=z.device) * t z_t = torch.cat([z, t_vec], dim=1) return self.net(z_t) class NeuralODE(nn.Module): """ Neural ODE transformation. """ def __init__(self, dim, hidden_dim=64): super().__init__() self.func = ODEFunc(dim, hidden_dim) def forward(self, z0, t_span=torch.tensor([0., 1.])): """ Integrate from t=0 to t=1. """ solution = odeint(self.func, z0, t_span, method='dopri5') return solution[-1] # Return state at t=1 def inverse(self, z1, t_span=torch.tensor([1., 0.])): """ Integrate backward from t=1 to t=0. """ solution = odeint(self.func, z1, t_span, method='dopri5') return solution[-1]ODE dynamics are inherently invertible—we can run time forward or backward. This gives us invertibility for free, without the architectural constraints of coupling layers. Any neural network can define the dynamics f(z,t).
For continuous flows, the standard change of variables formula becomes a differential equation for the log-density.
The Key Result:
If $\mathbf{z}(t)$ evolves according to $\frac{d\mathbf{z}}{dt} = f(\mathbf{z}(t), t)$, then the log-density evolves as:
$$\frac{d \log p(\mathbf{z}(t))}{dt} = -\text{tr}\left(\frac{\partial f}{\partial \mathbf{z}}\right)$$
This is the instantaneous change of variables formula, also known as Liouville's equation in physics.
Derivation Sketch:
For an infinitesimal time step $\delta t$:
Implications:
The total log-determinant is: $$\log p(\mathbf{z}_1) = \log p(\mathbf{z}_0) - \int_0^1 \text{tr}\left(\frac{\partial f(\mathbf{z}(t), t)}{\partial \mathbf{z}}\right) dt$$
The Computational Challenge:
The Jacobian $\frac{\partial f}{\partial \mathbf{z}}$ is a $d \times d$ matrix. Computing its trace directly requires $O(d)$ evaluations of $f$ (via finite differences or autodiff) or $O(d^2)$ memory to store the full Jacobian.
For high-dimensional data, this is prohibitive. FFJORD (Free-Form Jacobian of Reversible Dynamics) solves this via Hutchinson's trace estimator:
$$\text{tr}(\mathbf{A}) = \mathbb{E}_{\boldsymbol{\epsilon}}[\boldsymbol{\epsilon}^T \mathbf{A} \boldsymbol{\epsilon}]$$
where $\boldsymbol{\epsilon}$ is a random vector with $\mathbb{E}[\boldsymbol{\epsilon}] = \mathbf{0}$ and $\text{Cov}(\boldsymbol{\epsilon}) = \mathbf{I}$ (e.g., Gaussian or Rademacher).
We can compute $\frac{\partial f}{\partial \mathbf{z}} \boldsymbol{\epsilon}$ as a single vector-Jacobian product (VJP) in $O(d)$ time using reverse-mode autodiff—without ever materializing the full Jacobian!
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
import torch def hutchinson_trace_estimator(f, z, num_samples=1): """ Estimate tr(df/dz) using Hutchinson's estimator. tr(A) = E[ε^T A ε] where ε is random with E[ε]=0, Cov(ε)=I We compute A @ ε via vector-Jacobian product without forming A. """ trace_estimate = 0 for _ in range(num_samples): # Random probe vector (Rademacher: ±1 with equal prob) epsilon = torch.randint(0, 2, z.shape, device=z.device).float() * 2 - 1 # Compute f(z) and enable gradient computation z = z.requires_grad_(True) f_z = f(z) # Vector-Jacobian product: ε^T @ (df/dz) # This gives us (df/dz) @ ε efficiently vjp = torch.autograd.grad(f_z, z, epsilon, create_graph=True)[0] # tr(df/dz) ≈ ε^T @ (df/dz) @ ε = ε · vjp trace_estimate += (epsilon * vjp).sum(dim=1) return trace_estimate / num_samples class CNFFunc(torch.nn.Module): """ Joint dynamics for continuous normalizing flow. Evolves both z and log p(z) simultaneously. """ def __init__(self, dynamics_net): super().__init__() self.net = dynamics_net def forward(self, t, state): """ state = (z, log_p) Returns (dz/dt, d(log_p)/dt) """ z, log_p = state with torch.enable_grad(): z = z.requires_grad_(True) dz_dt = self.net(t, z) # Estimate trace of Jacobian trace = hutchinson_trace_estimator( lambda z: self.net(t, z), z, num_samples=1 ) # d(log p)/dt = -tr(df/dz) dlog_p_dt = -trace return dz_dt, dlog_p_dtFFJORD (Free-Form Jacobian of Reversible Dynamics) combines neural ODEs with Hutchinson's trace estimator to create continuous normalizing flows with unrestricted dynamics.
Key Components:
Free-form dynamics: Any neural network can define $f(\mathbf{z}, t)$—no coupling layers, no triangular Jacobians
Hutchinson estimator: Unbiased trace estimation in $O(d)$ time per integration step
Adjoint method: Memory-efficient backpropagation through the ODE solver
Adaptive solvers: Use adaptive ODE solvers (like Dormand-Prince) that adjust step size automatically
Training FFJORD:
The loss is negative log-likelihood: $$\mathcal{L} = -\mathbb{E}_{\mathbf{x}}\left[\log p_Z(\mathbf{z}_0) - \int_0^1 \text{tr}\left(\frac{\partial f}{\partial \mathbf{z}}\right) dt\right]$$
where $\mathbf{z}_0$ is obtained by solving the ODE backward from $\mathbf{x}$ (at $t=1$) to $t=0$.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182
import torchimport torch.nn as nnfrom torchdiffeq import odeint class FFJORD(nn.Module): """ Free-Form Jacobian of Reversible Dynamics. """ def __init__(self, dim, hidden_dims=[64, 64]): super().__init__() # Dynamics network: unrestricted architecture! layers = [] in_dim = dim + 1 # +1 for time for h_dim in hidden_dims: layers.extend([nn.Linear(in_dim, h_dim), nn.Softplus()]) in_dim = h_dim layers.append(nn.Linear(in_dim, dim)) self.dynamics = nn.Sequential(*layers) self.dim = dim def f(self, t, z): """Dynamics function.""" t_vec = torch.ones(z.shape[0], 1, device=z.device) * t return self.dynamics(torch.cat([z, t_vec], dim=1)) def forward(self, z0): """ Forward pass: z0 -> z1 (sampling direction) Returns z1 and log det Jacobian """ # Integrate forward t_span = torch.tensor([0., 1.], device=z0.device) def augmented_dynamics(t, state): z, _ = state[..., :self.dim], state[..., self.dim:] dz = self.f(t, z) # Trace estimation epsilon = torch.randn_like(z) z_req = z.requires_grad_(True) f_z = self.f(t, z_req) vjp = torch.autograd.grad(f_z, z_req, epsilon, retain_graph=True)[0] trace = (epsilon * vjp).sum(dim=1, keepdim=True) return torch.cat([dz, -trace], dim=1) init_state = torch.cat([z0, torch.zeros(z0.shape[0], 1, device=z0.device)], dim=1) final_state = odeint(augmented_dynamics, init_state, t_span)[-1] z1 = final_state[:, :self.dim] log_det = final_state[:, self.dim] return z1, log_det def inverse(self, z1): """ Inverse pass: z1 -> z0 (density estimation direction) """ t_span = torch.tensor([1., 0.], device=z1.device) def augmented_dynamics(t, state): z = state[..., :self.dim] dz = self.f(t, z) epsilon = torch.randn_like(z) z_req = z.requires_grad_(True) f_z = self.f(t, z_req) vjp = torch.autograd.grad(f_z, z_req, epsilon, retain_graph=True)[0] trace = (epsilon * vjp).sum(dim=1, keepdim=True) # Negative because integrating backward return torch.cat([-dz, trace], dim=1) init_state = torch.cat([z1, torch.zeros(z1.shape[0], 1, device=z1.device)], dim=1) final_state = odeint(augmented_dynamics, init_state, t_span)[-1] z0 = final_state[:, :self.dim] log_det = final_state[:, self.dim] return z0, log_det| Aspect | Discrete Flows (Glow) | Continuous Flows (FFJORD) |
|---|---|---|
| Architecture | Constrained (coupling layers) | Free-form (any network) |
| Jacobian computation | Exact (O(d) per layer) | Estimated (stochastic) |
| Depth | Fixed number of layers | Adaptive (solver-determined) |
| Memory | O(depth × d) | O(d) via adjoint method |
| Training speed | Fast | Slower (ODE solving) |
| Inference speed | Fast (one forward pass) | Slower (ODE solving) |
| Sample quality | State-of-the-art | Competitive but not best |
Regularization for Faster Solving:
Continuous flows can learn dynamics that are expensive to integrate (highly curved trajectories). Regularization techniques encourage straighter paths:
Kinetic regularization: Penalize $|f(\mathbf{z}, t)|^2$ to encourage slow, smooth dynamics
Jacobian regularization: Penalize $|\frac{\partial f}{\partial \mathbf{z}}|^2$ to encourage linear-ish dynamics
Variants and Extensions:
The main drawback of continuous flows is computational cost. Each forward/inverse pass requires solving an ODE, which can take many function evaluations. For applications requiring millions of density evaluations or samples, discrete flows are often more practical despite their architectural constraints.
Connection to Optimal Transport:
Continuous flows define a path between the base distribution and data distribution. Optimal transport (OT) seeks the most efficient such path. Regularizing CNFs toward OT solutions can improve both training and sample quality.
Connection to Diffusion Models:
Diffusion models can be viewed through a continuous flow lens. The score function $\nabla_\mathbf{x} \log p(\mathbf{x})$ defines dynamics for a probability flow ODE: $$\frac{d\mathbf{x}}{dt} = f(\mathbf{x}, t) - \frac{1}{2}g(t)^2 \nabla_\mathbf{x} \log p_t(\mathbf{x})$$
This connection has inspired hybrid approaches combining the best of flows and diffusion.
Applications:
Congratulations! You've completed the Flow-Based Models module. You now understand normalizing flows from mathematical foundations through practical architectures to cutting-edge continuous formulations. These skills enable you to apply flows to density estimation, generative modeling, and variational inference across diverse domains.