Machine LearningGenerative Models

Flow-Based Models

LevelAdvanced

Duration90 mins

TopicGenerative Models

5 / 5

Continuous Normalizing Flows: Neural ODEs

From Discrete Layers to Continuous Transformations

The flow architectures we've studied so far—RealNVP, Glow—use discrete sequences of transformations: fixed layers stacked in a predetermined order. Continuous normalizing flows (CNFs) take a fundamentally different approach, modeling the transformation as a continuous evolution through time governed by an ordinary differential equation (ODE).

This perspective, introduced through Neural ODEs (Chen et al., 2018), offers remarkable flexibility: the transformation is no longer constrained to specific layer architectures, and the Jacobian can take any form (computed efficiently via trace estimation). CNFs also provide theoretical insights connecting discrete flows to dynamical systems and optimal transport.

Learning Objectives

Understand how ODEs define continuous transformations, derive the instantaneous change of variables formula, learn efficient trace estimation for log-determinant computation, and master the FFJORD architecture for practical continuous flows.

Neural ODEs Foundation

From ResNets to ODEs:

A residual network layer computes: $$\mathbf{h}_{t+1} = \mathbf{h}_t + f(\mathbf{h}_t, \theta_t)$$

With small step sizes and many layers, this resembles Euler discretization of an ODE: $$\frac{d\mathbf{h}(t)}{dt} = f(\mathbf{h}(t), t, \theta)$$

Neural ODEs take this limit: instead of discrete layers, define the continuous dynamics $f$ as a neural network and solve the ODE to transform inputs.

The Transformation:

Given initial state $\mathbf{z}_0$ at time $t=0$, the final state at time $t=1$ is: $$\mathbf{z}_1 = \mathbf{z}_0 + \int_0^1 f(\mathbf{z}(t), t, \theta) , dt$$

This defines an invertible transformation from $\mathbf{z}_0$ to $\mathbf{z}_1$ (and vice versa by integrating backward in time).

neural_ode_basic.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import torch
import torch.nn as nn
from torchdiffeq import odeint
 
class ODEFunc(nn.Module):
    """
    Neural network defining ODE dynamics.
    dz/dt = f(z, t)
    """
    def __init__(self, dim, hidden_dim=64):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(dim + 1, hidden_dim),  # +1 for time
            nn.Tanh(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.Tanh(),
            nn.Linear(hidden_dim, dim)
        )
    
    def forward(self, t, z):
        # Concatenate time to input
        t_vec = torch.ones(z.shape[0], 1, device=z.device) * t
        z_t = torch.cat([z, t_vec], dim=1)
        return self.net(z_t)
 
 
class NeuralODE(nn.Module):
    """
    Neural ODE transformation.
    """
    def __init__(self, dim, hidden_dim=64):
        super().__init__()
        self.func = ODEFunc(dim, hidden_dim)
    
    def forward(self, z0, t_span=torch.tensor([0., 1.])):
        """
        Integrate from t=0 to t=1.
        """
        solution = odeint(self.func, z0, t_span, method='dopri5')
        return solution[-1]  # Return state at t=1
    
    def inverse(self, z1, t_span=torch.tensor([1., 0.])):
        """
        Integrate backward from t=1 to t=0.
        """
        solution = odeint(self.func, z1, t_span, method='dopri5')
        return solution[-1]

Key Insight: Invertibility by Construction

ODE dynamics are inherently invertible—we can run time forward or backward. This gives us invertibility for free, without the architectural constraints of coupling layers. Any neural network can define the dynamics f(z,t).

The Instantaneous Change of Variables

For continuous flows, the standard change of variables formula becomes a differential equation for the log-density.

The Key Result:

If $\mathbf{z}(t)$ evolves according to $\frac{d\mathbf{z}}{dt} = f(\mathbf{z}(t), t)$, then the log-density evolves as:

$$\frac{d \log p(\mathbf{z}(t))}{dt} = -\text{tr}\left(\frac{\partial f}{\partial \mathbf{z}}\right)$$

This is the instantaneous change of variables formula, also known as Liouville's equation in physics.

Derivation Sketch:

For an infinitesimal time step $\delta t$:

Discrete change of variables: $\log p(\mathbf{z}') = \log p(\mathbf{z}) - \log|\det(\mathbf{I} + \delta t \frac{\partial f}{\partial \mathbf{z}})|$
For small $\delta t$: $\log|\det(\mathbf{I} + \delta t \mathbf{A})| \approx \delta t \cdot \text{tr}(\mathbf{A})$
Taking $\delta t \to 0$ gives the continuous formula.

Implications:

The total log-determinant is: $$\log p(\mathbf{z}_1) = \log p(\mathbf{z}_0) - \int_0^1 \text{tr}\left(\frac{\partial f(\mathbf{z}(t), t)}{\partial \mathbf{z}}\right) dt$$

The Computational Challenge:

The Jacobian $\frac{\partial f}{\partial \mathbf{z}}$ is a $d \times d$ matrix. Computing its trace directly requires $O(d)$ evaluations of $f$ (via finite differences or autodiff) or $O(d^2)$ memory to store the full Jacobian.

For high-dimensional data, this is prohibitive. FFJORD (Free-Form Jacobian of Reversible Dynamics) solves this via Hutchinson's trace estimator:

$$\text{tr}(\mathbf{A}) = \mathbb{E}_{\boldsymbol{\epsilon}}[\boldsymbol{\epsilon}^T \mathbf{A} \boldsymbol{\epsilon}]$$

where $\boldsymbol{\epsilon}$ is a random vector with $\mathbb{E}[\boldsymbol{\epsilon}] = \mathbf{0}$ and $\text{Cov}(\boldsymbol{\epsilon}) = \mathbf{I}$ (e.g., Gaussian or Rademacher).

We can compute $\frac{\partial f}{\partial \mathbf{z}} \boldsymbol{\epsilon}$ as a single vector-Jacobian product (VJP) in $O(d)$ time using reverse-mode autodiff—without ever materializing the full Jacobian!

trace_estimation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import torch
 
def hutchinson_trace_estimator(f, z, num_samples=1):
    """
    Estimate tr(df/dz) using Hutchinson's estimator.
    
    tr(A) = E[ε^T A ε] where ε is random with E[ε]=0, Cov(ε)=I
    
    We compute A @ ε via vector-Jacobian product without forming A.
    """
    trace_estimate = 0
    
    for _ in range(num_samples):
        # Random probe vector (Rademacher: ±1 with equal prob)
        epsilon = torch.randint(0, 2, z.shape, device=z.device).float() * 2 - 1
        
        # Compute f(z) and enable gradient computation
        z = z.requires_grad_(True)
        f_z = f(z)
        
        # Vector-Jacobian product: ε^T @ (df/dz)
        # This gives us (df/dz) @ ε efficiently
        vjp = torch.autograd.grad(f_z, z, epsilon, create_graph=True)[0]
        
        # tr(df/dz) ≈ ε^T @ (df/dz) @ ε = ε · vjp
        trace_estimate += (epsilon * vjp).sum(dim=1)
    
    return trace_estimate / num_samples
 
 
class CNFFunc(torch.nn.Module):
    """
    Joint dynamics for continuous normalizing flow.
    Evolves both z and log p(z) simultaneously.
    """
    def __init__(self, dynamics_net):
        super().__init__()
        self.net = dynamics_net
    
    def forward(self, t, state):
        """
        state = (z, log_p)
        Returns (dz/dt, d(log_p)/dt)
        """
        z, log_p = state
        
        with torch.enable_grad():
            z = z.requires_grad_(True)
            dz_dt = self.net(t, z)
            
            # Estimate trace of Jacobian
            trace = hutchinson_trace_estimator(
                lambda z: self.net(t, z), z, num_samples=1
            )
        
        # d(log p)/dt = -tr(df/dz)
        dlog_p_dt = -trace
        
        return dz_dt, dlog_p_dt

FFJORD: Free-Form Flows

FFJORD (Free-Form Jacobian of Reversible Dynamics) combines neural ODEs with Hutchinson's trace estimator to create continuous normalizing flows with unrestricted dynamics.

Key Components:

Free-form dynamics: Any neural network can define $f(\mathbf{z}, t)$—no coupling layers, no triangular Jacobians
Hutchinson estimator: Unbiased trace estimation in $O(d)$ time per integration step
Adjoint method: Memory-efficient backpropagation through the ODE solver
Adaptive solvers: Use adaptive ODE solvers (like Dormand-Prince) that adjust step size automatically

Training FFJORD:

The loss is negative log-likelihood: $$\mathcal{L} = -\mathbb{E}_{\mathbf{x}}\left[\log p_Z(\mathbf{z}_0) - \int_0^1 \text{tr}\left(\frac{\partial f}{\partial \mathbf{z}}\right) dt\right]$$

where $\mathbf{z}_0$ is obtained by solving the ODE backward from $\mathbf{x}$ (at $t=1$) to $t=0$.

ffjord.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import torch
import torch.nn as nn
from torchdiffeq import odeint
 
class FFJORD(nn.Module):
    """
    Free-Form Jacobian of Reversible Dynamics.
    """
    def __init__(self, dim, hidden_dims=[64, 64]):
        super().__init__()
        
        # Dynamics network: unrestricted architecture!
        layers = []
        in_dim = dim + 1  # +1 for time
        for h_dim in hidden_dims:
            layers.extend([nn.Linear(in_dim, h_dim), nn.Softplus()])
            in_dim = h_dim
        layers.append(nn.Linear(in_dim, dim))
        
        self.dynamics = nn.Sequential(*layers)
        self.dim = dim
    
    def f(self, t, z):
        """Dynamics function."""
        t_vec = torch.ones(z.shape[0], 1, device=z.device) * t
        return self.dynamics(torch.cat([z, t_vec], dim=1))
    
    def forward(self, z0):
        """
        Forward pass: z0 -> z1 (sampling direction)
        Returns z1 and log det Jacobian
        """
        # Integrate forward
        t_span = torch.tensor([0., 1.], device=z0.device)
        
        def augmented_dynamics(t, state):
            z, _ = state[..., :self.dim], state[..., self.dim:]
            dz = self.f(t, z)
            
            # Trace estimation
            epsilon = torch.randn_like(z)
            z_req = z.requires_grad_(True)
            f_z = self.f(t, z_req)
            vjp = torch.autograd.grad(f_z, z_req, epsilon, retain_graph=True)[0]
            trace = (epsilon * vjp).sum(dim=1, keepdim=True)
            
            return torch.cat([dz, -trace], dim=1)
        
        init_state = torch.cat([z0, torch.zeros(z0.shape[0], 1, device=z0.device)], dim=1)
        final_state = odeint(augmented_dynamics, init_state, t_span)[-1]
        
        z1 = final_state[:, :self.dim]
        log_det = final_state[:, self.dim]
        
        return z1, log_det
    
    def inverse(self, z1):
        """
        Inverse pass: z1 -> z0 (density estimation direction)
        """
        t_span = torch.tensor([1., 0.], device=z1.device)
        
        def augmented_dynamics(t, state):
            z = state[..., :self.dim]
            dz = self.f(t, z)
            
            epsilon = torch.randn_like(z)
            z_req = z.requires_grad_(True)
            f_z = self.f(t, z_req)
            vjp = torch.autograd.grad(f_z, z_req, epsilon, retain_graph=True)[0]
            trace = (epsilon * vjp).sum(dim=1, keepdim=True)
            
            # Negative because integrating backward
            return torch.cat([-dz, trace], dim=1)
        
        init_state = torch.cat([z1, torch.zeros(z1.shape[0], 1, device=z1.device)], dim=1)
        final_state = odeint(augmented_dynamics, init_state, t_span)[-1]
        
        z0 = final_state[:, :self.dim]
        log_det = final_state[:, self.dim]
        
        return z0, log_det

Trade-offs and Variants

Continuous vs Discrete Flows
Aspect	Discrete Flows (Glow)	Continuous Flows (FFJORD)
Architecture	Constrained (coupling layers)	Free-form (any network)
Jacobian computation	Exact (O(d) per layer)	Estimated (stochastic)
Depth	Fixed number of layers	Adaptive (solver-determined)
Memory	O(depth × d)	O(d) via adjoint method
Training speed	Fast	Slower (ODE solving)
Inference speed	Fast (one forward pass)	Slower (ODE solving)
Sample quality	State-of-the-art	Competitive but not best

Regularization for Faster Solving:

Continuous flows can learn dynamics that are expensive to integrate (highly curved trajectories). Regularization techniques encourage straighter paths:

Kinetic regularization: Penalize $|f(\mathbf{z}, t)|^2$ to encourage slow, smooth dynamics
Jacobian regularization: Penalize $|\frac{\partial f}{\partial \mathbf{z}}|^2$ to encourage linear-ish dynamics

Variants and Extensions:

OT-Flow: Regularize toward optimal transport trajectories
RNODE: Residual neural ODEs for stability
Augmented Neural ODEs: Lift to higher dimensions for more expressive dynamics

Computational Cost

The main drawback of continuous flows is computational cost. Each forward/inverse pass requires solving an ODE, which can take many function evaluations. For applications requiring millions of density evaluations or samples, discrete flows are often more practical despite their architectural constraints.

Connections and Applications

Connection to Optimal Transport:

Continuous flows define a path between the base distribution and data distribution. Optimal transport (OT) seeks the most efficient such path. Regularizing CNFs toward OT solutions can improve both training and sample quality.

Connection to Diffusion Models:

Diffusion models can be viewed through a continuous flow lens. The score function $\nabla_\mathbf{x} \log p(\mathbf{x})$ defines dynamics for a probability flow ODE: $$\frac{d\mathbf{x}}{dt} = f(\mathbf{x}, t) - \frac{1}{2}g(t)^2 \nabla_\mathbf{x} \log p_t(\mathbf{x})$$

This connection has inspired hybrid approaches combining the best of flows and diffusion.

Applications:

Scientific computing: Modeling physical systems as continuous transformations
Variational inference: Flexible posteriors for Bayesian deep learning
Time series: Continuous-time models for irregularly sampled data
Optimal control: Learning continuous policies

When to Use Continuous Flows

•When architectural flexibility matters more than raw speed
•For scientific applications where continuous dynamics are natural
•When memory is limited (adjoint method is very memory-efficient)
•For research exploring connections to dynamical systems and OT

Summary

Key Takeaways

•Continuous flows model transformations as ODEs, removing architectural constraints of coupling layers.
•Instantaneous change of variables gives $d\log p/dt = -\text{tr}(\partial f/\partial z)$.
•Hutchinson's estimator enables O(d) trace estimation without forming the full Jacobian.
•FFJORD combines these ideas for practical continuous normalizing flows.
•Trade-offs: Flexibility comes at the cost of computational speed compared to discrete flows.
•Deep connections exist to optimal transport, diffusion models, and dynamical systems.

Module Complete

Congratulations! You've completed the Flow-Based Models module. You now understand normalizing flows from mathematical foundations through practical architectures to cutting-edge continuous formulations. These skills enable you to apply flows to density estimation, generative modeling, and variational inference across diverse domains.

5 / 5

Loading learning content...

Machine LearningGenerative Models

Flow-Based Models

LevelAdvanced

Duration90 mins

TopicGenerative Models

5 / 5

Continuous Normalizing Flows: Neural ODEs

From Discrete Layers to Continuous Transformations

Learning Objectives

Neural ODEs Foundation

From ResNets to ODEs:

A residual network layer computes: $$\mathbf{h}_{t+1} = \mathbf{h}_t + f(\mathbf{h}_t, \theta_t)$$

With small step sizes and many layers, this resembles Euler discretization of an ODE: $$\frac{d\mathbf{h}(t)}{dt} = f(\mathbf{h}(t), t, \theta)$$

Neural ODEs take this limit: instead of discrete layers, define the continuous dynamics $f$ as a neural network and solve the ODE to transform inputs.

The Transformation:

Given initial state $\mathbf{z}_0$ at time $t=0$, the final state at time $t=1$ is: $$\mathbf{z}_1 = \mathbf{z}_0 + \int_0^1 f(\mathbf{z}(t), t, \theta) , dt$$

This defines an invertible transformation from $\mathbf{z}_0$ to $\mathbf{z}_1$ (and vice versa by integrating backward in time).

neural_ode_basic.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import torch
import torch.nn as nn
from torchdiffeq import odeint
 
class ODEFunc(nn.Module):
    """
    Neural network defining ODE dynamics.
    dz/dt = f(z, t)
    """
    def __init__(self, dim, hidden_dim=64):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(dim + 1, hidden_dim),  # +1 for time
            nn.Tanh(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.Tanh(),
            nn.Linear(hidden_dim, dim)
        )
    
    def forward(self, t, z):
        # Concatenate time to input
        t_vec = torch.ones(z.shape[0], 1, device=z.device) * t
        z_t = torch.cat([z, t_vec], dim=1)
        return self.net(z_t)
 
 
class NeuralODE(nn.Module):
    """
    Neural ODE transformation.
    """
    def __init__(self, dim, hidden_dim=64):
        super().__init__()
        self.func = ODEFunc(dim, hidden_dim)
    
    def forward(self, z0, t_span=torch.tensor([0., 1.])):
        """
        Integrate from t=0 to t=1.
        """
        solution = odeint(self.func, z0, t_span, method='dopri5')
        return solution[-1]  # Return state at t=1
    
    def inverse(self, z1, t_span=torch.tensor([1., 0.])):
        """
        Integrate backward from t=1 to t=0.
        """
        solution = odeint(self.func, z1, t_span, method='dopri5')
        return solution[-1]

Key Insight: Invertibility by Construction

The Instantaneous Change of Variables

For continuous flows, the standard change of variables formula becomes a differential equation for the log-density.

The Key Result:

If $\mathbf{z}(t)$ evolves according to $\frac{d\mathbf{z}}{dt} = f(\mathbf{z}(t), t)$, then the log-density evolves as:

$$\frac{d \log p(\mathbf{z}(t))}{dt} = -\text{tr}\left(\frac{\partial f}{\partial \mathbf{z}}\right)$$

This is the instantaneous change of variables formula, also known as Liouville's equation in physics.

Derivation Sketch:

For an infinitesimal time step $\delta t$:

Discrete change of variables: $\log p(\mathbf{z}') = \log p(\mathbf{z}) - \log|\det(\mathbf{I} + \delta t \frac{\partial f}{\partial \mathbf{z}})|$
For small $\delta t$: $\log|\det(\mathbf{I} + \delta t \mathbf{A})| \approx \delta t \cdot \text{tr}(\mathbf{A})$
Taking $\delta t \to 0$ gives the continuous formula.

Implications:

The total log-determinant is: $$\log p(\mathbf{z}_1) = \log p(\mathbf{z}_0) - \int_0^1 \text{tr}\left(\frac{\partial f(\mathbf{z}(t), t)}{\partial \mathbf{z}}\right) dt$$

The Computational Challenge:

For high-dimensional data, this is prohibitive. FFJORD (Free-Form Jacobian of Reversible Dynamics) solves this via Hutchinson's trace estimator:

$$\text{tr}(\mathbf{A}) = \mathbb{E}_{\boldsymbol{\epsilon}}[\boldsymbol{\epsilon}^T \mathbf{A} \boldsymbol{\epsilon}]$$

where $\boldsymbol{\epsilon}$ is a random vector with $\mathbb{E}[\boldsymbol{\epsilon}] = \mathbf{0}$ and $\text{Cov}(\boldsymbol{\epsilon}) = \mathbf{I}$ (e.g., Gaussian or Rademacher).

trace_estimation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import torch
 
def hutchinson_trace_estimator(f, z, num_samples=1):
    """
    Estimate tr(df/dz) using Hutchinson's estimator.
    
    tr(A) = E[ε^T A ε] where ε is random with E[ε]=0, Cov(ε)=I
    
    We compute A @ ε via vector-Jacobian product without forming A.
    """
    trace_estimate = 0
    
    for _ in range(num_samples):
        # Random probe vector (Rademacher: ±1 with equal prob)
        epsilon = torch.randint(0, 2, z.shape, device=z.device).float() * 2 - 1
        
        # Compute f(z) and enable gradient computation
        z = z.requires_grad_(True)
        f_z = f(z)
        
        # Vector-Jacobian product: ε^T @ (df/dz)
        # This gives us (df/dz) @ ε efficiently
        vjp = torch.autograd.grad(f_z, z, epsilon, create_graph=True)[0]
        
        # tr(df/dz) ≈ ε^T @ (df/dz) @ ε = ε · vjp
        trace_estimate += (epsilon * vjp).sum(dim=1)
    
    return trace_estimate / num_samples
 
 
class CNFFunc(torch.nn.Module):
    """
    Joint dynamics for continuous normalizing flow.
    Evolves both z and log p(z) simultaneously.
    """
    def __init__(self, dynamics_net):
        super().__init__()
        self.net = dynamics_net
    
    def forward(self, t, state):
        """
        state = (z, log_p)
        Returns (dz/dt, d(log_p)/dt)
        """
        z, log_p = state
        
        with torch.enable_grad():
            z = z.requires_grad_(True)
            dz_dt = self.net(t, z)
            
            # Estimate trace of Jacobian
            trace = hutchinson_trace_estimator(
                lambda z: self.net(t, z), z, num_samples=1
            )
        
        # d(log p)/dt = -tr(df/dz)
        dlog_p_dt = -trace
        
        return dz_dt, dlog_p_dt

FFJORD: Free-Form Flows

FFJORD (Free-Form Jacobian of Reversible Dynamics) combines neural ODEs with Hutchinson's trace estimator to create continuous normalizing flows with unrestricted dynamics.

Key Components:

Free-form dynamics: Any neural network can define $f(\mathbf{z}, t)$—no coupling layers, no triangular Jacobians
Hutchinson estimator: Unbiased trace estimation in $O(d)$ time per integration step
Adjoint method: Memory-efficient backpropagation through the ODE solver
Adaptive solvers: Use adaptive ODE solvers (like Dormand-Prince) that adjust step size automatically

Training FFJORD:

The loss is negative log-likelihood: $$\mathcal{L} = -\mathbb{E}_{\mathbf{x}}\left[\log p_Z(\mathbf{z}_0) - \int_0^1 \text{tr}\left(\frac{\partial f}{\partial \mathbf{z}}\right) dt\right]$$

where $\mathbf{z}_0$ is obtained by solving the ODE backward from $\mathbf{x}$ (at $t=1$) to $t=0$.

ffjord.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import torch
import torch.nn as nn
from torchdiffeq import odeint
 
class FFJORD(nn.Module):
    """
    Free-Form Jacobian of Reversible Dynamics.
    """
    def __init__(self, dim, hidden_dims=[64, 64]):
        super().__init__()
        
        # Dynamics network: unrestricted architecture!
        layers = []
        in_dim = dim + 1  # +1 for time
        for h_dim in hidden_dims:
            layers.extend([nn.Linear(in_dim, h_dim), nn.Softplus()])
            in_dim = h_dim
        layers.append(nn.Linear(in_dim, dim))
        
        self.dynamics = nn.Sequential(*layers)
        self.dim = dim
    
    def f(self, t, z):
        """Dynamics function."""
        t_vec = torch.ones(z.shape[0], 1, device=z.device) * t
        return self.dynamics(torch.cat([z, t_vec], dim=1))
    
    def forward(self, z0):
        """
        Forward pass: z0 -> z1 (sampling direction)
        Returns z1 and log det Jacobian
        """
        # Integrate forward
        t_span = torch.tensor([0., 1.], device=z0.device)
        
        def augmented_dynamics(t, state):
            z, _ = state[..., :self.dim], state[..., self.dim:]
            dz = self.f(t, z)
            
            # Trace estimation
            epsilon = torch.randn_like(z)
            z_req = z.requires_grad_(True)
            f_z = self.f(t, z_req)
            vjp = torch.autograd.grad(f_z, z_req, epsilon, retain_graph=True)[0]
            trace = (epsilon * vjp).sum(dim=1, keepdim=True)
            
            return torch.cat([dz, -trace], dim=1)
        
        init_state = torch.cat([z0, torch.zeros(z0.shape[0], 1, device=z0.device)], dim=1)
        final_state = odeint(augmented_dynamics, init_state, t_span)[-1]
        
        z1 = final_state[:, :self.dim]
        log_det = final_state[:, self.dim]
        
        return z1, log_det
    
    def inverse(self, z1):
        """
        Inverse pass: z1 -> z0 (density estimation direction)
        """
        t_span = torch.tensor([1., 0.], device=z1.device)
        
        def augmented_dynamics(t, state):
            z = state[..., :self.dim]
            dz = self.f(t, z)
            
            epsilon = torch.randn_like(z)
            z_req = z.requires_grad_(True)
            f_z = self.f(t, z_req)
            vjp = torch.autograd.grad(f_z, z_req, epsilon, retain_graph=True)[0]
            trace = (epsilon * vjp).sum(dim=1, keepdim=True)
            
            # Negative because integrating backward
            return torch.cat([-dz, trace], dim=1)
        
        init_state = torch.cat([z1, torch.zeros(z1.shape[0], 1, device=z1.device)], dim=1)
        final_state = odeint(augmented_dynamics, init_state, t_span)[-1]
        
        z0 = final_state[:, :self.dim]
        log_det = final_state[:, self.dim]
        
        return z0, log_det

Trade-offs and Variants

Continuous vs Discrete Flows
Aspect	Discrete Flows (Glow)	Continuous Flows (FFJORD)
Architecture	Constrained (coupling layers)	Free-form (any network)
Jacobian computation	Exact (O(d) per layer)	Estimated (stochastic)
Depth	Fixed number of layers	Adaptive (solver-determined)
Memory	O(depth × d)	O(d) via adjoint method
Training speed	Fast	Slower (ODE solving)
Inference speed	Fast (one forward pass)	Slower (ODE solving)
Sample quality	State-of-the-art	Competitive but not best

Regularization for Faster Solving:

Continuous flows can learn dynamics that are expensive to integrate (highly curved trajectories). Regularization techniques encourage straighter paths:

Kinetic regularization: Penalize $|f(\mathbf{z}, t)|^2$ to encourage slow, smooth dynamics
Jacobian regularization: Penalize $|\frac{\partial f}{\partial \mathbf{z}}|^2$ to encourage linear-ish dynamics

Variants and Extensions:

OT-Flow: Regularize toward optimal transport trajectories
RNODE: Residual neural ODEs for stability
Augmented Neural ODEs: Lift to higher dimensions for more expressive dynamics

Computational Cost

Connections and Applications

Connection to Optimal Transport:

Connection to Diffusion Models:

This connection has inspired hybrid approaches combining the best of flows and diffusion.

Applications:

Scientific computing: Modeling physical systems as continuous transformations
Variational inference: Flexible posteriors for Bayesian deep learning
Time series: Continuous-time models for irregularly sampled data
Optimal control: Learning continuous policies

When to Use Continuous Flows

•When architectural flexibility matters more than raw speed
•For scientific applications where continuous dynamics are natural
•When memory is limited (adjoint method is very memory-efficient)
•For research exploring connections to dynamical systems and OT

Summary

Key Takeaways

•Continuous flows model transformations as ODEs, removing architectural constraints of coupling layers.
•Instantaneous change of variables gives $d\log p/dt = -\text{tr}(\partial f/\partial z)$.
•Hutchinson's estimator enables O(d) trace estimation without forming the full Jacobian.
•FFJORD combines these ideas for practical continuous normalizing flows.
•Trade-offs: Flexibility comes at the cost of computational speed compared to discrete flows.
•Deep connections exist to optimal transport, diffusion models, and dynamical systems.

Module Complete

5 / 5