Flow Based Models - Learning Module

Loading content...

0/245

Change of Variables: The Mathematical Heart of Flows

The Foundation of Density Transformation

The change of variables formula is the mathematical cornerstone upon which all normalizing flows are built. This elegant result from probability theory and measure theory tells us precisely how probability densities transform under differentiable, invertible mappings. While we introduced this formula in the previous page, here we develop a deep, rigorous understanding that will inform architectural decisions and help debug numerical issues in practice.

Understanding the change of variables formula is not merely academic—it directly explains why certain flow architectures work and others fail, why some transformations are computationally tractable while others are not, and how to reason about the behavior of complex composed transformations.

Learning Objectives

Master the rigorous derivation of the change of variables formula, understand its geometric interpretation through volume elements, learn to compute Jacobians for various transformation types, and develop intuition for how different Jacobian structures affect computational tractability.

The One-Dimensional Case

Before tackling the full multivariate case, let's build intuition from the one-dimensional setting where the concepts are most transparent.

Setup: Let $Z$ be a random variable with density $p_Z(z)$. Define $X = f(Z)$ where $f: \mathbb{R} \to \mathbb{R}$ is a strictly monotonic, differentiable function. What is the density $p_X(x)$?

Derivation via CDF:

For a monotonically increasing $f$: $$P(X \leq x) = P(f(Z) \leq x) = P(Z \leq f^{-1}(x))$$

Differentiating both sides with respect to $x$: $$p_X(x) = p_Z(f^{-1}(x)) \cdot \frac{d}{dx}f^{-1}(x) = p_Z(f^{-1}(x)) \cdot \frac{1}{f'(f^{-1}(x))}$$

For monotonically decreasing $f$, we get a minus sign, leading to the general formula: $$p_X(x) = p_Z(f^{-1}(x)) \cdot \left| \frac{d f^{-1}(x)}{dx} \right| = p_Z(z) \cdot \left| \frac{1}{f'(z)} \right|$$

The absolute value ensures the density remains positive regardless of whether $f$ is increasing or decreasing.

Intuition: Probability Conservation

Total probability must always integrate to 1. If $f$ stretches a region (|f'| > 1), the probability mass spreads over a larger interval, so density decreases. If $f$ compresses (|f'| < 1), density increases. The factor |1/f'(z)| precisely compensates for this stretching/compression.

one_dim_change_of_variables.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
 
def demonstrate_1d_change_of_variables():
    """
    Demonstrate change of variables in 1D.
    Transform a standard Gaussian through f(z) = z^3 + z
    """
    # Define transformation and its derivative
    def f(z):
        return z**3 + z
    
    def f_prime(z):
        return 3*z**2 + 1  # Always positive, so f is monotonic
    
    def f_inverse_numerical(x, tol=1e-10):
        # Newton's method to find z such that f(z) = x
        z = x  # Initial guess
        for _ in range(100):
            z_new = z - (f(z) - x) / f_prime(z)
            if abs(z_new - z) < tol:
                return z_new
            z = z_new
        return z
    
    # Base distribution: standard Gaussian
    p_z = lambda z: norm.pdf(z, 0, 1)
    
    # Transformed density via change of variables
    def p_x(x):
        z = f_inverse_numerical(x)
        return p_z(z) * abs(1 / f_prime(z))
    
    # Verify by sampling
    z_samples = np.random.randn(100000)
    x_samples = f(z_samples)
    
    # Plot comparison
    x_grid = np.linspace(-10, 10, 1000)
    p_x_analytical = [p_x(x) for x in x_grid]
    
    plt.figure(figsize=(10, 5))
    plt.hist(x_samples, bins=100, density=True, alpha=0.7, label='Empirical')
    plt.plot(x_grid, p_x_analytical, 'r-', lw=2, label='Analytical (CoV)')
    plt.xlabel('x')
    plt.ylabel('Density')
    plt.legend()
    plt.title('Change of Variables: z³ + z Transform of Gaussian')
    plt.show()
 
demonstrate_1d_change_of_variables()

The Multivariate Case

The multivariate generalization replaces the scalar derivative with the Jacobian matrix and the absolute value with the absolute value of the determinant.

The Jacobian Matrix:

For $f: \mathbb{R}^d \to \mathbb{R}^d$ with $f(\mathbf{z}) = (f_1(\mathbf{z}), \ldots, f_d(\mathbf{z}))$, the Jacobian is:

$$J_f(\mathbf{z}) = \begin{bmatrix} \frac{\partial f_1}{\partial z_1} & \cdots & \frac{\partial f_1}{\partial z_d} \ \vdots & \ddots & \vdots \ \frac{\partial f_d}{\partial z_1} & \cdots & \frac{\partial f_d}{\partial z_d} \end{bmatrix}$$

The Change of Variables Formula:

$$p_X(\mathbf{x}) = p_Z(f^{-1}(\mathbf{x})) \cdot \left| \det J_{f^{-1}}(\mathbf{x}) \right|$$

Or equivalently, letting $\mathbf{z} = f^{-1}(\mathbf{x})$:

$$p_X(\mathbf{x}) = p_Z(\mathbf{z}) \cdot \left| \det J_f(\mathbf{z}) \right|^{-1}$$

The key insight: $\det J_{f^{-1}}(\mathbf{x}) = (\det J_f(\mathbf{z}))^{-1}$ by the inverse function theorem.

Geometric Interpretation:

The Jacobian determinant measures how the transformation $f$ locally changes volumes:

Consider an infinitesimal hypercube at $\mathbf{z}$ with volume $d\mathbf{z}$
After transformation, this becomes a parallelepiped with volume $|\det J_f(\mathbf{z})| \cdot d\mathbf{z}$
Probability mass $p_Z(\mathbf{z}) \cdot d\mathbf{z}$ must equal $p_X(\mathbf{x}) \cdot d\mathbf{x}$
Therefore: $p_X(\mathbf{x}) = p_Z(\mathbf{z}) / |\det J_f(\mathbf{z})|$

This volume-change interpretation explains:

Why invertibility is required: Non-invertible maps fold space onto itself, making density undefined
Why the determinant matters: It precisely captures local volume scaling
Why we need the absolute value: Volume is always positive, regardless of orientation

Jacobian Determinants for Common Transformations
Transformation	Jacobian Matrix	Determinant	Complexity
$f(\mathbf{z}) = \mathbf{A}\mathbf{z} + \mathbf{b}$ (affine)	$\mathbf{A}$	$\det(\mathbf{A})$	$O(d^3)$ general, $O(d)$ if triangular
$f_i(z_i) = g(z_i)$ (element-wise)	Diagonal: $\text{diag}(g'(z_i))$	$\prod_i g'(z_i)$	$O(d)$
Permutation $\mathbf{P}$	$\mathbf{P}$	$\pm 1$	$O(1)$
Rotation $\mathbf{R}$	$\mathbf{R}$	$1$	$O(1)$
Coupling layer	Block triangular	Product of one block's diagonal	$O(d)$

Computing Log-Determinants Efficiently

In practice, we always work with log-determinants for numerical stability and computational convenience. The determinant of a Jacobian can be astronomically large or small, but its logarithm remains well-behaved.

Log-Determinant Properties:

Product rule: $\log |\det(\mathbf{AB})| = \log |\det(\mathbf{A})| + \log |\det(\mathbf{B})|$
Inverse rule: $\log |\det(\mathbf{A}^{-1})| = -\log |\det(\mathbf{A})|$
Triangular matrices: $\log |\det(\mathbf{T})| = \sum_i \log |T_{ii}|$
Diagonal matrices: $\log |\det(\mathbf{D})| = \sum_i \log |D_{ii}|$

These properties are essential for flows. The product rule allows us to sum log-determinants across composed layers. The triangular/diagonal rules show why such structures are computationally attractive.

log_det_computation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
import torch
import torch.nn as nn
 
def log_det_triangular(L):
    """
    Compute log|det(L)| for triangular matrix L in O(d) time.
    
    Args:
        L: Triangular matrix [batch, d, d] or [d, d]
    Returns:
        log_det: Log absolute determinant [batch] or scalar
    """
    # Determinant = product of diagonal elements
    # Log determinant = sum of log absolute diagonal elements
    diag = torch.diagonal(L, dim1=-2, dim2=-1)
    return torch.sum(torch.log(torch.abs(diag)), dim=-1)
 
def log_det_diagonal(diag_elements):
    """
    Compute log|det(D)| for diagonal matrix represented by its diagonal.
    
    Args:
        diag_elements: Diagonal elements [batch, d]
    Returns:
        log_det: Log absolute determinant [batch]
    """
    return torch.sum(torch.log(torch.abs(diag_elements)), dim=-1)
 
def log_det_lu(A):
    """
    Compute log|det(A)| via LU decomposition in O(d³) time.
    
    This is the general method for arbitrary square matrices.
    
    Args:
        A: Square matrix [batch, d, d] or [d, d]
    Returns:
        log_det: Log absolute determinant
    """
    # PyTorch's slogdet computes sign and log|det|
    sign, logabsdet = torch.linalg.slogdet(A)
    return logabsdet
 
# Example: Verify triangular computation
def verify_log_det():
    d = 100
    # Create random lower triangular matrix
    L = torch.tril(torch.randn(d, d))
    
    # Efficient computation
    log_det_fast = log_det_triangular(L)
    
    # Direct computation for verification
    log_det_direct = torch.log(torch.abs(torch.det(L)))
    
    print(f"Triangular O(d): {log_det_fast.item():.6f}")
    print(f"Direct O(d³): {log_det_direct.item():.6f}")
    print(f"Match: {torch.allclose(log_det_fast, log_det_direct)}")
 
verify_log_det()

Numerical Stability

Always compute log|det(J)| directly rather than computing det(J) then taking the log. For high-dimensional matrices, the determinant can overflow or underflow even when its logarithm is perfectly reasonable. Most deep learning frameworks provide numerically stable log-determinant functions.

Composition of Transformations

The power of normalizing flows comes from composing multiple simple transformations. The change of variables formula extends naturally to compositions.

Chain of Transformations:

Let $f = f_K \circ f_{K-1} \circ \cdots \circ f_1$, meaning $\mathbf{x} = f_K(f_{K-1}(\cdots f_1(\mathbf{z})))$.

Define intermediate variables: $\mathbf{z}_0 = \mathbf{z}$, $\mathbf{z}k = f_k(\mathbf{z}{k-1})$, $\mathbf{z}_K = \mathbf{x}$.

By the chain rule for Jacobians: $$J_f(\mathbf{z}) = J_{f_K}(\mathbf{z}{K-1}) \cdot J{f_{K-1}}(\mathbf{z}{K-2}) \cdots J{f_1}(\mathbf{z}_0)$$

Taking determinants and logarithms: $$\log |\det J_f(\mathbf{z})| = \sum_{k=1}^{K} \log |\det J_{f_k}(\mathbf{z}_{k-1})|$$

This additive decomposition means we can compute the total log-determinant by summing per-layer contributions during the forward or inverse pass—no need to ever form the full Jacobian matrix.

Key Implications for Flow Design

•Additive log-determinants: Total log-det is sum of per-layer log-dets, computed efficiently during forward/inverse passes.
•Composition preserves invertibility: If each $f_k$ is invertible, so is the composition. Inverse is $f_1^{-1} \circ \cdots \circ f_K^{-1}$.
•Depth increases expressiveness: Each layer can only make a limited change; stacking many layers allows complex transformations.
•Modular design: Different layer types can be freely combined as long as each satisfies the flow requirements.

composed_flow.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import torch
import torch.nn as nn
from typing import List, Tuple
 
class ComposedFlow(nn.Module):
    """
    A normalizing flow as a composition of flow layers.
    """
    
    def __init__(self, layers: List[nn.Module]):
        super().__init__()
        self.layers = nn.ModuleList(layers)
    
    def forward(self, z: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
        """Forward: z -> x, accumulating log det Jacobian."""
        log_det_total = torch.zeros(z.shape[0], device=z.device)
        x = z
        
        for layer in self.layers:
            x, log_det = layer.forward(x)
            log_det_total += log_det  # Additive!
        
        return x, log_det_total
    
    def inverse(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
        """Inverse: x -> z, accumulating log det of inverse Jacobian."""
        log_det_total = torch.zeros(x.shape[0], device=x.device)
        z = x
        
        for layer in reversed(self.layers):  # Reverse order!
            z, log_det = layer.inverse(z)
            log_det_total += log_det
        
        return z, log_det_total
    
    def log_prob(self, x: torch.Tensor, base_dist) -> torch.Tensor:
        """Compute log p(x) using change of variables."""
        z, log_det_inverse = self.inverse(x)
        log_pz = base_dist.log_prob(z).sum(dim=-1)
        return log_pz + log_det_inverse

Practical Considerations

Implementing the change of variables formula in practice requires attention to several important details.

Forward vs. Inverse Direction:

Flows define transformations in one direction but we often need both:

Sampling requires the forward direction: $\mathbf{z} \sim p_Z \to \mathbf{x} = f(\mathbf{z})$
Likelihood evaluation requires the inverse: $\mathbf{x} \to \mathbf{z} = f^{-1}(\mathbf{x})$

Some architectures (like autoregressive flows) are fast in one direction but slow in the other. Coupling layers are fast in both directions—a major advantage.

Dimension Preservation:

The change of variables formula requires $f: \mathbb{R}^d \to \mathbb{R}^d$—same input and output dimensions. Flows that change dimension require modified formulations (e.g., augmented flows that pad dimensions, or factored architectures that marginalize dimensions).

Debugging Flows

When implementing flows, verify: (1) forward(inverse(x)) ≈ x (reconstruction), (2) log_det_forward(z) ≈ -log_det_inverse(f(z)) (determinant consistency), (3) likelihoods integrate to 1 (via importance sampling). These sanity checks catch many implementation bugs.

Implementation Checklist

•Use log-determinants throughout, never raw determinants
•Verify invertibility numerically: $f^{-1}(f(\mathbf{z})) \approx \mathbf{z}$
•Check log-det signs are consistent between forward and inverse
•Use numerically stable implementations (log-sum-exp, etc.)
•Test on simple distributions before scaling up

Summary

The change of variables formula is the mathematical foundation that makes normalizing flows possible. We've seen how it generalizes from the intuitive 1D case to the full multivariate setting, why the Jacobian determinant measures local volume change, and how log-determinants enable numerically stable computation.

Ready for Architecture Design

With a solid understanding of the change of variables formula, we're now equipped to study the architectural innovations that make flows practical. Next, we'll explore coupling layers—the breakthrough that enabled flows to scale to high-dimensional data like images.