Loading content...
The change of variables formula is the mathematical cornerstone upon which all normalizing flows are built. This elegant result from probability theory and measure theory tells us precisely how probability densities transform under differentiable, invertible mappings. While we introduced this formula in the previous page, here we develop a deep, rigorous understanding that will inform architectural decisions and help debug numerical issues in practice.
Understanding the change of variables formula is not merely academic—it directly explains why certain flow architectures work and others fail, why some transformations are computationally tractable while others are not, and how to reason about the behavior of complex composed transformations.
Master the rigorous derivation of the change of variables formula, understand its geometric interpretation through volume elements, learn to compute Jacobians for various transformation types, and develop intuition for how different Jacobian structures affect computational tractability.
Before tackling the full multivariate case, let's build intuition from the one-dimensional setting where the concepts are most transparent.
Setup: Let $Z$ be a random variable with density $p_Z(z)$. Define $X = f(Z)$ where $f: \mathbb{R} \to \mathbb{R}$ is a strictly monotonic, differentiable function. What is the density $p_X(x)$?
Derivation via CDF:
For a monotonically increasing $f$: $$P(X \leq x) = P(f(Z) \leq x) = P(Z \leq f^{-1}(x))$$
Differentiating both sides with respect to $x$: $$p_X(x) = p_Z(f^{-1}(x)) \cdot \frac{d}{dx}f^{-1}(x) = p_Z(f^{-1}(x)) \cdot \frac{1}{f'(f^{-1}(x))}$$
For monotonically decreasing $f$, we get a minus sign, leading to the general formula: $$p_X(x) = p_Z(f^{-1}(x)) \cdot \left| \frac{d f^{-1}(x)}{dx} \right| = p_Z(z) \cdot \left| \frac{1}{f'(z)} \right|$$
The absolute value ensures the density remains positive regardless of whether $f$ is increasing or decreasing.
Total probability must always integrate to 1. If $f$ stretches a region (|f'| > 1), the probability mass spreads over a larger interval, so density decreases. If $f$ compresses (|f'| < 1), density increases. The factor |1/f'(z)| precisely compensates for this stretching/compression.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
import numpy as npimport matplotlib.pyplot as pltfrom scipy.stats import norm def demonstrate_1d_change_of_variables(): """ Demonstrate change of variables in 1D. Transform a standard Gaussian through f(z) = z^3 + z """ # Define transformation and its derivative def f(z): return z**3 + z def f_prime(z): return 3*z**2 + 1 # Always positive, so f is monotonic def f_inverse_numerical(x, tol=1e-10): # Newton's method to find z such that f(z) = x z = x # Initial guess for _ in range(100): z_new = z - (f(z) - x) / f_prime(z) if abs(z_new - z) < tol: return z_new z = z_new return z # Base distribution: standard Gaussian p_z = lambda z: norm.pdf(z, 0, 1) # Transformed density via change of variables def p_x(x): z = f_inverse_numerical(x) return p_z(z) * abs(1 / f_prime(z)) # Verify by sampling z_samples = np.random.randn(100000) x_samples = f(z_samples) # Plot comparison x_grid = np.linspace(-10, 10, 1000) p_x_analytical = [p_x(x) for x in x_grid] plt.figure(figsize=(10, 5)) plt.hist(x_samples, bins=100, density=True, alpha=0.7, label='Empirical') plt.plot(x_grid, p_x_analytical, 'r-', lw=2, label='Analytical (CoV)') plt.xlabel('x') plt.ylabel('Density') plt.legend() plt.title('Change of Variables: z³ + z Transform of Gaussian') plt.show() demonstrate_1d_change_of_variables()The multivariate generalization replaces the scalar derivative with the Jacobian matrix and the absolute value with the absolute value of the determinant.
The Jacobian Matrix:
For $f: \mathbb{R}^d \to \mathbb{R}^d$ with $f(\mathbf{z}) = (f_1(\mathbf{z}), \ldots, f_d(\mathbf{z}))$, the Jacobian is:
$$J_f(\mathbf{z}) = \begin{bmatrix} \frac{\partial f_1}{\partial z_1} & \cdots & \frac{\partial f_1}{\partial z_d} \ \vdots & \ddots & \vdots \ \frac{\partial f_d}{\partial z_1} & \cdots & \frac{\partial f_d}{\partial z_d} \end{bmatrix}$$
The Change of Variables Formula:
$$p_X(\mathbf{x}) = p_Z(f^{-1}(\mathbf{x})) \cdot \left| \det J_{f^{-1}}(\mathbf{x}) \right|$$
Or equivalently, letting $\mathbf{z} = f^{-1}(\mathbf{x})$:
$$p_X(\mathbf{x}) = p_Z(\mathbf{z}) \cdot \left| \det J_f(\mathbf{z}) \right|^{-1}$$
The key insight: $\det J_{f^{-1}}(\mathbf{x}) = (\det J_f(\mathbf{z}))^{-1}$ by the inverse function theorem.
Geometric Interpretation:
The Jacobian determinant measures how the transformation $f$ locally changes volumes:
This volume-change interpretation explains:
| Transformation | Jacobian Matrix | Determinant | Complexity |
|---|---|---|---|
| $f(\mathbf{z}) = \mathbf{A}\mathbf{z} + \mathbf{b}$ (affine) | $\mathbf{A}$ | $\det(\mathbf{A})$ | $O(d^3)$ general, $O(d)$ if triangular |
| $f_i(z_i) = g(z_i)$ (element-wise) | Diagonal: $\text{diag}(g'(z_i))$ | $\prod_i g'(z_i)$ | $O(d)$ |
| Permutation $\mathbf{P}$ | $\mathbf{P}$ | $\pm 1$ | $O(1)$ |
| Rotation $\mathbf{R}$ | $\mathbf{R}$ | $1$ | $O(1)$ |
| Coupling layer | Block triangular | Product of one block's diagonal | $O(d)$ |
In practice, we always work with log-determinants for numerical stability and computational convenience. The determinant of a Jacobian can be astronomically large or small, but its logarithm remains well-behaved.
Log-Determinant Properties:
Product rule: $\log |\det(\mathbf{AB})| = \log |\det(\mathbf{A})| + \log |\det(\mathbf{B})|$
Inverse rule: $\log |\det(\mathbf{A}^{-1})| = -\log |\det(\mathbf{A})|$
Triangular matrices: $\log |\det(\mathbf{T})| = \sum_i \log |T_{ii}|$
Diagonal matrices: $\log |\det(\mathbf{D})| = \sum_i \log |D_{ii}|$
These properties are essential for flows. The product rule allows us to sum log-determinants across composed layers. The triangular/diagonal rules show why such structures are computationally attractive.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
import torchimport torch.nn as nn def log_det_triangular(L): """ Compute log|det(L)| for triangular matrix L in O(d) time. Args: L: Triangular matrix [batch, d, d] or [d, d] Returns: log_det: Log absolute determinant [batch] or scalar """ # Determinant = product of diagonal elements # Log determinant = sum of log absolute diagonal elements diag = torch.diagonal(L, dim1=-2, dim2=-1) return torch.sum(torch.log(torch.abs(diag)), dim=-1) def log_det_diagonal(diag_elements): """ Compute log|det(D)| for diagonal matrix represented by its diagonal. Args: diag_elements: Diagonal elements [batch, d] Returns: log_det: Log absolute determinant [batch] """ return torch.sum(torch.log(torch.abs(diag_elements)), dim=-1) def log_det_lu(A): """ Compute log|det(A)| via LU decomposition in O(d³) time. This is the general method for arbitrary square matrices. Args: A: Square matrix [batch, d, d] or [d, d] Returns: log_det: Log absolute determinant """ # PyTorch's slogdet computes sign and log|det| sign, logabsdet = torch.linalg.slogdet(A) return logabsdet # Example: Verify triangular computationdef verify_log_det(): d = 100 # Create random lower triangular matrix L = torch.tril(torch.randn(d, d)) # Efficient computation log_det_fast = log_det_triangular(L) # Direct computation for verification log_det_direct = torch.log(torch.abs(torch.det(L))) print(f"Triangular O(d): {log_det_fast.item():.6f}") print(f"Direct O(d³): {log_det_direct.item():.6f}") print(f"Match: {torch.allclose(log_det_fast, log_det_direct)}") verify_log_det()Always compute log|det(J)| directly rather than computing det(J) then taking the log. For high-dimensional matrices, the determinant can overflow or underflow even when its logarithm is perfectly reasonable. Most deep learning frameworks provide numerically stable log-determinant functions.
The power of normalizing flows comes from composing multiple simple transformations. The change of variables formula extends naturally to compositions.
Chain of Transformations:
Let $f = f_K \circ f_{K-1} \circ \cdots \circ f_1$, meaning $\mathbf{x} = f_K(f_{K-1}(\cdots f_1(\mathbf{z})))$.
Define intermediate variables: $\mathbf{z}_0 = \mathbf{z}$, $\mathbf{z}k = f_k(\mathbf{z}{k-1})$, $\mathbf{z}_K = \mathbf{x}$.
By the chain rule for Jacobians: $$J_f(\mathbf{z}) = J_{f_K}(\mathbf{z}{K-1}) \cdot J{f_{K-1}}(\mathbf{z}{K-2}) \cdots J{f_1}(\mathbf{z}_0)$$
Taking determinants and logarithms: $$\log |\det J_f(\mathbf{z})| = \sum_{k=1}^{K} \log |\det J_{f_k}(\mathbf{z}_{k-1})|$$
This additive decomposition means we can compute the total log-determinant by summing per-layer contributions during the forward or inverse pass—no need to ever form the full Jacobian matrix.
12345678910111213141516171819202122232425262728293031323334353637383940
import torchimport torch.nn as nnfrom typing import List, Tuple class ComposedFlow(nn.Module): """ A normalizing flow as a composition of flow layers. """ def __init__(self, layers: List[nn.Module]): super().__init__() self.layers = nn.ModuleList(layers) def forward(self, z: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]: """Forward: z -> x, accumulating log det Jacobian.""" log_det_total = torch.zeros(z.shape[0], device=z.device) x = z for layer in self.layers: x, log_det = layer.forward(x) log_det_total += log_det # Additive! return x, log_det_total def inverse(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]: """Inverse: x -> z, accumulating log det of inverse Jacobian.""" log_det_total = torch.zeros(x.shape[0], device=x.device) z = x for layer in reversed(self.layers): # Reverse order! z, log_det = layer.inverse(z) log_det_total += log_det return z, log_det_total def log_prob(self, x: torch.Tensor, base_dist) -> torch.Tensor: """Compute log p(x) using change of variables.""" z, log_det_inverse = self.inverse(x) log_pz = base_dist.log_prob(z).sum(dim=-1) return log_pz + log_det_inverseImplementing the change of variables formula in practice requires attention to several important details.
Forward vs. Inverse Direction:
Flows define transformations in one direction but we often need both:
Some architectures (like autoregressive flows) are fast in one direction but slow in the other. Coupling layers are fast in both directions—a major advantage.
Dimension Preservation:
The change of variables formula requires $f: \mathbb{R}^d \to \mathbb{R}^d$—same input and output dimensions. Flows that change dimension require modified formulations (e.g., augmented flows that pad dimensions, or factored architectures that marginalize dimensions).
When implementing flows, verify: (1) forward(inverse(x)) ≈ x (reconstruction), (2) log_det_forward(z) ≈ -log_det_inverse(f(z)) (determinant consistency), (3) likelihoods integrate to 1 (via importance sampling). These sanity checks catch many implementation bugs.
The change of variables formula is the mathematical foundation that makes normalizing flows possible. We've seen how it generalizes from the intuitive 1D case to the full multivariate setting, why the Jacobian determinant measures local volume change, and how log-determinants enable numerically stable computation.
With a solid understanding of the change of variables formula, we're now equipped to study the architectural innovations that make flows practical. Next, we'll explore coupling layers—the breakthrough that enabled flows to scale to high-dimensional data like images.