Multivariate Calculus - Learning Module

Loading content...

0/245

Jacobian Matrices — Generalizing Derivatives to Vector Functions

When Functions Output Vectors

The gradient beautifully captures how a scalar function changes. But neural networks aren't just scalar functions—each layer transforms vectors into vectors. A linear layer maps ℝⁿ → ℝᵐ. An attention mechanism maps sequences to sequences. How do we describe the derivative of such vector-valued functions?

The answer is the Jacobian matrix—a matrix that contains all partial derivatives of all output components with respect to all input components. If the gradient is a vector of partial derivatives, the Jacobian is a matrix of partial derivatives.

The Jacobian isn't merely a theoretical construct. It's the mathematical object that propagates gradients through neural network layers. When you call loss.backward() in PyTorch, you're implicitly computing vector-Jacobian products. Understanding Jacobians means understanding backpropagation at a fundamental level.

What You Will Learn

By the end of this page, you will understand Jacobian matrices: their definition, geometric interpretation, computation, and role in the chain rule. You'll see how Jacobians enable backpropagation through arbitrary differentiable layers and understand the relationship between Jacobians, gradients, and total derivatives.

Definition of the Jacobian Matrix

Consider a vector-valued function f: ℝⁿ → ℝᵐ that maps n-dimensional inputs to m-dimensional outputs:

$$\mathbf{f}(\mathbf{x}) = \begin{bmatrix} f_1(x_1, \ldots, x_n) \ f_2(x_1, \ldots, x_n) \ \vdots \ f_m(x_1, \ldots, x_n) \end{bmatrix}$$

Definition (Jacobian Matrix):

The Jacobian matrix J of f at point x is the m × n matrix of all first-order partial derivatives:

$$\mathbf{J} = \frac{\partial \mathbf{f}}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \cdots & \frac{\partial f_1}{\partial x_n} \ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_2}{\partial x_n} \ \vdots & \vdots & \ddots & \vdots \ \frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}$$

Matrix Dimensions:

Rows: m (number of output dimensions)
Columns: n (number of input dimensions)
Entry (i, j): ∂fᵢ/∂xⱼ (how output i changes with input j)

Row Interpretation:

Row i of J is the gradient of the i-th output component fᵢ:

$$\text{Row } i = \nabla f_i = \left[ \frac{\partial f_i}{\partial x_1}, \ldots, \frac{\partial f_i}{\partial x_n} \right]$$

Column Interpretation:

Column j contains all the effects of perturbing xⱼ on every output.

Gradient as a Special Case

When m = 1 (scalar output), the Jacobian is a 1 × n matrix—a row vector. This is the transpose of the gradient: J = (∇f)ᵀ. Some conventions define the gradient as a row vector, in which case gradient equals Jacobian for scalar functions.

Examples:

Example 1: Linear Transformation

For f(x) = Ax + b where A is an m × n matrix:

$$\mathbf{J} = \mathbf{A}$$

The Jacobian of a linear function is simply the matrix itself!

Example 2: Element-wise Function

For f(x) = [σ(x₁), σ(x₂), ..., σ(xₙ)]ᵀ where σ is applied element-wise:

$$\mathbf{J} = \text{diag}(\sigma'(x_1), \sigma'(x_2), \ldots, \sigma'(x_n))$$

The Jacobian is diagonal because output i depends only on input i.

Example 3: Softmax Function

For softmax(z) where (softmax)ᵢ = exp(zᵢ)/Σⱼexp(zⱼ):

$$J_{ij} = \begin{cases} s_i(1 - s_i) & \text{if } i = j \ -s_i s_j & \text{if } i \neq j \end{cases}$$

where sᵢ = softmax(z)ᵢ. This is a full (non-diagonal) matrix because each output depends on all inputs.

Geometric Interpretation: Local Linear Approximation

The Jacobian has a beautiful geometric meaning: it describes how a function locally transforms space.

Linear Approximation:

For small perturbations Δx around a point x₀:

$$\mathbf{f}(\mathbf{x}_0 + \Delta\mathbf{x}) \approx \mathbf{f}(\mathbf{x}0) + \mathbf{J}|{\mathbf{x}_0} \cdot \Delta\mathbf{x}$$

The Jacobian is the coefficient matrix of the best linear approximation to f near x₀. Just as the derivative f'(x) gives the slope of the tangent line to a curve, the Jacobian gives the linear map that best approximates f locally.

Transformation of Shapes:

Consider what happens to an infinitesimal sphere near x₀:

The Jacobian J transforms it into an ellipsoid
The principal axes of this ellipsoid align with the right singular vectors of J
The lengths of axes are the singular values of J

Volume Scaling:

The Jacobian also tells us how volumes scale under the transformation:

$$\text{Volume scaling factor} = |\det(\mathbf{J})| \quad \text{(when } m = n \text{)}$$

This is crucial for change of variables in integration (the famous |det J| factor).

Invertibility:

If J is square (m = n) and non-singular (det J ≠ 0), the Inverse Function Theorem guarantees that f is locally invertible near that point.

jacobian_geometry.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
import numpy as np
import matplotlib.pyplot as plt
 
def f(x):
    """
    A nonlinear transformation R² → R².
    f(x, y) = (x² - y², 2xy) [related to z² in complex plane]
    """
    return np.array([x[0]**2 - x[1]**2, 2*x[0]*x[1]])
 
def jacobian_f(x):
    """
    Jacobian of f at point x.
    J = [[∂f₁/∂x, ∂f₁/∂y],
         [∂f₂/∂x, ∂f₂/∂y]]
      = [[2x, -2y],
         [2y, 2x]]
    """
    return np.array([
        [2*x[0], -2*x[1]],
        [2*x[1], 2*x[0]]
    ])
 
# Create a circle of points
theta = np.linspace(0, 2*np.pi, 100)
circle_radius = 0.3
 
# Choose a base point
base_point = np.array([1.5, 0.5])
 
# Points on circle centered at base_point
circle_points = np.array([
    base_point + circle_radius * np.array([np.cos(t), np.sin(t)])
    for t in theta
])
 
# Transform circle through f
transformed_points = np.array([f(p) for p in circle_points])
transformed_center = f(base_point)
 
# Get Jacobian at base point
J = jacobian_f(base_point)
print("Jacobian at (1.5, 0.5):")
print(J)
print(f"\nDeterminant: {np.linalg.det(J):.4f}")
print(f"(Volume scaling: {abs(np.linalg.det(J)):.4f})")
 
# Singular value decomposition
U, S, Vt = np.linalg.svd(J)
print(f"\nSingular values: {S}")
print("(These are the stretching factors along principal axes)")
 
# Compute expected ellipse from Jacobian
# Circle -> J @ circle -> ellipse with axes = singular values
jacobian_circle = np.array([
    J @ (circle_radius * np.array([np.cos(t), np.sin(t)]))
    for t in theta
])
jacobian_ellipse = transformed_center + jacobian_circle
 
# Plotting
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
 
# Plot 1: Original circle
ax1 = axes[0]
ax1.plot(circle_points[:, 0], circle_points[:, 1], 'b-', linewidth=2)
ax1.plot(base_point[0], base_point[1], 'ro', markersize=10)
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.set_title('Original: Circle in Input Space')
ax1.set_aspect('equal')
ax1.grid(True, alpha=0.3)
 
# Plot 2: Transformed (actual nonlinear)
ax2 = axes[1]
ax2.plot(transformed_points[:, 0], transformed_points[:, 1], 'g-', linewidth=2, 
         label='True transformation')
ax2.plot(jacobian_ellipse[:, 0], jacobian_ellipse[:, 1], 'r--', linewidth=2,
         label='Linear approximation (Jacobian)')
ax2.plot(transformed_center[0], transformed_center[1], 'ro', markersize=10)
ax2.set_xlabel('u')
ax2.set_ylabel('v')
ax2.set_title('Transformed Circle → Approximate Ellipse')
ax2.legend()
ax2.set_aspect('equal')
ax2.grid(True, alpha=0.3)
 
# Plot 3: Show how grid transforms
ax3 = axes[2]
x_grid = np.linspace(0.5, 2.5, 10)
y_grid = np.linspace(-0.5, 1.5, 10)
 
# Draw original grid lines and their images
for x_val in x_grid:
    y_line = np.linspace(-0.5, 1.5, 50)
    points = np.array([[x_val, y] for y in y_line])
    transformed = np.array([f(p) for p in points])
    ax3.plot(transformed[:, 0], transformed[:, 1], 'b-', alpha=0.5)
 
for y_val in y_grid:
    x_line = np.linspace(0.5, 2.5, 50)
    points = np.array([[x, y_val] for x in x_line])
    transformed = np.array([f(p) for p in points])
    ax3.plot(transformed[:, 0], transformed[:, 1], 'r-', alpha=0.5)
 
ax3.set_xlabel('u')
ax3.set_ylabel('v')
ax3.set_title('Grid Transformation (Jacobian varies by location)')
ax3.set_aspect('equal')
ax3.grid(True, alpha=0.3)
 
plt.tight_layout()
plt.savefig('jacobian_geometry.png', dpi=150, bbox_inches='tight')
plt.show()
 
print("\nKey insight: The Jacobian provides the best linear approximation")
print("to the transformation at each point. Circles map roughly to")
print("ellipses determined by the Jacobian's singular values.")

Jacobian as Local Linear Map

Think of the Jacobian as answering: 'If I zoom in enough at this point, what linear transformation does f look like?' This local linearization is the foundation of gradient-based optimization—we approximate curved loss surfaces with their tangent planes.

The Chain Rule: Jacobians Multiply

The chain rule for Jacobians is the mathematical foundation of backpropagation. When functions compose, their Jacobians multiply.

Theorem (Chain Rule for Jacobians):

If g: ℝⁿ → ℝᵐ and f: ℝᵐ → ℝᵖ, then the Jacobian of the composition f ∘ g is:

$$\mathbf{J}{\mathbf{f} \circ \mathbf{g}} = \mathbf{J}{\mathbf{f}} \cdot \mathbf{J}_{\mathbf{g}}$$

Dimension Check:

J_g is p × m (from ℝᵐ input to ℝᵖ output of f)
J_f is m × n (from ℝⁿ input to ℝᵐ output of g)
Their product is p × n ✓ (from ℝⁿ input to ℝᵖ final output)

Visualizing the Chain:

           J_g              J_f
    ℝⁿ ─────────→ ℝᵐ ─────────→ ℝᵖ
         (m×n)          (p×m)
           ↓               ↓
    ℝⁿ ──────────────────────→ ℝᵖ
              J_{f∘g} = J_f · J_g
                   (p×n)

Multi-Layer Composition:

For a deep composition h = fL ∘ f{L-1} ∘ ... ∘ f_1:

$$\mathbf{J}{\mathbf{h}} = \mathbf{J}{\mathbf{f}L} \cdot \mathbf{J}{\mathbf{f}{L-1}} \cdots \mathbf{J}{\mathbf{f}_1}$$

This product of Jacobian matrices is exactly what backpropagation computes!

Chain Rule in Various Notations
Notation Style	Forward Pass	Chain Rule
Scalar calculus	y = f(g(x))	dy/dx = (df/dg)(dg/dx)
Jacobian matrix	y = f(g(x))	J_y = J_f · J_g
Index notation	yᵢ = fᵢ(g₁, ..., gₘ)	∂yᵢ/∂xⱼ = Σₖ (∂fᵢ/∂gₖ)(∂gₖ/∂xⱼ)
Deep learning	h = layer₂(layer₁(x))	dL/dx = (dL/dh₂)(dh₂/dh₁)(dh₁/dx)

Matrix Multiplication Order

The chain rule multiplies Jacobians in reverse order of the forward pass. The Jacobian of the outermost function (last layer) goes on the left. This corresponds to backpropagation going from loss backward through the network.

Vector-Jacobian Products: The Heart of Backprop

In practice, we rarely compute full Jacobian matrices. They're too large—a layer mapping ℝ¹⁰⁰⁰ → ℝ¹⁰⁰⁰ has a million-entry Jacobian! Instead, backpropagation computes Vector-Jacobian Products (VJPs).

The Key Insight:

To minimize a scalar loss L with respect to parameters, we need ∇L—a vector. We don't need the full Jacobians of intermediate layers. We only need how the loss gradient propagates through each layer.

VJP Definition:

Given a vector v ∈ ℝᵐ and a Jacobian J ∈ ℝᵐˣⁿ, the VJP is:

$$\text{VJP}(\mathbf{v}, \mathbf{J}) = \mathbf{v}^\top \mathbf{J} \in \mathbb{R}^n$$

This is a row vector of dimension n (the input dimension).

Backpropagation as Chained VJPs:

For a composition y = f₃(f₂(f₁(x))) with scalar loss L(y):

Compute dL/dy (gradient of loss w.r.t final output)
VJP: v₃ = (dL/dy)ᵀ J₃ gives dL/dh₂
VJP: v₂ = v₃ J₂ gives dL/dh₁
VJP: v₁ = v₂ J₁ gives dL/dx

Computational Advantage:

Each VJP costs O(m·n) operations—the same as matrix-vector multiplication. Computing the full Jacobian would cost O(m·n) per column, totaling O(m·n²). VJPs are asymptotically cheaper when we only need the final gradient.

Dual: Jacobian-Vector Products (JVP):

The transpose operation J****v is a Jacobian-Vector Product (JVP), used in forward-mode automatic differentiation. Different use cases favor different modes:

VJP (reverse mode): Efficient for scalar outputs (loss functions)
JVP (forward mode): Efficient for scalar inputs (sensitivity analysis)

vjp_backpropagation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
import numpy as np
 
def vjp_linear_layer(v, W, x):
    """
    Vector-Jacobian Product for linear layer: y = Wx + b
    
    The Jacobian of y w.r.t. x is just W (shape: output_dim × input_dim)
    The Jacobian of y w.r.t. W is more complex (involves outer products)
    
    Parameters:
    - v: incoming gradient dL/dy, shape (output_dim,)
    - W: weight matrix, shape (output_dim, input_dim)
    - x: input vector, shape (input_dim,)
    
    Returns:
    - v_x: gradient dL/dx = v^T @ W, shape (input_dim,)
    - v_W: gradient dL/dW = outer(v, x), shape (output_dim, input_dim)
    """
    # VJP with respect to input: v^T @ J_x = v^T @ W
    v_x = v @ W  # shape (input_dim,)
    
    # VJP with respect to weights: each W_ij affects y_i as W_ij * x_j
    # So dL/dW_ij = (dL/dy_i) * x_j = v_i * x_j
    v_W = np.outer(v, x)  # shape (output_dim, input_dim)
    
    return v_x, v_W
 
def vjp_relu(v, z):
    """
    VJP for ReLU: y = max(0, z)
    
    Jacobian is diagonal: J_ii = 1 if z_i > 0, else 0
    So VJP is just element-wise multiplication.
    """
    return v * (z > 0).astype(float)
 
def demo_backpropagation_as_vjp():
    """
    Demonstrate backprop as a sequence of VJPs.
    
    Network: x -> W1 -> ReLU -> W2 -> y -> MSE Loss
    """
    np.random.seed(42)
    
    # Network dimensions
    input_dim = 4
    hidden_dim = 3
    output_dim = 2
    
    # Initialize weights
    W1 = np.random.randn(hidden_dim, input_dim) * 0.5
    W2 = np.random.randn(output_dim, hidden_dim) * 0.5
    
    # Sample input and target
    x = np.array([1.0, 2.0, -1.0, 0.5])
    target = np.array([1.0, -1.0])
    
    # ========== FORWARD PASS ==========
    print("="*60)
    print("FORWARD PASS (computing intermediate values)")
    print("="*60)
    
    # Layer 1: Linear
    z1 = W1 @ x
    print(f"z1 = W1 @ x: {z1}")
    
    # Layer 2: ReLU
    h1 = np.maximum(0, z1)
    print(f"h1 = ReLU(z1): {h1}")
    
    # Layer 3: Linear
    y = W2 @ h1
    print(f"y = W2 @ h1: {y}")
    
    # Loss: MSE
    loss = 0.5 * np.sum((y - target)**2)
    print(f"\nLoss = 0.5 * ||y - target||² = {loss:.6f}")
    
    # ========== BACKWARD PASS (VJPs) ==========
    print("\n" + "="*60)
    print("BACKWARD PASS (VJP chain)")
    print("="*60)
    
    # Gradient of loss w.r.t. prediction
    dL_dy = y - target  # shape (output_dim,)
    print(f"\n1. dL/dy = y - target = {dL_dy}")
    
    # VJP through W2 layer
    dL_dh1, dL_dW2 = vjp_linear_layer(dL_dy, W2, h1)
    print(f"\n2. VJP through W2:")
    print(f"   dL/dh1 = dL/dy @ W2 = {dL_dh1}")
    print(f"   dL/dW2 = outer(dL/dy, h1):\n{dL_dW2}")
    
    # VJP through ReLU
    dL_dz1 = vjp_relu(dL_dh1, z1)
    print(f"\n3. VJP through ReLU:")
    print(f"   dL/dz1 = dL/dh1 ⊙ (z1 > 0) = {dL_dz1}")
    
    # VJP through W1 layer
    dL_dx, dL_dW1 = vjp_linear_layer(dL_dz1, W1, x)
    print(f"\n4. VJP through W1:")
    print(f"   dL/dx = dL/dz1 @ W1 = {dL_dx}")
    print(f"   dL/dW1 = outer(dL/dz1, x):\n{dL_dW1}")
    
    print("\n" + "="*60)
    print("SUMMARY: We computed gradients for ALL parameters using VJPs")
    print("="*60)
    print(f"dL/dW2 shape: {dL_dW2.shape} (matches W2)")
    print(f"dL/dW1 shape: {dL_dW1.shape} (matches W1)")
    print("\nThis is exactly what autograd frameworks do internally!")
 
demo_backpropagation_as_vjp()

Why VJPs and Not Full Jacobians

A single layer in a transformer might have dimensions 4096 → 4096. Its Jacobian has 16 million entries. But the VJP only requires multiplying a 4096-vector by the Jacobian operator—much cheaper. This efficiency is why reverse-mode autodiff (backprop) dominates deep learning.

Jacobians of Common Neural Network Layers

Let's derive the Jacobians for the most common neural network components. Understanding these builds intuition for gradient flow.

1. Linear/Dense Layer: y = Wx + b

$$\mathbf{J}_{\mathbf{y} \leftarrow \mathbf{x}} = \mathbf{W}$$

The Jacobian is simply the weight matrix. This explains why weight magnitude affects gradient magnitude—large weights amplify gradients.

2. Element-wise Activation: y = σ(x)

$$\mathbf{J} = \text{diag}(\sigma'(x_1), \sigma'(x_2), \ldots, \sigma'(x_n))$$

Diagonal matrix because each output depends only on corresponding input.

Specific activations:

ReLU: J = diag(𝟙[x₁ > 0], ..., 𝟙[xₙ > 0])
Sigmoid: J = diag(σ(x₁)(1-σ(x₁)), ...)
Tanh: J = diag(1-tanh²(x₁), ...)

3. Softmax: sᵢ = exp(zᵢ)/Σⱼexp(zⱼ)

$$J_{ij} = \begin{cases} s_i(1 - s_i) & i = j \ -s_i s_j & i \neq j \end{cases} = s_i(\delta_{ij} - s_j)$$

Equivalently: J = diag(s) - s****sᵀ

4. Batch Normalization: (simplified, ignoring running stats)

For y = γ · (x - μ)/σ + β:

The Jacobian is more complex due to the mean and variance depending on all inputs. See BN paper for full derivation.

5. Dropout: (during training with mask m)

$$\mathbf{J} = \frac{1}{1-p} \cdot \text{diag}(m_1, m_2, \ldots, m_n)$$

Diagonal with entries 0 or 1/(1-p), stochastic per forward pass.

Jacobian Properties of Common Layers
Layer	Jacobian Structure	Memory for VJP	Gradient Issues
Linear (m×n weights)	Dense m×n (= W)	O(m×n)	Explodes/vanishes with W magnitude
ReLU	Diagonal 0/1	O(n) mask	Dead neurons (zeros)
Sigmoid	Diagonal in (0, 0.25)	O(n) activations	Vanishing (max derivative = 0.25)
Softmax	Dense n×n	O(n) softmax output	Full coupling between outputs
BatchNorm	Nearly diagonal	O(n)	Stabilizes gradient flow
Residual (x + f(x))	I + J_f	O(n)	Identity helps gradient flow

Residual Connections

Residual connections y = x + f(x) have Jacobian J = I + J_f. Even if J_f vanishes (deep network), the identity I ensures gradients flow unimpeded. This is why ResNets train so much more easily than plain deep networks.

The Jacobian Determinant: Volume and Invertibility

When the Jacobian is square (n × n), its determinant carries important information about the transformation.

Definition:

The Jacobian determinant of f: ℝⁿ → ℝⁿ at x is:

$$\det\left(\mathbf{J}_\mathbf{f}(\mathbf{x})\right) = \det\left( \frac{\partial \mathbf{f}}{\partial \mathbf{x}} \right)$$

Geometric Meaning:

|det J| measures how infinitesimal volumes scale under the transformation:

$$\text{Volume}(\mathbf{f}(S)) \approx |\det \mathbf{J}| \cdot \text{Volume}(S)$$

for small regions S.

Implications:

|det J| > 1: Transformation expands volume
|det J| < 1: Transformation contracts volume
det J = 0: Transformation collapses dimension (not invertible)
det J < 0: Transformation reverses orientation (flips handedness)

Change of Variables (Integration):

When integrating after a change of variables x = g(u):

$$\int f(\mathbf{x}) , d\mathbf{x} = \int f(\mathbf{g}(\mathbf{u})) |\det \mathbf{J}_\mathbf{g}| , d\mathbf{u}$$

The |det J| factor compensates for volume distortion.

Application: Normalizing Flows

Normalizing flows model complex probability distributions by transforming a simple base distribution (like Gaussian) through invertible neural networks. The log-probability is:

$$\log p(\mathbf{x}) = \log p(\mathbf{z}) - \log|\det \mathbf{J}|$$

Efficiently computing log|det J| is a major research area, leading to architectures like RealNVP, Glow, and Neural ODEs.

Jacobian Determinant Properties

•det(AB) = det(A) det(B): Composition scales volumes multiplicatively
•det(A⁻¹) = 1/det(A): Inverse transformation scales volumes inversely
•det(diagonal) = product of diagonal: Diagonal Jacobians are easy
•det(triangular) = product of diagonal: Triangular Jacobians work too
•det(orthogonal) = ±1: Rotations preserve volume

Efficient Log-Determinants

Computing det(n×n matrix) costs O(n³). But for special structures—diagonal, triangular, or near-identity—it's O(n). Normalizing flow architectures like RealNVP use triangular Jacobians precisely for this computational efficiency.

Computational Considerations in Practice

Understanding the computational aspects of Jacobians is crucial for efficient deep learning implementation.

Memory vs. Compute Tradeoffs:

Backpropagation requires storing intermediate activations for the backward pass:

Approach	Forward Memory	Backward Compute	Use Case
Store all	O(L × batch × dim)	O(1) recompute	Standard training
Checkpoint	O(√L × batch × dim)	O(L) recompute	Memory-limited
Recompute all	O(1)	O(L)	Extreme memory limit

Gradient Checkpointing:

Instead of storing all activations, store only every √L layers. Recompute others during backward pass. Trades 2× compute for √L memory reduction.

Mixed Precision:

Using FP16 for forward/backward passes halves memory and speeds up matrix operations (on supported hardware), with FP32 for accumulation to maintain numerical stability.

Jacobian-Free Methods:

Some applications need Jacobian-vector products without explicit Jacobians:

Hessian-vector products for second-order optimization
Fisher-vector products for natural gradient
Implicit differentiation for equilibrium models

These use clever tricks to compute the product without forming the full matrix.

jacobian_computation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import torch
import torch.nn.functional as F
 
def compute_full_jacobian(f, x):
    """
    Compute full Jacobian matrix using autograd.
    
    Warning: This is expensive! O(output_dim) backward passes.
    Use only for small dimensions or debugging.
    """
    x = x.clone().requires_grad_(True)
    y = f(x)
    
    output_dim = y.numel()
    input_dim = x.numel()
    
    jacobian = torch.zeros(output_dim, input_dim)
    
    for i in range(output_dim):
        # Zero gradients
        if x.grad is not None:
            x.grad.zero_()
        
        # Compute gradient of y[i] w.r.t. x
        y_flat = y.view(-1)
        y_flat[i].backward(retain_graph=True)
        
        jacobian[i] = x.grad.view(-1).clone()
    
    return jacobian
 
def compute_vjp_efficient(f, x, v):
    """
    Compute VJP v^T @ J efficiently using a single backward pass.
    
    This is what autograd does internally.
    """
    x = x.clone().requires_grad_(True)
    y = f(x)
    
    # v is the "incoming gradient" from upstream
    # The backward pass computes v^T @ J
    y.backward(v)
    
    return x.grad.clone()
 
# Example: Softmax Jacobian
def softmax_function(z):
    return F.softmax(z, dim=0)
 
# Small example for visualization
z = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
s = softmax_function(z)
 
print("Input logits z:", z.detach().numpy())
print("Softmax output s:", s.detach().numpy())
 
# Compute full Jacobian
J = compute_full_jacobian(softmax_function, z.detach())
print("\nFull Softmax Jacobian:")
print(J.numpy().round(4))
 
# Verify: J = diag(s) - s s^T
s_np = s.detach().numpy()
expected_J = np.diag(s_np) - np.outer(s_np, s_np)
print("\nExpected (diag(s) - s @ s^T):")
print(expected_J.round(4))
 
# VJP example
v = torch.tensor([1.0, 0.0, 0.0])  # gradient from loss w.r.t. first output
vjp_result = compute_vjp_efficient(softmax_function, z.detach(), v)
print(f"\nVJP with v = {v.numpy()}:")
print(f"Result: {vjp_result.numpy().round(4)}")
print(f"Verify (v @ J): {(v.numpy() @ J.numpy()).round(4)}")
 
# Cost comparison
print("\n" + "="*60)
print("COST COMPARISON")
print("="*60)
print(f"Full Jacobian: {J.shape[0]} backward passes")
print(f"VJP: 1 backward pass")
print("For a 1000-dim layer: 1000x speedup with VJP!")

Summary: Jacobians Enable Gradient Flow

The Jacobian matrix generalizes derivatives to vector-valued functions, enabling us to understand and compute gradients through complex neural network architectures.

Key Concepts:

Key Takeaways

•The Jacobian matrix collects all partial derivatives ∂fᵢ/∂xⱼ, with rows = output dims, columns = input dims.
•Geometric meaning: The Jacobian is the best linear approximation to a transformation at a point. It maps infinitesimal spheres to ellipsoids.
•Chain rule: Jacobians of composed functions multiply: J_{f∘g} = J_f · J_g (in reverse order of the forward pass).
•VJPs (Vector-Jacobian Products) are the computational workhorse of backpropagation, computing gradients without forming full Jacobian matrices.
•Common layers have structured Jacobians: linear = weight matrix, element-wise = diagonal, softmax = rank-1 perturbation of diagonal.
•Jacobian determinant measures volume scaling; det = 0 means information loss. Essential for normalizing flows.
•Efficiency matters: VJPs are O(output × input) vs full Jacobian's O(output² × input). This makes training million-parameter models tractable.

What's Next:

Gradients and Jacobians capture first-order information—how functions change linearly near a point. But loss landscapes are curved, not flat. The next page introduces Taylor series expansion, which captures higher-order curvature information. This leads to understanding the Hessian matrix and second-order optimization methods that can outperform gradient descent.

Page Complete

You now understand Jacobian matrices—the complete derivative for vector-valued functions. You see how the chain rule for Jacobians underlies backpropagation, and why VJPs make gradient computation tractable. Next: Taylor series and second-order approximations.

Jacobian Matrices — Generalizing Derivatives to Vector Functions

When Functions Output Vectors

What You Will Learn

Definition of the Jacobian Matrix

Consider a vector-valued function f: ℝⁿ → ℝᵐ that maps n-dimensional inputs to m-dimensional outputs:

$$\mathbf{f}(\mathbf{x}) = \begin{bmatrix} f_1(x_1, \ldots, x_n) \ f_2(x_1, \ldots, x_n) \ \vdots \ f_m(x_1, \ldots, x_n) \end{bmatrix}$$

Definition (Jacobian Matrix):

The Jacobian matrix J of f at point x is the m × n matrix of all first-order partial derivatives:

Matrix Dimensions:

Rows: m (number of output dimensions)
Columns: n (number of input dimensions)
Entry (i, j): ∂fᵢ/∂xⱼ (how output i changes with input j)

Row Interpretation:

Row i of J is the gradient of the i-th output component fᵢ:

$$\text{Row } i = \nabla f_i = \left[ \frac{\partial f_i}{\partial x_1}, \ldots, \frac{\partial f_i}{\partial x_n} \right]$$

Column Interpretation:

Column j contains all the effects of perturbing xⱼ on every output.

Gradient as a Special Case

Examples:

Example 1: Linear Transformation

For f(x) = Ax + b where A is an m × n matrix:

$$\mathbf{J} = \mathbf{A}$$

The Jacobian of a linear function is simply the matrix itself!

Example 2: Element-wise Function

For f(x) = [σ(x₁), σ(x₂), ..., σ(xₙ)]ᵀ where σ is applied element-wise:

$$\mathbf{J} = \text{diag}(\sigma'(x_1), \sigma'(x_2), \ldots, \sigma'(x_n))$$

The Jacobian is diagonal because output i depends only on input i.

Example 3: Softmax Function

For softmax(z) where (softmax)ᵢ = exp(zᵢ)/Σⱼexp(zⱼ):

$$J_{ij} = \begin{cases} s_i(1 - s_i) & \text{if } i = j \ -s_i s_j & \text{if } i \neq j \end{cases}$$

where sᵢ = softmax(z)ᵢ. This is a full (non-diagonal) matrix because each output depends on all inputs.

Geometric Interpretation: Local Linear Approximation

The Jacobian has a beautiful geometric meaning: it describes how a function locally transforms space.

Linear Approximation:

For small perturbations Δx around a point x₀:

$$\mathbf{f}(\mathbf{x}_0 + \Delta\mathbf{x}) \approx \mathbf{f}(\mathbf{x}0) + \mathbf{J}|{\mathbf{x}_0} \cdot \Delta\mathbf{x}$$

Transformation of Shapes:

Consider what happens to an infinitesimal sphere near x₀:

The Jacobian J transforms it into an ellipsoid
The principal axes of this ellipsoid align with the right singular vectors of J
The lengths of axes are the singular values of J

Volume Scaling:

The Jacobian also tells us how volumes scale under the transformation:

$$\text{Volume scaling factor} = |\det(\mathbf{J})| \quad \text{(when } m = n \text{)}$$

This is crucial for change of variables in integration (the famous |det J| factor).

Invertibility:

If J is square (m = n) and non-singular (det J ≠ 0), the Inverse Function Theorem guarantees that f is locally invertible near that point.

jacobian_geometry.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
import numpy as np
import matplotlib.pyplot as plt
 
def f(x):
    """
    A nonlinear transformation R² → R².
    f(x, y) = (x² - y², 2xy) [related to z² in complex plane]
    """
    return np.array([x[0]**2 - x[1]**2, 2*x[0]*x[1]])
 
def jacobian_f(x):
    """
    Jacobian of f at point x.
    J = [[∂f₁/∂x, ∂f₁/∂y],
         [∂f₂/∂x, ∂f₂/∂y]]
      = [[2x, -2y],
         [2y, 2x]]
    """
    return np.array([
        [2*x[0], -2*x[1]],
        [2*x[1], 2*x[0]]
    ])
 
# Create a circle of points
theta = np.linspace(0, 2*np.pi, 100)
circle_radius = 0.3
 
# Choose a base point
base_point = np.array([1.5, 0.5])
 
# Points on circle centered at base_point
circle_points = np.array([
    base_point + circle_radius * np.array([np.cos(t), np.sin(t)])
    for t in theta
])
 
# Transform circle through f
transformed_points = np.array([f(p) for p in circle_points])
transformed_center = f(base_point)
 
# Get Jacobian at base point
J = jacobian_f(base_point)
print("Jacobian at (1.5, 0.5):")
print(J)
print(f"\nDeterminant: {np.linalg.det(J):.4f}")
print(f"(Volume scaling: {abs(np.linalg.det(J)):.4f})")
 
# Singular value decomposition
U, S, Vt = np.linalg.svd(J)
print(f"\nSingular values: {S}")
print("(These are the stretching factors along principal axes)")
 
# Compute expected ellipse from Jacobian
# Circle -> J @ circle -> ellipse with axes = singular values
jacobian_circle = np.array([
    J @ (circle_radius * np.array([np.cos(t), np.sin(t)]))
    for t in theta
])
jacobian_ellipse = transformed_center + jacobian_circle
 
# Plotting
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
 
# Plot 1: Original circle
ax1 = axes[0]
ax1.plot(circle_points[:, 0], circle_points[:, 1], 'b-', linewidth=2)
ax1.plot(base_point[0], base_point[1], 'ro', markersize=10)
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.set_title('Original: Circle in Input Space')
ax1.set_aspect('equal')
ax1.grid(True, alpha=0.3)
 
# Plot 2: Transformed (actual nonlinear)
ax2 = axes[1]
ax2.plot(transformed_points[:, 0], transformed_points[:, 1], 'g-', linewidth=2, 
         label='True transformation')
ax2.plot(jacobian_ellipse[:, 0], jacobian_ellipse[:, 1], 'r--', linewidth=2,
         label='Linear approximation (Jacobian)')
ax2.plot(transformed_center[0], transformed_center[1], 'ro', markersize=10)
ax2.set_xlabel('u')
ax2.set_ylabel('v')
ax2.set_title('Transformed Circle → Approximate Ellipse')
ax2.legend()
ax2.set_aspect('equal')
ax2.grid(True, alpha=0.3)
 
# Plot 3: Show how grid transforms
ax3 = axes[2]
x_grid = np.linspace(0.5, 2.5, 10)
y_grid = np.linspace(-0.5, 1.5, 10)
 
# Draw original grid lines and their images
for x_val in x_grid:
    y_line = np.linspace(-0.5, 1.5, 50)
    points = np.array([[x_val, y] for y in y_line])
    transformed = np.array([f(p) for p in points])
    ax3.plot(transformed[:, 0], transformed[:, 1], 'b-', alpha=0.5)
 
for y_val in y_grid:
    x_line = np.linspace(0.5, 2.5, 50)
    points = np.array([[x, y_val] for x in x_line])
    transformed = np.array([f(p) for p in points])
    ax3.plot(transformed[:, 0], transformed[:, 1], 'r-', alpha=0.5)
 
ax3.set_xlabel('u')
ax3.set_ylabel('v')
ax3.set_title('Grid Transformation (Jacobian varies by location)')
ax3.set_aspect('equal')
ax3.grid(True, alpha=0.3)
 
plt.tight_layout()
plt.savefig('jacobian_geometry.png', dpi=150, bbox_inches='tight')
plt.show()
 
print("\nKey insight: The Jacobian provides the best linear approximation")
print("to the transformation at each point. Circles map roughly to")
print("ellipses determined by the Jacobian's singular values.")

Jacobian as Local Linear Map

The Chain Rule: Jacobians Multiply

The chain rule for Jacobians is the mathematical foundation of backpropagation. When functions compose, their Jacobians multiply.

Theorem (Chain Rule for Jacobians):

If g: ℝⁿ → ℝᵐ and f: ℝᵐ → ℝᵖ, then the Jacobian of the composition f ∘ g is:

$$\mathbf{J}{\mathbf{f} \circ \mathbf{g}} = \mathbf{J}{\mathbf{f}} \cdot \mathbf{J}_{\mathbf{g}}$$

Dimension Check:

J_g is p × m (from ℝᵐ input to ℝᵖ output of f)
J_f is m × n (from ℝⁿ input to ℝᵐ output of g)
Their product is p × n ✓ (from ℝⁿ input to ℝᵖ final output)

Visualizing the Chain:

           J_g              J_f
    ℝⁿ ─────────→ ℝᵐ ─────────→ ℝᵖ
         (m×n)          (p×m)
           ↓               ↓
    ℝⁿ ──────────────────────→ ℝᵖ
              J_{f∘g} = J_f · J_g
                   (p×n)

Multi-Layer Composition:

For a deep composition h = fL ∘ f{L-1} ∘ ... ∘ f_1:

$$\mathbf{J}{\mathbf{h}} = \mathbf{J}{\mathbf{f}L} \cdot \mathbf{J}{\mathbf{f}{L-1}} \cdots \mathbf{J}{\mathbf{f}_1}$$

This product of Jacobian matrices is exactly what backpropagation computes!

Chain Rule in Various Notations
Notation Style	Forward Pass	Chain Rule
Scalar calculus	y = f(g(x))	dy/dx = (df/dg)(dg/dx)
Jacobian matrix	y = f(g(x))	J_y = J_f · J_g
Index notation	yᵢ = fᵢ(g₁, ..., gₘ)	∂yᵢ/∂xⱼ = Σₖ (∂fᵢ/∂gₖ)(∂gₖ/∂xⱼ)
Deep learning	h = layer₂(layer₁(x))	dL/dx = (dL/dh₂)(dh₂/dh₁)(dh₁/dx)

Matrix Multiplication Order

Vector-Jacobian Products: The Heart of Backprop

The Key Insight:

VJP Definition:

Given a vector v ∈ ℝᵐ and a Jacobian J ∈ ℝᵐˣⁿ, the VJP is:

$$\text{VJP}(\mathbf{v}, \mathbf{J}) = \mathbf{v}^\top \mathbf{J} \in \mathbb{R}^n$$

This is a row vector of dimension n (the input dimension).

Backpropagation as Chained VJPs:

For a composition y = f₃(f₂(f₁(x))) with scalar loss L(y):

Compute dL/dy (gradient of loss w.r.t final output)
VJP: v₃ = (dL/dy)ᵀ J₃ gives dL/dh₂
VJP: v₂ = v₃ J₂ gives dL/dh₁
VJP: v₁ = v₂ J₁ gives dL/dx

Computational Advantage:

Dual: Jacobian-Vector Products (JVP):

The transpose operation J****v is a Jacobian-Vector Product (JVP), used in forward-mode automatic differentiation. Different use cases favor different modes:

VJP (reverse mode): Efficient for scalar outputs (loss functions)
JVP (forward mode): Efficient for scalar inputs (sensitivity analysis)

vjp_backpropagation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
import numpy as np
 
def vjp_linear_layer(v, W, x):
    """
    Vector-Jacobian Product for linear layer: y = Wx + b
    
    The Jacobian of y w.r.t. x is just W (shape: output_dim × input_dim)
    The Jacobian of y w.r.t. W is more complex (involves outer products)
    
    Parameters:
    - v: incoming gradient dL/dy, shape (output_dim,)
    - W: weight matrix, shape (output_dim, input_dim)
    - x: input vector, shape (input_dim,)
    
    Returns:
    - v_x: gradient dL/dx = v^T @ W, shape (input_dim,)
    - v_W: gradient dL/dW = outer(v, x), shape (output_dim, input_dim)
    """
    # VJP with respect to input: v^T @ J_x = v^T @ W
    v_x = v @ W  # shape (input_dim,)
    
    # VJP with respect to weights: each W_ij affects y_i as W_ij * x_j
    # So dL/dW_ij = (dL/dy_i) * x_j = v_i * x_j
    v_W = np.outer(v, x)  # shape (output_dim, input_dim)
    
    return v_x, v_W
 
def vjp_relu(v, z):
    """
    VJP for ReLU: y = max(0, z)
    
    Jacobian is diagonal: J_ii = 1 if z_i > 0, else 0
    So VJP is just element-wise multiplication.
    """
    return v * (z > 0).astype(float)
 
def demo_backpropagation_as_vjp():
    """
    Demonstrate backprop as a sequence of VJPs.
    
    Network: x -> W1 -> ReLU -> W2 -> y -> MSE Loss
    """
    np.random.seed(42)
    
    # Network dimensions
    input_dim = 4
    hidden_dim = 3
    output_dim = 2
    
    # Initialize weights
    W1 = np.random.randn(hidden_dim, input_dim) * 0.5
    W2 = np.random.randn(output_dim, hidden_dim) * 0.5
    
    # Sample input and target
    x = np.array([1.0, 2.0, -1.0, 0.5])
    target = np.array([1.0, -1.0])
    
    # ========== FORWARD PASS ==========
    print("="*60)
    print("FORWARD PASS (computing intermediate values)")
    print("="*60)
    
    # Layer 1: Linear
    z1 = W1 @ x
    print(f"z1 = W1 @ x: {z1}")
    
    # Layer 2: ReLU
    h1 = np.maximum(0, z1)
    print(f"h1 = ReLU(z1): {h1}")
    
    # Layer 3: Linear
    y = W2 @ h1
    print(f"y = W2 @ h1: {y}")
    
    # Loss: MSE
    loss = 0.5 * np.sum((y - target)**2)
    print(f"\nLoss = 0.5 * ||y - target||² = {loss:.6f}")
    
    # ========== BACKWARD PASS (VJPs) ==========
    print("\n" + "="*60)
    print("BACKWARD PASS (VJP chain)")
    print("="*60)
    
    # Gradient of loss w.r.t. prediction
    dL_dy = y - target  # shape (output_dim,)
    print(f"\n1. dL/dy = y - target = {dL_dy}")
    
    # VJP through W2 layer
    dL_dh1, dL_dW2 = vjp_linear_layer(dL_dy, W2, h1)
    print(f"\n2. VJP through W2:")
    print(f"   dL/dh1 = dL/dy @ W2 = {dL_dh1}")
    print(f"   dL/dW2 = outer(dL/dy, h1):\n{dL_dW2}")
    
    # VJP through ReLU
    dL_dz1 = vjp_relu(dL_dh1, z1)
    print(f"\n3. VJP through ReLU:")
    print(f"   dL/dz1 = dL/dh1 ⊙ (z1 > 0) = {dL_dz1}")
    
    # VJP through W1 layer
    dL_dx, dL_dW1 = vjp_linear_layer(dL_dz1, W1, x)
    print(f"\n4. VJP through W1:")
    print(f"   dL/dx = dL/dz1 @ W1 = {dL_dx}")
    print(f"   dL/dW1 = outer(dL/dz1, x):\n{dL_dW1}")
    
    print("\n" + "="*60)
    print("SUMMARY: We computed gradients for ALL parameters using VJPs")
    print("="*60)
    print(f"dL/dW2 shape: {dL_dW2.shape} (matches W2)")
    print(f"dL/dW1 shape: {dL_dW1.shape} (matches W1)")
    print("\nThis is exactly what autograd frameworks do internally!")
 
demo_backpropagation_as_vjp()

Why VJPs and Not Full Jacobians

Jacobians of Common Neural Network Layers

Let's derive the Jacobians for the most common neural network components. Understanding these builds intuition for gradient flow.

1. Linear/Dense Layer: y = Wx + b

$$\mathbf{J}_{\mathbf{y} \leftarrow \mathbf{x}} = \mathbf{W}$$

The Jacobian is simply the weight matrix. This explains why weight magnitude affects gradient magnitude—large weights amplify gradients.

2. Element-wise Activation: y = σ(x)

$$\mathbf{J} = \text{diag}(\sigma'(x_1), \sigma'(x_2), \ldots, \sigma'(x_n))$$

Diagonal matrix because each output depends only on corresponding input.

Specific activations:

ReLU: J = diag(𝟙[x₁ > 0], ..., 𝟙[xₙ > 0])
Sigmoid: J = diag(σ(x₁)(1-σ(x₁)), ...)
Tanh: J = diag(1-tanh²(x₁), ...)

3. Softmax: sᵢ = exp(zᵢ)/Σⱼexp(zⱼ)

$$J_{ij} = \begin{cases} s_i(1 - s_i) & i = j \ -s_i s_j & i \neq j \end{cases} = s_i(\delta_{ij} - s_j)$$

Equivalently: J = diag(s) - s****sᵀ

4. Batch Normalization: (simplified, ignoring running stats)

For y = γ · (x - μ)/σ + β:

The Jacobian is more complex due to the mean and variance depending on all inputs. See BN paper for full derivation.

5. Dropout: (during training with mask m)

$$\mathbf{J} = \frac{1}{1-p} \cdot \text{diag}(m_1, m_2, \ldots, m_n)$$

Diagonal with entries 0 or 1/(1-p), stochastic per forward pass.

Jacobian Properties of Common Layers
Layer	Jacobian Structure	Memory for VJP	Gradient Issues
Linear (m×n weights)	Dense m×n (= W)	O(m×n)	Explodes/vanishes with W magnitude
ReLU	Diagonal 0/1	O(n) mask	Dead neurons (zeros)
Sigmoid	Diagonal in (0, 0.25)	O(n) activations	Vanishing (max derivative = 0.25)
Softmax	Dense n×n	O(n) softmax output	Full coupling between outputs
BatchNorm	Nearly diagonal	O(n)	Stabilizes gradient flow
Residual (x + f(x))	I + J_f	O(n)	Identity helps gradient flow

Residual Connections

The Jacobian Determinant: Volume and Invertibility

When the Jacobian is square (n × n), its determinant carries important information about the transformation.

Definition:

The Jacobian determinant of f: ℝⁿ → ℝⁿ at x is:

$$\det\left(\mathbf{J}_\mathbf{f}(\mathbf{x})\right) = \det\left( \frac{\partial \mathbf{f}}{\partial \mathbf{x}} \right)$$

Geometric Meaning:

|det J| measures how infinitesimal volumes scale under the transformation:

$$\text{Volume}(\mathbf{f}(S)) \approx |\det \mathbf{J}| \cdot \text{Volume}(S)$$

for small regions S.

Implications:

|det J| > 1: Transformation expands volume
|det J| < 1: Transformation contracts volume
det J = 0: Transformation collapses dimension (not invertible)
det J < 0: Transformation reverses orientation (flips handedness)

Change of Variables (Integration):

When integrating after a change of variables x = g(u):

$$\int f(\mathbf{x}) , d\mathbf{x} = \int f(\mathbf{g}(\mathbf{u})) |\det \mathbf{J}_\mathbf{g}| , d\mathbf{u}$$

The |det J| factor compensates for volume distortion.

Application: Normalizing Flows

Normalizing flows model complex probability distributions by transforming a simple base distribution (like Gaussian) through invertible neural networks. The log-probability is:

$$\log p(\mathbf{x}) = \log p(\mathbf{z}) - \log|\det \mathbf{J}|$$

Efficiently computing log|det J| is a major research area, leading to architectures like RealNVP, Glow, and Neural ODEs.

Jacobian Determinant Properties

•det(AB) = det(A) det(B): Composition scales volumes multiplicatively
•det(A⁻¹) = 1/det(A): Inverse transformation scales volumes inversely
•det(diagonal) = product of diagonal: Diagonal Jacobians are easy
•det(triangular) = product of diagonal: Triangular Jacobians work too
•det(orthogonal) = ±1: Rotations preserve volume

Efficient Log-Determinants

Computational Considerations in Practice

Understanding the computational aspects of Jacobians is crucial for efficient deep learning implementation.

Memory vs. Compute Tradeoffs:

Backpropagation requires storing intermediate activations for the backward pass:

Approach	Forward Memory	Backward Compute	Use Case
Store all	O(L × batch × dim)	O(1) recompute	Standard training
Checkpoint	O(√L × batch × dim)	O(L) recompute	Memory-limited
Recompute all	O(1)	O(L)	Extreme memory limit

Gradient Checkpointing:

Instead of storing all activations, store only every √L layers. Recompute others during backward pass. Trades 2× compute for √L memory reduction.

Mixed Precision:

Using FP16 for forward/backward passes halves memory and speeds up matrix operations (on supported hardware), with FP32 for accumulation to maintain numerical stability.

Jacobian-Free Methods:

Some applications need Jacobian-vector products without explicit Jacobians:

Hessian-vector products for second-order optimization
Fisher-vector products for natural gradient
Implicit differentiation for equilibrium models

These use clever tricks to compute the product without forming the full matrix.

jacobian_computation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import torch
import torch.nn.functional as F
 
def compute_full_jacobian(f, x):
    """
    Compute full Jacobian matrix using autograd.
    
    Warning: This is expensive! O(output_dim) backward passes.
    Use only for small dimensions or debugging.
    """
    x = x.clone().requires_grad_(True)
    y = f(x)
    
    output_dim = y.numel()
    input_dim = x.numel()
    
    jacobian = torch.zeros(output_dim, input_dim)
    
    for i in range(output_dim):
        # Zero gradients
        if x.grad is not None:
            x.grad.zero_()
        
        # Compute gradient of y[i] w.r.t. x
        y_flat = y.view(-1)
        y_flat[i].backward(retain_graph=True)
        
        jacobian[i] = x.grad.view(-1).clone()
    
    return jacobian
 
def compute_vjp_efficient(f, x, v):
    """
    Compute VJP v^T @ J efficiently using a single backward pass.
    
    This is what autograd does internally.
    """
    x = x.clone().requires_grad_(True)
    y = f(x)
    
    # v is the "incoming gradient" from upstream
    # The backward pass computes v^T @ J
    y.backward(v)
    
    return x.grad.clone()
 
# Example: Softmax Jacobian
def softmax_function(z):
    return F.softmax(z, dim=0)
 
# Small example for visualization
z = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
s = softmax_function(z)
 
print("Input logits z:", z.detach().numpy())
print("Softmax output s:", s.detach().numpy())
 
# Compute full Jacobian
J = compute_full_jacobian(softmax_function, z.detach())
print("\nFull Softmax Jacobian:")
print(J.numpy().round(4))
 
# Verify: J = diag(s) - s s^T
s_np = s.detach().numpy()
expected_J = np.diag(s_np) - np.outer(s_np, s_np)
print("\nExpected (diag(s) - s @ s^T):")
print(expected_J.round(4))
 
# VJP example
v = torch.tensor([1.0, 0.0, 0.0])  # gradient from loss w.r.t. first output
vjp_result = compute_vjp_efficient(softmax_function, z.detach(), v)
print(f"\nVJP with v = {v.numpy()}:")
print(f"Result: {vjp_result.numpy().round(4)}")
print(f"Verify (v @ J): {(v.numpy() @ J.numpy()).round(4)}")
 
# Cost comparison
print("\n" + "="*60)
print("COST COMPARISON")
print("="*60)
print(f"Full Jacobian: {J.shape[0]} backward passes")
print(f"VJP: 1 backward pass")
print("For a 1000-dim layer: 1000x speedup with VJP!")

Summary: Jacobians Enable Gradient Flow

The Jacobian matrix generalizes derivatives to vector-valued functions, enabling us to understand and compute gradients through complex neural network architectures.

Key Concepts:

Key Takeaways

•The Jacobian matrix collects all partial derivatives ∂fᵢ/∂xⱼ, with rows = output dims, columns = input dims.
•Geometric meaning: The Jacobian is the best linear approximation to a transformation at a point. It maps infinitesimal spheres to ellipsoids.
•Chain rule: Jacobians of composed functions multiply: J_{f∘g} = J_f · J_g (in reverse order of the forward pass).
•VJPs (Vector-Jacobian Products) are the computational workhorse of backpropagation, computing gradients without forming full Jacobian matrices.
•Common layers have structured Jacobians: linear = weight matrix, element-wise = diagonal, softmax = rank-1 perturbation of diagonal.
•Jacobian determinant measures volume scaling; det = 0 means information loss. Essential for normalizing flows.
•Efficiency matters: VJPs are O(output × input) vs full Jacobian's O(output² × input). This makes training million-parameter models tractable.

What's Next:

Page Complete