Loading content...
The gradient beautifully captures how a scalar function changes. But neural networks aren't just scalar functions—each layer transforms vectors into vectors. A linear layer maps ℝⁿ → ℝᵐ. An attention mechanism maps sequences to sequences. How do we describe the derivative of such vector-valued functions?
The answer is the Jacobian matrix—a matrix that contains all partial derivatives of all output components with respect to all input components. If the gradient is a vector of partial derivatives, the Jacobian is a matrix of partial derivatives.
The Jacobian isn't merely a theoretical construct. It's the mathematical object that propagates gradients through neural network layers. When you call loss.backward() in PyTorch, you're implicitly computing vector-Jacobian products. Understanding Jacobians means understanding backpropagation at a fundamental level.
By the end of this page, you will understand Jacobian matrices: their definition, geometric interpretation, computation, and role in the chain rule. You'll see how Jacobians enable backpropagation through arbitrary differentiable layers and understand the relationship between Jacobians, gradients, and total derivatives.
Consider a vector-valued function f: ℝⁿ → ℝᵐ that maps n-dimensional inputs to m-dimensional outputs:
$$\mathbf{f}(\mathbf{x}) = \begin{bmatrix} f_1(x_1, \ldots, x_n) \ f_2(x_1, \ldots, x_n) \ \vdots \ f_m(x_1, \ldots, x_n) \end{bmatrix}$$
Definition (Jacobian Matrix):
The Jacobian matrix J of f at point x is the m × n matrix of all first-order partial derivatives:
$$\mathbf{J} = \frac{\partial \mathbf{f}}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \cdots & \frac{\partial f_1}{\partial x_n} \ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_2}{\partial x_n} \ \vdots & \vdots & \ddots & \vdots \ \frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}$$
Matrix Dimensions:
Row Interpretation:
Row i of J is the gradient of the i-th output component fᵢ:
$$\text{Row } i = \nabla f_i = \left[ \frac{\partial f_i}{\partial x_1}, \ldots, \frac{\partial f_i}{\partial x_n} \right]$$
Column Interpretation:
Column j contains all the effects of perturbing xⱼ on every output.
When m = 1 (scalar output), the Jacobian is a 1 × n matrix—a row vector. This is the transpose of the gradient: J = (∇f)ᵀ. Some conventions define the gradient as a row vector, in which case gradient equals Jacobian for scalar functions.
Examples:
Example 1: Linear Transformation
For f(x) = Ax + b where A is an m × n matrix:
$$\mathbf{J} = \mathbf{A}$$
The Jacobian of a linear function is simply the matrix itself!
Example 2: Element-wise Function
For f(x) = [σ(x₁), σ(x₂), ..., σ(xₙ)]ᵀ where σ is applied element-wise:
$$\mathbf{J} = \text{diag}(\sigma'(x_1), \sigma'(x_2), \ldots, \sigma'(x_n))$$
The Jacobian is diagonal because output i depends only on input i.
Example 3: Softmax Function
For softmax(z) where (softmax)ᵢ = exp(zᵢ)/Σⱼexp(zⱼ):
$$J_{ij} = \begin{cases} s_i(1 - s_i) & \text{if } i = j \ -s_i s_j & \text{if } i \neq j \end{cases}$$
where sᵢ = softmax(z)ᵢ. This is a full (non-diagonal) matrix because each output depends on all inputs.
The Jacobian has a beautiful geometric meaning: it describes how a function locally transforms space.
Linear Approximation:
For small perturbations Δx around a point x₀:
$$\mathbf{f}(\mathbf{x}_0 + \Delta\mathbf{x}) \approx \mathbf{f}(\mathbf{x}0) + \mathbf{J}|{\mathbf{x}_0} \cdot \Delta\mathbf{x}$$
The Jacobian is the coefficient matrix of the best linear approximation to f near x₀. Just as the derivative f'(x) gives the slope of the tangent line to a curve, the Jacobian gives the linear map that best approximates f locally.
Transformation of Shapes:
Consider what happens to an infinitesimal sphere near x₀:
Volume Scaling:
The Jacobian also tells us how volumes scale under the transformation:
$$\text{Volume scaling factor} = |\det(\mathbf{J})| \quad \text{(when } m = n \text{)}$$
This is crucial for change of variables in integration (the famous |det J| factor).
Invertibility:
If J is square (m = n) and non-singular (det J ≠ 0), the Inverse Function Theorem guarantees that f is locally invertible near that point.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118
import numpy as npimport matplotlib.pyplot as plt def f(x): """ A nonlinear transformation R² → R². f(x, y) = (x² - y², 2xy) [related to z² in complex plane] """ return np.array([x[0]**2 - x[1]**2, 2*x[0]*x[1]]) def jacobian_f(x): """ Jacobian of f at point x. J = [[∂f₁/∂x, ∂f₁/∂y], [∂f₂/∂x, ∂f₂/∂y]] = [[2x, -2y], [2y, 2x]] """ return np.array([ [2*x[0], -2*x[1]], [2*x[1], 2*x[0]] ]) # Create a circle of pointstheta = np.linspace(0, 2*np.pi, 100)circle_radius = 0.3 # Choose a base pointbase_point = np.array([1.5, 0.5]) # Points on circle centered at base_pointcircle_points = np.array([ base_point + circle_radius * np.array([np.cos(t), np.sin(t)]) for t in theta]) # Transform circle through ftransformed_points = np.array([f(p) for p in circle_points])transformed_center = f(base_point) # Get Jacobian at base pointJ = jacobian_f(base_point)print("Jacobian at (1.5, 0.5):")print(J)print(f"\nDeterminant: {np.linalg.det(J):.4f}")print(f"(Volume scaling: {abs(np.linalg.det(J)):.4f})") # Singular value decompositionU, S, Vt = np.linalg.svd(J)print(f"\nSingular values: {S}")print("(These are the stretching factors along principal axes)") # Compute expected ellipse from Jacobian# Circle -> J @ circle -> ellipse with axes = singular valuesjacobian_circle = np.array([ J @ (circle_radius * np.array([np.cos(t), np.sin(t)])) for t in theta])jacobian_ellipse = transformed_center + jacobian_circle # Plottingfig, axes = plt.subplots(1, 3, figsize=(15, 5)) # Plot 1: Original circleax1 = axes[0]ax1.plot(circle_points[:, 0], circle_points[:, 1], 'b-', linewidth=2)ax1.plot(base_point[0], base_point[1], 'ro', markersize=10)ax1.set_xlabel('x')ax1.set_ylabel('y')ax1.set_title('Original: Circle in Input Space')ax1.set_aspect('equal')ax1.grid(True, alpha=0.3) # Plot 2: Transformed (actual nonlinear)ax2 = axes[1]ax2.plot(transformed_points[:, 0], transformed_points[:, 1], 'g-', linewidth=2, label='True transformation')ax2.plot(jacobian_ellipse[:, 0], jacobian_ellipse[:, 1], 'r--', linewidth=2, label='Linear approximation (Jacobian)')ax2.plot(transformed_center[0], transformed_center[1], 'ro', markersize=10)ax2.set_xlabel('u')ax2.set_ylabel('v')ax2.set_title('Transformed Circle → Approximate Ellipse')ax2.legend()ax2.set_aspect('equal')ax2.grid(True, alpha=0.3) # Plot 3: Show how grid transformsax3 = axes[2]x_grid = np.linspace(0.5, 2.5, 10)y_grid = np.linspace(-0.5, 1.5, 10) # Draw original grid lines and their imagesfor x_val in x_grid: y_line = np.linspace(-0.5, 1.5, 50) points = np.array([[x_val, y] for y in y_line]) transformed = np.array([f(p) for p in points]) ax3.plot(transformed[:, 0], transformed[:, 1], 'b-', alpha=0.5) for y_val in y_grid: x_line = np.linspace(0.5, 2.5, 50) points = np.array([[x, y_val] for x in x_line]) transformed = np.array([f(p) for p in points]) ax3.plot(transformed[:, 0], transformed[:, 1], 'r-', alpha=0.5) ax3.set_xlabel('u')ax3.set_ylabel('v')ax3.set_title('Grid Transformation (Jacobian varies by location)')ax3.set_aspect('equal')ax3.grid(True, alpha=0.3) plt.tight_layout()plt.savefig('jacobian_geometry.png', dpi=150, bbox_inches='tight')plt.show() print("\nKey insight: The Jacobian provides the best linear approximation")print("to the transformation at each point. Circles map roughly to")print("ellipses determined by the Jacobian's singular values.")Think of the Jacobian as answering: 'If I zoom in enough at this point, what linear transformation does f look like?' This local linearization is the foundation of gradient-based optimization—we approximate curved loss surfaces with their tangent planes.
The chain rule for Jacobians is the mathematical foundation of backpropagation. When functions compose, their Jacobians multiply.
Theorem (Chain Rule for Jacobians):
If g: ℝⁿ → ℝᵐ and f: ℝᵐ → ℝᵖ, then the Jacobian of the composition f ∘ g is:
$$\mathbf{J}{\mathbf{f} \circ \mathbf{g}} = \mathbf{J}{\mathbf{f}} \cdot \mathbf{J}_{\mathbf{g}}$$
Dimension Check:
Visualizing the Chain:
J_g J_f
ℝⁿ ─────────→ ℝᵐ ─────────→ ℝᵖ
(m×n) (p×m)
↓ ↓
ℝⁿ ──────────────────────→ ℝᵖ
J_{f∘g} = J_f · J_g
(p×n)
Multi-Layer Composition:
For a deep composition h = fL ∘ f{L-1} ∘ ... ∘ f_1:
$$\mathbf{J}{\mathbf{h}} = \mathbf{J}{\mathbf{f}L} \cdot \mathbf{J}{\mathbf{f}{L-1}} \cdots \mathbf{J}{\mathbf{f}_1}$$
This product of Jacobian matrices is exactly what backpropagation computes!
| Notation Style | Forward Pass | Chain Rule |
|---|---|---|
| Scalar calculus | y = f(g(x)) | dy/dx = (df/dg)(dg/dx) |
| Jacobian matrix | y = f(g(x)) | J_y = J_f · J_g |
| Index notation | yᵢ = fᵢ(g₁, ..., gₘ) | ∂yᵢ/∂xⱼ = Σₖ (∂fᵢ/∂gₖ)(∂gₖ/∂xⱼ) |
| Deep learning | h = layer₂(layer₁(x)) | dL/dx = (dL/dh₂)(dh₂/dh₁)(dh₁/dx) |
The chain rule multiplies Jacobians in reverse order of the forward pass. The Jacobian of the outermost function (last layer) goes on the left. This corresponds to backpropagation going from loss backward through the network.
In practice, we rarely compute full Jacobian matrices. They're too large—a layer mapping ℝ¹⁰⁰⁰ → ℝ¹⁰⁰⁰ has a million-entry Jacobian! Instead, backpropagation computes Vector-Jacobian Products (VJPs).
The Key Insight:
To minimize a scalar loss L with respect to parameters, we need ∇L—a vector. We don't need the full Jacobians of intermediate layers. We only need how the loss gradient propagates through each layer.
VJP Definition:
Given a vector v ∈ ℝᵐ and a Jacobian J ∈ ℝᵐˣⁿ, the VJP is:
$$\text{VJP}(\mathbf{v}, \mathbf{J}) = \mathbf{v}^\top \mathbf{J} \in \mathbb{R}^n$$
This is a row vector of dimension n (the input dimension).
Backpropagation as Chained VJPs:
For a composition y = f₃(f₂(f₁(x))) with scalar loss L(y):
Computational Advantage:
Each VJP costs O(m·n) operations—the same as matrix-vector multiplication. Computing the full Jacobian would cost O(m·n) per column, totaling O(m·n²). VJPs are asymptotically cheaper when we only need the final gradient.
Dual: Jacobian-Vector Products (JVP):
The transpose operation J****v is a Jacobian-Vector Product (JVP), used in forward-mode automatic differentiation. Different use cases favor different modes:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112
import numpy as np def vjp_linear_layer(v, W, x): """ Vector-Jacobian Product for linear layer: y = Wx + b The Jacobian of y w.r.t. x is just W (shape: output_dim × input_dim) The Jacobian of y w.r.t. W is more complex (involves outer products) Parameters: - v: incoming gradient dL/dy, shape (output_dim,) - W: weight matrix, shape (output_dim, input_dim) - x: input vector, shape (input_dim,) Returns: - v_x: gradient dL/dx = v^T @ W, shape (input_dim,) - v_W: gradient dL/dW = outer(v, x), shape (output_dim, input_dim) """ # VJP with respect to input: v^T @ J_x = v^T @ W v_x = v @ W # shape (input_dim,) # VJP with respect to weights: each W_ij affects y_i as W_ij * x_j # So dL/dW_ij = (dL/dy_i) * x_j = v_i * x_j v_W = np.outer(v, x) # shape (output_dim, input_dim) return v_x, v_W def vjp_relu(v, z): """ VJP for ReLU: y = max(0, z) Jacobian is diagonal: J_ii = 1 if z_i > 0, else 0 So VJP is just element-wise multiplication. """ return v * (z > 0).astype(float) def demo_backpropagation_as_vjp(): """ Demonstrate backprop as a sequence of VJPs. Network: x -> W1 -> ReLU -> W2 -> y -> MSE Loss """ np.random.seed(42) # Network dimensions input_dim = 4 hidden_dim = 3 output_dim = 2 # Initialize weights W1 = np.random.randn(hidden_dim, input_dim) * 0.5 W2 = np.random.randn(output_dim, hidden_dim) * 0.5 # Sample input and target x = np.array([1.0, 2.0, -1.0, 0.5]) target = np.array([1.0, -1.0]) # ========== FORWARD PASS ========== print("="*60) print("FORWARD PASS (computing intermediate values)") print("="*60) # Layer 1: Linear z1 = W1 @ x print(f"z1 = W1 @ x: {z1}") # Layer 2: ReLU h1 = np.maximum(0, z1) print(f"h1 = ReLU(z1): {h1}") # Layer 3: Linear y = W2 @ h1 print(f"y = W2 @ h1: {y}") # Loss: MSE loss = 0.5 * np.sum((y - target)**2) print(f"\nLoss = 0.5 * ||y - target||² = {loss:.6f}") # ========== BACKWARD PASS (VJPs) ========== print("\n" + "="*60) print("BACKWARD PASS (VJP chain)") print("="*60) # Gradient of loss w.r.t. prediction dL_dy = y - target # shape (output_dim,) print(f"\n1. dL/dy = y - target = {dL_dy}") # VJP through W2 layer dL_dh1, dL_dW2 = vjp_linear_layer(dL_dy, W2, h1) print(f"\n2. VJP through W2:") print(f" dL/dh1 = dL/dy @ W2 = {dL_dh1}") print(f" dL/dW2 = outer(dL/dy, h1):\n{dL_dW2}") # VJP through ReLU dL_dz1 = vjp_relu(dL_dh1, z1) print(f"\n3. VJP through ReLU:") print(f" dL/dz1 = dL/dh1 ⊙ (z1 > 0) = {dL_dz1}") # VJP through W1 layer dL_dx, dL_dW1 = vjp_linear_layer(dL_dz1, W1, x) print(f"\n4. VJP through W1:") print(f" dL/dx = dL/dz1 @ W1 = {dL_dx}") print(f" dL/dW1 = outer(dL/dz1, x):\n{dL_dW1}") print("\n" + "="*60) print("SUMMARY: We computed gradients for ALL parameters using VJPs") print("="*60) print(f"dL/dW2 shape: {dL_dW2.shape} (matches W2)") print(f"dL/dW1 shape: {dL_dW1.shape} (matches W1)") print("\nThis is exactly what autograd frameworks do internally!") demo_backpropagation_as_vjp()A single layer in a transformer might have dimensions 4096 → 4096. Its Jacobian has 16 million entries. But the VJP only requires multiplying a 4096-vector by the Jacobian operator—much cheaper. This efficiency is why reverse-mode autodiff (backprop) dominates deep learning.
Let's derive the Jacobians for the most common neural network components. Understanding these builds intuition for gradient flow.
1. Linear/Dense Layer: y = Wx + b
$$\mathbf{J}_{\mathbf{y} \leftarrow \mathbf{x}} = \mathbf{W}$$
The Jacobian is simply the weight matrix. This explains why weight magnitude affects gradient magnitude—large weights amplify gradients.
2. Element-wise Activation: y = σ(x)
$$\mathbf{J} = \text{diag}(\sigma'(x_1), \sigma'(x_2), \ldots, \sigma'(x_n))$$
Diagonal matrix because each output depends only on corresponding input.
Specific activations:
3. Softmax: sᵢ = exp(zᵢ)/Σⱼexp(zⱼ)
$$J_{ij} = \begin{cases} s_i(1 - s_i) & i = j \ -s_i s_j & i \neq j \end{cases} = s_i(\delta_{ij} - s_j)$$
Equivalently: J = diag(s) - s****sᵀ
4. Batch Normalization: (simplified, ignoring running stats)
For y = γ · (x - μ)/σ + β:
The Jacobian is more complex due to the mean and variance depending on all inputs. See BN paper for full derivation.
5. Dropout: (during training with mask m)
$$\mathbf{J} = \frac{1}{1-p} \cdot \text{diag}(m_1, m_2, \ldots, m_n)$$
Diagonal with entries 0 or 1/(1-p), stochastic per forward pass.
| Layer | Jacobian Structure | Memory for VJP | Gradient Issues |
|---|---|---|---|
| Linear (m×n weights) | Dense m×n (= W) | O(m×n) | Explodes/vanishes with W magnitude |
| ReLU | Diagonal 0/1 | O(n) mask | Dead neurons (zeros) |
| Sigmoid | Diagonal in (0, 0.25) | O(n) activations | Vanishing (max derivative = 0.25) |
| Softmax | Dense n×n | O(n) softmax output | Full coupling between outputs |
| BatchNorm | Nearly diagonal | O(n) | Stabilizes gradient flow |
| Residual (x + f(x)) | I + J_f | O(n) | Identity helps gradient flow |
Residual connections y = x + f(x) have Jacobian J = I + J_f. Even if J_f vanishes (deep network), the identity I ensures gradients flow unimpeded. This is why ResNets train so much more easily than plain deep networks.
When the Jacobian is square (n × n), its determinant carries important information about the transformation.
Definition:
The Jacobian determinant of f: ℝⁿ → ℝⁿ at x is:
$$\det\left(\mathbf{J}_\mathbf{f}(\mathbf{x})\right) = \det\left( \frac{\partial \mathbf{f}}{\partial \mathbf{x}} \right)$$
Geometric Meaning:
|det J| measures how infinitesimal volumes scale under the transformation:
$$\text{Volume}(\mathbf{f}(S)) \approx |\det \mathbf{J}| \cdot \text{Volume}(S)$$
for small regions S.
Implications:
Change of Variables (Integration):
When integrating after a change of variables x = g(u):
$$\int f(\mathbf{x}) , d\mathbf{x} = \int f(\mathbf{g}(\mathbf{u})) |\det \mathbf{J}_\mathbf{g}| , d\mathbf{u}$$
The |det J| factor compensates for volume distortion.
Application: Normalizing Flows
Normalizing flows model complex probability distributions by transforming a simple base distribution (like Gaussian) through invertible neural networks. The log-probability is:
$$\log p(\mathbf{x}) = \log p(\mathbf{z}) - \log|\det \mathbf{J}|$$
Efficiently computing log|det J| is a major research area, leading to architectures like RealNVP, Glow, and Neural ODEs.
Computing det(n×n matrix) costs O(n³). But for special structures—diagonal, triangular, or near-identity—it's O(n). Normalizing flow architectures like RealNVP use triangular Jacobians precisely for this computational efficiency.
Understanding the computational aspects of Jacobians is crucial for efficient deep learning implementation.
Memory vs. Compute Tradeoffs:
Backpropagation requires storing intermediate activations for the backward pass:
| Approach | Forward Memory | Backward Compute | Use Case |
|---|---|---|---|
| Store all | O(L × batch × dim) | O(1) recompute | Standard training |
| Checkpoint | O(√L × batch × dim) | O(L) recompute | Memory-limited |
| Recompute all | O(1) | O(L) | Extreme memory limit |
Gradient Checkpointing:
Instead of storing all activations, store only every √L layers. Recompute others during backward pass. Trades 2× compute for √L memory reduction.
Mixed Precision:
Using FP16 for forward/backward passes halves memory and speeds up matrix operations (on supported hardware), with FP32 for accumulation to maintain numerical stability.
Jacobian-Free Methods:
Some applications need Jacobian-vector products without explicit Jacobians:
These use clever tricks to compute the product without forming the full matrix.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182
import torchimport torch.nn.functional as F def compute_full_jacobian(f, x): """ Compute full Jacobian matrix using autograd. Warning: This is expensive! O(output_dim) backward passes. Use only for small dimensions or debugging. """ x = x.clone().requires_grad_(True) y = f(x) output_dim = y.numel() input_dim = x.numel() jacobian = torch.zeros(output_dim, input_dim) for i in range(output_dim): # Zero gradients if x.grad is not None: x.grad.zero_() # Compute gradient of y[i] w.r.t. x y_flat = y.view(-1) y_flat[i].backward(retain_graph=True) jacobian[i] = x.grad.view(-1).clone() return jacobian def compute_vjp_efficient(f, x, v): """ Compute VJP v^T @ J efficiently using a single backward pass. This is what autograd does internally. """ x = x.clone().requires_grad_(True) y = f(x) # v is the "incoming gradient" from upstream # The backward pass computes v^T @ J y.backward(v) return x.grad.clone() # Example: Softmax Jacobiandef softmax_function(z): return F.softmax(z, dim=0) # Small example for visualizationz = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)s = softmax_function(z) print("Input logits z:", z.detach().numpy())print("Softmax output s:", s.detach().numpy()) # Compute full JacobianJ = compute_full_jacobian(softmax_function, z.detach())print("\nFull Softmax Jacobian:")print(J.numpy().round(4)) # Verify: J = diag(s) - s s^Ts_np = s.detach().numpy()expected_J = np.diag(s_np) - np.outer(s_np, s_np)print("\nExpected (diag(s) - s @ s^T):")print(expected_J.round(4)) # VJP examplev = torch.tensor([1.0, 0.0, 0.0]) # gradient from loss w.r.t. first outputvjp_result = compute_vjp_efficient(softmax_function, z.detach(), v)print(f"\nVJP with v = {v.numpy()}:")print(f"Result: {vjp_result.numpy().round(4)}")print(f"Verify (v @ J): {(v.numpy() @ J.numpy()).round(4)}") # Cost comparisonprint("\n" + "="*60)print("COST COMPARISON")print("="*60)print(f"Full Jacobian: {J.shape[0]} backward passes")print(f"VJP: 1 backward pass")print("For a 1000-dim layer: 1000x speedup with VJP!")The Jacobian matrix generalizes derivatives to vector-valued functions, enabling us to understand and compute gradients through complex neural network architectures.
Key Concepts:
What's Next:
Gradients and Jacobians capture first-order information—how functions change linearly near a point. But loss landscapes are curved, not flat. The next page introduces Taylor series expansion, which captures higher-order curvature information. This leads to understanding the Hessian matrix and second-order optimization methods that can outperform gradient descent.
You now understand Jacobian matrices—the complete derivative for vector-valued functions. You see how the chain rule for Jacobians underlies backpropagation, and why VJPs make gradient computation tractable. Next: Taylor series and second-order approximations.