Loading content...
If the forward pass is about computing outputs, the backward pass is about understanding how those outputs depend on inputs. Specifically, we need to compute gradients—partial derivatives of the loss with respect to every parameter. These gradients tell us how to adjust parameters to reduce the loss.
For a neural network with millions of parameters, computing these gradients might seem computationally infeasible. Remarkably, reverse mode automatic differentiation (commonly called backpropagation in the context of neural networks) computes all gradients in roughly the same time as a single forward pass.
This efficiency is not merely convenient—it's what makes deep learning possible. Without reverse mode autodiff, training networks with millions or billions of parameters would be computationally intractable.
In this section, we'll develop a complete understanding of reverse mode autodiff: its mathematical foundations, its implementation as an algorithm, and why it's exactly the right approach for neural networks.
By the end of this page, you'll understand: (1) The chain rule as the mathematical foundation of backpropagation, (2) Why reverse mode is efficient for neural networks (one output, many inputs), (3) The backward pass algorithm in detail, (4) Vector-Jacobian Products (VJPs) as the core operation, (5) Gradient accumulation for fan-out, and (6) Implementing backprop from scratch.
Reverse mode autodiff is, fundamentally, a systematic application of the chain rule of calculus. Let's build from simple to complex.
For composed functions $f(g(x))$, the derivative is:
$$\frac{df}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx}$$
Example: If $f(u) = \sin(u)$ and $g(x) = x^2$, then $f(g(x)) = \sin(x^2)$:
$$\frac{d}{dx}\sin(x^2) = \cos(x^2) \cdot 2x$$
For functions with multiple intermediate variables, we sum contributions:
If $z = f(u, v)$ where $u = g(x)$ and $v = h(x)$, then:
$$\frac{\partial z}{\partial x} = \frac{\partial z}{\partial u} \cdot \frac{\partial u}{\partial x} + \frac{\partial z}{\partial v} \cdot \frac{\partial v}{\partial x}$$
This summation is crucial: when a variable affects the output through multiple paths, we must sum the contributions from each path.
When a variable x influences the output through multiple intermediate values, the gradients from ALL paths must be summed. This is the mathematical basis for gradient accumulation in fan-out situations. Missing this summation is a common source of bugs in manual gradient implementations.
For a computational graph where output $L$ depends on $x$ through intermediate nodes $v_1, v_2, ..., v_n$:
$$\frac{\partial L}{\partial x} = \sum_{i: x \rightarrow v_i} \frac{\partial L}{\partial v_i} \cdot \frac{\partial v_i}{\partial x}$$
where the sum is over all nodes $v_i$ that directly depend on $x$.
This is exactly what backpropagation computes—systematically, efficiently, and automatically.
Key insight: $\frac{\partial z}{\partial x} = \frac{\partial z}{\partial u} \cdot \frac{\partial u}{\partial x} + \frac{\partial z}{\partial v} \cdot \frac{\partial v}{\partial x}$
Both paths from $x$ to $z$ contribute to the total derivative.
There are two fundamentally different ways to apply the chain rule through a computational graph: forward mode and reverse mode. Understanding their trade-offs explains why reverse mode is standard for neural networks.
Forward mode computes derivatives alongside the forward computation, propagating forward from inputs to outputs.
For each input $x_i$, we compute $\frac{\partial y}{\partial x_i}$ for all outputs $y$. We track tangent vectors (also called directional derivatives).
One forward pass per input variable.
Reverse mode first computes all forward values, then propagates derivatives backward from outputs to inputs.
For each output $y_j$, we compute $\frac{\partial y_j}{\partial x_i}$ for all inputs $x$. We track adjoint values (gradients of the output with respect to intermediate values).
One backward pass per output variable.
| Aspect | Forward Mode | Reverse Mode |
|---|---|---|
| Propagation direction | Input → Output | Output → Input |
| What's computed | ∂outputs/∂(one input) | ∂(one output)/∂inputs |
| Passes needed for full Jacobian | One per input | One per output |
| Efficient when | Few inputs, many outputs | Few outputs, many inputs |
| Neural network scenario | Inefficient (millions of inputs) | Efficient (one scalar loss) |
| Memory requirement | Forward values + tangents | All forward values (cached) |
| Common names | Tangent mode, forward AD | Adjoint mode, backprop, reverse AD |
In neural network training:
We need $\frac{\partial L}{\partial \theta_i}$ for all parameters $\theta_i$.
Forward mode cost: $n$ forward passes (one per parameter) → Infeasible
Reverse mode cost: 1 backward pass (one output) → Efficient
For a network with 1 million parameters:
This is why backpropagation (reverse mode) is used universally in deep learning.
Forward mode is preferred when:
In practice, frameworks like JAX support both modes, allowing the optimal choice based on the computation structure.
Forward mode has complexity O(n) where n is the number of inputs. Reverse mode has complexity O(m) where m is the number of outputs. For neural networks with millions of parameters (n huge) and one scalar loss (m=1), reverse mode is approximately a million times faster.
The core operation in reverse mode autodiff is the Vector-Jacobian Product (VJP). Understanding VJPs is essential for implementing and debugging backpropagation.
For a function $f: \mathbb{R}^n \rightarrow \mathbb{R}^m$, the Jacobian is the $m \times n$ matrix of all partial derivatives:
$$J_f = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \ \vdots & \ddots & \vdots \ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}$$
Row $i$ contains gradients of output $f_i$ with respect to all inputs.
A VJP takes a vector $\mathbf{v} \in \mathbb{R}^m$ (the "upstream gradient") and produces:
$$\mathbf{v}^T J_f \in \mathbb{R}^n$$
This is the gradient of a scalar function with respect to inputs, where $\mathbf{v}$ represents how the scalar depends on $f$'s outputs.
Crucially: VJPs can be computed efficiently without ever forming the full Jacobian matrix. Each elementary operation has a VJP that directly computes the product.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485
import numpy as np # ===== VJP Examples for Common Operations ===== def add_vjp(v, x, y): """ VJP for z = x + y Jacobian: J = [I, I] where I is identity VJP: v^T @ [I, I] = [v, v] Gradient flows through unchanged to both inputs. """ grad_x = v # ∂L/∂x = v * 1 grad_y = v # ∂L/∂y = v * 1 return grad_x, grad_y def multiply_vjp(v, x, y): """ VJP for z = x * y (elementwise) ∂z/∂x = y, ∂z/∂y = x (elementwise) """ grad_x = v * y # ∂L/∂x = v * y grad_y = v * x # ∂L/∂y = v * x return grad_x, grad_y def matmul_vjp(v, X, W): """ VJP for Y = X @ W X: (batch, n), W: (n, m), Y: (batch, m) v: (batch, m) - upstream gradient ∂L/∂X = v @ W^T (shape: batch, n) ∂L/∂W = X^T @ v (shape: n, m) """ grad_X = v @ W.T grad_W = X.T @ v return grad_X, grad_W def relu_vjp(v, x, output): """ VJP for y = ReLU(x) = max(0, x) ∂y/∂x = 1 if x > 0 else 0 """ grad_x = v * (x > 0).astype(float) return grad_x def softmax_cross_entropy_vjp(v, logits, labels): """ VJP for L = -log(softmax(logits)[label]) The beautiful result: gradient = softmax(logits) - one_hot(label) This is efficient and numerically stable! """ # Compute softmax exp_logits = np.exp(logits - np.max(logits, axis=-1, keepdims=True)) softmax = exp_logits / np.sum(exp_logits, axis=-1, keepdims=True) # Gradient is softmax - one_hot (for each example in batch) batch_size = logits.shape[0] grad_logits = softmax.copy() grad_logits[np.arange(batch_size), labels] -= 1 # Scale by upstream gradient grad_logits *= v.reshape(-1, 1) return grad_logits # Demonstrationprint("=== VJP for Y = X @ W ===")X = np.array([[1, 2], [3, 4]], dtype=float)W = np.array([[0.5, 0.3, 0.2], [0.1, 0.4, 0.5]], dtype=float)Y = X @ W # Suppose upstream gradient is all onesv = np.ones_like(Y)grad_X, grad_W = matmul_vjp(v, X, W) print(f"X shape: {X.shape}, W shape: {W.shape}")print(f"Y shape: {Y.shape}, v shape: {v.shape}")print(f"grad_X shape: {grad_X.shape} (same as X)")print(f"grad_W shape: {grad_W.shape} (same as W)")Consider matrix multiplication $Y = XW$ where $X \in \mathbb{R}^{b \times n}$ and $W \in \mathbb{R}^{n \times m}$.
The full Jacobian with respect to $W$ would be a $(bm) \times (nm)$ matrix—potentially enormous.
But the VJP $\frac{\partial L}{\partial W} = X^T \cdot \frac{\partial L}{\partial Y}$ computes the result directly:
No large Jacobian is ever formed. This is the practical magic of reverse mode autodiff.
For a framework to support autodiff for an operation, it must define that operation's VJP. This is the 'gradient function' or 'backward method' you register when defining custom operations. The VJP takes upstream gradients and returns gradients with respect to each input.
Now we can state the complete backward pass algorithm. This is reverse mode autodiff applied to a computational graph.
Input:
Output:
Procedure:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
def backward_pass(output_node): """ Compute gradients for all nodes via reverse mode autodiff. Assumes forward pass has been run and values are cached. """ # Step 1: Initialize gradients grads = {node: 0 for node in all_nodes} grads[output_node] = 1.0 # ∂L/∂L = 1 # Step 2: Get reverse topological order reverse_topo = reversed(topological_sort(graph)) # Step 3: Propagate gradients backward for node in reverse_topo: if not node.inputs: continue # Input node, no upstream # Get upstream gradient for this node upstream_grad = grads[node] if upstream_grad == 0: continue # No gradient flows (optimization) # Get cached forward values input_values = [inp.cached_value for inp in node.inputs] output_value = node.cached_value # Compute VJP: gradients w.r.t each input input_grads = node.vjp_function( upstream_grad, output_value, *input_values ) # Step 4: ACCUMULATE into input nodes (handle fan-out) for inp_node, inp_grad in zip(node.inputs, input_grads): grads[inp_node] += inp_grad # += not = !!! return grads # Example usageclass MultiplyNode: def __init__(self, x_node, y_node): self.inputs = [x_node, y_node] self.cached_value = None def forward(self): x, y = self.inputs[0].cached_value, self.inputs[1].cached_value self.cached_value = x * y return self.cached_value def vjp_function(self, upstream_grad, output_value, x, y): # z = x * y # ∂z/∂x = y, ∂z/∂y = x grad_x = upstream_grad * y grad_y = upstream_grad * x return [grad_x, grad_y]Let's trace backpropagation through $L = (x + y) \cdot x$ with $x = 3, y = 2$.
Forward pass:
Backward pass:
| Step | Node | Upstream $\bar{v}$ | VJP Computation | Result |
|---|---|---|---|---|
| 1 | $L$ | — | Initialize | $\bar{L} = 1$ |
| 2 | $L = a \cdot x$ | $\bar{L} = 1$ | $\bar{a} = \bar{L} \cdot x = 1 \cdot 3 = 3$ | $\bar{a} = 3$ |
| $\bar{x} \mathrel{+}= \bar{L} \cdot a = 1 \cdot 5 = 5$ | $\bar{x} = 5$ | |||
| 3 | $a = x + y$ | $\bar{a} = 3$ | $\bar{x} \mathrel{+}= \bar{a} \cdot 1 = 3$ | $\bar{x} = 8$ |
| $\bar{y} = \bar{a} \cdot 1 = 3$ | $\bar{y} = 3$ |
Final gradients: $\frac{\partial L}{\partial x} = 8$, $\frac{\partial L}{\partial y} = 3$
Verification: $L = (x + y) \cdot x = x^2 + xy$
One of the most important details in backpropagation is gradient accumulation when a node's output is used by multiple downstream nodes (fan-out).
When a value $v$ contributes to the loss through multiple paths:
$$L = f(g_1(v), g_2(v), ...)$$
The total derivative is:
$$\frac{\partial L}{\partial v} = \sum_i \frac{\partial L}{\partial g_i} \cdot \frac{\partial g_i}{\partial v}$$
Each path contributes to the gradient. We must sum all contributions, not overwrite.
Consider $L = x^2 + x$. The variable $x$ is used twice:
Forward:
Backward:
Bug 1: Overwriting instead of accumulating
# WRONG
for child in node.children:
grads[node] = child.vjp(...) # Overwrites previous!
# CORRECT
for child in node.children:
grads[node] += child.vjp(...) # Accumulates
Bug 2: Forgetting to initialize to zero
# WRONG
# grads[node] is undefined, accumulation fails
# CORRECT
grads = {node: 0 for node in all_nodes} # Initialize all to 0
Bug 3: Shared parameters across layers
In weight-sharing architectures (e.g., recurrent networks, Transformers with tied embeddings), the same parameter appears in multiple places. Gradients must accumulate across all uses:
# Weight-tied architecture
class TiedEmbedding(nn.Module):
def __init__(self, vocab_size, embed_dim):
self.embedding = nn.Parameter(torch.randn(vocab_size, embed_dim))
def forward(self, input_ids):
# Embedding lookup (use 1)
embedded = self.embedding[input_ids]
# ... processing ...
# Output projection (use 2 - same weights)
logits = processed @ self.embedding.T
return logits
# Gradients for self.embedding automatically accumulate from both uses!
Gradient accumulation from fan-out is mathematically required by the multivariate chain rule. No matter how your graph is structured, if a value is used multiple times, you MUST sum gradients from each use. Frameworks handle this automatically, but understanding it is essential for debugging and custom implementations.
Let's implement a complete, working autograd system from scratch. This implementation demonstrates all the concepts we've discussed in practical code.
We'll build a system supporting: scalars and vectors, addition, multiplication, ReLU, sum reduction, and automatic gradient computation.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183
import numpy as npfrom typing import List, Tuple, Callable, Optional class Tensor: """ A tensor with automatic differentiation support. Tracks computation graph via _children and computes gradients via _backward. """ def __init__( self, data: np.ndarray, requires_grad: bool = False, _children: Tuple['Tensor', ...] = (), _op: str = '' ): self.data = np.array(data, dtype=np.float64) self.requires_grad = requires_grad self.grad: Optional[np.ndarray] = None # Graph structure for autograd self._children = set(_children) self._op = _op self._backward: Callable[[], None] = lambda: None def __repr__(self) -> str: return f"Tensor({self.data}, grad={self.grad})" @property def shape(self) -> Tuple[int, ...]: return self.data.shape def zero_grad(self) -> None: """Zero out gradient.""" self.grad = None # ============ Operations ============ def __add__(self, other: 'Tensor') -> 'Tensor': other = other if isinstance(other, Tensor) else Tensor(other) out = Tensor( self.data + other.data, requires_grad=self.requires_grad or other.requires_grad, _children=(self, other), _op='add' ) def _backward(): if self.requires_grad: # Handle broadcasting: sum over broadcasted dimensions grad = out.grad # Reduce grad to match self.shape if broadcasting occurred for i, (s1, s2) in enumerate(zip(out.shape, self.shape + (1,) * (len(out.shape) - len(self.shape)))): if s1 != s2: grad = grad.sum(axis=i, keepdims=True) grad = grad.reshape(self.shape) if grad.shape != self.shape else grad self.grad = (self.grad or 0) + grad if other.requires_grad: grad = out.grad for i, (s1, s2) in enumerate(zip(out.shape, other.shape + (1,) * (len(out.shape) - len(other.shape)))): if s1 != s2: grad = grad.sum(axis=i, keepdims=True) grad = grad.reshape(other.shape) if grad.shape != other.shape else grad other.grad = (other.grad or 0) + grad out._backward = _backward return out def __mul__(self, other: 'Tensor') -> 'Tensor': other = other if isinstance(other, Tensor) else Tensor(other) out = Tensor( self.data * other.data, requires_grad=self.requires_grad or other.requires_grad, _children=(self, other), _op='mul' ) def _backward(): if self.requires_grad: grad = out.grad * other.data # Handle broadcasting while len(grad.shape) > len(self.shape): grad = grad.sum(axis=0) for i, (d1, d2) in enumerate(zip(self.shape, grad.shape)): if d1 == 1 and d2 > 1: grad = grad.sum(axis=i, keepdims=True) self.grad = (self.grad or 0) + grad if other.requires_grad: grad = out.grad * self.data while len(grad.shape) > len(other.shape): grad = grad.sum(axis=0) for i, (d1, d2) in enumerate(zip(other.shape, grad.shape)): if d1 == 1 and d2 > 1: grad = grad.sum(axis=i, keepdims=True) other.grad = (other.grad or 0) + grad out._backward = _backward return out def __neg__(self) -> 'Tensor': return self * Tensor(-1.0) def __sub__(self, other: 'Tensor') -> 'Tensor': return self + (-other) def relu(self) -> 'Tensor': out = Tensor( np.maximum(0, self.data), requires_grad=self.requires_grad, _children=(self,), _op='relu' ) def _backward(): if self.requires_grad: self.grad = (self.grad or 0) + (self.data > 0) * out.grad out._backward = _backward return out def sum(self) -> 'Tensor': out = Tensor( self.data.sum(), requires_grad=self.requires_grad, _children=(self,), _op='sum' ) def _backward(): if self.requires_grad: # Gradient broadcasts to all elements self.grad = (self.grad or 0) + np.ones_like(self.data) * out.grad out._backward = _backward return out def mean(self) -> 'Tensor': out = Tensor( self.data.mean(), requires_grad=self.requires_grad, _children=(self,), _op='mean' ) def _backward(): if self.requires_grad: n = self.data.size self.grad = (self.grad or 0) + np.ones_like(self.data) * out.grad / n out._backward = _backward return out # ============ Backward Pass ============ def backward(self) -> None: """ Compute gradients via reverse-mode autodiff. Must be called on a scalar tensor. """ assert self.data.size == 1, "backward() requires scalar output" # Build topological order topo: List['Tensor'] = [] visited = set() def build_topo(tensor: 'Tensor'): if tensor not in visited: visited.add(tensor) for child in tensor._children: build_topo(child) topo.append(tensor) build_topo(self) # Initialize gradient of output self.grad = np.ones_like(self.data) # Backpropagate in reverse topological order for tensor in reversed(topo): tensor._backward()12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667
# ============ Using Our Autograd System ============ # Example 1: Simple expressionprint("=== Example 1: f(x,y) = (x+y)·x ===")x = Tensor([3.0], requires_grad=True)y = Tensor([2.0], requires_grad=True) a = x + y # a = 5L = a * x # L = 15 L.backward()print(f"L = {L.data[0]}")print(f"∂L/∂x = {x.grad[0]} (expected: 2x+y = 8)")print(f"∂L/∂y = {y.grad[0]} (expected: x = 3)") # Example 2: Neural network layerprint("=== Example 2: Linear layer + ReLU ===")# y = relu(x @ W + b)np.random.seed(42)x = Tensor(np.random.randn(4, 3), requires_grad=False) # Input (batch=4, features=3)W = Tensor(np.random.randn(3, 2) * 0.1, requires_grad=True) # Weightsb = Tensor(np.zeros(2), requires_grad=True) # Bias # Forward passz = Tensor(x.data @ W.data) + b # Linear combinationh = z.relu() # Activationloss = h.mean() # Scalar loss print(f"Output shape: {h.shape}")print(f"Loss: {loss.data}") # Backward passloss.backward() print(f"W gradient shape: {W.grad.shape} (same as W)")print(f"b gradient shape: {b.grad.shape} (same as b)")print(f"W gradient:\n{W.grad}") # Example 3: Verify correctness with numerical gradientsprint("=== Gradient Check ===")def numerical_gradient(f, x, eps=1e-5): grad = np.zeros_like(x.data) for idx in np.ndindex(x.data.shape): x.data[idx] += eps loss_plus = f().data x.data[idx] -= 2 * eps loss_minus = f().data x.data[idx] += eps # restore grad[idx] = (loss_plus - loss_minus) / (2 * eps) return grad # Test gradient for a simple functionw = Tensor([2.0, 3.0], requires_grad=True) def compute_loss(): return ((w * w) + w * Tensor([1.0, 1.0])).sum() loss = compute_loss()loss.backward() numerical_grad = numerical_gradient(compute_loss, w) print(f"Autograd gradient: {w.grad}")print(f"Numerical gradient: {numerical_grad}")print(f"Difference: {np.abs(w.grad - numerical_grad).max():.2e}")This implementation follows the same principles as PyTorch's autograd. The main differences are: production systems use C++ for performance, handle edge cases exhaustively, support GPU tensors, and implement hundreds of operations. But the core algorithm—topological sort, backward traversal, VJP calls, gradient accumulation—is exactly what we've implemented.
Understanding the computational cost of backpropagation is essential for optimizing training performance and reasoning about model complexity.
Theorem: The backward pass takes at most 2-3× the time of the forward pass.
Proof sketch:
| Operation | Forward Cost | VJP Cost | Ratio |
|---|---|---|---|
| Add $z = x + y$ | $O(n)$ | $O(n)$ | 1× |
| Multiply $z = xy$ | $O(n)$ | $O(n)$ | 1× |
| MatMul $Y = XW$ | $O(bnm)$ | $O(bnm + nm)$ | ~2× |
| ReLU | $O(n)$ | $O(n)$ | 1× |
| Softmax | $O(nk)$ | $O(nk)$ | 1× |
| Cross-Entropy | $O(n)$ | $O(n)$ | 1× |
where $n$ is tensor size, $b$ is batch size, $m$ and $k$ are dimensions.
Backpropagation's memory requirements are dominated by activation caching:
Theorem: Memory for backprop scales as $O(n \cdot a)$ where $n$ is the number of layers and $a$ is the average activation size per layer.
For a transformer with:
Activation memory per layer ≈ $32 \times 512 \times 768 \times 4$ bytes = 48 MB
Total activation memory ≈ $12 \times 48$ MB = 576 MB
Plus attention weight storage: $12 \times 32 \times 12 \times 512 \times 512 \times 4$ bytes ≈ 4.6 GB
Total: ~5 GB just for activations!
This is why large models require significant GPU memory even at modest batch sizes.
Training large models is typically memory-bound, not compute-bound. Techniques like gradient checkpointing, mixed precision, and gradient accumulation across micro-batches all aim to reduce memory pressure. Understanding activation memory is crucial for fitting models on available hardware.
Gradient checkpointing trades computation for memory:
Without checkpointing:
With checkpointing (segment size $k$):
Optimal choice: $k = \sqrt{n}$ gives $O(\sqrt{n})$ memory with $O(\sqrt{n})$ extra computation.
For a 100-layer network, checkpointing every 10 layers:
We've developed a comprehensive understanding of reverse mode automatic differentiation—the algorithm that makes deep learning computationally feasible. Let's consolidate the key insights:
With forward and backward passes understood, we turn to the infrastructure that makes them efficient:
Backpropagation provides gradients; the next steps use those gradients to actually train networks.
You now understand reverse mode automatic differentiation in depth. You can explain why it's efficient for neural networks, implement it from scratch, and reason about its computational and memory costs. This algorithm is the engine that drives all of deep learning—every trained neural network relies on it for gradient computation.