Computational Graphs - Learning Module

Loading content...

0/245

Reverse Mode Autodiff

The Algorithm That Powers Deep Learning

If the forward pass is about computing outputs, the backward pass is about understanding how those outputs depend on inputs. Specifically, we need to compute gradients—partial derivatives of the loss with respect to every parameter. These gradients tell us how to adjust parameters to reduce the loss.

For a neural network with millions of parameters, computing these gradients might seem computationally infeasible. Remarkably, reverse mode automatic differentiation (commonly called backpropagation in the context of neural networks) computes all gradients in roughly the same time as a single forward pass.

This efficiency is not merely convenient—it's what makes deep learning possible. Without reverse mode autodiff, training networks with millions or billions of parameters would be computationally intractable.

In this section, we'll develop a complete understanding of reverse mode autodiff: its mathematical foundations, its implementation as an algorithm, and why it's exactly the right approach for neural networks.

What You Will Learn

By the end of this page, you'll understand: (1) The chain rule as the mathematical foundation of backpropagation, (2) Why reverse mode is efficient for neural networks (one output, many inputs), (3) The backward pass algorithm in detail, (4) Vector-Jacobian Products (VJPs) as the core operation, (5) Gradient accumulation for fan-out, and (6) Implementing backprop from scratch.

The Chain Rule Foundation

Reverse mode autodiff is, fundamentally, a systematic application of the chain rule of calculus. Let's build from simple to complex.

Single Variable Chain Rule

For composed functions $f(g(x))$, the derivative is:

$$\frac{df}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx}$$

Example: If $f(u) = \sin(u)$ and $g(x) = x^2$, then $f(g(x)) = \sin(x^2)$:

$$\frac{d}{dx}\sin(x^2) = \cos(x^2) \cdot 2x$$

Multivariate Chain Rule

For functions with multiple intermediate variables, we sum contributions:

If $z = f(u, v)$ where $u = g(x)$ and $v = h(x)$, then:

$$\frac{\partial z}{\partial x} = \frac{\partial z}{\partial u} \cdot \frac{\partial u}{\partial x} + \frac{\partial z}{\partial v} \cdot \frac{\partial v}{\partial x}$$

This summation is crucial: when a variable affects the output through multiple paths, we must sum the contributions from each path.

The Multipath Rule

When a variable x influences the output through multiple intermediate values, the gradients from ALL paths must be summed. This is the mathematical basis for gradient accumulation in fan-out situations. Missing this summation is a common source of bugs in manual gradient implementations.

General Form: The Total Derivative

For a computational graph where output $L$ depends on $x$ through intermediate nodes $v_1, v_2, ..., v_n$:

$$\frac{\partial L}{\partial x} = \sum_{i: x \rightarrow v_i} \frac{\partial L}{\partial v_i} \cdot \frac{\partial v_i}{\partial x}$$

where the sum is over all nodes $v_i$ that directly depend on $x$.

This is exactly what backpropagation computes—systematically, efficiently, and automatically.

Converting Mermaid diagram...

Key insight: $\frac{\partial z}{\partial x} = \frac{\partial z}{\partial u} \cdot \frac{\partial u}{\partial x} + \frac{\partial z}{\partial v} \cdot \frac{\partial v}{\partial x}$

Both paths from $x$ to $z$ contribute to the total derivative.

Forward Mode vs. Reverse Mode

There are two fundamentally different ways to apply the chain rule through a computational graph: forward mode and reverse mode. Understanding their trade-offs explains why reverse mode is standard for neural networks.

Forward Mode (Tangent Mode)

Forward mode computes derivatives alongside the forward computation, propagating forward from inputs to outputs.

For each input $x_i$, we compute $\frac{\partial y}{\partial x_i}$ for all outputs $y$. We track tangent vectors (also called directional derivatives).

One forward pass per input variable.

Reverse Mode (Adjoint Mode)

Reverse mode first computes all forward values, then propagates derivatives backward from outputs to inputs.

For each output $y_j$, we compute $\frac{\partial y_j}{\partial x_i}$ for all inputs $x$. We track adjoint values (gradients of the output with respect to intermediate values).

One backward pass per output variable.

Forward Mode vs. Reverse Mode Comparison
Aspect	Forward Mode	Reverse Mode
Propagation direction	Input → Output	Output → Input
What's computed	∂outputs/∂(one input)	∂(one output)/∂inputs
Passes needed for full Jacobian	One per input	One per output
Efficient when	Few inputs, many outputs	Few outputs, many inputs
Neural network scenario	Inefficient (millions of inputs)	Efficient (one scalar loss)
Memory requirement	Forward values + tangents	All forward values (cached)
Common names	Tangent mode, forward AD	Adjoint mode, backprop, reverse AD

Why Reverse Mode Wins for Neural Networks

In neural network training:

Inputs: Model parameters, possibly millions ($n \gg 1$)
Output: Scalar loss ($m = 1$)

We need $\frac{\partial L}{\partial \theta_i}$ for all parameters $\theta_i$.

Forward mode cost: $n$ forward passes (one per parameter) → Infeasible

Reverse mode cost: 1 backward pass (one output) → Efficient

For a network with 1 million parameters:

Forward mode: ~1,000,000 passes to get all gradients
Reverse mode: ~1 pass to get all gradients

This is why backpropagation (reverse mode) is used universally in deep learning.

When Forward Mode Makes Sense

Forward mode is preferred when:

$n \ll m$ (few inputs, many outputs)
Computing Jacobian-vector products (JVPs) for directional derivatives
Sensitivity analysis with few parameters
Some scientific computing applications

In practice, frameworks like JAX support both modes, allowing the optimal choice based on the computation structure.

The ${O(n)}$ vs. ${O(m)}$ Distinction

Forward mode has complexity O(n) where n is the number of inputs. Reverse mode has complexity O(m) where m is the number of outputs. For neural networks with millions of parameters (n huge) and one scalar loss (m=1), reverse mode is approximately a million times faster.

Vector-Jacobian Products (VJPs)

The core operation in reverse mode autodiff is the Vector-Jacobian Product (VJP). Understanding VJPs is essential for implementing and debugging backpropagation.

The Jacobian Matrix

For a function $f: \mathbb{R}^n \rightarrow \mathbb{R}^m$, the Jacobian is the $m \times n$ matrix of all partial derivatives:

$$J_f = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \ \vdots & \ddots & \vdots \ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}$$

Row $i$ contains gradients of output $f_i$ with respect to all inputs.

The VJP Operation

A VJP takes a vector $\mathbf{v} \in \mathbb{R}^m$ (the "upstream gradient") and produces:

$$\mathbf{v}^T J_f \in \mathbb{R}^n$$

This is the gradient of a scalar function with respect to inputs, where $\mathbf{v}$ represents how the scalar depends on $f$'s outputs.

Crucially: VJPs can be computed efficiently without ever forming the full Jacobian matrix. Each elementary operation has a VJP that directly computes the product.

vjp_examples.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import numpy as np
 
# ===== VJP Examples for Common Operations =====
 
def add_vjp(v, x, y):
    """
    VJP for z = x + y
    
    Jacobian: J = [I, I] where I is identity
    VJP: v^T @ [I, I] = [v, v]
    
    Gradient flows through unchanged to both inputs.
    """
    grad_x = v  # ∂L/∂x = v * 1
    grad_y = v  # ∂L/∂y = v * 1
    return grad_x, grad_y
 
def multiply_vjp(v, x, y):
    """
    VJP for z = x * y (elementwise)
    
    ∂z/∂x = y, ∂z/∂y = x (elementwise)
    """
    grad_x = v * y  # ∂L/∂x = v * y
    grad_y = v * x  # ∂L/∂y = v * x
    return grad_x, grad_y
 
def matmul_vjp(v, X, W):
    """
    VJP for Y = X @ W
    
    X: (batch, n), W: (n, m), Y: (batch, m)
    v: (batch, m) - upstream gradient
    
    ∂L/∂X = v @ W^T  (shape: batch, n)
    ∂L/∂W = X^T @ v  (shape: n, m)
    """
    grad_X = v @ W.T
    grad_W = X.T @ v
    return grad_X, grad_W
 
def relu_vjp(v, x, output):
    """
    VJP for y = ReLU(x) = max(0, x)
    
    ∂y/∂x = 1 if x > 0 else 0
    """
    grad_x = v * (x > 0).astype(float)
    return grad_x
 
def softmax_cross_entropy_vjp(v, logits, labels):
    """
    VJP for L = -log(softmax(logits)[label])
    
    The beautiful result: gradient = softmax(logits) - one_hot(label)
    This is efficient and numerically stable!
    """
    # Compute softmax
    exp_logits = np.exp(logits - np.max(logits, axis=-1, keepdims=True))
    softmax = exp_logits / np.sum(exp_logits, axis=-1, keepdims=True)
    
    # Gradient is softmax - one_hot (for each example in batch)
    batch_size = logits.shape[0]
    grad_logits = softmax.copy()
    grad_logits[np.arange(batch_size), labels] -= 1
    
    # Scale by upstream gradient
    grad_logits *= v.reshape(-1, 1)
    
    return grad_logits
 
# Demonstration
print("=== VJP for Y = X @ W ===")
X = np.array([[1, 2], [3, 4]], dtype=float)
W = np.array([[0.5, 0.3, 0.2], [0.1, 0.4, 0.5]], dtype=float)
Y = X @ W
 
# Suppose upstream gradient is all ones
v = np.ones_like(Y)
grad_X, grad_W = matmul_vjp(v, X, W)
 
print(f"X shape: {X.shape}, W shape: {W.shape}")
print(f"Y shape: {Y.shape}, v shape: {v.shape}")
print(f"grad_X shape: {grad_X.shape} (same as X)")
print(f"grad_W shape: {grad_W.shape} (same as W)")

Why VJPs Don't Need Full Jacobians

Consider matrix multiplication $Y = XW$ where $X \in \mathbb{R}^{b \times n}$ and $W \in \mathbb{R}^{n \times m}$.

The full Jacobian with respect to $W$ would be a $(bm) \times (nm)$ matrix—potentially enormous.

But the VJP $\frac{\partial L}{\partial W} = X^T \cdot \frac{\partial L}{\partial Y}$ computes the result directly:

Upstream gradient $\frac{\partial L}{\partial Y}$ has shape $(b \times m)$
Result has shape $(n \times m)$, same as $W$
Computed via a single matrix multiplication

No large Jacobian is ever formed. This is the practical magic of reverse mode autodiff.

Every Operation Needs a VJP Rule

For a framework to support autodiff for an operation, it must define that operation's VJP. This is the 'gradient function' or 'backward method' you register when defining custom operations. The VJP takes upstream gradients and returns gradients with respect to each input.

The Backward Pass Algorithm

Now we can state the complete backward pass algorithm. This is reverse mode autodiff applied to a computational graph.

Algorithm: Backward Pass (Reverse Mode Autodiff)

Input:

Computational graph $G$ with values computed (from forward pass)
Output node with value $L$ (the loss)

Output:

Gradient $\frac{\partial L}{\partial v}$ for every node $v$ in the graph

Procedure:

Backward Pass Steps

•Initialize output gradient: Set $\bar{L} = \frac{\partial L}{\partial L} = 1$ (the gradient of the output with respect to itself)
•Topological sort (reverse): Order nodes from output to inputs, reversing the forward order
•For each node in reverse order: Let the node compute $y = f(x_1, ..., x_k)$. Using the cached forward values and the upstream gradient $\bar{y} = \frac{\partial L}{\partial y}$, compute: $\bar{x}_i = \bar{x}_i + \bar{y} \cdot \frac{\partial f}{\partial x_i}$ for each input $x_i$ (note: ACCUMULATE, don't overwrite)
•Return gradients: The accumulated $\bar{x}$ values are the gradients for all nodes

backward_pass_algorithm.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
def backward_pass(output_node):
    """
    Compute gradients for all nodes via reverse mode autodiff.
    
    Assumes forward pass has been run and values are cached.
    """
    # Step 1: Initialize gradients
    grads = {node: 0 for node in all_nodes}
    grads[output_node] = 1.0  # ∂L/∂L = 1
    
    # Step 2: Get reverse topological order
    reverse_topo = reversed(topological_sort(graph))
    
    # Step 3: Propagate gradients backward
    for node in reverse_topo:
        if not node.inputs:
            continue  # Input node, no upstream
        
        # Get upstream gradient for this node
        upstream_grad = grads[node]
        
        if upstream_grad == 0:
            continue  # No gradient flows (optimization)
        
        # Get cached forward values
        input_values = [inp.cached_value for inp in node.inputs]
        output_value = node.cached_value
        
        # Compute VJP: gradients w.r.t each input
        input_grads = node.vjp_function(
            upstream_grad, output_value, *input_values
        )
        
        # Step 4: ACCUMULATE into input nodes (handle fan-out)
        for inp_node, inp_grad in zip(node.inputs, input_grads):
            grads[inp_node] += inp_grad  # += not = !!!
    
    return grads
 
# Example usage
class MultiplyNode:
    def __init__(self, x_node, y_node):
        self.inputs = [x_node, y_node]
        self.cached_value = None
    
    def forward(self):
        x, y = self.inputs[0].cached_value, self.inputs[1].cached_value
        self.cached_value = x * y
        return self.cached_value
    
    def vjp_function(self, upstream_grad, output_value, x, y):
        # z = x * y
        # ∂z/∂x = y, ∂z/∂y = x
        grad_x = upstream_grad * y
        grad_y = upstream_grad * x
        return [grad_x, grad_y]

Complete Worked Example

Let's trace backpropagation through $L = (x + y) \cdot x$ with $x = 3, y = 2$.

Forward pass:

$a = x + y = 5$
$L = a \cdot x = 15$

Backward pass:

Step	Node	Upstream $\bar{v}$	VJP Computation	Result
1	$L$	—	Initialize	$\bar{L} = 1$
2	$L = a \cdot x$	$\bar{L} = 1$	$\bar{a} = \bar{L} \cdot x = 1 \cdot 3 = 3$	$\bar{a} = 3$
			$\bar{x} \mathrel{+}= \bar{L} \cdot a = 1 \cdot 5 = 5$	$\bar{x} = 5$
3	$a = x + y$	$\bar{a} = 3$	$\bar{x} \mathrel{+}= \bar{a} \cdot 1 = 3$	$\bar{x} = 8$
			$\bar{y} = \bar{a} \cdot 1 = 3$	$\bar{y} = 3$

Final gradients: $\frac{\partial L}{\partial x} = 8$, $\frac{\partial L}{\partial y} = 3$

Verification: $L = (x + y) \cdot x = x^2 + xy$

$\frac{\partial L}{\partial x} = 2x + y = 2(3) + 2 = 8$ ✓
$\frac{\partial L}{\partial y} = x = 3$ ✓

Gradient Accumulation and Fan-Out

One of the most important details in backpropagation is gradient accumulation when a node's output is used by multiple downstream nodes (fan-out).

Why Accumulation?

When a value $v$ contributes to the loss through multiple paths:

$$L = f(g_1(v), g_2(v), ...)$$

The total derivative is:

$$\frac{\partial L}{\partial v} = \sum_i \frac{\partial L}{\partial g_i} \cdot \frac{\partial g_i}{\partial v}$$

Each path contributes to the gradient. We must sum all contributions, not overwrite.

The Fan-Out Pattern

Consider $L = x^2 + x$. The variable $x$ is used twice:

Once to compute $x^2$
Once directly in the sum

Forward:

$a = x^2$
$L = a + x$

Backward:

$\frac{\partial L}{\partial a} = 1$
$\frac{\partial L}{\partial x}$ from $L = a + x$: contributes $1$
$\frac{\partial L}{\partial x}$ from $a = x^2$: contributes $\bar{a} \cdot 2x = 1 \cdot 2x = 2x$
Total: $\frac{\partial L}{\partial x} = 1 + 2x = 2x + 1$

Converting Mermaid diagram...

Common Gradient Accumulation Bugs

Bug 1: Overwriting instead of accumulating

# WRONG
for child in node.children:
    grads[node] = child.vjp(...)  # Overwrites previous!

# CORRECT
for child in node.children:
    grads[node] += child.vjp(...)  # Accumulates

Bug 2: Forgetting to initialize to zero

# WRONG
# grads[node] is undefined, accumulation fails

# CORRECT
grads = {node: 0 for node in all_nodes}  # Initialize all to 0

Bug 3: Shared parameters across layers

In weight-sharing architectures (e.g., recurrent networks, Transformers with tied embeddings), the same parameter appears in multiple places. Gradients must accumulate across all uses:

# Weight-tied architecture
class TiedEmbedding(nn.Module):
    def __init__(self, vocab_size, embed_dim):
        self.embedding = nn.Parameter(torch.randn(vocab_size, embed_dim))
    
    def forward(self, input_ids):
        # Embedding lookup (use 1)
        embedded = self.embedding[input_ids]
        # ... processing ...
        # Output projection (use 2 - same weights)
        logits = processed @ self.embedding.T
        return logits
        
# Gradients for self.embedding automatically accumulate from both uses!

The Accumulation Rule is Non-Negotiable

Gradient accumulation from fan-out is mathematically required by the multivariate chain rule. No matter how your graph is structured, if a value is used multiple times, you MUST sum gradients from each use. Frameworks handle this automatically, but understanding it is essential for debugging and custom implementations.

Implementation Deep Dive

Let's implement a complete, working autograd system from scratch. This implementation demonstrates all the concepts we've discussed in practical code.

We'll build a system supporting: scalars and vectors, addition, multiplication, ReLU, sum reduction, and automatic gradient computation.

complete_autograd.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
import numpy as np
from typing import List, Tuple, Callable, Optional
 
class Tensor:
    """
    A tensor with automatic differentiation support.
    
    Tracks computation graph via _children and computes gradients via _backward.
    """
    
    def __init__(
        self, 
        data: np.ndarray, 
        requires_grad: bool = False,
        _children: Tuple['Tensor', ...] = (),
        _op: str = ''
    ):
        self.data = np.array(data, dtype=np.float64)
        self.requires_grad = requires_grad
        self.grad: Optional[np.ndarray] = None
        
        # Graph structure for autograd
        self._children = set(_children)
        self._op = _op
        self._backward: Callable[[], None] = lambda: None
        
    def __repr__(self) -> str:
        return f"Tensor({self.data}, grad={self.grad})"
    
    @property
    def shape(self) -> Tuple[int, ...]:
        return self.data.shape
    
    def zero_grad(self) -> None:
        """Zero out gradient."""
        self.grad = None
    
    # ============ Operations ============
    
    def __add__(self, other: 'Tensor') -> 'Tensor':
        other = other if isinstance(other, Tensor) else Tensor(other)
        out = Tensor(
            self.data + other.data,
            requires_grad=self.requires_grad or other.requires_grad,
            _children=(self, other),
            _op='add'
        )
        
        def _backward():
            if self.requires_grad:
                # Handle broadcasting: sum over broadcasted dimensions
                grad = out.grad
                # Reduce grad to match self.shape if broadcasting occurred
                for i, (s1, s2) in enumerate(zip(out.shape, self.shape + (1,) * (len(out.shape) - len(self.shape)))):
                    if s1 != s2:
                        grad = grad.sum(axis=i, keepdims=True)
                grad = grad.reshape(self.shape) if grad.shape != self.shape else grad
                self.grad = (self.grad or 0) + grad
                
            if other.requires_grad:
                grad = out.grad
                for i, (s1, s2) in enumerate(zip(out.shape, other.shape + (1,) * (len(out.shape) - len(other.shape)))):
                    if s1 != s2:
                        grad = grad.sum(axis=i, keepdims=True)
                grad = grad.reshape(other.shape) if grad.shape != other.shape else grad
                other.grad = (other.grad or 0) + grad
                
        out._backward = _backward
        return out
    
    def __mul__(self, other: 'Tensor') -> 'Tensor':
        other = other if isinstance(other, Tensor) else Tensor(other)
        out = Tensor(
            self.data * other.data,
            requires_grad=self.requires_grad or other.requires_grad,
            _children=(self, other),
            _op='mul'
        )
        
        def _backward():
            if self.requires_grad:
                grad = out.grad * other.data
                # Handle broadcasting
                while len(grad.shape) > len(self.shape):
                    grad = grad.sum(axis=0)
                for i, (d1, d2) in enumerate(zip(self.shape, grad.shape)):
                    if d1 == 1 and d2 > 1:
                        grad = grad.sum(axis=i, keepdims=True)
                self.grad = (self.grad or 0) + grad
                
            if other.requires_grad:
                grad = out.grad * self.data
                while len(grad.shape) > len(other.shape):
                    grad = grad.sum(axis=0)
                for i, (d1, d2) in enumerate(zip(other.shape, grad.shape)):
                    if d1 == 1 and d2 > 1:
                        grad = grad.sum(axis=i, keepdims=True)
                other.grad = (other.grad or 0) + grad
                
        out._backward = _backward
        return out
    
    def __neg__(self) -> 'Tensor':
        return self * Tensor(-1.0)
    
    def __sub__(self, other: 'Tensor') -> 'Tensor':
        return self + (-other)
    
    def relu(self) -> 'Tensor':
        out = Tensor(
            np.maximum(0, self.data),
            requires_grad=self.requires_grad,
            _children=(self,),
            _op='relu'
        )
        
        def _backward():
            if self.requires_grad:
                self.grad = (self.grad or 0) + (self.data > 0) * out.grad
                
        out._backward = _backward
        return out
    
    def sum(self) -> 'Tensor':
        out = Tensor(
            self.data.sum(),
            requires_grad=self.requires_grad,
            _children=(self,),
            _op='sum'
        )
        
        def _backward():
            if self.requires_grad:
                # Gradient broadcasts to all elements
                self.grad = (self.grad or 0) + np.ones_like(self.data) * out.grad
                
        out._backward = _backward
        return out
    
    def mean(self) -> 'Tensor':
        out = Tensor(
            self.data.mean(),
            requires_grad=self.requires_grad,
            _children=(self,),
            _op='mean'
        )
        
        def _backward():
            if self.requires_grad:
                n = self.data.size
                self.grad = (self.grad or 0) + np.ones_like(self.data) * out.grad / n
                
        out._backward = _backward
        return out
    
    # ============ Backward Pass ============
    
    def backward(self) -> None:
        """
        Compute gradients via reverse-mode autodiff.
        Must be called on a scalar tensor.
        """
        assert self.data.size == 1, "backward() requires scalar output"
        
        # Build topological order
        topo: List['Tensor'] = []
        visited = set()
        
        def build_topo(tensor: 'Tensor'):
            if tensor not in visited:
                visited.add(tensor)
                for child in tensor._children:
                    build_topo(child)
                topo.append(tensor)
        
        build_topo(self)
        
        # Initialize gradient of output
        self.grad = np.ones_like(self.data)
        
        # Backpropagate in reverse topological order
        for tensor in reversed(topo):
            tensor._backward()

using_autograd.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
# ============ Using Our Autograd System ============
 
# Example 1: Simple expression
print("=== Example 1: f(x,y) = (x+y)·x ===")
x = Tensor([3.0], requires_grad=True)
y = Tensor([2.0], requires_grad=True)
 
a = x + y  # a = 5
L = a * x  # L = 15
 
L.backward()
print(f"L = {L.data[0]}")
print(f"∂L/∂x = {x.grad[0]} (expected: 2x+y = 8)")
print(f"∂L/∂y = {y.grad[0]} (expected: x = 3)")
 
# Example 2: Neural network layer
print("
=== Example 2: Linear layer + ReLU ===")
# y = relu(x @ W + b)
np.random.seed(42)
x = Tensor(np.random.randn(4, 3), requires_grad=False)  # Input (batch=4, features=3)
W = Tensor(np.random.randn(3, 2) * 0.1, requires_grad=True)  # Weights
b = Tensor(np.zeros(2), requires_grad=True)  # Bias
 
# Forward pass
z = Tensor(x.data @ W.data) + b  # Linear combination
h = z.relu()                      # Activation
loss = h.mean()                   # Scalar loss
 
print(f"Output shape: {h.shape}")
print(f"Loss: {loss.data}")
 
# Backward pass
loss.backward()
 
print(f"W gradient shape: {W.grad.shape} (same as W)")
print(f"b gradient shape: {b.grad.shape} (same as b)")
print(f"W gradient:\n{W.grad}")
 
# Example 3: Verify correctness with numerical gradients
print("
=== Gradient Check ===")
def numerical_gradient(f, x, eps=1e-5):
    grad = np.zeros_like(x.data)
    for idx in np.ndindex(x.data.shape):
        x.data[idx] += eps
        loss_plus = f().data
        x.data[idx] -= 2 * eps
        loss_minus = f().data
        x.data[idx] += eps  # restore
        grad[idx] = (loss_plus - loss_minus) / (2 * eps)
    return grad
 
# Test gradient for a simple function
w = Tensor([2.0, 3.0], requires_grad=True)
 
def compute_loss():
    return ((w * w) + w * Tensor([1.0, 1.0])).sum()
 
loss = compute_loss()
loss.backward()
 
numerical_grad = numerical_gradient(compute_loss, w)
 
print(f"Autograd gradient: {w.grad}")
print(f"Numerical gradient: {numerical_grad}")
print(f"Difference: {np.abs(w.grad - numerical_grad).max():.2e}")

This Is Real Autograd

This implementation follows the same principles as PyTorch's autograd. The main differences are: production systems use C++ for performance, handle edge cases exhaustively, support GPU tensors, and implement hundreds of operations. But the core algorithm—topological sort, backward traversal, VJP calls, gradient accumulation—is exactly what we've implemented.

Computational Complexity Analysis

Understanding the computational cost of backpropagation is essential for optimizing training performance and reasoning about model complexity.

Time Complexity

Theorem: The backward pass takes at most 2-3× the time of the forward pass.

Proof sketch:

Each operation in the forward pass has a corresponding VJP in the backward pass
Most VJPs have similar complexity to their forward operations
Some VJPs are slightly more expensive (e.g., transposing matrices)
Typical overhead factor: 2-3×

Operation	Forward Cost	VJP Cost	Ratio
Add $z = x + y$	$O(n)$	$O(n)$	1×
Multiply $z = xy$	$O(n)$	$O(n)$	1×
MatMul $Y = XW$	$O(bnm)$	$O(bnm + nm)$	~2×
ReLU	$O(n)$	$O(n)$	1×
Softmax	$O(nk)$	$O(nk)$	1×
Cross-Entropy	$O(n)$	$O(n)$	1×

where $n$ is tensor size, $b$ is batch size, $m$ and $k$ are dimensions.

Memory Complexity

Backpropagation's memory requirements are dominated by activation caching:

Theorem: Memory for backprop scales as $O(n \cdot a)$ where $n$ is the number of layers and $a$ is the average activation size per layer.

For a transformer with:

12 layers
Batch size 32
Sequence length 512
Hidden dimension 768

Activation memory per layer ≈ $32 \times 512 \times 768 \times 4$ bytes = 48 MB

Total activation memory ≈ $12 \times 48$ MB = 576 MB

Plus attention weight storage: $12 \times 32 \times 12 \times 512 \times 512 \times 4$ bytes ≈ 4.6 GB

Total: ~5 GB just for activations!

This is why large models require significant GPU memory even at modest batch sizes.

Memory is Usually the Bottleneck

Training large models is typically memory-bound, not compute-bound. Techniques like gradient checkpointing, mixed precision, and gradient accumulation across micro-batches all aim to reduce memory pressure. Understanding activation memory is crucial for fitting models on available hardware.

Gradient Checkpointing Trade-off

Gradient checkpointing trades computation for memory:

Without checkpointing:

Memory: $O(n)$ (store all $n$ activations)
Computation: $O(1)$ forward + $O(1)$ backward

With checkpointing (segment size $k$):

Memory: $O(n/k)$ (store every $k$-th activation)
Computation: $O(1)$ forward + $O(k)$ backward (recompute segments)

Optimal choice: $k = \sqrt{n}$ gives $O(\sqrt{n})$ memory with $O(\sqrt{n})$ extra computation.

For a 100-layer network, checkpointing every 10 layers:

Memory reduced to ~10%
Computation increased by ~10%
Net win when memory-constrained!

Summary and Looking Ahead

We've developed a comprehensive understanding of reverse mode automatic differentiation—the algorithm that makes deep learning computationally feasible. Let's consolidate the key insights:

Key Takeaways

•Backpropagation is systematic chain rule application — Complex gradients decompose into products of simple local derivatives, traced through the computational graph.
•Reverse mode is efficient for many-input, few-output functions — Neural networks with millions of parameters and one scalar loss are perfect for reverse mode. One backward pass computes all gradients.
•VJPs are the core operation — Each operation's Vector-Jacobian Product efficiently computes gradients without forming full Jacobian matrices.
•Gradient accumulation handles fan-out — When a value is used multiple times, gradients from all uses must be summed. This is mathematically required by the multivariate chain rule.
•Backward pass cost ≈ 2-3× forward pass — VJPs have similar complexity to forward operations. Memory cost is dominated by cached activations.
•The algorithm is elegant and general — Any differentiable computation expressible as a DAG of primitive operations can be automatically differentiated.

What's Next

With forward and backward passes understood, we turn to the infrastructure that makes them efficient:

Topological Ordering: Algorithms for determining evaluation order in complex graphs
Modern Frameworks: How TensorFlow, PyTorch, and JAX implement these concepts at scale
Optimization Beyond Gradients: How gradient descent variants use these gradients to update parameters

Backpropagation provides gradients; the next steps use those gradients to actually train networks.

Page Complete

You now understand reverse mode automatic differentiation in depth. You can explain why it's efficient for neural networks, implement it from scratch, and reason about its computational and memory costs. This algorithm is the engine that drives all of deep learning—every trained neural network relies on it for gradient computation.

Reverse Mode Autodiff

The Algorithm That Powers Deep Learning

What You Will Learn

The Chain Rule Foundation

Reverse mode autodiff is, fundamentally, a systematic application of the chain rule of calculus. Let's build from simple to complex.

Single Variable Chain Rule

For composed functions $f(g(x))$, the derivative is:

$$\frac{df}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx}$$

Example: If $f(u) = \sin(u)$ and $g(x) = x^2$, then $f(g(x)) = \sin(x^2)$:

$$\frac{d}{dx}\sin(x^2) = \cos(x^2) \cdot 2x$$

Multivariate Chain Rule

For functions with multiple intermediate variables, we sum contributions:

If $z = f(u, v)$ where $u = g(x)$ and $v = h(x)$, then:

$$\frac{\partial z}{\partial x} = \frac{\partial z}{\partial u} \cdot \frac{\partial u}{\partial x} + \frac{\partial z}{\partial v} \cdot \frac{\partial v}{\partial x}$$

This summation is crucial: when a variable affects the output through multiple paths, we must sum the contributions from each path.

The Multipath Rule

General Form: The Total Derivative

For a computational graph where output $L$ depends on $x$ through intermediate nodes $v_1, v_2, ..., v_n$:

$$\frac{\partial L}{\partial x} = \sum_{i: x \rightarrow v_i} \frac{\partial L}{\partial v_i} \cdot \frac{\partial v_i}{\partial x}$$

where the sum is over all nodes $v_i$ that directly depend on $x$.

This is exactly what backpropagation computes—systematically, efficiently, and automatically.

Converting Mermaid diagram...

Key insight: $\frac{\partial z}{\partial x} = \frac{\partial z}{\partial u} \cdot \frac{\partial u}{\partial x} + \frac{\partial z}{\partial v} \cdot \frac{\partial v}{\partial x}$

Both paths from $x$ to $z$ contribute to the total derivative.

Forward Mode vs. Reverse Mode

Forward Mode (Tangent Mode)

Forward mode computes derivatives alongside the forward computation, propagating forward from inputs to outputs.

For each input $x_i$, we compute $\frac{\partial y}{\partial x_i}$ for all outputs $y$. We track tangent vectors (also called directional derivatives).

One forward pass per input variable.

Reverse Mode (Adjoint Mode)

Reverse mode first computes all forward values, then propagates derivatives backward from outputs to inputs.

For each output $y_j$, we compute $\frac{\partial y_j}{\partial x_i}$ for all inputs $x$. We track adjoint values (gradients of the output with respect to intermediate values).

One backward pass per output variable.

Forward Mode vs. Reverse Mode Comparison
Aspect	Forward Mode	Reverse Mode
Propagation direction	Input → Output	Output → Input
What's computed	∂outputs/∂(one input)	∂(one output)/∂inputs
Passes needed for full Jacobian	One per input	One per output
Efficient when	Few inputs, many outputs	Few outputs, many inputs
Neural network scenario	Inefficient (millions of inputs)	Efficient (one scalar loss)
Memory requirement	Forward values + tangents	All forward values (cached)
Common names	Tangent mode, forward AD	Adjoint mode, backprop, reverse AD

Why Reverse Mode Wins for Neural Networks

In neural network training:

Inputs: Model parameters, possibly millions ($n \gg 1$)
Output: Scalar loss ($m = 1$)

We need $\frac{\partial L}{\partial \theta_i}$ for all parameters $\theta_i$.

Forward mode cost: $n$ forward passes (one per parameter) → Infeasible

Reverse mode cost: 1 backward pass (one output) → Efficient

For a network with 1 million parameters:

Forward mode: ~1,000,000 passes to get all gradients
Reverse mode: ~1 pass to get all gradients

This is why backpropagation (reverse mode) is used universally in deep learning.

When Forward Mode Makes Sense

Forward mode is preferred when:

$n \ll m$ (few inputs, many outputs)
Computing Jacobian-vector products (JVPs) for directional derivatives
Sensitivity analysis with few parameters
Some scientific computing applications

In practice, frameworks like JAX support both modes, allowing the optimal choice based on the computation structure.

The ${O(n)}$ vs. ${O(m)}$ Distinction

Vector-Jacobian Products (VJPs)

The core operation in reverse mode autodiff is the Vector-Jacobian Product (VJP). Understanding VJPs is essential for implementing and debugging backpropagation.

The Jacobian Matrix

For a function $f: \mathbb{R}^n \rightarrow \mathbb{R}^m$, the Jacobian is the $m \times n$ matrix of all partial derivatives:

Row $i$ contains gradients of output $f_i$ with respect to all inputs.

The VJP Operation

A VJP takes a vector $\mathbf{v} \in \mathbb{R}^m$ (the "upstream gradient") and produces:

$$\mathbf{v}^T J_f \in \mathbb{R}^n$$

This is the gradient of a scalar function with respect to inputs, where $\mathbf{v}$ represents how the scalar depends on $f$'s outputs.

Crucially: VJPs can be computed efficiently without ever forming the full Jacobian matrix. Each elementary operation has a VJP that directly computes the product.

vjp_examples.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import numpy as np
 
# ===== VJP Examples for Common Operations =====
 
def add_vjp(v, x, y):
    """
    VJP for z = x + y
    
    Jacobian: J = [I, I] where I is identity
    VJP: v^T @ [I, I] = [v, v]
    
    Gradient flows through unchanged to both inputs.
    """
    grad_x = v  # ∂L/∂x = v * 1
    grad_y = v  # ∂L/∂y = v * 1
    return grad_x, grad_y
 
def multiply_vjp(v, x, y):
    """
    VJP for z = x * y (elementwise)
    
    ∂z/∂x = y, ∂z/∂y = x (elementwise)
    """
    grad_x = v * y  # ∂L/∂x = v * y
    grad_y = v * x  # ∂L/∂y = v * x
    return grad_x, grad_y
 
def matmul_vjp(v, X, W):
    """
    VJP for Y = X @ W
    
    X: (batch, n), W: (n, m), Y: (batch, m)
    v: (batch, m) - upstream gradient
    
    ∂L/∂X = v @ W^T  (shape: batch, n)
    ∂L/∂W = X^T @ v  (shape: n, m)
    """
    grad_X = v @ W.T
    grad_W = X.T @ v
    return grad_X, grad_W
 
def relu_vjp(v, x, output):
    """
    VJP for y = ReLU(x) = max(0, x)
    
    ∂y/∂x = 1 if x > 0 else 0
    """
    grad_x = v * (x > 0).astype(float)
    return grad_x
 
def softmax_cross_entropy_vjp(v, logits, labels):
    """
    VJP for L = -log(softmax(logits)[label])
    
    The beautiful result: gradient = softmax(logits) - one_hot(label)
    This is efficient and numerically stable!
    """
    # Compute softmax
    exp_logits = np.exp(logits - np.max(logits, axis=-1, keepdims=True))
    softmax = exp_logits / np.sum(exp_logits, axis=-1, keepdims=True)
    
    # Gradient is softmax - one_hot (for each example in batch)
    batch_size = logits.shape[0]
    grad_logits = softmax.copy()
    grad_logits[np.arange(batch_size), labels] -= 1
    
    # Scale by upstream gradient
    grad_logits *= v.reshape(-1, 1)
    
    return grad_logits
 
# Demonstration
print("=== VJP for Y = X @ W ===")
X = np.array([[1, 2], [3, 4]], dtype=float)
W = np.array([[0.5, 0.3, 0.2], [0.1, 0.4, 0.5]], dtype=float)
Y = X @ W
 
# Suppose upstream gradient is all ones
v = np.ones_like(Y)
grad_X, grad_W = matmul_vjp(v, X, W)
 
print(f"X shape: {X.shape}, W shape: {W.shape}")
print(f"Y shape: {Y.shape}, v shape: {v.shape}")
print(f"grad_X shape: {grad_X.shape} (same as X)")
print(f"grad_W shape: {grad_W.shape} (same as W)")

Why VJPs Don't Need Full Jacobians

Consider matrix multiplication $Y = XW$ where $X \in \mathbb{R}^{b \times n}$ and $W \in \mathbb{R}^{n \times m}$.

The full Jacobian with respect to $W$ would be a $(bm) \times (nm)$ matrix—potentially enormous.

But the VJP $\frac{\partial L}{\partial W} = X^T \cdot \frac{\partial L}{\partial Y}$ computes the result directly:

Upstream gradient $\frac{\partial L}{\partial Y}$ has shape $(b \times m)$
Result has shape $(n \times m)$, same as $W$
Computed via a single matrix multiplication

No large Jacobian is ever formed. This is the practical magic of reverse mode autodiff.

Every Operation Needs a VJP Rule

The Backward Pass Algorithm

Now we can state the complete backward pass algorithm. This is reverse mode autodiff applied to a computational graph.

Algorithm: Backward Pass (Reverse Mode Autodiff)

Input:

Computational graph $G$ with values computed (from forward pass)
Output node with value $L$ (the loss)

Output:

Gradient $\frac{\partial L}{\partial v}$ for every node $v$ in the graph

Procedure:

Backward Pass Steps

•Initialize output gradient: Set $\bar{L} = \frac{\partial L}{\partial L} = 1$ (the gradient of the output with respect to itself)
•Topological sort (reverse): Order nodes from output to inputs, reversing the forward order
•For each node in reverse order: Let the node compute $y = f(x_1, ..., x_k)$. Using the cached forward values and the upstream gradient $\bar{y} = \frac{\partial L}{\partial y}$, compute: $\bar{x}_i = \bar{x}_i + \bar{y} \cdot \frac{\partial f}{\partial x_i}$ for each input $x_i$ (note: ACCUMULATE, don't overwrite)
•Return gradients: The accumulated $\bar{x}$ values are the gradients for all nodes

backward_pass_algorithm.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
def backward_pass(output_node):
    """
    Compute gradients for all nodes via reverse mode autodiff.
    
    Assumes forward pass has been run and values are cached.
    """
    # Step 1: Initialize gradients
    grads = {node: 0 for node in all_nodes}
    grads[output_node] = 1.0  # ∂L/∂L = 1
    
    # Step 2: Get reverse topological order
    reverse_topo = reversed(topological_sort(graph))
    
    # Step 3: Propagate gradients backward
    for node in reverse_topo:
        if not node.inputs:
            continue  # Input node, no upstream
        
        # Get upstream gradient for this node
        upstream_grad = grads[node]
        
        if upstream_grad == 0:
            continue  # No gradient flows (optimization)
        
        # Get cached forward values
        input_values = [inp.cached_value for inp in node.inputs]
        output_value = node.cached_value
        
        # Compute VJP: gradients w.r.t each input
        input_grads = node.vjp_function(
            upstream_grad, output_value, *input_values
        )
        
        # Step 4: ACCUMULATE into input nodes (handle fan-out)
        for inp_node, inp_grad in zip(node.inputs, input_grads):
            grads[inp_node] += inp_grad  # += not = !!!
    
    return grads
 
# Example usage
class MultiplyNode:
    def __init__(self, x_node, y_node):
        self.inputs = [x_node, y_node]
        self.cached_value = None
    
    def forward(self):
        x, y = self.inputs[0].cached_value, self.inputs[1].cached_value
        self.cached_value = x * y
        return self.cached_value
    
    def vjp_function(self, upstream_grad, output_value, x, y):
        # z = x * y
        # ∂z/∂x = y, ∂z/∂y = x
        grad_x = upstream_grad * y
        grad_y = upstream_grad * x
        return [grad_x, grad_y]

Complete Worked Example

Let's trace backpropagation through $L = (x + y) \cdot x$ with $x = 3, y = 2$.

Forward pass:

$a = x + y = 5$
$L = a \cdot x = 15$

Backward pass:

Step	Node	Upstream $\bar{v}$	VJP Computation	Result
1	$L$	—	Initialize	$\bar{L} = 1$
2	$L = a \cdot x$	$\bar{L} = 1$	$\bar{a} = \bar{L} \cdot x = 1 \cdot 3 = 3$	$\bar{a} = 3$
			$\bar{x} \mathrel{+}= \bar{L} \cdot a = 1 \cdot 5 = 5$	$\bar{x} = 5$
3	$a = x + y$	$\bar{a} = 3$	$\bar{x} \mathrel{+}= \bar{a} \cdot 1 = 3$	$\bar{x} = 8$
			$\bar{y} = \bar{a} \cdot 1 = 3$	$\bar{y} = 3$

Final gradients: $\frac{\partial L}{\partial x} = 8$, $\frac{\partial L}{\partial y} = 3$

Verification: $L = (x + y) \cdot x = x^2 + xy$

$\frac{\partial L}{\partial x} = 2x + y = 2(3) + 2 = 8$ ✓
$\frac{\partial L}{\partial y} = x = 3$ ✓

Gradient Accumulation and Fan-Out

One of the most important details in backpropagation is gradient accumulation when a node's output is used by multiple downstream nodes (fan-out).

Why Accumulation?

When a value $v$ contributes to the loss through multiple paths:

$$L = f(g_1(v), g_2(v), ...)$$

The total derivative is:

$$\frac{\partial L}{\partial v} = \sum_i \frac{\partial L}{\partial g_i} \cdot \frac{\partial g_i}{\partial v}$$

Each path contributes to the gradient. We must sum all contributions, not overwrite.

The Fan-Out Pattern

Consider $L = x^2 + x$. The variable $x$ is used twice:

Once to compute $x^2$
Once directly in the sum

Forward:

$a = x^2$
$L = a + x$

Backward:

$\frac{\partial L}{\partial a} = 1$
$\frac{\partial L}{\partial x}$ from $L = a + x$: contributes $1$
$\frac{\partial L}{\partial x}$ from $a = x^2$: contributes $\bar{a} \cdot 2x = 1 \cdot 2x = 2x$
Total: $\frac{\partial L}{\partial x} = 1 + 2x = 2x + 1$

Converting Mermaid diagram...

Common Gradient Accumulation Bugs

Bug 1: Overwriting instead of accumulating

# WRONG
for child in node.children:
    grads[node] = child.vjp(...)  # Overwrites previous!

# CORRECT
for child in node.children:
    grads[node] += child.vjp(...)  # Accumulates

Bug 2: Forgetting to initialize to zero

# WRONG
# grads[node] is undefined, accumulation fails

# CORRECT
grads = {node: 0 for node in all_nodes}  # Initialize all to 0

Bug 3: Shared parameters across layers

In weight-sharing architectures (e.g., recurrent networks, Transformers with tied embeddings), the same parameter appears in multiple places. Gradients must accumulate across all uses:

# Weight-tied architecture
class TiedEmbedding(nn.Module):
    def __init__(self, vocab_size, embed_dim):
        self.embedding = nn.Parameter(torch.randn(vocab_size, embed_dim))
    
    def forward(self, input_ids):
        # Embedding lookup (use 1)
        embedded = self.embedding[input_ids]
        # ... processing ...
        # Output projection (use 2 - same weights)
        logits = processed @ self.embedding.T
        return logits
        
# Gradients for self.embedding automatically accumulate from both uses!

The Accumulation Rule is Non-Negotiable

Implementation Deep Dive

Let's implement a complete, working autograd system from scratch. This implementation demonstrates all the concepts we've discussed in practical code.

We'll build a system supporting: scalars and vectors, addition, multiplication, ReLU, sum reduction, and automatic gradient computation.

complete_autograd.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
import numpy as np
from typing import List, Tuple, Callable, Optional
 
class Tensor:
    """
    A tensor with automatic differentiation support.
    
    Tracks computation graph via _children and computes gradients via _backward.
    """
    
    def __init__(
        self, 
        data: np.ndarray, 
        requires_grad: bool = False,
        _children: Tuple['Tensor', ...] = (),
        _op: str = ''
    ):
        self.data = np.array(data, dtype=np.float64)
        self.requires_grad = requires_grad
        self.grad: Optional[np.ndarray] = None
        
        # Graph structure for autograd
        self._children = set(_children)
        self._op = _op
        self._backward: Callable[[], None] = lambda: None
        
    def __repr__(self) -> str:
        return f"Tensor({self.data}, grad={self.grad})"
    
    @property
    def shape(self) -> Tuple[int, ...]:
        return self.data.shape
    
    def zero_grad(self) -> None:
        """Zero out gradient."""
        self.grad = None
    
    # ============ Operations ============
    
    def __add__(self, other: 'Tensor') -> 'Tensor':
        other = other if isinstance(other, Tensor) else Tensor(other)
        out = Tensor(
            self.data + other.data,
            requires_grad=self.requires_grad or other.requires_grad,
            _children=(self, other),
            _op='add'
        )
        
        def _backward():
            if self.requires_grad:
                # Handle broadcasting: sum over broadcasted dimensions
                grad = out.grad
                # Reduce grad to match self.shape if broadcasting occurred
                for i, (s1, s2) in enumerate(zip(out.shape, self.shape + (1,) * (len(out.shape) - len(self.shape)))):
                    if s1 != s2:
                        grad = grad.sum(axis=i, keepdims=True)
                grad = grad.reshape(self.shape) if grad.shape != self.shape else grad
                self.grad = (self.grad or 0) + grad
                
            if other.requires_grad:
                grad = out.grad
                for i, (s1, s2) in enumerate(zip(out.shape, other.shape + (1,) * (len(out.shape) - len(other.shape)))):
                    if s1 != s2:
                        grad = grad.sum(axis=i, keepdims=True)
                grad = grad.reshape(other.shape) if grad.shape != other.shape else grad
                other.grad = (other.grad or 0) + grad
                
        out._backward = _backward
        return out
    
    def __mul__(self, other: 'Tensor') -> 'Tensor':
        other = other if isinstance(other, Tensor) else Tensor(other)
        out = Tensor(
            self.data * other.data,
            requires_grad=self.requires_grad or other.requires_grad,
            _children=(self, other),
            _op='mul'
        )
        
        def _backward():
            if self.requires_grad:
                grad = out.grad * other.data
                # Handle broadcasting
                while len(grad.shape) > len(self.shape):
                    grad = grad.sum(axis=0)
                for i, (d1, d2) in enumerate(zip(self.shape, grad.shape)):
                    if d1 == 1 and d2 > 1:
                        grad = grad.sum(axis=i, keepdims=True)
                self.grad = (self.grad or 0) + grad
                
            if other.requires_grad:
                grad = out.grad * self.data
                while len(grad.shape) > len(other.shape):
                    grad = grad.sum(axis=0)
                for i, (d1, d2) in enumerate(zip(other.shape, grad.shape)):
                    if d1 == 1 and d2 > 1:
                        grad = grad.sum(axis=i, keepdims=True)
                other.grad = (other.grad or 0) + grad
                
        out._backward = _backward
        return out
    
    def __neg__(self) -> 'Tensor':
        return self * Tensor(-1.0)
    
    def __sub__(self, other: 'Tensor') -> 'Tensor':
        return self + (-other)
    
    def relu(self) -> 'Tensor':
        out = Tensor(
            np.maximum(0, self.data),
            requires_grad=self.requires_grad,
            _children=(self,),
            _op='relu'
        )
        
        def _backward():
            if self.requires_grad:
                self.grad = (self.grad or 0) + (self.data > 0) * out.grad
                
        out._backward = _backward
        return out
    
    def sum(self) -> 'Tensor':
        out = Tensor(
            self.data.sum(),
            requires_grad=self.requires_grad,
            _children=(self,),
            _op='sum'
        )
        
        def _backward():
            if self.requires_grad:
                # Gradient broadcasts to all elements
                self.grad = (self.grad or 0) + np.ones_like(self.data) * out.grad
                
        out._backward = _backward
        return out
    
    def mean(self) -> 'Tensor':
        out = Tensor(
            self.data.mean(),
            requires_grad=self.requires_grad,
            _children=(self,),
            _op='mean'
        )
        
        def _backward():
            if self.requires_grad:
                n = self.data.size
                self.grad = (self.grad or 0) + np.ones_like(self.data) * out.grad / n
                
        out._backward = _backward
        return out
    
    # ============ Backward Pass ============
    
    def backward(self) -> None:
        """
        Compute gradients via reverse-mode autodiff.
        Must be called on a scalar tensor.
        """
        assert self.data.size == 1, "backward() requires scalar output"
        
        # Build topological order
        topo: List['Tensor'] = []
        visited = set()
        
        def build_topo(tensor: 'Tensor'):
            if tensor not in visited:
                visited.add(tensor)
                for child in tensor._children:
                    build_topo(child)
                topo.append(tensor)
        
        build_topo(self)
        
        # Initialize gradient of output
        self.grad = np.ones_like(self.data)
        
        # Backpropagate in reverse topological order
        for tensor in reversed(topo):
            tensor._backward()

using_autograd.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
# ============ Using Our Autograd System ============
 
# Example 1: Simple expression
print("=== Example 1: f(x,y) = (x+y)·x ===")
x = Tensor([3.0], requires_grad=True)
y = Tensor([2.0], requires_grad=True)
 
a = x + y  # a = 5
L = a * x  # L = 15
 
L.backward()
print(f"L = {L.data[0]}")
print(f"∂L/∂x = {x.grad[0]} (expected: 2x+y = 8)")
print(f"∂L/∂y = {y.grad[0]} (expected: x = 3)")
 
# Example 2: Neural network layer
print("
=== Example 2: Linear layer + ReLU ===")
# y = relu(x @ W + b)
np.random.seed(42)
x = Tensor(np.random.randn(4, 3), requires_grad=False)  # Input (batch=4, features=3)
W = Tensor(np.random.randn(3, 2) * 0.1, requires_grad=True)  # Weights
b = Tensor(np.zeros(2), requires_grad=True)  # Bias
 
# Forward pass
z = Tensor(x.data @ W.data) + b  # Linear combination
h = z.relu()                      # Activation
loss = h.mean()                   # Scalar loss
 
print(f"Output shape: {h.shape}")
print(f"Loss: {loss.data}")
 
# Backward pass
loss.backward()
 
print(f"W gradient shape: {W.grad.shape} (same as W)")
print(f"b gradient shape: {b.grad.shape} (same as b)")
print(f"W gradient:\n{W.grad}")
 
# Example 3: Verify correctness with numerical gradients
print("
=== Gradient Check ===")
def numerical_gradient(f, x, eps=1e-5):
    grad = np.zeros_like(x.data)
    for idx in np.ndindex(x.data.shape):
        x.data[idx] += eps
        loss_plus = f().data
        x.data[idx] -= 2 * eps
        loss_minus = f().data
        x.data[idx] += eps  # restore
        grad[idx] = (loss_plus - loss_minus) / (2 * eps)
    return grad
 
# Test gradient for a simple function
w = Tensor([2.0, 3.0], requires_grad=True)
 
def compute_loss():
    return ((w * w) + w * Tensor([1.0, 1.0])).sum()
 
loss = compute_loss()
loss.backward()
 
numerical_grad = numerical_gradient(compute_loss, w)
 
print(f"Autograd gradient: {w.grad}")
print(f"Numerical gradient: {numerical_grad}")
print(f"Difference: {np.abs(w.grad - numerical_grad).max():.2e}")

This Is Real Autograd

Computational Complexity Analysis

Understanding the computational cost of backpropagation is essential for optimizing training performance and reasoning about model complexity.

Time Complexity

Theorem: The backward pass takes at most 2-3× the time of the forward pass.

Proof sketch:

Each operation in the forward pass has a corresponding VJP in the backward pass
Most VJPs have similar complexity to their forward operations
Some VJPs are slightly more expensive (e.g., transposing matrices)
Typical overhead factor: 2-3×

Operation	Forward Cost	VJP Cost	Ratio
Add $z = x + y$	$O(n)$	$O(n)$	1×
Multiply $z = xy$	$O(n)$	$O(n)$	1×
MatMul $Y = XW$	$O(bnm)$	$O(bnm + nm)$	~2×
ReLU	$O(n)$	$O(n)$	1×
Softmax	$O(nk)$	$O(nk)$	1×
Cross-Entropy	$O(n)$	$O(n)$	1×

where $n$ is tensor size, $b$ is batch size, $m$ and $k$ are dimensions.

Memory Complexity

Backpropagation's memory requirements are dominated by activation caching:

Theorem: Memory for backprop scales as $O(n \cdot a)$ where $n$ is the number of layers and $a$ is the average activation size per layer.

For a transformer with:

12 layers
Batch size 32
Sequence length 512
Hidden dimension 768

Activation memory per layer ≈ $32 \times 512 \times 768 \times 4$ bytes = 48 MB

Total activation memory ≈ $12 \times 48$ MB = 576 MB

Plus attention weight storage: $12 \times 32 \times 12 \times 512 \times 512 \times 4$ bytes ≈ 4.6 GB

Total: ~5 GB just for activations!

This is why large models require significant GPU memory even at modest batch sizes.

Memory is Usually the Bottleneck

Gradient Checkpointing Trade-off

Gradient checkpointing trades computation for memory:

Without checkpointing:

Memory: $O(n)$ (store all $n$ activations)
Computation: $O(1)$ forward + $O(1)$ backward

With checkpointing (segment size $k$):

Memory: $O(n/k)$ (store every $k$-th activation)
Computation: $O(1)$ forward + $O(k)$ backward (recompute segments)

Optimal choice: $k = \sqrt{n}$ gives $O(\sqrt{n})$ memory with $O(\sqrt{n})$ extra computation.

For a 100-layer network, checkpointing every 10 layers:

Memory reduced to ~10%
Computation increased by ~10%
Net win when memory-constrained!

Summary and Looking Ahead

We've developed a comprehensive understanding of reverse mode automatic differentiation—the algorithm that makes deep learning computationally feasible. Let's consolidate the key insights:

Key Takeaways

•Backpropagation is systematic chain rule application — Complex gradients decompose into products of simple local derivatives, traced through the computational graph.
•Reverse mode is efficient for many-input, few-output functions — Neural networks with millions of parameters and one scalar loss are perfect for reverse mode. One backward pass computes all gradients.
•VJPs are the core operation — Each operation's Vector-Jacobian Product efficiently computes gradients without forming full Jacobian matrices.
•Gradient accumulation handles fan-out — When a value is used multiple times, gradients from all uses must be summed. This is mathematically required by the multivariate chain rule.
•Backward pass cost ≈ 2-3× forward pass — VJPs have similar complexity to forward operations. Memory cost is dominated by cached activations.
•The algorithm is elegant and general — Any differentiable computation expressible as a DAG of primitive operations can be automatically differentiated.

What's Next

With forward and backward passes understood, we turn to the infrastructure that makes them efficient:

Topological Ordering: Algorithms for determining evaluation order in complex graphs
Modern Frameworks: How TensorFlow, PyTorch, and JAX implement these concepts at scale
Optimization Beyond Gradients: How gradient descent variants use these gradients to update parameters

Backpropagation provides gradients; the next steps use those gradients to actually train networks.

Page Complete