Backpropagation Algorithm - Learning Module

Loading content...

0/278

Chain Rule Application

The Mathematical Heart of Deep Learning

If neural networks are the body of modern AI, then backpropagation is their heartbeat—the rhythmic flow of gradient information that enables learning. And at the very core of backpropagation lies a fundamental concept from calculus that every student encounters in their first year: the chain rule.

Yet the simplicity of the chain rule belies its profound power. When applied systematically to the computational graphs that represent neural networks, it transforms the seemingly intractable problem of computing gradients through millions of parameters into an elegant, efficient algorithm that runs in linear time.

In this page, we will develop a deep, rigorous understanding of how the chain rule serves as the mathematical foundation of backpropagation, building from first principles to the general formulation used in modern deep learning frameworks.

What You Will Master

By the end of this page, you will understand: (1) The univariate and multivariate chain rule formulations, (2) How the chain rule applies to computational graphs, (3) The mathematical derivation of gradient propagation through composed functions, (4) Why backpropagation achieves optimal computational efficiency, and (5) The connection between local gradients and global optimization.

The Univariate Chain Rule: Foundation

We begin with the simplest form of the chain rule, which you likely encountered in introductory calculus. This foundation will build toward the more general formulations needed for neural networks.

The Basic Chain Rule:

Given two differentiable functions $f: \mathbb{R} \to \mathbb{R}$ and $g: \mathbb{R} \to \mathbb{R}$, the derivative of the composition $h(x) = f(g(x))$ is:

$$\frac{dh}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx}$$

Or using Leibniz notation more explicitly:

$$\frac{dh}{dx} = \frac{df}{du}\bigg|_{u=g(x)} \cdot \frac{dg}{dx}$$

This formula states a profound truth: the derivative of a composition is the product of the derivatives of its components. Each function's local rate of change combines multiplicatively to produce the global rate of change.

Intuition: The Gear Analogy

Imagine two connected gears. If the first gear amplifies rotation by factor 3 (df/dg = 3) and the second amplifies by factor 2 (dg/dx = 2), then the total amplification is 3 × 2 = 6. The chain rule captures exactly this multiplicative composition of rates of change.

Concrete Example:

Let $g(x) = x^2$ and $f(u) = \sin(u)$, so $h(x) = \sin(x^2)$.

Applying the chain rule:

$\frac{dg}{dx} = 2x$
$\frac{df}{du} = \cos(u) = \cos(x^2)$
$\frac{dh}{dx} = \cos(x^2) \cdot 2x = 2x\cos(x^2)$

Extension to Multiple Compositions:

The chain rule extends naturally to any finite composition. For $h(x) = f_n(f_{n-1}(\cdots f_1(x)))$:

$$\frac{dh}{dx} = \frac{df_n}{df_{n-1}} \cdot \frac{df_{n-1}}{df_{n-2}} \cdots \frac{df_2}{df_1} \cdot \frac{df_1}{dx}$$

This is a product of local derivatives—each factor represents how one stage amplifies or attenuates changes from the previous stage. This multiplicative structure is precisely what backpropagation exploits.

univariate_chain_rule.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import numpy as np
 
# Example: Computing derivatives via chain rule
# h(x) = sin(x^2) = f(g(x)) where g(x) = x^2, f(u) = sin(u)
 
def g(x):
    """Inner function: g(x) = x^2"""
    return x ** 2
 
def dg_dx(x):
    """Derivative of g: dg/dx = 2x"""
    return 2 * x
 
def f(u):
    """Outer function: f(u) = sin(u)"""
    return np.sin(u)
 
def df_du(u):
    """Derivative of f: df/du = cos(u)"""
    return np.cos(u)
 
def h(x):
    """Composition: h(x) = f(g(x)) = sin(x^2)"""
    return f(g(x))
 
def dh_dx_chain_rule(x):
    """Chain rule: dh/dx = df/du|_{u=g(x)} * dg/dx"""
    u = g(x)  # Intermediate value
    return df_du(u) * dg_dx(x)
 
# Verify with numerical differentiation
def numerical_derivative(func, x, epsilon=1e-7):
    """Compute derivative numerically using central difference"""
    return (func(x + epsilon) - func(x - epsilon)) / (2 * epsilon)
 
# Test at x = 2.0
x_test = 2.0
analytical = dh_dx_chain_rule(x_test)
numerical = numerical_derivative(h, x_test)
 
print(f"x = {x_test}")
print(f"h(x) = sin(x²) = {h(x_test):.6f}")
print(f"Analytical derivative (chain rule): {analytical:.6f}")
print(f"Numerical derivative: {numerical:.6f}")
print(f"Difference: {abs(analytical - numerical):.2e}")
 
# Output:
# x = 2.0
# h(x) = sin(x²) = -0.756802
# Analytical derivative (chain rule): -2.614575
# Numerical derivative: -2.614575
# Difference: 1.77e-10

The Multivariate Chain Rule: Generalizing to Vector Functions

Neural networks operate on vectors, matrices, and tensors—not single numbers. To apply the chain rule in this setting, we must generalize to multivariate functions. This generalization introduces the concept of partial derivatives and the Jacobian matrix.

Setup:

Consider a function $f: \mathbb{R}^n \to \mathbb{R}^m$ that maps an $n$-dimensional input to an $m$-dimensional output. The Jacobian matrix $\mathbf{J}_f$ is an $m \times n$ matrix containing all first-order partial derivatives:

$$\mathbf{J}_f = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \ \vdots & \ddots & \vdots \ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}$$

The entry $(\mathbf{J}f){ij} = \frac{\partial f_i}{\partial x_j}$ tells us how the $i$-th output changes with respect to the $j$-th input.

The Jacobian as a Linear Approximation

The Jacobian matrix represents the best linear approximation to f near a point. For small perturbations δx, we have f(x + δx) ≈ f(x) + J_f · δx. This is the multivariate generalization of the tangent line approximation.

The Multivariate Chain Rule:

For compositions of vector functions $\mathbf{h}(\mathbf{x}) = \mathbf{f}(\mathbf{g}(\mathbf{x}))$, the chain rule becomes:

$$\mathbf{J}_h = \mathbf{J}_f \cdot \mathbf{J}_g$$

This is matrix multiplication of Jacobians! The chain rule, which was scalar multiplication in one dimension, becomes matrix multiplication in multiple dimensions.

Dimensions:

If $\mathbf{g}: \mathbb{R}^n \to \mathbb{R}^k$ and $\mathbf{f}: \mathbb{R}^k \to \mathbb{R}^m$
Then $\mathbf{J}_g$ is $k \times n$ and $\mathbf{J}_f$ is $m \times k$
Their product $\mathbf{J}_h = \mathbf{J}_f \cdot \mathbf{J}_g$ is $m \times n$ ✓

Special Case: Scalar Output (The Gradient)

In deep learning, we typically have a scalar loss function $L: \mathbb{R}^n \to \mathbb{R}$. The Jacobian of a scalar function is a $1 \times n$ row vector, which we call the gradient (often transposed to a column vector):

$$ abla L = \begin{bmatrix} \frac{\partial L}{\partial x_1} & \frac{\partial L}{\partial x_2} & \cdots & \frac{\partial L}{\partial x_n} \end{bmatrix}^T$$

For a composition $L = L_{\text{final}}(\mathbf{f}(\mathbf{x}))$, the gradient with respect to inputs is:

$$ abla_\mathbf{x} L = \mathbf{J}f^T \cdot abla{\mathbf{f}} L$$

This is the core equation of backpropagation: gradients with respect to earlier variables are computed by multiplying the Jacobian transpose with gradients from later stages.

multivariate_chain_rule.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
import numpy as np
 
# Example: Multivariate chain rule
# g: R^2 -> R^3, f: R^3 -> R^2
# h = f(g(x)): R^2 -> R^2
 
def g(x):
    """
    g: R^2 -> R^3
    g(x1, x2) = [x1^2, x1*x2, x2^2]
    """
    x1, x2 = x
    return np.array([x1**2, x1*x2, x2**2])
 
def jacobian_g(x):
    """
    Jacobian of g: 3x2 matrix
    J_g = [[2*x1,  0   ],
           [x2,    x1  ],
           [0,     2*x2]]
    """
    x1, x2 = x
    return np.array([
        [2*x1, 0],
        [x2, x1],
        [0, 2*x2]
    ])
 
def f(u):
    """
    f: R^3 -> R^2
    f(u1, u2, u3) = [sin(u1) + u2, u2 * u3]
    """
    u1, u2, u3 = u
    return np.array([np.sin(u1) + u2, u2 * u3])
 
def jacobian_f(u):
    """
    Jacobian of f: 2x3 matrix
    J_f = [[cos(u1), 1,  0 ],
           [0,       u3, u2]]
    """
    u1, u2, u3 = u
    return np.array([
        [np.cos(u1), 1, 0],
        [0, u3, u2]
    ])
 
def h(x):
    """Composition: h(x) = f(g(x))"""
    return f(g(x))
 
def jacobian_h_chain_rule(x):
    """Chain rule: J_h = J_f(g(x)) * J_g(x)"""
    u = g(x)
    J_f = jacobian_f(u)  # 2x3
    J_g = jacobian_g(x)  # 3x2
    return J_f @ J_g     # 2x2
 
# Numerical Jacobian for verification
def numerical_jacobian(func, x, epsilon=1e-7):
    """Compute Jacobian numerically"""
    n = len(x)
    f0 = func(x)
    m = len(f0)
    J = np.zeros((m, n))
    for j in range(n):
        x_plus = x.copy()
        x_minus = x.copy()
        x_plus[j] += epsilon
        x_minus[j] -= epsilon
        J[:, j] = (func(x_plus) - func(x_minus)) / (2 * epsilon)
    return J
 
# Test at x = [1.0, 2.0]
x_test = np.array([1.0, 2.0])
 
J_analytical = jacobian_h_chain_rule(x_test)
J_numerical = numerical_jacobian(h, x_test)
 
print("x =", x_test)
print("g(x) =", g(x_test))
print("h(x) = f(g(x)) =", h(x_test))
print()
print("Analytical Jacobian (chain rule):")
print(J_analytical)
print()
print("Numerical Jacobian:")
print(J_numerical)
print()
print("Max difference:", np.max(np.abs(J_analytical - J_numerical)))

Chain Rule in Computational Graphs

Neural networks are naturally represented as computational graphs—directed acyclic graphs (DAGs) where nodes represent operations and edges represent data flow. The chain rule provides the mathematical machinery to compute gradients through these graphs.

Graph Structure:

Nodes represent either:
- Input variables (leaves)
- Intermediate computations
- Output (root, typically the loss)
Edges represent the flow of tensors between operations

Key Insight: Local-to-Global Gradient Computation

Each node in the graph performs a local computation: given its inputs, it produces outputs. Crucially, each node also knows its local Jacobian—how its outputs change with respect to its inputs. The chain rule tells us how to combine these local Jacobians to compute global gradients.

Converting Mermaid diagram...

The figure above shows a simple computational graph for a single neuron: $y = \sigma(w_1 x_1 + w_2 x_2)$, followed by a loss computation. To find $\frac{\partial L}{\partial w_1}$, we need to trace the path from $w_1$ to $L$ and multiply the local derivatives along the way.

Paths and the Sum-of-Products Rule:

When multiple paths exist from a variable to the output (which happens when variables are reused), we sum over all paths. This is captured mathematically by:

$$\frac{\partial L}{\partial x_i} = \sum_{j \in \text{children}(i)} \frac{\partial L}{\partial x_j} \cdot \frac{\partial x_j}{\partial x_i}$$

This recursive formula is the essence of backpropagation: compute gradients at later nodes first, then propagate backward, accumulating contributions from all paths.

The Fan-Out Problem

When a variable feeds into multiple operations (fan-out), gradients from all downstream paths must be summed. This is the multivariate chain rule in action: when u depends on x through multiple intermediate variables, we sum the contributions. Forgetting to sum is a common source of bugs in manual gradient implementations.

computational_graph_gradients.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
import numpy as np
 
class ComputationalGraph:
    """
    Demonstrates chain rule in a computational graph for:
    y = sigmoid(w1*x1 + w2*x2)
    L = (y - target)^2  (MSE loss)
    """
    
    def __init__(self):
        self.cache = {}  # Store intermediate values for backward pass
        
    def forward(self, x1, x2, w1, w2, target):
        """Forward pass: compute output and loss"""
        # Store inputs
        self.cache['x1'] = x1
        self.cache['x2'] = x2
        self.cache['w1'] = w1
        self.cache['w2'] = w2
        self.cache['target'] = target
        
        # Intermediate computations
        z1 = w1 * x1      # Multiply
        z2 = w2 * x2      # Multiply
        z = z1 + z2       # Add
        y = self.sigmoid(z)  # Activation
        L = (y - target) ** 2  # Loss
        
        # Cache for backward
        self.cache['z1'] = z1
        self.cache['z2'] = z2
        self.cache['z'] = z
        self.cache['y'] = y
        self.cache['L'] = L
        
        return L
    
    def backward(self):
        """
        Backward pass: compute gradients via chain rule
        
        The chain rule is applied recursively:
        dL/dw1 = dL/dy * dy/dz * dz/dz1 * dz1/dw1
        """
        # Retrieve cached values
        x1 = self.cache['x1']
        x2 = self.cache['x2']
        w1 = self.cache['w1']
        y = self.cache['y']
        z = self.cache['z']
        target = self.cache['target']
        
        # Gradient of loss w.r.t. prediction
        # L = (y - target)^2 => dL/dy = 2(y - target)
        dL_dy = 2 * (y - target)
        
        # Gradient through sigmoid
        # y = sigmoid(z) => dy/dz = sigmoid(z) * (1 - sigmoid(z)) = y * (1-y)
        dy_dz = y * (1 - y)
        dL_dz = dL_dy * dy_dz  # Chain rule!
        
        # Gradient through addition
        # z = z1 + z2 => dz/dz1 = 1, dz/dz2 = 1
        dL_dz1 = dL_dz * 1
        dL_dz2 = dL_dz * 1
        
        # Gradient through multiplication
        # z1 = w1 * x1 => dz1/dw1 = x1, dz1/dx1 = w1
        dL_dw1 = dL_dz1 * x1  # Chain rule!
        dL_dx1 = dL_dz1 * w1
        
        # z2 = w2 * x2 => dz2/dw2 = x2, dz2/dx2 = w2
        dL_dw2 = dL_dz2 * x2
        dL_dx2 = dL_dz2 * w2
        
        return {
            'dL_dw1': dL_dw1,
            'dL_dw2': dL_dw2,
            'dL_dx1': dL_dx1,
            'dL_dx2': dL_dx2,
        }
    
    @staticmethod
    def sigmoid(z):
        return 1 / (1 + np.exp(-z))
 
# Numerical gradient for verification
def numerical_gradient(graph, param_name, x1, x2, w1, w2, target, eps=1e-5):
    """Compute gradient numerically using finite differences"""
    params = {'x1': x1, 'x2': x2, 'w1': w1, 'w2': w2}
    
    params_plus = params.copy()
    params_plus[param_name] += eps
    L_plus = graph.forward(**params_plus, target=target)
    
    params_minus = params.copy()
    params_minus[param_name] -= eps
    L_minus = graph.forward(**params_minus, target=target)
    
    return (L_plus - L_minus) / (2 * eps)
 
# Test
graph = ComputationalGraph()
 
x1, x2 = 0.5, 0.8
w1, w2 = 0.3, -0.2
target = 0.7
 
# Forward and backward
loss = graph.forward(x1, x2, w1, w2, target)
grads = graph.backward()
 
print(f"Input: x1={x1}, x2={x2}")
print(f"Weights: w1={w1}, w2={w2}")
print(f"Target: {target}")
print(f"Loss: {loss:.6f}")
print()
print("Analytical gradients (chain rule):")
for name, grad in grads.items():
    print(f"  {name}: {grad:.6f}")
 
print()
print("Numerical gradients (verification):")
for param in ['w1', 'w2', 'x1', 'x2']:
    num_grad = numerical_gradient(graph, param, x1, x2, w1, w2, target)
    print(f"  dL_d{param}: {num_grad:.6f}")

Forward Mode vs. Reverse Mode: Why Backprop Wins

The chain rule can be applied in two directions: forward mode (from inputs to outputs) or reverse mode (from outputs to inputs). The choice dramatically affects computational efficiency, and understanding this is crucial for grasping why backpropagation uses reverse mode.

Forward Mode Differentiation:

In forward mode, we propagate derivatives alongside values from inputs toward outputs. For a chain of functions $y = f_n(f_{n-1}(\cdots f_1(x)))$, we compute:

$$\frac{dy}{dx} = \frac{df_n}{df_{n-1}} \cdot \left( \frac{df_{n-1}}{df_{n-2}} \cdot \left( \cdots \left( \frac{df_2}{df_1} \cdot \frac{df_1}{dx} \right) \right) \right)$$

Note the right-to-left evaluation (parentheses grouping from the right). Each step produces the sensitivity of all intermediate values with respect to one input variable.

Reverse Mode Differentiation (Backpropagation):

In reverse mode, we first complete the forward pass, then propagate gradients backward from outputs to inputs:

$$\frac{dy}{dx} = \left( \left( \left( \frac{df_n}{df_{n-1}} \cdot \frac{df_{n-1}}{df_{n-2}} \right) \cdots \right) \cdot \frac{df_2}{df_1} \right) \cdot \frac{df_1}{dx}$$

Note the left-to-right evaluation. Each step produces the sensitivity of the final output with respect to all intermediate values at once.

Why This Matters:

For neural networks with millions of parameters but a single scalar loss:

Forward mode requires one forward pass per input dimension → millions of passes
Reverse mode requires one backward pass to get gradients for all parameters at once → one pass

This asymmetry makes reverse mode exponentially more efficient for deep learning.

Forward Mode vs. Reverse Mode Comparison
Aspect	Forward Mode	Reverse Mode (Backprop)
Direction	Input → Output	Output → Input
Computes per pass	∂output/∂(one input)	∂output/∂(all inputs)
Cost for n inputs, m outputs	O(n × forward_cost)	O(m × backward_cost)
Ideal when	Few inputs, many outputs	Many inputs, few outputs
Deep learning scenario	Inefficient (millions of weights)	Efficient (scalar loss)
Memory required	Low (no cache needed)	Higher (cache activations)
Also known as	Tangent mode, JVP	Adjoint mode, VJP

The Backprop Efficiency Insight

A neural network might have 100 million parameters but produces a single loss value. Forward mode would require 100 million passes; reverse mode requires just one. This is why backpropagation revolutionized neural network training—it's not just correct, it's computationally optimal.

forward_vs_reverse_mode.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
import numpy as np
import time
 
# Comparison of forward mode vs reverse mode autodiff
 
def simple_network_forward(x, weights):
    """
    Simple network: out = W3 @ relu(W2 @ relu(W1 @ x))
    Returns output and intermediate activations
    """
    a0 = x
    z1 = weights['W1'] @ a0
    a1 = np.maximum(0, z1)  # ReLU
    z2 = weights['W2'] @ a1
    a2 = np.maximum(0, z2)  # ReLU
    z3 = weights['W3'] @ a2
    out = z3
    
    cache = {'a0': a0, 'z1': z1, 'a1': a1, 'z2': z2, 'a2': a2, 'z3': z3}
    return out, cache
 
def forward_mode_gradient(x, weights, input_idx):
    """
    Forward mode: differentiate w.r.t. one input element
    Requires one pass per input dimension
    """
    eps = 1e-5
    x_plus = x.copy()
    x_plus[input_idx] += eps
    out_plus, _ = simple_network_forward(x_plus, weights)
    
    x_minus = x.copy()
    x_minus[input_idx] -= eps
    out_minus, _ = simple_network_forward(x_minus, weights)
    
    return (out_plus - out_minus) / (2 * eps)
 
def reverse_mode_gradient(x, weights, cache, upstream_grad=1.0):
    """
    Reverse mode (backprop): get gradients w.r.t. ALL inputs in one pass
    """
    # Upstream gradient (assuming scalar output or already aggregated)
    dout = np.atleast_1d(upstream_grad) if np.isscalar(upstream_grad) else upstream_grad
    
    # Backprop through W3 multiplication: z3 = W3 @ a2
    da2 = weights['W3'].T @ dout
    
    # Backprop through ReLU: a2 = relu(z2)
    dz2 = da2 * (cache['z2'] > 0)
    
    # Backprop through W2 multiplication: z2 = W2 @ a1
    da1 = weights['W2'].T @ dz2
    
    # Backprop through ReLU: a1 = relu(z1)
    dz1 = da1 * (cache['z1'] > 0)
    
    # Backprop through W1 multiplication: z1 = W1 @ x
    dx = weights['W1'].T @ dz1
    
    return dx
 
# Performance comparison
np.random.seed(42)
 
# Network dimensions
input_dim = 1000
hidden_dim = 500
output_dim = 1
 
# Random weights and input
weights = {
    'W1': np.random.randn(hidden_dim, input_dim) * 0.01,
    'W2': np.random.randn(hidden_dim, hidden_dim) * 0.01,
    'W3': np.random.randn(output_dim, hidden_dim) * 0.01,
}
x = np.random.randn(input_dim)
 
# Forward pass
out, cache = simple_network_forward(x, weights)
 
print(f"Network: {input_dim} → {hidden_dim} → {hidden_dim} → {output_dim}")
print(f"Total parameters: {sum(w.size for w in weights.values()):,}")
print(f"Input dimensions: {input_dim}")
print()
 
# Reverse mode: one pass gets ALL gradients
start = time.time()
dx_reverse = reverse_mode_gradient(x, weights, cache)
reverse_time = time.time() - start
print(f"Reverse mode (one pass): {reverse_time*1000:.3f} ms")
print(f"  → Gets gradient for all {input_dim} input dimensions")
 
# Forward mode: need one pass PER input dimension
# (We'll just time a few to estimate)
n_samples = 5
start = time.time()
for i in range(n_samples):
    _ = forward_mode_gradient(x, weights, i)
forward_time_per_input = (time.time() - start) / n_samples
total_forward_time_estimate = forward_time_per_input * input_dim
 
print(f"Forward mode (per input): {forward_time_per_input*1000:.3f} ms")
print(f"  → Would need {input_dim} passes = {total_forward_time_estimate*1000:.1f} ms total")
print()
print(f"Speedup from reverse mode: {total_forward_time_estimate/reverse_time:.0f}x")
 
# Verify correctness (sample a few dimensions)
print()
print("Verification (first 5 dimensions):")
for i in range(5):
    forward_grad = forward_mode_gradient(x, weights, i)[0]
    reverse_grad = dx_reverse[i]
    print(f"  dim {i}: forward={forward_grad:.6f}, reverse={reverse_grad:.6f}")

Deriving Backprop for Common Neural Network Operations

To solidify our understanding, let's derive the backward pass for fundamental neural network operations using the chain rule. These derivations form the building blocks of all neural network training.

Setup and Notation:

We use the convention where $\frac{\partial L}{\partial v}$ (often written as dv or grad_v in code) means the gradient of the final scalar loss $L$ with respect to variable $v$. The backward pass receives the "upstream gradient" $\frac{\partial L}{\partial \text{output}}$ and must compute $\frac{\partial L}{\partial \text{inputs}}$.

Forward: $Y = XW + b$ where $X$ is $(N \times D)$, $W$ is $(D \times M)$, $b$ is $(1 \times M)$, $Y$ is $(N \times M)$.

Backward: Given $\frac{\partial L}{\partial Y}$ (shape $N \times M$), compute:

Gradient w.r.t. bias $b$: $$\frac{\partial L}{\partial b} = \sum_{n=1}^{N} \frac{\partial L}{\partial Y_n} = \text{sum over batch}$$
Gradient w.r.t. weights $W$: $$\frac{\partial L}{\partial W} = X^T \frac{\partial L}{\partial Y}$$
Gradient w.r.t. input $X$: $$\frac{\partial L}{\partial X} = \frac{\partial L}{\partial Y} W^T$$

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def affine_backward(dY, cache):
    """
    Backward pass for affine layer: Y = X @ W + b
    
    Args:
        dY: Upstream gradient, shape (N, M)
        cache: (X, W, b) from forward pass
    
    Returns:
        dX: shape (N, D)
        dW: shape (D, M)
        db: shape (M,)
    """
    X, W, b = cache
    
    # Gradient of bias: sum over batch dimension
    db = np.sum(dY, axis=0)  # (M,)
    
    # Gradient of weights: dW = X.T @ dY
    dW = X.T @ dY  # (D, N) @ (N, M) = (D, M)
    
    # Gradient of input: dX = dY @ W.T
    dX = dY @ W.T  # (N, M) @ (M, D) = (N, D)
    
    return dX, dW, db

The Chain Rule: Source of Vanishing and Exploding Gradients

The multiplicative nature of the chain rule is a double-edged sword. While it enables efficient gradient computation, it also exposes deep networks to severe numerical stability issues.

The Problem:

For a deep network with $L$ layers, gradients at early layers involve products:

$$\frac{\partial \text{Loss}}{\partial W_1} = \frac{\partial \text{Loss}}{\partial a_L} \cdot \frac{\partial a_L}{\partial a_{L-1}} \cdots \frac{\partial a_2}{\partial a_1} \cdot \frac{\partial a_1}{\partial W_1}$$

If each factor $\frac{\partial a_i}{\partial a_{i-1}}$ has magnitude $< 1$, the product vanishes exponentially. If each factor has magnitude $> 1$, the product explodes exponentially.

Vanishing Gradients

•Cause: Activation functions with |derivative| < 1 (sigmoid, tanh in saturation)
•Effect: Early layers receive near-zero gradients
•Symptom: Early layers stop learning; training stalls
•Example: Sigmoid derivative max = 0.25; after 20 layers: 0.25²⁰ ≈ 10⁻¹²

Exploding Gradients

•Cause: Large weights or unstable recurrent connections
•Effect: Gradients grow unboundedly large
•Symptom: NaN values, loss explosion, training divergence
•Example: Weight matrix with spectral radius > 1 in RNNs

Modern Solutions Preserve Gradient Flow

Techniques like ReLU activations (derivative = 1 for positive inputs), residual connections (gradients can bypass layers), careful initialization (Xavier/He), and normalization layers all work by keeping the gradient scale factors close to 1, preventing exponential decay or growth through deep chains.

gradient_flow_analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
import numpy as np
import matplotlib.pyplot as plt
 
def analyze_gradient_flow(n_layers, activation='sigmoid'):
    """
    Analyze how gradients scale through deep networks
    due to the multiplicative nature of the chain rule.
    """
    
    def sigmoid(x):
        return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
    
    def sigmoid_grad(x):
        s = sigmoid(x)
        return s * (1 - s)  # Max value: 0.25
    
    def relu(x):
        return np.maximum(0, x)
    
    def relu_grad(x):
        return (x > 0).astype(float)  # Either 0 or 1
    
    def tanh_grad(x):
        return 1 - np.tanh(x)**2  # Max value: 1.0, but often << 1
    
    # Choose activation
    if activation == 'sigmoid':
        grad_fn = sigmoid_grad
    elif activation == 'relu':
        grad_fn = relu_grad
    else:
        grad_fn = tanh_grad
    
    # Simulate gradient flow
    gradient_magnitudes = []
    gradient = 1.0  # Start with upstream gradient of 1
    
    for layer in range(n_layers):
        # Random pre-activation (typical after random init)
        z = np.random.randn()
        
        # Local gradient from activation
        local_grad = grad_fn(z)
        
        # Chain rule: multiply
        gradient *= local_grad
        gradient_magnitudes.append(abs(gradient))
    
    return gradient_magnitudes
 
# Analyze different activations
np.random.seed(42)
n_layers = 50
 
activations = ['sigmoid', 'tanh', 'relu']
results = {}
 
for act in activations:
    # Average over many trials
    all_mags = []
    for trial in range(100):
        mags = analyze_gradient_flow(n_layers, act)
        all_mags.append(mags)
    results[act] = np.mean(all_mags, axis=0)
 
# Print analysis
print("Gradient magnitude after N layers (averaged over 100 trials):")
print(f"{'Layers':<10}", end="")
for act in activations:
    print(f"{act:<15}", end="")
print()
 
for n in [10, 20, 30, 40, 50]:
    print(f"{n:<10}", end="")
    for act in activations:
        mag = results[act][n-1]
        print(f"{mag:.2e}       ", end="")
    print()
 
print()
print("Analysis:")
print("- Sigmoid: Gradients vanish rapidly (factor ~0.25 per layer)")
print("- Tanh: Gradients vanish, but slower than sigmoid")
print("- ReLU: Gradients preserved on average (factor 0.5 per layer when units are active)")
print()
print("This is why modern deep networks use ReLU-family activations!")

Summary: The Chain Rule as Backpropagation's Foundation

We have built a comprehensive understanding of how the chain rule—a fundamental calculus concept—becomes the mathematical engine of backpropagation. Let's consolidate the key insights:

Key Takeaways

•The chain rule composes local derivatives — For $h = f(g(x))$, we have $h' = f'(g(x)) \cdot g'(x)$. Local rates of change multiply to give global rates of change.
•Multivariate generalization uses Jacobians — For vector functions, the chain rule becomes Jacobian matrix multiplication: $J_h = J_f \cdot J_g$.
•Computational graphs organize chain rule application — Each node computes a local Jacobian; the chain rule combines them along paths from inputs to outputs.
•Reverse mode (backprop) is computationally optimal — For many inputs and few outputs (neural networks with scalar loss), reverse mode requires only one backward pass.
•Each layer type has a specific backward formula — Affine, ReLU, softmax, etc. all have derived backward passes that follow from the chain rule.
•Multiplicative accumulation causes gradient issues — The chain rule's products can vanish or explode over deep networks, motivating modern architectural choices.

Looking Ahead:

With the mathematical foundation of the chain rule established, we're ready to explore how gradients actually flow through network architectures in the next section. We'll visualize gradient propagation, understand bottlenecks, and see how architectural choices affect gradient dynamics.

The chain rule is the "what" of backpropagation; gradient flow is the "how" it plays out in practice.

Foundation Established

You now have a rigorous understanding of the chain rule as applied to neural network training. This mathematical foundation is essential—every optimization step, every gradient computation, every architectural innovation builds upon the principles covered here. Next, we'll see how these gradients flow through actual network structures.