Machine LearningCalculus & Optimization

Differential Calculus Review

LevelIntermediate

Duration75 mins

TopicCalculus & Optimization

2 / 5

The Chain Rule

The Backbone of Deep Learning

If there's one calculus concept that makes deep learning possible, it's the chain rule. Every time a neural network learns—every weight update, every backpropagation pass—the chain rule is at work, propagating error signals backward through layers of composed functions.

Without the chain rule, we couldn't train networks with more than one layer. We couldn't build the transformers, CNNs, or any deep architecture that powers modern AI. Understanding the chain rule deeply is understanding how neural networks learn.

What You Will Master

By the end of this page, you'll understand the chain rule in single and multiple dimensions, see its direct connection to backpropagation, and be able to trace gradient flow through arbitrarily complex computational graphs.

The Single-Variable Chain Rule

Formal Statement:

If $y = f(u)$ and $u = g(x)$, then the derivative of the composite function $y = f(g(x))$ is:

$$\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx} = f'(g(x)) \cdot g'(x)$$

Intuition: The rate of change of $y$ with respect to $x$ equals the rate of change of $y$ with respect to $u$, multiplied by the rate of change of $u$ with respect to $x$. Rates of change multiply through the chain of dependencies.

Example: Let $y = \sin(x^2)$

Outer function: $f(u) = \sin(u)$, so $f'(u) = \cos(u)$
Inner function: $g(x) = x^2$, so $g'(x) = 2x$
Chain rule: $\frac{dy}{dx} = \cos(x^2) \cdot 2x = 2x\cos(x^2)$

The 'Inside-Out' Method

When applying the chain rule: (1) Identify the outermost function, (2) Differentiate it treating the inside as a single variable, (3) Multiply by the derivative of the inside. For nested compositions, repeat from outside to inside.

Chain Rule Verification
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import numpy as np
 
def numerical_derivative(f, x, h=1e-7):
    return (f(x + h) - f(x - h)) / (2 * h)
 
# Example 1: y = sin(x²)
f1 = lambda x: np.sin(x**2)
f1_analytical = lambda x: 2*x*np.cos(x**2)
 
x = 1.5
print(f"y = sin(x²) at x={x}")
print(f"  Numerical:  {numerical_derivative(f1, x):.8f}")
print(f"  Analytical: {f1_analytical(x):.8f}")
 
# Example 2: y = exp(sin(x)) - triple composition
f2 = lambda x: np.exp(np.sin(x))
# dy/dx = exp(sin(x)) · cos(x)
f2_analytical = lambda x: np.exp(np.sin(x)) * np.cos(x)
 
print(f"\ny = exp(sin(x)) at x={x}")
print(f"  Numerical:  {numerical_derivative(f2, x):.8f}")
print(f"  Analytical: {f2_analytical(x):.8f}")
 
# Example 3: Sigmoid of linear combination (ML-relevant)
# y = σ(wx + b) where σ(z) = 1/(1+e^(-z))
w, b = 2.0, -1.0
sigmoid = lambda z: 1 / (1 + np.exp(-z))
f3 = lambda x: sigmoid(w*x + b)
# dy/dx = σ(wx+b)(1-σ(wx+b)) · w
f3_analytical = lambda x: sigmoid(w*x + b) * (1 - sigmoid(w*x + b)) * w
 
print(f"\ny = σ(2x - 1) at x={x}")
print(f"  Numerical:  {numerical_derivative(f3, x):.8f}")
print(f"  Analytical: {f3_analytical(x):.8f}")

The Multivariable Chain Rule

In machine learning, functions depend on many variables through complex pathways. The multivariable chain rule handles this:

Case 1: Single path, multiple intermediates

If $z = f(x, y)$ where $x = g(t)$ and $y = h(t)$: $$\frac{dz}{dt} = \frac{\partial z}{\partial x}\frac{dx}{dt} + \frac{\partial z}{\partial y}\frac{dy}{dt}$$

Case 2: Multiple paths to multiple variables

If $z = f(u, v)$ where $u = g(x, y)$ and $v = h(x, y)$: $$\frac{\partial z}{\partial x} = \frac{\partial z}{\partial u}\frac{\partial u}{\partial x} + \frac{\partial z}{\partial v}\frac{\partial v}{\partial x}$$

Key Insight: Sum over all paths through which the variable can influence the output. Each path contributes its product of partial derivatives.

Chain Rule Formulas by Dependency Structure
Structure	Formula	Description
$z(u(x))$	$\frac{dz}{dx} = \frac{dz}{du}\frac{du}{dx}$	Simple composition
$z(u(x), v(x))$	$\frac{dz}{dx} = \frac{\partial z}{\partial u}\frac{du}{dx} + \frac{\partial z}{\partial v}\frac{dv}{dx}$	Two paths from x to z
$z(u(x,y))$	$\frac{\partial z}{\partial x} = \frac{dz}{du}\frac{\partial u}{\partial x}$	Intermediate depends on x,y
General	$\frac{\partial z}{\partial x_i} = \sum_j \frac{\partial z}{\partial u_j}\frac{\partial u_j}{\partial x_i}$	Sum over all intermediate paths

Multivariable Chain Rule
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import numpy as np
 
# Example: z = (x + y)² · sin(xy)
# Let u = x + y, v = xy
# Then z = u² · sin(v)
 
def z_direct(x, y):
    return (x + y)**2 * np.sin(x * y)
 
def dz_dx_analytical(x, y):
    """
    z = u² · sin(v) where u = x+y, v = xy
    
    ∂z/∂x = ∂z/∂u · ∂u/∂x + ∂z/∂v · ∂v/∂x
    
    ∂z/∂u = 2u · sin(v) = 2(x+y) · sin(xy)
    ∂u/∂x = 1
    ∂z/∂v = u² · cos(v) = (x+y)² · cos(xy)
    ∂v/∂x = y
    
    ∂z/∂x = 2(x+y)·sin(xy) + (x+y)²·cos(xy)·y
    """
    u = x + y
    v = x * y
    return 2*u*np.sin(v) + u**2 * np.cos(v) * y
 
# Numerical verification
def numerical_partial(f, x, y, var_idx, h=1e-7):
    if var_idx == 0:
        return (f(x+h, y) - f(x-h, y)) / (2*h)
    else:
        return (f(x, y+h) - f(x, y-h)) / (2*h)
 
x, y = 1.0, 2.0
print(f"z = (x+y)² · sin(xy) at ({x}, {y})")
print(f"\n∂z/∂x:")
print(f"  Numerical:  {numerical_partial(z_direct, x, y, 0):.8f}")
print(f"  Analytical: {dz_dx_analytical(x, y):.8f}")

Chain Rule as Backpropagation

The Deep Connection:

Backpropagation is simply the chain rule applied systematically to computational graphs. Consider a neural network layer:

$$z = Wx + b \quad \text{(linear transformation)}$$ $$a = \sigma(z) \quad \text{(activation)}$$ $$L = \text{Loss}(a, y) \quad \text{(loss computation)}$$

To update weight $W$, we need $\frac{\partial L}{\partial W}$:

$$\frac{\partial L}{\partial W} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial W}$$

This is the chain rule in action:

$\frac{\partial L}{\partial a}$: How loss changes with activation (from loss function)
$\frac{\partial a}{\partial z}$: How activation changes with pre-activation (activation derivative)
$\frac{\partial z}{\partial W}$: How pre-activation changes with weight (equals $x$)

Why 'Backpropagation'?

We propagate gradients backward because the chain rule multiplies 'outward-in'. Starting from the loss (outermost), we compute each factor and pass it back to the previous layer. Each layer receives the gradient from above and multiplies by its local derivative.

Backpropagation as Chain Rule
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
import numpy as np
 
class SimpleNeuron:
    """Single neuron demonstrating backprop = chain rule."""
    
    def __init__(self):
        self.w = np.random.randn()
        self.b = np.random.randn()
    
    def forward(self, x):
        """Forward pass: store intermediates for backprop."""
        self.x = x
        self.z = self.w * x + self.b          # Linear
        self.a = 1 / (1 + np.exp(-self.z))    # Sigmoid
        return self.a
    
    def backward(self, dL_da):
        """
        Backward pass: Chain rule in action.
        
        Given ∂L/∂a, compute ∂L/∂w and ∂L/∂b.
        
        Chain: ∂L/∂w = ∂L/∂a · ∂a/∂z · ∂z/∂w
        """
        # ∂a/∂z = a(1-a) for sigmoid
        da_dz = self.a * (1 - self.a)
        
        # ∂z/∂w = x, ∂z/∂b = 1
        dz_dw = self.x
        dz_db = 1
        
        # Chain rule
        dL_dz = dL_da * da_dz
        dL_dw = dL_dz * dz_dw
        dL_db = dL_dz * dz_db
        
        return dL_dw, dL_db
 
# Test
neuron = SimpleNeuron()
x, y_true = 2.0, 1.0
 
# Forward
a = neuron.forward(x)
L = (y_true - a)**2  # MSE loss
 
# Backward
dL_da = -2 * (y_true - a)  # ∂L/∂a = -2(y - a)
dL_dw, dL_db = neuron.backward(dL_da)
 
print(f"Forward: x={x}, a={a:.4f}, L={L:.4f}")
print(f"Backward: ∂L/∂w={dL_dw:.4f}, ∂L/∂b={dL_db:.4f}")
 
# Verify numerically
h = 1e-7
neuron.w += h
L_plus = (y_true - neuron.forward(x))**2
neuron.w -= 2*h
L_minus = (y_true - neuron.forward(x))**2
neuron.w += h
numerical_grad = (L_plus - L_minus) / (2*h)
 
print(f"\nVerification: Numerical ∂L/∂w = {numerical_grad:.4f}")

Gradient Flow and Its Problems

The chain rule reveals why deep networks can be hard to train. Consider $n$ layers, each with activation derivative $\sigma'$:

$$\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial a_n} \cdot \prod_{i=1}^{n} \sigma'(z_i) \cdot W_i \cdot \frac{\partial z_1}{\partial W_1}$$

The product of many terms can cause:

Vanishing Gradients: If $|\sigma'| < 1$ consistently (e.g., sigmoid where $\sigma' \leq 0.25$), the product shrinks exponentially. For 10 layers: $0.25^{10} \approx 10^{-6}$. Early layers receive virtually zero gradient.

Exploding Gradients: If $|\sigma' \cdot W| > 1$ consistently, gradients grow exponentially. Training becomes unstable.

Vanishing Gradient Causes

•Sigmoid/tanh saturate at extremes
•Many layers multiply small values
•Early layers learn extremely slowly

Solutions

•ReLU activation ($\sigma' = 1$ when active)
•Residual connections (skip connections)
•Batch/Layer normalization
•Careful weight initialization

Summary: The Chain Rule Foundation

Key Takeaways

•Chain rule handles function composition — $\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}$
•Multivariable chain rule sums over paths — all paths from input to output contribute
•Backpropagation IS the chain rule — applied systematically to computational graphs
•Gradient flow can vanish or explode — the chain rule reveals why deep training is hard
•Architectural solutions exist — ReLU, residual connections, normalization address gradient issues

Chain Rule Mastered

You now understand the mathematical engine behind all neural network training. Next, we'll explore gradients as vectors—the gradient vector that combines all partial derivatives and points toward steepest ascent.

2 / 5

Loading learning content...

Machine LearningCalculus & Optimization

Differential Calculus Review

LevelIntermediate

Duration75 mins

TopicCalculus & Optimization

2 / 5

The Chain Rule

The Backbone of Deep Learning

What You Will Master

The Single-Variable Chain Rule

Formal Statement:

If $y = f(u)$ and $u = g(x)$, then the derivative of the composite function $y = f(g(x))$ is:

$$\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx} = f'(g(x)) \cdot g'(x)$$

Example: Let $y = \sin(x^2)$

Outer function: $f(u) = \sin(u)$, so $f'(u) = \cos(u)$
Inner function: $g(x) = x^2$, so $g'(x) = 2x$
Chain rule: $\frac{dy}{dx} = \cos(x^2) \cdot 2x = 2x\cos(x^2)$

The 'Inside-Out' Method

Chain Rule Verification
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import numpy as np
 
def numerical_derivative(f, x, h=1e-7):
    return (f(x + h) - f(x - h)) / (2 * h)
 
# Example 1: y = sin(x²)
f1 = lambda x: np.sin(x**2)
f1_analytical = lambda x: 2*x*np.cos(x**2)
 
x = 1.5
print(f"y = sin(x²) at x={x}")
print(f"  Numerical:  {numerical_derivative(f1, x):.8f}")
print(f"  Analytical: {f1_analytical(x):.8f}")
 
# Example 2: y = exp(sin(x)) - triple composition
f2 = lambda x: np.exp(np.sin(x))
# dy/dx = exp(sin(x)) · cos(x)
f2_analytical = lambda x: np.exp(np.sin(x)) * np.cos(x)
 
print(f"\ny = exp(sin(x)) at x={x}")
print(f"  Numerical:  {numerical_derivative(f2, x):.8f}")
print(f"  Analytical: {f2_analytical(x):.8f}")
 
# Example 3: Sigmoid of linear combination (ML-relevant)
# y = σ(wx + b) where σ(z) = 1/(1+e^(-z))
w, b = 2.0, -1.0
sigmoid = lambda z: 1 / (1 + np.exp(-z))
f3 = lambda x: sigmoid(w*x + b)
# dy/dx = σ(wx+b)(1-σ(wx+b)) · w
f3_analytical = lambda x: sigmoid(w*x + b) * (1 - sigmoid(w*x + b)) * w
 
print(f"\ny = σ(2x - 1) at x={x}")
print(f"  Numerical:  {numerical_derivative(f3, x):.8f}")
print(f"  Analytical: {f3_analytical(x):.8f}")

The Multivariable Chain Rule

In machine learning, functions depend on many variables through complex pathways. The multivariable chain rule handles this:

Case 1: Single path, multiple intermediates

If $z = f(x, y)$ where $x = g(t)$ and $y = h(t)$: $$\frac{dz}{dt} = \frac{\partial z}{\partial x}\frac{dx}{dt} + \frac{\partial z}{\partial y}\frac{dy}{dt}$$

Case 2: Multiple paths to multiple variables

Key Insight: Sum over all paths through which the variable can influence the output. Each path contributes its product of partial derivatives.

Chain Rule Formulas by Dependency Structure
Structure	Formula	Description
$z(u(x))$	$\frac{dz}{dx} = \frac{dz}{du}\frac{du}{dx}$	Simple composition
$z(u(x), v(x))$	$\frac{dz}{dx} = \frac{\partial z}{\partial u}\frac{du}{dx} + \frac{\partial z}{\partial v}\frac{dv}{dx}$	Two paths from x to z
$z(u(x,y))$	$\frac{\partial z}{\partial x} = \frac{dz}{du}\frac{\partial u}{\partial x}$	Intermediate depends on x,y
General	$\frac{\partial z}{\partial x_i} = \sum_j \frac{\partial z}{\partial u_j}\frac{\partial u_j}{\partial x_i}$	Sum over all intermediate paths

Multivariable Chain Rule
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import numpy as np
 
# Example: z = (x + y)² · sin(xy)
# Let u = x + y, v = xy
# Then z = u² · sin(v)
 
def z_direct(x, y):
    return (x + y)**2 * np.sin(x * y)
 
def dz_dx_analytical(x, y):
    """
    z = u² · sin(v) where u = x+y, v = xy
    
    ∂z/∂x = ∂z/∂u · ∂u/∂x + ∂z/∂v · ∂v/∂x
    
    ∂z/∂u = 2u · sin(v) = 2(x+y) · sin(xy)
    ∂u/∂x = 1
    ∂z/∂v = u² · cos(v) = (x+y)² · cos(xy)
    ∂v/∂x = y
    
    ∂z/∂x = 2(x+y)·sin(xy) + (x+y)²·cos(xy)·y
    """
    u = x + y
    v = x * y
    return 2*u*np.sin(v) + u**2 * np.cos(v) * y
 
# Numerical verification
def numerical_partial(f, x, y, var_idx, h=1e-7):
    if var_idx == 0:
        return (f(x+h, y) - f(x-h, y)) / (2*h)
    else:
        return (f(x, y+h) - f(x, y-h)) / (2*h)
 
x, y = 1.0, 2.0
print(f"z = (x+y)² · sin(xy) at ({x}, {y})")
print(f"\n∂z/∂x:")
print(f"  Numerical:  {numerical_partial(z_direct, x, y, 0):.8f}")
print(f"  Analytical: {dz_dx_analytical(x, y):.8f}")

Chain Rule as Backpropagation

The Deep Connection:

Backpropagation is simply the chain rule applied systematically to computational graphs. Consider a neural network layer:

$$z = Wx + b \quad \text{(linear transformation)}$$ $$a = \sigma(z) \quad \text{(activation)}$$ $$L = \text{Loss}(a, y) \quad \text{(loss computation)}$$

To update weight $W$, we need $\frac{\partial L}{\partial W}$:

$$\frac{\partial L}{\partial W} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial W}$$

This is the chain rule in action:

$\frac{\partial L}{\partial a}$: How loss changes with activation (from loss function)
$\frac{\partial a}{\partial z}$: How activation changes with pre-activation (activation derivative)
$\frac{\partial z}{\partial W}$: How pre-activation changes with weight (equals $x$)

Why 'Backpropagation'?

Backpropagation as Chain Rule
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
import numpy as np
 
class SimpleNeuron:
    """Single neuron demonstrating backprop = chain rule."""
    
    def __init__(self):
        self.w = np.random.randn()
        self.b = np.random.randn()
    
    def forward(self, x):
        """Forward pass: store intermediates for backprop."""
        self.x = x
        self.z = self.w * x + self.b          # Linear
        self.a = 1 / (1 + np.exp(-self.z))    # Sigmoid
        return self.a
    
    def backward(self, dL_da):
        """
        Backward pass: Chain rule in action.
        
        Given ∂L/∂a, compute ∂L/∂w and ∂L/∂b.
        
        Chain: ∂L/∂w = ∂L/∂a · ∂a/∂z · ∂z/∂w
        """
        # ∂a/∂z = a(1-a) for sigmoid
        da_dz = self.a * (1 - self.a)
        
        # ∂z/∂w = x, ∂z/∂b = 1
        dz_dw = self.x
        dz_db = 1
        
        # Chain rule
        dL_dz = dL_da * da_dz
        dL_dw = dL_dz * dz_dw
        dL_db = dL_dz * dz_db
        
        return dL_dw, dL_db
 
# Test
neuron = SimpleNeuron()
x, y_true = 2.0, 1.0
 
# Forward
a = neuron.forward(x)
L = (y_true - a)**2  # MSE loss
 
# Backward
dL_da = -2 * (y_true - a)  # ∂L/∂a = -2(y - a)
dL_dw, dL_db = neuron.backward(dL_da)
 
print(f"Forward: x={x}, a={a:.4f}, L={L:.4f}")
print(f"Backward: ∂L/∂w={dL_dw:.4f}, ∂L/∂b={dL_db:.4f}")
 
# Verify numerically
h = 1e-7
neuron.w += h
L_plus = (y_true - neuron.forward(x))**2
neuron.w -= 2*h
L_minus = (y_true - neuron.forward(x))**2
neuron.w += h
numerical_grad = (L_plus - L_minus) / (2*h)
 
print(f"\nVerification: Numerical ∂L/∂w = {numerical_grad:.4f}")

Gradient Flow and Its Problems

The chain rule reveals why deep networks can be hard to train. Consider $n$ layers, each with activation derivative $\sigma'$:

$$\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial a_n} \cdot \prod_{i=1}^{n} \sigma'(z_i) \cdot W_i \cdot \frac{\partial z_1}{\partial W_1}$$

The product of many terms can cause:

Exploding Gradients: If $|\sigma' \cdot W| > 1$ consistently, gradients grow exponentially. Training becomes unstable.

Vanishing Gradient Causes

•Sigmoid/tanh saturate at extremes
•Many layers multiply small values
•Early layers learn extremely slowly

Solutions

•ReLU activation ($\sigma' = 1$ when active)
•Residual connections (skip connections)
•Batch/Layer normalization
•Careful weight initialization

Summary: The Chain Rule Foundation

Key Takeaways

•Chain rule handles function composition — $\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}$
•Multivariable chain rule sums over paths — all paths from input to output contribute
•Backpropagation IS the chain rule — applied systematically to computational graphs
•Gradient flow can vanish or explode — the chain rule reveals why deep training is hard
•Architectural solutions exist — ReLU, residual connections, normalization address gradient issues

Chain Rule Mastered

2 / 5