Loading learning content...
If there's one calculus concept that makes deep learning possible, it's the chain rule. Every time a neural network learns—every weight update, every backpropagation pass—the chain rule is at work, propagating error signals backward through layers of composed functions.
Without the chain rule, we couldn't train networks with more than one layer. We couldn't build the transformers, CNNs, or any deep architecture that powers modern AI. Understanding the chain rule deeply is understanding how neural networks learn.
By the end of this page, you'll understand the chain rule in single and multiple dimensions, see its direct connection to backpropagation, and be able to trace gradient flow through arbitrarily complex computational graphs.
Formal Statement:
If $y = f(u)$ and $u = g(x)$, then the derivative of the composite function $y = f(g(x))$ is:
$$\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx} = f'(g(x)) \cdot g'(x)$$
Intuition: The rate of change of $y$ with respect to $x$ equals the rate of change of $y$ with respect to $u$, multiplied by the rate of change of $u$ with respect to $x$. Rates of change multiply through the chain of dependencies.
Example: Let $y = \sin(x^2)$
When applying the chain rule: (1) Identify the outermost function, (2) Differentiate it treating the inside as a single variable, (3) Multiply by the derivative of the inside. For nested compositions, repeat from outside to inside.
12345678910111213141516171819202122232425262728293031323334
import numpy as np def numerical_derivative(f, x, h=1e-7): return (f(x + h) - f(x - h)) / (2 * h) # Example 1: y = sin(x²)f1 = lambda x: np.sin(x**2)f1_analytical = lambda x: 2*x*np.cos(x**2) x = 1.5print(f"y = sin(x²) at x={x}")print(f" Numerical: {numerical_derivative(f1, x):.8f}")print(f" Analytical: {f1_analytical(x):.8f}") # Example 2: y = exp(sin(x)) - triple compositionf2 = lambda x: np.exp(np.sin(x))# dy/dx = exp(sin(x)) · cos(x)f2_analytical = lambda x: np.exp(np.sin(x)) * np.cos(x) print(f"\ny = exp(sin(x)) at x={x}")print(f" Numerical: {numerical_derivative(f2, x):.8f}")print(f" Analytical: {f2_analytical(x):.8f}") # Example 3: Sigmoid of linear combination (ML-relevant)# y = σ(wx + b) where σ(z) = 1/(1+e^(-z))w, b = 2.0, -1.0sigmoid = lambda z: 1 / (1 + np.exp(-z))f3 = lambda x: sigmoid(w*x + b)# dy/dx = σ(wx+b)(1-σ(wx+b)) · wf3_analytical = lambda x: sigmoid(w*x + b) * (1 - sigmoid(w*x + b)) * w print(f"\ny = σ(2x - 1) at x={x}")print(f" Numerical: {numerical_derivative(f3, x):.8f}")print(f" Analytical: {f3_analytical(x):.8f}")In machine learning, functions depend on many variables through complex pathways. The multivariable chain rule handles this:
Case 1: Single path, multiple intermediates
If $z = f(x, y)$ where $x = g(t)$ and $y = h(t)$: $$\frac{dz}{dt} = \frac{\partial z}{\partial x}\frac{dx}{dt} + \frac{\partial z}{\partial y}\frac{dy}{dt}$$
Case 2: Multiple paths to multiple variables
If $z = f(u, v)$ where $u = g(x, y)$ and $v = h(x, y)$: $$\frac{\partial z}{\partial x} = \frac{\partial z}{\partial u}\frac{\partial u}{\partial x} + \frac{\partial z}{\partial v}\frac{\partial v}{\partial x}$$
Key Insight: Sum over all paths through which the variable can influence the output. Each path contributes its product of partial derivatives.
| Structure | Formula | Description |
|---|---|---|
| $z(u(x))$ | $\frac{dz}{dx} = \frac{dz}{du}\frac{du}{dx}$ | Simple composition |
| $z(u(x), v(x))$ | $\frac{dz}{dx} = \frac{\partial z}{\partial u}\frac{du}{dx} + \frac{\partial z}{\partial v}\frac{dv}{dx}$ | Two paths from x to z |
| $z(u(x,y))$ | $\frac{\partial z}{\partial x} = \frac{dz}{du}\frac{\partial u}{\partial x}$ | Intermediate depends on x,y |
| General | $\frac{\partial z}{\partial x_i} = \sum_j \frac{\partial z}{\partial u_j}\frac{\partial u_j}{\partial x_i}$ | Sum over all intermediate paths |
1234567891011121314151617181920212223242526272829303132333435363738
import numpy as np # Example: z = (x + y)² · sin(xy)# Let u = x + y, v = xy# Then z = u² · sin(v) def z_direct(x, y): return (x + y)**2 * np.sin(x * y) def dz_dx_analytical(x, y): """ z = u² · sin(v) where u = x+y, v = xy ∂z/∂x = ∂z/∂u · ∂u/∂x + ∂z/∂v · ∂v/∂x ∂z/∂u = 2u · sin(v) = 2(x+y) · sin(xy) ∂u/∂x = 1 ∂z/∂v = u² · cos(v) = (x+y)² · cos(xy) ∂v/∂x = y ∂z/∂x = 2(x+y)·sin(xy) + (x+y)²·cos(xy)·y """ u = x + y v = x * y return 2*u*np.sin(v) + u**2 * np.cos(v) * y # Numerical verificationdef numerical_partial(f, x, y, var_idx, h=1e-7): if var_idx == 0: return (f(x+h, y) - f(x-h, y)) / (2*h) else: return (f(x, y+h) - f(x, y-h)) / (2*h) x, y = 1.0, 2.0print(f"z = (x+y)² · sin(xy) at ({x}, {y})")print(f"\n∂z/∂x:")print(f" Numerical: {numerical_partial(z_direct, x, y, 0):.8f}")print(f" Analytical: {dz_dx_analytical(x, y):.8f}")The Deep Connection:
Backpropagation is simply the chain rule applied systematically to computational graphs. Consider a neural network layer:
$$z = Wx + b \quad \text{(linear transformation)}$$ $$a = \sigma(z) \quad \text{(activation)}$$ $$L = \text{Loss}(a, y) \quad \text{(loss computation)}$$
To update weight $W$, we need $\frac{\partial L}{\partial W}$:
$$\frac{\partial L}{\partial W} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial W}$$
This is the chain rule in action:
We propagate gradients backward because the chain rule multiplies 'outward-in'. Starting from the loss (outermost), we compute each factor and pass it back to the previous layer. Each layer receives the gradient from above and multiplies by its local derivative.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
import numpy as np class SimpleNeuron: """Single neuron demonstrating backprop = chain rule.""" def __init__(self): self.w = np.random.randn() self.b = np.random.randn() def forward(self, x): """Forward pass: store intermediates for backprop.""" self.x = x self.z = self.w * x + self.b # Linear self.a = 1 / (1 + np.exp(-self.z)) # Sigmoid return self.a def backward(self, dL_da): """ Backward pass: Chain rule in action. Given ∂L/∂a, compute ∂L/∂w and ∂L/∂b. Chain: ∂L/∂w = ∂L/∂a · ∂a/∂z · ∂z/∂w """ # ∂a/∂z = a(1-a) for sigmoid da_dz = self.a * (1 - self.a) # ∂z/∂w = x, ∂z/∂b = 1 dz_dw = self.x dz_db = 1 # Chain rule dL_dz = dL_da * da_dz dL_dw = dL_dz * dz_dw dL_db = dL_dz * dz_db return dL_dw, dL_db # Testneuron = SimpleNeuron()x, y_true = 2.0, 1.0 # Forwarda = neuron.forward(x)L = (y_true - a)**2 # MSE loss # BackwarddL_da = -2 * (y_true - a) # ∂L/∂a = -2(y - a)dL_dw, dL_db = neuron.backward(dL_da) print(f"Forward: x={x}, a={a:.4f}, L={L:.4f}")print(f"Backward: ∂L/∂w={dL_dw:.4f}, ∂L/∂b={dL_db:.4f}") # Verify numericallyh = 1e-7neuron.w += hL_plus = (y_true - neuron.forward(x))**2neuron.w -= 2*hL_minus = (y_true - neuron.forward(x))**2neuron.w += hnumerical_grad = (L_plus - L_minus) / (2*h) print(f"\nVerification: Numerical ∂L/∂w = {numerical_grad:.4f}")The chain rule reveals why deep networks can be hard to train. Consider $n$ layers, each with activation derivative $\sigma'$:
$$\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial a_n} \cdot \prod_{i=1}^{n} \sigma'(z_i) \cdot W_i \cdot \frac{\partial z_1}{\partial W_1}$$
The product of many terms can cause:
Vanishing Gradients: If $|\sigma'| < 1$ consistently (e.g., sigmoid where $\sigma' \leq 0.25$), the product shrinks exponentially. For 10 layers: $0.25^{10} \approx 10^{-6}$. Early layers receive virtually zero gradient.
Exploding Gradients: If $|\sigma' \cdot W| > 1$ consistently, gradients grow exponentially. Training becomes unstable.
You now understand the mathematical engine behind all neural network training. Next, we'll explore gradients as vectors—the gradient vector that combines all partial derivatives and points toward steepest ascent.