Loading learning content...
At its core, machine learning is about optimization—finding the best parameters that minimize prediction errors. But how do we mathematically describe "moving toward better"? How do we quantify the notion of improvement itself?
The answer lies in derivatives, the mathematical language of change. Every time a neural network updates its weights, every time a gradient descent step is taken, every time a loss function is minimized—derivatives are working behind the scenes and telling us the direction and magnitude of improvement.
This isn't merely calculus for calculus's sake. Understanding derivatives deeply will unlock your ability to:
By the end of this page, you will have mastered the concept of derivatives as rates of change, understand partial derivatives for multivariate functions, and see clearly how these concepts connect to the optimization process that is central to all machine learning algorithms.
Before we can apply derivatives to machine learning, we need a crystal-clear understanding of what a derivative actually is. While you may recall "the derivative is the slope of the tangent line," this intuition—though valuable—only scratches the surface.
Formal Definition:
The derivative of a function $f(x)$ at a point $a$ is defined as the limit:
$$f'(a) = \lim_{h \to 0} \frac{f(a + h) - f(a)}{h}$$
This represents the instantaneous rate of change of $f$ at the point $a$. Let's dissect each component:
Geometrically, as $h \to 0$, the secant line between $(a, f(a))$ and $(a+h, f(a+h))$ approaches the tangent line at $a$. The derivative $f'(a)$ is the slope of this tangent line. This geometric intuition is powerful: the derivative tells us the direction of steepest ascent of the function at any point.
Alternative Notations:
Derivatives have several equivalent notations, each highlighting different aspects:
| Notation | Name | Usage Context |
|---|---|---|
| $f'(x)$ | Lagrange notation | General calculus |
| $\frac{df}{dx}$ | Leibniz notation | Emphasizes ratio of changes |
| $\frac{d}{dx}f$ | Operator notation | Treats differentiation as operator |
| $D_x f$ | Euler notation | Functional analysis |
| $\dot{f}$ | Newton notation | Physics, time derivatives |
In machine learning literature, Leibniz notation ($\frac{\partial f}{\partial x}$) dominates because it makes the chain rule's structure explicit and clearly shows which variable we're differentiating with respect to.
Why Differentiability Matters:
For a function to be differentiable at a point, the limit must exist—meaning the function must be "smooth" at that point. Non-differentiable points occur when:
Machine Learning Connection:
This is why most activation functions in neural networks are designed to be differentiable everywhere—or at least have well-defined subgradients at non-differentiable points. The ReLU function $f(x) = \max(0, x)$ has a corner at $x = 0$, but we typically define its derivative as 0 at that point for practical implementation.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
import numpy as npimport matplotlib.pyplot as plt # Demonstrate derivative as limit of difference quotientdef numerical_derivative(f, x, h=1e-8): """ Approximates f'(x) using the central difference formula. Central difference: [f(x+h) - f(x-h)] / (2h) is more accurate than forward difference: [f(x+h) - f(x)] / h """ return (f(x + h) - f(x - h)) / (2 * h) # Example 1: Polynomial f(x) = x^3def f_cubic(x): return x ** 3 # The analytical derivative is f'(x) = 3x^2def f_cubic_derivative_analytical(x): return 3 * x ** 2 # Test at x = 2x_test = 2.0numerical_deriv = numerical_derivative(f_cubic, x_test)analytical_deriv = f_cubic_derivative_analytical(x_test) print(f"For f(x) = x³ at x = {x_test}:")print(f" Numerical derivative: {numerical_deriv:.10f}")print(f" Analytical derivative: {analytical_deriv:.10f}")print(f" Error: {abs(numerical_deriv - analytical_deriv):.2e}") # Example 2: Exponential f(x) = e^x (its own derivative!)def f_exp(x): return np.exp(x) x_test = 1.0print(f"\nFor f(x) = e^x at x = {x_test}:")print(f" Numerical derivative: {numerical_derivative(f_exp, x_test):.10f}")print(f" Analytical derivative: {f_exp(x_test):.10f}") # e^x is its own derivative # Example 3: Sigmoid (crucial for ML)def sigmoid(x): return 1 / (1 + np.exp(-x)) def sigmoid_derivative_analytical(x): """ The sigmoid derivative has a beautiful form: σ'(x) = σ(x) · (1 - σ(x)) """ s = sigmoid(x) return s * (1 - s) x_test = 0.0print(f"\nFor sigmoid at x = {x_test}:")print(f" Numerical derivative: {numerical_derivative(sigmoid, x_test):.10f}")print(f" Analytical derivative: {sigmoid_derivative_analytical(x_test):.10f}")The power of calculus lies in not having to compute limits from first principles every time. A set of differentiation rules allows us to find derivatives of complex functions by combining simpler pieces. These rules are not merely computational tools—they reveal the deep structure of how functions compose and interact.
The Foundational Rules:
| Rule | Formula | Intuition |
|---|---|---|
| Constant | $\frac{d}{dx}c = 0$ | Constants don't change; no rate of change |
| Power | $\frac{d}{dx}x^n = nx^{n-1}$ | Exponent drops down, power decreases |
| Sum | $\frac{d}{dx}[f + g] = f' + g'$ | Rate of sum = sum of rates |
| Difference | $\frac{d}{dx}[f - g] = f' - g'$ | Difference of rates |
| Constant Multiple | $\frac{d}{dx}[cf] = cf'$ | Scaling preserves under differentiation |
| Product | $\frac{d}{dx}[fg] = f'g + fg'$ | Each factor's contribution to change |
| Quotient | $\frac{d}{dx}\left[\frac{f}{g}\right] = \frac{f'g - fg'}{g^2}$ | Numerator-denominator interaction |
| Chain | $\frac{d}{dx}f(g(x)) = f'(g(x)) \cdot g'(x)$ | Composition rule: outer × inner derivative |
Deep Dive: The Product Rule
The product rule $(fg)' = f'g + fg'$ deserves careful attention because it appears constantly in ML contexts. Consider why this rule makes sense intuitively:
Imagine a rectangle with width $f(x)$ and height $g(x)$. As $x$ changes by a small amount $dx$:
The change in area (approximately) comes from:
Total: $(f'g + fg')dx$, giving us the product rule.
The product rule appears when computing gradients of attention mechanisms, gating mechanisms (like LSTMs), and any architecture that multiplies learned representations together. If $a(x)$ is an attention weight and $v(x)$ is a value vector, the gradient through their product $(av)' = a'v + av'$ involves both how attention changes and how values change.
Essential Function Derivatives:
Beyond the rules, you should internalize these fundamental derivatives:
| Function | Derivative | ML Application |
|---|---|---|
| $e^x$ | $e^x$ | Softmax, exponential family |
| $\ln(x)$ | $\frac{1}{x}$ | Log-likelihood, cross-entropy |
| $\sin(x)$ | $\cos(x)$ | Positional encodings (transformers) |
| $\cos(x)$ | $-\sin(x)$ | Positional encodings (transformers) |
| $a^x$ | $a^x \ln(a)$ | Learning rate schedules |
| $\log_a(x)$ | $\frac{1}{x \ln(a)}$ | Information theory |
| $\tanh(x)$ | $1 - \tanh^2(x) = \text{sech}^2(x)$ | Activation function |
| $\sigma(x)$ | $\sigma(x)(1-\sigma(x))$ | Sigmoid activation |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
import numpy as np # Demonstrating each rule with numerical verification def verify_derivative(name, f, f_prime, x, h=1e-7): """Verify analytical derivative against numerical approximation.""" numerical = (f(x + h) - f(x - h)) / (2 * h) analytical = f_prime(x) error = abs(numerical - analytical) print(f"{name}:") print(f" Numerical: {numerical:.8f}") print(f" Analytical: {analytical:.8f}") print(f" Error: {error:.2e}\n") # 1. Power Rule: d/dx[x^4] = 4x^3verify_derivative( "Power Rule: d/dx[x⁴] = 4x³", f=lambda x: x**4, f_prime=lambda x: 4 * x**3, x=2.0) # 2. Product Rule: d/dx[x² · sin(x)] = 2x·sin(x) + x²·cos(x)verify_derivative( "Product Rule: d/dx[x² · sin(x)]", f=lambda x: x**2 * np.sin(x), f_prime=lambda x: 2*x*np.sin(x) + x**2*np.cos(x), x=1.5) # 3. Quotient Rule: d/dx[sin(x)/x] = [cos(x)·x - sin(x)]/x²verify_derivative( "Quotient Rule: d/dx[sin(x)/x]", f=lambda x: np.sin(x) / x, f_prime=lambda x: (np.cos(x)*x - np.sin(x)) / x**2, x=0.5) # 4. Chain Rule: d/dx[e^(x²)] = e^(x²) · 2xverify_derivative( "Chain Rule: d/dx[e^(x²)]", f=lambda x: np.exp(x**2), f_prime=lambda x: np.exp(x**2) * 2*x, x=1.0) # 5. ML-relevant: Log-loss derivative component# d/dx[-log(sigmoid(x))] = -(1-sigmoid(x)) = sigmoid(x) - 1def sigmoid(x): return 1 / (1 + np.exp(-x)) verify_derivative( "Log-Loss Gradient: d/dx[-log(σ(x))]", f=lambda x: -np.log(sigmoid(x)), f_prime=lambda x: sigmoid(x) - 1, # or -(1 - sigmoid(x)) x=0.5)In machine learning, we rarely work with functions of a single variable. A neural network's loss function depends on millions of parameters; a regression model depends on multiple features and coefficients. This necessitates extending our derivative concept to multivariate functions.
The Fundamental Idea:
Given a function $f(x_1, x_2, \ldots, x_n)$ of multiple variables, the partial derivative with respect to $x_i$ measures how $f$ changes when only $x_i$ changes, while all other variables are held constant.
$$\frac{\partial f}{\partial x_i} = \lim_{h \to 0} \frac{f(x_1, \ldots, x_i + h, \ldots, x_n) - f(x_1, \ldots, x_i, \ldots, x_n)}{h}$$
Notation:
To compute a partial derivative with respect to variable $x_i$, treat all other variables as constants and apply ordinary differentiation rules. For example, if $f(x, y) = x^2 y + 3xy^2$, then $\frac{\partial f}{\partial x} = 2xy + 3y^2$ (treating $y$ as constant) and $\frac{\partial f}{\partial y} = x^2 + 6xy$ (treating $x$ as constant).
Concrete Example: Two-Variable Function
Consider the function $f(x, y) = x^3 + x^2y - 2y^2 + 4xy$. Let's compute both partial derivatives:
Computing $\frac{\partial f}{\partial x}$: (treat $y$ as constant)
Result: $\frac{\partial f}{\partial x} = 3x^2 + 2xy + 4y$
Computing $\frac{\partial f}{\partial y}$: (treat $x$ as constant)
Result: $\frac{\partial f}{\partial y} = x^2 - 4y + 4x$
Geometric Interpretation:
For a function $f(x, y)$:
Think of standing on a hilly terrain:
The gradient (coming next section) tells you the steepest uphill direction overall.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101
import numpy as np def partial_derivative(f, var_idx, point, h=1e-7): """ Compute partial derivative of f with respect to variable at var_idx. Args: f: Function that takes a numpy array of coordinates var_idx: Index of variable to differentiate with respect to point: numpy array representing the point at which to evaluate h: Small step size for numerical differentiation Returns: Approximate partial derivative at the given point """ point = np.array(point, dtype=float) # Create points for central difference point_plus = point.copy() point_minus = point.copy() point_plus[var_idx] += h point_minus[var_idx] -= h return (f(point_plus) - f(point_minus)) / (2 * h) # Example 1: f(x, y) = x³ + x²y - 2y² + 4xydef f_example(p): x, y = p[0], p[1] return x**3 + x**2 * y - 2 * y**2 + 4 * x * y def df_dx_analytical(p): x, y = p[0], p[1] return 3 * x**2 + 2 * x * y + 4 * y def df_dy_analytical(p): x, y = p[0], p[1] return x**2 - 4 * y + 4 * x # Test at point (2, 3)test_point = np.array([2.0, 3.0]) print("Function: f(x, y) = x³ + x²y - 2y² + 4xy")print(f"Point: ({test_point[0]}, {test_point[1]})\n") # Partial with respect to xpartial_x_numerical = partial_derivative(f_example, 0, test_point)partial_x_analytical = df_dx_analytical(test_point)print(f"∂f/∂x:")print(f" Numerical: {partial_x_numerical:.8f}")print(f" Analytical: {partial_x_analytical:.8f}")print(f" Error: {abs(partial_x_numerical - partial_x_analytical):.2e}\n") # Partial with respect to ypartial_y_numerical = partial_derivative(f_example, 1, test_point)partial_y_analytical = df_dy_analytical(test_point)print(f"∂f/∂y:")print(f" Numerical: {partial_y_numerical:.8f}")print(f" Analytical: {partial_y_analytical:.8f}")print(f" Error: {abs(partial_y_numerical - partial_y_analytical):.2e}\n") # Example 2: ML-relevant - MSE Loss# L(w₀, w₁) = (1/n) Σᵢ (yᵢ - (w₀ + w₁xᵢ))²# For simplicity, consider single data point: L = (y - w₀ - w₁x)² def mse_single_point(params): """MSE for single point (x=2, y=5): L = (5 - w₀ - 2w₁)²""" w0, w1 = params[0], params[1] x, y = 2.0, 5.0 return (y - w0 - w1 * x) ** 2 def mse_dw0_analytical(params): """∂L/∂w₀ = -2(y - w₀ - w₁x)""" w0, w1 = params[0], params[1] x, y = 2.0, 5.0 return -2 * (y - w0 - w1 * x) def mse_dw1_analytical(params): """∂L/∂w₁ = -2x(y - w₀ - w₁x)""" w0, w1 = params[0], params[1] x, y = 2.0, 5.0 return -2 * x * (y - w0 - w1 * x) # Test at w₀=1, w₁=1 (prediction = 1 + 2 = 3, actual = 5, error = 2)params = np.array([1.0, 1.0]) print("\nMSE Loss: L = (5 - w₀ - 2w₁)² for single point (x=2, y=5)")print(f"Parameters: w₀={params[0]}, w₁={params[1]}\n") partial_w0_num = partial_derivative(mse_single_point, 0, params)partial_w0_ana = mse_dw0_analytical(params)print(f"∂L/∂w₀:")print(f" Numerical: {partial_w0_num:.8f}")print(f" Analytical: {partial_w0_ana:.8f}") partial_w1_num = partial_derivative(mse_single_point, 1, params)partial_w1_ana = mse_dw1_analytical(params)print(f"\n∂L/∂w₁:")print(f" Numerical: {partial_w1_num:.8f}")print(f" Analytical: {partial_w1_ana:.8f}")Just as we can take the derivative of a derivative (second derivative) for single-variable functions, we can compute second-order partial derivatives and mixed partial derivatives for multivariate functions. These are critical for understanding curvature—which determines convergence rates and optimal step sizes in optimization.
Types of Second-Order Partials:
For a function $f(x, y)$, we have four second-order partial derivatives:
$\frac{\partial^2 f}{\partial x^2}$: Differentiate with respect to $x$ twice (how the $x$-slope changes in the $x$-direction)
$\frac{\partial^2 f}{\partial y^2}$: Differentiate with respect to $y$ twice (how the $y$-slope changes in the $y$-direction)
$\frac{\partial^2 f}{\partial x \partial y}$: Differentiate first with respect to $y$, then with respect to $x$
$\frac{\partial^2 f}{\partial y \partial x}$: Differentiate first with respect to $x$, then with respect to $y$
For functions with continuous second partial derivatives, the order of differentiation doesn't matter: $\frac{\partial^2 f}{\partial x \partial y} = \frac{\partial^2 f}{\partial y \partial x}$. This symmetry is crucial—it means we only need to compute $n(n+1)/2$ second derivatives for a function of $n$ variables instead of $n^2$.
Worked Example:
Let $f(x, y) = x^3y^2 + 2xy^3 - x^2$
First-order partials:
Second-order partials:
Note: $\frac{\partial^2 f}{\partial x \partial y} = \frac{\partial^2 f}{\partial y \partial x}$, confirming Schwarz's theorem.
Geometric Meaning of Second Derivatives:
Second-order partial derivatives describe curvature:
Why This Matters for ML:
Second-order information tells us:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384
import numpy as np def second_partial(f, i, j, point, h=1e-5): """ Compute second partial derivative ∂²f/(∂xᵢ ∂xⱼ) Uses central difference for both derivatives. """ n = len(point) point = np.array(point, dtype=float) # Create offset vectors def offset(idx, delta): v = np.zeros(n) v[idx] = delta return v # ∂²f/(∂xᵢ ∂xⱼ) ≈ [f(x+hᵢ+hⱼ) - f(x+hᵢ-hⱼ) - f(x-hᵢ+hⱼ) + f(x-hᵢ-hⱼ)] / (4h²) pp = point + offset(i, h) + offset(j, h) pm = point + offset(i, h) - offset(j, h) mp = point - offset(i, h) + offset(j, h) mm = point - offset(i, h) - offset(j, h) return (f(pp) - f(pm) - f(mp) + f(mm)) / (4 * h * h) # Example: f(x, y) = x³y² + 2xy³ - x²def f(p): x, y = p[0], p[1] return x**3 * y**2 + 2*x*y**3 - x**2 # Analytical second derivativesdef f_xx(p): x, y = p[0], p[1] return 6*x*y**2 - 2 def f_yy(p): x, y = p[0], p[1] return 2*x**3 + 12*x*y def f_xy(p): x, y = p[0], p[1] return 6*x**2*y + 6*y**2 # Test at point (1, 2)point = np.array([1.0, 2.0])print(f"Function: f(x, y) = x³y² + 2xy³ - x²")print(f"Point: ({point[0]}, {point[1]})\n") print("Second-Order Partial Derivatives:\n") # ∂²f/∂x²fxx_num = second_partial(f, 0, 0, point)fxx_ana = f_xx(point)print(f"∂²f/∂x²:")print(f" Numerical: {fxx_num:.6f}")print(f" Analytical: {fxx_ana:.6f}\n") # ∂²f/∂y²fyy_num = second_partial(f, 1, 1, point)fyy_ana = f_yy(point)print(f"∂²f/∂y²:")print(f" Numerical: {fyy_num:.6f}")print(f" Analytical: {fyy_ana:.6f}\n") # ∂²f/(∂x∂y)fxy_num = second_partial(f, 0, 1, point)fxy_ana = f_xy(point)print(f"∂²f/(∂x∂y):")print(f" Numerical: {fxy_num:.6f}")print(f" Analytical: {fxy_ana:.6f}\n") # ∂²f/(∂y∂x) - should equal ∂²f/(∂x∂y) by Schwarz's theoremfyx_num = second_partial(f, 1, 0, point)print(f"∂²f/(∂y∂x): {fyx_num:.6f} (should equal ∂²f/(∂x∂y))\n") # Curvature interpretationprint("Curvature Interpretation:")print(f" f_xx = {fxx_ana:.2f}: Curvature in x-direction")print(f" f_yy = {fyy_ana:.2f}: Curvature in y-direction")if fxx_ana > 0: print(" → Surface curves upward in x-direction at this point")if fyy_ana > 0: print(" → Surface curves upward in y-direction at this point")Now let's connect everything to the practice of machine learning. Every modern ML model learns by computing derivatives—specifically, derivatives of a loss function with respect to model parameters.
The Universal Pattern:
This simple pattern underlies everything from linear regression to GPT-4.
Example: Linear Regression Gradient Derivation
Let's derive the gradient for linear regression from first principles. Our model is: $$\hat{y} = w_0 + w_1 x_1 + w_2 x_2 + \cdots + w_d x_d = \mathbf{w}^T \mathbf{x}$$
Our loss function (MSE) for a single data point $(\mathbf{x}, y)$: $$\mathcal{L} = (y - \mathbf{w}^T \mathbf{x})^2$$
Computing $\frac{\partial \mathcal{L}}{\partial w_j}$:
Using the chain rule: $$\frac{\partial \mathcal{L}}{\partial w_j} = 2(y - \mathbf{w}^T \mathbf{x}) \cdot \frac{\partial}{\partial w_j}(y - \mathbf{w}^T \mathbf{x})$$
Now, $\frac{\partial}{\partial w_j}(y - \mathbf{w}^T \mathbf{x}) = -x_j$ (since $\mathbf{w}^T \mathbf{x} = w_0 x_0 + w_1 x_1 + \cdots$ and only the $w_j$ term contains $w_j$)
Therefore: $$\frac{\partial \mathcal{L}}{\partial w_j} = -2(y - \mathbf{w}^T \mathbf{x}) \cdot x_j = -2 \cdot \text{error} \cdot x_j$$
Intuition: The gradient is proportional to the prediction error and the feature value. Large errors on features with large values produce large gradients—which makes sense, as those features contributed more to the error.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120
import numpy as np class LinearRegressionFromScratch: """ Linear Regression with explicit gradient computation. This implementation shows exactly how derivatives drive learning. """ def __init__(self, n_features): # Initialize weights randomly self.weights = np.random.randn(n_features + 1) * 0.01 # weights[0] is bias, weights[1:] are feature weights def predict(self, X): """Compute predictions: ŷ = Xw (with bias column added)""" X_with_bias = np.column_stack([np.ones(len(X)), X]) return X_with_bias @ self.weights def compute_loss(self, X, y): """MSE Loss: (1/n) * Σ(yᵢ - ŷᵢ)²""" predictions = self.predict(X) return np.mean((y - predictions) ** 2) def compute_gradient(self, X, y): """ Compute ∂L/∂w for all weights. Derivation: L = (1/n) Σ(yᵢ - wᵀxᵢ)² ∂L/∂wⱼ = (1/n) Σ 2(yᵢ - wᵀxᵢ)(-xᵢⱼ) = (-2/n) Σ (yᵢ - ŷᵢ)xᵢⱼ = (-2/n) Xᵀ(y - ŷ) """ n = len(y) X_with_bias = np.column_stack([np.ones(n), X]) predictions = X_with_bias @ self.weights errors = y - predictions # Shape: (n,) # Gradient: (-2/n) * X^T * errors # Note: This is the analytical gradient derived from calculus gradient = (-2 / n) * X_with_bias.T @ errors return gradient def numerical_gradient(self, X, y, h=1e-7): """ Compute gradient numerically for verification. This should match compute_gradient() if our derivation is correct. """ gradient = np.zeros_like(self.weights) for i in range(len(self.weights)): # Compute f(w + h eᵢ) self.weights[i] += h loss_plus = self.compute_loss(X, y) # Compute f(w - h eᵢ) self.weights[i] -= 2 * h loss_minus = self.compute_loss(X, y) # Central difference gradient[i] = (loss_plus - loss_minus) / (2 * h) # Restore weight self.weights[i] += h return gradient def train_step(self, X, y, learning_rate=0.01): """Single gradient descent step.""" gradient = self.compute_gradient(X, y) self.weights -= learning_rate * gradient return self.compute_loss(X, y) # Demonstrationnp.random.seed(42) # Generate synthetic data: y = 2 + 3x₁ - x₂ + noisen_samples = 100X = np.random.randn(n_samples, 2)true_weights = np.array([2.0, 3.0, -1.0]) # [bias, w1, w2]y = 2.0 + 3.0 * X[:, 0] - 1.0 * X[:, 1] + 0.1 * np.random.randn(n_samples) # Create modelmodel = LinearRegressionFromScratch(n_features=2) print("Gradient Verification:")print("=" * 50) analytical_grad = model.compute_gradient(X, y)numerical_grad = model.numerical_gradient(X, y) print("\n Analytical vs Numerical Gradients:\n")for i, (a, n) in enumerate(zip(analytical_grad, numerical_grad)): diff = abs(a - n) status = "✓" if diff < 1e-5 else "✗" print(f" ∂L/∂w[{i}]: Analytical={a:+.8f}, Numerical={n:+.8f}, Error={diff:.2e} {status}") print("\n" + "=" * 50)print("\nTraining Progress:")print("-" * 50) # Train for several epochsfor epoch in range(0, 101, 20): loss = model.compute_loss(X, y) if epoch == 0: print(f"Epoch {epoch:4d}: Loss = {loss:.6f}") for _ in range(20): model.train_step(X, y, learning_rate=0.1) if epoch > 0: loss = model.compute_loss(X, y) print(f"Epoch {epoch:4d}: Loss = {loss:.6f}") print("\n" + "-" * 50)print(f"\nLearned weights: {model.weights}")print(f"True weights: {true_weights}")Understanding derivatives is one thing; applying them correctly in practice is another. Let's examine common mistakes and how to avoid them.
Always verify analytical gradients against numerical gradients during development. Compute the relative difference: $\frac{|\nabla_{analytical} - \nabla_{numerical}|}{|\nabla_{analytical}| + |\nabla_{numerical}|}$. This should be less than $10^{-7}$ for correctly implemented gradients.
Best Practices for Derivative Work:
Derive on Paper First: Before coding, derive gradients symbolically. This catches errors early and builds intuition.
Test Component-wise: For complex functions, test derivatives of each component before combining.
Use Automatic Differentiation: In production, use autodiff (PyTorch, TensorFlow, JAX) rather than manual gradients. But understand what autodiff is doing under the hood.
Monitor Gradient Magnitudes: During training, track gradient norms. Exploding gradients (very large) or vanishing gradients (very small) indicate problems.
Verify at Multiple Points: Test gradients at several random points, not just convenient ones like $(0, 0)$ where terms might cancel.
We've built a rigorous foundation in derivatives and partial derivatives. Let's consolidate what we've learned:
What's Next: The Chain Rule Deep Dive
In the next page, we'll explore the chain rule in exhaustive detail. While we touched on it here, the chain rule deserves its own treatment because:
The chain rule transforms our understanding from "derivatives of simple functions" to "derivatives through arbitrary computational graphs"—which is exactly what automatic differentiation exploits.
You now possess a rigorous understanding of derivatives and partial derivatives—the atoms of calculus that combine to form every optimization algorithm in machine learning. Next, we'll see how the chain rule acts as the algebra that combines these atoms into the molecules of gradient computation.