Machine LearningCalculus & Optimization

Differential Calculus Review

LevelIntermediate

Duration75 mins

TopicCalculus & Optimization

1 / 5

Derivatives and Partial Derivatives

The Language of Change

At its core, machine learning is about optimization—finding the best parameters that minimize prediction errors. But how do we mathematically describe "moving toward better"? How do we quantify the notion of improvement itself?

The answer lies in derivatives, the mathematical language of change. Every time a neural network updates its weights, every time a gradient descent step is taken, every time a loss function is minimized—derivatives are working behind the scenes and telling us the direction and magnitude of improvement.

This isn't merely calculus for calculus's sake. Understanding derivatives deeply will unlock your ability to:

Comprehend why optimization algorithms work
Debug training failures when gradients vanish or explode
Design custom loss functions with proper optimization behavior
Understand automatic differentiation and backpropagation at a fundamental level

What You Will Master

By the end of this page, you will have mastered the concept of derivatives as rates of change, understand partial derivatives for multivariate functions, and see clearly how these concepts connect to the optimization process that is central to all machine learning algorithms.

The Derivative — A Rigorous Foundation

Before we can apply derivatives to machine learning, we need a crystal-clear understanding of what a derivative actually is. While you may recall "the derivative is the slope of the tangent line," this intuition—though valuable—only scratches the surface.

Formal Definition:

The derivative of a function $f(x)$ at a point $a$ is defined as the limit:

$$f'(a) = \lim_{h \to 0} \frac{f(a + h) - f(a)}{h}$$

This represents the instantaneous rate of change of $f$ at the point $a$. Let's dissect each component:

$f(a + h) - f(a)$: The change in the function's output when the input changes by $h$
$\frac{\Delta f}{\Delta x}$: The average rate of change over the interval $[a, a+h]$
$\lim_{h \to 0}$: Taking this average rate of change as the interval shrinks to zero, giving us the instantaneous rate

The Geometric Perspective

Geometrically, as $h \to 0$, the secant line between $(a, f(a))$ and $(a+h, f(a+h))$ approaches the tangent line at $a$. The derivative $f'(a)$ is the slope of this tangent line. This geometric intuition is powerful: the derivative tells us the direction of steepest ascent of the function at any point.

Alternative Notations:

Derivatives have several equivalent notations, each highlighting different aspects:

Notation	Name	Usage Context
$f'(x)$	Lagrange notation	General calculus
$\frac{df}{dx}$	Leibniz notation	Emphasizes ratio of changes
$\frac{d}{dx}f$	Operator notation	Treats differentiation as operator
$D_x f$	Euler notation	Functional analysis
$\dot{f}$	Newton notation	Physics, time derivatives

In machine learning literature, Leibniz notation ($\frac{\partial f}{\partial x}$) dominates because it makes the chain rule's structure explicit and clearly shows which variable we're differentiating with respect to.

Why Differentiability Matters:

For a function to be differentiable at a point, the limit must exist—meaning the function must be "smooth" at that point. Non-differentiable points occur when:

Discontinuities: The function jumps (e.g., step functions)
Corners/Cusps: The function has a sharp turn (e.g., $|x|$ at $x=0$)
Vertical tangents: The slope becomes infinite

Machine Learning Connection:

This is why most activation functions in neural networks are designed to be differentiable everywhere—or at least have well-defined subgradients at non-differentiable points. The ReLU function $f(x) = \max(0, x)$ has a corner at $x = 0$, but we typically define its derivative as 0 at that point for practical implementation.

Fundamental Derivative Examples
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import numpy as np
import matplotlib.pyplot as plt
 
# Demonstrate derivative as limit of difference quotient
def numerical_derivative(f, x, h=1e-8):
    """
    Approximates f'(x) using the central difference formula.
    Central difference: [f(x+h) - f(x-h)] / (2h) is more accurate
    than forward difference: [f(x+h) - f(x)] / h
    """
    return (f(x + h) - f(x - h)) / (2 * h)
 
# Example 1: Polynomial f(x) = x^3
def f_cubic(x):
    return x ** 3
 
# The analytical derivative is f'(x) = 3x^2
def f_cubic_derivative_analytical(x):
    return 3 * x ** 2
 
# Test at x = 2
x_test = 2.0
numerical_deriv = numerical_derivative(f_cubic, x_test)
analytical_deriv = f_cubic_derivative_analytical(x_test)
 
print(f"For f(x) = x³ at x = {x_test}:")
print(f"  Numerical derivative:  {numerical_deriv:.10f}")
print(f"  Analytical derivative: {analytical_deriv:.10f}")
print(f"  Error: {abs(numerical_deriv - analytical_deriv):.2e}")
 
# Example 2: Exponential f(x) = e^x (its own derivative!)
def f_exp(x):
    return np.exp(x)
 
x_test = 1.0
print(f"\nFor f(x) = e^x at x = {x_test}:")
print(f"  Numerical derivative:  {numerical_derivative(f_exp, x_test):.10f}")
print(f"  Analytical derivative: {f_exp(x_test):.10f}")  # e^x is its own derivative
 
# Example 3: Sigmoid (crucial for ML)
def sigmoid(x):
    return 1 / (1 + np.exp(-x))
 
def sigmoid_derivative_analytical(x):
    """
    The sigmoid derivative has a beautiful form:
    σ'(x) = σ(x) · (1 - σ(x))
    """
    s = sigmoid(x)
    return s * (1 - s)
 
x_test = 0.0
print(f"\nFor sigmoid at x = {x_test}:")
print(f"  Numerical derivative:  {numerical_derivative(sigmoid, x_test):.10f}")
print(f"  Analytical derivative: {sigmoid_derivative_analytical(x_test):.10f}")

Essential Differentiation Rules

The power of calculus lies in not having to compute limits from first principles every time. A set of differentiation rules allows us to find derivatives of complex functions by combining simpler pieces. These rules are not merely computational tools—they reveal the deep structure of how functions compose and interact.

The Foundational Rules:

Core Differentiation Rules
Rule	Formula	Intuition
Constant	$\frac{d}{dx}c = 0$	Constants don't change; no rate of change
Power	$\frac{d}{dx}x^n = nx^{n-1}$	Exponent drops down, power decreases
Sum	$\frac{d}{dx}[f + g] = f' + g'$	Rate of sum = sum of rates
Difference	$\frac{d}{dx}[f - g] = f' - g'$	Difference of rates
Constant Multiple	$\frac{d}{dx}[cf] = cf'$	Scaling preserves under differentiation
Product	$\frac{d}{dx}[fg] = f'g + fg'$	Each factor's contribution to change
Quotient	$\frac{d}{dx}\left[\frac{f}{g}\right] = \frac{f'g - fg'}{g^2}$	Numerator-denominator interaction
Chain	$\frac{d}{dx}f(g(x)) = f'(g(x)) \cdot g'(x)$	Composition rule: outer × inner derivative

Deep Dive: The Product Rule

The product rule $(fg)' = f'g + fg'$ deserves careful attention because it appears constantly in ML contexts. Consider why this rule makes sense intuitively:

Imagine a rectangle with width $f(x)$ and height $g(x)$. As $x$ changes by a small amount $dx$:

The width changes by $f'(x)dx$
The height changes by $g'(x)dx$

The change in area (approximately) comes from:

A thin vertical strip of width $f'(x)dx$ and height $g(x)$ → contributes $f'(x)g(x)dx$
A thin horizontal strip of height $g'(x)dx$ and width $f(x)$ → contributes $f(x)g'(x)dx$
A tiny corner rectangle of size $f'(x)dx \cdot g'(x)dx$ → negligible (second-order term)

Total: $(f'g + fg')dx$, giving us the product rule.

The Product Rule in Neural Networks

The product rule appears when computing gradients of attention mechanisms, gating mechanisms (like LSTMs), and any architecture that multiplies learned representations together. If $a(x)$ is an attention weight and $v(x)$ is a value vector, the gradient through their product $(av)' = a'v + av'$ involves both how attention changes and how values change.

Essential Function Derivatives:

Beyond the rules, you should internalize these fundamental derivatives:

Function	Derivative	ML Application
$e^x$	$e^x$	Softmax, exponential family
$\ln(x)$	$\frac{1}{x}$	Log-likelihood, cross-entropy
$\sin(x)$	$\cos(x)$	Positional encodings (transformers)
$\cos(x)$	$-\sin(x)$	Positional encodings (transformers)
$a^x$	$a^x \ln(a)$	Learning rate schedules
$\log_a(x)$	$\frac{1}{x \ln(a)}$	Information theory
$\tanh(x)$	$1 - \tanh^2(x) = \text{sech}^2(x)$	Activation function
$\sigma(x)$	$\sigma(x)(1-\sigma(x))$	Sigmoid activation

Differentiation Rules in Practice
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import numpy as np
 
# Demonstrating each rule with numerical verification
 
def verify_derivative(name, f, f_prime, x, h=1e-7):
    """Verify analytical derivative against numerical approximation."""
    numerical = (f(x + h) - f(x - h)) / (2 * h)
    analytical = f_prime(x)
    error = abs(numerical - analytical)
    print(f"{name}:")
    print(f"  Numerical:  {numerical:.8f}")
    print(f"  Analytical: {analytical:.8f}")
    print(f"  Error:      {error:.2e}\n")
 
# 1. Power Rule: d/dx[x^4] = 4x^3
verify_derivative(
    "Power Rule: d/dx[x⁴] = 4x³",
    f=lambda x: x**4,
    f_prime=lambda x: 4 * x**3,
    x=2.0
)
 
# 2. Product Rule: d/dx[x² · sin(x)] = 2x·sin(x) + x²·cos(x)
verify_derivative(
    "Product Rule: d/dx[x² · sin(x)]",
    f=lambda x: x**2 * np.sin(x),
    f_prime=lambda x: 2*x*np.sin(x) + x**2*np.cos(x),
    x=1.5
)
 
# 3. Quotient Rule: d/dx[sin(x)/x] = [cos(x)·x - sin(x)]/x²
verify_derivative(
    "Quotient Rule: d/dx[sin(x)/x]",
    f=lambda x: np.sin(x) / x,
    f_prime=lambda x: (np.cos(x)*x - np.sin(x)) / x**2,
    x=0.5
)
 
# 4. Chain Rule: d/dx[e^(x²)] = e^(x²) · 2x
verify_derivative(
    "Chain Rule: d/dx[e^(x²)]",
    f=lambda x: np.exp(x**2),
    f_prime=lambda x: np.exp(x**2) * 2*x,
    x=1.0
)
 
# 5. ML-relevant: Log-loss derivative component
# d/dx[-log(sigmoid(x))] = -(1-sigmoid(x)) = sigmoid(x) - 1
def sigmoid(x):
    return 1 / (1 + np.exp(-x))
 
verify_derivative(
    "Log-Loss Gradient: d/dx[-log(σ(x))]",
    f=lambda x: -np.log(sigmoid(x)),
    f_prime=lambda x: sigmoid(x) - 1,  # or -(1 - sigmoid(x))
    x=0.5
)

Partial Derivatives — The Multivariate Extension

In machine learning, we rarely work with functions of a single variable. A neural network's loss function depends on millions of parameters; a regression model depends on multiple features and coefficients. This necessitates extending our derivative concept to multivariate functions.

The Fundamental Idea:

Given a function $f(x_1, x_2, \ldots, x_n)$ of multiple variables, the partial derivative with respect to $x_i$ measures how $f$ changes when only $x_i$ changes, while all other variables are held constant.

$$\frac{\partial f}{\partial x_i} = \lim_{h \to 0} \frac{f(x_1, \ldots, x_i + h, \ldots, x_n) - f(x_1, \ldots, x_i, \ldots, x_n)}{h}$$

Notation:

$\frac{\partial f}{\partial x}$ (partial derivative of $f$ with respect to $x$)
$f_x$ or $f_{x_i}$ (subscript notation)
$\partial_x f$ (operator notation)
$D_i f$ (component notation)

Computing Partial Derivatives

To compute a partial derivative with respect to variable $x_i$, treat all other variables as constants and apply ordinary differentiation rules. For example, if $f(x, y) = x^2 y + 3xy^2$, then $\frac{\partial f}{\partial x} = 2xy + 3y^2$ (treating $y$ as constant) and $\frac{\partial f}{\partial y} = x^2 + 6xy$ (treating $x$ as constant).

Concrete Example: Two-Variable Function

Consider the function $f(x, y) = x^3 + x^2y - 2y^2 + 4xy$. Let's compute both partial derivatives:

Computing $\frac{\partial f}{\partial x}$: (treat $y$ as constant)

$\frac{\partial}{\partial x}(x^3) = 3x^2$
$\frac{\partial}{\partial x}(x^2y) = 2xy$ (since $y$ is constant)
$\frac{\partial}{\partial x}(-2y^2) = 0$ (constant in $x$)
$\frac{\partial}{\partial x}(4xy) = 4y$

Result: $\frac{\partial f}{\partial x} = 3x^2 + 2xy + 4y$

Computing $\frac{\partial f}{\partial y}$: (treat $x$ as constant)

$\frac{\partial}{\partial y}(x^3) = 0$ (constant in $y$)
$\frac{\partial}{\partial y}(x^2y) = x^2$
$\frac{\partial}{\partial y}(-2y^2) = -4y$
$\frac{\partial}{\partial y}(4xy) = 4x$

Result: $\frac{\partial f}{\partial y} = x^2 - 4y + 4x$

Geometric Interpretation:

For a function $f(x, y)$:

$\frac{\partial f}{\partial x}$ at point $(a, b)$ is the slope of the curve formed by intersecting the surface $z = f(x, y)$ with the plane $y = b$ (i.e., moving parallel to the $x$-axis)
$\frac{\partial f}{\partial y}$ at point $(a, b)$ is the slope of the curve formed by intersecting with the plane $x = a$ (moving parallel to the $y$-axis)

Think of standing on a hilly terrain:

$\frac{\partial f}{\partial x}$ tells you the steepness if you walk due East
$\frac{\partial f}{\partial y}$ tells you the steepness if you walk due North

The gradient (coming next section) tells you the steepest uphill direction overall.

Partial Derivatives: Computation and Verification
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
import numpy as np
 
def partial_derivative(f, var_idx, point, h=1e-7):
    """
    Compute partial derivative of f with respect to variable at var_idx.
    
    Args:
        f: Function that takes a numpy array of coordinates
        var_idx: Index of variable to differentiate with respect to
        point: numpy array representing the point at which to evaluate
        h: Small step size for numerical differentiation
    
    Returns:
        Approximate partial derivative at the given point
    """
    point = np.array(point, dtype=float)
    
    # Create points for central difference
    point_plus = point.copy()
    point_minus = point.copy()
    point_plus[var_idx] += h
    point_minus[var_idx] -= h
    
    return (f(point_plus) - f(point_minus)) / (2 * h)
 
 
# Example 1: f(x, y) = x³ + x²y - 2y² + 4xy
def f_example(p):
    x, y = p[0], p[1]
    return x**3 + x**2 * y - 2 * y**2 + 4 * x * y
 
def df_dx_analytical(p):
    x, y = p[0], p[1]
    return 3 * x**2 + 2 * x * y + 4 * y
 
def df_dy_analytical(p):
    x, y = p[0], p[1]
    return x**2 - 4 * y + 4 * x
 
# Test at point (2, 3)
test_point = np.array([2.0, 3.0])
 
print("Function: f(x, y) = x³ + x²y - 2y² + 4xy")
print(f"Point: ({test_point[0]}, {test_point[1]})\n")
 
# Partial with respect to x
partial_x_numerical = partial_derivative(f_example, 0, test_point)
partial_x_analytical = df_dx_analytical(test_point)
print(f"∂f/∂x:")
print(f"  Numerical:  {partial_x_numerical:.8f}")
print(f"  Analytical: {partial_x_analytical:.8f}")
print(f"  Error:      {abs(partial_x_numerical - partial_x_analytical):.2e}\n")
 
# Partial with respect to y
partial_y_numerical = partial_derivative(f_example, 1, test_point)
partial_y_analytical = df_dy_analytical(test_point)
print(f"∂f/∂y:")
print(f"  Numerical:  {partial_y_numerical:.8f}")
print(f"  Analytical: {partial_y_analytical:.8f}")
print(f"  Error:      {abs(partial_y_numerical - partial_y_analytical):.2e}\n")
 
 
# Example 2: ML-relevant - MSE Loss
# L(w₀, w₁) = (1/n) Σᵢ (yᵢ - (w₀ + w₁xᵢ))²
# For simplicity, consider single data point: L = (y - w₀ - w₁x)²
 
def mse_single_point(params):
    """MSE for single point (x=2, y=5): L = (5 - w₀ - 2w₁)²"""
    w0, w1 = params[0], params[1]
    x, y = 2.0, 5.0
    return (y - w0 - w1 * x) ** 2
 
def mse_dw0_analytical(params):
    """∂L/∂w₀ = -2(y - w₀ - w₁x)"""
    w0, w1 = params[0], params[1]
    x, y = 2.0, 5.0
    return -2 * (y - w0 - w1 * x)
 
def mse_dw1_analytical(params):
    """∂L/∂w₁ = -2x(y - w₀ - w₁x)"""
    w0, w1 = params[0], params[1]
    x, y = 2.0, 5.0
    return -2 * x * (y - w0 - w1 * x)
 
# Test at w₀=1, w₁=1 (prediction = 1 + 2 = 3, actual = 5, error = 2)
params = np.array([1.0, 1.0])
 
print("\nMSE Loss: L = (5 - w₀ - 2w₁)² for single point (x=2, y=5)")
print(f"Parameters: w₀={params[0]}, w₁={params[1]}\n")
 
partial_w0_num = partial_derivative(mse_single_point, 0, params)
partial_w0_ana = mse_dw0_analytical(params)
print(f"∂L/∂w₀:")
print(f"  Numerical:  {partial_w0_num:.8f}")
print(f"  Analytical: {partial_w0_ana:.8f}")
 
partial_w1_num = partial_derivative(mse_single_point, 1, params)
partial_w1_ana = mse_dw1_analytical(params)
print(f"\n∂L/∂w₁:")
print(f"  Numerical:  {partial_w1_num:.8f}")
print(f"  Analytical: {partial_w1_ana:.8f}")

Higher-Order Partial Derivatives

Just as we can take the derivative of a derivative (second derivative) for single-variable functions, we can compute second-order partial derivatives and mixed partial derivatives for multivariate functions. These are critical for understanding curvature—which determines convergence rates and optimal step sizes in optimization.

Types of Second-Order Partials:

For a function $f(x, y)$, we have four second-order partial derivatives:

$\frac{\partial^2 f}{\partial x^2}$: Differentiate with respect to $x$ twice (how the $x$-slope changes in the $x$-direction)
$\frac{\partial^2 f}{\partial y^2}$: Differentiate with respect to $y$ twice (how the $y$-slope changes in the $y$-direction)
$\frac{\partial^2 f}{\partial x \partial y}$: Differentiate first with respect to $y$, then with respect to $x$
$\frac{\partial^2 f}{\partial y \partial x}$: Differentiate first with respect to $x$, then with respect to $y$

Schwarz's Theorem (Clairaut's Theorem)

For functions with continuous second partial derivatives, the order of differentiation doesn't matter: $\frac{\partial^2 f}{\partial x \partial y} = \frac{\partial^2 f}{\partial y \partial x}$. This symmetry is crucial—it means we only need to compute $n(n+1)/2$ second derivatives for a function of $n$ variables instead of $n^2$.

Worked Example:

Let $f(x, y) = x^3y^2 + 2xy^3 - x^2$

First-order partials:

$\frac{\partial f}{\partial x} = 3x^2y^2 + 2y^3 - 2x$
$\frac{\partial f}{\partial y} = 2x^3y + 6xy^2$

Second-order partials:

$\frac{\partial^2 f}{\partial x^2} = \frac{\partial}{\partial x}(3x^2y^2 + 2y^3 - 2x) = 6xy^2 - 2$
$\frac{\partial^2 f}{\partial y^2} = \frac{\partial}{\partial y}(2x^3y + 6xy^2) = 2x^3 + 12xy$
$\frac{\partial^2 f}{\partial x \partial y} = \frac{\partial}{\partial x}(2x^3y + 6xy^2) = 6x^2y + 6y^2$
$\frac{\partial^2 f}{\partial y \partial x} = \frac{\partial}{\partial y}(3x^2y^2 + 2y^3 - 2x) = 6x^2y + 6y^2$

Note: $\frac{\partial^2 f}{\partial x \partial y} = \frac{\partial^2 f}{\partial y \partial x}$, confirming Schwarz's theorem.

Geometric Meaning of Second Derivatives:

Second-order partial derivatives describe curvature:

$\frac{\partial^2 f}{\partial x^2} > 0$: The surface curves upward in the $x$-direction (like a bowl)
$\frac{\partial^2 f}{\partial x^2} < 0$: The surface curves downward in the $x$-direction (like an inverted bowl)
Mixed partials describe how the slope in one direction changes as you move in the perpendicular direction—they capture "twisting" of the surface

Why This Matters for ML:

Second-order information tells us:

Whether we're at a minimum vs maximum vs saddle point (more on this later)
How fast to take gradient steps (curvature determines optimal learning rates)
The condition number of the optimization (ratio of largest to smallest curvatures)

Second-Order Derivatives and Curvature
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import numpy as np
 
def second_partial(f, i, j, point, h=1e-5):
    """
    Compute second partial derivative ∂²f/(∂xᵢ ∂xⱼ)
    Uses central difference for both derivatives.
    """
    n = len(point)
    point = np.array(point, dtype=float)
    
    # Create offset vectors
    def offset(idx, delta):
        v = np.zeros(n)
        v[idx] = delta
        return v
    
    # ∂²f/(∂xᵢ ∂xⱼ) ≈ [f(x+hᵢ+hⱼ) - f(x+hᵢ-hⱼ) - f(x-hᵢ+hⱼ) + f(x-hᵢ-hⱼ)] / (4h²)
    pp = point + offset(i, h) + offset(j, h)
    pm = point + offset(i, h) - offset(j, h)
    mp = point - offset(i, h) + offset(j, h)
    mm = point - offset(i, h) - offset(j, h)
    
    return (f(pp) - f(pm) - f(mp) + f(mm)) / (4 * h * h)
 
 
# Example: f(x, y) = x³y² + 2xy³ - x²
def f(p):
    x, y = p[0], p[1]
    return x**3 * y**2 + 2*x*y**3 - x**2
 
# Analytical second derivatives
def f_xx(p):
    x, y = p[0], p[1]
    return 6*x*y**2 - 2
 
def f_yy(p):
    x, y = p[0], p[1]
    return 2*x**3 + 12*x*y
 
def f_xy(p):
    x, y = p[0], p[1]
    return 6*x**2*y + 6*y**2
 
 
# Test at point (1, 2)
point = np.array([1.0, 2.0])
print(f"Function: f(x, y) = x³y² + 2xy³ - x²")
print(f"Point: ({point[0]}, {point[1]})\n")
 
print("Second-Order Partial Derivatives:\n")
 
# ∂²f/∂x²
fxx_num = second_partial(f, 0, 0, point)
fxx_ana = f_xx(point)
print(f"∂²f/∂x²:")
print(f"  Numerical:  {fxx_num:.6f}")
print(f"  Analytical: {fxx_ana:.6f}\n")
 
# ∂²f/∂y²
fyy_num = second_partial(f, 1, 1, point)
fyy_ana = f_yy(point)
print(f"∂²f/∂y²:")
print(f"  Numerical:  {fyy_num:.6f}")
print(f"  Analytical: {fyy_ana:.6f}\n")
 
# ∂²f/(∂x∂y)
fxy_num = second_partial(f, 0, 1, point)
fxy_ana = f_xy(point)
print(f"∂²f/(∂x∂y):")
print(f"  Numerical:  {fxy_num:.6f}")
print(f"  Analytical: {fxy_ana:.6f}\n")
 
# ∂²f/(∂y∂x) - should equal ∂²f/(∂x∂y) by Schwarz's theorem
fyx_num = second_partial(f, 1, 0, point)
print(f"∂²f/(∂y∂x): {fyx_num:.6f} (should equal ∂²f/(∂x∂y))\n")
 
# Curvature interpretation
print("Curvature Interpretation:")
print(f"  f_xx = {fxx_ana:.2f}: Curvature in x-direction")
print(f"  f_yy = {fyy_ana:.2f}: Curvature in y-direction")
if fxx_ana > 0:
    print("  → Surface curves upward in x-direction at this point")
if fyy_ana > 0:
    print("  → Surface curves upward in y-direction at this point")

Derivatives in Machine Learning Context

Now let's connect everything to the practice of machine learning. Every modern ML model learns by computing derivatives—specifically, derivatives of a loss function with respect to model parameters.

The Universal Pattern:

Define a loss function $\mathcal{L}(\theta)$ where $\theta$ represents all learnable parameters
Compute $\frac{\partial \mathcal{L}}{\partial \theta_i}$ for each parameter $\theta_i$
Update parameters: $\theta_i \leftarrow \theta_i - \eta \frac{\partial \mathcal{L}}{\partial \theta_i}$ (where $\eta$ is the learning rate)
Repeat until convergence

This simple pattern underlies everything from linear regression to GPT-4.

Common Loss Functions

•MSE Loss: $\mathcal{L} = \frac{1}{n}\sum(y_i - \hat{y}_i)^2$
•Cross-Entropy: $\mathcal{L} = -\sum y_i \log(\hat{y}_i)$
•Hinge Loss: $\mathcal{L} = \max(0, 1 - y \cdot \hat{y})$
•Huber Loss: Quadratic near zero, linear far from zero

Why Derivatives Matter

•Direction: $\frac{\partial \mathcal{L}}{\partial \theta} > 0$ → decrease $\theta$
•Magnitude: Large gradient → large update needed
•Zero gradient: At minimum (or saddle/maximum)
•Gradient flow: Enables deep network training

Example: Linear Regression Gradient Derivation

Let's derive the gradient for linear regression from first principles. Our model is: $$\hat{y} = w_0 + w_1 x_1 + w_2 x_2 + \cdots + w_d x_d = \mathbf{w}^T \mathbf{x}$$

Our loss function (MSE) for a single data point $(\mathbf{x}, y)$: $$\mathcal{L} = (y - \mathbf{w}^T \mathbf{x})^2$$

Computing $\frac{\partial \mathcal{L}}{\partial w_j}$:

Using the chain rule: $$\frac{\partial \mathcal{L}}{\partial w_j} = 2(y - \mathbf{w}^T \mathbf{x}) \cdot \frac{\partial}{\partial w_j}(y - \mathbf{w}^T \mathbf{x})$$

Now, $\frac{\partial}{\partial w_j}(y - \mathbf{w}^T \mathbf{x}) = -x_j$ (since $\mathbf{w}^T \mathbf{x} = w_0 x_0 + w_1 x_1 + \cdots$ and only the $w_j$ term contains $w_j$)

Therefore: $$\frac{\partial \mathcal{L}}{\partial w_j} = -2(y - \mathbf{w}^T \mathbf{x}) \cdot x_j = -2 \cdot \text{error} \cdot x_j$$

Intuition: The gradient is proportional to the prediction error and the feature value. Large errors on features with large values produce large gradients—which makes sense, as those features contributed more to the error.

Gradient Computation for Linear Regression
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
import numpy as np
 
class LinearRegressionFromScratch:
    """
    Linear Regression with explicit gradient computation.
    
    This implementation shows exactly how derivatives drive learning.
    """
    
    def __init__(self, n_features):
        # Initialize weights randomly
        self.weights = np.random.randn(n_features + 1) * 0.01
        # weights[0] is bias, weights[1:] are feature weights
    
    def predict(self, X):
        """Compute predictions: ŷ = Xw (with bias column added)"""
        X_with_bias = np.column_stack([np.ones(len(X)), X])
        return X_with_bias @ self.weights
    
    def compute_loss(self, X, y):
        """MSE Loss: (1/n) * Σ(yᵢ - ŷᵢ)²"""
        predictions = self.predict(X)
        return np.mean((y - predictions) ** 2)
    
    def compute_gradient(self, X, y):
        """
        Compute ∂L/∂w for all weights.
        
        Derivation:
        L = (1/n) Σ(yᵢ - wᵀxᵢ)²
        ∂L/∂wⱼ = (1/n) Σ 2(yᵢ - wᵀxᵢ)(-xᵢⱼ)
               = (-2/n) Σ (yᵢ - ŷᵢ)xᵢⱼ
               = (-2/n) Xᵀ(y - ŷ)
        """
        n = len(y)
        X_with_bias = np.column_stack([np.ones(n), X])
        predictions = X_with_bias @ self.weights
        errors = y - predictions  # Shape: (n,)
        
        # Gradient: (-2/n) * X^T * errors
        # Note: This is the analytical gradient derived from calculus
        gradient = (-2 / n) * X_with_bias.T @ errors
        
        return gradient
    
    def numerical_gradient(self, X, y, h=1e-7):
        """
        Compute gradient numerically for verification.
        This should match compute_gradient() if our derivation is correct.
        """
        gradient = np.zeros_like(self.weights)
        
        for i in range(len(self.weights)):
            # Compute f(w + h eᵢ)
            self.weights[i] += h
            loss_plus = self.compute_loss(X, y)
            
            # Compute f(w - h eᵢ)
            self.weights[i] -= 2 * h
            loss_minus = self.compute_loss(X, y)
            
            # Central difference
            gradient[i] = (loss_plus - loss_minus) / (2 * h)
            
            # Restore weight
            self.weights[i] += h
        
        return gradient
    
    def train_step(self, X, y, learning_rate=0.01):
        """Single gradient descent step."""
        gradient = self.compute_gradient(X, y)
        self.weights -= learning_rate * gradient
        return self.compute_loss(X, y)
 
 
# Demonstration
np.random.seed(42)
 
# Generate synthetic data: y = 2 + 3x₁ - x₂ + noise
n_samples = 100
X = np.random.randn(n_samples, 2)
true_weights = np.array([2.0, 3.0, -1.0])  # [bias, w1, w2]
y = 2.0 + 3.0 * X[:, 0] - 1.0 * X[:, 1] + 0.1 * np.random.randn(n_samples)
 
# Create model
model = LinearRegressionFromScratch(n_features=2)
 
print("Gradient Verification:")
print("=" * 50)
 
analytical_grad = model.compute_gradient(X, y)
numerical_grad = model.numerical_gradient(X, y)
 
print("\n   Analytical vs Numerical Gradients:\n")
for i, (a, n) in enumerate(zip(analytical_grad, numerical_grad)):
    diff = abs(a - n)
    status = "✓" if diff < 1e-5 else "✗"
    print(f"   ∂L/∂w[{i}]: Analytical={a:+.8f}, Numerical={n:+.8f}, Error={diff:.2e} {status}")
 
print("\n" + "=" * 50)
print("\nTraining Progress:")
print("-" * 50)
 
# Train for several epochs
for epoch in range(0, 101, 20):
    loss = model.compute_loss(X, y)
    if epoch == 0:
        print(f"Epoch {epoch:4d}: Loss = {loss:.6f}")
    
    for _ in range(20):
        model.train_step(X, y, learning_rate=0.1)
    
    if epoch > 0:
        loss = model.compute_loss(X, y)
        print(f"Epoch {epoch:4d}: Loss = {loss:.6f}")
 
print("\n" + "-" * 50)
print(f"\nLearned weights:  {model.weights}")
print(f"True weights:     {true_weights}")

Common Pitfalls and Best Practices

Understanding derivatives is one thing; applying them correctly in practice is another. Let's examine common mistakes and how to avoid them.

Common Mistakes in Derivative Computation

•Forgetting the Chain Rule: When $f$ depends on $g(x)$, you must include $g'(x)$. Example: $\frac{d}{dx}e^{x^2} = e^{x^2} \cdot 2x$, not just $e^{x^2}$.
•Confusing $\frac{d}{dx}$ with $\frac{\partial}{\partial x}$: Ordinary derivatives assume all dependence flows through $x$; partial derivatives treat other variables as constant even if they depend on $x$ elsewhere.
•Numerical Instability: Using too small $h$ in numerical differentiation causes floating-point errors; too large loses accuracy. Optimal $h \approx 10^{-8}$ for forward difference, $10^{-5}$ for central difference.
•Gradient Sign Errors: The gradient points toward increasing function values. For minimization, we subtract the gradient: $\theta \leftarrow \theta - \eta \nabla \mathcal{L}$.
•Broadcasting Bugs: In vectorized code, shape mismatches silently broadcast, leading to wrong gradients. Always verify shapes explicitly.

Gradient Checking: Your Safety Net

Always verify analytical gradients against numerical gradients during development. Compute the relative difference: $\frac{|\nabla_{analytical} - \nabla_{numerical}|}{|\nabla_{analytical}| + |\nabla_{numerical}|}$. This should be less than $10^{-7}$ for correctly implemented gradients.

Best Practices for Derivative Work:

Derive on Paper First: Before coding, derive gradients symbolically. This catches errors early and builds intuition.
Test Component-wise: For complex functions, test derivatives of each component before combining.
Use Automatic Differentiation: In production, use autodiff (PyTorch, TensorFlow, JAX) rather than manual gradients. But understand what autodiff is doing under the hood.
Monitor Gradient Magnitudes: During training, track gradient norms. Exploding gradients (very large) or vanishing gradients (very small) indicate problems.
Verify at Multiple Points: Test gradients at several random points, not just convenient ones like $(0, 0)$ where terms might cancel.

Summary and Connection Forward

We've built a rigorous foundation in derivatives and partial derivatives. Let's consolidate what we've learned:

Key Takeaways

•Derivatives measure instantaneous rates of change — defined formally as the limit $f'(a) = \lim_{h \to 0} \frac{f(a+h) - f(a)}{h}$.
•Differentiation rules enable efficient computation — power, product, quotient, and chain rules eliminate the need to compute limits directly.
•Partial derivatives extend to multivariate functions — holding other variables constant, we measure change in one direction at a time.
•Second-order partials capture curvature — critical for understanding optimization landscapes and convergence behavior.
•Machine learning is fundamentally gradient-driven — loss function derivatives with respect to parameters guide all learning.
•Always verify gradients numerically — gradient checking catches implementation errors before they cause training failures.

What's Next: The Chain Rule Deep Dive

In the next page, we'll explore the chain rule in exhaustive detail. While we touched on it here, the chain rule deserves its own treatment because:

It enables backpropagation—the algorithm that makes deep learning possible
It shows how gradients flow through composed functions (nested layers of neural networks)
It reveals why certain architectures suffer from vanishing/exploding gradients

The chain rule transforms our understanding from "derivatives of simple functions" to "derivatives through arbitrary computational graphs"—which is exactly what automatic differentiation exploits.

Foundation Established

You now possess a rigorous understanding of derivatives and partial derivatives—the atoms of calculus that combine to form every optimization algorithm in machine learning. Next, we'll see how the chain rule acts as the algebra that combines these atoms into the molecules of gradient computation.

1 / 5

Loading learning content...

Machine LearningCalculus & Optimization

Differential Calculus Review

LevelIntermediate

Duration75 mins

TopicCalculus & Optimization

1 / 5

Derivatives and Partial Derivatives

The Language of Change

This isn't merely calculus for calculus's sake. Understanding derivatives deeply will unlock your ability to:

Comprehend why optimization algorithms work
Debug training failures when gradients vanish or explode
Design custom loss functions with proper optimization behavior
Understand automatic differentiation and backpropagation at a fundamental level

What You Will Master

The Derivative — A Rigorous Foundation

Formal Definition:

The derivative of a function $f(x)$ at a point $a$ is defined as the limit:

$$f'(a) = \lim_{h \to 0} \frac{f(a + h) - f(a)}{h}$$

This represents the instantaneous rate of change of $f$ at the point $a$. Let's dissect each component:

$f(a + h) - f(a)$: The change in the function's output when the input changes by $h$
$\frac{\Delta f}{\Delta x}$: The average rate of change over the interval $[a, a+h]$
$\lim_{h \to 0}$: Taking this average rate of change as the interval shrinks to zero, giving us the instantaneous rate

The Geometric Perspective

Alternative Notations:

Derivatives have several equivalent notations, each highlighting different aspects:

Notation	Name	Usage Context
$f'(x)$	Lagrange notation	General calculus
$\frac{df}{dx}$	Leibniz notation	Emphasizes ratio of changes
$\frac{d}{dx}f$	Operator notation	Treats differentiation as operator
$D_x f$	Euler notation	Functional analysis
$\dot{f}$	Newton notation	Physics, time derivatives

Why Differentiability Matters:

For a function to be differentiable at a point, the limit must exist—meaning the function must be "smooth" at that point. Non-differentiable points occur when:

Discontinuities: The function jumps (e.g., step functions)
Corners/Cusps: The function has a sharp turn (e.g., $|x|$ at $x=0$)
Vertical tangents: The slope becomes infinite

Machine Learning Connection:

Fundamental Derivative Examples
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import numpy as np
import matplotlib.pyplot as plt
 
# Demonstrate derivative as limit of difference quotient
def numerical_derivative(f, x, h=1e-8):
    """
    Approximates f'(x) using the central difference formula.
    Central difference: [f(x+h) - f(x-h)] / (2h) is more accurate
    than forward difference: [f(x+h) - f(x)] / h
    """
    return (f(x + h) - f(x - h)) / (2 * h)
 
# Example 1: Polynomial f(x) = x^3
def f_cubic(x):
    return x ** 3
 
# The analytical derivative is f'(x) = 3x^2
def f_cubic_derivative_analytical(x):
    return 3 * x ** 2
 
# Test at x = 2
x_test = 2.0
numerical_deriv = numerical_derivative(f_cubic, x_test)
analytical_deriv = f_cubic_derivative_analytical(x_test)
 
print(f"For f(x) = x³ at x = {x_test}:")
print(f"  Numerical derivative:  {numerical_deriv:.10f}")
print(f"  Analytical derivative: {analytical_deriv:.10f}")
print(f"  Error: {abs(numerical_deriv - analytical_deriv):.2e}")
 
# Example 2: Exponential f(x) = e^x (its own derivative!)
def f_exp(x):
    return np.exp(x)
 
x_test = 1.0
print(f"\nFor f(x) = e^x at x = {x_test}:")
print(f"  Numerical derivative:  {numerical_derivative(f_exp, x_test):.10f}")
print(f"  Analytical derivative: {f_exp(x_test):.10f}")  # e^x is its own derivative
 
# Example 3: Sigmoid (crucial for ML)
def sigmoid(x):
    return 1 / (1 + np.exp(-x))
 
def sigmoid_derivative_analytical(x):
    """
    The sigmoid derivative has a beautiful form:
    σ'(x) = σ(x) · (1 - σ(x))
    """
    s = sigmoid(x)
    return s * (1 - s)
 
x_test = 0.0
print(f"\nFor sigmoid at x = {x_test}:")
print(f"  Numerical derivative:  {numerical_derivative(sigmoid, x_test):.10f}")
print(f"  Analytical derivative: {sigmoid_derivative_analytical(x_test):.10f}")

Essential Differentiation Rules

The Foundational Rules:

Core Differentiation Rules
Rule	Formula	Intuition
Constant	$\frac{d}{dx}c = 0$	Constants don't change; no rate of change
Power	$\frac{d}{dx}x^n = nx^{n-1}$	Exponent drops down, power decreases
Sum	$\frac{d}{dx}[f + g] = f' + g'$	Rate of sum = sum of rates
Difference	$\frac{d}{dx}[f - g] = f' - g'$	Difference of rates
Constant Multiple	$\frac{d}{dx}[cf] = cf'$	Scaling preserves under differentiation
Product	$\frac{d}{dx}[fg] = f'g + fg'$	Each factor's contribution to change
Quotient	$\frac{d}{dx}\left[\frac{f}{g}\right] = \frac{f'g - fg'}{g^2}$	Numerator-denominator interaction
Chain	$\frac{d}{dx}f(g(x)) = f'(g(x)) \cdot g'(x)$	Composition rule: outer × inner derivative

Deep Dive: The Product Rule

The product rule $(fg)' = f'g + fg'$ deserves careful attention because it appears constantly in ML contexts. Consider why this rule makes sense intuitively:

Imagine a rectangle with width $f(x)$ and height $g(x)$. As $x$ changes by a small amount $dx$:

The width changes by $f'(x)dx$
The height changes by $g'(x)dx$

The change in area (approximately) comes from:

A thin vertical strip of width $f'(x)dx$ and height $g(x)$ → contributes $f'(x)g(x)dx$
A thin horizontal strip of height $g'(x)dx$ and width $f(x)$ → contributes $f(x)g'(x)dx$
A tiny corner rectangle of size $f'(x)dx \cdot g'(x)dx$ → negligible (second-order term)

Total: $(f'g + fg')dx$, giving us the product rule.

The Product Rule in Neural Networks

Essential Function Derivatives:

Beyond the rules, you should internalize these fundamental derivatives:

Function	Derivative	ML Application
$e^x$	$e^x$	Softmax, exponential family
$\ln(x)$	$\frac{1}{x}$	Log-likelihood, cross-entropy
$\sin(x)$	$\cos(x)$	Positional encodings (transformers)
$\cos(x)$	$-\sin(x)$	Positional encodings (transformers)
$a^x$	$a^x \ln(a)$	Learning rate schedules
$\log_a(x)$	$\frac{1}{x \ln(a)}$	Information theory
$\tanh(x)$	$1 - \tanh^2(x) = \text{sech}^2(x)$	Activation function
$\sigma(x)$	$\sigma(x)(1-\sigma(x))$	Sigmoid activation

Differentiation Rules in Practice
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import numpy as np
 
# Demonstrating each rule with numerical verification
 
def verify_derivative(name, f, f_prime, x, h=1e-7):
    """Verify analytical derivative against numerical approximation."""
    numerical = (f(x + h) - f(x - h)) / (2 * h)
    analytical = f_prime(x)
    error = abs(numerical - analytical)
    print(f"{name}:")
    print(f"  Numerical:  {numerical:.8f}")
    print(f"  Analytical: {analytical:.8f}")
    print(f"  Error:      {error:.2e}\n")
 
# 1. Power Rule: d/dx[x^4] = 4x^3
verify_derivative(
    "Power Rule: d/dx[x⁴] = 4x³",
    f=lambda x: x**4,
    f_prime=lambda x: 4 * x**3,
    x=2.0
)
 
# 2. Product Rule: d/dx[x² · sin(x)] = 2x·sin(x) + x²·cos(x)
verify_derivative(
    "Product Rule: d/dx[x² · sin(x)]",
    f=lambda x: x**2 * np.sin(x),
    f_prime=lambda x: 2*x*np.sin(x) + x**2*np.cos(x),
    x=1.5
)
 
# 3. Quotient Rule: d/dx[sin(x)/x] = [cos(x)·x - sin(x)]/x²
verify_derivative(
    "Quotient Rule: d/dx[sin(x)/x]",
    f=lambda x: np.sin(x) / x,
    f_prime=lambda x: (np.cos(x)*x - np.sin(x)) / x**2,
    x=0.5
)
 
# 4. Chain Rule: d/dx[e^(x²)] = e^(x²) · 2x
verify_derivative(
    "Chain Rule: d/dx[e^(x²)]",
    f=lambda x: np.exp(x**2),
    f_prime=lambda x: np.exp(x**2) * 2*x,
    x=1.0
)
 
# 5. ML-relevant: Log-loss derivative component
# d/dx[-log(sigmoid(x))] = -(1-sigmoid(x)) = sigmoid(x) - 1
def sigmoid(x):
    return 1 / (1 + np.exp(-x))
 
verify_derivative(
    "Log-Loss Gradient: d/dx[-log(σ(x))]",
    f=lambda x: -np.log(sigmoid(x)),
    f_prime=lambda x: sigmoid(x) - 1,  # or -(1 - sigmoid(x))
    x=0.5
)

Partial Derivatives — The Multivariate Extension

The Fundamental Idea:

$$\frac{\partial f}{\partial x_i} = \lim_{h \to 0} \frac{f(x_1, \ldots, x_i + h, \ldots, x_n) - f(x_1, \ldots, x_i, \ldots, x_n)}{h}$$

Notation:

$\frac{\partial f}{\partial x}$ (partial derivative of $f$ with respect to $x$)
$f_x$ or $f_{x_i}$ (subscript notation)
$\partial_x f$ (operator notation)
$D_i f$ (component notation)

Computing Partial Derivatives

Concrete Example: Two-Variable Function

Consider the function $f(x, y) = x^3 + x^2y - 2y^2 + 4xy$. Let's compute both partial derivatives:

Computing $\frac{\partial f}{\partial x}$: (treat $y$ as constant)

$\frac{\partial}{\partial x}(x^3) = 3x^2$
$\frac{\partial}{\partial x}(x^2y) = 2xy$ (since $y$ is constant)
$\frac{\partial}{\partial x}(-2y^2) = 0$ (constant in $x$)
$\frac{\partial}{\partial x}(4xy) = 4y$

Result: $\frac{\partial f}{\partial x} = 3x^2 + 2xy + 4y$

Computing $\frac{\partial f}{\partial y}$: (treat $x$ as constant)

$\frac{\partial}{\partial y}(x^3) = 0$ (constant in $y$)
$\frac{\partial}{\partial y}(x^2y) = x^2$
$\frac{\partial}{\partial y}(-2y^2) = -4y$
$\frac{\partial}{\partial y}(4xy) = 4x$

Result: $\frac{\partial f}{\partial y} = x^2 - 4y + 4x$

Geometric Interpretation:

For a function $f(x, y)$:

$\frac{\partial f}{\partial x}$ at point $(a, b)$ is the slope of the curve formed by intersecting the surface $z = f(x, y)$ with the plane $y = b$ (i.e., moving parallel to the $x$-axis)
$\frac{\partial f}{\partial y}$ at point $(a, b)$ is the slope of the curve formed by intersecting with the plane $x = a$ (moving parallel to the $y$-axis)

Think of standing on a hilly terrain:

$\frac{\partial f}{\partial x}$ tells you the steepness if you walk due East
$\frac{\partial f}{\partial y}$ tells you the steepness if you walk due North

The gradient (coming next section) tells you the steepest uphill direction overall.

Partial Derivatives: Computation and Verification
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
import numpy as np
 
def partial_derivative(f, var_idx, point, h=1e-7):
    """
    Compute partial derivative of f with respect to variable at var_idx.
    
    Args:
        f: Function that takes a numpy array of coordinates
        var_idx: Index of variable to differentiate with respect to
        point: numpy array representing the point at which to evaluate
        h: Small step size for numerical differentiation
    
    Returns:
        Approximate partial derivative at the given point
    """
    point = np.array(point, dtype=float)
    
    # Create points for central difference
    point_plus = point.copy()
    point_minus = point.copy()
    point_plus[var_idx] += h
    point_minus[var_idx] -= h
    
    return (f(point_plus) - f(point_minus)) / (2 * h)
 
 
# Example 1: f(x, y) = x³ + x²y - 2y² + 4xy
def f_example(p):
    x, y = p[0], p[1]
    return x**3 + x**2 * y - 2 * y**2 + 4 * x * y
 
def df_dx_analytical(p):
    x, y = p[0], p[1]
    return 3 * x**2 + 2 * x * y + 4 * y
 
def df_dy_analytical(p):
    x, y = p[0], p[1]
    return x**2 - 4 * y + 4 * x
 
# Test at point (2, 3)
test_point = np.array([2.0, 3.0])
 
print("Function: f(x, y) = x³ + x²y - 2y² + 4xy")
print(f"Point: ({test_point[0]}, {test_point[1]})\n")
 
# Partial with respect to x
partial_x_numerical = partial_derivative(f_example, 0, test_point)
partial_x_analytical = df_dx_analytical(test_point)
print(f"∂f/∂x:")
print(f"  Numerical:  {partial_x_numerical:.8f}")
print(f"  Analytical: {partial_x_analytical:.8f}")
print(f"  Error:      {abs(partial_x_numerical - partial_x_analytical):.2e}\n")
 
# Partial with respect to y
partial_y_numerical = partial_derivative(f_example, 1, test_point)
partial_y_analytical = df_dy_analytical(test_point)
print(f"∂f/∂y:")
print(f"  Numerical:  {partial_y_numerical:.8f}")
print(f"  Analytical: {partial_y_analytical:.8f}")
print(f"  Error:      {abs(partial_y_numerical - partial_y_analytical):.2e}\n")
 
 
# Example 2: ML-relevant - MSE Loss
# L(w₀, w₁) = (1/n) Σᵢ (yᵢ - (w₀ + w₁xᵢ))²
# For simplicity, consider single data point: L = (y - w₀ - w₁x)²
 
def mse_single_point(params):
    """MSE for single point (x=2, y=5): L = (5 - w₀ - 2w₁)²"""
    w0, w1 = params[0], params[1]
    x, y = 2.0, 5.0
    return (y - w0 - w1 * x) ** 2
 
def mse_dw0_analytical(params):
    """∂L/∂w₀ = -2(y - w₀ - w₁x)"""
    w0, w1 = params[0], params[1]
    x, y = 2.0, 5.0
    return -2 * (y - w0 - w1 * x)
 
def mse_dw1_analytical(params):
    """∂L/∂w₁ = -2x(y - w₀ - w₁x)"""
    w0, w1 = params[0], params[1]
    x, y = 2.0, 5.0
    return -2 * x * (y - w0 - w1 * x)
 
# Test at w₀=1, w₁=1 (prediction = 1 + 2 = 3, actual = 5, error = 2)
params = np.array([1.0, 1.0])
 
print("\nMSE Loss: L = (5 - w₀ - 2w₁)² for single point (x=2, y=5)")
print(f"Parameters: w₀={params[0]}, w₁={params[1]}\n")
 
partial_w0_num = partial_derivative(mse_single_point, 0, params)
partial_w0_ana = mse_dw0_analytical(params)
print(f"∂L/∂w₀:")
print(f"  Numerical:  {partial_w0_num:.8f}")
print(f"  Analytical: {partial_w0_ana:.8f}")
 
partial_w1_num = partial_derivative(mse_single_point, 1, params)
partial_w1_ana = mse_dw1_analytical(params)
print(f"\n∂L/∂w₁:")
print(f"  Numerical:  {partial_w1_num:.8f}")
print(f"  Analytical: {partial_w1_ana:.8f}")

Higher-Order Partial Derivatives

Types of Second-Order Partials:

For a function $f(x, y)$, we have four second-order partial derivatives:

$\frac{\partial^2 f}{\partial x^2}$: Differentiate with respect to $x$ twice (how the $x$-slope changes in the $x$-direction)
$\frac{\partial^2 f}{\partial y^2}$: Differentiate with respect to $y$ twice (how the $y$-slope changes in the $y$-direction)
$\frac{\partial^2 f}{\partial x \partial y}$: Differentiate first with respect to $y$, then with respect to $x$
$\frac{\partial^2 f}{\partial y \partial x}$: Differentiate first with respect to $x$, then with respect to $y$

Schwarz's Theorem (Clairaut's Theorem)

Worked Example:

Let $f(x, y) = x^3y^2 + 2xy^3 - x^2$

First-order partials:

$\frac{\partial f}{\partial x} = 3x^2y^2 + 2y^3 - 2x$
$\frac{\partial f}{\partial y} = 2x^3y + 6xy^2$

Second-order partials:

$\frac{\partial^2 f}{\partial x^2} = \frac{\partial}{\partial x}(3x^2y^2 + 2y^3 - 2x) = 6xy^2 - 2$
$\frac{\partial^2 f}{\partial y^2} = \frac{\partial}{\partial y}(2x^3y + 6xy^2) = 2x^3 + 12xy$
$\frac{\partial^2 f}{\partial x \partial y} = \frac{\partial}{\partial x}(2x^3y + 6xy^2) = 6x^2y + 6y^2$
$\frac{\partial^2 f}{\partial y \partial x} = \frac{\partial}{\partial y}(3x^2y^2 + 2y^3 - 2x) = 6x^2y + 6y^2$

Note: $\frac{\partial^2 f}{\partial x \partial y} = \frac{\partial^2 f}{\partial y \partial x}$, confirming Schwarz's theorem.

Geometric Meaning of Second Derivatives:

Second-order partial derivatives describe curvature:

$\frac{\partial^2 f}{\partial x^2} > 0$: The surface curves upward in the $x$-direction (like a bowl)
$\frac{\partial^2 f}{\partial x^2} < 0$: The surface curves downward in the $x$-direction (like an inverted bowl)
Mixed partials describe how the slope in one direction changes as you move in the perpendicular direction—they capture "twisting" of the surface

Why This Matters for ML:

Second-order information tells us:

Whether we're at a minimum vs maximum vs saddle point (more on this later)
How fast to take gradient steps (curvature determines optimal learning rates)
The condition number of the optimization (ratio of largest to smallest curvatures)

Second-Order Derivatives and Curvature
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import numpy as np
 
def second_partial(f, i, j, point, h=1e-5):
    """
    Compute second partial derivative ∂²f/(∂xᵢ ∂xⱼ)
    Uses central difference for both derivatives.
    """
    n = len(point)
    point = np.array(point, dtype=float)
    
    # Create offset vectors
    def offset(idx, delta):
        v = np.zeros(n)
        v[idx] = delta
        return v
    
    # ∂²f/(∂xᵢ ∂xⱼ) ≈ [f(x+hᵢ+hⱼ) - f(x+hᵢ-hⱼ) - f(x-hᵢ+hⱼ) + f(x-hᵢ-hⱼ)] / (4h²)
    pp = point + offset(i, h) + offset(j, h)
    pm = point + offset(i, h) - offset(j, h)
    mp = point - offset(i, h) + offset(j, h)
    mm = point - offset(i, h) - offset(j, h)
    
    return (f(pp) - f(pm) - f(mp) + f(mm)) / (4 * h * h)
 
 
# Example: f(x, y) = x³y² + 2xy³ - x²
def f(p):
    x, y = p[0], p[1]
    return x**3 * y**2 + 2*x*y**3 - x**2
 
# Analytical second derivatives
def f_xx(p):
    x, y = p[0], p[1]
    return 6*x*y**2 - 2
 
def f_yy(p):
    x, y = p[0], p[1]
    return 2*x**3 + 12*x*y
 
def f_xy(p):
    x, y = p[0], p[1]
    return 6*x**2*y + 6*y**2
 
 
# Test at point (1, 2)
point = np.array([1.0, 2.0])
print(f"Function: f(x, y) = x³y² + 2xy³ - x²")
print(f"Point: ({point[0]}, {point[1]})\n")
 
print("Second-Order Partial Derivatives:\n")
 
# ∂²f/∂x²
fxx_num = second_partial(f, 0, 0, point)
fxx_ana = f_xx(point)
print(f"∂²f/∂x²:")
print(f"  Numerical:  {fxx_num:.6f}")
print(f"  Analytical: {fxx_ana:.6f}\n")
 
# ∂²f/∂y²
fyy_num = second_partial(f, 1, 1, point)
fyy_ana = f_yy(point)
print(f"∂²f/∂y²:")
print(f"  Numerical:  {fyy_num:.6f}")
print(f"  Analytical: {fyy_ana:.6f}\n")
 
# ∂²f/(∂x∂y)
fxy_num = second_partial(f, 0, 1, point)
fxy_ana = f_xy(point)
print(f"∂²f/(∂x∂y):")
print(f"  Numerical:  {fxy_num:.6f}")
print(f"  Analytical: {fxy_ana:.6f}\n")
 
# ∂²f/(∂y∂x) - should equal ∂²f/(∂x∂y) by Schwarz's theorem
fyx_num = second_partial(f, 1, 0, point)
print(f"∂²f/(∂y∂x): {fyx_num:.6f} (should equal ∂²f/(∂x∂y))\n")
 
# Curvature interpretation
print("Curvature Interpretation:")
print(f"  f_xx = {fxx_ana:.2f}: Curvature in x-direction")
print(f"  f_yy = {fyy_ana:.2f}: Curvature in y-direction")
if fxx_ana > 0:
    print("  → Surface curves upward in x-direction at this point")
if fyy_ana > 0:
    print("  → Surface curves upward in y-direction at this point")

Derivatives in Machine Learning Context

The Universal Pattern:

Define a loss function $\mathcal{L}(\theta)$ where $\theta$ represents all learnable parameters
Compute $\frac{\partial \mathcal{L}}{\partial \theta_i}$ for each parameter $\theta_i$
Update parameters: $\theta_i \leftarrow \theta_i - \eta \frac{\partial \mathcal{L}}{\partial \theta_i}$ (where $\eta$ is the learning rate)
Repeat until convergence

This simple pattern underlies everything from linear regression to GPT-4.

Common Loss Functions

•MSE Loss: $\mathcal{L} = \frac{1}{n}\sum(y_i - \hat{y}_i)^2$
•Cross-Entropy: $\mathcal{L} = -\sum y_i \log(\hat{y}_i)$
•Hinge Loss: $\mathcal{L} = \max(0, 1 - y \cdot \hat{y})$
•Huber Loss: Quadratic near zero, linear far from zero

Why Derivatives Matter

•Direction: $\frac{\partial \mathcal{L}}{\partial \theta} > 0$ → decrease $\theta$
•Magnitude: Large gradient → large update needed
•Zero gradient: At minimum (or saddle/maximum)
•Gradient flow: Enables deep network training

Example: Linear Regression Gradient Derivation

Let's derive the gradient for linear regression from first principles. Our model is: $$\hat{y} = w_0 + w_1 x_1 + w_2 x_2 + \cdots + w_d x_d = \mathbf{w}^T \mathbf{x}$$

Our loss function (MSE) for a single data point $(\mathbf{x}, y)$: $$\mathcal{L} = (y - \mathbf{w}^T \mathbf{x})^2$$

Computing $\frac{\partial \mathcal{L}}{\partial w_j}$:

Using the chain rule: $$\frac{\partial \mathcal{L}}{\partial w_j} = 2(y - \mathbf{w}^T \mathbf{x}) \cdot \frac{\partial}{\partial w_j}(y - \mathbf{w}^T \mathbf{x})$$

Now, $\frac{\partial}{\partial w_j}(y - \mathbf{w}^T \mathbf{x}) = -x_j$ (since $\mathbf{w}^T \mathbf{x} = w_0 x_0 + w_1 x_1 + \cdots$ and only the $w_j$ term contains $w_j$)

Therefore: $$\frac{\partial \mathcal{L}}{\partial w_j} = -2(y - \mathbf{w}^T \mathbf{x}) \cdot x_j = -2 \cdot \text{error} \cdot x_j$$

Gradient Computation for Linear Regression
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
import numpy as np
 
class LinearRegressionFromScratch:
    """
    Linear Regression with explicit gradient computation.
    
    This implementation shows exactly how derivatives drive learning.
    """
    
    def __init__(self, n_features):
        # Initialize weights randomly
        self.weights = np.random.randn(n_features + 1) * 0.01
        # weights[0] is bias, weights[1:] are feature weights
    
    def predict(self, X):
        """Compute predictions: ŷ = Xw (with bias column added)"""
        X_with_bias = np.column_stack([np.ones(len(X)), X])
        return X_with_bias @ self.weights
    
    def compute_loss(self, X, y):
        """MSE Loss: (1/n) * Σ(yᵢ - ŷᵢ)²"""
        predictions = self.predict(X)
        return np.mean((y - predictions) ** 2)
    
    def compute_gradient(self, X, y):
        """
        Compute ∂L/∂w for all weights.
        
        Derivation:
        L = (1/n) Σ(yᵢ - wᵀxᵢ)²
        ∂L/∂wⱼ = (1/n) Σ 2(yᵢ - wᵀxᵢ)(-xᵢⱼ)
               = (-2/n) Σ (yᵢ - ŷᵢ)xᵢⱼ
               = (-2/n) Xᵀ(y - ŷ)
        """
        n = len(y)
        X_with_bias = np.column_stack([np.ones(n), X])
        predictions = X_with_bias @ self.weights
        errors = y - predictions  # Shape: (n,)
        
        # Gradient: (-2/n) * X^T * errors
        # Note: This is the analytical gradient derived from calculus
        gradient = (-2 / n) * X_with_bias.T @ errors
        
        return gradient
    
    def numerical_gradient(self, X, y, h=1e-7):
        """
        Compute gradient numerically for verification.
        This should match compute_gradient() if our derivation is correct.
        """
        gradient = np.zeros_like(self.weights)
        
        for i in range(len(self.weights)):
            # Compute f(w + h eᵢ)
            self.weights[i] += h
            loss_plus = self.compute_loss(X, y)
            
            # Compute f(w - h eᵢ)
            self.weights[i] -= 2 * h
            loss_minus = self.compute_loss(X, y)
            
            # Central difference
            gradient[i] = (loss_plus - loss_minus) / (2 * h)
            
            # Restore weight
            self.weights[i] += h
        
        return gradient
    
    def train_step(self, X, y, learning_rate=0.01):
        """Single gradient descent step."""
        gradient = self.compute_gradient(X, y)
        self.weights -= learning_rate * gradient
        return self.compute_loss(X, y)
 
 
# Demonstration
np.random.seed(42)
 
# Generate synthetic data: y = 2 + 3x₁ - x₂ + noise
n_samples = 100
X = np.random.randn(n_samples, 2)
true_weights = np.array([2.0, 3.0, -1.0])  # [bias, w1, w2]
y = 2.0 + 3.0 * X[:, 0] - 1.0 * X[:, 1] + 0.1 * np.random.randn(n_samples)
 
# Create model
model = LinearRegressionFromScratch(n_features=2)
 
print("Gradient Verification:")
print("=" * 50)
 
analytical_grad = model.compute_gradient(X, y)
numerical_grad = model.numerical_gradient(X, y)
 
print("\n   Analytical vs Numerical Gradients:\n")
for i, (a, n) in enumerate(zip(analytical_grad, numerical_grad)):
    diff = abs(a - n)
    status = "✓" if diff < 1e-5 else "✗"
    print(f"   ∂L/∂w[{i}]: Analytical={a:+.8f}, Numerical={n:+.8f}, Error={diff:.2e} {status}")
 
print("\n" + "=" * 50)
print("\nTraining Progress:")
print("-" * 50)
 
# Train for several epochs
for epoch in range(0, 101, 20):
    loss = model.compute_loss(X, y)
    if epoch == 0:
        print(f"Epoch {epoch:4d}: Loss = {loss:.6f}")
    
    for _ in range(20):
        model.train_step(X, y, learning_rate=0.1)
    
    if epoch > 0:
        loss = model.compute_loss(X, y)
        print(f"Epoch {epoch:4d}: Loss = {loss:.6f}")
 
print("\n" + "-" * 50)
print(f"\nLearned weights:  {model.weights}")
print(f"True weights:     {true_weights}")

Common Pitfalls and Best Practices

Understanding derivatives is one thing; applying them correctly in practice is another. Let's examine common mistakes and how to avoid them.

Common Mistakes in Derivative Computation

•Forgetting the Chain Rule: When $f$ depends on $g(x)$, you must include $g'(x)$. Example: $\frac{d}{dx}e^{x^2} = e^{x^2} \cdot 2x$, not just $e^{x^2}$.
•Confusing $\frac{d}{dx}$ with $\frac{\partial}{\partial x}$: Ordinary derivatives assume all dependence flows through $x$; partial derivatives treat other variables as constant even if they depend on $x$ elsewhere.
•Numerical Instability: Using too small $h$ in numerical differentiation causes floating-point errors; too large loses accuracy. Optimal $h \approx 10^{-8}$ for forward difference, $10^{-5}$ for central difference.
•Gradient Sign Errors: The gradient points toward increasing function values. For minimization, we subtract the gradient: $\theta \leftarrow \theta - \eta \nabla \mathcal{L}$.
•Broadcasting Bugs: In vectorized code, shape mismatches silently broadcast, leading to wrong gradients. Always verify shapes explicitly.

Gradient Checking: Your Safety Net

Best Practices for Derivative Work:

Derive on Paper First: Before coding, derive gradients symbolically. This catches errors early and builds intuition.
Test Component-wise: For complex functions, test derivatives of each component before combining.
Use Automatic Differentiation: In production, use autodiff (PyTorch, TensorFlow, JAX) rather than manual gradients. But understand what autodiff is doing under the hood.
Monitor Gradient Magnitudes: During training, track gradient norms. Exploding gradients (very large) or vanishing gradients (very small) indicate problems.
Verify at Multiple Points: Test gradients at several random points, not just convenient ones like $(0, 0)$ where terms might cancel.

Summary and Connection Forward

We've built a rigorous foundation in derivatives and partial derivatives. Let's consolidate what we've learned:

Key Takeaways

•Derivatives measure instantaneous rates of change — defined formally as the limit $f'(a) = \lim_{h \to 0} \frac{f(a+h) - f(a)}{h}$.
•Differentiation rules enable efficient computation — power, product, quotient, and chain rules eliminate the need to compute limits directly.
•Partial derivatives extend to multivariate functions — holding other variables constant, we measure change in one direction at a time.
•Second-order partials capture curvature — critical for understanding optimization landscapes and convergence behavior.
•Machine learning is fundamentally gradient-driven — loss function derivatives with respect to parameters guide all learning.
•Always verify gradients numerically — gradient checking catches implementation errors before they cause training failures.

What's Next: The Chain Rule Deep Dive

In the next page, we'll explore the chain rule in exhaustive detail. While we touched on it here, the chain rule deserves its own treatment because:

It enables backpropagation—the algorithm that makes deep learning possible
It shows how gradients flow through composed functions (nested layers of neural networks)
It reveals why certain architectures suffer from vanishing/exploding gradients

The chain rule transforms our understanding from "derivatives of simple functions" to "derivatives through arbitrary computational graphs"—which is exactly what automatic differentiation exploits.

Foundation Established

1 / 5