Differential Calculus - Learning Module

Loading content...

0/278

Gradients

The Direction of Steepest Ascent

We've learned how partial derivatives measure rates of change in individual directions. But in optimization, we need something more powerful: a single object that captures all directional information simultaneously.

Enter the gradient—a vector that combines all partial derivatives and, remarkably, always points in the direction of steepest increase of the function. This property makes the gradient the essential tool for optimization: to minimize, we simply move opposite to the gradient.

What You Will Master

By the end of this page, you'll understand the gradient as a vector, its geometric meaning as steepest ascent direction, and how it drives gradient descent—the workhorse of machine learning optimization.

The Gradient Vector Defined

Formal Definition:

For a scalar-valued function $f: \mathbb{R}^n \to \mathbb{R}$, the gradient is the vector of all partial derivatives:

$$ abla f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \ \frac{\partial f}{\partial x_2} \ \vdots \ \frac{\partial f}{\partial x_n} \end{bmatrix}$$

Notation Variants:

$ abla f$ (nabla notation, most common)
$\text{grad } f$ (gradient operator)
$\frac{\partial f}{\partial \mathbf{x}}$ (vector derivative notation)

Example:

For $f(x, y, z) = x^2 + 3xy - z^2 + 2z$:

$$ abla f = \begin{bmatrix} 2x + 3y \ 3x \ -2z + 2 \end{bmatrix}$$

At point $(1, 2, 3)$: $ abla f = [8, 3, -4]^T$

Row vs Column Convention

ML literature sometimes uses row vectors for gradients (following the Jacobian convention). We use column vectors here. Always check the convention in context—what matters is consistency in your computations.

Gradient Computation
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import numpy as np
 
def compute_gradient(f, point, h=1e-7):
    """Compute gradient numerically using central differences."""
    n = len(point)
    gradient = np.zeros(n)
    
    for i in range(n):
        point_plus = point.copy()
        point_minus = point.copy()
        point_plus[i] += h
        point_minus[i] -= h
        gradient[i] = (f(point_plus) - f(point_minus)) / (2 * h)
    
    return gradient
 
# Example: f(x, y, z) = x² + 3xy - z² + 2z
def f(p):
    x, y, z = p[0], p[1], p[2]
    return x**2 + 3*x*y - z**2 + 2*z
 
def gradient_analytical(p):
    x, y, z = p[0], p[1], p[2]
    return np.array([2*x + 3*y, 3*x, -2*z + 2])
 
point = np.array([1.0, 2.0, 3.0])
 
numerical_grad = compute_gradient(f, point)
analytical_grad = gradient_analytical(point)
 
print(f"f(x,y,z) = x² + 3xy - z² + 2z")
print(f"Point: {point}")
print(f"
Numerical gradient:  {numerical_grad}")
print(f"Analytical gradient: {analytical_grad}")
print(f"Max error: {np.max(np.abs(numerical_grad - analytical_grad)):.2e}")

Geometric Interpretation: Steepest Ascent

The Fundamental Property:

The gradient $ abla f(\mathbf{x})$ points in the direction of steepest increase of $f$ at point $\mathbf{x}$. Its magnitude $| abla f|$ equals the rate of increase in that direction.

Why This Is True:

The directional derivative in direction $\mathbf{u}$ (unit vector) is: $$D_{\mathbf{u}} f = abla f \cdot \mathbf{u} = | abla f| \cos(\theta)$$

where $\theta$ is the angle between $ abla f$ and $\mathbf{u}$.

Maximum when $\theta = 0$ (same direction as $ abla f$): $D_{\mathbf{u}} f = | abla f|$
Zero when $\theta = 90°$ (perpendicular, i.e., along level set)
Minimum when $\theta = 180°$ (opposite direction): $D_{\mathbf{u}} f = -| abla f|$

Consequence for Optimization:

To maximize $f$: move in direction $ abla f$ To minimize $f$: move in direction $- abla f$ (gradient descent!)

Gradient and Level Sets:

For a function $f(x, y)$, level sets are curves where $f = c$ (constant). A beautiful property: the gradient is always perpendicular to level sets.

Intuitively: along a level set, $f$ doesn't change. Moving perpendicular to it gives the fastest change. This is precisely the gradient direction.

Physical Analogy:

Imagine a topographic map where $f(x, y)$ is elevation:

Level sets = contour lines (constant elevation)
Gradient direction = perpendicular to contours, pointing uphill
Gradient magnitude = steepness of the hill
Water flows opposite to gradient (downhill = toward minimum)

Visualizing Gradients

Picture a hill in 3D. At any point, if you place a ball, it rolls downhill—opposite to the gradient. The steeper the hill, the larger the gradient magnitude, the faster the ball accelerates. Gradient descent mimics this natural process.

From Gradients to Gradient Descent

The Algorithm:

Gradient descent uses the gradient to iteratively minimize a function:

$$\mathbf{x}_{t+1} = \mathbf{x}_t - \eta abla f(\mathbf{x}_t)$$

where $\eta > 0$ is the learning rate.

Why It Works:

$ abla f$ points toward steepest increase
$- abla f$ points toward steepest decrease
Moving in direction $- abla f$ reduces $f$ (locally)
Repeat until $ abla f \approx 0$ (stationary point)

In Machine Learning:

$f = \mathcal{L}(\theta)$ is the loss function
$\mathbf{x} = \theta$ are model parameters
Each step reduces prediction error

Learning Rate Effects
Learning Rate	Behavior	Result
Too small ($\eta \ll 1$)	Tiny steps, slow progress	Converges but very slowly
Just right	Balanced steps	Efficient convergence
Too large ($\eta \gg 1$)	Overshoots minimum	Oscillates or diverges

Gradient Descent Implementation
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import numpy as np
 
def gradient_descent(f, grad_f, x0, learning_rate=0.1, max_iters=100, tol=1e-6):
    """
    Minimize f starting from x0 using gradient descent.
    
    Returns: (x_final, history of x values, history of f values)
    """
    x = x0.copy()
    x_history = [x.copy()]
    f_history = [f(x)]
    
    for i in range(max_iters):
        grad = grad_f(x)
        
        # Check convergence
        if np.linalg.norm(grad) < tol:
            print(f"Converged at iteration {i}")
            break
        
        # Gradient descent step
        x = x - learning_rate * grad
        
        x_history.append(x.copy())
        f_history.append(f(x))
    
    return x, x_history, f_history
 
# Example: Minimize f(x,y) = x² + 2y² (elliptic paraboloid)
# Minimum at (0, 0)
def f(p):
    return p[0]**2 + 2*p[1]**2
 
def grad_f(p):
    return np.array([2*p[0], 4*p[1]])
 
x0 = np.array([4.0, 3.0])
x_final, x_hist, f_hist = gradient_descent(f, grad_f, x0, learning_rate=0.2)
 
print(f"Starting point: {x0}, f = {f(x0):.4f}")
print(f"Final point: {x_final}, f = {f(x_final):.6f}")
print(f"
Iterations:")
for i in [0, 1, 2, 5, 10, len(f_hist)-1]:
    if i < len(f_hist):
        print(f"  {i:3d}: x = [{x_hist[i][0]:+.4f}, {x_hist[i][1]:+.4f}], f = {f_hist[i]:.6f}")

Gradients in Machine Learning Practice

The ML Gradient:

For a loss function $\mathcal{L}(\theta)$ with parameters $\theta = [\theta_1, \ldots, \theta_n]^T$:

$$ abla_\theta \mathcal{L} = \begin{bmatrix} \frac{\partial \mathcal{L}}{\partial \theta_1} & \cdots & \frac{\partial \mathcal{L}}{\partial \theta_n} \end{bmatrix}^T$$

Dimensions in Practice:

Model	Parameters	Gradient Size
Linear Regression (100 features)	101	101
Small MLP (100-50-10)	5,560	5,560
ResNet-50	25.6M	25.6M
GPT-3	175B	175B

Every parameter has a corresponding gradient component, computed via backpropagation (chain rule).

Gradient Tells Us

•Sign: Increase or decrease parameter?
•Magnitude: How sensitive is loss?
•Relative scale: Which parameters matter most?

Gradient Challenges

•Vanishing: Near-zero in deep networks
•Exploding: Unstable training
•Noise: SGD estimates are stochastic

Important Gradient Properties

Linearity of the Gradient:

$$ abla(\alpha f + \beta g) = \alpha abla f + \beta abla g$$

This is crucial: loss functions that are sums (e.g., over data points) have gradients that are sums of per-point gradients.

Product Rule for Gradients:

$$ abla(fg) = f abla g + g abla f$$

Chain Rule for Gradients:

If $h(\mathbf{x}) = f(g(\mathbf{x}))$ where $g: \mathbb{R}^n \to \mathbb{R}^m$ and $f: \mathbb{R}^m \to \mathbb{R}$:

$$ abla_\mathbf{x} h = J_g^T abla_\mathbf{u} f$$

where $J_g$ is the Jacobian matrix of $g$. This generalizes the chain rule to vector functions.

Zero Gradient = Stationary Point

When $ abla f = \mathbf{0}$, we're at a stationary point. This could be a minimum (what we want), maximum (bad), or saddle point (common in high dimensions). Second-order analysis (Hessian) distinguishes these cases.

Summary: Gradients as Optimization Guides

Key Takeaways

•Gradient is a vector of all partial derivatives — combines all directional information
•Points toward steepest ascent — fundamental geometric property
•Perpendicular to level sets — useful for understanding optimization landscape
•Drives gradient descent — $\theta \leftarrow \theta - \eta abla \mathcal{L}$
•Has linearity properties — enables efficient batch computations

Gradients Mastered

You now understand gradients as the fundamental tool for optimization. Next, we'll explore directional derivatives—how to compute rates of change in arbitrary directions, not just along coordinate axes.