Loading content...
We've learned how partial derivatives measure rates of change in individual directions. But in optimization, we need something more powerful: a single object that captures all directional information simultaneously.
Enter the gradient—a vector that combines all partial derivatives and, remarkably, always points in the direction of steepest increase of the function. This property makes the gradient the essential tool for optimization: to minimize, we simply move opposite to the gradient.
By the end of this page, you'll understand the gradient as a vector, its geometric meaning as steepest ascent direction, and how it drives gradient descent—the workhorse of machine learning optimization.
Formal Definition:
For a scalar-valued function $f: \mathbb{R}^n \to \mathbb{R}$, the gradient is the vector of all partial derivatives:
$$ abla f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \ \frac{\partial f}{\partial x_2} \ \vdots \ \frac{\partial f}{\partial x_n} \end{bmatrix}$$
Notation Variants:
Example:
For $f(x, y, z) = x^2 + 3xy - z^2 + 2z$:
$$ abla f = \begin{bmatrix} 2x + 3y \ 3x \ -2z + 2 \end{bmatrix}$$
At point $(1, 2, 3)$: $ abla f = [8, 3, -4]^T$
ML literature sometimes uses row vectors for gradients (following the Jacobian convention). We use column vectors here. Always check the convention in context—what matters is consistency in your computations.
123456789101112131415161718192021222324252627282930313233343536
import numpy as np def compute_gradient(f, point, h=1e-7): """Compute gradient numerically using central differences.""" n = len(point) gradient = np.zeros(n) for i in range(n): point_plus = point.copy() point_minus = point.copy() point_plus[i] += h point_minus[i] -= h gradient[i] = (f(point_plus) - f(point_minus)) / (2 * h) return gradient # Example: f(x, y, z) = x² + 3xy - z² + 2zdef f(p): x, y, z = p[0], p[1], p[2] return x**2 + 3*x*y - z**2 + 2*z def gradient_analytical(p): x, y, z = p[0], p[1], p[2] return np.array([2*x + 3*y, 3*x, -2*z + 2]) point = np.array([1.0, 2.0, 3.0]) numerical_grad = compute_gradient(f, point)analytical_grad = gradient_analytical(point) print(f"f(x,y,z) = x² + 3xy - z² + 2z")print(f"Point: {point}")print(f"Numerical gradient: {numerical_grad}")print(f"Analytical gradient: {analytical_grad}")print(f"Max error: {np.max(np.abs(numerical_grad - analytical_grad)):.2e}")The Fundamental Property:
The gradient $ abla f(\mathbf{x})$ points in the direction of steepest increase of $f$ at point $\mathbf{x}$. Its magnitude $| abla f|$ equals the rate of increase in that direction.
Why This Is True:
The directional derivative in direction $\mathbf{u}$ (unit vector) is: $$D_{\mathbf{u}} f = abla f \cdot \mathbf{u} = | abla f| \cos(\theta)$$
where $\theta$ is the angle between $ abla f$ and $\mathbf{u}$.
Consequence for Optimization:
To maximize $f$: move in direction $ abla f$ To minimize $f$: move in direction $- abla f$ (gradient descent!)
Gradient and Level Sets:
For a function $f(x, y)$, level sets are curves where $f = c$ (constant). A beautiful property: the gradient is always perpendicular to level sets.
Intuitively: along a level set, $f$ doesn't change. Moving perpendicular to it gives the fastest change. This is precisely the gradient direction.
Physical Analogy:
Imagine a topographic map where $f(x, y)$ is elevation:
Picture a hill in 3D. At any point, if you place a ball, it rolls downhill—opposite to the gradient. The steeper the hill, the larger the gradient magnitude, the faster the ball accelerates. Gradient descent mimics this natural process.
The Algorithm:
Gradient descent uses the gradient to iteratively minimize a function:
$$\mathbf{x}_{t+1} = \mathbf{x}_t - \eta abla f(\mathbf{x}_t)$$
where $\eta > 0$ is the learning rate.
Why It Works:
In Machine Learning:
| Learning Rate | Behavior | Result |
|---|---|---|
| Too small ($\eta \ll 1$) | Tiny steps, slow progress | Converges but very slowly |
| Just right | Balanced steps | Efficient convergence |
| Too large ($\eta \gg 1$) | Overshoots minimum | Oscillates or diverges |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
import numpy as np def gradient_descent(f, grad_f, x0, learning_rate=0.1, max_iters=100, tol=1e-6): """ Minimize f starting from x0 using gradient descent. Returns: (x_final, history of x values, history of f values) """ x = x0.copy() x_history = [x.copy()] f_history = [f(x)] for i in range(max_iters): grad = grad_f(x) # Check convergence if np.linalg.norm(grad) < tol: print(f"Converged at iteration {i}") break # Gradient descent step x = x - learning_rate * grad x_history.append(x.copy()) f_history.append(f(x)) return x, x_history, f_history # Example: Minimize f(x,y) = x² + 2y² (elliptic paraboloid)# Minimum at (0, 0)def f(p): return p[0]**2 + 2*p[1]**2 def grad_f(p): return np.array([2*p[0], 4*p[1]]) x0 = np.array([4.0, 3.0])x_final, x_hist, f_hist = gradient_descent(f, grad_f, x0, learning_rate=0.2) print(f"Starting point: {x0}, f = {f(x0):.4f}")print(f"Final point: {x_final}, f = {f(x_final):.6f}")print(f"Iterations:")for i in [0, 1, 2, 5, 10, len(f_hist)-1]: if i < len(f_hist): print(f" {i:3d}: x = [{x_hist[i][0]:+.4f}, {x_hist[i][1]:+.4f}], f = {f_hist[i]:.6f}")The ML Gradient:
For a loss function $\mathcal{L}(\theta)$ with parameters $\theta = [\theta_1, \ldots, \theta_n]^T$:
$$ abla_\theta \mathcal{L} = \begin{bmatrix} \frac{\partial \mathcal{L}}{\partial \theta_1} & \cdots & \frac{\partial \mathcal{L}}{\partial \theta_n} \end{bmatrix}^T$$
Dimensions in Practice:
| Model | Parameters | Gradient Size |
|---|---|---|
| Linear Regression (100 features) | 101 | 101 |
| Small MLP (100-50-10) | 5,560 | 5,560 |
| ResNet-50 | 25.6M | 25.6M |
| GPT-3 | 175B | 175B |
Every parameter has a corresponding gradient component, computed via backpropagation (chain rule).
Linearity of the Gradient:
$$ abla(\alpha f + \beta g) = \alpha abla f + \beta abla g$$
This is crucial: loss functions that are sums (e.g., over data points) have gradients that are sums of per-point gradients.
Product Rule for Gradients:
$$ abla(fg) = f abla g + g abla f$$
Chain Rule for Gradients:
If $h(\mathbf{x}) = f(g(\mathbf{x}))$ where $g: \mathbb{R}^n \to \mathbb{R}^m$ and $f: \mathbb{R}^m \to \mathbb{R}$:
$$ abla_\mathbf{x} h = J_g^T abla_\mathbf{u} f$$
where $J_g$ is the Jacobian matrix of $g$. This generalizes the chain rule to vector functions.
When $ abla f = \mathbf{0}$, we're at a stationary point. This could be a minimum (what we want), maximum (bad), or saddle point (common in high dimensions). Second-order analysis (Hessian) distinguishes these cases.
You now understand gradients as the fundamental tool for optimization. Next, we'll explore directional derivatives—how to compute rates of change in arbitrary directions, not just along coordinate axes.