Machine LearningGradient Descent

Gradient Descent

LevelIntermediate

Duration90 mins

TopicGradient Descent

1 / 5

Gradient Descent Algorithm

The Engine of Modern Machine Learning

Every neural network you've ever used—from the language model generating this text to the image classifier in your phone—learned its parameters through gradient descent. This simple yet profound algorithm is the beating heart of modern machine learning, transforming random initial weights into intelligent systems that recognize speech, translate languages, and diagnose diseases.

Gradient descent is elegantly simple in concept: to minimize a function, repeatedly take small steps in the direction that decreases it most steeply. Yet this simplicity belies remarkable depth. Understanding gradient descent deeply—its geometry, its dynamics, its failure modes, and its variants—is essential for any serious practitioner of machine learning.

What You Will Learn

By the end of this page, you will understand the geometric intuition behind gradient descent, derive the algorithm mathematically, implement it from scratch, and analyze why it works. You'll see how a local search strategy can navigate high-dimensional parameter spaces to find models that generalize to unseen data.

The Optimization Problem in Machine Learning

Before diving into the algorithm, let's establish precisely what problem we're solving. In supervised machine learning, we have:

A dataset $\mathcal{D} = {(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \ldots, (\mathbf{x}_n, y_n)}$ of input-output pairs
A model $f(\mathbf{x}; \boldsymbol{\theta})$ parameterized by weights $\boldsymbol{\theta} \in \mathbb{R}^d$
A loss function $\mathcal{L}(\hat{y}, y)$ measuring prediction error

Our goal is to find the parameters $\boldsymbol{\theta}^*$ that minimize the empirical risk—the average loss over our training data:

$$\boldsymbol{\theta}^* = \arg\min_{\boldsymbol{\theta}} J(\boldsymbol{\theta}) = \arg\min_{\boldsymbol{\theta}} \frac{1}{n} \sum_{i=1}^{n} \mathcal{L}(f(\mathbf{x}_i; \boldsymbol{\theta}), y_i)$$

Here, $J(\boldsymbol{\theta})$ is the cost function or objective function—a scalar-valued function of potentially millions of parameters that we must minimize.

The Scale of Modern Optimization

A modern large language model may have 100 billion parameters. Gradient descent must navigate a 100-billion-dimensional space to find a good minimum. There's no closed-form solution—we must search iteratively. This is why efficient optimization algorithms are so critical.

Why can't we solve this analytically?

For simple linear regression, we can derive the optimal weights in closed form using the normal equations:

$$\boldsymbol{\theta}^* = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}$$

But this approach fails for most practical ML models because:

Non-linear models: Neural networks compose non-linear activations, making $J(\boldsymbol{\theta})$ non-convex with no closed-form minimum
Computational cost: Inverting an $n \times n$ matrix costs $O(n^3)$, prohibitive for large datasets
Memory: Storing the full matrix $\mathbf{X}^T \mathbf{X}$ may be impossible for big data

Iterative optimization algorithms like gradient descent overcome all these limitations.

Geometric Intuition: Descending the Loss Landscape

Imagine standing on a mountainous terrain in thick fog—you can feel the local slope beneath your feet but can't see the overall landscape. Your goal is to reach the lowest valley. What strategy would you use?

The most natural approach: at each step, feel which direction goes downhill most steeply, take a step in that direction, and repeat. This is exactly gradient descent.

The loss landscape is a high-dimensional surface where:

Each axis represents a model parameter $\theta_i$
The height at any point represents the loss $J(\boldsymbol{\theta})$
Valleys represent low-loss configurations (good models)
Peaks represent high-loss configurations (bad models)

The gradient $\nabla J(\boldsymbol{\theta})$ is a vector pointing in the direction of steepest ascent. To descend, we move in the opposite direction: $-\nabla J(\boldsymbol{\theta})$.

Converting Mermaid diagram...

Why the negative gradient?

The gradient $\nabla J(\boldsymbol{\theta})$ at point $\boldsymbol{\theta}$ is the vector of partial derivatives:

$$\nabla J(\boldsymbol{\theta}) = \begin{bmatrix} \frac{\partial J}{\partial \theta_1} \ \frac{\partial J}{\partial \theta_2} \ \vdots \ \frac{\partial J}{\partial \theta_d} \end{bmatrix}$$

This vector points in the direction where $J$ increases most rapidly. By moving in the opposite direction, we descend the surface.

Key insight: Among all possible directions we could step, the negative gradient direction decreases the function value fastest (for infinitesimally small steps). This is provable from first-order Taylor expansion.

The Directional Derivative Perspective

The directional derivative of J in direction u is ∇J · u. To minimize this dot product (maximize decrease), we choose u = -∇J/||∇J||. The negative gradient direction is optimal among all unit directions.

The Gradient Descent Algorithm

Now let's formalize the algorithm precisely. Gradient descent is an iterative optimization algorithm that updates parameters according to:

$$\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \alpha \nabla J(\boldsymbol{\theta}^{(t)})$$

where:

$\boldsymbol{\theta}^{(t)}$ is the parameter vector at iteration $t$
$\alpha > 0$ is the learning rate (also called step size)
$\nabla J(\boldsymbol{\theta}^{(t)})$ is the gradient evaluated at the current point

The algorithm terminates when convergence criteria are met (typically when the gradient norm falls below a threshold, or the loss stops decreasing significantly).

gradient_descent.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import numpy as np
from typing import Callable, Tuple, List
 
def gradient_descent(
    objective_fn: Callable[[np.ndarray], float],
    gradient_fn: Callable[[np.ndarray], np.ndarray],
    theta_init: np.ndarray,
    learning_rate: float = 0.01,
    max_iterations: int = 1000,
    tolerance: float = 1e-6,
    verbose: bool = False
) -> Tuple[np.ndarray, List[float]]:
    """
    Perform gradient descent optimization.
    
    Args:
        objective_fn: Function J(θ) to minimize
        gradient_fn: Function computing ∇J(θ)
        theta_init: Initial parameter vector
        learning_rate: Step size α
        max_iterations: Maximum iterations before stopping
        tolerance: Stop when ||∇J|| < tolerance
        verbose: Print progress
    
    Returns:
        Tuple of (optimal_theta, loss_history)
    """
    theta = theta_init.copy()
    loss_history = []
    
    for iteration in range(max_iterations):
        # Compute current loss and gradient
        loss = objective_fn(theta)
        gradient = gradient_fn(theta)
        loss_history.append(loss)
        
        # Check convergence
        gradient_norm = np.linalg.norm(gradient)
        if gradient_norm < tolerance:
            if verbose:
                print(f"Converged at iteration {iteration}")
            break
        
        # Update parameters: θ ← θ - α∇J(θ)
        theta = theta - learning_rate * gradient
        
        if verbose and iteration % 100 == 0:
            print(f"Iter {iteration}: Loss = {loss:.6f}, ||∇J|| = {gradient_norm:.6f}")
    
    return theta, loss_history
 
# Example: Minimize a simple quadratic function
# J(θ) = (θ₁ - 3)² + (θ₂ + 2)² (minimum at θ* = [3, -2])
 
def quadratic_objective(theta: np.ndarray) -> float:
    return (theta[0] - 3)**2 + (theta[1] + 2)**2
 
def quadratic_gradient(theta: np.ndarray) -> np.ndarray:
    return np.array([
        2 * (theta[0] - 3),  # ∂J/∂θ₁
        2 * (theta[1] + 2)   # ∂J/∂θ₂
    ])
 
# Run gradient descent
theta_initial = np.array([0.0, 0.0])
theta_optimal, history = gradient_descent(
    quadratic_objective,
    quadratic_gradient,
    theta_initial,
    learning_rate=0.1,
    verbose=True
)
print(f"\nOptimal θ: {theta_optimal}")  # Should be close to [3, -2]

Algorithm Pseudocode (Formal)

Algorithm: Gradient Descent
Input: J(θ) - objective function
        ∇J(θ) - gradient function
        θ₀ - initial parameters
        α - learning rate
        ε - convergence threshold
        T - maximum iterations

1. t ← 0
2. while t < T and ||∇J(θₜ)|| > ε do
3.     gₜ ← ∇J(θₜ)           // Compute gradient
4.     θₜ₊₁ ← θₜ - α · gₜ    // Update parameters
5.     t ← t + 1
6. end while
7. return θₜ

This is the complete algorithm. Despite its simplicity, gradient descent (and its variants) trains everything from logistic regression to GPT-4.

Step-by-Step Worked Example

Let's trace through gradient descent manually on a concrete example to build intuition.

Problem: Minimize $J(\theta) = \theta^2 + 4\theta + 5$

Step 1: Find the gradient $$\frac{dJ}{d\theta} = 2\theta + 4$$

Step 2: Initialize

Initial guess: $\theta_0 = 3$
Learning rate: $\alpha = 0.1$

Step 3: Iterate

Gradient Descent Iterations
Iteration $t$	$\theta_t$	$J(\theta_t)$	$\nabla J(\theta_t)$	$\theta_{t+1}$
0	3.000	26.00	10.00	3 - 0.1(10) = 2.00
1	2.000	17.00	8.00	2 - 0.1(8) = 1.20
2	1.200	11.44	6.40	1.2 - 0.1(6.4) = 0.56
3	0.560	7.55	5.12	0.56 - 0.1(5.12) = 0.048
4	0.048	5.24	4.10	0.048 - 0.1(4.1) = -0.36
5	-0.362	4.09	3.28	-0.362 - 0.1(3.28) = -0.69
...	...	...	...	...
20	-1.998	1.00	0.004	≈ -2.00

Verification: The true minimum is at $\theta^* = -2$ (where $\nabla J = 0$), and $J(-2) = 4 - 8 + 5 = 1$.

After 20 iterations, gradient descent found $\theta \approx -2.00$ with $J \approx 1.00$. Success!

Key observations from this example:

Initial loss was high (26), but decreased monotonically each iteration
Gradient magnitude decreased as we approached the minimum
Steps got smaller near the optimum because the gradient shrinks
The algorithm never overshoots the minimum with this learning rate

Why the Loss Decreases

For sufficiently small α, the loss is guaranteed to decrease each step. This follows from the first-order Taylor approximation: J(θ - α∇J) ≈ J(θ) - α||∇J||² < J(θ) since α||∇J||² > 0.

Gradient Descent for Linear Regression

Now let's apply gradient descent to a real ML problem: linear regression.

Model: $\hat{y} = \mathbf{w}^T \mathbf{x} + b$

Loss function: Mean Squared Error (MSE) $$J(\mathbf{w}, b) = \frac{1}{2n} \sum_{i=1}^{n} (\mathbf{w}^T \mathbf{x}_i + b - y_i)^2$$

The factor of $\frac{1}{2}$ is conventional—it cancels with the exponent when differentiating.

Computing the gradients:

Let $\hat{y}_i = \mathbf{w}^T \mathbf{x}_i + b$ be the prediction for sample $i$.

$$\frac{\partial J}{\partial w_j} = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}i - y_i) x{ij}$$

$$\frac{\partial J}{\partial b} = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)$$

In vector form: $$\nabla_{\mathbf{w}} J = \frac{1}{n} \mathbf{X}^T (\mathbf{X}\mathbf{w} + b\mathbf{1} - \mathbf{y})$$ $$\nabla_b J = \frac{1}{n} \mathbf{1}^T (\mathbf{X}\mathbf{w} + b\mathbf{1} - \mathbf{y})$$

linear_regression_gd.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
import numpy as np
import matplotlib.pyplot as plt
 
class LinearRegressionGD:
    """
    Linear regression trained via gradient descent.
    
    Model: y = Xw + b
    Loss: J(w,b) = (1/2n) * ||Xw + b - y||²
    """
    
    def __init__(self, learning_rate: float = 0.01, n_iterations: int = 1000):
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations
        self.weights = None
        self.bias = None
        self.loss_history = []
    
    def fit(self, X: np.ndarray, y: np.ndarray) -> 'LinearRegressionGD':
        """
        Train the model using gradient descent.
        
        Args:
            X: Training features, shape (n_samples, n_features)
            y: Target values, shape (n_samples,)
        """
        n_samples, n_features = X.shape
        
        # Initialize parameters to zeros
        self.weights = np.zeros(n_features)
        self.bias = 0.0
        
        for iteration in range(self.n_iterations):
            # Forward pass: compute predictions
            y_pred = X @ self.weights + self.bias
            
            # Compute residuals (errors)
            residuals = y_pred - y  # shape: (n_samples,)
            
            # Compute loss: J = (1/2n) * sum(residuals²)
            loss = np.mean(residuals ** 2) / 2
            self.loss_history.append(loss)
            
            # Compute gradients
            # ∂J/∂w = (1/n) * X^T @ residuals
            grad_weights = (X.T @ residuals) / n_samples
            
            # ∂J/∂b = (1/n) * sum(residuals)
            grad_bias = np.mean(residuals)
            
            # Update parameters: θ ← θ - α∇J
            self.weights -= self.learning_rate * grad_weights
            self.bias -= self.learning_rate * grad_bias
        
        return self
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Generate predictions."""
        return X @ self.weights + self.bias
 
# Generate synthetic data
np.random.seed(42)
n_samples = 100
X = 2 * np.random.randn(n_samples, 1)  # Single feature
true_weights = np.array([3.0])
true_bias = 2.0
y = X @ true_weights + true_bias + 0.5 * np.random.randn(n_samples)
 
# Train model
model = LinearRegressionGD(learning_rate=0.1, n_iterations=100)
model.fit(X, y)
 
print(f"True weights: {true_weights}, Learned: {model.weights}")
print(f"True bias: {true_bias}, Learned: {model.bias:.4f}")
print(f"Final loss: {model.loss_history[-1]:.6f}")
 
# Plot convergence
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.plot(model.loss_history)
plt.xlabel('Iteration')
plt.ylabel('MSE Loss')
plt.title('Training Loss Convergence')
 
plt.subplot(1, 2, 2)
plt.scatter(X, y, alpha=0.5, label='Data')
X_line = np.linspace(X.min(), X.max(), 100).reshape(-1, 1)
plt.plot(X_line, model.predict(X_line), 'r-', linewidth=2, label='Learned')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.title('Linear Regression Fit')
plt.tight_layout()
plt.savefig('linear_regression_gd.png', dpi=150)

Gradient Descent Recovers True Parameters

With sufficient iterations and an appropriate learning rate, gradient descent finds weights and bias very close to the true values used to generate the data. This demonstrates the algorithm's correctness on a convex problem.

Why Gradient Descent Works: Mathematical Justification

Let's rigorously prove that gradient descent decreases the objective function (under appropriate conditions).

Theorem (Sufficient Decrease)

Let $J: \mathbb{R}^d \to \mathbb{R}$ be continuously differentiable with $L$-Lipschitz continuous gradient: $$|\nabla J(\mathbf{x}) - \nabla J(\mathbf{y})| \leq L |\mathbf{x} - \mathbf{y}| \quad \forall \mathbf{x}, \mathbf{y}$$

Then for learning rate $\alpha \leq \frac{1}{L}$, gradient descent satisfies: $$J(\boldsymbol{\theta}^{(t+1)}) \leq J(\boldsymbol{\theta}^{(t)}) - \frac{\alpha}{2} |\nabla J(\boldsymbol{\theta}^{(t)})|^2$$

Proof Sketch:

Using Taylor expansion with Lipschitz smoothness: $$J(\boldsymbol{\theta} - \alpha \nabla J) \leq J(\boldsymbol{\theta}) - \alpha |\nabla J|^2 + \frac{L \alpha^2}{2} |\nabla J|^2$$ $$= J(\boldsymbol{\theta}) - \alpha \left(1 - \frac{L\alpha}{2}\right) |\nabla J|^2$$

For $\alpha \leq \frac{1}{L}$, we have $1 - \frac{L\alpha}{2} \geq \frac{1}{2}$, giving: $$J(\boldsymbol{\theta}^{(t+1)}) \leq J(\boldsymbol{\theta}^{(t)}) - \frac{\alpha}{2} |\nabla J(\boldsymbol{\theta}^{(t)})|^2$$

What is Lipschitz Smoothness?

A function has L-Lipschitz continuous gradient if its gradient doesn't change too fast. Geometrically, the surface can't curve too sharply. The constant L bounds the curvature: a larger L means the function can be more curved, requiring smaller step sizes for safe descent.

Corollary (Guaranteed Progress)

As long as $|\nabla J(\boldsymbol{\theta})| > 0$ (we haven't reached a critical point), the loss decreases by at least $\frac{\alpha}{2} |\nabla J|^2$ per step.

Implications:

The loss monotonically decreases (with appropriate $\alpha$)
Larger gradients → larger decrease — far from optima, we make fast progress
Near local minima, gradients shrink — progress slows but convergence happens
The learning rate upper bound $1/L$ — too large $\alpha$ violates the bound and can cause divergence

This theorem is the theoretical foundation for gradient descent's reliability on smooth functions.

Critical Conditions for Convergence

•Smooth objective: J must be differentiable with Lipschitz continuous gradient
•Appropriate learning rate: α ≤ 1/L ensures descent; α > 1/L can cause divergence
•Starting point matters: Gradient descent finds local minima, which may not be global
•Convexity helps: For convex J, any local minimum is global

Understanding Loss Surfaces

The loss surface (or loss landscape) is the graph of $J(\boldsymbol{\theta})$ over parameter space. Understanding its topology is crucial for optimization.

Types of Critical Points:

Point Type	Definition	Eigenvalues of Hessian	Behavior
Local Minimum	∇J = 0, all curvatures positive	All λᵢ > 0	Stable, GD converges here
Local Maximum	∇J = 0, all curvatures negative	All λᵢ < 0	Unstable, GD escapes
Saddle Point	∇J = 0, mixed curvatures	Some λᵢ > 0, some < 0	Unstable in some directions
Global Minimum	Lowest value of J across all θ	All λᵢ > 0	Optimal solution

The Hessian matrix $\mathbf{H} = \nabla^2 J$ captures local curvature. Its eigenvalues determine the nature of critical points.

Convex vs Non-Convex Landscapes

Convex functions (like MSE loss for linear regression) have a bowl-shaped landscape:

Single global minimum
No local minima or saddle points
Gradient descent converges to optimal solution from any starting point

Non-convex functions (like neural network losses) have complex landscapes:

Multiple local minima
Saddle points (very common in high dimensions)
Plateaus with near-zero gradient
Gradient descent behavior depends heavily on initialization

Modern insight: In high-dimensional neural networks, local minima are often nearly as good as the global minimum. The real challenge is escaping saddle points and plateaus.

Saddle Points Dominate High Dimensions

In high-dimensional spaces, saddle points are exponentially more common than local minima. For a random critical point in d dimensions, the probability of all d eigenvalues being positive (local minimum) is roughly 2^(-d). For d = 1000, this is astronomically unlikely.

Practical Implementation Considerations

Moving from theory to practice requires addressing several implementation details:

1. Parameter Initialization

Starting point profoundly affects which minimum gradient descent finds:

Zero initialization: Often problematic (symmetry breaking issues in neural networks)
Random initialization: Standard approach; use appropriate scale
Xavier/He initialization: Designed for neural networks to maintain variance across layers

2. Numerical Precision

Use float32 or float64 appropriately (32-bit for speed, 64-bit for precision)
Watch for gradient underflow/overflow with very large or small values
Gradient clipping prevents exploding gradients

3. Termination Criteria

When to stop iterating:

Gradient norm falls below threshold: $|\nabla J| < \epsilon$
Loss change is negligible: $|J^{(t)} - J^{(t-1)}| < \epsilon$
Maximum iterations reached
Validation loss starts increasing (early stopping)

Best Practices

•Normalize/standardize input features to similar scales
•Start with small learning rate, tune empirically
•Monitor training loss curves for divergence
•Use validation set to detect overfitting
•Implement gradient clipping for stability

Common Pitfalls

•Learning rate too high → loss oscillates or diverges
•Learning rate too low → extremely slow convergence
•Poor initialization → stuck in bad local minimum
•No feature scaling → some dimensions dominate
•Ignoring gradient explosion → NaN losses

Feature Scaling is Critical

If features have vastly different scales (e.g., age in years vs income in thousands), gradient descent becomes inefficient. The loss surface becomes elongated, and a single learning rate can't work well for all dimensions. Always standardize features to zero mean and unit variance, or normalize to [0, 1].

Summary and Path Forward

We've established the foundation of gradient descent—the workhorse algorithm of machine learning optimization. Let's consolidate the essential concepts:

Key Takeaways

•Gradient descent minimizes functions iteratively — Update θ ← θ - α∇J(θ) until convergence
•The gradient points toward steepest ascent — We move opposite to descend
•Learning rate α controls step size — Too large diverges, too small is slow
•Convergence is guaranteed for smooth, convex functions — With appropriate α ≤ 1/L
•Loss surfaces can be complex — Local minima, saddle points, and plateaus challenge optimization
•Feature scaling is essential — Normalize inputs for efficient optimization

What's Next:

With the core algorithm understood, we turn to critical practical questions:

Page 2: Learning Rate Selection — How do we choose α? What happens when it's wrong?
Page 3: Convergence Analysis — How fast does gradient descent converge? What affects speed?
Page 4: Variants (Batch, Stochastic, Mini-batch) — How do we scale to massive datasets?
Page 5: Momentum and Acceleration — How do we speed up convergence?

You've Mastered the Foundation

You now understand gradient descent at a deep level—the geometric intuition, the algorithm, the mathematics, and practical implementation. This knowledge powers your understanding of how every neural network learns. Next, we'll explore the critical challenge of choosing the right learning rate.

1 / 5

Loading learning content...

Machine LearningGradient Descent

Gradient Descent

LevelIntermediate

Duration90 mins

TopicGradient Descent

1 / 5

Gradient Descent Algorithm

The Engine of Modern Machine Learning

What You Will Learn

The Optimization Problem in Machine Learning

Before diving into the algorithm, let's establish precisely what problem we're solving. In supervised machine learning, we have:

A dataset $\mathcal{D} = {(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \ldots, (\mathbf{x}_n, y_n)}$ of input-output pairs
A model $f(\mathbf{x}; \boldsymbol{\theta})$ parameterized by weights $\boldsymbol{\theta} \in \mathbb{R}^d$
A loss function $\mathcal{L}(\hat{y}, y)$ measuring prediction error

Our goal is to find the parameters $\boldsymbol{\theta}^*$ that minimize the empirical risk—the average loss over our training data:

$$\boldsymbol{\theta}^* = \arg\min_{\boldsymbol{\theta}} J(\boldsymbol{\theta}) = \arg\min_{\boldsymbol{\theta}} \frac{1}{n} \sum_{i=1}^{n} \mathcal{L}(f(\mathbf{x}_i; \boldsymbol{\theta}), y_i)$$

Here, $J(\boldsymbol{\theta})$ is the cost function or objective function—a scalar-valued function of potentially millions of parameters that we must minimize.

The Scale of Modern Optimization

Why can't we solve this analytically?

For simple linear regression, we can derive the optimal weights in closed form using the normal equations:

$$\boldsymbol{\theta}^* = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}$$

But this approach fails for most practical ML models because:

Non-linear models: Neural networks compose non-linear activations, making $J(\boldsymbol{\theta})$ non-convex with no closed-form minimum
Computational cost: Inverting an $n \times n$ matrix costs $O(n^3)$, prohibitive for large datasets
Memory: Storing the full matrix $\mathbf{X}^T \mathbf{X}$ may be impossible for big data

Iterative optimization algorithms like gradient descent overcome all these limitations.

Geometric Intuition: Descending the Loss Landscape

The most natural approach: at each step, feel which direction goes downhill most steeply, take a step in that direction, and repeat. This is exactly gradient descent.

The loss landscape is a high-dimensional surface where:

Each axis represents a model parameter $\theta_i$
The height at any point represents the loss $J(\boldsymbol{\theta})$
Valleys represent low-loss configurations (good models)
Peaks represent high-loss configurations (bad models)

The gradient $\nabla J(\boldsymbol{\theta})$ is a vector pointing in the direction of steepest ascent. To descend, we move in the opposite direction: $-\nabla J(\boldsymbol{\theta})$.

Converting Mermaid diagram...

Why the negative gradient?

The gradient $\nabla J(\boldsymbol{\theta})$ at point $\boldsymbol{\theta}$ is the vector of partial derivatives:

$$\nabla J(\boldsymbol{\theta}) = \begin{bmatrix} \frac{\partial J}{\partial \theta_1} \ \frac{\partial J}{\partial \theta_2} \ \vdots \ \frac{\partial J}{\partial \theta_d} \end{bmatrix}$$

This vector points in the direction where $J$ increases most rapidly. By moving in the opposite direction, we descend the surface.

The Directional Derivative Perspective

The Gradient Descent Algorithm

Now let's formalize the algorithm precisely. Gradient descent is an iterative optimization algorithm that updates parameters according to:

$$\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \alpha \nabla J(\boldsymbol{\theta}^{(t)})$$

where:

$\boldsymbol{\theta}^{(t)}$ is the parameter vector at iteration $t$
$\alpha > 0$ is the learning rate (also called step size)
$\nabla J(\boldsymbol{\theta}^{(t)})$ is the gradient evaluated at the current point

The algorithm terminates when convergence criteria are met (typically when the gradient norm falls below a threshold, or the loss stops decreasing significantly).

gradient_descent.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import numpy as np
from typing import Callable, Tuple, List
 
def gradient_descent(
    objective_fn: Callable[[np.ndarray], float],
    gradient_fn: Callable[[np.ndarray], np.ndarray],
    theta_init: np.ndarray,
    learning_rate: float = 0.01,
    max_iterations: int = 1000,
    tolerance: float = 1e-6,
    verbose: bool = False
) -> Tuple[np.ndarray, List[float]]:
    """
    Perform gradient descent optimization.
    
    Args:
        objective_fn: Function J(θ) to minimize
        gradient_fn: Function computing ∇J(θ)
        theta_init: Initial parameter vector
        learning_rate: Step size α
        max_iterations: Maximum iterations before stopping
        tolerance: Stop when ||∇J|| < tolerance
        verbose: Print progress
    
    Returns:
        Tuple of (optimal_theta, loss_history)
    """
    theta = theta_init.copy()
    loss_history = []
    
    for iteration in range(max_iterations):
        # Compute current loss and gradient
        loss = objective_fn(theta)
        gradient = gradient_fn(theta)
        loss_history.append(loss)
        
        # Check convergence
        gradient_norm = np.linalg.norm(gradient)
        if gradient_norm < tolerance:
            if verbose:
                print(f"Converged at iteration {iteration}")
            break
        
        # Update parameters: θ ← θ - α∇J(θ)
        theta = theta - learning_rate * gradient
        
        if verbose and iteration % 100 == 0:
            print(f"Iter {iteration}: Loss = {loss:.6f}, ||∇J|| = {gradient_norm:.6f}")
    
    return theta, loss_history
 
# Example: Minimize a simple quadratic function
# J(θ) = (θ₁ - 3)² + (θ₂ + 2)² (minimum at θ* = [3, -2])
 
def quadratic_objective(theta: np.ndarray) -> float:
    return (theta[0] - 3)**2 + (theta[1] + 2)**2
 
def quadratic_gradient(theta: np.ndarray) -> np.ndarray:
    return np.array([
        2 * (theta[0] - 3),  # ∂J/∂θ₁
        2 * (theta[1] + 2)   # ∂J/∂θ₂
    ])
 
# Run gradient descent
theta_initial = np.array([0.0, 0.0])
theta_optimal, history = gradient_descent(
    quadratic_objective,
    quadratic_gradient,
    theta_initial,
    learning_rate=0.1,
    verbose=True
)
print(f"\nOptimal θ: {theta_optimal}")  # Should be close to [3, -2]

Algorithm Pseudocode (Formal)

Algorithm: Gradient Descent
Input: J(θ) - objective function
        ∇J(θ) - gradient function
        θ₀ - initial parameters
        α - learning rate
        ε - convergence threshold
        T - maximum iterations

1. t ← 0
2. while t < T and ||∇J(θₜ)|| > ε do
3.     gₜ ← ∇J(θₜ)           // Compute gradient
4.     θₜ₊₁ ← θₜ - α · gₜ    // Update parameters
5.     t ← t + 1
6. end while
7. return θₜ

This is the complete algorithm. Despite its simplicity, gradient descent (and its variants) trains everything from logistic regression to GPT-4.

Step-by-Step Worked Example

Let's trace through gradient descent manually on a concrete example to build intuition.

Problem: Minimize $J(\theta) = \theta^2 + 4\theta + 5$

Step 1: Find the gradient $$\frac{dJ}{d\theta} = 2\theta + 4$$

Step 2: Initialize

Initial guess: $\theta_0 = 3$
Learning rate: $\alpha = 0.1$

Step 3: Iterate

Gradient Descent Iterations
Iteration $t$	$\theta_t$	$J(\theta_t)$	$\nabla J(\theta_t)$	$\theta_{t+1}$
0	3.000	26.00	10.00	3 - 0.1(10) = 2.00
1	2.000	17.00	8.00	2 - 0.1(8) = 1.20
2	1.200	11.44	6.40	1.2 - 0.1(6.4) = 0.56
3	0.560	7.55	5.12	0.56 - 0.1(5.12) = 0.048
4	0.048	5.24	4.10	0.048 - 0.1(4.1) = -0.36
5	-0.362	4.09	3.28	-0.362 - 0.1(3.28) = -0.69
...	...	...	...	...
20	-1.998	1.00	0.004	≈ -2.00

Verification: The true minimum is at $\theta^* = -2$ (where $\nabla J = 0$), and $J(-2) = 4 - 8 + 5 = 1$.

After 20 iterations, gradient descent found $\theta \approx -2.00$ with $J \approx 1.00$. Success!

Key observations from this example:

Initial loss was high (26), but decreased monotonically each iteration
Gradient magnitude decreased as we approached the minimum
Steps got smaller near the optimum because the gradient shrinks
The algorithm never overshoots the minimum with this learning rate

Why the Loss Decreases

Gradient Descent for Linear Regression

Now let's apply gradient descent to a real ML problem: linear regression.

Model: $\hat{y} = \mathbf{w}^T \mathbf{x} + b$

Loss function: Mean Squared Error (MSE) $$J(\mathbf{w}, b) = \frac{1}{2n} \sum_{i=1}^{n} (\mathbf{w}^T \mathbf{x}_i + b - y_i)^2$$

The factor of $\frac{1}{2}$ is conventional—it cancels with the exponent when differentiating.

Computing the gradients:

Let $\hat{y}_i = \mathbf{w}^T \mathbf{x}_i + b$ be the prediction for sample $i$.

$$\frac{\partial J}{\partial w_j} = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}i - y_i) x{ij}$$

$$\frac{\partial J}{\partial b} = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)$$

linear_regression_gd.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
import numpy as np
import matplotlib.pyplot as plt
 
class LinearRegressionGD:
    """
    Linear regression trained via gradient descent.
    
    Model: y = Xw + b
    Loss: J(w,b) = (1/2n) * ||Xw + b - y||²
    """
    
    def __init__(self, learning_rate: float = 0.01, n_iterations: int = 1000):
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations
        self.weights = None
        self.bias = None
        self.loss_history = []
    
    def fit(self, X: np.ndarray, y: np.ndarray) -> 'LinearRegressionGD':
        """
        Train the model using gradient descent.
        
        Args:
            X: Training features, shape (n_samples, n_features)
            y: Target values, shape (n_samples,)
        """
        n_samples, n_features = X.shape
        
        # Initialize parameters to zeros
        self.weights = np.zeros(n_features)
        self.bias = 0.0
        
        for iteration in range(self.n_iterations):
            # Forward pass: compute predictions
            y_pred = X @ self.weights + self.bias
            
            # Compute residuals (errors)
            residuals = y_pred - y  # shape: (n_samples,)
            
            # Compute loss: J = (1/2n) * sum(residuals²)
            loss = np.mean(residuals ** 2) / 2
            self.loss_history.append(loss)
            
            # Compute gradients
            # ∂J/∂w = (1/n) * X^T @ residuals
            grad_weights = (X.T @ residuals) / n_samples
            
            # ∂J/∂b = (1/n) * sum(residuals)
            grad_bias = np.mean(residuals)
            
            # Update parameters: θ ← θ - α∇J
            self.weights -= self.learning_rate * grad_weights
            self.bias -= self.learning_rate * grad_bias
        
        return self
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Generate predictions."""
        return X @ self.weights + self.bias
 
# Generate synthetic data
np.random.seed(42)
n_samples = 100
X = 2 * np.random.randn(n_samples, 1)  # Single feature
true_weights = np.array([3.0])
true_bias = 2.0
y = X @ true_weights + true_bias + 0.5 * np.random.randn(n_samples)
 
# Train model
model = LinearRegressionGD(learning_rate=0.1, n_iterations=100)
model.fit(X, y)
 
print(f"True weights: {true_weights}, Learned: {model.weights}")
print(f"True bias: {true_bias}, Learned: {model.bias:.4f}")
print(f"Final loss: {model.loss_history[-1]:.6f}")
 
# Plot convergence
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.plot(model.loss_history)
plt.xlabel('Iteration')
plt.ylabel('MSE Loss')
plt.title('Training Loss Convergence')
 
plt.subplot(1, 2, 2)
plt.scatter(X, y, alpha=0.5, label='Data')
X_line = np.linspace(X.min(), X.max(), 100).reshape(-1, 1)
plt.plot(X_line, model.predict(X_line), 'r-', linewidth=2, label='Learned')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.title('Linear Regression Fit')
plt.tight_layout()
plt.savefig('linear_regression_gd.png', dpi=150)

Gradient Descent Recovers True Parameters

Why Gradient Descent Works: Mathematical Justification

Let's rigorously prove that gradient descent decreases the objective function (under appropriate conditions).

Theorem (Sufficient Decrease)

Proof Sketch:

What is Lipschitz Smoothness?

Corollary (Guaranteed Progress)

As long as $|\nabla J(\boldsymbol{\theta})| > 0$ (we haven't reached a critical point), the loss decreases by at least $\frac{\alpha}{2} |\nabla J|^2$ per step.

Implications:

The loss monotonically decreases (with appropriate $\alpha$)
Larger gradients → larger decrease — far from optima, we make fast progress
Near local minima, gradients shrink — progress slows but convergence happens
The learning rate upper bound $1/L$ — too large $\alpha$ violates the bound and can cause divergence

This theorem is the theoretical foundation for gradient descent's reliability on smooth functions.

Critical Conditions for Convergence

•Smooth objective: J must be differentiable with Lipschitz continuous gradient
•Appropriate learning rate: α ≤ 1/L ensures descent; α > 1/L can cause divergence
•Starting point matters: Gradient descent finds local minima, which may not be global
•Convexity helps: For convex J, any local minimum is global

Understanding Loss Surfaces

The loss surface (or loss landscape) is the graph of $J(\boldsymbol{\theta})$ over parameter space. Understanding its topology is crucial for optimization.

Types of Critical Points:

Point Type	Definition	Eigenvalues of Hessian	Behavior
Local Minimum	∇J = 0, all curvatures positive	All λᵢ > 0	Stable, GD converges here
Local Maximum	∇J = 0, all curvatures negative	All λᵢ < 0	Unstable, GD escapes
Saddle Point	∇J = 0, mixed curvatures	Some λᵢ > 0, some < 0	Unstable in some directions
Global Minimum	Lowest value of J across all θ	All λᵢ > 0	Optimal solution

The Hessian matrix $\mathbf{H} = \nabla^2 J$ captures local curvature. Its eigenvalues determine the nature of critical points.

Convex vs Non-Convex Landscapes

Convex functions (like MSE loss for linear regression) have a bowl-shaped landscape:

Single global minimum
No local minima or saddle points
Gradient descent converges to optimal solution from any starting point

Non-convex functions (like neural network losses) have complex landscapes:

Multiple local minima
Saddle points (very common in high dimensions)
Plateaus with near-zero gradient
Gradient descent behavior depends heavily on initialization

Modern insight: In high-dimensional neural networks, local minima are often nearly as good as the global minimum. The real challenge is escaping saddle points and plateaus.

Saddle Points Dominate High Dimensions

Practical Implementation Considerations

Moving from theory to practice requires addressing several implementation details:

1. Parameter Initialization

Starting point profoundly affects which minimum gradient descent finds:

Zero initialization: Often problematic (symmetry breaking issues in neural networks)
Random initialization: Standard approach; use appropriate scale
Xavier/He initialization: Designed for neural networks to maintain variance across layers

2. Numerical Precision

Use float32 or float64 appropriately (32-bit for speed, 64-bit for precision)
Watch for gradient underflow/overflow with very large or small values
Gradient clipping prevents exploding gradients

3. Termination Criteria

When to stop iterating:

Gradient norm falls below threshold: $|\nabla J| < \epsilon$
Loss change is negligible: $|J^{(t)} - J^{(t-1)}| < \epsilon$
Maximum iterations reached
Validation loss starts increasing (early stopping)

Best Practices

•Normalize/standardize input features to similar scales
•Start with small learning rate, tune empirically
•Monitor training loss curves for divergence
•Use validation set to detect overfitting
•Implement gradient clipping for stability

Common Pitfalls

•Learning rate too high → loss oscillates or diverges
•Learning rate too low → extremely slow convergence
•Poor initialization → stuck in bad local minimum
•No feature scaling → some dimensions dominate
•Ignoring gradient explosion → NaN losses

Feature Scaling is Critical

Summary and Path Forward

We've established the foundation of gradient descent—the workhorse algorithm of machine learning optimization. Let's consolidate the essential concepts:

Key Takeaways

•Gradient descent minimizes functions iteratively — Update θ ← θ - α∇J(θ) until convergence
•The gradient points toward steepest ascent — We move opposite to descend
•Learning rate α controls step size — Too large diverges, too small is slow
•Convergence is guaranteed for smooth, convex functions — With appropriate α ≤ 1/L
•Loss surfaces can be complex — Local minima, saddle points, and plateaus challenge optimization
•Feature scaling is essential — Normalize inputs for efficient optimization

What's Next:

With the core algorithm understood, we turn to critical practical questions:

Page 2: Learning Rate Selection — How do we choose α? What happens when it's wrong?
Page 3: Convergence Analysis — How fast does gradient descent converge? What affects speed?
Page 4: Variants (Batch, Stochastic, Mini-batch) — How do we scale to massive datasets?
Page 5: Momentum and Acceleration — How do we speed up convergence?

You've Mastered the Foundation

1 / 5