Machine LearningBackpropagation & Optimization

Gradient Descent Variants

LevelIntermediate

Duration90 mins

TopicBackpropagation & Optimization

1 / 5

Batch Gradient Descent

The Foundational Algorithm

At the heart of virtually every machine learning algorithm lies a deceptively simple idea: find the parameters that minimize a loss function by iteratively moving in the direction of steepest descent. This is gradient descent—the workhorse algorithm that powers everything from logistic regression to transformers with billions of parameters.

Batch gradient descent, also known as full-batch gradient descent or simply vanilla gradient descent, represents the purest form of this optimization paradigm. Before we can appreciate the sophistication of modern optimizers like Adam or understand why stochastic methods work, we must first master this foundational algorithm in its complete, unmodified form.

This page provides an exhaustive treatment of batch gradient descent: its mathematical foundations, geometric intuitions, convergence guarantees, implementation considerations, and the fundamental limitations that motivated all subsequent innovations in optimization.

What You Will Learn

By the end of this page, you will understand the precise mathematical formulation of batch gradient descent, visualize its behavior in different loss landscapes, derive its convergence conditions, implement it correctly from first principles, and identify when it is (and isn't) the right choice for optimization.

Mathematical Formulation

To understand batch gradient descent rigorously, we must first establish the precise mathematical setting. Consider a supervised learning problem where we have a dataset of N training examples:

$$\mathcal{D} = {(\mathbf{x}^{(1)}, y^{(1)}), (\mathbf{x}^{(2)}, y^{(2)}), \ldots, (\mathbf{x}^{(N)}, y^{(N)})}$$

Our goal is to find parameters $\boldsymbol{\theta} \in \mathbb{R}^d$ that minimize the empirical risk—the average loss over all training examples:

$$J(\boldsymbol{\theta}) = \frac{1}{N} \sum_{i=1}^{N} \mathcal{L}(f_{\boldsymbol{\theta}}(\mathbf{x}^{(i)}), y^{(i)})$$

Here, $f_{\boldsymbol{\theta}}$ is our model parameterized by $\boldsymbol{\theta}$, and $\mathcal{L}$ is a loss function measuring the discrepancy between predictions and targets.

Notation Convention

We use $J(\boldsymbol{\theta})$ to denote the objective function (total cost across all examples) and $\mathcal{L}$ for the per-example loss. The gradient $\nabla_{\boldsymbol{\theta}} J$ is a vector in $\mathbb{R}^d$ pointing in the direction of steepest increase of $J$.

The Gradient Descent Update Rule

Batch gradient descent updates parameters by moving in the opposite direction of the gradient:

$$\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \eta \nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta}^{(t)})$$

where:

$\boldsymbol{\theta}^{(t)}$ denotes the parameters at iteration $t$
$\eta > 0$ is the learning rate (also called step size)
$\nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta}^{(t)})$ is the gradient computed over the entire dataset

Expanding the gradient explicitly:

$$\nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta}) = \frac{1}{N} \sum_{i=1}^{N} \nabla_{\boldsymbol{\theta}} \mathcal{L}(f_{\boldsymbol{\theta}}(\mathbf{x}^{(i)}), y^{(i)})$$

This is the defining characteristic of batch gradient descent: at each iteration, we compute gradients with respect to every single training example and average them before making a parameter update.

Key Components of the Gradient Descent Update
Symbol	Name	Role	Typical Range
$\boldsymbol{\theta}$	Parameters	The values we're optimizing	Depends on model architecture
$\eta$	Learning rate	Controls step size	$10^{-5}$ to $10^{-1}$
$\nabla J$	Gradient	Direction of steepest ascent	Computed, not chosen
$t$	Iteration	Current optimization step	$0, 1, 2, \ldots$

Geometric Interpretation

Understanding gradient descent geometrically provides crucial intuition for debugging optimization problems and selecting hyperparameters. Let's build this understanding from first principles.

The Loss Surface

The objective function $J(\boldsymbol{\theta})$ defines a loss surface (or loss landscape) over the parameter space. For a model with $d$ parameters, this is a surface in $(d+1)$-dimensional space—$d$ dimensions for parameters and one for the loss value.

Valleys represent regions of low loss (good parameters)
Peaks and ridges represent regions of high loss (bad parameters)
Saddle points are flat regions that are neither minima nor maxima

The gradient $\nabla J(\boldsymbol{\theta})$ at any point is a vector that:

Points in the direction of steepest ascent on the loss surface
Has magnitude equal to the rate of steepest ascent

By moving in the negative gradient direction, we descend the loss surface toward lower values.

The Mountain Analogy

Imagine standing on a foggy mountain and trying to reach the lowest valley. The gradient tells you the direction of steepest uphill slope. By always walking opposite to this direction (downhill), you'll eventually reach a valley floor. The learning rate determines how large each step is—too large and you might overshoot the valley; too small and the journey takes forever.

Why Negative Gradient Works

Consider the first-order Taylor expansion of $J$ around the current point $\boldsymbol{\theta}^{(t)}$:

$$J(\boldsymbol{\theta}^{(t)} + \boldsymbol{\delta}) \approx J(\boldsymbol{\theta}^{(t)}) + \nabla J(\boldsymbol{\theta}^{(t)})^\top \boldsymbol{\delta}$$

To minimize $J$, we want to choose $\boldsymbol{\delta}$ such that $J(\boldsymbol{\theta}^{(t)} + \boldsymbol{\delta}) < J(\boldsymbol{\theta}^{(t)})$. This requires:

$$\nabla J(\boldsymbol{\theta}^{(t)})^\top \boldsymbol{\delta} < 0$$

The most negative this inner product can be (for a fixed $|\boldsymbol{\delta}|$) is achieved when $\boldsymbol{\delta}$ points exactly opposite to the gradient:

$$\boldsymbol{\delta} = -\eta \nabla J(\boldsymbol{\theta}^{(t)})$$

This is precisely the gradient descent update! The negative gradient is the direction of steepest local descent.

Visualizing Different Landscapes

The behavior of gradient descent varies dramatically depending on the loss landscape:

Convex quadratic: The ideal case. Contours are ellipsoids, and gradient descent moves directly toward the minimum (if the Hessian is isotropic) or follows a zig-zag path if contours are elongated.
Convex non-quadratic: Guaranteed to converge to the global minimum, but the path may be more complex.
Non-convex: The typical case in deep learning. Multiple local minima, saddle points, and flat regions make optimization challenging.
Ravines and narrow valleys: Causes oscillation across the valley walls while slow progress along the valley floor—a major motivation for momentum methods.
Plateau regions: Gradient nearly vanishes, causing extremely slow progress despite being far from any minimum.

Algorithm and Implementation

Let's formalize the batch gradient descent algorithm and examine a rigorous implementation. Understanding the precise algorithmic steps is essential before we can appreciate the modifications introduced by more sophisticated optimizers.

Algorithm: Batch Gradient Descent

Input: Initial parameters θ₀, learning rate η, dataset D, max iterations T, tolerance ε
Output: Optimized parameters θ*

1. t ← 0
2. while t < T and ||∇J(θᵗ)|| > ε:
      a. Compute gradient: g ← (1/N) Σᵢ ∇L(f_θᵗ(xⁱ), yⁱ)  // Over ALL N examples
      b. Update parameters: θᵗ⁺¹ ← θᵗ - η·g
      c. t ← t + 1
3. return θᵗ

Notice that step 2a requires a full pass through the entire dataset for each parameter update. This is the computational signature of batch gradient descent.

batch_gradient_descent.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
import numpy as np
from typing import Callable, Tuple, List
 
def batch_gradient_descent(
    X: np.ndarray,                    # Training features (N x d)
    y: np.ndarray,                    # Training targets (N,)
    theta_init: np.ndarray,           # Initial parameters (d,)
    loss_fn: Callable,                # Loss function L(predictions, targets)
    grad_fn: Callable,                # Gradient function ∇L w.r.t. theta
    learning_rate: float = 0.01,
    max_iterations: int = 1000,
    tolerance: float = 1e-6,
    verbose: bool = True
) -> Tuple[np.ndarray, List[float]]:
    """
    Full-batch gradient descent optimizer.
    
    Returns:
        theta: Optimized parameters
        loss_history: Loss value at each iteration
    """
    theta = theta_init.copy()
    N = X.shape[0]
    loss_history = []
    
    for t in range(max_iterations):
        # Step 1: Compute predictions for ALL examples
        predictions = X @ theta  # Assuming linear model for simplicity
        
        # Step 2: Compute loss over ENTIRE dataset
        current_loss = np.mean(loss_fn(predictions, y))
        loss_history.append(current_loss)
        
        # Step 3: Compute gradient over ENTIRE dataset
        # The gradient is averaged over all N examples
        gradient = grad_fn(X, y, theta, predictions)  # Shape: (d,)
        
        # Step 4: Check stopping criterion
        gradient_norm = np.linalg.norm(gradient)
        if gradient_norm < tolerance:
            if verbose:
                print(f"Converged at iteration {t} (gradient norm: {gradient_norm:.2e})")
            break
        
        # Step 5: Update parameters
        theta = theta - learning_rate * gradient
        
        if verbose and t % 100 == 0:
            print(f"Iteration {t}: Loss = {current_loss:.6f}, ||∇J|| = {gradient_norm:.6f}")
    
    return theta, loss_history
 
 
# Example: Linear regression with MSE loss
def mse_loss(predictions: np.ndarray, targets: np.ndarray) -> np.ndarray:
    """Mean squared error loss (per example)."""
    return 0.5 * (predictions - targets) ** 2
 
 
def mse_gradient(X: np.ndarray, y: np.ndarray, 
                 theta: np.ndarray, predictions: np.ndarray) -> np.ndarray:
    """Gradient of MSE loss w.r.t. theta, averaged over all examples."""
    N = X.shape[0]
    residuals = predictions - y  # (N,)
    # ∇J = (1/N) * X^T * (X*theta - y)
    gradient = (1/N) * X.T @ residuals
    return gradient
 
 
# Usage example
if __name__ == "__main__":
    # Generate synthetic data
    np.random.seed(42)
    N, d = 1000, 5
    X_true = np.random.randn(N, d)
    theta_true = np.array([1.0, -2.0, 0.5, 3.0, -1.5])
    y_true = X_true @ theta_true + 0.1 * np.random.randn(N)
    
    # Initialize parameters randomly
    theta_init = np.random.randn(d)
    
    # Run batch gradient descent
    theta_opt, history = batch_gradient_descent(
        X_true, y_true, theta_init,
        loss_fn=mse_loss,
        grad_fn=mse_gradient,
        learning_rate=0.1,
        max_iterations=500
    )
    
    print(f"\nTrue theta: {theta_true}")
    print(f"Learned theta: {theta_opt}")
    print(f"Parameter error: {np.linalg.norm(theta_opt - theta_true):.6f}")

Critical Implementation Detail

The gradient MUST be computed over the entire dataset before making a single parameter update. A common mistake is to update parameters after each example—that's stochastic gradient descent, not batch gradient descent. In batch GD, one 'epoch' equals one parameter update.

Convergence Analysis

Understanding when and how fast gradient descent converges is crucial for both theoretical guarantees and practical hyperparameter tuning. The convergence properties depend critically on the properties of the objective function.

Key Definitions

Before stating convergence results, we need several important definitions:

Lipschitz Continuity of the Gradient: The gradient $\nabla J$ is $L$-Lipschitz if: $$|\nabla J(\boldsymbol{\theta}) - \nabla J(\boldsymbol{\phi})| \leq L |\boldsymbol{\theta} - \boldsymbol{\phi}|$$ for all $\boldsymbol{\theta}, \boldsymbol{\phi}$. The constant $L$ is called the smoothness parameter.

Strong Convexity: The function $J$ is $\mu$-strongly convex if: $$J(\boldsymbol{\phi}) \geq J(\boldsymbol{\theta}) + \nabla J(\boldsymbol{\theta})^\top(\boldsymbol{\phi} - \boldsymbol{\theta}) + \frac{\mu}{2}|\boldsymbol{\phi} - \boldsymbol{\theta}|^2$$ Strong convexity ensures a unique global minimum and prevents 'flat' regions.

Condition Number: For a strongly convex function with $L$-Lipschitz gradient, the condition number is: $$\kappa = \frac{L}{\mu}$$ This ratio fundamentally governs convergence speed—higher $\kappa$ means slower convergence.

The Condition Number Intuition

The condition number κ measures how 'elongated' the contours of the loss function are. For a quadratic loss, κ = λ_max/λ_min where λ are eigenvalues of the Hessian. High κ means the loss decreases rapidly in some directions but slowly in others, causing oscillation and slow progress.

Convergence Theorems

Theorem 1 (Convergence for Smooth Functions)

If $J$ has $L$-Lipschitz continuous gradient and $\eta \leq \frac{1}{L}$, then batch gradient descent satisfies:

$$J(\boldsymbol{\theta}^{(T)}) - J(\boldsymbol{\theta}^) \leq \frac{|\boldsymbol{\theta}^{(0)} - \boldsymbol{\theta}^|^2}{2\eta T}$$

This is a sublinear convergence rate of $O(1/T)$—to reduce the error by a factor of 10, we need 10× more iterations.

Theorem 2 (Convergence for Strongly Convex Functions)

If $J$ is $\mu$-strongly convex with $L$-Lipschitz gradient and $\eta = \frac{1}{L}$, then:

$$|\boldsymbol{\theta}^{(T)} - \boldsymbol{\theta}^|^2 \leq \left(1 - \frac{\mu}{L}\right)^T |\boldsymbol{\theta}^{(0)} - \boldsymbol{\theta}^|^2$$

This is linear convergence (also called geometric or exponential convergence). The error decreases by a constant factor $(1 - 1/\kappa)$ at each step.

Convergence Rates for Batch Gradient Descent
Function Class	Rate	Iterations for ε-accuracy	Learning Rate
Convex, smooth	$O(1/T)$	$O(1/\epsilon)$	$\eta \leq 1/L$
Strongly convex, smooth	$O((1-\mu/L)^T)$	$O(\kappa \log(1/\epsilon))$	$\eta = 1/L$
Non-convex, smooth	To stationary point	$O(1/\epsilon^2)$	$\eta \leq 1/L$

Practical Implications

Learning rate bounds: The convergence theorems provide concrete guidance: $\eta \leq 1/L$ ensures convergence. If you see divergence (loss exploding), your learning rate likely violates this bound.
Condition number determines difficulty: Problems with $\kappa \approx 1$ (isotropic) converge quickly. Problems with $\kappa >> 1$ (ill-conditioned) converge slowly and may require preconditioning or adaptive methods.
Non-convex reality: Deep learning loss surfaces are non-convex. We can't guarantee finding global minima, but gradient descent still finds stationary points (where $\nabla J \approx 0$).

Learning Rate Analysis

The learning rate $\eta$ is the single most critical hyperparameter in gradient descent. Its value determines whether optimization succeeds, fails spectacularly, or crawls painfully slowly. Let's develop a deep understanding of learning rate dynamics.

The Stability Analysis

Consider optimizing a simple quadratic: $J(\theta) = \frac{1}{2}a\theta^2$ where $a > 0$. The gradient is $\nabla J = a\theta$, so the update becomes:

$$\theta^{(t+1)} = \theta^{(t)} - \eta \cdot a \cdot \theta^{(t)} = (1 - \eta a)\theta^{(t)}$$

For convergence to $\theta^* = 0$, we need $|1 - \eta a| < 1$, which requires:

$$0 < \eta < \frac{2}{a}$$

This is the stability condition. When $\eta > 2/a$, the updates overshoot and oscillate with increasing magnitude—the optimization diverges.

Divergence Warning

If your loss suddenly jumps to NaN or infinity during training, the most likely cause is a learning rate that's too large. This violates the stability condition. The fix is simple: reduce the learning rate.

The Three Regimes of Learning Rate

1. Too Large ($\eta > 2/L$): Divergence

Updates overshoot the minimum
Loss increases instead of decreasing
Parameters may become NaN
Symptom: Loss explodes or oscillates wildly

2. Too Small ($\eta << 1/L$): Slow Convergence

Each update makes minimal progress
May take millions of iterations to converge
Gets stuck in noisy regions for a long time
Symptom: Loss decreases but painfully slowly

3. Just Right ($\eta \approx 1/L$): Optimal Convergence

Near-optimal convergence rate
Stable and efficient
Makes meaningful progress each iteration
Symptom: Loss decreases steadily and reasonably quickly

Signs of Bad Learning Rate

•Loss increases instead of decreasing
•Loss oscillates wildly between values
•Loss becomes NaN or infinity
•Loss decreases extremely slowly
•Training takes unreasonably long
•Loss plateaus far from expected value

Signs of Good Learning Rate

•Loss decreases consistently
•Some small fluctuation is acceptable
•Convergence within reasonable iterations
•Final loss close to expected value
•Gradient norms decrease over time
•Training completes in expected time

Finding the Right Learning Rate

In practice, since we don't know $L$ exactly, we use empirical strategies:

Grid search: Try values ${10^{-5}, 10^{-4}, 10^{-3}, 10^{-2}, 10^{-1}}$ and pick the largest that converges
Learning rate finder: Exponentially increase $\eta$ during a short training run and plot loss vs. learning rate. Choose $\eta$ just before loss starts increasing.
Backtracking line search: Start with large $\eta$ and halve it until $J(\theta - \eta \nabla J) < J(\theta)$ (Armijo condition).
Adaptive methods: Use algorithms like Adam that automatically adjust per-parameter learning rates (covered later in this module).

Computational Complexity

Understanding the computational cost of batch gradient descent is essential for determining its applicability to different problem scales. Let's analyze the time and space complexity rigorously.

Time Complexity per Iteration

For a dataset with $N$ examples and a model with $d$ parameters:

Forward pass: Computing $f_\theta(\mathbf{x}^{(i)})$ for all $N$ examples: $O(N \cdot C_f)$ where $C_f$ is the cost of one forward pass
Gradient computation: Computing $\nabla_\theta \mathcal{L}$ for all $N$ examples via backpropagation: $O(N \cdot C_b)$ where $C_b \approx C_f$ (backward pass is similar cost)
Gradient averaging: Summing $N$ gradients: $O(N \cdot d)$
Parameter update: One vector subtraction: $O(d)$

For neural networks, $C_f$ and $C_b$ are proportional to the number of parameters $d$, giving:

$$\text{Time per iteration} = O(N \cdot d)$$

Computational Cost Analysis
Operation	Time Complexity	Space Complexity	Bottleneck For
Single forward pass	$O(d)$	$O(d)$	Very deep networks
Full-batch forward	$O(N \cdot d)$	$O(N \cdot d)$	Large datasets
Single backprop	$O(d)$	$O(d)$	Memory for activations
Full-batch gradient	$O(N \cdot d)$	$O(d)$	Time on large $N$
Parameter update	$O(d)$	$O(d)$	Rarely the bottleneck

The Scalability Problem

Let's make this concrete with real numbers:

Dataset Size ($N$)	Parameters ($d$)	Ops per Iteration	At 100 GFLOPS
10,000	1 million	10 billion	100 seconds
1 million	1 million	1 trillion	~3 hours
10 million	1 million	10 trillion	~28 hours
100 million	1 million	100 trillion	~12 days

For a modern dataset with millions of examples, a single iteration of batch gradient descent can take hours or days. Since training typically requires thousands of iterations, batch gradient descent becomes completely impractical.

This is the fundamental problem that stochastic gradient descent solves: instead of using all $N$ examples per update, use a small subset. We trade gradient accuracy for computational speed—and as we'll see, this trade-off is extremely favorable.

The Curse of Scale

Batch gradient descent computes the most accurate gradient possible but pays a price: O(N) computation per update. Modern datasets have N in the millions or billions. This makes batch GD impractical for most real-world deep learning—motivating the shift to stochastic methods.

Limitations and Failure Modes

While batch gradient descent has strong theoretical properties, it suffers from several practical limitations that severely restrict its usefulness in modern machine learning.

Limitation 1: Computational Expense

As analyzed above, the $O(N \cdot d)$ cost per iteration becomes prohibitive for large datasets. But it's worse than just slow—the cost is unavoidable for each update. Even if the first 100 examples would give you a 'good enough' gradient direction, you still must process all $N$ examples before making any progress.

Limitation 2: Memory Requirements

Batch gradient descent requires either:

Loading entire dataset into memory (often impossible for large datasets)
Making $N$ disk reads per iteration (extremely slow)

Modern datasets can be terabytes in size. Holding them in RAM is simply not feasible.

Limitation 3: No Implicit Regularization

Stochastic methods benefit from noise in the gradient estimates, which acts as a form of regularization. Batch gradient descent, with its exact gradients, lacks this beneficial noise. This often leads to finding sharper minima that generalize worse.

Failure Modes of Batch Gradient Descent

•Saddle points: In high dimensions, saddle points are far more common than local minima. Batch GD can get stuck at saddle points because the gradient vanishes despite the point not being a minimum.
•Ill-conditioned Hessian: When eigenvalues of the Hessian vary widely, gradient descent oscillates across narrow valleys. Progress in the 'slow' directions is extremely slow.
•Plateau regions: Regions where the loss is nearly flat cause vanishingly small gradients, stalling progress entirely.
•Ravine landscape: Common in neural networks. Causes oscillation perpendicular to the optimal direction, requiring tiny learning rates that slow overall convergence.
•Poor initialization: Starting in a bad region (far from any good minimum) can lead to extremely long convergence times or convergence to poor local optima.

Limitation 4: Sensitivity to Curvature

A single learning rate must work for all parameters. But different parameters may need different step sizes:

Some parameters have high curvature (steep loss) → need small steps
Other parameters have low curvature (flat loss) → need large steps

Using one fixed $\eta$ for all parameters is fundamentally suboptimal. This motivates adaptive learning rate methods like AdaGrad, RMSprop, and Adam, which we'll cover later.

Limitation 5: Deterministic Path

Batch gradient descent follows a completely deterministic path given the initial parameters. This means:

No chance to 'jump out' of poor local regions via gradient noise
No implicit exploration of the loss landscape
The minimum found depends entirely on initialization

When to Use Batch Gradient Descent

Despite its limitations, batch gradient descent remains the right choice in certain scenarios. Understanding these use cases helps you make informed optimizer choices.

Appropriate Use Cases

Very small datasets ($N < 1000$): When the entire dataset fits easily in memory and gradient computation is fast, the simplicity and stability of batch GD is advantageous.
Convex optimization problems: For convex losses (logistic regression, linear regression, SVMs), batch GD provides reliable convergence with well-understood behavior.
When exact gradients are required: Some scenarios require precise gradients, such as second-order optimization methods that depend on accurate Hessian-vector products.
Debugging and prototyping: When developing new models, batch GD's deterministic behavior makes debugging easier. You can reproduce exact training trajectories.
Fine-tuning with stability requirements: When stability matters more than speed (e.g., fine-tuning a nearly-converged model), exact gradients prevent the noise-induced oscillation of stochastic methods.

Batch GD Decision Guide
Consider Batch GD When	Avoid Batch GD When
Dataset has < 1,000 examples	Dataset has > 10,000 examples
Problem is convex	Problem is highly non-convex
Memory allows full dataset	Dataset doesn't fit in memory
Need deterministic training	Training speed is critical
Debugging model issues	Production training at scale

The Modern Reality

In contemporary deep learning practice, batch gradient descent is rarely used directly. Mini-batch SGD or adaptive methods dominate. However, understanding batch GD is essential—it's the baseline from which all other methods are derived, and many theoretical results assume this setting.

Historical and Pedagogical Importance

Even if you never use pure batch gradient descent in practice, understanding it deeply is valuable:

Foundation for theory: Convergence theorems, learning rate analysis, and stability conditions extend (with modifications) to all gradient-based methods.
Baseline for comparison: When evaluating new optimizers, batch GD provides a clear baseline. Any new method should outperform it on relevant metrics.
Building intuition: The geometric and mathematical understanding developed here applies directly to understanding more sophisticated methods.

In the next pages, we'll see how stochastic gradient descent addresses the major limitations of batch GD while preserving the core gradient descent paradigm.

Summary: Batch Gradient Descent

We've completed a thorough examination of batch gradient descent—the foundational optimization algorithm. Let's consolidate the key insights:

Key Takeaways

•Full-batch computation: Each parameter update uses the gradient averaged over ALL training examples, giving the most accurate gradient estimate possible.
•Negative gradient direction: Moving opposite to ∇J is the locally optimal direction for decreasing the loss, justified by first-order Taylor expansion.
•Learning rate bounds: Stability requires η ≤ 2/L where L is the smoothness constant. Practical choice is η ≈ 1/L.
•Convergence guarantees: Linear convergence for strongly convex functions, O(1/T) for convex functions, convergence to stationary points for non-convex.
•Computational cost: O(N·d) per iteration makes batch GD prohibitively expensive for large datasets.
•Deterministic behavior: Same initialization leads to same trajectory—useful for debugging, problematic for exploration.
•Limited practical use: Modern deep learning rarely uses pure batch GD due to computational constraints and lack of implicit regularization.

What's Next

The next page introduces Stochastic Gradient Descent (SGD)—the revolutionary modification that makes gradient-based optimization practical for large-scale machine learning. We'll see how using a single example (or small batch) instead of the full dataset dramatically changes the optimization dynamics while still converging to good solutions.

Foundation Established

You now have a rigorous understanding of batch gradient descent—its mathematics, geometry, convergence properties, implementation, and limitations. This foundation is essential for understanding why stochastic methods work and when to apply different optimization strategies.

1 / 5

Loading learning content...

Machine LearningBackpropagation & Optimization

Gradient Descent Variants

LevelIntermediate

Duration90 mins

TopicBackpropagation & Optimization

1 / 5

Batch Gradient Descent

The Foundational Algorithm

What You Will Learn

Mathematical Formulation

To understand batch gradient descent rigorously, we must first establish the precise mathematical setting. Consider a supervised learning problem where we have a dataset of N training examples:

$$\mathcal{D} = {(\mathbf{x}^{(1)}, y^{(1)}), (\mathbf{x}^{(2)}, y^{(2)}), \ldots, (\mathbf{x}^{(N)}, y^{(N)})}$$

Our goal is to find parameters $\boldsymbol{\theta} \in \mathbb{R}^d$ that minimize the empirical risk—the average loss over all training examples:

$$J(\boldsymbol{\theta}) = \frac{1}{N} \sum_{i=1}^{N} \mathcal{L}(f_{\boldsymbol{\theta}}(\mathbf{x}^{(i)}), y^{(i)})$$

Here, $f_{\boldsymbol{\theta}}$ is our model parameterized by $\boldsymbol{\theta}$, and $\mathcal{L}$ is a loss function measuring the discrepancy between predictions and targets.

Notation Convention

The Gradient Descent Update Rule

Batch gradient descent updates parameters by moving in the opposite direction of the gradient:

$$\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \eta \nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta}^{(t)})$$

where:

$\boldsymbol{\theta}^{(t)}$ denotes the parameters at iteration $t$
$\eta > 0$ is the learning rate (also called step size)
$\nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta}^{(t)})$ is the gradient computed over the entire dataset

Expanding the gradient explicitly:

$$\nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta}) = \frac{1}{N} \sum_{i=1}^{N} \nabla_{\boldsymbol{\theta}} \mathcal{L}(f_{\boldsymbol{\theta}}(\mathbf{x}^{(i)}), y^{(i)})$$

Key Components of the Gradient Descent Update
Symbol	Name	Role	Typical Range
$\boldsymbol{\theta}$	Parameters	The values we're optimizing	Depends on model architecture
$\eta$	Learning rate	Controls step size	$10^{-5}$ to $10^{-1}$
$\nabla J$	Gradient	Direction of steepest ascent	Computed, not chosen
$t$	Iteration	Current optimization step	$0, 1, 2, \ldots$

Geometric Interpretation

Understanding gradient descent geometrically provides crucial intuition for debugging optimization problems and selecting hyperparameters. Let's build this understanding from first principles.

The Loss Surface

Valleys represent regions of low loss (good parameters)
Peaks and ridges represent regions of high loss (bad parameters)
Saddle points are flat regions that are neither minima nor maxima

The gradient $\nabla J(\boldsymbol{\theta})$ at any point is a vector that:

Points in the direction of steepest ascent on the loss surface
Has magnitude equal to the rate of steepest ascent

By moving in the negative gradient direction, we descend the loss surface toward lower values.

The Mountain Analogy

Why Negative Gradient Works

Consider the first-order Taylor expansion of $J$ around the current point $\boldsymbol{\theta}^{(t)}$:

$$J(\boldsymbol{\theta}^{(t)} + \boldsymbol{\delta}) \approx J(\boldsymbol{\theta}^{(t)}) + \nabla J(\boldsymbol{\theta}^{(t)})^\top \boldsymbol{\delta}$$

To minimize $J$, we want to choose $\boldsymbol{\delta}$ such that $J(\boldsymbol{\theta}^{(t)} + \boldsymbol{\delta}) < J(\boldsymbol{\theta}^{(t)})$. This requires:

$$\nabla J(\boldsymbol{\theta}^{(t)})^\top \boldsymbol{\delta} < 0$$

The most negative this inner product can be (for a fixed $|\boldsymbol{\delta}|$) is achieved when $\boldsymbol{\delta}$ points exactly opposite to the gradient:

$$\boldsymbol{\delta} = -\eta \nabla J(\boldsymbol{\theta}^{(t)})$$

This is precisely the gradient descent update! The negative gradient is the direction of steepest local descent.

Visualizing Different Landscapes

The behavior of gradient descent varies dramatically depending on the loss landscape:

Convex quadratic: The ideal case. Contours are ellipsoids, and gradient descent moves directly toward the minimum (if the Hessian is isotropic) or follows a zig-zag path if contours are elongated.
Convex non-quadratic: Guaranteed to converge to the global minimum, but the path may be more complex.
Non-convex: The typical case in deep learning. Multiple local minima, saddle points, and flat regions make optimization challenging.
Ravines and narrow valleys: Causes oscillation across the valley walls while slow progress along the valley floor—a major motivation for momentum methods.
Plateau regions: Gradient nearly vanishes, causing extremely slow progress despite being far from any minimum.

Algorithm and Implementation

Algorithm: Batch Gradient Descent

Input: Initial parameters θ₀, learning rate η, dataset D, max iterations T, tolerance ε
Output: Optimized parameters θ*

1. t ← 0
2. while t < T and ||∇J(θᵗ)|| > ε:
      a. Compute gradient: g ← (1/N) Σᵢ ∇L(f_θᵗ(xⁱ), yⁱ)  // Over ALL N examples
      b. Update parameters: θᵗ⁺¹ ← θᵗ - η·g
      c. t ← t + 1
3. return θᵗ

Notice that step 2a requires a full pass through the entire dataset for each parameter update. This is the computational signature of batch gradient descent.

batch_gradient_descent.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
import numpy as np
from typing import Callable, Tuple, List
 
def batch_gradient_descent(
    X: np.ndarray,                    # Training features (N x d)
    y: np.ndarray,                    # Training targets (N,)
    theta_init: np.ndarray,           # Initial parameters (d,)
    loss_fn: Callable,                # Loss function L(predictions, targets)
    grad_fn: Callable,                # Gradient function ∇L w.r.t. theta
    learning_rate: float = 0.01,
    max_iterations: int = 1000,
    tolerance: float = 1e-6,
    verbose: bool = True
) -> Tuple[np.ndarray, List[float]]:
    """
    Full-batch gradient descent optimizer.
    
    Returns:
        theta: Optimized parameters
        loss_history: Loss value at each iteration
    """
    theta = theta_init.copy()
    N = X.shape[0]
    loss_history = []
    
    for t in range(max_iterations):
        # Step 1: Compute predictions for ALL examples
        predictions = X @ theta  # Assuming linear model for simplicity
        
        # Step 2: Compute loss over ENTIRE dataset
        current_loss = np.mean(loss_fn(predictions, y))
        loss_history.append(current_loss)
        
        # Step 3: Compute gradient over ENTIRE dataset
        # The gradient is averaged over all N examples
        gradient = grad_fn(X, y, theta, predictions)  # Shape: (d,)
        
        # Step 4: Check stopping criterion
        gradient_norm = np.linalg.norm(gradient)
        if gradient_norm < tolerance:
            if verbose:
                print(f"Converged at iteration {t} (gradient norm: {gradient_norm:.2e})")
            break
        
        # Step 5: Update parameters
        theta = theta - learning_rate * gradient
        
        if verbose and t % 100 == 0:
            print(f"Iteration {t}: Loss = {current_loss:.6f}, ||∇J|| = {gradient_norm:.6f}")
    
    return theta, loss_history
 
 
# Example: Linear regression with MSE loss
def mse_loss(predictions: np.ndarray, targets: np.ndarray) -> np.ndarray:
    """Mean squared error loss (per example)."""
    return 0.5 * (predictions - targets) ** 2
 
 
def mse_gradient(X: np.ndarray, y: np.ndarray, 
                 theta: np.ndarray, predictions: np.ndarray) -> np.ndarray:
    """Gradient of MSE loss w.r.t. theta, averaged over all examples."""
    N = X.shape[0]
    residuals = predictions - y  # (N,)
    # ∇J = (1/N) * X^T * (X*theta - y)
    gradient = (1/N) * X.T @ residuals
    return gradient
 
 
# Usage example
if __name__ == "__main__":
    # Generate synthetic data
    np.random.seed(42)
    N, d = 1000, 5
    X_true = np.random.randn(N, d)
    theta_true = np.array([1.0, -2.0, 0.5, 3.0, -1.5])
    y_true = X_true @ theta_true + 0.1 * np.random.randn(N)
    
    # Initialize parameters randomly
    theta_init = np.random.randn(d)
    
    # Run batch gradient descent
    theta_opt, history = batch_gradient_descent(
        X_true, y_true, theta_init,
        loss_fn=mse_loss,
        grad_fn=mse_gradient,
        learning_rate=0.1,
        max_iterations=500
    )
    
    print(f"\nTrue theta: {theta_true}")
    print(f"Learned theta: {theta_opt}")
    print(f"Parameter error: {np.linalg.norm(theta_opt - theta_true):.6f}")

Critical Implementation Detail

Convergence Analysis

Key Definitions

Before stating convergence results, we need several important definitions:

The Condition Number Intuition

Convergence Theorems

Theorem 1 (Convergence for Smooth Functions)

If $J$ has $L$-Lipschitz continuous gradient and $\eta \leq \frac{1}{L}$, then batch gradient descent satisfies:

$$J(\boldsymbol{\theta}^{(T)}) - J(\boldsymbol{\theta}^) \leq \frac{|\boldsymbol{\theta}^{(0)} - \boldsymbol{\theta}^|^2}{2\eta T}$$

This is a sublinear convergence rate of $O(1/T)$—to reduce the error by a factor of 10, we need 10× more iterations.

Theorem 2 (Convergence for Strongly Convex Functions)

If $J$ is $\mu$-strongly convex with $L$-Lipschitz gradient and $\eta = \frac{1}{L}$, then:

$$|\boldsymbol{\theta}^{(T)} - \boldsymbol{\theta}^|^2 \leq \left(1 - \frac{\mu}{L}\right)^T |\boldsymbol{\theta}^{(0)} - \boldsymbol{\theta}^|^2$$

This is linear convergence (also called geometric or exponential convergence). The error decreases by a constant factor $(1 - 1/\kappa)$ at each step.

Convergence Rates for Batch Gradient Descent
Function Class	Rate	Iterations for ε-accuracy	Learning Rate
Convex, smooth	$O(1/T)$	$O(1/\epsilon)$	$\eta \leq 1/L$
Strongly convex, smooth	$O((1-\mu/L)^T)$	$O(\kappa \log(1/\epsilon))$	$\eta = 1/L$
Non-convex, smooth	To stationary point	$O(1/\epsilon^2)$	$\eta \leq 1/L$

Practical Implications

Learning rate bounds: The convergence theorems provide concrete guidance: $\eta \leq 1/L$ ensures convergence. If you see divergence (loss exploding), your learning rate likely violates this bound.
Condition number determines difficulty: Problems with $\kappa \approx 1$ (isotropic) converge quickly. Problems with $\kappa >> 1$ (ill-conditioned) converge slowly and may require preconditioning or adaptive methods.
Non-convex reality: Deep learning loss surfaces are non-convex. We can't guarantee finding global minima, but gradient descent still finds stationary points (where $\nabla J \approx 0$).

Learning Rate Analysis

The Stability Analysis

Consider optimizing a simple quadratic: $J(\theta) = \frac{1}{2}a\theta^2$ where $a > 0$. The gradient is $\nabla J = a\theta$, so the update becomes:

$$\theta^{(t+1)} = \theta^{(t)} - \eta \cdot a \cdot \theta^{(t)} = (1 - \eta a)\theta^{(t)}$$

For convergence to $\theta^* = 0$, we need $|1 - \eta a| < 1$, which requires:

$$0 < \eta < \frac{2}{a}$$

This is the stability condition. When $\eta > 2/a$, the updates overshoot and oscillate with increasing magnitude—the optimization diverges.

Divergence Warning

The Three Regimes of Learning Rate

1. Too Large ($\eta > 2/L$): Divergence

Updates overshoot the minimum
Loss increases instead of decreasing
Parameters may become NaN
Symptom: Loss explodes or oscillates wildly

2. Too Small ($\eta << 1/L$): Slow Convergence

Each update makes minimal progress
May take millions of iterations to converge
Gets stuck in noisy regions for a long time
Symptom: Loss decreases but painfully slowly

3. Just Right ($\eta \approx 1/L$): Optimal Convergence

Near-optimal convergence rate
Stable and efficient
Makes meaningful progress each iteration
Symptom: Loss decreases steadily and reasonably quickly

Signs of Bad Learning Rate

•Loss increases instead of decreasing
•Loss oscillates wildly between values
•Loss becomes NaN or infinity
•Loss decreases extremely slowly
•Training takes unreasonably long
•Loss plateaus far from expected value

Signs of Good Learning Rate

•Loss decreases consistently
•Some small fluctuation is acceptable
•Convergence within reasonable iterations
•Final loss close to expected value
•Gradient norms decrease over time
•Training completes in expected time

Finding the Right Learning Rate

In practice, since we don't know $L$ exactly, we use empirical strategies:

Grid search: Try values ${10^{-5}, 10^{-4}, 10^{-3}, 10^{-2}, 10^{-1}}$ and pick the largest that converges
Learning rate finder: Exponentially increase $\eta$ during a short training run and plot loss vs. learning rate. Choose $\eta$ just before loss starts increasing.
Backtracking line search: Start with large $\eta$ and halve it until $J(\theta - \eta \nabla J) < J(\theta)$ (Armijo condition).
Adaptive methods: Use algorithms like Adam that automatically adjust per-parameter learning rates (covered later in this module).

Computational Complexity

Understanding the computational cost of batch gradient descent is essential for determining its applicability to different problem scales. Let's analyze the time and space complexity rigorously.

Time Complexity per Iteration

For a dataset with $N$ examples and a model with $d$ parameters:

Forward pass: Computing $f_\theta(\mathbf{x}^{(i)})$ for all $N$ examples: $O(N \cdot C_f)$ where $C_f$ is the cost of one forward pass
Gradient computation: Computing $\nabla_\theta \mathcal{L}$ for all $N$ examples via backpropagation: $O(N \cdot C_b)$ where $C_b \approx C_f$ (backward pass is similar cost)
Gradient averaging: Summing $N$ gradients: $O(N \cdot d)$
Parameter update: One vector subtraction: $O(d)$

For neural networks, $C_f$ and $C_b$ are proportional to the number of parameters $d$, giving:

$$\text{Time per iteration} = O(N \cdot d)$$

Computational Cost Analysis
Operation	Time Complexity	Space Complexity	Bottleneck For
Single forward pass	$O(d)$	$O(d)$	Very deep networks
Full-batch forward	$O(N \cdot d)$	$O(N \cdot d)$	Large datasets
Single backprop	$O(d)$	$O(d)$	Memory for activations
Full-batch gradient	$O(N \cdot d)$	$O(d)$	Time on large $N$
Parameter update	$O(d)$	$O(d)$	Rarely the bottleneck

The Scalability Problem

Let's make this concrete with real numbers:

Dataset Size ($N$)	Parameters ($d$)	Ops per Iteration	At 100 GFLOPS
10,000	1 million	10 billion	100 seconds
1 million	1 million	1 trillion	~3 hours
10 million	1 million	10 trillion	~28 hours
100 million	1 million	100 trillion	~12 days

The Curse of Scale

Limitations and Failure Modes

While batch gradient descent has strong theoretical properties, it suffers from several practical limitations that severely restrict its usefulness in modern machine learning.

Limitation 1: Computational Expense

Limitation 2: Memory Requirements

Batch gradient descent requires either:

Loading entire dataset into memory (often impossible for large datasets)
Making $N$ disk reads per iteration (extremely slow)

Modern datasets can be terabytes in size. Holding them in RAM is simply not feasible.

Limitation 3: No Implicit Regularization

Failure Modes of Batch Gradient Descent

•Saddle points: In high dimensions, saddle points are far more common than local minima. Batch GD can get stuck at saddle points because the gradient vanishes despite the point not being a minimum.
•Ill-conditioned Hessian: When eigenvalues of the Hessian vary widely, gradient descent oscillates across narrow valleys. Progress in the 'slow' directions is extremely slow.
•Plateau regions: Regions where the loss is nearly flat cause vanishingly small gradients, stalling progress entirely.
•Ravine landscape: Common in neural networks. Causes oscillation perpendicular to the optimal direction, requiring tiny learning rates that slow overall convergence.
•Poor initialization: Starting in a bad region (far from any good minimum) can lead to extremely long convergence times or convergence to poor local optima.

Limitation 4: Sensitivity to Curvature

A single learning rate must work for all parameters. But different parameters may need different step sizes:

Some parameters have high curvature (steep loss) → need small steps
Other parameters have low curvature (flat loss) → need large steps

Using one fixed $\eta$ for all parameters is fundamentally suboptimal. This motivates adaptive learning rate methods like AdaGrad, RMSprop, and Adam, which we'll cover later.

Limitation 5: Deterministic Path

Batch gradient descent follows a completely deterministic path given the initial parameters. This means:

No chance to 'jump out' of poor local regions via gradient noise
No implicit exploration of the loss landscape
The minimum found depends entirely on initialization

When to Use Batch Gradient Descent

Despite its limitations, batch gradient descent remains the right choice in certain scenarios. Understanding these use cases helps you make informed optimizer choices.

Appropriate Use Cases

Very small datasets ($N < 1000$): When the entire dataset fits easily in memory and gradient computation is fast, the simplicity and stability of batch GD is advantageous.
Convex optimization problems: For convex losses (logistic regression, linear regression, SVMs), batch GD provides reliable convergence with well-understood behavior.
When exact gradients are required: Some scenarios require precise gradients, such as second-order optimization methods that depend on accurate Hessian-vector products.
Debugging and prototyping: When developing new models, batch GD's deterministic behavior makes debugging easier. You can reproduce exact training trajectories.
Fine-tuning with stability requirements: When stability matters more than speed (e.g., fine-tuning a nearly-converged model), exact gradients prevent the noise-induced oscillation of stochastic methods.

Batch GD Decision Guide
Consider Batch GD When	Avoid Batch GD When
Dataset has < 1,000 examples	Dataset has > 10,000 examples
Problem is convex	Problem is highly non-convex
Memory allows full dataset	Dataset doesn't fit in memory
Need deterministic training	Training speed is critical
Debugging model issues	Production training at scale

The Modern Reality

Historical and Pedagogical Importance

Even if you never use pure batch gradient descent in practice, understanding it deeply is valuable:

Foundation for theory: Convergence theorems, learning rate analysis, and stability conditions extend (with modifications) to all gradient-based methods.
Baseline for comparison: When evaluating new optimizers, batch GD provides a clear baseline. Any new method should outperform it on relevant metrics.
Building intuition: The geometric and mathematical understanding developed here applies directly to understanding more sophisticated methods.

In the next pages, we'll see how stochastic gradient descent addresses the major limitations of batch GD while preserving the core gradient descent paradigm.

Summary: Batch Gradient Descent

We've completed a thorough examination of batch gradient descent—the foundational optimization algorithm. Let's consolidate the key insights:

Key Takeaways

•Full-batch computation: Each parameter update uses the gradient averaged over ALL training examples, giving the most accurate gradient estimate possible.
•Negative gradient direction: Moving opposite to ∇J is the locally optimal direction for decreasing the loss, justified by first-order Taylor expansion.
•Learning rate bounds: Stability requires η ≤ 2/L where L is the smoothness constant. Practical choice is η ≈ 1/L.
•Convergence guarantees: Linear convergence for strongly convex functions, O(1/T) for convex functions, convergence to stationary points for non-convex.
•Computational cost: O(N·d) per iteration makes batch GD prohibitively expensive for large datasets.
•Deterministic behavior: Same initialization leads to same trajectory—useful for debugging, problematic for exploration.
•Limited practical use: Modern deep learning rarely uses pure batch GD due to computational constraints and lack of implicit regularization.

What's Next

Foundation Established

1 / 5