Loading learning content...
At the heart of virtually every machine learning algorithm lies a deceptively simple idea: find the parameters that minimize a loss function by iteratively moving in the direction of steepest descent. This is gradient descent—the workhorse algorithm that powers everything from logistic regression to transformers with billions of parameters.
Batch gradient descent, also known as full-batch gradient descent or simply vanilla gradient descent, represents the purest form of this optimization paradigm. Before we can appreciate the sophistication of modern optimizers like Adam or understand why stochastic methods work, we must first master this foundational algorithm in its complete, unmodified form.
This page provides an exhaustive treatment of batch gradient descent: its mathematical foundations, geometric intuitions, convergence guarantees, implementation considerations, and the fundamental limitations that motivated all subsequent innovations in optimization.
By the end of this page, you will understand the precise mathematical formulation of batch gradient descent, visualize its behavior in different loss landscapes, derive its convergence conditions, implement it correctly from first principles, and identify when it is (and isn't) the right choice for optimization.
To understand batch gradient descent rigorously, we must first establish the precise mathematical setting. Consider a supervised learning problem where we have a dataset of N training examples:
$$\mathcal{D} = {(\mathbf{x}^{(1)}, y^{(1)}), (\mathbf{x}^{(2)}, y^{(2)}), \ldots, (\mathbf{x}^{(N)}, y^{(N)})}$$
Our goal is to find parameters $\boldsymbol{\theta} \in \mathbb{R}^d$ that minimize the empirical risk—the average loss over all training examples:
$$J(\boldsymbol{\theta}) = \frac{1}{N} \sum_{i=1}^{N} \mathcal{L}(f_{\boldsymbol{\theta}}(\mathbf{x}^{(i)}), y^{(i)})$$
Here, $f_{\boldsymbol{\theta}}$ is our model parameterized by $\boldsymbol{\theta}$, and $\mathcal{L}$ is a loss function measuring the discrepancy between predictions and targets.
We use $J(\boldsymbol{\theta})$ to denote the objective function (total cost across all examples) and $\mathcal{L}$ for the per-example loss. The gradient $\nabla_{\boldsymbol{\theta}} J$ is a vector in $\mathbb{R}^d$ pointing in the direction of steepest increase of $J$.
The Gradient Descent Update Rule
Batch gradient descent updates parameters by moving in the opposite direction of the gradient:
$$\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \eta \nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta}^{(t)})$$
where:
Expanding the gradient explicitly:
$$\nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta}) = \frac{1}{N} \sum_{i=1}^{N} \nabla_{\boldsymbol{\theta}} \mathcal{L}(f_{\boldsymbol{\theta}}(\mathbf{x}^{(i)}), y^{(i)})$$
This is the defining characteristic of batch gradient descent: at each iteration, we compute gradients with respect to every single training example and average them before making a parameter update.
| Symbol | Name | Role | Typical Range |
|---|---|---|---|
| $\boldsymbol{\theta}$ | Parameters | The values we're optimizing | Depends on model architecture |
| $\eta$ | Learning rate | Controls step size | $10^{-5}$ to $10^{-1}$ |
| $\nabla J$ | Gradient | Direction of steepest ascent | Computed, not chosen |
| $t$ | Iteration | Current optimization step | $0, 1, 2, \ldots$ |
Understanding gradient descent geometrically provides crucial intuition for debugging optimization problems and selecting hyperparameters. Let's build this understanding from first principles.
The Loss Surface
The objective function $J(\boldsymbol{\theta})$ defines a loss surface (or loss landscape) over the parameter space. For a model with $d$ parameters, this is a surface in $(d+1)$-dimensional space—$d$ dimensions for parameters and one for the loss value.
The gradient $\nabla J(\boldsymbol{\theta})$ at any point is a vector that:
By moving in the negative gradient direction, we descend the loss surface toward lower values.
Imagine standing on a foggy mountain and trying to reach the lowest valley. The gradient tells you the direction of steepest uphill slope. By always walking opposite to this direction (downhill), you'll eventually reach a valley floor. The learning rate determines how large each step is—too large and you might overshoot the valley; too small and the journey takes forever.
Why Negative Gradient Works
Consider the first-order Taylor expansion of $J$ around the current point $\boldsymbol{\theta}^{(t)}$:
$$J(\boldsymbol{\theta}^{(t)} + \boldsymbol{\delta}) \approx J(\boldsymbol{\theta}^{(t)}) + \nabla J(\boldsymbol{\theta}^{(t)})^\top \boldsymbol{\delta}$$
To minimize $J$, we want to choose $\boldsymbol{\delta}$ such that $J(\boldsymbol{\theta}^{(t)} + \boldsymbol{\delta}) < J(\boldsymbol{\theta}^{(t)})$. This requires:
$$\nabla J(\boldsymbol{\theta}^{(t)})^\top \boldsymbol{\delta} < 0$$
The most negative this inner product can be (for a fixed $|\boldsymbol{\delta}|$) is achieved when $\boldsymbol{\delta}$ points exactly opposite to the gradient:
$$\boldsymbol{\delta} = -\eta \nabla J(\boldsymbol{\theta}^{(t)})$$
This is precisely the gradient descent update! The negative gradient is the direction of steepest local descent.
Visualizing Different Landscapes
The behavior of gradient descent varies dramatically depending on the loss landscape:
Convex quadratic: The ideal case. Contours are ellipsoids, and gradient descent moves directly toward the minimum (if the Hessian is isotropic) or follows a zig-zag path if contours are elongated.
Convex non-quadratic: Guaranteed to converge to the global minimum, but the path may be more complex.
Non-convex: The typical case in deep learning. Multiple local minima, saddle points, and flat regions make optimization challenging.
Ravines and narrow valleys: Causes oscillation across the valley walls while slow progress along the valley floor—a major motivation for momentum methods.
Plateau regions: Gradient nearly vanishes, causing extremely slow progress despite being far from any minimum.
Let's formalize the batch gradient descent algorithm and examine a rigorous implementation. Understanding the precise algorithmic steps is essential before we can appreciate the modifications introduced by more sophisticated optimizers.
Algorithm: Batch Gradient Descent
Input: Initial parameters θ₀, learning rate η, dataset D, max iterations T, tolerance ε
Output: Optimized parameters θ*
1. t ← 0
2. while t < T and ||∇J(θᵗ)|| > ε:
a. Compute gradient: g ← (1/N) Σᵢ ∇L(f_θᵗ(xⁱ), yⁱ) // Over ALL N examples
b. Update parameters: θᵗ⁺¹ ← θᵗ - η·g
c. t ← t + 1
3. return θᵗ
Notice that step 2a requires a full pass through the entire dataset for each parameter update. This is the computational signature of batch gradient descent.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293
import numpy as npfrom typing import Callable, Tuple, List def batch_gradient_descent( X: np.ndarray, # Training features (N x d) y: np.ndarray, # Training targets (N,) theta_init: np.ndarray, # Initial parameters (d,) loss_fn: Callable, # Loss function L(predictions, targets) grad_fn: Callable, # Gradient function ∇L w.r.t. theta learning_rate: float = 0.01, max_iterations: int = 1000, tolerance: float = 1e-6, verbose: bool = True) -> Tuple[np.ndarray, List[float]]: """ Full-batch gradient descent optimizer. Returns: theta: Optimized parameters loss_history: Loss value at each iteration """ theta = theta_init.copy() N = X.shape[0] loss_history = [] for t in range(max_iterations): # Step 1: Compute predictions for ALL examples predictions = X @ theta # Assuming linear model for simplicity # Step 2: Compute loss over ENTIRE dataset current_loss = np.mean(loss_fn(predictions, y)) loss_history.append(current_loss) # Step 3: Compute gradient over ENTIRE dataset # The gradient is averaged over all N examples gradient = grad_fn(X, y, theta, predictions) # Shape: (d,) # Step 4: Check stopping criterion gradient_norm = np.linalg.norm(gradient) if gradient_norm < tolerance: if verbose: print(f"Converged at iteration {t} (gradient norm: {gradient_norm:.2e})") break # Step 5: Update parameters theta = theta - learning_rate * gradient if verbose and t % 100 == 0: print(f"Iteration {t}: Loss = {current_loss:.6f}, ||∇J|| = {gradient_norm:.6f}") return theta, loss_history # Example: Linear regression with MSE lossdef mse_loss(predictions: np.ndarray, targets: np.ndarray) -> np.ndarray: """Mean squared error loss (per example).""" return 0.5 * (predictions - targets) ** 2 def mse_gradient(X: np.ndarray, y: np.ndarray, theta: np.ndarray, predictions: np.ndarray) -> np.ndarray: """Gradient of MSE loss w.r.t. theta, averaged over all examples.""" N = X.shape[0] residuals = predictions - y # (N,) # ∇J = (1/N) * X^T * (X*theta - y) gradient = (1/N) * X.T @ residuals return gradient # Usage exampleif __name__ == "__main__": # Generate synthetic data np.random.seed(42) N, d = 1000, 5 X_true = np.random.randn(N, d) theta_true = np.array([1.0, -2.0, 0.5, 3.0, -1.5]) y_true = X_true @ theta_true + 0.1 * np.random.randn(N) # Initialize parameters randomly theta_init = np.random.randn(d) # Run batch gradient descent theta_opt, history = batch_gradient_descent( X_true, y_true, theta_init, loss_fn=mse_loss, grad_fn=mse_gradient, learning_rate=0.1, max_iterations=500 ) print(f"\nTrue theta: {theta_true}") print(f"Learned theta: {theta_opt}") print(f"Parameter error: {np.linalg.norm(theta_opt - theta_true):.6f}")The gradient MUST be computed over the entire dataset before making a single parameter update. A common mistake is to update parameters after each example—that's stochastic gradient descent, not batch gradient descent. In batch GD, one 'epoch' equals one parameter update.
Understanding when and how fast gradient descent converges is crucial for both theoretical guarantees and practical hyperparameter tuning. The convergence properties depend critically on the properties of the objective function.
Key Definitions
Before stating convergence results, we need several important definitions:
Lipschitz Continuity of the Gradient: The gradient $\nabla J$ is $L$-Lipschitz if: $$|\nabla J(\boldsymbol{\theta}) - \nabla J(\boldsymbol{\phi})| \leq L |\boldsymbol{\theta} - \boldsymbol{\phi}|$$ for all $\boldsymbol{\theta}, \boldsymbol{\phi}$. The constant $L$ is called the smoothness parameter.
Strong Convexity: The function $J$ is $\mu$-strongly convex if: $$J(\boldsymbol{\phi}) \geq J(\boldsymbol{\theta}) + \nabla J(\boldsymbol{\theta})^\top(\boldsymbol{\phi} - \boldsymbol{\theta}) + \frac{\mu}{2}|\boldsymbol{\phi} - \boldsymbol{\theta}|^2$$ Strong convexity ensures a unique global minimum and prevents 'flat' regions.
Condition Number: For a strongly convex function with $L$-Lipschitz gradient, the condition number is: $$\kappa = \frac{L}{\mu}$$ This ratio fundamentally governs convergence speed—higher $\kappa$ means slower convergence.
The condition number κ measures how 'elongated' the contours of the loss function are. For a quadratic loss, κ = λ_max/λ_min where λ are eigenvalues of the Hessian. High κ means the loss decreases rapidly in some directions but slowly in others, causing oscillation and slow progress.
Convergence Theorems
Theorem 1 (Convergence for Smooth Functions)
If $J$ has $L$-Lipschitz continuous gradient and $\eta \leq \frac{1}{L}$, then batch gradient descent satisfies:
$$J(\boldsymbol{\theta}^{(T)}) - J(\boldsymbol{\theta}^) \leq \frac{|\boldsymbol{\theta}^{(0)} - \boldsymbol{\theta}^|^2}{2\eta T}$$
This is a sublinear convergence rate of $O(1/T)$—to reduce the error by a factor of 10, we need 10× more iterations.
Theorem 2 (Convergence for Strongly Convex Functions)
If $J$ is $\mu$-strongly convex with $L$-Lipschitz gradient and $\eta = \frac{1}{L}$, then:
$$|\boldsymbol{\theta}^{(T)} - \boldsymbol{\theta}^|^2 \leq \left(1 - \frac{\mu}{L}\right)^T |\boldsymbol{\theta}^{(0)} - \boldsymbol{\theta}^|^2$$
This is linear convergence (also called geometric or exponential convergence). The error decreases by a constant factor $(1 - 1/\kappa)$ at each step.
| Function Class | Rate | Iterations for ε-accuracy | Learning Rate |
|---|---|---|---|
| Convex, smooth | $O(1/T)$ | $O(1/\epsilon)$ | $\eta \leq 1/L$ |
| Strongly convex, smooth | $O((1-\mu/L)^T)$ | $O(\kappa \log(1/\epsilon))$ | $\eta = 1/L$ |
| Non-convex, smooth | To stationary point | $O(1/\epsilon^2)$ | $\eta \leq 1/L$ |
Practical Implications
Learning rate bounds: The convergence theorems provide concrete guidance: $\eta \leq 1/L$ ensures convergence. If you see divergence (loss exploding), your learning rate likely violates this bound.
Condition number determines difficulty: Problems with $\kappa \approx 1$ (isotropic) converge quickly. Problems with $\kappa >> 1$ (ill-conditioned) converge slowly and may require preconditioning or adaptive methods.
Non-convex reality: Deep learning loss surfaces are non-convex. We can't guarantee finding global minima, but gradient descent still finds stationary points (where $\nabla J \approx 0$).
The learning rate $\eta$ is the single most critical hyperparameter in gradient descent. Its value determines whether optimization succeeds, fails spectacularly, or crawls painfully slowly. Let's develop a deep understanding of learning rate dynamics.
The Stability Analysis
Consider optimizing a simple quadratic: $J(\theta) = \frac{1}{2}a\theta^2$ where $a > 0$. The gradient is $\nabla J = a\theta$, so the update becomes:
$$\theta^{(t+1)} = \theta^{(t)} - \eta \cdot a \cdot \theta^{(t)} = (1 - \eta a)\theta^{(t)}$$
For convergence to $\theta^* = 0$, we need $|1 - \eta a| < 1$, which requires:
$$0 < \eta < \frac{2}{a}$$
This is the stability condition. When $\eta > 2/a$, the updates overshoot and oscillate with increasing magnitude—the optimization diverges.
If your loss suddenly jumps to NaN or infinity during training, the most likely cause is a learning rate that's too large. This violates the stability condition. The fix is simple: reduce the learning rate.
The Three Regimes of Learning Rate
1. Too Large ($\eta > 2/L$): Divergence
2. Too Small ($\eta << 1/L$): Slow Convergence
3. Just Right ($\eta \approx 1/L$): Optimal Convergence
Finding the Right Learning Rate
In practice, since we don't know $L$ exactly, we use empirical strategies:
Grid search: Try values ${10^{-5}, 10^{-4}, 10^{-3}, 10^{-2}, 10^{-1}}$ and pick the largest that converges
Learning rate finder: Exponentially increase $\eta$ during a short training run and plot loss vs. learning rate. Choose $\eta$ just before loss starts increasing.
Backtracking line search: Start with large $\eta$ and halve it until $J(\theta - \eta \nabla J) < J(\theta)$ (Armijo condition).
Adaptive methods: Use algorithms like Adam that automatically adjust per-parameter learning rates (covered later in this module).
Understanding the computational cost of batch gradient descent is essential for determining its applicability to different problem scales. Let's analyze the time and space complexity rigorously.
Time Complexity per Iteration
For a dataset with $N$ examples and a model with $d$ parameters:
Forward pass: Computing $f_\theta(\mathbf{x}^{(i)})$ for all $N$ examples: $O(N \cdot C_f)$ where $C_f$ is the cost of one forward pass
Gradient computation: Computing $\nabla_\theta \mathcal{L}$ for all $N$ examples via backpropagation: $O(N \cdot C_b)$ where $C_b \approx C_f$ (backward pass is similar cost)
Gradient averaging: Summing $N$ gradients: $O(N \cdot d)$
Parameter update: One vector subtraction: $O(d)$
For neural networks, $C_f$ and $C_b$ are proportional to the number of parameters $d$, giving:
$$\text{Time per iteration} = O(N \cdot d)$$
| Operation | Time Complexity | Space Complexity | Bottleneck For |
|---|---|---|---|
| Single forward pass | $O(d)$ | $O(d)$ | Very deep networks |
| Full-batch forward | $O(N \cdot d)$ | $O(N \cdot d)$ | Large datasets |
| Single backprop | $O(d)$ | $O(d)$ | Memory for activations |
| Full-batch gradient | $O(N \cdot d)$ | $O(d)$ | Time on large $N$ |
| Parameter update | $O(d)$ | $O(d)$ | Rarely the bottleneck |
The Scalability Problem
Let's make this concrete with real numbers:
| Dataset Size ($N$) | Parameters ($d$) | Ops per Iteration | At 100 GFLOPS |
|---|---|---|---|
| 10,000 | 1 million | 10 billion | 100 seconds |
| 1 million | 1 million | 1 trillion | ~3 hours |
| 10 million | 1 million | 10 trillion | ~28 hours |
| 100 million | 1 million | 100 trillion | ~12 days |
For a modern dataset with millions of examples, a single iteration of batch gradient descent can take hours or days. Since training typically requires thousands of iterations, batch gradient descent becomes completely impractical.
This is the fundamental problem that stochastic gradient descent solves: instead of using all $N$ examples per update, use a small subset. We trade gradient accuracy for computational speed—and as we'll see, this trade-off is extremely favorable.
Batch gradient descent computes the most accurate gradient possible but pays a price: O(N) computation per update. Modern datasets have N in the millions or billions. This makes batch GD impractical for most real-world deep learning—motivating the shift to stochastic methods.
While batch gradient descent has strong theoretical properties, it suffers from several practical limitations that severely restrict its usefulness in modern machine learning.
Limitation 1: Computational Expense
As analyzed above, the $O(N \cdot d)$ cost per iteration becomes prohibitive for large datasets. But it's worse than just slow—the cost is unavoidable for each update. Even if the first 100 examples would give you a 'good enough' gradient direction, you still must process all $N$ examples before making any progress.
Limitation 2: Memory Requirements
Batch gradient descent requires either:
Modern datasets can be terabytes in size. Holding them in RAM is simply not feasible.
Limitation 3: No Implicit Regularization
Stochastic methods benefit from noise in the gradient estimates, which acts as a form of regularization. Batch gradient descent, with its exact gradients, lacks this beneficial noise. This often leads to finding sharper minima that generalize worse.
Limitation 4: Sensitivity to Curvature
A single learning rate must work for all parameters. But different parameters may need different step sizes:
Using one fixed $\eta$ for all parameters is fundamentally suboptimal. This motivates adaptive learning rate methods like AdaGrad, RMSprop, and Adam, which we'll cover later.
Limitation 5: Deterministic Path
Batch gradient descent follows a completely deterministic path given the initial parameters. This means:
Despite its limitations, batch gradient descent remains the right choice in certain scenarios. Understanding these use cases helps you make informed optimizer choices.
Appropriate Use Cases
Very small datasets ($N < 1000$): When the entire dataset fits easily in memory and gradient computation is fast, the simplicity and stability of batch GD is advantageous.
Convex optimization problems: For convex losses (logistic regression, linear regression, SVMs), batch GD provides reliable convergence with well-understood behavior.
When exact gradients are required: Some scenarios require precise gradients, such as second-order optimization methods that depend on accurate Hessian-vector products.
Debugging and prototyping: When developing new models, batch GD's deterministic behavior makes debugging easier. You can reproduce exact training trajectories.
Fine-tuning with stability requirements: When stability matters more than speed (e.g., fine-tuning a nearly-converged model), exact gradients prevent the noise-induced oscillation of stochastic methods.
| Consider Batch GD When | Avoid Batch GD When |
|---|---|
| Dataset has < 1,000 examples | Dataset has > 10,000 examples |
| Problem is convex | Problem is highly non-convex |
| Memory allows full dataset | Dataset doesn't fit in memory |
| Need deterministic training | Training speed is critical |
| Debugging model issues | Production training at scale |
In contemporary deep learning practice, batch gradient descent is rarely used directly. Mini-batch SGD or adaptive methods dominate. However, understanding batch GD is essential—it's the baseline from which all other methods are derived, and many theoretical results assume this setting.
Historical and Pedagogical Importance
Even if you never use pure batch gradient descent in practice, understanding it deeply is valuable:
Foundation for theory: Convergence theorems, learning rate analysis, and stability conditions extend (with modifications) to all gradient-based methods.
Baseline for comparison: When evaluating new optimizers, batch GD provides a clear baseline. Any new method should outperform it on relevant metrics.
Building intuition: The geometric and mathematical understanding developed here applies directly to understanding more sophisticated methods.
In the next pages, we'll see how stochastic gradient descent addresses the major limitations of batch GD while preserving the core gradient descent paradigm.
We've completed a thorough examination of batch gradient descent—the foundational optimization algorithm. Let's consolidate the key insights:
What's Next
The next page introduces Stochastic Gradient Descent (SGD)—the revolutionary modification that makes gradient-based optimization practical for large-scale machine learning. We'll see how using a single example (or small batch) instead of the full dataset dramatically changes the optimization dynamics while still converging to good solutions.
You now have a rigorous understanding of batch gradient descent—its mathematics, geometry, convergence properties, implementation, and limitations. This foundation is essential for understanding why stochastic methods work and when to apply different optimization strategies.