Loading content...
In theory, gradient descent is straightforward: compute the gradient over all data, take a step, repeat. In practice, this becomes impossible. Modern datasets contain millions—sometimes billions—of samples. Computing the exact gradient requires processing every sample for every single update step.
Imagine training a language model on the entire internet. Each gradient computation would require reading terabytes of text before making one parameter update. At that rate, training would take centuries.
The solution? Approximate the gradient. Instead of computing the exact gradient over all data, use a subset—a mini-batch. This introduces noise but enables massive speedups. The art lies in balancing the noise-accuracy trade-off. This page explores the spectrum of gradient descent variants and when to use each.
By the end of this page, you will understand the fundamental differences between batch, stochastic, and mini-batch gradient descent, analyze the variance-computation trade-off, implement all three variants, and develop intuition for choosing batch sizes in practice.
All variants optimize the same objective—they differ in how they estimate the gradient:
Full (Batch) Gradient Descent
$$\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \alpha abla J(\boldsymbol{\theta}^{(t)}) = \boldsymbol{\theta}^{(t)} - \frac{\alpha}{n} \sum_{i=1}^{n} abla \mathcal{L}_i(\boldsymbol{\theta}^{(t)})$$
Computes the exact gradient over all $n$ samples. Precise but expensive.
Stochastic Gradient Descent (SGD)
$$\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \alpha abla \mathcal{L}_i(\boldsymbol{\theta}^{(t)})$$
Computes gradient from a single randomly sampled training example. Noisy but cheap.
Mini-batch Gradient Descent
$$\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \frac{\alpha}{|B|} \sum_{i \in B} abla \mathcal{L}_i(\boldsymbol{\theta}^{(t)})$$
Computes gradient from a randomly sampled batch $B$ of $b$ examples. Best of both worlds: reduced variance, manageable cost.
| Variant | Batch Size | Gradient Quality | Update Frequency | Memory Use | Best For |
|---|---|---|---|---|---|
| Full Batch | All n samples | Exact (no variance) | 1 per epoch | High | Small datasets, convex problems |
| Stochastic (SGD) | 1 sample | Very noisy | n per epoch | Minimal | Online learning, huge datasets |
| Mini-batch | b samples (16-512) | Moderate variance | n/b per epoch | Moderate | Deep learning (most common) |
Full batch gradient descent uses the entire dataset to compute each gradient. This is what we've been analyzing so far.
Advantages:
Disadvantages:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364
import numpy as np def batch_gradient_descent(X, y, theta_init, learning_rate=0.01, n_epochs=100, tol=1e-6): """ Full batch gradient descent for linear regression. Args: X: Feature matrix (n_samples, n_features) y: Target vector (n_samples,) theta_init: Initial parameters learning_rate: Step size α n_epochs: Maximum number of passes through data tol: Convergence tolerance Returns: theta: Learned parameters history: List of (loss, grad_norm) per epoch """ n_samples = X.shape[0] theta = theta_init.copy() history = [] for epoch in range(n_epochs): # Forward pass: compute predictions for ALL samples y_pred = X @ theta # Compute exact gradient over entire dataset residuals = y_pred - y gradient = (X.T @ residuals) / n_samples # Average gradient # Compute loss for monitoring loss = np.mean(residuals ** 2) / 2 grad_norm = np.linalg.norm(gradient) history.append({'epoch': epoch, 'loss': loss, 'grad_norm': grad_norm}) # Check convergence if grad_norm < tol: print(f"Converged at epoch {epoch}") break # Update: one step per epoch theta = theta - learning_rate * gradient return theta, history # Example usagenp.random.seed(42)n_samples = 1000n_features = 5 # Generate dataX = np.random.randn(n_samples, n_features)true_theta = np.array([1, -2, 3, -4, 5])y = X @ true_theta + 0.1 * np.random.randn(n_samples) # Traintheta_init = np.zeros(n_features)theta, history = batch_gradient_descent(X, y, theta_init, learning_rate=0.1, n_epochs=100) print(f"True parameters: {true_theta}")print(f"Learned parameters: {np.round(theta, 3)}")print(f"Final loss: {history[-1]['loss']:.6f}")Full batch GD is appropriate when: (1) dataset fits in memory, (2) n ≤ 10,000 samples, (3) you need deterministic, reproducible training, (4) using second-order methods that require exact gradients. For deep learning with large datasets, mini-batch is almost always preferred.
SGD uses a single randomly sampled example to estimate the gradient. The gradient of one sample is an unbiased estimator of the true gradient:
$$\mathbb{E}[ abla \mathcal{L}_i(\boldsymbol{\theta})] = abla J(\boldsymbol{\theta})$$
where the expectation is over random sample $i$.
Advantages:
Disadvantages:
The variance of SGD:
Let $g_i = abla \mathcal{L}_i(\boldsymbol{\theta})$ be the gradient for sample $i$. The variance of this estimator is:
$$\text{Var}[g_i] = \mathbb{E}[|g_i|^2] - | abla J|^2$$
This variance is typically large when samples have diverse gradients. Near optima (where $ abla J \approx 0$), the noise doesn't vanish—updates continue to oscillate.
Key insight: SGD can get close to the optimum quickly due to many updates, but it can't converge precisely without learning rate decay.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091
import numpy as np def stochastic_gradient_descent(X, y, theta_init, learning_rate=0.01, n_epochs=10, lr_decay=None): """ Pure Stochastic Gradient Descent (batch size = 1). Args: X: Feature matrix (n_samples, n_features) y: Target vector (n_samples,) theta_init: Initial parameters learning_rate: Initial step size α n_epochs: Number of passes through entire dataset lr_decay: Optional learning rate schedule ('1/t', 'sqrt') Returns: theta: Learned parameters history: List of loss values (sampled periodically) """ n_samples, n_features = X.shape theta = theta_init.copy() history = [] step = 0 for epoch in range(n_epochs): # Shuffle data at start of each epoch indices = np.random.permutation(n_samples) epoch_losses = [] for i in indices: step += 1 # Get single sample x_i = X[i:i+1] # Shape (1, n_features) y_i = y[i:i+1] # Shape (1,) # Forward pass: prediction for ONE sample y_pred = x_i @ theta # Compute gradient for ONE sample (unbiased estimator) residual = y_pred - y_i gradient = x_i.T.flatten() * residual[0] # No averaging needed # Learning rate decay if lr_decay == '1/t': alpha = learning_rate / (1 + 0.01 * step) elif lr_decay == 'sqrt': alpha = learning_rate / np.sqrt(1 + 0.01 * step) else: alpha = learning_rate # Update with noisy gradient theta = theta - alpha * gradient # Track loss (for subset of steps to avoid overhead) if step % 100 == 0: loss = np.mean((X @ theta - y) ** 2) / 2 epoch_losses.append(loss) # Epoch summary if epoch_losses: history.append({ 'epoch': epoch, 'loss': np.mean(epoch_losses), 'final_lr': alpha }) return theta, history # Example: Compare SGD with and without learning rate decaynp.random.seed(42)n_samples = 1000X = np.random.randn(n_samples, 3)true_theta = np.array([1.0, 2.0, 3.0])y = X @ true_theta + 0.1 * np.random.randn(n_samples) theta_init = np.zeros(3) # SGD without decaytheta_no_decay, hist_no_decay = stochastic_gradient_descent( X, y, theta_init, learning_rate=0.01, n_epochs=20, lr_decay=None) # SGD with 1/t decaytheta_with_decay, hist_with_decay = stochastic_gradient_descent( X, y, theta_init, learning_rate=0.1, n_epochs=20, lr_decay='1/t') print("True theta:", true_theta)print(f"SGD (no decay): {np.round(theta_no_decay, 3)}, final loss: {hist_no_decay[-1]['loss']:.6f}")print(f"SGD (with decay): {np.round(theta_with_decay, 3)}, final loss: {hist_with_decay[-1]['loss']:.6f}")With fixed learning rate, SGD oscillates in a neighborhood of the optimum. The size of this neighborhood scales with the learning rate and gradient variance. To converge precisely, you must decrease the learning rate over time (e.g., α_t = α₀/t).
Mini-batch gradient descent is the standard approach for deep learning. It strikes a balance between full batch (exact but slow) and pure SGD (fast but noisy).
Mini-batch gradient:
$$ abla_{\text{batch}} = \frac{1}{b} \sum_{i \in B} abla \mathcal{L}_i(\boldsymbol{\theta})$$
where $B$ is a random batch of $b$ samples.
Variance reduction:
The variance of the mini-batch gradient estimator is: $$\text{Var}[ abla_{\text{batch}}] = \frac{\text{Var}[g_i]}{b}$$
Variance decreases linearly with batch size $b$. A batch of 64 has 64× less variance than SGD.
| Batch Size | Variance | Updates/Epoch | GPU Utilization | Generalization |
|---|---|---|---|---|
| 1 (SGD) | Very high | n | Poor | Often best |
| 16 | High | n/16 | Moderate | Very good |
| 32 | Moderate | n/32 | Good | Good |
| 64 | Moderate | n/64 | Good | Good |
| 128 | Lower | n/128 | Very good | Good |
| 256 | Low | n/256 | Excellent | May degrade |
| 512+ | Very low | n/512+ | Excellent | Often worse |
| Full (n) | Zero | 1 | Excellent | May overfit |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091
import numpy as np def minibatch_gradient_descent(X, y, theta_init, batch_size=32, learning_rate=0.01, n_epochs=50): """ Mini-batch Gradient Descent - the deep learning standard. Args: X: Feature matrix (n_samples, n_features) y: Target vector (n_samples,) theta_init: Initial parameters batch_size: Number of samples per batch learning_rate: Step size α n_epochs: Number of passes through entire dataset Returns: theta: Learned parameters history: Loss and gradient info per epoch """ n_samples, n_features = X.shape theta = theta_init.copy() history = [] for epoch in range(n_epochs): # Shuffle data at start of each epoch indices = np.random.permutation(n_samples) epoch_losses = [] # Process data in batches for start_idx in range(0, n_samples, batch_size): end_idx = min(start_idx + batch_size, n_samples) batch_indices = indices[start_idx:end_idx] current_batch_size = len(batch_indices) # Extract batch X_batch = X[batch_indices] y_batch = y[batch_indices] # Forward pass y_pred = X_batch @ theta # Compute gradient (averaged over batch) residuals = y_pred - y_batch gradient = (X_batch.T @ residuals) / current_batch_size # Update parameters theta = theta - learning_rate * gradient # Track batch loss batch_loss = np.mean(residuals ** 2) / 2 epoch_losses.append(batch_loss) # Epoch statistics epoch_loss = np.mean(epoch_losses) full_loss = np.mean((X @ theta - y) ** 2) / 2 history.append({ 'epoch': epoch, 'batch_loss': epoch_loss, 'full_loss': full_loss, 'n_batches': len(epoch_losses) }) return theta, history # Compare different batch sizesnp.random.seed(42)n_samples = 5000n_features = 10 X = np.random.randn(n_samples, n_features)true_theta = np.linspace(-5, 5, n_features)y = X @ true_theta + 0.1 * np.random.randn(n_samples) theta_init = np.zeros(n_features) batch_sizes = [1, 16, 64, 256, n_samples]results = {} for bs in batch_sizes: name = f"bs_{bs}" if bs < n_samples else "full" theta, hist = minibatch_gradient_descent( X, y, theta_init, batch_size=bs, learning_rate=0.01, n_epochs=20 ) results[name] = { 'theta': theta, 'final_loss': hist[-1]['full_loss'], 'history': hist } print(f"Batch size {bs:5d}: Final loss = {hist[-1]['full_loss']:.6f}") # All converge to similar solution, but trajectories differStart with batch size 32 or 64. Increase batch size if: (1) GPU is underutilized, (2) training is unstable, (3) you can afford more compute. Decrease batch size if: (1) generalization is poor, (2) memory is limited, (3) you want implicit regularization.
The fundamental trade-off in choosing batch size is between gradient quality (variance) and computational efficiency.
Fixed Compute Budget Analysis:
Suppose you have budget for $C$ gradient evaluations (sample-gradient computations).
With batch size $b$:
For convex problems:
The error after $T$ steps with batch size $b$ and variance $\sigma^2$ is approximately: $$\text{Error} \propto \underbrace{\frac{1}{T}}{\text{optimization}} + \underbrace{\frac{\alpha \sigma^2}{b}}{\text{noise}}$$
Substituting $T = C/b$: $$\text{Error} \propto \frac{b}{C} + \frac{\alpha \sigma^2}{b}$$
This is minimized at an intermediate batch size—not too small (noise dominates), not too large (not enough updates).
The "linear scaling rule":
As batch size increases, the variance decreases by $1/b$. To maintain the same noise-to-signal ratio of updates, the learning rate can be scaled up proportionally:
$$\alpha_{\text{large batch}} = \alpha_{\text{small batch}} \times \frac{b_{\text{large}}}{b_{\text{small}}}$$
Example:
This allows training with larger batches without losing convergence speed in terms of epochs. However, the larger batch is more expensive per epoch, so wall-clock speedup requires parallelism.
Caveats:
Empirically, smaller batches often generalize better than large batches, even when training loss is similar. Theories suggest SGD noise acts as implicit regularization, finding "flatter" minima that generalize better. This is an active research area.
These terms are often confused. Let's define them precisely:
Epoch: One complete pass through the entire training dataset.
Iteration/Step: One parameter update (one gradient descent step).
Batch: A subset of samples used to compute one gradient.
| Quantity | Value | Meaning |
|---|---|---|
| n | 60,000 | Total samples in dataset |
| b | 128 | Samples per mini-batch |
| Iterations per epoch | ⌈60000/128⌉ = 469 | Gradient updates per pass |
| Epochs trained | 10 | Full passes through data |
| Total iterations | 4,690 | Total gradient updates |
| Total samples seen | 600,000 | = n × epochs |
Why this matters:
Learning rate schedules are often defined per epoch (e.g., "decay at epoch 30") or per iteration.
Comparing methods: SGD does $n$ updates per epoch vs 1 for full batch. Comparing "per epoch" progress is misleading.
Convergence guarantees: Often stated in terms of iterations, not epochs.
Batch normalization statistics: Computed over mini-batches, so batch size affects normalization.
Best practice: Report both iterations and epochs, along with batch size, so results are reproducible and comparable.
How you select mini-batches affects both convergence and variance.
Random Shuffling (Standard)
At the start of each epoch, randomly permute the dataset. Then iterate through in order.
With Replacement Sampling
For each batch, sample randomly with replacement from the full dataset.
Without Replacement (No Shuffling)
Process data in the same order every epoch.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485
import numpy as npfrom typing import Iterator, Tuple def shuffled_batches(X: np.ndarray, y: np.ndarray, batch_size: int) -> Iterator[Tuple[np.ndarray, np.ndarray]]: """ Standard shuffled mini-batch iteration. Each sample seen exactly once per epoch. """ n = len(X) indices = np.random.permutation(n) for start in range(0, n, batch_size): end = min(start + batch_size, n) batch_idx = indices[start:end] yield X[batch_idx], y[batch_idx] def sampled_batches(X: np.ndarray, y: np.ndarray, batch_size: int, n_batches: int) -> Iterator[Tuple[np.ndarray, np.ndarray]]: """ Random sampling WITH replacement. Some samples may repeat, others may be missed. """ n = len(X) for _ in range(n_batches): batch_idx = np.random.choice(n, size=batch_size, replace=True) yield X[batch_idx], y[batch_idx] def stratified_batches(X: np.ndarray, y: np.ndarray, batch_size: int) -> Iterator[Tuple[np.ndarray, np.ndarray]]: """ Stratified batching: each batch has balanced class representation. Important for imbalanced datasets. """ classes = np.unique(y) n_classes = len(classes) samples_per_class = batch_size // n_classes # Group indices by class class_indices = {c: np.where(y == c)[0] for c in classes} # Shuffle within each class for c in classes: np.random.shuffle(class_indices[c]) # Create batches with balanced classes class_pointers = {c: 0 for c in classes} n = len(y) n_batches = n // batch_size for _ in range(n_batches): batch_idx = [] for c in classes: start = class_pointers[c] end = start + samples_per_class # Handle wraparound if end > len(class_indices[c]): np.random.shuffle(class_indices[c]) class_pointers[c] = 0 start = 0 end = samples_per_class batch_idx.extend(class_indices[c][start:end]) class_pointers[c] = end batch_idx = np.array(batch_idx) np.random.shuffle(batch_idx) # Shuffle within batch yield X[batch_idx], y[batch_idx] # Example usageX = np.random.randn(1000, 10)y = np.random.randint(0, 5, 1000) # 5 classes print("Shuffled batches (standard):")for i, (X_batch, y_batch) in enumerate(shuffled_batches(X, y, 64)): if i < 2: print(f" Batch {i}: shape {X_batch.shape}, class distribution: {np.bincount(y_batch, minlength=5)}") print("Stratified batches:")for i, (X_batch, y_batch) in enumerate(stratified_batches(X, y, 50)): if i < 2: print(f" Batch {i}: shape {X_batch.shape}, class distribution: {np.bincount(y_batch, minlength=5)}")Never train on data in a fixed order. Even if you think your data is "already random," shuffle anyway. The cost is negligible, and the risk of subtle ordering biases is real. Most deep learning frameworks do this automatically, but verify!
Modern deep learning relies heavily on GPUs, which fundamentally changes the batch size calculus.
Why GPUs Love Large Batches:
GPUs have thousands of cores designed for parallel matrix operations. With batch size $b$:
Memory Constraints:
The batch size is often limited by GPU memory: $$\text{Memory} = \text{Model params} + \text{Activations} \times b + \text{Gradients}$$
Activations scale linearly with batch size. For large models (BERT, GPT), batch size 4-16 may be the maximum that fits.
Gradient Accumulation:
When batch size is memory-limited, accumulate gradients over multiple mini-batches:
for i, (X_batch, y_batch) in enumerate(data_loader):
loss = model(X_batch, y_batch) / accumulation_steps
loss.backward() # Accumulates gradients
if (i + 1) % accumulation_steps == 0:
optimizer.step() # Apply accumulated gradients
optimizer.zero_grad()
This simulates a larger "effective batch size" without the memory cost.
Coming Up Next:
Now that we understand gradient descent variants, we turn to acceleration techniques. Momentum and its variants (Nesterov acceleration) can dramatically speed up convergence, especially on ill-conditioned problems. Page 5 explores how to make gradient descent faster without changing the batch size.
You now understand the spectrum from full batch to pure SGD, the trade-offs involved, and how to choose batch sizes in practice. This knowledge is essential for scaling ML to real-world datasets. Next, we'll learn how momentum accelerates convergence.