Loading content...
When you train a neural network in PyTorch, TensorFlow, or any modern framework, you're almost certainly using mini-batch stochastic gradient descent. Not pure batch gradient descent (too slow), not pure SGD with batch size 1 (too noisy and inefficient)—but a carefully chosen middle ground.
Mini-batch SGD computes gradients over a subset of training examples—typically 32, 64, 128, or 256 at a time—and updates parameters based on this averaged gradient. This simple modification unlocks massive practical benefits: variance reduction compared to pure SGD, GPU parallelism for dramatic speedups, and a tunable noise level that affects both convergence and generalization.
Understanding mini-batch SGD deeply is essential because batch size is a fundamental hyperparameter affecting training speed, final model quality, memory usage, and even what hardware you can use. This page provides the complete picture.
By the end of this page, you will understand the mathematical framework of mini-batch SGD, analyze variance reduction from batching, master batch size selection principles, appreciate hardware efficiency considerations, and navigate the batch size-generalization trade-off.
Mini-batch SGD generalizes both batch gradient descent and pure SGD. At each iteration, instead of using the full dataset (batch GD) or one example (SGD), we sample a mini-batch of $B$ examples.
The Mini-batch Gradient
Let $\mathcal{B}_t \subset {1, \ldots, N}$ be a randomly selected subset of size $B$ at iteration $t$. The mini-batch gradient is:
$$\mathbf{g}t = \frac{1}{B} \sum{i \in \mathcal{B}t} abla \mathcal{L}(f{\boldsymbol{\theta}^{(t)}}(\mathbf{x}^{(i)}), y^{(i)})$$
The parameter update is:
$$\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \eta \mathbf{g}_t$$
This is still a stochastic gradient method because $\mathbf{g}_t$ is a random variable (depending on which mini-batch is sampled). However, the variance is reduced compared to pure SGD.
The mini-batch gradient remains an unbiased estimator of the true gradient: E[g_t] = ∇J(θ). This is because each example in the batch is sampled uniformly at random, and expectations distribute over sums. Unbiasedness ensures eventual convergence.
The Spectrum from SGD to Batch GD
Mini-batch size $B$ controls where we are on the SGD-to-batch-GD spectrum:
| Batch Size | Method | Gradient Variance | Compute/Update | Updates/Epoch |
|---|---|---|---|---|
| $B = 1$ | Pure SGD | $\sigma^2$ (high) | $O(d)$ | $N$ |
| $B = 32$ | Mini-batch | $\sigma^2/32$ | $O(32d)$ | $N/32$ |
| $B = 256$ | Larger batch | $\sigma^2/256$ | $O(256d)$ | $N/256$ |
| $B = N$ | Batch GD | $0$ | $O(Nd)$ | $1$ |
The key insight: increasing $B$ reduces variance but also reduces the number of updates per epoch. This creates a fundamental trade-off that requires careful analysis.
Variance Analysis
For independent samples, variance reduces linearly with batch size:
$$\text{Var}[\mathbf{g}_t] = \frac{\sigma^2}{B}$$
where $\sigma^2 = \text{Var}[ abla \mathcal{L}(\boldsymbol{\theta}, \mathbf{x}, y)]$ is the per-example gradient variance.
Proof sketch: For independent random variables $X_1, \ldots, X_B$ with variance $\sigma^2$: $$\text{Var}\left[\frac{1}{B}\sum_{i=1}^{B} X_i\right] = \frac{1}{B^2} \cdot B \cdot \sigma^2 = \frac{\sigma^2}{B}$$
This $1/B$ variance reduction is central to understanding mini-batch behavior.
One of the most important practical discoveries in deep learning optimization is the linear scaling rule: when you increase batch size, you should proportionally increase the learning rate.
The Rule
If training works well with batch size $B$ and learning rate $\eta$, then for batch size $kB$:
$$\eta_{\text{new}} = k \cdot \eta$$
For example, if batch size 32 works with learning rate 0.01, batch size 256 (8× larger) should use learning rate 0.08 (8× larger).
Why Does This Work?
Consider the expected parameter change over $k$ SGD updates with batch size $B$:
$$\mathbb{E}[\Delta \boldsymbol{\theta}] = k \cdot (-\eta abla J) = -k\eta abla J$$
Now consider one update with batch size $kB$ and learning rate $k\eta$:
$$\mathbb{E}[\Delta \boldsymbol{\theta}] = -k\eta abla J$$
The expected updates are identical! The linear scaling rule ensures that, on average, we make the same progress per epoch regardless of batch size.
The linear scaling rule works up to a point, but breaks down for very large batch sizes. Empirically, it often fails beyond batch sizes of several thousand. The reason: at large B, the gradient variance becomes so low that we're effectively doing batch GD, and we hit the learning rate stability limit.
Understanding the Limitation
Recall the stability condition from batch GD: $\eta < 2/L$ where $L$ is the smoothness constant. This bound doesn't scale with batch size—it's a property of the loss function.
As we increase $B$ and $\eta$ together:
Practical Consequence: Beyond a certain batch size (depends on model and data), increasing $B$ further doesn't speed up training—you can't increase $\eta$ proportionally without instability.
The Warmup Solution
For large batch training, the linear scaling rule is often combined with learning rate warmup:
Warmup helps because early training has larger gradients and less stable dynamics. Jumping directly to a large learning rate can cause divergence.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
def compute_scaled_lr( base_lr: float, base_batch_size: int, new_batch_size: int) -> float: """ Apply the linear scaling rule. Args: base_lr: Learning rate that worked for base_batch_size base_batch_size: Original batch size new_batch_size: New batch size we want to use Returns: Scaled learning rate for new batch size """ scale_factor = new_batch_size / base_batch_size return base_lr * scale_factor def warmup_lr_schedule( current_epoch: float, warmup_epochs: int, target_lr: float, base_lr: float = 0.0) -> float: """ Linear warmup schedule. Args: current_epoch: Current training epoch (can be fractional) warmup_epochs: Number of epochs for warmup target_lr: Learning rate to reach after warmup base_lr: Starting learning rate (usually 0 or small value) Returns: Current learning rate """ if current_epoch < warmup_epochs: # Linear interpolation from base_lr to target_lr alpha = current_epoch / warmup_epochs return base_lr + alpha * (target_lr - base_lr) else: return target_lr # Example: Scaling from batch 32 to batch 512base_batch_size = 32base_lr = 0.01 new_batch_size = 512scaled_lr = compute_scaled_lr(base_lr, base_batch_size, new_batch_size)print(f"Scaled LR: {scaled_lr}") # 0.16 # With warmupfor epoch in [0, 1, 2, 3, 4, 5]: lr = warmup_lr_schedule(epoch, warmup_epochs=3, target_lr=scaled_lr) print(f"Epoch {epoch}: LR = {lr:.4f}")Choosing the right batch size involves balancing multiple factors: training speed, generalization, memory constraints, and hardware efficiency. Let's analyze each systematically.
Factor 1: Training Speed
Training speed depends on two quantities:
Larger batch sizes means:
There's often a sweet spot where total training time is minimized.
| Factor | Small Batch (32-64) | Medium Batch (128-512) | Large Batch (1K+) |
|---|---|---|---|
| Gradient variance | High (noisy) | Moderate | Low (smooth) |
| Updates per epoch | Many | Moderate | Few |
| GPU utilization | Often poor | Good | Excellent |
| Memory usage | Low | Moderate | High |
| Generalization | Often better | Good | Can be worse |
| Learning rate | Smaller stable | Moderate | Large (with care) |
Factor 2: Generalization
Empirical evidence suggests that smaller batch sizes often generalize better. This counterintuitive finding has several potential explanations:
The generalization gap (test error - train error) often increases with batch size. This is the generalization penalty of large-batch training.
Factor 3: Hardware Efficiency
GPUs are massively parallel processors. A single forward pass isn't enough to utilize all GPU cores. Batching enables parallelism:
However, returns diminish: doubling batch size typically less than doubles throughput due to memory bandwidth limits.
Start with batch size 32-64 for research/prototyping. Increase for production training where speed matters, but verify generalization doesn't degrade. Use powers of 2 (32, 64, 128, 256) for hardware efficiency. If memory-limited, use gradient accumulation to simulate larger batches.
Factor 4: Memory Constraints
Batch size is often constrained by GPU memory:
$$\text{Memory} \approx B \cdot (\text{activations} + \text{gradients}) + \text{parameters}$$
Activations scale linearly with batch size and can dominate for large models. For a model that uses 8GB with batch size 32, doubling to batch size 64 might need ~14GB.
Gradient Accumulation addresses memory limits:
This achieves large effective batch sizes without the memory of holding all activations simultaneously.
Implementing mini-batch SGD correctly requires attention to batching, shuffling, and edge cases. Let's examine a production-quality implementation.
Algorithm: Mini-batch SGD
Input: Dataset D of size N, batch size B, learning rate η, epochs E
Output: Optimized parameters θ
1. Initialize θ₀
2. for epoch = 1 to E:
a. Shuffle D to get permuted indices π
b. num_batches ← ⌈N/B⌉
c. for b = 1 to num_batches:
i. Select batch: B_b = {π[(b-1)B+1], ..., π[min(bB, N)]}
ii. Compute: g ← (1/|B_b|) Σ_{i∈B_b} ∇L(θ, x^i, y^i)
iii. Update: θ ← θ - η · g
3. return θ
Key implementation details:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150
import numpy as npfrom typing import Callable, Tuple, List, Iterator def create_batches( X: np.ndarray, y: np.ndarray, batch_size: int, shuffle: bool = True) -> Iterator[Tuple[np.ndarray, np.ndarray]]: """ Generator that yields mini-batches from the dataset. Args: X: Features (N x d) y: Targets (N,) batch_size: Size of each batch shuffle: Whether to shuffle the data Yields: Tuples of (X_batch, y_batch) """ N = X.shape[0] indices = np.arange(N) if shuffle: np.random.shuffle(indices) for start_idx in range(0, N, batch_size): end_idx = min(start_idx + batch_size, N) batch_indices = indices[start_idx:end_idx] yield X[batch_indices], y[batch_indices] def minibatch_sgd( X: np.ndarray, y: np.ndarray, theta_init: np.ndarray, grad_fn: Callable, # (X_batch, y_batch, theta) -> gradient loss_fn: Callable, # (X, y, theta) -> scalar loss batch_size: int = 32, learning_rate: float = 0.01, num_epochs: int = 10, shuffle: bool = True, verbose: bool = True) -> Tuple[np.ndarray, List[float]]: """ Mini-batch Stochastic Gradient Descent. Args: X: Feature matrix (N x d) y: Target vector (N,) theta_init: Initial parameters grad_fn: Function to compute gradient on a batch loss_fn: Function to compute loss batch_size: Number of examples per batch learning_rate: Step size num_epochs: Number of passes through data shuffle: Whether to shuffle each epoch verbose: Print progress Returns: Optimized parameters and loss history """ theta = theta_init.copy() N = X.shape[0] loss_history = [] num_batches = int(np.ceil(N / batch_size)) for epoch in range(num_epochs): epoch_loss = 0.0 num_samples = 0 for X_batch, y_batch in create_batches(X, y, batch_size, shuffle): actual_batch_size = X_batch.shape[0] # Compute gradient for this batch gradient = grad_fn(X_batch, y_batch, theta) # Gradient is already averaged over the batch by grad_fn theta = theta - learning_rate * gradient # Track batch loss for monitoring batch_loss = loss_fn(X_batch, y_batch, theta) epoch_loss += batch_loss * actual_batch_size num_samples += actual_batch_size # Epoch metrics avg_epoch_loss = epoch_loss / num_samples loss_history.append(avg_epoch_loss) if verbose: print(f"Epoch {epoch+1}/{num_epochs} - Loss: {avg_epoch_loss:.6f}") return theta, loss_history # Gradient function for linear regression (batch version)def batch_mse_gradient( X_batch: np.ndarray, y_batch: np.ndarray, theta: np.ndarray) -> np.ndarray: """ MSE gradient averaged over a mini-batch. """ B = X_batch.shape[0] predictions = X_batch @ theta residuals = predictions - y_batch gradient = (1/B) * X_batch.T @ residuals return gradient def batch_mse_loss( X_batch: np.ndarray, y_batch: np.ndarray, theta: np.ndarray) -> float: """ MSE loss for a batch. """ predictions = X_batch @ theta return 0.5 * np.mean((predictions - y_batch) ** 2) # Example usageif __name__ == "__main__": np.random.seed(42) # Generate data N, d = 10000, 5 X = np.random.randn(N, d) theta_true = np.array([1.0, -2.0, 0.5, 3.0, -1.5]) y = X @ theta_true + 0.1 * np.random.randn(N) # Initialize theta_init = np.random.randn(d) # Run mini-batch SGD with different batch sizes for batch_size in [16, 64, 256]: theta_opt, history = minibatch_sgd( X, y, theta_init, grad_fn=batch_mse_gradient, loss_fn=batch_mse_loss, batch_size=batch_size, learning_rate=0.1, num_epochs=20, verbose=False ) error = np.linalg.norm(theta_opt - theta_true) print(f"Batch {batch_size}: Final error = {error:.6f}")When N is not divisible by B, the last batch is smaller. Options: (1) Drop it (common in training), (2) Pad with zeros or repeated examples, (3) Use varying batch size. Most frameworks drop by default. For validation, include all examples.
One of the primary motivations for mini-batching is hardware efficiency. Understanding how batch size affects GPU utilization is crucial for practical deep learning.
Why Batching Enables Parallelism
A GPU consists of thousands of cores that can execute the same operation on different data simultaneously (SIMD: Single Instruction, Multiple Data). Mini-batching provides independent data points that can be processed in parallel:
Matrix multiplication: $\mathbf{Y} = \mathbf{X}\mathbf{W}$ where $\mathbf{X}$ is $B \times d_{\text{in}}$. All $B$ examples are processed simultaneously.
Convolutions: Each of $B$ images is convolved with filters independently.
Activation functions: ReLU, sigmoid, etc. applied to all $B \cdot d$ values in parallel.
With batch size 1, most GPU cores sit idle. With batch size $B$, we can utilize $B$ times more parallelism (up to hardware limits).
The Throughput Curve
As batch size increases, throughput (examples per second) typically follows this pattern:
Small $B$ (1-16): Throughput increases nearly linearly with $B$. GPU underutilized.
Medium $B$ (32-256): Throughput still increases but sublinearly. Approaching GPU saturation.
Large $B$ (512+): Throughput plateaus. GPU fully utilized; memory bandwidth becomes bottleneck.
Very large $B$: May actually decrease due to memory thrashing or reduced caching efficiency.
The exact numbers depend on model architecture, GPU model, and memory bandwidth.
Data Loading Considerations
High GPU utilization requires the data pipeline to keep up. If data loading is slow:
Best practices for fast data loading:
num_workers > 0 in DataLoader to parallelize CPU preprocessingThe goal: data ready before GPU finishes previous batch.
The relationship between batch size and generalization is one of the most researched topics in deep learning optimization. The empirical observation is robust: small batch sizes often lead to better generalization, but the reasons are nuanced.
The Empirical Evidence
Multiple studies have documented the batch size-generalization relationship:
Sharp vs Flat Minima (Keskar et al., 2017): Large batch training tends to converge to 'sharp' minima with high curvature, while small batch finds 'flat' minima. Flat minima generalize better.
Generalization Gap: Training with large batches often achieves similar training loss but higher test loss compared to small batches.
Critical Batch Size: There exists a problem-dependent batch size beyond which increasing further provides no benefit and may hurt generalization.
Gradient noise from small batches acts as regularization. The noise scale is proportional to lr/B. When you increase B, you effectively reduce regularization. This partially explains why large-batch training needs modified techniques (like longer training, warmup, or explicit regularization) to match small-batch generalization.
Theoretical Perspectives
The Noise Scale: The ratio $\eta/B$ controls the 'temperature' of the stochastic optimization process:
$$\text{Noise Scale} \propto \frac{\eta \sigma^2}{B}$$
Small batches with same effective learning rate have higher noise scale, enabling more exploration and regularization.
The SDE View: SGD can be approximated as a stochastic differential equation:
$$d\boldsymbol{\theta} = - abla J(\boldsymbol{\theta})dt + \sqrt{\frac{\eta \sigma^2}{B}}d\mathbf{W}$$
The diffusion term (noise) is modulated by $\eta/B$. This continuous-time view helps explain the interplay between batch size and learning rate.
The Gradient Diversity Hypothesis: Small batches see more diverse gradients throughout training. This diversity may help escape narrow basins and explore the loss landscape more thoroughly.
Gradient accumulation is a technique to simulate large batch sizes when GPU memory is insufficient. It's essential for training large models on limited hardware.
The Core Idea
Instead of:
Do:
Both approaches produce the same parameter update (mathematically identical), but gradient accumulation uses 4× less memory for activations.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667
import torchimport torch.nn as nn def train_with_gradient_accumulation( model: nn.Module, dataloader, optimizer, loss_fn, accumulation_steps: int = 4): """ Training loop with gradient accumulation. Effective batch size = dataloader.batch_size * accumulation_steps Memory usage = dataloader.batch_size (not effective batch size) Args: model: The neural network dataloader: Provides mini-batches optimizer: Parameter optimizer (SGD, Adam, etc.) loss_fn: Loss function accumulation_steps: Number of micro-batches to accumulate """ model.train() optimizer.zero_grad() # Zero gradients once at the start for i, (inputs, targets) in enumerate(dataloader): # Forward pass outputs = model(inputs) # Scale loss by accumulation steps # This ensures the accumulated gradient equals the large-batch gradient loss = loss_fn(outputs, targets) / accumulation_steps # Backward pass - gradients are ACCUMULATED (not replaced) loss.backward() # Update only every accumulation_steps if (i + 1) % accumulation_steps == 0: optimizer.step() # Apply accumulated gradients optimizer.zero_grad() # Reset gradients for next accumulation print(f"Step {(i+1) // accumulation_steps}: " f"Loss = {loss.item() * accumulation_steps:.4f}") # Handle remaining batches if not divisible by accumulation_steps if (i + 1) % accumulation_steps != 0: optimizer.step() optimizer.zero_grad() # Why we divide loss by accumulation_steps:## Normally, gradient = d(Loss)/d(theta)# # With accumulation over K micro-batches:# accumulated_grad = sum of K gradients# = K * (average gradient per micro-batch)# = K * (true gradient per micro-batch)## But we want: accumulated_grad = true gradient for effective batch# # By dividing loss by K:# accumulated_grad = sum of K * (gradient / K) = true gradient## Alternatively, you can scale the learning rate by 1/K instead of scaling lossBatch normalization computes statistics over the micro-batch, not the effective batch. This can cause issues if micro-batches are very small. Solutions: (1) Use Group Normalization or Layer Normalization instead, (2) Sync BatchNorm statistics across accumulation steps (complex), (3) Use large enough micro-batches.
When to Use Gradient Accumulation
Trade-offs
| Aspect | Large Batch (fits in memory) | Gradient Accumulation |
|---|---|---|
| Speed | Faster (one forward/backward) | Slower (K forward/backward) |
| Memory | Higher | Lower |
| Gradient | Exact | Mathematically identical |
| BatchNorm | Full batch stats | Micro-batch stats (potential issue) |
Mini-batch SGD naturally extends to distributed training across multiple GPUs or machines. Understanding this connection is important as modern training increasingly uses distributed setups.
Data Parallelism
The most common distributed approach:
Effectively, this multiplies batch size by $K$: $$B_{\text{effective}} = K \cdot B_{\text{per-worker}}$$
Synchronous vs Asynchronous Updates
Synchronous SGD:
Asynchronous SGD:
Synchronous is more common in practice because it's equivalent to mini-batch SGD with larger batch size—the algorithm we understand well.
Even if you train on one GPU, understanding distributed training helps interpret papers and hyperparameters. When a paper says 'batch size 8192 on 32 GPUs', they mean 256 per GPU with synchronous updates. If you have 1 GPU, you'd use gradient accumulation over 32 steps with batch 256.
We've comprehensively covered mini-batch stochastic gradient descent—the workhorse algorithm of modern deep learning. Let's consolidate the essential knowledge:
What's Next
The next page covers learning rate selection—perhaps the single most important hyperparameter in optimization. We'll explore principled approaches to choosing the learning rate, from simple heuristics to sophisticated techniques like learning rate range tests and cyclical learning rates.
You now understand why batch size matters, how to choose it, and what trade-offs you're making. Combined with knowledge of batch GD and pure SGD, you have the complete picture of gradient descent variants. The remaining pages in this module will help you tune these methods for maximum effectiveness.