Gradient Descent Variants - Learning Module

Loading content...

0/278

Mini-batch SGD

The Practical Sweet Spot

When you train a neural network in PyTorch, TensorFlow, or any modern framework, you're almost certainly using mini-batch stochastic gradient descent. Not pure batch gradient descent (too slow), not pure SGD with batch size 1 (too noisy and inefficient)—but a carefully chosen middle ground.

Mini-batch SGD computes gradients over a subset of training examples—typically 32, 64, 128, or 256 at a time—and updates parameters based on this averaged gradient. This simple modification unlocks massive practical benefits: variance reduction compared to pure SGD, GPU parallelism for dramatic speedups, and a tunable noise level that affects both convergence and generalization.

Understanding mini-batch SGD deeply is essential because batch size is a fundamental hyperparameter affecting training speed, final model quality, memory usage, and even what hardware you can use. This page provides the complete picture.

What You Will Learn

By the end of this page, you will understand the mathematical framework of mini-batch SGD, analyze variance reduction from batching, master batch size selection principles, appreciate hardware efficiency considerations, and navigate the batch size-generalization trade-off.

Mathematical Framework

Mini-batch SGD generalizes both batch gradient descent and pure SGD. At each iteration, instead of using the full dataset (batch GD) or one example (SGD), we sample a mini-batch of $B$ examples.

The Mini-batch Gradient

Let $\mathcal{B}_t \subset {1, \ldots, N}$ be a randomly selected subset of size $B$ at iteration $t$. The mini-batch gradient is:

$$\mathbf{g}t = \frac{1}{B} \sum{i \in \mathcal{B}t} abla \mathcal{L}(f{\boldsymbol{\theta}^{(t)}}(\mathbf{x}^{(i)}), y^{(i)})$$

The parameter update is:

$$\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \eta \mathbf{g}_t$$

This is still a stochastic gradient method because $\mathbf{g}_t$ is a random variable (depending on which mini-batch is sampled). However, the variance is reduced compared to pure SGD.

Unbiasedness Preserved

The mini-batch gradient remains an unbiased estimator of the true gradient: E[g_t] = ∇J(θ). This is because each example in the batch is sampled uniformly at random, and expectations distribute over sums. Unbiasedness ensures eventual convergence.

The Spectrum from SGD to Batch GD

Mini-batch size $B$ controls where we are on the SGD-to-batch-GD spectrum:

Batch Size	Method	Gradient Variance	Compute/Update	Updates/Epoch
$B = 1$	Pure SGD	$\sigma^2$ (high)	$O(d)$	$N$
$B = 32$	Mini-batch	$\sigma^2/32$	$O(32d)$	$N/32$
$B = 256$	Larger batch	$\sigma^2/256$	$O(256d)$	$N/256$
$B = N$	Batch GD	$0$	$O(Nd)$	$1$

The key insight: increasing $B$ reduces variance but also reduces the number of updates per epoch. This creates a fundamental trade-off that requires careful analysis.

Variance Analysis

For independent samples, variance reduces linearly with batch size:

$$\text{Var}[\mathbf{g}_t] = \frac{\sigma^2}{B}$$

where $\sigma^2 = \text{Var}[ abla \mathcal{L}(\boldsymbol{\theta}, \mathbf{x}, y)]$ is the per-example gradient variance.

Proof sketch: For independent random variables $X_1, \ldots, X_B$ with variance $\sigma^2$: $$\text{Var}\left[\frac{1}{B}\sum_{i=1}^{B} X_i\right] = \frac{1}{B^2} \cdot B \cdot \sigma^2 = \frac{\sigma^2}{B}$$

This $1/B$ variance reduction is central to understanding mini-batch behavior.

The Linear Scaling Rule

One of the most important practical discoveries in deep learning optimization is the linear scaling rule: when you increase batch size, you should proportionally increase the learning rate.

The Rule

If training works well with batch size $B$ and learning rate $\eta$, then for batch size $kB$:

$$\eta_{\text{new}} = k \cdot \eta$$

For example, if batch size 32 works with learning rate 0.01, batch size 256 (8× larger) should use learning rate 0.08 (8× larger).

Why Does This Work?

Consider the expected parameter change over $k$ SGD updates with batch size $B$:

$$\mathbb{E}[\Delta \boldsymbol{\theta}] = k \cdot (-\eta abla J) = -k\eta abla J$$

Now consider one update with batch size $kB$ and learning rate $k\eta$:

$$\mathbb{E}[\Delta \boldsymbol{\theta}] = -k\eta abla J$$

The expected updates are identical! The linear scaling rule ensures that, on average, we make the same progress per epoch regardless of batch size.

Linear Scaling Has Limits

The linear scaling rule works up to a point, but breaks down for very large batch sizes. Empirically, it often fails beyond batch sizes of several thousand. The reason: at large B, the gradient variance becomes so low that we're effectively doing batch GD, and we hit the learning rate stability limit.

Understanding the Limitation

Recall the stability condition from batch GD: $\eta < 2/L$ where $L$ is the smoothness constant. This bound doesn't scale with batch size—it's a property of the loss function.

As we increase $B$ and $\eta$ together:

For small $B$: The noise provides implicit regularization, allowing somewhat larger $\eta$
For large $B$: Noise disappears, and we hit the hard $\eta < 2/L$ ceiling

Practical Consequence: Beyond a certain batch size (depends on model and data), increasing $B$ further doesn't speed up training—you can't increase $\eta$ proportionally without instability.

The Warmup Solution

For large batch training, the linear scaling rule is often combined with learning rate warmup:

Start with a small learning rate
Linearly increase to the scaled target over several epochs
Then apply the normal schedule (decay)

Warmup helps because early training has larger gradients and less stable dynamics. Jumping directly to a large learning rate can cause divergence.

linear_scaling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
def compute_scaled_lr(
    base_lr: float,
    base_batch_size: int,
    new_batch_size: int
) -> float:
    """
    Apply the linear scaling rule.
    
    Args:
        base_lr: Learning rate that worked for base_batch_size
        base_batch_size: Original batch size
        new_batch_size: New batch size we want to use
    
    Returns:
        Scaled learning rate for new batch size
    """
    scale_factor = new_batch_size / base_batch_size
    return base_lr * scale_factor
 
 
def warmup_lr_schedule(
    current_epoch: float,
    warmup_epochs: int,
    target_lr: float,
    base_lr: float = 0.0
) -> float:
    """
    Linear warmup schedule.
    
    Args:
        current_epoch: Current training epoch (can be fractional)
        warmup_epochs: Number of epochs for warmup
        target_lr: Learning rate to reach after warmup
        base_lr: Starting learning rate (usually 0 or small value)
    
    Returns:
        Current learning rate
    """
    if current_epoch < warmup_epochs:
        # Linear interpolation from base_lr to target_lr
        alpha = current_epoch / warmup_epochs
        return base_lr + alpha * (target_lr - base_lr)
    else:
        return target_lr
 
 
# Example: Scaling from batch 32 to batch 512
base_batch_size = 32
base_lr = 0.01
 
new_batch_size = 512
scaled_lr = compute_scaled_lr(base_lr, base_batch_size, new_batch_size)
print(f"Scaled LR: {scaled_lr}")  # 0.16
 
# With warmup
for epoch in [0, 1, 2, 3, 4, 5]:
    lr = warmup_lr_schedule(epoch, warmup_epochs=3, target_lr=scaled_lr)
    print(f"Epoch {epoch}: LR = {lr:.4f}")

Optimal Batch Size Selection

Choosing the right batch size involves balancing multiple factors: training speed, generalization, memory constraints, and hardware efficiency. Let's analyze each systematically.

Factor 1: Training Speed

Training speed depends on two quantities:

Time per epoch: Dominated by forward/backward passes
Epochs to convergence: How many passes through data needed

Larger batch sizes means:

Fewer updates per epoch (faster per epoch on GPU due to parallelism)
But potentially more epochs needed (less noise, possibly worse trajectory)

There's often a sweet spot where total training time is minimized.

Batch Size Trade-offs
Factor	Small Batch (32-64)	Medium Batch (128-512)	Large Batch (1K+)
Gradient variance	High (noisy)	Moderate	Low (smooth)
Updates per epoch	Many	Moderate	Few
GPU utilization	Often poor	Good	Excellent
Memory usage	Low	Moderate	High
Generalization	Often better	Good	Can be worse
Learning rate	Smaller stable	Moderate	Large (with care)

Factor 2: Generalization

Empirical evidence suggests that smaller batch sizes often generalize better. This counterintuitive finding has several potential explanations:

Gradient noise as regularization: More noise → more exploration → flatter minima
Implicit bias: Small batches favor solutions with certain properties
Convergence dynamics: Small batches spend more iterations, seeing more varied gradients

The generalization gap (test error - train error) often increases with batch size. This is the generalization penalty of large-batch training.

Factor 3: Hardware Efficiency

GPUs are massively parallel processors. A single forward pass isn't enough to utilize all GPU cores. Batching enables parallelism:

Matrix multiplication: $\mathbf{Y} = \mathbf{X}\mathbf{W}$ for $B$ examples simultaneously
Convolutions: Process $B$ images in parallel
All arithmetic: SIMD operations across batch dimension

However, returns diminish: doubling batch size typically less than doubles throughput due to memory bandwidth limits.

Practical Batch Size Guidelines

Start with batch size 32-64 for research/prototyping. Increase for production training where speed matters, but verify generalization doesn't degrade. Use powers of 2 (32, 64, 128, 256) for hardware efficiency. If memory-limited, use gradient accumulation to simulate larger batches.

Factor 4: Memory Constraints

Batch size is often constrained by GPU memory:

$$\text{Memory} \approx B \cdot (\text{activations} + \text{gradients}) + \text{parameters}$$

Activations scale linearly with batch size and can dominate for large models. For a model that uses 8GB with batch size 32, doubling to batch size 64 might need ~14GB.

Gradient Accumulation addresses memory limits:

Run forward/backward with small batch (fits in memory)
Accumulate gradients over $k$ micro-batches
Average gradients and update: effective batch size = $k \cdot B_{\text{micro}}$

This achieves large effective batch sizes without the memory of holding all activations simultaneously.

Implementation Details

Implementing mini-batch SGD correctly requires attention to batching, shuffling, and edge cases. Let's examine a production-quality implementation.

Algorithm: Mini-batch SGD

Input: Dataset D of size N, batch size B, learning rate η, epochs E
Output: Optimized parameters θ

1. Initialize θ₀
2. for epoch = 1 to E:
      a. Shuffle D to get permuted indices π
      b. num_batches ← ⌈N/B⌉
      c. for b = 1 to num_batches:
            i.   Select batch: B_b = {π[(b-1)B+1], ..., π[min(bB, N)]}
            ii.  Compute: g ← (1/|B_b|) Σ_{i∈B_b} ∇L(θ, x^i, y^i)
            iii. Update: θ ← θ - η · g
3. return θ

Key implementation details:

Shuffle ONCE per epoch, not per batch
Handle last batch which may be smaller (drop or pad)
Average by actual batch size, not B (for final batch)

minibatch_sgd.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
import numpy as np
from typing import Callable, Tuple, List, Iterator
 
def create_batches(
    X: np.ndarray,
    y: np.ndarray,
    batch_size: int,
    shuffle: bool = True
) -> Iterator[Tuple[np.ndarray, np.ndarray]]:
    """
    Generator that yields mini-batches from the dataset.
    
    Args:
        X: Features (N x d)
        y: Targets (N,)
        batch_size: Size of each batch
        shuffle: Whether to shuffle the data
    
    Yields:
        Tuples of (X_batch, y_batch)
    """
    N = X.shape[0]
    indices = np.arange(N)
    
    if shuffle:
        np.random.shuffle(indices)
    
    for start_idx in range(0, N, batch_size):
        end_idx = min(start_idx + batch_size, N)
        batch_indices = indices[start_idx:end_idx]
        yield X[batch_indices], y[batch_indices]
 
 
def minibatch_sgd(
    X: np.ndarray,
    y: np.ndarray,
    theta_init: np.ndarray,
    grad_fn: Callable,         # (X_batch, y_batch, theta) -> gradient
    loss_fn: Callable,         # (X, y, theta) -> scalar loss
    batch_size: int = 32,
    learning_rate: float = 0.01,
    num_epochs: int = 10,
    shuffle: bool = True,
    verbose: bool = True
) -> Tuple[np.ndarray, List[float]]:
    """
    Mini-batch Stochastic Gradient Descent.
    
    Args:
        X: Feature matrix (N x d)
        y: Target vector (N,)
        theta_init: Initial parameters
        grad_fn: Function to compute gradient on a batch
        loss_fn: Function to compute loss
        batch_size: Number of examples per batch
        learning_rate: Step size
        num_epochs: Number of passes through data
        shuffle: Whether to shuffle each epoch
        verbose: Print progress
    
    Returns:
        Optimized parameters and loss history
    """
    theta = theta_init.copy()
    N = X.shape[0]
    loss_history = []
    
    num_batches = int(np.ceil(N / batch_size))
    
    for epoch in range(num_epochs):
        epoch_loss = 0.0
        num_samples = 0
        
        for X_batch, y_batch in create_batches(X, y, batch_size, shuffle):
            actual_batch_size = X_batch.shape[0]
            
            # Compute gradient for this batch
            gradient = grad_fn(X_batch, y_batch, theta)
            
            # Gradient is already averaged over the batch by grad_fn
            theta = theta - learning_rate * gradient
            
            # Track batch loss for monitoring
            batch_loss = loss_fn(X_batch, y_batch, theta)
            epoch_loss += batch_loss * actual_batch_size
            num_samples += actual_batch_size
        
        # Epoch metrics
        avg_epoch_loss = epoch_loss / num_samples
        loss_history.append(avg_epoch_loss)
        
        if verbose:
            print(f"Epoch {epoch+1}/{num_epochs} - Loss: {avg_epoch_loss:.6f}")
    
    return theta, loss_history
 
 
# Gradient function for linear regression (batch version)
def batch_mse_gradient(
    X_batch: np.ndarray,
    y_batch: np.ndarray,
    theta: np.ndarray
) -> np.ndarray:
    """
    MSE gradient averaged over a mini-batch.
    """
    B = X_batch.shape[0]
    predictions = X_batch @ theta
    residuals = predictions - y_batch
    gradient = (1/B) * X_batch.T @ residuals
    return gradient
 
def batch_mse_loss(
    X_batch: np.ndarray,
    y_batch: np.ndarray,
    theta: np.ndarray
) -> float:
    """
    MSE loss for a batch.
    """
    predictions = X_batch @ theta
    return 0.5 * np.mean((predictions - y_batch) ** 2)
 
 
# Example usage
if __name__ == "__main__":
    np.random.seed(42)
    
    # Generate data
    N, d = 10000, 5
    X = np.random.randn(N, d)
    theta_true = np.array([1.0, -2.0, 0.5, 3.0, -1.5])
    y = X @ theta_true + 0.1 * np.random.randn(N)
    
    # Initialize
    theta_init = np.random.randn(d)
    
    # Run mini-batch SGD with different batch sizes
    for batch_size in [16, 64, 256]:
        theta_opt, history = minibatch_sgd(
            X, y, theta_init,
            grad_fn=batch_mse_gradient,
            loss_fn=batch_mse_loss,
            batch_size=batch_size,
            learning_rate=0.1,
            num_epochs=20,
            verbose=False
        )
        error = np.linalg.norm(theta_opt - theta_true)
        print(f"Batch {batch_size}: Final error = {error:.6f}")

Handle the Last Batch

When N is not divisible by B, the last batch is smaller. Options: (1) Drop it (common in training), (2) Pad with zeros or repeated examples, (3) Use varying batch size. Most frameworks drop by default. For validation, include all examples.

GPU Utilization and Parallelism

One of the primary motivations for mini-batching is hardware efficiency. Understanding how batch size affects GPU utilization is crucial for practical deep learning.

Why Batching Enables Parallelism

A GPU consists of thousands of cores that can execute the same operation on different data simultaneously (SIMD: Single Instruction, Multiple Data). Mini-batching provides independent data points that can be processed in parallel:

Matrix multiplication: $\mathbf{Y} = \mathbf{X}\mathbf{W}$ where $\mathbf{X}$ is $B \times d_{\text{in}}$. All $B$ examples are processed simultaneously.
Convolutions: Each of $B$ images is convolved with filters independently.
Activation functions: ReLU, sigmoid, etc. applied to all $B \cdot d$ values in parallel.

With batch size 1, most GPU cores sit idle. With batch size $B$, we can utilize $B$ times more parallelism (up to hardware limits).

The Throughput Curve

As batch size increases, throughput (examples per second) typically follows this pattern:

Small $B$ (1-16): Throughput increases nearly linearly with $B$. GPU underutilized.
Medium $B$ (32-256): Throughput still increases but sublinearly. Approaching GPU saturation.
Large $B$ (512+): Throughput plateaus. GPU fully utilized; memory bandwidth becomes bottleneck.
Very large $B$: May actually decrease due to memory thrashing or reduced caching efficiency.

The exact numbers depend on model architecture, GPU model, and memory bandwidth.

Hardware Efficiency Tips

•Use powers of 2: Batch sizes 32, 64, 128, 256 align with GPU memory and tensor core dimensions for optimal throughput.
•Profile your specific setup: Optimal batch size varies with GPU model, model architecture, and precision (FP32 vs FP16).
•Watch GPU utilization: Use nvidia-smi or profilers. If utilization is low, try larger batches.
•Memory bandwidth matters: For large models, memory transfer becomes the bottleneck, not compute.
•Consider mixed precision: FP16 doubles effective batch size in same memory, improving throughput.

Data Loading Considerations

High GPU utilization requires the data pipeline to keep up. If data loading is slow:

GPU sits idle waiting for data
Throughput drops despite batch size

Best practices for fast data loading:

Multiple workers: Use num_workers > 0 in DataLoader to parallelize CPU preprocessing
Prefetching: Load next batch while current batch is processing on GPU
Memory mapping: For large datasets, use memory-mapped files to avoid loading into RAM
GPU-side augmentation: Move augmentations to GPU where possible (though this uses compute)
Appropriate storage: SSDs >> HDDs for random access patterns

The goal: data ready before GPU finishes previous batch.

Batch Size and Generalization

The relationship between batch size and generalization is one of the most researched topics in deep learning optimization. The empirical observation is robust: small batch sizes often lead to better generalization, but the reasons are nuanced.

The Empirical Evidence

Multiple studies have documented the batch size-generalization relationship:

Sharp vs Flat Minima (Keskar et al., 2017): Large batch training tends to converge to 'sharp' minima with high curvature, while small batch finds 'flat' minima. Flat minima generalize better.
Generalization Gap: Training with large batches often achieves similar training loss but higher test loss compared to small batches.
Critical Batch Size: There exists a problem-dependent batch size beyond which increasing further provides no benefit and may hurt generalization.

The Noise-Regularization Connection

Gradient noise from small batches acts as regularization. The noise scale is proportional to lr/B. When you increase B, you effectively reduce regularization. This partially explains why large-batch training needs modified techniques (like longer training, warmup, or explicit regularization) to match small-batch generalization.

Theoretical Perspectives

The Noise Scale: The ratio $\eta/B$ controls the 'temperature' of the stochastic optimization process:

$$\text{Noise Scale} \propto \frac{\eta \sigma^2}{B}$$

Small batches with same effective learning rate have higher noise scale, enabling more exploration and regularization.

The SDE View: SGD can be approximated as a stochastic differential equation:

$$d\boldsymbol{\theta} = - abla J(\boldsymbol{\theta})dt + \sqrt{\frac{\eta \sigma^2}{B}}d\mathbf{W}$$

The diffusion term (noise) is modulated by $\eta/B$. This continuous-time view helps explain the interplay between batch size and learning rate.

The Gradient Diversity Hypothesis: Small batches see more diverse gradients throughout training. This diversity may help escape narrow basins and explore the loss landscape more thoroughly.

Strategies for Large-Batch Training

•Use linear scaling rule + warmup
•Train for more epochs
•Add explicit regularization (dropout, weight decay)
•LAMB optimizer for very large batches
•Layer-wise adaptive learning rates
•Ghost batch normalization

Warning Signs in Large-Batch Training

•Training loss converges, test loss doesn't
•Generalization gap increases with batch size
•Model memorizes training data
•Loss landscape feels 'sharp' (loss sensitive to perturbations)
•Training becomes unstable without warmup

Gradient Accumulation

Gradient accumulation is a technique to simulate large batch sizes when GPU memory is insufficient. It's essential for training large models on limited hardware.

The Core Idea

Instead of:

Forward/backward on batch of 256 → Update

Do:

Forward/backward on batch of 64 → Accumulate gradient
Forward/backward on batch of 64 → Accumulate gradient
Forward/backward on batch of 64 → Accumulate gradient
Forward/backward on batch of 64 → Accumulate gradient
Average accumulated gradients → Update

Both approaches produce the same parameter update (mathematically identical), but gradient accumulation uses 4× less memory for activations.

gradient_accumulation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import torch
import torch.nn as nn
 
def train_with_gradient_accumulation(
    model: nn.Module,
    dataloader,
    optimizer,
    loss_fn,
    accumulation_steps: int = 4
):
    """
    Training loop with gradient accumulation.
    
    Effective batch size = dataloader.batch_size * accumulation_steps
    Memory usage = dataloader.batch_size (not effective batch size)
    
    Args:
        model: The neural network
        dataloader: Provides mini-batches
        optimizer: Parameter optimizer (SGD, Adam, etc.)
        loss_fn: Loss function
        accumulation_steps: Number of micro-batches to accumulate
    """
    model.train()
    
    optimizer.zero_grad()  # Zero gradients once at the start
    
    for i, (inputs, targets) in enumerate(dataloader):
        # Forward pass
        outputs = model(inputs)
        
        # Scale loss by accumulation steps
        # This ensures the accumulated gradient equals the large-batch gradient
        loss = loss_fn(outputs, targets) / accumulation_steps
        
        # Backward pass - gradients are ACCUMULATED (not replaced)
        loss.backward()
        
        # Update only every accumulation_steps
        if (i + 1) % accumulation_steps == 0:
            optimizer.step()    # Apply accumulated gradients
            optimizer.zero_grad()  # Reset gradients for next accumulation
            
            print(f"Step {(i+1) // accumulation_steps}: "
                  f"Loss = {loss.item() * accumulation_steps:.4f}")
    
    # Handle remaining batches if not divisible by accumulation_steps
    if (i + 1) % accumulation_steps != 0:
        optimizer.step()
        optimizer.zero_grad()
 
 
# Why we divide loss by accumulation_steps:
#
# Normally, gradient = d(Loss)/d(theta)
# 
# With accumulation over K micro-batches:
# accumulated_grad = sum of K gradients
#                  = K * (average gradient per micro-batch)
#                  = K * (true gradient per micro-batch)
#
# But we want: accumulated_grad = true gradient for effective batch
# 
# By dividing loss by K:
# accumulated_grad = sum of K * (gradient / K) = true gradient
#
# Alternatively, you can scale the learning rate by 1/K instead of scaling loss

BatchNorm with Gradient Accumulation

Batch normalization computes statistics over the micro-batch, not the effective batch. This can cause issues if micro-batches are very small. Solutions: (1) Use Group Normalization or Layer Normalization instead, (2) Sync BatchNorm statistics across accumulation steps (complex), (3) Use large enough micro-batches.

When to Use Gradient Accumulation

Training batch size limited by GPU memory
Need large effective batch for stability (some architectures/tasks)
Reproducing results from papers that used larger hardware
Multi-GPU training where you want consistent behavior across hardware configs

Trade-offs

Aspect	Large Batch (fits in memory)	Gradient Accumulation
Speed	Faster (one forward/backward)	Slower (K forward/backward)
Memory	Higher	Lower
Gradient	Exact	Mathematically identical
BatchNorm	Full batch stats	Micro-batch stats (potential issue)

Distributed Training Preview

Mini-batch SGD naturally extends to distributed training across multiple GPUs or machines. Understanding this connection is important as modern training increasingly uses distributed setups.

Data Parallelism

The most common distributed approach:

Each of $K$ workers has a copy of the model
Dataset is partitioned; each worker gets $N/K$ examples
Each worker computes gradient on its local mini-batch
Gradients are averaged across workers (all-reduce)
All workers apply the same update

Effectively, this multiplies batch size by $K$: $$B_{\text{effective}} = K \cdot B_{\text{per-worker}}$$

Synchronous vs Asynchronous Updates

Synchronous SGD:

All workers compute gradients
Wait for all workers (synchronization barrier)
Average gradients, apply update
All workers have identical parameters

Asynchronous SGD:

Workers compute and apply updates independently
No waiting; faster but uses 'stale' gradients
Parameters may diverge across workers
Can be unstable but avoids synchronization overhead

Synchronous is more common in practice because it's equivalent to mini-batch SGD with larger batch size—the algorithm we understand well.

Scaling Considerations

•Linear scaling rule applies: K workers → K× learning rate (with warmup)
•Communication overhead: Gradient synchronization can bottleneck training if network is slow
•Batch size limits apply: Beyond critical batch size, adding workers provides diminishing returns
•Mixed precision helps: Smaller gradients (FP16) mean less communication
•Gradient compression: Techniques to reduce communication by sending fewer gradient elements

Why This Matters for Single-GPU Training

Even if you train on one GPU, understanding distributed training helps interpret papers and hyperparameters. When a paper says 'batch size 8192 on 32 GPUs', they mean 256 per GPU with synchronous updates. If you have 1 GPU, you'd use gradient accumulation over 32 steps with batch 256.

Summary: Mini-batch SGD

We've comprehensively covered mini-batch stochastic gradient descent—the workhorse algorithm of modern deep learning. Let's consolidate the essential knowledge:

Key Takeaways

•The sweet spot: Mini-batch SGD balances gradient variance (from SGD) with computational efficiency (from batch GD), using B = 32-512 examples per update.
•Variance reduction: Variance decreases as σ²/B. Larger batches = smoother gradients = potentially larger stable learning rates.
•Linear scaling rule: When increasing batch size by K×, increase learning rate by K× (with warmup for large B).
•Hardware efficiency: Batch size affects GPU utilization. Powers of 2 align with hardware; too small wastes compute, too large hits memory limits.
•Generalization trade-off: Smaller batches often generalize better due to implicit regularization from gradient noise.
•Gradient accumulation: Simulates large batches on limited memory by accumulating gradients over multiple micro-batches.
•Distributed training: Multi-GPU training is conceptually mini-batch SGD with larger effective batch size across workers.

What's Next

The next page covers learning rate selection—perhaps the single most important hyperparameter in optimization. We'll explore principled approaches to choosing the learning rate, from simple heuristics to sophisticated techniques like learning rate range tests and cyclical learning rates.

Practical Readiness

You now understand why batch size matters, how to choose it, and what trade-offs you're making. Combined with knowledge of batch GD and pure SGD, you have the complete picture of gradient descent variants. The remaining pages in this module will help you tune these methods for maximum effectiveness.