Gradient Descent - Learning Module

Loading content...

0/278

Variants: Batch, Stochastic, Mini-batch

Scaling Gradient Descent to Real-World Data

In theory, gradient descent is straightforward: compute the gradient over all data, take a step, repeat. In practice, this becomes impossible. Modern datasets contain millions—sometimes billions—of samples. Computing the exact gradient requires processing every sample for every single update step.

Imagine training a language model on the entire internet. Each gradient computation would require reading terabytes of text before making one parameter update. At that rate, training would take centuries.

The solution? Approximate the gradient. Instead of computing the exact gradient over all data, use a subset—a mini-batch. This introduces noise but enables massive speedups. The art lies in balancing the noise-accuracy trade-off. This page explores the spectrum of gradient descent variants and when to use each.

What You Will Learn

By the end of this page, you will understand the fundamental differences between batch, stochastic, and mini-batch gradient descent, analyze the variance-computation trade-off, implement all three variants, and develop intuition for choosing batch sizes in practice.

The Three Variants of Gradient Descent

All variants optimize the same objective—they differ in how they estimate the gradient:

Full (Batch) Gradient Descent

$$\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \alpha abla J(\boldsymbol{\theta}^{(t)}) = \boldsymbol{\theta}^{(t)} - \frac{\alpha}{n} \sum_{i=1}^{n} abla \mathcal{L}_i(\boldsymbol{\theta}^{(t)})$$

Computes the exact gradient over all $n$ samples. Precise but expensive.

Stochastic Gradient Descent (SGD)

$$\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \alpha abla \mathcal{L}_i(\boldsymbol{\theta}^{(t)})$$

Computes gradient from a single randomly sampled training example. Noisy but cheap.

Mini-batch Gradient Descent

$$\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \frac{\alpha}{|B|} \sum_{i \in B} abla \mathcal{L}_i(\boldsymbol{\theta}^{(t)})$$

Computes gradient from a randomly sampled batch $B$ of $b$ examples. Best of both worlds: reduced variance, manageable cost.

Gradient Descent Variants Comparison
Variant	Batch Size	Gradient Quality	Update Frequency	Memory Use	Best For
Full Batch	All n samples	Exact (no variance)	1 per epoch	High	Small datasets, convex problems
Stochastic (SGD)	1 sample	Very noisy	n per epoch	Minimal	Online learning, huge datasets
Mini-batch	b samples (16-512)	Moderate variance	n/b per epoch	Moderate	Deep learning (most common)

Full (Batch) Gradient Descent

Full batch gradient descent uses the entire dataset to compute each gradient. This is what we've been analyzing so far.

Advantages:

Exact gradient: No stochastic noise, deterministic updates
Stable convergence: Loss decreases smoothly
Theoretical guarantees: Convergence proofs apply directly
Vectorization: Can fully utilize matrix operations

Disadvantages:

Computational cost: O(n) per update step
Memory: Must fit all data (or at least gradients) in memory
No online learning: Can't update as new data arrives
May overfit: No SGD regularization effect
Slow for large n: One epoch = one update

batch_gradient_descent.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
import numpy as np
 
def batch_gradient_descent(X, y, theta_init, learning_rate=0.01, 
                            n_epochs=100, tol=1e-6):
    """
    Full batch gradient descent for linear regression.
    
    Args:
        X: Feature matrix (n_samples, n_features)
        y: Target vector (n_samples,)
        theta_init: Initial parameters
        learning_rate: Step size α
        n_epochs: Maximum number of passes through data
        tol: Convergence tolerance
    
    Returns:
        theta: Learned parameters
        history: List of (loss, grad_norm) per epoch
    """
    n_samples = X.shape[0]
    theta = theta_init.copy()
    history = []
    
    for epoch in range(n_epochs):
        # Forward pass: compute predictions for ALL samples
        y_pred = X @ theta
        
        # Compute exact gradient over entire dataset
        residuals = y_pred - y
        gradient = (X.T @ residuals) / n_samples  # Average gradient
        
        # Compute loss for monitoring
        loss = np.mean(residuals ** 2) / 2
        grad_norm = np.linalg.norm(gradient)
        history.append({'epoch': epoch, 'loss': loss, 'grad_norm': grad_norm})
        
        # Check convergence
        if grad_norm < tol:
            print(f"Converged at epoch {epoch}")
            break
        
        # Update: one step per epoch
        theta = theta - learning_rate * gradient
    
    return theta, history
 
# Example usage
np.random.seed(42)
n_samples = 1000
n_features = 5
 
# Generate data
X = np.random.randn(n_samples, n_features)
true_theta = np.array([1, -2, 3, -4, 5])
y = X @ true_theta + 0.1 * np.random.randn(n_samples)
 
# Train
theta_init = np.zeros(n_features)
theta, history = batch_gradient_descent(X, y, theta_init, 
                                         learning_rate=0.1, n_epochs=100)
 
print(f"True parameters: {true_theta}")
print(f"Learned parameters: {np.round(theta, 3)}")
print(f"Final loss: {history[-1]['loss']:.6f}")

When to Use Full Batch GD

Full batch GD is appropriate when: (1) dataset fits in memory, (2) n ≤ 10,000 samples, (3) you need deterministic, reproducible training, (4) using second-order methods that require exact gradients. For deep learning with large datasets, mini-batch is almost always preferred.

Stochastic Gradient Descent (SGD)

SGD uses a single randomly sampled example to estimate the gradient. The gradient of one sample is an unbiased estimator of the true gradient:

$$\mathbb{E}[ abla \mathcal{L}_i(\boldsymbol{\theta})] = abla J(\boldsymbol{\theta})$$

where the expectation is over random sample $i$.

Advantages:

Cheap updates: O(1) per step instead of O(n)
Many updates per epoch: n updates vs 1
Online learning: Can process streaming data
Implicit regularization: Noise prevents overfitting
Escapes saddle points: Noise helps exploration

Disadvantages:

High variance: Gradient estimate is noisy
Oscillating loss: Updates may increase loss temporarily
Requires learning rate decay: Otherwise oscillates around minimum
Less vectorization: Single samples don't fully utilize GPU

The variance of SGD:

Let $g_i = abla \mathcal{L}_i(\boldsymbol{\theta})$ be the gradient for sample $i$. The variance of this estimator is:

$$\text{Var}[g_i] = \mathbb{E}[|g_i|^2] - | abla J|^2$$

This variance is typically large when samples have diverse gradients. Near optima (where $ abla J \approx 0$), the noise doesn't vanish—updates continue to oscillate.

Key insight: SGD can get close to the optimum quickly due to many updates, but it can't converge precisely without learning rate decay.

stochastic_gradient_descent.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
import numpy as np
 
def stochastic_gradient_descent(X, y, theta_init, learning_rate=0.01,
                                  n_epochs=10, lr_decay=None):
    """
    Pure Stochastic Gradient Descent (batch size = 1).
    
    Args:
        X: Feature matrix (n_samples, n_features)
        y: Target vector (n_samples,)
        theta_init: Initial parameters
        learning_rate: Initial step size α
        n_epochs: Number of passes through entire dataset
        lr_decay: Optional learning rate schedule ('1/t', 'sqrt')
    
    Returns:
        theta: Learned parameters
        history: List of loss values (sampled periodically)
    """
    n_samples, n_features = X.shape
    theta = theta_init.copy()
    history = []
    step = 0
    
    for epoch in range(n_epochs):
        # Shuffle data at start of each epoch
        indices = np.random.permutation(n_samples)
        epoch_losses = []
        
        for i in indices:
            step += 1
            
            # Get single sample
            x_i = X[i:i+1]  # Shape (1, n_features)
            y_i = y[i:i+1]  # Shape (1,)
            
            # Forward pass: prediction for ONE sample
            y_pred = x_i @ theta
            
            # Compute gradient for ONE sample (unbiased estimator)
            residual = y_pred - y_i
            gradient = x_i.T.flatten() * residual[0]  # No averaging needed
            
            # Learning rate decay
            if lr_decay == '1/t':
                alpha = learning_rate / (1 + 0.01 * step)
            elif lr_decay == 'sqrt':
                alpha = learning_rate / np.sqrt(1 + 0.01 * step)
            else:
                alpha = learning_rate
            
            # Update with noisy gradient
            theta = theta - alpha * gradient
            
            # Track loss (for subset of steps to avoid overhead)
            if step % 100 == 0:
                loss = np.mean((X @ theta - y) ** 2) / 2
                epoch_losses.append(loss)
        
        # Epoch summary
        if epoch_losses:
            history.append({
                'epoch': epoch,
                'loss': np.mean(epoch_losses),
                'final_lr': alpha
            })
    
    return theta, history
 
# Example: Compare SGD with and without learning rate decay
np.random.seed(42)
n_samples = 1000
X = np.random.randn(n_samples, 3)
true_theta = np.array([1.0, 2.0, 3.0])
y = X @ true_theta + 0.1 * np.random.randn(n_samples)
 
theta_init = np.zeros(3)
 
# SGD without decay
theta_no_decay, hist_no_decay = stochastic_gradient_descent(
    X, y, theta_init, learning_rate=0.01, n_epochs=20, lr_decay=None
)
 
# SGD with 1/t decay
theta_with_decay, hist_with_decay = stochastic_gradient_descent(
    X, y, theta_init, learning_rate=0.1, n_epochs=20, lr_decay='1/t'
)
 
print("True theta:", true_theta)
print(f"SGD (no decay):   {np.round(theta_no_decay, 3)}, final loss: {hist_no_decay[-1]['loss']:.6f}")
print(f"SGD (with decay): {np.round(theta_with_decay, 3)}, final loss: {hist_with_decay[-1]['loss']:.6f}")

SGD Doesn't Converge to Exact Optimum

With fixed learning rate, SGD oscillates in a neighborhood of the optimum. The size of this neighborhood scales with the learning rate and gradient variance. To converge precisely, you must decrease the learning rate over time (e.g., α_t = α₀/t).

Mini-batch Gradient Descent

Mini-batch gradient descent is the standard approach for deep learning. It strikes a balance between full batch (exact but slow) and pure SGD (fast but noisy).

Mini-batch gradient:

$$ abla_{\text{batch}} = \frac{1}{b} \sum_{i \in B} abla \mathcal{L}_i(\boldsymbol{\theta})$$

where $B$ is a random batch of $b$ samples.

Variance reduction:

The variance of the mini-batch gradient estimator is: $$\text{Var}[ abla_{\text{batch}}] = \frac{\text{Var}[g_i]}{b}$$

Variance decreases linearly with batch size $b$. A batch of 64 has 64× less variance than SGD.

Batch Size Trade-offs
Batch Size	Variance	Updates/Epoch	GPU Utilization	Generalization
1 (SGD)	Very high	n	Poor	Often best
16	High	n/16	Moderate	Very good
32	Moderate	n/32	Good	Good
64	Moderate	n/64	Good	Good
128	Lower	n/128	Very good	Good
256	Low	n/256	Excellent	May degrade
512+	Very low	n/512+	Excellent	Often worse
Full (n)	Zero	1	Excellent	May overfit

minibatch_gradient_descent.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
import numpy as np
 
def minibatch_gradient_descent(X, y, theta_init, batch_size=32,
                                 learning_rate=0.01, n_epochs=50):
    """
    Mini-batch Gradient Descent - the deep learning standard.
    
    Args:
        X: Feature matrix (n_samples, n_features)
        y: Target vector (n_samples,)
        theta_init: Initial parameters
        batch_size: Number of samples per batch
        learning_rate: Step size α
        n_epochs: Number of passes through entire dataset
    
    Returns:
        theta: Learned parameters
        history: Loss and gradient info per epoch
    """
    n_samples, n_features = X.shape
    theta = theta_init.copy()
    history = []
    
    for epoch in range(n_epochs):
        # Shuffle data at start of each epoch
        indices = np.random.permutation(n_samples)
        epoch_losses = []
        
        # Process data in batches
        for start_idx in range(0, n_samples, batch_size):
            end_idx = min(start_idx + batch_size, n_samples)
            batch_indices = indices[start_idx:end_idx]
            current_batch_size = len(batch_indices)
            
            # Extract batch
            X_batch = X[batch_indices]
            y_batch = y[batch_indices]
            
            # Forward pass
            y_pred = X_batch @ theta
            
            # Compute gradient (averaged over batch)
            residuals = y_pred - y_batch
            gradient = (X_batch.T @ residuals) / current_batch_size
            
            # Update parameters
            theta = theta - learning_rate * gradient
            
            # Track batch loss
            batch_loss = np.mean(residuals ** 2) / 2
            epoch_losses.append(batch_loss)
        
        # Epoch statistics
        epoch_loss = np.mean(epoch_losses)
        full_loss = np.mean((X @ theta - y) ** 2) / 2
        history.append({
            'epoch': epoch,
            'batch_loss': epoch_loss,
            'full_loss': full_loss,
            'n_batches': len(epoch_losses)
        })
    
    return theta, history
 
# Compare different batch sizes
np.random.seed(42)
n_samples = 5000
n_features = 10
 
X = np.random.randn(n_samples, n_features)
true_theta = np.linspace(-5, 5, n_features)
y = X @ true_theta + 0.1 * np.random.randn(n_samples)
 
theta_init = np.zeros(n_features)
 
batch_sizes = [1, 16, 64, 256, n_samples]
results = {}
 
for bs in batch_sizes:
    name = f"bs_{bs}" if bs < n_samples else "full"
    theta, hist = minibatch_gradient_descent(
        X, y, theta_init, batch_size=bs, learning_rate=0.01, n_epochs=20
    )
    results[name] = {
        'theta': theta,
        'final_loss': hist[-1]['full_loss'],
        'history': hist
    }
    print(f"Batch size {bs:5d}: Final loss = {hist[-1]['full_loss']:.6f}")
 
# All converge to similar solution, but trajectories differ

Practical Batch Size Guidance

Start with batch size 32 or 64. Increase batch size if: (1) GPU is underutilized, (2) training is unstable, (3) you can afford more compute. Decrease batch size if: (1) generalization is poor, (2) memory is limited, (3) you want implicit regularization.

The Variance-Computation Trade-off

The fundamental trade-off in choosing batch size is between gradient quality (variance) and computational efficiency.

Fixed Compute Budget Analysis:

Suppose you have budget for $C$ gradient evaluations (sample-gradient computations).

With batch size $b$:

Number of updates: $T = C / b$
Gradient variance: $\sigma^2 / b$

For convex problems:

The error after $T$ steps with batch size $b$ and variance $\sigma^2$ is approximately: $$\text{Error} \propto \underbrace{\frac{1}{T}}{\text{optimization}} + \underbrace{\frac{\alpha \sigma^2}{b}}{\text{noise}}$$

Substituting $T = C/b$: $$\text{Error} \propto \frac{b}{C} + \frac{\alpha \sigma^2}{b}$$

This is minimized at an intermediate batch size—not too small (noise dominates), not too large (not enough updates).

The "linear scaling rule":

As batch size increases, the variance decreases by $1/b$. To maintain the same noise-to-signal ratio of updates, the learning rate can be scaled up proportionally:

$$\alpha_{\text{large batch}} = \alpha_{\text{small batch}} \times \frac{b_{\text{large}}}{b_{\text{small}}}$$

Example:

Baseline: batch size 32, learning rate 0.01
Scaled up: batch size 256, learning rate 0.08

This allows training with larger batches without losing convergence speed in terms of epochs. However, the larger batch is more expensive per epoch, so wall-clock speedup requires parallelism.

Caveats:

Linear scaling works well up to a point, then breaks down
Very large batches (>8192) often need warmup and special techniques
Generalization may suffer with very large batches

The Generalization Puzzle

Empirically, smaller batches often generalize better than large batches, even when training loss is similar. Theories suggest SGD noise acts as implicit regularization, finding "flatter" minima that generalize better. This is an active research area.

Epochs, Iterations, and Steps

These terms are often confused. Let's define them precisely:

Epoch: One complete pass through the entire training dataset.

If dataset has $n$ samples, one epoch processes all $n$ samples once.

Iteration/Step: One parameter update (one gradient descent step).

Full batch: 1 iteration = 1 epoch
Mini-batch with batch size $b$: $\lceil n/b \rceil$ iterations = 1 epoch
SGD (batch size 1): $n$ iterations = 1 epoch

Batch: A subset of samples used to compute one gradient.

Terminology Example (n=60,000 samples, b=128)
Quantity	Value	Meaning
n	60,000	Total samples in dataset
b	128	Samples per mini-batch
Iterations per epoch	⌈60000/128⌉ = 469	Gradient updates per pass
Epochs trained	10	Full passes through data
Total iterations	4,690	Total gradient updates
Total samples seen	600,000	= n × epochs

Why this matters:

Learning rate schedules are often defined per epoch (e.g., "decay at epoch 30") or per iteration.
Comparing methods: SGD does $n$ updates per epoch vs 1 for full batch. Comparing "per epoch" progress is misleading.
Convergence guarantees: Often stated in terms of iterations, not epochs.
Batch normalization statistics: Computed over mini-batches, so batch size affects normalization.

Best practice: Report both iterations and epochs, along with batch size, so results are reproducible and comparable.

Shuffling and Sampling Strategies

How you select mini-batches affects both convergence and variance.

Random Shuffling (Standard)

At the start of each epoch, randomly permute the dataset. Then iterate through in order.

Each sample seen exactly once per epoch
Low overhead (single shuffle per epoch)
Variance is bounded
Most common approach

With Replacement Sampling

For each batch, sample randomly with replacement from the full dataset.

Some samples may be seen multiple times per epoch
Others may be missed
Slightly higher variance
Matches the "i.i.d. sample" assumption in theory

Without Replacement (No Shuffling)

Process data in the same order every epoch.

Don't do this! Gradient estimates become biased
If data is sorted (e.g., by class), updates are catastrophically biased
Only acceptable if data is already randomly ordered

data_loading_strategies.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import numpy as np
from typing import Iterator, Tuple
 
def shuffled_batches(X: np.ndarray, y: np.ndarray, 
                      batch_size: int) -> Iterator[Tuple[np.ndarray, np.ndarray]]:
    """
    Standard shuffled mini-batch iteration.
    Each sample seen exactly once per epoch.
    """
    n = len(X)
    indices = np.random.permutation(n)
    
    for start in range(0, n, batch_size):
        end = min(start + batch_size, n)
        batch_idx = indices[start:end]
        yield X[batch_idx], y[batch_idx]
 
def sampled_batches(X: np.ndarray, y: np.ndarray, 
                     batch_size: int, n_batches: int) -> Iterator[Tuple[np.ndarray, np.ndarray]]:
    """
    Random sampling WITH replacement.
    Some samples may repeat, others may be missed.
    """
    n = len(X)
    
    for _ in range(n_batches):
        batch_idx = np.random.choice(n, size=batch_size, replace=True)
        yield X[batch_idx], y[batch_idx]
 
def stratified_batches(X: np.ndarray, y: np.ndarray, 
                        batch_size: int) -> Iterator[Tuple[np.ndarray, np.ndarray]]:
    """
    Stratified batching: each batch has balanced class representation.
    Important for imbalanced datasets.
    """
    classes = np.unique(y)
    n_classes = len(classes)
    samples_per_class = batch_size // n_classes
    
    # Group indices by class
    class_indices = {c: np.where(y == c)[0] for c in classes}
    
    # Shuffle within each class
    for c in classes:
        np.random.shuffle(class_indices[c])
    
    # Create batches with balanced classes
    class_pointers = {c: 0 for c in classes}
    n = len(y)
    n_batches = n // batch_size
    
    for _ in range(n_batches):
        batch_idx = []
        for c in classes:
            start = class_pointers[c]
            end = start + samples_per_class
            
            # Handle wraparound
            if end > len(class_indices[c]):
                np.random.shuffle(class_indices[c])
                class_pointers[c] = 0
                start = 0
                end = samples_per_class
            
            batch_idx.extend(class_indices[c][start:end])
            class_pointers[c] = end
        
        batch_idx = np.array(batch_idx)
        np.random.shuffle(batch_idx)  # Shuffle within batch
        yield X[batch_idx], y[batch_idx]
 
# Example usage
X = np.random.randn(1000, 10)
y = np.random.randint(0, 5, 1000)  # 5 classes
 
print("Shuffled batches (standard):")
for i, (X_batch, y_batch) in enumerate(shuffled_batches(X, y, 64)):
    if i < 2:
        print(f"  Batch {i}: shape {X_batch.shape}, class distribution: {np.bincount(y_batch, minlength=5)}")
 
print("
Stratified batches:")
for i, (X_batch, y_batch) in enumerate(stratified_batches(X, y, 50)):
    if i < 2:
        print(f"  Batch {i}: shape {X_batch.shape}, class distribution: {np.bincount(y_batch, minlength=5)}")

Always Shuffle!

Never train on data in a fixed order. Even if you think your data is "already random," shuffle anyway. The cost is negligible, and the risk of subtle ordering biases is real. Most deep learning frameworks do this automatically, but verify!

GPU Considerations and Parallelism

Modern deep learning relies heavily on GPUs, which fundamentally changes the batch size calculus.

Why GPUs Love Large Batches:

GPUs have thousands of cores designed for parallel matrix operations. With batch size $b$:

Forward pass: multiply $(b \times d_{in})$ by $(d_{in} \times d_{out})$
If $b$ is small, most GPU cores sit idle
Larger $b$ → better parallelism → higher throughput

Memory Constraints:

The batch size is often limited by GPU memory: $$\text{Memory} = \text{Model params} + \text{Activations} \times b + \text{Gradients}$$

Activations scale linearly with batch size. For large models (BERT, GPT), batch size 4-16 may be the maximum that fits.

Gradient Accumulation:

When batch size is memory-limited, accumulate gradients over multiple mini-batches:

for i, (X_batch, y_batch) in enumerate(data_loader):
    loss = model(X_batch, y_batch) / accumulation_steps
    loss.backward()  # Accumulates gradients
    
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()  # Apply accumulated gradients
        optimizer.zero_grad()

This simulates a larger "effective batch size" without the memory cost.

Practical GPU Guidelines

•Maximize GPU utilization: Start with the largest batch size that fits in memory
•Use gradient accumulation: Simulate larger batches when memory-limited
•Profile first: Use nvidia-smi to check GPU utilization and adjust
•Powers of 2: Use batch sizes like 32, 64, 128 for optimal memory alignment
•Mixed precision (FP16): Halves memory, nearly doubles batch size
•Multi-GPU: Distributed data parallelism multiplies effective batch size

Summary: Choosing the Right Variant

Key Takeaways

•Full batch GD: Exact gradients, 1 update per epoch. Use for small datasets, convex problems
•SGD (batch size 1): Maximum updates (n per epoch), high variance. Use for online learning, streaming data
•Mini-batch: The sweet spot. Moderate variance, efficient GPU use. The standard for deep learning
•Variance ∝ 1/batch_size: Larger batches = smoother gradients = can use higher learning rate
•Generalization may prefer smaller batches: Noise acts as regularization
•Always shuffle: Prevent ordering biases in gradient estimates
•GPU efficiency favors larger batches: Maximize parallelism within memory limits
•Gradient accumulation: Simulate large batches when memory-constrained

When to Use Large Batches

•Training is unstable with small batches
•GPU is underutilized
•Distributed training across many GPUs
•You can scale learning rate accordingly
•Training speed (wall-clock) is critical

When to Use Small Batches

•Generalization is priority over training speed
•Memory is severely limited
•Data has high diversity (many modes)
•Working with few samples (fine-tuning)
•Need robustness / exploration

Coming Up Next:

Now that we understand gradient descent variants, we turn to acceleration techniques. Momentum and its variants (Nesterov acceleration) can dramatically speed up convergence, especially on ill-conditioned problems. Page 5 explores how to make gradient descent faster without changing the batch size.

Gradient Descent Variants Mastered

You now understand the spectrum from full batch to pure SGD, the trade-offs involved, and how to choose batch sizes in practice. This knowledge is essential for scaling ML to real-world datasets. Next, we'll learn how momentum accelerates convergence.

Variants: Batch, Stochastic, Mini-batch

Scaling Gradient Descent to Real-World Data

What You Will Learn

The Three Variants of Gradient Descent

All variants optimize the same objective—they differ in how they estimate the gradient:

Full (Batch) Gradient Descent

Computes the exact gradient over all $n$ samples. Precise but expensive.

Stochastic Gradient Descent (SGD)

$$\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \alpha abla \mathcal{L}_i(\boldsymbol{\theta}^{(t)})$$

Computes gradient from a single randomly sampled training example. Noisy but cheap.

Mini-batch Gradient Descent

$$\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \frac{\alpha}{|B|} \sum_{i \in B} abla \mathcal{L}_i(\boldsymbol{\theta}^{(t)})$$

Computes gradient from a randomly sampled batch $B$ of $b$ examples. Best of both worlds: reduced variance, manageable cost.

Gradient Descent Variants Comparison
Variant	Batch Size	Gradient Quality	Update Frequency	Memory Use	Best For
Full Batch	All n samples	Exact (no variance)	1 per epoch	High	Small datasets, convex problems
Stochastic (SGD)	1 sample	Very noisy	n per epoch	Minimal	Online learning, huge datasets
Mini-batch	b samples (16-512)	Moderate variance	n/b per epoch	Moderate	Deep learning (most common)

Full (Batch) Gradient Descent

Full batch gradient descent uses the entire dataset to compute each gradient. This is what we've been analyzing so far.

Advantages:

Exact gradient: No stochastic noise, deterministic updates
Stable convergence: Loss decreases smoothly
Theoretical guarantees: Convergence proofs apply directly
Vectorization: Can fully utilize matrix operations

Disadvantages:

Computational cost: O(n) per update step
Memory: Must fit all data (or at least gradients) in memory
No online learning: Can't update as new data arrives
May overfit: No SGD regularization effect
Slow for large n: One epoch = one update

batch_gradient_descent.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
import numpy as np
 
def batch_gradient_descent(X, y, theta_init, learning_rate=0.01, 
                            n_epochs=100, tol=1e-6):
    """
    Full batch gradient descent for linear regression.
    
    Args:
        X: Feature matrix (n_samples, n_features)
        y: Target vector (n_samples,)
        theta_init: Initial parameters
        learning_rate: Step size α
        n_epochs: Maximum number of passes through data
        tol: Convergence tolerance
    
    Returns:
        theta: Learned parameters
        history: List of (loss, grad_norm) per epoch
    """
    n_samples = X.shape[0]
    theta = theta_init.copy()
    history = []
    
    for epoch in range(n_epochs):
        # Forward pass: compute predictions for ALL samples
        y_pred = X @ theta
        
        # Compute exact gradient over entire dataset
        residuals = y_pred - y
        gradient = (X.T @ residuals) / n_samples  # Average gradient
        
        # Compute loss for monitoring
        loss = np.mean(residuals ** 2) / 2
        grad_norm = np.linalg.norm(gradient)
        history.append({'epoch': epoch, 'loss': loss, 'grad_norm': grad_norm})
        
        # Check convergence
        if grad_norm < tol:
            print(f"Converged at epoch {epoch}")
            break
        
        # Update: one step per epoch
        theta = theta - learning_rate * gradient
    
    return theta, history
 
# Example usage
np.random.seed(42)
n_samples = 1000
n_features = 5
 
# Generate data
X = np.random.randn(n_samples, n_features)
true_theta = np.array([1, -2, 3, -4, 5])
y = X @ true_theta + 0.1 * np.random.randn(n_samples)
 
# Train
theta_init = np.zeros(n_features)
theta, history = batch_gradient_descent(X, y, theta_init, 
                                         learning_rate=0.1, n_epochs=100)
 
print(f"True parameters: {true_theta}")
print(f"Learned parameters: {np.round(theta, 3)}")
print(f"Final loss: {history[-1]['loss']:.6f}")

When to Use Full Batch GD

Stochastic Gradient Descent (SGD)

SGD uses a single randomly sampled example to estimate the gradient. The gradient of one sample is an unbiased estimator of the true gradient:

$$\mathbb{E}[ abla \mathcal{L}_i(\boldsymbol{\theta})] = abla J(\boldsymbol{\theta})$$

where the expectation is over random sample $i$.

Advantages:

Cheap updates: O(1) per step instead of O(n)
Many updates per epoch: n updates vs 1
Online learning: Can process streaming data
Implicit regularization: Noise prevents overfitting
Escapes saddle points: Noise helps exploration

Disadvantages:

High variance: Gradient estimate is noisy
Oscillating loss: Updates may increase loss temporarily
Requires learning rate decay: Otherwise oscillates around minimum
Less vectorization: Single samples don't fully utilize GPU

The variance of SGD:

Let $g_i = abla \mathcal{L}_i(\boldsymbol{\theta})$ be the gradient for sample $i$. The variance of this estimator is:

$$\text{Var}[g_i] = \mathbb{E}[|g_i|^2] - | abla J|^2$$

This variance is typically large when samples have diverse gradients. Near optima (where $ abla J \approx 0$), the noise doesn't vanish—updates continue to oscillate.

Key insight: SGD can get close to the optimum quickly due to many updates, but it can't converge precisely without learning rate decay.

stochastic_gradient_descent.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
import numpy as np
 
def stochastic_gradient_descent(X, y, theta_init, learning_rate=0.01,
                                  n_epochs=10, lr_decay=None):
    """
    Pure Stochastic Gradient Descent (batch size = 1).
    
    Args:
        X: Feature matrix (n_samples, n_features)
        y: Target vector (n_samples,)
        theta_init: Initial parameters
        learning_rate: Initial step size α
        n_epochs: Number of passes through entire dataset
        lr_decay: Optional learning rate schedule ('1/t', 'sqrt')
    
    Returns:
        theta: Learned parameters
        history: List of loss values (sampled periodically)
    """
    n_samples, n_features = X.shape
    theta = theta_init.copy()
    history = []
    step = 0
    
    for epoch in range(n_epochs):
        # Shuffle data at start of each epoch
        indices = np.random.permutation(n_samples)
        epoch_losses = []
        
        for i in indices:
            step += 1
            
            # Get single sample
            x_i = X[i:i+1]  # Shape (1, n_features)
            y_i = y[i:i+1]  # Shape (1,)
            
            # Forward pass: prediction for ONE sample
            y_pred = x_i @ theta
            
            # Compute gradient for ONE sample (unbiased estimator)
            residual = y_pred - y_i
            gradient = x_i.T.flatten() * residual[0]  # No averaging needed
            
            # Learning rate decay
            if lr_decay == '1/t':
                alpha = learning_rate / (1 + 0.01 * step)
            elif lr_decay == 'sqrt':
                alpha = learning_rate / np.sqrt(1 + 0.01 * step)
            else:
                alpha = learning_rate
            
            # Update with noisy gradient
            theta = theta - alpha * gradient
            
            # Track loss (for subset of steps to avoid overhead)
            if step % 100 == 0:
                loss = np.mean((X @ theta - y) ** 2) / 2
                epoch_losses.append(loss)
        
        # Epoch summary
        if epoch_losses:
            history.append({
                'epoch': epoch,
                'loss': np.mean(epoch_losses),
                'final_lr': alpha
            })
    
    return theta, history
 
# Example: Compare SGD with and without learning rate decay
np.random.seed(42)
n_samples = 1000
X = np.random.randn(n_samples, 3)
true_theta = np.array([1.0, 2.0, 3.0])
y = X @ true_theta + 0.1 * np.random.randn(n_samples)
 
theta_init = np.zeros(3)
 
# SGD without decay
theta_no_decay, hist_no_decay = stochastic_gradient_descent(
    X, y, theta_init, learning_rate=0.01, n_epochs=20, lr_decay=None
)
 
# SGD with 1/t decay
theta_with_decay, hist_with_decay = stochastic_gradient_descent(
    X, y, theta_init, learning_rate=0.1, n_epochs=20, lr_decay='1/t'
)
 
print("True theta:", true_theta)
print(f"SGD (no decay):   {np.round(theta_no_decay, 3)}, final loss: {hist_no_decay[-1]['loss']:.6f}")
print(f"SGD (with decay): {np.round(theta_with_decay, 3)}, final loss: {hist_with_decay[-1]['loss']:.6f}")

SGD Doesn't Converge to Exact Optimum

Mini-batch Gradient Descent

Mini-batch gradient descent is the standard approach for deep learning. It strikes a balance between full batch (exact but slow) and pure SGD (fast but noisy).

Mini-batch gradient:

$$ abla_{\text{batch}} = \frac{1}{b} \sum_{i \in B} abla \mathcal{L}_i(\boldsymbol{\theta})$$

where $B$ is a random batch of $b$ samples.

Variance reduction:

The variance of the mini-batch gradient estimator is: $$\text{Var}[ abla_{\text{batch}}] = \frac{\text{Var}[g_i]}{b}$$

Variance decreases linearly with batch size $b$. A batch of 64 has 64× less variance than SGD.

Batch Size Trade-offs
Batch Size	Variance	Updates/Epoch	GPU Utilization	Generalization
1 (SGD)	Very high	n	Poor	Often best
16	High	n/16	Moderate	Very good
32	Moderate	n/32	Good	Good
64	Moderate	n/64	Good	Good
128	Lower	n/128	Very good	Good
256	Low	n/256	Excellent	May degrade
512+	Very low	n/512+	Excellent	Often worse
Full (n)	Zero	1	Excellent	May overfit

minibatch_gradient_descent.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
import numpy as np
 
def minibatch_gradient_descent(X, y, theta_init, batch_size=32,
                                 learning_rate=0.01, n_epochs=50):
    """
    Mini-batch Gradient Descent - the deep learning standard.
    
    Args:
        X: Feature matrix (n_samples, n_features)
        y: Target vector (n_samples,)
        theta_init: Initial parameters
        batch_size: Number of samples per batch
        learning_rate: Step size α
        n_epochs: Number of passes through entire dataset
    
    Returns:
        theta: Learned parameters
        history: Loss and gradient info per epoch
    """
    n_samples, n_features = X.shape
    theta = theta_init.copy()
    history = []
    
    for epoch in range(n_epochs):
        # Shuffle data at start of each epoch
        indices = np.random.permutation(n_samples)
        epoch_losses = []
        
        # Process data in batches
        for start_idx in range(0, n_samples, batch_size):
            end_idx = min(start_idx + batch_size, n_samples)
            batch_indices = indices[start_idx:end_idx]
            current_batch_size = len(batch_indices)
            
            # Extract batch
            X_batch = X[batch_indices]
            y_batch = y[batch_indices]
            
            # Forward pass
            y_pred = X_batch @ theta
            
            # Compute gradient (averaged over batch)
            residuals = y_pred - y_batch
            gradient = (X_batch.T @ residuals) / current_batch_size
            
            # Update parameters
            theta = theta - learning_rate * gradient
            
            # Track batch loss
            batch_loss = np.mean(residuals ** 2) / 2
            epoch_losses.append(batch_loss)
        
        # Epoch statistics
        epoch_loss = np.mean(epoch_losses)
        full_loss = np.mean((X @ theta - y) ** 2) / 2
        history.append({
            'epoch': epoch,
            'batch_loss': epoch_loss,
            'full_loss': full_loss,
            'n_batches': len(epoch_losses)
        })
    
    return theta, history
 
# Compare different batch sizes
np.random.seed(42)
n_samples = 5000
n_features = 10
 
X = np.random.randn(n_samples, n_features)
true_theta = np.linspace(-5, 5, n_features)
y = X @ true_theta + 0.1 * np.random.randn(n_samples)
 
theta_init = np.zeros(n_features)
 
batch_sizes = [1, 16, 64, 256, n_samples]
results = {}
 
for bs in batch_sizes:
    name = f"bs_{bs}" if bs < n_samples else "full"
    theta, hist = minibatch_gradient_descent(
        X, y, theta_init, batch_size=bs, learning_rate=0.01, n_epochs=20
    )
    results[name] = {
        'theta': theta,
        'final_loss': hist[-1]['full_loss'],
        'history': hist
    }
    print(f"Batch size {bs:5d}: Final loss = {hist[-1]['full_loss']:.6f}")
 
# All converge to similar solution, but trajectories differ

Practical Batch Size Guidance

The Variance-Computation Trade-off

The fundamental trade-off in choosing batch size is between gradient quality (variance) and computational efficiency.

Fixed Compute Budget Analysis:

Suppose you have budget for $C$ gradient evaluations (sample-gradient computations).

With batch size $b$:

Number of updates: $T = C / b$
Gradient variance: $\sigma^2 / b$

For convex problems:

Substituting $T = C/b$: $$\text{Error} \propto \frac{b}{C} + \frac{\alpha \sigma^2}{b}$$

This is minimized at an intermediate batch size—not too small (noise dominates), not too large (not enough updates).

The "linear scaling rule":

As batch size increases, the variance decreases by $1/b$. To maintain the same noise-to-signal ratio of updates, the learning rate can be scaled up proportionally:

$$\alpha_{\text{large batch}} = \alpha_{\text{small batch}} \times \frac{b_{\text{large}}}{b_{\text{small}}}$$

Example:

Baseline: batch size 32, learning rate 0.01
Scaled up: batch size 256, learning rate 0.08

This allows training with larger batches without losing convergence speed in terms of epochs. However, the larger batch is more expensive per epoch, so wall-clock speedup requires parallelism.

Caveats:

Linear scaling works well up to a point, then breaks down
Very large batches (>8192) often need warmup and special techniques
Generalization may suffer with very large batches

The Generalization Puzzle

Epochs, Iterations, and Steps

These terms are often confused. Let's define them precisely:

Epoch: One complete pass through the entire training dataset.

If dataset has $n$ samples, one epoch processes all $n$ samples once.

Iteration/Step: One parameter update (one gradient descent step).

Full batch: 1 iteration = 1 epoch
Mini-batch with batch size $b$: $\lceil n/b \rceil$ iterations = 1 epoch
SGD (batch size 1): $n$ iterations = 1 epoch

Batch: A subset of samples used to compute one gradient.

Terminology Example (n=60,000 samples, b=128)
Quantity	Value	Meaning
n	60,000	Total samples in dataset
b	128	Samples per mini-batch
Iterations per epoch	⌈60000/128⌉ = 469	Gradient updates per pass
Epochs trained	10	Full passes through data
Total iterations	4,690	Total gradient updates
Total samples seen	600,000	= n × epochs

Why this matters:

Learning rate schedules are often defined per epoch (e.g., "decay at epoch 30") or per iteration.
Comparing methods: SGD does $n$ updates per epoch vs 1 for full batch. Comparing "per epoch" progress is misleading.
Convergence guarantees: Often stated in terms of iterations, not epochs.
Batch normalization statistics: Computed over mini-batches, so batch size affects normalization.

Best practice: Report both iterations and epochs, along with batch size, so results are reproducible and comparable.

Shuffling and Sampling Strategies

How you select mini-batches affects both convergence and variance.

Random Shuffling (Standard)

At the start of each epoch, randomly permute the dataset. Then iterate through in order.

Each sample seen exactly once per epoch
Low overhead (single shuffle per epoch)
Variance is bounded
Most common approach

With Replacement Sampling

For each batch, sample randomly with replacement from the full dataset.

Some samples may be seen multiple times per epoch
Others may be missed
Slightly higher variance
Matches the "i.i.d. sample" assumption in theory

Without Replacement (No Shuffling)

Process data in the same order every epoch.

Don't do this! Gradient estimates become biased
If data is sorted (e.g., by class), updates are catastrophically biased
Only acceptable if data is already randomly ordered

data_loading_strategies.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import numpy as np
from typing import Iterator, Tuple
 
def shuffled_batches(X: np.ndarray, y: np.ndarray, 
                      batch_size: int) -> Iterator[Tuple[np.ndarray, np.ndarray]]:
    """
    Standard shuffled mini-batch iteration.
    Each sample seen exactly once per epoch.
    """
    n = len(X)
    indices = np.random.permutation(n)
    
    for start in range(0, n, batch_size):
        end = min(start + batch_size, n)
        batch_idx = indices[start:end]
        yield X[batch_idx], y[batch_idx]
 
def sampled_batches(X: np.ndarray, y: np.ndarray, 
                     batch_size: int, n_batches: int) -> Iterator[Tuple[np.ndarray, np.ndarray]]:
    """
    Random sampling WITH replacement.
    Some samples may repeat, others may be missed.
    """
    n = len(X)
    
    for _ in range(n_batches):
        batch_idx = np.random.choice(n, size=batch_size, replace=True)
        yield X[batch_idx], y[batch_idx]
 
def stratified_batches(X: np.ndarray, y: np.ndarray, 
                        batch_size: int) -> Iterator[Tuple[np.ndarray, np.ndarray]]:
    """
    Stratified batching: each batch has balanced class representation.
    Important for imbalanced datasets.
    """
    classes = np.unique(y)
    n_classes = len(classes)
    samples_per_class = batch_size // n_classes
    
    # Group indices by class
    class_indices = {c: np.where(y == c)[0] for c in classes}
    
    # Shuffle within each class
    for c in classes:
        np.random.shuffle(class_indices[c])
    
    # Create batches with balanced classes
    class_pointers = {c: 0 for c in classes}
    n = len(y)
    n_batches = n // batch_size
    
    for _ in range(n_batches):
        batch_idx = []
        for c in classes:
            start = class_pointers[c]
            end = start + samples_per_class
            
            # Handle wraparound
            if end > len(class_indices[c]):
                np.random.shuffle(class_indices[c])
                class_pointers[c] = 0
                start = 0
                end = samples_per_class
            
            batch_idx.extend(class_indices[c][start:end])
            class_pointers[c] = end
        
        batch_idx = np.array(batch_idx)
        np.random.shuffle(batch_idx)  # Shuffle within batch
        yield X[batch_idx], y[batch_idx]
 
# Example usage
X = np.random.randn(1000, 10)
y = np.random.randint(0, 5, 1000)  # 5 classes
 
print("Shuffled batches (standard):")
for i, (X_batch, y_batch) in enumerate(shuffled_batches(X, y, 64)):
    if i < 2:
        print(f"  Batch {i}: shape {X_batch.shape}, class distribution: {np.bincount(y_batch, minlength=5)}")
 
print("
Stratified batches:")
for i, (X_batch, y_batch) in enumerate(stratified_batches(X, y, 50)):
    if i < 2:
        print(f"  Batch {i}: shape {X_batch.shape}, class distribution: {np.bincount(y_batch, minlength=5)}")

Always Shuffle!

GPU Considerations and Parallelism

Modern deep learning relies heavily on GPUs, which fundamentally changes the batch size calculus.

Why GPUs Love Large Batches:

GPUs have thousands of cores designed for parallel matrix operations. With batch size $b$:

Forward pass: multiply $(b \times d_{in})$ by $(d_{in} \times d_{out})$
If $b$ is small, most GPU cores sit idle
Larger $b$ → better parallelism → higher throughput

Memory Constraints:

The batch size is often limited by GPU memory: $$\text{Memory} = \text{Model params} + \text{Activations} \times b + \text{Gradients}$$

Activations scale linearly with batch size. For large models (BERT, GPT), batch size 4-16 may be the maximum that fits.

Gradient Accumulation:

When batch size is memory-limited, accumulate gradients over multiple mini-batches:

for i, (X_batch, y_batch) in enumerate(data_loader):
    loss = model(X_batch, y_batch) / accumulation_steps
    loss.backward()  # Accumulates gradients
    
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()  # Apply accumulated gradients
        optimizer.zero_grad()

This simulates a larger "effective batch size" without the memory cost.

Practical GPU Guidelines

•Maximize GPU utilization: Start with the largest batch size that fits in memory
•Use gradient accumulation: Simulate larger batches when memory-limited
•Profile first: Use nvidia-smi to check GPU utilization and adjust
•Powers of 2: Use batch sizes like 32, 64, 128 for optimal memory alignment
•Mixed precision (FP16): Halves memory, nearly doubles batch size
•Multi-GPU: Distributed data parallelism multiplies effective batch size

Summary: Choosing the Right Variant

Key Takeaways

•Full batch GD: Exact gradients, 1 update per epoch. Use for small datasets, convex problems
•SGD (batch size 1): Maximum updates (n per epoch), high variance. Use for online learning, streaming data
•Mini-batch: The sweet spot. Moderate variance, efficient GPU use. The standard for deep learning
•Variance ∝ 1/batch_size: Larger batches = smoother gradients = can use higher learning rate
•Generalization may prefer smaller batches: Noise acts as regularization
•Always shuffle: Prevent ordering biases in gradient estimates
•GPU efficiency favors larger batches: Maximize parallelism within memory limits
•Gradient accumulation: Simulate large batches when memory-constrained

When to Use Large Batches

•Training is unstable with small batches
•GPU is underutilized
•Distributed training across many GPUs
•You can scale learning rate accordingly
•Training speed (wall-clock) is critical

When to Use Small Batches

•Generalization is priority over training speed
•Memory is severely limited
•Data has high diversity (many modes)
•Working with few samples (fine-tuning)
•Need robustness / exploration

Coming Up Next:

Gradient Descent Variants Mastered