Gradient Descent Variants - Learning Module

Loading content...

0/278

Stochastic Gradient Descent

The Algorithm That Revolutionized Machine Learning

In the early 2010s, as researchers attempted to train neural networks on increasingly large datasets, they encountered a seemingly insurmountable barrier: batch gradient descent simply couldn't scale. Computing exact gradients over millions or billions of training examples for each parameter update was computationally impossible.

The solution, ironically, came from an idea proposed over six decades earlier by Herbert Robbins and Sutton Monro in 1951. Stochastic Gradient Descent (SGD) makes a radical trade-off: instead of computing the exact gradient over all data, it estimates the gradient using just a single randomly selected example.

This seemingly reckless approximation—using one example instead of millions—not only works but often works better than batch methods. Understanding why requires diving deep into the mathematics of stochastic optimization, variance reduction, and the surprising benefits of gradient noise.

What You Will Learn

By the end of this page, you will understand the mathematical foundations of SGD, analyze why noisy gradients still lead to convergence, quantify the variance-convergence trade-off, implement SGD correctly, and appreciate why stochasticity is not just tolerable but beneficial for generalization.

From Batch to Stochastic

Recall that batch gradient descent computes the gradient over the entire dataset:

$$ abla J(\boldsymbol{\theta}) = \frac{1}{N} \sum_{i=1}^{N} abla \mathcal{L}(f_{\boldsymbol{\theta}}(\mathbf{x}^{(i)}), y^{(i)})$$

This is the expectation of the per-example gradient over the empirical data distribution. The key insight of SGD is that we can estimate this expectation using a single sample.

The SGD Update Rule

At each iteration, SGD:

Randomly samples one example $(\mathbf{x}^{(i)}, y^{(i)})$ from the dataset
Computes the gradient for that single example
Updates parameters:

$$\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \eta abla \mathcal{L}(f_{\boldsymbol{\theta}^{(t)}}(\mathbf{x}^{(i)}), y^{(i)})$$

Notice the critical difference: we use $ abla \mathcal{L}$ (gradient of loss for ONE example) instead of $ abla J$ (average gradient over ALL examples).

The Stochastic Gradient as an Estimator

The single-example gradient is an unbiased estimator of the true gradient: E[∇L(θ, xⁱ, yⁱ)] = ∇J(θ), where the expectation is over random sampling of the example. This unbiasedness is what ensures convergence despite using noisy estimates.

The Fundamental Trade-off

SGD makes an extreme trade-off:

Aspect	Batch GD	SGD
Gradient accuracy	Exact	Very noisy
Computation per update	O(N·d)	O(d)
Updates per epoch	1	N
Progress direction	Always downhill	Often uphill initially
Convergence path	Smooth	Highly erratic

The key question is: can we actually converge to a good solution with such noisy gradients? The answer is yes, under appropriate conditions—but the convergence behavior is fundamentally different from batch methods.

Advantages of SGD

•N updates per epoch vs 1 for batch
•Can escape shallow local minima via noise
•Works with datasets larger than memory
•Online learning: process data as it arrives
•Implicit regularization from gradient noise
•Often finds flatter, better minima

Challenges of SGD

•High variance in gradient estimates
•Noisy, oscillating loss curve
•Requires careful learning rate tuning
•Learning rate schedule often needed
•Harder to debug (non-deterministic)
•May not converge to exact minimum

Mathematical Analysis

To understand SGD rigorously, we must analyze the stochastic gradient as a random variable and quantify its relationship to the true gradient.

Decomposition of the Stochastic Gradient

Let $\mathbf{g}^{(t)} = abla \mathcal{L}(f_{\boldsymbol{\theta}^{(t)}}(\mathbf{x}^{(i_t)}), y^{(i_t)})$ be the stochastic gradient at iteration $t$, where $i_t$ is the randomly selected index. We can decompose:

$$\mathbf{g}^{(t)} = abla J(\boldsymbol{\theta}^{(t)}) + \boldsymbol{\epsilon}^{(t)}$$

where $\boldsymbol{\epsilon}^{(t)}$ is the gradient noise with the crucial property:

$$\mathbb{E}[\boldsymbol{\epsilon}^{(t)}] = \mathbf{0}$$

This zero-mean property is what makes SGD work. The stochastic gradient is an unbiased estimator of the true gradient.

Variance of the Stochastic Gradient

While the expected value of $\mathbf{g}^{(t)}$ equals the true gradient, the variance can be substantial:

$$\text{Var}[\mathbf{g}^{(t)}] = \mathbb{E}[|\boldsymbol{\epsilon}^{(t)}|^2] = \sigma^2(\boldsymbol{\theta}^{(t)})$$

This variance $\sigma^2$ quantifies how different individual gradients can be from the mean. High variance means:

Individual updates may go uphill (increasing loss)
The optimization path will be noisy
More iterations needed to average out noise

The variance is NOT uniform across parameter space. Near a minimum (where gradients are small), individual gradients might point in very different directions, leading to high relative noise.

Variance Intuition

If all training examples give similar gradients, variance is low and SGD behaves like batch GD. If examples give wildly different gradients (heterogeneous dataset), variance is high and SGD is noisier. Data normalization and proper scaling reduce variance.

Bounded Variance Assumption

Convergence proofs typically assume bounded variance:

$$\mathbb{E}[| abla \mathcal{L}(\boldsymbol{\theta}, \mathbf{x}^{(i)}, y^{(i)}) - abla J(\boldsymbol{\theta})|^2] \leq \sigma^2$$

for all $\boldsymbol{\theta}$ and uniformly over all examples. This assumption says the noise is bounded—no single example can produce an arbitrarily large gradient that would destabilize optimization.

Why Unbiasedness Matters

The zero-mean noise property is crucial. Consider the expected update:

$$\mathbb{E}[\boldsymbol{\theta}^{(t+1)}] = \mathbb{E}[\boldsymbol{\theta}^{(t)} - \eta \mathbf{g}^{(t)}] = \boldsymbol{\theta}^{(t)} - \eta abla J(\boldsymbol{\theta}^{(t)})$$

(treating $\boldsymbol{\theta}^{(t)}$ as fixed given the history)

This shows that on average, SGD moves in the correct direction. Individual steps may be wrong, but the expected direction is correct.

Convergence Theory

The convergence of SGD is fundamentally different from batch gradient descent. The noise in gradient estimates creates new challenges that require modified analysis and, critically, decaying learning rates.

The Convergence Dilemma

With a fixed learning rate $\eta$, SGD cannot converge to the exact minimum. Here's why:

Near the minimum, the true gradient $ abla J \approx \mathbf{0}$
But the stochastic gradient $\mathbf{g} = abla J + \boldsymbol{\epsilon} \approx \boldsymbol{\epsilon}$ is dominated by noise
Updates of size $\eta |\boldsymbol{\epsilon}|$ cause perpetual oscillation around the minimum

The solution is to decrease the learning rate over time, eventually making steps small enough to overcome the noise.

Robbins-Monro Conditions

For SGD to converge to the true minimum, the learning rate schedule η_t must satisfy: (1) Σ η_t = ∞ (we take enough total step to reach anywhere), and (2) Σ η_t² < ∞ (the noise contribution vanishes). The classic schedule η_t = η₀/(1 + t) satisfies both conditions.

Convergence Theorems

Theorem 1 (Convergence for Convex Functions)

Let $J$ be convex with $L$-Lipschitz gradient, and assume bounded variance $\sigma^2$. With learning rate $\eta_t = 1/(\sqrt{t})$, SGD satisfies:

$$\mathbb{E}[J(\boldsymbol{\theta}^{(T)}) - J(\boldsymbol{\theta}^*)] \leq O\left(\frac{\sigma + LR}{\sqrt{T}}\right)$$

where $R = |\boldsymbol{\theta}^{(0)} - \boldsymbol{\theta}^*|$.

This is a sublinear convergence rate of $O(1/\sqrt{T})$—slower than batch GD's $O(1/T)$ for convex functions.

Theorem 2 (Convergence for Strongly Convex Functions)

Let $J$ be $\mu$-strongly convex with $L$-Lipschitz gradient. With learning rate $\eta_t = 1/(\mu t)$, SGD achieves:

$$\mathbb{E}[J(\boldsymbol{\theta}^{(T)}) - J(\boldsymbol{\theta}^*)] \leq O\left(\frac{L\sigma^2}{\mu^2 T}\right)$$

This is $O(1/T)$—matching the batch GD rate for strongly convex functions! Strong convexity compensates for stochastic noise.

SGD vs Batch GD Convergence Rates
Function Class	Batch GD	SGD	Notes
Convex	$O(1/T)$	$O(1/\sqrt{T})$	SGD slower by factor $\sqrt{T}$
Strongly convex	$O(e^{-T/\kappa})$	$O(1/T)$	SGD loses exponential rate
Non-convex	Stationary point	Stationary point	Similar guarantees

The Computation-Iteration Trade-off

While SGD converges slower in iterations, it wins overwhelmingly in wall-clock time for large datasets:

Batch GD: $O(1/T)$ convergence, $O(N)$ cost per iteration → $O(N/\epsilon)$ total cost to reach error $\epsilon$
SGD: $O(1/\sqrt{T})$ convergence, $O(1)$ cost per iteration → $O(1/\epsilon^2)$ total cost

For large $N$, SGD's $O(1/\epsilon^2)$ total cost heavily beats batch GD's $O(N/\epsilon)$ when $N >> 1/\epsilon$.

Example: For $N = 10^6$ and $\epsilon = 10^{-3}$:

Batch GD: ~$10^9$ operations
SGD: ~$10^6$ operations

SGD is 1000× faster in this scenario!

The Benefits of Noise

Perhaps the most counterintuitive aspect of SGD is that the noise—which seems like a pure negative—actually provides significant benefits. Understanding these benefits explains why SGD often outperforms batch methods even when computation is not a constraint.

Benefit 1: Escaping Local Minima

In non-convex optimization, gradient descent can get stuck in local minima. The gradient is zero, so no progress is made. But SGD's noisy gradients are rarely exactly zero—the noise provides 'kicks' that can push parameters out of shallow local minima.

This acts like simulated annealing: early in training (large learning rate = high 'temperature'), the noise helps explore; later (small learning rate = low 'temperature'), the algorithm settles into a good basin.

Saddle Point Navigation

In high-dimensional spaces, saddle points are far more common than local minima. SGD's noise helps navigate away from saddle points, where the gradient may be very small but the point is not a minimum. This is a major advantage over deterministic methods.

Benefit 2: Implicit Regularization

SGD noise acts as a form of regularization that prefers 'flat' minima over 'sharp' ones. Why?

In a sharp minimum, the loss increases rapidly away from the optimum. The noisy SGD updates frequently push parameters away from the minimum. If the minimum is sharp, these excursions increase the loss significantly, and the parameters are pushed back.

In a flat minimum, small excursions don't change the loss much. The parameters can wander within the flat region without penalty.

The net effect: SGD spends more time in flat minima. This is important because flat minima generalize better—small changes to parameters (as might occur with new data) don't hurt performance much.

Benefit 3: Regularized Loss Landscape

Recent theoretical work shows that SGD effectively optimizes a smoothed version of the loss:

$$\tilde{J}(\boldsymbol{\theta}) = \mathbb{E}_{\boldsymbol{\epsilon}}[J(\boldsymbol{\theta} + \eta \boldsymbol{\epsilon})]$$

This smoothed loss has fewer sharp features, making optimization easier. The amount of smoothing is proportional to the learning rate and gradient variance.

Benefit 4: Better Generalization

Empirical evidence consistently shows that:

Models trained with SGD generalize better than those trained with exact gradients
Larger learning rates (more noise) often lead to better test performance (up to a point)
The 'generalization gap' (train error - test error) is smaller for SGD

This suggests that the implicit regularization from SGD noise is genuinely beneficial for learning, not just a nuisance to be minimized.

The Noise Benefit Summary

•Exploration: Noise enables escaping poor local optima and navigating saddle points
•Flat minima: SGD prefers flat minima that generalize better
•Smoothing: Effective optimization of a smoothed, easier loss landscape
•Regularization: Implicit L2-like regularization without explicit penalty terms
•Generalization: Empirically better test performance than deterministic methods

Implementation

Implementing SGD correctly requires attention to several practical details that significantly impact performance. Let's examine a production-quality implementation with careful handling of edge cases.

Algorithm: Stochastic Gradient Descent

Input: Initial θ₀, learning rate schedule η(t), dataset D, max epochs E
Output: Optimized parameters θ*

1. t ← 0
2. for epoch = 1 to E:
      a. Shuffle dataset D randomly
      b. for each example (x, y) in D:
            i.   Compute gradient: g ← ∇L(f_θ(x), y)  // Single example
            ii.  Get learning rate: η_t ← η(t)
            iii. Update: θ ← θ - η_t · g
            iv.  t ← t + 1
3. return θ

Critical detail: Shuffling the dataset each epoch ensures each example is used exactly once per epoch and in random order, which reduces the impact of any ordering in the original dataset.

sgd.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
import numpy as np
from typing import Callable, Tuple, List, Optional
 
def sgd(
    X: np.ndarray,                        # Training features (N x d)
    y: np.ndarray,                        # Training targets (N,)
    theta_init: np.ndarray,               # Initial parameters (d,)
    grad_fn: Callable,                    # Gradient for single example
    learning_rate: float = 0.01,
    lr_schedule: Optional[Callable] = None,  # Optional: η(t) function
    num_epochs: int = 10,
    shuffle: bool = True,
    verbose: bool = True,
    log_every: int = 1000
) -> Tuple[np.ndarray, List[float]]:
    """
    Stochastic Gradient Descent optimizer.
    
    Args:
        X: Feature matrix (N examples, d features)
        y: Target vector (N examples)
        theta_init: Initial parameter vector
        grad_fn: Function (x_i, y_i, theta) -> gradient for one example
        learning_rate: Base learning rate
        lr_schedule: Optional function (t, lr0) -> lr_t for scheduling
        num_epochs: Number of passes through the dataset
        shuffle: Whether to shuffle data each epoch
        verbose: Print progress
        log_every: How often to log (in iterations)
    
    Returns:
        theta: Optimized parameters
        loss_history: Loss values logged during training
    """
    theta = theta_init.copy()
    N = X.shape[0]
    loss_history = []
    t = 0  # Global iteration counter
    
    # Default: no schedule (constant learning rate)
    if lr_schedule is None:
        lr_schedule = lambda t, lr0: lr0
    
    for epoch in range(num_epochs):
        # Shuffle data at the start of each epoch
        if shuffle:
            indices = np.random.permutation(N)
        else:
            indices = np.arange(N)
        
        epoch_loss = 0.0
        
        for idx in indices:
            # Extract single example
            x_i = X[idx]
            y_i = y[idx]
            
            # Compute gradient for this ONE example
            gradient = grad_fn(x_i, y_i, theta)
            
            # Get current learning rate from schedule
            current_lr = lr_schedule(t, learning_rate)
            
            # SGD update
            theta = theta - current_lr * gradient
            
            # Track loss (optional, for monitoring)
            if t % log_every == 0:
                # Compute full loss occasionally for monitoring
                predictions = X @ theta
                current_loss = np.mean(0.5 * (predictions - y) ** 2)
                loss_history.append(current_loss)
                
                if verbose:
                    print(f"Epoch {epoch+1}, Iter {t}: Loss = {current_loss:.6f}, "
                          f"LR = {current_lr:.6f}")
            
            t += 1
        
        if verbose:
            # End of epoch summary
            final_loss = np.mean(0.5 * ((X @ theta) - y) ** 2)
            print(f"Epoch {epoch+1}/{num_epochs} complete. Loss: {final_loss:.6f}")
    
    return theta, loss_history
 
 
# Single-example gradient for linear regression with MSE
def single_example_gradient(x_i: np.ndarray, y_i: float, 
                            theta: np.ndarray) -> np.ndarray:
    """
    Gradient of MSE loss for a single example.
    L(theta) = 0.5 * (x_i @ theta - y_i)^2
    ∇L = (x_i @ theta - y_i) * x_i
    """
    prediction = np.dot(x_i, theta)
    residual = prediction - y_i
    gradient = residual * x_i
    return gradient
 
 
# Common learning rate schedules
def constant_schedule(t: int, lr0: float) -> float:
    """Constant learning rate."""
    return lr0
 
def step_decay(t: int, lr0: float, drop_rate: float = 0.5, 
               drop_every: int = 10000) -> float:
    """Step decay: multiply by drop_rate every drop_every iterations."""
    return lr0 * (drop_rate ** (t // drop_every))
 
def inverse_schedule(t: int, lr0: float) -> float:
    """1/t schedule (Robbins-Monro)."""
    return lr0 / (1 + t)
 
def inverse_sqrt_schedule(t: int, lr0: float) -> float:
    """1/sqrt(t) schedule."""
    return lr0 / np.sqrt(1 + t)
 
 
# Example usage
if __name__ == "__main__":
    np.random.seed(42)
    
    # Generate data
    N, d = 10000, 5
    X = np.random.randn(N, d)
    theta_true = np.array([1.0, -2.0, 0.5, 3.0, -1.5])
    y = X @ theta_true + 0.1 * np.random.randn(N)
    
    # Random initialization
    theta_init = np.random.randn(d)
    
    # Run SGD
    theta_opt, history = sgd(
        X, y, theta_init,
        grad_fn=single_example_gradient,
        learning_rate=0.01,
        lr_schedule=lambda t, lr0: lr0 / np.sqrt(1 + t/1000),
        num_epochs=5,
        log_every=2000
    )
    
    print(f"
True theta: {theta_true}")
    print(f"Learned theta: {theta_opt}")
    print(f"Parameter error: {np.linalg.norm(theta_opt - theta_true):.6f}")

Shuffling is Essential

Never iterate through data in a fixed order! If examples are sorted by class or have any structure, SGD will make biased updates. Shuffling each epoch randomizes the order, ensuring the stochastic gradient is truly an unbiased estimator. Some frameworks shuffle into a buffer for efficiency.

Learning Rate Schedules

Unlike batch gradient descent where a constant learning rate can work well, SGD typically requires a learning rate schedule—a systematic reduction of the learning rate over time—to converge properly.

Why Schedules Are Necessary

Recall the convergence dilemma: with fixed $\eta$, parameters oscillate around the minimum with magnitude proportional to $\eta \sigma$ (learning rate × gradient noise). To converge exactly, we must reduce $\eta$ over time so these oscillations vanish.

However, decreasing $\eta$ too quickly means we never reach the minimum. This tension is captured by the Robbins-Monro conditions:

$$\sum_{t=1}^{\infty} \eta_t = \infty \quad \text{(reach anywhere)}$$ $$\sum_{t=1}^{\infty} \eta_t^2 < \infty \quad \text{(noise contribution vanishes)}$$

Common Learning Rate Schedules
Schedule	Formula	Satisfies R-M	Use Case
Constant	$\eta_t = \eta_0$	No	Mini-batch SGD (noise bounded)
Inverse	$\eta_t = \eta_0 / t$	Yes	Theoretical analysis
Inverse sqrt	$\eta_t = \eta_0 / \sqrt{t}$	No*	Non-convex optimization
Step decay	$\eta_t = \eta_0 \cdot \gamma^{\lfloor t/k \rfloor}$	No	Practical deep learning
Exponential	$\eta_t = \eta_0 \cdot e^{-\lambda t}$	No	Fast initial decay

In Practice: Modern Deep Learning Schedules

For deep learning, the theoretical Robbins-Monro schedules are often too aggressive. Modern practice uses:

Step Decay: Divide learning rate by a factor (e.g., 10) at fixed epochs

if epoch in [30, 60, 90]:
    lr = lr * 0.1

Cosine Annealing: Smooth decrease following a cosine curve $$\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\frac{t}{T}\pi\right)\right)$$

Warmup + Decay: Start with a small learning rate, increase linearly, then decay $$\eta_t = \begin{cases} \frac{t}{t_w} \cdot \eta_{\max} & t < t_w \ \text{decay}(\eta_{\max}, t - t_w) & t \geq t_w \end{cases}$$

The warmup phase helps stabilize training when initial gradients are large or the loss landscape is poorly conditioned near initialization.

Practical Schedule Choice

For most deep learning tasks, start with cosine annealing or step decay. Use warmup if training is unstable initially. The exact schedule matters less than getting the initial learning rate right—always tune the base learning rate first, then worry about the schedule.

Variance Reduction Perspective

Understanding SGD from a variance reduction perspective provides insight into why mini-batches help and motivates more advanced algorithms.

The Variance-Convergence Trade-off

SGD's convergence rate depends on the gradient variance $\sigma^2$. Lower variance means:

Less oscillation around the minimum
Potentially larger stable learning rates
Faster convergence in iterations

This motivates strategies to reduce variance while maintaining computational efficiency.

Strategy 1: Averaging (The Polyak-Ruppert Average)

Instead of using the final iterate $\boldsymbol{\theta}^{(T)}$, average over the trajectory:

$$\bar{\boldsymbol{\theta}} = \frac{1}{T} \sum_{t=1}^{T} \boldsymbol{\theta}^{(t)}$$

This averaged iterate has lower variance than the final iterate and can achieve better convergence rates. The intuition: while individual iterates oscillate, their average stabilizes near the minimum.

Strategy 2: Mini-batching (Preview)

Instead of using one example, use $B$ examples:

$$\mathbf{g} = \frac{1}{B} \sum_{j=1}^{B} abla \mathcal{L}(\boldsymbol{\theta}, \mathbf{x}^{(i_j)}, y^{(i_j)})$$

The variance decreases by factor $B$:

$$\text{Var}[\mathbf{g}] = \frac{\sigma^2}{B}$$

This is the key insight behind mini-batch SGD, which we'll cover in detail in the next page. It interpolates between pure SGD ($B=1$) and batch GD ($B=N$).

Strategy 3: Advanced Variance Reduction Methods

More sophisticated methods actively reduce variance:

SVRG (Stochastic Variance Reduced Gradient): Periodically compute exact gradient and use it to correct stochastic estimates
SAGA: Maintain a table of gradients for each example, enabling variance reduction without periodic exact computation

These methods achieve the $O(1/T)$ or even linear convergence rates of batch methods while using $O(1)$ computation per update. They're less common in deep learning due to memory requirements but important in convex optimization.

The Variance-Computation Frontier

There's a fundamental trade-off: reducing variance requires more computation per update. Pure SGD (B=1) has maximum variance but minimum computation. Batch GD has zero variance but maximum computation. Mini-batch SGD with carefully chosen B finds an efficient operating point on this trade-off curve.

Online Learning Connection

SGD has a deep connection to online learning—a paradigm where data arrives sequentially and the model must update after each example. This connection provides theoretical grounding and suggests practical applications.

The Online Learning Setting

In online learning:

An example $(\mathbf{x}^{(t)}, y^{(t)})$ arrives
The learner makes a prediction $\hat{y}^{(t)} = f_{\boldsymbol{\theta}^{(t)}}(\mathbf{x}^{(t)})$
The true label is revealed
The learner updates parameters: $\boldsymbol{\theta}^{(t+1)}$
The example may be discarded (never seen again)

SGD naturally fits this paradigm: each update uses only the current example, and we don't need to store or revisit past data.

Regret Analysis

Online learning is analyzed via regret—the difference between the learner's cumulative loss and the best fixed predictor in hindsight:

$$R_T = \sum_{t=1}^{T} \mathcal{L}(\boldsymbol{\theta}^{(t)}) - \min_{\boldsymbol{\theta}} \sum_{t=1}^{T} \mathcal{L}(\boldsymbol{\theta})$$

A good algorithm has sublinear regret: $R_T = o(T)$, meaning per-round regret $R_T/T \to 0$.

Theorem (Online Gradient Descent Regret)

For convex losses with gradients bounded by $G$ and learning rate $\eta = \frac{R}{G\sqrt{T}}$:

$$R_T \leq \frac{RG\sqrt{T}}{2} + \frac{RG\sqrt{T}}{2} = RG\sqrt{T}$$

This $O(\sqrt{T})$ regret bound is optimal for online convex optimization, and SGD achieves it!

Practical Online Learning Applications

•Recommendation systems: Update preferences as users interact
•Ad click prediction: Learn from clicks in real-time
•Spam filtering: Adapt to new spam patterns immediately
•Financial trading: React to market data as it streams
•Robotics: Learn from continuous sensor feedback
•Personalization: Adapt models to individual users over time

When to Use True Online Learning

True online learning (single pass through data, no storage) is appropriate when: data is too large to store, data is non-stationary (distribution changes), real-time adaptation is needed, or examples are genuinely streaming. Most deep learning uses 'fake' online learning—multiple passes through a stored dataset.

Summary: Stochastic Gradient Descent

We've thoroughly examined Stochastic Gradient Descent—the algorithm that made large-scale machine learning practical. Let's consolidate the key insights:

Key Takeaways

•Single-example updates: SGD uses one randomly sampled example per update, trading gradient accuracy for massive computational savings.
•Unbiased estimator: The stochastic gradient has the same expected value as the true gradient—errors cancel out over many updates.
•Variance matters: Gradient variance σ² determines oscillation magnitude and affects convergence rate. Lower variance = faster convergence.
•Learning rate schedules: Decaying η is often necessary for exact convergence. Robbins-Monro conditions guarantee convergence.
•Noise as feature: Gradient noise enables escaping local minima, finding flat minima, and implicit regularization—often improving generalization.
•Computational dominance: For large N, SGD's O(1) update cost overwhelmingly beats batch GD's O(N), despite slower per-iteration convergence.
•Online learning: SGD naturally supports streaming data and real-time adaptation, with provably optimal regret bounds.

What's Next

The next page introduces Mini-batch SGD—the practical middle ground that combines the computational benefits of stochastic methods with variance reduction through batching. Mini-batch SGD is what's actually used in nearly all deep learning, and understanding it requires understanding both batch GD and pure SGD, which we've now covered.

Core Concepts Mastered

You now understand why SGD works despite using noisy gradients, when and why learning rate schedules are needed, how noise benefits optimization and generalization, and the connection to online learning. This foundation is essential for understanding the optimizer choices in modern deep learning.