Loading content...
In the early 2010s, as researchers attempted to train neural networks on increasingly large datasets, they encountered a seemingly insurmountable barrier: batch gradient descent simply couldn't scale. Computing exact gradients over millions or billions of training examples for each parameter update was computationally impossible.
The solution, ironically, came from an idea proposed over six decades earlier by Herbert Robbins and Sutton Monro in 1951. Stochastic Gradient Descent (SGD) makes a radical trade-off: instead of computing the exact gradient over all data, it estimates the gradient using just a single randomly selected example.
This seemingly reckless approximation—using one example instead of millions—not only works but often works better than batch methods. Understanding why requires diving deep into the mathematics of stochastic optimization, variance reduction, and the surprising benefits of gradient noise.
By the end of this page, you will understand the mathematical foundations of SGD, analyze why noisy gradients still lead to convergence, quantify the variance-convergence trade-off, implement SGD correctly, and appreciate why stochasticity is not just tolerable but beneficial for generalization.
Recall that batch gradient descent computes the gradient over the entire dataset:
$$ abla J(\boldsymbol{\theta}) = \frac{1}{N} \sum_{i=1}^{N} abla \mathcal{L}(f_{\boldsymbol{\theta}}(\mathbf{x}^{(i)}), y^{(i)})$$
This is the expectation of the per-example gradient over the empirical data distribution. The key insight of SGD is that we can estimate this expectation using a single sample.
The SGD Update Rule
At each iteration, SGD:
$$\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \eta abla \mathcal{L}(f_{\boldsymbol{\theta}^{(t)}}(\mathbf{x}^{(i)}), y^{(i)})$$
Notice the critical difference: we use $ abla \mathcal{L}$ (gradient of loss for ONE example) instead of $ abla J$ (average gradient over ALL examples).
The single-example gradient is an unbiased estimator of the true gradient: E[∇L(θ, xⁱ, yⁱ)] = ∇J(θ), where the expectation is over random sampling of the example. This unbiasedness is what ensures convergence despite using noisy estimates.
The Fundamental Trade-off
SGD makes an extreme trade-off:
| Aspect | Batch GD | SGD |
|---|---|---|
| Gradient accuracy | Exact | Very noisy |
| Computation per update | O(N·d) | O(d) |
| Updates per epoch | 1 | N |
| Progress direction | Always downhill | Often uphill initially |
| Convergence path | Smooth | Highly erratic |
The key question is: can we actually converge to a good solution with such noisy gradients? The answer is yes, under appropriate conditions—but the convergence behavior is fundamentally different from batch methods.
To understand SGD rigorously, we must analyze the stochastic gradient as a random variable and quantify its relationship to the true gradient.
Decomposition of the Stochastic Gradient
Let $\mathbf{g}^{(t)} = abla \mathcal{L}(f_{\boldsymbol{\theta}^{(t)}}(\mathbf{x}^{(i_t)}), y^{(i_t)})$ be the stochastic gradient at iteration $t$, where $i_t$ is the randomly selected index. We can decompose:
$$\mathbf{g}^{(t)} = abla J(\boldsymbol{\theta}^{(t)}) + \boldsymbol{\epsilon}^{(t)}$$
where $\boldsymbol{\epsilon}^{(t)}$ is the gradient noise with the crucial property:
$$\mathbb{E}[\boldsymbol{\epsilon}^{(t)}] = \mathbf{0}$$
This zero-mean property is what makes SGD work. The stochastic gradient is an unbiased estimator of the true gradient.
Variance of the Stochastic Gradient
While the expected value of $\mathbf{g}^{(t)}$ equals the true gradient, the variance can be substantial:
$$\text{Var}[\mathbf{g}^{(t)}] = \mathbb{E}[|\boldsymbol{\epsilon}^{(t)}|^2] = \sigma^2(\boldsymbol{\theta}^{(t)})$$
This variance $\sigma^2$ quantifies how different individual gradients can be from the mean. High variance means:
The variance is NOT uniform across parameter space. Near a minimum (where gradients are small), individual gradients might point in very different directions, leading to high relative noise.
If all training examples give similar gradients, variance is low and SGD behaves like batch GD. If examples give wildly different gradients (heterogeneous dataset), variance is high and SGD is noisier. Data normalization and proper scaling reduce variance.
Bounded Variance Assumption
Convergence proofs typically assume bounded variance:
$$\mathbb{E}[| abla \mathcal{L}(\boldsymbol{\theta}, \mathbf{x}^{(i)}, y^{(i)}) - abla J(\boldsymbol{\theta})|^2] \leq \sigma^2$$
for all $\boldsymbol{\theta}$ and uniformly over all examples. This assumption says the noise is bounded—no single example can produce an arbitrarily large gradient that would destabilize optimization.
Why Unbiasedness Matters
The zero-mean noise property is crucial. Consider the expected update:
$$\mathbb{E}[\boldsymbol{\theta}^{(t+1)}] = \mathbb{E}[\boldsymbol{\theta}^{(t)} - \eta \mathbf{g}^{(t)}] = \boldsymbol{\theta}^{(t)} - \eta abla J(\boldsymbol{\theta}^{(t)})$$
(treating $\boldsymbol{\theta}^{(t)}$ as fixed given the history)
This shows that on average, SGD moves in the correct direction. Individual steps may be wrong, but the expected direction is correct.
The convergence of SGD is fundamentally different from batch gradient descent. The noise in gradient estimates creates new challenges that require modified analysis and, critically, decaying learning rates.
The Convergence Dilemma
With a fixed learning rate $\eta$, SGD cannot converge to the exact minimum. Here's why:
The solution is to decrease the learning rate over time, eventually making steps small enough to overcome the noise.
For SGD to converge to the true minimum, the learning rate schedule η_t must satisfy: (1) Σ η_t = ∞ (we take enough total step to reach anywhere), and (2) Σ η_t² < ∞ (the noise contribution vanishes). The classic schedule η_t = η₀/(1 + t) satisfies both conditions.
Convergence Theorems
Theorem 1 (Convergence for Convex Functions)
Let $J$ be convex with $L$-Lipschitz gradient, and assume bounded variance $\sigma^2$. With learning rate $\eta_t = 1/(\sqrt{t})$, SGD satisfies:
$$\mathbb{E}[J(\boldsymbol{\theta}^{(T)}) - J(\boldsymbol{\theta}^*)] \leq O\left(\frac{\sigma + LR}{\sqrt{T}}\right)$$
where $R = |\boldsymbol{\theta}^{(0)} - \boldsymbol{\theta}^*|$.
This is a sublinear convergence rate of $O(1/\sqrt{T})$—slower than batch GD's $O(1/T)$ for convex functions.
Theorem 2 (Convergence for Strongly Convex Functions)
Let $J$ be $\mu$-strongly convex with $L$-Lipschitz gradient. With learning rate $\eta_t = 1/(\mu t)$, SGD achieves:
$$\mathbb{E}[J(\boldsymbol{\theta}^{(T)}) - J(\boldsymbol{\theta}^*)] \leq O\left(\frac{L\sigma^2}{\mu^2 T}\right)$$
This is $O(1/T)$—matching the batch GD rate for strongly convex functions! Strong convexity compensates for stochastic noise.
| Function Class | Batch GD | SGD | Notes |
|---|---|---|---|
| Convex | $O(1/T)$ | $O(1/\sqrt{T})$ | SGD slower by factor $\sqrt{T}$ |
| Strongly convex | $O(e^{-T/\kappa})$ | $O(1/T)$ | SGD loses exponential rate |
| Non-convex | Stationary point | Stationary point | Similar guarantees |
The Computation-Iteration Trade-off
While SGD converges slower in iterations, it wins overwhelmingly in wall-clock time for large datasets:
For large $N$, SGD's $O(1/\epsilon^2)$ total cost heavily beats batch GD's $O(N/\epsilon)$ when $N >> 1/\epsilon$.
Example: For $N = 10^6$ and $\epsilon = 10^{-3}$:
SGD is 1000× faster in this scenario!
Perhaps the most counterintuitive aspect of SGD is that the noise—which seems like a pure negative—actually provides significant benefits. Understanding these benefits explains why SGD often outperforms batch methods even when computation is not a constraint.
Benefit 1: Escaping Local Minima
In non-convex optimization, gradient descent can get stuck in local minima. The gradient is zero, so no progress is made. But SGD's noisy gradients are rarely exactly zero—the noise provides 'kicks' that can push parameters out of shallow local minima.
This acts like simulated annealing: early in training (large learning rate = high 'temperature'), the noise helps explore; later (small learning rate = low 'temperature'), the algorithm settles into a good basin.
In high-dimensional spaces, saddle points are far more common than local minima. SGD's noise helps navigate away from saddle points, where the gradient may be very small but the point is not a minimum. This is a major advantage over deterministic methods.
Benefit 2: Implicit Regularization
SGD noise acts as a form of regularization that prefers 'flat' minima over 'sharp' ones. Why?
In a sharp minimum, the loss increases rapidly away from the optimum. The noisy SGD updates frequently push parameters away from the minimum. If the minimum is sharp, these excursions increase the loss significantly, and the parameters are pushed back.
In a flat minimum, small excursions don't change the loss much. The parameters can wander within the flat region without penalty.
The net effect: SGD spends more time in flat minima. This is important because flat minima generalize better—small changes to parameters (as might occur with new data) don't hurt performance much.
Benefit 3: Regularized Loss Landscape
Recent theoretical work shows that SGD effectively optimizes a smoothed version of the loss:
$$\tilde{J}(\boldsymbol{\theta}) = \mathbb{E}_{\boldsymbol{\epsilon}}[J(\boldsymbol{\theta} + \eta \boldsymbol{\epsilon})]$$
This smoothed loss has fewer sharp features, making optimization easier. The amount of smoothing is proportional to the learning rate and gradient variance.
Benefit 4: Better Generalization
Empirical evidence consistently shows that:
This suggests that the implicit regularization from SGD noise is genuinely beneficial for learning, not just a nuisance to be minimized.
Implementing SGD correctly requires attention to several practical details that significantly impact performance. Let's examine a production-quality implementation with careful handling of edge cases.
Algorithm: Stochastic Gradient Descent
Input: Initial θ₀, learning rate schedule η(t), dataset D, max epochs E
Output: Optimized parameters θ*
1. t ← 0
2. for epoch = 1 to E:
a. Shuffle dataset D randomly
b. for each example (x, y) in D:
i. Compute gradient: g ← ∇L(f_θ(x), y) // Single example
ii. Get learning rate: η_t ← η(t)
iii. Update: θ ← θ - η_t · g
iv. t ← t + 1
3. return θ
Critical detail: Shuffling the dataset each epoch ensures each example is used exactly once per epoch and in random order, which reduces the impact of any ordering in the original dataset.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147
import numpy as npfrom typing import Callable, Tuple, List, Optional def sgd( X: np.ndarray, # Training features (N x d) y: np.ndarray, # Training targets (N,) theta_init: np.ndarray, # Initial parameters (d,) grad_fn: Callable, # Gradient for single example learning_rate: float = 0.01, lr_schedule: Optional[Callable] = None, # Optional: η(t) function num_epochs: int = 10, shuffle: bool = True, verbose: bool = True, log_every: int = 1000) -> Tuple[np.ndarray, List[float]]: """ Stochastic Gradient Descent optimizer. Args: X: Feature matrix (N examples, d features) y: Target vector (N examples) theta_init: Initial parameter vector grad_fn: Function (x_i, y_i, theta) -> gradient for one example learning_rate: Base learning rate lr_schedule: Optional function (t, lr0) -> lr_t for scheduling num_epochs: Number of passes through the dataset shuffle: Whether to shuffle data each epoch verbose: Print progress log_every: How often to log (in iterations) Returns: theta: Optimized parameters loss_history: Loss values logged during training """ theta = theta_init.copy() N = X.shape[0] loss_history = [] t = 0 # Global iteration counter # Default: no schedule (constant learning rate) if lr_schedule is None: lr_schedule = lambda t, lr0: lr0 for epoch in range(num_epochs): # Shuffle data at the start of each epoch if shuffle: indices = np.random.permutation(N) else: indices = np.arange(N) epoch_loss = 0.0 for idx in indices: # Extract single example x_i = X[idx] y_i = y[idx] # Compute gradient for this ONE example gradient = grad_fn(x_i, y_i, theta) # Get current learning rate from schedule current_lr = lr_schedule(t, learning_rate) # SGD update theta = theta - current_lr * gradient # Track loss (optional, for monitoring) if t % log_every == 0: # Compute full loss occasionally for monitoring predictions = X @ theta current_loss = np.mean(0.5 * (predictions - y) ** 2) loss_history.append(current_loss) if verbose: print(f"Epoch {epoch+1}, Iter {t}: Loss = {current_loss:.6f}, " f"LR = {current_lr:.6f}") t += 1 if verbose: # End of epoch summary final_loss = np.mean(0.5 * ((X @ theta) - y) ** 2) print(f"Epoch {epoch+1}/{num_epochs} complete. Loss: {final_loss:.6f}") return theta, loss_history # Single-example gradient for linear regression with MSEdef single_example_gradient(x_i: np.ndarray, y_i: float, theta: np.ndarray) -> np.ndarray: """ Gradient of MSE loss for a single example. L(theta) = 0.5 * (x_i @ theta - y_i)^2 ∇L = (x_i @ theta - y_i) * x_i """ prediction = np.dot(x_i, theta) residual = prediction - y_i gradient = residual * x_i return gradient # Common learning rate schedulesdef constant_schedule(t: int, lr0: float) -> float: """Constant learning rate.""" return lr0 def step_decay(t: int, lr0: float, drop_rate: float = 0.5, drop_every: int = 10000) -> float: """Step decay: multiply by drop_rate every drop_every iterations.""" return lr0 * (drop_rate ** (t // drop_every)) def inverse_schedule(t: int, lr0: float) -> float: """1/t schedule (Robbins-Monro).""" return lr0 / (1 + t) def inverse_sqrt_schedule(t: int, lr0: float) -> float: """1/sqrt(t) schedule.""" return lr0 / np.sqrt(1 + t) # Example usageif __name__ == "__main__": np.random.seed(42) # Generate data N, d = 10000, 5 X = np.random.randn(N, d) theta_true = np.array([1.0, -2.0, 0.5, 3.0, -1.5]) y = X @ theta_true + 0.1 * np.random.randn(N) # Random initialization theta_init = np.random.randn(d) # Run SGD theta_opt, history = sgd( X, y, theta_init, grad_fn=single_example_gradient, learning_rate=0.01, lr_schedule=lambda t, lr0: lr0 / np.sqrt(1 + t/1000), num_epochs=5, log_every=2000 ) print(f"True theta: {theta_true}") print(f"Learned theta: {theta_opt}") print(f"Parameter error: {np.linalg.norm(theta_opt - theta_true):.6f}")Never iterate through data in a fixed order! If examples are sorted by class or have any structure, SGD will make biased updates. Shuffling each epoch randomizes the order, ensuring the stochastic gradient is truly an unbiased estimator. Some frameworks shuffle into a buffer for efficiency.
Unlike batch gradient descent where a constant learning rate can work well, SGD typically requires a learning rate schedule—a systematic reduction of the learning rate over time—to converge properly.
Why Schedules Are Necessary
Recall the convergence dilemma: with fixed $\eta$, parameters oscillate around the minimum with magnitude proportional to $\eta \sigma$ (learning rate × gradient noise). To converge exactly, we must reduce $\eta$ over time so these oscillations vanish.
However, decreasing $\eta$ too quickly means we never reach the minimum. This tension is captured by the Robbins-Monro conditions:
$$\sum_{t=1}^{\infty} \eta_t = \infty \quad \text{(reach anywhere)}$$ $$\sum_{t=1}^{\infty} \eta_t^2 < \infty \quad \text{(noise contribution vanishes)}$$
| Schedule | Formula | Satisfies R-M | Use Case |
|---|---|---|---|
| Constant | $\eta_t = \eta_0$ | No | Mini-batch SGD (noise bounded) |
| Inverse | $\eta_t = \eta_0 / t$ | Yes | Theoretical analysis |
| Inverse sqrt | $\eta_t = \eta_0 / \sqrt{t}$ | No* | Non-convex optimization |
| Step decay | $\eta_t = \eta_0 \cdot \gamma^{\lfloor t/k \rfloor}$ | No | Practical deep learning |
| Exponential | $\eta_t = \eta_0 \cdot e^{-\lambda t}$ | No | Fast initial decay |
In Practice: Modern Deep Learning Schedules
For deep learning, the theoretical Robbins-Monro schedules are often too aggressive. Modern practice uses:
Step Decay: Divide learning rate by a factor (e.g., 10) at fixed epochs
if epoch in [30, 60, 90]:
lr = lr * 0.1
Cosine Annealing: Smooth decrease following a cosine curve $$\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\frac{t}{T}\pi\right)\right)$$
Warmup + Decay: Start with a small learning rate, increase linearly, then decay $$\eta_t = \begin{cases} \frac{t}{t_w} \cdot \eta_{\max} & t < t_w \ \text{decay}(\eta_{\max}, t - t_w) & t \geq t_w \end{cases}$$
The warmup phase helps stabilize training when initial gradients are large or the loss landscape is poorly conditioned near initialization.
For most deep learning tasks, start with cosine annealing or step decay. Use warmup if training is unstable initially. The exact schedule matters less than getting the initial learning rate right—always tune the base learning rate first, then worry about the schedule.
Understanding SGD from a variance reduction perspective provides insight into why mini-batches help and motivates more advanced algorithms.
The Variance-Convergence Trade-off
SGD's convergence rate depends on the gradient variance $\sigma^2$. Lower variance means:
This motivates strategies to reduce variance while maintaining computational efficiency.
Strategy 1: Averaging (The Polyak-Ruppert Average)
Instead of using the final iterate $\boldsymbol{\theta}^{(T)}$, average over the trajectory:
$$\bar{\boldsymbol{\theta}} = \frac{1}{T} \sum_{t=1}^{T} \boldsymbol{\theta}^{(t)}$$
This averaged iterate has lower variance than the final iterate and can achieve better convergence rates. The intuition: while individual iterates oscillate, their average stabilizes near the minimum.
Strategy 2: Mini-batching (Preview)
Instead of using one example, use $B$ examples:
$$\mathbf{g} = \frac{1}{B} \sum_{j=1}^{B} abla \mathcal{L}(\boldsymbol{\theta}, \mathbf{x}^{(i_j)}, y^{(i_j)})$$
The variance decreases by factor $B$:
$$\text{Var}[\mathbf{g}] = \frac{\sigma^2}{B}$$
This is the key insight behind mini-batch SGD, which we'll cover in detail in the next page. It interpolates between pure SGD ($B=1$) and batch GD ($B=N$).
Strategy 3: Advanced Variance Reduction Methods
More sophisticated methods actively reduce variance:
SVRG (Stochastic Variance Reduced Gradient): Periodically compute exact gradient and use it to correct stochastic estimates
SAGA: Maintain a table of gradients for each example, enabling variance reduction without periodic exact computation
These methods achieve the $O(1/T)$ or even linear convergence rates of batch methods while using $O(1)$ computation per update. They're less common in deep learning due to memory requirements but important in convex optimization.
There's a fundamental trade-off: reducing variance requires more computation per update. Pure SGD (B=1) has maximum variance but minimum computation. Batch GD has zero variance but maximum computation. Mini-batch SGD with carefully chosen B finds an efficient operating point on this trade-off curve.
SGD has a deep connection to online learning—a paradigm where data arrives sequentially and the model must update after each example. This connection provides theoretical grounding and suggests practical applications.
The Online Learning Setting
In online learning:
SGD naturally fits this paradigm: each update uses only the current example, and we don't need to store or revisit past data.
Regret Analysis
Online learning is analyzed via regret—the difference between the learner's cumulative loss and the best fixed predictor in hindsight:
$$R_T = \sum_{t=1}^{T} \mathcal{L}(\boldsymbol{\theta}^{(t)}) - \min_{\boldsymbol{\theta}} \sum_{t=1}^{T} \mathcal{L}(\boldsymbol{\theta})$$
A good algorithm has sublinear regret: $R_T = o(T)$, meaning per-round regret $R_T/T \to 0$.
Theorem (Online Gradient Descent Regret)
For convex losses with gradients bounded by $G$ and learning rate $\eta = \frac{R}{G\sqrt{T}}$:
$$R_T \leq \frac{RG\sqrt{T}}{2} + \frac{RG\sqrt{T}}{2} = RG\sqrt{T}$$
This $O(\sqrt{T})$ regret bound is optimal for online convex optimization, and SGD achieves it!
True online learning (single pass through data, no storage) is appropriate when: data is too large to store, data is non-stationary (distribution changes), real-time adaptation is needed, or examples are genuinely streaming. Most deep learning uses 'fake' online learning—multiple passes through a stored dataset.
We've thoroughly examined Stochastic Gradient Descent—the algorithm that made large-scale machine learning practical. Let's consolidate the key insights:
What's Next
The next page introduces Mini-batch SGD—the practical middle ground that combines the computational benefits of stochastic methods with variance reduction through batching. Mini-batch SGD is what's actually used in nearly all deep learning, and understanding it requires understanding both batch GD and pure SGD, which we've now covered.
You now understand why SGD works despite using noisy gradients, when and why learning rate schedules are needed, how noise benefits optimization and generalization, and the connection to online learning. This foundation is essential for understanding the optimizer choices in modern deep learning.