Loading content...
Throughout this module, we've built intuition about gradient descent variants through examples and heuristics. But a fundamental question remains: Under what conditions does gradient descent actually converge, and how quickly?
Convergence analysis provides mathematically rigorous answers to these questions. It tells us not just that gradient descent 'eventually works,' but precisely how many iterations we need to reach a given accuracy, what properties of the loss function determine convergence speed, and why certain variants outperform others in specific settings.
This materialmay feel more theoretical than previous pages, but the insights are directly practical: they explain why learning rates must obey certain bounds, why condition number matters, and why stochastic methods dominate in large-scale optimization despite having 'worse' theoretical rates. Understanding these results transforms hyperparameter tuning from guesswork to principled engineering.
By the end of this page, you will understand formal convergence definitions, derive convergence rates for convex and strongly convex functions, analyze the impact of condition number on convergence, compare batch vs stochastic convergence rates, and connect theoretical results to practical optimization.
Before analyzing convergence rates, we must establish precise definitions and the assumptions that make analysis possible.
What Does Convergence Mean?
We say gradient descent converges if the sequence of iterates ${\boldsymbol{\theta}^{(t)}}$ approaches an optimal point as $t \to \infty$. This can be measured in several ways:
Function value convergence: $J(\boldsymbol{\theta}^{(t)}) - J(\boldsymbol{\theta}^*) \to 0$
Iterate convergence: $|\boldsymbol{\theta}^{(t)} - \boldsymbol{\theta}^*| \to 0$
Gradient convergence: $| abla J(\boldsymbol{\theta}^{(t)})| \to 0$ (convergence to stationary point)
For convex problems, all these are equivalent. For non-convex problems (neural networks), we typically analyze convergence to stationary points where the gradient vanishes.
A convergence rate describes HOW FAST we approach the optimum. O(1/T) means the error decreases proportionally to 1/T. O(ρᵀ) for ρ < 1 is called linear or exponential convergence—the fastest typical rate. Lower bound results show we often can't do better than these rates.
Key Assumptions for Analysis
Assumption 1: L-Lipschitz Gradient (Smoothness)
$$| abla J(\boldsymbol{\theta}) - abla J(\boldsymbol{\phi})| \leq L |\boldsymbol{\theta} - \boldsymbol{\phi}|$$
This bounds the curvature of $J$. Equivalently, the Hessian eigenvalues are bounded by $L$.
Implication: The gradient can't change too rapidly, so the first-order Taylor approximation is valid in a region of size $1/L$.
Assumption 2: μ-Strong Convexity
$$J(\boldsymbol{\phi}) \geq J(\boldsymbol{\theta}) + abla J(\boldsymbol{\theta})^\top(\boldsymbol{\phi} - \boldsymbol{\theta}) + \frac{\mu}{2}|\boldsymbol{\phi} - \boldsymbol{\theta}|^2$$
This means $J$ curves upward at least as fast as a quadratic with parameter $\mu$.
Implication: There's a unique global minimum, and we can bound the distance to optimum using the gradient norm.
The Condition Number
$$\kappa = \frac{L}{\mu}$$
The condition number captures how 'elongated' the loss landscape is. Higher $\kappa$ means some directions have much steeper curvature than others, making optimization harder.
We begin with the analysis of gradient descent on convex functions—the classical setting where the strongest results hold.
Theorem 1: Convergence for Smooth Convex Functions
Let $J$ be convex with $L$-Lipschitz continuous gradient. For batch gradient descent with learning rate $\eta \leq 1/L$:
$$J(\boldsymbol{\theta}^{(T)}) - J(\boldsymbol{\theta}^) \leq \frac{|\boldsymbol{\theta}^{(0)} - \boldsymbol{\theta}^|^2}{2\eta T}$$
Proof Sketch:
By smoothness, $J(\boldsymbol{\theta}^{(t+1)}) \leq J(\boldsymbol{\theta}^{(t)}) - \eta(1 - \frac{\eta L}{2})| abla J(\boldsymbol{\theta}^{(t)})|^2$
By convexity, $| abla J(\boldsymbol{\theta}^{(t)})|^2 \geq 2\mu [J(\boldsymbol{\theta}^{(t)}) - J(\boldsymbol{\theta}^*)]$ (using strong convexity later)
Telescope the sum and use distance contraction
This $O(1/T)$ rate means quadrupling iterations halves the error.
Theorem 2: Linear Convergence for Strongly Convex Functions
Let $J$ be $\mu$-strongly convex with $L$-Lipschitz gradient. For gradient descent with $\eta = 1/L$:
$$|\boldsymbol{\theta}^{(T)} - \boldsymbol{\theta}^|^2 \leq \left(1 - \frac{\mu}{L}\right)^T |\boldsymbol{\theta}^{(0)} - \boldsymbol{\theta}^|^2$$
Equivalently: $$|\boldsymbol{\theta}^{(T)} - \boldsymbol{\theta}^|^2 \leq \left(1 - \frac{1}{\kappa}\right)^T |\boldsymbol{\theta}^{(0)} - \boldsymbol{\theta}^|^2$$
Interpretation:
| Function Class | Rate | Iterations for ε-accuracy | Dependency |
|---|---|---|---|
| Convex, smooth | $O(1/T)$ | $O(LR^2/\epsilon)$ | Linear in $1/\epsilon$ |
| Strongly convex, smooth | $O((1-\mu/L)^T)$ | $O(\kappa \log(1/\epsilon))$ | Log in $1/\epsilon$ |
| Non-convex, smooth | $O(1/\sqrt{T})$ to stationary | $O(L/\epsilon^2)$ | Only find ∇J ≈ 0 |
Strong convexity changes the iteration complexity from O(1/ε) to O(log(1/ε)). To go from 0.1 to 0.001 error (100× reduction): convex needs ~100× more iterations; strongly convex needs only ~log(100) ≈ 7× more. This is why regularization (which adds strong convexity) can speed up optimization.
Stochastic gradient descent adds noise to gradients, fundamentally changing the convergence analysis. We need to track not just the sequence of iterates, but their expected behavior.
Key Assumptions for SGD
Bounded Variance: The stochastic gradient has bounded variance: $$\mathbb{E}[| abla \mathcal{L}(\boldsymbol{\theta}, \mathbf{x}, y) - abla J(\boldsymbol{\theta})|^2] \leq \sigma^2$$
This $\sigma^2$ is the fundamental noise parameter. Higher variance means noisier gradients.
Unbiasedness: $\mathbb{E}[ abla \mathcal{L}] = abla J$ (stochastic gradient is unbiased).
Theorem 3: SGD for Convex Functions
For convex $J$ with $\eta_t = \frac{R}{\sigma\sqrt{t}}$:
$$\mathbb{E}[J(\boldsymbol{\theta}^{(T)})] - J(\boldsymbol{\theta}^*) \leq O\left(\frac{R\sigma}{\sqrt{T}}\right)$$
where $R = |\boldsymbol{\theta}^{(0)} - \boldsymbol{\theta}^*|$.
Theorem 4: SGD for Strongly Convex Functions
For $\mu$-strongly convex $J$ with $\eta_t = 1/(\mu t)$:
$$\mathbb{E}[J(\boldsymbol{\theta}^{(T)})] - J(\boldsymbol{\theta}^*) \leq O\left(\frac{L\sigma^2}{\mu^2 T}\right)$$
Key Observation: SGD achieves $O(1/T)$ for strongly convex functions—the same order as batch GD for merely convex functions. The gradient noise prevents the linear convergence we get with exact gradients.
Theorem 5: Mini-batch SGD
With mini-batch size $B$, the variance becomes $\sigma^2/B$. Substituting into the SGD results:
Larger batches reduce variance and improve convergence—but recall that we get fewer updates per epoch, so the benefit isn't free.
| Method | Rate | Iterations for ε error | Compute per iter | Total compute |
|---|---|---|---|---|
| Batch GD | $O((1-1/\kappa)^T)$ | $O(\kappa\log(1/\epsilon))$ | $O(N)$ | $O(N\kappa\log(1/\epsilon))$ |
| SGD | $O(1/T)$ | $O(\sigma^2/\epsilon)$ | $O(1)$ | $O(\sigma^2/\epsilon)$ |
| Mini-batch $B$ | $O(1/(BT))$ | $O(\sigma^2/(B\epsilon))$ | $O(B)$ | $O(\sigma^2/\epsilon)$ |
SGD wins in total compute when N (dataset size) is large and we don't need extremely high precision ε. For typical ML where N >> σ²/ε, SGD's O(σ²/ε) beats batch GD's O(Nκlog(1/ε)). This is why SGD dominates in practice.
The condition number $\kappa = L/\mu$ appears throughout convergence analysis. Let's develop a deep understanding of its role.
Geometric Interpretation
For a quadratic loss $J(\boldsymbol{\theta}) = \frac{1}{2}\boldsymbol{\theta}^\top \mathbf{H} \boldsymbol{\theta}$:
Geometrically, $\kappa$ is the aspect ratio of the loss function's level sets:
In elongated landscapes, gradient descent oscillates across the narrow direction while making slow progress along the long direction.
Why Condition Number Hurts Convergence
With learning rate $\eta = 1/L$, we optimize well for directions with curvature $\approx L$ but poorly for directions with curvature $\approx \mu$:
Progress in the 'slow' directions takes $\kappa$ times longer than in the 'fast' directions.
The Convergence Rate Revisited
$$\rho = 1 - \frac{1}{\kappa} = \frac{\kappa - 1}{\kappa}$$
| $\kappa$ | $\rho$ | Iterations to halve error |
|---|---|---|
| 2 | 0.5 | 1 |
| 10 | 0.9 | 7 |
| 100 | 0.99 | 69 |
| 1000 | 0.999 | 693 |
| 10000 | 0.9999 | 6931 |
For neural networks, effective $\kappa$ can be $10^4$ to $10^6$, explaining why training can be fragile and why adaptive methods help.
If training is slow but stable, you likely have a high condition number problem. Solutions: try Adam instead of SGD, add batch normalization, or increase learning rate (it may be too conservative for the slow directions). If training is unstable, reduce LR or add gradient clipping.
Neural network loss functions are non-convex, invalidating the guarantees from convex optimization. What can we say about convergence in this more challenging setting?
The Relaxed Goal
For non-convex functions, we can't guarantee finding the global minimum. Instead, we aim for stationary points where $| abla J(\boldsymbol{\theta})| = 0$. These include:
Theorem 6: Convergence to Stationary Points
For smooth (not necessarily convex) $J$ with $L$-Lipschitz gradient, gradient descent with $\eta \leq 1/L$ satisfies:
$$\min_{t=0,\ldots,T-1} | abla J(\boldsymbol{\theta}^{(t)})|^2 \leq \frac{2[J(\boldsymbol{\theta}^{(0)}) - J(\boldsymbol{\theta}^*)]}{\eta T}$$
This says the minimum gradient norm over all iterations decreases as $O(1/T)$. To find a point with $| abla J| \leq \epsilon$, we need $O(1/\epsilon^2)$ iterations.
SGD for Non-Convex Functions
SGD converges to stationary points at rate:
$$\mathbb{E}\left[\min_{t} | abla J(\boldsymbol{\theta}^{(t)})|^2\right] \leq O\left(\frac{1}{\sqrt{T}}\right)$$
This is slower than batch GD's $O(1/T)$, but SGD does far more updates per unit compute, often making it faster in wall-clock time.
The Blessing of Non-Convexity
Surprisingly, non-convexity isn't always bad:
Saddle points are escapable: In high dimensions, saddle points typically have directions of negative curvature. SGD's noise helps escape along these directions.
Local minima may be good enough: Empirically, different local minima of neural networks often have similar test performance. Finding ANY good minimum suffices.
Flatness correlates with generalization: Wide, flat minima (easier to find) often generalize better than sharp, narrow ones.
In high dimensional spaces, strict local minima are rare—most critical points are saddle points. A random direction at a saddle point has roughly 50% chance of decreasing and 50% chance of increasing the loss. SGD's noise naturally samples these directions, helping escape saddles.
Practical Implications for Neural Networks
We can't guarantee finding global optima: But we don't need to. Finding good local optima is sufficient.
Multiple good solutions exist: Different runs with different initializations find different minima, often with similar performance.
Optimization and generalization differ: A lower training loss doesn't always mean better test performance. The path taken by optimization matters.
Hyperparameters affect the basin reached: Learning rate, batch size, and schedule don't just affect speed—they affect WHICH minimum we find.
The fundamental limitation of SGD is gradient variance, which prevents linear convergence on strongly convex problems. Variance reduction methods aim to get the best of both worlds: the cheap iterations of SGD with the fast convergence of batch GD.
The Key Insight
The gradient variance doesn't need to be $\sigma^2$ at all times. If we can periodically reduce the noise—especially as we approach the optimum—we can potentially recover linear convergence.
SVRG (Stochastic Variance Reduced Gradient)
The subtraction cancels much of the variance, making $\mathbf{g}^{(t)}$ a lower-variance estimator of $ abla J(\boldsymbol{\theta}^{(t)})$.
Why SVRG Works
The variance-reduced gradient has variance: $$\text{Var}[\mathbf{g}^{(t)}] \approx \text{Var}[ abla \mathcal{L}(\boldsymbol{\theta}^{(t)}) - abla \mathcal{L}(\tilde{\boldsymbol{\theta}})]$$
When $\boldsymbol{\theta}^{(t)} \approx \tilde{\boldsymbol{\theta}}$, this variance is small because the two stochastic gradients are correlated—their randomness partially cancels.
Theorem 7: SVRG Convergence
For strongly convex $J$, SVRG with proper parameters achieves: $$\mathbb{E}[J(\boldsymbol{\theta}^{(T)}) - J(\boldsymbol{\theta}^)] \leq \rho^T [J(\boldsymbol{\theta}^{(0)}) - J(\boldsymbol{\theta}^)]$$
where $\rho < 1$. This is linear convergence—the same rate as batch GD—but with SGD-like $O(1)$ per-iteration cost!
| Method | Convergence Rate | Total Complexity to ε-accuracy |
|---|---|---|
| Batch GD | Linear $O((1-1/\kappa)^T)$ | $O(N\kappa \log(1/\epsilon))$ |
| SGD | Sublinear $O(1/T)$ | $O(\sigma^2/\epsilon)$ |
| SVRG | Linear $O(\rho^T)$ | $O((N + \kappa) \log(1/\epsilon))$ |
| SAGA | Linear $O(\rho^T)$ | $O((N + \kappa) \log(1/\epsilon))$ |
Variance reduction methods shine for convex problems where high precision is needed. They're less common in deep learning because: (1) the non-convexity makes theoretical guarantees weaker, (2) implicit regularization from SGD noise may be beneficial, (3) storing full gradients for SAGA requires O(Nd) memory.
Are the convergence rates we've established the best possible, or could clever algorithms do better? Lower bound results answer this fundamental question.
Theorem 8: Lower Bound for First-Order Methods (Convex)
No first-order method (using only gradient information) can guarantee better than $O(1/T^2)$ convergence for smooth convex functions in the worst case.
Gradient descent achieves $O(1/T)$, so there's a gap. Accelerated gradient descent (Nesterov, 1983) achieves $O(1/T^2)$, matching the lower bound.
Theorem 9: Lower Bound for Strongly Convex Functions
No first-order method can converge faster than: $$\left(\frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1}\right)^T$$
Gradient descent achieves $(1 - 1/\kappa)^T \approx e^{-T/\kappa}$. Optimal methods achieve $(\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1})^T \approx e^{-2T/\sqrt{\kappa}}$.
The gap is $\sqrt{\kappa}$ vs $\kappa$—potentially huge for ill-conditioned problems!
Momentum-based methods like Nesterov's accelerated gradient can close this gap. Instead of κ iterations per order-of-magnitude improvement, accelerated methods need only √κ iterations. For κ = 10000, this is 100× faster!
Lower Bounds for Stochastic Methods
For SGD with bounded variance $\sigma^2$:
$$\mathbb{E}[J(\boldsymbol{\theta}^{(T)}) - J(\boldsymbol{\theta}^*)] \geq \Omega\left(\frac{\sigma}{\sqrt{T}}\right) \text{ for convex}$$
$$\mathbb{E}[J(\boldsymbol{\theta}^{(T)}) - J(\boldsymbol{\theta}^*)] \geq \Omega\left(\frac{\sigma^2}{\mu T}\right) \text{ for strongly convex}$$
SGD matches these lower bounds (up to constants), making it optimal in this setting!
Implications:
Practical Takeaways from Lower Bounds
SGD is optimal for what it does: You can't improve on SGD's rate without using structure SGD ignores (variance reduction, second-order info).
Momentum actually helps: The gap between GD and optimal accelerated methods is real. Momentum-based optimizers achieve it.
Second-order methods can beat first-order bounds: Newton's method converges quadratically, ignoring the $\kappa$ dependence entirely. The catch: computing the Hessian is expensive.
Batch size affects achievable rates: Larger batches reduce $\sigma^2/B$, directly improving SGD convergence bounds.
Convergence theory for convex functions, while elegant, doesn't directly apply to neural networks which are highly non-convex. Let's bridge the gap between theory and practice.
Why Theory Still Matters for Neural Networks
Local convexity: Near optima, loss functions are often approximately convex. Convergence theory applies locally.
Qualitative insights transfer: The importance of condition number, learning rate bounds, and variance effects carry over even when precise theorems don't.
Debugging guidance: When training fails, convex intuition often identifies the problem (LR too high, gradients too noisy, etc.).
Component analysis: Individual operations (linear layers, convolutions) are often convex in parameters—helps understand layer-wise behavior.
Practical Guidelines Derived from Theory
Set learning rate based on stability, not convergence speed
Expect slower convergence for larger models
Use batch size to control noise
Apply schedules for fine convergence
Use theory for intuition and debugging, not for direct parameter calculation. Neural networks violate theoretical assumptions, but the qualitative lessons remain valuable. When something goes wrong, check against theoretical predictions to identify likely causes.
We've completed a rigorous tour of convergence analysis for gradient descent methods. Let's consolidate the key mathematical and practical insights:
Module Complete
You've now completed the Gradient Descent Variants module. You understand:
This foundation prepares you for adaptive optimizers (Adam, RMSprop, etc.) covered in the next module, which build on these ideas to handle ill-conditioned problems more effectively.
You now have both the practical skills to tune gradient descent methods and the theoretical understanding to know WHY your choices work. This combination—practical effectiveness backed by principled understanding—is what separates excellent ML engineers from those who merely follow recipes.