Gradient Descent Variants - Learning Module

Loading content...

0/278

Convergence Analysis

When and How Fast Do We Converge?

Throughout this module, we've built intuition about gradient descent variants through examples and heuristics. But a fundamental question remains: Under what conditions does gradient descent actually converge, and how quickly?

Convergence analysis provides mathematically rigorous answers to these questions. It tells us not just that gradient descent 'eventually works,' but precisely how many iterations we need to reach a given accuracy, what properties of the loss function determine convergence speed, and why certain variants outperform others in specific settings.

This materialmay feel more theoretical than previous pages, but the insights are directly practical: they explain why learning rates must obey certain bounds, why condition number matters, and why stochastic methods dominate in large-scale optimization despite having 'worse' theoretical rates. Understanding these results transforms hyperparameter tuning from guesswork to principled engineering.

What You Will Learn

By the end of this page, you will understand formal convergence definitions, derive convergence rates for convex and strongly convex functions, analyze the impact of condition number on convergence, compare batch vs stochastic convergence rates, and connect theoretical results to practical optimization.

Convergence Foundations

Before analyzing convergence rates, we must establish precise definitions and the assumptions that make analysis possible.

What Does Convergence Mean?

We say gradient descent converges if the sequence of iterates ${\boldsymbol{\theta}^{(t)}}$ approaches an optimal point as $t \to \infty$. This can be measured in several ways:

Function value convergence: $J(\boldsymbol{\theta}^{(t)}) - J(\boldsymbol{\theta}^*) \to 0$
Iterate convergence: $|\boldsymbol{\theta}^{(t)} - \boldsymbol{\theta}^*| \to 0$
Gradient convergence: $| abla J(\boldsymbol{\theta}^{(t)})| \to 0$ (convergence to stationary point)

For convex problems, all these are equivalent. For non-convex problems (neural networks), we typically analyze convergence to stationary points where the gradient vanishes.

Convergence Rates

A convergence rate describes HOW FAST we approach the optimum. O(1/T) means the error decreases proportionally to 1/T. O(ρᵀ) for ρ < 1 is called linear or exponential convergence—the fastest typical rate. Lower bound results show we often can't do better than these rates.

Key Assumptions for Analysis

Assumption 1: L-Lipschitz Gradient (Smoothness)

$$| abla J(\boldsymbol{\theta}) - abla J(\boldsymbol{\phi})| \leq L |\boldsymbol{\theta} - \boldsymbol{\phi}|$$

This bounds the curvature of $J$. Equivalently, the Hessian eigenvalues are bounded by $L$.

Implication: The gradient can't change too rapidly, so the first-order Taylor approximation is valid in a region of size $1/L$.

Assumption 2: μ-Strong Convexity

$$J(\boldsymbol{\phi}) \geq J(\boldsymbol{\theta}) + abla J(\boldsymbol{\theta})^\top(\boldsymbol{\phi} - \boldsymbol{\theta}) + \frac{\mu}{2}|\boldsymbol{\phi} - \boldsymbol{\theta}|^2$$

This means $J$ curves upward at least as fast as a quadratic with parameter $\mu$.

Implication: There's a unique global minimum, and we can bound the distance to optimum using the gradient norm.

The Condition Number

$$\kappa = \frac{L}{\mu}$$

The condition number captures how 'elongated' the loss landscape is. Higher $\kappa$ means some directions have much steeper curvature than others, making optimization harder.

Convex Convergence Analysis

We begin with the analysis of gradient descent on convex functions—the classical setting where the strongest results hold.

Theorem 1: Convergence for Smooth Convex Functions

Let $J$ be convex with $L$-Lipschitz continuous gradient. For batch gradient descent with learning rate $\eta \leq 1/L$:

$$J(\boldsymbol{\theta}^{(T)}) - J(\boldsymbol{\theta}^) \leq \frac{|\boldsymbol{\theta}^{(0)} - \boldsymbol{\theta}^|^2}{2\eta T}$$

Proof Sketch:

By smoothness, $J(\boldsymbol{\theta}^{(t+1)}) \leq J(\boldsymbol{\theta}^{(t)}) - \eta(1 - \frac{\eta L}{2})| abla J(\boldsymbol{\theta}^{(t)})|^2$
By convexity, $| abla J(\boldsymbol{\theta}^{(t)})|^2 \geq 2\mu [J(\boldsymbol{\theta}^{(t)}) - J(\boldsymbol{\theta}^*)]$ (using strong convexity later)
Telescope the sum and use distance contraction

This $O(1/T)$ rate means quadrupling iterations halves the error.

Theorem 2: Linear Convergence for Strongly Convex Functions

Let $J$ be $\mu$-strongly convex with $L$-Lipschitz gradient. For gradient descent with $\eta = 1/L$:

$$|\boldsymbol{\theta}^{(T)} - \boldsymbol{\theta}^|^2 \leq \left(1 - \frac{\mu}{L}\right)^T |\boldsymbol{\theta}^{(0)} - \boldsymbol{\theta}^|^2$$

Equivalently: $$|\boldsymbol{\theta}^{(T)} - \boldsymbol{\theta}^|^2 \leq \left(1 - \frac{1}{\kappa}\right)^T |\boldsymbol{\theta}^{(0)} - \boldsymbol{\theta}^|^2$$

Interpretation:

The error contracts by factor $(1 - 1/\kappa)$ each iteration
To halve the error: need $\log(2) / \log(\frac{\kappa}{\kappa-1}) \approx \kappa \ln 2$ iterations
For condition number $\kappa = 100$: ~70 iterations per halving
This is linear/exponential convergence—much faster than $O(1/T)$

Convergence Rates Summary (Batch Gradient Descent)
Function Class	Rate	Iterations for ε-accuracy	Dependency
Convex, smooth	$O(1/T)$	$O(LR^2/\epsilon)$	Linear in $1/\epsilon$
Strongly convex, smooth	$O((1-\mu/L)^T)$	$O(\kappa \log(1/\epsilon))$	Log in $1/\epsilon$
Non-convex, smooth	$O(1/\sqrt{T})$ to stationary	$O(L/\epsilon^2)$	Only find ∇J ≈ 0

The Power of Strong Convexity

Strong convexity changes the iteration complexity from O(1/ε) to O(log(1/ε)). To go from 0.1 to 0.001 error (100× reduction): convex needs ~100× more iterations; strongly convex needs only ~log(100) ≈ 7× more. This is why regularization (which adds strong convexity) can speed up optimization.

Stochastic Convergence Analysis

Stochastic gradient descent adds noise to gradients, fundamentally changing the convergence analysis. We need to track not just the sequence of iterates, but their expected behavior.

Key Assumptions for SGD

Bounded Variance: The stochastic gradient has bounded variance: $$\mathbb{E}[| abla \mathcal{L}(\boldsymbol{\theta}, \mathbf{x}, y) - abla J(\boldsymbol{\theta})|^2] \leq \sigma^2$$

This $\sigma^2$ is the fundamental noise parameter. Higher variance means noisier gradients.

Unbiasedness: $\mathbb{E}[ abla \mathcal{L}] = abla J$ (stochastic gradient is unbiased).

Theorem 3: SGD for Convex Functions

For convex $J$ with $\eta_t = \frac{R}{\sigma\sqrt{t}}$:

$$\mathbb{E}[J(\boldsymbol{\theta}^{(T)})] - J(\boldsymbol{\theta}^*) \leq O\left(\frac{R\sigma}{\sqrt{T}}\right)$$

where $R = |\boldsymbol{\theta}^{(0)} - \boldsymbol{\theta}^*|$.

Theorem 4: SGD for Strongly Convex Functions

For $\mu$-strongly convex $J$ with $\eta_t = 1/(\mu t)$:

$$\mathbb{E}[J(\boldsymbol{\theta}^{(T)})] - J(\boldsymbol{\theta}^*) \leq O\left(\frac{L\sigma^2}{\mu^2 T}\right)$$

Key Observation: SGD achieves $O(1/T)$ for strongly convex functions—the same order as batch GD for merely convex functions. The gradient noise prevents the linear convergence we get with exact gradients.

Theorem 5: Mini-batch SGD

With mini-batch size $B$, the variance becomes $\sigma^2/B$. Substituting into the SGD results:

Convex: $O\left(\frac{R\sigma}{\sqrt{BT}}\right)$
Strongly convex: $O\left(\frac{L\sigma^2}{\mu^2 BT}\right)$

Larger batches reduce variance and improve convergence—but recall that we get fewer updates per epoch, so the benefit isn't free.

Batch GD vs SGD Convergence for Strongly Convex Functions
Method	Rate	Iterations for ε error	Compute per iter	Total compute
Batch GD	$O((1-1/\kappa)^T)$	$O(\kappa\log(1/\epsilon))$	$O(N)$	$O(N\kappa\log(1/\epsilon))$
SGD	$O(1/T)$	$O(\sigma^2/\epsilon)$	$O(1)$	$O(\sigma^2/\epsilon)$
Mini-batch $B$	$O(1/(BT))$	$O(\sigma^2/(B\epsilon))$	$O(B)$	$O(\sigma^2/\epsilon)$

When SGD Wins

SGD wins in total compute when N (dataset size) is large and we don't need extremely high precision ε. For typical ML where N >> σ²/ε, SGD's O(σ²/ε) beats batch GD's O(Nκlog(1/ε)). This is why SGD dominates in practice.

Condition Number Effects

The condition number $\kappa = L/\mu$ appears throughout convergence analysis. Let's develop a deep understanding of its role.

Geometric Interpretation

For a quadratic loss $J(\boldsymbol{\theta}) = \frac{1}{2}\boldsymbol{\theta}^\top \mathbf{H} \boldsymbol{\theta}$:

$L = \lambda_{\max}(\mathbf{H})$ (largest eigenvalue)
$\mu = \lambda_{\min}(\mathbf{H})$ (smallest eigenvalue)
$\kappa = \lambda_{\max}/\lambda_{\min}$

Geometrically, $\kappa$ is the aspect ratio of the loss function's level sets:

$\kappa = 1$: Perfectly circular contours (isotropic)
$\kappa = 10$: Ellipses 10× longer in one direction
$\kappa = 1000$: Extremely elongated, narrow valleys

In elongated landscapes, gradient descent oscillates across the narrow direction while making slow progress along the long direction.

Why Condition Number Hurts Convergence

With learning rate $\eta = 1/L$, we optimize well for directions with curvature $\approx L$ but poorly for directions with curvature $\approx \mu$:

High curvature directions: $\eta \cdot L = 1$ → optimal step size
Low curvature directions: $\eta \cdot \mu = \mu/L = 1/\kappa$ → step size $\kappa$× too small

Progress in the 'slow' directions takes $\kappa$ times longer than in the 'fast' directions.

The Convergence Rate Revisited

$$\rho = 1 - \frac{1}{\kappa} = \frac{\kappa - 1}{\kappa}$$

$\kappa$	$\rho$	Iterations to halve error
2	0.5	1
10	0.9	7
100	0.99	69
1000	0.999	693
10000	0.9999	6931

For neural networks, effective $\kappa$ can be $10^4$ to $10^6$, explaining why training can be fragile and why adaptive methods help.

Strategies for High Condition Number

•Preconditioning: Transform coordinates so the Hessian becomes better conditioned. Second-order methods do this implicitly.
•Adaptive learning rates: Adam, RMSprop estimate per-coordinate curvature and adjust step sizes, effectively reducing the per-coordinate condition number.
•Momentum: Helps traverse narrow ravines by accumulating velocity in consistent directions.
•Normalization: Batch normalization, layer normalization smooth the loss landscape and reduce effective condition number.
•Learning rate schedules: Different effective LR at different training stages can address changing curvature.

Practical Impact

If training is slow but stable, you likely have a high condition number problem. Solutions: try Adam instead of SGD, add batch normalization, or increase learning rate (it may be too conservative for the slow directions). If training is unstable, reduce LR or add gradient clipping.

Non-Convex Convergence

Neural network loss functions are non-convex, invalidating the guarantees from convex optimization. What can we say about convergence in this more challenging setting?

The Relaxed Goal

For non-convex functions, we can't guarantee finding the global minimum. Instead, we aim for stationary points where $| abla J(\boldsymbol{\theta})| = 0$. These include:

Local minima (what we hope to find)
Local maxima (unstable, gradient descent escapes)
Saddle points (tricky in high dimensions)

Theorem 6: Convergence to Stationary Points

For smooth (not necessarily convex) $J$ with $L$-Lipschitz gradient, gradient descent with $\eta \leq 1/L$ satisfies:

$$\min_{t=0,\ldots,T-1} | abla J(\boldsymbol{\theta}^{(t)})|^2 \leq \frac{2[J(\boldsymbol{\theta}^{(0)}) - J(\boldsymbol{\theta}^*)]}{\eta T}$$

This says the minimum gradient norm over all iterations decreases as $O(1/T)$. To find a point with $| abla J| \leq \epsilon$, we need $O(1/\epsilon^2)$ iterations.

SGD for Non-Convex Functions

SGD converges to stationary points at rate:

$$\mathbb{E}\left[\min_{t} | abla J(\boldsymbol{\theta}^{(t)})|^2\right] \leq O\left(\frac{1}{\sqrt{T}}\right)$$

This is slower than batch GD's $O(1/T)$, but SGD does far more updates per unit compute, often making it faster in wall-clock time.

The Blessing of Non-Convexity

Surprisingly, non-convexity isn't always bad:

Saddle points are escapable: In high dimensions, saddle points typically have directions of negative curvature. SGD's noise helps escape along these directions.
Local minima may be good enough: Empirically, different local minima of neural networks often have similar test performance. Finding ANY good minimum suffices.
Flatness correlates with generalization: Wide, flat minima (easier to find) often generalize better than sharp, narrow ones.

The Saddle Point Landscape

In high dimensional spaces, strict local minima are rare—most critical points are saddle points. A random direction at a saddle point has roughly 50% chance of decreasing and 50% chance of increasing the loss. SGD's noise naturally samples these directions, helping escape saddles.

Practical Implications for Neural Networks

We can't guarantee finding global optima: But we don't need to. Finding good local optima is sufficient.
Multiple good solutions exist: Different runs with different initializations find different minima, often with similar performance.
Optimization and generalization differ: A lower training loss doesn't always mean better test performance. The path taken by optimization matters.
Hyperparameters affect the basin reached: Learning rate, batch size, and schedule don't just affect speed—they affect WHICH minimum we find.

Variance Reduction Methods

The fundamental limitation of SGD is gradient variance, which prevents linear convergence on strongly convex problems. Variance reduction methods aim to get the best of both worlds: the cheap iterations of SGD with the fast convergence of batch GD.

The Key Insight

The gradient variance doesn't need to be $\sigma^2$ at all times. If we can periodically reduce the noise—especially as we approach the optimum—we can potentially recover linear convergence.

SVRG (Stochastic Variance Reduced Gradient)

Periodically compute exact gradient: $\tilde{\mathbf{g}} = abla J(\tilde{\boldsymbol{\theta}})$
For inner loop, use variance-reduced gradient: $$\mathbf{g}^{(t)} = abla \mathcal{L}(\boldsymbol{\theta}^{(t)}) - abla \mathcal{L}(\tilde{\boldsymbol{\theta}}) + \tilde{\mathbf{g}}$$

The subtraction cancels much of the variance, making $\mathbf{g}^{(t)}$ a lower-variance estimator of $ abla J(\boldsymbol{\theta}^{(t)})$.

Why SVRG Works

The variance-reduced gradient has variance: $$\text{Var}[\mathbf{g}^{(t)}] \approx \text{Var}[ abla \mathcal{L}(\boldsymbol{\theta}^{(t)}) - abla \mathcal{L}(\tilde{\boldsymbol{\theta}})]$$

When $\boldsymbol{\theta}^{(t)} \approx \tilde{\boldsymbol{\theta}}$, this variance is small because the two stochastic gradients are correlated—their randomness partially cancels.

Theorem 7: SVRG Convergence

For strongly convex $J$, SVRG with proper parameters achieves: $$\mathbb{E}[J(\boldsymbol{\theta}^{(T)}) - J(\boldsymbol{\theta}^)] \leq \rho^T [J(\boldsymbol{\theta}^{(0)}) - J(\boldsymbol{\theta}^)]$$

where $\rho < 1$. This is linear convergence—the same rate as batch GD—but with SGD-like $O(1)$ per-iteration cost!

Comparison of Methods for Strongly Convex Functions
Method	Convergence Rate	Total Complexity to ε-accuracy
Batch GD	Linear $O((1-1/\kappa)^T)$	$O(N\kappa \log(1/\epsilon))$
SGD	Sublinear $O(1/T)$	$O(\sigma^2/\epsilon)$
SVRG	Linear $O(\rho^T)$	$O((N + \kappa) \log(1/\epsilon))$
SAGA	Linear $O(\rho^T)$	$O((N + \kappa) \log(1/\epsilon))$

When to Use Variance Reduction

Variance reduction methods shine for convex problems where high precision is needed. They're less common in deep learning because: (1) the non-convexity makes theoretical guarantees weaker, (2) implicit regularization from SGD noise may be beneficial, (3) storing full gradients for SAGA requires O(Nd) memory.

Lower Bounds and Optimality

Are the convergence rates we've established the best possible, or could clever algorithms do better? Lower bound results answer this fundamental question.

Theorem 8: Lower Bound for First-Order Methods (Convex)

No first-order method (using only gradient information) can guarantee better than $O(1/T^2)$ convergence for smooth convex functions in the worst case.

Gradient descent achieves $O(1/T)$, so there's a gap. Accelerated gradient descent (Nesterov, 1983) achieves $O(1/T^2)$, matching the lower bound.

Theorem 9: Lower Bound for Strongly Convex Functions

No first-order method can converge faster than: $$\left(\frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1}\right)^T$$

Gradient descent achieves $(1 - 1/\kappa)^T \approx e^{-T/\kappa}$. Optimal methods achieve $(\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1})^T \approx e^{-2T/\sqrt{\kappa}}$.

The gap is $\sqrt{\kappa}$ vs $\kappa$—potentially huge for ill-conditioned problems!

The Acceleration Gap

Momentum-based methods like Nesterov's accelerated gradient can close this gap. Instead of κ iterations per order-of-magnitude improvement, accelerated methods need only √κ iterations. For κ = 10000, this is 100× faster!

Lower Bounds for Stochastic Methods

For SGD with bounded variance $\sigma^2$:

$$\mathbb{E}[J(\boldsymbol{\theta}^{(T)}) - J(\boldsymbol{\theta}^*)] \geq \Omega\left(\frac{\sigma}{\sqrt{T}}\right) \text{ for convex}$$

$$\mathbb{E}[J(\boldsymbol{\theta}^{(T)}) - J(\boldsymbol{\theta}^*)] \geq \Omega\left(\frac{\sigma^2}{\mu T}\right) \text{ for strongly convex}$$

SGD matches these lower bounds (up to constants), making it optimal in this setting!

Implications:

You can't do better than $O(1/\sqrt{T})$ for convex functions with stochastic gradients
The $\sigma^2$ factor is fundamental—no algorithm eliminates gradient noise impact
Variance reduction methods achieve better rates by changing the problem (reducing effective $\sigma^2$)

Practical Takeaways from Lower Bounds

SGD is optimal for what it does: You can't improve on SGD's rate without using structure SGD ignores (variance reduction, second-order info).
Momentum actually helps: The gap between GD and optimal accelerated methods is real. Momentum-based optimizers achieve it.
Second-order methods can beat first-order bounds: Newton's method converges quadratically, ignoring the $\kappa$ dependence entirely. The catch: computing the Hessian is expensive.
Batch size affects achievable rates: Larger batches reduce $\sigma^2/B$, directly improving SGD convergence bounds.

From Theory to Practice

Convergence theory for convex functions, while elegant, doesn't directly apply to neural networks which are highly non-convex. Let's bridge the gap between theory and practice.

Why Theory Still Matters for Neural Networks

Local convexity: Near optima, loss functions are often approximately convex. Convergence theory applies locally.
Qualitative insights transfer: The importance of condition number, learning rate bounds, and variance effects carry over even when precise theorems don't.
Debugging guidance: When training fails, convex intuition often identifies the problem (LR too high, gradients too noisy, etc.).
Component analysis: Individual operations (linear layers, convolutions) are often convex in parameters—helps understand layer-wise behavior.

Theoretical Insights That Transfer

•LR must be bounded by curvature
•Condition number affects convergence speed
•Variance slows convergence
•Momentum accelerates ill-conditioned problems
•Schedules help find finer optima

What Theory Doesn't Capture

•Generalization (which minimum is reached)
•Implicit bias of optimization algorithms
•Effect of architecture on optimization
•When local optima are 'good enough'
•Interaction with normalization layers

Practical Guidelines Derived from Theory

Set learning rate based on stability, not convergence speed
- Theory says $\eta < 2/L$. In practice, start smaller ($\eta \approx 0.01/L$) and increase.
- Signs of $\eta$ near limit: loss oscillates slightly but trends down.
Expect slower convergence for larger models
- More parameters often means higher effective $\kappa$
- Need smaller LR or adaptive methods for stability
Use batch size to control noise
- Want exploration? Smaller batch = more noise
- Want precision? Larger batch = less noise
- Always apply linear scaling rule when changing batch size
Apply schedules for fine convergence
- Late training needs smaller LR to find precise minimum
- Matches theory: as we approach optimum, need smaller steps to reduce noise impact

The Practitioner's Balance

Use theory for intuition and debugging, not for direct parameter calculation. Neural networks violate theoretical assumptions, but the qualitative lessons remain valuable. When something goes wrong, check against theoretical predictions to identify likely causes.

Summary: Convergence Analysis

We've completed a rigorous tour of convergence analysis for gradient descent methods. Let's consolidate the key mathematical and practical insights:

Key Takeaways

•Smoothness bounds learning rate: η ≤ 1/L for stability. The Lipschitz constant L determines how aggressive we can be.
•Strong convexity enables linear convergence: Without it, we get O(1/T). With μ-strong convexity, we get O(ρᵀ) where ρ = 1 - μ/L.
•Condition number κ = L/μ governs difficulty: High κ means ill-conditioned problems with slow convergence along some directions.
•SGD trades accuracy for speed: Achieves O(1/√T) for convex, O(1/T) for strongly convex—slower rates but O(1) compute per iteration.
•Variance is the limiting factor for SGD: The σ² parameter appears in all SGD bounds. Reducing variance (larger batches) improves rates.
•Non-convex convergence goes to stationary points: We can only guarantee ∇J → 0, not that we find global optima.
•Lower bounds show optimality: SGD matches theoretical limits for stochastic optimization. Improvements require additional structure.
•Theory guides practice: While neural networks violate assumptions, qualitative insights about LR, noise, and conditioning remain valuable.

Module Complete

You've now completed the Gradient Descent Variants module. You understand:

Batch GD: exact gradients, O(N) per update, strongest convergence
SGD: single-example gradients, O(1) per update, noise as feature
Mini-batch: the practical sweet spot balancing variance and efficiency
Learning rate: the critical hyperparameter, tuned empirically with theoretical guidance
Convergence: the mathematical framework explaining when and why these methods work

This foundation prepares you for adaptive optimizers (Adam, RMSprop, etc.) covered in the next module, which build on these ideas to handle ill-conditioned problems more effectively.

Deep Understanding Achieved

You now have both the practical skills to tune gradient descent methods and the theoretical understanding to know WHY your choices work. This combination—practical effectiveness backed by principled understanding—is what separates excellent ML engineers from those who merely follow recipes.