Gradient Descent - Learning Module

Loading content...

0/278

Convergence Analysis

Understanding How Fast Gradient Descent Converges

When you launch a training run, a natural question arises: How long will this take? Will the model converge in 100 iterations or 100,000? Is the algorithm making adequate progress, or is it stuck? Answering these questions requires understanding convergence theory—the mathematical framework that characterizes when, whether, and how fast gradient descent reaches a solution.

Convergence analysis isn't just academic rigor. It provides practical guidance: which algorithms to choose for different problems, how to set stopping criteria, and when to expect diminishing returns from additional training. This page develops the theory you need to reason about optimization dynamics.

What You Will Learn

By the end of this page, you will understand convergence rates (sublinear, linear, superlinear), prove convergence for convex and strongly convex functions, appreciate the role of condition numbers, and connect theory to practical ML scenarios.

Key Concepts for Convergence Analysis

Before diving into convergence rates, let's formalize the key properties of objective functions that determine convergence behavior.

Smoothness (L-Lipschitz Gradient)

A function $J$ is L-smooth if its gradient is L-Lipschitz continuous: $$| abla J(\mathbf{x}) - abla J(\mathbf{y})| \leq L |\mathbf{x} - \mathbf{y}| \quad \forall \mathbf{x}, \mathbf{y}$$

Equivalently, for twice-differentiable functions: $\lambda_{\max}(\mathbf{H}) \leq L$ (the Hessian's largest eigenvalue is bounded).

Intuition: The gradient can't change too abruptly. The surface has bounded curvature. Smaller L means flatter surface, allowing larger steps.

Convexity

A function $J$ is convex if: $$J(\lambda \mathbf{x} + (1-\lambda) \mathbf{y}) \leq \lambda J(\mathbf{x}) + (1-\lambda) J(\mathbf{y}) \quad \forall \lambda \in [0,1]$$

Equivalently: any chord lies above the function. The Hessian is positive semi-definite: $\mathbf{H} \succeq 0$.

Intuition: Bowl-shaped. Any local minimum is the global minimum. No saddle points or local maxima (except at $\infty$).

Strong Convexity (μ-strongly convex)

A function $J$ is μ-strongly convex (μ > 0) if: $$J(\mathbf{y}) \geq J(\mathbf{x}) + abla J(\mathbf{x})^T (\mathbf{y} - \mathbf{x}) + \frac{\mu}{2} |\mathbf{y} - \mathbf{x}|^2$$

Equivalently: $\lambda_{\min}(\mathbf{H}) \geq \mu$ (the Hessian's smallest eigenvalue is bounded away from zero).

Intuition: The bowl has positive curvature in every direction. The function curves upward at least quadratically. Strong convexity guarantees a unique minimum and faster convergence.

The Condition Number

For L-smooth, μ-strongly convex functions, the condition number is: $$\kappa = \frac{L}{\mu}$$

This ratio determines convergence speed:

κ = 1: Perfectly conditioned (spherical bowl)
κ >> 1: Ill-conditioned (elongated ellipsoid)

Function Properties and Convergence
Function Class	Smoothness	Convexity	Convergence Rate
General smooth	L-smooth	None assumed	O(1/√t) to stationary point
Convex smooth	L-smooth	Convex	O(1/t) to global minimum
Strongly convex smooth	L-smooth	μ-strongly convex	O((1-μ/L)^t) linear rate
Non-smooth convex	Non-smooth	Convex	O(1/√t) with subgradients

Convergence Rate: Convex Functions

Let's prove the convergence rate for gradient descent on convex, L-smooth functions.

Theorem (Sublinear Convergence for Convex Functions)

Let $J$ be convex and L-smooth. Gradient descent with step size $\alpha = 1/L$ satisfies: $$J(\boldsymbol{\theta}^{(t)}) - J^* \leq \frac{L |\boldsymbol{\theta}^{(0)} - \boldsymbol{\theta}^*|^2}{2t}$$

where $J^* = J(\boldsymbol{\theta}^*)$ is the minimum value.

Interpretation: The error (suboptimality) decreases as $O(1/t)$. This is sublinear convergence—to halve the error, you roughly double the iterations. After 1000 iterations, error is about 1000× smaller than initially.

What O(1/t) Means Practically

To achieve ε-accuracy (J - J* ≤ ε), you need O(1/ε) iterations. For ε = 0.001, expect ~1000 iterations. This is relatively slow—each decimal of precision costs 10× more iterations.

Proof Sketch:

Step 1: Sufficient Decrease Lemma

For L-smooth functions with $\alpha = 1/L$: $$J(\boldsymbol{\theta}^{(t+1)}) \leq J(\boldsymbol{\theta}^{(t)}) - \frac{1}{2L} | abla J(\boldsymbol{\theta}^{(t)})|^2$$

Step 2: Convexity Bound

For convex functions: $$J(\boldsymbol{\theta}^{(t)}) - J^* \leq abla J(\boldsymbol{\theta}^{(t)})^T (\boldsymbol{\theta}^{(t)} - \boldsymbol{\theta}^*)$$

Step 3: Combine and Telescope

Using the update rule and summing over iterations: $$\sum_{k=0}^{t-1} (J(\boldsymbol{\theta}^{(k)}) - J^) \leq \frac{L}{2} |\boldsymbol{\theta}^{(0)} - \boldsymbol{\theta}^|^2$$

Since $J(\boldsymbol{\theta}^{(k)})$ is non-increasing, the left side is at least $t(J(\boldsymbol{\theta}^{(t)}) - J^)$, giving: $$J(\boldsymbol{\theta}^{(t)}) - J^ \leq \frac{L |\boldsymbol{\theta}^{(0)} - \boldsymbol{\theta}^*|^2}{2t}$$

QED. ∎

Convergence Rate: Strongly Convex Functions

Strong convexity dramatically accelerates convergence from sublinear to linear (also called exponential or geometric).

Theorem (Linear Convergence for Strongly Convex Functions)

Let $J$ be μ-strongly convex and L-smooth. Gradient descent with $\alpha = 1/L$ satisfies: $$|\boldsymbol{\theta}^{(t)} - \boldsymbol{\theta}^|^2 \leq \left(1 - \frac{\mu}{L}\right)^t |\boldsymbol{\theta}^{(0)} - \boldsymbol{\theta}^|^2$$

Alternatively, in terms of function value: $$J(\boldsymbol{\theta}^{(t)}) - J^* \leq \left(1 - \frac{\mu}{L}\right)^t (J(\boldsymbol{\theta}^{(0)}) - J^*)$$

Interpretation: The error decreases by a constant factor $(1 - 1/\kappa)$ each iteration. This is linear convergence. Now, to halve the error requires only $O(\kappa \log 2)$ iterations, independent of current accuracy.

Linear Convergence Rate Examples (κ = L/μ)
Condition κ	Contraction Factor	Iterations to halve error	Iterations for 10^-6 error
1	0	1	1
2	0.5	1	~20
10	0.9	~7	~140
100	0.99	~70	~1,400
1,000	0.999	~700	~14,000

Why strong convexity helps:

Strong convexity guarantees:

Unique minimum: No ambiguity about the target
Lower bound on curvature: The function curves upward everywhere, providing "grip" for descent
Gradient informative about distance: Near the optimum, small gradient → small distance

The condition number κ = L/μ determines the rate:

κ = 1: Perfect conditioning, one-step convergence
κ = 100: Each iteration reduces error by 1%
κ = 10,000: Each iteration reduces error by 0.01%

Proof Sketch:

The key insight is the Polyak-Łojasiewicz (PL) inequality for strongly convex functions: $$| abla J(\boldsymbol{\theta})|^2 \geq 2\mu (J(\boldsymbol{\theta}) - J^*)$$

Combined with the sufficient decrease lemma: $$J(\boldsymbol{\theta}^{(t+1)}) - J^* \leq (1 - \mu/L)(J(\boldsymbol{\theta}^{(t)}) - J^*)$$

Induction completes the proof.

Linear vs Sublinear: Massive Difference

For 10^-6 accuracy with initial error 1: Sublinear O(1/t) needs ~1,000,000 iterations. Linear rate with κ=100 needs ~1,400 iterations. That's 700× faster! This is why adding regularization (which induces strong convexity) can dramatically speed up training.

Convergence for Non-Convex Functions

Most neural network loss functions are non-convex. What can we say about convergence in this case?

For non-convex functions, we can't guarantee convergence to a global minimum—there may be many local minima and saddle points. Instead, we analyze convergence to a stationary point (where ∇J = 0).

Theorem (Convergence to Stationary Point)

Let $J$ be L-smooth (not necessarily convex). Gradient descent with $\alpha = 1/L$ satisfies: $$\min_{k=0,\ldots,t-1} | abla J(\boldsymbol{\theta}^{(k)})|^2 \leq \frac{2L(J(\boldsymbol{\theta}^{(0)}) - J_{\inf})}{t}$$

where $J_{\inf}$ is the infimum of $J$.

Interpretation: The minimum gradient norm across all iterations decreases as $O(1/t)$. After $t$ iterations, at least one iterate had gradient norm at most $O(1/\sqrt{t})$.

To find ε-stationary point (||∇J|| ≤ ε):

$$t = O\left(\frac{L(J(\boldsymbol{\theta}^{(0)}) - J_{\inf})}{\epsilon^2}\right)$$

This is $O(1/\epsilon^2)$ complexity—slower than convex optimization.

Types of stationary points:

Local minimum: Hessian is positive definite
Local maximum: Hessian is negative definite (unstable, gradient descent escapes)
Saddle point: Hessian has both positive and negative eigenvalues
Strict saddle: Saddle point with at least one negative eigenvalue

Modern insight on non-convex landscapes:

Local minima in neural networks are often nearly as good as global minimum
Saddle points are the real challenge in high dimensions
Strict saddle points are unstable under gradient descent (with noise or perturbation)
SGD noise helps escape saddle points

Non-Convex ≠ Hopeless

While non-convex optimization is theoretically NP-hard in general, neural network loss landscapes have special structure. Empirically, gradient descent with proper initialization finds good solutions. Theory is catching up to explain why, but practice often works better than worst-case theory suggests.

Convergence Rates: A Comprehensive Comparison

Let's synthesize the convergence rates across different scenarios:

Convergence Rate Summary
Setting	Rate Type	Error after t iters	Iters for ε-accuracy	Example
Convex + L-smooth	Sublinear O(1/t)	O(D²L/t)	O(D²L/ε)	Linear regression
μ-strongly convex + L-smooth	Linear O(ρ^t)	(1-μ/L)^t	O(κ·log(1/ε))	Ridge regression
Non-convex + L-smooth	Sublinear O(1/√t)	O(1/√t) gradient	O(1/ε²)	Neural networks
Strongly convex + Nesterov	Linear O(ρ^t)	(1-1/√κ)^t	O(√κ·log(1/ε))	Accelerated GD

Key insights:

Strong convexity gives logarithmic dependence on accuracy: O(log(1/ε)) vs O(1/ε). Huge practical difference.
Condition number κ controls everything: Factor of κ in strongly convex rate, √κ in accelerated methods.
Non-convex is fundamentally harder: We settle for stationary points, not global optima.
Acceleration (Nesterov momentum): Improves condition number dependence from O(κ) to O(√κ)—significant for ill-conditioned problems.

convergence_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
import numpy as np
import matplotlib.pyplot as plt
 
def plot_convergence_rates():
    """Visualize different convergence rate behaviors"""
    
    t = np.arange(1, 201)
    
    # Initial error
    E0 = 1.0
    
    # Sublinear: O(1/t)
    sublinear = E0 / t
    
    # Linear: O(ρ^t) with various condition numbers
    kappa_10 = E0 * (1 - 1/10)**t    # κ = 10
    kappa_100 = E0 * (1 - 1/100)**t   # κ = 100
    kappa_1000 = E0 * (1 - 1/1000)**t # κ = 1000
    
    # Accelerated linear: O((1-1/√κ)^t)
    accel_100 = E0 * (1 - 1/np.sqrt(100))**t  # Accelerated, κ = 100
    
    plt.figure(figsize=(12, 5))
    
    # Linear scale
    plt.subplot(1, 2, 1)
    plt.plot(t, sublinear, 'b-', linewidth=2, label='Sublinear O(1/t)')
    plt.plot(t, kappa_10, 'g-', linewidth=2, label='Linear κ=10')
    plt.plot(t, kappa_100, 'orange', linewidth=2, label='Linear κ=100')
    plt.plot(t, kappa_1000, 'r-', linewidth=2, label='Linear κ=1000')
    plt.plot(t, accel_100, 'purple', linewidth=2, linestyle='--', label='Accelerated κ=100')
    plt.xlabel('Iteration t')
    plt.ylabel('Error')
    plt.title('Convergence Rates (Linear Scale)')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.ylim(0, 1)
    
    # Log scale
    plt.subplot(1, 2, 2)
    plt.semilogy(t, sublinear, 'b-', linewidth=2, label='Sublinear O(1/t)')
    plt.semilogy(t, kappa_10, 'g-', linewidth=2, label='Linear κ=10')
    plt.semilogy(t, kappa_100, 'orange', linewidth=2, label='Linear κ=100')
    plt.semilogy(t, kappa_1000, 'r-', linewidth=2, label='Linear κ=1000')
    plt.semilogy(t, accel_100, 'purple', linewidth=2, linestyle='--', label='Accelerated κ=100')
    plt.xlabel('Iteration t')
    plt.ylabel('Error (log scale)')
    plt.title('Convergence Rates (Log Scale)')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.savefig('convergence_rates.png', dpi=150)
    plt.show()
 
    # Print iterations needed for ε = 1e-6
    print("Iterations for ε = 1e-6 accuracy:")
    print(f"  Sublinear O(1/t): ~{int(1/1e-6):,}")
    print(f"  Linear κ=10: ~{int(np.log(1e-6) / np.log(0.9)):,}")
    print(f"  Linear κ=100: ~{int(np.log(1e-6) / np.log(0.99)):,}")
    print(f"  Linear κ=1000: ~{int(np.log(1e-6) / np.log(0.999)):,}")
    print(f"  Accelerated κ=100: ~{int(np.log(1e-6) / np.log(0.9)):,}")
 
plot_convergence_rates()

Lower Bounds: Are These Rates Optimal?

A natural question: can we do better than these rates with a cleverer algorithm? Information-theoretic lower bounds tell us the fundamental limits.

Theorem (Lower Bounds for First-Order Methods)

For the class of first-order methods (using only function values and gradients):

Setting	Upper Bound (GD)	Lower Bound	Optimal Method
Convex, L-smooth	O(1/t)	Ω(1/t²)	Nesterov acceleration
μ-strongly convex	O((1-μ/L)^t)	Ω((1-√(μ/L))^t)	Nesterov acceleration

Key insight: Gradient descent is not optimal for convex optimization!

For convex functions: GD achieves O(1/t), but O(1/t²) is possible (Nesterov)
For strongly convex: GD contracts by (1-1/κ), but (1-1/√κ) is optimal (Nesterov)

Nesterov Acceleration

Nesterov's accelerated gradient method achieves optimal rates by using momentum—a "look-ahead" step that leverages past gradients. This reduces the iteration complexity from O(κ) to O(√κ) for strongly convex problems. We'll cover momentum in detail on Page 5.

Why does this matter practically?

For a problem with κ = 10,000:

GD needs O(10,000) iterations for high accuracy
Nesterov needs O(100) iterations

That's 100× fewer iterations! For large-scale problems, this translates to days vs weeks of training.

Caveat: Acceleration helps most for smooth, well-conditioned problems. For noisy gradients (SGD), non-convex landscapes (deep learning), the benefits are less clear-cut.

Practical Implications of Convergence Theory

Convergence theory isn't just for theoreticians. Here's how to apply these insights in practice:

1. Setting Stopping Criteria

Knowing convergence rates helps set realistic stopping criteria:

For κ = 100, expect ~100 iterations per order of magnitude of accuracy
Plot loss on log scale: linear descent indicates linear convergence
If progress stalls prematurely, check learning rate, data, or model

2. Regularization for Faster Convergence

Adding L2 regularization to the loss: $$J_{reg}(\boldsymbol{\theta}) = J(\boldsymbol{\theta}) + \frac{\lambda}{2} |\boldsymbol{\theta}|^2$$

makes the problem μ-strongly convex with μ ≥ λ. This:

Guarantees linear convergence
Improves conditioning (reduces κ)
Prevents divergence in ill-conditioned problems

3. Diagnosing Training Dynamics

Diagnosing Optimization Issues
Observation	Likely Cause	Solution
Loss decreases very slowly	Learning rate too small or κ very large	Increase LR, add regularization, use adaptive optimizer
Loss oscillates	Learning rate too large	Reduce LR
Loss plateaus early	Stuck at saddle point or local minimum	Add momentum, use SGD noise, try different init
Log-loss not linear	Non-convex effects or noise	Expected for neural networks
Validation loss diverges while train decreases	Overfitting	Add regularization, use early stopping

Actionable Guidelines

•Monitor gradient norm: Should decrease toward zero; if not, something is wrong
•Plot log-loss vs iteration: Linear descent = linear convergence; bowing upward = sublinear
•Add regularization: Even small λ dramatically improves convergence on ill-conditioned problems
•Use momentum: Free acceleration with minimal downsides
•Try adaptive methods (Adam): Automatically handle varying curvature across parameters
•Learning rate warmup: Helps in early chaotic phase of non-convex optimization

Convergence with Noisy Gradients

In practice, we often use stochastic gradient descent (SGD), where the gradient is estimated from a random mini-batch. This introduces noise. How does this affect convergence?

SGD Convergence (Strongly Convex, Noisy Gradients)

With learning rate $\alpha_t = \alpha_0 / t$ and bounded variance $\sigma^2$: $$\mathbb{E}[J(\boldsymbol{\theta}^{(t)})] - J^* = O\left(\frac{1}{t}\right)$$

Key difference from exact GD: Linear convergence becomes sublinear due to noise. The noise floor prevents arbitrarily precise convergence with fixed learning rate.

With decreasing learning rate: Convergence is guaranteed, but slower than exact GD.

Why SGD still works:

Noise can help escape saddle points
Implicit regularization effect
Better generalization (doesn't overfit to training gradients)

The Noise-Convergence Tradeoff

SGD trades exact gradient information for computational efficiency (processing one sample vs entire dataset). The noise is often beneficial for generalization but limits ultimate training accuracy. In practice, we often care more about test performance than training loss.

Summary: Convergence Theory Essentials

Key Takeaways

•Smoothness L and strong convexity μ determine convergence behavior
•Convex + smooth → O(1/t) sublinear convergence to global minimum
•Strongly convex + smooth → O((1-μ/L)^t) linear convergence
•Condition number κ = L/μ controls convergence speed
•Non-convex → O(1/√t) to stationary point (gradient norm)
•GD is not optimal — Nesterov acceleration achieves better rates
•SGD noise introduces sublinear behavior but aids generalization
•Regularization improves conditioning and guarantees faster convergence

Coming Up Next:

We've analyzed gradient descent on the full dataset. But modern ML operates on datasets with millions of samples—computing the exact gradient is prohibitively expensive. Page 4 explores gradient descent variants: batch, stochastic, and mini-batch approaches that make optimization tractable at scale.

Convergence Theory Mastered

You now understand the mathematical foundations of gradient descent convergence. You can analyze convergence rates, understand the role of condition numbers, and diagnose optimization issues. This theoretical grounding informs all practical optimization decisions in machine learning.

Convergence Analysis

Understanding How Fast Gradient Descent Converges

What You Will Learn

Key Concepts for Convergence Analysis

Before diving into convergence rates, let's formalize the key properties of objective functions that determine convergence behavior.

Smoothness (L-Lipschitz Gradient)

A function $J$ is L-smooth if its gradient is L-Lipschitz continuous: $$| abla J(\mathbf{x}) - abla J(\mathbf{y})| \leq L |\mathbf{x} - \mathbf{y}| \quad \forall \mathbf{x}, \mathbf{y}$$

Equivalently, for twice-differentiable functions: $\lambda_{\max}(\mathbf{H}) \leq L$ (the Hessian's largest eigenvalue is bounded).

Intuition: The gradient can't change too abruptly. The surface has bounded curvature. Smaller L means flatter surface, allowing larger steps.

Convexity

A function $J$ is convex if: $$J(\lambda \mathbf{x} + (1-\lambda) \mathbf{y}) \leq \lambda J(\mathbf{x}) + (1-\lambda) J(\mathbf{y}) \quad \forall \lambda \in [0,1]$$

Equivalently: any chord lies above the function. The Hessian is positive semi-definite: $\mathbf{H} \succeq 0$.

Intuition: Bowl-shaped. Any local minimum is the global minimum. No saddle points or local maxima (except at $\infty$).

Strong Convexity (μ-strongly convex)

A function $J$ is μ-strongly convex (μ > 0) if: $$J(\mathbf{y}) \geq J(\mathbf{x}) + abla J(\mathbf{x})^T (\mathbf{y} - \mathbf{x}) + \frac{\mu}{2} |\mathbf{y} - \mathbf{x}|^2$$

Equivalently: $\lambda_{\min}(\mathbf{H}) \geq \mu$ (the Hessian's smallest eigenvalue is bounded away from zero).

Intuition: The bowl has positive curvature in every direction. The function curves upward at least quadratically. Strong convexity guarantees a unique minimum and faster convergence.

The Condition Number

For L-smooth, μ-strongly convex functions, the condition number is: $$\kappa = \frac{L}{\mu}$$

This ratio determines convergence speed:

κ = 1: Perfectly conditioned (spherical bowl)
κ >> 1: Ill-conditioned (elongated ellipsoid)

Function Properties and Convergence
Function Class	Smoothness	Convexity	Convergence Rate
General smooth	L-smooth	None assumed	O(1/√t) to stationary point
Convex smooth	L-smooth	Convex	O(1/t) to global minimum
Strongly convex smooth	L-smooth	μ-strongly convex	O((1-μ/L)^t) linear rate
Non-smooth convex	Non-smooth	Convex	O(1/√t) with subgradients

Convergence Rate: Convex Functions

Let's prove the convergence rate for gradient descent on convex, L-smooth functions.

Theorem (Sublinear Convergence for Convex Functions)

where $J^* = J(\boldsymbol{\theta}^*)$ is the minimum value.

What O(1/t) Means Practically

To achieve ε-accuracy (J - J* ≤ ε), you need O(1/ε) iterations. For ε = 0.001, expect ~1000 iterations. This is relatively slow—each decimal of precision costs 10× more iterations.

Proof Sketch:

Step 1: Sufficient Decrease Lemma

For L-smooth functions with $\alpha = 1/L$: $$J(\boldsymbol{\theta}^{(t+1)}) \leq J(\boldsymbol{\theta}^{(t)}) - \frac{1}{2L} | abla J(\boldsymbol{\theta}^{(t)})|^2$$

Step 2: Convexity Bound

For convex functions: $$J(\boldsymbol{\theta}^{(t)}) - J^* \leq abla J(\boldsymbol{\theta}^{(t)})^T (\boldsymbol{\theta}^{(t)} - \boldsymbol{\theta}^*)$$

Step 3: Combine and Telescope

Using the update rule and summing over iterations: $$\sum_{k=0}^{t-1} (J(\boldsymbol{\theta}^{(k)}) - J^) \leq \frac{L}{2} |\boldsymbol{\theta}^{(0)} - \boldsymbol{\theta}^|^2$$

QED. ∎

Convergence Rate: Strongly Convex Functions

Strong convexity dramatically accelerates convergence from sublinear to linear (also called exponential or geometric).

Theorem (Linear Convergence for Strongly Convex Functions)

Alternatively, in terms of function value: $$J(\boldsymbol{\theta}^{(t)}) - J^* \leq \left(1 - \frac{\mu}{L}\right)^t (J(\boldsymbol{\theta}^{(0)}) - J^*)$$

Linear Convergence Rate Examples (κ = L/μ)
Condition κ	Contraction Factor	Iterations to halve error	Iterations for 10^-6 error
1	0	1	1
2	0.5	1	~20
10	0.9	~7	~140
100	0.99	~70	~1,400
1,000	0.999	~700	~14,000

Why strong convexity helps:

Strong convexity guarantees:

Unique minimum: No ambiguity about the target
Lower bound on curvature: The function curves upward everywhere, providing "grip" for descent
Gradient informative about distance: Near the optimum, small gradient → small distance

The condition number κ = L/μ determines the rate:

κ = 1: Perfect conditioning, one-step convergence
κ = 100: Each iteration reduces error by 1%
κ = 10,000: Each iteration reduces error by 0.01%

Proof Sketch:

The key insight is the Polyak-Łojasiewicz (PL) inequality for strongly convex functions: $$| abla J(\boldsymbol{\theta})|^2 \geq 2\mu (J(\boldsymbol{\theta}) - J^*)$$

Combined with the sufficient decrease lemma: $$J(\boldsymbol{\theta}^{(t+1)}) - J^* \leq (1 - \mu/L)(J(\boldsymbol{\theta}^{(t)}) - J^*)$$

Induction completes the proof.

Linear vs Sublinear: Massive Difference

Convergence for Non-Convex Functions

Most neural network loss functions are non-convex. What can we say about convergence in this case?

Theorem (Convergence to Stationary Point)

where $J_{\inf}$ is the infimum of $J$.

Interpretation: The minimum gradient norm across all iterations decreases as $O(1/t)$. After $t$ iterations, at least one iterate had gradient norm at most $O(1/\sqrt{t})$.

To find ε-stationary point (||∇J|| ≤ ε):

$$t = O\left(\frac{L(J(\boldsymbol{\theta}^{(0)}) - J_{\inf})}{\epsilon^2}\right)$$

This is $O(1/\epsilon^2)$ complexity—slower than convex optimization.

Types of stationary points:

Local minimum: Hessian is positive definite
Local maximum: Hessian is negative definite (unstable, gradient descent escapes)
Saddle point: Hessian has both positive and negative eigenvalues
Strict saddle: Saddle point with at least one negative eigenvalue

Modern insight on non-convex landscapes:

Local minima in neural networks are often nearly as good as global minimum
Saddle points are the real challenge in high dimensions
Strict saddle points are unstable under gradient descent (with noise or perturbation)
SGD noise helps escape saddle points

Non-Convex ≠ Hopeless

Convergence Rates: A Comprehensive Comparison

Let's synthesize the convergence rates across different scenarios:

Convergence Rate Summary
Setting	Rate Type	Error after t iters	Iters for ε-accuracy	Example
Convex + L-smooth	Sublinear O(1/t)	O(D²L/t)	O(D²L/ε)	Linear regression
μ-strongly convex + L-smooth	Linear O(ρ^t)	(1-μ/L)^t	O(κ·log(1/ε))	Ridge regression
Non-convex + L-smooth	Sublinear O(1/√t)	O(1/√t) gradient	O(1/ε²)	Neural networks
Strongly convex + Nesterov	Linear O(ρ^t)	(1-1/√κ)^t	O(√κ·log(1/ε))	Accelerated GD

Key insights:

Strong convexity gives logarithmic dependence on accuracy: O(log(1/ε)) vs O(1/ε). Huge practical difference.
Condition number κ controls everything: Factor of κ in strongly convex rate, √κ in accelerated methods.
Non-convex is fundamentally harder: We settle for stationary points, not global optima.
Acceleration (Nesterov momentum): Improves condition number dependence from O(κ) to O(√κ)—significant for ill-conditioned problems.

convergence_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
import numpy as np
import matplotlib.pyplot as plt
 
def plot_convergence_rates():
    """Visualize different convergence rate behaviors"""
    
    t = np.arange(1, 201)
    
    # Initial error
    E0 = 1.0
    
    # Sublinear: O(1/t)
    sublinear = E0 / t
    
    # Linear: O(ρ^t) with various condition numbers
    kappa_10 = E0 * (1 - 1/10)**t    # κ = 10
    kappa_100 = E0 * (1 - 1/100)**t   # κ = 100
    kappa_1000 = E0 * (1 - 1/1000)**t # κ = 1000
    
    # Accelerated linear: O((1-1/√κ)^t)
    accel_100 = E0 * (1 - 1/np.sqrt(100))**t  # Accelerated, κ = 100
    
    plt.figure(figsize=(12, 5))
    
    # Linear scale
    plt.subplot(1, 2, 1)
    plt.plot(t, sublinear, 'b-', linewidth=2, label='Sublinear O(1/t)')
    plt.plot(t, kappa_10, 'g-', linewidth=2, label='Linear κ=10')
    plt.plot(t, kappa_100, 'orange', linewidth=2, label='Linear κ=100')
    plt.plot(t, kappa_1000, 'r-', linewidth=2, label='Linear κ=1000')
    plt.plot(t, accel_100, 'purple', linewidth=2, linestyle='--', label='Accelerated κ=100')
    plt.xlabel('Iteration t')
    plt.ylabel('Error')
    plt.title('Convergence Rates (Linear Scale)')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.ylim(0, 1)
    
    # Log scale
    plt.subplot(1, 2, 2)
    plt.semilogy(t, sublinear, 'b-', linewidth=2, label='Sublinear O(1/t)')
    plt.semilogy(t, kappa_10, 'g-', linewidth=2, label='Linear κ=10')
    plt.semilogy(t, kappa_100, 'orange', linewidth=2, label='Linear κ=100')
    plt.semilogy(t, kappa_1000, 'r-', linewidth=2, label='Linear κ=1000')
    plt.semilogy(t, accel_100, 'purple', linewidth=2, linestyle='--', label='Accelerated κ=100')
    plt.xlabel('Iteration t')
    plt.ylabel('Error (log scale)')
    plt.title('Convergence Rates (Log Scale)')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.savefig('convergence_rates.png', dpi=150)
    plt.show()
 
    # Print iterations needed for ε = 1e-6
    print("Iterations for ε = 1e-6 accuracy:")
    print(f"  Sublinear O(1/t): ~{int(1/1e-6):,}")
    print(f"  Linear κ=10: ~{int(np.log(1e-6) / np.log(0.9)):,}")
    print(f"  Linear κ=100: ~{int(np.log(1e-6) / np.log(0.99)):,}")
    print(f"  Linear κ=1000: ~{int(np.log(1e-6) / np.log(0.999)):,}")
    print(f"  Accelerated κ=100: ~{int(np.log(1e-6) / np.log(0.9)):,}")
 
plot_convergence_rates()

Lower Bounds: Are These Rates Optimal?

A natural question: can we do better than these rates with a cleverer algorithm? Information-theoretic lower bounds tell us the fundamental limits.

Theorem (Lower Bounds for First-Order Methods)

For the class of first-order methods (using only function values and gradients):

Setting	Upper Bound (GD)	Lower Bound	Optimal Method
Convex, L-smooth	O(1/t)	Ω(1/t²)	Nesterov acceleration
μ-strongly convex	O((1-μ/L)^t)	Ω((1-√(μ/L))^t)	Nesterov acceleration

Key insight: Gradient descent is not optimal for convex optimization!

For convex functions: GD achieves O(1/t), but O(1/t²) is possible (Nesterov)
For strongly convex: GD contracts by (1-1/κ), but (1-1/√κ) is optimal (Nesterov)

Nesterov Acceleration

Why does this matter practically?

For a problem with κ = 10,000:

GD needs O(10,000) iterations for high accuracy
Nesterov needs O(100) iterations

That's 100× fewer iterations! For large-scale problems, this translates to days vs weeks of training.

Caveat: Acceleration helps most for smooth, well-conditioned problems. For noisy gradients (SGD), non-convex landscapes (deep learning), the benefits are less clear-cut.

Practical Implications of Convergence Theory

Convergence theory isn't just for theoreticians. Here's how to apply these insights in practice:

1. Setting Stopping Criteria

Knowing convergence rates helps set realistic stopping criteria:

For κ = 100, expect ~100 iterations per order of magnitude of accuracy
Plot loss on log scale: linear descent indicates linear convergence
If progress stalls prematurely, check learning rate, data, or model

2. Regularization for Faster Convergence

Adding L2 regularization to the loss: $$J_{reg}(\boldsymbol{\theta}) = J(\boldsymbol{\theta}) + \frac{\lambda}{2} |\boldsymbol{\theta}|^2$$

makes the problem μ-strongly convex with μ ≥ λ. This:

Guarantees linear convergence
Improves conditioning (reduces κ)
Prevents divergence in ill-conditioned problems

3. Diagnosing Training Dynamics

Diagnosing Optimization Issues
Observation	Likely Cause	Solution
Loss decreases very slowly	Learning rate too small or κ very large	Increase LR, add regularization, use adaptive optimizer
Loss oscillates	Learning rate too large	Reduce LR
Loss plateaus early	Stuck at saddle point or local minimum	Add momentum, use SGD noise, try different init
Log-loss not linear	Non-convex effects or noise	Expected for neural networks
Validation loss diverges while train decreases	Overfitting	Add regularization, use early stopping

Actionable Guidelines

•Monitor gradient norm: Should decrease toward zero; if not, something is wrong
•Plot log-loss vs iteration: Linear descent = linear convergence; bowing upward = sublinear
•Add regularization: Even small λ dramatically improves convergence on ill-conditioned problems
•Use momentum: Free acceleration with minimal downsides
•Try adaptive methods (Adam): Automatically handle varying curvature across parameters
•Learning rate warmup: Helps in early chaotic phase of non-convex optimization

Convergence with Noisy Gradients

In practice, we often use stochastic gradient descent (SGD), where the gradient is estimated from a random mini-batch. This introduces noise. How does this affect convergence?

SGD Convergence (Strongly Convex, Noisy Gradients)

With learning rate $\alpha_t = \alpha_0 / t$ and bounded variance $\sigma^2$: $$\mathbb{E}[J(\boldsymbol{\theta}^{(t)})] - J^* = O\left(\frac{1}{t}\right)$$

Key difference from exact GD: Linear convergence becomes sublinear due to noise. The noise floor prevents arbitrarily precise convergence with fixed learning rate.

With decreasing learning rate: Convergence is guaranteed, but slower than exact GD.

Why SGD still works:

Noise can help escape saddle points
Implicit regularization effect
Better generalization (doesn't overfit to training gradients)

The Noise-Convergence Tradeoff

Summary: Convergence Theory Essentials

Key Takeaways

•Smoothness L and strong convexity μ determine convergence behavior
•Convex + smooth → O(1/t) sublinear convergence to global minimum
•Strongly convex + smooth → O((1-μ/L)^t) linear convergence
•Condition number κ = L/μ controls convergence speed
•Non-convex → O(1/√t) to stationary point (gradient norm)
•GD is not optimal — Nesterov acceleration achieves better rates
•SGD noise introduces sublinear behavior but aids generalization
•Regularization improves conditioning and guarantees faster convergence

Coming Up Next:

Convergence Theory Mastered