Gradient Descent - Learning Module

Loading content...

0/245

Learning Rate Selection

The Most Important Hyperparameter

If there's one hyperparameter that can make or break your machine learning model, it's the learning rate (α). Set it too high, and your model explodes—losses oscillate wildly or diverge to infinity. Set it too low, and training crawls at a glacial pace, potentially getting stuck in poor local minima. Get it just right, and your model converges smoothly to an excellent solution.

Yet finding this "just right" value is notoriously challenging. There's no universal formula. The optimal learning rate depends on your model architecture, loss function, data distribution, batch size, and even where you are in training. This page equips you with the theory and practical techniques to navigate this critical decision.

What You Will Learn

By the end of this page, you will understand why the learning rate is so critical, derive theoretical bounds, visualize the effects of different learning rates, implement learning rate schedules, and master practical tuning strategies used by ML practitioners worldwide.

What the Learning Rate Actually Controls

The learning rate α appears in the gradient descent update rule:

$$\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \alpha \nabla J(\boldsymbol{\theta}^{(t)})$$

It's a multiplicative factor that scales the gradient. Geometrically, the gradient tells us which direction to move, while α tells us how far to move in that direction.

Think of it like this:

Imagine you're blindfolded on a hilly landscape, trying to reach the lowest valley. You can feel the slope beneath your feet (the gradient). The learning rate determines whether you:

Take tiny, cautious steps (small α): Safe, but you might take hours to reach the valley
Take normal strides (moderate α): Efficient progress down the slope
Take giant leaps (large α): You might overshoot the valley, end up on the opposite hill, and bounce back and forth forever
Sprint recklessly (very large α): You leap over valleys entirely, ending up higher than where you started, eventually running off a cliff (divergence)

Converting Mermaid diagram...

Mathematical perspective:

The gradient $\nabla J$ has units of "loss per unit parameter change." Multiplying by α (with units of "parameter change") gives a change in parameters that reduces loss. The magnitude $|\alpha \nabla J|$ is the actual step size in parameter space.

For a gradient with magnitude 10:

α = 0.001 → step size = 0.01
α = 0.01 → step size = 0.1
α = 0.1 → step size = 1.0
α = 1.0 → step size = 10.0

The difference between these can mean convergence in 100 iterations versus 100,000 iterations—or divergence.

Theoretical Bounds on Learning Rate

Theory provides guidance on learning rate selection. Recall from the previous page that for convergence, we need:

$$\alpha \leq \frac{1}{L}$$

where $L$ is the Lipschitz constant of the gradient (measuring how fast the gradient changes). But what does this mean practically?

For Quadratic Functions:

Consider $J(\theta) = \frac{1}{2} L \theta^2$ (a perfect parabola). The gradient is $\nabla J = L\theta$, and the Lipschitz constant is $L$ itself.

The gradient descent update becomes: $$\theta^{(t+1)} = \theta^{(t)} - \alpha L \theta^{(t)} = (1 - \alpha L) \theta^{(t)}$$

For convergence, we need $|1 - \alpha L| < 1$, which gives: $$0 < \alpha < \frac{2}{L}$$

If $\alpha > \frac{2}{L}$, the factor $(1 - \alpha L) < -1$, and $|\theta|$ grows each iteration—divergence!

Learning Rate Effects on Quadratic (L=2)
Learning Rate α	Factor (1-αL)	Behavior	Convergence Speed
0.001	0.998	Converges very slowly	~2000 iters to halve
0.01	0.98	Converges slowly	~70 iters to halve
0.1	0.8	Converges well	~3 iters to halve
0.5	0.0	Converges in 1 step!	Optimal for quadratic
0.9	-0.8	Oscillates but converges	Zig-zag descent
1.0	-1.0	Oscillates forever	No convergence
1.5	-2.0	Diverges exponentially	Loss explodes

Optimal Learning Rate for Quadratics

For a pure quadratic with curvature L, the optimal learning rate is α = 1/L, which gives convergence in a single step! In practice, loss functions aren't perfect quadratics, but this guides intuition: adapt α to local curvature.

For General Smooth Functions:

The Lipschitz constant $L$ bounds the second derivative (Hessian). Intuitively:

High curvature (large L) → gradient changes fast → need small steps → small α
Low curvature (small L) → gradient changes slowly → can take larger steps → larger α

Connection to the Hessian:

For twice-differentiable functions, $L \geq \lambda_{\max}(\mathbf{H})$, where $\lambda_{\max}$ is the largest eigenvalue of the Hessian. This means:

$$\alpha \leq \frac{1}{\lambda_{\max}(\mathbf{H})}$$

The more curved the loss surface, the smaller the safe learning rate.

Visualizing Learning Rate Effects

Let's make these concepts concrete with visualizations. We'll optimize a 2D quadratic and observe how different learning rates affect the trajectory.

learning_rate_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
import numpy as np
import matplotlib.pyplot as plt
 
def quadratic_loss(theta, A=None):
    """
    Quadratic loss: J(θ) = 0.5 * θ^T A θ
    A determines the shape (curvature in each direction)
    """
    if A is None:
        A = np.array([[4, 0], [0, 1]])  # Elongated bowl
    return 0.5 * theta @ A @ theta
 
def quadratic_grad(theta, A=None):
    """Gradient of quadratic: ∇J = Aθ"""
    if A is None:
        A = np.array([[4, 0], [0, 1]])
    return A @ theta
 
def gradient_descent_trajectory(theta_init, lr, n_iters, A=None):
    """Run GD and record trajectory"""
    trajectory = [theta_init.copy()]
    losses = [quadratic_loss(theta_init, A)]
    theta = theta_init.copy()
    
    for _ in range(n_iters):
        grad = quadratic_grad(theta, A)
        theta = theta - lr * grad
        trajectory.append(theta.copy())
        losses.append(quadratic_loss(theta, A))
    
    return np.array(trajectory), np.array(losses)
 
# Test different learning rates
theta_init = np.array([3.0, 3.0])
A = np.array([[4, 0], [0, 1]])  # Lipschitz constant L = 4
 
learning_rates = [0.05, 0.2, 0.45, 0.55]
labels = ['α=0.05 (too small)', 'α=0.2 (good)', 'α=0.45 (near optimal)', 'α=0.55 (too large)']
colors = ['blue', 'green', 'orange', 'red']
 
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
# Left plot: Trajectories in 2D
ax1 = axes[0]
 
# Plot loss contours
x = np.linspace(-4, 4, 100)
y = np.linspace(-4, 4, 100)
X, Y = np.meshgrid(x, y)
Z = 0.5 * (4 * X**2 + Y**2)  # J = 0.5 * (4x² + y²)
ax1.contour(X, Y, Z, levels=20, cmap='gray', alpha=0.5)
 
for lr, label, color in zip(learning_rates, labels, colors):
    traj, losses = gradient_descent_trajectory(theta_init, lr, 50, A)
    ax1.plot(traj[:, 0], traj[:, 1], 'o-', color=color, markersize=3, 
             linewidth=1.5, label=label, alpha=0.8)
 
ax1.scatter([0], [0], color='black', s=100, marker='*', zorder=10, label='Minimum')
ax1.scatter([theta_init[0]], [theta_init[1]], color='black', s=60, marker='s', zorder=10)
ax1.set_xlabel('θ₁')
ax1.set_ylabel('θ₂')
ax1.set_title('Gradient Descent Trajectories')
ax1.legend(loc='upper right', fontsize=8)
ax1.set_xlim(-4, 4)
ax1.set_ylim(-4, 4)
 
# Right plot: Loss curves
ax2 = axes[1]
for lr, label, color in zip(learning_rates, labels, colors):
    _, losses = gradient_descent_trajectory(theta_init, lr, 50, A)
    ax2.semilogy(losses, color=color, linewidth=2, label=label)
 
ax2.set_xlabel('Iteration')
ax2.set_ylabel('Loss (log scale)')
ax2.set_title('Loss Convergence (Log Scale)')
ax2.legend(fontsize=8)
ax2.grid(True, alpha=0.3)
 
plt.tight_layout()
plt.savefig('learning_rate_effects.png', dpi=150)
plt.show()
 
# Demonstrate divergence
print("\n=== Divergence Example (α too large) ===")
theta_init_div = np.array([1.0, 1.0])
traj_div, losses_div = gradient_descent_trajectory(theta_init_div, 0.6, 10, A)
print(f"Initial loss: {losses_div[0]:.4f}")
print(f"After 5 iters: {losses_div[5]:.4f}")
print(f"After 10 iters: {losses_div[10]:.4f}")
print("Loss is increasing → DIVERGENCE!")

What the visualization reveals:

α = 0.05 (blue): The trajectory inches toward the minimum but makes painfully slow progress. After 50 iterations, it's still far from optimal.
α = 0.2 (green): Healthy progress. The trajectory descends efficiently, though slightly zig-zagging due to the elongated loss surface.
α = 0.45 (orange): Near-optimal. Very fast convergence, though you can see slight oscillations.
α = 0.55 (red): Too large! The trajectory oscillates wildly and may diverge. Each step overshoots the minimum.

The elongated bowl effect:

Notice that our loss surface is shaped like an elongated bowl (A = diag(4, 1)), with more curvature in the θ₁ direction. This means:

θ₁ has Lipschitz constant 4 → needs α ≤ 0.5
θ₂ has Lipschitz constant 1 → could use α up to 2.0

With a single learning rate, we're limited by the most curved direction. This is why ill-conditioned problems (very elongated loss surfaces) are difficult to optimize.

The Condition Number Problem

The condition number κ of a matrix measures how elongated the loss surface is:

$$\kappa = \frac{\lambda_{\max}}{\lambda_{\min}}$$

where λ_max and λ_min are the largest and smallest eigenvalues of the Hessian.

Why it matters:

With learning rate $\alpha = 1/\lambda_{\max}$ (the largest safe step), progress in the flattest direction (λ_min) is: $$\Delta \theta_{\text{flat}} \propto \frac{\lambda_{\min}}{\lambda_{\max}} = \frac{1}{\kappa}$$
Convergence rate depends on condition number: roughly $O(\kappa)$ iterations needed

Example:

κ = 1 (perfectly round bowl): Converges quickly in all directions
κ = 10: Slow progress in flat direction; 10× more iterations
κ = 1000: Extremely slow; common in neural networks without normalization

Condition Number and Convergence
Condition κ	Shape	Iterations to Converge	Learning Rate Challenge
1	Perfect sphere	~10	Any α in range works well
10	Slightly elongated	~100	Moderate tuning needed
100	Very elongated	~1,000	Careful tuning required
1,000	Extremely elongated	~10,000	May need adaptive methods
1,000,000	Pathological	~1,000,000	Standard GD nearly useless

Ill-Conditioning is Common

Real neural networks often have condition numbers in the thousands or millions before applying techniques like batch normalization, weight initialization, and adaptive optimizers. This is why vanilla gradient descent with a fixed learning rate struggles on deep networks.

Solutions to ill-conditioning:

Feature normalization: Standardizing inputs improves conditioning
Batch normalization: Reduces internal covariate shift, improves conditioning
Adaptive learning rates: Per-parameter learning rates (Adam, RMSprop)
Second-order methods: Newton's method uses Hessian to adapt to curvature
Momentum: Helps traverse flat directions faster (covered in Page 5)

Learning Rate Schedules

A fixed learning rate is often suboptimal. Early in training, we're far from the optimum—large steps are safe and efficient. As we approach the minimum, we need smaller steps for precision. Learning rate schedules adapt α over time.

Common Schedules:

Learning Rate Schedule Types
Schedule	Formula	Use Case	Advantage
Step Decay	α(t) = α₀ × γ^⌊t/s⌋	When plateaus are predictable	Simple, interpretable
Exponential Decay	α(t) = α₀ × e^(-kt)	Smooth decrease	Continuous, no sudden drops
1/t Decay	α(t) = α₀ / (1 + kt)	Theoretical guarantees	Provable convergence
Cosine Annealing	α(t) = α_min + ½(α₀-α_min)(1+cos(πt/T))	Cyclical training	Smooth with warm restarts
Warmup + Decay	Linear increase, then decay	Large models, transformers	Stabilizes early training

learning_rate_schedules.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
import numpy as np
import matplotlib.pyplot as plt
 
def step_decay(epoch, initial_lr=0.1, drop_rate=0.5, epochs_drop=10):
    """Step decay: drop by factor every epochs_drop epochs"""
    return initial_lr * (drop_rate ** (epoch // epochs_drop))
 
def exponential_decay(epoch, initial_lr=0.1, decay_rate=0.05):
    """Exponential decay: α(t) = α₀ × e^(-kt)"""
    return initial_lr * np.exp(-decay_rate * epoch)
 
def inverse_decay(epoch, initial_lr=0.1, decay_rate=0.01):
    """Inverse time decay: α(t) = α₀ / (1 + kt)"""
    return initial_lr / (1 + decay_rate * epoch)
 
def cosine_annealing(epoch, initial_lr=0.1, min_lr=0.001, total_epochs=100):
    """Cosine annealing: smooth oscillation between max and min"""
    return min_lr + 0.5 * (initial_lr - min_lr) * (1 + np.cos(np.pi * epoch / total_epochs))
 
def warmup_decay(epoch, initial_lr=0.1, warmup_epochs=10, total_epochs=100):
    """Linear warmup followed by cosine decay"""
    if epoch < warmup_epochs:
        return initial_lr * epoch / warmup_epochs
    else:
        progress = (epoch - warmup_epochs) / (total_epochs - warmup_epochs)
        return initial_lr * 0.5 * (1 + np.cos(np.pi * progress))
 
# Visualize all schedules
epochs = np.arange(100)
schedules = {
    'Step Decay': [step_decay(e) for e in epochs],
    'Exponential Decay': [exponential_decay(e) for e in epochs],
    'Inverse Decay': [inverse_decay(e) for e in epochs],
    'Cosine Annealing': [cosine_annealing(e) for e in epochs],
    'Warmup + Decay': [warmup_decay(e) for e in epochs],
}
 
plt.figure(figsize=(12, 5))
 
plt.subplot(1, 2, 1)
for name, lrs in schedules.items():
    plt.plot(epochs, lrs, linewidth=2, label=name)
plt.xlabel('Epoch')
plt.ylabel('Learning Rate')
plt.title('Learning Rate Schedules (Linear Scale)')
plt.legend()
plt.grid(True, alpha=0.3)
 
plt.subplot(1, 2, 2)
for name, lrs in schedules.items():
    plt.semilogy(epochs, lrs, linewidth=2, label=name)
plt.xlabel('Epoch')
plt.ylabel('Learning Rate (Log Scale)')
plt.title('Learning Rate Schedules (Log Scale)')
plt.legend()
plt.grid(True, alpha=0.3)
 
plt.tight_layout()
plt.savefig('learning_rate_schedules.png', dpi=150)
plt.show()

Warmup is Crucial for Large Models

Transformer models (BERT, GPT, etc.) universally use warmup. Early in training, with random weights, gradients can be unstable. Starting with a tiny learning rate and gradually increasing prevents catastrophic updates. After warmup, decay to refine the solution.

Practical Learning Rate Tuning Strategies

In practice, you can't compute the Lipschitz constant. Here are battle-tested strategies for finding good learning rates:

1. The "Baby Steps" Approach

Start with a very small learning rate (e.g., 1e-5). If training is stable, multiply by 3. Repeat until:

Loss starts oscillating → too high, back off
Training is too slow → continue increasing

2. Grid Search / Random Search

Search over logarithmically-spaced values: [1e-5, 3e-5, 1e-4, 3e-4, 1e-3, 3e-3, 1e-2]

3. Learning Rate Range Test (Leslie Smith)

The most efficient modern approach:

lr_finder.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
import numpy as np
import matplotlib.pyplot as plt
 
def lr_range_test(model, train_loader, optimizer_class, 
                   min_lr=1e-7, max_lr=10, num_steps=100):
    """
    Learning Rate Range Test (LR Finder)
    
    Exponentially increase learning rate while recording loss.
    Plot loss vs lr to find the optimal learning rate region.
    """
    # Exponentially spaced learning rates
    lr_mult = (max_lr / min_lr) ** (1 / num_steps)
    
    lrs = []
    losses = []
    lr = min_lr
    
    # Save initial model state
    initial_state = {k: v.clone() for k, v in model.state_dict().items()}
    
    model.train()
    smooth_loss = 0
    
    for step, (X, y) in enumerate(train_loader):
        if step >= num_steps:
            break
        
        # Set learning rate
        for param_group in optimizer.param_groups:
            param_group['lr'] = lr
        
        # Forward pass
        optimizer.zero_grad()
        loss = criterion(model(X), y)
        
        # Record smoothed loss
        if step == 0:
            smooth_loss = loss.item()
        else:
            smooth_loss = 0.98 * smooth_loss + 0.02 * loss.item()
        
        lrs.append(lr)
        losses.append(smooth_loss)
        
        # Stop if loss explodes
        if step > 0 and smooth_loss > 4 * min(losses):
            break
        
        # Backward pass
        loss.backward()
        optimizer.step()
        
        # Increase learning rate
        lr *= lr_mult
    
    # Restore model to initial state
    model.load_state_dict(initial_state)
    
    return lrs, losses
 
def suggest_lr(lrs, losses):
    """
    Suggest learning rate from range test results.
    Find the point of steepest descent (maximum negative slope).
    """
    min_idx = np.argmin(losses)
    # Go back from minimum to find where loss was decreasing fastest
    slopes = np.diff(losses) / np.diff(np.log10(lrs))
    steepest_idx = np.argmin(slopes[:min_idx])
    return lrs[steepest_idx]
 
# Visualization
def plot_lr_finder(lrs, losses, suggested_lr=None):
    plt.figure(figsize=(10, 4))
    plt.semilogx(lrs, losses, linewidth=2)
    plt.xlabel('Learning Rate (log scale)')
    plt.ylabel('Loss')
    plt.title('Learning Rate Finder')
    
    if suggested_lr:
        plt.axvline(x=suggested_lr, color='r', linestyle='--', 
                    label=f'Suggested LR: {suggested_lr:.2e}')
        plt.legend()
    
    plt.grid(True, alpha=0.3)
    plt.savefig('lr_finder.png', dpi=150)
    plt.show()
 
# Usage example (pseudo-code):
# lrs, losses = lr_range_test(model, train_loader, torch.optim.SGD)
# suggested = suggest_lr(lrs, losses)
# plot_lr_finder(lrs, losses, suggested)

How to interpret the LR Finder plot:

Loss starts flat (lr too small to make progress)
Loss decreases steeply (the "sweet spot")
Loss reaches minimum
Loss increases or explodes (lr too high)

Rule of thumb: Choose a learning rate about 10× smaller than where the loss is minimum, or at the steepest point of descent.

Learning Rate Selection Heuristics

•Linear/Logistic regression: α ∈ [0.01, 0.1] with normalized features
•Small neural networks: α ∈ [0.001, 0.01] for SGD, [0.0001, 0.001] for Adam
•Deep CNNs: α ∈ [0.01, 0.1] for SGD with momentum
•Transformers: α ∈ [1e-4, 3e-4] for Adam with warmup
•Fine-tuning pretrained models: 10-100× smaller than training from scratch

Preview: Adaptive Learning Rate Methods

The challenge of selecting a single learning rate that works for all parameters has motivated adaptive learning rate methods. These algorithms automatically adjust the learning rate for each parameter based on its gradient history.

Key Methods (covered in depth later):

Method	Key Idea	When to Use
AdaGrad	Divide by accumulated squared gradients	Sparse data (NLP, recommendations)
RMSprop	Exponential moving average of squared gradients	Non-stationary objectives, RNNs
Adam	Combines momentum with RMSprop	Default choice for most problems
AdamW	Adam with decoupled weight decay	Transformers, large models

The fundamental insight: Parameters that receive large gradients consistently should have smaller learning rates (they're changing enough). Parameters with small gradients should have larger learning rates (they need more signal).

Adam: The Practical Default

If you're unsure what optimizer to use, start with Adam (learning rate 1e-3 to 3e-4). It's less sensitive to hyperparameter choices than vanilla SGD and works well across a wide range of problems. SGD with momentum often achieves better final accuracy but requires more tuning.

Summary and Key Insights

The learning rate is not just a hyperparameter—it's the throttle controlling how aggressively your model learns. Master it, and optimization becomes tractable.

Key Takeaways

•Learning rate controls step size — α scales the gradient to determine how far to move
•Too high → divergence — overshooting causes oscillation or explosion
•Too low → slow convergence — progress is painfully gradual
•Theoretical bound: α ≤ 1/L — constrained by loss surface curvature
•Condition number κ determines difficulty — elongated surfaces slow convergence
•Schedules adapt α over time — start large, decay as you approach optimum
•LR Finder is the practical tool — empirically find the sweet spot
•Adaptive methods reduce sensitivity — Adam, RMSprop auto-tune per parameter

Coming Up Next:

With learning rate understood, we turn to convergence analysis: How fast does gradient descent converge? What theoretical guarantees can we prove? Understanding convergence rates helps us choose between algorithms and set expectations for training time.

Learning Rate Mastery Achieved

You now understand why learning rate is so critical, the mathematical bounds that govern it, how to visualize its effects, and practical strategies for tuning. Next, we'll dive into convergence theory to understand exactly how fast gradient descent converges.