Gradient Descent Variants - Learning Module

Loading content...

0/278

Learning Rate Selection

The Most Important Hyperparameter

Ask any experienced deep learning practitioner which hyperparameter matters most, and the answer is almost always the same: the learning rate. Get it right, and training proceeds smoothly toward excellent solutions. Get it wrong, and your model either diverges into chaos or crawls toward mediocrity so slowly that you abandon the experiment.

The learning rate sits at the heart of a fundamental tension in optimization. Too large, and you overshoot the minimum, oscillating wildly or diverging entirely. Too small, and you crawl toward the optimum so slowly that training becomes impractical, or you get stuck in suboptimal regions.

This page provides a comprehensive treatment of learning rate selection: the theoretical bounds that constrain valid choices, practical heuristics for initial values, systematic search strategies, the learning rate range test, and how to diagnose and fix learning rate problems.

What You Will Learn

By the end of this page, you will understand the theoretical bounds on learning rate, apply practical heuristics for initial selection, conduct learning rate range tests, tune learning rates systematically, and diagnose problems from training curves.

Theoretical Foundations

Understanding the theoretical basis for learning rate bounds provides intuition for practical selection. Let's establish the key results from optimization theory.

The Smoothness Constant

The central quantity governing learning rate is the Lipschitz constant of the gradient, often denoted $L$. A function $J$ has $L$-Lipschitz continuous gradient if:

$$| abla J(\boldsymbol{\theta}) - abla J(\boldsymbol{\phi})| \leq L |\boldsymbol{\theta} - \boldsymbol{\phi}|$$

Intuitively, $L$ bounds how fast the gradient can change as parameters change. For quadratic functions, $L$ equals the largest eigenvalue of the Hessian: $L = \lambda_{\max}(\mathbf{H})$.

The Stability Bound

For gradient descent to decrease the loss at each step, we need:

$$\eta < \frac{2}{L}$$

With $\eta > 2/L$, updates overshoot the minimum and oscillate with increasing amplitude—the algorithm diverges.

The Quadratic Case

For a simple quadratic J(θ) = ½aθ², the gradient is aθ, and L = a. The stability condition η < 2/a is exact: larger η causes oscillation. This toy example builds intuition for the general case.

Optimal Learning Rate for Quadratics

For a quadratic objective, the optimal learning rate is:

$$\eta^* = \frac{1}{L}$$

This choice gives the fastest convergence without oscillation. For more complex (non-quadratic) losses, $\eta = 1/L$ remains a good heuristic, though the optimal value may differ.

The Condition Number

The condition number $\kappa = L/\mu$ (where $\mu$ is the strong convexity constant) determines convergence difficulty:

$\kappa \approx 1$: Well-conditioned, isotropic loss. Gradient descent converges quickly.
$\kappa >> 1$: Ill-conditioned, elongated contours. Convergence is slow; requires careful $\eta$ tuning.

For neural networks, the effective condition number can be enormous ($10^4$ to $10^6$), explaining why learning rate selection is so critical and why adaptive methods (which estimate per-parameter curvature) can help significantly.

The Problem: We Don't Know L

In practice, we rarely know the smoothness constant $L$. For neural networks:

$L$ depends on the data, architecture, initialization, and current parameters
$L$ changes throughout training
Computing $L$ exactly requires the Hessian (expensive)

This is why learning rate selection is empirical rather than purely theoretical. We use the theory to understand the qualitative behavior, then tune empirically.

Practical Heuristics

While theory provides bounds, practical learning rate selection relies on accumulated empirical wisdom. These heuristics serve as starting points for systematic tuning.

Heuristic 1: Start with Default Values

Well-tested defaults exist for common optimizers:

Optimizer	Common Default	Typical Range
SGD	0.01 - 0.1	$10^{-4}$ to $10^{0}$
SGD + Momentum	0.01	$10^{-4}$ to $10^{-1}$
Adam	0.001	$10^{-5}$ to $10^{-2}$
AdamW	0.001	$10^{-5}$ to $10^{-2}$
RMSprop	0.001	$10^{-5}$ to $10^{-2}$

Adaptive methods (Adam, RMSprop) typically use smaller learning rates because they internally scale gradients.

Heuristic 2: Scale with Batch Size

As covered in the mini-batch SGD page, the linear scaling rule suggests:

$$\eta_{\text{new}} = \eta_{\text{base}} \cdot \frac{B_{\text{new}}}{B_{\text{base}}}$$

If default learning rates assume batch size 32, adjust proportionally for your batch size.

Heuristic 3: Scale with Model/Data

Some practitioners suggest:

Smaller models → larger learning rates (less complex loss surface)
Larger models → smaller learning rates (more complex optimization)
Pre-training → smaller learning rates (preserve learned representations)
Fine-tuning → much smaller learning rates (don't destroy base model)
Noisy data → smaller learning rates (avoid amplifying noise)

The 3x Rule

When grid-searching learning rates, use factors of ~3: {0.001, 0.003, 0.01, 0.03, 0.1}. Factors of 10 are too coarse (you might miss the optimal region); factors of 2 are too fine for initial search. Refine with finer grid after finding the right order of magnitude.

Heuristic 4: Architecture-Specific Guidelines

ResNets: Often trained with SGD, η = 0.1 with batch 256, decayed 10× at epochs 30/60/90
Vision Transformers: Adam with η = 0.001, often with warmup and cosine decay
BERT/GPT: Adam with η = 1e-4 to 5e-5 depending on task
LLM fine-tuning: Very small: η = 1e-5 to 2e-5

These guidelines emerge from extensive community experimentation. When starting with a new architecture, check papers and community practice for initial values.

Heuristic 5: Loss-Specific Considerations

Cross-entropy: Learning rates often need to be smaller because gradients can be large when confidently wrong
MSE: More stable; can often use larger learning rates
Adversarial losses (GANs): Very sensitive; typically small rates and careful balancing

The Learning Rate Range Test

The learning rate range test (also called learning rate finder) is a systematic technique for finding a good initial learning rate. Proposed by Leslie Smith, it's become a standard practice in deep learning.

The Algorithm

Start with a very small learning rate (e.g., $10^{-7}$)
Train for a few batches (or one epoch)
After each batch, multiply the learning rate by a small factor (e.g., 1.1)
Record the loss at each learning rate
Plot loss vs. learning rate (log scale)
Select learning rate from the steepest descent region

The goal is to find the range where increasing learning rate actively improves training, before it becomes unstable.

lr_finder.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
import numpy as np
import torch
import torch.nn as nn
import matplotlib.pyplot as plt
from typing import Tuple, List
 
def learning_rate_finder(
    model: nn.Module,
    dataloader,
    loss_fn,
    optimizer_class,
    lr_min: float = 1e-7,
    lr_max: float = 10.0,
    num_steps: int = 100,
    smooth_factor: float = 0.98,
    diverge_threshold: float = 5.0
) -> Tuple[List[float], List[float]]:
    """
    Learning rate range test to find optimal learning rate.
    
    Args:
        model: The neural network (will be modified in-place)
        dataloader: Training data loader
        loss_fn: Loss function
        optimizer_class: Optimizer class (e.g., torch.optim.SGD)
        lr_min: Starting learning rate
        lr_max: Maximum learning rate to test
        num_steps: Number of batches to process
        smooth_factor: Exponential smoothing for loss
        diverge_threshold: Stop if loss exceeds this multiple of best loss
    
    Returns:
        learning_rates: List of tested LRs
        losses: Smoothed losses at each LR
    """
    # Save initial state for later restoration if needed
    initial_state = {k: v.clone() for k, v in model.state_dict().items()}
    
    # Start with minimum LR
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr_min
    
    optimizer = optimizer_class(model.parameters(), lr=lr_min)
    
    # Compute LR multiplier per step
    lr_mult = (lr_max / lr_min) ** (1 / num_steps)
    
    learning_rates = []
    losses = []
    smoothed_loss = None
    best_loss = float('inf')
    current_lr = lr_min
    
    data_iter = iter(dataloader)
    model.train()
    
    for step in range(num_steps):
        # Get next batch (cycle if needed)
        try:
            inputs, targets = next(data_iter)
        except StopIteration:
            data_iter = iter(dataloader)
            inputs, targets = next(data_iter)
        
        # Forward pass
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = loss_fn(outputs, targets)
        
        # Check for divergence
        if smoothed_loss is None:
            smoothed_loss = loss.item()
        else:
            smoothed_loss = smooth_factor * smoothed_loss + (1 - smooth_factor) * loss.item()
        
        if smoothed_loss < best_loss:
            best_loss = smoothed_loss
        
        if smoothed_loss > diverge_threshold * best_loss:
            print(f"Stopping early: loss diverged at LR = {current_lr:.2e}")
            break
        
        # Record
        learning_rates.append(current_lr)
        losses.append(smoothed_loss)
        
        # Backward and step
        loss.backward()
        optimizer.step()
        
        # Increase learning rate
        current_lr *= lr_mult
        for param_group in optimizer.param_groups:
            param_group['lr'] = current_lr
    
    # Optionally restore initial state
    model.load_state_dict(initial_state)
    
    return learning_rates, losses
 
 
def suggest_lr(learning_rates: List[float], losses: List[float]) -> float:
    """
    Suggest a learning rate based on the range test results.
    
    Strategy: Find the point where loss is decreasing fastest,
    then use a slightly smaller LR for stability.
    """
    # Compute gradient of loss (in log-LR space)
    log_lrs = np.log10(learning_rates)
    losses = np.array(losses)
    
    # Compute numerical gradient
    grad = np.gradient(losses, log_lrs)
    
    # Find minimum gradient (steepest descent)
    min_grad_idx = np.argmin(grad)
    
    # Pick LR slightly before this point (factor of 10 smaller often works)
    suggested_lr = learning_rates[min_grad_idx] / 10
    
    return suggested_lr
 
 
def plot_lr_finder(learning_rates: List[float], losses: List[float],
                   suggested_lr: float = None):
    """
    Plot learning rate finder results.
    """
    plt.figure(figsize=(10, 6))
    plt.plot(learning_rates, losses, 'b-', linewidth=1)
    plt.xscale('log')
    plt.xlabel('Learning Rate (log scale)')
    plt.ylabel('Loss')
    plt.title('Learning Rate Range Test')
    
    if suggested_lr is not None:
        plt.axvline(x=suggested_lr, color='r', linestyle='--',
                    label=f'Suggested LR: {suggested_lr:.2e}')
        plt.legend()
    
    plt.grid(True, alpha=0.3)
    plt.show()

Interpreting the Plot

The learning rate finder produces a characteristic curve:

Flat region (very small LR): Loss barely changes. Learning rate too small to make progress.
Descent region: Loss decreases as LR increases. This is the useful range.
Minimum region: Loss is lowest. Potential good LR.
Ascent region: Loss increases. LR becoming too large.
Explosion: Loss becomes NaN or infinity. LR far too large.

The Selection Strategy: Choose a learning rate in the descent region, typically 1/10 to 1/3 of the way to the minimum. Choosing at the minimum often leads to instability during longer training.

LR Finder Caveats

The LR finder gives a starting point, not the final answer. The optimal LR changes during training (hence schedules). Results may vary with initialization. It's most reliable for finding the order of magnitude. Always validate with actual training runs.

Systematic Search Strategies

Beyond the learning rate finder, systematic hyperparameter search provides principled approaches to learning rate selection.

Grid Search

The simplest approach: try a pre-defined set of values.

learning_rates = [1e-5, 3e-5, 1e-4, 3e-4, 1e-3, 3e-3, 1e-2, 3e-2, 1e-1]
for lr in learning_rates:
    train_and_evaluate(model, lr)
    # Track validation performance

Pros: Simple, reproducible, easy to parallelize Cons: May miss optimal if not in grid; scales poorly with dimensions

Logarithmic Grid: For learning rates, always use log-scale spacing. $10^{-4}$ and $10^{-3}$ are as different as $10^{-3}$ and $10^{-2}$.

Random Search

Sample learning rates uniformly in log space:

import numpy as np

# Sample 50 learning rates in [1e-5, 1e-1], log-uniform
lr_samples = 10 ** np.random.uniform(-5, -1, 50)
for lr in lr_samples:
    train_and_evaluate(model, lr)

Advantage over grid: More likely to hit good values; doesn't waste evaluations on clearly bad regions.

Random search often outperforms grid search for the same budget, especially when not all hyperparameters are equally important (which is typically the case).

Bayesian Optimization

•Models the performance surface
•Balances exploration and exploitation
•Often finds good values faster
•Available in Optuna, Ray Tune, Weights & Biases
•Best for expensive-to-evaluate settings

Population-Based Training

•Evolves population of models
•Adapts hyperparameters during training
•Can find schedules automatically
•High parallelism requirement
•State-of-the-art when resources available

Successive Halving (Hyperband)

Efficient early stopping of bad configurations:

Start many runs with random learning rates
Train each for a short time
Discard the worst half
Train remaining for 2× longer
Repeat until one configuration remains

This allocates more resources to promising learning rates while quickly eliminating poor choices. Hyperband is a principled version of this that determines optimal resource allocation.

Practical Recommendation

For a single hyperparameter search: (1) Use LR finder to get the rough range, (2) Grid search over ~5 values around that range, (3) Refine with smaller grid. For multiple hyperparameters: use random search or Bayesian optimization with Optuna or similar tools.

Diagnosing Learning Rate Problems

Training curves encode rich information about learning rate appropriateness. Learning to read these curves is a essential skill for debugging optimization.

Symptom 1: Loss Explodes (NaN/Inf)

Cause: Learning rate far too large. Updates overshoot so badly that numerical overflow occurs.

Evidence:

Loss jumps to very large values or NaN
Often happens in first few iterations
Weights become NaN or Inf

Fix: Reduce learning rate by 10× or more. If using warmup, slow the warmup.

Symptom 2: Loss Oscillates Wildly

Cause: Learning rate too large, but not catastrophically. Updates overshoot, then overcorrect.

Evidence:

Loss goes up and down dramatically
Validation metrics erratic
Training unstable

Fix: Reduce learning rate by 2-5×. Consider adding gradient clipping.

Symptom 3: Loss Decreases Extremely Slowly

Cause: Learning rate too small. Making negligible progress per step.

Evidence:

Loss barely changes over many epochs
Gradients are reasonable (not vanishing)
Would need impractically long to converge

Fix: Increase learning rate by 3-10×. Ensure you haven't accidentally doubled the decay.

Diagnostic Checklist for Learning Rate Issues
Observation	Likely Cause	Action
Loss is NaN after few steps	LR way too high	Reduce LR by 10-100×
Loss spikes then recovers	LR slightly too high	Reduce LR by 2-3×
Wild oscillation in loss	LR too high / batch too small	Reduce LR; consider larger batch
Steady but slow decrease	LR too low	Increase LR by 3-10×
Loss flat for many epochs	LR way too low or stuck	Increase LR; check for dead ReLUs
Train decreases, val increases	Overfitting (not LR directly)	Regularization; maybe reduce LR late
Loss plateaus, won't improve	LR too high to find fine minimum	Decay LR; add schedule step

Symptom 4: Fast Initial Progress, Then Plateau

Cause: Learning rate appropriate for early training but too large for fine-tuning near minimum.

Evidence:

Loss decreases well initially
Plateaus at a suboptimal value
Further training doesn't help

Fix: Apply learning rate decay. Use cosine annealing or step decay.

Symptom 5: Training and Validation Diverge

Cause: Could be overfitting (not directly LR), but large LR can exacerbate.

Evidence:

Training loss decreases
Validation loss increases or stagnates

Fix: This is primarily a regularization issue, but reducing LR slightly may help. Focus on dropout, weight decay, or data augmentation first.

The 10× Rule of Thumb

If you're uncertain whether to increase or decrease: try changing by 10×. If the problem gets worse, go the other direction. This may seem crude, but learning rate issues are often order-of-magnitude problems, not fine-tuning problems.

Per-Layer Learning Rates

Not all parameters are equal. Different layers may benefit from different learning rates. This technique, called differential learning rates or layer-wise learning rates, is especially useful for transfer learning.

Why Different Layers Need Different Rates

Pre-trained layers: Should change slowly to preserve learned features. Small LR.
New layers: Need to learn from scratch. Larger LR.
Early vs. late layers: In transfer learning, early layers (edges, textures) are more general. Later layers (object parts, semantics) are more task-specific.

The Discriminative Fine-Tuning Strategy

For a model with $L$ layers, use learning rates: $$\eta_l = \eta_{\text{base}} \cdot \gamma^{L-l}$$

where $\gamma < 1$ (e.g., $\gamma = 0.9$ or $\gamma = 0.5$). This gives exponentially smaller LR to earlier layers.

Example: Base LR = 0.001, γ = 0.5, 4 layers

Layer 4 (output): 0.001
Layer 3: 0.0005
Layer 2: 0.00025
Layer 1 (input): 0.000125

per_layer_lr.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import torch
import torch.nn as nn
 
def get_parameter_groups(model: nn.Module, base_lr: float, 
                         lr_decay: float = 0.9) -> list:
    """
    Create parameter groups with layer-wise learning rates.
    
    Earlier layers get smaller learning rates.
    
    Args:
        model: Neural network
        base_lr: Learning rate for the final layer
        lr_decay: Multiplicative factor for each layer back
    
    Returns:
        List of parameter groups for optimizer
    """
    # Get all named parameters organized by depth
    # This example assumes a simple sequential model
    param_groups = []
    
    # Group parameters by layer name
    layer_params = {}
    for name, param in model.named_parameters():
        # Extract layer identifier (e.g., "layer1", "conv2")
        layer_key = name.split('.')[0]
        if layer_key not in layer_params:
            layer_params[layer_key] = []
        layer_params[layer_key].append(param)
    
    # Assign learning rates (reverse order: last layer gets base_lr)
    layer_names = list(layer_params.keys())
    num_layers = len(layer_names)
    
    for i, layer_name in enumerate(layer_names):
        # Earlier layers (smaller i) get smaller LR
        layer_lr = base_lr * (lr_decay ** (num_layers - 1 - i))
        param_groups.append({
            'params': layer_params[layer_name],
            'lr': layer_lr,
            'name': layer_name  # For logging
        })
        print(f"Layer {layer_name}: LR = {layer_lr:.6f}")
    
    return param_groups
 
 
# For pre-trained + new head scenario
def freeze_and_differential(model, pretrained_lr, head_lr):
    """
    Common pattern: small LR for pretrained, large for new head.
    """
    pretrained_params = []
    head_params = []
    
    for name, param in model.named_parameters():
        if 'classifier' in name or 'fc' in name or 'head' in name:
            # These are typically new layers
            head_params.append(param)
        else:
            # Pretrained backbone
            pretrained_params.append(param)
    
    param_groups = [
        {'params': pretrained_params, 'lr': pretrained_lr},
        {'params': head_params, 'lr': head_lr}
    ]
    
    return param_groups
 
 
# Usage
# param_groups = freeze_and_differential(model, 
#                                        pretrained_lr=1e-5, 
#                                        head_lr=1e-3)
# optimizer = torch.optim.Adam(param_groups)

When to Use Per-Layer LR

Per-layer learning rates are most useful for: transfer learning (pre-trained + new layers), fine-tuning large models (BERT, GPT, etc.), and very deep networks where gradients vary by layer. For training from scratch on standard architectures, uniform LR often suffices.

Learning Rate and Other Hyperparameters

Learning rate doesn't exist in isolation. Its optimal value depends on and interacts with other hyperparameters. Understanding these interactions is crucial for effective tuning.

Interaction with Batch Size

As established by the linear scaling rule:

Double batch size → Double learning rate
But this breaks down for very large batches
The ratio η/B controls effective 'noise temperature'

Interaction with Weight Decay

Weight decay ($\lambda$) and learning rate interact:

Larger LR amplifies the regularization effect of weight decay
For AdamW: weight decay is decoupled from LR
For SGD with L2: effective regularization depends on both

If you change LR, you may need to re-tune weight decay.

Learning Rate Interactions
Hyperparameter	Interaction with LR	Tuning Strategy
Batch size	Linear scaling rule applies	Tune together; LR ∝ B
Weight decay	Effective regularization scales with LR	Re-tune if LR changes significantly
Momentum	Higher momentum → need smaller LR	Usually keep momentum fixed, tune LR
Gradient clipping	Clipping reduces effective LR on large-grad steps	May need larger LR if heavily clipping
Dropout	Indirect (affects gradient variance)	Usually tuned independently
Network depth/width	Deeper/wider often needs smaller LR	Check stability; adjust as needed

Interaction with Momentum

Momentum accelerates updates by accumulating gradient history: $$\mathbf{v}^{(t)} = \beta \mathbf{v}^{(t-1)} + abla J(\boldsymbol{\theta}^{(t)})$$ $$\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \eta \mathbf{v}^{(t)}$$

Higher momentum (β closer to 1) means:

Larger effective step sizes
More momentum → need smaller LR for stability
Common pairing: β = 0.9 with LR scaled accordingly

Interaction with Optimizer

Different optimizers have different LR sensitivities:

SGD: Most sensitive; needs careful tuning
Momentum/Nesterov: More forgiving; momentum provides smoothing
Adam: Adaptive per-parameter rates reduce sensitivity; but still matters
LAMB: Designed for large batches; can use very large LR

Tuning Order Recommendation

When tuning multiple hyperparameters: (1) Fix others at reasonable defaults, (2) Tune learning rate first, (3) Tune batch size (adjust LR proportionally), (4) Tune regularization (weight decay, dropout), (5) Fine-tune LR again if needed. Learning rate is the most impactful, so start there.

Common Pitfalls and Best Practices

Years of community experience have revealed common mistakes in learning rate selection. Knowing these pitfalls helps you avoid them.

Pitfall 1: Using the Same LR for All Tasks

A learning rate that works for ImageNet classification may be wrong for:

Your smaller dataset (may need smaller LR)
A different architecture (transformers vs CNNs)
Transfer learning (definitely need smaller LR for pretrained parts)

Best practice: Always tune LR for your specific setup. Start from recommended defaults but validate.

Pitfall 2: Forgetting to Adjust for Precision

Mixed precision (FP16) training can require learning rate adjustments:

Loss scaling interacts with effective LR
Some recommend slight LR reduction for stability
Gradient underflow/overflow more likely at extreme LR

Common Learning Rate Mistakes

•Copying LR from different batch size: Paper uses batch 256, you use 32, but you keep LR = 0.1. Scale down!
•No schedule for long training: Using constant LR for 100+ epochs. Add decay!
•Too aggressive warmup: Jumping to large LR too fast. Use gradual warmup.
•Decaying too early: Decaying LR before model has fully learned. Wait for plateau.
•Decaying too late: Waiting to decay when loss has plateaued for many epochs. Decay when improvement slows.
•Ignoring instability signs: Occasional NaNs or spikes. They indicate LR is marginal—reduce it.
•Same LR for pretrained and new parts: In transfer learning, pretrained needs much smaller LR.

Pitfall 3: Not Using Schedules

Fixed learning rate is almost never optimal for long training. The loss surface changes as you approach different regions. Early training benefits from exploration (larger LR); late training needs precision (smaller LR).

Best practice: Always use a learning rate schedule. Cosine annealing is a robust default.

Pitfall 4: Over-Engineering the Schedule

Complex schedules (warm restarts, cyclical rates, custom rules) can help but also add tuning burden. Simple schedules often work nearly as well.

Best practice: Start simple (step decay or cosine). Add complexity only if simple schedules underperform after tuning.

Best Practices Summary

•Start with optimizer defaults, then tune
•Use LR finder to get the right order of magnitude
•Scale LR with batch size (linear scaling rule)
•Always use a learning rate schedule
•Use warmup for large batches or sensitive models
•Use differential LR for transfer learning
•Log training curves and diagnose issues
•Re-tune LR when changing other major hyperparameters

Summary: Learning Rate Selection

We've thoroughly covered learning rate selection—the most critical hyperparameter in gradient-based optimization. Let's consolidate the essential knowledge:

Key Takeaways

•Theoretical bounds exist: η < 2/L for stability, η ≈ 1/L often optimal. But we don't know L in practice.
•Use practical defaults: SGD ~0.01-0.1, Adam ~0.001, then tune. These are starting points, not final values.
•LR finder is powerful: Sweep learning rates to find the descent region. Pick LR before the minimum, not at it.
•Systematic search works: Random search or Bayesian optimization beats grid search. Use early stopping (Hyperband) for efficiency.
•Diagnose from curves: Exploding loss = LR too high. Slow progress = LR too low. Plateau = needs decay.
•Per-layer LR for transfer learning: Pretrained layers need smaller LR. New layers need larger LR.
•LR interacts with everything: Batch size, weight decay, momentum, optimizer all affect optimal LR.
•Always use schedules: Cosine annealing or step decay. Fixed LR is rarely optimal for long training.

What's Next

The final page in this module covers convergence analysis—the theoretical study of when and how fast gradient descent variants converge. We'll formalize the intuitions built in previous pages with rigorous mathematical treatment.

Practical Mastery

You now have the tools to select, tune, and diagnose learning rates effectively. This skill is foundational—virtually every ML training pipeline requires learning rate tuning. The techniques here apply whether you're training a simple linear model or a billion-parameter language model.

Learning Rate Selection

The Most Important Hyperparameter

What You Will Learn

Theoretical Foundations

Understanding the theoretical basis for learning rate bounds provides intuition for practical selection. Let's establish the key results from optimization theory.

The Smoothness Constant

The central quantity governing learning rate is the Lipschitz constant of the gradient, often denoted $L$. A function $J$ has $L$-Lipschitz continuous gradient if:

$$| abla J(\boldsymbol{\theta}) - abla J(\boldsymbol{\phi})| \leq L |\boldsymbol{\theta} - \boldsymbol{\phi}|$$

Intuitively, $L$ bounds how fast the gradient can change as parameters change. For quadratic functions, $L$ equals the largest eigenvalue of the Hessian: $L = \lambda_{\max}(\mathbf{H})$.

The Stability Bound

For gradient descent to decrease the loss at each step, we need:

$$\eta < \frac{2}{L}$$

With $\eta > 2/L$, updates overshoot the minimum and oscillate with increasing amplitude—the algorithm diverges.

The Quadratic Case

Optimal Learning Rate for Quadratics

For a quadratic objective, the optimal learning rate is:

$$\eta^* = \frac{1}{L}$$

This choice gives the fastest convergence without oscillation. For more complex (non-quadratic) losses, $\eta = 1/L$ remains a good heuristic, though the optimal value may differ.

The Condition Number

The condition number $\kappa = L/\mu$ (where $\mu$ is the strong convexity constant) determines convergence difficulty:

$\kappa \approx 1$: Well-conditioned, isotropic loss. Gradient descent converges quickly.
$\kappa >> 1$: Ill-conditioned, elongated contours. Convergence is slow; requires careful $\eta$ tuning.

The Problem: We Don't Know L

In practice, we rarely know the smoothness constant $L$. For neural networks:

$L$ depends on the data, architecture, initialization, and current parameters
$L$ changes throughout training
Computing $L$ exactly requires the Hessian (expensive)

This is why learning rate selection is empirical rather than purely theoretical. We use the theory to understand the qualitative behavior, then tune empirically.

Practical Heuristics

While theory provides bounds, practical learning rate selection relies on accumulated empirical wisdom. These heuristics serve as starting points for systematic tuning.

Heuristic 1: Start with Default Values

Well-tested defaults exist for common optimizers:

Optimizer	Common Default	Typical Range
SGD	0.01 - 0.1	$10^{-4}$ to $10^{0}$
SGD + Momentum	0.01	$10^{-4}$ to $10^{-1}$
Adam	0.001	$10^{-5}$ to $10^{-2}$
AdamW	0.001	$10^{-5}$ to $10^{-2}$
RMSprop	0.001	$10^{-5}$ to $10^{-2}$

Adaptive methods (Adam, RMSprop) typically use smaller learning rates because they internally scale gradients.

Heuristic 2: Scale with Batch Size

As covered in the mini-batch SGD page, the linear scaling rule suggests:

$$\eta_{\text{new}} = \eta_{\text{base}} \cdot \frac{B_{\text{new}}}{B_{\text{base}}}$$

If default learning rates assume batch size 32, adjust proportionally for your batch size.

Heuristic 3: Scale with Model/Data

Some practitioners suggest:

Smaller models → larger learning rates (less complex loss surface)
Larger models → smaller learning rates (more complex optimization)
Pre-training → smaller learning rates (preserve learned representations)
Fine-tuning → much smaller learning rates (don't destroy base model)
Noisy data → smaller learning rates (avoid amplifying noise)

The 3x Rule

Heuristic 4: Architecture-Specific Guidelines

ResNets: Often trained with SGD, η = 0.1 with batch 256, decayed 10× at epochs 30/60/90
Vision Transformers: Adam with η = 0.001, often with warmup and cosine decay
BERT/GPT: Adam with η = 1e-4 to 5e-5 depending on task
LLM fine-tuning: Very small: η = 1e-5 to 2e-5

These guidelines emerge from extensive community experimentation. When starting with a new architecture, check papers and community practice for initial values.

Heuristic 5: Loss-Specific Considerations

Cross-entropy: Learning rates often need to be smaller because gradients can be large when confidently wrong
MSE: More stable; can often use larger learning rates
Adversarial losses (GANs): Very sensitive; typically small rates and careful balancing

The Learning Rate Range Test

The Algorithm

Start with a very small learning rate (e.g., $10^{-7}$)
Train for a few batches (or one epoch)
After each batch, multiply the learning rate by a small factor (e.g., 1.1)
Record the loss at each learning rate
Plot loss vs. learning rate (log scale)
Select learning rate from the steepest descent region

The goal is to find the range where increasing learning rate actively improves training, before it becomes unstable.

lr_finder.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
import numpy as np
import torch
import torch.nn as nn
import matplotlib.pyplot as plt
from typing import Tuple, List
 
def learning_rate_finder(
    model: nn.Module,
    dataloader,
    loss_fn,
    optimizer_class,
    lr_min: float = 1e-7,
    lr_max: float = 10.0,
    num_steps: int = 100,
    smooth_factor: float = 0.98,
    diverge_threshold: float = 5.0
) -> Tuple[List[float], List[float]]:
    """
    Learning rate range test to find optimal learning rate.
    
    Args:
        model: The neural network (will be modified in-place)
        dataloader: Training data loader
        loss_fn: Loss function
        optimizer_class: Optimizer class (e.g., torch.optim.SGD)
        lr_min: Starting learning rate
        lr_max: Maximum learning rate to test
        num_steps: Number of batches to process
        smooth_factor: Exponential smoothing for loss
        diverge_threshold: Stop if loss exceeds this multiple of best loss
    
    Returns:
        learning_rates: List of tested LRs
        losses: Smoothed losses at each LR
    """
    # Save initial state for later restoration if needed
    initial_state = {k: v.clone() for k, v in model.state_dict().items()}
    
    # Start with minimum LR
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr_min
    
    optimizer = optimizer_class(model.parameters(), lr=lr_min)
    
    # Compute LR multiplier per step
    lr_mult = (lr_max / lr_min) ** (1 / num_steps)
    
    learning_rates = []
    losses = []
    smoothed_loss = None
    best_loss = float('inf')
    current_lr = lr_min
    
    data_iter = iter(dataloader)
    model.train()
    
    for step in range(num_steps):
        # Get next batch (cycle if needed)
        try:
            inputs, targets = next(data_iter)
        except StopIteration:
            data_iter = iter(dataloader)
            inputs, targets = next(data_iter)
        
        # Forward pass
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = loss_fn(outputs, targets)
        
        # Check for divergence
        if smoothed_loss is None:
            smoothed_loss = loss.item()
        else:
            smoothed_loss = smooth_factor * smoothed_loss + (1 - smooth_factor) * loss.item()
        
        if smoothed_loss < best_loss:
            best_loss = smoothed_loss
        
        if smoothed_loss > diverge_threshold * best_loss:
            print(f"Stopping early: loss diverged at LR = {current_lr:.2e}")
            break
        
        # Record
        learning_rates.append(current_lr)
        losses.append(smoothed_loss)
        
        # Backward and step
        loss.backward()
        optimizer.step()
        
        # Increase learning rate
        current_lr *= lr_mult
        for param_group in optimizer.param_groups:
            param_group['lr'] = current_lr
    
    # Optionally restore initial state
    model.load_state_dict(initial_state)
    
    return learning_rates, losses
 
 
def suggest_lr(learning_rates: List[float], losses: List[float]) -> float:
    """
    Suggest a learning rate based on the range test results.
    
    Strategy: Find the point where loss is decreasing fastest,
    then use a slightly smaller LR for stability.
    """
    # Compute gradient of loss (in log-LR space)
    log_lrs = np.log10(learning_rates)
    losses = np.array(losses)
    
    # Compute numerical gradient
    grad = np.gradient(losses, log_lrs)
    
    # Find minimum gradient (steepest descent)
    min_grad_idx = np.argmin(grad)
    
    # Pick LR slightly before this point (factor of 10 smaller often works)
    suggested_lr = learning_rates[min_grad_idx] / 10
    
    return suggested_lr
 
 
def plot_lr_finder(learning_rates: List[float], losses: List[float],
                   suggested_lr: float = None):
    """
    Plot learning rate finder results.
    """
    plt.figure(figsize=(10, 6))
    plt.plot(learning_rates, losses, 'b-', linewidth=1)
    plt.xscale('log')
    plt.xlabel('Learning Rate (log scale)')
    plt.ylabel('Loss')
    plt.title('Learning Rate Range Test')
    
    if suggested_lr is not None:
        plt.axvline(x=suggested_lr, color='r', linestyle='--',
                    label=f'Suggested LR: {suggested_lr:.2e}')
        plt.legend()
    
    plt.grid(True, alpha=0.3)
    plt.show()

Interpreting the Plot

The learning rate finder produces a characteristic curve:

Flat region (very small LR): Loss barely changes. Learning rate too small to make progress.
Descent region: Loss decreases as LR increases. This is the useful range.
Minimum region: Loss is lowest. Potential good LR.
Ascent region: Loss increases. LR becoming too large.
Explosion: Loss becomes NaN or infinity. LR far too large.

LR Finder Caveats

Systematic Search Strategies

Beyond the learning rate finder, systematic hyperparameter search provides principled approaches to learning rate selection.

Grid Search

The simplest approach: try a pre-defined set of values.

learning_rates = [1e-5, 3e-5, 1e-4, 3e-4, 1e-3, 3e-3, 1e-2, 3e-2, 1e-1]
for lr in learning_rates:
    train_and_evaluate(model, lr)
    # Track validation performance

Pros: Simple, reproducible, easy to parallelize Cons: May miss optimal if not in grid; scales poorly with dimensions

Logarithmic Grid: For learning rates, always use log-scale spacing. $10^{-4}$ and $10^{-3}$ are as different as $10^{-3}$ and $10^{-2}$.

Random Search

Sample learning rates uniformly in log space:

import numpy as np

# Sample 50 learning rates in [1e-5, 1e-1], log-uniform
lr_samples = 10 ** np.random.uniform(-5, -1, 50)
for lr in lr_samples:
    train_and_evaluate(model, lr)

Advantage over grid: More likely to hit good values; doesn't waste evaluations on clearly bad regions.

Random search often outperforms grid search for the same budget, especially when not all hyperparameters are equally important (which is typically the case).

Bayesian Optimization

•Models the performance surface
•Balances exploration and exploitation
•Often finds good values faster
•Available in Optuna, Ray Tune, Weights & Biases
•Best for expensive-to-evaluate settings

Population-Based Training

•Evolves population of models
•Adapts hyperparameters during training
•Can find schedules automatically
•High parallelism requirement
•State-of-the-art when resources available

Successive Halving (Hyperband)

Efficient early stopping of bad configurations:

Start many runs with random learning rates
Train each for a short time
Discard the worst half
Train remaining for 2× longer
Repeat until one configuration remains

This allocates more resources to promising learning rates while quickly eliminating poor choices. Hyperband is a principled version of this that determines optimal resource allocation.

Practical Recommendation

Diagnosing Learning Rate Problems

Training curves encode rich information about learning rate appropriateness. Learning to read these curves is a essential skill for debugging optimization.

Symptom 1: Loss Explodes (NaN/Inf)

Cause: Learning rate far too large. Updates overshoot so badly that numerical overflow occurs.

Evidence:

Loss jumps to very large values or NaN
Often happens in first few iterations
Weights become NaN or Inf

Fix: Reduce learning rate by 10× or more. If using warmup, slow the warmup.

Symptom 2: Loss Oscillates Wildly

Cause: Learning rate too large, but not catastrophically. Updates overshoot, then overcorrect.

Evidence:

Loss goes up and down dramatically
Validation metrics erratic
Training unstable

Fix: Reduce learning rate by 2-5×. Consider adding gradient clipping.

Symptom 3: Loss Decreases Extremely Slowly

Cause: Learning rate too small. Making negligible progress per step.

Evidence:

Loss barely changes over many epochs
Gradients are reasonable (not vanishing)
Would need impractically long to converge

Fix: Increase learning rate by 3-10×. Ensure you haven't accidentally doubled the decay.

Diagnostic Checklist for Learning Rate Issues
Observation	Likely Cause	Action
Loss is NaN after few steps	LR way too high	Reduce LR by 10-100×
Loss spikes then recovers	LR slightly too high	Reduce LR by 2-3×
Wild oscillation in loss	LR too high / batch too small	Reduce LR; consider larger batch
Steady but slow decrease	LR too low	Increase LR by 3-10×
Loss flat for many epochs	LR way too low or stuck	Increase LR; check for dead ReLUs
Train decreases, val increases	Overfitting (not LR directly)	Regularization; maybe reduce LR late
Loss plateaus, won't improve	LR too high to find fine minimum	Decay LR; add schedule step

Symptom 4: Fast Initial Progress, Then Plateau

Cause: Learning rate appropriate for early training but too large for fine-tuning near minimum.

Evidence:

Loss decreases well initially
Plateaus at a suboptimal value
Further training doesn't help

Fix: Apply learning rate decay. Use cosine annealing or step decay.

Symptom 5: Training and Validation Diverge

Cause: Could be overfitting (not directly LR), but large LR can exacerbate.

Evidence:

Training loss decreases
Validation loss increases or stagnates

Fix: This is primarily a regularization issue, but reducing LR slightly may help. Focus on dropout, weight decay, or data augmentation first.

The 10× Rule of Thumb

Per-Layer Learning Rates

Why Different Layers Need Different Rates

Pre-trained layers: Should change slowly to preserve learned features. Small LR.
New layers: Need to learn from scratch. Larger LR.
Early vs. late layers: In transfer learning, early layers (edges, textures) are more general. Later layers (object parts, semantics) are more task-specific.

The Discriminative Fine-Tuning Strategy

For a model with $L$ layers, use learning rates: $$\eta_l = \eta_{\text{base}} \cdot \gamma^{L-l}$$

where $\gamma < 1$ (e.g., $\gamma = 0.9$ or $\gamma = 0.5$). This gives exponentially smaller LR to earlier layers.

Example: Base LR = 0.001, γ = 0.5, 4 layers

Layer 4 (output): 0.001
Layer 3: 0.0005
Layer 2: 0.00025
Layer 1 (input): 0.000125

per_layer_lr.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import torch
import torch.nn as nn
 
def get_parameter_groups(model: nn.Module, base_lr: float, 
                         lr_decay: float = 0.9) -> list:
    """
    Create parameter groups with layer-wise learning rates.
    
    Earlier layers get smaller learning rates.
    
    Args:
        model: Neural network
        base_lr: Learning rate for the final layer
        lr_decay: Multiplicative factor for each layer back
    
    Returns:
        List of parameter groups for optimizer
    """
    # Get all named parameters organized by depth
    # This example assumes a simple sequential model
    param_groups = []
    
    # Group parameters by layer name
    layer_params = {}
    for name, param in model.named_parameters():
        # Extract layer identifier (e.g., "layer1", "conv2")
        layer_key = name.split('.')[0]
        if layer_key not in layer_params:
            layer_params[layer_key] = []
        layer_params[layer_key].append(param)
    
    # Assign learning rates (reverse order: last layer gets base_lr)
    layer_names = list(layer_params.keys())
    num_layers = len(layer_names)
    
    for i, layer_name in enumerate(layer_names):
        # Earlier layers (smaller i) get smaller LR
        layer_lr = base_lr * (lr_decay ** (num_layers - 1 - i))
        param_groups.append({
            'params': layer_params[layer_name],
            'lr': layer_lr,
            'name': layer_name  # For logging
        })
        print(f"Layer {layer_name}: LR = {layer_lr:.6f}")
    
    return param_groups
 
 
# For pre-trained + new head scenario
def freeze_and_differential(model, pretrained_lr, head_lr):
    """
    Common pattern: small LR for pretrained, large for new head.
    """
    pretrained_params = []
    head_params = []
    
    for name, param in model.named_parameters():
        if 'classifier' in name or 'fc' in name or 'head' in name:
            # These are typically new layers
            head_params.append(param)
        else:
            # Pretrained backbone
            pretrained_params.append(param)
    
    param_groups = [
        {'params': pretrained_params, 'lr': pretrained_lr},
        {'params': head_params, 'lr': head_lr}
    ]
    
    return param_groups
 
 
# Usage
# param_groups = freeze_and_differential(model, 
#                                        pretrained_lr=1e-5, 
#                                        head_lr=1e-3)
# optimizer = torch.optim.Adam(param_groups)

When to Use Per-Layer LR

Learning Rate and Other Hyperparameters

Learning rate doesn't exist in isolation. Its optimal value depends on and interacts with other hyperparameters. Understanding these interactions is crucial for effective tuning.

Interaction with Batch Size

As established by the linear scaling rule:

Double batch size → Double learning rate
But this breaks down for very large batches
The ratio η/B controls effective 'noise temperature'

Interaction with Weight Decay

Weight decay ($\lambda$) and learning rate interact:

Larger LR amplifies the regularization effect of weight decay
For AdamW: weight decay is decoupled from LR
For SGD with L2: effective regularization depends on both

If you change LR, you may need to re-tune weight decay.

Learning Rate Interactions
Hyperparameter	Interaction with LR	Tuning Strategy
Batch size	Linear scaling rule applies	Tune together; LR ∝ B
Weight decay	Effective regularization scales with LR	Re-tune if LR changes significantly
Momentum	Higher momentum → need smaller LR	Usually keep momentum fixed, tune LR
Gradient clipping	Clipping reduces effective LR on large-grad steps	May need larger LR if heavily clipping
Dropout	Indirect (affects gradient variance)	Usually tuned independently
Network depth/width	Deeper/wider often needs smaller LR	Check stability; adjust as needed

Interaction with Momentum

Higher momentum (β closer to 1) means:

Larger effective step sizes
More momentum → need smaller LR for stability
Common pairing: β = 0.9 with LR scaled accordingly

Interaction with Optimizer

Different optimizers have different LR sensitivities:

SGD: Most sensitive; needs careful tuning
Momentum/Nesterov: More forgiving; momentum provides smoothing
Adam: Adaptive per-parameter rates reduce sensitivity; but still matters
LAMB: Designed for large batches; can use very large LR

Tuning Order Recommendation

Common Pitfalls and Best Practices

Years of community experience have revealed common mistakes in learning rate selection. Knowing these pitfalls helps you avoid them.

Pitfall 1: Using the Same LR for All Tasks

A learning rate that works for ImageNet classification may be wrong for:

Your smaller dataset (may need smaller LR)
A different architecture (transformers vs CNNs)
Transfer learning (definitely need smaller LR for pretrained parts)

Best practice: Always tune LR for your specific setup. Start from recommended defaults but validate.

Pitfall 2: Forgetting to Adjust for Precision

Mixed precision (FP16) training can require learning rate adjustments:

Loss scaling interacts with effective LR
Some recommend slight LR reduction for stability
Gradient underflow/overflow more likely at extreme LR

Common Learning Rate Mistakes

•Copying LR from different batch size: Paper uses batch 256, you use 32, but you keep LR = 0.1. Scale down!
•No schedule for long training: Using constant LR for 100+ epochs. Add decay!
•Too aggressive warmup: Jumping to large LR too fast. Use gradual warmup.
•Decaying too early: Decaying LR before model has fully learned. Wait for plateau.
•Decaying too late: Waiting to decay when loss has plateaued for many epochs. Decay when improvement slows.
•Ignoring instability signs: Occasional NaNs or spikes. They indicate LR is marginal—reduce it.
•Same LR for pretrained and new parts: In transfer learning, pretrained needs much smaller LR.

Pitfall 3: Not Using Schedules

Best practice: Always use a learning rate schedule. Cosine annealing is a robust default.

Pitfall 4: Over-Engineering the Schedule

Complex schedules (warm restarts, cyclical rates, custom rules) can help but also add tuning burden. Simple schedules often work nearly as well.

Best practice: Start simple (step decay or cosine). Add complexity only if simple schedules underperform after tuning.

Best Practices Summary

•Start with optimizer defaults, then tune
•Use LR finder to get the right order of magnitude
•Scale LR with batch size (linear scaling rule)
•Always use a learning rate schedule
•Use warmup for large batches or sensitive models
•Use differential LR for transfer learning
•Log training curves and diagnose issues
•Re-tune LR when changing other major hyperparameters

Summary: Learning Rate Selection

We've thoroughly covered learning rate selection—the most critical hyperparameter in gradient-based optimization. Let's consolidate the essential knowledge:

Key Takeaways

•Theoretical bounds exist: η < 2/L for stability, η ≈ 1/L often optimal. But we don't know L in practice.
•Use practical defaults: SGD ~0.01-0.1, Adam ~0.001, then tune. These are starting points, not final values.
•LR finder is powerful: Sweep learning rates to find the descent region. Pick LR before the minimum, not at it.
•Systematic search works: Random search or Bayesian optimization beats grid search. Use early stopping (Hyperband) for efficiency.
•Diagnose from curves: Exploding loss = LR too high. Slow progress = LR too low. Plateau = needs decay.
•Per-layer LR for transfer learning: Pretrained layers need smaller LR. New layers need larger LR.
•LR interacts with everything: Batch size, weight decay, momentum, optimizer all affect optimal LR.
•Always use schedules: Cosine annealing or step decay. Fixed LR is rarely optimal for long training.

What's Next

Practical Mastery