Loading content...
Ask any experienced deep learning practitioner which hyperparameter matters most, and the answer is almost always the same: the learning rate. Get it right, and training proceeds smoothly toward excellent solutions. Get it wrong, and your model either diverges into chaos or crawls toward mediocrity so slowly that you abandon the experiment.
The learning rate sits at the heart of a fundamental tension in optimization. Too large, and you overshoot the minimum, oscillating wildly or diverging entirely. Too small, and you crawl toward the optimum so slowly that training becomes impractical, or you get stuck in suboptimal regions.
This page provides a comprehensive treatment of learning rate selection: the theoretical bounds that constrain valid choices, practical heuristics for initial values, systematic search strategies, the learning rate range test, and how to diagnose and fix learning rate problems.
By the end of this page, you will understand the theoretical bounds on learning rate, apply practical heuristics for initial selection, conduct learning rate range tests, tune learning rates systematically, and diagnose problems from training curves.
Understanding the theoretical basis for learning rate bounds provides intuition for practical selection. Let's establish the key results from optimization theory.
The Smoothness Constant
The central quantity governing learning rate is the Lipschitz constant of the gradient, often denoted $L$. A function $J$ has $L$-Lipschitz continuous gradient if:
$$| abla J(\boldsymbol{\theta}) - abla J(\boldsymbol{\phi})| \leq L |\boldsymbol{\theta} - \boldsymbol{\phi}|$$
Intuitively, $L$ bounds how fast the gradient can change as parameters change. For quadratic functions, $L$ equals the largest eigenvalue of the Hessian: $L = \lambda_{\max}(\mathbf{H})$.
The Stability Bound
For gradient descent to decrease the loss at each step, we need:
$$\eta < \frac{2}{L}$$
With $\eta > 2/L$, updates overshoot the minimum and oscillate with increasing amplitude—the algorithm diverges.
For a simple quadratic J(θ) = ½aθ², the gradient is aθ, and L = a. The stability condition η < 2/a is exact: larger η causes oscillation. This toy example builds intuition for the general case.
Optimal Learning Rate for Quadratics
For a quadratic objective, the optimal learning rate is:
$$\eta^* = \frac{1}{L}$$
This choice gives the fastest convergence without oscillation. For more complex (non-quadratic) losses, $\eta = 1/L$ remains a good heuristic, though the optimal value may differ.
The Condition Number
The condition number $\kappa = L/\mu$ (where $\mu$ is the strong convexity constant) determines convergence difficulty:
For neural networks, the effective condition number can be enormous ($10^4$ to $10^6$), explaining why learning rate selection is so critical and why adaptive methods (which estimate per-parameter curvature) can help significantly.
The Problem: We Don't Know L
In practice, we rarely know the smoothness constant $L$. For neural networks:
This is why learning rate selection is empirical rather than purely theoretical. We use the theory to understand the qualitative behavior, then tune empirically.
While theory provides bounds, practical learning rate selection relies on accumulated empirical wisdom. These heuristics serve as starting points for systematic tuning.
Heuristic 1: Start with Default Values
Well-tested defaults exist for common optimizers:
| Optimizer | Common Default | Typical Range |
|---|---|---|
| SGD | 0.01 - 0.1 | $10^{-4}$ to $10^{0}$ |
| SGD + Momentum | 0.01 | $10^{-4}$ to $10^{-1}$ |
| Adam | 0.001 | $10^{-5}$ to $10^{-2}$ |
| AdamW | 0.001 | $10^{-5}$ to $10^{-2}$ |
| RMSprop | 0.001 | $10^{-5}$ to $10^{-2}$ |
Adaptive methods (Adam, RMSprop) typically use smaller learning rates because they internally scale gradients.
Heuristic 2: Scale with Batch Size
As covered in the mini-batch SGD page, the linear scaling rule suggests:
$$\eta_{\text{new}} = \eta_{\text{base}} \cdot \frac{B_{\text{new}}}{B_{\text{base}}}$$
If default learning rates assume batch size 32, adjust proportionally for your batch size.
Heuristic 3: Scale with Model/Data
Some practitioners suggest:
When grid-searching learning rates, use factors of ~3: {0.001, 0.003, 0.01, 0.03, 0.1}. Factors of 10 are too coarse (you might miss the optimal region); factors of 2 are too fine for initial search. Refine with finer grid after finding the right order of magnitude.
Heuristic 4: Architecture-Specific Guidelines
These guidelines emerge from extensive community experimentation. When starting with a new architecture, check papers and community practice for initial values.
Heuristic 5: Loss-Specific Considerations
The learning rate range test (also called learning rate finder) is a systematic technique for finding a good initial learning rate. Proposed by Leslie Smith, it's become a standard practice in deep learning.
The Algorithm
The goal is to find the range where increasing learning rate actively improves training, before it becomes unstable.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143
import numpy as npimport torchimport torch.nn as nnimport matplotlib.pyplot as pltfrom typing import Tuple, List def learning_rate_finder( model: nn.Module, dataloader, loss_fn, optimizer_class, lr_min: float = 1e-7, lr_max: float = 10.0, num_steps: int = 100, smooth_factor: float = 0.98, diverge_threshold: float = 5.0) -> Tuple[List[float], List[float]]: """ Learning rate range test to find optimal learning rate. Args: model: The neural network (will be modified in-place) dataloader: Training data loader loss_fn: Loss function optimizer_class: Optimizer class (e.g., torch.optim.SGD) lr_min: Starting learning rate lr_max: Maximum learning rate to test num_steps: Number of batches to process smooth_factor: Exponential smoothing for loss diverge_threshold: Stop if loss exceeds this multiple of best loss Returns: learning_rates: List of tested LRs losses: Smoothed losses at each LR """ # Save initial state for later restoration if needed initial_state = {k: v.clone() for k, v in model.state_dict().items()} # Start with minimum LR for param_group in optimizer.param_groups: param_group['lr'] = lr_min optimizer = optimizer_class(model.parameters(), lr=lr_min) # Compute LR multiplier per step lr_mult = (lr_max / lr_min) ** (1 / num_steps) learning_rates = [] losses = [] smoothed_loss = None best_loss = float('inf') current_lr = lr_min data_iter = iter(dataloader) model.train() for step in range(num_steps): # Get next batch (cycle if needed) try: inputs, targets = next(data_iter) except StopIteration: data_iter = iter(dataloader) inputs, targets = next(data_iter) # Forward pass optimizer.zero_grad() outputs = model(inputs) loss = loss_fn(outputs, targets) # Check for divergence if smoothed_loss is None: smoothed_loss = loss.item() else: smoothed_loss = smooth_factor * smoothed_loss + (1 - smooth_factor) * loss.item() if smoothed_loss < best_loss: best_loss = smoothed_loss if smoothed_loss > diverge_threshold * best_loss: print(f"Stopping early: loss diverged at LR = {current_lr:.2e}") break # Record learning_rates.append(current_lr) losses.append(smoothed_loss) # Backward and step loss.backward() optimizer.step() # Increase learning rate current_lr *= lr_mult for param_group in optimizer.param_groups: param_group['lr'] = current_lr # Optionally restore initial state model.load_state_dict(initial_state) return learning_rates, losses def suggest_lr(learning_rates: List[float], losses: List[float]) -> float: """ Suggest a learning rate based on the range test results. Strategy: Find the point where loss is decreasing fastest, then use a slightly smaller LR for stability. """ # Compute gradient of loss (in log-LR space) log_lrs = np.log10(learning_rates) losses = np.array(losses) # Compute numerical gradient grad = np.gradient(losses, log_lrs) # Find minimum gradient (steepest descent) min_grad_idx = np.argmin(grad) # Pick LR slightly before this point (factor of 10 smaller often works) suggested_lr = learning_rates[min_grad_idx] / 10 return suggested_lr def plot_lr_finder(learning_rates: List[float], losses: List[float], suggested_lr: float = None): """ Plot learning rate finder results. """ plt.figure(figsize=(10, 6)) plt.plot(learning_rates, losses, 'b-', linewidth=1) plt.xscale('log') plt.xlabel('Learning Rate (log scale)') plt.ylabel('Loss') plt.title('Learning Rate Range Test') if suggested_lr is not None: plt.axvline(x=suggested_lr, color='r', linestyle='--', label=f'Suggested LR: {suggested_lr:.2e}') plt.legend() plt.grid(True, alpha=0.3) plt.show()Interpreting the Plot
The learning rate finder produces a characteristic curve:
Flat region (very small LR): Loss barely changes. Learning rate too small to make progress.
Descent region: Loss decreases as LR increases. This is the useful range.
Minimum region: Loss is lowest. Potential good LR.
Ascent region: Loss increases. LR becoming too large.
Explosion: Loss becomes NaN or infinity. LR far too large.
The Selection Strategy: Choose a learning rate in the descent region, typically 1/10 to 1/3 of the way to the minimum. Choosing at the minimum often leads to instability during longer training.
The LR finder gives a starting point, not the final answer. The optimal LR changes during training (hence schedules). Results may vary with initialization. It's most reliable for finding the order of magnitude. Always validate with actual training runs.
Beyond the learning rate finder, systematic hyperparameter search provides principled approaches to learning rate selection.
Grid Search
The simplest approach: try a pre-defined set of values.
learning_rates = [1e-5, 3e-5, 1e-4, 3e-4, 1e-3, 3e-3, 1e-2, 3e-2, 1e-1]
for lr in learning_rates:
train_and_evaluate(model, lr)
# Track validation performance
Pros: Simple, reproducible, easy to parallelize Cons: May miss optimal if not in grid; scales poorly with dimensions
Logarithmic Grid: For learning rates, always use log-scale spacing. $10^{-4}$ and $10^{-3}$ are as different as $10^{-3}$ and $10^{-2}$.
Random Search
Sample learning rates uniformly in log space:
import numpy as np
# Sample 50 learning rates in [1e-5, 1e-1], log-uniform
lr_samples = 10 ** np.random.uniform(-5, -1, 50)
for lr in lr_samples:
train_and_evaluate(model, lr)
Advantage over grid: More likely to hit good values; doesn't waste evaluations on clearly bad regions.
Random search often outperforms grid search for the same budget, especially when not all hyperparameters are equally important (which is typically the case).
Successive Halving (Hyperband)
Efficient early stopping of bad configurations:
This allocates more resources to promising learning rates while quickly eliminating poor choices. Hyperband is a principled version of this that determines optimal resource allocation.
For a single hyperparameter search: (1) Use LR finder to get the rough range, (2) Grid search over ~5 values around that range, (3) Refine with smaller grid. For multiple hyperparameters: use random search or Bayesian optimization with Optuna or similar tools.
Training curves encode rich information about learning rate appropriateness. Learning to read these curves is a essential skill for debugging optimization.
Symptom 1: Loss Explodes (NaN/Inf)
Cause: Learning rate far too large. Updates overshoot so badly that numerical overflow occurs.
Evidence:
Fix: Reduce learning rate by 10× or more. If using warmup, slow the warmup.
Symptom 2: Loss Oscillates Wildly
Cause: Learning rate too large, but not catastrophically. Updates overshoot, then overcorrect.
Evidence:
Fix: Reduce learning rate by 2-5×. Consider adding gradient clipping.
Symptom 3: Loss Decreases Extremely Slowly
Cause: Learning rate too small. Making negligible progress per step.
Evidence:
Fix: Increase learning rate by 3-10×. Ensure you haven't accidentally doubled the decay.
| Observation | Likely Cause | Action |
|---|---|---|
| Loss is NaN after few steps | LR way too high | Reduce LR by 10-100× |
| Loss spikes then recovers | LR slightly too high | Reduce LR by 2-3× |
| Wild oscillation in loss | LR too high / batch too small | Reduce LR; consider larger batch |
| Steady but slow decrease | LR too low | Increase LR by 3-10× |
| Loss flat for many epochs | LR way too low or stuck | Increase LR; check for dead ReLUs |
| Train decreases, val increases | Overfitting (not LR directly) | Regularization; maybe reduce LR late |
| Loss plateaus, won't improve | LR too high to find fine minimum | Decay LR; add schedule step |
Symptom 4: Fast Initial Progress, Then Plateau
Cause: Learning rate appropriate for early training but too large for fine-tuning near minimum.
Evidence:
Fix: Apply learning rate decay. Use cosine annealing or step decay.
Symptom 5: Training and Validation Diverge
Cause: Could be overfitting (not directly LR), but large LR can exacerbate.
Evidence:
Fix: This is primarily a regularization issue, but reducing LR slightly may help. Focus on dropout, weight decay, or data augmentation first.
If you're uncertain whether to increase or decrease: try changing by 10×. If the problem gets worse, go the other direction. This may seem crude, but learning rate issues are often order-of-magnitude problems, not fine-tuning problems.
Not all parameters are equal. Different layers may benefit from different learning rates. This technique, called differential learning rates or layer-wise learning rates, is especially useful for transfer learning.
Why Different Layers Need Different Rates
Pre-trained layers: Should change slowly to preserve learned features. Small LR.
New layers: Need to learn from scratch. Larger LR.
Early vs. late layers: In transfer learning, early layers (edges, textures) are more general. Later layers (object parts, semantics) are more task-specific.
The Discriminative Fine-Tuning Strategy
For a model with $L$ layers, use learning rates: $$\eta_l = \eta_{\text{base}} \cdot \gamma^{L-l}$$
where $\gamma < 1$ (e.g., $\gamma = 0.9$ or $\gamma = 0.5$). This gives exponentially smaller LR to earlier layers.
Example: Base LR = 0.001, γ = 0.5, 4 layers
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677
import torchimport torch.nn as nn def get_parameter_groups(model: nn.Module, base_lr: float, lr_decay: float = 0.9) -> list: """ Create parameter groups with layer-wise learning rates. Earlier layers get smaller learning rates. Args: model: Neural network base_lr: Learning rate for the final layer lr_decay: Multiplicative factor for each layer back Returns: List of parameter groups for optimizer """ # Get all named parameters organized by depth # This example assumes a simple sequential model param_groups = [] # Group parameters by layer name layer_params = {} for name, param in model.named_parameters(): # Extract layer identifier (e.g., "layer1", "conv2") layer_key = name.split('.')[0] if layer_key not in layer_params: layer_params[layer_key] = [] layer_params[layer_key].append(param) # Assign learning rates (reverse order: last layer gets base_lr) layer_names = list(layer_params.keys()) num_layers = len(layer_names) for i, layer_name in enumerate(layer_names): # Earlier layers (smaller i) get smaller LR layer_lr = base_lr * (lr_decay ** (num_layers - 1 - i)) param_groups.append({ 'params': layer_params[layer_name], 'lr': layer_lr, 'name': layer_name # For logging }) print(f"Layer {layer_name}: LR = {layer_lr:.6f}") return param_groups # For pre-trained + new head scenariodef freeze_and_differential(model, pretrained_lr, head_lr): """ Common pattern: small LR for pretrained, large for new head. """ pretrained_params = [] head_params = [] for name, param in model.named_parameters(): if 'classifier' in name or 'fc' in name or 'head' in name: # These are typically new layers head_params.append(param) else: # Pretrained backbone pretrained_params.append(param) param_groups = [ {'params': pretrained_params, 'lr': pretrained_lr}, {'params': head_params, 'lr': head_lr} ] return param_groups # Usage# param_groups = freeze_and_differential(model, # pretrained_lr=1e-5, # head_lr=1e-3)# optimizer = torch.optim.Adam(param_groups)Per-layer learning rates are most useful for: transfer learning (pre-trained + new layers), fine-tuning large models (BERT, GPT, etc.), and very deep networks where gradients vary by layer. For training from scratch on standard architectures, uniform LR often suffices.
Learning rate doesn't exist in isolation. Its optimal value depends on and interacts with other hyperparameters. Understanding these interactions is crucial for effective tuning.
Interaction with Batch Size
As established by the linear scaling rule:
Interaction with Weight Decay
Weight decay ($\lambda$) and learning rate interact:
If you change LR, you may need to re-tune weight decay.
| Hyperparameter | Interaction with LR | Tuning Strategy |
|---|---|---|
| Batch size | Linear scaling rule applies | Tune together; LR ∝ B |
| Weight decay | Effective regularization scales with LR | Re-tune if LR changes significantly |
| Momentum | Higher momentum → need smaller LR | Usually keep momentum fixed, tune LR |
| Gradient clipping | Clipping reduces effective LR on large-grad steps | May need larger LR if heavily clipping |
| Dropout | Indirect (affects gradient variance) | Usually tuned independently |
| Network depth/width | Deeper/wider often needs smaller LR | Check stability; adjust as needed |
Interaction with Momentum
Momentum accelerates updates by accumulating gradient history: $$\mathbf{v}^{(t)} = \beta \mathbf{v}^{(t-1)} + abla J(\boldsymbol{\theta}^{(t)})$$ $$\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \eta \mathbf{v}^{(t)}$$
Higher momentum (β closer to 1) means:
Interaction with Optimizer
Different optimizers have different LR sensitivities:
When tuning multiple hyperparameters: (1) Fix others at reasonable defaults, (2) Tune learning rate first, (3) Tune batch size (adjust LR proportionally), (4) Tune regularization (weight decay, dropout), (5) Fine-tune LR again if needed. Learning rate is the most impactful, so start there.
Years of community experience have revealed common mistakes in learning rate selection. Knowing these pitfalls helps you avoid them.
Pitfall 1: Using the Same LR for All Tasks
A learning rate that works for ImageNet classification may be wrong for:
Best practice: Always tune LR for your specific setup. Start from recommended defaults but validate.
Pitfall 2: Forgetting to Adjust for Precision
Mixed precision (FP16) training can require learning rate adjustments:
Pitfall 3: Not Using Schedules
Fixed learning rate is almost never optimal for long training. The loss surface changes as you approach different regions. Early training benefits from exploration (larger LR); late training needs precision (smaller LR).
Best practice: Always use a learning rate schedule. Cosine annealing is a robust default.
Pitfall 4: Over-Engineering the Schedule
Complex schedules (warm restarts, cyclical rates, custom rules) can help but also add tuning burden. Simple schedules often work nearly as well.
Best practice: Start simple (step decay or cosine). Add complexity only if simple schedules underperform after tuning.
We've thoroughly covered learning rate selection—the most critical hyperparameter in gradient-based optimization. Let's consolidate the essential knowledge:
What's Next
The final page in this module covers convergence analysis—the theoretical study of when and how fast gradient descent variants converge. We'll formalize the intuitions built in previous pages with rigorous mathematical treatment.
You now have the tools to select, tune, and diagnose learning rates effectively. This skill is foundational—virtually every ML training pipeline requires learning rate tuning. The techniques here apply whether you're training a simple linear model or a billion-parameter language model.