Loading content...
If there's one hyperparameter that can make or break your machine learning model, it's the learning rate (α). Set it too high, and your model explodes—losses oscillate wildly or diverge to infinity. Set it too low, and training crawls at a glacial pace, potentially getting stuck in poor local minima. Get it just right, and your model converges smoothly to an excellent solution.
Yet finding this "just right" value is notoriously challenging. There's no universal formula. The optimal learning rate depends on your model architecture, loss function, data distribution, batch size, and even where you are in training. This page equips you with the theory and practical techniques to navigate this critical decision.
By the end of this page, you will understand why the learning rate is so critical, derive theoretical bounds, visualize the effects of different learning rates, implement learning rate schedules, and master practical tuning strategies used by ML practitioners worldwide.
The learning rate α appears in the gradient descent update rule:
$$\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \alpha \nabla J(\boldsymbol{\theta}^{(t)})$$
It's a multiplicative factor that scales the gradient. Geometrically, the gradient tells us which direction to move, while α tells us how far to move in that direction.
Think of it like this:
Imagine you're blindfolded on a hilly landscape, trying to reach the lowest valley. You can feel the slope beneath your feet (the gradient). The learning rate determines whether you:
Mathematical perspective:
The gradient $\nabla J$ has units of "loss per unit parameter change." Multiplying by α (with units of "parameter change") gives a change in parameters that reduces loss. The magnitude $|\alpha \nabla J|$ is the actual step size in parameter space.
For a gradient with magnitude 10:
The difference between these can mean convergence in 100 iterations versus 100,000 iterations—or divergence.
Theory provides guidance on learning rate selection. Recall from the previous page that for convergence, we need:
$$\alpha \leq \frac{1}{L}$$
where $L$ is the Lipschitz constant of the gradient (measuring how fast the gradient changes). But what does this mean practically?
For Quadratic Functions:
Consider $J(\theta) = \frac{1}{2} L \theta^2$ (a perfect parabola). The gradient is $\nabla J = L\theta$, and the Lipschitz constant is $L$ itself.
The gradient descent update becomes: $$\theta^{(t+1)} = \theta^{(t)} - \alpha L \theta^{(t)} = (1 - \alpha L) \theta^{(t)}$$
For convergence, we need $|1 - \alpha L| < 1$, which gives: $$0 < \alpha < \frac{2}{L}$$
If $\alpha > \frac{2}{L}$, the factor $(1 - \alpha L) < -1$, and $|\theta|$ grows each iteration—divergence!
| Learning Rate α | Factor (1-αL) | Behavior | Convergence Speed |
|---|---|---|---|
| 0.001 | 0.998 | Converges very slowly | ~2000 iters to halve |
| 0.01 | 0.98 | Converges slowly | ~70 iters to halve |
| 0.1 | 0.8 | Converges well | ~3 iters to halve |
| 0.5 | 0.0 | Converges in 1 step! | Optimal for quadratic |
| 0.9 | -0.8 | Oscillates but converges | Zig-zag descent |
| 1.0 | -1.0 | Oscillates forever | No convergence |
| 1.5 | -2.0 | Diverges exponentially | Loss explodes |
For a pure quadratic with curvature L, the optimal learning rate is α = 1/L, which gives convergence in a single step! In practice, loss functions aren't perfect quadratics, but this guides intuition: adapt α to local curvature.
For General Smooth Functions:
The Lipschitz constant $L$ bounds the second derivative (Hessian). Intuitively:
Connection to the Hessian:
For twice-differentiable functions, $L \geq \lambda_{\max}(\mathbf{H})$, where $\lambda_{\max}$ is the largest eigenvalue of the Hessian. This means:
$$\alpha \leq \frac{1}{\lambda_{\max}(\mathbf{H})}$$
The more curved the loss surface, the smaller the safe learning rate.
Let's make these concepts concrete with visualizations. We'll optimize a 2D quadratic and observe how different learning rates affect the trajectory.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990
import numpy as npimport matplotlib.pyplot as plt def quadratic_loss(theta, A=None): """ Quadratic loss: J(θ) = 0.5 * θ^T A θ A determines the shape (curvature in each direction) """ if A is None: A = np.array([[4, 0], [0, 1]]) # Elongated bowl return 0.5 * theta @ A @ theta def quadratic_grad(theta, A=None): """Gradient of quadratic: ∇J = Aθ""" if A is None: A = np.array([[4, 0], [0, 1]]) return A @ theta def gradient_descent_trajectory(theta_init, lr, n_iters, A=None): """Run GD and record trajectory""" trajectory = [theta_init.copy()] losses = [quadratic_loss(theta_init, A)] theta = theta_init.copy() for _ in range(n_iters): grad = quadratic_grad(theta, A) theta = theta - lr * grad trajectory.append(theta.copy()) losses.append(quadratic_loss(theta, A)) return np.array(trajectory), np.array(losses) # Test different learning ratestheta_init = np.array([3.0, 3.0])A = np.array([[4, 0], [0, 1]]) # Lipschitz constant L = 4 learning_rates = [0.05, 0.2, 0.45, 0.55]labels = ['α=0.05 (too small)', 'α=0.2 (good)', 'α=0.45 (near optimal)', 'α=0.55 (too large)']colors = ['blue', 'green', 'orange', 'red'] fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Left plot: Trajectories in 2Dax1 = axes[0] # Plot loss contoursx = np.linspace(-4, 4, 100)y = np.linspace(-4, 4, 100)X, Y = np.meshgrid(x, y)Z = 0.5 * (4 * X**2 + Y**2) # J = 0.5 * (4x² + y²)ax1.contour(X, Y, Z, levels=20, cmap='gray', alpha=0.5) for lr, label, color in zip(learning_rates, labels, colors): traj, losses = gradient_descent_trajectory(theta_init, lr, 50, A) ax1.plot(traj[:, 0], traj[:, 1], 'o-', color=color, markersize=3, linewidth=1.5, label=label, alpha=0.8) ax1.scatter([0], [0], color='black', s=100, marker='*', zorder=10, label='Minimum')ax1.scatter([theta_init[0]], [theta_init[1]], color='black', s=60, marker='s', zorder=10)ax1.set_xlabel('θ₁')ax1.set_ylabel('θ₂')ax1.set_title('Gradient Descent Trajectories')ax1.legend(loc='upper right', fontsize=8)ax1.set_xlim(-4, 4)ax1.set_ylim(-4, 4) # Right plot: Loss curvesax2 = axes[1]for lr, label, color in zip(learning_rates, labels, colors): _, losses = gradient_descent_trajectory(theta_init, lr, 50, A) ax2.semilogy(losses, color=color, linewidth=2, label=label) ax2.set_xlabel('Iteration')ax2.set_ylabel('Loss (log scale)')ax2.set_title('Loss Convergence (Log Scale)')ax2.legend(fontsize=8)ax2.grid(True, alpha=0.3) plt.tight_layout()plt.savefig('learning_rate_effects.png', dpi=150)plt.show() # Demonstrate divergenceprint("\n=== Divergence Example (α too large) ===")theta_init_div = np.array([1.0, 1.0])traj_div, losses_div = gradient_descent_trajectory(theta_init_div, 0.6, 10, A)print(f"Initial loss: {losses_div[0]:.4f}")print(f"After 5 iters: {losses_div[5]:.4f}")print(f"After 10 iters: {losses_div[10]:.4f}")print("Loss is increasing → DIVERGENCE!")What the visualization reveals:
α = 0.05 (blue): The trajectory inches toward the minimum but makes painfully slow progress. After 50 iterations, it's still far from optimal.
α = 0.2 (green): Healthy progress. The trajectory descends efficiently, though slightly zig-zagging due to the elongated loss surface.
α = 0.45 (orange): Near-optimal. Very fast convergence, though you can see slight oscillations.
α = 0.55 (red): Too large! The trajectory oscillates wildly and may diverge. Each step overshoots the minimum.
The elongated bowl effect:
Notice that our loss surface is shaped like an elongated bowl (A = diag(4, 1)), with more curvature in the θ₁ direction. This means:
With a single learning rate, we're limited by the most curved direction. This is why ill-conditioned problems (very elongated loss surfaces) are difficult to optimize.
The condition number κ of a matrix measures how elongated the loss surface is:
$$\kappa = \frac{\lambda_{\max}}{\lambda_{\min}}$$
where λ_max and λ_min are the largest and smallest eigenvalues of the Hessian.
Why it matters:
With learning rate $\alpha = 1/\lambda_{\max}$ (the largest safe step), progress in the flattest direction (λ_min) is: $$\Delta \theta_{\text{flat}} \propto \frac{\lambda_{\min}}{\lambda_{\max}} = \frac{1}{\kappa}$$
Convergence rate depends on condition number: roughly $O(\kappa)$ iterations needed
Example:
| Condition κ | Shape | Iterations to Converge | Learning Rate Challenge |
|---|---|---|---|
| 1 | Perfect sphere | ~10 | Any α in range works well |
| 10 | Slightly elongated | ~100 | Moderate tuning needed |
| 100 | Very elongated | ~1,000 | Careful tuning required |
| 1,000 | Extremely elongated | ~10,000 | May need adaptive methods |
| 1,000,000 | Pathological | ~1,000,000 | Standard GD nearly useless |
Real neural networks often have condition numbers in the thousands or millions before applying techniques like batch normalization, weight initialization, and adaptive optimizers. This is why vanilla gradient descent with a fixed learning rate struggles on deep networks.
Solutions to ill-conditioning:
A fixed learning rate is often suboptimal. Early in training, we're far from the optimum—large steps are safe and efficient. As we approach the minimum, we need smaller steps for precision. Learning rate schedules adapt α over time.
Common Schedules:
| Schedule | Formula | Use Case | Advantage |
|---|---|---|---|
| Step Decay | α(t) = α₀ × γ^⌊t/s⌋ | When plateaus are predictable | Simple, interpretable |
| Exponential Decay | α(t) = α₀ × e^(-kt) | Smooth decrease | Continuous, no sudden drops |
| 1/t Decay | α(t) = α₀ / (1 + kt) | Theoretical guarantees | Provable convergence |
| Cosine Annealing | α(t) = α_min + ½(α₀-α_min)(1+cos(πt/T)) | Cyclical training | Smooth with warm restarts |
| Warmup + Decay | Linear increase, then decay | Large models, transformers | Stabilizes early training |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
import numpy as npimport matplotlib.pyplot as plt def step_decay(epoch, initial_lr=0.1, drop_rate=0.5, epochs_drop=10): """Step decay: drop by factor every epochs_drop epochs""" return initial_lr * (drop_rate ** (epoch // epochs_drop)) def exponential_decay(epoch, initial_lr=0.1, decay_rate=0.05): """Exponential decay: α(t) = α₀ × e^(-kt)""" return initial_lr * np.exp(-decay_rate * epoch) def inverse_decay(epoch, initial_lr=0.1, decay_rate=0.01): """Inverse time decay: α(t) = α₀ / (1 + kt)""" return initial_lr / (1 + decay_rate * epoch) def cosine_annealing(epoch, initial_lr=0.1, min_lr=0.001, total_epochs=100): """Cosine annealing: smooth oscillation between max and min""" return min_lr + 0.5 * (initial_lr - min_lr) * (1 + np.cos(np.pi * epoch / total_epochs)) def warmup_decay(epoch, initial_lr=0.1, warmup_epochs=10, total_epochs=100): """Linear warmup followed by cosine decay""" if epoch < warmup_epochs: return initial_lr * epoch / warmup_epochs else: progress = (epoch - warmup_epochs) / (total_epochs - warmup_epochs) return initial_lr * 0.5 * (1 + np.cos(np.pi * progress)) # Visualize all schedulesepochs = np.arange(100)schedules = { 'Step Decay': [step_decay(e) for e in epochs], 'Exponential Decay': [exponential_decay(e) for e in epochs], 'Inverse Decay': [inverse_decay(e) for e in epochs], 'Cosine Annealing': [cosine_annealing(e) for e in epochs], 'Warmup + Decay': [warmup_decay(e) for e in epochs],} plt.figure(figsize=(12, 5)) plt.subplot(1, 2, 1)for name, lrs in schedules.items(): plt.plot(epochs, lrs, linewidth=2, label=name)plt.xlabel('Epoch')plt.ylabel('Learning Rate')plt.title('Learning Rate Schedules (Linear Scale)')plt.legend()plt.grid(True, alpha=0.3) plt.subplot(1, 2, 2)for name, lrs in schedules.items(): plt.semilogy(epochs, lrs, linewidth=2, label=name)plt.xlabel('Epoch')plt.ylabel('Learning Rate (Log Scale)')plt.title('Learning Rate Schedules (Log Scale)')plt.legend()plt.grid(True, alpha=0.3) plt.tight_layout()plt.savefig('learning_rate_schedules.png', dpi=150)plt.show()Transformer models (BERT, GPT, etc.) universally use warmup. Early in training, with random weights, gradients can be unstable. Starting with a tiny learning rate and gradually increasing prevents catastrophic updates. After warmup, decay to refine the solution.
In practice, you can't compute the Lipschitz constant. Here are battle-tested strategies for finding good learning rates:
1. The "Baby Steps" Approach
Start with a very small learning rate (e.g., 1e-5). If training is stable, multiply by 3. Repeat until:
2. Grid Search / Random Search
Search over logarithmically-spaced values: [1e-5, 3e-5, 1e-4, 3e-4, 1e-3, 3e-3, 1e-2]
3. Learning Rate Range Test (Leslie Smith)
The most efficient modern approach:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293
import numpy as npimport matplotlib.pyplot as plt def lr_range_test(model, train_loader, optimizer_class, min_lr=1e-7, max_lr=10, num_steps=100): """ Learning Rate Range Test (LR Finder) Exponentially increase learning rate while recording loss. Plot loss vs lr to find the optimal learning rate region. """ # Exponentially spaced learning rates lr_mult = (max_lr / min_lr) ** (1 / num_steps) lrs = [] losses = [] lr = min_lr # Save initial model state initial_state = {k: v.clone() for k, v in model.state_dict().items()} model.train() smooth_loss = 0 for step, (X, y) in enumerate(train_loader): if step >= num_steps: break # Set learning rate for param_group in optimizer.param_groups: param_group['lr'] = lr # Forward pass optimizer.zero_grad() loss = criterion(model(X), y) # Record smoothed loss if step == 0: smooth_loss = loss.item() else: smooth_loss = 0.98 * smooth_loss + 0.02 * loss.item() lrs.append(lr) losses.append(smooth_loss) # Stop if loss explodes if step > 0 and smooth_loss > 4 * min(losses): break # Backward pass loss.backward() optimizer.step() # Increase learning rate lr *= lr_mult # Restore model to initial state model.load_state_dict(initial_state) return lrs, losses def suggest_lr(lrs, losses): """ Suggest learning rate from range test results. Find the point of steepest descent (maximum negative slope). """ min_idx = np.argmin(losses) # Go back from minimum to find where loss was decreasing fastest slopes = np.diff(losses) / np.diff(np.log10(lrs)) steepest_idx = np.argmin(slopes[:min_idx]) return lrs[steepest_idx] # Visualizationdef plot_lr_finder(lrs, losses, suggested_lr=None): plt.figure(figsize=(10, 4)) plt.semilogx(lrs, losses, linewidth=2) plt.xlabel('Learning Rate (log scale)') plt.ylabel('Loss') plt.title('Learning Rate Finder') if suggested_lr: plt.axvline(x=suggested_lr, color='r', linestyle='--', label=f'Suggested LR: {suggested_lr:.2e}') plt.legend() plt.grid(True, alpha=0.3) plt.savefig('lr_finder.png', dpi=150) plt.show() # Usage example (pseudo-code):# lrs, losses = lr_range_test(model, train_loader, torch.optim.SGD)# suggested = suggest_lr(lrs, losses)# plot_lr_finder(lrs, losses, suggested)How to interpret the LR Finder plot:
Rule of thumb: Choose a learning rate about 10× smaller than where the loss is minimum, or at the steepest point of descent.
The challenge of selecting a single learning rate that works for all parameters has motivated adaptive learning rate methods. These algorithms automatically adjust the learning rate for each parameter based on its gradient history.
Key Methods (covered in depth later):
| Method | Key Idea | When to Use |
|---|---|---|
| AdaGrad | Divide by accumulated squared gradients | Sparse data (NLP, recommendations) |
| RMSprop | Exponential moving average of squared gradients | Non-stationary objectives, RNNs |
| Adam | Combines momentum with RMSprop | Default choice for most problems |
| AdamW | Adam with decoupled weight decay | Transformers, large models |
The fundamental insight: Parameters that receive large gradients consistently should have smaller learning rates (they're changing enough). Parameters with small gradients should have larger learning rates (they need more signal).
If you're unsure what optimizer to use, start with Adam (learning rate 1e-3 to 3e-4). It's less sensitive to hyperparameter choices than vanilla SGD and works well across a wide range of problems. SGD with momentum often achieves better final accuracy but requires more tuning.
The learning rate is not just a hyperparameter—it's the throttle controlling how aggressively your model learns. Master it, and optimization becomes tractable.
Coming Up Next:
With learning rate understood, we turn to convergence analysis: How fast does gradient descent converge? What theoretical guarantees can we prove? Understanding convergence rates helps us choose between algorithms and set expectations for training time.
You now understand why learning rate is so critical, the mathematical bounds that govern it, how to visualize its effects, and practical strategies for tuning. Next, we'll dive into convergence theory to understand exactly how fast gradient descent converges.