Loading content...
Every learning rate schedule we've examined so far follows a common pattern: the learning rate decreases monotonically throughout training. Step decay drops it discretely, exponential decay reduces it continuously, and cosine annealing creates smooth S-curves—but all head inexorably downward.
Cyclical Learning Rates (CLR) challenge this paradigm by deliberately oscillating the learning rate between bounds. Rather than committing to a single trajectory, CLR periodically increases the learning rate, enabling the model to escape local minima, explore new regions of parameter space, and potentially discover better solutions.
This counterintuitive approach was introduced by Leslie Smith in 2015 and has since proven remarkably effective, particularly for generating diverse ensemble models and achieving competitive performance with less hyperparameter tuning.
By the end of this page, you will understand the theoretical foundations of cyclical learning rates, implement triangular, triangular2, and exponential range policies, use the LR range test to find optimal bounds, and know when cyclical schedules outperform monotonic alternatives.
The Basic Idea:
Cyclical learning rates oscillate between a minimum bound (base_lr) and maximum bound (max_lr) over a defined cycle period (step_size):
$$\eta_t = \eta_{base} + (\eta_{max} - \eta_{base}) \cdot f(cycle_position)$$
Where $f(\cdot)$ is a function that varies between 0 and 1 based on position within the cycle.
Key Parameters:
Why Oscillation Helps:
Saddle Point Escape: High LR phases provide the momentum to escape saddle points that trap low-LR optimization.
Local Minimum Escape: Periodic LR increases can jolt the model out of suboptimal local minima.
Implicit Regularization: The varied learning rates act as a form of regularization, preventing overfitting to specific gradient trajectories.
Ensemble Diversity: Models sampled at different cycle phases represent diverse solutions, improving ensemble quality.
| Policy | Behavior | Max LR Over Time | Best For |
|---|---|---|---|
| triangular | Linear oscillation between bounds | Constant | Baseline, exploration |
| triangular2 | Same, but max_lr halves each cycle | Decreasing | Progressive refinement |
| exp_range | Max_lr decays exponentially | Exponential decay | Balance exploration/precision |
Triangular Policy:
The learning rate follows a triangular wave:
$$cycle = \lfloor 1 + \frac{iteration}{2 \times step_size} \rfloor$$ $$x = |\frac{iteration}{step_size} - 2 \times cycle + 1|$$ $$\eta = base_lr + (max_lr - base_lr) \times \max(0, 1-x)$$
This creates a sawtooth pattern: LR rises from base to max over step_size iterations, then falls back to base.
Triangular2 Policy:
Same as triangular, but the amplitude (max_lr - base_lr) is halved after each cycle:
$$\eta = base_lr + \frac{max_lr - base_lr}{2^{cycle-1}} \times \max(0, 1-x)$$
This provides aggressive exploration early, with progressively finer oscillations later.
Exp_range Policy:
The maximum learning rate decays exponentially:
$$\eta = base_lr + (max_lr - base_lr) \times \gamma^{iteration} \times \max(0, 1-x)$$
Where $\gamma \in (0.99, 0.99999)$ controls the decay rate.
Leslie Smith recommends setting step_size to 2-10 times the number of iterations per epoch. With step_size = 4 × iterations_per_epoch, you get approximately 2 complete cycles per epoch (one cycle = 2 × step_size iterations).
The LR Range Test is a systematic procedure for finding optimal base_lr and max_lr values. This eliminates guesswork from CLR configuration.
Procedure:
Interpreting Results:
The plot typically shows:
Setting Bounds:
Example: If loss decreases from LR=3e-4 to LR=3e-3 then increases, set base_lr=1e-4, max_lr=3e-3.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122
import torchimport numpy as npimport matplotlib.pyplot as plt class LRRangeFinder: """ Implements the LR range test for finding optimal CLR bounds. """ def __init__(self, model, train_loader, criterion, optimizer_class, optimizer_kwargs=None, device='cuda'): self.model = model self.train_loader = train_loader self.criterion = criterion self.optimizer_class = optimizer_class self.optimizer_kwargs = optimizer_kwargs or {} self.device = device def run(self, start_lr=1e-7, end_lr=10, num_iterations=100, smooth_factor=0.05): """ Run the LR range test. Returns: lrs: List of learning rates tested losses: Smoothed losses at each LR """ # Save initial model state initial_state = {k: v.clone() for k, v in self.model.state_dict().items()} # Create fresh optimizer optimizer = self.optimizer_class( self.model.parameters(), lr=start_lr, **self.optimizer_kwargs ) # Compute LR schedule (exponential increase) lr_mult = (end_lr / start_lr) ** (1 / num_iterations) lrs, losses = [], [] smoothed_loss = 0 best_loss = float('inf') data_iter = iter(self.train_loader) for i in range(num_iterations): # Get batch (cycle through data if needed) try: inputs, targets = next(data_iter) except StopIteration: data_iter = iter(self.train_loader) inputs, targets = next(data_iter) inputs = inputs.to(self.device) targets = targets.to(self.device) # Forward pass optimizer.zero_grad() outputs = self.model(inputs) loss = self.criterion(outputs, targets) # Smooth loss if i == 0: smoothed_loss = loss.item() else: smoothed_loss = smooth_factor * loss.item() + \ (1 - smooth_factor) * smoothed_loss # Record current_lr = optimizer.param_groups[0]['lr'] lrs.append(current_lr) losses.append(smoothed_loss) # Check for divergence if smoothed_loss > 4 * best_loss: print(f"Stopping early: loss diverged at LR={current_lr:.2e}") break best_loss = min(best_loss, smoothed_loss) # Backward and update loss.backward() optimizer.step() # Increase LR for next step for pg in optimizer.param_groups: pg['lr'] *= lr_mult # Restore initial model state self.model.load_state_dict(initial_state) return lrs, losses def plot(self, lrs, losses, skip_start=10, skip_end=5): """Plot LR range test results with suggested bounds.""" fig, ax = plt.subplots(figsize=(10, 6)) # Skip extreme ends lrs = lrs[skip_start:-skip_end] losses = losses[skip_start:-skip_end] ax.semilogx(lrs, losses) ax.set_xlabel('Learning Rate (log scale)') ax.set_ylabel('Loss') ax.set_title('LR Range Test') # Find suggested max_lr (steepest descent point) gradients = np.gradient(losses) min_grad_idx = np.argmin(gradients) suggested_max = lrs[min_grad_idx] ax.axvline(x=suggested_max, color='r', linestyle='--', label=f'Suggested max_lr: {suggested_max:.2e}') ax.axvline(x=suggested_max/10, color='g', linestyle='--', label=f'Suggested base_lr: {suggested_max/10:.2e}') ax.legend() ax.grid(True, alpha=0.3) return fig, suggested_max1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586
import torchfrom torch.optim.lr_scheduler import CyclicLR, OneCycleLR # =====================================================# PyTorch Native CyclicLR# =====================================================def create_cyclical_scheduler(optimizer, base_lr, max_lr, step_size_up, mode='triangular'): """ Create PyTorch CyclicLR scheduler. Args: optimizer: Wrapped optimizer base_lr: Minimum LR max_lr: Maximum LR step_size_up: Iterations to go from base to max mode: 'triangular', 'triangular2', or 'exp_range' """ return CyclicLR( optimizer, base_lr=base_lr, max_lr=max_lr, step_size_up=step_size_up, mode=mode, cycle_momentum=True # Inverse momentum cycling ) # =====================================================# OneCycleLR: Single Cycle with Warmup# =====================================================def create_one_cycle_scheduler(optimizer, max_lr, total_steps, pct_start=0.3, anneal_strategy='cos'): """ OneCycleLR: Warmup to max_lr, then anneal to near-zero. Often achieves SOTA with less tuning than multi-cycle CLR. Args: optimizer: Wrapped optimizer max_lr: Peak learning rate total_steps: Total training steps pct_start: Fraction of training for warmup (default 30%) anneal_strategy: 'cos' or 'linear' for decay phase """ return OneCycleLR( optimizer, max_lr=max_lr, total_steps=total_steps, pct_start=pct_start, anneal_strategy=anneal_strategy, cycle_momentum=True, div_factor=25.0, # initial_lr = max_lr/25 final_div_factor=1e4 # final_lr = initial_lr/1e4 ) # =====================================================# Training Loop with CLR# =====================================================def train_with_clr(model, train_loader, criterion, optimizer, scheduler, device='cuda'): """ Training loop with per-step CLR scheduling. CRITICAL: CLR scheduler.step() is called per ITERATION, not per epoch! """ model.train() total_loss = 0 lr_history = [] for batch_idx, (inputs, targets) in enumerate(train_loader): inputs, targets = inputs.to(device), targets.to(device) optimizer.zero_grad() outputs = model(inputs) loss = criterion(outputs, targets) loss.backward() optimizer.step() # Step scheduler AFTER optimizer.step(), EVERY iteration scheduler.step() total_loss += loss.item() lr_history.append(optimizer.param_groups[0]['lr']) return total_loss / len(train_loader), lr_historyUnlike most schedulers, CyclicLR expects scheduler.step() called after EVERY batch/iteration, not once per epoch. This is a common source of bugs—calling it per epoch results in nearly constant LR.
OneCycleLR represents the evolution of cyclical learning rates into a highly practical, SOTA-achieving schedule. Rather than multiple symmetric cycles, it uses a single asymmetric cycle:
This combines the benefits of warmup (stable early training) with CLR's exploration (the peak) and cosine annealing's refinement (the final decay).
Why OneCycleLR Often Wins:
| Parameter | Default | Purpose | Tuning Guidance |
|---|---|---|---|
| max_lr | Required | Peak learning rate | Use LR range test |
| pct_start | 0.3 | Warmup fraction | 0.2-0.4 typical |
| div_factor | 25 | initial_lr = max_lr/div_factor | 25 works broadly |
| final_div_factor | 1e4 | final_lr = initial_lr/factor | Large = near-zero end |
| anneal_strategy | 'cos' | Decay curve shape | 'cos' usually best |
For new projects without established baselines: try OneCycleLR first. Run the LR range test, set max_lr, and you often get competitive results with minimal tuning. For reproducing published results, use their exact schedule.
Module Complete:
You've now mastered the complete landscape of learning rate scheduling—from foundational step decay through smooth exponential and cosine schedules, warmup strategies for training stability, and cyclical approaches for enhanced exploration. This toolkit enables you to configure learning rates for any training scenario.
You now have comprehensive knowledge of learning rate scheduling, from simple step decay to sophisticated cyclical policies. This knowledge enables principled configuration of one of deep learning's most critical hyperparameters.