Learning Rate Schedules - Learning Module

Loading content...

0/245

Cyclical Learning Rates: Oscillating for Better Exploration

Breaking the Monotonic Decay Paradigm

Every learning rate schedule we've examined so far follows a common pattern: the learning rate decreases monotonically throughout training. Step decay drops it discretely, exponential decay reduces it continuously, and cosine annealing creates smooth S-curves—but all head inexorably downward.

Cyclical Learning Rates (CLR) challenge this paradigm by deliberately oscillating the learning rate between bounds. Rather than committing to a single trajectory, CLR periodically increases the learning rate, enabling the model to escape local minima, explore new regions of parameter space, and potentially discover better solutions.

This counterintuitive approach was introduced by Leslie Smith in 2015 and has since proven remarkably effective, particularly for generating diverse ensemble models and achieving competitive performance with less hyperparameter tuning.

What You Will Master

By the end of this page, you will understand the theoretical foundations of cyclical learning rates, implement triangular, triangular2, and exponential range policies, use the LR range test to find optimal bounds, and know when cyclical schedules outperform monotonic alternatives.

Core Concepts: The Cycling Mechanism

The Basic Idea:

Cyclical learning rates oscillate between a minimum bound (base_lr) and maximum bound (max_lr) over a defined cycle period (step_size):

$$\eta_t = \eta_{base} + (\eta_{max} - \eta_{base}) \cdot f(cycle_position)$$

Where $f(\cdot)$ is a function that varies between 0 and 1 based on position within the cycle.

Key Parameters:

base_lr: Minimum learning rate during cycles
max_lr: Maximum learning rate during cycles
step_size: Half-cycle length (iterations, not epochs)
mode: The cycling policy (triangular, triangular2, exp_range)

Why Oscillation Helps:

Saddle Point Escape: High LR phases provide the momentum to escape saddle points that trap low-LR optimization.
Local Minimum Escape: Periodic LR increases can jolt the model out of suboptimal local minima.
Implicit Regularization: The varied learning rates act as a form of regularization, preventing overfitting to specific gradient trajectories.
Ensemble Diversity: Models sampled at different cycle phases represent diverse solutions, improving ensemble quality.

CLR Policies Comparison
Policy	Behavior	Max LR Over Time	Best For
triangular	Linear oscillation between bounds	Constant	Baseline, exploration
triangular2	Same, but max_lr halves each cycle	Decreasing	Progressive refinement
exp_range	Max_lr decays exponentially	Exponential decay	Balance exploration/precision

Mathematical Formulation

Triangular Policy:

The learning rate follows a triangular wave:

$$cycle = \lfloor 1 + \frac{iteration}{2 \times step_size} \rfloor$$ $$x = |\frac{iteration}{step_size} - 2 \times cycle + 1|$$ $$\eta = base_lr + (max_lr - base_lr) \times \max(0, 1-x)$$

This creates a sawtooth pattern: LR rises from base to max over step_size iterations, then falls back to base.

Triangular2 Policy:

Same as triangular, but the amplitude (max_lr - base_lr) is halved after each cycle:

$$\eta = base_lr + \frac{max_lr - base_lr}{2^{cycle-1}} \times \max(0, 1-x)$$

This provides aggressive exploration early, with progressively finer oscillations later.

Exp_range Policy:

The maximum learning rate decays exponentially:

$$\eta = base_lr + (max_lr - base_lr) \times \gamma^{iteration} \times \max(0, 1-x)$$

Where $\gamma \in (0.99, 0.99999)$ controls the decay rate.

Step Size Heuristic

Leslie Smith recommends setting step_size to 2-10 times the number of iterations per epoch. With step_size = 4 × iterations_per_epoch, you get approximately 2 complete cycles per epoch (one cycle = 2 × step_size iterations).

The LR Range Test: Finding Optimal Bounds

The LR Range Test is a systematic procedure for finding optimal base_lr and max_lr values. This eliminates guesswork from CLR configuration.

Procedure:

Run training for 1-2 epochs with LR increasing linearly from very small (1e-7) to very large (1 or 10)
Record loss at each iteration
Plot loss vs. learning rate (log scale)
Identify the LR range where loss decreases consistently

Interpreting Results:

The plot typically shows:

Initial flat region (LR too low to learn)
Steep descent region (optimal learning)
Plateau or increase (LR becoming too high)

Setting Bounds:

base_lr: Choose where steep descent begins (~1/3 to 1/4 of max_lr)
max_lr: Choose where loss stops improving or starts increasing

Example: If loss decreases from LR=3e-4 to LR=3e-3 then increases, set base_lr=1e-4, max_lr=3e-3.

lr_range_test.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
import torch
import numpy as np
import matplotlib.pyplot as plt
 
class LRRangeFinder:
    """
    Implements the LR range test for finding optimal CLR bounds.
    """
    
    def __init__(self, model, train_loader, criterion, optimizer_class,
                 optimizer_kwargs=None, device='cuda'):
        self.model = model
        self.train_loader = train_loader
        self.criterion = criterion
        self.optimizer_class = optimizer_class
        self.optimizer_kwargs = optimizer_kwargs or {}
        self.device = device
        
    def run(self, start_lr=1e-7, end_lr=10, num_iterations=100, 
            smooth_factor=0.05):
        """
        Run the LR range test.
        
        Returns:
            lrs: List of learning rates tested
            losses: Smoothed losses at each LR
        """
        # Save initial model state
        initial_state = {k: v.clone() for k, v in 
                        self.model.state_dict().items()}
        
        # Create fresh optimizer
        optimizer = self.optimizer_class(
            self.model.parameters(), 
            lr=start_lr, 
            **self.optimizer_kwargs
        )
        
        # Compute LR schedule (exponential increase)
        lr_mult = (end_lr / start_lr) ** (1 / num_iterations)
        
        lrs, losses = [], []
        smoothed_loss = 0
        best_loss = float('inf')
        
        data_iter = iter(self.train_loader)
        
        for i in range(num_iterations):
            # Get batch (cycle through data if needed)
            try:
                inputs, targets = next(data_iter)
            except StopIteration:
                data_iter = iter(self.train_loader)
                inputs, targets = next(data_iter)
            
            inputs = inputs.to(self.device)
            targets = targets.to(self.device)
            
            # Forward pass
            optimizer.zero_grad()
            outputs = self.model(inputs)
            loss = self.criterion(outputs, targets)
            
            # Smooth loss
            if i == 0:
                smoothed_loss = loss.item()
            else:
                smoothed_loss = smooth_factor * loss.item() + \
                               (1 - smooth_factor) * smoothed_loss
            
            # Record
            current_lr = optimizer.param_groups[0]['lr']
            lrs.append(current_lr)
            losses.append(smoothed_loss)
            
            # Check for divergence
            if smoothed_loss > 4 * best_loss:
                print(f"Stopping early: loss diverged at LR={current_lr:.2e}")
                break
            
            best_loss = min(best_loss, smoothed_loss)
            
            # Backward and update
            loss.backward()
            optimizer.step()
            
            # Increase LR for next step
            for pg in optimizer.param_groups:
                pg['lr'] *= lr_mult
        
        # Restore initial model state
        self.model.load_state_dict(initial_state)
        
        return lrs, losses
    
    def plot(self, lrs, losses, skip_start=10, skip_end=5):
        """Plot LR range test results with suggested bounds."""
        fig, ax = plt.subplots(figsize=(10, 6))
        
        # Skip extreme ends
        lrs = lrs[skip_start:-skip_end]
        losses = losses[skip_start:-skip_end]
        
        ax.semilogx(lrs, losses)
        ax.set_xlabel('Learning Rate (log scale)')
        ax.set_ylabel('Loss')
        ax.set_title('LR Range Test')
        
        # Find suggested max_lr (steepest descent point)
        gradients = np.gradient(losses)
        min_grad_idx = np.argmin(gradients)
        suggested_max = lrs[min_grad_idx]
        
        ax.axvline(x=suggested_max, color='r', linestyle='--', 
                  label=f'Suggested max_lr: {suggested_max:.2e}')
        ax.axvline(x=suggested_max/10, color='g', linestyle='--',
                  label=f'Suggested base_lr: {suggested_max/10:.2e}')
        
        ax.legend()
        ax.grid(True, alpha=0.3)
        
        return fig, suggested_max

Implementation: PyTorch CLR Scheduler

cyclical_lr_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
import torch
from torch.optim.lr_scheduler import CyclicLR, OneCycleLR
 
# =====================================================
# PyTorch Native CyclicLR
# =====================================================
def create_cyclical_scheduler(optimizer, base_lr, max_lr, 
                               step_size_up, mode='triangular'):
    """
    Create PyTorch CyclicLR scheduler.
    
    Args:
        optimizer: Wrapped optimizer
        base_lr: Minimum LR
        max_lr: Maximum LR
        step_size_up: Iterations to go from base to max
        mode: 'triangular', 'triangular2', or 'exp_range'
    """
    return CyclicLR(
        optimizer,
        base_lr=base_lr,
        max_lr=max_lr,
        step_size_up=step_size_up,
        mode=mode,
        cycle_momentum=True  # Inverse momentum cycling
    )
 
# =====================================================
# OneCycleLR: Single Cycle with Warmup
# =====================================================
def create_one_cycle_scheduler(optimizer, max_lr, total_steps,
                                pct_start=0.3, anneal_strategy='cos'):
    """
    OneCycleLR: Warmup to max_lr, then anneal to near-zero.
    
    Often achieves SOTA with less tuning than multi-cycle CLR.
    
    Args:
        optimizer: Wrapped optimizer
        max_lr: Peak learning rate
        total_steps: Total training steps
        pct_start: Fraction of training for warmup (default 30%)
        anneal_strategy: 'cos' or 'linear' for decay phase
    """
    return OneCycleLR(
        optimizer,
        max_lr=max_lr,
        total_steps=total_steps,
        pct_start=pct_start,
        anneal_strategy=anneal_strategy,
        cycle_momentum=True,
        div_factor=25.0,      # initial_lr = max_lr/25
        final_div_factor=1e4  # final_lr = initial_lr/1e4
    )
 
# =====================================================
# Training Loop with CLR
# =====================================================
def train_with_clr(model, train_loader, criterion, optimizer, 
                   scheduler, device='cuda'):
    """
    Training loop with per-step CLR scheduling.
    
    CRITICAL: CLR scheduler.step() is called per ITERATION, 
    not per epoch!
    """
    model.train()
    total_loss = 0
    lr_history = []
    
    for batch_idx, (inputs, targets) in enumerate(train_loader):
        inputs, targets = inputs.to(device), targets.to(device)
        
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()
        
        # Step scheduler AFTER optimizer.step(), EVERY iteration
        scheduler.step()
        
        total_loss += loss.item()
        lr_history.append(optimizer.param_groups[0]['lr'])
    
    return total_loss / len(train_loader), lr_history

Per-Iteration Scheduling

Unlike most schedulers, CyclicLR expects scheduler.step() called after EVERY batch/iteration, not once per epoch. This is a common source of bugs—calling it per epoch results in nearly constant LR.

OneCycleLR: The Practical Champion

OneCycleLR represents the evolution of cyclical learning rates into a highly practical, SOTA-achieving schedule. Rather than multiple symmetric cycles, it uses a single asymmetric cycle:

Warmup phase (default 30%): LR increases from low to max_lr
Annealing phase (remaining 70%): LR decreases from max_lr to near-zero

This combines the benefits of warmup (stable early training) with CLR's exploration (the peak) and cosine annealing's refinement (the final decay).

Why OneCycleLR Often Wins:

Super-convergence: Can achieve equivalent or better performance in fewer epochs
Simpler tuning: Only max_lr needs careful selection (via LR range test)
Built-in warmup: No separate warmup scheduler needed
Momentum cycling: Automatically reduces momentum when LR is high, increases when low

OneCycleLR Key Parameters
Parameter	Default	Purpose	Tuning Guidance
max_lr	Required	Peak learning rate	Use LR range test
pct_start	0.3	Warmup fraction	0.2-0.4 typical
div_factor	25	initial_lr = max_lr/div_factor	25 works broadly
final_div_factor	1e4	final_lr = initial_lr/factor	Large = near-zero end
anneal_strategy	'cos'	Decay curve shape	'cos' usually best

When to Use Cyclical Learning Rates

Ideal Scenarios for CLR

•Limited hyperparameter tuning budget
•Need for snapshot ensembles
•Training stalls at local minima
•Super-convergence desired
•Exploring new architectures

When to Prefer Alternatives

•Reproducing published baselines
•Very long training runs (1000+ epochs)
•Stable, well-understood training
•Fine-tuning (use lower, stable LR)
•When cosine/step is proven optimal

The Practical Recommendation

For new projects without established baselines: try OneCycleLR first. Run the LR range test, set max_lr, and you often get competitive results with minimal tuning. For reproducing published results, use their exact schedule.

Summary: Cyclical Learning Rate Mastery

Key Takeaways

•Core Idea: Oscillate LR between bounds to escape local minima and improve exploration.
•Three Policies: triangular (constant amplitude), triangular2 (halving), exp_range (exponential decay).
•LR Range Test: Systematic method to find optimal base_lr and max_lr—eliminates guesswork.
•OneCycleLR: Single asymmetric cycle often achieves SOTA with minimal tuning.
•Per-Iteration Stepping: CLR schedulers must be stepped every batch, not every epoch.
•Super-convergence: CLR can achieve equivalent performance in fewer epochs.

Module Complete:

You've now mastered the complete landscape of learning rate scheduling—from foundational step decay through smooth exponential and cosine schedules, warmup strategies for training stability, and cyclical approaches for enhanced exploration. This toolkit enables you to configure learning rates for any training scenario.

Module Complete

You now have comprehensive knowledge of learning rate scheduling, from simple step decay to sophisticated cyclical policies. This knowledge enables principled configuration of one of deep learning's most critical hyperparameters.