Loading content...
Among the various learning rate schedules, cosine annealing stands out for its mathematical elegance and remarkable empirical success. Unlike step decay's sharp transitions or exponential decay's fixed percentage reduction, cosine annealing traces a smooth, S-shaped curve that starts gently, accelerates in the middle, and gently approaches the minimum—mirroring the natural dynamics of optimization convergence.
The cosine schedule has become ubiquitous in modern deep learning, serving as the default choice for transformer training (BERT, GPT, Vision Transformers) and achieving state-of-the-art results across computer vision, NLP, and beyond. Its extension to warm restarts (SGDR: Stochastic Gradient Descent with Warm Restarts) further enhances its power by enabling exploration-exploitation cycles within a single training run.
This page provides a comprehensive treatment of cosine annealing: from the basic formulation through warm restart variants, implementation details, and the theoretical principles that make it work so well.
By the end of this page, you will understand why cosine annealing's shape matches optimization dynamics, implement both single-cycle and warm restart variants, tune the schedule's hyperparameters for your specific training scenario, and recognize when cosine annealing outperforms alternatives.
Basic Cosine Annealing:
The learning rate follows a half-period cosine function from the maximum to minimum:
$$\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\frac{\pi t}{T}\right)\right)$$
Where:
Key Properties of the Cosine Curve:
Smooth Start: At $t=0$, $\cos(0) = 1$, so $\eta_0 = \eta_{\max}$. The derivative is zero—meaning the schedule starts flat.
Smooth End: At $t=T$, $\cos(\pi) = -1$, so $\eta_T = \eta_{\min}$. Again, the derivative is zero—the schedule ends flat.
Steepest Decay at Midpoint: At $t = T/2$, the decay rate is maximal. This concentrates the transition in the middle of training.
Monotonically Decreasing: The schedule strictly decreases from $\eta_{\max}$ to $\eta_{\min}$ without oscillations.
| Training Progress (t/T) | cos(πt/T) | Learning Rate | Decay Rate |
|---|---|---|---|
| 0% | 1.0 | η_max | Minimal (flat) |
| 25% | 0.707 | 0.85 × η_max | Increasing |
| 50% | 0.0 | 0.5 × η_max | Maximum |
| 75% | -0.707 | 0.15 × η_max | Decreasing |
| 100% | -1.0 | η_min | Minimal (flat) |
Intuitive Interpretation:
The cosine schedule embodies a compelling intuition about optimization:
Early Training (First 25%): Gentle reduction preserves exploration. The model is still finding good basins of attraction—we don't want to prematurely commit.
Middle Training (25-75%): Accelerated reduction as the model transitions from exploration to exploitation. This is where major phase changes happen.
Late Training (Last 25%): Gentle approach to minimum. Like a landing aircraft, we want to touch down smoothly rather than crash.
This matches empirical observations: most of the 'work' of optimization happens in the middle phase, with the beginning and end being more about setup and refinement.
The zero derivative at start and end isn't just aesthetic—it provides stability. At the start, a sudden LR drop would be jarring; the flat region lets training settle. At the end, the flat region lets the model converge precisely without the LR racing toward zero.
SGDR: Stochastic Gradient Descent with Warm Restarts extends basic cosine annealing by periodically resetting the learning rate to its maximum value, creating multiple cosine cycles within a single training run. This seemingly simple modification has profound effects on optimization dynamics.
The Core Idea:
Instead of a single cosine decay from $\eta_{\max}$ to $\eta_{\min}$, SGDR executes multiple cosine cycles. At the end of each cycle, the learning rate 'restarts' to $\eta_{\max}$ (or a reduced maximum), beginning a new exploration phase.
Mathematical Formulation:
For a training run with cycle length $T_i$ for cycle $i$:
$$\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\frac{\pi T_{cur}}{T_i}\right)\right)$$
Where $T_{cur}$ is the number of epochs since the last restart.
Cycle Length Strategies:
Fixed Cycles: All cycles have the same length $T_0$. Train for $n \times T_0$ epochs with $n$ complete cycles.
Doubling Cycles: Each cycle is twice as long as the previous: $T_i = T_0 \times 2^i$. This provides increasingly long refinement periods.
Linear Growth: Cycles increase linearly: $T_i = T_0 \times (i + 1)$. Balanced between fixed and doubling.
| Strategy | Cycle Lengths | Total Epochs (4 cycles) | Best For |
|---|---|---|---|
| Fixed (T₀=50) | 50, 50, 50, 50 | 200 | Ensemble diversification |
| Doubling (T₀=10) | 10, 20, 40, 80 | 150 | Progressive refinement |
| Linear (T₀=25) | 25, 50, 75, 100 | 250 | Balanced exploration |
Why Warm Restarts Help:
Escape Local Minima: The learning rate spike at restart can jolt the model out of poor local optima, enabling exploration of better basins.
Snapshot Ensembles: Models at cycle endpoints (just before restart) can be saved and ensembled, providing multiple diversified models from one training run.
Implicit Regularization: The periodic exploration prevents overfitting to specific features of the training data.
Faster Convergence: Counterintuitively, multiple shorter cycles often converge faster than one long decay, especially for complex loss landscapes.
The Snapshot Ensemble Bonus:
Each cycle endpoint represents a model that has converged with different initialization (the state at restart). These models, sharing early training but diverging in later cycles, provide diverse predictions that average well:
$$\text{Ensemble}(x) = \frac{1}{n}\sum_{i=1}^n f_{\theta_i}(x)$$
where $\theta_i$ are the parameters saved at the end of cycle $i$.
The 'restart' only affects the learning rate—model parameters continue from where they left off. This 'warm' restart preserves learned representations while enabling new exploration. Cold restart (reinitializing weights) would waste all previous training.
The empirical success of cosine annealing invites theoretical explanation. Several complementary perspectives illuminate why this particular shape works so well.
Optimization Dynamics Correspondence:
Neural network training exhibits characteristic dynamics that cosine annealing naturally matches:
Early Phase (Rapid Feature Learning): Loss drops quickly as the model learns basic features. High, stable LR supports rapid progress.
Middle Phase (Refinement): The model refines feature representations and resolves competition between alternative solutions. Moderate, decreasing LR enables this transition.
Late Phase (Fine-Tuning): Small adjustments maximize performance. Very low LR prevents overshooting optimal point.
The cosine curve's S-shape naturally concentrates change in the middle phase where the most optimization 'action' occurs.
Comparison with Polynomial Decay:
Theoretical analysis of SGD convergence often suggests polynomial decay: $$\eta_t \propto t^{-\alpha}$$
for some $\alpha \in (0.5, 1)$. Cosine annealing approximates this behavior:
The Warm Restart Mechanism:
Warm restarts provide a theoretically-grounded mechanism for escaping local minima. When the learning rate spikes:
Increased Noise: SGD noise amplitude scales with learning rate, enabling escape from shallow local minima.
Basin Exploration: The high LR allows the model to traverse between different basins of attraction.
Diversity Injection: Each restart creates a divergence point, generating model diversity for ensembles.
The cosine shape of each cycle then guides convergence to a new local minimum, potentially better than the previous one.
Some researchers interpret warm restarts through information-theoretic lens: each cycle introduces new 'information' (exploration) followed by 'compression' (convergence). This explore-compress cycle mirrors how humans learn complex skills through alternating broad exploration and focused practice.
PyTorch provides native support for cosine annealing, but understanding the implementation details enables customization and debugging.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334
import numpy as npimport torchfrom torch.optim.lr_scheduler import ( CosineAnnealingLR, CosineAnnealingWarmRestarts, LambdaLR)from typing import Optional, Listimport math # =====================================================# Implementation 1: PyTorch Native CosineAnnealingLR# =====================================================def create_cosine_annealing( optimizer, T_max: int, eta_min: float = 0, last_epoch: int = -1): """ Standard single-cycle cosine annealing. Args: optimizer: Wrapped optimizer T_max: Maximum number of epochs (full cosine period) eta_min: Minimum learning rate last_epoch: For resuming training LR formula: η_t = η_min + 0.5*(η_max - η_min)*(1 + cos(π*t/T_max)) """ return CosineAnnealingLR( optimizer, T_max=T_max, eta_min=eta_min, last_epoch=last_epoch ) # =====================================================# Implementation 2: PyTorch Native Warm Restarts# =====================================================def create_warm_restarts( optimizer, T_0: int, T_mult: int = 1, eta_min: float = 0, last_epoch: int = -1): """ Cosine annealing with warm restarts (SGDR). Args: optimizer: Wrapped optimizer T_0: Number of epochs for first cycle T_mult: Multiplier for subsequent cycles (T_i = T_0 * T_mult^i) eta_min: Minimum learning rate last_epoch: For resuming T_mult=1: All cycles same length T_0 T_mult=2: Cycles double (10, 20, 40, 80, ...) """ return CosineAnnealingWarmRestarts( optimizer, T_0=T_0, T_mult=T_mult, eta_min=eta_min, last_epoch=last_epoch ) # =====================================================# Implementation 3: Cosine with Warmup# =====================================================class CosineAnnealingWithWarmup: """ Cosine annealing with linear warmup period. This is the standard schedule for transformer training: 1. Linear warmup from near-zero to base LR 2. Cosine decay from base LR to minimum """ def __init__( self, optimizer, warmup_epochs: int, total_epochs: int, warmup_start_lr: float = 1e-7, eta_min: float = 0 ): self.optimizer = optimizer self.warmup_epochs = warmup_epochs self.total_epochs = total_epochs self.warmup_start_lr = warmup_start_lr self.eta_min = eta_min # Store base LRs from optimizer self.base_lrs = [pg['lr'] for pg in optimizer.param_groups] self.current_epoch = 0 self.cosine_epochs = total_epochs - warmup_epochs def get_lr(self) -> List[float]: if self.current_epoch < self.warmup_epochs: # Linear warmup alpha = self.current_epoch / self.warmup_epochs return [ self.warmup_start_lr + alpha * (base_lr - self.warmup_start_lr) for base_lr in self.base_lrs ] else: # Cosine annealing cosine_epoch = self.current_epoch - self.warmup_epochs cosine_factor = 0.5 * (1 + math.cos(math.pi * cosine_epoch / self.cosine_epochs)) return [ self.eta_min + cosine_factor * (base_lr - self.eta_min) for base_lr in self.base_lrs ] def step(self, epoch: Optional[int] = None): if epoch is not None: self.current_epoch = epoch else: self.current_epoch += 1 lrs = self.get_lr() for param_group, lr in zip(self.optimizer.param_groups, lrs): param_group['lr'] = lr def state_dict(self): return { 'current_epoch': self.current_epoch, 'warmup_epochs': self.warmup_epochs, 'total_epochs': self.total_epochs, 'base_lrs': self.base_lrs, 'warmup_start_lr': self.warmup_start_lr, 'eta_min': self.eta_min } def load_state_dict(self, state_dict): self.current_epoch = state_dict['current_epoch'] self.warmup_epochs = state_dict['warmup_epochs'] self.total_epochs = state_dict['total_epochs'] self.base_lrs = state_dict['base_lrs'] self.warmup_start_lr = state_dict['warmup_start_lr'] self.eta_min = state_dict['eta_min'] self.cosine_epochs = self.total_epochs - self.warmup_epochs # =====================================================# Implementation 4: Warm Restarts with Warmup# =====================================================class WarmRestartWithWarmup: """ Full-featured cosine schedule with: - Initial linear warmup - Multiple cosine cycles with restarts - Configurable cycle length growth - Maximum LR decay across restarts This is production-ready for transformer training. """ def __init__( self, optimizer, warmup_epochs: int, cycle_epochs: int, num_cycles: int, cycle_mult: float = 1.0, max_lr_decay: float = 1.0, warmup_start_lr: float = 1e-7, eta_min: float = 0 ): """ Args: optimizer: Wrapped optimizer warmup_epochs: Linear warmup duration cycle_epochs: First cycle duration (after warmup) num_cycles: Number of cosine cycles cycle_mult: Cycle length multiplier (1.0 = fixed, 2.0 = doubling) max_lr_decay: Factor to reduce max LR at each restart warmup_start_lr: LR at epoch 0 eta_min: Minimum LR (floor) """ self.optimizer = optimizer self.warmup_epochs = warmup_epochs self.cycle_epochs = cycle_epochs self.num_cycles = num_cycles self.cycle_mult = cycle_mult self.max_lr_decay = max_lr_decay self.warmup_start_lr = warmup_start_lr self.eta_min = eta_min self.base_lrs = [pg['lr'] for pg in optimizer.param_groups] self.current_epoch = 0 # Precompute cycle boundaries self.cycle_ends = [warmup_epochs] for i in range(num_cycles): cycle_len = cycle_epochs * (cycle_mult ** i) self.cycle_ends.append(self.cycle_ends[-1] + cycle_len) self.total_epochs = int(self.cycle_ends[-1]) print(f"WarmRestartWithWarmup: {num_cycles} cycles, " f"total {self.total_epochs} epochs") print(f" Cycle ends: {[int(e) for e in self.cycle_ends]}") def _get_cycle_info(self, epoch: int): """Determine current cycle and position within it.""" for cycle_idx in range(len(self.cycle_ends) - 1): if epoch < self.cycle_ends[cycle_idx + 1]: cycle_start = self.cycle_ends[cycle_idx] cycle_end = self.cycle_ends[cycle_idx + 1] cycle_progress = (epoch - cycle_start) / (cycle_end - cycle_start) return cycle_idx, cycle_progress # Beyond planned cycles: stay at minimum return self.num_cycles - 1, 1.0 def get_lr(self) -> List[float]: if self.current_epoch < self.warmup_epochs: # Warmup phase alpha = self.current_epoch / self.warmup_epochs return [ self.warmup_start_lr + alpha * (base_lr - self.warmup_start_lr) for base_lr in self.base_lrs ] cycle_idx, progress = self._get_cycle_info(self.current_epoch) # Compute max LR for this cycle (may decay across restarts) cycle_max_lr_factor = self.max_lr_decay ** cycle_idx # Cosine factor cosine_factor = 0.5 * (1 + math.cos(math.pi * progress)) return [ self.eta_min + cosine_factor * (base_lr * cycle_max_lr_factor - self.eta_min) for base_lr in self.base_lrs ] def step(self, epoch: Optional[int] = None): if epoch is not None: self.current_epoch = epoch else: self.current_epoch += 1 lrs = self.get_lr() for param_group, lr in zip(self.optimizer.param_groups, lrs): param_group['lr'] = lr # =====================================================# Implementation 5: Lambda-Based for Maximum Flexibility# =====================================================def create_cosine_with_lambda( optimizer, warmup_epochs: int, total_epochs: int, min_lr_ratio: float = 0.0, num_cycles: float = 0.5 # 0.5 = half period (standard), 1.0 = full period): """ Use LambdaLR for customized cosine schedules. Allows fractional cycles and other customizations. """ def lr_lambda(epoch: int) -> float: if epoch < warmup_epochs: return epoch / warmup_epochs progress = (epoch - warmup_epochs) / (total_epochs - warmup_epochs) cosine_value = math.cos(math.pi * num_cycles * 2 * progress) # Map [-1, 1] to [min_lr_ratio, 1] return min_lr_ratio + (1 + cosine_value) * (1 - min_lr_ratio) / 2 return LambdaLR(optimizer, lr_lambda) # =====================================================# Visualization and Analysis# =====================================================def visualize_cosine_schedules(total_epochs: int = 200): """Compare different cosine schedule variants.""" import matplotlib.pyplot as plt epochs = np.arange(total_epochs) base_lr = 0.1 # Basic cosine basic = base_lr * 0.5 * (1 + np.cos(np.pi * epochs / total_epochs)) # With warmup (10%) warmup_length = int(total_epochs * 0.1) with_warmup = np.where( epochs < warmup_length, base_lr * epochs / warmup_length, base_lr * 0.5 * (1 + np.cos(np.pi * (epochs - warmup_length) / (total_epochs - warmup_length))) ) # Warm restarts (4 cycles) cycle_length = total_epochs // 4 warm_restart = base_lr * 0.5 * (1 + np.cos(np.pi * (epochs % cycle_length) / cycle_length)) # Warm restarts with doubling def doubling_restarts(epoch, base=50): # Find which cycle we're in cumsum = 0 cycle = 0 while cumsum + base * (2 ** cycle) <= epoch: cumsum += base * (2 ** cycle) cycle += 1 cycle_len = base * (2 ** cycle) progress = (epoch - cumsum) / cycle_len return 0.5 * (1 + np.cos(np.pi * progress)) doubling = np.array([base_lr * doubling_restarts(e) for e in epochs]) fig, ax = plt.subplots(figsize=(12, 6)) ax.plot(epochs, basic, label='Basic Cosine', linewidth=2) ax.plot(epochs, with_warmup, label='Cosine + Warmup', linewidth=2) ax.plot(epochs, warm_restart, label='Warm Restarts (fixed)', linewidth=2) ax.plot(epochs, doubling, label='Warm Restarts (doubling)', linewidth=2) ax.set_xlabel('Epoch') ax.set_ylabel('Learning Rate') ax.set_title('Cosine Annealing Variants Comparison') ax.legend() ax.grid(True, alpha=0.3) return figCosineAnnealingLR expects T_max in epochs, with scheduler.step() called once per epoch. If you call step() per batch/step, you'll complete the schedule in T_max steps, not epochs—a common source of bugs. For per-step scheduling, compute T_max = epochs × steps_per_epoch.
Cosine annealing has fewer hyperparameters than some alternatives, but each choice significantly impacts training dynamics.
T_max (Total Epochs/Cycle Length):
The most important parameter. T_max controls the rate of learning rate decay:
Practical Guidelines:
eta_min (Minimum Learning Rate):
Controls the LR floor. Options:
| Scenario | T_max / T_0 | T_mult | eta_min | Warmup |
|---|---|---|---|---|
| ImageNet training (90 epochs) | 90 | — | 0 | 0 |
| Transformer pretraining (long) | Total epochs | — | 1e-5 | 10% of total |
| SGDR ensemble (100 epochs) | 25 | 1 | 0 | 5 epochs |
| Progressive refinement (150 epochs) | 10 | 2 | 1e-6 | 5 epochs |
| Fine-tuning pretrained (20 epochs) | 20 | — | 1e-6 | 2 epochs |
Warmup Duration:
For cosine schedules with warmup:
Cycle Multiplier (T_mult) for Warm Restarts:
Max LR Decay Across Restarts:
Some implementations reduce the maximum learning rate at each restart: $$\eta_{\max}^{(i)} = \eta_{\max}^{(0)} \cdot \gamma^i$$
This provides additional annealing across cycles, ensuring later cycles are more about refinement than exploration.
For most scenarios, 5-10% warmup followed by cosine decay works well. The intuition: warmup lets batch norm and optimizer state stabilize; cosine then provides smooth, principled decay. When in doubt, start with 5% warmup and adjust based on early training stability.
Cosine annealing has become the default choice for modern deep learning, but it's not universally optimal. Understanding when to use it—and when alternatives might be better—requires examining specific scenarios.
When Cosine Annealing Excels:
When Alternatives Might Be Better:
Following Established Recipes: If a seminal paper (e.g., original ResNet) uses step decay with specific milestones, reproducing their results requires their schedule.
Validation-Based Reduction: When you need adaptive scheduling based on training dynamics, ReduceLROnPlateau might be more appropriate.
Very Short Training: For very few epochs (< 20), the overhead of cosine's gentle start/end phases might not be worthwhile.
Debugging Training Issues: Step decay's discrete phases are easier to analyze when diagnosing problems.
| Criterion | Favor Cosine | Favor Step | Favor Exponential |
|---|---|---|---|
| Training length | 100+ epochs | 30-100 epochs | Long with stability needs |
| Published baseline | Transformer papers | CNN papers (ResNet) | Few formal baselines |
| Stability priority | High | Moderate | Very high |
| Interpretability need | Moderate | High | Low |
| Ensemble goal | Strong (warm restarts) | Weak | Weak |
If you're starting a new project without specific schedule requirements, cosine annealing with 5-10% linear warmup is the recommended default. It works well across architectures, has theoretical grounding, and produces reliable results with minimal tuning.
Cosine annealing represents the current state-of-the-art in learning rate scheduling, combining mathematical elegance with practical effectiveness. Its S-shaped curve naturally matches optimization dynamics, and the warm restart variant enables sophisticated exploration-exploitation strategies.
What's Next:
The next page explores warmup strategies in greater depth, examining why large models and large batches require careful initialization of the learning rate and how to design effective warmup schedules for your specific training scenario.
You now understand cosine annealing from mathematical foundations through advanced warm restart variants. This scheduling approach, combined with warmup strategies covered next, forms the backbone of modern deep learning training pipelines.