Warmup Cosine Annealing Learning Rate Scheduler (Medium) — Practice with Code Visualizer

In deep learning optimization, the learning rate schedule is one of the most critical hyperparameters affecting training dynamics and final model performance. A well-designed schedule helps models converge faster, escape sharp minima, and achieve better generalization.

The Warmup + Cosine Annealing schedule has become one of the most popular and effective learning rate strategies in modern deep learning. It combines two distinct phases:

Phase 1: Linear Warmup (Steps 0 to W-1) During the initial warmup phase, the learning rate starts from 0 and increases linearly to the maximum learning rate (lr_max). This gradual ramp-up prevents the model from making large, potentially destabilizing updates when the weights are still randomly initialized. The warmup helps the optimizer "explore" the loss landscape carefully before committing to aggressive updates.

For step t in the warmup phase (where t ranges from 0 to W-1), the learning rate is:

$$lr(t) = \frac{t}{W} \times lr_{max}$$

Phase 2: Cosine Annealing Decay (Steps W to T-1) After warmup completes at step W, the learning rate smoothly decays following a cosine curve from lr_max down to lr_min. The cosine decay provides a gentle, non-linear reduction that spends more time at moderate learning rates—helping the model refine its weights before settling into a local minimum.

For step t in the decay phase (where t ranges from W to T-1), the cosine annealing formula is:

$$lr(t) = lr_{min} + \frac{1}{2}(lr_{max} - lr_{min})\left(1 + \cos\left(\frac{\pi \cdot (t - W)}{T - W - 1}\right)\right)$$

Your Task: Implement a function that generates the complete learning rate schedule for T training steps. The function should return a list containing the learning rate at each step, where each value is rounded to 4 decimal places.

Notes:

Step indexing is 0-based (the first step is step 0, the last step is step T-1)
At step 0, the learning rate is always 0 (start of warmup)
At step W, the learning rate equals lr_max (end of warmup / start of decay)
The cosine formula ensures the learning rate at the final step (T-1) approaches lr_min
Use the mathematical constant π (pi) for the cosine calculation

Warmup Phase (Steps 0-2): • Step 0: lr = (0/3) × 1.0 = 0.0 • Step 1: lr = (1/3) × 1.0 = 0.3333 • Step 2: lr = (2/3) × 1.0 = 0.6667

Transition Point (Step 3): • Step 3: lr = lr_max = 1.0 (warmup complete)

Cosine Decay Phase (Steps 4-9): The learning rate follows a cosine curve from 1.0 down to 0.0 over 6 steps. • Step 4: lr = 0.5 × (1 + cos(π × 1/6)) = 0.9505 • Step 5: lr = 0.5 × (1 + cos(π × 2/6)) = 0.8117 • Step 6: lr = 0.5 × (1 + cos(π × 3/6)) = 0.6113 • Step 7: lr = 0.5 × (1 + cos(π × 4/6)) = 0.3887 • Step 8: lr = 0.5 × (1 + cos(π × 5/6)) = 0.1883 • Step 9: lr = 0.5 × (1 + cos(π × 6/6)) = 0.0495

Warmup Phase (Steps 0-1): • Step 0: lr = (0/2) × 0.1 = 0.0 • Step 1: lr = (1/2) × 0.1 = 0.05

Transition Point (Step 2): • Step 2: lr = lr_max = 0.1

Cosine Decay Phase (Steps 3-4): The learning rate decays from 0.1 to 0.01 following a cosine curve. • Step 3: Uses cosine formula with lr_min = 0.01, yields 0.0775 • Step 4: Approaches the minimum learning rate, yields 0.0325

Note that the final value doesn't reach exactly 0.01 because of the discrete step nature of the schedule.

Warmup Phase (Steps 0-3): The learning rate increases linearly from 0 to 0.5 over 4 steps: • Step 0: lr = 0.0 • Step 1: lr = (1/4) × 0.5 = 0.125 • Step 2: lr = (2/4) × 0.5 = 0.25 • Step 3: lr = (3/4) × 0.5 = 0.375

Transition Point (Step 4): • Step 4: lr = lr_max = 0.5

Cosine Decay Phase (Steps 5-7): The learning rate follows cosine decay from 0.5 toward 0.1: • Step 5: lr = 0.4414 • Step 6: lr = 0.3 (midpoint of decay range) • Step 7: lr = 0.1586 (approaching lr_min)