Loading learning content...
If you could tune only one hyperparameter for fine-tuning, it should be the learning rate. This single value controls the magnitude of every weight update, determining whether training converges smoothly, oscillates wildly, or stagnates completely.
Fine-tuning amplifies the importance of learning rate selection. Unlike training from scratch—where the network must discover features from random initialization—fine-tuning starts from a pre-trained state containing valuable knowledge. The wrong learning rate can destroy this knowledge in a few epochs, or fail to adapt it at all.
This page explores the science and practice of learning rate strategies: warmup schedules that stabilize early training, decay policies that enable fine convergence, and techniques for finding optimal rates systematically.
By the end of this page, you will understand why warmup matters for fine-tuning, implement effective decay schedules, use learning rate finders to locate optimal values, and apply slanted triangular learning rates for maximum adaptation.
Why Fine-Tuning Needs Lower Learning Rates:
Pre-trained models occupy favorable regions of the loss landscape—local minima that generalize well to their training task. Fine-tuning should gently guide the model from this region to a nearby optimum for the target task.
High learning rates cause large weight updates that catapult the model out of favorable regions:
$$\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t)$$
If η is too large, the update magnitude ||η∇L|| overwhelms the careful structure in θ. The model 'forgets' pre-trained knowledge—a phenomenon called catastrophic forgetting.
Empirical Guidelines:
| Model | Training from Scratch | Fine-Tuning | Notes |
|---|---|---|---|
| ResNet-50 | 0.1 (SGD) | 1e-4 to 1e-3 (Adam) | Lower for deeper nets |
| EfficientNet | 0.016 | 1e-4 to 5e-4 | Already optimized |
| ViT-Base | 1e-3 | 5e-6 to 1e-4 | Transformers need lower LR |
| BERT-Base | 1e-4 | 2e-5 to 5e-5 | Very sensitive |
| GPT-2/3 | 2.5e-4 | 1e-5 to 3e-5 | Large models need care |
Too high: Destroys pre-trained knowledge, causes unstable training or divergence. Too low: Extremely slow convergence, may get stuck in poor local minima before adapting. Just right: Preserves useful features while enabling task-specific adaptation.
Why Warmup Matters:
At the start of fine-tuning, gradients are computed based on the pre-trained model's behavior on target data—data it may never have seen. These initial gradients can be noisy and misleading. A high learning rate amplifies these noisy updates.
Warmup addresses this by starting with a very small learning rate and gradually increasing it. This allows:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
import torchfrom torch.optim.lr_scheduler import LambdaLRimport math def linear_warmup_scheduler(optimizer, warmup_steps: int, total_steps: int): """ Linear warmup followed by linear decay. LR: 0 -> max_lr (warmup) -> 0 (decay) """ def lr_lambda(step): if step < warmup_steps: return float(step) / float(max(1, warmup_steps)) progress = float(step - warmup_steps) / float(max(1, total_steps - warmup_steps)) return max(0.0, 1.0 - progress) return LambdaLR(optimizer, lr_lambda) def cosine_warmup_scheduler(optimizer, warmup_steps: int, total_steps: int, min_lr_ratio: float = 0.0): """ Linear warmup followed by cosine decay. Smooth decay often works better than linear. """ def lr_lambda(step): if step < warmup_steps: return float(step) / float(max(1, warmup_steps)) progress = float(step - warmup_steps) / float(max(1, total_steps - warmup_steps)) return max(min_lr_ratio, 0.5 * (1.0 + math.cos(math.pi * progress))) return LambdaLR(optimizer, lr_lambda) def slanted_triangular_scheduler(optimizer, total_steps: int, cut_frac: float = 0.1, ratio: int = 32): """ Slanted Triangular Learning Rate (STLR) from ULMFiT. Quick ramp up, slow decay. Optimal for fine-tuning. Args: cut_frac: Fraction of steps for warmup (0.1 = 10%) ratio: How much smaller initial LR is vs peak """ cut = int(total_steps * cut_frac) def lr_lambda(step): if step < cut: p = step / cut else: p = 1 - (step - cut) / (total_steps - cut) return (1 + p * (ratio - 1)) / ratio return LambdaLR(optimizer, lr_lambda)Warmup Duration Guidelines:
Rule of thumb: When in doubt, use 5-10% of total training time for warmup. More warmup rarely hurts; too little can cause instability.
STLR spends more time at high learning rates (for adaptation) and less time ramping up. This is statistically optimal for fine-tuning: you want to explore aggressively but converge smoothly. Use cut_frac=0.1 and ratio=32 as starting points.
After warmup, the learning rate should decay to enable fine convergence. Different decay policies offer different trade-offs.
Linear Decay: $$\eta_t = \eta_0 \cdot (1 - t/T)$$
Simple and predictable. Good default choice.
Cosine Decay: $$\eta_t = \eta_{min} + \frac{1}{2}(\eta_0 - \eta_{min})(1 + \cos(\pi t/T))$$
Smooth decay that spends more time at moderate LRs. Often outperforms linear.
Exponential Decay: $$\eta_t = \eta_0 \cdot \gamma^{t}$$
Aggressive early decay, slow late decay. Better for long training runs.
Step Decay: $$\eta_t = \eta_0 \cdot \gamma^{\lfloor t/s \rfloor}$$
Drop LR by factor γ every s steps. Traditional but less smooth.
| Policy | Pros | Cons | Best For |
|---|---|---|---|
| Linear | Simple, predictable | May decay too fast early | Short fine-tuning |
| Cosine | Smooth, effective | Fixed schedule | Default choice |
| Exponential | Aggressive adaptation | Needs γ tuning | Long training |
| Step | Clear phases | Discontinuities | When phases are natural |
| Cosine w/ Restarts | Escapes local minima | More hyperparameters | Difficult tasks |
12345678910111213141516171819202122232425262728
from torch.optim.lr_scheduler import CosineAnnealingLR, CosineAnnealingWarmRestarts, ExponentialLR, StepLR def create_scheduler(optimizer, policy: str, total_steps: int, **kwargs): """Factory for creating learning rate schedulers.""" if policy == "cosine": return CosineAnnealingLR(optimizer, T_max=total_steps, eta_min=kwargs.get('min_lr', 0)) elif policy == "cosine_restarts": # T_0: initial period, T_mult: period multiplier after each restart return CosineAnnealingWarmRestarts( optimizer, T_0=kwargs.get('T_0', total_steps // 4), T_mult=kwargs.get('T_mult', 2) ) elif policy == "exponential": return ExponentialLR(optimizer, gamma=kwargs.get('gamma', 0.95)) elif policy == "step": return StepLR( optimizer, step_size=kwargs.get('step_size', total_steps // 3), gamma=kwargs.get('gamma', 0.1) ) else: raise ValueError(f"Unknown policy: {policy}")Warm restarts periodically reset the learning rate, allowing the model to escape local minima. Particularly useful for difficult fine-tuning tasks or when the loss plateaus. Start with T_0 = total_steps/4 and T_mult = 2.
Rather than guessing, we can systematically find a good learning rate using the LR Range Test introduced by Leslie Smith.
The Algorithm:
Interpreting the Plot:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102
import torchimport numpy as npimport matplotlib.pyplot as pltfrom copy import deepcopy class LRFinder: """ Learning Rate Finder using exponential LR increase. Based on 'Cyclical Learning Rates for Training Neural Networks' (Smith, 2017). """ def __init__(self, model, optimizer, criterion, device="cuda"): self.model = model self.optimizer = optimizer self.criterion = criterion self.device = device # Save initial state self.model_state = deepcopy(model.state_dict()) self.optimizer_state = deepcopy(optimizer.state_dict()) def find(self, train_loader, start_lr=1e-7, end_lr=10, num_steps=100, smooth_factor=0.05): """ Run LR range test. Returns: lrs: List of learning rates tested losses: Smoothed losses at each LR """ # Compute LR multiplier lr_mult = (end_lr / start_lr) ** (1 / num_steps) lr = start_lr self.optimizer.param_groups[0]['lr'] = lr lrs, losses = [], [] best_loss = float('inf') avg_loss = 0 self.model.train() iterator = iter(train_loader) for step in range(num_steps): try: inputs, targets = next(iterator) except StopIteration: iterator = iter(train_loader) inputs, targets = next(iterator) inputs, targets = inputs.to(self.device), targets.to(self.device) self.optimizer.zero_grad() outputs = self.model(inputs) loss = self.criterion(outputs, targets) # Smooth the loss avg_loss = smooth_factor * loss.item() + (1 - smooth_factor) * avg_loss smoothed_loss = avg_loss / (1 - smooth_factor ** (step + 1)) # Stop if loss explodes if step > 0 and smoothed_loss > 4 * best_loss: break if smoothed_loss < best_loss: best_loss = smoothed_loss lrs.append(lr) losses.append(smoothed_loss) loss.backward() self.optimizer.step() # Update LR lr *= lr_mult self.optimizer.param_groups[0]['lr'] = lr # Restore initial state self.model.load_state_dict(self.model_state) self.optimizer.load_state_dict(self.optimizer_state) return lrs, losses def plot(self, lrs, losses, suggest=True): """Plot LR range test results.""" plt.figure(figsize=(10, 6)) plt.plot(lrs, losses) plt.xscale('log') plt.xlabel('Learning Rate') plt.ylabel('Loss') plt.title('LR Range Test') if suggest: # Find steepest descent point gradients = np.gradient(losses) min_idx = np.argmin(gradients) suggested_lr = lrs[min_idx] plt.axvline(x=suggested_lr, color='r', linestyle='--', label=f'Suggested LR: {suggested_lr:.2e}') plt.legend() plt.show() return suggested_lr if suggest else NoneFor fine-tuning, the optimal LR from the range test is often at the steepest descent, NOT at the minimum. The minimum may be where the model starts 'overfitting' to the test batches. Choose an LR slightly before the steep descent flattens out.
The One Cycle Policy combines warmup and decay into a single coherent strategy that often outperforms alternatives.
The Strategy:
Key Insight: High learning rates provide regularization by preventing convergence to sharp minima. The cycle exploits this by spending significant time at high LRs.
123456789101112131415161718192021222324252627282930313233343536373839404142434445
from torch.optim.lr_scheduler import OneCycleLR def create_one_cycle_scheduler( optimizer, max_lr: float, total_steps: int, pct_start: float = 0.3, # 30% warmup div_factor: float = 25, # initial_lr = max_lr / 25 final_div_factor: float = 1e4 # final_lr = initial_lr / 10000): """ Create OneCycleLR scheduler for fine-tuning. Args: max_lr: Peak learning rate total_steps: Total training steps pct_start: Fraction of cycle for warmup div_factor: initial_lr = max_lr / div_factor final_div_factor: final_lr = initial_lr / final_div_factor """ return OneCycleLR( optimizer, max_lr=max_lr, total_steps=total_steps, pct_start=pct_start, div_factor=div_factor, final_div_factor=final_div_factor, anneal_strategy='cos' # cosine annealing ) # Example usage for fine-tuning"""# Find max_lr using LR finder first, then:scheduler = create_one_cycle_scheduler( optimizer, max_lr=1e-3, # From LR finder total_steps=len(train_loader) * num_epochs, pct_start=0.3) for epoch in range(num_epochs): for batch in train_loader: # ... training step ... scheduler.step() # Step after each batch!"""| Scenario | max_lr | pct_start | div_factor | Notes |
|---|---|---|---|---|
| ImageNet CNNs → Similar domain | 1e-3 | 0.3 | 25 | Standard settings |
| ImageNet CNNs → Different domain | 5e-4 | 0.4 | 25 | More warmup, lower peak |
| BERT → NLP task | 2e-5 | 0.1 | 10 | Shorter warmup, smaller range |
| ViT → Vision task | 1e-4 | 0.3 | 25 | Transformers need care |
What's Next:
With learning rate strategies mastered, we turn to a critical challenge: catastrophic forgetting. The next page explores why neural networks forget pre-trained knowledge during fine-tuning and techniques to preserve important capabilities while adapting to new tasks.
You now understand learning rate strategies for fine-tuning: warmup, decay, LR finding, and the one cycle policy. These techniques are essential for stable, effective transfer learning.