Machine LearningTransfer Learning & Domain Adaptation

Fine-Tuning Strategies

LevelAdvanced

Duration90 mins

TopicTransfer Learning & Domain Adaptation

3 / 5

Learning Rate Strategies

The Most Critical Hyperparameter

If you could tune only one hyperparameter for fine-tuning, it should be the learning rate. This single value controls the magnitude of every weight update, determining whether training converges smoothly, oscillates wildly, or stagnates completely.

Fine-tuning amplifies the importance of learning rate selection. Unlike training from scratch—where the network must discover features from random initialization—fine-tuning starts from a pre-trained state containing valuable knowledge. The wrong learning rate can destroy this knowledge in a few epochs, or fail to adapt it at all.

This page explores the science and practice of learning rate strategies: warmup schedules that stabilize early training, decay policies that enable fine convergence, and techniques for finding optimal rates systematically.

What You Will Learn

By the end of this page, you will understand why warmup matters for fine-tuning, implement effective decay schedules, use learning rate finders to locate optimal values, and apply slanted triangular learning rates for maximum adaptation.

Learning Rate Fundamentals for Fine-Tuning

Why Fine-Tuning Needs Lower Learning Rates:

Pre-trained models occupy favorable regions of the loss landscape—local minima that generalize well to their training task. Fine-tuning should gently guide the model from this region to a nearby optimum for the target task.

High learning rates cause large weight updates that catapult the model out of favorable regions:

$$\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t)$$

If η is too large, the update magnitude ||η∇L|| overwhelms the careful structure in θ. The model 'forgets' pre-trained knowledge—a phenomenon called catastrophic forgetting.

Empirical Guidelines:

Fine-tuning LR is typically 10-100x smaller than training-from-scratch LR
Vision models (ResNet on ImageNet): Train from scratch ≈ 0.1, Fine-tune ≈ 1e-4 to 1e-3
Language models (BERT): Train from scratch ≈ 1e-4, Fine-tune ≈ 2e-5 to 5e-5

Typical Learning Rates by Model and Task
Model	Training from Scratch	Fine-Tuning	Notes
ResNet-50	0.1 (SGD)	1e-4 to 1e-3 (Adam)	Lower for deeper nets
EfficientNet	0.016	1e-4 to 5e-4	Already optimized
ViT-Base	1e-3	5e-6 to 1e-4	Transformers need lower LR
BERT-Base	1e-4	2e-5 to 5e-5	Very sensitive
GPT-2/3	2.5e-4	1e-5 to 3e-5	Large models need care

The Goldilocks Problem

Too high: Destroys pre-trained knowledge, causes unstable training or divergence. Too low: Extremely slow convergence, may get stuck in poor local minima before adapting. Just right: Preserves useful features while enabling task-specific adaptation.

Warmup Schedules

Why Warmup Matters:

At the start of fine-tuning, gradients are computed based on the pre-trained model's behavior on target data—data it may never have seen. These initial gradients can be noisy and misleading. A high learning rate amplifies these noisy updates.

Warmup addresses this by starting with a very small learning rate and gradually increasing it. This allows:

Initial gradient estimates to stabilize
Optimizer momentum/variance estimates to accumulate (for Adam)
Batch normalization statistics to adapt (if unfrozen)
The model to 'explore' the local loss landscape gently

warmup_schedules.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import torch
from torch.optim.lr_scheduler import LambdaLR
import math
 
def linear_warmup_scheduler(optimizer, warmup_steps: int, total_steps: int):
    """
    Linear warmup followed by linear decay.
    LR: 0 -> max_lr (warmup) -> 0 (decay)
    """
    def lr_lambda(step):
        if step < warmup_steps:
            return float(step) / float(max(1, warmup_steps))
        progress = float(step - warmup_steps) / float(max(1, total_steps - warmup_steps))
        return max(0.0, 1.0 - progress)
    
    return LambdaLR(optimizer, lr_lambda)
 
 
def cosine_warmup_scheduler(optimizer, warmup_steps: int, total_steps: int, min_lr_ratio: float = 0.0):
    """
    Linear warmup followed by cosine decay.
    Smooth decay often works better than linear.
    """
    def lr_lambda(step):
        if step < warmup_steps:
            return float(step) / float(max(1, warmup_steps))
        progress = float(step - warmup_steps) / float(max(1, total_steps - warmup_steps))
        return max(min_lr_ratio, 0.5 * (1.0 + math.cos(math.pi * progress)))
    
    return LambdaLR(optimizer, lr_lambda)
 
 
def slanted_triangular_scheduler(optimizer, total_steps: int, cut_frac: float = 0.1, ratio: int = 32):
    """
    Slanted Triangular Learning Rate (STLR) from ULMFiT.
    Quick ramp up, slow decay. Optimal for fine-tuning.
    
    Args:
        cut_frac: Fraction of steps for warmup (0.1 = 10%)
        ratio: How much smaller initial LR is vs peak
    """
    cut = int(total_steps * cut_frac)
    
    def lr_lambda(step):
        if step < cut:
            p = step / cut
        else:
            p = 1 - (step - cut) / (total_steps - cut)
        return (1 + p * (ratio - 1)) / ratio
    
    return LambdaLR(optimizer, lr_lambda)

Warmup Duration Guidelines:

Vision models: 1-5 epochs or 1000-5000 steps
BERT-family: 6-10% of total training steps
Large language models: 1-3% of total steps (already well-initialized)

Rule of thumb: When in doubt, use 5-10% of total training time for warmup. More warmup rarely hurts; too little can cause instability.

The Slanted Triangular Advantage

STLR spends more time at high learning rates (for adaptation) and less time ramping up. This is statistically optimal for fine-tuning: you want to explore aggressively but converge smoothly. Use cut_frac=0.1 and ratio=32 as starting points.

Decay Policies

After warmup, the learning rate should decay to enable fine convergence. Different decay policies offer different trade-offs.

Linear Decay: $$\eta_t = \eta_0 \cdot (1 - t/T)$$

Simple and predictable. Good default choice.

Cosine Decay: $$\eta_t = \eta_{min} + \frac{1}{2}(\eta_0 - \eta_{min})(1 + \cos(\pi t/T))$$

Smooth decay that spends more time at moderate LRs. Often outperforms linear.

Exponential Decay: $$\eta_t = \eta_0 \cdot \gamma^{t}$$

Aggressive early decay, slow late decay. Better for long training runs.

Step Decay: $$\eta_t = \eta_0 \cdot \gamma^{\lfloor t/s \rfloor}$$

Drop LR by factor γ every s steps. Traditional but less smooth.

Decay Policy Comparison
Policy	Pros	Cons	Best For
Linear	Simple, predictable	May decay too fast early	Short fine-tuning
Cosine	Smooth, effective	Fixed schedule	Default choice
Exponential	Aggressive adaptation	Needs γ tuning	Long training
Step	Clear phases	Discontinuities	When phases are natural
Cosine w/ Restarts	Escapes local minima	More hyperparameters	Difficult tasks

decay_policies.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from torch.optim.lr_scheduler import CosineAnnealingLR, CosineAnnealingWarmRestarts, ExponentialLR, StepLR
 
def create_scheduler(optimizer, policy: str, total_steps: int, **kwargs):
    """Factory for creating learning rate schedulers."""
    
    if policy == "cosine":
        return CosineAnnealingLR(optimizer, T_max=total_steps, eta_min=kwargs.get('min_lr', 0))
    
    elif policy == "cosine_restarts":
        # T_0: initial period, T_mult: period multiplier after each restart
        return CosineAnnealingWarmRestarts(
            optimizer, 
            T_0=kwargs.get('T_0', total_steps // 4),
            T_mult=kwargs.get('T_mult', 2)
        )
    
    elif policy == "exponential":
        return ExponentialLR(optimizer, gamma=kwargs.get('gamma', 0.95))
    
    elif policy == "step":
        return StepLR(
            optimizer, 
            step_size=kwargs.get('step_size', total_steps // 3),
            gamma=kwargs.get('gamma', 0.1)
        )
    
    else:
        raise ValueError(f"Unknown policy: {policy}")

Cosine with Warm Restarts

Warm restarts periodically reset the learning rate, allowing the model to escape local minima. Particularly useful for difficult fine-tuning tasks or when the loss plateaus. Start with T_0 = total_steps/4 and T_mult = 2.

Learning Rate Finding

Rather than guessing, we can systematically find a good learning rate using the LR Range Test introduced by Leslie Smith.

The Algorithm:

Start with a very small LR (e.g., 1e-7)
Train for one epoch, exponentially increasing LR each batch
Record loss at each LR
Plot loss vs LR on log scale
Choose LR where loss descends most steeply (before divergence)

Interpreting the Plot:

Flat region (left): LR too small, no learning
Steep descent: Sweet spot, rapid learning
Minimum point: Potentially good LR
Ascending/diverging (right): LR too high

lr_finder.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
import torch
import numpy as np
import matplotlib.pyplot as plt
from copy import deepcopy
 
class LRFinder:
    """
    Learning Rate Finder using exponential LR increase.
    Based on 'Cyclical Learning Rates for Training Neural Networks' (Smith, 2017).
    """
    
    def __init__(self, model, optimizer, criterion, device="cuda"):
        self.model = model
        self.optimizer = optimizer
        self.criterion = criterion
        self.device = device
        
        # Save initial state
        self.model_state = deepcopy(model.state_dict())
        self.optimizer_state = deepcopy(optimizer.state_dict())
    
    def find(self, train_loader, start_lr=1e-7, end_lr=10, num_steps=100, smooth_factor=0.05):
        """
        Run LR range test.
        
        Returns:
            lrs: List of learning rates tested
            losses: Smoothed losses at each LR
        """
        # Compute LR multiplier
        lr_mult = (end_lr / start_lr) ** (1 / num_steps)
        
        lr = start_lr
        self.optimizer.param_groups[0]['lr'] = lr
        
        lrs, losses = [], []
        best_loss = float('inf')
        avg_loss = 0
        
        self.model.train()
        iterator = iter(train_loader)
        
        for step in range(num_steps):
            try:
                inputs, targets = next(iterator)
            except StopIteration:
                iterator = iter(train_loader)
                inputs, targets = next(iterator)
            
            inputs, targets = inputs.to(self.device), targets.to(self.device)
            
            self.optimizer.zero_grad()
            outputs = self.model(inputs)
            loss = self.criterion(outputs, targets)
            
            # Smooth the loss
            avg_loss = smooth_factor * loss.item() + (1 - smooth_factor) * avg_loss
            smoothed_loss = avg_loss / (1 - smooth_factor ** (step + 1))
            
            # Stop if loss explodes
            if step > 0 and smoothed_loss > 4 * best_loss:
                break
            
            if smoothed_loss < best_loss:
                best_loss = smoothed_loss
            
            lrs.append(lr)
            losses.append(smoothed_loss)
            
            loss.backward()
            self.optimizer.step()
            
            # Update LR
            lr *= lr_mult
            self.optimizer.param_groups[0]['lr'] = lr
        
        # Restore initial state
        self.model.load_state_dict(self.model_state)
        self.optimizer.load_state_dict(self.optimizer_state)
        
        return lrs, losses
    
    def plot(self, lrs, losses, suggest=True):
        """Plot LR range test results."""
        plt.figure(figsize=(10, 6))
        plt.plot(lrs, losses)
        plt.xscale('log')
        plt.xlabel('Learning Rate')
        plt.ylabel('Loss')
        plt.title('LR Range Test')
        
        if suggest:
            # Find steepest descent point
            gradients = np.gradient(losses)
            min_idx = np.argmin(gradients)
            suggested_lr = lrs[min_idx]
            plt.axvline(x=suggested_lr, color='r', linestyle='--', 
                       label=f'Suggested LR: {suggested_lr:.2e}')
            plt.legend()
        
        plt.show()
        return suggested_lr if suggest else None

Fine-Tuning LR Selection

For fine-tuning, the optimal LR from the range test is often at the steepest descent, NOT at the minimum. The minimum may be where the model starts 'overfitting' to the test batches. Choose an LR slightly before the steep descent flattens out.

One Cycle Policy

The One Cycle Policy combines warmup and decay into a single coherent strategy that often outperforms alternatives.

The Strategy:

Start at base_lr (low)
Increase to max_lr over first ~30-40% of training
Decrease back to base_lr (or lower) over remaining training
Optional: anneal to even lower LR in final 10%

Key Insight: High learning rates provide regularization by preventing convergence to sharp minima. The cycle exploits this by spending significant time at high LRs.

one_cycle.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
from torch.optim.lr_scheduler import OneCycleLR
 
def create_one_cycle_scheduler(
    optimizer,
    max_lr: float,
    total_steps: int,
    pct_start: float = 0.3,  # 30% warmup
    div_factor: float = 25,  # initial_lr = max_lr / 25
    final_div_factor: float = 1e4  # final_lr = initial_lr / 10000
):
    """
    Create OneCycleLR scheduler for fine-tuning.
    
    Args:
        max_lr: Peak learning rate
        total_steps: Total training steps
        pct_start: Fraction of cycle for warmup
        div_factor: initial_lr = max_lr / div_factor
        final_div_factor: final_lr = initial_lr / final_div_factor
    """
    return OneCycleLR(
        optimizer,
        max_lr=max_lr,
        total_steps=total_steps,
        pct_start=pct_start,
        div_factor=div_factor,
        final_div_factor=final_div_factor,
        anneal_strategy='cos'  # cosine annealing
    )
 
# Example usage for fine-tuning
"""
# Find max_lr using LR finder first, then:
scheduler = create_one_cycle_scheduler(
    optimizer,
    max_lr=1e-3,  # From LR finder
    total_steps=len(train_loader) * num_epochs,
    pct_start=0.3
)
 
for epoch in range(num_epochs):
    for batch in train_loader:
        # ... training step ...
        scheduler.step()  # Step after each batch!
"""

One Cycle Policy Hyperparameters for Common Fine-Tuning Scenarios
Scenario	max_lr	pct_start	div_factor	Notes
ImageNet CNNs → Similar domain	1e-3	0.3	25	Standard settings
ImageNet CNNs → Different domain	5e-4	0.4	25	More warmup, lower peak
BERT → NLP task	2e-5	0.1	10	Shorter warmup, smaller range
ViT → Vision task	1e-4	0.3	25	Transformers need care

Summary: Learning Rate Strategies

Key Takeaways

•Fine-tuning needs lower LRs — 10-100x smaller than training from scratch to preserve pre-trained knowledge.
•Warmup stabilizes early training — Start low, ramp up over 5-10% of training to handle noisy initial gradients.
•Decay enables fine convergence — Cosine decay is usually best; warm restarts help escape local minima.
•LR Range Test finds optimal rates — Systematically test LRs instead of guessing; choose steepest descent point.
•One Cycle Policy often wins — Combines warmup and decay optimally; use with LR finder for best results.
•Different models, different rates — Vision CNNs, ViTs, and BERT each have characteristic optimal ranges.

What's Next:

With learning rate strategies mastered, we turn to a critical challenge: catastrophic forgetting. The next page explores why neural networks forget pre-trained knowledge during fine-tuning and techniques to preserve important capabilities while adapting to new tasks.

Page Complete

You now understand learning rate strategies for fine-tuning: warmup, decay, LR finding, and the one cycle policy. These techniques are essential for stable, effective transfer learning.

3 / 5

Loading learning content...

Machine LearningTransfer Learning & Domain Adaptation

Fine-Tuning Strategies

LevelAdvanced

Duration90 mins

TopicTransfer Learning & Domain Adaptation

3 / 5

Learning Rate Strategies

The Most Critical Hyperparameter

What You Will Learn

Learning Rate Fundamentals for Fine-Tuning

Why Fine-Tuning Needs Lower Learning Rates:

High learning rates cause large weight updates that catapult the model out of favorable regions:

$$\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t)$$

If η is too large, the update magnitude ||η∇L|| overwhelms the careful structure in θ. The model 'forgets' pre-trained knowledge—a phenomenon called catastrophic forgetting.

Empirical Guidelines:

Fine-tuning LR is typically 10-100x smaller than training-from-scratch LR
Vision models (ResNet on ImageNet): Train from scratch ≈ 0.1, Fine-tune ≈ 1e-4 to 1e-3
Language models (BERT): Train from scratch ≈ 1e-4, Fine-tune ≈ 2e-5 to 5e-5

Typical Learning Rates by Model and Task
Model	Training from Scratch	Fine-Tuning	Notes
ResNet-50	0.1 (SGD)	1e-4 to 1e-3 (Adam)	Lower for deeper nets
EfficientNet	0.016	1e-4 to 5e-4	Already optimized
ViT-Base	1e-3	5e-6 to 1e-4	Transformers need lower LR
BERT-Base	1e-4	2e-5 to 5e-5	Very sensitive
GPT-2/3	2.5e-4	1e-5 to 3e-5	Large models need care

The Goldilocks Problem

Warmup Schedules

Why Warmup Matters:

Warmup addresses this by starting with a very small learning rate and gradually increasing it. This allows:

Initial gradient estimates to stabilize
Optimizer momentum/variance estimates to accumulate (for Adam)
Batch normalization statistics to adapt (if unfrozen)
The model to 'explore' the local loss landscape gently

warmup_schedules.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import torch
from torch.optim.lr_scheduler import LambdaLR
import math
 
def linear_warmup_scheduler(optimizer, warmup_steps: int, total_steps: int):
    """
    Linear warmup followed by linear decay.
    LR: 0 -> max_lr (warmup) -> 0 (decay)
    """
    def lr_lambda(step):
        if step < warmup_steps:
            return float(step) / float(max(1, warmup_steps))
        progress = float(step - warmup_steps) / float(max(1, total_steps - warmup_steps))
        return max(0.0, 1.0 - progress)
    
    return LambdaLR(optimizer, lr_lambda)
 
 
def cosine_warmup_scheduler(optimizer, warmup_steps: int, total_steps: int, min_lr_ratio: float = 0.0):
    """
    Linear warmup followed by cosine decay.
    Smooth decay often works better than linear.
    """
    def lr_lambda(step):
        if step < warmup_steps:
            return float(step) / float(max(1, warmup_steps))
        progress = float(step - warmup_steps) / float(max(1, total_steps - warmup_steps))
        return max(min_lr_ratio, 0.5 * (1.0 + math.cos(math.pi * progress)))
    
    return LambdaLR(optimizer, lr_lambda)
 
 
def slanted_triangular_scheduler(optimizer, total_steps: int, cut_frac: float = 0.1, ratio: int = 32):
    """
    Slanted Triangular Learning Rate (STLR) from ULMFiT.
    Quick ramp up, slow decay. Optimal for fine-tuning.
    
    Args:
        cut_frac: Fraction of steps for warmup (0.1 = 10%)
        ratio: How much smaller initial LR is vs peak
    """
    cut = int(total_steps * cut_frac)
    
    def lr_lambda(step):
        if step < cut:
            p = step / cut
        else:
            p = 1 - (step - cut) / (total_steps - cut)
        return (1 + p * (ratio - 1)) / ratio
    
    return LambdaLR(optimizer, lr_lambda)

Warmup Duration Guidelines:

Vision models: 1-5 epochs or 1000-5000 steps
BERT-family: 6-10% of total training steps
Large language models: 1-3% of total steps (already well-initialized)

Rule of thumb: When in doubt, use 5-10% of total training time for warmup. More warmup rarely hurts; too little can cause instability.

The Slanted Triangular Advantage

Decay Policies

After warmup, the learning rate should decay to enable fine convergence. Different decay policies offer different trade-offs.

Linear Decay: $$\eta_t = \eta_0 \cdot (1 - t/T)$$

Simple and predictable. Good default choice.

Cosine Decay: $$\eta_t = \eta_{min} + \frac{1}{2}(\eta_0 - \eta_{min})(1 + \cos(\pi t/T))$$

Smooth decay that spends more time at moderate LRs. Often outperforms linear.

Exponential Decay: $$\eta_t = \eta_0 \cdot \gamma^{t}$$

Aggressive early decay, slow late decay. Better for long training runs.

Step Decay: $$\eta_t = \eta_0 \cdot \gamma^{\lfloor t/s \rfloor}$$

Drop LR by factor γ every s steps. Traditional but less smooth.

Decay Policy Comparison
Policy	Pros	Cons	Best For
Linear	Simple, predictable	May decay too fast early	Short fine-tuning
Cosine	Smooth, effective	Fixed schedule	Default choice
Exponential	Aggressive adaptation	Needs γ tuning	Long training
Step	Clear phases	Discontinuities	When phases are natural
Cosine w/ Restarts	Escapes local minima	More hyperparameters	Difficult tasks

decay_policies.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from torch.optim.lr_scheduler import CosineAnnealingLR, CosineAnnealingWarmRestarts, ExponentialLR, StepLR
 
def create_scheduler(optimizer, policy: str, total_steps: int, **kwargs):
    """Factory for creating learning rate schedulers."""
    
    if policy == "cosine":
        return CosineAnnealingLR(optimizer, T_max=total_steps, eta_min=kwargs.get('min_lr', 0))
    
    elif policy == "cosine_restarts":
        # T_0: initial period, T_mult: period multiplier after each restart
        return CosineAnnealingWarmRestarts(
            optimizer, 
            T_0=kwargs.get('T_0', total_steps // 4),
            T_mult=kwargs.get('T_mult', 2)
        )
    
    elif policy == "exponential":
        return ExponentialLR(optimizer, gamma=kwargs.get('gamma', 0.95))
    
    elif policy == "step":
        return StepLR(
            optimizer, 
            step_size=kwargs.get('step_size', total_steps // 3),
            gamma=kwargs.get('gamma', 0.1)
        )
    
    else:
        raise ValueError(f"Unknown policy: {policy}")

Cosine with Warm Restarts

Learning Rate Finding

Rather than guessing, we can systematically find a good learning rate using the LR Range Test introduced by Leslie Smith.

The Algorithm:

Start with a very small LR (e.g., 1e-7)
Train for one epoch, exponentially increasing LR each batch
Record loss at each LR
Plot loss vs LR on log scale
Choose LR where loss descends most steeply (before divergence)

Interpreting the Plot:

Flat region (left): LR too small, no learning
Steep descent: Sweet spot, rapid learning
Minimum point: Potentially good LR
Ascending/diverging (right): LR too high

lr_finder.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
import torch
import numpy as np
import matplotlib.pyplot as plt
from copy import deepcopy
 
class LRFinder:
    """
    Learning Rate Finder using exponential LR increase.
    Based on 'Cyclical Learning Rates for Training Neural Networks' (Smith, 2017).
    """
    
    def __init__(self, model, optimizer, criterion, device="cuda"):
        self.model = model
        self.optimizer = optimizer
        self.criterion = criterion
        self.device = device
        
        # Save initial state
        self.model_state = deepcopy(model.state_dict())
        self.optimizer_state = deepcopy(optimizer.state_dict())
    
    def find(self, train_loader, start_lr=1e-7, end_lr=10, num_steps=100, smooth_factor=0.05):
        """
        Run LR range test.
        
        Returns:
            lrs: List of learning rates tested
            losses: Smoothed losses at each LR
        """
        # Compute LR multiplier
        lr_mult = (end_lr / start_lr) ** (1 / num_steps)
        
        lr = start_lr
        self.optimizer.param_groups[0]['lr'] = lr
        
        lrs, losses = [], []
        best_loss = float('inf')
        avg_loss = 0
        
        self.model.train()
        iterator = iter(train_loader)
        
        for step in range(num_steps):
            try:
                inputs, targets = next(iterator)
            except StopIteration:
                iterator = iter(train_loader)
                inputs, targets = next(iterator)
            
            inputs, targets = inputs.to(self.device), targets.to(self.device)
            
            self.optimizer.zero_grad()
            outputs = self.model(inputs)
            loss = self.criterion(outputs, targets)
            
            # Smooth the loss
            avg_loss = smooth_factor * loss.item() + (1 - smooth_factor) * avg_loss
            smoothed_loss = avg_loss / (1 - smooth_factor ** (step + 1))
            
            # Stop if loss explodes
            if step > 0 and smoothed_loss > 4 * best_loss:
                break
            
            if smoothed_loss < best_loss:
                best_loss = smoothed_loss
            
            lrs.append(lr)
            losses.append(smoothed_loss)
            
            loss.backward()
            self.optimizer.step()
            
            # Update LR
            lr *= lr_mult
            self.optimizer.param_groups[0]['lr'] = lr
        
        # Restore initial state
        self.model.load_state_dict(self.model_state)
        self.optimizer.load_state_dict(self.optimizer_state)
        
        return lrs, losses
    
    def plot(self, lrs, losses, suggest=True):
        """Plot LR range test results."""
        plt.figure(figsize=(10, 6))
        plt.plot(lrs, losses)
        plt.xscale('log')
        plt.xlabel('Learning Rate')
        plt.ylabel('Loss')
        plt.title('LR Range Test')
        
        if suggest:
            # Find steepest descent point
            gradients = np.gradient(losses)
            min_idx = np.argmin(gradients)
            suggested_lr = lrs[min_idx]
            plt.axvline(x=suggested_lr, color='r', linestyle='--', 
                       label=f'Suggested LR: {suggested_lr:.2e}')
            plt.legend()
        
        plt.show()
        return suggested_lr if suggest else None

Fine-Tuning LR Selection

One Cycle Policy

The One Cycle Policy combines warmup and decay into a single coherent strategy that often outperforms alternatives.

The Strategy:

Start at base_lr (low)
Increase to max_lr over first ~30-40% of training
Decrease back to base_lr (or lower) over remaining training
Optional: anneal to even lower LR in final 10%

Key Insight: High learning rates provide regularization by preventing convergence to sharp minima. The cycle exploits this by spending significant time at high LRs.

one_cycle.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
from torch.optim.lr_scheduler import OneCycleLR
 
def create_one_cycle_scheduler(
    optimizer,
    max_lr: float,
    total_steps: int,
    pct_start: float = 0.3,  # 30% warmup
    div_factor: float = 25,  # initial_lr = max_lr / 25
    final_div_factor: float = 1e4  # final_lr = initial_lr / 10000
):
    """
    Create OneCycleLR scheduler for fine-tuning.
    
    Args:
        max_lr: Peak learning rate
        total_steps: Total training steps
        pct_start: Fraction of cycle for warmup
        div_factor: initial_lr = max_lr / div_factor
        final_div_factor: final_lr = initial_lr / final_div_factor
    """
    return OneCycleLR(
        optimizer,
        max_lr=max_lr,
        total_steps=total_steps,
        pct_start=pct_start,
        div_factor=div_factor,
        final_div_factor=final_div_factor,
        anneal_strategy='cos'  # cosine annealing
    )
 
# Example usage for fine-tuning
"""
# Find max_lr using LR finder first, then:
scheduler = create_one_cycle_scheduler(
    optimizer,
    max_lr=1e-3,  # From LR finder
    total_steps=len(train_loader) * num_epochs,
    pct_start=0.3
)
 
for epoch in range(num_epochs):
    for batch in train_loader:
        # ... training step ...
        scheduler.step()  # Step after each batch!
"""

One Cycle Policy Hyperparameters for Common Fine-Tuning Scenarios
Scenario	max_lr	pct_start	div_factor	Notes
ImageNet CNNs → Similar domain	1e-3	0.3	25	Standard settings
ImageNet CNNs → Different domain	5e-4	0.4	25	More warmup, lower peak
BERT → NLP task	2e-5	0.1	10	Shorter warmup, smaller range
ViT → Vision task	1e-4	0.3	25	Transformers need care

Summary: Learning Rate Strategies

Key Takeaways

•Fine-tuning needs lower LRs — 10-100x smaller than training from scratch to preserve pre-trained knowledge.
•Warmup stabilizes early training — Start low, ramp up over 5-10% of training to handle noisy initial gradients.
•Decay enables fine convergence — Cosine decay is usually best; warm restarts help escape local minima.
•LR Range Test finds optimal rates — Systematically test LRs instead of guessing; choose steepest descent point.
•One Cycle Policy often wins — Combines warmup and decay optimally; use with LR finder for best results.
•Different models, different rates — Vision CNNs, ViTs, and BERT each have characteristic optimal ranges.

What's Next:

Page Complete

You now understand learning rate strategies for fine-tuning: warmup, decay, LR finding, and the one cycle policy. These techniques are essential for stable, effective transfer learning.

3 / 5