Machine LearningRegularization in Deep Learning

Weight Regularization

LevelAdvanced

Duration75 mins

TopicRegularization in Deep Learning

3 / 5

Weight Decay vs L2 Regularization

The Subtle Distinction That Changes Everything

For years, "weight decay" and "L2 regularization" have been used interchangeably. In the context of vanilla SGD, they are indeed mathematically equivalent. However, when using adaptive optimizers like Adam, RMSprop, or AdaGrad, they become fundamentally different techniques with dramatically different behaviors.

This distinction, formalized in the influential 2017 paper "Decoupled Weight Decay Regularization" (Loshchilov & Hutter), led to the creation of AdamW—now the default optimizer for training large language models, vision transformers, and most state-of-the-art deep learning systems.

Understanding this distinction is not merely academic—using the wrong formulation can significantly degrade model performance, particularly for large-scale training.

Critical for Modern Deep Learning

If you're using Adam with the weight_decay parameter expecting L2 regularization behavior, you're not getting true weight decay. This page explains why, and why it matters for your models.

Equivalence Under Vanilla SGD

Let's first establish why weight decay and L2 regularization have been conflated.

L2 Regularization Approach:

Add the penalty to the loss, then differentiate: $$\mathcal{L}{\text{reg}} = \mathcal{L}{\text{data}} + \frac{\lambda}{2}|\boldsymbol{\theta}|_2^2$$

$$\nabla_\theta \mathcal{L}{\text{reg}} = \nabla\theta \mathcal{L}_{\text{data}} + \lambda \boldsymbol{\theta}$$

SGD update: $$\boldsymbol{\theta}{t+1} = \boldsymbol{\theta}t - \eta(\nabla\theta \mathcal{L}{\text{data}} + \lambda \boldsymbol{\theta}_t)$$

Weight Decay Approach:

Shrink weights directly by a factor, independently of the loss gradient: $$\boldsymbol{\theta}{t+1} = \boldsymbol{\theta}t - \eta \nabla\theta \mathcal{L}{\text{data}} - \eta \lambda_d \boldsymbol{\theta}t$$ $$= (1 - \eta \lambda_d) \boldsymbol{\theta}t - \eta \nabla\theta \mathcal{L}{\text{data}}$$

With vanilla SGD, setting $\lambda_d = \lambda$ makes these identical. The regularization gradient $\lambda \boldsymbol{\theta}$ produces the same effect as multiplying weights by $(1 - \eta \lambda)$.

sgd_equivalence.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import numpy as np
 
def sgd_l2_regularization(theta, grad, lr, lambda_l2):
    """
    SGD with L2 regularization (added to gradient).
    
    θ_new = θ - lr * (∇L + λ*θ)
    """
    reg_grad = grad + lambda_l2 * theta
    return theta - lr * reg_grad
 
def sgd_weight_decay(theta, grad, lr, wd):
    """
    SGD with weight decay (direct shrinkage).
    
    θ_new = (1 - lr*wd)*θ - lr*∇L
    """
    return (1 - lr * wd) * theta - lr * grad
 
# These are EQUIVALENT for vanilla SGD
theta = np.array([1.0, 2.0, -1.5])
grad = np.array([0.1, -0.2, 0.3])
lr = 0.01
lambda_val = 0.001
 
result_l2 = sgd_l2_regularization(theta, grad, lr, lambda_val)
result_wd = sgd_weight_decay(theta, grad, lr, lambda_val)
 
print(f"L2 result: {result_l2}")
print(f"WD result: {result_wd}")
print(f"Difference: {np.max(np.abs(result_l2 - result_wd))}")  # ~0

Breakdown with Adaptive Optimizers

Adaptive optimizers (Adam, RMSprop, AdaGrad) scale gradients by accumulated statistics. This breaks the equivalence.

Adam Update (simplified):

For each parameter $\theta_i$: $$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t \quad \text{(momentum)}$$ $$v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 \quad \text{(squared gradient)}$$ $$\theta_{t+1} = \theta_t - \eta \frac{m_t}{\sqrt{v_t} + \epsilon}$$

The Problem with L2 + Adam:

If we add L2 to the loss, the gradient becomes $g = \nabla \mathcal{L}_{\text{data}} + \lambda \theta$. This regularization term is scaled by Adam's adaptive learning rate:

$$\theta_{t+1} = \theta_t - \eta \frac{m_t + \lambda \theta_t}{\sqrt{v_t} + \epsilon}$$

The regularization effect on parameter $\theta_i$ depends on $v_{t,i}$—parameters with large gradient history get less regularization, while parameters with small gradient history get more regularization.

This is NOT the intended behavior of weight decay!

L2 + Adam ≠ Weight Decay

With Adam, L2 regularization causes: (1) Inconsistent regularization across parameters based on gradient history, (2) Parameters with frequently large gradients (active features) receive less regularization, (3) The intended uniform shrinkage effect is lost entirely.

Effect on Different Parameter Types
Parameter Type	Gradient History (v)	L2 Effect with Adam	True Weight Decay Effect
Frequently updated	Large v	Weak regularization (divided by √v)	Strong, uniform regularization
Rarely updated	Small v	Strong regularization	Same uniform regularization
Initial phase	Small v	Very strong (can be destabilizing)	Consistent from start

Decoupled Weight Decay: AdamW

The solution is decoupled weight decay: apply weight decay directly to parameters, not through the gradient.

AdamW Update:

Compute gradients of data loss only (no L2 term)
Update momentum and squared gradient estimates from data gradients
Apply adaptive gradient step
Apply weight decay separately, not scaled by adaptive learning rate

$$m_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla \mathcal{L}{\text{data}}$$ $$v_t = \beta_2 v{t-1} + (1 - \beta_2) (\nabla \mathcal{L}{\text{data}})^2$$ $$\boldsymbol{\theta}{t+1} = (1 - \eta \lambda_d) \boldsymbol{\theta}_t - \eta \frac{m_t}{\sqrt{v_t} + \epsilon}$$

Notice: weight decay $(1 - \eta \lambda_d)$ is applied before the adaptive update, and is not divided by $\sqrt{v}$.

adamw_implementation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import torch
import torch.optim as optim
 
# Standard Adam with L2 (NOT recommended for most cases)
optimizer_adam_l2 = optim.Adam(
    model.parameters(),
    lr=1e-3,
    weight_decay=1e-2  # This is L2, NOT true weight decay!
)
 
# AdamW with decoupled weight decay (RECOMMENDED)
optimizer_adamw = optim.AdamW(
    model.parameters(),
    lr=1e-3,
    weight_decay=1e-2  # This IS true weight decay
)
 
# Manual implementation to understand the difference
class ManualAdamW:
    def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), 
                 eps=1e-8, weight_decay=0.01):
        self.params = list(params)
        self.lr = lr
        self.beta1, self.beta2 = betas
        self.eps = eps
        self.weight_decay = weight_decay
        self.t = 0
        
        # Initialize moment estimates
        self.m = [torch.zeros_like(p) for p in self.params]
        self.v = [torch.zeros_like(p) for p in self.params]
    
    def step(self):
        self.t += 1
        
        for i, p in enumerate(self.params):
            if p.grad is None:
                continue
            
            g = p.grad.data
            
            # Update biased moment estimates (from DATA gradient only)
            self.m[i] = self.beta1 * self.m[i] + (1 - self.beta1) * g
            self.v[i] = self.beta2 * self.v[i] + (1 - self.beta2) * g**2
            
            # Bias correction
            m_hat = self.m[i] / (1 - self.beta1**self.t)
            v_hat = self.v[i] / (1 - self.beta2**self.t)
            
            # DECOUPLED weight decay: applied to parameters directly
            p.data = p.data * (1 - self.lr * self.weight_decay)
            
            # Adaptive gradient step (no weight decay here!)
            p.data = p.data - self.lr * m_hat / (torch.sqrt(v_hat) + self.eps)

Always Use AdamW

For virtually all modern deep learning with adaptive optimizers, use AdamW (or decoupled weight decay variants). The performance difference can be substantial, especially for large models and long training runs.

Empirical Evidence

The Loshchilov & Hutter paper and subsequent research demonstrated significant improvements from decoupled weight decay:

Key Findings:

Research Results

•ImageNet classification: AdamW matched or exceeded SGD+momentum with proper weight decay, while Adam+L2 underperformed
•Language modeling: Transformers trained with AdamW showed better generalization than Adam+L2
•Hyperparameter robustness: AdamW's optimal weight decay is more stable across learning rates
•Training stability: Decoupled weight decay provides more consistent regularization throughout training
•BERT/GPT training: AdamW became the standard for all major language model training

Performance Comparison (Representative Results)
Task	SGD+WD	Adam+L2	AdamW
CIFAR-10 (ResNet-32)	~93.0%	~91.5%	~93.2%
ImageNet (ResNet-50)	~76.5%	~74.8%	~76.7%
Penn Treebank LM	Baseline	-2% rel.	+0.5% rel.

Why the Difference Matters at Scale:

For small models and short training, the difference may be negligible. But as model size and training duration increase:

The inconsistent regularization of L2+Adam accumulates
Features with different gradient characteristics diverge in regularization
The model's effective capacity becomes poorly controlled
Generalization degrades compared to properly decoupled training

Hyperparameter Considerations

When switching from Adam+L2 to AdamW, hyperparameters need adjustment.

Hyperparameter Guidelines
Parameter	Adam + L2	AdamW	Notes
Weight decay	Often 1e-4 to 1e-3	Typically 0.01 to 0.1	AdamW can handle larger values
Learning rate	Standard	Often can be larger	Decoupling allows more aggressive LR
LR schedule	Aggressive decay	Gentler decay often works	Weight decay provides ongoing regularization
Warmup	Often needed	Still recommended	Helps with initial gradient estimates

Weight Decay Magnitude

AdamW's decoupled weight decay typically uses larger values (0.01-0.1) than L2 regularization (1e-4-1e-3). The decoupling means the value isn't conflated with learning rate dynamics, so you're tuning a cleaner hyperparameter.

Decoupled Weight Decay in Other Optimizers

The principle of decoupled weight decay extends beyond Adam.

Decoupled Variants

•SGDW: SGD with decoupled weight decay (equivalent to SGD+L2, but explicit separation)
•RMSpropW: RMSprop with decoupled weight decay
•LAMB/LARS: Layer-wise adaptive optimizers with decoupled decay (used for large batch training)
•Adafactor: Memory-efficient adaptive optimizer with proper weight decay handling
•8-bit AdamW: Quantized optimizer used in large model training, maintains decoupled structure

optimizer_selection.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import torch.optim as optim
 
# Modern best practices for different scenarios
 
# 1. General deep learning (most common)
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)
 
# 2. Fine-tuning pre-trained models
optimizer = optim.AdamW(model.parameters(), lr=5e-5, weight_decay=0.01)
 
# 3. Vision models (traditional approach still works well)
optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9, 
                      weight_decay=1e-4)  # Equivalent to SGDW
 
# 4. Large batch training
# Use LAMB or LARS (not built into PyTorch, use external libraries)
 
# 5. Memory-constrained training
# Consider Adafactor or 8-bit optimizers

Summary: Weight Decay vs L2

Key Takeaways

•L2 regularization adds penalty to loss → gradient includes $\lambda\theta$
•Weight decay multiplies weights by $(1-\eta\lambda)$ directly, independent of gradient
•For vanilla SGD, these are mathematically equivalent
•For adaptive optimizers, L2 regularization is scaled by gradient statistics → inconsistent regularization
•AdamW decouples weight decay → uniform regularization, better generalization
•Always prefer AdamW over Adam+weight_decay for modern deep learning

Page Complete

You now understand the critical distinction between weight decay and L2 regularization with adaptive optimizers. This knowledge is essential for training state-of-the-art models. Next, we explore max-norm constraints—an alternative approach to controlling weight magnitudes.

3 / 5

Loading learning content...

Machine LearningRegularization in Deep Learning

Weight Regularization

LevelAdvanced

Duration75 mins

TopicRegularization in Deep Learning

3 / 5

Weight Decay vs L2 Regularization

The Subtle Distinction That Changes Everything

Understanding this distinction is not merely academic—using the wrong formulation can significantly degrade model performance, particularly for large-scale training.

Critical for Modern Deep Learning

If you're using Adam with the weight_decay parameter expecting L2 regularization behavior, you're not getting true weight decay. This page explains why, and why it matters for your models.

Equivalence Under Vanilla SGD

Let's first establish why weight decay and L2 regularization have been conflated.

L2 Regularization Approach:

Add the penalty to the loss, then differentiate: $$\mathcal{L}{\text{reg}} = \mathcal{L}{\text{data}} + \frac{\lambda}{2}|\boldsymbol{\theta}|_2^2$$

$$\nabla_\theta \mathcal{L}{\text{reg}} = \nabla\theta \mathcal{L}_{\text{data}} + \lambda \boldsymbol{\theta}$$

SGD update: $$\boldsymbol{\theta}{t+1} = \boldsymbol{\theta}t - \eta(\nabla\theta \mathcal{L}{\text{data}} + \lambda \boldsymbol{\theta}_t)$$

Weight Decay Approach:

sgd_equivalence.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import numpy as np
 
def sgd_l2_regularization(theta, grad, lr, lambda_l2):
    """
    SGD with L2 regularization (added to gradient).
    
    θ_new = θ - lr * (∇L + λ*θ)
    """
    reg_grad = grad + lambda_l2 * theta
    return theta - lr * reg_grad
 
def sgd_weight_decay(theta, grad, lr, wd):
    """
    SGD with weight decay (direct shrinkage).
    
    θ_new = (1 - lr*wd)*θ - lr*∇L
    """
    return (1 - lr * wd) * theta - lr * grad
 
# These are EQUIVALENT for vanilla SGD
theta = np.array([1.0, 2.0, -1.5])
grad = np.array([0.1, -0.2, 0.3])
lr = 0.01
lambda_val = 0.001
 
result_l2 = sgd_l2_regularization(theta, grad, lr, lambda_val)
result_wd = sgd_weight_decay(theta, grad, lr, lambda_val)
 
print(f"L2 result: {result_l2}")
print(f"WD result: {result_wd}")
print(f"Difference: {np.max(np.abs(result_l2 - result_wd))}")  # ~0

Breakdown with Adaptive Optimizers

Adaptive optimizers (Adam, RMSprop, AdaGrad) scale gradients by accumulated statistics. This breaks the equivalence.

Adam Update (simplified):

The Problem with L2 + Adam:

If we add L2 to the loss, the gradient becomes $g = \nabla \mathcal{L}_{\text{data}} + \lambda \theta$. This regularization term is scaled by Adam's adaptive learning rate:

$$\theta_{t+1} = \theta_t - \eta \frac{m_t + \lambda \theta_t}{\sqrt{v_t} + \epsilon}$$

This is NOT the intended behavior of weight decay!

L2 + Adam ≠ Weight Decay

Effect on Different Parameter Types
Parameter Type	Gradient History (v)	L2 Effect with Adam	True Weight Decay Effect
Frequently updated	Large v	Weak regularization (divided by √v)	Strong, uniform regularization
Rarely updated	Small v	Strong regularization	Same uniform regularization
Initial phase	Small v	Very strong (can be destabilizing)	Consistent from start

Decoupled Weight Decay: AdamW

The solution is decoupled weight decay: apply weight decay directly to parameters, not through the gradient.

AdamW Update:

Compute gradients of data loss only (no L2 term)
Update momentum and squared gradient estimates from data gradients
Apply adaptive gradient step
Apply weight decay separately, not scaled by adaptive learning rate

Notice: weight decay $(1 - \eta \lambda_d)$ is applied before the adaptive update, and is not divided by $\sqrt{v}$.

adamw_implementation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import torch
import torch.optim as optim
 
# Standard Adam with L2 (NOT recommended for most cases)
optimizer_adam_l2 = optim.Adam(
    model.parameters(),
    lr=1e-3,
    weight_decay=1e-2  # This is L2, NOT true weight decay!
)
 
# AdamW with decoupled weight decay (RECOMMENDED)
optimizer_adamw = optim.AdamW(
    model.parameters(),
    lr=1e-3,
    weight_decay=1e-2  # This IS true weight decay
)
 
# Manual implementation to understand the difference
class ManualAdamW:
    def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), 
                 eps=1e-8, weight_decay=0.01):
        self.params = list(params)
        self.lr = lr
        self.beta1, self.beta2 = betas
        self.eps = eps
        self.weight_decay = weight_decay
        self.t = 0
        
        # Initialize moment estimates
        self.m = [torch.zeros_like(p) for p in self.params]
        self.v = [torch.zeros_like(p) for p in self.params]
    
    def step(self):
        self.t += 1
        
        for i, p in enumerate(self.params):
            if p.grad is None:
                continue
            
            g = p.grad.data
            
            # Update biased moment estimates (from DATA gradient only)
            self.m[i] = self.beta1 * self.m[i] + (1 - self.beta1) * g
            self.v[i] = self.beta2 * self.v[i] + (1 - self.beta2) * g**2
            
            # Bias correction
            m_hat = self.m[i] / (1 - self.beta1**self.t)
            v_hat = self.v[i] / (1 - self.beta2**self.t)
            
            # DECOUPLED weight decay: applied to parameters directly
            p.data = p.data * (1 - self.lr * self.weight_decay)
            
            # Adaptive gradient step (no weight decay here!)
            p.data = p.data - self.lr * m_hat / (torch.sqrt(v_hat) + self.eps)

Always Use AdamW

Empirical Evidence

The Loshchilov & Hutter paper and subsequent research demonstrated significant improvements from decoupled weight decay:

Key Findings:

Research Results

•ImageNet classification: AdamW matched or exceeded SGD+momentum with proper weight decay, while Adam+L2 underperformed
•Language modeling: Transformers trained with AdamW showed better generalization than Adam+L2
•Hyperparameter robustness: AdamW's optimal weight decay is more stable across learning rates
•Training stability: Decoupled weight decay provides more consistent regularization throughout training
•BERT/GPT training: AdamW became the standard for all major language model training

Performance Comparison (Representative Results)
Task	SGD+WD	Adam+L2	AdamW
CIFAR-10 (ResNet-32)	~93.0%	~91.5%	~93.2%
ImageNet (ResNet-50)	~76.5%	~74.8%	~76.7%
Penn Treebank LM	Baseline	-2% rel.	+0.5% rel.

Why the Difference Matters at Scale:

For small models and short training, the difference may be negligible. But as model size and training duration increase:

The inconsistent regularization of L2+Adam accumulates
Features with different gradient characteristics diverge in regularization
The model's effective capacity becomes poorly controlled
Generalization degrades compared to properly decoupled training

Hyperparameter Considerations

When switching from Adam+L2 to AdamW, hyperparameters need adjustment.

Hyperparameter Guidelines
Parameter	Adam + L2	AdamW	Notes
Weight decay	Often 1e-4 to 1e-3	Typically 0.01 to 0.1	AdamW can handle larger values
Learning rate	Standard	Often can be larger	Decoupling allows more aggressive LR
LR schedule	Aggressive decay	Gentler decay often works	Weight decay provides ongoing regularization
Warmup	Often needed	Still recommended	Helps with initial gradient estimates

Weight Decay Magnitude

Decoupled Weight Decay in Other Optimizers

The principle of decoupled weight decay extends beyond Adam.

Decoupled Variants

•SGDW: SGD with decoupled weight decay (equivalent to SGD+L2, but explicit separation)
•RMSpropW: RMSprop with decoupled weight decay
•LAMB/LARS: Layer-wise adaptive optimizers with decoupled decay (used for large batch training)
•Adafactor: Memory-efficient adaptive optimizer with proper weight decay handling
•8-bit AdamW: Quantized optimizer used in large model training, maintains decoupled structure

optimizer_selection.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import torch.optim as optim
 
# Modern best practices for different scenarios
 
# 1. General deep learning (most common)
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)
 
# 2. Fine-tuning pre-trained models
optimizer = optim.AdamW(model.parameters(), lr=5e-5, weight_decay=0.01)
 
# 3. Vision models (traditional approach still works well)
optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9, 
                      weight_decay=1e-4)  # Equivalent to SGDW
 
# 4. Large batch training
# Use LAMB or LARS (not built into PyTorch, use external libraries)
 
# 5. Memory-constrained training
# Consider Adafactor or 8-bit optimizers

Summary: Weight Decay vs L2

Key Takeaways

•L2 regularization adds penalty to loss → gradient includes $\lambda\theta$
•Weight decay multiplies weights by $(1-\eta\lambda)$ directly, independent of gradient
•For vanilla SGD, these are mathematically equivalent
•For adaptive optimizers, L2 regularization is scaled by gradient statistics → inconsistent regularization
•AdamW decouples weight decay → uniform regularization, better generalization
•Always prefer AdamW over Adam+weight_decay for modern deep learning

Page Complete

3 / 5