Loading learning content...
For years, "weight decay" and "L2 regularization" have been used interchangeably. In the context of vanilla SGD, they are indeed mathematically equivalent. However, when using adaptive optimizers like Adam, RMSprop, or AdaGrad, they become fundamentally different techniques with dramatically different behaviors.
This distinction, formalized in the influential 2017 paper "Decoupled Weight Decay Regularization" (Loshchilov & Hutter), led to the creation of AdamW—now the default optimizer for training large language models, vision transformers, and most state-of-the-art deep learning systems.
Understanding this distinction is not merely academic—using the wrong formulation can significantly degrade model performance, particularly for large-scale training.
If you're using Adam with the weight_decay parameter expecting L2 regularization behavior, you're not getting true weight decay. This page explains why, and why it matters for your models.
Let's first establish why weight decay and L2 regularization have been conflated.
L2 Regularization Approach:
Add the penalty to the loss, then differentiate: $$\mathcal{L}{\text{reg}} = \mathcal{L}{\text{data}} + \frac{\lambda}{2}|\boldsymbol{\theta}|_2^2$$
$$\nabla_\theta \mathcal{L}{\text{reg}} = \nabla\theta \mathcal{L}_{\text{data}} + \lambda \boldsymbol{\theta}$$
SGD update: $$\boldsymbol{\theta}{t+1} = \boldsymbol{\theta}t - \eta(\nabla\theta \mathcal{L}{\text{data}} + \lambda \boldsymbol{\theta}_t)$$
Weight Decay Approach:
Shrink weights directly by a factor, independently of the loss gradient: $$\boldsymbol{\theta}{t+1} = \boldsymbol{\theta}t - \eta \nabla\theta \mathcal{L}{\text{data}} - \eta \lambda_d \boldsymbol{\theta}t$$ $$= (1 - \eta \lambda_d) \boldsymbol{\theta}t - \eta \nabla\theta \mathcal{L}{\text{data}}$$
With vanilla SGD, setting $\lambda_d = \lambda$ makes these identical. The regularization gradient $\lambda \boldsymbol{\theta}$ produces the same effect as multiplying weights by $(1 - \eta \lambda)$.
12345678910111213141516171819202122232425262728293031
import numpy as np def sgd_l2_regularization(theta, grad, lr, lambda_l2): """ SGD with L2 regularization (added to gradient). θ_new = θ - lr * (∇L + λ*θ) """ reg_grad = grad + lambda_l2 * theta return theta - lr * reg_grad def sgd_weight_decay(theta, grad, lr, wd): """ SGD with weight decay (direct shrinkage). θ_new = (1 - lr*wd)*θ - lr*∇L """ return (1 - lr * wd) * theta - lr * grad # These are EQUIVALENT for vanilla SGDtheta = np.array([1.0, 2.0, -1.5])grad = np.array([0.1, -0.2, 0.3])lr = 0.01lambda_val = 0.001 result_l2 = sgd_l2_regularization(theta, grad, lr, lambda_val)result_wd = sgd_weight_decay(theta, grad, lr, lambda_val) print(f"L2 result: {result_l2}")print(f"WD result: {result_wd}")print(f"Difference: {np.max(np.abs(result_l2 - result_wd))}") # ~0Adaptive optimizers (Adam, RMSprop, AdaGrad) scale gradients by accumulated statistics. This breaks the equivalence.
Adam Update (simplified):
For each parameter $\theta_i$: $$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t \quad \text{(momentum)}$$ $$v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 \quad \text{(squared gradient)}$$ $$\theta_{t+1} = \theta_t - \eta \frac{m_t}{\sqrt{v_t} + \epsilon}$$
The Problem with L2 + Adam:
If we add L2 to the loss, the gradient becomes $g = \nabla \mathcal{L}_{\text{data}} + \lambda \theta$. This regularization term is scaled by Adam's adaptive learning rate:
$$\theta_{t+1} = \theta_t - \eta \frac{m_t + \lambda \theta_t}{\sqrt{v_t} + \epsilon}$$
The regularization effect on parameter $\theta_i$ depends on $v_{t,i}$—parameters with large gradient history get less regularization, while parameters with small gradient history get more regularization.
This is NOT the intended behavior of weight decay!
With Adam, L2 regularization causes: (1) Inconsistent regularization across parameters based on gradient history, (2) Parameters with frequently large gradients (active features) receive less regularization, (3) The intended uniform shrinkage effect is lost entirely.
| Parameter Type | Gradient History (v) | L2 Effect with Adam | True Weight Decay Effect |
|---|---|---|---|
| Frequently updated | Large v | Weak regularization (divided by √v) | Strong, uniform regularization |
| Rarely updated | Small v | Strong regularization | Same uniform regularization |
| Initial phase | Small v | Very strong (can be destabilizing) | Consistent from start |
The solution is decoupled weight decay: apply weight decay directly to parameters, not through the gradient.
AdamW Update:
$$m_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla \mathcal{L}{\text{data}}$$ $$v_t = \beta_2 v{t-1} + (1 - \beta_2) (\nabla \mathcal{L}{\text{data}})^2$$ $$\boldsymbol{\theta}{t+1} = (1 - \eta \lambda_d) \boldsymbol{\theta}_t - \eta \frac{m_t}{\sqrt{v_t} + \epsilon}$$
Notice: weight decay $(1 - \eta \lambda_d)$ is applied before the adaptive update, and is not divided by $\sqrt{v}$.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
import torchimport torch.optim as optim # Standard Adam with L2 (NOT recommended for most cases)optimizer_adam_l2 = optim.Adam( model.parameters(), lr=1e-3, weight_decay=1e-2 # This is L2, NOT true weight decay!) # AdamW with decoupled weight decay (RECOMMENDED)optimizer_adamw = optim.AdamW( model.parameters(), lr=1e-3, weight_decay=1e-2 # This IS true weight decay) # Manual implementation to understand the differenceclass ManualAdamW: def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8, weight_decay=0.01): self.params = list(params) self.lr = lr self.beta1, self.beta2 = betas self.eps = eps self.weight_decay = weight_decay self.t = 0 # Initialize moment estimates self.m = [torch.zeros_like(p) for p in self.params] self.v = [torch.zeros_like(p) for p in self.params] def step(self): self.t += 1 for i, p in enumerate(self.params): if p.grad is None: continue g = p.grad.data # Update biased moment estimates (from DATA gradient only) self.m[i] = self.beta1 * self.m[i] + (1 - self.beta1) * g self.v[i] = self.beta2 * self.v[i] + (1 - self.beta2) * g**2 # Bias correction m_hat = self.m[i] / (1 - self.beta1**self.t) v_hat = self.v[i] / (1 - self.beta2**self.t) # DECOUPLED weight decay: applied to parameters directly p.data = p.data * (1 - self.lr * self.weight_decay) # Adaptive gradient step (no weight decay here!) p.data = p.data - self.lr * m_hat / (torch.sqrt(v_hat) + self.eps)For virtually all modern deep learning with adaptive optimizers, use AdamW (or decoupled weight decay variants). The performance difference can be substantial, especially for large models and long training runs.
The Loshchilov & Hutter paper and subsequent research demonstrated significant improvements from decoupled weight decay:
Key Findings:
| Task | SGD+WD | Adam+L2 | AdamW |
|---|---|---|---|
| CIFAR-10 (ResNet-32) | ~93.0% | ~91.5% | ~93.2% |
| ImageNet (ResNet-50) | ~76.5% | ~74.8% | ~76.7% |
| Penn Treebank LM | Baseline | -2% rel. | +0.5% rel. |
Why the Difference Matters at Scale:
For small models and short training, the difference may be negligible. But as model size and training duration increase:
When switching from Adam+L2 to AdamW, hyperparameters need adjustment.
| Parameter | Adam + L2 | AdamW | Notes |
|---|---|---|---|
| Weight decay | Often 1e-4 to 1e-3 | Typically 0.01 to 0.1 | AdamW can handle larger values |
| Learning rate | Standard | Often can be larger | Decoupling allows more aggressive LR |
| LR schedule | Aggressive decay | Gentler decay often works | Weight decay provides ongoing regularization |
| Warmup | Often needed | Still recommended | Helps with initial gradient estimates |
AdamW's decoupled weight decay typically uses larger values (0.01-0.1) than L2 regularization (1e-4-1e-3). The decoupling means the value isn't conflated with learning rate dynamics, so you're tuning a cleaner hyperparameter.
The principle of decoupled weight decay extends beyond Adam.
12345678910111213141516171819
import torch.optim as optim # Modern best practices for different scenarios # 1. General deep learning (most common)optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01) # 2. Fine-tuning pre-trained modelsoptimizer = optim.AdamW(model.parameters(), lr=5e-5, weight_decay=0.01) # 3. Vision models (traditional approach still works well)optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=1e-4) # Equivalent to SGDW # 4. Large batch training# Use LAMB or LARS (not built into PyTorch, use external libraries) # 5. Memory-constrained training# Consider Adafactor or 8-bit optimizersYou now understand the critical distinction between weight decay and L2 regularization with adaptive optimizers. This knowledge is essential for training state-of-the-art models. Next, we explore max-norm constraints—an alternative approach to controlling weight magnitudes.