00:00:00

Description

Editorial

Decoupled Weight Decay Parameter Update

MEDIUM20 pts

In modern deep learning optimization, one of the most widely adopted techniques is the decoupled weight decay regularization approach, commonly known as AdamW. This algorithm addresses a critical flaw in traditional Adam optimization: the coupling of weight decay with the gradient-based update, which can lead to suboptimal regularization behavior.

Understanding the Problem

The standard Adam optimizer combines adaptive learning rates with momentum, but when L2 regularization is applied by adding the regularization gradient directly to the loss gradient, the effective regularization strength varies inversely with the adaptive learning rate scaling. This means heavily updated parameters receive less regularization, while rarely updated parameters receive more—the opposite of what's typically desired.

AdamW solves this by applying weight decay directly to the parameters rather than incorporating it into the gradient computation. This decoupling ensures consistent regularization regardless of the gradient history.

Mathematical Formulation

Given a parameter vector w and its gradient g at time step t, the algorithm proceeds as follows:

Step 1: Update biased first moment estimate (momentum) $$m_t = \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot g_t$$

Step 2: Update biased second moment estimate (adaptive learning rate) $$v_t = \beta_2 \cdot v_{t-1} + (1 - \beta_2) \cdot g_t^2$$

Step 3: Compute bias-corrected estimates $$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}$$ $$\hat{v}_t = \frac{v_t}{1 - \beta_2^t}$$

Step 4: Update parameters with decoupled weight decay $$w_t = w_{t-1} - \eta \cdot \left( \frac{\hat{m}_t}{\sqrt{\hat{v}t} + \epsilon} + \lambda \cdot w{t-1} \right)$$

Where:

η (eta) is the learning rate
λ (lambda) is the weight decay coefficient
ε (epsilon) is a small constant for numerical stability
β₁ and β₂ are the exponential decay rates for the moment estimates

Your Task

Implement a function adamw_update(w, g, m, v, t, lr, beta1, beta2, epsilon, weight_decay) that performs one complete optimization step using the decoupled weight decay algorithm. The function should:

Update the first moment estimate m (exponential moving average of gradients)
Update the second moment estimate v (exponential moving average of squared gradients)
Apply bias correction to both moment estimates
Compute the parameter update with decoupled weight decay
Return the updated parameters along with the new moment estimates

Example

Input

w = [1.0, 2.0]
g = [0.1, -0.2]
m = [0.0, 0.0]
v = [0.0, 0.0]
t = 1
lr = 0.01
beta1 = 0.9
beta2 = 0.999
epsilon = 1e-8
weight_decay = 0.1

Output

w_new = [0.989, 2.008]
m_new = [0.01, -0.02]
v_new = [0.0, 0.0]

Explanation

First Iteration Analysis:

First moment update:
- m₁ = 0.9 × 0.0 + 0.1 × [0.1, -0.2] = [0.01, -0.02]
Second moment update:
- v₁ = 0.999 × 0.0 + 0.001 × [0.01, 0.04] = [0.00001, 0.00004] ≈ [0.0, 0.0] (rounded to 4 decimal places)
Bias correction (t=1):
- m̂ = [0.01, -0.02] / (1 - 0.9¹) = [0.1, -0.2]
- v̂ = [0.00001, 0.00004] / (1 - 0.999¹) ≈ [0.01, 0.04]
Parameter update with decoupled weight decay:
- w_new[0] = 1.0 - 0.01 × (0.1/√0.01 + 0.1 × 1.0) = 1.0 - 0.01 × (1.0 + 0.1) = 0.989
- w_new[1] = 2.0 - 0.01 × (-0.2/√0.04 + 0.1 × 2.0) = 2.0 - 0.01 × (-1.0 + 0.2) = 2.008

The parameters move opposite to the gradient direction while being regularized toward zero by the weight decay term.

Example

Input

w = [0.5, -0.5, 1.0]
g = [0.2, 0.3, -0.1]
m = [0.01, -0.02, 0.03]
v = [0.001, 0.002, 0.001]
t = 5
lr = 0.001
beta1 = 0.9
beta2 = 0.999
epsilon = 1e-8
weight_decay = 0.01

Output

w_new = [0.4998, -0.5, 0.9999]
m_new = [0.029, 0.012, 0.017]
v_new = [0.001, 0.0021, 0.001]

Explanation

Mid-Training Update (t=5):

With pre-existing momentum (m) and adaptive learning rate history (v), this represents a typical mid-training scenario where the optimizer has already "warmed up."

First moment update:
- m₅ = 0.9 × [0.01, -0.02, 0.03] + 0.1 × [0.2, 0.3, -0.1]
- m₅ = [0.009 + 0.02, -0.018 + 0.03, 0.027 - 0.01] = [0.029, 0.012, 0.017]
Second moment update:
- v₅ = 0.999 × [0.001, 0.002, 0.001] + 0.001 × [0.04, 0.09, 0.01]
- v₅ ≈ [0.001, 0.0021, 0.001]
Bias correction factor at t=5:
- 1 - β₁⁵ = 1 - 0.9⁵ ≈ 0.4095
- 1 - β₂⁵ = 1 - 0.999⁵ ≈ 0.00499

The smaller learning rate (0.001) and lower weight decay (0.01) result in more conservative parameter updates, typical of fine-tuning scenarios.

Example

Input

w = [1.0, -1.0, 0.5, -0.5]
g = [0.1, 0.2, -0.1, -0.2]
m = [0.0, 0.0, 0.0, 0.0]
v = [0.0, 0.0, 0.0, 0.0]
t = 1
lr = 0.01
beta1 = 0.9
beta2 = 0.999
epsilon = 1e-8
weight_decay = 0.05

Output

w_new = [0.9895, -1.0095, 0.5097, -0.4898]
m_new = [0.01, 0.02, -0.01, -0.02]
v_new = [0.0, 0.0, 0.0, 0.0]

Explanation

4-Dimensional Parameter Space:

This example demonstrates the algorithm's behavior across a higher-dimensional parameter space with mixed positive and negative values.

Notice the asymmetric updates:

Positive parameters (1.0, 0.5) are pushed downward by weight decay
Negative parameters (-1.0, -0.5) are pushed upward (toward zero) by weight decay
Gradient direction further modifies each update

The decoupled weight decay uniformly shrinks all parameters toward zero by a factor of (1 - lr × weight_decay) = (1 - 0.01 × 0.05) = 0.9995, independent of the gradient magnitude.

This is the key insight of AdamW: regularization strength is consistent across all parameters, regardless of their update frequency or gradient variance.

Accepted0/0·0% Acceptance

Constraints

1 ≤ length of parameter vector ≤ 1000
-10⁶ ≤ w[i], g[i] ≤ 10⁶
0 ≤ m[i], v[i] ≤ 10⁶ (moment estimates are initialized to zero or warm-started)
t ≥ 1 (time step starts from 1)
0 < lr ≤ 1.0 (learning rate)
0 < beta1 < 1 (typically 0.9)
0 < beta2 < 1 (typically 0.999)
epsilon > 0 (typically 1e-8)
0 ≤ weight_decay ≤ 1.0
All input arrays (w, g, m, v) have the same shape

Code

Visualizer

Solutions

14px

Test Cases3

Results

Submissions

g =

[0.2,0.3,-0.1]

m =

[0.01,-0.02,0.03]

t =

v =

[0.001,0.002,0.001]

w =

[0.5,-0.5,1]

lr =

0.001

beta1 =

0.9

beta2 =

0.999

epsilon =

1e-8

weight_decay =

0.01

Loading problem...

101

00:00:00

Description

Editorial

Decoupled Weight Decay Parameter Update

MEDIUM20 pts

Understanding the Problem

Mathematical Formulation

Given a parameter vector w and its gradient g at time step t, the algorithm proceeds as follows:

Step 1: Update biased first moment estimate (momentum) $$m_t = \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot g_t$$

Step 2: Update biased second moment estimate (adaptive learning rate) $$v_t = \beta_2 \cdot v_{t-1} + (1 - \beta_2) \cdot g_t^2$$

Step 3: Compute bias-corrected estimates $$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}$$ $$\hat{v}_t = \frac{v_t}{1 - \beta_2^t}$$

Step 4: Update parameters with decoupled weight decay $$w_t = w_{t-1} - \eta \cdot \left( \frac{\hat{m}_t}{\sqrt{\hat{v}t} + \epsilon} + \lambda \cdot w{t-1} \right)$$

Where:

η (eta) is the learning rate
λ (lambda) is the weight decay coefficient
ε (epsilon) is a small constant for numerical stability
β₁ and β₂ are the exponential decay rates for the moment estimates

Your Task

Update the first moment estimate m (exponential moving average of gradients)
Update the second moment estimate v (exponential moving average of squared gradients)
Apply bias correction to both moment estimates
Compute the parameter update with decoupled weight decay
Return the updated parameters along with the new moment estimates

Example

Input

w = [1.0, 2.0]
g = [0.1, -0.2]
m = [0.0, 0.0]
v = [0.0, 0.0]
t = 1
lr = 0.01
beta1 = 0.9
beta2 = 0.999
epsilon = 1e-8
weight_decay = 0.1

Output

w_new = [0.989, 2.008]
m_new = [0.01, -0.02]
v_new = [0.0, 0.0]

Explanation

First Iteration Analysis:

First moment update:
- m₁ = 0.9 × 0.0 + 0.1 × [0.1, -0.2] = [0.01, -0.02]
Second moment update:
- v₁ = 0.999 × 0.0 + 0.001 × [0.01, 0.04] = [0.00001, 0.00004] ≈ [0.0, 0.0] (rounded to 4 decimal places)
Bias correction (t=1):
- m̂ = [0.01, -0.02] / (1 - 0.9¹) = [0.1, -0.2]
- v̂ = [0.00001, 0.00004] / (1 - 0.999¹) ≈ [0.01, 0.04]
Parameter update with decoupled weight decay:
- w_new[0] = 1.0 - 0.01 × (0.1/√0.01 + 0.1 × 1.0) = 1.0 - 0.01 × (1.0 + 0.1) = 0.989
- w_new[1] = 2.0 - 0.01 × (-0.2/√0.04 + 0.1 × 2.0) = 2.0 - 0.01 × (-1.0 + 0.2) = 2.008

The parameters move opposite to the gradient direction while being regularized toward zero by the weight decay term.

Example

Input

w = [0.5, -0.5, 1.0]
g = [0.2, 0.3, -0.1]
m = [0.01, -0.02, 0.03]
v = [0.001, 0.002, 0.001]
t = 5
lr = 0.001
beta1 = 0.9
beta2 = 0.999
epsilon = 1e-8
weight_decay = 0.01

Output

w_new = [0.4998, -0.5, 0.9999]
m_new = [0.029, 0.012, 0.017]
v_new = [0.001, 0.0021, 0.001]

Explanation

Mid-Training Update (t=5):

With pre-existing momentum (m) and adaptive learning rate history (v), this represents a typical mid-training scenario where the optimizer has already "warmed up."

First moment update:
- m₅ = 0.9 × [0.01, -0.02, 0.03] + 0.1 × [0.2, 0.3, -0.1]
- m₅ = [0.009 + 0.02, -0.018 + 0.03, 0.027 - 0.01] = [0.029, 0.012, 0.017]
Second moment update:
- v₅ = 0.999 × [0.001, 0.002, 0.001] + 0.001 × [0.04, 0.09, 0.01]
- v₅ ≈ [0.001, 0.0021, 0.001]
Bias correction factor at t=5:
- 1 - β₁⁵ = 1 - 0.9⁵ ≈ 0.4095
- 1 - β₂⁵ = 1 - 0.999⁵ ≈ 0.00499

The smaller learning rate (0.001) and lower weight decay (0.01) result in more conservative parameter updates, typical of fine-tuning scenarios.

Example

Input

w = [1.0, -1.0, 0.5, -0.5]
g = [0.1, 0.2, -0.1, -0.2]
m = [0.0, 0.0, 0.0, 0.0]
v = [0.0, 0.0, 0.0, 0.0]
t = 1
lr = 0.01
beta1 = 0.9
beta2 = 0.999
epsilon = 1e-8
weight_decay = 0.05

Output

w_new = [0.9895, -1.0095, 0.5097, -0.4898]
m_new = [0.01, 0.02, -0.01, -0.02]
v_new = [0.0, 0.0, 0.0, 0.0]

Explanation

4-Dimensional Parameter Space:

This example demonstrates the algorithm's behavior across a higher-dimensional parameter space with mixed positive and negative values.

Notice the asymmetric updates:

Positive parameters (1.0, 0.5) are pushed downward by weight decay
Negative parameters (-1.0, -0.5) are pushed upward (toward zero) by weight decay
Gradient direction further modifies each update

The decoupled weight decay uniformly shrinks all parameters toward zero by a factor of (1 - lr × weight_decay) = (1 - 0.01 × 0.05) = 0.9995, independent of the gradient magnitude.

This is the key insight of AdamW: regularization strength is consistent across all parameters, regardless of their update frequency or gradient variance.

Accepted0/0·0% Acceptance

Constraints

1 ≤ length of parameter vector ≤ 1000
-10⁶ ≤ w[i], g[i] ≤ 10⁶
0 ≤ m[i], v[i] ≤ 10⁶ (moment estimates are initialized to zero or warm-started)
t ≥ 1 (time step starts from 1)
0 < lr ≤ 1.0 (learning rate)
0 < beta1 < 1 (typically 0.9)
0 < beta2 < 1 (typically 0.999)
epsilon > 0 (typically 1e-8)
0 ≤ weight_decay ≤ 1.0
All input arrays (w, g, m, v) have the same shape

Code

Visualizer

Solutions

14px

Test Cases3

Results

Submissions

g =

[0.2,0.3,-0.1]

m =

[0.01,-0.02,0.03]

t =

v =

[0.001,0.002,0.001]

w =

[0.5,-0.5,1]

lr =

0.001

beta1 =

0.9

beta2 =

0.999

epsilon =

1e-8

weight_decay =

0.01

Decoupled Weight Decay Parameter Update

Understanding the Problem

Mathematical Formulation

Your Task

Hints

Decoupled Weight Decay Parameter Update

Understanding the Problem

Mathematical Formulation

Your Task

Hints