Adamax Gradient Optimizer Update Step (Easy) — Practice with Code Visualizer

The Adamax algorithm is a powerful variant of the Adam optimizer that leverages the infinity norm (L∞ norm) instead of the second moment estimate (L² norm) used in Adam. This modification often provides more stable training dynamics, particularly when gradients exhibit extreme values or heavy-tailed distributions.

Understanding Adamax

Adamax was introduced by Kingma and Ba in 2015 as part of their seminal work on Adam. While Adam maintains an exponentially decaying average of squared gradients, Adamax tracks the maximum absolute gradient observed so far, which can be viewed as the limit of the Lp norm as p approaches infinity.

The Adamax Update Equations

Given a parameter θ, gradient gₜ at timestep t, the Adamax algorithm maintains two moving statistics:

1. First Moment Estimate (Momentum): $$m_t = \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot g_t$$

This is identical to Adam and represents the exponentially weighted moving average of gradients.

2. Infinity Norm (Maximum Gradient Tracking): $$u_t = \max(\beta_2 \cdot u_{t-1}, |g_t|)$$

Instead of squaring gradients, Adamax takes the maximum of the decayed previous value and the current absolute gradient. This is the key difference from Adam.

3. Bias-Corrected Update: The final parameter update uses bias correction only for the first moment: $$\theta_t = \theta_{t-1} - \frac{\alpha}{u_t + \epsilon} \cdot \frac{m_t}{1 - \beta_1^t}$$

Where:

α (learning_rate): The step size, typically 0.002
β₁ (beta1): Decay rate for the first moment, typically 0.9
β₂ (beta2): Decay rate for the infinity norm, typically 0.999
ε (epsilon): Small constant preventing division by zero, typically 1e-8
t: Current timestep (1-indexed)

Advantages of Adamax

Robustness to Large Gradients: The infinity norm naturally handles gradient spikes without extreme scaling
Simpler Bias Correction: Only the first moment requires bias correction, simplifying the update rule
Stable Training: Particularly effective for embeddings and sparse gradient scenarios
Theoretical Guarantees: Cleaner convergence analysis compared to Adam in certain settings

Your Task

Implement the Adamax update step function that:

Takes the current parameter, gradient, first moment estimate (m), infinity norm (u), and timestep (t)
Computes the updated parameter value along with the new moving averages
Handles both scalar inputs and array inputs (element-wise operations)
Returns a tuple of (updated_parameter, updated_m, updated_u)

Default Hyperparameters:

learning_rate (α) = 0.002
beta1 (β₁) = 0.9
beta2 (β₂) = 0.999
epsilon (ε) = 1e-8

Step-by-step computation with initial states:

Given: parameter = 1.0, grad = 0.1, m = 0.0, u = 0.0, t = 1 Hyperparameters: α = 0.002, β₁ = 0.9, β₂ = 0.999, ε = 1e-8

Step 1: Update first moment estimate (m) m₁ = β₁ · m₀ + (1 - β₁) · g₁ m₁ = 0.9 × 0.0 + 0.1 × 0.1 = 0.01

Step 2: Update infinity norm (u) u₁ = max(β₂ · u₀, |g₁|) u₁ = max(0.999 × 0.0, |0.1|) = max(0, 0.1) = 0.1

Step 3: Compute bias-corrected first moment m̂₁ = m₁ / (1 - β₁^t) = 0.01 / (1 - 0.9¹) = 0.01 / 0.1 = 0.1

Step 4: Update parameter θ₁ = θ₀ - (α / (u₁ + ε)) · m̂₁ θ₁ = 1.0 - (0.002 / (0.1 + 1e-8)) · 0.1 θ₁ = 1.0 - 0.02 × 0.1 = 1.0 - 0.002 = 0.998

Result: (0.998, 0.01, 0.1)

Element-wise computation for array inputs:

The Adamax algorithm applies the same update rules element-wise across all dimensions. With three parameters:

For parameter[0] = 1, grad[0] = 0.1:

m[0] = 0.9 × 0 + 0.1 × 0.1 = 0.01
u[0] = max(0.999 × 0, |0.1|) = 0.1
m̂[0] = 0.01 / 0.1 = 0.1
θ[0] = 1 - (0.002 / 0.1) × 0.1 = 0.998

For parameter[1] = 2, grad[1] = 0.2:

m[1] = 0.9 × 0 + 0.1 × 0.2 = 0.02
u[1] = max(0.999 × 0, |0.2|) = 0.2
m̂[1] = 0.02 / 0.1 = 0.2
θ[1] = 2 - (0.002 / 0.2) × 0.2 = 2 - 0.002 = 1.998

For parameter[2] = 3, grad[2] = 0.3:

m[2] = 0.9 × 0 + 0.1 × 0.3 = 0.03
u[2] = max(0.999 × 0, |0.3|) = 0.3
m̂[2] = 0.03 / 0.1 = 0.3
θ[2] = 3 - (0.002 / 0.3) × 0.3 = 3 - 0.002 = 2.998

Result: ([0.998, 1.998, 2.998], [0.01, 0.02, 0.03], [0.1, 0.2, 0.3])

Computation with non-zero history (later timestep):

Given: parameter = 0.5, grad = 0.05, m = 0.02, u = 0.15, t = 5 Hyperparameters: α = 0.002, β₁ = 0.9, β₂ = 0.999, ε = 1e-8

Step 1: Update first moment estimate m₅ = 0.9 × 0.02 + 0.1 × 0.05 = 0.018 + 0.005 = 0.023

Step 2: Update infinity norm u₅ = max(0.999 × 0.15, |0.05|) = max(0.14985, 0.05) = 0.14985

Here, the decayed previous u value (0.14985) exceeds the current gradient magnitude (0.05), so the infinity norm decays slightly.

Step 3: Compute bias correction factor 1 - β₁^5 = 1 - 0.9⁵ = 1 - 0.59049 = 0.40951

Step 4: Compute bias-corrected first moment m̂₅ = 0.023 / 0.40951 ≈ 0.0562

Step 5: Update parameter θ₅ = 0.5 - (0.002 / 0.14985) × 0.0562 ≈ 0.5 - 0.00075 = 0.49925

Result: (0.49925, 0.023, 0.14985)

Note how the infinity norm slowly decays when gradients become smaller, allowing for finer parameter updates as training converges.

Understanding Adamax

The Adamax Update Equations

Given a parameter θ, gradient gₜ at timestep t, the Adamax algorithm maintains two moving statistics:

1. First Moment Estimate (Momentum): $$m_t = \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot g_t$$

This is identical to Adam and represents the exponentially weighted moving average of gradients.

2. Infinity Norm (Maximum Gradient Tracking): $$u_t = \max(\beta_2 \cdot u_{t-1}, |g_t|)$$

Instead of squaring gradients, Adamax takes the maximum of the decayed previous value and the current absolute gradient. This is the key difference from Adam.

3. Bias-Corrected Update: The final parameter update uses bias correction only for the first moment: $$\theta_t = \theta_{t-1} - \frac{\alpha}{u_t + \epsilon} \cdot \frac{m_t}{1 - \beta_1^t}$$

Where:

α (learning_rate): The step size, typically 0.002
β₁ (beta1): Decay rate for the first moment, typically 0.9
β₂ (beta2): Decay rate for the infinity norm, typically 0.999
ε (epsilon): Small constant preventing division by zero, typically 1e-8
t: Current timestep (1-indexed)

Advantages of Adamax

Robustness to Large Gradients: The infinity norm naturally handles gradient spikes without extreme scaling
Simpler Bias Correction: Only the first moment requires bias correction, simplifying the update rule
Stable Training: Particularly effective for embeddings and sparse gradient scenarios
Theoretical Guarantees: Cleaner convergence analysis compared to Adam in certain settings

Your Task

Implement the Adamax update step function that:

Takes the current parameter, gradient, first moment estimate (m), infinity norm (u), and timestep (t)
Computes the updated parameter value along with the new moving averages
Handles both scalar inputs and array inputs (element-wise operations)
Returns a tuple of (updated_parameter, updated_m, updated_u)

Default Hyperparameters:

learning_rate (α) = 0.002
beta1 (β₁) = 0.9
beta2 (β₂) = 0.999
epsilon (ε) = 1e-8

Step-by-step computation with initial states:

Given: parameter = 1.0, grad = 0.1, m = 0.0, u = 0.0, t = 1 Hyperparameters: α = 0.002, β₁ = 0.9, β₂ = 0.999, ε = 1e-8

Step 1: Update first moment estimate (m) m₁ = β₁ · m₀ + (1 - β₁) · g₁ m₁ = 0.9 × 0.0 + 0.1 × 0.1 = 0.01

Step 2: Update infinity norm (u) u₁ = max(β₂ · u₀, |g₁|) u₁ = max(0.999 × 0.0, |0.1|) = max(0, 0.1) = 0.1

Step 3: Compute bias-corrected first moment m̂₁ = m₁ / (1 - β₁^t) = 0.01 / (1 - 0.9¹) = 0.01 / 0.1 = 0.1

Step 4: Update parameter θ₁ = θ₀ - (α / (u₁ + ε)) · m̂₁ θ₁ = 1.0 - (0.002 / (0.1 + 1e-8)) · 0.1 θ₁ = 1.0 - 0.02 × 0.1 = 1.0 - 0.002 = 0.998

Result: (0.998, 0.01, 0.1)

Element-wise computation for array inputs:

The Adamax algorithm applies the same update rules element-wise across all dimensions. With three parameters:

For parameter[0] = 1, grad[0] = 0.1:

m[0] = 0.9 × 0 + 0.1 × 0.1 = 0.01
u[0] = max(0.999 × 0, |0.1|) = 0.1
m̂[0] = 0.01 / 0.1 = 0.1
θ[0] = 1 - (0.002 / 0.1) × 0.1 = 0.998

For parameter[1] = 2, grad[1] = 0.2:

m[1] = 0.9 × 0 + 0.1 × 0.2 = 0.02
u[1] = max(0.999 × 0, |0.2|) = 0.2
m̂[1] = 0.02 / 0.1 = 0.2
θ[1] = 2 - (0.002 / 0.2) × 0.2 = 2 - 0.002 = 1.998

For parameter[2] = 3, grad[2] = 0.3:

m[2] = 0.9 × 0 + 0.1 × 0.3 = 0.03
u[2] = max(0.999 × 0, |0.3|) = 0.3
m̂[2] = 0.03 / 0.1 = 0.3
θ[2] = 3 - (0.002 / 0.3) × 0.3 = 3 - 0.002 = 2.998

Result: ([0.998, 1.998, 2.998], [0.01, 0.02, 0.03], [0.1, 0.2, 0.3])

Computation with non-zero history (later timestep):

Given: parameter = 0.5, grad = 0.05, m = 0.02, u = 0.15, t = 5 Hyperparameters: α = 0.002, β₁ = 0.9, β₂ = 0.999, ε = 1e-8

Step 1: Update first moment estimate m₅ = 0.9 × 0.02 + 0.1 × 0.05 = 0.018 + 0.005 = 0.023

Step 2: Update infinity norm u₅ = max(0.999 × 0.15, |0.05|) = max(0.14985, 0.05) = 0.14985

Here, the decayed previous u value (0.14985) exceeds the current gradient magnitude (0.05), so the infinity norm decays slightly.

Step 3: Compute bias correction factor 1 - β₁^5 = 1 - 0.9⁵ = 1 - 0.59049 = 0.40951

Step 4: Compute bias-corrected first moment m̂₅ = 0.023 / 0.40951 ≈ 0.0562

Step 5: Update parameter θ₅ = 0.5 - (0.002 / 0.14985) × 0.0562 ≈ 0.5 - 0.00075 = 0.49925

Result: (0.49925, 0.023, 0.14985)

Note how the infinity norm slowly decays when gradients become smaller, allowing for finer parameter updates as training converges.

Adamax Gradient Optimizer Update Step

Understanding Adamax

The Adamax Update Equations

Advantages of Adamax

Your Task

Hints

Adamax Gradient Optimizer Update Step

Understanding Adamax

The Adamax Update Equations

Advantages of Adamax

Your Task

Hints