Loading problem...
The Adamax algorithm is a powerful variant of the Adam optimizer that leverages the infinity norm (L∞ norm) instead of the second moment estimate (L² norm) used in Adam. This modification often provides more stable training dynamics, particularly when gradients exhibit extreme values or heavy-tailed distributions.
Adamax was introduced by Kingma and Ba in 2015 as part of their seminal work on Adam. While Adam maintains an exponentially decaying average of squared gradients, Adamax tracks the maximum absolute gradient observed so far, which can be viewed as the limit of the Lp norm as p approaches infinity.
Given a parameter θ, gradient gₜ at timestep t, the Adamax algorithm maintains two moving statistics:
1. First Moment Estimate (Momentum): $$m_t = \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot g_t$$
This is identical to Adam and represents the exponentially weighted moving average of gradients.
2. Infinity Norm (Maximum Gradient Tracking): $$u_t = \max(\beta_2 \cdot u_{t-1}, |g_t|)$$
Instead of squaring gradients, Adamax takes the maximum of the decayed previous value and the current absolute gradient. This is the key difference from Adam.
3. Bias-Corrected Update: The final parameter update uses bias correction only for the first moment: $$\theta_t = \theta_{t-1} - \frac{\alpha}{u_t + \epsilon} \cdot \frac{m_t}{1 - \beta_1^t}$$
Where:
Implement the Adamax update step function that:
Default Hyperparameters:
parameter = 1.0, grad = 0.1, m = 0.0, u = 0.0, t = 1(0.998, 0.01, 0.1)Step-by-step computation with initial states:
Given: parameter = 1.0, grad = 0.1, m = 0.0, u = 0.0, t = 1 Hyperparameters: α = 0.002, β₁ = 0.9, β₂ = 0.999, ε = 1e-8
Step 1: Update first moment estimate (m) m₁ = β₁ · m₀ + (1 - β₁) · g₁ m₁ = 0.9 × 0.0 + 0.1 × 0.1 = 0.01
Step 2: Update infinity norm (u) u₁ = max(β₂ · u₀, |g₁|) u₁ = max(0.999 × 0.0, |0.1|) = max(0, 0.1) = 0.1
Step 3: Compute bias-corrected first moment m̂₁ = m₁ / (1 - β₁^t) = 0.01 / (1 - 0.9¹) = 0.01 / 0.1 = 0.1
Step 4: Update parameter θ₁ = θ₀ - (α / (u₁ + ε)) · m̂₁ θ₁ = 1.0 - (0.002 / (0.1 + 1e-8)) · 0.1 θ₁ = 1.0 - 0.02 × 0.1 = 1.0 - 0.002 = 0.998
Result: (0.998, 0.01, 0.1)
parameter = [1, 2, 3], grad = [0.1, 0.2, 0.3], m = [0, 0, 0], u = [0, 0, 0], t = 1([0.998, 1.998, 2.998], [0.01, 0.02, 0.03], [0.1, 0.2, 0.3])Element-wise computation for array inputs:
The Adamax algorithm applies the same update rules element-wise across all dimensions. With three parameters:
For parameter[0] = 1, grad[0] = 0.1:
For parameter[1] = 2, grad[1] = 0.2:
For parameter[2] = 3, grad[2] = 0.3:
Result: ([0.998, 1.998, 2.998], [0.01, 0.02, 0.03], [0.1, 0.2, 0.3])
parameter = 0.5, grad = 0.05, m = 0.02, u = 0.15, t = 5(0.49925, 0.023, 0.14985)Computation with non-zero history (later timestep):
Given: parameter = 0.5, grad = 0.05, m = 0.02, u = 0.15, t = 5 Hyperparameters: α = 0.002, β₁ = 0.9, β₂ = 0.999, ε = 1e-8
Step 1: Update first moment estimate m₅ = 0.9 × 0.02 + 0.1 × 0.05 = 0.018 + 0.005 = 0.023
Step 2: Update infinity norm u₅ = max(0.999 × 0.15, |0.05|) = max(0.14985, 0.05) = 0.14985
Here, the decayed previous u value (0.14985) exceeds the current gradient magnitude (0.05), so the infinity norm decays slightly.
Step 3: Compute bias correction factor 1 - β₁^5 = 1 - 0.9⁵ = 1 - 0.59049 = 0.40951
Step 4: Compute bias-corrected first moment m̂₅ = 0.023 / 0.40951 ≈ 0.0562
Step 5: Update parameter θ₅ = 0.5 - (0.002 / 0.14985) × 0.0562 ≈ 0.5 - 0.00075 = 0.49925
Result: (0.49925, 0.023, 0.14985)
Note how the infinity norm slowly decays when gradients become smaller, allowing for finer parameter updates as training converges.
Constraints