0/318

00:00:00

Description

Editorial

Adaptive Moment Estimation Optimizer Step

MEDIUM20 pts

The Adaptive Moment Estimation (Adam) algorithm is one of the most widely used optimization methods in modern machine learning. It elegantly combines the strengths of two powerful gradient descent variants—momentum and adaptive learning rates—to achieve robust and efficient parameter updates.

Core Concept

Adam maintains two exponentially decaying moving averages for each parameter:

First Moment (m): The exponentially weighted average of past gradients, similar to momentum. This helps the optimizer build up velocity in consistent gradient directions while dampening oscillations.
Second Moment (v): The exponentially weighted average of past squared gradients, similar to RMSProp. This provides per-parameter adaptive learning rates, allowing the algorithm to take larger steps for parameters with consistently small gradients and smaller steps for those with large or highly variable gradients.

Mathematical Formulation

At each timestep t, given a gradient g_t, Adam performs the following computations:

Step 1 - Update Biased First Moment Estimate: $$m_t = \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot g_t$$

Step 2 - Update Biased Second Moment Estimate: $$v_t = \beta_2 \cdot v_{t-1} + (1 - \beta_2) \cdot g_t^2$$

Step 3 - Compute Bias-Corrected First Moment: $$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}$$

Step 4 - Compute Bias-Corrected Second Moment: $$\hat{v}_t = \frac{v_t}{1 - \beta_2^t}$$

Step 5 - Update Parameters: $$\theta_t = \theta_{t-1} - \frac{\alpha \cdot \hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$$

Bias Correction Explained

The bias correction in steps 3 and 4 is crucial. Since m and v are initialized to zero, they are biased towards zero during the early timesteps. The correction factors (1 - β₁ᵗ) and (1 - β₂ᵗ) counteract this initialization bias, ensuring accurate moment estimates even at the beginning of training.

Default Hyperparameters

The original Adam paper (Kingma & Ba, 2014) recommends:

α (learning_rate) = 0.001
β₁ = 0.9
β₂ = 0.999
ε = 10⁻⁸

Your Task

Implement a function that performs a single Adam optimization step. Your function should:

Accept the current parameter value(s), gradient(s), moment estimates, and timestep
Support both scalar inputs (single parameter optimization) and array inputs (vector parameter optimization)
Apply proper bias correction to the moment estimates
Return the updated parameter(s) along with the new moment estimates

Example

Input

parameter = 1.0, grad = 0.1, m = 0.0, v = 0.0, t = 1

Output

(0.999, 0.01, 1e-05)

Explanation

Initial Setup: Starting with parameter θ = 1.0, gradient g = 0.1, and zero-initialized moments at timestep t = 1.

Step 1 - First Moment Update: m₁ = 0.9 × 0.0 + 0.1 × 0.1 = 0.01

Step 2 - Second Moment Update: v₁ = 0.999 × 0.0 + 0.001 × (0.1)² = 0.001 × 0.01 = 0.00001 = 1e-05

Step 3 - Bias-Corrected First Moment: m̂₁ = 0.01 / (1 - 0.9¹) = 0.01 / 0.1 = 0.1

Step 4 - Bias-Corrected Second Moment: v̂₁ = 1e-05 / (1 - 0.999¹) = 1e-05 / 0.001 = 0.01

Step 5 - Parameter Update: θ₁ = 1.0 - (0.001 × 0.1) / (√0.01 + 1e-08) θ₁ = 1.0 - 0.0001 / 0.100000001 ≈ 0.999

The function returns (0.999, 0.01, 1e-05) representing the updated parameter and raw moment estimates.

Example

Input

parameter = 2.0, grad = 0.5, m = 0.05, v = 0.001, t = 2

Output

(1.999, 0.095, 0.00125)

Explanation

Continuing Optimization: At timestep t = 2 with existing moment estimates from a previous iteration.

Step 1 - First Moment Update: m₂ = 0.9 × 0.05 + 0.1 × 0.5 = 0.045 + 0.05 = 0.095

Step 2 - Second Moment Update: v₂ = 0.999 × 0.001 + 0.001 × (0.5)² = 0.000999 + 0.00025 = 0.001249 ≈ 0.00125

Step 3 - Bias-Corrected First Moment: m̂₂ = 0.095 / (1 - 0.9²) = 0.095 / 0.19 = 0.5

Step 4 - Bias-Corrected Second Moment: v̂₂ = 0.00125 / (1 - 0.999²) = 0.00125 / 0.001999 ≈ 0.6253

Step 5 - Parameter Update: θ₂ = 2.0 - (0.001 × 0.5) / (√0.6253 + 1e-08) θ₂ = 2.0 - 0.0005 / 0.7908 ≈ 1.999

The accumulated momentum from previous gradients influences the current update direction.

Example

Input

parameter = [1.0, 2.0, 3.0], grad = [0.1, 0.2, 0.3], m = [0.0, 0.0, 0.0], v = [0.0, 0.0, 0.0], t = 1

Output

([0.999, 1.999, 2.999], [0.01, 0.02, 0.03], [1e-05, 4e-05, 9e-05])

Explanation

Vector Optimization: Adam applied element-wise to a 3-dimensional parameter vector.

Each parameter is updated independently using its corresponding gradient:

For θ₁ = 1.0, g₁ = 0.1:

m₁ = 0.1 × 0.1 = 0.01
v₁ = 0.001 × 0.01 = 1e-05
θ₁_new ≈ 0.999

For θ₂ = 2.0, g₂ = 0.2:

m₂ = 0.1 × 0.2 = 0.02
v₂ = 0.001 × 0.04 = 4e-05
θ₂_new ≈ 1.999

For θ₃ = 3.0, g₃ = 0.3:

m₃ = 0.1 × 0.3 = 0.03
v₃ = 0.001 × 0.09 = 9e-05
θ₃_new ≈ 2.999

This demonstrates how Adam handles vector parameters—the optimization is performed element-wise, maintaining separate moment estimates for each parameter dimension.

Accepted0/0·0% Acceptance

Constraints

Parameter can be a scalar (float) or a 1D array of up to 1000 elements
Gradient shape must match parameter shape exactly
Moment estimates (m, v) must have the same shape as parameters
-10⁶ ≤ parameter values ≤ 10⁶
-10⁶ ≤ gradient values ≤ 10⁶
1 ≤ t ≤ 10⁶ (timestep is a positive integer)
0 < learning_rate ≤ 1 (default: 0.001)
0 ≤ beta1 < 1 (default: 0.9)
0 ≤ beta2 < 1 (default: 0.999)
epsilon > 0 (default: 1e-8)

Code

Visualizer

Solutions

14px

Test Cases3

Results

Submissions

m =

t =

v =

grad =

0.1

parameter =

Core Concept

Adam maintains two exponentially decaying moving averages for each parameter:

First Moment (m): The exponentially weighted average of past gradients, similar to momentum. This helps the optimizer build up velocity in consistent gradient directions while dampening oscillations.

Second Moment (v): The exponentially weighted average of past squared gradients, similar to RMSProp. This provides per-parameter adaptive learning rates, allowing the algorithm to take larger steps for parameters with consistently small gradients and smaller steps for those with large or highly variable gradients.

Mathematical Formulation

At each timestep t, given a gradient g_t, Adam performs the following computations:

Step 1 - Update Biased First Moment Estimate: $$m_t = \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot g_t$$

Step 2 - Update Biased Second Moment Estimate: $$v_t = \beta_2 \cdot v_{t-1} + (1 - \beta_2) \cdot g_t^2$$

Step 3 - Compute Bias-Corrected First Moment: $$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}$$

Step 4 - Compute Bias-Corrected Second Moment: $$\hat{v}_t = \frac{v_t}{1 - \beta_2^t}$$

Step 5 - Update Parameters: $$\theta_t = \theta_{t-1} - \frac{\alpha \cdot \hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$$

Bias Correction Explained

Your Task

Implement a function that performs a single Adam optimization step. Your function should:

Accept the current parameter value(s), gradient(s), moment estimates, and timestep

Support both scalar inputs (single parameter optimization) and array inputs (vector parameter optimization)

Apply proper bias correction to the moment estimates

Return the updated parameter(s) along with the new moment estimates

Adaptive Moment Estimation Optimizer Step

Core Concept

Mathematical Formulation

Bias Correction Explained

Default Hyperparameters

Your Task

Hints

Adaptive Moment Estimation Optimizer Step

Core Concept

Mathematical Formulation

Bias Correction Explained

Default Hyperparameters

Your Task

Hints