0/318

00:00:00

Description

Editorial

Adaptive Moment Estimation Optimizer

MEDIUM20 pts

The Adaptive Moment Estimation (AME) algorithm is one of the most powerful and widely-adopted optimization techniques in modern machine learning. It elegantly combines the strengths of two earlier methods: Momentum-based Gradient Descent (which accelerates convergence by accumulating velocity in consistent gradient directions) and RMSProp (which adapts learning rates based on recent gradient magnitudes).

At its core, AME maintains two exponentially decaying moving averages for each parameter:

First Moment Estimate (m): Tracks the mean of gradients, acting as momentum to smooth out noisy updates and accelerate progress in consistent directions.
Second Moment Estimate (v): Tracks the uncentered variance of gradients, enabling per-parameter adaptive learning rates that scale inversely with the historical gradient magnitude.

The Algorithm:

For each iteration ( t ), the algorithm performs the following steps:

Compute Gradient: Calculate ( g_t = abla f(\theta_{t-1}) )
Update First Moment: ( m_t = \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot g_t )
Update Second Moment: ( v_t = \beta_2 \cdot v_{t-1} + (1 - \beta_2) \cdot g_t^2 )
Bias Correction: Since ( m ) and ( v ) are initialized at zero, they are biased toward zero during initial iterations. Apply correction:
- ( \hat{m}_t = \frac{m_t}{1 - \beta_1^t} )
- ( \hat{v}_t = \frac{v_t}{1 - \beta_2^t} )
Parameter Update: ( \theta_t = \theta_{t-1} - \alpha \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} )

Where:

( \alpha ) is the step size (learning rate)
( \beta_1 ) is the first moment decay rate (typically 0.9)
( \beta_2 ) is the second moment decay rate (typically 0.999)
( \epsilon ) is a small constant for numerical stability (typically 1e-8)

Your Task:

Implement the AME optimizer that takes an objective function type and initial parameters, then iteratively applies the algorithm to find the parameter values that minimize the objective function. You must support the following function types:

sphere: ( f(x) = \sum_{i} x_i^2 ) (gradient: ( 2x_i ) for each component)
quadratic: ( f(x) = \sum_{i} c_i \cdot x_i^2 ) where ( c_i ) are coefficients (gradient: ( 2c_i x_i ))
booth: ( f(x, y) = (x + 2y - 7)^2 + (2x + y - 5)^2 ) (a classic optimization test function)

Return the optimized parameter values after completing the specified number of iterations.

Example

Input

func_type = "sphere"
initial_params = [1.0, 1.0]
step_size = 0.001
first_moment_decay = 0.9
second_moment_decay = 0.999
stability_constant = 1e-8
max_iterations = 1000

Output

[0.25766503, 0.25766503]

Explanation

The sphere function f(x) = x₁² + x₂² has its global minimum at the origin [0, 0]. Starting from [1.0, 1.0], the AME optimizer progressively reduces both parameters toward zero.

With a step size of 0.001 and 1000 iterations, the optimizer makes steady progress but hasn't fully converged. The first and second moment estimates help adapt the effective learning rate for each parameter. After 1000 iterations, the parameters reach approximately [0.25766503, 0.25766503], demonstrating significant progress toward the minimum.

The symmetric result reflects the symmetry of both the objective function and the starting point, as both parameters experience identical optimization dynamics.

Example

Input

func_type = "quadratic"
initial_params = [2.0, -1.0, 0.5]
step_size = 0.01
first_moment_decay = 0.9
second_moment_decay = 0.999
stability_constant = 1e-8
max_iterations = 500
function_config = {"coefficients": [1.0, 2.0, 3.0]}

Output

[0.01126204, -0.0, -0.0]

Explanation

The weighted quadratic function f(x) = 1.0·x₁² + 2.0·x₂² + 3.0·x₃² has its minimum at [0, 0, 0]. The coefficients affect gradient magnitudes: parameters with larger coefficients have steeper gradients and converge faster.

Starting from [2.0, -1.0, 0.5]: • x₁ has coefficient 1.0 (smallest gradient magnitude), converges slowest • x₂ has coefficient 2.0, converges faster • x₃ has coefficient 3.0 (largest gradient magnitude), converges fastest

After 500 iterations with step_size = 0.01, parameters x₂ and x₃ have essentially reached zero, while x₁ ≈ 0.01126204 is still approaching the minimum. This demonstrates how AME adapts to different gradient scales across parameters.

Example

Input

func_type = "booth"
initial_params = [0.0, 0.0]
step_size = 0.01
first_moment_decay = 0.9
second_moment_decay = 0.999
stability_constant = 1e-8
max_iterations = 1000

Output

[1.24813489, 2.74568331]

Explanation

The Booth function f(x, y) = (x + 2y - 7)² + (2x + y - 5)² is a classic optimization benchmark with its global minimum at (x, y) = (1, 3).

Starting from the origin [0.0, 0.0], the optimizer navigates the curved loss landscape. Unlike simple quadratic functions, the Booth function has coupled gradients where each parameter's gradient depends on both variables: • ∂f/∂x = 2(x + 2y - 7) + 4(2x + y - 5) = 10x + 8y - 34 • ∂f/∂y = 4(x + 2y - 7) + 2(2x + y - 5) = 8x + 10y - 38

After 1000 iterations, the optimizer reaches [1.24813489, 2.74568331], which is close to but not yet at the true minimum [1, 3]. The adaptive moment estimates help navigate the non-axis-aligned contours of this function efficiently.

Accepted0/0·0% Acceptance

Constraints

1 ≤ length of initial_params ≤ 100
-10⁶ ≤ initial_params[i] ≤ 10⁶
1e-6 ≤ step_size ≤ 1.0
0 < first_moment_decay < 1 (typical value: 0.9)
0 < second_moment_decay < 1 (typical value: 0.999)
1e-10 ≤ stability_constant ≤ 1e-6
1 ≤ max_iterations ≤ 10,000
For 'quadratic' function: coefficients array must match initial_params length
For 'booth' function: initial_params must have exactly 2 elements

Code

Visualizer

Solutions

14px

Test Cases3

Results

Submissions

x0 =

[1,1]

beta1 =

0.9

beta2 =

0.999

epsilon =

1e-8

function_type =

"sphere"

learning_rate =

0.001

num_iterations =

1000

function_params =

null

Adaptive Moment Estimation Optimizer

Hints

Adaptive Moment Estimation Optimizer

Hints