Loading content...
The Adaptive Moment Estimation (AME) algorithm is one of the most powerful and widely-adopted optimization techniques in modern machine learning. It elegantly combines the strengths of two earlier methods: Momentum-based Gradient Descent (which accelerates convergence by accumulating velocity in consistent gradient directions) and RMSProp (which adapts learning rates based on recent gradient magnitudes).
At its core, AME maintains two exponentially decaying moving averages for each parameter:
First Moment Estimate (m): Tracks the mean of gradients, acting as momentum to smooth out noisy updates and accelerate progress in consistent directions.
Second Moment Estimate (v): Tracks the uncentered variance of gradients, enabling per-parameter adaptive learning rates that scale inversely with the historical gradient magnitude.
The Algorithm:
For each iteration ( t ), the algorithm performs the following steps:
Compute Gradient: Calculate ( g_t = abla f(\theta_{t-1}) )
Update First Moment: ( m_t = \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot g_t )
Update Second Moment: ( v_t = \beta_2 \cdot v_{t-1} + (1 - \beta_2) \cdot g_t^2 )
Bias Correction: Since ( m ) and ( v ) are initialized at zero, they are biased toward zero during initial iterations. Apply correction:
Parameter Update: ( \theta_t = \theta_{t-1} - \alpha \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} )
Where:
Your Task:
Implement the AME optimizer that takes an objective function type and initial parameters, then iteratively applies the algorithm to find the parameter values that minimize the objective function. You must support the following function types:
Return the optimized parameter values after completing the specified number of iterations.
func_type = "sphere"
initial_params = [1.0, 1.0]
step_size = 0.001
first_moment_decay = 0.9
second_moment_decay = 0.999
stability_constant = 1e-8
max_iterations = 1000[0.25766503, 0.25766503]The sphere function f(x) = x₁² + x₂² has its global minimum at the origin [0, 0]. Starting from [1.0, 1.0], the AME optimizer progressively reduces both parameters toward zero.
With a step size of 0.001 and 1000 iterations, the optimizer makes steady progress but hasn't fully converged. The first and second moment estimates help adapt the effective learning rate for each parameter. After 1000 iterations, the parameters reach approximately [0.25766503, 0.25766503], demonstrating significant progress toward the minimum.
The symmetric result reflects the symmetry of both the objective function and the starting point, as both parameters experience identical optimization dynamics.
func_type = "quadratic"
initial_params = [2.0, -1.0, 0.5]
step_size = 0.01
first_moment_decay = 0.9
second_moment_decay = 0.999
stability_constant = 1e-8
max_iterations = 500
function_config = {"coefficients": [1.0, 2.0, 3.0]}[0.01126204, -0.0, -0.0]The weighted quadratic function f(x) = 1.0·x₁² + 2.0·x₂² + 3.0·x₃² has its minimum at [0, 0, 0]. The coefficients affect gradient magnitudes: parameters with larger coefficients have steeper gradients and converge faster.
Starting from [2.0, -1.0, 0.5]: • x₁ has coefficient 1.0 (smallest gradient magnitude), converges slowest • x₂ has coefficient 2.0, converges faster • x₃ has coefficient 3.0 (largest gradient magnitude), converges fastest
After 500 iterations with step_size = 0.01, parameters x₂ and x₃ have essentially reached zero, while x₁ ≈ 0.01126204 is still approaching the minimum. This demonstrates how AME adapts to different gradient scales across parameters.
func_type = "booth"
initial_params = [0.0, 0.0]
step_size = 0.01
first_moment_decay = 0.9
second_moment_decay = 0.999
stability_constant = 1e-8
max_iterations = 1000[1.24813489, 2.74568331]The Booth function f(x, y) = (x + 2y - 7)² + (2x + y - 5)² is a classic optimization benchmark with its global minimum at (x, y) = (1, 3).
Starting from the origin [0.0, 0.0], the optimizer navigates the curved loss landscape. Unlike simple quadratic functions, the Booth function has coupled gradients where each parameter's gradient depends on both variables: • ∂f/∂x = 2(x + 2y - 7) + 4(2x + y - 5) = 10x + 8y - 34 • ∂f/∂y = 4(x + 2y - 7) + 2(2x + y - 5) = 8x + 10y - 38
After 1000 iterations, the optimizer reaches [1.24813489, 2.74568331], which is close to but not yet at the true minimum [1, 3]. The adaptive moment estimates help navigate the non-axis-aligned contours of this function efficiently.
Constraints