Momentum-Based Parameter Updater (Easy) — Practice with Code Visualizer

In gradient descent optimization, the momentum technique is a powerful enhancement that accelerates convergence and helps navigate through challenging loss landscapes. Instead of relying solely on the current gradient, momentum maintains a running average of past gradients, creating a "velocity" that carries the optimization forward with accumulated direction.

Conceptual Foundation: Imagine rolling a ball down a hilly terrain to find the lowest point. Standard gradient descent moves the ball based only on the current slope, which can be slow in flat regions and erratic in steep, oscillating valleys. Momentum gives the ball inertia—it remembers its previous direction and speed, allowing it to build up velocity in consistent directions while dampening oscillations in inconsistent directions.

Mathematical Formulation: The momentum update rule consists of two interconnected equations:

Velocity Update: $$v_{t+1} = \gamma \cdot v_t + \alpha \cdot g_t$$
Parameter Update: $$\theta_{t+1} = \theta_t - v_{t+1}$$

Where:

$\theta_t$ is the current parameter value
$g_t$ is the current gradient (∂Loss/∂θ)
$v_t$ is the current velocity (initially zero)
$\alpha$ is the learning rate (default: 0.01)
$\gamma$ is the momentum coefficient (default: 0.9)

Key Benefits:

Faster Convergence: By accumulating gradients in consistent directions, momentum accelerates progress toward optima
Reduced Oscillation: The averaging effect smooths out noisy or conflicting gradient signals
Escape Local Minima: Built-up momentum can help push through shallow local minima to find better solutions

Your Task: Implement a function that performs a single momentum-based gradient descent update step. The function should:

Accept current parameter(s), gradient(s), and velocity values
Support both scalar values and 1D arrays (for updating multiple parameters simultaneously)
Use default values of learning_rate=0.01 and momentum_coefficient=0.9
Return both the updated parameter(s) and the new velocity values as a tuple
Round all output values to 3 decimal places for consistency

Let's trace through the momentum update step-by-step:

Given values: • Current parameter θ = 1.0 • Current gradient g = 0.1 • Current velocity v = 0.1 • Learning rate α = 0.01 (default) • Momentum coefficient γ = 0.9 (default)

Step 1: Compute new velocity v_new = γ × v + α × g v_new = 0.9 × 0.1 + 0.01 × 0.1 v_new = 0.09 + 0.001 = 0.091

Step 2: Update parameter θ_new = θ - v_new θ_new = 1.0 - 0.091 = 0.909

The function returns (0.909, 0.091) representing the updated parameter and new velocity.

This example shows the first iteration when starting from zero velocity:

Given values: • Current parameter θ = 2.0 • Current gradient g = 0.5 • Current velocity v = 0.0 (initial state)

Step 1: Compute new velocity v_new = 0.9 × 0.0 + 0.01 × 0.5 v_new = 0 + 0.005 = 0.005

Step 2: Update parameter θ_new = 2.0 - 0.005 = 1.995

With zero initial velocity, the first update is entirely determined by the scaled gradient. Subsequent iterations will accumulate momentum.

The momentum update seamlessly extends to arrays, applying the same logic element-wise:

For each element i: • v_new[i] = 0.9 × velocity[i] + 0.01 × grad[i] • θ_new[i] = parameter[i] - v_new[i]

Calculations: • Element 0: v_new = 0.001, θ_new = 1.0 - 0.001 = 0.999 • Element 1: v_new = 0.002, θ_new = 2.0 - 0.002 = 1.998 • Element 2: v_new = 0.003, θ_new = 3.0 - 0.003 = 2.997

This vectorized operation is how momentum is applied in practice to entire weight matrices or bias vectors in neural networks.