Loading problem...
In gradient descent optimization, the momentum technique is a powerful enhancement that accelerates convergence and helps navigate through challenging loss landscapes. Instead of relying solely on the current gradient, momentum maintains a running average of past gradients, creating a "velocity" that carries the optimization forward with accumulated direction.
Conceptual Foundation: Imagine rolling a ball down a hilly terrain to find the lowest point. Standard gradient descent moves the ball based only on the current slope, which can be slow in flat regions and erratic in steep, oscillating valleys. Momentum gives the ball inertia—it remembers its previous direction and speed, allowing it to build up velocity in consistent directions while dampening oscillations in inconsistent directions.
Mathematical Formulation: The momentum update rule consists of two interconnected equations:
Velocity Update: $$v_{t+1} = \gamma \cdot v_t + \alpha \cdot g_t$$
Parameter Update: $$\theta_{t+1} = \theta_t - v_{t+1}$$
Where:
Key Benefits:
Your Task: Implement a function that performs a single momentum-based gradient descent update step. The function should:
parameter = 1.0, grad = 0.1, velocity = 0.1[0.909, 0.091]Let's trace through the momentum update step-by-step:
Given values: • Current parameter θ = 1.0 • Current gradient g = 0.1 • Current velocity v = 0.1 • Learning rate α = 0.01 (default) • Momentum coefficient γ = 0.9 (default)
Step 1: Compute new velocity v_new = γ × v + α × g v_new = 0.9 × 0.1 + 0.01 × 0.1 v_new = 0.09 + 0.001 = 0.091
Step 2: Update parameter θ_new = θ - v_new θ_new = 1.0 - 0.091 = 0.909
The function returns (0.909, 0.091) representing the updated parameter and new velocity.
parameter = 2.0, grad = 0.5, velocity = 0.0[1.995, 0.005]This example shows the first iteration when starting from zero velocity:
Given values: • Current parameter θ = 2.0 • Current gradient g = 0.5 • Current velocity v = 0.0 (initial state)
Step 1: Compute new velocity v_new = 0.9 × 0.0 + 0.01 × 0.5 v_new = 0 + 0.005 = 0.005
Step 2: Update parameter θ_new = 2.0 - 0.005 = 1.995
With zero initial velocity, the first update is entirely determined by the scaled gradient. Subsequent iterations will accumulate momentum.
parameter = [1.0, 2.0, 3.0], grad = [0.1, 0.2, 0.3], velocity = [0.0, 0.0, 0.0][[0.999, 1.998, 2.997], [0.001, 0.002, 0.003]]The momentum update seamlessly extends to arrays, applying the same logic element-wise:
For each element i: • v_new[i] = 0.9 × velocity[i] + 0.01 × grad[i] • θ_new[i] = parameter[i] - v_new[i]
Calculations: • Element 0: v_new = 0.001, θ_new = 1.0 - 0.001 = 0.999 • Element 1: v_new = 0.002, θ_new = 2.0 - 0.002 = 1.998 • Element 2: v_new = 0.003, θ_new = 3.0 - 0.003 = 2.997
This vectorized operation is how momentum is applied in practice to entire weight matrices or bias vectors in neural networks.
Constraints