Loading content...
In gradient-based optimization, the Nesterov Accelerated Gradient (NAG) method is a powerful enhancement to classical momentum optimization that provides faster convergence and better theoretical guarantees. Unlike standard momentum, which blindly follows the accumulated gradient direction, Nesterov momentum employs a look-ahead strategy that anticipates where the parameters will move before computing the gradient.
Imagine you're a skier descending a mountain. Standard momentum would have you continue in your current direction of travel while also considering the slope beneath your feet. Nesterov momentum is smarter: it first projects where you'll end up after applying your current momentum, then looks at the slope at that anticipated position to make a more informed adjustment. This "peek ahead" prevents overshooting and enables more responsive corrections.
The Nesterov Accelerated Gradient update consists of two key steps:
Step 1: Look-Ahead Position First, compute the anticipated position by projecting the current velocity: $$\theta_{\text{look-ahead}} = \theta_t + \mu \cdot v_t$$
Step 2: Compute Gradient at Look-Ahead Position Evaluate the gradient at this anticipated position: $$g = abla f(\theta_{\text{look-ahead}})$$
Step 3: Update Velocity and Parameters Update the velocity by incorporating both momentum and the gradient at the look-ahead position: $$v_{t+1} = \mu \cdot v_t - \alpha \cdot g$$
Update the parameters using the new velocity: $$\theta_{t+1} = \theta_t + v_{t+1}$$
Where:
Implement the nesterov_momentum_update function that performs a single Nesterov momentum optimization step. Your function should:
Round all output values to 4 decimal places for consistency.
parameter = 1.0
grad_fn = lambda x: x
velocity = 0.1[0.981, -0.019]Step-by-step calculation:
Look-ahead position: θ_lookahead = θ + μ × v = 1.0 + 0.9 × 0.1 = 1.09
Gradient at look-ahead position: g = grad_fn(1.09) = 1.09
Update velocity: v_new = μ × v - α × g = 0.9 × 0.1 - 0.1 × 1.09 = 0.09 - 0.109 = -0.019
Update parameter: θ_new = θ + v_new = 1.0 + (-0.019) = 0.981
Result: [0.981, -0.019]
parameter = 2.0
grad_fn = lambda x: x
velocity = 0.0[1.8, -0.2]Step-by-step calculation:
Look-ahead position: θ_lookahead = θ + μ × v = 2.0 + 0.9 × 0.0 = 2.0
Gradient at look-ahead position: g = grad_fn(2.0) = 2.0
Update velocity: v_new = μ × v - α × g = 0.9 × 0.0 - 0.1 × 2.0 = 0 - 0.2 = -0.2
Update parameter: θ_new = θ + v_new = 2.0 + (-0.2) = 1.8
Result: [1.8, -0.2]
Note: With zero initial velocity, the look-ahead position equals the current position, and the update behaves like a standard gradient descent step with momentum initialization.
parameter = [1.0, 2.0, 3.0]
grad_fn = lambda x: x
velocity = [0.1, 0.2, 0.3][[0.981, 1.962, 2.943], [-0.019, -0.038, -0.057]]Vector computation (element-wise):
The same algorithm applies independently to each element:
For element 0:
For element 1:
For element 2:
Result: [[0.981, 1.962, 2.943], [-0.019, -0.038, -0.057]]
Constraints