Loading content...
Vanilla gradient descent has a fundamental limitation: it's provably slow on ill-conditioned problems. When the loss surface is elongated (high condition number κ), gradient descent zig-zags back and forth, making painfully slow progress toward the optimum.
But what if we could remember our previous steps and build up velocity in consistent directions? This is exactly what momentum does. By incorporating a "memory" of past gradients, momentum methods can:
Momentum is not just a heuristic—it's mathematically guaranteed to accelerate convergence. This page explores the theory and practice of momentum-based optimization.
By the end of this page, you will understand the physics intuition behind momentum, implement classical and Nesterov momentum, analyze why momentum provides acceleration, and master practical tuning of momentum hyperparameters.
Consider optimizing a quadratic with different curvatures along each axis:
$$J(\theta_1, \theta_2) = \frac{1}{2}(100\theta_1^2 + \theta_2^2)$$
This is an elongated ellipse with:
What happens with vanilla GD:
The fundamental issue: A single learning rate can't simultaneously:
Convergence rate reminder:
For L-smooth, μ-strongly convex functions, vanilla GD converges as: $$(1 - \mu/L)^t = (1 - 1/\kappa)^t$$
For κ = 100, each iteration reduces error by only 1%. To reduce error by 10⁶ requires: $$t \approx \kappa \ln(10^6) \approx 1,400 \text{ iterations}$$
Can we do better? Yes! With acceleration, we can achieve: $$(1 - 1/\sqrt{\kappa})^t$$
For κ = 100, this means only ~140 iterations—10× faster!
Classical momentum (also called Polyak momentum or heavy ball method) introduces a velocity vector that accumulates gradient information:
$$\mathbf{v}^{(t+1)} = \gamma \mathbf{v}^{(t)} + abla J(\boldsymbol{\theta}^{(t)})$$ $$\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \alpha \mathbf{v}^{(t+1)}$$
where:
Interpretation:
The velocity $\mathbf{v}$ is an exponentially weighted moving average of gradients: $$\mathbf{v}^{(t)} = \sum_{k=0}^{t} \gamma^{t-k} abla J(\boldsymbol{\theta}^{(k)})$$
Recent gradients contribute more; older gradients decay exponentially with factor $\gamma$.
The physics analogy:
Imagine a ball rolling down a hilly landscape:
In optimization terms:
Momentum acts as a low-pass filter on the gradients, smoothing out high-frequency oscillations while preserving the overall descent direction.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103
import numpy as npimport matplotlib.pyplot as plt def gradient_descent_with_momentum(theta_init, grad_fn, learning_rate=0.01, momentum=0.9, n_iterations=100): """ Gradient descent with classical (Polyak) momentum. Args: theta_init: Initial parameters grad_fn: Function that computes gradient learning_rate: Step size α momentum: Momentum coefficient γ (0 to 1) n_iterations: Number of update steps Returns: trajectory: List of parameter vectors """ theta = theta_init.copy() velocity = np.zeros_like(theta) # Initialize velocity to zero trajectory = [theta.copy()] for t in range(n_iterations): # Compute gradient at current position grad = grad_fn(theta) # Update velocity: v = γv + ∇J(θ) velocity = momentum * velocity + grad # Update parameters: θ = θ - αv theta = theta - learning_rate * velocity trajectory.append(theta.copy()) return np.array(trajectory) # Example: Elongated quadratic (ill-conditioned)def elongated_quadratic_grad(theta): """Gradient of J(θ) = 50*θ₁² + 0.5*θ₂² (condition number κ=100)""" return np.array([100 * theta[0], 1 * theta[1]]) # Compare vanilla GD vs momentumtheta_init = np.array([10.0, 10.0]) # Vanilla GD (learning rate limited by steep direction)trajectory_vanilla = gradient_descent_with_momentum( theta_init, elongated_quadratic_grad, learning_rate=0.015, momentum=0.0, n_iterations=100) # With momentumtrajectory_momentum = gradient_descent_with_momentum( theta_init, elongated_quadratic_grad, learning_rate=0.015, momentum=0.9, n_iterations=100) # Visualizefig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Trajectory plotax1 = axes[0]theta1 = np.linspace(-12, 12, 100)theta2 = np.linspace(-12, 12, 100)T1, T2 = np.meshgrid(theta1, theta2)J = 50 * T1**2 + 0.5 * T2**2 # Loss surface ax1.contour(T1, T2, J, levels=30, cmap='gray', alpha=0.5)ax1.plot(trajectory_vanilla[:, 0], trajectory_vanilla[:, 1], 'b.-', linewidth=1.5, markersize=4, label='Vanilla GD', alpha=0.7)ax1.plot(trajectory_momentum[:, 0], trajectory_momentum[:, 1], 'r.-', linewidth=1.5, markersize=4, label='Momentum (γ=0.9)', alpha=0.7)ax1.scatter([0], [0], color='green', s=100, marker='*', zorder=10, label='Optimum')ax1.set_xlabel('θ₁')ax1.set_ylabel('θ₂')ax1.set_title('Optimization Trajectories (κ = 100)')ax1.legend()ax1.set_xlim(-12, 12)ax1.set_ylim(-12, 12) # Loss curveax2 = axes[1]def compute_loss(theta): return 50 * theta[0]**2 + 0.5 * theta[1]**2 losses_vanilla = [compute_loss(t) for t in trajectory_vanilla]losses_momentum = [compute_loss(t) for t in trajectory_momentum] ax2.semilogy(losses_vanilla, 'b-', linewidth=2, label='Vanilla GD')ax2.semilogy(losses_momentum, 'r-', linewidth=2, label='Momentum (γ=0.9)')ax2.set_xlabel('Iteration')ax2.set_ylabel('Loss (log scale)')ax2.set_title('Convergence Speed')ax2.legend()ax2.grid(True, alpha=0.3) plt.tight_layout()plt.savefig('momentum_comparison.png', dpi=150)plt.show() print(f"Vanilla GD - Final loss: {losses_vanilla[-1]:.6f}")print(f"Momentum - Final loss: {losses_momentum[-1]:.6f}")The momentum coefficient γ is typically set to 0.9 (90% of previous velocity retained). Values closer to 1 provide more smoothing but risk overshooting. Some schedules increase momentum during training (e.g., from 0.5 to 0.99).
Nesterov momentum is a clever modification that provides provably optimal convergence rates. The key insight: compute the gradient at the "lookahead" position where momentum would take us, not at the current position.
Update equations:
$$\mathbf{v}^{(t+1)} = \gamma \mathbf{v}^{(t)} + abla J(\boldsymbol{\theta}^{(t)} - \alpha \gamma \mathbf{v}^{(t)})$$ $$\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \alpha \mathbf{v}^{(t+1)}$$
Notice the gradient is evaluated at $\boldsymbol{\theta}^{(t)} - \alpha \gamma \mathbf{v}^{(t)}$, not $\boldsymbol{\theta}^{(t)}$.
Equivalent formulation (more practical):
$$\tilde{\boldsymbol{\theta}} = \boldsymbol{\theta}^{(t)} - \alpha \gamma \mathbf{v}^{(t)}$$ $$\mathbf{v}^{(t+1)} = \gamma \mathbf{v}^{(t)} + abla J(\tilde{\boldsymbol{\theta}})$$ $$\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \alpha \mathbf{v}^{(t+1)}$$
Why lookahead helps:
Classical momentum computes the gradient at the current position, then applies momentum. This means we might accelerate in the wrong direction if the gradient is about to change.
Nesterov momentum asks: "If I'm going to move by momentum anyway, where will I end up? Let me compute the gradient there." This provides early correction:
The result: Nesterov momentum converges faster with less oscillation, especially near the optimum.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687
import numpy as np def nesterov_momentum(theta_init, grad_fn, learning_rate=0.01, momentum=0.9, n_iterations=100): """ Gradient descent with Nesterov accelerated gradient. Key difference from classical momentum: Gradient is evaluated at the "lookahead" position. """ theta = theta_init.copy() velocity = np.zeros_like(theta) trajectory = [theta.copy()] for t in range(n_iterations): # Lookahead: where would momentum take us? theta_lookahead = theta - learning_rate * momentum * velocity # Compute gradient at lookahead position (not current!) grad = grad_fn(theta_lookahead) # Update velocity velocity = momentum * velocity + grad # Update parameters theta = theta - learning_rate * velocity trajectory.append(theta.copy()) return np.array(trajectory) def nesterov_reparameterized(theta_init, grad_fn, learning_rate=0.01, momentum=0.9, n_iterations=100): """ Equivalent Nesterov formulation often used in practice. Uses accumulated parameter (φ = θ + αγv) for cleaner updates. """ theta = theta_init.copy() theta_prev = theta.copy() trajectory = [theta.copy()] for t in range(n_iterations): # Extrapolation step (lookahead) theta_extrapolated = theta + momentum * (theta - theta_prev) # Gradient step grad = grad_fn(theta_extrapolated) # Store previous for next extrapolation theta_prev = theta.copy() # Update theta = theta_extrapolated - learning_rate * grad trajectory.append(theta.copy()) return np.array(trajectory) # Compare all three methodsdef elongated_grad(theta): return np.array([100 * theta[0], 1 * theta[1]]) theta_init = np.array([10.0, 10.0])lr = 0.015gamma = 0.9 traj_vanilla = gradient_descent_with_momentum( theta_init, elongated_grad, lr, momentum=0.0, n_iterations=80)traj_classical = gradient_descent_with_momentum( theta_init, elongated_grad, lr, momentum=gamma, n_iterations=80)traj_nesterov = nesterov_momentum( theta_init, elongated_grad, lr, momentum=gamma, n_iterations=80) def loss_fn(t): return 50 * t[0]**2 + 0.5 * t[1]**2 print("Loss after 80 iterations:")print(f" Vanilla GD: {loss_fn(traj_vanilla[-1]):.6f}")print(f" Classical Mom: {loss_fn(traj_classical[-1]):.6f}")print(f" Nesterov Mom: {loss_fn(traj_nesterov[-1]):.6f}")For L-smooth, μ-strongly convex functions, Nesterov achieves the optimal rate (1 - 1/√κ)^t among first-order methods. This matches the lower bound proven in optimization theory. You literally cannot do better with only gradient information!
Let's understand why momentum provides acceleration through analysis of a simple quadratic.
Quadratic analysis:
Consider $J(\theta) = \frac{L}{2} \theta^2$. Vanilla GD with $\alpha = 1/L$: $$\theta^{(t+1)} = \theta^{(t)} - \frac{1}{L} \cdot L\theta^{(t)} = 0$$
One-step convergence! But with $\alpha = 1/(2L)$ (sub-optimal): $$\theta^{(t+1)} = (1 - 1/2) \theta^{(t)} = 0.5 \cdot \theta^{(t)}$$
Convergence rate is 0.5 per iteration.
Now add momentum (with $\gamma = 0.5$, $\alpha = 1/(2L)$):
The system becomes: $$\begin{pmatrix} \theta^{(t+1)} \ v^{(t+1)} \end{pmatrix} = \begin{pmatrix} 0.5 & -0.5/L \ L/2 & 0.5 \end{pmatrix} \begin{pmatrix} \theta^{(t)} \ v^{(t)} \end{pmatrix}$$
The eigenvalues of this matrix determine convergence. With proper tuning, they can be complex with modulus less than 1—producing spiraling convergence that's faster than pure contraction.
| Method | Convergence Rate | Iterations for κ=100 | Iterations for κ=10000 |
|---|---|---|---|
| Vanilla GD | (1 - 1/κ)^t | ~1400 | ~140,000 |
| Classical Momentum (tuned) | (1 - 2/√κ)^t (approx) | ~140 | ~1,400 |
| Nesterov (optimal) | (1 - 1/√κ)^t | ~140 | ~1,400 |
| Improvement factor | √κ × | ~10× | ~100× |
The acceleration mystery solved:
Why can momentum achieve $O(\sqrt{\kappa})$ while vanilla GD requires $O(\kappa)$?
Vanilla GD perspective: Limited by the steepest direction. Can only take steps of size $1/L$. Progress in flat direction is $(\mu/L) = 1/\kappa$ per step.
Momentum perspective: Accumulates velocity in flat direction while canceling oscillations in steep direction. Effectively takes larger steps in flat directions without destabilizing steep directions.
Mathematical intuition: Momentum transforms the optimization into a second-order dynamical system (position + velocity). This richer state space allows faster navigation of the geometry.
Optimal momentum coefficient for strongly convex problems: $$\gamma^* = \frac{\sqrt{L} - \sqrt{\mu}}{\sqrt{L} + \sqrt{\mu}} = \frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1}$$
For κ = 100: $\gamma^* \approx 0.82$. For κ = 10000: $\gamma^* \approx 0.98$.
Different frameworks implement momentum in slightly different (but equivalent) ways.
PyTorch-style (velocity accumulation):
v = gamma * v + gradient
theta = theta - learning_rate * v
TensorFlow-style (scaled velocity):
v = gamma * v + learning_rate * gradient
theta = theta - v
These are equivalent with appropriate parameter mapping. Be careful when porting code between frameworks!
Nesterov in PyTorch:
v = gamma * v + gradient
theta = theta - learning_rate * (gradient + gamma * v)
This "look-ahead" is applied differently but achieves the same effect.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778
import numpy as np class SGDMomentum: """ PyTorch-style SGD with momentum implementation. Matches torch.optim.SGD behavior. """ def __init__(self, params, lr=0.01, momentum=0.9, nesterov=False): """ Args: params: List of parameter arrays lr: Learning rate momentum: Momentum coefficient (default: 0.9) nesterov: Use Nesterov momentum (default: False) """ self.params = params self.lr = lr self.momentum = momentum self.nesterov = nesterov # Initialize velocity for each parameter self.velocities = [np.zeros_like(p) for p in params] def step(self, gradients): """ Perform one optimization step. Args: gradients: List of gradient arrays (same shape as params) """ for i, (param, grad, vel) in enumerate(zip(self.params, gradients, self.velocities)): if self.momentum != 0: # Update velocity: v = γv + g vel = self.momentum * vel + grad self.velocities[i] = vel if self.nesterov: # Nesterov: θ ← θ - α(g + γv) param -= self.lr * (grad + self.momentum * vel) else: # Classical: θ ← θ - αv param -= self.lr * vel else: # Vanilla GD: θ ← θ - αg param -= self.lr * grad def zero_grad(self): """Reset gradients (in practice, done externally).""" pass # Example usagedef rosenbrock_grad(x, y, a=1, b=100): """Gradient of Rosenbrock function: (a-x)² + b(y-x²)²""" dx = -2*(a - x) - 4*b*x*(y - x**2) dy = 2*b*(y - x**2) return np.array([dx, dy]) # Initializetheta = np.array([-1.5, 1.5])params = [theta] # Train with momentumoptimizer = SGDMomentum(params, lr=0.0001, momentum=0.9, nesterov=True)trajectory = [theta.copy()] for step in range(1000): grad = rosenbrock_grad(theta[0], theta[1]) optimizer.step([grad]) if step % 100 == 0: loss = (1 - theta[0])**2 + 100*(theta[1] - theta[0]**2)**2 print(f"Step {step}: loss = {loss:.4f}, θ = {theta}") trajectory.append(theta.copy()) print(f"Final: θ = {theta}, optimal = [1, 1]")Momentum introduces a new hyperparameter γ. How do we tune it effectively?
Typical values:
Learning rate interaction:
With momentum, you can often use a higher learning rate than vanilla GD:
Momentum schedule:
Some practitioners use a momentum schedule:
This provides aggressive exploration early and precise convergence late.
| Scenario | Recommended γ | Learning Rate Adjustment | Notes |
|---|---|---|---|
| Default/general | 0.9 | ~1/10 of vanilla GD | Safe starting point |
| Very noisy gradients (small batch) | 0.99 | ~1/100 of vanilla GD | More smoothing needed |
| Well-conditioned problem | 0.5-0.8 | ~1/2 of vanilla GD | Less acceleration needed |
| Fine-tuning pretrained model | 0.9 | Very small LR | Don't want to deviate far |
| Training from scratch | 0.9 → 0.99 | Follow schedule | Gradual increase works well |
Start with lr=0.01, momentum=0.9. If training is unstable, reduce lr. If convergence is slow, try increasing lr or momentum. For state-of-the-art results on vision models (ImageNet), lr=0.1, momentum=0.9 with cosine schedule is common.
Momentum addresses part of the ill-conditioning problem by building velocity in flat directions. But it still uses a single learning rate for all parameters. Adaptive methods go further by maintaining per-parameter learning rates.
The key insight: Different parameters may need different step sizes. Weights connecting to frequently-active features see large gradients; weights to rare features see small gradients. A single learning rate can't serve both well.
| Method | Core Idea | Key Innovation |
|---|---|---|
| AdaGrad | Divide lr by accumulated squared gradients | Automatic lr decay per parameter |
| RMSprop | Exponential moving average of squared gradients | Prevents lr from decaying to zero |
| Adam | RMSprop + momentum on gradients | Best of both worlds, widely used |
| AdamW | Adam with decoupled weight decay | Better regularization for transformers |
Adam is particularly important—it's the default optimizer for most deep learning:
$$m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t \quad \text{(momentum on gradients)}$$ $$v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \quad \text{(momentum on squared gradients)}$$ $$\theta_{t+1} = \theta_t - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$$
where $\hat{m}_t$ and $\hat{v}_t$ are bias-corrected estimates.
When to use what:
For computer vision (CNNs), well-tuned SGD+momentum often achieves slightly better final accuracy than Adam. For NLP (transformers), Adam/AdamW is almost universally preferred. For quick experiments or unknown domains, start with Adam.
Module Complete!
You've now mastered gradient descent—from the basic algorithm through learning rate selection, convergence theory, batch variants, and momentum acceleration. These foundations underpin all of modern machine learning optimization.
Next steps in the curriculum:
Congratulations! You've completed a comprehensive journey through gradient descent optimization. From the basic update rule to Nesterov acceleration, you now have the theoretical foundation and practical skills to optimize any machine learning model. The techniques you've learned power virtually every neural network trained today.