Adaptive Optimizers - Learning Module

Loading content...

0/245

Momentum

The Sluggishness Problem

Vanilla gradient descent has a fundamental problem: it treats every gradient update as if it exists in isolation. Each step looks only at the current gradient, with no memory of previous steps. This myopic behavior leads to two devastating inefficiencies that cripple training in practice.

Problem 1: Oscillation in Ravines

Consider a loss surface shaped like a long, narrow valley—a ravine. The gradient points steeply toward the walls but only gently toward the valley floor where the minimum lies. Vanilla gradient descent bounces back and forth between the walls, making tiny progress along the valley floor. Each step undoes much of the previous step's work.

Problem 2: Slow Progress on Plateaus

On flat regions of the loss surface, gradients become vanishingly small. Without any accumulated velocity, vanilla gradient descent slows to a crawl, taking thousands of iterations to cross what could be traversed quickly with momentum.

The breakthrough insight: give the optimizer memory. Instead of using only the current gradient, accumulate a velocity vector that builds up speed in consistent directions and dampens oscillations in inconsistent directions. This simple idea—borrowed directly from physics—transforms optimization.

What You Will Learn

By the end of this page, you will understand momentum optimization from first principles: the physics intuition, the mathematical formulation, the convergence properties, the hyperparameter effects, and the implementation details. You'll be able to analyze when momentum helps, when it hurts, and how to tune it effectively.

Physical Intuition: A Ball Rolling Downhill

The name "momentum" comes directly from classical mechanics, and the analogy is remarkably precise. Imagine a heavy ball rolling down a hilly landscape where the height represents the loss function value.

Without momentum (vanilla gradient descent):

The ball is infinitely sticky—it has no inertia. At each moment, it moves in the direction of steepest descent at that exact location, then stops completely. The next moment, it evaluates the slope again from scratch. In a ravine, this ball bounces back and forth endlessly.

With momentum:

The ball has mass and velocity. It accelerates in the direction of the slope, but its velocity persists. When the gradient changes direction (walls of the ravine), the ball's existing velocity partially cancels the oscillation. When the gradient points consistently in one direction (down the valley), velocities accumulate, building speed.

The key physical concepts that transfer directly to optimization:

Physical Concepts → Optimization Concepts

•Velocity — The accumulated direction of motion, updated by gradients but persisting between steps
•Inertia — The tendency to continue moving in the current direction, controlled by the momentum coefficient
•Friction — The decay that prevents infinite acceleration, implemented as the complement of momentum (1 - β)
•Mass — Implicitly represented by how much gradient signal is needed to change velocity
•Acceleration — The gradient itself, which adds to (or subtracts from) the current velocity

Why this analogy matters:

This isn't just a pedagogical metaphor—it's the actual derivation. Momentum SGD solves a discrete approximation to the physics equation:

$$m\frac{d^2 x}{dt^2} = -\nabla L(x) - \gamma \frac{dx}{dt}$$

Where m is mass, γ is friction, and ∇L(x) is the loss gradient. The momentum coefficient β controls the ratio of inertia to friction, determining how much the ball "remembers" its previous velocity versus responding to current gradients.

The Ravine Test

When evaluating any optimizer, imagine a long, narrow ravine. Vanilla gradient descent oscillates. Momentum builds velocity along the valley floor while canceling oscillations against the walls. Every adaptive optimizer must solve this fundamental problem.

Mathematical Formulation

Let's formalize momentum optimization with precise mathematical notation. Given parameters θ, loss function L(θ), learning rate α, and momentum coefficient β ∈ [0, 1):

Classical Momentum (Polyak's Heavy Ball Method):

$$v_t = \beta v_{t-1} + \nabla L(\theta_{t-1})$$ $$\theta_t = \theta_{t-1} - \alpha v_t$$

Here, v_t is the velocity vector, initialized to zero. The first equation updates velocity: multiply previous velocity by β (preserving momentum), then add the current gradient. The second equation updates parameters: move in the direction of velocity, scaled by the learning rate.

Alternative Formulation (with gradient scaling):

Some frameworks use a slightly different formulation:

$$v_t = \beta v_{t-1} + \alpha \nabla L(\theta_{t-1})$$ $$\theta_t = \theta_{t-1} - v_t$$

This is mathematically equivalent after absorbing α into the velocity definition. The choice affects hyperparameter interpretation but not the algorithm's behavior.

Expanding the recursion reveals the exponential averaging:

momentum_expansion.py
Expansion
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Unrolling the velocity update shows exponential weighting:
# v_t = β v_{t-1} + g_t
# v_t = β (β v_{t-2} + g_{t-1}) + g_t
# v_t = β² v_{t-2} + β g_{t-1} + g_t
# ...continuing...
# v_t = β^t v_0 + Σ_{i=0}^{t-1} β^i g_{t-i}
 
# Since v_0 = 0:
# v_t = g_t + β g_{t-1} + β² g_{t-2} + β³ g_{t-3} + ...
 
# This is an exponentially weighted average of all past gradients!
# Recent gradients have weight 1
# Older gradients decay exponentially with factor β
 
# Example with β = 0.9:
# g_t:     weight = 1.0
# g_{t-1}: weight = 0.9
# g_{t-2}: weight = 0.81
# g_{t-3}: weight = 0.729
# g_{t-10}: weight = 0.349
# g_{t-100}: weight ≈ 0.000027 (negligible)
 
# Effective window of influence:
# For β = 0.9, ~90% of the weight is on the last 22 gradients
# For β = 0.99, ~90% of the weight is on the last 229 gradients

Interpretation as Exponential Moving Average:

The velocity v_t is an exponentially weighted moving average (EMA) of past gradients. This has profound implications:

Noise reduction: Random noise in stochastic gradients has inconsistent signs. The EMA cancels this noise while preserving the consistent signal.
Direction stabilization: If gradients consistently point in one direction, the EMA accumulates, increasing the effective step size. If gradients oscillate, the EMA dampens.
Implicit learning rate adaptation: The effective step in any direction is proportional to how consistently the gradients point that way.

The critical relationship between β and effective step count:

The "half-life" of the momentum—how many steps until a gradient's influence is halved—is:

$$\text{half-life} = \frac{\ln(0.5)}{\ln(\beta)} = \frac{-0.693}{\ln(\beta)}$$

For β = 0.9, half-life ≈ 6.6 steps. For β = 0.99, half-life ≈ 69 steps.

Convergence Analysis

Understanding why momentum accelerates convergence requires analyzing the algorithm's behavior on well-understood loss surfaces. The canonical test case is a quadratic loss function, where we can derive exact convergence rates.

Quadratic Analysis:

Consider minimizing L(θ) = ½θᵀHθ where H is a positive definite Hessian. The eigenvalues λ₁ ≤ λ₂ ≤ ... ≤ λₙ of H determine the curvature in each eigendirection.

Condition number: κ = λₙ/λ₁

This ratio measures how "stretched" the loss surface is. A high condition number means a ravine—steep in some directions, shallow in others.

Vanilla gradient descent convergence:

The optimal learning rate is α* = 2/(λ₁ + λₙ), giving convergence rate:

$$\rho_{\text{GD}} = \frac{\kappa - 1}{\kappa + 1}$$

For κ = 100, ρ_GD ≈ 0.98, meaning each step reduces error by only 2%.

Momentum convergence:

With optimal hyperparameters α* and β*, momentum achieves:

$$\rho_{\text{momentum}} = \frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1}$$

For κ = 100, ρ_momentum ≈ 0.818, reducing error by ~18% per step—a 9× speedup!

Convergence Rate Comparison: GD vs Momentum
Condition Number κ	GD Rate ρ	GD Error Reduction	Momentum Rate ρ	Momentum Error Reduction	Speedup Factor
10	0.818	18.2%	0.519	48.1%	~2.6×
100	0.980	2.0%	0.818	18.2%	~9×
1,000	0.998	0.2%	0.939	6.1%	~30×
10,000	0.9998	0.02%	0.980	2.0%	~100×

The square root phenomenon:

The key insight is that momentum reduces the square root of the condition number's effect. This is profound: problems that would take 10,000 iterations with GD take ~100 with momentum.

Optimal hyperparameters for quadratics:

$$\alpha^* = \left(\frac{2}{\sqrt{\lambda_1} + \sqrt{\lambda_n}}\right)^2$$

$$\beta^* = \left(\frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1}\right)^2$$

Note that optimal β approaches 1 as the condition number increases. This explains the common practice of using high momentum values (0.9-0.99) in deep learning, where loss surfaces often have high condition numbers.

Beyond quadratics:

Real neural network loss surfaces aren't quadratic, but the intuition transfers:

High curvature directions → momentum prevents overshooting
Low curvature directions → momentum accumulates velocity
The effective condition number of the local Hessian determines speedup

Why β = 0.9 is Common

β = 0.9 is optimal for condition numbers around 100, which is typical for many machine learning problems. For very ill-conditioned problems (deep networks, recurrent networks), β = 0.99 is sometimes used. Values below 0.9 sacrifice acceleration; values above 0.99 risk overshooting and instability.

The Geometry of Momentum

To truly understand momentum, we need to visualize what happens geometrically in different loss landscape scenarios.

Scenario 1: The Ravine (High Condition Number)

Imagine a loss surface shaped like a long, narrow valley—a paraboloid stretched in one direction. The gradient at any point has two components:

A large component pointing toward the nearest wall (high curvature direction)
A small component pointing along the valley floor (low curvature direction)

Without momentum: Each step bounces toward a wall, then back. The valley-floor component makes tiny progress each iteration. Convergence is painfully slow.

With momentum: The wall-pointing components oscillate in sign and cancel in the velocity. The valley-floor components have consistent sign and accumulate. The ball accelerates along the valley while its wall-to-wall oscillations are damped.

Converting Mermaid diagram...

Scenario 2: The Plateau (Low Gradient Region)

On flat regions of the loss surface, gradients are small. Without momentum, progress stalls. With momentum, velocity persists from previous steps, carrying the optimizer across the plateau.

Scenario 3: The Saddle Point

Saddle points have zero gradient but aren't minima—the surface curves up in some directions and down in others. Vanilla GD stops at saddle points (zero gradient = zero update). Momentum, with its accumulated velocity, can escape saddle points by continuing in the direction it was already moving.

Scenario 4: Sharp Minima vs Flat Minima

There's evidence that momentum prefers flat minima over sharp minima. Sharp minima have high curvature; momentum's oscillation-damping effect prevents settling into these. Flat minima have low curvature; momentum doesn't oscillate and settles smoothly. This may contribute to better generalization.

The velocity-gradient angle:

A useful diagnostic is the angle between velocity and current gradient. In smooth optimization, this angle should be small (velocity aligns with gradient). In oscillatory regimes, this angle becomes large (velocity fights gradient). Tracking this angle can reveal optimization issues.

momentum_geometry.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import numpy as np
import matplotlib.pyplot as plt
 
def momentum_trajectory_2d(f, grad_f, x0, alpha, beta, n_steps):
    """
    Visualize momentum SGD trajectory on a 2D loss surface.
    
    Args:
        f: Loss function f(x, y) -> scalar
        grad_f: Gradient function grad_f(x, y) -> (dx, dy)
        x0: Initial point (x, y)
        alpha: Learning rate
        beta: Momentum coefficient
        n_steps: Number of optimization steps
    
    Returns:
        trajectory: List of (x, y) positions
        velocities: List of velocity vectors
    """
    x = np.array(x0, dtype=float)
    v = np.zeros_like(x)  # Initialize velocity to zero
    
    trajectory = [x.copy()]
    velocities = [v.copy()]
    
    for _ in range(n_steps):
        g = np.array(grad_f(x[0], x[1]))
        
        # Compute velocity-gradient angle (useful diagnostic)
        if np.linalg.norm(v) > 1e-10 and np.linalg.norm(g) > 1e-10:
            cos_angle = np.dot(v, g) / (np.linalg.norm(v) * np.linalg.norm(g))
            angle = np.arccos(np.clip(cos_angle, -1, 1)) * 180 / np.pi
            # Large angle indicates oscillation
        
        # Momentum update
        v = beta * v + g  # Update velocity
        x = x - alpha * v  # Update position
        
        trajectory.append(x.copy())
        velocities.append(v.copy())
    
    return trajectory, velocities
 
 
# Example: Rosenbrock-like ravine
def ravine_loss(x, y):
    """A ravine-shaped loss: steep in x, shallow in y."""
    return 100 * x**2 + y**2
 
def ravine_grad(x, y):
    return (200 * x, 2 * y)
 
# Compare vanilla GD (beta=0) vs momentum (beta=0.9)
x0 = [1.0, 1.0]
alpha = 0.005
 
traj_vanilla, _ = momentum_trajectory_2d(
    ravine_loss, ravine_grad, x0, alpha, beta=0.0, n_steps=100
)
traj_momentum, _ = momentum_trajectory_2d(
    ravine_loss, ravine_grad, x0, alpha, beta=0.9, n_steps=100
)
 
# Vanilla oscillates; momentum converges smoothly
print(f"Vanilla final position: {traj_vanilla[-1]}")
print(f"Momentum final position: {traj_momentum[-1]}")

Implementation Details

Implementing momentum correctly requires attention to several practical details that aren't obvious from the mathematical formulation.

Memory Requirements:

Momentum requires storing one velocity vector v for each parameter. For a model with N parameters:

Vanilla SGD: O(N) memory (just parameters)
Momentum SGD: O(2N) memory (parameters + velocities)

For large models, this doubles the optimizer state. Most frameworks store this efficiently alongside parameters.

Initialization:

The standard approach initializes velocity to zero: v₀ = 0. This means the first step is identical to vanilla gradient descent—no momentum contribution. The momentum effect builds up over the first 1/(1-β) steps.

Some practitioners use "warmup" strategies, starting with low momentum and increasing it, to avoid initial instability.

Numerical Considerations:

With high β values (0.99+), velocity can grow large. Combined with large gradients, this can cause numerical overflow. Framework implementations typically use float32, which handles this well, but monitoring velocity norms is wise for debugging.

momentum_implementation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
import numpy as np
from typing import Dict, Callable
 
class MomentumSGD:
    """
    Production-quality Momentum SGD optimizer.
    
    Implements classical momentum (Polyak's heavy ball method):
        v_t = β * v_{t-1} + ∇L(θ)
        θ_t = θ_{t-1} - α * v_t
    """
    
    def __init__(
        self,
        learning_rate: float = 0.01,
        momentum: float = 0.9,
        weight_decay: float = 0.0,
        nesterov: bool = False,  # Preview: covered in next page
    ):
        """
        Args:
            learning_rate: Step size α
            momentum: Momentum coefficient β ∈ [0, 1)
            weight_decay: L2 regularization coefficient
            nesterov: Whether to use Nesterov momentum
        """
        if not 0.0 <= momentum < 1.0:
            raise ValueError(f"Momentum must be in [0, 1), got {momentum}")
        if learning_rate <= 0:
            raise ValueError(f"Learning rate must be positive, got {learning_rate}")
        
        self.lr = learning_rate
        self.momentum = momentum
        self.weight_decay = weight_decay
        self.nesterov = nesterov
        
        # State: velocity buffer for each parameter
        self.velocities: Dict[str, np.ndarray] = {}
        self.steps = 0
    
    def step(
        self,
        parameters: Dict[str, np.ndarray],
        gradients: Dict[str, np.ndarray],
    ) -> Dict[str, np.ndarray]:
        """
        Perform one optimization step.
        
        Args:
            parameters: Dict mapping param names to parameter arrays
            gradients: Dict mapping param names to gradient arrays
            
        Returns:
            Updated parameters dict
        """
        self.steps += 1
        updated_params = {}
        
        for name, param in parameters.items():
            if name not in gradients:
                updated_params[name] = param
                continue
            
            grad = gradients[name]
            
            # Apply weight decay (L2 regularization)
            if self.weight_decay > 0:
                grad = grad + self.weight_decay * param
            
            # Initialize velocity if needed
            if name not in self.velocities:
                self.velocities[name] = np.zeros_like(param)
            
            v = self.velocities[name]
            
            # Momentum update: v = β * v + gradient
            v = self.momentum * v + grad
            self.velocities[name] = v
            
            # Parameter update
            if self.nesterov:
                # Nesterov: use "lookahead" gradient
                # θ = θ - α * (β * v + gradient)
                updated_params[name] = param - self.lr * (self.momentum * v + grad)
            else:
                # Classical: θ = θ - α * v
                updated_params[name] = param - self.lr * v
        
        return updated_params
    
    def reset(self):
        """Clear velocity state (useful for fine-tuning)."""
        self.velocities = {}
        self.steps = 0
    
    def get_velocity_norms(self) -> Dict[str, float]:
        """Monitor velocity magnitudes for debugging."""
        return {name: np.linalg.norm(v) for name, v in self.velocities.items()}
 
 
# Example usage
if __name__ == "__main__":
    # Simple quadratic optimization demo
    optimizer = MomentumSGD(learning_rate=0.1, momentum=0.9)
    
    # Parameters: we want to minimize (x-3)² + (y-2)²
    params = {"theta": np.array([0.0, 0.0])}
    target = np.array([3.0, 2.0])
    
    for step in range(50):
        # Gradient of (θ - target)²
        grads = {"theta": 2 * (params["theta"] - target)}
        params = optimizer.step(params, grads)
        
        if step % 10 == 0:
            loss = np.sum((params["theta"] - target) ** 2)
            print(f"Step {step}: θ = {params['theta']}, loss = {loss:.6f}")

Framework Differences

Different frameworks implement momentum with subtle variations. PyTorch's SGD uses v = βv + gradient (then θ -= lrv), while TensorFlow/Keras may use v = βv + lrgradient (then θ -= v). The behavior is equivalent, but hyperparameters from one framework may not directly transfer to another. Always check the documentation.

Hyperparameter Tuning

Momentum SGD has two primary hyperparameters—learning rate α and momentum coefficient β—with a complex interaction that requires careful tuning.

The Learning Rate (α):

With momentum, the effective step size is larger than with vanilla GD. The velocity accumulates gradients, amplifying the step. As a rule of thumb:

If vanilla GD uses learning rate α₀, momentum SGD should use α ≈ α₀ × (1 - β)
For β = 0.9, this means ~10× smaller learning rate than vanilla GD

The Momentum Coefficient (β):

Higher β means stronger momentum effect:

β = 0: No momentum (vanilla SGD)
β = 0.5: Weak momentum, mild acceleration
β = 0.9: Standard choice, ~10× speedup on ill-conditioned problems
β = 0.99: Strong momentum, for very ill-conditioned problems
β → 1: Infinite momentum, unstable (never converges)

Momentum Coefficient Guidelines
β Value	Effective Window	Use Case	Risks
0.0	1 step	Baseline/debugging	Slow on ill-conditioned problems
0.5	~2 steps	Mildly ill-conditioned	May not help much
0.9	~10 steps	Default for most problems	Standard choice
0.95	~20 steps	Deep networks, RNNs	Requires careful LR tuning
0.99	~100 steps	Very ill-conditioned	Can overshoot, slow to adapt

The α-β Tradeoff:

There's an inherent tradeoff: higher momentum allows faster convergence on smooth paths, but increases the risk of overshooting and oscillation at minima. A common strategy:

Start with β = 0.9 (standard default)
If training is unstable, reduce α (not β)
If training is too slow, consider β = 0.99 with reduced α
For fine-tuning, reduce both α and β for precise convergence

Momentum Warmup:

Some practitioners start with low momentum (β = 0.5) and increase to the target value (β = 0.9) over the first few epochs. This prevents the initial "cold start" where zero velocity makes the first updates behave like vanilla GD.

Interaction with Batch Size:

Larger batches produce less noisy gradients. With low noise, momentum has fewer oscillations to dampen, so its benefit decreases. Very large batch training often uses lower momentum or other techniques (LARS, LAMB) that we'll cover later.

Practical Tuning Strategy

Start with the defaults: lr=0.1, momentum=0.9 for image tasks; lr=0.01, momentum=0.9 for language tasks. If training diverges, halve the learning rate. If training is too slow, try increasing momentum to 0.95 while halving the learning rate. Monitor the loss curve: good momentum shows smooth descent; bad momentum shows oscillation or plateau.

When Momentum Helps (and When It Doesn't)

Momentum is not universally beneficial. Understanding when it helps—and when it hurts—is essential for effective optimization.

Momentum Helps When:

•Loss surface is ill-conditioned — High curvature in some directions, low in others (the ravine problem)
•Gradients are noisy — Stochastic gradients from mini-batches; momentum averages out the noise
•There are plateaus — Flat regions where gradients are small; momentum carries through
•Saddle points exist — Zero-gradient points that aren't minima; momentum escapes via velocity
•Consistent optimization direction — When the path to the minimum is relatively straight

Momentum Hurts When:

•Loss surface is well-conditioned — Spherical loss; no advantage from momentum
•Rapid direction changes needed — Sharp corners in optimization path; momentum overshoots
•Near convergence — Close to minimum; momentum causes oscillation around the optimum
•Non-stationary objectives — Online learning where the target moves; old gradients mislead
•Sparse gradients — Parameters updated rarely; outdated velocity is harmful

The Sparsity Problem:

In models with sparse gradients (embeddings, attention patterns), many parameters receive zero gradients on most steps. Momentum accumulates all past gradients, including the zeros. When a sparse parameter finally gets a gradient, its velocity might be dominated by decay from old updates, not the current signal.

This is one motivation for per-parameter adaptive methods (AdaGrad, Adam) covered later.

Online Learning Caution:

In online learning, the data distribution may shift over time. Momentum averages old gradients, which reflect the old distribution. If the distribution changes faster than 1/(1-β) steps, momentum actively hurts by pulling toward outdated optima. Use lower momentum (β ≤ 0.5) or none for online settings.

Near-Convergence Behavior:

As training approaches the minimum, the gradient becomes small. But momentum carries residual velocity from previous steps, causing the optimizer to overshoot and oscillate around the minimum. This is why learning rate schedules are important—reducing α near the end mitigates this oscillation.

Summary: The Power of Memory

Momentum represents a fundamental insight in optimization: past gradients contain valuable information. By accumulating an exponentially weighted average of gradients into a velocity vector, momentum achieves:

Key Takeaways

•Acceleration on consistent directions — Velocity accumulates when gradients point the same way, speeding up progress through valleys and across plateaus
•Damping on oscillatory directions — Velocity cancels when gradients flip sign, stabilizing progress in ravine scenarios
•Noise reduction — Random stochastic gradient noise averages out, revealing the true descent direction
•Condition number reduction — Converts κ-dependent convergence to √κ-dependent, dramatically speeding up ill-conditioned problems
•Saddle point escape — Accumulated velocity carries the optimizer through zero-gradient regions

The momentum coefficient β = 0.9 has become a standard default in deep learning, providing substantial acceleration on typical neural network loss surfaces while remaining stable with appropriate learning rates.

What's Next:

Classical momentum has a subtle inefficiency: it computes the gradient at the current position, then updates velocity. But what if we could compute the gradient at the position we're about to move to? This "lookahead" idea leads to Nesterov momentum, which further accelerates convergence with a simple modification. The next page explores this elegant improvement.

Page Complete

You now understand momentum optimization from the ground up: the physical intuition, the mathematical formulation, the convergence theory, and the practical implementation. Next, we'll see how Nesterov's insight takes this further with a simple but powerful modification.