Adaptive Optimizers - Learning Module

Loading content...

0/245

RMSprop

The Forgetting Solution

In 2012, in an unpublished lecture slide for his Coursera course on neural networks, Geoffrey Hinton introduced RMSprop—a simple but transformative fix to AdaGrad's accumulation problem.

AdaGrad's issue was clear: it sums squared gradients forever, so learning rates only decrease, eventually becoming too small for continued learning. The solution, in hindsight, is obvious: make it forget.

Instead of accumulating all past squared gradients, RMSprop maintains an exponential moving average (EMA) of squared gradients. Recent gradients matter more than ancient ones. The denominator can now both grow (if current gradients are large) and shrink (if current gradients are small). The learning rate adapts to the current gradient regime, not the entire history.

This single change—replacing a sum with an EMA—unlocks adaptive optimization for deep learning:

Training runs of arbitrary length become feasible
Learning rates can recover when entering flatter regions
The algorithm remains sensitive to current curvature

RMSprop is one of those rare ideas that seems trivially simple yet has profound impact. It's the bridge between AdaGrad's theoretical elegance and Adam's practical dominance.

What You Will Learn

By the end of this page, you will understand RMSprop's mechanics, the mathematical properties of exponential moving averages, the decay rate hyperparameter's effect, and why RMSprop enabled the deep learning revolution before Adam arrived.

The Core Insight: Forgetting Is Feature, Not Bug

Let's trace the key insight that leads from AdaGrad to RMSprop.

AdaGrad's Accumulator:

$$G_t = G_{t-1} + g_t^2 = \sum_{\tau=1}^{t} g_\tau^2$$

This treats all historical gradients equally. A gradient from step 1 contributes as much to G₁₀₀₀₀ as a gradient from step 9999.

Why This Is Wrong for Deep Learning:

Non-stationary optimization: The loss landscape changes as we move. What mattered at step 100 may be irrelevant at step 10,000.
Curvature changes: The local Hessian (curvature) varies across the landscape. A parameter might need large updates in one region and small updates in another.
Phase changes in training: Early training explores; late training fine-tunes. The appropriate learning rate differs.

The Fix: Exponential Forgetting

Instead of summing, average with exponential decay:

$$v_t = \rho \cdot v_{t-1} + (1 - \rho) \cdot g_t^2$$

where ρ ∈ (0, 1) is the decay rate (typically 0.9 or 0.99).

This is an exponential moving average (EMA) of squared gradients. Old values decay exponentially:

g_{t}² contributes with weight (1-ρ) × ρ⁰ = (1-ρ)
g_{t-1}² contributes with weight (1-ρ) × ρ¹
g_{t-k}² contributes with weight (1-ρ) × ρᵏ

After k = ln(0.5)/ln(ρ) = -0.693/ln(ρ) steps, the contribution halves.

Half-Life of Gradient Contribution by Decay Rate
Decay Rate ρ	Half-Life (steps)	Effective Window	Use Case
0.9	~7 steps	~22 steps	Fast adaptation, small batches
0.99	~69 steps	~229 steps	Slower adaptation, more stability
0.999	~693 steps	~2,300 steps	Very slow adaptation, large batches

Why EMA Solves the Problem:

Bounded accumulator: v_t stays in the range of recent squared gradients, never growing without bound.
Adaptable learning rate: If gradients shrink (entering flat region), v_t shrinks too, increasing the effective learning rate.
Local sensitivity: The accumulator reflects current conditions, not ancient history.
Stability: The averaging still smooths out noise, just over a finite window.

The Naming

RMSprop = Root Mean Square Propagation. The 'mean square' refers to the EMA of squared gradients—an approximation to the mean squared gradient. The denominator √v_t is the 'root mean square' (RMS) of recent gradients, which normalizes the update.

Mathematical Formulation

Let's formalize RMSprop precisely.

RMSprop Update Rule:

For parameters θ, learning rate α, decay rate ρ, and stability constant ε:

$$g_t = \nabla L(\theta_{t-1})$$

$$v_t = \rho \cdot v_{t-1} + (1 - \rho) \cdot g_t^2$$

$$\theta_t = \theta_{t-1} - \frac{\alpha}{\sqrt{v_t} + \epsilon} \cdot g_t$$

Where all operations are element-wise.

Component Analysis:

v_t: Exponential moving average of squared gradients. Tracks the "typical" squared gradient magnitude for each parameter.
√v_t: Root mean square (RMS) of recent gradients. Approximates local gradient scale.
α/(√v_t + ε): Effective learning rate, inversely proportional to RMS gradient.
α/(√v_t + ε) × g_t: Update step. Large gradient history → small step. Small gradient history → large step.

Comparison to AdaGrad:

Aspect	AdaGrad	RMSprop
Accumulator	Sum: G_t = Σg²_τ	EMA: v_t = ρv_{t-1} + (1-ρ)g²_t
Growth	Unbounded (→ ∞)	Bounded (≈ recent gradient scale)
Learning rate	Only decreases	Can increase or decrease
Sensitivity	All history equally	Recent history weighted more
Use case	Sparse/convex	Deep learning

rmsprop_derivation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import numpy as np
 
def rmsprop_step(theta, gradient, v, alpha=0.001, rho=0.9, epsilon=1e-8):
    """
    Single RMSprop update step.
    
    Args:
        theta: Parameter vector
        gradient: Current gradient
        v: Running EMA of squared gradients (modified in-place)
        alpha: Base learning rate
        rho: Decay rate for EMA (typically 0.9)
        epsilon: Numerical stability constant
        
    Returns:
        Updated theta
    """
    # Update EMA of squared gradients
    v[:] = rho * v + (1 - rho) * gradient ** 2
    
    # Compute adapted update
    # Large v[i] -> small step for parameter i
    # Small v[i] -> large step for parameter i
    update = alpha / (np.sqrt(v) + epsilon) * gradient
    
    return theta - update
 
 
# Demonstration: RMSprop vs AdaGrad over long training
def compare_accumulator_growth(steps=10000):
    """Compare how accumulator/EMA evolves over time."""
    gradient_scale = 1.0
    
    # AdaGrad accumulator
    G_adagrad = 0.0
    adagrad_lrs = []
    
    # RMSprop EMA
    v_rmsprop = 0.0
    rho = 0.9
    rmsprop_lrs = []
    
    alpha = 1.0  # Base LR for comparison
    
    for t in range(1, steps + 1):
        g = np.random.randn() * gradient_scale
        
        # AdaGrad: accumulate
        G_adagrad += g ** 2
        adagrad_lrs.append(alpha / (np.sqrt(G_adagrad) + 1e-8))
        
        # RMSprop: EMA
        v_rmsprop = rho * v_rmsprop + (1 - rho) * g ** 2
        rmsprop_lrs.append(alpha / (np.sqrt(v_rmsprop) + 1e-8))
    
    return adagrad_lrs, rmsprop_lrs
 
adagrad_lrs, rmsprop_lrs = compare_accumulator_growth()
 
print("Effective learning rates over 10,000 steps:")
print(f"                  Step 100    Step 1000   Step 10000")
print(f"AdaGrad:          {adagrad_lrs[99]:.6f}  {adagrad_lrs[999]:.6f}  {adagrad_lrs[9999]:.6f}")
print(f"RMSprop (ρ=0.9):  {rmsprop_lrs[99]:.6f}  {rmsprop_lrs[999]:.6f}  {rmsprop_lrs[9999]:.6f}")
 
# RMSprop maintains roughly constant effective LR
# AdaGrad's effective LR decays as O(1/sqrt(t))

Steady-State Behavior:

For stationary gradients with variance σ², the EMA converges to:

$$\lim_{t \to \infty} \mathbb{E}[v_t] = \sigma^2$$

This means the effective learning rate stabilizes at:

$$\alpha_{\text{eff}} \approx \frac{\alpha}{\sigma}$$

The learning rate is automatically scaled by the inverse standard deviation of the gradients—exactly the adaptation we want! Parameters with noisy, high-variance gradients get smaller learning rates; parameters with stable, low-variance gradients get larger learning rates.

Properties of Exponential Moving Averages

Understanding EMAs deeply is essential for understanding not just RMSprop, but all modern adaptive optimizers.

Definition:

For a sequence x₁, x₂, ..., the EMA with decay rate ρ is:

$$\text{EMA}t = \rho \cdot \text{EMA}{t-1} + (1 - \rho) \cdot x_t$$

Starting from EMA₀ = 0 (or some initial value).

Unrolling the Recursion:

$$\text{EMA}t = (1-\rho)\sum{k=0}^{t-1} \rho^k x_{t-k} + \rho^t \text{EMA}_0$$

The weights (1-ρ)ρᵏ sum to 1 (as t → ∞), making this a weighted average.

Key Properties:

Bounded Memory: The effective window is ≈ 1/(1-ρ) steps. For ρ = 0.9, about 10 steps; for ρ = 0.99, about 100 steps.
Bias Toward Zero: Early in training, EMA_t is biased toward zero because the initial EMA₀ = 0 hasn't been "flushed out" yet. This creates a cold-start problem.
Lag: The EMA lags behind sudden changes in the sequence. If gradients suddenly increase, the EMA takes ~1/(1-ρ) steps to catch up.
Smoothing: High ρ means more smoothing (less variance, more lag). Low ρ means less smoothing (more variance, faster response).

ema_properties.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import numpy as np
 
def demonstrate_ema_properties():
    """Visualize key EMA properties."""
    
    # 1. Bias toward zero (cold start problem)
    print("=== Cold Start Bias ===")
    x = 1.0  # Constant input
    rho = 0.9
    ema = 0.0
    
    for t in range(1, 21):
        ema = rho * ema + (1 - rho) * x
        print(f"Step {t:2d}: EMA = {ema:.4f}, True mean = 1.0, Bias = {1.0 - ema:.4f}")
    
    # Takes ~23 steps to get within 10% of true value!
    # This motivates "bias correction" (used in Adam)
    
    # 2. Effective window demonstration
    print("
=== Effective Window ===")
    print("Weight of past values in EMA (ρ=0.9):")
    for k in range(10):
        weight = (1 - 0.9) * (0.9 ** k)
        cumulative = 1 - (0.9 ** (k + 1))
        print(f"  x[t-{k}]: weight = {weight:.4f}, cumulative = {cumulative:.4f}")
    
    # 3. Response to sudden change
    print("
=== Response to Sudden Change ===")
    sequence = [1.0] * 50 + [10.0] * 50  # Jump from 1 to 10 at t=50
    rho = 0.9
    ema = 0.0
    
    for t, x in enumerate(sequence):
        ema = rho * ema + (1 - rho) * x
        if t in [48, 49, 50, 51, 55, 60, 70, 99]:
            print(f"Step {t}: x = {x:.1f}, EMA = {ema:.2f}")
    
    # EMA takes ~10 steps to mostly catch up to the new value
 
demonstrate_ema_properties()

Bias Correction:

The cold-start bias is significant. Adam (covered next page) addresses this with bias correction:

$$\hat{v}_t = \frac{v_t}{1 - \rho^t}$$

Early in training (small t), this correction is substantial. As t grows, ρᵗ → 0 and the correction vanishes. RMSprop in its original form doesn't include bias correction, which can cause issues in early training.

Choosing ρ:

ρ Value	Effective Window	Behavior
0.9	~10 steps	Fast adaptation, noisier
0.99	~100 steps	Balanced (common default)
0.999	~1000 steps	Very smooth, slow adaptation

For deep learning, ρ = 0.9 is common for smaller batches; ρ = 0.99 for larger batches or when stability is paramount.

EMA vs Running Sum

An EMA with decay ρ is approximately equivalent to averaging over the last 1/(1-ρ) samples. For ρ=0.9, this is ~10 samples. For ρ=0.99, ~100 samples. But EMA is computationally cheaper: O(1) update vs O(window) for explicit averaging.

Centered RMSprop

A refinement of RMSprop addresses the fact that v_t tracks E[g²], not Var[g]. The difference matters when gradients have non-zero mean.

The Issue:

Standard RMSprop divides by √E[g²]. But we might want to divide by √Var[g] = √(E[g²] - E[g]²), which measures gradient variability rather than magnitude.

Consider: if a gradient is consistently 10 (low variance, high mean), standard RMSprop gives a small learning rate. But maybe we want a large learning rate since the direction is consistent!

Centered RMSprop:

Track both E[g²] and E[g]:

$$v_t = \rho \cdot v_{t-1} + (1 - \rho) \cdot g_t^2$$ $$\bar{g}t = \rho \cdot \bar{g}{t-1} + (1 - \rho) \cdot g_t$$ $$\tilde{v}_t = v_t - \bar{g}_t^2$$

Update using the variance estimate:

$$\theta_t = \theta_{t-1} - \frac{\alpha}{\sqrt{\tilde{v}_t} + \epsilon} \cdot g_t$$

Interpretation:

Large variance → inconsistent gradients → smaller learning rate (appropriate)
Small variance → consistent gradients → larger learning rate (appropriate)

This separates "gradient magnitude" from "gradient consistency", allowing more nuanced adaptation.

centered_rmsprop.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import numpy as np
 
def centered_rmsprop_step(theta, gradient, v, g_bar, 
                          alpha=0.001, rho=0.99, epsilon=1e-8):
    """
    Centered RMSprop: normalizes by variance instead of second moment.
    
    Args:
        theta: Parameters
        gradient: Current gradient
        v: EMA of squared gradients (E[g²])
        g_bar: EMA of gradients (E[g])
        alpha: Learning rate
        rho: Decay rate
        epsilon: Stability constant
        
    Returns:
        Updated theta
    """
    # Update EMAs
    v[:] = rho * v + (1 - rho) * gradient ** 2
    g_bar[:] = rho * g_bar + (1 - rho) * gradient
    
    # Variance estimate: Var[g] = E[g²] - E[g]²
    variance = v - g_bar ** 2
    
    # Ensure non-negative (numerical issues can make this slightly negative)
    variance = np.maximum(variance, 0)
    
    # Update with variance normalization
    update = alpha / (np.sqrt(variance) + epsilon) * gradient
    
    return theta - update
 
 
# Compare standard vs centered on biased gradients
def compare_variants():
    np.random.seed(42)
    theta_std = 0.0
    theta_ctr = 0.0
    
    v_std = 0.0
    v_ctr = 0.0
    g_bar = 0.0
    
    alpha = 0.1
    rho = 0.9
    
    # Gradient with large mean, small variance
    # Mean = 5, Std = 0.5
    for step in range(100):
        g = 5.0 + np.random.randn() * 0.5
        
        # Standard RMSprop
        v_std = rho * v_std + (1 - rho) * g ** 2
        update_std = alpha / (np.sqrt(v_std) + 1e-8) * g
        theta_std -= update_std
        
        # Centered RMSprop
        v_ctr = rho * v_ctr + (1 - rho) * g ** 2
        g_bar = rho * g_bar + (1 - rho) * g
        var = max(0, v_ctr - g_bar ** 2)
        update_ctr = alpha / (np.sqrt(var) + 1e-8) * g
        theta_ctr -= update_ctr
    
    print(f"After 100 steps with biased gradient (mean=5, std=0.5):")
    print(f"  Standard RMSprop:  θ = {theta_std:.2f}")
    print(f"  Centered RMSprop:  θ = {theta_ctr:.2f}")
    print(f"  Centered moved further (larger effective LR for consistent gradients)")
 
compare_variants()

Practical Consideration

Centered RMSprop requires tracking an additional EMA (g_bar), doubling optimizer state. It also has numerical edge cases when variance is near zero. In practice, standard RMSprop or Adam is usually preferred. Centered variants are useful when gradient bias is known to be significant.

Production Implementation

Here's a complete, production-quality RMSprop implementation.

rmsprop.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
import numpy as np
from typing import Dict, Optional
 
class RMSprop:
    """
    RMSprop (Root Mean Square Propagation) optimizer.
    
    Maintains exponential moving average of squared gradients for
    adaptive per-parameter learning rate scaling.
    
    Update rule:
        v_t = ρ * v_{t-1} + (1-ρ) * g_t²
        θ_t = θ_{t-1} - α / (√v_t + ε) * g_t
    
    Key differences from AdaGrad:
        - Uses EMA instead of sum (bounded accumulator)
        - Learning rate can increase when gradients shrink
        - Suitable for long training runs
    
    Reference:
        Hinton, Srivastava, Swersky. "Neural Networks for Machine 
        Learning" Lecture 6e (2012, unpublished).
    """
    
    def __init__(
        self,
        learning_rate: float = 0.01,
        rho: float = 0.9,
        epsilon: float = 1e-8,
        weight_decay: float = 0.0,
        momentum: float = 0.0,
        centered: bool = False,
    ):
        """
        Args:
            learning_rate: Base learning rate α
            rho: Decay rate for squared gradient EMA (typically 0.9 or 0.99)
            epsilon: Small constant for numerical stability
            weight_decay: L2 regularization coefficient
            momentum: Optional momentum term (not in original RMSprop)
            centered: If True, use centered variant (normalize by variance)
        """
        if not 0.0 < rho < 1.0:
            raise ValueError(f"rho must be in (0, 1), got {rho}")
        
        self.lr = learning_rate
        self.rho = rho
        self.eps = epsilon
        self.weight_decay = weight_decay
        self.momentum = momentum
        self.centered = centered
        
        # State
        self.v: Dict[str, np.ndarray] = {}  # EMA of squared gradients
        self.g_avg: Dict[str, np.ndarray] = {}  # EMA of gradients (centered)
        self.momentum_buffer: Dict[str, np.ndarray] = {}  # Momentum buffer
        self.steps = 0
    
    def step(
        self,
        parameters: Dict[str, np.ndarray],
        gradients: Dict[str, np.ndarray],
    ) -> Dict[str, np.ndarray]:
        """
        Perform one RMSprop step.
        
        Args:
            parameters: Dict mapping names to parameter arrays
            gradients: Dict mapping names to gradient arrays
            
        Returns:
            Updated parameters dict
        """
        self.steps += 1
        updated = {}
        
        for name, param in parameters.items():
            if name not in gradients:
                updated[name] = param
                continue
            
            grad = gradients[name].copy()
            
            # Apply weight decay
            if self.weight_decay > 0:
                grad = grad + self.weight_decay * param
            
            # Initialize state
            if name not in self.v:
                self.v[name] = np.zeros_like(param)
                if self.centered:
                    self.g_avg[name] = np.zeros_like(param)
                if self.momentum > 0:
                    self.momentum_buffer[name] = np.zeros_like(param)
            
            v = self.v[name]
            
            # Update EMA of squared gradients
            v[:] = self.rho * v + (1 - self.rho) * grad ** 2
            
            if self.centered:
                # Centered variant: normalize by variance
                g_avg = self.g_avg[name]
                g_avg[:] = self.rho * g_avg + (1 - self.rho) * grad
                avg = v - g_avg ** 2
                avg = np.maximum(avg, 0)  # Ensure non-negative
            else:
                avg = v
            
            # Compute update
            update = grad / (np.sqrt(avg) + self.eps)
            
            # Optional: apply momentum
            if self.momentum > 0:
                buf = self.momentum_buffer[name]
                buf[:] = self.momentum * buf + update
                update = buf
            
            updated[name] = param - self.lr * update
        
        return updated
    
    def get_state_summary(self) -> Dict[str, float]:
        """Get summary statistics for monitoring."""
        all_v = np.concatenate([v.flatten() for v in self.v.values()])
        all_lr = self.lr / (np.sqrt(all_v) + self.eps)
        return {
            'v_mean': float(np.mean(all_v)),
            'v_max': float(np.max(all_v)),
            'effective_lr_mean': float(np.mean(all_lr)),
            'effective_lr_min': float(np.min(all_lr)),
            'effective_lr_max': float(np.max(all_lr)),
        }
 
 
# Demonstration
if __name__ == "__main__":
    optimizer = RMSprop(learning_rate=0.01, rho=0.9)
    
    # Simple optimization: minimize (x - 3)² + (y - 2)²
    params = {"theta": np.array([0.0, 0.0])}
    target = np.array([3.0, 2.0])
    
    for step in range(200):
        grad = 2 * (params["theta"] - target)
        params = optimizer.step(params, {"theta": grad})
        
        if step % 50 == 49:
            loss = np.sum((params["theta"] - target) ** 2)
            stats = optimizer.get_state_summary()
            print(f"Step {step+1}: loss = {loss:.6f}, eff_lr = {stats['effective_lr_mean']:.6f}")

PyTorch Naming Convention

PyTorch uses 'alpha' for the decay rate ρ, which is confusingly also what we call the learning rate in mathematical notation. Be careful: optimizer.param_groups[0]['alpha'] is the EMA decay, not the learning rate!

Hyperparameter Guidance

RMSprop has three key hyperparameters: learning rate α, decay rate ρ, and epsilon ε. Here's comprehensive tuning guidance.

Learning Rate (α):

RMSprop's effective learning rate is α/√v_t. Since v_t adapts, the appropriate base α differs from SGD:

Typical range: 0.001 to 0.01
Starting point: 0.001 (conservative) or 0.01 (aggressive)
vs SGD: Usually 10× smaller than vanilla SGD learning rate

The adaptation means RMSprop is less sensitive to α than SGD, but still requires tuning.

Decay Rate (ρ):

Controls the memory of the EMA. Hinton's original suggestion was ρ = 0.9, but ρ = 0.99 is common.

Decay Rate Selection Guidelines
ρ Value	Effective Window	When to Use	Trade-offs
0.9	~10 steps	Small batches, fast-changing gradients	More responsive, noisier LR
0.99	~100 steps	Default for most cases	Balanced stability/adaptation
0.999	~1000 steps	Large batches, very stable training	Slower adaptation, very smooth

Epsilon (ε):

Prevents division by zero when v_t is small. Standard values:

TensorFlow default: ε = 1e-7
PyTorch default: ε = 1e-8
Extreme stability: ε = 1e-5 (if seeing NaN/Inf)

Larger ε reduces adaptation strength; smaller ε allows more aggressive adaptation but risks numerical issues.

Hyperparameter Interactions:

α and ρ: Higher ρ means smoother v_t, which means more stable effective LR. This permits slightly higher α.
α and ε: If gradients are very small (LLMs, some RNNs), ε may dominate √v_t, effectively undoing adaptation. Reduce ε or increase gradient scale (gradient clipping minimum).
Batch size and ρ: Larger batches → less noisy gradients → can use larger ρ. Relationship: ρ ≈ 1 - batch_size / dataset_size is a rough heuristic.

Tuning Strategy:

Start with α = 0.001, ρ = 0.9, ε = 1e-8
If unstable: halve α
If too slow: double α (up to ~0.01)
If late-training issues: try ρ = 0.99 for more smoothing
If numerical issues: increase ε to 1e-6

RMSprop in the Optimizer Ecosystem

RMSprop occupies a specific niche in the optimizer landscape. Understanding its relationship to other methods clarifies when to use it.

Optimizer Comparison
Optimizer	Gradient Adaptation	Momentum	Bias Correction
SGD	None (global LR)	Optional	N/A
AdaGrad	Sum of g²	None	None
RMSprop	EMA of g²	Optional	None
Adam	EMA of g²	EMA of g	Yes

RMSprop vs SGD+Momentum:

Aspect	SGD+Momentum	RMSprop
LR Sensitivity	High (requires careful tuning)	Lower (adapts automatically)
Sparse gradients	Inefficient (global LR)	Efficient (per-param LR)
Ill-conditioned	Slow (requires small LR)	Fast (adapts to curvature)
Theoretical guarantees	Well understood	Less formal theory
Reproducibility	Easier (simpler dynamics)	Harder (adaptive dynamics)

RMSprop vs Adam:

Adam adds momentum (EMA of gradients) and bias correction to RMSprop. For most deep learning, Adam is preferred. RMSprop remains relevant when:

You need one fewer hyperparameter (no β₁)
Momentum causes issues (some RNN training)
Reproducing older experiments that used RMSprop

Historical Context:

RMSprop came from Hinton's 2012 Coursera course. It was never formally published, yet became hugely influential. The lecture slide's informal status contrasts with its massive impact—a testament to the deep learning community's rapid idea propagation.

Modern Recommendation

In 2024, Adam or AdamW is the default choice for most tasks. RMSprop is still useful for specific cases: when momentum causes issues, when you want simpler dynamics, or when matching legacy baselines. Understanding RMSprop deeply prepares you for understanding Adam.

Summary: The EMA Revolution

RMSprop solved AdaGrad's critical flaw with a simple but profound change: replace accumulating sums with exponentially weighted averages that can forget.

Key Takeaways

•EMA instead of sum — Squared gradients decay exponentially, keeping the accumulator bounded and responsive to current conditions
•Bidirectional learning rate adaptation — Learning rate can now increase (entering flat regions) or decrease (entering steep regions)
•Finite memory window — ρ controls effective history length; ~1/(1-ρ) steps contribute significantly
•No bias correction — Cold-start bias exists but often doesn't matter for longer training runs
•Centered variant — Option to normalize by variance instead of second moment, useful for biased gradients
•Practical workhorse — Enabled adaptive optimization for deep learning before Adam arrived

The Optimizer Evolution:

Era	Optimizer	Key Innovation
Classic	GD, Momentum	Direction memory
Adaptive	AdaGrad	Per-parameter learning rates
Modern	RMSprop	Forgetting (EMA)
Current	Adam	Momentum + Adaptation + Bias Correction

What's Next:

RMSprop adapts to gradient magnitudes but treats the gradient direction naively—just using the current gradient. What if we applied EMA to the gradients themselves, not just their squares? Combined with RMSprop's adaptation, this creates Adam: arguably the most widely-used optimizer in deep learning today.

The next page explores Adam's formulation, its bias correction mechanism, and the many variants that have emerged.

Page Complete

You now understand RMSprop: why forgetting fixes AdaGrad, how EMAs work, the decay rate's role, and when RMSprop remains useful. Next, we explore Adam—the synthesis of momentum and adaptive learning rates that dominates modern deep learning.

RMSprop

The Forgetting Solution

In 2012, in an unpublished lecture slide for his Coursera course on neural networks, Geoffrey Hinton introduced RMSprop—a simple but transformative fix to AdaGrad's accumulation problem.

This single change—replacing a sum with an EMA—unlocks adaptive optimization for deep learning:

Training runs of arbitrary length become feasible
Learning rates can recover when entering flatter regions
The algorithm remains sensitive to current curvature

RMSprop is one of those rare ideas that seems trivially simple yet has profound impact. It's the bridge between AdaGrad's theoretical elegance and Adam's practical dominance.

What You Will Learn

The Core Insight: Forgetting Is Feature, Not Bug

Let's trace the key insight that leads from AdaGrad to RMSprop.

AdaGrad's Accumulator:

$$G_t = G_{t-1} + g_t^2 = \sum_{\tau=1}^{t} g_\tau^2$$

This treats all historical gradients equally. A gradient from step 1 contributes as much to G₁₀₀₀₀ as a gradient from step 9999.

Why This Is Wrong for Deep Learning:

Non-stationary optimization: The loss landscape changes as we move. What mattered at step 100 may be irrelevant at step 10,000.
Curvature changes: The local Hessian (curvature) varies across the landscape. A parameter might need large updates in one region and small updates in another.
Phase changes in training: Early training explores; late training fine-tunes. The appropriate learning rate differs.

The Fix: Exponential Forgetting

Instead of summing, average with exponential decay:

$$v_t = \rho \cdot v_{t-1} + (1 - \rho) \cdot g_t^2$$

where ρ ∈ (0, 1) is the decay rate (typically 0.9 or 0.99).

This is an exponential moving average (EMA) of squared gradients. Old values decay exponentially:

g_{t}² contributes with weight (1-ρ) × ρ⁰ = (1-ρ)
g_{t-1}² contributes with weight (1-ρ) × ρ¹
g_{t-k}² contributes with weight (1-ρ) × ρᵏ

After k = ln(0.5)/ln(ρ) = -0.693/ln(ρ) steps, the contribution halves.

Half-Life of Gradient Contribution by Decay Rate
Decay Rate ρ	Half-Life (steps)	Effective Window	Use Case
0.9	~7 steps	~22 steps	Fast adaptation, small batches
0.99	~69 steps	~229 steps	Slower adaptation, more stability
0.999	~693 steps	~2,300 steps	Very slow adaptation, large batches

Why EMA Solves the Problem:

Bounded accumulator: v_t stays in the range of recent squared gradients, never growing without bound.
Adaptable learning rate: If gradients shrink (entering flat region), v_t shrinks too, increasing the effective learning rate.
Local sensitivity: The accumulator reflects current conditions, not ancient history.
Stability: The averaging still smooths out noise, just over a finite window.

The Naming

Mathematical Formulation

Let's formalize RMSprop precisely.

RMSprop Update Rule:

For parameters θ, learning rate α, decay rate ρ, and stability constant ε:

$$g_t = \nabla L(\theta_{t-1})$$

$$v_t = \rho \cdot v_{t-1} + (1 - \rho) \cdot g_t^2$$

$$\theta_t = \theta_{t-1} - \frac{\alpha}{\sqrt{v_t} + \epsilon} \cdot g_t$$

Where all operations are element-wise.

Component Analysis:

v_t: Exponential moving average of squared gradients. Tracks the "typical" squared gradient magnitude for each parameter.
√v_t: Root mean square (RMS) of recent gradients. Approximates local gradient scale.
α/(√v_t + ε): Effective learning rate, inversely proportional to RMS gradient.
α/(√v_t + ε) × g_t: Update step. Large gradient history → small step. Small gradient history → large step.

Comparison to AdaGrad:

Aspect	AdaGrad	RMSprop
Accumulator	Sum: G_t = Σg²_τ	EMA: v_t = ρv_{t-1} + (1-ρ)g²_t
Growth	Unbounded (→ ∞)	Bounded (≈ recent gradient scale)
Learning rate	Only decreases	Can increase or decrease
Sensitivity	All history equally	Recent history weighted more
Use case	Sparse/convex	Deep learning

rmsprop_derivation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import numpy as np
 
def rmsprop_step(theta, gradient, v, alpha=0.001, rho=0.9, epsilon=1e-8):
    """
    Single RMSprop update step.
    
    Args:
        theta: Parameter vector
        gradient: Current gradient
        v: Running EMA of squared gradients (modified in-place)
        alpha: Base learning rate
        rho: Decay rate for EMA (typically 0.9)
        epsilon: Numerical stability constant
        
    Returns:
        Updated theta
    """
    # Update EMA of squared gradients
    v[:] = rho * v + (1 - rho) * gradient ** 2
    
    # Compute adapted update
    # Large v[i] -> small step for parameter i
    # Small v[i] -> large step for parameter i
    update = alpha / (np.sqrt(v) + epsilon) * gradient
    
    return theta - update
 
 
# Demonstration: RMSprop vs AdaGrad over long training
def compare_accumulator_growth(steps=10000):
    """Compare how accumulator/EMA evolves over time."""
    gradient_scale = 1.0
    
    # AdaGrad accumulator
    G_adagrad = 0.0
    adagrad_lrs = []
    
    # RMSprop EMA
    v_rmsprop = 0.0
    rho = 0.9
    rmsprop_lrs = []
    
    alpha = 1.0  # Base LR for comparison
    
    for t in range(1, steps + 1):
        g = np.random.randn() * gradient_scale
        
        # AdaGrad: accumulate
        G_adagrad += g ** 2
        adagrad_lrs.append(alpha / (np.sqrt(G_adagrad) + 1e-8))
        
        # RMSprop: EMA
        v_rmsprop = rho * v_rmsprop + (1 - rho) * g ** 2
        rmsprop_lrs.append(alpha / (np.sqrt(v_rmsprop) + 1e-8))
    
    return adagrad_lrs, rmsprop_lrs
 
adagrad_lrs, rmsprop_lrs = compare_accumulator_growth()
 
print("Effective learning rates over 10,000 steps:")
print(f"                  Step 100    Step 1000   Step 10000")
print(f"AdaGrad:          {adagrad_lrs[99]:.6f}  {adagrad_lrs[999]:.6f}  {adagrad_lrs[9999]:.6f}")
print(f"RMSprop (ρ=0.9):  {rmsprop_lrs[99]:.6f}  {rmsprop_lrs[999]:.6f}  {rmsprop_lrs[9999]:.6f}")
 
# RMSprop maintains roughly constant effective LR
# AdaGrad's effective LR decays as O(1/sqrt(t))

Steady-State Behavior:

For stationary gradients with variance σ², the EMA converges to:

$$\lim_{t \to \infty} \mathbb{E}[v_t] = \sigma^2$$

This means the effective learning rate stabilizes at:

$$\alpha_{\text{eff}} \approx \frac{\alpha}{\sigma}$$

Properties of Exponential Moving Averages

Understanding EMAs deeply is essential for understanding not just RMSprop, but all modern adaptive optimizers.

Definition:

For a sequence x₁, x₂, ..., the EMA with decay rate ρ is:

$$\text{EMA}t = \rho \cdot \text{EMA}{t-1} + (1 - \rho) \cdot x_t$$

Starting from EMA₀ = 0 (or some initial value).

Unrolling the Recursion:

$$\text{EMA}t = (1-\rho)\sum{k=0}^{t-1} \rho^k x_{t-k} + \rho^t \text{EMA}_0$$

The weights (1-ρ)ρᵏ sum to 1 (as t → ∞), making this a weighted average.

Key Properties:

Bounded Memory: The effective window is ≈ 1/(1-ρ) steps. For ρ = 0.9, about 10 steps; for ρ = 0.99, about 100 steps.
Bias Toward Zero: Early in training, EMA_t is biased toward zero because the initial EMA₀ = 0 hasn't been "flushed out" yet. This creates a cold-start problem.
Lag: The EMA lags behind sudden changes in the sequence. If gradients suddenly increase, the EMA takes ~1/(1-ρ) steps to catch up.
Smoothing: High ρ means more smoothing (less variance, more lag). Low ρ means less smoothing (more variance, faster response).

ema_properties.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import numpy as np
 
def demonstrate_ema_properties():
    """Visualize key EMA properties."""
    
    # 1. Bias toward zero (cold start problem)
    print("=== Cold Start Bias ===")
    x = 1.0  # Constant input
    rho = 0.9
    ema = 0.0
    
    for t in range(1, 21):
        ema = rho * ema + (1 - rho) * x
        print(f"Step {t:2d}: EMA = {ema:.4f}, True mean = 1.0, Bias = {1.0 - ema:.4f}")
    
    # Takes ~23 steps to get within 10% of true value!
    # This motivates "bias correction" (used in Adam)
    
    # 2. Effective window demonstration
    print("
=== Effective Window ===")
    print("Weight of past values in EMA (ρ=0.9):")
    for k in range(10):
        weight = (1 - 0.9) * (0.9 ** k)
        cumulative = 1 - (0.9 ** (k + 1))
        print(f"  x[t-{k}]: weight = {weight:.4f}, cumulative = {cumulative:.4f}")
    
    # 3. Response to sudden change
    print("
=== Response to Sudden Change ===")
    sequence = [1.0] * 50 + [10.0] * 50  # Jump from 1 to 10 at t=50
    rho = 0.9
    ema = 0.0
    
    for t, x in enumerate(sequence):
        ema = rho * ema + (1 - rho) * x
        if t in [48, 49, 50, 51, 55, 60, 70, 99]:
            print(f"Step {t}: x = {x:.1f}, EMA = {ema:.2f}")
    
    # EMA takes ~10 steps to mostly catch up to the new value
 
demonstrate_ema_properties()

Bias Correction:

The cold-start bias is significant. Adam (covered next page) addresses this with bias correction:

$$\hat{v}_t = \frac{v_t}{1 - \rho^t}$$

Choosing ρ:

ρ Value	Effective Window	Behavior
0.9	~10 steps	Fast adaptation, noisier
0.99	~100 steps	Balanced (common default)
0.999	~1000 steps	Very smooth, slow adaptation

For deep learning, ρ = 0.9 is common for smaller batches; ρ = 0.99 for larger batches or when stability is paramount.

EMA vs Running Sum

Centered RMSprop

A refinement of RMSprop addresses the fact that v_t tracks E[g²], not Var[g]. The difference matters when gradients have non-zero mean.

The Issue:

Standard RMSprop divides by √E[g²]. But we might want to divide by √Var[g] = √(E[g²] - E[g]²), which measures gradient variability rather than magnitude.

Consider: if a gradient is consistently 10 (low variance, high mean), standard RMSprop gives a small learning rate. But maybe we want a large learning rate since the direction is consistent!

Centered RMSprop:

Track both E[g²] and E[g]:

$$v_t = \rho \cdot v_{t-1} + (1 - \rho) \cdot g_t^2$$ $$\bar{g}t = \rho \cdot \bar{g}{t-1} + (1 - \rho) \cdot g_t$$ $$\tilde{v}_t = v_t - \bar{g}_t^2$$

Update using the variance estimate:

$$\theta_t = \theta_{t-1} - \frac{\alpha}{\sqrt{\tilde{v}_t} + \epsilon} \cdot g_t$$

Interpretation:

Large variance → inconsistent gradients → smaller learning rate (appropriate)
Small variance → consistent gradients → larger learning rate (appropriate)

This separates "gradient magnitude" from "gradient consistency", allowing more nuanced adaptation.

centered_rmsprop.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import numpy as np
 
def centered_rmsprop_step(theta, gradient, v, g_bar, 
                          alpha=0.001, rho=0.99, epsilon=1e-8):
    """
    Centered RMSprop: normalizes by variance instead of second moment.
    
    Args:
        theta: Parameters
        gradient: Current gradient
        v: EMA of squared gradients (E[g²])
        g_bar: EMA of gradients (E[g])
        alpha: Learning rate
        rho: Decay rate
        epsilon: Stability constant
        
    Returns:
        Updated theta
    """
    # Update EMAs
    v[:] = rho * v + (1 - rho) * gradient ** 2
    g_bar[:] = rho * g_bar + (1 - rho) * gradient
    
    # Variance estimate: Var[g] = E[g²] - E[g]²
    variance = v - g_bar ** 2
    
    # Ensure non-negative (numerical issues can make this slightly negative)
    variance = np.maximum(variance, 0)
    
    # Update with variance normalization
    update = alpha / (np.sqrt(variance) + epsilon) * gradient
    
    return theta - update
 
 
# Compare standard vs centered on biased gradients
def compare_variants():
    np.random.seed(42)
    theta_std = 0.0
    theta_ctr = 0.0
    
    v_std = 0.0
    v_ctr = 0.0
    g_bar = 0.0
    
    alpha = 0.1
    rho = 0.9
    
    # Gradient with large mean, small variance
    # Mean = 5, Std = 0.5
    for step in range(100):
        g = 5.0 + np.random.randn() * 0.5
        
        # Standard RMSprop
        v_std = rho * v_std + (1 - rho) * g ** 2
        update_std = alpha / (np.sqrt(v_std) + 1e-8) * g
        theta_std -= update_std
        
        # Centered RMSprop
        v_ctr = rho * v_ctr + (1 - rho) * g ** 2
        g_bar = rho * g_bar + (1 - rho) * g
        var = max(0, v_ctr - g_bar ** 2)
        update_ctr = alpha / (np.sqrt(var) + 1e-8) * g
        theta_ctr -= update_ctr
    
    print(f"After 100 steps with biased gradient (mean=5, std=0.5):")
    print(f"  Standard RMSprop:  θ = {theta_std:.2f}")
    print(f"  Centered RMSprop:  θ = {theta_ctr:.2f}")
    print(f"  Centered moved further (larger effective LR for consistent gradients)")
 
compare_variants()

Practical Consideration

Production Implementation

Here's a complete, production-quality RMSprop implementation.

rmsprop.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
import numpy as np
from typing import Dict, Optional
 
class RMSprop:
    """
    RMSprop (Root Mean Square Propagation) optimizer.
    
    Maintains exponential moving average of squared gradients for
    adaptive per-parameter learning rate scaling.
    
    Update rule:
        v_t = ρ * v_{t-1} + (1-ρ) * g_t²
        θ_t = θ_{t-1} - α / (√v_t + ε) * g_t
    
    Key differences from AdaGrad:
        - Uses EMA instead of sum (bounded accumulator)
        - Learning rate can increase when gradients shrink
        - Suitable for long training runs
    
    Reference:
        Hinton, Srivastava, Swersky. "Neural Networks for Machine 
        Learning" Lecture 6e (2012, unpublished).
    """
    
    def __init__(
        self,
        learning_rate: float = 0.01,
        rho: float = 0.9,
        epsilon: float = 1e-8,
        weight_decay: float = 0.0,
        momentum: float = 0.0,
        centered: bool = False,
    ):
        """
        Args:
            learning_rate: Base learning rate α
            rho: Decay rate for squared gradient EMA (typically 0.9 or 0.99)
            epsilon: Small constant for numerical stability
            weight_decay: L2 regularization coefficient
            momentum: Optional momentum term (not in original RMSprop)
            centered: If True, use centered variant (normalize by variance)
        """
        if not 0.0 < rho < 1.0:
            raise ValueError(f"rho must be in (0, 1), got {rho}")
        
        self.lr = learning_rate
        self.rho = rho
        self.eps = epsilon
        self.weight_decay = weight_decay
        self.momentum = momentum
        self.centered = centered
        
        # State
        self.v: Dict[str, np.ndarray] = {}  # EMA of squared gradients
        self.g_avg: Dict[str, np.ndarray] = {}  # EMA of gradients (centered)
        self.momentum_buffer: Dict[str, np.ndarray] = {}  # Momentum buffer
        self.steps = 0
    
    def step(
        self,
        parameters: Dict[str, np.ndarray],
        gradients: Dict[str, np.ndarray],
    ) -> Dict[str, np.ndarray]:
        """
        Perform one RMSprop step.
        
        Args:
            parameters: Dict mapping names to parameter arrays
            gradients: Dict mapping names to gradient arrays
            
        Returns:
            Updated parameters dict
        """
        self.steps += 1
        updated = {}
        
        for name, param in parameters.items():
            if name not in gradients:
                updated[name] = param
                continue
            
            grad = gradients[name].copy()
            
            # Apply weight decay
            if self.weight_decay > 0:
                grad = grad + self.weight_decay * param
            
            # Initialize state
            if name not in self.v:
                self.v[name] = np.zeros_like(param)
                if self.centered:
                    self.g_avg[name] = np.zeros_like(param)
                if self.momentum > 0:
                    self.momentum_buffer[name] = np.zeros_like(param)
            
            v = self.v[name]
            
            # Update EMA of squared gradients
            v[:] = self.rho * v + (1 - self.rho) * grad ** 2
            
            if self.centered:
                # Centered variant: normalize by variance
                g_avg = self.g_avg[name]
                g_avg[:] = self.rho * g_avg + (1 - self.rho) * grad
                avg = v - g_avg ** 2
                avg = np.maximum(avg, 0)  # Ensure non-negative
            else:
                avg = v
            
            # Compute update
            update = grad / (np.sqrt(avg) + self.eps)
            
            # Optional: apply momentum
            if self.momentum > 0:
                buf = self.momentum_buffer[name]
                buf[:] = self.momentum * buf + update
                update = buf
            
            updated[name] = param - self.lr * update
        
        return updated
    
    def get_state_summary(self) -> Dict[str, float]:
        """Get summary statistics for monitoring."""
        all_v = np.concatenate([v.flatten() for v in self.v.values()])
        all_lr = self.lr / (np.sqrt(all_v) + self.eps)
        return {
            'v_mean': float(np.mean(all_v)),
            'v_max': float(np.max(all_v)),
            'effective_lr_mean': float(np.mean(all_lr)),
            'effective_lr_min': float(np.min(all_lr)),
            'effective_lr_max': float(np.max(all_lr)),
        }
 
 
# Demonstration
if __name__ == "__main__":
    optimizer = RMSprop(learning_rate=0.01, rho=0.9)
    
    # Simple optimization: minimize (x - 3)² + (y - 2)²
    params = {"theta": np.array([0.0, 0.0])}
    target = np.array([3.0, 2.0])
    
    for step in range(200):
        grad = 2 * (params["theta"] - target)
        params = optimizer.step(params, {"theta": grad})
        
        if step % 50 == 49:
            loss = np.sum((params["theta"] - target) ** 2)
            stats = optimizer.get_state_summary()
            print(f"Step {step+1}: loss = {loss:.6f}, eff_lr = {stats['effective_lr_mean']:.6f}")

PyTorch Naming Convention

Hyperparameter Guidance

RMSprop has three key hyperparameters: learning rate α, decay rate ρ, and epsilon ε. Here's comprehensive tuning guidance.

Learning Rate (α):

RMSprop's effective learning rate is α/√v_t. Since v_t adapts, the appropriate base α differs from SGD:

Typical range: 0.001 to 0.01
Starting point: 0.001 (conservative) or 0.01 (aggressive)
vs SGD: Usually 10× smaller than vanilla SGD learning rate

The adaptation means RMSprop is less sensitive to α than SGD, but still requires tuning.

Decay Rate (ρ):

Controls the memory of the EMA. Hinton's original suggestion was ρ = 0.9, but ρ = 0.99 is common.

Decay Rate Selection Guidelines
ρ Value	Effective Window	When to Use	Trade-offs
0.9	~10 steps	Small batches, fast-changing gradients	More responsive, noisier LR
0.99	~100 steps	Default for most cases	Balanced stability/adaptation
0.999	~1000 steps	Large batches, very stable training	Slower adaptation, very smooth

Epsilon (ε):

Prevents division by zero when v_t is small. Standard values:

TensorFlow default: ε = 1e-7
PyTorch default: ε = 1e-8
Extreme stability: ε = 1e-5 (if seeing NaN/Inf)

Larger ε reduces adaptation strength; smaller ε allows more aggressive adaptation but risks numerical issues.

Hyperparameter Interactions:

α and ρ: Higher ρ means smoother v_t, which means more stable effective LR. This permits slightly higher α.
α and ε: If gradients are very small (LLMs, some RNNs), ε may dominate √v_t, effectively undoing adaptation. Reduce ε or increase gradient scale (gradient clipping minimum).
Batch size and ρ: Larger batches → less noisy gradients → can use larger ρ. Relationship: ρ ≈ 1 - batch_size / dataset_size is a rough heuristic.

Tuning Strategy:

Start with α = 0.001, ρ = 0.9, ε = 1e-8
If unstable: halve α
If too slow: double α (up to ~0.01)
If late-training issues: try ρ = 0.99 for more smoothing
If numerical issues: increase ε to 1e-6

RMSprop in the Optimizer Ecosystem

RMSprop occupies a specific niche in the optimizer landscape. Understanding its relationship to other methods clarifies when to use it.

Optimizer Comparison
Optimizer	Gradient Adaptation	Momentum	Bias Correction
SGD	None (global LR)	Optional	N/A
AdaGrad	Sum of g²	None	None
RMSprop	EMA of g²	Optional	None
Adam	EMA of g²	EMA of g	Yes

RMSprop vs SGD+Momentum:

Aspect	SGD+Momentum	RMSprop
LR Sensitivity	High (requires careful tuning)	Lower (adapts automatically)
Sparse gradients	Inefficient (global LR)	Efficient (per-param LR)
Ill-conditioned	Slow (requires small LR)	Fast (adapts to curvature)
Theoretical guarantees	Well understood	Less formal theory
Reproducibility	Easier (simpler dynamics)	Harder (adaptive dynamics)

RMSprop vs Adam:

Adam adds momentum (EMA of gradients) and bias correction to RMSprop. For most deep learning, Adam is preferred. RMSprop remains relevant when:

You need one fewer hyperparameter (no β₁)
Momentum causes issues (some RNN training)
Reproducing older experiments that used RMSprop

Historical Context:

Modern Recommendation

Summary: The EMA Revolution

RMSprop solved AdaGrad's critical flaw with a simple but profound change: replace accumulating sums with exponentially weighted averages that can forget.

Key Takeaways

•EMA instead of sum — Squared gradients decay exponentially, keeping the accumulator bounded and responsive to current conditions
•Bidirectional learning rate adaptation — Learning rate can now increase (entering flat regions) or decrease (entering steep regions)
•Finite memory window — ρ controls effective history length; ~1/(1-ρ) steps contribute significantly
•No bias correction — Cold-start bias exists but often doesn't matter for longer training runs
•Centered variant — Option to normalize by variance instead of second moment, useful for biased gradients
•Practical workhorse — Enabled adaptive optimization for deep learning before Adam arrived

The Optimizer Evolution:

Era	Optimizer	Key Innovation
Classic	GD, Momentum	Direction memory
Adaptive	AdaGrad	Per-parameter learning rates
Modern	RMSprop	Forgetting (EMA)
Current	Adam	Momentum + Adaptation + Bias Correction

What's Next:

The next page explores Adam's formulation, its bias correction mechanism, and the many variants that have emerged.

Page Complete