Loading content...
In 2012, in an unpublished lecture slide for his Coursera course on neural networks, Geoffrey Hinton introduced RMSprop—a simple but transformative fix to AdaGrad's accumulation problem.
AdaGrad's issue was clear: it sums squared gradients forever, so learning rates only decrease, eventually becoming too small for continued learning. The solution, in hindsight, is obvious: make it forget.
Instead of accumulating all past squared gradients, RMSprop maintains an exponential moving average (EMA) of squared gradients. Recent gradients matter more than ancient ones. The denominator can now both grow (if current gradients are large) and shrink (if current gradients are small). The learning rate adapts to the current gradient regime, not the entire history.
This single change—replacing a sum with an EMA—unlocks adaptive optimization for deep learning:
RMSprop is one of those rare ideas that seems trivially simple yet has profound impact. It's the bridge between AdaGrad's theoretical elegance and Adam's practical dominance.
By the end of this page, you will understand RMSprop's mechanics, the mathematical properties of exponential moving averages, the decay rate hyperparameter's effect, and why RMSprop enabled the deep learning revolution before Adam arrived.
Let's trace the key insight that leads from AdaGrad to RMSprop.
AdaGrad's Accumulator:
$$G_t = G_{t-1} + g_t^2 = \sum_{\tau=1}^{t} g_\tau^2$$
This treats all historical gradients equally. A gradient from step 1 contributes as much to G₁₀₀₀₀ as a gradient from step 9999.
Why This Is Wrong for Deep Learning:
Non-stationary optimization: The loss landscape changes as we move. What mattered at step 100 may be irrelevant at step 10,000.
Curvature changes: The local Hessian (curvature) varies across the landscape. A parameter might need large updates in one region and small updates in another.
Phase changes in training: Early training explores; late training fine-tunes. The appropriate learning rate differs.
The Fix: Exponential Forgetting
Instead of summing, average with exponential decay:
$$v_t = \rho \cdot v_{t-1} + (1 - \rho) \cdot g_t^2$$
where ρ ∈ (0, 1) is the decay rate (typically 0.9 or 0.99).
This is an exponential moving average (EMA) of squared gradients. Old values decay exponentially:
After k = ln(0.5)/ln(ρ) = -0.693/ln(ρ) steps, the contribution halves.
| Decay Rate ρ | Half-Life (steps) | Effective Window | Use Case |
|---|---|---|---|
| 0.9 | ~7 steps | ~22 steps | Fast adaptation, small batches |
| 0.99 | ~69 steps | ~229 steps | Slower adaptation, more stability |
| 0.999 | ~693 steps | ~2,300 steps | Very slow adaptation, large batches |
Why EMA Solves the Problem:
Bounded accumulator: v_t stays in the range of recent squared gradients, never growing without bound.
Adaptable learning rate: If gradients shrink (entering flat region), v_t shrinks too, increasing the effective learning rate.
Local sensitivity: The accumulator reflects current conditions, not ancient history.
Stability: The averaging still smooths out noise, just over a finite window.
RMSprop = Root Mean Square Propagation. The 'mean square' refers to the EMA of squared gradients—an approximation to the mean squared gradient. The denominator √v_t is the 'root mean square' (RMS) of recent gradients, which normalizes the update.
Let's formalize RMSprop precisely.
RMSprop Update Rule:
For parameters θ, learning rate α, decay rate ρ, and stability constant ε:
$$g_t = \nabla L(\theta_{t-1})$$
$$v_t = \rho \cdot v_{t-1} + (1 - \rho) \cdot g_t^2$$
$$\theta_t = \theta_{t-1} - \frac{\alpha}{\sqrt{v_t} + \epsilon} \cdot g_t$$
Where all operations are element-wise.
Component Analysis:
v_t: Exponential moving average of squared gradients. Tracks the "typical" squared gradient magnitude for each parameter.
√v_t: Root mean square (RMS) of recent gradients. Approximates local gradient scale.
α/(√v_t + ε): Effective learning rate, inversely proportional to RMS gradient.
α/(√v_t + ε) × g_t: Update step. Large gradient history → small step. Small gradient history → large step.
Comparison to AdaGrad:
| Aspect | AdaGrad | RMSprop |
|---|---|---|
| Accumulator | Sum: G_t = Σg²_τ | EMA: v_t = ρv_{t-1} + (1-ρ)g²_t |
| Growth | Unbounded (→ ∞) | Bounded (≈ recent gradient scale) |
| Learning rate | Only decreases | Can increase or decrease |
| Sensitivity | All history equally | Recent history weighted more |
| Use case | Sparse/convex | Deep learning |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
import numpy as np def rmsprop_step(theta, gradient, v, alpha=0.001, rho=0.9, epsilon=1e-8): """ Single RMSprop update step. Args: theta: Parameter vector gradient: Current gradient v: Running EMA of squared gradients (modified in-place) alpha: Base learning rate rho: Decay rate for EMA (typically 0.9) epsilon: Numerical stability constant Returns: Updated theta """ # Update EMA of squared gradients v[:] = rho * v + (1 - rho) * gradient ** 2 # Compute adapted update # Large v[i] -> small step for parameter i # Small v[i] -> large step for parameter i update = alpha / (np.sqrt(v) + epsilon) * gradient return theta - update # Demonstration: RMSprop vs AdaGrad over long trainingdef compare_accumulator_growth(steps=10000): """Compare how accumulator/EMA evolves over time.""" gradient_scale = 1.0 # AdaGrad accumulator G_adagrad = 0.0 adagrad_lrs = [] # RMSprop EMA v_rmsprop = 0.0 rho = 0.9 rmsprop_lrs = [] alpha = 1.0 # Base LR for comparison for t in range(1, steps + 1): g = np.random.randn() * gradient_scale # AdaGrad: accumulate G_adagrad += g ** 2 adagrad_lrs.append(alpha / (np.sqrt(G_adagrad) + 1e-8)) # RMSprop: EMA v_rmsprop = rho * v_rmsprop + (1 - rho) * g ** 2 rmsprop_lrs.append(alpha / (np.sqrt(v_rmsprop) + 1e-8)) return adagrad_lrs, rmsprop_lrs adagrad_lrs, rmsprop_lrs = compare_accumulator_growth() print("Effective learning rates over 10,000 steps:")print(f" Step 100 Step 1000 Step 10000")print(f"AdaGrad: {adagrad_lrs[99]:.6f} {adagrad_lrs[999]:.6f} {adagrad_lrs[9999]:.6f}")print(f"RMSprop (ρ=0.9): {rmsprop_lrs[99]:.6f} {rmsprop_lrs[999]:.6f} {rmsprop_lrs[9999]:.6f}") # RMSprop maintains roughly constant effective LR# AdaGrad's effective LR decays as O(1/sqrt(t))Steady-State Behavior:
For stationary gradients with variance σ², the EMA converges to:
$$\lim_{t \to \infty} \mathbb{E}[v_t] = \sigma^2$$
This means the effective learning rate stabilizes at:
$$\alpha_{\text{eff}} \approx \frac{\alpha}{\sigma}$$
The learning rate is automatically scaled by the inverse standard deviation of the gradients—exactly the adaptation we want! Parameters with noisy, high-variance gradients get smaller learning rates; parameters with stable, low-variance gradients get larger learning rates.
Understanding EMAs deeply is essential for understanding not just RMSprop, but all modern adaptive optimizers.
Definition:
For a sequence x₁, x₂, ..., the EMA with decay rate ρ is:
$$\text{EMA}t = \rho \cdot \text{EMA}{t-1} + (1 - \rho) \cdot x_t$$
Starting from EMA₀ = 0 (or some initial value).
Unrolling the Recursion:
$$\text{EMA}t = (1-\rho)\sum{k=0}^{t-1} \rho^k x_{t-k} + \rho^t \text{EMA}_0$$
The weights (1-ρ)ρᵏ sum to 1 (as t → ∞), making this a weighted average.
Key Properties:
Bounded Memory: The effective window is ≈ 1/(1-ρ) steps. For ρ = 0.9, about 10 steps; for ρ = 0.99, about 100 steps.
Bias Toward Zero: Early in training, EMA_t is biased toward zero because the initial EMA₀ = 0 hasn't been "flushed out" yet. This creates a cold-start problem.
Lag: The EMA lags behind sudden changes in the sequence. If gradients suddenly increase, the EMA takes ~1/(1-ρ) steps to catch up.
Smoothing: High ρ means more smoothing (less variance, more lag). Low ρ means less smoothing (more variance, faster response).
123456789101112131415161718192021222324252627282930313233343536373839404142
import numpy as np def demonstrate_ema_properties(): """Visualize key EMA properties.""" # 1. Bias toward zero (cold start problem) print("=== Cold Start Bias ===") x = 1.0 # Constant input rho = 0.9 ema = 0.0 for t in range(1, 21): ema = rho * ema + (1 - rho) * x print(f"Step {t:2d}: EMA = {ema:.4f}, True mean = 1.0, Bias = {1.0 - ema:.4f}") # Takes ~23 steps to get within 10% of true value! # This motivates "bias correction" (used in Adam) # 2. Effective window demonstration print("=== Effective Window ===") print("Weight of past values in EMA (ρ=0.9):") for k in range(10): weight = (1 - 0.9) * (0.9 ** k) cumulative = 1 - (0.9 ** (k + 1)) print(f" x[t-{k}]: weight = {weight:.4f}, cumulative = {cumulative:.4f}") # 3. Response to sudden change print("=== Response to Sudden Change ===") sequence = [1.0] * 50 + [10.0] * 50 # Jump from 1 to 10 at t=50 rho = 0.9 ema = 0.0 for t, x in enumerate(sequence): ema = rho * ema + (1 - rho) * x if t in [48, 49, 50, 51, 55, 60, 70, 99]: print(f"Step {t}: x = {x:.1f}, EMA = {ema:.2f}") # EMA takes ~10 steps to mostly catch up to the new value demonstrate_ema_properties()Bias Correction:
The cold-start bias is significant. Adam (covered next page) addresses this with bias correction:
$$\hat{v}_t = \frac{v_t}{1 - \rho^t}$$
Early in training (small t), this correction is substantial. As t grows, ρᵗ → 0 and the correction vanishes. RMSprop in its original form doesn't include bias correction, which can cause issues in early training.
Choosing ρ:
| ρ Value | Effective Window | Behavior |
|---|---|---|
| 0.9 | ~10 steps | Fast adaptation, noisier |
| 0.99 | ~100 steps | Balanced (common default) |
| 0.999 | ~1000 steps | Very smooth, slow adaptation |
For deep learning, ρ = 0.9 is common for smaller batches; ρ = 0.99 for larger batches or when stability is paramount.
An EMA with decay ρ is approximately equivalent to averaging over the last 1/(1-ρ) samples. For ρ=0.9, this is ~10 samples. For ρ=0.99, ~100 samples. But EMA is computationally cheaper: O(1) update vs O(window) for explicit averaging.
A refinement of RMSprop addresses the fact that v_t tracks E[g²], not Var[g]. The difference matters when gradients have non-zero mean.
The Issue:
Standard RMSprop divides by √E[g²]. But we might want to divide by √Var[g] = √(E[g²] - E[g]²), which measures gradient variability rather than magnitude.
Consider: if a gradient is consistently 10 (low variance, high mean), standard RMSprop gives a small learning rate. But maybe we want a large learning rate since the direction is consistent!
Centered RMSprop:
Track both E[g²] and E[g]:
$$v_t = \rho \cdot v_{t-1} + (1 - \rho) \cdot g_t^2$$ $$\bar{g}t = \rho \cdot \bar{g}{t-1} + (1 - \rho) \cdot g_t$$ $$\tilde{v}_t = v_t - \bar{g}_t^2$$
Update using the variance estimate:
$$\theta_t = \theta_{t-1} - \frac{\alpha}{\sqrt{\tilde{v}_t} + \epsilon} \cdot g_t$$
Interpretation:
This separates "gradient magnitude" from "gradient consistency", allowing more nuanced adaptation.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071
import numpy as np def centered_rmsprop_step(theta, gradient, v, g_bar, alpha=0.001, rho=0.99, epsilon=1e-8): """ Centered RMSprop: normalizes by variance instead of second moment. Args: theta: Parameters gradient: Current gradient v: EMA of squared gradients (E[g²]) g_bar: EMA of gradients (E[g]) alpha: Learning rate rho: Decay rate epsilon: Stability constant Returns: Updated theta """ # Update EMAs v[:] = rho * v + (1 - rho) * gradient ** 2 g_bar[:] = rho * g_bar + (1 - rho) * gradient # Variance estimate: Var[g] = E[g²] - E[g]² variance = v - g_bar ** 2 # Ensure non-negative (numerical issues can make this slightly negative) variance = np.maximum(variance, 0) # Update with variance normalization update = alpha / (np.sqrt(variance) + epsilon) * gradient return theta - update # Compare standard vs centered on biased gradientsdef compare_variants(): np.random.seed(42) theta_std = 0.0 theta_ctr = 0.0 v_std = 0.0 v_ctr = 0.0 g_bar = 0.0 alpha = 0.1 rho = 0.9 # Gradient with large mean, small variance # Mean = 5, Std = 0.5 for step in range(100): g = 5.0 + np.random.randn() * 0.5 # Standard RMSprop v_std = rho * v_std + (1 - rho) * g ** 2 update_std = alpha / (np.sqrt(v_std) + 1e-8) * g theta_std -= update_std # Centered RMSprop v_ctr = rho * v_ctr + (1 - rho) * g ** 2 g_bar = rho * g_bar + (1 - rho) * g var = max(0, v_ctr - g_bar ** 2) update_ctr = alpha / (np.sqrt(var) + 1e-8) * g theta_ctr -= update_ctr print(f"After 100 steps with biased gradient (mean=5, std=0.5):") print(f" Standard RMSprop: θ = {theta_std:.2f}") print(f" Centered RMSprop: θ = {theta_ctr:.2f}") print(f" Centered moved further (larger effective LR for consistent gradients)") compare_variants()Centered RMSprop requires tracking an additional EMA (g_bar), doubling optimizer state. It also has numerical edge cases when variance is near zero. In practice, standard RMSprop or Adam is usually preferred. Centered variants are useful when gradient bias is known to be significant.
Here's a complete, production-quality RMSprop implementation.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151
import numpy as npfrom typing import Dict, Optional class RMSprop: """ RMSprop (Root Mean Square Propagation) optimizer. Maintains exponential moving average of squared gradients for adaptive per-parameter learning rate scaling. Update rule: v_t = ρ * v_{t-1} + (1-ρ) * g_t² θ_t = θ_{t-1} - α / (√v_t + ε) * g_t Key differences from AdaGrad: - Uses EMA instead of sum (bounded accumulator) - Learning rate can increase when gradients shrink - Suitable for long training runs Reference: Hinton, Srivastava, Swersky. "Neural Networks for Machine Learning" Lecture 6e (2012, unpublished). """ def __init__( self, learning_rate: float = 0.01, rho: float = 0.9, epsilon: float = 1e-8, weight_decay: float = 0.0, momentum: float = 0.0, centered: bool = False, ): """ Args: learning_rate: Base learning rate α rho: Decay rate for squared gradient EMA (typically 0.9 or 0.99) epsilon: Small constant for numerical stability weight_decay: L2 regularization coefficient momentum: Optional momentum term (not in original RMSprop) centered: If True, use centered variant (normalize by variance) """ if not 0.0 < rho < 1.0: raise ValueError(f"rho must be in (0, 1), got {rho}") self.lr = learning_rate self.rho = rho self.eps = epsilon self.weight_decay = weight_decay self.momentum = momentum self.centered = centered # State self.v: Dict[str, np.ndarray] = {} # EMA of squared gradients self.g_avg: Dict[str, np.ndarray] = {} # EMA of gradients (centered) self.momentum_buffer: Dict[str, np.ndarray] = {} # Momentum buffer self.steps = 0 def step( self, parameters: Dict[str, np.ndarray], gradients: Dict[str, np.ndarray], ) -> Dict[str, np.ndarray]: """ Perform one RMSprop step. Args: parameters: Dict mapping names to parameter arrays gradients: Dict mapping names to gradient arrays Returns: Updated parameters dict """ self.steps += 1 updated = {} for name, param in parameters.items(): if name not in gradients: updated[name] = param continue grad = gradients[name].copy() # Apply weight decay if self.weight_decay > 0: grad = grad + self.weight_decay * param # Initialize state if name not in self.v: self.v[name] = np.zeros_like(param) if self.centered: self.g_avg[name] = np.zeros_like(param) if self.momentum > 0: self.momentum_buffer[name] = np.zeros_like(param) v = self.v[name] # Update EMA of squared gradients v[:] = self.rho * v + (1 - self.rho) * grad ** 2 if self.centered: # Centered variant: normalize by variance g_avg = self.g_avg[name] g_avg[:] = self.rho * g_avg + (1 - self.rho) * grad avg = v - g_avg ** 2 avg = np.maximum(avg, 0) # Ensure non-negative else: avg = v # Compute update update = grad / (np.sqrt(avg) + self.eps) # Optional: apply momentum if self.momentum > 0: buf = self.momentum_buffer[name] buf[:] = self.momentum * buf + update update = buf updated[name] = param - self.lr * update return updated def get_state_summary(self) -> Dict[str, float]: """Get summary statistics for monitoring.""" all_v = np.concatenate([v.flatten() for v in self.v.values()]) all_lr = self.lr / (np.sqrt(all_v) + self.eps) return { 'v_mean': float(np.mean(all_v)), 'v_max': float(np.max(all_v)), 'effective_lr_mean': float(np.mean(all_lr)), 'effective_lr_min': float(np.min(all_lr)), 'effective_lr_max': float(np.max(all_lr)), } # Demonstrationif __name__ == "__main__": optimizer = RMSprop(learning_rate=0.01, rho=0.9) # Simple optimization: minimize (x - 3)² + (y - 2)² params = {"theta": np.array([0.0, 0.0])} target = np.array([3.0, 2.0]) for step in range(200): grad = 2 * (params["theta"] - target) params = optimizer.step(params, {"theta": grad}) if step % 50 == 49: loss = np.sum((params["theta"] - target) ** 2) stats = optimizer.get_state_summary() print(f"Step {step+1}: loss = {loss:.6f}, eff_lr = {stats['effective_lr_mean']:.6f}")PyTorch uses 'alpha' for the decay rate ρ, which is confusingly also what we call the learning rate in mathematical notation. Be careful: optimizer.param_groups[0]['alpha'] is the EMA decay, not the learning rate!
RMSprop has three key hyperparameters: learning rate α, decay rate ρ, and epsilon ε. Here's comprehensive tuning guidance.
Learning Rate (α):
RMSprop's effective learning rate is α/√v_t. Since v_t adapts, the appropriate base α differs from SGD:
The adaptation means RMSprop is less sensitive to α than SGD, but still requires tuning.
Decay Rate (ρ):
Controls the memory of the EMA. Hinton's original suggestion was ρ = 0.9, but ρ = 0.99 is common.
| ρ Value | Effective Window | When to Use | Trade-offs |
|---|---|---|---|
| 0.9 | ~10 steps | Small batches, fast-changing gradients | More responsive, noisier LR |
| 0.99 | ~100 steps | Default for most cases | Balanced stability/adaptation |
| 0.999 | ~1000 steps | Large batches, very stable training | Slower adaptation, very smooth |
Epsilon (ε):
Prevents division by zero when v_t is small. Standard values:
Larger ε reduces adaptation strength; smaller ε allows more aggressive adaptation but risks numerical issues.
Hyperparameter Interactions:
α and ρ: Higher ρ means smoother v_t, which means more stable effective LR. This permits slightly higher α.
α and ε: If gradients are very small (LLMs, some RNNs), ε may dominate √v_t, effectively undoing adaptation. Reduce ε or increase gradient scale (gradient clipping minimum).
Batch size and ρ: Larger batches → less noisy gradients → can use larger ρ. Relationship: ρ ≈ 1 - batch_size / dataset_size is a rough heuristic.
Tuning Strategy:
RMSprop occupies a specific niche in the optimizer landscape. Understanding its relationship to other methods clarifies when to use it.
| Optimizer | Gradient Adaptation | Momentum | Bias Correction |
|---|---|---|---|
| SGD | None (global LR) | Optional | N/A |
| AdaGrad | Sum of g² | None | None |
| RMSprop | EMA of g² | Optional | None |
| Adam | EMA of g² | EMA of g | Yes |
RMSprop vs SGD+Momentum:
| Aspect | SGD+Momentum | RMSprop |
|---|---|---|
| LR Sensitivity | High (requires careful tuning) | Lower (adapts automatically) |
| Sparse gradients | Inefficient (global LR) | Efficient (per-param LR) |
| Ill-conditioned | Slow (requires small LR) | Fast (adapts to curvature) |
| Theoretical guarantees | Well understood | Less formal theory |
| Reproducibility | Easier (simpler dynamics) | Harder (adaptive dynamics) |
RMSprop vs Adam:
Adam adds momentum (EMA of gradients) and bias correction to RMSprop. For most deep learning, Adam is preferred. RMSprop remains relevant when:
Historical Context:
RMSprop came from Hinton's 2012 Coursera course. It was never formally published, yet became hugely influential. The lecture slide's informal status contrasts with its massive impact—a testament to the deep learning community's rapid idea propagation.
In 2024, Adam or AdamW is the default choice for most tasks. RMSprop is still useful for specific cases: when momentum causes issues, when you want simpler dynamics, or when matching legacy baselines. Understanding RMSprop deeply prepares you for understanding Adam.
RMSprop solved AdaGrad's critical flaw with a simple but profound change: replace accumulating sums with exponentially weighted averages that can forget.
The Optimizer Evolution:
| Era | Optimizer | Key Innovation |
|---|---|---|
| Classic | GD, Momentum | Direction memory |
| Adaptive | AdaGrad | Per-parameter learning rates |
| Modern | RMSprop | Forgetting (EMA) |
| Current | Adam | Momentum + Adaptation + Bias Correction |
What's Next:
RMSprop adapts to gradient magnitudes but treats the gradient direction naively—just using the current gradient. What if we applied EMA to the gradients themselves, not just their squares? Combined with RMSprop's adaptation, this creates Adam: arguably the most widely-used optimizer in deep learning today.
The next page explores Adam's formulation, its bias correction mechanism, and the many variants that have emerged.
You now understand RMSprop: why forgetting fixes AdaGrad, how EMAs work, the decay rate's role, and when RMSprop remains useful. Next, we explore Adam—the synthesis of momentum and adaptive learning rates that dominates modern deep learning.