Loading content...
Every optimizer we've studied so far has a fundamental limitation: they apply the same learning rate to all parameters. Momentum makes all parameters faster. Nesterov makes all parameters look ahead. But what if different parameters need fundamentally different treatment?
Consider a word embedding matrix. The embedding for "the" is updated on nearly every batch. The embedding for "quixotic" might appear once per epoch. If we use a learning rate suitable for "the", updates to "quixotic" are too small—it never learns. If we use a learning rate suitable for "quixotic", updates to "the" are too large—it oscillates wildly.
This is the sparse gradient problem, and it's pervasive in machine learning:
In 2011, John Duchi, Elad Hazan, and Yoram Singer proposed AdaGrad (Adaptive Gradient), which elegantly solves this problem by maintaining a per-parameter sum of squared gradients and using it to scale individual learning rates. Parameters with large cumulative gradients get smaller learning rates; parameters with small cumulative gradients get larger learning rates.
This simple idea launched the entire field of adaptive optimization.
By the end of this page, you will understand AdaGrad from first principles: the adaptation mechanism, the mathematical formulation, the regret analysis that motivated it, the sparse gradient handling, and critically—why its accumulation leads to learning rate decay that both helps and eventually hurts.
Let's make the sparse gradient problem concrete with a detailed example.
Example: Training Word Embeddings
Consider training a 100,000-word vocabulary with 300-dimensional embeddings (30 million parameters just for embeddings). In a typical training batch:
With a global learning rate α = 0.1:
The frequent word is overtrained (potentially oscillating); the rare word barely moves.
The Geometric Intuition:
Imagine the loss surface as a function of two parameters:
With a global learning rate, we must choose:
We need α₁ = 0.01 and α₂ = 0.1 simultaneously—different learning rates for different parameters.
| Parameter Type | Gradient Size | Ideal α | With Global α=0.1 |
|---|---|---|---|
| Frequent features | Large (~10) | Small (~0.01) | Overshoots, oscillates |
| Rare features | Small (~0.1) | Large (~0.5) | Barely moves, undertrained |
| Dense layers | Medium (~1) | Medium (~0.1) | Works reasonably |
AdaGrad's Insight:
The gradient magnitude itself tells us how to scale! If a parameter receives large gradients, we should use a smaller learning rate. If it receives small gradients, we should use a larger learning rate.
But we can't just use the current gradient magnitude—that's too noisy. Instead, AdaGrad accumulates the sum of squared gradients over all time, creating a stable estimate of the typical gradient scale for each parameter.
This sum gives us a denominator that normalizes updates: frequently-updated parameters accumulate large sums (getting smaller effective learning rates), while rarely-updated parameters accumulate small sums (retaining larger effective learning rates).
AdaGrad divides each parameter's learning rate by the root-sum-of-squared-gradients for that parameter. Large accumulated gradients → small learning rate. Small accumulated gradients → large learning rate. Adaptation happens automatically!
Let's formalize AdaGrad precisely. For parameters θ ∈ ℝⁿ, learning rate α, and small constant ε for numerical stability:
AdaGrad Update Rule:
$$g_t = \nabla L(\theta_{t-1})$$
$$G_t = G_{t-1} + g_t \odot g_t$$
$$\theta_t = \theta_{t-1} - \frac{\alpha}{\sqrt{G_t} + \epsilon} \odot g_t$$
Where:
Unpacking the Notation:
For a single parameter θᵢ:
$$G_{t,i} = \sum_{\tau=1}^{t} g_{\tau,i}^2$$
$$\theta_{t,i} = \theta_{t-1,i} - \frac{\alpha}{\sqrt{G_{t,i}} + \epsilon} g_{t,i}$$
The effective learning rate for parameter i is: αᵢ = α / (√Gₜ,ᵢ + ε)
Key Properties:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
import numpy as np def adagrad_update(theta, gradient, G, alpha=0.01, epsilon=1e-8): """ Single AdaGrad update step. Args: theta: Parameter vector (n,) gradient: Current gradient (n,) G: Accumulated squared gradients (n,), modified in-place alpha: Base learning rate epsilon: Numerical stability constant Returns: Updated theta """ # Accumulate squared gradients (element-wise) G += gradient ** 2 # Compute adapted learning rate per parameter # Large G[i] → small effective lr for parameter i # Small G[i] → large effective lr for parameter i adapted_lr = alpha / (np.sqrt(G) + epsilon) # Update parameters theta_new = theta - adapted_lr * gradient return theta_new # Example: Sparse gradient scenarionp.random.seed(42) # Two parameters: one receives frequent updates, one receives sparse updatestheta = np.array([0.0, 0.0])G = np.array([0.0, 0.0]) # Accumulated squared gradientsalpha = 1.0 # Larger base rate since AdaGrad will reduce it # Simulate 1000 stepsfor step in range(1000): # Parameter 0: always gets gradient ~1 # Parameter 1: gets gradient ~1 only 1% of the time g0 = np.random.randn() * 1.0 g1 = np.random.randn() * 1.0 if np.random.rand() < 0.01 else 0.0 gradient = np.array([g0, g1]) theta = adagrad_update(theta, gradient, G, alpha) print(f"Accumulated G: [{G[0]:.1f}, {G[1]:.1f}]")print(f"Effective LRs: [{alpha/(np.sqrt(G[0])+1e-8):.6f}, {alpha/(np.sqrt(G[1])+1e-8):.6f}]")# Parameter 0 (frequent): G >> 0, effective LR << 1# Parameter 1 (sparse): G ~ 0, effective LR ~ 1The Diagonal Approximation:
Technically, AdaGrad approximates an adaptive preconditioner. The "full" AdaGrad would maintain:
$$H_t = \sum_{\tau=1}^{t} g_\tau g_\tau^T$$
This n×n matrix captures gradient covariance, allowing updates that account for parameter interactions. The update would be:
$$\theta_t = \theta_{t-1} - \alpha H_t^{-1/2} g_t$$
But storing and inverting an n×n matrix is prohibitive for large n. The practical AdaGrad uses a diagonal approximation: only store the diagonal of H, which is exactly Gₜ. This loses covariance information but makes the algorithm O(n) in space and time.
Modern optimizers like Adam and variants also use this diagonal approximation. Full-matrix methods (K-FAC, natural gradient) recapture some covariance structure.
AdaGrad wasn't designed heuristically—it has rigorous theoretical foundations in online convex optimization. Understanding this gives insight into why the algorithm works and when it's appropriate.
Online Learning Setup:
At each time step t:
The goal is to minimize regret: the difference between total loss incurred and the loss of the best fixed parameters in hindsight.
$$\text{Regret}T = \sum{t=1}^{T} L_t(\theta_t) - \min_{\theta^} \sum_{t=1}^{T} L_t(\theta^)$$
Regret Bounds:
For standard online gradient descent with learning rate α = O(1/√T):
$$\text{Regret}_T = O(\sqrt{T})$$
AdaGrad achieves the same O(√T) regret, but with a crucial improvement for sparse gradients.
AdaGrad's Data-Dependent Bound:
AdaGrad's regret depends on the actual gradient magnitudes observed, not their worst-case bounds:
$$\text{Regret}T = O\left(\sum{i=1}^{n} \sqrt{\sum_{t=1}^{T} g_{t,i}^2}\right)$$
Compare to standard bounds using the maximum gradient G_∞:
$$\text{Standard bound: } O(\sqrt{T} \cdot G_\infty \cdot \sqrt{n})$$ $$\text{AdaGrad bound: } O\left(\sum_i \sqrt{\sum_t g_{t,i}^2}\right)$$
If gradients are sparse (many g_{t,i} = 0), AdaGrad's bound is much tighter!
In sparse settings, AdaGrad can be exponentially better than non-adaptive methods in terms of regret. A feature that's only active 1% of the time contributes √(0.01T) to AdaGrad's bound instead of √T—a 10× improvement!
Worked Example: Sparse vs Dense Regret
Consider n = 100 features, T = 10,000 steps:
Standard SGD regret:
AdaGrad regret:
Wait—AdaGrad is worse on dense data? Yes, but only by constant factors. On sparse data, AdaGrad matches the effectively dense lower dimension. The adaptive learning rate automatically "discovers" the effective dimensionality.
The Implicit Regularization View:
Another interpretation: AdaGrad implicitly regularizes parameters proportionally to how much they've been updated. Frequently-updated parameters have more inertia against further change; rarely-updated parameters remain flexible. This is a form of Bayesian shrinkage.
AdaGrad has a critical weakness that limits its applicability: the learning rate only decreases, never recovers.
The Mechanism:
Since Gₜ is a sum of non-negative values, it can only increase:
$$G_t = G_{t-1} + g_t^2 \geq G_{t-1}$$
Therefore, the effective learning rate α/(√Gₜ + ε) can only decrease.
Why This Is a Problem:
In online convex optimization (AdaGrad's design domain), this is actually desirable—the learning rate should decay as we approach optimality.
But in deep learning:
Empirical Observation:
After sufficient training steps, AdaGrad's effective learning rate becomes negligibly small:
12345678910111213141516171819202122232425262728293031323334353637383940
import numpy as np def track_adagrad_lr(steps, gradient_variance=1.0, alpha=0.1, epsilon=1e-8): """ Track how AdaGrad's effective learning rate decays over time. """ G = 0.0 # Accumulated squared gradients effective_lrs = [] for t in range(1, steps + 1): # Simulate gradient with given variance g = np.random.randn() * np.sqrt(gradient_variance) # Accumulate G += g ** 2 # Effective learning rate effective_lr = alpha / (np.sqrt(G) + epsilon) effective_lrs.append(effective_lr) return effective_lrs # Track over 100,000 stepssteps = 100_000lrs = track_adagrad_lr(steps) print("Effective learning rate decay:")print(f" Step 100: {lrs[99]:.6f}")print(f" Step 1,000: {lrs[999]:.6f}")print(f" Step 10,000: {lrs[9999]:.6f}")print(f" Step 100,000: {lrs[99999]:.6f}") # Output shows dramatic decay:# Step 100: 0.010000# Step 1,000: 0.003162 (3x smaller)# Step 10,000: 0.001000 (10x smaller)# Step 100,000: 0.000316 (32x smaller) # The learning rate decays as O(1/sqrt(t))# After 1M steps, it's ~0.0001 - essentially no learning!Theoretical Decay Rate:
For gradients with variance σ², the accumulated sum at step T is:
$$G_T \approx T \cdot \sigma^2$$
So the effective learning rate at step T is:
$$\alpha_T^{\text{eff}} = \frac{\alpha}{\sqrt{G_T} + \epsilon} \approx \frac{\alpha}{\sigma\sqrt{T}}$$
This O(1/√T) decay is too aggressive for non-convex optimization where we need sustained learning capability.
When the Decay Is Acceptable:
When the Decay Is Fatal:
AdaGrad's strength (adapting to gradient magnitudes) creates its weakness (relentless accumulation). Future optimizers—RMSprop and Adam—solve this by using exponential moving averages instead of sums, allowing the effective learning rate to both decrease and increase.
Here's a production-quality AdaGrad implementation with full features.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165
import numpy as npfrom typing import Dict, Optional class AdaGrad: """ AdaGrad (Adaptive Gradient) optimizer. Maintains per-parameter sum of squared gradients for adaptive learning rate scaling. Parameters with large accumulated gradients get smaller learning rates. Update rule: G_t = G_{t-1} + g_t² (element-wise) θ_t = θ_{t-1} - α / (√G_t + ε) * g_t Reference: Duchi, Hazan, Singer. "Adaptive Subgradient Methods for Online Learning and Stochastic Optimization." JMLR 2011. """ def __init__( self, learning_rate: float = 0.01, weight_decay: float = 0.0, initial_accumulator_value: float = 0.0, epsilon: float = 1e-8, ): """ Args: learning_rate: Base learning rate α weight_decay: L2 regularization coefficient initial_accumulator_value: Initial value for G (can help stability; 0.1 is sometimes used) epsilon: Small constant for numerical stability """ if learning_rate < 0: raise ValueError(f"Learning rate must be non-negative, got {learning_rate}") if epsilon <= 0: raise ValueError(f"Epsilon must be positive, got {epsilon}") self.lr = learning_rate self.weight_decay = weight_decay self.initial_accumulator = initial_accumulator_value self.eps = epsilon # State: accumulated squared gradients per parameter self.accumulators: Dict[str, np.ndarray] = {} self.steps = 0 def step( self, parameters: Dict[str, np.ndarray], gradients: Dict[str, np.ndarray], ) -> Dict[str, np.ndarray]: """ Perform one AdaGrad step. Args: parameters: Dict mapping names to parameter arrays gradients: Dict mapping names to gradient arrays Returns: Updated parameters dict """ self.steps += 1 updated = {} for name, param in parameters.items(): if name not in gradients: updated[name] = param continue grad = gradients[name].copy() # Apply weight decay (L2 regularization) if self.weight_decay > 0: grad = grad + self.weight_decay * param # Initialize accumulator if needed if name not in self.accumulators: self.accumulators[name] = np.full_like( param, self.initial_accumulator, dtype=np.float64 ) acc = self.accumulators[name] # Accumulate squared gradients (element-wise) acc += grad ** 2 # Compute adapted update: α / (√G + ε) * g adapted_lr = self.lr / (np.sqrt(acc) + self.eps) updated[name] = param - adapted_lr * grad return updated def get_effective_learning_rates(self) -> Dict[str, np.ndarray]: """Get current per-parameter effective learning rates.""" return { name: self.lr / (np.sqrt(acc) + self.eps) for name, acc in self.accumulators.items() } def get_accumulator_stats(self) -> Dict[str, Dict[str, float]]: """Get statistics about accumulated gradients.""" stats = {} for name, acc in self.accumulators.items(): stats[name] = { 'min': float(np.min(acc)), 'max': float(np.max(acc)), 'mean': float(np.mean(acc)), 'nonzero_frac': float(np.mean(acc > 1e-10)), } return stats def state_dict(self) -> Dict: """Save optimizer state.""" return { 'accumulators': {k: v.copy() for k, v in self.accumulators.items()}, 'steps': self.steps, 'lr': self.lr, } def load_state_dict(self, state: Dict): """Restore optimizer state.""" self.accumulators = {k: v.copy() for k, v in state['accumulators'].items()} self.steps = state['steps'] # Demonstration: Sparse gradient handlingif __name__ == "__main__": optimizer = AdaGrad(learning_rate=1.0) # Simulate embeddings: 10 words, 5-dim each n_words = 10 dim = 5 embeddings = np.random.randn(n_words, dim) * 0.1 params = {"embeddings": embeddings} # Simulate 1000 training steps with sparse gradients # Words 0,1,2 are frequent; words 7,8,9 are rare np.random.seed(42) for step in range(1000): grad = np.zeros_like(embeddings) # Frequent words (80% chance each step) for word in [0, 1, 2]: if np.random.rand() < 0.8: grad[word] = np.random.randn(dim) * 0.5 # Rare words (2% chance each step) for word in [7, 8, 9]: if np.random.rand() < 0.02: grad[word] = np.random.randn(dim) * 0.5 params = optimizer.step(params, {"embeddings": grad}) # Check effective learning rates eff_lr = optimizer.get_effective_learning_rates()["embeddings"] print("Effective learning rates by word:") for i in range(n_words): print(f" Word {i}: {eff_lr[i, 0]:.4f} (mean across dims)") # Frequent words have LOWER effective lr (more accumulated gradients) # Rare words have HIGHER effective lr (less accumulated gradients) # This is the adaptive behavior we want!Despite its learning rate decay problem, AdaGrad remains valuable in specific scenarios.
Production Use Cases:
Large-Scale Logistic Regression (Google)
AdaGrad was developed at Google for ad click-through rate prediction. With billions of sparse features, AdaGrad's per-feature learning rates are essential. The convex objective (logistic loss) makes LR decay appropriate.
Word Embedding Training
Training Word2Vec, GloVe, or similar embeddings benefits from AdaGrad. Common words receive more updates but smaller learning rates; rare words receive fewer but larger updates. This balances representation quality across vocabulary frequencies.
Collaborative Filtering
Matrix factorization for recommender systems has sparse gradients (only observed user-item pairs). AdaGrad handles the varying update frequency naturally.
Tuning Guidance:
In 2024, AdaGrad is rarely used directly in deep learning—RMSprop and Adam have supplanted it. But understanding AdaGrad is essential because: (1) it's still optimal for sparse/convex settings, (2) RMSprop/Adam are direct descendants that fix its accumulation problem, (3) the adaptive learning rate concept it pioneered is now standard.
AdaGrad spawned several variants addressing its limitations while preserving its benefits.
AdaDelta (Zeiler, 2012):
Replaces the unbounded accumulator with an exponential moving average:
$$E[g^2]t = \rho E[g^2]{t-1} + (1-\rho)g_t^2$$
Also introduces a "delta" accumulator for the numerator, removing the need to set a learning rate at all. Clever but not as widely adopted as RMSprop.
AdaGrad with Weight Decay:
Standard L2 regularization adds λθ to gradients. For AdaGrad, this is problematic—regularization gradients accumulate alongside data gradients. "Decoupled weight decay" applies regularization separately:
$$\theta_t = (1 - \lambda)\theta_{t-1} - \frac{\alpha}{\sqrt{G_t} + \epsilon} g_t$$
This idea becomes important in AdamW later.
AdaGrad with Learning Rate Decay:
Combine AdaGrad's per-parameter adaptation with a global schedule:
$$\alpha_t = \frac{\alpha_0}{1 + \gamma \cdot t}$$
This provides additional decay control, though it adds a hyperparameter.
Sparse AdaGrad:
For extremely sparse problems, only update accumulators for non-zero gradient entries. Saves memory and computation for very high-dimensional sparse problems.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384
import numpy as np class AdaGradWithDecay: """AdaGrad with additional global learning rate decay.""" def __init__(self, lr=0.01, lr_decay=0.01, eps=1e-8): self.initial_lr = lr self.lr_decay = lr_decay self.eps = eps self.accumulators = {} self.steps = 0 def step(self, params, grads): self.steps += 1 # Global LR decay current_lr = self.initial_lr / (1 + self.lr_decay * self.steps) updated = {} for name, param in params.items(): if name not in grads: updated[name] = param continue grad = grads[name] if name not in self.accumulators: self.accumulators[name] = np.zeros_like(param) self.accumulators[name] += grad ** 2 # Combined adaptation: per-param (AdaGrad) + global (decay) adapted_lr = current_lr / (np.sqrt(self.accumulators[name]) + self.eps) updated[name] = param - adapted_lr * grad return updated class SparseAdaGrad: """AdaGrad optimized for sparse gradients.""" def __init__(self, lr=0.01, eps=1e-8): self.lr = lr self.eps = eps # Use dict of dicts for sparse accumulator storage # Only stores non-zero entries self.sparse_accumulators = {} self.steps = 0 def step(self, params, sparse_grads): """ sparse_grads: Dict mapping param names to (indices, values) tuples indices: array of updated indices values: array of gradient values at those indices """ self.steps += 1 updated = {} for name, param in params.items(): if name not in sparse_grads: updated[name] = param continue indices, grad_values = sparse_grads[name] if name not in self.sparse_accumulators: # Store as dict: index -> accumulated value self.sparse_accumulators[name] = {} acc = self.sparse_accumulators[name] param_new = param.copy() for idx, gv in zip(indices, grad_values): # Only accumulate for non-zero gradients! if idx not in acc: acc[idx] = 0.0 acc[idx] += gv ** 2 # Sparse update adapted_lr = self.lr / (np.sqrt(acc[idx]) + self.eps) param_new[idx] = param[idx] - adapted_lr * gv updated[name] = param_new return updatedAdaGrad represents a paradigm shift in optimization: from global to per-parameter learning rates, automatically adapted based on historical gradient information.
The Lineage So Far:
| Optimizer | Key Innovation |
|---|---|
| Gradient Descent | The foundation: follow negative gradient |
| Momentum | Memory: exponential average of past gradients |
| Nesterov | Lookahead: evaluate gradient at future position |
| AdaGrad | Adaptation: per-parameter learning rates |
What's Next:
AdaGrad's accumulation problem—learning rates that only decrease—makes it unsuitable for typical deep learning. The next page introduces RMSprop, Geoffrey Hinton's elegant fix: replace the sum of squared gradients with an exponential moving average. This allows the effective learning rate to both increase and decrease, adapting to the current (not historical) gradient regime.
RMSprop keeps AdaGrad's adaptation benefits while enabling sustained learning for arbitrarily long training runs.
You now understand AdaGrad: its motivation, mechanism, theoretical foundations, limitations, and appropriate use cases. Next, we'll see how RMSprop elegantly solves the accumulation problem by using exponential moving averages instead of sums.