Adaptive Optimizers - Learning Module

Loading content...

0/245

AdaGrad

The One-Size-Fits-All Problem

Every optimizer we've studied so far has a fundamental limitation: they apply the same learning rate to all parameters. Momentum makes all parameters faster. Nesterov makes all parameters look ahead. But what if different parameters need fundamentally different treatment?

Consider a word embedding matrix. The embedding for "the" is updated on nearly every batch. The embedding for "quixotic" might appear once per epoch. If we use a learning rate suitable for "the", updates to "quixotic" are too small—it never learns. If we use a learning rate suitable for "quixotic", updates to "the" are too large—it oscillates wildly.

This is the sparse gradient problem, and it's pervasive in machine learning:

Word embeddings: common vs rare words
Recommender systems: popular vs niche items
Click prediction: frequent vs rare features
Any model with sparse input data

In 2011, John Duchi, Elad Hazan, and Yoram Singer proposed AdaGrad (Adaptive Gradient), which elegantly solves this problem by maintaining a per-parameter sum of squared gradients and using it to scale individual learning rates. Parameters with large cumulative gradients get smaller learning rates; parameters with small cumulative gradients get larger learning rates.

This simple idea launched the entire field of adaptive optimization.

What You Will Learn

By the end of this page, you will understand AdaGrad from first principles: the adaptation mechanism, the mathematical formulation, the regret analysis that motivated it, the sparse gradient handling, and critically—why its accumulation leads to learning rate decay that both helps and eventually hurts.

Motivation: Why Adapt Per-Parameter?

Let's make the sparse gradient problem concrete with a detailed example.

Example: Training Word Embeddings

Consider training a 100,000-word vocabulary with 300-dimensional embeddings (30 million parameters just for embeddings). In a typical training batch:

"the": appears in 80% of sentences → gradient computed 80% of steps
"machine": appears in 2% of sentences → gradient computed 2% of steps
"serendipitous": appears in 0.001% of sentences → gradient computed once per 100k steps

With a global learning rate α = 0.1:

"the" receives 100,000 × 0.8 × 0.1 = 8,000 units of cumulative update
"serendipitous" receives 100,000 × 0.00001 × 0.1 = 0.1 units of cumulative update

The frequent word is overtrained (potentially oscillating); the rare word barely moves.

The Geometric Intuition:

Imagine the loss surface as a function of two parameters:

Parameter 1: receives gradients of magnitude ~10 on every step
Parameter 2: receives gradients of magnitude ~0.1 on every step

With a global learning rate, we must choose:

α = 0.01 for stability on Parameter 1 → 0.01 × 0.1 = 0.001 step for Parameter 2 (too slow!)
α = 0.1 for speed on Parameter 2 → 0.1 × 10 = 1.0 step for Parameter 1 (too fast!)

We need α₁ = 0.01 and α₂ = 0.1 simultaneously—different learning rates for different parameters.

Why Global Learning Rates Fail
Parameter Type	Gradient Size	Ideal α	With Global α=0.1
Frequent features	Large (~10)	Small (~0.01)	Overshoots, oscillates
Rare features	Small (~0.1)	Large (~0.5)	Barely moves, undertrained
Dense layers	Medium (~1)	Medium (~0.1)	Works reasonably

AdaGrad's Insight:

The gradient magnitude itself tells us how to scale! If a parameter receives large gradients, we should use a smaller learning rate. If it receives small gradients, we should use a larger learning rate.

But we can't just use the current gradient magnitude—that's too noisy. Instead, AdaGrad accumulates the sum of squared gradients over all time, creating a stable estimate of the typical gradient scale for each parameter.

This sum gives us a denominator that normalizes updates: frequently-updated parameters accumulate large sums (getting smaller effective learning rates), while rarely-updated parameters accumulate small sums (retaining larger effective learning rates).

The Core Idea

AdaGrad divides each parameter's learning rate by the root-sum-of-squared-gradients for that parameter. Large accumulated gradients → small learning rate. Small accumulated gradients → large learning rate. Adaptation happens automatically!

Mathematical Formulation

Let's formalize AdaGrad precisely. For parameters θ ∈ ℝⁿ, learning rate α, and small constant ε for numerical stability:

AdaGrad Update Rule:

$$g_t = \nabla L(\theta_{t-1})$$

$$G_t = G_{t-1} + g_t \odot g_t$$

$$\theta_t = \theta_{t-1} - \frac{\alpha}{\sqrt{G_t} + \epsilon} \odot g_t$$

Where:

g_t is the gradient at step t
G_t is the accumulated sum of squared gradients (element-wise)
⊙ denotes element-wise (Hadamard) multiplication
Division and square root are also element-wise
ε ≈ 10⁻⁸ prevents division by zero

Unpacking the Notation:

For a single parameter θᵢ:

$$G_{t,i} = \sum_{\tau=1}^{t} g_{\tau,i}^2$$

$$\theta_{t,i} = \theta_{t-1,i} - \frac{\alpha}{\sqrt{G_{t,i}} + \epsilon} g_{t,i}$$

The effective learning rate for parameter i is: αᵢ = α / (√Gₜ,ᵢ + ε)

Key Properties:

Monotonically decreasing learning rate: Since Gₜ only grows, αᵢ only shrinks over time
Parameter-specific: Each parameter has its own Gₜ,ᵢ and thus its own effective αᵢ
Automatic adaptation: No manual tuning of per-parameter rates needed

adagrad_derivation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import numpy as np
 
def adagrad_update(theta, gradient, G, alpha=0.01, epsilon=1e-8):
    """
    Single AdaGrad update step.
    
    Args:
        theta: Parameter vector (n,)
        gradient: Current gradient (n,)
        G: Accumulated squared gradients (n,), modified in-place
        alpha: Base learning rate
        epsilon: Numerical stability constant
        
    Returns:
        Updated theta
    """
    # Accumulate squared gradients (element-wise)
    G += gradient ** 2
    
    # Compute adapted learning rate per parameter
    # Large G[i] → small effective lr for parameter i
    # Small G[i] → large effective lr for parameter i
    adapted_lr = alpha / (np.sqrt(G) + epsilon)
    
    # Update parameters
    theta_new = theta - adapted_lr * gradient
    
    return theta_new
 
 
# Example: Sparse gradient scenario
np.random.seed(42)
 
# Two parameters: one receives frequent updates, one receives sparse updates
theta = np.array([0.0, 0.0])
G = np.array([0.0, 0.0])  # Accumulated squared gradients
alpha = 1.0  # Larger base rate since AdaGrad will reduce it
 
# Simulate 1000 steps
for step in range(1000):
    # Parameter 0: always gets gradient ~1
    # Parameter 1: gets gradient ~1 only 1% of the time
    g0 = np.random.randn() * 1.0
    g1 = np.random.randn() * 1.0 if np.random.rand() < 0.01 else 0.0
    
    gradient = np.array([g0, g1])
    theta = adagrad_update(theta, gradient, G, alpha)
 
print(f"Accumulated G: [{G[0]:.1f}, {G[1]:.1f}]")
print(f"Effective LRs: [{alpha/(np.sqrt(G[0])+1e-8):.6f}, {alpha/(np.sqrt(G[1])+1e-8):.6f}]")
# Parameter 0 (frequent): G >> 0, effective LR << 1
# Parameter 1 (sparse): G ~ 0, effective LR ~ 1

The Diagonal Approximation:

Technically, AdaGrad approximates an adaptive preconditioner. The "full" AdaGrad would maintain:

$$H_t = \sum_{\tau=1}^{t} g_\tau g_\tau^T$$

This n×n matrix captures gradient covariance, allowing updates that account for parameter interactions. The update would be:

$$\theta_t = \theta_{t-1} - \alpha H_t^{-1/2} g_t$$

But storing and inverting an n×n matrix is prohibitive for large n. The practical AdaGrad uses a diagonal approximation: only store the diagonal of H, which is exactly Gₜ. This loses covariance information but makes the algorithm O(n) in space and time.

Modern optimizers like Adam and variants also use this diagonal approximation. Full-matrix methods (K-FAC, natural gradient) recapture some covariance structure.

Regret Analysis: Why AdaGrad Works

AdaGrad wasn't designed heuristically—it has rigorous theoretical foundations in online convex optimization. Understanding this gives insight into why the algorithm works and when it's appropriate.

Online Learning Setup:

At each time step t:

Algorithm picks parameters θₜ
Environment reveals loss function Lₜ
Algorithm incurs loss Lₜ(θₜ)
Algorithm updates parameters

The goal is to minimize regret: the difference between total loss incurred and the loss of the best fixed parameters in hindsight.

$$\text{Regret}T = \sum{t=1}^{T} L_t(\theta_t) - \min_{\theta^} \sum_{t=1}^{T} L_t(\theta^)$$

Regret Bounds:

For standard online gradient descent with learning rate α = O(1/√T):

$$\text{Regret}_T = O(\sqrt{T})$$

AdaGrad achieves the same O(√T) regret, but with a crucial improvement for sparse gradients.

AdaGrad's Data-Dependent Bound:

AdaGrad's regret depends on the actual gradient magnitudes observed, not their worst-case bounds:

$$\text{Regret}T = O\left(\sum{i=1}^{n} \sqrt{\sum_{t=1}^{T} g_{t,i}^2}\right)$$

Compare to standard bounds using the maximum gradient G_∞:

$$\text{Standard bound: } O(\sqrt{T} \cdot G_\infty \cdot \sqrt{n})$$ $$\text{AdaGrad bound: } O\left(\sum_i \sqrt{\sum_t g_{t,i}^2}\right)$$

If gradients are sparse (many g_{t,i} = 0), AdaGrad's bound is much tighter!

Why This Matters

In sparse settings, AdaGrad can be exponentially better than non-adaptive methods in terms of regret. A feature that's only active 1% of the time contributes √(0.01T) to AdaGrad's bound instead of √T—a 10× improvement!

Worked Example: Sparse vs Dense Regret

Consider n = 100 features, T = 10,000 steps:

Dense setting: all gradients always active, magnitude ~1
Sparse setting: each feature active only 1% of the time, magnitude ~1 when active

Standard SGD regret:

Dense: O(√10000 × 1 × √100) = O(1000)
Sparse: O(√10000 × 1 × √100) = O(1000) (same—ignores sparsity!)

AdaGrad regret:

Dense: O(100 × √10000) = O(10000)
Sparse: O(100 × √100) = O(1000) (exploits sparsity!)

Wait—AdaGrad is worse on dense data? Yes, but only by constant factors. On sparse data, AdaGrad matches the effectively dense lower dimension. The adaptive learning rate automatically "discovers" the effective dimensionality.

The Implicit Regularization View:

Another interpretation: AdaGrad implicitly regularizes parameters proportionally to how much they've been updated. Frequently-updated parameters have more inertia against further change; rarely-updated parameters remain flexible. This is a form of Bayesian shrinkage.

The Accumulation Problem

AdaGrad has a critical weakness that limits its applicability: the learning rate only decreases, never recovers.

The Mechanism:

Since Gₜ is a sum of non-negative values, it can only increase:

$$G_t = G_{t-1} + g_t^2 \geq G_{t-1}$$

Therefore, the effective learning rate α/(√Gₜ + ε) can only decrease.

Why This Is a Problem:

In online convex optimization (AdaGrad's design domain), this is actually desirable—the learning rate should decay as we approach optimality.

But in deep learning:

The loss is non-convex with multiple phases
We may need to escape local minima (requires large steps)
The optimal local learning rate changes throughout training
Training runs for millions of steps; Gₜ becomes enormous

Empirical Observation:

After sufficient training steps, AdaGrad's effective learning rate becomes negligibly small:

adagrad_decay.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import numpy as np
 
def track_adagrad_lr(steps, gradient_variance=1.0, alpha=0.1, epsilon=1e-8):
    """
    Track how AdaGrad's effective learning rate decays over time.
    """
    G = 0.0  # Accumulated squared gradients
    effective_lrs = []
    
    for t in range(1, steps + 1):
        # Simulate gradient with given variance
        g = np.random.randn() * np.sqrt(gradient_variance)
        
        # Accumulate
        G += g ** 2
        
        # Effective learning rate
        effective_lr = alpha / (np.sqrt(G) + epsilon)
        effective_lrs.append(effective_lr)
    
    return effective_lrs
 
# Track over 100,000 steps
steps = 100_000
lrs = track_adagrad_lr(steps)
 
print("Effective learning rate decay:")
print(f"  Step 100:     {lrs[99]:.6f}")
print(f"  Step 1,000:   {lrs[999]:.6f}")
print(f"  Step 10,000:  {lrs[9999]:.6f}")
print(f"  Step 100,000: {lrs[99999]:.6f}")
 
# Output shows dramatic decay:
# Step 100:     0.010000
# Step 1,000:   0.003162  (3x smaller)
# Step 10,000:  0.001000  (10x smaller)
# Step 100,000: 0.000316  (32x smaller)
 
# The learning rate decays as O(1/sqrt(t))
# After 1M steps, it's ~0.0001 - essentially no learning!

Theoretical Decay Rate:

For gradients with variance σ², the accumulated sum at step T is:

$$G_T \approx T \cdot \sigma^2$$

So the effective learning rate at step T is:

$$\alpha_T^{\text{eff}} = \frac{\alpha}{\sqrt{G_T} + \epsilon} \approx \frac{\alpha}{\sigma\sqrt{T}}$$

This O(1/√T) decay is too aggressive for non-convex optimization where we need sustained learning capability.

When the Decay Is Acceptable:

Convex optimization: Decay toward optimum is appropriate
Short training runs: Gₜ doesn't get too large
Sparse gradients: Only active features accumulate, slowing decay
Online learning: Data distribution may shift; past gradients naturally become less relevant

When the Decay Is Fatal:

Deep learning: Long training, non-convex, need sustained learning
Dense gradients: All parameters accumulate quickly
Non-stationary objectives: Old gradient information is obsolete

The Fundamental Tension

AdaGrad's strength (adapting to gradient magnitudes) creates its weakness (relentless accumulation). Future optimizers—RMSprop and Adam—solve this by using exponential moving averages instead of sums, allowing the effective learning rate to both decrease and increase.

Production Implementation

Here's a production-quality AdaGrad implementation with full features.

adagrad.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
import numpy as np
from typing import Dict, Optional
 
class AdaGrad:
    """
    AdaGrad (Adaptive Gradient) optimizer.
    
    Maintains per-parameter sum of squared gradients for adaptive
    learning rate scaling. Parameters with large accumulated 
    gradients get smaller learning rates.
    
    Update rule:
        G_t = G_{t-1} + g_t² (element-wise)
        θ_t = θ_{t-1} - α / (√G_t + ε) * g_t
    
    Reference:
        Duchi, Hazan, Singer. "Adaptive Subgradient Methods for 
        Online Learning and Stochastic Optimization." JMLR 2011.
    """
    
    def __init__(
        self,
        learning_rate: float = 0.01,
        weight_decay: float = 0.0,
        initial_accumulator_value: float = 0.0,
        epsilon: float = 1e-8,
    ):
        """
        Args:
            learning_rate: Base learning rate α
            weight_decay: L2 regularization coefficient
            initial_accumulator_value: Initial value for G (can help 
                                       stability; 0.1 is sometimes used)
            epsilon: Small constant for numerical stability
        """
        if learning_rate < 0:
            raise ValueError(f"Learning rate must be non-negative, got {learning_rate}")
        if epsilon <= 0:
            raise ValueError(f"Epsilon must be positive, got {epsilon}")
        
        self.lr = learning_rate
        self.weight_decay = weight_decay
        self.initial_accumulator = initial_accumulator_value
        self.eps = epsilon
        
        # State: accumulated squared gradients per parameter
        self.accumulators: Dict[str, np.ndarray] = {}
        self.steps = 0
    
    def step(
        self,
        parameters: Dict[str, np.ndarray],
        gradients: Dict[str, np.ndarray],
    ) -> Dict[str, np.ndarray]:
        """
        Perform one AdaGrad step.
        
        Args:
            parameters: Dict mapping names to parameter arrays
            gradients: Dict mapping names to gradient arrays
            
        Returns:
            Updated parameters dict
        """
        self.steps += 1
        updated = {}
        
        for name, param in parameters.items():
            if name not in gradients:
                updated[name] = param
                continue
            
            grad = gradients[name].copy()
            
            # Apply weight decay (L2 regularization)
            if self.weight_decay > 0:
                grad = grad + self.weight_decay * param
            
            # Initialize accumulator if needed
            if name not in self.accumulators:
                self.accumulators[name] = np.full_like(
                    param, self.initial_accumulator, dtype=np.float64
                )
            
            acc = self.accumulators[name]
            
            # Accumulate squared gradients (element-wise)
            acc += grad ** 2
            
            # Compute adapted update: α / (√G + ε) * g
            adapted_lr = self.lr / (np.sqrt(acc) + self.eps)
            updated[name] = param - adapted_lr * grad
        
        return updated
    
    def get_effective_learning_rates(self) -> Dict[str, np.ndarray]:
        """Get current per-parameter effective learning rates."""
        return {
            name: self.lr / (np.sqrt(acc) + self.eps)
            for name, acc in self.accumulators.items()
        }
    
    def get_accumulator_stats(self) -> Dict[str, Dict[str, float]]:
        """Get statistics about accumulated gradients."""
        stats = {}
        for name, acc in self.accumulators.items():
            stats[name] = {
                'min': float(np.min(acc)),
                'max': float(np.max(acc)),
                'mean': float(np.mean(acc)),
                'nonzero_frac': float(np.mean(acc > 1e-10)),
            }
        return stats
    
    def state_dict(self) -> Dict:
        """Save optimizer state."""
        return {
            'accumulators': {k: v.copy() for k, v in self.accumulators.items()},
            'steps': self.steps,
            'lr': self.lr,
        }
    
    def load_state_dict(self, state: Dict):
        """Restore optimizer state."""
        self.accumulators = {k: v.copy() for k, v in state['accumulators'].items()}
        self.steps = state['steps']
 
 
# Demonstration: Sparse gradient handling
if __name__ == "__main__":
    optimizer = AdaGrad(learning_rate=1.0)
    
    # Simulate embeddings: 10 words, 5-dim each
    n_words = 10
    dim = 5
    embeddings = np.random.randn(n_words, dim) * 0.1
    params = {"embeddings": embeddings}
    
    # Simulate 1000 training steps with sparse gradients
    # Words 0,1,2 are frequent; words 7,8,9 are rare
    np.random.seed(42)
    for step in range(1000):
        grad = np.zeros_like(embeddings)
        
        # Frequent words (80% chance each step)
        for word in [0, 1, 2]:
            if np.random.rand() < 0.8:
                grad[word] = np.random.randn(dim) * 0.5
        
        # Rare words (2% chance each step)
        for word in [7, 8, 9]:
            if np.random.rand() < 0.02:
                grad[word] = np.random.randn(dim) * 0.5
        
        params = optimizer.step(params, {"embeddings": grad})
    
    # Check effective learning rates
    eff_lr = optimizer.get_effective_learning_rates()["embeddings"]
    print("Effective learning rates by word:")
    for i in range(n_words):
        print(f"  Word {i}: {eff_lr[i, 0]:.4f} (mean across dims)")
    
    # Frequent words have LOWER effective lr (more accumulated gradients)
    # Rare words have HIGHER effective lr (less accumulated gradients)
    # This is the adaptive behavior we want!

When to Use AdaGrad

Despite its learning rate decay problem, AdaGrad remains valuable in specific scenarios.

AdaGrad Excels When

•Sparse gradients dominate — NLP embeddings, recommender systems, click-through prediction
•Convex objectives — Linear/logistic regression, convex losses
•Short training runs — Accumulation doesn't get too large
•Online learning — Streaming data where LR decay is natural
•Features have varying scales — Some always large, some always small

Avoid AdaGrad When

•Long training runs — LR decays to nothing after millions of steps
•Dense gradients — All params accumulate quickly, uniform decay
•Non-convex deep learning — Need sustained learning, escape local minima
•Second training phases — Fine-tuning continues accumulation from pretraining
•Need learning rate recovery — AdaGrad can't increase LR

Production Use Cases:

Large-Scale Logistic Regression (Google)

AdaGrad was developed at Google for ad click-through rate prediction. With billions of sparse features, AdaGrad's per-feature learning rates are essential. The convex objective (logistic loss) makes LR decay appropriate.
Word Embedding Training

Training Word2Vec, GloVe, or similar embeddings benefits from AdaGrad. Common words receive more updates but smaller learning rates; rare words receive fewer but larger updates. This balances representation quality across vocabulary frequencies.
Collaborative Filtering

Matrix factorization for recommender systems has sparse gradients (only observed user-item pairs). AdaGrad handles the varying update frequency naturally.

Tuning Guidance:

Learning rate: Use higher α than SGD (0.01 → 0.1) since AdaGrad will reduce it
Initial accumulator: Setting to 0.1 instead of 0 can prevent early instability
Epsilon: Keep small (1e-8 to 1e-10); larger values slow adaptation

The Modern Perspective

In 2024, AdaGrad is rarely used directly in deep learning—RMSprop and Adam have supplanted it. But understanding AdaGrad is essential because: (1) it's still optimal for sparse/convex settings, (2) RMSprop/Adam are direct descendants that fix its accumulation problem, (3) the adaptive learning rate concept it pioneered is now standard.

Variants and Extensions

AdaGrad spawned several variants addressing its limitations while preserving its benefits.

AdaDelta (Zeiler, 2012):

Replaces the unbounded accumulator with an exponential moving average:

$$E[g^2]t = \rho E[g^2]{t-1} + (1-\rho)g_t^2$$

Also introduces a "delta" accumulator for the numerator, removing the need to set a learning rate at all. Clever but not as widely adopted as RMSprop.

AdaGrad with Weight Decay:

Standard L2 regularization adds λθ to gradients. For AdaGrad, this is problematic—regularization gradients accumulate alongside data gradients. "Decoupled weight decay" applies regularization separately:

$$\theta_t = (1 - \lambda)\theta_{t-1} - \frac{\alpha}{\sqrt{G_t} + \epsilon} g_t$$

This idea becomes important in AdamW later.

AdaGrad with Learning Rate Decay:

Combine AdaGrad's per-parameter adaptation with a global schedule:

$$\alpha_t = \frac{\alpha_0}{1 + \gamma \cdot t}$$

This provides additional decay control, though it adds a hyperparameter.

Sparse AdaGrad:

For extremely sparse problems, only update accumulators for non-zero gradient entries. Saves memory and computation for very high-dimensional sparse problems.

adagrad_variants.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import numpy as np
 
class AdaGradWithDecay:
    """AdaGrad with additional global learning rate decay."""
    
    def __init__(self, lr=0.01, lr_decay=0.01, eps=1e-8):
        self.initial_lr = lr
        self.lr_decay = lr_decay
        self.eps = eps
        self.accumulators = {}
        self.steps = 0
    
    def step(self, params, grads):
        self.steps += 1
        
        # Global LR decay
        current_lr = self.initial_lr / (1 + self.lr_decay * self.steps)
        
        updated = {}
        for name, param in params.items():
            if name not in grads:
                updated[name] = param
                continue
            
            grad = grads[name]
            if name not in self.accumulators:
                self.accumulators[name] = np.zeros_like(param)
            
            self.accumulators[name] += grad ** 2
            
            # Combined adaptation: per-param (AdaGrad) + global (decay)
            adapted_lr = current_lr / (np.sqrt(self.accumulators[name]) + self.eps)
            updated[name] = param - adapted_lr * grad
        
        return updated
 
 
class SparseAdaGrad:
    """AdaGrad optimized for sparse gradients."""
    
    def __init__(self, lr=0.01, eps=1e-8):
        self.lr = lr
        self.eps = eps
        # Use dict of dicts for sparse accumulator storage
        # Only stores non-zero entries
        self.sparse_accumulators = {}
        self.steps = 0
    
    def step(self, params, sparse_grads):
        """
        sparse_grads: Dict mapping param names to (indices, values) tuples
                      indices: array of updated indices
                      values: array of gradient values at those indices
        """
        self.steps += 1
        updated = {}
        
        for name, param in params.items():
            if name not in sparse_grads:
                updated[name] = param
                continue
            
            indices, grad_values = sparse_grads[name]
            
            if name not in self.sparse_accumulators:
                # Store as dict: index -> accumulated value
                self.sparse_accumulators[name] = {}
            
            acc = self.sparse_accumulators[name]
            param_new = param.copy()
            
            for idx, gv in zip(indices, grad_values):
                # Only accumulate for non-zero gradients!
                if idx not in acc:
                    acc[idx] = 0.0
                acc[idx] += gv ** 2
                
                # Sparse update
                adapted_lr = self.lr / (np.sqrt(acc[idx]) + self.eps)
                param_new[idx] = param[idx] - adapted_lr * gv
            
            updated[name] = param_new
        
        return updated

Summary: The Birth of Adaptive Methods

AdaGrad represents a paradigm shift in optimization: from global to per-parameter learning rates, automatically adapted based on historical gradient information.

Key Takeaways

•Per-parameter learning rates — Each parameter gets its own learning rate, inversely scaled by accumulated gradient magnitude
•Sparse gradient handling — Rare features retain larger learning rates; frequent features get smaller ones. Essential for NLP and recommender systems
•Theoretical foundation — Optimal regret bounds in online convex optimization, with data-dependent (not worst-case) guarantees
•The accumulation problem — Learning rates only decrease, eventually becoming too small for long training runs
•Best use cases — Sparse gradients, convex objectives, short training, online learning
•Legacy — Introduced the adaptive learning rate concept that RMSprop and Adam inherit

The Lineage So Far:

Optimizer	Key Innovation
Gradient Descent	The foundation: follow negative gradient
Momentum	Memory: exponential average of past gradients
Nesterov	Lookahead: evaluate gradient at future position
AdaGrad	Adaptation: per-parameter learning rates

What's Next:

AdaGrad's accumulation problem—learning rates that only decrease—makes it unsuitable for typical deep learning. The next page introduces RMSprop, Geoffrey Hinton's elegant fix: replace the sum of squared gradients with an exponential moving average. This allows the effective learning rate to both increase and decrease, adapting to the current (not historical) gradient regime.

RMSprop keeps AdaGrad's adaptation benefits while enabling sustained learning for arbitrarily long training runs.

Page Complete

You now understand AdaGrad: its motivation, mechanism, theoretical foundations, limitations, and appropriate use cases. Next, we'll see how RMSprop elegantly solves the accumulation problem by using exponential moving averages instead of sums.