Adaptive Optimizers - Learning Module

Loading content...

0/245

Adam and Variants

The Synthesis: Momentum Meets Adaptation

In 2014, Diederik Kingma and Jimmy Ba proposed Adam (Adaptive Moment Estimation)—a synthesis of the best ideas from momentum-based and adaptive optimization. Adam combines:

First moment estimation (momentum): Exponential moving average of gradients, providing direction smoothing and acceleration
Second moment estimation (RMSprop): Exponential moving average of squared gradients, providing per-parameter learning rate adaptation
Bias correction: Principled initialization that corrects the cold-start problem of EMAs

This combination proved remarkably robust across problem domains, hyperparameter settings, and architectures. Adam became—and remains—the default optimizer for deep learning.

Why Adam Dominates:

Works well "out of the box" with default hyperparameters
Handles sparse gradients effectively
Requires less learning rate tuning than SGD
Converges faster in early training than most alternatives
Supports a wide range of architectures (CNNs, RNNs, Transformers)

But Adam isn't without controversy. Questions about its convergence properties, its interaction with weight decay, and its generalization performance have spawned numerous variants: AdamW, NAdam, RAdam, AdamP, LAMB, and more.

This page covers Adam comprehensively, then explores the variants that address its limitations.

What You Will Learn

By the end of this page, you will understand Adam's full algorithm including bias correction, the rationale behind each component, the convergence concerns, and how AdamW, NAdam, RAdam, and other variants address specific issues.

The Adam Algorithm

Let's formalize Adam precisely.

Adam Update Rule:

For parameters θ, learning rate α, first moment decay β₁, second moment decay β₂, and stability constant ε:

$$g_t = \nabla L(\theta_{t-1})$$

First moment estimate (momentum): $$m_t = \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot g_t$$

Second moment estimate (RMSprop): $$v_t = \beta_2 \cdot v_{t-1} + (1 - \beta_2) \cdot g_t^2$$

Bias correction: $$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}$$ $$\hat{v}_t = \frac{v_t}{1 - \beta_2^t}$$

Parameter update: $$\theta_t = \theta_{t-1} - \alpha \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$$

Default Hyperparameters (from the paper):

α = 0.001 (learning rate)
β₁ = 0.9 (first moment decay)
β₂ = 0.999 (second moment decay)
ε = 10⁻⁸ (numerical stability)

adam_algorithm.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import numpy as np
 
def adam_step(theta, gradient, m, v, t, 
              alpha=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
    """
    Single Adam update step.
    
    Args:
        theta: Parameters
        gradient: Current gradient
        m: First moment estimate (EMA of gradients)
        v: Second moment estimate (EMA of squared gradients)
        t: Current step number (1-indexed, for bias correction)
        alpha: Learning rate
        beta1: First moment decay rate
        beta2: Second moment decay rate
        epsilon: Numerical stability constant
        
    Returns:
        Updated theta, m, v
    """
    # First moment estimate (momentum)
    m = beta1 * m + (1 - beta1) * gradient
    
    # Second moment estimate (RMSprop-like)
    v = beta2 * v + (1 - beta2) * gradient ** 2
    
    # Bias-corrected estimates
    m_hat = m / (1 - beta1 ** t)
    v_hat = v / (1 - beta2 ** t)
    
    # Parameter update
    theta = theta - alpha * m_hat / (np.sqrt(v_hat) + epsilon)
    
    return theta, m, v
 
 
# Simple optimization example
def optimize_with_adam():
    # Minimize f(x) = (x - 3)² + (y + 2)²
    theta = np.array([0.0, 0.0])
    target = np.array([3.0, -2.0])
    
    m = np.zeros_like(theta)
    v = np.zeros_like(theta)
    alpha = 0.1  # Higher LR for this simple problem
    
    for t in range(1, 201):
        gradient = 2 * (theta - target)
        theta, m, v = adam_step(theta, gradient, m, v, t, alpha=alpha)
        
        if t % 50 == 0:
            loss = np.sum((theta - target) ** 2)
            print(f"Step {t}: θ = [{theta[0]:.4f}, {theta[1]:.4f}], loss = {loss:.6f}")
    
    return theta
 
optimize_with_adam()

Component-by-Component Analysis:

First Moment (m_t): This is essentially momentum. The EMA of gradients smooths out noise and provides acceleration in consistent directions. Unlike classical momentum which uses the raw sum, Adam uses a normalized EMA.

Second Moment (v_t): This is RMSprop's adaptation mechanism. The EMA of squared gradients tracks the typical scale of each parameter's gradients, enabling per-parameter learning rate scaling.

The Update Ratio (m̂/√v̂): This is the key innovation. The update direction is determined by the momentum (m̂), but the step size is scaled by 1/√v̂. Parameters with large squared gradients get smaller steps; parameters with small squared gradients get larger steps.

Interpretation as Trust Region:

The ratio m̂ₜ,ᵢ/√v̂ₜ,ᵢ has magnitude roughly bounded by 1. To see why:

If gradients consistently point the same direction: |m̂| ≈ |g|, √v̂ ≈ |g|, ratio ≈ ±1
If gradients are inconsistent: |m̂| < √v̂, ratio < 1

This automatically constrains updates, providing implicit trust region behavior.

Bias Correction: Solving the Cold Start

Bias correction is Adam's often-overlooked but critical innovation. It addresses a fundamental problem with EMAs: initialization bias.

The Problem:

When we initialize m₀ = 0 and v₀ = 0, the EMAs are biased toward zero in early steps:

$$m_t = (1 - \beta_1) \sum_{i=1}^{t} \beta_1^{t-i} g_i$$

The expected value (assuming E[gᵢ] = μ for all i):

$$\mathbb{E}[m_t] = \mu \cdot (1 - \beta_1^t)$$

For β₁ = 0.9 and t = 1: E[m₁] = 0.1μ (10% of true value!) For β₁ = 0.9 and t = 10: E[m₁₀] ≈ 0.65μ (still 35% bias!)

The second moment has the same issue with β₂, which is typically larger (0.999), making the bias last longer.

The Solution:

Divide by the factor causing the bias:

$$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}$$ $$\hat{v}_t = \frac{v_t}{1 - \beta_2^t}$$

Now E[m̂ₜ] = μ regardless of t.

bias_correction.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import numpy as np
import matplotlib.pyplot as plt
 
def demonstrate_bias_correction(beta=0.9, true_mean=1.0, steps=100):
    """Show why bias correction matters."""
    
    # Constant "gradient" for clear demonstration
    g = true_mean
    
    # Without bias correction
    m = 0.0
    uncorrected = []
    
    # With bias correction
    m_corrected = 0.0
    corrected = []
    
    for t in range(1, steps + 1):
        m = beta * m + (1 - beta) * g
        m_hat = m / (1 - beta ** t)
        
        uncorrected.append(m)
        corrected.append(m_hat)
    
    print(f"With β = {beta}, true mean = {true_mean}")
    print("\nStep | Uncorrected | Corrected | Correction Factor")
    print("-" * 55)
    for t in [1, 2, 5, 10, 20, 50, 100]:
        factor = 1 / (1 - beta ** t)
        print(f"{t:4d} | {uncorrected[t-1]:11.4f} | {corrected[t-1]:9.4f} | {factor:.4f}")
    
    print(f"\nAfter 100 steps, uncorrected = {uncorrected[-1]:.4f} (should be 1.0)")
    print(f"After 100 steps, corrected = {corrected[-1]:.4f} (should be 1.0)")
 
# Demonstrate for β₁ = 0.9 (first moment)
print("=== First Moment (β₁ = 0.9) ===")
demonstrate_bias_correction(beta=0.9)
 
print("\n=== Second Moment (β₂ = 0.999) ===")
demonstrate_bias_correction(beta=0.999)
 
# Note: With β = 0.999, it takes ~6900 steps to get within 0.1% of true value!
# Bias correction is essential for the second moment.

Practical Impact:

Without bias correction:

Early steps have effective learning rate much lower than α
First moment (momentum) barely influences early updates
Second moment (adaptation) is severely underestimated

With bias correction:

First step has correction factor 1/(1-β₁) = 10 for β₁ = 0.9
First step has correction factor 1/(1-β₂) = 1000 for β₂ = 0.999
Learning dynamics are consistent from step 1

Memory-Efficient Implementation:

Computing β^t every step is expensive for large t. Instead, track running factors:

efficient_bias_correction.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# Efficient bias correction without computing β^t
class EfficientAdam:
    def __init__(self, alpha=0.001, beta1=0.9, beta2=0.999, eps=1e-8):
        self.alpha = alpha
        self.beta1 = beta1
        self.beta2 = beta2
        self.eps = eps
        
        # Track running β^t factors
        self.beta1_power = 1.0  # Will become β₁^t
        self.beta2_power = 1.0  # Will become β₂^t
        
        self.m = None
        self.v = None
        self.t = 0
    
    def step(self, theta, gradient):
        self.t += 1
        
        # Update running powers BEFORE using them
        self.beta1_power *= self.beta1
        self.beta2_power *= self.beta2
        
        # Now beta1_power = β₁^t, beta2_power = β₂^t
        
        if self.m is None:
            self.m = np.zeros_like(theta)
            self.v = np.zeros_like(theta)
        
        # EMA updates
        self.m = self.beta1 * self.m + (1 - self.beta1) * gradient
        self.v = self.beta2 * self.v + (1 - self.beta2) * gradient ** 2
        
        # Efficient bias correction (no exponentiation)
        m_hat = self.m / (1 - self.beta1_power)
        v_hat = self.v / (1 - self.beta2_power)
        
        return theta - self.alpha * m_hat / (np.sqrt(v_hat) + self.eps)

RMSprop vs Adam Early Training

RMSprop without bias correction can have unstable early training—the underestimated v leads to overly large updates. Adam's bias-corrected v̂ provides more stable early-training dynamics. This is one reason Adam became preferred over RMSprop.

Convergence Concerns and Theory

Despite Adam's practical success, theoretical analysis revealed concerning properties.

The Non-Convergence Paper (Reddi et al., 2018):

"On the Convergence of Adam and Beyond" constructed simple convex optimization problems where Adam fails to converge to the optimum.

The Intuition:

Consider a parameter that occasionally receives large gradients but usually receives small gradients. In Adam:

Large gradients increase v, reducing the effective learning rate
But large gradients also increase m, potentially in the "wrong" direction
Small gradients slowly decay v, increasing effective learning rate
The interplay can create oscillation around the optimum

A Simple Failure Case:

$$f_t(x) = \begin{cases} Cx & \text{with probability } p \ -x & \text{with probability } 1-p \end{cases}$$

For appropriate C and small p, Adam's momentum can dominate, pushing away from x* = 0 faster than the adaptation mechanism can correct.

AMSGrad Fix:

Reddi et al. proposed AMSGrad: keep a running maximum of v rather than just the EMA:

$$\hat{v}t = \max(\hat{v}{t-1}, v_t)$$

This ensures the denominator never decreases, providing convergence guarantees.

In practice, AMSGrad rarely outperforms Adam. The pathological cases are artificial; real deep learning losses don't exhibit these patterns.

Theoretical Convergence Properties
Optimizer	Convex Guarantee	Non-Convex Behavior	Practical Performance
SGD	Converges (with schedule)	Local minima/saddles	Good (with tuning)
Adam	May not converge	Works well empirically	Excellent (default)
AMSGrad	Converges	Similar to Adam	No real improvement
AdamW	Not analyzed thoroughly	Works well empirically	State-of-art for many tasks

The Practical Perspective:

Deep learning losses are highly non-convex; we don't seek global optima anyway. What matters is:

Training loss decreases to acceptable levels
Validation/test performance is good
Training is stable and reasonably fast

Adam excels at all three for most problems, despite the theoretical concerns.

The Generalization Gap:

A more concerning observation: models trained with Adam often generalize worse than those trained with SGD+momentum (especially in vision). Hypotheses:

Adam finds "sharper" minima that generalize poorly
The aggressive early-stage learning of Adam leads to different solutions
Decoupled weight decay (AdamW) partially addresses this

This motivated the AdamW variant, covered next.

Theory vs Practice

Adam's theoretical non-convergence results don't translate to practical failures in deep learning. The constructed failure cases are pathological. However, the generalization gap compared to SGD is real and motivates using AdamW or carefully tuned SGD for production models when test performance is critical.

AdamW: Decoupled Weight Decay

AdamW, introduced by Loshchilov and Hutter in 2017, addresses a subtle but important issue with L2 regularization in adaptive optimizers.

The Problem: L2 ≠ Weight Decay in Adam

In SGD, L2 regularization and weight decay are equivalent:

$$\nabla_{\text{L2}}(\theta) = \nabla L(\theta) + \lambda \theta$$ $$\theta_{t+1} = \theta_t - \alpha(\nabla L(\theta_t) + \lambda \theta_t) = \theta_t - \alpha\nabla L(\theta_t) - \alpha\lambda \theta_t$$

The last term is "weight decay"—shrinking weights by a fixed fraction each step.

In Adam, this equivalence breaks:

With L2 regularization, the gradient becomes g + λθ. This modified gradient affects both moments:

$$m_t = \beta_1 m_{t-1} + (1-\beta_1)(g_t + \lambda\theta_{t-1})$$ $$v_t = \beta_2 v_{t-1} + (1-\beta_2)(g_t + \lambda\theta_{t-1})^2$$

The regularization term λθ is scaled by the adaptive learning rate 1/√v̂. Large weights get less regularization (they have larger gradients → larger v → smaller effective regularization). This is backwards!

AdamW's Solution: Decouple Weight Decay

Apply weight decay after the Adam update, not to the gradient:

$$g_t = \nabla L(\theta_{t-1})$$ (no L2 term!) $$m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t$$ $$v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2$$ $$\theta_t = \theta_{t-1} - \alpha\frac{\hat{m}_t}{\sqrt{\hat{v}t} + \epsilon} - \alpha\lambda\theta{t-1}$$

Now weight decay is applied uniformly, not scaled by the adaptive denominator.

adamw.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
import numpy as np
from typing import Dict
 
class AdamW:
    """
    AdamW: Adam with decoupled weight decay.
    
    Weight decay is applied directly to parameters, not through the
    gradient (which would be scaled by the adaptive learning rate).
    
    Reference:
        Loshchilov & Hutter. "Decoupled Weight Decay Regularization" (2017)
    """
    
    def __init__(
        self,
        learning_rate: float = 0.001,
        beta1: float = 0.9,
        beta2: float = 0.999,
        epsilon: float = 1e-8,
        weight_decay: float = 0.01,  # Note: larger than L2 default!
    ):
        self.lr = learning_rate
        self.beta1 = beta1
        self.beta2 = beta2
        self.eps = epsilon
        self.weight_decay = weight_decay
        
        self.m: Dict[str, np.ndarray] = {}
        self.v: Dict[str, np.ndarray] = {}
        self.t = 0
    
    def step(
        self,
        parameters: Dict[str, np.ndarray],
        gradients: Dict[str, np.ndarray],
    ) -> Dict[str, np.ndarray]:
        self.t += 1
        updated = {}
        
        for name, param in parameters.items():
            if name not in gradients:
                updated[name] = param
                continue
            
            grad = gradients[name]
            
            # Initialize moments
            if name not in self.m:
                self.m[name] = np.zeros_like(param)
                self.v[name] = np.zeros_like(param)
            
            # Standard Adam moment updates (NO L2 in grad!)
            self.m[name] = self.beta1 * self.m[name] + (1 - self.beta1) * grad
            self.v[name] = self.beta2 * self.v[name] + (1 - self.beta2) * grad ** 2
            
            # Bias correction
            m_hat = self.m[name] / (1 - self.beta1 ** self.t)
            v_hat = self.v[name] / (1 - self.beta2 ** self.t)
            
            # Adam step
            adam_update = self.lr * m_hat / (np.sqrt(v_hat) + self.eps)
            
            # Decoupled weight decay (applied separately!)
            weight_decay_update = self.lr * self.weight_decay * param
            
            updated[name] = param - adam_update - weight_decay_update
        
        return updated
 
 
class AdamL2:
    """
    Standard Adam with L2 regularization (for comparison).
    
    L2 penalty is added to the gradient, then both moments see it.
    """
    
    def __init__(self, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8, l2=0.01):
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.eps = eps
        self.l2 = l2
        self.m = {}
        self.v = {}
        self.t = 0
    
    def step(self, parameters, gradients):
        self.t += 1
        updated = {}
        
        for name, param in parameters.items():
            if name not in gradients:
                updated[name] = param
                continue
            
            # L2 regularization ADDED TO GRADIENT
            grad = gradients[name] + self.l2 * param
            
            if name not in self.m:
                self.m[name] = np.zeros_like(param)
                self.v[name] = np.zeros_like(param)
            
            # Both moments see the L2 term (problematic!)
            self.m[name] = self.beta1 * self.m[name] + (1 - self.beta1) * grad
            self.v[name] = self.beta2 * self.v[name] + (1 - self.beta2) * grad ** 2
            
            m_hat = self.m[name] / (1 - self.beta1 ** self.t)
            v_hat = self.v[name] / (1 - self.beta2 ** self.t)
            
            updated[name] = param - self.lr * m_hat / (np.sqrt(v_hat) + self.eps)
        
        return updated

Modern Recommendation

For new projects, use AdamW instead of Adam when regularization is needed. The decoupled weight decay provides more effective regularization and often leads to better generalization. The hyperparameter weight_decay in AdamW is typically 0.01-0.1 (larger than L2 coefficients because it's not scaled by gradient adaptation).

Other Adam Variants

The success and limitations of Adam motivated numerous variants. Here are the most significant.

NAdam (Nesterov Adam):

Incorporates Nesterov momentum into Adam. Instead of using the current momentum m̂ₜ, NAdam uses a "lookahead" momentum:

$$\theta_t = \theta_{t-1} - \alpha \cdot \frac{\beta_1 \hat{m}_t + (1-\beta_1)g_t/(1-\beta_1^t)}{\sqrt{\hat{v}_t} + \epsilon}$$

The numerator combines the previous momentum (β₁m̂ₜ) with bias-corrected current gradient. This provides the same anticipatory behavior as Nesterov momentum.

RAdam (Rectified Adam):

Addresses variance issues in the second moment estimate during early training. The adaptive learning rate is unstable when v_t has high variance (early steps with few gradient samples).

RAdam uses a heuristic to compute the variance of the second moment estimator and disables adaptation when variance is high:

$$\rho_t = \rho_\infty - \frac{2t\beta_2^t}{1-\beta_2^t}$$

where ρ_∞ = 2/(1-β₂) - 1 ≈ 5 for β₂ = 0.999.

When ρ_t > 5, use Adam update. When ρ_t ≤ 5, use SGD-like update (ignore v_t).

LAMB (Layer-wise Adaptive Moments for Batch training):

Designed for large batch training (batch sizes 32K+). LAMB normalizes updates per-layer:

$$r_t = \frac{\hat{m}t}{\sqrt{\hat{v}t} + \epsilon}$$ $$\theta_t = \theta{t-1} - \alpha \cdot \phi(|\theta|) \cdot \frac{r_t + \lambda\theta{t-1}}{|r_t + \lambda\theta_{t-1}|}$$

where φ is a trust ratio that scales updates based on the parameter norm. Used to train BERT in 76 minutes!

Adam Variants Comparison
Variant	Key Innovation	When to Use	Complexity
Adam	Momentum + Adaptation + Bias correction	General default	Baseline
AdamW	Decoupled weight decay	When regularization needed	Same as Adam
NAdam	Nesterov momentum	Slightly faster convergence	+1 multiply
RAdam	Variance rectification	Warmup-free training	+variance calculation
LAMB	Layer-wise scaling	Very large batch training	+norm calculations
AMSGrad	Max instead of EMA for v	Theoretical guarantees	+max tracking

Practical Selection Guide:

Default choice: AdamW with lr=0.001, weight_decay=0.01
Large language models: AdamW with lower lr (e.g., 3e-4)
Computer vision (best accuracy): SGD+momentum with careful tuning
Very large batch: LAMB or LARS
Quick experiments: Adam (faster than SGD, good enough)
Warmup-free: RAdam (avoids learning rate warmup)

The Warmup Connection:

Many Adam training recipes include "warmup" periods where the learning rate starts small and increases. This compensates for early instability when the second moment estimate is unreliable. RAdam automates this by detecting when adaptation is unreliable and falling back to SGD.

Complete Implementation

Here's a complete, production-quality Adam implementation with all standard features.

adam_complete.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
import numpy as np
from typing import Dict, Optional
 
class Adam:
    """
    Adam (Adaptive Moment Estimation) optimizer.
    
    Combines momentum (first moment) with adaptive learning rates
    (second moment), including bias correction for both.
    
    Reference:
        Kingma & Ba. "Adam: A Method for Stochastic Optimization" (2014)
    """
    
    def __init__(
        self,
        learning_rate: float = 0.001,
        beta1: float = 0.9,
        beta2: float = 0.999,
        epsilon: float = 1e-8,
        weight_decay: float = 0.0,
        amsgrad: bool = False,
        decoupled_weight_decay: bool = True,  # AdamW-style
    ):
        """
        Args:
            learning_rate: Base learning rate α
            beta1: First moment decay (momentum)
            beta2: Second moment decay (RMSprop-like)
            epsilon: Numerical stability constant
            weight_decay: Regularization strength
            amsgrad: Use AMSGrad variant (max v instead of EMA)
            decoupled_weight_decay: If True, use AdamW; else L2 in gradient
        """
        if not 0.0 <= beta1 < 1.0:
            raise ValueError(f"beta1 must be in [0, 1), got {beta1}")
        if not 0.0 <= beta2 < 1.0:
            raise ValueError(f"beta2 must be in [0, 1), got {beta2}")
        
        self.lr = learning_rate
        self.beta1 = beta1
        self.beta2 = beta2
        self.eps = epsilon
        self.weight_decay = weight_decay
        self.amsgrad = amsgrad
        self.decoupled = decoupled_weight_decay
        
        # State
        self.m: Dict[str, np.ndarray] = {}  # First moment
        self.v: Dict[str, np.ndarray] = {}  # Second moment
        self.v_max: Dict[str, np.ndarray] = {}  # AMSGrad max
        self.t = 0
        
        # For efficient bias correction
        self.beta1_power = 1.0
        self.beta2_power = 1.0
    
    def step(
        self,
        parameters: Dict[str, np.ndarray],
        gradients: Dict[str, np.ndarray],
    ) -> Dict[str, np.ndarray]:
        """Perform one Adam update step."""
        self.t += 1
        self.beta1_power *= self.beta1
        self.beta2_power *= self.beta2
        
        updated = {}
        
        for name, param in parameters.items():
            if name not in gradients:
                updated[name] = param
                continue
            
            grad = gradients[name].copy()
            
            # L2 regularization (if not decoupled)
            if self.weight_decay > 0 and not self.decoupled:
                grad = grad + self.weight_decay * param
            
            # Initialize state
            if name not in self.m:
                self.m[name] = np.zeros_like(param)
                self.v[name] = np.zeros_like(param)
                if self.amsgrad:
                    self.v_max[name] = np.zeros_like(param)
            
            # Update biased first moment estimate
            self.m[name] = self.beta1 * self.m[name] + (1 - self.beta1) * grad
            
            # Update biased second moment estimate
            self.v[name] = self.beta2 * self.v[name] + (1 - self.beta2) * grad ** 2
            
            # Bias-corrected estimates
            m_hat = self.m[name] / (1 - self.beta1_power)
            v_hat = self.v[name] / (1 - self.beta2_power)
            
            # AMSGrad: use max of v instead
            if self.amsgrad:
                self.v_max[name] = np.maximum(self.v_max[name], v_hat)
                v_hat = self.v_max[name]
            
            # Compute update
            update = self.lr * m_hat / (np.sqrt(v_hat) + self.eps)
            
            # Decoupled weight decay (AdamW)
            if self.weight_decay > 0 and self.decoupled:
                update = update + self.lr * self.weight_decay * param
            
            updated[name] = param - update
        
        return updated
    
    def get_diagnostics(self) -> Dict[str, float]:
        """Get optimizer diagnostics for monitoring."""
        if not self.m:
            return {}
        
        all_m = np.concatenate([m.flatten() for m in self.m.values()])
        all_v = np.concatenate([v.flatten() for v in self.v.values()])
        
        return {
            'step': self.t,
            'm_mean': float(np.mean(all_m)),
            'm_std': float(np.std(all_m)),
            'v_mean': float(np.mean(all_v)),
            'v_std': float(np.std(all_v)),
            'eff_lr': float(self.lr * np.mean(1 / (np.sqrt(all_v / (1 - self.beta2_power)) + self.eps))),
        }
    
    def state_dict(self) -> Dict:
        """Save optimizer state for checkpointing."""
        return {
            'm': {k: v.copy() for k, v in self.m.items()},
            'v': {k: v.copy() for k, v in self.v.items()},
            'v_max': {k: v.copy() for k, v in self.v_max.items()} if self.amsgrad else {},
            't': self.t,
            'beta1_power': self.beta1_power,
            'beta2_power': self.beta2_power,
        }
    
    def load_state_dict(self, state: Dict):
        """Restore optimizer state."""
        self.m = {k: v.copy() for k, v in state['m'].items()}
        self.v = {k: v.copy() for k, v in state['v'].items()}
        if self.amsgrad and state['v_max']:
            self.v_max = {k: v.copy() for k, v in state['v_max'].items()}
        self.t = state['t']
        self.beta1_power = state['beta1_power']
        self.beta2_power = state['beta2_power']

Practical Recommendations

After years of community experience with Adam, several best practices have emerged.

Adam Best Practices

•Use AdamW instead of Adam — Decoupled weight decay regularizes better. Set weight_decay=0.01 as default.
•Start with lr=0.001 — The paper's default works well for most CNNs. For transformers, try 3e-4 or 1e-4.
•Keep β₁=0.9, β₂=0.999 — The defaults rarely need changing. For noisy gradients, β₂=0.99 can help.
•Use learning rate warmup — Start with small LR for 1-5% of training, linearly increase to target. Essential for transformers.
•Combine with gradient clipping — Clip gradients to max norm 1.0 for stability, especially with RNNs/Transformers.
•Monitor the effective learning rate — If α/√v becomes very small, training may have stalled.
•Use cosine annealing — lr schedule that decays to near-zero by end of training often helps final performance.

Domain-Specific Recommendations
Domain	Optimizer	Learning Rate	Notes
CNN Classification	SGD+momentum or AdamW	0.1 (SGD) or 0.001 (Adam)	SGD often gives better accuracy
NLP / Transformers	AdamW	1e-4 to 5e-4	Warmup essential; β₂=0.98 sometimes used
GAN Training	Adam	1e-4 to 2e-4	β₁=0.5 often used (less momentum)
Reinforcement Learning	Adam or RMSprop	3e-4 to 1e-3	RMSprop still common; Adam stable
Fine-tuning	AdamW	1e-5 to 5e-5	Much smaller LR than pretraining

Debugging Adam Issues:

Loss not decreasing: Check if effective LR is too small (v too large). Try larger base LR or smaller β₂.
Loss exploding: Gradient clipping, smaller LR, or check for data issues.
Training unstable early: Add warmup period (gradually increase LR from 0).
Good training loss, bad val loss: Add/increase weight decay, dropout, or try SGD for better generalization.
Loss plateaus: Try reducing LR (cosine annealing), adjusting β₂, or checking for learning rate too small.

The 2024 Recommendation

For most new projects: AdamW with lr=3e-4, weight_decay=0.01, betas=(0.9, 0.999), combined with cosine annealing and ~5% warmup steps. This configuration is robust across many domains. Tune from there based on validation performance.

Summary: The Adaptive Optimization Revolution

Adam represents the synthesis of the major optimization innovations we've studied: momentum's direction smoothing, adaptive methods' per-parameter scaling, and principled bias correction.

Key Takeaways

•Adam combines momentum (m) and adaptation (v) — First moment provides direction; second moment provides per-parameter scaling
•Bias correction is essential — Corrects cold-start issues in EMAs, especially crucial for the second moment
•Convergence theory has gaps — Theoretically non-convergent on some convex problems, but works well in practice
•AdamW is preferred over Adam — Decoupled weight decay provides proper regularization
•Variants address specific issues — NAdam (Nesterov), RAdam (variance), LAMB (large batch)
•Best practices matter — Learning rate warmup, gradient clipping, cosine annealing significantly impact results

The Complete Optimizer Lineage:

Generation	Optimizers	Key Innovation
First	GD, Momentum, Nesterov	Direction memory, lookahead
Second	AdaGrad, RMSprop	Per-parameter learning rates
Third	Adam, AdamW	Combined momentum + adaptation
Fourth	LARS, LAMB	Scaling for large-batch training

Module Complete:

You've now mastered the core adaptive optimizers that drive modern deep learning:

Momentum: Exponential averaging of gradients for acceleration
Nesterov: Lookahead evaluation for optimal convergence
AdaGrad: Per-parameter adaptation via accumulated squared gradients
RMSprop: Bounded adaptation via EMA (forgetting)
Adam: Synthesis with bias correction; AdamW variant for regularization

These tools form the foundation for training any neural network. While the search for better optimizers continues, understanding these fundamentals prepares you to evaluate new methods, debug training issues, and tune optimization for your specific applications.

Module Complete

Congratulations! You've completed the Adaptive Optimizers module. You now understand the full evolution from basic gradient descent through Adam—the mathematical foundations, practical implementations, and their appropriate use cases. This knowledge is essential for effective deep learning practice.