Loading content...
The Adaptive Moment Estimation (Adam) algorithm is one of the most widely used optimization methods in modern machine learning. It elegantly combines the strengths of two powerful gradient descent variants—momentum and adaptive learning rates—to achieve robust and efficient parameter updates.
Adam maintains two exponentially decaying moving averages for each parameter:
First Moment (m): The exponentially weighted average of past gradients, similar to momentum. This helps the optimizer build up velocity in consistent gradient directions while dampening oscillations.
Second Moment (v): The exponentially weighted average of past squared gradients, similar to RMSProp. This provides per-parameter adaptive learning rates, allowing the algorithm to take larger steps for parameters with consistently small gradients and smaller steps for those with large or highly variable gradients.
At each timestep t, given a gradient g_t, Adam performs the following computations:
Step 1 - Update Biased First Moment Estimate: $$m_t = \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot g_t$$
Step 2 - Update Biased Second Moment Estimate: $$v_t = \beta_2 \cdot v_{t-1} + (1 - \beta_2) \cdot g_t^2$$
Step 3 - Compute Bias-Corrected First Moment: $$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}$$
Step 4 - Compute Bias-Corrected Second Moment: $$\hat{v}_t = \frac{v_t}{1 - \beta_2^t}$$
Step 5 - Update Parameters: $$\theta_t = \theta_{t-1} - \frac{\alpha \cdot \hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$$
The bias correction in steps 3 and 4 is crucial. Since m and v are initialized to zero, they are biased towards zero during the early timesteps. The correction factors (1 - β₁ᵗ) and (1 - β₂ᵗ) counteract this initialization bias, ensuring accurate moment estimates even at the beginning of training.
The original Adam paper (Kingma & Ba, 2014) recommends:
Implement a function that performs a single Adam optimization step. Your function should:
parameter = 1.0, grad = 0.1, m = 0.0, v = 0.0, t = 1(0.999, 0.01, 1e-05)Initial Setup: Starting with parameter θ = 1.0, gradient g = 0.1, and zero-initialized moments at timestep t = 1.
Step 1 - First Moment Update: m₁ = 0.9 × 0.0 + 0.1 × 0.1 = 0.01
Step 2 - Second Moment Update: v₁ = 0.999 × 0.0 + 0.001 × (0.1)² = 0.001 × 0.01 = 0.00001 = 1e-05
Step 3 - Bias-Corrected First Moment: m̂₁ = 0.01 / (1 - 0.9¹) = 0.01 / 0.1 = 0.1
Step 4 - Bias-Corrected Second Moment: v̂₁ = 1e-05 / (1 - 0.999¹) = 1e-05 / 0.001 = 0.01
Step 5 - Parameter Update: θ₁ = 1.0 - (0.001 × 0.1) / (√0.01 + 1e-08) θ₁ = 1.0 - 0.0001 / 0.100000001 ≈ 0.999
The function returns (0.999, 0.01, 1e-05) representing the updated parameter and raw moment estimates.
parameter = 2.0, grad = 0.5, m = 0.05, v = 0.001, t = 2(1.999, 0.095, 0.00125)Continuing Optimization: At timestep t = 2 with existing moment estimates from a previous iteration.
Step 1 - First Moment Update: m₂ = 0.9 × 0.05 + 0.1 × 0.5 = 0.045 + 0.05 = 0.095
Step 2 - Second Moment Update: v₂ = 0.999 × 0.001 + 0.001 × (0.5)² = 0.000999 + 0.00025 = 0.001249 ≈ 0.00125
Step 3 - Bias-Corrected First Moment: m̂₂ = 0.095 / (1 - 0.9²) = 0.095 / 0.19 = 0.5
Step 4 - Bias-Corrected Second Moment: v̂₂ = 0.00125 / (1 - 0.999²) = 0.00125 / 0.001999 ≈ 0.6253
Step 5 - Parameter Update: θ₂ = 2.0 - (0.001 × 0.5) / (√0.6253 + 1e-08) θ₂ = 2.0 - 0.0005 / 0.7908 ≈ 1.999
The accumulated momentum from previous gradients influences the current update direction.
parameter = [1.0, 2.0, 3.0], grad = [0.1, 0.2, 0.3], m = [0.0, 0.0, 0.0], v = [0.0, 0.0, 0.0], t = 1([0.999, 1.999, 2.999], [0.01, 0.02, 0.03], [1e-05, 4e-05, 9e-05])Vector Optimization: Adam applied element-wise to a 3-dimensional parameter vector.
Each parameter is updated independently using its corresponding gradient:
For θ₁ = 1.0, g₁ = 0.1:
For θ₂ = 2.0, g₂ = 0.2:
For θ₃ = 3.0, g₃ = 0.3:
This demonstrates how Adam handles vector parameters—the optimization is performed element-wise, maintaining separate moment estimates for each parameter dimension.
Constraints