Loading content...
In the landscape of optimization algorithms for machine learning, adaptive learning rate methods represent a significant advancement over fixed-rate gradient descent. One of the pioneering techniques in this category adjusts the learning rate individually for each parameter based on the history of gradients observed during training.
The core insight is elegant: parameters that frequently receive large gradients should have their effective learning rate reduced, while parameters with infrequent or small gradients should maintain a larger learning rate. This per-parameter adaptation is achieved by tracking the cumulative sum of squared gradients.
Given the following inputs at iteration t:
The update procedure follows these steps:
Step 1: Update the gradient accumulator $$G_{new} = G + g^2$$
Step 2: Compute the adaptive learning rate and update parameters $$\theta_{new} = \theta - \frac{\eta}{\sqrt{G_{new}} + \epsilon} \cdot g$$
The division by √(G + ε) is the key mechanism: as gradients accumulate over time, the effective learning rate for each parameter automatically decreases. This provides a form of automatic annealing that can be particularly beneficial for sparse features and non-convex optimization landscapes.
Implement a function that performs a single parameter update step using this adaptive gradient method. Your implementation should:
parameter = 1.0, grad = 0.1, G = 1.0[0.999, 1.01]Step-by-step calculation with scalar values:
Given:
Step 1: Update gradient accumulator G_new = G + g² = 1.0 + (0.1)² = 1.0 + 0.01 = 1.01
Step 2: Calculate adaptive learning rate Adaptive rate = η / (√G_new + ε) = 0.01 / (√1.01 + 1e-8) ≈ 0.01 / 1.005 ≈ 0.00995
Step 3: Update parameter θ_new = θ - adaptive_rate × g = 1.0 - 0.00995 × 0.1 ≈ 1.0 - 0.000995 ≈ 0.999
The function returns (0.999, 1.01) representing the updated parameter and new gradient accumulator.
parameter = [1.0, 2.0, 3.0], grad = [0.1, 0.2, 0.3], G = [0.0, 0.0, 0.0][[0.99, 1.99, 2.99], [0.01, 0.04, 0.09]]Element-wise calculation with array inputs (starting from zero accumulation):
Given:
Step 1: Update gradient accumulator (element-wise) G_new = G + g² = [0.0 + 0.01, 0.0 + 0.04, 0.0 + 0.09] = [0.01, 0.04, 0.09]
Step 2: Calculate element-wise updates
For θ[0] = 1.0:
For θ[1] = 2.0:
For θ[2] = 3.0:
Result: Updated parameters = [0.99, 1.99, 2.99], Updated G = [0.01, 0.04, 0.09]
parameter = [0.5, -0.5, 1.0], grad = [0.5, -0.3, 0.1], G = [0.5, 0.25, 0.1][[0.494, -0.495, 0.997], [0.75, 0.34, 0.11]]Calculation with pre-existing gradient history:
Given:
Step 1: Update gradient accumulator G_new = [0.5 + 0.25, 0.25 + 0.09, 0.1 + 0.01] = [0.75, 0.34, 0.11]
Step 2: Compute updates with accumulated history
For θ[0] = 0.5 (g = 0.5, G_new = 0.75):
For θ[1] = -0.5 (g = -0.3, G_new = 0.34):
For θ[2] = 1.0 (g = 0.1, G_new = 0.11):
Notice how the negative gradient for θ[1] causes the parameter to move in the positive direction, and how larger accumulated gradients result in smaller effective step sizes.
Constraints