Adaptive Gradient Scaling Optimizer (Medium) — Practice with Code Visualizer

One of the most powerful innovations in gradient-based optimization is the concept of adaptive learning rates—algorithms that automatically adjust the step size for each parameter based on its historical gradient behavior. The Adaptive Gradient Scaling optimizer (commonly known as RMSProp) achieves this by maintaining an exponentially weighted moving average of squared gradients for each parameter.

The core insight is elegantly simple: parameters that consistently receive large gradients should take smaller steps (to avoid overshooting), while parameters with small gradients should take larger steps (to accelerate convergence in flat regions). This is accomplished by dividing each gradient by the square root of its running average squared magnitude.

Mathematical Formulation:

For each parameter θᵢ with gradient gᵢ and cache vᵢ:

Step 1: Update the squared gradient cache (exponential moving average) $$v_i^{(t)} = \beta \cdot v_i^{(t-1)} + (1 - \beta) \cdot g_i^2$$

Step 2: Update the parameter $$\theta_i^{(t)} = \theta_i^{(t-1)} - \eta \cdot \frac{g_i}{\sqrt{v_i^{(t)} + \epsilon}}$$

Where:

β (beta) is the decay rate, typically set to 0.9 or 0.99
η (eta/lr) is the base learning rate
ε (epsilon) is a small constant (10⁻⁸) added for numerical stability to prevent division by zero

Why This Works:

The denominator √(vᵢ + ε) acts as an adaptive scaling factor. When gradients are consistently large, vᵢ grows large, reducing the effective step size. When gradients are small, vᵢ remains small, allowing larger steps. This automatic adjustment makes the optimizer robust to varying gradient scales across different parameters and training phases.

Your Task:

Implement a function that performs a single update step of the Adaptive Gradient Scaling optimizer. Given the current parameter values, their gradients, the cache of squared gradient averages, the learning rate, and the decay rate (beta), compute and return both the updated parameters and the updated cache.

Note: Use ε = 10⁻⁸ as the numerical stability constant.

Step 1: Update the cache (moving average of squared gradients) • v_new = β × v_old + (1 - β) × g² • v_new = 0.9 × 0.0 + 0.1 × (0.1)² • v_new = 0 + 0.1 × 0.01 = 0.001

Step 2: Update the parameter • θ_new = θ_old - lr × g / √(v_new + ε) • θ_new = 1.0 - 0.1 × 0.1 / √(0.001 + 10⁻⁸) • θ_new = 1.0 - 0.01 / √0.001 • θ_new = 1.0 - 0.01 / 0.0316228 • θ_new = 1.0 - 0.316228 ≈ 0.683772

The scaled gradient step is much larger than the raw gradient (0.316 vs 0.1) because the cache value (0.001) is small, indicating this parameter hasn't received many large gradients historically.

With three parameters starting from zero cache, we apply the update to each element:

Parameter 1: • v₁ = 0.9 × 0.0 + 0.1 × 0.01 = 0.001 • θ₁ = 1.0 - 0.01 × 0.1 / √(0.001 + ε) = 0.968377

Parameter 2: • v₂ = 0.9 × 0.0 + 0.1 × 0.04 = 0.004 • θ₂ = 2.0 - 0.01 × 0.2 / √(0.004 + ε) = 1.968377

Parameter 3: • v₃ = 0.9 × 0.0 + 0.1 × 0.09 = 0.009 • θ₃ = 3.0 - 0.01 × 0.3 / √(0.009 + ε) = 2.968377

Notice that all three parameters move by the same effective amount (≈0.0316) despite having different gradient magnitudes. This is the adaptive scaling in action—the algorithm normalizes updates by gradient magnitude.

With a high beta (0.99) and existing cache values, the cache changes slowly:

Parameter 1 (positive gradient): • v₁ = 0.99 × 0.01 + 0.01 × 0.0025 = 0.0099 + 0.000025 = 0.009925 • θ₁ = 0.5 - 0.001 × 0.05 / √(0.009925 + ε) ≈ 0.499498

Parameter 2 (negative gradient): • v₂ = 0.99 × 0.01 + 0.01 × 0.0025 = 0.009925 (cache only uses g², so sign doesn't matter) • θ₂ = -0.5 - 0.001 × (-0.05) / √(0.009925 + ε) ≈ -0.499498

The high beta value (0.99) causes the cache to decay slowly, preserving gradient history over many iterations. The existing cache helps stabilize the update magnitude.

Mathematical Formulation:

For each parameter θᵢ with gradient gᵢ and cache vᵢ:

Step 1: Update the squared gradient cache (exponential moving average) $$v_i^{(t)} = \beta \cdot v_i^{(t-1)} + (1 - \beta) \cdot g_i^2$$

Step 2: Update the parameter $$\theta_i^{(t)} = \theta_i^{(t-1)} - \eta \cdot \frac{g_i}{\sqrt{v_i^{(t)} + \epsilon}}$$

Where:

β (beta) is the decay rate, typically set to 0.9 or 0.99
η (eta/lr) is the base learning rate
ε (epsilon) is a small constant (10⁻⁸) added for numerical stability to prevent division by zero