Loading problem...
One of the most powerful innovations in gradient-based optimization is the concept of adaptive learning rates—algorithms that automatically adjust the step size for each parameter based on its historical gradient behavior. The Adaptive Gradient Scaling optimizer (commonly known as RMSProp) achieves this by maintaining an exponentially weighted moving average of squared gradients for each parameter.
The core insight is elegantly simple: parameters that consistently receive large gradients should take smaller steps (to avoid overshooting), while parameters with small gradients should take larger steps (to accelerate convergence in flat regions). This is accomplished by dividing each gradient by the square root of its running average squared magnitude.
Mathematical Formulation:
For each parameter θᵢ with gradient gᵢ and cache vᵢ:
Step 1: Update the squared gradient cache (exponential moving average) $$v_i^{(t)} = \beta \cdot v_i^{(t-1)} + (1 - \beta) \cdot g_i^2$$
Step 2: Update the parameter $$\theta_i^{(t)} = \theta_i^{(t-1)} - \eta \cdot \frac{g_i}{\sqrt{v_i^{(t)} + \epsilon}}$$
Where:
Why This Works:
The denominator √(vᵢ + ε) acts as an adaptive scaling factor. When gradients are consistently large, vᵢ grows large, reducing the effective step size. When gradients are small, vᵢ remains small, allowing larger steps. This automatic adjustment makes the optimizer robust to varying gradient scales across different parameters and training phases.
Your Task:
Implement a function that performs a single update step of the Adaptive Gradient Scaling optimizer. Given the current parameter values, their gradients, the cache of squared gradient averages, the learning rate, and the decay rate (beta), compute and return both the updated parameters and the updated cache.
Note: Use ε = 10⁻⁸ as the numerical stability constant.
params = [1.0]
grads = [0.1]
cache = [0.0]
lr = 0.1
beta = 0.9([0.683772], [0.001])Step 1: Update the cache (moving average of squared gradients) • v_new = β × v_old + (1 - β) × g² • v_new = 0.9 × 0.0 + 0.1 × (0.1)² • v_new = 0 + 0.1 × 0.01 = 0.001
Step 2: Update the parameter • θ_new = θ_old - lr × g / √(v_new + ε) • θ_new = 1.0 - 0.1 × 0.1 / √(0.001 + 10⁻⁸) • θ_new = 1.0 - 0.01 / √0.001 • θ_new = 1.0 - 0.01 / 0.0316228 • θ_new = 1.0 - 0.316228 ≈ 0.683772
The scaled gradient step is much larger than the raw gradient (0.316 vs 0.1) because the cache value (0.001) is small, indicating this parameter hasn't received many large gradients historically.
params = [1.0, 2.0, 3.0]
grads = [0.1, 0.2, 0.3]
cache = [0.0, 0.0, 0.0]
lr = 0.01
beta = 0.9([0.968377, 1.968377, 2.968377], [0.001, 0.004, 0.009])With three parameters starting from zero cache, we apply the update to each element:
Parameter 1: • v₁ = 0.9 × 0.0 + 0.1 × 0.01 = 0.001 • θ₁ = 1.0 - 0.01 × 0.1 / √(0.001 + ε) = 0.968377
Parameter 2: • v₂ = 0.9 × 0.0 + 0.1 × 0.04 = 0.004 • θ₂ = 2.0 - 0.01 × 0.2 / √(0.004 + ε) = 1.968377
Parameter 3: • v₃ = 0.9 × 0.0 + 0.1 × 0.09 = 0.009 • θ₃ = 3.0 - 0.01 × 0.3 / √(0.009 + ε) = 2.968377
Notice that all three parameters move by the same effective amount (≈0.0316) despite having different gradient magnitudes. This is the adaptive scaling in action—the algorithm normalizes updates by gradient magnitude.
params = [0.5, -0.5]
grads = [0.05, -0.05]
cache = [0.01, 0.01]
lr = 0.001
beta = 0.99([0.499498, -0.499498], [0.009925, 0.009925])With a high beta (0.99) and existing cache values, the cache changes slowly:
Parameter 1 (positive gradient): • v₁ = 0.99 × 0.01 + 0.01 × 0.0025 = 0.0099 + 0.000025 = 0.009925 • θ₁ = 0.5 - 0.001 × 0.05 / √(0.009925 + ε) ≈ 0.499498
Parameter 2 (negative gradient): • v₂ = 0.99 × 0.01 + 0.01 × 0.0025 = 0.009925 (cache only uses g², so sign doesn't matter) • θ₂ = -0.5 - 0.001 × (-0.05) / √(0.009925 + ε) ≈ -0.499498
The high beta value (0.99) causes the cache to decay slowly, preserving gradient history over many iterations. The existing cache helps stabilize the update magnitude.
Constraints