Loading content...
In the landscape of gradient-based optimization algorithms, adaptive methods that dynamically adjust learning rates have proven essential for training deep neural networks effectively. One particularly elegant approach maintains running averages of both squared gradients and squared parameter updates, using the ratio of these accumulated terms to automatically scale the learning rate—completely eliminating the need to specify an initial learning rate hyperparameter.
This optimization method addresses a fundamental limitation of earlier adaptive algorithms by using a window of accumulated gradients rather than accumulating all past squared gradients. The key insight is that the units of the update should match the units of the parameters, which is achieved by computing a ratio of root mean squared (RMS) values.
Given a parameter θ, its gradient g, and two running averages u (for squared gradients) and v (for squared parameter updates), the algorithm proceeds as follows:
Step 1: Update the gradient accumulator $$u_{new} = \rho \cdot u + (1 - \rho) \cdot g^2$$
Step 2: Compute the parameter update $$\Delta\theta = -\frac{\sqrt{v + \epsilon}}{\sqrt{u_{new} + \epsilon}} \cdot g$$
Step 3: Update the delta accumulator $$v_{new} = \rho \cdot v + (1 - \rho) \cdot (\Delta\theta)^2$$
Step 4: Apply the update $$\theta_{new} = \theta + \Delta\theta$$
Implement a function that performs a single optimization step using this adaptive delta algorithm. Your implementation should:
parameter = 1.0, grad = 0.1, u = 1.0, v = 1.0, rho = 0.95, epsilon = 1e-06[0.89743, 0.9505, 0.95053]Scalar update computation:
Update gradient accumulator (u): u_new = 0.95 × 1.0 + 0.05 × (0.1)² = 0.95 + 0.0005 = 0.9505
Compute RMS values: RMS(v) = √(1.0 + 1e-6) ≈ 1.0000005 RMS(u_new) = √(0.9505 + 1e-6) ≈ 0.97493
Calculate parameter delta: Δθ = -(1.0000005 / 0.97493) × 0.1 ≈ -0.10257
Update delta accumulator (v): v_new = 0.95 × 1.0 + 0.05 × (-0.10257)² ≈ 0.95053
Apply update: θ_new = 1.0 + (-0.10257) ≈ 0.89743
The output tuple contains [updated_parameter, updated_u, updated_v] = [0.89743, 0.9505, 0.95053].
parameter = [1.0, 2.0], grad = [0.1, 0.2], u = [1.0, 1.0], v = [1.0, 1.0], rho = 0.95, epsilon = 1e-06[[0.89743, 1.79502], [0.9505, 0.952], [0.95053, 0.9521]]Array-wise update computation:
The algorithm processes each element independently:
For element 0 (parameter=1.0, grad=0.1):
For element 1 (parameter=2.0, grad=0.2):
The output is a list of three arrays: [updated_parameters, updated_u, updated_v].
parameter = 0.5, grad = 0.5, u = 0.5, v = 0.5, rho = 0.9, epsilon = 1e-06[-0.01299, 0.475, 0.47632]Larger gradient with different decay rate:
Using ρ = 0.9 (faster decay) and a larger gradient:
Update gradient accumulator: u_new = 0.9 × 0.5 + 0.1 × 0.25 = 0.475
Compute the update: RMS(v) = √(0.5 + 1e-6) ≈ 0.70711 RMS(u_new) = √(0.475 + 1e-6) ≈ 0.68920 Δθ = -(0.70711 / 0.68920) × 0.5 ≈ -0.51299
Update delta accumulator: v_new = 0.9 × 0.5 + 0.1 × 0.26316 ≈ 0.47632
Apply update: θ_new = 0.5 + (-0.51299) ≈ -0.01299
Notice how the large gradient (0.5) causes a significant update, moving the parameter from positive to slightly negative.
Constraints