0/318

00:00:00

Description

Editorial

Adaptive Delta Gradient Optimizer

MEDIUM20 pts

In the landscape of gradient-based optimization algorithms, adaptive methods that dynamically adjust learning rates have proven essential for training deep neural networks effectively. One particularly elegant approach maintains running averages of both squared gradients and squared parameter updates, using the ratio of these accumulated terms to automatically scale the learning rate—completely eliminating the need to specify an initial learning rate hyperparameter.

The Adaptive Delta Algorithm

This optimization method addresses a fundamental limitation of earlier adaptive algorithms by using a window of accumulated gradients rather than accumulating all past squared gradients. The key insight is that the units of the update should match the units of the parameters, which is achieved by computing a ratio of root mean squared (RMS) values.

Mathematical Formulation

Given a parameter θ, its gradient g, and two running averages u (for squared gradients) and v (for squared parameter updates), the algorithm proceeds as follows:

Step 1: Update the gradient accumulator $$u_{new} = \rho \cdot u + (1 - \rho) \cdot g^2$$

Step 2: Compute the parameter update $$\Delta\theta = -\frac{\sqrt{v + \epsilon}}{\sqrt{u_{new} + \epsilon}} \cdot g$$

Step 3: Update the delta accumulator $$v_{new} = \rho \cdot v + (1 - \rho) \cdot (\Delta\theta)^2$$

Step 4: Apply the update $$\theta_{new} = \theta + \Delta\theta$$

Key Parameters

ρ (rho): The decay constant that controls how quickly the running averages forget old information. Values close to 1.0 (e.g., 0.95) create longer memory windows.
ε (epsilon): A small constant (typically 1e-6) added for numerical stability when computing the square root.

Advantages

No learning rate tuning: The algorithm adapts the learning rate automatically based on recent gradient and update magnitudes.
Robust to large gradients: The accumulation prevents sudden large updates.
Correct units: Unlike some other methods, the update has the same units as the parameters.

Your Task

Implement a function that performs a single optimization step using this adaptive delta algorithm. Your implementation should:

Handle both scalar and array-valued parameters/gradients
Return the updated parameter value along with the new values of both accumulators
Round all output values to 5 decimal places
Support element-wise operations for array inputs

Example

Input

parameter = 1.0, grad = 0.1, u = 1.0, v = 1.0, rho = 0.95, epsilon = 1e-06

Output

[0.89743, 0.9505, 0.95053]

Explanation

Scalar update computation:

Update gradient accumulator (u): u_new = 0.95 × 1.0 + 0.05 × (0.1)² = 0.95 + 0.0005 = 0.9505
Compute RMS values: RMS(v) = √(1.0 + 1e-6) ≈ 1.0000005 RMS(u_new) = √(0.9505 + 1e-6) ≈ 0.97493
Calculate parameter delta: Δθ = -(1.0000005 / 0.97493) × 0.1 ≈ -0.10257
Update delta accumulator (v): v_new = 0.95 × 1.0 + 0.05 × (-0.10257)² ≈ 0.95053
Apply update: θ_new = 1.0 + (-0.10257) ≈ 0.89743

The output tuple contains [updated_parameter, updated_u, updated_v] = [0.89743, 0.9505, 0.95053].

Example

Input

parameter = [1.0, 2.0], grad = [0.1, 0.2], u = [1.0, 1.0], v = [1.0, 1.0], rho = 0.95, epsilon = 1e-06

Output

[[0.89743, 1.79502], [0.9505, 0.952], [0.95053, 0.9521]]

Explanation

Array-wise update computation:

The algorithm processes each element independently:

For element 0 (parameter=1.0, grad=0.1):

u_new[0] = 0.95 × 1.0 + 0.05 × 0.01 = 0.9505
Δθ[0] ≈ -0.10257 → θ_new[0] ≈ 0.89743
v_new[0] = 0.95 × 1.0 + 0.05 × 0.01052 ≈ 0.95053

For element 1 (parameter=2.0, grad=0.2):

u_new[1] = 0.95 × 1.0 + 0.05 × 0.04 = 0.952
Δθ[1] ≈ -0.20498 → θ_new[1] ≈ 1.79502
v_new[1] = 0.95 × 1.0 + 0.05 × 0.04202 ≈ 0.9521

The output is a list of three arrays: [updated_parameters, updated_u, updated_v].

Example

Input

parameter = 0.5, grad = 0.5, u = 0.5, v = 0.5, rho = 0.9, epsilon = 1e-06

Output

[-0.01299, 0.475, 0.47632]

Explanation

Larger gradient with different decay rate:

Using ρ = 0.9 (faster decay) and a larger gradient:

Update gradient accumulator: u_new = 0.9 × 0.5 + 0.1 × 0.25 = 0.475
Compute the update: RMS(v) = √(0.5 + 1e-6) ≈ 0.70711 RMS(u_new) = √(0.475 + 1e-6) ≈ 0.68920 Δθ = -(0.70711 / 0.68920) × 0.5 ≈ -0.51299
Update delta accumulator: v_new = 0.9 × 0.5 + 0.1 × 0.26316 ≈ 0.47632
Apply update: θ_new = 0.5 + (-0.51299) ≈ -0.01299

Notice how the large gradient (0.5) causes a significant update, moving the parameter from positive to slightly negative.

Accepted0/0·0% Acceptance

Constraints

Parameters and gradients can be scalars (float) or 1D arrays (list of floats)
When inputs are arrays, all arrays (parameter, grad, u, v) must have the same length
0 < ρ < 1 (decay rate must be strictly between 0 and 1)
ε > 0 (epsilon must be positive for numerical stability)
Array lengths will be between 1 and 1000 elements
-10⁶ ≤ parameter values ≤ 10⁶
-10⁶ ≤ gradient values ≤ 10⁶
0 ≤ u, v ≤ 10⁶ (accumulators are non-negative by construction)
All output values should be rounded to 5 decimal places

Code

Visualizer

Solutions

14px

Test Cases3

Results

Submissions

u =

v =

rho =

0.95

grad =

0.1

epsilon =

0.000001

parameter =

The Adaptive Delta Algorithm

Mathematical Formulation

Given a parameter θ, its gradient g, and two running averages u (for squared gradients) and v (for squared parameter updates), the algorithm proceeds as follows:

Step 1: Update the gradient accumulator $$u_{new} = \rho \cdot u + (1 - \rho) \cdot g^2$$

Step 2: Compute the parameter update $$\Delta\theta = -\frac{\sqrt{v + \epsilon}}{\sqrt{u_{new} + \epsilon}} \cdot g$$

Step 3: Update the delta accumulator $$v_{new} = \rho \cdot v + (1 - \rho) \cdot (\Delta\theta)^2$$

Step 4: Apply the update $$\theta_{new} = \theta + \Delta\theta$$

Key Parameters

ρ (rho): The decay constant that controls how quickly the running averages forget old information. Values close to 1.0 (e.g., 0.95) create longer memory windows.

ε (epsilon): A small constant (typically 1e-6) added for numerical stability when computing the square root.

Advantages

No learning rate tuning: The algorithm adapts the learning rate automatically based on recent gradient and update magnitudes.

Robust to large gradients: The accumulation prevents sudden large updates.

Correct units: Unlike some other methods, the update has the same units as the parameters.

Your Task

Implement a function that performs a single optimization step using this adaptive delta algorithm. Your implementation should:

Handle both scalar and array-valued parameters/gradients

Return the updated parameter value along with the new values of both accumulators

Round all output values to 5 decimal places

Support element-wise operations for array inputs

Adaptive Delta Gradient Optimizer

The Adaptive Delta Algorithm

Mathematical Formulation

Key Parameters

Advantages

Your Task

Hints

Adaptive Delta Gradient Optimizer

The Adaptive Delta Algorithm

Mathematical Formulation

Key Parameters

Advantages

Your Task

Hints