Loading content...
In machine learning and optimization, computing gradients correctly is absolutely critical. Whether you're training neural networks through backpropagation, implementing custom loss functions, or developing novel optimization algorithms, incorrect gradients lead to failed convergence, suboptimal solutions, or completely broken models.
The challenge is that analytical gradient computations can be complex, error-prone, and difficult to debug. A single sign error, missing derivative term, or incorrect chain rule application can silently corrupt your entire training process. This is where gradient verification becomes an indispensable debugging technique.
The centered finite difference approximation provides a reliable way to numerically estimate gradients. For a scalar-valued function ( f: \mathbb{R}^n \to \mathbb{R} ), the partial derivative with respect to the ( i )-th dimension can be approximated as:
$$\frac{\partial f}{\partial x_i} \approx \frac{f(x + \epsilon \cdot e_i) - f(x - \epsilon \cdot e_i)}{2\epsilon}$$
where ( e_i ) is the unit vector in the ( i )-th direction and ( \epsilon ) is a small perturbation value (typically ( 10^{-7} ) to ( 10^{-5} )).
This centered difference formula achieves second-order accuracy ( O(\epsilon^2) ), which is more accurate than the forward difference formula that has first-order accuracy ( O(\epsilon) ).
To compare numerical and analytical gradients robustly, we use the relative error metric that normalizes by the magnitude of both gradients:
$$\text{relative_error} = \frac{| abla_{\text{numerical}} - abla_{\text{analytical}}|2}{| abla{\text{numerical}}|2 + | abla{\text{analytical}}|_2}$$
This metric ranges from 0 (perfect match) to 1 (completely different gradients). A well-implemented gradient should achieve a relative error of less than ( 10^{-5} ) to ( 10^{-7} ).
Edge Case Handling: When both gradients are zero vectors, the relative error should be defined as 0 (since identical zero gradients represent a perfect match).
Implement a function that computes the numerical gradient using the centered finite difference method and calculates the relative error between the numerical and provided analytical gradients.
Input Parameters:
f: A scalar-valued function ( f: \mathbb{R}^n \to \mathbb{R} )x: A numpy array representing the evaluation pointanalytical_grad: A numpy array representing the analytically computed gradientepsilon: The perturbation magnitude (default: ( 10^{-7} ))Output: Return a tuple containing:
numerical_grad: The numerically approximated gradient (same shape as x)relative_error: A scalar quantifying the difference between gradientsf(x) = x[0]**2 + x[1]**2
x = [3.0, 4.0]
analytical_grad = [6.0, 8.0]
epsilon = 1e-7numerical_grad = [6.0, 8.0], relative_error ≈ 1e-9For the quadratic function f(x) = x₀² + x₁², the true gradient is ∇f = [2x₀, 2x₁].
At x = [3.0, 4.0], the analytical gradient is correctly given as [6.0, 8.0].
Computing the numerical gradient:
For x₀: • f([3 + ε, 4]) = (3 + ε)² + 16 = 9 + 6ε + ε² + 16 • f([3 - ε, 4]) = (3 - ε)² + 16 = 9 - 6ε + ε² + 16 • ∂f/∂x₀ ≈ ((9 + 6ε + ε² + 16) - (9 - 6ε + ε² + 16)) / (2ε) = 12ε / (2ε) = 6.0
For x₁: • f([3, 4 + ε]) = 9 + (4 + ε)² = 9 + 16 + 8ε + ε² • f([3, 4 - ε]) = 9 + (4 - ε)² = 9 + 16 - 8ε + ε² • ∂f/∂x₁ ≈ 16ε / (2ε) = 8.0
Since both gradients are [6.0, 8.0], the relative error is essentially 0 (limited only by floating-point precision rounding, approximately 1e-9).
f(x) = x[0] + x[1] + x[2]
x = [1.0, 2.0, 3.0]
analytical_grad = [1.0, 1.0, 1.0]
epsilon = 1e-7numerical_grad = [1.0, 1.0, 1.0], relative_error ≈ 1.9e-9For the linear function f(x) = x₀ + x₁ + x₂, the gradient is constant: ∇f = [1, 1, 1].
Computing the numerical gradient for each dimension:
For any dimension i: • f(x + ε·eᵢ) - f(x - ε·eᵢ) = (sum of components + ε) - (sum of components - ε) = 2ε • ∂f/∂xᵢ ≈ 2ε / (2ε) = 1.0
The numerical gradient matches [1.0, 1.0, 1.0] exactly. The relative error is at the noise floor of floating-point computation (approximately 1.9e-9).
f(x) = x[0]**3 + x[1]**3
x = [2.0, -1.0]
analytical_grad = [12.0, 3.0]
epsilon = 1e-7numerical_grad = [12.0, 3.0], relative_error ≈ 3e-10For the cubic function f(x) = x₀³ + x₁³, the gradient is ∇f = [3x₀², 3x₁²].
At x = [2.0, -1.0]: • ∂f/∂x₀ = 3 × (2.0)² = 12.0 • ∂f/∂x₁ = 3 × (-1.0)² = 3.0
Numerical verification using centered differences:
For x₀: • f([2 + ε, -1]) = (2 + ε)³ + (-1)³ = 8 + 12ε + 6ε² + ε³ - 1 • f([2 - ε, -1]) = (2 - ε)³ + (-1)³ = 8 - 12ε + 6ε² - ε³ - 1 • ∂f/∂x₀ ≈ (24ε + 2ε³) / (2ε) = 12 + ε² ≈ 12.0
For x₁: • f([2, -1 + ε]) = 8 + (-1 + ε)³ = 8 - 1 + 3ε - 3ε² + ε³ • f([2, -1 - ε]) = 8 + (-1 - ε)³ = 8 - 1 - 3ε - 3ε² - ε³ • ∂f/∂x₁ ≈ (6ε + 2ε³) / (2ε) = 3 + ε² ≈ 3.0
Both gradients agree, yielding an extremely small relative error (~3e-10), demonstrating correct analytical gradient implementation.
Constraints