Gradient Descent with L2 Regularization Penalty (Easy) — Practice with Code Visualizer

In deep learning and machine learning optimization, L2 regularization (also known as weight decay) is a fundamental technique used to prevent overfitting by penalizing large parameter values. This regularization approach adds a term proportional to the sum of squared weights to the loss function, encouraging the model to keep weights small and thus improve generalization.

When training neural networks, the standard gradient descent update rule is:

$$\theta_{\text{new}} = \theta_{\text{old}} - \eta \cdot abla L$$

where θ represents the model parameters, η is the learning rate, and ∇L is the gradient of the loss function. With L2 regularization, the update rule becomes:

$$\theta_{\text{new}} = \theta_{\text{old}} - \eta \cdot abla L - \eta \cdot \lambda \cdot \theta_{\text{old}}$$

This can be rewritten as:

$$\theta_{\text{new}} = \theta_{\text{old}} \cdot (1 - \eta \cdot \lambda) - \eta \cdot abla L$$

where λ is the weight decay coefficient that controls the strength of regularization.

Selective Application: In practice, weight decay is typically applied only to weight parameters and not to bias terms. This is because biases don't contribute to overfitting in the same way weights do, and penalizing biases can prevent the model from learning necessary offsets. Your implementation should support selectively applying weight decay to specific parameter groups based on a boolean flag array.

Your Task: Implement a function that performs gradient descent parameter updates with optional L2 regularization. The function should:

Take arrays of parameters, their corresponding gradients, a learning rate, and a weight decay coefficient
Use a boolean array to determine which parameter groups should have weight decay applied
Return the updated parameters after applying the appropriate update rule to each group

With weight decay enabled for the first parameter group:

• For the first parameter (θ = 1.0, g = 0.1): θ_new = 1.0 - (0.1 × 0.1) - (0.1 × 0.01 × 1.0) = 1.0 - 0.01 - 0.001 = 0.989

• For the second parameter (θ = 2.0, g = 0.2): θ_new = 2.0 - (0.1 × 0.2) - (0.1 × 0.01 × 2.0) = 2.0 - 0.02 - 0.002 = 1.978

The gradient descent update (-lr × gradient) and the weight decay term (-lr × weight_decay × θ) are both applied.

This example demonstrates selective weight decay application:

First parameter group (apply_to_all[0] = True - weight decay applied): • θ₁ = 1.0 - (0.1 × 0.1) - (0.1 × 0.01 × 1.0) = 0.989 • θ₂ = 2.0 - (0.1 × 0.2) - (0.1 × 0.01 × 2.0) = 1.978

Second parameter group (apply_to_all[1] = False - weight decay NOT applied): • θ₃ = 0.5 - (0.1 × 0.05) = 0.5 - 0.005 = 0.495

This is typical: weights get regularized, biases do not.

With weight decay disabled, only standard gradient descent is performed:

• θ₁ = 1.0 - (0.1 × 0.1) = 1.0 - 0.01 = 0.99 • θ₂ = 2.0 - (0.1 × 0.2) = 2.0 - 0.02 = 1.98 • θ₃ = 3.0 - (0.1 × 0.3) = 3.0 - 0.03 = 2.97

No L2 penalty is subtracted, only the gradient-based update.