Loading problem...
In deep learning and machine learning optimization, L2 regularization (also known as weight decay) is a fundamental technique used to prevent overfitting by penalizing large parameter values. This regularization approach adds a term proportional to the sum of squared weights to the loss function, encouraging the model to keep weights small and thus improve generalization.
When training neural networks, the standard gradient descent update rule is:
$$\theta_{\text{new}} = \theta_{\text{old}} - \eta \cdot abla L$$
where θ represents the model parameters, η is the learning rate, and ∇L is the gradient of the loss function. With L2 regularization, the update rule becomes:
$$\theta_{\text{new}} = \theta_{\text{old}} - \eta \cdot abla L - \eta \cdot \lambda \cdot \theta_{\text{old}}$$
This can be rewritten as:
$$\theta_{\text{new}} = \theta_{\text{old}} \cdot (1 - \eta \cdot \lambda) - \eta \cdot abla L$$
where λ is the weight decay coefficient that controls the strength of regularization.
Selective Application: In practice, weight decay is typically applied only to weight parameters and not to bias terms. This is because biases don't contribute to overfitting in the same way weights do, and penalizing biases can prevent the model from learning necessary offsets. Your implementation should support selectively applying weight decay to specific parameter groups based on a boolean flag array.
Your Task: Implement a function that performs gradient descent parameter updates with optional L2 regularization. The function should:
parameters = [[1.0, 2.0]]
gradients = [[0.1, 0.2]]
lr = 0.1
weight_decay = 0.01
apply_to_all = [True][[0.989, 1.978]]With weight decay enabled for the first parameter group:
• For the first parameter (θ = 1.0, g = 0.1): θ_new = 1.0 - (0.1 × 0.1) - (0.1 × 0.01 × 1.0) = 1.0 - 0.01 - 0.001 = 0.989
• For the second parameter (θ = 2.0, g = 0.2): θ_new = 2.0 - (0.1 × 0.2) - (0.1 × 0.01 × 2.0) = 2.0 - 0.02 - 0.002 = 1.978
The gradient descent update (-lr × gradient) and the weight decay term (-lr × weight_decay × θ) are both applied.
parameters = [[1.0, 2.0], [0.5]]
gradients = [[0.1, 0.2], [0.05]]
lr = 0.1
weight_decay = 0.01
apply_to_all = [True, False][[0.989, 1.978], [0.495]]This example demonstrates selective weight decay application:
First parameter group (apply_to_all[0] = True - weight decay applied): • θ₁ = 1.0 - (0.1 × 0.1) - (0.1 × 0.01 × 1.0) = 0.989 • θ₂ = 2.0 - (0.1 × 0.2) - (0.1 × 0.01 × 2.0) = 1.978
Second parameter group (apply_to_all[1] = False - weight decay NOT applied): • θ₃ = 0.5 - (0.1 × 0.05) = 0.5 - 0.005 = 0.495
This is typical: weights get regularized, biases do not.
parameters = [[1.0, 2.0, 3.0]]
gradients = [[0.1, 0.2, 0.3]]
lr = 0.1
weight_decay = 0.01
apply_to_all = [False][[0.99, 1.98, 2.97]]With weight decay disabled, only standard gradient descent is performed:
• θ₁ = 1.0 - (0.1 × 0.1) = 1.0 - 0.01 = 0.99 • θ₂ = 2.0 - (0.1 × 0.2) = 2.0 - 0.02 = 1.98 • θ₃ = 3.0 - (0.1 × 0.3) = 3.0 - 0.03 = 2.97
No L2 penalty is subtracted, only the gradient-based update.
Constraints