Weight Regularization - Learning Module

Loading content...

0/245

L2 Weight Decay

The Weight Magnitude Problem

Deep neural networks possess an extraordinary capacity to memorize training data. Given sufficient parameters, a network can achieve zero training error on virtually any dataset—including randomly labeled data. This remarkable flexibility is both a blessing and a curse: while it enables learning complex patterns, it also creates a dangerous propensity for overfitting.

One key symptom of overfitting manifests in the magnitude of learned weights. When a network overfits, individual weights often grow to extreme values, creating decision boundaries that are exquisitely tuned to training examples but catastrophically wrong for new data. The weights become hypersensitive—small changes in input produce wild swings in output.

L2 weight decay addresses this directly by penalizing large weight magnitudes, encouraging the network to find solutions that use smaller, more balanced weights. This seemingly simple modification has profound implications for generalization, optimization dynamics, and the geometry of learned representations.

What You Will Learn

This page provides complete coverage of L2 weight decay: the mathematical formulation, geometric interpretation, gradient dynamics, relationship to Bayesian priors, effects on optimization, practical implementation considerations, and interaction with modern techniques like batch normalization and adaptive optimizers.

Mathematical Formulation

L2 regularization augments the standard loss function with a penalty term proportional to the squared Euclidean norm of the weight vector. Given a neural network with parameters $\boldsymbol{\theta}$ (encompassing all weight matrices and bias vectors), the regularized objective becomes:

$$\mathcal{L}{\text{reg}}(\boldsymbol{\theta}) = \mathcal{L}{\text{data}}(\boldsymbol{\theta}) + \frac{\lambda}{2} |\boldsymbol{\theta}|_2^2$$

where:

$\mathcal{L}_{\text{data}}(\boldsymbol{\theta})$ is the original data-fitting loss (e.g., cross-entropy, MSE)
$\lambda > 0$ is the regularization strength (hyperparameter)
$|\boldsymbol{\theta}|_2^2 = \sum_i \theta_i^2$ is the squared L2 norm
The factor of $\frac{1}{2}$ is a convenience that simplifies gradient computation

Expanding for a network with $L$ layers, each with weight matrix $\mathbf{W}^{(l)}$:

l2_regularization.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# L2 regularization penalty computation
import numpy as np
 
def compute_l2_penalty(weights_list, lambda_reg):
    """
    Compute L2 regularization penalty for all weight matrices.
    
    Args:
        weights_list: List of weight matrices [W1, W2, ..., WL]
        lambda_reg: Regularization strength (λ)
    
    Returns:
        L2 penalty term: (λ/2) * Σ||W||²_F
    """
    l2_penalty = 0.0
    for W in weights_list:
        # Frobenius norm squared = sum of squared elements
        l2_penalty += np.sum(W ** 2)
    
    return (lambda_reg / 2) * l2_penalty
 
def regularized_loss(data_loss, weights_list, lambda_reg):
    """
    Compute total regularized loss.
    
    L_reg = L_data + (λ/2) * Σ||W||²_F
    """
    l2_term = compute_l2_penalty(weights_list, lambda_reg)
    return data_loss + l2_term

Biases Are Typically Excluded

In practice, bias terms are often excluded from L2 regularization. Biases shift activations but don't control the sensitivity of outputs to inputs—they have lower impact on model capacity. Regularizing biases can also harm performance by preventing necessary activation shifts.

The Frobenius Norm Connection:

For weight matrices, the L2 penalty equals the squared Frobenius norm:

$$|\mathbf{W}|F^2 = \sum{i,j} W_{ij}^2 = \text{tr}(\mathbf{W}^\top \mathbf{W})$$

This is the natural extension of the Euclidean norm to matrices—the sum of squared elements. The trace formulation $\text{tr}(\mathbf{W}^\top \mathbf{W})$ is useful for theoretical analysis and efficient GPU computation.

Gradient Dynamics and Weight Decay

The gradient of the L2 penalty with respect to any weight $w$ has an elegantly simple form:

$$\frac{\partial}{\partial w} \left( \frac{\lambda}{2} w^2 \right) = \lambda w$$

This means the gradient of the regularized loss is:

$$\nabla_w \mathcal{L}{\text{reg}} = \nabla_w \mathcal{L}{\text{data}} + \lambda w$$

The gradient descent update becomes:

$$w_{t+1} = w_t - \eta (\nabla_w \mathcal{L}{\text{data}} + \lambda w_t) = w_t(1 - \eta\lambda) - \eta \nabla_w \mathcal{L}{\text{data}}$$

Notice the term $(1 - \eta\lambda)$: at each step, weights are multiplied by a factor less than 1 before the gradient step. This is why L2 regularization is called weight decay—weights naturally decay toward zero unless the data gradient pushes them away.

weight_decay_update.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
def sgd_with_l2_regularization(weights, gradients, lr, lambda_reg):
    """
    SGD update with L2 regularization added to gradient.
    
    w_new = w - lr * (grad + λ * w)
          = w * (1 - lr*λ) - lr * grad
    """
    updated = []
    for w, g in zip(weights, gradients):
        # Method 1: Add regularization to gradient
        regularized_grad = g + lambda_reg * w
        w_new = w - lr * regularized_grad
        
        # Equivalent Method 2: Weight decay form
        # w_new = w * (1 - lr * lambda_reg) - lr * g
        
        updated.append(w_new)
    return updated
 
def weight_decay_update(weights, gradients, lr, weight_decay):
    """
    Direct weight decay formulation.
    
    This multiplies weights by decay factor before gradient step.
    """
    decay_factor = 1 - lr * weight_decay
    updated = []
    for w, g in zip(weights, gradients):
        w_new = decay_factor * w - lr * g
        updated.append(w_new)
    return updated

The Decay Interpretation

Without any data gradient, a weight decays exponentially: w_t = w_0 · (1 - ηλ)^t. After many steps, weights approach zero. The equilibrium weight magnitude is determined by the balance between data gradients pushing weights away from zero and decay pulling them back.

Geometric Interpretation

L2 regularization has a beautiful geometric interpretation. Consider optimization in weight space:

The Constrained Optimization View:

Minimizing $\mathcal{L}_{\text{data}} + \frac{\lambda}{2}|\boldsymbol{\theta}|_2^2$ is equivalent (via Lagrange multipliers) to solving:

$$\min_{\boldsymbol{\theta}} \mathcal{L}_{\text{data}}(\boldsymbol{\theta}) \quad \text{subject to} \quad |\boldsymbol{\theta}|_2^2 \leq c$$

for some constraint radius $c$ determined by $\lambda$. Geometrically:

The constraint $|\boldsymbol{\theta}|_2^2 \leq c$ defines a hypersphere centered at the origin
We seek the point on this sphere that minimizes the data loss
Larger $\lambda$ corresponds to smaller sphere (tighter constraint)

The Contour Intersection:

Imagine level curves of the data loss $\mathcal{L}_{\text{data}}$ (ellipses in 2D) and circles representing the L2 constraint. The regularized optimum occurs where:

A loss contour is tangent to a constraint circle, OR
The unconstrained minimum lies within the constraint sphere

Geometric Effects of L2 Regularization
Aspect	Without Regularization	With L2 Regularization
Solution location	Unconstrained minimum	Projected toward origin
Weight magnitudes	Can grow arbitrarily large	Bounded by effective constraint
Solution space	Entire parameter space	Hypersphere centered at origin
Sensitivity	Can be extreme	Controlled, more uniform

Why Small Weights Generalize:

The geometric view reveals why L2 regularization improves generalization:

Smoother Functions: Networks with smaller weights have smaller Lipschitz constants—output changes slowly as input varies. This smooth behavior is less likely to overfit noise.
Implicit Capacity Control: By constraining weights to a sphere, we limit the effective capacity of the model. Fewer extreme weight configurations means less ability to memorize.
Stability: Small weights mean bounded activations and gradients, reducing the risk of exploding values during training and inference.

Bayesian Perspective: Gaussian Prior

L2 regularization has a profound probabilistic interpretation: it corresponds to placing a Gaussian prior on the weights and performing maximum a posteriori (MAP) estimation.

The Prior:

Assume each weight is drawn independently from a zero-mean Gaussian: $$p(w_i) = \mathcal{N}(0, \sigma_w^2) = \frac{1}{\sqrt{2\pi\sigma_w^2}} \exp\left(-\frac{w_i^2}{2\sigma_w^2}\right)$$

MAP Estimation:

Maximizing the posterior $p(\boldsymbol{\theta}|\mathcal{D}) \propto p(\mathcal{D}|\boldsymbol{\theta}) p(\boldsymbol{\theta})$ is equivalent to minimizing:

$$-\log p(\boldsymbol{\theta}|\mathcal{D}) = -\log p(\mathcal{D}|\boldsymbol{\theta}) - \log p(\boldsymbol{\theta}) + \text{const}$$

The negative log-prior becomes: $$-\log p(\boldsymbol{\theta}) = \frac{1}{2\sigma_w^2} \sum_i w_i^2 + \text{const} = \frac{1}{2\sigma_w^2} |\boldsymbol{\theta}|_2^2 + \text{const}$$

Comparing with the L2 penalty $\frac{\lambda}{2}|\boldsymbol{\theta}|_2^2$, we identify: $$\lambda = \frac{1}{\sigma_w^2}$$

The λ-σ Relationship

Strong regularization (large λ) corresponds to a narrow prior (small σ_w)—we believe weights should be close to zero. Weak regularization (small λ) corresponds to a broad prior (large σ_w)—we're less certain about weight magnitudes.

Implications of the Bayesian View:

Principled Hyperparameter Selection: The prior variance $\sigma_w^2$ encodes our belief about weight scales. Domain knowledge can inform $\lambda$ selection.
Uncertainty Quantification: The Bayesian framing connects to posterior uncertainty estimation—though MAP is a point estimate, the framework extends to full Bayesian inference.
Hierarchical Models: We can place hyperpriors on $\lambda$ and learn it from data, enabling empirical Bayes approaches.
Comparison with Other Priors: L1 regularization corresponds to a Laplace prior (promoting exact sparsity), elastic net to a mixture, and so on.

Effect on Optimization Landscape

L2 regularization fundamentally alters the loss landscape, affecting both the location and nature of minima.

Curvature Modification:

The Hessian of the regularized loss is: $$\mathbf{H}{\text{reg}} = \mathbf{H}{\text{data}} + \lambda \mathbf{I}$$

Adding $\lambda \mathbf{I}$ to the Hessian has critical effects:

Eigenvalue Shift: Every eigenvalue of the Hessian increases by $\lambda$. If $\mathbf{H}{\text{data}}$ has eigenvalues ${\lambda_1, \lambda_2, ...}$, then $\mathbf{H}{\text{reg}}$ has eigenvalues ${\lambda_1 + \lambda, \lambda_2 + \lambda, ...}$.
Improved Conditioning: The condition number $\kappa = \lambda_{\max}/\lambda_{\min}$ decreases. This makes gradient descent converge faster and more stably.
Eliminating Flat Directions: If $\mathbf{H}_{\text{data}}$ has zero eigenvalues (flat directions), regularization makes them positive, removing ambiguity in the solution.

hessian_analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import numpy as np
 
def analyze_hessian_conditioning(H_data, lambda_reg):
    """
    Analyze how L2 regularization affects Hessian conditioning.
    
    H_reg = H_data + λI
    """
    n = H_data.shape[0]
    H_reg = H_data + lambda_reg * np.eye(n)
    
    # Compute eigenvalues
    eigs_data = np.linalg.eigvalsh(H_data)
    eigs_reg = np.linalg.eigvalsh(H_reg)
    
    # Condition numbers (ratio of max to min eigenvalue)
    # Add small epsilon to avoid division by zero
    eps = 1e-10
    cond_data = np.max(eigs_data) / (np.min(np.abs(eigs_data)) + eps)
    cond_reg = np.max(eigs_reg) / np.min(eigs_reg)
    
    print(f"Data Hessian eigenvalues: [{eigs_data.min():.4f}, {eigs_data.max():.4f}]")
    print(f"Reg Hessian eigenvalues:  [{eigs_reg.min():.4f}, {eigs_reg.max():.4f}]")
    print(f"Data condition number: {cond_data:.2f}")
    print(f"Reg condition number:  {cond_reg:.2f}")
    
    return cond_data, cond_reg

Why Better Conditioning Matters

Poor conditioning (large κ) means gradient descent takes tiny steps in some directions and large steps in others, causing slow, oscillatory convergence. L2 regularization compresses the eigenvalue spectrum, enabling faster, more stable optimization.

Implementation in PyTorch

PyTorch provides built-in support for L2 regularization through the weight_decay parameter in optimizers. However, understanding the nuances is crucial for correct usage.

pytorch_l2_regularization.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
import torch
import torch.nn as nn
import torch.optim as optim
 
# Method 1: Using optimizer's weight_decay parameter
model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Linear(256, 10)
)
 
# weight_decay applies L2 penalty to all parameters
optimizer = optim.SGD(model.parameters(), lr=0.01, weight_decay=1e-4)
 
# Method 2: Manual L2 regularization (for custom control)
def train_with_manual_l2(model, dataloader, criterion, optimizer, lambda_reg):
    model.train()
    for inputs, targets in dataloader:
        optimizer.zero_grad()
        
        # Forward pass
        outputs = model(inputs)
        data_loss = criterion(outputs, targets)
        
        # Compute L2 penalty manually (exclude biases)
        l2_penalty = 0.0
        for name, param in model.named_parameters():
            if 'weight' in name:  # Only penalize weights, not biases
                l2_penalty += torch.sum(param ** 2)
        
        # Total loss
        total_loss = data_loss + (lambda_reg / 2) * l2_penalty
        
        # Backward and update
        total_loss.backward()
        optimizer.step()
 
# Method 3: Per-layer regularization (different λ per layer)
def create_param_groups_with_varying_decay(model):
    """
    Apply different weight decay to different layers.
    Often useful: less decay on early layers, more on later layers.
    """
    param_groups = []
    
    for i, (name, module) in enumerate(model.named_modules()):
        if isinstance(module, nn.Linear):
            # Increase decay for deeper layers
            layer_decay = 1e-4 * (1 + i * 0.5)
            param_groups.append({
                'params': module.weight,
                'weight_decay': layer_decay
            })
            param_groups.append({
                'params': module.bias,
                'weight_decay': 0.0  # No decay on biases
            })
    
    return param_groups
 
# Usage with per-layer decay
param_groups = create_param_groups_with_varying_decay(model)
optimizer = optim.SGD(param_groups, lr=0.01)

Weight Decay vs L2 with Adam

For adaptive optimizers like Adam, weight_decay in the optimizer is NOT equivalent to L2 regularization! This is addressed in AdamW (decoupled weight decay). We cover this critical distinction in a later page.

Selecting the Regularization Strength

Choosing $\lambda$ is one of the most important hyperparameter decisions. Too small, and you get no regularization benefit; too large, and you prevent the model from learning.

General Guidelines:

Typical Weight Decay Values
Setting	Recommended λ	Rationale
Small datasets (<10K)	1e-3 to 1e-2	Strong regularization to prevent overfitting
Medium datasets (10K-100K)	1e-4 to 1e-3	Moderate regularization
Large datasets (>100K)	1e-5 to 1e-4	Light regularization; data itself regularizes
Pre-trained fine-tuning	1e-5 to 1e-4	Preserve learned features
Training from scratch	1e-4 to 1e-3	Allow learning but constrain

Practical Selection Strategies

•Grid Search: Try logarithmically spaced values (1e-5, 1e-4, 1e-3, 1e-2) on validation set
•Start Large, Decay: Begin with strong regularization and reduce if underfitting
•Monitor Weight Norms: Track ||θ||₂ during training; it should stabilize, not explode or collapse
•Watch Train-Val Gap: Large gap suggests increasing λ; no gap suggests decreasing
•Cross-Validation: Most reliable but expensive; use for final selection

Summary: L2 Weight Decay

Key Takeaways

•L2 regularization adds penalty $\frac{\lambda}{2}|\boldsymbol{\theta}|_2^2$ to loss, penalizing large weights
•Weight decay is the equivalent gradient-level view: weights shrink by factor $(1-\eta\lambda)$ each step
•Geometric interpretation: constrains solution to hypersphere centered at origin
•Bayesian interpretation: equivalent to Gaussian prior on weights with variance $1/\lambda$
•Optimization benefit: improves Hessian conditioning, eliminating flat directions
•Implementation: use optimizer's weight_decay parameter; exclude biases typically

Page Complete

You now understand L2 weight decay from multiple perspectives: mathematical, geometric, probabilistic, and practical. Next, we explore L1 regularization (sparsity-inducing) and its fundamentally different behavior.