Weight Regularization - Learning Module

Loading content...

0/245

L1 Sparsity

The Pursuit of Sparse Representations

While L2 regularization shrinks all weights toward zero proportionally, it rarely produces weights that are exactly zero. Every connection in the network remains active, even if diminished. In many contexts—particularly when interpretability, compression, or feature selection matter—we desire something stronger: we want the model to set irrelevant weights to precisely zero.

L1 regularization (also known as Lasso in linear regression) achieves this by penalizing the absolute value of weights rather than their squares. This seemingly subtle change produces a qualitatively different effect: L1 encourages sparse solutions where many weights are exactly zero while others remain at substantial magnitudes.

This sparsity-inducing property has profound implications for neural network compression, automated feature selection, interpretability, and understanding which connections truly matter for a task.

What You Will Learn

This page provides complete coverage of L1 regularization: mathematical formulation, the geometry that produces sparsity, subgradient optimization, comparison with L2, proximal gradient methods, practical implementation, and when to prefer L1 over L2 in deep learning.

Mathematical Formulation

L1 regularization augments the loss function with a penalty proportional to the L1 norm (sum of absolute values) of the weight vector:

$$\mathcal{L}{\text{reg}}(\boldsymbol{\theta}) = \mathcal{L}{\text{data}}(\boldsymbol{\theta}) + \lambda |\boldsymbol{\theta}|_1$$

where:

$|\boldsymbol{\theta}|_1 = \sum_i |\theta_i|$ is the L1 norm
$\lambda > 0$ is the regularization strength

Expanding for all weights across layers:

$$\mathcal{L}{\text{reg}} = \mathcal{L}{\text{data}} + \lambda \sum_{l=1}^{L} \sum_{i,j} |W_{ij}^{(l)}|$$

Key Difference from L2:

The L1 norm is not differentiable at zero—the function $f(w) = |w|$ has a kink at $w = 0$. This non-differentiability is precisely what enables sparsity: the gradient is undefined at zero, creating a "sticky" point.

l1_regularization.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import numpy as np
 
def compute_l1_penalty(weights_list, lambda_reg):
    """
    Compute L1 regularization penalty.
    
    Args:
        weights_list: List of weight matrices [W1, W2, ..., WL]
        lambda_reg: Regularization strength (λ)
    
    Returns:
        L1 penalty: λ * Σ|W|
    """
    l1_penalty = 0.0
    for W in weights_list:
        l1_penalty += np.sum(np.abs(W))
    
    return lambda_reg * l1_penalty
 
def l1_subgradient(W, lambda_reg):
    """
    Compute subgradient of L1 penalty.
    
    For w ≠ 0: ∂|w|/∂w = sign(w)
    For w = 0: subgradient is any value in [-1, 1]
    
    We use sign(w), which gives 0 at w=0.
    """
    return lambda_reg * np.sign(W)
 
def l1_regularized_loss(data_loss, weights_list, lambda_reg):
    """
    Total loss with L1 regularization.
    """
    l1_term = compute_l1_penalty(weights_list, lambda_reg)
    return data_loss + l1_term

Why L1 Induces Sparsity: Geometric Insight

The sparsity-inducing property of L1 has a beautiful geometric explanation. Consider the constrained optimization view:

$$\min_{\boldsymbol{\theta}} \mathcal{L}_{\text{data}}(\boldsymbol{\theta}) \quad \text{subject to} \quad |\boldsymbol{\theta}|_1 \leq c$$

The Key Geometric Difference:

L2 constraint: The region $|\boldsymbol{\theta}|_2 \leq c$ is a hypersphere (circle in 2D)—smooth, no corners
L1 constraint: The region $|\boldsymbol{\theta}|_1 \leq c$ is a hyperoctahedron (diamond/rhombus in 2D)—with sharp corners on the axes

Why Corners Matter:

Imagine the loss function $\mathcal{L}_{\text{data}}$ as elliptical contours expanding from a minimum. As we inflate these contours, we seek the first point where a contour touches the constraint region.

For L2 (sphere): The tangent point can occur anywhere on the smooth surface—generically not on any axis
For L1 (diamond): The corners protrude along the axes. Contours are far more likely to first touch a corner—where one or more coordinates equal zero

The Diamond's Corners

In 2D, the L1 constraint is a diamond with corners at (±c, 0) and (0, ±c). A corner at (c, 0) means θ₂ = 0 exactly—one feature is completely eliminated. In high dimensions, corners have many zero coordinates, and smooth elliptical loss contours almost always touch these corners first.

Geometric Comparison: L1 vs L2 Constraint Regions
Property	L1 (Diamond)	L2 (Sphere)
Shape in 2D	Square rotated 45°	Circle
Shape in nD	Hyperoctahedron (cross-polytope)	Hypersphere
Corners	2n axis-aligned corners	No corners (smooth)
Typical solution	On a corner (sparse)	On surface (all nonzero)
Sparsity	Natural outcome	Rare (requires exact alignment)

Mathematical Explanation via Subgradients:

At a corner where $w_j = 0$, the L1 penalty's subgradient at that coordinate is the interval $[-\lambda, +\lambda]$. For the total gradient to be zero (optimality condition):

$$\frac{\partial \mathcal{L}_{\text{data}}}{\partial w_j} \in [-\lambda, +\lambda]$$

This means if the data gradient at $w_j = 0$ is small enough in magnitude (less than $\lambda$), the optimal solution keeps $w_j = 0$. With L2, the gradient would instead be proportional to $w_j$ itself—near zero, the penalty gradient vanishes, providing no "stickiness" at zero.

Subgradient Optimization

The L1 penalty is non-differentiable at zero, which complicates standard gradient descent. We use subgradients—a generalization of gradients for convex (but non-smooth) functions.

The Sign Function as Subgradient:

The subgradient of $|w|$ is: $$\partial |w| = \begin{cases} {+1} & \text{if } w > 0 \ {-1} & \text{if } w < 0 \ [-1, +1] & \text{if } w = 0 \end{cases}$$

In practice, we use $\text{sign}(w)$, which returns 0 at $w = 0$. This works but can cause oscillation around zero.

The Update Rule:

$$w_{t+1} = w_t - \eta \left( \nabla_w \mathcal{L}_{\text{data}} + \lambda \cdot \text{sign}(w_t) \right)$$

l1_subgradient_descent.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import numpy as np
 
def sgd_with_l1_regularization(weights, gradients, lr, lambda_reg):
    """
    SGD update with L1 regularization using subgradient.
    
    w_new = w - lr * (∇L_data + λ * sign(w))
    
    Note: This can cause oscillation around zero.
    """
    updated = []
    for w, g in zip(weights, gradients):
        # Subgradient of L1 penalty
        l1_subgrad = lambda_reg * np.sign(w)
        
        # Combined gradient
        total_grad = g + l1_subgrad
        
        # Update
        w_new = w - lr * total_grad
        updated.append(w_new)
    
    return updated
 
def proximal_l1_update(weights, gradients, lr, lambda_reg):
    """
    Proximal gradient descent for L1 (soft thresholding).
    
    This is the CORRECT way to handle L1 regularization.
    
    Step 1: Gradient step on data loss only
        w_temp = w - lr * ∇L_data
    
    Step 2: Proximal operator (soft thresholding)
        w_new = sign(w_temp) * max(|w_temp| - lr*λ, 0)
    
    This naturally produces exact zeros!
    """
    updated = []
    threshold = lr * lambda_reg
    
    for w, g in zip(weights, gradients):
        # Gradient step
        w_temp = w - lr * g
        
        # Soft thresholding (proximal operator for L1)
        w_new = np.sign(w_temp) * np.maximum(np.abs(w_temp) - threshold, 0)
        
        updated.append(w_new)
    
    return updated

Proximal Methods Are Superior

Soft thresholding (proximal gradient descent) is the theoretically correct and practically superior method for L1 optimization. It produces exact zeros and avoids oscillation. The threshold λη acts as a "dead zone"—weights smaller than this in magnitude collapse to exactly zero.

L1 vs L2: A Detailed Comparison

Understanding when to use L1 versus L2 requires appreciating their fundamentally different behaviors.

Comprehensive L1 vs L2 Comparison
Aspect	L1 Regularization	L2 Regularization
Penalty term	$\lambda \sum\|w_i\|$	$\frac{\lambda}{2} \sum w_i^2$
Gradient/subgradient	$\lambda \cdot \text{sign}(w)$ (constant magnitude)	$\lambda w$ (proportional to weight)
Effect on large weights	Constant pressure	Strong pressure (quadratic)
Effect on small weights	Constant pressure toward zero	Weak pressure (linear in w)
Produces exact zeros	Yes (sparsity)	Rarely (shrinks but doesn't zero)
Bayesian prior	Laplace (double exponential)	Gaussian
Constraint geometry	Diamond (corners)	Sphere (smooth)
Optimization	Non-smooth, needs proximal methods	Smooth, standard gradient descent
Feature selection	Implicit (zeros = unused features)	No (all features used)

The Gradient Behavior Is Key:

L2 gradient $\lambda w$: Near zero, the gradient vanishes. A weight at $w = 0.001$ receives negligible gradient $0.001\lambda$—no force to push it to exactly zero.
L1 subgradient $\lambda \cdot \text{sign}(w)$: Even at $w = 0.001$, the subgradient magnitude is $\lambda$ (constant). Small weights receive the same "pressure" as large ones, driving them all the way to zero.

This constant pressure on all weight magnitudes is what produces sparsity—L1 doesn't "give up" on small weights the way L2 does.

When to Use Which

Use L1 when: you suspect many features are irrelevant, you want interpretability via feature selection, or you need model compression. Use L2 when: all features are likely relevant, you want smooth optimization, or you're using adaptive optimizers that handle weight decay properly.

Bayesian Perspective: Laplace Prior

Just as L2 corresponds to a Gaussian prior, L1 corresponds to a Laplace (double exponential) prior:

$$p(w_i) = \frac{\lambda}{2} \exp(-\lambda |w_i|)$$

The Laplace distribution has:

Peak at zero: Much sharper than Gaussian (favors small weights)
Heavy tails: Allows some weights to be large
Non-differentiable at zero: The "spike" corresponds to the L1 non-smoothness

MAP Estimation:

The negative log-prior is: $$-\log p(\boldsymbol{\theta}) = \lambda \sum_i |\theta_i| + \text{const}$$

This is exactly the L1 penalty! Minimizing the regularized loss is MAP estimation under a Laplace prior.

Sparsity from Probability

The Laplace prior's spike at zero assigns higher probability density to weights exactly at zero compared to small nonzero weights. This prior belief—that most weights should be zero—is what drives the MAP solution toward sparsity.

Implementation in PyTorch

Unlike L2 (which has built-in weight_decay), L1 regularization typically requires manual implementation in PyTorch.

pytorch_l1_regularization.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import torch
import torch.nn as nn
import torch.optim as optim
 
def compute_l1_loss(model, lambda_l1):
    """
    Compute L1 regularization term for all weights.
    """
    l1_loss = 0.0
    for name, param in model.named_parameters():
        if 'weight' in name:  # Exclude biases
            l1_loss += torch.sum(torch.abs(param))
    return lambda_l1 * l1_loss
 
def train_with_l1_regularization(model, dataloader, criterion, optimizer, lambda_l1):
    """
    Training loop with L1 regularization added to loss.
    """
    model.train()
    for inputs, targets in dataloader:
        optimizer.zero_grad()
        
        outputs = model(inputs)
        data_loss = criterion(outputs, targets)
        
        # Add L1 penalty
        l1_penalty = compute_l1_loss(model, lambda_l1)
        total_loss = data_loss + l1_penalty
        
        total_loss.backward()
        optimizer.step()
 
# Custom optimizer with proximal L1 (soft thresholding)
class ProximalSGD(optim.Optimizer):
    """
    SGD with proximal L1 regularization (soft thresholding).
    
    Properly produces exact zeros for L1 regularization.
    """
    def __init__(self, params, lr=0.01, l1_lambda=0.01):
        defaults = dict(lr=lr, l1_lambda=l1_lambda)
        super().__init__(params, defaults)
    
    def step(self, closure=None):
        loss = None
        if closure is not None:
            loss = closure()
        
        for group in self.param_groups:
            lr = group['lr']
            l1_lambda = group['l1_lambda']
            threshold = lr * l1_lambda
            
            for p in group['params']:
                if p.grad is None:
                    continue
                
                # Gradient step
                p.data.add_(p.grad, alpha=-lr)
                
                # Soft thresholding (proximal operator)
                p.data = torch.sign(p.data) * torch.clamp(
                    torch.abs(p.data) - threshold, min=0
                )
        
        return loss
 
# Usage
model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Linear(256, 10)
)
 
optimizer = ProximalSGD(model.parameters(), lr=0.01, l1_lambda=1e-4)

L1 in Deep Learning: Practical Considerations

While L1 regularization is foundational in classical machine learning (Lasso regression), its role in deep learning is more nuanced.

Challenges with L1 in Deep Networks

•Non-smooth optimization: Standard optimizers assume smooth gradients; L1's kink at zero can cause issues
•Interaction with momentum: Momentum can overshoot zero and oscillate; requires careful damping
•Adaptive optimizer issues: Adam/RMSprop with L1 doesn't produce true sparsity without modifications
•Over-sparsification: Too much L1 can zero out important features, degrading performance
•Gradients through dead neurons: If weights to a neuron are all zeroed, gradients don't flow—neuron is permanently dead

When L1 Works Well in Deep Learning

•Structured pruning: Combined with post-training thresholding for network compression
•Attention mechanisms: Encouraging sparse attention weights for interpretability
•Feature selection layers: First layer of network to identify relevant input features
•Sparse coding: Autoencoders with L1 on latent representation
•Elastic net: Combined with L2 (αL1 + (1-α)L2) for best of both worlds

Summary: L1 Sparsity

Key Takeaways

•L1 regularization adds penalty $\lambda|\boldsymbol{\theta}|_1$, encouraging exact zeros (sparsity)
•Sparsity arises geometrically from the diamond-shaped constraint region with axis-aligned corners
•Subgradient $\lambda \cdot \text{sign}(w)$ provides constant pressure regardless of weight magnitude
•Proximal methods (soft thresholding) are the proper optimization approach for L1
•Laplace prior is the Bayesian interpretation—a spike at zero favoring sparse solutions
•Use L1 for feature selection, compression, and interpretability; L2 for smooth optimization and when all features matter

Page Complete

You now understand L1 regularization's sparsity-inducing properties from geometric, optimization, and probabilistic perspectives. Next, we explore the subtle but critical distinction between weight decay and L2 regularization—a difference that matters significantly for adaptive optimizers.