Loading content...
While L2 regularization shrinks all weights toward zero proportionally, it rarely produces weights that are exactly zero. Every connection in the network remains active, even if diminished. In many contexts—particularly when interpretability, compression, or feature selection matter—we desire something stronger: we want the model to set irrelevant weights to precisely zero.
L1 regularization (also known as Lasso in linear regression) achieves this by penalizing the absolute value of weights rather than their squares. This seemingly subtle change produces a qualitatively different effect: L1 encourages sparse solutions where many weights are exactly zero while others remain at substantial magnitudes.
This sparsity-inducing property has profound implications for neural network compression, automated feature selection, interpretability, and understanding which connections truly matter for a task.
This page provides complete coverage of L1 regularization: mathematical formulation, the geometry that produces sparsity, subgradient optimization, comparison with L2, proximal gradient methods, practical implementation, and when to prefer L1 over L2 in deep learning.
L1 regularization augments the loss function with a penalty proportional to the L1 norm (sum of absolute values) of the weight vector:
$$\mathcal{L}{\text{reg}}(\boldsymbol{\theta}) = \mathcal{L}{\text{data}}(\boldsymbol{\theta}) + \lambda |\boldsymbol{\theta}|_1$$
where:
Expanding for all weights across layers:
$$\mathcal{L}{\text{reg}} = \mathcal{L}{\text{data}} + \lambda \sum_{l=1}^{L} \sum_{i,j} |W_{ij}^{(l)}|$$
Key Difference from L2:
The L1 norm is not differentiable at zero—the function $f(w) = |w|$ has a kink at $w = 0$. This non-differentiability is precisely what enables sparsity: the gradient is undefined at zero, creating a "sticky" point.
123456789101112131415161718192021222324252627282930313233343536
import numpy as np def compute_l1_penalty(weights_list, lambda_reg): """ Compute L1 regularization penalty. Args: weights_list: List of weight matrices [W1, W2, ..., WL] lambda_reg: Regularization strength (λ) Returns: L1 penalty: λ * Σ|W| """ l1_penalty = 0.0 for W in weights_list: l1_penalty += np.sum(np.abs(W)) return lambda_reg * l1_penalty def l1_subgradient(W, lambda_reg): """ Compute subgradient of L1 penalty. For w ≠ 0: ∂|w|/∂w = sign(w) For w = 0: subgradient is any value in [-1, 1] We use sign(w), which gives 0 at w=0. """ return lambda_reg * np.sign(W) def l1_regularized_loss(data_loss, weights_list, lambda_reg): """ Total loss with L1 regularization. """ l1_term = compute_l1_penalty(weights_list, lambda_reg) return data_loss + l1_termThe sparsity-inducing property of L1 has a beautiful geometric explanation. Consider the constrained optimization view:
$$\min_{\boldsymbol{\theta}} \mathcal{L}_{\text{data}}(\boldsymbol{\theta}) \quad \text{subject to} \quad |\boldsymbol{\theta}|_1 \leq c$$
The Key Geometric Difference:
Why Corners Matter:
Imagine the loss function $\mathcal{L}_{\text{data}}$ as elliptical contours expanding from a minimum. As we inflate these contours, we seek the first point where a contour touches the constraint region.
In 2D, the L1 constraint is a diamond with corners at (±c, 0) and (0, ±c). A corner at (c, 0) means θ₂ = 0 exactly—one feature is completely eliminated. In high dimensions, corners have many zero coordinates, and smooth elliptical loss contours almost always touch these corners first.
| Property | L1 (Diamond) | L2 (Sphere) |
|---|---|---|
| Shape in 2D | Square rotated 45° | Circle |
| Shape in nD | Hyperoctahedron (cross-polytope) | Hypersphere |
| Corners | 2n axis-aligned corners | No corners (smooth) |
| Typical solution | On a corner (sparse) | On surface (all nonzero) |
| Sparsity | Natural outcome | Rare (requires exact alignment) |
Mathematical Explanation via Subgradients:
At a corner where $w_j = 0$, the L1 penalty's subgradient at that coordinate is the interval $[-\lambda, +\lambda]$. For the total gradient to be zero (optimality condition):
$$\frac{\partial \mathcal{L}_{\text{data}}}{\partial w_j} \in [-\lambda, +\lambda]$$
This means if the data gradient at $w_j = 0$ is small enough in magnitude (less than $\lambda$), the optimal solution keeps $w_j = 0$. With L2, the gradient would instead be proportional to $w_j$ itself—near zero, the penalty gradient vanishes, providing no "stickiness" at zero.
The L1 penalty is non-differentiable at zero, which complicates standard gradient descent. We use subgradients—a generalization of gradients for convex (but non-smooth) functions.
The Sign Function as Subgradient:
The subgradient of $|w|$ is: $$\partial |w| = \begin{cases} {+1} & \text{if } w > 0 \ {-1} & \text{if } w < 0 \ [-1, +1] & \text{if } w = 0 \end{cases}$$
In practice, we use $\text{sign}(w)$, which returns 0 at $w = 0$. This works but can cause oscillation around zero.
The Update Rule:
$$w_{t+1} = w_t - \eta \left( \nabla_w \mathcal{L}_{\text{data}} + \lambda \cdot \text{sign}(w_t) \right)$$
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
import numpy as np def sgd_with_l1_regularization(weights, gradients, lr, lambda_reg): """ SGD update with L1 regularization using subgradient. w_new = w - lr * (∇L_data + λ * sign(w)) Note: This can cause oscillation around zero. """ updated = [] for w, g in zip(weights, gradients): # Subgradient of L1 penalty l1_subgrad = lambda_reg * np.sign(w) # Combined gradient total_grad = g + l1_subgrad # Update w_new = w - lr * total_grad updated.append(w_new) return updated def proximal_l1_update(weights, gradients, lr, lambda_reg): """ Proximal gradient descent for L1 (soft thresholding). This is the CORRECT way to handle L1 regularization. Step 1: Gradient step on data loss only w_temp = w - lr * ∇L_data Step 2: Proximal operator (soft thresholding) w_new = sign(w_temp) * max(|w_temp| - lr*λ, 0) This naturally produces exact zeros! """ updated = [] threshold = lr * lambda_reg for w, g in zip(weights, gradients): # Gradient step w_temp = w - lr * g # Soft thresholding (proximal operator for L1) w_new = np.sign(w_temp) * np.maximum(np.abs(w_temp) - threshold, 0) updated.append(w_new) return updatedSoft thresholding (proximal gradient descent) is the theoretically correct and practically superior method for L1 optimization. It produces exact zeros and avoids oscillation. The threshold λη acts as a "dead zone"—weights smaller than this in magnitude collapse to exactly zero.
Understanding when to use L1 versus L2 requires appreciating their fundamentally different behaviors.
| Aspect | L1 Regularization | L2 Regularization |
|---|---|---|
| Penalty term | $\lambda \sum|w_i|$ | $\frac{\lambda}{2} \sum w_i^2$ |
| Gradient/subgradient | $\lambda \cdot \text{sign}(w)$ (constant magnitude) | $\lambda w$ (proportional to weight) |
| Effect on large weights | Constant pressure | Strong pressure (quadratic) |
| Effect on small weights | Constant pressure toward zero | Weak pressure (linear in w) |
| Produces exact zeros | Yes (sparsity) | Rarely (shrinks but doesn't zero) |
| Bayesian prior | Laplace (double exponential) | Gaussian |
| Constraint geometry | Diamond (corners) | Sphere (smooth) |
| Optimization | Non-smooth, needs proximal methods | Smooth, standard gradient descent |
| Feature selection | Implicit (zeros = unused features) | No (all features used) |
The Gradient Behavior Is Key:
L2 gradient $\lambda w$: Near zero, the gradient vanishes. A weight at $w = 0.001$ receives negligible gradient $0.001\lambda$—no force to push it to exactly zero.
L1 subgradient $\lambda \cdot \text{sign}(w)$: Even at $w = 0.001$, the subgradient magnitude is $\lambda$ (constant). Small weights receive the same "pressure" as large ones, driving them all the way to zero.
This constant pressure on all weight magnitudes is what produces sparsity—L1 doesn't "give up" on small weights the way L2 does.
Use L1 when: you suspect many features are irrelevant, you want interpretability via feature selection, or you need model compression. Use L2 when: all features are likely relevant, you want smooth optimization, or you're using adaptive optimizers that handle weight decay properly.
Just as L2 corresponds to a Gaussian prior, L1 corresponds to a Laplace (double exponential) prior:
$$p(w_i) = \frac{\lambda}{2} \exp(-\lambda |w_i|)$$
The Laplace distribution has:
MAP Estimation:
The negative log-prior is: $$-\log p(\boldsymbol{\theta}) = \lambda \sum_i |\theta_i| + \text{const}$$
This is exactly the L1 penalty! Minimizing the regularized loss is MAP estimation under a Laplace prior.
The Laplace prior's spike at zero assigns higher probability density to weights exactly at zero compared to small nonzero weights. This prior belief—that most weights should be zero—is what drives the MAP solution toward sparsity.
Unlike L2 (which has built-in weight_decay), L1 regularization typically requires manual implementation in PyTorch.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
import torchimport torch.nn as nnimport torch.optim as optim def compute_l1_loss(model, lambda_l1): """ Compute L1 regularization term for all weights. """ l1_loss = 0.0 for name, param in model.named_parameters(): if 'weight' in name: # Exclude biases l1_loss += torch.sum(torch.abs(param)) return lambda_l1 * l1_loss def train_with_l1_regularization(model, dataloader, criterion, optimizer, lambda_l1): """ Training loop with L1 regularization added to loss. """ model.train() for inputs, targets in dataloader: optimizer.zero_grad() outputs = model(inputs) data_loss = criterion(outputs, targets) # Add L1 penalty l1_penalty = compute_l1_loss(model, lambda_l1) total_loss = data_loss + l1_penalty total_loss.backward() optimizer.step() # Custom optimizer with proximal L1 (soft thresholding)class ProximalSGD(optim.Optimizer): """ SGD with proximal L1 regularization (soft thresholding). Properly produces exact zeros for L1 regularization. """ def __init__(self, params, lr=0.01, l1_lambda=0.01): defaults = dict(lr=lr, l1_lambda=l1_lambda) super().__init__(params, defaults) def step(self, closure=None): loss = None if closure is not None: loss = closure() for group in self.param_groups: lr = group['lr'] l1_lambda = group['l1_lambda'] threshold = lr * l1_lambda for p in group['params']: if p.grad is None: continue # Gradient step p.data.add_(p.grad, alpha=-lr) # Soft thresholding (proximal operator) p.data = torch.sign(p.data) * torch.clamp( torch.abs(p.data) - threshold, min=0 ) return loss # Usagemodel = nn.Sequential( nn.Linear(784, 256), nn.ReLU(), nn.Linear(256, 10)) optimizer = ProximalSGD(model.parameters(), lr=0.01, l1_lambda=1e-4)While L1 regularization is foundational in classical machine learning (Lasso regression), its role in deep learning is more nuanced.
You now understand L1 regularization's sparsity-inducing properties from geometric, optimization, and probabilistic perspectives. Next, we explore the subtle but critical distinction between weight decay and L2 regularization—a difference that matters significantly for adaptive optimizers.