Loading learning content...
In the previous page, we derived the Lasso objective and examined the mathematical properties of the L1 norm. We saw that unlike the L2 penalty, the L1 penalty has a constant derivative (in magnitude) as coefficients approach zero. But this mathematical property alone doesn't fully explain why optimization settles at exactly zero.
This page answers the central question: Why does L1 regularization produce sparse solutions?
Sparsity—having many coefficients equal exactly zero—is not merely a mathematical curiosity. It represents automatic feature selection, where the model identifies which variables truly matter and discards the rest. This has profound implications:
By the end of this page, you will understand multiple complementary perspectives on why L1 produces sparsity: the geometric argument (corner contact), the subdifferential argument (interval at zero), the Bayesian argument (Laplace prior concentration), and the path perspective (sequential variable entry).
The most intuitive explanation for L1 sparsity comes from constrained optimization geometry. Let's develop this carefully.
Setup:
Consider the 2D case with coefficients $(\beta_1, \beta_2)$. The Lasso solves:
$$\min_{\beta_1, \beta_2} |\mathbf{y} - \beta_1\mathbf{x}_1 - \beta_2\mathbf{x}_2|_2^2 \quad \text{subject to} \quad |\beta_1| + |\beta_2| \leq t$$
The Constraint Regions:
The Objective Function's Level Sets:
The residual sum of squares $|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2$ is a quadratic function of $\boldsymbol{\beta}$. Its level sets (curves of constant RSS) are ellipses centered at the OLS solution $\hat{\boldsymbol{\beta}}^{\text{OLS}}$.
Finding the Optimum:
We seek the point in the constraint region that achieves minimum RSS. Geometrically, we expand ellipses outward from the OLS solution until they first touch the constraint boundary.
Probability of Corner Contact:
Why are corners so likely to be optimal? Consider the geometric degrees of freedom:
Corners are "stickier"—the optimization must work harder to escape them. Unless the OLS solution aligns precisely with non-corner directions, the Lasso solution lands on a corner.
Formal Statement:
For generic data (OLS solution in general position), as we vary $t$ from 0 to $\infty$:
Imagine rolling a ball (the expanding loss ellipse) along a surface. On a smooth sphere (L2), it can touch anywhere. On a diamond (L1), it catches on corners and edges. The pointy geometry of L1 creates "attractors" at sparse solutions.
The geometric intuition from corners can be made rigorous through subdifferential calculus. This provides the mathematical machinery to understand why zero is a stable solution.
The Subdifferential of |β|:
Recall that for $f(\beta) = |\beta|$:
$$\partial f(\beta) = \begin{cases} {+1} & \beta > 0 \ {-1} & \beta < 0 \ [-1, +1] & \beta = 0 \end{cases}$$
Optimality at Zero:
For the Lasso objective $L(\beta) = \frac{1}{2}(y - x\beta)^2 + \lambda|\beta|$ in one dimension:
The condition for $\beta = 0$ to be optimal is:
$$0 \in \partial L(0) = {-xy} + \lambda \cdot [-1, 1] = [-xy - \lambda, -xy + \lambda]$$
This holds when $|xy| \leq \lambda$, i.e., when the correlation between feature and response is below the threshold.
Why This Creates Exact Zeros:
The interval subdifferential at zero provides "slack" in the optimality conditions. Compare:
| Penalty | f(β) | ∂f(0) | Condition for β=0 Optimal |
|---|---|---|---|
| L2 | $\beta^2$ | ${0}$ (single point) | Gradient of data term = 0 exactly |
| L1 | $|\beta|$ | $[-1, 1]$ (interval) | Gradient of data term ∈ $[-\lambda, \lambda]$ |
The Critical Insight:
For L2 regularization, the subdifferential at zero is just ${0}$. To have $\beta = 0$ optimal, the data term's gradient must be exactly zero—a measure-zero event that essentially never happens.
For L1 regularization, the subdifferential at zero is the interval $[-1, 1]$. To have $\beta = 0$ optimal, the data term's gradient just needs to be bounded by $\lambda$—a condition with positive probability.
Mathematical Formulation (Multivariate):
For the full Lasso problem, the KKT conditions at an optimal $\hat{\boldsymbol{\beta}}$ are:
$$-\frac{1}{n}\mathbf{X}^T(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) + \lambda \hat{\mathbf{s}} = \mathbf{0}$$
where $\hat{s}_j \in \partial|\hat{\beta}_j|$.
For components where $\hat{\beta}_j = 0$:
$$\left|\frac{1}{n}\mathbf{x}_j^T\mathbf{r}\right| \leq \lambda$$
where $\mathbf{r} = \mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}$ is the residual. Features with correlations below $\lambda$ are perfectly happy at zero.
Think of $\lambda$ as defining a 'highway' of width $2\lambda$ centered at zero. Any feature whose correlation with the residual stays within this highway remains at zero. Only features with stronger correlations 'escape' and take non-zero values. Larger $\lambda$ means wider highway, more zeros.
The Lasso has an elegant Bayesian interpretation: it corresponds to Maximum A Posteriori (MAP) estimation under a Laplace prior on the coefficients.
The Laplace Distribution:
The Laplace (double-exponential) distribution has density:
$$p(\beta | b) = \frac{1}{2b} \exp\left(-\frac{|\beta|}{b}\right)$$
with scale parameter $b > 0$. It's symmetric around zero with heavier tails than the Gaussian and a sharp peak at zero—this peak is key to sparsity.
Prior Distribution on Coefficients:
Assume i.i.d. Laplace priors on each coefficient:
$$p(\boldsymbol{\beta}) = \prod_{j=1}^p \frac{1}{2b} \exp\left(-\frac{|\beta_j|}{b}\right) = \frac{1}{(2b)^p} \exp\left(-\frac{|\boldsymbol{\beta}|_1}{b}\right)$$
MAP Estimation:
With Gaussian likelihood $\mathbf{y} | \mathbf{X}, \boldsymbol{\beta} \sim \mathcal{N}(\mathbf{X}\boldsymbol{\beta}, \sigma^2\mathbf{I})$, the posterior is:
$$p(\boldsymbol{\beta} | \mathbf{y}, \mathbf{X}) \propto \exp\left(-\frac{1}{2\sigma^2}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 - \frac{|\boldsymbol{\beta}|_1}{b}\right)$$
Taking the negative log and minimizing (MAP):
$$\hat{\boldsymbol{\beta}}^{\text{MAP}} = \underset{\boldsymbol{\beta}}{\arg\min}\left{\frac{1}{2\sigma^2}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \frac{1}{b}|\boldsymbol{\beta}|_1\right}$$
This is exactly the Lasso with $\lambda = \sigma^2/b$.
Why the Laplace Prior Induces Sparsity:
The Laplace density has a cusp at zero—a discontinuous derivative. This creates a probability "spike" at exactly zero.
Intuitively, the prior strongly believes most coefficients should be near zero, with occasional large values (heavy tails). This matches our scientific intuition: in high-dimensional problems, most features are probably irrelevant, but a few might have substantial effects.
Contrast with the Gaussian:
The Gaussian prior has a smooth dome at zero. There's no particular attraction to exactly zero—coefficients near zero have similar prior probability to those exactly at zero. This smoothness translates to the inability to produce exact sparsity.
Hyperprior Perspective:
The Laplace prior can be written as a scale mixture of normals:
$$\beta_j | \tau_j^2 \sim \mathcal{N}(0, \tau_j^2), \quad \tau_j^2 \sim \text{Exponential}(\lambda^2/2)$$
This representation connects Lasso to hierarchical Bayesian models and enables extensions like the Bayesian Lasso, which provides full posterior inference rather than just point estimates.
The correspondence between priors and regularizers is deep: Gaussian↔L2, Laplace↔L1, Horseshoe↔adaptive sparsity. Choosing a regularizer is equivalent to choosing a prior belief about coefficient structure. The Laplace prior encodes: 'Most coefficients are zero or tiny; a few might be large.'
Given that L1 produces sparsity, a natural question arises: How sparse will the solution be? The answer depends on $\lambda$, the data distribution, and the true underlying signal.
The Sparsity-λ Relationship:
As $\lambda$ increases, more coefficients become zero. We can characterize this relationship:
For $\lambda \geq \lambda_{\max}$:
$$\lambda_{\max} = \frac{1}{n}|\mathbf{X}^T\mathbf{y}|_\infty = \max_j \left|\frac{1}{n}\mathbf{x}_j^T\mathbf{y}\right|$$
All coefficients are zero. This is the first $\lambda$ value where sparsity = $p$ (all features removed).
As $\lambda \downarrow 0$:
More coefficients become non-zero. As $\lambda \to 0$, the Lasso solution approaches the OLS solution (if $n > p$) or a minimum-norm solution (if $n < p$).
The Active Set:
Define the active set as $\mathcal{A}(\lambda) = {j : \hat{\beta}_j(\lambda) eq 0}$. Its cardinality $|\mathcal{A}(\lambda)|$ is the degrees of freedom of the model.
As $\lambda$ decreases, $|\mathcal{A}(\lambda)|$ generally increases, though coefficients can also exit the active set (become zero again) in some cases.
123456789101112131415161718192021222324252627282930313233
import numpy as npfrom sklearn.linear_model import lasso_pathimport matplotlib.pyplot as plt # Generate synthetic data with true sparsitynp.random.seed(42)n, p = 100, 50true_support = [0, 5, 10, 15, 20] # Only 5 features matterX = np.random.randn(n, p)beta_true = np.zeros(p)beta_true[true_support] = np.array([3, -2, 1.5, -1, 2.5])y = X @ beta_true + 0.5 * np.random.randn(n) # Compute lasso pathalphas, coefs, _ = lasso_path(X, y, n_alphas=100) # Analyze sparsity across lambda valuessparsity = np.sum(np.abs(coefs) > 1e-10, axis=0) # Number of non-zeros print("Sparsity Analysis:")print("-" * 50)print(f"λ_max (all zeros): {alphas[0]:.4f}")print(f"λ_min (least sparse): {alphas[-1]:.4f}")print(f"Non-zeros at λ_max: {sparsity[0]}")print(f"Non-zeros at λ_min: {sparsity[-1]}")print(f"True non-zeros: {len(true_support)}") # Find lambda giving correct sparsityoptimal_idx = np.argmin(np.abs(sparsity - len(true_support)))print(f"λ for {len(true_support)} non-zeros: {alphas[optimal_idx]:.4f}")print(f"Recovered support: {np.where(np.abs(coefs[:, optimal_idx]) > 1e-10)[0]}")print(f"True support: {true_support}")Factors Affecting Sparsity:
Signal Strength: Strong true signals (large $|\beta_j^*|$) remain non-zero even at large $\lambda$. Weak signals are eliminated first.
Feature Correlations: Highly correlated features compete for selection. Lasso tends to select one and zero out others (instability).
Sample Size: More data provides better estimation of which features matter, stabilizing the active set.
Noise Level: Higher noise requires larger $\lambda$ for good prediction, which increases sparsity.
Degrees of Freedom of Lasso:
An important theoretical result: for the Lasso, the expected degrees of freedom equals the expected number of non-zero coefficients:
$$\mathbb{E}[\text{df}_{\text{lasso}}(\lambda)] = \mathbb{E}[|\mathcal{A}(\lambda)|]$$
This elegant result (proved via Stein's Lemma) enables unbiased estimation of prediction risk and informs model selection criteria like AIC and Cp.
More sparsity isn't always better. Overly large λ produces a model with few features but high bias (it misses true signals). Overly small λ includes noise features. The art is finding the λ that captures true signal structure—typically via cross-validation.
A fundamental question in high-dimensional statistics: Under what conditions does Lasso correctly identify the truly non-zero coefficients?
This is the problem of support recovery or variable selection consistency. The answer involves conditions on the design matrix $\mathbf{X}$ that govern whether Lasso can distinguish signal from noise.
The Irrepresentable Condition:
Let $S = {j : \beta_j^* eq 0}$ be the true support (indices of non-zero coefficients) and $S^c$ its complement (zero coefficients).
Partition the Gram matrix $\mathbf{C} = \frac{1}{n}\mathbf{X}^T\mathbf{X}$:
$$\mathbf{C} = \begin{pmatrix} \mathbf{C}{SS} & \mathbf{C}{SS^c} \ \mathbf{C}{S^cS} & \mathbf{C}{S^cS^c} \end{pmatrix}$$
The Irrepresentable Condition states:
$$|\mathbf{C}{S^cS}\mathbf{C}{SS}^{-1}\text{sign}(\boldsymbol{\beta}S^*)|\infty < 1$$
Interpretation:
This condition limits how much the irrelevant features ($S^c$) can be represented by the relevant features ($S$). If irrelevant features correlate too strongly with relevant ones weighted by the true coefficient signs, Lasso may include spurious variables.
When the Condition Fails:
If the irrepresentable condition fails, there exists a noise feature $j \in S^c$ that is "too well represented" by signal features. This noise feature can enter the Lasso model even in the large-sample limit.
Example of Failure:
Consider two features with correlation $\rho$:
If $|\rho| > 0.5$, the irrepresentable condition fails for some sign configurations, and Lasso may include Feature 2 even with infinite data.
Practical Implications:
The Beta-Min Condition:
Beyond structural conditions on $\mathbf{X}$, we also need the true non-zero coefficients to be large enough:
$$\min_{j \in S} |\beta_j^*| \geq c \cdot \sqrt{\frac{\log p}{n}}$$
for some constant $c > 0$. Coefficients below this threshold are undetectable in high-dimensional settings—they blend into the noise floor.
Lasso can achieve good prediction even when support recovery fails. If features 1 and 2 are highly correlated and only feature 1 truly matters, selecting feature 2 instead barely affects predictions. The irrepresentable condition ensures correct identification, not just good prediction—a stronger guarantee relevant for scientific interpretation.
The theoretical guarantee of sparsity meets practical reality in ways that require understanding. Let's examine how sparsity manifests in real applications.
When Sparsity Assumptions Hold:
Lasso's sparsity assumption is that the true model involves a small number of features. This assumption is reasonable in many domains:
| Domain | Feature Space | Why Sparsity is Expected |
|---|---|---|
| Genomics | ~20,000 genes | Most genes irrelevant to a specific phenotype; disease involves few pathways |
| Text Analysis | ~100,000 words | Documents about specific topics use limited vocabulary |
| Finance | ~1000 factors | Asset returns driven by few systematic factors + idiosyncratic noise |
| Signal Processing | ~millions of coefficients | Natural signals are sparse in appropriate bases (wavelets, Fourier) |
| Image Analysis | ~millions of pixels | Images compressible; few features determine classification |
When Sparsity Fails:
Not all domains exhibit true sparsity:
Practical Sparsity Patterns:
123456789101112131415161718192021222324252627282930313233343536373839404142434445
import numpy as npfrom sklearn.linear_model import LassoCVfrom sklearn.datasets import make_regression # Simulate different sparsity scenariosnp.random.seed(42) def analyze_sparsity(X, y, name): """Analyze Lasso sparsity pattern.""" model = LassoCV(cv=5, random_state=42) model.fit(X, y) coefs = model.coef_ non_zero = np.sum(np.abs(coefs) > 1e-10) sparsity_ratio = 1 - non_zero / len(coefs) print(f"{name}:") print(f" Features: {len(coefs)}, Non-zero: {non_zero}") print(f" Sparsity: {sparsity_ratio:.1%}") print(f" Optimal λ: {model.alpha_:.4f}") return coefs # Scenario 1: Truly sparse signalX1, y1 = make_regression(n_samples=200, n_features=100, n_informative=5, noise=1, random_state=42)coefs1 = analyze_sparsity(X1, y1, "Sparse Signal (5 true features)") # Scenario 2: Dense signalX2, y2 = make_regression(n_samples=200, n_features=100, n_informative=50, noise=1, random_state=42)coefs2 = analyze_sparsity(X2, y2, "Dense Signal (50 true features)") # Scenario 3: Correlated featuresX3_base = np.random.randn(200, 20)X3 = np.column_stack([X3_base + 0.1*np.random.randn(200, 20) for _ in range(5)]) # 5 groups of 20 correlated featuresbeta3 = np.zeros(100)beta3[:20] = np.random.randn(20) # First group mattersy3 = X3 @ beta3 + np.random.randn(200)coefs3 = analyze_sparsity(X3, y3, "Correlated Groups") print("Observation: Lasso achieves different sparsity levels")print("based on underlying signal structure.")Sparsity is not just a computational convenience—it's a modeling assumption. Before applying Lasso, ask: 'Is it plausible that only a few features truly matter?' If the answer is no, consider ridge regression, elastic net, or other methods that don't enforce hard sparsity.
In practice, the distinction between "exactly zero" and "negligibly small" coefficients matters for different purposes.
Exact Sparsity in Theory:
Mathematically, Lasso produces solutions where some $\hat{\beta}_j = 0$ exactly. This is a genuine zero, not a floating-point approximation.
Numerical Considerations:
In floating-point arithmetic, what we observe is that some coefficients are set to values like $10^{-16}$, which is effectively zero. Software implementations typically threshold coefficients below $10^{-10}$ or similar.
Approximate Sparsity:
The concept of approximate sparsity (or effective sparsity) is often more useful:
Real data often exhibits approximate sparsity: a few dominant coefficients, many small but non-zero contributions.
Thresholding for Interpretation:
For model interpretation, practitioners often apply a secondary threshold to Lasso coefficients:
This two-stage approach separates selection (Lasso) from estimation (clean-up). The threshold $\tau$ can be chosen via stability selection (see below).
Stability Selection:
A robust approach to determine which zeros are "real":
This addresses the instability of Lasso selection under perturbation, particularly with correlated features.
An exact zero from Lasso doesn't mean the corresponding feature is irrelevant—it means that feature wasn't needed given other selected features. With correlated features, a zero might swap to non-zero under slight data perturbation. For high-stakes interpretations, use stability selection or bootstrap confidence intervals.
We've examined the sparsity-inducing property of L1 regularization from multiple perspectives. Let's consolidate these insights.
What's Next:
We've established why L1 produces sparsity. The next page provides the geometric interpretation visually, showing how constraint regions interact with loss contours to produce solutions at corners. This geometric understanding is crucial for intuition about Lasso behavior.
You now understand why L1 regularization produces exact zeros through geometric, analytical, and Bayesian perspectives. You've learned about conditions for correct sparsity recovery and practical considerations for interpretation. Next, we'll visualize these concepts geometrically.