Machine LearningRegularization Theory

Lasso Regression (L1 Regularization)

LevelIntermediate

Duration90 mins

TopicRegularization Theory

2 / 5

Sparsity Inducing Property

The Magic of Zeros

In the previous page, we derived the Lasso objective and examined the mathematical properties of the L1 norm. We saw that unlike the L2 penalty, the L1 penalty has a constant derivative (in magnitude) as coefficients approach zero. But this mathematical property alone doesn't fully explain why optimization settles at exactly zero.

This page answers the central question: Why does L1 regularization produce sparse solutions?

Sparsity—having many coefficients equal exactly zero—is not merely a mathematical curiosity. It represents automatic feature selection, where the model identifies which variables truly matter and discards the rest. This has profound implications:

Interpretability: A model with 10 non-zero coefficients out of 1,000 features tells a clear story
Computational efficiency: Sparse models require less memory and compute faster
Generalization: Eliminating irrelevant features reduces overfitting
Scientific discovery: Sparsity reveals which factors drive outcomes

What You Will Learn

By the end of this page, you will understand multiple complementary perspectives on why L1 produces sparsity: the geometric argument (corner contact), the subdifferential argument (interval at zero), the Bayesian argument (Laplace prior concentration), and the path perspective (sequential variable entry).

The Corner Argument: Geometry of Sparsity

The most intuitive explanation for L1 sparsity comes from constrained optimization geometry. Let's develop this carefully.

Setup:

Consider the 2D case with coefficients $(\beta_1, \beta_2)$. The Lasso solves:

$$\min_{\beta_1, \beta_2} |\mathbf{y} - \beta_1\mathbf{x}_1 - \beta_2\mathbf{x}_2|_2^2 \quad \text{subject to} \quad |\beta_1| + |\beta_2| \leq t$$

The Constraint Regions:

L1 constraint ($|\beta_1| + |\beta_2| \leq t$): A diamond with corners at $(t, 0)$, $(-t, 0)$, $(0, t)$, $(0, -t)$
L2 constraint ($\beta_1^2 + \beta_2^2 \leq t^2$): A circle of radius $t$

The Objective Function's Level Sets:

The residual sum of squares $|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2$ is a quadratic function of $\boldsymbol{\beta}$. Its level sets (curves of constant RSS) are ellipses centered at the OLS solution $\hat{\boldsymbol{\beta}}^{\text{OLS}}$.

Finding the Optimum:

We seek the point in the constraint region that achieves minimum RSS. Geometrically, we expand ellipses outward from the OLS solution until they first touch the constraint boundary.

Key Geometric Insights

•L2 (Circle): Ellipses are more likely to touch the smooth circular boundary at a point where BOTH coordinates are non-zero. The tangency condition rarely coincides with a coordinate axis.
•L1 (Diamond): The diamond has sharp corners on the coordinate axes. Ellipses are likely to touch these corners, where one or more coordinates are EXACTLY ZERO.
•Higher Dimensions: In $p$ dimensions, the L1 ball is a cross-polytope with $2^p$ faces but only $2p$ vertices. Vertices lie on coordinate axes (maximally sparse), edges have one zero coordinate, faces have two zeros, etc.

Probability of Corner Contact:

Why are corners so likely to be optimal? Consider the geometric degrees of freedom:

An ellipse tangent to a smooth curve has one degree of freedom removed (normal vectors align)
An ellipse tangent to a corner has two degrees of freedom removed (the corner is a constraint from two directions)

Corners are "stickier"—the optimization must work harder to escape them. Unless the OLS solution aligns precisely with non-corner directions, the Lasso solution lands on a corner.

Formal Statement:

For generic data (OLS solution in general position), as we vary $t$ from 0 to $\infty$:

The Lasso solution path traces through faces of increasing dimension
Solutions spend "time" (range of $t$ values) at each face proportional to its dimension
Vertices (full sparsity) are passed through unless the OLS solution is perfectly aligned with them
Most solutions lie on low-dimensional faces (high sparsity)

The Diamond vs. Circle Intuition

Imagine rolling a ball (the expanding loss ellipse) along a surface. On a smooth sphere (L2), it can touch anywhere. On a diamond (L1), it catches on corners and edges. The pointy geometry of L1 creates "attractors" at sparse solutions.

The Subdifferential Argument

The geometric intuition from corners can be made rigorous through subdifferential calculus. This provides the mathematical machinery to understand why zero is a stable solution.

The Subdifferential of |β|:

Recall that for $f(\beta) = |\beta|$:

$$\partial f(\beta) = \begin{cases} {+1} & \beta > 0 \ {-1} & \beta < 0 \ [-1, +1] & \beta = 0 \end{cases}$$

Optimality at Zero:

For the Lasso objective $L(\beta) = \frac{1}{2}(y - x\beta)^2 + \lambda|\beta|$ in one dimension:

The condition for $\beta = 0$ to be optimal is:

$$0 \in \partial L(0) = {-xy} + \lambda \cdot [-1, 1] = [-xy - \lambda, -xy + \lambda]$$

This holds when $|xy| \leq \lambda$, i.e., when the correlation between feature and response is below the threshold.

Why This Creates Exact Zeros:

The interval subdifferential at zero provides "slack" in the optimality conditions. Compare:

Subdifferential at Zero: L1 vs L2
Penalty	f(β)	∂f(0)	Condition for β=0 Optimal
L2	$\beta^2$	${0}$ (single point)	Gradient of data term = 0 exactly
L1	$\|\beta\|$	$[-1, 1]$ (interval)	Gradient of data term ∈ $[-\lambda, \lambda]$

The Critical Insight:

For L2 regularization, the subdifferential at zero is just ${0}$. To have $\beta = 0$ optimal, the data term's gradient must be exactly zero—a measure-zero event that essentially never happens.

For L1 regularization, the subdifferential at zero is the interval $[-1, 1]$. To have $\beta = 0$ optimal, the data term's gradient just needs to be bounded by $\lambda$—a condition with positive probability.

Mathematical Formulation (Multivariate):

For the full Lasso problem, the KKT conditions at an optimal $\hat{\boldsymbol{\beta}}$ are:

$$-\frac{1}{n}\mathbf{X}^T(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) + \lambda \hat{\mathbf{s}} = \mathbf{0}$$

where $\hat{s}_j \in \partial|\hat{\beta}_j|$.

For components where $\hat{\beta}_j = 0$:

$$\left|\frac{1}{n}\mathbf{x}_j^T\mathbf{r}\right| \leq \lambda$$

where $\mathbf{r} = \mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}$ is the residual. Features with correlations below $\lambda$ are perfectly happy at zero.

The λ-Highway

Think of $\lambda$ as defining a 'highway' of width $2\lambda$ centered at zero. Any feature whose correlation with the residual stays within this highway remains at zero. Only features with stronger correlations 'escape' and take non-zero values. Larger $\lambda$ means wider highway, more zeros.

Bayesian Perspective: The Laplace Prior

The Lasso has an elegant Bayesian interpretation: it corresponds to Maximum A Posteriori (MAP) estimation under a Laplace prior on the coefficients.

The Laplace Distribution:

The Laplace (double-exponential) distribution has density:

$$p(\beta | b) = \frac{1}{2b} \exp\left(-\frac{|\beta|}{b}\right)$$

with scale parameter $b > 0$. It's symmetric around zero with heavier tails than the Gaussian and a sharp peak at zero—this peak is key to sparsity.

Prior Distribution on Coefficients:

Assume i.i.d. Laplace priors on each coefficient:

$$p(\boldsymbol{\beta}) = \prod_{j=1}^p \frac{1}{2b} \exp\left(-\frac{|\beta_j|}{b}\right) = \frac{1}{(2b)^p} \exp\left(-\frac{|\boldsymbol{\beta}|_1}{b}\right)$$

MAP Estimation:

With Gaussian likelihood $\mathbf{y} | \mathbf{X}, \boldsymbol{\beta} \sim \mathcal{N}(\mathbf{X}\boldsymbol{\beta}, \sigma^2\mathbf{I})$, the posterior is:

$$p(\boldsymbol{\beta} | \mathbf{y}, \mathbf{X}) \propto \exp\left(-\frac{1}{2\sigma^2}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 - \frac{|\boldsymbol{\beta}|_1}{b}\right)$$

Taking the negative log and minimizing (MAP):

$$\hat{\boldsymbol{\beta}}^{\text{MAP}} = \underset{\boldsymbol{\beta}}{\arg\min}\left{\frac{1}{2\sigma^2}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \frac{1}{b}|\boldsymbol{\beta}|_1\right}$$

This is exactly the Lasso with $\lambda = \sigma^2/b$.

Laplace vs. Gaussian Prior Comparison

•Gaussian Prior $p(\beta) \propto \exp(-\beta^2/2\tau^2)$: Log-prior is $-\beta^2/2\tau^2$ → L2 penalty → Ridge regression
•Laplace Prior $p(\beta) \propto \exp(-|\beta|/b)$: Log-prior is $-|\beta|/b$ → L1 penalty → Lasso regression
•At Zero: Laplace has infinite density gradient (sharp peak); Gaussian has zero gradient (smooth dome)
•Tail Behavior: Laplace has heavier tails, allowing larger coefficients; Gaussian suppresses extremes more

Why the Laplace Prior Induces Sparsity:

The Laplace density has a cusp at zero—a discontinuous derivative. This creates a probability "spike" at exactly zero.

Intuitively, the prior strongly believes most coefficients should be near zero, with occasional large values (heavy tails). This matches our scientific intuition: in high-dimensional problems, most features are probably irrelevant, but a few might have substantial effects.

Contrast with the Gaussian:

The Gaussian prior has a smooth dome at zero. There's no particular attraction to exactly zero—coefficients near zero have similar prior probability to those exactly at zero. This smoothness translates to the inability to produce exact sparsity.

Hyperprior Perspective:

The Laplace prior can be written as a scale mixture of normals:

$$\beta_j | \tau_j^2 \sim \mathcal{N}(0, \tau_j^2), \quad \tau_j^2 \sim \text{Exponential}(\lambda^2/2)$$

This representation connects Lasso to hierarchical Bayesian models and enables extensions like the Bayesian Lasso, which provides full posterior inference rather than just point estimates.

Priors as Regularizers

The correspondence between priors and regularizers is deep: Gaussian↔L2, Laplace↔L1, Horseshoe↔adaptive sparsity. Choosing a regularizer is equivalent to choosing a prior belief about coefficient structure. The Laplace prior encodes: 'Most coefficients are zero or tiny; a few might be large.'

Quantifying Sparsity: How Many Zeros?

Given that L1 produces sparsity, a natural question arises: How sparse will the solution be? The answer depends on $\lambda$, the data distribution, and the true underlying signal.

The Sparsity-λ Relationship:

As $\lambda$ increases, more coefficients become zero. We can characterize this relationship:

For $\lambda \geq \lambda_{\max}$:

$$\lambda_{\max} = \frac{1}{n}|\mathbf{X}^T\mathbf{y}|_\infty = \max_j \left|\frac{1}{n}\mathbf{x}_j^T\mathbf{y}\right|$$

All coefficients are zero. This is the first $\lambda$ value where sparsity = $p$ (all features removed).

As $\lambda \downarrow 0$:

More coefficients become non-zero. As $\lambda \to 0$, the Lasso solution approaches the OLS solution (if $n > p$) or a minimum-norm solution (if $n < p$).

The Active Set:

Define the active set as $\mathcal{A}(\lambda) = {j : \hat{\beta}_j(\lambda) eq 0}$. Its cardinality $|\mathcal{A}(\lambda)|$ is the degrees of freedom of the model.

As $\lambda$ decreases, $|\mathcal{A}(\lambda)|$ generally increases, though coefficients can also exit the active set (become zero again) in some cases.

Sparsity Analysis
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import numpy as np
from sklearn.linear_model import lasso_path
import matplotlib.pyplot as plt
 
# Generate synthetic data with true sparsity
np.random.seed(42)
n, p = 100, 50
true_support = [0, 5, 10, 15, 20]  # Only 5 features matter
X = np.random.randn(n, p)
beta_true = np.zeros(p)
beta_true[true_support] = np.array([3, -2, 1.5, -1, 2.5])
y = X @ beta_true + 0.5 * np.random.randn(n)
 
# Compute lasso path
alphas, coefs, _ = lasso_path(X, y, n_alphas=100)
 
# Analyze sparsity across lambda values
sparsity = np.sum(np.abs(coefs) > 1e-10, axis=0)  # Number of non-zeros
 
print("Sparsity Analysis:")
print("-" * 50)
print(f"λ_max (all zeros): {alphas[0]:.4f}")
print(f"λ_min (least sparse): {alphas[-1]:.4f}")
print(f"Non-zeros at λ_max: {sparsity[0]}")
print(f"Non-zeros at λ_min: {sparsity[-1]}")
print(f"True non-zeros: {len(true_support)}")
 
# Find lambda giving correct sparsity
optimal_idx = np.argmin(np.abs(sparsity - len(true_support)))
print(f"
λ for {len(true_support)} non-zeros: {alphas[optimal_idx]:.4f}")
print(f"Recovered support: {np.where(np.abs(coefs[:, optimal_idx]) > 1e-10)[0]}")
print(f"True support: {true_support}")

Factors Affecting Sparsity:

Signal Strength: Strong true signals (large $|\beta_j^*|$) remain non-zero even at large $\lambda$. Weak signals are eliminated first.
Feature Correlations: Highly correlated features compete for selection. Lasso tends to select one and zero out others (instability).
Sample Size: More data provides better estimation of which features matter, stabilizing the active set.
Noise Level: Higher noise requires larger $\lambda$ for good prediction, which increases sparsity.

Degrees of Freedom of Lasso:

An important theoretical result: for the Lasso, the expected degrees of freedom equals the expected number of non-zero coefficients:

$$\mathbb{E}[\text{df}_{\text{lasso}}(\lambda)] = \mathbb{E}[|\mathcal{A}(\lambda)|]$$

This elegant result (proved via Stein's Lemma) enables unbiased estimation of prediction risk and informs model selection criteria like AIC and Cp.

The Sparsity-Bias Tradeoff

More sparsity isn't always better. Overly large λ produces a model with few features but high bias (it misses true signals). Overly small λ includes noise features. The art is finding the λ that captures true signal structure—typically via cross-validation.

When Does Lasso Recover True Sparsity?

A fundamental question in high-dimensional statistics: Under what conditions does Lasso correctly identify the truly non-zero coefficients?

This is the problem of support recovery or variable selection consistency. The answer involves conditions on the design matrix $\mathbf{X}$ that govern whether Lasso can distinguish signal from noise.

The Irrepresentable Condition:

Let $S = {j : \beta_j^* eq 0}$ be the true support (indices of non-zero coefficients) and $S^c$ its complement (zero coefficients).

Partition the Gram matrix $\mathbf{C} = \frac{1}{n}\mathbf{X}^T\mathbf{X}$:

$$\mathbf{C} = \begin{pmatrix} \mathbf{C}{SS} & \mathbf{C}{SS^c} \ \mathbf{C}{S^cS} & \mathbf{C}{S^cS^c} \end{pmatrix}$$

The Irrepresentable Condition states:

$$|\mathbf{C}{S^cS}\mathbf{C}{SS}^{-1}\text{sign}(\boldsymbol{\beta}S^*)|\infty < 1$$

Interpretation:

This condition limits how much the irrelevant features ($S^c$) can be represented by the relevant features ($S$). If irrelevant features correlate too strongly with relevant ones weighted by the true coefficient signs, Lasso may include spurious variables.

Key Concepts in Support Recovery

•Sign Consistency: Lasso is sign consistent if $\text{sign}(\hat{\beta}_j) = \text{sign}(\beta_j^*)$ for all $j$ with probability approaching 1 as $n \to \infty$
•Support Recovery: Correct identification of both non-zero and zero coefficients; stronger than sign consistency
•Irrepresentable Condition: Necessary condition for support recovery; violated when noise features correlate strongly with signal features
•Restricted Eigenvalue Condition: Alternative condition for $\ell_2$ consistency (not necessarily sparsity recovery)

When the Condition Fails:

If the irrepresentable condition fails, there exists a noise feature $j \in S^c$ that is "too well represented" by signal features. This noise feature can enter the Lasso model even in the large-sample limit.

Example of Failure:

Consider two features with correlation $\rho$:

Feature 1: truly relevant ($\beta_1^* = 1$)
Feature 2: irrelevant ($\beta_2^* = 0$)

If $|\rho| > 0.5$, the irrepresentable condition fails for some sign configurations, and Lasso may include Feature 2 even with infinite data.

Practical Implications:

In highly correlated designs, Lasso may not achieve exact support recovery
For prediction purposes, this may not matter—correlated features are interchangeable
For interpretability/discovery, consider Elastic Net (stabilizes selection) or multi-step procedures

The Beta-Min Condition:

Beyond structural conditions on $\mathbf{X}$, we also need the true non-zero coefficients to be large enough:

$$\min_{j \in S} |\beta_j^*| \geq c \cdot \sqrt{\frac{\log p}{n}}$$

for some constant $c > 0$. Coefficients below this threshold are undetectable in high-dimensional settings—they blend into the noise floor.

Prediction vs. Selection

Lasso can achieve good prediction even when support recovery fails. If features 1 and 2 are highly correlated and only feature 1 truly matters, selecting feature 2 instead barely affects predictions. The irrepresentable condition ensures correct identification, not just good prediction—a stronger guarantee relevant for scientific interpretation.

Sparsity in Practice

The theoretical guarantee of sparsity meets practical reality in ways that require understanding. Let's examine how sparsity manifests in real applications.

When Sparsity Assumptions Hold:

Lasso's sparsity assumption is that the true model involves a small number of features. This assumption is reasonable in many domains:

Domains Where Sparsity is Plausible
Domain	Feature Space	Why Sparsity is Expected
Genomics	~20,000 genes	Most genes irrelevant to a specific phenotype; disease involves few pathways
Text Analysis	~100,000 words	Documents about specific topics use limited vocabulary
Finance	~1000 factors	Asset returns driven by few systematic factors + idiosyncratic noise
Signal Processing	~millions of coefficients	Natural signals are sparse in appropriate bases (wavelets, Fourier)
Image Analysis	~millions of pixels	Images compressible; few features determine classification

When Sparsity Fails:

Not all domains exhibit true sparsity:

Dense signals: If outcomes genuinely depend on many features (e.g., complex traits influenced by many genes with small effects), forcing sparsity introduces bias
Correlated predictors: When features come in groups (e.g., related genes), Lasso arbitrarily selects within groups
Nonlinear effects: True sparsity in expanded feature space doesn't imply sparsity in original features

Practical Sparsity Patterns:

Sparsity Pattern Analysis
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import numpy as np
from sklearn.linear_model import LassoCV
from sklearn.datasets import make_regression
 
# Simulate different sparsity scenarios
np.random.seed(42)
 
def analyze_sparsity(X, y, name):
    """Analyze Lasso sparsity pattern."""
    model = LassoCV(cv=5, random_state=42)
    model.fit(X, y)
    
    coefs = model.coef_
    non_zero = np.sum(np.abs(coefs) > 1e-10)
    sparsity_ratio = 1 - non_zero / len(coefs)
    
    print(f"
{name}:")
    print(f"  Features: {len(coefs)}, Non-zero: {non_zero}")
    print(f"  Sparsity: {sparsity_ratio:.1%}")
    print(f"  Optimal λ: {model.alpha_:.4f}")
    return coefs
 
# Scenario 1: Truly sparse signal
X1, y1 = make_regression(n_samples=200, n_features=100, 
                         n_informative=5, noise=1, random_state=42)
coefs1 = analyze_sparsity(X1, y1, "Sparse Signal (5 true features)")
 
# Scenario 2: Dense signal
X2, y2 = make_regression(n_samples=200, n_features=100,
                         n_informative=50, noise=1, random_state=42)
coefs2 = analyze_sparsity(X2, y2, "Dense Signal (50 true features)")
 
# Scenario 3: Correlated features
X3_base = np.random.randn(200, 20)
X3 = np.column_stack([X3_base + 0.1*np.random.randn(200, 20) 
                      for _ in range(5)])  # 5 groups of 20 correlated features
beta3 = np.zeros(100)
beta3[:20] = np.random.randn(20)  # First group matters
y3 = X3 @ beta3 + np.random.randn(200)
coefs3 = analyze_sparsity(X3, y3, "Correlated Groups")
 
print("
Observation: Lasso achieves different sparsity levels")
print("based on underlying signal structure.")

Sparsity as a Modeling Choice

Sparsity is not just a computational convenience—it's a modeling assumption. Before applying Lasso, ask: 'Is it plausible that only a few features truly matter?' If the answer is no, consider ridge regression, elastic net, or other methods that don't enforce hard sparsity.

Approximate vs. Exact Sparsity

In practice, the distinction between "exactly zero" and "negligibly small" coefficients matters for different purposes.

Exact Sparsity in Theory:

Mathematically, Lasso produces solutions where some $\hat{\beta}_j = 0$ exactly. This is a genuine zero, not a floating-point approximation.

Numerical Considerations:

In floating-point arithmetic, what we observe is that some coefficients are set to values like $10^{-16}$, which is effectively zero. Software implementations typically threshold coefficients below $10^{-10}$ or similar.

Approximate Sparsity:

The concept of approximate sparsity (or effective sparsity) is often more useful:

Exact $s$-sparse: Exactly $s$ non-zero coefficients, rest exactly zero
Approximately sparse: Many small coefficients, few large ones
$\ell_q$ ball: $\sum_j |\beta_j|^q \leq R$ for some $0 < q < 1$ (not really sparse, but compressible)

Real data often exhibits approximate sparsity: a few dominant coefficients, many small but non-zero contributions.

Sparsity Levels in Practice

•Hard Sparsity: Exactly $k$ non-zeros, common assumption in compressed sensing theory
•Soft Sparsity: Coefficients decay polynomially: $|\beta_{(j)}| \lesssim j^{-\alpha}$ for sorted coefficients. Many small contributions.
•Group Sparsity: Within-group density, between-group sparsity. Entire feature groups are zero or non-zero together.
•Structured Sparsity: Non-zero pattern follows domain structure (spatial, temporal, hierarchical)

Thresholding for Interpretation:

For model interpretation, practitioners often apply a secondary threshold to Lasso coefficients:

Fit Lasso, obtaining $\hat{\boldsymbol{\beta}}$
Apply threshold $\tau$: keep $\hat{\beta}_j$ only if $|\hat{\beta}_j| > \tau$
Optionally, refit OLS on selected features (de-biasing)

This two-stage approach separates selection (Lasso) from estimation (clean-up). The threshold $\tau$ can be chosen via stability selection (see below).

Stability Selection:

A robust approach to determine which zeros are "real":

Subsample data many times
Run Lasso on each subsample
Track how often each feature is selected (has non-zero coefficient)
Features selected frequently across subsamples are stable; occasional selections are spurious

This addresses the instability of Lasso selection under perturbation, particularly with correlated features.

Don't Over-interpret Exact Zeros

An exact zero from Lasso doesn't mean the corresponding feature is irrelevant—it means that feature wasn't needed given other selected features. With correlated features, a zero might swap to non-zero under slight data perturbation. For high-stakes interpretations, use stability selection or bootstrap confidence intervals.

Summary: Why L1 Produces Sparsity

We've examined the sparsity-inducing property of L1 regularization from multiple perspectives. Let's consolidate these insights.

Key Takeaways

•Geometric Argument: The L1 constraint set is a diamond with corners on coordinate axes. Optimal solutions tend to land on corners (sparse) because expanding loss ellipses catch on sharp corners.
•Subdifferential Argument: At $\beta = 0$, the L1 penalty has subdifferential $[-1, 1]$, an interval providing 'slack' for zero to be optimal. L2's subdifferential at zero is just ${0}$—no slack.
•Bayesian Argument: L1 corresponds to a Laplace prior with a sharp peak at zero. The prior concentrates probability at exactly zero, encouraging sparse posterior modes.
•Quantitative Sparsity: λ controls sparsity level. Larger λ → more zeros. The irrepresentable condition determines whether Lasso correctly identifies true zeros.
•Practical Considerations: Sparsity is an assumption—verify it's reasonable for your domain. Use stability selection for robust identification. Don't over-interpret individual zeros with correlated features.

What's Next:

We've established why L1 produces sparsity. The next page provides the geometric interpretation visually, showing how constraint regions interact with loss contours to produce solutions at corners. This geometric understanding is crucial for intuition about Lasso behavior.

Page Complete

You now understand why L1 regularization produces exact zeros through geometric, analytical, and Bayesian perspectives. You've learned about conditions for correct sparsity recovery and practical considerations for interpretation. Next, we'll visualize these concepts geometrically.

2 / 5

Loading learning content...

Machine LearningRegularization Theory

Lasso Regression (L1 Regularization)

LevelIntermediate

Duration90 mins

TopicRegularization Theory

2 / 5

Sparsity Inducing Property

The Magic of Zeros

This page answers the central question: Why does L1 regularization produce sparse solutions?

Interpretability: A model with 10 non-zero coefficients out of 1,000 features tells a clear story
Computational efficiency: Sparse models require less memory and compute faster
Generalization: Eliminating irrelevant features reduces overfitting
Scientific discovery: Sparsity reveals which factors drive outcomes

What You Will Learn

The Corner Argument: Geometry of Sparsity

The most intuitive explanation for L1 sparsity comes from constrained optimization geometry. Let's develop this carefully.

Setup:

Consider the 2D case with coefficients $(\beta_1, \beta_2)$. The Lasso solves:

$$\min_{\beta_1, \beta_2} |\mathbf{y} - \beta_1\mathbf{x}_1 - \beta_2\mathbf{x}_2|_2^2 \quad \text{subject to} \quad |\beta_1| + |\beta_2| \leq t$$

The Constraint Regions:

L1 constraint ($|\beta_1| + |\beta_2| \leq t$): A diamond with corners at $(t, 0)$, $(-t, 0)$, $(0, t)$, $(0, -t)$
L2 constraint ($\beta_1^2 + \beta_2^2 \leq t^2$): A circle of radius $t$

The Objective Function's Level Sets:

Finding the Optimum:

We seek the point in the constraint region that achieves minimum RSS. Geometrically, we expand ellipses outward from the OLS solution until they first touch the constraint boundary.

Key Geometric Insights

•L2 (Circle): Ellipses are more likely to touch the smooth circular boundary at a point where BOTH coordinates are non-zero. The tangency condition rarely coincides with a coordinate axis.
•L1 (Diamond): The diamond has sharp corners on the coordinate axes. Ellipses are likely to touch these corners, where one or more coordinates are EXACTLY ZERO.
•Higher Dimensions: In $p$ dimensions, the L1 ball is a cross-polytope with $2^p$ faces but only $2p$ vertices. Vertices lie on coordinate axes (maximally sparse), edges have one zero coordinate, faces have two zeros, etc.

Probability of Corner Contact:

Why are corners so likely to be optimal? Consider the geometric degrees of freedom:

An ellipse tangent to a smooth curve has one degree of freedom removed (normal vectors align)
An ellipse tangent to a corner has two degrees of freedom removed (the corner is a constraint from two directions)

Corners are "stickier"—the optimization must work harder to escape them. Unless the OLS solution aligns precisely with non-corner directions, the Lasso solution lands on a corner.

Formal Statement:

For generic data (OLS solution in general position), as we vary $t$ from 0 to $\infty$:

The Lasso solution path traces through faces of increasing dimension
Solutions spend "time" (range of $t$ values) at each face proportional to its dimension
Vertices (full sparsity) are passed through unless the OLS solution is perfectly aligned with them
Most solutions lie on low-dimensional faces (high sparsity)

The Diamond vs. Circle Intuition

The Subdifferential Argument

The geometric intuition from corners can be made rigorous through subdifferential calculus. This provides the mathematical machinery to understand why zero is a stable solution.

The Subdifferential of |β|:

Recall that for $f(\beta) = |\beta|$:

$$\partial f(\beta) = \begin{cases} {+1} & \beta > 0 \ {-1} & \beta < 0 \ [-1, +1] & \beta = 0 \end{cases}$$

Optimality at Zero:

For the Lasso objective $L(\beta) = \frac{1}{2}(y - x\beta)^2 + \lambda|\beta|$ in one dimension:

The condition for $\beta = 0$ to be optimal is:

$$0 \in \partial L(0) = {-xy} + \lambda \cdot [-1, 1] = [-xy - \lambda, -xy + \lambda]$$

This holds when $|xy| \leq \lambda$, i.e., when the correlation between feature and response is below the threshold.

Why This Creates Exact Zeros:

The interval subdifferential at zero provides "slack" in the optimality conditions. Compare:

Subdifferential at Zero: L1 vs L2
Penalty	f(β)	∂f(0)	Condition for β=0 Optimal
L2	$\beta^2$	${0}$ (single point)	Gradient of data term = 0 exactly
L1	$\|\beta\|$	$[-1, 1]$ (interval)	Gradient of data term ∈ $[-\lambda, \lambda]$

The Critical Insight:

For L2 regularization, the subdifferential at zero is just ${0}$. To have $\beta = 0$ optimal, the data term's gradient must be exactly zero—a measure-zero event that essentially never happens.

Mathematical Formulation (Multivariate):

For the full Lasso problem, the KKT conditions at an optimal $\hat{\boldsymbol{\beta}}$ are:

$$-\frac{1}{n}\mathbf{X}^T(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) + \lambda \hat{\mathbf{s}} = \mathbf{0}$$

where $\hat{s}_j \in \partial|\hat{\beta}_j|$.

For components where $\hat{\beta}_j = 0$:

$$\left|\frac{1}{n}\mathbf{x}_j^T\mathbf{r}\right| \leq \lambda$$

where $\mathbf{r} = \mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}$ is the residual. Features with correlations below $\lambda$ are perfectly happy at zero.

The λ-Highway

Bayesian Perspective: The Laplace Prior

The Lasso has an elegant Bayesian interpretation: it corresponds to Maximum A Posteriori (MAP) estimation under a Laplace prior on the coefficients.

The Laplace Distribution:

The Laplace (double-exponential) distribution has density:

$$p(\beta | b) = \frac{1}{2b} \exp\left(-\frac{|\beta|}{b}\right)$$

with scale parameter $b > 0$. It's symmetric around zero with heavier tails than the Gaussian and a sharp peak at zero—this peak is key to sparsity.

Prior Distribution on Coefficients:

Assume i.i.d. Laplace priors on each coefficient:

$$p(\boldsymbol{\beta}) = \prod_{j=1}^p \frac{1}{2b} \exp\left(-\frac{|\beta_j|}{b}\right) = \frac{1}{(2b)^p} \exp\left(-\frac{|\boldsymbol{\beta}|_1}{b}\right)$$

MAP Estimation:

With Gaussian likelihood $\mathbf{y} | \mathbf{X}, \boldsymbol{\beta} \sim \mathcal{N}(\mathbf{X}\boldsymbol{\beta}, \sigma^2\mathbf{I})$, the posterior is:

$$p(\boldsymbol{\beta} | \mathbf{y}, \mathbf{X}) \propto \exp\left(-\frac{1}{2\sigma^2}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 - \frac{|\boldsymbol{\beta}|_1}{b}\right)$$

Taking the negative log and minimizing (MAP):

$$\hat{\boldsymbol{\beta}}^{\text{MAP}} = \underset{\boldsymbol{\beta}}{\arg\min}\left{\frac{1}{2\sigma^2}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \frac{1}{b}|\boldsymbol{\beta}|_1\right}$$

This is exactly the Lasso with $\lambda = \sigma^2/b$.

Laplace vs. Gaussian Prior Comparison

•Gaussian Prior $p(\beta) \propto \exp(-\beta^2/2\tau^2)$: Log-prior is $-\beta^2/2\tau^2$ → L2 penalty → Ridge regression
•Laplace Prior $p(\beta) \propto \exp(-|\beta|/b)$: Log-prior is $-|\beta|/b$ → L1 penalty → Lasso regression
•At Zero: Laplace has infinite density gradient (sharp peak); Gaussian has zero gradient (smooth dome)
•Tail Behavior: Laplace has heavier tails, allowing larger coefficients; Gaussian suppresses extremes more

Why the Laplace Prior Induces Sparsity:

The Laplace density has a cusp at zero—a discontinuous derivative. This creates a probability "spike" at exactly zero.

Contrast with the Gaussian:

Hyperprior Perspective:

The Laplace prior can be written as a scale mixture of normals:

$$\beta_j | \tau_j^2 \sim \mathcal{N}(0, \tau_j^2), \quad \tau_j^2 \sim \text{Exponential}(\lambda^2/2)$$

This representation connects Lasso to hierarchical Bayesian models and enables extensions like the Bayesian Lasso, which provides full posterior inference rather than just point estimates.

Priors as Regularizers

Quantifying Sparsity: How Many Zeros?

Given that L1 produces sparsity, a natural question arises: How sparse will the solution be? The answer depends on $\lambda$, the data distribution, and the true underlying signal.

The Sparsity-λ Relationship:

As $\lambda$ increases, more coefficients become zero. We can characterize this relationship:

For $\lambda \geq \lambda_{\max}$:

$$\lambda_{\max} = \frac{1}{n}|\mathbf{X}^T\mathbf{y}|_\infty = \max_j \left|\frac{1}{n}\mathbf{x}_j^T\mathbf{y}\right|$$

All coefficients are zero. This is the first $\lambda$ value where sparsity = $p$ (all features removed).

As $\lambda \downarrow 0$:

More coefficients become non-zero. As $\lambda \to 0$, the Lasso solution approaches the OLS solution (if $n > p$) or a minimum-norm solution (if $n < p$).

The Active Set:

Define the active set as $\mathcal{A}(\lambda) = {j : \hat{\beta}_j(\lambda) eq 0}$. Its cardinality $|\mathcal{A}(\lambda)|$ is the degrees of freedom of the model.

As $\lambda$ decreases, $|\mathcal{A}(\lambda)|$ generally increases, though coefficients can also exit the active set (become zero again) in some cases.

Sparsity Analysis
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import numpy as np
from sklearn.linear_model import lasso_path
import matplotlib.pyplot as plt
 
# Generate synthetic data with true sparsity
np.random.seed(42)
n, p = 100, 50
true_support = [0, 5, 10, 15, 20]  # Only 5 features matter
X = np.random.randn(n, p)
beta_true = np.zeros(p)
beta_true[true_support] = np.array([3, -2, 1.5, -1, 2.5])
y = X @ beta_true + 0.5 * np.random.randn(n)
 
# Compute lasso path
alphas, coefs, _ = lasso_path(X, y, n_alphas=100)
 
# Analyze sparsity across lambda values
sparsity = np.sum(np.abs(coefs) > 1e-10, axis=0)  # Number of non-zeros
 
print("Sparsity Analysis:")
print("-" * 50)
print(f"λ_max (all zeros): {alphas[0]:.4f}")
print(f"λ_min (least sparse): {alphas[-1]:.4f}")
print(f"Non-zeros at λ_max: {sparsity[0]}")
print(f"Non-zeros at λ_min: {sparsity[-1]}")
print(f"True non-zeros: {len(true_support)}")
 
# Find lambda giving correct sparsity
optimal_idx = np.argmin(np.abs(sparsity - len(true_support)))
print(f"
λ for {len(true_support)} non-zeros: {alphas[optimal_idx]:.4f}")
print(f"Recovered support: {np.where(np.abs(coefs[:, optimal_idx]) > 1e-10)[0]}")
print(f"True support: {true_support}")

Factors Affecting Sparsity:

Signal Strength: Strong true signals (large $|\beta_j^*|$) remain non-zero even at large $\lambda$. Weak signals are eliminated first.
Feature Correlations: Highly correlated features compete for selection. Lasso tends to select one and zero out others (instability).
Sample Size: More data provides better estimation of which features matter, stabilizing the active set.
Noise Level: Higher noise requires larger $\lambda$ for good prediction, which increases sparsity.

Degrees of Freedom of Lasso:

An important theoretical result: for the Lasso, the expected degrees of freedom equals the expected number of non-zero coefficients:

$$\mathbb{E}[\text{df}_{\text{lasso}}(\lambda)] = \mathbb{E}[|\mathcal{A}(\lambda)|]$$

This elegant result (proved via Stein's Lemma) enables unbiased estimation of prediction risk and informs model selection criteria like AIC and Cp.

The Sparsity-Bias Tradeoff

When Does Lasso Recover True Sparsity?

A fundamental question in high-dimensional statistics: Under what conditions does Lasso correctly identify the truly non-zero coefficients?

The Irrepresentable Condition:

Let $S = {j : \beta_j^* eq 0}$ be the true support (indices of non-zero coefficients) and $S^c$ its complement (zero coefficients).

Partition the Gram matrix $\mathbf{C} = \frac{1}{n}\mathbf{X}^T\mathbf{X}$:

$$\mathbf{C} = \begin{pmatrix} \mathbf{C}{SS} & \mathbf{C}{SS^c} \ \mathbf{C}{S^cS} & \mathbf{C}{S^cS^c} \end{pmatrix}$$

The Irrepresentable Condition states:

$$|\mathbf{C}{S^cS}\mathbf{C}{SS}^{-1}\text{sign}(\boldsymbol{\beta}S^*)|\infty < 1$$

Interpretation:

Key Concepts in Support Recovery

•Sign Consistency: Lasso is sign consistent if $\text{sign}(\hat{\beta}_j) = \text{sign}(\beta_j^*)$ for all $j$ with probability approaching 1 as $n \to \infty$
•Support Recovery: Correct identification of both non-zero and zero coefficients; stronger than sign consistency
•Irrepresentable Condition: Necessary condition for support recovery; violated when noise features correlate strongly with signal features
•Restricted Eigenvalue Condition: Alternative condition for $\ell_2$ consistency (not necessarily sparsity recovery)

When the Condition Fails:

Example of Failure:

Consider two features with correlation $\rho$:

Feature 1: truly relevant ($\beta_1^* = 1$)
Feature 2: irrelevant ($\beta_2^* = 0$)

If $|\rho| > 0.5$, the irrepresentable condition fails for some sign configurations, and Lasso may include Feature 2 even with infinite data.

Practical Implications:

In highly correlated designs, Lasso may not achieve exact support recovery
For prediction purposes, this may not matter—correlated features are interchangeable
For interpretability/discovery, consider Elastic Net (stabilizes selection) or multi-step procedures

The Beta-Min Condition:

Beyond structural conditions on $\mathbf{X}$, we also need the true non-zero coefficients to be large enough:

$$\min_{j \in S} |\beta_j^*| \geq c \cdot \sqrt{\frac{\log p}{n}}$$

for some constant $c > 0$. Coefficients below this threshold are undetectable in high-dimensional settings—they blend into the noise floor.

Prediction vs. Selection

Sparsity in Practice

The theoretical guarantee of sparsity meets practical reality in ways that require understanding. Let's examine how sparsity manifests in real applications.

When Sparsity Assumptions Hold:

Lasso's sparsity assumption is that the true model involves a small number of features. This assumption is reasonable in many domains:

Domains Where Sparsity is Plausible
Domain	Feature Space	Why Sparsity is Expected
Genomics	~20,000 genes	Most genes irrelevant to a specific phenotype; disease involves few pathways
Text Analysis	~100,000 words	Documents about specific topics use limited vocabulary
Finance	~1000 factors	Asset returns driven by few systematic factors + idiosyncratic noise
Signal Processing	~millions of coefficients	Natural signals are sparse in appropriate bases (wavelets, Fourier)
Image Analysis	~millions of pixels	Images compressible; few features determine classification

When Sparsity Fails:

Not all domains exhibit true sparsity:

Dense signals: If outcomes genuinely depend on many features (e.g., complex traits influenced by many genes with small effects), forcing sparsity introduces bias
Correlated predictors: When features come in groups (e.g., related genes), Lasso arbitrarily selects within groups
Nonlinear effects: True sparsity in expanded feature space doesn't imply sparsity in original features

Practical Sparsity Patterns:

Sparsity Pattern Analysis
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import numpy as np
from sklearn.linear_model import LassoCV
from sklearn.datasets import make_regression
 
# Simulate different sparsity scenarios
np.random.seed(42)
 
def analyze_sparsity(X, y, name):
    """Analyze Lasso sparsity pattern."""
    model = LassoCV(cv=5, random_state=42)
    model.fit(X, y)
    
    coefs = model.coef_
    non_zero = np.sum(np.abs(coefs) > 1e-10)
    sparsity_ratio = 1 - non_zero / len(coefs)
    
    print(f"
{name}:")
    print(f"  Features: {len(coefs)}, Non-zero: {non_zero}")
    print(f"  Sparsity: {sparsity_ratio:.1%}")
    print(f"  Optimal λ: {model.alpha_:.4f}")
    return coefs
 
# Scenario 1: Truly sparse signal
X1, y1 = make_regression(n_samples=200, n_features=100, 
                         n_informative=5, noise=1, random_state=42)
coefs1 = analyze_sparsity(X1, y1, "Sparse Signal (5 true features)")
 
# Scenario 2: Dense signal
X2, y2 = make_regression(n_samples=200, n_features=100,
                         n_informative=50, noise=1, random_state=42)
coefs2 = analyze_sparsity(X2, y2, "Dense Signal (50 true features)")
 
# Scenario 3: Correlated features
X3_base = np.random.randn(200, 20)
X3 = np.column_stack([X3_base + 0.1*np.random.randn(200, 20) 
                      for _ in range(5)])  # 5 groups of 20 correlated features
beta3 = np.zeros(100)
beta3[:20] = np.random.randn(20)  # First group matters
y3 = X3 @ beta3 + np.random.randn(200)
coefs3 = analyze_sparsity(X3, y3, "Correlated Groups")
 
print("
Observation: Lasso achieves different sparsity levels")
print("based on underlying signal structure.")

Sparsity as a Modeling Choice

Approximate vs. Exact Sparsity

In practice, the distinction between "exactly zero" and "negligibly small" coefficients matters for different purposes.

Exact Sparsity in Theory:

Mathematically, Lasso produces solutions where some $\hat{\beta}_j = 0$ exactly. This is a genuine zero, not a floating-point approximation.

Numerical Considerations:

Approximate Sparsity:

The concept of approximate sparsity (or effective sparsity) is often more useful:

Exact $s$-sparse: Exactly $s$ non-zero coefficients, rest exactly zero
Approximately sparse: Many small coefficients, few large ones
$\ell_q$ ball: $\sum_j |\beta_j|^q \leq R$ for some $0 < q < 1$ (not really sparse, but compressible)

Real data often exhibits approximate sparsity: a few dominant coefficients, many small but non-zero contributions.

Sparsity Levels in Practice

•Hard Sparsity: Exactly $k$ non-zeros, common assumption in compressed sensing theory
•Soft Sparsity: Coefficients decay polynomially: $|\beta_{(j)}| \lesssim j^{-\alpha}$ for sorted coefficients. Many small contributions.
•Group Sparsity: Within-group density, between-group sparsity. Entire feature groups are zero or non-zero together.
•Structured Sparsity: Non-zero pattern follows domain structure (spatial, temporal, hierarchical)

Thresholding for Interpretation:

For model interpretation, practitioners often apply a secondary threshold to Lasso coefficients:

Fit Lasso, obtaining $\hat{\boldsymbol{\beta}}$
Apply threshold $\tau$: keep $\hat{\beta}_j$ only if $|\hat{\beta}_j| > \tau$
Optionally, refit OLS on selected features (de-biasing)

This two-stage approach separates selection (Lasso) from estimation (clean-up). The threshold $\tau$ can be chosen via stability selection (see below).

Stability Selection:

A robust approach to determine which zeros are "real":

Subsample data many times
Run Lasso on each subsample
Track how often each feature is selected (has non-zero coefficient)
Features selected frequently across subsamples are stable; occasional selections are spurious

This addresses the instability of Lasso selection under perturbation, particularly with correlated features.

Don't Over-interpret Exact Zeros

Summary: Why L1 Produces Sparsity

We've examined the sparsity-inducing property of L1 regularization from multiple perspectives. Let's consolidate these insights.

Key Takeaways

•Geometric Argument: The L1 constraint set is a diamond with corners on coordinate axes. Optimal solutions tend to land on corners (sparse) because expanding loss ellipses catch on sharp corners.
•Subdifferential Argument: At $\beta = 0$, the L1 penalty has subdifferential $[-1, 1]$, an interval providing 'slack' for zero to be optimal. L2's subdifferential at zero is just ${0}$—no slack.
•Bayesian Argument: L1 corresponds to a Laplace prior with a sharp peak at zero. The prior concentrates probability at exactly zero, encouraging sparse posterior modes.
•Quantitative Sparsity: λ controls sparsity level. Larger λ → more zeros. The irrepresentable condition determines whether Lasso correctly identifies true zeros.
•Practical Considerations: Sparsity is an assumption—verify it's reasonable for your domain. Use stability selection for robust identification. Don't over-interpret individual zeros with correlated features.

What's Next:

Page Complete

2 / 5