Loading learning content...
In the previous module, we explored Ridge Regression—the elegant solution that tames overfitting by adding an L2 penalty that shrinks all coefficients toward zero. Ridge regression is remarkably effective: it stabilizes ill-conditioned problems, handles multicollinearity gracefully, and provides a smooth bias-variance tradeoff.
But ridge regression has a fundamental limitation: it never produces exactly zero coefficients. Every feature, no matter how irrelevant, retains some non-zero weight in the final model. When you have 1,000 potential features and suspect only 20 truly matter, ridge gives you a model with 1,000 non-zero coefficients—making interpretation difficult and suggesting false relationships.
This is where Lasso regression (Least Absolute Shrinkage and Selection Operator) enters the picture. Proposed by Robert Tibshirani in 1996, Lasso makes a seemingly small modification to the regularization term: replace the squared L2 penalty with an absolute value L1 penalty. This single change produces dramatically different behavior—Lasso doesn't just shrink coefficients; it eliminates them entirely.
By the end of this page, you will deeply understand the L1 penalty formulation, derive the Lasso objective function, comprehend the mathematical distinction between L1 and L2 norms, and build intuition for why absolute value penalties produce fundamentally different optimization landscapes than squared penalties.
Let us begin by precisely defining the Lasso objective. Consider the standard linear regression setting:
The Lasso Objective Function:
$$\hat{\boldsymbol{\beta}}^{\text{lasso}} = \underset{\boldsymbol{\beta}}{\arg\min} \left{ \frac{1}{2n} |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \lambda |\boldsymbol{\beta}|_1 \right}$$
Let's dissect each component of this formulation:
Different sources use different scaling conventions. Some use $\frac{1}{2}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \lambda|\boldsymbol{\beta}|_1$ without the $\frac{1}{n}$ factor. Others parameterize as $|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \alpha|\boldsymbol{\beta}|_1$ with $\alpha = 2n\lambda$. The qualitative behavior is identical; only the numeric value of the regularization parameter changes.
The L1 norm (also called the Manhattan norm, taxicab norm, or ℓ₁ norm) is a fundamental concept in linear algebra and analysis. For a vector $\mathbf{v} = (v_1, v_2, \ldots, v_p)^T$, the L1 norm is defined as:
$$|\mathbf{v}|1 = \sum{j=1}^{p} |v_j| = |v_1| + |v_2| + \cdots + |v_p|$$
The name "Manhattan distance" comes from the grid-like street layout of Manhattan: to travel between two intersections, you must walk along streets (horizontal and vertical), not diagonally through buildings.
Properties of the L1 Norm:
The Subdifferential at Zero:
The non-differentiability of $|v|$ at $v = 0$ is not a bug—it's the feature that enables sparsity. Let's examine this carefully using the concept of subdifferentials from convex analysis.
For a convex function $f$, the subdifferential at a point $x$ is the set of all slopes of lines that lie entirely below the graph of $f$ while touching it at $x$:
$$\partial f(x) = {g : f(y) \geq f(x) + g(y - x) \text{ for all } y}$$
For the absolute value function $f(v) = |v|$:
$$\partial |v| = \begin{cases} {1} & \text{if } v > 0 \ {-1} & \text{if } v < 0 \ [-1, 1] & \text{if } v = 0 \end{cases}$$
This interval $[-1, 1]$ at zero is crucial. It means that any subgradient between $-1$ and $1$ is valid at the origin. This "slack" allows the optimization to settle at exactly zero when the gradient of the data term is small enough.
In ridge regression, the gradient of $v^2$ is $2v$, which approaches zero continuously as $v \to 0$. The optimal solution reaches $v = 0$ only asymptotically. But with $|v|$, the subgradient at zero is an entire interval, meaning zero can be an optimal solution even when the data term's gradient isn't exactly zero. This is the mathematical foundation of Lasso's sparsity.
To appreciate the Lasso's unique behavior, we must contrast it mathematically with ridge regression. Both methods add a penalty to ordinary least squares, but the nature of that penalty changes everything.
Side-by-Side Comparison:
| Property | Ridge (L2) | Lasso (L1) |
|---|---|---|
| Penalty Term | $\lambda \sum_{j=1}^p \beta_j^2 = \lambda|\boldsymbol{\beta}|_2^2$ | $\lambda \sum_{j=1}^p |\beta_j| = \lambda|\boldsymbol{\beta}|_1$ |
| Penalty Scaling | Quadratic in $|\beta_j|$ | Linear in $|\beta_j|$ |
| Closed-Form Solution | Yes: $(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}$ | No (except in orthogonal design) |
| Differentiability | Everywhere differentiable | Not differentiable at $\beta_j = 0$ |
| Sparsity | No exact zeros | Produces exact zeros |
| Feature Selection | Includes all features | Automatic feature selection |
| Correlated Features | Spreads weight evenly | Tends to select one arbitrarily |
| Constraint Set Shape | Sphere/ellipsoid | Diamond/rhombus |
The Penalty Behavior Near Zero:
The fundamental difference emerges when we examine how each penalty behaves for small coefficient values.
Consider moving a coefficient from $\beta_j = 0.01$ to $\beta_j = 0$:
This asymmetry is profound. The L1 penalty gives the same "credit" for reducing $|\beta_j|$ from $0.01$ to $0$ as it does for reducing from $100.01$ to $100$. In contrast, L2 provides nearly zero incentive to shrink already-small coefficients to exactly zero.
Mathematically:
The derivative of the L1 penalty doesn't diminish as coefficients approach zero—it maintains constant pressure to drive coefficients all the way to zero.
Lasso's sparsity is not free. The loss of differentiability means we cannot use standard gradient descent. The arbitrary selection among correlated features can be unstable. And when true underlying signals are distributed across many features, Lasso may underperform ridge. Understanding these tradeoffs is essential for proper application.
The Lasso can be equivalently expressed as a constrained optimization problem. This formulation provides complementary geometric intuition and connects to the broader theory of regularization.
Constrained Form:
$$\hat{\boldsymbol{\beta}}^{\text{lasso}} = \underset{\boldsymbol{\beta}}{\arg\min} \left{ \frac{1}{2n}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 \right} \quad \text{subject to} \quad |\boldsymbol{\beta}|_1 \leq t$$
Equivalence via Lagrange Duality:
For every value of the regularization parameter $\lambda \geq 0$, there exists a corresponding constraint bound $t \geq 0$ such that the solutions are identical. The relationship is governed by Lagrange duality:
Interpreting the Constraint $|\boldsymbol{\beta}|_1 \leq t$:
This constraint defines an L1 ball in coefficient space—the set of all coefficient vectors whose L1 norm is at most $t$. The geometric shape of this ball is crucial:
Why Corners Matter for Sparsity:
Optimization seeks to minimize the loss function while staying within the constraint set. Geometrically, we expand contour ellipses of the loss until they just touch the constraint boundary.
For L2 constraints (a sphere), the touching point can occur anywhere on the smooth boundary. For L1 constraints (a diamond), the touching point tends to occur at corners—and corners are where coordinates equal zero.
This is the geometric revelation: the sharp corners of the L1 ball "catch" the expanding loss contours, naturally producing solutions on coordinate axes (sparse solutions).
Think of $t$ as a "complexity budget." You have a fixed total of absolute coefficient magnitude to spend. Placing weight on feature $j$ (making $|\beta_j| > 0$) costs you from this budget. The Lasso finds the best allocation, and often the optimal strategy is to concentrate budget on a few important features rather than spreading it thin across many.
While Lasso lacks a closed-form solution in general, there is one special case where the solution has a beautiful explicit form: orthonormal design matrices.
Setup: Orthonormal Design
Assume $\mathbf{X}$ has orthonormal columns: $\mathbf{X}^T\mathbf{X} = \mathbf{I}$. This is achievable through QR decomposition or when features are uncorrelated and standardized.
Let $\tilde{\boldsymbol{\beta}}^{\text{OLS}} = \mathbf{X}^T\mathbf{y}$ be the OLS solution (which equals $(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} = \mathbf{X}^T\mathbf{y}$ under orthonormality).
The Soft Thresholding Operator:
For orthonormal $\mathbf{X}$, the Lasso solution is:
$$\hat{\beta}j^{\text{lasso}} = S{\lambda}(\tilde{\beta}_j^{\text{OLS}}) = \text{sign}(\tilde{\beta}_j^{\text{OLS}}) \cdot \max(|\tilde{\beta}_j^{\text{OLS}}| - \lambda, 0)$$
This is the soft thresholding function (also called the shrinkage operator), often denoted $S_\lambda(z)$.
123456789101112131415161718192021222324252627282930313233343536
import numpy as np def soft_threshold(z, lambda_): """ Soft thresholding operator (proximal operator of L1 norm). For scalar z: S_λ(z) = sign(z) * max(|z| - λ, 0) This shrinks z toward zero by λ, setting to exactly zero if |z| ≤ λ. Parameters ---------- z : float or np.ndarray Input value(s) to threshold lambda_ : float Threshold parameter (λ ≥ 0) Returns ------- float or np.ndarray Soft-thresholded value(s) """ return np.sign(z) * np.maximum(np.abs(z) - lambda_, 0) # Demonstrationz_values = np.linspace(-3, 3, 7)lambda_ = 1.0 print("Soft Thresholding with λ = 1.0:")print("-" * 40)for z in z_values: result = soft_threshold(z, lambda_) print(f"S_λ({z:+.1f}) = {result:+.2f}")Properties of Soft Thresholding:
Sparsity Zone: If $|\tilde{\beta}_j^{\text{OLS}}| \leq \lambda$, then $\hat{\beta}_j^{\text{lasso}} = 0$ (coefficient is eliminated)
Uniform Shrinkage: If $|\tilde{\beta}_j^{\text{OLS}}| > \lambda$, the coefficient is shrunk by exactly $\lambda$ in absolute value
Continuity: Unlike hard thresholding (which jumps from zero to non-zero), soft thresholding is continuous
Bias: Large coefficients are shrunk by $\lambda$, introducing bias for truly important features
Comparison: Soft vs. Hard vs. Ridge
| Method | Transformation | At $|\tilde{\beta}|=0.5$, $\lambda=1$ | At $|\tilde{\beta}|=2$, $\lambda=1$ |
|---|---|---|---|
| Ridge (L2) | $\frac{\tilde{\beta}}{1+\lambda}$ | $0.25$ | $1.0$ |
| Lasso (L1) | $\text{sign}(\tilde{\beta})\max(|\tilde{\beta}|-\lambda, 0)$ | $0$ (eliminated!) | $1.0$ |
| Hard Threshold | $\tilde{\beta} \cdot \mathbf{1}_{|\tilde{\beta}|>\lambda}$ | $0$ (eliminated) | $2.0$ (unchanged) |
Soft thresholding is the proximal operator of the L1 norm: $\text{prox}_{\lambda|\cdot|1}(z) = S\lambda(z)$. This connects Lasso to the broader theory of proximal methods, enabling algorithms like ISTA and FISTA that we'll explore in later pages.
Since the L1 norm is not differentiable everywhere, we cannot simply set the gradient to zero. Instead, we use the Karush-Kuhn-Tucker (KKT) conditions from convex optimization, which generalize first-order optimality conditions to non-smooth functions.
The Lasso Optimality Conditions:
A vector $\hat{\boldsymbol{\beta}}$ is optimal for the Lasso problem if and only if:
$$\frac{1}{n}\mathbf{X}^T(\mathbf{X}\hat{\boldsymbol{\beta}} - \mathbf{y}) + \lambda \hat{\mathbf{s}} = \mathbf{0}$$
where $\hat{\mathbf{s}} \in \partial|\hat{\boldsymbol{\beta}}|_1$ is a subgradient of the L1 norm at $\hat{\boldsymbol{\beta}}$.
Subgradient Components:
For each coordinate $j$:
$$\hat{s}_j \in \partial|\hat{\beta}_j| = \begin{cases} {\text{sign}(\hat{\beta}_j)} & \text{if } \hat{\beta}_j \neq 0 \ [-1, 1] & \text{if } \hat{\beta}_j = 0 \end{cases}$$
Coordinate-wise KKT Conditions:
Let $\mathbf{r} = \mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}$ be the residual vector. The optimality conditions per coordinate are:
Interpretation:
These conditions reveal the fundamental mechanism of Lasso sparsity:
Active Features ($\hat{\beta}_j \neq 0$): Must have residual correlations of exactly $\pm\lambda$. They are "maxed out" in their contribution.
Inactive Features ($\hat{\beta}_j = 0$): Have residual correlations strictly less than $\lambda$ in magnitude. They "can't compete" with the active features.
The regularization parameter $\lambda$ acts as a threshold: only features sufficiently correlated with the residual (after accounting for other features) earn non-zero coefficients.
Finding the Critical $\lambda$:
At what $\lambda$ does a coefficient first become non-zero? When $\lambda$ is very large, all coefficients are zero. As we decrease $\lambda$, coefficients enter the model one by one.
The first feature to enter is the one most correlated with the response:
$$\lambda_{\max} = \frac{1}{n}|\mathbf{X}^T\mathbf{y}|_\infty = \max_j \left|\frac{1}{n}\mathbf{x}_j^T\mathbf{y}\right|$$
For $\lambda \geq \lambda_{\max}$, the solution is $\hat{\boldsymbol{\beta}} = \mathbf{0}$.
As $\lambda$ decreases from $\lambda_{\max}$ to $0$, coefficients enter (and sometimes exit) the model sequentially. The trajectory of coefficients as a function of $\lambda$ is called the Lasso path. Remarkably, between entrance/exit events, the paths are piecewise linear. The LARS algorithm exploits this structure for efficient path computation.
Proper application of Lasso requires careful attention to standardization and intercept handling. Neglecting these considerations leads to models that penalize features unequally based on arbitrary scaling choices.
Why Standardization Matters:
The L1 penalty $\lambda\sum_j|\beta_j|$ treats all coefficients equally. But if features have different scales, this equality is misleading:
Without standardization, the Lasso would heavily penalize feature $A$ simply because of measurement units.
Standard Preprocessing:
Handling the Intercept:
The intercept $\beta_0$ should generally not be penalized. Penalizing the intercept shifts predictions away from the data's natural center, introducing unnecessary bias.
The standard approach:
Mathematical Justification:
With centered data, the gradient of the RSS with respect to $\beta_0$ is:
$$\frac{\partial}{\partial \beta_0} \sum_i (y_i - \beta_0 - \mathbf{x}_i^T\boldsymbol{\beta})^2 = -2\sum_i(y_i - \beta_0 - \mathbf{x}_i^T\boldsymbol{\beta})$$
Setting to zero: $\hat{\beta}_0 = \bar{y} - \bar{\mathbf{x}}^T\hat{\boldsymbol{\beta}}$
When $\bar{y} = 0$ and $\bar{\mathbf{x}} = \mathbf{0}$ (centered data), this gives $\hat{\beta}_0 = 0$.
Many implementations (including sklearn's Lasso) handle standardization internally via the 'normalize' or 'preprocessing' options. However, some require explicit standardization. Always verify whether your implementation centers and scales the data, and whether it penalizes the intercept. Mismatched assumptions lead to incorrect hyperparameter values and suboptimal models.
We have established the mathematical foundation of Lasso regression. Let's consolidate the key insights before moving to the geometric interpretation in the next page.
What's Next:
We've established why the L1 penalty produces different behavior than L2, but we haven't yet visualized how sparsity emerges geometrically. The next page provides the crucial geometric interpretation—showing how the diamond-shaped L1 constraint set interacts with elliptical level sets to produce sparse solutions at corners.
You now understand the L1 penalty formulation mathematically: the objective function structure, the properties of the L1 norm, contrast with L2 regularization, the constrained formulation, soft thresholding, and KKT optimality conditions. Next, we'll build geometric intuition for why Lasso produces sparse solutions.