Machine LearningRegularization Theory

Lasso Regression (L1 Regularization)

LevelIntermediate

Duration90 mins

TopicRegularization Theory

1 / 5

L1 Penalty Formulation

The Quest for Sparse Models

In the previous module, we explored Ridge Regression—the elegant solution that tames overfitting by adding an L2 penalty that shrinks all coefficients toward zero. Ridge regression is remarkably effective: it stabilizes ill-conditioned problems, handles multicollinearity gracefully, and provides a smooth bias-variance tradeoff.

But ridge regression has a fundamental limitation: it never produces exactly zero coefficients. Every feature, no matter how irrelevant, retains some non-zero weight in the final model. When you have 1,000 potential features and suspect only 20 truly matter, ridge gives you a model with 1,000 non-zero coefficients—making interpretation difficult and suggesting false relationships.

This is where Lasso regression (Least Absolute Shrinkage and Selection Operator) enters the picture. Proposed by Robert Tibshirani in 1996, Lasso makes a seemingly small modification to the regularization term: replace the squared L2 penalty with an absolute value L1 penalty. This single change produces dramatically different behavior—Lasso doesn't just shrink coefficients; it eliminates them entirely.

What You Will Learn

By the end of this page, you will deeply understand the L1 penalty formulation, derive the Lasso objective function, comprehend the mathematical distinction between L1 and L2 norms, and build intuition for why absolute value penalties produce fundamentally different optimization landscapes than squared penalties.

The Lasso Objective Function

Let us begin by precisely defining the Lasso objective. Consider the standard linear regression setting:

We have $n$ observations and $p$ features
Design matrix $\mathbf{X} \in \mathbb{R}^{n \times p}$
Response vector $\mathbf{y} \in \mathbb{R}^n$
Coefficient vector $\boldsymbol{\beta} \in \mathbb{R}^p$

The Lasso Objective Function:

$$\hat{\boldsymbol{\beta}}^{\text{lasso}} = \underset{\boldsymbol{\beta}}{\arg\min} \left{ \frac{1}{2n} |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \lambda |\boldsymbol{\beta}|_1 \right}$$

Let's dissect each component of this formulation:

Components of the Lasso Objective

•Data Fidelity Term $\frac{1}{2n}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2$: The mean squared error (MSE) measuring how well the model fits the training data. The factor $\frac{1}{2}$ is conventional (simplifies derivatives), and $\frac{1}{n}$ normalizes by sample size for consistent λ scaling across different dataset sizes.
•L1 Penalty Term $\lambda|\boldsymbol{\beta}|_1$: The regularization term that penalizes the sum of absolute coefficient values. This is defined as $|\boldsymbol{\beta}|1 = \sum{j=1}^{p} |\beta_j|$.
•Regularization Parameter $\lambda \geq 0$: Controls the strength of regularization. When $\lambda = 0$, we recover ordinary least squares. As $\lambda \to \infty$, all coefficients are driven to exactly zero.

Notation Variations

Different sources use different scaling conventions. Some use $\frac{1}{2}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \lambda|\boldsymbol{\beta}|_1$ without the $\frac{1}{n}$ factor. Others parameterize as $|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \alpha|\boldsymbol{\beta}|_1$ with $\alpha = 2n\lambda$. The qualitative behavior is identical; only the numeric value of the regularization parameter changes.

Understanding the L1 Norm

The L1 norm (also called the Manhattan norm, taxicab norm, or ℓ₁ norm) is a fundamental concept in linear algebra and analysis. For a vector $\mathbf{v} = (v_1, v_2, \ldots, v_p)^T$, the L1 norm is defined as:

$$|\mathbf{v}|1 = \sum{j=1}^{p} |v_j| = |v_1| + |v_2| + \cdots + |v_p|$$

The name "Manhattan distance" comes from the grid-like street layout of Manhattan: to travel between two intersections, you must walk along streets (horizontal and vertical), not diagonally through buildings.

Properties of the L1 Norm:

Mathematical Properties

•Non-negativity: $|\mathbf{v}|_1 \geq 0$ for all $\mathbf{v}$, with equality if and only if $\mathbf{v} = \mathbf{0}$
•Positive Homogeneity: $|\alpha\mathbf{v}|_1 = |\alpha| \cdot |\mathbf{v}|_1$ for any scalar $\alpha$
•Triangle Inequality: $|\mathbf{u} + \mathbf{v}|_1 \leq |\mathbf{u}|_1 + |\mathbf{v}|_1$
•Convexity: The function $f(\mathbf{v}) = |\mathbf{v}|_1$ is convex, ensuring the Lasso objective is convex
•Non-differentiability: The absolute value function $|v_j|$ is not differentiable at $v_j = 0$, creating kinks in the objective—this is the key to sparsity

The Subdifferential at Zero:

The non-differentiability of $|v|$ at $v = 0$ is not a bug—it's the feature that enables sparsity. Let's examine this carefully using the concept of subdifferentials from convex analysis.

For a convex function $f$, the subdifferential at a point $x$ is the set of all slopes of lines that lie entirely below the graph of $f$ while touching it at $x$:

$$\partial f(x) = {g : f(y) \geq f(x) + g(y - x) \text{ for all } y}$$

For the absolute value function $f(v) = |v|$:

$$\partial |v| = \begin{cases} {1} & \text{if } v > 0 \ {-1} & \text{if } v < 0 \ [-1, 1] & \text{if } v = 0 \end{cases}$$

This interval $[-1, 1]$ at zero is crucial. It means that any subgradient between $-1$ and $1$ is valid at the origin. This "slack" allows the optimization to settle at exactly zero when the gradient of the data term is small enough.

Why the Kink Matters

In ridge regression, the gradient of $v^2$ is $2v$, which approaches zero continuously as $v \to 0$. The optimal solution reaches $v = 0$ only asymptotically. But with $|v|$, the subgradient at zero is an entire interval, meaning zero can be an optimal solution even when the data term's gradient isn't exactly zero. This is the mathematical foundation of Lasso's sparsity.

L1 vs L2: A Mathematical Contrast

To appreciate the Lasso's unique behavior, we must contrast it mathematically with ridge regression. Both methods add a penalty to ordinary least squares, but the nature of that penalty changes everything.

Side-by-Side Comparison:

L1 (Lasso) vs L2 (Ridge) Regularization
Property	Ridge (L2)	Lasso (L1)
Penalty Term	$\lambda \sum_{j=1}^p \beta_j^2 = \lambda\|\boldsymbol{\beta}\|_2^2$	$\lambda \sum_{j=1}^p \|\beta_j\| = \lambda\|\boldsymbol{\beta}\|_1$
Penalty Scaling	Quadratic in $\|\beta_j\|$	Linear in $\|\beta_j\|$
Closed-Form Solution	Yes: $(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}$	No (except in orthogonal design)
Differentiability	Everywhere differentiable	Not differentiable at $\beta_j = 0$
Sparsity	No exact zeros	Produces exact zeros
Feature Selection	Includes all features	Automatic feature selection
Correlated Features	Spreads weight evenly	Tends to select one arbitrarily
Constraint Set Shape	Sphere/ellipsoid	Diamond/rhombus

The Penalty Behavior Near Zero:

The fundamental difference emerges when we examine how each penalty behaves for small coefficient values.

Consider moving a coefficient from $\beta_j = 0.01$ to $\beta_j = 0$:

L2 Penalty Change: $(0.01)^2 - 0^2 = 0.0001$ — a tiny reduction
L1 Penalty Change: $|0.01| - |0| = 0.01$ — a much larger reduction (100× more)

This asymmetry is profound. The L1 penalty gives the same "credit" for reducing $|\beta_j|$ from $0.01$ to $0$ as it does for reducing from $100.01$ to $100$. In contrast, L2 provides nearly zero incentive to shrink already-small coefficients to exactly zero.

Mathematically:

For L2: $\frac{d}{d\beta_j}(\beta_j^2) = 2\beta_j \to 0$ as $\beta_j \to 0$
For L1: $\frac{d}{d\beta_j}|\beta_j| = \text{sign}(\beta_j) = \pm 1$ (constant magnitude)

The derivative of the L1 penalty doesn't diminish as coefficients approach zero—it maintains constant pressure to drive coefficients all the way to zero.

The Cost of Sparsity

Lasso's sparsity is not free. The loss of differentiability means we cannot use standard gradient descent. The arbitrary selection among correlated features can be unstable. And when true underlying signals are distributed across many features, Lasso may underperform ridge. Understanding these tradeoffs is essential for proper application.

The Constrained Formulation

The Lasso can be equivalently expressed as a constrained optimization problem. This formulation provides complementary geometric intuition and connects to the broader theory of regularization.

Constrained Form:

$$\hat{\boldsymbol{\beta}}^{\text{lasso}} = \underset{\boldsymbol{\beta}}{\arg\min} \left{ \frac{1}{2n}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 \right} \quad \text{subject to} \quad |\boldsymbol{\beta}|_1 \leq t$$

Equivalence via Lagrange Duality:

For every value of the regularization parameter $\lambda \geq 0$, there exists a corresponding constraint bound $t \geq 0$ such that the solutions are identical. The relationship is governed by Lagrange duality:

The penalized form uses $\lambda$ as a Lagrange multiplier
The constrained form uses $t$ as an explicit bound
Strong duality holds because both the objective and constraint set are convex

Interpreting the Constraint $|\boldsymbol{\beta}|_1 \leq t$:

This constraint defines an L1 ball in coefficient space—the set of all coefficient vectors whose L1 norm is at most $t$. The geometric shape of this ball is crucial:

L1 Ball Geometry

•In 2D: The L1 ball ${(\beta_1, \beta_2) : |\beta_1| + |\beta_2| \leq t}$ is a diamond (rotated square) with vertices at $(\pm t, 0)$ and $(0, \pm t)$
•In 3D: The L1 ball is an octahedron with vertices on the coordinate axes
•In p-D: The L1 ball is a cross-polytope (hyperoctahedron) with $2p$ vertices located at $\pm t \cdot \mathbf{e}_j$ for each standard basis vector $\mathbf{e}_j$
•Key Property: The corners of the L1 ball lie exactly on the coordinate axes—where some coordinates are zero

Why Corners Matter for Sparsity:

Optimization seeks to minimize the loss function while staying within the constraint set. Geometrically, we expand contour ellipses of the loss until they just touch the constraint boundary.

For L2 constraints (a sphere), the touching point can occur anywhere on the smooth boundary. For L1 constraints (a diamond), the touching point tends to occur at corners—and corners are where coordinates equal zero.

This is the geometric revelation: the sharp corners of the L1 ball "catch" the expanding loss contours, naturally producing solutions on coordinate axes (sparse solutions).

Budget Interpretation

Think of $t$ as a "complexity budget." You have a fixed total of absolute coefficient magnitude to spend. Placing weight on feature $j$ (making $|\beta_j| > 0$) costs you from this budget. The Lasso finds the best allocation, and often the optimal strategy is to concentrate budget on a few important features rather than spreading it thin across many.

The Soft Thresholding Solution

While Lasso lacks a closed-form solution in general, there is one special case where the solution has a beautiful explicit form: orthonormal design matrices.

Setup: Orthonormal Design

Assume $\mathbf{X}$ has orthonormal columns: $\mathbf{X}^T\mathbf{X} = \mathbf{I}$. This is achievable through QR decomposition or when features are uncorrelated and standardized.

Let $\tilde{\boldsymbol{\beta}}^{\text{OLS}} = \mathbf{X}^T\mathbf{y}$ be the OLS solution (which equals $(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} = \mathbf{X}^T\mathbf{y}$ under orthonormality).

The Soft Thresholding Operator:

For orthonormal $\mathbf{X}$, the Lasso solution is:

$$\hat{\beta}j^{\text{lasso}} = S{\lambda}(\tilde{\beta}_j^{\text{OLS}}) = \text{sign}(\tilde{\beta}_j^{\text{OLS}}) \cdot \max(|\tilde{\beta}_j^{\text{OLS}}| - \lambda, 0)$$

This is the soft thresholding function (also called the shrinkage operator), often denoted $S_\lambda(z)$.

Soft Thresholding Operator
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import numpy as np
 
def soft_threshold(z, lambda_):
    """
    Soft thresholding operator (proximal operator of L1 norm).
    
    For scalar z:
    S_λ(z) = sign(z) * max(|z| - λ, 0)
    
    This shrinks z toward zero by λ, setting to exactly zero
    if |z| ≤ λ.
    
    Parameters
    ----------
    z : float or np.ndarray
        Input value(s) to threshold
    lambda_ : float
        Threshold parameter (λ ≥ 0)
    
    Returns
    -------
    float or np.ndarray
        Soft-thresholded value(s)
    """
    return np.sign(z) * np.maximum(np.abs(z) - lambda_, 0)
 
 
# Demonstration
z_values = np.linspace(-3, 3, 7)
lambda_ = 1.0
 
print("Soft Thresholding with λ = 1.0:")
print("-" * 40)
for z in z_values:
    result = soft_threshold(z, lambda_)
    print(f"S_λ({z:+.1f}) = {result:+.2f}")

Properties of Soft Thresholding:

Sparsity Zone: If $|\tilde{\beta}_j^{\text{OLS}}| \leq \lambda$, then $\hat{\beta}_j^{\text{lasso}} = 0$ (coefficient is eliminated)
Uniform Shrinkage: If $|\tilde{\beta}_j^{\text{OLS}}| > \lambda$, the coefficient is shrunk by exactly $\lambda$ in absolute value
Continuity: Unlike hard thresholding (which jumps from zero to non-zero), soft thresholding is continuous
Bias: Large coefficients are shrunk by $\lambda$, introducing bias for truly important features

Comparison: Soft vs. Hard vs. Ridge

Effect on OLS Estimate $\tilde{\beta}$ for Different Methods
Method	Transformation	At $\|\tilde{\beta}\|=0.5$, $\lambda=1$	At $\|\tilde{\beta}\|=2$, $\lambda=1$
Ridge (L2)	$\frac{\tilde{\beta}}{1+\lambda}$	$0.25$	$1.0$
Lasso (L1)	$\text{sign}(\tilde{\beta})\max(\|\tilde{\beta}\|-\lambda, 0)$	$0$ (eliminated!)	$1.0$
Hard Threshold	$\tilde{\beta} \cdot \mathbf{1}_{\|\tilde{\beta}\|>\lambda}$	$0$ (eliminated)	$2.0$ (unchanged)

Proximal Operator Interpretation

Soft thresholding is the proximal operator of the L1 norm: $\text{prox}_{\lambda|\cdot|1}(z) = S\lambda(z)$. This connects Lasso to the broader theory of proximal methods, enabling algorithms like ISTA and FISTA that we'll explore in later pages.

Optimality Conditions (KKT)

Since the L1 norm is not differentiable everywhere, we cannot simply set the gradient to zero. Instead, we use the Karush-Kuhn-Tucker (KKT) conditions from convex optimization, which generalize first-order optimality conditions to non-smooth functions.

The Lasso Optimality Conditions:

A vector $\hat{\boldsymbol{\beta}}$ is optimal for the Lasso problem if and only if:

$$\frac{1}{n}\mathbf{X}^T(\mathbf{X}\hat{\boldsymbol{\beta}} - \mathbf{y}) + \lambda \hat{\mathbf{s}} = \mathbf{0}$$

where $\hat{\mathbf{s}} \in \partial|\hat{\boldsymbol{\beta}}|_1$ is a subgradient of the L1 norm at $\hat{\boldsymbol{\beta}}$.

Subgradient Components:

For each coordinate $j$:

$$\hat{s}_j \in \partial|\hat{\beta}_j| = \begin{cases} {\text{sign}(\hat{\beta}_j)} & \text{if } \hat{\beta}_j \neq 0 \ [-1, 1] & \text{if } \hat{\beta}_j = 0 \end{cases}$$

Coordinate-wise KKT Conditions:

Let $\mathbf{r} = \mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}$ be the residual vector. The optimality conditions per coordinate are:

KKT Conditions by Case

•If $\hat{\beta}_j > 0$: $\frac{1}{n}\mathbf{x}_j^T\mathbf{r} = \lambda$ (the correlation with residual equals $\lambda$)
•If $\hat{\beta}_j < 0$: $\frac{1}{n}\mathbf{x}_j^T\mathbf{r} = -\lambda$ (the correlation with residual equals $-\lambda$)
•If $\hat{\beta}_j = 0$: $|\frac{1}{n}\mathbf{x}_j^T\mathbf{r}| \leq \lambda$ (the correlation with residual is bounded by $\lambda$)

Interpretation:

These conditions reveal the fundamental mechanism of Lasso sparsity:

Active Features ($\hat{\beta}_j \neq 0$): Must have residual correlations of exactly $\pm\lambda$. They are "maxed out" in their contribution.
Inactive Features ($\hat{\beta}_j = 0$): Have residual correlations strictly less than $\lambda$ in magnitude. They "can't compete" with the active features.

The regularization parameter $\lambda$ acts as a threshold: only features sufficiently correlated with the residual (after accounting for other features) earn non-zero coefficients.

Finding the Critical $\lambda$:

At what $\lambda$ does a coefficient first become non-zero? When $\lambda$ is very large, all coefficients are zero. As we decrease $\lambda$, coefficients enter the model one by one.

The first feature to enter is the one most correlated with the response:

$$\lambda_{\max} = \frac{1}{n}|\mathbf{X}^T\mathbf{y}|_\infty = \max_j \left|\frac{1}{n}\mathbf{x}_j^T\mathbf{y}\right|$$

For $\lambda \geq \lambda_{\max}$, the solution is $\hat{\boldsymbol{\beta}} = \mathbf{0}$.

The Lasso Path

As $\lambda$ decreases from $\lambda_{\max}$ to $0$, coefficients enter (and sometimes exit) the model sequentially. The trajectory of coefficients as a function of $\lambda$ is called the Lasso path. Remarkably, between entrance/exit events, the paths are piecewise linear. The LARS algorithm exploits this structure for efficient path computation.

Standardization and the Intercept

Proper application of Lasso requires careful attention to standardization and intercept handling. Neglecting these considerations leads to models that penalize features unequally based on arbitrary scaling choices.

Why Standardization Matters:

The L1 penalty $\lambda\sum_j|\beta_j|$ treats all coefficients equally. But if features have different scales, this equality is misleading:

Feature $A$ measured in millimeters has coefficient $\beta_A = 1000$
Feature $B$ measured in meters has coefficient $\beta_B = 1$
Both encode the same relationship, but $|\beta_A|$ is 1000× larger

Without standardization, the Lasso would heavily penalize feature $A$ simply because of measurement units.

Standard Preprocessing:

Preprocessing Steps

•Center the response: $\tilde{y}_i = y_i - \bar{y}$ where $\bar{y} = \frac{1}{n}\sum_i y_i$
•Center each feature: $\tilde{x}{ij} = x{ij} - \bar{x}_j$ where $\bar{x}j = \frac{1}{n}\sum_i x{ij}$
•Scale features to unit variance: $\hat{x}{ij} = \tilde{x}{ij} / s_j$ where $s_j = \sqrt{\frac{1}{n}\sum_i \tilde{x}_{ij}^2}$
•Fit Lasso without intercept: On standardized data, the intercept is zero
•Back-transform: After fitting, compute $\hat{\beta}_0 = \bar{y} - \sum_j \hat{\beta}_j^{\text{std}} \cdot \bar{x}_j / s_j$ and $\hat{\beta}_j = \hat{\beta}_j^{\text{std}} / s_j$

Handling the Intercept:

The intercept $\beta_0$ should generally not be penalized. Penalizing the intercept shifts predictions away from the data's natural center, introducing unnecessary bias.

The standard approach:

Center $\mathbf{y}$ and columns of $\mathbf{X}$
Fit the Lasso on centered data (intercept is implicitly zero)
Recover the original-scale intercept: $\hat{\beta}_0 = \bar{y} - \sum_j \hat{\beta}_j \bar{x}_j$

Mathematical Justification:

With centered data, the gradient of the RSS with respect to $\beta_0$ is:

$$\frac{\partial}{\partial \beta_0} \sum_i (y_i - \beta_0 - \mathbf{x}_i^T\boldsymbol{\beta})^2 = -2\sum_i(y_i - \beta_0 - \mathbf{x}_i^T\boldsymbol{\beta})$$

Setting to zero: $\hat{\beta}_0 = \bar{y} - \bar{\mathbf{x}}^T\hat{\boldsymbol{\beta}}$

When $\bar{y} = 0$ and $\bar{\mathbf{x}} = \mathbf{0}$ (centered data), this gives $\hat{\beta}_0 = 0$.

Common Pitfall

Many implementations (including sklearn's Lasso) handle standardization internally via the 'normalize' or 'preprocessing' options. However, some require explicit standardization. Always verify whether your implementation centers and scales the data, and whether it penalizes the intercept. Mismatched assumptions lead to incorrect hyperparameter values and suboptimal models.

Summary and Path Forward

We have established the mathematical foundation of Lasso regression. Let's consolidate the key insights before moving to the geometric interpretation in the next page.

Key Takeaways

•The Lasso Objective: Minimizes $\frac{1}{2n}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \lambda|\boldsymbol{\beta}|_1$, trading data fit against L1 penalty
•L1 Norm Properties: Non-differentiable at zero, with subdifferential $[-1, 1]$—this interval enables exact sparsity
•L1 vs L2 Contrast: L1 provides constant shrinkage pressure toward zero regardless of coefficient magnitude, unlike the diminishing pressure of L2
•Constrained Formulation: Equivalent to minimizing RSS subject to $|\boldsymbol{\beta}|_1 \leq t$, with the L1 ball being a diamond with corners on axes
•Soft Thresholding: For orthogonal designs, $\hat{\beta}j = S\lambda(\tilde{\beta}_j^{\text{OLS}})$ shrinks by $\lambda$ or sets to zero
•KKT Conditions: Active coefficients correlate with residuals at exactly $\pm\lambda$; inactive correlate at less than $\lambda$
•Standardization: Essential for fair penalization; intercept should not be penalized

What's Next:

We've established why the L1 penalty produces different behavior than L2, but we haven't yet visualized how sparsity emerges geometrically. The next page provides the crucial geometric interpretation—showing how the diamond-shaped L1 constraint set interacts with elliptical level sets to produce sparse solutions at corners.

Page Complete

You now understand the L1 penalty formulation mathematically: the objective function structure, the properties of the L1 norm, contrast with L2 regularization, the constrained formulation, soft thresholding, and KKT optimality conditions. Next, we'll build geometric intuition for why Lasso produces sparse solutions.

1 / 5

Loading learning content...

Machine LearningRegularization Theory

Lasso Regression (L1 Regularization)

LevelIntermediate

Duration90 mins

TopicRegularization Theory

1 / 5

L1 Penalty Formulation

The Quest for Sparse Models

What You Will Learn

The Lasso Objective Function

Let us begin by precisely defining the Lasso objective. Consider the standard linear regression setting:

We have $n$ observations and $p$ features
Design matrix $\mathbf{X} \in \mathbb{R}^{n \times p}$
Response vector $\mathbf{y} \in \mathbb{R}^n$
Coefficient vector $\boldsymbol{\beta} \in \mathbb{R}^p$

The Lasso Objective Function:

$$\hat{\boldsymbol{\beta}}^{\text{lasso}} = \underset{\boldsymbol{\beta}}{\arg\min} \left{ \frac{1}{2n} |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \lambda |\boldsymbol{\beta}|_1 \right}$$

Let's dissect each component of this formulation:

Components of the Lasso Objective

•Data Fidelity Term $\frac{1}{2n}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2$: The mean squared error (MSE) measuring how well the model fits the training data. The factor $\frac{1}{2}$ is conventional (simplifies derivatives), and $\frac{1}{n}$ normalizes by sample size for consistent λ scaling across different dataset sizes.
•L1 Penalty Term $\lambda|\boldsymbol{\beta}|_1$: The regularization term that penalizes the sum of absolute coefficient values. This is defined as $|\boldsymbol{\beta}|1 = \sum{j=1}^{p} |\beta_j|$.
•Regularization Parameter $\lambda \geq 0$: Controls the strength of regularization. When $\lambda = 0$, we recover ordinary least squares. As $\lambda \to \infty$, all coefficients are driven to exactly zero.

Notation Variations

Understanding the L1 Norm

$$|\mathbf{v}|1 = \sum{j=1}^{p} |v_j| = |v_1| + |v_2| + \cdots + |v_p|$$

Properties of the L1 Norm:

Mathematical Properties

•Non-negativity: $|\mathbf{v}|_1 \geq 0$ for all $\mathbf{v}$, with equality if and only if $\mathbf{v} = \mathbf{0}$
•Positive Homogeneity: $|\alpha\mathbf{v}|_1 = |\alpha| \cdot |\mathbf{v}|_1$ for any scalar $\alpha$
•Triangle Inequality: $|\mathbf{u} + \mathbf{v}|_1 \leq |\mathbf{u}|_1 + |\mathbf{v}|_1$
•Convexity: The function $f(\mathbf{v}) = |\mathbf{v}|_1$ is convex, ensuring the Lasso objective is convex
•Non-differentiability: The absolute value function $|v_j|$ is not differentiable at $v_j = 0$, creating kinks in the objective—this is the key to sparsity

The Subdifferential at Zero:

The non-differentiability of $|v|$ at $v = 0$ is not a bug—it's the feature that enables sparsity. Let's examine this carefully using the concept of subdifferentials from convex analysis.

For a convex function $f$, the subdifferential at a point $x$ is the set of all slopes of lines that lie entirely below the graph of $f$ while touching it at $x$:

$$\partial f(x) = {g : f(y) \geq f(x) + g(y - x) \text{ for all } y}$$

For the absolute value function $f(v) = |v|$:

$$\partial |v| = \begin{cases} {1} & \text{if } v > 0 \ {-1} & \text{if } v < 0 \ [-1, 1] & \text{if } v = 0 \end{cases}$$

Why the Kink Matters

L1 vs L2: A Mathematical Contrast

Side-by-Side Comparison:

L1 (Lasso) vs L2 (Ridge) Regularization
Property	Ridge (L2)	Lasso (L1)
Penalty Term	$\lambda \sum_{j=1}^p \beta_j^2 = \lambda\|\boldsymbol{\beta}\|_2^2$	$\lambda \sum_{j=1}^p \|\beta_j\| = \lambda\|\boldsymbol{\beta}\|_1$
Penalty Scaling	Quadratic in $\|\beta_j\|$	Linear in $\|\beta_j\|$
Closed-Form Solution	Yes: $(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}$	No (except in orthogonal design)
Differentiability	Everywhere differentiable	Not differentiable at $\beta_j = 0$
Sparsity	No exact zeros	Produces exact zeros
Feature Selection	Includes all features	Automatic feature selection
Correlated Features	Spreads weight evenly	Tends to select one arbitrarily
Constraint Set Shape	Sphere/ellipsoid	Diamond/rhombus

The Penalty Behavior Near Zero:

The fundamental difference emerges when we examine how each penalty behaves for small coefficient values.

Consider moving a coefficient from $\beta_j = 0.01$ to $\beta_j = 0$:

L2 Penalty Change: $(0.01)^2 - 0^2 = 0.0001$ — a tiny reduction
L1 Penalty Change: $|0.01| - |0| = 0.01$ — a much larger reduction (100× more)

Mathematically:

For L2: $\frac{d}{d\beta_j}(\beta_j^2) = 2\beta_j \to 0$ as $\beta_j \to 0$
For L1: $\frac{d}{d\beta_j}|\beta_j| = \text{sign}(\beta_j) = \pm 1$ (constant magnitude)

The derivative of the L1 penalty doesn't diminish as coefficients approach zero—it maintains constant pressure to drive coefficients all the way to zero.

The Cost of Sparsity

The Constrained Formulation

The Lasso can be equivalently expressed as a constrained optimization problem. This formulation provides complementary geometric intuition and connects to the broader theory of regularization.

Constrained Form:

Equivalence via Lagrange Duality:

The penalized form uses $\lambda$ as a Lagrange multiplier
The constrained form uses $t$ as an explicit bound
Strong duality holds because both the objective and constraint set are convex

Interpreting the Constraint $|\boldsymbol{\beta}|_1 \leq t$:

This constraint defines an L1 ball in coefficient space—the set of all coefficient vectors whose L1 norm is at most $t$. The geometric shape of this ball is crucial:

L1 Ball Geometry

•In 2D: The L1 ball ${(\beta_1, \beta_2) : |\beta_1| + |\beta_2| \leq t}$ is a diamond (rotated square) with vertices at $(\pm t, 0)$ and $(0, \pm t)$
•In 3D: The L1 ball is an octahedron with vertices on the coordinate axes
•In p-D: The L1 ball is a cross-polytope (hyperoctahedron) with $2p$ vertices located at $\pm t \cdot \mathbf{e}_j$ for each standard basis vector $\mathbf{e}_j$
•Key Property: The corners of the L1 ball lie exactly on the coordinate axes—where some coordinates are zero

Why Corners Matter for Sparsity:

Optimization seeks to minimize the loss function while staying within the constraint set. Geometrically, we expand contour ellipses of the loss until they just touch the constraint boundary.

This is the geometric revelation: the sharp corners of the L1 ball "catch" the expanding loss contours, naturally producing solutions on coordinate axes (sparse solutions).

Budget Interpretation

The Soft Thresholding Solution

While Lasso lacks a closed-form solution in general, there is one special case where the solution has a beautiful explicit form: orthonormal design matrices.

Setup: Orthonormal Design

Assume $\mathbf{X}$ has orthonormal columns: $\mathbf{X}^T\mathbf{X} = \mathbf{I}$. This is achievable through QR decomposition or when features are uncorrelated and standardized.

The Soft Thresholding Operator:

For orthonormal $\mathbf{X}$, the Lasso solution is:

$$\hat{\beta}j^{\text{lasso}} = S{\lambda}(\tilde{\beta}_j^{\text{OLS}}) = \text{sign}(\tilde{\beta}_j^{\text{OLS}}) \cdot \max(|\tilde{\beta}_j^{\text{OLS}}| - \lambda, 0)$$

This is the soft thresholding function (also called the shrinkage operator), often denoted $S_\lambda(z)$.

Soft Thresholding Operator
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import numpy as np
 
def soft_threshold(z, lambda_):
    """
    Soft thresholding operator (proximal operator of L1 norm).
    
    For scalar z:
    S_λ(z) = sign(z) * max(|z| - λ, 0)
    
    This shrinks z toward zero by λ, setting to exactly zero
    if |z| ≤ λ.
    
    Parameters
    ----------
    z : float or np.ndarray
        Input value(s) to threshold
    lambda_ : float
        Threshold parameter (λ ≥ 0)
    
    Returns
    -------
    float or np.ndarray
        Soft-thresholded value(s)
    """
    return np.sign(z) * np.maximum(np.abs(z) - lambda_, 0)
 
 
# Demonstration
z_values = np.linspace(-3, 3, 7)
lambda_ = 1.0
 
print("Soft Thresholding with λ = 1.0:")
print("-" * 40)
for z in z_values:
    result = soft_threshold(z, lambda_)
    print(f"S_λ({z:+.1f}) = {result:+.2f}")

Properties of Soft Thresholding:

Sparsity Zone: If $|\tilde{\beta}_j^{\text{OLS}}| \leq \lambda$, then $\hat{\beta}_j^{\text{lasso}} = 0$ (coefficient is eliminated)
Uniform Shrinkage: If $|\tilde{\beta}_j^{\text{OLS}}| > \lambda$, the coefficient is shrunk by exactly $\lambda$ in absolute value
Continuity: Unlike hard thresholding (which jumps from zero to non-zero), soft thresholding is continuous
Bias: Large coefficients are shrunk by $\lambda$, introducing bias for truly important features

Comparison: Soft vs. Hard vs. Ridge

Effect on OLS Estimate $\tilde{\beta}$ for Different Methods
Method	Transformation	At $\|\tilde{\beta}\|=0.5$, $\lambda=1$	At $\|\tilde{\beta}\|=2$, $\lambda=1$
Ridge (L2)	$\frac{\tilde{\beta}}{1+\lambda}$	$0.25$	$1.0$
Lasso (L1)	$\text{sign}(\tilde{\beta})\max(\|\tilde{\beta}\|-\lambda, 0)$	$0$ (eliminated!)	$1.0$
Hard Threshold	$\tilde{\beta} \cdot \mathbf{1}_{\|\tilde{\beta}\|>\lambda}$	$0$ (eliminated)	$2.0$ (unchanged)

Proximal Operator Interpretation

Optimality Conditions (KKT)

The Lasso Optimality Conditions:

A vector $\hat{\boldsymbol{\beta}}$ is optimal for the Lasso problem if and only if:

$$\frac{1}{n}\mathbf{X}^T(\mathbf{X}\hat{\boldsymbol{\beta}} - \mathbf{y}) + \lambda \hat{\mathbf{s}} = \mathbf{0}$$

where $\hat{\mathbf{s}} \in \partial|\hat{\boldsymbol{\beta}}|_1$ is a subgradient of the L1 norm at $\hat{\boldsymbol{\beta}}$.

Subgradient Components:

For each coordinate $j$:

$$\hat{s}_j \in \partial|\hat{\beta}_j| = \begin{cases} {\text{sign}(\hat{\beta}_j)} & \text{if } \hat{\beta}_j \neq 0 \ [-1, 1] & \text{if } \hat{\beta}_j = 0 \end{cases}$$

Coordinate-wise KKT Conditions:

Let $\mathbf{r} = \mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}$ be the residual vector. The optimality conditions per coordinate are:

KKT Conditions by Case

•If $\hat{\beta}_j > 0$: $\frac{1}{n}\mathbf{x}_j^T\mathbf{r} = \lambda$ (the correlation with residual equals $\lambda$)
•If $\hat{\beta}_j < 0$: $\frac{1}{n}\mathbf{x}_j^T\mathbf{r} = -\lambda$ (the correlation with residual equals $-\lambda$)
•If $\hat{\beta}_j = 0$: $|\frac{1}{n}\mathbf{x}_j^T\mathbf{r}| \leq \lambda$ (the correlation with residual is bounded by $\lambda$)

Interpretation:

These conditions reveal the fundamental mechanism of Lasso sparsity:

Active Features ($\hat{\beta}_j \neq 0$): Must have residual correlations of exactly $\pm\lambda$. They are "maxed out" in their contribution.
Inactive Features ($\hat{\beta}_j = 0$): Have residual correlations strictly less than $\lambda$ in magnitude. They "can't compete" with the active features.

The regularization parameter $\lambda$ acts as a threshold: only features sufficiently correlated with the residual (after accounting for other features) earn non-zero coefficients.

Finding the Critical $\lambda$:

At what $\lambda$ does a coefficient first become non-zero? When $\lambda$ is very large, all coefficients are zero. As we decrease $\lambda$, coefficients enter the model one by one.

The first feature to enter is the one most correlated with the response:

$$\lambda_{\max} = \frac{1}{n}|\mathbf{X}^T\mathbf{y}|_\infty = \max_j \left|\frac{1}{n}\mathbf{x}_j^T\mathbf{y}\right|$$

For $\lambda \geq \lambda_{\max}$, the solution is $\hat{\boldsymbol{\beta}} = \mathbf{0}$.

The Lasso Path

Standardization and the Intercept

Why Standardization Matters:

The L1 penalty $\lambda\sum_j|\beta_j|$ treats all coefficients equally. But if features have different scales, this equality is misleading:

Feature $A$ measured in millimeters has coefficient $\beta_A = 1000$
Feature $B$ measured in meters has coefficient $\beta_B = 1$
Both encode the same relationship, but $|\beta_A|$ is 1000× larger

Without standardization, the Lasso would heavily penalize feature $A$ simply because of measurement units.

Standard Preprocessing:

Preprocessing Steps

•Center the response: $\tilde{y}_i = y_i - \bar{y}$ where $\bar{y} = \frac{1}{n}\sum_i y_i$
•Center each feature: $\tilde{x}{ij} = x{ij} - \bar{x}_j$ where $\bar{x}j = \frac{1}{n}\sum_i x{ij}$
•Scale features to unit variance: $\hat{x}{ij} = \tilde{x}{ij} / s_j$ where $s_j = \sqrt{\frac{1}{n}\sum_i \tilde{x}_{ij}^2}$
•Fit Lasso without intercept: On standardized data, the intercept is zero
•Back-transform: After fitting, compute $\hat{\beta}_0 = \bar{y} - \sum_j \hat{\beta}_j^{\text{std}} \cdot \bar{x}_j / s_j$ and $\hat{\beta}_j = \hat{\beta}_j^{\text{std}} / s_j$

Handling the Intercept:

The intercept $\beta_0$ should generally not be penalized. Penalizing the intercept shifts predictions away from the data's natural center, introducing unnecessary bias.

The standard approach:

Center $\mathbf{y}$ and columns of $\mathbf{X}$
Fit the Lasso on centered data (intercept is implicitly zero)
Recover the original-scale intercept: $\hat{\beta}_0 = \bar{y} - \sum_j \hat{\beta}_j \bar{x}_j$

Mathematical Justification:

With centered data, the gradient of the RSS with respect to $\beta_0$ is:

$$\frac{\partial}{\partial \beta_0} \sum_i (y_i - \beta_0 - \mathbf{x}_i^T\boldsymbol{\beta})^2 = -2\sum_i(y_i - \beta_0 - \mathbf{x}_i^T\boldsymbol{\beta})$$

Setting to zero: $\hat{\beta}_0 = \bar{y} - \bar{\mathbf{x}}^T\hat{\boldsymbol{\beta}}$

When $\bar{y} = 0$ and $\bar{\mathbf{x}} = \mathbf{0}$ (centered data), this gives $\hat{\beta}_0 = 0$.

Common Pitfall

Summary and Path Forward

We have established the mathematical foundation of Lasso regression. Let's consolidate the key insights before moving to the geometric interpretation in the next page.

Key Takeaways

•The Lasso Objective: Minimizes $\frac{1}{2n}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \lambda|\boldsymbol{\beta}|_1$, trading data fit against L1 penalty
•L1 Norm Properties: Non-differentiable at zero, with subdifferential $[-1, 1]$—this interval enables exact sparsity
•L1 vs L2 Contrast: L1 provides constant shrinkage pressure toward zero regardless of coefficient magnitude, unlike the diminishing pressure of L2
•Constrained Formulation: Equivalent to minimizing RSS subject to $|\boldsymbol{\beta}|_1 \leq t$, with the L1 ball being a diamond with corners on axes
•Soft Thresholding: For orthogonal designs, $\hat{\beta}j = S\lambda(\tilde{\beta}_j^{\text{OLS}})$ shrinks by $\lambda$ or sets to zero
•KKT Conditions: Active coefficients correlate with residuals at exactly $\pm\lambda$; inactive correlate at less than $\lambda$
•Standardization: Essential for fair penalization; intercept should not be penalized

What's Next:

Page Complete

1 / 5