Machine LearningRegularization Theory

Lasso Regression (L1 Regularization)

LevelIntermediate

Duration90 mins

TopicRegularization Theory

3 / 5

Geometric Interpretation

Seeing is Understanding

Mathematics provides the precision, but geometry provides the intuition. The geometric interpretation of Lasso versus Ridge regularization is one of the most illuminating visualizations in all of machine learning—it makes immediately clear why L1 produces sparsity while L2 does not.

This page develops that geometric understanding systematically. We will:

Visualize the constraint regions (L1 diamonds vs. L2 circles)
Understand the loss function's level sets (elliptical contours)
See how these geometric objects interact to determine optimal solutions
Generalize to higher dimensions and understand the geometry of sparsity

By the end, you will have a mental picture that lets you reason about regularization geometrically—an intuition that transfers to many other constrained optimization problems.

What You Will Learn

This page provides visual and geometric intuition for L1 sparsity. You will understand why diamond corners attract solutions, how different constraint strengths affect solutions, and how the geometry generalizes to high dimensions where visualization fails but geometric intuition still applies.

Visualizing Constraint Sets

Let's begin by visualizing the constraint regions for L1 and L2 regularization in two dimensions. Consider coefficients $(\beta_1, \beta_2)$.

The L2 Constraint (Ridge):

$$\beta_1^2 + \beta_2^2 \leq t^2$$

This defines a disk of radius $t$ centered at the origin. Its boundary is a circle—perfectly smooth with no corners or edges. Every point on the boundary has a unique normal direction.

The L1 Constraint (Lasso):

$$|\beta_1| + |\beta_2| \leq t$$

This defines a diamond (rotated square) with vertices at $(t, 0)$, $(-t, 0)$, $(0, t)$, $(0, -t)$. The boundary consists of four straight edges meeting at four corners. Corners lie exactly on the coordinate axes.

Key Geometric Properties:

Geometric Properties of Constraint Sets
Property	L2 (Circle)	L1 (Diamond)
Shape	Smooth, round	Polyhedral, angular
Corners	None (continuously curved)	4 vertices on axes
Edges	None (continuous boundary)	4 straight edges
Normal directions	Infinite (radial)	4 edge normals + corner normals
Convexity	Strictly convex	Convex but not strictly
Extreme points	Every boundary point	Only 4 vertices
Axis intersections	At distance $t$	At distance $t$ (vertices)

Why Corner Geometry Matters:

The crucial difference is the presence of corners (vertices) in the L1 constraint set. Corners are points where the boundary is not smooth—the normal direction is not uniquely defined.

At a corner, multiple normal directions are valid. This geometric property translates to the subdifferential being an interval rather than a single point at $\beta_j = 0$. The optimization can "rest" at a corner because there's no unique direction to move that decreases both the constraint violation and objective.

Higher-Dimensional Constraint Sets:

In $p$ dimensions:

L2 constraint: $\sum_j \beta_j^2 \leq t^2$ is a hypersphere—smooth everywhere
L1 constraint: $\sum_j |\beta_j| \leq t$ is a cross-polytope (hyperoctahedron)—with $2p$ vertices on coordinate axes and $2^p$ facets

The cross-polytope's vertices are ${\pm t \cdot \mathbf{e}j}{j=1}^p$ where $\mathbf{e}_j$ is the $j$-th standard basis vector. These vertices are the most sparse points—having only one non-zero coordinate.

The Sparsity Hierarchy

In the L1 polytope, vertices have one non-zero (maximally sparse), edges have two non-zeros, faces have three non-zeros, and so on. The interior has no sparsity constraints. Constrained optimization naturally prefers lower-dimensional faces (sparser solutions) because they have more directions to 'catch' the expanding objective.

Loss Function Level Sets

To understand the constrained optimization geometry, we need to visualize the level sets (contour curves) of the loss function.

The Ordinary Least Squares Loss:

$$\mathcal{L}(\boldsymbol{\beta}) = |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 = (\boldsymbol{\beta} - \hat{\boldsymbol{\beta}}^{\text{OLS}})^T \mathbf{X}^T\mathbf{X} (\boldsymbol{\beta} - \hat{\boldsymbol{\beta}}^{\text{OLS}}) + \text{const}$$

This is a quadratic form centered at the OLS solution $\hat{\boldsymbol{\beta}}^{\text{OLS}}$.

Level Set Geometry:

The level sets ${\boldsymbol{\beta} : \mathcal{L}(\boldsymbol{\beta}) = c}$ are ellipsoids (ellipses in 2D) centered at $\hat{\boldsymbol{\beta}}^{\text{OLS}}$:

Shape: Determined by the eigenvalues of $\mathbf{X}^T\mathbf{X}$
Orientation: Determined by the eigenvectors of $\mathbf{X}^T\mathbf{X}$
Center: Located at $\hat{\boldsymbol{\beta}}^{\text{OLS}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$
Nested structure: Smaller ellipses = lower loss, expanding outward from center

Special Cases:

Orthogonal design ($\mathbf{X}^T\mathbf{X} = n\mathbf{I}$): Level sets are circles (spheres) centered at OLS
Correlated features: Level sets are elongated ellipses aligned with correlation structure
Multicollinearity: Extremely elongated "cigar-shaped" ellipses nearly parallel to coefficient space

Visualizing Level Sets
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
import numpy as np
import matplotlib.pyplot as plt
 
def plot_constraint_and_loss(X, y, t_l1=1.0, t_l2=1.0, figsize=(14, 5)):
    """
    Visualize L1 and L2 constraint regions with loss contours.
    
    Parameters
    ----------
    X : ndarray of shape (n, 2)
        Design matrix (2 features for visualization)
    y : ndarray of shape (n,)
        Response vector
    t_l1 : float
        L1 constraint bound
    t_l2 : float
        L2 constraint bound (radius)
    """
    # Compute OLS solution
    beta_ols = np.linalg.lstsq(X, y, rcond=None)[0]
    
    # Create grid for contours
    beta_range = max(abs(beta_ols).max() * 1.5, t_l1 * 1.5, t_l2 * 1.5)
    b1 = np.linspace(-beta_range, beta_range, 200)
    b2 = np.linspace(-beta_range, beta_range, 200)
    B1, B2 = np.meshgrid(b1, b2)
    
    # Compute loss at each grid point
    Loss = np.zeros_like(B1)
    for i in range(len(b1)):
        for j in range(len(b2)):
            beta = np.array([B1[i, j], B2[i, j]])
            Loss[i, j] = np.sum((y - X @ beta) ** 2)
    
    fig, axes = plt.subplots(1, 3, figsize=figsize)
    
    # Plot 1: L2 constraint (Ridge)
    ax = axes[0]
    ax.contour(B1, B2, Loss, levels=15, cmap='coolwarm', alpha=0.7)
    theta = np.linspace(0, 2*np.pi, 100)
    ax.fill(t_l2*np.cos(theta), t_l2*np.sin(theta), 
            alpha=0.3, color='blue', label='L2 constraint')
    ax.plot(t_l2*np.cos(theta), t_l2*np.sin(theta), 'b-', linewidth=2)
    ax.plot(*beta_ols, 'r*', markersize=15, label='OLS solution')
    ax.axhline(0, color='gray', linestyle='--', alpha=0.5)
    ax.axvline(0, color='gray', linestyle='--', alpha=0.5)
    ax.set_xlabel(r'$\beta_1$')
    ax.set_ylabel(r'$\beta_2$')
    ax.set_title('Ridge (L2): Circular Constraint')
    ax.legend()
    ax.set_aspect('equal')
    
    # Plot 2: L1 constraint (Lasso)
    ax = axes[1]
    ax.contour(B1, B2, Loss, levels=15, cmap='coolwarm', alpha=0.7)
    diamond = np.array([[t_l1, 0], [0, t_l1], [-t_l1, 0], [0, -t_l1], [t_l1, 0]])
    ax.fill(diamond[:, 0], diamond[:, 1], alpha=0.3, color='green', 
            label='L1 constraint')
    ax.plot(diamond[:, 0], diamond[:, 1], 'g-', linewidth=2)
    ax.plot(*beta_ols, 'r*', markersize=15, label='OLS solution')
    ax.axhline(0, color='gray', linestyle='--', alpha=0.5)
    ax.axvline(0, color='gray', linestyle='--', alpha=0.5)
    ax.set_xlabel(r'$\beta_1$')
    ax.set_ylabel(r'$\beta_2$')
    ax.set_title('Lasso (L1): Diamond Constraint')
    ax.legend()
    ax.set_aspect('equal')
    
    # Plot 3: Both overlaid
    ax = axes[2]
    ax.contour(B1, B2, Loss, levels=15, cmap='coolwarm', alpha=0.7)
    ax.fill(t_l2*np.cos(theta), t_l2*np.sin(theta), 
            alpha=0.2, color='blue', label='L2')
    ax.plot(t_l2*np.cos(theta), t_l2*np.sin(theta), 'b-', linewidth=2)
    ax.fill(diamond[:, 0], diamond[:, 1], alpha=0.2, color='green', label='L1')
    ax.plot(diamond[:, 0], diamond[:, 1], 'g-', linewidth=2)
    ax.plot(*beta_ols, 'r*', markersize=15, label='OLS')
    ax.axhline(0, color='gray', linestyle='--', alpha=0.5)
    ax.axvline(0, color='gray', linestyle='--', alpha=0.5)
    ax.set_xlabel(r'$\beta_1$')
    ax.set_ylabel(r'$\beta_2$')
    ax.set_title('L1 vs L2: Shape Comparison')
    ax.legend()
    ax.set_aspect('equal')
    
    plt.tight_layout()
    return fig
 
 
# Example usage
np.random.seed(42)
n = 50
X = np.random.randn(n, 2)
X[:, 1] = X[:, 0] * 0.8 + np.random.randn(n) * 0.3  # Correlated features
y = 2 * X[:, 0] + 0.5 * X[:, 1] + np.random.randn(n) * 0.5
 
# Visualization shows elongated ellipses due to correlation
# Note how L1 diamond corners align with axes (sparse directions)
# while L2 circle has no preferred sparse directions

Ellipse Orientation Matters

The orientation of loss ellipses relative to constraint sets determines where contact occurs. With correlated features, ellipses are tilted, potentially making corner contact more or less likely depending on the tilt direction relative to the diamond's geometry.

The Contact Point: Where Geometry Meets Sparsity

The constrained optimization problem seeks the point within the constraint region that minimizes the loss. Geometrically, we find where the smallest possible loss ellipse contacts the constraint boundary.

The Constrained Optimization View:

Starting from the OLS solution, we expand level set ellipses outward (increasing loss) until the first contact with the constraint boundary. This contact point is the constrained optimum.

For L2 (Circle) Constraint:

The first contact typically occurs at a point where the ellipse is tangent to the circle. Tangency means:

$$\nabla \mathcal{L}(\boldsymbol{\beta}^) = \mu \nabla g(\boldsymbol{\beta}^)$$

where $g(\boldsymbol{\beta}) = |\boldsymbol{\beta}|_2^2 - t^2$ is the constraint function and $\mu > 0$ is the Lagrange multiplier.

Since the circle is smooth everywhere, this tangency condition has no special preference for axis-aligned points. The contact point generically has both coordinates non-zero.

For L1 (Diamond) Constraint:

The first contact can occur either on an edge or at a corner:

Types of L1 Contact

•Edge Contact: Ellipse tangent to a flat edge of the diamond. The contacting point has both coordinates non-zero (but constrained to lie on the edge).
•Corner Contact: Ellipse touches a vertex of the diamond. The vertex has exactly one non-zero coordinate—this is a sparse solution.
•Corner Probability: Generic ellipses (random orientation/shape) are more likely to contact corners than edges because corners 'stick out' and catch the expanding ellipse.

Why Corners are Preferred:

Consider expanding an ellipse from the interior. The ellipse makes first contact with the closest point on the boundary. For the diamond:

Corners are the boundary points closest to the origin in their respective directions
The corners are extremal in the sense that they maximize the distance from the origin in axis directions
Unless the ellipse is perfectly aligned with an edge (measure zero probability), corner contact is generic

Mathematical Argument:

The normal cone at a corner contains all directions pointing "outward" from the corner. At the vertex $(t, 0)$:

$$N_{(t,0)} = {(\alpha, \beta) : \alpha \geq |\beta|}$$

This is a large cone (45-degree angle in 2D). For the gradient of the loss to lie in this cone (the optimality condition), there's significant "room"—many possible gradient directions work.

At an edge point, the normal cone is just a single direction (the edge normal). Only when the loss gradient aligns exactly with this normal do we have an optimal edge point—a measure-zero event.

Probability Interpretation:

If we randomly generate problems (random $\mathbf{X}$, $\mathbf{y}$), the probability of corner contact approaches 1 for generic problem instances. Edge contact requires precise alignment that occurs with probability zero.

The Sticky Corner Effect

Corners are 'sticky' attractors for the optimization. Once you're near a corner, many gradient directions lead you to the corner. This is why Lasso 'wants' to set coefficients to zero—corners are geometrically stable fixed points of coordinate descent and other algorithms.

Varying the Constraint Strength

As we vary the constraint bound $t$ (equivalently, the regularization parameter $\lambda$), the solution traces a path through coefficient space. This path reveals the regularization-sparsity relationship geometrically.

The Lasso Path Geometrically:

$t = 0$ (or $\lambda = \infty$): The constraint region shrinks to the origin. Solution: $\hat{\boldsymbol{\beta}} = \mathbf{0}$ (maximum sparsity).
$t$ very small ($\lambda$ large): Small diamond with corners on axes. Most likely contact at a single corner—only one non-zero coefficient.
$t$ moderate ($\lambda$ moderate): Diamond grows. As corners pass the OLS solution's projections onto axes, more coefficients become non-zero.
$t$ large ($\lambda$ small): Diamond becomes large enough to contain the OLS solution. Solution approaches OLS (minimum regularization).
$t = \infty$ ($\lambda = 0$): No constraint. Solution equals OLS.

Sequential Feature Entry:

As $t$ increases (or $\lambda$ decreases), coefficients "enter" the model in order of their importance:

Geometric Lasso Path
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import numpy as np
from sklearn.linear_model import lasso_path
import matplotlib.pyplot as plt
 
def trace_lasso_path_geometric(X, y, n_alphas=100):
                                        """
    Trace the Lasso path and visualize coefficient entry order.
    
    The path shows how coefficients transition from 0 to non-zero
    as regularization decreases—geometric sparsity in action.
    """
    # Compute Lasso path
    alphas, coefs, _ = lasso_path(X, y, n_alphas=n_alphas)
    
    # Find entry order (first alpha where each coef becomes non-zero)
    entry_alphas = []
    for j in range(coefs.shape[0]):
        non_zero_idx = np.where(np.abs(coefs[j, :]) > 1e-10)[0]
        if len(non_zero_idx) > 0:
            entry_alphas.append((j, alphas[non_zero_idx[0]]))
        else:
            entry_alphas.append((j, 0))
    
    # Sort by entry (highest alpha = earliest entry)
    entry_order = sorted(entry_alphas, key=lambda x: -x[1])
    
    print("Feature Entry Order (earliest to latest):")
    print("-" * 45)
    for rank, (feature_idx, alpha) in enumerate(entry_order, 1):
        if alpha > 0:
            print(f"  {rank}. Feature {feature_idx}: enters at λ = {alpha:.4f}")
    
    # Plot coefficient paths
    fig, ax = plt.subplots(figsize=(10, 6))
    for j in range(coefs.shape[0]):
        ax.plot(np.log(alphas), coefs[j, :], label=f'β_{j}')
    ax.axhline(0, color='gray', linestyle='--', alpha=0.5)
    ax.set_xlabel('log(λ)')
    ax.set_ylabel('Coefficient Value')
    ax.set_title('Lasso Path: Coefficients vs log(λ)')
    ax.legend()
    plt.gca().invert_xaxis()  # Decreasing λ from left to right
    
    return fig, entry_order
 
 
# Example: 10 features, only 3 truly relevant
np.random.seed(42)
n, p = 100, 10
X = np.random.randn(n, p)
beta_true = np.zeros(p)
beta_true[0] = 3.0   # Strong signal
beta_true[3] = -2.0  # Moderate signal  
beta_true[7] = 1.5   # Weak signal
y = X @ beta_true + 0.5 * np.random.randn(n)
 
fig, entry_order = trace_lasso_path_geometric(X, y)

Geometric Interpretation of the Path:

The Lasso path traces through vertices and edges of the L1 polytope:

Starting at origin (all zeros, a vertex)
Moving along an edge (one coefficient varies, others zero)
Reaching a vertex (coefficient entry/exit event)
Turning onto a new edge
Eventually reaching a face or the interior (many non-zeros)

Between these transitions, the path is piecewise linear—coefficients change at constant rates. This remarkable property enables efficient Lasso path algorithms (LARS).

The Regularization Path Theorem:

For a fixed design matrix $\mathbf{X}$, as $\lambda$ decreases from $\lambda_{\max}$ to 0:

The set of active variables changes at finitely many values of $\lambda$
Between these change points, coefficients are linear in $\lambda$
The entire path can be computed efficiently without evaluating at every $\lambda$

LARS Algorithm

The Least Angle Regression (LARS) algorithm exploits the piecewise linear path structure. It traces the entire Lasso path in the same computational cost as a single least squares fit. This geometric insight translates directly to algorithmic efficiency.

High-Dimensional Geometry

While we visualize in 2D or 3D, the geometric principles extend to arbitrary dimensions. High-dimensional geometry has counterintuitive properties that actually strengthen the sparsity argument.

The Cross-Polytope in p Dimensions:

The L1 ball ${\boldsymbol{\beta} : |\boldsymbol{\beta}|_1 \leq t}$ in $\mathbb{R}^p$ is a cross-polytope (or hyperoctahedron):

Vertices: $2p$ points at ${\pm t \cdot \mathbf{e}j}{j=1}^p$ (each has exactly one non-zero coordinate)
Edges: Connect vertices that differ in exactly one coordinate sign
Faces: Defined by sign patterns on subsets of coordinates
Facets (top faces): $2^p$ facets, each a $(p-1)$-simplex defined by fixing the signs of all coordinates

Concentration of Measure Phenomenon:

High-dimensional geometry exhibits surprising properties:

High-Dimensional Geometric Phenomena

•Volume Near Corners: As $p$ increases, the 'volume' of the L1 ball concentrates near its corners and edges. The interior becomes measure-theoretically negligible.
•Corner Sharpness: The corners of the cross-polytope become sharper with dimension. The normal cone at each vertex spans a larger solid angle relative to the surface.
•Random Direction Property: A random direction in $\mathbb{R}^p$ is likely to be nearly aligned with some coordinate axis subspace. This increases corner contact probability.
•Curse Becomes Blessing: The 'curse of dimensionality' for integration becomes the 'blessing' for sparsity—high-dimensional spaces strongly favor sparse solutions.

Volume Distribution:

Consider the volume of the L1 and L2 balls:

L2 ball volume: $V_p^{L2}(t) = \frac{\pi^{p/2}}{\Gamma(p/2 + 1)} t^p$
L1 ball volume: $V_p^{L1}(t) = \frac{2^p}{p!} t^p$

The ratio $V_p^{L1}(t) / V_p^{L2}(t)$ decreases rapidly with $p$. The L1 ball is much "pointier" than the L2 ball in high dimensions.

Probability of Sparse Solutions:

For a randomly oriented loss ellipsoid in $\mathbb{R}^p$, the probability of contacting a face of dimension $k$ (having $p-k$ zeros) is:

$$P(\text{$k$-dimensional face contact}) \propto \binom{p}{k} \cdot (\text{geometry factor})$$

The geometry factor favors lower-dimensional faces (more zeros). Combined with the combinatorial factor, this gives a concrete probability distribution over sparsity levels.

The Typical Solution:

In high dimensions with moderate $\lambda$, the typical Lasso solution has:

Many exactly zero coefficients
A few non-zero coefficients of varying magnitudes
Sparsity level depending on $\lambda$, signal strength, and correlation structure

Intuition Transfer

The 2D diamond-vs-circle intuition transfers to high dimensions: corners (sparse) are sticky attractors, edges (less sparse) are unstable except under special alignment, and the full interior (no sparsity) requires the unconstrained OLS to lie inside the L1 ball.

Geometric Duality: Penalized vs. Constrained

The Lasso can be formulated as either a penalized problem or a constrained problem. Geometrically, these formulations are dual perspectives on the same optimization.

Penalized Form:

$$\min_\boldsymbol{\beta} \left{ \frac{1}{2}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \lambda|\boldsymbol{\beta}|_1 \right}$$

Geometric interpretation: Find the point where the sum of loss and penalty is minimized. The penalty "warps" the objective, creating valleys along coordinate axes.

Constrained Form:

$$\min_\boldsymbol{\beta} \frac{1}{2}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 \quad \text{s.t.} \quad |\boldsymbol{\beta}|_1 \leq t$$

Geometric interpretation: Find the minimum loss point within the L1 diamond.

The Duality Correspondence:

For every $\lambda \geq 0$, there exists a unique $t \geq 0$ such that both formulations have the same solution. The correspondence is:**

Penalized vs. Constrained Duality
Penalized	Constrained	Geometric View
$\lambda = 0$	$t = \|\hat{\boldsymbol{\beta}}^{OLS}\|_1$	Diamond contains OLS; solution = OLS
$\lambda$ small	$t$ large	Large diamond; solution near OLS
$\lambda$ moderate	$t$ moderate	Contact at edge or corner
$\lambda$ large	$t$ small	Small diamond; sparse solution at corner
$\lambda = \lambda_{max}$	$t = 0$	Diamond shrinks to origin; $\hat{\boldsymbol{\beta}} = \mathbf{0}$

Lagrangian Geometry:

The Lagrangian function combines both views:

$$L(\boldsymbol{\beta}, \mu) = \frac{1}{2}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \mu(|\boldsymbol{\beta}|_1 - t)$$

At optimality, $\mu = \lambda$ and the gradients balance:

$$\nabla_{\boldsymbol{\beta}} \text{(loss)} + \lambda \cdot \partial|\boldsymbol{\beta}|_1 \ni \mathbf{0}$$

Geometrically: the loss gradient points toward the OLS solution; the penalty subgradient points "outward" from the constraint boundary. At the optimum, these forces balance.

The Penalty as Constraint Enforcement:

Another way to see the equivalence: the penalty $\lambda|\boldsymbol{\beta}|_1$ acts as a "soft" version of the constraint $|\boldsymbol{\beta}|_1 \leq t$. Instead of a hard boundary, violations are penalized proportionally. The regularization parameter $\lambda$ is the "price per unit" of constraint violation.

Choosing Formulation

Use the constrained view for geometric intuition (visualizing diamond/ellipse contact). Use the penalized view for algorithms (gradient-based optimization). The mathematical equivalence means either perspective is valid; choose based on what insight you need.

Geometric Insights for Algorithm Design

The geometry of L1 regularization directly informs algorithm design. Understanding shape properties translates to understanding algorithmic behavior.

Why Coordinate Descent Works Well:

The L1 ball has axis-aligned faces and edges. Coordinate descent, which optimizes one coordinate at a time while holding others fixed, naturally navigates this geometry:

Moving along a coordinate axis stays on the current face (if already on a face)
Corner solutions are fixed points of coordinate updates
The algorithm "walks" along edges and settles at corners

Soft Thresholding as Projection:

The soft thresholding operator $S_\lambda(z) = \text{sign}(z)\max(|z| - \lambda, 0)$ has a geometric interpretation:

It is the proximal operator of the L1 norm—the closest point in the L1-penalized objective sense. Geometrically, it shrinks points toward the coordinate axes, then snaps them to axes when close enough.

Active Set Methods:

The piecewise linear path structure suggests active set methods:

Maintain a set of potentially non-zero variables (active set)
Solve the reduced problem on active variables
Check optimality conditions to add/remove variables
Repeat until convergence

This exploits the geometry: most computation happens on low-dimensional faces.

Geometric Properties → Algorithm Features

•Diamond corners → Coordinate descent converges to sparse solutions
•Piecewise linear path → LARS computes entire path efficiently
•Subdifferential at zero → Soft thresholding provides closed-form updates
•Convexity → Global optimum guaranteed, no local minima
•Separable penalty → Coordinate-wise updates decouple
•Polyhedral constraint → Active set methods exploit structure

Screening Rules:

Geometry also enables safe screening—provably eliminating variables before running the full algorithm:

If a feature $j$ satisfies $|\mathbf{x}_j^T\mathbf{y}| < \lambda - |\mathbf{x}_j|_2 \cdot \text{(geometric term)}$, then $\hat{\beta}_j = 0$ at optimum.

Geometrically: this identifies features whose gradient directions are "too far" from the contact region to possibly be non-zero.

Practical Impact:

In high-dimensional problems ($p \gg n$), most coefficients are zero. Geometric insights enable algorithms that:

Focus computation on the few non-zero coefficients
Provably skip irrelevant features
Exploit sparsity for memory-efficient representations
Provide path algorithms instead of single-λ solves

Geometry Guides Optimization

The best Lasso algorithms are designed with geometry in mind. They don't fight the L1 structure—they exploit it. Understanding the diamond-shaped constraint set and piecewise linear paths leads to algorithms orders of magnitude faster than black-box optimization.

Summary: The Geometry of Sparsity

We've developed deep geometric intuition for why L1 regularization produces sparse solutions. Let's consolidate the visual and conceptual insights.

Key Geometric Insights

•L1 Diamond vs. L2 Circle: The L1 constraint set is a diamond with corners on coordinate axes; L2 is a smooth circle with no preferred directions.
•Corner Contact: Loss ellipses expanding from OLS are likely to first contact the diamond at corners (sparse) rather than edges (dense).
•Corners are Sticky: The normal cone at corners is large, making corners stable fixed points of optimization algorithms.
•Path Geometry: As λ varies, solutions trace piecewise linear paths through the polytope's vertices and edges.
•High-Dimensional Strengthening: In high dimensions, corners become sharper and more 'attractive'—the geometry strongly favors sparsity.
•Duality of Views: Penalized and constrained formulations are geometrically equivalent—different lenses on the same optimization.
•Algorithmic Implications: Coordinate descent, LARS, and screening rules all exploit the axis-aligned, polyhedral geometry of L1.

What's Next:

We've established the mathematical formulation, sparsity mechanics, and geometric interpretation. The next page addresses a crucial practical question: How do we actually solve the Lasso problem? We'll explore coordinate descent, the proximal gradient method (ISTA/FISTA), and the LARS algorithm—each exploiting the geometric insights developed here.

Page Complete

You now have geometric intuition for L1 sparsity: the diamond catches expanding ellipses at corners, corners are stable attractors, and this geometry persists and strengthens in high dimensions. Next, we'll translate this understanding into practical algorithms.

3 / 5

Loading learning content...

Machine LearningRegularization Theory

Lasso Regression (L1 Regularization)

LevelIntermediate

Duration90 mins

TopicRegularization Theory

3 / 5

Geometric Interpretation

Seeing is Understanding

This page develops that geometric understanding systematically. We will:

Visualize the constraint regions (L1 diamonds vs. L2 circles)
Understand the loss function's level sets (elliptical contours)
See how these geometric objects interact to determine optimal solutions
Generalize to higher dimensions and understand the geometry of sparsity

By the end, you will have a mental picture that lets you reason about regularization geometrically—an intuition that transfers to many other constrained optimization problems.

What You Will Learn

Visualizing Constraint Sets

Let's begin by visualizing the constraint regions for L1 and L2 regularization in two dimensions. Consider coefficients $(\beta_1, \beta_2)$.

The L2 Constraint (Ridge):

$$\beta_1^2 + \beta_2^2 \leq t^2$$

This defines a disk of radius $t$ centered at the origin. Its boundary is a circle—perfectly smooth with no corners or edges. Every point on the boundary has a unique normal direction.

The L1 Constraint (Lasso):

$$|\beta_1| + |\beta_2| \leq t$$

Key Geometric Properties:

Geometric Properties of Constraint Sets
Property	L2 (Circle)	L1 (Diamond)
Shape	Smooth, round	Polyhedral, angular
Corners	None (continuously curved)	4 vertices on axes
Edges	None (continuous boundary)	4 straight edges
Normal directions	Infinite (radial)	4 edge normals + corner normals
Convexity	Strictly convex	Convex but not strictly
Extreme points	Every boundary point	Only 4 vertices
Axis intersections	At distance $t$	At distance $t$ (vertices)

Why Corner Geometry Matters:

The crucial difference is the presence of corners (vertices) in the L1 constraint set. Corners are points where the boundary is not smooth—the normal direction is not uniquely defined.

Higher-Dimensional Constraint Sets:

In $p$ dimensions:

L2 constraint: $\sum_j \beta_j^2 \leq t^2$ is a hypersphere—smooth everywhere
L1 constraint: $\sum_j |\beta_j| \leq t$ is a cross-polytope (hyperoctahedron)—with $2p$ vertices on coordinate axes and $2^p$ facets

The Sparsity Hierarchy

Loss Function Level Sets

To understand the constrained optimization geometry, we need to visualize the level sets (contour curves) of the loss function.

The Ordinary Least Squares Loss:

This is a quadratic form centered at the OLS solution $\hat{\boldsymbol{\beta}}^{\text{OLS}}$.

Level Set Geometry:

The level sets ${\boldsymbol{\beta} : \mathcal{L}(\boldsymbol{\beta}) = c}$ are ellipsoids (ellipses in 2D) centered at $\hat{\boldsymbol{\beta}}^{\text{OLS}}$:

Shape: Determined by the eigenvalues of $\mathbf{X}^T\mathbf{X}$
Orientation: Determined by the eigenvectors of $\mathbf{X}^T\mathbf{X}$
Center: Located at $\hat{\boldsymbol{\beta}}^{\text{OLS}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$
Nested structure: Smaller ellipses = lower loss, expanding outward from center

Special Cases:

Orthogonal design ($\mathbf{X}^T\mathbf{X} = n\mathbf{I}$): Level sets are circles (spheres) centered at OLS
Correlated features: Level sets are elongated ellipses aligned with correlation structure
Multicollinearity: Extremely elongated "cigar-shaped" ellipses nearly parallel to coefficient space

Visualizing Level Sets
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
import numpy as np
import matplotlib.pyplot as plt
 
def plot_constraint_and_loss(X, y, t_l1=1.0, t_l2=1.0, figsize=(14, 5)):
    """
    Visualize L1 and L2 constraint regions with loss contours.
    
    Parameters
    ----------
    X : ndarray of shape (n, 2)
        Design matrix (2 features for visualization)
    y : ndarray of shape (n,)
        Response vector
    t_l1 : float
        L1 constraint bound
    t_l2 : float
        L2 constraint bound (radius)
    """
    # Compute OLS solution
    beta_ols = np.linalg.lstsq(X, y, rcond=None)[0]
    
    # Create grid for contours
    beta_range = max(abs(beta_ols).max() * 1.5, t_l1 * 1.5, t_l2 * 1.5)
    b1 = np.linspace(-beta_range, beta_range, 200)
    b2 = np.linspace(-beta_range, beta_range, 200)
    B1, B2 = np.meshgrid(b1, b2)
    
    # Compute loss at each grid point
    Loss = np.zeros_like(B1)
    for i in range(len(b1)):
        for j in range(len(b2)):
            beta = np.array([B1[i, j], B2[i, j]])
            Loss[i, j] = np.sum((y - X @ beta) ** 2)
    
    fig, axes = plt.subplots(1, 3, figsize=figsize)
    
    # Plot 1: L2 constraint (Ridge)
    ax = axes[0]
    ax.contour(B1, B2, Loss, levels=15, cmap='coolwarm', alpha=0.7)
    theta = np.linspace(0, 2*np.pi, 100)
    ax.fill(t_l2*np.cos(theta), t_l2*np.sin(theta), 
            alpha=0.3, color='blue', label='L2 constraint')
    ax.plot(t_l2*np.cos(theta), t_l2*np.sin(theta), 'b-', linewidth=2)
    ax.plot(*beta_ols, 'r*', markersize=15, label='OLS solution')
    ax.axhline(0, color='gray', linestyle='--', alpha=0.5)
    ax.axvline(0, color='gray', linestyle='--', alpha=0.5)
    ax.set_xlabel(r'$\beta_1$')
    ax.set_ylabel(r'$\beta_2$')
    ax.set_title('Ridge (L2): Circular Constraint')
    ax.legend()
    ax.set_aspect('equal')
    
    # Plot 2: L1 constraint (Lasso)
    ax = axes[1]
    ax.contour(B1, B2, Loss, levels=15, cmap='coolwarm', alpha=0.7)
    diamond = np.array([[t_l1, 0], [0, t_l1], [-t_l1, 0], [0, -t_l1], [t_l1, 0]])
    ax.fill(diamond[:, 0], diamond[:, 1], alpha=0.3, color='green', 
            label='L1 constraint')
    ax.plot(diamond[:, 0], diamond[:, 1], 'g-', linewidth=2)
    ax.plot(*beta_ols, 'r*', markersize=15, label='OLS solution')
    ax.axhline(0, color='gray', linestyle='--', alpha=0.5)
    ax.axvline(0, color='gray', linestyle='--', alpha=0.5)
    ax.set_xlabel(r'$\beta_1$')
    ax.set_ylabel(r'$\beta_2$')
    ax.set_title('Lasso (L1): Diamond Constraint')
    ax.legend()
    ax.set_aspect('equal')
    
    # Plot 3: Both overlaid
    ax = axes[2]
    ax.contour(B1, B2, Loss, levels=15, cmap='coolwarm', alpha=0.7)
    ax.fill(t_l2*np.cos(theta), t_l2*np.sin(theta), 
            alpha=0.2, color='blue', label='L2')
    ax.plot(t_l2*np.cos(theta), t_l2*np.sin(theta), 'b-', linewidth=2)
    ax.fill(diamond[:, 0], diamond[:, 1], alpha=0.2, color='green', label='L1')
    ax.plot(diamond[:, 0], diamond[:, 1], 'g-', linewidth=2)
    ax.plot(*beta_ols, 'r*', markersize=15, label='OLS')
    ax.axhline(0, color='gray', linestyle='--', alpha=0.5)
    ax.axvline(0, color='gray', linestyle='--', alpha=0.5)
    ax.set_xlabel(r'$\beta_1$')
    ax.set_ylabel(r'$\beta_2$')
    ax.set_title('L1 vs L2: Shape Comparison')
    ax.legend()
    ax.set_aspect('equal')
    
    plt.tight_layout()
    return fig
 
 
# Example usage
np.random.seed(42)
n = 50
X = np.random.randn(n, 2)
X[:, 1] = X[:, 0] * 0.8 + np.random.randn(n) * 0.3  # Correlated features
y = 2 * X[:, 0] + 0.5 * X[:, 1] + np.random.randn(n) * 0.5
 
# Visualization shows elongated ellipses due to correlation
# Note how L1 diamond corners align with axes (sparse directions)
# while L2 circle has no preferred sparse directions

Ellipse Orientation Matters

The Contact Point: Where Geometry Meets Sparsity

The Constrained Optimization View:

Starting from the OLS solution, we expand level set ellipses outward (increasing loss) until the first contact with the constraint boundary. This contact point is the constrained optimum.

For L2 (Circle) Constraint:

The first contact typically occurs at a point where the ellipse is tangent to the circle. Tangency means:

$$\nabla \mathcal{L}(\boldsymbol{\beta}^) = \mu \nabla g(\boldsymbol{\beta}^)$$

where $g(\boldsymbol{\beta}) = |\boldsymbol{\beta}|_2^2 - t^2$ is the constraint function and $\mu > 0$ is the Lagrange multiplier.

Since the circle is smooth everywhere, this tangency condition has no special preference for axis-aligned points. The contact point generically has both coordinates non-zero.

For L1 (Diamond) Constraint:

The first contact can occur either on an edge or at a corner:

Types of L1 Contact

•Edge Contact: Ellipse tangent to a flat edge of the diamond. The contacting point has both coordinates non-zero (but constrained to lie on the edge).
•Corner Contact: Ellipse touches a vertex of the diamond. The vertex has exactly one non-zero coordinate—this is a sparse solution.
•Corner Probability: Generic ellipses (random orientation/shape) are more likely to contact corners than edges because corners 'stick out' and catch the expanding ellipse.

Why Corners are Preferred:

Consider expanding an ellipse from the interior. The ellipse makes first contact with the closest point on the boundary. For the diamond:

Corners are the boundary points closest to the origin in their respective directions
The corners are extremal in the sense that they maximize the distance from the origin in axis directions
Unless the ellipse is perfectly aligned with an edge (measure zero probability), corner contact is generic

Mathematical Argument:

The normal cone at a corner contains all directions pointing "outward" from the corner. At the vertex $(t, 0)$:

$$N_{(t,0)} = {(\alpha, \beta) : \alpha \geq |\beta|}$$

This is a large cone (45-degree angle in 2D). For the gradient of the loss to lie in this cone (the optimality condition), there's significant "room"—many possible gradient directions work.

At an edge point, the normal cone is just a single direction (the edge normal). Only when the loss gradient aligns exactly with this normal do we have an optimal edge point—a measure-zero event.

Probability Interpretation:

The Sticky Corner Effect

Varying the Constraint Strength

The Lasso Path Geometrically:

$t = 0$ (or $\lambda = \infty$): The constraint region shrinks to the origin. Solution: $\hat{\boldsymbol{\beta}} = \mathbf{0}$ (maximum sparsity).
$t$ very small ($\lambda$ large): Small diamond with corners on axes. Most likely contact at a single corner—only one non-zero coefficient.
$t$ moderate ($\lambda$ moderate): Diamond grows. As corners pass the OLS solution's projections onto axes, more coefficients become non-zero.
$t$ large ($\lambda$ small): Diamond becomes large enough to contain the OLS solution. Solution approaches OLS (minimum regularization).
$t = \infty$ ($\lambda = 0$): No constraint. Solution equals OLS.

Sequential Feature Entry:

As $t$ increases (or $\lambda$ decreases), coefficients "enter" the model in order of their importance:

Geometric Lasso Path
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import numpy as np
from sklearn.linear_model import lasso_path
import matplotlib.pyplot as plt
 
def trace_lasso_path_geometric(X, y, n_alphas=100):
                                        """
    Trace the Lasso path and visualize coefficient entry order.
    
    The path shows how coefficients transition from 0 to non-zero
    as regularization decreases—geometric sparsity in action.
    """
    # Compute Lasso path
    alphas, coefs, _ = lasso_path(X, y, n_alphas=n_alphas)
    
    # Find entry order (first alpha where each coef becomes non-zero)
    entry_alphas = []
    for j in range(coefs.shape[0]):
        non_zero_idx = np.where(np.abs(coefs[j, :]) > 1e-10)[0]
        if len(non_zero_idx) > 0:
            entry_alphas.append((j, alphas[non_zero_idx[0]]))
        else:
            entry_alphas.append((j, 0))
    
    # Sort by entry (highest alpha = earliest entry)
    entry_order = sorted(entry_alphas, key=lambda x: -x[1])
    
    print("Feature Entry Order (earliest to latest):")
    print("-" * 45)
    for rank, (feature_idx, alpha) in enumerate(entry_order, 1):
        if alpha > 0:
            print(f"  {rank}. Feature {feature_idx}: enters at λ = {alpha:.4f}")
    
    # Plot coefficient paths
    fig, ax = plt.subplots(figsize=(10, 6))
    for j in range(coefs.shape[0]):
        ax.plot(np.log(alphas), coefs[j, :], label=f'β_{j}')
    ax.axhline(0, color='gray', linestyle='--', alpha=0.5)
    ax.set_xlabel('log(λ)')
    ax.set_ylabel('Coefficient Value')
    ax.set_title('Lasso Path: Coefficients vs log(λ)')
    ax.legend()
    plt.gca().invert_xaxis()  # Decreasing λ from left to right
    
    return fig, entry_order
 
 
# Example: 10 features, only 3 truly relevant
np.random.seed(42)
n, p = 100, 10
X = np.random.randn(n, p)
beta_true = np.zeros(p)
beta_true[0] = 3.0   # Strong signal
beta_true[3] = -2.0  # Moderate signal  
beta_true[7] = 1.5   # Weak signal
y = X @ beta_true + 0.5 * np.random.randn(n)
 
fig, entry_order = trace_lasso_path_geometric(X, y)

Geometric Interpretation of the Path:

The Lasso path traces through vertices and edges of the L1 polytope:

Starting at origin (all zeros, a vertex)
Moving along an edge (one coefficient varies, others zero)
Reaching a vertex (coefficient entry/exit event)
Turning onto a new edge
Eventually reaching a face or the interior (many non-zeros)

Between these transitions, the path is piecewise linear—coefficients change at constant rates. This remarkable property enables efficient Lasso path algorithms (LARS).

The Regularization Path Theorem:

For a fixed design matrix $\mathbf{X}$, as $\lambda$ decreases from $\lambda_{\max}$ to 0:

The set of active variables changes at finitely many values of $\lambda$
Between these change points, coefficients are linear in $\lambda$
The entire path can be computed efficiently without evaluating at every $\lambda$

LARS Algorithm

High-Dimensional Geometry

While we visualize in 2D or 3D, the geometric principles extend to arbitrary dimensions. High-dimensional geometry has counterintuitive properties that actually strengthen the sparsity argument.

The Cross-Polytope in p Dimensions:

The L1 ball ${\boldsymbol{\beta} : |\boldsymbol{\beta}|_1 \leq t}$ in $\mathbb{R}^p$ is a cross-polytope (or hyperoctahedron):

Vertices: $2p$ points at ${\pm t \cdot \mathbf{e}j}{j=1}^p$ (each has exactly one non-zero coordinate)
Edges: Connect vertices that differ in exactly one coordinate sign
Faces: Defined by sign patterns on subsets of coordinates
Facets (top faces): $2^p$ facets, each a $(p-1)$-simplex defined by fixing the signs of all coordinates

Concentration of Measure Phenomenon:

High-dimensional geometry exhibits surprising properties:

High-Dimensional Geometric Phenomena

•Volume Near Corners: As $p$ increases, the 'volume' of the L1 ball concentrates near its corners and edges. The interior becomes measure-theoretically negligible.
•Corner Sharpness: The corners of the cross-polytope become sharper with dimension. The normal cone at each vertex spans a larger solid angle relative to the surface.
•Random Direction Property: A random direction in $\mathbb{R}^p$ is likely to be nearly aligned with some coordinate axis subspace. This increases corner contact probability.
•Curse Becomes Blessing: The 'curse of dimensionality' for integration becomes the 'blessing' for sparsity—high-dimensional spaces strongly favor sparse solutions.

Volume Distribution:

Consider the volume of the L1 and L2 balls:

L2 ball volume: $V_p^{L2}(t) = \frac{\pi^{p/2}}{\Gamma(p/2 + 1)} t^p$
L1 ball volume: $V_p^{L1}(t) = \frac{2^p}{p!} t^p$

The ratio $V_p^{L1}(t) / V_p^{L2}(t)$ decreases rapidly with $p$. The L1 ball is much "pointier" than the L2 ball in high dimensions.

Probability of Sparse Solutions:

For a randomly oriented loss ellipsoid in $\mathbb{R}^p$, the probability of contacting a face of dimension $k$ (having $p-k$ zeros) is:

$$P(\text{$k$-dimensional face contact}) \propto \binom{p}{k} \cdot (\text{geometry factor})$$

The geometry factor favors lower-dimensional faces (more zeros). Combined with the combinatorial factor, this gives a concrete probability distribution over sparsity levels.

The Typical Solution:

In high dimensions with moderate $\lambda$, the typical Lasso solution has:

Many exactly zero coefficients
A few non-zero coefficients of varying magnitudes
Sparsity level depending on $\lambda$, signal strength, and correlation structure

Intuition Transfer

Geometric Duality: Penalized vs. Constrained

The Lasso can be formulated as either a penalized problem or a constrained problem. Geometrically, these formulations are dual perspectives on the same optimization.

Penalized Form:

$$\min_\boldsymbol{\beta} \left{ \frac{1}{2}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \lambda|\boldsymbol{\beta}|_1 \right}$$

Geometric interpretation: Find the point where the sum of loss and penalty is minimized. The penalty "warps" the objective, creating valleys along coordinate axes.

Constrained Form:

$$\min_\boldsymbol{\beta} \frac{1}{2}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 \quad \text{s.t.} \quad |\boldsymbol{\beta}|_1 \leq t$$

Geometric interpretation: Find the minimum loss point within the L1 diamond.

The Duality Correspondence:

For every $\lambda \geq 0$, there exists a unique $t \geq 0$ such that both formulations have the same solution. The correspondence is:**

Penalized vs. Constrained Duality
Penalized	Constrained	Geometric View
$\lambda = 0$	$t = \|\hat{\boldsymbol{\beta}}^{OLS}\|_1$	Diamond contains OLS; solution = OLS
$\lambda$ small	$t$ large	Large diamond; solution near OLS
$\lambda$ moderate	$t$ moderate	Contact at edge or corner
$\lambda$ large	$t$ small	Small diamond; sparse solution at corner
$\lambda = \lambda_{max}$	$t = 0$	Diamond shrinks to origin; $\hat{\boldsymbol{\beta}} = \mathbf{0}$

Lagrangian Geometry:

The Lagrangian function combines both views:

$$L(\boldsymbol{\beta}, \mu) = \frac{1}{2}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \mu(|\boldsymbol{\beta}|_1 - t)$$

At optimality, $\mu = \lambda$ and the gradients balance:

$$\nabla_{\boldsymbol{\beta}} \text{(loss)} + \lambda \cdot \partial|\boldsymbol{\beta}|_1 \ni \mathbf{0}$$

Geometrically: the loss gradient points toward the OLS solution; the penalty subgradient points "outward" from the constraint boundary. At the optimum, these forces balance.

The Penalty as Constraint Enforcement:

Choosing Formulation

Geometric Insights for Algorithm Design

The geometry of L1 regularization directly informs algorithm design. Understanding shape properties translates to understanding algorithmic behavior.

Why Coordinate Descent Works Well:

The L1 ball has axis-aligned faces and edges. Coordinate descent, which optimizes one coordinate at a time while holding others fixed, naturally navigates this geometry:

Moving along a coordinate axis stays on the current face (if already on a face)
Corner solutions are fixed points of coordinate updates
The algorithm "walks" along edges and settles at corners

Soft Thresholding as Projection:

The soft thresholding operator $S_\lambda(z) = \text{sign}(z)\max(|z| - \lambda, 0)$ has a geometric interpretation:

Active Set Methods:

The piecewise linear path structure suggests active set methods:

Maintain a set of potentially non-zero variables (active set)
Solve the reduced problem on active variables
Check optimality conditions to add/remove variables
Repeat until convergence

This exploits the geometry: most computation happens on low-dimensional faces.

Geometric Properties → Algorithm Features

•Diamond corners → Coordinate descent converges to sparse solutions
•Piecewise linear path → LARS computes entire path efficiently
•Subdifferential at zero → Soft thresholding provides closed-form updates
•Convexity → Global optimum guaranteed, no local minima
•Separable penalty → Coordinate-wise updates decouple
•Polyhedral constraint → Active set methods exploit structure

Screening Rules:

Geometry also enables safe screening—provably eliminating variables before running the full algorithm:

If a feature $j$ satisfies $|\mathbf{x}_j^T\mathbf{y}| < \lambda - |\mathbf{x}_j|_2 \cdot \text{(geometric term)}$, then $\hat{\beta}_j = 0$ at optimum.

Geometrically: this identifies features whose gradient directions are "too far" from the contact region to possibly be non-zero.

Practical Impact:

In high-dimensional problems ($p \gg n$), most coefficients are zero. Geometric insights enable algorithms that:

Focus computation on the few non-zero coefficients
Provably skip irrelevant features
Exploit sparsity for memory-efficient representations
Provide path algorithms instead of single-λ solves

Geometry Guides Optimization

Summary: The Geometry of Sparsity

We've developed deep geometric intuition for why L1 regularization produces sparse solutions. Let's consolidate the visual and conceptual insights.

Key Geometric Insights

•L1 Diamond vs. L2 Circle: The L1 constraint set is a diamond with corners on coordinate axes; L2 is a smooth circle with no preferred directions.
•Corner Contact: Loss ellipses expanding from OLS are likely to first contact the diamond at corners (sparse) rather than edges (dense).
•Corners are Sticky: The normal cone at corners is large, making corners stable fixed points of optimization algorithms.
•Path Geometry: As λ varies, solutions trace piecewise linear paths through the polytope's vertices and edges.
•High-Dimensional Strengthening: In high dimensions, corners become sharper and more 'attractive'—the geometry strongly favors sparsity.
•Duality of Views: Penalized and constrained formulations are geometrically equivalent—different lenses on the same optimization.
•Algorithmic Implications: Coordinate descent, LARS, and screening rules all exploit the axis-aligned, polyhedral geometry of L1.

What's Next:

Page Complete

3 / 5