Loading learning content...
Mathematics provides the precision, but geometry provides the intuition. The geometric interpretation of Lasso versus Ridge regularization is one of the most illuminating visualizations in all of machine learning—it makes immediately clear why L1 produces sparsity while L2 does not.
This page develops that geometric understanding systematically. We will:
By the end, you will have a mental picture that lets you reason about regularization geometrically—an intuition that transfers to many other constrained optimization problems.
This page provides visual and geometric intuition for L1 sparsity. You will understand why diamond corners attract solutions, how different constraint strengths affect solutions, and how the geometry generalizes to high dimensions where visualization fails but geometric intuition still applies.
Let's begin by visualizing the constraint regions for L1 and L2 regularization in two dimensions. Consider coefficients $(\beta_1, \beta_2)$.
The L2 Constraint (Ridge):
$$\beta_1^2 + \beta_2^2 \leq t^2$$
This defines a disk of radius $t$ centered at the origin. Its boundary is a circle—perfectly smooth with no corners or edges. Every point on the boundary has a unique normal direction.
The L1 Constraint (Lasso):
$$|\beta_1| + |\beta_2| \leq t$$
This defines a diamond (rotated square) with vertices at $(t, 0)$, $(-t, 0)$, $(0, t)$, $(0, -t)$. The boundary consists of four straight edges meeting at four corners. Corners lie exactly on the coordinate axes.
Key Geometric Properties:
| Property | L2 (Circle) | L1 (Diamond) |
|---|---|---|
| Shape | Smooth, round | Polyhedral, angular |
| Corners | None (continuously curved) | 4 vertices on axes |
| Edges | None (continuous boundary) | 4 straight edges |
| Normal directions | Infinite (radial) | 4 edge normals + corner normals |
| Convexity | Strictly convex | Convex but not strictly |
| Extreme points | Every boundary point | Only 4 vertices |
| Axis intersections | At distance $t$ | At distance $t$ (vertices) |
Why Corner Geometry Matters:
The crucial difference is the presence of corners (vertices) in the L1 constraint set. Corners are points where the boundary is not smooth—the normal direction is not uniquely defined.
At a corner, multiple normal directions are valid. This geometric property translates to the subdifferential being an interval rather than a single point at $\beta_j = 0$. The optimization can "rest" at a corner because there's no unique direction to move that decreases both the constraint violation and objective.
Higher-Dimensional Constraint Sets:
In $p$ dimensions:
The cross-polytope's vertices are ${\pm t \cdot \mathbf{e}j}{j=1}^p$ where $\mathbf{e}_j$ is the $j$-th standard basis vector. These vertices are the most sparse points—having only one non-zero coordinate.
In the L1 polytope, vertices have one non-zero (maximally sparse), edges have two non-zeros, faces have three non-zeros, and so on. The interior has no sparsity constraints. Constrained optimization naturally prefers lower-dimensional faces (sparser solutions) because they have more directions to 'catch' the expanding objective.
To understand the constrained optimization geometry, we need to visualize the level sets (contour curves) of the loss function.
The Ordinary Least Squares Loss:
$$\mathcal{L}(\boldsymbol{\beta}) = |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 = (\boldsymbol{\beta} - \hat{\boldsymbol{\beta}}^{\text{OLS}})^T \mathbf{X}^T\mathbf{X} (\boldsymbol{\beta} - \hat{\boldsymbol{\beta}}^{\text{OLS}}) + \text{const}$$
This is a quadratic form centered at the OLS solution $\hat{\boldsymbol{\beta}}^{\text{OLS}}$.
Level Set Geometry:
The level sets ${\boldsymbol{\beta} : \mathcal{L}(\boldsymbol{\beta}) = c}$ are ellipsoids (ellipses in 2D) centered at $\hat{\boldsymbol{\beta}}^{\text{OLS}}$:
Special Cases:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899
import numpy as npimport matplotlib.pyplot as plt def plot_constraint_and_loss(X, y, t_l1=1.0, t_l2=1.0, figsize=(14, 5)): """ Visualize L1 and L2 constraint regions with loss contours. Parameters ---------- X : ndarray of shape (n, 2) Design matrix (2 features for visualization) y : ndarray of shape (n,) Response vector t_l1 : float L1 constraint bound t_l2 : float L2 constraint bound (radius) """ # Compute OLS solution beta_ols = np.linalg.lstsq(X, y, rcond=None)[0] # Create grid for contours beta_range = max(abs(beta_ols).max() * 1.5, t_l1 * 1.5, t_l2 * 1.5) b1 = np.linspace(-beta_range, beta_range, 200) b2 = np.linspace(-beta_range, beta_range, 200) B1, B2 = np.meshgrid(b1, b2) # Compute loss at each grid point Loss = np.zeros_like(B1) for i in range(len(b1)): for j in range(len(b2)): beta = np.array([B1[i, j], B2[i, j]]) Loss[i, j] = np.sum((y - X @ beta) ** 2) fig, axes = plt.subplots(1, 3, figsize=figsize) # Plot 1: L2 constraint (Ridge) ax = axes[0] ax.contour(B1, B2, Loss, levels=15, cmap='coolwarm', alpha=0.7) theta = np.linspace(0, 2*np.pi, 100) ax.fill(t_l2*np.cos(theta), t_l2*np.sin(theta), alpha=0.3, color='blue', label='L2 constraint') ax.plot(t_l2*np.cos(theta), t_l2*np.sin(theta), 'b-', linewidth=2) ax.plot(*beta_ols, 'r*', markersize=15, label='OLS solution') ax.axhline(0, color='gray', linestyle='--', alpha=0.5) ax.axvline(0, color='gray', linestyle='--', alpha=0.5) ax.set_xlabel(r'$\beta_1$') ax.set_ylabel(r'$\beta_2$') ax.set_title('Ridge (L2): Circular Constraint') ax.legend() ax.set_aspect('equal') # Plot 2: L1 constraint (Lasso) ax = axes[1] ax.contour(B1, B2, Loss, levels=15, cmap='coolwarm', alpha=0.7) diamond = np.array([[t_l1, 0], [0, t_l1], [-t_l1, 0], [0, -t_l1], [t_l1, 0]]) ax.fill(diamond[:, 0], diamond[:, 1], alpha=0.3, color='green', label='L1 constraint') ax.plot(diamond[:, 0], diamond[:, 1], 'g-', linewidth=2) ax.plot(*beta_ols, 'r*', markersize=15, label='OLS solution') ax.axhline(0, color='gray', linestyle='--', alpha=0.5) ax.axvline(0, color='gray', linestyle='--', alpha=0.5) ax.set_xlabel(r'$\beta_1$') ax.set_ylabel(r'$\beta_2$') ax.set_title('Lasso (L1): Diamond Constraint') ax.legend() ax.set_aspect('equal') # Plot 3: Both overlaid ax = axes[2] ax.contour(B1, B2, Loss, levels=15, cmap='coolwarm', alpha=0.7) ax.fill(t_l2*np.cos(theta), t_l2*np.sin(theta), alpha=0.2, color='blue', label='L2') ax.plot(t_l2*np.cos(theta), t_l2*np.sin(theta), 'b-', linewidth=2) ax.fill(diamond[:, 0], diamond[:, 1], alpha=0.2, color='green', label='L1') ax.plot(diamond[:, 0], diamond[:, 1], 'g-', linewidth=2) ax.plot(*beta_ols, 'r*', markersize=15, label='OLS') ax.axhline(0, color='gray', linestyle='--', alpha=0.5) ax.axvline(0, color='gray', linestyle='--', alpha=0.5) ax.set_xlabel(r'$\beta_1$') ax.set_ylabel(r'$\beta_2$') ax.set_title('L1 vs L2: Shape Comparison') ax.legend() ax.set_aspect('equal') plt.tight_layout() return fig # Example usagenp.random.seed(42)n = 50X = np.random.randn(n, 2)X[:, 1] = X[:, 0] * 0.8 + np.random.randn(n) * 0.3 # Correlated featuresy = 2 * X[:, 0] + 0.5 * X[:, 1] + np.random.randn(n) * 0.5 # Visualization shows elongated ellipses due to correlation# Note how L1 diamond corners align with axes (sparse directions)# while L2 circle has no preferred sparse directionsThe orientation of loss ellipses relative to constraint sets determines where contact occurs. With correlated features, ellipses are tilted, potentially making corner contact more or less likely depending on the tilt direction relative to the diamond's geometry.
The constrained optimization problem seeks the point within the constraint region that minimizes the loss. Geometrically, we find where the smallest possible loss ellipse contacts the constraint boundary.
The Constrained Optimization View:
Starting from the OLS solution, we expand level set ellipses outward (increasing loss) until the first contact with the constraint boundary. This contact point is the constrained optimum.
For L2 (Circle) Constraint:
The first contact typically occurs at a point where the ellipse is tangent to the circle. Tangency means:
$$\nabla \mathcal{L}(\boldsymbol{\beta}^) = \mu \nabla g(\boldsymbol{\beta}^)$$
where $g(\boldsymbol{\beta}) = |\boldsymbol{\beta}|_2^2 - t^2$ is the constraint function and $\mu > 0$ is the Lagrange multiplier.
Since the circle is smooth everywhere, this tangency condition has no special preference for axis-aligned points. The contact point generically has both coordinates non-zero.
For L1 (Diamond) Constraint:
The first contact can occur either on an edge or at a corner:
Why Corners are Preferred:
Consider expanding an ellipse from the interior. The ellipse makes first contact with the closest point on the boundary. For the diamond:
Mathematical Argument:
The normal cone at a corner contains all directions pointing "outward" from the corner. At the vertex $(t, 0)$:
$$N_{(t,0)} = {(\alpha, \beta) : \alpha \geq |\beta|}$$
This is a large cone (45-degree angle in 2D). For the gradient of the loss to lie in this cone (the optimality condition), there's significant "room"—many possible gradient directions work.
At an edge point, the normal cone is just a single direction (the edge normal). Only when the loss gradient aligns exactly with this normal do we have an optimal edge point—a measure-zero event.
Probability Interpretation:
If we randomly generate problems (random $\mathbf{X}$, $\mathbf{y}$), the probability of corner contact approaches 1 for generic problem instances. Edge contact requires precise alignment that occurs with probability zero.
Corners are 'sticky' attractors for the optimization. Once you're near a corner, many gradient directions lead you to the corner. This is why Lasso 'wants' to set coefficients to zero—corners are geometrically stable fixed points of coordinate descent and other algorithms.
As we vary the constraint bound $t$ (equivalently, the regularization parameter $\lambda$), the solution traces a path through coefficient space. This path reveals the regularization-sparsity relationship geometrically.
The Lasso Path Geometrically:
$t = 0$ (or $\lambda = \infty$): The constraint region shrinks to the origin. Solution: $\hat{\boldsymbol{\beta}} = \mathbf{0}$ (maximum sparsity).
$t$ very small ($\lambda$ large): Small diamond with corners on axes. Most likely contact at a single corner—only one non-zero coefficient.
$t$ moderate ($\lambda$ moderate): Diamond grows. As corners pass the OLS solution's projections onto axes, more coefficients become non-zero.
$t$ large ($\lambda$ small): Diamond becomes large enough to contain the OLS solution. Solution approaches OLS (minimum regularization).
$t = \infty$ ($\lambda = 0$): No constraint. Solution equals OLS.
Sequential Feature Entry:
As $t$ increases (or $\lambda$ decreases), coefficients "enter" the model in order of their importance:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
import numpy as npfrom sklearn.linear_model import lasso_pathimport matplotlib.pyplot as plt def trace_lasso_path_geometric(X, y, n_alphas=100): """ Trace the Lasso path and visualize coefficient entry order. The path shows how coefficients transition from 0 to non-zero as regularization decreases—geometric sparsity in action. """ # Compute Lasso path alphas, coefs, _ = lasso_path(X, y, n_alphas=n_alphas) # Find entry order (first alpha where each coef becomes non-zero) entry_alphas = [] for j in range(coefs.shape[0]): non_zero_idx = np.where(np.abs(coefs[j, :]) > 1e-10)[0] if len(non_zero_idx) > 0: entry_alphas.append((j, alphas[non_zero_idx[0]])) else: entry_alphas.append((j, 0)) # Sort by entry (highest alpha = earliest entry) entry_order = sorted(entry_alphas, key=lambda x: -x[1]) print("Feature Entry Order (earliest to latest):") print("-" * 45) for rank, (feature_idx, alpha) in enumerate(entry_order, 1): if alpha > 0: print(f" {rank}. Feature {feature_idx}: enters at λ = {alpha:.4f}") # Plot coefficient paths fig, ax = plt.subplots(figsize=(10, 6)) for j in range(coefs.shape[0]): ax.plot(np.log(alphas), coefs[j, :], label=f'β_{j}') ax.axhline(0, color='gray', linestyle='--', alpha=0.5) ax.set_xlabel('log(λ)') ax.set_ylabel('Coefficient Value') ax.set_title('Lasso Path: Coefficients vs log(λ)') ax.legend() plt.gca().invert_xaxis() # Decreasing λ from left to right return fig, entry_order # Example: 10 features, only 3 truly relevantnp.random.seed(42)n, p = 100, 10X = np.random.randn(n, p)beta_true = np.zeros(p)beta_true[0] = 3.0 # Strong signalbeta_true[3] = -2.0 # Moderate signal beta_true[7] = 1.5 # Weak signaly = X @ beta_true + 0.5 * np.random.randn(n) fig, entry_order = trace_lasso_path_geometric(X, y)Geometric Interpretation of the Path:
The Lasso path traces through vertices and edges of the L1 polytope:
Between these transitions, the path is piecewise linear—coefficients change at constant rates. This remarkable property enables efficient Lasso path algorithms (LARS).
The Regularization Path Theorem:
For a fixed design matrix $\mathbf{X}$, as $\lambda$ decreases from $\lambda_{\max}$ to 0:
The Least Angle Regression (LARS) algorithm exploits the piecewise linear path structure. It traces the entire Lasso path in the same computational cost as a single least squares fit. This geometric insight translates directly to algorithmic efficiency.
While we visualize in 2D or 3D, the geometric principles extend to arbitrary dimensions. High-dimensional geometry has counterintuitive properties that actually strengthen the sparsity argument.
The Cross-Polytope in p Dimensions:
The L1 ball ${\boldsymbol{\beta} : |\boldsymbol{\beta}|_1 \leq t}$ in $\mathbb{R}^p$ is a cross-polytope (or hyperoctahedron):
Concentration of Measure Phenomenon:
High-dimensional geometry exhibits surprising properties:
Volume Distribution:
Consider the volume of the L1 and L2 balls:
The ratio $V_p^{L1}(t) / V_p^{L2}(t)$ decreases rapidly with $p$. The L1 ball is much "pointier" than the L2 ball in high dimensions.
Probability of Sparse Solutions:
For a randomly oriented loss ellipsoid in $\mathbb{R}^p$, the probability of contacting a face of dimension $k$ (having $p-k$ zeros) is:
$$P(\text{$k$-dimensional face contact}) \propto \binom{p}{k} \cdot (\text{geometry factor})$$
The geometry factor favors lower-dimensional faces (more zeros). Combined with the combinatorial factor, this gives a concrete probability distribution over sparsity levels.
The Typical Solution:
In high dimensions with moderate $\lambda$, the typical Lasso solution has:
The 2D diamond-vs-circle intuition transfers to high dimensions: corners (sparse) are sticky attractors, edges (less sparse) are unstable except under special alignment, and the full interior (no sparsity) requires the unconstrained OLS to lie inside the L1 ball.
The Lasso can be formulated as either a penalized problem or a constrained problem. Geometrically, these formulations are dual perspectives on the same optimization.
Penalized Form:
$$\min_\boldsymbol{\beta} \left{ \frac{1}{2}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \lambda|\boldsymbol{\beta}|_1 \right}$$
Geometric interpretation: Find the point where the sum of loss and penalty is minimized. The penalty "warps" the objective, creating valleys along coordinate axes.
Constrained Form:
$$\min_\boldsymbol{\beta} \frac{1}{2}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 \quad \text{s.t.} \quad |\boldsymbol{\beta}|_1 \leq t$$
Geometric interpretation: Find the minimum loss point within the L1 diamond.
The Duality Correspondence:
For every $\lambda \geq 0$, there exists a unique $t \geq 0$ such that both formulations have the same solution. The correspondence is:**
| Penalized | Constrained | Geometric View |
|---|---|---|
| $\lambda = 0$ | $t = |\hat{\boldsymbol{\beta}}^{OLS}|_1$ | Diamond contains OLS; solution = OLS |
| $\lambda$ small | $t$ large | Large diamond; solution near OLS |
| $\lambda$ moderate | $t$ moderate | Contact at edge or corner |
| $\lambda$ large | $t$ small | Small diamond; sparse solution at corner |
| $\lambda = \lambda_{max}$ | $t = 0$ | Diamond shrinks to origin; $\hat{\boldsymbol{\beta}} = \mathbf{0}$ |
Lagrangian Geometry:
The Lagrangian function combines both views:
$$L(\boldsymbol{\beta}, \mu) = \frac{1}{2}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \mu(|\boldsymbol{\beta}|_1 - t)$$
At optimality, $\mu = \lambda$ and the gradients balance:
$$\nabla_{\boldsymbol{\beta}} \text{(loss)} + \lambda \cdot \partial|\boldsymbol{\beta}|_1 \ni \mathbf{0}$$
Geometrically: the loss gradient points toward the OLS solution; the penalty subgradient points "outward" from the constraint boundary. At the optimum, these forces balance.
The Penalty as Constraint Enforcement:
Another way to see the equivalence: the penalty $\lambda|\boldsymbol{\beta}|_1$ acts as a "soft" version of the constraint $|\boldsymbol{\beta}|_1 \leq t$. Instead of a hard boundary, violations are penalized proportionally. The regularization parameter $\lambda$ is the "price per unit" of constraint violation.
Use the constrained view for geometric intuition (visualizing diamond/ellipse contact). Use the penalized view for algorithms (gradient-based optimization). The mathematical equivalence means either perspective is valid; choose based on what insight you need.
The geometry of L1 regularization directly informs algorithm design. Understanding shape properties translates to understanding algorithmic behavior.
Why Coordinate Descent Works Well:
The L1 ball has axis-aligned faces and edges. Coordinate descent, which optimizes one coordinate at a time while holding others fixed, naturally navigates this geometry:
Soft Thresholding as Projection:
The soft thresholding operator $S_\lambda(z) = \text{sign}(z)\max(|z| - \lambda, 0)$ has a geometric interpretation:
It is the proximal operator of the L1 norm—the closest point in the L1-penalized objective sense. Geometrically, it shrinks points toward the coordinate axes, then snaps them to axes when close enough.
Active Set Methods:
The piecewise linear path structure suggests active set methods:
This exploits the geometry: most computation happens on low-dimensional faces.
Screening Rules:
Geometry also enables safe screening—provably eliminating variables before running the full algorithm:
If a feature $j$ satisfies $|\mathbf{x}_j^T\mathbf{y}| < \lambda - |\mathbf{x}_j|_2 \cdot \text{(geometric term)}$, then $\hat{\beta}_j = 0$ at optimum.
Geometrically: this identifies features whose gradient directions are "too far" from the contact region to possibly be non-zero.
Practical Impact:
In high-dimensional problems ($p \gg n$), most coefficients are zero. Geometric insights enable algorithms that:
The best Lasso algorithms are designed with geometry in mind. They don't fight the L1 structure—they exploit it. Understanding the diamond-shaped constraint set and piecewise linear paths leads to algorithms orders of magnitude faster than black-box optimization.
We've developed deep geometric intuition for why L1 regularization produces sparse solutions. Let's consolidate the visual and conceptual insights.
What's Next:
We've established the mathematical formulation, sparsity mechanics, and geometric interpretation. The next page addresses a crucial practical question: How do we actually solve the Lasso problem? We'll explore coordinate descent, the proximal gradient method (ISTA/FISTA), and the LARS algorithm—each exploiting the geometric insights developed here.
You now have geometric intuition for L1 sparsity: the diamond catches expanding ellipses at corners, corners are stable attractors, and this geometry persists and strengthens in high dimensions. Next, we'll translate this understanding into practical algorithms.