Loading learning content...
Unlike ridge regression, the Lasso does not have a closed-form solution. The L1 norm introduces non-differentiability at zero, preventing the simple matrix inversion approach. We must resort to iterative optimization algorithms.
This limitation is not a setback—it's an opportunity. The structure of the Lasso problem admits specialized algorithms that are remarkably efficient, exploiting the geometry we've developed. The workhorse algorithm for Lasso is coordinate descent, which leverages:
This page develops coordinate descent in detail, then surveys alternative approaches: proximal gradient methods (ISTA/FISTA) and the path algorithm LARS.
By the end of this page, you will understand the coordinate descent algorithm in depth, be able to implement it efficiently, appreciate its convergence properties, and know when to choose alternative algorithms like ISTA or LARS.
Before developing specialized algorithms, let's understand why standard optimization methods struggle with the Lasso.
Gradient Descent Failure:
Standard gradient descent requires computing $\nabla f(\boldsymbol{\beta})$. But the L1 norm $|\boldsymbol{\beta}|_1 = \sum_j |\beta_j|$ has no gradient at $\beta_j = 0$:
$$\frac{d}{d\beta_j}|\beta_j| = \begin{cases} +1 & \beta_j > 0 \ -1 & \beta_j < 0 \ \text{undefined} & \beta_j = 0 \end{cases}$$
We can use subgradients (replacing gradients with elements of the subdifferential), but subgradient methods converge slowly—typically $O(1/\sqrt{k})$ to $O(1/k)$ compared to $O(\rho^k)$ for smooth problems.
Newton's Method Failure:
Newton's method requires the Hessian $\nabla^2 f(\boldsymbol{\beta})$. The L1 norm has no second derivative—it's not even first-order differentiable everywhere. Attempting Newton's method on Lasso is fundamentally ill-posed.
Why Special Structure Helps:
The Lasso Objective Structure:
Write the Lasso objective as:
$$F(\boldsymbol{\beta}) = \underbrace{\frac{1}{2n}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|2^2}{f(\boldsymbol{\beta}): \text{smooth, convex}} + \underbrace{\lambda|\boldsymbol{\beta}|1}{g(\boldsymbol{\beta}): \text{non-smooth, convex, separable}}$$
This "$f + g$" structure is the key. We can apply proximal methods that handle $f$ with gradients and $g$ with its proximal operator. Or we can apply coordinate descent, which exploits separability to obtain closed-form coordinate updates.
The Lasso is a prototypical 'composite' optimization problem. The theory and algorithms developed for Lasso extend to many ML problems sharing this structure: elastic net, group lasso, total variation regularization, sparse logistic regression, and more.
Coordinate descent is the method of choice for Lasso in practice. It's simple to implement, surprisingly fast, and scales well to high-dimensional problems.
The Core Idea:
Instead of optimizing all coordinates simultaneously, optimize one coordinate at a time while holding others fixed. Cycle through coordinates repeatedly until convergence.
Why This Works for Lasso:
When all coordinates except $\beta_j$ are fixed, the Lasso objective becomes a univariate optimization problem in $\beta_j$:
$$\min_{\beta_j} \left{ \frac{1}{2n}|\mathbf{r}_{-j} - \beta_j \mathbf{x}_j|_2^2 + \lambda|\beta_j| \right}$$
where $\mathbf{r}{-j} = \mathbf{y} - \sum{k \neq j} \beta_k \mathbf{x}_k$ is the partial residual (residual excluding feature $j$).
This is a 1D Lasso problem with a closed-form solution: soft thresholding!
The Soft Thresholding Update:
Let $\tilde{\beta}_j = \frac{1}{n}\mathbf{x}j^T\mathbf{r}{-j}$ be the OLS coefficient for feature $j$ regressed against the partial residual. If features are standardized ($\frac{1}{n}|\mathbf{x}_j|_2^2 = 1$):
$$\beta_j^{\text{new}} = S_\lambda(\tilde{\beta}_j) = \text{sign}(\tilde{\beta}_j) \max(|\tilde{\beta}_j| - \lambda, 0)$$
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192
import numpy as np def soft_threshold(z, threshold): """Soft thresholding operator (proximal of L1 norm).""" return np.sign(z) * np.maximum(np.abs(z) - threshold, 0) def lasso_coordinate_descent(X, y, lambda_, max_iters=1000, tol=1e-6): """ Solve Lasso via coordinate descent. Minimizes: (1/2n)||y - Xβ||² + λ||β||₁ Parameters ---------- X : ndarray of shape (n, p) Design matrix (assumed standardized: columns have zero mean, unit norm) y : ndarray of shape (n,) Response vector (assumed centered) lambda_ : float Regularization parameter max_iters : int Maximum number of full cycles through coordinates tol : float Convergence tolerance (max absolute change in coefficients) Returns ------- beta : ndarray of shape (p,) Estimated coefficients history : list Objective values at each iteration """ n, p = X.shape beta = np.zeros(p) residual = y.copy() # r = y - Xβ, initially β=0 history = [] for iteration in range(max_iters): beta_old = beta.copy() for j in range(p): # Add back contribution of feature j to residual residual += X[:, j] * beta[j] # Compute OLS coefficient for feature j alone # z_j = X_j' @ residual / n (correlation with partial residual) z_j = X[:, j] @ residual / n # Apply soft thresholding beta[j] = soft_threshold(z_j, lambda_) # Update residual with new coefficient residual -= X[:, j] * beta[j] # Compute objective (optional, for monitoring) mse = np.sum(residual ** 2) / (2 * n) l1_penalty = lambda_ * np.sum(np.abs(beta)) objective = mse + l1_penalty history.append(objective) # Check convergence if np.max(np.abs(beta - beta_old)) < tol: print(f"Converged at iteration {iteration + 1}") break return beta, history # Example usagenp.random.seed(42)n, p = 100, 20X = np.random.randn(n, p)# Standardize columnsX = (X - X.mean(axis=0)) / X.std(axis=0)X /= np.sqrt(n) # Now X'X/n = I for standardized data # True sparse coefficientsbeta_true = np.zeros(p)beta_true[:5] = [3, -2, 1.5, -1, 0.5]y = X @ beta_true + 0.3 * np.random.randn(n)y = y - y.mean() # Center response # Solvelambda_ = 0.1beta_hat, history = lasso_coordinate_descent(X, y, lambda_) print(f"\nTrue non-zeros: {np.sum(beta_true != 0)}")print(f"Estimated non-zeros: {np.sum(np.abs(beta_hat) > 1e-10)}")print(f"\nTrue coefficients: {beta_true[:5]}")print(f"Estimated coefficients: {beta_hat[:5].round(3)}")The key efficiency trick is maintaining the residual vector r = y - Xβ incrementally. Each coordinate update changes only one term: r_new = r_old + x_j(β_j^old - β_j^new). This avoids recomputing the full matrix-vector product Xβ at each step.
Let's derive the coordinate descent update rigorously, handling the case of non-standardized features.
Setup:
We minimize over $\beta_j$ while holding $\boldsymbol{\beta}_{-j}$ (all other coefficients) fixed:
$$\min_{\beta_j} \underbrace{\frac{1}{2n}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|2^2}{\text{quadratic in } \beta_j} + \lambda|\beta_j|$$
Expanding the Quadratic:
Let $\mathbf{r}{-j} = \mathbf{y} - \sum{k \neq j} \beta_k \mathbf{x}_k$ be the partial residual. Then:
$$|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|2^2 = |\mathbf{r}{-j} - \beta_j \mathbf{x}_j|2^2 = |\mathbf{r}{-j}|_2^2 - 2\beta_j \mathbf{x}j^T \mathbf{r}{-j} + \beta_j^2 |\mathbf{x}_j|_2^2$$
The Univariate Objective:
$$h(\beta_j) = \frac{1}{2n}\left(|\mathbf{r}_{-j}|_2^2 - 2\beta_j \mathbf{x}j^T \mathbf{r}{-j} + \beta_j^2 |\mathbf{x}_j|_2^2\right) + \lambda|\beta_j|$$
Dropping constants and rearranging:
$$h(\beta_j) = \frac{|\mathbf{x}_j|_2^2}{2n}\left(\beta_j - \frac{\mathbf{x}j^T \mathbf{r}{-j}}{|\mathbf{x}_j|_2^2}\right)^2 + \lambda|\beta_j| + \text{const}$$
Defining Key Quantities:
The Update Formula:
The univariate problem $\min_{\beta_j} \frac{c_j}{2}(\beta_j - \tilde{\beta}_j)^2 + \lambda|\beta_j|$ has solution:
$$\beta_j^{\text{new}} = S_{\lambda/c_j}(\tilde{\beta}_j) = \text{sign}(\tilde{\beta}_j) \max\left(|\tilde{\beta}_j| - \frac{\lambda}{c_j}, 0\right)$$
Simplified Form (Standardized Features):
If $\frac{1}{n}|\mathbf{x}_j|_2^2 = 1$ (standardized), then $c_j = 1$ and:
$$\beta_j^{\text{new}} = S_\lambda\left(\frac{1}{n}\mathbf{x}j^T \mathbf{r}{-j}\right)$$
This is the formula implemented in the algorithm above.
General Form (Non-standardized Features):
$$\beta_j^{\text{new}} = \frac{1}{|\mathbf{x}_j|2^2} S{n\lambda/|\mathbf{x}_j|_2^2}\left(\mathbf{x}j^T \mathbf{r}{-j}\right)$$
Or equivalently:
$$\beta_j^{\text{new}} = S_{n\lambda}\left(\mathbf{x}j^T \mathbf{r}{-j}\right) / |\mathbf{x}_j|_2^2$$
Most Lasso implementations assume standardized features. Using non-standardized features with these implementations gives incorrect results. Always check your implementation's assumptions and preprocess data accordingly.
A natural question: Does coordinate descent converge? And if so, how fast?
Convergence Guarantee:
For the Lasso objective (convex with separable non-smooth part), coordinate descent converges to the global optimum. This is guaranteed by:
Theorem (Tseng, 2001): For minimizing $f(\boldsymbol{\beta}) + \sum_j g_j(\beta_j)$ where $f$ is convex, differentiable with Lipschitz gradient, and each $g_j$ is convex (not necessarily smooth), coordinate descent converges to the global minimum.
The Lasso satisfies these conditions with $f(\boldsymbol{\beta}) = \frac{1}{2n}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2$ and $g_j(\beta_j) = \lambda|\beta_j|$.
Convergence Rate:
| Setting | Rate | Notes |
|---|---|---|
| General convex | $O(1/k)$ | Objective suboptimality after $k$ iterations |
| Strongly convex $f$ | $O(\rho^k)$, $\rho < 1$ | Linear (geometric) convergence |
| Lasso (typical) | $O(\rho^k)$ | Effectively linear due to restricted strong convexity |
| High correlation | Slower | Condition number affects rate |
| Path computation | Very fast | Warm starting from nearby λ values |
Linear Convergence for Lasso:
Although the Lasso objective is not strongly convex (the L1 norm has zero curvature), coordinate descent often exhibits linear convergence in practice. This is because:
Restricted Strong Convexity (RSC): The objective is effectively strongly convex in directions that matter, even if not globally.
Finite Active Set: After some iterations, the set of non-zero coefficients stabilizes. On this subspace, the objective is strongly convex (just least squares).
Exact Zeros: Once $\beta_j = 0$ and the optimality condition is satisfied, it stays zero—no oscillation.
Practical Convergence:
In practice, coordinate descent for Lasso is remarkably fast:
Stopping Criteria:
Common convergence criteria:
The KKT conditions provide exact optimality verification: for active variables, check gradient = ±λ; for inactive, check |gradient| ≤ λ. This is more informative than heuristic stopping rules and catches subtle convergence failures.
Naive coordinate descent is already good. Optimized coordinate descent is exceptional. Here are key implementation techniques:
1. Warm Starting:
When computing solutions for multiple $\lambda$ values (the Lasso path), initialize each solve with the solution from the previous $\lambda$:
For dense grids with ratio $\lambda_{k+1}/\lambda_k \approx 0.9$, each solve typically requires only 1-3 iterations.
2. Active Set Strategy:
Focus computation on non-zero coefficients:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
import numpy as np def lasso_optimized(X, y, lambda_, max_iters=1000, tol=1e-6): """ Optimized Lasso coordinate descent with active set strategy. Key optimizations: 1. Maintain residual incrementally 2. Active set cycling (focus on non-zeros) 3. Periodic full sweeps to check for new active variables """ n, p = X.shape beta = np.zeros(p) residual = y.copy() # Precompute column norms (for non-standardized data) col_norms_sq = np.sum(X ** 2, axis=0) # Active set: indices of currently non-zero coefficients active_set = set() for iteration in range(max_iters): beta_old = beta.copy() # PHASE 1: Cycle through active set (cheap) for _ in range(10): # Multiple inner iterations on active for j in list(active_set): beta_j_old = beta[j] residual += X[:, j] * beta_j_old z_j = X[:, j] @ residual threshold = n * lambda_ / col_norms_sq[j] beta[j] = np.sign(z_j) * max(abs(z_j) / col_norms_sq[j] - threshold, 0) residual -= X[:, j] * beta[j] # Update active set if abs(beta[j]) < 1e-15: active_set.discard(j) # PHASE 2: Full sweep to check inactive variables (periodic) max_violation = 0 for j in range(p): if j in active_set: continue # Check KKT condition for inactive variable gradient_j = -X[:, j] @ residual / n violation = max(0, abs(gradient_j) - lambda_) if violation > max_violation: max_violation = violation # If violated, add to active set and update if abs(gradient_j) > lambda_ + tol: residual += X[:, j] * beta[j] z_j = X[:, j] @ residual threshold = n * lambda_ / col_norms_sq[j] beta[j] = np.sign(z_j) * max(abs(z_j) / col_norms_sq[j] - threshold, 0) residual -= X[:, j] * beta[j] if abs(beta[j]) > 1e-15: active_set.add(j) # Convergence check if np.max(np.abs(beta - beta_old)) < tol and max_violation < tol: print(f"Converged at iteration {iteration + 1}") print(f"Active set size: {len(active_set)}") break return beta # The active set strategy is crucial for high-dimensional problems# When p >> n and solution is sparse, most time is spent on ~k variables# instead of cycling through all p variables3. Safe Screening Rules:
Before even starting optimization, eliminate features that are guaranteed to have zero coefficients:
SAFE Rule: If $|\mathbf{x}_j^T\mathbf{y}| \leq n\lambda - |\mathbf{x}_j|_2 \cdot |\mathbf{y}|_2 \cdot \theta$, then $\hat{\beta}_j = 0$.
Here $\theta$ is a geometric bound depending on the current and target $\lambda$ values.
4. Matrix-Free Updates:
For very large $p$, store $\mathbf{X}$ sparsely or use matrix-free products:
5. Strong Rules (Heuristic Screening):
STRONG rule: Assume features with $|\mathbf{x}j^T\mathbf{r}| < 2\lambda - \lambda{\text{prev}}$ will have $\beta_j = 0$.
This is not safe (can miss active features), so verify with KKT conditions after convergence. But it's very effective in practice—often eliminates 99% of features.
Libraries like glmnet (R/Python), scikit-learn, and liblinear implement all these optimizations. For $p = 10^6$ features, optimized coordinate descent can compute the full regularization path in seconds. Don't implement from scratch for production—use these battle-tested libraries.
Proximal gradient methods offer an alternative to coordinate descent, particularly useful when the smooth part $f$ has structure that coordinate descent can't exploit.
The Proximal Gradient Framework:
For objectives $F(\boldsymbol{\beta}) = f(\boldsymbol{\beta}) + g(\boldsymbol{\beta})$ where $f$ is smooth and $g$ is convex:
$$\boldsymbol{\beta}^{(k+1)} = \text{prox}_{t_k g}\left(\boldsymbol{\beta}^{(k)} - t_k \nabla f(\boldsymbol{\beta}^{(k)})\right)$$
where $t_k > 0$ is the step size and the proximal operator is:
$$\text{prox}{t g}(\mathbf{z}) = \arg\min{\mathbf{u}} \left{ g(\mathbf{u}) + \frac{1}{2t}|\mathbf{u} - \mathbf{z}|_2^2 \right}$$
For Lasso:
With $g(\boldsymbol{\beta}) = \lambda|\boldsymbol{\beta}|_1$, the proximal operator is component-wise soft thresholding:
$$\text{prox}_{t\lambda|\cdot|1}(\mathbf{z}) = S{t\lambda}(\mathbf{z})$$
ISTA (Iterative Shrinkage-Thresholding Algorithm):
$$\boldsymbol{\beta}^{(k+1)} = S_{t\lambda}\left(\boldsymbol{\beta}^{(k)} + \frac{t}{n}\mathbf{X}^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}^{(k)})\right)$$
where $t = 1/L$ with $L = |\mathbf{X}|_2^2/n$ being the Lipschitz constant of $\nabla f$.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102
import numpy as np def soft_threshold(z, threshold): """Component-wise soft thresholding.""" return np.sign(z) * np.maximum(np.abs(z) - threshold, 0) def ista(X, y, lambda_, max_iters=1000, tol=1e-6): """ ISTA: Iterative Shrinkage-Thresholding Algorithm. Basic proximal gradient descent for Lasso. Convergence rate: O(1/k) """ n, p = X.shape beta = np.zeros(p) # Step size: 1/L where L = ||X||^2 / n L = np.linalg.norm(X, ord=2) ** 2 / n t = 1 / L for k in range(max_iters): beta_old = beta.copy() # Gradient step gradient = -X.T @ (y - X @ beta) / n z = beta - t * gradient # Proximal step (soft thresholding) beta = soft_threshold(z, t * lambda_) if np.max(np.abs(beta - beta_old)) < tol: print(f"ISTA converged at iteration {k + 1}") break return beta def fista(X, y, lambda_, max_iters=1000, tol=1e-6): """ FISTA: Fast ISTA with Nesterov acceleration. Accelerated proximal gradient descent for Lasso. Convergence rate: O(1/k^2) - quadratically faster than ISTA! """ n, p = X.shape beta = np.zeros(p) beta_prev = beta.copy() L = np.linalg.norm(X, ord=2) ** 2 / n t = 1 / L # Nesterov momentum parameters tau = 1 tau_prev = 1 for k in range(max_iters): # Momentum term momentum = (tau_prev - 1) / tau z = beta + momentum * (beta - beta_prev) # Gradient step gradient = -X.T @ (y - X @ z) / n v = z - t * gradient # Proximal step beta_new = soft_threshold(v, t * lambda_) # Convergence check if np.max(np.abs(beta_new - beta)) < tol: print(f"FISTA converged at iteration {k + 1}") break # Update momentum parameter tau_prev = tau tau = (1 + np.sqrt(1 + 4 * tau ** 2)) / 2 beta_prev = beta beta = beta_new return beta # Comparisonnp.random.seed(42)n, p = 500, 200X = np.random.randn(n, p)X = X / np.sqrt(n)beta_true = np.zeros(p)beta_true[:10] = np.random.randn(10)y = X @ beta_true + 0.5 * np.random.randn(n) lambda_ = 0.1 print("ISTA:")beta_ista = ista(X, y, lambda_, max_iters=5000) print("\nFISTA:")beta_fista = fista(X, y, lambda_, max_iters=5000) print(f"\nNon-zeros (ISTA): {np.sum(np.abs(beta_ista) > 1e-10)}")print(f"Non-zeros (FISTA): {np.sum(np.abs(beta_fista) > 1e-10)}")| Algorithm | Convergence | Per-Iteration Cost | Best For |
|---|---|---|---|
| Coord Descent | $O(\rho^k)$ (linear) | $O(np)$ per cycle | Standard Lasso, path computation |
| ISTA | $O(1/k)$ (sublinear) | $O(np)$ per iteration | Simple implementation, theory |
| FISTA | $O(1/k^2)$ (accelerated) | $O(np)$ per iteration | Faster than ISTA, still simple |
| LARS | Exact path | $O(np \min(n,p))$ total | Full path, feature entry order |
Coordinate descent often beats ISTA/FISTA for Lasso due to its ability to exploit sparsity. Proximal methods shine when: (1) the problem structure doesn't separate across coordinates, (2) distributed/parallel implementation is needed, or (3) extensions beyond L1 (e.g., total variation, nuclear norm) are used.
LARS (Least Angle Regression) is an elegant algorithm that computes the entire Lasso path—all solutions as $\lambda$ varies from $\lambda_{\max}$ to 0—in one shot.
Key Insight: Piecewise Linear Path
The Lasso solution path $\hat{\boldsymbol{\beta}}(\lambda)$ is piecewise linear in $\lambda$. Between certain critical values of $\lambda$, coefficients change at constant rates. This means:
LARS Algorithm Overview:
The Equicorrelation Property:
At any point on the Lasso path, all active features have the same absolute correlation with the current residual:
$$|\mathbf{x}_j^T\mathbf{r}| = \lambda \quad \text{for all } j \in \mathcal{A}$$
where $\mathcal{A}$ is the active set. This is the equicorrelation property.
LARS maintains this property exactly, moving coefficients together keeping correlations equal. When a new feature's correlation "catches up" to the active correlation, it joins the active set.
LARS vs. Lasso Path:
Plain LARS doesn't remove features—coefficients never go to zero. The LARS-Lasso modification handles sign changes:
Computational Complexity:
LARS is ideal when you need the full solution path for model selection, want to understand feature entry order, or need exact solutions. For a single λ value, coordinate descent is simpler. For very high dimensions with warm starting, coordinate descent with screening can be faster than LARS.
We've covered the algorithmic landscape for Lasso optimization. Here's what you should take away:
Algorithm Selection Guide:
What's Next:
We've covered formulation, sparsity mechanics, geometry, and algorithms. The final page addresses the feature selection aspect—how Lasso's sparsity translates to practical variable selection, its reliability, and connections to statistical inference.
You now understand how to solve Lasso computationally: coordinate descent with soft thresholding updates, convergence guarantees, practical optimizations, and alternative methods (ISTA/FISTA, LARS). You're equipped to implement and understand production Lasso solvers.