Machine LearningRegularization Theory

Lasso Regression (L1 Regularization)

LevelIntermediate

Duration90 mins

TopicRegularization Theory

4 / 5

Solution Algorithms (Coordinate Descent)

From Theory to Computation

Unlike ridge regression, the Lasso does not have a closed-form solution. The L1 norm introduces non-differentiability at zero, preventing the simple matrix inversion approach. We must resort to iterative optimization algorithms.

This limitation is not a setback—it's an opportunity. The structure of the Lasso problem admits specialized algorithms that are remarkably efficient, exploiting the geometry we've developed. The workhorse algorithm for Lasso is coordinate descent, which leverages:

Separability: The L1 penalty $\sum_j |\beta_j|$ separates across coordinates
Closed-form updates: Each coordinate subproblem has an explicit solution (soft thresholding)
Sparsity exploitation: Zero coefficients require no computation
Warm starting: Solutions at nearby $\lambda$ values are close, enabling fast path computation

This page develops coordinate descent in detail, then surveys alternative approaches: proximal gradient methods (ISTA/FISTA) and the path algorithm LARS.

What You Will Learn

By the end of this page, you will understand the coordinate descent algorithm in depth, be able to implement it efficiently, appreciate its convergence properties, and know when to choose alternative algorithms like ISTA or LARS.

Why Standard Methods Fail

Before developing specialized algorithms, let's understand why standard optimization methods struggle with the Lasso.

Gradient Descent Failure:

Standard gradient descent requires computing $\nabla f(\boldsymbol{\beta})$. But the L1 norm $|\boldsymbol{\beta}|_1 = \sum_j |\beta_j|$ has no gradient at $\beta_j = 0$:

$$\frac{d}{d\beta_j}|\beta_j| = \begin{cases} +1 & \beta_j > 0 \ -1 & \beta_j < 0 \ \text{undefined} & \beta_j = 0 \end{cases}$$

We can use subgradients (replacing gradients with elements of the subdifferential), but subgradient methods converge slowly—typically $O(1/\sqrt{k})$ to $O(1/k)$ compared to $O(\rho^k)$ for smooth problems.

Newton's Method Failure:

Newton's method requires the Hessian $\nabla^2 f(\boldsymbol{\beta})$. The L1 norm has no second derivative—it's not even first-order differentiable everywhere. Attempting Newton's method on Lasso is fundamentally ill-posed.

Why Special Structure Helps:

Lasso's Favorable Structure

•Convexity: The objective is convex, guaranteeing a unique global minimum and no local minima traps.
•Separable Penalty: $|\boldsymbol{\beta}|_1 = \sum_j |\beta_j|$ decomposes into independent per-coordinate terms.
•Smooth + Simple: The objective is $f(\boldsymbol{\beta}) + g(\boldsymbol{\beta})$ where $f$ is smooth (least squares) and $g$ is simple (L1 norm with known proximal operator).
•Sparse Solutions: At optimum, most $\beta_j = 0$, reducing effective dimensionality.
•Piecewise Linearity: Coefficient paths are piecewise linear in $\lambda$, enabling path algorithms.

The Lasso Objective Structure:

Write the Lasso objective as:

$$F(\boldsymbol{\beta}) = \underbrace{\frac{1}{2n}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|2^2}{f(\boldsymbol{\beta}): \text{smooth, convex}} + \underbrace{\lambda|\boldsymbol{\beta}|1}{g(\boldsymbol{\beta}): \text{non-smooth, convex, separable}}$$

This "$f + g$" structure is the key. We can apply proximal methods that handle $f$ with gradients and $g$ with its proximal operator. Or we can apply coordinate descent, which exploits separability to obtain closed-form coordinate updates.

Composite Optimization

The Lasso is a prototypical 'composite' optimization problem. The theory and algorithms developed for Lasso extend to many ML problems sharing this structure: elastic net, group lasso, total variation regularization, sparse logistic regression, and more.

Coordinate Descent: The Workhorse Algorithm

Coordinate descent is the method of choice for Lasso in practice. It's simple to implement, surprisingly fast, and scales well to high-dimensional problems.

The Core Idea:

Instead of optimizing all coordinates simultaneously, optimize one coordinate at a time while holding others fixed. Cycle through coordinates repeatedly until convergence.

Why This Works for Lasso:

When all coordinates except $\beta_j$ are fixed, the Lasso objective becomes a univariate optimization problem in $\beta_j$:

$$\min_{\beta_j} \left{ \frac{1}{2n}|\mathbf{r}_{-j} - \beta_j \mathbf{x}_j|_2^2 + \lambda|\beta_j| \right}$$

where $\mathbf{r}{-j} = \mathbf{y} - \sum{k \neq j} \beta_k \mathbf{x}_k$ is the partial residual (residual excluding feature $j$).

This is a 1D Lasso problem with a closed-form solution: soft thresholding!

The Soft Thresholding Update:

Let $\tilde{\beta}_j = \frac{1}{n}\mathbf{x}j^T\mathbf{r}{-j}$ be the OLS coefficient for feature $j$ regressed against the partial residual. If features are standardized ($\frac{1}{n}|\mathbf{x}_j|_2^2 = 1$):

$$\beta_j^{\text{new}} = S_\lambda(\tilde{\beta}_j) = \text{sign}(\tilde{\beta}_j) \max(|\tilde{\beta}_j| - \lambda, 0)$$

Coordinate Descent Algorithm
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
import numpy as np
 
def soft_threshold(z, threshold):
    """Soft thresholding operator (proximal of L1 norm)."""
    return np.sign(z) * np.maximum(np.abs(z) - threshold, 0)
 
 
def lasso_coordinate_descent(X, y, lambda_, max_iters=1000, tol=1e-6):
    """
    Solve Lasso via coordinate descent.
    
    Minimizes: (1/2n)||y - Xβ||² + λ||β||₁
    
    Parameters
    ----------
    X : ndarray of shape (n, p)
        Design matrix (assumed standardized: columns have zero mean, unit norm)
    y : ndarray of shape (n,)
        Response vector (assumed centered)
    lambda_ : float
        Regularization parameter
    max_iters : int
        Maximum number of full cycles through coordinates
    tol : float
        Convergence tolerance (max absolute change in coefficients)
    
    Returns
    -------
    beta : ndarray of shape (p,)
        Estimated coefficients
    history : list
        Objective values at each iteration
    """
    n, p = X.shape
    beta = np.zeros(p)
    residual = y.copy()  # r = y - Xβ, initially β=0
    
    history = []
    
    for iteration in range(max_iters):
        beta_old = beta.copy()
        
        for j in range(p):
            # Add back contribution of feature j to residual
            residual += X[:, j] * beta[j]
            
            # Compute OLS coefficient for feature j alone
            # z_j = X_j' @ residual / n (correlation with partial residual)
            z_j = X[:, j] @ residual / n
            
            # Apply soft thresholding
            beta[j] = soft_threshold(z_j, lambda_)
            
            # Update residual with new coefficient
            residual -= X[:, j] * beta[j]
        
        # Compute objective (optional, for monitoring)
        mse = np.sum(residual ** 2) / (2 * n)
        l1_penalty = lambda_ * np.sum(np.abs(beta))
        objective = mse + l1_penalty
        history.append(objective)
        
        # Check convergence
        if np.max(np.abs(beta - beta_old)) < tol:
            print(f"Converged at iteration {iteration + 1}")
            break
    
    return beta, history
 
 
# Example usage
np.random.seed(42)
n, p = 100, 20
X = np.random.randn(n, p)
# Standardize columns
X = (X - X.mean(axis=0)) / X.std(axis=0)
X /= np.sqrt(n)  # Now X'X/n = I for standardized data
 
# True sparse coefficients
beta_true = np.zeros(p)
beta_true[:5] = [3, -2, 1.5, -1, 0.5]
y = X @ beta_true + 0.3 * np.random.randn(n)
y = y - y.mean()  # Center response
 
# Solve
lambda_ = 0.1
beta_hat, history = lasso_coordinate_descent(X, y, lambda_)
 
print(f"\nTrue non-zeros: {np.sum(beta_true != 0)}")
print(f"Estimated non-zeros: {np.sum(np.abs(beta_hat) > 1e-10)}")
print(f"\nTrue coefficients: {beta_true[:5]}")
print(f"Estimated coefficients: {beta_hat[:5].round(3)}")

Efficient Residual Updates

The key efficiency trick is maintaining the residual vector r = y - Xβ incrementally. Each coordinate update changes only one term: r_new = r_old + x_j(β_j^old - β_j^new). This avoids recomputing the full matrix-vector product Xβ at each step.

Deriving the Coordinate Update

Let's derive the coordinate descent update rigorously, handling the case of non-standardized features.

Setup:

We minimize over $\beta_j$ while holding $\boldsymbol{\beta}_{-j}$ (all other coefficients) fixed:

$$\min_{\beta_j} \underbrace{\frac{1}{2n}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|2^2}{\text{quadratic in } \beta_j} + \lambda|\beta_j|$$

Expanding the Quadratic:

Let $\mathbf{r}{-j} = \mathbf{y} - \sum{k \neq j} \beta_k \mathbf{x}_k$ be the partial residual. Then:

$$|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|2^2 = |\mathbf{r}{-j} - \beta_j \mathbf{x}_j|2^2 = |\mathbf{r}{-j}|_2^2 - 2\beta_j \mathbf{x}j^T \mathbf{r}{-j} + \beta_j^2 |\mathbf{x}_j|_2^2$$

The Univariate Objective:

$$h(\beta_j) = \frac{1}{2n}\left(|\mathbf{r}_{-j}|_2^2 - 2\beta_j \mathbf{x}j^T \mathbf{r}{-j} + \beta_j^2 |\mathbf{x}_j|_2^2\right) + \lambda|\beta_j|$$

Dropping constants and rearranging:

$$h(\beta_j) = \frac{|\mathbf{x}_j|_2^2}{2n}\left(\beta_j - \frac{\mathbf{x}j^T \mathbf{r}{-j}}{|\mathbf{x}_j|_2^2}\right)^2 + \lambda|\beta_j| + \text{const}$$

Defining Key Quantities:

Coordinate Descent Variables

•Partial residual: $\mathbf{r}{-j} = \mathbf{y} - \sum{k \neq j} \beta_k \mathbf{x}_k = \mathbf{r} + \beta_j \mathbf{x}_j$
•Simple regression coefficient: $\tilde{\beta}_j = \frac{\mathbf{x}j^T \mathbf{r}{-j}}{|\mathbf{x}_j|_2^2}$ (OLS of $\mathbf{x}_j$ on partial residual)
•Normalization factor: $c_j = \frac{|\mathbf{x}_j|_2^2}{n}$
•Effective threshold: $\tilde{\lambda}_j = \frac{\lambda}{c_j} = \frac{n\lambda}{|\mathbf{x}_j|_2^2}$

The Update Formula:

The univariate problem $\min_{\beta_j} \frac{c_j}{2}(\beta_j - \tilde{\beta}_j)^2 + \lambda|\beta_j|$ has solution:

$$\beta_j^{\text{new}} = S_{\lambda/c_j}(\tilde{\beta}_j) = \text{sign}(\tilde{\beta}_j) \max\left(|\tilde{\beta}_j| - \frac{\lambda}{c_j}, 0\right)$$

Simplified Form (Standardized Features):

If $\frac{1}{n}|\mathbf{x}_j|_2^2 = 1$ (standardized), then $c_j = 1$ and:

$$\beta_j^{\text{new}} = S_\lambda\left(\frac{1}{n}\mathbf{x}j^T \mathbf{r}{-j}\right)$$

This is the formula implemented in the algorithm above.

General Form (Non-standardized Features):

$$\beta_j^{\text{new}} = \frac{1}{|\mathbf{x}_j|2^2} S{n\lambda/|\mathbf{x}_j|_2^2}\left(\mathbf{x}j^T \mathbf{r}{-j}\right)$$

Or equivalently:

$$\beta_j^{\text{new}} = S_{n\lambda}\left(\mathbf{x}j^T \mathbf{r}{-j}\right) / |\mathbf{x}_j|_2^2$$

Standardization Matters

Most Lasso implementations assume standardized features. Using non-standardized features with these implementations gives incorrect results. Always check your implementation's assumptions and preprocess data accordingly.

Convergence Properties

A natural question: Does coordinate descent converge? And if so, how fast?

Convergence Guarantee:

For the Lasso objective (convex with separable non-smooth part), coordinate descent converges to the global optimum. This is guaranteed by:

Theorem (Tseng, 2001): For minimizing $f(\boldsymbol{\beta}) + \sum_j g_j(\beta_j)$ where $f$ is convex, differentiable with Lipschitz gradient, and each $g_j$ is convex (not necessarily smooth), coordinate descent converges to the global minimum.

The Lasso satisfies these conditions with $f(\boldsymbol{\beta}) = \frac{1}{2n}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2$ and $g_j(\beta_j) = \lambda|\beta_j|$.

Convergence Rate:

Coordinate Descent Convergence Rates
Setting	Rate	Notes
General convex	$O(1/k)$	Objective suboptimality after $k$ iterations
Strongly convex $f$	$O(\rho^k)$, $\rho < 1$	Linear (geometric) convergence
Lasso (typical)	$O(\rho^k)$	Effectively linear due to restricted strong convexity
High correlation	Slower	Condition number affects rate
Path computation	Very fast	Warm starting from nearby λ values

Linear Convergence for Lasso:

Although the Lasso objective is not strongly convex (the L1 norm has zero curvature), coordinate descent often exhibits linear convergence in practice. This is because:

Restricted Strong Convexity (RSC): The objective is effectively strongly convex in directions that matter, even if not globally.
Finite Active Set: After some iterations, the set of non-zero coefficients stabilizes. On this subspace, the objective is strongly convex (just least squares).
Exact Zeros: Once $\beta_j = 0$ and the optimality condition is satisfied, it stays zero—no oscillation.

Practical Convergence:

In practice, coordinate descent for Lasso is remarkably fast:

Often converges in 10-100 iterations for moderate problems
Scales well to $p > 10^6$ features
Benefits enormously from warm starting along the regularization path

Stopping Criteria:

Common convergence criteria:

Stopping Criteria

•Coefficient change: $\max_j |\beta_j^{(k+1)} - \beta_j^{(k)}| < \epsilon$
•Objective change: $|F(\boldsymbol{\beta}^{(k+1)}) - F(\boldsymbol{\beta}^{(k)})| < \epsilon$
•Relative change: $|\boldsymbol{\beta}^{(k+1)} - \boldsymbol{\beta}^{(k)}|_2 / |\boldsymbol{\beta}^{(k)}|_2 < \epsilon$
•KKT violation: $\max_j |\text{KKT violation}_j| < \epsilon$

Checking Optimality

The KKT conditions provide exact optimality verification: for active variables, check gradient = ±λ; for inactive, check |gradient| ≤ λ. This is more informative than heuristic stopping rules and catches subtle convergence failures.

Implementation Optimizations

Naive coordinate descent is already good. Optimized coordinate descent is exceptional. Here are key implementation techniques:

1. Warm Starting:

When computing solutions for multiple $\lambda$ values (the Lasso path), initialize each solve with the solution from the previous $\lambda$:

Compute $\lambda_{\max} = \frac{1}{n}|\mathbf{X}^T\mathbf{y}|_\infty$ (all-zeros solution)
Create a grid $\lambda_1 = \lambda_{\max} > \lambda_2 > \cdots > \lambda_K$
Initialize $\boldsymbol{\beta}^{(k)}$ with the solution at $\lambda_{k-1}$

For dense grids with ratio $\lambda_{k+1}/\lambda_k \approx 0.9$, each solve typically requires only 1-3 iterations.

2. Active Set Strategy:

Focus computation on non-zero coefficients:

Optimized Coordinate Descent
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import numpy as np
 
def lasso_optimized(X, y, lambda_, max_iters=1000, tol=1e-6):
    """
    Optimized Lasso coordinate descent with active set strategy.
    
    Key optimizations:
    1. Maintain residual incrementally
    2. Active set cycling (focus on non-zeros)
    3. Periodic full sweeps to check for new active variables
    """
    n, p = X.shape
    beta = np.zeros(p)
    residual = y.copy()
    
    # Precompute column norms (for non-standardized data)
    col_norms_sq = np.sum(X ** 2, axis=0)
    
    # Active set: indices of currently non-zero coefficients
    active_set = set()
    
    for iteration in range(max_iters):
        beta_old = beta.copy()
        
        # PHASE 1: Cycle through active set (cheap)
        for _ in range(10):  # Multiple inner iterations on active
            for j in list(active_set):
                beta_j_old = beta[j]
                residual += X[:, j] * beta_j_old
                
                z_j = X[:, j] @ residual
                threshold = n * lambda_ / col_norms_sq[j]
                beta[j] = np.sign(z_j) * max(abs(z_j) / col_norms_sq[j] - threshold, 0)
                
                residual -= X[:, j] * beta[j]
                
                # Update active set
                if abs(beta[j]) < 1e-15:
                    active_set.discard(j)
        
        # PHASE 2: Full sweep to check inactive variables (periodic)
        max_violation = 0
        for j in range(p):
            if j in active_set:
                continue
            
            # Check KKT condition for inactive variable
            gradient_j = -X[:, j] @ residual / n
            violation = max(0, abs(gradient_j) - lambda_)
            
            if violation > max_violation:
                max_violation = violation
            
            # If violated, add to active set and update
            if abs(gradient_j) > lambda_ + tol:
                residual += X[:, j] * beta[j]
                z_j = X[:, j] @ residual
                threshold = n * lambda_ / col_norms_sq[j]
                beta[j] = np.sign(z_j) * max(abs(z_j) / col_norms_sq[j] - threshold, 0)
                residual -= X[:, j] * beta[j]
                if abs(beta[j]) > 1e-15:
                    active_set.add(j)
        
        # Convergence check
        if np.max(np.abs(beta - beta_old)) < tol and max_violation < tol:
            print(f"Converged at iteration {iteration + 1}")
            print(f"Active set size: {len(active_set)}")
            break
    
    return beta
 
 
# The active set strategy is crucial for high-dimensional problems
# When p >> n and solution is sparse, most time is spent on ~k variables
# instead of cycling through all p variables

3. Safe Screening Rules:

Before even starting optimization, eliminate features that are guaranteed to have zero coefficients:

SAFE Rule: If $|\mathbf{x}_j^T\mathbf{y}| \leq n\lambda - |\mathbf{x}_j|_2 \cdot |\mathbf{y}|_2 \cdot \theta$, then $\hat{\beta}_j = 0$.

Here $\theta$ is a geometric bound depending on the current and target $\lambda$ values.

4. Matrix-Free Updates:

For very large $p$, store $\mathbf{X}$ sparsely or use matrix-free products:

Compute $\mathbf{x}_j^T\mathbf{r}$ via sparse matrix operations
Update residual as $\mathbf{r} \leftarrow \mathbf{r} + \Delta\beta_j \cdot \mathbf{x}_j$

5. Strong Rules (Heuristic Screening):

STRONG rule: Assume features with $|\mathbf{x}j^T\mathbf{r}| < 2\lambda - \lambda{\text{prev}}$ will have $\beta_j = 0$.

This is not safe (can miss active features), so verify with KKT conditions after convergence. But it's very effective in practice—often eliminates 99% of features.

Production Implementations

Libraries like glmnet (R/Python), scikit-learn, and liblinear implement all these optimizations. For $p = 10^6$ features, optimized coordinate descent can compute the full regularization path in seconds. Don't implement from scratch for production—use these battle-tested libraries.

Alternative: Proximal Gradient Methods

Proximal gradient methods offer an alternative to coordinate descent, particularly useful when the smooth part $f$ has structure that coordinate descent can't exploit.

The Proximal Gradient Framework:

For objectives $F(\boldsymbol{\beta}) = f(\boldsymbol{\beta}) + g(\boldsymbol{\beta})$ where $f$ is smooth and $g$ is convex:

$$\boldsymbol{\beta}^{(k+1)} = \text{prox}_{t_k g}\left(\boldsymbol{\beta}^{(k)} - t_k \nabla f(\boldsymbol{\beta}^{(k)})\right)$$

where $t_k > 0$ is the step size and the proximal operator is:

$$\text{prox}{t g}(\mathbf{z}) = \arg\min{\mathbf{u}} \left{ g(\mathbf{u}) + \frac{1}{2t}|\mathbf{u} - \mathbf{z}|_2^2 \right}$$

For Lasso:

With $g(\boldsymbol{\beta}) = \lambda|\boldsymbol{\beta}|_1$, the proximal operator is component-wise soft thresholding:

$$\text{prox}_{t\lambda|\cdot|1}(\mathbf{z}) = S{t\lambda}(\mathbf{z})$$

ISTA (Iterative Shrinkage-Thresholding Algorithm):

$$\boldsymbol{\beta}^{(k+1)} = S_{t\lambda}\left(\boldsymbol{\beta}^{(k)} + \frac{t}{n}\mathbf{X}^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}^{(k)})\right)$$

where $t = 1/L$ with $L = |\mathbf{X}|_2^2/n$ being the Lipschitz constant of $\nabla f$.

ISTA and FISTA
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
import numpy as np
 
def soft_threshold(z, threshold):
    """Component-wise soft thresholding."""
    return np.sign(z) * np.maximum(np.abs(z) - threshold, 0)
 
 
def ista(X, y, lambda_, max_iters=1000, tol=1e-6):
    """
    ISTA: Iterative Shrinkage-Thresholding Algorithm.
    
    Basic proximal gradient descent for Lasso.
    Convergence rate: O(1/k)
    """
    n, p = X.shape
    beta = np.zeros(p)
    
    # Step size: 1/L where L = ||X||^2 / n
    L = np.linalg.norm(X, ord=2) ** 2 / n
    t = 1 / L
    
    for k in range(max_iters):
        beta_old = beta.copy()
        
        # Gradient step
        gradient = -X.T @ (y - X @ beta) / n
        z = beta - t * gradient
        
        # Proximal step (soft thresholding)
        beta = soft_threshold(z, t * lambda_)
        
        if np.max(np.abs(beta - beta_old)) < tol:
            print(f"ISTA converged at iteration {k + 1}")
            break
    
    return beta
 
 
def fista(X, y, lambda_, max_iters=1000, tol=1e-6):
    """
    FISTA: Fast ISTA with Nesterov acceleration.
    
    Accelerated proximal gradient descent for Lasso.
    Convergence rate: O(1/k^2) - quadratically faster than ISTA!
    """
    n, p = X.shape
    beta = np.zeros(p)
    beta_prev = beta.copy()
    
    L = np.linalg.norm(X, ord=2) ** 2 / n
    t = 1 / L
    
    # Nesterov momentum parameters
    tau = 1
    tau_prev = 1
    
    for k in range(max_iters):
        # Momentum term
        momentum = (tau_prev - 1) / tau
        z = beta + momentum * (beta - beta_prev)
        
        # Gradient step
        gradient = -X.T @ (y - X @ z) / n
        v = z - t * gradient
        
        # Proximal step
        beta_new = soft_threshold(v, t * lambda_)
        
        # Convergence check
        if np.max(np.abs(beta_new - beta)) < tol:
            print(f"FISTA converged at iteration {k + 1}")
            break
        
        # Update momentum parameter
        tau_prev = tau
        tau = (1 + np.sqrt(1 + 4 * tau ** 2)) / 2
        
        beta_prev = beta
        beta = beta_new
    
    return beta
 
 
# Comparison
np.random.seed(42)
n, p = 500, 200
X = np.random.randn(n, p)
X = X / np.sqrt(n)
beta_true = np.zeros(p)
beta_true[:10] = np.random.randn(10)
y = X @ beta_true + 0.5 * np.random.randn(n)
 
lambda_ = 0.1
 
print("ISTA:")
beta_ista = ista(X, y, lambda_, max_iters=5000)
 
print("\nFISTA:")
beta_fista = fista(X, y, lambda_, max_iters=5000)
 
print(f"\nNon-zeros (ISTA): {np.sum(np.abs(beta_ista) > 1e-10)}")
print(f"Non-zeros (FISTA): {np.sum(np.abs(beta_fista) > 1e-10)}")

Algorithm Comparison
Algorithm	Convergence	Per-Iteration Cost	Best For
Coord Descent	$O(\rho^k)$ (linear)	$O(np)$ per cycle	Standard Lasso, path computation
ISTA	$O(1/k)$ (sublinear)	$O(np)$ per iteration	Simple implementation, theory
FISTA	$O(1/k^2)$ (accelerated)	$O(np)$ per iteration	Faster than ISTA, still simple
LARS	Exact path	$O(np \min(n,p))$ total	Full path, feature entry order

When to Use Proximal Methods

Coordinate descent often beats ISTA/FISTA for Lasso due to its ability to exploit sparsity. Proximal methods shine when: (1) the problem structure doesn't separate across coordinates, (2) distributed/parallel implementation is needed, or (3) extensions beyond L1 (e.g., total variation, nuclear norm) are used.

LARS: Least Angle Regression

LARS (Least Angle Regression) is an elegant algorithm that computes the entire Lasso path—all solutions as $\lambda$ varies from $\lambda_{\max}$ to 0—in one shot.

Key Insight: Piecewise Linear Path

The Lasso solution path $\hat{\boldsymbol{\beta}}(\lambda)$ is piecewise linear in $\lambda$. Between certain critical values of $\lambda$, coefficients change at constant rates. This means:

We don't need to solve Lasso for many $\lambda$ values separately
The path can be traced exactly with finite computation
Feature entry/exit points can be computed precisely

LARS Algorithm Overview:

Initialize: All coefficients zero, residual $\mathbf{r} = \mathbf{y}$
Find most correlated feature: $j^* = \arg\max_j |\mathbf{x}_j^T\mathbf{r}|$
Move toward OLS: Increase $\beta_{j^}$ in the direction that decreases $|\mathbf{x}_{j^}^T\mathbf{r}|$
Maintain equicorrelation: As we move, keep all active features' correlations equal
Add new features: When another feature becomes equally correlated, add it to active set
(For Lasso): If a coefficient crosses zero, remove it from active set
Repeat until all features are added or residual is zero

LARS Key Properties

•Exact Path: No approximation or grid of λ values—exact solution at all λ
•Efficiency: Same computational cost as one OLS fit for the entire path
•Feature Order: Naturally provides the order in which features enter the model
•Model Selection: Easy to find optimal λ via cross-validation on the path
•Connection to Forward Stepwise: LARS is a 'democratic' version of forward stepwise selection

The Equicorrelation Property:

At any point on the Lasso path, all active features have the same absolute correlation with the current residual:

$$|\mathbf{x}_j^T\mathbf{r}| = \lambda \quad \text{for all } j \in \mathcal{A}$$

where $\mathcal{A}$ is the active set. This is the equicorrelation property.

LARS maintains this property exactly, moving coefficients together keeping correlations equal. When a new feature's correlation "catches up" to the active correlation, it joins the active set.

LARS vs. Lasso Path:

Plain LARS doesn't remove features—coefficients never go to zero. The LARS-Lasso modification handles sign changes:

If a coefficient crosses zero during the path (sign change), remove it from active set
This is the "Lasso modification" that produces the actual Lasso path

Computational Complexity:

LARS-Lasso path computation: $O(p \cdot n \cdot \min(n, p))$
Same cost as solving one OLS problem with $\min(n, p)$ features
For sparse solutions, often much faster in practice

When to Use LARS

LARS is ideal when you need the full solution path for model selection, want to understand feature entry order, or need exact solutions. For a single λ value, coordinate descent is simpler. For very high dimensions with warm starting, coordinate descent with screening can be faster than LARS.

Summary: Solving the Lasso

We've covered the algorithmic landscape for Lasso optimization. Here's what you should take away:

Key Takeaways

•No Closed Form: L1's non-differentiability prevents matrix-formula solutions; iterative methods are required.
•Coordinate Descent: The workhorse algorithm—cycles through coordinates applying soft thresholding. Converges linearly for Lasso.
•Soft Thresholding: The closed-form coordinate update $S_\lambda(z)$ shrinks by $\lambda$ or snaps to zero.
•Optimizations: Warm starting, active sets, and screening rules make coordinate descent scale to millions of features.
•Proximal Methods (ISTA/FISTA): Alternative framework, especially useful for non-separable penalties.
•LARS: Computes exact Lasso path efficiently, revealing feature entry order.
•Production Libraries: Use glmnet, sklearn, or similar—they implement all optimizations correctly.

Algorithm Selection Guide:

Standard Lasso at single λ: Coordinate descent
Regularization path: Coordinate descent with warm start, or LARS
Very high dimensions (p >> 10⁶): Coordinate descent with screening
Parallel/distributed: Proximal gradient (ADMM variant)
Non-L1 penalties: Proximal gradient framework

What's Next:

We've covered formulation, sparsity mechanics, geometry, and algorithms. The final page addresses the feature selection aspect—how Lasso's sparsity translates to practical variable selection, its reliability, and connections to statistical inference.

Page Complete

You now understand how to solve Lasso computationally: coordinate descent with soft thresholding updates, convergence guarantees, practical optimizations, and alternative methods (ISTA/FISTA, LARS). You're equipped to implement and understand production Lasso solvers.

4 / 5

Loading learning content...

Machine LearningRegularization Theory

Lasso Regression (L1 Regularization)

LevelIntermediate

Duration90 mins

TopicRegularization Theory

4 / 5

Solution Algorithms (Coordinate Descent)

From Theory to Computation

Separability: The L1 penalty $\sum_j |\beta_j|$ separates across coordinates
Closed-form updates: Each coordinate subproblem has an explicit solution (soft thresholding)
Sparsity exploitation: Zero coefficients require no computation
Warm starting: Solutions at nearby $\lambda$ values are close, enabling fast path computation

This page develops coordinate descent in detail, then surveys alternative approaches: proximal gradient methods (ISTA/FISTA) and the path algorithm LARS.

What You Will Learn

Why Standard Methods Fail

Before developing specialized algorithms, let's understand why standard optimization methods struggle with the Lasso.

Gradient Descent Failure:

Standard gradient descent requires computing $\nabla f(\boldsymbol{\beta})$. But the L1 norm $|\boldsymbol{\beta}|_1 = \sum_j |\beta_j|$ has no gradient at $\beta_j = 0$:

$$\frac{d}{d\beta_j}|\beta_j| = \begin{cases} +1 & \beta_j > 0 \ -1 & \beta_j < 0 \ \text{undefined} & \beta_j = 0 \end{cases}$$

Newton's Method Failure:

Why Special Structure Helps:

Lasso's Favorable Structure

•Convexity: The objective is convex, guaranteeing a unique global minimum and no local minima traps.
•Separable Penalty: $|\boldsymbol{\beta}|_1 = \sum_j |\beta_j|$ decomposes into independent per-coordinate terms.
•Smooth + Simple: The objective is $f(\boldsymbol{\beta}) + g(\boldsymbol{\beta})$ where $f$ is smooth (least squares) and $g$ is simple (L1 norm with known proximal operator).
•Sparse Solutions: At optimum, most $\beta_j = 0$, reducing effective dimensionality.
•Piecewise Linearity: Coefficient paths are piecewise linear in $\lambda$, enabling path algorithms.

The Lasso Objective Structure:

Write the Lasso objective as:

Composite Optimization

Coordinate Descent: The Workhorse Algorithm

Coordinate descent is the method of choice for Lasso in practice. It's simple to implement, surprisingly fast, and scales well to high-dimensional problems.

The Core Idea:

Instead of optimizing all coordinates simultaneously, optimize one coordinate at a time while holding others fixed. Cycle through coordinates repeatedly until convergence.

Why This Works for Lasso:

When all coordinates except $\beta_j$ are fixed, the Lasso objective becomes a univariate optimization problem in $\beta_j$:

$$\min_{\beta_j} \left{ \frac{1}{2n}|\mathbf{r}_{-j} - \beta_j \mathbf{x}_j|_2^2 + \lambda|\beta_j| \right}$$

where $\mathbf{r}{-j} = \mathbf{y} - \sum{k \neq j} \beta_k \mathbf{x}_k$ is the partial residual (residual excluding feature $j$).

This is a 1D Lasso problem with a closed-form solution: soft thresholding!

The Soft Thresholding Update:

$$\beta_j^{\text{new}} = S_\lambda(\tilde{\beta}_j) = \text{sign}(\tilde{\beta}_j) \max(|\tilde{\beta}_j| - \lambda, 0)$$

Coordinate Descent Algorithm
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
import numpy as np
 
def soft_threshold(z, threshold):
    """Soft thresholding operator (proximal of L1 norm)."""
    return np.sign(z) * np.maximum(np.abs(z) - threshold, 0)
 
 
def lasso_coordinate_descent(X, y, lambda_, max_iters=1000, tol=1e-6):
    """
    Solve Lasso via coordinate descent.
    
    Minimizes: (1/2n)||y - Xβ||² + λ||β||₁
    
    Parameters
    ----------
    X : ndarray of shape (n, p)
        Design matrix (assumed standardized: columns have zero mean, unit norm)
    y : ndarray of shape (n,)
        Response vector (assumed centered)
    lambda_ : float
        Regularization parameter
    max_iters : int
        Maximum number of full cycles through coordinates
    tol : float
        Convergence tolerance (max absolute change in coefficients)
    
    Returns
    -------
    beta : ndarray of shape (p,)
        Estimated coefficients
    history : list
        Objective values at each iteration
    """
    n, p = X.shape
    beta = np.zeros(p)
    residual = y.copy()  # r = y - Xβ, initially β=0
    
    history = []
    
    for iteration in range(max_iters):
        beta_old = beta.copy()
        
        for j in range(p):
            # Add back contribution of feature j to residual
            residual += X[:, j] * beta[j]
            
            # Compute OLS coefficient for feature j alone
            # z_j = X_j' @ residual / n (correlation with partial residual)
            z_j = X[:, j] @ residual / n
            
            # Apply soft thresholding
            beta[j] = soft_threshold(z_j, lambda_)
            
            # Update residual with new coefficient
            residual -= X[:, j] * beta[j]
        
        # Compute objective (optional, for monitoring)
        mse = np.sum(residual ** 2) / (2 * n)
        l1_penalty = lambda_ * np.sum(np.abs(beta))
        objective = mse + l1_penalty
        history.append(objective)
        
        # Check convergence
        if np.max(np.abs(beta - beta_old)) < tol:
            print(f"Converged at iteration {iteration + 1}")
            break
    
    return beta, history
 
 
# Example usage
np.random.seed(42)
n, p = 100, 20
X = np.random.randn(n, p)
# Standardize columns
X = (X - X.mean(axis=0)) / X.std(axis=0)
X /= np.sqrt(n)  # Now X'X/n = I for standardized data
 
# True sparse coefficients
beta_true = np.zeros(p)
beta_true[:5] = [3, -2, 1.5, -1, 0.5]
y = X @ beta_true + 0.3 * np.random.randn(n)
y = y - y.mean()  # Center response
 
# Solve
lambda_ = 0.1
beta_hat, history = lasso_coordinate_descent(X, y, lambda_)
 
print(f"\nTrue non-zeros: {np.sum(beta_true != 0)}")
print(f"Estimated non-zeros: {np.sum(np.abs(beta_hat) > 1e-10)}")
print(f"\nTrue coefficients: {beta_true[:5]}")
print(f"Estimated coefficients: {beta_hat[:5].round(3)}")

Efficient Residual Updates

Deriving the Coordinate Update

Let's derive the coordinate descent update rigorously, handling the case of non-standardized features.

Setup:

We minimize over $\beta_j$ while holding $\boldsymbol{\beta}_{-j}$ (all other coefficients) fixed:

$$\min_{\beta_j} \underbrace{\frac{1}{2n}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|2^2}{\text{quadratic in } \beta_j} + \lambda|\beta_j|$$

Expanding the Quadratic:

Let $\mathbf{r}{-j} = \mathbf{y} - \sum{k \neq j} \beta_k \mathbf{x}_k$ be the partial residual. Then:

$$|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|2^2 = |\mathbf{r}{-j} - \beta_j \mathbf{x}_j|2^2 = |\mathbf{r}{-j}|_2^2 - 2\beta_j \mathbf{x}j^T \mathbf{r}{-j} + \beta_j^2 |\mathbf{x}_j|_2^2$$

The Univariate Objective:

$$h(\beta_j) = \frac{1}{2n}\left(|\mathbf{r}_{-j}|_2^2 - 2\beta_j \mathbf{x}j^T \mathbf{r}{-j} + \beta_j^2 |\mathbf{x}_j|_2^2\right) + \lambda|\beta_j|$$

Dropping constants and rearranging:

$$h(\beta_j) = \frac{|\mathbf{x}_j|_2^2}{2n}\left(\beta_j - \frac{\mathbf{x}j^T \mathbf{r}{-j}}{|\mathbf{x}_j|_2^2}\right)^2 + \lambda|\beta_j| + \text{const}$$

Defining Key Quantities:

Coordinate Descent Variables

•Partial residual: $\mathbf{r}{-j} = \mathbf{y} - \sum{k \neq j} \beta_k \mathbf{x}_k = \mathbf{r} + \beta_j \mathbf{x}_j$
•Simple regression coefficient: $\tilde{\beta}_j = \frac{\mathbf{x}j^T \mathbf{r}{-j}}{|\mathbf{x}_j|_2^2}$ (OLS of $\mathbf{x}_j$ on partial residual)
•Normalization factor: $c_j = \frac{|\mathbf{x}_j|_2^2}{n}$
•Effective threshold: $\tilde{\lambda}_j = \frac{\lambda}{c_j} = \frac{n\lambda}{|\mathbf{x}_j|_2^2}$

The Update Formula:

The univariate problem $\min_{\beta_j} \frac{c_j}{2}(\beta_j - \tilde{\beta}_j)^2 + \lambda|\beta_j|$ has solution:

$$\beta_j^{\text{new}} = S_{\lambda/c_j}(\tilde{\beta}_j) = \text{sign}(\tilde{\beta}_j) \max\left(|\tilde{\beta}_j| - \frac{\lambda}{c_j}, 0\right)$$

Simplified Form (Standardized Features):

If $\frac{1}{n}|\mathbf{x}_j|_2^2 = 1$ (standardized), then $c_j = 1$ and:

$$\beta_j^{\text{new}} = S_\lambda\left(\frac{1}{n}\mathbf{x}j^T \mathbf{r}{-j}\right)$$

This is the formula implemented in the algorithm above.

General Form (Non-standardized Features):

$$\beta_j^{\text{new}} = \frac{1}{|\mathbf{x}_j|2^2} S{n\lambda/|\mathbf{x}_j|_2^2}\left(\mathbf{x}j^T \mathbf{r}{-j}\right)$$

Or equivalently:

$$\beta_j^{\text{new}} = S_{n\lambda}\left(\mathbf{x}j^T \mathbf{r}{-j}\right) / |\mathbf{x}_j|_2^2$$

Standardization Matters

Convergence Properties

A natural question: Does coordinate descent converge? And if so, how fast?

Convergence Guarantee:

For the Lasso objective (convex with separable non-smooth part), coordinate descent converges to the global optimum. This is guaranteed by:

The Lasso satisfies these conditions with $f(\boldsymbol{\beta}) = \frac{1}{2n}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2$ and $g_j(\beta_j) = \lambda|\beta_j|$.

Convergence Rate:

Coordinate Descent Convergence Rates
Setting	Rate	Notes
General convex	$O(1/k)$	Objective suboptimality after $k$ iterations
Strongly convex $f$	$O(\rho^k)$, $\rho < 1$	Linear (geometric) convergence
Lasso (typical)	$O(\rho^k)$	Effectively linear due to restricted strong convexity
High correlation	Slower	Condition number affects rate
Path computation	Very fast	Warm starting from nearby λ values

Linear Convergence for Lasso:

Although the Lasso objective is not strongly convex (the L1 norm has zero curvature), coordinate descent often exhibits linear convergence in practice. This is because:

Restricted Strong Convexity (RSC): The objective is effectively strongly convex in directions that matter, even if not globally.
Finite Active Set: After some iterations, the set of non-zero coefficients stabilizes. On this subspace, the objective is strongly convex (just least squares).
Exact Zeros: Once $\beta_j = 0$ and the optimality condition is satisfied, it stays zero—no oscillation.

Practical Convergence:

In practice, coordinate descent for Lasso is remarkably fast:

Often converges in 10-100 iterations for moderate problems
Scales well to $p > 10^6$ features
Benefits enormously from warm starting along the regularization path

Stopping Criteria:

Common convergence criteria:

Stopping Criteria

•Coefficient change: $\max_j |\beta_j^{(k+1)} - \beta_j^{(k)}| < \epsilon$
•Objective change: $|F(\boldsymbol{\beta}^{(k+1)}) - F(\boldsymbol{\beta}^{(k)})| < \epsilon$
•Relative change: $|\boldsymbol{\beta}^{(k+1)} - \boldsymbol{\beta}^{(k)}|_2 / |\boldsymbol{\beta}^{(k)}|_2 < \epsilon$
•KKT violation: $\max_j |\text{KKT violation}_j| < \epsilon$

Checking Optimality

Implementation Optimizations

Naive coordinate descent is already good. Optimized coordinate descent is exceptional. Here are key implementation techniques:

1. Warm Starting:

When computing solutions for multiple $\lambda$ values (the Lasso path), initialize each solve with the solution from the previous $\lambda$:

Compute $\lambda_{\max} = \frac{1}{n}|\mathbf{X}^T\mathbf{y}|_\infty$ (all-zeros solution)
Create a grid $\lambda_1 = \lambda_{\max} > \lambda_2 > \cdots > \lambda_K$
Initialize $\boldsymbol{\beta}^{(k)}$ with the solution at $\lambda_{k-1}$

For dense grids with ratio $\lambda_{k+1}/\lambda_k \approx 0.9$, each solve typically requires only 1-3 iterations.

2. Active Set Strategy:

Focus computation on non-zero coefficients:

Optimized Coordinate Descent
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import numpy as np
 
def lasso_optimized(X, y, lambda_, max_iters=1000, tol=1e-6):
    """
    Optimized Lasso coordinate descent with active set strategy.
    
    Key optimizations:
    1. Maintain residual incrementally
    2. Active set cycling (focus on non-zeros)
    3. Periodic full sweeps to check for new active variables
    """
    n, p = X.shape
    beta = np.zeros(p)
    residual = y.copy()
    
    # Precompute column norms (for non-standardized data)
    col_norms_sq = np.sum(X ** 2, axis=0)
    
    # Active set: indices of currently non-zero coefficients
    active_set = set()
    
    for iteration in range(max_iters):
        beta_old = beta.copy()
        
        # PHASE 1: Cycle through active set (cheap)
        for _ in range(10):  # Multiple inner iterations on active
            for j in list(active_set):
                beta_j_old = beta[j]
                residual += X[:, j] * beta_j_old
                
                z_j = X[:, j] @ residual
                threshold = n * lambda_ / col_norms_sq[j]
                beta[j] = np.sign(z_j) * max(abs(z_j) / col_norms_sq[j] - threshold, 0)
                
                residual -= X[:, j] * beta[j]
                
                # Update active set
                if abs(beta[j]) < 1e-15:
                    active_set.discard(j)
        
        # PHASE 2: Full sweep to check inactive variables (periodic)
        max_violation = 0
        for j in range(p):
            if j in active_set:
                continue
            
            # Check KKT condition for inactive variable
            gradient_j = -X[:, j] @ residual / n
            violation = max(0, abs(gradient_j) - lambda_)
            
            if violation > max_violation:
                max_violation = violation
            
            # If violated, add to active set and update
            if abs(gradient_j) > lambda_ + tol:
                residual += X[:, j] * beta[j]
                z_j = X[:, j] @ residual
                threshold = n * lambda_ / col_norms_sq[j]
                beta[j] = np.sign(z_j) * max(abs(z_j) / col_norms_sq[j] - threshold, 0)
                residual -= X[:, j] * beta[j]
                if abs(beta[j]) > 1e-15:
                    active_set.add(j)
        
        # Convergence check
        if np.max(np.abs(beta - beta_old)) < tol and max_violation < tol:
            print(f"Converged at iteration {iteration + 1}")
            print(f"Active set size: {len(active_set)}")
            break
    
    return beta
 
 
# The active set strategy is crucial for high-dimensional problems
# When p >> n and solution is sparse, most time is spent on ~k variables
# instead of cycling through all p variables

3. Safe Screening Rules:

Before even starting optimization, eliminate features that are guaranteed to have zero coefficients:

SAFE Rule: If $|\mathbf{x}_j^T\mathbf{y}| \leq n\lambda - |\mathbf{x}_j|_2 \cdot |\mathbf{y}|_2 \cdot \theta$, then $\hat{\beta}_j = 0$.

Here $\theta$ is a geometric bound depending on the current and target $\lambda$ values.

4. Matrix-Free Updates:

For very large $p$, store $\mathbf{X}$ sparsely or use matrix-free products:

Compute $\mathbf{x}_j^T\mathbf{r}$ via sparse matrix operations
Update residual as $\mathbf{r} \leftarrow \mathbf{r} + \Delta\beta_j \cdot \mathbf{x}_j$

5. Strong Rules (Heuristic Screening):

STRONG rule: Assume features with $|\mathbf{x}j^T\mathbf{r}| < 2\lambda - \lambda{\text{prev}}$ will have $\beta_j = 0$.

This is not safe (can miss active features), so verify with KKT conditions after convergence. But it's very effective in practice—often eliminates 99% of features.

Production Implementations

Alternative: Proximal Gradient Methods

Proximal gradient methods offer an alternative to coordinate descent, particularly useful when the smooth part $f$ has structure that coordinate descent can't exploit.

The Proximal Gradient Framework:

For objectives $F(\boldsymbol{\beta}) = f(\boldsymbol{\beta}) + g(\boldsymbol{\beta})$ where $f$ is smooth and $g$ is convex:

$$\boldsymbol{\beta}^{(k+1)} = \text{prox}_{t_k g}\left(\boldsymbol{\beta}^{(k)} - t_k \nabla f(\boldsymbol{\beta}^{(k)})\right)$$

where $t_k > 0$ is the step size and the proximal operator is:

$$\text{prox}{t g}(\mathbf{z}) = \arg\min{\mathbf{u}} \left{ g(\mathbf{u}) + \frac{1}{2t}|\mathbf{u} - \mathbf{z}|_2^2 \right}$$

For Lasso:

With $g(\boldsymbol{\beta}) = \lambda|\boldsymbol{\beta}|_1$, the proximal operator is component-wise soft thresholding:

$$\text{prox}_{t\lambda|\cdot|1}(\mathbf{z}) = S{t\lambda}(\mathbf{z})$$

ISTA (Iterative Shrinkage-Thresholding Algorithm):

$$\boldsymbol{\beta}^{(k+1)} = S_{t\lambda}\left(\boldsymbol{\beta}^{(k)} + \frac{t}{n}\mathbf{X}^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}^{(k)})\right)$$

where $t = 1/L$ with $L = |\mathbf{X}|_2^2/n$ being the Lipschitz constant of $\nabla f$.

ISTA and FISTA
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
import numpy as np
 
def soft_threshold(z, threshold):
    """Component-wise soft thresholding."""
    return np.sign(z) * np.maximum(np.abs(z) - threshold, 0)
 
 
def ista(X, y, lambda_, max_iters=1000, tol=1e-6):
    """
    ISTA: Iterative Shrinkage-Thresholding Algorithm.
    
    Basic proximal gradient descent for Lasso.
    Convergence rate: O(1/k)
    """
    n, p = X.shape
    beta = np.zeros(p)
    
    # Step size: 1/L where L = ||X||^2 / n
    L = np.linalg.norm(X, ord=2) ** 2 / n
    t = 1 / L
    
    for k in range(max_iters):
        beta_old = beta.copy()
        
        # Gradient step
        gradient = -X.T @ (y - X @ beta) / n
        z = beta - t * gradient
        
        # Proximal step (soft thresholding)
        beta = soft_threshold(z, t * lambda_)
        
        if np.max(np.abs(beta - beta_old)) < tol:
            print(f"ISTA converged at iteration {k + 1}")
            break
    
    return beta
 
 
def fista(X, y, lambda_, max_iters=1000, tol=1e-6):
    """
    FISTA: Fast ISTA with Nesterov acceleration.
    
    Accelerated proximal gradient descent for Lasso.
    Convergence rate: O(1/k^2) - quadratically faster than ISTA!
    """
    n, p = X.shape
    beta = np.zeros(p)
    beta_prev = beta.copy()
    
    L = np.linalg.norm(X, ord=2) ** 2 / n
    t = 1 / L
    
    # Nesterov momentum parameters
    tau = 1
    tau_prev = 1
    
    for k in range(max_iters):
        # Momentum term
        momentum = (tau_prev - 1) / tau
        z = beta + momentum * (beta - beta_prev)
        
        # Gradient step
        gradient = -X.T @ (y - X @ z) / n
        v = z - t * gradient
        
        # Proximal step
        beta_new = soft_threshold(v, t * lambda_)
        
        # Convergence check
        if np.max(np.abs(beta_new - beta)) < tol:
            print(f"FISTA converged at iteration {k + 1}")
            break
        
        # Update momentum parameter
        tau_prev = tau
        tau = (1 + np.sqrt(1 + 4 * tau ** 2)) / 2
        
        beta_prev = beta
        beta = beta_new
    
    return beta
 
 
# Comparison
np.random.seed(42)
n, p = 500, 200
X = np.random.randn(n, p)
X = X / np.sqrt(n)
beta_true = np.zeros(p)
beta_true[:10] = np.random.randn(10)
y = X @ beta_true + 0.5 * np.random.randn(n)
 
lambda_ = 0.1
 
print("ISTA:")
beta_ista = ista(X, y, lambda_, max_iters=5000)
 
print("\nFISTA:")
beta_fista = fista(X, y, lambda_, max_iters=5000)
 
print(f"\nNon-zeros (ISTA): {np.sum(np.abs(beta_ista) > 1e-10)}")
print(f"Non-zeros (FISTA): {np.sum(np.abs(beta_fista) > 1e-10)}")

Algorithm Comparison
Algorithm	Convergence	Per-Iteration Cost	Best For
Coord Descent	$O(\rho^k)$ (linear)	$O(np)$ per cycle	Standard Lasso, path computation
ISTA	$O(1/k)$ (sublinear)	$O(np)$ per iteration	Simple implementation, theory
FISTA	$O(1/k^2)$ (accelerated)	$O(np)$ per iteration	Faster than ISTA, still simple
LARS	Exact path	$O(np \min(n,p))$ total	Full path, feature entry order

When to Use Proximal Methods

LARS: Least Angle Regression

LARS (Least Angle Regression) is an elegant algorithm that computes the entire Lasso path—all solutions as $\lambda$ varies from $\lambda_{\max}$ to 0—in one shot.

Key Insight: Piecewise Linear Path

The Lasso solution path $\hat{\boldsymbol{\beta}}(\lambda)$ is piecewise linear in $\lambda$. Between certain critical values of $\lambda$, coefficients change at constant rates. This means:

We don't need to solve Lasso for many $\lambda$ values separately
The path can be traced exactly with finite computation
Feature entry/exit points can be computed precisely

LARS Algorithm Overview:

Initialize: All coefficients zero, residual $\mathbf{r} = \mathbf{y}$
Find most correlated feature: $j^* = \arg\max_j |\mathbf{x}_j^T\mathbf{r}|$
Move toward OLS: Increase $\beta_{j^}$ in the direction that decreases $|\mathbf{x}_{j^}^T\mathbf{r}|$
Maintain equicorrelation: As we move, keep all active features' correlations equal
Add new features: When another feature becomes equally correlated, add it to active set
(For Lasso): If a coefficient crosses zero, remove it from active set
Repeat until all features are added or residual is zero

LARS Key Properties

•Exact Path: No approximation or grid of λ values—exact solution at all λ
•Efficiency: Same computational cost as one OLS fit for the entire path
•Feature Order: Naturally provides the order in which features enter the model
•Model Selection: Easy to find optimal λ via cross-validation on the path
•Connection to Forward Stepwise: LARS is a 'democratic' version of forward stepwise selection

The Equicorrelation Property:

At any point on the Lasso path, all active features have the same absolute correlation with the current residual:

$$|\mathbf{x}_j^T\mathbf{r}| = \lambda \quad \text{for all } j \in \mathcal{A}$$

where $\mathcal{A}$ is the active set. This is the equicorrelation property.

LARS maintains this property exactly, moving coefficients together keeping correlations equal. When a new feature's correlation "catches up" to the active correlation, it joins the active set.

LARS vs. Lasso Path:

Plain LARS doesn't remove features—coefficients never go to zero. The LARS-Lasso modification handles sign changes:

If a coefficient crosses zero during the path (sign change), remove it from active set
This is the "Lasso modification" that produces the actual Lasso path

Computational Complexity:

LARS-Lasso path computation: $O(p \cdot n \cdot \min(n, p))$
Same cost as solving one OLS problem with $\min(n, p)$ features
For sparse solutions, often much faster in practice

When to Use LARS

Summary: Solving the Lasso

We've covered the algorithmic landscape for Lasso optimization. Here's what you should take away:

Key Takeaways

•No Closed Form: L1's non-differentiability prevents matrix-formula solutions; iterative methods are required.
•Coordinate Descent: The workhorse algorithm—cycles through coordinates applying soft thresholding. Converges linearly for Lasso.
•Soft Thresholding: The closed-form coordinate update $S_\lambda(z)$ shrinks by $\lambda$ or snaps to zero.
•Optimizations: Warm starting, active sets, and screening rules make coordinate descent scale to millions of features.
•Proximal Methods (ISTA/FISTA): Alternative framework, especially useful for non-separable penalties.
•LARS: Computes exact Lasso path efficiently, revealing feature entry order.
•Production Libraries: Use glmnet, sklearn, or similar—they implement all optimizations correctly.

Algorithm Selection Guide:

Standard Lasso at single λ: Coordinate descent
Regularization path: Coordinate descent with warm start, or LARS
Very high dimensions (p >> 10⁶): Coordinate descent with screening
Parallel/distributed: Proximal gradient (ADMM variant)
Non-L1 penalties: Proximal gradient framework

What's Next:

Page Complete

4 / 5