Machine LearningRegularized Logistic Regression

Regularized Logistic Regression

LevelIntermediate

Duration90 mins

TopicRegularized Logistic Regression

2 / 5

L1 Regularization (Lasso) in Logistic Regression

Sparsity and Feature Selection

While L2 regularization provides stability by shrinking coefficients toward zero, it never eliminates features entirely—every predictor retains some influence, however small. In many real-world classification problems, we believe only a subset of available features are truly predictive, while others are noise or irrelevant.

L1 regularization (also known as Lasso regularization, from "Least Absolute Shrinkage and Selection Operator") addresses this by adding a penalty proportional to the absolute value of coefficients rather than their squares. This seemingly minor change has profound consequences: L1 regularization can drive coefficients exactly to zero, automatically performing feature selection as part of the fitting process.

This page develops the complete theory of L1-regularized logistic regression, explaining why the L1 penalty induces sparsity, how to solve the resulting optimization problem, and when L1 is preferred over L2.

Historical Context

The Lasso was introduced by Robert Tibshirani in 1996 for linear regression. Its extension to logistic regression and other generalized linear models followed naturally. The name "Lasso" captures both the selection aspect (roping in relevant features) and the mathematical operation (the L1 penalty forms a diamond-shaped constraint region that "lassos" the solution).

Mathematical Formulation

1.1 The L1-Regularized Objective

L1 regularization modifies the logistic regression objective by adding a penalty proportional to the L1 norm of the coefficient vector:

$$J(\boldsymbol{\beta}) = -\sum_{i=1}^{n} \left[ y_i \log p_i + (1-y_i) \log(1-p_i) \right] + \lambda |\boldsymbol{\beta}|_1$$

where:

$|\boldsymbol{\beta}|1 = \sum{j=1}^{p} |\beta_j|$ is the L1 norm (sum of absolute values)
$\lambda \geq 0$ is the regularization strength
$p_i = \sigma(\mathbf{x}_i^\top \boldsymbol{\beta})$ remains the logistic probability

Key Difference from L2: The L1 penalty uses absolute values $|\beta_j|$ instead of squares $\beta_j^2$. This creates a fundamentally different penalty landscape with sharp corners at zero.

1.2 Constrained Optimization View

Equivalently, we can express L1 regularization as a constrained optimization problem:

$$\begin{aligned} \text{minimize} \quad & -\mathcal{L}(\boldsymbol{\beta}) \ \text{subject to} \quad & |\boldsymbol{\beta}|_1 \leq t \end{aligned}$$

The constraint region $|\boldsymbol{\beta}|_1 \leq t$ is a diamond (or cross-polytope) in the parameter space. In 2D, this is a square rotated 45°; in higher dimensions, it's a generalized diamond (zonotope).

Geometric Intuition for Sparsity: The corners of this diamond lie on the coordinate axes (e.g., at $(t, 0)$, $(-t, 0)$, $(0, t)$, $(0, -t)$ in 2D). When the optimal solution lies at a corner, one or more coordinates are exactly zero. Unlike the smooth L2 ball, the L1 diamond has sharp corners that "catch" the solution.

Why Corners Mean Sparsity

Imagine drawing level curves of the log-likelihood (ellipses centered at the MLE) and expanding them until they touch the L1 constraint region. Because the diamond has corners on the axes, the first point of contact is often at a corner—where one or more coefficients equal zero. The sharper the constraint angles, the more likely the solution includes exact zeros.

1.3 The Subdifferential at Zero

The L1 penalty $|\beta_j|$ is not differentiable at $\beta_j = 0$. To handle this, we use subgradients:

$$\frac{\partial |\beta_j|}{\partial \beta_j} = \begin{cases} +1 & \text{if } \beta_j > 0 \ -1 & \text{if } \beta_j < 0 \ [-1, +1] & \text{if } \beta_j = 0 \end{cases}$$

The subdifferential at zero is the entire interval $[-1, +1]$. At the optimal solution, if $\beta_j = 0$, the optimality conditions require:

$$\frac{\partial (-\mathcal{L})}{\partial \beta_j} \in [-\lambda, +\lambda]$$

This means a coefficient remains at zero as long as the gradient of the log-likelihood is bounded by $\pm\lambda$. Features whose contributions are "not strong enough" to overcome the penalty threshold are excluded entirely.

l1_regularization_objective.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
import numpy as np
from scipy.special import expit
 
def l1_regularized_logistic_loss(beta, X, y, lambda_reg, fit_intercept=True):
    """
    Compute the L1-regularized logistic regression loss.
    
    Note: This function is non-differentiable at beta_j = 0.
    
    Parameters
    ----------
    beta : array, shape (p,) or (p+1,)
        Coefficient vector (includes intercept if fit_intercept=True)
    X : array, shape (n, p)
        Feature matrix
    y : array, shape (n,)
        Binary labels (0 or 1)
    lambda_reg : float
        Regularization strength
    fit_intercept : bool
        Whether first element of beta is the intercept
    
    Returns
    -------
    loss : float
        The penalized negative log-likelihood
    """
    n = X.shape[0]
    
    # Separate intercept from feature coefficients
    if fit_intercept:
        intercept = beta[0]
        coef = beta[1:]
        linear_pred = X @ coef + intercept
    else:
        linear_pred = X @ beta
        coef = beta
    
    # Compute probabilities
    prob = expit(linear_pred)
    eps = 1e-15
    prob = np.clip(prob, eps, 1 - eps)
    
    # Negative log-likelihood
    nll = -np.sum(y * np.log(prob) + (1 - y) * np.log(1 - prob))
    
    # L1 penalty on feature coefficients (not intercept)
    l1_penalty = lambda_reg * np.sum(np.abs(coef))
    
    return nll + l1_penalty
 
 
def l1_subgradient(beta, X, y, lambda_reg, fit_intercept=True):
    """
    Compute a subgradient of the L1-regularized loss.
    
    At points where |beta_j| > 0, this is the gradient.
    At beta_j = 0, we return 0 as a valid subgradient element.
    """
    n = X.shape[0]
    
    if fit_intercept:
        intercept = beta[0]
        coef = beta[1:]
        linear_pred = X @ coef + intercept
    else:
        linear_pred = X @ beta
        coef = beta
    
    prob = expit(linear_pred)
    residuals = prob - y
    
    # Subgradient of L1 penalty: sign(coef) where coef ≠ 0, else 0
    l1_subgrad = np.sign(coef)
    # For zero coefficients, we could use any value in [-1, 1]
    # Using 0 is a common convention for optimization
    
    if fit_intercept:
        grad_intercept = np.sum(residuals)
        grad_coef = X.T @ residuals + lambda_reg * l1_subgrad
        return np.concatenate([[grad_intercept], grad_coef])
    else:
        return X.T @ residuals + lambda_reg * l1_subgrad

Why L1 Induces Sparsity

2.1 Geometric Explanation

The sparsity-inducing property of L1 regularization arises from the geometry of the constraint region. Consider minimizing a smooth convex function (the negative log-likelihood) subject to an L1 constraint:

L2 Constraint (Circle/Sphere): The constraint boundary is smooth everywhere. As we expand the log-likelihood contours until they touch the constraint, the tangent point generically has all coordinates non-zero.

L1 Constraint (Diamond): The constraint boundary has corners on the coordinate axes. These corners are "sticky"—when the likelihood gradient points toward a corner, the optimal solution sits at that corner with exact zeros.

L2 Regularization (Ridge)

•Circular/spherical constraint region
•Smooth boundary everywhere
•Solutions typically have all β_j ≠ 0
•Coefficients shrunk but never zero
•No automatic feature selection

L1 Regularization (Lasso)

•Diamond-shaped constraint region
•Sharp corners on coordinate axes
•Solutions often have some β_j = 0
•Coefficients can be exactly zero
•Automatic feature selection

2.2 Analytic Explanation: The Soft Thresholding Effect

To understand L1 sparsity analytically, consider the univariate case. Minimizing:

$$f(\beta) = \frac{1}{2}(\beta - z)^2 + \lambda|\beta|$$

where $z$ is the unconstrained optimum. The solution is the soft thresholding operator:

$$\hat{\beta} = \text{sign}(z) \cdot \max(|z| - \lambda, 0) = S_\lambda(z)$$

This has three regimes:

If $|z| \leq \lambda$: $\hat{\beta} = 0$ (the effect is too weak to overcome the penalty)
If $z > \lambda$: $\hat{\beta} = z - \lambda$ (shrunk toward zero by $\lambda$)
If $z < -\lambda$: $\hat{\beta} = z + \lambda$ (shrunk toward zero by $\lambda$)

Contrast with L2: For the same problem with L2 penalty, the solution is $\hat{\beta} = z/(1 + \lambda)$—never exactly zero regardless of how small $z$ is.

Soft Thresholding Intuition

Think of the L1 penalty as a "threshold" that effects must exceed to enter the model. With λ = 0.5, a feature needs a coefficient magnitude greater than 0.5 (in appropriate units) to appear in the final model. This creates a natural filter separating signal from noise.

soft_thresholding.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
import numpy as np
import matplotlib.pyplot as plt
 
def soft_threshold(z, lambda_reg):
    """
    Soft thresholding operator (proximal operator for L1 norm).
    
    S_λ(z) = sign(z) * max(|z| - λ, 0)
    
    This is the solution to: minimize_β  (1/2)(β - z)² + λ|β|
    """
    return np.sign(z) * np.maximum(np.abs(z) - lambda_reg, 0)
 
 
def visualize_thresholding():
    """
    Visualize soft thresholding vs ridge shrinkage.
    """
    z = np.linspace(-3, 3, 1000)
    
    lambda_vals = [0.5, 1.0, 1.5]
    
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))
    
    # Soft thresholding (L1)
    ax1 = axes[0]
    ax1.axhline(0, color='gray', linestyle='-', linewidth=0.5)
    ax1.axvline(0, color='gray', linestyle='-', linewidth=0.5)
    ax1.plot(z, z, 'k--', label='No penalty (β = z)', alpha=0.5)
    
    for lam in lambda_vals:
        beta_l1 = soft_threshold(z, lam)
        ax1.plot(z, beta_l1, label=f'λ = {lam}', linewidth=2)
    
    ax1.set_xlabel('Unconstrained estimate z', fontsize=12)
    ax1.set_ylabel('L1-regularized estimate β', fontsize=12)
    ax1.set_title('Soft Thresholding (L1/Lasso)', fontsize=14)
    ax1.legend()
    ax1.set_xlim(-3, 3)
    ax1.set_ylim(-3, 3)
    
    # Ridge shrinkage (L2)
    ax2 = axes[1]
    ax2.axhline(0, color='gray', linestyle='-', linewidth=0.5)
    ax2.axvline(0, color='gray', linestyle='-', linewidth=0.5)
    ax2.plot(z, z, 'k--', label='No penalty (β = z)', alpha=0.5)
    
    for lam in lambda_vals:
        beta_l2 = z / (1 + lam)
        ax2.plot(z, beta_l2, label=f'λ = {lam}', linewidth=2)
    
    ax2.set_xlabel('Unconstrained estimate z', fontsize=12)
    ax2.set_ylabel('L2-regularized estimate β', fontsize=12)
    ax2.set_title('Ridge Shrinkage (L2)', fontsize=14)
    ax2.legend()
    ax2.set_xlim(-3, 3)
    ax2.set_ylim(-3, 3)
    
    plt.tight_layout()
    plt.savefig('thresholding_comparison.png', dpi=150)
    plt.show()
 
 
# Demonstrate sparsity in multivariate case
def demonstrate_sparsity():
    """
    Show how L1 creates sparse solutions while L2 doesn't.
    """
    np.random.seed(42)
    
    # True sparse model
    p = 20
    true_coef = np.zeros(p)
    true_coef[:5] = [2.0, -1.5, 1.0, -0.5, 0.5]  # Only 5 non-zero
    
    # Generate data
    n = 200
    X = np.random.randn(n, p)
    prob = expit(X @ true_coef)
    y = (np.random.rand(n) < prob).astype(float)
    
    from sklearn.linear_model import LogisticRegression
    from sklearn.preprocessing import StandardScaler
    
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # Fit with L1 and L2
    l1_model = LogisticRegression(penalty='l1', C=1.0, solver='saga', max_iter=5000)
    l2_model = LogisticRegression(penalty='l2', C=1.0, solver='lbfgs', max_iter=5000)
    
    l1_model.fit(X_scaled, y)
    l2_model.fit(X_scaled, y)
    
    print("\nSparsity Comparison (20 features, true model has 5 non-zero)")
    print("=" * 60)
    print(f"{'Coefficient':<12} {'True':>8} {'L1':>10} {'L2':>10}")
    print("-" * 60)
    
    for j in range(p):
        l1_val = l1_model.coef_[0, j]
        l2_val = l2_model.coef_[0, j]
        print(f"β_{j+1:<10} {true_coef[j]:>8.3f} {l1_val:>10.4f} {l2_val:>10.4f}")
    
    print("-" * 60)
    print(f"Non-zero:    {np.sum(true_coef != 0):>8} "
          f"{np.sum(np.abs(l1_model.coef_[0]) > 0.01):>10} "
          f"{np.sum(np.abs(l2_model.coef_[0]) > 0.01):>10}")
 
 
if __name__ == "__main__":
    from scipy.special import expit
    demonstrate_sparsity()

Optimization Algorithms for L1-Regularized Logistic Regression

3.1 The Non-Smooth Optimization Challenge

Unlike L2 regularization, where the objective is smooth and standard Newton's method applies, L1 regularization creates a non-smooth optimization problem. The L1 penalty $\sum|\beta_j|$ is not differentiable at $\beta_j = 0$, exactly where we want the solution to lie for sparse models.

This requires specialized algorithms:

Coordinate Descent: Update one coefficient at a time while holding others fixed. For each coordinate, the subproblem has a closed-form solution via soft thresholding.

Proximal Gradient Methods: Apply gradient descent to the smooth part (log-likelihood), then apply the proximal operator (soft thresholding) for the L1 penalty.

Interior Point Methods: Transform the problem using slack variables and solve a sequence of smooth barrier problems.

3.2 Coordinate Descent

Coordinate descent is the workhorse algorithm for L1-regularized problems. The key insight is that when all but one coefficient is fixed, the remaining subproblem has a closed-form solution.

Single-Coordinate Update:

Let $r_j = \beta_j^{\text{old}} - \frac{1}{n} \nabla_j \mathcal{L}$ be the "partial residual" for coefficient $j$. The updated coefficient is:

$$\beta_j^{\text{new}} = S_\lambda(r_j) = \text{sign}(r_j) \cdot \max(|r_j| - \lambda', 0)$$

where $\lambda' = \lambda / L_j$ and $L_j$ is an upper bound on the curvature for coordinate $j$.

Algorithm:

Initialize $\boldsymbol{\beta} = \mathbf{0}$
Repeat until convergence:
- For each $j = 1, \ldots, p$:
  - Compute gradient contribution from log-likelihood
  - Apply soft thresholding update
Return final $\boldsymbol{\beta}$

coordinate_descent_l1_logistic.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
import numpy as np
from scipy.special import expit
 
class L1LogisticRegressionCD:
    """
    L1-Regularized Logistic Regression using Coordinate Descent.
    
    This implementation uses the cyclic coordinate descent algorithm
    with soft thresholding updates.
    """
    
    def __init__(self, lambda_reg=1.0, max_iter=1000, tol=1e-6):
        self.lambda_reg = lambda_reg
        self.max_iter = max_iter
        self.tol = tol
        self.coef_ = None
        self.intercept_ = None
        self.n_iter_ = 0
        
    def _soft_threshold(self, z, threshold):
        """Soft thresholding operator."""
        return np.sign(z) * np.maximum(np.abs(z) - threshold, 0)
    
    def fit(self, X, y):
        """
        Fit L1-regularized logistic regression using coordinate descent.
        
        The algorithm cycles through coordinates, updating each using
        a Newton step followed by soft thresholding.
        """
        n, p = X.shape
        
        # Initialize
        intercept = 0.0
        coef = np.zeros(p)
        
        # Precompute squared norms of feature columns
        col_sq_norms = np.sum(X ** 2, axis=0)
        
        for iteration in range(self.max_iter):
            coef_old = coef.copy()
            intercept_old = intercept
            
            # Update intercept (no regularization)
            linear_pred = X @ coef + intercept
            prob = expit(linear_pred)
            grad_intercept = np.sum(prob - y)
            hess_intercept = np.sum(prob * (1 - prob))
            if hess_intercept > 1e-10:
                intercept = intercept - grad_intercept / hess_intercept
            
            # Update each coefficient
            for j in range(p):
                # Current predictions without feature j's contribution
                linear_pred_no_j = X @ coef + intercept - X[:, j] * coef[j]
                
                # Residual with linearization
                prob = expit(linear_pred_no_j + X[:, j] * coef[j])
                
                # Gradient and approximate Hessian for coordinate j
                residual = prob - y
                grad_j = np.dot(X[:, j], residual)
                
                # Hessian approximation: use diagonal of X'WX
                weights = prob * (1 - prob)
                hess_j = np.dot(X[:, j] ** 2, weights)
                
                if hess_j < 1e-10:
                    continue
                
                # Newton direction (without penalty)
                z = coef[j] - grad_j / hess_j
                
                # Soft thresholding
                threshold = self.lambda_reg / hess_j
                coef[j] = self._soft_threshold(z, threshold)
            
            # Check convergence
            coef_change = np.max(np.abs(coef - coef_old))
            intercept_change = np.abs(intercept - intercept_old)
            
            if max(coef_change, intercept_change) < self.tol:
                self.n_iter_ = iteration + 1
                break
        else:
            self.n_iter_ = self.max_iter
        
        self.coef_ = coef
        self.intercept_ = intercept
        
        return self
    
    def predict_proba(self, X):
        """Predict class probabilities."""
        linear_pred = X @ self.coef_ + self.intercept_
        prob_1 = expit(linear_pred)
        return np.column_stack([1 - prob_1, prob_1])
    
    def predict(self, X, threshold=0.5):
        """Predict class labels."""
        return (self.predict_proba(X)[:, 1] >= threshold).astype(int)
 
 
# Demonstration
if __name__ == "__main__":
    np.random.seed(42)
    
    # Generate sparse data
    n, p = 300, 50
    X = np.random.randn(n, p)
    
    # True sparse model
    true_coef = np.zeros(p)
    true_coef[:5] = [2.0, -1.5, 1.0, -0.8, 0.6]
    
    prob = expit(X @ true_coef)
    y = (np.random.rand(n) < prob).astype(float)
    
    # Standardize
    X = (X - X.mean(axis=0)) / X.std(axis=0)
    
    # Fit with different regularization strengths
    print("\nL1 Logistic Regression - Coordinate Descent")
    print("=" * 60)
    
    for lam in [0.01, 0.1, 0.5, 1.0]:
        model = L1LogisticRegressionCD(lambda_reg=lam, max_iter=2000)
        model.fit(X, y)
        
        n_nonzero = np.sum(np.abs(model.coef_) > 1e-6)
        accuracy = np.mean(model.predict(X) == y)
        
        print(f"λ={lam:5.2f}: {n_nonzero:2d} non-zero coefficients, "
              f"accuracy={accuracy:.4f}, iterations={model.n_iter_}")

3.3 Proximal Gradient Descent (ISTA/FISTA)

Proximal gradient methods separate the objective into smooth and non-smooth parts:

$$J(\boldsymbol{\beta}) = \underbrace{-\mathcal{L}(\boldsymbol{\beta})}_{g(\boldsymbol{\beta}) \text{ (smooth)}} + \underbrace{\lambda |\boldsymbol{\beta}|1}{h(\boldsymbol{\beta}) \text{ (non-smooth)}}$$

ISTA (Iterative Shrinkage-Thresholding Algorithm):

$$\boldsymbol{\beta}^{(t+1)} = \text{prox}_{\alpha h}\left( \boldsymbol{\beta}^{(t)} - \alpha \nabla g(\boldsymbol{\beta}^{(t)}) \right)$$

where $\alpha$ is the step size and the proximal operator for L1 is soft thresholding:

$$\text{prox}{\alpha h}(\mathbf{z}) = S{\alpha\lambda}(\mathbf{z})$$

FISTA (Fast ISTA): Adds momentum (Nesterov acceleration) for $O(1/k^2)$ convergence instead of $O(1/k)$:

$$\boldsymbol{\beta}^{(t+1)} = \text{prox}_{\alpha h}\left( \mathbf{v}^{(t)} - \alpha \nabla g(\mathbf{v}^{(t)}) \right)$$ $$\mathbf{v}^{(t+1)} = \boldsymbol{\beta}^{(t+1)} + \frac{t}{t+3}(\boldsymbol{\beta}^{(t+1)} - \boldsymbol{\beta}^{(t)})$$

Choosing an Algorithm

Coordinate descent is typically fastest for L1-regularized logistic regression and is the default in scikit-learn (solver='saga') and glmnet. Proximal methods are more general and work for any convex penalty. Interior point methods provide high accuracy but scale poorly to very large problems.

Bayesian Interpretation: The Laplace Prior

4.1 L1 as MAP with Laplace Prior

Just as L2 regularization corresponds to a Gaussian prior, L1 regularization corresponds to Maximum A Posteriori (MAP) estimation with a Laplace (double-exponential) prior:

$$p(\beta_j) = \frac{\lambda}{2} \exp(-\lambda |\beta_j|)$$

Log-prior: $$\log p(\boldsymbol{\beta}) = \sum_j \log\left(\frac{\lambda}{2}\right) - \lambda \sum_j |\beta_j| = \text{const} - \lambda |\boldsymbol{\beta}|_1$$

MAP objective: $$\log p(\boldsymbol{\beta} | \mathbf{y}, \mathbf{X}) \propto \log p(\mathbf{y} | \mathbf{X}, \boldsymbol{\beta}) + \log p(\boldsymbol{\beta}) = \mathcal{L}(\boldsymbol{\beta}) - \lambda |\boldsymbol{\beta}|_1$$

Maximizing this is equivalent to minimizing the L1-penalized negative log-likelihood.

4.2 Laplace vs Gaussian Prior: The Key Difference

Gaussian (L2) Prior:

Smooth, unimodal density
Tails decay as $\exp(-\beta^2)$ (fast, Gaussian tails)
Density at zero: $p(0) = C$ (finite, continuous)
No preference for exact zeros

Laplace (L1) Prior:

"Peaked" at zero, sharp cusp
Tails decay as $\exp(-|\beta|)$ (slower, exponential tails)
Density at zero: $p(0) = \lambda/2$ (finite, but derivative undefined)
Strong preference for zeros, thicker tails allow large values

The Laplace prior has more mass concentrated at zero and heavier tails than the Gaussian. This embodies a "spike-and-slab" intuition: most coefficients should be zero, but those that aren't can be substantial.

Comparison of Prior Distributions
Property	Gaussian (L2)	Laplace (L1)
Density formula	$\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\beta^2/2\sigma^2}$	$\frac{\lambda}{2}e^{-\lambda\|\beta\|}$
Shape at zero	Smooth, rounded peak	Sharp cusp (non-differentiable)
Tail behavior	Light (Gaussian)	Heavy (exponential)
Mode	At zero	At zero
Sparsity preference	No	Yes (exact zeros likely)
Log-penalty	Quadratic $\beta^2$	Linear $\|\beta\|$

Bayesian Model Selection

While L1-regularized estimation finds the MAP point estimate, full Bayesian inference would compute the posterior distribution. With a Laplace prior, the posterior peaks at sparse solutions, but unlike the spike-and-slab prior, it doesn't place exact point mass at zero. For true Bayesian variable selection, more sophisticated priors (spike-and-slab, horseshoe) are used.

Feature Selection Properties

5.1 Automatic Feature Selection

L1 regularization's ability to set coefficients exactly to zero makes it a powerful tool for embedded feature selection—feature selection performed as part of model fitting rather than as a separate preprocessing step.

Advantages over manual selection:

Data-driven: selection based on predictive power, not heuristics
Automatic: no need to specify a target number of features
Regularization-controlled: λ parameter controls sparsity level
Optimized jointly: features selected for their combined predictive power

Properties of L1 selection:

Features with small true effects may be excluded (soft thresholding)
Among correlated features, L1 tends to select one arbitrarily
Selection stability can be low across bootstrap samples

5.2 Behavior with Correlated Features

L1 regularization has a well-known limitation: instability with correlated features.

Consider two highly correlated features $x_1$ and $x_2$ that are both predictive of $y$:

True model: Both should have non-zero coefficients
L1 behavior: Often selects one and excludes the other
Which one?: Essentially arbitrary—small data perturbations can flip the selection

Why this happens: The L1 penalty treats $|\beta_1| + |\beta_2|$ the same regardless of how the total weight is distributed. If $\beta_1 + \beta_2 = c$ is required for good fit:

$(c, 0)$: penalty = $|c|$
$(0, c)$: penalty = $|c|$
$(c/2, c/2)$: penalty = $|c|$

All three have the same penalty! The data determines which corner of this equal-penalty line the solution lands on, but this is sensitive to noise.

Selection Stability

The arbitrary selection among correlated features is a significant issue for interpretation. A coefficient being zero doesn't mean the feature is unimportant—it might just be collinear with a retained feature. For stable feature selection with correlated predictors, consider Elastic Net or stability selection techniques.

5.3 The Lasso Consistency Property

Under certain conditions, the Lasso can recover the true sparse model consistently as $n \to \infty$. The key condition is the Irrepresentable Condition (also called the betamin condition):

$$\max_{j \notin S} |\mathbf{X}_{S}^\top \mathbf{X}_j| / n < 1 - \epsilon$$

where $S$ is the true support (indices of non-zero coefficients), $\mathbf{X}_S$ is the submatrix of true predictors, and $\mathbf{X}_j$ is a "noise" column.

Interpretation: The correlation between irrelevant features and the relevant feature space must be bounded below 1. If an irrelevant feature is too correlated with relevant ones, the Lasso may incorrectly select it.

In practice: This condition often fails in real high-dimensional data, limiting the Lasso's reliability for identifying the "true" model. For prediction, however, L1 regularization remains excellent even when consistency fails.

lasso_feature_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
import numpy as np
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_predict
import matplotlib.pyplot as plt
 
def analyze_l1_feature_selection(X, y, true_support=None):
    """
    Analyze L1 feature selection across multiple regularization strengths.
    
    Parameters
    ----------
    X : array, shape (n, p)
        Feature matrix
    y : array, shape (n,)
        Binary labels
    true_support : array-like, optional
        Indices of truly relevant features (for evaluation)
    
    Returns
    -------
    results : dict
        Analysis results including selected features at optimal lambda
    """
    # Standardize features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # Range of regularization strengths
    C_values = np.logspace(-3, 2, 50)  # C = 1/lambda
    
    # Track which features are selected at each C
    selected_features = []
    
    for C in C_values:
        model = LogisticRegression(
            penalty='l1', 
            C=C, 
            solver='saga', 
            max_iter=5000
        )
        model.fit(X_scaled, y)
        
        # Features with non-zero coefficients
        nonzero = np.where(np.abs(model.coef_[0]) > 1e-6)[0]
        selected_features.append(set(nonzero))
    
    # Cross-validation to find optimal C
    cv_model = LogisticRegressionCV(
        penalty='l1',
        solver='saga',
        Cs=20,
        cv=5,
        max_iter=5000
    )
    cv_model.fit(X_scaled, y)
    
    optimal_C = cv_model.C_[0]
    optimal_features = np.where(np.abs(cv_model.coef_[0]) > 1e-6)[0]
    
    # Evaluate selection if true support is known
    if true_support is not None:
        true_set = set(true_support)
        selected_set = set(optimal_features)
        
        true_positives = len(true_set & selected_set)
        false_positives = len(selected_set - true_set)
        false_negatives = len(true_set - selected_set)
        
        precision = true_positives / (true_positives + false_positives) if selected_set else 0
        recall = true_positives / len(true_set) if true_set else 1
    else:
        precision, recall = None, None
    
    return {
        'optimal_C': optimal_C,
        'optimal_lambda': 1 / optimal_C,
        'selected_features': optimal_features,
        'n_selected': len(optimal_features),
        'coefficients': cv_model.coef_[0],
        'precision': precision,
        'recall': recall
    }
 
 
def stability_selection(X, y, n_bootstrap=100, threshold=0.5):
    """
    Stability selection: run L1 on bootstrap samples and keep
    features selected in at least 'threshold' fraction of samples.
    """
    n, p = X.shape
    selection_counts = np.zeros(p)
    
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    for _ in range(n_bootstrap):
        # Bootstrap sample
        indices = np.random.choice(n, size=n, replace=True)
        X_boot = X_scaled[indices]
        y_boot = y[indices]
        
        # Fit L1 with CV
        model = LogisticRegressionCV(
            penalty='l1',
            solver='saga',
            Cs=10,
            cv=3,
            max_iter=2000
        )
        model.fit(X_boot, y_boot)
        
        # Record selected features
        selected = np.abs(model.coef_[0]) > 1e-6
        selection_counts += selected
    
    # Features selected in at least threshold fraction of bootstraps
    selection_probs = selection_counts / n_bootstrap
    stable_features = np.where(selection_probs >= threshold)[0]
    
    return stable_features, selection_probs
 
 
# Example
if __name__ == "__main__":
    from scipy.special import expit
    np.random.seed(42)
    
    # Data with sparse true model
    n, p = 500, 100
    X = np.random.randn(n, p)
    
    # Add correlation blocks
    X[:, 5:10] = X[:, 0:1] + 0.1 * np.random.randn(n, 5)  # Correlated with X_0
    
    true_support = [0, 1, 2, 3, 4]  # First 5 features are relevant
    true_coef = np.zeros(p)
    true_coef[true_support] = [1.5, -1.0, 0.8, -0.6, 0.5]
    
    prob = expit(X @ true_coef)
    y = (np.random.rand(n) < prob).astype(float)
    
    # Analyze L1 selection
    results = analyze_l1_feature_selection(X, y, true_support)
    
    print("\nL1 Feature Selection Analysis")
    print("=" * 60)
    print(f"Optimal λ: {results['optimal_lambda']:.4f}")
    print(f"Features selected: {results['selected_features']}")
    print(f"Number selected: {results['n_selected']} (true: {len(true_support)})")
    print(f"Precision: {results['precision']:.3f}")
    print(f"Recall: {results['recall']:.3f}")
    
    # Stability selection
    stable, probs = stability_selection(X, y, n_bootstrap=50, threshold=0.6)
    print(f"\nStability Selection (60% threshold): {stable}")

L1 vs L2: When to Use Each

6.1 Comprehensive Comparison

The choice between L1 and L2 regularization depends on your goals and data characteristics:

L1 (Lasso) vs L2 (Ridge) Regularization
Criterion	L1 (Lasso)	L2 (Ridge)
Sparsity	Creates exact zeros	Shrinks but non-zero
Feature selection	Built-in	Requires separate step
Correlated features	Arbitrarily selects one	Shares weight proportionally
Optimization	Non-smooth, specialized algorithms	Smooth, Newton methods
Bayesian prior	Laplace (peaked at zero)	Gaussian (smooth at zero)
When p > n	Can select at most n features	Handles naturally
Interpretability	Sparse model, clear feature set	All features contribute
Prediction accuracy	Good when true model sparse	Often slightly better

6.2 Decision Guide

Choose L1 (Lasso) when:

You believe the true model is sparse (few relevant features)
Interpretability through feature selection is important
You need to identify a subset of key predictors
The number of features exceeds the sample size (p > n)

Choose L2 (Ridge) when:

Many features contribute small amounts (no sparsity)
Features are highly correlated (groups of related predictors)
Purely predictive accuracy is the goal
You need numerically stable standard errors

Consider Elastic Net (coming next) when:

Features are correlated AND sparsity is desired
You want to select groups of related features together
Neither L1 nor L2 alone performs well

Practical Advice

In real applications, try both L1 and L2 (and Elastic Net) with cross-validation. The "best" regularization often depends on the specific dataset. Use cross-validated prediction performance to guide the choice, and consider domain knowledge about the expected sparsity level.

Summary and Key Takeaways

Key Takeaways

•L1 regularization adds the sum of absolute values of coefficients to the loss function, creating a non-smooth optimization problem
•The diamond-shaped constraint region has corners on coordinate axes, causing solutions to often have exact zeros
•Soft thresholding is the mechanism: coefficients below a threshold are set exactly to zero
•Coordinate descent is the most efficient algorithm, exploiting the separable structure of the L1 penalty
•Bayesian interpretation: L1 corresponds to MAP estimation with a Laplace prior, which concentrates mass at zero
•Feature selection is automatic but can be unstable with correlated features—one is selected arbitrarily
•Choose L1 when sparsity is expected and interpretability through a small feature set is valued

What's Next

The next page introduces Elastic Net regularization, which combines L1 and L2 penalties. Elastic Net inherits the sparsity of L1 while addressing its limitations with correlated features, offering the best of both worlds for many applications.

2 / 5

Loading learning content...

Machine LearningRegularized Logistic Regression

Regularized Logistic Regression

LevelIntermediate

Duration90 mins

TopicRegularized Logistic Regression

2 / 5

L1 Regularization (Lasso) in Logistic Regression

Sparsity and Feature Selection

Historical Context

Mathematical Formulation

1.1 The L1-Regularized Objective

L1 regularization modifies the logistic regression objective by adding a penalty proportional to the L1 norm of the coefficient vector:

$$J(\boldsymbol{\beta}) = -\sum_{i=1}^{n} \left[ y_i \log p_i + (1-y_i) \log(1-p_i) \right] + \lambda |\boldsymbol{\beta}|_1$$

where:

$|\boldsymbol{\beta}|1 = \sum{j=1}^{p} |\beta_j|$ is the L1 norm (sum of absolute values)
$\lambda \geq 0$ is the regularization strength
$p_i = \sigma(\mathbf{x}_i^\top \boldsymbol{\beta})$ remains the logistic probability

Key Difference from L2: The L1 penalty uses absolute values $|\beta_j|$ instead of squares $\beta_j^2$. This creates a fundamentally different penalty landscape with sharp corners at zero.

1.2 Constrained Optimization View

Equivalently, we can express L1 regularization as a constrained optimization problem:

$$\begin{aligned} \text{minimize} \quad & -\mathcal{L}(\boldsymbol{\beta}) \ \text{subject to} \quad & |\boldsymbol{\beta}|_1 \leq t \end{aligned}$$

Why Corners Mean Sparsity

1.3 The Subdifferential at Zero

The L1 penalty $|\beta_j|$ is not differentiable at $\beta_j = 0$. To handle this, we use subgradients:

$$\frac{\partial |\beta_j|}{\partial \beta_j} = \begin{cases} +1 & \text{if } \beta_j > 0 \ -1 & \text{if } \beta_j < 0 \ [-1, +1] & \text{if } \beta_j = 0 \end{cases}$$

The subdifferential at zero is the entire interval $[-1, +1]$. At the optimal solution, if $\beta_j = 0$, the optimality conditions require:

$$\frac{\partial (-\mathcal{L})}{\partial \beta_j} \in [-\lambda, +\lambda]$$

l1_regularization_objective.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
import numpy as np
from scipy.special import expit
 
def l1_regularized_logistic_loss(beta, X, y, lambda_reg, fit_intercept=True):
    """
    Compute the L1-regularized logistic regression loss.
    
    Note: This function is non-differentiable at beta_j = 0.
    
    Parameters
    ----------
    beta : array, shape (p,) or (p+1,)
        Coefficient vector (includes intercept if fit_intercept=True)
    X : array, shape (n, p)
        Feature matrix
    y : array, shape (n,)
        Binary labels (0 or 1)
    lambda_reg : float
        Regularization strength
    fit_intercept : bool
        Whether first element of beta is the intercept
    
    Returns
    -------
    loss : float
        The penalized negative log-likelihood
    """
    n = X.shape[0]
    
    # Separate intercept from feature coefficients
    if fit_intercept:
        intercept = beta[0]
        coef = beta[1:]
        linear_pred = X @ coef + intercept
    else:
        linear_pred = X @ beta
        coef = beta
    
    # Compute probabilities
    prob = expit(linear_pred)
    eps = 1e-15
    prob = np.clip(prob, eps, 1 - eps)
    
    # Negative log-likelihood
    nll = -np.sum(y * np.log(prob) + (1 - y) * np.log(1 - prob))
    
    # L1 penalty on feature coefficients (not intercept)
    l1_penalty = lambda_reg * np.sum(np.abs(coef))
    
    return nll + l1_penalty
 
 
def l1_subgradient(beta, X, y, lambda_reg, fit_intercept=True):
    """
    Compute a subgradient of the L1-regularized loss.
    
    At points where |beta_j| > 0, this is the gradient.
    At beta_j = 0, we return 0 as a valid subgradient element.
    """
    n = X.shape[0]
    
    if fit_intercept:
        intercept = beta[0]
        coef = beta[1:]
        linear_pred = X @ coef + intercept
    else:
        linear_pred = X @ beta
        coef = beta
    
    prob = expit(linear_pred)
    residuals = prob - y
    
    # Subgradient of L1 penalty: sign(coef) where coef ≠ 0, else 0
    l1_subgrad = np.sign(coef)
    # For zero coefficients, we could use any value in [-1, 1]
    # Using 0 is a common convention for optimization
    
    if fit_intercept:
        grad_intercept = np.sum(residuals)
        grad_coef = X.T @ residuals + lambda_reg * l1_subgrad
        return np.concatenate([[grad_intercept], grad_coef])
    else:
        return X.T @ residuals + lambda_reg * l1_subgrad

Why L1 Induces Sparsity

2.1 Geometric Explanation

L2 Regularization (Ridge)

•Circular/spherical constraint region
•Smooth boundary everywhere
•Solutions typically have all β_j ≠ 0
•Coefficients shrunk but never zero
•No automatic feature selection

L1 Regularization (Lasso)

•Diamond-shaped constraint region
•Sharp corners on coordinate axes
•Solutions often have some β_j = 0
•Coefficients can be exactly zero
•Automatic feature selection

2.2 Analytic Explanation: The Soft Thresholding Effect

To understand L1 sparsity analytically, consider the univariate case. Minimizing:

$$f(\beta) = \frac{1}{2}(\beta - z)^2 + \lambda|\beta|$$

where $z$ is the unconstrained optimum. The solution is the soft thresholding operator:

$$\hat{\beta} = \text{sign}(z) \cdot \max(|z| - \lambda, 0) = S_\lambda(z)$$

This has three regimes:

If $|z| \leq \lambda$: $\hat{\beta} = 0$ (the effect is too weak to overcome the penalty)
If $z > \lambda$: $\hat{\beta} = z - \lambda$ (shrunk toward zero by $\lambda$)
If $z < -\lambda$: $\hat{\beta} = z + \lambda$ (shrunk toward zero by $\lambda$)

Contrast with L2: For the same problem with L2 penalty, the solution is $\hat{\beta} = z/(1 + \lambda)$—never exactly zero regardless of how small $z$ is.

Soft Thresholding Intuition

soft_thresholding.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
import numpy as np
import matplotlib.pyplot as plt
 
def soft_threshold(z, lambda_reg):
    """
    Soft thresholding operator (proximal operator for L1 norm).
    
    S_λ(z) = sign(z) * max(|z| - λ, 0)
    
    This is the solution to: minimize_β  (1/2)(β - z)² + λ|β|
    """
    return np.sign(z) * np.maximum(np.abs(z) - lambda_reg, 0)
 
 
def visualize_thresholding():
    """
    Visualize soft thresholding vs ridge shrinkage.
    """
    z = np.linspace(-3, 3, 1000)
    
    lambda_vals = [0.5, 1.0, 1.5]
    
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))
    
    # Soft thresholding (L1)
    ax1 = axes[0]
    ax1.axhline(0, color='gray', linestyle='-', linewidth=0.5)
    ax1.axvline(0, color='gray', linestyle='-', linewidth=0.5)
    ax1.plot(z, z, 'k--', label='No penalty (β = z)', alpha=0.5)
    
    for lam in lambda_vals:
        beta_l1 = soft_threshold(z, lam)
        ax1.plot(z, beta_l1, label=f'λ = {lam}', linewidth=2)
    
    ax1.set_xlabel('Unconstrained estimate z', fontsize=12)
    ax1.set_ylabel('L1-regularized estimate β', fontsize=12)
    ax1.set_title('Soft Thresholding (L1/Lasso)', fontsize=14)
    ax1.legend()
    ax1.set_xlim(-3, 3)
    ax1.set_ylim(-3, 3)
    
    # Ridge shrinkage (L2)
    ax2 = axes[1]
    ax2.axhline(0, color='gray', linestyle='-', linewidth=0.5)
    ax2.axvline(0, color='gray', linestyle='-', linewidth=0.5)
    ax2.plot(z, z, 'k--', label='No penalty (β = z)', alpha=0.5)
    
    for lam in lambda_vals:
        beta_l2 = z / (1 + lam)
        ax2.plot(z, beta_l2, label=f'λ = {lam}', linewidth=2)
    
    ax2.set_xlabel('Unconstrained estimate z', fontsize=12)
    ax2.set_ylabel('L2-regularized estimate β', fontsize=12)
    ax2.set_title('Ridge Shrinkage (L2)', fontsize=14)
    ax2.legend()
    ax2.set_xlim(-3, 3)
    ax2.set_ylim(-3, 3)
    
    plt.tight_layout()
    plt.savefig('thresholding_comparison.png', dpi=150)
    plt.show()
 
 
# Demonstrate sparsity in multivariate case
def demonstrate_sparsity():
    """
    Show how L1 creates sparse solutions while L2 doesn't.
    """
    np.random.seed(42)
    
    # True sparse model
    p = 20
    true_coef = np.zeros(p)
    true_coef[:5] = [2.0, -1.5, 1.0, -0.5, 0.5]  # Only 5 non-zero
    
    # Generate data
    n = 200
    X = np.random.randn(n, p)
    prob = expit(X @ true_coef)
    y = (np.random.rand(n) < prob).astype(float)
    
    from sklearn.linear_model import LogisticRegression
    from sklearn.preprocessing import StandardScaler
    
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # Fit with L1 and L2
    l1_model = LogisticRegression(penalty='l1', C=1.0, solver='saga', max_iter=5000)
    l2_model = LogisticRegression(penalty='l2', C=1.0, solver='lbfgs', max_iter=5000)
    
    l1_model.fit(X_scaled, y)
    l2_model.fit(X_scaled, y)
    
    print("\nSparsity Comparison (20 features, true model has 5 non-zero)")
    print("=" * 60)
    print(f"{'Coefficient':<12} {'True':>8} {'L1':>10} {'L2':>10}")
    print("-" * 60)
    
    for j in range(p):
        l1_val = l1_model.coef_[0, j]
        l2_val = l2_model.coef_[0, j]
        print(f"β_{j+1:<10} {true_coef[j]:>8.3f} {l1_val:>10.4f} {l2_val:>10.4f}")
    
    print("-" * 60)
    print(f"Non-zero:    {np.sum(true_coef != 0):>8} "
          f"{np.sum(np.abs(l1_model.coef_[0]) > 0.01):>10} "
          f"{np.sum(np.abs(l2_model.coef_[0]) > 0.01):>10}")
 
 
if __name__ == "__main__":
    from scipy.special import expit
    demonstrate_sparsity()

Optimization Algorithms for L1-Regularized Logistic Regression

3.1 The Non-Smooth Optimization Challenge

This requires specialized algorithms:

Coordinate Descent: Update one coefficient at a time while holding others fixed. For each coordinate, the subproblem has a closed-form solution via soft thresholding.

Proximal Gradient Methods: Apply gradient descent to the smooth part (log-likelihood), then apply the proximal operator (soft thresholding) for the L1 penalty.

Interior Point Methods: Transform the problem using slack variables and solve a sequence of smooth barrier problems.

3.2 Coordinate Descent

Coordinate descent is the workhorse algorithm for L1-regularized problems. The key insight is that when all but one coefficient is fixed, the remaining subproblem has a closed-form solution.

Single-Coordinate Update:

Let $r_j = \beta_j^{\text{old}} - \frac{1}{n} \nabla_j \mathcal{L}$ be the "partial residual" for coefficient $j$. The updated coefficient is:

$$\beta_j^{\text{new}} = S_\lambda(r_j) = \text{sign}(r_j) \cdot \max(|r_j| - \lambda', 0)$$

where $\lambda' = \lambda / L_j$ and $L_j$ is an upper bound on the curvature for coordinate $j$.

Algorithm:

Initialize $\boldsymbol{\beta} = \mathbf{0}$
Repeat until convergence:
- For each $j = 1, \ldots, p$:
  - Compute gradient contribution from log-likelihood
  - Apply soft thresholding update
Return final $\boldsymbol{\beta}$

coordinate_descent_l1_logistic.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
import numpy as np
from scipy.special import expit
 
class L1LogisticRegressionCD:
    """
    L1-Regularized Logistic Regression using Coordinate Descent.
    
    This implementation uses the cyclic coordinate descent algorithm
    with soft thresholding updates.
    """
    
    def __init__(self, lambda_reg=1.0, max_iter=1000, tol=1e-6):
        self.lambda_reg = lambda_reg
        self.max_iter = max_iter
        self.tol = tol
        self.coef_ = None
        self.intercept_ = None
        self.n_iter_ = 0
        
    def _soft_threshold(self, z, threshold):
        """Soft thresholding operator."""
        return np.sign(z) * np.maximum(np.abs(z) - threshold, 0)
    
    def fit(self, X, y):
        """
        Fit L1-regularized logistic regression using coordinate descent.
        
        The algorithm cycles through coordinates, updating each using
        a Newton step followed by soft thresholding.
        """
        n, p = X.shape
        
        # Initialize
        intercept = 0.0
        coef = np.zeros(p)
        
        # Precompute squared norms of feature columns
        col_sq_norms = np.sum(X ** 2, axis=0)
        
        for iteration in range(self.max_iter):
            coef_old = coef.copy()
            intercept_old = intercept
            
            # Update intercept (no regularization)
            linear_pred = X @ coef + intercept
            prob = expit(linear_pred)
            grad_intercept = np.sum(prob - y)
            hess_intercept = np.sum(prob * (1 - prob))
            if hess_intercept > 1e-10:
                intercept = intercept - grad_intercept / hess_intercept
            
            # Update each coefficient
            for j in range(p):
                # Current predictions without feature j's contribution
                linear_pred_no_j = X @ coef + intercept - X[:, j] * coef[j]
                
                # Residual with linearization
                prob = expit(linear_pred_no_j + X[:, j] * coef[j])
                
                # Gradient and approximate Hessian for coordinate j
                residual = prob - y
                grad_j = np.dot(X[:, j], residual)
                
                # Hessian approximation: use diagonal of X'WX
                weights = prob * (1 - prob)
                hess_j = np.dot(X[:, j] ** 2, weights)
                
                if hess_j < 1e-10:
                    continue
                
                # Newton direction (without penalty)
                z = coef[j] - grad_j / hess_j
                
                # Soft thresholding
                threshold = self.lambda_reg / hess_j
                coef[j] = self._soft_threshold(z, threshold)
            
            # Check convergence
            coef_change = np.max(np.abs(coef - coef_old))
            intercept_change = np.abs(intercept - intercept_old)
            
            if max(coef_change, intercept_change) < self.tol:
                self.n_iter_ = iteration + 1
                break
        else:
            self.n_iter_ = self.max_iter
        
        self.coef_ = coef
        self.intercept_ = intercept
        
        return self
    
    def predict_proba(self, X):
        """Predict class probabilities."""
        linear_pred = X @ self.coef_ + self.intercept_
        prob_1 = expit(linear_pred)
        return np.column_stack([1 - prob_1, prob_1])
    
    def predict(self, X, threshold=0.5):
        """Predict class labels."""
        return (self.predict_proba(X)[:, 1] >= threshold).astype(int)
 
 
# Demonstration
if __name__ == "__main__":
    np.random.seed(42)
    
    # Generate sparse data
    n, p = 300, 50
    X = np.random.randn(n, p)
    
    # True sparse model
    true_coef = np.zeros(p)
    true_coef[:5] = [2.0, -1.5, 1.0, -0.8, 0.6]
    
    prob = expit(X @ true_coef)
    y = (np.random.rand(n) < prob).astype(float)
    
    # Standardize
    X = (X - X.mean(axis=0)) / X.std(axis=0)
    
    # Fit with different regularization strengths
    print("\nL1 Logistic Regression - Coordinate Descent")
    print("=" * 60)
    
    for lam in [0.01, 0.1, 0.5, 1.0]:
        model = L1LogisticRegressionCD(lambda_reg=lam, max_iter=2000)
        model.fit(X, y)
        
        n_nonzero = np.sum(np.abs(model.coef_) > 1e-6)
        accuracy = np.mean(model.predict(X) == y)
        
        print(f"λ={lam:5.2f}: {n_nonzero:2d} non-zero coefficients, "
              f"accuracy={accuracy:.4f}, iterations={model.n_iter_}")

3.3 Proximal Gradient Descent (ISTA/FISTA)

Proximal gradient methods separate the objective into smooth and non-smooth parts:

ISTA (Iterative Shrinkage-Thresholding Algorithm):

$$\boldsymbol{\beta}^{(t+1)} = \text{prox}_{\alpha h}\left( \boldsymbol{\beta}^{(t)} - \alpha \nabla g(\boldsymbol{\beta}^{(t)}) \right)$$

where $\alpha$ is the step size and the proximal operator for L1 is soft thresholding:

$$\text{prox}{\alpha h}(\mathbf{z}) = S{\alpha\lambda}(\mathbf{z})$$

FISTA (Fast ISTA): Adds momentum (Nesterov acceleration) for $O(1/k^2)$ convergence instead of $O(1/k)$:

Choosing an Algorithm

Bayesian Interpretation: The Laplace Prior

4.1 L1 as MAP with Laplace Prior

Just as L2 regularization corresponds to a Gaussian prior, L1 regularization corresponds to Maximum A Posteriori (MAP) estimation with a Laplace (double-exponential) prior:

$$p(\beta_j) = \frac{\lambda}{2} \exp(-\lambda |\beta_j|)$$

Log-prior: $$\log p(\boldsymbol{\beta}) = \sum_j \log\left(\frac{\lambda}{2}\right) - \lambda \sum_j |\beta_j| = \text{const} - \lambda |\boldsymbol{\beta}|_1$$

Maximizing this is equivalent to minimizing the L1-penalized negative log-likelihood.

4.2 Laplace vs Gaussian Prior: The Key Difference

Gaussian (L2) Prior:

Smooth, unimodal density
Tails decay as $\exp(-\beta^2)$ (fast, Gaussian tails)
Density at zero: $p(0) = C$ (finite, continuous)
No preference for exact zeros

Laplace (L1) Prior:

"Peaked" at zero, sharp cusp
Tails decay as $\exp(-|\beta|)$ (slower, exponential tails)
Density at zero: $p(0) = \lambda/2$ (finite, but derivative undefined)
Strong preference for zeros, thicker tails allow large values

Comparison of Prior Distributions
Property	Gaussian (L2)	Laplace (L1)
Density formula	$\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\beta^2/2\sigma^2}$	$\frac{\lambda}{2}e^{-\lambda\|\beta\|}$
Shape at zero	Smooth, rounded peak	Sharp cusp (non-differentiable)
Tail behavior	Light (Gaussian)	Heavy (exponential)
Mode	At zero	At zero
Sparsity preference	No	Yes (exact zeros likely)
Log-penalty	Quadratic $\beta^2$	Linear $\|\beta\|$

Bayesian Model Selection

Feature Selection Properties

5.1 Automatic Feature Selection

Advantages over manual selection:

Data-driven: selection based on predictive power, not heuristics
Automatic: no need to specify a target number of features
Regularization-controlled: λ parameter controls sparsity level
Optimized jointly: features selected for their combined predictive power

Properties of L1 selection:

Features with small true effects may be excluded (soft thresholding)
Among correlated features, L1 tends to select one arbitrarily
Selection stability can be low across bootstrap samples

5.2 Behavior with Correlated Features

L1 regularization has a well-known limitation: instability with correlated features.

Consider two highly correlated features $x_1$ and $x_2$ that are both predictive of $y$:

True model: Both should have non-zero coefficients
L1 behavior: Often selects one and excludes the other
Which one?: Essentially arbitrary—small data perturbations can flip the selection

Why this happens: The L1 penalty treats $|\beta_1| + |\beta_2|$ the same regardless of how the total weight is distributed. If $\beta_1 + \beta_2 = c$ is required for good fit:

$(c, 0)$: penalty = $|c|$
$(0, c)$: penalty = $|c|$
$(c/2, c/2)$: penalty = $|c|$

All three have the same penalty! The data determines which corner of this equal-penalty line the solution lands on, but this is sensitive to noise.

Selection Stability

5.3 The Lasso Consistency Property

Under certain conditions, the Lasso can recover the true sparse model consistently as $n \to \infty$. The key condition is the Irrepresentable Condition (also called the betamin condition):

$$\max_{j \notin S} |\mathbf{X}_{S}^\top \mathbf{X}_j| / n < 1 - \epsilon$$

where $S$ is the true support (indices of non-zero coefficients), $\mathbf{X}_S$ is the submatrix of true predictors, and $\mathbf{X}_j$ is a "noise" column.

lasso_feature_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
import numpy as np
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_predict
import matplotlib.pyplot as plt
 
def analyze_l1_feature_selection(X, y, true_support=None):
    """
    Analyze L1 feature selection across multiple regularization strengths.
    
    Parameters
    ----------
    X : array, shape (n, p)
        Feature matrix
    y : array, shape (n,)
        Binary labels
    true_support : array-like, optional
        Indices of truly relevant features (for evaluation)
    
    Returns
    -------
    results : dict
        Analysis results including selected features at optimal lambda
    """
    # Standardize features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # Range of regularization strengths
    C_values = np.logspace(-3, 2, 50)  # C = 1/lambda
    
    # Track which features are selected at each C
    selected_features = []
    
    for C in C_values:
        model = LogisticRegression(
            penalty='l1', 
            C=C, 
            solver='saga', 
            max_iter=5000
        )
        model.fit(X_scaled, y)
        
        # Features with non-zero coefficients
        nonzero = np.where(np.abs(model.coef_[0]) > 1e-6)[0]
        selected_features.append(set(nonzero))
    
    # Cross-validation to find optimal C
    cv_model = LogisticRegressionCV(
        penalty='l1',
        solver='saga',
        Cs=20,
        cv=5,
        max_iter=5000
    )
    cv_model.fit(X_scaled, y)
    
    optimal_C = cv_model.C_[0]
    optimal_features = np.where(np.abs(cv_model.coef_[0]) > 1e-6)[0]
    
    # Evaluate selection if true support is known
    if true_support is not None:
        true_set = set(true_support)
        selected_set = set(optimal_features)
        
        true_positives = len(true_set & selected_set)
        false_positives = len(selected_set - true_set)
        false_negatives = len(true_set - selected_set)
        
        precision = true_positives / (true_positives + false_positives) if selected_set else 0
        recall = true_positives / len(true_set) if true_set else 1
    else:
        precision, recall = None, None
    
    return {
        'optimal_C': optimal_C,
        'optimal_lambda': 1 / optimal_C,
        'selected_features': optimal_features,
        'n_selected': len(optimal_features),
        'coefficients': cv_model.coef_[0],
        'precision': precision,
        'recall': recall
    }
 
 
def stability_selection(X, y, n_bootstrap=100, threshold=0.5):
    """
    Stability selection: run L1 on bootstrap samples and keep
    features selected in at least 'threshold' fraction of samples.
    """
    n, p = X.shape
    selection_counts = np.zeros(p)
    
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    for _ in range(n_bootstrap):
        # Bootstrap sample
        indices = np.random.choice(n, size=n, replace=True)
        X_boot = X_scaled[indices]
        y_boot = y[indices]
        
        # Fit L1 with CV
        model = LogisticRegressionCV(
            penalty='l1',
            solver='saga',
            Cs=10,
            cv=3,
            max_iter=2000
        )
        model.fit(X_boot, y_boot)
        
        # Record selected features
        selected = np.abs(model.coef_[0]) > 1e-6
        selection_counts += selected
    
    # Features selected in at least threshold fraction of bootstraps
    selection_probs = selection_counts / n_bootstrap
    stable_features = np.where(selection_probs >= threshold)[0]
    
    return stable_features, selection_probs
 
 
# Example
if __name__ == "__main__":
    from scipy.special import expit
    np.random.seed(42)
    
    # Data with sparse true model
    n, p = 500, 100
    X = np.random.randn(n, p)
    
    # Add correlation blocks
    X[:, 5:10] = X[:, 0:1] + 0.1 * np.random.randn(n, 5)  # Correlated with X_0
    
    true_support = [0, 1, 2, 3, 4]  # First 5 features are relevant
    true_coef = np.zeros(p)
    true_coef[true_support] = [1.5, -1.0, 0.8, -0.6, 0.5]
    
    prob = expit(X @ true_coef)
    y = (np.random.rand(n) < prob).astype(float)
    
    # Analyze L1 selection
    results = analyze_l1_feature_selection(X, y, true_support)
    
    print("\nL1 Feature Selection Analysis")
    print("=" * 60)
    print(f"Optimal λ: {results['optimal_lambda']:.4f}")
    print(f"Features selected: {results['selected_features']}")
    print(f"Number selected: {results['n_selected']} (true: {len(true_support)})")
    print(f"Precision: {results['precision']:.3f}")
    print(f"Recall: {results['recall']:.3f}")
    
    # Stability selection
    stable, probs = stability_selection(X, y, n_bootstrap=50, threshold=0.6)
    print(f"\nStability Selection (60% threshold): {stable}")

L1 vs L2: When to Use Each

6.1 Comprehensive Comparison

The choice between L1 and L2 regularization depends on your goals and data characteristics:

L1 (Lasso) vs L2 (Ridge) Regularization
Criterion	L1 (Lasso)	L2 (Ridge)
Sparsity	Creates exact zeros	Shrinks but non-zero
Feature selection	Built-in	Requires separate step
Correlated features	Arbitrarily selects one	Shares weight proportionally
Optimization	Non-smooth, specialized algorithms	Smooth, Newton methods
Bayesian prior	Laplace (peaked at zero)	Gaussian (smooth at zero)
When p > n	Can select at most n features	Handles naturally
Interpretability	Sparse model, clear feature set	All features contribute
Prediction accuracy	Good when true model sparse	Often slightly better

6.2 Decision Guide

Choose L1 (Lasso) when:

You believe the true model is sparse (few relevant features)
Interpretability through feature selection is important
You need to identify a subset of key predictors
The number of features exceeds the sample size (p > n)

Choose L2 (Ridge) when:

Many features contribute small amounts (no sparsity)
Features are highly correlated (groups of related predictors)
Purely predictive accuracy is the goal
You need numerically stable standard errors

Consider Elastic Net (coming next) when:

Features are correlated AND sparsity is desired
You want to select groups of related features together
Neither L1 nor L2 alone performs well

Practical Advice

Summary and Key Takeaways

Key Takeaways

•L1 regularization adds the sum of absolute values of coefficients to the loss function, creating a non-smooth optimization problem
•The diamond-shaped constraint region has corners on coordinate axes, causing solutions to often have exact zeros
•Soft thresholding is the mechanism: coefficients below a threshold are set exactly to zero
•Coordinate descent is the most efficient algorithm, exploiting the separable structure of the L1 penalty
•Bayesian interpretation: L1 corresponds to MAP estimation with a Laplace prior, which concentrates mass at zero
•Feature selection is automatic but can be unstable with correlated features—one is selected arbitrarily
•Choose L1 when sparsity is expected and interpretability through a small feature set is valued

What's Next

2 / 5