Loading learning content...
While L2 regularization provides stability by shrinking coefficients toward zero, it never eliminates features entirely—every predictor retains some influence, however small. In many real-world classification problems, we believe only a subset of available features are truly predictive, while others are noise or irrelevant.
L1 regularization (also known as Lasso regularization, from "Least Absolute Shrinkage and Selection Operator") addresses this by adding a penalty proportional to the absolute value of coefficients rather than their squares. This seemingly minor change has profound consequences: L1 regularization can drive coefficients exactly to zero, automatically performing feature selection as part of the fitting process.
This page develops the complete theory of L1-regularized logistic regression, explaining why the L1 penalty induces sparsity, how to solve the resulting optimization problem, and when L1 is preferred over L2.
The Lasso was introduced by Robert Tibshirani in 1996 for linear regression. Its extension to logistic regression and other generalized linear models followed naturally. The name "Lasso" captures both the selection aspect (roping in relevant features) and the mathematical operation (the L1 penalty forms a diamond-shaped constraint region that "lassos" the solution).
L1 regularization modifies the logistic regression objective by adding a penalty proportional to the L1 norm of the coefficient vector:
$$J(\boldsymbol{\beta}) = -\sum_{i=1}^{n} \left[ y_i \log p_i + (1-y_i) \log(1-p_i) \right] + \lambda |\boldsymbol{\beta}|_1$$
where:
Key Difference from L2: The L1 penalty uses absolute values $|\beta_j|$ instead of squares $\beta_j^2$. This creates a fundamentally different penalty landscape with sharp corners at zero.
Equivalently, we can express L1 regularization as a constrained optimization problem:
$$\begin{aligned} \text{minimize} \quad & -\mathcal{L}(\boldsymbol{\beta}) \ \text{subject to} \quad & |\boldsymbol{\beta}|_1 \leq t \end{aligned}$$
The constraint region $|\boldsymbol{\beta}|_1 \leq t$ is a diamond (or cross-polytope) in the parameter space. In 2D, this is a square rotated 45°; in higher dimensions, it's a generalized diamond (zonotope).
Geometric Intuition for Sparsity: The corners of this diamond lie on the coordinate axes (e.g., at $(t, 0)$, $(-t, 0)$, $(0, t)$, $(0, -t)$ in 2D). When the optimal solution lies at a corner, one or more coordinates are exactly zero. Unlike the smooth L2 ball, the L1 diamond has sharp corners that "catch" the solution.
Imagine drawing level curves of the log-likelihood (ellipses centered at the MLE) and expanding them until they touch the L1 constraint region. Because the diamond has corners on the axes, the first point of contact is often at a corner—where one or more coefficients equal zero. The sharper the constraint angles, the more likely the solution includes exact zeros.
The L1 penalty $|\beta_j|$ is not differentiable at $\beta_j = 0$. To handle this, we use subgradients:
$$\frac{\partial |\beta_j|}{\partial \beta_j} = \begin{cases} +1 & \text{if } \beta_j > 0 \ -1 & \text{if } \beta_j < 0 \ [-1, +1] & \text{if } \beta_j = 0 \end{cases}$$
The subdifferential at zero is the entire interval $[-1, +1]$. At the optimal solution, if $\beta_j = 0$, the optimality conditions require:
$$\frac{\partial (-\mathcal{L})}{\partial \beta_j} \in [-\lambda, +\lambda]$$
This means a coefficient remains at zero as long as the gradient of the log-likelihood is bounded by $\pm\lambda$. Features whose contributions are "not strong enough" to overcome the penalty threshold are excluded entirely.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283
import numpy as npfrom scipy.special import expit def l1_regularized_logistic_loss(beta, X, y, lambda_reg, fit_intercept=True): """ Compute the L1-regularized logistic regression loss. Note: This function is non-differentiable at beta_j = 0. Parameters ---------- beta : array, shape (p,) or (p+1,) Coefficient vector (includes intercept if fit_intercept=True) X : array, shape (n, p) Feature matrix y : array, shape (n,) Binary labels (0 or 1) lambda_reg : float Regularization strength fit_intercept : bool Whether first element of beta is the intercept Returns ------- loss : float The penalized negative log-likelihood """ n = X.shape[0] # Separate intercept from feature coefficients if fit_intercept: intercept = beta[0] coef = beta[1:] linear_pred = X @ coef + intercept else: linear_pred = X @ beta coef = beta # Compute probabilities prob = expit(linear_pred) eps = 1e-15 prob = np.clip(prob, eps, 1 - eps) # Negative log-likelihood nll = -np.sum(y * np.log(prob) + (1 - y) * np.log(1 - prob)) # L1 penalty on feature coefficients (not intercept) l1_penalty = lambda_reg * np.sum(np.abs(coef)) return nll + l1_penalty def l1_subgradient(beta, X, y, lambda_reg, fit_intercept=True): """ Compute a subgradient of the L1-regularized loss. At points where |beta_j| > 0, this is the gradient. At beta_j = 0, we return 0 as a valid subgradient element. """ n = X.shape[0] if fit_intercept: intercept = beta[0] coef = beta[1:] linear_pred = X @ coef + intercept else: linear_pred = X @ beta coef = beta prob = expit(linear_pred) residuals = prob - y # Subgradient of L1 penalty: sign(coef) where coef ≠ 0, else 0 l1_subgrad = np.sign(coef) # For zero coefficients, we could use any value in [-1, 1] # Using 0 is a common convention for optimization if fit_intercept: grad_intercept = np.sum(residuals) grad_coef = X.T @ residuals + lambda_reg * l1_subgrad return np.concatenate([[grad_intercept], grad_coef]) else: return X.T @ residuals + lambda_reg * l1_subgradThe sparsity-inducing property of L1 regularization arises from the geometry of the constraint region. Consider minimizing a smooth convex function (the negative log-likelihood) subject to an L1 constraint:
L2 Constraint (Circle/Sphere): The constraint boundary is smooth everywhere. As we expand the log-likelihood contours until they touch the constraint, the tangent point generically has all coordinates non-zero.
L1 Constraint (Diamond): The constraint boundary has corners on the coordinate axes. These corners are "sticky"—when the likelihood gradient points toward a corner, the optimal solution sits at that corner with exact zeros.
To understand L1 sparsity analytically, consider the univariate case. Minimizing:
$$f(\beta) = \frac{1}{2}(\beta - z)^2 + \lambda|\beta|$$
where $z$ is the unconstrained optimum. The solution is the soft thresholding operator:
$$\hat{\beta} = \text{sign}(z) \cdot \max(|z| - \lambda, 0) = S_\lambda(z)$$
This has three regimes:
Contrast with L2: For the same problem with L2 penalty, the solution is $\hat{\beta} = z/(1 + \lambda)$—never exactly zero regardless of how small $z$ is.
Think of the L1 penalty as a "threshold" that effects must exceed to enter the model. With λ = 0.5, a feature needs a coefficient magnitude greater than 0.5 (in appropriate units) to appear in the final model. This creates a natural filter separating signal from noise.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113
import numpy as npimport matplotlib.pyplot as plt def soft_threshold(z, lambda_reg): """ Soft thresholding operator (proximal operator for L1 norm). S_λ(z) = sign(z) * max(|z| - λ, 0) This is the solution to: minimize_β (1/2)(β - z)² + λ|β| """ return np.sign(z) * np.maximum(np.abs(z) - lambda_reg, 0) def visualize_thresholding(): """ Visualize soft thresholding vs ridge shrinkage. """ z = np.linspace(-3, 3, 1000) lambda_vals = [0.5, 1.0, 1.5] fig, axes = plt.subplots(1, 2, figsize=(12, 5)) # Soft thresholding (L1) ax1 = axes[0] ax1.axhline(0, color='gray', linestyle='-', linewidth=0.5) ax1.axvline(0, color='gray', linestyle='-', linewidth=0.5) ax1.plot(z, z, 'k--', label='No penalty (β = z)', alpha=0.5) for lam in lambda_vals: beta_l1 = soft_threshold(z, lam) ax1.plot(z, beta_l1, label=f'λ = {lam}', linewidth=2) ax1.set_xlabel('Unconstrained estimate z', fontsize=12) ax1.set_ylabel('L1-regularized estimate β', fontsize=12) ax1.set_title('Soft Thresholding (L1/Lasso)', fontsize=14) ax1.legend() ax1.set_xlim(-3, 3) ax1.set_ylim(-3, 3) # Ridge shrinkage (L2) ax2 = axes[1] ax2.axhline(0, color='gray', linestyle='-', linewidth=0.5) ax2.axvline(0, color='gray', linestyle='-', linewidth=0.5) ax2.plot(z, z, 'k--', label='No penalty (β = z)', alpha=0.5) for lam in lambda_vals: beta_l2 = z / (1 + lam) ax2.plot(z, beta_l2, label=f'λ = {lam}', linewidth=2) ax2.set_xlabel('Unconstrained estimate z', fontsize=12) ax2.set_ylabel('L2-regularized estimate β', fontsize=12) ax2.set_title('Ridge Shrinkage (L2)', fontsize=14) ax2.legend() ax2.set_xlim(-3, 3) ax2.set_ylim(-3, 3) plt.tight_layout() plt.savefig('thresholding_comparison.png', dpi=150) plt.show() # Demonstrate sparsity in multivariate casedef demonstrate_sparsity(): """ Show how L1 creates sparse solutions while L2 doesn't. """ np.random.seed(42) # True sparse model p = 20 true_coef = np.zeros(p) true_coef[:5] = [2.0, -1.5, 1.0, -0.5, 0.5] # Only 5 non-zero # Generate data n = 200 X = np.random.randn(n, p) prob = expit(X @ true_coef) y = (np.random.rand(n) < prob).astype(float) from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Fit with L1 and L2 l1_model = LogisticRegression(penalty='l1', C=1.0, solver='saga', max_iter=5000) l2_model = LogisticRegression(penalty='l2', C=1.0, solver='lbfgs', max_iter=5000) l1_model.fit(X_scaled, y) l2_model.fit(X_scaled, y) print("\nSparsity Comparison (20 features, true model has 5 non-zero)") print("=" * 60) print(f"{'Coefficient':<12} {'True':>8} {'L1':>10} {'L2':>10}") print("-" * 60) for j in range(p): l1_val = l1_model.coef_[0, j] l2_val = l2_model.coef_[0, j] print(f"β_{j+1:<10} {true_coef[j]:>8.3f} {l1_val:>10.4f} {l2_val:>10.4f}") print("-" * 60) print(f"Non-zero: {np.sum(true_coef != 0):>8} " f"{np.sum(np.abs(l1_model.coef_[0]) > 0.01):>10} " f"{np.sum(np.abs(l2_model.coef_[0]) > 0.01):>10}") if __name__ == "__main__": from scipy.special import expit demonstrate_sparsity()Unlike L2 regularization, where the objective is smooth and standard Newton's method applies, L1 regularization creates a non-smooth optimization problem. The L1 penalty $\sum|\beta_j|$ is not differentiable at $\beta_j = 0$, exactly where we want the solution to lie for sparse models.
This requires specialized algorithms:
Coordinate Descent: Update one coefficient at a time while holding others fixed. For each coordinate, the subproblem has a closed-form solution via soft thresholding.
Proximal Gradient Methods: Apply gradient descent to the smooth part (log-likelihood), then apply the proximal operator (soft thresholding) for the L1 penalty.
Interior Point Methods: Transform the problem using slack variables and solve a sequence of smooth barrier problems.
Coordinate descent is the workhorse algorithm for L1-regularized problems. The key insight is that when all but one coefficient is fixed, the remaining subproblem has a closed-form solution.
Single-Coordinate Update:
Let $r_j = \beta_j^{\text{old}} - \frac{1}{n} \nabla_j \mathcal{L}$ be the "partial residual" for coefficient $j$. The updated coefficient is:
$$\beta_j^{\text{new}} = S_\lambda(r_j) = \text{sign}(r_j) \cdot \max(|r_j| - \lambda', 0)$$
where $\lambda' = \lambda / L_j$ and $L_j$ is an upper bound on the curvature for coordinate $j$.
Algorithm:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134
import numpy as npfrom scipy.special import expit class L1LogisticRegressionCD: """ L1-Regularized Logistic Regression using Coordinate Descent. This implementation uses the cyclic coordinate descent algorithm with soft thresholding updates. """ def __init__(self, lambda_reg=1.0, max_iter=1000, tol=1e-6): self.lambda_reg = lambda_reg self.max_iter = max_iter self.tol = tol self.coef_ = None self.intercept_ = None self.n_iter_ = 0 def _soft_threshold(self, z, threshold): """Soft thresholding operator.""" return np.sign(z) * np.maximum(np.abs(z) - threshold, 0) def fit(self, X, y): """ Fit L1-regularized logistic regression using coordinate descent. The algorithm cycles through coordinates, updating each using a Newton step followed by soft thresholding. """ n, p = X.shape # Initialize intercept = 0.0 coef = np.zeros(p) # Precompute squared norms of feature columns col_sq_norms = np.sum(X ** 2, axis=0) for iteration in range(self.max_iter): coef_old = coef.copy() intercept_old = intercept # Update intercept (no regularization) linear_pred = X @ coef + intercept prob = expit(linear_pred) grad_intercept = np.sum(prob - y) hess_intercept = np.sum(prob * (1 - prob)) if hess_intercept > 1e-10: intercept = intercept - grad_intercept / hess_intercept # Update each coefficient for j in range(p): # Current predictions without feature j's contribution linear_pred_no_j = X @ coef + intercept - X[:, j] * coef[j] # Residual with linearization prob = expit(linear_pred_no_j + X[:, j] * coef[j]) # Gradient and approximate Hessian for coordinate j residual = prob - y grad_j = np.dot(X[:, j], residual) # Hessian approximation: use diagonal of X'WX weights = prob * (1 - prob) hess_j = np.dot(X[:, j] ** 2, weights) if hess_j < 1e-10: continue # Newton direction (without penalty) z = coef[j] - grad_j / hess_j # Soft thresholding threshold = self.lambda_reg / hess_j coef[j] = self._soft_threshold(z, threshold) # Check convergence coef_change = np.max(np.abs(coef - coef_old)) intercept_change = np.abs(intercept - intercept_old) if max(coef_change, intercept_change) < self.tol: self.n_iter_ = iteration + 1 break else: self.n_iter_ = self.max_iter self.coef_ = coef self.intercept_ = intercept return self def predict_proba(self, X): """Predict class probabilities.""" linear_pred = X @ self.coef_ + self.intercept_ prob_1 = expit(linear_pred) return np.column_stack([1 - prob_1, prob_1]) def predict(self, X, threshold=0.5): """Predict class labels.""" return (self.predict_proba(X)[:, 1] >= threshold).astype(int) # Demonstrationif __name__ == "__main__": np.random.seed(42) # Generate sparse data n, p = 300, 50 X = np.random.randn(n, p) # True sparse model true_coef = np.zeros(p) true_coef[:5] = [2.0, -1.5, 1.0, -0.8, 0.6] prob = expit(X @ true_coef) y = (np.random.rand(n) < prob).astype(float) # Standardize X = (X - X.mean(axis=0)) / X.std(axis=0) # Fit with different regularization strengths print("\nL1 Logistic Regression - Coordinate Descent") print("=" * 60) for lam in [0.01, 0.1, 0.5, 1.0]: model = L1LogisticRegressionCD(lambda_reg=lam, max_iter=2000) model.fit(X, y) n_nonzero = np.sum(np.abs(model.coef_) > 1e-6) accuracy = np.mean(model.predict(X) == y) print(f"λ={lam:5.2f}: {n_nonzero:2d} non-zero coefficients, " f"accuracy={accuracy:.4f}, iterations={model.n_iter_}")Proximal gradient methods separate the objective into smooth and non-smooth parts:
$$J(\boldsymbol{\beta}) = \underbrace{-\mathcal{L}(\boldsymbol{\beta})}_{g(\boldsymbol{\beta}) \text{ (smooth)}} + \underbrace{\lambda |\boldsymbol{\beta}|1}{h(\boldsymbol{\beta}) \text{ (non-smooth)}}$$
ISTA (Iterative Shrinkage-Thresholding Algorithm):
$$\boldsymbol{\beta}^{(t+1)} = \text{prox}_{\alpha h}\left( \boldsymbol{\beta}^{(t)} - \alpha \nabla g(\boldsymbol{\beta}^{(t)}) \right)$$
where $\alpha$ is the step size and the proximal operator for L1 is soft thresholding:
$$\text{prox}{\alpha h}(\mathbf{z}) = S{\alpha\lambda}(\mathbf{z})$$
FISTA (Fast ISTA): Adds momentum (Nesterov acceleration) for $O(1/k^2)$ convergence instead of $O(1/k)$:
$$\boldsymbol{\beta}^{(t+1)} = \text{prox}_{\alpha h}\left( \mathbf{v}^{(t)} - \alpha \nabla g(\mathbf{v}^{(t)}) \right)$$ $$\mathbf{v}^{(t+1)} = \boldsymbol{\beta}^{(t+1)} + \frac{t}{t+3}(\boldsymbol{\beta}^{(t+1)} - \boldsymbol{\beta}^{(t)})$$
Coordinate descent is typically fastest for L1-regularized logistic regression and is the default in scikit-learn (solver='saga') and glmnet. Proximal methods are more general and work for any convex penalty. Interior point methods provide high accuracy but scale poorly to very large problems.
Just as L2 regularization corresponds to a Gaussian prior, L1 regularization corresponds to Maximum A Posteriori (MAP) estimation with a Laplace (double-exponential) prior:
$$p(\beta_j) = \frac{\lambda}{2} \exp(-\lambda |\beta_j|)$$
Log-prior: $$\log p(\boldsymbol{\beta}) = \sum_j \log\left(\frac{\lambda}{2}\right) - \lambda \sum_j |\beta_j| = \text{const} - \lambda |\boldsymbol{\beta}|_1$$
MAP objective: $$\log p(\boldsymbol{\beta} | \mathbf{y}, \mathbf{X}) \propto \log p(\mathbf{y} | \mathbf{X}, \boldsymbol{\beta}) + \log p(\boldsymbol{\beta}) = \mathcal{L}(\boldsymbol{\beta}) - \lambda |\boldsymbol{\beta}|_1$$
Maximizing this is equivalent to minimizing the L1-penalized negative log-likelihood.
Gaussian (L2) Prior:
Laplace (L1) Prior:
The Laplace prior has more mass concentrated at zero and heavier tails than the Gaussian. This embodies a "spike-and-slab" intuition: most coefficients should be zero, but those that aren't can be substantial.
| Property | Gaussian (L2) | Laplace (L1) |
|---|---|---|
| Density formula | $\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\beta^2/2\sigma^2}$ | $\frac{\lambda}{2}e^{-\lambda|\beta|}$ |
| Shape at zero | Smooth, rounded peak | Sharp cusp (non-differentiable) |
| Tail behavior | Light (Gaussian) | Heavy (exponential) |
| Mode | At zero | At zero |
| Sparsity preference | No | Yes (exact zeros likely) |
| Log-penalty | Quadratic $\beta^2$ | Linear $|\beta|$ |
While L1-regularized estimation finds the MAP point estimate, full Bayesian inference would compute the posterior distribution. With a Laplace prior, the posterior peaks at sparse solutions, but unlike the spike-and-slab prior, it doesn't place exact point mass at zero. For true Bayesian variable selection, more sophisticated priors (spike-and-slab, horseshoe) are used.
L1 regularization's ability to set coefficients exactly to zero makes it a powerful tool for embedded feature selection—feature selection performed as part of model fitting rather than as a separate preprocessing step.
Advantages over manual selection:
Properties of L1 selection:
L1 regularization has a well-known limitation: instability with correlated features.
Consider two highly correlated features $x_1$ and $x_2$ that are both predictive of $y$:
Why this happens: The L1 penalty treats $|\beta_1| + |\beta_2|$ the same regardless of how the total weight is distributed. If $\beta_1 + \beta_2 = c$ is required for good fit:
All three have the same penalty! The data determines which corner of this equal-penalty line the solution lands on, but this is sensitive to noise.
The arbitrary selection among correlated features is a significant issue for interpretation. A coefficient being zero doesn't mean the feature is unimportant—it might just be collinear with a retained feature. For stable feature selection with correlated predictors, consider Elastic Net or stability selection techniques.
Under certain conditions, the Lasso can recover the true sparse model consistently as $n \to \infty$. The key condition is the Irrepresentable Condition (also called the betamin condition):
$$\max_{j \notin S} |\mathbf{X}_{S}^\top \mathbf{X}_j| / n < 1 - \epsilon$$
where $S$ is the true support (indices of non-zero coefficients), $\mathbf{X}_S$ is the submatrix of true predictors, and $\mathbf{X}_j$ is a "noise" column.
Interpretation: The correlation between irrelevant features and the relevant feature space must be bounded below 1. If an irrelevant feature is too correlated with relevant ones, the Lasso may incorrectly select it.
In practice: This condition often fails in real high-dimensional data, limiting the Lasso's reliability for identifying the "true" model. For prediction, however, L1 regularization remains excellent even when consistency fails.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156
import numpy as npfrom sklearn.linear_model import LogisticRegressionCVfrom sklearn.preprocessing import StandardScalerfrom sklearn.model_selection import cross_val_predictimport matplotlib.pyplot as plt def analyze_l1_feature_selection(X, y, true_support=None): """ Analyze L1 feature selection across multiple regularization strengths. Parameters ---------- X : array, shape (n, p) Feature matrix y : array, shape (n,) Binary labels true_support : array-like, optional Indices of truly relevant features (for evaluation) Returns ------- results : dict Analysis results including selected features at optimal lambda """ # Standardize features scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Range of regularization strengths C_values = np.logspace(-3, 2, 50) # C = 1/lambda # Track which features are selected at each C selected_features = [] for C in C_values: model = LogisticRegression( penalty='l1', C=C, solver='saga', max_iter=5000 ) model.fit(X_scaled, y) # Features with non-zero coefficients nonzero = np.where(np.abs(model.coef_[0]) > 1e-6)[0] selected_features.append(set(nonzero)) # Cross-validation to find optimal C cv_model = LogisticRegressionCV( penalty='l1', solver='saga', Cs=20, cv=5, max_iter=5000 ) cv_model.fit(X_scaled, y) optimal_C = cv_model.C_[0] optimal_features = np.where(np.abs(cv_model.coef_[0]) > 1e-6)[0] # Evaluate selection if true support is known if true_support is not None: true_set = set(true_support) selected_set = set(optimal_features) true_positives = len(true_set & selected_set) false_positives = len(selected_set - true_set) false_negatives = len(true_set - selected_set) precision = true_positives / (true_positives + false_positives) if selected_set else 0 recall = true_positives / len(true_set) if true_set else 1 else: precision, recall = None, None return { 'optimal_C': optimal_C, 'optimal_lambda': 1 / optimal_C, 'selected_features': optimal_features, 'n_selected': len(optimal_features), 'coefficients': cv_model.coef_[0], 'precision': precision, 'recall': recall } def stability_selection(X, y, n_bootstrap=100, threshold=0.5): """ Stability selection: run L1 on bootstrap samples and keep features selected in at least 'threshold' fraction of samples. """ n, p = X.shape selection_counts = np.zeros(p) scaler = StandardScaler() X_scaled = scaler.fit_transform(X) for _ in range(n_bootstrap): # Bootstrap sample indices = np.random.choice(n, size=n, replace=True) X_boot = X_scaled[indices] y_boot = y[indices] # Fit L1 with CV model = LogisticRegressionCV( penalty='l1', solver='saga', Cs=10, cv=3, max_iter=2000 ) model.fit(X_boot, y_boot) # Record selected features selected = np.abs(model.coef_[0]) > 1e-6 selection_counts += selected # Features selected in at least threshold fraction of bootstraps selection_probs = selection_counts / n_bootstrap stable_features = np.where(selection_probs >= threshold)[0] return stable_features, selection_probs # Exampleif __name__ == "__main__": from scipy.special import expit np.random.seed(42) # Data with sparse true model n, p = 500, 100 X = np.random.randn(n, p) # Add correlation blocks X[:, 5:10] = X[:, 0:1] + 0.1 * np.random.randn(n, 5) # Correlated with X_0 true_support = [0, 1, 2, 3, 4] # First 5 features are relevant true_coef = np.zeros(p) true_coef[true_support] = [1.5, -1.0, 0.8, -0.6, 0.5] prob = expit(X @ true_coef) y = (np.random.rand(n) < prob).astype(float) # Analyze L1 selection results = analyze_l1_feature_selection(X, y, true_support) print("\nL1 Feature Selection Analysis") print("=" * 60) print(f"Optimal λ: {results['optimal_lambda']:.4f}") print(f"Features selected: {results['selected_features']}") print(f"Number selected: {results['n_selected']} (true: {len(true_support)})") print(f"Precision: {results['precision']:.3f}") print(f"Recall: {results['recall']:.3f}") # Stability selection stable, probs = stability_selection(X, y, n_bootstrap=50, threshold=0.6) print(f"\nStability Selection (60% threshold): {stable}")The choice between L1 and L2 regularization depends on your goals and data characteristics:
| Criterion | L1 (Lasso) | L2 (Ridge) |
|---|---|---|
| Sparsity | Creates exact zeros | Shrinks but non-zero |
| Feature selection | Built-in | Requires separate step |
| Correlated features | Arbitrarily selects one | Shares weight proportionally |
| Optimization | Non-smooth, specialized algorithms | Smooth, Newton methods |
| Bayesian prior | Laplace (peaked at zero) | Gaussian (smooth at zero) |
| When p > n | Can select at most n features | Handles naturally |
| Interpretability | Sparse model, clear feature set | All features contribute |
| Prediction accuracy | Good when true model sparse | Often slightly better |
Choose L1 (Lasso) when:
Choose L2 (Ridge) when:
Consider Elastic Net (coming next) when:
In real applications, try both L1 and L2 (and Elastic Net) with cross-validation. The "best" regularization often depends on the specific dataset. Use cross-validated prediction performance to guide the choice, and consider domain knowledge about the expected sparsity level.
The next page introduces Elastic Net regularization, which combines L1 and L2 penalties. Elastic Net inherits the sparsity of L1 while addressing its limitations with correlated features, offering the best of both worlds for many applications.