Loading content...
One of Lasso's most celebrated properties is automatic feature selection. By driving coefficients exactly to zero, Lasso doesn't just shrink the model—it identifies which features matter and discards the rest. This transforms a regression method into a variable selection tool.
But this power comes with nuances. Feature selection via Lasso is not a black box that always works. Understanding when and how Lasso selects features—and when it fails—is essential for practitioners who rely on Lasso for scientific inference or dimensionality reduction.
This page addresses the critical questions:
By the end of this page, you will understand Lasso as a feature selection method, its guarantees and limitations, the crucial distinction between prediction and selection, stability selection for robust variable identification, and post-selection inference.
A fundamental distinction underlies much confusion about Lasso: prediction and feature selection are different objectives that may require different approaches.
Prediction Goal:
Find a model $\hat{f}$ that minimizes expected prediction error:
$$\mathbb{E}[(Y - \hat{f}(\mathbf{X}))^2]$$
We don't care which features are used, only that predictions are accurate. Including irrelevant features is fine if it doesn't hurt prediction.
Feature Selection Goal:
Identify the true support $S = {j : \beta_j^* \neq 0}$—the features that genuinely influence the outcome. We want:
$$\hat{S} = S$$
with high probability. False positives (including irrelevant features) and false negatives (missing relevant features) both matter.
Why the Distinction Matters:
| Aspect | For Prediction | For Selection |
|---|---|---|
| Correlated features | Interchangeable—keep any | Must identify the correct one |
| False inclusions | Mild penalty (bias) | Serious error (wrong science) |
| Optimal λ | Minimize CV prediction error | May need different λ |
| Noise variables | Tolerable if shrunk | Must exclude all |
| Validation | Test set MSE | Known-truth simulations |
Lasso's Dual Role:
Lasso attempts to serve both goals simultaneously, which creates tension:
Example of Divergence:
Consider 100 features where:
For prediction: Including feature 6 instead of feature 1 barely changes predictions—they're nearly interchangeable.
For selection: Lasso might select feature 6 (noise) and exclude feature 1 (true signal). This is scientifically wrong despite good predictions.
Good cross-validation prediction error does NOT guarantee correct feature selection. A model with wrong features can predict well. If your goal is scientific understanding (identifying true drivers), prediction accuracy is necessary but not sufficient.
When does Lasso successfully recover the true support? This requires conditions on both the design matrix $\mathbf{X}$ and the true coefficients $\boldsymbol{\beta}^*$.
The Three Key Conditions:
1. Irrepresentable Condition (IC):
Let $S$ be the true support and $S^c$ its complement. The irrepresentable condition requires:
$$|\mathbf{C}{S^cS}\mathbf{C}{SS}^{-1}\text{sign}(\boldsymbol{\beta}S^*)|\infty < 1$$
where $\mathbf{C} = \frac{1}{n}\mathbf{X}^T\mathbf{X}$ is the sample correlation matrix.
Interpretation: Irrelevant features cannot be perfectly "explained by" relevant features. If an irrelevant feature is highly correlated with relevant ones (weighted by their signs), Lasso may incorrectly select it.
2. Beta-Min Condition:
The smallest true non-zero coefficient must be large enough:
$$\min_{j \in S} |\beta_j^*| \geq c \cdot \sigma \sqrt{\frac{\log p}{n}}$$
for some constant $c > 0$. This ensures signals are detectable above the noise floor.
3. Restricted Eigenvalue Condition:
The design matrix must be well-conditioned on sparse directions:
$$\left|\mathbf{X}_S \mathbf{v}\right|_2^2 \geq \kappa n |\mathbf{v}|_2^2$$
for all vectors $\mathbf{v}$ and sparse index sets $S$. This prevents near-collinearity among relevant features.
Theoretical Guarantee:
Theorem (Selection Consistency): Under the irrepresentable condition, beta-min condition, and restricted eigenvalue condition, with $\lambda \asymp \sigma\sqrt{\frac{\log p}{n}}$:
$$P(\hat{S}_{\text{lasso}} = S) \to 1 \quad \text{as } n \to \infty$$
Reality Check:
These conditions are sufficient but not necessary. Lasso sometimes works even when conditions are violated (lucky data). But it can also fail when conditions appear satisfied (bad luck or finite-sample effects).
More importantly, we rarely know if conditions hold in practice. We don't know the true support $S$. The irrepresentable condition depends on $\text{sign}(\boldsymbol{\beta}^*)$, which is unknown. We must work with assumptions, not certainties.
High correlation between features is a red flag for selection reliability. If you know or suspect groups of correlated features, consider using Elastic Net or Group Lasso. If you need reliable selection, use stability selection (coming next) instead of trusting a single Lasso run.
The most common practical failure mode for Lasso selection is correlated features. This deserves detailed examination.
The Issue:
When features $j$ and $k$ are highly correlated ($|\rho_{jk}| \approx 1$), and the true model uses feature $j$:
Demonstration:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
import numpy as npfrom sklearn.linear_model import Lasso def demonstrate_correlation_instability(rho=0.95, n_trials=100): """ Show how Lasso selection becomes unstable with correlated features. True model: y = 2*x1 + noise x2 is correlated with x1 (correlation = rho) """ np.random.seed(42) n = 100 selections = {'x1_only': 0, 'x2_only': 0, 'both': 0, 'neither': 0} for trial in range(n_trials): # Generate correlated features x1 = np.random.randn(n) x2 = rho * x1 + np.sqrt(1 - rho**2) * np.random.randn(n) X = np.column_stack([x1, x2]) # True model uses only x1 y = 2 * x1 + 0.5 * np.random.randn(n) # Fit Lasso lasso = Lasso(alpha=0.1, fit_intercept=True) lasso.fit(X, y) coef1_nonzero = abs(lasso.coef_[0]) > 0.01 coef2_nonzero = abs(lasso.coef_[1]) > 0.01 if coef1_nonzero and not coef2_nonzero: selections['x1_only'] += 1 elif coef2_nonzero and not coef1_nonzero: selections['x2_only'] += 1 elif coef1_nonzero and coef2_nonzero: selections['both'] += 1 else: selections['neither'] += 1 print(f"Selection outcomes over {n_trials} trials (correlation = {rho}):") print(f" x1 only (correct): {selections['x1_only']}%") print(f" x2 only (wrong): {selections['x2_only']}%") print(f" Both selected: {selections['both']}%") print(f" Neither: {selections['neither']}%") return selections # Low correlation: stable selectionprint("Low correlation (ρ = 0.3):")demonstrate_correlation_instability(rho=0.3) print("\nHigh correlation (ρ = 0.95):")demonstrate_correlation_instability(rho=0.95) # With high correlation, Lasso may select the wrong feature!The Grouping Effect Problem:
When features form correlated groups, Lasso tends to:
Example:
Solutions:
Elastic Net (α*L1 + (1-α)*L2) with α < 1 exhibits the 'grouping effect': correlated features tend to have similar coefficients, all entering or leaving together. This stabilizes selection among correlated groups, though it reduces sparsity.
Stability selection (Meinshausen & Bühlmann, 2010) addresses the instability of Lasso selection by aggregating results across many subsamples. Features that are consistently selected across perturbations are deemed reliable.
The Algorithm:
For $b = 1, \ldots, B$ bootstrap/subsample iterations:
Compute selection probability for each feature: $$\hat{\Pi}j^{(\lambda)} = \frac{1}{B}\sum{b=1}^B \mathbf{1}{\hat{\beta}_{j}^{(b)}(\lambda) \neq 0}$$
Select features with selection probability above threshold $\pi_{\text{thr}}$: $$\hat{S}^{\text{stable}} = {j : \hat{\Pi}j^{(\lambda)} \geq \pi{\text{thr}}}$$
Typical choices: $\pi_{\text{thr}} = 0.6$ to $0.9$, $B = 100$ to $500$ subsamples.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788
import numpy as npfrom sklearn.linear_model import LassoCV def stability_selection(X, y, n_bootstrap=100, sample_fraction=0.5, threshold=0.6): """ Perform stability selection for robust feature identification. Parameters ---------- X : ndarray of shape (n, p) Feature matrix y : ndarray of shape (n,) Response vector n_bootstrap : int Number of subsampling iterations sample_fraction : float Fraction of samples to use in each subsample threshold : float Selection probability threshold (0.5 to 1.0) Returns ------- stable_features : list Indices of stably selected features selection_probabilities : ndarray Selection probability for each feature """ n, p = X.shape subsample_size = int(n * sample_fraction) # Track selection across bootstraps selection_counts = np.zeros(p) for b in range(n_bootstrap): # Random subsample indices = np.random.choice(n, size=subsample_size, replace=False) X_sub = X[indices] y_sub = y[indices] # Fit Lasso with CV-selected lambda lasso = LassoCV(cv=5, random_state=b) lasso.fit(X_sub, y_sub) # Record selections selected = np.abs(lasso.coef_) > 1e-10 selection_counts += selected # Compute selection probabilities selection_probabilities = selection_counts / n_bootstrap # Identify stable features stable_features = np.where(selection_probabilities >= threshold)[0] return stable_features, selection_probabilities # Example usagenp.random.seed(42)n, p = 200, 50 # Create features with correlation structureX = np.random.randn(n, p)# Make features 0-4 correlated with each otherfor j in range(1, 5): X[:, j] = 0.8 * X[:, 0] + 0.2 * np.random.randn(n) # True model: features 0, 10, 20 matterbeta_true = np.zeros(p)beta_true[0] = 2.0beta_true[10] = -1.5beta_true[20] = 1.0y = X @ beta_true + 0.5 * np.random.randn(n) # Single Lasso runfrom sklearn.linear_model import LassoCVsingle_lasso = LassoCV(cv=5).fit(X, y)single_selected = np.where(np.abs(single_lasso.coef_) > 1e-10)[0]print(f"Single Lasso selected: {single_selected}")print(f"True features: [0, 10, 20]") # Stability selectionstable_features, probs = stability_selection(X, y, n_bootstrap=100, threshold=0.7)print(f"\nStability selection (threshold=0.7): {stable_features}")print(f"Selection probabilities for true features:")print(f" Feature 0: {probs[0]:.2f}")print(f" Feature 10: {probs[10]:.2f}")print(f" Feature 20: {probs[20]:.2f}")Theoretical Guarantees:
Stability selection provides control over false positives. Under mild conditions:
$$\mathbb{E}[|\hat{S}^{\text{stable}} \cap S^c|] \leq \frac{q^2}{(2\pi_{\text{thr}} - 1)p}$$
where $q$ is the expected number of selected variables per subsample and $S^c$ is the set of irrelevant variables.
Benefits:
Limitations:
The 'stability path' plot shows selection probability vs. regularization strength for each feature. Stable features maintain high selection probability across λ values. Unstable features show erratic selection patterns. This plot is diagnostic gold for understanding selection reliability.
A subtle but critical issue arises when using Lasso for scientific inference: the same data used for selection cannot naively be used for inference.
The Problem:
Suppose Lasso selects features $\hat{S} = {1, 5, 12}$. We want to:
Naive approach: Fit OLS on selected features, use standard inference.
Why this fails: The selection event $\hat{S} = {1, 5, 12}$ was chosen because features 1, 5, 12 had the strongest apparent associations. Standard inference assumes the model is fixed before seeing data. Selecting based on data and then testing on the same data inflates significance.
The Selection Bias:
Consider: we select feature 1 because $|\hat{\beta}_1^{\text{lasso}}| > 0$. This already tells us the data supports a non-zero coefficient. Testing $H_0: \beta_1 = 0$ after this selection is circular—the selection guarantees the test rejects!
Simulation Evidence:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
import numpy as npfrom scipy import statsfrom sklearn.linear_model import LassoCV def demonstrate_selection_bias(n_sims=1000): """ Show how naive post-selection inference is biased. True model: y = noise (no true signals) Yet naive p-values after Lasso selection are often significant! """ np.random.seed(42) n, p = 100, 50 naive_pvalues = [] for sim in range(n_sims): # Generate null data (no true signal) X = np.random.randn(n, p) y = np.random.randn(n) # Pure noise! # Lasso selection lasso = LassoCV(cv=5).fit(X, y) selected = np.where(np.abs(lasso.coef_) > 1e-10)[0] if len(selected) == 0: continue # Naive OLS on selected features X_sel = X[:, selected] beta_ols = np.linalg.lstsq(X_sel, y, rcond=None)[0] # Naive standard errors residuals = y - X_sel @ beta_ols sigma_hat = np.sqrt(np.sum(residuals**2) / (n - len(selected))) var_beta = sigma_hat**2 * np.diag(np.linalg.inv(X_sel.T @ X_sel)) se_beta = np.sqrt(var_beta) # Naive t-test for first selected feature t_stat = beta_ols[0] / se_beta[0] pvalue = 2 * (1 - stats.t.cdf(abs(t_stat), df=n-len(selected))) naive_pvalues.append(pvalue) # Under the null, p-values should be uniform # But selection bias makes small p-values too common! naive_pvalues = np.array(naive_pvalues) print("Under the NULL (no true signals):") print(f" Simulations with selection: {len(naive_pvalues)}") print(f" P-values < 0.05: {np.mean(naive_pvalues < 0.05):.1%} (should be ~5%)") print(f" P-values < 0.01: {np.mean(naive_pvalues < 0.01):.1%} (should be ~1%)") print(f" Mean p-value: {np.mean(naive_pvalues):.3f} (should be ~0.5)") return naive_pvalues pvals = demonstrate_selection_bias()print("\nConclusion: Naive inference after selection is SEVERELY biased!")Solutions for Valid Post-Selection Inference:
1. Data Splitting:
2. Selective Inference (Conditional Inference):
3. Debiased Lasso:
4. Bootstrap Methods:
Using the same data for selection and inference is 'double dipping.' Papers reporting significant p-values for features selected by Lasso on the same data are often invalid. Always split data or use valid post-selection inference methods if making scientific claims.
Lasso is one of many feature selection approaches. Understanding alternatives helps choose the right tool.
The Feature Selection Landscape:
| Method | Type | Pros | Cons |
|---|---|---|---|
| Lasso | Embedded (in model) | Efficient, interpretable coefficients | Unstable with correlations |
| Elastic Net | Embedded | Handles correlations better | Less sparse than Lasso |
| Forward Stepwise | Wrapper | Well-understood, fast | Greedy, can't recover from errors |
| Best Subset | Wrapper | Optimal if computable | NP-hard for large p |
| Random Forest Importance | Embedded/Filter | Captures nonlinearities | Not for coefficient interpretation |
| Univariate Filtering | Filter | Very fast, simple | Ignores feature interactions |
| SCAD/MCP | Embedded | Less biased than Lasso | Non-convex, harder to optimize |
When to Prefer Lasso:
When to Consider Alternatives:
Non-Convex Penalties (SCAD, MCP):
SCAD (Smoothly Clipped Absolute Deviation) and MCP (Minimax Concave Penalty) reduce Lasso's bias on large coefficients:
$$\text{Lasso: } |\beta|$$ $$\text{SCAD: } \begin{cases} |\beta| & |\beta| \leq \lambda \ \text{(quadratic transition)} & \lambda < |\beta| < a\lambda \ \frac{(a+1)\lambda}{2} & |\beta| \geq a\lambda \end{cases}$$
SCAD/MCP still produce sparsity but don't shrink large coefficients. The cost: non-convex optimization (local minima possible).
Consider combining methods: (1) Use univariate filtering to remove obviously irrelevant features, (2) Apply Lasso to the reduced set, (3) Use stability selection for robustness, (4) Apply debiased Lasso for inference. Each step addresses different challenges.
Based on theory and experience, here are practical recommendations for using Lasso for feature selection.
Step-by-Step Workflow:
What to Report:
For scientific publications using Lasso selection:
Common Mistakes to Avoid:
Feature selection is inherently uncertain. No method—Lasso included—magically identifies 'the true model.' Present results as 'features selected by Lasso under these conditions' rather than 'the features that matter.' Scientific humility protects against over-interpretation.
We've explored Lasso's role as an automatic feature selection method. Here are the key insights:
The Big Picture:
Lasso's automatic feature selection is powerful but imperfect. It works best when:
Used thoughtfully, Lasso is an invaluable tool for dimensionality reduction and generating hypotheses. Used naively, it can mislead. The difference is understanding its assumptions and limitations—which you now do.
Module Complete:
You've now mastered Lasso regression comprehensively: the L1 formulation, sparsity mechanics, geometric interpretation, optimization algorithms, and feature selection aspects. This knowledge equips you to apply Lasso effectively and understand its behavior in high-dimensional machine learning problems.
Congratulations! You have completed the comprehensive study of Lasso Regression (L1 Regularization). You understand the mathematical formulation, why L1 produces sparsity, the geometric interpretation, solution algorithms, and the crucial feature selection aspects including stability and post-selection inference.