Loading content...
A fundamental question haunts every feature selection study: Are these selected features truly important, or just lucky winners in this particular dataset?
This question strikes at the heart of reproducibility. If you apply LASSO to your data and select 15 features, what happens if you collect new data? Would the same 15 features emerge, or would a completely different set appear?
Stability refers to the consistency of feature selection results across different samples from the same underlying distribution. An unstable selection method is concerning because:
Feature selection instability contributes to the replication crisis in data science. A study finds that gene X predicts disease Y. Another lab tries to replicate and finds gene Z instead. Both used valid methods—but unstable feature selection meant their results depended heavily on which patients happened to be in each sample.
1. Small Sample Sizes: With limited data, sampling variation is high. Different samples lead to different feature rankings.
2. Correlated Features: When multiple features carry similar information, selection methods may arbitrarily choose among them.
3. Borderline Features: Features near the selection threshold can flip in or out based on small perturbations.
4. High Dimensionality: With many features, chance correlations become more likely, leading to false positives.
5. Data-Dependent Thresholds: Using the same data to choose selection thresholds and evaluate selected features creates dependencies.
Before we can improve stability, we need ways to measure it. Several metrics quantify how consistent feature selection is across multiple runs or data splits.
For two feature sets $S_1$ and $S_2$:
$$J(S_1, S_2) = \frac{|S_1 \cap S_2|}{|S_1 \cup S_2|}$$
Ranges from 0 (no overlap) to 1 (identical sets). Simple but doesn't account for set sizes.
A more sophisticated measure that corrects for chance agreement:
$$\text{KI}(S_1, S_2) = \frac{|S_1 \cap S_2| - \frac{k^2}{p}}{k - \frac{k^2}{p}}$$
where $k = |S_1| = |S_2|$ (assumes equal-sized sets) and $p$ is the total number of features. This index:
For $M$ subsamples, compute pairwise stability and average:
$$\bar{S} = \frac{2}{M(M-1)} \sum_{i < j} \text{similarity}(S_i, S_j)$$
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112
import numpy as npfrom itertools import combinationsfrom typing import List, Set def jaccard_similarity(s1: Set[int], s2: Set[int]) -> float: """Jaccard similarity between two feature sets.""" if len(s1) == 0 and len(s2) == 0: return 1.0 intersection = len(s1 & s2) union = len(s1 | s2) return intersection / union if union > 0 else 0.0 def kuncheva_index(s1: Set[int], s2: Set[int], p: int) -> float: """ Kuncheva's stability index, corrected for chance. Parameters: ----------- s1, s2 : Sets of selected feature indices p : Total number of features Returns: -------- Stability index in range [-1, 1], with 0 being random """ k = len(s1) if k != len(s2): raise ValueError("Sets must have equal size for Kuncheva index") if k == 0 or k == p: return 0.0 observed = len(s1 & s2) expected = k * k / p denominator = k - expected if denominator == 0: return 0.0 return (observed - expected) / denominator def average_stability(selections: List[Set[int]], p: int, metric: str = 'kuncheva') -> float: """ Compute average stability across multiple feature selections. Parameters: ----------- selections : List of feature sets from different runs p : Total number of features metric : 'jaccard' or 'kuncheva' Returns: -------- Average pairwise stability """ n_selections = len(selections) if n_selections < 2: return 1.0 similarities = [] for s1, s2 in combinations(selections, 2): if metric == 'jaccard': sim = jaccard_similarity(s1, s2) elif metric == 'kuncheva': # For Kuncheva, ensure equal sizes min_size = min(len(s1), len(s2)) s1_trimmed = set(list(s1)[:min_size]) s2_trimmed = set(list(s2)[:min_size]) sim = kuncheva_index(s1_trimmed, s2_trimmed, p) else: raise ValueError(f"Unknown metric: {metric}") similarities.append(sim) return np.mean(similarities) # Example: Measure LASSO stabilityfrom sklearn.linear_model import LassoCVfrom sklearn.datasets import load_breast_cancerfrom sklearn.utils import resample data = load_breast_cancer()X, y = data.data, data.targetn_features = X.shape[1] # Run feature selection on multiple bootstrap samplesn_bootstrap = 20selections = [] for i in range(n_bootstrap): X_boot, y_boot = resample(X, y, random_state=i) lasso = LassoCV(cv=5, random_state=42) lasso.fit(X_boot, y_boot) # Get selected features (non-zero coefficients) selected = set(np.where(np.abs(lasso.coef_) > 1e-10)[0]) selections.append(selected) print(f"Selection sizes: {[len(s) for s in selections]}")print(f"Jaccard stability: {average_stability(selections, n_features, 'jaccard'):.4f}")print(f"Feature selection frequency:")freq = np.zeros(n_features)for s in selections: for f in s: freq[f] += 1freq /= n_bootstrap for i in np.argsort(freq)[::-1][:10]: print(f" {data.feature_names[i]}: {freq[i]:.0%}")Stability Selection (Meinshausen & Bühlmann, 2010) is a principled framework that transforms any feature selection method into a stable one with finite sample error control.
Instead of running feature selection once:
The key insight: features that appear consistently across random subsamples are more likely to be truly important than those appearing sporadically.
Stability selection provides upper bounds on false positives:
$$\mathbb{E}[V] \leq \frac{q^2}{(2\pi_{thr} - 1)p}$$
where:
This bound holds under mild assumptions and allows practitioners to control expected false discoveries by choosing appropriate thresholds.
Subsampling introduces controlled randomness that 'tests' whether features are robustly important. True signal features will be selected regardless of which half of the data we use. Noise features will sometimes pass, sometimes fail—their selection frequency reveals their unreliability.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134
import numpy as npfrom sklearn.linear_model import LassoCV, LogisticRegressionCVfrom sklearn.preprocessing import StandardScalerfrom typing import Tuple, Listimport matplotlib.pyplot as plt def stability_selection(X: np.ndarray, y: np.ndarray, base_selector, n_bootstrap: int = 100, sample_fraction: float = 0.5, threshold: float = 0.6, random_state: int = 42) -> Tuple[np.ndarray, np.ndarray]: """ Stability Selection framework. Parameters: ----------- X : Feature matrix (n_samples, n_features) y : Target vector base_selector : Fitted selector with coef_ attribute or feature_importances_ n_bootstrap : Number of subsampling iterations sample_fraction : Fraction of samples to use per iteration threshold : Selection frequency threshold for final inclusion random_state : Random seed for reproducibility Returns: -------- selected_features : Boolean mask of selected features selection_frequencies : Frequency of selection for each feature """ np.random.seed(random_state) n_samples, n_features = X.shape subsample_size = int(n_samples * sample_fraction) # Track selection counts selection_counts = np.zeros(n_features) for b in range(n_bootstrap): # Random subsample without replacement indices = np.random.choice(n_samples, subsample_size, replace=False) X_sub = X[indices] y_sub = y[indices] # Fit selector selector = base_selector.__class__(**base_selector.get_params()) selector.fit(X_sub, y_sub) # Get selected features if hasattr(selector, 'coef_'): coef = selector.coef_.ravel() selected = np.abs(coef) > 1e-10 elif hasattr(selector, 'feature_importances_'): imp = selector.feature_importances_ # Select top 20% by importance threshold_val = np.percentile(imp, 80) selected = imp > threshold_val else: raise ValueError("Selector must have coef_ or feature_importances_") selection_counts += selected # Compute frequencies selection_frequencies = selection_counts / n_bootstrap # Apply threshold selected_features = selection_frequencies >= threshold return selected_features, selection_frequencies # Example usagefrom sklearn.datasets import make_classification # Create data with known important featuresnp.random.seed(42)n_samples = 300n_informative = 10n_redundant = 5n_noise = 85 X, y = make_classification( n_samples=n_samples, n_features=100, n_informative=n_informative, n_redundant=n_redundant, n_clusters_per_class=2, flip_y=0.03, random_state=42) # Standardizescaler = StandardScaler()X_scaled = scaler.fit_transform(X) # Base selectorbase_selector = LogisticRegressionCV( penalty='l1', solver='saga', cv=5, max_iter=1000, random_state=42)base_selector.fit(X_scaled, y) # Apply stability selectionselected, frequencies = stability_selection( X_scaled, y, base_selector, n_bootstrap=100, threshold=0.7) print(f"Features selected: {np.sum(selected)} / {X.shape[1]}")print(f"True informative features: first {n_informative}")print(f"Top 15 features by selection frequency:")top_idx = np.argsort(frequencies)[::-1][:15]for idx in top_idx: is_informative = "✓ informative" if idx < n_informative else "" is_redundant = "~ redundant" if n_informative <= idx < n_informative + n_redundant else "" print(f" Feature {idx:3d}: {frequencies[idx]:.2%} {is_informative}{is_redundant}") # Plot selection frequencyplt.figure(figsize=(12, 5))colors = ['green' if i < n_informative else 'orange' if i < n_informative + n_redundant else 'red' for i in range(100)]plt.bar(range(100), frequencies, color=colors, alpha=0.7)plt.axhline(y=0.7, color='black', linestyle='--', label='Threshold')plt.xlabel('Feature Index')plt.ylabel('Selection Frequency')plt.title('Stability Selection: Feature Selection Frequencies')plt.legend(['Threshold (0.7)', 'Informative', 'Redundant', 'Noise'])plt.tight_layout()plt.show()One of stability selection's most valuable properties is its finite-sample error bound. Unlike many heuristic methods, stability selection provides mathematical guarantees about false positive control.
Under assumption (β-min condition): if variable $j$ is selected by the base method with probability at least $\pi_{thr}$ when included in the true model, then:
$$\mathbb{E}[V] \leq \frac{q^2}{(2\pi_{thr} - 1)p}$$
where $V$ is the number of falsely selected variables.
Given a desired error bound $E_{max}$, we can solve for the required threshold:
$$\pi_{thr} \geq \frac{1}{2} + \frac{q^2}{2E_{max} \cdot p}$$
Example: With $p = 1000$ features, $q = 20$ average selected per run, and desired $E[V] \leq 1$ false positive:
$$\pi_{thr} \geq \frac{1}{2} + \frac{400}{2000} = 0.7$$
A threshold of 0.7 (70% selection frequency) controls expected false positives to ≤ 1.
| Desired E[False Positives] | Features (p) | Avg Selected (q) | Required Threshold |
|---|---|---|---|
| ≤ 1 | 100 | 10 | 0.75 |
| ≤ 1 | 1,000 | 20 | 0.70 |
| ≤ 1 | 10,000 | 50 | 0.625 |
| ≤ 5 | 1,000 | 30 | 0.59 |
| ≤ 0.5 | 500 | 15 | 0.95 |
The error bound requires the 'exchangeability' assumption: the base selector doesn't systematically favor certain irrelevant features over others. This is typically satisfied by randomized methods (LASSO with random subsampling) but may be violated if the base method has intrinsic bias. Always consider whether your setting meets the assumptions.
Two related approaches strengthen stability selection's theoretical properties and practical performance.
Randomized LASSO adds an extra layer of randomization by rescaling features differently in each subsample:
$$\min_w \frac{1}{2n}|y - Xw|_2^2 + \lambda \sum_j \frac{|w_j|}{W_j}$$
where $W_j$ is drawn randomly from $[\alpha, 1]$ for each feature $j$ in each subsample.
This makes selection more exploratory:
Bolasso takes a hard intersection approach:
This is equivalent to stability selection with threshold $\pi_{thr} = 1$ (100% frequency).
Properties:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135
import numpy as npfrom sklearn.linear_model import Lassofrom sklearn.preprocessing import StandardScaler def randomized_lasso_stability(X: np.ndarray, y: np.ndarray, n_bootstrap: int = 100, alpha: float = 0.01, weakness: float = 0.5, sample_fraction: float = 0.5, threshold: float = 0.6, random_state: int = 42) -> tuple: """ Randomized LASSO for stability selection. Parameters: ----------- X : Feature matrix y : Target vector n_bootstrap : Number of iterations alpha : Base LASSO regularization strength weakness : Lower bound for random feature scaling [weakness, 1] sample_fraction : Fraction of samples per subsample threshold : Selection frequency threshold Returns: -------- selected : Boolean mask of selected features frequencies : Selection frequency per feature """ np.random.seed(random_state) n_samples, n_features = X.shape subsample_size = int(n_samples * sample_fraction) selection_counts = np.zeros(n_features) for b in range(n_bootstrap): # Subsample indices = np.random.choice(n_samples, subsample_size, replace=False) X_sub = X[indices] y_sub = y[indices] # Random feature rescaling scales = np.random.uniform(weakness, 1.0, n_features) X_scaled = X_sub * scales # Fit LASSO with randomized features lasso = Lasso(alpha=alpha, max_iter=10000) lasso.fit(X_scaled, y_sub) # Selected features (accounting for rescaling) selected = np.abs(lasso.coef_) > 1e-10 selection_counts += selected frequencies = selection_counts / n_bootstrap selected = frequencies >= threshold return selected, frequencies def bolasso(X: np.ndarray, y: np.ndarray, n_bootstrap: int = 100, alpha: float = 0.01, random_state: int = 42) -> tuple: """ Bolasso: Bootstrap LASSO with intersection. Features must be selected in ALL bootstrap samples. """ np.random.seed(random_state) n_samples, n_features = X.shape # Track which features are selected in ALL runs all_selected = np.ones(n_features, dtype=bool) selection_counts = np.zeros(n_features) for b in range(n_bootstrap): # Bootstrap sample indices = np.random.choice(n_samples, n_samples, replace=True) X_boot = X[indices] y_boot = y[indices] # Fit LASSO lasso = Lasso(alpha=alpha, max_iter=10000) lasso.fit(X_boot, y_boot) selected = np.abs(lasso.coef_) > 1e-10 selection_counts += selected all_selected &= selected # Intersection frequencies = selection_counts / n_bootstrap return all_selected, frequencies # Compare methodsfrom sklearn.datasets import make_regression X, y, true_coef = make_regression( n_samples=200, n_features=50, n_informative=5, noise=10, coef=True, random_state=42) scaler = StandardScaler()X_scaled = scaler.fit_transform(X) # True important featurestrue_support = np.abs(true_coef) > 0print(f"True support: {np.where(true_support)[0]}") # Standard LASSOfrom sklearn.linear_model import LassoCVlasso = LassoCV(cv=5)lasso.fit(X_scaled, y)lasso_selected = np.abs(lasso.coef_) > 1e-10print(f"Standard LASSO selected: {np.where(lasso_selected)[0]}") # Randomized LASSO stability selectionrand_selected, rand_freq = randomized_lasso_stability( X_scaled, y, n_bootstrap=100, alpha=0.05, threshold=0.7)print(f"Randomized LASSO selected: {np.where(rand_selected)[0]}") # Bolassobola_selected, bola_freq = bolasso(X_scaled, y, n_bootstrap=100, alpha=0.05)print(f"Bolasso selected: {np.where(bola_selected)[0]}") # Compare true positives and false positivesprint(f"Method comparison:")for name, selected in [("LASSO", lasso_selected), ("Rand.LASSO", rand_selected), ("Bolasso", bola_selected)]: tp = np.sum(selected & true_support) fp = np.sum(selected & ~true_support) fn = np.sum(~selected & true_support) print(f" {name}: TP={tp}, FP={fp}, FN={fn}")Complementary Pairs Stability Selection (CPSS) enhances the original framework with an elegant twist: instead of random subsamples, it uses complementary pairs of disjoint subsets.
For each iteration:
Then average this paired-selection count across many random splits.
Using complementary pairs:
The improved bound becomes:
$$\mathbb{E}[V] \leq \frac{q^2}{(2\pi_{thr} - 1)^2 p}$$
Note the squared denominator—this can be significantly tighter than the standard bound.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104
import numpy as npfrom sklearn.linear_model import LassoCV def complementary_pairs_stability_selection( X: np.ndarray, y: np.ndarray, base_selector_class, selector_params: dict, n_pairs: int = 50, threshold: float = 0.6, random_state: int = 42) -> tuple: """ Complementary Pairs Stability Selection (CPSS). Uses pairs of disjoint subsets for tighter error control. Parameters: ----------- X : Feature matrix y : Target vector base_selector_class : Selector class (e.g., LassoCV) selector_params : Parameters for selector n_pairs : Number of complementary pairs to evaluate threshold : Selection frequency threshold Returns: -------- selected : Boolean mask of selected features frequencies : Selection frequency per feature """ np.random.seed(random_state) n_samples, n_features = X.shape half = n_samples // 2 # Count features selected in BOTH halves of a pair paired_selection_counts = np.zeros(n_features) for p in range(n_pairs): # Create random complementary pair perm = np.random.permutation(n_samples) first_half = perm[:half] second_half = perm[half:2*half] # Ensure equal sizes # Fit on first half selector1 = base_selector_class(**selector_params) selector1.fit(X[first_half], y[first_half]) selected1 = np.abs(selector1.coef_.ravel()) > 1e-10 # Fit on second (complementary) half selector2 = base_selector_class(**selector_params) selector2.fit(X[second_half], y[second_half]) selected2 = np.abs(selector2.coef_.ravel()) > 1e-10 # Feature counts only if selected in BOTH halves paired_selection = selected1 & selected2 paired_selection_counts += paired_selection frequencies = paired_selection_counts / n_pairs selected = frequencies >= threshold return selected, frequencies # Examplefrom sklearn.datasets import make_classificationfrom sklearn.preprocessing import StandardScaler X, y = make_classification( n_samples=400, n_features=100, n_informative=10, n_redundant=5, random_state=42)X = StandardScaler().fit_transform(X) # Standard stability selectionfrom sklearn.linear_model import LogisticRegressionCV def run_standard_stability(X, y, n_iter=100, threshold=0.7): n_samples, n_features = X.shape half = n_samples // 2 counts = np.zeros(n_features) for i in range(n_iter): idx = np.random.choice(n_samples, half, replace=False) selector = LogisticRegressionCV(penalty='l1', solver='saga', cv=3, max_iter=500) selector.fit(X[idx], y[idx]) counts += np.abs(selector.coef_.ravel()) > 1e-10 return counts / n_iter # Run both methodsstandard_freq = run_standard_stability(X, y, n_iter=100, threshold=0.7)cpss_selected, cpss_freq = complementary_pairs_stability_selection( X, y, LogisticRegressionCV, {'penalty': 'l1', 'solver': 'saga', 'cv': 3, 'max_iter': 500}, n_pairs=50, threshold=0.5 # Can use lower threshold due to tighter bound) print("Comparison: Standard vs CPSS")print(f"Standard stability - features with freq > 0.7: {np.sum(standard_freq > 0.7)}")print(f"CPSS - features with freq > 0.5: {np.sum(cpss_selected)}") # The CPSS frequencies are typically lower but with better error controlImplementing stability selection effectively requires attention to several practical considerations.
The base selector should:
Good choices:
Number of subsamples (B): More is better for stability, but returns diminish:
Sample fraction:
Threshold ($\pi_{thr}$):
| Priority | Recommended Settings |
|---|---|
| False positive control (strict) | High threshold (0.8-0.9), CPSS, more iterations |
| Power (find weak signal) | Lower threshold (0.5-0.6), randomized LASSO |
| Computational efficiency | Fewer iterations (50), simple base selector |
| Handling correlations | Elastic Net as base, randomized feature scaling |
| Small sample size | Leave-one-out or CPSS, conservative threshold |
While scikit-learn once had RandomizedLasso and stability selection, they were deprecated. The sklearn-contrib package 'stabsel' provides modern implementations. Alternatively, the implementation shown earlier gives you full control over the process.
| Method | Error Control | Power | Correlation Handling | Computation |
|---|---|---|---|---|
| Standard Stability Selection | Moderate | Good | Moderate | B × model fits |
| Randomized LASSO | Better | Better | Better | B × model fits |
| CPSS | Best | Lower | Good | 2B × model fits |
| Bolasso | Very strict | Lowest | Good | B × model fits |
| Ensemble (multiple base selectors) | Varies | Good | Good | Higher |
Standard Stability Selection: Good default choice, reasonable error control, well-understood properties.
Randomized LASSO: When features are correlated and you want to avoid arbitrary selection among correlated groups.
CPSS: When false positives are especially costly (medical diagnosis, high-stakes decisions).
Bolasso: Extremely conservative settings where you only want features with near-certain importance.
Ensemble Approaches: When you're unsure which base selector is best—combine multiple selectors' stability results.
Stability selection ensures our selected features are robust. But even with stable selection, we face another subtle danger: Feature Selection Bias—the tendency to overestimate the performance of models when feature selection uses the same data as model evaluation. Next, we'll explore this bias and learn how to avoid misleading ourselves.