Feature Selection Theory - Learning Module

Loading content...

0/278

Stability Selection

The Stability Problem in Feature Selection

A fundamental question haunts every feature selection study: Are these selected features truly important, or just lucky winners in this particular dataset?

This question strikes at the heart of reproducibility. If you apply LASSO to your data and select 15 features, what happens if you collect new data? Would the same 15 features emerge, or would a completely different set appear?

Stability refers to the consistency of feature selection results across different samples from the same underlying distribution. An unstable selection method is concerning because:

Selected features may be artifacts of sampling noise
Results won't replicate in new data
Scientific conclusions based on selected features may be false
Production models may rely on spurious correlations

The Replication Crisis Connection

Feature selection instability contributes to the replication crisis in data science. A study finds that gene X predicts disease Y. Another lab tries to replicate and finds gene Z instead. Both used valid methods—but unstable feature selection meant their results depended heavily on which patients happened to be in each sample.

Sources of Instability

1. Small Sample Sizes: With limited data, sampling variation is high. Different samples lead to different feature rankings.

2. Correlated Features: When multiple features carry similar information, selection methods may arbitrarily choose among them.

3. Borderline Features: Features near the selection threshold can flip in or out based on small perturbations.

4. High Dimensionality: With many features, chance correlations become more likely, leading to false positives.

5. Data-Dependent Thresholds: Using the same data to choose selection thresholds and evaluate selected features creates dependencies.

Measuring Selection Stability

Before we can improve stability, we need ways to measure it. Several metrics quantify how consistent feature selection is across multiple runs or data splits.

Jaccard Similarity

For two feature sets $S_1$ and $S_2$:

$$J(S_1, S_2) = \frac{|S_1 \cap S_2|}{|S_1 \cup S_2|}$$

Ranges from 0 (no overlap) to 1 (identical sets). Simple but doesn't account for set sizes.

Kuncheva's Stability Index

A more sophisticated measure that corrects for chance agreement:

$$\text{KI}(S_1, S_2) = \frac{|S_1 \cap S_2| - \frac{k^2}{p}}{k - \frac{k^2}{p}}$$

where $k = |S_1| = |S_2|$ (assumes equal-sized sets) and $p$ is the total number of features. This index:

Equals 0 for random selection
Equals 1 for perfect agreement
Properly accounts for the number of features selected

Consistency Index Across Multiple Splits

For $M$ subsamples, compute pairwise stability and average:

$$\bar{S} = \frac{2}{M(M-1)} \sum_{i < j} \text{similarity}(S_i, S_j)$$

stability_metrics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
import numpy as np
from itertools import combinations
from typing import List, Set
 
def jaccard_similarity(s1: Set[int], s2: Set[int]) -> float:
    """Jaccard similarity between two feature sets."""
    if len(s1) == 0 and len(s2) == 0:
        return 1.0
    intersection = len(s1 & s2)
    union = len(s1 | s2)
    return intersection / union if union > 0 else 0.0
 
def kuncheva_index(s1: Set[int], s2: Set[int], p: int) -> float:
    """
    Kuncheva's stability index, corrected for chance.
    
    Parameters:
    -----------
    s1, s2 : Sets of selected feature indices
    p : Total number of features
    
    Returns:
    --------
    Stability index in range [-1, 1], with 0 being random
    """
    k = len(s1)
    if k != len(s2):
        raise ValueError("Sets must have equal size for Kuncheva index")
    
    if k == 0 or k == p:
        return 0.0
    
    observed = len(s1 & s2)
    expected = k * k / p
    denominator = k - expected
    
    if denominator == 0:
        return 0.0
    
    return (observed - expected) / denominator
 
def average_stability(selections: List[Set[int]], 
                      p: int,
                      metric: str = 'kuncheva') -> float:
    """
    Compute average stability across multiple feature selections.
    
    Parameters:
    -----------
    selections : List of feature sets from different runs
    p : Total number of features
    metric : 'jaccard' or 'kuncheva'
    
    Returns:
    --------
    Average pairwise stability
    """
    n_selections = len(selections)
    if n_selections < 2:
        return 1.0
    
    similarities = []
    for s1, s2 in combinations(selections, 2):
        if metric == 'jaccard':
            sim = jaccard_similarity(s1, s2)
        elif metric == 'kuncheva':
            # For Kuncheva, ensure equal sizes
            min_size = min(len(s1), len(s2))
            s1_trimmed = set(list(s1)[:min_size])
            s2_trimmed = set(list(s2)[:min_size])
            sim = kuncheva_index(s1_trimmed, s2_trimmed, p)
        else:
            raise ValueError(f"Unknown metric: {metric}")
        similarities.append(sim)
    
    return np.mean(similarities)
 
# Example: Measure LASSO stability
from sklearn.linear_model import LassoCV
from sklearn.datasets import load_breast_cancer
from sklearn.utils import resample
 
data = load_breast_cancer()
X, y = data.data, data.target
n_features = X.shape[1]
 
# Run feature selection on multiple bootstrap samples
n_bootstrap = 20
selections = []
 
for i in range(n_bootstrap):
    X_boot, y_boot = resample(X, y, random_state=i)
    
    lasso = LassoCV(cv=5, random_state=42)
    lasso.fit(X_boot, y_boot)
    
    # Get selected features (non-zero coefficients)
    selected = set(np.where(np.abs(lasso.coef_) > 1e-10)[0])
    selections.append(selected)
    
print(f"Selection sizes: {[len(s) for s in selections]}")
print(f"Jaccard stability: {average_stability(selections, n_features, 'jaccard'):.4f}")
print(f"
Feature selection frequency:")
freq = np.zeros(n_features)
for s in selections:
    for f in s:
        freq[f] += 1
freq /= n_bootstrap
 
for i in np.argsort(freq)[::-1][:10]:
    print(f"  {data.feature_names[i]}: {freq[i]:.0%}")

The Stability Selection Framework

Stability Selection (Meinshausen & Bühlmann, 2010) is a principled framework that transforms any feature selection method into a stable one with finite sample error control.

Core Idea

Instead of running feature selection once:

Subsample the data many times (typically 50% without replacement)
Apply the base selection method to each subsample
Compute selection frequencies: how often each feature was selected
Threshold: Select features appearing in more than $\pi_{thr}$ fraction of subsamples

The key insight: features that appear consistently across random subsamples are more likely to be truly important than those appearing sporadically.

Mathematical Guarantees

Stability selection provides upper bounds on false positives:

$$\mathbb{E}[V] \leq \frac{q^2}{(2\pi_{thr} - 1)p}$$

where:

$V$ = number of false positives
$q$ = average number of features selected per subsample
$\pi_{thr}$ = selection threshold (typically 0.6-0.9)
$p$ = total number of features

This bound holds under mild assumptions and allows practitioners to control expected false discoveries by choosing appropriate thresholds.

Why Subsampling Works

Subsampling introduces controlled randomness that 'tests' whether features are robustly important. True signal features will be selected regardless of which half of the data we use. Noise features will sometimes pass, sometimes fail—their selection frequency reveals their unreliability.

stability_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
import numpy as np
from sklearn.linear_model import LassoCV, LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
from typing import Tuple, List
import matplotlib.pyplot as plt
 
def stability_selection(X: np.ndarray, y: np.ndarray,
                        base_selector,
                        n_bootstrap: int = 100,
                        sample_fraction: float = 0.5,
                        threshold: float = 0.6,
                        random_state: int = 42) -> Tuple[np.ndarray, np.ndarray]:
    """
    Stability Selection framework.
    
    Parameters:
    -----------
    X : Feature matrix (n_samples, n_features)
    y : Target vector
    base_selector : Fitted selector with coef_ attribute or feature_importances_
    n_bootstrap : Number of subsampling iterations
    sample_fraction : Fraction of samples to use per iteration
    threshold : Selection frequency threshold for final inclusion
    random_state : Random seed for reproducibility
    
    Returns:
    --------
    selected_features : Boolean mask of selected features
    selection_frequencies : Frequency of selection for each feature
    """
    np.random.seed(random_state)
    n_samples, n_features = X.shape
    subsample_size = int(n_samples * sample_fraction)
    
    # Track selection counts
    selection_counts = np.zeros(n_features)
    
    for b in range(n_bootstrap):
        # Random subsample without replacement
        indices = np.random.choice(n_samples, subsample_size, replace=False)
        X_sub = X[indices]
        y_sub = y[indices]
        
        # Fit selector
        selector = base_selector.__class__(**base_selector.get_params())
        selector.fit(X_sub, y_sub)
        
        # Get selected features
        if hasattr(selector, 'coef_'):
            coef = selector.coef_.ravel()
            selected = np.abs(coef) > 1e-10
        elif hasattr(selector, 'feature_importances_'):
            imp = selector.feature_importances_
            # Select top 20% by importance
            threshold_val = np.percentile(imp, 80)
            selected = imp > threshold_val
        else:
            raise ValueError("Selector must have coef_ or feature_importances_")
        
        selection_counts += selected
    
    # Compute frequencies
    selection_frequencies = selection_counts / n_bootstrap
    
    # Apply threshold
    selected_features = selection_frequencies >= threshold
    
    return selected_features, selection_frequencies
 
# Example usage
from sklearn.datasets import make_classification
 
# Create data with known important features
np.random.seed(42)
n_samples = 300
n_informative = 10
n_redundant = 5
n_noise = 85
 
X, y = make_classification(
    n_samples=n_samples,
    n_features=100,
    n_informative=n_informative,
    n_redundant=n_redundant,
    n_clusters_per_class=2,
    flip_y=0.03,
    random_state=42
)
 
# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
 
# Base selector
base_selector = LogisticRegressionCV(
    penalty='l1', 
    solver='saga', 
    cv=5, 
    max_iter=1000,
    random_state=42
)
base_selector.fit(X_scaled, y)
 
# Apply stability selection
selected, frequencies = stability_selection(
    X_scaled, y, 
    base_selector,
    n_bootstrap=100,
    threshold=0.7
)
 
print(f"Features selected: {np.sum(selected)} / {X.shape[1]}")
print(f"True informative features: first {n_informative}")
print(f"
Top 15 features by selection frequency:")
top_idx = np.argsort(frequencies)[::-1][:15]
for idx in top_idx:
    is_informative = "✓ informative" if idx < n_informative else ""
    is_redundant = "~ redundant" if n_informative <= idx < n_informative + n_redundant else ""
    print(f"  Feature {idx:3d}: {frequencies[idx]:.2%} {is_informative}{is_redundant}")
 
# Plot selection frequency
plt.figure(figsize=(12, 5))
colors = ['green' if i < n_informative else 
          'orange' if i < n_informative + n_redundant else 
          'red' for i in range(100)]
plt.bar(range(100), frequencies, color=colors, alpha=0.7)
plt.axhline(y=0.7, color='black', linestyle='--', label='Threshold')
plt.xlabel('Feature Index')
plt.ylabel('Selection Frequency')
plt.title('Stability Selection: Feature Selection Frequencies')
plt.legend(['Threshold (0.7)', 'Informative', 'Redundant', 'Noise'])
plt.tight_layout()
plt.show()

Theoretical Error Control

One of stability selection's most valuable properties is its finite-sample error bound. Unlike many heuristic methods, stability selection provides mathematical guarantees about false positive control.

The Per-Family Error Rate Bound

Under assumption (β-min condition): if variable $j$ is selected by the base method with probability at least $\pi_{thr}$ when included in the true model, then:

$$\mathbb{E}[V] \leq \frac{q^2}{(2\pi_{thr} - 1)p}$$

where $V$ is the number of falsely selected variables.

Choosing Threshold Based on Error Control

Given a desired error bound $E_{max}$, we can solve for the required threshold:

$$\pi_{thr} \geq \frac{1}{2} + \frac{q^2}{2E_{max} \cdot p}$$

Example: With $p = 1000$ features, $q = 20$ average selected per run, and desired $E[V] \leq 1$ false positive:

$$\pi_{thr} \geq \frac{1}{2} + \frac{400}{2000} = 0.7$$

A threshold of 0.7 (70% selection frequency) controls expected false positives to ≤ 1.

Threshold Selection Guide for Different Error Tolerances
Desired E[False Positives]	Features (p)	Avg Selected (q)	Required Threshold
≤ 1	100	10	0.75
≤ 1	1,000	20	0.70
≤ 1	10,000	50	0.625
≤ 5	1,000	30	0.59
≤ 0.5	500	15	0.95

Assumptions and Limitations

The error bound requires the 'exchangeability' assumption: the base selector doesn't systematically favor certain irrelevant features over others. This is typically satisfied by randomized methods (LASSO with random subsampling) but may be violated if the base method has intrinsic bias. Always consider whether your setting meets the assumptions.

Randomized LASSO and Bolasso

Two related approaches strengthen stability selection's theoretical properties and practical performance.

Randomized LASSO

Randomized LASSO adds an extra layer of randomization by rescaling features differently in each subsample:

$$\min_w \frac{1}{2n}|y - Xw|_2^2 + \lambda \sum_j \frac{|w_j|}{W_j}$$

where $W_j$ is drawn randomly from $[\alpha, 1]$ for each feature $j$ in each subsample.

This makes selection more exploratory:

Features that consistently overcome varying penalty strengths are truly robust
Reduces the sensitivity to any single LASSO hyperparameter choice
Improves the theoretical bounds

Bolasso (Bootstrap LASSO)

Bolasso takes a hard intersection approach:

Run LASSO on $B$ bootstrap samples
Final selection = features selected in all $B$ samples

This is equivalent to stability selection with threshold $\pi_{thr} = 1$ (100% frequency).

Properties:

Very conservative—may miss weak but real effects
Strong consistency: as $n \to \infty$, exactly recovers true support
Can be too strict for finite samples

randomized_lasso.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
import numpy as np
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
 
def randomized_lasso_stability(X: np.ndarray, y: np.ndarray,
                                n_bootstrap: int = 100,
                                alpha: float = 0.01,
                                weakness: float = 0.5,
                                sample_fraction: float = 0.5,
                                threshold: float = 0.6,
                                random_state: int = 42) -> tuple:
    """
    Randomized LASSO for stability selection.
    
    Parameters:
    -----------
    X : Feature matrix
    y : Target vector
    n_bootstrap : Number of iterations
    alpha : Base LASSO regularization strength
    weakness : Lower bound for random feature scaling [weakness, 1]
    sample_fraction : Fraction of samples per subsample
    threshold : Selection frequency threshold
    
    Returns:
    --------
    selected : Boolean mask of selected features
    frequencies : Selection frequency per feature
    """
    np.random.seed(random_state)
    n_samples, n_features = X.shape
    subsample_size = int(n_samples * sample_fraction)
    
    selection_counts = np.zeros(n_features)
    
    for b in range(n_bootstrap):
        # Subsample
        indices = np.random.choice(n_samples, subsample_size, replace=False)
        X_sub = X[indices]
        y_sub = y[indices]
        
        # Random feature rescaling
        scales = np.random.uniform(weakness, 1.0, n_features)
        X_scaled = X_sub * scales
        
        # Fit LASSO with randomized features
        lasso = Lasso(alpha=alpha, max_iter=10000)
        lasso.fit(X_scaled, y_sub)
        
        # Selected features (accounting for rescaling)
        selected = np.abs(lasso.coef_) > 1e-10
        selection_counts += selected
    
    frequencies = selection_counts / n_bootstrap
    selected = frequencies >= threshold
    
    return selected, frequencies
 
def bolasso(X: np.ndarray, y: np.ndarray,
            n_bootstrap: int = 100,
            alpha: float = 0.01,
            random_state: int = 42) -> tuple:
    """
    Bolasso: Bootstrap LASSO with intersection.
    
    Features must be selected in ALL bootstrap samples.
    """
    np.random.seed(random_state)
    n_samples, n_features = X.shape
    
    # Track which features are selected in ALL runs
    all_selected = np.ones(n_features, dtype=bool)
    selection_counts = np.zeros(n_features)
    
    for b in range(n_bootstrap):
        # Bootstrap sample
        indices = np.random.choice(n_samples, n_samples, replace=True)
        X_boot = X[indices]
        y_boot = y[indices]
        
        # Fit LASSO
        lasso = Lasso(alpha=alpha, max_iter=10000)
        lasso.fit(X_boot, y_boot)
        
        selected = np.abs(lasso.coef_) > 1e-10
        selection_counts += selected
        all_selected &= selected  # Intersection
    
    frequencies = selection_counts / n_bootstrap
    
    return all_selected, frequencies
 
# Compare methods
from sklearn.datasets import make_regression
 
X, y, true_coef = make_regression(
    n_samples=200, n_features=50, n_informative=5,
    noise=10, coef=True, random_state=42
)
 
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
 
# True important features
true_support = np.abs(true_coef) > 0
print(f"True support: {np.where(true_support)[0]}")
 
# Standard LASSO
from sklearn.linear_model import LassoCV
lasso = LassoCV(cv=5)
lasso.fit(X_scaled, y)
lasso_selected = np.abs(lasso.coef_) > 1e-10
print(f"
Standard LASSO selected: {np.where(lasso_selected)[0]}")
 
# Randomized LASSO stability selection
rand_selected, rand_freq = randomized_lasso_stability(
    X_scaled, y, n_bootstrap=100, alpha=0.05, threshold=0.7
)
print(f"Randomized LASSO selected: {np.where(rand_selected)[0]}")
 
# Bolasso
bola_selected, bola_freq = bolasso(X_scaled, y, n_bootstrap=100, alpha=0.05)
print(f"Bolasso selected: {np.where(bola_selected)[0]}")
 
# Compare true positives and false positives
print(f"
Method comparison:")
for name, selected in [("LASSO", lasso_selected), 
                       ("Rand.LASSO", rand_selected),
                       ("Bolasso", bola_selected)]:
    tp = np.sum(selected & true_support)
    fp = np.sum(selected & ~true_support)
    fn = np.sum(~selected & true_support)
    print(f"  {name}: TP={tp}, FP={fp}, FN={fn}")

Complementary Pairs Stability Selection

Complementary Pairs Stability Selection (CPSS) enhances the original framework with an elegant twist: instead of random subsamples, it uses complementary pairs of disjoint subsets.

The Complementary Pairs Approach

For each iteration:

Randomly split data into two equal, disjoint halves: $S$ and $S^c$
Run selection on both halves
A feature "counts" only if selected in both halves of the pair

Then average this paired-selection count across many random splits.

Why Complementary Pairs Help

Using complementary pairs:

Reduces variance: The two halves are negatively correlated in terms of which features they contain
Tightens error bounds: The theoretical false positive bound improves
Same computational cost: Still requires the same number of model fits

The improved bound becomes:

$$\mathbb{E}[V] \leq \frac{q^2}{(2\pi_{thr} - 1)^2 p}$$

Note the squared denominator—this can be significantly tighter than the standard bound.

cpss.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
import numpy as np
from sklearn.linear_model import LassoCV
 
def complementary_pairs_stability_selection(
    X: np.ndarray, y: np.ndarray,
    base_selector_class,
    selector_params: dict,
    n_pairs: int = 50,
    threshold: float = 0.6,
    random_state: int = 42
) -> tuple:
    """
    Complementary Pairs Stability Selection (CPSS).
    
    Uses pairs of disjoint subsets for tighter error control.
    
    Parameters:
    -----------
    X : Feature matrix
    y : Target vector  
    base_selector_class : Selector class (e.g., LassoCV)
    selector_params : Parameters for selector
    n_pairs : Number of complementary pairs to evaluate
    threshold : Selection frequency threshold
    
    Returns:
    --------
    selected : Boolean mask of selected features
    frequencies : Selection frequency per feature
    """
    np.random.seed(random_state)
    n_samples, n_features = X.shape
    half = n_samples // 2
    
    # Count features selected in BOTH halves of a pair
    paired_selection_counts = np.zeros(n_features)
    
    for p in range(n_pairs):
        # Create random complementary pair
        perm = np.random.permutation(n_samples)
        first_half = perm[:half]
        second_half = perm[half:2*half]  # Ensure equal sizes
        
        # Fit on first half
        selector1 = base_selector_class(**selector_params)
        selector1.fit(X[first_half], y[first_half])
        selected1 = np.abs(selector1.coef_.ravel()) > 1e-10
        
        # Fit on second (complementary) half
        selector2 = base_selector_class(**selector_params)
        selector2.fit(X[second_half], y[second_half])
        selected2 = np.abs(selector2.coef_.ravel()) > 1e-10
        
        # Feature counts only if selected in BOTH halves
        paired_selection = selected1 & selected2
        paired_selection_counts += paired_selection
    
    frequencies = paired_selection_counts / n_pairs
    selected = frequencies >= threshold
    
    return selected, frequencies
 
# Example
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
 
X, y = make_classification(
    n_samples=400, n_features=100,
    n_informative=10, n_redundant=5,
    random_state=42
)
X = StandardScaler().fit_transform(X)
 
# Standard stability selection
from sklearn.linear_model import LogisticRegressionCV
 
def run_standard_stability(X, y, n_iter=100, threshold=0.7):
    n_samples, n_features = X.shape
    half = n_samples // 2
    counts = np.zeros(n_features)
    
    for i in range(n_iter):
        idx = np.random.choice(n_samples, half, replace=False)
        selector = LogisticRegressionCV(penalty='l1', solver='saga', cv=3, max_iter=500)
        selector.fit(X[idx], y[idx])
        counts += np.abs(selector.coef_.ravel()) > 1e-10
    
    return counts / n_iter
 
# Run both methods
standard_freq = run_standard_stability(X, y, n_iter=100, threshold=0.7)
cpss_selected, cpss_freq = complementary_pairs_stability_selection(
    X, y, 
    LogisticRegressionCV,
    {'penalty': 'l1', 'solver': 'saga', 'cv': 3, 'max_iter': 500},
    n_pairs=50,
    threshold=0.5  # Can use lower threshold due to tighter bound
)
 
print("Comparison: Standard vs CPSS")
print(f"Standard stability - features with freq > 0.7: {np.sum(standard_freq > 0.7)}")
print(f"CPSS - features with freq > 0.5: {np.sum(cpss_selected)}")
 
# The CPSS frequencies are typically lower but with better error control

Practical Implementation Guide

Implementing stability selection effectively requires attention to several practical considerations.

Choosing the Base Selector

The base selector should:

Be capable of feature selection (sparse solutions)
Handle the subsampled data well
Not require too much computational time

Good choices:

LASSO / Elastic Net for linear models
L1-regularized logistic regression for classification
Random Forest with feature importance threshold

Setting Hyperparameters

Number of subsamples (B): More is better for stability, but returns diminish:

B = 50: Reasonable minimum
B = 100: Good default
B = 200+: Diminishing returns

Sample fraction:

0.5 (50%): Standard choice, good balance
0.632: Bootstrap-like, slightly larger
Smaller fractions increase variance but reduce computation

Threshold ($\pi_{thr}$):

Use the error control formula to guide selection
0.6: Lenient, more features selected
0.7-0.8: Typical default
0.9+: Very stringent

Stability Selection Decision Framework
Priority	Recommended Settings
False positive control (strict)	High threshold (0.8-0.9), CPSS, more iterations
Power (find weak signal)	Lower threshold (0.5-0.6), randomized LASSO
Computational efficiency	Fewer iterations (50), simple base selector
Handling correlations	Elastic Net as base, randomized feature scaling
Small sample size	Leave-one-out or CPSS, conservative threshold

scikit-learn Implementation

While scikit-learn once had RandomizedLasso and stability selection, they were deprecated. The sklearn-contrib package 'stabsel' provides modern implementations. Alternatively, the implementation shown earlier gives you full control over the process.

Comparing Stability-Enhancing Methods

Stability Method Comparison
Method	Error Control	Power	Correlation Handling	Computation
Standard Stability Selection	Moderate	Good	Moderate	B × model fits
Randomized LASSO	Better	Better	Better	B × model fits
CPSS	Best	Lower	Good	2B × model fits
Bolasso	Very strict	Lowest	Good	B × model fits
Ensemble (multiple base selectors)	Varies	Good	Good	Higher

When to Use Each Method

Standard Stability Selection: Good default choice, reasonable error control, well-understood properties.

Randomized LASSO: When features are correlated and you want to avoid arbitrary selection among correlated groups.

CPSS: When false positives are especially costly (medical diagnosis, high-stakes decisions).

Bolasso: Extremely conservative settings where you only want features with near-certain importance.

Ensemble Approaches: When you're unsure which base selector is best—combine multiple selectors' stability results.

Summary: Stability Selection

Key Takeaways

•Stability addresses reproducibility: Selected features should persist across different data samples, not be artifacts of particular datasets.
•Stability selection uses subsampling: Running selection on many subsamples and retaining frequently-selected features identifies robust signals.
•Mathematical error control: Stability selection provides finite-sample bounds on expected false positives, controllable via threshold choice.
•Randomized LASSO adds feature rescaling for better exploration and handling of correlated features.
•Complementary Pairs (CPSS) tightens error bounds by requiring selection in both halves of disjoint splits.
•Selection frequency directly measures feature robustness—high frequency indicates consistent importance across data variations.
•Threshold choice trades off between false positives (higher threshold) and false negatives (lower threshold).

What's Next

Stability selection ensures our selected features are robust. But even with stable selection, we face another subtle danger: Feature Selection Bias—the tendency to overestimate the performance of models when feature selection uses the same data as model evaluation. Next, we'll explore this bias and learn how to avoid misleading ourselves.

Stability Selection

The Stability Problem in Feature Selection

A fundamental question haunts every feature selection study: Are these selected features truly important, or just lucky winners in this particular dataset?

Stability refers to the consistency of feature selection results across different samples from the same underlying distribution. An unstable selection method is concerning because:

Selected features may be artifacts of sampling noise
Results won't replicate in new data
Scientific conclusions based on selected features may be false
Production models may rely on spurious correlations

The Replication Crisis Connection

Sources of Instability

1. Small Sample Sizes: With limited data, sampling variation is high. Different samples lead to different feature rankings.

2. Correlated Features: When multiple features carry similar information, selection methods may arbitrarily choose among them.

3. Borderline Features: Features near the selection threshold can flip in or out based on small perturbations.

4. High Dimensionality: With many features, chance correlations become more likely, leading to false positives.

5. Data-Dependent Thresholds: Using the same data to choose selection thresholds and evaluate selected features creates dependencies.

Measuring Selection Stability

Before we can improve stability, we need ways to measure it. Several metrics quantify how consistent feature selection is across multiple runs or data splits.

Jaccard Similarity

For two feature sets $S_1$ and $S_2$:

$$J(S_1, S_2) = \frac{|S_1 \cap S_2|}{|S_1 \cup S_2|}$$

Ranges from 0 (no overlap) to 1 (identical sets). Simple but doesn't account for set sizes.

Kuncheva's Stability Index

A more sophisticated measure that corrects for chance agreement:

$$\text{KI}(S_1, S_2) = \frac{|S_1 \cap S_2| - \frac{k^2}{p}}{k - \frac{k^2}{p}}$$

where $k = |S_1| = |S_2|$ (assumes equal-sized sets) and $p$ is the total number of features. This index:

Equals 0 for random selection
Equals 1 for perfect agreement
Properly accounts for the number of features selected

Consistency Index Across Multiple Splits

For $M$ subsamples, compute pairwise stability and average:

$$\bar{S} = \frac{2}{M(M-1)} \sum_{i < j} \text{similarity}(S_i, S_j)$$

stability_metrics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
import numpy as np
from itertools import combinations
from typing import List, Set
 
def jaccard_similarity(s1: Set[int], s2: Set[int]) -> float:
    """Jaccard similarity between two feature sets."""
    if len(s1) == 0 and len(s2) == 0:
        return 1.0
    intersection = len(s1 & s2)
    union = len(s1 | s2)
    return intersection / union if union > 0 else 0.0
 
def kuncheva_index(s1: Set[int], s2: Set[int], p: int) -> float:
    """
    Kuncheva's stability index, corrected for chance.
    
    Parameters:
    -----------
    s1, s2 : Sets of selected feature indices
    p : Total number of features
    
    Returns:
    --------
    Stability index in range [-1, 1], with 0 being random
    """
    k = len(s1)
    if k != len(s2):
        raise ValueError("Sets must have equal size for Kuncheva index")
    
    if k == 0 or k == p:
        return 0.0
    
    observed = len(s1 & s2)
    expected = k * k / p
    denominator = k - expected
    
    if denominator == 0:
        return 0.0
    
    return (observed - expected) / denominator
 
def average_stability(selections: List[Set[int]], 
                      p: int,
                      metric: str = 'kuncheva') -> float:
    """
    Compute average stability across multiple feature selections.
    
    Parameters:
    -----------
    selections : List of feature sets from different runs
    p : Total number of features
    metric : 'jaccard' or 'kuncheva'
    
    Returns:
    --------
    Average pairwise stability
    """
    n_selections = len(selections)
    if n_selections < 2:
        return 1.0
    
    similarities = []
    for s1, s2 in combinations(selections, 2):
        if metric == 'jaccard':
            sim = jaccard_similarity(s1, s2)
        elif metric == 'kuncheva':
            # For Kuncheva, ensure equal sizes
            min_size = min(len(s1), len(s2))
            s1_trimmed = set(list(s1)[:min_size])
            s2_trimmed = set(list(s2)[:min_size])
            sim = kuncheva_index(s1_trimmed, s2_trimmed, p)
        else:
            raise ValueError(f"Unknown metric: {metric}")
        similarities.append(sim)
    
    return np.mean(similarities)
 
# Example: Measure LASSO stability
from sklearn.linear_model import LassoCV
from sklearn.datasets import load_breast_cancer
from sklearn.utils import resample
 
data = load_breast_cancer()
X, y = data.data, data.target
n_features = X.shape[1]
 
# Run feature selection on multiple bootstrap samples
n_bootstrap = 20
selections = []
 
for i in range(n_bootstrap):
    X_boot, y_boot = resample(X, y, random_state=i)
    
    lasso = LassoCV(cv=5, random_state=42)
    lasso.fit(X_boot, y_boot)
    
    # Get selected features (non-zero coefficients)
    selected = set(np.where(np.abs(lasso.coef_) > 1e-10)[0])
    selections.append(selected)
    
print(f"Selection sizes: {[len(s) for s in selections]}")
print(f"Jaccard stability: {average_stability(selections, n_features, 'jaccard'):.4f}")
print(f"
Feature selection frequency:")
freq = np.zeros(n_features)
for s in selections:
    for f in s:
        freq[f] += 1
freq /= n_bootstrap
 
for i in np.argsort(freq)[::-1][:10]:
    print(f"  {data.feature_names[i]}: {freq[i]:.0%}")

The Stability Selection Framework

Stability Selection (Meinshausen & Bühlmann, 2010) is a principled framework that transforms any feature selection method into a stable one with finite sample error control.

Core Idea

Instead of running feature selection once:

Subsample the data many times (typically 50% without replacement)
Apply the base selection method to each subsample
Compute selection frequencies: how often each feature was selected
Threshold: Select features appearing in more than $\pi_{thr}$ fraction of subsamples

The key insight: features that appear consistently across random subsamples are more likely to be truly important than those appearing sporadically.

Mathematical Guarantees

Stability selection provides upper bounds on false positives:

$$\mathbb{E}[V] \leq \frac{q^2}{(2\pi_{thr} - 1)p}$$

where:

$V$ = number of false positives
$q$ = average number of features selected per subsample
$\pi_{thr}$ = selection threshold (typically 0.6-0.9)
$p$ = total number of features

This bound holds under mild assumptions and allows practitioners to control expected false discoveries by choosing appropriate thresholds.

Why Subsampling Works

stability_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
import numpy as np
from sklearn.linear_model import LassoCV, LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
from typing import Tuple, List
import matplotlib.pyplot as plt
 
def stability_selection(X: np.ndarray, y: np.ndarray,
                        base_selector,
                        n_bootstrap: int = 100,
                        sample_fraction: float = 0.5,
                        threshold: float = 0.6,
                        random_state: int = 42) -> Tuple[np.ndarray, np.ndarray]:
    """
    Stability Selection framework.
    
    Parameters:
    -----------
    X : Feature matrix (n_samples, n_features)
    y : Target vector
    base_selector : Fitted selector with coef_ attribute or feature_importances_
    n_bootstrap : Number of subsampling iterations
    sample_fraction : Fraction of samples to use per iteration
    threshold : Selection frequency threshold for final inclusion
    random_state : Random seed for reproducibility
    
    Returns:
    --------
    selected_features : Boolean mask of selected features
    selection_frequencies : Frequency of selection for each feature
    """
    np.random.seed(random_state)
    n_samples, n_features = X.shape
    subsample_size = int(n_samples * sample_fraction)
    
    # Track selection counts
    selection_counts = np.zeros(n_features)
    
    for b in range(n_bootstrap):
        # Random subsample without replacement
        indices = np.random.choice(n_samples, subsample_size, replace=False)
        X_sub = X[indices]
        y_sub = y[indices]
        
        # Fit selector
        selector = base_selector.__class__(**base_selector.get_params())
        selector.fit(X_sub, y_sub)
        
        # Get selected features
        if hasattr(selector, 'coef_'):
            coef = selector.coef_.ravel()
            selected = np.abs(coef) > 1e-10
        elif hasattr(selector, 'feature_importances_'):
            imp = selector.feature_importances_
            # Select top 20% by importance
            threshold_val = np.percentile(imp, 80)
            selected = imp > threshold_val
        else:
            raise ValueError("Selector must have coef_ or feature_importances_")
        
        selection_counts += selected
    
    # Compute frequencies
    selection_frequencies = selection_counts / n_bootstrap
    
    # Apply threshold
    selected_features = selection_frequencies >= threshold
    
    return selected_features, selection_frequencies
 
# Example usage
from sklearn.datasets import make_classification
 
# Create data with known important features
np.random.seed(42)
n_samples = 300
n_informative = 10
n_redundant = 5
n_noise = 85
 
X, y = make_classification(
    n_samples=n_samples,
    n_features=100,
    n_informative=n_informative,
    n_redundant=n_redundant,
    n_clusters_per_class=2,
    flip_y=0.03,
    random_state=42
)
 
# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
 
# Base selector
base_selector = LogisticRegressionCV(
    penalty='l1', 
    solver='saga', 
    cv=5, 
    max_iter=1000,
    random_state=42
)
base_selector.fit(X_scaled, y)
 
# Apply stability selection
selected, frequencies = stability_selection(
    X_scaled, y, 
    base_selector,
    n_bootstrap=100,
    threshold=0.7
)
 
print(f"Features selected: {np.sum(selected)} / {X.shape[1]}")
print(f"True informative features: first {n_informative}")
print(f"
Top 15 features by selection frequency:")
top_idx = np.argsort(frequencies)[::-1][:15]
for idx in top_idx:
    is_informative = "✓ informative" if idx < n_informative else ""
    is_redundant = "~ redundant" if n_informative <= idx < n_informative + n_redundant else ""
    print(f"  Feature {idx:3d}: {frequencies[idx]:.2%} {is_informative}{is_redundant}")
 
# Plot selection frequency
plt.figure(figsize=(12, 5))
colors = ['green' if i < n_informative else 
          'orange' if i < n_informative + n_redundant else 
          'red' for i in range(100)]
plt.bar(range(100), frequencies, color=colors, alpha=0.7)
plt.axhline(y=0.7, color='black', linestyle='--', label='Threshold')
plt.xlabel('Feature Index')
plt.ylabel('Selection Frequency')
plt.title('Stability Selection: Feature Selection Frequencies')
plt.legend(['Threshold (0.7)', 'Informative', 'Redundant', 'Noise'])
plt.tight_layout()
plt.show()

Theoretical Error Control

The Per-Family Error Rate Bound

Under assumption (β-min condition): if variable $j$ is selected by the base method with probability at least $\pi_{thr}$ when included in the true model, then:

$$\mathbb{E}[V] \leq \frac{q^2}{(2\pi_{thr} - 1)p}$$

where $V$ is the number of falsely selected variables.

Choosing Threshold Based on Error Control

Given a desired error bound $E_{max}$, we can solve for the required threshold:

$$\pi_{thr} \geq \frac{1}{2} + \frac{q^2}{2E_{max} \cdot p}$$

Example: With $p = 1000$ features, $q = 20$ average selected per run, and desired $E[V] \leq 1$ false positive:

$$\pi_{thr} \geq \frac{1}{2} + \frac{400}{2000} = 0.7$$

A threshold of 0.7 (70% selection frequency) controls expected false positives to ≤ 1.

Threshold Selection Guide for Different Error Tolerances
Desired E[False Positives]	Features (p)	Avg Selected (q)	Required Threshold
≤ 1	100	10	0.75
≤ 1	1,000	20	0.70
≤ 1	10,000	50	0.625
≤ 5	1,000	30	0.59
≤ 0.5	500	15	0.95

Assumptions and Limitations

Randomized LASSO and Bolasso

Two related approaches strengthen stability selection's theoretical properties and practical performance.

Randomized LASSO

Randomized LASSO adds an extra layer of randomization by rescaling features differently in each subsample:

$$\min_w \frac{1}{2n}|y - Xw|_2^2 + \lambda \sum_j \frac{|w_j|}{W_j}$$

where $W_j$ is drawn randomly from $[\alpha, 1]$ for each feature $j$ in each subsample.

This makes selection more exploratory:

Features that consistently overcome varying penalty strengths are truly robust
Reduces the sensitivity to any single LASSO hyperparameter choice
Improves the theoretical bounds

Bolasso (Bootstrap LASSO)

Bolasso takes a hard intersection approach:

Run LASSO on $B$ bootstrap samples
Final selection = features selected in all $B$ samples

This is equivalent to stability selection with threshold $\pi_{thr} = 1$ (100% frequency).

Properties:

Very conservative—may miss weak but real effects
Strong consistency: as $n \to \infty$, exactly recovers true support
Can be too strict for finite samples

randomized_lasso.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
import numpy as np
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
 
def randomized_lasso_stability(X: np.ndarray, y: np.ndarray,
                                n_bootstrap: int = 100,
                                alpha: float = 0.01,
                                weakness: float = 0.5,
                                sample_fraction: float = 0.5,
                                threshold: float = 0.6,
                                random_state: int = 42) -> tuple:
    """
    Randomized LASSO for stability selection.
    
    Parameters:
    -----------
    X : Feature matrix
    y : Target vector
    n_bootstrap : Number of iterations
    alpha : Base LASSO regularization strength
    weakness : Lower bound for random feature scaling [weakness, 1]
    sample_fraction : Fraction of samples per subsample
    threshold : Selection frequency threshold
    
    Returns:
    --------
    selected : Boolean mask of selected features
    frequencies : Selection frequency per feature
    """
    np.random.seed(random_state)
    n_samples, n_features = X.shape
    subsample_size = int(n_samples * sample_fraction)
    
    selection_counts = np.zeros(n_features)
    
    for b in range(n_bootstrap):
        # Subsample
        indices = np.random.choice(n_samples, subsample_size, replace=False)
        X_sub = X[indices]
        y_sub = y[indices]
        
        # Random feature rescaling
        scales = np.random.uniform(weakness, 1.0, n_features)
        X_scaled = X_sub * scales
        
        # Fit LASSO with randomized features
        lasso = Lasso(alpha=alpha, max_iter=10000)
        lasso.fit(X_scaled, y_sub)
        
        # Selected features (accounting for rescaling)
        selected = np.abs(lasso.coef_) > 1e-10
        selection_counts += selected
    
    frequencies = selection_counts / n_bootstrap
    selected = frequencies >= threshold
    
    return selected, frequencies
 
def bolasso(X: np.ndarray, y: np.ndarray,
            n_bootstrap: int = 100,
            alpha: float = 0.01,
            random_state: int = 42) -> tuple:
    """
    Bolasso: Bootstrap LASSO with intersection.
    
    Features must be selected in ALL bootstrap samples.
    """
    np.random.seed(random_state)
    n_samples, n_features = X.shape
    
    # Track which features are selected in ALL runs
    all_selected = np.ones(n_features, dtype=bool)
    selection_counts = np.zeros(n_features)
    
    for b in range(n_bootstrap):
        # Bootstrap sample
        indices = np.random.choice(n_samples, n_samples, replace=True)
        X_boot = X[indices]
        y_boot = y[indices]
        
        # Fit LASSO
        lasso = Lasso(alpha=alpha, max_iter=10000)
        lasso.fit(X_boot, y_boot)
        
        selected = np.abs(lasso.coef_) > 1e-10
        selection_counts += selected
        all_selected &= selected  # Intersection
    
    frequencies = selection_counts / n_bootstrap
    
    return all_selected, frequencies
 
# Compare methods
from sklearn.datasets import make_regression
 
X, y, true_coef = make_regression(
    n_samples=200, n_features=50, n_informative=5,
    noise=10, coef=True, random_state=42
)
 
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
 
# True important features
true_support = np.abs(true_coef) > 0
print(f"True support: {np.where(true_support)[0]}")
 
# Standard LASSO
from sklearn.linear_model import LassoCV
lasso = LassoCV(cv=5)
lasso.fit(X_scaled, y)
lasso_selected = np.abs(lasso.coef_) > 1e-10
print(f"
Standard LASSO selected: {np.where(lasso_selected)[0]}")
 
# Randomized LASSO stability selection
rand_selected, rand_freq = randomized_lasso_stability(
    X_scaled, y, n_bootstrap=100, alpha=0.05, threshold=0.7
)
print(f"Randomized LASSO selected: {np.where(rand_selected)[0]}")
 
# Bolasso
bola_selected, bola_freq = bolasso(X_scaled, y, n_bootstrap=100, alpha=0.05)
print(f"Bolasso selected: {np.where(bola_selected)[0]}")
 
# Compare true positives and false positives
print(f"
Method comparison:")
for name, selected in [("LASSO", lasso_selected), 
                       ("Rand.LASSO", rand_selected),
                       ("Bolasso", bola_selected)]:
    tp = np.sum(selected & true_support)
    fp = np.sum(selected & ~true_support)
    fn = np.sum(~selected & true_support)
    print(f"  {name}: TP={tp}, FP={fp}, FN={fn}")

Complementary Pairs Stability Selection

Complementary Pairs Stability Selection (CPSS) enhances the original framework with an elegant twist: instead of random subsamples, it uses complementary pairs of disjoint subsets.

The Complementary Pairs Approach

For each iteration:

Randomly split data into two equal, disjoint halves: $S$ and $S^c$
Run selection on both halves
A feature "counts" only if selected in both halves of the pair

Then average this paired-selection count across many random splits.

Why Complementary Pairs Help

Using complementary pairs:

Reduces variance: The two halves are negatively correlated in terms of which features they contain
Tightens error bounds: The theoretical false positive bound improves
Same computational cost: Still requires the same number of model fits

The improved bound becomes:

$$\mathbb{E}[V] \leq \frac{q^2}{(2\pi_{thr} - 1)^2 p}$$

Note the squared denominator—this can be significantly tighter than the standard bound.

cpss.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
import numpy as np
from sklearn.linear_model import LassoCV
 
def complementary_pairs_stability_selection(
    X: np.ndarray, y: np.ndarray,
    base_selector_class,
    selector_params: dict,
    n_pairs: int = 50,
    threshold: float = 0.6,
    random_state: int = 42
) -> tuple:
    """
    Complementary Pairs Stability Selection (CPSS).
    
    Uses pairs of disjoint subsets for tighter error control.
    
    Parameters:
    -----------
    X : Feature matrix
    y : Target vector  
    base_selector_class : Selector class (e.g., LassoCV)
    selector_params : Parameters for selector
    n_pairs : Number of complementary pairs to evaluate
    threshold : Selection frequency threshold
    
    Returns:
    --------
    selected : Boolean mask of selected features
    frequencies : Selection frequency per feature
    """
    np.random.seed(random_state)
    n_samples, n_features = X.shape
    half = n_samples // 2
    
    # Count features selected in BOTH halves of a pair
    paired_selection_counts = np.zeros(n_features)
    
    for p in range(n_pairs):
        # Create random complementary pair
        perm = np.random.permutation(n_samples)
        first_half = perm[:half]
        second_half = perm[half:2*half]  # Ensure equal sizes
        
        # Fit on first half
        selector1 = base_selector_class(**selector_params)
        selector1.fit(X[first_half], y[first_half])
        selected1 = np.abs(selector1.coef_.ravel()) > 1e-10
        
        # Fit on second (complementary) half
        selector2 = base_selector_class(**selector_params)
        selector2.fit(X[second_half], y[second_half])
        selected2 = np.abs(selector2.coef_.ravel()) > 1e-10
        
        # Feature counts only if selected in BOTH halves
        paired_selection = selected1 & selected2
        paired_selection_counts += paired_selection
    
    frequencies = paired_selection_counts / n_pairs
    selected = frequencies >= threshold
    
    return selected, frequencies
 
# Example
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
 
X, y = make_classification(
    n_samples=400, n_features=100,
    n_informative=10, n_redundant=5,
    random_state=42
)
X = StandardScaler().fit_transform(X)
 
# Standard stability selection
from sklearn.linear_model import LogisticRegressionCV
 
def run_standard_stability(X, y, n_iter=100, threshold=0.7):
    n_samples, n_features = X.shape
    half = n_samples // 2
    counts = np.zeros(n_features)
    
    for i in range(n_iter):
        idx = np.random.choice(n_samples, half, replace=False)
        selector = LogisticRegressionCV(penalty='l1', solver='saga', cv=3, max_iter=500)
        selector.fit(X[idx], y[idx])
        counts += np.abs(selector.coef_.ravel()) > 1e-10
    
    return counts / n_iter
 
# Run both methods
standard_freq = run_standard_stability(X, y, n_iter=100, threshold=0.7)
cpss_selected, cpss_freq = complementary_pairs_stability_selection(
    X, y, 
    LogisticRegressionCV,
    {'penalty': 'l1', 'solver': 'saga', 'cv': 3, 'max_iter': 500},
    n_pairs=50,
    threshold=0.5  # Can use lower threshold due to tighter bound
)
 
print("Comparison: Standard vs CPSS")
print(f"Standard stability - features with freq > 0.7: {np.sum(standard_freq > 0.7)}")
print(f"CPSS - features with freq > 0.5: {np.sum(cpss_selected)}")
 
# The CPSS frequencies are typically lower but with better error control

Practical Implementation Guide

Implementing stability selection effectively requires attention to several practical considerations.

Choosing the Base Selector

The base selector should:

Be capable of feature selection (sparse solutions)
Handle the subsampled data well
Not require too much computational time

Good choices:

LASSO / Elastic Net for linear models
L1-regularized logistic regression for classification
Random Forest with feature importance threshold

Setting Hyperparameters

Number of subsamples (B): More is better for stability, but returns diminish:

B = 50: Reasonable minimum
B = 100: Good default
B = 200+: Diminishing returns

Sample fraction:

0.5 (50%): Standard choice, good balance
0.632: Bootstrap-like, slightly larger
Smaller fractions increase variance but reduce computation

Threshold ($\pi_{thr}$):

Use the error control formula to guide selection
0.6: Lenient, more features selected
0.7-0.8: Typical default
0.9+: Very stringent

Stability Selection Decision Framework
Priority	Recommended Settings
False positive control (strict)	High threshold (0.8-0.9), CPSS, more iterations
Power (find weak signal)	Lower threshold (0.5-0.6), randomized LASSO
Computational efficiency	Fewer iterations (50), simple base selector
Handling correlations	Elastic Net as base, randomized feature scaling
Small sample size	Leave-one-out or CPSS, conservative threshold

scikit-learn Implementation

Comparing Stability-Enhancing Methods

Stability Method Comparison
Method	Error Control	Power	Correlation Handling	Computation
Standard Stability Selection	Moderate	Good	Moderate	B × model fits
Randomized LASSO	Better	Better	Better	B × model fits
CPSS	Best	Lower	Good	2B × model fits
Bolasso	Very strict	Lowest	Good	B × model fits
Ensemble (multiple base selectors)	Varies	Good	Good	Higher

When to Use Each Method

Standard Stability Selection: Good default choice, reasonable error control, well-understood properties.

Randomized LASSO: When features are correlated and you want to avoid arbitrary selection among correlated groups.

CPSS: When false positives are especially costly (medical diagnosis, high-stakes decisions).

Bolasso: Extremely conservative settings where you only want features with near-certain importance.

Ensemble Approaches: When you're unsure which base selector is best—combine multiple selectors' stability results.

Summary: Stability Selection

Key Takeaways

•Stability addresses reproducibility: Selected features should persist across different data samples, not be artifacts of particular datasets.
•Stability selection uses subsampling: Running selection on many subsamples and retaining frequently-selected features identifies robust signals.
•Mathematical error control: Stability selection provides finite-sample bounds on expected false positives, controllable via threshold choice.
•Randomized LASSO adds feature rescaling for better exploration and handling of correlated features.
•Complementary Pairs (CPSS) tightens error bounds by requiring selection in both halves of disjoint splits.
•Selection frequency directly measures feature robustness—high frequency indicates consistent importance across data variations.
•Threshold choice trades off between false positives (higher threshold) and false negatives (lower threshold).

What's Next