Machine LearningFeature Selection Theory

Feature Selection Theory

LevelIntermediate

Duration90 mins

TopicFeature Selection Theory

2 / 5

Wrapper Methods

Understanding Wrapper Methods

Wrapper methods represent a paradigm shift from filter methods: rather than evaluating features through statistical proxies, they directly measure how feature subsets affect model performance. The name derives from how these methods "wrap" around a learning algorithm, using it as a black box to score feature combinations.

The fundamental insight is elegant: if our goal is to find features that lead to good predictions, why not evaluate features by... making predictions? A wrapper method trains a model on a candidate feature subset, evaluates its performance via cross-validation, and uses this performance as the subset's score.

This approach offers a compelling advantage over filters: selection is optimized for the specific learning algorithm you intend to use. Features that a Random Forest finds informative may differ from those optimal for a neural network, and wrapper methods capture these algorithm-specific preferences.

The Wrapper Philosophy

Wrapper methods treat feature selection as a search problem: explore the space of possible feature subsets, evaluate each using a model, and return the best-performing subset. The quality of a feature is judged entirely by its contribution to predictive performance—the ultimate criterion for most ML applications.

The Computational Challenge

Before diving into specific wrapper algorithms, we must confront the fundamental computational challenge they face.

The Combinatorial Explosion

With $p$ features, there are $2^p$ possible feature subsets (including the empty set). This exponential growth rapidly becomes intractable:

Features (p)	Possible Subsets ($2^p$)	Time at 1ms per evaluation
10	1,024	~1 second
20	1,048,576	~17 minutes
30	1,073,741,824	~34 years
50	~10^15	~31 million years

Exhaustive search is feasible only for very small feature sets. Real-world datasets with hundreds or thousands of features require heuristic search strategies that explore a fraction of the space while hopefully finding near-optimal solutions.

The Bias-Variance Tradeoff in Search

Wrapper methods face a meta-level bias-variance tradeoff:

More exhaustive search → finds better subsets but computationally prohibitive, risk of overfitting to validation set
Greedier search → faster but may miss optimal subsets, trapped in local optima

The art of wrapper methods lies in navigating this tradeoff.

Computational Reality Check

If training your model takes 10 seconds and you have 100 features, evaluating all 2^100 ≈ 10^30 subsets would take longer than the age of the universe. Wrapper methods MUST use intelligent search strategies. The question isn't whether to approximate—it's how to approximate well.

Forward Selection (Sequential Forward Selection)

Forward Selection is the most intuitive wrapper strategy: start with no features and greedily add the feature that most improves model performance, one at a time.

Algorithm

Initialize: Start with empty feature set $S = \emptyset$
Iterate: For each feature $f$ not in $S$:
- Train model on $S \cup {f}$
- Evaluate performance via cross-validation
Select: Add the feature $f^*$ that yields the best performance
Stop: When no feature improves performance, or desired count reached
Return: Final feature set $S$

Complexity Analysis

With $p$ features and selecting $k$:

First iteration: evaluate $p$ subsets
Second iteration: evaluate $p-1$ subsets
...
k-th iteration: evaluate $p-k+1$ subsets

Total evaluations: $\sum_{i=0}^{k-1}(p-i) = kp - \frac{k(k-1)}{2} = O(kp)$

For selecting $k = p/2$ features: $O(p^2)$ evaluations—far better than $2^p$!

forward_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.base import clone
from typing import List, Tuple
 
def forward_selection(X: np.ndarray, y: np.ndarray,
                      estimator,
                      max_features: int = None,
                      cv: int = 5,
                      scoring: str = 'accuracy',
                      min_improvement: float = 0.001) -> Tuple[List[int], List[float]]:
    """
    Sequential Forward Selection for feature selection.
    
    Parameters:
    -----------
    X : Feature matrix of shape (n_samples, n_features)
    y : Target vector
    estimator : sklearn-compatible estimator
    max_features : Maximum features to select (default: all)
    cv : Number of cross-validation folds
    scoring : Scoring metric for evaluation
    min_improvement : Minimum improvement to continue adding features
    
    Returns:
    --------
    selected_features : List of selected feature indices (in order selected)
    scores : Cross-validation score after each addition
    """
    n_features = X.shape[1]
    max_features = max_features or n_features
    
    selected = []
    remaining = list(range(n_features))
    scores = []
    best_score = -np.inf
    
    for _ in range(max_features):
        best_feature = None
        best_new_score = -np.inf
        
        for feature in remaining:
            # Create candidate feature set
            candidate_features = selected + [feature]
            X_candidate = X[:, candidate_features]
            
            # Evaluate with cross-validation
            cv_scores = cross_val_score(
                clone(estimator), X_candidate, y, 
                cv=cv, scoring=scoring
            )
            mean_score = np.mean(cv_scores)
            
            if mean_score > best_new_score:
                best_new_score = mean_score
                best_feature = feature
        
        # Check if improvement is sufficient
        if best_new_score - best_score < min_improvement:
            print(f"Stopping: improvement {best_new_score - best_score:.4f} < {min_improvement}")
            break
        
        # Add best feature
        selected.append(best_feature)
        remaining.remove(best_feature)
        best_score = best_new_score
        scores.append(best_score)
        
        print(f"Added feature {best_feature}: score = {best_score:.4f}")
    
    return selected, scores
 
# Example usage
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
 
data = load_breast_cancer()
X, y = data.data, data.target
 
estimator = LogisticRegression(max_iter=1000, random_state=42)
selected, scores = forward_selection(X, y, estimator, max_features=10)
 
print(f"
Selected features: {[data.feature_names[i] for i in selected]}")
print(f"Final score: {scores[-1]:.4f}")

Forward Selection Intuition

Think of forward selection as building a team: start with no players, and at each round, add the person who most improves team performance. Early picks might be star players (highly predictive features), while later picks fill complementary roles (features that work well with those already selected).

Backward Elimination (Sequential Backward Selection)

Backward Elimination takes the opposite approach: start with all features and greedily remove the feature whose elimination least degrades (or most improves) model performance.

Algorithm

Initialize: Start with all features $S = {1, 2, ..., p}$
Iterate: For each feature $f$ in $S$:
- Train model on $S \setminus {f}$
- Evaluate performance via cross-validation
Remove: Eliminate the feature $f^*$ whose removal least hurts performance
Stop: When removal significantly degrades performance, or minimum count reached
Return: Final feature set $S$

When to Prefer Backward Elimination

Backward elimination often outperforms forward selection when:

Feature interactions matter: Starting with all features captures interactions that might be missed by building up
Many features are informative: It's easier to remove a few bad features than to discover many good ones
Features are highly correlated: Backward elimination can identify the best representative from correlated groups

backward_elimination.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.base import clone
from typing import List, Tuple
 
def backward_elimination(X: np.ndarray, y: np.ndarray,
                         estimator,
                         min_features: int = 1,
                         cv: int = 5,
                         scoring: str = 'accuracy',
                         max_degradation: float = 0.01) -> Tuple[List[int], List[float]]:
    """
    Sequential Backward Elimination for feature selection.
    
    Parameters:
    -----------
    X : Feature matrix of shape (n_samples, n_features)
    y : Target vector  
    estimator : sklearn-compatible estimator
    min_features : Minimum features to retain
    cv : Number of cross-validation folds
    scoring : Scoring metric for evaluation
    max_degradation : Maximum allowed score degradation to continue
    
    Returns:
    --------
    remaining_features : List of retained feature indices
    scores : Cross-validation score after each removal
    """
    n_features = X.shape[1]
    remaining = list(range(n_features))
    scores = []
    
    # Initial score with all features
    cv_scores = cross_val_score(
        clone(estimator), X, y, cv=cv, scoring=scoring
    )
    best_score = np.mean(cv_scores)
    scores.append(best_score)
    print(f"Initial score (all {n_features} features): {best_score:.4f}")
    
    while len(remaining) > min_features:
        worst_feature = None
        best_score_after_removal = -np.inf
        
        for feature in remaining:
            # Create candidate feature set without this feature
            candidate_features = [f for f in remaining if f != feature]
            X_candidate = X[:, candidate_features]
            
            # Evaluate with cross-validation
            cv_scores = cross_val_score(
                clone(estimator), X_candidate, y,
                cv=cv, scoring=scoring
            )
            mean_score = np.mean(cv_scores)
            
            if mean_score > best_score_after_removal:
                best_score_after_removal = mean_score
                worst_feature = feature
        
        # Check if degradation is acceptable
        degradation = best_score - best_score_after_removal
        if degradation > max_degradation:
            print(f"Stopping: degradation {degradation:.4f} > {max_degradation}")
            break
        
        # Remove worst feature
        remaining.remove(worst_feature)
        best_score = best_score_after_removal
        scores.append(best_score)
        
        print(f"Removed feature {worst_feature}: score = {best_score:.4f} ({len(remaining)} remaining)")
    
    return remaining, scores
 
# Example usage
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
 
data = load_breast_cancer()
X, y = data.data, data.target
 
estimator = RandomForestClassifier(n_estimators=50, random_state=42)
remaining, scores = backward_elimination(X, y, estimator, min_features=5)
 
print(f"
Retained features: {[data.feature_names[i] for i in remaining]}")
print(f"Final score: {scores[-1]:.4f}")

Forward Selection Advantages

•Faster with many features (fewer initial evaluations)
•Naturally produces sparse models
•Works well when few features are relevant
•Can stop early for efficiency

Backward Elimination Advantages

•Captures feature interactions from the start
•Better when many features are useful
•Handles correlated features more gracefully
•Often finds better final subsets

Stepwise Selection and Bidirectional Search

Stepwise selection combines forward and backward approaches, allowing features to be both added and removed throughout the search. This addresses a key limitation of pure sequential methods: once a feature is added (forward) or removed (backward), that decision is permanent.

Bidirectional Stepwise Selection

The algorithm alternates or combines:

Forward step: Consider adding each unselected feature
Backward step: Consider removing each selected feature
Choose: Execute whichever action (add or remove) most improves the score

This allows the algorithm to:

Correct early mistakes (remove a feature added too eagerly)
Capture conditional dependencies (a feature becomes useful only after others are selected)
Escape some local optima that trap pure forward/backward searches

stepwise_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
from mlxtend.feature_selection import SequentialFeatureSelector
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import GradientBoostingClassifier
 
# Load data
data = load_breast_cancer()
X, y = data.data, data.target
 
# Create estimator
estimator = GradientBoostingClassifier(n_estimators=50, random_state=42)
 
# Bidirectional stepwise selection using mlxtend
sfs = SequentialFeatureSelector(
    estimator,
    k_features=10,           # Target number of features
    forward=True,            # Start with forward selection
    floating=True,           # Enable backward steps (SFFS/SBFS)
    scoring='accuracy',
    cv=5,
    n_jobs=-1
)
 
sfs = sfs.fit(X, y)
 
print(f"Selected feature indices: {sfs.k_feature_idx_}")
print(f"Selected feature names: {[data.feature_names[i] for i in sfs.k_feature_idx_]}")
print(f"Best CV score: {sfs.k_score_:.4f}")
 
# Examine selection history
import pandas as pd
results_df = pd.DataFrame.from_dict(sfs.get_metric_dict()).T
print("
Selection history:")
print(results_df[['feature_idx', 'avg_score']].head(15))

Sequential Floating Selection (SFFS/SBFS)

Floating selection methods add flexibility to the stepwise process:

Sequential Floating Forward Selection (SFFS):

Perform forward step (add best feature)
Conditionally remove features if doing so improves score
Repeat removals until no improvement
Return to step 1

Sequential Floating Backward Selection (SBFS):

Perform backward step (remove worst feature)
Conditionally add features if doing so improves score
Repeat additions until no improvement
Return to step 1

The "floating" refers to how the number of selected features can fluctuate during the search, rather than monotonically increasing (forward) or decreasing (backward).

Computational Cost of Floating Methods

While floating methods find better solutions than strict sequential approaches, they require more evaluations. In the worst case, SFFS might perform a full backward search after each forward step. The typical cost lies between pure forward selection and exhaustive search, depending on how often floating steps occur.

Exhaustive Search and Stochastic Approaches

Exhaustive Feature Selection

When computational resources permit, exhaustive search evaluates all possible feature subsets. This guarantees finding the globally optimal subset for the given model and evaluation metric.

Given the $2^p$ complexity, exhaustive search is only practical for:

Small feature sets ($p \lesssim 20-25$)
Fast models (linear models, small decision trees)
Problems where optimal selection is critical

Even with parallelization, exhaustive search hits limits quickly. For 25 features: ~33 million subsets. At 100ms per evaluation: ~38 days on a single core, ~1 day with 40 cores.

exhaustive_search.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
from mlxtend.feature_selection import ExhaustiveFeatureSelector
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
import time
 
# Load a small dataset
data = load_iris()
X, y = data.data, data.target
 
# Exhaustive search is only practical for very few features
# Iris has only 4 features: 2^4 = 16 subsets - trivially fast
estimator = LogisticRegression(max_iter=1000, random_state=42)
 
efs = ExhaustiveFeatureSelector(
    estimator,
    min_features=1,
    max_features=4,
    scoring='accuracy',
    cv=5,
    n_jobs=-1
)
 
start = time.time()
efs = efs.fit(X, y)
elapsed = time.time() - start
 
print(f"Evaluated {2**4 - 1} subsets in {elapsed:.2f} seconds")
print(f"Best subset: {efs.best_idx_}")
print(f"Best features: {[data.feature_names[i] for i in efs.best_idx_]}")
print(f"Best score: {efs.best_score_:.4f}")

Genetic Algorithms for Feature Selection

When the search space is too large for exhaustive search but greedy methods feel limiting, genetic algorithms (GA) offer a middle ground. GAs treat feature subsets as "individuals" in a population that evolves over generations.

Key components:

Representation: Each individual is a binary vector of length $p$ (1 = feature selected)
Fitness: Model performance (cross-validation score)
Selection: Better-performing individuals are more likely to reproduce
Crossover: Combine feature masks from two parents
Mutation: Random bit flips to maintain diversity

Advantages:

Explores diverse regions of the search space
Can escape local optima through mutation
Parallelizable (evaluate individuals concurrently)
Works well for large feature spaces

Disadvantages:

Many hyperparameters (population size, mutation rate, etc.)
No convergence guarantees
Still requires many model evaluations

genetic_algorithm_feature_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from typing import Tuple
 
def genetic_feature_selection(
    X: np.ndarray, y: np.ndarray,
    estimator,
    population_size: int = 50,
    generations: int = 100,
    mutation_rate: float = 0.1,
    crossover_rate: float = 0.8,
    cv: int = 5
) -> Tuple[np.ndarray, float]:
    """
    Genetic algorithm for feature selection.
    
    Returns:
    --------
    best_mask : Boolean array indicating selected features
    best_score : Fitness of best individual
    """
    n_features = X.shape[1]
    
    # Initialize population randomly
    population = np.random.rand(population_size, n_features) > 0.5
    
    # Ensure at least one feature per individual
    for i in range(population_size):
        if not population[i].any():
            population[i, np.random.randint(n_features)] = True
    
    def evaluate_fitness(mask):
        if not mask.any():
            return 0.0
        X_subset = X[:, mask]
        scores = cross_val_score(estimator, X_subset, y, cv=cv)
        return np.mean(scores)
    
    best_mask = None
    best_score = -np.inf
    
    for gen in range(generations):
        # Evaluate fitness
        fitness = np.array([evaluate_fitness(ind) for ind in population])
        
        # Track best
        gen_best_idx = np.argmax(fitness)
        if fitness[gen_best_idx] > best_score:
            best_score = fitness[gen_best_idx]
            best_mask = population[gen_best_idx].copy()
        
        if gen % 20 == 0:
            print(f"Gen {gen}: best={best_score:.4f}, features={best_mask.sum()}")
        
        # Selection (tournament)
        new_population = []
        for _ in range(population_size):
            # Tournament selection
            candidates = np.random.choice(population_size, 3, replace=False)
            winner = candidates[np.argmax(fitness[candidates])]
            new_population.append(population[winner].copy())
        
        # Crossover
        for i in range(0, population_size - 1, 2):
            if np.random.rand() < crossover_rate:
                point = np.random.randint(1, n_features)
                new_population[i][point:], new_population[i+1][point:] = \
                    new_population[i+1][point:].copy(), new_population[i][point:].copy()
        
        # Mutation
        for i in range(population_size):
            for j in range(n_features):
                if np.random.rand() < mutation_rate:
                    new_population[i][j] = not new_population[i][j]
            # Ensure at least one feature
            if not new_population[i].any():
                new_population[i][np.random.randint(n_features)] = True
        
        population = np.array(new_population)
    
    return best_mask, best_score
 
# Usage example
from sklearn.datasets import load_breast_cancer
 
data = load_breast_cancer()
X, y = data.data, data.target
 
estimator = RandomForestClassifier(n_estimators=30, random_state=42)
best_mask, best_score = genetic_feature_selection(
    X, y, estimator, 
    population_size=30, 
    generations=50
)
 
selected_features = [data.feature_names[i] for i, m in enumerate(best_mask) if m]
print(f"
Best features ({len(selected_features)}): {selected_features}")
print(f"Best CV score: {best_score:.4f}")

Recursive Feature Elimination (RFE)

Recursive Feature Elimination (RFE) is a popular wrapper method that leverages model-derived feature importance to guide elimination. Unlike pure backward elimination that evaluates all features equally at each step, RFE asks the model itself which features are least important.

Algorithm

Train: Fit model on current feature set
Rank: Compute feature importances from the model (coefficients, impurity decrease, etc.)
Eliminate: Remove the least important feature(s)
Repeat: Until desired number of features reached

The key insight: rather than retraining $O(p)$ models at each step to find which feature to remove, RFE trains one model and uses its internal ranking. This dramatically reduces computational cost.

Requirements

RFE requires models that provide feature importance scores:

Linear models: Coefficient magnitudes ($|w_i|$)
Tree-based models: Impurity decrease or permutation importance
SVMs: Coefficient magnitudes for linear kernel

recursive_feature_elimination.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
from sklearn.feature_selection import RFE, RFECV
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
import matplotlib.pyplot as plt
import numpy as np
 
# Load data
data = load_breast_cancer()
X, y = data.data, data.target
 
# RFE with specified number of features
estimator = RandomForestClassifier(n_estimators=100, random_state=42)
rfe = RFE(estimator, n_features_to_select=10, step=1)
rfe = rfe.fit(X, y)
 
print("RFE Results:")
print(f"Selected features: {[data.feature_names[i] for i, s in enumerate(rfe.support_) if s]}")
print(f"Feature rankings: {rfe.ranking_}")
 
# RFECV: RFE with cross-validation to find optimal number
rfecv = RFECV(
    estimator=LogisticRegression(max_iter=1000, random_state=42),
    step=1,
    cv=5,
    scoring='accuracy',
    min_features_to_select=1,
    n_jobs=-1
)
rfecv = rfecv.fit(X, y)
 
print(f"
RFECV Results:")
print(f"Optimal number of features: {rfecv.n_features_}")
print(f"Selected features: {[data.feature_names[i] for i, s in enumerate(rfecv.support_) if s]}")
 
# Plot number of features VS. cross-validation scores
plt.figure(figsize=(10, 5))
n_features_range = range(1, len(rfecv.cv_results_['mean_test_score']) + 1)
plt.plot(n_features_range, rfecv.cv_results_['mean_test_score'], 'b-o')
plt.fill_between(
    n_features_range,
    rfecv.cv_results_['mean_test_score'] - rfecv.cv_results_['std_test_score'],
    rfecv.cv_results_['mean_test_score'] + rfecv.cv_results_['std_test_score'],
    alpha=0.2
)
plt.axvline(rfecv.n_features_, color='r', linestyle='--', label=f'Optimal: {rfecv.n_features_}')
plt.xlabel('Number of Features')
plt.ylabel('CV Accuracy')
plt.title('RFE Cross-Validation Performance')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

RFE vs Pure Backward Elimination

RFE is O(p) per elimination round (one model fit) vs O(p) model fits for backward elimination at each step. For selecting k features from p, RFE requires O(p-k) model fits vs O(p² - k²)/2 for backward elimination. This makes RFE dramatically faster, though it relies on the model's importance estimates being accurate.

Evaluating and Comparing Wrapper Methods

Choosing among wrapper methods requires understanding their tradeoffs and proper evaluation protocols.

Nested Cross-Validation

A critical concern with wrapper methods is selection bias. If you use the same data to:

Select features (inner optimization)
Evaluate final model performance (outer evaluation)

...your performance estimate will be optimistically biased. The selected features were chosen because they happened to work well on this particular data.

Solution: Nested cross-validation

Outer loop: Evaluate generalization performance
Inner loop: Perform feature selection

For each outer fold:

Exclude test fold entirely
Run feature selection on training folds (using inner CV)
Train final model on training folds with selected features
Evaluate on test fold

This gives an unbiased estimate of the full pipeline's performance.

nested_cv_wrapper.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_breast_cancer
import numpy as np
 
# Load data
data = load_breast_cancer()
X, y = data.data, data.target
 
# Create pipeline: RFE + classifier
estimator = LogisticRegression(max_iter=1000, random_state=42)
 
# IMPORTANT: RFECV performs its own inner CV for feature selection
# Wrapping in a pipeline ensures outer CV uses fresh feature selection
pipeline = Pipeline([
    ('feature_selection', RFECV(estimator, step=1, cv=3, scoring='accuracy')),
    ('classifier', LogisticRegression(max_iter=1000, random_state=42))
])
 
# Outer cross-validation for unbiased performance estimate
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
nested_scores = cross_val_score(
    pipeline, X, y, 
    cv=outer_cv, 
    scoring='accuracy',
    n_jobs=-1
)
 
print(f"Nested CV scores: {nested_scores}")
print(f"Mean accuracy: {nested_scores.mean():.4f} (+/- {nested_scores.std()*2:.4f})")
 
# Compare with naive (biased) estimate
rfecv = RFECV(estimator, step=1, cv=5, scoring='accuracy')
rfecv.fit(X, y)
naive_score = rfecv.cv_results_['mean_test_score'][rfecv.n_features_ - 1]
print(f"
Naive CV score (biased): {naive_score:.4f}")
print(f"Bias: {naive_score - nested_scores.mean():.4f}")

Wrapper Method Comparison Summary
Method	Complexity	Strengths	Weaknesses
Forward Selection	O(kp)	Fast, works well with few relevant features	Cannot correct early mistakes, misses interactions
Backward Elimination	O((p-k)p)	Captures interactions, handles correlations	Slow for large p, cannot recover removed features
Stepwise (Floating)	Between forward and exhaustive	Corrects mistakes, balances exploration	More complex, more evaluations needed
RFE	O(p-k) × model fit	Fast, leverages model insights	Depends on model importance accuracy
Exhaustive	O(2^p)	Guarantees optimal solution	Only feasible for small p
Genetic Algorithm	O(generations × population × CV)	Explores broadly, escapes local optima	Many hyperparameters, no guarantees

Summary: Wrapper Methods

Key Takeaways

•Wrapper methods evaluate feature subsets by actual model performance, making selection algorithm-specific but directly optimized for prediction.
•The search space is exponential (2^p), requiring heuristic strategies: forward selection, backward elimination, stepwise, or stochastic search.
•Forward selection builds up features greedily, efficient when few features matter.
•Backward elimination removes features greedily, better for capturing interactions.
•Stepwise/floating methods allow both additions and removals, correcting greedy mistakes.
•RFE leverages model-derived importance to accelerate backward elimination.
•Nested cross-validation is essential to avoid selection bias and get honest performance estimates.

What's Next

Wrapper methods optimize feature selection for specific models but require many model evaluations. In the next section, we'll explore Embedded Methods—approaches where feature selection is built into the model training process itself, offering efficiency gains while maintaining algorithm-specific optimization.

2 / 5

Loading learning content...

Machine LearningFeature Selection Theory

Feature Selection Theory

LevelIntermediate

Duration90 mins

TopicFeature Selection Theory

2 / 5

Wrapper Methods

Understanding Wrapper Methods

The Wrapper Philosophy

The Computational Challenge

Before diving into specific wrapper algorithms, we must confront the fundamental computational challenge they face.

The Combinatorial Explosion

With $p$ features, there are $2^p$ possible feature subsets (including the empty set). This exponential growth rapidly becomes intractable:

Features (p)	Possible Subsets ($2^p$)	Time at 1ms per evaluation
10	1,024	~1 second
20	1,048,576	~17 minutes
30	1,073,741,824	~34 years
50	~10^15	~31 million years

The Bias-Variance Tradeoff in Search

Wrapper methods face a meta-level bias-variance tradeoff:

More exhaustive search → finds better subsets but computationally prohibitive, risk of overfitting to validation set
Greedier search → faster but may miss optimal subsets, trapped in local optima

The art of wrapper methods lies in navigating this tradeoff.

Computational Reality Check

Forward Selection (Sequential Forward Selection)

Forward Selection is the most intuitive wrapper strategy: start with no features and greedily add the feature that most improves model performance, one at a time.

Algorithm

Initialize: Start with empty feature set $S = \emptyset$
Iterate: For each feature $f$ not in $S$:
- Train model on $S \cup {f}$
- Evaluate performance via cross-validation
Select: Add the feature $f^*$ that yields the best performance
Stop: When no feature improves performance, or desired count reached
Return: Final feature set $S$

Complexity Analysis

With $p$ features and selecting $k$:

First iteration: evaluate $p$ subsets
Second iteration: evaluate $p-1$ subsets
...
k-th iteration: evaluate $p-k+1$ subsets

Total evaluations: $\sum_{i=0}^{k-1}(p-i) = kp - \frac{k(k-1)}{2} = O(kp)$

For selecting $k = p/2$ features: $O(p^2)$ evaluations—far better than $2^p$!

forward_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.base import clone
from typing import List, Tuple
 
def forward_selection(X: np.ndarray, y: np.ndarray,
                      estimator,
                      max_features: int = None,
                      cv: int = 5,
                      scoring: str = 'accuracy',
                      min_improvement: float = 0.001) -> Tuple[List[int], List[float]]:
    """
    Sequential Forward Selection for feature selection.
    
    Parameters:
    -----------
    X : Feature matrix of shape (n_samples, n_features)
    y : Target vector
    estimator : sklearn-compatible estimator
    max_features : Maximum features to select (default: all)
    cv : Number of cross-validation folds
    scoring : Scoring metric for evaluation
    min_improvement : Minimum improvement to continue adding features
    
    Returns:
    --------
    selected_features : List of selected feature indices (in order selected)
    scores : Cross-validation score after each addition
    """
    n_features = X.shape[1]
    max_features = max_features or n_features
    
    selected = []
    remaining = list(range(n_features))
    scores = []
    best_score = -np.inf
    
    for _ in range(max_features):
        best_feature = None
        best_new_score = -np.inf
        
        for feature in remaining:
            # Create candidate feature set
            candidate_features = selected + [feature]
            X_candidate = X[:, candidate_features]
            
            # Evaluate with cross-validation
            cv_scores = cross_val_score(
                clone(estimator), X_candidate, y, 
                cv=cv, scoring=scoring
            )
            mean_score = np.mean(cv_scores)
            
            if mean_score > best_new_score:
                best_new_score = mean_score
                best_feature = feature
        
        # Check if improvement is sufficient
        if best_new_score - best_score < min_improvement:
            print(f"Stopping: improvement {best_new_score - best_score:.4f} < {min_improvement}")
            break
        
        # Add best feature
        selected.append(best_feature)
        remaining.remove(best_feature)
        best_score = best_new_score
        scores.append(best_score)
        
        print(f"Added feature {best_feature}: score = {best_score:.4f}")
    
    return selected, scores
 
# Example usage
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
 
data = load_breast_cancer()
X, y = data.data, data.target
 
estimator = LogisticRegression(max_iter=1000, random_state=42)
selected, scores = forward_selection(X, y, estimator, max_features=10)
 
print(f"
Selected features: {[data.feature_names[i] for i in selected]}")
print(f"Final score: {scores[-1]:.4f}")

Forward Selection Intuition

Backward Elimination (Sequential Backward Selection)

Backward Elimination takes the opposite approach: start with all features and greedily remove the feature whose elimination least degrades (or most improves) model performance.

Algorithm

Initialize: Start with all features $S = {1, 2, ..., p}$
Iterate: For each feature $f$ in $S$:
- Train model on $S \setminus {f}$
- Evaluate performance via cross-validation
Remove: Eliminate the feature $f^*$ whose removal least hurts performance
Stop: When removal significantly degrades performance, or minimum count reached
Return: Final feature set $S$

When to Prefer Backward Elimination

Backward elimination often outperforms forward selection when:

Feature interactions matter: Starting with all features captures interactions that might be missed by building up
Many features are informative: It's easier to remove a few bad features than to discover many good ones
Features are highly correlated: Backward elimination can identify the best representative from correlated groups

backward_elimination.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.base import clone
from typing import List, Tuple
 
def backward_elimination(X: np.ndarray, y: np.ndarray,
                         estimator,
                         min_features: int = 1,
                         cv: int = 5,
                         scoring: str = 'accuracy',
                         max_degradation: float = 0.01) -> Tuple[List[int], List[float]]:
    """
    Sequential Backward Elimination for feature selection.
    
    Parameters:
    -----------
    X : Feature matrix of shape (n_samples, n_features)
    y : Target vector  
    estimator : sklearn-compatible estimator
    min_features : Minimum features to retain
    cv : Number of cross-validation folds
    scoring : Scoring metric for evaluation
    max_degradation : Maximum allowed score degradation to continue
    
    Returns:
    --------
    remaining_features : List of retained feature indices
    scores : Cross-validation score after each removal
    """
    n_features = X.shape[1]
    remaining = list(range(n_features))
    scores = []
    
    # Initial score with all features
    cv_scores = cross_val_score(
        clone(estimator), X, y, cv=cv, scoring=scoring
    )
    best_score = np.mean(cv_scores)
    scores.append(best_score)
    print(f"Initial score (all {n_features} features): {best_score:.4f}")
    
    while len(remaining) > min_features:
        worst_feature = None
        best_score_after_removal = -np.inf
        
        for feature in remaining:
            # Create candidate feature set without this feature
            candidate_features = [f for f in remaining if f != feature]
            X_candidate = X[:, candidate_features]
            
            # Evaluate with cross-validation
            cv_scores = cross_val_score(
                clone(estimator), X_candidate, y,
                cv=cv, scoring=scoring
            )
            mean_score = np.mean(cv_scores)
            
            if mean_score > best_score_after_removal:
                best_score_after_removal = mean_score
                worst_feature = feature
        
        # Check if degradation is acceptable
        degradation = best_score - best_score_after_removal
        if degradation > max_degradation:
            print(f"Stopping: degradation {degradation:.4f} > {max_degradation}")
            break
        
        # Remove worst feature
        remaining.remove(worst_feature)
        best_score = best_score_after_removal
        scores.append(best_score)
        
        print(f"Removed feature {worst_feature}: score = {best_score:.4f} ({len(remaining)} remaining)")
    
    return remaining, scores
 
# Example usage
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
 
data = load_breast_cancer()
X, y = data.data, data.target
 
estimator = RandomForestClassifier(n_estimators=50, random_state=42)
remaining, scores = backward_elimination(X, y, estimator, min_features=5)
 
print(f"
Retained features: {[data.feature_names[i] for i in remaining]}")
print(f"Final score: {scores[-1]:.4f}")

Forward Selection Advantages

•Faster with many features (fewer initial evaluations)
•Naturally produces sparse models
•Works well when few features are relevant
•Can stop early for efficiency

Backward Elimination Advantages

•Captures feature interactions from the start
•Better when many features are useful
•Handles correlated features more gracefully
•Often finds better final subsets

Stepwise Selection and Bidirectional Search

Bidirectional Stepwise Selection

The algorithm alternates or combines:

Forward step: Consider adding each unselected feature
Backward step: Consider removing each selected feature
Choose: Execute whichever action (add or remove) most improves the score

This allows the algorithm to:

Correct early mistakes (remove a feature added too eagerly)
Capture conditional dependencies (a feature becomes useful only after others are selected)
Escape some local optima that trap pure forward/backward searches

stepwise_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
from mlxtend.feature_selection import SequentialFeatureSelector
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import GradientBoostingClassifier
 
# Load data
data = load_breast_cancer()
X, y = data.data, data.target
 
# Create estimator
estimator = GradientBoostingClassifier(n_estimators=50, random_state=42)
 
# Bidirectional stepwise selection using mlxtend
sfs = SequentialFeatureSelector(
    estimator,
    k_features=10,           # Target number of features
    forward=True,            # Start with forward selection
    floating=True,           # Enable backward steps (SFFS/SBFS)
    scoring='accuracy',
    cv=5,
    n_jobs=-1
)
 
sfs = sfs.fit(X, y)
 
print(f"Selected feature indices: {sfs.k_feature_idx_}")
print(f"Selected feature names: {[data.feature_names[i] for i in sfs.k_feature_idx_]}")
print(f"Best CV score: {sfs.k_score_:.4f}")
 
# Examine selection history
import pandas as pd
results_df = pd.DataFrame.from_dict(sfs.get_metric_dict()).T
print("
Selection history:")
print(results_df[['feature_idx', 'avg_score']].head(15))

Sequential Floating Selection (SFFS/SBFS)

Floating selection methods add flexibility to the stepwise process:

Sequential Floating Forward Selection (SFFS):

Perform forward step (add best feature)
Conditionally remove features if doing so improves score
Repeat removals until no improvement
Return to step 1

Sequential Floating Backward Selection (SBFS):

Perform backward step (remove worst feature)
Conditionally add features if doing so improves score
Repeat additions until no improvement
Return to step 1

The "floating" refers to how the number of selected features can fluctuate during the search, rather than monotonically increasing (forward) or decreasing (backward).

Computational Cost of Floating Methods

Exhaustive Search and Stochastic Approaches

Exhaustive Feature Selection

When computational resources permit, exhaustive search evaluates all possible feature subsets. This guarantees finding the globally optimal subset for the given model and evaluation metric.

Given the $2^p$ complexity, exhaustive search is only practical for:

Small feature sets ($p \lesssim 20-25$)
Fast models (linear models, small decision trees)
Problems where optimal selection is critical

Even with parallelization, exhaustive search hits limits quickly. For 25 features: ~33 million subsets. At 100ms per evaluation: ~38 days on a single core, ~1 day with 40 cores.

exhaustive_search.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
from mlxtend.feature_selection import ExhaustiveFeatureSelector
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
import time
 
# Load a small dataset
data = load_iris()
X, y = data.data, data.target
 
# Exhaustive search is only practical for very few features
# Iris has only 4 features: 2^4 = 16 subsets - trivially fast
estimator = LogisticRegression(max_iter=1000, random_state=42)
 
efs = ExhaustiveFeatureSelector(
    estimator,
    min_features=1,
    max_features=4,
    scoring='accuracy',
    cv=5,
    n_jobs=-1
)
 
start = time.time()
efs = efs.fit(X, y)
elapsed = time.time() - start
 
print(f"Evaluated {2**4 - 1} subsets in {elapsed:.2f} seconds")
print(f"Best subset: {efs.best_idx_}")
print(f"Best features: {[data.feature_names[i] for i in efs.best_idx_]}")
print(f"Best score: {efs.best_score_:.4f}")

Genetic Algorithms for Feature Selection

Key components:

Representation: Each individual is a binary vector of length $p$ (1 = feature selected)
Fitness: Model performance (cross-validation score)
Selection: Better-performing individuals are more likely to reproduce
Crossover: Combine feature masks from two parents
Mutation: Random bit flips to maintain diversity

Advantages:

Explores diverse regions of the search space
Can escape local optima through mutation
Parallelizable (evaluate individuals concurrently)
Works well for large feature spaces

Disadvantages:

Many hyperparameters (population size, mutation rate, etc.)
No convergence guarantees
Still requires many model evaluations

genetic_algorithm_feature_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from typing import Tuple
 
def genetic_feature_selection(
    X: np.ndarray, y: np.ndarray,
    estimator,
    population_size: int = 50,
    generations: int = 100,
    mutation_rate: float = 0.1,
    crossover_rate: float = 0.8,
    cv: int = 5
) -> Tuple[np.ndarray, float]:
    """
    Genetic algorithm for feature selection.
    
    Returns:
    --------
    best_mask : Boolean array indicating selected features
    best_score : Fitness of best individual
    """
    n_features = X.shape[1]
    
    # Initialize population randomly
    population = np.random.rand(population_size, n_features) > 0.5
    
    # Ensure at least one feature per individual
    for i in range(population_size):
        if not population[i].any():
            population[i, np.random.randint(n_features)] = True
    
    def evaluate_fitness(mask):
        if not mask.any():
            return 0.0
        X_subset = X[:, mask]
        scores = cross_val_score(estimator, X_subset, y, cv=cv)
        return np.mean(scores)
    
    best_mask = None
    best_score = -np.inf
    
    for gen in range(generations):
        # Evaluate fitness
        fitness = np.array([evaluate_fitness(ind) for ind in population])
        
        # Track best
        gen_best_idx = np.argmax(fitness)
        if fitness[gen_best_idx] > best_score:
            best_score = fitness[gen_best_idx]
            best_mask = population[gen_best_idx].copy()
        
        if gen % 20 == 0:
            print(f"Gen {gen}: best={best_score:.4f}, features={best_mask.sum()}")
        
        # Selection (tournament)
        new_population = []
        for _ in range(population_size):
            # Tournament selection
            candidates = np.random.choice(population_size, 3, replace=False)
            winner = candidates[np.argmax(fitness[candidates])]
            new_population.append(population[winner].copy())
        
        # Crossover
        for i in range(0, population_size - 1, 2):
            if np.random.rand() < crossover_rate:
                point = np.random.randint(1, n_features)
                new_population[i][point:], new_population[i+1][point:] = \
                    new_population[i+1][point:].copy(), new_population[i][point:].copy()
        
        # Mutation
        for i in range(population_size):
            for j in range(n_features):
                if np.random.rand() < mutation_rate:
                    new_population[i][j] = not new_population[i][j]
            # Ensure at least one feature
            if not new_population[i].any():
                new_population[i][np.random.randint(n_features)] = True
        
        population = np.array(new_population)
    
    return best_mask, best_score
 
# Usage example
from sklearn.datasets import load_breast_cancer
 
data = load_breast_cancer()
X, y = data.data, data.target
 
estimator = RandomForestClassifier(n_estimators=30, random_state=42)
best_mask, best_score = genetic_feature_selection(
    X, y, estimator, 
    population_size=30, 
    generations=50
)
 
selected_features = [data.feature_names[i] for i, m in enumerate(best_mask) if m]
print(f"
Best features ({len(selected_features)}): {selected_features}")
print(f"Best CV score: {best_score:.4f}")

Recursive Feature Elimination (RFE)

Algorithm

Train: Fit model on current feature set
Rank: Compute feature importances from the model (coefficients, impurity decrease, etc.)
Eliminate: Remove the least important feature(s)
Repeat: Until desired number of features reached

The key insight: rather than retraining $O(p)$ models at each step to find which feature to remove, RFE trains one model and uses its internal ranking. This dramatically reduces computational cost.

Requirements

RFE requires models that provide feature importance scores:

Linear models: Coefficient magnitudes ($|w_i|$)
Tree-based models: Impurity decrease or permutation importance
SVMs: Coefficient magnitudes for linear kernel

recursive_feature_elimination.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
from sklearn.feature_selection import RFE, RFECV
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
import matplotlib.pyplot as plt
import numpy as np
 
# Load data
data = load_breast_cancer()
X, y = data.data, data.target
 
# RFE with specified number of features
estimator = RandomForestClassifier(n_estimators=100, random_state=42)
rfe = RFE(estimator, n_features_to_select=10, step=1)
rfe = rfe.fit(X, y)
 
print("RFE Results:")
print(f"Selected features: {[data.feature_names[i] for i, s in enumerate(rfe.support_) if s]}")
print(f"Feature rankings: {rfe.ranking_}")
 
# RFECV: RFE with cross-validation to find optimal number
rfecv = RFECV(
    estimator=LogisticRegression(max_iter=1000, random_state=42),
    step=1,
    cv=5,
    scoring='accuracy',
    min_features_to_select=1,
    n_jobs=-1
)
rfecv = rfecv.fit(X, y)
 
print(f"
RFECV Results:")
print(f"Optimal number of features: {rfecv.n_features_}")
print(f"Selected features: {[data.feature_names[i] for i, s in enumerate(rfecv.support_) if s]}")
 
# Plot number of features VS. cross-validation scores
plt.figure(figsize=(10, 5))
n_features_range = range(1, len(rfecv.cv_results_['mean_test_score']) + 1)
plt.plot(n_features_range, rfecv.cv_results_['mean_test_score'], 'b-o')
plt.fill_between(
    n_features_range,
    rfecv.cv_results_['mean_test_score'] - rfecv.cv_results_['std_test_score'],
    rfecv.cv_results_['mean_test_score'] + rfecv.cv_results_['std_test_score'],
    alpha=0.2
)
plt.axvline(rfecv.n_features_, color='r', linestyle='--', label=f'Optimal: {rfecv.n_features_}')
plt.xlabel('Number of Features')
plt.ylabel('CV Accuracy')
plt.title('RFE Cross-Validation Performance')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

RFE vs Pure Backward Elimination

Evaluating and Comparing Wrapper Methods

Choosing among wrapper methods requires understanding their tradeoffs and proper evaluation protocols.

Nested Cross-Validation

A critical concern with wrapper methods is selection bias. If you use the same data to:

Select features (inner optimization)
Evaluate final model performance (outer evaluation)

...your performance estimate will be optimistically biased. The selected features were chosen because they happened to work well on this particular data.

Solution: Nested cross-validation

Outer loop: Evaluate generalization performance
Inner loop: Perform feature selection

For each outer fold:

Exclude test fold entirely
Run feature selection on training folds (using inner CV)
Train final model on training folds with selected features
Evaluate on test fold

This gives an unbiased estimate of the full pipeline's performance.

nested_cv_wrapper.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_breast_cancer
import numpy as np
 
# Load data
data = load_breast_cancer()
X, y = data.data, data.target
 
# Create pipeline: RFE + classifier
estimator = LogisticRegression(max_iter=1000, random_state=42)
 
# IMPORTANT: RFECV performs its own inner CV for feature selection
# Wrapping in a pipeline ensures outer CV uses fresh feature selection
pipeline = Pipeline([
    ('feature_selection', RFECV(estimator, step=1, cv=3, scoring='accuracy')),
    ('classifier', LogisticRegression(max_iter=1000, random_state=42))
])
 
# Outer cross-validation for unbiased performance estimate
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
nested_scores = cross_val_score(
    pipeline, X, y, 
    cv=outer_cv, 
    scoring='accuracy',
    n_jobs=-1
)
 
print(f"Nested CV scores: {nested_scores}")
print(f"Mean accuracy: {nested_scores.mean():.4f} (+/- {nested_scores.std()*2:.4f})")
 
# Compare with naive (biased) estimate
rfecv = RFECV(estimator, step=1, cv=5, scoring='accuracy')
rfecv.fit(X, y)
naive_score = rfecv.cv_results_['mean_test_score'][rfecv.n_features_ - 1]
print(f"
Naive CV score (biased): {naive_score:.4f}")
print(f"Bias: {naive_score - nested_scores.mean():.4f}")

Wrapper Method Comparison Summary
Method	Complexity	Strengths	Weaknesses
Forward Selection	O(kp)	Fast, works well with few relevant features	Cannot correct early mistakes, misses interactions
Backward Elimination	O((p-k)p)	Captures interactions, handles correlations	Slow for large p, cannot recover removed features
Stepwise (Floating)	Between forward and exhaustive	Corrects mistakes, balances exploration	More complex, more evaluations needed
RFE	O(p-k) × model fit	Fast, leverages model insights	Depends on model importance accuracy
Exhaustive	O(2^p)	Guarantees optimal solution	Only feasible for small p
Genetic Algorithm	O(generations × population × CV)	Explores broadly, escapes local optima	Many hyperparameters, no guarantees

Summary: Wrapper Methods

Key Takeaways

•Wrapper methods evaluate feature subsets by actual model performance, making selection algorithm-specific but directly optimized for prediction.
•The search space is exponential (2^p), requiring heuristic strategies: forward selection, backward elimination, stepwise, or stochastic search.
•Forward selection builds up features greedily, efficient when few features matter.
•Backward elimination removes features greedily, better for capturing interactions.
•Stepwise/floating methods allow both additions and removals, correcting greedy mistakes.
•RFE leverages model-derived importance to accelerate backward elimination.
•Nested cross-validation is essential to avoid selection bias and get honest performance estimates.

What's Next

2 / 5