Loading learning content...
Wrapper methods represent a paradigm shift from filter methods: rather than evaluating features through statistical proxies, they directly measure how feature subsets affect model performance. The name derives from how these methods "wrap" around a learning algorithm, using it as a black box to score feature combinations.
The fundamental insight is elegant: if our goal is to find features that lead to good predictions, why not evaluate features by... making predictions? A wrapper method trains a model on a candidate feature subset, evaluates its performance via cross-validation, and uses this performance as the subset's score.
This approach offers a compelling advantage over filters: selection is optimized for the specific learning algorithm you intend to use. Features that a Random Forest finds informative may differ from those optimal for a neural network, and wrapper methods capture these algorithm-specific preferences.
Wrapper methods treat feature selection as a search problem: explore the space of possible feature subsets, evaluate each using a model, and return the best-performing subset. The quality of a feature is judged entirely by its contribution to predictive performance—the ultimate criterion for most ML applications.
Before diving into specific wrapper algorithms, we must confront the fundamental computational challenge they face.
With $p$ features, there are $2^p$ possible feature subsets (including the empty set). This exponential growth rapidly becomes intractable:
| Features (p) | Possible Subsets ($2^p$) | Time at 1ms per evaluation |
|---|---|---|
| 10 | 1,024 | ~1 second |
| 20 | 1,048,576 | ~17 minutes |
| 30 | 1,073,741,824 | ~34 years |
| 50 | ~10^15 | ~31 million years |
Exhaustive search is feasible only for very small feature sets. Real-world datasets with hundreds or thousands of features require heuristic search strategies that explore a fraction of the space while hopefully finding near-optimal solutions.
Wrapper methods face a meta-level bias-variance tradeoff:
The art of wrapper methods lies in navigating this tradeoff.
If training your model takes 10 seconds and you have 100 features, evaluating all 2^100 ≈ 10^30 subsets would take longer than the age of the universe. Wrapper methods MUST use intelligent search strategies. The question isn't whether to approximate—it's how to approximate well.
Forward Selection is the most intuitive wrapper strategy: start with no features and greedily add the feature that most improves model performance, one at a time.
With $p$ features and selecting $k$:
Total evaluations: $\sum_{i=0}^{k-1}(p-i) = kp - \frac{k(k-1)}{2} = O(kp)$
For selecting $k = p/2$ features: $O(p^2)$ evaluations—far better than $2^p$!
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384
import numpy as npfrom sklearn.model_selection import cross_val_scorefrom sklearn.base import clonefrom typing import List, Tuple def forward_selection(X: np.ndarray, y: np.ndarray, estimator, max_features: int = None, cv: int = 5, scoring: str = 'accuracy', min_improvement: float = 0.001) -> Tuple[List[int], List[float]]: """ Sequential Forward Selection for feature selection. Parameters: ----------- X : Feature matrix of shape (n_samples, n_features) y : Target vector estimator : sklearn-compatible estimator max_features : Maximum features to select (default: all) cv : Number of cross-validation folds scoring : Scoring metric for evaluation min_improvement : Minimum improvement to continue adding features Returns: -------- selected_features : List of selected feature indices (in order selected) scores : Cross-validation score after each addition """ n_features = X.shape[1] max_features = max_features or n_features selected = [] remaining = list(range(n_features)) scores = [] best_score = -np.inf for _ in range(max_features): best_feature = None best_new_score = -np.inf for feature in remaining: # Create candidate feature set candidate_features = selected + [feature] X_candidate = X[:, candidate_features] # Evaluate with cross-validation cv_scores = cross_val_score( clone(estimator), X_candidate, y, cv=cv, scoring=scoring ) mean_score = np.mean(cv_scores) if mean_score > best_new_score: best_new_score = mean_score best_feature = feature # Check if improvement is sufficient if best_new_score - best_score < min_improvement: print(f"Stopping: improvement {best_new_score - best_score:.4f} < {min_improvement}") break # Add best feature selected.append(best_feature) remaining.remove(best_feature) best_score = best_new_score scores.append(best_score) print(f"Added feature {best_feature}: score = {best_score:.4f}") return selected, scores # Example usagefrom sklearn.datasets import load_breast_cancerfrom sklearn.linear_model import LogisticRegression data = load_breast_cancer()X, y = data.data, data.target estimator = LogisticRegression(max_iter=1000, random_state=42)selected, scores = forward_selection(X, y, estimator, max_features=10) print(f"\nSelected features: {[data.feature_names[i] for i in selected]}")print(f"Final score: {scores[-1]:.4f}")Think of forward selection as building a team: start with no players, and at each round, add the person who most improves team performance. Early picks might be star players (highly predictive features), while later picks fill complementary roles (features that work well with those already selected).
Backward Elimination takes the opposite approach: start with all features and greedily remove the feature whose elimination least degrades (or most improves) model performance.
Backward elimination often outperforms forward selection when:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788
import numpy as npfrom sklearn.model_selection import cross_val_scorefrom sklearn.base import clonefrom typing import List, Tuple def backward_elimination(X: np.ndarray, y: np.ndarray, estimator, min_features: int = 1, cv: int = 5, scoring: str = 'accuracy', max_degradation: float = 0.01) -> Tuple[List[int], List[float]]: """ Sequential Backward Elimination for feature selection. Parameters: ----------- X : Feature matrix of shape (n_samples, n_features) y : Target vector estimator : sklearn-compatible estimator min_features : Minimum features to retain cv : Number of cross-validation folds scoring : Scoring metric for evaluation max_degradation : Maximum allowed score degradation to continue Returns: -------- remaining_features : List of retained feature indices scores : Cross-validation score after each removal """ n_features = X.shape[1] remaining = list(range(n_features)) scores = [] # Initial score with all features cv_scores = cross_val_score( clone(estimator), X, y, cv=cv, scoring=scoring ) best_score = np.mean(cv_scores) scores.append(best_score) print(f"Initial score (all {n_features} features): {best_score:.4f}") while len(remaining) > min_features: worst_feature = None best_score_after_removal = -np.inf for feature in remaining: # Create candidate feature set without this feature candidate_features = [f for f in remaining if f != feature] X_candidate = X[:, candidate_features] # Evaluate with cross-validation cv_scores = cross_val_score( clone(estimator), X_candidate, y, cv=cv, scoring=scoring ) mean_score = np.mean(cv_scores) if mean_score > best_score_after_removal: best_score_after_removal = mean_score worst_feature = feature # Check if degradation is acceptable degradation = best_score - best_score_after_removal if degradation > max_degradation: print(f"Stopping: degradation {degradation:.4f} > {max_degradation}") break # Remove worst feature remaining.remove(worst_feature) best_score = best_score_after_removal scores.append(best_score) print(f"Removed feature {worst_feature}: score = {best_score:.4f} ({len(remaining)} remaining)") return remaining, scores # Example usagefrom sklearn.datasets import load_breast_cancerfrom sklearn.ensemble import RandomForestClassifier data = load_breast_cancer()X, y = data.data, data.target estimator = RandomForestClassifier(n_estimators=50, random_state=42)remaining, scores = backward_elimination(X, y, estimator, min_features=5) print(f"\nRetained features: {[data.feature_names[i] for i in remaining]}")print(f"Final score: {scores[-1]:.4f}")Stepwise selection combines forward and backward approaches, allowing features to be both added and removed throughout the search. This addresses a key limitation of pure sequential methods: once a feature is added (forward) or removed (backward), that decision is permanent.
The algorithm alternates or combines:
This allows the algorithm to:
123456789101112131415161718192021222324252627282930313233
from mlxtend.feature_selection import SequentialFeatureSelectorfrom sklearn.datasets import load_breast_cancerfrom sklearn.ensemble import GradientBoostingClassifier # Load datadata = load_breast_cancer()X, y = data.data, data.target # Create estimatorestimator = GradientBoostingClassifier(n_estimators=50, random_state=42) # Bidirectional stepwise selection using mlxtendsfs = SequentialFeatureSelector( estimator, k_features=10, # Target number of features forward=True, # Start with forward selection floating=True, # Enable backward steps (SFFS/SBFS) scoring='accuracy', cv=5, n_jobs=-1) sfs = sfs.fit(X, y) print(f"Selected feature indices: {sfs.k_feature_idx_}")print(f"Selected feature names: {[data.feature_names[i] for i in sfs.k_feature_idx_]}")print(f"Best CV score: {sfs.k_score_:.4f}") # Examine selection historyimport pandas as pdresults_df = pd.DataFrame.from_dict(sfs.get_metric_dict()).Tprint("\nSelection history:")print(results_df[['feature_idx', 'avg_score']].head(15))Floating selection methods add flexibility to the stepwise process:
Sequential Floating Forward Selection (SFFS):
Sequential Floating Backward Selection (SBFS):
The "floating" refers to how the number of selected features can fluctuate during the search, rather than monotonically increasing (forward) or decreasing (backward).
While floating methods find better solutions than strict sequential approaches, they require more evaluations. In the worst case, SFFS might perform a full backward search after each forward step. The typical cost lies between pure forward selection and exhaustive search, depending on how often floating steps occur.
When computational resources permit, exhaustive search evaluates all possible feature subsets. This guarantees finding the globally optimal subset for the given model and evaluation metric.
Given the $2^p$ complexity, exhaustive search is only practical for:
Even with parallelization, exhaustive search hits limits quickly. For 25 features: ~33 million subsets. At 100ms per evaluation: ~38 days on a single core, ~1 day with 40 cores.
123456789101112131415161718192021222324252627282930
from mlxtend.feature_selection import ExhaustiveFeatureSelectorfrom sklearn.linear_model import LogisticRegressionfrom sklearn.datasets import load_irisimport time # Load a small datasetdata = load_iris()X, y = data.data, data.target # Exhaustive search is only practical for very few features# Iris has only 4 features: 2^4 = 16 subsets - trivially fastestimator = LogisticRegression(max_iter=1000, random_state=42) efs = ExhaustiveFeatureSelector( estimator, min_features=1, max_features=4, scoring='accuracy', cv=5, n_jobs=-1) start = time.time()efs = efs.fit(X, y)elapsed = time.time() - start print(f"Evaluated {2**4 - 1} subsets in {elapsed:.2f} seconds")print(f"Best subset: {efs.best_idx_}")print(f"Best features: {[data.feature_names[i] for i in efs.best_idx_]}")print(f"Best score: {efs.best_score_:.4f}")When the search space is too large for exhaustive search but greedy methods feel limiting, genetic algorithms (GA) offer a middle ground. GAs treat feature subsets as "individuals" in a population that evolves over generations.
Key components:
Advantages:
Disadvantages:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899
import numpy as npfrom sklearn.model_selection import cross_val_scorefrom sklearn.ensemble import RandomForestClassifierfrom typing import Tuple def genetic_feature_selection( X: np.ndarray, y: np.ndarray, estimator, population_size: int = 50, generations: int = 100, mutation_rate: float = 0.1, crossover_rate: float = 0.8, cv: int = 5) -> Tuple[np.ndarray, float]: """ Genetic algorithm for feature selection. Returns: -------- best_mask : Boolean array indicating selected features best_score : Fitness of best individual """ n_features = X.shape[1] # Initialize population randomly population = np.random.rand(population_size, n_features) > 0.5 # Ensure at least one feature per individual for i in range(population_size): if not population[i].any(): population[i, np.random.randint(n_features)] = True def evaluate_fitness(mask): if not mask.any(): return 0.0 X_subset = X[:, mask] scores = cross_val_score(estimator, X_subset, y, cv=cv) return np.mean(scores) best_mask = None best_score = -np.inf for gen in range(generations): # Evaluate fitness fitness = np.array([evaluate_fitness(ind) for ind in population]) # Track best gen_best_idx = np.argmax(fitness) if fitness[gen_best_idx] > best_score: best_score = fitness[gen_best_idx] best_mask = population[gen_best_idx].copy() if gen % 20 == 0: print(f"Gen {gen}: best={best_score:.4f}, features={best_mask.sum()}") # Selection (tournament) new_population = [] for _ in range(population_size): # Tournament selection candidates = np.random.choice(population_size, 3, replace=False) winner = candidates[np.argmax(fitness[candidates])] new_population.append(population[winner].copy()) # Crossover for i in range(0, population_size - 1, 2): if np.random.rand() < crossover_rate: point = np.random.randint(1, n_features) new_population[i][point:], new_population[i+1][point:] = \ new_population[i+1][point:].copy(), new_population[i][point:].copy() # Mutation for i in range(population_size): for j in range(n_features): if np.random.rand() < mutation_rate: new_population[i][j] = not new_population[i][j] # Ensure at least one feature if not new_population[i].any(): new_population[i][np.random.randint(n_features)] = True population = np.array(new_population) return best_mask, best_score # Usage examplefrom sklearn.datasets import load_breast_cancer data = load_breast_cancer()X, y = data.data, data.target estimator = RandomForestClassifier(n_estimators=30, random_state=42)best_mask, best_score = genetic_feature_selection( X, y, estimator, population_size=30, generations=50) selected_features = [data.feature_names[i] for i, m in enumerate(best_mask) if m]print(f"\nBest features ({len(selected_features)}): {selected_features}")print(f"Best CV score: {best_score:.4f}")Recursive Feature Elimination (RFE) is a popular wrapper method that leverages model-derived feature importance to guide elimination. Unlike pure backward elimination that evaluates all features equally at each step, RFE asks the model itself which features are least important.
The key insight: rather than retraining $O(p)$ models at each step to find which feature to remove, RFE trains one model and uses its internal ranking. This dramatically reduces computational cost.
RFE requires models that provide feature importance scores:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
from sklearn.feature_selection import RFE, RFECVfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.linear_model import LogisticRegressionfrom sklearn.datasets import load_breast_cancerimport matplotlib.pyplot as pltimport numpy as np # Load datadata = load_breast_cancer()X, y = data.data, data.target # RFE with specified number of featuresestimator = RandomForestClassifier(n_estimators=100, random_state=42)rfe = RFE(estimator, n_features_to_select=10, step=1)rfe = rfe.fit(X, y) print("RFE Results:")print(f"Selected features: {[data.feature_names[i] for i, s in enumerate(rfe.support_) if s]}")print(f"Feature rankings: {rfe.ranking_}") # RFECV: RFE with cross-validation to find optimal numberrfecv = RFECV( estimator=LogisticRegression(max_iter=1000, random_state=42), step=1, cv=5, scoring='accuracy', min_features_to_select=1, n_jobs=-1)rfecv = rfecv.fit(X, y) print(f"\nRFECV Results:")print(f"Optimal number of features: {rfecv.n_features_}")print(f"Selected features: {[data.feature_names[i] for i, s in enumerate(rfecv.support_) if s]}") # Plot number of features VS. cross-validation scoresplt.figure(figsize=(10, 5))n_features_range = range(1, len(rfecv.cv_results_['mean_test_score']) + 1)plt.plot(n_features_range, rfecv.cv_results_['mean_test_score'], 'b-o')plt.fill_between( n_features_range, rfecv.cv_results_['mean_test_score'] - rfecv.cv_results_['std_test_score'], rfecv.cv_results_['mean_test_score'] + rfecv.cv_results_['std_test_score'], alpha=0.2)plt.axvline(rfecv.n_features_, color='r', linestyle='--', label=f'Optimal: {rfecv.n_features_}')plt.xlabel('Number of Features')plt.ylabel('CV Accuracy')plt.title('RFE Cross-Validation Performance')plt.legend()plt.grid(True, alpha=0.3)plt.show()RFE is O(p) per elimination round (one model fit) vs O(p) model fits for backward elimination at each step. For selecting k features from p, RFE requires O(p-k) model fits vs O(p² - k²)/2 for backward elimination. This makes RFE dramatically faster, though it relies on the model's importance estimates being accurate.
Choosing among wrapper methods requires understanding their tradeoffs and proper evaluation protocols.
A critical concern with wrapper methods is selection bias. If you use the same data to:
...your performance estimate will be optimistically biased. The selected features were chosen because they happened to work well on this particular data.
Solution: Nested cross-validation
For each outer fold:
This gives an unbiased estimate of the full pipeline's performance.
123456789101112131415161718192021222324252627282930313233343536373839
from sklearn.model_selection import cross_val_score, StratifiedKFoldfrom sklearn.feature_selection import RFECVfrom sklearn.linear_model import LogisticRegressionfrom sklearn.pipeline import Pipelinefrom sklearn.datasets import load_breast_cancerimport numpy as np # Load datadata = load_breast_cancer()X, y = data.data, data.target # Create pipeline: RFE + classifierestimator = LogisticRegression(max_iter=1000, random_state=42) # IMPORTANT: RFECV performs its own inner CV for feature selection# Wrapping in a pipeline ensures outer CV uses fresh feature selectionpipeline = Pipeline([ ('feature_selection', RFECV(estimator, step=1, cv=3, scoring='accuracy')), ('classifier', LogisticRegression(max_iter=1000, random_state=42))]) # Outer cross-validation for unbiased performance estimateouter_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)nested_scores = cross_val_score( pipeline, X, y, cv=outer_cv, scoring='accuracy', n_jobs=-1) print(f"Nested CV scores: {nested_scores}")print(f"Mean accuracy: {nested_scores.mean():.4f} (+/- {nested_scores.std()*2:.4f})") # Compare with naive (biased) estimaterfecv = RFECV(estimator, step=1, cv=5, scoring='accuracy')rfecv.fit(X, y)naive_score = rfecv.cv_results_['mean_test_score'][rfecv.n_features_ - 1]print(f"\nNaive CV score (biased): {naive_score:.4f}")print(f"Bias: {naive_score - nested_scores.mean():.4f}")| Method | Complexity | Strengths | Weaknesses |
|---|---|---|---|
| Forward Selection | O(kp) | Fast, works well with few relevant features | Cannot correct early mistakes, misses interactions |
| Backward Elimination | O((p-k)p) | Captures interactions, handles correlations | Slow for large p, cannot recover removed features |
| Stepwise (Floating) | Between forward and exhaustive | Corrects mistakes, balances exploration | More complex, more evaluations needed |
| RFE | O(p-k) × model fit | Fast, leverages model insights | Depends on model importance accuracy |
| Exhaustive | O(2^p) | Guarantees optimal solution | Only feasible for small p |
| Genetic Algorithm | O(generations × population × CV) | Explores broadly, escapes local optima | Many hyperparameters, no guarantees |
Wrapper methods optimize feature selection for specific models but require many model evaluations. In the next section, we'll explore Embedded Methods—approaches where feature selection is built into the model training process itself, offering efficiency gains while maintaining algorithm-specific optimization.