Loading learning content...
Modern datasets arrive with hundreds, thousands, even millions of features. Gene expression data has 20,000+ features. Text documents become 100,000+ dimensional vectors. Click stream data generates countless behavioral signals. More features seem like more information—but this abundance is often a curse.
The feature selection imperative: Irrelevant features add noise, increasing variance without reducing bias. Redundant features waste computation and can destabilize optimization. High dimensionality triggers the curse of dimensionality, where distances become meaningless and data becomes sparse. And interpretability suffers—a model with 1,000 features cannot be explained to stakeholders.
AutoML systems must automatically identify which features genuinely contribute to prediction and which should be discarded. This page explores how modern AutoML achieves feature selection at scale—from simple statistical filters to sophisticated embedded methods.
By the end of this page, you will understand: the three families of feature selection (filter, wrapper, embedded), specific algorithms within each family and their tradeoffs, how AutoML systems automate feature selection, the relationship between regularization and feature selection, and strategies for extremely high-dimensional data.
Feature selection is not merely dimensionality reduction—it's about finding the minimal set of features that maximizes predictive power while satisfying practical constraints.
The Benefits of Fewer Features:
| Family | Approach | Computational Cost | Considers Model |
|---|---|---|---|
| Filter Methods | Statistical measures of feature relevance | Low (O(n×d)) | No |
| Wrapper Methods | Search over feature subsets using model performance | High (O(2^d) worst case) | Yes |
| Embedded Methods | Feature selection during model training | Medium (model-dependent) | Yes |
Feature selection identifies predictive features, not causal ones. A feature that correlates with the target in training data may be spurious. Feature selection can actually increase the risk of learning spurious correlations if applied naively. Always validate selected features on held-out data.
Filter methods evaluate features independently of any specific model. They compute a score for each feature based on its statistical relationship with the target, then select the top-k or those above a threshold. Fast and model-agnostic, but potentially suboptimal since they ignore feature interactions.
Common Filter Scores:
For Classification:
For Regression:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495
from sklearn.feature_selection import ( SelectKBest, SelectPercentile, chi2, f_classif, mutual_info_classif, f_regression, mutual_info_regression)import numpy as npimport pandas as pd class AutoMLFilterSelector: """ Automated filter-based feature selection. Selects appropriate statistical test based on: - Target type (classification vs regression) - Feature characteristics (non-negative, distribution) """ def __init__( self, task: str = 'classification', method: str = 'auto', k: int = None, percentile: float = None, threshold: float = None ): self.task = task self.method = method self.k = k self.percentile = percentile self.threshold = threshold self.selector_ = None self.scores_ = None self.selected_features_ = None def _select_score_func(self, X): """Select appropriate scoring function.""" if self.method != 'auto': return self._get_score_func_by_name(self.method) # Auto-select based on task and data properties has_negative = (X < 0).any().any() if self.task == 'classification': if has_negative: return f_classif # ANOVA F, handles any values else: return chi2 # Chi-squared, fast for non-negative else: return f_regression # F-statistic for regression def _get_score_func_by_name(self, name): funcs = { 'chi2': chi2, 'f_classif': f_classif, 'mutual_info_classif': mutual_info_classif, 'f_regression': f_regression, 'mutual_info_regression': mutual_info_regression, } return funcs.get(name, f_classif) def fit(self, X, y): """Fit filter selector and compute feature scores.""" score_func = self._select_score_func(X) # Determine selection criterion if self.k is not None: self.selector_ = SelectKBest(score_func, k=self.k) elif self.percentile is not None: self.selector_ = SelectPercentile(score_func, percentile=self.percentile) else: # Default: top 50% self.selector_ = SelectPercentile(score_func, percentile=50) self.selector_.fit(X, y) self.scores_ = self.selector_.scores_ self.selected_features_ = self.selector_.get_support() return self def transform(self, X): """Select features based on fitted filter.""" return self.selector_.transform(X) def get_feature_ranking(self, feature_names=None): """Return features ranked by score.""" if feature_names is None: feature_names = [f"feature_{i}" for i in range(len(self.scores_))] ranking = pd.DataFrame({ 'feature': feature_names, 'score': self.scores_, 'selected': self.selected_features_ }).sort_values('score', ascending=False) return rankingFilter methods are fast enough to apply as preprocessing before more expensive model-based selection. A common AutoML pattern: use variance threshold and correlation filtering first to reduce from 10,000 features to 1,000, then apply wrapper or embedded methods on the reduced set.
Wrapper methods treat feature selection as a search problem: find the subset of features that maximizes model performance. They use the actual model (or a proxy) to evaluate candidate feature sets.
The Search Space Challenge: With d features, there are 2^d possible subsets. For d=100, that's 10^30 subsets—exhaustive search is impossible. Wrapper methods use heuristics to search efficiently.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106
from sklearn.feature_selection import RFE, RFECV, SequentialFeatureSelectorfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.linear_model import LogisticRegressionimport numpy as np class AutoMLWrapperSelector: """ Automated wrapper-based feature selection. Selects search strategy based on dataset size and time budget. """ def __init__( self, n_features_to_select: int = None, time_budget_seconds: float = 300, cv: int = 5 ): self.n_features_to_select = n_features_to_select self.time_budget = time_budget_seconds self.cv = cv self.selector_ = None self.selected_features_ = None def _select_strategy(self, X, y): """ Select search strategy based on dataset characteristics. """ n_samples, n_features = X.shape # Estimate time per model fit est_fit_time = n_samples * n_features * 1e-6 # rough heuristic est_cv_time = est_fit_time * self.cv # RFE iterations n_rfe_iters = n_features - (self.n_features_to_select or n_features // 2) est_rfe_time = est_cv_time * n_rfe_iters # SFS iterations (forward) n_sfs_iters = self.n_features_to_select or n_features // 2 est_sfs_time = est_cv_time * n_sfs_iters * n_features / 2 # avg remaining if est_rfe_time < self.time_budget: return 'rfe_cv' # RFECV if time allows elif est_rfe_time * 0.3 < self.time_budget: return 'rfe' # RFE without CV else: return 'sfs_forward' # Forward selection (usually faster) def fit(self, X, y, estimator=None): """ Fit wrapper selector with automatic strategy selection. """ if estimator is None: estimator = RandomForestClassifier(n_estimators=50, random_state=42) strategy = self._select_strategy(X, y) n_to_select = self.n_features_to_select or X.shape[1] // 2 if strategy == 'rfe_cv': # RFECV: automatically selects optimal number of features self.selector_ = RFECV( estimator=estimator, step=1, cv=self.cv, scoring='accuracy', min_features_to_select=1 ) elif strategy == 'rfe': # RFE: fixed number of features self.selector_ = RFE( estimator=estimator, n_features_to_select=n_to_select, step=1 ) else: # Sequential forward selection self.selector_ = SequentialFeatureSelector( estimator=estimator, n_features_to_select=n_to_select, direction='forward', cv=self.cv, scoring='accuracy' ) self.selector_.fit(X, y) self.selected_features_ = self.selector_.get_support() return self def transform(self, X): return self.selector_.transform(X) def get_feature_ranking(self, feature_names=None): """Return feature ranking from RFE.""" if hasattr(self.selector_, 'ranking_'): if feature_names is None: feature_names = [f"feature_{i}" for i in range(len(self.selector_.ranking_))] import pandas as pd return pd.DataFrame({ 'feature': feature_names, 'ranking': self.selector_.ranking_, 'selected': self.selected_features_ }).sort_values('ranking') return NoneWrapper methods require training models many times—potentially n×d times for n-fold CV with d features. For large datasets or complex models, this can take hours or days. AutoML systems must estimate runtime and fall back to faster methods when necessary.
Embedded methods perform feature selection as part of model training. The model itself learns which features are important through regularization or inherent feature importance mechanisms. More efficient than wrappers, but tied to specific model types.
L1 (Lasso) Regularization
L1 penalty encourages sparsity—many coefficients become exactly zero. Features with zero coefficients are effectively selected out.
Why L1 produces zeros:
The L1 penalty (|β|) has non-differentiable corners at zero. Gradient descent solutions tend to land exactly at these corners, driving coefficients to zero. L2 penalty (β²) has no corners, so coefficients shrink but remain nonzero.
Selection via regularization strength:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970
from sklearn.linear_model import LassoCV, LogisticRegressionCVfrom sklearn.feature_selection import SelectFromModelimport numpy as np class L1FeatureSelector: """ Feature selection using L1 regularization. Trains L1-penalized model and selects features with nonzero coefficients. """ def __init__( self, task: str = 'classification', cv: int = 5, threshold: str = 'mean' # 'mean', 'median', or numeric ): self.task = task self.cv = cv self.threshold = threshold self.model_ = None self.selector_ = None def fit(self, X, y): """Fit L1 model and identify selected features.""" if self.task == 'classification': self.model_ = LogisticRegressionCV( penalty='l1', solver='saga', cv=self.cv, max_iter=1000 ) else: self.model_ = LassoCV(cv=self.cv) self.model_.fit(X, y) # Get feature importances (absolute coefficients) if hasattr(self.model_, 'coef_'): importance = np.abs(self.model_.coef_).ravel() else: importance = np.abs(self.model_.coef_) # Select features above threshold self.selector_ = SelectFromModel( self.model_, threshold=self.threshold, prefit=True ) return self def transform(self, X): return self.selector_.transform(X) def get_nonzero_features(self, feature_names=None): """Return features with nonzero coefficients.""" coef = self.model_.coef_.ravel() nonzero_idx = np.where(coef != 0)[0] if feature_names is None: feature_names = [f"feature_{i}" for i in range(len(coef))] import pandas as pd return pd.DataFrame({ 'feature': [feature_names[i] for i in nonzero_idx], 'coefficient': coef[nonzero_idx], 'abs_coefficient': np.abs(coef[nonzero_idx]) }).sort_values('abs_coefficient', ascending=False)Elastic Net combines L1 and L2 regularization: λ₁|β| + λ₂β². It produces sparse solutions like Lasso while handling correlated features better. AutoML systems often prefer Elastic Net over pure Lasso for this robustness.
AutoML systems must combine multiple feature selection approaches into a cohesive strategy that balances quality and computational cost. The optimal approach depends on dataset characteristics, time budget, and downstream model.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172
import numpy as npimport pandas as pdfrom sklearn.feature_selection import ( VarianceThreshold, SelectKBest, f_classif, mutual_info_classif, RFE, SelectFromModel)from sklearn.ensemble import RandomForestClassifierfrom sklearn.linear_model import LogisticRegressionfrom typing import List, Tuple class AutoMLFeatureSelector: """ Multi-stage feature selection pipeline for AutoML. Combines filter, embedded, and optionally wrapper methods in a computationally efficient pipeline. """ def __init__( self, target_n_features: int = None, target_fraction: float = 0.5, time_budget_seconds: float = 60.0, use_wrapper: bool = True ): self.target_n_features = target_n_features self.target_fraction = target_fraction self.time_budget = time_budget_seconds self.use_wrapper = use_wrapper # Pipeline stages self.variance_selector_ = None self.correlation_dropped_ = None self.filter_selector_ = None self.embedded_selector_ = None self.final_features_ = None def fit(self, X: pd.DataFrame, y: pd.Series) -> 'AutoMLFeatureSelector': """ Multi-stage feature selection: 1. Remove constant/near-constant features 2. Remove highly correlated redundant features 3. Univariate filter (fast) 4. Embedded selection (tree importance or L1) 5. Optional wrapper refinement if time allows """ current_features = list(X.columns) X_current = X.copy() # Stage 1: Variance Threshold print(f"Stage 1: Variance threshold ({len(current_features)} features)") X_current, current_features = self._variance_filter(X_current) print(f" → {len(current_features)} features remain") # Stage 2: Correlation Filter print(f"Stage 2: Correlation filter") X_current, current_features = self._correlation_filter(X_current, y) print(f" → {len(current_features)} features remain") # Stage 3: Univariate Filter target_after_filter = max( self.target_n_features or int(len(current_features) * 0.7), 10 ) if len(current_features) > target_after_filter: print(f"Stage 3: Univariate filter (target: {target_after_filter})") X_current, current_features = self._univariate_filter( X_current, y, target_after_filter ) print(f" → {len(current_features)} features remain") # Stage 4: Embedded Selection (tree importance) final_target = self.target_n_features or int( len(X.columns) * self.target_fraction ) final_target = max(final_target, 5) if len(current_features) > final_target: print(f"Stage 4: Tree importance (target: {final_target})") X_current, current_features = self._tree_importance_filter( X_current, y, final_target ) print(f" → {len(current_features)} features remain") self.final_features_ = current_features return self def _variance_filter( self, X: pd.DataFrame, threshold: float = 0.01 ) -> Tuple[pd.DataFrame, List[str]]: """Remove low-variance features.""" self.variance_selector_ = VarianceThreshold(threshold=threshold) X_filtered = self.variance_selector_.fit_transform(X) selected_mask = self.variance_selector_.get_support() selected_features = [c for c, m in zip(X.columns, selected_mask) if m] return pd.DataFrame(X_filtered, columns=selected_features), selected_features def _correlation_filter( self, X: pd.DataFrame, y: pd.Series, threshold: float = 0.95 ) -> Tuple[pd.DataFrame, List[str]]: """Remove highly correlated features.""" corr_matrix = X.corr().abs() target_corr = X.corrwith(y).abs() upper = corr_matrix.where( np.triu(np.ones(corr_matrix.shape), k=1).astype(bool) ) to_drop = set() for col in upper.columns: for idx in upper.index: if upper.loc[idx, col] > threshold: # Drop the one with lower target correlation if target_corr.get(col, 0) > target_corr.get(idx, 0): to_drop.add(idx) else: to_drop.add(col) self.correlation_dropped_ = list(to_drop) selected = [c for c in X.columns if c not in to_drop] return X[selected], selected def _univariate_filter( self, X: pd.DataFrame, y: pd.Series, k: int ) -> Tuple[pd.DataFrame, List[str]]: """Apply univariate statistical filter.""" self.filter_selector_ = SelectKBest(f_classif, k=k) X_filtered = self.filter_selector_.fit_transform(X, y) selected_mask = self.filter_selector_.get_support() selected_features = [c for c, m in zip(X.columns, selected_mask) if m] return pd.DataFrame(X_filtered, columns=selected_features), selected_features def _tree_importance_filter( self, X: pd.DataFrame, y: pd.Series, k: int ) -> Tuple[pd.DataFrame, List[str]]: """Select top features by tree importance.""" rf = RandomForestClassifier(n_estimators=50, random_state=42, n_jobs=-1) rf.fit(X, y) importances = rf.feature_importances_ top_k_idx = np.argsort(importances)[::-1][:k] selected_features = [X.columns[i] for i in top_k_idx] return X[selected_features], selected_features def transform(self, X: pd.DataFrame) -> pd.DataFrame: """Select the final features.""" return X[self.final_features_] def get_selection_summary(self) -> dict: """Return summary of feature selection process.""" return { 'final_n_features': len(self.final_features_), 'selected_features': self.final_features_, 'correlation_dropped': self.correlation_dropped_ }Production AutoML systems use staged selection: fast filter methods first (O(n×d)), then embedded methods (O(n×d×log(n))), and expensive wrapper methods only if time allows. This gives good results quickly while allowing refinement given more time.
Feature selection is essential for building efficient, interpretable, and well-generalizing models. AutoML systems combine multiple approaches in staged pipelines to balance quality and computational cost.
You now understand how AutoML systems approach feature selection—from simple statistical filters through sophisticated embedded methods to multi-stage automated pipelines. Next, we'll explore automated data cleaning, where AutoML must detect and handle data quality issues.