Machine LearningAutoML & Neural Architecture Search

Automated Preprocessing

LevelAdvanced

Duration90 mins

TopicAutoML & Neural Architecture Search

4 / 5

Feature Selection

The Curse of Abundance

Modern datasets arrive with hundreds, thousands, even millions of features. Gene expression data has 20,000+ features. Text documents become 100,000+ dimensional vectors. Click stream data generates countless behavioral signals. More features seem like more information—but this abundance is often a curse.

The feature selection imperative: Irrelevant features add noise, increasing variance without reducing bias. Redundant features waste computation and can destabilize optimization. High dimensionality triggers the curse of dimensionality, where distances become meaningless and data becomes sparse. And interpretability suffers—a model with 1,000 features cannot be explained to stakeholders.

AutoML systems must automatically identify which features genuinely contribute to prediction and which should be discarded. This page explores how modern AutoML achieves feature selection at scale—from simple statistical filters to sophisticated embedded methods.

What You Will Learn

By the end of this page, you will understand: the three families of feature selection (filter, wrapper, embedded), specific algorithms within each family and their tradeoffs, how AutoML systems automate feature selection, the relationship between regularization and feature selection, and strategies for extremely high-dimensional data.

Why Feature Selection Matters

Feature selection is not merely dimensionality reduction—it's about finding the minimal set of features that maximizes predictive power while satisfying practical constraints.

The Benefits of Fewer Features:

Advantages of Feature Selection

•Reduced Overfitting — Fewer parameters means less capacity to memorize training data. Generalization improves.
•Faster Training — Less computation per iteration, and often faster convergence. Training time can drop by orders of magnitude.
•Lower Inference Cost — Production systems benefit from smaller models. Real-time predictions become feasible.
•Improved Interpretability — Stakeholders can understand models with 10 features, not 10,000.
•Reduced Data Collection — If certain features aren't needed, you don't need to collect or maintain them.
•Better Generalization — Removing noisy features reduces variance, often improving test performance.

Feature Selection Families Overview
Family	Approach	Computational Cost	Considers Model
Filter Methods	Statistical measures of feature relevance	Low (O(n×d))	No
Wrapper Methods	Search over feature subsets using model performance	High (O(2^d) worst case)	Yes
Embedded Methods	Feature selection during model training	Medium (model-dependent)	Yes

The Correlation vs. Causation Trap

Feature selection identifies predictive features, not causal ones. A feature that correlates with the target in training data may be spurious. Feature selection can actually increase the risk of learning spurious correlations if applied naively. Always validate selected features on held-out data.

Filter Methods

Filter methods evaluate features independently of any specific model. They compute a score for each feature based on its statistical relationship with the target, then select the top-k or those above a threshold. Fast and model-agnostic, but potentially suboptimal since they ignore feature interactions.

Common Filter Scores:

For Classification:

Chi-squared (χ²): Tests independence between each feature and target. Requires non-negative features.
ANOVA F-value: Compares within-class to between-class variance. Higher F = better separation.
Mutual Information: Measures any dependency (including nonlinear). More general but harder to estimate.

For Regression:

Pearson Correlation: Measures linear relationship. Simple but misses nonlinear patterns.
Spearman Correlation: Measures monotonic relationship. More robust to outliers.
F-regression: ANOVA-style test for linear relationship with continuous target.
Mutual Information: Same as classification; captures any dependency.

filter_methods.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
from sklearn.feature_selection import (
    SelectKBest, SelectPercentile,
    chi2, f_classif, mutual_info_classif,
    f_regression, mutual_info_regression
)
import numpy as np
import pandas as pd
 
class AutoMLFilterSelector:
    """
    Automated filter-based feature selection.
    
    Selects appropriate statistical test based on:
    - Target type (classification vs regression)
    - Feature characteristics (non-negative, distribution)
    """
    
    def __init__(
        self,
        task: str = 'classification',
        method: str = 'auto',
        k: int = None,
        percentile: float = None,
        threshold: float = None
    ):
        self.task = task
        self.method = method
        self.k = k
        self.percentile = percentile
        self.threshold = threshold
        self.selector_ = None
        self.scores_ = None
        self.selected_features_ = None
    
    def _select_score_func(self, X):
        """Select appropriate scoring function."""
        if self.method != 'auto':
            return self._get_score_func_by_name(self.method)
        
        # Auto-select based on task and data properties
        has_negative = (X < 0).any().any()
        
        if self.task == 'classification':
            if has_negative:
                return f_classif  # ANOVA F, handles any values
            else:
                return chi2  # Chi-squared, fast for non-negative
        else:
            return f_regression  # F-statistic for regression
    
    def _get_score_func_by_name(self, name):
        funcs = {
            'chi2': chi2,
            'f_classif': f_classif,
            'mutual_info_classif': mutual_info_classif,
            'f_regression': f_regression,
            'mutual_info_regression': mutual_info_regression,
        }
        return funcs.get(name, f_classif)
    
    def fit(self, X, y):
        """Fit filter selector and compute feature scores."""
        score_func = self._select_score_func(X)
        
        # Determine selection criterion
        if self.k is not None:
            self.selector_ = SelectKBest(score_func, k=self.k)
        elif self.percentile is not None:
            self.selector_ = SelectPercentile(score_func, percentile=self.percentile)
        else:
            # Default: top 50%
            self.selector_ = SelectPercentile(score_func, percentile=50)
        
        self.selector_.fit(X, y)
        self.scores_ = self.selector_.scores_
        self.selected_features_ = self.selector_.get_support()
        
        return self
    
    def transform(self, X):
        """Select features based on fitted filter."""
        return self.selector_.transform(X)
    
    def get_feature_ranking(self, feature_names=None):
        """Return features ranked by score."""
        if feature_names is None:
            feature_names = [f"feature_{i}" for i in range(len(self.scores_))]
        
        ranking = pd.DataFrame({
            'feature': feature_names,
            'score': self.scores_,
            'selected': self.selected_features_
        }).sort_values('score', ascending=False)
        
        return ranking

Filter Methods Are Preprocessing

Filter methods are fast enough to apply as preprocessing before more expensive model-based selection. A common AutoML pattern: use variance threshold and correlation filtering first to reduce from 10,000 features to 1,000, then apply wrapper or embedded methods on the reduced set.

Wrapper Methods

Wrapper methods treat feature selection as a search problem: find the subset of features that maximizes model performance. They use the actual model (or a proxy) to evaluate candidate feature sets.

The Search Space Challenge: With d features, there are 2^d possible subsets. For d=100, that's 10^30 subsets—exhaustive search is impossible. Wrapper methods use heuristics to search efficiently.

Wrapper Search Strategies

•Forward Selection — Start with no features. Iteratively add the one that most improves performance. Stop when no improvement possible.
•Backward Elimination — Start with all features. Iteratively remove the one whose removal hurts least. Stop when removal hurts too much.
•Recursive Feature Elimination (RFE) — Train model, remove least important feature(s), repeat until desired count reached.
•Genetic Algorithms — Use evolutionary search over feature subsets. Good for very large search spaces.
•Sequential Feature Selection (SFS) — Generalized forward/backward with cross-validation at each step.

wrapper_methods.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
from sklearn.feature_selection import RFE, RFECV, SequentialFeatureSelector
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
import numpy as np
 
class AutoMLWrapperSelector:
    """
    Automated wrapper-based feature selection.
    
    Selects search strategy based on dataset size and time budget.
    """
    
    def __init__(
        self,
        n_features_to_select: int = None,
        time_budget_seconds: float = 300,
        cv: int = 5
    ):
        self.n_features_to_select = n_features_to_select
        self.time_budget = time_budget_seconds
        self.cv = cv
        self.selector_ = None
        self.selected_features_ = None
    
    def _select_strategy(self, X, y):
        """
        Select search strategy based on dataset characteristics.
        """
        n_samples, n_features = X.shape
        
        # Estimate time per model fit
        est_fit_time = n_samples * n_features * 1e-6  # rough heuristic
        est_cv_time = est_fit_time * self.cv
        
        # RFE iterations
        n_rfe_iters = n_features - (self.n_features_to_select or n_features // 2)
        est_rfe_time = est_cv_time * n_rfe_iters
        
        # SFS iterations (forward)
        n_sfs_iters = self.n_features_to_select or n_features // 2
        est_sfs_time = est_cv_time * n_sfs_iters * n_features / 2  # avg remaining
        
        if est_rfe_time < self.time_budget:
            return 'rfe_cv'  # RFECV if time allows
        elif est_rfe_time * 0.3 < self.time_budget:
            return 'rfe'  # RFE without CV
        else:
            return 'sfs_forward'  # Forward selection (usually faster)
    
    def fit(self, X, y, estimator=None):
        """
        Fit wrapper selector with automatic strategy selection.
        """
        if estimator is None:
            estimator = RandomForestClassifier(n_estimators=50, random_state=42)
        
        strategy = self._select_strategy(X, y)
        n_to_select = self.n_features_to_select or X.shape[1] // 2
        
        if strategy == 'rfe_cv':
            # RFECV: automatically selects optimal number of features
            self.selector_ = RFECV(
                estimator=estimator,
                step=1,
                cv=self.cv,
                scoring='accuracy',
                min_features_to_select=1
            )
        elif strategy == 'rfe':
            # RFE: fixed number of features
            self.selector_ = RFE(
                estimator=estimator,
                n_features_to_select=n_to_select,
                step=1
            )
        else:
            # Sequential forward selection
            self.selector_ = SequentialFeatureSelector(
                estimator=estimator,
                n_features_to_select=n_to_select,
                direction='forward',
                cv=self.cv,
                scoring='accuracy'
            )
        
        self.selector_.fit(X, y)
        self.selected_features_ = self.selector_.get_support()
        
        return self
    
    def transform(self, X):
        return self.selector_.transform(X)
    
    def get_feature_ranking(self, feature_names=None):
        """Return feature ranking from RFE."""
        if hasattr(self.selector_, 'ranking_'):
            if feature_names is None:
                feature_names = [f"feature_{i}" for i in range(len(self.selector_.ranking_))]
            
            import pandas as pd
            return pd.DataFrame({
                'feature': feature_names,
                'ranking': self.selector_.ranking_,
                'selected': self.selected_features_
            }).sort_values('ranking')
        return None

Computational Cost of Wrappers

Wrapper methods require training models many times—potentially n×d times for n-fold CV with d features. For large datasets or complex models, this can take hours or days. AutoML systems must estimate runtime and fall back to faster methods when necessary.

Embedded Methods

Embedded methods perform feature selection as part of model training. The model itself learns which features are important through regularization or inherent feature importance mechanisms. More efficient than wrappers, but tied to specific model types.

L1 (Lasso) Regularization

L1 penalty encourages sparsity—many coefficients become exactly zero. Features with zero coefficients are effectively selected out.

Why L1 produces zeros:

The L1 penalty (|β|) has non-differentiable corners at zero. Gradient descent solutions tend to land exactly at these corners, driving coefficients to zero. L2 penalty (β²) has no corners, so coefficients shrink but remain nonzero.

Selection via regularization strength:

Weak regularization: many features retained
Strong regularization: few features (more zeros)
Cross-validation finds optimal balance

l1_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
from sklearn.linear_model import LassoCV, LogisticRegressionCV
from sklearn.feature_selection import SelectFromModel
import numpy as np
 
class L1FeatureSelector:
    """
    Feature selection using L1 regularization.
    
    Trains L1-penalized model and selects features with nonzero
    coefficients.
    """
    
    def __init__(
        self,
        task: str = 'classification',
        cv: int = 5,
        threshold: str = 'mean'  # 'mean', 'median', or numeric
    ):
        self.task = task
        self.cv = cv
        self.threshold = threshold
        self.model_ = None
        self.selector_ = None
    
    def fit(self, X, y):
        """Fit L1 model and identify selected features."""
        if self.task == 'classification':
            self.model_ = LogisticRegressionCV(
                penalty='l1',
                solver='saga',
                cv=self.cv,
                max_iter=1000
            )
        else:
            self.model_ = LassoCV(cv=self.cv)
        
        self.model_.fit(X, y)
        
        # Get feature importances (absolute coefficients)
        if hasattr(self.model_, 'coef_'):
            importance = np.abs(self.model_.coef_).ravel()
        else:
            importance = np.abs(self.model_.coef_)
        
        # Select features above threshold
        self.selector_ = SelectFromModel(
            self.model_,
            threshold=self.threshold,
            prefit=True
        )
        
        return self
    
    def transform(self, X):
        return self.selector_.transform(X)
    
    def get_nonzero_features(self, feature_names=None):
        """Return features with nonzero coefficients."""
        coef = self.model_.coef_.ravel()
        nonzero_idx = np.where(coef != 0)[0]
        
        if feature_names is None:
            feature_names = [f"feature_{i}" for i in range(len(coef))]
        
        import pandas as pd
        return pd.DataFrame({
            'feature': [feature_names[i] for i in nonzero_idx],
            'coefficient': coef[nonzero_idx],
            'abs_coefficient': np.abs(coef[nonzero_idx])
        }).sort_values('abs_coefficient', ascending=False)

Elastic Net: Best of Both Worlds

Elastic Net combines L1 and L2 regularization: λ₁|β| + λ₂β². It produces sparse solutions like Lasso while handling correlated features better. AutoML systems often prefer Elastic Net over pure Lasso for this robustness.

AutoML Feature Selection Strategies

AutoML systems must combine multiple feature selection approaches into a cohesive strategy that balances quality and computational cost. The optimal approach depends on dataset characteristics, time budget, and downstream model.

automl_feature_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
import numpy as np
import pandas as pd
from sklearn.feature_selection import (
    VarianceThreshold, SelectKBest, f_classif, mutual_info_classif,
    RFE, SelectFromModel
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from typing import List, Tuple
 
class AutoMLFeatureSelector:
    """
    Multi-stage feature selection pipeline for AutoML.
    
    Combines filter, embedded, and optionally wrapper methods
    in a computationally efficient pipeline.
    """
    
    def __init__(
        self,
        target_n_features: int = None,
        target_fraction: float = 0.5,
        time_budget_seconds: float = 60.0,
        use_wrapper: bool = True
    ):
        self.target_n_features = target_n_features
        self.target_fraction = target_fraction
        self.time_budget = time_budget_seconds
        self.use_wrapper = use_wrapper
        
        # Pipeline stages
        self.variance_selector_ = None
        self.correlation_dropped_ = None
        self.filter_selector_ = None
        self.embedded_selector_ = None
        self.final_features_ = None
    
    def fit(self, X: pd.DataFrame, y: pd.Series) -> 'AutoMLFeatureSelector':
        """
        Multi-stage feature selection:
        1. Remove constant/near-constant features
        2. Remove highly correlated redundant features
        3. Univariate filter (fast)
        4. Embedded selection (tree importance or L1)
        5. Optional wrapper refinement if time allows
        """
        current_features = list(X.columns)
        X_current = X.copy()
        
        # Stage 1: Variance Threshold
        print(f"Stage 1: Variance threshold ({len(current_features)} features)")
        X_current, current_features = self._variance_filter(X_current)
        print(f"  → {len(current_features)} features remain")
        
        # Stage 2: Correlation Filter
        print(f"Stage 2: Correlation filter")
        X_current, current_features = self._correlation_filter(X_current, y)
        print(f"  → {len(current_features)} features remain")
        
        # Stage 3: Univariate Filter
        target_after_filter = max(
            self.target_n_features or int(len(current_features) * 0.7),
            10
        )
        if len(current_features) > target_after_filter:
            print(f"Stage 3: Univariate filter (target: {target_after_filter})")
            X_current, current_features = self._univariate_filter(
                X_current, y, target_after_filter
            )
            print(f"  → {len(current_features)} features remain")
        
        # Stage 4: Embedded Selection (tree importance)
        final_target = self.target_n_features or int(
            len(X.columns) * self.target_fraction
        )
        final_target = max(final_target, 5)
        
        if len(current_features) > final_target:
            print(f"Stage 4: Tree importance (target: {final_target})")
            X_current, current_features = self._tree_importance_filter(
                X_current, y, final_target
            )
            print(f"  → {len(current_features)} features remain")
        
        self.final_features_ = current_features
        return self
    
    def _variance_filter(
        self,
        X: pd.DataFrame,
        threshold: float = 0.01
    ) -> Tuple[pd.DataFrame, List[str]]:
        """Remove low-variance features."""
        self.variance_selector_ = VarianceThreshold(threshold=threshold)
        X_filtered = self.variance_selector_.fit_transform(X)
        
        selected_mask = self.variance_selector_.get_support()
        selected_features = [c for c, m in zip(X.columns, selected_mask) if m]
        
        return pd.DataFrame(X_filtered, columns=selected_features), selected_features
    
    def _correlation_filter(
        self,
        X: pd.DataFrame,
        y: pd.Series,
        threshold: float = 0.95
    ) -> Tuple[pd.DataFrame, List[str]]:
        """Remove highly correlated features."""
        corr_matrix = X.corr().abs()
        target_corr = X.corrwith(y).abs()
        
        upper = corr_matrix.where(
            np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
        )
        
        to_drop = set()
        for col in upper.columns:
            for idx in upper.index:
                if upper.loc[idx, col] > threshold:
                    # Drop the one with lower target correlation
                    if target_corr.get(col, 0) > target_corr.get(idx, 0):
                        to_drop.add(idx)
                    else:
                        to_drop.add(col)
        
        self.correlation_dropped_ = list(to_drop)
        selected = [c for c in X.columns if c not in to_drop]
        
        return X[selected], selected
    
    def _univariate_filter(
        self,
        X: pd.DataFrame,
        y: pd.Series,
        k: int
    ) -> Tuple[pd.DataFrame, List[str]]:
        """Apply univariate statistical filter."""
        self.filter_selector_ = SelectKBest(f_classif, k=k)
        X_filtered = self.filter_selector_.fit_transform(X, y)
        
        selected_mask = self.filter_selector_.get_support()
        selected_features = [c for c, m in zip(X.columns, selected_mask) if m]
        
        return pd.DataFrame(X_filtered, columns=selected_features), selected_features
    
    def _tree_importance_filter(
        self,
        X: pd.DataFrame,
        y: pd.Series,
        k: int
    ) -> Tuple[pd.DataFrame, List[str]]:
        """Select top features by tree importance."""
        rf = RandomForestClassifier(n_estimators=50, random_state=42, n_jobs=-1)
        rf.fit(X, y)
        
        importances = rf.feature_importances_
        top_k_idx = np.argsort(importances)[::-1][:k]
        selected_features = [X.columns[i] for i in top_k_idx]
        
        return X[selected_features], selected_features
    
    def transform(self, X: pd.DataFrame) -> pd.DataFrame:
        """Select the final features."""
        return X[self.final_features_]
    
    def get_selection_summary(self) -> dict:
        """Return summary of feature selection process."""
        return {
            'final_n_features': len(self.final_features_),
            'selected_features': self.final_features_,
            'correlation_dropped': self.correlation_dropped_
        }

Staged Selection for Efficiency

Production AutoML systems use staged selection: fast filter methods first (O(n×d)), then embedded methods (O(n×d×log(n))), and expensive wrapper methods only if time allows. This gives good results quickly while allowing refinement given more time.

Summary: Automated Feature Selection

Feature selection is essential for building efficient, interpretable, and well-generalizing models. AutoML systems combine multiple approaches in staged pipelines to balance quality and computational cost.

Key Takeaways

•Three families — Filter (fast, univariate), wrapper (accurate, slow), embedded (integrated with training).
•Filter methods are preprocessing — Use variance threshold and correlation filters to reduce dimensionality before expensive methods.
•L1 regularization produces sparsity — Lasso and Elastic Net automatically select features through coefficient shrinkage.
•Tree importance is model-aware — Random forests and gradient boosting provide built-in feature ranking.
•Staged pipelines balance cost and quality — Fast filters first, expensive refinement if time allows.
•Validate on held-out data — Feature selection can overfit; always evaluate on unseen data.

Page Complete

You now understand how AutoML systems approach feature selection—from simple statistical filters through sophisticated embedded methods to multi-stage automated pipelines. Next, we'll explore automated data cleaning, where AutoML must detect and handle data quality issues.

4 / 5

Loading learning content...

Machine LearningAutoML & Neural Architecture Search

Automated Preprocessing

LevelAdvanced

Duration90 mins

TopicAutoML & Neural Architecture Search

4 / 5

Feature Selection

The Curse of Abundance

What You Will Learn

Why Feature Selection Matters

Feature selection is not merely dimensionality reduction—it's about finding the minimal set of features that maximizes predictive power while satisfying practical constraints.

The Benefits of Fewer Features:

Advantages of Feature Selection

•Reduced Overfitting — Fewer parameters means less capacity to memorize training data. Generalization improves.
•Faster Training — Less computation per iteration, and often faster convergence. Training time can drop by orders of magnitude.
•Lower Inference Cost — Production systems benefit from smaller models. Real-time predictions become feasible.
•Improved Interpretability — Stakeholders can understand models with 10 features, not 10,000.
•Reduced Data Collection — If certain features aren't needed, you don't need to collect or maintain them.
•Better Generalization — Removing noisy features reduces variance, often improving test performance.

Feature Selection Families Overview
Family	Approach	Computational Cost	Considers Model
Filter Methods	Statistical measures of feature relevance	Low (O(n×d))	No
Wrapper Methods	Search over feature subsets using model performance	High (O(2^d) worst case)	Yes
Embedded Methods	Feature selection during model training	Medium (model-dependent)	Yes

The Correlation vs. Causation Trap

Filter Methods

Common Filter Scores:

For Classification:

Chi-squared (χ²): Tests independence between each feature and target. Requires non-negative features.
ANOVA F-value: Compares within-class to between-class variance. Higher F = better separation.
Mutual Information: Measures any dependency (including nonlinear). More general but harder to estimate.

For Regression:

Pearson Correlation: Measures linear relationship. Simple but misses nonlinear patterns.
Spearman Correlation: Measures monotonic relationship. More robust to outliers.
F-regression: ANOVA-style test for linear relationship with continuous target.
Mutual Information: Same as classification; captures any dependency.

filter_methods.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
from sklearn.feature_selection import (
    SelectKBest, SelectPercentile,
    chi2, f_classif, mutual_info_classif,
    f_regression, mutual_info_regression
)
import numpy as np
import pandas as pd
 
class AutoMLFilterSelector:
    """
    Automated filter-based feature selection.
    
    Selects appropriate statistical test based on:
    - Target type (classification vs regression)
    - Feature characteristics (non-negative, distribution)
    """
    
    def __init__(
        self,
        task: str = 'classification',
        method: str = 'auto',
        k: int = None,
        percentile: float = None,
        threshold: float = None
    ):
        self.task = task
        self.method = method
        self.k = k
        self.percentile = percentile
        self.threshold = threshold
        self.selector_ = None
        self.scores_ = None
        self.selected_features_ = None
    
    def _select_score_func(self, X):
        """Select appropriate scoring function."""
        if self.method != 'auto':
            return self._get_score_func_by_name(self.method)
        
        # Auto-select based on task and data properties
        has_negative = (X < 0).any().any()
        
        if self.task == 'classification':
            if has_negative:
                return f_classif  # ANOVA F, handles any values
            else:
                return chi2  # Chi-squared, fast for non-negative
        else:
            return f_regression  # F-statistic for regression
    
    def _get_score_func_by_name(self, name):
        funcs = {
            'chi2': chi2,
            'f_classif': f_classif,
            'mutual_info_classif': mutual_info_classif,
            'f_regression': f_regression,
            'mutual_info_regression': mutual_info_regression,
        }
        return funcs.get(name, f_classif)
    
    def fit(self, X, y):
        """Fit filter selector and compute feature scores."""
        score_func = self._select_score_func(X)
        
        # Determine selection criterion
        if self.k is not None:
            self.selector_ = SelectKBest(score_func, k=self.k)
        elif self.percentile is not None:
            self.selector_ = SelectPercentile(score_func, percentile=self.percentile)
        else:
            # Default: top 50%
            self.selector_ = SelectPercentile(score_func, percentile=50)
        
        self.selector_.fit(X, y)
        self.scores_ = self.selector_.scores_
        self.selected_features_ = self.selector_.get_support()
        
        return self
    
    def transform(self, X):
        """Select features based on fitted filter."""
        return self.selector_.transform(X)
    
    def get_feature_ranking(self, feature_names=None):
        """Return features ranked by score."""
        if feature_names is None:
            feature_names = [f"feature_{i}" for i in range(len(self.scores_))]
        
        ranking = pd.DataFrame({
            'feature': feature_names,
            'score': self.scores_,
            'selected': self.selected_features_
        }).sort_values('score', ascending=False)
        
        return ranking

Filter Methods Are Preprocessing

Wrapper Methods

Wrapper methods treat feature selection as a search problem: find the subset of features that maximizes model performance. They use the actual model (or a proxy) to evaluate candidate feature sets.

Wrapper Search Strategies

•Forward Selection — Start with no features. Iteratively add the one that most improves performance. Stop when no improvement possible.
•Backward Elimination — Start with all features. Iteratively remove the one whose removal hurts least. Stop when removal hurts too much.
•Recursive Feature Elimination (RFE) — Train model, remove least important feature(s), repeat until desired count reached.
•Genetic Algorithms — Use evolutionary search over feature subsets. Good for very large search spaces.
•Sequential Feature Selection (SFS) — Generalized forward/backward with cross-validation at each step.

wrapper_methods.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
from sklearn.feature_selection import RFE, RFECV, SequentialFeatureSelector
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
import numpy as np
 
class AutoMLWrapperSelector:
    """
    Automated wrapper-based feature selection.
    
    Selects search strategy based on dataset size and time budget.
    """
    
    def __init__(
        self,
        n_features_to_select: int = None,
        time_budget_seconds: float = 300,
        cv: int = 5
    ):
        self.n_features_to_select = n_features_to_select
        self.time_budget = time_budget_seconds
        self.cv = cv
        self.selector_ = None
        self.selected_features_ = None
    
    def _select_strategy(self, X, y):
        """
        Select search strategy based on dataset characteristics.
        """
        n_samples, n_features = X.shape
        
        # Estimate time per model fit
        est_fit_time = n_samples * n_features * 1e-6  # rough heuristic
        est_cv_time = est_fit_time * self.cv
        
        # RFE iterations
        n_rfe_iters = n_features - (self.n_features_to_select or n_features // 2)
        est_rfe_time = est_cv_time * n_rfe_iters
        
        # SFS iterations (forward)
        n_sfs_iters = self.n_features_to_select or n_features // 2
        est_sfs_time = est_cv_time * n_sfs_iters * n_features / 2  # avg remaining
        
        if est_rfe_time < self.time_budget:
            return 'rfe_cv'  # RFECV if time allows
        elif est_rfe_time * 0.3 < self.time_budget:
            return 'rfe'  # RFE without CV
        else:
            return 'sfs_forward'  # Forward selection (usually faster)
    
    def fit(self, X, y, estimator=None):
        """
        Fit wrapper selector with automatic strategy selection.
        """
        if estimator is None:
            estimator = RandomForestClassifier(n_estimators=50, random_state=42)
        
        strategy = self._select_strategy(X, y)
        n_to_select = self.n_features_to_select or X.shape[1] // 2
        
        if strategy == 'rfe_cv':
            # RFECV: automatically selects optimal number of features
            self.selector_ = RFECV(
                estimator=estimator,
                step=1,
                cv=self.cv,
                scoring='accuracy',
                min_features_to_select=1
            )
        elif strategy == 'rfe':
            # RFE: fixed number of features
            self.selector_ = RFE(
                estimator=estimator,
                n_features_to_select=n_to_select,
                step=1
            )
        else:
            # Sequential forward selection
            self.selector_ = SequentialFeatureSelector(
                estimator=estimator,
                n_features_to_select=n_to_select,
                direction='forward',
                cv=self.cv,
                scoring='accuracy'
            )
        
        self.selector_.fit(X, y)
        self.selected_features_ = self.selector_.get_support()
        
        return self
    
    def transform(self, X):
        return self.selector_.transform(X)
    
    def get_feature_ranking(self, feature_names=None):
        """Return feature ranking from RFE."""
        if hasattr(self.selector_, 'ranking_'):
            if feature_names is None:
                feature_names = [f"feature_{i}" for i in range(len(self.selector_.ranking_))]
            
            import pandas as pd
            return pd.DataFrame({
                'feature': feature_names,
                'ranking': self.selector_.ranking_,
                'selected': self.selected_features_
            }).sort_values('ranking')
        return None

Computational Cost of Wrappers

Embedded Methods

L1 (Lasso) Regularization

L1 penalty encourages sparsity—many coefficients become exactly zero. Features with zero coefficients are effectively selected out.

Why L1 produces zeros:

Selection via regularization strength:

Weak regularization: many features retained
Strong regularization: few features (more zeros)
Cross-validation finds optimal balance

l1_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
from sklearn.linear_model import LassoCV, LogisticRegressionCV
from sklearn.feature_selection import SelectFromModel
import numpy as np
 
class L1FeatureSelector:
    """
    Feature selection using L1 regularization.
    
    Trains L1-penalized model and selects features with nonzero
    coefficients.
    """
    
    def __init__(
        self,
        task: str = 'classification',
        cv: int = 5,
        threshold: str = 'mean'  # 'mean', 'median', or numeric
    ):
        self.task = task
        self.cv = cv
        self.threshold = threshold
        self.model_ = None
        self.selector_ = None
    
    def fit(self, X, y):
        """Fit L1 model and identify selected features."""
        if self.task == 'classification':
            self.model_ = LogisticRegressionCV(
                penalty='l1',
                solver='saga',
                cv=self.cv,
                max_iter=1000
            )
        else:
            self.model_ = LassoCV(cv=self.cv)
        
        self.model_.fit(X, y)
        
        # Get feature importances (absolute coefficients)
        if hasattr(self.model_, 'coef_'):
            importance = np.abs(self.model_.coef_).ravel()
        else:
            importance = np.abs(self.model_.coef_)
        
        # Select features above threshold
        self.selector_ = SelectFromModel(
            self.model_,
            threshold=self.threshold,
            prefit=True
        )
        
        return self
    
    def transform(self, X):
        return self.selector_.transform(X)
    
    def get_nonzero_features(self, feature_names=None):
        """Return features with nonzero coefficients."""
        coef = self.model_.coef_.ravel()
        nonzero_idx = np.where(coef != 0)[0]
        
        if feature_names is None:
            feature_names = [f"feature_{i}" for i in range(len(coef))]
        
        import pandas as pd
        return pd.DataFrame({
            'feature': [feature_names[i] for i in nonzero_idx],
            'coefficient': coef[nonzero_idx],
            'abs_coefficient': np.abs(coef[nonzero_idx])
        }).sort_values('abs_coefficient', ascending=False)

Elastic Net: Best of Both Worlds

AutoML Feature Selection Strategies

automl_feature_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
import numpy as np
import pandas as pd
from sklearn.feature_selection import (
    VarianceThreshold, SelectKBest, f_classif, mutual_info_classif,
    RFE, SelectFromModel
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from typing import List, Tuple
 
class AutoMLFeatureSelector:
    """
    Multi-stage feature selection pipeline for AutoML.
    
    Combines filter, embedded, and optionally wrapper methods
    in a computationally efficient pipeline.
    """
    
    def __init__(
        self,
        target_n_features: int = None,
        target_fraction: float = 0.5,
        time_budget_seconds: float = 60.0,
        use_wrapper: bool = True
    ):
        self.target_n_features = target_n_features
        self.target_fraction = target_fraction
        self.time_budget = time_budget_seconds
        self.use_wrapper = use_wrapper
        
        # Pipeline stages
        self.variance_selector_ = None
        self.correlation_dropped_ = None
        self.filter_selector_ = None
        self.embedded_selector_ = None
        self.final_features_ = None
    
    def fit(self, X: pd.DataFrame, y: pd.Series) -> 'AutoMLFeatureSelector':
        """
        Multi-stage feature selection:
        1. Remove constant/near-constant features
        2. Remove highly correlated redundant features
        3. Univariate filter (fast)
        4. Embedded selection (tree importance or L1)
        5. Optional wrapper refinement if time allows
        """
        current_features = list(X.columns)
        X_current = X.copy()
        
        # Stage 1: Variance Threshold
        print(f"Stage 1: Variance threshold ({len(current_features)} features)")
        X_current, current_features = self._variance_filter(X_current)
        print(f"  → {len(current_features)} features remain")
        
        # Stage 2: Correlation Filter
        print(f"Stage 2: Correlation filter")
        X_current, current_features = self._correlation_filter(X_current, y)
        print(f"  → {len(current_features)} features remain")
        
        # Stage 3: Univariate Filter
        target_after_filter = max(
            self.target_n_features or int(len(current_features) * 0.7),
            10
        )
        if len(current_features) > target_after_filter:
            print(f"Stage 3: Univariate filter (target: {target_after_filter})")
            X_current, current_features = self._univariate_filter(
                X_current, y, target_after_filter
            )
            print(f"  → {len(current_features)} features remain")
        
        # Stage 4: Embedded Selection (tree importance)
        final_target = self.target_n_features or int(
            len(X.columns) * self.target_fraction
        )
        final_target = max(final_target, 5)
        
        if len(current_features) > final_target:
            print(f"Stage 4: Tree importance (target: {final_target})")
            X_current, current_features = self._tree_importance_filter(
                X_current, y, final_target
            )
            print(f"  → {len(current_features)} features remain")
        
        self.final_features_ = current_features
        return self
    
    def _variance_filter(
        self,
        X: pd.DataFrame,
        threshold: float = 0.01
    ) -> Tuple[pd.DataFrame, List[str]]:
        """Remove low-variance features."""
        self.variance_selector_ = VarianceThreshold(threshold=threshold)
        X_filtered = self.variance_selector_.fit_transform(X)
        
        selected_mask = self.variance_selector_.get_support()
        selected_features = [c for c, m in zip(X.columns, selected_mask) if m]
        
        return pd.DataFrame(X_filtered, columns=selected_features), selected_features
    
    def _correlation_filter(
        self,
        X: pd.DataFrame,
        y: pd.Series,
        threshold: float = 0.95
    ) -> Tuple[pd.DataFrame, List[str]]:
        """Remove highly correlated features."""
        corr_matrix = X.corr().abs()
        target_corr = X.corrwith(y).abs()
        
        upper = corr_matrix.where(
            np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
        )
        
        to_drop = set()
        for col in upper.columns:
            for idx in upper.index:
                if upper.loc[idx, col] > threshold:
                    # Drop the one with lower target correlation
                    if target_corr.get(col, 0) > target_corr.get(idx, 0):
                        to_drop.add(idx)
                    else:
                        to_drop.add(col)
        
        self.correlation_dropped_ = list(to_drop)
        selected = [c for c in X.columns if c not in to_drop]
        
        return X[selected], selected
    
    def _univariate_filter(
        self,
        X: pd.DataFrame,
        y: pd.Series,
        k: int
    ) -> Tuple[pd.DataFrame, List[str]]:
        """Apply univariate statistical filter."""
        self.filter_selector_ = SelectKBest(f_classif, k=k)
        X_filtered = self.filter_selector_.fit_transform(X, y)
        
        selected_mask = self.filter_selector_.get_support()
        selected_features = [c for c, m in zip(X.columns, selected_mask) if m]
        
        return pd.DataFrame(X_filtered, columns=selected_features), selected_features
    
    def _tree_importance_filter(
        self,
        X: pd.DataFrame,
        y: pd.Series,
        k: int
    ) -> Tuple[pd.DataFrame, List[str]]:
        """Select top features by tree importance."""
        rf = RandomForestClassifier(n_estimators=50, random_state=42, n_jobs=-1)
        rf.fit(X, y)
        
        importances = rf.feature_importances_
        top_k_idx = np.argsort(importances)[::-1][:k]
        selected_features = [X.columns[i] for i in top_k_idx]
        
        return X[selected_features], selected_features
    
    def transform(self, X: pd.DataFrame) -> pd.DataFrame:
        """Select the final features."""
        return X[self.final_features_]
    
    def get_selection_summary(self) -> dict:
        """Return summary of feature selection process."""
        return {
            'final_n_features': len(self.final_features_),
            'selected_features': self.final_features_,
            'correlation_dropped': self.correlation_dropped_
        }

Staged Selection for Efficiency

Summary: Automated Feature Selection

Key Takeaways

•Three families — Filter (fast, univariate), wrapper (accurate, slow), embedded (integrated with training).
•Filter methods are preprocessing — Use variance threshold and correlation filters to reduce dimensionality before expensive methods.
•L1 regularization produces sparsity — Lasso and Elastic Net automatically select features through coefficient shrinkage.
•Tree importance is model-aware — Random forests and gradient boosting provide built-in feature ranking.
•Staged pipelines balance cost and quality — Fast filters first, expensive refinement if time allows.
•Validate on held-out data — Feature selection can overfit; always evaluate on unseen data.

Page Complete

4 / 5