Automated Feature Engineering - Learning Module

Loading content...

0/245

Feature Evaluation: Assessing and Selecting Quality Features

The Feature Selection Challenge

Automated feature engineering is a double-edged sword. DFS can generate thousands of candidate features in minutes—far more than any human could manually create. But more features isn't always better.

Too many features leads to:

Overfitting: Models memorize noise in irrelevant features
Computational burden: Training time scales with feature count
Interpretability loss: Important signals buried in noise
Maintenance overhead: More features to compute, store, and monitor

The fundamental challenge:

From thousands of auto-generated features, how do we identify the subset that maximizes predictive power while minimizing complexity?

This page equips you with systematic methods to evaluate feature quality, detect redundancy, and select optimal feature subsets for your models.

What You Will Learn

By the end of this page, you will understand: univariate feature importance metrics, model-based feature selection methods, redundancy detection through correlation analysis, the feature selection taxonomy (filter, wrapper, embedded), and practical workflows for reducing feature dimensionality.

Univariate Feature Importance

Univariate methods evaluate each feature independently, measuring its individual relationship with the target variable. These are the fastest evaluation methods—O(n) complexity for n features.

Statistical Tests by Task Type

Task Type	Feature Type	Target Type	Test
Classification	Numeric	Categorical	ANOVA F-test
Classification	Categorical	Categorical	Chi-squared test
Regression	Numeric	Numeric	Pearson correlation
Regression	Categorical	Numeric	Mutual information

Scikit-learn Integration

univariate_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import numpy as np
import pandas as pd
from sklearn.feature_selection import (
    SelectKBest, SelectPercentile,
    f_classif, chi2, mutual_info_classif,
    f_regression, mutual_info_regression
)
from sklearn.preprocessing import MinMaxScaler
 
# Assume feature_matrix is from DFS and y is target
X = feature_matrix.fillna(0)  # Handle nulls
y = labels  # Binary classification target
 
# ANOVA F-test for classification (numeric features)
selector_f = SelectKBest(score_func=f_classif, k=50)
selector_f.fit(X, y)
 
# Get feature scores
f_scores = pd.DataFrame({
    'feature': X.columns,
    'f_score': selector_f.scores_,
    'p_value': selector_f.pvalues_
}).sort_values('f_score', ascending=False)
 
print("Top 10 Features by F-score:")
print(f_scores.head(10))
 
# Mutual Information (captures nonlinear relationships)
mi_scores = mutual_info_classif(X, y, random_state=42)
mi_df = pd.DataFrame({
    'feature': X.columns,
    'mi_score': mi_scores
}).sort_values('mi_score', ascending=False)
 
print("\nTop 10 Features by Mutual Information:")
print(mi_df.head(10))
 
# Compare rankings
top_f = set(f_scores.head(50)['feature'])
top_mi = set(mi_df.head(50)['feature'])
print(f"\nOverlap in top 50: {len(top_f & top_mi)} features")

Interpretation of Univariate Metrics

F-statistic (ANOVA)

Measures variance between groups vs. within groups
Higher is better
Assumes linear relationship
Sensitive to outliers

Mutual Information

Information-theoretic measure
Captures nonlinear dependencies
Always ≥ 0 (0 means independence)
Requires discretization for continuous variables

Chi-squared

For categorical features
Measures observed vs. expected frequencies
Requires non-negative values

Pearson Correlation (regression)

Linear relationship strength
Range [-1, 1]
Fast but misses nonlinear patterns

Univariate Limitation

Univariate methods ignore feature interactions. A feature with low individual importance might be highly predictive when combined with others (XOR-like patterns). Always complement univariate with multivariate evaluation for robust selection.

Model-Based Feature Importance

Model-based methods use trained models to assess feature importance, capturing both individual effects and interactions.

Tree-Based Importance

Decision trees and ensembles (Random Forest, XGBoost, LightGBM) provide built-in feature importance:

Metric	Description	Best For
Gini Importance	Mean decrease in impurity	Fast, built-in
Split Count	Number of times feature is used	Understanding coverage
Gain	Total gain across all splits	Gradient boosting models
Permutation	Drop in score when shuffled	Model-agnostic, reliable

model_importance.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split
import lightgbm as lgb
 
# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
 
# Train Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
 
# Method 1: Gini Importance (built-in)
gini_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)
 
print("Top 10 Features by Gini Importance:")
print(gini_importance.head(10))
 
# Method 2: Permutation Importance (more reliable)
perm_importance = permutation_importance(
    rf, X_test, y_test, 
    n_repeats=10, random_state=42, n_jobs=-1
)
 
perm_df = pd.DataFrame({
    'feature': X.columns,
    'importance_mean': perm_importance.importances_mean,
    'importance_std': perm_importance.importances_std
}).sort_values('importance_mean', ascending=False)
 
print("\nTop 10 Features by Permutation Importance:")
print(perm_df.head(10))
 
# Method 3: LightGBM Gain-based importance
lgb_model = lgb.LGBMClassifier(n_estimators=100, random_state=42)
lgb_model.fit(X_train, y_train)
 
lgb_importance = pd.DataFrame({
    'feature': X.columns,
    'gain': lgb_model.booster_.feature_importance(importance_type='gain'),
    'split': lgb_model.booster_.feature_importance(importance_type='split')
}).sort_values('gain', ascending=False)
 
print("\nTop 10 Features by LightGBM Gain:")
print(lgb_importance.head(10))

SHAP Values: The Gold Standard

SHAP (SHapley Additive exPlanations) values provide theoretically grounded, consistent feature importance with several advantages:

Additive: Contributions sum to prediction
Local + Global: Explain individual predictions AND overall importance
Consistent: More important features always have higher absolute SHAP
Accounts for interactions: Via interaction values

shap_importance.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import shap
import matplotlib.pyplot as plt
 
# Create SHAP explainer (use TreeExplainer for tree models)
explainer = shap.TreeExplainer(lgb_model)
shap_values = explainer.shap_values(X_test)
 
# For binary classification, shap_values is list of 2 arrays
# Use class 1 (positive class) for importance
if isinstance(shap_values, list):
    shap_vals = shap_values[1]
else:
    shap_vals = shap_values
 
# Global feature importance (mean absolute SHAP)
shap_importance = pd.DataFrame({
    'feature': X.columns,
    'mean_abs_shap': np.abs(shap_vals).mean(axis=0)
}).sort_values('mean_abs_shap', ascending=False)
 
print("Top 10 Features by SHAP Importance:")
print(shap_importance.head(10))
 
# Summary plot (beeswarm)
plt.figure(figsize=(10, 8))
shap.summary_plot(shap_vals, X_test, plot_type="bar", max_display=20)
plt.title("SHAP Feature Importance")
plt.tight_layout()
plt.savefig("shap_importance.png")
 
# Compare all methods
comparison = gini_importance.head(20).merge(
    perm_df.head(20), on='feature', how='outer'
).merge(
    shap_importance.head(20), on='feature', how='outer'
)
print("\nImportance Comparison (top 20 from each method):")
print(comparison)

Importance Method Disagreement

Different importance methods often disagree on feature rankings. This isn't necessarily a problem—each method measures a different aspect of importance. When methods agree, you have high confidence. When they disagree, investigate why and consider keeping features that rank high in ANY method.

Redundancy Detection

Automated feature engineering often produces highly correlated features—different paths to the same information. Redundancy detection identifies and removes these near-duplicates.

Why Redundancy Matters

Inflated importance: Correlated features split importance, making each look weaker
Model instability: Small data changes cause importance to shift between correlated features
Increased complexity: More features to maintain with no predictive benefit
Regularization interference: L1/L2 regularization distributes weight across correlated features

Correlation-Based Filtering

redundancy_detection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import numpy as np
import pandas as pd
from scipy import stats
from collections import defaultdict
 
def detect_redundant_features(X, threshold=0.95, method='pearson'):
    """
    Identify highly correlated feature pairs.
    
    Args:
        X: Feature DataFrame
        threshold: Correlation threshold (default 0.95)
        method: 'pearson', 'spearman', or 'kendall'
    
    Returns:
        List of (feature1, feature2, correlation) tuples
    """
    # Compute correlation matrix
    corr_matrix = X.corr(method=method)
    
    # Extract upper triangle (avoid duplicates)
    upper_tri = np.triu(np.ones_like(corr_matrix, dtype=bool), k=1)
    
    # Find pairs above threshold
    redundant_pairs = []
    for i, j in zip(*np.where(upper_tri)):
        if abs(corr_matrix.iloc[i, j]) >= threshold:
            redundant_pairs.append((
                corr_matrix.columns[i],
                corr_matrix.columns[j],
                corr_matrix.iloc[i, j]
            ))
    
    return sorted(redundant_pairs, key=lambda x: abs(x[2]), reverse=True)
 
# Find redundant features
redundant = detect_redundant_features(X, threshold=0.95)
print(f"Found {len(redundant)} highly correlated pairs (r >= 0.95):")
for f1, f2, corr in redundant[:10]:
    print(f"  {corr:.3f}: {f1} ↔ {f2}")
 
def remove_redundant_features(X, threshold=0.95, importance_scores=None):
    """
    Remove redundant features, keeping the more important one.
    
    Args:
        X: Feature DataFrame
        threshold: Correlation threshold
        importance_scores: Dict of feature -> importance (optional)
    
    Returns:
        List of features to keep
    """
    corr_matrix = X.corr().abs()
    features_to_drop = set()
    
    for i in range(len(corr_matrix.columns)):
        for j in range(i + 1, len(corr_matrix.columns)):
            if corr_matrix.iloc[i, j] >= threshold:
                col_i = corr_matrix.columns[i]
                col_j = corr_matrix.columns[j]
                
                # Drop the less important feature
                if importance_scores:
                    drop = col_i if importance_scores.get(col_i, 0) < \
                                    importance_scores.get(col_j, 0) else col_j
                else:
                    # Without importance, drop the second one
                    drop = col_j
                
                features_to_drop.add(drop)
    
    features_to_keep = [c for c in X.columns if c not in features_to_drop]
    return features_to_keep
 
# Create importance dict from earlier analysis
importance_dict = dict(zip(shap_importance['feature'], 
                           shap_importance['mean_abs_shap']))
 
features_to_keep = remove_redundant_features(
    X, threshold=0.95, importance_scores=importance_dict
)
print(f"\nOriginal features: {len(X.columns)}")
print(f"After redundancy removal: {len(features_to_keep)}")
print(f"Features removed: {len(X.columns) - len(features_to_keep)}")

Hierarchical Clustering for Feature Groups

When many features are inter-correlated, it's useful to identify feature clusters—groups of features that essentially measure the same underlying concept:

from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial.distance import squareform

# Create distance matrix from correlation
corr = X.corr().abs()
distance_matrix = 1 - corr

# Hierarchical clustering
linkage_matrix = linkage(
    squareform(distance_matrix), method='average'
)

# Cut tree at threshold to get clusters
clusters = fcluster(linkage_matrix, t=0.3, criterion='distance')

# Group features by cluster
feature_clusters = defaultdict(list)
for feat, cluster_id in zip(X.columns, clusters):
    feature_clusters[cluster_id].append(feat)

# From each cluster, keep only the most important feature
representative_features = []
for cluster_id, features in feature_clusters.items():
    best_feat = max(features, key=lambda f: importance_dict.get(f, 0))
    representative_features.append(best_feat)

Automatic Redundancy in DFS Features

DFS often creates naturally redundant features. For example, SUM(orders.total_amount) and MEAN(orders.total_amount) × COUNT(orders) are perfectly correlated. Similarly, features at different depths may capture the same information. Redundancy removal is almost always necessary after DFS.

Feature Selection Taxonomy

Feature selection methods fall into three categories based on how they interact with the learning algorithm:

Filter Methods

Definition: Evaluate features independently of any machine learning model.

Method	How it Works	Speed	Captures Interactions
Variance threshold	Remove near-constant features	Very fast	No
Univariate tests	Statistical tests per feature	Fast	No
Correlation filter	Remove highly correlated pairs	Fast	No
Mutual information	Information-theoretic measure	Moderate	Partially

filter_methods.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from sklearn.feature_selection import (
    VarianceThreshold, SelectKBest, f_classif,
    mutual_info_classif
)
 
# Step 1: Remove constant/near-constant features
variance_selector = VarianceThreshold(threshold=0.01)
X_var = variance_selector.fit_transform(X)
kept_cols = X.columns[variance_selector.get_support()]
print(f"After variance filter: {len(kept_cols)} features")
 
# Step 2: Remove highly correlated features
X_filtered = X[kept_cols]
corr_matrix = X_filtered.corr().abs()
upper_tri = np.triu(np.ones_like(corr_matrix, dtype=bool), k=1)
to_drop = [col for col in X_filtered.columns 
           if any(corr_matrix.loc[col, upper_tri[X_filtered.columns.get_loc(col)]] > 0.95)]
X_decorr = X_filtered.drop(columns=to_drop)
print(f"After correlation filter: {len(X_decorr.columns)} features")
 
# Step 3: Select top k by univariate test
k = min(100, len(X_decorr.columns))
univariate_selector = SelectKBest(score_func=f_classif, k=k)
X_univariate = univariate_selector.fit_transform(X_decorr, y)
final_cols = X_decorr.columns[univariate_selector.get_support()]
print(f"After univariate filter: {len(final_cols)} features")

Wrapper Methods

Definition: Use model performance to evaluate feature subsets.

Method	How it Works	Speed	Optimal Guarantee
Forward selection	Add features greedily	Slow	No
Backward elimination	Remove features greedily	Slow	No
Recursive Feature Elimination	Remove least important iteratively	Moderate	No
Exhaustive search	Try all combinations	Very slow	Yes

wrapper_methods.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
from sklearn.feature_selection import RFE, RFECV
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
 
# Recursive Feature Elimination with Cross-Validation
estimator = LogisticRegression(max_iter=1000, random_state=42)
 
# RFECV automatically finds optimal number of features
rfecv = RFECV(
    estimator=estimator,
    step=0.1,        # Remove 10% of features each step
    cv=5,            # 5-fold cross-validation
    scoring='roc_auc',
    min_features_to_select=10,
    n_jobs=-1
)
 
# Note: Wrapper methods are slow with many features
# Consider pre-filtering with filter methods first
X_prefiltered = X[final_cols]  # From filter step
rfecv.fit(X_prefiltered, y)
 
print(f"Optimal number of features: {rfecv.n_features_}")
print(f"Best CV score: {rfecv.cv_results_['mean_test_score'].max():.4f}")
 
# Get selected features
rfe_selected = X_prefiltered.columns[rfecv.support_]
print(f"\nSelected Features ({len(rfe_selected)}):")
for feat in rfe_selected[:10]:
    print(f"  {feat}")

Embedded Methods

Definition: Feature selection is built into the model training process.

Method	How it Works	Speed	Notes
L1 Regularization (Lasso)	Drives coefficients to zero	Fast	Linear models only
ElasticNet	Combines L1 and L2	Fast	Handles correlation
Tree importance + threshold	Built-in importance scores	Fast	Tree models only
Feature importance from boosting	Iterative importance	Moderate	Very effective

embedded_methods.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
from sklearn.linear_model import LassoCV, ElasticNetCV
from sklearn.feature_selection import SelectFromModel
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
 
# L1 (Lasso) for embedded feature selection
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('lasso', LassoCV(cv=5, random_state=42, max_iter=2000))
])
 
pipeline.fit(X_prefiltered, y)
lasso_model = pipeline.named_steps['lasso']
 
# Features with non-zero coefficients
lasso_coefs = pd.DataFrame({
    'feature': X_prefiltered.columns,
    'coefficient': lasso_model.coef_
})
lasso_selected = lasso_coefs[lasso_coefs['coefficient'] != 0]
print(f"Lasso selected {len(lasso_selected)} features")
 
# Tree-based embedded selection
from sklearn.ensemble import GradientBoostingClassifier
 
gbc = GradientBoostingClassifier(n_estimators=100, random_state=42)
gbc.fit(X_prefiltered, y)
 
# Select features above median importance
selector = SelectFromModel(gbc, threshold='median', prefit=True)
X_embedded = selector.transform(X_prefiltered)
embedded_selected = X_prefiltered.columns[selector.get_support()]
print(f"\nGBM embedded selection: {len(embedded_selected)} features")

Stability Selection

A critical but often overlooked aspect of feature selection is stability—do the same features get selected when you slightly perturb the data?

The Stability Problem

With correlated features or noisy data, feature selection can be unstable:

Run selection on 80% of data → Feature A selected
Run on different 80% → Feature B selected (correlated with A)

This instability indicates:

The features might not be truly important
Any interpretation based on selected features is unreliable
Production features may differ from development

Stability Selection Algorithm

Repeatedly subsample data, run selection, and count how often each feature is selected:

stability_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
import numpy as np
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler
from collections import Counter
 
def stability_selection(X, y, n_iterations=100, sample_fraction=0.5,
                        threshold=0.5, random_state=42):
    """
    Perform stability selection to identify robust features.
    
    Args:
        X: Feature DataFrame
        y: Target variable
        n_iterations: Number of bootstrap iterations
        sample_fraction: Fraction of data to use per iteration
        threshold: Minimum selection frequency to include feature
        random_state: Random seed
    
    Returns:
        DataFrame with selection frequencies
    """
    np.random.seed(random_state)
    n_samples = len(X)
    sample_size = int(n_samples * sample_fraction)
    
    selection_counts = Counter()
    scaler = StandardScaler()
    
    for i in range(n_iterations):
        # Bootstrap sample
        indices = np.random.choice(n_samples, sample_size, replace=False)
        X_sample = X.iloc[indices]
        y_sample = y.iloc[indices] if hasattr(y, 'iloc') else y[indices]
        
        # Scale features
        X_scaled = scaler.fit_transform(X_sample)
        
        # Fit Lasso with cross-validation
        lasso = LassoCV(cv=3, random_state=i, max_iter=2000)
        lasso.fit(X_scaled, y_sample)
        
        # Record selected features (non-zero coefficients)
        selected = X.columns[lasso.coef_ != 0]
        selection_counts.update(selected)
        
        if (i + 1) % 20 == 0:
            print(f"Completed {i + 1}/{n_iterations} iterations")
    
    # Calculate selection frequencies
    frequencies = pd.DataFrame({
        'feature': list(selection_counts.keys()),
        'count': list(selection_counts.values())
    })
    frequencies['frequency'] = frequencies['count'] / n_iterations
    frequencies = frequencies.sort_values('frequency', ascending=False)
    
    # Apply threshold
    stable_features = frequencies[frequencies['frequency'] >= threshold]
    
    return frequencies, stable_features['feature'].tolist()
 
# Run stability selection
frequencies, stable_features = stability_selection(
    X_prefiltered, y, 
    n_iterations=100, 
    sample_fraction=0.5, 
    threshold=0.5
)
 
print(f"Stable features (selected >50% of time): {len(stable_features)}")
print("\nTop 15 most stable features:")
print(frequencies.head(15))

Stability = Reliability

Features with high stability scores (>70%) are reliably important across data variations. These should form the core of your feature set. Features with low stability (<30%) may be artifacts of specific samples and should be treated skeptically.

The Complete Evaluation Pipeline

Now let's put it all together in a systematic pipeline for evaluating and selecting features from DFS output:

Recommended Pipeline

┌─────────────────────────┐
│  DFS Output (1000s)     │
└───────────┬─────────────┘
            ▼
┌─────────────────────────┐
│ 1. Remove Constants     │  Fast filter
│    (variance < 0.01)    │  (~30% reduction)
└───────────┬─────────────┘
            ▼
┌─────────────────────────┐
│ 2. Remove Redundant     │  Correlation filter
│    (correlation > 0.95) │  (~40% reduction)
└───────────┬─────────────┘
            ▼
┌─────────────────────────┐
│ 3. Univariate Filter    │  Top N by MI or F-test
│    (keep top 500)       │  (Controlled reduction)
└───────────┬─────────────┘
            ▼
┌─────────────────────────┐
│ 4. Model Importance     │  SHAP or permutation
│    (rank features)      │  (Quality ranking)
└───────────┬─────────────┘
            ▼
┌─────────────────────────┐
│ 5. Stability Selection  │  Multiple subsamples
│    (keep stable >50%)   │  (Robustness filter)
└───────────┬─────────────┘
            ▼
┌─────────────────────────┐
│  Final Feature Set      │
│  (50-100 features)      │
└─────────────────────────┘

complete_pipeline.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
import pandas as pd
import numpy as np
from sklearn.feature_selection import VarianceThreshold, SelectKBest, mutual_info_classif
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split
 
class FeatureEvaluationPipeline:
    """
    Systematic pipeline for evaluating and selecting DFS features.
    """
    
    def __init__(self, variance_threshold=0.01, correlation_threshold=0.95,
                 univariate_k=500, stability_iterations=50, 
                 stability_threshold=0.5):
        self.variance_threshold = variance_threshold
        self.correlation_threshold = correlation_threshold
        self.univariate_k = univariate_k
        self.stability_iterations = stability_iterations
        self.stability_threshold = stability_threshold
        self.selected_features_ = None
        self.evaluation_report_ = {}
    
    def fit(self, X, y, verbose=True):
        """
        Run the complete evaluation pipeline.
        """
        original_count = len(X.columns)
        
        # Step 1: Variance filter
        X_current = self._variance_filter(X)
        self.evaluation_report_['after_variance'] = len(X_current.columns)
        if verbose:
            print(f"[1/5] Variance filter: {original_count} → {len(X_current.columns)}")
        
        # Step 2: Correlation filter
        X_current = self._correlation_filter(X_current)
        self.evaluation_report_['after_correlation'] = len(X_current.columns)
        if verbose:
            print(f"[2/5] Correlation filter: → {len(X_current.columns)}")
        
        # Step 3: Univariate filter
        X_current, univariate_scores = self._univariate_filter(X_current, y)
        self.evaluation_report_['after_univariate'] = len(X_current.columns)
        self.evaluation_report_['univariate_scores'] = univariate_scores
        if verbose:
            print(f"[3/5] Univariate filter: → {len(X_current.columns)}")
        
        # Step 4: Model importance
        importance_df = self._model_importance(X_current, y)
        self.evaluation_report_['importance'] = importance_df
        if verbose:
            print(f"[4/5] Model importance calculated")
        
        # Step 5: Stability selection
        stable_features = self._stability_selection(X_current, y)
        self.evaluation_report_['stable_features'] = stable_features
        self.selected_features_ = stable_features
        if verbose:
            print(f"[5/5] Stability selection: → {len(stable_features)} final features")
        
        return self
    
    def _variance_filter(self, X):
        X_filled = X.fillna(X.median())
        selector = VarianceThreshold(threshold=self.variance_threshold)
        selector.fit(X_filled)
        return X[X.columns[selector.get_support()]]
    
    def _correlation_filter(self, X):
        X_filled = X.fillna(X.median())
        corr = X_filled.corr().abs()
        upper = np.triu(np.ones_like(corr, dtype=bool), k=1)
        to_drop = set()
        for i in range(len(corr.columns)):
            for j in range(i + 1, len(corr.columns)):
                if corr.iloc[i, j] > self.correlation_threshold:
                    # Drop the one with higher mean correlation
                    mean_i = corr.iloc[i].mean()
                    mean_j = corr.iloc[j].mean()
                    to_drop.add(corr.columns[j if mean_j > mean_i else i])
        return X.drop(columns=list(to_drop))
    
    def _univariate_filter(self, X, y):
        X_filled = X.fillna(X.median())
        k = min(self.univariate_k, len(X.columns))
        selector = SelectKBest(score_func=mutual_info_classif, k=k)
        selector.fit(X_filled, y)
        scores = pd.DataFrame({
            'feature': X.columns,
            'mi_score': selector.scores_
        }).sort_values('mi_score', ascending=False)
        return X[X.columns[selector.get_support()]], scores
    
    def _model_importance(self, X, y):
        X_filled = X.fillna(X.median())
        X_train, X_test, y_train, y_test = train_test_split(
            X_filled, y, test_size=0.2, random_state=42
        )
        rf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
        rf.fit(X_train, y_train)
        perm = permutation_importance(rf, X_test, y_test, n_repeats=10, 
                                      random_state=42, n_jobs=-1)
        return pd.DataFrame({
            'feature': X.columns,
            'importance': perm.importances_mean
        }).sort_values('importance', ascending=False)
    
    def _stability_selection(self, X, y):
        # Simplified stability selection
        from collections import Counter
        from sklearn.linear_model import LassoCV
        from sklearn.preprocessing import StandardScaler
        
        X_filled = X.fillna(X.median())
        counts = Counter()
        
        for i in range(self.stability_iterations):
            idx = np.random.choice(len(X), int(0.5 * len(X)), replace=False)
            X_sub = X_filled.iloc[idx]
            y_sub = y.iloc[idx] if hasattr(y, 'iloc') else y[idx]
            
            X_scaled = StandardScaler().fit_transform(X_sub)
            lasso = LassoCV(cv=3, random_state=i, max_iter=2000)
            lasso.fit(X_scaled, y_sub)
            
            selected = X.columns[lasso.coef_ != 0]
            counts.update(selected)
        
        freq = {f: c / self.stability_iterations for f, c in counts.items()}
        stable = [f for f, p in freq.items() if p >= self.stability_threshold]
        return stable
    
    def transform(self, X):
        return X[self.selected_features_]
    
    def get_report(self):
        return self.evaluation_report_
 
# Usage
pipeline = FeatureEvaluationPipeline()
pipeline.fit(feature_matrix.fillna(0), y)
X_selected = pipeline.transform(feature_matrix)
print(f"\nFinal feature set: {X_selected.shape[1]} features")

Evaluation Metrics for Feature Sets

Beyond individual feature quality, we need to evaluate feature sets as a whole:

Predictive Performance

The ultimate test—does the feature set improve model accuracy?

Metric	Task	Interpretation
ROC-AUC	Binary classification	Ranking quality
Log Loss	Classification	Probability calibration
RMSE	Regression	Error magnitude
Lift @ K	Ranking	Top-K performance

feature_set_evaluation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score, log_loss
import time
 
def evaluate_feature_set(X, y, feature_subset, cv=5):
    """
    Evaluate a feature set on multiple criteria.
    """
    X_subset = X[feature_subset].fillna(0)
    
    # Model performance
    model = GradientBoostingClassifier(n_estimators=100, random_state=42)
    
    auc_scores = cross_val_score(
        model, X_subset, y, cv=cv, scoring='roc_auc', n_jobs=-1
    )
    
    # Training time
    start = time.time()
    model.fit(X_subset, y)
    train_time = time.time() - start
    
    # Feature efficiency
    features_per_auc_point = len(feature_subset) / auc_scores.mean()
    
    return {
        'n_features': len(feature_subset),
        'auc_mean': auc_scores.mean(),
        'auc_std': auc_scores.std(),
        'train_time_seconds': train_time,
        'features_per_auc_point': features_per_auc_point
    }
 
# Compare feature sets
feature_sets = {
    'all_features': list(X.columns),
    'top_100_importance': importance_df.head(100)['feature'].tolist(),
    'stable_features': stable_features,
    'pipeline_output': pipeline.selected_features_
}
 
print("Feature Set Comparison:")
print("-" * 70)
for name, features in feature_sets.items():
    if len(features) > 0:
        metrics = evaluate_feature_set(X, y, features)
        print(f"\n{name}:")
        print(f"  Features: {metrics['n_features']}")
        print(f"  AUC: {metrics['auc_mean']:.4f} ± {metrics['auc_std']:.4f}")
        print(f"  Train time: {metrics['train_time_seconds']:.2f}s")
        print(f"  Efficiency: {metrics['features_per_auc_point']:.1f} features/AUC point")

Feature Set Quality Metrics
Metric	Formula	Interpretation
Feature Efficiency	AUC / log(n_features)	Performance per complexity
Stability Score	Mean selection frequency	Robustness across samples
Redundancy Score	Mean pairwise correlation	Information overlap
Coverage Score	Unique entities represented	Schema coverage

Summary: From Thousands to the Essential Few

Feature evaluation transforms the raw output of automated feature engineering into a curated, high-quality feature set. Let's consolidate the key methods:

Key Takeaways

•Univariate methods (F-test, MI) provide fast initial ranking but miss feature interactions
•Model-based methods (permutation importance, SHAP) capture true predictive value including interactions
•Redundancy detection removes correlated features that split importance and add no information
•Filter-wrapper-embedded taxonomy offers different trade-offs between speed and optimality
•Stability selection identifies features that are robustly selected across data variations
•Pipeline approach systematically reduces thousands of features to a manageable, high-quality set
•Feature set metrics evaluate the final set holistically on performance, efficiency, and stability

What's Next:

With features generated and evaluated, the final challenge is computational efficiency. The next page covers computational considerations—strategies for scaling automated feature engineering to large datasets while managing memory and processing time constraints.

Page Complete

You now have a complete toolkit for feature evaluation and selection. From univariate statistics to stability selection, you can systematically reduce DFS output to a curated set that maximizes predictive power while minimizing complexity.