Machine LearningFeature Engineering for Boosting

Feature Engineering for Boosting

LevelAdvanced

Duration90 mins

TopicFeature Engineering for Boosting

5 / 5

Feature Selection

The Art of Choosing What Matters

More features don't always mean better models. In fact, including irrelevant or redundant features can hurt gradient boosting performance through increased overfitting, longer training times, and decreased interpretability. Feature selection—the process of identifying and retaining only the most valuable features—is a critical skill for building production-quality boosting models.

Gradient boosting provides unique advantages for feature selection: built-in importance measures, iterative refinement through residual fitting, and natural handling of feature interactions. Understanding how to leverage these properties separates practitioners who achieve good results from those who achieve exceptional ones.

What You Will Learn

By the end of this page, you will understand feature importance metrics in boosting (split-based, gain-based, permutation), implement recursive feature elimination for GBDT, handle correlated features effectively, and build a systematic feature selection pipeline for production systems.

Why Feature Selection Matters for Boosting

The Paradox of Boosting's Feature Handling:

Tree-based boosting is often described as "immune" to irrelevant features—the algorithm simply won't split on features that don't improve predictions. While partially true, this perspective misses important nuances:

Training efficiency: Even if unused, features consume memory and slow split evaluation
Overfitting on noise: Random features can occasionally provide spurious splits
Feature masking: Correlated features compete for splits, potentially hiding true relationships
Importance instability: With many features, importance estimates have high variance
Production burden: More features mean larger models and more complex pipelines

Impact of Feature Count on Gradient Boosting
Metric	Effect of More Features	Mitigation
Training time	Linear increase in split evaluation	Feature selection, subsampling
Memory usage	Linear increase	Feature selection
Overfitting risk	Moderate increase (noise features used occasionally)	Regularization + selection
Interpretability	Decreases (importance spread thin)	Select top features
Model size	Increases (more unique split values)	Feature selection
Inference time	Slight increase	Usually negligible for trees

When Feature Selection Matters Most

Feature selection provides biggest benefits when: (1) you have many candidate features (>100), (2) training data is limited, (3) model interpretability is required, (4) production latency/memory is constrained, or (5) features have significant redundancy.

Feature Importance Metrics

Gradient boosting provides multiple measures of feature importance, each capturing different aspects.

1. Split Count (Frequency) Importance:

Counts how many times each feature is used for splitting across all trees:

$$\text{Importance}{split}(f) = \sum{t \in \text{trees}} \sum_{n \in \text{nodes}(t)} \mathbb{1}[\text{feature}(n) = f]$$

Pros: Simple, fast to compute Cons: Biased toward high-cardinality features (more split points available)

2. Gain Importance:

Sums the improvement in the loss function contributed by splits on each feature:

$$\text{Importance}{gain}(f) = \sum{t \in \text{trees}} \sum_{n \in \text{nodes}(t)} \mathbb{1}[\text{feature}(n) = f] \cdot \text{Gain}(n)$$

Pros: Measures actual predictive contribution Cons: Still biased toward high-cardinality and continuous features

feature_importance.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import numpy as np
import pandas as pd
from sklearn.inspection import permutation_importance
import lightgbm as lgb
 
def get_all_importance_types(model, X, y, feature_names):
    """
    Compute multiple importance metrics for comprehensive analysis.
    """
    importance_df = pd.DataFrame({'feature': feature_names})
    
    # 1. Split-based importance
    if hasattr(model, 'feature_importances_'):
        importance_df['split_importance'] = model.feature_importances_
    
    # 2. Gain-based importance (LightGBM specific)
    if hasattr(model, 'booster_'):
        gain = model.booster_.feature_importance(importance_type='gain')
        importance_df['gain_importance'] = gain
    
    # 3. Permutation importance (model-agnostic, unbiased)
    perm_imp = permutation_importance(
        model, X, y, n_repeats=10, random_state=42, n_jobs=-1
    )
    importance_df['permutation_importance'] = perm_imp.importances_mean
    importance_df['permutation_std'] = perm_imp.importances_std
    
    # 4. Compute rankings for each metric
    for col in ['split_importance', 'gain_importance', 'permutation_importance']:
        if col in importance_df.columns:
            importance_df[f'{col}_rank'] = importance_df[col].rank(ascending=False)
    
    return importance_df.sort_values('permutation_importance', ascending=False)
 
 
def shap_importance(model, X, feature_names):
    """
    SHAP-based feature importance - gold standard for interpretability.
    """
    import shap
    
    explainer = shap.TreeExplainer(model)
    shap_values = explainer.shap_values(X)
    
    # Mean absolute SHAP value per feature
    if isinstance(shap_values, list):
        # Multi-class: average across classes
        mean_abs_shap = np.mean([np.abs(sv).mean(axis=0) for sv in shap_values], axis=0)
    else:
        mean_abs_shap = np.abs(shap_values).mean(axis=0)
    
    return pd.DataFrame({
        'feature': feature_names,
        'shap_importance': mean_abs_shap
    }).sort_values('shap_importance', ascending=False)

3. Permutation Importance:

Measures performance drop when a feature's values are randomly shuffled:

$$\text{Importance}{perm}(f) = \text{Score}{original} - \mathbb{E}[\text{Score}_{f \text{ shuffled}}]$$

Pros: Unbiased, captures feature's predictive value Cons: Computationally expensive, affected by correlated features

4. SHAP Importance:

Based on Shapley values from game theory—measures average marginal contribution:

Pros: Theoretically grounded, handles interactions, local + global Cons: Computationally expensive for large datasets

Feature Importance Metrics Comparison
Metric	Speed	Bias	Best For
Split count	Fast	High-cardinality bias	Quick overview
Gain	Fast	Moderate bias	Understanding split value
Permutation	Slow	Low	Reliable feature ranking
SHAP	Slowest	Very low	Interpretability, interactions

Feature Selection Methods

Method 1: Importance Threshold

Simplest approach—keep features above an importance threshold:

importance = model.feature_importances_
threshold = np.percentile(importance, 50)  # Keep top 50%
selected_features = feature_names[importance >= threshold]

Method 2: Recursive Feature Elimination (RFE)

Iteratively remove least important features:

Train model on all features
Remove bottom 10% by importance
Retrain on remaining features
Repeat until desired feature count or performance drop

Method 3: Boruta Algorithm

Creates "shadow" features (shuffled copies of originals) and selects features that consistently outperform shadows:

from boruta import BorutaPy

boruta = BorutaPy(model, n_estimators='auto', random_state=42)
boruta.fit(X, y)
selected = feature_names[boruta.support_]

feature_selection_pipeline.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import numpy as np
from sklearn.model_selection import cross_val_score
import lightgbm as lgb
 
def recursive_feature_elimination_lgb(X, y, feature_names, 
                                       min_features=10, step=0.1,
                                       cv=5, verbose=True):
    """
    RFE specifically designed for LightGBM with cross-validation.
    
    Uses permutation importance for unbiased feature ranking.
    """
    current_features = list(feature_names)
    best_score = -np.inf
    best_features = current_features.copy()
    history = []
    
    while len(current_features) > min_features:
        # Train model
        X_current = X[current_features]
        model = lgb.LGBMClassifier(n_estimators=100, random_state=42)
        
        # Cross-validation score
        scores = cross_val_score(model, X_current, y, cv=cv, scoring='roc_auc')
        mean_score = scores.mean()
        
        history.append({
            'n_features': len(current_features),
            'cv_score': mean_score,
            'cv_std': scores.std()
        })
        
        if verbose:
            print(f"Features: {len(current_features)}, CV AUC: {mean_score:.4f} ± {scores.std():.4f}")
        
        # Track best
        if mean_score > best_score:
            best_score = mean_score
            best_features = current_features.copy()
        
        # Get feature importance via permutation
        model.fit(X_current, y)
        from sklearn.inspection import permutation_importance
        perm_imp = permutation_importance(model, X_current, y, n_repeats=5, random_state=42)
        importances = perm_imp.importances_mean
        
        # Remove bottom features
        n_remove = max(1, int(len(current_features) * step))
        indices_to_remove = np.argsort(importances)[:n_remove]
        features_to_remove = [current_features[i] for i in indices_to_remove]
        
        current_features = [f for f in current_features if f not in features_to_remove]
    
    return best_features, history
 
 
def null_importance_selection(X, y, feature_names, n_runs=50, threshold=0.8):
    """
    Select features that perform better than null (shuffled target) importance.
    
    Features must beat random baseline in >threshold fraction of runs.
    """
    actual_importance = train_and_get_importance(X, y)
    
    null_importances = []
    for i in range(n_runs):
        y_shuffled = np.random.permutation(y)
        null_imp = train_and_get_importance(X, y_shuffled)
        null_importances.append(null_imp)
    
    null_importances = np.array(null_importances)
    
    # Count how many times actual > null
    beats_null = (actual_importance > null_importances).mean(axis=0)
    
    selected = feature_names[beats_null >= threshold]
    return selected, beats_null

Handling Correlated Features

Correlated features pose challenges for feature selection: importance gets split between them, and removing one can either hurt or help depending on the correlation structure.

The Problem:

If features A and B are highly correlated (ρ > 0.9):

Importance is split roughly equally between them
Permutation importance of each appears low (the other compensates)
Model performance is similar with either one alone

Strategy 1: Cluster-Based Selection

Group correlated features and select one representative per cluster:

from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial.distance import squareform

corr_matrix = X.corr().abs()
dist_matrix = 1 - corr_matrix
linkage_matrix = linkage(squareform(dist_matrix), method='average')
clusters = fcluster(linkage_matrix, t=0.3, criterion='distance')  # 0.7 correlation threshold

# Select one feature per cluster (highest importance)
for cluster_id in np.unique(clusters):
    cluster_features = feature_names[clusters == cluster_id]
    # Keep the one with highest importance

Strategy 2: Variance Inflation Factor (VIF)

Remove features with high VIF (indicating multicollinearity):

from statsmodels.stats.outliers_influence import variance_inflation_factor

vif_data = pd.DataFrame()
vif_data['feature'] = X.columns
vif_data['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

# Remove features with VIF > 10 (high multicollinearity)

Correlation Warning

Be careful with permutation importance when features are correlated. Shuffling one feature doesn't hurt performance much because the correlated feature remains. Consider using grouped permutation or SHAP interaction values for correlated feature sets.

Production Feature Selection Pipeline

Recommended Pipeline

•Remove zero-variance features — Features with single value provide no information.
•Remove highly correlated pairs — Keep one from pairs with correlation > 0.95.
•Train initial model on all features — Establish baseline performance and initial importance.
•Apply null importance filtering — Remove features that don't beat random baseline.
•Perform RFE with cross-validation — Iteratively remove features, track CV performance.
•Validate on holdout set — Confirm selected features generalize.
•Document feature rationale — Record why each feature was selected/rejected.

Stability Considerations:

Feature selection should be stable—small changes in data shouldn't dramatically change selected features. Techniques to improve stability:

Bootstrap feature selection: Run selection on multiple bootstrap samples, keep features selected in >80% of runs
Ensemble selection: Combine rankings from multiple importance methods
Regularization: Use L1 regularization in XGBoost/LightGBM to encourage sparsity

# XGBoost with L1 regularization encourages feature sparsity
model = xgb.XGBClassifier(
    reg_alpha=1.0,  # L1 regularization
    colsample_bytree=0.8,  # Random feature subsampling
)

Summary: Feature Selection for Boosting

Key Takeaways

•Feature selection improves efficiency and often performance — Even robust boosters benefit from focused feature sets.
•Multiple importance metrics exist with different biases — Use permutation or SHAP for reliable rankings.
•RFE and Boruta are gold-standard selection methods — Iterative removal with validation prevents over-selection.
•Correlated features require special handling — Cluster-based selection or VIF filtering address multicollinearity.
•Stability matters for production — Ensemble feature selection and bootstrap validation improve robustness.
•Document and validate — Track why features are selected; validate on true holdout data.

Module Complete

Congratulations! You've completed the Feature Engineering for Boosting module. You now have comprehensive knowledge of feature interactions, target encoding, frequency encoding, categorical handling, and feature selection—the complete toolkit for preparing features that maximize gradient boosting performance.

5 / 5

Loading learning content...

Machine LearningFeature Engineering for Boosting

Feature Engineering for Boosting

LevelAdvanced

Duration90 mins

TopicFeature Engineering for Boosting

5 / 5

Feature Selection

The Art of Choosing What Matters

What You Will Learn

Why Feature Selection Matters for Boosting

The Paradox of Boosting's Feature Handling:

Training efficiency: Even if unused, features consume memory and slow split evaluation
Overfitting on noise: Random features can occasionally provide spurious splits
Feature masking: Correlated features compete for splits, potentially hiding true relationships
Importance instability: With many features, importance estimates have high variance
Production burden: More features mean larger models and more complex pipelines

Impact of Feature Count on Gradient Boosting
Metric	Effect of More Features	Mitigation
Training time	Linear increase in split evaluation	Feature selection, subsampling
Memory usage	Linear increase	Feature selection
Overfitting risk	Moderate increase (noise features used occasionally)	Regularization + selection
Interpretability	Decreases (importance spread thin)	Select top features
Model size	Increases (more unique split values)	Feature selection
Inference time	Slight increase	Usually negligible for trees

When Feature Selection Matters Most

Feature Importance Metrics

Gradient boosting provides multiple measures of feature importance, each capturing different aspects.

1. Split Count (Frequency) Importance:

Counts how many times each feature is used for splitting across all trees:

$$\text{Importance}{split}(f) = \sum{t \in \text{trees}} \sum_{n \in \text{nodes}(t)} \mathbb{1}[\text{feature}(n) = f]$$

Pros: Simple, fast to compute Cons: Biased toward high-cardinality features (more split points available)

2. Gain Importance:

Sums the improvement in the loss function contributed by splits on each feature:

$$\text{Importance}{gain}(f) = \sum{t \in \text{trees}} \sum_{n \in \text{nodes}(t)} \mathbb{1}[\text{feature}(n) = f] \cdot \text{Gain}(n)$$

Pros: Measures actual predictive contribution Cons: Still biased toward high-cardinality and continuous features

feature_importance.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import numpy as np
import pandas as pd
from sklearn.inspection import permutation_importance
import lightgbm as lgb
 
def get_all_importance_types(model, X, y, feature_names):
    """
    Compute multiple importance metrics for comprehensive analysis.
    """
    importance_df = pd.DataFrame({'feature': feature_names})
    
    # 1. Split-based importance
    if hasattr(model, 'feature_importances_'):
        importance_df['split_importance'] = model.feature_importances_
    
    # 2. Gain-based importance (LightGBM specific)
    if hasattr(model, 'booster_'):
        gain = model.booster_.feature_importance(importance_type='gain')
        importance_df['gain_importance'] = gain
    
    # 3. Permutation importance (model-agnostic, unbiased)
    perm_imp = permutation_importance(
        model, X, y, n_repeats=10, random_state=42, n_jobs=-1
    )
    importance_df['permutation_importance'] = perm_imp.importances_mean
    importance_df['permutation_std'] = perm_imp.importances_std
    
    # 4. Compute rankings for each metric
    for col in ['split_importance', 'gain_importance', 'permutation_importance']:
        if col in importance_df.columns:
            importance_df[f'{col}_rank'] = importance_df[col].rank(ascending=False)
    
    return importance_df.sort_values('permutation_importance', ascending=False)
 
 
def shap_importance(model, X, feature_names):
    """
    SHAP-based feature importance - gold standard for interpretability.
    """
    import shap
    
    explainer = shap.TreeExplainer(model)
    shap_values = explainer.shap_values(X)
    
    # Mean absolute SHAP value per feature
    if isinstance(shap_values, list):
        # Multi-class: average across classes
        mean_abs_shap = np.mean([np.abs(sv).mean(axis=0) for sv in shap_values], axis=0)
    else:
        mean_abs_shap = np.abs(shap_values).mean(axis=0)
    
    return pd.DataFrame({
        'feature': feature_names,
        'shap_importance': mean_abs_shap
    }).sort_values('shap_importance', ascending=False)

3. Permutation Importance:

Measures performance drop when a feature's values are randomly shuffled:

$$\text{Importance}{perm}(f) = \text{Score}{original} - \mathbb{E}[\text{Score}_{f \text{ shuffled}}]$$

Pros: Unbiased, captures feature's predictive value Cons: Computationally expensive, affected by correlated features

4. SHAP Importance:

Based on Shapley values from game theory—measures average marginal contribution:

Pros: Theoretically grounded, handles interactions, local + global Cons: Computationally expensive for large datasets

Feature Importance Metrics Comparison
Metric	Speed	Bias	Best For
Split count	Fast	High-cardinality bias	Quick overview
Gain	Fast	Moderate bias	Understanding split value
Permutation	Slow	Low	Reliable feature ranking
SHAP	Slowest	Very low	Interpretability, interactions

Feature Selection Methods

Method 1: Importance Threshold

Simplest approach—keep features above an importance threshold:

importance = model.feature_importances_
threshold = np.percentile(importance, 50)  # Keep top 50%
selected_features = feature_names[importance >= threshold]

Method 2: Recursive Feature Elimination (RFE)

Iteratively remove least important features:

Train model on all features
Remove bottom 10% by importance
Retrain on remaining features
Repeat until desired feature count or performance drop

Method 3: Boruta Algorithm

Creates "shadow" features (shuffled copies of originals) and selects features that consistently outperform shadows:

from boruta import BorutaPy

boruta = BorutaPy(model, n_estimators='auto', random_state=42)
boruta.fit(X, y)
selected = feature_names[boruta.support_]

feature_selection_pipeline.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import numpy as np
from sklearn.model_selection import cross_val_score
import lightgbm as lgb
 
def recursive_feature_elimination_lgb(X, y, feature_names, 
                                       min_features=10, step=0.1,
                                       cv=5, verbose=True):
    """
    RFE specifically designed for LightGBM with cross-validation.
    
    Uses permutation importance for unbiased feature ranking.
    """
    current_features = list(feature_names)
    best_score = -np.inf
    best_features = current_features.copy()
    history = []
    
    while len(current_features) > min_features:
        # Train model
        X_current = X[current_features]
        model = lgb.LGBMClassifier(n_estimators=100, random_state=42)
        
        # Cross-validation score
        scores = cross_val_score(model, X_current, y, cv=cv, scoring='roc_auc')
        mean_score = scores.mean()
        
        history.append({
            'n_features': len(current_features),
            'cv_score': mean_score,
            'cv_std': scores.std()
        })
        
        if verbose:
            print(f"Features: {len(current_features)}, CV AUC: {mean_score:.4f} ± {scores.std():.4f}")
        
        # Track best
        if mean_score > best_score:
            best_score = mean_score
            best_features = current_features.copy()
        
        # Get feature importance via permutation
        model.fit(X_current, y)
        from sklearn.inspection import permutation_importance
        perm_imp = permutation_importance(model, X_current, y, n_repeats=5, random_state=42)
        importances = perm_imp.importances_mean
        
        # Remove bottom features
        n_remove = max(1, int(len(current_features) * step))
        indices_to_remove = np.argsort(importances)[:n_remove]
        features_to_remove = [current_features[i] for i in indices_to_remove]
        
        current_features = [f for f in current_features if f not in features_to_remove]
    
    return best_features, history
 
 
def null_importance_selection(X, y, feature_names, n_runs=50, threshold=0.8):
    """
    Select features that perform better than null (shuffled target) importance.
    
    Features must beat random baseline in >threshold fraction of runs.
    """
    actual_importance = train_and_get_importance(X, y)
    
    null_importances = []
    for i in range(n_runs):
        y_shuffled = np.random.permutation(y)
        null_imp = train_and_get_importance(X, y_shuffled)
        null_importances.append(null_imp)
    
    null_importances = np.array(null_importances)
    
    # Count how many times actual > null
    beats_null = (actual_importance > null_importances).mean(axis=0)
    
    selected = feature_names[beats_null >= threshold]
    return selected, beats_null

Handling Correlated Features

Correlated features pose challenges for feature selection: importance gets split between them, and removing one can either hurt or help depending on the correlation structure.

The Problem:

If features A and B are highly correlated (ρ > 0.9):

Importance is split roughly equally between them
Permutation importance of each appears low (the other compensates)
Model performance is similar with either one alone

Strategy 1: Cluster-Based Selection

Group correlated features and select one representative per cluster:

from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial.distance import squareform

corr_matrix = X.corr().abs()
dist_matrix = 1 - corr_matrix
linkage_matrix = linkage(squareform(dist_matrix), method='average')
clusters = fcluster(linkage_matrix, t=0.3, criterion='distance')  # 0.7 correlation threshold

# Select one feature per cluster (highest importance)
for cluster_id in np.unique(clusters):
    cluster_features = feature_names[clusters == cluster_id]
    # Keep the one with highest importance

Strategy 2: Variance Inflation Factor (VIF)

Remove features with high VIF (indicating multicollinearity):

from statsmodels.stats.outliers_influence import variance_inflation_factor

vif_data = pd.DataFrame()
vif_data['feature'] = X.columns
vif_data['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

# Remove features with VIF > 10 (high multicollinearity)

Correlation Warning

Production Feature Selection Pipeline

Recommended Pipeline

•Remove zero-variance features — Features with single value provide no information.
•Remove highly correlated pairs — Keep one from pairs with correlation > 0.95.
•Train initial model on all features — Establish baseline performance and initial importance.
•Apply null importance filtering — Remove features that don't beat random baseline.
•Perform RFE with cross-validation — Iteratively remove features, track CV performance.
•Validate on holdout set — Confirm selected features generalize.
•Document feature rationale — Record why each feature was selected/rejected.

Stability Considerations:

Feature selection should be stable—small changes in data shouldn't dramatically change selected features. Techniques to improve stability:

Bootstrap feature selection: Run selection on multiple bootstrap samples, keep features selected in >80% of runs
Ensemble selection: Combine rankings from multiple importance methods
Regularization: Use L1 regularization in XGBoost/LightGBM to encourage sparsity

# XGBoost with L1 regularization encourages feature sparsity
model = xgb.XGBClassifier(
    reg_alpha=1.0,  # L1 regularization
    colsample_bytree=0.8,  # Random feature subsampling
)

Summary: Feature Selection for Boosting

Key Takeaways

•Feature selection improves efficiency and often performance — Even robust boosters benefit from focused feature sets.
•Multiple importance metrics exist with different biases — Use permutation or SHAP for reliable rankings.
•RFE and Boruta are gold-standard selection methods — Iterative removal with validation prevents over-selection.
•Correlated features require special handling — Cluster-based selection or VIF filtering address multicollinearity.
•Stability matters for production — Ensemble feature selection and bootstrap validation improve robustness.
•Document and validate — Track why features are selected; validate on true holdout data.

Module Complete

5 / 5