Feature Importance - Learning Module

Loading content...

0/245

Bias in Importance

When Feature Importance Lies

Feature importance scores promise to reveal which variables truly drive your model's predictions. But these scores can be systematically misleading—not through random error, but through structural biases inherent in how they're computed. Understanding these biases is crucial: an analyst who trusts biased importance scores may make catastrophic decisions about feature selection, data collection, or model interpretation.

This page exposes the most dangerous biases affecting feature importance methods, demonstrates how to detect them, and provides practical strategies for obtaining more reliable estimates.

Why This Matters

Biased feature importance can lead to: (1) Selecting worthless features while discarding valuable ones, (2) Misinterpreting which factors drive outcomes, (3) Building models that fail when data distributions shift, (4) Spending resources collecting low-value data. These aren't edge cases—they happen regularly in practice.

What You Will Learn

By the end of this page, you will understand: (1) Cardinality bias and how it inflates importance of high-cardinality features, (2) How feature correlations distort importance estimates, (3) Data leakage detection through importance analysis, (4) Sampling and scale biases, and (5) Strategies for bias mitigation.

Cardinality Bias: The High-Cardinality Trap

Cardinality bias is the most well-documented bias in impurity-based importance. Features with more unique values receive inflated importance scores simply because they offer more potential split points.

The mechanism:

Consider a decision tree choosing where to split. A continuous feature with 1000 unique values offers 999 potential split points. A binary feature offers exactly 1 split point. Even if both features are equally predictive, the high-cardinality feature has far more opportunities to find a split that happens to reduce impurity—especially if there's noise in the data.

This effect is purely mechanical: more candidates → higher probability of finding a good split → higher accumulated importance.

Cardinality Bias: More Values = Inflated Importance
Feature Type	Unique Values	Split Opportunities	Bias Level
Binary	2	1	Low
Ordinal (5 levels)	5	4	Low
Categorical (50 classes)	50	Many (combinatorial)	Medium-High
Continuous	~N	N-1 ≈ thousands	High
ID/Unique identifier	N	N-1 (every sample)	Extreme

The Extreme Case: Random IDs

If you accidentally include a random ID column in your features, impurity-based importance will often rank it as the MOST important feature—despite having zero predictive value. This is pure cardinality bias: unique values for every sample = maximum split opportunities.

cardinality_bias_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split
 
def demonstrate_cardinality_bias():
    """
    Demonstrate how cardinality bias affects impurity-based importance.
    """
    np.random.seed(42)
    n_samples = 2000
    
    # Create features with IDENTICAL predictive power but different cardinality
    
    # Binary feature (2 unique values) - Highly predictive
    binary_feature = (np.random.randn(n_samples) > 0).astype(float)
    
    # Continuous feature (many unique values) - Equally predictive
    continuous_feature = np.random.randn(n_samples)
    
    # Random ID (unique per sample) - ZERO predictive power
    random_id = np.random.permutation(n_samples).astype(float)
    
    # Target: depends equally on binary and continuous, NOT on random_id
    # Binary and continuous contribute equally
    noise = np.random.randn(n_samples) * 0.3
    y = ((binary_feature * 2 - 1) + continuous_feature + noise > 0).astype(int)
    
    X = np.column_stack([binary_feature, continuous_feature, random_id])
    feature_names = ['binary_predictive', 'continuous_predictive', 'random_id_noise']
    
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42)
    
    # Train random forest
    rf = RandomForestClassifier(n_estimators=200, max_depth=None, random_state=42)
    rf.fit(X_train, y_train)
    
    # Get impurity-based importance
    impurity_imp = rf.feature_importances_
    
    # Get permutation importance (on validation set)
    perm_result = permutation_importance(rf, X_val, y_val, n_repeats=30, random_state=42)
    perm_imp = perm_result.importances_mean
    
    # Display results
    print("Cardinality Bias Demonstration")
    print("=" * 75)
    print(f"{'Feature':<25} {'Cardinality':<12} {'Impurity Imp':<15} {'Permutation Imp'}")
    print("-" * 75)
    
    cardinalities = [2, len(np.unique(continuous_feature)), n_samples]
    for i, name in enumerate(feature_names):
        print(f"{name:<25} {cardinalities[i]:<12} {impurity_imp[i]:<15.4f} {perm_imp[i]:.4f}")
    
    print("\n📊 Analysis:")
    print(f"  • Binary feature cardinality: 2, Continuous: ~{int(len(np.unique(continuous_feature)))}, Random ID: {n_samples}")
    print(f"  • Impurity importance ranks: {np.argsort(-impurity_imp) + 1}")
    print(f"  • Permutation importance ranks: {np.argsort(-perm_imp) + 1}")
    
    if impurity_imp[2] > impurity_imp[0]:
        print("\n  ⚠️ CARDINALITY BIAS DETECTED!")
        print("     The random ID (zero predictive power) appears MORE important")
        print("     than the truly predictive binary feature according to impurity importance!")
    
    print("\n  ✅ Permutation importance correctly identifies the random ID as unimportant")
    
    return {
        'impurity': impurity_imp,
        'permutation': perm_imp,
        'feature_names': feature_names
    }
 
if __name__ == "__main__":
    demonstrate_cardinality_bias()

Mitigation strategies for cardinality bias:

Use permutation importance: It directly measures predictive contribution, so high-cardinality features that don't generalize show low importance
Regularize trees: Shallower trees (lower max_depth) are forced to use truly informative features first, reducing cardinality exploitation
Bin continuous features: Converting continuous to ordinal (e.g., quantile bins) equalizes cardinality across features, though this loses information
Use corrected importance measures: Some implementations (e.g., Scikit-learn's max_features parameter) reduce cardinality bias by limiting available features at each split

Correlation Bias: When Features Share the Spotlight

When features are correlated, importance gets distributed among them in ways that can be misleading. This affects all importance methods, though in different ways.

For impurity-based importance:

When features A and B are highly correlated, splits on either achieve similar impurity reduction. The tree might arbitrarily choose one over the other at different nodes, splitting the "credit" between them. Neither feature appears as important as it truly is for the target.

For permutation importance:

Shuffling feature A destroys its correlation with the target AND with feature B. But the model can still use B (which remains correlated with the target). This makes A appear less important than it would be if B didn't exist.

For drop-column importance:

Removing feature A allows the model to rely on B during retraining. If B can fully substitute for A, then A's drop-column importance is near zero—even if A is extremely predictive in isolation.

The Correlation Paradox

Correlated features can each show LOW individual importance while together being HIGHLY important. If you select features based on individual importance, you might exclude an entire cluster of correlated features that collectively carry most predictive signal.

Example: Height and Weight

Consider predicting heart disease risk, where both height and weight matter. Since height and weight are correlated:

Each appears moderately important individually
But their combined importance is huge
Dropping one barely hurts (the other compensates)
Dropping both devastates performance

Importance with Correlated Features
Method	Height	Weight
Impurity	0.15	0.12
Permutation	0.08	0.06
Drop-Column	0.02	0.01
Drop-Both	—	0.45 combined

correlation_bias_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.base import clone
 
def demonstrate_correlation_bias():
    """
    Show how correlated features split importance and each appears less important
    than they would individually.
    """
    np.random.seed(42)
    n_samples = 1500
    
    # Create a "true signal" feature
    true_signal = np.random.randn(n_samples)
    
    # Create 3 features: 2 correlated versions of true signal, 1 independent
    # Feature A: true signal + small noise
    feature_a = true_signal + np.random.randn(n_samples) * 0.2
    
    # Feature B: true signal + small noise (correlated with A)
    feature_b = true_signal + np.random.randn(n_samples) * 0.2
    
    # Feature C: independent, moderately predictive
    feature_c = np.random.randn(n_samples)
    
    # Target depends on true_signal AND feature_c
    y = (true_signal + feature_c * 0.5 + np.random.randn(n_samples) * 0.3 > 0).astype(int)
    
    print(f"Correlation between A and B: {np.corrcoef(feature_a, feature_b)[0,1]:.3f}")
    print(f"Correlation between A and true_signal: {np.corrcoef(feature_a, true_signal)[0,1]:.3f}")
    print()
    
    # Scenario 1: Model with ONLY feature A (no correlation issue)
    X_only_a = feature_a.reshape(-1, 1)
    rf_a = RandomForestClassifier(n_estimators=100, random_state=42)
    rf_a.fit(X_only_a, y)
    print(f"Model with only A - Accuracy: {rf_a.score(X_only_a, y):.3f}")
    
    # Scenario 2: Model with A, B, and C (correlation between A and B)
    X_all = np.column_stack([feature_a, feature_b, feature_c])
    rf_all = RandomForestClassifier(n_estimators=100, random_state=42)
    rf_all.fit(X_all, y)
    print(f"Model with A, B, C - Accuracy: {rf_all.score(X_all, y):.3f}")
    
    # Compare importance
    print("\nImpurity-Based Importance:")
    print(f"  Only A model: Feature A = {rf_a.feature_importances_[0]:.3f}")
    print(f"  With A,B,C:   Feature A = {rf_all.feature_importances_[0]:.3f}")
    print(f"                Feature B = {rf_all.feature_importances_[1]:.3f}")
    print(f"                Feature C = {rf_all.feature_importances_[2]:.3f}")
    print(f"  → A's importance DROPPED from {rf_a.feature_importances_[0]:.3f} to {rf_all.feature_importances_[0]:.3f}")
    print(f"  → A+B combined: {rf_all.feature_importances_[0] + rf_all.feature_importances_[1]:.3f}")
    
    # Drop-column analysis reveals the redundancy
    print("\nDrop-Column Importance (reveals redundancy):")
    
    baseline_score = rf_all.score(X_all, y)
    
    # Drop A only
    rf_no_a = clone(rf_all)
    rf_no_a.fit(X_all[:, 1:], y)  # Keep B and C
    score_no_a = rf_no_a.score(X_all[:, 1:], y)
    
    # Drop B only  
    rf_no_b = clone(rf_all)
    rf_no_b.fit(X_all[:, [0, 2]], y)  # Keep A and C
    score_no_b = rf_no_b.score(X_all[:, [0, 2]], y)
    
    # Drop both A and B
    rf_no_ab = clone(rf_all)
    rf_no_ab.fit(X_all[:, 2:], y)  # Keep only C
    score_no_ab = rf_no_ab.score(X_all[:, 2:], y)
    
    print(f"  Baseline score: {baseline_score:.3f}")
    print(f"  Without A:      {score_no_a:.3f} (importance: {baseline_score - score_no_a:+.3f})")
    print(f"  Without B:      {score_no_b:.3f} (importance: {baseline_score - score_no_b:+.3f})")
    print(f"  Without A+B:    {score_no_ab:.3f} (joint importance: {baseline_score - score_no_ab:+.3f})")
    
    print("\n📊 Key Insight:")
    print("  • A and B each show LOW individual drop-column importance")
    print("  • But dropping BOTH shows HIGH importance")
    print("  • This is because B can compensate for A (and vice versa)")
    print("  • Their correlated signal is valuable, but individually redundant")
 
if __name__ == "__main__":
    demonstrate_correlation_bias()

Mitigation strategies for correlation bias:

Cluster correlated features: Group highly correlated features and analyze group importance
Hierarchical importance: First measure group importance, then within-group importance
Conditional permutation: Shuffle features only within strata defined by their correlated partners (maintains realistic joint distributions)
SHAP values: Shapley-based methods properly attribute credit among correlated features
Feature selection before importance: Use dimensionality reduction (PCA, feature clustering) first, then measure importance of the reduced features

Data Leakage Detection Through Importance Analysis

Feature importance analysis can be a powerful tool for detecting data leakage—when information from the future or the target leaks into features. Leaked features show suspiciously high importance.

Signs of data leakage in feature importance:

A single feature dominates: If one feature has importance of 0.5+ when you expect distributed importance, investigate
Too-good-to-be-true features: Features that shouldn't logically be so predictive yet rank at the top
Dramatic gap between top and rest: A sharp drop-off in importance after the first few features
Features available only at prediction time: Features that encode future information

Leakage Examples

Common leakage patterns: (1) 'days_until_churn' when predicting churn (encodes the label), (2) 'account_closed_date' in fraud detection (only exists after fraud investigation), (3) Aggregated statistics computed using future data, (4) Identifier columns that correlate with the target in training but won't generalize.

Detection methodology:

A systematic approach to leakage detection:

Rank features by importance (any method)
For top-ranked features, ask:
- Would this feature be available at prediction time in production?
- Does this feature's predictive power make logical sense?
- Is this feature derived from the target or post-outcome data?
If suspicious, check train vs. validation importance:
- Leaky features show HIGH importance on training data
- But might show LOWER or NEGATIVE importance on truly held-out data
Test with domain expert review:
- Present top features to someone who understands the data generation process
- They can often spot features that "shouldn't be that good"

leakage_detection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split
 
def detect_potential_leakage(model, X_train, y_train, X_val, y_val, 
                              feature_names, threshold=0.15):
    """
    Detect features that might represent data leakage.
    
    Args:
        model: Fitted model 
        X_train, y_train: Training data
        X_val, y_val: Validation data
        feature_names: List of feature names
        threshold: Importance threshold to flag as suspicious
    
    Returns:
        DataFrame with leakage analysis
    """
    # Get impurity importance
    impurity_imp = model.feature_importances_
    
    # Get permutation importance on training AND validation
    perm_train = permutation_importance(model, X_train, y_train, 
                                         n_repeats=20, n_jobs=-1)
    perm_val = permutation_importance(model, X_val, y_val, 
                                       n_repeats=20, n_jobs=-1)
    
    results = pd.DataFrame({
        'feature': feature_names,
        'impurity_importance': impurity_imp,
        'perm_train': perm_train.importances_mean,
        'perm_val': perm_val.importances_mean,
    })
    
    # Compute leakage indicators
    # 1. Dominance: Does one feature account for huge chunk?
    results['is_dominant'] = results['impurity_importance'] > threshold
    
    # 2. Train/val gap: Much higher on train than val suggests overfitting/leakage
    results['train_val_ratio'] = (results['perm_train'] / 
                                   results['perm_val'].replace(0, 0.001))
    results['suspicious_gap'] = results['train_val_ratio'] > 2.0
    
    # 3. Negative validation importance: Feature hurts on unseen data
    results['negative_val'] = results['perm_val'] < 0
    
    # Flag overall suspicion
    results['suspected_leakage'] = (
        results['is_dominant'] | 
        results['suspicious_gap'] | 
        results['negative_val']
    )
    
    return results.sort_values('impurity_importance', ascending=False)
 
def create_leaky_dataset():
    """Create a dataset with intentional data leakage for demonstration."""
    np.random.seed(42)
    n_samples = 1000
    
    # Legitimate features
    feature_1 = np.random.randn(n_samples)
    feature_2 = np.random.randn(n_samples)
    feature_3 = np.random.randn(n_samples)
    
    # Target: depends on features 1 and 2
    y = (feature_1 + feature_2 * 0.5 + np.random.randn(n_samples) * 0.5 > 0).astype(int)
    
    # LEAKY FEATURE: directly encodes target information
    # Simulates something like "complaint_resolved" when predicting churn
    leaky_feature = y + np.random.randn(n_samples) * 0.1  # Almost perfect predictor
    
    # SUSPICIOUS FEATURE: ID that happened to correlate in training
    # Will not generalize
    suspicious_id = np.arange(n_samples).astype(float)
    
    X = np.column_stack([
        feature_1, feature_2, feature_3, 
        leaky_feature, suspicious_id
    ])
    
    feature_names = [
        'legitimate_A', 'legitimate_B', 'legitimate_C',
        'LEAKY_outcome_derived', 'suspicious_id'
    ]
    
    return X, y, feature_names
 
# Demonstration
if __name__ == "__main__":
    X, y, feature_names = create_leaky_dataset()
    X_train, X_val, y_train, y_val = train_test_split(
        X, y, test_size=0.3, random_state=42
    )
    
    rf = RandomForestClassifier(n_estimators=100, random_state=42)
    rf.fit(X_train, y_train)
    
    print("Data Leakage Detection Analysis")
    print("=" * 80)
    print(f"Training accuracy: {rf.score(X_train, y_train):.4f}")
    print(f"Validation accuracy: {rf.score(X_val, y_val):.4f}")
    print()
    
    leakage_analysis = detect_potential_leakage(
        rf, X_train, y_train, X_val, y_val, feature_names
    )
    
    print("Feature Analysis:")
    print("-" * 80)
    cols = ['feature', 'impurity_importance', 'perm_train', 'perm_val', 
           'train_val_ratio', 'suspected_leakage']
    print(leakage_analysis[cols].to_string(index=False))
    
    # Report suspicious features
    suspicious = leakage_analysis[leakage_analysis['suspected_leakage']]
    if len(suspicious) > 0:
        print("\n⚠️ SUSPECTED LEAKAGE DETECTED:")
        for _, row in suspicious.iterrows():
            reasons = []
            if row['is_dominant']:
                reasons.append(f"dominates importance ({row['impurity_importance']:.2f})")
            if row['suspicious_gap']:
                reasons.append(f"train/val ratio = {row['train_val_ratio']:.1f}x")
            if row['negative_val']:
                reasons.append("negative validation importance")
            print(f"  • {row['feature']}: {', '.join(reasons)}")
        print("\n  Recommendation: Review these features with domain experts before deployment!")

Scale and Normalization Effects

Unlike many ML algorithms, tree-based methods are generally scale-invariant for raw predictions—scaling features doesn't change how trees split. However, feature importance can still be affected by scale in subtle ways.

Where scale matters for importance:

Permutation importance scoring: If using metrics like MSE or MAE (rather than R²), the absolute importance values depend on target scale
Comparing across datasets: Raw importance values can't be compared between different datasets or even different train/val splits without normalization
Mixed-scale features in boosting: Some boosting implementations handle differently-scaled features with different effective learning rates
Numerical precision: Very large or very small feature values can cause numerical issues in some implementations

Good News for Tree Importance

For tree-based impurity importance and permutation importance, feature scaling (standardization, min-max scaling) typically has NO effect on importance rankings. This is unlike permutation importance for linear models, where scaling can dramatically change importance interpretations.

Normalization of importance scores:

To make importance scores interpretable and comparable:

Sum-to-one normalization: Divide each importance by the sum of all importances $$Imp_{norm}(j) = \frac{Imp(j)}{\sum_k Imp(k)}$$
Min-max normalization: Scale to [0, 1] range $$Imp_{scaled}(j) = \frac{Imp(j) - Imp_{min}}{Imp_{max} - Imp_{min}}$$
Rank normalization: Use ranks instead of raw values $$Rank(j) = \text{position of feature j when sorted by importance}$$
Standard z-score: Especially useful for comparing across models $$z_j = \frac{Imp(j) - \mu_{imp}}{\sigma_{imp}}$$

When to Use Different Normalizations
Normalization	Use When	Interpretation
Sum-to-one	Comparing feature contributions within model	% of total importance
Min-max	Visualizing importance on [0,1] scale	Relative importance
Rank	Comparing across different model types	Ordinal importance
Z-score	Statistical significance testing	Std devs from mean importance

Sampling and Stability Bias

Feature importance estimates can vary dramatically based on which samples are in your training and validation sets. This sampling instability is a form of bias when it causes you to draw confident conclusions from unstable estimates.

Sources of instability:

Random seed sensitivity: Different random seeds during training produce different tree structures, hence different importances
Train/validation split: Different splits can rank features very differently, especially for marginal features
Sample size effects: With small datasets, importance estimates have high variance
Class imbalance: Rare class samples can dramatically affect splits near leaves, causing volatile importance

Don't Over-interpret Unstable Rankings

If Feature A ranks 3rd in one run and 8th in another, the ranking difference might be noise. Always assess stability before drawing conclusions about feature rankings.

stability_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split
from scipy.stats import kendalltau
 
def assess_importance_stability(X, y, feature_names=None, n_runs=10, 
                                 test_size=0.3):
    """
    Assess stability of feature importance across multiple runs.
    
    Args:
        X, y: Dataset
        feature_names: Feature names
        n_runs: Number of different random seeds to test
        test_size: Validation set size
    
    Returns:
        DataFrame with stability metrics
    """
    n_features = X.shape[1]
    if feature_names is None:
        feature_names = [f"feature_{i}" for i in range(n_features)]
    
    # Store importances across runs
    impurity_imps = np.zeros((n_runs, n_features))
    perm_imps = np.zeros((n_runs, n_features))
    
    for run in range(n_runs):
        # Different random seed for each run
        X_train, X_val, y_train, y_val = train_test_split(
            X, y, test_size=test_size, random_state=run
        )
        
        rf = RandomForestClassifier(n_estimators=100, random_state=run)
        rf.fit(X_train, y_train)
        
        impurity_imps[run] = rf.feature_importances_
        
        perm_result = permutation_importance(
            rf, X_val, y_val, n_repeats=10, random_state=run, n_jobs=-1
        )
        perm_imps[run] = perm_result.importances_mean
    
    # Compute stability metrics
    results = pd.DataFrame({
        'feature': feature_names,
        'impurity_mean': impurity_imps.mean(axis=0),
        'impurity_std': impurity_imps.std(axis=0),
        'perm_mean': perm_imps.mean(axis=0),
        'perm_std': perm_imps.std(axis=0),
    })
    
    # Coefficient of variation (lower = more stable)
    results['impurity_cv'] = results['impurity_std'] / results['impurity_mean'].replace(0, np.inf)
    results['perm_cv'] = results['perm_std'] / results['perm_mean'].abs().replace(0, np.inf)
    
    # Rank stability: how often does this feature keep its rank?
    impurity_ranks = np.argsort(-impurity_imps, axis=1)
    perm_ranks = np.argsort(-perm_imps, axis=1)
    
    results['impurity_mean_rank'] = impurity_ranks.mean(axis=0)
    results['impurity_rank_std'] = impurity_ranks.std(axis=0)
    results['perm_mean_rank'] = perm_ranks.mean(axis=0)
    results['perm_rank_std'] = perm_ranks.std(axis=0)
    
    # Overall stability score (0-1, higher = more stable)
    max_rank_std = n_features / 2  # Maximum possible rank std
    results['stability_score'] = 1 - (results['perm_rank_std'] / max_rank_std)
    
    # Compute Kendall's Tau between consecutive runs (rank correlation)
    rank_correlations = []
    for i in range(n_runs - 1):
        tau, _ = kendalltau(perm_imps[i], perm_imps[i+1])
        rank_correlations.append(tau)
    
    print("Importance Stability Analysis")
    print("=" * 70)
    print(f"Runs: {n_runs}, Dataset size: {len(X)}")
    print(f"Average rank correlation between runs: {np.mean(rank_correlations):.3f}")
    print()
    
    return results.sort_values('perm_mean', ascending=False)
 
# Example
if __name__ == "__main__":
    from sklearn.datasets import make_classification
    
    # Create dataset with some clearly important and some marginal features
    X, y = make_classification(
        n_samples=800,  # Moderate size - some instability expected
        n_features=15,
        n_informative=6,
        n_redundant=3,
        n_clusters_per_class=2,
        random_state=42
    )
    
    feature_names = [f"feature_{i}" for i in range(15)]
    
    stability = assess_importance_stability(X, y, feature_names, n_runs=10)
    
    print("Stability Results:")
    print("-" * 70)
    display_cols = ['feature', 'perm_mean', 'perm_std', 'perm_mean_rank', 
                   'perm_rank_std', 'stability_score']
    print(stability[display_cols].round(3).to_string(index=False))
    
    # Identify unstable features
    unstable = stability[stability['stability_score'] < 0.7]
    stable = stability[stability['stability_score'] >= 0.7]
    
    print(f"\nStable features (stability >= 0.7): {len(stable)}")
    print(f"Unstable features (stability < 0.7): {len(unstable)}")
    
    if len(unstable) > 0:
        print("\n⚠️ Unstable features (rankings vary significantly across runs):")
        for _, row in unstable.iterrows():
            print(f"  • {row['feature']}: rank std = {row['perm_rank_std']:.1f}")

Comprehensive Strategies for Bias Mitigation

Having identified the major biases, let's consolidate strategies for obtaining more reliable feature importance estimates.

Bias Mitigation Best Practices

•Use multiple methods: Compare impurity, permutation, and drop-column importance. Features ranked high by all methods are more reliably important.
•Always use held-out data: Compute permutation importance on validation/test sets, never on training data.
•Run multiple trials: Use different random seeds and assess rank stability. Trust only stable rankings.
•Handle correlations explicitly: Cluster correlated features and analyze group importance, or use SHAP values for proper attribution.
•Screen for leakage: Before interpretation, check that top features make domain sense and don't encode future information.
•Regularize trees appropriately: Shallower trees and higher min_samples_split reduce cardinality bias.
•Report uncertainty: Always report importance standard deviations or confidence intervals, not just point estimates.
•Validate with domain experts: Top features should align with domain knowledge. Surprises warrant investigation.

Summary: Bias Types and Recommended Mitigations
Bias Type	Affects	Primary Mitigation	Secondary Mitigation
Cardinality	Impurity-based	Use permutation importance	Regularize trees
Correlation	All methods	Group analysis or SHAP	Report combined importance
Leakage	All methods	Domain expert review	Train/val gap analysis
Sampling	All methods	Multiple runs + stability	Cross-validation
Overfitting	Training-set importance	Use validation set	Regularization

robust_importance_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.base import clone
from scipy import stats
 
def robust_feature_importance(model, X, y, feature_names=None, 
                               n_splits=5, n_repeats=3, 
                               perm_repeats=20):
    """
    Compute robust feature importance with bias-aware methodology.
    
    Combines multiple methods, cross-validation, and stability metrics
    to produce reliable importance estimates.
    
    Args:
        model: Base model
        X, y: Full dataset
        feature_names: Feature names
        n_splits: CV splits
        n_repeats: Number of full analysis repeats
        perm_repeats: Permutation importance repeats per split
    
    Returns:
        Comprehensive importance analysis DataFrame
    """
    n_features = X.shape[1]
    if feature_names is None:
        feature_names = [f"F{i}" for i in range(n_features)]
    
    # Storage for results across repeats
    all_impurity = []
    all_perm_train = []
    all_perm_val = []
    
    for repeat in range(n_repeats):
        kfold = StratifiedKFold(n_splits=n_splits, shuffle=True, 
                                 random_state=repeat * 42)
        
        for train_idx, val_idx in kfold.split(X, y):
            X_train, X_val = X[train_idx], X[val_idx]
            y_train, y_val = y[train_idx], y[val_idx]
            
            # Train model
            rf = clone(model)
            rf.fit(X_train, y_train)
            
            # Impurity importance
            all_impurity.append(rf.feature_importances_)
            
            # Permutation on train
            perm_train = permutation_importance(
                rf, X_train, y_train, n_repeats=perm_repeats, n_jobs=-1
            )
            all_perm_train.append(perm_train.importances_mean)
            
            # Permutation on validation
            perm_val = permutation_importance(
                rf, X_val, y_val, n_repeats=perm_repeats, n_jobs=-1
            )
            all_perm_val.append(perm_val.importances_mean)
    
    # Convert to arrays
    all_impurity = np.array(all_impurity)
    all_perm_train = np.array(all_perm_train)
    all_perm_val = np.array(all_perm_val)
    
    # Compile results
    results = pd.DataFrame({
        'feature': feature_names,
        
        # Impurity importance
        'impurity_mean': all_impurity.mean(axis=0),
        'impurity_std': all_impurity.std(axis=0),
        
        # Permutation importance (validation - the one we trust)
        'perm_val_mean': all_perm_val.mean(axis=0),
        'perm_val_std': all_perm_val.std(axis=0),
        
        # Train/val ratio (leakage indicator)
        'train_val_ratio': all_perm_train.mean(axis=0) / 
                           np.maximum(all_perm_val.mean(axis=0), 0.001),
    })
    
    # Statistical significance of importance
    n = len(all_perm_val)
    t_stats = []
    p_values = []
    for j in range(n_features):
        t_stat, p_val = stats.ttest_1samp(all_perm_val[:, j], 0)
        t_stats.append(t_stat)
        p_values.append(p_val / 2 if t_stat > 0 else 1)  # One-sided
    
    results['t_statistic'] = t_stats
    results['p_value'] = p_values
    results['significant'] = results['p_value'] < 0.05
    
    # Stability metrics
    ranks = np.argsort(-all_perm_val, axis=1)
    results['rank_mean'] = ranks.mean(axis=0)
    results['rank_std'] = ranks.std(axis=0)
    
    # Confidence intervals
    ci_factor = stats.t.ppf(0.975, df=n-1)
    results['ci_lower'] = results['perm_val_mean'] - ci_factor * results['perm_val_std'] / np.sqrt(n)
    results['ci_upper'] = results['perm_val_mean'] + ci_factor * results['perm_val_std'] / np.sqrt(n)
    
    # Quality flags
    results['cardinality_bias_risk'] = (
        (results['impurity_mean'] > 0.1) & 
        (results['perm_val_mean'] < 0.02)
    )
    results['leakage_risk'] = results['train_val_ratio'] > 3
    results['unstable'] = results['rank_std'] > n_features * 0.3
    
    # Overall reliability score
    results['reliable'] = (
        results['significant'] & 
        ~results['cardinality_bias_risk'] & 
        ~results['leakage_risk'] & 
        ~results['unstable']
    )
    
    return results.sort_values('perm_val_mean', ascending=False)
 
# Example usage
if __name__ == "__main__":
    from sklearn.datasets import make_classification
    
    X, y = make_classification(
        n_samples=1000, n_features=15, n_informative=7,
        n_redundant=3, random_state=42
    )
    feature_names = [f"feature_{i}" for i in range(15)]
    
    rf = RandomForestClassifier(n_estimators=100)
    
    print("Robust Feature Importance Analysis")
    print("=" * 80)
    
    results = robust_feature_importance(
        rf, X, y, feature_names,
        n_splits=5, n_repeats=3, perm_repeats=15
    )
    
    print("\nTop Features (sorted by validation permutation importance):")
    print("-" * 80)
    display_cols = ['feature', 'perm_val_mean', 'ci_lower', 'ci_upper',
                   'significant', 'reliable']
    print(results.head(10)[display_cols].to_string(index=False))
    
    # Summary statistics
    reliable = results[results['reliable']]
    unreliable = results[~results['reliable']]
    
    print(f"\n📊 Summary:")
    print(f"  Reliable important features: {len(reliable[reliable['perm_val_mean'] > 0.02])}")
    print(f"  Flagged for potential bias: {len(unreliable)}")
    
    if len(results[results['leakage_risk']]) > 0:
        print(f"\n⚠️ Leakage risk detected for: "
              f"{results[results['leakage_risk']]['feature'].tolist()}")

Summary: Bias in Feature Importance

Feature importance biases are not edge cases—they're systematic effects that can fundamentally mislead analysis. Understanding them is essential for reliable interpretation. Let's consolidate the key insights:

Key Biases and Mitigations

•Cardinality bias: High-cardinality features get inflated impurity importance. Mitigate with permutation importance.
•Correlation bias: Correlated features share/dilute importance. Mitigate with group analysis or SHAP values.
•Data leakage: Leaked features show suspiciously high importance. Detect through train/val gaps and domain review.
•Sampling instability: Rankings vary across runs. Mitigate with multiple trials and stability assessment.
•Overfitting in importance: Training set importance can mislead. Always use held-out validation data.
•No single method is complete: Use multiple importance methods and triangulate findings.
•Always report uncertainty: Importance confidence intervals prevent over-interpretation.

What's next:

With a thorough understanding of feature importance methods and their biases, the final page of this module provides Interpretation Guidelines—practical frameworks for translating importance scores into actionable insights, communicating findings to stakeholders, and making sound decisions based on feature importance analysis.

Page Complete

You now understand the major biases affecting feature importance estimates, including cardinality bias, correlation effects, data leakage, and sampling instability. You can detect these biases systematically and apply appropriate mitigation strategies to obtain reliable importance measures.