Loading content...
Feature importance scores promise to reveal which variables truly drive your model's predictions. But these scores can be systematically misleading—not through random error, but through structural biases inherent in how they're computed. Understanding these biases is crucial: an analyst who trusts biased importance scores may make catastrophic decisions about feature selection, data collection, or model interpretation.
This page exposes the most dangerous biases affecting feature importance methods, demonstrates how to detect them, and provides practical strategies for obtaining more reliable estimates.
Biased feature importance can lead to: (1) Selecting worthless features while discarding valuable ones, (2) Misinterpreting which factors drive outcomes, (3) Building models that fail when data distributions shift, (4) Spending resources collecting low-value data. These aren't edge cases—they happen regularly in practice.
By the end of this page, you will understand: (1) Cardinality bias and how it inflates importance of high-cardinality features, (2) How feature correlations distort importance estimates, (3) Data leakage detection through importance analysis, (4) Sampling and scale biases, and (5) Strategies for bias mitigation.
Cardinality bias is the most well-documented bias in impurity-based importance. Features with more unique values receive inflated importance scores simply because they offer more potential split points.
The mechanism:
Consider a decision tree choosing where to split. A continuous feature with 1000 unique values offers 999 potential split points. A binary feature offers exactly 1 split point. Even if both features are equally predictive, the high-cardinality feature has far more opportunities to find a split that happens to reduce impurity—especially if there's noise in the data.
This effect is purely mechanical: more candidates → higher probability of finding a good split → higher accumulated importance.
| Feature Type | Unique Values | Split Opportunities | Bias Level |
|---|---|---|---|
| Binary | 2 | 1 | Low |
| Ordinal (5 levels) | 5 | 4 | Low |
| Categorical (50 classes) | 50 | Many (combinatorial) | Medium-High |
| Continuous | ~N | N-1 ≈ thousands | High |
| ID/Unique identifier | N | N-1 (every sample) | Extreme |
If you accidentally include a random ID column in your features, impurity-based importance will often rank it as the MOST important feature—despite having zero predictive value. This is pure cardinality bias: unique values for every sample = maximum split opportunities.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
import numpy as npimport pandas as pdfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.inspection import permutation_importancefrom sklearn.model_selection import train_test_split def demonstrate_cardinality_bias(): """ Demonstrate how cardinality bias affects impurity-based importance. """ np.random.seed(42) n_samples = 2000 # Create features with IDENTICAL predictive power but different cardinality # Binary feature (2 unique values) - Highly predictive binary_feature = (np.random.randn(n_samples) > 0).astype(float) # Continuous feature (many unique values) - Equally predictive continuous_feature = np.random.randn(n_samples) # Random ID (unique per sample) - ZERO predictive power random_id = np.random.permutation(n_samples).astype(float) # Target: depends equally on binary and continuous, NOT on random_id # Binary and continuous contribute equally noise = np.random.randn(n_samples) * 0.3 y = ((binary_feature * 2 - 1) + continuous_feature + noise > 0).astype(int) X = np.column_stack([binary_feature, continuous_feature, random_id]) feature_names = ['binary_predictive', 'continuous_predictive', 'random_id_noise'] X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42) # Train random forest rf = RandomForestClassifier(n_estimators=200, max_depth=None, random_state=42) rf.fit(X_train, y_train) # Get impurity-based importance impurity_imp = rf.feature_importances_ # Get permutation importance (on validation set) perm_result = permutation_importance(rf, X_val, y_val, n_repeats=30, random_state=42) perm_imp = perm_result.importances_mean # Display results print("Cardinality Bias Demonstration") print("=" * 75) print(f"{'Feature':<25} {'Cardinality':<12} {'Impurity Imp':<15} {'Permutation Imp'}") print("-" * 75) cardinalities = [2, len(np.unique(continuous_feature)), n_samples] for i, name in enumerate(feature_names): print(f"{name:<25} {cardinalities[i]:<12} {impurity_imp[i]:<15.4f} {perm_imp[i]:.4f}") print("\n📊 Analysis:") print(f" • Binary feature cardinality: 2, Continuous: ~{int(len(np.unique(continuous_feature)))}, Random ID: {n_samples}") print(f" • Impurity importance ranks: {np.argsort(-impurity_imp) + 1}") print(f" • Permutation importance ranks: {np.argsort(-perm_imp) + 1}") if impurity_imp[2] > impurity_imp[0]: print("\n ⚠️ CARDINALITY BIAS DETECTED!") print(" The random ID (zero predictive power) appears MORE important") print(" than the truly predictive binary feature according to impurity importance!") print("\n ✅ Permutation importance correctly identifies the random ID as unimportant") return { 'impurity': impurity_imp, 'permutation': perm_imp, 'feature_names': feature_names } if __name__ == "__main__": demonstrate_cardinality_bias()Mitigation strategies for cardinality bias:
Use permutation importance: It directly measures predictive contribution, so high-cardinality features that don't generalize show low importance
Regularize trees: Shallower trees (lower max_depth) are forced to use truly informative features first, reducing cardinality exploitation
Bin continuous features: Converting continuous to ordinal (e.g., quantile bins) equalizes cardinality across features, though this loses information
Use corrected importance measures: Some implementations (e.g., Scikit-learn's max_features parameter) reduce cardinality bias by limiting available features at each split
When features are correlated, importance gets distributed among them in ways that can be misleading. This affects all importance methods, though in different ways.
For impurity-based importance:
When features A and B are highly correlated, splits on either achieve similar impurity reduction. The tree might arbitrarily choose one over the other at different nodes, splitting the "credit" between them. Neither feature appears as important as it truly is for the target.
For permutation importance:
Shuffling feature A destroys its correlation with the target AND with feature B. But the model can still use B (which remains correlated with the target). This makes A appear less important than it would be if B didn't exist.
For drop-column importance:
Removing feature A allows the model to rely on B during retraining. If B can fully substitute for A, then A's drop-column importance is near zero—even if A is extremely predictive in isolation.
Correlated features can each show LOW individual importance while together being HIGHLY important. If you select features based on individual importance, you might exclude an entire cluster of correlated features that collectively carry most predictive signal.
Example: Height and Weight
Consider predicting heart disease risk, where both height and weight matter. Since height and weight are correlated:
| Method | Height | Weight |
|---|---|---|
| Impurity | 0.15 | 0.12 |
| Permutation | 0.08 | 0.06 |
| Drop-Column | 0.02 | 0.01 |
| Drop-Both | — | 0.45 combined |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788
import numpy as npimport pandas as pdfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.inspection import permutation_importancefrom sklearn.base import clone def demonstrate_correlation_bias(): """ Show how correlated features split importance and each appears less important than they would individually. """ np.random.seed(42) n_samples = 1500 # Create a "true signal" feature true_signal = np.random.randn(n_samples) # Create 3 features: 2 correlated versions of true signal, 1 independent # Feature A: true signal + small noise feature_a = true_signal + np.random.randn(n_samples) * 0.2 # Feature B: true signal + small noise (correlated with A) feature_b = true_signal + np.random.randn(n_samples) * 0.2 # Feature C: independent, moderately predictive feature_c = np.random.randn(n_samples) # Target depends on true_signal AND feature_c y = (true_signal + feature_c * 0.5 + np.random.randn(n_samples) * 0.3 > 0).astype(int) print(f"Correlation between A and B: {np.corrcoef(feature_a, feature_b)[0,1]:.3f}") print(f"Correlation between A and true_signal: {np.corrcoef(feature_a, true_signal)[0,1]:.3f}") print() # Scenario 1: Model with ONLY feature A (no correlation issue) X_only_a = feature_a.reshape(-1, 1) rf_a = RandomForestClassifier(n_estimators=100, random_state=42) rf_a.fit(X_only_a, y) print(f"Model with only A - Accuracy: {rf_a.score(X_only_a, y):.3f}") # Scenario 2: Model with A, B, and C (correlation between A and B) X_all = np.column_stack([feature_a, feature_b, feature_c]) rf_all = RandomForestClassifier(n_estimators=100, random_state=42) rf_all.fit(X_all, y) print(f"Model with A, B, C - Accuracy: {rf_all.score(X_all, y):.3f}") # Compare importance print("\nImpurity-Based Importance:") print(f" Only A model: Feature A = {rf_a.feature_importances_[0]:.3f}") print(f" With A,B,C: Feature A = {rf_all.feature_importances_[0]:.3f}") print(f" Feature B = {rf_all.feature_importances_[1]:.3f}") print(f" Feature C = {rf_all.feature_importances_[2]:.3f}") print(f" → A's importance DROPPED from {rf_a.feature_importances_[0]:.3f} to {rf_all.feature_importances_[0]:.3f}") print(f" → A+B combined: {rf_all.feature_importances_[0] + rf_all.feature_importances_[1]:.3f}") # Drop-column analysis reveals the redundancy print("\nDrop-Column Importance (reveals redundancy):") baseline_score = rf_all.score(X_all, y) # Drop A only rf_no_a = clone(rf_all) rf_no_a.fit(X_all[:, 1:], y) # Keep B and C score_no_a = rf_no_a.score(X_all[:, 1:], y) # Drop B only rf_no_b = clone(rf_all) rf_no_b.fit(X_all[:, [0, 2]], y) # Keep A and C score_no_b = rf_no_b.score(X_all[:, [0, 2]], y) # Drop both A and B rf_no_ab = clone(rf_all) rf_no_ab.fit(X_all[:, 2:], y) # Keep only C score_no_ab = rf_no_ab.score(X_all[:, 2:], y) print(f" Baseline score: {baseline_score:.3f}") print(f" Without A: {score_no_a:.3f} (importance: {baseline_score - score_no_a:+.3f})") print(f" Without B: {score_no_b:.3f} (importance: {baseline_score - score_no_b:+.3f})") print(f" Without A+B: {score_no_ab:.3f} (joint importance: {baseline_score - score_no_ab:+.3f})") print("\n📊 Key Insight:") print(" • A and B each show LOW individual drop-column importance") print(" • But dropping BOTH shows HIGH importance") print(" • This is because B can compensate for A (and vice versa)") print(" • Their correlated signal is valuable, but individually redundant") if __name__ == "__main__": demonstrate_correlation_bias()Mitigation strategies for correlation bias:
Cluster correlated features: Group highly correlated features and analyze group importance
Hierarchical importance: First measure group importance, then within-group importance
Conditional permutation: Shuffle features only within strata defined by their correlated partners (maintains realistic joint distributions)
SHAP values: Shapley-based methods properly attribute credit among correlated features
Feature selection before importance: Use dimensionality reduction (PCA, feature clustering) first, then measure importance of the reduced features
Feature importance analysis can be a powerful tool for detecting data leakage—when information from the future or the target leaks into features. Leaked features show suspiciously high importance.
Signs of data leakage in feature importance:
A single feature dominates: If one feature has importance of 0.5+ when you expect distributed importance, investigate
Too-good-to-be-true features: Features that shouldn't logically be so predictive yet rank at the top
Dramatic gap between top and rest: A sharp drop-off in importance after the first few features
Features available only at prediction time: Features that encode future information
Common leakage patterns: (1) 'days_until_churn' when predicting churn (encodes the label), (2) 'account_closed_date' in fraud detection (only exists after fraud investigation), (3) Aggregated statistics computed using future data, (4) Identifier columns that correlate with the target in training but won't generalize.
Detection methodology:
A systematic approach to leakage detection:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131
import numpy as npimport pandas as pdfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.inspection import permutation_importancefrom sklearn.model_selection import train_test_split def detect_potential_leakage(model, X_train, y_train, X_val, y_val, feature_names, threshold=0.15): """ Detect features that might represent data leakage. Args: model: Fitted model X_train, y_train: Training data X_val, y_val: Validation data feature_names: List of feature names threshold: Importance threshold to flag as suspicious Returns: DataFrame with leakage analysis """ # Get impurity importance impurity_imp = model.feature_importances_ # Get permutation importance on training AND validation perm_train = permutation_importance(model, X_train, y_train, n_repeats=20, n_jobs=-1) perm_val = permutation_importance(model, X_val, y_val, n_repeats=20, n_jobs=-1) results = pd.DataFrame({ 'feature': feature_names, 'impurity_importance': impurity_imp, 'perm_train': perm_train.importances_mean, 'perm_val': perm_val.importances_mean, }) # Compute leakage indicators # 1. Dominance: Does one feature account for huge chunk? results['is_dominant'] = results['impurity_importance'] > threshold # 2. Train/val gap: Much higher on train than val suggests overfitting/leakage results['train_val_ratio'] = (results['perm_train'] / results['perm_val'].replace(0, 0.001)) results['suspicious_gap'] = results['train_val_ratio'] > 2.0 # 3. Negative validation importance: Feature hurts on unseen data results['negative_val'] = results['perm_val'] < 0 # Flag overall suspicion results['suspected_leakage'] = ( results['is_dominant'] | results['suspicious_gap'] | results['negative_val'] ) return results.sort_values('impurity_importance', ascending=False) def create_leaky_dataset(): """Create a dataset with intentional data leakage for demonstration.""" np.random.seed(42) n_samples = 1000 # Legitimate features feature_1 = np.random.randn(n_samples) feature_2 = np.random.randn(n_samples) feature_3 = np.random.randn(n_samples) # Target: depends on features 1 and 2 y = (feature_1 + feature_2 * 0.5 + np.random.randn(n_samples) * 0.5 > 0).astype(int) # LEAKY FEATURE: directly encodes target information # Simulates something like "complaint_resolved" when predicting churn leaky_feature = y + np.random.randn(n_samples) * 0.1 # Almost perfect predictor # SUSPICIOUS FEATURE: ID that happened to correlate in training # Will not generalize suspicious_id = np.arange(n_samples).astype(float) X = np.column_stack([ feature_1, feature_2, feature_3, leaky_feature, suspicious_id ]) feature_names = [ 'legitimate_A', 'legitimate_B', 'legitimate_C', 'LEAKY_outcome_derived', 'suspicious_id' ] return X, y, feature_names # Demonstrationif __name__ == "__main__": X, y, feature_names = create_leaky_dataset() X_train, X_val, y_train, y_val = train_test_split( X, y, test_size=0.3, random_state=42 ) rf = RandomForestClassifier(n_estimators=100, random_state=42) rf.fit(X_train, y_train) print("Data Leakage Detection Analysis") print("=" * 80) print(f"Training accuracy: {rf.score(X_train, y_train):.4f}") print(f"Validation accuracy: {rf.score(X_val, y_val):.4f}") print() leakage_analysis = detect_potential_leakage( rf, X_train, y_train, X_val, y_val, feature_names ) print("Feature Analysis:") print("-" * 80) cols = ['feature', 'impurity_importance', 'perm_train', 'perm_val', 'train_val_ratio', 'suspected_leakage'] print(leakage_analysis[cols].to_string(index=False)) # Report suspicious features suspicious = leakage_analysis[leakage_analysis['suspected_leakage']] if len(suspicious) > 0: print("\n⚠️ SUSPECTED LEAKAGE DETECTED:") for _, row in suspicious.iterrows(): reasons = [] if row['is_dominant']: reasons.append(f"dominates importance ({row['impurity_importance']:.2f})") if row['suspicious_gap']: reasons.append(f"train/val ratio = {row['train_val_ratio']:.1f}x") if row['negative_val']: reasons.append("negative validation importance") print(f" • {row['feature']}: {', '.join(reasons)}") print("\n Recommendation: Review these features with domain experts before deployment!")Unlike many ML algorithms, tree-based methods are generally scale-invariant for raw predictions—scaling features doesn't change how trees split. However, feature importance can still be affected by scale in subtle ways.
Where scale matters for importance:
Permutation importance scoring: If using metrics like MSE or MAE (rather than R²), the absolute importance values depend on target scale
Comparing across datasets: Raw importance values can't be compared between different datasets or even different train/val splits without normalization
Mixed-scale features in boosting: Some boosting implementations handle differently-scaled features with different effective learning rates
Numerical precision: Very large or very small feature values can cause numerical issues in some implementations
For tree-based impurity importance and permutation importance, feature scaling (standardization, min-max scaling) typically has NO effect on importance rankings. This is unlike permutation importance for linear models, where scaling can dramatically change importance interpretations.
Normalization of importance scores:
To make importance scores interpretable and comparable:
Sum-to-one normalization: Divide each importance by the sum of all importances $$Imp_{norm}(j) = \frac{Imp(j)}{\sum_k Imp(k)}$$
Min-max normalization: Scale to [0, 1] range $$Imp_{scaled}(j) = \frac{Imp(j) - Imp_{min}}{Imp_{max} - Imp_{min}}$$
Rank normalization: Use ranks instead of raw values $$Rank(j) = \text{position of feature j when sorted by importance}$$
Standard z-score: Especially useful for comparing across models $$z_j = \frac{Imp(j) - \mu_{imp}}{\sigma_{imp}}$$
| Normalization | Use When | Interpretation |
|---|---|---|
| Sum-to-one | Comparing feature contributions within model | % of total importance |
| Min-max | Visualizing importance on [0,1] scale | Relative importance |
| Rank | Comparing across different model types | Ordinal importance |
| Z-score | Statistical significance testing | Std devs from mean importance |
Feature importance estimates can vary dramatically based on which samples are in your training and validation sets. This sampling instability is a form of bias when it causes you to draw confident conclusions from unstable estimates.
Sources of instability:
Random seed sensitivity: Different random seeds during training produce different tree structures, hence different importances
Train/validation split: Different splits can rank features very differently, especially for marginal features
Sample size effects: With small datasets, importance estimates have high variance
Class imbalance: Rare class samples can dramatically affect splits near leaves, causing volatile importance
If Feature A ranks 3rd in one run and 8th in another, the ranking difference might be noise. Always assess stability before drawing conclusions about feature rankings.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120
import numpy as npimport pandas as pdfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.inspection import permutation_importancefrom sklearn.model_selection import train_test_splitfrom scipy.stats import kendalltau def assess_importance_stability(X, y, feature_names=None, n_runs=10, test_size=0.3): """ Assess stability of feature importance across multiple runs. Args: X, y: Dataset feature_names: Feature names n_runs: Number of different random seeds to test test_size: Validation set size Returns: DataFrame with stability metrics """ n_features = X.shape[1] if feature_names is None: feature_names = [f"feature_{i}" for i in range(n_features)] # Store importances across runs impurity_imps = np.zeros((n_runs, n_features)) perm_imps = np.zeros((n_runs, n_features)) for run in range(n_runs): # Different random seed for each run X_train, X_val, y_train, y_val = train_test_split( X, y, test_size=test_size, random_state=run ) rf = RandomForestClassifier(n_estimators=100, random_state=run) rf.fit(X_train, y_train) impurity_imps[run] = rf.feature_importances_ perm_result = permutation_importance( rf, X_val, y_val, n_repeats=10, random_state=run, n_jobs=-1 ) perm_imps[run] = perm_result.importances_mean # Compute stability metrics results = pd.DataFrame({ 'feature': feature_names, 'impurity_mean': impurity_imps.mean(axis=0), 'impurity_std': impurity_imps.std(axis=0), 'perm_mean': perm_imps.mean(axis=0), 'perm_std': perm_imps.std(axis=0), }) # Coefficient of variation (lower = more stable) results['impurity_cv'] = results['impurity_std'] / results['impurity_mean'].replace(0, np.inf) results['perm_cv'] = results['perm_std'] / results['perm_mean'].abs().replace(0, np.inf) # Rank stability: how often does this feature keep its rank? impurity_ranks = np.argsort(-impurity_imps, axis=1) perm_ranks = np.argsort(-perm_imps, axis=1) results['impurity_mean_rank'] = impurity_ranks.mean(axis=0) results['impurity_rank_std'] = impurity_ranks.std(axis=0) results['perm_mean_rank'] = perm_ranks.mean(axis=0) results['perm_rank_std'] = perm_ranks.std(axis=0) # Overall stability score (0-1, higher = more stable) max_rank_std = n_features / 2 # Maximum possible rank std results['stability_score'] = 1 - (results['perm_rank_std'] / max_rank_std) # Compute Kendall's Tau between consecutive runs (rank correlation) rank_correlations = [] for i in range(n_runs - 1): tau, _ = kendalltau(perm_imps[i], perm_imps[i+1]) rank_correlations.append(tau) print("Importance Stability Analysis") print("=" * 70) print(f"Runs: {n_runs}, Dataset size: {len(X)}") print(f"Average rank correlation between runs: {np.mean(rank_correlations):.3f}") print() return results.sort_values('perm_mean', ascending=False) # Exampleif __name__ == "__main__": from sklearn.datasets import make_classification # Create dataset with some clearly important and some marginal features X, y = make_classification( n_samples=800, # Moderate size - some instability expected n_features=15, n_informative=6, n_redundant=3, n_clusters_per_class=2, random_state=42 ) feature_names = [f"feature_{i}" for i in range(15)] stability = assess_importance_stability(X, y, feature_names, n_runs=10) print("Stability Results:") print("-" * 70) display_cols = ['feature', 'perm_mean', 'perm_std', 'perm_mean_rank', 'perm_rank_std', 'stability_score'] print(stability[display_cols].round(3).to_string(index=False)) # Identify unstable features unstable = stability[stability['stability_score'] < 0.7] stable = stability[stability['stability_score'] >= 0.7] print(f"\nStable features (stability >= 0.7): {len(stable)}") print(f"Unstable features (stability < 0.7): {len(unstable)}") if len(unstable) > 0: print("\n⚠️ Unstable features (rankings vary significantly across runs):") for _, row in unstable.iterrows(): print(f" • {row['feature']}: rank std = {row['perm_rank_std']:.1f}")Having identified the major biases, let's consolidate strategies for obtaining more reliable feature importance estimates.
min_samples_split reduce cardinality bias.| Bias Type | Affects | Primary Mitigation | Secondary Mitigation |
|---|---|---|---|
| Cardinality | Impurity-based | Use permutation importance | Regularize trees |
| Correlation | All methods | Group analysis or SHAP | Report combined importance |
| Leakage | All methods | Domain expert review | Train/val gap analysis |
| Sampling | All methods | Multiple runs + stability | Cross-validation |
| Overfitting | Training-set importance | Use validation set | Regularization |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164
import numpy as npimport pandas as pdfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.inspection import permutation_importancefrom sklearn.model_selection import cross_val_score, StratifiedKFoldfrom sklearn.base import clonefrom scipy import stats def robust_feature_importance(model, X, y, feature_names=None, n_splits=5, n_repeats=3, perm_repeats=20): """ Compute robust feature importance with bias-aware methodology. Combines multiple methods, cross-validation, and stability metrics to produce reliable importance estimates. Args: model: Base model X, y: Full dataset feature_names: Feature names n_splits: CV splits n_repeats: Number of full analysis repeats perm_repeats: Permutation importance repeats per split Returns: Comprehensive importance analysis DataFrame """ n_features = X.shape[1] if feature_names is None: feature_names = [f"F{i}" for i in range(n_features)] # Storage for results across repeats all_impurity = [] all_perm_train = [] all_perm_val = [] for repeat in range(n_repeats): kfold = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=repeat * 42) for train_idx, val_idx in kfold.split(X, y): X_train, X_val = X[train_idx], X[val_idx] y_train, y_val = y[train_idx], y[val_idx] # Train model rf = clone(model) rf.fit(X_train, y_train) # Impurity importance all_impurity.append(rf.feature_importances_) # Permutation on train perm_train = permutation_importance( rf, X_train, y_train, n_repeats=perm_repeats, n_jobs=-1 ) all_perm_train.append(perm_train.importances_mean) # Permutation on validation perm_val = permutation_importance( rf, X_val, y_val, n_repeats=perm_repeats, n_jobs=-1 ) all_perm_val.append(perm_val.importances_mean) # Convert to arrays all_impurity = np.array(all_impurity) all_perm_train = np.array(all_perm_train) all_perm_val = np.array(all_perm_val) # Compile results results = pd.DataFrame({ 'feature': feature_names, # Impurity importance 'impurity_mean': all_impurity.mean(axis=0), 'impurity_std': all_impurity.std(axis=0), # Permutation importance (validation - the one we trust) 'perm_val_mean': all_perm_val.mean(axis=0), 'perm_val_std': all_perm_val.std(axis=0), # Train/val ratio (leakage indicator) 'train_val_ratio': all_perm_train.mean(axis=0) / np.maximum(all_perm_val.mean(axis=0), 0.001), }) # Statistical significance of importance n = len(all_perm_val) t_stats = [] p_values = [] for j in range(n_features): t_stat, p_val = stats.ttest_1samp(all_perm_val[:, j], 0) t_stats.append(t_stat) p_values.append(p_val / 2 if t_stat > 0 else 1) # One-sided results['t_statistic'] = t_stats results['p_value'] = p_values results['significant'] = results['p_value'] < 0.05 # Stability metrics ranks = np.argsort(-all_perm_val, axis=1) results['rank_mean'] = ranks.mean(axis=0) results['rank_std'] = ranks.std(axis=0) # Confidence intervals ci_factor = stats.t.ppf(0.975, df=n-1) results['ci_lower'] = results['perm_val_mean'] - ci_factor * results['perm_val_std'] / np.sqrt(n) results['ci_upper'] = results['perm_val_mean'] + ci_factor * results['perm_val_std'] / np.sqrt(n) # Quality flags results['cardinality_bias_risk'] = ( (results['impurity_mean'] > 0.1) & (results['perm_val_mean'] < 0.02) ) results['leakage_risk'] = results['train_val_ratio'] > 3 results['unstable'] = results['rank_std'] > n_features * 0.3 # Overall reliability score results['reliable'] = ( results['significant'] & ~results['cardinality_bias_risk'] & ~results['leakage_risk'] & ~results['unstable'] ) return results.sort_values('perm_val_mean', ascending=False) # Example usageif __name__ == "__main__": from sklearn.datasets import make_classification X, y = make_classification( n_samples=1000, n_features=15, n_informative=7, n_redundant=3, random_state=42 ) feature_names = [f"feature_{i}" for i in range(15)] rf = RandomForestClassifier(n_estimators=100) print("Robust Feature Importance Analysis") print("=" * 80) results = robust_feature_importance( rf, X, y, feature_names, n_splits=5, n_repeats=3, perm_repeats=15 ) print("\nTop Features (sorted by validation permutation importance):") print("-" * 80) display_cols = ['feature', 'perm_val_mean', 'ci_lower', 'ci_upper', 'significant', 'reliable'] print(results.head(10)[display_cols].to_string(index=False)) # Summary statistics reliable = results[results['reliable']] unreliable = results[~results['reliable']] print(f"\n📊 Summary:") print(f" Reliable important features: {len(reliable[reliable['perm_val_mean'] > 0.02])}") print(f" Flagged for potential bias: {len(unreliable)}") if len(results[results['leakage_risk']]) > 0: print(f"\n⚠️ Leakage risk detected for: " f"{results[results['leakage_risk']]['feature'].tolist()}")Feature importance biases are not edge cases—they're systematic effects that can fundamentally mislead analysis. Understanding them is essential for reliable interpretation. Let's consolidate the key insights:
What's next:
With a thorough understanding of feature importance methods and their biases, the final page of this module provides Interpretation Guidelines—practical frameworks for translating importance scores into actionable insights, communicating findings to stakeholders, and making sound decisions based on feature importance analysis.
You now understand the major biases affecting feature importance estimates, including cardinality bias, correlation effects, data leakage, and sampling instability. You can detect these biases systematically and apply appropriate mitigation strategies to obtain reliable importance measures.