Machine LearningBagging & Random Forests

Feature Importance

LevelAdvanced

Duration90 mins

TopicBagging & Random Forests

2 / 5

Permutation Importance

Measuring True Predictive Contribution

Impurity-based importance tells us which features the model uses internally—but does this translate to what actually matters for prediction? A feature might be heavily used during training splits yet contribute little to the model's ability to generalize. Conversely, a feature rarely selected during training might be the one keeping predictions accurate.

Permutation importance addresses this gap directly. Instead of looking at the model's internal mechanics, it asks a simple but profound question: "What happens to model performance if we break the relationship between this feature and the target?" By randomly shuffling each feature's values and measuring the resulting performance drop, we obtain a measure that reflects true predictive contribution.

What You Will Learn

By the end of this page, you will understand: (1) The theoretical motivation for permutation importance, (2) The complete algorithm and its implementation, (3) Statistical properties including variance and bias, (4) How to interpret negative importance values, and (5) When permutation importance outperforms impurity-based methods.

The Core Intuition

The brilliance of permutation importance lies in its simplicity. Consider what happens when you randomly shuffle the values of a single feature across samples:

The feature's statistical distribution is preserved — Mean, variance, and marginal distribution remain identical
But its relationship to the target is destroyed — The shuffled feature values no longer correspond to the correct targets
All other features remain intact — The model can still use them for prediction

If the model's performance drops significantly after shuffling, the feature must have been carrying information essential for prediction. If performance barely changes, the feature is either uninformative or redundant with other features.

The Thought Experiment

Imagine you're a model trying to predict whether a customer will churn. If someone secretly shuffled the 'days_since_last_purchase' column, mixing up each customer's value with random other customers, you'd suddenly find that feature useless for prediction—even though it 'looks' the same statistically. Permutation importance quantifies exactly this scenario.

Why shuffling instead of removing?

One might ask: why not just remove the feature entirely and retrain? While theoretically cleaner, this approach has critical drawbacks:

Approach	Pros	Cons
Shuffling (Permutation)	No retraining required; Fast; Measures marginal contribution	Doesn't account for model adaptation
Removing + Retraining	Accounts for feature interactions; Measures true absence	Expensive (retrain per feature); Model changes confound importance

Permutation importance strikes an excellent balance: it measures the importance of a feature given the current model structure, which is usually what we care about when interpreting a trained model. The next page covers drop-column importance for when retraining is acceptable.

The Permutation Importance Algorithm

The algorithm for computing permutation importance is straightforward and elegant:

Algorithm: Permutation Importance

Input: Fitted model f, dataset (X, y), scoring function S, number of permutations K
Output: Importance score for each feature

1. Compute baseline score: score_baseline = S(y, f(X))

2. For each feature j in {1, 2, ..., p}:
   a. Initialize importance_j = 0
   b. For k = 1 to K:
      i.   Create X_permuted by randomly shuffling column j of X
      ii.  Compute permuted score: score_permuted = S(y, f(X_permuted))
      iii. importance_j += (score_baseline - score_permuted) / K
   c. Store importance_j

3. Return importances for all features

Key design choices:

Scoring function S: Can be any metric (accuracy, AUC, MSE, R², etc.). Higher values typically mean better performance.
Number of permutations K: Multiple permutations reduce variance in importance estimates. K ≥ 5-10 is common.
Dataset: Can be training set, validation set, or test set. Validation/test is preferred (explained below).

Sign Convention

We define importance as (baseline - permuted), so positive importance means performance DROPPED after permutation (the feature was helpful). If performance improves after permutation, importance is negative—a highly suspicious situation we'll discuss later.

permutation_importance_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
import numpy as np
from sklearn.metrics import accuracy_score, r2_score
from typing import Callable, Tuple
 
def permutation_importance(
    model,
    X: np.ndarray,
    y: np.ndarray,
    scoring_func: Callable,
    n_repeats: int = 10,
    random_state: int = 42
) -> Tuple[np.ndarray, np.ndarray]:
    """
    Compute permutation importance for all features.
    
    Args:
        model: Fitted model with predict method
        X: Feature matrix (n_samples, n_features)
        y: Target array (n_samples,)
        scoring_func: Function(y_true, y_pred) -> score (higher is better)
        n_repeats: Number of permutation iterations per feature
        random_state: Random seed for reproducibility
    
    Returns:
        importances_mean: Mean importance for each feature
        importances_std: Standard deviation of importance estimates
    """
    rng = np.random.RandomState(random_state)
    n_samples, n_features = X.shape
    
    # Compute baseline score
    baseline_score = scoring_func(y, model.predict(X))
    
    # Store all importance measurements
    importances = np.zeros((n_features, n_repeats))
    
    for feature_idx in range(n_features):
        for repeat_idx in range(n_repeats):
            # Create a copy to avoid modifying original data
            X_permuted = X.copy()
            
            # Randomly shuffle this feature's values
            X_permuted[:, feature_idx] = rng.permutation(X[:, feature_idx])
            
            # Score with permuted feature
            permuted_score = scoring_func(y, model.predict(X_permuted))
            
            # Importance = drop in performance
            importances[feature_idx, repeat_idx] = baseline_score - permuted_score
    
    importances_mean = importances.mean(axis=1)
    importances_std = importances.std(axis=1)
    
    return importances_mean, importances_std
 
# Example usage with sklearn's built-in implementation
from sklearn.inspection import permutation_importance as sklearn_perm_imp
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
 
# Generate data
X, y = make_classification(
    n_samples=1000,
    n_features=10,
    n_informative=5,
    n_redundant=2,
    random_state=42
)
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.3, random_state=42
)
 
# Train model
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
 
# Compute permutation importance on VALIDATION set
result = sklearn_perm_imp(
    rf, X_val, y_val,
    n_repeats=20,
    random_state=42,
    n_jobs=-1
)
 
# Display results
print("Permutation Importance (Validation Set)")
print("=" * 50)
for i in np.argsort(result.importances_mean)[::-1]:
    print(f"Feature {i}: {result.importances_mean[i]:.4f} "
          f"± {result.importances_std[i]:.4f}")

Training vs Validation Set Importance

A critical decision when computing permutation importance is whether to use the training set or a held-out validation/test set. This choice has profound implications for what the importance scores actually measure.

Training set permutation importance:

Measures importance for the model's ability to explain training data
Can be misleading for overfit features (model memorized noise)
Features that enabled overfitting will appear highly important
Useful for debugging training behavior

Validation/test set permutation importance:

Measures importance for the model's ability to generalize
Reflects true predictive value on unseen data
Overfit features will correctly show low importance
This is almost always what you want in practice

Critical Recommendation

Always compute permutation importance on held-out data (validation or test set) when evaluating feature importance for prediction. Training set importance can dramatically overstate the value of noise features and features that enabled overfitting.

Training Set Importance

•High-cardinality features show inflated importance (can memorize)
•Noise features might appear important if model overfit to them
•All features used by the model appear 'important'
•Misleading for feature selection and interpretation

Validation Set Importance

•Only features that help generalization appear important
•Noise features correctly show near-zero importance
•High-cardinality features penalized if they cause overfitting
•Reliable for feature selection and model interpretation

train_vs_val_importance.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split
 
# Create dataset with informative AND pure noise features
np.random.seed(42)
n_samples = 1000
n_informative = 5
n_noise = 5
 
# Informative features
X_informative = np.random.randn(n_samples, n_informative)
y = (X_informative.sum(axis=1) > 0).astype(int)
 
# Pure noise features (no relationship to target)
X_noise = np.random.randn(n_samples, n_noise)
 
# Combine
X = np.hstack([X_informative, X_noise])
feature_names = [f"informative_{i}" for i in range(n_informative)] +                 [f"noise_{i}" for i in range(n_noise)]
 
# Split
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.3, random_state=42
)
 
# Train a deep tree (prone to overfitting)
rf = RandomForestClassifier(
    n_estimators=100,
    max_depth=None,  # Fully grown trees - can overfit
    random_state=42
)
rf.fit(X_train, y_train)
 
print(f"Training accuracy: {rf.score(X_train, y_train):.4f}")
print(f"Validation accuracy: {rf.score(X_val, y_val):.4f}")
print()
 
# Compute importance on TRAINING set
perm_train = permutation_importance(
    rf, X_train, y_train, n_repeats=20, random_state=42
)
 
# Compute importance on VALIDATION set  
perm_val = permutation_importance(
    rf, X_val, y_val, n_repeats=20, random_state=42
)
 
# Compare results
results = pd.DataFrame({
    'feature': feature_names,
    'train_importance': perm_train.importances_mean,
    'val_importance': perm_val.importances_mean,
    'true_type': ['informative'] * n_informative + ['noise'] * n_noise
})
 
print("Training vs Validation Permutation Importance")
print("=" * 70)
print(results.sort_values('train_importance', ascending=False).to_string(index=False))
print()
print("🔍 Key Insight: Notice how noise features may show higher importance")
print("   on training data due to overfitting, but correctly show low importance")
print("   on validation data.")

Statistical Properties of Permutation Importance

Understanding the statistical behavior of permutation importance helps us interpret results correctly and design reliable analyses.

Variance in permutation importance:

The importance estimate varies based on:

Which permutation was chosen — Different shuffles yield different scores
Which samples are in the evaluation set — Different validation sets give different baselines
Model stochasticity — For random models, different trained instances give different importances

By repeating the permutation process K times, we can estimate the variance in our importance estimate:

$$Var(\hat{I}j) = \frac{1}{K-1} \sum{k=1}^{K} (I_j^{(k)} - \bar{I}_j)^2$$

This variance estimate is crucial for determining whether observed importance differences are statistically significant.

Confidence Intervals

With K ≥ 20 repetitions, you can construct approximate 95% confidence intervals as mean ± 1.96 × std. Features whose confidence intervals overlap likely don't differ significantly in importance.

Bias considerations:

Permutation importance has several potential sources of bias:

Feature correlation bias: When features are correlated, permuting one feature can create unrealistic data points that the model has never seen. This can either inflate or deflate importance unpredictably.
Extrapolation bias: If permutation creates out-of-distribution inputs (e.g., a height of 2 meters paired with a weight of 30 kg), the model's predictions become unreliable, and the measured importance reflects extrapolation behavior rather than true importance.
Distribution shift bias: If the dataset's feature distribution is skewed, different samples contribute unequally to the importance calculation.

Handling correlated features:

For highly correlated features, consider these approaches:

Approach	Description	When to Use
Group permutation	Shuffle correlated features together	When features are known to be dependent
Conditional permutation	Shuffle only within local neighborhoods	When maintaining realistic combinations matters
SAGE/SHAP values	Proper Shapley-based attribution	When rigorous causal attribution is needed

importance_confidence.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
import numpy as np
from scipy import stats
from sklearn.inspection import permutation_importance
 
def compute_importance_with_significance(model, X, y, n_repeats=30, 
                                          scoring='accuracy', alpha=0.05):
    """
    Compute permutation importance with statistical significance testing.
    
    Args:
        model: Fitted model
        X: Feature matrix
        y: Target vector
        n_repeats: Number of permutations (higher = more stable)
        scoring: Scoring metric name
        alpha: Significance level
    
    Returns:
        Dictionary with importance statistics and significance
    """
    result = permutation_importance(
        model, X, y, 
        n_repeats=n_repeats,
        scoring=scoring,
        n_jobs=-1
    )
    
    n_features = X.shape[1]
    
    # Compute test statistic: is mean significantly different from 0?
    # Under null hypothesis (feature unimportant), mean importance = 0
    t_stats = []
    p_values = []
    significant = []
    
    for j in range(n_features):
        importances_j = result.importances[j]
        
        # One-sample t-test against 0
        t_stat, p_val = stats.ttest_1samp(importances_j, 0)
        t_stats.append(t_stat)
        p_values.append(p_val)
        
        # Is feature significantly important? (one-sided: importance > 0)
        # Using one-sided p-value for "is importance positive"
        p_one_sided = p_val / 2 if t_stat > 0 else 1 - p_val / 2
        significant.append(p_one_sided < alpha)
    
    # Compute confidence intervals
    n = n_repeats
    ci_factor = stats.t.ppf(1 - alpha/2, df=n-1)
    ci_half_width = ci_factor * result.importances_std / np.sqrt(n)
    
    return {
        'mean': result.importances_mean,
        'std': result.importances_std,
        'ci_lower': result.importances_mean - ci_half_width,
        'ci_upper': result.importances_mean + ci_half_width,
        't_statistic': np.array(t_stats),
        'p_value': np.array(p_values),
        'significant': np.array(significant)
    }
 
# Example: Significance testing for feature importance
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
 
# Create data with clearly informative and uninformative features
X, y = make_classification(
    n_samples=500,
    n_features=10,
    n_informative=4,
    n_redundant=0,
    n_clusters_per_class=1,
    random_state=42
)
 
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.3, random_state=42
)
 
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
 
# Compute importance with significance
results = compute_importance_with_significance(
    rf, X_val, y_val, n_repeats=50
)
 
print("Permutation Importance with Statistical Significance")
print("=" * 75)
print(f"{'Feature':<10} {'Mean':>10} {'95% CI':>20} {'p-value':>12} {'Sig?'}")
print("-" * 75)
 
for i in range(len(results['mean'])):
    ci_str = f"[{results['ci_lower'][i]:.4f}, {results['ci_upper'][i]:.4f}]"
    sig_str = "✓" if results['significant'][i] else "✗"
    p_val = results['p_value'][i]
    p_str = f"{p_val:.4f}" if p_val >= 0.0001 else "<0.0001"
    
    print(f"Feature {i:<2} {results['mean'][i]:>10.4f} {ci_str:>20} {p_str:>12} {sig_str:>5}")

Interpreting Negative Importance

One of the most perplexing results in permutation importance analysis is encountering negative importance values. This occurs when the model performs better after shuffling a feature—an apparently paradoxical result.

What negative importance means:

Negative importance indicates that the feature, as used by the model, is actually hurting prediction performance. When the feature's relationship to the target is broken through shuffling, the model makes better predictions. This can happen for several reasons:

Causes of Negative Importance

•Overfitting to noise: The model learned spurious correlations from the training data that don't generalize. Breaking these associations improves validation performance.
•Multicollinearity artifacts: With correlated features, the model may have split contribution in strange ways. Shuffling one correlated feature can help the remaining features work better.
•Random noise features: Pure noise features might show slightly negative importance due to random fluctuations. This is expected behavior near zero.
•Distributional extrapolation: Shuffling creates out-of-distribution points that, by chance, happen to yield better predictions (rare but possible).

Red Flag Interpretation

Strongly negative importance (not just near-zero) is a serious red flag. It suggests: (1) severe overfitting, (2) data leakage during training, (3) feature engineering bugs, or (4) train/test distribution mismatch. Investigate thoroughly before deploying the model.

How to respond to negative importance:

Scenario	Likely Cause	Recommended Action
Slightly negative (-0.001 to 0)	Random noise	Treat as zero importance; safe to ignore
Moderately negative (-0.01 to -0.001)	Mild overfitting or noise	Consider regularization; validate on more data
Strongly negative (< -0.01)	Serious overfitting or leakage	Investigate feature; check for data leakage; consider removal

A worked example:

Imagine a model predicting customer satisfaction. One feature is 'customer_service_email' (the exact email address customers used to contact support). During training, the model memorizes which email addresses correspond to satisfied customers. On held-out data, these memorized associations don't generalize—in fact, they're misleading. Shuffling this feature breaks the spurious associations, and prediction improves.

detect_negative_importance.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split
 
def analyze_suspicious_features(model, X_val, y_val, feature_names=None,
                                 n_repeats=30, threshold=-0.005):
    """
    Identify and analyze features with suspiciously negative importance.
    
    Args:
        model: Fitted model
        X_val: Validation features
        y_val: Validation target
        feature_names: List of feature names
        n_repeats: Number of permutation repeats
        threshold: Importance below this is flagged as suspicious
    
    Returns:
        DataFrame of suspicious features with diagnostic info
    """
    if feature_names is None:
        feature_names = [f"feature_{i}" for i in range(X_val.shape[1])]
    
    result = permutation_importance(
        model, X_val, y_val, n_repeats=n_repeats, n_jobs=-1
    )
    
    # Find suspicious features
    suspicious_mask = result.importances_mean < threshold
    suspicious_indices = np.where(suspicious_mask)[0]
    
    if len(suspicious_indices) == 0:
        print("✅ No features with suspiciously negative importance found.")
        return None
    
    suspicious = []
    for idx in suspicious_indices:
        # Check how consistently negative the importance is
        neg_fraction = (result.importances[idx] < 0).mean()
        
        suspicious.append({
            'feature': feature_names[idx],
            'index': idx,
            'mean_importance': result.importances_mean[idx],
            'std_importance': result.importances_std[idx],
            'fraction_negative': neg_fraction,
            'min_importance': result.importances[idx].min(),
            'max_importance': result.importances[idx].max(),
        })
    
    df = pd.DataFrame(suspicious)
    df = df.sort_values('mean_importance')
    
    print("⚠️ Suspicious Features Detected!")
    print("=" * 80)
    for _, row in df.iterrows():
        print(f"\n{row['feature']} (index {row['index']}):")
        print(f"  Mean importance: {row['mean_importance']:.4f}")
        print(f"  Fraction of trials with negative importance: {row['fraction_negative']:.1%}")
        print(f"  Range: [{row['min_importance']:.4f}, {row['max_importance']:.4f}]")
        
        if row['fraction_negative'] > 0.95:
            print("  🔴 CRITICAL: Consistently harmful - investigate data leakage")
        elif row['fraction_negative'] > 0.7:
            print("  🟠 WARNING: Frequently harmful - likely overfitting")
        else:
            print("  🟡 CAUTION: Sometimes harmful - may be noise")
    
    return df
 
# Demonstration: Create a feature that causes overfitting
np.random.seed(42)
n_samples = 500
 
# Genuine features
X_good = np.random.randn(n_samples, 5)
y = (X_good[:, 0] + X_good[:, 1] > 0).astype(int)
 
# Overfitting-prone feature: random ID that happens to correlate with y
# in training but won't generalize
overfit_feature = np.random.randn(n_samples)
 
# Pure noise feature
noise_feature = np.random.randn(n_samples)
 
X = np.column_stack([X_good, overfit_feature, noise_feature])
feature_names = [f"good_{i}" for i in range(5)] + ["overfit_prone", "pure_noise"]
 
# Split data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42)
 
# Train deep tree that can overfit
rf = RandomForestClassifier(n_estimators=100, max_depth=20, random_state=42)
rf.fit(X_train, y_train)
 
# Analyze
print(f"Training accuracy: {rf.score(X_train, y_train):.4f}")
print(f"Validation accuracy: {rf.score(X_val, y_val):.4f}")
print()
 
suspicious_df = analyze_suspicious_features(
    rf, X_val, y_val, 
    feature_names=feature_names,
    threshold=-0.001
)

Permutation vs Impurity-Based Importance

Now that we understand both methods, let's directly compare their properties, strengths, and appropriate use cases.

Fundamental difference:

Impurity-based: Measures how much the model uses a feature during training
Permutation-based: Measures how much the feature contributes to predictive accuracy

These can diverge significantly! A feature might be heavily used in splits but contribute little to generalization (overfitting), or rarely used but crucial when it is (high-value specialized feature).

Comprehensive Comparison of Importance Methods
Aspect	Impurity-Based (MDI)	Permutation-Based
Computation cost	Free (computed during training)	O(n × p × K) — must run inference p × K times
Measures	Training-time usage patterns	Validation-time predictive contribution
Can detect overfitting?	No — can't distinguish helpful from harmful	Yes — negative importance reveals overfitting
Affected by cardinality?	Yes — strong bias toward high-cardinality	No — directly measures performance impact
Model-agnostic?	No — only for tree-based models	Yes — works for any model with predict()
Handles feature correlation	Poorly — splits importance arbitrarily	Poorly — can create unrealistic combinations
Reproducibility	Deterministic (given model)	Stochastic — varies with permutation seed
Sign of values	Always non-negative	Can be negative (feature hurts predictions)

When to Use Which

Use impurity-based when: you need quick feature screening, want to understand the model's internal logic, or are doing initial exploration. Use permutation-based when: you need reliable importance for feature selection, want to detect overfitting, or care about generalization performance.

compare_importance_methods.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split
 
def compare_importance_methods(X, y, feature_names=None, random_state=42):
    """
    Compare impurity-based and permutation importance for the same model.
    
    Demonstrates scenarios where they agree/disagree.
    """
    if feature_names is None:
        feature_names = [f"F{i}" for i in range(X.shape[1])]
    
    X_train, X_val, y_train, y_val = train_test_split(
        X, y, test_size=0.3, random_state=random_state
    )
    
    # Train model
    rf = RandomForestClassifier(n_estimators=200, max_depth=10, 
                                 random_state=random_state)
    rf.fit(X_train, y_train)
    
    # Impurity-based importance
    imp_impurity = rf.feature_importances_
    
    # Permutation importance (on validation set)
    perm_result = permutation_importance(
        rf, X_val, y_val, n_repeats=30, random_state=random_state, n_jobs=-1
    )
    imp_permutation = perm_result.importances_mean
    
    # Create comparison DataFrame
    comparison = pd.DataFrame({
        'feature': feature_names,
        'impurity': imp_impurity,
        'permutation': imp_permutation,
    })
    
    # Compute correlation and disagreement
    correlation = np.corrcoef(imp_impurity, imp_permutation)[0, 1]
    
    # Rank each feature by both methods
    comparison['rank_impurity'] = comparison['impurity'].rank(ascending=False)
    comparison['rank_permutation'] = comparison['permutation'].rank(ascending=False)
    comparison['rank_diff'] = abs(comparison['rank_impurity'] - 
                                    comparison['rank_permutation'])
    
    print("Importance Method Comparison")
    print("=" * 70)
    print(f"Model accuracy: Train={rf.score(X_train, y_train):.4f}, "
          f"Val={rf.score(X_val, y_val):.4f}")
    print(f"Importance correlation: {correlation:.4f}")
    print()
    print(comparison.sort_values('rank_impurity').to_string(index=False))
    
    # Identify disagreements
    big_disagreements = comparison[comparison['rank_diff'] >= 3]
    if len(big_disagreements) > 0:
        print(f"\n⚠️ Features with rank difference >= 3:")
        for _, row in big_disagreements.iterrows():
            print(f"  {row['feature']}: "
                  f"Impurity rank={int(row['rank_impurity'])}, "
                  f"Permutation rank={int(row['rank_permutation'])}")
    
    return comparison
 
# Create a scenario where methods disagree
np.random.seed(42)
n_samples = 1000
 
# Feature 1: High cardinality, moderately predictive (impurity will overrate)
high_card = np.random.randn(n_samples)
 
# Feature 2: Binary, highly predictive (impurity may underrate)
binary_strong = (np.random.randn(n_samples) > 0).astype(float)
 
# Feature 3: Continuous, weak predictor
continuous_weak = np.random.randn(n_samples) * 0.3
 
# Target depends mostly on binary feature
y = (binary_strong * 2 + high_card * 0.5 + continuous_weak + 
     np.random.randn(n_samples) * 0.5 > 1).astype(int)
 
X = np.column_stack([high_card, binary_strong, continuous_weak])
feature_names = ['high_cardinality', 'binary_strong', 'continuous_weak']
 
# Run comparison
comparison = compare_importance_methods(X, y, feature_names)
 
print("\n📊 Analysis:")
print("- 'high_cardinality' has many unique values → Impurity importance inflated")
print("- 'binary_strong' has only 2 values → Impurity importance deflated")
print("- Permutation importance correctly reflects true predictive value")

Computational Considerations

Permutation importance is more computationally expensive than impurity-based importance. Understanding the cost structure helps plan efficient analyses.

Time complexity:

For a model with inference time $T_{predict}$ on dataset of size $n$:

$$T_{permutation} = \mathcal{O}(p \times K \times T_{predict})$$

where $p$ is the number of features and $K$ is the number of permutation repeats.

Practical example:

100 features, 30 repeats = 3,000 inference runs
If inference takes 100ms, total time ≈ 5 minutes

Optimization strategies:

Speed Optimization Techniques

•Parallel computation: Permute different features simultaneously using multiple CPU cores (n_jobs=-1 in sklearn)
•Subsample the validation set: Use a random subset (e.g., 5000 samples) instead of the full validation set for initial screening
•Reduce K for initial analysis: Start with K=5 for screening, increase to K=30+ for final analysis
•Two-stage screening: First compute importance on a subset, then run detailed analysis only on top candidates
•GPU acceleration: For neural networks, batch the permuted predictions efficiently on GPU

efficient_permutation_importance.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
import numpy as np
import time
from sklearn.inspection import permutation_importance
from sklearn.ensemble import RandomForestClassifier
 
def timed_permutation_importance(model, X, y, n_repeats, n_jobs=1):
    """Time permutation importance computation."""
    start = time.time()
    result = permutation_importance(
        model, X, y, n_repeats=n_repeats, n_jobs=n_jobs, random_state=42
    )
    elapsed = time.time() - start
    return result, elapsed
 
def efficient_importance_analysis(model, X, y, feature_names=None,
                                   initial_k=5, final_k=30,
                                   subsample_size=2000, top_n=10):
    """
    Two-stage efficient permutation importance analysis.
    
    Stage 1: Quick screening with subsampled data and few repeats
    Stage 2: Detailed analysis of top features
    
    Args:
        model: Fitted model
        X, y: Full validation data
        feature_names: Optional feature names
        initial_k: Repeats for initial screening
        final_k: Repeats for detailed analysis
        subsample_size: Samples for initial screening
        top_n: Number of top features for detailed analysis
    
    Returns:
        DataFrame with importance results
    """
    if feature_names is None:
        feature_names = [f"F{i}" for i in range(X.shape[1])]
    
    n_samples = X.shape[0]
    n_features = X.shape[1]
    
    # Stage 1: Quick screening
    print("Stage 1: Quick screening...")
    if n_samples > subsample_size:
        idx = np.random.choice(n_samples, subsample_size, replace=False)
        X_sub, y_sub = X[idx], y[idx]
    else:
        X_sub, y_sub = X, y
    
    result_quick, time_quick = timed_permutation_importance(
        model, X_sub, y_sub, n_repeats=initial_k, n_jobs=-1
    )
    print(f"  Completed in {time_quick:.2f}s")
    
    # Identify top features for detailed analysis
    top_indices = np.argsort(result_quick.importances_mean)[::-1][:top_n]
    print(f"  Top {top_n} features identified: {[feature_names[i] for i in top_indices]}")
    
    # Stage 2: Detailed analysis of top features only
    print(f"\nStage 2: Detailed analysis of top {top_n} features...")
    
    # For detailed analysis, we only permute the top features
    detailed_importances = {}
    start_detail = time.time()
    
    for idx in top_indices:
        # This is a simplified version - in production you'd modify
        # the sklearn implementation to only permute selected features
        X_work = X.copy()
        scores = []
        
        baseline = model.score(X, y)
        for k in range(final_k):
            X_work[:, idx] = np.random.permutation(X[:, idx])
            scores.append(model.score(X_work, y))
        
        detailed_importances[feature_names[idx]] = {
            'mean': baseline - np.mean(scores),
            'std': np.std([baseline - s for s in scores])
        }
    
    time_detail = time.time() - start_detail
    print(f"  Completed in {time_detail:.2f}s")
    
    # Combine results
    print(f"\nTotal time: {time_quick + time_detail:.2f}s")
    print(f"  (vs estimated full analysis: {n_features * final_k / initial_k * time_quick / 60:.1f} min)")
    
    return detailed_importances
 
# Example usage
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
 
# Large dataset
X, y = make_classification(
    n_samples=10000, n_features=100,
    n_informative=20, n_redundant=10,
    random_state=42
)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42)
 
rf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
 
# Compare: Full analysis vs Two-stage
print("=" * 60)
print("Two-Stage Efficient Analysis")
print("=" * 60)
efficient_results = efficient_importance_analysis(
    rf, X_val, y_val,
    initial_k=5, final_k=30,
    subsample_size=2000, top_n=15
)

Summary: Permutation Importance

Permutation importance provides a powerful, model-agnostic approach to measuring feature significance based on actual predictive contribution rather than training-time usage patterns. Let's consolidate the key insights:

Core Concepts

•Permutation importance measures the drop in model performance when a feature's relationship to the target is broken through random shuffling.
•Always use held-out data (validation/test set) for permutation importance to measure generalization contribution, not training set memorization.
•Multiple repetitions (K ≥ 10-30) provide variance estimates and enable statistical significance testing for importance scores.
•Negative importance indicates a feature is hurting predictions—a red flag for overfitting, data leakage, or model bugs.
•Model-agnostic: Unlike impurity-based importance, permutation importance works for ANY model with a predict function.
•No cardinality bias: Directly measures performance impact, avoiding the high-cardinality bias of impurity-based methods.
•Correlation caveat: Shuffling correlated features can create unrealistic data points, potentially distorting importance estimates.

What's next:

We've now covered the two most common feature importance methods: impurity-based (fast but biased) and permutation-based (reliable but slower). The next page explores Drop-Column Importance—a method that measures what happens when we completely retrain the model without each feature. While computationally expensive, it provides the cleanest measure of feature value when training adaptation matters.

Page Complete

You now understand how to compute, interpret, and apply permutation importance. You can distinguish it from impurity-based importance, recognize when to use each method, handle negative importance values appropriately, and design efficient analyses for large-scale feature importance studies.

2 / 5

Loading learning content...

Machine LearningBagging & Random Forests

Feature Importance

LevelAdvanced

Duration90 mins

TopicBagging & Random Forests

2 / 5

Permutation Importance

Measuring True Predictive Contribution

What You Will Learn

The Core Intuition

The brilliance of permutation importance lies in its simplicity. Consider what happens when you randomly shuffle the values of a single feature across samples:

The feature's statistical distribution is preserved — Mean, variance, and marginal distribution remain identical
But its relationship to the target is destroyed — The shuffled feature values no longer correspond to the correct targets
All other features remain intact — The model can still use them for prediction

The Thought Experiment

Why shuffling instead of removing?

One might ask: why not just remove the feature entirely and retrain? While theoretically cleaner, this approach has critical drawbacks:

Approach	Pros	Cons
Shuffling (Permutation)	No retraining required; Fast; Measures marginal contribution	Doesn't account for model adaptation
Removing + Retraining	Accounts for feature interactions; Measures true absence	Expensive (retrain per feature); Model changes confound importance

The Permutation Importance Algorithm

The algorithm for computing permutation importance is straightforward and elegant:

Algorithm: Permutation Importance

Input: Fitted model f, dataset (X, y), scoring function S, number of permutations K
Output: Importance score for each feature

1. Compute baseline score: score_baseline = S(y, f(X))

2. For each feature j in {1, 2, ..., p}:
   a. Initialize importance_j = 0
   b. For k = 1 to K:
      i.   Create X_permuted by randomly shuffling column j of X
      ii.  Compute permuted score: score_permuted = S(y, f(X_permuted))
      iii. importance_j += (score_baseline - score_permuted) / K
   c. Store importance_j

3. Return importances for all features

Key design choices:

Scoring function S: Can be any metric (accuracy, AUC, MSE, R², etc.). Higher values typically mean better performance.
Number of permutations K: Multiple permutations reduce variance in importance estimates. K ≥ 5-10 is common.
Dataset: Can be training set, validation set, or test set. Validation/test is preferred (explained below).

Sign Convention

permutation_importance_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
import numpy as np
from sklearn.metrics import accuracy_score, r2_score
from typing import Callable, Tuple
 
def permutation_importance(
    model,
    X: np.ndarray,
    y: np.ndarray,
    scoring_func: Callable,
    n_repeats: int = 10,
    random_state: int = 42
) -> Tuple[np.ndarray, np.ndarray]:
    """
    Compute permutation importance for all features.
    
    Args:
        model: Fitted model with predict method
        X: Feature matrix (n_samples, n_features)
        y: Target array (n_samples,)
        scoring_func: Function(y_true, y_pred) -> score (higher is better)
        n_repeats: Number of permutation iterations per feature
        random_state: Random seed for reproducibility
    
    Returns:
        importances_mean: Mean importance for each feature
        importances_std: Standard deviation of importance estimates
    """
    rng = np.random.RandomState(random_state)
    n_samples, n_features = X.shape
    
    # Compute baseline score
    baseline_score = scoring_func(y, model.predict(X))
    
    # Store all importance measurements
    importances = np.zeros((n_features, n_repeats))
    
    for feature_idx in range(n_features):
        for repeat_idx in range(n_repeats):
            # Create a copy to avoid modifying original data
            X_permuted = X.copy()
            
            # Randomly shuffle this feature's values
            X_permuted[:, feature_idx] = rng.permutation(X[:, feature_idx])
            
            # Score with permuted feature
            permuted_score = scoring_func(y, model.predict(X_permuted))
            
            # Importance = drop in performance
            importances[feature_idx, repeat_idx] = baseline_score - permuted_score
    
    importances_mean = importances.mean(axis=1)
    importances_std = importances.std(axis=1)
    
    return importances_mean, importances_std
 
# Example usage with sklearn's built-in implementation
from sklearn.inspection import permutation_importance as sklearn_perm_imp
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
 
# Generate data
X, y = make_classification(
    n_samples=1000,
    n_features=10,
    n_informative=5,
    n_redundant=2,
    random_state=42
)
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.3, random_state=42
)
 
# Train model
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
 
# Compute permutation importance on VALIDATION set
result = sklearn_perm_imp(
    rf, X_val, y_val,
    n_repeats=20,
    random_state=42,
    n_jobs=-1
)
 
# Display results
print("Permutation Importance (Validation Set)")
print("=" * 50)
for i in np.argsort(result.importances_mean)[::-1]:
    print(f"Feature {i}: {result.importances_mean[i]:.4f} "
          f"± {result.importances_std[i]:.4f}")

Training vs Validation Set Importance

Training set permutation importance:

Measures importance for the model's ability to explain training data
Can be misleading for overfit features (model memorized noise)
Features that enabled overfitting will appear highly important
Useful for debugging training behavior

Validation/test set permutation importance:

Measures importance for the model's ability to generalize
Reflects true predictive value on unseen data
Overfit features will correctly show low importance
This is almost always what you want in practice

Critical Recommendation

Training Set Importance

•High-cardinality features show inflated importance (can memorize)
•Noise features might appear important if model overfit to them
•All features used by the model appear 'important'
•Misleading for feature selection and interpretation

Validation Set Importance

•Only features that help generalization appear important
•Noise features correctly show near-zero importance
•High-cardinality features penalized if they cause overfitting
•Reliable for feature selection and model interpretation

train_vs_val_importance.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split
 
# Create dataset with informative AND pure noise features
np.random.seed(42)
n_samples = 1000
n_informative = 5
n_noise = 5
 
# Informative features
X_informative = np.random.randn(n_samples, n_informative)
y = (X_informative.sum(axis=1) > 0).astype(int)
 
# Pure noise features (no relationship to target)
X_noise = np.random.randn(n_samples, n_noise)
 
# Combine
X = np.hstack([X_informative, X_noise])
feature_names = [f"informative_{i}" for i in range(n_informative)] +                 [f"noise_{i}" for i in range(n_noise)]
 
# Split
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.3, random_state=42
)
 
# Train a deep tree (prone to overfitting)
rf = RandomForestClassifier(
    n_estimators=100,
    max_depth=None,  # Fully grown trees - can overfit
    random_state=42
)
rf.fit(X_train, y_train)
 
print(f"Training accuracy: {rf.score(X_train, y_train):.4f}")
print(f"Validation accuracy: {rf.score(X_val, y_val):.4f}")
print()
 
# Compute importance on TRAINING set
perm_train = permutation_importance(
    rf, X_train, y_train, n_repeats=20, random_state=42
)
 
# Compute importance on VALIDATION set  
perm_val = permutation_importance(
    rf, X_val, y_val, n_repeats=20, random_state=42
)
 
# Compare results
results = pd.DataFrame({
    'feature': feature_names,
    'train_importance': perm_train.importances_mean,
    'val_importance': perm_val.importances_mean,
    'true_type': ['informative'] * n_informative + ['noise'] * n_noise
})
 
print("Training vs Validation Permutation Importance")
print("=" * 70)
print(results.sort_values('train_importance', ascending=False).to_string(index=False))
print()
print("🔍 Key Insight: Notice how noise features may show higher importance")
print("   on training data due to overfitting, but correctly show low importance")
print("   on validation data.")

Statistical Properties of Permutation Importance

Understanding the statistical behavior of permutation importance helps us interpret results correctly and design reliable analyses.

Variance in permutation importance:

The importance estimate varies based on:

Which permutation was chosen — Different shuffles yield different scores
Which samples are in the evaluation set — Different validation sets give different baselines
Model stochasticity — For random models, different trained instances give different importances

By repeating the permutation process K times, we can estimate the variance in our importance estimate:

$$Var(\hat{I}j) = \frac{1}{K-1} \sum{k=1}^{K} (I_j^{(k)} - \bar{I}_j)^2$$

This variance estimate is crucial for determining whether observed importance differences are statistically significant.

Confidence Intervals

With K ≥ 20 repetitions, you can construct approximate 95% confidence intervals as mean ± 1.96 × std. Features whose confidence intervals overlap likely don't differ significantly in importance.

Bias considerations:

Permutation importance has several potential sources of bias:

Feature correlation bias: When features are correlated, permuting one feature can create unrealistic data points that the model has never seen. This can either inflate or deflate importance unpredictably.
Extrapolation bias: If permutation creates out-of-distribution inputs (e.g., a height of 2 meters paired with a weight of 30 kg), the model's predictions become unreliable, and the measured importance reflects extrapolation behavior rather than true importance.
Distribution shift bias: If the dataset's feature distribution is skewed, different samples contribute unequally to the importance calculation.

Handling correlated features:

For highly correlated features, consider these approaches:

Approach	Description	When to Use
Group permutation	Shuffle correlated features together	When features are known to be dependent
Conditional permutation	Shuffle only within local neighborhoods	When maintaining realistic combinations matters
SAGE/SHAP values	Proper Shapley-based attribution	When rigorous causal attribution is needed

importance_confidence.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
import numpy as np
from scipy import stats
from sklearn.inspection import permutation_importance
 
def compute_importance_with_significance(model, X, y, n_repeats=30, 
                                          scoring='accuracy', alpha=0.05):
    """
    Compute permutation importance with statistical significance testing.
    
    Args:
        model: Fitted model
        X: Feature matrix
        y: Target vector
        n_repeats: Number of permutations (higher = more stable)
        scoring: Scoring metric name
        alpha: Significance level
    
    Returns:
        Dictionary with importance statistics and significance
    """
    result = permutation_importance(
        model, X, y, 
        n_repeats=n_repeats,
        scoring=scoring,
        n_jobs=-1
    )
    
    n_features = X.shape[1]
    
    # Compute test statistic: is mean significantly different from 0?
    # Under null hypothesis (feature unimportant), mean importance = 0
    t_stats = []
    p_values = []
    significant = []
    
    for j in range(n_features):
        importances_j = result.importances[j]
        
        # One-sample t-test against 0
        t_stat, p_val = stats.ttest_1samp(importances_j, 0)
        t_stats.append(t_stat)
        p_values.append(p_val)
        
        # Is feature significantly important? (one-sided: importance > 0)
        # Using one-sided p-value for "is importance positive"
        p_one_sided = p_val / 2 if t_stat > 0 else 1 - p_val / 2
        significant.append(p_one_sided < alpha)
    
    # Compute confidence intervals
    n = n_repeats
    ci_factor = stats.t.ppf(1 - alpha/2, df=n-1)
    ci_half_width = ci_factor * result.importances_std / np.sqrt(n)
    
    return {
        'mean': result.importances_mean,
        'std': result.importances_std,
        'ci_lower': result.importances_mean - ci_half_width,
        'ci_upper': result.importances_mean + ci_half_width,
        't_statistic': np.array(t_stats),
        'p_value': np.array(p_values),
        'significant': np.array(significant)
    }
 
# Example: Significance testing for feature importance
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
 
# Create data with clearly informative and uninformative features
X, y = make_classification(
    n_samples=500,
    n_features=10,
    n_informative=4,
    n_redundant=0,
    n_clusters_per_class=1,
    random_state=42
)
 
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.3, random_state=42
)
 
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
 
# Compute importance with significance
results = compute_importance_with_significance(
    rf, X_val, y_val, n_repeats=50
)
 
print("Permutation Importance with Statistical Significance")
print("=" * 75)
print(f"{'Feature':<10} {'Mean':>10} {'95% CI':>20} {'p-value':>12} {'Sig?'}")
print("-" * 75)
 
for i in range(len(results['mean'])):
    ci_str = f"[{results['ci_lower'][i]:.4f}, {results['ci_upper'][i]:.4f}]"
    sig_str = "✓" if results['significant'][i] else "✗"
    p_val = results['p_value'][i]
    p_str = f"{p_val:.4f}" if p_val >= 0.0001 else "<0.0001"
    
    print(f"Feature {i:<2} {results['mean'][i]:>10.4f} {ci_str:>20} {p_str:>12} {sig_str:>5}")

Interpreting Negative Importance

What negative importance means:

Causes of Negative Importance

•Overfitting to noise: The model learned spurious correlations from the training data that don't generalize. Breaking these associations improves validation performance.
•Multicollinearity artifacts: With correlated features, the model may have split contribution in strange ways. Shuffling one correlated feature can help the remaining features work better.
•Random noise features: Pure noise features might show slightly negative importance due to random fluctuations. This is expected behavior near zero.
•Distributional extrapolation: Shuffling creates out-of-distribution points that, by chance, happen to yield better predictions (rare but possible).

Red Flag Interpretation

How to respond to negative importance:

Scenario	Likely Cause	Recommended Action
Slightly negative (-0.001 to 0)	Random noise	Treat as zero importance; safe to ignore
Moderately negative (-0.01 to -0.001)	Mild overfitting or noise	Consider regularization; validate on more data
Strongly negative (< -0.01)	Serious overfitting or leakage	Investigate feature; check for data leakage; consider removal

A worked example:

detect_negative_importance.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split
 
def analyze_suspicious_features(model, X_val, y_val, feature_names=None,
                                 n_repeats=30, threshold=-0.005):
    """
    Identify and analyze features with suspiciously negative importance.
    
    Args:
        model: Fitted model
        X_val: Validation features
        y_val: Validation target
        feature_names: List of feature names
        n_repeats: Number of permutation repeats
        threshold: Importance below this is flagged as suspicious
    
    Returns:
        DataFrame of suspicious features with diagnostic info
    """
    if feature_names is None:
        feature_names = [f"feature_{i}" for i in range(X_val.shape[1])]
    
    result = permutation_importance(
        model, X_val, y_val, n_repeats=n_repeats, n_jobs=-1
    )
    
    # Find suspicious features
    suspicious_mask = result.importances_mean < threshold
    suspicious_indices = np.where(suspicious_mask)[0]
    
    if len(suspicious_indices) == 0:
        print("✅ No features with suspiciously negative importance found.")
        return None
    
    suspicious = []
    for idx in suspicious_indices:
        # Check how consistently negative the importance is
        neg_fraction = (result.importances[idx] < 0).mean()
        
        suspicious.append({
            'feature': feature_names[idx],
            'index': idx,
            'mean_importance': result.importances_mean[idx],
            'std_importance': result.importances_std[idx],
            'fraction_negative': neg_fraction,
            'min_importance': result.importances[idx].min(),
            'max_importance': result.importances[idx].max(),
        })
    
    df = pd.DataFrame(suspicious)
    df = df.sort_values('mean_importance')
    
    print("⚠️ Suspicious Features Detected!")
    print("=" * 80)
    for _, row in df.iterrows():
        print(f"\n{row['feature']} (index {row['index']}):")
        print(f"  Mean importance: {row['mean_importance']:.4f}")
        print(f"  Fraction of trials with negative importance: {row['fraction_negative']:.1%}")
        print(f"  Range: [{row['min_importance']:.4f}, {row['max_importance']:.4f}]")
        
        if row['fraction_negative'] > 0.95:
            print("  🔴 CRITICAL: Consistently harmful - investigate data leakage")
        elif row['fraction_negative'] > 0.7:
            print("  🟠 WARNING: Frequently harmful - likely overfitting")
        else:
            print("  🟡 CAUTION: Sometimes harmful - may be noise")
    
    return df
 
# Demonstration: Create a feature that causes overfitting
np.random.seed(42)
n_samples = 500
 
# Genuine features
X_good = np.random.randn(n_samples, 5)
y = (X_good[:, 0] + X_good[:, 1] > 0).astype(int)
 
# Overfitting-prone feature: random ID that happens to correlate with y
# in training but won't generalize
overfit_feature = np.random.randn(n_samples)
 
# Pure noise feature
noise_feature = np.random.randn(n_samples)
 
X = np.column_stack([X_good, overfit_feature, noise_feature])
feature_names = [f"good_{i}" for i in range(5)] + ["overfit_prone", "pure_noise"]
 
# Split data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42)
 
# Train deep tree that can overfit
rf = RandomForestClassifier(n_estimators=100, max_depth=20, random_state=42)
rf.fit(X_train, y_train)
 
# Analyze
print(f"Training accuracy: {rf.score(X_train, y_train):.4f}")
print(f"Validation accuracy: {rf.score(X_val, y_val):.4f}")
print()
 
suspicious_df = analyze_suspicious_features(
    rf, X_val, y_val, 
    feature_names=feature_names,
    threshold=-0.001
)

Permutation vs Impurity-Based Importance

Now that we understand both methods, let's directly compare their properties, strengths, and appropriate use cases.

Fundamental difference:

Impurity-based: Measures how much the model uses a feature during training
Permutation-based: Measures how much the feature contributes to predictive accuracy

Comprehensive Comparison of Importance Methods
Aspect	Impurity-Based (MDI)	Permutation-Based
Computation cost	Free (computed during training)	O(n × p × K) — must run inference p × K times
Measures	Training-time usage patterns	Validation-time predictive contribution
Can detect overfitting?	No — can't distinguish helpful from harmful	Yes — negative importance reveals overfitting
Affected by cardinality?	Yes — strong bias toward high-cardinality	No — directly measures performance impact
Model-agnostic?	No — only for tree-based models	Yes — works for any model with predict()
Handles feature correlation	Poorly — splits importance arbitrarily	Poorly — can create unrealistic combinations
Reproducibility	Deterministic (given model)	Stochastic — varies with permutation seed
Sign of values	Always non-negative	Can be negative (feature hurts predictions)

When to Use Which

compare_importance_methods.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split
 
def compare_importance_methods(X, y, feature_names=None, random_state=42):
    """
    Compare impurity-based and permutation importance for the same model.
    
    Demonstrates scenarios where they agree/disagree.
    """
    if feature_names is None:
        feature_names = [f"F{i}" for i in range(X.shape[1])]
    
    X_train, X_val, y_train, y_val = train_test_split(
        X, y, test_size=0.3, random_state=random_state
    )
    
    # Train model
    rf = RandomForestClassifier(n_estimators=200, max_depth=10, 
                                 random_state=random_state)
    rf.fit(X_train, y_train)
    
    # Impurity-based importance
    imp_impurity = rf.feature_importances_
    
    # Permutation importance (on validation set)
    perm_result = permutation_importance(
        rf, X_val, y_val, n_repeats=30, random_state=random_state, n_jobs=-1
    )
    imp_permutation = perm_result.importances_mean
    
    # Create comparison DataFrame
    comparison = pd.DataFrame({
        'feature': feature_names,
        'impurity': imp_impurity,
        'permutation': imp_permutation,
    })
    
    # Compute correlation and disagreement
    correlation = np.corrcoef(imp_impurity, imp_permutation)[0, 1]
    
    # Rank each feature by both methods
    comparison['rank_impurity'] = comparison['impurity'].rank(ascending=False)
    comparison['rank_permutation'] = comparison['permutation'].rank(ascending=False)
    comparison['rank_diff'] = abs(comparison['rank_impurity'] - 
                                    comparison['rank_permutation'])
    
    print("Importance Method Comparison")
    print("=" * 70)
    print(f"Model accuracy: Train={rf.score(X_train, y_train):.4f}, "
          f"Val={rf.score(X_val, y_val):.4f}")
    print(f"Importance correlation: {correlation:.4f}")
    print()
    print(comparison.sort_values('rank_impurity').to_string(index=False))
    
    # Identify disagreements
    big_disagreements = comparison[comparison['rank_diff'] >= 3]
    if len(big_disagreements) > 0:
        print(f"\n⚠️ Features with rank difference >= 3:")
        for _, row in big_disagreements.iterrows():
            print(f"  {row['feature']}: "
                  f"Impurity rank={int(row['rank_impurity'])}, "
                  f"Permutation rank={int(row['rank_permutation'])}")
    
    return comparison
 
# Create a scenario where methods disagree
np.random.seed(42)
n_samples = 1000
 
# Feature 1: High cardinality, moderately predictive (impurity will overrate)
high_card = np.random.randn(n_samples)
 
# Feature 2: Binary, highly predictive (impurity may underrate)
binary_strong = (np.random.randn(n_samples) > 0).astype(float)
 
# Feature 3: Continuous, weak predictor
continuous_weak = np.random.randn(n_samples) * 0.3
 
# Target depends mostly on binary feature
y = (binary_strong * 2 + high_card * 0.5 + continuous_weak + 
     np.random.randn(n_samples) * 0.5 > 1).astype(int)
 
X = np.column_stack([high_card, binary_strong, continuous_weak])
feature_names = ['high_cardinality', 'binary_strong', 'continuous_weak']
 
# Run comparison
comparison = compare_importance_methods(X, y, feature_names)
 
print("\n📊 Analysis:")
print("- 'high_cardinality' has many unique values → Impurity importance inflated")
print("- 'binary_strong' has only 2 values → Impurity importance deflated")
print("- Permutation importance correctly reflects true predictive value")

Computational Considerations

Permutation importance is more computationally expensive than impurity-based importance. Understanding the cost structure helps plan efficient analyses.

Time complexity:

For a model with inference time $T_{predict}$ on dataset of size $n$:

$$T_{permutation} = \mathcal{O}(p \times K \times T_{predict})$$

where $p$ is the number of features and $K$ is the number of permutation repeats.

Practical example:

100 features, 30 repeats = 3,000 inference runs
If inference takes 100ms, total time ≈ 5 minutes

Optimization strategies:

Speed Optimization Techniques

•Parallel computation: Permute different features simultaneously using multiple CPU cores (n_jobs=-1 in sklearn)
•Subsample the validation set: Use a random subset (e.g., 5000 samples) instead of the full validation set for initial screening
•Reduce K for initial analysis: Start with K=5 for screening, increase to K=30+ for final analysis
•Two-stage screening: First compute importance on a subset, then run detailed analysis only on top candidates
•GPU acceleration: For neural networks, batch the permuted predictions efficiently on GPU

efficient_permutation_importance.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
import numpy as np
import time
from sklearn.inspection import permutation_importance
from sklearn.ensemble import RandomForestClassifier
 
def timed_permutation_importance(model, X, y, n_repeats, n_jobs=1):
    """Time permutation importance computation."""
    start = time.time()
    result = permutation_importance(
        model, X, y, n_repeats=n_repeats, n_jobs=n_jobs, random_state=42
    )
    elapsed = time.time() - start
    return result, elapsed
 
def efficient_importance_analysis(model, X, y, feature_names=None,
                                   initial_k=5, final_k=30,
                                   subsample_size=2000, top_n=10):
    """
    Two-stage efficient permutation importance analysis.
    
    Stage 1: Quick screening with subsampled data and few repeats
    Stage 2: Detailed analysis of top features
    
    Args:
        model: Fitted model
        X, y: Full validation data
        feature_names: Optional feature names
        initial_k: Repeats for initial screening
        final_k: Repeats for detailed analysis
        subsample_size: Samples for initial screening
        top_n: Number of top features for detailed analysis
    
    Returns:
        DataFrame with importance results
    """
    if feature_names is None:
        feature_names = [f"F{i}" for i in range(X.shape[1])]
    
    n_samples = X.shape[0]
    n_features = X.shape[1]
    
    # Stage 1: Quick screening
    print("Stage 1: Quick screening...")
    if n_samples > subsample_size:
        idx = np.random.choice(n_samples, subsample_size, replace=False)
        X_sub, y_sub = X[idx], y[idx]
    else:
        X_sub, y_sub = X, y
    
    result_quick, time_quick = timed_permutation_importance(
        model, X_sub, y_sub, n_repeats=initial_k, n_jobs=-1
    )
    print(f"  Completed in {time_quick:.2f}s")
    
    # Identify top features for detailed analysis
    top_indices = np.argsort(result_quick.importances_mean)[::-1][:top_n]
    print(f"  Top {top_n} features identified: {[feature_names[i] for i in top_indices]}")
    
    # Stage 2: Detailed analysis of top features only
    print(f"\nStage 2: Detailed analysis of top {top_n} features...")
    
    # For detailed analysis, we only permute the top features
    detailed_importances = {}
    start_detail = time.time()
    
    for idx in top_indices:
        # This is a simplified version - in production you'd modify
        # the sklearn implementation to only permute selected features
        X_work = X.copy()
        scores = []
        
        baseline = model.score(X, y)
        for k in range(final_k):
            X_work[:, idx] = np.random.permutation(X[:, idx])
            scores.append(model.score(X_work, y))
        
        detailed_importances[feature_names[idx]] = {
            'mean': baseline - np.mean(scores),
            'std': np.std([baseline - s for s in scores])
        }
    
    time_detail = time.time() - start_detail
    print(f"  Completed in {time_detail:.2f}s")
    
    # Combine results
    print(f"\nTotal time: {time_quick + time_detail:.2f}s")
    print(f"  (vs estimated full analysis: {n_features * final_k / initial_k * time_quick / 60:.1f} min)")
    
    return detailed_importances
 
# Example usage
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
 
# Large dataset
X, y = make_classification(
    n_samples=10000, n_features=100,
    n_informative=20, n_redundant=10,
    random_state=42
)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42)
 
rf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
 
# Compare: Full analysis vs Two-stage
print("=" * 60)
print("Two-Stage Efficient Analysis")
print("=" * 60)
efficient_results = efficient_importance_analysis(
    rf, X_val, y_val,
    initial_k=5, final_k=30,
    subsample_size=2000, top_n=15
)

Summary: Permutation Importance

Core Concepts

•Permutation importance measures the drop in model performance when a feature's relationship to the target is broken through random shuffling.
•Always use held-out data (validation/test set) for permutation importance to measure generalization contribution, not training set memorization.
•Multiple repetitions (K ≥ 10-30) provide variance estimates and enable statistical significance testing for importance scores.
•Negative importance indicates a feature is hurting predictions—a red flag for overfitting, data leakage, or model bugs.
•Model-agnostic: Unlike impurity-based importance, permutation importance works for ANY model with a predict function.
•No cardinality bias: Directly measures performance impact, avoiding the high-cardinality bias of impurity-based methods.
•Correlation caveat: Shuffling correlated features can create unrealistic data points, potentially distorting importance estimates.

What's next:

Page Complete

2 / 5