Loading content...
Both impurity-based and permutation importance have a shared limitation: they evaluate features within the context of a model that was trained with those features present. But what if a feature's true value can only be understood by asking a more fundamental question: "How would the model perform if this feature had never existed?"
Drop-column importance (also called leave-one-covariate-out or ablation importance) answers this question directly. For each feature, we retrain the model entirely without it and measure the performance change. This approach captures something the other methods can't: how the model would adapt to the feature's absence.
By the end of this page, you will understand: (1) The theoretical justification for drop-column importance, (2) The complete algorithm and its variations, (3) How it differs from permutation importance, (4) Computational strategies for practical implementation, and (5) When this expensive method is worth the cost.
To understand why drop-column importance provides unique information, consider what happens when a feature is absent during training versus when it's merely shuffled:
Permutation importance (feature present but shuffled):
Drop-column importance (feature never existed):
Imagine a basketball team where one player gets injured (permutation = substituting a random person) vs. having never drafted that player (drop-column = the team practiced all season without them). In the first case, the team's plays were designed around the missing player. In the second, the team developed strategies without depending on them. The performance difference tells you different things about that player's value.
Implications for correlated features:
This distinction is most pronounced for correlated features. Consider features A and B that are highly correlated:
| Scenario | Permutation Importance | Drop-Column Importance |
|---|---|---|
| Shuffle A | Model can't adapt; relies on broken A-B relationship | N/A |
| Drop A | N/A | Model learns to use B instead; may show minimal loss |
With permutation importance, both correlated features might appear important because shuffling breaks the correlation the model depends on. With drop-column importance, the model adapts—if B can fully substitute for A, then A's true marginal contribution is near zero.
| Method | Question Answered | Model Adaptation |
|---|---|---|
| Impurity-based | How much did splits on this feature reduce training impurity? | N/A (training only) |
| Permutation | How much worse does this trained model perform if we break feature-target relationship? | None—model frozen |
| Drop-column | How much worse is the best model we can train without this feature? | Full—model retrained |
The algorithm for drop-column importance is conceptually simple but computationally demanding:
Algorithm: Drop-Column Importance
Input: Dataset (X, y), model class M, validation data (X_val, y_val), scoring function S
Output: Importance score for each feature
1. Train baseline model on all features:
model_baseline = M.fit(X, y)
score_baseline = S(y_val, model_baseline.predict(X_val))
2. For each feature j in {1, 2, ..., p}:
a. Create dataset X_minus_j by removing column j from X
b. Create X_val_minus_j by removing column j from X_val
c. Train model without feature j:
model_j = M.fit(X_minus_j, y)
d. Score the reduced model:
score_j = S(y_val, model_j.predict(X_val_minus_j))
e. importance_j = score_baseline - score_j
3. Return importances for all features
Key differences from permutation importance:
For a model that takes 1 hour to train with 100 features, drop-column importance requires ~101 hours of training time. This makes it impractical for large models or large feature sets without optimization strategies (covered later).
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135
import numpy as npimport pandas as pdfrom sklearn.base import clonefrom sklearn.model_selection import cross_val_scorefrom typing import List, Optional, Callableimport time def drop_column_importance( model, X: np.ndarray, y: np.ndarray, X_val: np.ndarray, y_val: np.ndarray, scoring: str = 'accuracy', feature_names: Optional[List[str]] = None, verbose: bool = True) -> pd.DataFrame: """ Compute drop-column feature importance. Args: model: Sklearn-compatible model (will be cloned for each training) X: Training feature matrix y: Training target X_val: Validation feature matrix y_val: Validation target scoring: Scoring metric name feature_names: Optional list of feature names verbose: Print progress if True Returns: DataFrame with feature importances """ from sklearn.metrics import get_scorer n_features = X.shape[1] if feature_names is None: feature_names = [f"feature_{i}" for i in range(n_features)] scorer = get_scorer(scoring) # Train baseline model with all features if verbose: print("Training baseline model with all features...") start_baseline = time.time() baseline_model = clone(model) baseline_model.fit(X, y) baseline_score = scorer(baseline_model, X_val, y_val) baseline_time = time.time() - start_baseline if verbose: print(f" Baseline {scoring}: {baseline_score:.4f} (trained in {baseline_time:.1f}s)") estimated_total = baseline_time * (n_features + 1) print(f" Estimated total time: {estimated_total/60:.1f} minutes") # Drop each feature and retrain results = [] for j in range(n_features): if verbose: print(f" [{j+1}/{n_features}] Dropping {feature_names[j]}...", end=" ") start_j = time.time() # Create reduced datasets X_reduced = np.delete(X, j, axis=1) X_val_reduced = np.delete(X_val, j, axis=1) # Train model without feature j reduced_model = clone(model) reduced_model.fit(X_reduced, y) reduced_score = scorer(reduced_model, X_val_reduced, y_val) # Importance = performance drop importance = baseline_score - reduced_score elapsed = time.time() - start_j if verbose: print(f"{scoring}={reduced_score:.4f}, importance={importance:+.4f} ({elapsed:.1f}s)") results.append({ 'feature': feature_names[j], 'feature_index': j, 'importance': importance, 'score_without': reduced_score, 'training_time': elapsed }) df = pd.DataFrame(results) df['baseline_score'] = baseline_score df['importance_pct'] = (df['importance'] / baseline_score * 100).round(2) return df.sort_values('importance', ascending=False).reset_index(drop=True) # Example usageif __name__ == "__main__": from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split # Create dataset with mixed feature types np.random.seed(42) n_samples = 1000 # 3 highly informative features X_info = np.random.randn(n_samples, 3) # 2 correlated redundant features (copies of informative with noise) X_redundant = X_info[:, :2] + np.random.randn(n_samples, 2) * 0.1 # 3 pure noise features X_noise = np.random.randn(n_samples, 3) X = np.hstack([X_info, X_redundant, X_noise]) y = (X_info[:, 0] + X_info[:, 1] + X_info[:, 2] > 0).astype(int) feature_names = ['info_0', 'info_1', 'info_2', 'redundant_0', 'redundant_1', 'noise_0', 'noise_1', 'noise_2'] X_train, X_val, y_train, y_val = train_test_split( X, y, test_size=0.3, random_state=42 ) # Compute drop-column importance rf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1) importance_df = drop_column_importance( rf, X_train, y_train, X_val, y_val, feature_names=feature_names, verbose=True ) print("\nDrop-Column Importance Results:") print("=" * 70) print(importance_df.to_string(index=False))Drop-column importance reveals different information than other methods, requiring careful interpretation.
High drop-column importance: The model performs significantly worse without this feature, even after retraining to adapt. This means:
Low or zero drop-column importance (for a feature that permutation marked as important): The model recovers performance by using other features after retraining. This indicates:
Low drop-column importance does NOT mean the feature is useless—it means it's replaceable. In production, you might still want redundant features for robustness (if one source fails, others compensate).
Negative drop-column importance:
If dropping a feature improves model performance, it means:
Common patterns to look for:
| Pattern | Permutation Imp. | Drop-Column Imp. | Interpretation |
|---|---|---|---|
| Unique predictor | High | High | Feature provides irreplaceable value |
| Redundant predictor | High | Low/Zero | Feature is valuable but substitutable |
| Harmful feature | Negative | Negative | Feature hurts predictions—consider removal |
| Used but uninformative | Near-zero | Near-zero | Feature neither helps nor hurts |
| Correlated pair (A & B) | Both high | Both low | Together valuable, individually redundant |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127
import numpy as npimport pandas as pdfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.inspection import permutation_importancefrom sklearn.model_selection import train_test_split def classify_feature_role(perm_imp: float, drop_imp: float, threshold: float = 0.01) -> str: """ Classify a feature's role based on permutation and drop-column importance. Args: perm_imp: Permutation importance drop_imp: Drop-column importance threshold: Significance threshold Returns: Feature classification string """ if perm_imp > threshold and drop_imp > threshold: return "UNIQUE_PREDICTOR" # High both = irreplaceable elif perm_imp > threshold and abs(drop_imp) <= threshold: return "REDUNDANT_PREDICTOR" # High perm, low drop = substitutable elif perm_imp < -threshold and drop_imp < -threshold: return "HARMFUL_FEATURE" # Both negative = actively hurts model elif abs(perm_imp) <= threshold and abs(drop_imp) <= threshold: return "UNINFORMATIVE" # Neither method shows importance elif perm_imp < -threshold and drop_imp > -threshold: return "OVERFIT_RECOVERED" # Was hurting when frozen, recovers when retrained else: return "AMBIGUOUS" # Unusual pattern requiring investigation def comprehensive_feature_analysis(model, X_train, y_train, X_val, y_val, feature_names=None, perm_repeats=30): """ Perform comprehensive feature analysis using both importance methods. """ from sklearn.base import clone from sklearn.metrics import accuracy_score n_features = X_train.shape[1] if feature_names is None: feature_names = [f"F{i}" for i in range(n_features)] # Train baseline baseline = clone(model) baseline.fit(X_train, y_train) baseline_score = accuracy_score(y_val, baseline.predict(X_val)) # Permutation importance perm_result = permutation_importance( baseline, X_val, y_val, n_repeats=perm_repeats, n_jobs=-1 ) # Drop-column importance drop_importances = [] for j in range(n_features): X_train_j = np.delete(X_train, j, axis=1) X_val_j = np.delete(X_val, j, axis=1) model_j = clone(model) model_j.fit(X_train_j, y_train) score_j = accuracy_score(y_val, model_j.predict(X_val_j)) drop_importances.append(baseline_score - score_j) # Compile results results = pd.DataFrame({ 'feature': feature_names, 'perm_importance': perm_result.importances_mean, 'perm_std': perm_result.importances_std, 'drop_importance': drop_importances, }) # Classify each feature results['role'] = results.apply( lambda row: classify_feature_role(row['perm_importance'], row['drop_importance']), axis=1 ) # Add actionable recommendations def get_recommendation(role): recommendations = { 'UNIQUE_PREDICTOR': '✅ Keep - Critical feature', 'REDUNDANT_PREDICTOR': '⚡ Consider keeping for robustness', 'HARMFUL_FEATURE': '❌ Remove - Hurting predictions', 'UNINFORMATIVE': '🔍 Review - May be droppable', 'OVERFIT_RECOVERED': '⚠️ Regularize or remove', 'AMBIGUOUS': '🔍 Investigate further' } return recommendations.get(role, 'Unknown') results['recommendation'] = results['role'].apply(get_recommendation) return results.sort_values('drop_importance', ascending=False) # Example with clear feature patternsnp.random.seed(42)n_samples = 1000 # Create features with different characteristics# Strong unique predictorx_unique = np.random.randn(n_samples) # Correlated pair (redundant)x_corr_a = np.random.randn(n_samples)x_corr_b = x_corr_a + np.random.randn(n_samples) * 0.1 # Nearly identical # Pure noisex_noise = np.random.randn(n_samples) # Target depends on unique and one of the correlated featuresy = (x_unique + x_corr_a > 0).astype(int) X = np.column_stack([x_unique, x_corr_a, x_corr_b, x_noise])feature_names = ['unique_predictor', 'correlated_a', 'correlated_b', 'noise'] X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42) rf = RandomForestClassifier(n_estimators=100, random_state=42)analysis = comprehensive_feature_analysis( rf, X_train, y_train, X_val, y_val, feature_names) print("Comprehensive Feature Analysis")print("=" * 90)print(analysis.to_string(index=False))The O(p × training_time) complexity of drop-column importance makes it impractical for many real-world scenarios. Here are strategies to make it tractable:
Strategy 1: Pre-screening with faster methods
Use impurity-based or permutation importance first to identify candidates, then apply drop-column only to the top-k features:
# Step 1: Quick permutation screening
perm_results = permutation_importance(model, X_val, y_val)
top_k_indices = np.argsort(perm_results.importances_mean)[-k:]
# Step 2: Drop-column only for top-k
for j in top_k_indices:
# Expensive but worth it for key features
Often, 20% of features provide 80% of predictive power. Running drop-column on the top 20% of features (by permutation importance) captures most of the interesting insights at a fraction of the cost.
Strategy 2: Use simpler proxy models
Instead of training your full complex model (e.g., XGBoost with 1000 trees), train a simpler version (e.g., 100 trees, lower depth) for importance estimation:
fast_model = RandomForestClassifier(n_estimators=50, max_depth=10)
# Use for drop-column importance
full_model = RandomForestClassifier(n_estimators=500, max_depth=None)
# Use for final model training
Strategy 3: Subsample training data
For importance estimation (not final training), use a subsample of training data:
subsample_idx = np.random.choice(len(X_train), size=5000, replace=False)
X_sub, y_sub = X_train[subsample_idx], y_train[subsample_idx]
# Use subsampled data for drop-column analysis
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173
import numpy as npimport pandas as pdfrom sklearn.base import clonefrom sklearn.inspection import permutation_importancefrom concurrent.futures import ProcessPoolExecutorfrom functools import partialimport time def drop_single_feature(j, X_train, y_train, X_val, y_val, model, scorer): """Drop a single feature and return the score (for parallel execution).""" X_train_j = np.delete(X_train, j, axis=1) X_val_j = np.delete(X_val, j, axis=1) model_j = clone(model) model_j.fit(X_train_j, y_train) return j, scorer(model_j, X_val_j, y_val) def efficient_drop_column_importance( model, X_train: np.ndarray, y_train: np.ndarray, X_val: np.ndarray, y_val: np.ndarray, feature_names: list = None, screening_top_k: int = None, subsample_train: int = None, n_jobs: int = -1, verbose: bool = True) -> pd.DataFrame: """ Compute drop-column importance with optimization strategies. Args: model: Model to evaluate X_train, y_train: Training data X_val, y_val: Validation data feature_names: Feature names screening_top_k: Only analyze top-k by permutation importance subsample_train: Subsample training data to this size n_jobs: Number of parallel jobs (-1 for all cores) verbose: Print progress Returns: DataFrame with importance results """ from sklearn.metrics import get_scorer from joblib import Parallel, delayed scorer = get_scorer('accuracy') n_features = X_train.shape[1] if feature_names is None: feature_names = [f"feature_{i}" for i in range(n_features)] # Apply subsampling if requested if subsample_train and subsample_train < len(X_train): if verbose: print(f"Subsampling training data: {len(X_train)} -> {subsample_train}") idx = np.random.choice(len(X_train), subsample_train, replace=False) X_train_eff = X_train[idx] y_train_eff = y_train[idx] else: X_train_eff = X_train y_train_eff = y_train # Determine which features to analyze if screening_top_k and screening_top_k < n_features: if verbose: print(f"Screening: identifying top {screening_top_k} features by permutation importance...") # Quick permutation screening quick_model = clone(model) quick_model.set_params(n_estimators=50) # Faster quick_model.fit(X_train_eff, y_train_eff) perm_result = permutation_importance( quick_model, X_val, y_val, n_repeats=5, n_jobs=n_jobs ) features_to_analyze = np.argsort(perm_result.importances_mean)[-screening_top_k:] if verbose: print(f" Analyzing features: {[feature_names[i] for i in features_to_analyze]}") else: features_to_analyze = np.arange(n_features) # Train baseline if verbose: print("Training baseline model...") baseline = clone(model) baseline.fit(X_train_eff, y_train_eff) baseline_score = scorer(baseline, X_val, y_val) if verbose: print(f" Baseline accuracy: {baseline_score:.4f}") # Parallel drop-column analysis if verbose: print(f"Running drop-column analysis on {len(features_to_analyze)} features...") start = time.time() def analyze_feature(j): X_train_j = np.delete(X_train_eff, j, axis=1) X_val_j = np.delete(X_val, j, axis=1) model_j = clone(model) model_j.fit(X_train_j, y_train_eff) return j, scorer(model_j, X_val_j, y_val) if n_jobs == 1: results_raw = [analyze_feature(j) for j in features_to_analyze] else: results_raw = Parallel(n_jobs=n_jobs)( delayed(analyze_feature)(j) for j in features_to_analyze ) if verbose: elapsed = time.time() - start print(f" Completed in {elapsed:.1f}s") # Compile results results = [] analyzed_indices = set(j for j, _ in results_raw) for j in range(n_features): if j in analyzed_indices: score_j = next(score for idx, score in results_raw if idx == j) importance = baseline_score - score_j else: importance = np.nan # Not analyzed score_j = np.nan results.append({ 'feature': feature_names[j], 'drop_importance': importance, 'score_without': score_j, 'analyzed': j in analyzed_indices }) df = pd.DataFrame(results) df['baseline_score'] = baseline_score return df.sort_values('drop_importance', ascending=False, na_position='last') # Demonstrationif __name__ == "__main__": from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split # Create larger dataset X, y = make_classification( n_samples=5000, n_features=50, # Many features n_informative=15, n_redundant=10, random_state=42 ) X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42) rf = RandomForestClassifier(n_estimators=100, random_state=42) print("Efficient Drop-Column Analysis") print("=" * 60) # With optimizations results = efficient_drop_column_importance( rf, X_train, y_train, X_val, y_val, screening_top_k=15, # Only analyze top 15 subsample_train=2000, # Use subset for training n_jobs=-1, verbose=True ) print("\nResults (top 15 features analyzed):") print(results[results['analyzed']].to_string(index=False))For robust importance estimates, single train-validation splits are often insufficient. Cross-validation provides more reliable estimates at the cost of additional computation.
Why cross-validation matters:
With a single split, importance estimates are sensitive to which samples ended up in training vs. validation. A feature might appear more or less important simply due to unlucky data allocation. Cross-validation averages over multiple splits, giving:
With k-fold CV and p features, you need (p+1) × k model training runs. For 5-fold CV with 100 features, that's 505 training runs—an order of magnitude more than single-split drop-column.
Algorithm: Cross-Validated Drop-Column Importance
For each fold (k = 1 to K):
Split data into (Train_k, Val_k)
Train baseline on Train_k → Score on Val_k → baseline_score_k
For each feature j:
Drop feature j from Train_k and Val_k
Train model → Score on Val_k → score_k_j
importance_k_j = baseline_score_k - score_k_j
For each feature j:
importance_mean_j = mean(importance_1_j, ..., importance_K_j)
importance_std_j = std(importance_1_j, ..., importance_K_j)
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159
import numpy as npimport pandas as pdfrom sklearn.base import clonefrom sklearn.model_selection import KFold, StratifiedKFoldfrom sklearn.metrics import get_scorerfrom joblib import Parallel, delayedimport warnings def cv_drop_column_importance( model, X: np.ndarray, y: np.ndarray, feature_names: list = None, cv: int = 5, stratified: bool = True, scoring: str = 'accuracy', n_jobs: int = -1, verbose: bool = True) -> pd.DataFrame: """ Compute drop-column importance with cross-validation for robust estimates. Args: model: Model to evaluate X: Full feature matrix y: Full target array feature_names: Feature names cv: Number of cross-validation folds stratified: Use stratified K-fold for classification scoring: Scoring metric n_jobs: Parallel jobs verbose: Print progress Returns: DataFrame with mean and std importance across folds """ n_features = X.shape[1] if feature_names is None: feature_names = [f"feature_{i}" for i in range(n_features)] scorer = get_scorer(scoring) # Create CV splitter if stratified: kfold = StratifiedKFold(n_splits=cv, shuffle=True, random_state=42) else: kfold = KFold(n_splits=cv, shuffle=True, random_state=42) # Store importance for each fold and feature fold_importances = np.zeros((cv, n_features)) fold_baseline_scores = np.zeros(cv) for fold_idx, (train_idx, val_idx) in enumerate(kfold.split(X, y)): if verbose: print(f"Fold {fold_idx + 1}/{cv}...") X_train, X_val = X[train_idx], X[val_idx] y_train, y_val = y[train_idx], y[val_idx] # Baseline for this fold baseline = clone(model) baseline.fit(X_train, y_train) baseline_score = scorer(baseline, X_val, y_val) fold_baseline_scores[fold_idx] = baseline_score if verbose: print(f" Baseline: {baseline_score:.4f}") # Drop each feature def drop_and_score(j): X_train_j = np.delete(X_train, j, axis=1) X_val_j = np.delete(X_val, j, axis=1) model_j = clone(model) with warnings.catch_warnings(): warnings.simplefilter("ignore") model_j.fit(X_train_j, y_train) return scorer(model_j, X_val_j, y_val) if n_jobs == 1: scores = [drop_and_score(j) for j in range(n_features)] else: scores = Parallel(n_jobs=n_jobs)( delayed(drop_and_score)(j) for j in range(n_features) ) for j, score_j in enumerate(scores): fold_importances[fold_idx, j] = baseline_score - score_j # Aggregate across folds results = pd.DataFrame({ 'feature': feature_names, 'importance_mean': fold_importances.mean(axis=0), 'importance_std': fold_importances.std(axis=0), 'importance_min': fold_importances.min(axis=0), 'importance_max': fold_importances.max(axis=0), }) # Add statistical measures n = cv t_factor = 2.776 # Approximate t-value for 95% CI with 4 df results['ci_lower'] = results['importance_mean'] - t_factor * results['importance_std'] / np.sqrt(n) results['ci_upper'] = results['importance_mean'] + t_factor * results['importance_std'] / np.sqrt(n) # Coefficient of variation (stability measure) results['stability'] = 1 - (results['importance_std'] / results['importance_mean'].abs().replace(0, np.inf)) results['stability'] = results['stability'].clip(0, 1) # Add fold-level details for fold_idx in range(cv): results[f'fold_{fold_idx+1}'] = fold_importances[fold_idx] results['baseline_mean'] = fold_baseline_scores.mean() results['baseline_std'] = fold_baseline_scores.std() return results.sort_values('importance_mean', ascending=False).reset_index(drop=True) # Example usageif __name__ == "__main__": from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import make_classification # Create dataset X, y = make_classification( n_samples=1000, n_features=15, n_informative=8, n_redundant=3, class_sep=1.5, random_state=42 ) feature_names = [f"F{i}" for i in range(15)] rf = RandomForestClassifier(n_estimators=100, random_state=42) print("Cross-Validated Drop-Column Importance") print("=" * 70) results = cv_drop_column_importance( rf, X, y, feature_names=feature_names, cv=5, n_jobs=-1, verbose=True ) print("\nResults (sorted by mean importance):") print("-" * 70) display_cols = ['feature', 'importance_mean', 'importance_std', 'ci_lower', 'ci_upper', 'stability'] print(results[display_cols].to_string(index=False)) # Identify significantly important features print("\nStatistically Significant Features (95% CI doesn't include 0):") sig_features = results[results['ci_lower'] > 0]['feature'].tolist() print(f" {sig_features}")When features naturally form groups (e.g., multiple features derived from the same sensor, demographic variables, temporal lags), analyzing group importance can be more meaningful and efficient than individual feature analysis.
Why group analysis?
Semantic meaning: "How important is demographic information?" is often more meaningful than "How important is age vs. income?"
Efficiency: With 100 features in 10 groups, analyzing groups requires only 11 models (baseline + 10 groups) vs. 101 models for individual features
Robustness: Group importance is more stable than individual importance when features within groups are correlated
Practical decisions: Data collection decisions often involve whole categories ("Should we collect sensor data?") rather than individual features
The way you define groups should reflect your domain knowledge and practical needs. Dropping 'all location features' answers different questions than dropping 'zip code' alone.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159
import numpy as npimport pandas as pdfrom sklearn.base import clonefrom typing import Dict, List, Optional def group_drop_column_importance( model, X_train: np.ndarray, y_train: np.ndarray, X_val: np.ndarray, y_val: np.ndarray, feature_groups: Dict[str, List[int]], feature_names: Optional[List[str]] = None, scoring: str = 'accuracy', verbose: bool = True) -> pd.DataFrame: """ Compute drop-column importance for feature groups. Args: model: Model to evaluate X_train, y_train: Training data X_val, y_val: Validation data feature_groups: Dict mapping group names to lists of feature indices feature_names: Optional individual feature names scoring: Scoring metric verbose: Print progress Returns: DataFrame with group importance results """ from sklearn.metrics import get_scorer scorer = get_scorer(scoring) n_features = X_train.shape[1] if feature_names is None: feature_names = [f"F{i}" for i in range(n_features)] # Train baseline if verbose: print("Training baseline model with all features...") baseline = clone(model) baseline.fit(X_train, y_train) baseline_score = scorer(baseline, X_val, y_val) if verbose: print(f" Baseline {scoring}: {baseline_score:.4f}") results = [] for group_name, indices in feature_groups.items(): if verbose: feature_list = [feature_names[i] for i in indices] print(f"Dropping group '{group_name}' ({len(indices)} features: {feature_list[:3]}...)") # Drop all features in this group keep_indices = [i for i in range(n_features) if i not in indices] X_train_reduced = X_train[:, keep_indices] X_val_reduced = X_val[:, keep_indices] # Train and score model_reduced = clone(model) model_reduced.fit(X_train_reduced, y_train) reduced_score = scorer(model_reduced, X_val_reduced, y_val) importance = baseline_score - reduced_score if verbose: print(f" Score without: {reduced_score:.4f}, importance: {importance:+.4f}") results.append({ 'group': group_name, 'n_features': len(indices), 'features': [feature_names[i] for i in indices], 'importance': importance, 'score_without': reduced_score, 'importance_per_feature': importance / len(indices) }) df = pd.DataFrame(results) df['baseline_score'] = baseline_score df['importance_pct'] = (df['importance'] / baseline_score * 100).round(2) return df.sort_values('importance', ascending=False).reset_index(drop=True) # Example with semantic feature groupsif __name__ == "__main__": from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split np.random.seed(42) n_samples = 1000 # Create features that belong to different semantic groups # Demographic features (indices 0-2) age = np.random.randint(18, 80, n_samples) / 100 # Normalized income = np.random.exponential(50000, n_samples) / 200000 # Normalized education_years = np.random.randint(8, 22, n_samples) / 25 # Normalized # Behavioral features (indices 3-5) visit_frequency = np.random.poisson(10, n_samples) / 20 avg_session_time = np.random.exponential(300, n_samples) / 1000 pages_per_visit = np.random.poisson(5, n_samples) / 10 # Location features (indices 6-8) distance_to_store = np.random.exponential(20, n_samples) / 50 urban_score = np.random.beta(2, 5, n_samples) competitor_density = np.random.poisson(3, n_samples) / 10 # Combine all features X = np.column_stack([ age, income, education_years, visit_frequency, avg_session_time, pages_per_visit, distance_to_store, urban_score, competitor_density ]) feature_names = [ 'age', 'income', 'education_years', 'visit_frequency', 'avg_session_time', 'pages_per_visit', 'distance_to_store', 'urban_score', 'competitor_density' ] # Target: depends mainly on behavioral features y = ((visit_frequency * 2 + avg_session_time * 3 + pages_per_visit - distance_to_store * 0.5 + income * 0.3 + np.random.randn(n_samples) * 0.3) > 0.5).astype(int) X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42) # Define semantic feature groups feature_groups = { 'demographics': [0, 1, 2], # age, income, education 'behavior': [3, 4, 5], # visit patterns 'location': [6, 7, 8], # geographic features } rf = RandomForestClassifier(n_estimators=100, random_state=42) print("Group Drop-Column Importance Analysis") print("=" * 70) group_results = group_drop_column_importance( rf, X_train, y_train, X_val, y_val, feature_groups=feature_groups, feature_names=feature_names, verbose=True ) print("\nGroup Importance Summary:") print("-" * 70) display_cols = ['group', 'n_features', 'importance', 'importance_pct', 'importance_per_feature'] print(group_results[display_cols].to_string(index=False)) print("\n📊 Interpretation:") top_group = group_results.iloc[0]['group'] print(f" Most important group: '{top_group}'") print(f" This informs data collection priorities and feature engineering focus.")Given its computational cost, when is drop-column importance worth the investment?
Use drop-column importance when:
Prefer other methods when:
| Criterion | Impurity-Based | Permutation | Drop-Column |
|---|---|---|---|
| Computational cost | ⭐⭐⭐⭐⭐ (Free) | ⭐⭐⭐ (Fast) | ⭐ (Expensive) |
| Measures generalization | ❌ | ✅ | ✅ |
| Handles feature adaptation | ❌ | ❌ | ✅ |
| Detects overfitting | ❌ | ✅ | ✅ |
| Works for any model | ❌ | ✅ | ✅ |
| Stable estimates | ⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐ |
| Captures redundancy | ❌ | ❌ | ✅ |
Drop-column importance represents the gold standard for understanding feature value—at the cost of significant computational expense. Let's consolidate the key concepts:
What's next:
We've now covered three methods for measuring feature importance, each with distinct strengths. The next page explores the biases inherent in feature importance methods—understanding when importance estimates can be misleading and how to recognize and mitigate these issues.
You now understand drop-column importance as the gold standard for measuring true feature value, including its theoretical advantages, computational optimization strategies, cross-validation extensions, and group analysis capabilities. You can make informed decisions about when this method justifies its computational cost.