Loading learning content...
When you train a Random Forest or Gradient Boosted ensemble and achieve remarkable predictive accuracy, a natural question emerges: Which features are actually driving these predictions? This question isn't merely academic curiosity—it's fundamental to model interpretability, feature engineering, domain validation, and ultimately building trust in machine learning systems.
Impurity-based importance, also known as Mean Decrease in Impurity (MDI) or Gini importance, represents the most widely used method for answering this question in tree-based models. It's computed automatically during training, requires no additional computation, and provides intuitive rankings of feature significance. But beneath this elegant simplicity lies a sophisticated mathematical framework—and subtle biases that every practitioner must understand.
By the end of this page, you will understand: (1) The mathematical foundation of impurity-based importance, (2) How it's computed in decision trees and ensembles, (3) The relationship between Gini impurity, entropy, and importance scores, (4) Step-by-step calculation examples, and (5) The fundamental assumptions underlying this method.
Consider how decision trees make predictions. At each internal node, the tree examines a single feature and asks: "Based on this feature's value, how should we partition the data?" The goal is to create child nodes that are purer than the parent—meaning samples within each child node are more homogeneous with respect to the target variable.
The key insight: If a feature enables highly effective splits—creating much purer child nodes—then that feature must carry substantial predictive information. Features that consistently enable large impurity reductions across many splits are fundamentally more important for the model's decision-making process.
Impurity-based importance quantifies importance as the total reduction in impurity that a feature provides across all nodes where it's used for splitting, weighted by the number of samples reaching those nodes. Features enabling large, well-populated impurity reductions rank higher.
Formalizing the intuition:
Let's define the components mathematically. For each split in a decision tree, we have:
The impurity decrease from this split is:
$$\Delta I = I_{parent} - \frac{N_L}{N} I_L - \frac{N_R}{N} I_R$$
This weighted average accounts for the fact that a split creating one very pure but tiny child and one large impure child isn't as valuable as a split creating two reasonably pure, well-balanced children.
| Term | Symbol | Meaning | Range |
|---|---|---|---|
| Parent impurity | $I_{parent}$ | Impurity before split | [0, max] |
| Left child impurity | $I_L$ | Impurity of left partition | [0, max] |
| Right child impurity | $I_R$ | Impurity of right partition | [0, max] |
| Left sample fraction | $N_L / N$ | Proportion going left | (0, 1) |
| Right sample fraction | $N_R / N$ | Proportion going right | (0, 1) |
| Impurity decrease | $\Delta I$ | Reduction achieved | [0, $I_{parent}$] |
For classification tasks, two impurity measures dominate: Gini impurity and entropy. Both quantify the "disorder" or "heterogeneity" of a node's class distribution, but they differ subtly in their mathematical properties and practical implications.
Gini Impurity:
Gini impurity measures the probability that a randomly chosen sample would be incorrectly classified if labeled according to the distribution of labels in the node:
$$I_{Gini}(t) = 1 - \sum_{k=1}^{K} p_k^2 = \sum_{k=1}^{K} p_k(1 - p_k)$$
where $p_k$ is the proportion of samples in node $t$ belonging to class $k$.
Key properties of Gini impurity:
The term $p_k(1-p_k)$ represents the probability of selecting a sample from class $k$ and then a sample NOT from class $k$. Summing over all classes gives the total probability of a classification disagreement.
Entropy (Information Gain):
Entropy originates from information theory and measures the expected information content (in bits) needed to describe the class of a random sample:
$$I_{entropy}(t) = -\sum_{k=1}^{K} p_k \log_2(p_k)$$
By convention, $0 \log 0 = 0$.
Key properties of entropy:
| $p_1$ (Class 1) | $p_2$ (Class 2) | Gini Impurity | Entropy (bits) |
|---|---|---|---|
| 0.0 | 1.0 | 0.000 | 0.000 |
| 0.1 | 0.9 | 0.180 | 0.469 |
| 0.2 | 0.8 | 0.320 | 0.722 |
| 0.3 | 0.7 | 0.420 | 0.881 |
| 0.4 | 0.6 | 0.480 | 0.971 |
| 0.5 | 0.5 | 0.500 | 1.000 |
Practical behavior comparison:
Both measures achieve their minimum at pure nodes and maximum at uniform distributions. However, entropy is slightly more sensitive to changes in skewed distributions, while Gini tends to favor creating one large pure node when possible.
For computing feature importance, both measures typically produce highly correlated rankings. The choice between them rarely impacts which features are deemed most important, though absolute importance values will differ due to the different scales.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667
import numpy as np def gini_impurity(class_proportions): """ Calculate Gini impurity for a node. Args: class_proportions: Array of class proportions (must sum to 1) Returns: Gini impurity value in [0, 1 - 1/K] """ return 1 - np.sum(class_proportions ** 2) def entropy(class_proportions): """ Calculate entropy for a node. Args: class_proportions: Array of class proportions (must sum to 1) Returns: Entropy in bits, in [0, log2(K)] """ # Filter out zero proportions to avoid log(0) p = class_proportions[class_proportions > 0] return -np.sum(p * np.log2(p)) def impurity_decrease(parent_impurity, left_impurity, right_impurity, n_left, n_right): """ Calculate the impurity decrease (information gain) from a split. Args: parent_impurity: Impurity of the parent node left_impurity: Impurity of left child right_impurity: Impurity of right child n_left: Number of samples in left child n_right: Number of samples in right child Returns: Weighted impurity decrease """ n_total = n_left + n_right weighted_child_impurity = ( (n_left / n_total) * left_impurity + (n_right / n_total) * right_impurity ) return parent_impurity - weighted_child_impurity # Example: Binary classification# Node with 70% class 0, 30% class 1p = np.array([0.7, 0.3])print(f"Class proportions: {p}")print(f"Gini impurity: {gini_impurity(p):.4f}") # 0.42print(f"Entropy: {entropy(p):.4f} bits") # 0.881 # Example: Impurity decrease from a split# Parent: 100 samples, 50/50 split → Gini = 0.5# Left: 60 samples, 80% class 0 → Gini = 0.32# Right: 40 samples, 10% class 0 → Gini = 0.18parent_gini = gini_impurity(np.array([0.5, 0.5]))left_gini = gini_impurity(np.array([0.8, 0.2]))right_gini = gini_impurity(np.array([0.1, 0.9])) gain = impurity_decrease(parent_gini, left_gini, right_gini, 60, 40)print(f"\nImpurity decrease: {gain:.4f}") # 0.5 - 0.264 = 0.236For regression trees, we replace the class-based impurity measures with variance-based measures that quantify how spread out the target values are within a node.
Mean Squared Error (Variance):
The most common regression impurity measure is the variance (equivalent to MSE from the node's mean prediction):
$$I_{MSE}(t) = \frac{1}{N_t} \sum_{i \in t} (y_i - \bar{y}_t)^2 = Var(y_t)$$
where $\bar{y}_t$ is the mean target value of samples in node $t$.
Mean Absolute Error:
An alternative measure using absolute deviations:
$$I_{MAE}(t) = \frac{1}{N_t} \sum_{i \in t} |y_i - median(y_t)|$$
MAE is more robust to outliers but computationally more expensive due to median computation.
In regression, 'purity' means target values are tightly clustered around a single value (low variance). A perfect split would create child nodes where every sample has exactly the target value equal to the node's prediction—zero variance.
Variance reduction calculation:
The impurity decrease for regression is computed identically to classification, substituting variance for Gini/entropy:
$$\Delta I = Var(y_{parent}) - \frac{N_L}{N} Var(y_L) - \frac{N_R}{N} Var(y_R)$$
This quantity is also known as the reduction in variance or variance reduction.
A key mathematical fact: The variance decrease from a split can be expressed in terms of the squared difference between child means:
$$\Delta I = \frac{N_L \cdot N_R}{(N_L + N_R)^2} (\bar{y}_L - \bar{y}_R)^2$$
This reveals that splits maximizing variance reduction are those that create children with maximally different mean predictions and balanced sample sizes.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071
import numpy as np def variance_impurity(y): """ Calculate variance impurity for a regression node. Args: y: Array of target values in the node Returns: Variance of target values """ return np.var(y) def mae_impurity(y): """ Calculate MAE impurity for a regression node. Args: y: Array of target values in the node Returns: Mean absolute deviation from median """ return np.mean(np.abs(y - np.median(y))) def variance_reduction(y_parent, y_left, y_right): """ Calculate variance reduction from a regression split. Args: y_parent: Target values in parent node y_left: Target values in left child y_right: Target values in right child Returns: Variance reduction achieved by the split """ n_parent = len(y_parent) n_left = len(y_left) n_right = len(y_right) var_parent = variance_impurity(y_parent) var_left = variance_impurity(y_left) var_right = variance_impurity(y_right) weighted_child_variance = ( (n_left / n_parent) * var_left + (n_right / n_parent) * var_right ) return var_parent - weighted_child_variance # Example: Regression split# Parent node: 10 samples with values [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]y_parent = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) # Split at value 5: left = [1,2,3,4,5], right = [6,7,8,9,10]y_left = np.array([1, 2, 3, 4, 5])y_right = np.array([6, 7, 8, 9, 10]) print(f"Parent variance: {variance_impurity(y_parent):.4f}") # 8.25print(f"Left child variance: {variance_impurity(y_left):.4f}") # 2.0print(f"Right child variance: {variance_impurity(y_right):.4f}") # 2.0print(f"Variance reduction: {variance_reduction(y_parent, y_left, y_right):.4f}") # 6.25 # Verify using the mean difference formulan_l, n_r = len(y_left), len(y_right)mean_diff = np.mean(y_left) - np.mean(y_right)formula_result = (n_l * n_r / (n_l + n_r)**2) * mean_diff**2print(f"Formula verification: {formula_result:.4f}") # 6.25With impurity decrease defined for individual splits, we can now compute feature importance for an entire decision tree. The algorithm is elegant:
Algorithm: Mean Decrease in Impurity (MDI)
The impurity decrease is weighted by the number of samples $N_t$ reaching node $t$. This reflects the principle that splits affecting more samples should contribute more to feature importance. A split at the root (affecting all samples) contributes more than a deep split (affecting few samples).
Formal definition:
For a tree $T$ with a set of internal nodes $\mathcal{N}$, the importance of feature $j$ is:
$$Imp(j) = \sum_{t \in \mathcal{N}: split_feature(t) = j} N_t \cdot \Delta I_t$$
where $N_t$ is the number of samples reaching node $t$ and $\Delta I_t$ is the impurity decrease at node $t$.
Normalization:
To make importances interpretable as proportions (summing to 1), we normalize:
$$Imp_{normalized}(j) = \frac{Imp(j)}{\sum_{k} Imp(k)}$$
This allows statements like "feature X accounts for 30% of the total impurity reduction in this tree."
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879
import numpy as npfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.datasets import make_classification # Create synthetic datasetX, y = make_classification( n_samples=1000, n_features=10, n_informative=5, n_redundant=2, n_clusters_per_class=2, random_state=42) # Train a decision treetree = DecisionTreeClassifier(max_depth=5, random_state=42)tree.fit(X, y) # Access the built-in impurity-based importanceimportance_builtin = tree.feature_importances_print("Built-in feature importances:")for i, imp in enumerate(importance_builtin): print(f" Feature {i}: {imp:.4f}") # Manual computation of feature importance (for educational purposes)def compute_mdi_importance(tree): """ Manually compute Mean Decrease in Impurity importance. Args: tree: Fitted sklearn DecisionTreeClassifier Returns: Array of feature importances """ tree_ = tree.tree_ n_features = tree_.n_features importance = np.zeros(n_features) for node_id in range(tree_.node_count): # Skip leaf nodes if tree_.children_left[node_id] == -1: continue # Get node properties feature = tree_.feature[node_id] n_samples = tree_.n_node_samples[node_id] impurity = tree_.impurity[node_id] # Get children properties left_child = tree_.children_left[node_id] right_child = tree_.children_right[node_id] n_left = tree_.n_node_samples[left_child] n_right = tree_.n_node_samples[right_child] impurity_left = tree_.impurity[left_child] impurity_right = tree_.impurity[right_child] # Compute weighted impurity decrease impurity_decrease = ( n_samples * impurity - n_left * impurity_left - n_right * impurity_right ) importance[feature] += impurity_decrease # Normalize to sum to 1 importance /= tree_.n_node_samples[0] # Divide by root node samples importance /= importance.sum() return importance importance_manual = compute_mdi_importance(tree)print("\nManual computation (verification):")for i, imp in enumerate(importance_manual): print(f" Feature {i}: {imp:.4f}") # Verify they matchprint(f"\nImportances match: {np.allclose(importance_builtin, importance_manual)}")In Random Forests and other ensemble methods, feature importance is computed by averaging the importances across all trees in the ensemble:
$$Imp_{RF}(j) = \frac{1}{B} \sum_{b=1}^{B} Imp_{T_b}(j)$$
where $B$ is the number of trees and $Imp_{T_b}(j)$ is the importance of feature $j$ in tree $b$.
Why averaging helps:
Single-tree importances can be unstable—different random seeds produce different tree structures and thus different importance rankings. By averaging over many trees (each trained on bootstrap samples with random feature subsets), the ensemble importance smooths out this variance and provides more robust estimates.
Just as Random Forests reduce prediction variance through averaging, they also reduce variance in importance estimates. This makes ensemble importances more reliable than single-tree importances, especially when feature correlations exist.
The role of feature subsampling:
Random Forests use a random subset of features at each split (typically $\sqrt{p}$ for classification or $p/3$ for regression). This has important implications for feature importance:
Decorrelation effect: Correlated features get opportunities to be selected independently across different trees, potentially sharing importance more equitably
Detection of weak features: Features that would never be selected in a greedy tree (overshadowed by stronger features) can prove their worth when stronger features are excluded from consideration
Importance inflation for rare splitters: Features that rarely appear (due to low max_features) get their importance concentrated in fewer splits, which can inflate or deflate their normalized importance
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.datasets import make_classification # Create dataset with known informative featuresX, y = make_classification( n_samples=1000, n_features=10, n_informative=4, # Features 0-3 are truly informative n_redundant=2, # Features 4-5 are linear combinations n_repeated=0, n_classes=2, random_state=42) # Train Random Forestrf = RandomForestClassifier( n_estimators=100, max_depth=10, random_state=42, n_jobs=-1)rf.fit(X, y) # Get ensemble importance (averaged across trees)ensemble_importance = rf.feature_importances_ # Also compute individual tree importances to show variancetree_importances = np.array([tree.feature_importances_ for tree in rf.estimators_]) # Compute mean and standard deviation across treesimportance_mean = tree_importances.mean(axis=0)importance_std = tree_importances.std(axis=0) # Display resultsprint("Feature Importance (Random Forest):")print("=" * 55)print(f"{'Feature':<10} {'Importance':<12} {'Std Dev':<12} {'Status'}")print("-" * 55) # True informative features are 0-3for i in range(10): status = "Informative" if i < 4 else ("Redundant" if i < 6 else "Noise") print(f"Feature {i:<2} {importance_mean[i]:>10.4f} {importance_std[i]:>10.4f} {status}") # Visualize importance with error barsfig, ax = plt.subplots(figsize=(10, 6))x_pos = np.arange(10)ax.bar(x_pos, importance_mean, yerr=importance_std, capsize=4, color='steelblue', alpha=0.8)ax.set_xlabel('Feature Index')ax.set_ylabel('Mean Decrease in Impurity')ax.set_title('Random Forest Feature Importance with Standard Deviation')ax.set_xticks(x_pos) # Add feature type annotationsfor i, (mean, std) in enumerate(zip(importance_mean, importance_std)): if i < 4: ax.annotate('Informative', (i, mean + std + 0.01), ha='center', fontsize=8, color='green') elif i < 6: ax.annotate('Redundant', (i, mean + std + 0.01), ha='center', fontsize=8, color='orange') plt.tight_layout()Understanding the theoretical properties of impurity-based importance helps us interpret results correctly and recognize limitations.
Property 1: Non-negativity
Impurity-based importance is always non-negative: $Imp(j) \geq 0$ for all features $j$. This follows from the fact that any split must reduce weighted impurity (or at least not increase it), so $\Delta I_t \geq 0$ for all nodes.
Property 2: Summation to unity
After normalization, importances sum to 1: $\sum_j Imp(j) = 1$. This allows interpretation as proportional contributions to total impurity reduction.
Property 3: Unused features have zero importance
Features that never appear in any split have exactly zero importance. This can occur due to:
A feature with zero impurity-based importance may still be predictive! It might just be redundant with another feature that was chosen instead. Always verify with complementary methods like permutation importance.
Property 4: Bias toward features with more splits
This is the critical limitation of impurity-based importance. Consider two features:
Even if Feature B is more predictive per-split, Feature A will likely accumulate more importance simply because it has more splitting opportunities. This bias is especially severe for:
Property 5: Sensitivity to tree depth
Deeper trees use more features and distribute importance more evenly. Shallow trees concentrate importance in root-proximal features. This means importance rankings can change dramatically based on max_depth settings.
| Tree Depth | Splits Used | Importance Concentration | Typical Use Case |
|---|---|---|---|
| max_depth=1 | 1 | Single feature gets 100% | Baseline/sanity check |
| max_depth=3 | 7 | Top 3-5 features dominate | Interpretable models |
| max_depth=10 | ~1000 | Many features contribute | Balanced models |
| max_depth=None | ~n_samples | Importance spread widely including noise | Maximum accuracy |
Let's walk through a complete example of computing impurity-based importance for a small decision tree.
Dataset: 100 samples, 3 features, binary classification (50 positive, 50 negative)
Tree Structure:
Root (n=100, 50+/50-)
Split: Feature 0 ≤ 0.5
/ \
Node A (n=60, 45+/15-) Node B (n=40, 5+/35-)
Split: Feature 1 ≤ 0.3 Split: Feature 2 ≤ 0.7
/ \ / \
Leaf (40+/5-) Leaf (5+/10-) Leaf (4+/10-) Leaf (1+/25-)
n=45 n=15 n=14 n=26
Step 1: Compute impurities for all nodes
Using Gini impurity: $I = 1 - p_+^2 - p_-^2$
| Node | Samples | Class Distribution | Gini Impurity |
|---|---|---|---|
| Root | 100 | 50+/50- = (0.5, 0.5) | $1 - 0.5^2 - 0.5^2 = 0.500$ |
| A | 60 | 45+/15- = (0.75, 0.25) | $1 - 0.75^2 - 0.25^2 = 0.375$ |
| B | 40 | 5+/35- = (0.125, 0.875) | $1 - 0.125^2 - 0.875^2 = 0.219$ |
| A-Left | 45 | 40+/5- = (0.889, 0.111) | $1 - 0.889^2 - 0.111^2 = 0.198$ |
| A-Right | 15 | 5+/10- = (0.333, 0.667) | $1 - 0.333^2 - 0.667^2 = 0.444$ |
| B-Left | 14 | 4+/10- = (0.286, 0.714) | $1 - 0.286^2 - 0.714^2 = 0.408$ |
| B-Right | 26 | 1+/25- = (0.038, 0.962) | $1 - 0.038^2 - 0.962^2 = 0.074$ |
Step 2: Compute impurity decrease for each split
Root split (Feature 0): $$\Delta I_{root} = 0.500 - \frac{60}{100}(0.375) - \frac{40}{100}(0.219)$$ $$= 0.500 - 0.225 - 0.088 = 0.187$$
Weighted by samples: $100 \times 0.187 = 18.7$
Node A split (Feature 1): $$\Delta I_A = 0.375 - \frac{45}{60}(0.198) - \frac{15}{60}(0.444)$$ $$= 0.375 - 0.149 - 0.111 = 0.115$$
Weighted by samples: $60 \times 0.115 = 6.9$
Node B split (Feature 2): $$\Delta I_B = 0.219 - \frac{14}{40}(0.408) - \frac{26}{40}(0.074)$$ $$= 0.219 - 0.143 - 0.048 = 0.028$$
Weighted by samples: $40 \times 0.028 = 1.1$
Step 3: Aggregate by feature
| Feature | Nodes Used | Weighted Contribution | Normalized Importance |
|---|---|---|---|
| Feature 0 | Root | 18.7 | $18.7 / 26.7 = 0.700$ |
| Feature 1 | Node A | 6.9 | $6.9 / 26.7 = 0.258$ |
| Feature 2 | Node B | 1.1 | $1.1 / 26.7 = 0.042$ |
| Total | — | 26.7 | 1.000 |
Interpretation: Feature 0 (used at the root) accounts for 70% of total impurity reduction, making it the most important feature by this metric. Feature 2 contributes only 4.2%, suggesting it provides minimal additional information beyond what Features 0 and 1 capture.
Features selected at the root tend to have highest importance because: (1) they affect all samples, and (2) the root has maximum impurity to reduce. This is often appropriate—the root feature typically IS most informative—but can exaggerate the gap between top features and others.
When using impurity-based importance in practice, several considerations will help you extract reliable insights.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192
import numpy as npimport pandas as pdfrom sklearn.ensemble import RandomForestClassifier, RandomForestRegressorfrom sklearn.model_selection import cross_val_score def analyze_feature_importance(X, y, feature_names=None, n_runs=5, is_classifier=True, **rf_params): """ Comprehensive impurity-based importance analysis with stability assessment. Args: X: Feature matrix y: Target vector feature_names: List of feature names n_runs: Number of random seed runs for stability is_classifier: True for classification, False for regression **rf_params: Additional Random Forest parameters Returns: DataFrame with importance statistics """ n_features = X.shape[1] if feature_names is None: feature_names = [f"Feature_{i}" for i in range(n_features)] # Collect importances across multiple runs all_importances = [] for run in range(n_runs): if is_classifier: rf = RandomForestClassifier(random_state=run, n_jobs=-1, **rf_params) else: rf = RandomForestRegressor(random_state=run, n_jobs=-1, **rf_params) rf.fit(X, y) all_importances.append(rf.feature_importances_) all_importances = np.array(all_importances) # Compute statistics results = pd.DataFrame({ 'feature': feature_names, 'mean_importance': all_importances.mean(axis=0), 'std_importance': all_importances.std(axis=0), 'min_importance': all_importances.min(axis=0), 'max_importance': all_importances.max(axis=0), }) # Compute coefficient of variation (stability metric) results['cv'] = results['std_importance'] / results['mean_importance'].replace(0, np.inf) # Compute mean rank across runs ranks = np.array([np.argsort(-imp) for imp in all_importances]) mean_ranks = ranks.mean(axis=0) results['mean_rank'] = mean_ranks # Sort by mean importance results = results.sort_values('mean_importance', ascending=False).reset_index(drop=True) return results # Example usageif __name__ == "__main__": from sklearn.datasets import make_classification X, y = make_classification( n_samples=1000, n_features=20, n_informative=10, n_redundant=5, random_state=42 ) feature_names = [f"feature_{i}" for i in range(20)] results = analyze_feature_importance( X, y, feature_names=feature_names, n_runs=10, n_estimators=100, max_depth=10 ) print("Feature Importance Analysis") print("=" * 80) print(results.to_string(index=False)) print() # Identify stable vs unstable features print("\nStability Analysis:") stable = results[results['cv'] < 0.2] unstable = results[results['cv'] >= 0.2] print(f" Stable features (CV < 0.2): {len(stable)}") print(f" Unstable features (CV >= 0.2): {len(unstable)}")We've comprehensively examined impurity-based importance—the most common method for quantifying feature significance in tree-based models. Let's consolidate the key concepts:
What's next:
While impurity-based importance is convenient and fast, we've identified its fundamental limitation: it measures training-time behavior rather than true predictive contribution. The next page introduces Permutation Importance—a method that directly measures how each feature affects model predictions on held-out data, providing a more accurate picture of feature relevance for generalization.
You now understand the mathematical foundation, computation process, and practical considerations for impurity-based feature importance. You can correctly interpret MDI values, recognize their biases, and apply best practices for reliable feature analysis.