Machine LearningBagging & Random Forests

Feature Importance

LevelAdvanced

Duration90 mins

TopicBagging & Random Forests

1 / 5

Impurity-based Importance

The Quest to Understand Model Decisions

When you train a Random Forest or Gradient Boosted ensemble and achieve remarkable predictive accuracy, a natural question emerges: Which features are actually driving these predictions? This question isn't merely academic curiosity—it's fundamental to model interpretability, feature engineering, domain validation, and ultimately building trust in machine learning systems.

Impurity-based importance, also known as Mean Decrease in Impurity (MDI) or Gini importance, represents the most widely used method for answering this question in tree-based models. It's computed automatically during training, requires no additional computation, and provides intuitive rankings of feature significance. But beneath this elegant simplicity lies a sophisticated mathematical framework—and subtle biases that every practitioner must understand.

What You Will Learn

By the end of this page, you will understand: (1) The mathematical foundation of impurity-based importance, (2) How it's computed in decision trees and ensembles, (3) The relationship between Gini impurity, entropy, and importance scores, (4) Step-by-step calculation examples, and (5) The fundamental assumptions underlying this method.

The Intuition Behind Impurity-based Importance

Consider how decision trees make predictions. At each internal node, the tree examines a single feature and asks: "Based on this feature's value, how should we partition the data?" The goal is to create child nodes that are purer than the parent—meaning samples within each child node are more homogeneous with respect to the target variable.

The key insight: If a feature enables highly effective splits—creating much purer child nodes—then that feature must carry substantial predictive information. Features that consistently enable large impurity reductions across many splits are fundamentally more important for the model's decision-making process.

The Core Principle

Impurity-based importance quantifies importance as the total reduction in impurity that a feature provides across all nodes where it's used for splitting, weighted by the number of samples reaching those nodes. Features enabling large, well-populated impurity reductions rank higher.

Formalizing the intuition:

Let's define the components mathematically. For each split in a decision tree, we have:

Parent node with $N$ samples and impurity $I_{parent}$
Left child with $N_L$ samples and impurity $I_L$
Right child with $N_R$ samples and impurity $I_R$

The impurity decrease from this split is:

$$\Delta I = I_{parent} - \frac{N_L}{N} I_L - \frac{N_R}{N} I_R$$

This weighted average accounts for the fact that a split creating one very pure but tiny child and one large impure child isn't as valuable as a split creating two reasonably pure, well-balanced children.

Components of Impurity Decrease Calculation
Term	Symbol	Meaning	Range
Parent impurity	$I_{parent}$	Impurity before split	[0, max]
Left child impurity	$I_L$	Impurity of left partition	[0, max]
Right child impurity	$I_R$	Impurity of right partition	[0, max]
Left sample fraction	$N_L / N$	Proportion going left	(0, 1)
Right sample fraction	$N_R / N$	Proportion going right	(0, 1)
Impurity decrease	$\Delta I$	Reduction achieved	[0, $I_{parent}$]

Impurity Measures for Classification

For classification tasks, two impurity measures dominate: Gini impurity and entropy. Both quantify the "disorder" or "heterogeneity" of a node's class distribution, but they differ subtly in their mathematical properties and practical implications.

Gini Impurity:

Gini impurity measures the probability that a randomly chosen sample would be incorrectly classified if labeled according to the distribution of labels in the node:

$$I_{Gini}(t) = 1 - \sum_{k=1}^{K} p_k^2 = \sum_{k=1}^{K} p_k(1 - p_k)$$

where $p_k$ is the proportion of samples in node $t$ belonging to class $k$.

Key properties of Gini impurity:

Range: $[0, 1 - 1/K]$ where $K$ is the number of classes
Minimum (0): Pure node—all samples belong to one class
Maximum ($1 - 1/K$): Maximum impurity—samples uniformly distributed across classes
Binary case: Maximum of 0.5 when $p = 0.5$

Understanding the Gini Formula

The term $p_k(1-p_k)$ represents the probability of selecting a sample from class $k$ and then a sample NOT from class $k$. Summing over all classes gives the total probability of a classification disagreement.

Entropy (Information Gain):

Entropy originates from information theory and measures the expected information content (in bits) needed to describe the class of a random sample:

$$I_{entropy}(t) = -\sum_{k=1}^{K} p_k \log_2(p_k)$$

By convention, $0 \log 0 = 0$.

Key properties of entropy:

Range: $[0, \log_2 K]$
Minimum (0): Pure node—zero uncertainty
Maximum ($\log_2 K$): Uniform distribution—maximum uncertainty
Binary case: Maximum of 1.0 when $p = 0.5$

Comparison of Gini and Entropy for Binary Classification
$p_1$ (Class 1)	$p_2$ (Class 2)	Gini Impurity	Entropy (bits)
0.0	1.0	0.000	0.000
0.1	0.9	0.180	0.469
0.2	0.8	0.320	0.722
0.3	0.7	0.420	0.881
0.4	0.6	0.480	0.971
0.5	0.5	0.500	1.000

Practical behavior comparison:

Both measures achieve their minimum at pure nodes and maximum at uniform distributions. However, entropy is slightly more sensitive to changes in skewed distributions, while Gini tends to favor creating one large pure node when possible.

For computing feature importance, both measures typically produce highly correlated rankings. The choice between them rarely impacts which features are deemed most important, though absolute importance values will differ due to the different scales.

impurity_measures.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import numpy as np
 
def gini_impurity(class_proportions):
    """
    Calculate Gini impurity for a node.
    
    Args:
        class_proportions: Array of class proportions (must sum to 1)
    
    Returns:
        Gini impurity value in [0, 1 - 1/K]
    """
    return 1 - np.sum(class_proportions ** 2)
 
def entropy(class_proportions):
    """
    Calculate entropy for a node.
    
    Args:
        class_proportions: Array of class proportions (must sum to 1)
    
    Returns:
        Entropy in bits, in [0, log2(K)]
    """
    # Filter out zero proportions to avoid log(0)
    p = class_proportions[class_proportions > 0]
    return -np.sum(p * np.log2(p))
 
def impurity_decrease(parent_impurity, left_impurity, right_impurity,
                      n_left, n_right):
    """
    Calculate the impurity decrease (information gain) from a split.
    
    Args:
        parent_impurity: Impurity of the parent node
        left_impurity: Impurity of left child
        right_impurity: Impurity of right child
        n_left: Number of samples in left child
        n_right: Number of samples in right child
    
    Returns:
        Weighted impurity decrease
    """
    n_total = n_left + n_right
    weighted_child_impurity = (
        (n_left / n_total) * left_impurity +
        (n_right / n_total) * right_impurity
    )
    return parent_impurity - weighted_child_impurity
 
# Example: Binary classification
# Node with 70% class 0, 30% class 1
p = np.array([0.7, 0.3])
print(f"Class proportions: {p}")
print(f"Gini impurity: {gini_impurity(p):.4f}")  # 0.42
print(f"Entropy: {entropy(p):.4f} bits")  # 0.881
 
# Example: Impurity decrease from a split
# Parent: 100 samples, 50/50 split → Gini = 0.5
# Left: 60 samples, 80% class 0 → Gini = 0.32
# Right: 40 samples, 10% class 0 → Gini = 0.18
parent_gini = gini_impurity(np.array([0.5, 0.5]))
left_gini = gini_impurity(np.array([0.8, 0.2]))
right_gini = gini_impurity(np.array([0.1, 0.9]))
 
gain = impurity_decrease(parent_gini, left_gini, right_gini, 60, 40)
print(f"\nImpurity decrease: {gain:.4f}")  # 0.5 - 0.264 = 0.236

Impurity Measures for Regression

For regression trees, we replace the class-based impurity measures with variance-based measures that quantify how spread out the target values are within a node.

Mean Squared Error (Variance):

The most common regression impurity measure is the variance (equivalent to MSE from the node's mean prediction):

$$I_{MSE}(t) = \frac{1}{N_t} \sum_{i \in t} (y_i - \bar{y}_t)^2 = Var(y_t)$$

where $\bar{y}_t$ is the mean target value of samples in node $t$.

Mean Absolute Error:

An alternative measure using absolute deviations:

$$I_{MAE}(t) = \frac{1}{N_t} \sum_{i \in t} |y_i - median(y_t)|$$

MAE is more robust to outliers but computationally more expensive due to median computation.

The Regression Analogy

In regression, 'purity' means target values are tightly clustered around a single value (low variance). A perfect split would create child nodes where every sample has exactly the target value equal to the node's prediction—zero variance.

Variance reduction calculation:

The impurity decrease for regression is computed identically to classification, substituting variance for Gini/entropy:

$$\Delta I = Var(y_{parent}) - \frac{N_L}{N} Var(y_L) - \frac{N_R}{N} Var(y_R)$$

This quantity is also known as the reduction in variance or variance reduction.

A key mathematical fact: The variance decrease from a split can be expressed in terms of the squared difference between child means:

$$\Delta I = \frac{N_L \cdot N_R}{(N_L + N_R)^2} (\bar{y}_L - \bar{y}_R)^2$$

This reveals that splits maximizing variance reduction are those that create children with maximally different mean predictions and balanced sample sizes.

regression_impurity.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import numpy as np
 
def variance_impurity(y):
    """
    Calculate variance impurity for a regression node.
    
    Args:
        y: Array of target values in the node
    
    Returns:
        Variance of target values
    """
    return np.var(y)
 
def mae_impurity(y):
    """
    Calculate MAE impurity for a regression node.
    
    Args:
        y: Array of target values in the node
    
    Returns:
        Mean absolute deviation from median
    """
    return np.mean(np.abs(y - np.median(y)))
 
def variance_reduction(y_parent, y_left, y_right):
    """
    Calculate variance reduction from a regression split.
    
    Args:
        y_parent: Target values in parent node
        y_left: Target values in left child
        y_right: Target values in right child
    
    Returns:
        Variance reduction achieved by the split
    """
    n_parent = len(y_parent)
    n_left = len(y_left)
    n_right = len(y_right)
    
    var_parent = variance_impurity(y_parent)
    var_left = variance_impurity(y_left)
    var_right = variance_impurity(y_right)
    
    weighted_child_variance = (
        (n_left / n_parent) * var_left +
        (n_right / n_parent) * var_right
    )
    
    return var_parent - weighted_child_variance
 
# Example: Regression split
# Parent node: 10 samples with values [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y_parent = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
 
# Split at value 5: left = [1,2,3,4,5], right = [6,7,8,9,10]
y_left = np.array([1, 2, 3, 4, 5])
y_right = np.array([6, 7, 8, 9, 10])
 
print(f"Parent variance: {variance_impurity(y_parent):.4f}")  # 8.25
print(f"Left child variance: {variance_impurity(y_left):.4f}")  # 2.0
print(f"Right child variance: {variance_impurity(y_right):.4f}")  # 2.0
print(f"Variance reduction: {variance_reduction(y_parent, y_left, y_right):.4f}")  # 6.25
 
# Verify using the mean difference formula
n_l, n_r = len(y_left), len(y_right)
mean_diff = np.mean(y_left) - np.mean(y_right)
formula_result = (n_l * n_r / (n_l + n_r)**2) * mean_diff**2
print(f"Formula verification: {formula_result:.4f}")  # 6.25

Computing Feature Importance in a Single Tree

With impurity decrease defined for individual splits, we can now compute feature importance for an entire decision tree. The algorithm is elegant:

Algorithm: Mean Decrease in Impurity (MDI)

Initialize importance for each feature to 0
Traverse every internal node in the tree
For each node splitting on feature $j$:
- Compute the weighted impurity decrease $N_t \cdot \Delta I_t$
- Add this value to the importance of feature $j$
Normalize all importances to sum to 1 (optional)

The Weighting Factor

The impurity decrease is weighted by the number of samples $N_t$ reaching node $t$. This reflects the principle that splits affecting more samples should contribute more to feature importance. A split at the root (affecting all samples) contributes more than a deep split (affecting few samples).

Formal definition:

For a tree $T$ with a set of internal nodes $\mathcal{N}$, the importance of feature $j$ is:

$$Imp(j) = \sum_{t \in \mathcal{N}: split_feature(t) = j} N_t \cdot \Delta I_t$$

where $N_t$ is the number of samples reaching node $t$ and $\Delta I_t$ is the impurity decrease at node $t$.

Normalization:

To make importances interpretable as proportions (summing to 1), we normalize:

$$Imp_{normalized}(j) = \frac{Imp(j)}{\sum_{k} Imp(k)}$$

This allows statements like "feature X accounts for 30% of the total impurity reduction in this tree."

Advantages of MDI

•Computed during training — No additional computational cost after model is trained
•Fast to retrieve — Simply sum pre-computed values during tree traversal
•Intuitive interpretation — Directly measures contribution to model's decision process
•Always available — Works for any tree-based model

Limitations to Consider

•Biased toward high-cardinality features — More split opportunities inflates importance
•Doesn't reflect predictive power — Measures in-sample, not generalization
•Correlated features share importance — Can dilute importance across redundant features
•Tree structure dependent — Different random seeds yield different rankings

tree_feature_importance.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
 
# Create synthetic dataset
X, y = make_classification(
    n_samples=1000,
    n_features=10,
    n_informative=5,
    n_redundant=2,
    n_clusters_per_class=2,
    random_state=42
)
 
# Train a decision tree
tree = DecisionTreeClassifier(max_depth=5, random_state=42)
tree.fit(X, y)
 
# Access the built-in impurity-based importance
importance_builtin = tree.feature_importances_
print("Built-in feature importances:")
for i, imp in enumerate(importance_builtin):
    print(f"  Feature {i}: {imp:.4f}")
 
# Manual computation of feature importance (for educational purposes)
def compute_mdi_importance(tree):
    """
    Manually compute Mean Decrease in Impurity importance.
    
    Args:
        tree: Fitted sklearn DecisionTreeClassifier
    
    Returns:
        Array of feature importances
    """
    tree_ = tree.tree_
    n_features = tree_.n_features
    importance = np.zeros(n_features)
    
    for node_id in range(tree_.node_count):
        # Skip leaf nodes
        if tree_.children_left[node_id] == -1:
            continue
            
        # Get node properties
        feature = tree_.feature[node_id]
        n_samples = tree_.n_node_samples[node_id]
        impurity = tree_.impurity[node_id]
        
        # Get children properties
        left_child = tree_.children_left[node_id]
        right_child = tree_.children_right[node_id]
        n_left = tree_.n_node_samples[left_child]
        n_right = tree_.n_node_samples[right_child]
        impurity_left = tree_.impurity[left_child]
        impurity_right = tree_.impurity[right_child]
        
        # Compute weighted impurity decrease
        impurity_decrease = (
            n_samples * impurity -
            n_left * impurity_left -
            n_right * impurity_right
        )
        
        importance[feature] += impurity_decrease
    
    # Normalize to sum to 1
    importance /= tree_.n_node_samples[0]  # Divide by root node samples
    importance /= importance.sum()
    
    return importance
 
importance_manual = compute_mdi_importance(tree)
print("\nManual computation (verification):")
for i, imp in enumerate(importance_manual):
    print(f"  Feature {i}: {imp:.4f}")
 
# Verify they match
print(f"\nImportances match: {np.allclose(importance_builtin, importance_manual)}")

Feature Importance in Random Forests

In Random Forests and other ensemble methods, feature importance is computed by averaging the importances across all trees in the ensemble:

$$Imp_{RF}(j) = \frac{1}{B} \sum_{b=1}^{B} Imp_{T_b}(j)$$

where $B$ is the number of trees and $Imp_{T_b}(j)$ is the importance of feature $j$ in tree $b$.

Why averaging helps:

Single-tree importances can be unstable—different random seeds produce different tree structures and thus different importance rankings. By averaging over many trees (each trained on bootstrap samples with random feature subsets), the ensemble importance smooths out this variance and provides more robust estimates.

Variance Reduction in Importance Estimates

Just as Random Forests reduce prediction variance through averaging, they also reduce variance in importance estimates. This makes ensemble importances more reliable than single-tree importances, especially when feature correlations exist.

The role of feature subsampling:

Random Forests use a random subset of features at each split (typically $\sqrt{p}$ for classification or $p/3$ for regression). This has important implications for feature importance:

Decorrelation effect: Correlated features get opportunities to be selected independently across different trees, potentially sharing importance more equitably
Detection of weak features: Features that would never be selected in a greedy tree (overshadowed by stronger features) can prove their worth when stronger features are excluded from consideration
Importance inflation for rare splitters: Features that rarely appear (due to low max_features) get their importance concentrated in fewer splits, which can inflate or deflate their normalized importance

random_forest_importance.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
 
# Create dataset with known informative features
X, y = make_classification(
    n_samples=1000,
    n_features=10,
    n_informative=4,      # Features 0-3 are truly informative
    n_redundant=2,        # Features 4-5 are linear combinations
    n_repeated=0,
    n_classes=2,
    random_state=42
)
 
# Train Random Forest
rf = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    random_state=42,
    n_jobs=-1
)
rf.fit(X, y)
 
# Get ensemble importance (averaged across trees)
ensemble_importance = rf.feature_importances_
 
# Also compute individual tree importances to show variance
tree_importances = np.array([tree.feature_importances_ 
                             for tree in rf.estimators_])
 
# Compute mean and standard deviation across trees
importance_mean = tree_importances.mean(axis=0)
importance_std = tree_importances.std(axis=0)
 
# Display results
print("Feature Importance (Random Forest):")
print("=" * 55)
print(f"{'Feature':<10} {'Importance':<12} {'Std Dev':<12} {'Status'}")
print("-" * 55)
 
# True informative features are 0-3
for i in range(10):
    status = "Informative" if i < 4 else ("Redundant" if i < 6 else "Noise")
    print(f"Feature {i:<2} {importance_mean[i]:>10.4f} {importance_std[i]:>10.4f}   {status}")
 
# Visualize importance with error bars
fig, ax = plt.subplots(figsize=(10, 6))
x_pos = np.arange(10)
ax.bar(x_pos, importance_mean, yerr=importance_std, 
       capsize=4, color='steelblue', alpha=0.8)
ax.set_xlabel('Feature Index')
ax.set_ylabel('Mean Decrease in Impurity')
ax.set_title('Random Forest Feature Importance with Standard Deviation')
ax.set_xticks(x_pos)
 
# Add feature type annotations
for i, (mean, std) in enumerate(zip(importance_mean, importance_std)):
    if i < 4:
        ax.annotate('Informative', (i, mean + std + 0.01), 
                   ha='center', fontsize=8, color='green')
    elif i < 6:
        ax.annotate('Redundant', (i, mean + std + 0.01), 
                   ha='center', fontsize=8, color='orange')
 
plt.tight_layout()

Mathematical Properties and Guarantees

Understanding the theoretical properties of impurity-based importance helps us interpret results correctly and recognize limitations.

Property 1: Non-negativity

Impurity-based importance is always non-negative: $Imp(j) \geq 0$ for all features $j$. This follows from the fact that any split must reduce weighted impurity (or at least not increase it), so $\Delta I_t \geq 0$ for all nodes.

Property 2: Summation to unity

After normalization, importances sum to 1: $\sum_j Imp(j) = 1$. This allows interpretation as proportional contributions to total impurity reduction.

Property 3: Unused features have zero importance

Features that never appear in any split have exactly zero importance. This can occur due to:

Feature redundancy (another feature is always preferred)
Feature subsampling in Random Forests (bad luck across all trees)
Early stopping (tree is shallow, not all features get used)

Zero Importance ≠ Irrelevance

A feature with zero impurity-based importance may still be predictive! It might just be redundant with another feature that was chosen instead. Always verify with complementary methods like permutation importance.

Property 4: Bias toward features with more splits

This is the critical limitation of impurity-based importance. Consider two features:

Feature A: Continuous, enables splits anywhere in its range
Feature B: Binary, enables only one possible split

Even if Feature B is more predictive per-split, Feature A will likely accumulate more importance simply because it has more splitting opportunities. This bias is especially severe for:

High-cardinality categorical features (many distinct values)
Continuous features vs. binary features
Features with no missing values vs. features with missing values

Property 5: Sensitivity to tree depth

Deeper trees use more features and distribute importance more evenly. Shallow trees concentrate importance in root-proximal features. This means importance rankings can change dramatically based on max_depth settings.

Impact of Tree Depth on Importance Distribution
Tree Depth	Splits Used	Importance Concentration	Typical Use Case
max_depth=1	1	Single feature gets 100%	Baseline/sanity check
max_depth=3	7	Top 3-5 features dominate	Interpretable models
max_depth=10	~1000	Many features contribute	Balanced models
max_depth=None	~n_samples	Importance spread widely including noise	Maximum accuracy

Worked Example: Step-by-Step Calculation

Let's walk through a complete example of computing impurity-based importance for a small decision tree.

Dataset: 100 samples, 3 features, binary classification (50 positive, 50 negative)

Tree Structure:

                   Root (n=100, 50+/50-)
                   Split: Feature 0 ≤ 0.5
                  /                      \
         Node A (n=60, 45+/15-)     Node B (n=40, 5+/35-)
         Split: Feature 1 ≤ 0.3     Split: Feature 2 ≤ 0.7
        /              \             /              \
   Leaf (40+/5-)   Leaf (5+/10-)  Leaf (4+/10-)   Leaf (1+/25-)
   n=45            n=15           n=14            n=26

Step 1: Compute impurities for all nodes

Using Gini impurity: $I = 1 - p_+^2 - p_-^2$

Node	Samples	Class Distribution	Gini Impurity
Root	100	50+/50- = (0.5, 0.5)	$1 - 0.5^2 - 0.5^2 = 0.500$
A	60	45+/15- = (0.75, 0.25)	$1 - 0.75^2 - 0.25^2 = 0.375$
B	40	5+/35- = (0.125, 0.875)	$1 - 0.125^2 - 0.875^2 = 0.219$
A-Left	45	40+/5- = (0.889, 0.111)	$1 - 0.889^2 - 0.111^2 = 0.198$
A-Right	15	5+/10- = (0.333, 0.667)	$1 - 0.333^2 - 0.667^2 = 0.444$
B-Left	14	4+/10- = (0.286, 0.714)	$1 - 0.286^2 - 0.714^2 = 0.408$
B-Right	26	1+/25- = (0.038, 0.962)	$1 - 0.038^2 - 0.962^2 = 0.074$

Step 2: Compute impurity decrease for each split

Root split (Feature 0): $$\Delta I_{root} = 0.500 - \frac{60}{100}(0.375) - \frac{40}{100}(0.219)$$ $$= 0.500 - 0.225 - 0.088 = 0.187$$

Weighted by samples: $100 \times 0.187 = 18.7$

Node A split (Feature 1): $$\Delta I_A = 0.375 - \frac{45}{60}(0.198) - \frac{15}{60}(0.444)$$ $$= 0.375 - 0.149 - 0.111 = 0.115$$

Weighted by samples: $60 \times 0.115 = 6.9$

Node B split (Feature 2): $$\Delta I_B = 0.219 - \frac{14}{40}(0.408) - \frac{26}{40}(0.074)$$ $$= 0.219 - 0.143 - 0.048 = 0.028$$

Weighted by samples: $40 \times 0.028 = 1.1$

Step 3: Aggregate by feature

Feature	Nodes Used	Weighted Contribution	Normalized Importance
Feature 0	Root	18.7	$18.7 / 26.7 = 0.700$
Feature 1	Node A	6.9	$6.9 / 26.7 = 0.258$
Feature 2	Node B	1.1	$1.1 / 26.7 = 0.042$
Total	—	26.7	1.000

Interpretation: Feature 0 (used at the root) accounts for 70% of total impurity reduction, making it the most important feature by this metric. Feature 2 contributes only 4.2%, suggesting it provides minimal additional information beyond what Features 0 and 1 capture.

Root Feature Dominance

Features selected at the root tend to have highest importance because: (1) they affect all samples, and (2) the root has maximum impurity to reduce. This is often appropriate—the root feature typically IS most informative—but can exaggerate the gap between top features and others.

Practical Implementation and Best Practices

When using impurity-based importance in practice, several considerations will help you extract reliable insights.

Best Practices for Impurity-based Importance

•Use ensembles over single trees — Random Forest importance is more stable than single Decision Tree importance. The averaging reduces variance and the random feature selection helps correlated features share importance.
•Run multiple experiments — Train several models with different random seeds and check consistency of top features. Truly important features should appear at the top across runs.
•Be cautious with high-cardinality features — If you have features with many unique values (IDs, timestamps, ZIP codes), their importance will be artificially inflated. Consider binning or encoding strategies.
•Combine with permutation importance — Use impurity-based importance for initial screening (it's fast), then validate findings with permutation importance (it measures true predictive contribution).
•Consider feature correlations — Importance gets distributed among correlated features. If you have groups of related features, consider their combined importance rather than individual rankings.
•Beware of data leakage — Features that leak target information will show extremely high importance. Suspiciously dominant features warrant investigation.

importance_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import cross_val_score
 
def analyze_feature_importance(X, y, feature_names=None, n_runs=5, 
                                 is_classifier=True, **rf_params):
    """
    Comprehensive impurity-based importance analysis with stability assessment.
    
    Args:
        X: Feature matrix
        y: Target vector
        feature_names: List of feature names
        n_runs: Number of random seed runs for stability
        is_classifier: True for classification, False for regression
        **rf_params: Additional Random Forest parameters
    
    Returns:
        DataFrame with importance statistics
    """
    n_features = X.shape[1]
    if feature_names is None:
        feature_names = [f"Feature_{i}" for i in range(n_features)]
    
    # Collect importances across multiple runs
    all_importances = []
    
    for run in range(n_runs):
        if is_classifier:
            rf = RandomForestClassifier(random_state=run, n_jobs=-1, **rf_params)
        else:
            rf = RandomForestRegressor(random_state=run, n_jobs=-1, **rf_params)
        
        rf.fit(X, y)
        all_importances.append(rf.feature_importances_)
    
    all_importances = np.array(all_importances)
    
    # Compute statistics
    results = pd.DataFrame({
        'feature': feature_names,
        'mean_importance': all_importances.mean(axis=0),
        'std_importance': all_importances.std(axis=0),
        'min_importance': all_importances.min(axis=0),
        'max_importance': all_importances.max(axis=0),
    })
    
    # Compute coefficient of variation (stability metric)
    results['cv'] = results['std_importance'] / results['mean_importance'].replace(0, np.inf)
    
    # Compute mean rank across runs
    ranks = np.array([np.argsort(-imp) for imp in all_importances])
    mean_ranks = ranks.mean(axis=0)
    results['mean_rank'] = mean_ranks
    
    # Sort by mean importance
    results = results.sort_values('mean_importance', ascending=False).reset_index(drop=True)
    
    return results
 
# Example usage
if __name__ == "__main__":
    from sklearn.datasets import make_classification
    
    X, y = make_classification(
        n_samples=1000, n_features=20,
        n_informative=10, n_redundant=5,
        random_state=42
    )
    
    feature_names = [f"feature_{i}" for i in range(20)]
    
    results = analyze_feature_importance(
        X, y, 
        feature_names=feature_names,
        n_runs=10,
        n_estimators=100,
        max_depth=10
    )
    
    print("Feature Importance Analysis")
    print("=" * 80)
    print(results.to_string(index=False))
    print()
    
    # Identify stable vs unstable features
    print("\nStability Analysis:")
    stable = results[results['cv'] < 0.2]
    unstable = results[results['cv'] >= 0.2]
    print(f"  Stable features (CV < 0.2): {len(stable)}")
    print(f"  Unstable features (CV >= 0.2): {len(unstable)}")

Summary: Impurity-based Importance

We've comprehensively examined impurity-based importance—the most common method for quantifying feature significance in tree-based models. Let's consolidate the key concepts:

Core Concepts

•Impurity-based importance (MDI) measures feature importance as the total weighted reduction in impurity across all splits using that feature.
•Gini impurity and entropy are the primary measures for classification; variance is used for regression.
•The weighting factor ($N_t$) ensures splits affecting more samples contribute proportionally more to importance.
•Random Forest importance averages single-tree importances, providing more stable estimates through variance reduction.
•Key bias: Features with more splitting opportunities (high cardinality) receive artificially inflated importance.
•Normalization makes importances sum to 1, enabling interpretation as proportional contributions.
•Zero importance does not imply irrelevance—redundant features may be predictive but never selected.

What's next:

While impurity-based importance is convenient and fast, we've identified its fundamental limitation: it measures training-time behavior rather than true predictive contribution. The next page introduces Permutation Importance—a method that directly measures how each feature affects model predictions on held-out data, providing a more accurate picture of feature relevance for generalization.

Page Complete

You now understand the mathematical foundation, computation process, and practical considerations for impurity-based feature importance. You can correctly interpret MDI values, recognize their biases, and apply best practices for reliable feature analysis.

1 / 5

Loading learning content...

Machine LearningBagging & Random Forests

Feature Importance

LevelAdvanced

Duration90 mins

TopicBagging & Random Forests

1 / 5

Impurity-based Importance

The Quest to Understand Model Decisions

What You Will Learn

The Intuition Behind Impurity-based Importance

The Core Principle

Formalizing the intuition:

Let's define the components mathematically. For each split in a decision tree, we have:

Parent node with $N$ samples and impurity $I_{parent}$
Left child with $N_L$ samples and impurity $I_L$
Right child with $N_R$ samples and impurity $I_R$

The impurity decrease from this split is:

$$\Delta I = I_{parent} - \frac{N_L}{N} I_L - \frac{N_R}{N} I_R$$

Components of Impurity Decrease Calculation
Term	Symbol	Meaning	Range
Parent impurity	$I_{parent}$	Impurity before split	[0, max]
Left child impurity	$I_L$	Impurity of left partition	[0, max]
Right child impurity	$I_R$	Impurity of right partition	[0, max]
Left sample fraction	$N_L / N$	Proportion going left	(0, 1)
Right sample fraction	$N_R / N$	Proportion going right	(0, 1)
Impurity decrease	$\Delta I$	Reduction achieved	[0, $I_{parent}$]

Impurity Measures for Classification

Gini Impurity:

Gini impurity measures the probability that a randomly chosen sample would be incorrectly classified if labeled according to the distribution of labels in the node:

$$I_{Gini}(t) = 1 - \sum_{k=1}^{K} p_k^2 = \sum_{k=1}^{K} p_k(1 - p_k)$$

where $p_k$ is the proportion of samples in node $t$ belonging to class $k$.

Key properties of Gini impurity:

Range: $[0, 1 - 1/K]$ where $K$ is the number of classes
Minimum (0): Pure node—all samples belong to one class
Maximum ($1 - 1/K$): Maximum impurity—samples uniformly distributed across classes
Binary case: Maximum of 0.5 when $p = 0.5$

Understanding the Gini Formula

Entropy (Information Gain):

Entropy originates from information theory and measures the expected information content (in bits) needed to describe the class of a random sample:

$$I_{entropy}(t) = -\sum_{k=1}^{K} p_k \log_2(p_k)$$

By convention, $0 \log 0 = 0$.

Key properties of entropy:

Range: $[0, \log_2 K]$
Minimum (0): Pure node—zero uncertainty
Maximum ($\log_2 K$): Uniform distribution—maximum uncertainty
Binary case: Maximum of 1.0 when $p = 0.5$

Comparison of Gini and Entropy for Binary Classification
$p_1$ (Class 1)	$p_2$ (Class 2)	Gini Impurity	Entropy (bits)
0.0	1.0	0.000	0.000
0.1	0.9	0.180	0.469
0.2	0.8	0.320	0.722
0.3	0.7	0.420	0.881
0.4	0.6	0.480	0.971
0.5	0.5	0.500	1.000

Practical behavior comparison:

impurity_measures.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import numpy as np
 
def gini_impurity(class_proportions):
    """
    Calculate Gini impurity for a node.
    
    Args:
        class_proportions: Array of class proportions (must sum to 1)
    
    Returns:
        Gini impurity value in [0, 1 - 1/K]
    """
    return 1 - np.sum(class_proportions ** 2)
 
def entropy(class_proportions):
    """
    Calculate entropy for a node.
    
    Args:
        class_proportions: Array of class proportions (must sum to 1)
    
    Returns:
        Entropy in bits, in [0, log2(K)]
    """
    # Filter out zero proportions to avoid log(0)
    p = class_proportions[class_proportions > 0]
    return -np.sum(p * np.log2(p))
 
def impurity_decrease(parent_impurity, left_impurity, right_impurity,
                      n_left, n_right):
    """
    Calculate the impurity decrease (information gain) from a split.
    
    Args:
        parent_impurity: Impurity of the parent node
        left_impurity: Impurity of left child
        right_impurity: Impurity of right child
        n_left: Number of samples in left child
        n_right: Number of samples in right child
    
    Returns:
        Weighted impurity decrease
    """
    n_total = n_left + n_right
    weighted_child_impurity = (
        (n_left / n_total) * left_impurity +
        (n_right / n_total) * right_impurity
    )
    return parent_impurity - weighted_child_impurity
 
# Example: Binary classification
# Node with 70% class 0, 30% class 1
p = np.array([0.7, 0.3])
print(f"Class proportions: {p}")
print(f"Gini impurity: {gini_impurity(p):.4f}")  # 0.42
print(f"Entropy: {entropy(p):.4f} bits")  # 0.881
 
# Example: Impurity decrease from a split
# Parent: 100 samples, 50/50 split → Gini = 0.5
# Left: 60 samples, 80% class 0 → Gini = 0.32
# Right: 40 samples, 10% class 0 → Gini = 0.18
parent_gini = gini_impurity(np.array([0.5, 0.5]))
left_gini = gini_impurity(np.array([0.8, 0.2]))
right_gini = gini_impurity(np.array([0.1, 0.9]))
 
gain = impurity_decrease(parent_gini, left_gini, right_gini, 60, 40)
print(f"\nImpurity decrease: {gain:.4f}")  # 0.5 - 0.264 = 0.236

Impurity Measures for Regression

For regression trees, we replace the class-based impurity measures with variance-based measures that quantify how spread out the target values are within a node.

Mean Squared Error (Variance):

The most common regression impurity measure is the variance (equivalent to MSE from the node's mean prediction):

$$I_{MSE}(t) = \frac{1}{N_t} \sum_{i \in t} (y_i - \bar{y}_t)^2 = Var(y_t)$$

where $\bar{y}_t$ is the mean target value of samples in node $t$.

Mean Absolute Error:

An alternative measure using absolute deviations:

$$I_{MAE}(t) = \frac{1}{N_t} \sum_{i \in t} |y_i - median(y_t)|$$

MAE is more robust to outliers but computationally more expensive due to median computation.

The Regression Analogy

Variance reduction calculation:

The impurity decrease for regression is computed identically to classification, substituting variance for Gini/entropy:

$$\Delta I = Var(y_{parent}) - \frac{N_L}{N} Var(y_L) - \frac{N_R}{N} Var(y_R)$$

This quantity is also known as the reduction in variance or variance reduction.

A key mathematical fact: The variance decrease from a split can be expressed in terms of the squared difference between child means:

$$\Delta I = \frac{N_L \cdot N_R}{(N_L + N_R)^2} (\bar{y}_L - \bar{y}_R)^2$$

This reveals that splits maximizing variance reduction are those that create children with maximally different mean predictions and balanced sample sizes.

regression_impurity.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import numpy as np
 
def variance_impurity(y):
    """
    Calculate variance impurity for a regression node.
    
    Args:
        y: Array of target values in the node
    
    Returns:
        Variance of target values
    """
    return np.var(y)
 
def mae_impurity(y):
    """
    Calculate MAE impurity for a regression node.
    
    Args:
        y: Array of target values in the node
    
    Returns:
        Mean absolute deviation from median
    """
    return np.mean(np.abs(y - np.median(y)))
 
def variance_reduction(y_parent, y_left, y_right):
    """
    Calculate variance reduction from a regression split.
    
    Args:
        y_parent: Target values in parent node
        y_left: Target values in left child
        y_right: Target values in right child
    
    Returns:
        Variance reduction achieved by the split
    """
    n_parent = len(y_parent)
    n_left = len(y_left)
    n_right = len(y_right)
    
    var_parent = variance_impurity(y_parent)
    var_left = variance_impurity(y_left)
    var_right = variance_impurity(y_right)
    
    weighted_child_variance = (
        (n_left / n_parent) * var_left +
        (n_right / n_parent) * var_right
    )
    
    return var_parent - weighted_child_variance
 
# Example: Regression split
# Parent node: 10 samples with values [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y_parent = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
 
# Split at value 5: left = [1,2,3,4,5], right = [6,7,8,9,10]
y_left = np.array([1, 2, 3, 4, 5])
y_right = np.array([6, 7, 8, 9, 10])
 
print(f"Parent variance: {variance_impurity(y_parent):.4f}")  # 8.25
print(f"Left child variance: {variance_impurity(y_left):.4f}")  # 2.0
print(f"Right child variance: {variance_impurity(y_right):.4f}")  # 2.0
print(f"Variance reduction: {variance_reduction(y_parent, y_left, y_right):.4f}")  # 6.25
 
# Verify using the mean difference formula
n_l, n_r = len(y_left), len(y_right)
mean_diff = np.mean(y_left) - np.mean(y_right)
formula_result = (n_l * n_r / (n_l + n_r)**2) * mean_diff**2
print(f"Formula verification: {formula_result:.4f}")  # 6.25

Computing Feature Importance in a Single Tree

With impurity decrease defined for individual splits, we can now compute feature importance for an entire decision tree. The algorithm is elegant:

Algorithm: Mean Decrease in Impurity (MDI)

Initialize importance for each feature to 0
Traverse every internal node in the tree
For each node splitting on feature $j$:
- Compute the weighted impurity decrease $N_t \cdot \Delta I_t$
- Add this value to the importance of feature $j$
Normalize all importances to sum to 1 (optional)

The Weighting Factor

Formal definition:

For a tree $T$ with a set of internal nodes $\mathcal{N}$, the importance of feature $j$ is:

$$Imp(j) = \sum_{t \in \mathcal{N}: split_feature(t) = j} N_t \cdot \Delta I_t$$

where $N_t$ is the number of samples reaching node $t$ and $\Delta I_t$ is the impurity decrease at node $t$.

Normalization:

To make importances interpretable as proportions (summing to 1), we normalize:

$$Imp_{normalized}(j) = \frac{Imp(j)}{\sum_{k} Imp(k)}$$

This allows statements like "feature X accounts for 30% of the total impurity reduction in this tree."

Advantages of MDI

•Computed during training — No additional computational cost after model is trained
•Fast to retrieve — Simply sum pre-computed values during tree traversal
•Intuitive interpretation — Directly measures contribution to model's decision process
•Always available — Works for any tree-based model

Limitations to Consider

•Biased toward high-cardinality features — More split opportunities inflates importance
•Doesn't reflect predictive power — Measures in-sample, not generalization
•Correlated features share importance — Can dilute importance across redundant features
•Tree structure dependent — Different random seeds yield different rankings

tree_feature_importance.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
 
# Create synthetic dataset
X, y = make_classification(
    n_samples=1000,
    n_features=10,
    n_informative=5,
    n_redundant=2,
    n_clusters_per_class=2,
    random_state=42
)
 
# Train a decision tree
tree = DecisionTreeClassifier(max_depth=5, random_state=42)
tree.fit(X, y)
 
# Access the built-in impurity-based importance
importance_builtin = tree.feature_importances_
print("Built-in feature importances:")
for i, imp in enumerate(importance_builtin):
    print(f"  Feature {i}: {imp:.4f}")
 
# Manual computation of feature importance (for educational purposes)
def compute_mdi_importance(tree):
    """
    Manually compute Mean Decrease in Impurity importance.
    
    Args:
        tree: Fitted sklearn DecisionTreeClassifier
    
    Returns:
        Array of feature importances
    """
    tree_ = tree.tree_
    n_features = tree_.n_features
    importance = np.zeros(n_features)
    
    for node_id in range(tree_.node_count):
        # Skip leaf nodes
        if tree_.children_left[node_id] == -1:
            continue
            
        # Get node properties
        feature = tree_.feature[node_id]
        n_samples = tree_.n_node_samples[node_id]
        impurity = tree_.impurity[node_id]
        
        # Get children properties
        left_child = tree_.children_left[node_id]
        right_child = tree_.children_right[node_id]
        n_left = tree_.n_node_samples[left_child]
        n_right = tree_.n_node_samples[right_child]
        impurity_left = tree_.impurity[left_child]
        impurity_right = tree_.impurity[right_child]
        
        # Compute weighted impurity decrease
        impurity_decrease = (
            n_samples * impurity -
            n_left * impurity_left -
            n_right * impurity_right
        )
        
        importance[feature] += impurity_decrease
    
    # Normalize to sum to 1
    importance /= tree_.n_node_samples[0]  # Divide by root node samples
    importance /= importance.sum()
    
    return importance
 
importance_manual = compute_mdi_importance(tree)
print("\nManual computation (verification):")
for i, imp in enumerate(importance_manual):
    print(f"  Feature {i}: {imp:.4f}")
 
# Verify they match
print(f"\nImportances match: {np.allclose(importance_builtin, importance_manual)}")

Feature Importance in Random Forests

In Random Forests and other ensemble methods, feature importance is computed by averaging the importances across all trees in the ensemble:

$$Imp_{RF}(j) = \frac{1}{B} \sum_{b=1}^{B} Imp_{T_b}(j)$$

where $B$ is the number of trees and $Imp_{T_b}(j)$ is the importance of feature $j$ in tree $b$.

Why averaging helps:

Variance Reduction in Importance Estimates

The role of feature subsampling:

Random Forests use a random subset of features at each split (typically $\sqrt{p}$ for classification or $p/3$ for regression). This has important implications for feature importance:

Decorrelation effect: Correlated features get opportunities to be selected independently across different trees, potentially sharing importance more equitably
Detection of weak features: Features that would never be selected in a greedy tree (overshadowed by stronger features) can prove their worth when stronger features are excluded from consideration
Importance inflation for rare splitters: Features that rarely appear (due to low max_features) get their importance concentrated in fewer splits, which can inflate or deflate their normalized importance

random_forest_importance.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
 
# Create dataset with known informative features
X, y = make_classification(
    n_samples=1000,
    n_features=10,
    n_informative=4,      # Features 0-3 are truly informative
    n_redundant=2,        # Features 4-5 are linear combinations
    n_repeated=0,
    n_classes=2,
    random_state=42
)
 
# Train Random Forest
rf = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    random_state=42,
    n_jobs=-1
)
rf.fit(X, y)
 
# Get ensemble importance (averaged across trees)
ensemble_importance = rf.feature_importances_
 
# Also compute individual tree importances to show variance
tree_importances = np.array([tree.feature_importances_ 
                             for tree in rf.estimators_])
 
# Compute mean and standard deviation across trees
importance_mean = tree_importances.mean(axis=0)
importance_std = tree_importances.std(axis=0)
 
# Display results
print("Feature Importance (Random Forest):")
print("=" * 55)
print(f"{'Feature':<10} {'Importance':<12} {'Std Dev':<12} {'Status'}")
print("-" * 55)
 
# True informative features are 0-3
for i in range(10):
    status = "Informative" if i < 4 else ("Redundant" if i < 6 else "Noise")
    print(f"Feature {i:<2} {importance_mean[i]:>10.4f} {importance_std[i]:>10.4f}   {status}")
 
# Visualize importance with error bars
fig, ax = plt.subplots(figsize=(10, 6))
x_pos = np.arange(10)
ax.bar(x_pos, importance_mean, yerr=importance_std, 
       capsize=4, color='steelblue', alpha=0.8)
ax.set_xlabel('Feature Index')
ax.set_ylabel('Mean Decrease in Impurity')
ax.set_title('Random Forest Feature Importance with Standard Deviation')
ax.set_xticks(x_pos)
 
# Add feature type annotations
for i, (mean, std) in enumerate(zip(importance_mean, importance_std)):
    if i < 4:
        ax.annotate('Informative', (i, mean + std + 0.01), 
                   ha='center', fontsize=8, color='green')
    elif i < 6:
        ax.annotate('Redundant', (i, mean + std + 0.01), 
                   ha='center', fontsize=8, color='orange')
 
plt.tight_layout()

Mathematical Properties and Guarantees

Understanding the theoretical properties of impurity-based importance helps us interpret results correctly and recognize limitations.

Property 1: Non-negativity

Property 2: Summation to unity

After normalization, importances sum to 1: $\sum_j Imp(j) = 1$. This allows interpretation as proportional contributions to total impurity reduction.

Property 3: Unused features have zero importance

Features that never appear in any split have exactly zero importance. This can occur due to:

Feature redundancy (another feature is always preferred)
Feature subsampling in Random Forests (bad luck across all trees)
Early stopping (tree is shallow, not all features get used)

Zero Importance ≠ Irrelevance

Property 4: Bias toward features with more splits

This is the critical limitation of impurity-based importance. Consider two features:

Feature A: Continuous, enables splits anywhere in its range
Feature B: Binary, enables only one possible split

Even if Feature B is more predictive per-split, Feature A will likely accumulate more importance simply because it has more splitting opportunities. This bias is especially severe for:

High-cardinality categorical features (many distinct values)
Continuous features vs. binary features
Features with no missing values vs. features with missing values

Property 5: Sensitivity to tree depth

Impact of Tree Depth on Importance Distribution
Tree Depth	Splits Used	Importance Concentration	Typical Use Case
max_depth=1	1	Single feature gets 100%	Baseline/sanity check
max_depth=3	7	Top 3-5 features dominate	Interpretable models
max_depth=10	~1000	Many features contribute	Balanced models
max_depth=None	~n_samples	Importance spread widely including noise	Maximum accuracy

Worked Example: Step-by-Step Calculation

Let's walk through a complete example of computing impurity-based importance for a small decision tree.

Dataset: 100 samples, 3 features, binary classification (50 positive, 50 negative)

Tree Structure:

                   Root (n=100, 50+/50-)
                   Split: Feature 0 ≤ 0.5
                  /                      \
         Node A (n=60, 45+/15-)     Node B (n=40, 5+/35-)
         Split: Feature 1 ≤ 0.3     Split: Feature 2 ≤ 0.7
        /              \             /              \
   Leaf (40+/5-)   Leaf (5+/10-)  Leaf (4+/10-)   Leaf (1+/25-)
   n=45            n=15           n=14            n=26

Step 1: Compute impurities for all nodes

Using Gini impurity: $I = 1 - p_+^2 - p_-^2$

Node	Samples	Class Distribution	Gini Impurity
Root	100	50+/50- = (0.5, 0.5)	$1 - 0.5^2 - 0.5^2 = 0.500$
A	60	45+/15- = (0.75, 0.25)	$1 - 0.75^2 - 0.25^2 = 0.375$
B	40	5+/35- = (0.125, 0.875)	$1 - 0.125^2 - 0.875^2 = 0.219$
A-Left	45	40+/5- = (0.889, 0.111)	$1 - 0.889^2 - 0.111^2 = 0.198$
A-Right	15	5+/10- = (0.333, 0.667)	$1 - 0.333^2 - 0.667^2 = 0.444$
B-Left	14	4+/10- = (0.286, 0.714)	$1 - 0.286^2 - 0.714^2 = 0.408$
B-Right	26	1+/25- = (0.038, 0.962)	$1 - 0.038^2 - 0.962^2 = 0.074$

Step 2: Compute impurity decrease for each split

Root split (Feature 0): $$\Delta I_{root} = 0.500 - \frac{60}{100}(0.375) - \frac{40}{100}(0.219)$$ $$= 0.500 - 0.225 - 0.088 = 0.187$$

Weighted by samples: $100 \times 0.187 = 18.7$

Node A split (Feature 1): $$\Delta I_A = 0.375 - \frac{45}{60}(0.198) - \frac{15}{60}(0.444)$$ $$= 0.375 - 0.149 - 0.111 = 0.115$$

Weighted by samples: $60 \times 0.115 = 6.9$

Node B split (Feature 2): $$\Delta I_B = 0.219 - \frac{14}{40}(0.408) - \frac{26}{40}(0.074)$$ $$= 0.219 - 0.143 - 0.048 = 0.028$$

Weighted by samples: $40 \times 0.028 = 1.1$

Step 3: Aggregate by feature

Feature	Nodes Used	Weighted Contribution	Normalized Importance
Feature 0	Root	18.7	$18.7 / 26.7 = 0.700$
Feature 1	Node A	6.9	$6.9 / 26.7 = 0.258$
Feature 2	Node B	1.1	$1.1 / 26.7 = 0.042$
Total	—	26.7	1.000

Root Feature Dominance

Practical Implementation and Best Practices

When using impurity-based importance in practice, several considerations will help you extract reliable insights.

Best Practices for Impurity-based Importance

•Use ensembles over single trees — Random Forest importance is more stable than single Decision Tree importance. The averaging reduces variance and the random feature selection helps correlated features share importance.
•Run multiple experiments — Train several models with different random seeds and check consistency of top features. Truly important features should appear at the top across runs.
•Be cautious with high-cardinality features — If you have features with many unique values (IDs, timestamps, ZIP codes), their importance will be artificially inflated. Consider binning or encoding strategies.
•Combine with permutation importance — Use impurity-based importance for initial screening (it's fast), then validate findings with permutation importance (it measures true predictive contribution).
•Consider feature correlations — Importance gets distributed among correlated features. If you have groups of related features, consider their combined importance rather than individual rankings.
•Beware of data leakage — Features that leak target information will show extremely high importance. Suspiciously dominant features warrant investigation.

importance_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import cross_val_score
 
def analyze_feature_importance(X, y, feature_names=None, n_runs=5, 
                                 is_classifier=True, **rf_params):
    """
    Comprehensive impurity-based importance analysis with stability assessment.
    
    Args:
        X: Feature matrix
        y: Target vector
        feature_names: List of feature names
        n_runs: Number of random seed runs for stability
        is_classifier: True for classification, False for regression
        **rf_params: Additional Random Forest parameters
    
    Returns:
        DataFrame with importance statistics
    """
    n_features = X.shape[1]
    if feature_names is None:
        feature_names = [f"Feature_{i}" for i in range(n_features)]
    
    # Collect importances across multiple runs
    all_importances = []
    
    for run in range(n_runs):
        if is_classifier:
            rf = RandomForestClassifier(random_state=run, n_jobs=-1, **rf_params)
        else:
            rf = RandomForestRegressor(random_state=run, n_jobs=-1, **rf_params)
        
        rf.fit(X, y)
        all_importances.append(rf.feature_importances_)
    
    all_importances = np.array(all_importances)
    
    # Compute statistics
    results = pd.DataFrame({
        'feature': feature_names,
        'mean_importance': all_importances.mean(axis=0),
        'std_importance': all_importances.std(axis=0),
        'min_importance': all_importances.min(axis=0),
        'max_importance': all_importances.max(axis=0),
    })
    
    # Compute coefficient of variation (stability metric)
    results['cv'] = results['std_importance'] / results['mean_importance'].replace(0, np.inf)
    
    # Compute mean rank across runs
    ranks = np.array([np.argsort(-imp) for imp in all_importances])
    mean_ranks = ranks.mean(axis=0)
    results['mean_rank'] = mean_ranks
    
    # Sort by mean importance
    results = results.sort_values('mean_importance', ascending=False).reset_index(drop=True)
    
    return results
 
# Example usage
if __name__ == "__main__":
    from sklearn.datasets import make_classification
    
    X, y = make_classification(
        n_samples=1000, n_features=20,
        n_informative=10, n_redundant=5,
        random_state=42
    )
    
    feature_names = [f"feature_{i}" for i in range(20)]
    
    results = analyze_feature_importance(
        X, y, 
        feature_names=feature_names,
        n_runs=10,
        n_estimators=100,
        max_depth=10
    )
    
    print("Feature Importance Analysis")
    print("=" * 80)
    print(results.to_string(index=False))
    print()
    
    # Identify stable vs unstable features
    print("\nStability Analysis:")
    stable = results[results['cv'] < 0.2]
    unstable = results[results['cv'] >= 0.2]
    print(f"  Stable features (CV < 0.2): {len(stable)}")
    print(f"  Unstable features (CV >= 0.2): {len(unstable)}")

Summary: Impurity-based Importance

We've comprehensively examined impurity-based importance—the most common method for quantifying feature significance in tree-based models. Let's consolidate the key concepts:

Core Concepts

•Impurity-based importance (MDI) measures feature importance as the total weighted reduction in impurity across all splits using that feature.
•Gini impurity and entropy are the primary measures for classification; variance is used for regression.
•The weighting factor ($N_t$) ensures splits affecting more samples contribute proportionally more to importance.
•Random Forest importance averages single-tree importances, providing more stable estimates through variance reduction.
•Key bias: Features with more splitting opportunities (high cardinality) receive artificially inflated importance.
•Normalization makes importances sum to 1, enabling interpretation as proportional contributions.
•Zero importance does not imply irrelevance—redundant features may be predictive but never selected.

What's next:

Page Complete

1 / 5

Feature Importance

Impurity-based Importance

Splits Used

Feature Importance

Impurity-based Importance

Splits Used