Machine LearningModel-Specific Interpretability

Model-Specific Interpretability

LevelAdvanced

Duration120 mins

TopicModel-Specific Interpretability

2 / 5

Tree Visualization

The Intuitive Power of Trees

Decision trees hold a unique position in machine learning: they are simultaneously powerful predictive models and naturally interpretable decision rules. When a loan officer needs to explain why an application was denied, or when a physician must justify a diagnostic recommendation, tree-based models provide explanations that non-technical stakeholders can actually understand.

The tree metaphor is intuitive:

Every prediction follows a path from root to leaf through a series of yes/no questions. At each internal node, the model asks a question about a single feature ('Is income > $50,000?'). Based on the answer, it moves to the left or right child. This continues until reaching a leaf that provides the prediction.

This transparency made decision trees the backbone of early expert systems and continues to make them valuable in high-stakes domains like medicine, finance, and criminal justice where 'black box' models are unacceptable.

What You Will Learn

This page covers: (1) Visualizing individual decision trees with full interpretability, (2) Understanding tree structure and decision logic, (3) Feature importance in trees and ensembles, (4) Partial dependence and individual conditional expectation plots, (5) Interpreting Random Forests and Gradient Boosting, and (6) Limitations of tree interpretability at scale.

Anatomy of a Decision Tree

Understanding tree visualization requires mastering the components that every tree contains. Each element provides interpretive value:

Structural Components:

Root Node: The top of the tree, contains all training samples, makes the first split
Internal Nodes: Decision points that split data based on feature thresholds
Leaf Nodes: Terminal nodes containing predictions (class labels or regression values)
Edges/Branches: Connections between nodes representing decision outcomes (True/False)
Depth: Distance from root to any node (root is depth 0)

Information at Each Node:

Split Feature: Which feature is used for the decision
Split Threshold: The value that separates left vs right children
Samples: Number of training examples reaching this node
Values/Distribution: Class distribution (classification) or mean value (regression)
Impurity: Gini, entropy, or MSE measuring node purity

tree_anatomy_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import numpy as np
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, plot_tree, export_text
import matplotlib.pyplot as plt
 
# Load classic Iris dataset
iris = load_iris()
X, y = iris.data, iris.target
feature_names = iris.feature_names
class_names = iris.target_names
 
# Train a shallow tree (max_depth=3 for interpretability)
tree = DecisionTreeClassifier(max_depth=3, random_state=42)
tree.fit(X, y)
 
# Visualization Method 1: Matplotlib plot_tree
fig, ax = plt.subplots(figsize=(20, 10))
plot_tree(tree, 
          feature_names=feature_names,
          class_names=class_names,
          filled=True,           # Color nodes by majority class
          rounded=True,          # Round node edges
          proportion=True,       # Show proportions not counts
          fontsize=10,
          ax=ax)
plt.title("Decision Tree for Iris Classification", fontsize=14)
plt.tight_layout()
plt.savefig("iris_tree.png", dpi=150)
plt.show()
 
# Visualization Method 2: Text representation
print("Text Decision Rules:")
print(export_text(tree, feature_names=feature_names))
 
# Understanding the output:
# |--- petal width (cm) <= 0.80
# |   |--- class: setosa
# |--- petal width (cm) >  0.80
# |   |--- petal width (cm) <= 1.75
# |   |   |--- class: versicolor
# |   |--- petal width (cm) >  1.75
# |   |   |--- class: virginica

Reading a Tree Visualization

•Node Color — Typically indicates predicted class (classification) or value range (regression). Darker colors often mean higher confidence/purity.
•Split Condition — Read as 'if feature ≤ threshold, go left; else go right'. Always binary for CART trees.
•Gini/Entropy — Lower values indicate purer nodes. Gini=0 or Entropy=0 means all samples belong to one class.
•Samples Count — Track how data flows through the tree. Watch for nodes with few samples (potential overfitting).
•Value Array — Shows class distribution [class_0, class_1, ...]. Majority class becomes prediction.

Extracting Decision Rules from Trees

The most powerful form of tree interpretability is extracting explicit decision rules. Each path from root to leaf defines a rule that can be expressed in natural language or formal logic.

Rule Structure:

IF (condition_1) AND (condition_2) AND ... AND (condition_n) THEN prediction = X

Each condition is a simple feature threshold comparison. The conjunction of all conditions along a path defines the 'region' of feature space that gets a particular prediction.

Why Rules Matter:

Human Verification: Domain experts can review rules for reasonableness
Documentation: Rules can be written into policy documents
Debugging: Incorrect predictions can be traced to specific rules
Implementation: Rules can be coded in any system without ML dependencies

extract_decision_rules.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import numpy as np
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.datasets import load_iris
 
# Train tree
iris = load_iris()
tree = DecisionTreeClassifier(max_depth=4, random_state=42)
tree.fit(iris.data, iris.target)
 
def extract_rules(tree, feature_names, class_names):
    """Extract all decision rules as human-readable strings."""
    tree_ = tree.tree_
    feature_name = [
        feature_names[i] if i != -2 else "undefined!"
        for i in tree_.feature
    ]
    
    rules = []
    
    def recurse(node, conditions):
        if tree_.feature[node] != -2:  # Not a leaf
            name = feature_name[node]
            threshold = tree_.threshold[node]
            
            # Left branch: <= threshold
            left_conditions = conditions + [f"{name} <= {threshold:.2f}"]
            recurse(tree_.children_left[node], left_conditions)
            
            # Right branch: > threshold
            right_conditions = conditions + [f"{name} > {threshold:.2f}"]
            recurse(tree_.children_right[node], right_conditions)
        else:  # Leaf node
            # Get class with most samples
            class_idx = np.argmax(tree_.value[node])
            class_label = class_names[class_idx]
            
            # Get confidence (purity)
            total = tree_.n_node_samples[node]
            correct = tree_.value[node].flatten()[class_idx]
            confidence = correct / total * 100
            
            rule = f"IF {' AND '.join(conditions)} THEN {class_label} (confidence: {confidence:.1f}%, samples: {total})"
            rules.append(rule)
    
    recurse(0, [])
    return rules
 
# Extract and print all rules
rules = extract_rules(tree, iris.feature_names, iris.target_names)
print(f"Extracted {len(rules)} decision rules:\n")
for i, rule in enumerate(rules, 1):
    print(f"Rule {i}: {rule}\n")
 
# For individual prediction explanation
def explain_prediction(tree, X_instance, feature_names, class_names):
    """Trace the decision path for a single prediction."""
    node_indicator = tree.decision_path([X_instance])
    leaf_id = tree.apply([X_instance])[0]
    
    node_path = node_indicator.indices[node_indicator.indptr[0]:node_indicator.indptr[1]]
    
    explanation = []
    for node_id in node_path[:-1]:  # Exclude leaf
        feature_idx = tree.tree_.feature[node_id]
        threshold = tree.tree_.threshold[node_id]
        feature_val = X_instance[feature_idx]
        
        if feature_val <= threshold:
            direction = "<="
        else:
            direction = ">"
        
        explanation.append(
            f"{feature_names[feature_idx]} = {feature_val:.2f} {direction} {threshold:.2f}"
        )
    
    prediction = class_names[tree.predict([X_instance])[0]]
    return f"Prediction: {prediction}\nPath: " + " → ".join(explanation)
 
# Example: Explain a specific prediction
sample = iris.data[100]  # A random sample
print("\n" + "="*60)
print("Sample Explanation:")
print(explain_prediction(tree, sample, iris.feature_names, iris.target_names))

Rules for Stakeholder Communication

Extracted rules are excellent for stakeholder communication. A credit risk model might produce: 'IF credit_score > 700 AND debt_to_income ≤ 0.40 AND employment_years > 2 THEN approve (confidence: 94%)'. This is far more actionable than 'the model output was 0.87'.

Feature Importance in Tree-Based Models

Beyond visualizing individual paths, we often want to understand the global importance of each feature across the entire tree or ensemble. Tree-based models provide natural importance measures based on how much each feature contributes to reducing impurity.

Impurity-Based Importance (Gini/Entropy Importance):

For each feature, sum the weighted impurity decrease across all nodes where that feature is used for splitting:

$$\text{Importance}(f) = \sum_{t \in T_f} \frac{n_t}{n} \cdot \Delta\text{impurity}_t$$

where $T_f$ is the set of nodes splitting on feature $f$, $n_t$ is samples at node $t$, $n$ is total samples, and $\Delta\text{impurity}$ is the impurity decrease from the split.

Limitations of Impurity Importance:

Biased toward high-cardinality features: Features with many unique values get more chances to split well by chance
Doesn't indicate direction: Importance doesn't tell you if the effect is positive or negative
Sensitive to correlated features: Importance is split among correlated features arbitrarily

feature_importance_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestRegressor
from sklearn.inspection import permutation_importance
 
# Load data (using Boston for demonstration - note: deprecated dataset)
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
X, y = housing.data, housing.target
feature_names = housing.feature_names
 
# Train Random Forest
rf = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
rf.fit(X, y)
 
# Method 1: Built-in Gini/Impurity Importance
impurity_importance = pd.DataFrame({
    'feature': feature_names,
    'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)
 
print("Impurity-Based Feature Importance:")
print(impurity_importance.to_string(index=False))
 
# Method 2: Permutation Importance (more reliable)
perm_importance = permutation_importance(rf, X, y, n_repeats=30, random_state=42, n_jobs=-1)
 
perm_df = pd.DataFrame({
    'feature': feature_names,
    'importance_mean': perm_importance.importances_mean,
    'importance_std': perm_importance.importances_std
}).sort_values('importance_mean', ascending=False)
 
print("\nPermutation Feature Importance:")
print(perm_df.to_string(index=False))
 
# Visualization: Compare both methods
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
 
# Impurity importance
ax1 = axes[0]
impurity_sorted = impurity_importance.sort_values('importance', ascending=True)
ax1.barh(impurity_sorted['feature'], impurity_sorted['importance'], color='steelblue')
ax1.set_xlabel('Importance')
ax1.set_title('Impurity-Based Importance')
 
# Permutation importance
ax2 = axes[1]
perm_sorted = perm_df.sort_values('importance_mean', ascending=True)
ax2.barh(perm_sorted['feature'], perm_sorted['importance_mean'], 
         xerr=perm_sorted['importance_std'], color='darkorange')
ax2.set_xlabel('Mean Accuracy Decrease')
ax2.set_title('Permutation Importance')
 
plt.tight_layout()
plt.savefig('feature_importance_comparison.png', dpi=150)
plt.show()

Comparison of Feature Importance Methods
Method	Computation	Strengths	Weaknesses
Impurity (Gini)	Sum of weighted impurity decreases	Fast, built-in to training	Biased toward high-cardinality, correlated features
Permutation	Performance drop when feature shuffled	Model-agnostic, less biased	Slower, requires test set
Drop-Column	Retrain without each feature	Most accurate for importance	Very slow (retrain per feature)
SHAP TreeExplainer	Game-theoretic attribution	Accounts for interactions	Computationally intensive for large datasets

Importance ≠ Effect Direction

Feature importance tells you THAT a feature matters, not HOW it affects predictions. A feature with high importance might increase or decrease the prediction depending on its value. Use partial dependence plots to understand the direction and shape of the relationship.

Partial Dependence Plots (PDP)

While feature importance tells us which features matter, Partial Dependence Plots (PDPs) show us how features affect predictions. PDPs visualize the marginal effect of one or two features on the predicted outcome.

Mathematical Definition:

For a feature $x_s$ (or set of features), the partial dependence function is:

$$\hat{f}s(x_s) = \mathbb{E}{x_c}[\hat{f}(x_s, x_c)] \approx \frac{1}{n}\sum_{i=1}^{n} \hat{f}(x_s, x_c^{(i)})$$

where $x_c$ represents all other features. We average predictions over all observed values of other features, creating a function of just $x_s$.

Interpretation:

The PDP shows the average prediction as the feature varies. A positive slope indicates increasing the feature increases predictions. Non-linear shapes reveal non-linear relationships captured by the model.

partial_dependence_plots.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.inspection import PartialDependenceDisplay
 
# Load data
housing = fetch_california_housing()
X, y = housing.data, housing.target
feature_names = housing.feature_names
 
# Train model
gbm = GradientBoostingRegressor(n_estimators=100, max_depth=4, random_state=42)
gbm.fit(X, y)
 
# 1D Partial Dependence Plots for top features
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
 
features_to_plot = ['MedInc', 'AveRooms', 'HouseAge', 'Latitude']
feature_indices = [list(feature_names).index(f) for f in features_to_plot]
 
for idx, (feature_idx, ax) in enumerate(zip(feature_indices, axes.flat)):
    PartialDependenceDisplay.from_estimator(
        gbm, X, [feature_idx],
        feature_names=feature_names,
        ax=ax,
        line_kw={"color": "steelblue", "linewidth": 2}
    )
    ax.set_title(f"PDP for {features_to_plot[idx]}", fontsize=12)
 
plt.suptitle("Partial Dependence Plots", fontsize=14)
plt.tight_layout()
plt.savefig("pdp_1d.png", dpi=150)
plt.show()
 
# 2D Partial Dependence (interaction visualization)
fig, ax = plt.subplots(figsize=(10, 8))
 
PartialDependenceDisplay.from_estimator(
    gbm, X, 
    [(0, 1)],  # MedInc vs AveRooms interaction
    feature_names=feature_names,
    ax=ax,
    kind='both'  # Show both contour and 3D
)
plt.title("2D PDP: Median Income vs Average Rooms")
plt.tight_layout()
plt.savefig("pdp_2d.png", dpi=150)
plt.show()
 
# Interpretation guide:
# - Flat line: Feature has little effect
# - Monotonic increase: Higher values → higher predictions
# - Non-linear curve: Effect varies across feature range
# - 2D interactions: Effect of one feature depends on another

Interpreting PDP Shapes

•Flat/Horizontal Line — Feature has minimal impact on predictions (though may still interact with others)
•Linear Positive Slope — Increasing feature value increases prediction proportionally
•Linear Negative Slope — Increasing feature value decreases prediction
•Step Function — Model uses specific thresholds (common in shallow trees)
•U-Shape or Inverted U — Optimal values exist in the middle; extremes hurt predictions
•Saturation Curve — Effect diminishes at higher values (e.g., log-like relationship)

Individual Conditional Expectation (ICE) Plots

Partial Dependence Plots show the average effect, but this average can hide important heterogeneity. What if the effect of a feature differs substantially across different subgroups of data?

ICE Plots to the Rescue:

Individual Conditional Expectation plots show one curve per instance (or a sample of instances). Instead of averaging, we see how the prediction for each individual changes as we vary the feature.

$$\text{ICE curve for instance } i: \hat{f}(x_s, x_c^{(i)})$$

The PDP is simply the average of all ICE curves.

What ICE Reveals:

Homogeneous effects: All ICE curves parallel → feature affects all instances similarly
Heterogeneous effects: ICE curves cross or diverge → feature effect depends on other features (interactions)
Subgroups: Clusters of similar curves may indicate subpopulations with different behaviors

ice_plots.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.inspection import PartialDependenceDisplay
 
# Load and train
housing = fetch_california_housing()
X, y = housing.data, housing.target
feature_names = list(housing.feature_names)
 
gbm = GradientBoostingRegressor(n_estimators=100, max_depth=4, random_state=42)
gbm.fit(X, y)
 
# ICE plots with varying heterogeneity
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
# Feature with homogeneous effect (parallel curves)
PartialDependenceDisplay.from_estimator(
    gbm, X, 
    [feature_names.index('MedInc')],
    feature_names=feature_names,
    kind='both',  # Show both PDP (thick) and ICE (thin)
    subsample=100,  # Sample 100 ICE curves for clarity
    ax=axes[0],
    ice_lines_kw={'alpha': 0.3, 'color': 'steelblue'},
    pd_line_kw={'color': 'red', 'linewidth': 3}
)
axes[0].set_title("Homogeneous Effect: Median Income\n(Parallel ICE curves)", fontsize=12)
 
# Feature with heterogeneous effect (crossing curves)
PartialDependenceDisplay.from_estimator(
    gbm, X, 
    [feature_names.index('Latitude')],
    feature_names=feature_names,
    kind='both',
    subsample=100,
    ax=axes[1],
    ice_lines_kw={'alpha': 0.3, 'color': 'darkorange'},
    pd_line_kw={'color': 'red', 'linewidth': 3}
)
axes[1].set_title("Heterogeneous Effect: Latitude\n(Crossing ICE curves = interactions)", fontsize=12)
 
plt.tight_layout()
plt.savefig("ice_plots.png", dpi=150)
plt.show()
 
# Centered ICE (c-ICE) for clearer comparison
def plot_centered_ice(model, X, feature_idx, feature_names, n_samples=100, ax=None):
    """Plot centered ICE curves anchored at first value."""
    from sklearn.inspection import partial_dependence
    
    result = partial_dependence(
        model, X, [feature_idx], 
        kind='individual', 
        grid_resolution=50
    )
    
    ice_curves = result['individual'][0]
    grid_values = result['grid_values'][0]
    
    # Sample instances
    np.random.seed(42)
    sample_idx = np.random.choice(len(ice_curves), min(n_samples, len(ice_curves)), replace=False)
    
    if ax is None:
        fig, ax = plt.subplots(figsize=(8, 6))
    
    # Center: subtract first value from each curve
    for i in sample_idx:
        centered = ice_curves[i] - ice_curves[i][0]
        ax.plot(grid_values, centered, alpha=0.2, color='steelblue')
    
    # Average centered curve
    avg_centered = np.mean(ice_curves - ice_curves[:, [0]], axis=0)
    ax.plot(grid_values, avg_centered, color='red', linewidth=3, label='Mean c-ICE')
    
    ax.axhline(y=0, color='gray', linestyle='--', alpha=0.5)
    ax.set_xlabel(feature_names[feature_idx])
    ax.set_ylabel('Prediction change from baseline')
    ax.set_title(f"Centered ICE Plot: {feature_names[feature_idx]}")
    ax.legend()
    
    return ax
 
fig, ax = plt.subplots(figsize=(8, 6))
plot_centered_ice(gbm, X, 0, feature_names, ax=ax)
plt.tight_layout()
plt.savefig("centered_ice.png", dpi=150)
plt.show()

PDP vs ICE: When to Use Which

Start with ICE plots to check for heterogeneity. If all curves are parallel, the PDP accurately represents the relationship for all instances. If curves cross or diverge significantly, the PDP average may be misleading—either stratify your analysis or investigate interactions.

Interpreting Random Forests and Gradient Boosting

Individual decision trees are highly interpretable, but ensemble methods like Random Forests and Gradient Boosting combine hundreds or thousands of trees. This creates a tension: ensembles are more accurate but less transparent.

The Interpretability-Accuracy Tradeoff:

Single Decision Tree: Fully traceable decision path, complete rule extraction possible
Random Forest (100 trees): Each tree votes; tracing is possible but impractical to visualize
Gradient Boosting (500 trees): Predictions are sums across trees; even harder to trace

Strategies for Ensemble Interpretability:

Aggregate Feature Importance: Average importance across all trees
Partial Dependence: Works identically for ensembles
Sample Tree Visualization: Examine representative trees
Surrogate Models: Train an interpretable model to mimic the ensemble
SHAP TreeExplainer: Efficient exact computation of Shapley values for trees

Random Forest Interpretation

•Each tree trained on bootstrap sample with feature subsampling
•Predictions: majority vote (clf) or mean (reg)
•Feature importance averaged across trees
•Decorrelated trees → more stable importance
•Individual tree visualization possible but not representative

Gradient Boosting Interpretation

•Trees fit sequentially to residuals
•Predictions: weighted sum of all tree outputs
•Earlier trees capture main effects
•Later trees capture interactions and corrections
•Feature importance biased toward early splits

ensemble_interpretation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.tree import plot_tree
 
# Load data
housing = fetch_california_housing()
X, y = housing.data, housing.target
feature_names = housing.feature_names
 
# Train both ensemble types
rf = RandomForestRegressor(n_estimators=100, max_depth=6, random_state=42)
rf.fit(X, y)
 
gbm = GradientBoostingRegressor(n_estimators=100, max_depth=4, random_state=42)
gbm.fit(X, y)
 
# Visualize individual trees from RF (sample)
fig, axes = plt.subplots(2, 2, figsize=(20, 16))
for idx, ax in enumerate(axes.flat):
    tree_idx = idx * 25  # Sample trees 0, 25, 50, 75
    plot_tree(rf.estimators_[tree_idx], 
              feature_names=feature_names,
              max_depth=3,  # Show only top 3 levels
              filled=True,
              ax=ax,
              fontsize=8)
    ax.set_title(f"Random Forest Tree #{tree_idx}", fontsize=11)
 
plt.suptitle("Sample Trees from Random Forest (showing depth=3)", fontsize=14)
plt.tight_layout()
plt.savefig("rf_sample_trees.png", dpi=150)
plt.show()
 
# Compare feature importance stability across trees
def analyze_importance_stability(rf, feature_names):
    """Analyze variance in feature importance across RF trees."""
    all_importances = np.array([tree.feature_importances_ for tree in rf.estimators_])
    
    print("Feature Importance Stability:\n")
    print(f"{'Feature':<15} {'Mean':>8} {'Std':>8} {'CV':>8}")
    print("-" * 45)
    
    for i, name in enumerate(feature_names):
        mean = all_importances[:, i].mean()
        std = all_importances[:, i].std()
        cv = std / mean if mean > 0 else 0
        print(f"{name:<15} {mean:>8.4f} {std:>8.4f} {cv:>8.2f}")
    
    return all_importances
 
print("\n" + "="*50)
importances = analyze_importance_stability(rf, feature_names)
 
# Visualize importance distribution across trees
fig, ax = plt.subplots(figsize=(12, 6))
positions = np.arange(len(feature_names))
bp = ax.boxplot(importances, positions=positions, vert=True)
ax.set_xticks(positions)
ax.set_xticklabels(feature_names, rotation=45, ha='right')
ax.set_ylabel('Feature Importance')
ax.set_title('Distribution of Feature Importance Across 100 RF Trees')
plt.tight_layout()
plt.savefig("importance_stability.png", dpi=150)
plt.show()

Surrogate Models for Ensemble Interpretation

When an ensemble model is too complex to interpret directly, we can train a simpler surrogate model to approximate its behavior. The surrogate is interpretable, and if it achieves high fidelity to the original model, its explanations transfer to the complex model.

Surrogate Model Process:

Train complex model (e.g., XGBoost ensemble) on training data
Generate predictions from complex model on training data (or new data)
Train interpretable model (e.g., shallow decision tree) to predict the complex model's predictions
Measure fidelity: how well does surrogate match complex model?
If fidelity is high, interpret the surrogate

Important Caveats:

Surrogate only approximates behavior; it's not the true decision process
Global surrogates may miss local complexities
Low fidelity means explanations are unreliable
Different surrogates may give different explanations

surrogate_models.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor, plot_tree, export_text
from sklearn.metrics import r2_score
import matplotlib.pyplot as plt
 
# Load data and train complex model
housing = fetch_california_housing()
X, y = housing.data, housing.target
feature_names = housing.feature_names
 
# Complex model
gbm = GradientBoostingRegressor(n_estimators=200, max_depth=6, random_state=42)
gbm.fit(X, y)
 
# Get complex model predictions
gbm_predictions = gbm.predict(X)
 
print(f"Complex model R² on true labels: {r2_score(y, gbm_predictions):.4f}")
 
# Train surrogate models of varying complexity
surrogates = {}
for depth in [2, 3, 4, 5, 6, 8]:
    surrogate = DecisionTreeRegressor(max_depth=depth, random_state=42)
    surrogate.fit(X, gbm_predictions)  # Train on GBM predictions, not true labels!
    
    surrogate_pred = surrogate.predict(X)
    fidelity = r2_score(gbm_predictions, surrogate_pred)
    surrogates[depth] = {'model': surrogate, 'fidelity': fidelity}
    
    print(f"Depth {depth} surrogate - Fidelity (R²): {fidelity:.4f}")
 
# Visualize best surrogate (balance interpretability and fidelity)
best_depth = 4  # Choose based on acceptable fidelity
best_surrogate = surrogates[best_depth]['model']
 
fig, ax = plt.subplots(figsize=(20, 12))
plot_tree(best_surrogate, 
          feature_names=feature_names,
          filled=True,
          rounded=True,
          fontsize=10,
          ax=ax)
ax.set_title(f"Global Surrogate Tree (depth={best_depth}, fidelity={surrogates[best_depth]['fidelity']:.3f})")
plt.tight_layout()
plt.savefig("surrogate_tree.png", dpi=150)
plt.show()
 
# Extract rules from surrogate
print("\n" + "="*60)
print("Surrogate Model Decision Rules:")
print("="*60)
print(export_text(best_surrogate, feature_names=list(feature_names)))
 
# Compare importance between complex model and surrogate
print("\n" + "="*60)
print("Feature Importance Comparison:")
print("="*60)
print(f"{'Feature':<15} {'GBM':>10} {'Surrogate':>12} {'Diff':>8}")
print("-" * 48)
 
for i, name in enumerate(feature_names):
    gbm_imp = gbm.feature_importances_[i]
    surr_imp = best_surrogate.feature_importances_[i]
    diff = abs(gbm_imp - surr_imp)
    print(f"{name:<15} {gbm_imp:>10.4f} {surr_imp:>12.4f} {diff:>8.4f}")

Fidelity is Critical

Never trust a surrogate with low fidelity. If the surrogate only captures 60% of the complex model's variance, its explanations may be wrong 40% of the time in ways you can't predict. Aim for R² > 0.9 for global surrogates, or use local surrogates (like LIME) for individual predictions.

Limitations of Tree Interpretability

Tree-based interpretability is powerful but has fundamental limitations that practitioners must understand:

Inherent Limitations:

Depth vs Interpretability Tradeoff: Deep trees are accurate but uninterpretable. Visualizing a depth-20 tree with thousands of leaves is meaningless.
Axis-Aligned Splits: Trees can only split perpendicular to feature axes. They approximate diagonal decision boundaries with many splits, creating jagged approximations that don't reflect the true underlying pattern.
Instability: Small changes in training data can produce completely different tree structures. Two nearly identical datasets might produce different split orders and feature importance rankings.
Feature Importance Fallacies: Correlated features share importance arbitrarily. A 'zero importance' feature might matter greatly if its correlated partner was selected instead.
Ensemble Opacity: While individual trees are interpretable, 500 trees voting or summing contributions aren't. Importance measures and PDPs provide aggregate views but lose the granular traceability of single trees.

Tree Visualization Best Practices
Situation	Best Practice	Rationale
Single Decision Tree	Limit depth to 4-6, visualize entire tree	Full interpretability with manageable complexity
Random Forest	Use aggregate importance + PDPs, visualize sample trees	Individual trees unrepresentative due to randomization
Gradient Boosting	PDPs + SHAP, visualize early trees only	Later trees capture residual patterns not main effects
High-stakes decisions	Train interpretable surrogate for explanations	Regulatory and trust requirements often need rule-based explanations
Feature importance	Use permutation importance over Gini	Less biased, works on held-out data
Detecting interactions	Use ICE plots, 2D PDPs, or SHAP interaction values	Aggregate importance misses interaction effects

Common Pitfalls to Avoid

•Treating all trees equally — In boosting, early trees matter more. In RF, all contribute equally but capture different patterns.
•Ignoring feature correlation — Importance is split among correlated features. Group related features for meaningful importance.
•Over-interpreting unstable splits — If the top split changes frequently across CV folds, don't draw strong conclusions from it.
•Confusing importance with causality — Tree importance is predictive, not causal. Income might predict health but isn't the causal driver.
•Neglecting class imbalance — Trees optimize splits for majority class. Check per-class accuracy, not just node impurity.

Summary: Tree Visualization Mastery

Tree-based models offer unique interpretability that bridges machine learning predictions and human understanding. Let's consolidate the key insights:

Key Takeaways

•Decision trees are natively interpretable — Every prediction follows a traceable path through yes/no decisions on individual features.
•Decision rules can be extracted — Each root-to-leaf path becomes an IF-THEN rule suitable for documentation and verification.
•Feature importance requires caution — Gini importance is biased; permutation importance is more reliable but slower.
•PDPs show average effects — They reveal HOW features affect predictions, not just which ones matter.
•ICE plots reveal heterogeneity — When effects differ across instances, ICE uncovers what PDPs hide.
•Ensembles sacrifice interpretability for accuracy — Use aggregate methods (importance, PDPs) or train global surrogates.
•Surrogates must have high fidelity — Low-fidelity surrogates provide unreliable explanations.
•Beware instability and correlation — Different runs may produce different structures; correlated features confound importance.

What's Next:

While trees offer intuitive interpretability, modern deep learning models present different challenges. In the next page, we'll explore Attention Visualization—how to interpret transformer-based models and other attention mechanisms by visualizing where models 'look' when making predictions.

Page Complete

You now have a comprehensive toolkit for interpreting tree-based models. Whether working with a simple decision tree for regulatory compliance or diagnosing a gradient boosting ensemble, you can extract meaningful insights about feature effects, importance, and decision logic.

2 / 5

Loading learning content...

Machine LearningModel-Specific Interpretability

Model-Specific Interpretability

LevelAdvanced

Duration120 mins

TopicModel-Specific Interpretability

2 / 5

Tree Visualization

The Intuitive Power of Trees

The tree metaphor is intuitive:

What You Will Learn

Anatomy of a Decision Tree

Understanding tree visualization requires mastering the components that every tree contains. Each element provides interpretive value:

Structural Components:

Root Node: The top of the tree, contains all training samples, makes the first split
Internal Nodes: Decision points that split data based on feature thresholds
Leaf Nodes: Terminal nodes containing predictions (class labels or regression values)
Edges/Branches: Connections between nodes representing decision outcomes (True/False)
Depth: Distance from root to any node (root is depth 0)

Information at Each Node:

Split Feature: Which feature is used for the decision
Split Threshold: The value that separates left vs right children
Samples: Number of training examples reaching this node
Values/Distribution: Class distribution (classification) or mean value (regression)
Impurity: Gini, entropy, or MSE measuring node purity

tree_anatomy_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import numpy as np
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, plot_tree, export_text
import matplotlib.pyplot as plt
 
# Load classic Iris dataset
iris = load_iris()
X, y = iris.data, iris.target
feature_names = iris.feature_names
class_names = iris.target_names
 
# Train a shallow tree (max_depth=3 for interpretability)
tree = DecisionTreeClassifier(max_depth=3, random_state=42)
tree.fit(X, y)
 
# Visualization Method 1: Matplotlib plot_tree
fig, ax = plt.subplots(figsize=(20, 10))
plot_tree(tree, 
          feature_names=feature_names,
          class_names=class_names,
          filled=True,           # Color nodes by majority class
          rounded=True,          # Round node edges
          proportion=True,       # Show proportions not counts
          fontsize=10,
          ax=ax)
plt.title("Decision Tree for Iris Classification", fontsize=14)
plt.tight_layout()
plt.savefig("iris_tree.png", dpi=150)
plt.show()
 
# Visualization Method 2: Text representation
print("Text Decision Rules:")
print(export_text(tree, feature_names=feature_names))
 
# Understanding the output:
# |--- petal width (cm) <= 0.80
# |   |--- class: setosa
# |--- petal width (cm) >  0.80
# |   |--- petal width (cm) <= 1.75
# |   |   |--- class: versicolor
# |   |--- petal width (cm) >  1.75
# |   |   |--- class: virginica

Reading a Tree Visualization

•Node Color — Typically indicates predicted class (classification) or value range (regression). Darker colors often mean higher confidence/purity.
•Split Condition — Read as 'if feature ≤ threshold, go left; else go right'. Always binary for CART trees.
•Gini/Entropy — Lower values indicate purer nodes. Gini=0 or Entropy=0 means all samples belong to one class.
•Samples Count — Track how data flows through the tree. Watch for nodes with few samples (potential overfitting).
•Value Array — Shows class distribution [class_0, class_1, ...]. Majority class becomes prediction.

Extracting Decision Rules from Trees

The most powerful form of tree interpretability is extracting explicit decision rules. Each path from root to leaf defines a rule that can be expressed in natural language or formal logic.

Rule Structure:

IF (condition_1) AND (condition_2) AND ... AND (condition_n) THEN prediction = X

Each condition is a simple feature threshold comparison. The conjunction of all conditions along a path defines the 'region' of feature space that gets a particular prediction.

Why Rules Matter:

Human Verification: Domain experts can review rules for reasonableness
Documentation: Rules can be written into policy documents
Debugging: Incorrect predictions can be traced to specific rules
Implementation: Rules can be coded in any system without ML dependencies

extract_decision_rules.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import numpy as np
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.datasets import load_iris
 
# Train tree
iris = load_iris()
tree = DecisionTreeClassifier(max_depth=4, random_state=42)
tree.fit(iris.data, iris.target)
 
def extract_rules(tree, feature_names, class_names):
    """Extract all decision rules as human-readable strings."""
    tree_ = tree.tree_
    feature_name = [
        feature_names[i] if i != -2 else "undefined!"
        for i in tree_.feature
    ]
    
    rules = []
    
    def recurse(node, conditions):
        if tree_.feature[node] != -2:  # Not a leaf
            name = feature_name[node]
            threshold = tree_.threshold[node]
            
            # Left branch: <= threshold
            left_conditions = conditions + [f"{name} <= {threshold:.2f}"]
            recurse(tree_.children_left[node], left_conditions)
            
            # Right branch: > threshold
            right_conditions = conditions + [f"{name} > {threshold:.2f}"]
            recurse(tree_.children_right[node], right_conditions)
        else:  # Leaf node
            # Get class with most samples
            class_idx = np.argmax(tree_.value[node])
            class_label = class_names[class_idx]
            
            # Get confidence (purity)
            total = tree_.n_node_samples[node]
            correct = tree_.value[node].flatten()[class_idx]
            confidence = correct / total * 100
            
            rule = f"IF {' AND '.join(conditions)} THEN {class_label} (confidence: {confidence:.1f}%, samples: {total})"
            rules.append(rule)
    
    recurse(0, [])
    return rules
 
# Extract and print all rules
rules = extract_rules(tree, iris.feature_names, iris.target_names)
print(f"Extracted {len(rules)} decision rules:\n")
for i, rule in enumerate(rules, 1):
    print(f"Rule {i}: {rule}\n")
 
# For individual prediction explanation
def explain_prediction(tree, X_instance, feature_names, class_names):
    """Trace the decision path for a single prediction."""
    node_indicator = tree.decision_path([X_instance])
    leaf_id = tree.apply([X_instance])[0]
    
    node_path = node_indicator.indices[node_indicator.indptr[0]:node_indicator.indptr[1]]
    
    explanation = []
    for node_id in node_path[:-1]:  # Exclude leaf
        feature_idx = tree.tree_.feature[node_id]
        threshold = tree.tree_.threshold[node_id]
        feature_val = X_instance[feature_idx]
        
        if feature_val <= threshold:
            direction = "<="
        else:
            direction = ">"
        
        explanation.append(
            f"{feature_names[feature_idx]} = {feature_val:.2f} {direction} {threshold:.2f}"
        )
    
    prediction = class_names[tree.predict([X_instance])[0]]
    return f"Prediction: {prediction}\nPath: " + " → ".join(explanation)
 
# Example: Explain a specific prediction
sample = iris.data[100]  # A random sample
print("\n" + "="*60)
print("Sample Explanation:")
print(explain_prediction(tree, sample, iris.feature_names, iris.target_names))

Rules for Stakeholder Communication

Feature Importance in Tree-Based Models

Impurity-Based Importance (Gini/Entropy Importance):

For each feature, sum the weighted impurity decrease across all nodes where that feature is used for splitting:

$$\text{Importance}(f) = \sum_{t \in T_f} \frac{n_t}{n} \cdot \Delta\text{impurity}_t$$

where $T_f$ is the set of nodes splitting on feature $f$, $n_t$ is samples at node $t$, $n$ is total samples, and $\Delta\text{impurity}$ is the impurity decrease from the split.

Limitations of Impurity Importance:

Biased toward high-cardinality features: Features with many unique values get more chances to split well by chance
Doesn't indicate direction: Importance doesn't tell you if the effect is positive or negative
Sensitive to correlated features: Importance is split among correlated features arbitrarily

feature_importance_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestRegressor
from sklearn.inspection import permutation_importance
 
# Load data (using Boston for demonstration - note: deprecated dataset)
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
X, y = housing.data, housing.target
feature_names = housing.feature_names
 
# Train Random Forest
rf = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
rf.fit(X, y)
 
# Method 1: Built-in Gini/Impurity Importance
impurity_importance = pd.DataFrame({
    'feature': feature_names,
    'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)
 
print("Impurity-Based Feature Importance:")
print(impurity_importance.to_string(index=False))
 
# Method 2: Permutation Importance (more reliable)
perm_importance = permutation_importance(rf, X, y, n_repeats=30, random_state=42, n_jobs=-1)
 
perm_df = pd.DataFrame({
    'feature': feature_names,
    'importance_mean': perm_importance.importances_mean,
    'importance_std': perm_importance.importances_std
}).sort_values('importance_mean', ascending=False)
 
print("\nPermutation Feature Importance:")
print(perm_df.to_string(index=False))
 
# Visualization: Compare both methods
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
 
# Impurity importance
ax1 = axes[0]
impurity_sorted = impurity_importance.sort_values('importance', ascending=True)
ax1.barh(impurity_sorted['feature'], impurity_sorted['importance'], color='steelblue')
ax1.set_xlabel('Importance')
ax1.set_title('Impurity-Based Importance')
 
# Permutation importance
ax2 = axes[1]
perm_sorted = perm_df.sort_values('importance_mean', ascending=True)
ax2.barh(perm_sorted['feature'], perm_sorted['importance_mean'], 
         xerr=perm_sorted['importance_std'], color='darkorange')
ax2.set_xlabel('Mean Accuracy Decrease')
ax2.set_title('Permutation Importance')
 
plt.tight_layout()
plt.savefig('feature_importance_comparison.png', dpi=150)
plt.show()

Comparison of Feature Importance Methods
Method	Computation	Strengths	Weaknesses
Impurity (Gini)	Sum of weighted impurity decreases	Fast, built-in to training	Biased toward high-cardinality, correlated features
Permutation	Performance drop when feature shuffled	Model-agnostic, less biased	Slower, requires test set
Drop-Column	Retrain without each feature	Most accurate for importance	Very slow (retrain per feature)
SHAP TreeExplainer	Game-theoretic attribution	Accounts for interactions	Computationally intensive for large datasets

Importance ≠ Effect Direction

Partial Dependence Plots (PDP)

Mathematical Definition:

For a feature $x_s$ (or set of features), the partial dependence function is:

$$\hat{f}s(x_s) = \mathbb{E}{x_c}[\hat{f}(x_s, x_c)] \approx \frac{1}{n}\sum_{i=1}^{n} \hat{f}(x_s, x_c^{(i)})$$

where $x_c$ represents all other features. We average predictions over all observed values of other features, creating a function of just $x_s$.

Interpretation:

partial_dependence_plots.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.inspection import PartialDependenceDisplay
 
# Load data
housing = fetch_california_housing()
X, y = housing.data, housing.target
feature_names = housing.feature_names
 
# Train model
gbm = GradientBoostingRegressor(n_estimators=100, max_depth=4, random_state=42)
gbm.fit(X, y)
 
# 1D Partial Dependence Plots for top features
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
 
features_to_plot = ['MedInc', 'AveRooms', 'HouseAge', 'Latitude']
feature_indices = [list(feature_names).index(f) for f in features_to_plot]
 
for idx, (feature_idx, ax) in enumerate(zip(feature_indices, axes.flat)):
    PartialDependenceDisplay.from_estimator(
        gbm, X, [feature_idx],
        feature_names=feature_names,
        ax=ax,
        line_kw={"color": "steelblue", "linewidth": 2}
    )
    ax.set_title(f"PDP for {features_to_plot[idx]}", fontsize=12)
 
plt.suptitle("Partial Dependence Plots", fontsize=14)
plt.tight_layout()
plt.savefig("pdp_1d.png", dpi=150)
plt.show()
 
# 2D Partial Dependence (interaction visualization)
fig, ax = plt.subplots(figsize=(10, 8))
 
PartialDependenceDisplay.from_estimator(
    gbm, X, 
    [(0, 1)],  # MedInc vs AveRooms interaction
    feature_names=feature_names,
    ax=ax,
    kind='both'  # Show both contour and 3D
)
plt.title("2D PDP: Median Income vs Average Rooms")
plt.tight_layout()
plt.savefig("pdp_2d.png", dpi=150)
plt.show()
 
# Interpretation guide:
# - Flat line: Feature has little effect
# - Monotonic increase: Higher values → higher predictions
# - Non-linear curve: Effect varies across feature range
# - 2D interactions: Effect of one feature depends on another

Interpreting PDP Shapes

•Flat/Horizontal Line — Feature has minimal impact on predictions (though may still interact with others)
•Linear Positive Slope — Increasing feature value increases prediction proportionally
•Linear Negative Slope — Increasing feature value decreases prediction
•Step Function — Model uses specific thresholds (common in shallow trees)
•U-Shape or Inverted U — Optimal values exist in the middle; extremes hurt predictions
•Saturation Curve — Effect diminishes at higher values (e.g., log-like relationship)

Individual Conditional Expectation (ICE) Plots

Partial Dependence Plots show the average effect, but this average can hide important heterogeneity. What if the effect of a feature differs substantially across different subgroups of data?

ICE Plots to the Rescue:

Individual Conditional Expectation plots show one curve per instance (or a sample of instances). Instead of averaging, we see how the prediction for each individual changes as we vary the feature.

$$\text{ICE curve for instance } i: \hat{f}(x_s, x_c^{(i)})$$

The PDP is simply the average of all ICE curves.

What ICE Reveals:

Homogeneous effects: All ICE curves parallel → feature affects all instances similarly
Heterogeneous effects: ICE curves cross or diverge → feature effect depends on other features (interactions)
Subgroups: Clusters of similar curves may indicate subpopulations with different behaviors

ice_plots.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.inspection import PartialDependenceDisplay
 
# Load and train
housing = fetch_california_housing()
X, y = housing.data, housing.target
feature_names = list(housing.feature_names)
 
gbm = GradientBoostingRegressor(n_estimators=100, max_depth=4, random_state=42)
gbm.fit(X, y)
 
# ICE plots with varying heterogeneity
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
# Feature with homogeneous effect (parallel curves)
PartialDependenceDisplay.from_estimator(
    gbm, X, 
    [feature_names.index('MedInc')],
    feature_names=feature_names,
    kind='both',  # Show both PDP (thick) and ICE (thin)
    subsample=100,  # Sample 100 ICE curves for clarity
    ax=axes[0],
    ice_lines_kw={'alpha': 0.3, 'color': 'steelblue'},
    pd_line_kw={'color': 'red', 'linewidth': 3}
)
axes[0].set_title("Homogeneous Effect: Median Income\n(Parallel ICE curves)", fontsize=12)
 
# Feature with heterogeneous effect (crossing curves)
PartialDependenceDisplay.from_estimator(
    gbm, X, 
    [feature_names.index('Latitude')],
    feature_names=feature_names,
    kind='both',
    subsample=100,
    ax=axes[1],
    ice_lines_kw={'alpha': 0.3, 'color': 'darkorange'},
    pd_line_kw={'color': 'red', 'linewidth': 3}
)
axes[1].set_title("Heterogeneous Effect: Latitude\n(Crossing ICE curves = interactions)", fontsize=12)
 
plt.tight_layout()
plt.savefig("ice_plots.png", dpi=150)
plt.show()
 
# Centered ICE (c-ICE) for clearer comparison
def plot_centered_ice(model, X, feature_idx, feature_names, n_samples=100, ax=None):
    """Plot centered ICE curves anchored at first value."""
    from sklearn.inspection import partial_dependence
    
    result = partial_dependence(
        model, X, [feature_idx], 
        kind='individual', 
        grid_resolution=50
    )
    
    ice_curves = result['individual'][0]
    grid_values = result['grid_values'][0]
    
    # Sample instances
    np.random.seed(42)
    sample_idx = np.random.choice(len(ice_curves), min(n_samples, len(ice_curves)), replace=False)
    
    if ax is None:
        fig, ax = plt.subplots(figsize=(8, 6))
    
    # Center: subtract first value from each curve
    for i in sample_idx:
        centered = ice_curves[i] - ice_curves[i][0]
        ax.plot(grid_values, centered, alpha=0.2, color='steelblue')
    
    # Average centered curve
    avg_centered = np.mean(ice_curves - ice_curves[:, [0]], axis=0)
    ax.plot(grid_values, avg_centered, color='red', linewidth=3, label='Mean c-ICE')
    
    ax.axhline(y=0, color='gray', linestyle='--', alpha=0.5)
    ax.set_xlabel(feature_names[feature_idx])
    ax.set_ylabel('Prediction change from baseline')
    ax.set_title(f"Centered ICE Plot: {feature_names[feature_idx]}")
    ax.legend()
    
    return ax
 
fig, ax = plt.subplots(figsize=(8, 6))
plot_centered_ice(gbm, X, 0, feature_names, ax=ax)
plt.tight_layout()
plt.savefig("centered_ice.png", dpi=150)
plt.show()

PDP vs ICE: When to Use Which

Interpreting Random Forests and Gradient Boosting

The Interpretability-Accuracy Tradeoff:

Single Decision Tree: Fully traceable decision path, complete rule extraction possible
Random Forest (100 trees): Each tree votes; tracing is possible but impractical to visualize
Gradient Boosting (500 trees): Predictions are sums across trees; even harder to trace

Strategies for Ensemble Interpretability:

Aggregate Feature Importance: Average importance across all trees
Partial Dependence: Works identically for ensembles
Sample Tree Visualization: Examine representative trees
Surrogate Models: Train an interpretable model to mimic the ensemble
SHAP TreeExplainer: Efficient exact computation of Shapley values for trees

Random Forest Interpretation

•Each tree trained on bootstrap sample with feature subsampling
•Predictions: majority vote (clf) or mean (reg)
•Feature importance averaged across trees
•Decorrelated trees → more stable importance
•Individual tree visualization possible but not representative

Gradient Boosting Interpretation

•Trees fit sequentially to residuals
•Predictions: weighted sum of all tree outputs
•Earlier trees capture main effects
•Later trees capture interactions and corrections
•Feature importance biased toward early splits

ensemble_interpretation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.tree import plot_tree
 
# Load data
housing = fetch_california_housing()
X, y = housing.data, housing.target
feature_names = housing.feature_names
 
# Train both ensemble types
rf = RandomForestRegressor(n_estimators=100, max_depth=6, random_state=42)
rf.fit(X, y)
 
gbm = GradientBoostingRegressor(n_estimators=100, max_depth=4, random_state=42)
gbm.fit(X, y)
 
# Visualize individual trees from RF (sample)
fig, axes = plt.subplots(2, 2, figsize=(20, 16))
for idx, ax in enumerate(axes.flat):
    tree_idx = idx * 25  # Sample trees 0, 25, 50, 75
    plot_tree(rf.estimators_[tree_idx], 
              feature_names=feature_names,
              max_depth=3,  # Show only top 3 levels
              filled=True,
              ax=ax,
              fontsize=8)
    ax.set_title(f"Random Forest Tree #{tree_idx}", fontsize=11)
 
plt.suptitle("Sample Trees from Random Forest (showing depth=3)", fontsize=14)
plt.tight_layout()
plt.savefig("rf_sample_trees.png", dpi=150)
plt.show()
 
# Compare feature importance stability across trees
def analyze_importance_stability(rf, feature_names):
    """Analyze variance in feature importance across RF trees."""
    all_importances = np.array([tree.feature_importances_ for tree in rf.estimators_])
    
    print("Feature Importance Stability:\n")
    print(f"{'Feature':<15} {'Mean':>8} {'Std':>8} {'CV':>8}")
    print("-" * 45)
    
    for i, name in enumerate(feature_names):
        mean = all_importances[:, i].mean()
        std = all_importances[:, i].std()
        cv = std / mean if mean > 0 else 0
        print(f"{name:<15} {mean:>8.4f} {std:>8.4f} {cv:>8.2f}")
    
    return all_importances
 
print("\n" + "="*50)
importances = analyze_importance_stability(rf, feature_names)
 
# Visualize importance distribution across trees
fig, ax = plt.subplots(figsize=(12, 6))
positions = np.arange(len(feature_names))
bp = ax.boxplot(importances, positions=positions, vert=True)
ax.set_xticks(positions)
ax.set_xticklabels(feature_names, rotation=45, ha='right')
ax.set_ylabel('Feature Importance')
ax.set_title('Distribution of Feature Importance Across 100 RF Trees')
plt.tight_layout()
plt.savefig("importance_stability.png", dpi=150)
plt.show()

Surrogate Models for Ensemble Interpretation

Surrogate Model Process:

Train complex model (e.g., XGBoost ensemble) on training data
Generate predictions from complex model on training data (or new data)
Train interpretable model (e.g., shallow decision tree) to predict the complex model's predictions
Measure fidelity: how well does surrogate match complex model?
If fidelity is high, interpret the surrogate

Important Caveats:

Surrogate only approximates behavior; it's not the true decision process
Global surrogates may miss local complexities
Low fidelity means explanations are unreliable
Different surrogates may give different explanations

surrogate_models.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor, plot_tree, export_text
from sklearn.metrics import r2_score
import matplotlib.pyplot as plt
 
# Load data and train complex model
housing = fetch_california_housing()
X, y = housing.data, housing.target
feature_names = housing.feature_names
 
# Complex model
gbm = GradientBoostingRegressor(n_estimators=200, max_depth=6, random_state=42)
gbm.fit(X, y)
 
# Get complex model predictions
gbm_predictions = gbm.predict(X)
 
print(f"Complex model R² on true labels: {r2_score(y, gbm_predictions):.4f}")
 
# Train surrogate models of varying complexity
surrogates = {}
for depth in [2, 3, 4, 5, 6, 8]:
    surrogate = DecisionTreeRegressor(max_depth=depth, random_state=42)
    surrogate.fit(X, gbm_predictions)  # Train on GBM predictions, not true labels!
    
    surrogate_pred = surrogate.predict(X)
    fidelity = r2_score(gbm_predictions, surrogate_pred)
    surrogates[depth] = {'model': surrogate, 'fidelity': fidelity}
    
    print(f"Depth {depth} surrogate - Fidelity (R²): {fidelity:.4f}")
 
# Visualize best surrogate (balance interpretability and fidelity)
best_depth = 4  # Choose based on acceptable fidelity
best_surrogate = surrogates[best_depth]['model']
 
fig, ax = plt.subplots(figsize=(20, 12))
plot_tree(best_surrogate, 
          feature_names=feature_names,
          filled=True,
          rounded=True,
          fontsize=10,
          ax=ax)
ax.set_title(f"Global Surrogate Tree (depth={best_depth}, fidelity={surrogates[best_depth]['fidelity']:.3f})")
plt.tight_layout()
plt.savefig("surrogate_tree.png", dpi=150)
plt.show()
 
# Extract rules from surrogate
print("\n" + "="*60)
print("Surrogate Model Decision Rules:")
print("="*60)
print(export_text(best_surrogate, feature_names=list(feature_names)))
 
# Compare importance between complex model and surrogate
print("\n" + "="*60)
print("Feature Importance Comparison:")
print("="*60)
print(f"{'Feature':<15} {'GBM':>10} {'Surrogate':>12} {'Diff':>8}")
print("-" * 48)
 
for i, name in enumerate(feature_names):
    gbm_imp = gbm.feature_importances_[i]
    surr_imp = best_surrogate.feature_importances_[i]
    diff = abs(gbm_imp - surr_imp)
    print(f"{name:<15} {gbm_imp:>10.4f} {surr_imp:>12.4f} {diff:>8.4f}")

Fidelity is Critical

Limitations of Tree Interpretability

Tree-based interpretability is powerful but has fundamental limitations that practitioners must understand:

Inherent Limitations:

Depth vs Interpretability Tradeoff: Deep trees are accurate but uninterpretable. Visualizing a depth-20 tree with thousands of leaves is meaningless.
Axis-Aligned Splits: Trees can only split perpendicular to feature axes. They approximate diagonal decision boundaries with many splits, creating jagged approximations that don't reflect the true underlying pattern.
Instability: Small changes in training data can produce completely different tree structures. Two nearly identical datasets might produce different split orders and feature importance rankings.
Feature Importance Fallacies: Correlated features share importance arbitrarily. A 'zero importance' feature might matter greatly if its correlated partner was selected instead.
Ensemble Opacity: While individual trees are interpretable, 500 trees voting or summing contributions aren't. Importance measures and PDPs provide aggregate views but lose the granular traceability of single trees.

Tree Visualization Best Practices
Situation	Best Practice	Rationale
Single Decision Tree	Limit depth to 4-6, visualize entire tree	Full interpretability with manageable complexity
Random Forest	Use aggregate importance + PDPs, visualize sample trees	Individual trees unrepresentative due to randomization
Gradient Boosting	PDPs + SHAP, visualize early trees only	Later trees capture residual patterns not main effects
High-stakes decisions	Train interpretable surrogate for explanations	Regulatory and trust requirements often need rule-based explanations
Feature importance	Use permutation importance over Gini	Less biased, works on held-out data
Detecting interactions	Use ICE plots, 2D PDPs, or SHAP interaction values	Aggregate importance misses interaction effects

Common Pitfalls to Avoid

•Treating all trees equally — In boosting, early trees matter more. In RF, all contribute equally but capture different patterns.
•Ignoring feature correlation — Importance is split among correlated features. Group related features for meaningful importance.
•Over-interpreting unstable splits — If the top split changes frequently across CV folds, don't draw strong conclusions from it.
•Confusing importance with causality — Tree importance is predictive, not causal. Income might predict health but isn't the causal driver.
•Neglecting class imbalance — Trees optimize splits for majority class. Check per-class accuracy, not just node impurity.

Summary: Tree Visualization Mastery

Tree-based models offer unique interpretability that bridges machine learning predictions and human understanding. Let's consolidate the key insights:

Key Takeaways

•Decision trees are natively interpretable — Every prediction follows a traceable path through yes/no decisions on individual features.
•Decision rules can be extracted — Each root-to-leaf path becomes an IF-THEN rule suitable for documentation and verification.
•Feature importance requires caution — Gini importance is biased; permutation importance is more reliable but slower.
•PDPs show average effects — They reveal HOW features affect predictions, not just which ones matter.
•ICE plots reveal heterogeneity — When effects differ across instances, ICE uncovers what PDPs hide.
•Ensembles sacrifice interpretability for accuracy — Use aggregate methods (importance, PDPs) or train global surrogates.
•Surrogates must have high fidelity — Low-fidelity surrogates provide unreliable explanations.
•Beware instability and correlation — Different runs may produce different structures; correlated features confound importance.

What's Next:

Page Complete

2 / 5