Machine LearningDecision Tree Fundamentals

Decision Tree Fundamentals

LevelBeginner

Duration90 mins

TopicDecision Tree Fundamentals

5 / 5

Interpretability

The Gift of Explanation

In an era of black-box models—deep neural networks with millions of parameters, ensemble methods combining hundreds of trees—decision trees stand apart as inherently interpretable. A decision tree doesn't just predict; it explains. Every prediction comes with a complete logical justification: a sequence of feature-based conditions that anyone can follow.

This interpretability is not a minor convenience—it's often a requirement. In healthcare, a physician needs to understand why a model recommends a treatment. In finance, regulations demand explainable credit decisions. In criminal justice, defendants have the right to understand risk assessments. In safety-critical systems, engineers must verify that a model's logic is sound.

Decision trees offer something precious: the ability to open the model and see exactly what it learned. This transparency enables debugging, trust-building, knowledge extraction, and regulatory compliance. Understanding how to leverage this interpretability—and when complexity threatens it—is essential for effective tree-based modeling.

What You Will Learn

By the end of this page, you will understand: (1) why trees are inherently interpretable, (2) how to extract decision rules as human-readable logic, (3) visualization techniques for tree structure, (4) feature importance measures, (5) the complexity-interpretability tradeoff, and (6) when interpretability matters most.

Why Trees Are Interpretable

Decision trees possess several properties that enable human understanding:

1. Sequential Logic

Humans naturally reason through sequences of yes/no questions. "Is the patient over 65? If yes, do they have diabetes? If yes, what's their blood pressure?" Decision trees mirror this cognitive process exactly.

Each path through the tree is a logical chain that humans can follow step-by-step. Unlike a neural network where thousands of weights combine in opaque ways, a tree's reasoning is laid out explicitly.

2. Local Feature-Based Conditions

Each decision node tests one feature against one threshold: $$\text{Is } x_j \leq \theta \text{?}$$

This atomic structure is comprehensible. "Is age ≤ 50?" makes sense. "Is $0.034 \cdot \text{age} + 0.019 \cdot \text{income} - 0.087 \cdot \text{debt_ratio} \leq 0.412$?" does not.

3. Hierarchical Importance

Features that appear higher in the tree (closer to the root) are more important for prediction. This hierarchy conveys importance naturally:

Root split: Most discriminative feature
Second level: Refining features
Deep nodes: Fine-tuning features

4. Complete Justification

For any prediction, the tree provides a complete audit trail:

The path taken (every condition checked)
The final leaf reached
The prediction made
The training samples that informed that leaf

No other information is hidden. The prediction is fully determined by these visible components.

5. Visual Representation

Trees have a natural graphical form. Unlike a matrix of weights, a tree can be drawn as a diagram that intuitively shows structure. Visualization is a powerful interpretability tool that trees support natively.

Intrinsic vs. Post-hoc Interpretability

Decision trees are 'intrinsically interpretable'—their structure directly reveals their logic. This contrasts with 'post-hoc interpretability' methods like LIME or SHAP that approximate a black-box model with an interpretable explanation. Intrinsic interpretability is generally preferred when available because the explanation perfectly matches the model's actual behavior.

Rule Extraction

Every decision tree can be converted into a set of IF-THEN rules. This provides maximum transparency: the model's logic is expressed in plain logical statements.

From Tree to Rules

Each root-to-leaf path becomes one rule:

IF (condition_1) AND (condition_2) AND ... AND (condition_n)
THEN predict: [class/value]

The conditions are the split decisions along the path. The prediction is the leaf's output.

Example Rule Extraction

Consider a tree for credit approval:

           [income > 50k?]
          /              
        Yes               No
        /                   
  [debt_ratio ≤ 0.3?]     [credit_score > 650?]
    /         \               /         
  Yes         No            Yes          No
   |           |             |            |
APPROVE     REVIEW       REVIEW       REJECT

Extracted rules:

Rule 1: IF income > 50k AND debt_ratio ≤ 0.3 
        THEN APPROVE
        [Confidence: 95%, Support: 23%]

Rule 2: IF income > 50k AND debt_ratio > 0.3
        THEN REVIEW
        [Confidence: 70%, Support: 15%]

Rule 3: IF income ≤ 50k AND credit_score > 650
        THEN REVIEW
        [Confidence: 65%, Support: 28%]

Rule 4: IF income ≤ 50k AND credit_score ≤ 650
        THEN REJECT
        [Confidence: 92%, Support: 34%]

rule_extraction.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
import numpy as np
 
def extract_rules(tree, feature_names, class_names):
    """Extract human-readable rules from a decision tree."""
    tree_ = tree.tree_
    feature_name = [
        feature_names[i] if i != -2 else "undefined!"
        for i in tree_.feature
    ]
    
    rules = []
    
    def recurse(node, conditions):
        if tree_.feature[node] != -2:  # Internal node
            name = feature_name[node]
            threshold = tree_.threshold[node]
            
            # Left branch: feature <= threshold
            left_condition = f"{name} <= {threshold:.2f}"
            recurse(tree_.children_left[node], conditions + [left_condition])
            
            # Right branch: feature > threshold
            right_condition = f"{name} > {threshold:.2f}"
            recurse(tree_.children_right[node], conditions + [right_condition])
        else:  # Leaf node
            # Get class prediction and confidence
            class_counts = tree_.value[node][0]
            total = sum(class_counts)
            predicted_class = np.argmax(class_counts)
            confidence = class_counts[predicted_class] / total
            
            rule = {
                'conditions': conditions,
                'prediction': class_names[predicted_class],
                'confidence': confidence,
                'support': total,
                'class_distribution': dict(zip(class_names, class_counts))
            }
            rules.append(rule)
    
    recurse(0, [])
    return rules
 
 
def format_rules(rules, total_samples):
    """Format rules as human-readable strings."""
    output = []
    for i, rule in enumerate(rules, 1):
        conditions = " AND ".join(rule['conditions'])
        support_pct = rule['support'] / total_samples * 100
        
        text = f"""
Rule {i}:
  IF {conditions}
  THEN predict: {rule['prediction']}
  Confidence: {rule['confidence']:.1%}
  Support: {support_pct:.1f}% ({int(rule['support'])} samples)
  Distribution: {rule['class_distribution']}
"""
        output.append(text)
    return "
".join(output)
 
 
# Example usage
iris = load_iris()
X, y = iris.data, iris.target
 
tree = DecisionTreeClassifier(max_depth=3, random_state=42)
tree.fit(X, y)
 
rules = extract_rules(tree, iris.feature_names, iris.target_names)
formatted = format_rules(rules, len(y))
 
print("="*60)
print("EXTRACTED DECISION RULES")
print("="*60)
print(formatted)
 
# Show rule complexity statistics
print("="*60)
print("RULE STATISTICS")
print("="*60)
print(f"Total rules: {len(rules)}")
print(f"Average conditions per rule: {np.mean([len(r['conditions']) for r in rules]):.1f}")
print(f"Max conditions (most complex): {max(len(r['conditions']) for r in rules)}")
print(f"Min confidence: {min(r['confidence'] for r in rules):.1%}")
print(f"Max confidence: {max(r['confidence'] for r in rules):.1%}")

Rule Quality Metrics

Confidence (Precision): What fraction of training samples in this leaf have the predicted class? $$\text{Confidence} = \frac{n_{\text{correct}}}{n_{\text{total in leaf}}}$$

Support (Coverage): What fraction of all training samples reach this leaf? $$\text{Support} = \frac{n_{\text{in leaf}}}{N}$$

Lift: How much better is the rule's precision compared to the base rate? $$\text{Lift} = \frac{\text{Confidence}}{P(\text{class})}$$

Rules with high confidence and reasonable support are the most actionable.

Tree Visualization

Visual representations of trees leverage human spatial reasoning to convey structure at a glance.

Standard Tree Diagram

The classic visualization shows:

Nodes as boxes or circles
Edges as lines connecting parent to children
Split conditions labeled on nodes or edges
Leaf predictions at terminal nodes
Optional annotations: sample counts, impurity values, class distributions

Enhanced Visualization Elements

Color coding:

Node color by predicted class (classification)
Node color intensity by purity (darker = purer)
Node color by prediction value (regression, using colormap)

Size scaling:

Node size proportional to sample count
Edge thickness proportional to sample flow

Information display:

Impurity values (Gini, entropy)
Sample counts and percentages
Class distribution histograms
Split gain values

tree_visualization.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
from sklearn.tree import DecisionTreeClassifier, plot_tree, export_text
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
 
# Train a tree for visualization
iris = load_iris()
X, y = iris.data, iris.target
 
tree = DecisionTreeClassifier(max_depth=3, random_state=42)
tree.fit(X, y)
 
# Method 1: Matplotlib visualization
plt.figure(figsize=(20, 10))
plot_tree(tree, 
          feature_names=iris.feature_names,
          class_names=iris.target_names,
          filled=True,           # Color by class
          rounded=True,          # Rounded boxes
          proportion=True,       # Show proportions
          fontsize=10)
plt.title("Decision Tree Visualization (Iris Dataset)")
plt.tight_layout()
plt.savefig('tree_visualization.png', dpi=150, bbox_inches='tight')
plt.show()
 
# Method 2: Text representation (useful for logging/documentation)
print("="*60)
print("TEXT REPRESENTATION")
print("="*60)
text_repr = export_text(tree, 
                        feature_names=iris.feature_names,
                        class_names=list(iris.target_names))
print(text_repr)
 
# Method 3: Graphviz DOT format (for high-quality vector graphics)
from sklearn.tree import export_graphviz
import graphviz
 
dot_data = export_graphviz(
    tree,
    out_file=None,
    feature_names=iris.feature_names,
    class_names=iris.target_names,
    filled=True,
    rounded=True,
    special_characters=True
)
 
# If graphviz is installed, this creates a PDF/PNG
try:
    graph = graphviz.Source(dot_data)
    graph.render("tree_graphviz", format='pdf', cleanup=True)
    print("
Graphviz PDF saved as 'tree_graphviz.pdf'")
except:
    print("
Graphviz not installed - DOT format available for external rendering")
    # DOT data can be used with online Graphviz viewers
 
# Method 4: Custom detailed view
print("
" + "="*60)
print("DETAILED NODE ANALYSIS")
print("="*60)
tree_struct = tree.tree_
for node_id in range(tree_struct.node_count):
    is_leaf = tree_struct.feature[node_id] == -2
    depth = 0
    parent = node_id
    while parent != 0:
        parent = [i for i in range(node_id) 
                  if tree_struct.children_left[i] == parent or 
                     tree_struct.children_right[i] == parent][0]
        depth += 1
    
    indent = "  " * depth
    n_samples = tree_struct.n_node_samples[node_id]
    impurity = tree_struct.impurity[node_id]
    
    if is_leaf:
        class_counts = tree_struct.value[node_id][0]
        pred_class = iris.target_names[int(class_counts.argmax())]
        print(f"{indent}Leaf (node {node_id}): {n_samples} samples, "
              f"predict {pred_class}, gini={impurity:.3f}")
    else:
        feature = iris.feature_names[tree_struct.feature[node_id]]
        threshold = tree_struct.threshold[node_id]
        print(f"{indent}Node {node_id}: {n_samples} samples, "
              f"split: {feature} <= {threshold:.2f}, gini={impurity:.3f}")

Visualization Best Practices

For stakeholder communication: (1) Keep trees shallow (depth ≤ 4) for visual clarity, (2) Use filled colors to highlight class predictions, (3) Annotate with sample counts to show confidence, (4) Highlight the most important paths, (5) Consider interactive visualizations for exploration. Remember: a tree that's too complex to visualize meaningfully may be too complex to interpret.

Feature Importance

Trees naturally reveal which features matter most through the splitting process. Features used in higher-level splits, or that yield larger impurity reductions, are more important.

Mean Decrease in Impurity (MDI)

The most common feature importance measure for trees:

$$\text{Importance}(j) = \sum_{v : \text{feature}(v) = j} \frac{n_v}{N} \cdot \Delta\mathcal{I}_v$$

Where:

Sum is over all nodes that split on feature $j$
$n_v / N$ is the fraction of samples reaching node $v$
$\Delta\mathcal{I}_v$ is the impurity decrease from the split

Importances are normalized to sum to 1 across all features.

Interpretation: Feature importance measures how much, on average, splitting on this feature reduces impurity across the tree.

Split-Based Importance Decomposition

For a single split at node $v$ on feature $j$:

$$\Delta\mathcal{I}_v = \mathcal{I}(\mathcal{D}_v) - \left( \frac{n_L}{n_v} \mathcal{I}(\mathcal{D}_L) + \frac{n_R}{n_v} \mathcal{I}(\mathcal{D}_R) \right)$$

Weighted by sample count: $$\text{Contribution} = \frac{n_v}{N} \cdot \Delta\mathcal{I}_v$$

Example: If feature "income" is used at:

Root (n=1000, $\Delta\mathcal{I}$ = 0.15): contributes 0.15
Node at depth 2 (n=400, $\Delta\mathcal{I}$ = 0.08): contributes 0.032
Total importance for "income": 0.182 (before normalization)

feature_importance.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
 
# Train a tree
iris = load_iris()
X, y = iris.data, iris.target
 
tree = DecisionTreeClassifier(max_depth=5, random_state=42)
tree.fit(X, y)
 
# Built-in feature importance (MDI)
importances = tree.feature_importances_
indices = np.argsort(importances)[::-1]
 
print("="*60)
print("FEATURE IMPORTANCE (Mean Decrease in Impurity)")
print("="*60)
for i, idx in enumerate(indices):
    print(f"{i+1}. {iris.feature_names[idx]}: {importances[idx]:.4f}")
 
# Visualize
plt.figure(figsize=(10, 6))
plt.bar(range(len(importances)), importances[indices])
plt.xticks(range(len(importances)), 
           [iris.feature_names[i] for i in indices], 
           rotation=45, ha='right')
plt.xlabel('Feature')
plt.ylabel('Importance (MDI)')
plt.title('Feature Importance from Decision Tree')
plt.tight_layout()
plt.savefig('feature_importance.png', dpi=150)
plt.show()
 
# Manual calculation to verify
print("
" + "="*60)
print("MANUAL IMPORTANCE CALCULATION")
print("="*60)
 
tree_struct = tree.tree_
n_samples = tree_struct.n_node_samples[0]
n_features = X.shape[1]
 
# Calculate importance for each feature
manual_importance = np.zeros(n_features)
 
for node_id in range(tree_struct.node_count):
    if tree_struct.feature[node_id] != -2:  # Not a leaf
        feature = tree_struct.feature[node_id]
        
        # Get impurity reduction
        n_node = tree_struct.n_node_samples[node_id]
        impurity = tree_struct.impurity[node_id]
        
        left_child = tree_struct.children_left[node_id]
        right_child = tree_struct.children_right[node_id]
        
        n_left = tree_struct.n_node_samples[left_child]
        n_right = tree_struct.n_node_samples[right_child]
        impurity_left = tree_struct.impurity[left_child]
        impurity_right = tree_struct.impurity[right_child]
        
        # Weighted impurity decrease
        weighted_impurity = ((n_left * impurity_left + n_right * impurity_right) 
                            / n_node)
        impurity_decrease = impurity - weighted_impurity
        
        # Add weighted contribution
        manual_importance[feature] += (n_node / n_samples) * impurity_decrease
 
# Normalize
manual_importance = manual_importance / manual_importance.sum()
 
print("
Manual calculation matches sklearn:")
for i in range(n_features):
    match = "✓" if np.isclose(manual_importance[i], importances[i]) else "✗"
    print(f"  {iris.feature_names[i]}: "
          f"sklearn={importances[i]:.4f}, manual={manual_importance[i]:.4f} {match}")

Importance Caveats

MDI importance has known limitations: (1) Biased toward high-cardinality features (more potential splits), (2) Doesn't account for feature correlations (one of two correlated features may get all importance), (3) Computed from training data only (may not reflect test-time importance). Permutation importance is often more reliable but computationally expensive.

Instance-Level Explanations

Beyond global feature importance, trees offer instance-level explanations: why was this particular sample classified this way?

Decision Path Explanation

For any prediction, trace the path from root to leaf:

Sample #42: Income=$75k, Age=35, Credit_Score=720

Decision Path:
  1. income > 50000? YES (75000 > 50000)
  2. age > 40? NO (35 ≤ 40)
  3. credit_score > 700? YES (720 > 700)
  
Reached Leaf #7: APPROVE
Leaf Statistics: 98 samples, 95% approved, 5% rejected

This complete audit trail shows:

Every condition checked
How this sample satisfied/failed each condition
Why it ended up in this particular leaf
The statistical support for the prediction

instance_explanation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
import numpy as np
 
# Train tree
iris = load_iris()
X, y = iris.data, iris.target
tree = DecisionTreeClassifier(max_depth=4, random_state=42)
tree.fit(X, y)
 
def explain_prediction(tree, X_sample, feature_names, class_names):
    """Generate a complete explanation for a single prediction."""
    tree_struct = tree.tree_
    
    # Get decision path
    node_indicator = tree.decision_path([X_sample])
    node_ids = node_indicator.indices
    
    explanation = {
        'input_values': dict(zip(feature_names, X_sample)),
        'decision_path': [],
        'prediction': class_names[tree.predict([X_sample])[0]],
        'probabilities': dict(zip(class_names, tree.predict_proba([X_sample])[0]))
    }
    
    for node_id in node_ids:
        node_info = {'node_id': node_id}
        
        if tree_struct.feature[node_id] != -2:  # Internal node
            feature = feature_names[tree_struct.feature[node_id]]
            threshold = tree_struct.threshold[node_id]
            value = X_sample[tree_struct.feature[node_id]]
            went_left = value <= threshold
            
            node_info['type'] = 'decision'
            node_info['feature'] = feature
            node_info['threshold'] = threshold
            node_info['sample_value'] = value
            node_info['condition'] = f"{feature} <= {threshold:.2f}"
            node_info['result'] = 'YES' if went_left else 'NO'
            node_info['direction'] = 'left' if went_left else 'right'
        else:  # Leaf node
            class_counts = tree_struct.value[node_id][0]
            node_info['type'] = 'leaf'
            node_info['class_distribution'] = dict(zip(class_names, class_counts))
            node_info['n_samples'] = int(sum(class_counts))
        
        explanation['decision_path'].append(node_info)
    
    return explanation
 
 
def format_explanation(explanation):
    """Format explanation as readable text."""
    lines = []
    lines.append("="*60)
    lines.append("PREDICTION EXPLANATION")
    lines.append("="*60)
    
    lines.append("
Input Features:")
    for feat, val in explanation['input_values'].items():
        lines.append(f"  {feat}: {val:.2f}")
    
    lines.append("
Decision Path:")
    for i, step in enumerate(explanation['decision_path']):
        if step['type'] == 'decision':
            lines.append(f"  Step {i+1}: {step['condition']}?")
            lines.append(f"           Sample value: {step['sample_value']:.2f}")
            lines.append(f"           Answer: {step['result']} -> go {step['direction']}")
        else:
            lines.append(f"  Step {i+1}: LEAF NODE")
            lines.append(f"           Samples: {step['n_samples']}")
            lines.append(f"           Class distribution: {step['class_distribution']}")
    
    lines.append(f"
FINAL PREDICTION: {explanation['prediction']}")
    lines.append(f"
Class Probabilities:")
    for cls, prob in explanation['probabilities'].items():
        lines.append(f"  {cls}: {prob:.1%}")
    
    return "
".join(lines)
 
 
# Explain a specific sample
sample_idx = 100
X_sample = X[sample_idx]
y_true = iris.target_names[y[sample_idx]]
 
explanation = explain_prediction(tree, X_sample, 
                                 iris.feature_names, iris.target_names)
print(format_explanation(explanation))
print(f"
True class: {y_true}")
print(f"Correct: {explanation['prediction'] == y_true}")

Contrastive Explanations

Often the most useful question is: "What would need to change for a different prediction?"

For a rejected loan application:

Current: income=$45k, debt_ratio=0.4 -> REJECTED

To get APPROVED:
  Option 1: Increase income to >$50k (bypass first split)
  Option 2: Reduce debt_ratio to ≤0.3 (while income >$50k)
  
Minimum change: Increase income by $5,001

This contrastive framing is actionable—it tells the applicant exactly what to change.

The Complexity-Interpretability Tradeoff

A fundamental tension exists: more complex trees may be more accurate, but they become harder to understand.

Cognitive Load and Tree Complexity

Manageable complexity:

Depth ≤ 4: Most humans can follow the entire tree
≤ 10 leaves: Each rule can be memorized and applied
≤ 5 features: Easy to build mental model

Overwhelming complexity:

Depth > 7: Full paths are hard to hold in working memory
50 leaves: Rule set becomes unwieldy
10 features used: Difficult to prioritize attention

The Accuracy-Interpretability Curve

Accuracy
    |
    |           .------- Complex tree (higher accuracy)
    |         ./
    |       ./
    |     ./
    |   ./           
    | ./ Simple tree (lower accuracy, interpretable)
    |/
    +-------------------------> Complexity
               ^
               |
         Interpretability threshold
         (beyond this, tree is a "black box")

The optimal point depends on the application. For life-or-death medical decisions, pure accuracy may dominate. For credit decisions with regulatory requirements, interpretability is mandatory.

When to Prioritize Interpretability

•Regulated industries (finance, healthcare)
•High-stakes individual decisions
•Debugging and model validation
•Knowledge extraction from data
•Building stakeholder trust
•When predictions will be questioned

When Accuracy May Dominate

•Mass prediction (millions of decisions)
•Low-stakes recommendations
•When post-hoc explanation suffices
•Research and experimentation
•Automated systems without human oversight
•When accuracy gains are substantial

Strategies for Maintaining Interpretability

Hard constraints: Set max_depth=4 or max_leaf_nodes=10 as non-negotiables
Regularization tuning: Use cross-validation to find the simplest tree within 1-2% of best accuracy
Rule simplification: Post-process rules to remove redundant conditions
Ensemble distillation: Train a complex ensemble, then distill into an interpretable tree that mimics its predictions
Selective complexity: Allow depth only where impurity gain justifies it (min_impurity_decrease)

The Ensemble Paradox

Random Forests and Gradient Boosting combine many trees for higher accuracy—but lose interpretability. A 500-tree ensemble is essentially a black box, even though each individual tree is interpretable. This is a conscious tradeoff. When interpretability is required, a single shallow tree may be preferred even at some accuracy cost.

Interpretability Summary

We have explored the interpretability properties that make decision trees uniquely valuable. Let us consolidate the key insights:

Key Concepts

•Inherent interpretability: Trees are intrinsically transparent—structure directly reveals logic
•Rule extraction: Every tree converts to IF-THEN rules with confidence and support metrics
•Visualization: Trees have natural graphical representations that convey structure at a glance
•Feature importance: MDI quantifies which features drive predictions globally
•Instance explanations: Every prediction comes with a complete audit trail of decisions
•Complexity tradeoff: More complex trees may be more accurate but less interpretable
•Application context: The right balance depends on regulatory, ethical, and practical requirements

Module complete:

With this page, we conclude the Decision Tree Fundamentals module. You now have a comprehensive understanding of tree structure, splitting rules, leaf predictions, recursive partitioning, and interpretability. This foundation prepares you for the next modules: splitting criteria (Gini, entropy), tree growing algorithms (ID3, C4.5, CART), pruning strategies, and eventually ensemble methods that build on these fundamentals.

Module Complete

Congratulations! You have mastered the fundamentals of decision trees. You understand their structure, how they're built, how they make predictions, and why they're uniquely interpretable. This knowledge is essential for everything that follows in decision tree learning—from splitting criteria to pruning to ensemble methods. The next module explores the specific criteria (Gini impurity, entropy, information gain) used to select optimal splits.

5 / 5

Loading learning content...

Machine LearningDecision Tree Fundamentals

Decision Tree Fundamentals

LevelBeginner

Duration90 mins

TopicDecision Tree Fundamentals

5 / 5

Interpretability

The Gift of Explanation

What You Will Learn

Why Trees Are Interpretable

Decision trees possess several properties that enable human understanding:

1. Sequential Logic

2. Local Feature-Based Conditions

Each decision node tests one feature against one threshold: $$\text{Is } x_j \leq \theta \text{?}$$

This atomic structure is comprehensible. "Is age ≤ 50?" makes sense. "Is $0.034 \cdot \text{age} + 0.019 \cdot \text{income} - 0.087 \cdot \text{debt_ratio} \leq 0.412$?" does not.

3. Hierarchical Importance

Features that appear higher in the tree (closer to the root) are more important for prediction. This hierarchy conveys importance naturally:

Root split: Most discriminative feature
Second level: Refining features
Deep nodes: Fine-tuning features

4. Complete Justification

For any prediction, the tree provides a complete audit trail:

The path taken (every condition checked)
The final leaf reached
The prediction made
The training samples that informed that leaf

No other information is hidden. The prediction is fully determined by these visible components.

5. Visual Representation

Intrinsic vs. Post-hoc Interpretability

Rule Extraction

Every decision tree can be converted into a set of IF-THEN rules. This provides maximum transparency: the model's logic is expressed in plain logical statements.

From Tree to Rules

Each root-to-leaf path becomes one rule:

IF (condition_1) AND (condition_2) AND ... AND (condition_n)
THEN predict: [class/value]

The conditions are the split decisions along the path. The prediction is the leaf's output.

Example Rule Extraction

Consider a tree for credit approval:

           [income > 50k?]
          /              
        Yes               No
        /                   
  [debt_ratio ≤ 0.3?]     [credit_score > 650?]
    /         \               /         
  Yes         No            Yes          No
   |           |             |            |
APPROVE     REVIEW       REVIEW       REJECT

Extracted rules:

Rule 1: IF income > 50k AND debt_ratio ≤ 0.3 
        THEN APPROVE
        [Confidence: 95%, Support: 23%]

Rule 2: IF income > 50k AND debt_ratio > 0.3
        THEN REVIEW
        [Confidence: 70%, Support: 15%]

Rule 3: IF income ≤ 50k AND credit_score > 650
        THEN REVIEW
        [Confidence: 65%, Support: 28%]

Rule 4: IF income ≤ 50k AND credit_score ≤ 650
        THEN REJECT
        [Confidence: 92%, Support: 34%]

rule_extraction.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
import numpy as np
 
def extract_rules(tree, feature_names, class_names):
    """Extract human-readable rules from a decision tree."""
    tree_ = tree.tree_
    feature_name = [
        feature_names[i] if i != -2 else "undefined!"
        for i in tree_.feature
    ]
    
    rules = []
    
    def recurse(node, conditions):
        if tree_.feature[node] != -2:  # Internal node
            name = feature_name[node]
            threshold = tree_.threshold[node]
            
            # Left branch: feature <= threshold
            left_condition = f"{name} <= {threshold:.2f}"
            recurse(tree_.children_left[node], conditions + [left_condition])
            
            # Right branch: feature > threshold
            right_condition = f"{name} > {threshold:.2f}"
            recurse(tree_.children_right[node], conditions + [right_condition])
        else:  # Leaf node
            # Get class prediction and confidence
            class_counts = tree_.value[node][0]
            total = sum(class_counts)
            predicted_class = np.argmax(class_counts)
            confidence = class_counts[predicted_class] / total
            
            rule = {
                'conditions': conditions,
                'prediction': class_names[predicted_class],
                'confidence': confidence,
                'support': total,
                'class_distribution': dict(zip(class_names, class_counts))
            }
            rules.append(rule)
    
    recurse(0, [])
    return rules
 
 
def format_rules(rules, total_samples):
    """Format rules as human-readable strings."""
    output = []
    for i, rule in enumerate(rules, 1):
        conditions = " AND ".join(rule['conditions'])
        support_pct = rule['support'] / total_samples * 100
        
        text = f"""
Rule {i}:
  IF {conditions}
  THEN predict: {rule['prediction']}
  Confidence: {rule['confidence']:.1%}
  Support: {support_pct:.1f}% ({int(rule['support'])} samples)
  Distribution: {rule['class_distribution']}
"""
        output.append(text)
    return "
".join(output)
 
 
# Example usage
iris = load_iris()
X, y = iris.data, iris.target
 
tree = DecisionTreeClassifier(max_depth=3, random_state=42)
tree.fit(X, y)
 
rules = extract_rules(tree, iris.feature_names, iris.target_names)
formatted = format_rules(rules, len(y))
 
print("="*60)
print("EXTRACTED DECISION RULES")
print("="*60)
print(formatted)
 
# Show rule complexity statistics
print("="*60)
print("RULE STATISTICS")
print("="*60)
print(f"Total rules: {len(rules)}")
print(f"Average conditions per rule: {np.mean([len(r['conditions']) for r in rules]):.1f}")
print(f"Max conditions (most complex): {max(len(r['conditions']) for r in rules)}")
print(f"Min confidence: {min(r['confidence'] for r in rules):.1%}")
print(f"Max confidence: {max(r['confidence'] for r in rules):.1%}")

Rule Quality Metrics

Confidence (Precision): What fraction of training samples in this leaf have the predicted class? $$\text{Confidence} = \frac{n_{\text{correct}}}{n_{\text{total in leaf}}}$$

Support (Coverage): What fraction of all training samples reach this leaf? $$\text{Support} = \frac{n_{\text{in leaf}}}{N}$$

Lift: How much better is the rule's precision compared to the base rate? $$\text{Lift} = \frac{\text{Confidence}}{P(\text{class})}$$

Rules with high confidence and reasonable support are the most actionable.

Tree Visualization

Visual representations of trees leverage human spatial reasoning to convey structure at a glance.

Standard Tree Diagram

The classic visualization shows:

Nodes as boxes or circles
Edges as lines connecting parent to children
Split conditions labeled on nodes or edges
Leaf predictions at terminal nodes
Optional annotations: sample counts, impurity values, class distributions

Enhanced Visualization Elements

Color coding:

Node color by predicted class (classification)
Node color intensity by purity (darker = purer)
Node color by prediction value (regression, using colormap)

Size scaling:

Node size proportional to sample count
Edge thickness proportional to sample flow

Information display:

Impurity values (Gini, entropy)
Sample counts and percentages
Class distribution histograms
Split gain values

tree_visualization.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
from sklearn.tree import DecisionTreeClassifier, plot_tree, export_text
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
 
# Train a tree for visualization
iris = load_iris()
X, y = iris.data, iris.target
 
tree = DecisionTreeClassifier(max_depth=3, random_state=42)
tree.fit(X, y)
 
# Method 1: Matplotlib visualization
plt.figure(figsize=(20, 10))
plot_tree(tree, 
          feature_names=iris.feature_names,
          class_names=iris.target_names,
          filled=True,           # Color by class
          rounded=True,          # Rounded boxes
          proportion=True,       # Show proportions
          fontsize=10)
plt.title("Decision Tree Visualization (Iris Dataset)")
plt.tight_layout()
plt.savefig('tree_visualization.png', dpi=150, bbox_inches='tight')
plt.show()
 
# Method 2: Text representation (useful for logging/documentation)
print("="*60)
print("TEXT REPRESENTATION")
print("="*60)
text_repr = export_text(tree, 
                        feature_names=iris.feature_names,
                        class_names=list(iris.target_names))
print(text_repr)
 
# Method 3: Graphviz DOT format (for high-quality vector graphics)
from sklearn.tree import export_graphviz
import graphviz
 
dot_data = export_graphviz(
    tree,
    out_file=None,
    feature_names=iris.feature_names,
    class_names=iris.target_names,
    filled=True,
    rounded=True,
    special_characters=True
)
 
# If graphviz is installed, this creates a PDF/PNG
try:
    graph = graphviz.Source(dot_data)
    graph.render("tree_graphviz", format='pdf', cleanup=True)
    print("
Graphviz PDF saved as 'tree_graphviz.pdf'")
except:
    print("
Graphviz not installed - DOT format available for external rendering")
    # DOT data can be used with online Graphviz viewers
 
# Method 4: Custom detailed view
print("
" + "="*60)
print("DETAILED NODE ANALYSIS")
print("="*60)
tree_struct = tree.tree_
for node_id in range(tree_struct.node_count):
    is_leaf = tree_struct.feature[node_id] == -2
    depth = 0
    parent = node_id
    while parent != 0:
        parent = [i for i in range(node_id) 
                  if tree_struct.children_left[i] == parent or 
                     tree_struct.children_right[i] == parent][0]
        depth += 1
    
    indent = "  " * depth
    n_samples = tree_struct.n_node_samples[node_id]
    impurity = tree_struct.impurity[node_id]
    
    if is_leaf:
        class_counts = tree_struct.value[node_id][0]
        pred_class = iris.target_names[int(class_counts.argmax())]
        print(f"{indent}Leaf (node {node_id}): {n_samples} samples, "
              f"predict {pred_class}, gini={impurity:.3f}")
    else:
        feature = iris.feature_names[tree_struct.feature[node_id]]
        threshold = tree_struct.threshold[node_id]
        print(f"{indent}Node {node_id}: {n_samples} samples, "
              f"split: {feature} <= {threshold:.2f}, gini={impurity:.3f}")

Visualization Best Practices

Feature Importance

Trees naturally reveal which features matter most through the splitting process. Features used in higher-level splits, or that yield larger impurity reductions, are more important.

Mean Decrease in Impurity (MDI)

The most common feature importance measure for trees:

$$\text{Importance}(j) = \sum_{v : \text{feature}(v) = j} \frac{n_v}{N} \cdot \Delta\mathcal{I}_v$$

Where:

Sum is over all nodes that split on feature $j$
$n_v / N$ is the fraction of samples reaching node $v$
$\Delta\mathcal{I}_v$ is the impurity decrease from the split

Importances are normalized to sum to 1 across all features.

Interpretation: Feature importance measures how much, on average, splitting on this feature reduces impurity across the tree.

Split-Based Importance Decomposition

For a single split at node $v$ on feature $j$:

$$\Delta\mathcal{I}_v = \mathcal{I}(\mathcal{D}_v) - \left( \frac{n_L}{n_v} \mathcal{I}(\mathcal{D}_L) + \frac{n_R}{n_v} \mathcal{I}(\mathcal{D}_R) \right)$$

Weighted by sample count: $$\text{Contribution} = \frac{n_v}{N} \cdot \Delta\mathcal{I}_v$$

Example: If feature "income" is used at:

Root (n=1000, $\Delta\mathcal{I}$ = 0.15): contributes 0.15
Node at depth 2 (n=400, $\Delta\mathcal{I}$ = 0.08): contributes 0.032
Total importance for "income": 0.182 (before normalization)

feature_importance.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
 
# Train a tree
iris = load_iris()
X, y = iris.data, iris.target
 
tree = DecisionTreeClassifier(max_depth=5, random_state=42)
tree.fit(X, y)
 
# Built-in feature importance (MDI)
importances = tree.feature_importances_
indices = np.argsort(importances)[::-1]
 
print("="*60)
print("FEATURE IMPORTANCE (Mean Decrease in Impurity)")
print("="*60)
for i, idx in enumerate(indices):
    print(f"{i+1}. {iris.feature_names[idx]}: {importances[idx]:.4f}")
 
# Visualize
plt.figure(figsize=(10, 6))
plt.bar(range(len(importances)), importances[indices])
plt.xticks(range(len(importances)), 
           [iris.feature_names[i] for i in indices], 
           rotation=45, ha='right')
plt.xlabel('Feature')
plt.ylabel('Importance (MDI)')
plt.title('Feature Importance from Decision Tree')
plt.tight_layout()
plt.savefig('feature_importance.png', dpi=150)
plt.show()
 
# Manual calculation to verify
print("
" + "="*60)
print("MANUAL IMPORTANCE CALCULATION")
print("="*60)
 
tree_struct = tree.tree_
n_samples = tree_struct.n_node_samples[0]
n_features = X.shape[1]
 
# Calculate importance for each feature
manual_importance = np.zeros(n_features)
 
for node_id in range(tree_struct.node_count):
    if tree_struct.feature[node_id] != -2:  # Not a leaf
        feature = tree_struct.feature[node_id]
        
        # Get impurity reduction
        n_node = tree_struct.n_node_samples[node_id]
        impurity = tree_struct.impurity[node_id]
        
        left_child = tree_struct.children_left[node_id]
        right_child = tree_struct.children_right[node_id]
        
        n_left = tree_struct.n_node_samples[left_child]
        n_right = tree_struct.n_node_samples[right_child]
        impurity_left = tree_struct.impurity[left_child]
        impurity_right = tree_struct.impurity[right_child]
        
        # Weighted impurity decrease
        weighted_impurity = ((n_left * impurity_left + n_right * impurity_right) 
                            / n_node)
        impurity_decrease = impurity - weighted_impurity
        
        # Add weighted contribution
        manual_importance[feature] += (n_node / n_samples) * impurity_decrease
 
# Normalize
manual_importance = manual_importance / manual_importance.sum()
 
print("
Manual calculation matches sklearn:")
for i in range(n_features):
    match = "✓" if np.isclose(manual_importance[i], importances[i]) else "✗"
    print(f"  {iris.feature_names[i]}: "
          f"sklearn={importances[i]:.4f}, manual={manual_importance[i]:.4f} {match}")

Importance Caveats

Instance-Level Explanations

Beyond global feature importance, trees offer instance-level explanations: why was this particular sample classified this way?

Decision Path Explanation

For any prediction, trace the path from root to leaf:

Sample #42: Income=$75k, Age=35, Credit_Score=720

Decision Path:
  1. income > 50000? YES (75000 > 50000)
  2. age > 40? NO (35 ≤ 40)
  3. credit_score > 700? YES (720 > 700)
  
Reached Leaf #7: APPROVE
Leaf Statistics: 98 samples, 95% approved, 5% rejected

This complete audit trail shows:

Every condition checked
How this sample satisfied/failed each condition
Why it ended up in this particular leaf
The statistical support for the prediction

instance_explanation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
import numpy as np
 
# Train tree
iris = load_iris()
X, y = iris.data, iris.target
tree = DecisionTreeClassifier(max_depth=4, random_state=42)
tree.fit(X, y)
 
def explain_prediction(tree, X_sample, feature_names, class_names):
    """Generate a complete explanation for a single prediction."""
    tree_struct = tree.tree_
    
    # Get decision path
    node_indicator = tree.decision_path([X_sample])
    node_ids = node_indicator.indices
    
    explanation = {
        'input_values': dict(zip(feature_names, X_sample)),
        'decision_path': [],
        'prediction': class_names[tree.predict([X_sample])[0]],
        'probabilities': dict(zip(class_names, tree.predict_proba([X_sample])[0]))
    }
    
    for node_id in node_ids:
        node_info = {'node_id': node_id}
        
        if tree_struct.feature[node_id] != -2:  # Internal node
            feature = feature_names[tree_struct.feature[node_id]]
            threshold = tree_struct.threshold[node_id]
            value = X_sample[tree_struct.feature[node_id]]
            went_left = value <= threshold
            
            node_info['type'] = 'decision'
            node_info['feature'] = feature
            node_info['threshold'] = threshold
            node_info['sample_value'] = value
            node_info['condition'] = f"{feature} <= {threshold:.2f}"
            node_info['result'] = 'YES' if went_left else 'NO'
            node_info['direction'] = 'left' if went_left else 'right'
        else:  # Leaf node
            class_counts = tree_struct.value[node_id][0]
            node_info['type'] = 'leaf'
            node_info['class_distribution'] = dict(zip(class_names, class_counts))
            node_info['n_samples'] = int(sum(class_counts))
        
        explanation['decision_path'].append(node_info)
    
    return explanation
 
 
def format_explanation(explanation):
    """Format explanation as readable text."""
    lines = []
    lines.append("="*60)
    lines.append("PREDICTION EXPLANATION")
    lines.append("="*60)
    
    lines.append("
Input Features:")
    for feat, val in explanation['input_values'].items():
        lines.append(f"  {feat}: {val:.2f}")
    
    lines.append("
Decision Path:")
    for i, step in enumerate(explanation['decision_path']):
        if step['type'] == 'decision':
            lines.append(f"  Step {i+1}: {step['condition']}?")
            lines.append(f"           Sample value: {step['sample_value']:.2f}")
            lines.append(f"           Answer: {step['result']} -> go {step['direction']}")
        else:
            lines.append(f"  Step {i+1}: LEAF NODE")
            lines.append(f"           Samples: {step['n_samples']}")
            lines.append(f"           Class distribution: {step['class_distribution']}")
    
    lines.append(f"
FINAL PREDICTION: {explanation['prediction']}")
    lines.append(f"
Class Probabilities:")
    for cls, prob in explanation['probabilities'].items():
        lines.append(f"  {cls}: {prob:.1%}")
    
    return "
".join(lines)
 
 
# Explain a specific sample
sample_idx = 100
X_sample = X[sample_idx]
y_true = iris.target_names[y[sample_idx]]
 
explanation = explain_prediction(tree, X_sample, 
                                 iris.feature_names, iris.target_names)
print(format_explanation(explanation))
print(f"
True class: {y_true}")
print(f"Correct: {explanation['prediction'] == y_true}")

Contrastive Explanations

Often the most useful question is: "What would need to change for a different prediction?"

For a rejected loan application:

Current: income=$45k, debt_ratio=0.4 -> REJECTED

To get APPROVED:
  Option 1: Increase income to >$50k (bypass first split)
  Option 2: Reduce debt_ratio to ≤0.3 (while income >$50k)
  
Minimum change: Increase income by $5,001

This contrastive framing is actionable—it tells the applicant exactly what to change.

The Complexity-Interpretability Tradeoff

A fundamental tension exists: more complex trees may be more accurate, but they become harder to understand.

Cognitive Load and Tree Complexity

Manageable complexity:

Depth ≤ 4: Most humans can follow the entire tree
≤ 10 leaves: Each rule can be memorized and applied
≤ 5 features: Easy to build mental model

Overwhelming complexity:

Depth > 7: Full paths are hard to hold in working memory
50 leaves: Rule set becomes unwieldy
10 features used: Difficult to prioritize attention

The Accuracy-Interpretability Curve

Accuracy
    |
    |           .------- Complex tree (higher accuracy)
    |         ./
    |       ./
    |     ./
    |   ./           
    | ./ Simple tree (lower accuracy, interpretable)
    |/
    +-------------------------> Complexity
               ^
               |
         Interpretability threshold
         (beyond this, tree is a "black box")

The optimal point depends on the application. For life-or-death medical decisions, pure accuracy may dominate. For credit decisions with regulatory requirements, interpretability is mandatory.

When to Prioritize Interpretability

•Regulated industries (finance, healthcare)
•High-stakes individual decisions
•Debugging and model validation
•Knowledge extraction from data
•Building stakeholder trust
•When predictions will be questioned

When Accuracy May Dominate

•Mass prediction (millions of decisions)
•Low-stakes recommendations
•When post-hoc explanation suffices
•Research and experimentation
•Automated systems without human oversight
•When accuracy gains are substantial

Strategies for Maintaining Interpretability

Hard constraints: Set max_depth=4 or max_leaf_nodes=10 as non-negotiables
Regularization tuning: Use cross-validation to find the simplest tree within 1-2% of best accuracy
Rule simplification: Post-process rules to remove redundant conditions
Ensemble distillation: Train a complex ensemble, then distill into an interpretable tree that mimics its predictions
Selective complexity: Allow depth only where impurity gain justifies it (min_impurity_decrease)

The Ensemble Paradox

Interpretability Summary

We have explored the interpretability properties that make decision trees uniquely valuable. Let us consolidate the key insights:

Key Concepts

•Inherent interpretability: Trees are intrinsically transparent—structure directly reveals logic
•Rule extraction: Every tree converts to IF-THEN rules with confidence and support metrics
•Visualization: Trees have natural graphical representations that convey structure at a glance
•Feature importance: MDI quantifies which features drive predictions globally
•Instance explanations: Every prediction comes with a complete audit trail of decisions
•Complexity tradeoff: More complex trees may be more accurate but less interpretable
•Application context: The right balance depends on regulatory, ethical, and practical requirements

Module complete:

Module Complete

5 / 5