Machine LearningDecision Trees

Splitting Criteria

LevelIntermediate

Duration90 mins

TopicDecision Trees

3 / 5

Misclassification Error: Simplicity at a Cost

The Most Intuitive Impurity Measure

If you were asked to design an impurity measure from scratch, misclassification error would likely be your first idea. It's stunningly simple: just measure the fraction of samples that would be incorrectly classified by the majority-vote rule.

A node is pure if all samples belong to one class (error = 0). A node is maximally impure if classes are evenly split (error = 0.5 for binary). What could be more natural?

Yet despite this intuitive appeal, misclassification error is rarely used as a splitting criterion. It's the Goldilocks story of impurity measures: too simple where complexity matters, yet perfectly appropriate for final evaluation. Understanding why reveals deep insights into the mathematics of decision trees.

What You Will Learn

By the end of this page, you will understand why misclassification error, despite measuring exactly what we care about (errors!), fails as a splitting criterion. You'll see concrete examples where it gets 'stuck,' learn about strict concavity, and understand when misclassification error IS appropriate.

Mathematical Definition

Misclassification error (also called classification error or 0-1 loss impurity) is defined as:

$$E = 1 - \max_k p_k$$

where $p_k$ is the proportion of class $k$ in the node.

Interpretation: If we predict the majority class for all samples in a node, $E$ is the fraction we get wrong.

For Binary Classification:

With class proportions $p$ and $(1-p)$:

$$E = 1 - \max(p, 1-p) = \min(p, 1-p)$$

This creates a piecewise linear function:

For $p \leq 0.5$: $E = p$
For $p > 0.5$: $E = 1 - p$

The function is V-shaped, with minimum at $p = 0, 1$ and maximum at $p = 0.5$.

misclassification_error.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import numpy as np
from typing import Union
 
def misclassification_error(class_counts: Union[list, np.ndarray]) -> float:
    """
    Compute misclassification error from class counts.
    
    Misclassification error measures the proportion of samples that
    would be incorrectly classified by predicting the majority class.
    
    Args:
        class_counts: Array or list of counts for each class
        
    Returns:
        Misclassification error in range [0, 1 - 1/K]
        
    Mathematical Definition:
        E = 1 - max_k(p_k)
        
    Properties:
        - E = 0 for pure nodes (all samples same class)
        - E = 1 - 1/K for uniform distribution over K classes
        - For binary: maximum is 0.5 when p = 0.5
        - Piecewise linear in class proportions
        
    Example:
        >>> misclassification_error([50, 50])  # Balanced binary
        0.5
        >>> misclassification_error([100, 0])  # Pure
        0.0
        >>> misclassification_error([80, 20])  # Majority wins
        0.2
    """
    counts = np.array(class_counts, dtype=np.float64)
    n = counts.sum()
    
    if n == 0:
        return 0.0  # Empty node
    
    p = counts / n
    return 1.0 - np.max(p)
 
 
def misclassification_error_binary(p: float) -> float:
    """
    Compute misclassification error for binary classification.
    
    E = min(p, 1-p) = 1 - max(p, 1-p)
    
    This is a V-shaped function with apex at p=0.5.
    
    Args:
        p: Proportion of positive class (0 ≤ p ≤ 1)
        
    Returns:
        Error rate in range [0, 0.5]
    """
    return min(p, 1 - p)
 
 
# Demonstration
if __name__ == "__main__":
    print("Misclassification Error Examples:")
    print("=" * 55)
    
    # Pure node
    print(f"Pure [100, 0]:        E = {misclassification_error([100, 0]):.4f}")
    
    # Balanced binary
    print(f"Balanced [50, 50]:    E = {misclassification_error([50, 50]):.4f}")
    
    # Imbalanced
    print(f"[80, 20]:             E = {misclassification_error([80, 20]):.4f}")
    print(f"[90, 10]:             E = {misclassification_error([90, 10]):.4f}")
    print(f"[99, 1]:              E = {misclassification_error([99, 1]):.4f}")
    
    # Multi-class
    print(f"\n3-class uniform [33, 33, 34]: E = {misclassification_error([33, 33, 34]):.4f}")
    print(f"3-class max = 1 - 1/3 = {1 - 1/3:.4f}")
    
    print(f"\n4-class [70, 10, 10, 10]: E = {misclassification_error([70, 10, 10, 10]):.4f}")
    print(f"4-class [40, 30, 20, 10]: E = {misclassification_error([40, 30, 20, 10]):.4f}")

Misclassification Error at Various Proportions (Binary)
p (Class 1)	1-p (Class 0)	E = min(p, 1-p)	Majority Prediction
0.0	1.0	0.0	Class 0 (perfect)
0.1	0.9	0.1	Class 0 (90% correct)
0.3	0.7	0.3	Class 0 (70% correct)
0.5	0.5	0.5	Either (50% correct)
0.7	0.3	0.3	Class 1 (70% correct)
0.9	0.1	0.1	Class 1 (90% correct)
1.0	0.0	0.0	Class 1 (perfect)

The Concavity Problem: Why Misclassification Error Fails

Here's the critical issue: misclassification error is not strictly concave.

Recall that Gini and entropy are strictly concave functions. This guarantees:

$$f\left(\frac{n_L \cdot p_L + n_R \cdot p_R}{n}\right) > \frac{n_L}{n} f(p_L) + \frac{n_R}{n} f(p_R)$$

whenever $p_L \neq p_R$. In words: the impurity of the parent is strictly greater than the weighted average impurity of children (if they differ). This means every split that separates classes reduces impurity.

Misclassification error is concave (the graph lies above its chords) but not strictly concave—it's piecewise linear. The linear regions mean there are infinitely many splits that achieve zero reduction in error!

The Fundamental Problem

When impurity is piecewise linear, valid splits can have ZERO impurity reduction even when they separate classes. The algorithm sees no benefit and may make arbitrary or suboptimal choices. Gini and entropy, being curved, always show positive reduction for class-separating splits.

Geometric Visualization:

Imagine the binary misclassification error curve: two straight lines meeting at $p = 0.5$.

A split from $(p=0.5)$ to $(p_L=0.3, p_R=0.7)$ with equal-sized children:
- Parent error: $0.5$
- Weighted child error: $0.5 \times 0.3 + 0.5 \times 0.3 = 0.3$
- Reduction: $0.5 - 0.3 = 0.2$ ✓ (This works!)
A split from $(p=0.3)$ to $(p_L=0.1, p_R=0.5)$ with equal-sized children:
- Parent error: $0.3$
- Weighted child error: $0.5 \times 0.1 + 0.5 \times 0.5 = 0.3$
- Reduction: $0.3 - 0.3 = 0.0$ ✗ (Zero gain despite separation!)

The second split is clearly useful (one child is purer), but misclassification error doesn't see it.

concavity_demonstration.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
import numpy as np
 
def gini(p):
    return 2 * p * (1 - p)
 
def entropy(p):
    if p == 0 or p == 1:
        return 0
    return -p * np.log2(p) - (1-p) * np.log2(1-p)
 
def misclass(p):
    return min(p, 1-p)
 
def weighted_avg(p_left, p_right, n_left, n_right):
    """Compute weighted average of impurity."""
    n = n_left + n_right
    return (n_left / n), (n_right / n)
 
def compute_reduction(impurity_fn, p_parent, p_left, p_right, 
                      n_left, n_right):
    """Compute impurity reduction for a split."""
    parent_impurity = impurity_fn(p_parent)
    
    w_left = n_left / (n_left + n_right)
    w_right = n_right / (n_left + n_right)
    
    child_impurity = w_left * impurity_fn(p_left) + w_right * impurity_fn(p_right)
    
    return parent_impurity - child_impurity
 
 
print("Demonstrating the Concavity Problem")
print("=" * 70)
 
# Case 1: Split that works for all measures
print("\n--- Case 1: Split from p=0.5 to (0.3, 0.7) with equal children ---")
p_parent = 0.5
p_left, p_right = 0.3, 0.7
n_left, n_right = 50, 50
 
for name, fn in [("Gini", gini), ("Entropy", entropy), ("MisclassError", misclass)]:
    reduction = compute_reduction(fn, p_parent, p_left, p_right, n_left, n_right)
    print(f"{name:15} Parent={fn(p_parent):.4f}, Reduction={reduction:.4f}")
 
# Case 2: Split that fails for misclassification error
print("\n--- Case 2: Split from p=0.3 to (0.1, 0.5) with equal children ---")
print("(Left child is MUCH purer, but misclassification sees no benefit)")
p_parent = 0.3
p_left, p_right = 0.1, 0.5
n_left, n_right = 50, 50
 
for name, fn in [("Gini", gini), ("Entropy", entropy), ("MisclassError", misclass)]:
    reduction = compute_reduction(fn, p_parent, p_left, p_right, n_left, n_right)
    status = "✓" if reduction > 0.001 else "✗ ZERO GAIN"
    print(f"{name:15} Parent={fn(p_parent):.4f}, Reduction={reduction:.4f} {status}")
 
# Case 3: Another problematic case
print("\n--- Case 3: Split from p=0.2 to (0.1, 0.3) with equal children ---")
p_parent = 0.2
p_left, p_right = 0.1, 0.3
n_left, n_right = 50, 50
 
for name, fn in [("Gini", gini), ("Entropy", entropy), ("MisclassError", misclass)]:
    reduction = compute_reduction(fn, p_parent, p_left, p_right, n_left, n_right)
    status = "✓" if reduction > 0.001 else "✗ ZERO GAIN"
    print(f"{name:15} Parent={fn(p_parent):.4f}, Reduction={reduction:.4f} {status}")
 
# Case 4: Perfect split for all
print("\n--- Case 4: Perfect split from p=0.5 to (0.0, 1.0) ---")
p_parent = 0.5
p_left, p_right = 0.0, 1.0
n_left, n_right = 50, 50
 
for name, fn in [("Gini", gini), ("Entropy", entropy), ("MisclassError", misclass)]:
    reduction = compute_reduction(fn, p_parent, p_left, p_right, n_left, n_right)
    print(f"{name:15} Parent={fn(p_parent):.4f}, Reduction={reduction:.4f} (Perfect!)")
 
print("\n" + "=" * 70)
print("CONCLUSION: Misclassification error misses many valid splits!")
print("Gini and Entropy ALWAYS show positive reduction for class-separating splits.")

Detailed Analysis: Where Misclassification Error Gets Stuck

Let's analyze precisely when misclassification error shows zero gain.

Theorem: For a binary split, misclassification error reduction is zero if and only if:

$$\max(p_L, 1-p_L) \cdot w_L + \max(p_R, 1-p_R) \cdot w_R = \max(p_{parent}, 1-p_{parent})$$

where $w_L, w_R$ are the child weights.

When Does This Happen?

Both children on the same side of 0.5: If $p_L, p_R < 0.5$ (both minority class dominated by same class as parent), the linear segments are on the same 'branch' of the V-shape. The weighted average can equal the parent.
Children span 0.5 but balance out: If $p_L < 0.5 < p_R$, reductions can cancel (e.g., one child gets purer by $\delta$, the other gets impurer by $\delta$).

The 'Stuck at Majority' Problem

If the parent's majority class proportion is, say, 70%, splits that don't change the majority class but DO separate examples (e.g., [60%, 80%] split) may show zero reduction. The algorithm is 'stuck' even though the data is separable!

Practical Consequence:

Consider a node with 300 samples: 210 Class A (70%), 90 Class B (30%).

Split Option 1: Left = (180 A, 20 B), Right = (30 A, 70 B)

Left: $p_L = 20/200 = 10%$ error
Right: $p_R = 30/100 = 30%$ error
Weighted: $0.667 \times 0.10 + 0.333 \times 0.30 = 0.167 = 16.7%$
Parent: $0.30 = 30%$
Reduction: $13.3%$ ✓

Split Option 2: Left = (190 A, 60 B), Right = (20 A, 30 B)

Left: $p_L = 60/250 = 24%$ error (A majority)
Right: $p_R = 20/50 = 40%$ error (B majority)
Weighted: $0.833 \times 0.24 + 0.167 \times 0.40 = 0.267 = 26.7%$
Parent: $0.30 = 30%$
Reduction: $3.3%$ ✓

Split Option 3: Left = (150 A, 50 B), Right = (60 A, 40 B)

Left: $p_L = 50/200 = 25%$ error (A majority)
Right: $p_R = 40/100 = 40%$ error (A majority)
Weighted: $0.667 \times 0.25 + 0.333 \times 0.40 = 0.30 = 30%$
Parent: $0.30 = 30%$
Reduction: $0%$ ✗ STUCK!

Split 3 separates the data but shows zero improvement under misclassification error!

Mathematical Properties

Properties of Misclassification Error

•Piecewise linear: $E = 1 - \max_k p_k$ creates $K$ linear regions meeting at equi-probability points.
•Non-differentiable at joints: At $p = 0.5$ (binary), the function has a 'kink'—no derivative exists.
•Concave but not strictly concave: Graph lies above chords, but line segments make it not strictly curved.
•Lower bound on Gini and Entropy: $E \leq G$ and $E \leq H/\log_2(K)$ when normalized.
•Minimum sensitivity to distribution shape: Only depends on the majority class, ignoring distribution structure.
•Symmetric: Like Gini and entropy, $E$ is invariant under class relabeling.

Why is Misclassification Error a Lower Bound?

For binary classification at any $p$:

$$\min(p, 1-p) \leq 2p(1-p) \leq H(p)$$

where $H(p) = -p \log p - (1-p) \log(1-p)$.

Proof (for $p \leq 0.5$):

$E = p$ while $G = 2p(1-p) = 2p - 2p^2$.

Since $p \leq 0.5$, we have $1-p \geq 0.5 \geq p$, so $2(1-p) \geq 1$, thus $2p(1-p) \geq p = E$.

Intuition: Gini and entropy 'care about' the full distribution shape, not just the maximum. Moving probability mass between non-majority classes changes Gini/entropy but not misclassification error.

Comparison at Various Binary Proportions
p	Misclass E	Gini G	Entropy H	G/E	H/E
0.10	0.100	0.180	0.469	1.80	4.69
0.20	0.200	0.320	0.722	1.60	3.61
0.30	0.300	0.420	0.881	1.40	2.94
0.40	0.400	0.480	0.971	1.20	2.43
0.50	0.500	0.500	1.000	1.00	2.00

Multi-class Behavior

For $K$ classes, misclassification error is:

$$E = 1 - \max_{k} p_k$$

Properties in Multi-class Setting:

Range: $[0, 1 - 1/K]$, same as Gini
Maximum at uniform: $E_{max} = 1 - 1/K$ when $p_k = 1/K$ for all $k$
Piecewise linear in $K$ regions: Each region corresponds to a different class being majority
Even more structural blindness: Only the majority class matters; all others contribute equally to error

The Structural Blindness Problem:

Consider 4 classes with these distributions:

Distribution A: $[40%, 20%, 20%, 20%]$
Distribution B: $[40%, 30%, 15%, 15%]$
Distribution C: $[40%, 40%, 10%, 10%]$

All three have the same misclassification error: $E = 1 - 0.40 = 0.60$!

Yet:

Distribution C is 'less pure' intuitively (two classes tied for dominance)
Gini: A=0.72, B=0.70, C=0.68
Entropy: A=1.92, B=1.84, C=1.78

Misclassification error loses all information about the distribution shape.

multiclass_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import numpy as np
 
def misclass_error(p):
    """Misclassification error for probability distribution."""
    return 1 - np.max(p)
 
def gini_impurity(p):
    """Gini impurity for probability distribution."""
    return 1 - np.sum(p ** 2)
 
def entropy(p):
    """Shannon entropy for probability distribution."""
    p = np.array(p)
    p = p[p > 0]  # Filter zeros
    return -np.sum(p * np.log2(p))
 
 
print("Multi-class Structural Blindness Demonstration")
print("=" * 65)
 
distributions = [
    ("Uniform", [0.25, 0.25, 0.25, 0.25]),
    ("Dominant 40%", [0.40, 0.20, 0.20, 0.20]),
    ("Spread non-max", [0.40, 0.30, 0.15, 0.15]),
    ("Two dominants", [0.40, 0.40, 0.10, 0.10]),
    ("Very skewed", [0.70, 0.10, 0.10, 0.10]),
    ("Near-pure", [0.85, 0.05, 0.05, 0.05]),
    ("Pure", [1.00, 0.00, 0.00, 0.00]),
]
 
print(f"\n{'Distribution':<20} {'Misclass E':>12} {'Gini G':>10} {'Entropy H':>10}")
print("-" * 65)
 
for name, p in distributions:
    p = np.array(p)
    E = misclass_error(p)
    G = gini_impurity(p)
    H = entropy(p)
    print(f"{name:<20} {E:>12.4f} {G:>10.4f} {H:>10.4f}")
 
print("\n" + "-" * 65)
print("Notice: 'Dominant 40%', 'Spread non-max', 'Two dominants'")
print("all have SAME misclass error (0.60) but DIFFERENT Gini/Entropy!")
print("Misclassification error ignores distribution structure.")

When Misclassification Error IS Appropriate

Despite its problems as a splitting criterion, misclassification error is the right choice for several purposes:

1. Final Model Evaluation

When reporting a tree's performance, misclassification error (accuracy's complement) is exactly what matters. You want to know: 'What fraction of predictions will be wrong?'

2. Pruning Criterion

Cost-complexity pruning (used in CART) optimizes:

$$R_\alpha(T) = R(T) + \alpha |T|$$

where $R(T)$ is misclassification rate and $|T|$ is tree size. Here, misclassification error gives the true generalization estimate.

3. Threshold Tuning at Leaves

When adjusting prediction thresholds for class imbalance, you're directly optimizing misclassification (or related metrics like F1).

4. Simple Baseline

For quick understanding: 'A node with 60% Class A has 40% error rate.' Intuitive and easy to communicate.

The Split vs. Evaluate Distinction

Use Gini or Entropy for BUILDING the tree (they're better at finding good splits). Use misclassification error for EVALUATING the tree (it directly measures accuracy). This is standard practice in CART and similar algorithms.

Summary of Appropriate Uses

•Splitting: ✗ Use Gini or Entropy instead
•Evaluation (accuracy): ✓ Correct metric for final performance
•Pruning objective: ✓ Reflects true generalization error
•Communication: ✓ Intuitive for stakeholders
•Visualization: ✓ Easy to interpret node error rates

Connection to Loss Functions

Misclassification error corresponds to the 0-1 loss function:

$$L_{0-1}(y, \hat{y}) = \mathbb{1}[y \neq \hat{y}] = \begin{cases} 1 & \text{if } y \neq \hat{y} \ 0 & \text{if } y = \hat{y} \end{cases}$$

This is the most natural loss for classification: you're either right (loss = 0) or wrong (loss = 1).

Why Not Optimize 0-1 Loss Directly?

Non-differentiable: The 0-1 loss has no gradient—you can't do gradient descent.
Combinatorially hard: Finding the optimal tree under 0-1 loss is NP-hard.
Poor for greedy algorithms: As we've shown, greedy splits under 0-1 loss (i.e., misclassification error) get stuck.

Surrogate Losses:

Gini and entropy are examples of surrogate losses—smooth, differentiable approximations to 0-1 loss that are easier to optimize but still correlate with the true objective.

Loss	Formula	Convex?	Differentiable?	Good for Splitting?
0-1 (Misclass)	$\mathbb{1}[y \neq \hat{y}]$	No	No	No
Gini-based	$2p(1-p)$	Yes	Yes	Yes
Cross-entropy	$-\log \hat{p}_y$	Yes	Yes	Yes

This is analogous to using log loss (cross-entropy) instead of 0-1 loss in logistic regression—it provides gradients for optimization.

The Surrogate Loss Principle

Throughout machine learning, we optimize surrogate losses that: (1) are computationally tractable, (2) approximate the true objective, and (3) provide useful gradients. Gini and entropy are surrogate losses for 0-1 misclassification, just as hinge loss (SVM) and log loss (logistic regression) are.

Visualizing the Criteria Together

A visual comparison crystallizes the differences between our three impurity measures. The key observation is the curvature difference at the boundaries.

visualize_all_criteria.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
import numpy as np
import matplotlib.pyplot as plt
 
def gini(p):
    return 2 * p * (1 - p)
 
def entropy(p):
    result = np.zeros_like(p)
    mask = (p > 0) & (p < 1)
    pm = p[mask]
    result[mask] = -pm * np.log2(pm) - (1-pm) * np.log2(1-pm)
    return result
 
def misclass(p):
    return np.minimum(p, 1 - p)
 
def plot_impurity_comparison():
    """
    Create comprehensive comparison plot of all impurity measures.
    """
    fig, axes = plt.subplots(2, 2, figsize=(14, 12))
    
    p = np.linspace(0.001, 0.999, 1000)
    
    g = gini(p)
    h = entropy(p)
    m = misclass(p)
    
    # Top-left: Raw values
    ax1 = axes[0, 0]
    ax1.plot(p, g, 'b-', linewidth=2.5, label='Gini (max=0.5)')
    ax1.plot(p, h, 'r-', linewidth=2.5, label='Entropy (max=1.0)')
    ax1.plot(p, m, 'g-', linewidth=2.5, label='Misclass (max=0.5)')
    
    ax1.fill_between(p, m, alpha=0.1, color='green')
    ax1.fill_between(p, m, g, alpha=0.1, color='blue')
    ax1.fill_between(p, g, h, alpha=0.1, color='red')
    
    ax1.set_xlabel('Proportion of Class 1 (p)', fontsize=12)
    ax1.set_ylabel('Impurity Value', fontsize=12)
    ax1.set_title('Raw Impurity Values', fontsize=14)
    ax1.legend(loc='upper right')
    ax1.grid(True, alpha=0.3)
    ax1.set_xlim(0, 1)
    ax1.set_ylim(0, 1.1)
    
    # Top-right: Normalized to max=1
    ax2 = axes[0, 1]
    ax2.plot(p, g / 0.5, 'b-', linewidth=2.5, label='Gini (scaled)')
    ax2.plot(p, h / 1.0, 'r-', linewidth=2.5, label='Entropy (scaled)')
    ax2.plot(p, m / 0.5, 'g-', linewidth=2.5, label='Misclass (scaled)')
    
    ax2.set_xlabel('Proportion of Class 1 (p)', fontsize=12)
    ax2.set_ylabel('Normalized Impurity', fontsize=12)
    ax2.set_title('Normalized Comparison (all scaled to max=1)', fontsize=14)
    ax2.legend(loc='upper right')
    ax2.grid(True, alpha=0.3)
    ax2.set_xlim(0, 1)
    ax2.set_ylim(0, 1.1)
    
    # Bottom-left: Zoom near p=0 (boundary behavior)
    ax3 = axes[1, 0]
    p_zoom = np.linspace(0.001, 0.15, 500)
    ax3.plot(p_zoom, gini(p_zoom)/0.5, 'b-', linewidth=2.5, label='Gini')
    ax3.plot(p_zoom, entropy(p_zoom)/1.0, 'r-', linewidth=2.5, label='Entropy')
    ax3.plot(p_zoom, misclass(p_zoom)/0.5, 'g-', linewidth=2.5, label='Misclass')
    
    ax3.set_xlabel('Proportion of Class 1 (p)', fontsize=12)
    ax3.set_ylabel('Normalized Impurity', fontsize=12)
    ax3.set_title('Zoom: Behavior Near p=0 (Boundary)', fontsize=14)
    ax3.legend(loc='upper right')
    ax3.grid(True, alpha=0.3)
    ax3.set_annotation = ax3.annotate(
        'Entropy has vertical\ntangent at 0',
        xy=(0.02, 0.12), xytext=(0.06, 0.35),
        fontsize=10, arrowprops=dict(arrowstyle='->', color='red')
    )
    
    # Bottom-right: Second derivative (curvature)
    ax4 = axes[1, 1]
    
    # Compute numerical second derivatives
    dp = p[1] - p[0]
    g2 = np.gradient(np.gradient(g, dp), dp)
    h2 = np.gradient(np.gradient(h, dp), dp)
    # Misclass has 0 second derivative everywhere except at p=0.5
    
    ax4.plot(p[10:-10], g2[10:-10], 'b-', linewidth=2.5, label='Gini d²/dp²')
    ax4.plot(p[10:-10], h2[10:-10], 'r-', linewidth=2.5, label='Entropy d²/dp²')
    ax4.axhline(y=0, color='green', linestyle='--', linewidth=2.5, 
                label='Misclass d²/dp²=0 (linear)')
    
    ax4.set_xlabel('Proportion of Class 1 (p)', fontsize=12)
    ax4.set_ylabel('Second Derivative (Curvature)', fontsize=12)
    ax4.set_title('Curvature: Why Strict Concavity Matters', fontsize=14)
    ax4.legend(loc='upper right')
    ax4.grid(True, alpha=0.3)
    ax4.set_xlim(0, 1)
    
    plt.tight_layout()
    plt.savefig('impurity_comprehensive.png', dpi=150, bbox_inches='tight')
    plt.show()
    
    print("\nKey Observations:")
    print("=" * 60)
    print("1. Misclassification error is PIECEWISE LINEAR (zero curvature)")
    print("2. Gini has CONSTANT negative curvature (-4)")
    print("3. Entropy has curvature that goes to -INFINITY at boundaries")
    print("4. This curvature guarantees positive gain for class-separating splits")
 
 
if __name__ == "__main__":
    plot_impurity_comparison()

Summary: Misclassification Error's Role

We've thoroughly analyzed why misclassification error, despite its intuitive appeal, fails as a splitting criterion. Let's consolidate:

Key Takeaways

•Definition: $E = 1 - \max_k p_k$ is the error rate of majority-vote prediction.
•Piecewise linear: Not strictly concave—splits can yield zero impurity reduction.
•Gets stuck: Valid class-separating splits may show no improvement.
•Structurally blind: Only the majority class matters; ignores distribution shape.
•Corresponds to 0-1 loss: The natural classification loss, but hard to optimize.
•Use for evaluation, not splitting: Perfect for measuring accuracy, wrong for building trees.
•Surrogate losses: Gini and entropy are smooth approximations that work with greedy algorithms.
•Lower bound: $E \leq G \leq H$ (when appropriately scaled).

What's Next:

Having thoroughly understood all three major impurity measures, the next page provides a systematic comparison. We'll explore when they differ, which to choose in practice, and how the theoretical differences translate to real-world tree construction.

Page Complete

You now understand misclassification error's fundamental limitation: its piecewise linear nature prevents it from 'seeing' many valid splits. This is why CART uses Gini internally but reports misclassification error for evaluation—each has its proper role.

3 / 5

Loading learning content...

Machine LearningDecision Trees

Splitting Criteria

LevelIntermediate

Duration90 mins

TopicDecision Trees

3 / 5

Misclassification Error: Simplicity at a Cost

The Most Intuitive Impurity Measure

A node is pure if all samples belong to one class (error = 0). A node is maximally impure if classes are evenly split (error = 0.5 for binary). What could be more natural?

What You Will Learn

Mathematical Definition

Misclassification error (also called classification error or 0-1 loss impurity) is defined as:

$$E = 1 - \max_k p_k$$

where $p_k$ is the proportion of class $k$ in the node.

Interpretation: If we predict the majority class for all samples in a node, $E$ is the fraction we get wrong.

For Binary Classification:

With class proportions $p$ and $(1-p)$:

$$E = 1 - \max(p, 1-p) = \min(p, 1-p)$$

This creates a piecewise linear function:

For $p \leq 0.5$: $E = p$
For $p > 0.5$: $E = 1 - p$

The function is V-shaped, with minimum at $p = 0, 1$ and maximum at $p = 0.5$.

misclassification_error.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import numpy as np
from typing import Union
 
def misclassification_error(class_counts: Union[list, np.ndarray]) -> float:
    """
    Compute misclassification error from class counts.
    
    Misclassification error measures the proportion of samples that
    would be incorrectly classified by predicting the majority class.
    
    Args:
        class_counts: Array or list of counts for each class
        
    Returns:
        Misclassification error in range [0, 1 - 1/K]
        
    Mathematical Definition:
        E = 1 - max_k(p_k)
        
    Properties:
        - E = 0 for pure nodes (all samples same class)
        - E = 1 - 1/K for uniform distribution over K classes
        - For binary: maximum is 0.5 when p = 0.5
        - Piecewise linear in class proportions
        
    Example:
        >>> misclassification_error([50, 50])  # Balanced binary
        0.5
        >>> misclassification_error([100, 0])  # Pure
        0.0
        >>> misclassification_error([80, 20])  # Majority wins
        0.2
    """
    counts = np.array(class_counts, dtype=np.float64)
    n = counts.sum()
    
    if n == 0:
        return 0.0  # Empty node
    
    p = counts / n
    return 1.0 - np.max(p)
 
 
def misclassification_error_binary(p: float) -> float:
    """
    Compute misclassification error for binary classification.
    
    E = min(p, 1-p) = 1 - max(p, 1-p)
    
    This is a V-shaped function with apex at p=0.5.
    
    Args:
        p: Proportion of positive class (0 ≤ p ≤ 1)
        
    Returns:
        Error rate in range [0, 0.5]
    """
    return min(p, 1 - p)
 
 
# Demonstration
if __name__ == "__main__":
    print("Misclassification Error Examples:")
    print("=" * 55)
    
    # Pure node
    print(f"Pure [100, 0]:        E = {misclassification_error([100, 0]):.4f}")
    
    # Balanced binary
    print(f"Balanced [50, 50]:    E = {misclassification_error([50, 50]):.4f}")
    
    # Imbalanced
    print(f"[80, 20]:             E = {misclassification_error([80, 20]):.4f}")
    print(f"[90, 10]:             E = {misclassification_error([90, 10]):.4f}")
    print(f"[99, 1]:              E = {misclassification_error([99, 1]):.4f}")
    
    # Multi-class
    print(f"\n3-class uniform [33, 33, 34]: E = {misclassification_error([33, 33, 34]):.4f}")
    print(f"3-class max = 1 - 1/3 = {1 - 1/3:.4f}")
    
    print(f"\n4-class [70, 10, 10, 10]: E = {misclassification_error([70, 10, 10, 10]):.4f}")
    print(f"4-class [40, 30, 20, 10]: E = {misclassification_error([40, 30, 20, 10]):.4f}")

Misclassification Error at Various Proportions (Binary)
p (Class 1)	1-p (Class 0)	E = min(p, 1-p)	Majority Prediction
0.0	1.0	0.0	Class 0 (perfect)
0.1	0.9	0.1	Class 0 (90% correct)
0.3	0.7	0.3	Class 0 (70% correct)
0.5	0.5	0.5	Either (50% correct)
0.7	0.3	0.3	Class 1 (70% correct)
0.9	0.1	0.1	Class 1 (90% correct)
1.0	0.0	0.0	Class 1 (perfect)

The Concavity Problem: Why Misclassification Error Fails

Here's the critical issue: misclassification error is not strictly concave.

Recall that Gini and entropy are strictly concave functions. This guarantees:

$$f\left(\frac{n_L \cdot p_L + n_R \cdot p_R}{n}\right) > \frac{n_L}{n} f(p_L) + \frac{n_R}{n} f(p_R)$$

The Fundamental Problem

Geometric Visualization:

Imagine the binary misclassification error curve: two straight lines meeting at $p = 0.5$.

A split from $(p=0.5)$ to $(p_L=0.3, p_R=0.7)$ with equal-sized children:
- Parent error: $0.5$
- Weighted child error: $0.5 \times 0.3 + 0.5 \times 0.3 = 0.3$
- Reduction: $0.5 - 0.3 = 0.2$ ✓ (This works!)
A split from $(p=0.3)$ to $(p_L=0.1, p_R=0.5)$ with equal-sized children:
- Parent error: $0.3$
- Weighted child error: $0.5 \times 0.1 + 0.5 \times 0.5 = 0.3$
- Reduction: $0.3 - 0.3 = 0.0$ ✗ (Zero gain despite separation!)

The second split is clearly useful (one child is purer), but misclassification error doesn't see it.

concavity_demonstration.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
import numpy as np
 
def gini(p):
    return 2 * p * (1 - p)
 
def entropy(p):
    if p == 0 or p == 1:
        return 0
    return -p * np.log2(p) - (1-p) * np.log2(1-p)
 
def misclass(p):
    return min(p, 1-p)
 
def weighted_avg(p_left, p_right, n_left, n_right):
    """Compute weighted average of impurity."""
    n = n_left + n_right
    return (n_left / n), (n_right / n)
 
def compute_reduction(impurity_fn, p_parent, p_left, p_right, 
                      n_left, n_right):
    """Compute impurity reduction for a split."""
    parent_impurity = impurity_fn(p_parent)
    
    w_left = n_left / (n_left + n_right)
    w_right = n_right / (n_left + n_right)
    
    child_impurity = w_left * impurity_fn(p_left) + w_right * impurity_fn(p_right)
    
    return parent_impurity - child_impurity
 
 
print("Demonstrating the Concavity Problem")
print("=" * 70)
 
# Case 1: Split that works for all measures
print("\n--- Case 1: Split from p=0.5 to (0.3, 0.7) with equal children ---")
p_parent = 0.5
p_left, p_right = 0.3, 0.7
n_left, n_right = 50, 50
 
for name, fn in [("Gini", gini), ("Entropy", entropy), ("MisclassError", misclass)]:
    reduction = compute_reduction(fn, p_parent, p_left, p_right, n_left, n_right)
    print(f"{name:15} Parent={fn(p_parent):.4f}, Reduction={reduction:.4f}")
 
# Case 2: Split that fails for misclassification error
print("\n--- Case 2: Split from p=0.3 to (0.1, 0.5) with equal children ---")
print("(Left child is MUCH purer, but misclassification sees no benefit)")
p_parent = 0.3
p_left, p_right = 0.1, 0.5
n_left, n_right = 50, 50
 
for name, fn in [("Gini", gini), ("Entropy", entropy), ("MisclassError", misclass)]:
    reduction = compute_reduction(fn, p_parent, p_left, p_right, n_left, n_right)
    status = "✓" if reduction > 0.001 else "✗ ZERO GAIN"
    print(f"{name:15} Parent={fn(p_parent):.4f}, Reduction={reduction:.4f} {status}")
 
# Case 3: Another problematic case
print("\n--- Case 3: Split from p=0.2 to (0.1, 0.3) with equal children ---")
p_parent = 0.2
p_left, p_right = 0.1, 0.3
n_left, n_right = 50, 50
 
for name, fn in [("Gini", gini), ("Entropy", entropy), ("MisclassError", misclass)]:
    reduction = compute_reduction(fn, p_parent, p_left, p_right, n_left, n_right)
    status = "✓" if reduction > 0.001 else "✗ ZERO GAIN"
    print(f"{name:15} Parent={fn(p_parent):.4f}, Reduction={reduction:.4f} {status}")
 
# Case 4: Perfect split for all
print("\n--- Case 4: Perfect split from p=0.5 to (0.0, 1.0) ---")
p_parent = 0.5
p_left, p_right = 0.0, 1.0
n_left, n_right = 50, 50
 
for name, fn in [("Gini", gini), ("Entropy", entropy), ("MisclassError", misclass)]:
    reduction = compute_reduction(fn, p_parent, p_left, p_right, n_left, n_right)
    print(f"{name:15} Parent={fn(p_parent):.4f}, Reduction={reduction:.4f} (Perfect!)")
 
print("\n" + "=" * 70)
print("CONCLUSION: Misclassification error misses many valid splits!")
print("Gini and Entropy ALWAYS show positive reduction for class-separating splits.")

Detailed Analysis: Where Misclassification Error Gets Stuck

Let's analyze precisely when misclassification error shows zero gain.

Theorem: For a binary split, misclassification error reduction is zero if and only if:

$$\max(p_L, 1-p_L) \cdot w_L + \max(p_R, 1-p_R) \cdot w_R = \max(p_{parent}, 1-p_{parent})$$

where $w_L, w_R$ are the child weights.

When Does This Happen?

Both children on the same side of 0.5: If $p_L, p_R < 0.5$ (both minority class dominated by same class as parent), the linear segments are on the same 'branch' of the V-shape. The weighted average can equal the parent.
Children span 0.5 but balance out: If $p_L < 0.5 < p_R$, reductions can cancel (e.g., one child gets purer by $\delta$, the other gets impurer by $\delta$).

The 'Stuck at Majority' Problem

Practical Consequence:

Consider a node with 300 samples: 210 Class A (70%), 90 Class B (30%).

Split Option 1: Left = (180 A, 20 B), Right = (30 A, 70 B)

Left: $p_L = 20/200 = 10%$ error
Right: $p_R = 30/100 = 30%$ error
Weighted: $0.667 \times 0.10 + 0.333 \times 0.30 = 0.167 = 16.7%$
Parent: $0.30 = 30%$
Reduction: $13.3%$ ✓

Split Option 2: Left = (190 A, 60 B), Right = (20 A, 30 B)

Left: $p_L = 60/250 = 24%$ error (A majority)
Right: $p_R = 20/50 = 40%$ error (B majority)
Weighted: $0.833 \times 0.24 + 0.167 \times 0.40 = 0.267 = 26.7%$
Parent: $0.30 = 30%$
Reduction: $3.3%$ ✓

Split Option 3: Left = (150 A, 50 B), Right = (60 A, 40 B)

Left: $p_L = 50/200 = 25%$ error (A majority)
Right: $p_R = 40/100 = 40%$ error (A majority)
Weighted: $0.667 \times 0.25 + 0.333 \times 0.40 = 0.30 = 30%$
Parent: $0.30 = 30%$
Reduction: $0%$ ✗ STUCK!

Split 3 separates the data but shows zero improvement under misclassification error!

Mathematical Properties

Properties of Misclassification Error

•Piecewise linear: $E = 1 - \max_k p_k$ creates $K$ linear regions meeting at equi-probability points.
•Non-differentiable at joints: At $p = 0.5$ (binary), the function has a 'kink'—no derivative exists.
•Concave but not strictly concave: Graph lies above chords, but line segments make it not strictly curved.
•Lower bound on Gini and Entropy: $E \leq G$ and $E \leq H/\log_2(K)$ when normalized.
•Minimum sensitivity to distribution shape: Only depends on the majority class, ignoring distribution structure.
•Symmetric: Like Gini and entropy, $E$ is invariant under class relabeling.

Why is Misclassification Error a Lower Bound?

For binary classification at any $p$:

$$\min(p, 1-p) \leq 2p(1-p) \leq H(p)$$

where $H(p) = -p \log p - (1-p) \log(1-p)$.

Proof (for $p \leq 0.5$):

$E = p$ while $G = 2p(1-p) = 2p - 2p^2$.

Since $p \leq 0.5$, we have $1-p \geq 0.5 \geq p$, so $2(1-p) \geq 1$, thus $2p(1-p) \geq p = E$.

Comparison at Various Binary Proportions
p	Misclass E	Gini G	Entropy H	G/E	H/E
0.10	0.100	0.180	0.469	1.80	4.69
0.20	0.200	0.320	0.722	1.60	3.61
0.30	0.300	0.420	0.881	1.40	2.94
0.40	0.400	0.480	0.971	1.20	2.43
0.50	0.500	0.500	1.000	1.00	2.00

Multi-class Behavior

For $K$ classes, misclassification error is:

$$E = 1 - \max_{k} p_k$$

Properties in Multi-class Setting:

Range: $[0, 1 - 1/K]$, same as Gini
Maximum at uniform: $E_{max} = 1 - 1/K$ when $p_k = 1/K$ for all $k$
Piecewise linear in $K$ regions: Each region corresponds to a different class being majority
Even more structural blindness: Only the majority class matters; all others contribute equally to error

The Structural Blindness Problem:

Consider 4 classes with these distributions:

Distribution A: $[40%, 20%, 20%, 20%]$
Distribution B: $[40%, 30%, 15%, 15%]$
Distribution C: $[40%, 40%, 10%, 10%]$

All three have the same misclassification error: $E = 1 - 0.40 = 0.60$!

Yet:

Distribution C is 'less pure' intuitively (two classes tied for dominance)
Gini: A=0.72, B=0.70, C=0.68
Entropy: A=1.92, B=1.84, C=1.78

Misclassification error loses all information about the distribution shape.

multiclass_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import numpy as np
 
def misclass_error(p):
    """Misclassification error for probability distribution."""
    return 1 - np.max(p)
 
def gini_impurity(p):
    """Gini impurity for probability distribution."""
    return 1 - np.sum(p ** 2)
 
def entropy(p):
    """Shannon entropy for probability distribution."""
    p = np.array(p)
    p = p[p > 0]  # Filter zeros
    return -np.sum(p * np.log2(p))
 
 
print("Multi-class Structural Blindness Demonstration")
print("=" * 65)
 
distributions = [
    ("Uniform", [0.25, 0.25, 0.25, 0.25]),
    ("Dominant 40%", [0.40, 0.20, 0.20, 0.20]),
    ("Spread non-max", [0.40, 0.30, 0.15, 0.15]),
    ("Two dominants", [0.40, 0.40, 0.10, 0.10]),
    ("Very skewed", [0.70, 0.10, 0.10, 0.10]),
    ("Near-pure", [0.85, 0.05, 0.05, 0.05]),
    ("Pure", [1.00, 0.00, 0.00, 0.00]),
]
 
print(f"\n{'Distribution':<20} {'Misclass E':>12} {'Gini G':>10} {'Entropy H':>10}")
print("-" * 65)
 
for name, p in distributions:
    p = np.array(p)
    E = misclass_error(p)
    G = gini_impurity(p)
    H = entropy(p)
    print(f"{name:<20} {E:>12.4f} {G:>10.4f} {H:>10.4f}")
 
print("\n" + "-" * 65)
print("Notice: 'Dominant 40%', 'Spread non-max', 'Two dominants'")
print("all have SAME misclass error (0.60) but DIFFERENT Gini/Entropy!")
print("Misclassification error ignores distribution structure.")

When Misclassification Error IS Appropriate

Despite its problems as a splitting criterion, misclassification error is the right choice for several purposes:

1. Final Model Evaluation

When reporting a tree's performance, misclassification error (accuracy's complement) is exactly what matters. You want to know: 'What fraction of predictions will be wrong?'

2. Pruning Criterion

Cost-complexity pruning (used in CART) optimizes:

$$R_\alpha(T) = R(T) + \alpha |T|$$

where $R(T)$ is misclassification rate and $|T|$ is tree size. Here, misclassification error gives the true generalization estimate.

3. Threshold Tuning at Leaves

When adjusting prediction thresholds for class imbalance, you're directly optimizing misclassification (or related metrics like F1).

4. Simple Baseline

For quick understanding: 'A node with 60% Class A has 40% error rate.' Intuitive and easy to communicate.

The Split vs. Evaluate Distinction

Summary of Appropriate Uses

•Splitting: ✗ Use Gini or Entropy instead
•Evaluation (accuracy): ✓ Correct metric for final performance
•Pruning objective: ✓ Reflects true generalization error
•Communication: ✓ Intuitive for stakeholders
•Visualization: ✓ Easy to interpret node error rates

Connection to Loss Functions

Misclassification error corresponds to the 0-1 loss function:

$$L_{0-1}(y, \hat{y}) = \mathbb{1}[y \neq \hat{y}] = \begin{cases} 1 & \text{if } y \neq \hat{y} \ 0 & \text{if } y = \hat{y} \end{cases}$$

This is the most natural loss for classification: you're either right (loss = 0) or wrong (loss = 1).

Why Not Optimize 0-1 Loss Directly?

Non-differentiable: The 0-1 loss has no gradient—you can't do gradient descent.
Combinatorially hard: Finding the optimal tree under 0-1 loss is NP-hard.
Poor for greedy algorithms: As we've shown, greedy splits under 0-1 loss (i.e., misclassification error) get stuck.

Surrogate Losses:

Gini and entropy are examples of surrogate losses—smooth, differentiable approximations to 0-1 loss that are easier to optimize but still correlate with the true objective.

Loss	Formula	Convex?	Differentiable?	Good for Splitting?
0-1 (Misclass)	$\mathbb{1}[y \neq \hat{y}]$	No	No	No
Gini-based	$2p(1-p)$	Yes	Yes	Yes
Cross-entropy	$-\log \hat{p}_y$	Yes	Yes	Yes

This is analogous to using log loss (cross-entropy) instead of 0-1 loss in logistic regression—it provides gradients for optimization.

The Surrogate Loss Principle

Visualizing the Criteria Together

A visual comparison crystallizes the differences between our three impurity measures. The key observation is the curvature difference at the boundaries.

visualize_all_criteria.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
import numpy as np
import matplotlib.pyplot as plt
 
def gini(p):
    return 2 * p * (1 - p)
 
def entropy(p):
    result = np.zeros_like(p)
    mask = (p > 0) & (p < 1)
    pm = p[mask]
    result[mask] = -pm * np.log2(pm) - (1-pm) * np.log2(1-pm)
    return result
 
def misclass(p):
    return np.minimum(p, 1 - p)
 
def plot_impurity_comparison():
    """
    Create comprehensive comparison plot of all impurity measures.
    """
    fig, axes = plt.subplots(2, 2, figsize=(14, 12))
    
    p = np.linspace(0.001, 0.999, 1000)
    
    g = gini(p)
    h = entropy(p)
    m = misclass(p)
    
    # Top-left: Raw values
    ax1 = axes[0, 0]
    ax1.plot(p, g, 'b-', linewidth=2.5, label='Gini (max=0.5)')
    ax1.plot(p, h, 'r-', linewidth=2.5, label='Entropy (max=1.0)')
    ax1.plot(p, m, 'g-', linewidth=2.5, label='Misclass (max=0.5)')
    
    ax1.fill_between(p, m, alpha=0.1, color='green')
    ax1.fill_between(p, m, g, alpha=0.1, color='blue')
    ax1.fill_between(p, g, h, alpha=0.1, color='red')
    
    ax1.set_xlabel('Proportion of Class 1 (p)', fontsize=12)
    ax1.set_ylabel('Impurity Value', fontsize=12)
    ax1.set_title('Raw Impurity Values', fontsize=14)
    ax1.legend(loc='upper right')
    ax1.grid(True, alpha=0.3)
    ax1.set_xlim(0, 1)
    ax1.set_ylim(0, 1.1)
    
    # Top-right: Normalized to max=1
    ax2 = axes[0, 1]
    ax2.plot(p, g / 0.5, 'b-', linewidth=2.5, label='Gini (scaled)')
    ax2.plot(p, h / 1.0, 'r-', linewidth=2.5, label='Entropy (scaled)')
    ax2.plot(p, m / 0.5, 'g-', linewidth=2.5, label='Misclass (scaled)')
    
    ax2.set_xlabel('Proportion of Class 1 (p)', fontsize=12)
    ax2.set_ylabel('Normalized Impurity', fontsize=12)
    ax2.set_title('Normalized Comparison (all scaled to max=1)', fontsize=14)
    ax2.legend(loc='upper right')
    ax2.grid(True, alpha=0.3)
    ax2.set_xlim(0, 1)
    ax2.set_ylim(0, 1.1)
    
    # Bottom-left: Zoom near p=0 (boundary behavior)
    ax3 = axes[1, 0]
    p_zoom = np.linspace(0.001, 0.15, 500)
    ax3.plot(p_zoom, gini(p_zoom)/0.5, 'b-', linewidth=2.5, label='Gini')
    ax3.plot(p_zoom, entropy(p_zoom)/1.0, 'r-', linewidth=2.5, label='Entropy')
    ax3.plot(p_zoom, misclass(p_zoom)/0.5, 'g-', linewidth=2.5, label='Misclass')
    
    ax3.set_xlabel('Proportion of Class 1 (p)', fontsize=12)
    ax3.set_ylabel('Normalized Impurity', fontsize=12)
    ax3.set_title('Zoom: Behavior Near p=0 (Boundary)', fontsize=14)
    ax3.legend(loc='upper right')
    ax3.grid(True, alpha=0.3)
    ax3.set_annotation = ax3.annotate(
        'Entropy has vertical\ntangent at 0',
        xy=(0.02, 0.12), xytext=(0.06, 0.35),
        fontsize=10, arrowprops=dict(arrowstyle='->', color='red')
    )
    
    # Bottom-right: Second derivative (curvature)
    ax4 = axes[1, 1]
    
    # Compute numerical second derivatives
    dp = p[1] - p[0]
    g2 = np.gradient(np.gradient(g, dp), dp)
    h2 = np.gradient(np.gradient(h, dp), dp)
    # Misclass has 0 second derivative everywhere except at p=0.5
    
    ax4.plot(p[10:-10], g2[10:-10], 'b-', linewidth=2.5, label='Gini d²/dp²')
    ax4.plot(p[10:-10], h2[10:-10], 'r-', linewidth=2.5, label='Entropy d²/dp²')
    ax4.axhline(y=0, color='green', linestyle='--', linewidth=2.5, 
                label='Misclass d²/dp²=0 (linear)')
    
    ax4.set_xlabel('Proportion of Class 1 (p)', fontsize=12)
    ax4.set_ylabel('Second Derivative (Curvature)', fontsize=12)
    ax4.set_title('Curvature: Why Strict Concavity Matters', fontsize=14)
    ax4.legend(loc='upper right')
    ax4.grid(True, alpha=0.3)
    ax4.set_xlim(0, 1)
    
    plt.tight_layout()
    plt.savefig('impurity_comprehensive.png', dpi=150, bbox_inches='tight')
    plt.show()
    
    print("\nKey Observations:")
    print("=" * 60)
    print("1. Misclassification error is PIECEWISE LINEAR (zero curvature)")
    print("2. Gini has CONSTANT negative curvature (-4)")
    print("3. Entropy has curvature that goes to -INFINITY at boundaries")
    print("4. This curvature guarantees positive gain for class-separating splits")
 
 
if __name__ == "__main__":
    plot_impurity_comparison()

Summary: Misclassification Error's Role

We've thoroughly analyzed why misclassification error, despite its intuitive appeal, fails as a splitting criterion. Let's consolidate:

Key Takeaways

•Definition: $E = 1 - \max_k p_k$ is the error rate of majority-vote prediction.
•Piecewise linear: Not strictly concave—splits can yield zero impurity reduction.
•Gets stuck: Valid class-separating splits may show no improvement.
•Structurally blind: Only the majority class matters; ignores distribution shape.
•Corresponds to 0-1 loss: The natural classification loss, but hard to optimize.
•Use for evaluation, not splitting: Perfect for measuring accuracy, wrong for building trees.
•Surrogate losses: Gini and entropy are smooth approximations that work with greedy algorithms.
•Lower bound: $E \leq G \leq H$ (when appropriately scaled).

What's Next:

Page Complete

3 / 5