Machine LearningDecision Trees

Splitting Criteria

LevelIntermediate

Duration90 mins

TopicDecision Trees

2 / 5

Entropy and Information Gain: The Information-Theoretic Foundation

From Communication Theory to Machine Learning

In 1948, Claude Shannon published A Mathematical Theory of Communication, a paper that would revolutionize our understanding of information. Shannon asked: How do we quantify information? His answer—entropy—became the foundation of information theory and eventually found its way into machine learning.

When we use entropy as a splitting criterion in decision trees, we're applying Shannon's profound insight: the 'disorder' or 'uncertainty' in a node can be measured as the average number of bits needed to encode the class labels. A pure node requires no bits (we already know the answer); a maximally impure node requires maximum bits (maximum uncertainty).

This page explores entropy through the lens of information theory, revealing why it's not just a valid impurity measure, but the canonical measure from a communication perspective.

What You Will Learn

By the end of this page, you will understand entropy from its information-theoretic roots, master the mathematics of information gain, see how it connects to decision tree algorithms like ID3 and C4.5, and appreciate the subtle differences between entropy and Gini impurity.

Information Theory Foundations

Before defining entropy, we need to understand self-information—the fundamental unit that entropy aggregates.

The Key Insight: Rare events carry more information than common events.

Think about news headlines:

"Sun Rises in the East" — Not informative (probability ≈ 1)
"Major Earthquake in California" — Somewhat informative (moderate probability)
"Alien Contact Confirmed" — Highly informative (extremely low probability)

Shannon formalized this with the self-information or surprisal of an event:

$$I(x) = -\log_2 P(x) = \log_2 \frac{1}{P(x)}$$

Properties of Self-Information:

Non-negativity: $I(x) \geq 0$ for all events (since $P(x) \leq 1$)
Certainty yields zero: $I(x) = 0$ when $P(x) = 1$ (no surprise)
Rare events have high information: As $P(x) \to 0$, $I(x) \to \infty$
Additivity: For independent events, $I(x, y) = I(x) + I(y)$

The choice of logarithm base determines the unit:

Base 2: bits (most common in CS)
Base $e$: nats (natural units)
Base 10: hartleys (rare)

Self-Information for Various Probabilities
Probability P(x)	Self-Information I(x) in bits	Intuition
1.0	0.0	Certain event, no surprise
0.5	1.0	One bit of information (coin flip)
0.25	2.0	Two bits (like two coin flips)
0.125	3.0	Three bits
0.01	6.64	Rare event, high surprise
0.001	9.97	Very rare, ~10 bits

Why Logarithm?

The logarithm is the unique function satisfying additivity for independent events. If you see two independent events with probabilities p and q, the total information is I(p) + I(q) = -log(p) - log(q) = -log(pq), which equals the information of the joint event. No other function family has this property.

Shannon Entropy: Expected Information

Entropy is the expected self-information of a random variable. For a discrete random variable $X$ with possible values ${x_1, x_2, ..., x_K}$ and probability distribution $P$:

$$H(X) = \mathbb{E}[I(X)] = -\sum_{k=1}^{K} P(x_k) \log_2 P(x_k) = -\sum_{k=1}^{K} p_k \log_2 p_k$$

Interpretation: Entropy measures the average number of bits needed to encode the outcome of $X$ using an optimal coding scheme.

For Decision Trees: In a node with $K$ classes and class proportions $p_1, p_2, ..., p_K$:

$$H = -\sum_{k=1}^{K} p_k \log_2 p_k$$

where we define $0 \cdot \log 0 = 0$ (by continuity, as $\lim_{p \to 0^+} p \log p = 0$).

Binary Entropy Function:

For binary classification with $p$ and $(1-p)$:

$$H(p) = -p \log_2 p - (1-p) \log_2 (1-p)$$

This is the famous binary entropy function, fundamental to coding theory and machine learning alike.

entropy.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
import numpy as np
from typing import Union
 
def entropy(class_counts: Union[list, np.ndarray], base: float = 2) -> float:
    """
    Compute Shannon entropy from class counts.
    
    Shannon entropy measures the average information content or
    'surprise' in the distribution. It's the expected number of
    bits needed to encode the class label optimally.
    
    Args:
        class_counts: Array or list of counts for each class
        base: Logarithm base (2 for bits, e for nats)
        
    Returns:
        Entropy value in range [0, log_base(K)]
        
    Mathematical Definition:
        H = -Σ(p_k * log(p_k))
        
    Properties:
        - H = 0 for pure nodes (all samples same class)
        - H = log(K) for uniform distribution over K classes
        - For binary: maximum is 1.0 bit when p = 0.5
        
    Convention:
        0 * log(0) = 0 (by L'Hôpital's rule / continuity)
    """
    counts = np.array(class_counts, dtype=np.float64)
    n = counts.sum()
    
    if n == 0:
        return 0.0  # Empty node
    
    # Compute probabilities
    p = counts / n
    
    # Filter out zeros to avoid log(0)
    # Note: 0 * log(0) = 0 by convention
    p_nonzero = p[p > 0]
    
    # Entropy = -Σ(p * log(p))
    if base == 2:
        return -np.sum(p_nonzero * np.log2(p_nonzero))
    elif base == np.e:
        return -np.sum(p_nonzero * np.log(p_nonzero))
    else:
        return -np.sum(p_nonzero * np.log(p_nonzero)) / np.log(base)
 
 
def binary_entropy(p: float) -> float:
    """
    Compute binary entropy function H(p).
    
    H(p) = -p*log2(p) - (1-p)*log2(1-p)
    
    This is the entropy of a Bernoulli random variable.
    
    Args:
        p: Probability of positive class (0 ≤ p ≤ 1)
        
    Returns:
        Entropy in bits, range [0, 1]
    
    Special cases:
        H(0) = H(1) = 0 (pure)
        H(0.5) = 1.0 (maximum uncertainty)
    """
    if p == 0 or p == 1:
        return 0.0
    return -p * np.log2(p) - (1-p) * np.log2(1-p)
 
 
def cross_entropy(p: np.ndarray, q: np.ndarray) -> float:
    """
    Compute cross-entropy H(P, Q) between two distributions.
    
    H(P, Q) = -Σ p_k * log(q_k)
    
    Cross-entropy measures the average number of bits needed
    to encode samples from P using a code optimized for Q.
    
    Note: H(P, Q) ≥ H(P), with equality iff P = Q.
    The difference H(P, Q) - H(P) is the KL divergence D_KL(P||Q).
    """
    p = np.array(p, dtype=np.float64)
    q = np.array(q, dtype=np.float64)
    
    # Avoid log(0) where p > 0
    mask = p > 0
    return -np.sum(p[mask] * np.log2(q[mask]))
 
 
# Demonstration
if __name__ == "__main__":
    print("Shannon Entropy Examples:")
    print("=" * 55)
    
    # Pure node
    print(f"Pure [100, 0]:        H = {entropy([100, 0]):.4f} bits")
    
    # Balanced binary
    print(f"Balanced [50, 50]:    H = {entropy([50, 50]):.4f} bits")
    
    # Imbalanced
    print(f"[80, 20]:             H = {entropy([80, 20]):.4f} bits")
    print(f"[90, 10]:             H = {entropy([90, 10]):.4f} bits")
    print(f"[99, 1]:              H = {entropy([99, 1]):.4f} bits")
    
    # Multi-class
    print(f"\n3-class uniform [33, 33, 34]: H = {entropy([33, 33, 34]):.4f} bits")
    print(f"3-class max = log2(3) = {np.log2(3):.4f} bits")
    
    print(f"\n4-class uniform [25, 25, 25, 25]: H = {entropy([25, 25, 25, 25]):.4f} bits")
    print(f"4-class max = log2(4) = {np.log2(4):.4f} bits")
    
    # Binary entropy curve
    print(f"\nBinary entropy H(p) at various p:")
    for p in [0.0, 0.1, 0.2, 0.3, 0.4, 0.5]:
        print(f"  H({p:.1f}) = {binary_entropy(p):.4f} bits")

Mathematical Properties of Entropy

Entropy possesses remarkable mathematical properties that make it the ideal measure of uncertainty. Understanding these properties deepens our grasp of why entropy works as a splitting criterion.

Fundamental Properties of Entropy

•Non-negativity: $H(X) \geq 0$ always, with equality iff $X$ is deterministic (some $p_k = 1$).
•Maximum at uniform: $H(X) \leq \log_2 K$, with equality iff $p_k = 1/K$ for all $k$.
•Strict concavity: $H(\lambda p + (1-\lambda)q) > \lambda H(p) + (1-\lambda) H(q)$ for $0 < \lambda < 1$ and $p \neq q$.
•Continuity: $H$ is continuous on the interior of the probability simplex.
•Symmetry: $H(p_1, ..., p_K) = H(p_{\sigma(1)}, ..., p_{\sigma(K)})$ for any permutation $\sigma$.
•Additivity for independent variables: $H(X, Y) = H(X) + H(Y)$ when $X \perp Y$.
•Chain rule: $H(X, Y) = H(X) + H(Y|X) = H(Y) + H(X|Y)$.
•Conditioning reduces entropy: $H(X|Y) \leq H(X)$, with equality iff $X \perp Y$.

Shannon's Uniqueness Theorem

Shannon proved that entropy is the UNIQUE function satisfying: (1) continuity, (2) monotonicity in the number of equally likely outcomes, and (3) additivity for independent choices. Any other function satisfying these axioms is just a scalar multiple of entropy. This is why entropy is THE canonical measure of uncertainty.

Concavity and Its Implications:

The strict concavity of entropy is crucial for decision trees. It means:

$$H\left(\frac{n_L}{n} p_L + \frac{n_R}{n} p_R\right) > \frac{n_L}{n} H(p_L) + \frac{n_R}{n} H(p_R)$$

whenever $p_L \neq p_R$.

Interpretation: The entropy of a mixture is strictly greater than the weighted average of component entropies. This guarantees that:

Splitting always helps (or at least never hurts)
Information gain is always non-negative
Pure splits achieve maximum gain

Without strict concavity (like misclassification error), we might find splits that appear to give zero gain even when the data can be better separated.

Information Gain: Quantifying Learning

Information gain measures how much knowing a feature's value reduces uncertainty about the class label. It's the entropy-based analog of Gini gain.

Definition: For a dataset $S$ and a feature $A$ with possible values ${a_1, a_2, ..., a_V}$:

$$\text{IG}(S, A) = H(S) - \sum_{v=1}^{V} \frac{|S_v|}{|S|} H(S_v)$$

where $S_v = {x \in S : A(x) = v}$ is the subset with feature $A$ equal to $v$.

Components:

$H(S)$: Entropy of parent node (prior uncertainty)
$\sum \frac{|S_v|}{|S|} H(S_v)$: Weighted average entropy of children (conditional entropy, $H(Y|A)$)
Information Gain: Reduction in entropy = $H(Y) - H(Y|A)$

Interpretation: Information gain equals the mutual information $I(Y; A)$ between the class label $Y$ and feature $A$:

$$\text{IG}(S, A) = I(Y; A) = H(Y) - H(Y|A)$$

This measures how much information the feature provides about the class.

information_gain.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
import numpy as np
from typing import List, Tuple
 
def entropy(class_counts: np.ndarray) -> float:
    """Compute Shannon entropy from class counts."""
    n = class_counts.sum()
    if n == 0:
        return 0.0
    p = class_counts / n
    p = p[p > 0]  # Filter zeros
    return -np.sum(p * np.log2(p))
 
 
def conditional_entropy(parent_counts: np.ndarray,
                       child_counts_list: List[np.ndarray]) -> float:
    """
    Compute conditional entropy H(Y|A) after a split.
    
    H(Y|A) = Σ P(A=v) * H(Y|A=v)
           = Σ (|S_v|/|S|) * H(S_v)
    
    This is the expected entropy after learning the split feature.
    """
    n_total = parent_counts.sum()
    
    weighted_entropy = 0.0
    for child_counts in child_counts_list:
        n_child = child_counts.sum()
        if n_child > 0:
            weight = n_child / n_total
            weighted_entropy += weight * entropy(child_counts)
    
    return weighted_entropy
 
 
def information_gain(parent_counts: np.ndarray,
                    child_counts_list: List[np.ndarray]) -> float:
    """
    Compute information gain for a split.
    
    IG(S, A) = H(S) - H(S|A)
             = H(parent) - Σ (|S_v|/|S|) * H(S_v)
    
    This equals the mutual information I(Y; A) between
    the class label Y and feature A.
    
    Args:
        parent_counts: Class counts in parent node
        child_counts_list: List of class counts for each child
        
    Returns:
        Information gain in bits (always ≥ 0)
    """
    parent_entropy = entropy(parent_counts)
    cond_entropy = conditional_entropy(parent_counts, child_counts_list)
    return parent_entropy - cond_entropy
 
 
def find_best_threshold_ig(X: np.ndarray, y: np.ndarray,
                           feature_idx: int) -> Tuple[float, float]:
    """
    Find best threshold for continuous feature using information gain.
    
    Returns:
        (best_threshold, best_information_gain)
    """
    n_samples = len(y)
    n_classes = len(np.unique(y))
    
    feature_values = X[:, feature_idx]
    sorted_indices = np.argsort(feature_values)
    sorted_features = feature_values[sorted_indices]
    sorted_labels = y[sorted_indices]
    
    parent_counts = np.bincount(y, minlength=n_classes)
    parent_H = entropy(parent_counts)
    
    best_ig = -np.inf
    best_threshold = None
    
    left_counts = np.zeros(n_classes, dtype=np.int64)
    right_counts = parent_counts.copy()
    
    for i in range(n_samples - 1):
        c = sorted_labels[i]
        left_counts[c] += 1
        right_counts[c] -= 1
        
        if sorted_features[i] == sorted_features[i + 1]:
            continue
        
        cond_H = conditional_entropy(parent_counts, [left_counts, right_counts])
        ig = parent_H - cond_H
        
        if ig > best_ig:
            best_ig = ig
            best_threshold = (sorted_features[i] + sorted_features[i + 1]) / 2
    
    return best_threshold, best_ig
 
 
# Demonstration: ID3-style categorical split
if __name__ == "__main__":
    print("Information Gain Examples")
    print("=" * 60)
    
    # Example: Weather dataset (classic ID3 example)
    # Parent: 9 Yes, 5 No for playing tennis
    parent = np.array([9, 5])
    print(f"\nParent node: {parent} (9 Yes, 5 No)")
    print(f"Parent entropy: H = {entropy(parent):.4f} bits")
    
    # Split on Outlook: Sunny(2Y,3N), Overcast(4Y,0N), Rain(3Y,2N)
    sunny = np.array([2, 3])
    overcast = np.array([4, 0])
    rain = np.array([3, 2])
    
    print(f"\nSplit on 'Outlook':")
    print(f"  Sunny {sunny}:    H = {entropy(sunny):.4f}")
    print(f"  Overcast {overcast}: H = {entropy(overcast):.4f}")
    print(f"  Rain {rain}:      H = {entropy(rain):.4f}")
    
    ig_outlook = information_gain(parent, [sunny, overcast, rain])
    print(f"  Information Gain = {ig_outlook:.4f} bits")
    
    # Split on Wind: Weak(6Y,2N), Strong(3Y,3N)
    weak = np.array([6, 2])
    strong = np.array([3, 3])
    
    print(f"\nSplit on 'Wind':")
    print(f"  Weak {weak}:   H = {entropy(weak):.4f}")
    print(f"  Strong {strong}: H = {entropy(strong):.4f}")
    
    ig_wind = information_gain(parent, [weak, strong])
    print(f"  Information Gain = {ig_wind:.4f} bits")
    
    # Compare
    print(f"\n→ Choose 'Outlook' (IG={ig_outlook:.4f} > {ig_wind:.4f})")
    
    # Perfect binary split
    print(f"\n--- Perfect Split ---")
    parent_perfect = np.array([50, 50])
    left_pure = np.array([50, 0])
    right_pure = np.array([0, 50])
    
    print(f"Parent: {parent_perfect}, H = {entropy(parent_perfect):.4f}")
    ig_perfect = information_gain(parent_perfect, [left_pure, right_pure])
    print(f"Perfect split IG = {ig_perfect:.4f} bits (maximum possible)")

Bits of Learning

Information gain tells you exactly how many bits of information each feature provides about the class label. A perfect binary split has IG = 1 bit (you learn exactly one bit—which of two classes). This makes entropy interpretable in a way Gini isn't.

ID3 and C4.5: Entropy-Based Tree Algorithms

The most famous algorithms using entropy for splitting are ID3 (Iterative Dichotomiser 3) by Ross Quinlan (1986) and its successor C4.5 (1993). These algorithms shaped the field of decision tree learning.

ID3 Algorithm:

If all examples have the same class, return a leaf with that class
If no features remain, return a leaf with the majority class
Otherwise:
- Compute information gain for each feature
- Select feature $A^*$ with maximum IG
- Create a node for $A^*$
- For each value $v$ of $A^*$:
  - Create a branch for $v$
  - Recursively build subtree on examples where $A^* = v$

ID3 Limitations:

Only handles categorical features
Biased toward features with many values
No pruning (prone to overfitting)
Cannot handle missing values

The Multi-Value Bias Problem

ID3's information gain is biased toward features with many values. Consider a 'customer ID' feature—it perfectly separates examples (maximum IG!) but has zero predictive value on new data. Each ID is its own leaf. C4.5 addresses this with gain ratio.

C4.5 Improvements:

C4.5 addressed ID3's limitations with several enhancements:

Gain Ratio (instead of raw information gain):

$$\text{GainRatio}(S, A) = \frac{\text{IG}(S, A)}{\text{SplitInfo}(S, A)}$$

where: $$\text{SplitInfo}(S, A) = -\sum_{v=1}^{V} \frac{|S_v|}{|S|} \log_2 \frac{|S_v|}{|S|}$$

SplitInfo is high for features with many values, penalizing the bias.

Continuous features: Binary splits on thresholds (like CART)
Missing values: Distribute fractionally to all branches
Pruning: Error-based post-pruning to prevent overfitting

C5.0 (commercial) further improved efficiency and added boosting.

gain_ratio.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
import numpy as np
from typing import List
 
def entropy(counts: np.ndarray) -> float:
    """Compute Shannon entropy."""
    n = counts.sum()
    if n == 0:
        return 0.0
    p = counts / n
    p = p[p > 0]
    return -np.sum(p * np.log2(p))
 
 
def split_info(child_sizes: List[int], total: int) -> float:
    """
    Compute split information (intrinsic value).
    
    SplitInfo(S, A) = -Σ (|S_v|/|S|) * log2(|S_v|/|S|)
    
    This is the entropy of the split itself (not the class distribution).
    High when feature has many values of similar frequency.
    
    Used to penalize features with many values in gain ratio.
    """
    if total == 0:
        return 0.0
    
    split_entropy = 0.0
    for size in child_sizes:
        if size > 0:
            p = size / total
            split_entropy -= p * np.log2(p)
    
    return split_entropy
 
 
def information_gain(parent_counts: np.ndarray,
                    child_counts_list: List[np.ndarray]) -> float:
    """Compute information gain for a split."""
    parent_H = entropy(parent_counts)
    n_total = parent_counts.sum()
    
    weighted_child_H = 0.0
    for child_counts in child_counts_list:
        n_child = child_counts.sum()
        if n_child > 0:
            weighted_child_H += (n_child / n_total) * entropy(child_counts)
    
    return parent_H - weighted_child_H
 
 
def gain_ratio(parent_counts: np.ndarray,
               child_counts_list: List[np.ndarray]) -> float:
    """
    Compute gain ratio (C4.5 criterion).
    
    GainRatio(S, A) = IG(S, A) / SplitInfo(S, A)
    
    Normalizes information gain by the split's intrinsic value,
    reducing bias toward multi-valued features.
    
    Note: When SplitInfo is very small, gain ratio can be
    artificially high. C4.5 uses a heuristic: only consider
    features with IG above average.
    """
    ig = information_gain(parent_counts, child_counts_list)
    
    n_total = parent_counts.sum()
    child_sizes = [c.sum() for c in child_counts_list]
    si = split_info(child_sizes, n_total)
    
    if si == 0:
        return 0.0  # Avoid division by zero
    
    return ig / si
 
 
# Demonstration: Multi-value bias
if __name__ == "__main__":
    print("Gain Ratio vs Information Gain")
    print("=" * 60)
    
    # Parent: 7 Yes, 7 No
    parent = np.array([7, 7])
    print(f"Parent: {parent}")
    print(f"Parent entropy: {entropy(parent):.4f} bits\n")
    
    # Feature A: Binary split (good generalization)
    # Splits into (5Y, 2N) and (2Y, 5N)
    child_A = [np.array([5, 2]), np.array([2, 5])]
    ig_A = information_gain(parent, child_A)
    gr_A = gain_ratio(parent, child_A)
    si_A = split_info([7, 7], 14)
    
    print("Feature A (Binary: Low/High):")
    print(f"  Children: {[c.tolist() for c in child_A]}")
    print(f"  Information Gain: {ig_A:.4f}")
    print(f"  Split Info: {si_A:.4f}")
    print(f"  Gain Ratio: {gr_A:.4f}")
    
    # Feature B: 7-way split (like customer ID)
    # Each bucket: (1Y, 1N) - useless for generalization!
    child_B = [np.array([1, 1]) for _ in range(7)]
    ig_B = information_gain(parent, child_B)
    gr_B = gain_ratio(parent, child_B)
    si_B = split_info([2]*7, 14)
    
    print(f"\nFeature B (7-way: like ID feature):")
    print(f"  Children: 7 × [1, 1]")
    print(f"  Information Gain: {ig_B:.4f}")
    print(f"  Split Info: {si_B:.4f}")
    print(f"  Gain Ratio: {gr_B:.4f}")
    
    # Feature C: 14-way split (pure ID - worst case)
    # Each sample is its own bucket
    child_C_yes = [np.array([1, 0]) for _ in range(7)]
    child_C_no = [np.array([0, 1]) for _ in range(7)]
    child_C = child_C_yes + child_C_no
    ig_C = information_gain(parent, child_C)
    gr_C = gain_ratio(parent, child_C)
    si_C = split_info([1]*14, 14)
    
    print(f"\nFeature C (14-way: unique ID):")
    print(f"  Children: 14 × [single sample]")
    print(f"  Information Gain: {ig_C:.4f} (MAXIMUM - perfect separation!)")
    print(f"  Split Info: {si_C:.4f}")
    print(f"  Gain Ratio: {gr_C:.4f}")
    
    print(f"\n→ ID3 would choose Feature C (highest IG={ig_C:.4f})")
    print(f"→ C4.5 would choose Feature A (highest GR={gr_A:.4f})")

Entropy vs Gini: A Detailed Comparison

Now that we understand both Gini impurity and entropy, let's compare them systematically. The comparison reveals both theoretical and practical differences.

Entropy vs Gini: Comprehensive Comparison
Aspect	Gini Impurity	Entropy
Formula (K classes)	$1 - \sum p_k^2$	$-\sum p_k \log_2 p_k$
Binary formula	$2p(1-p)$	$-p \log p - (1-p) \log(1-p)$
Maximum (binary)	0.5	1.0 bit
Maximum (K classes)	$1 - 1/K$	$\log_2 K$ bits
Computation	Fast (squares only)	Slower (logarithms)
Interpretation	Misclassification probability	Bits of information
Curvature at extremes	Moderate	Steeper near 0 and 1
Sensitivity to small p	Less sensitive	More sensitive
Used in	CART, scikit-learn default	ID3, C4.5, XGBoost option

Mathematical Relationship:

For binary classification, we can approximate entropy using Gini:

$$H(p) \approx \frac{G(p)}{\ln 2} \cdot \frac{2}{1}$$ when $p$ is not too extreme.

More precisely, the Taylor expansion around $p = 0.5$ shows:

$$H(p) = 1 - \frac{2}{\ln 2}(p - 0.5)^2 - \frac{4}{3 \ln 2}(p - 0.5)^4 - ...$$

$$G(p) = 0.5 - 2(p - 0.5)^2$$

Both are concave, symmetric about $p = 0.5$, and zero at $p = 0, 1$. The key difference is curvature:

Entropy has infinite slope at boundaries (vertical tangent at $p = 0, 1$)
Gini has finite slope at boundaries (slope = $\pm 2$ at $p = 0, 1$)

This makes entropy more sensitive to class imbalance and small probability classes.

entropy_gini_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
import numpy as np
import matplotlib.pyplot as plt
 
def gini_binary(p):
    return 2 * p * (1 - p)
 
def entropy_binary(p):
    """Binary entropy, handling edge cases."""
    result = np.zeros_like(p)
    mask = (p > 0) & (p < 1)
    result[mask] = -p[mask] * np.log2(p[mask]) - (1-p[mask]) * np.log2(1-p[mask])
    return result
 
def misclass_binary(p):
    return np.minimum(p, 1 - p)
 
def plot_comparison():
    """
    Compare Gini, Entropy, and Misclassification error for binary case.
    """
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    p = np.linspace(0.001, 0.999, 1000)
    
    gini = gini_binary(p)
    entropy_vals = entropy_binary(p)
    misclass = misclass_binary(p)
    
    # Normalize for comparison (scale to max = 1)
    gini_norm = gini / 0.5
    entropy_norm = entropy_vals / 1.0
    misclass_norm = misclass / 0.5
    
    # Left plot: Raw values
    ax1 = axes[0]
    ax1.plot(p, gini, 'b-', linewidth=2.5, label='Gini (max=0.5)')
    ax1.plot(p, entropy_vals, 'r-', linewidth=2.5, label='Entropy (max=1.0)')
    ax1.plot(p, misclass, 'g--', linewidth=2.5, label='Misclass (max=0.5)')
    
    ax1.set_xlabel('Proportion of Class 1 (p)', fontsize=12)
    ax1.set_ylabel('Impurity', fontsize=12)
    ax1.set_title('Impurity Measures: Raw Values', fontsize=14)
    ax1.legend(loc='upper right')
    ax1.grid(True, alpha=0.3)
    ax1.set_xlim(0, 1)
    ax1.set_ylim(0, 1.1)
    
    # Right plot: Normalized (scaled to max = 1)
    ax2 = axes[1]
    ax2.plot(p, gini_norm, 'b-', linewidth=2.5, label='Gini (normalized)')
    ax2.plot(p, entropy_norm, 'r-', linewidth=2.5, label='Entropy (normalized)')
    ax2.plot(p, misclass_norm, 'g--', linewidth=2.5, label='Misclass (normalized)')
    
    ax2.set_xlabel('Proportion of Class 1 (p)', fontsize=12)
    ax2.set_ylabel('Normalized Impurity', fontsize=12)
    ax2.set_title('Impurity Measures: Normalized to Max=1', fontsize=14)
    ax2.legend(loc='upper right')
    ax2.grid(True, alpha=0.3)
    ax2.set_xlim(0, 1)
    ax2.set_ylim(0, 1.1)
    
    plt.tight_layout()
    plt.savefig('impurity_comparison.png', dpi=150)
    plt.show()
    
    # Print numerical comparison at key points
    print("\nNumerical Comparison at Key Points:")
    print("=" * 60)
    print(f"{'p':>6} | {'Gini':>10} | {'Entropy':>10} | {'Misclass':>10}")
    print("-" * 60)
    for p_val in [0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5]:
        g = gini_binary(p_val)
        h = entropy_binary(np.array([p_val]))[0]
        m = misclass_binary(np.array([p_val]))[0]
        print(f"{p_val:>6.2f} | {g:>10.4f} | {h:>10.4f} | {m:>10.4f}")
 
 
def sensitivity_analysis():
    """
    Show how entropy is more sensitive to small probabilities.
    """
    print("\n\nSensitivity to Small Probabilities:")
    print("=" * 60)
    print("When p = 0.01 (rare class):")
    
    p = 0.01
    g = gini_binary(p)
    h = entropy_binary(np.array([p]))[0]
    
    print(f"  Gini = {g:.6f}")
    print(f"  Entropy = {h:.6f}")
    print(f"  Ratio (Entropy/Gini) = {h/g:.2f}")
    
    print("\nWhen p = 0.001 (very rare class):")
    p = 0.001
    g = gini_binary(p)
    h = entropy_binary(np.array([p]))[0]
    
    print(f"  Gini = {g:.6f}")
    print(f"  Entropy = {h:.6f}")
    print(f"  Ratio (Entropy/Gini) = {h/g:.2f}")
    
    print("\n→ Entropy gives more 'credit' for separating rare classes")
 
 
if __name__ == "__main__":
    plot_comparison()
    sensitivity_analysis()

When Does the Choice Matter?

Empirical studies show Gini and entropy produce identical tree structures in ~98% of cases. The remaining 2% occur when: (1) class imbalance is extreme, (2) there are many classes, or (3) splits have very similar quality. In practice, computational efficiency often favors Gini.

Connection to Cross-Entropy Loss

Entropy in decision trees connects deeply to cross-entropy loss, the standard loss function for classification in neural networks and other probabilistic models.

Cross-Entropy Reviewed:

For a true distribution $P$ and predicted distribution $Q$:

$$H(P, Q) = -\sum_k P(k) \log Q(k)$$

In classification with one-hot labels $y$ and predicted probabilities $\hat{p}$:

$$\mathcal{L}{CE} = -\sum{k=1}^K y_k \log \hat{p}k = -\log \hat{p}{\text{true class}}$$

The Connection:

When a decision tree leaf predicts class probabilities equal to the empirical distribution in the leaf:

The cross-entropy between true labels and leaf predictions equals the leaf's entropy
Minimizing entropy = minimizing expected cross-entropy loss
Information gain = reduction in expected cross-entropy loss

This means tree-building with entropy is implicitly performing maximum likelihood estimation for a piecewise-constant probability model!

Deep Learning Connection

Modern gradient boosting libraries (XGBoost, LightGBM) can use cross-entropy/log-loss directly rather than entropy. This unifies tree-based methods with neural network training under the same loss function, enabling gradient-based optimization of tree splits.

Mathematical Derivation:

Let a leaf contain $n$ samples with class distribution $(n_1, ..., n_K)$. The leaf predicts $\hat{p}_k = n_k/n$.

The total cross-entropy loss in this leaf:

$$\sum_{i=1}^{n} -\log \hat{p}{y_i} = -\sum{k=1}^{K} n_k \log \frac{n_k}{n} = n \cdot \left( -\sum_{k=1}^{K} \frac{n_k}{n} \log \frac{n_k}{n} \right) = n \cdot H$$

Total loss across tree = sum over all leaves of $n_{\text{leaf}} \cdot H_{\text{leaf}}$.

Minimizing this is equivalent to minimizing weighted average entropy—exactly what information gain does!

This proves that greedy information gain maximization is greedy cross-entropy minimization, providing a strong theoretical justification for entropy-based splitting.

Practical Considerations

When should you use entropy over Gini? Here's practical guidance based on theoretical properties and empirical experience:

When to Prefer Entropy

•Severe class imbalance: Entropy's higher sensitivity to small probabilities can help separate rare classes.
•Many classes: The $\log K$ scaling of entropy's maximum provides better 'resolution' for multi-class problems.
•Information-theoretic interpretation needed: When you want to reason about 'bits of information' in splits.
•Matching other components: When your overall system uses cross-entropy loss (e.g., gradient boosting).
•Research/publication: Entropy is more theoretically elegant and well-studied.

When to Prefer Gini

•Computational efficiency matters: Gini is 2-4x faster (no logarithms).
•Large datasets: Speed difference compounds with data size.
•Binary or few-class problems: Theoretical differences are minimal.
•Following CART convention: When reproducing CART-based methods.
•Default choice: In most practical scenarios, Gini is preferred.

Practical Recommendation

Start with Gini (the default in most libraries). Switch to entropy only if you have a specific reason: extreme class imbalance, many classes, or need to match cross-entropy loss elsewhere in your pipeline. Always validate with cross-validation—the empirical performance difference is usually insignificant.

Summary: Entropy and Information Gain

We've explored entropy from its information-theoretic roots to its practical application in decision trees. Let's consolidate the key insights:

Key Takeaways

•Entropy $H = -\sum p_k \log_2 p_k$ measures average bits needed to encode class labels.
•Self-information $I(x) = -\log P(x)$ is the building block: rare events carry more information.
•Information gain $IG = H(parent) - H(children|split)$ equals mutual information $I(Y; A)$.
•ID3 uses information gain directly; C4.5 uses gain ratio to penalize many-valued features.
•Rate of change: Entropy has infinite slope at boundaries, making it more sensitive to class imbalance.
•Cross-entropy connection: Minimizing weighted entropy = minimizing cross-entropy loss.
•Practical equivalence: Entropy and Gini produce nearly identical trees in most cases.
•Default: Use Gini for speed; switch to entropy for severe imbalance or theoretical consistency.

What's Next:

We've now covered the two most important splitting criteria in depth. The next page explores Misclassification Error—a simpler criterion that, despite its intuitive appeal, has subtle problems that make it less suitable for tree construction.

Page Complete

You now understand entropy and information gain from both theoretical and practical perspectives. The information-theoretic foundation provides deep insights into why trees work and connects to the broader machine learning landscape through cross-entropy loss.

2 / 5

Loading learning content...

Machine LearningDecision Trees

Splitting Criteria

LevelIntermediate

Duration90 mins

TopicDecision Trees

2 / 5

Entropy and Information Gain: The Information-Theoretic Foundation

From Communication Theory to Machine Learning

This page explores entropy through the lens of information theory, revealing why it's not just a valid impurity measure, but the canonical measure from a communication perspective.

What You Will Learn

Information Theory Foundations

Before defining entropy, we need to understand self-information—the fundamental unit that entropy aggregates.

The Key Insight: Rare events carry more information than common events.

Think about news headlines:

"Sun Rises in the East" — Not informative (probability ≈ 1)
"Major Earthquake in California" — Somewhat informative (moderate probability)
"Alien Contact Confirmed" — Highly informative (extremely low probability)

Shannon formalized this with the self-information or surprisal of an event:

$$I(x) = -\log_2 P(x) = \log_2 \frac{1}{P(x)}$$

Properties of Self-Information:

Non-negativity: $I(x) \geq 0$ for all events (since $P(x) \leq 1$)
Certainty yields zero: $I(x) = 0$ when $P(x) = 1$ (no surprise)
Rare events have high information: As $P(x) \to 0$, $I(x) \to \infty$
Additivity: For independent events, $I(x, y) = I(x) + I(y)$

The choice of logarithm base determines the unit:

Base 2: bits (most common in CS)
Base $e$: nats (natural units)
Base 10: hartleys (rare)

Self-Information for Various Probabilities
Probability P(x)	Self-Information I(x) in bits	Intuition
1.0	0.0	Certain event, no surprise
0.5	1.0	One bit of information (coin flip)
0.25	2.0	Two bits (like two coin flips)
0.125	3.0	Three bits
0.01	6.64	Rare event, high surprise
0.001	9.97	Very rare, ~10 bits

Why Logarithm?

Shannon Entropy: Expected Information

Entropy is the expected self-information of a random variable. For a discrete random variable $X$ with possible values ${x_1, x_2, ..., x_K}$ and probability distribution $P$:

$$H(X) = \mathbb{E}[I(X)] = -\sum_{k=1}^{K} P(x_k) \log_2 P(x_k) = -\sum_{k=1}^{K} p_k \log_2 p_k$$

Interpretation: Entropy measures the average number of bits needed to encode the outcome of $X$ using an optimal coding scheme.

For Decision Trees: In a node with $K$ classes and class proportions $p_1, p_2, ..., p_K$:

$$H = -\sum_{k=1}^{K} p_k \log_2 p_k$$

where we define $0 \cdot \log 0 = 0$ (by continuity, as $\lim_{p \to 0^+} p \log p = 0$).

Binary Entropy Function:

For binary classification with $p$ and $(1-p)$:

$$H(p) = -p \log_2 p - (1-p) \log_2 (1-p)$$

This is the famous binary entropy function, fundamental to coding theory and machine learning alike.

entropy.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
import numpy as np
from typing import Union
 
def entropy(class_counts: Union[list, np.ndarray], base: float = 2) -> float:
    """
    Compute Shannon entropy from class counts.
    
    Shannon entropy measures the average information content or
    'surprise' in the distribution. It's the expected number of
    bits needed to encode the class label optimally.
    
    Args:
        class_counts: Array or list of counts for each class
        base: Logarithm base (2 for bits, e for nats)
        
    Returns:
        Entropy value in range [0, log_base(K)]
        
    Mathematical Definition:
        H = -Σ(p_k * log(p_k))
        
    Properties:
        - H = 0 for pure nodes (all samples same class)
        - H = log(K) for uniform distribution over K classes
        - For binary: maximum is 1.0 bit when p = 0.5
        
    Convention:
        0 * log(0) = 0 (by L'Hôpital's rule / continuity)
    """
    counts = np.array(class_counts, dtype=np.float64)
    n = counts.sum()
    
    if n == 0:
        return 0.0  # Empty node
    
    # Compute probabilities
    p = counts / n
    
    # Filter out zeros to avoid log(0)
    # Note: 0 * log(0) = 0 by convention
    p_nonzero = p[p > 0]
    
    # Entropy = -Σ(p * log(p))
    if base == 2:
        return -np.sum(p_nonzero * np.log2(p_nonzero))
    elif base == np.e:
        return -np.sum(p_nonzero * np.log(p_nonzero))
    else:
        return -np.sum(p_nonzero * np.log(p_nonzero)) / np.log(base)
 
 
def binary_entropy(p: float) -> float:
    """
    Compute binary entropy function H(p).
    
    H(p) = -p*log2(p) - (1-p)*log2(1-p)
    
    This is the entropy of a Bernoulli random variable.
    
    Args:
        p: Probability of positive class (0 ≤ p ≤ 1)
        
    Returns:
        Entropy in bits, range [0, 1]
    
    Special cases:
        H(0) = H(1) = 0 (pure)
        H(0.5) = 1.0 (maximum uncertainty)
    """
    if p == 0 or p == 1:
        return 0.0
    return -p * np.log2(p) - (1-p) * np.log2(1-p)
 
 
def cross_entropy(p: np.ndarray, q: np.ndarray) -> float:
    """
    Compute cross-entropy H(P, Q) between two distributions.
    
    H(P, Q) = -Σ p_k * log(q_k)
    
    Cross-entropy measures the average number of bits needed
    to encode samples from P using a code optimized for Q.
    
    Note: H(P, Q) ≥ H(P), with equality iff P = Q.
    The difference H(P, Q) - H(P) is the KL divergence D_KL(P||Q).
    """
    p = np.array(p, dtype=np.float64)
    q = np.array(q, dtype=np.float64)
    
    # Avoid log(0) where p > 0
    mask = p > 0
    return -np.sum(p[mask] * np.log2(q[mask]))
 
 
# Demonstration
if __name__ == "__main__":
    print("Shannon Entropy Examples:")
    print("=" * 55)
    
    # Pure node
    print(f"Pure [100, 0]:        H = {entropy([100, 0]):.4f} bits")
    
    # Balanced binary
    print(f"Balanced [50, 50]:    H = {entropy([50, 50]):.4f} bits")
    
    # Imbalanced
    print(f"[80, 20]:             H = {entropy([80, 20]):.4f} bits")
    print(f"[90, 10]:             H = {entropy([90, 10]):.4f} bits")
    print(f"[99, 1]:              H = {entropy([99, 1]):.4f} bits")
    
    # Multi-class
    print(f"\n3-class uniform [33, 33, 34]: H = {entropy([33, 33, 34]):.4f} bits")
    print(f"3-class max = log2(3) = {np.log2(3):.4f} bits")
    
    print(f"\n4-class uniform [25, 25, 25, 25]: H = {entropy([25, 25, 25, 25]):.4f} bits")
    print(f"4-class max = log2(4) = {np.log2(4):.4f} bits")
    
    # Binary entropy curve
    print(f"\nBinary entropy H(p) at various p:")
    for p in [0.0, 0.1, 0.2, 0.3, 0.4, 0.5]:
        print(f"  H({p:.1f}) = {binary_entropy(p):.4f} bits")

Mathematical Properties of Entropy

Entropy possesses remarkable mathematical properties that make it the ideal measure of uncertainty. Understanding these properties deepens our grasp of why entropy works as a splitting criterion.

Fundamental Properties of Entropy

•Non-negativity: $H(X) \geq 0$ always, with equality iff $X$ is deterministic (some $p_k = 1$).
•Maximum at uniform: $H(X) \leq \log_2 K$, with equality iff $p_k = 1/K$ for all $k$.
•Strict concavity: $H(\lambda p + (1-\lambda)q) > \lambda H(p) + (1-\lambda) H(q)$ for $0 < \lambda < 1$ and $p \neq q$.
•Continuity: $H$ is continuous on the interior of the probability simplex.
•Symmetry: $H(p_1, ..., p_K) = H(p_{\sigma(1)}, ..., p_{\sigma(K)})$ for any permutation $\sigma$.
•Additivity for independent variables: $H(X, Y) = H(X) + H(Y)$ when $X \perp Y$.
•Chain rule: $H(X, Y) = H(X) + H(Y|X) = H(Y) + H(X|Y)$.
•Conditioning reduces entropy: $H(X|Y) \leq H(X)$, with equality iff $X \perp Y$.

Shannon's Uniqueness Theorem

Concavity and Its Implications:

The strict concavity of entropy is crucial for decision trees. It means:

$$H\left(\frac{n_L}{n} p_L + \frac{n_R}{n} p_R\right) > \frac{n_L}{n} H(p_L) + \frac{n_R}{n} H(p_R)$$

whenever $p_L \neq p_R$.

Interpretation: The entropy of a mixture is strictly greater than the weighted average of component entropies. This guarantees that:

Splitting always helps (or at least never hurts)
Information gain is always non-negative
Pure splits achieve maximum gain

Without strict concavity (like misclassification error), we might find splits that appear to give zero gain even when the data can be better separated.

Information Gain: Quantifying Learning

Information gain measures how much knowing a feature's value reduces uncertainty about the class label. It's the entropy-based analog of Gini gain.

Definition: For a dataset $S$ and a feature $A$ with possible values ${a_1, a_2, ..., a_V}$:

$$\text{IG}(S, A) = H(S) - \sum_{v=1}^{V} \frac{|S_v|}{|S|} H(S_v)$$

where $S_v = {x \in S : A(x) = v}$ is the subset with feature $A$ equal to $v$.

Components:

$H(S)$: Entropy of parent node (prior uncertainty)
$\sum \frac{|S_v|}{|S|} H(S_v)$: Weighted average entropy of children (conditional entropy, $H(Y|A)$)
Information Gain: Reduction in entropy = $H(Y) - H(Y|A)$

Interpretation: Information gain equals the mutual information $I(Y; A)$ between the class label $Y$ and feature $A$:

$$\text{IG}(S, A) = I(Y; A) = H(Y) - H(Y|A)$$

This measures how much information the feature provides about the class.

information_gain.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
import numpy as np
from typing import List, Tuple
 
def entropy(class_counts: np.ndarray) -> float:
    """Compute Shannon entropy from class counts."""
    n = class_counts.sum()
    if n == 0:
        return 0.0
    p = class_counts / n
    p = p[p > 0]  # Filter zeros
    return -np.sum(p * np.log2(p))
 
 
def conditional_entropy(parent_counts: np.ndarray,
                       child_counts_list: List[np.ndarray]) -> float:
    """
    Compute conditional entropy H(Y|A) after a split.
    
    H(Y|A) = Σ P(A=v) * H(Y|A=v)
           = Σ (|S_v|/|S|) * H(S_v)
    
    This is the expected entropy after learning the split feature.
    """
    n_total = parent_counts.sum()
    
    weighted_entropy = 0.0
    for child_counts in child_counts_list:
        n_child = child_counts.sum()
        if n_child > 0:
            weight = n_child / n_total
            weighted_entropy += weight * entropy(child_counts)
    
    return weighted_entropy
 
 
def information_gain(parent_counts: np.ndarray,
                    child_counts_list: List[np.ndarray]) -> float:
    """
    Compute information gain for a split.
    
    IG(S, A) = H(S) - H(S|A)
             = H(parent) - Σ (|S_v|/|S|) * H(S_v)
    
    This equals the mutual information I(Y; A) between
    the class label Y and feature A.
    
    Args:
        parent_counts: Class counts in parent node
        child_counts_list: List of class counts for each child
        
    Returns:
        Information gain in bits (always ≥ 0)
    """
    parent_entropy = entropy(parent_counts)
    cond_entropy = conditional_entropy(parent_counts, child_counts_list)
    return parent_entropy - cond_entropy
 
 
def find_best_threshold_ig(X: np.ndarray, y: np.ndarray,
                           feature_idx: int) -> Tuple[float, float]:
    """
    Find best threshold for continuous feature using information gain.
    
    Returns:
        (best_threshold, best_information_gain)
    """
    n_samples = len(y)
    n_classes = len(np.unique(y))
    
    feature_values = X[:, feature_idx]
    sorted_indices = np.argsort(feature_values)
    sorted_features = feature_values[sorted_indices]
    sorted_labels = y[sorted_indices]
    
    parent_counts = np.bincount(y, minlength=n_classes)
    parent_H = entropy(parent_counts)
    
    best_ig = -np.inf
    best_threshold = None
    
    left_counts = np.zeros(n_classes, dtype=np.int64)
    right_counts = parent_counts.copy()
    
    for i in range(n_samples - 1):
        c = sorted_labels[i]
        left_counts[c] += 1
        right_counts[c] -= 1
        
        if sorted_features[i] == sorted_features[i + 1]:
            continue
        
        cond_H = conditional_entropy(parent_counts, [left_counts, right_counts])
        ig = parent_H - cond_H
        
        if ig > best_ig:
            best_ig = ig
            best_threshold = (sorted_features[i] + sorted_features[i + 1]) / 2
    
    return best_threshold, best_ig
 
 
# Demonstration: ID3-style categorical split
if __name__ == "__main__":
    print("Information Gain Examples")
    print("=" * 60)
    
    # Example: Weather dataset (classic ID3 example)
    # Parent: 9 Yes, 5 No for playing tennis
    parent = np.array([9, 5])
    print(f"\nParent node: {parent} (9 Yes, 5 No)")
    print(f"Parent entropy: H = {entropy(parent):.4f} bits")
    
    # Split on Outlook: Sunny(2Y,3N), Overcast(4Y,0N), Rain(3Y,2N)
    sunny = np.array([2, 3])
    overcast = np.array([4, 0])
    rain = np.array([3, 2])
    
    print(f"\nSplit on 'Outlook':")
    print(f"  Sunny {sunny}:    H = {entropy(sunny):.4f}")
    print(f"  Overcast {overcast}: H = {entropy(overcast):.4f}")
    print(f"  Rain {rain}:      H = {entropy(rain):.4f}")
    
    ig_outlook = information_gain(parent, [sunny, overcast, rain])
    print(f"  Information Gain = {ig_outlook:.4f} bits")
    
    # Split on Wind: Weak(6Y,2N), Strong(3Y,3N)
    weak = np.array([6, 2])
    strong = np.array([3, 3])
    
    print(f"\nSplit on 'Wind':")
    print(f"  Weak {weak}:   H = {entropy(weak):.4f}")
    print(f"  Strong {strong}: H = {entropy(strong):.4f}")
    
    ig_wind = information_gain(parent, [weak, strong])
    print(f"  Information Gain = {ig_wind:.4f} bits")
    
    # Compare
    print(f"\n→ Choose 'Outlook' (IG={ig_outlook:.4f} > {ig_wind:.4f})")
    
    # Perfect binary split
    print(f"\n--- Perfect Split ---")
    parent_perfect = np.array([50, 50])
    left_pure = np.array([50, 0])
    right_pure = np.array([0, 50])
    
    print(f"Parent: {parent_perfect}, H = {entropy(parent_perfect):.4f}")
    ig_perfect = information_gain(parent_perfect, [left_pure, right_pure])
    print(f"Perfect split IG = {ig_perfect:.4f} bits (maximum possible)")

Bits of Learning

ID3 and C4.5: Entropy-Based Tree Algorithms

ID3 Algorithm:

If all examples have the same class, return a leaf with that class
If no features remain, return a leaf with the majority class
Otherwise:
- Compute information gain for each feature
- Select feature $A^*$ with maximum IG
- Create a node for $A^*$
- For each value $v$ of $A^*$:
  - Create a branch for $v$
  - Recursively build subtree on examples where $A^* = v$

ID3 Limitations:

Only handles categorical features
Biased toward features with many values
No pruning (prone to overfitting)
Cannot handle missing values

The Multi-Value Bias Problem

C4.5 Improvements:

C4.5 addressed ID3's limitations with several enhancements:

Gain Ratio (instead of raw information gain):

$$\text{GainRatio}(S, A) = \frac{\text{IG}(S, A)}{\text{SplitInfo}(S, A)}$$

where: $$\text{SplitInfo}(S, A) = -\sum_{v=1}^{V} \frac{|S_v|}{|S|} \log_2 \frac{|S_v|}{|S|}$$

SplitInfo is high for features with many values, penalizing the bias.

Continuous features: Binary splits on thresholds (like CART)
Missing values: Distribute fractionally to all branches
Pruning: Error-based post-pruning to prevent overfitting

C5.0 (commercial) further improved efficiency and added boosting.

gain_ratio.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
import numpy as np
from typing import List
 
def entropy(counts: np.ndarray) -> float:
    """Compute Shannon entropy."""
    n = counts.sum()
    if n == 0:
        return 0.0
    p = counts / n
    p = p[p > 0]
    return -np.sum(p * np.log2(p))
 
 
def split_info(child_sizes: List[int], total: int) -> float:
    """
    Compute split information (intrinsic value).
    
    SplitInfo(S, A) = -Σ (|S_v|/|S|) * log2(|S_v|/|S|)
    
    This is the entropy of the split itself (not the class distribution).
    High when feature has many values of similar frequency.
    
    Used to penalize features with many values in gain ratio.
    """
    if total == 0:
        return 0.0
    
    split_entropy = 0.0
    for size in child_sizes:
        if size > 0:
            p = size / total
            split_entropy -= p * np.log2(p)
    
    return split_entropy
 
 
def information_gain(parent_counts: np.ndarray,
                    child_counts_list: List[np.ndarray]) -> float:
    """Compute information gain for a split."""
    parent_H = entropy(parent_counts)
    n_total = parent_counts.sum()
    
    weighted_child_H = 0.0
    for child_counts in child_counts_list:
        n_child = child_counts.sum()
        if n_child > 0:
            weighted_child_H += (n_child / n_total) * entropy(child_counts)
    
    return parent_H - weighted_child_H
 
 
def gain_ratio(parent_counts: np.ndarray,
               child_counts_list: List[np.ndarray]) -> float:
    """
    Compute gain ratio (C4.5 criterion).
    
    GainRatio(S, A) = IG(S, A) / SplitInfo(S, A)
    
    Normalizes information gain by the split's intrinsic value,
    reducing bias toward multi-valued features.
    
    Note: When SplitInfo is very small, gain ratio can be
    artificially high. C4.5 uses a heuristic: only consider
    features with IG above average.
    """
    ig = information_gain(parent_counts, child_counts_list)
    
    n_total = parent_counts.sum()
    child_sizes = [c.sum() for c in child_counts_list]
    si = split_info(child_sizes, n_total)
    
    if si == 0:
        return 0.0  # Avoid division by zero
    
    return ig / si
 
 
# Demonstration: Multi-value bias
if __name__ == "__main__":
    print("Gain Ratio vs Information Gain")
    print("=" * 60)
    
    # Parent: 7 Yes, 7 No
    parent = np.array([7, 7])
    print(f"Parent: {parent}")
    print(f"Parent entropy: {entropy(parent):.4f} bits\n")
    
    # Feature A: Binary split (good generalization)
    # Splits into (5Y, 2N) and (2Y, 5N)
    child_A = [np.array([5, 2]), np.array([2, 5])]
    ig_A = information_gain(parent, child_A)
    gr_A = gain_ratio(parent, child_A)
    si_A = split_info([7, 7], 14)
    
    print("Feature A (Binary: Low/High):")
    print(f"  Children: {[c.tolist() for c in child_A]}")
    print(f"  Information Gain: {ig_A:.4f}")
    print(f"  Split Info: {si_A:.4f}")
    print(f"  Gain Ratio: {gr_A:.4f}")
    
    # Feature B: 7-way split (like customer ID)
    # Each bucket: (1Y, 1N) - useless for generalization!
    child_B = [np.array([1, 1]) for _ in range(7)]
    ig_B = information_gain(parent, child_B)
    gr_B = gain_ratio(parent, child_B)
    si_B = split_info([2]*7, 14)
    
    print(f"\nFeature B (7-way: like ID feature):")
    print(f"  Children: 7 × [1, 1]")
    print(f"  Information Gain: {ig_B:.4f}")
    print(f"  Split Info: {si_B:.4f}")
    print(f"  Gain Ratio: {gr_B:.4f}")
    
    # Feature C: 14-way split (pure ID - worst case)
    # Each sample is its own bucket
    child_C_yes = [np.array([1, 0]) for _ in range(7)]
    child_C_no = [np.array([0, 1]) for _ in range(7)]
    child_C = child_C_yes + child_C_no
    ig_C = information_gain(parent, child_C)
    gr_C = gain_ratio(parent, child_C)
    si_C = split_info([1]*14, 14)
    
    print(f"\nFeature C (14-way: unique ID):")
    print(f"  Children: 14 × [single sample]")
    print(f"  Information Gain: {ig_C:.4f} (MAXIMUM - perfect separation!)")
    print(f"  Split Info: {si_C:.4f}")
    print(f"  Gain Ratio: {gr_C:.4f}")
    
    print(f"\n→ ID3 would choose Feature C (highest IG={ig_C:.4f})")
    print(f"→ C4.5 would choose Feature A (highest GR={gr_A:.4f})")

Entropy vs Gini: A Detailed Comparison

Now that we understand both Gini impurity and entropy, let's compare them systematically. The comparison reveals both theoretical and practical differences.

Entropy vs Gini: Comprehensive Comparison
Aspect	Gini Impurity	Entropy
Formula (K classes)	$1 - \sum p_k^2$	$-\sum p_k \log_2 p_k$
Binary formula	$2p(1-p)$	$-p \log p - (1-p) \log(1-p)$
Maximum (binary)	0.5	1.0 bit
Maximum (K classes)	$1 - 1/K$	$\log_2 K$ bits
Computation	Fast (squares only)	Slower (logarithms)
Interpretation	Misclassification probability	Bits of information
Curvature at extremes	Moderate	Steeper near 0 and 1
Sensitivity to small p	Less sensitive	More sensitive
Used in	CART, scikit-learn default	ID3, C4.5, XGBoost option

Mathematical Relationship:

For binary classification, we can approximate entropy using Gini:

$$H(p) \approx \frac{G(p)}{\ln 2} \cdot \frac{2}{1}$$ when $p$ is not too extreme.

More precisely, the Taylor expansion around $p = 0.5$ shows:

$$H(p) = 1 - \frac{2}{\ln 2}(p - 0.5)^2 - \frac{4}{3 \ln 2}(p - 0.5)^4 - ...$$

$$G(p) = 0.5 - 2(p - 0.5)^2$$

Both are concave, symmetric about $p = 0.5$, and zero at $p = 0, 1$. The key difference is curvature:

Entropy has infinite slope at boundaries (vertical tangent at $p = 0, 1$)
Gini has finite slope at boundaries (slope = $\pm 2$ at $p = 0, 1$)

This makes entropy more sensitive to class imbalance and small probability classes.

entropy_gini_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
import numpy as np
import matplotlib.pyplot as plt
 
def gini_binary(p):
    return 2 * p * (1 - p)
 
def entropy_binary(p):
    """Binary entropy, handling edge cases."""
    result = np.zeros_like(p)
    mask = (p > 0) & (p < 1)
    result[mask] = -p[mask] * np.log2(p[mask]) - (1-p[mask]) * np.log2(1-p[mask])
    return result
 
def misclass_binary(p):
    return np.minimum(p, 1 - p)
 
def plot_comparison():
    """
    Compare Gini, Entropy, and Misclassification error for binary case.
    """
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    p = np.linspace(0.001, 0.999, 1000)
    
    gini = gini_binary(p)
    entropy_vals = entropy_binary(p)
    misclass = misclass_binary(p)
    
    # Normalize for comparison (scale to max = 1)
    gini_norm = gini / 0.5
    entropy_norm = entropy_vals / 1.0
    misclass_norm = misclass / 0.5
    
    # Left plot: Raw values
    ax1 = axes[0]
    ax1.plot(p, gini, 'b-', linewidth=2.5, label='Gini (max=0.5)')
    ax1.plot(p, entropy_vals, 'r-', linewidth=2.5, label='Entropy (max=1.0)')
    ax1.plot(p, misclass, 'g--', linewidth=2.5, label='Misclass (max=0.5)')
    
    ax1.set_xlabel('Proportion of Class 1 (p)', fontsize=12)
    ax1.set_ylabel('Impurity', fontsize=12)
    ax1.set_title('Impurity Measures: Raw Values', fontsize=14)
    ax1.legend(loc='upper right')
    ax1.grid(True, alpha=0.3)
    ax1.set_xlim(0, 1)
    ax1.set_ylim(0, 1.1)
    
    # Right plot: Normalized (scaled to max = 1)
    ax2 = axes[1]
    ax2.plot(p, gini_norm, 'b-', linewidth=2.5, label='Gini (normalized)')
    ax2.plot(p, entropy_norm, 'r-', linewidth=2.5, label='Entropy (normalized)')
    ax2.plot(p, misclass_norm, 'g--', linewidth=2.5, label='Misclass (normalized)')
    
    ax2.set_xlabel('Proportion of Class 1 (p)', fontsize=12)
    ax2.set_ylabel('Normalized Impurity', fontsize=12)
    ax2.set_title('Impurity Measures: Normalized to Max=1', fontsize=14)
    ax2.legend(loc='upper right')
    ax2.grid(True, alpha=0.3)
    ax2.set_xlim(0, 1)
    ax2.set_ylim(0, 1.1)
    
    plt.tight_layout()
    plt.savefig('impurity_comparison.png', dpi=150)
    plt.show()
    
    # Print numerical comparison at key points
    print("\nNumerical Comparison at Key Points:")
    print("=" * 60)
    print(f"{'p':>6} | {'Gini':>10} | {'Entropy':>10} | {'Misclass':>10}")
    print("-" * 60)
    for p_val in [0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5]:
        g = gini_binary(p_val)
        h = entropy_binary(np.array([p_val]))[0]
        m = misclass_binary(np.array([p_val]))[0]
        print(f"{p_val:>6.2f} | {g:>10.4f} | {h:>10.4f} | {m:>10.4f}")
 
 
def sensitivity_analysis():
    """
    Show how entropy is more sensitive to small probabilities.
    """
    print("\n\nSensitivity to Small Probabilities:")
    print("=" * 60)
    print("When p = 0.01 (rare class):")
    
    p = 0.01
    g = gini_binary(p)
    h = entropy_binary(np.array([p]))[0]
    
    print(f"  Gini = {g:.6f}")
    print(f"  Entropy = {h:.6f}")
    print(f"  Ratio (Entropy/Gini) = {h/g:.2f}")
    
    print("\nWhen p = 0.001 (very rare class):")
    p = 0.001
    g = gini_binary(p)
    h = entropy_binary(np.array([p]))[0]
    
    print(f"  Gini = {g:.6f}")
    print(f"  Entropy = {h:.6f}")
    print(f"  Ratio (Entropy/Gini) = {h/g:.2f}")
    
    print("\n→ Entropy gives more 'credit' for separating rare classes")
 
 
if __name__ == "__main__":
    plot_comparison()
    sensitivity_analysis()

When Does the Choice Matter?

Connection to Cross-Entropy Loss

Entropy in decision trees connects deeply to cross-entropy loss, the standard loss function for classification in neural networks and other probabilistic models.

Cross-Entropy Reviewed:

For a true distribution $P$ and predicted distribution $Q$:

$$H(P, Q) = -\sum_k P(k) \log Q(k)$$

In classification with one-hot labels $y$ and predicted probabilities $\hat{p}$:

$$\mathcal{L}{CE} = -\sum{k=1}^K y_k \log \hat{p}k = -\log \hat{p}{\text{true class}}$$

The Connection:

When a decision tree leaf predicts class probabilities equal to the empirical distribution in the leaf:

The cross-entropy between true labels and leaf predictions equals the leaf's entropy
Minimizing entropy = minimizing expected cross-entropy loss
Information gain = reduction in expected cross-entropy loss

This means tree-building with entropy is implicitly performing maximum likelihood estimation for a piecewise-constant probability model!

Deep Learning Connection

Mathematical Derivation:

Let a leaf contain $n$ samples with class distribution $(n_1, ..., n_K)$. The leaf predicts $\hat{p}_k = n_k/n$.

The total cross-entropy loss in this leaf:

$$\sum_{i=1}^{n} -\log \hat{p}{y_i} = -\sum{k=1}^{K} n_k \log \frac{n_k}{n} = n \cdot \left( -\sum_{k=1}^{K} \frac{n_k}{n} \log \frac{n_k}{n} \right) = n \cdot H$$

Total loss across tree = sum over all leaves of $n_{\text{leaf}} \cdot H_{\text{leaf}}$.

Minimizing this is equivalent to minimizing weighted average entropy—exactly what information gain does!

This proves that greedy information gain maximization is greedy cross-entropy minimization, providing a strong theoretical justification for entropy-based splitting.

Practical Considerations

When should you use entropy over Gini? Here's practical guidance based on theoretical properties and empirical experience:

When to Prefer Entropy

•Severe class imbalance: Entropy's higher sensitivity to small probabilities can help separate rare classes.
•Many classes: The $\log K$ scaling of entropy's maximum provides better 'resolution' for multi-class problems.
•Information-theoretic interpretation needed: When you want to reason about 'bits of information' in splits.
•Matching other components: When your overall system uses cross-entropy loss (e.g., gradient boosting).
•Research/publication: Entropy is more theoretically elegant and well-studied.

When to Prefer Gini

•Computational efficiency matters: Gini is 2-4x faster (no logarithms).
•Large datasets: Speed difference compounds with data size.
•Binary or few-class problems: Theoretical differences are minimal.
•Following CART convention: When reproducing CART-based methods.
•Default choice: In most practical scenarios, Gini is preferred.

Practical Recommendation

Summary: Entropy and Information Gain

We've explored entropy from its information-theoretic roots to its practical application in decision trees. Let's consolidate the key insights:

Key Takeaways

•Entropy $H = -\sum p_k \log_2 p_k$ measures average bits needed to encode class labels.
•Self-information $I(x) = -\log P(x)$ is the building block: rare events carry more information.
•Information gain $IG = H(parent) - H(children|split)$ equals mutual information $I(Y; A)$.
•ID3 uses information gain directly; C4.5 uses gain ratio to penalize many-valued features.
•Rate of change: Entropy has infinite slope at boundaries, making it more sensitive to class imbalance.
•Cross-entropy connection: Minimizing weighted entropy = minimizing cross-entropy loss.
•Practical equivalence: Entropy and Gini produce nearly identical trees in most cases.
•Default: Use Gini for speed; switch to entropy for severe imbalance or theoretical consistency.

What's Next:

Page Complete

2 / 5