Loading learning content...
In 1948, Claude Shannon published A Mathematical Theory of Communication, a paper that would revolutionize our understanding of information. Shannon asked: How do we quantify information? His answer—entropy—became the foundation of information theory and eventually found its way into machine learning.
When we use entropy as a splitting criterion in decision trees, we're applying Shannon's profound insight: the 'disorder' or 'uncertainty' in a node can be measured as the average number of bits needed to encode the class labels. A pure node requires no bits (we already know the answer); a maximally impure node requires maximum bits (maximum uncertainty).
This page explores entropy through the lens of information theory, revealing why it's not just a valid impurity measure, but the canonical measure from a communication perspective.
By the end of this page, you will understand entropy from its information-theoretic roots, master the mathematics of information gain, see how it connects to decision tree algorithms like ID3 and C4.5, and appreciate the subtle differences between entropy and Gini impurity.
Before defining entropy, we need to understand self-information—the fundamental unit that entropy aggregates.
The Key Insight: Rare events carry more information than common events.
Think about news headlines:
Shannon formalized this with the self-information or surprisal of an event:
$$I(x) = -\log_2 P(x) = \log_2 \frac{1}{P(x)}$$
Properties of Self-Information:
The choice of logarithm base determines the unit:
| Probability P(x) | Self-Information I(x) in bits | Intuition |
|---|---|---|
| 1.0 | 0.0 | Certain event, no surprise |
| 0.5 | 1.0 | One bit of information (coin flip) |
| 0.25 | 2.0 | Two bits (like two coin flips) |
| 0.125 | 3.0 | Three bits |
| 0.01 | 6.64 | Rare event, high surprise |
| 0.001 | 9.97 | Very rare, ~10 bits |
The logarithm is the unique function satisfying additivity for independent events. If you see two independent events with probabilities p and q, the total information is I(p) + I(q) = -log(p) - log(q) = -log(pq), which equals the information of the joint event. No other function family has this property.
Entropy is the expected self-information of a random variable. For a discrete random variable $X$ with possible values ${x_1, x_2, ..., x_K}$ and probability distribution $P$:
$$H(X) = \mathbb{E}[I(X)] = -\sum_{k=1}^{K} P(x_k) \log_2 P(x_k) = -\sum_{k=1}^{K} p_k \log_2 p_k$$
Interpretation: Entropy measures the average number of bits needed to encode the outcome of $X$ using an optimal coding scheme.
For Decision Trees: In a node with $K$ classes and class proportions $p_1, p_2, ..., p_K$:
$$H = -\sum_{k=1}^{K} p_k \log_2 p_k$$
where we define $0 \cdot \log 0 = 0$ (by continuity, as $\lim_{p \to 0^+} p \log p = 0$).
Binary Entropy Function:
For binary classification with $p$ and $(1-p)$:
$$H(p) = -p \log_2 p - (1-p) \log_2 (1-p)$$
This is the famous binary entropy function, fundamental to coding theory and machine learning alike.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121
import numpy as npfrom typing import Union def entropy(class_counts: Union[list, np.ndarray], base: float = 2) -> float: """ Compute Shannon entropy from class counts. Shannon entropy measures the average information content or 'surprise' in the distribution. It's the expected number of bits needed to encode the class label optimally. Args: class_counts: Array or list of counts for each class base: Logarithm base (2 for bits, e for nats) Returns: Entropy value in range [0, log_base(K)] Mathematical Definition: H = -Σ(p_k * log(p_k)) Properties: - H = 0 for pure nodes (all samples same class) - H = log(K) for uniform distribution over K classes - For binary: maximum is 1.0 bit when p = 0.5 Convention: 0 * log(0) = 0 (by L'Hôpital's rule / continuity) """ counts = np.array(class_counts, dtype=np.float64) n = counts.sum() if n == 0: return 0.0 # Empty node # Compute probabilities p = counts / n # Filter out zeros to avoid log(0) # Note: 0 * log(0) = 0 by convention p_nonzero = p[p > 0] # Entropy = -Σ(p * log(p)) if base == 2: return -np.sum(p_nonzero * np.log2(p_nonzero)) elif base == np.e: return -np.sum(p_nonzero * np.log(p_nonzero)) else: return -np.sum(p_nonzero * np.log(p_nonzero)) / np.log(base) def binary_entropy(p: float) -> float: """ Compute binary entropy function H(p). H(p) = -p*log2(p) - (1-p)*log2(1-p) This is the entropy of a Bernoulli random variable. Args: p: Probability of positive class (0 ≤ p ≤ 1) Returns: Entropy in bits, range [0, 1] Special cases: H(0) = H(1) = 0 (pure) H(0.5) = 1.0 (maximum uncertainty) """ if p == 0 or p == 1: return 0.0 return -p * np.log2(p) - (1-p) * np.log2(1-p) def cross_entropy(p: np.ndarray, q: np.ndarray) -> float: """ Compute cross-entropy H(P, Q) between two distributions. H(P, Q) = -Σ p_k * log(q_k) Cross-entropy measures the average number of bits needed to encode samples from P using a code optimized for Q. Note: H(P, Q) ≥ H(P), with equality iff P = Q. The difference H(P, Q) - H(P) is the KL divergence D_KL(P||Q). """ p = np.array(p, dtype=np.float64) q = np.array(q, dtype=np.float64) # Avoid log(0) where p > 0 mask = p > 0 return -np.sum(p[mask] * np.log2(q[mask])) # Demonstrationif __name__ == "__main__": print("Shannon Entropy Examples:") print("=" * 55) # Pure node print(f"Pure [100, 0]: H = {entropy([100, 0]):.4f} bits") # Balanced binary print(f"Balanced [50, 50]: H = {entropy([50, 50]):.4f} bits") # Imbalanced print(f"[80, 20]: H = {entropy([80, 20]):.4f} bits") print(f"[90, 10]: H = {entropy([90, 10]):.4f} bits") print(f"[99, 1]: H = {entropy([99, 1]):.4f} bits") # Multi-class print(f"\n3-class uniform [33, 33, 34]: H = {entropy([33, 33, 34]):.4f} bits") print(f"3-class max = log2(3) = {np.log2(3):.4f} bits") print(f"\n4-class uniform [25, 25, 25, 25]: H = {entropy([25, 25, 25, 25]):.4f} bits") print(f"4-class max = log2(4) = {np.log2(4):.4f} bits") # Binary entropy curve print(f"\nBinary entropy H(p) at various p:") for p in [0.0, 0.1, 0.2, 0.3, 0.4, 0.5]: print(f" H({p:.1f}) = {binary_entropy(p):.4f} bits")Entropy possesses remarkable mathematical properties that make it the ideal measure of uncertainty. Understanding these properties deepens our grasp of why entropy works as a splitting criterion.
Shannon proved that entropy is the UNIQUE function satisfying: (1) continuity, (2) monotonicity in the number of equally likely outcomes, and (3) additivity for independent choices. Any other function satisfying these axioms is just a scalar multiple of entropy. This is why entropy is THE canonical measure of uncertainty.
Concavity and Its Implications:
The strict concavity of entropy is crucial for decision trees. It means:
$$H\left(\frac{n_L}{n} p_L + \frac{n_R}{n} p_R\right) > \frac{n_L}{n} H(p_L) + \frac{n_R}{n} H(p_R)$$
whenever $p_L \neq p_R$.
Interpretation: The entropy of a mixture is strictly greater than the weighted average of component entropies. This guarantees that:
Without strict concavity (like misclassification error), we might find splits that appear to give zero gain even when the data can be better separated.
Information gain measures how much knowing a feature's value reduces uncertainty about the class label. It's the entropy-based analog of Gini gain.
Definition: For a dataset $S$ and a feature $A$ with possible values ${a_1, a_2, ..., a_V}$:
$$\text{IG}(S, A) = H(S) - \sum_{v=1}^{V} \frac{|S_v|}{|S|} H(S_v)$$
where $S_v = {x \in S : A(x) = v}$ is the subset with feature $A$ equal to $v$.
Components:
Interpretation: Information gain equals the mutual information $I(Y; A)$ between the class label $Y$ and feature $A$:
$$\text{IG}(S, A) = I(Y; A) = H(Y) - H(Y|A)$$
This measures how much information the feature provides about the class.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148
import numpy as npfrom typing import List, Tuple def entropy(class_counts: np.ndarray) -> float: """Compute Shannon entropy from class counts.""" n = class_counts.sum() if n == 0: return 0.0 p = class_counts / n p = p[p > 0] # Filter zeros return -np.sum(p * np.log2(p)) def conditional_entropy(parent_counts: np.ndarray, child_counts_list: List[np.ndarray]) -> float: """ Compute conditional entropy H(Y|A) after a split. H(Y|A) = Σ P(A=v) * H(Y|A=v) = Σ (|S_v|/|S|) * H(S_v) This is the expected entropy after learning the split feature. """ n_total = parent_counts.sum() weighted_entropy = 0.0 for child_counts in child_counts_list: n_child = child_counts.sum() if n_child > 0: weight = n_child / n_total weighted_entropy += weight * entropy(child_counts) return weighted_entropy def information_gain(parent_counts: np.ndarray, child_counts_list: List[np.ndarray]) -> float: """ Compute information gain for a split. IG(S, A) = H(S) - H(S|A) = H(parent) - Σ (|S_v|/|S|) * H(S_v) This equals the mutual information I(Y; A) between the class label Y and feature A. Args: parent_counts: Class counts in parent node child_counts_list: List of class counts for each child Returns: Information gain in bits (always ≥ 0) """ parent_entropy = entropy(parent_counts) cond_entropy = conditional_entropy(parent_counts, child_counts_list) return parent_entropy - cond_entropy def find_best_threshold_ig(X: np.ndarray, y: np.ndarray, feature_idx: int) -> Tuple[float, float]: """ Find best threshold for continuous feature using information gain. Returns: (best_threshold, best_information_gain) """ n_samples = len(y) n_classes = len(np.unique(y)) feature_values = X[:, feature_idx] sorted_indices = np.argsort(feature_values) sorted_features = feature_values[sorted_indices] sorted_labels = y[sorted_indices] parent_counts = np.bincount(y, minlength=n_classes) parent_H = entropy(parent_counts) best_ig = -np.inf best_threshold = None left_counts = np.zeros(n_classes, dtype=np.int64) right_counts = parent_counts.copy() for i in range(n_samples - 1): c = sorted_labels[i] left_counts[c] += 1 right_counts[c] -= 1 if sorted_features[i] == sorted_features[i + 1]: continue cond_H = conditional_entropy(parent_counts, [left_counts, right_counts]) ig = parent_H - cond_H if ig > best_ig: best_ig = ig best_threshold = (sorted_features[i] + sorted_features[i + 1]) / 2 return best_threshold, best_ig # Demonstration: ID3-style categorical splitif __name__ == "__main__": print("Information Gain Examples") print("=" * 60) # Example: Weather dataset (classic ID3 example) # Parent: 9 Yes, 5 No for playing tennis parent = np.array([9, 5]) print(f"\nParent node: {parent} (9 Yes, 5 No)") print(f"Parent entropy: H = {entropy(parent):.4f} bits") # Split on Outlook: Sunny(2Y,3N), Overcast(4Y,0N), Rain(3Y,2N) sunny = np.array([2, 3]) overcast = np.array([4, 0]) rain = np.array([3, 2]) print(f"\nSplit on 'Outlook':") print(f" Sunny {sunny}: H = {entropy(sunny):.4f}") print(f" Overcast {overcast}: H = {entropy(overcast):.4f}") print(f" Rain {rain}: H = {entropy(rain):.4f}") ig_outlook = information_gain(parent, [sunny, overcast, rain]) print(f" Information Gain = {ig_outlook:.4f} bits") # Split on Wind: Weak(6Y,2N), Strong(3Y,3N) weak = np.array([6, 2]) strong = np.array([3, 3]) print(f"\nSplit on 'Wind':") print(f" Weak {weak}: H = {entropy(weak):.4f}") print(f" Strong {strong}: H = {entropy(strong):.4f}") ig_wind = information_gain(parent, [weak, strong]) print(f" Information Gain = {ig_wind:.4f} bits") # Compare print(f"\n→ Choose 'Outlook' (IG={ig_outlook:.4f} > {ig_wind:.4f})") # Perfect binary split print(f"\n--- Perfect Split ---") parent_perfect = np.array([50, 50]) left_pure = np.array([50, 0]) right_pure = np.array([0, 50]) print(f"Parent: {parent_perfect}, H = {entropy(parent_perfect):.4f}") ig_perfect = information_gain(parent_perfect, [left_pure, right_pure]) print(f"Perfect split IG = {ig_perfect:.4f} bits (maximum possible)")Information gain tells you exactly how many bits of information each feature provides about the class label. A perfect binary split has IG = 1 bit (you learn exactly one bit—which of two classes). This makes entropy interpretable in a way Gini isn't.
The most famous algorithms using entropy for splitting are ID3 (Iterative Dichotomiser 3) by Ross Quinlan (1986) and its successor C4.5 (1993). These algorithms shaped the field of decision tree learning.
ID3 Algorithm:
ID3 Limitations:
ID3's information gain is biased toward features with many values. Consider a 'customer ID' feature—it perfectly separates examples (maximum IG!) but has zero predictive value on new data. Each ID is its own leaf. C4.5 addresses this with gain ratio.
C4.5 Improvements:
C4.5 addressed ID3's limitations with several enhancements:
$$\text{GainRatio}(S, A) = \frac{\text{IG}(S, A)}{\text{SplitInfo}(S, A)}$$
where: $$\text{SplitInfo}(S, A) = -\sum_{v=1}^{V} \frac{|S_v|}{|S|} \log_2 \frac{|S_v|}{|S|}$$
SplitInfo is high for features with many values, penalizing the bias.
Continuous features: Binary splits on thresholds (like CART)
Missing values: Distribute fractionally to all branches
Pruning: Error-based post-pruning to prevent overfitting
C5.0 (commercial) further improved efficiency and added boosting.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130
import numpy as npfrom typing import List def entropy(counts: np.ndarray) -> float: """Compute Shannon entropy.""" n = counts.sum() if n == 0: return 0.0 p = counts / n p = p[p > 0] return -np.sum(p * np.log2(p)) def split_info(child_sizes: List[int], total: int) -> float: """ Compute split information (intrinsic value). SplitInfo(S, A) = -Σ (|S_v|/|S|) * log2(|S_v|/|S|) This is the entropy of the split itself (not the class distribution). High when feature has many values of similar frequency. Used to penalize features with many values in gain ratio. """ if total == 0: return 0.0 split_entropy = 0.0 for size in child_sizes: if size > 0: p = size / total split_entropy -= p * np.log2(p) return split_entropy def information_gain(parent_counts: np.ndarray, child_counts_list: List[np.ndarray]) -> float: """Compute information gain for a split.""" parent_H = entropy(parent_counts) n_total = parent_counts.sum() weighted_child_H = 0.0 for child_counts in child_counts_list: n_child = child_counts.sum() if n_child > 0: weighted_child_H += (n_child / n_total) * entropy(child_counts) return parent_H - weighted_child_H def gain_ratio(parent_counts: np.ndarray, child_counts_list: List[np.ndarray]) -> float: """ Compute gain ratio (C4.5 criterion). GainRatio(S, A) = IG(S, A) / SplitInfo(S, A) Normalizes information gain by the split's intrinsic value, reducing bias toward multi-valued features. Note: When SplitInfo is very small, gain ratio can be artificially high. C4.5 uses a heuristic: only consider features with IG above average. """ ig = information_gain(parent_counts, child_counts_list) n_total = parent_counts.sum() child_sizes = [c.sum() for c in child_counts_list] si = split_info(child_sizes, n_total) if si == 0: return 0.0 # Avoid division by zero return ig / si # Demonstration: Multi-value biasif __name__ == "__main__": print("Gain Ratio vs Information Gain") print("=" * 60) # Parent: 7 Yes, 7 No parent = np.array([7, 7]) print(f"Parent: {parent}") print(f"Parent entropy: {entropy(parent):.4f} bits\n") # Feature A: Binary split (good generalization) # Splits into (5Y, 2N) and (2Y, 5N) child_A = [np.array([5, 2]), np.array([2, 5])] ig_A = information_gain(parent, child_A) gr_A = gain_ratio(parent, child_A) si_A = split_info([7, 7], 14) print("Feature A (Binary: Low/High):") print(f" Children: {[c.tolist() for c in child_A]}") print(f" Information Gain: {ig_A:.4f}") print(f" Split Info: {si_A:.4f}") print(f" Gain Ratio: {gr_A:.4f}") # Feature B: 7-way split (like customer ID) # Each bucket: (1Y, 1N) - useless for generalization! child_B = [np.array([1, 1]) for _ in range(7)] ig_B = information_gain(parent, child_B) gr_B = gain_ratio(parent, child_B) si_B = split_info([2]*7, 14) print(f"\nFeature B (7-way: like ID feature):") print(f" Children: 7 × [1, 1]") print(f" Information Gain: {ig_B:.4f}") print(f" Split Info: {si_B:.4f}") print(f" Gain Ratio: {gr_B:.4f}") # Feature C: 14-way split (pure ID - worst case) # Each sample is its own bucket child_C_yes = [np.array([1, 0]) for _ in range(7)] child_C_no = [np.array([0, 1]) for _ in range(7)] child_C = child_C_yes + child_C_no ig_C = information_gain(parent, child_C) gr_C = gain_ratio(parent, child_C) si_C = split_info([1]*14, 14) print(f"\nFeature C (14-way: unique ID):") print(f" Children: 14 × [single sample]") print(f" Information Gain: {ig_C:.4f} (MAXIMUM - perfect separation!)") print(f" Split Info: {si_C:.4f}") print(f" Gain Ratio: {gr_C:.4f}") print(f"\n→ ID3 would choose Feature C (highest IG={ig_C:.4f})") print(f"→ C4.5 would choose Feature A (highest GR={gr_A:.4f})")Now that we understand both Gini impurity and entropy, let's compare them systematically. The comparison reveals both theoretical and practical differences.
| Aspect | Gini Impurity | Entropy |
|---|---|---|
| Formula (K classes) | $1 - \sum p_k^2$ | $-\sum p_k \log_2 p_k$ |
| Binary formula | $2p(1-p)$ | $-p \log p - (1-p) \log(1-p)$ |
| Maximum (binary) | 0.5 | 1.0 bit |
| Maximum (K classes) | $1 - 1/K$ | $\log_2 K$ bits |
| Computation | Fast (squares only) | Slower (logarithms) |
| Interpretation | Misclassification probability | Bits of information |
| Curvature at extremes | Moderate | Steeper near 0 and 1 |
| Sensitivity to small p | Less sensitive | More sensitive |
| Used in | CART, scikit-learn default | ID3, C4.5, XGBoost option |
Mathematical Relationship:
For binary classification, we can approximate entropy using Gini:
$$H(p) \approx \frac{G(p)}{\ln 2} \cdot \frac{2}{1}$$ when $p$ is not too extreme.
More precisely, the Taylor expansion around $p = 0.5$ shows:
$$H(p) = 1 - \frac{2}{\ln 2}(p - 0.5)^2 - \frac{4}{3 \ln 2}(p - 0.5)^4 - ...$$
$$G(p) = 0.5 - 2(p - 0.5)^2$$
Both are concave, symmetric about $p = 0.5$, and zero at $p = 0, 1$. The key difference is curvature:
This makes entropy more sensitive to class imbalance and small probability classes.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108
import numpy as npimport matplotlib.pyplot as plt def gini_binary(p): return 2 * p * (1 - p) def entropy_binary(p): """Binary entropy, handling edge cases.""" result = np.zeros_like(p) mask = (p > 0) & (p < 1) result[mask] = -p[mask] * np.log2(p[mask]) - (1-p[mask]) * np.log2(1-p[mask]) return result def misclass_binary(p): return np.minimum(p, 1 - p) def plot_comparison(): """ Compare Gini, Entropy, and Misclassification error for binary case. """ fig, axes = plt.subplots(1, 2, figsize=(14, 5)) p = np.linspace(0.001, 0.999, 1000) gini = gini_binary(p) entropy_vals = entropy_binary(p) misclass = misclass_binary(p) # Normalize for comparison (scale to max = 1) gini_norm = gini / 0.5 entropy_norm = entropy_vals / 1.0 misclass_norm = misclass / 0.5 # Left plot: Raw values ax1 = axes[0] ax1.plot(p, gini, 'b-', linewidth=2.5, label='Gini (max=0.5)') ax1.plot(p, entropy_vals, 'r-', linewidth=2.5, label='Entropy (max=1.0)') ax1.plot(p, misclass, 'g--', linewidth=2.5, label='Misclass (max=0.5)') ax1.set_xlabel('Proportion of Class 1 (p)', fontsize=12) ax1.set_ylabel('Impurity', fontsize=12) ax1.set_title('Impurity Measures: Raw Values', fontsize=14) ax1.legend(loc='upper right') ax1.grid(True, alpha=0.3) ax1.set_xlim(0, 1) ax1.set_ylim(0, 1.1) # Right plot: Normalized (scaled to max = 1) ax2 = axes[1] ax2.plot(p, gini_norm, 'b-', linewidth=2.5, label='Gini (normalized)') ax2.plot(p, entropy_norm, 'r-', linewidth=2.5, label='Entropy (normalized)') ax2.plot(p, misclass_norm, 'g--', linewidth=2.5, label='Misclass (normalized)') ax2.set_xlabel('Proportion of Class 1 (p)', fontsize=12) ax2.set_ylabel('Normalized Impurity', fontsize=12) ax2.set_title('Impurity Measures: Normalized to Max=1', fontsize=14) ax2.legend(loc='upper right') ax2.grid(True, alpha=0.3) ax2.set_xlim(0, 1) ax2.set_ylim(0, 1.1) plt.tight_layout() plt.savefig('impurity_comparison.png', dpi=150) plt.show() # Print numerical comparison at key points print("\nNumerical Comparison at Key Points:") print("=" * 60) print(f"{'p':>6} | {'Gini':>10} | {'Entropy':>10} | {'Misclass':>10}") print("-" * 60) for p_val in [0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5]: g = gini_binary(p_val) h = entropy_binary(np.array([p_val]))[0] m = misclass_binary(np.array([p_val]))[0] print(f"{p_val:>6.2f} | {g:>10.4f} | {h:>10.4f} | {m:>10.4f}") def sensitivity_analysis(): """ Show how entropy is more sensitive to small probabilities. """ print("\n\nSensitivity to Small Probabilities:") print("=" * 60) print("When p = 0.01 (rare class):") p = 0.01 g = gini_binary(p) h = entropy_binary(np.array([p]))[0] print(f" Gini = {g:.6f}") print(f" Entropy = {h:.6f}") print(f" Ratio (Entropy/Gini) = {h/g:.2f}") print("\nWhen p = 0.001 (very rare class):") p = 0.001 g = gini_binary(p) h = entropy_binary(np.array([p]))[0] print(f" Gini = {g:.6f}") print(f" Entropy = {h:.6f}") print(f" Ratio (Entropy/Gini) = {h/g:.2f}") print("\n→ Entropy gives more 'credit' for separating rare classes") if __name__ == "__main__": plot_comparison() sensitivity_analysis()Empirical studies show Gini and entropy produce identical tree structures in ~98% of cases. The remaining 2% occur when: (1) class imbalance is extreme, (2) there are many classes, or (3) splits have very similar quality. In practice, computational efficiency often favors Gini.
Entropy in decision trees connects deeply to cross-entropy loss, the standard loss function for classification in neural networks and other probabilistic models.
Cross-Entropy Reviewed:
For a true distribution $P$ and predicted distribution $Q$:
$$H(P, Q) = -\sum_k P(k) \log Q(k)$$
In classification with one-hot labels $y$ and predicted probabilities $\hat{p}$:
$$\mathcal{L}{CE} = -\sum{k=1}^K y_k \log \hat{p}k = -\log \hat{p}{\text{true class}}$$
The Connection:
When a decision tree leaf predicts class probabilities equal to the empirical distribution in the leaf:
This means tree-building with entropy is implicitly performing maximum likelihood estimation for a piecewise-constant probability model!
Modern gradient boosting libraries (XGBoost, LightGBM) can use cross-entropy/log-loss directly rather than entropy. This unifies tree-based methods with neural network training under the same loss function, enabling gradient-based optimization of tree splits.
Mathematical Derivation:
Let a leaf contain $n$ samples with class distribution $(n_1, ..., n_K)$. The leaf predicts $\hat{p}_k = n_k/n$.
The total cross-entropy loss in this leaf:
$$\sum_{i=1}^{n} -\log \hat{p}{y_i} = -\sum{k=1}^{K} n_k \log \frac{n_k}{n} = n \cdot \left( -\sum_{k=1}^{K} \frac{n_k}{n} \log \frac{n_k}{n} \right) = n \cdot H$$
Total loss across tree = sum over all leaves of $n_{\text{leaf}} \cdot H_{\text{leaf}}$.
Minimizing this is equivalent to minimizing weighted average entropy—exactly what information gain does!
This proves that greedy information gain maximization is greedy cross-entropy minimization, providing a strong theoretical justification for entropy-based splitting.
When should you use entropy over Gini? Here's practical guidance based on theoretical properties and empirical experience:
Start with Gini (the default in most libraries). Switch to entropy only if you have a specific reason: extreme class imbalance, many classes, or need to match cross-entropy loss elsewhere in your pipeline. Always validate with cross-validation—the empirical performance difference is usually insignificant.
We've explored entropy from its information-theoretic roots to its practical application in decision trees. Let's consolidate the key insights:
What's Next:
We've now covered the two most important splitting criteria in depth. The next page explores Misclassification Error—a simpler criterion that, despite its intuitive appeal, has subtle problems that make it less suitable for tree construction.
You now understand entropy and information gain from both theoretical and practical perspectives. The information-theoretic foundation provides deep insights into why trees work and connects to the broader machine learning landscape through cross-entropy loss.