Loading learning content...
In 1986, Ross Quinlan published a paper that would fundamentally shape how machines learn to make decisions. The ID3 algorithm (Iterative Dichotomiser 3) introduced a revolutionary idea: using information theory—the mathematical framework developed by Claude Shannon for communication systems—to build decision trees.\n\nBefore ID3, decision tree construction was largely heuristic or domain-specific. ID3 provided the first principled, general-purpose algorithm for learning trees from data, grounded in the elegant mathematics of entropy and information gain. This algorithm became the foundation for virtually all modern tree-based methods, from C4.5 and CART to random forests and gradient boosting.\n\nUnderstanding ID3 deeply is essential because its core concepts—entropy-based splitting, greedy recursive partitioning, and the information gain criterion—pervade all subsequent tree algorithms. Master ID3, and you'll have the conceptual foundation for the entire family of tree-based methods.
By the end of this page, you will understand: (1) The complete ID3 algorithm with rigorous mathematical foundations; (2) How entropy and information gain drive splitting decisions; (3) The recursive tree-building process with detailed pseudocode; (4) ID3's assumptions, limitations, and historical significance; (5) Complete implementation with worked examples.
The Problem ID3 Solved:\n\nBefore ID3, machine learning researchers faced a fundamental challenge: given a dataset of examples with categorical attributes and class labels, how do you systematically construct a decision tree that accurately classifies new instances?\n\nEarly approaches were ad-hoc:\n- Expert systems required human experts to manually specify decision rules\n- Enumeration methods were computationally intractable for large attribute spaces\n- Statistical methods like discriminant analysis didn't produce interpretable rules\n\nQuinlan's insight was to frame tree construction as an information optimization problem. At each node, choose the attribute that provides the most information about the class label. This elegant formulation transformed tree learning from art to science.
ID3 stands for 'Iterative Dichotomiser 3'—the third iteration of Quinlan's dichotomising (splitting) algorithms. ID1 and ID2 were earlier, less refined versions. The 'iterative' refers to the recursive nature of the algorithm, which iteratively partitions the data space.
Shannon's Information Theory Connection:\n\nClaude Shannon's 1948 paper 'A Mathematical Theory of Communication' introduced entropy as a measure of uncertainty in a random variable. Shannon showed that entropy quantifies the average amount of information (in bits) needed to describe the outcome of a random process.\n\nQuinlan recognized that classification is fundamentally about reducing uncertainty. Before seeing any attributes, we're uncertain about an instance's class. Each attribute we examine should reduce this uncertainty as much as possible. Entropy provides the perfect mathematical framework for measuring this uncertainty reduction.\n\nKey Insight: The best attribute to split on is the one that reduces class uncertainty the most—the one with the highest information gain.
| Era | Approach | Limitation |
|---|---|---|
| 1960s | Manual expert rules | Doesn't scale; requires domain experts |
| 1970s | AQ/Covering algorithms | Sensitive to rule ordering |
| 1979 | ID3 precursors (CLS) | Ad-hoc splitting criteria |
| 1986 | ID3 (Quinlan) | Categorical attributes only; no pruning |
| 1993 | C4.5 (Quinlan) | Handles continuous; includes pruning |
| 1984 | CART (Breiman et al.) | Binary splits; regression support |
Definition of Entropy:\n\nFor a discrete random variable $Y$ with $K$ possible values (classes) and probability distribution $P(Y = k) = p_k$, the entropy is defined as:\n\n$$H(Y) = -\sum_{k=1}^{K} p_k \log_2(p_k)$$\n\nwhere we adopt the convention that $0 \log_2(0) = 0$ (justified by the limit $\lim_{p \to 0^+} p \log p = 0$).\n\nIntuition:\n- Entropy measures the average surprise or uncertainty in the distribution\n- High entropy = high uncertainty (outcomes are unpredictable)\n- Low entropy = low uncertainty (outcomes are predictable)\n- Entropy is maximized when outcomes are equally likely\n- Entropy is minimized (zero) when one outcome is certain
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
import numpy as npfrom typing import Union, List def entropy(probs: Union[np.ndarray, List[float]]) -> float: """ Compute entropy of a probability distribution. Parameters: probs: Array of probabilities (must sum to 1) Returns: Entropy in bits (log base 2) """ probs = np.asarray(probs) # Filter out zero probabilities to avoid log(0) probs = probs[probs > 0] return -np.sum(probs * np.log2(probs)) def entropy_from_counts(counts: Union[np.ndarray, List[int]]) -> float: """ Compute entropy from class counts. Parameters: counts: Array of counts for each class Returns: Entropy in bits """ counts = np.asarray(counts) total = np.sum(counts) if total == 0: return 0.0 probs = counts / total return entropy(probs) # Demonstrate entropy behaviorprint("Entropy Examples (Binary Classification)")print("=" * 50)print(f"{'Distribution':<25} | {'Entropy (bits)':<15}")print("-" * 50) distributions = [ ([1.0, 0.0], "Certain (1.0, 0.0)"), ([0.9, 0.1], "Skewed (0.9, 0.1)"), ([0.7, 0.3], "Moderate (0.7, 0.3)"), ([0.5, 0.5], "Maximum (0.5, 0.5)"),] for probs, name in distributions: h = entropy(probs) print(f"{name:<25} | {h:<15.4f}") print("\nKey insight: Maximum entropy occurs at uniform distribution")Properties of Entropy:\n\n1. Non-negativity: $H(Y) \geq 0$ always. Entropy cannot be negative.\n\n2. Maximum value: For $K$ classes, $H(Y) \leq \log_2(K)$, with equality when $p_k = 1/K$ for all $k$.\n\n3. Minimum value: $H(Y) = 0$ if and only if one class has probability 1 (complete certainty).\n\n4. Concavity: Entropy is a concave function, meaning mixtures of distributions have higher entropy than the weighted average of individual entropies.\n\n5. Symmetry: Entropy depends only on the probability values, not their ordering.\n\nFor Binary Classification ($K=2$):\n\nWith $p$ = probability of positive class and $(1-p)$ = probability of negative class:\n$$H(Y) = -p \log_2(p) - (1-p) \log_2(1-p)$$\n\nThis is the binary entropy function, which reaches maximum of 1 bit at $p = 0.5$.
Entropy in bits has a concrete interpretation: it's the minimum average number of yes/no questions needed to identify the outcome. For a fair coin (H=1 bit), you need exactly one question. For a fair 8-sided die (H=3 bits), you need three questions on average (binary search: is it in upper half? etc.).
Conditional Entropy:\n\nBefore defining information gain, we need conditional entropy—the expected entropy of $Y$ given that we know the value of attribute $A$.\n\nIf attribute $A$ has values $\{a_1, a_2, \ldots, a_v\}$, the conditional entropy is:\n\n$$H(Y|A) = \sum_{j=1}^{v} P(A = a_j) \cdot H(Y|A = a_j)$$\n\nwhere $H(Y|A = a_j)$ is the entropy of $Y$ among instances where $A = a_j$.\n\nInformation Gain:\n\nThe information gain of attribute $A$ is the reduction in entropy achieved by knowing $A$:\n\n$$IG(Y, A) = H(Y) - H(Y|A)$$\n\nInterpretation:\n- $H(Y)$: Initial uncertainty about class before seeing any attribute\n- $H(Y|A)$: Remaining uncertainty after knowing attribute $A$\n- $IG(Y, A)$: How much uncertainty $A$ eliminates = how much we learn from $A$
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293
import numpy as npfrom collections import Counterfrom typing import List, Tuple def information_gain(y: np.ndarray, x: np.ndarray) -> float: """ Compute information gain of splitting y by attribute x. Parameters: y: Class labels (n_samples,) x: Attribute values (n_samples,) Returns: Information gain in bits """ n = len(y) # Compute parent entropy H(Y) parent_counts = np.array(list(Counter(y).values())) parent_entropy = entropy_from_counts(parent_counts) # Compute conditional entropy H(Y|X) conditional_entropy = 0.0 unique_values = np.unique(x) for value in unique_values: # Subset where X = value mask = (x == value) subset_y = y[mask] # Weight by proportion of instances weight = np.sum(mask) / n # Entropy of this subset subset_counts = np.array(list(Counter(subset_y).values())) subset_entropy = entropy_from_counts(subset_counts) conditional_entropy += weight * subset_entropy # Information gain = parent entropy - conditional entropy return parent_entropy - conditional_entropy # Classic "Play Tennis" example from Quinlan's papers# Attributes: Outlook, Temperature, Humidity, Wind# Class: PlayTennis (Yes/No) outlook = np.array(['Sunny', 'Sunny', 'Overcast', 'Rain', 'Rain', 'Rain', 'Overcast', 'Sunny', 'Sunny', 'Rain', 'Sunny', 'Overcast', 'Overcast', 'Rain'])temperature = np.array(['Hot', 'Hot', 'Hot', 'Mild', 'Cool', 'Cool', 'Cool', 'Mild', 'Cool', 'Mild', 'Mild', 'Mild', 'Hot', 'Mild'])humidity = np.array(['High', 'High', 'High', 'High', 'Normal', 'Normal', 'Normal', 'High', 'Normal', 'Normal', 'Normal', 'High', 'Normal', 'High'])wind = np.array(['Weak', 'Strong', 'Weak', 'Weak', 'Weak', 'Strong', 'Strong', 'Weak', 'Weak', 'Weak', 'Strong', 'Strong', 'Weak', 'Strong'])play_tennis = np.array(['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No']) print("Information Gain Analysis: Play Tennis Dataset")print("=" * 55) # Parent entropyparent_counts = np.array([9, 5]) # 9 Yes, 5 Noparent_h = entropy_from_counts(parent_counts)print(f"Parent entropy H(PlayTennis): {parent_h:.4f} bits")print() # Information gain for each attributeattributes = { 'Outlook': outlook, 'Temperature': temperature, 'Humidity': humidity, 'Wind': wind} print(f"{'Attribute':<15} | {'Information Gain':<18} | {'Rank'}")print("-" * 55) gains = {}for name, values in attributes.items(): ig = information_gain(play_tennis, values) gains[name] = ig # Sort by gainsorted_attrs = sorted(gains.items(), key=lambda x: x[1], reverse=True)for rank, (name, ig) in enumerate(sorted_attrs, 1): print(f"{name:<15} | {ig:<18.4f} | {rank}") print()print(f"Best split attribute: {sorted_attrs[0][0]}")Information gain measures how much 'purer' the child nodes become after splitting. A high-gain attribute creates subsets that are more homogeneous (lower entropy) than the parent. ID3 greedily maximizes this purity improvement at each step, building trees that efficiently separate classes.
Worked Example: Outlook Attribute\n\nLet's trace through the information gain calculation for the Outlook attribute:\n\nParent node: 9 Yes, 5 No out of 14 total\n$$H(PlayTennis) = -\frac{9}{14}\log_2\frac{9}{14} - \frac{5}{14}\log_2\frac{5}{14} = 0.940 \text{ bits}$$\n\nAfter splitting on Outlook:\n\n| Outlook | Count | Yes | No | $H(Y|Outlook=v)$ |\n|---------|-------|-----|-----|------------------|\n| Sunny | 5 | 2 | 3 | 0.971 bits |\n| Overcast | 4 | 4 | 0 | 0.000 bits |\n| Rain | 5 | 3 | 2 | 0.971 bits |\n\nConditional entropy:\n$$H(Y|Outlook) = \frac{5}{14}(0.971) + \frac{4}{14}(0.000) + \frac{5}{14}(0.971) = 0.694 \text{ bits}$$\n\nInformation gain:\n$$IG(PlayTennis, Outlook) = 0.940 - 0.694 = 0.246 \text{ bits}$$\n\nThis is the highest among all attributes, so ID3 chooses Outlook as the root split.
Algorithm Overview:\n\nID3 constructs decision trees using a greedy, top-down, recursive partitioning strategy. Starting from the root with all training instances, it:\n\n1. If all instances have the same class → create a leaf with that class\n2. If no attributes remain → create a leaf with the majority class\n3. Otherwise → select the attribute with highest information gain, create a node, and recursively build subtrees for each attribute value\n\nFormal Algorithm:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209
"""ID3 Algorithm: Complete Implementation This implementation follows Quinlan's original specification withdetailed comments explaining each decision point.""" import numpy as npfrom collections import Counterfrom typing import Dict, List, Any, Optional, Unionfrom dataclasses import dataclass @dataclassclass TreeNode: """Represents a node in the decision tree.""" is_leaf: bool = False prediction: Optional[str] = None # For leaf nodes attribute: Optional[str] = None # For internal nodes children: Optional[Dict[str, 'TreeNode']] = None # Branches def __repr__(self): if self.is_leaf: return f"Leaf({self.prediction})" return f"Node({self.attribute})" class ID3Classifier: """ ID3 Decision Tree Classifier. Implements the classic ID3 algorithm with information gain (entropy reduction) as the splitting criterion. Limitations (by design, matching original ID3): - Categorical attributes only - No pruning - No handling of missing values - No continuous attributes """ def __init__(self): self.root: Optional[TreeNode] = None self.attributes: List[str] = [] def _entropy(self, y: np.ndarray) -> float: """Compute entropy of label distribution.""" if len(y) == 0: return 0.0 counts = np.array(list(Counter(y).values())) probs = counts / len(y) probs = probs[probs > 0] return -np.sum(probs * np.log2(probs)) def _information_gain(self, X: np.ndarray, y: np.ndarray, attribute_idx: int) -> float: """Compute information gain for splitting on attribute.""" parent_entropy = self._entropy(y) # Get unique values for this attribute values = np.unique(X[:, attribute_idx]) # Compute weighted child entropy child_entropy = 0.0 for value in values: mask = X[:, attribute_idx] == value weight = np.sum(mask) / len(y) child_entropy += weight * self._entropy(y[mask]) return parent_entropy - child_entropy def _majority_class(self, y: np.ndarray) -> str: """Return the most common class label.""" counter = Counter(y) return counter.most_common(1)[0][0] def _build_tree(self, X: np.ndarray, y: np.ndarray, available_attrs: List[int], attribute_names: List[str]) -> TreeNode: """ Recursively build the decision tree. Base cases: 1. All examples have same class -> leaf node 2. No attributes left -> leaf with majority class 3. No examples -> leaf with parent's majority (handled in parent) Recursive case: Select best attribute, create branches for each value, recurse on subsets. """ # Base case 1: All examples have same class unique_classes = np.unique(y) if len(unique_classes) == 1: return TreeNode(is_leaf=True, prediction=unique_classes[0]) # Base case 2: No attributes remaining if len(available_attrs) == 0: return TreeNode(is_leaf=True, prediction=self._majority_class(y)) # Find attribute with maximum information gain best_gain = -1 best_attr_idx = None for attr_idx in available_attrs: gain = self._information_gain(X, y, attr_idx) if gain > best_gain: best_gain = gain best_attr_idx = attr_idx # Create internal node best_attr_name = attribute_names[best_attr_idx] node = TreeNode( is_leaf=False, attribute=best_attr_name, children={} ) # Remove this attribute from available set remaining_attrs = [i for i in available_attrs if i != best_attr_idx] # Create branches for each attribute value unique_values = np.unique(X[:, best_attr_idx]) majority_class = self._majority_class(y) # Default for empty branches for value in unique_values: mask = X[:, best_attr_idx] == value subset_X = X[mask] subset_y = y[mask] if len(subset_y) == 0: # Empty subset: use parent's majority class node.children[value] = TreeNode( is_leaf=True, prediction=majority_class ) else: # Recurse on subset node.children[value] = self._build_tree( subset_X, subset_y, remaining_attrs, attribute_names ) return node def fit(self, X: np.ndarray, y: np.ndarray, attribute_names: Optional[List[str]] = None): """ Build the decision tree from training data. Parameters: X: Feature matrix (n_samples, n_attributes), categorical values y: Class labels (n_samples,) attribute_names: Optional names for attributes """ n_samples, n_attrs = X.shape if attribute_names is None: attribute_names = [f"attr_{i}" for i in range(n_attrs)] self.attributes = attribute_names available_attrs = list(range(n_attrs)) self.root = self._build_tree(X, y, available_attrs, attribute_names) return self def predict_one(self, x: np.ndarray, attribute_name_to_idx: Dict[str, int]) -> str: """Predict class for a single instance.""" node = self.root while not node.is_leaf: attr_name = node.attribute attr_idx = attribute_name_to_idx[attr_name] value = x[attr_idx] if value in node.children: node = node.children[value] else: # Unseen value: return most common child prediction # (Simple handling; real systems do better) child_preds = [c.prediction for c in node.children.values() if c.is_leaf] if child_preds: return Counter(child_preds).most_common(1)[0][0] else: node = list(node.children.values())[0] return node.prediction def predict(self, X: np.ndarray) -> np.ndarray: """Predict classes for multiple instances.""" attr_to_idx = {name: i for i, name in enumerate(self.attributes)} predictions = [self.predict_one(x, attr_to_idx) for x in X] return np.array(predictions) def print_tree(self, node: Optional[TreeNode] = None, indent: str = "") -> None: """Pretty-print the decision tree.""" if node is None: node = self.root if node.is_leaf: print(f"{indent}→ {node.prediction}") else: print(f"{indent}[{node.attribute}]") for value, child in node.children.items(): print(f"{indent} = {value}:") self.print_tree(child, indent + " ")ID3 makes locally optimal choices at each node without backtracking. This greedy approach doesn't guarantee globally optimal trees, but it's computationally efficient (O(n × m × d) for n samples, m attributes, depth d) and works remarkably well in practice.
Let's trace through ID3 on the classic 'Play Tennis' dataset, showing every calculation and decision.
12345678910111213141516171819202122232425262728293031323334353637
import numpy as np # Play Tennis Datasetdata = np.array([ ['Sunny', 'Hot', 'High', 'Weak', 'No'], ['Sunny', 'Hot', 'High', 'Strong', 'No'], ['Overcast', 'Hot', 'High', 'Weak', 'Yes'], ['Rain', 'Mild', 'High', 'Weak', 'Yes'], ['Rain', 'Cool', 'Normal', 'Weak', 'Yes'], ['Rain', 'Cool', 'Normal', 'Strong', 'No'], ['Overcast', 'Cool', 'Normal', 'Strong', 'Yes'], ['Sunny', 'Mild', 'High', 'Weak', 'No'], ['Sunny', 'Cool', 'Normal', 'Weak', 'Yes'], ['Rain', 'Mild', 'Normal', 'Weak', 'Yes'], ['Sunny', 'Mild', 'Normal', 'Strong', 'Yes'], ['Overcast', 'Mild', 'High', 'Strong', 'Yes'], ['Overcast', 'Hot', 'Normal', 'Weak', 'Yes'], ['Rain', 'Mild', 'High', 'Strong', 'No'],]) X = data[:, :-1]y = data[:, -1]attribute_names = ['Outlook', 'Temperature', 'Humidity', 'Wind'] # Build and display the treeclf = ID3Classifier()clf.fit(X, y, attribute_names) print("ID3 Decision Tree for Play Tennis:")print("=" * 40)clf.print_tree() print("\n" + "=" * 40)print("Predictions on training data:")predictions = clf.predict(X)accuracy = np.mean(predictions == y)print(f"Training accuracy: {accuracy:.2%}")Step-by-Step Tree Construction:\n\nStep 1: Root Node\n- Compute information gain for all attributes\n- Outlook has highest IG (0.246) → select as root\n- Create three branches: Sunny, Overcast, Rain\n\nStep 2: Overcast Branch\n- All 4 instances are "Yes" → create leaf node with "Yes"\n- Entropy = 0, no further splitting needed\n\nStep 3: Sunny Branch\n- 5 instances: 2 Yes, 3 No\n- Compute IG for remaining attributes (Temperature, Humidity, Wind)\n- Humidity has highest IG → split on Humidity\n- High → No (pure), Normal → Yes (pure)\n\nStep 4: Rain Branch\n- 5 instances: 3 Yes, 2 No\n- Compute IG for remaining attributes\n- Wind has highest IG → split on Wind\n- Weak → Yes (pure), Strong → No (pure)\n\nFinal Tree:\n\nOutlook\n├── Sunny → Humidity\n│ ├── High → No\n│ └── Normal → Yes\n├── Overcast → Yes\n└── Rain → Wind\n ├── Weak → Yes\n └── Strong → No\n
ID3 was groundbreaking but has significant limitations that motivated later algorithms like C4.5 and CART:
Consider an 'ID' attribute with unique value for each instance. Information gain would be maximum (splits into pure singleton nodes), but this attribute has zero predictive value for new instances! This flaw motivated the gain ratio in C4.5.
We've thoroughly examined the ID3 algorithm—the foundational tree-building method that introduced information-theoretic principles to decision tree learning.
What's Next:\n\nID3's limitations motivated Quinlan to develop C4.5, which addresses many shortcomings. The next page explores C4.5's innovations: the gain ratio to handle many-valued attributes, continuous attribute support through threshold selection, and pruning strategies to prevent overfitting.
You now understand the ID3 algorithm in depth: its information-theoretic foundations, the complete algorithm with implementation, and its historical significance as the ancestor of modern tree methods. Next, we'll explore how C4.5 overcame ID3's limitations.