Machine LearningDecision Trees

Tree Growing Algorithms

LevelIntermediate

Duration120 mins

TopicDecision Trees

1 / 5

ID3 Algorithm

The Birth of Information-Theoretic Tree Learning

In 1986, Ross Quinlan published a paper that would fundamentally shape how machines learn to make decisions. The ID3 algorithm (Iterative Dichotomiser 3) introduced a revolutionary idea: using information theory—the mathematical framework developed by Claude Shannon for communication systems—to build decision trees.\n\nBefore ID3, decision tree construction was largely heuristic or domain-specific. ID3 provided the first principled, general-purpose algorithm for learning trees from data, grounded in the elegant mathematics of entropy and information gain. This algorithm became the foundation for virtually all modern tree-based methods, from C4.5 and CART to random forests and gradient boosting.\n\nUnderstanding ID3 deeply is essential because its core concepts—entropy-based splitting, greedy recursive partitioning, and the information gain criterion—pervade all subsequent tree algorithms. Master ID3, and you'll have the conceptual foundation for the entire family of tree-based methods.

What You Will Learn

By the end of this page, you will understand: (1) The complete ID3 algorithm with rigorous mathematical foundations; (2) How entropy and information gain drive splitting decisions; (3) The recursive tree-building process with detailed pseudocode; (4) ID3's assumptions, limitations, and historical significance; (5) Complete implementation with worked examples.

Historical Context and Motivation

The Problem ID3 Solved:\n\nBefore ID3, machine learning researchers faced a fundamental challenge: given a dataset of examples with categorical attributes and class labels, how do you systematically construct a decision tree that accurately classifies new instances?\n\nEarly approaches were ad-hoc:\n- Expert systems required human experts to manually specify decision rules\n- Enumeration methods were computationally intractable for large attribute spaces\n- Statistical methods like discriminant analysis didn't produce interpretable rules\n\nQuinlan's insight was to frame tree construction as an information optimization problem. At each node, choose the attribute that provides the most information about the class label. This elegant formulation transformed tree learning from art to science.

The Name 'ID3'

ID3 stands for 'Iterative Dichotomiser 3'—the third iteration of Quinlan's dichotomising (splitting) algorithms. ID1 and ID2 were earlier, less refined versions. The 'iterative' refers to the recursive nature of the algorithm, which iteratively partitions the data space.

Shannon's Information Theory Connection:\n\nClaude Shannon's 1948 paper 'A Mathematical Theory of Communication' introduced entropy as a measure of uncertainty in a random variable. Shannon showed that entropy quantifies the average amount of information (in bits) needed to describe the outcome of a random process.\n\nQuinlan recognized that classification is fundamentally about reducing uncertainty. Before seeing any attributes, we're uncertain about an instance's class. Each attribute we examine should reduce this uncertainty as much as possible. Entropy provides the perfect mathematical framework for measuring this uncertainty reduction.\n\nKey Insight: The best attribute to split on is the one that reduces class uncertainty the most—the one with the highest information gain.

Evolution of Tree-Building Approaches
Era	Approach	Limitation
1960s	Manual expert rules	Doesn't scale; requires domain experts
1970s	AQ/Covering algorithms	Sensitive to rule ordering
1979	ID3 precursors (CLS)	Ad-hoc splitting criteria
1986	ID3 (Quinlan)	Categorical attributes only; no pruning
1993	C4.5 (Quinlan)	Handles continuous; includes pruning
1984	CART (Breiman et al.)	Binary splits; regression support

Entropy: Measuring Uncertainty

Definition of Entropy:\n\nFor a discrete random variable $Y$ with $K$ possible values (classes) and probability distribution $P(Y = k) = p_k$, the entropy is defined as:\n\n$$H(Y) = -\sum_{k=1}^{K} p_k \log_2(p_k)$$\n\nwhere we adopt the convention that $0 \log_2(0) = 0$ (justified by the limit $\lim_{p \to 0^+} p \log p = 0$).\n\nIntuition:\n- Entropy measures the average surprise or uncertainty in the distribution\n- High entropy = high uncertainty (outcomes are unpredictable)\n- Low entropy = low uncertainty (outcomes are predictable)\n- Entropy is maximized when outcomes are equally likely\n- Entropy is minimized (zero) when one outcome is certain

entropy_fundamentals.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import numpy as np
from typing import Union, List
 
def entropy(probs: Union[np.ndarray, List[float]]) -> float:
    """
    Compute entropy of a probability distribution.
    
    Parameters:
        probs: Array of probabilities (must sum to 1)
        
    Returns:
        Entropy in bits (log base 2)
    """
    probs = np.asarray(probs)
    
    # Filter out zero probabilities to avoid log(0)
    probs = probs[probs > 0]
    
    return -np.sum(probs * np.log2(probs))
 
def entropy_from_counts(counts: Union[np.ndarray, List[int]]) -> float:
    """
    Compute entropy from class counts.
    
    Parameters:
        counts: Array of counts for each class
        
    Returns:
        Entropy in bits
    """
    counts = np.asarray(counts)
    total = np.sum(counts)
    
    if total == 0:
        return 0.0
    
    probs = counts / total
    return entropy(probs)
 
# Demonstrate entropy behavior
print("Entropy Examples (Binary Classification)")
print("=" * 50)
print(f"{'Distribution':<25} | {'Entropy (bits)':<15}")
print("-" * 50)
 
distributions = [
    ([1.0, 0.0], "Certain (1.0, 0.0)"),
    ([0.9, 0.1], "Skewed (0.9, 0.1)"),
    ([0.7, 0.3], "Moderate (0.7, 0.3)"),
    ([0.5, 0.5], "Maximum (0.5, 0.5)"),
]
 
for probs, name in distributions:
    h = entropy(probs)
    print(f"{name:<25} | {h:<15.4f}")
 
print("\nKey insight: Maximum entropy occurs at uniform distribution")

Properties of Entropy:\n\n1. Non-negativity: $H(Y) \geq 0$ always. Entropy cannot be negative.\n\n2. Maximum value: For $K$ classes, $H(Y) \leq \log_2(K)$, with equality when $p_k = 1/K$ for all $k$.\n\n3. Minimum value: $H(Y) = 0$ if and only if one class has probability 1 (complete certainty).\n\n4. Concavity: Entropy is a concave function, meaning mixtures of distributions have higher entropy than the weighted average of individual entropies.\n\n5. Symmetry: Entropy depends only on the probability values, not their ordering.\n\nFor Binary Classification ($K=2$):\n\nWith $p$ = probability of positive class and $(1-p)$ = probability of negative class:\n$$H(Y) = -p \log_2(p) - (1-p) \log_2(1-p)$$\n\nThis is the binary entropy function, which reaches maximum of 1 bit at $p = 0.5$.

The 'Bits' Interpretation

Entropy in bits has a concrete interpretation: it's the minimum average number of yes/no questions needed to identify the outcome. For a fair coin (H=1 bit), you need exactly one question. For a fair 8-sided die (H=3 bits), you need three questions on average (binary search: is it in upper half? etc.).

Information Gain: The Splitting Criterion

Conditional Entropy:\n\nBefore defining information gain, we need conditional entropy—the expected entropy of $Y$ given that we know the value of attribute $A$.\n\nIf attribute $A$ has values $\{a_1, a_2, \ldots, a_v\}$, the conditional entropy is:\n\n$$H(Y|A) = \sum_{j=1}^{v} P(A = a_j) \cdot H(Y|A = a_j)$$\n\nwhere $H(Y|A = a_j)$ is the entropy of $Y$ among instances where $A = a_j$.\n\nInformation Gain:\n\nThe information gain of attribute $A$ is the reduction in entropy achieved by knowing $A$:\n\n$$IG(Y, A) = H(Y) - H(Y|A)$$\n\nInterpretation:\n- $H(Y)$: Initial uncertainty about class before seeing any attribute\n- $H(Y|A)$: Remaining uncertainty after knowing attribute $A$\n- $IG(Y, A)$: How much uncertainty $A$ eliminates = how much we learn from $A$

information_gain.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
import numpy as np
from collections import Counter
from typing import List, Tuple
 
def information_gain(y: np.ndarray, x: np.ndarray) -> float:
    """
    Compute information gain of splitting y by attribute x.
    
    Parameters:
        y: Class labels (n_samples,)
        x: Attribute values (n_samples,)
        
    Returns:
        Information gain in bits
    """
    n = len(y)
    
    # Compute parent entropy H(Y)
    parent_counts = np.array(list(Counter(y).values()))
    parent_entropy = entropy_from_counts(parent_counts)
    
    # Compute conditional entropy H(Y|X)
    conditional_entropy = 0.0
    unique_values = np.unique(x)
    
    for value in unique_values:
        # Subset where X = value
        mask = (x == value)
        subset_y = y[mask]
        
        # Weight by proportion of instances
        weight = np.sum(mask) / n
        
        # Entropy of this subset
        subset_counts = np.array(list(Counter(subset_y).values()))
        subset_entropy = entropy_from_counts(subset_counts)
        
        conditional_entropy += weight * subset_entropy
    
    # Information gain = parent entropy - conditional entropy
    return parent_entropy - conditional_entropy
 
# Classic "Play Tennis" example from Quinlan's papers
# Attributes: Outlook, Temperature, Humidity, Wind
# Class: PlayTennis (Yes/No)
 
outlook = np.array(['Sunny', 'Sunny', 'Overcast', 'Rain', 'Rain', 'Rain', 
                    'Overcast', 'Sunny', 'Sunny', 'Rain', 'Sunny', 
                    'Overcast', 'Overcast', 'Rain'])
temperature = np.array(['Hot', 'Hot', 'Hot', 'Mild', 'Cool', 'Cool', 
                        'Cool', 'Mild', 'Cool', 'Mild', 'Mild', 
                        'Mild', 'Hot', 'Mild'])
humidity = np.array(['High', 'High', 'High', 'High', 'Normal', 'Normal',
                     'Normal', 'High', 'Normal', 'Normal', 'Normal',
                     'High', 'Normal', 'High'])
wind = np.array(['Weak', 'Strong', 'Weak', 'Weak', 'Weak', 'Strong',
                 'Strong', 'Weak', 'Weak', 'Weak', 'Strong', 'Strong',
                 'Weak', 'Strong'])
play_tennis = np.array(['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes',
                        'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No'])
 
print("Information Gain Analysis: Play Tennis Dataset")
print("=" * 55)
 
# Parent entropy
parent_counts = np.array([9, 5])  # 9 Yes, 5 No
parent_h = entropy_from_counts(parent_counts)
print(f"Parent entropy H(PlayTennis): {parent_h:.4f} bits")
print()
 
# Information gain for each attribute
attributes = {
    'Outlook': outlook,
    'Temperature': temperature,
    'Humidity': humidity,
    'Wind': wind
}
 
print(f"{'Attribute':<15} | {'Information Gain':<18} | {'Rank'}")
print("-" * 55)
 
gains = {}
for name, values in attributes.items():
    ig = information_gain(play_tennis, values)
    gains[name] = ig
 
# Sort by gain
sorted_attrs = sorted(gains.items(), key=lambda x: x[1], reverse=True)
for rank, (name, ig) in enumerate(sorted_attrs, 1):
    print(f"{name:<15} | {ig:<18.4f} | {rank}")
 
print()
print(f"Best split attribute: {sorted_attrs[0][0]}")

Why Information Gain Works

Information gain measures how much 'purer' the child nodes become after splitting. A high-gain attribute creates subsets that are more homogeneous (lower entropy) than the parent. ID3 greedily maximizes this purity improvement at each step, building trees that efficiently separate classes.

Worked Example: Outlook Attribute\n\nLet's trace through the information gain calculation for the Outlook attribute:\n\nParent node: 9 Yes, 5 No out of 14 total\n$$H(PlayTennis) = -\frac{9}{14}\log_2\frac{9}{14} - \frac{5}{14}\log_2\frac{5}{14} = 0.940 \text{ bits}$$\n\nAfter splitting on Outlook:\n\n| Outlook | Count | Yes | No | $H(Y|Outlook=v)$ |\n|---------|-------|-----|-----|------------------|\n| Sunny | 5 | 2 | 3 | 0.971 bits |\n| Overcast | 4 | 4 | 0 | 0.000 bits |\n| Rain | 5 | 3 | 2 | 0.971 bits |\n\nConditional entropy:\n$$H(Y|Outlook) = \frac{5}{14}(0.971) + \frac{4}{14}(0.000) + \frac{5}{14}(0.971) = 0.694 \text{ bits}$$\n\nInformation gain:\n$$IG(PlayTennis, Outlook) = 0.940 - 0.694 = 0.246 \text{ bits}$$\n\nThis is the highest among all attributes, so ID3 chooses Outlook as the root split.

The Complete ID3 Algorithm

Algorithm Overview:\n\nID3 constructs decision trees using a greedy, top-down, recursive partitioning strategy. Starting from the root with all training instances, it:\n\n1. If all instances have the same class → create a leaf with that class\n2. If no attributes remain → create a leaf with the majority class\n3. Otherwise → select the attribute with highest information gain, create a node, and recursively build subtrees for each attribute value\n\nFormal Algorithm:

id3_algorithm.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
"""
ID3 Algorithm: Complete Implementation
 
This implementation follows Quinlan's original specification with
detailed comments explaining each decision point.
"""
 
import numpy as np
from collections import Counter
from typing import Dict, List, Any, Optional, Union
from dataclasses import dataclass
 
@dataclass
class TreeNode:
    """Represents a node in the decision tree."""
    is_leaf: bool = False
    prediction: Optional[str] = None  # For leaf nodes
    attribute: Optional[str] = None   # For internal nodes
    children: Optional[Dict[str, 'TreeNode']] = None  # Branches
    
    def __repr__(self):
        if self.is_leaf:
            return f"Leaf({self.prediction})"
        return f"Node({self.attribute})"
 
class ID3Classifier:
    """
    ID3 Decision Tree Classifier.
    
    Implements the classic ID3 algorithm with information gain
    (entropy reduction) as the splitting criterion.
    
    Limitations (by design, matching original ID3):
    - Categorical attributes only
    - No pruning
    - No handling of missing values
    - No continuous attributes
    """
    
    def __init__(self):
        self.root: Optional[TreeNode] = None
        self.attributes: List[str] = []
        
    def _entropy(self, y: np.ndarray) -> float:
        """Compute entropy of label distribution."""
        if len(y) == 0:
            return 0.0
        counts = np.array(list(Counter(y).values()))
        probs = counts / len(y)
        probs = probs[probs > 0]
        return -np.sum(probs * np.log2(probs))
    
    def _information_gain(self, X: np.ndarray, y: np.ndarray, 
                          attribute_idx: int) -> float:
        """Compute information gain for splitting on attribute."""
        parent_entropy = self._entropy(y)
        
        # Get unique values for this attribute
        values = np.unique(X[:, attribute_idx])
        
        # Compute weighted child entropy
        child_entropy = 0.0
        for value in values:
            mask = X[:, attribute_idx] == value
            weight = np.sum(mask) / len(y)
            child_entropy += weight * self._entropy(y[mask])
        
        return parent_entropy - child_entropy
    
    def _majority_class(self, y: np.ndarray) -> str:
        """Return the most common class label."""
        counter = Counter(y)
        return counter.most_common(1)[0][0]
    
    def _build_tree(self, X: np.ndarray, y: np.ndarray,
                    available_attrs: List[int],
                    attribute_names: List[str]) -> TreeNode:
        """
        Recursively build the decision tree.
        
        Base cases:
        1. All examples have same class -> leaf node
        2. No attributes left -> leaf with majority class
        3. No examples -> leaf with parent's majority (handled in parent)
        
        Recursive case:
        Select best attribute, create branches for each value,
        recurse on subsets.
        """
        # Base case 1: All examples have same class
        unique_classes = np.unique(y)
        if len(unique_classes) == 1:
            return TreeNode(is_leaf=True, prediction=unique_classes[0])
        
        # Base case 2: No attributes remaining
        if len(available_attrs) == 0:
            return TreeNode(is_leaf=True, 
                          prediction=self._majority_class(y))
        
        # Find attribute with maximum information gain
        best_gain = -1
        best_attr_idx = None
        
        for attr_idx in available_attrs:
            gain = self._information_gain(X, y, attr_idx)
            if gain > best_gain:
                best_gain = gain
                best_attr_idx = attr_idx
        
        # Create internal node
        best_attr_name = attribute_names[best_attr_idx]
        node = TreeNode(
            is_leaf=False,
            attribute=best_attr_name,
            children={}
        )
        
        # Remove this attribute from available set
        remaining_attrs = [i for i in available_attrs if i != best_attr_idx]
        
        # Create branches for each attribute value
        unique_values = np.unique(X[:, best_attr_idx])
        majority_class = self._majority_class(y)  # Default for empty branches
        
        for value in unique_values:
            mask = X[:, best_attr_idx] == value
            subset_X = X[mask]
            subset_y = y[mask]
            
            if len(subset_y) == 0:
                # Empty subset: use parent's majority class
                node.children[value] = TreeNode(
                    is_leaf=True, 
                    prediction=majority_class
                )
            else:
                # Recurse on subset
                node.children[value] = self._build_tree(
                    subset_X, subset_y, 
                    remaining_attrs, attribute_names
                )
        
        return node
    
    def fit(self, X: np.ndarray, y: np.ndarray, 
            attribute_names: Optional[List[str]] = None):
        """
        Build the decision tree from training data.
        
        Parameters:
            X: Feature matrix (n_samples, n_attributes), categorical values
            y: Class labels (n_samples,)
            attribute_names: Optional names for attributes
        """
        n_samples, n_attrs = X.shape
        
        if attribute_names is None:
            attribute_names = [f"attr_{i}" for i in range(n_attrs)]
        
        self.attributes = attribute_names
        available_attrs = list(range(n_attrs))
        
        self.root = self._build_tree(X, y, available_attrs, attribute_names)
        
        return self
    
    def predict_one(self, x: np.ndarray, 
                    attribute_name_to_idx: Dict[str, int]) -> str:
        """Predict class for a single instance."""
        node = self.root
        
        while not node.is_leaf:
            attr_name = node.attribute
            attr_idx = attribute_name_to_idx[attr_name]
            value = x[attr_idx]
            
            if value in node.children:
                node = node.children[value]
            else:
                # Unseen value: return most common child prediction
                # (Simple handling; real systems do better)
                child_preds = [c.prediction for c in node.children.values() 
                              if c.is_leaf]
                if child_preds:
                    return Counter(child_preds).most_common(1)[0][0]
                else:
                    node = list(node.children.values())[0]
        
        return node.prediction
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Predict classes for multiple instances."""
        attr_to_idx = {name: i for i, name in enumerate(self.attributes)}
        predictions = [self.predict_one(x, attr_to_idx) for x in X]
        return np.array(predictions)
    
    def print_tree(self, node: Optional[TreeNode] = None, 
                   indent: str = "") -> None:
        """Pretty-print the decision tree."""
        if node is None:
            node = self.root
        
        if node.is_leaf:
            print(f"{indent}→ {node.prediction}")
        else:
            print(f"{indent}[{node.attribute}]")
            for value, child in node.children.items():
                print(f"{indent}  = {value}:")
                self.print_tree(child, indent + "    ")

The Greedy Nature of ID3

ID3 makes locally optimal choices at each node without backtracking. This greedy approach doesn't guarantee globally optimal trees, but it's computationally efficient (O(n × m × d) for n samples, m attributes, depth d) and works remarkably well in practice.

Complete Worked Example: Building a Tree Step-by-Step

Let's trace through ID3 on the classic 'Play Tennis' dataset, showing every calculation and decision.

id3_complete_example.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import numpy as np
 
# Play Tennis Dataset
data = np.array([
    ['Sunny',    'Hot',  'High',   'Weak',   'No'],
    ['Sunny',    'Hot',  'High',   'Strong', 'No'],
    ['Overcast', 'Hot',  'High',   'Weak',   'Yes'],
    ['Rain',     'Mild', 'High',   'Weak',   'Yes'],
    ['Rain',     'Cool', 'Normal', 'Weak',   'Yes'],
    ['Rain',     'Cool', 'Normal', 'Strong', 'No'],
    ['Overcast', 'Cool', 'Normal', 'Strong', 'Yes'],
    ['Sunny',    'Mild', 'High',   'Weak',   'No'],
    ['Sunny',    'Cool', 'Normal', 'Weak',   'Yes'],
    ['Rain',     'Mild', 'Normal', 'Weak',   'Yes'],
    ['Sunny',    'Mild', 'Normal', 'Strong', 'Yes'],
    ['Overcast', 'Mild', 'High',   'Strong', 'Yes'],
    ['Overcast', 'Hot',  'Normal', 'Weak',   'Yes'],
    ['Rain',     'Mild', 'High',   'Strong', 'No'],
])
 
X = data[:, :-1]
y = data[:, -1]
attribute_names = ['Outlook', 'Temperature', 'Humidity', 'Wind']
 
# Build and display the tree
clf = ID3Classifier()
clf.fit(X, y, attribute_names)
 
print("ID3 Decision Tree for Play Tennis:")
print("=" * 40)
clf.print_tree()
 
print("\n" + "=" * 40)
print("Predictions on training data:")
predictions = clf.predict(X)
accuracy = np.mean(predictions == y)
print(f"Training accuracy: {accuracy:.2%}")

Step-by-Step Tree Construction:\n\nStep 1: Root Node\n- Compute information gain for all attributes\n- Outlook has highest IG (0.246) → select as root\n- Create three branches: Sunny, Overcast, Rain\n\nStep 2: Overcast Branch\n- All 4 instances are "Yes" → create leaf node with "Yes"\n- Entropy = 0, no further splitting needed\n\nStep 3: Sunny Branch\n- 5 instances: 2 Yes, 3 No\n- Compute IG for remaining attributes (Temperature, Humidity, Wind)\n- Humidity has highest IG → split on Humidity\n- High → No (pure), Normal → Yes (pure)\n\nStep 4: Rain Branch\n- 5 instances: 3 Yes, 2 No\n- Compute IG for remaining attributes\n- Wind has highest IG → split on Wind\n- Weak → Yes (pure), Strong → No (pure)\n\nFinal Tree:\n\nOutlook\n├── Sunny → Humidity\n│ ├── High → No\n│ └── Normal → Yes\n├── Overcast → Yes\n└── Rain → Wind\n ├── Weak → Yes\n └── Strong → No\n

ID3 Limitations and Assumptions

ID3 was groundbreaking but has significant limitations that motivated later algorithms like C4.5 and CART:

Key Limitations of ID3

•Categorical Attributes Only: ID3 cannot handle continuous (numeric) attributes directly. Real-world data often has continuous features like age, income, temperature.
•Bias Toward Many-Valued Attributes: An attribute with many unique values (like ID numbers) will have artificially high information gain because it can perfectly separate instances—leading to massive overfitting.
•No Pruning: ID3 grows trees until pure leaves or no attributes remain, typically overfitting the training data. There's no mechanism to prevent or reverse overfitting.
•No Missing Value Handling: If an instance has missing values for some attributes, ID3 cannot process it. Real datasets frequently have missing values.
•Memory Intensive: The algorithm requires holding all training data in memory during construction.
•Greedy Suboptimality: Local greedy choices may miss globally better tree structures. Finding the optimal tree is NP-complete.

The Many-Values Problem

Consider an 'ID' attribute with unique value for each instance. Information gain would be maximum (splits into pure singleton nodes), but this attribute has zero predictive value for new instances! This flaw motivated the gain ratio in C4.5.

ID3 Strengths

•Simple and elegant algorithm
•Produces interpretable rules
•Computationally efficient
•Strong theoretical foundation
•Works well for categorical data

ID3 Weaknesses

•No continuous attribute support
•Prone to overfitting
•Biased toward many-valued attrs
•Cannot handle missing data
•No built-in regularization

Summary: The ID3 Algorithm

We've thoroughly examined the ID3 algorithm—the foundational tree-building method that introduced information-theoretic principles to decision tree learning.

Key Takeaways

•Entropy measures uncertainty in a distribution; it's maximized for uniform distributions and zero for certain outcomes.
•Information gain measures entropy reduction achieved by splitting on an attribute—the core criterion ID3 uses to select splits.
•Greedy recursive partitioning builds trees top-down, making locally optimal choices at each node without backtracking.
•ID3 uses categorical attributes only and grows trees until leaves are pure or attributes are exhausted.
•Key limitations include bias toward many-valued attributes, no pruning, no continuous features, and no missing value handling.
•ID3's elegance lies in connecting information theory to classification, creating a principled framework extended by all subsequent tree algorithms.

What's Next:\n\nID3's limitations motivated Quinlan to develop C4.5, which addresses many shortcomings. The next page explores C4.5's innovations: the gain ratio to handle many-valued attributes, continuous attribute support through threshold selection, and pruning strategies to prevent overfitting.

Page Complete

You now understand the ID3 algorithm in depth: its information-theoretic foundations, the complete algorithm with implementation, and its historical significance as the ancestor of modern tree methods. Next, we'll explore how C4.5 overcame ID3's limitations.

1 / 5

Loading learning content...

Machine LearningDecision Trees

Tree Growing Algorithms

LevelIntermediate

Duration120 mins

TopicDecision Trees

1 / 5

ID3 Algorithm

The Birth of Information-Theoretic Tree Learning

What You Will Learn

Historical Context and Motivation

The Name 'ID3'

Evolution of Tree-Building Approaches
Era	Approach	Limitation
1960s	Manual expert rules	Doesn't scale; requires domain experts
1970s	AQ/Covering algorithms	Sensitive to rule ordering
1979	ID3 precursors (CLS)	Ad-hoc splitting criteria
1986	ID3 (Quinlan)	Categorical attributes only; no pruning
1993	C4.5 (Quinlan)	Handles continuous; includes pruning
1984	CART (Breiman et al.)	Binary splits; regression support

Entropy: Measuring Uncertainty

entropy_fundamentals.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import numpy as np
from typing import Union, List
 
def entropy(probs: Union[np.ndarray, List[float]]) -> float:
    """
    Compute entropy of a probability distribution.
    
    Parameters:
        probs: Array of probabilities (must sum to 1)
        
    Returns:
        Entropy in bits (log base 2)
    """
    probs = np.asarray(probs)
    
    # Filter out zero probabilities to avoid log(0)
    probs = probs[probs > 0]
    
    return -np.sum(probs * np.log2(probs))
 
def entropy_from_counts(counts: Union[np.ndarray, List[int]]) -> float:
    """
    Compute entropy from class counts.
    
    Parameters:
        counts: Array of counts for each class
        
    Returns:
        Entropy in bits
    """
    counts = np.asarray(counts)
    total = np.sum(counts)
    
    if total == 0:
        return 0.0
    
    probs = counts / total
    return entropy(probs)
 
# Demonstrate entropy behavior
print("Entropy Examples (Binary Classification)")
print("=" * 50)
print(f"{'Distribution':<25} | {'Entropy (bits)':<15}")
print("-" * 50)
 
distributions = [
    ([1.0, 0.0], "Certain (1.0, 0.0)"),
    ([0.9, 0.1], "Skewed (0.9, 0.1)"),
    ([0.7, 0.3], "Moderate (0.7, 0.3)"),
    ([0.5, 0.5], "Maximum (0.5, 0.5)"),
]
 
for probs, name in distributions:
    h = entropy(probs)
    print(f"{name:<25} | {h:<15.4f}")
 
print("\nKey insight: Maximum entropy occurs at uniform distribution")

The 'Bits' Interpretation

Information Gain: The Splitting Criterion

information_gain.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
import numpy as np
from collections import Counter
from typing import List, Tuple
 
def information_gain(y: np.ndarray, x: np.ndarray) -> float:
    """
    Compute information gain of splitting y by attribute x.
    
    Parameters:
        y: Class labels (n_samples,)
        x: Attribute values (n_samples,)
        
    Returns:
        Information gain in bits
    """
    n = len(y)
    
    # Compute parent entropy H(Y)
    parent_counts = np.array(list(Counter(y).values()))
    parent_entropy = entropy_from_counts(parent_counts)
    
    # Compute conditional entropy H(Y|X)
    conditional_entropy = 0.0
    unique_values = np.unique(x)
    
    for value in unique_values:
        # Subset where X = value
        mask = (x == value)
        subset_y = y[mask]
        
        # Weight by proportion of instances
        weight = np.sum(mask) / n
        
        # Entropy of this subset
        subset_counts = np.array(list(Counter(subset_y).values()))
        subset_entropy = entropy_from_counts(subset_counts)
        
        conditional_entropy += weight * subset_entropy
    
    # Information gain = parent entropy - conditional entropy
    return parent_entropy - conditional_entropy
 
# Classic "Play Tennis" example from Quinlan's papers
# Attributes: Outlook, Temperature, Humidity, Wind
# Class: PlayTennis (Yes/No)
 
outlook = np.array(['Sunny', 'Sunny', 'Overcast', 'Rain', 'Rain', 'Rain', 
                    'Overcast', 'Sunny', 'Sunny', 'Rain', 'Sunny', 
                    'Overcast', 'Overcast', 'Rain'])
temperature = np.array(['Hot', 'Hot', 'Hot', 'Mild', 'Cool', 'Cool', 
                        'Cool', 'Mild', 'Cool', 'Mild', 'Mild', 
                        'Mild', 'Hot', 'Mild'])
humidity = np.array(['High', 'High', 'High', 'High', 'Normal', 'Normal',
                     'Normal', 'High', 'Normal', 'Normal', 'Normal',
                     'High', 'Normal', 'High'])
wind = np.array(['Weak', 'Strong', 'Weak', 'Weak', 'Weak', 'Strong',
                 'Strong', 'Weak', 'Weak', 'Weak', 'Strong', 'Strong',
                 'Weak', 'Strong'])
play_tennis = np.array(['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes',
                        'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No'])
 
print("Information Gain Analysis: Play Tennis Dataset")
print("=" * 55)
 
# Parent entropy
parent_counts = np.array([9, 5])  # 9 Yes, 5 No
parent_h = entropy_from_counts(parent_counts)
print(f"Parent entropy H(PlayTennis): {parent_h:.4f} bits")
print()
 
# Information gain for each attribute
attributes = {
    'Outlook': outlook,
    'Temperature': temperature,
    'Humidity': humidity,
    'Wind': wind
}
 
print(f"{'Attribute':<15} | {'Information Gain':<18} | {'Rank'}")
print("-" * 55)
 
gains = {}
for name, values in attributes.items():
    ig = information_gain(play_tennis, values)
    gains[name] = ig
 
# Sort by gain
sorted_attrs = sorted(gains.items(), key=lambda x: x[1], reverse=True)
for rank, (name, ig) in enumerate(sorted_attrs, 1):
    print(f"{name:<15} | {ig:<18.4f} | {rank}")
 
print()
print(f"Best split attribute: {sorted_attrs[0][0]}")

Why Information Gain Works

The Complete ID3 Algorithm

id3_algorithm.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
"""
ID3 Algorithm: Complete Implementation
 
This implementation follows Quinlan's original specification with
detailed comments explaining each decision point.
"""
 
import numpy as np
from collections import Counter
from typing import Dict, List, Any, Optional, Union
from dataclasses import dataclass
 
@dataclass
class TreeNode:
    """Represents a node in the decision tree."""
    is_leaf: bool = False
    prediction: Optional[str] = None  # For leaf nodes
    attribute: Optional[str] = None   # For internal nodes
    children: Optional[Dict[str, 'TreeNode']] = None  # Branches
    
    def __repr__(self):
        if self.is_leaf:
            return f"Leaf({self.prediction})"
        return f"Node({self.attribute})"
 
class ID3Classifier:
    """
    ID3 Decision Tree Classifier.
    
    Implements the classic ID3 algorithm with information gain
    (entropy reduction) as the splitting criterion.
    
    Limitations (by design, matching original ID3):
    - Categorical attributes only
    - No pruning
    - No handling of missing values
    - No continuous attributes
    """
    
    def __init__(self):
        self.root: Optional[TreeNode] = None
        self.attributes: List[str] = []
        
    def _entropy(self, y: np.ndarray) -> float:
        """Compute entropy of label distribution."""
        if len(y) == 0:
            return 0.0
        counts = np.array(list(Counter(y).values()))
        probs = counts / len(y)
        probs = probs[probs > 0]
        return -np.sum(probs * np.log2(probs))
    
    def _information_gain(self, X: np.ndarray, y: np.ndarray, 
                          attribute_idx: int) -> float:
        """Compute information gain for splitting on attribute."""
        parent_entropy = self._entropy(y)
        
        # Get unique values for this attribute
        values = np.unique(X[:, attribute_idx])
        
        # Compute weighted child entropy
        child_entropy = 0.0
        for value in values:
            mask = X[:, attribute_idx] == value
            weight = np.sum(mask) / len(y)
            child_entropy += weight * self._entropy(y[mask])
        
        return parent_entropy - child_entropy
    
    def _majority_class(self, y: np.ndarray) -> str:
        """Return the most common class label."""
        counter = Counter(y)
        return counter.most_common(1)[0][0]
    
    def _build_tree(self, X: np.ndarray, y: np.ndarray,
                    available_attrs: List[int],
                    attribute_names: List[str]) -> TreeNode:
        """
        Recursively build the decision tree.
        
        Base cases:
        1. All examples have same class -> leaf node
        2. No attributes left -> leaf with majority class
        3. No examples -> leaf with parent's majority (handled in parent)
        
        Recursive case:
        Select best attribute, create branches for each value,
        recurse on subsets.
        """
        # Base case 1: All examples have same class
        unique_classes = np.unique(y)
        if len(unique_classes) == 1:
            return TreeNode(is_leaf=True, prediction=unique_classes[0])
        
        # Base case 2: No attributes remaining
        if len(available_attrs) == 0:
            return TreeNode(is_leaf=True, 
                          prediction=self._majority_class(y))
        
        # Find attribute with maximum information gain
        best_gain = -1
        best_attr_idx = None
        
        for attr_idx in available_attrs:
            gain = self._information_gain(X, y, attr_idx)
            if gain > best_gain:
                best_gain = gain
                best_attr_idx = attr_idx
        
        # Create internal node
        best_attr_name = attribute_names[best_attr_idx]
        node = TreeNode(
            is_leaf=False,
            attribute=best_attr_name,
            children={}
        )
        
        # Remove this attribute from available set
        remaining_attrs = [i for i in available_attrs if i != best_attr_idx]
        
        # Create branches for each attribute value
        unique_values = np.unique(X[:, best_attr_idx])
        majority_class = self._majority_class(y)  # Default for empty branches
        
        for value in unique_values:
            mask = X[:, best_attr_idx] == value
            subset_X = X[mask]
            subset_y = y[mask]
            
            if len(subset_y) == 0:
                # Empty subset: use parent's majority class
                node.children[value] = TreeNode(
                    is_leaf=True, 
                    prediction=majority_class
                )
            else:
                # Recurse on subset
                node.children[value] = self._build_tree(
                    subset_X, subset_y, 
                    remaining_attrs, attribute_names
                )
        
        return node
    
    def fit(self, X: np.ndarray, y: np.ndarray, 
            attribute_names: Optional[List[str]] = None):
        """
        Build the decision tree from training data.
        
        Parameters:
            X: Feature matrix (n_samples, n_attributes), categorical values
            y: Class labels (n_samples,)
            attribute_names: Optional names for attributes
        """
        n_samples, n_attrs = X.shape
        
        if attribute_names is None:
            attribute_names = [f"attr_{i}" for i in range(n_attrs)]
        
        self.attributes = attribute_names
        available_attrs = list(range(n_attrs))
        
        self.root = self._build_tree(X, y, available_attrs, attribute_names)
        
        return self
    
    def predict_one(self, x: np.ndarray, 
                    attribute_name_to_idx: Dict[str, int]) -> str:
        """Predict class for a single instance."""
        node = self.root
        
        while not node.is_leaf:
            attr_name = node.attribute
            attr_idx = attribute_name_to_idx[attr_name]
            value = x[attr_idx]
            
            if value in node.children:
                node = node.children[value]
            else:
                # Unseen value: return most common child prediction
                # (Simple handling; real systems do better)
                child_preds = [c.prediction for c in node.children.values() 
                              if c.is_leaf]
                if child_preds:
                    return Counter(child_preds).most_common(1)[0][0]
                else:
                    node = list(node.children.values())[0]
        
        return node.prediction
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Predict classes for multiple instances."""
        attr_to_idx = {name: i for i, name in enumerate(self.attributes)}
        predictions = [self.predict_one(x, attr_to_idx) for x in X]
        return np.array(predictions)
    
    def print_tree(self, node: Optional[TreeNode] = None, 
                   indent: str = "") -> None:
        """Pretty-print the decision tree."""
        if node is None:
            node = self.root
        
        if node.is_leaf:
            print(f"{indent}→ {node.prediction}")
        else:
            print(f"{indent}[{node.attribute}]")
            for value, child in node.children.items():
                print(f"{indent}  = {value}:")
                self.print_tree(child, indent + "    ")

The Greedy Nature of ID3

Complete Worked Example: Building a Tree Step-by-Step

Let's trace through ID3 on the classic 'Play Tennis' dataset, showing every calculation and decision.

id3_complete_example.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import numpy as np
 
# Play Tennis Dataset
data = np.array([
    ['Sunny',    'Hot',  'High',   'Weak',   'No'],
    ['Sunny',    'Hot',  'High',   'Strong', 'No'],
    ['Overcast', 'Hot',  'High',   'Weak',   'Yes'],
    ['Rain',     'Mild', 'High',   'Weak',   'Yes'],
    ['Rain',     'Cool', 'Normal', 'Weak',   'Yes'],
    ['Rain',     'Cool', 'Normal', 'Strong', 'No'],
    ['Overcast', 'Cool', 'Normal', 'Strong', 'Yes'],
    ['Sunny',    'Mild', 'High',   'Weak',   'No'],
    ['Sunny',    'Cool', 'Normal', 'Weak',   'Yes'],
    ['Rain',     'Mild', 'Normal', 'Weak',   'Yes'],
    ['Sunny',    'Mild', 'Normal', 'Strong', 'Yes'],
    ['Overcast', 'Mild', 'High',   'Strong', 'Yes'],
    ['Overcast', 'Hot',  'Normal', 'Weak',   'Yes'],
    ['Rain',     'Mild', 'High',   'Strong', 'No'],
])
 
X = data[:, :-1]
y = data[:, -1]
attribute_names = ['Outlook', 'Temperature', 'Humidity', 'Wind']
 
# Build and display the tree
clf = ID3Classifier()
clf.fit(X, y, attribute_names)
 
print("ID3 Decision Tree for Play Tennis:")
print("=" * 40)
clf.print_tree()
 
print("\n" + "=" * 40)
print("Predictions on training data:")
predictions = clf.predict(X)
accuracy = np.mean(predictions == y)
print(f"Training accuracy: {accuracy:.2%}")

ID3 Limitations and Assumptions

ID3 was groundbreaking but has significant limitations that motivated later algorithms like C4.5 and CART:

Key Limitations of ID3

•Categorical Attributes Only: ID3 cannot handle continuous (numeric) attributes directly. Real-world data often has continuous features like age, income, temperature.
•Bias Toward Many-Valued Attributes: An attribute with many unique values (like ID numbers) will have artificially high information gain because it can perfectly separate instances—leading to massive overfitting.
•No Pruning: ID3 grows trees until pure leaves or no attributes remain, typically overfitting the training data. There's no mechanism to prevent or reverse overfitting.
•No Missing Value Handling: If an instance has missing values for some attributes, ID3 cannot process it. Real datasets frequently have missing values.
•Memory Intensive: The algorithm requires holding all training data in memory during construction.
•Greedy Suboptimality: Local greedy choices may miss globally better tree structures. Finding the optimal tree is NP-complete.

The Many-Values Problem

ID3 Strengths

•Simple and elegant algorithm
•Produces interpretable rules
•Computationally efficient
•Strong theoretical foundation
•Works well for categorical data

ID3 Weaknesses

•No continuous attribute support
•Prone to overfitting
•Biased toward many-valued attrs
•Cannot handle missing data
•No built-in regularization

Summary: The ID3 Algorithm

We've thoroughly examined the ID3 algorithm—the foundational tree-building method that introduced information-theoretic principles to decision tree learning.

Key Takeaways

•Entropy measures uncertainty in a distribution; it's maximized for uniform distributions and zero for certain outcomes.
•Information gain measures entropy reduction achieved by splitting on an attribute—the core criterion ID3 uses to select splits.
•Greedy recursive partitioning builds trees top-down, making locally optimal choices at each node without backtracking.
•ID3 uses categorical attributes only and grows trees until leaves are pure or attributes are exhausted.
•Key limitations include bias toward many-valued attributes, no pruning, no continuous features, and no missing value handling.
•ID3's elegance lies in connecting information theory to classification, creating a principled framework extended by all subsequent tree algorithms.

Page Complete

1 / 5