Tree Growing Algorithms - Learning Module

Loading content...

0/278

C4.5 Algorithm

The Evolution: From ID3 to C4.5

In 1993, Ross Quinlan published C4.5: Programs for Machine Learning, introducing an algorithm that would become the most influential decision tree method in machine learning history. C4.5 wasn't just an incremental improvement over ID3—it was a comprehensive solution that addressed every major limitation of its predecessor.

C4.5 introduced four critical innovations:

Gain Ratio: A normalized information gain that prevents bias toward many-valued attributes
Continuous Attribute Handling: Automatic threshold selection for numeric features
Pruning: Both pre-pruning and post-pruning to combat overfitting
Missing Value Handling: Principled treatment of incomplete data

These innovations made C4.5 practical for real-world datasets and established design patterns that persist in modern tree implementations. Understanding C4.5 deeply provides the foundation for understanding nearly every tree-based algorithm that followed.

What You Will Learn

By the end of this page, you will understand: (1) The gain ratio and why it fixes ID3's many-values bias; (2) How C4.5 handles continuous attributes through optimal threshold selection; (3) Error-based pruning strategies and their theoretical foundation; (4) Missing value handling via fractional instances; (5) Complete C4.5 implementation with all features.

Gain Ratio: Fixing the Many-Values Problem

The Problem with Information Gain:

ID3's information gain has a fundamental flaw: it favors attributes with many distinct values. Consider an extreme case—an 'ID' attribute that uniquely identifies each training instance. Splitting on ID produces perfectly pure leaves (entropy = 0), yielding maximum information gain. But this attribute has zero predictive value for new instances!

More subtly, even legitimate attributes with more values tend to receive higher information gain simply because more splits allow more entropy reduction.

The Solution: Gain Ratio

C4.5 introduces split information to penalize attributes with many values:

$$SplitInfo(A) = -\sum_{j=1}^{v} \frac{|S_j|}{|S|} \log_2 \frac{|S_j|}{|S|}$$

where $S_j$ is the subset of instances where attribute $A$ takes value $a_j$.

Note: Split information is the entropy of the attribute itself—how much uncertainty there is about which branch an instance takes.

Gain Ratio:

$$GainRatio(A) = \frac{InformationGain(A)}{SplitInfo(A)}$$

gain_ratio.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import numpy as np
from collections import Counter
from typing import Tuple
 
def split_information(x: np.ndarray) -> float:
    """
    Compute split information (intrinsic value) of an attribute.
    
    This measures the entropy of the split itself—how uniformly
    the instances are distributed across attribute values.
    
    Parameters:
        x: Attribute values (n_samples,)
        
    Returns:
        Split information in bits
    """
    n = len(x)
    value_counts = Counter(x)
    
    split_info = 0.0
    for count in value_counts.values():
        if count > 0:
            p = count / n
            split_info -= p * np.log2(p)
    
    return split_info
 
def gain_ratio(y: np.ndarray, x: np.ndarray) -> Tuple[float, float, float]:
    """
    Compute gain ratio for splitting y by attribute x.
    
    Parameters:
        y: Class labels
        x: Attribute values
        
    Returns:
        Tuple of (gain_ratio, information_gain, split_info)
    """
    ig = information_gain(y, x)  # From previous implementation
    si = split_information(x)
    
    # Avoid division by zero
    if si == 0:
        return 0.0, ig, si
    
    return ig / si, ig, si
 
# Demonstrate the many-values problem
print("Gain Ratio vs Information Gain: Many-Values Problem")
print("=" * 60)
 
# Create synthetic example
np.random.seed(42)
n = 100
 
# Class labels: roughly balanced
y = np.array(['Yes'] * 55 + ['No'] * 45)
 
# Attribute 1: Binary (2 values)
binary_attr = np.array(['A', 'B'] * 50)
 
# Attribute 2: Many values (10 values)
ten_values = np.array([f'V{i}' for i in range(10)] * 10)
 
# Attribute 3: Unique ID (100 values) - useless but high IG!
unique_id = np.array([f'ID_{i}' for i in range(100)])
 
attributes = [
    ('Binary (2 values)', binary_attr),
    ('Multi (10 values)', ten_values),
    ('UniqueID (100 values)', unique_id),
]
 
print(f"{'Attribute':<25} | {'IG':<8} | {'SplitInfo':<10} | {'GainRatio':<10}")
print("-" * 60)
 
for name, attr in attributes:
    gr, ig, si = gain_ratio(y, attr)
    print(f"{name:<25} | {ig:<8.4f} | {si:<10.4f} | {gr:<10.4f}")
 
print()
print("Note: UniqueID has highest IG but lowest GainRatio!")
print("GainRatio correctly penalizes the useless many-valued attribute.")

Intuition for Gain Ratio

Split information measures how 'spread out' the split is. An attribute that splits into many small groups has high split information. Dividing IG by split info normalizes the gain—an attribute with twice the IG but twice the split info (due to more values) gets the same ratio. This creates fair comparison across attributes with different numbers of values.

Practical Considerations with Gain Ratio:

The Zero Denominator Problem: If an attribute has only one value (all instances identical), split info = 0 and gain ratio is undefined. C4.5 handles this by excluding such attributes.

The Low IG Filter: C4.5 doesn't simply choose the highest gain ratio. An attribute with tiny information gain but also tiny split info could have high ratio but be practically useless. C4.5 uses a two-stage process:

First, compute average information gain across all attributes
Only consider attributes with gain ≥ average
Among qualifying attributes, choose highest gain ratio

This prevents selection of low-gain attributes just because they have low split info.

Handling Continuous Attributes

The Challenge:

ID3 requires categorical attributes, but real datasets often contain continuous features like age, temperature, income, or sensor readings. C4.5 handles continuous attributes by discretizing them into binary splits at optimal thresholds.

The Approach:

For a continuous attribute $A$ with values $\{v_1, v_2, \ldots, v_n\}$ (sorted), C4.5:

Considers all possible binary thresholds: $A \leq t$ vs $A > t$
Threshold candidates are midpoints between consecutive distinct values
Computes information gain for each threshold
Selects the threshold with highest gain
Treats this as a binary categorical attribute for splitting

Threshold Selection:

Given sorted unique values $v_1 < v_2 < \ldots < v_m$, candidate thresholds are: $$t_i = \frac{v_i + v_{i+1}}{2} \quad \text{for } i = 1, \ldots, m-1$$

continuous_threshold.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
import numpy as np
from typing import Tuple, Optional
 
def find_best_threshold(x: np.ndarray, y: np.ndarray) -> Tuple[float, float]:
    """
    Find the optimal binary split threshold for a continuous attribute.
    
    Parameters:
        x: Continuous attribute values (n_samples,)
        y: Class labels (n_samples,)
        
    Returns:
        Tuple of (best_threshold, best_information_gain)
    """
    # Sort by attribute value
    sorted_indices = np.argsort(x)
    x_sorted = x[sorted_indices]
    y_sorted = y[sorted_indices]
    
    # Find unique values and their positions
    unique_values = np.unique(x_sorted)
    
    if len(unique_values) < 2:
        return np.inf, 0.0  # Cannot split
    
    best_threshold = None
    best_gain = -np.inf
    
    # Consider threshold between each pair of consecutive unique values
    for i in range(len(unique_values) - 1):
        threshold = (unique_values[i] + unique_values[i + 1]) / 2
        
        # Create binary split
        left_mask = x <= threshold
        right_mask = ~left_mask
        
        # Compute information gain for this split
        n = len(y)
        n_left = np.sum(left_mask)
        n_right = np.sum(right_mask)
        
        if n_left == 0 or n_right == 0:
            continue
        
        # Parent entropy
        parent_entropy = entropy_from_counts(
            np.array(list(Counter(y).values()))
        )
        
        # Weighted child entropy
        left_entropy = entropy_from_counts(
            np.array(list(Counter(y[left_mask]).values()))
        )
        right_entropy = entropy_from_counts(
            np.array(list(Counter(y[right_mask]).values()))
        )
        
        child_entropy = (n_left/n)*left_entropy + (n_right/n)*right_entropy
        gain = parent_entropy - child_entropy
        
        if gain > best_gain:
            best_gain = gain
            best_threshold = threshold
    
    return best_threshold, best_gain
 
# Demonstration with synthetic data
np.random.seed(42)
 
# Class 0: temperatures mostly below 70
# Class 1: temperatures mostly above 70
temps_class0 = np.random.normal(60, 8, 50)
temps_class1 = np.random.normal(80, 8, 50)
 
x = np.concatenate([temps_class0, temps_class1])
y = np.array(['Cold'] * 50 + ['Hot'] * 50)
 
threshold, gain = find_best_threshold(x, y)
print(f"Optimal Temperature Threshold: {threshold:.2f}")
print(f"Information Gain at this threshold: {gain:.4f}")
 
# Show resulting split quality
left_mask = x <= threshold
print(f"
Left branch (≤ {threshold:.1f}): {Counter(y[left_mask])}")
print(f"Right branch (> {threshold:.1f}): {Counter(y[~left_mask])}")

Computational Efficiency

Naively, finding the best threshold requires O(n²) time—O(n) thresholds each requiring O(n) entropy computation. However, by sorting once and incrementally updating class counts as we slide the threshold, we achieve O(n log n) for the sort plus O(n) for the scan—efficient enough for practical use.

Important Details:

Reusing Continuous Attributes: Unlike categorical attributes (which are removed after use in ID3), C4.5 allows continuous attributes to be used multiple times at different thresholds along a path. A tree might split on 'age ≤ 30' at the root, then later split on 'age ≤ 45' in a subtree.

Gain Ratio for Continuous Splits: Binary splits always have split info ≤ 1 bit (maximized when split is 50/50). This is lower than multi-way splits, giving continuous attributes a potential advantage. C4.5 adjusts by considering this in its selection process.

Boundary Cases: The optimal threshold often lies exactly at a class boundary—the point where sorted instances switch from one class to another. This observation enables further optimization in production implementations.

Pruning: Controlling Tree Complexity

Why Pruning Matters:

Unpruned trees grown to purity typically overfit the training data. They memorize noise, capturing spurious patterns that don't generalize. Pruning addresses this by removing subtrees that don't improve generalization.

C4.5 uses error-based pruning, a post-pruning method that grows the full tree first, then prunes back. This is more effective than pre-pruning (stopping early) because it can evaluate the full context before deciding what to remove.

C4.5's Error-Based Pruning:

The key insight is estimating the error rate on unseen data for each node, then pruning subtrees that increase this estimated error. C4.5 uses a statistical approach based on confidence intervals.

c45_pruning.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import numpy as np
from scipy import stats
from typing import Optional
from dataclasses import dataclass
 
def pessimistic_error_estimate(n: int, e: int, 
                                confidence: float = 0.25) -> float:
    """
    Compute C4.5's pessimistic error estimate.
    
    Uses upper bound of confidence interval for binomial proportion.
    
    Parameters:
        n: Total instances at node
        e: Misclassified instances at node
        confidence: Confidence level (C4.5 default: 0.25)
        
    Returns:
        Pessimistic error rate estimate
    """
    if n == 0:
        return 0.5
    
    # Observed error rate
    f = e / n
    
    # Using normal approximation to binomial
    # Upper bound of confidence interval
    z = stats.norm.ppf(1 - confidence / 2)
    
    # Wilson score interval upper bound
    # More accurate than normal approximation for small n
    numerator = f + z*z/(2*n) + z*np.sqrt(f*(1-f)/n + z*z/(4*n*n))
    denominator = 1 + z*z/n
    
    return numerator / denominator
 
def should_prune(leaf_errors: float, leaf_n: int,
                 subtree_errors: float, subtree_n: int) -> bool:
    """
    Determine if a subtree should be replaced with a leaf.
    
    The subtree is pruned if the leaf's pessimistic error
    is not significantly worse than the subtree's.
    """
    # Estimate errors for leaf prediction
    leaf_estimate = leaf_n * pessimistic_error_estimate(leaf_n, int(leaf_errors))
    
    # Accumulated pessimistic errors from subtree leaves
    # (In practice, computed recursively for each leaf in subtree)
    
    return leaf_estimate <= subtree_errors + 0.5  # One SE rule
 
# Example: Pruning decision
print("C4.5 Pruning Decision Example")
print("=" * 50)
 
# Scenario: Node with 100 instances
# As leaf: predicts majority class, 20 errors
# As subtree: 5 leaves with total 15 errors
 
leaf_n = 100
leaf_errors = 20
 
print(f"Node as leaf: {leaf_n} instances, {leaf_errors} errors")
print(f"Observed error rate: {leaf_errors/leaf_n:.1%}")
 
# Pessimistic estimate
pessimistic = pessimistic_error_estimate(leaf_n, leaf_errors)
print(f"Pessimistic error estimate: {pessimistic:.1%}")
print(f"Expected errors (pessimistic): {leaf_n * pessimistic:.1f}")
 
print()
print("Subtree has 5 leaves with 15 total observed errors")
print("Each leaf's pessimistic estimate adds overhead...")
print()
print("If pessimistic leaf errors ≤ subtree errors → PRUNE")

The Pruning Algorithm:

Build the full tree using gain ratio splitting
Bottom-up traversal: For each internal node, compute:
- Error estimate if this subtree is replaced by a leaf (majority vote)
- Accumulated error estimate from all leaves in the subtree
Prune if leaf error ≤ subtree error (accounting for confidence intervals)
Subtree raising: C4.5 also considers replacing a node with its most popular subtree

The Confidence Parameter:

C4.5 uses a default confidence level of 25% (CF=0.25), which produces moderately aggressive pruning. Lower CF → more pruning (simpler trees). Higher CF → less pruning (more complex trees). This is a hyperparameter to tune.

Pre-Pruning (Stopping Early)

•Stop splitting when improvement is small
•Faster training (no wasted growth)
•Risk: May stop too early (horizon effect)
•Common criteria: min samples, max depth

Post-Pruning (Error-Based)

•Grow full tree, then prune back
•Can see full context before deciding
•More computation but better results
•C4.5's approach; CART uses similar

Missing Value Handling in C4.5

The Challenge:

Real datasets frequently have missing values. ID3 simply cannot handle them—instances with missing values are excluded. C4.5 introduces a principled approach using fractional instances.

C4.5's Approach:

When computing information gain for an attribute with missing values:

Exclude missing instances from entropy calculation for that attribute
Use only known values to compute information gain
Adjust gain by fraction of known values: If 80% of instances have known values, multiply the gain by 0.80

When an instance with a missing value reaches a split node during training or prediction:

Split the instance fractionally across all branches
Weight each fraction by branch probability (proportion of training instances taking that branch)
Aggregate predictions by summing weighted votes

missing_values_c45.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import numpy as np
from typing import Dict, Optional
 
def information_gain_with_missing(y: np.ndarray, x: np.ndarray) -> float:
    """
    Compute information gain handling missing values (marked as None or NaN).
    
    C4.5 approach:
    1. Compute IG using only instances with known values
    2. Multiply by fraction of known values
    """
    # Identify known values
    if x.dtype == object:
        known_mask = np.array([v is not None and v != '' for v in x])
    else:
        known_mask = ~np.isnan(x)
    
    n_total = len(y)
    n_known = np.sum(known_mask)
    
    if n_known == 0:
        return 0.0
    
    # Compute IG on known subset
    y_known = y[known_mask]
    x_known = x[known_mask]
    
    ig_known = information_gain(y_known, x_known)
    
    # Adjust by fraction of known values
    fraction_known = n_known / n_total
    
    return ig_known * fraction_known
 
def distribute_fractional(instance_weight: float,
                         branch_probs: Dict[str, float]) -> Dict[str, float]:
    """
    Distribute a fractional instance across branches.
    
    Parameters:
        instance_weight: Current weight of the instance
        branch_probs: Probability of each branch (from training proportions)
        
    Returns:
        Dictionary mapping branch values to fractional weights
    """
    return {
        branch: instance_weight * prob 
        for branch, prob in branch_probs.items()
    }
 
# Example demonstration
print("C4.5 Missing Value Handling Example")
print("=" * 50)
 
# Dataset with missing values
outlook = np.array(['Sunny', 'Sunny', None, 'Rain', 'Rain', None,
                    'Overcast', 'Sunny', 'Sunny', 'Rain', None,
                    'Overcast', 'Overcast', 'Rain'])
play = np.array(['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes',
                 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No'])
 
print("Dataset has 3 missing Outlook values (shown as None)")
print()
 
# Standard IG (would fail or exclude missing)
known_mask = np.array([v is not None for v in outlook])
ig_known_only = information_gain(play[known_mask], outlook[known_mask])
print(f"IG using only known values: {ig_known_only:.4f}")
 
# C4.5 adjusted IG
fraction_known = np.sum(known_mask) / len(outlook)
ig_adjusted = ig_known_only * fraction_known
print(f"Fraction known: {fraction_known:.2%}")
print(f"C4.5 adjusted IG: {ig_adjusted:.4f}")
 
# Branch probabilities for fractional distribution
print()
print("When classifying an instance with missing Outlook:")
branch_probs = {'Sunny': 4/11, 'Overcast': 3/11, 'Rain': 4/11}
fractional = distribute_fractional(1.0, branch_probs)
print(f"Instance distributed: {fractional}")

Fractional Instances Intuition

Think of fractional instances as 'hedging your bets.' If an instance could belong to any branch (because the splitting attribute is missing), distribute it proportionally. During prediction, this becomes a weighted voting scheme—each branch contributes according to its probability, and the final prediction aggregates these weighted votes.

Complete C4.5 Implementation

Here's a comprehensive C4.5 implementation incorporating all the innovations we've discussed:

c45_complete.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
import numpy as np
from dataclasses import dataclass, field
from typing import Dict, List, Optional, Union, Tuple
from collections import Counter
 
@dataclass
class C45Node:
    """Node in a C4.5 decision tree."""
    is_leaf: bool = False
    prediction: Optional[str] = None
    class_distribution: Optional[Dict[str, float]] = None
    
    # For internal nodes
    attribute: Optional[str] = None
    attribute_idx: Optional[int] = None
    is_continuous: bool = False
    threshold: Optional[float] = None  # For continuous attributes
    children: Dict[str, 'C45Node'] = field(default_factory=dict)
    
    # For pruning
    n_samples: int = 0
    n_errors: int = 0
 
class C45Classifier:
    """
    C4.5 Decision Tree Classifier.
    
    Implements gains ratio, continuous attributes, pruning,
    and missing value handling.
    """
    
    def __init__(self, 
                 min_samples_split: int = 2,
                 min_gain_ratio: float = 0.0,
                 confidence: float = 0.25):
        self.root: Optional[C45Node] = None
        self.min_samples_split = min_samples_split
        self.min_gain_ratio = min_gain_ratio
        self.confidence = confidence
        
    def _is_continuous(self, x: np.ndarray) -> bool:
        """Check if attribute appears continuous."""
        try:
            x_numeric = x[~np.isnan(x.astype(float))]
            return len(np.unique(x_numeric)) > 10
        except (ValueError, TypeError):
            return False
    
    def _compute_gain_ratio(self, y: np.ndarray, x: np.ndarray,
                            threshold: Optional[float] = None) -> Tuple[float, float]:
        """Compute gain ratio, handling missing values and continuous splits."""
        # Handle missing values
        known_mask = ~self._is_missing(x)
        n_total = len(y)
        n_known = np.sum(known_mask)
        
        if n_known < self.min_samples_split:
            return 0.0, None
        
        y_known = y[known_mask]
        x_known = x[known_mask]
        
        # For continuous: create binary split
        if threshold is not None:
            x_known = np.where(x_known.astype(float) <= threshold, 
                              'left', 'right')
        
        ig = self._information_gain(y_known, x_known)
        si = self._split_information(x_known)
        
        # Adjust by fraction known
        ig *= (n_known / n_total)
        
        if si == 0:
            return 0.0, threshold
        
        return ig / si, threshold
    
    def _find_best_split(self, X: np.ndarray, y: np.ndarray,
                         available: List[int]) -> Tuple[int, Optional[float], float]:
        """Find best attribute and threshold to split on."""
        best_ratio = -np.inf
        best_attr = None
        best_threshold = None
        
        for attr_idx in available:
            x = X[:, attr_idx]
            
            if self._is_continuous(x):
                # Try multiple thresholds
                thresh, _ = self._find_best_threshold(x, y)
                if thresh is not None:
                    ratio, _ = self._compute_gain_ratio(y, x, thresh)
                    if ratio > best_ratio:
                        best_ratio = ratio
                        best_attr = attr_idx
                        best_threshold = thresh
            else:
                ratio, _ = self._compute_gain_ratio(y, x)
                if ratio > best_ratio:
                    best_ratio = ratio
                    best_attr = attr_idx
                    best_threshold = None
        
        return best_attr, best_threshold, best_ratio
    
    def _build_tree(self, X: np.ndarray, y: np.ndarray,
                    available: List[int], depth: int = 0) -> C45Node:
        """Recursively build the C4.5 tree."""
        n_samples = len(y)
        class_counts = Counter(y)
        majority_class = class_counts.most_common(1)[0][0]
        
        # Base cases
        if len(class_counts) == 1:
            return C45Node(is_leaf=True, prediction=majority_class,
                          class_distribution=dict(class_counts),
                          n_samples=n_samples, n_errors=0)
        
        if len(available) == 0 or n_samples < self.min_samples_split:
            errors = n_samples - class_counts[majority_class]
            return C45Node(is_leaf=True, prediction=majority_class,
                          class_distribution=dict(class_counts),
                          n_samples=n_samples, n_errors=errors)
        
        # Find best split
        best_attr, threshold, ratio = self._find_best_split(X, y, available)
        
        if ratio <= self.min_gain_ratio:
            errors = n_samples - class_counts[majority_class]
            return C45Node(is_leaf=True, prediction=majority_class,
                          class_distribution=dict(class_counts),
                          n_samples=n_samples, n_errors=errors)
        
        # Create node and recurse
        is_cont = threshold is not None
        node = C45Node(
            attribute_idx=best_attr,
            is_continuous=is_cont,
            threshold=threshold,
            n_samples=n_samples,
            class_distribution=dict(class_counts)
        )
        
        # Determine child subsets
        x = X[:, best_attr]
        if is_cont:
            values = ['left', 'right']
            masks = {
                'left': x.astype(float) <= threshold,
                'right': x.astype(float) > threshold
            }
        else:
            values = np.unique(x[~self._is_missing(x)])
            masks = {v: x == v for v in values}
        
        # Update available (remove categorical, keep continuous)
        new_available = available if is_cont else [i for i in available if i != best_attr]
        
        # Build children
        for value in values:
            mask = masks[value]
            if np.sum(mask) > 0:
                node.children[str(value)] = self._build_tree(
                    X[mask], y[mask], new_available, depth + 1
                )
        
        return node
    
    def fit(self, X: np.ndarray, y: np.ndarray):
        """Build the decision tree from training data."""
        n_attrs = X.shape[1]
        self.root = self._build_tree(X, y, list(range(n_attrs)))
        # Pruning would be applied here
        return self
 
print("C4.5 Classifier implemented with all core features")

Summary: The C4.5 Algorithm

Key Takeaways

•Gain ratio = IG/SplitInfo normalizes for attribute cardinality, fixing ID3's bias toward many-valued attributes.
•Continuous attributes are handled by finding optimal binary thresholds that maximize information gain.
•Error-based pruning grows the full tree then prunes back using pessimistic error estimates to prevent overfitting.
•Missing values are handled by fractional instance distribution—splitting instances proportionally across branches.
•C4.5 became the de facto standard for decision trees, influencing all subsequent tree algorithms including random forests and gradient boosting.

What's Next:

While C4.5 addressed ID3's limitations, a parallel development was occurring. In 1984, Breiman and colleagues introduced CART (Classification and Regression Trees)—an algorithm that took a different approach with binary splits, Gini impurity, and native support for regression. The next page explores CART's innovations and compares it with C4.5.

Page Complete

You now understand C4.5's innovations: gain ratio for fair attribute comparison, optimal thresholds for continuous features, error-based pruning for generalization, and fractional instances for missing values. Next, we'll explore CART's alternative approach to tree construction.

C4.5 Algorithm

The Evolution: From ID3 to C4.5

C4.5 introduced four critical innovations:

Gain Ratio: A normalized information gain that prevents bias toward many-valued attributes
Continuous Attribute Handling: Automatic threshold selection for numeric features
Pruning: Both pre-pruning and post-pruning to combat overfitting
Missing Value Handling: Principled treatment of incomplete data

What You Will Learn

Gain Ratio: Fixing the Many-Values Problem

The Problem with Information Gain:

More subtly, even legitimate attributes with more values tend to receive higher information gain simply because more splits allow more entropy reduction.

The Solution: Gain Ratio

C4.5 introduces split information to penalize attributes with many values:

$$SplitInfo(A) = -\sum_{j=1}^{v} \frac{|S_j|}{|S|} \log_2 \frac{|S_j|}{|S|}$$

where $S_j$ is the subset of instances where attribute $A$ takes value $a_j$.

Note: Split information is the entropy of the attribute itself—how much uncertainty there is about which branch an instance takes.

Gain Ratio:

$$GainRatio(A) = \frac{InformationGain(A)}{SplitInfo(A)}$$

gain_ratio.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import numpy as np
from collections import Counter
from typing import Tuple
 
def split_information(x: np.ndarray) -> float:
    """
    Compute split information (intrinsic value) of an attribute.
    
    This measures the entropy of the split itself—how uniformly
    the instances are distributed across attribute values.
    
    Parameters:
        x: Attribute values (n_samples,)
        
    Returns:
        Split information in bits
    """
    n = len(x)
    value_counts = Counter(x)
    
    split_info = 0.0
    for count in value_counts.values():
        if count > 0:
            p = count / n
            split_info -= p * np.log2(p)
    
    return split_info
 
def gain_ratio(y: np.ndarray, x: np.ndarray) -> Tuple[float, float, float]:
    """
    Compute gain ratio for splitting y by attribute x.
    
    Parameters:
        y: Class labels
        x: Attribute values
        
    Returns:
        Tuple of (gain_ratio, information_gain, split_info)
    """
    ig = information_gain(y, x)  # From previous implementation
    si = split_information(x)
    
    # Avoid division by zero
    if si == 0:
        return 0.0, ig, si
    
    return ig / si, ig, si
 
# Demonstrate the many-values problem
print("Gain Ratio vs Information Gain: Many-Values Problem")
print("=" * 60)
 
# Create synthetic example
np.random.seed(42)
n = 100
 
# Class labels: roughly balanced
y = np.array(['Yes'] * 55 + ['No'] * 45)
 
# Attribute 1: Binary (2 values)
binary_attr = np.array(['A', 'B'] * 50)
 
# Attribute 2: Many values (10 values)
ten_values = np.array([f'V{i}' for i in range(10)] * 10)
 
# Attribute 3: Unique ID (100 values) - useless but high IG!
unique_id = np.array([f'ID_{i}' for i in range(100)])
 
attributes = [
    ('Binary (2 values)', binary_attr),
    ('Multi (10 values)', ten_values),
    ('UniqueID (100 values)', unique_id),
]
 
print(f"{'Attribute':<25} | {'IG':<8} | {'SplitInfo':<10} | {'GainRatio':<10}")
print("-" * 60)
 
for name, attr in attributes:
    gr, ig, si = gain_ratio(y, attr)
    print(f"{name:<25} | {ig:<8.4f} | {si:<10.4f} | {gr:<10.4f}")
 
print()
print("Note: UniqueID has highest IG but lowest GainRatio!")
print("GainRatio correctly penalizes the useless many-valued attribute.")

Intuition for Gain Ratio

Practical Considerations with Gain Ratio:

The Zero Denominator Problem: If an attribute has only one value (all instances identical), split info = 0 and gain ratio is undefined. C4.5 handles this by excluding such attributes.

First, compute average information gain across all attributes
Only consider attributes with gain ≥ average
Among qualifying attributes, choose highest gain ratio

This prevents selection of low-gain attributes just because they have low split info.

Handling Continuous Attributes

The Challenge:

The Approach:

For a continuous attribute $A$ with values $\{v_1, v_2, \ldots, v_n\}$ (sorted), C4.5:

Considers all possible binary thresholds: $A \leq t$ vs $A > t$
Threshold candidates are midpoints between consecutive distinct values
Computes information gain for each threshold
Selects the threshold with highest gain
Treats this as a binary categorical attribute for splitting

Threshold Selection:

Given sorted unique values $v_1 < v_2 < \ldots < v_m$, candidate thresholds are: $$t_i = \frac{v_i + v_{i+1}}{2} \quad \text{for } i = 1, \ldots, m-1$$

continuous_threshold.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
import numpy as np
from typing import Tuple, Optional
 
def find_best_threshold(x: np.ndarray, y: np.ndarray) -> Tuple[float, float]:
    """
    Find the optimal binary split threshold for a continuous attribute.
    
    Parameters:
        x: Continuous attribute values (n_samples,)
        y: Class labels (n_samples,)
        
    Returns:
        Tuple of (best_threshold, best_information_gain)
    """
    # Sort by attribute value
    sorted_indices = np.argsort(x)
    x_sorted = x[sorted_indices]
    y_sorted = y[sorted_indices]
    
    # Find unique values and their positions
    unique_values = np.unique(x_sorted)
    
    if len(unique_values) < 2:
        return np.inf, 0.0  # Cannot split
    
    best_threshold = None
    best_gain = -np.inf
    
    # Consider threshold between each pair of consecutive unique values
    for i in range(len(unique_values) - 1):
        threshold = (unique_values[i] + unique_values[i + 1]) / 2
        
        # Create binary split
        left_mask = x <= threshold
        right_mask = ~left_mask
        
        # Compute information gain for this split
        n = len(y)
        n_left = np.sum(left_mask)
        n_right = np.sum(right_mask)
        
        if n_left == 0 or n_right == 0:
            continue
        
        # Parent entropy
        parent_entropy = entropy_from_counts(
            np.array(list(Counter(y).values()))
        )
        
        # Weighted child entropy
        left_entropy = entropy_from_counts(
            np.array(list(Counter(y[left_mask]).values()))
        )
        right_entropy = entropy_from_counts(
            np.array(list(Counter(y[right_mask]).values()))
        )
        
        child_entropy = (n_left/n)*left_entropy + (n_right/n)*right_entropy
        gain = parent_entropy - child_entropy
        
        if gain > best_gain:
            best_gain = gain
            best_threshold = threshold
    
    return best_threshold, best_gain
 
# Demonstration with synthetic data
np.random.seed(42)
 
# Class 0: temperatures mostly below 70
# Class 1: temperatures mostly above 70
temps_class0 = np.random.normal(60, 8, 50)
temps_class1 = np.random.normal(80, 8, 50)
 
x = np.concatenate([temps_class0, temps_class1])
y = np.array(['Cold'] * 50 + ['Hot'] * 50)
 
threshold, gain = find_best_threshold(x, y)
print(f"Optimal Temperature Threshold: {threshold:.2f}")
print(f"Information Gain at this threshold: {gain:.4f}")
 
# Show resulting split quality
left_mask = x <= threshold
print(f"
Left branch (≤ {threshold:.1f}): {Counter(y[left_mask])}")
print(f"Right branch (> {threshold:.1f}): {Counter(y[~left_mask])}")

Computational Efficiency

Important Details:

Pruning: Controlling Tree Complexity

Why Pruning Matters:

C4.5's Error-Based Pruning:

c45_pruning.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import numpy as np
from scipy import stats
from typing import Optional
from dataclasses import dataclass
 
def pessimistic_error_estimate(n: int, e: int, 
                                confidence: float = 0.25) -> float:
    """
    Compute C4.5's pessimistic error estimate.
    
    Uses upper bound of confidence interval for binomial proportion.
    
    Parameters:
        n: Total instances at node
        e: Misclassified instances at node
        confidence: Confidence level (C4.5 default: 0.25)
        
    Returns:
        Pessimistic error rate estimate
    """
    if n == 0:
        return 0.5
    
    # Observed error rate
    f = e / n
    
    # Using normal approximation to binomial
    # Upper bound of confidence interval
    z = stats.norm.ppf(1 - confidence / 2)
    
    # Wilson score interval upper bound
    # More accurate than normal approximation for small n
    numerator = f + z*z/(2*n) + z*np.sqrt(f*(1-f)/n + z*z/(4*n*n))
    denominator = 1 + z*z/n
    
    return numerator / denominator
 
def should_prune(leaf_errors: float, leaf_n: int,
                 subtree_errors: float, subtree_n: int) -> bool:
    """
    Determine if a subtree should be replaced with a leaf.
    
    The subtree is pruned if the leaf's pessimistic error
    is not significantly worse than the subtree's.
    """
    # Estimate errors for leaf prediction
    leaf_estimate = leaf_n * pessimistic_error_estimate(leaf_n, int(leaf_errors))
    
    # Accumulated pessimistic errors from subtree leaves
    # (In practice, computed recursively for each leaf in subtree)
    
    return leaf_estimate <= subtree_errors + 0.5  # One SE rule
 
# Example: Pruning decision
print("C4.5 Pruning Decision Example")
print("=" * 50)
 
# Scenario: Node with 100 instances
# As leaf: predicts majority class, 20 errors
# As subtree: 5 leaves with total 15 errors
 
leaf_n = 100
leaf_errors = 20
 
print(f"Node as leaf: {leaf_n} instances, {leaf_errors} errors")
print(f"Observed error rate: {leaf_errors/leaf_n:.1%}")
 
# Pessimistic estimate
pessimistic = pessimistic_error_estimate(leaf_n, leaf_errors)
print(f"Pessimistic error estimate: {pessimistic:.1%}")
print(f"Expected errors (pessimistic): {leaf_n * pessimistic:.1f}")
 
print()
print("Subtree has 5 leaves with 15 total observed errors")
print("Each leaf's pessimistic estimate adds overhead...")
print()
print("If pessimistic leaf errors ≤ subtree errors → PRUNE")

The Pruning Algorithm:

Build the full tree using gain ratio splitting
Bottom-up traversal: For each internal node, compute:
- Error estimate if this subtree is replaced by a leaf (majority vote)
- Accumulated error estimate from all leaves in the subtree
Prune if leaf error ≤ subtree error (accounting for confidence intervals)
Subtree raising: C4.5 also considers replacing a node with its most popular subtree

The Confidence Parameter:

Pre-Pruning (Stopping Early)

•Stop splitting when improvement is small
•Faster training (no wasted growth)
•Risk: May stop too early (horizon effect)
•Common criteria: min samples, max depth

Post-Pruning (Error-Based)

•Grow full tree, then prune back
•Can see full context before deciding
•More computation but better results
•C4.5's approach; CART uses similar

Missing Value Handling in C4.5

The Challenge:

Real datasets frequently have missing values. ID3 simply cannot handle them—instances with missing values are excluded. C4.5 introduces a principled approach using fractional instances.

C4.5's Approach:

When computing information gain for an attribute with missing values:

Exclude missing instances from entropy calculation for that attribute
Use only known values to compute information gain
Adjust gain by fraction of known values: If 80% of instances have known values, multiply the gain by 0.80

When an instance with a missing value reaches a split node during training or prediction:

Split the instance fractionally across all branches
Weight each fraction by branch probability (proportion of training instances taking that branch)
Aggregate predictions by summing weighted votes

missing_values_c45.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import numpy as np
from typing import Dict, Optional
 
def information_gain_with_missing(y: np.ndarray, x: np.ndarray) -> float:
    """
    Compute information gain handling missing values (marked as None or NaN).
    
    C4.5 approach:
    1. Compute IG using only instances with known values
    2. Multiply by fraction of known values
    """
    # Identify known values
    if x.dtype == object:
        known_mask = np.array([v is not None and v != '' for v in x])
    else:
        known_mask = ~np.isnan(x)
    
    n_total = len(y)
    n_known = np.sum(known_mask)
    
    if n_known == 0:
        return 0.0
    
    # Compute IG on known subset
    y_known = y[known_mask]
    x_known = x[known_mask]
    
    ig_known = information_gain(y_known, x_known)
    
    # Adjust by fraction of known values
    fraction_known = n_known / n_total
    
    return ig_known * fraction_known
 
def distribute_fractional(instance_weight: float,
                         branch_probs: Dict[str, float]) -> Dict[str, float]:
    """
    Distribute a fractional instance across branches.
    
    Parameters:
        instance_weight: Current weight of the instance
        branch_probs: Probability of each branch (from training proportions)
        
    Returns:
        Dictionary mapping branch values to fractional weights
    """
    return {
        branch: instance_weight * prob 
        for branch, prob in branch_probs.items()
    }
 
# Example demonstration
print("C4.5 Missing Value Handling Example")
print("=" * 50)
 
# Dataset with missing values
outlook = np.array(['Sunny', 'Sunny', None, 'Rain', 'Rain', None,
                    'Overcast', 'Sunny', 'Sunny', 'Rain', None,
                    'Overcast', 'Overcast', 'Rain'])
play = np.array(['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes',
                 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No'])
 
print("Dataset has 3 missing Outlook values (shown as None)")
print()
 
# Standard IG (would fail or exclude missing)
known_mask = np.array([v is not None for v in outlook])
ig_known_only = information_gain(play[known_mask], outlook[known_mask])
print(f"IG using only known values: {ig_known_only:.4f}")
 
# C4.5 adjusted IG
fraction_known = np.sum(known_mask) / len(outlook)
ig_adjusted = ig_known_only * fraction_known
print(f"Fraction known: {fraction_known:.2%}")
print(f"C4.5 adjusted IG: {ig_adjusted:.4f}")
 
# Branch probabilities for fractional distribution
print()
print("When classifying an instance with missing Outlook:")
branch_probs = {'Sunny': 4/11, 'Overcast': 3/11, 'Rain': 4/11}
fractional = distribute_fractional(1.0, branch_probs)
print(f"Instance distributed: {fractional}")

Fractional Instances Intuition

Complete C4.5 Implementation

Here's a comprehensive C4.5 implementation incorporating all the innovations we've discussed:

c45_complete.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
import numpy as np
from dataclasses import dataclass, field
from typing import Dict, List, Optional, Union, Tuple
from collections import Counter
 
@dataclass
class C45Node:
    """Node in a C4.5 decision tree."""
    is_leaf: bool = False
    prediction: Optional[str] = None
    class_distribution: Optional[Dict[str, float]] = None
    
    # For internal nodes
    attribute: Optional[str] = None
    attribute_idx: Optional[int] = None
    is_continuous: bool = False
    threshold: Optional[float] = None  # For continuous attributes
    children: Dict[str, 'C45Node'] = field(default_factory=dict)
    
    # For pruning
    n_samples: int = 0
    n_errors: int = 0
 
class C45Classifier:
    """
    C4.5 Decision Tree Classifier.
    
    Implements gains ratio, continuous attributes, pruning,
    and missing value handling.
    """
    
    def __init__(self, 
                 min_samples_split: int = 2,
                 min_gain_ratio: float = 0.0,
                 confidence: float = 0.25):
        self.root: Optional[C45Node] = None
        self.min_samples_split = min_samples_split
        self.min_gain_ratio = min_gain_ratio
        self.confidence = confidence
        
    def _is_continuous(self, x: np.ndarray) -> bool:
        """Check if attribute appears continuous."""
        try:
            x_numeric = x[~np.isnan(x.astype(float))]
            return len(np.unique(x_numeric)) > 10
        except (ValueError, TypeError):
            return False
    
    def _compute_gain_ratio(self, y: np.ndarray, x: np.ndarray,
                            threshold: Optional[float] = None) -> Tuple[float, float]:
        """Compute gain ratio, handling missing values and continuous splits."""
        # Handle missing values
        known_mask = ~self._is_missing(x)
        n_total = len(y)
        n_known = np.sum(known_mask)
        
        if n_known < self.min_samples_split:
            return 0.0, None
        
        y_known = y[known_mask]
        x_known = x[known_mask]
        
        # For continuous: create binary split
        if threshold is not None:
            x_known = np.where(x_known.astype(float) <= threshold, 
                              'left', 'right')
        
        ig = self._information_gain(y_known, x_known)
        si = self._split_information(x_known)
        
        # Adjust by fraction known
        ig *= (n_known / n_total)
        
        if si == 0:
            return 0.0, threshold
        
        return ig / si, threshold
    
    def _find_best_split(self, X: np.ndarray, y: np.ndarray,
                         available: List[int]) -> Tuple[int, Optional[float], float]:
        """Find best attribute and threshold to split on."""
        best_ratio = -np.inf
        best_attr = None
        best_threshold = None
        
        for attr_idx in available:
            x = X[:, attr_idx]
            
            if self._is_continuous(x):
                # Try multiple thresholds
                thresh, _ = self._find_best_threshold(x, y)
                if thresh is not None:
                    ratio, _ = self._compute_gain_ratio(y, x, thresh)
                    if ratio > best_ratio:
                        best_ratio = ratio
                        best_attr = attr_idx
                        best_threshold = thresh
            else:
                ratio, _ = self._compute_gain_ratio(y, x)
                if ratio > best_ratio:
                    best_ratio = ratio
                    best_attr = attr_idx
                    best_threshold = None
        
        return best_attr, best_threshold, best_ratio
    
    def _build_tree(self, X: np.ndarray, y: np.ndarray,
                    available: List[int], depth: int = 0) -> C45Node:
        """Recursively build the C4.5 tree."""
        n_samples = len(y)
        class_counts = Counter(y)
        majority_class = class_counts.most_common(1)[0][0]
        
        # Base cases
        if len(class_counts) == 1:
            return C45Node(is_leaf=True, prediction=majority_class,
                          class_distribution=dict(class_counts),
                          n_samples=n_samples, n_errors=0)
        
        if len(available) == 0 or n_samples < self.min_samples_split:
            errors = n_samples - class_counts[majority_class]
            return C45Node(is_leaf=True, prediction=majority_class,
                          class_distribution=dict(class_counts),
                          n_samples=n_samples, n_errors=errors)
        
        # Find best split
        best_attr, threshold, ratio = self._find_best_split(X, y, available)
        
        if ratio <= self.min_gain_ratio:
            errors = n_samples - class_counts[majority_class]
            return C45Node(is_leaf=True, prediction=majority_class,
                          class_distribution=dict(class_counts),
                          n_samples=n_samples, n_errors=errors)
        
        # Create node and recurse
        is_cont = threshold is not None
        node = C45Node(
            attribute_idx=best_attr,
            is_continuous=is_cont,
            threshold=threshold,
            n_samples=n_samples,
            class_distribution=dict(class_counts)
        )
        
        # Determine child subsets
        x = X[:, best_attr]
        if is_cont:
            values = ['left', 'right']
            masks = {
                'left': x.astype(float) <= threshold,
                'right': x.astype(float) > threshold
            }
        else:
            values = np.unique(x[~self._is_missing(x)])
            masks = {v: x == v for v in values}
        
        # Update available (remove categorical, keep continuous)
        new_available = available if is_cont else [i for i in available if i != best_attr]
        
        # Build children
        for value in values:
            mask = masks[value]
            if np.sum(mask) > 0:
                node.children[str(value)] = self._build_tree(
                    X[mask], y[mask], new_available, depth + 1
                )
        
        return node
    
    def fit(self, X: np.ndarray, y: np.ndarray):
        """Build the decision tree from training data."""
        n_attrs = X.shape[1]
        self.root = self._build_tree(X, y, list(range(n_attrs)))
        # Pruning would be applied here
        return self
 
print("C4.5 Classifier implemented with all core features")

Summary: The C4.5 Algorithm

Key Takeaways

•Gain ratio = IG/SplitInfo normalizes for attribute cardinality, fixing ID3's bias toward many-valued attributes.
•Continuous attributes are handled by finding optimal binary thresholds that maximize information gain.
•Error-based pruning grows the full tree then prunes back using pessimistic error estimates to prevent overfitting.
•Missing values are handled by fractional instance distribution—splitting instances proportionally across branches.
•C4.5 became the de facto standard for decision trees, influencing all subsequent tree algorithms including random forests and gradient boosting.

What's Next:

Page Complete