Tree Growing Algorithms - Learning Module

Loading content...

0/278

CART Algorithm

CART: The Statistical Learning Approach

In 1984, two years before Quinlan published ID3, Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone published Classification and Regression Trees—a book that introduced CART, an algorithm born from the statistical tradition rather than the machine learning community.

While Quinlan was developing information-theoretic approaches in Australia, Breiman's group at Berkeley was constructing trees from a statistical perspective. The result was profoundly different in philosophy and implementation:

Binary splits only: Every node has exactly two children
Gini impurity: An alternative to entropy that's computationally simpler
Regression support: Native ability to predict continuous values
Cost-complexity pruning: A theoretically grounded approach to tree simplification
Surrogate splits: Sophisticated missing value handling

CART's influence is immense. It's the algorithm underlying scikit-learn's DecisionTreeClassifier and DecisionTreeRegressor, and its binary splitting approach is fundamental to all modern ensemble methods.

What You Will Learn

By the end of this page, you will understand: (1) Gini impurity and why it's preferred over entropy in many implementations; (2) The mathematical elegance of binary splits; (3) Cost-complexity pruning with cross-validation; (4) How CART handles regression through variance reduction; (5) Surrogate splits for missing values; (6) Complete CART implementation.

Gini Impurity: An Alternative to Entropy

Definition:

For a node with class distribution $p_1, p_2, \ldots, p_K$, the Gini impurity is:

$$Gini = 1 - \sum_{k=1}^{K} p_k^2 = \sum_{k eq k'} p_k p_{k'}$$

Interpretation:

Gini impurity measures the probability that a randomly chosen element would be incorrectly classified if labeled according to the class distribution. It's the expected error rate of a random classifier that assigns labels proportionally.

For binary classification ($K=2$): $$Gini = 1 - p^2 - (1-p)^2 = 2p(1-p)$$

This reaches maximum (0.5) when $p = 0.5$ and minimum (0) when $p = 0$ or $p = 1$.

gini_impurity.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import numpy as np
from typing import Union, List
 
def gini_impurity(probs: Union[np.ndarray, List[float]]) -> float:
    """
    Compute Gini impurity of a probability distribution.
    
    Parameters:
        probs: Array of class probabilities (must sum to 1)
        
    Returns:
        Gini impurity (0 to 1-1/K)
    """
    probs = np.asarray(probs)
    return 1.0 - np.sum(probs ** 2)
 
def gini_from_counts(counts: Union[np.ndarray, List[int]]) -> float:
    """Compute Gini impurity from class counts."""
    counts = np.asarray(counts)
    total = np.sum(counts)
    if total == 0:
        return 0.0
    probs = counts / total
    return gini_impurity(probs)
 
def entropy(probs: np.ndarray) -> float:
    """Compute entropy for comparison."""
    probs = probs[probs > 0]
    return -np.sum(probs * np.log2(probs))
 
# Compare Gini and Entropy across probability range
print("Gini Impurity vs Entropy (Binary Classification)")
print("=" * 55)
print(f"{'p(class=1)':<12} | {'Gini':<10} | {'Entropy':<10} | {'Ratio':<10}")
print("-" * 55)
 
for p in [0.0, 0.1, 0.2, 0.3, 0.4, 0.5]:
    probs = np.array([p, 1-p]) if p > 0 else np.array([0.001, 0.999])
    g = gini_impurity(probs)
    h = entropy(probs) if p > 0 else 0
    ratio = g / h if h > 0 else 0
    print(f"{p:<12.1f} | {g:<10.4f} | {h:<10.4f} | {ratio:<10.4f}")
 
print()
print("Key insight: Gini and Entropy have very similar shape,")
print("but Gini is computationally cheaper (no logarithms).")

Gini Advantages

•No logarithms: Faster to compute
•Bounded [0, 0.5] for binary: Intuitive range
•Same optimal splits: Usually agrees with entropy
•Differentiable everywhere: Useful for optimization

Entropy Advantages

•Information-theoretic basis: Principled interpretation
•Measured in bits: Concrete meaning
•More sensitive to changes near boundaries
•Standard in information theory: Literature compatibility

Practical Equivalence

In practice, Gini and entropy produce nearly identical trees. The choice rarely matters for final model quality—scikit-learn defaults to Gini because it's faster. The important insight is that both measure 'impurity' and reach minimum (maximum purity) when all instances belong to one class.

The Elegance of Binary Splits

CART's Fundamental Choice:

Unlike ID3/C4.5 which can create multi-way splits (one branch per attribute value), CART uses binary splits exclusively. This seemingly simple choice has profound implications:

For Categorical Attributes: CART searches for the optimal subset of values to go left vs. right: $$\text{Find subset } S \subseteq \{v_1, v_2, \ldots, v_k\} \text{ that maximizes impurity reduction}$$

For a categorical attribute with $k$ values, there are $2^{k-1} - 1$ possible binary splits to consider.

For Continuous Attributes: Same as C4.5—find optimal threshold $t$: left if $x \leq t$, right otherwise.

For Ordinal Attributes: Binary splits respect ordering: split at some value, all lesser values go left.

binary_splits.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import numpy as np
from itertools import combinations
from typing import Tuple, Set, FrozenSet
 
def find_best_categorical_split(x: np.ndarray, y: np.ndarray) -> Tuple[FrozenSet, float]:
    """
    Find optimal binary split for a categorical attribute.
    
    For k unique values, examines 2^(k-1) - 1 possible splits.
    Uses dynamic programming optimization for two classes.
    
    Parameters:
        x: Categorical attribute values
        y: Class labels
        
    Returns:
        Tuple of (left_values_set, gini_reduction)
    """
    unique_values = np.unique(x)
    k = len(unique_values)
    
    if k == 1:
        return frozenset(), 0.0
    
    # Compute parent Gini
    parent_counts = np.array([np.sum(y == c) for c in np.unique(y)])
    parent_gini = gini_from_counts(parent_counts)
    
    best_split = None
    best_reduction = -np.inf
    n = len(y)
    
    # For efficiency, only need to try subsets up to half the values
    # (complementary subsets give same split)
    for size in range(1, k // 2 + 1):
        for left_values in combinations(unique_values, size):
            left_set = set(left_values)
            
            # Compute split
            left_mask = np.isin(x, list(left_set))
            right_mask = ~left_mask
            
            n_left = np.sum(left_mask)
            n_right = np.sum(right_mask)
            
            if n_left == 0 or n_right == 0:
                continue
            
            # Compute weighted child Gini
            left_counts = np.array([np.sum(y[left_mask] == c) for c in np.unique(y)])
            right_counts = np.array([np.sum(y[right_mask] == c) for c in np.unique(y)])
            
            weighted_child_gini = (
                (n_left / n) * gini_from_counts(left_counts) +
                (n_right / n) * gini_from_counts(right_counts)
            )
            
            reduction = parent_gini - weighted_child_gini
            
            if reduction > best_reduction:
                best_reduction = reduction
                best_split = frozenset(left_set)
    
    return best_split, best_reduction
 
# Example with categorical attribute
colors = np.array(['Red', 'Red', 'Blue', 'Blue', 'Green', 'Green',
                   'Red', 'Blue', 'Green', 'Yellow'])
buy = np.array(['Yes', 'Yes', 'No', 'No', 'Yes', 'No',
                'Yes', 'No', 'No', 'Yes'])
 
left_set, reduction = find_best_categorical_split(colors, buy)
print(f"Best split: {left_set} vs rest")
print(f"Gini reduction: {reduction:.4f}")

Why Binary Splits?

Consistent structure: Every node has exactly 2 children, simplifying tree traversal and analysis
Handles all attribute types uniformly: Continuous, categorical, and ordinal attributes all produce binary splits
No attribute exhaustion: Categorical attributes can be reused (different subsets) at deeper nodes
Optimal for regression: Variance reduction naturally applies to binary partitions
Required for ensembles: Random forests and boosting implementations assume binary trees

The Exponential Challenge:

For categorical attributes with many values (say, 20 US states), there are $2^{19} - 1 \approx 500,000$ possible binary splits. CART uses optimizations:

For binary classification, optimal split can be found in O(k log k) by sorting categories by class proportion
For multi-class, heuristic search or limiting categories

Binary Classification Optimization

For binary classification with a categorical attribute with k values, the optimal split can be found efficiently: sort categories by P(class=1|category), then consider only k-1 threshold splits. This reduces exponential to O(k log k) complexity—a beautiful result from the original CART book.

CART for Regression: Variance Reduction

The Regression Setting:

For regression, the target $y$ is continuous. CART predictions are piecewise constant: each leaf predicts the mean of training instances in that leaf.

Splitting Criterion: Variance Reduction

The goal is to create splits that reduce variance in the child nodes. For a node $t$ with instances $S$:

$$\text{Variance}(S) = \frac{1}{|S|} \sum_{i \in S} (y_i - \bar{y}_S)^2$$

where $\bar{y}_S$ is the mean of $y$ values in $S$.

Variance Reduction for a Split: $$\Delta V = \text{Var}(S) - \frac{|S_L|}{|S|}\text{Var}(S_L) - \frac{|S_R|}{|S|}\text{Var}(S_R)$$

Select the split that maximizes variance reduction (or equivalently, minimizes weighted child variance).

regression_cart.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
import numpy as np
from typing import Tuple
 
def variance_reduction(y: np.ndarray, x: np.ndarray,
                       threshold: float) -> float:
    """
    Compute variance reduction for a binary split.
    
    Parameters:
        y: Target values (continuous)
        x: Feature values 
        threshold: Split threshold
        
    Returns:
        Variance reduction
    """
    left_mask = x <= threshold
    right_mask = ~left_mask
    
    n = len(y)
    n_left = np.sum(left_mask)
    n_right = np.sum(right_mask)
    
    if n_left == 0 or n_right == 0:
        return 0.0
    
    # Parent variance
    parent_var = np.var(y)
    
    # Child variances
    left_var = np.var(y[left_mask]) if n_left > 1 else 0
    right_var = np.var(y[right_mask]) if n_right > 1 else 0
    
    # Weighted child variance
    weighted_child_var = (n_left/n)*left_var + (n_right/n)*right_var
    
    return parent_var - weighted_child_var
 
def find_best_regression_split(X: np.ndarray, y: np.ndarray,
                                feature_idx: int) -> Tuple[float, float]:
    """
    Find optimal split threshold for regression.
    """
    x = X[:, feature_idx]
    
    # Sort values
    sorted_indices = np.argsort(x)
    x_sorted = x[sorted_indices]
    y_sorted = y[sorted_indices]
    
    # Unique thresholds
    unique_values = np.unique(x_sorted)
    
    best_threshold = None
    best_reduction = -np.inf
    
    for i in range(len(unique_values) - 1):
        threshold = (unique_values[i] + unique_values[i+1]) / 2
        reduction = variance_reduction(y, x, threshold)
        
        if reduction > best_reduction:
            best_reduction = reduction
            best_threshold = threshold
    
    return best_threshold, best_reduction
 
# Example: Housing price prediction
np.random.seed(42)
n = 200
 
# Features: square footage and age
sqft = np.random.uniform(1000, 4000, n)
age = np.random.uniform(0, 50, n)
 
# Price = base + sqft effect + age penalty + noise
# Non-linear: larger homes have premium, older homes have discount
price = (
    100000 +                           # Base price
    150 * sqft +                       # Per sqft
    50 * np.maximum(sqft - 2500, 0) +  # Premium for large
    -2000 * age +                      # Age penalty
    np.random.normal(0, 30000, n)      # Noise
)
 
X = np.column_stack([sqft, age])
 
# Find best split for each feature
for i, name in enumerate(['Square Footage', 'Age']):
    thresh, reduction = find_best_regression_split(X, price, i)
    print(f"{name}: threshold = {thresh:.1f}, variance reduction = {reduction:.2e}")

MSE vs Variance

Minimizing variance in children is equivalent to minimizing Mean Squared Error (MSE) of predictions. Since each leaf predicts the mean, MSE = variance. scikit-learn uses 'mse' (or 'squared_error') criterion which is computationally identical to variance reduction.

Other Regression Criteria:

While variance/MSE is standard, CART supports alternatives:

Criterion	Formula	Use Case
MSE	$\frac{1}{n}\sum(y_i - \bar{y})^2$	Standard regression
MAE	$\frac{1}{n}\sum	y_i - \text{median}
Friedman MSE	MSE with improvement weight	Gradient boosting
Poisson	Deviance for count data	Count regression

MAE (Mean Absolute Error) uses median instead of mean for predictions, providing robustness to outliers at the cost of computational efficiency (median is harder to update incrementally).

Cost-Complexity Pruning

The Pruning Framework:

CART's cost-complexity pruning (also called minimal cost-complexity or weakest link pruning) provides a principled way to find the optimal tree size. It defines a cost function balancing accuracy and complexity:

$$R_\alpha(T) = R(T) + \alpha |\tilde{T}|$$

where:

$R(T)$ = resubstitution error (training error) of tree $T$
$|\tilde{T}|$ = number of leaf nodes (complexity)
$\alpha$ = complexity parameter (cost per leaf)

The Algorithm:

Grow the full tree $T_{max}$
Generate sequence of nested subtrees $T_{max} \supseteq T_1 \supseteq T_2 \supseteq \ldots \supseteq T_{root}$
Each $T_i$ is optimal for some range of $\alpha$
Use cross-validation to select best $\alpha$ (and corresponding tree)

cost_complexity_pruning.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import numpy as np
from typing import List, Tuple
from dataclasses import dataclass
 
@dataclass
class PruningResult:
    """Result of cost-complexity pruning analysis."""
    alpha: float              # Complexity parameter
    tree_size: int           # Number of leaves
    train_error: float       # Training error
    cv_error: float          # Cross-validation error
    cv_std: float            # CV error standard deviation
 
def compute_effective_alpha(node_error: float, subtree_error: float,
                            n_leaves: int) -> float:
    """
    Compute effective alpha for pruning a subtree.
    
    This is the alpha value at which it becomes worth pruning
    the subtree to a single leaf (weakest link principle).
    
    Parameters:
        node_error: Error if node becomes leaf
        subtree_error: Total error of subtree leaves
        n_leaves: Number of leaves in subtree
        
    Returns:
        Effective alpha (complexity parameter threshold)
    """
    if n_leaves <= 1:
        return np.inf  # Already a leaf
    
    # At this alpha, cost of leaf equals cost of subtree
    # R(t) + alpha = R(T_t) + alpha * |T_t|
    # alpha * (|T_t| - 1) = R(t) - R(T_t)
    # alpha = (R(t) - R(T_t)) / (|T_t| - 1)
    
    return (node_error - subtree_error) / (n_leaves - 1)
 
def find_pruning_sequence(errors: List[Tuple[float, float, int]]) -> List[float]:
    """
    Given node errors, subtree errors, and leaf counts,
    find the sequence of effective alphas for pruning.
    
    This implements the weakest link pruning algorithm.
    """
    alphas = []
    
    for node_err, subtree_err, n_leaves in errors:
        alpha = compute_effective_alpha(node_err, subtree_err, n_leaves)
        if alpha < np.inf:
            alphas.append(alpha)
    
    return sorted(set(alphas))
 
# Demonstration of cost-complexity concept
print("Cost-Complexity Pruning Intuition")
print("=" * 55)
 
# Example: A subtree with 5 leaves
# Training error if kept as subtree: 10 misclassifications
# Training error if pruned to leaf: 25 misclassifications
 
node_error = 25  # Error if this node is a leaf
subtree_error = 10  # Total error across 5 leaves
n_leaves = 5
 
alpha_crit = compute_effective_alpha(node_error, subtree_error, n_leaves)
 
print(f"Subtree: {n_leaves} leaves, {subtree_error} errors")
print(f"As leaf: 1 leaf, {node_error} errors")
print(f"Critical alpha: {alpha_crit:.2f}")
print()
print(f"When alpha < {alpha_crit:.2f}: Keep subtree (complexity penalty low)")
print(f"When alpha > {alpha_crit:.2f}: Prune to leaf (complexity too costly)")

Cross-Validation for Alpha Selection:

The pruning sequence gives us candidate trees, but which is best? CART uses k-fold cross-validation:

For each candidate $\alpha$ in the sequence
Perform k-fold CV: train on k-1 folds, evaluate on held-out fold
Compute mean and standard error of validation error
Select $\alpha$ using one of:
- Minimum error: Choose $\alpha$ with lowest mean CV error
- 1-SE rule: Choose largest $\alpha$ within 1 SE of minimum (simpler tree)

The 1-SE Rule:

The 1-standard-error rule prefers simpler models when they're statistically indistinguishable from the best. If the minimum CV error is $E^* \pm SE^$, choose the simplest tree with error $\leq E^ + SE^*$. This provides regularization and often generalizes better.

scikit-learn's ccp_alpha

In scikit-learn, set ccp_alpha parameter to control pruning. Use cost_complexity_pruning_path() to get the (alpha, impurities) pairs for all subtrees, then cross-validate to find optimal alpha. The default ccp_alpha=0 means no pruning.

Surrogate Splits: CART's Missing Value Solution

The Surrogate Split Concept:

CART handles missing values through surrogate splits—backup splitting rules that mimic the primary split using other attributes.

Construction:

For the primary split (best attribute), all instances with known values go left or right
For each other attribute, find the split that best predicts the primary split's direction
Rank surrogates by their predictive accuracy (matching fraction)
Store the top few surrogates at each node

Usage:

When an instance has a missing value for the primary split attribute:

Check first surrogate—if available, use it
If first surrogate also missing, try second surrogate
Continue through surrogate list
If all surrogates missing, use majority direction

surrogate_splits.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
import numpy as np
from typing import List, Tuple, Optional
 
def find_surrogate_splits(X: np.ndarray, primary_direction: np.ndarray,
                          primary_attr: int,
                          max_surrogates: int = 3) -> List[Tuple[int, float, float]]:
    """
    Find surrogate splits that predict the primary split direction.
    
    Parameters:
        X: Feature matrix
        primary_direction: Boolean array (True=left, False=right)
        primary_attr: Index of primary splitting attribute
        max_surrogates: Maximum number of surrogates to return
        
    Returns:
        List of (attribute_index, threshold, agreement_rate)
    """
    n_samples, n_features = X.shape
    surrogates = []
    
    # Baseline: majority direction agreement
    majority_left = np.mean(primary_direction)
    baseline = max(majority_left, 1 - majority_left)
    
    for attr in range(n_features):
        if attr == primary_attr:
            continue
        
        x = X[:, attr]
        
        # Skip if missing values
        valid = ~np.isnan(x)
        if np.sum(valid) < 10:
            continue
        
        x_valid = x[valid]
        dir_valid = primary_direction[valid]
        
        # Find threshold that best predicts direction
        unique_vals = np.unique(x_valid)
        
        best_thresh = None
        best_agree = baseline  # Must beat baseline
        
        for i in range(len(unique_vals) - 1):
            thresh = (unique_vals[i] + unique_vals[i+1]) / 2
            
            # Predict left if x <= thresh
            pred_left = x_valid <= thresh
            
            # Agreement rate
            agree = np.mean(pred_left == dir_valid)
            
            # Also try reverse (left if x > thresh)
            agree_rev = np.mean((~pred_left) == dir_valid)
            
            if agree >= best_agree:
                best_agree = agree
                best_thresh = (thresh, False)  # Normal direction
            if agree_rev >= best_agree:
                best_agree = agree_rev
                best_thresh = (thresh, True)  # Reverse direction
        
        if best_thresh is not None and best_agree > baseline:
            surrogates.append((attr, best_thresh[0], best_agree))
    
    # Sort by agreement rate and return top surrogates
    surrogates.sort(key=lambda x: x[2], reverse=True)
    return surrogates[:max_surrogates]
 
# Demonstration
np.random.seed(42)
n = 200
 
# Create correlated features
age = np.random.uniform(20, 70, n)
income = 20000 + 1500 * age + np.random.normal(0, 10000, n)
credit_score = 400 + 5 * age + 0.005 * income + np.random.normal(0, 50, n)
 
X = np.column_stack([age, income, credit_score])
 
# Primary split on age <= 45
primary_thresh = 45
primary_dir = age <= primary_thresh
 
print("Surrogate Split Analysis")
print("=" * 50)
print(f"Primary split: Age <= {primary_thresh}")
print()
 
surrogates = find_surrogate_splits(X, primary_dir, primary_attr=0)
print("Surrogates (ranked by agreement with primary):")
names = ['Age', 'Income', 'Credit Score']
 
for attr, thresh, agree in surrogates:
    print(f"  {names[attr]} <= {thresh:.0f}: {agree:.1%} agreement")

Surrogate vs Fractional

Surrogates are more principled than C4.5's fractional instances for correlated data. If 'income' predicts 'age' well, using income when age is missing leverages the data's correlation structure. Fractional instances ignore these relationships. However, surrogates require more computation and storage.

CART vs C4.5: A Detailed Comparison

CART vs C4.5 Comparison
Aspect	CART	C4.5
Split Type	Binary only	Multi-way possible
Impurity Measure	Gini (classification), MSE (regression)	Information gain / Gain ratio
Pruning	Cost-complexity (cross-validation)	Error-based (pessimistic estimates)
Missing Values	Surrogate splits	Fractional instances
Regression	Native support	Not supported
Categorical Handling	Binary subset splits	One branch per value
Attribute Reuse	Yes (different thresholds/subsets)	No for categorical
Output	Binary tree	Multi-way tree
Default in sklearn	Yes	No (use C5.0 library)

When to Prefer CART:

Regression problems
Binary tree structure needed (ensembles)
Correlated features (surrogates help)
Using scikit-learn or similar libraries

When to Prefer C4.5:

Multi-class with many categories (multi-way splits more natural)
Information-theoretic interpretation important
Legacy systems using C4.5/C5.0

Summary: The CART Algorithm

Key Takeaways

•Gini impurity provides a computationally efficient alternative to entropy with nearly identical results.
•Binary splits create uniform tree structure, enable attribute reuse, and are essential for ensemble methods.
•Regression support via variance reduction makes CART suitable for continuous prediction tasks.
•Cost-complexity pruning with cross-validation provides principled tree simplification with the 1-SE rule.
•Surrogate splits leverage feature correlations for intelligent missing value handling.
•CART is the foundation of modern implementations including scikit-learn and all major ensemble methods.

What's Next:

Now that we understand all three major tree-building algorithms, the next pages address practical challenges: handling missing values in depth, and managing categorical features with many values. These skills are essential for applying tree algorithms to real-world datasets.

Page Complete

You now understand CART: Gini impurity, binary splits, variance reduction for regression, cost-complexity pruning, and surrogate splits. This knowledge directly applies to using DecisionTreeClassifier and DecisionTreeRegressor in practice.