Loading content...
In 1984, two years before Quinlan published ID3, Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone published Classification and Regression Trees—a book that introduced CART, an algorithm born from the statistical tradition rather than the machine learning community.
While Quinlan was developing information-theoretic approaches in Australia, Breiman's group at Berkeley was constructing trees from a statistical perspective. The result was profoundly different in philosophy and implementation:
CART's influence is immense. It's the algorithm underlying scikit-learn's DecisionTreeClassifier and DecisionTreeRegressor, and its binary splitting approach is fundamental to all modern ensemble methods.
By the end of this page, you will understand: (1) Gini impurity and why it's preferred over entropy in many implementations; (2) The mathematical elegance of binary splits; (3) Cost-complexity pruning with cross-validation; (4) How CART handles regression through variance reduction; (5) Surrogate splits for missing values; (6) Complete CART implementation.
Definition:
For a node with class distribution $p_1, p_2, \ldots, p_K$, the Gini impurity is:
$$Gini = 1 - \sum_{k=1}^{K} p_k^2 = \sum_{k eq k'} p_k p_{k'}$$
Interpretation:
Gini impurity measures the probability that a randomly chosen element would be incorrectly classified if labeled according to the class distribution. It's the expected error rate of a random classifier that assigns labels proportionally.
For binary classification ($K=2$): $$Gini = 1 - p^2 - (1-p)^2 = 2p(1-p)$$
This reaches maximum (0.5) when $p = 0.5$ and minimum (0) when $p = 0$ or $p = 1$.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
import numpy as npfrom typing import Union, List def gini_impurity(probs: Union[np.ndarray, List[float]]) -> float: """ Compute Gini impurity of a probability distribution. Parameters: probs: Array of class probabilities (must sum to 1) Returns: Gini impurity (0 to 1-1/K) """ probs = np.asarray(probs) return 1.0 - np.sum(probs ** 2) def gini_from_counts(counts: Union[np.ndarray, List[int]]) -> float: """Compute Gini impurity from class counts.""" counts = np.asarray(counts) total = np.sum(counts) if total == 0: return 0.0 probs = counts / total return gini_impurity(probs) def entropy(probs: np.ndarray) -> float: """Compute entropy for comparison.""" probs = probs[probs > 0] return -np.sum(probs * np.log2(probs)) # Compare Gini and Entropy across probability rangeprint("Gini Impurity vs Entropy (Binary Classification)")print("=" * 55)print(f"{'p(class=1)':<12} | {'Gini':<10} | {'Entropy':<10} | {'Ratio':<10}")print("-" * 55) for p in [0.0, 0.1, 0.2, 0.3, 0.4, 0.5]: probs = np.array([p, 1-p]) if p > 0 else np.array([0.001, 0.999]) g = gini_impurity(probs) h = entropy(probs) if p > 0 else 0 ratio = g / h if h > 0 else 0 print(f"{p:<12.1f} | {g:<10.4f} | {h:<10.4f} | {ratio:<10.4f}") print()print("Key insight: Gini and Entropy have very similar shape,")print("but Gini is computationally cheaper (no logarithms).")In practice, Gini and entropy produce nearly identical trees. The choice rarely matters for final model quality—scikit-learn defaults to Gini because it's faster. The important insight is that both measure 'impurity' and reach minimum (maximum purity) when all instances belong to one class.
CART's Fundamental Choice:
Unlike ID3/C4.5 which can create multi-way splits (one branch per attribute value), CART uses binary splits exclusively. This seemingly simple choice has profound implications:
For Categorical Attributes: CART searches for the optimal subset of values to go left vs. right: $$\text{Find subset } S \subseteq \{v_1, v_2, \ldots, v_k\} \text{ that maximizes impurity reduction}$$
For a categorical attribute with $k$ values, there are $2^{k-1} - 1$ possible binary splits to consider.
For Continuous Attributes: Same as C4.5—find optimal threshold $t$: left if $x \leq t$, right otherwise.
For Ordinal Attributes: Binary splits respect ordering: split at some value, all lesser values go left.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
import numpy as npfrom itertools import combinationsfrom typing import Tuple, Set, FrozenSet def find_best_categorical_split(x: np.ndarray, y: np.ndarray) -> Tuple[FrozenSet, float]: """ Find optimal binary split for a categorical attribute. For k unique values, examines 2^(k-1) - 1 possible splits. Uses dynamic programming optimization for two classes. Parameters: x: Categorical attribute values y: Class labels Returns: Tuple of (left_values_set, gini_reduction) """ unique_values = np.unique(x) k = len(unique_values) if k == 1: return frozenset(), 0.0 # Compute parent Gini parent_counts = np.array([np.sum(y == c) for c in np.unique(y)]) parent_gini = gini_from_counts(parent_counts) best_split = None best_reduction = -np.inf n = len(y) # For efficiency, only need to try subsets up to half the values # (complementary subsets give same split) for size in range(1, k // 2 + 1): for left_values in combinations(unique_values, size): left_set = set(left_values) # Compute split left_mask = np.isin(x, list(left_set)) right_mask = ~left_mask n_left = np.sum(left_mask) n_right = np.sum(right_mask) if n_left == 0 or n_right == 0: continue # Compute weighted child Gini left_counts = np.array([np.sum(y[left_mask] == c) for c in np.unique(y)]) right_counts = np.array([np.sum(y[right_mask] == c) for c in np.unique(y)]) weighted_child_gini = ( (n_left / n) * gini_from_counts(left_counts) + (n_right / n) * gini_from_counts(right_counts) ) reduction = parent_gini - weighted_child_gini if reduction > best_reduction: best_reduction = reduction best_split = frozenset(left_set) return best_split, best_reduction # Example with categorical attributecolors = np.array(['Red', 'Red', 'Blue', 'Blue', 'Green', 'Green', 'Red', 'Blue', 'Green', 'Yellow'])buy = np.array(['Yes', 'Yes', 'No', 'No', 'Yes', 'No', 'Yes', 'No', 'No', 'Yes']) left_set, reduction = find_best_categorical_split(colors, buy)print(f"Best split: {left_set} vs rest")print(f"Gini reduction: {reduction:.4f}")Why Binary Splits?
Consistent structure: Every node has exactly 2 children, simplifying tree traversal and analysis
Handles all attribute types uniformly: Continuous, categorical, and ordinal attributes all produce binary splits
No attribute exhaustion: Categorical attributes can be reused (different subsets) at deeper nodes
Optimal for regression: Variance reduction naturally applies to binary partitions
Required for ensembles: Random forests and boosting implementations assume binary trees
The Exponential Challenge:
For categorical attributes with many values (say, 20 US states), there are $2^{19} - 1 \approx 500,000$ possible binary splits. CART uses optimizations:
For binary classification with a categorical attribute with k values, the optimal split can be found efficiently: sort categories by P(class=1|category), then consider only k-1 threshold splits. This reduces exponential to O(k log k) complexity—a beautiful result from the original CART book.
The Regression Setting:
For regression, the target $y$ is continuous. CART predictions are piecewise constant: each leaf predicts the mean of training instances in that leaf.
Splitting Criterion: Variance Reduction
The goal is to create splits that reduce variance in the child nodes. For a node $t$ with instances $S$:
$$\text{Variance}(S) = \frac{1}{|S|} \sum_{i \in S} (y_i - \bar{y}_S)^2$$
where $\bar{y}_S$ is the mean of $y$ values in $S$.
Variance Reduction for a Split: $$\Delta V = \text{Var}(S) - \frac{|S_L|}{|S|}\text{Var}(S_L) - \frac{|S_R|}{|S|}\text{Var}(S_R)$$
Select the split that maximizes variance reduction (or equivalently, minimizes weighted child variance).
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990
import numpy as npfrom typing import Tuple def variance_reduction(y: np.ndarray, x: np.ndarray, threshold: float) -> float: """ Compute variance reduction for a binary split. Parameters: y: Target values (continuous) x: Feature values threshold: Split threshold Returns: Variance reduction """ left_mask = x <= threshold right_mask = ~left_mask n = len(y) n_left = np.sum(left_mask) n_right = np.sum(right_mask) if n_left == 0 or n_right == 0: return 0.0 # Parent variance parent_var = np.var(y) # Child variances left_var = np.var(y[left_mask]) if n_left > 1 else 0 right_var = np.var(y[right_mask]) if n_right > 1 else 0 # Weighted child variance weighted_child_var = (n_left/n)*left_var + (n_right/n)*right_var return parent_var - weighted_child_var def find_best_regression_split(X: np.ndarray, y: np.ndarray, feature_idx: int) -> Tuple[float, float]: """ Find optimal split threshold for regression. """ x = X[:, feature_idx] # Sort values sorted_indices = np.argsort(x) x_sorted = x[sorted_indices] y_sorted = y[sorted_indices] # Unique thresholds unique_values = np.unique(x_sorted) best_threshold = None best_reduction = -np.inf for i in range(len(unique_values) - 1): threshold = (unique_values[i] + unique_values[i+1]) / 2 reduction = variance_reduction(y, x, threshold) if reduction > best_reduction: best_reduction = reduction best_threshold = threshold return best_threshold, best_reduction # Example: Housing price predictionnp.random.seed(42)n = 200 # Features: square footage and agesqft = np.random.uniform(1000, 4000, n)age = np.random.uniform(0, 50, n) # Price = base + sqft effect + age penalty + noise# Non-linear: larger homes have premium, older homes have discountprice = ( 100000 + # Base price 150 * sqft + # Per sqft 50 * np.maximum(sqft - 2500, 0) + # Premium for large -2000 * age + # Age penalty np.random.normal(0, 30000, n) # Noise) X = np.column_stack([sqft, age]) # Find best split for each featurefor i, name in enumerate(['Square Footage', 'Age']): thresh, reduction = find_best_regression_split(X, price, i) print(f"{name}: threshold = {thresh:.1f}, variance reduction = {reduction:.2e}")Minimizing variance in children is equivalent to minimizing Mean Squared Error (MSE) of predictions. Since each leaf predicts the mean, MSE = variance. scikit-learn uses 'mse' (or 'squared_error') criterion which is computationally identical to variance reduction.
Other Regression Criteria:
While variance/MSE is standard, CART supports alternatives:
| Criterion | Formula | Use Case |
|---|---|---|
| MSE | $\frac{1}{n}\sum(y_i - \bar{y})^2$ | Standard regression |
| MAE | $\frac{1}{n}\sum | y_i - \text{median} |
| Friedman MSE | MSE with improvement weight | Gradient boosting |
| Poisson | Deviance for count data | Count regression |
MAE (Mean Absolute Error) uses median instead of mean for predictions, providing robustness to outliers at the cost of computational efficiency (median is harder to update incrementally).
The Pruning Framework:
CART's cost-complexity pruning (also called minimal cost-complexity or weakest link pruning) provides a principled way to find the optimal tree size. It defines a cost function balancing accuracy and complexity:
$$R_\alpha(T) = R(T) + \alpha |\tilde{T}|$$
where:
The Algorithm:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
import numpy as npfrom typing import List, Tuplefrom dataclasses import dataclass @dataclassclass PruningResult: """Result of cost-complexity pruning analysis.""" alpha: float # Complexity parameter tree_size: int # Number of leaves train_error: float # Training error cv_error: float # Cross-validation error cv_std: float # CV error standard deviation def compute_effective_alpha(node_error: float, subtree_error: float, n_leaves: int) -> float: """ Compute effective alpha for pruning a subtree. This is the alpha value at which it becomes worth pruning the subtree to a single leaf (weakest link principle). Parameters: node_error: Error if node becomes leaf subtree_error: Total error of subtree leaves n_leaves: Number of leaves in subtree Returns: Effective alpha (complexity parameter threshold) """ if n_leaves <= 1: return np.inf # Already a leaf # At this alpha, cost of leaf equals cost of subtree # R(t) + alpha = R(T_t) + alpha * |T_t| # alpha * (|T_t| - 1) = R(t) - R(T_t) # alpha = (R(t) - R(T_t)) / (|T_t| - 1) return (node_error - subtree_error) / (n_leaves - 1) def find_pruning_sequence(errors: List[Tuple[float, float, int]]) -> List[float]: """ Given node errors, subtree errors, and leaf counts, find the sequence of effective alphas for pruning. This implements the weakest link pruning algorithm. """ alphas = [] for node_err, subtree_err, n_leaves in errors: alpha = compute_effective_alpha(node_err, subtree_err, n_leaves) if alpha < np.inf: alphas.append(alpha) return sorted(set(alphas)) # Demonstration of cost-complexity conceptprint("Cost-Complexity Pruning Intuition")print("=" * 55) # Example: A subtree with 5 leaves# Training error if kept as subtree: 10 misclassifications# Training error if pruned to leaf: 25 misclassifications node_error = 25 # Error if this node is a leafsubtree_error = 10 # Total error across 5 leavesn_leaves = 5 alpha_crit = compute_effective_alpha(node_error, subtree_error, n_leaves) print(f"Subtree: {n_leaves} leaves, {subtree_error} errors")print(f"As leaf: 1 leaf, {node_error} errors")print(f"Critical alpha: {alpha_crit:.2f}")print()print(f"When alpha < {alpha_crit:.2f}: Keep subtree (complexity penalty low)")print(f"When alpha > {alpha_crit:.2f}: Prune to leaf (complexity too costly)")Cross-Validation for Alpha Selection:
The pruning sequence gives us candidate trees, but which is best? CART uses k-fold cross-validation:
The 1-SE Rule:
The 1-standard-error rule prefers simpler models when they're statistically indistinguishable from the best. If the minimum CV error is $E^* \pm SE^$, choose the simplest tree with error $\leq E^ + SE^*$. This provides regularization and often generalizes better.
In scikit-learn, set ccp_alpha parameter to control pruning. Use cost_complexity_pruning_path() to get the (alpha, impurities) pairs for all subtrees, then cross-validate to find optimal alpha. The default ccp_alpha=0 means no pruning.
The Surrogate Split Concept:
CART handles missing values through surrogate splits—backup splitting rules that mimic the primary split using other attributes.
Construction:
Usage:
When an instance has a missing value for the primary split attribute:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697
import numpy as npfrom typing import List, Tuple, Optional def find_surrogate_splits(X: np.ndarray, primary_direction: np.ndarray, primary_attr: int, max_surrogates: int = 3) -> List[Tuple[int, float, float]]: """ Find surrogate splits that predict the primary split direction. Parameters: X: Feature matrix primary_direction: Boolean array (True=left, False=right) primary_attr: Index of primary splitting attribute max_surrogates: Maximum number of surrogates to return Returns: List of (attribute_index, threshold, agreement_rate) """ n_samples, n_features = X.shape surrogates = [] # Baseline: majority direction agreement majority_left = np.mean(primary_direction) baseline = max(majority_left, 1 - majority_left) for attr in range(n_features): if attr == primary_attr: continue x = X[:, attr] # Skip if missing values valid = ~np.isnan(x) if np.sum(valid) < 10: continue x_valid = x[valid] dir_valid = primary_direction[valid] # Find threshold that best predicts direction unique_vals = np.unique(x_valid) best_thresh = None best_agree = baseline # Must beat baseline for i in range(len(unique_vals) - 1): thresh = (unique_vals[i] + unique_vals[i+1]) / 2 # Predict left if x <= thresh pred_left = x_valid <= thresh # Agreement rate agree = np.mean(pred_left == dir_valid) # Also try reverse (left if x > thresh) agree_rev = np.mean((~pred_left) == dir_valid) if agree >= best_agree: best_agree = agree best_thresh = (thresh, False) # Normal direction if agree_rev >= best_agree: best_agree = agree_rev best_thresh = (thresh, True) # Reverse direction if best_thresh is not None and best_agree > baseline: surrogates.append((attr, best_thresh[0], best_agree)) # Sort by agreement rate and return top surrogates surrogates.sort(key=lambda x: x[2], reverse=True) return surrogates[:max_surrogates] # Demonstrationnp.random.seed(42)n = 200 # Create correlated featuresage = np.random.uniform(20, 70, n)income = 20000 + 1500 * age + np.random.normal(0, 10000, n)credit_score = 400 + 5 * age + 0.005 * income + np.random.normal(0, 50, n) X = np.column_stack([age, income, credit_score]) # Primary split on age <= 45primary_thresh = 45primary_dir = age <= primary_thresh print("Surrogate Split Analysis")print("=" * 50)print(f"Primary split: Age <= {primary_thresh}")print() surrogates = find_surrogate_splits(X, primary_dir, primary_attr=0)print("Surrogates (ranked by agreement with primary):")names = ['Age', 'Income', 'Credit Score'] for attr, thresh, agree in surrogates: print(f" {names[attr]} <= {thresh:.0f}: {agree:.1%} agreement")Surrogates are more principled than C4.5's fractional instances for correlated data. If 'income' predicts 'age' well, using income when age is missing leverages the data's correlation structure. Fractional instances ignore these relationships. However, surrogates require more computation and storage.
| Aspect | CART | C4.5 |
|---|---|---|
| Split Type | Binary only | Multi-way possible |
| Impurity Measure | Gini (classification), MSE (regression) | Information gain / Gain ratio |
| Pruning | Cost-complexity (cross-validation) | Error-based (pessimistic estimates) |
| Missing Values | Surrogate splits | Fractional instances |
| Regression | Native support | Not supported |
| Categorical Handling | Binary subset splits | One branch per value |
| Attribute Reuse | Yes (different thresholds/subsets) | No for categorical |
| Output | Binary tree | Multi-way tree |
| Default in sklearn | Yes | No (use C5.0 library) |
When to Prefer CART:
When to Prefer C4.5:
What's Next:
Now that we understand all three major tree-building algorithms, the next pages address practical challenges: handling missing values in depth, and managing categorical features with many values. These skills are essential for applying tree algorithms to real-world datasets.
You now understand CART: Gini impurity, binary splits, variance reduction for regression, cost-complexity pruning, and surrogate splits. This knowledge directly applies to using DecisionTreeClassifier and DecisionTreeRegressor in practice.