Regression Trees - Learning Module

Loading content...

0/245

CART for Regression

The Algorithm That Changed Regression Trees

In 1984, Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone published their seminal book Classification and Regression Trees (CART), introducing what would become the most influential decision tree algorithm in machine learning history. While decision trees had existed before CART, this algorithm brought a level of mathematical rigor, computational efficiency, and practical elegance that transformed trees from a heuristic technique into a principled statistical methodology.

CART for regression addresses a fundamental question: How can we recursively partition a feature space to predict a continuous target variable? The answer lies in a beautifully simple yet powerful algorithm that has stood the test of time for four decades and remains the foundation of modern ensemble methods like Random Forests and Gradient Boosting.

What You Will Learn

By the end of this page, you will understand the complete CART regression algorithm: how it recursively partitions feature space, why binary splits are both necessary and sufficient, how optimal split points are computed, and the mathematical principles that govern its behavior. You will be equipped to implement CART from scratch and understand its role as the foundation of modern tree-based methods.

Historical Context and Motivation

Before CART, regression analysis was dominated by parametric methods—primarily linear regression and its extensions. These methods make strong assumptions about the functional form of the relationship between predictors and the target. While powerful when assumptions hold, they often fail when the true relationship is complex, nonlinear, or involves intricate interactions between variables.

The limitations of parametric regression:

Linearity assumption: Standard regression assumes a linear (or known nonlinear) relationship
Homoscedasticity: Constant variance assumption often violated in practice
Interaction specification: Practitioners must manually specify interaction terms
Nonlinearity handling: Polynomial terms require knowing the correct degree
Interpretability at scale: High-dimensional models become difficult to interpret

What CART offered:

CART provided a fundamentally different approach—one that makes no assumptions about functional form at all. Instead of fitting a global model, CART creates a piecewise model by recursively partitioning the feature space into regions where simple predictions (constants or linear models) suffice. This approach:

Automatically captures nonlinearity without specification
Discovers interactions through the tree structure
Adapts to local patterns in different regions of the feature space
Produces highly interpretable decision rules
Handles mixed data types (continuous and categorical) naturally

Why 'CART' Specifically?

The name CART reflects that the same algorithmic framework handles both Classification (categorical targets) and Regression (continuous targets). The key differences lie in the splitting criterion (Gini/entropy for classification, MSE for regression) and the leaf predictions (class probabilities vs. mean values). This unified framework was a key insight of the original work.

The mathematical philosophy of CART:

CART operates on a deceptively simple principle: at each step, find the single binary split that most reduces prediction error. By composing many such locally optimal splits, the algorithm constructs a globally effective predictor. This greedy, recursive approach sacrifices global optimality (finding the best tree is NP-complete) for computational tractability while achieving remarkably good practical performance.

The algorithm's beauty lies in this balance: it is simple enough to understand completely, efficient enough to scale to large datasets, and flexible enough to capture complex patterns that would elude parametric methods.

The CART Regression Algorithm

The CART algorithm for regression is a recursive procedure that builds a binary tree by repeatedly splitting the feature space. Let's formalize this precisely.

Problem Setup:

Given training data ${(\mathbf{x}i, y_i)}{i=1}^{n}$ where:

$\mathbf{x}_i \in \mathbb{R}^p$ is the feature vector for observation $i$
$y_i \in \mathbb{R}$ is the continuous target value
We seek a function $\hat{f}(\mathbf{x})$ that minimizes prediction error

Core Algorithm (Recursive Partitioning):

The algorithm maintains a partition of the feature space into disjoint regions ${R_1, R_2, \ldots, R_M}$ where $M$ is the number of terminal nodes (leaves). In each region $R_m$, the prediction is constant:

$$\hat{f}(\mathbf{x}) = \sum_{m=1}^{M} c_m \cdot \mathbb{1}(\mathbf{x} \in R_m)$$

where $c_m$ is the predicted value for region $R_m$ and $\mathbb{1}(\cdot)$ is the indicator function.

cart_regression_algorithm.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
def cart_regression(X, y, max_depth=None, min_samples_split=2, 
                       min_samples_leaf=1, current_depth=0):
    """
    CART Regression Algorithm - Core Implementation
    
    This function implements the recursive partitioning procedure
    that forms the heart of the CART regression algorithm.
    
    Parameters:
    -----------
    X : array-like, shape (n_samples, n_features)
        Training feature matrix
    y : array-like, shape (n_samples,)
        Target values (continuous)
    max_depth : int or None
        Maximum depth of the tree (None for unlimited)
    min_samples_split : int
        Minimum samples required to split a node
    min_samples_leaf : int
        Minimum samples required in each leaf
    current_depth : int
        Current depth in the recursion (internal use)
    
    Returns:
    --------
    node : dict
        A dictionary representing the tree node
    """
    n_samples, n_features = X.shape
    
    # BASE CASE 1: Check stopping criteria
    # ====================================
    # These conditions determine when to create a leaf node
    
    # Create leaf if max depth reached
    if max_depth is not None and current_depth >= max_depth:
        return create_leaf_node(y)
    
    # Create leaf if insufficient samples to split
    if n_samples < min_samples_split:
        return create_leaf_node(y)
    
    # Create leaf if all targets are identical (pure node)
    if np.var(y) == 0:
        return create_leaf_node(y)
    
    # RECURSIVE CASE: Find best split
    # ================================
    best_split = find_best_split(X, y, min_samples_leaf)
    
    # Create leaf if no valid split found
    if best_split is None:
        return create_leaf_node(y)
    
    feature_idx = best_split['feature_index']
    threshold = best_split['threshold']
    
    # Partition samples based on best split
    left_mask = X[:, feature_idx] <= threshold
    right_mask = ~left_mask
    
    X_left, y_left = X[left_mask], y[left_mask]
    X_right, y_right = X[right_mask], y[right_mask]
    
    # Recursively build left and right subtrees
    left_child = cart_regression(
        X_left, y_left, max_depth, min_samples_split,
        min_samples_leaf, current_depth + 1
    )
    
    right_child = cart_regression(
        X_right, y_right, max_depth, min_samples_split,
        min_samples_leaf, current_depth + 1
    )
    
    # Return internal node
    return {
        'type': 'internal',
        'feature_index': feature_idx,
        'threshold': threshold,
        'left': left_child,
        'right': right_child,
        'n_samples': n_samples,
        'impurity': np.var(y)  # MSE before split
    }
 
 
def create_leaf_node(y):
    """
    Create a leaf node with constant prediction.
    
    For regression, the optimal prediction that minimizes
    squared error is the mean of target values in the region.
    """
    return {
        'type': 'leaf',
        'prediction': np.mean(y),
        'n_samples': len(y),
        'variance': np.var(y),
        'std': np.std(y)
    }

Understanding the recursion:

The algorithm above embodies the divide-and-conquer strategy that defines CART:

Base case evaluation: Check if stopping criteria are met (depth, sample size, purity)
Split search: If not stopping, find the optimal binary split across all features
Partition: Divide samples into left and right child nodes based on the split
Recurse: Apply the same procedure to each child node
Construct: Build the tree by connecting nodes through parent-child relationships

This recursive structure naturally produces a binary tree where:

Internal nodes represent splitting decisions (feature and threshold)
Leaf nodes represent predictions (mean of samples reaching that region)
Edges represent the outcomes of split conditions (≤ or >)

Finding the Optimal Split

The heart of CART lies in the find_best_split function—the procedure that identifies the single best way to divide a node into two children. For regression, this means finding the split that most reduces prediction error.

The Split Search Problem:

For a node containing $n$ samples with target values $y_1, \ldots, y_n$, we consider splits of the form:

$$\text{Split}(j, t): \quad X_j \leq t \quad \text{vs} \quad X_j > t$$

where $j$ indexes the feature and $t$ is the threshold value. This creates two regions:

Left child: $R_L = {i : x_{ij} \leq t}$
Right child: $R_R = {i : x_{ij} > t}$

The Objective Function:

For regression, CART minimizes the sum of squared residuals (or equivalently, weighted variance) after the split:

$$\text{Cost}(j, t) = \sum_{i \in R_L} (y_i - \bar{y}L)^2 + \sum{i \in R_R} (y_i - \bar{y}_R)^2$$

where $\bar{y}L = \frac{1}{|R_L|} \sum{i \in R_L} y_i$ and $\bar{y}R = \frac{1}{|R_R|} \sum{i \in R_R} y_i$

This is equivalent to minimizing the weighted sum of child variances:

$$\text{Cost}(j, t) = n_L \cdot \text{Var}(y_L) + n_R \cdot \text{Var}(y_R)$$

find_best_split.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
def find_best_split(X, y, min_samples_leaf=1):
    """
    Find the optimal binary split for a regression tree node.
    
    This function exhaustively searches all features and all
    possible split points to find the split that minimizes
    the weighted sum of child node variances (MSE).
    
    Parameters:
    -----------
    X : array-like, shape (n_samples, n_features)
        Feature matrix for current node
    y : array-like, shape (n_samples,)
        Target values for current node
    min_samples_leaf : int
        Minimum samples required in each child
    
    Returns:
    --------
    best_split : dict or None
        Dictionary with 'feature_index', 'threshold', and 'gain'
        Returns None if no valid split exists
    """
    n_samples, n_features = X.shape
    
    # Calculate impurity of current node (before split)
    # This is the variance (MSE relative to mean)
    current_impurity = np.var(y) * n_samples
    
    best_gain = 0.0
    best_split = None
    
    # Iterate through all features
    for feature_idx in range(n_features):
        # Get unique values for this feature, sorted
        feature_values = X[:, feature_idx]
        unique_values = np.unique(feature_values)
        
        # For efficiency, consider midpoints between consecutive values
        # This avoids redundant evaluations and handles continuous features
        if len(unique_values) > 1:
            thresholds = (unique_values[:-1] + unique_values[1:]) / 2
        else:
            continue  # Cannot split on single-valued feature
        
        # Evaluate each potential threshold
        for threshold in thresholds:
            # Partition samples
            left_mask = feature_values <= threshold
            right_mask = ~left_mask
            
            n_left = np.sum(left_mask)
            n_right = np.sum(right_mask)
            
            # Check minimum leaf size constraint
            if n_left < min_samples_leaf or n_right < min_samples_leaf:
                continue
            
            # Calculate child node impurities
            y_left = y[left_mask]
            y_right = y[right_mask]
            
            left_impurity = np.var(y_left) * n_left if n_left > 0 else 0
            right_impurity = np.var(y_right) * n_right if n_right > 0 else 0
            
            # Total impurity after split
            child_impurity = left_impurity + right_impurity
            
            # Impurity reduction (gain)
            gain = current_impurity - child_impurity
            
            # Update best if this split is better
            if gain > best_gain:
                best_gain = gain
                best_split = {
                    'feature_index': feature_idx,
                    'threshold': threshold,
                    'gain': gain,
                    'n_left': n_left,
                    'n_right': n_right,
                    'left_mean': np.mean(y_left),
                    'right_mean': np.mean(y_right)
                }
    
    return best_split
 
 
def find_best_split_efficient(X, y, min_samples_leaf=1):
    """
    Optimized split finding using sorted scan technique.
    
    This O(n log n) approach sorts each feature once, then
    uses incremental updates to compute split statistics
    in O(n) per feature instead of O(n²).
    """
    n_samples, n_features = X.shape
    current_impurity = np.sum((y - np.mean(y))**2)
    
    best_gain = 0.0
    best_split = None
    
    for feature_idx in range(n_features):
        # Sort by feature value
        sorted_indices = np.argsort(X[:, feature_idx])
        sorted_y = y[sorted_indices]
        sorted_x = X[sorted_indices, feature_idx]
        
        # Initialize running statistics for incremental updates
        n_left = 0
        sum_left = 0.0
        sum_sq_left = 0.0
        
        n_right = n_samples
        sum_right = np.sum(sorted_y)
        sum_sq_right = np.sum(sorted_y**2)
        
        # Scan through potential split points
        for i in range(n_samples - 1):
            # Move sample i from right to left
            yi = sorted_y[i]
            
            n_left += 1
            sum_left += yi
            sum_sq_left += yi**2
            
            n_right -= 1
            sum_right -= yi
            sum_sq_right -= yi**2
            
            # Skip if not at a valid split boundary
            # (must have distinct feature values)
            if sorted_x[i] == sorted_x[i + 1]:
                continue
            
            # Check minimum leaf size
            if n_left < min_samples_leaf or n_right < min_samples_leaf:
                continue
            
            # Compute variance using: Var = E[X²] - E[X]²
            mean_left = sum_left / n_left
            mean_right = sum_right / n_right
            
            var_left = (sum_sq_left / n_left) - mean_left**2
            var_right = (sum_sq_right / n_right) - mean_right**2
            
            # Weighted sum of child variances
            child_impurity = n_left * var_left + n_right * var_right
            
            gain = current_impurity - child_impurity
            
            if gain > best_gain:
                best_gain = gain
                # Threshold at midpoint between consecutive values
                threshold = (sorted_x[i] + sorted_x[i + 1]) / 2
                best_split = {
                    'feature_index': feature_idx,
                    'threshold': threshold,
                    'gain': gain,
                    'n_left': n_left,
                    'n_right': n_right
                }
    
    return best_split

Computational Complexity

The naive split search is O(n²p) per node—for each of p features, we consider O(n) thresholds, and for each threshold, we compute statistics in O(n). The efficient version reduces this to O(np log n) by sorting features once (O(n log n) each) and using O(n) incremental scans. For large n, this optimization is crucial.

Why variance minimization?

The choice of variance (or equivalently, MSE) as the splitting criterion for regression is not arbitrary—it corresponds to maximum likelihood estimation under a Gaussian noise model:

$$y_i = f(\mathbf{x}_i) + \epsilon_i, \quad \epsilon_i \sim \mathcal{N}(0, \sigma^2)$$

Under this model, minimizing $\sum_i (y_i - \hat{y}_i)^2$ is equivalent to maximizing likelihood. The mean prediction in each region is the maximum likelihood estimate of the true function value for that region.

Connection to analysis of variance (ANOVA):

The split search can also be viewed through the lens of one-way ANOVA. We're essentially asking: "Which binary grouping of samples (defined by feature $j$ and threshold $t$) explains the most variance in $y$?" The information gain (variance reduction) is proportional to the between-group sum of squares in an ANOVA decomposition.

Binary Splits: Why and How

A distinctive feature of CART is its exclusive use of binary splits—each internal node has exactly two children. This design choice has profound implications for the algorithm's behavior and efficiency.

Why binary splits?

Computational tractability: For a continuous feature, we only need to consider O(n) threshold candidates. Multi-way splits would require evaluating an exponential number of partitions.
Sufficient expressiveness: Any multi-way split can be represented by a sequence of binary splits. A 3-way split on feature $X$ (e.g., $X < a$, $a \leq X < b$, $X \geq b$) is equivalent to two consecutive binary splits.
Algorithmic elegance: Binary trees have well-studied properties, efficient traversal algorithms, and natural recursive structure.
Interaction detection: Binary splits can still capture complex interactions through the tree hierarchy—what a single node cannot express, multiple levels can.

Mathematical perspective:

Let $T$ be the set of all possible trees. Finding the optimal tree is:

$$\min_{T} \sum_{m=1}^{|T|} \sum_{i \in R_m} (y_i - \bar{y}_{R_m})^2 + \lambda |T|$$

where $|T|$ is tree complexity and $\lambda$ is a regularization parameter. This optimization is NP-complete because the number of possible trees grows super-exponentially with depth.

By restricting to binary splits and using greedy optimization, CART trades global optimality for polynomial-time solvability—a trade-off that has proven remarkably successful in practice.

Comparison: Binary vs. Multi-way Splits
Aspect	Binary Splits	Multi-way Splits
Continuous features	O(n) threshold candidates	Must discretize or use exponential search
Categorical features	2^(k-1)-1 subsets for k categories	Could use k-way split but loses hierarchy
Tree depth	May be deeper for same expressiveness	Shallower but wider
Interpretability	Each decision is a simple yes/no	More complex per-node decisions
Computational cost	O(np log n) per level	Generally exponential in categories
Expressiveness	Complete (can represent any partition)	More direct for some patterns

Handling categorical features:

For categorical features with $k$ levels, CART considers all possible binary partitions into "left" and "right" subsets. There are $2^{k-1} - 1$ such partitions (excluding empty sets and the trivial all-in-one partition).

For regression, a remarkable optimization exists: sorting categories by their mean target value and only considering splits that separate contiguous groups in this ordering is guaranteed to find the optimal partition. This reduces complexity from $O(2^k)$ to $O(k \log k)$.

Proof sketch: For regression with MSE criterion, the optimal split always places categories with similar mean targets on the same side. This is because mixing categories with high and low means increases within-group variance.

categorical_split.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
def find_best_categorical_split(feature, y, min_samples_leaf=1):
    """
    Find optimal binary split for a categorical feature using
    the sorted-means optimization for regression.
    
    This exploits the fact that for MSE, optimal splits always
    separate contiguous groups when categories are ordered by mean.
    """
    # Get unique categories and their statistics
    categories = np.unique(feature)
    n_categories = len(categories)
    
    if n_categories < 2:
        return None
    
    # Compute mean target for each category
    category_means = {}
    category_counts = {}
    for cat in categories:
        mask = feature == cat
        category_means[cat] = np.mean(y[mask])
        category_counts[cat] = np.sum(mask)
    
    # Sort categories by their mean (ascending)
    sorted_cats = sorted(categories, key=lambda c: category_means[c])
    
    # Now search only over contiguous splits
    # This is O(k) instead of O(2^k)
    best_gain = 0.0
    best_split = None
    
    total_var = np.var(y) * len(y)
    
    # Accumulate statistics as we scan through sorted categories
    n_left = 0
    sum_left = 0.0
    sum_sq_left = 0.0
    
    n_right = len(y)
    sum_right = np.sum(y)
    sum_sq_right = np.sum(y**2)
    
    for i in range(n_categories - 1):
        # Move category sorted_cats[i] from right to left
        cat = sorted_cats[i]
        mask = feature == cat
        cat_y = y[mask]
        
        n_cat = len(cat_y)
        n_left += n_cat
        sum_left += np.sum(cat_y)
        sum_sq_left += np.sum(cat_y**2)
        
        n_right -= n_cat
        sum_right -= np.sum(cat_y)
        sum_sq_right -= np.sum(cat_y**2)
        
        # Check minimum leaf size
        if n_left < min_samples_leaf or n_right < min_samples_leaf:
            continue
        
        # Compute impurity using incremental statistics
        mean_left = sum_left / n_left
        mean_right = sum_right / n_right
        
        var_left = (sum_sq_left / n_left) - mean_left**2
        var_right = (sum_sq_right / n_right) - mean_right**2
        
        child_impurity = n_left * var_left + n_right * var_right
        gain = total_var - child_impurity
        
        if gain > best_gain:
            best_gain = gain
            # Left group contains first i+1 categories in sorted order
            left_categories = set(sorted_cats[:i+1])
            best_split = {
                'left_categories': left_categories,
                'gain': gain,
                'n_left': n_left,
                'n_right': n_right
            }
    
    return best_split

Prediction and Tree Traversal

Once the tree is built, prediction is straightforward: for a new observation $\mathbf{x}$, we traverse from root to leaf by following the appropriate branch at each internal node, then return the leaf's prediction.

Prediction algorithm:

$$\hat{y} = \text{predict}(\mathbf{x}) = \text{traverse}(\text{root}, \mathbf{x})$$

where traverse recursively follows splits:

At internal node: if $x_j \leq t$, go left; else go right
At leaf node: return the stored prediction (mean of training targets)

Computational complexity:

Traversal visits at most $d$ nodes where $d$ is tree depth, making prediction $O(d)$. For a balanced tree with $n$ training samples, $d = O(\log n)$. However, unbalanced trees can have $d = O(n)$ in degenerate cases.

tree_prediction.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
def predict(tree, X):
    """
    Make predictions for multiple samples.
    
    Parameters:
    -----------
    tree : dict
        Root node of the trained CART tree
    X : array-like, shape (n_samples, n_features)
        Feature matrix for prediction
    
    Returns:
    --------
    predictions : array, shape (n_samples,)
        Predicted target values
    """
    return np.array([predict_single(tree, x) for x in X])
 
 
def predict_single(node, x):
    """
    Traverse tree to predict for a single sample.
    
    This recursive function follows the decision path from
    root to leaf, returning the leaf's prediction.
    """
    # Base case: reached leaf node
    if node['type'] == 'leaf':
        return node['prediction']
    
    # Recursive case: follow appropriate branch
    feature_idx = node['feature_index']
    threshold = node['threshold']
    
    if x[feature_idx] <= threshold:
        return predict_single(node['left'], x)
    else:
        return predict_single(node['right'], x)
 
 
def predict_with_path(node, x, path=None):
    """
    Predict and return the decision path taken.
    
    Useful for interpretability and debugging.
    """
    if path is None:
        path = []
    
    if node['type'] == 'leaf':
        return node['prediction'], path
    
    feature_idx = node['feature_index']
    threshold = node['threshold']
    
    if x[feature_idx] <= threshold:
        path.append({
            'feature': feature_idx,
            'threshold': threshold,
            'direction': 'left',
            'condition': f'X[{feature_idx}] <= {threshold:.4f}'
        })
        return predict_with_path(node['left'], x, path)
    else:
        path.append({
            'feature': feature_idx,
            'threshold': threshold,
            'direction': 'right',
            'condition': f'X[{feature_idx}] > {threshold:.4f}'
        })
        return predict_with_path(node['right'], x, path)
 
 
def get_leaf_regions(tree, bounds=None, feature_names=None):
    """
    Extract all leaf regions with their bounds and predictions.
    
    This function recursively traverses the tree and accumulates
    the constraints that define each region.
    """
    regions = []
    _extract_regions(tree, bounds or {}, regions, feature_names)
    return regions
 
 
def _extract_regions(node, current_bounds, regions, feature_names):
    """Helper function to recursively extract regions."""
    
    if node['type'] == 'leaf':
        regions.append({
            'bounds': current_bounds.copy(),
            'prediction': node['prediction'],
            'n_samples': node['n_samples'],
            'variance': node.get('variance', 0)
        })
        return
    
    feature_idx = node['feature_index']
    threshold = node['threshold']
    fname = feature_names[feature_idx] if feature_names else f'X{feature_idx}'
    
    # Left child: add upper bound constraint
    left_bounds = current_bounds.copy()
    if fname in left_bounds:
        left_bounds[fname] = (left_bounds[fname][0], 
                              min(left_bounds[fname][1], threshold))
    else:
        left_bounds[fname] = (float('-inf'), threshold)
    _extract_regions(node['left'], left_bounds, regions, feature_names)
    
    # Right child: add lower bound constraint  
    right_bounds = current_bounds.copy()
    if fname in right_bounds:
        right_bounds[fname] = (max(right_bounds[fname][0], threshold),
                               right_bounds[fname][1])
    else:
        right_bounds[fname] = (threshold, float('inf'))
    _extract_regions(node['right'], right_bounds, regions, feature_names)

Prediction as Region Identification

Each prediction path uniquely identifies a rectangular region in feature space. The prediction value is constant throughout this region. This property—that predictions are piecewise constant over axis-aligned boxes—is fundamental to understanding regression tree behavior and limitations.

Stopping Criteria and Tree Control

Without stopping criteria, CART would grow until every leaf contains a single training sample (or all targets in a leaf are identical). This would perfectly fit the training data but generalize poorly—a classic case of overfitting.

Common stopping criteria:

Maximum depth (max_depth): Limits tree height
Minimum samples to split (min_samples_split): Requires minimum samples before considering a split
Minimum samples per leaf (min_samples_leaf): Ensures each leaf has sufficient support
Minimum impurity decrease (min_impurity_decrease): Requires splits to improve by threshold
Maximum leaf nodes (max_leaf_nodes): Caps total leaves (grows best-first rather than depth-first)

Trade-offs:

Each criterion balances bias and variance differently:

Aggressive stopping (small max_depth, large min_samples_leaf) → High bias, low variance
Permissive stopping (large max_depth, small min_samples_leaf) → Low bias, high variance

Signs of Overfitting

•Training error near zero, high test error
•Many leaves with single samples
•Extremely deep trees
•Large discrepancy in train/test performance
•High sensitivity to small data changes

Signs of Good Fit

•Similar train and test error
•Leaves with reasonable sample counts
•Interpretable tree structure
•Stable performance across cross-validation folds
•Predictions vary smoothly with inputs

stopping_criteria.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
class CARTRegressionTree:
    """
    Complete CART Regression Tree implementation with all
    standard stopping criteria.
    """
    
    def __init__(self, 
                 max_depth=None,
                 min_samples_split=2,
                 min_samples_leaf=1,
                 min_impurity_decrease=0.0,
                 max_leaf_nodes=None):
        """
        Initialize CART tree with stopping hyperparameters.
        
        Parameters:
        -----------
        max_depth : int or None
            Maximum tree depth. None for unlimited.
            
        min_samples_split : int
            Minimum samples required to attempt a split.
            
        min_samples_leaf : int
            Minimum samples required in each leaf.
            
        min_impurity_decrease : float
            Minimum impurity reduction to accept a split.
            A split is only made if:
            impurity_parent - weighted_impurity_children >= threshold
            
        max_leaf_nodes : int or None
            Maximum number of leaf nodes. If set, grows
            best-first (highest gain) rather than depth-first.
        """
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.min_samples_leaf = min_samples_leaf
        self.min_impurity_decrease = min_impurity_decrease
        self.max_leaf_nodes = max_leaf_nodes
        self.tree_ = None
        self.n_features_ = None
        
    def should_stop(self, n_samples, depth, impurity, best_gain):
        """
        Evaluate all stopping criteria.
        
        Returns True if any stopping condition is met.
        """
        # Maximum depth reached
        if self.max_depth is not None and depth >= self.max_depth:
            return True
            
        # Insufficient samples to split
        if n_samples < self.min_samples_split:
            return True
            
        # Pure node (zero impurity)
        if impurity <= 1e-10:
            return True
            
        # No valid split found (best_gain is None)
        if best_gain is None:
            return True
            
        # Insufficient impurity decrease
        if best_gain < self.min_impurity_decrease:
            return True
            
        return False
    
    def fit(self, X, y):
        """Fit the regression tree to training data."""
        self.n_features_ = X.shape[1]
        
        if self.max_leaf_nodes is not None:
            self.tree_ = self._grow_best_first(X, y)
        else:
            self.tree_ = self._grow_depth_first(X, y, depth=0)
            
        return self
    
    def _grow_depth_first(self, X, y, depth):
        """Standard depth-first recursive tree growing."""
        n_samples = len(y)
        impurity = np.var(y) * n_samples
        
        best_split = find_best_split(X, y, self.min_samples_leaf)
        best_gain = best_split['gain'] if best_split else None
        
        if self.should_stop(n_samples, depth, impurity, best_gain):
            return create_leaf_node(y)
        
        # Create split and recurse
        feature_idx = best_split['feature_index']
        threshold = best_split['threshold']
        
        left_mask = X[:, feature_idx] <= threshold
        right_mask = ~left_mask
        
        return {
            'type': 'internal',
            'feature_index': feature_idx,
            'threshold': threshold,
            'left': self._grow_depth_first(X[left_mask], y[left_mask], depth+1),
            'right': self._grow_depth_first(X[right_mask], y[right_mask], depth+1),
            'n_samples': n_samples,
            'impurity': impurity / n_samples
        }

Complete CART Implementation

Let's bring everything together into a complete, production-quality CART regression implementation that includes all the components we've discussed.

complete_cart.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
import numpy as np
from typing import Optional, Dict, Any, List, Tuple
 
 
class CARTRegressor:
    """
    Complete CART Regression Tree Implementation
    
    This class implements the Classification and Regression Trees (CART)
    algorithm for regression as described by Breiman et al. (1984).
    
    The implementation uses the MSE (variance) splitting criterion and
    supports all standard hyperparameters for tree construction.
    
    Attributes:
    -----------
    tree_ : dict
        The trained tree structure after calling fit()
    n_features_ : int
        Number of features in training data
    feature_importances_ : ndarray
        Impurity-based feature importance scores (sum to 1)
    
    Examples:
    ---------
    >>> tree = CARTRegressor(max_depth=5, min_samples_leaf=10)
    >>> tree.fit(X_train, y_train)
    >>> predictions = tree.predict(X_test)
    >>> print(f"R² Score: {tree.score(X_test, y_test):.4f}")
    """
    
    def __init__(self,
                 max_depth: Optional[int] = None,
                 min_samples_split: int = 2,
                 min_samples_leaf: int = 1,
                 min_impurity_decrease: float = 0.0):
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.min_samples_leaf = min_samples_leaf
        self.min_impurity_decrease = min_impurity_decrease
        
        # Attributes set during fitting
        self.tree_: Optional[Dict[str, Any]] = None
        self.n_features_: Optional[int] = None
        self.feature_importances_: Optional[np.ndarray] = None
        self._impurity_gains: List[Tuple[int, float]] = []
        
    def fit(self, X: np.ndarray, y: np.ndarray) -> 'CARTRegressor':
        """
        Build the regression tree from training data.
        
        Parameters:
        -----------
        X : ndarray, shape (n_samples, n_features)
            Training feature matrix
        y : ndarray, shape (n_samples,)
            Target values
            
        Returns:
        --------
        self : CARTRegressor
            Fitted estimator
        """
        X = np.atleast_2d(X)
        y = np.asarray(y).ravel()
        
        if len(X) != len(y):
            raise ValueError("X and y must have same number of samples")
            
        self.n_features_ = X.shape[1]
        self._impurity_gains = []
        
        # Grow tree recursively
        self.tree_ = self._build_tree(X, y, depth=0)
        
        # Compute feature importances from accumulated gains
        self._compute_feature_importances()
        
        return self
    
    def _build_tree(self, X: np.ndarray, y: np.ndarray, 
                    depth: int) -> Dict[str, Any]:
        """Recursively build tree structure."""
        n_samples = len(y)
        impurity = np.var(y)
        
        # Check stopping criteria
        if self._should_stop(n_samples, depth, impurity):
            return self._create_leaf(y)
        
        # Find best split
        best = self._find_best_split(X, y)
        
        if best is None or best['gain'] < self.min_impurity_decrease:
            return self._create_leaf(y)
        
        # Record split for feature importance
        self._impurity_gains.append((best['feature'], best['gain']))
        
        # Partition data
        left_mask = X[:, best['feature']] <= best['threshold']
        right_mask = ~left_mask
        
        # Check child size constraints
        if (np.sum(left_mask) < self.min_samples_leaf or
            np.sum(right_mask) < self.min_samples_leaf):
            return self._create_leaf(y)
        
        # Build children recursively
        return {
            'type': 'internal',
            'feature': best['feature'],
            'threshold': best['threshold'],
            'left': self._build_tree(X[left_mask], y[left_mask], depth + 1),
            'right': self._build_tree(X[right_mask], y[right_mask], depth + 1),
            'n_samples': n_samples,
            'impurity': impurity,
            'gain': best['gain']
        }
    
    def _should_stop(self, n_samples: int, depth: int, 
                     impurity: float) -> bool:
        """Evaluate stopping conditions."""
        if self.max_depth is not None and depth >= self.max_depth:
            return True
        if n_samples < self.min_samples_split:
            return True
        if impurity < 1e-10:  # Pure node
            return True
        return False
    
    def _create_leaf(self, y: np.ndarray) -> Dict[str, Any]:
        """Create leaf node with prediction."""
        return {
            'type': 'leaf',
            'prediction': float(np.mean(y)),
            'n_samples': len(y),
            'variance': float(np.var(y)),
            'std': float(np.std(y))
        }
    
    def _find_best_split(self, X: np.ndarray, 
                         y: np.ndarray) -> Optional[Dict[str, Any]]:
        """Find optimal split using efficient sorted scan."""
        n_samples, n_features = X.shape
        parent_impurity = np.var(y) * n_samples
        
        best_gain = 0.0
        best_split = None
        
        for feat_idx in range(n_features):
            # Sort by feature
            order = np.argsort(X[:, feat_idx])
            sorted_y = y[order]
            sorted_x = X[order, feat_idx]
            
            # Running statistics
            n_left, sum_left, sum_sq_left = 0, 0.0, 0.0
            n_right = n_samples
            sum_right = np.sum(sorted_y)
            sum_sq_right = np.sum(sorted_y ** 2)
            
            for i in range(n_samples - 1):
                yi = sorted_y[i]
                
                # Update statistics
                n_left += 1
                sum_left += yi
                sum_sq_left += yi ** 2
                n_right -= 1
                sum_right -= yi
                sum_sq_right -= yi ** 2
                
                # Skip duplicate feature values
                if sorted_x[i] == sorted_x[i + 1]:
                    continue
                
                # Check min_samples_leaf
                if n_left < self.min_samples_leaf:
                    continue
                if n_right < self.min_samples_leaf:
                    break
                
                # Compute child variances
                mean_left = sum_left / n_left
                mean_right = sum_right / n_right
                var_left = max(0, sum_sq_left/n_left - mean_left**2)
                var_right = max(0, sum_sq_right/n_right - mean_right**2)
                
                child_impurity = n_left * var_left + n_right * var_right
                gain = parent_impurity - child_impurity
                
                if gain > best_gain:
                    best_gain = gain
                    best_split = {
                        'feature': feat_idx,
                        'threshold': (sorted_x[i] + sorted_x[i+1]) / 2,
                        'gain': gain
                    }
        
        return best_split
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Predict target values for samples."""
        X = np.atleast_2d(X)
        return np.array([self._predict_single(self.tree_, x) for x in X])
    
    def _predict_single(self, node: Dict, x: np.ndarray) -> float:
        """Traverse tree for single prediction."""
        if node['type'] == 'leaf':
            return node['prediction']
        if x[node['feature']] <= node['threshold']:
            return self._predict_single(node['left'], x)
        return self._predict_single(node['right'], x)
    
    def score(self, X: np.ndarray, y: np.ndarray) -> float:
        """Compute R² score."""
        y_pred = self.predict(X)
        ss_res = np.sum((y - y_pred) ** 2)
        ss_tot = np.sum((y - np.mean(y)) ** 2)
        return 1 - ss_res / ss_tot if ss_tot > 0 else 0.0
    
    def _compute_feature_importances(self):
        """Compute impurity-based feature importances."""
        importances = np.zeros(self.n_features_)
        for feat_idx, gain in self._impurity_gains:
            importances[feat_idx] += gain
        total = importances.sum()
        if total > 0:
            importances /= total
        self.feature_importances_ = importances

CART in Modern Practice

While CART forms the algorithmic foundation, modern practice involves additional considerations for robust deployment.

Using scikit-learn's DecisionTreeRegressor:

In production, you'll typically use established libraries rather than custom implementations. scikit-learn's DecisionTreeRegressor implements optimized CART with additional features:

sklearn_cart.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score, GridSearchCV
import matplotlib.pyplot as plt
 
# Create and train regression tree
tree = DecisionTreeRegressor(
    max_depth=5,
    min_samples_split=10,
    min_samples_leaf=5,
    random_state=42
)
tree.fit(X_train, y_train)
 
# Evaluate performance
train_score = tree.score(X_train, y_train)
test_score = tree.score(X_test, y_test)
print(f"Training R²: {train_score:.4f}")
print(f"Test R²: {test_score:.4f}")
 
# Cross-validation for robust performance estimate
cv_scores = cross_val_score(tree, X, y, cv=5, scoring='r2')
print(f"CV R² Score: {cv_scores.mean():.4f} (+/- {cv_scores.std()*2:.4f})")
 
# Hyperparameter tuning
param_grid = {
    'max_depth': [3, 5, 7, 10, None],
    'min_samples_split': [2, 5, 10, 20],
    'min_samples_leaf': [1, 2, 5, 10]
}
 
grid_search = GridSearchCV(
    DecisionTreeRegressor(random_state=42),
    param_grid,
    cv=5,
    scoring='neg_mean_squared_error',
    n_jobs=-1
)
grid_search.fit(X_train, y_train)
 
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV MSE: {-grid_search.best_score_:.4f}")
 
# Feature importance analysis
importances = tree.feature_importances_
indices = np.argsort(importances)[::-1]
 
plt.figure(figsize=(10, 6))
plt.title("Feature Importances")
plt.bar(range(len(importances)), importances[indices])
plt.xticks(range(len(importances)), 
           [feature_names[i] for i in indices], rotation=45)
plt.tight_layout()
plt.show()

When to Use CART for Regression

Single decision trees excel when: (1) Interpretability is critical—you need to explain decisions; (2) Data has natural hierarchical/interaction structure; (3) You're building base learners for ensembles; (4) Features are mixed types (numeric and categorical); (5) Quick baseline models are needed. For maximum predictive power, trees are typically used within Random Forest or Gradient Boosting ensembles.

Summary: CART for Regression

We've completed a comprehensive examination of the CART algorithm for regression—the foundational method for constructing regression decision trees.

Key Takeaways

•CART uses greedy recursive partitioning — At each node, find the single binary split that most reduces MSE, then recurse on child nodes.
•Binary splits are both sufficient and efficient — Any partition can be expressed through binary splits; searching is polynomial rather than exponential.
•Optimal splits minimize weighted child variance — This corresponds to maximum likelihood under Gaussian noise assumptions.
•Efficient algorithms exploit sorted scans — O(np log n) per level rather than naive O(n²p) through incremental statistics.
•Stopping criteria control overfitting — max_depth, min_samples_split, min_samples_leaf balance bias and variance.
•Predictions are piecewise constant — Each leaf defines a rectangular region with constant prediction value.
•CART forms the foundation for ensembles — Random Forests and Gradient Boosting compose many CART trees for improved performance.

What's next:

With the CART algorithm established, the next page examines the splitting criterion in depth—specifically, why Mean Squared Error (MSE) is used for regression and how it connects to broader statistical principles.

Page Complete

You now have a deep understanding of the CART regression algorithm—its mathematical foundations, computational implementation, and practical considerations. This knowledge is essential for understanding how modern tree-based methods like Random Forests and XGBoost work at their core.