Loading content...
In 1984, Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone published their seminal book Classification and Regression Trees (CART), introducing what would become the most influential decision tree algorithm in machine learning history. While decision trees had existed before CART, this algorithm brought a level of mathematical rigor, computational efficiency, and practical elegance that transformed trees from a heuristic technique into a principled statistical methodology.
CART for regression addresses a fundamental question: How can we recursively partition a feature space to predict a continuous target variable? The answer lies in a beautifully simple yet powerful algorithm that has stood the test of time for four decades and remains the foundation of modern ensemble methods like Random Forests and Gradient Boosting.
By the end of this page, you will understand the complete CART regression algorithm: how it recursively partitions feature space, why binary splits are both necessary and sufficient, how optimal split points are computed, and the mathematical principles that govern its behavior. You will be equipped to implement CART from scratch and understand its role as the foundation of modern tree-based methods.
Before CART, regression analysis was dominated by parametric methods—primarily linear regression and its extensions. These methods make strong assumptions about the functional form of the relationship between predictors and the target. While powerful when assumptions hold, they often fail when the true relationship is complex, nonlinear, or involves intricate interactions between variables.
The limitations of parametric regression:
What CART offered:
CART provided a fundamentally different approach—one that makes no assumptions about functional form at all. Instead of fitting a global model, CART creates a piecewise model by recursively partitioning the feature space into regions where simple predictions (constants or linear models) suffice. This approach:
The name CART reflects that the same algorithmic framework handles both Classification (categorical targets) and Regression (continuous targets). The key differences lie in the splitting criterion (Gini/entropy for classification, MSE for regression) and the leaf predictions (class probabilities vs. mean values). This unified framework was a key insight of the original work.
The mathematical philosophy of CART:
CART operates on a deceptively simple principle: at each step, find the single binary split that most reduces prediction error. By composing many such locally optimal splits, the algorithm constructs a globally effective predictor. This greedy, recursive approach sacrifices global optimality (finding the best tree is NP-complete) for computational tractability while achieving remarkably good practical performance.
The algorithm's beauty lies in this balance: it is simple enough to understand completely, efficient enough to scale to large datasets, and flexible enough to capture complex patterns that would elude parametric methods.
The CART algorithm for regression is a recursive procedure that builds a binary tree by repeatedly splitting the feature space. Let's formalize this precisely.
Problem Setup:
Given training data ${(\mathbf{x}i, y_i)}{i=1}^{n}$ where:
Core Algorithm (Recursive Partitioning):
The algorithm maintains a partition of the feature space into disjoint regions ${R_1, R_2, \ldots, R_M}$ where $M$ is the number of terminal nodes (leaves). In each region $R_m$, the prediction is constant:
$$\hat{f}(\mathbf{x}) = \sum_{m=1}^{M} c_m \cdot \mathbb{1}(\mathbf{x} \in R_m)$$
where $c_m$ is the predicted value for region $R_m$ and $\mathbb{1}(\cdot)$ is the indicator function.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101
def cart_regression(X, y, max_depth=None, min_samples_split=2, min_samples_leaf=1, current_depth=0): """ CART Regression Algorithm - Core Implementation This function implements the recursive partitioning procedure that forms the heart of the CART regression algorithm. Parameters: ----------- X : array-like, shape (n_samples, n_features) Training feature matrix y : array-like, shape (n_samples,) Target values (continuous) max_depth : int or None Maximum depth of the tree (None for unlimited) min_samples_split : int Minimum samples required to split a node min_samples_leaf : int Minimum samples required in each leaf current_depth : int Current depth in the recursion (internal use) Returns: -------- node : dict A dictionary representing the tree node """ n_samples, n_features = X.shape # BASE CASE 1: Check stopping criteria # ==================================== # These conditions determine when to create a leaf node # Create leaf if max depth reached if max_depth is not None and current_depth >= max_depth: return create_leaf_node(y) # Create leaf if insufficient samples to split if n_samples < min_samples_split: return create_leaf_node(y) # Create leaf if all targets are identical (pure node) if np.var(y) == 0: return create_leaf_node(y) # RECURSIVE CASE: Find best split # ================================ best_split = find_best_split(X, y, min_samples_leaf) # Create leaf if no valid split found if best_split is None: return create_leaf_node(y) feature_idx = best_split['feature_index'] threshold = best_split['threshold'] # Partition samples based on best split left_mask = X[:, feature_idx] <= threshold right_mask = ~left_mask X_left, y_left = X[left_mask], y[left_mask] X_right, y_right = X[right_mask], y[right_mask] # Recursively build left and right subtrees left_child = cart_regression( X_left, y_left, max_depth, min_samples_split, min_samples_leaf, current_depth + 1 ) right_child = cart_regression( X_right, y_right, max_depth, min_samples_split, min_samples_leaf, current_depth + 1 ) # Return internal node return { 'type': 'internal', 'feature_index': feature_idx, 'threshold': threshold, 'left': left_child, 'right': right_child, 'n_samples': n_samples, 'impurity': np.var(y) # MSE before split } def create_leaf_node(y): """ Create a leaf node with constant prediction. For regression, the optimal prediction that minimizes squared error is the mean of target values in the region. """ return { 'type': 'leaf', 'prediction': np.mean(y), 'n_samples': len(y), 'variance': np.var(y), 'std': np.std(y) }Understanding the recursion:
The algorithm above embodies the divide-and-conquer strategy that defines CART:
This recursive structure naturally produces a binary tree where:
The heart of CART lies in the find_best_split function—the procedure that identifies the single best way to divide a node into two children. For regression, this means finding the split that most reduces prediction error.
The Split Search Problem:
For a node containing $n$ samples with target values $y_1, \ldots, y_n$, we consider splits of the form:
$$\text{Split}(j, t): \quad X_j \leq t \quad \text{vs} \quad X_j > t$$
where $j$ indexes the feature and $t$ is the threshold value. This creates two regions:
The Objective Function:
For regression, CART minimizes the sum of squared residuals (or equivalently, weighted variance) after the split:
$$\text{Cost}(j, t) = \sum_{i \in R_L} (y_i - \bar{y}L)^2 + \sum{i \in R_R} (y_i - \bar{y}_R)^2$$
where $\bar{y}L = \frac{1}{|R_L|} \sum{i \in R_L} y_i$ and $\bar{y}R = \frac{1}{|R_R|} \sum{i \in R_R} y_i$
This is equivalent to minimizing the weighted sum of child variances:
$$\text{Cost}(j, t) = n_L \cdot \text{Var}(y_L) + n_R \cdot \text{Var}(y_R)$$
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163
def find_best_split(X, y, min_samples_leaf=1): """ Find the optimal binary split for a regression tree node. This function exhaustively searches all features and all possible split points to find the split that minimizes the weighted sum of child node variances (MSE). Parameters: ----------- X : array-like, shape (n_samples, n_features) Feature matrix for current node y : array-like, shape (n_samples,) Target values for current node min_samples_leaf : int Minimum samples required in each child Returns: -------- best_split : dict or None Dictionary with 'feature_index', 'threshold', and 'gain' Returns None if no valid split exists """ n_samples, n_features = X.shape # Calculate impurity of current node (before split) # This is the variance (MSE relative to mean) current_impurity = np.var(y) * n_samples best_gain = 0.0 best_split = None # Iterate through all features for feature_idx in range(n_features): # Get unique values for this feature, sorted feature_values = X[:, feature_idx] unique_values = np.unique(feature_values) # For efficiency, consider midpoints between consecutive values # This avoids redundant evaluations and handles continuous features if len(unique_values) > 1: thresholds = (unique_values[:-1] + unique_values[1:]) / 2 else: continue # Cannot split on single-valued feature # Evaluate each potential threshold for threshold in thresholds: # Partition samples left_mask = feature_values <= threshold right_mask = ~left_mask n_left = np.sum(left_mask) n_right = np.sum(right_mask) # Check minimum leaf size constraint if n_left < min_samples_leaf or n_right < min_samples_leaf: continue # Calculate child node impurities y_left = y[left_mask] y_right = y[right_mask] left_impurity = np.var(y_left) * n_left if n_left > 0 else 0 right_impurity = np.var(y_right) * n_right if n_right > 0 else 0 # Total impurity after split child_impurity = left_impurity + right_impurity # Impurity reduction (gain) gain = current_impurity - child_impurity # Update best if this split is better if gain > best_gain: best_gain = gain best_split = { 'feature_index': feature_idx, 'threshold': threshold, 'gain': gain, 'n_left': n_left, 'n_right': n_right, 'left_mean': np.mean(y_left), 'right_mean': np.mean(y_right) } return best_split def find_best_split_efficient(X, y, min_samples_leaf=1): """ Optimized split finding using sorted scan technique. This O(n log n) approach sorts each feature once, then uses incremental updates to compute split statistics in O(n) per feature instead of O(n²). """ n_samples, n_features = X.shape current_impurity = np.sum((y - np.mean(y))**2) best_gain = 0.0 best_split = None for feature_idx in range(n_features): # Sort by feature value sorted_indices = np.argsort(X[:, feature_idx]) sorted_y = y[sorted_indices] sorted_x = X[sorted_indices, feature_idx] # Initialize running statistics for incremental updates n_left = 0 sum_left = 0.0 sum_sq_left = 0.0 n_right = n_samples sum_right = np.sum(sorted_y) sum_sq_right = np.sum(sorted_y**2) # Scan through potential split points for i in range(n_samples - 1): # Move sample i from right to left yi = sorted_y[i] n_left += 1 sum_left += yi sum_sq_left += yi**2 n_right -= 1 sum_right -= yi sum_sq_right -= yi**2 # Skip if not at a valid split boundary # (must have distinct feature values) if sorted_x[i] == sorted_x[i + 1]: continue # Check minimum leaf size if n_left < min_samples_leaf or n_right < min_samples_leaf: continue # Compute variance using: Var = E[X²] - E[X]² mean_left = sum_left / n_left mean_right = sum_right / n_right var_left = (sum_sq_left / n_left) - mean_left**2 var_right = (sum_sq_right / n_right) - mean_right**2 # Weighted sum of child variances child_impurity = n_left * var_left + n_right * var_right gain = current_impurity - child_impurity if gain > best_gain: best_gain = gain # Threshold at midpoint between consecutive values threshold = (sorted_x[i] + sorted_x[i + 1]) / 2 best_split = { 'feature_index': feature_idx, 'threshold': threshold, 'gain': gain, 'n_left': n_left, 'n_right': n_right } return best_splitThe naive split search is O(n²p) per node—for each of p features, we consider O(n) thresholds, and for each threshold, we compute statistics in O(n). The efficient version reduces this to O(np log n) by sorting features once (O(n log n) each) and using O(n) incremental scans. For large n, this optimization is crucial.
Why variance minimization?
The choice of variance (or equivalently, MSE) as the splitting criterion for regression is not arbitrary—it corresponds to maximum likelihood estimation under a Gaussian noise model:
$$y_i = f(\mathbf{x}_i) + \epsilon_i, \quad \epsilon_i \sim \mathcal{N}(0, \sigma^2)$$
Under this model, minimizing $\sum_i (y_i - \hat{y}_i)^2$ is equivalent to maximizing likelihood. The mean prediction in each region is the maximum likelihood estimate of the true function value for that region.
Connection to analysis of variance (ANOVA):
The split search can also be viewed through the lens of one-way ANOVA. We're essentially asking: "Which binary grouping of samples (defined by feature $j$ and threshold $t$) explains the most variance in $y$?" The information gain (variance reduction) is proportional to the between-group sum of squares in an ANOVA decomposition.
A distinctive feature of CART is its exclusive use of binary splits—each internal node has exactly two children. This design choice has profound implications for the algorithm's behavior and efficiency.
Why binary splits?
Computational tractability: For a continuous feature, we only need to consider O(n) threshold candidates. Multi-way splits would require evaluating an exponential number of partitions.
Sufficient expressiveness: Any multi-way split can be represented by a sequence of binary splits. A 3-way split on feature $X$ (e.g., $X < a$, $a \leq X < b$, $X \geq b$) is equivalent to two consecutive binary splits.
Algorithmic elegance: Binary trees have well-studied properties, efficient traversal algorithms, and natural recursive structure.
Interaction detection: Binary splits can still capture complex interactions through the tree hierarchy—what a single node cannot express, multiple levels can.
Mathematical perspective:
Let $T$ be the set of all possible trees. Finding the optimal tree is:
$$\min_{T} \sum_{m=1}^{|T|} \sum_{i \in R_m} (y_i - \bar{y}_{R_m})^2 + \lambda |T|$$
where $|T|$ is tree complexity and $\lambda$ is a regularization parameter. This optimization is NP-complete because the number of possible trees grows super-exponentially with depth.
By restricting to binary splits and using greedy optimization, CART trades global optimality for polynomial-time solvability—a trade-off that has proven remarkably successful in practice.
| Aspect | Binary Splits | Multi-way Splits |
|---|---|---|
| Continuous features | O(n) threshold candidates | Must discretize or use exponential search |
| Categorical features | 2^(k-1)-1 subsets for k categories | Could use k-way split but loses hierarchy |
| Tree depth | May be deeper for same expressiveness | Shallower but wider |
| Interpretability | Each decision is a simple yes/no | More complex per-node decisions |
| Computational cost | O(np log n) per level | Generally exponential in categories |
| Expressiveness | Complete (can represent any partition) | More direct for some patterns |
Handling categorical features:
For categorical features with $k$ levels, CART considers all possible binary partitions into "left" and "right" subsets. There are $2^{k-1} - 1$ such partitions (excluding empty sets and the trivial all-in-one partition).
For regression, a remarkable optimization exists: sorting categories by their mean target value and only considering splits that separate contiguous groups in this ordering is guaranteed to find the optimal partition. This reduces complexity from $O(2^k)$ to $O(k \log k)$.
Proof sketch: For regression with MSE criterion, the optimal split always places categories with similar mean targets on the same side. This is because mixing categories with high and low means increases within-group variance.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283
def find_best_categorical_split(feature, y, min_samples_leaf=1): """ Find optimal binary split for a categorical feature using the sorted-means optimization for regression. This exploits the fact that for MSE, optimal splits always separate contiguous groups when categories are ordered by mean. """ # Get unique categories and their statistics categories = np.unique(feature) n_categories = len(categories) if n_categories < 2: return None # Compute mean target for each category category_means = {} category_counts = {} for cat in categories: mask = feature == cat category_means[cat] = np.mean(y[mask]) category_counts[cat] = np.sum(mask) # Sort categories by their mean (ascending) sorted_cats = sorted(categories, key=lambda c: category_means[c]) # Now search only over contiguous splits # This is O(k) instead of O(2^k) best_gain = 0.0 best_split = None total_var = np.var(y) * len(y) # Accumulate statistics as we scan through sorted categories n_left = 0 sum_left = 0.0 sum_sq_left = 0.0 n_right = len(y) sum_right = np.sum(y) sum_sq_right = np.sum(y**2) for i in range(n_categories - 1): # Move category sorted_cats[i] from right to left cat = sorted_cats[i] mask = feature == cat cat_y = y[mask] n_cat = len(cat_y) n_left += n_cat sum_left += np.sum(cat_y) sum_sq_left += np.sum(cat_y**2) n_right -= n_cat sum_right -= np.sum(cat_y) sum_sq_right -= np.sum(cat_y**2) # Check minimum leaf size if n_left < min_samples_leaf or n_right < min_samples_leaf: continue # Compute impurity using incremental statistics mean_left = sum_left / n_left mean_right = sum_right / n_right var_left = (sum_sq_left / n_left) - mean_left**2 var_right = (sum_sq_right / n_right) - mean_right**2 child_impurity = n_left * var_left + n_right * var_right gain = total_var - child_impurity if gain > best_gain: best_gain = gain # Left group contains first i+1 categories in sorted order left_categories = set(sorted_cats[:i+1]) best_split = { 'left_categories': left_categories, 'gain': gain, 'n_left': n_left, 'n_right': n_right } return best_splitOnce the tree is built, prediction is straightforward: for a new observation $\mathbf{x}$, we traverse from root to leaf by following the appropriate branch at each internal node, then return the leaf's prediction.
Prediction algorithm:
$$\hat{y} = \text{predict}(\mathbf{x}) = \text{traverse}(\text{root}, \mathbf{x})$$
where traverse recursively follows splits:
Computational complexity:
Traversal visits at most $d$ nodes where $d$ is tree depth, making prediction $O(d)$. For a balanced tree with $n$ training samples, $d = O(\log n)$. However, unbalanced trees can have $d = O(n)$ in degenerate cases.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118
def predict(tree, X): """ Make predictions for multiple samples. Parameters: ----------- tree : dict Root node of the trained CART tree X : array-like, shape (n_samples, n_features) Feature matrix for prediction Returns: -------- predictions : array, shape (n_samples,) Predicted target values """ return np.array([predict_single(tree, x) for x in X]) def predict_single(node, x): """ Traverse tree to predict for a single sample. This recursive function follows the decision path from root to leaf, returning the leaf's prediction. """ # Base case: reached leaf node if node['type'] == 'leaf': return node['prediction'] # Recursive case: follow appropriate branch feature_idx = node['feature_index'] threshold = node['threshold'] if x[feature_idx] <= threshold: return predict_single(node['left'], x) else: return predict_single(node['right'], x) def predict_with_path(node, x, path=None): """ Predict and return the decision path taken. Useful for interpretability and debugging. """ if path is None: path = [] if node['type'] == 'leaf': return node['prediction'], path feature_idx = node['feature_index'] threshold = node['threshold'] if x[feature_idx] <= threshold: path.append({ 'feature': feature_idx, 'threshold': threshold, 'direction': 'left', 'condition': f'X[{feature_idx}] <= {threshold:.4f}' }) return predict_with_path(node['left'], x, path) else: path.append({ 'feature': feature_idx, 'threshold': threshold, 'direction': 'right', 'condition': f'X[{feature_idx}] > {threshold:.4f}' }) return predict_with_path(node['right'], x, path) def get_leaf_regions(tree, bounds=None, feature_names=None): """ Extract all leaf regions with their bounds and predictions. This function recursively traverses the tree and accumulates the constraints that define each region. """ regions = [] _extract_regions(tree, bounds or {}, regions, feature_names) return regions def _extract_regions(node, current_bounds, regions, feature_names): """Helper function to recursively extract regions.""" if node['type'] == 'leaf': regions.append({ 'bounds': current_bounds.copy(), 'prediction': node['prediction'], 'n_samples': node['n_samples'], 'variance': node.get('variance', 0) }) return feature_idx = node['feature_index'] threshold = node['threshold'] fname = feature_names[feature_idx] if feature_names else f'X{feature_idx}' # Left child: add upper bound constraint left_bounds = current_bounds.copy() if fname in left_bounds: left_bounds[fname] = (left_bounds[fname][0], min(left_bounds[fname][1], threshold)) else: left_bounds[fname] = (float('-inf'), threshold) _extract_regions(node['left'], left_bounds, regions, feature_names) # Right child: add lower bound constraint right_bounds = current_bounds.copy() if fname in right_bounds: right_bounds[fname] = (max(right_bounds[fname][0], threshold), right_bounds[fname][1]) else: right_bounds[fname] = (threshold, float('inf')) _extract_regions(node['right'], right_bounds, regions, feature_names)Each prediction path uniquely identifies a rectangular region in feature space. The prediction value is constant throughout this region. This property—that predictions are piecewise constant over axis-aligned boxes—is fundamental to understanding regression tree behavior and limitations.
Without stopping criteria, CART would grow until every leaf contains a single training sample (or all targets in a leaf are identical). This would perfectly fit the training data but generalize poorly—a classic case of overfitting.
Common stopping criteria:
max_depth): Limits tree heightmin_samples_split): Requires minimum samples before considering a splitmin_samples_leaf): Ensures each leaf has sufficient supportmin_impurity_decrease): Requires splits to improve by thresholdmax_leaf_nodes): Caps total leaves (grows best-first rather than depth-first)Trade-offs:
Each criterion balances bias and variance differently:
max_depth, large min_samples_leaf) → High bias, low variancemax_depth, small min_samples_leaf) → Low bias, high variance123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109
class CARTRegressionTree: """ Complete CART Regression Tree implementation with all standard stopping criteria. """ def __init__(self, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_impurity_decrease=0.0, max_leaf_nodes=None): """ Initialize CART tree with stopping hyperparameters. Parameters: ----------- max_depth : int or None Maximum tree depth. None for unlimited. min_samples_split : int Minimum samples required to attempt a split. min_samples_leaf : int Minimum samples required in each leaf. min_impurity_decrease : float Minimum impurity reduction to accept a split. A split is only made if: impurity_parent - weighted_impurity_children >= threshold max_leaf_nodes : int or None Maximum number of leaf nodes. If set, grows best-first (highest gain) rather than depth-first. """ self.max_depth = max_depth self.min_samples_split = min_samples_split self.min_samples_leaf = min_samples_leaf self.min_impurity_decrease = min_impurity_decrease self.max_leaf_nodes = max_leaf_nodes self.tree_ = None self.n_features_ = None def should_stop(self, n_samples, depth, impurity, best_gain): """ Evaluate all stopping criteria. Returns True if any stopping condition is met. """ # Maximum depth reached if self.max_depth is not None and depth >= self.max_depth: return True # Insufficient samples to split if n_samples < self.min_samples_split: return True # Pure node (zero impurity) if impurity <= 1e-10: return True # No valid split found (best_gain is None) if best_gain is None: return True # Insufficient impurity decrease if best_gain < self.min_impurity_decrease: return True return False def fit(self, X, y): """Fit the regression tree to training data.""" self.n_features_ = X.shape[1] if self.max_leaf_nodes is not None: self.tree_ = self._grow_best_first(X, y) else: self.tree_ = self._grow_depth_first(X, y, depth=0) return self def _grow_depth_first(self, X, y, depth): """Standard depth-first recursive tree growing.""" n_samples = len(y) impurity = np.var(y) * n_samples best_split = find_best_split(X, y, self.min_samples_leaf) best_gain = best_split['gain'] if best_split else None if self.should_stop(n_samples, depth, impurity, best_gain): return create_leaf_node(y) # Create split and recurse feature_idx = best_split['feature_index'] threshold = best_split['threshold'] left_mask = X[:, feature_idx] <= threshold right_mask = ~left_mask return { 'type': 'internal', 'feature_index': feature_idx, 'threshold': threshold, 'left': self._grow_depth_first(X[left_mask], y[left_mask], depth+1), 'right': self._grow_depth_first(X[right_mask], y[right_mask], depth+1), 'n_samples': n_samples, 'impurity': impurity / n_samples }Let's bring everything together into a complete, production-quality CART regression implementation that includes all the components we've discussed.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231
import numpy as npfrom typing import Optional, Dict, Any, List, Tuple class CARTRegressor: """ Complete CART Regression Tree Implementation This class implements the Classification and Regression Trees (CART) algorithm for regression as described by Breiman et al. (1984). The implementation uses the MSE (variance) splitting criterion and supports all standard hyperparameters for tree construction. Attributes: ----------- tree_ : dict The trained tree structure after calling fit() n_features_ : int Number of features in training data feature_importances_ : ndarray Impurity-based feature importance scores (sum to 1) Examples: --------- >>> tree = CARTRegressor(max_depth=5, min_samples_leaf=10) >>> tree.fit(X_train, y_train) >>> predictions = tree.predict(X_test) >>> print(f"R² Score: {tree.score(X_test, y_test):.4f}") """ def __init__(self, max_depth: Optional[int] = None, min_samples_split: int = 2, min_samples_leaf: int = 1, min_impurity_decrease: float = 0.0): self.max_depth = max_depth self.min_samples_split = min_samples_split self.min_samples_leaf = min_samples_leaf self.min_impurity_decrease = min_impurity_decrease # Attributes set during fitting self.tree_: Optional[Dict[str, Any]] = None self.n_features_: Optional[int] = None self.feature_importances_: Optional[np.ndarray] = None self._impurity_gains: List[Tuple[int, float]] = [] def fit(self, X: np.ndarray, y: np.ndarray) -> 'CARTRegressor': """ Build the regression tree from training data. Parameters: ----------- X : ndarray, shape (n_samples, n_features) Training feature matrix y : ndarray, shape (n_samples,) Target values Returns: -------- self : CARTRegressor Fitted estimator """ X = np.atleast_2d(X) y = np.asarray(y).ravel() if len(X) != len(y): raise ValueError("X and y must have same number of samples") self.n_features_ = X.shape[1] self._impurity_gains = [] # Grow tree recursively self.tree_ = self._build_tree(X, y, depth=0) # Compute feature importances from accumulated gains self._compute_feature_importances() return self def _build_tree(self, X: np.ndarray, y: np.ndarray, depth: int) -> Dict[str, Any]: """Recursively build tree structure.""" n_samples = len(y) impurity = np.var(y) # Check stopping criteria if self._should_stop(n_samples, depth, impurity): return self._create_leaf(y) # Find best split best = self._find_best_split(X, y) if best is None or best['gain'] < self.min_impurity_decrease: return self._create_leaf(y) # Record split for feature importance self._impurity_gains.append((best['feature'], best['gain'])) # Partition data left_mask = X[:, best['feature']] <= best['threshold'] right_mask = ~left_mask # Check child size constraints if (np.sum(left_mask) < self.min_samples_leaf or np.sum(right_mask) < self.min_samples_leaf): return self._create_leaf(y) # Build children recursively return { 'type': 'internal', 'feature': best['feature'], 'threshold': best['threshold'], 'left': self._build_tree(X[left_mask], y[left_mask], depth + 1), 'right': self._build_tree(X[right_mask], y[right_mask], depth + 1), 'n_samples': n_samples, 'impurity': impurity, 'gain': best['gain'] } def _should_stop(self, n_samples: int, depth: int, impurity: float) -> bool: """Evaluate stopping conditions.""" if self.max_depth is not None and depth >= self.max_depth: return True if n_samples < self.min_samples_split: return True if impurity < 1e-10: # Pure node return True return False def _create_leaf(self, y: np.ndarray) -> Dict[str, Any]: """Create leaf node with prediction.""" return { 'type': 'leaf', 'prediction': float(np.mean(y)), 'n_samples': len(y), 'variance': float(np.var(y)), 'std': float(np.std(y)) } def _find_best_split(self, X: np.ndarray, y: np.ndarray) -> Optional[Dict[str, Any]]: """Find optimal split using efficient sorted scan.""" n_samples, n_features = X.shape parent_impurity = np.var(y) * n_samples best_gain = 0.0 best_split = None for feat_idx in range(n_features): # Sort by feature order = np.argsort(X[:, feat_idx]) sorted_y = y[order] sorted_x = X[order, feat_idx] # Running statistics n_left, sum_left, sum_sq_left = 0, 0.0, 0.0 n_right = n_samples sum_right = np.sum(sorted_y) sum_sq_right = np.sum(sorted_y ** 2) for i in range(n_samples - 1): yi = sorted_y[i] # Update statistics n_left += 1 sum_left += yi sum_sq_left += yi ** 2 n_right -= 1 sum_right -= yi sum_sq_right -= yi ** 2 # Skip duplicate feature values if sorted_x[i] == sorted_x[i + 1]: continue # Check min_samples_leaf if n_left < self.min_samples_leaf: continue if n_right < self.min_samples_leaf: break # Compute child variances mean_left = sum_left / n_left mean_right = sum_right / n_right var_left = max(0, sum_sq_left/n_left - mean_left**2) var_right = max(0, sum_sq_right/n_right - mean_right**2) child_impurity = n_left * var_left + n_right * var_right gain = parent_impurity - child_impurity if gain > best_gain: best_gain = gain best_split = { 'feature': feat_idx, 'threshold': (sorted_x[i] + sorted_x[i+1]) / 2, 'gain': gain } return best_split def predict(self, X: np.ndarray) -> np.ndarray: """Predict target values for samples.""" X = np.atleast_2d(X) return np.array([self._predict_single(self.tree_, x) for x in X]) def _predict_single(self, node: Dict, x: np.ndarray) -> float: """Traverse tree for single prediction.""" if node['type'] == 'leaf': return node['prediction'] if x[node['feature']] <= node['threshold']: return self._predict_single(node['left'], x) return self._predict_single(node['right'], x) def score(self, X: np.ndarray, y: np.ndarray) -> float: """Compute R² score.""" y_pred = self.predict(X) ss_res = np.sum((y - y_pred) ** 2) ss_tot = np.sum((y - np.mean(y)) ** 2) return 1 - ss_res / ss_tot if ss_tot > 0 else 0.0 def _compute_feature_importances(self): """Compute impurity-based feature importances.""" importances = np.zeros(self.n_features_) for feat_idx, gain in self._impurity_gains: importances[feat_idx] += gain total = importances.sum() if total > 0: importances /= total self.feature_importances_ = importancesWhile CART forms the algorithmic foundation, modern practice involves additional considerations for robust deployment.
Using scikit-learn's DecisionTreeRegressor:
In production, you'll typically use established libraries rather than custom implementations. scikit-learn's DecisionTreeRegressor implements optimized CART with additional features:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
from sklearn.tree import DecisionTreeRegressorfrom sklearn.model_selection import cross_val_score, GridSearchCVimport matplotlib.pyplot as plt # Create and train regression treetree = DecisionTreeRegressor( max_depth=5, min_samples_split=10, min_samples_leaf=5, random_state=42)tree.fit(X_train, y_train) # Evaluate performancetrain_score = tree.score(X_train, y_train)test_score = tree.score(X_test, y_test)print(f"Training R²: {train_score:.4f}")print(f"Test R²: {test_score:.4f}") # Cross-validation for robust performance estimatecv_scores = cross_val_score(tree, X, y, cv=5, scoring='r2')print(f"CV R² Score: {cv_scores.mean():.4f} (+/- {cv_scores.std()*2:.4f})") # Hyperparameter tuningparam_grid = { 'max_depth': [3, 5, 7, 10, None], 'min_samples_split': [2, 5, 10, 20], 'min_samples_leaf': [1, 2, 5, 10]} grid_search = GridSearchCV( DecisionTreeRegressor(random_state=42), param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)grid_search.fit(X_train, y_train) print(f"Best parameters: {grid_search.best_params_}")print(f"Best CV MSE: {-grid_search.best_score_:.4f}") # Feature importance analysisimportances = tree.feature_importances_indices = np.argsort(importances)[::-1] plt.figure(figsize=(10, 6))plt.title("Feature Importances")plt.bar(range(len(importances)), importances[indices])plt.xticks(range(len(importances)), [feature_names[i] for i in indices], rotation=45)plt.tight_layout()plt.show()Single decision trees excel when: (1) Interpretability is critical—you need to explain decisions; (2) Data has natural hierarchical/interaction structure; (3) You're building base learners for ensembles; (4) Features are mixed types (numeric and categorical); (5) Quick baseline models are needed. For maximum predictive power, trees are typically used within Random Forest or Gradient Boosting ensembles.
We've completed a comprehensive examination of the CART algorithm for regression—the foundational method for constructing regression decision trees.
What's next:
With the CART algorithm established, the next page examines the splitting criterion in depth—specifically, why Mean Squared Error (MSE) is used for regression and how it connects to broader statistical principles.
You now have a deep understanding of the CART regression algorithm—its mathematical foundations, computational implementation, and practical considerations. This knowledge is essential for understanding how modern tree-based methods like Random Forests and XGBoost work at their core.