Machine LearningDecision Trees

Regression Trees

LevelIntermediate

Duration90 mins

TopicDecision Trees

3 / 5

Leaf Predictions

Where Predictions Are Made

In a regression tree, internal nodes ask questions—they split the feature space. But leaf nodes answer questions—they provide the actual predictions. When a new sample traverses the tree from root to leaf, the journey's destination determines its predicted value.

The most common leaf prediction is deceptively simple: return the mean of training targets that reached that leaf. Yet this simplicity belies profound statistical optimality. Understanding why the mean is optimal, when alternatives might be better, and how to quantify prediction uncertainty transforms a basic understanding into genuine expertise.

What You Will Learn

By the end of this page, you will understand: why the mean minimizes squared error within each region, how to compute and interpret leaf statistics, alternative prediction strategies (median, quantiles, weighted), uncertainty quantification in leaf predictions, and advanced techniques like linear models in leaves. You'll also understand how ensemble methods leverage leaf predictions.

The Mean as Optimal Prediction

For a leaf node $m$ containing training samples with targets ${y_1, y_2, \ldots, y_{n_m}}$, the regression tree predicts a constant value $c_m$ for all samples falling in that region. What should $c_m$ be?

The optimization problem:

Given the MSE loss function, we seek:

$$c_m^* = \arg\min_{c} \sum_{i \in R_m} (y_i - c)^2$$

Derivation:

Taking the derivative and setting to zero:

$$\frac{d}{dc}\sum_{i \in R_m} (y_i - c)^2 = \sum_{i \in R_m} -2(y_i - c) = 0$$

$$\sum_{i \in R_m} y_i = n_m \cdot c$$

$$c_m^* = \frac{1}{n_m}\sum_{i \in R_m} y_i = \bar{y}_m$$

The optimal constant prediction is the sample mean.

Verification (second derivative test):

$$\frac{d^2}{dc^2}\sum_{i \in R_m} (y_i - c)^2 = 2n_m > 0$$

The second derivative is positive, confirming this is a minimum.

Statistical Interpretation

The mean minimizes squared error unconditionally—this is a fundamental result from statistics. In the context of trees, we condition on the sample reaching leaf m. The conditional mean E[Y | X ∈ R_m] is approximated by the sample mean ȳ_m. This is maximum likelihood under Gaussian assumptions and unbiased regardless of the distribution.

mean_optimality.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
import numpy as np
from scipy.optimize import minimize_scalar
 
def demonstrate_mean_optimality():
    """
    Demonstrate that the mean minimizes MSE for constant predictions.
    """
    # Sample leaf data
    y_leaf = np.array([2.1, 3.5, 2.8, 4.2, 3.1, 2.9, 3.8, 3.3])
    
    # Compute the mean
    mean_prediction = np.mean(y_leaf)
    
    # Define MSE as function of constant prediction
    def mse(c):
        return np.mean((y_leaf - c) ** 2)
    
    # Find minimum numerically
    result = minimize_scalar(mse, bounds=(0, 10), method='bounded')
    numerical_optimal = result.x
    
    print("Demonstration: Mean Minimizes MSE")
    print("=" * 50)
    print(f"Leaf targets: {y_leaf}")
    print(f"Sample mean: {mean_prediction:.6f}")
    print(f"Numerical minimum: {numerical_optimal:.6f}")
    print(f"Match: {np.isclose(mean_prediction, numerical_optimal)}")
    
    # Show MSE at different predictions
    print("\nMSE at various prediction values:")
    for c in [2.0, 2.5, mean_prediction, 3.5, 4.0]:
        print(f"  c = {c:.2f}: MSE = {mse(c):.6f}")
    
    # Visualize the convex function
    c_values = np.linspace(1, 5, 100)
    mse_values = [mse(c) for c in c_values]
    
    print(f"\nMinimum MSE at mean: {mse(mean_prediction):.6f}")
    print(f"This equals the variance: {np.var(y_leaf):.6f}")
    
    return mean_prediction, mse(mean_prediction)
 
 
def weighted_mean_prediction(y, weights=None):
    """
    Compute weighted mean for leaf prediction.
    
    Useful when samples have different importance (e.g., sample weights).
    """
    if weights is None:
        return np.mean(y)
    
    weights = np.asarray(weights)
    return np.sum(weights * y) / np.sum(weights)
 
 
def optimal_prediction_for_loss(y, loss='mse'):
    """
    Find optimal constant prediction for different loss functions.
    """
    if loss == 'mse':
        # Mean minimizes MSE
        return np.mean(y)
    
    elif loss == 'mae':
        # Median minimizes MAE
        return np.median(y)
    
    elif loss == 'huber':
        # M-estimator (approximated by median for delta=1)
        from scipy.optimize import minimize_scalar
        
        def huber_loss(c, delta=1.0):
            errors = y - c
            return np.mean(np.where(
                np.abs(errors) <= delta,
                0.5 * errors**2,
                delta * (np.abs(errors) - 0.5 * delta)
            ))
        
        result = minimize_scalar(huber_loss, 
                                bounds=(np.min(y), np.max(y)),
                                method='bounded')
        return result.x
    
    elif loss == 'quantile':
        # Quantile prediction (default: median = 0.5 quantile)
        return np.percentile(y, 50)
    
    else:
        raise ValueError(f"Unknown loss: {loss}")
 
 
# Run demonstration
demonstrate_mean_optimality()

Comprehensive Leaf Statistics

Beyond the mean prediction, leaf nodes can provide rich statistical information that aids interpretation, uncertainty quantification, and model diagnostics.

Essential leaf statistics:

Prediction ($\bar{y}_m$): The constant prediction for the region
Sample count ($n_m$): Number of training samples in the leaf
Variance ($\sigma_m^2$): Residual variance within the leaf
Standard error ($\text{SE}_m = \sigma_m / \sqrt{n_m}$): Uncertainty of the mean estimate
Range ($[\min y_m, \max y_m]$): Bounds of observed targets
Residuals (${y_i - \bar{y}_m : i \in R_m}$): Individual deviations for diagnostics

leaf_statistics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
import numpy as np
from typing import Dict, Any, List
from scipy import stats
 
class LeafStatistics:
                                            """
    Comprehensive statistics for a regression tree leaf node.
    
    Provides point prediction, uncertainty quantification,
    and diagnostic information for interpretation.
    """
    
    def __init__(self, y: np.ndarray, sample_indices: np.ndarray = None):
        """
        Compute all leaf statistics from target values.
        
        Parameters:
        -----------
        y : ndarray
            Target values of samples in this leaf
        sample_indices : ndarray, optional
            Original indices of samples (for residual tracking)
        """
        self.y = np.asarray(y)
        self.n = len(self.y)
        self.sample_indices = sample_indices
        
        if self.n == 0:
            raise ValueError("Leaf cannot be empty")
        
        # Core statistics
        self.prediction = np.mean(self.y)
        self.variance = np.var(self.y, ddof=0)  # Population variance
        self.std = np.std(self.y, ddof=0)
        
        # For inference, use ddof=1 for unbiased variance estimate
        self.sample_variance = np.var(self.y, ddof=1) if self.n > 1 else 0.0
        self.sample_std = np.sqrt(self.sample_variance)
        
        # Standard error of the mean
        self.standard_error = self.sample_std / np.sqrt(self.n) if self.n > 1 else float('inf')
        
        # Range
        self.min_value = np.min(self.y)
        self.max_value = np.max(self.y)
        self.range = self.max_value - self.min_value
        
        # Residuals
        self.residuals = self.y - self.prediction
        self.sum_squared_residuals = np.sum(self.residuals ** 2)
        self.mse = self.sum_squared_residuals / self.n
        
        # Higher moments
        self.skewness = stats.skew(self.y) if self.n > 2 else 0.0
        self.kurtosis = stats.kurtosis(self.y) if self.n > 3 else 0.0
        
    def confidence_interval(self, confidence: float = 0.95) -> tuple:
        """
        Compute confidence interval for the mean prediction.
        
        Uses t-distribution for proper coverage with finite samples.
        """
        if self.n <= 1:
            return (float('-inf'), float('inf'))
        
        alpha = 1 - confidence
        t_critical = stats.t.ppf(1 - alpha/2, df=self.n - 1)
        
        margin = t_critical * self.standard_error
        return (self.prediction - margin, self.prediction + margin)
    
    def prediction_interval(self, confidence: float = 0.95) -> tuple:
        """
        Compute prediction interval for new observations.
        
        Wider than confidence interval because it accounts for
        both mean uncertainty AND individual variation.
        """
        if self.n <= 1:
            return (float('-inf'), float('inf'))
        
        alpha = 1 - confidence
        t_critical = stats.t.ppf(1 - alpha/2, df=self.n - 1)
        
        # Prediction interval standard error includes individual variance
        pred_se = self.sample_std * np.sqrt(1 + 1/self.n)
        margin = t_critical * pred_se
        
        return (self.prediction - margin, self.prediction + margin)
    
    def quantile_predictions(self, quantiles: List[float] = [0.1, 0.5, 0.9]) -> Dict[float, float]:
        """
        Compute quantile predictions from leaf distribution.
        
        Useful for prediction intervals and distributional forecasts.
        """
        return {q: np.percentile(self.y, 100 * q) for q in quantiles}
    
    def to_dict(self) -> Dict[str, Any]:
        """Export all statistics as dictionary."""
        ci = self.confidence_interval()
        pi = self.prediction_interval()
        
        return {
            'prediction': self.prediction,
            'n_samples': self.n,
            'variance': self.variance,
            'std': self.std,
            'standard_error': self.standard_error,
            'min': self.min_value,
            'max': self.max_value,
            'range': self.range,
            'mse': self.mse,
            'confidence_interval_95': ci,
            'prediction_interval_95': pi,
            'skewness': self.skewness,
            'kurtosis': self.kurtosis
        }
    
    def summary(self) -> str:
        """Human-readable summary of leaf statistics."""
        ci = self.confidence_interval()
        pi = self.prediction_interval()
        
        return f"""
Leaf Summary (n = {self.n})
{'=' * 40}
Prediction:      {self.prediction:.4f}
Std Dev:         {self.std:.4f}
Std Error:       {self.standard_error:.4f}
Range:           [{self.min_value:.4f}, {self.max_value:.4f}]
95% CI (mean):   [{ci[0]:.4f}, {ci[1]:.4f}]
95% PI (new):    [{pi[0]:.4f}, {pi[1]:.4f}]
MSE:             {self.mse:.4f}
"""
 
 
def compute_tree_leaf_statistics(tree, X, y):
    """
    Compute statistics for all leaves of a trained tree.
    """
    # Get leaf assignments
    leaf_ids = tree.apply(X)
    unique_leaves = np.unique(leaf_ids)
    
    leaf_stats = {}
    for leaf_id in unique_leaves:
        mask = leaf_ids == leaf_id
        y_leaf = y[mask]
        indices = np.where(mask)[0]
        
        leaf_stats[leaf_id] = LeafStatistics(y_leaf, indices)
    
    return leaf_stats
 
 
# Demonstration
np.random.seed(42)
y_demo = np.random.normal(10, 2, 50)
stats_demo = LeafStatistics(y_demo)
print(stats_demo.summary())

Interpreting Leaf Statistics
Statistic	Interpretation	When Useful
n_samples	Leaf support/reliability	Small n → uncertain predictions
variance	Within-leaf heterogeneity	High variance → more splits needed?
standard_error	Uncertainty of mean estimate	Statistical inference
range	Extreme value bounds	Outlier detection
confidence_interval	Where true mean likely lies	Assessing mean precision
prediction_interval	Where new values likely fall	Realistic error bounds

Alternative Prediction Strategies

While the mean is optimal for MSE, alternative prediction strategies serve different objectives.

Median prediction (robust):

The median minimizes Mean Absolute Error (MAE):

$$c_m^{\text{MAE}} = \arg\min_c \sum_{i \in R_m} |y_i - c| = \text{median}(y_m)$$

Advantages:

Robust to outliers
Better for skewed distributions
Single extreme value has bounded influence

Quantile prediction (distributional):

For quantile $\tau \in (0, 1)$:

$$c_m^\tau = \arg\min_c \sum_{i \in R_m} \rho_\tau(y_i - c)$$

where $\rho_\tau(u) = u(\tau - \mathbb{1}_{u < 0})$ is the asymmetric pinball loss.

Useful for:

Prediction intervals ($\tau = 0.05, 0.95$ for 90% interval)
Risk-sensitive predictions (predict 90th percentile for worst-case)

Weighted prediction:

With sample weights $w_i$:

$$c_m^w = \frac{\sum_{i \in R_m} w_i y_i}{\sum_{i \in R_m} w_i}$$

Useful for:

Handling class imbalance (rare events)
Importance sampling
Cost-sensitive learning

alternative_predictions.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
import numpy as np
from typing import Optional
 
class FlexibleLeafPredictor:
    """
    Leaf prediction with multiple strategies.
    
    Supports mean, median, quantile, and weighted predictions.
    """
    
    def __init__(self, strategy: str = 'mean', **kwargs):
        """
        Initialize predictor with specified strategy.
        
        Parameters:
        -----------
        strategy : str
            One of 'mean', 'median', 'quantile', 'weighted', 'trimmed_mean'
        **kwargs : 
            Strategy-specific parameters:
            - quantile: 'q' (float, 0-1)
            - trimmed_mean: 'trim_fraction' (float, 0-0.5)
        """
        self.strategy = strategy
        self.kwargs = kwargs
        
    def predict(self, y: np.ndarray, weights: Optional[np.ndarray] = None) -> float:
        """Compute leaf prediction using configured strategy."""
        
        if len(y) == 0:
            raise ValueError("Cannot predict from empty leaf")
        
        if self.strategy == 'mean':
            if weights is not None:
                return np.average(y, weights=weights)
            return np.mean(y)
        
        elif self.strategy == 'median':
            if weights is not None:
                # Weighted median
                return self._weighted_percentile(y, weights, 50)
            return np.median(y)
        
        elif self.strategy == 'quantile':
            q = self.kwargs.get('q', 0.5)  # Default to median
            if weights is not None:
                return self._weighted_percentile(y, weights, 100 * q)
            return np.percentile(y, 100 * q)
        
        elif self.strategy == 'trimmed_mean':
            # Trim extreme values before computing mean
            trim = self.kwargs.get('trim_fraction', 0.1)
            from scipy.stats import trim_mean
            return trim_mean(y, trim)
        
        elif self.strategy == 'winsorized_mean':
            # Cap extreme values instead of trimming
            limit = self.kwargs.get('limit', 0.05)
            from scipy.stats.mstats import winsorize
            return np.mean(winsorize(y, limits=[limit, limit]))
        
        else:
            raise ValueError(f"Unknown strategy: {self.strategy}")
    
    def _weighted_percentile(self, y: np.ndarray, weights: np.ndarray, 
                             percentile: float) -> float:
        """Compute weighted percentile."""
        sorted_idx = np.argsort(y)
        sorted_y = y[sorted_idx]
        sorted_w = weights[sorted_idx]
        
        # Cumulative weight
        cum_weight = np.cumsum(sorted_w)
        cum_weight_normalized = cum_weight / cum_weight[-1]
        
        # Find position
        target = percentile / 100
        idx = np.searchsorted(cum_weight_normalized, target)
        
        if idx == 0:
            return sorted_y[0]
        elif idx >= len(sorted_y):
            return sorted_y[-1]
        else:
            return sorted_y[idx]
 
 
def compare_prediction_strategies():
    """
    Compare prediction strategies on data with outliers.
    """
    np.random.seed(42)
    
    # Normal data
    y_normal = np.random.normal(10, 2, 100)
    
    # Data with outliers
    y_outliers = np.concatenate([
        np.random.normal(10, 2, 95),
        np.array([50, 60, 70, -20, -30])  # 5 extreme outliers
    ])
    
    strategies = ['mean', 'median', 'trimmed_mean', 'winsorized_mean']
    
    print("Prediction Strategy Comparison")
    print("=" * 60)
    
    print("\nNormal data (n=100, μ=10, σ=2):")
    for strat in strategies:
        predictor = FlexibleLeafPredictor(strat, trim_fraction=0.1, limit=0.05)
        pred = predictor.predict(y_normal)
        print(f"  {strat:20s}: {pred:.4f}")
    
    print(f"\nData with 5% extreme outliers:")
    for strat in strategies:
        predictor = FlexibleLeafPredictor(strat, trim_fraction=0.1, limit=0.05)
        pred = predictor.predict(y_outliers)
        print(f"  {strat:20s}: {pred:.4f}")
    
    print(f"\nTrue population mean: 10.0")
    print(f"Mean is sensitive to outliers; median and trimmed mean are robust.")
 
compare_prediction_strategies()

Choosing a Prediction Strategy

Use mean when: outliers are informative, squared error is the metric, data is roughly symmetric. Use median when: outliers are noise, MAE is the metric, distribution is skewed. Use quantile when: you need prediction intervals or care about tail behavior. Use weighted when: samples have different importance or costs.

Uncertainty Quantification in Leaf Predictions

Standard regression trees provide point predictions, but real-world decisions often require understanding prediction uncertainty. Leaves naturally enable uncertainty quantification through their sample distribution.

Types of uncertainty:

Epistemic uncertainty (model uncertainty): How sure are we about the mean? Measured by standard error, depends on $n_m$.
Aleatoric uncertainty (data uncertainty): How much do individual values vary? Measured by within-leaf variance, irreducible given features.

Confidence vs. prediction intervals:

Confidence interval for $\mu_m$: Where the true population mean likely lies $$\bar{y}m \pm t{\alpha/2, n_m-1} \cdot \frac{s_m}{\sqrt{n_m}}$$
Prediction interval for $y_{\text{new}}$: Where a new observation likely falls $$\bar{y}m \pm t{\alpha/2, n_m-1} \cdot s_m \sqrt{1 + \frac{1}{n_m}}$$

The prediction interval is always wider because it includes individual variability.

uncertainty_quantification.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
import numpy as np
from scipy import stats
from typing import Tuple, Dict
 
class UncertaintyAwarePredictor:
    """
    Regression tree predictor with uncertainty quantification.
    
    Provides point predictions plus confidence/prediction intervals.
    """
    
    def __init__(self, tree, X_train, y_train):
        """
        Initialize with trained tree and training data for leaf statistics.
        """
        self.tree = tree
        self.leaf_ids = tree.apply(X_train)
        self.leaf_stats = self._compute_leaf_stats(X_train, y_train)
    
    def _compute_leaf_stats(self, X, y) -> Dict[int, Dict]:
        """Compute statistics for each leaf."""
        stats_dict = {}
        
        for leaf_id in np.unique(self.leaf_ids):
            mask = self.leaf_ids == leaf_id
            y_leaf = y[mask]
            n = len(y_leaf)
            
            stats_dict[leaf_id] = {
                'mean': np.mean(y_leaf),
                'std': np.std(y_leaf, ddof=1) if n > 1 else 0.0,
                'n': n,
                'quantiles': np.percentile(y_leaf, [5, 25, 50, 75, 95]) if n > 1 else [y_leaf[0]]*5
            }
        
        return stats_dict
    
    def predict_with_uncertainty(self, X) -> Dict[str, np.ndarray]:
        """
        Predict with full uncertainty information.
        
        Returns:
        --------
        dict with keys:
            - 'prediction': point predictions
            - 'std': standard deviation in each leaf
            - 'n_samples': samples in each leaf
            - 'ci_lower', 'ci_upper': 95% confidence interval for mean
            - 'pi_lower', 'pi_upper': 95% prediction interval for new obs
        """
        X = np.atleast_2d(X)
        n_samples = X.shape[0]
        
        leaf_ids = self.tree.apply(X)
        
        predictions = np.zeros(n_samples)
        stds = np.zeros(n_samples)
        ns = np.zeros(n_samples, dtype=int)
        ci_lower = np.zeros(n_samples)
        ci_upper = np.zeros(n_samples)
        pi_lower = np.zeros(n_samples)
        pi_upper = np.zeros(n_samples)
        
        for i, leaf_id in enumerate(leaf_ids):
            stats = self.leaf_stats[leaf_id]
            
            predictions[i] = stats['mean']
            stds[i] = stats['std']
            ns[i] = stats['n']
            
            n = stats['n']
            std = stats['std']
            mean = stats['mean']
            
            if n > 1:
                t_crit = stats.t.ppf(0.975, df=n-1) if hasattr(stats, 't') else 1.96
                # Actually use scipy.stats
                from scipy.stats import t as t_dist
                t_crit = t_dist.ppf(0.975, df=n-1)
                
                # Confidence interval for mean
                se = std / np.sqrt(n)
                ci_lower[i] = mean - t_crit * se
                ci_upper[i] = mean + t_crit * se
                
                # Prediction interval for new observation
                pred_se = std * np.sqrt(1 + 1/n)
                pi_lower[i] = mean - t_crit * pred_se
                pi_upper[i] = mean + t_crit * pred_se
            else:
                # Single sample - no uncertainty estimate possible
                ci_lower[i] = ci_upper[i] = mean
                pi_lower[i] = pi_upper[i] = mean
        
        return {
            'prediction': predictions,
            'std': stds,
            'n_samples': ns,
            'ci_lower': ci_lower,
            'ci_upper': ci_upper,
            'pi_lower': pi_lower,
            'pi_upper': pi_upper
        }
    
    def predict_quantiles(self, X, quantiles=[0.05, 0.5, 0.95]) -> np.ndarray:
        """
        Predict multiple quantiles based on leaf distributions.
        """
        X = np.atleast_2d(X)
        n_samples = X.shape[0]
        n_quantiles = len(quantiles)
        
        leaf_ids = self.tree.apply(X)
        results = np.zeros((n_samples, n_quantiles))
        
        for i, leaf_id in enumerate(leaf_ids):
            leaf_quantiles = self.leaf_stats[leaf_id]['quantiles']
            # Map requested quantiles to stored ones
            # This is simplified - full implementation would interpolate
            for j, q in enumerate(quantiles):
                q_idx = int(q * 4)  # Rough mapping to [5, 25, 50, 75, 95]
                q_idx = min(4, max(0, q_idx))
                results[i, j] = leaf_quantiles[q_idx]
        
        return results
 
 
def calibration_analysis(y_true, predictions, intervals):
    """
    Check if prediction intervals are well-calibrated.
    
    For 95% intervals, ~95% of true values should fall within.
    """
    lower = intervals['pi_lower']
    upper = intervals['pi_upper']
    
    coverage = np.mean((y_true >= lower) & (y_true <= upper))
    avg_width = np.mean(upper - lower)
    
    print(f"Prediction Interval Calibration:")
    print(f"  Expected coverage: 95%")
    print(f"  Actual coverage: {coverage*100:.1f}%")
    print(f"  Average interval width: {avg_width:.4f}")
    
    return coverage, avg_width

Assumptions Matter

These intervals assume: (1) targets in each leaf follow approximately normal distribution, (2) samples are independent, (3) the tree structure is fixed (ignoring model selection uncertainty). For small leaves or non-normal distributions, bootstrap or quantile-based intervals may be more reliable.

Linear Models in Leaves (Model Trees)

Standard regression trees predict constants in each leaf. A natural extension is to fit linear models within each leaf region—creating what are called model trees or linear regression trees.

The idea:

Instead of: $$\hat{y} = c_m \quad \text{for } \mathbf{x} \in R_m$$

We fit: $$\hat{y} = \mathbf{x}^\top \boldsymbol{\beta}_m + b_m \quad \text{for } \mathbf{x} \in R_m$$

Benefits:

Smoother predictions across leaf boundaries
Captures linear trends within each region
Often achieves same accuracy with shallower trees
Better extrapolation behavior at region boundaries

The M5 algorithm (Quinlan, 1992):

Build tree using standard splitting (variance reduction)
For each leaf, fit linear regression on samples in that region
Optionally: prune leaves by comparing to simpler models
Optionally: smooth predictions across parent-child boundaries

model_trees.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import Ridge
from typing import Dict, Any
 
class ModelTree:
    """
    Regression tree with linear models in leaves.
    
    Combines the partitioning capability of trees with
    the smoothness of linear regression.
    """
    
    def __init__(self, max_depth=5, min_samples_leaf=10, 
                 ridge_alpha=1.0):
        """
        Parameters:
        -----------
        max_depth : int
            Maximum depth of the partitioning tree
        min_samples_leaf : int
            Minimum samples to fit linear model (needs >= p+1)
        ridge_alpha : float
            Regularization for leaf linear models
        """
        self.max_depth = max_depth
        self.min_samples_leaf = min_samples_leaf
        self.ridge_alpha = ridge_alpha
        
        self.tree = None
        self.leaf_models = {}
    
    def fit(self, X, y):
        """
        Fit model tree: partition then fit linear models.
        """
        X = np.atleast_2d(X)
        
        # Step 1: Fit partitioning tree
        self.tree = DecisionTreeRegressor(
            max_depth=self.max_depth,
            min_samples_leaf=max(self.min_samples_leaf, X.shape[1] + 2)
        )
        self.tree.fit(X, y)
        
        # Step 2: Fit linear model in each leaf
        leaf_ids = self.tree.apply(X)
        
        for leaf_id in np.unique(leaf_ids):
            mask = leaf_ids == leaf_id
            X_leaf = X[mask]
            y_leaf = y[mask]
            
            # Fit regularized linear regression
            model = Ridge(alpha=self.ridge_alpha)
            model.fit(X_leaf, y_leaf)
            
            self.leaf_models[leaf_id] = {
                'model': model,
                'n_samples': len(y_leaf),
                'mse': np.mean((y_leaf - model.predict(X_leaf))**2)
            }
        
        return self
    
    def predict(self, X):
        """
        Predict using leaf-specific linear models.
        """
        X = np.atleast_2d(X)
        leaf_ids = self.tree.apply(X)
        
        predictions = np.zeros(X.shape[0])
        
        for leaf_id in np.unique(leaf_ids):
            mask = leaf_ids == leaf_id
            X_leaf = X[mask]
            
            model = self.leaf_models[leaf_id]['model']
            predictions[mask] = model.predict(X_leaf)
        
        return predictions
    
    def get_leaf_equations(self, feature_names=None):
        """
        Extract linear equations for each leaf (interpretability).
        """
        equations = {}
        
        for leaf_id, info in self.leaf_models.items():
            model = info['model']
            coef = model.coef_
            intercept = model.intercept_
            
            if feature_names is None:
                feature_names = [f'x{i}' for i in range(len(coef))]
            
            terms = [f"{intercept:.4f}"]
            for name, c in zip(feature_names, coef):
                if abs(c) > 1e-10:
                    sign = '+' if c > 0 else '-'
                    terms.append(f"{sign} {abs(c):.4f}*{name}")
            
            equations[leaf_id] = ' '.join(terms)
        
        return equations
 
 
def compare_constant_vs_linear_leaves():
    """
    Compare standard tree (constant leaves) vs model tree (linear leaves).
    """
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import mean_squared_error, r2_score
    
    np.random.seed(42)
    
    # Generate data with piecewise linear structure
    n = 500
    X = np.random.randn(n, 3)
    
    # True function: different linear functions in different regions
    y = np.where(
        X[:, 0] > 0,
        2 * X[:, 1] + 3 * X[:, 2] + 5,      # Region 1: linear
        -1 * X[:, 1] + 2 * X[:, 2] - 3      # Region 2: different linear
    ) + np.random.randn(n) * 0.5
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
    
    # Standard regression tree with constant leaves
    constant_tree = DecisionTreeRegressor(max_depth=5, min_samples_leaf=10)
    constant_tree.fit(X_train, y_train)
    y_pred_constant = constant_tree.predict(X_test)
    
    # Model tree with linear leaves
    model_tree = ModelTree(max_depth=3, min_samples_leaf=20)
    model_tree.fit(X_train, y_train)
    y_pred_linear = model_tree.predict(X_test)
    
    print("Constant vs Linear Leaf Comparison")
    print("=" * 50)
    print(f"\nConstant leaves (depth=5):")
    print(f"  Test MSE: {mean_squared_error(y_test, y_pred_constant):.4f}")
    print(f"  Test R²:  {r2_score(y_test, y_pred_constant):.4f}")
    print(f"  # Leaves: {constant_tree.get_n_leaves()}")
    
    print(f"\nLinear leaves (depth=3):")
    print(f"  Test MSE: {mean_squared_error(y_test, y_pred_linear):.4f}")
    print(f"  Test R²:  {r2_score(y_test, y_pred_linear):.4f}")
    print(f"  # Leaves: {len(model_tree.leaf_models)}")
    
    print("\nLeaf equations:")
    for leaf_id, eq in model_tree.get_leaf_equations(['x0', 'x1', 'x2']).items():
        print(f"  Leaf {leaf_id}: ŷ = {eq}")
 
compare_constant_vs_linear_leaves()

When to Use Model Trees

Model trees are particularly effective when: (1) the true function is piecewise linear, (2) you want smoother predictions than constant trees, (3) interpretable equations per region are valuable, (4) you're willing to have more parameters per leaf in exchange for fewer leaves. Libraries like M5P (Weka) and cubist (R) implement sophisticated model trees.

Leaf Predictions in Ensembles

Understanding leaf predictions becomes especially important in ensemble methods, where many trees contribute to the final prediction.

Random Forest perspective:

In Random Forests, each tree makes an independent prediction. The final prediction averages across trees:

$$\hat{y}{\text{RF}} = \frac{1}{T}\sum{t=1}^{T} \hat{y}^{(t)}$$

where $\hat{y}^{(t)}$ is tree $t$'s leaf mean. This averaging:

Reduces variance (each tree sees different bootstrap sample)
Preserves bias (mean of means equals mean of data)
Provides natural uncertainty estimate (variance across trees)

Gradient Boosting perspective:

In Gradient Boosting, each tree predicts residuals/gradients:

$$\hat{y}{\text{GB}} = \sum{t=1}^{T} \eta \cdot \hat{r}^{(t)}$$

where $\eta$ is the learning rate and $\hat{r}^{(t)}$ is tree $t$'s prediction. Here, leaf values are:

Fitted to pseudo-residuals (negative gradients)
Scaled by learning rate for stability
Accumulated additively across iterations

ensemble_leaf_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import numpy as np
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
 
def analyze_rf_leaf_predictions(X, y, n_trees=100):
    """
    Analyze how individual tree leaves contribute to RF prediction.
    """
    rf = RandomForestRegressor(n_estimators=n_trees, max_depth=5, random_state=42)
    rf.fit(X, y)
    
    # Get individual tree predictions
    individual_predictions = np.array([tree.predict(X) for tree in rf.estimators_])
    
    # RF prediction is the mean
    rf_prediction = rf.predict(X)
    
    # Prediction uncertainty from tree variance
    prediction_std = np.std(individual_predictions, axis=0)
    
    print("Random Forest Leaf Analysis")
    print("=" * 50)
    print(f"Number of trees: {n_trees}")
    print(f"Individual tree predictions shape: {individual_predictions.shape}")
    print(f"\nFor first 5 samples:")
    for i in range(min(5, len(y))):
        print(f"  Sample {i}: RF pred = {rf_prediction[i]:.3f}, "
              f"Std across trees = {prediction_std[i]:.3f}, "
              f"True = {y[i]:.3f}")
    
    return rf, individual_predictions, prediction_std
 
 
def analyze_gb_leaf_contributions(X, y, n_trees=100):
    """
    Analyze how boosting iterations contribute to final prediction.
    """
    gb = GradientBoostingRegressor(n_estimators=n_trees, max_depth=3, 
                                    learning_rate=0.1, random_state=42)
    gb.fit(X, y)
    
    # Track prediction at each stage
    staged_predictions = list(gb.staged_predict(X))
    
    # Contribution of each tree (difference from previous stage)
    contributions = [staged_predictions[0]]  # First tree contribution
    for i in range(1, len(staged_predictions)):
        contribution = staged_predictions[i] - staged_predictions[i-1]
        contributions.append(contribution)
    
    contributions = np.array(contributions)
    
    print("\nGradient Boosting Leaf Analysis")
    print("=" * 50)
    print(f"Number of trees: {n_trees}")
    print(f"Learning rate: 0.1")
    print(f"\nFor first sample:")
    print(f"  Final prediction: {staged_predictions[-1][0]:.3f}")
    print(f"  True value: {y[0]:.3f}")
    print(f"  Contributions by tree (first 10):")
    for i in range(min(10, n_trees)):
        print(f"    Tree {i+1}: {contributions[i][0]:+.4f}, "
              f"Cumulative: {staged_predictions[i][0]:.3f}")
    
    return gb, staged_predictions, contributions
 
 
# Demo
if __name__ == "__main__":
    np.random.seed(42)
    X = np.random.randn(200, 5)
    y = 3 * X[:, 0] + 2 * X[:, 1] - X[:, 2] + np.random.randn(200) * 0.5
    
    analyze_rf_leaf_predictions(X, y)
    analyze_gb_leaf_contributions(X, y)

Leaf Predictions Across Ensemble Methods
Method	Leaf Predictions	Combination	Uncertainty Source
Single Tree	Mean of training samples in region	Direct output	Within-leaf variance
Random Forest	Mean per tree (different bootstrap)	Average across trees	Variance across trees
Gradient Boosting	Negative gradient (residual)	Sum with learning rate	Staged prediction path
Model Tree	Linear regression fit	Direct output	Regression standard errors

Leaf Diagnostics and Interpretation

Analyzing leaf characteristics provides insights into model behavior and potential issues.

Key diagnostic questions:

Leaf purity: Are leaves homogeneous? High within-leaf variance suggests more splits needed.
Leaf support: Do leaves have enough samples? Small leaves are unreliable.
Residual patterns: Are leaf residuals symmetric? Skewed residuals suggest model misspecification.
Coverage: Do leaves cover the feature space reasonably? Gaps indicate extrapolation risk.
Extreme predictions: Are some leaf predictions outliers? May indicate overfitting or unusual regions.

leaf_diagnostics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
import numpy as np
from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as plt
 
class LeafDiagnostics:
    """
    Comprehensive diagnostics for regression tree leaves.
    """
    
    def __init__(self, tree, X, y):
        self.tree = tree
        self.X = np.atleast_2d(X)
        self.y = np.asarray(y)
        
        self.leaf_ids = tree.apply(X)
        self.unique_leaves = np.unique(self.leaf_ids)
        self.n_leaves = len(self.unique_leaves)
        
        self._compute_diagnostics()
    
    def _compute_diagnostics(self):
        """Compute all diagnostic metrics."""
        self.leaf_info = {}
        
        for leaf_id in self.unique_leaves:
            mask = self.leaf_ids == leaf_id
            y_leaf = self.y[mask]
            n = len(y_leaf)
            
            prediction = np.mean(y_leaf)
            residuals = y_leaf - prediction
            
            self.leaf_info[leaf_id] = {
                'n_samples': n,
                'prediction': prediction,
                'variance': np.var(y_leaf),
                'std': np.std(y_leaf),
                'residuals': residuals,
                'residual_mean': np.mean(residuals),
                'residual_std': np.std(residuals),
                'min': np.min(y_leaf),
                'max': np.max(y_leaf),
                'skewness': self._safe_skewness(y_leaf),
                'fraction_data': n / len(self.y)
            }
    
    def _safe_skewness(self, y):
        from scipy.stats import skew
        return skew(y) if len(y) > 2 else 0.0
    
    def get_summary_report(self):
        """Generate summary diagnostic report."""
        n_samples_per_leaf = [info['n_samples'] for info in self.leaf_info.values()]
        variances = [info['variance'] for info in self.leaf_info.values()]
        predictions = [info['prediction'] for info in self.leaf_info.values()]
        
        report = f"""
LEAF DIAGNOSTIC REPORT
{'='*60}
 
Overview:
  Total leaves: {self.n_leaves}
  Total samples: {len(self.y)}
  
Sample Distribution:
  Min samples per leaf: {min(n_samples_per_leaf)}
  Max samples per leaf: {max(n_samples_per_leaf)}
  Median samples per leaf: {np.median(n_samples_per_leaf):.0f}
  Leaves with < 10 samples: {sum(1 for n in n_samples_per_leaf if n < 10)}
  
Within-Leaf Variance:
  Min variance: {min(variances):.4f}
  Max variance: {max(variances):.4f}
  Mean variance: {np.mean(variances):.4f}
  Weighted mean variance: {np.average(variances, weights=n_samples_per_leaf):.4f}
  
Predictions:
  Min prediction: {min(predictions):.4f}
  Max prediction: {max(predictions):.4f}
  Prediction range: {max(predictions) - min(predictions):.4f}
  Target range: {self.y.max() - self.y.min():.4f}
 
Potential Issues:
"""
        # Check for issues
        issues = []
        
        # Small leaves
        small_leaves = [lid for lid, info in self.leaf_info.items() 
                       if info['n_samples'] < 5]
        if small_leaves:
            issues.append(f"  ⚠ {len(small_leaves)} leaves with < 5 samples (unreliable)")
        
        # High variance leaves
        overall_var = np.var(self.y)
        high_var_leaves = [lid for lid, info in self.leaf_info.items()
                         if info['variance'] > 0.5 * overall_var]
        if high_var_leaves:
            issues.append(f"  ⚠ {len(high_var_leaves)} leaves with variance > 50% of total")
        
        # Skewed residuals
        skewed_leaves = [lid for lid, info in self.leaf_info.items()
                        if abs(info['skewness']) > 1]
        if skewed_leaves:
            issues.append(f"  ⚠ {len(skewed_leaves)} leaves with skewed residuals (|skew| > 1)")
        
        if not issues:
            issues.append("  ✓ No major issues detected")
        
        report += '\n'.join(issues)
        
        return report
    
    def get_problem_leaves(self, min_samples=5, max_variance_ratio=0.5):
        """Identify leaves that may need attention."""
        overall_var = np.var(self.y)
        
        problems = []
        for leaf_id, info in self.leaf_info.items():
            issues = []
            
            if info['n_samples'] < min_samples:
                issues.append(f"small sample (n={info['n_samples']})")
            
            if info['variance'] > max_variance_ratio * overall_var:
                issues.append(f"high variance ({info['variance']:.3f})")
            
            if abs(info['skewness']) > 1:
                issues.append(f"skewed (skew={info['skewness']:.2f})")
            
            if issues:
                problems.append({
                    'leaf_id': leaf_id,
                    'issues': issues,
                    'info': info
                })
        
        return problems
 
 
# Usage example
if __name__ == "__main__":
    np.random.seed(42)
    X = np.random.randn(200, 3)
    y = 2 * X[:, 0] + X[:, 1]**2 + np.random.randn(200) * 0.5
    
    tree = DecisionTreeRegressor(max_depth=4, min_samples_leaf=5)
    tree.fit(X, y)
    
    diagnostics = LeafDiagnostics(tree, X, y)
    print(diagnostics.get_summary_report())
    
    problems = diagnostics.get_problem_leaves()
    if problems:
        print("\nProblem leaves:")
        for p in problems[:5]:
            print(f"  Leaf {p['leaf_id']}: {', '.join(p['issues'])}")

Summary: Leaf Predictions

We've explored the crucial topic of leaf predictions in regression trees—where the mathematical machinery of splitting translates into actual predictions.

Key Takeaways

•Mean is optimal for MSE — The sample mean minimizes squared error; this is the unique minimizer from calculus.
•Leaf statistics enable inference — Beyond prediction: variance, standard error, confidence/prediction intervals.
•Alternatives exist — Median for robustness, quantiles for distributional predictions, weighted for importance.
•Uncertainty is quantifiable — Confidence intervals for mean, prediction intervals for new samples, both from leaf data.
•Model trees extend leaves — Linear models in leaves give smoother predictions with fewer splits.
•Ensembles use leaves differently — RF averages leaf means; GB accumulates gradient predictions.
•Diagnostics guide improvement — Small leaves, high variance, skewed residuals indicate potential issues.

What's next:

With leaf predictions understood, the next page examines piecewise constant approximation—the mathematical framework that explains how regression trees approximate continuous functions through constant pieces, including approximation theory, convergence properties, and the relationship to basis functions.

Page Complete

You now have a comprehensive understanding of regression tree leaf predictions—from the optimality of the mean through uncertainty quantification to advanced techniques like model trees. This knowledge is essential for both interpreting tree behavior and extending trees for specialized applications.

3 / 5

Loading learning content...

Machine LearningDecision Trees

Regression Trees

LevelIntermediate

Duration90 mins

TopicDecision Trees

3 / 5

Leaf Predictions

Where Predictions Are Made

What You Will Learn

The Mean as Optimal Prediction

The optimization problem:

Given the MSE loss function, we seek:

$$c_m^* = \arg\min_{c} \sum_{i \in R_m} (y_i - c)^2$$

Derivation:

Taking the derivative and setting to zero:

$$\frac{d}{dc}\sum_{i \in R_m} (y_i - c)^2 = \sum_{i \in R_m} -2(y_i - c) = 0$$

$$\sum_{i \in R_m} y_i = n_m \cdot c$$

$$c_m^* = \frac{1}{n_m}\sum_{i \in R_m} y_i = \bar{y}_m$$

The optimal constant prediction is the sample mean.

Verification (second derivative test):

$$\frac{d^2}{dc^2}\sum_{i \in R_m} (y_i - c)^2 = 2n_m > 0$$

The second derivative is positive, confirming this is a minimum.

Statistical Interpretation

mean_optimality.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
import numpy as np
from scipy.optimize import minimize_scalar
 
def demonstrate_mean_optimality():
    """
    Demonstrate that the mean minimizes MSE for constant predictions.
    """
    # Sample leaf data
    y_leaf = np.array([2.1, 3.5, 2.8, 4.2, 3.1, 2.9, 3.8, 3.3])
    
    # Compute the mean
    mean_prediction = np.mean(y_leaf)
    
    # Define MSE as function of constant prediction
    def mse(c):
        return np.mean((y_leaf - c) ** 2)
    
    # Find minimum numerically
    result = minimize_scalar(mse, bounds=(0, 10), method='bounded')
    numerical_optimal = result.x
    
    print("Demonstration: Mean Minimizes MSE")
    print("=" * 50)
    print(f"Leaf targets: {y_leaf}")
    print(f"Sample mean: {mean_prediction:.6f}")
    print(f"Numerical minimum: {numerical_optimal:.6f}")
    print(f"Match: {np.isclose(mean_prediction, numerical_optimal)}")
    
    # Show MSE at different predictions
    print("\nMSE at various prediction values:")
    for c in [2.0, 2.5, mean_prediction, 3.5, 4.0]:
        print(f"  c = {c:.2f}: MSE = {mse(c):.6f}")
    
    # Visualize the convex function
    c_values = np.linspace(1, 5, 100)
    mse_values = [mse(c) for c in c_values]
    
    print(f"\nMinimum MSE at mean: {mse(mean_prediction):.6f}")
    print(f"This equals the variance: {np.var(y_leaf):.6f}")
    
    return mean_prediction, mse(mean_prediction)
 
 
def weighted_mean_prediction(y, weights=None):
    """
    Compute weighted mean for leaf prediction.
    
    Useful when samples have different importance (e.g., sample weights).
    """
    if weights is None:
        return np.mean(y)
    
    weights = np.asarray(weights)
    return np.sum(weights * y) / np.sum(weights)
 
 
def optimal_prediction_for_loss(y, loss='mse'):
    """
    Find optimal constant prediction for different loss functions.
    """
    if loss == 'mse':
        # Mean minimizes MSE
        return np.mean(y)
    
    elif loss == 'mae':
        # Median minimizes MAE
        return np.median(y)
    
    elif loss == 'huber':
        # M-estimator (approximated by median for delta=1)
        from scipy.optimize import minimize_scalar
        
        def huber_loss(c, delta=1.0):
            errors = y - c
            return np.mean(np.where(
                np.abs(errors) <= delta,
                0.5 * errors**2,
                delta * (np.abs(errors) - 0.5 * delta)
            ))
        
        result = minimize_scalar(huber_loss, 
                                bounds=(np.min(y), np.max(y)),
                                method='bounded')
        return result.x
    
    elif loss == 'quantile':
        # Quantile prediction (default: median = 0.5 quantile)
        return np.percentile(y, 50)
    
    else:
        raise ValueError(f"Unknown loss: {loss}")
 
 
# Run demonstration
demonstrate_mean_optimality()

Comprehensive Leaf Statistics

Beyond the mean prediction, leaf nodes can provide rich statistical information that aids interpretation, uncertainty quantification, and model diagnostics.

Essential leaf statistics:

Prediction ($\bar{y}_m$): The constant prediction for the region
Sample count ($n_m$): Number of training samples in the leaf
Variance ($\sigma_m^2$): Residual variance within the leaf
Standard error ($\text{SE}_m = \sigma_m / \sqrt{n_m}$): Uncertainty of the mean estimate
Range ($[\min y_m, \max y_m]$): Bounds of observed targets
Residuals (${y_i - \bar{y}_m : i \in R_m}$): Individual deviations for diagnostics

leaf_statistics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
import numpy as np
from typing import Dict, Any, List
from scipy import stats
 
class LeafStatistics:
                                            """
    Comprehensive statistics for a regression tree leaf node.
    
    Provides point prediction, uncertainty quantification,
    and diagnostic information for interpretation.
    """
    
    def __init__(self, y: np.ndarray, sample_indices: np.ndarray = None):
        """
        Compute all leaf statistics from target values.
        
        Parameters:
        -----------
        y : ndarray
            Target values of samples in this leaf
        sample_indices : ndarray, optional
            Original indices of samples (for residual tracking)
        """
        self.y = np.asarray(y)
        self.n = len(self.y)
        self.sample_indices = sample_indices
        
        if self.n == 0:
            raise ValueError("Leaf cannot be empty")
        
        # Core statistics
        self.prediction = np.mean(self.y)
        self.variance = np.var(self.y, ddof=0)  # Population variance
        self.std = np.std(self.y, ddof=0)
        
        # For inference, use ddof=1 for unbiased variance estimate
        self.sample_variance = np.var(self.y, ddof=1) if self.n > 1 else 0.0
        self.sample_std = np.sqrt(self.sample_variance)
        
        # Standard error of the mean
        self.standard_error = self.sample_std / np.sqrt(self.n) if self.n > 1 else float('inf')
        
        # Range
        self.min_value = np.min(self.y)
        self.max_value = np.max(self.y)
        self.range = self.max_value - self.min_value
        
        # Residuals
        self.residuals = self.y - self.prediction
        self.sum_squared_residuals = np.sum(self.residuals ** 2)
        self.mse = self.sum_squared_residuals / self.n
        
        # Higher moments
        self.skewness = stats.skew(self.y) if self.n > 2 else 0.0
        self.kurtosis = stats.kurtosis(self.y) if self.n > 3 else 0.0
        
    def confidence_interval(self, confidence: float = 0.95) -> tuple:
        """
        Compute confidence interval for the mean prediction.
        
        Uses t-distribution for proper coverage with finite samples.
        """
        if self.n <= 1:
            return (float('-inf'), float('inf'))
        
        alpha = 1 - confidence
        t_critical = stats.t.ppf(1 - alpha/2, df=self.n - 1)
        
        margin = t_critical * self.standard_error
        return (self.prediction - margin, self.prediction + margin)
    
    def prediction_interval(self, confidence: float = 0.95) -> tuple:
        """
        Compute prediction interval for new observations.
        
        Wider than confidence interval because it accounts for
        both mean uncertainty AND individual variation.
        """
        if self.n <= 1:
            return (float('-inf'), float('inf'))
        
        alpha = 1 - confidence
        t_critical = stats.t.ppf(1 - alpha/2, df=self.n - 1)
        
        # Prediction interval standard error includes individual variance
        pred_se = self.sample_std * np.sqrt(1 + 1/self.n)
        margin = t_critical * pred_se
        
        return (self.prediction - margin, self.prediction + margin)
    
    def quantile_predictions(self, quantiles: List[float] = [0.1, 0.5, 0.9]) -> Dict[float, float]:
        """
        Compute quantile predictions from leaf distribution.
        
        Useful for prediction intervals and distributional forecasts.
        """
        return {q: np.percentile(self.y, 100 * q) for q in quantiles}
    
    def to_dict(self) -> Dict[str, Any]:
        """Export all statistics as dictionary."""
        ci = self.confidence_interval()
        pi = self.prediction_interval()
        
        return {
            'prediction': self.prediction,
            'n_samples': self.n,
            'variance': self.variance,
            'std': self.std,
            'standard_error': self.standard_error,
            'min': self.min_value,
            'max': self.max_value,
            'range': self.range,
            'mse': self.mse,
            'confidence_interval_95': ci,
            'prediction_interval_95': pi,
            'skewness': self.skewness,
            'kurtosis': self.kurtosis
        }
    
    def summary(self) -> str:
        """Human-readable summary of leaf statistics."""
        ci = self.confidence_interval()
        pi = self.prediction_interval()
        
        return f"""
Leaf Summary (n = {self.n})
{'=' * 40}
Prediction:      {self.prediction:.4f}
Std Dev:         {self.std:.4f}
Std Error:       {self.standard_error:.4f}
Range:           [{self.min_value:.4f}, {self.max_value:.4f}]
95% CI (mean):   [{ci[0]:.4f}, {ci[1]:.4f}]
95% PI (new):    [{pi[0]:.4f}, {pi[1]:.4f}]
MSE:             {self.mse:.4f}
"""
 
 
def compute_tree_leaf_statistics(tree, X, y):
    """
    Compute statistics for all leaves of a trained tree.
    """
    # Get leaf assignments
    leaf_ids = tree.apply(X)
    unique_leaves = np.unique(leaf_ids)
    
    leaf_stats = {}
    for leaf_id in unique_leaves:
        mask = leaf_ids == leaf_id
        y_leaf = y[mask]
        indices = np.where(mask)[0]
        
        leaf_stats[leaf_id] = LeafStatistics(y_leaf, indices)
    
    return leaf_stats
 
 
# Demonstration
np.random.seed(42)
y_demo = np.random.normal(10, 2, 50)
stats_demo = LeafStatistics(y_demo)
print(stats_demo.summary())

Interpreting Leaf Statistics
Statistic	Interpretation	When Useful
n_samples	Leaf support/reliability	Small n → uncertain predictions
variance	Within-leaf heterogeneity	High variance → more splits needed?
standard_error	Uncertainty of mean estimate	Statistical inference
range	Extreme value bounds	Outlier detection
confidence_interval	Where true mean likely lies	Assessing mean precision
prediction_interval	Where new values likely fall	Realistic error bounds

Alternative Prediction Strategies

While the mean is optimal for MSE, alternative prediction strategies serve different objectives.

Median prediction (robust):

The median minimizes Mean Absolute Error (MAE):

$$c_m^{\text{MAE}} = \arg\min_c \sum_{i \in R_m} |y_i - c| = \text{median}(y_m)$$

Advantages:

Robust to outliers
Better for skewed distributions
Single extreme value has bounded influence

Quantile prediction (distributional):

For quantile $\tau \in (0, 1)$:

$$c_m^\tau = \arg\min_c \sum_{i \in R_m} \rho_\tau(y_i - c)$$

where $\rho_\tau(u) = u(\tau - \mathbb{1}_{u < 0})$ is the asymmetric pinball loss.

Useful for:

Prediction intervals ($\tau = 0.05, 0.95$ for 90% interval)
Risk-sensitive predictions (predict 90th percentile for worst-case)

Weighted prediction:

With sample weights $w_i$:

$$c_m^w = \frac{\sum_{i \in R_m} w_i y_i}{\sum_{i \in R_m} w_i}$$

Useful for:

Handling class imbalance (rare events)
Importance sampling
Cost-sensitive learning

alternative_predictions.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
import numpy as np
from typing import Optional
 
class FlexibleLeafPredictor:
    """
    Leaf prediction with multiple strategies.
    
    Supports mean, median, quantile, and weighted predictions.
    """
    
    def __init__(self, strategy: str = 'mean', **kwargs):
        """
        Initialize predictor with specified strategy.
        
        Parameters:
        -----------
        strategy : str
            One of 'mean', 'median', 'quantile', 'weighted', 'trimmed_mean'
        **kwargs : 
            Strategy-specific parameters:
            - quantile: 'q' (float, 0-1)
            - trimmed_mean: 'trim_fraction' (float, 0-0.5)
        """
        self.strategy = strategy
        self.kwargs = kwargs
        
    def predict(self, y: np.ndarray, weights: Optional[np.ndarray] = None) -> float:
        """Compute leaf prediction using configured strategy."""
        
        if len(y) == 0:
            raise ValueError("Cannot predict from empty leaf")
        
        if self.strategy == 'mean':
            if weights is not None:
                return np.average(y, weights=weights)
            return np.mean(y)
        
        elif self.strategy == 'median':
            if weights is not None:
                # Weighted median
                return self._weighted_percentile(y, weights, 50)
            return np.median(y)
        
        elif self.strategy == 'quantile':
            q = self.kwargs.get('q', 0.5)  # Default to median
            if weights is not None:
                return self._weighted_percentile(y, weights, 100 * q)
            return np.percentile(y, 100 * q)
        
        elif self.strategy == 'trimmed_mean':
            # Trim extreme values before computing mean
            trim = self.kwargs.get('trim_fraction', 0.1)
            from scipy.stats import trim_mean
            return trim_mean(y, trim)
        
        elif self.strategy == 'winsorized_mean':
            # Cap extreme values instead of trimming
            limit = self.kwargs.get('limit', 0.05)
            from scipy.stats.mstats import winsorize
            return np.mean(winsorize(y, limits=[limit, limit]))
        
        else:
            raise ValueError(f"Unknown strategy: {self.strategy}")
    
    def _weighted_percentile(self, y: np.ndarray, weights: np.ndarray, 
                             percentile: float) -> float:
        """Compute weighted percentile."""
        sorted_idx = np.argsort(y)
        sorted_y = y[sorted_idx]
        sorted_w = weights[sorted_idx]
        
        # Cumulative weight
        cum_weight = np.cumsum(sorted_w)
        cum_weight_normalized = cum_weight / cum_weight[-1]
        
        # Find position
        target = percentile / 100
        idx = np.searchsorted(cum_weight_normalized, target)
        
        if idx == 0:
            return sorted_y[0]
        elif idx >= len(sorted_y):
            return sorted_y[-1]
        else:
            return sorted_y[idx]
 
 
def compare_prediction_strategies():
    """
    Compare prediction strategies on data with outliers.
    """
    np.random.seed(42)
    
    # Normal data
    y_normal = np.random.normal(10, 2, 100)
    
    # Data with outliers
    y_outliers = np.concatenate([
        np.random.normal(10, 2, 95),
        np.array([50, 60, 70, -20, -30])  # 5 extreme outliers
    ])
    
    strategies = ['mean', 'median', 'trimmed_mean', 'winsorized_mean']
    
    print("Prediction Strategy Comparison")
    print("=" * 60)
    
    print("\nNormal data (n=100, μ=10, σ=2):")
    for strat in strategies:
        predictor = FlexibleLeafPredictor(strat, trim_fraction=0.1, limit=0.05)
        pred = predictor.predict(y_normal)
        print(f"  {strat:20s}: {pred:.4f}")
    
    print(f"\nData with 5% extreme outliers:")
    for strat in strategies:
        predictor = FlexibleLeafPredictor(strat, trim_fraction=0.1, limit=0.05)
        pred = predictor.predict(y_outliers)
        print(f"  {strat:20s}: {pred:.4f}")
    
    print(f"\nTrue population mean: 10.0")
    print(f"Mean is sensitive to outliers; median and trimmed mean are robust.")
 
compare_prediction_strategies()

Choosing a Prediction Strategy

Uncertainty Quantification in Leaf Predictions

Types of uncertainty:

Epistemic uncertainty (model uncertainty): How sure are we about the mean? Measured by standard error, depends on $n_m$.
Aleatoric uncertainty (data uncertainty): How much do individual values vary? Measured by within-leaf variance, irreducible given features.

Confidence vs. prediction intervals:

Confidence interval for $\mu_m$: Where the true population mean likely lies $$\bar{y}m \pm t{\alpha/2, n_m-1} \cdot \frac{s_m}{\sqrt{n_m}}$$
Prediction interval for $y_{\text{new}}$: Where a new observation likely falls $$\bar{y}m \pm t{\alpha/2, n_m-1} \cdot s_m \sqrt{1 + \frac{1}{n_m}}$$

The prediction interval is always wider because it includes individual variability.

uncertainty_quantification.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
import numpy as np
from scipy import stats
from typing import Tuple, Dict
 
class UncertaintyAwarePredictor:
    """
    Regression tree predictor with uncertainty quantification.
    
    Provides point predictions plus confidence/prediction intervals.
    """
    
    def __init__(self, tree, X_train, y_train):
        """
        Initialize with trained tree and training data for leaf statistics.
        """
        self.tree = tree
        self.leaf_ids = tree.apply(X_train)
        self.leaf_stats = self._compute_leaf_stats(X_train, y_train)
    
    def _compute_leaf_stats(self, X, y) -> Dict[int, Dict]:
        """Compute statistics for each leaf."""
        stats_dict = {}
        
        for leaf_id in np.unique(self.leaf_ids):
            mask = self.leaf_ids == leaf_id
            y_leaf = y[mask]
            n = len(y_leaf)
            
            stats_dict[leaf_id] = {
                'mean': np.mean(y_leaf),
                'std': np.std(y_leaf, ddof=1) if n > 1 else 0.0,
                'n': n,
                'quantiles': np.percentile(y_leaf, [5, 25, 50, 75, 95]) if n > 1 else [y_leaf[0]]*5
            }
        
        return stats_dict
    
    def predict_with_uncertainty(self, X) -> Dict[str, np.ndarray]:
        """
        Predict with full uncertainty information.
        
        Returns:
        --------
        dict with keys:
            - 'prediction': point predictions
            - 'std': standard deviation in each leaf
            - 'n_samples': samples in each leaf
            - 'ci_lower', 'ci_upper': 95% confidence interval for mean
            - 'pi_lower', 'pi_upper': 95% prediction interval for new obs
        """
        X = np.atleast_2d(X)
        n_samples = X.shape[0]
        
        leaf_ids = self.tree.apply(X)
        
        predictions = np.zeros(n_samples)
        stds = np.zeros(n_samples)
        ns = np.zeros(n_samples, dtype=int)
        ci_lower = np.zeros(n_samples)
        ci_upper = np.zeros(n_samples)
        pi_lower = np.zeros(n_samples)
        pi_upper = np.zeros(n_samples)
        
        for i, leaf_id in enumerate(leaf_ids):
            stats = self.leaf_stats[leaf_id]
            
            predictions[i] = stats['mean']
            stds[i] = stats['std']
            ns[i] = stats['n']
            
            n = stats['n']
            std = stats['std']
            mean = stats['mean']
            
            if n > 1:
                t_crit = stats.t.ppf(0.975, df=n-1) if hasattr(stats, 't') else 1.96
                # Actually use scipy.stats
                from scipy.stats import t as t_dist
                t_crit = t_dist.ppf(0.975, df=n-1)
                
                # Confidence interval for mean
                se = std / np.sqrt(n)
                ci_lower[i] = mean - t_crit * se
                ci_upper[i] = mean + t_crit * se
                
                # Prediction interval for new observation
                pred_se = std * np.sqrt(1 + 1/n)
                pi_lower[i] = mean - t_crit * pred_se
                pi_upper[i] = mean + t_crit * pred_se
            else:
                # Single sample - no uncertainty estimate possible
                ci_lower[i] = ci_upper[i] = mean
                pi_lower[i] = pi_upper[i] = mean
        
        return {
            'prediction': predictions,
            'std': stds,
            'n_samples': ns,
            'ci_lower': ci_lower,
            'ci_upper': ci_upper,
            'pi_lower': pi_lower,
            'pi_upper': pi_upper
        }
    
    def predict_quantiles(self, X, quantiles=[0.05, 0.5, 0.95]) -> np.ndarray:
        """
        Predict multiple quantiles based on leaf distributions.
        """
        X = np.atleast_2d(X)
        n_samples = X.shape[0]
        n_quantiles = len(quantiles)
        
        leaf_ids = self.tree.apply(X)
        results = np.zeros((n_samples, n_quantiles))
        
        for i, leaf_id in enumerate(leaf_ids):
            leaf_quantiles = self.leaf_stats[leaf_id]['quantiles']
            # Map requested quantiles to stored ones
            # This is simplified - full implementation would interpolate
            for j, q in enumerate(quantiles):
                q_idx = int(q * 4)  # Rough mapping to [5, 25, 50, 75, 95]
                q_idx = min(4, max(0, q_idx))
                results[i, j] = leaf_quantiles[q_idx]
        
        return results
 
 
def calibration_analysis(y_true, predictions, intervals):
    """
    Check if prediction intervals are well-calibrated.
    
    For 95% intervals, ~95% of true values should fall within.
    """
    lower = intervals['pi_lower']
    upper = intervals['pi_upper']
    
    coverage = np.mean((y_true >= lower) & (y_true <= upper))
    avg_width = np.mean(upper - lower)
    
    print(f"Prediction Interval Calibration:")
    print(f"  Expected coverage: 95%")
    print(f"  Actual coverage: {coverage*100:.1f}%")
    print(f"  Average interval width: {avg_width:.4f}")
    
    return coverage, avg_width

Assumptions Matter

Linear Models in Leaves (Model Trees)

The idea:

Instead of: $$\hat{y} = c_m \quad \text{for } \mathbf{x} \in R_m$$

We fit: $$\hat{y} = \mathbf{x}^\top \boldsymbol{\beta}_m + b_m \quad \text{for } \mathbf{x} \in R_m$$

Benefits:

Smoother predictions across leaf boundaries
Captures linear trends within each region
Often achieves same accuracy with shallower trees
Better extrapolation behavior at region boundaries

The M5 algorithm (Quinlan, 1992):

Build tree using standard splitting (variance reduction)
For each leaf, fit linear regression on samples in that region
Optionally: prune leaves by comparing to simpler models
Optionally: smooth predictions across parent-child boundaries

model_trees.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import Ridge
from typing import Dict, Any
 
class ModelTree:
    """
    Regression tree with linear models in leaves.
    
    Combines the partitioning capability of trees with
    the smoothness of linear regression.
    """
    
    def __init__(self, max_depth=5, min_samples_leaf=10, 
                 ridge_alpha=1.0):
        """
        Parameters:
        -----------
        max_depth : int
            Maximum depth of the partitioning tree
        min_samples_leaf : int
            Minimum samples to fit linear model (needs >= p+1)
        ridge_alpha : float
            Regularization for leaf linear models
        """
        self.max_depth = max_depth
        self.min_samples_leaf = min_samples_leaf
        self.ridge_alpha = ridge_alpha
        
        self.tree = None
        self.leaf_models = {}
    
    def fit(self, X, y):
        """
        Fit model tree: partition then fit linear models.
        """
        X = np.atleast_2d(X)
        
        # Step 1: Fit partitioning tree
        self.tree = DecisionTreeRegressor(
            max_depth=self.max_depth,
            min_samples_leaf=max(self.min_samples_leaf, X.shape[1] + 2)
        )
        self.tree.fit(X, y)
        
        # Step 2: Fit linear model in each leaf
        leaf_ids = self.tree.apply(X)
        
        for leaf_id in np.unique(leaf_ids):
            mask = leaf_ids == leaf_id
            X_leaf = X[mask]
            y_leaf = y[mask]
            
            # Fit regularized linear regression
            model = Ridge(alpha=self.ridge_alpha)
            model.fit(X_leaf, y_leaf)
            
            self.leaf_models[leaf_id] = {
                'model': model,
                'n_samples': len(y_leaf),
                'mse': np.mean((y_leaf - model.predict(X_leaf))**2)
            }
        
        return self
    
    def predict(self, X):
        """
        Predict using leaf-specific linear models.
        """
        X = np.atleast_2d(X)
        leaf_ids = self.tree.apply(X)
        
        predictions = np.zeros(X.shape[0])
        
        for leaf_id in np.unique(leaf_ids):
            mask = leaf_ids == leaf_id
            X_leaf = X[mask]
            
            model = self.leaf_models[leaf_id]['model']
            predictions[mask] = model.predict(X_leaf)
        
        return predictions
    
    def get_leaf_equations(self, feature_names=None):
        """
        Extract linear equations for each leaf (interpretability).
        """
        equations = {}
        
        for leaf_id, info in self.leaf_models.items():
            model = info['model']
            coef = model.coef_
            intercept = model.intercept_
            
            if feature_names is None:
                feature_names = [f'x{i}' for i in range(len(coef))]
            
            terms = [f"{intercept:.4f}"]
            for name, c in zip(feature_names, coef):
                if abs(c) > 1e-10:
                    sign = '+' if c > 0 else '-'
                    terms.append(f"{sign} {abs(c):.4f}*{name}")
            
            equations[leaf_id] = ' '.join(terms)
        
        return equations
 
 
def compare_constant_vs_linear_leaves():
    """
    Compare standard tree (constant leaves) vs model tree (linear leaves).
    """
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import mean_squared_error, r2_score
    
    np.random.seed(42)
    
    # Generate data with piecewise linear structure
    n = 500
    X = np.random.randn(n, 3)
    
    # True function: different linear functions in different regions
    y = np.where(
        X[:, 0] > 0,
        2 * X[:, 1] + 3 * X[:, 2] + 5,      # Region 1: linear
        -1 * X[:, 1] + 2 * X[:, 2] - 3      # Region 2: different linear
    ) + np.random.randn(n) * 0.5
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
    
    # Standard regression tree with constant leaves
    constant_tree = DecisionTreeRegressor(max_depth=5, min_samples_leaf=10)
    constant_tree.fit(X_train, y_train)
    y_pred_constant = constant_tree.predict(X_test)
    
    # Model tree with linear leaves
    model_tree = ModelTree(max_depth=3, min_samples_leaf=20)
    model_tree.fit(X_train, y_train)
    y_pred_linear = model_tree.predict(X_test)
    
    print("Constant vs Linear Leaf Comparison")
    print("=" * 50)
    print(f"\nConstant leaves (depth=5):")
    print(f"  Test MSE: {mean_squared_error(y_test, y_pred_constant):.4f}")
    print(f"  Test R²:  {r2_score(y_test, y_pred_constant):.4f}")
    print(f"  # Leaves: {constant_tree.get_n_leaves()}")
    
    print(f"\nLinear leaves (depth=3):")
    print(f"  Test MSE: {mean_squared_error(y_test, y_pred_linear):.4f}")
    print(f"  Test R²:  {r2_score(y_test, y_pred_linear):.4f}")
    print(f"  # Leaves: {len(model_tree.leaf_models)}")
    
    print("\nLeaf equations:")
    for leaf_id, eq in model_tree.get_leaf_equations(['x0', 'x1', 'x2']).items():
        print(f"  Leaf {leaf_id}: ŷ = {eq}")
 
compare_constant_vs_linear_leaves()

When to Use Model Trees

Leaf Predictions in Ensembles

Understanding leaf predictions becomes especially important in ensemble methods, where many trees contribute to the final prediction.

Random Forest perspective:

In Random Forests, each tree makes an independent prediction. The final prediction averages across trees:

$$\hat{y}{\text{RF}} = \frac{1}{T}\sum{t=1}^{T} \hat{y}^{(t)}$$

where $\hat{y}^{(t)}$ is tree $t$'s leaf mean. This averaging:

Reduces variance (each tree sees different bootstrap sample)
Preserves bias (mean of means equals mean of data)
Provides natural uncertainty estimate (variance across trees)

Gradient Boosting perspective:

In Gradient Boosting, each tree predicts residuals/gradients:

$$\hat{y}{\text{GB}} = \sum{t=1}^{T} \eta \cdot \hat{r}^{(t)}$$

where $\eta$ is the learning rate and $\hat{r}^{(t)}$ is tree $t$'s prediction. Here, leaf values are:

Fitted to pseudo-residuals (negative gradients)
Scaled by learning rate for stability
Accumulated additively across iterations

ensemble_leaf_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import numpy as np
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
 
def analyze_rf_leaf_predictions(X, y, n_trees=100):
    """
    Analyze how individual tree leaves contribute to RF prediction.
    """
    rf = RandomForestRegressor(n_estimators=n_trees, max_depth=5, random_state=42)
    rf.fit(X, y)
    
    # Get individual tree predictions
    individual_predictions = np.array([tree.predict(X) for tree in rf.estimators_])
    
    # RF prediction is the mean
    rf_prediction = rf.predict(X)
    
    # Prediction uncertainty from tree variance
    prediction_std = np.std(individual_predictions, axis=0)
    
    print("Random Forest Leaf Analysis")
    print("=" * 50)
    print(f"Number of trees: {n_trees}")
    print(f"Individual tree predictions shape: {individual_predictions.shape}")
    print(f"\nFor first 5 samples:")
    for i in range(min(5, len(y))):
        print(f"  Sample {i}: RF pred = {rf_prediction[i]:.3f}, "
              f"Std across trees = {prediction_std[i]:.3f}, "
              f"True = {y[i]:.3f}")
    
    return rf, individual_predictions, prediction_std
 
 
def analyze_gb_leaf_contributions(X, y, n_trees=100):
    """
    Analyze how boosting iterations contribute to final prediction.
    """
    gb = GradientBoostingRegressor(n_estimators=n_trees, max_depth=3, 
                                    learning_rate=0.1, random_state=42)
    gb.fit(X, y)
    
    # Track prediction at each stage
    staged_predictions = list(gb.staged_predict(X))
    
    # Contribution of each tree (difference from previous stage)
    contributions = [staged_predictions[0]]  # First tree contribution
    for i in range(1, len(staged_predictions)):
        contribution = staged_predictions[i] - staged_predictions[i-1]
        contributions.append(contribution)
    
    contributions = np.array(contributions)
    
    print("\nGradient Boosting Leaf Analysis")
    print("=" * 50)
    print(f"Number of trees: {n_trees}")
    print(f"Learning rate: 0.1")
    print(f"\nFor first sample:")
    print(f"  Final prediction: {staged_predictions[-1][0]:.3f}")
    print(f"  True value: {y[0]:.3f}")
    print(f"  Contributions by tree (first 10):")
    for i in range(min(10, n_trees)):
        print(f"    Tree {i+1}: {contributions[i][0]:+.4f}, "
              f"Cumulative: {staged_predictions[i][0]:.3f}")
    
    return gb, staged_predictions, contributions
 
 
# Demo
if __name__ == "__main__":
    np.random.seed(42)
    X = np.random.randn(200, 5)
    y = 3 * X[:, 0] + 2 * X[:, 1] - X[:, 2] + np.random.randn(200) * 0.5
    
    analyze_rf_leaf_predictions(X, y)
    analyze_gb_leaf_contributions(X, y)

Leaf Predictions Across Ensemble Methods
Method	Leaf Predictions	Combination	Uncertainty Source
Single Tree	Mean of training samples in region	Direct output	Within-leaf variance
Random Forest	Mean per tree (different bootstrap)	Average across trees	Variance across trees
Gradient Boosting	Negative gradient (residual)	Sum with learning rate	Staged prediction path
Model Tree	Linear regression fit	Direct output	Regression standard errors

Leaf Diagnostics and Interpretation

Analyzing leaf characteristics provides insights into model behavior and potential issues.

Key diagnostic questions:

Leaf purity: Are leaves homogeneous? High within-leaf variance suggests more splits needed.
Leaf support: Do leaves have enough samples? Small leaves are unreliable.
Residual patterns: Are leaf residuals symmetric? Skewed residuals suggest model misspecification.
Coverage: Do leaves cover the feature space reasonably? Gaps indicate extrapolation risk.
Extreme predictions: Are some leaf predictions outliers? May indicate overfitting or unusual regions.

leaf_diagnostics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
import numpy as np
from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as plt
 
class LeafDiagnostics:
    """
    Comprehensive diagnostics for regression tree leaves.
    """
    
    def __init__(self, tree, X, y):
        self.tree = tree
        self.X = np.atleast_2d(X)
        self.y = np.asarray(y)
        
        self.leaf_ids = tree.apply(X)
        self.unique_leaves = np.unique(self.leaf_ids)
        self.n_leaves = len(self.unique_leaves)
        
        self._compute_diagnostics()
    
    def _compute_diagnostics(self):
        """Compute all diagnostic metrics."""
        self.leaf_info = {}
        
        for leaf_id in self.unique_leaves:
            mask = self.leaf_ids == leaf_id
            y_leaf = self.y[mask]
            n = len(y_leaf)
            
            prediction = np.mean(y_leaf)
            residuals = y_leaf - prediction
            
            self.leaf_info[leaf_id] = {
                'n_samples': n,
                'prediction': prediction,
                'variance': np.var(y_leaf),
                'std': np.std(y_leaf),
                'residuals': residuals,
                'residual_mean': np.mean(residuals),
                'residual_std': np.std(residuals),
                'min': np.min(y_leaf),
                'max': np.max(y_leaf),
                'skewness': self._safe_skewness(y_leaf),
                'fraction_data': n / len(self.y)
            }
    
    def _safe_skewness(self, y):
        from scipy.stats import skew
        return skew(y) if len(y) > 2 else 0.0
    
    def get_summary_report(self):
        """Generate summary diagnostic report."""
        n_samples_per_leaf = [info['n_samples'] for info in self.leaf_info.values()]
        variances = [info['variance'] for info in self.leaf_info.values()]
        predictions = [info['prediction'] for info in self.leaf_info.values()]
        
        report = f"""
LEAF DIAGNOSTIC REPORT
{'='*60}
 
Overview:
  Total leaves: {self.n_leaves}
  Total samples: {len(self.y)}
  
Sample Distribution:
  Min samples per leaf: {min(n_samples_per_leaf)}
  Max samples per leaf: {max(n_samples_per_leaf)}
  Median samples per leaf: {np.median(n_samples_per_leaf):.0f}
  Leaves with < 10 samples: {sum(1 for n in n_samples_per_leaf if n < 10)}
  
Within-Leaf Variance:
  Min variance: {min(variances):.4f}
  Max variance: {max(variances):.4f}
  Mean variance: {np.mean(variances):.4f}
  Weighted mean variance: {np.average(variances, weights=n_samples_per_leaf):.4f}
  
Predictions:
  Min prediction: {min(predictions):.4f}
  Max prediction: {max(predictions):.4f}
  Prediction range: {max(predictions) - min(predictions):.4f}
  Target range: {self.y.max() - self.y.min():.4f}
 
Potential Issues:
"""
        # Check for issues
        issues = []
        
        # Small leaves
        small_leaves = [lid for lid, info in self.leaf_info.items() 
                       if info['n_samples'] < 5]
        if small_leaves:
            issues.append(f"  ⚠ {len(small_leaves)} leaves with < 5 samples (unreliable)")
        
        # High variance leaves
        overall_var = np.var(self.y)
        high_var_leaves = [lid for lid, info in self.leaf_info.items()
                         if info['variance'] > 0.5 * overall_var]
        if high_var_leaves:
            issues.append(f"  ⚠ {len(high_var_leaves)} leaves with variance > 50% of total")
        
        # Skewed residuals
        skewed_leaves = [lid for lid, info in self.leaf_info.items()
                        if abs(info['skewness']) > 1]
        if skewed_leaves:
            issues.append(f"  ⚠ {len(skewed_leaves)} leaves with skewed residuals (|skew| > 1)")
        
        if not issues:
            issues.append("  ✓ No major issues detected")
        
        report += '\n'.join(issues)
        
        return report
    
    def get_problem_leaves(self, min_samples=5, max_variance_ratio=0.5):
        """Identify leaves that may need attention."""
        overall_var = np.var(self.y)
        
        problems = []
        for leaf_id, info in self.leaf_info.items():
            issues = []
            
            if info['n_samples'] < min_samples:
                issues.append(f"small sample (n={info['n_samples']})")
            
            if info['variance'] > max_variance_ratio * overall_var:
                issues.append(f"high variance ({info['variance']:.3f})")
            
            if abs(info['skewness']) > 1:
                issues.append(f"skewed (skew={info['skewness']:.2f})")
            
            if issues:
                problems.append({
                    'leaf_id': leaf_id,
                    'issues': issues,
                    'info': info
                })
        
        return problems
 
 
# Usage example
if __name__ == "__main__":
    np.random.seed(42)
    X = np.random.randn(200, 3)
    y = 2 * X[:, 0] + X[:, 1]**2 + np.random.randn(200) * 0.5
    
    tree = DecisionTreeRegressor(max_depth=4, min_samples_leaf=5)
    tree.fit(X, y)
    
    diagnostics = LeafDiagnostics(tree, X, y)
    print(diagnostics.get_summary_report())
    
    problems = diagnostics.get_problem_leaves()
    if problems:
        print("\nProblem leaves:")
        for p in problems[:5]:
            print(f"  Leaf {p['leaf_id']}: {', '.join(p['issues'])}")

Summary: Leaf Predictions

We've explored the crucial topic of leaf predictions in regression trees—where the mathematical machinery of splitting translates into actual predictions.

Key Takeaways

•Mean is optimal for MSE — The sample mean minimizes squared error; this is the unique minimizer from calculus.
•Leaf statistics enable inference — Beyond prediction: variance, standard error, confidence/prediction intervals.
•Alternatives exist — Median for robustness, quantiles for distributional predictions, weighted for importance.
•Uncertainty is quantifiable — Confidence intervals for mean, prediction intervals for new samples, both from leaf data.
•Model trees extend leaves — Linear models in leaves give smoother predictions with fewer splits.
•Ensembles use leaves differently — RF averages leaf means; GB accumulates gradient predictions.
•Diagnostics guide improvement — Small leaves, high variance, skewed residuals indicate potential issues.

What's next:

Page Complete

3 / 5