Machine LearningDecision Trees

Regression Trees

LevelIntermediate

Duration90 mins

TopicDecision Trees

4 / 5

Piecewise Constant Approximation

Functions Built from Rectangles

Every regression tree prediction function is fundamentally simple: it divides the feature space into rectangular regions and outputs a constant value within each region. This piecewise constant approximation is one of the most basic forms of function approximation—yet it is remarkably powerful.

Understanding regression trees through the lens of approximation theory reveals deep insights: why trees can approximate any continuous function arbitrarily well, how approximation error relates to tree complexity, what makes some functions easier to approximate than others, and how trees compare to other approximation methods like polynomials, splines, and neural networks.

What You Will Learn

By the end of this page, you will understand: the mathematical formulation of piecewise constant functions, universal approximation properties of tree functions, error bounds and convergence rates, the relationship between tree depth and approximation power, connections to histogram and step function approximation, and the fundamental trade-offs that govern tree-based regression.

Mathematical Formulation

A regression tree defines a piecewise constant function over the feature space. Let's establish the precise mathematical framework.

Definition: Tree Partition

A binary regression tree with $M$ leaves defines a partition of the feature space $\mathcal{X} \subseteq \mathbb{R}^p$ into $M$ disjoint regions:

$$\mathcal{X} = R_1 \cup R_2 \cup \cdots \cup R_M, \quad R_i \cap R_j = \emptyset \text{ for } i \neq j$$

Definition: Tree Function

The prediction function $\hat{f}: \mathcal{X} \to \mathbb{R}$ is:

$$\hat{f}(\mathbf{x}) = \sum_{m=1}^{M} c_m \cdot \mathbb{1}_{R_m}(\mathbf{x})$$

where:

$c_m \in \mathbb{R}$ is the constant prediction for region $R_m$
$\mathbb{1}_{R_m}(\mathbf{x}) = \begin{cases} 1 & \mathbf{x} \in R_m \ 0 & \text{otherwise} \end{cases}$ is the indicator function

Key Properties:

Axis-aligned boxes: Each $R_m$ is a hyperrectangle with sides parallel to coordinate axes
Constant on regions: $\hat{f}$ is constant within each $R_m$
Discontinuous at boundaries: $\hat{f}$ has jump discontinuities at region boundaries
Finite complexity: Described by $M$ constants plus the partition structure

piecewise_constant.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
import numpy as np
from typing import List, Tuple, Callable
 
class PiecewiseConstantFunction:
    """
    Mathematical representation of a piecewise constant function.
    
    This class represents the output of a regression tree as a
    formal piecewise constant function, enabling analysis and
    manipulation independent of the tree structure.
    """
    
    def __init__(self, regions: List[dict], values: List[float]):
        """
        Initialize with regions and constant values.
        
        Parameters:
        -----------
        regions : list of dict
            Each dict has 'bounds': {feature_idx: (lower, upper)}
        values : list of float
            Constant value for each region
        """
        self.regions = regions
        self.values = values
        self.n_regions = len(regions)
    
    def __call__(self, x: np.ndarray) -> float:
        """Evaluate function at point x."""
        for region, value in zip(self.regions, self.values):
            if self._in_region(x, region):
                return value
        raise ValueError(f"Point {x} not in any region")
    
    def _in_region(self, x: np.ndarray, region: dict) -> bool:
        """Check if point x is in the given region."""
        for feat_idx, (lower, upper) in region['bounds'].items():
            if not (lower <= x[feat_idx] <= upper):
                return False
        return True
    
    def evaluate_batch(self, X: np.ndarray) -> np.ndarray:
        """Evaluate on multiple points."""
        return np.array([self(x) for x in X])
    
    def get_discontinuity_points(self, axis: int) -> List[float]:
        """
        Get points where function is discontinuous along an axis.
        """
        boundaries = set()
        for region in self.regions:
            bounds = region['bounds'].get(axis, (-np.inf, np.inf))
            boundaries.add(bounds[0])
            boundaries.add(bounds[1])
        
        boundaries.discard(-np.inf)
        boundaries.discard(np.inf)
        return sorted(boundaries)
    
    def integral(self, domain_bounds: dict) -> float:
        """
        Compute integral of function over domain.
        
        ∫_D f(x) dx = Σ_m c_m · Volume(R_m ∩ D)
        """
        total = 0.0
        for region, value in zip(self.regions, self.values):
            volume = self._region_volume(region, domain_bounds)
            total += value * volume
        return total
    
    def _region_volume(self, region: dict, domain: dict) -> float:
        """Compute volume of region intersected with domain."""
        volume = 1.0
        for feat_idx in domain:
            d_lower, d_upper = domain[feat_idx]
            r_lower, r_upper = region['bounds'].get(feat_idx, (-np.inf, np.inf))
            
            lower = max(d_lower, r_lower)
            upper = min(d_upper, r_upper)
            
            if upper <= lower:
                return 0.0  # No intersection
            
            volume *= (upper - lower)
        return volume
    
    @staticmethod
    def from_regression_tree(tree, feature_names=None):
        """
        Extract piecewise constant function from sklearn tree.
        """
        from sklearn.tree import DecisionTreeRegressor
        
        # Extract tree structure
        n_nodes = tree.tree_.node_count
        children_left = tree.tree_.children_left
        children_right = tree.tree_.children_right
        feature = tree.tree_.feature
        threshold = tree.tree_.threshold
        values = tree.tree_.value.flatten()
        
        n_features = tree.n_features_in_
        
        regions = []
        region_values = []
        
        def traverse(node_id, bounds):
            # Leaf node
            if children_left[node_id] == children_right[node_id]:
                regions.append({'bounds': bounds.copy()})
                region_values.append(values[node_id])
                return
            
            feat = feature[node_id]
            thresh = threshold[node_id]
            
            # Left child: feature <= threshold
            left_bounds = bounds.copy()
            if feat in left_bounds:
                left_bounds[feat] = (left_bounds[feat][0], min(left_bounds[feat][1], thresh))
            else:
                left_bounds[feat] = (-np.inf, thresh)
            traverse(children_left[node_id], left_bounds)
            
            # Right child: feature > threshold
            right_bounds = bounds.copy()
            if feat in right_bounds:
                right_bounds[feat] = (max(right_bounds[feat][0], thresh), right_bounds[feat][1])
            else:
                right_bounds[feat] = (thresh, np.inf)
            traverse(children_right[node_id], right_bounds)
        
        traverse(0, {})
        
        return PiecewiseConstantFunction(regions, region_values)
 
 
def demonstrate_piecewise_constant():
    """
    Demonstrate piecewise constant function representation.
    """
    from sklearn.tree import DecisionTreeRegressor
    
    # Create simple 1D data
    np.random.seed(42)
    X = np.linspace(0, 10, 100).reshape(-1, 1)
    y = np.sin(X.ravel()) + np.random.randn(100) * 0.1
    
    # Fit tree
    tree = DecisionTreeRegressor(max_depth=3)
    tree.fit(X, y)
    
    # Extract as piecewise constant function
    pcf = PiecewiseConstantFunction.from_regression_tree(tree)
    
    print("Piecewise Constant Function from Tree (depth=3)")
    print("=" * 50)
    print(f"Number of regions (leaves): {pcf.n_regions}")
    print(f"\nRegions and values:")
    
    for i, (region, value) in enumerate(zip(pcf.regions, pcf.values)):
        bounds = region['bounds'].get(0, (-np.inf, np.inf))
        print(f"  Region {i+1}: [{bounds[0]:.3f}, {bounds[1]:.3f}] → {value:.4f}")
    
    # Verify matches tree prediction
    y_tree = tree.predict(X)
    y_pcf = pcf.evaluate_batch(X)
    print(f"\nPredictions match: {np.allclose(y_tree, y_pcf)}")
 
demonstrate_piecewise_constant()

Universal Approximation Property

A fundamental question in approximation theory: Can piecewise constant functions approximate any continuous function? The answer is yes, with sufficient regions.

Theorem (Universal Approximation for Piecewise Constants):

Let $f: [a,b]^p \to \mathbb{R}$ be continuous on a compact domain. For any $\epsilon > 0$, there exists a piecewise constant function $\hat{f}$ with finitely many rectangular regions such that:

$$\sup_{\mathbf{x} \in [a,b]^p} |f(\mathbf{x}) - \hat{f}(\mathbf{x})| < \epsilon$$

Proof sketch:

By the Heine-Cantor theorem, $f$ is uniformly continuous on the compact domain
For any $\epsilon > 0$, there exists $\delta > 0$ such that $|f(\mathbf{x}) - f(\mathbf{y})| < \epsilon$ whenever $|\mathbf{x} - \mathbf{y}| < \delta$
Partition the domain into hypercubes with diameter $< \delta$
Set each region's constant to $f$ evaluated at any point in the region
Maximum error within each region is $< \epsilon$ by uniform continuity

Implication for regression trees:

This theorem establishes that regression trees are universal approximators—they can approximate any continuous function to arbitrary precision. The cost is the number of regions (tree complexity) needed for a given accuracy.

Comparison to Neural Networks

Neural networks are also universal approximators (via the Universal Approximation Theorem for MLPs). The key differences: (1) Trees achieve approximation through spatial partition; NNs through weighted combinations of nonlinear activations. (2) Trees are piecewise constant (discontinuous); NNs can be smooth. (3) Trees have axis-aligned discontinuities; NNs can have arbitrarily oriented decision boundaries.

universal_approximation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
import numpy as np
from sklearn.tree import DecisionTreeRegressor
 
def demonstrate_universal_approximation():
    """
    Demonstrate that trees can approximate any function 
    arbitrarily well given sufficient depth.
    """
    np.random.seed(42)
    
    # True function: combination of nonlinearities
    def true_function(x):
        return np.sin(3 * x) * np.exp(-0.3 * x) + 0.5 * np.cos(5 * x)
    
    # Dense sample for evaluation
    X_dense = np.linspace(0, 10, 1000).reshape(-1, 1)
    y_true = true_function(X_dense.ravel())
    
    # Training data (noisy samples)
    n_train = 200
    X_train = np.random.uniform(0, 10, n_train).reshape(-1, 1)
    y_train = true_function(X_train.ravel()) + np.random.randn(n_train) * 0.1
    
    print("Universal Approximation Demonstration")
    print("=" * 60)
    print("Target: f(x) = sin(3x)·exp(-0.3x) + 0.5·cos(5x)")
    print(f"Training samples: {n_train}")
    print()
    
    # Increase depth and track approximation quality
    depths = [1, 2, 3, 5, 7, 10, 15, 20]
    
    print(f"{'Depth':>6} {'Leaves':>8} {'Train MSE':>12} {'Test MSE':>12} {'Max Error':>12}")
    print("-" * 55)
    
    for depth in depths:
        tree = DecisionTreeRegressor(max_depth=depth, min_samples_leaf=1)
        tree.fit(X_train, y_train)
        
        # Predictions on dense grid (approximating test)
        y_pred = tree.predict(X_dense)
        
        # Metrics
        n_leaves = tree.get_n_leaves()
        train_mse = np.mean((tree.predict(X_train) - y_train)**2)
        test_mse = np.mean((y_pred - y_true)**2)
        max_error = np.max(np.abs(y_pred - y_true))
        
        print(f"{depth:>6} {n_leaves:>8} {train_mse:>12.6f} {test_mse:>12.6f} {max_error:>12.6f}")
    
    print("\nObservations:")
    print("- Error decreases with depth (more regions)")
    print("- Eventually overfits if data is noisy")
    print("- With clean data, error → 0 as depth → ∞")
    
    return depths
 
 
def approximation_regions_vs_error():
    """
    Analyze how number of regions relates to approximation error.
    """
    np.random.seed(42)
    
    # Clean 1D function (no noise)
    def f(x):
        return np.sin(2 * np.pi * x)
    
    X = np.linspace(0, 1, 1000).reshape(-1, 1)
    y = f(X.ravel())
    
    print("\nRegions vs Error (Clean Data)")
    print("=" * 50)
    print(f"{'Leaves':>8} {'L∞ Error':>12} {'L2 Error':>12} {'Bits/region':>12}")
    print("-" * 48)
    
    for n_leaves in [2, 4, 8, 16, 32, 64, 128]:
        # Use max_leaf_nodes to control exactly
        tree = DecisionTreeRegressor(max_leaf_nodes=n_leaves)
        tree.fit(X, y)
        
        y_pred = tree.predict(X)
        
        l_inf = np.max(np.abs(y - y_pred))
        l_2 = np.sqrt(np.mean((y - y_pred)**2))
        
        # Theoretical bit rate: log2(n_leaves) bits to specify region
        bits = np.log2(n_leaves)
        
        print(f"{n_leaves:>8} {l_inf:>12.6f} {l_2:>12.6f} {bits:>12.2f}")
    
    print("\nNote: Error ≈ O(1/leaves) for smooth functions")
 
demonstrate_universal_approximation()
approximation_regions_vs_error()

Approximation Error Analysis

How quickly does approximation error decrease as we add more regions? The answer depends on the smoothness of the target function and the dimension of the feature space.

Error bounds for Lipschitz functions:

Let $f$ be Lipschitz continuous with constant $L$: $$|f(\mathbf{x}) - f(\mathbf{y})| \leq L |\mathbf{x} - \mathbf{y}|$$

For a partition into $M$ hypercubes of side length $h$: $$|f - \hat{f}|_\infty \leq L \cdot \sqrt{p} \cdot h$$

Since $M \propto h^{-p}$, we have $h \propto M^{-1/p}$, giving: $$|f - \hat{f}|_\infty = O(M^{-1/p})$$

The curse of dimensionality:

The $M^{-1/p}$ rate reveals a fundamental limitation:

In 1D: Error $\propto 1/M$ (fast convergence)
In 10D: Error $\propto M^{-0.1}$ (very slow convergence)
In 100D: Error $\propto M^{-0.01}$ (practically no improvement)

To halve the error in $p$ dimensions, we need roughly $2^p$ times more regions!

Approximation Error Rates for Different Dimensions
Dimension p	Error Rate	Regions to halve error	Practical implication
1	O(1/M)	2×	Excellent convergence
2	O(M^{-0.5})	4×	Good convergence
5	O(M^{-0.2})	32×	Moderate slowdown
10	O(M^{-0.1})	1024×	Significant challenges
20	O(M^{-0.05})	1M×	Practically infeasible

error_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
import numpy as np
from sklearn.tree import DecisionTreeRegressor
 
def curse_of_dimensionality_demo():
    """
    Demonstrate how approximation degrades with dimension.
    """
    np.random.seed(42)
    
    print("Curse of Dimensionality in Tree Approximation")
    print("=" * 60)
    print("Target: sum of sin functions (same complexity per dimension)")
    print()
    
    dimensions = [1, 2, 3, 5, 10, 20]
    n_samples = 5000  # Fixed sample size
    max_leaves = 256  # Fixed model complexity
    
    results = []
    
    print(f"{'Dim':>5} {'#Samples':>10} {'#Leaves':>10} {'Train MSE':>12} {'Approx Rate':>12}")
    print("-" * 52)
    
    for p in dimensions:
        # Generate data
        X = np.random.uniform(0, 1, (n_samples, p))
        
        # Target: sum of sin functions (separable)
        y = np.sum(np.sin(2 * np.pi * X), axis=1)
        
        # Fit tree
        tree = DecisionTreeRegressor(max_leaf_nodes=max_leaves)
        tree.fit(X, y)
        
        y_pred = tree.predict(X)
        train_mse = np.mean((y - y_pred)**2)
        
        # Theoretical rate: M^{-1/p}
        theoretical_rate = max_leaves ** (-1/p)
        
        # Compare to total variance
        total_var = np.var(y)
        
        results.append({
            'dim': p,
            'mse': train_mse,
            'rate': theoretical_rate,
            'total_var': total_var
        })
        
        print(f"{p:>5} {n_samples:>10} {tree.get_n_leaves():>10} {train_mse:>12.4f} {theoretical_rate:>12.4f}")
    
    print("\nObservations:")
    print("- Same model complexity performs worse in higher dimensions")
    print("- Error grows roughly with 1/p rate (curse of dimensionality)")
    print("- Trees can only overcome this with exponentially more leaves")
    
    return results
 
 
def adaptive_partition_advantage():
    """
    Show that adaptive (data-driven) partitions outperform uniform grids.
    """
    np.random.seed(42)
    
    # Function with localized complexity
    def f(x):
        # Smooth except near x=0.5 where it oscillates
        smooth_part = x
        complex_part = 0.3 * np.sin(20 * np.pi * x) * np.exp(-100 * (x - 0.5)**2)
        return smooth_part + complex_part
    
    X = np.linspace(0, 1, 1000).reshape(-1, 1)
    y = f(X.ravel())
    
    n_regions = 16
    
    # Uniform grid (histogram approximation)
    uniform_breaks = np.linspace(0, 1, n_regions + 1)
    y_uniform = np.zeros_like(y)
    for i in range(n_regions):
        mask = (X.ravel() >= uniform_breaks[i]) & (X.ravel() < uniform_breaks[i+1])
        if np.any(mask):
            y_uniform[mask] = np.mean(y[mask])
    
    # Adaptive tree
    tree = DecisionTreeRegressor(max_leaf_nodes=n_regions)
    tree.fit(X, y)
    y_adaptive = tree.predict(X)
    
    uniform_mse = np.mean((y - y_uniform)**2)
    adaptive_mse = np.mean((y - y_adaptive)**2)
    
    print("\nUniform vs Adaptive Partitioning")
    print("=" * 50)
    print(f"Function: smooth with localized oscillation at x=0.5")
    print(f"Number of regions: {n_regions}")
    print(f"\nUniform grid MSE:  {uniform_mse:.6f}")
    print(f"Adaptive tree MSE: {adaptive_mse:.6f}")
    print(f"Improvement ratio: {uniform_mse/adaptive_mse:.2f}x")
    print("\nAdaptive trees concentrate regions where function varies most")
 
curse_of_dimensionality_demo()
adaptive_partition_advantage()

How Trees Mitigate the Curse

Unlike uniform grids, trees partition adaptively—placing more regions where the function varies and fewer where it's smooth. If the function has low 'intrinsic dimensionality' (only a few features matter), trees discover this and achieve better rates than worst-case bounds suggest. This is why trees work surprisingly well in high dimensions when the true function has structure.

Relationship to Step Functions and Histograms

Regression trees generalize several fundamental approximation concepts.

In 1D: Simple step functions

A 1D regression tree is exactly a step function: $$\hat{f}(x) = \sum_{m=1}^{M} c_m \cdot \mathbb{1}{[a{m-1}, a_m)}(x)$$

where $a_0 < a_1 < \cdots < a_M$ are the split points (thresholds).

Connection to histograms:

If we fix the partition uniformly and set $c_m = \bar{y}_m$ (bin mean), we get histogram regression. Trees optimize both the partition AND the values—making them adaptive histograms.

In multiple dimensions: Tensor products

With axis-aligned splits, a tree partition is the intersection of 1D partitions along each axis—mathematically, a tensor product of step functions: $$R_m = I_1^{(m)} \times I_2^{(m)} \times \cdots \times I_p^{(m)}$$

where each $I_j^{(m)}$ is an interval on axis $j$.

Dyadic partitions:

A special case occurs when thresholds are at midpoints of parent intervals—creating a dyadic partition. This is related to Haar wavelets and provides a natural hierarchical decomposition.

step_functions.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
import numpy as np
from typing import List, Tuple
 
class StepFunction:
    """
    1D step function representation.
    
    A tree with 1 feature is exactly a step function.
    """
    
    def __init__(self, breakpoints: List[float], values: List[float]):
        """
        Parameters:
        -----------
        breakpoints : list
            Points where function changes value (sorted)
        values : list
            Function value in each interval (len = len(breakpoints) + 1)
        """
        self.breakpoints = np.array(sorted(breakpoints))
        self.values = np.array(values)
        
        assert len(values) == len(breakpoints) + 1
    
    def __call__(self, x: np.ndarray) -> np.ndarray:
        """Evaluate step function at points x."""
        x = np.atleast_1d(x)
        # np.searchsorted gives index of interval
        indices = np.searchsorted(self.breakpoints, x)
        return self.values[indices]
    
    def total_variation(self) -> float:
        """
        Compute total variation (sum of jump sizes).
        
        TV measures the "roughness" of the function.
        """
        jumps = np.abs(np.diff(self.values))
        return np.sum(jumps)
    
    def refine(self, new_breakpoints: List[float], 
               new_values_func=None) -> 'StepFunction':
        """
        Refine step function by adding more breakpoints.
        """
        all_breaks = sorted(set(self.breakpoints.tolist() + new_breakpoints))
        
        if new_values_func is None:
            # Maintain existing values
            new_values = [self(b - 1e-10) for b in all_breaks] + [self(all_breaks[-1] + 1e-10)]
        else:
            # Use provided function to set values
            midpoints = [(all_breaks[i] + all_breaks[i+1])/2 
                        for i in range(len(all_breaks)-1)]
            new_values = [new_values_func(-np.inf)] +                         [new_values_func(m) for m in midpoints] +                         [new_values_func(np.inf)]
        
        return StepFunction(all_breaks, new_values)
 
 
def histogram_regression(X, y, n_bins=10):
    """
    Uniform histogram regression (non-adaptive baseline).
    """
    X = np.asarray(X).ravel()
    y = np.asarray(y)
    
    # Uniform bins
    x_min, x_max = X.min(), X.max()
    edges = np.linspace(x_min - 1e-10, x_max + 1e-10, n_bins + 1)
    
    # Compute mean in each bin
    bin_indices = np.digitize(X, edges) - 1
    bin_indices = np.clip(bin_indices, 0, n_bins - 1)
    
    bin_means = np.zeros(n_bins)
    for i in range(n_bins):
        mask = bin_indices == i
        if np.any(mask):
            bin_means[i] = np.mean(y[mask])
    
    # Create step function
    return StepFunction(edges[1:-1].tolist(), bin_means.tolist())
 
 
def adaptive_vs_histogram():
    """
    Compare adaptive trees to uniform histograms.
    """
    from sklearn.tree import DecisionTreeRegressor
    
    np.random.seed(42)
    
    # Function with varying complexity
    def f(x):
        return np.where(x < 0.3, 0.5, 
               np.where(x < 0.7, np.sin(10 * np.pi * x), -0.3))
    
    X = np.random.uniform(0, 1, 200)
    y = f(X) + np.random.randn(200) * 0.1
    
    X_dense = np.linspace(0, 1, 500)
    y_true = f(X_dense)
    
    n_regions = 10
    
    # Histogram
    hist_func = histogram_regression(X, y, n_bins=n_regions)
    y_hist = hist_func(X_dense)
    
    # Adaptive tree
    tree = DecisionTreeRegressor(max_leaf_nodes=n_regions)
    tree.fit(X.reshape(-1, 1), y)
    y_tree = tree.predict(X_dense.reshape(-1, 1))
    
    hist_mse = np.mean((y_true - y_hist)**2)
    tree_mse = np.mean((y_true - y_tree)**2)
    
    print("Histogram vs Adaptive Tree")
    print("=" * 50)
    print(f"Function: flat-oscillating-flat structure")
    print(f"Regions: {n_regions}")
    print(f"\nHistogram MSE: {hist_mse:.5f}")
    print(f"Tree MSE:      {tree_mse:.5f}")
    print(f"\nTree advantage: {hist_mse/tree_mse:.2f}x better")
    
    # Also show where tree put its splits
    print("\nTree split points (concentrated in complex region):")
    structure = tree.tree_
    for i in range(structure.node_count):
        if structure.children_left[i] != structure.children_right[i]:  # Internal node
            print(f"  x <= {structure.threshold[i]:.3f}")
 
adaptive_vs_histogram()

Basis Function Interpretation

Tree predictions can be viewed as expansions in a basis of indicator functions—connecting trees to the broader theory of basis expansions in function approximation.

The tree basis:

Each leaf defines a basis function: $$\phi_m(\mathbf{x}) = \mathbb{1}_{R_m}(\mathbf{x})$$

The tree function is: $$\hat{f}(\mathbf{x}) = \sum_{m=1}^{M} c_m \phi_m(\mathbf{x})$$

This is a linear model in a nonlinear feature space—the features are the indicator functions for each region.

Properties of the tree basis:

Partition of unity: $\sum_{m=1}^{M} \phi_m(\mathbf{x}) = 1$ for all $\mathbf{x}$
Orthogonality: $\phi_i(\mathbf{x}) \cdot \phi_j(\mathbf{x}) = 0$ for $i \neq j$ (disjoint supports)
Locality: Each $\phi_m$ is nonzero only in region $R_m$
Data-dependent: Unlike fixed bases (polynomials, Fourier), tree bases adapt to training data

Comparison to other bases:

Tree Basis vs Other Function Approximation Bases
Basis Type	Basis Functions	Locality	Smoothness
Tree indicators	1_{R_m}(x)	Strictly local (disjoint)	Discontinuous (step)
Polynomials	x^k	Global	Infinitely smooth
B-splines	Local smooth functions	Local (overlapping)	Smooth (tunable degree)
Fourier	sin(kx), cos(kx)	Global	Infinitely smooth
RBF	exp(-\|\|x-c\|\|²/σ²)	Local (overlapping)	Infinitely smooth
ReLU	max(0, w·x + b)	Piecewise	Continuous, not smooth

basis_interpretation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
import numpy as np
from sklearn.tree import DecisionTreeRegressor
 
def tree_to_feature_matrix(tree, X):
    """
    Convert tree predictions to linear model in indicator basis.
    
    Each leaf becomes a binary feature: 1 if sample reaches that leaf.
    Tree prediction = φ(X) @ coefficients
    """
    leaf_ids = tree.apply(X)
    unique_leaves = np.unique(leaf_ids)
    
    # Create binary feature matrix
    n_samples = X.shape[0]
    n_leaves = len(unique_leaves)
    
    # Mapping from leaf_id to feature index
    leaf_to_idx = {leaf: idx for idx, leaf in enumerate(unique_leaves)}
    
    # Feature matrix: Φ[i,j] = 1 if sample i goes to leaf j
    Phi = np.zeros((n_samples, n_leaves))
    for i, leaf_id in enumerate(leaf_ids):
        Phi[i, leaf_to_idx[leaf_id]] = 1
    
    # Coefficients are leaf predictions
    coefficients = np.array([tree.tree_.value[leaf].item() 
                            for leaf in unique_leaves])
    
    return Phi, coefficients, leaf_to_idx
 
 
def demonstrate_basis_interpretation():
    """
    Show that tree = linear model in indicator feature space.
    """
    np.random.seed(42)
    
    X = np.random.randn(100, 3)
    y = np.sin(X[:, 0]) + 0.5 * X[:, 1] + np.random.randn(100) * 0.1
    
    tree = DecisionTreeRegressor(max_depth=3)
    tree.fit(X, y)
    
    # Get basis representation
    Phi, coefficients, _ = tree_to_feature_matrix(tree, X)
    
    print("Tree as Linear Model in Indicator Basis")
    print("=" * 55)
    
    print(f"\nOriginal feature dimension: {X.shape[1]}")
    print(f"Indicator basis dimension: {Phi.shape[1]} (number of leaves)")
    print(f"\nFeature matrix shape: {Phi.shape}")
    print(f"Coefficients shape: {coefficients.shape}")
    
    # Verify: Phi @ coefficients == tree.predict(X)
    y_from_basis = Phi @ coefficients
    y_from_tree = tree.predict(X)
    
    print(f"\nPrediction from basis: {y_from_basis[:5]}")
    print(f"Prediction from tree:  {y_from_tree[:5]}")
    print(f"Match: {np.allclose(y_from_basis, y_from_tree)}")
    
    # Properties of the basis
    print(f"\nBasis properties:")
    print(f"  Sum of each row: {Phi.sum(axis=1)[:5]}  (partition of unity)")
    print(f"  Φᵀ @ Φ diagonal: {np.diag(Phi.T @ Phi)}  (orthogonal, counts)")
    
    # This representation can be used in other models!
    from sklearn.linear_model import Ridge
    
    ridge = Ridge(alpha=0.1)
    ridge.fit(Phi, y)
    
    print(f"\nRidge on tree basis vs tree coefficients:")
    print(f"  Tree coefficients:  {coefficients[:5]}")
    print(f"  Ridge coefficients: {ridge.coef_[:5]}")
 
 
def hierarchical_basis():
    """
    Show hierarchical basis from tree structure.
    """
    print("\nHierarchical Tree Basis")
    print("=" * 50)
    print("""
    Tree structure defines a hierarchy of increasingly refined bases:
    
    Depth 0: φ₀(x) = 1  (constant, single region)
    
    Depth 1: φ₁(x) = 1[x₀ ≤ t₁]       (left of first split)
             φ₂(x) = 1[x₀ > t₁]        (right of first split)
    
    Depth 2: φ₃(x) = 1[x₀ ≤ t₁ ∧ x₁ ≤ t₂]  (refined partitions)
             ...
    
    This is related to:
    - Haar wavelet decomposition (dyadic case)
    - Multiresolution analysis
    - Hierarchical models
    
    Key insight: Deeper = more basis functions = more flexibility
    """)
 
demonstrate_basis_interpretation()
hierarchical_basis()

Smoothness, Discontinuities, and Their Implications

A fundamental limitation of piecewise constant approximation is the presence of discontinuities at region boundaries. These have important practical implications.

The nature of discontinuities:

At each split boundary, the tree function jumps from one constant to another: $$\lim_{\epsilon \to 0^+} \hat{f}(x + \epsilon) \neq \hat{f}(x)$$

for $x$ on a split boundary.

Practical implications:

Non-smooth predictions: Small changes in input can cause large changes in output at boundaries
Gradient issues: $\nabla \hat{f}(\mathbf{x}) = \mathbf{0}$ everywhere except at boundaries where it's undefined. This makes trees unusable as differentiable components in gradient-based optimization.
Interpolation artifacts: For true smooth functions, the jagged tree approximation can look visually jarring
Extrapolation behavior: Outside training data, trees simply return the nearest leaf's value—no trend extrapolation

Mitigations:

Ensembling: Averaging many trees (Random Forest) smooths predictions
Model trees: Linear models in leaves provide local smoothness
Soft trees: Probabilistic splits (sigmoid instead of step) give differentiability
After-the-fact smoothing: Gaussian process or kernel smoothing on tree output

smoothness_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
 
def analyze_discontinuities():
    """
    Analyze discontinuity behavior of tree predictions.
    """
    np.random.seed(42)
    
    # Smooth true function
    def f(x):
        return np.sin(2 * np.pi * x)
    
    X_train = np.random.uniform(0, 1, 50).reshape(-1, 1)
    y_train = f(X_train.ravel()) + np.random.randn(50) * 0.1
    
    # Fine grid for analysis
    X_fine = np.linspace(0, 1, 1000).reshape(-1, 1)
    y_true = f(X_fine.ravel())
    
    # Single tree
    tree = DecisionTreeRegressor(max_depth=4)
    tree.fit(X_train, y_train)
    y_tree = tree.predict(X_fine)
    
    # Find discontinuity locations (where prediction jumps)
    prediction_diffs = np.abs(np.diff(y_tree))
    jump_indices = np.where(prediction_diffs > 0.01)[0]
    jump_locations = X_fine.ravel()[jump_indices]
    jump_sizes = prediction_diffs[jump_indices]
    
    print("Discontinuity Analysis")
    print("=" * 50)
    print(f"Number of discontinuities: {len(jump_indices)}")
    print(f"\nJump locations and sizes:")
    for loc, size in zip(jump_locations[:5], jump_sizes[:5]):
        print(f"  x = {loc:.4f}: jump size = {size:.4f}")
    
    print(f"\nMax jump size: {max(jump_sizes):.4f}")
    print(f"Total variation: {sum(jump_sizes):.4f}")
    
    # Total variation of true function
    true_variation = np.sum(np.abs(np.diff(y_true)))
    print(f"True function variation: {true_variation:.4f}")
    
    return jump_locations, jump_sizes
 
 
def ensemble_smoothing_effect():
    """
    Show how ensembles smooth out discontinuities.
    """
    np.random.seed(42)
    
    def f(x):
        return np.sin(2 * np.pi * x)
    
    X_train = np.random.uniform(0, 1, 100).reshape(-1, 1)
    y_train = f(X_train.ravel()) + np.random.randn(100) * 0.1
    
    X_fine = np.linspace(0, 1, 1000).reshape(-1, 1)
    
    # Single tree
    tree = DecisionTreeRegressor(max_depth=4)
    tree.fit(X_train, y_train)
    y_tree = tree.predict(X_fine)
    
    # Random Forest (ensemble of trees)
    rf = RandomForestRegressor(n_estimators=100, max_depth=4, random_state=42)
    rf.fit(X_train, y_train)
    y_rf = rf.predict(X_fine)
    
    # Measure "roughness" via total variation
    tv_tree = np.sum(np.abs(np.diff(y_tree)))
    tv_rf = np.sum(np.abs(np.diff(y_rf)))
    
    print("\nEnsemble Smoothing Effect")
    print("=" * 50)
    print(f"Total variation (roughness measure):")
    print(f"  Single tree (depth 4):     {tv_tree:.4f}")
    print(f"  Random Forest (100 trees): {tv_rf:.4f}")
    print(f"  Smoothing factor:          {tv_tree/tv_rf:.2f}x")
    
    # Also compare max gradient approximation
    grad_tree = np.max(np.abs(np.diff(y_tree) / np.diff(X_fine.ravel())))
    grad_rf = np.max(np.abs(np.diff(y_rf) / np.diff(X_fine.ravel())))
    
    print(f"\nMax gradient (smoothness measure):")
    print(f"  Single tree:   {grad_tree:.2f}")
    print(f"  Random Forest: {grad_rf:.2f}")
 
 
def soft_tree_concept():
    """
    Illustrate the concept of soft (differentiable) trees.
    """
    print("\nSoft Trees: Making Trees Differentiable")
    print("=" * 50)
    print("""
    Standard tree split: f(x) = c_L if x ≤ t else c_R
                        Uses step function (discontinuous)
    
    Soft tree split: f(x) = σ((x-t)/τ) * c_L + (1-σ((x-t)/τ)) * c_R
                    Uses sigmoid σ(z) = 1/(1+exp(-z))
                    Temperature τ controls sharpness
    
    As τ → 0: soft tree → hard tree
    Larger τ: smoother transitions, differentiable everywhere
    
    Applications:
    - Neural networks with tree-like structure
    - Differentiable feature learning
    - End-to-end training with gradient descent
    
    Libraries: PyTorch's soft decision trees, Neural Oblivious Decision Trees
    """)
 
analyze_discontinuities()
ensemble_smoothing_effect()
soft_tree_concept()

Gradient-Based Methods and Trees

Standard decision trees cannot be integrated into neural networks or other differentiable pipelines because their gradients are zero everywhere (or undefined). This limitation has motivated research into 'soft' or 'differentiable' decision trees that use smooth approximations to step functions, enabling end-to-end gradient-based learning.

Theoretical Convergence Results

Under what conditions do regression trees converge to the true function as sample size grows? Theoretical analysis provides important guarantees.

Consistency theorem (informal):

Under mild regularity conditions, as $n \to \infty$:

If tree complexity grows appropriately ($M_n \to \infty$ but $M_n / n \to 0$)
If minimum leaf size grows ($n_{\min} \to \infty$)
Then $\mathbb{E}[(\hat{f}_n(\mathbf{x}) - f(\mathbf{x}))^2] \to 0$ for all $\mathbf{x}$

Convergence rates:

For Lipschitz continuous functions in $p$ dimensions: $$\mathbb{E}|\hat{f}_n - f|_2^2 = O(n^{-2/(p+2)})$$

This is the minimax optimal rate for nonparametric regression without smoothness assumptions.

Key insight:

Trees achieve optimal rates for rough (Lipschitz) functions but suboptimal rates for smoother functions. Methods like splines or kernel smoothing are superior when the true function has higher regularity (continuous derivatives).

Convergence Rates for Different Function Classes
Function Class	Tree Rate	Optimal Rate	Gap
Lipschitz	n^{-2/(p+2)}	n^{-2/(p+2)}	Optimal ✓
1× differentiable	n^{-2/(p+2)}	n^{-2/(p+4)}	Suboptimal
2× differentiable	n^{-2/(p+2)}	n^{-2/(p+8)}	Suboptimal
Infinitely smooth	n^{-2/(p+2)}	exp(-n^{γ})	Very suboptimal

convergence_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
import numpy as np
from sklearn.tree import DecisionTreeRegressor
 
def empirical_convergence_study():
    """
    Empirically study convergence of tree predictions as n grows.
    """
    np.random.seed(42)
    
    # True function
    def f(x):
        return np.sin(2 * np.pi * x)
    
    # Evaluate on fixed test points
    X_test = np.linspace(0, 1, 500).reshape(-1, 1)
    y_test = f(X_test.ravel())
    
    sample_sizes = [50, 100, 200, 500, 1000, 2000, 5000]
    n_trials = 20  # Average over multiple datasets
    
    print("Convergence as Sample Size Grows")
    print("=" * 60)
    print(f"{'n':>8} {'MSE':>12} {'Std':>12} {'Rate est':>12}")
    print("-" * 50)
    
    mses = []
    
    for n in sample_sizes:
        trial_mses = []
        
        for _ in range(n_trials):
            # Generate training data
            X_train = np.random.uniform(0, 1, n).reshape(-1, 1)
            y_train = f(X_train.ravel()) + np.random.randn(n) * 0.1
            
            # Fit tree (let complexity grow with n)
            max_leaves = int(np.sqrt(n))  # M ~ sqrt(n)
            tree = DecisionTreeRegressor(max_leaf_nodes=max_leaves,
                                        min_samples_leaf=5)
            tree.fit(X_train, y_train)
            
            # Test error
            y_pred = tree.predict(X_test)
            mse = np.mean((y_test - y_pred)**2)
            trial_mses.append(mse)
        
        avg_mse = np.mean(trial_mses)
        std_mse = np.std(trial_mses)
        mses.append(avg_mse)
        
        # Estimate convergence rate
        if len(mses) > 1:
            log_n = np.log(sample_sizes[:len(mses)])
            log_mse = np.log(mses)
            rate = -np.polyfit(log_n, log_mse, 1)[0]
        else:
            rate = float('nan')
        
        print(f"{n:>8} {avg_mse:>12.6f} {std_mse:>12.6f} {rate:>12.3f}")
    
    # Theoretical rate for 1D
    print(f"\nTheoretical rate for 1D: 2/(1+2) = 0.67")
    print(f"Observed rate: {-np.polyfit(np.log(sample_sizes), np.log(mses), 1)[0]:.3f}")
 
 
def bias_variance_with_sample_size():
    """
    Decompose error into bias and variance as n grows.
    """
    np.random.seed(42)
    
    def f(x):
        return x ** 2
    
    X_test = np.linspace(0, 1, 100).reshape(-1, 1)
    y_test = f(X_test.ravel())
    
    print("\nBias-Variance Decomposition vs Sample Size")
    print("=" * 55)
    print(f"{'n':>8} {'Bias²':>12} {'Variance':>12} {'Total MSE':>12}")
    print("-" * 50)
    
    for n in [50, 100, 200, 500, 1000]:
        predictions = []
        
        # Multiple trials to estimate variance
        for _ in range(50):
            X_train = np.random.uniform(0, 1, n).reshape(-1, 1)
            y_train = f(X_train.ravel()) + np.random.randn(n) * 0.1
            
            tree = DecisionTreeRegressor(max_leaf_nodes=int(np.sqrt(n)))
            tree.fit(X_train, y_train)
            
            predictions.append(tree.predict(X_test))
        
        predictions = np.array(predictions)
        
        # E[f̂]
        mean_pred = np.mean(predictions, axis=0)
        
        # Bias² = E[(E[f̂] - f)²]
        bias_sq = np.mean((mean_pred - y_test) ** 2)
        
        # Variance = E[(f̂ - E[f̂])²]
        variance = np.mean(np.var(predictions, axis=0))
        
        # Total MSE
        total_mse = bias_sq + variance
        
        print(f"{n:>8} {bias_sq:>12.6f} {variance:>12.6f} {total_mse:>12.6f}")
 
empirical_convergence_study()
bias_variance_with_sample_size()

Summary: Piecewise Constant Approximation

We've explored the mathematical foundations of regression trees as piecewise constant function approximators—the theoretical lens through which tree behavior becomes transparent.

Key Takeaways

•Trees define piecewise constant functions — Partition into axis-aligned boxes, constant prediction in each.
•Universal approximation holds — Any continuous function can be approximated arbitrarily well with enough regions.
•Curse of dimensionality limits rates — Error scales as M^{-1/p}; exponentially more regions needed in high dimensions.
•Adaptive partition beats uniform — Trees concentrate regions where function varies most.
•Tree = linear model in indicator basis — Provides explicit feature representation for downstream analysis.
•Discontinuities are inherent — Piecewise constant ≠ smooth; ensembles provide smoothing.
•Consistent estimation with optimal rates — For Lipschitz functions, trees achieve minimax optimal convergence.

What's next:

The final page of this module provides a comprehensive comparison with linear models—analyzing when trees outperform linear regression, when the reverse is true, and how to choose between these fundamentally different approaches to regression.

Page Complete

You now have a deep understanding of regression trees from an approximation theory perspective—universal approximation, error bounds, basis functions, and convergence. This theoretical grounding is essential for understanding why trees work, when they might fail, and how they compare to other approximation methods.

4 / 5

Loading learning content...

Machine LearningDecision Trees

Regression Trees

LevelIntermediate

Duration90 mins

TopicDecision Trees

4 / 5

Piecewise Constant Approximation

Functions Built from Rectangles

What You Will Learn

Mathematical Formulation

A regression tree defines a piecewise constant function over the feature space. Let's establish the precise mathematical framework.

Definition: Tree Partition

A binary regression tree with $M$ leaves defines a partition of the feature space $\mathcal{X} \subseteq \mathbb{R}^p$ into $M$ disjoint regions:

$$\mathcal{X} = R_1 \cup R_2 \cup \cdots \cup R_M, \quad R_i \cap R_j = \emptyset \text{ for } i \neq j$$

Definition: Tree Function

The prediction function $\hat{f}: \mathcal{X} \to \mathbb{R}$ is:

$$\hat{f}(\mathbf{x}) = \sum_{m=1}^{M} c_m \cdot \mathbb{1}_{R_m}(\mathbf{x})$$

where:

$c_m \in \mathbb{R}$ is the constant prediction for region $R_m$
$\mathbb{1}_{R_m}(\mathbf{x}) = \begin{cases} 1 & \mathbf{x} \in R_m \ 0 & \text{otherwise} \end{cases}$ is the indicator function

Key Properties:

Axis-aligned boxes: Each $R_m$ is a hyperrectangle with sides parallel to coordinate axes
Constant on regions: $\hat{f}$ is constant within each $R_m$
Discontinuous at boundaries: $\hat{f}$ has jump discontinuities at region boundaries
Finite complexity: Described by $M$ constants plus the partition structure

piecewise_constant.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
import numpy as np
from typing import List, Tuple, Callable
 
class PiecewiseConstantFunction:
    """
    Mathematical representation of a piecewise constant function.
    
    This class represents the output of a regression tree as a
    formal piecewise constant function, enabling analysis and
    manipulation independent of the tree structure.
    """
    
    def __init__(self, regions: List[dict], values: List[float]):
        """
        Initialize with regions and constant values.
        
        Parameters:
        -----------
        regions : list of dict
            Each dict has 'bounds': {feature_idx: (lower, upper)}
        values : list of float
            Constant value for each region
        """
        self.regions = regions
        self.values = values
        self.n_regions = len(regions)
    
    def __call__(self, x: np.ndarray) -> float:
        """Evaluate function at point x."""
        for region, value in zip(self.regions, self.values):
            if self._in_region(x, region):
                return value
        raise ValueError(f"Point {x} not in any region")
    
    def _in_region(self, x: np.ndarray, region: dict) -> bool:
        """Check if point x is in the given region."""
        for feat_idx, (lower, upper) in region['bounds'].items():
            if not (lower <= x[feat_idx] <= upper):
                return False
        return True
    
    def evaluate_batch(self, X: np.ndarray) -> np.ndarray:
        """Evaluate on multiple points."""
        return np.array([self(x) for x in X])
    
    def get_discontinuity_points(self, axis: int) -> List[float]:
        """
        Get points where function is discontinuous along an axis.
        """
        boundaries = set()
        for region in self.regions:
            bounds = region['bounds'].get(axis, (-np.inf, np.inf))
            boundaries.add(bounds[0])
            boundaries.add(bounds[1])
        
        boundaries.discard(-np.inf)
        boundaries.discard(np.inf)
        return sorted(boundaries)
    
    def integral(self, domain_bounds: dict) -> float:
        """
        Compute integral of function over domain.
        
        ∫_D f(x) dx = Σ_m c_m · Volume(R_m ∩ D)
        """
        total = 0.0
        for region, value in zip(self.regions, self.values):
            volume = self._region_volume(region, domain_bounds)
            total += value * volume
        return total
    
    def _region_volume(self, region: dict, domain: dict) -> float:
        """Compute volume of region intersected with domain."""
        volume = 1.0
        for feat_idx in domain:
            d_lower, d_upper = domain[feat_idx]
            r_lower, r_upper = region['bounds'].get(feat_idx, (-np.inf, np.inf))
            
            lower = max(d_lower, r_lower)
            upper = min(d_upper, r_upper)
            
            if upper <= lower:
                return 0.0  # No intersection
            
            volume *= (upper - lower)
        return volume
    
    @staticmethod
    def from_regression_tree(tree, feature_names=None):
        """
        Extract piecewise constant function from sklearn tree.
        """
        from sklearn.tree import DecisionTreeRegressor
        
        # Extract tree structure
        n_nodes = tree.tree_.node_count
        children_left = tree.tree_.children_left
        children_right = tree.tree_.children_right
        feature = tree.tree_.feature
        threshold = tree.tree_.threshold
        values = tree.tree_.value.flatten()
        
        n_features = tree.n_features_in_
        
        regions = []
        region_values = []
        
        def traverse(node_id, bounds):
            # Leaf node
            if children_left[node_id] == children_right[node_id]:
                regions.append({'bounds': bounds.copy()})
                region_values.append(values[node_id])
                return
            
            feat = feature[node_id]
            thresh = threshold[node_id]
            
            # Left child: feature <= threshold
            left_bounds = bounds.copy()
            if feat in left_bounds:
                left_bounds[feat] = (left_bounds[feat][0], min(left_bounds[feat][1], thresh))
            else:
                left_bounds[feat] = (-np.inf, thresh)
            traverse(children_left[node_id], left_bounds)
            
            # Right child: feature > threshold
            right_bounds = bounds.copy()
            if feat in right_bounds:
                right_bounds[feat] = (max(right_bounds[feat][0], thresh), right_bounds[feat][1])
            else:
                right_bounds[feat] = (thresh, np.inf)
            traverse(children_right[node_id], right_bounds)
        
        traverse(0, {})
        
        return PiecewiseConstantFunction(regions, region_values)
 
 
def demonstrate_piecewise_constant():
    """
    Demonstrate piecewise constant function representation.
    """
    from sklearn.tree import DecisionTreeRegressor
    
    # Create simple 1D data
    np.random.seed(42)
    X = np.linspace(0, 10, 100).reshape(-1, 1)
    y = np.sin(X.ravel()) + np.random.randn(100) * 0.1
    
    # Fit tree
    tree = DecisionTreeRegressor(max_depth=3)
    tree.fit(X, y)
    
    # Extract as piecewise constant function
    pcf = PiecewiseConstantFunction.from_regression_tree(tree)
    
    print("Piecewise Constant Function from Tree (depth=3)")
    print("=" * 50)
    print(f"Number of regions (leaves): {pcf.n_regions}")
    print(f"\nRegions and values:")
    
    for i, (region, value) in enumerate(zip(pcf.regions, pcf.values)):
        bounds = region['bounds'].get(0, (-np.inf, np.inf))
        print(f"  Region {i+1}: [{bounds[0]:.3f}, {bounds[1]:.3f}] → {value:.4f}")
    
    # Verify matches tree prediction
    y_tree = tree.predict(X)
    y_pcf = pcf.evaluate_batch(X)
    print(f"\nPredictions match: {np.allclose(y_tree, y_pcf)}")
 
demonstrate_piecewise_constant()

Universal Approximation Property

A fundamental question in approximation theory: Can piecewise constant functions approximate any continuous function? The answer is yes, with sufficient regions.

Theorem (Universal Approximation for Piecewise Constants):

Let $f: [a,b]^p \to \mathbb{R}$ be continuous on a compact domain. For any $\epsilon > 0$, there exists a piecewise constant function $\hat{f}$ with finitely many rectangular regions such that:

$$\sup_{\mathbf{x} \in [a,b]^p} |f(\mathbf{x}) - \hat{f}(\mathbf{x})| < \epsilon$$

Proof sketch:

By the Heine-Cantor theorem, $f$ is uniformly continuous on the compact domain
For any $\epsilon > 0$, there exists $\delta > 0$ such that $|f(\mathbf{x}) - f(\mathbf{y})| < \epsilon$ whenever $|\mathbf{x} - \mathbf{y}| < \delta$
Partition the domain into hypercubes with diameter $< \delta$
Set each region's constant to $f$ evaluated at any point in the region
Maximum error within each region is $< \epsilon$ by uniform continuity

Implication for regression trees:

Comparison to Neural Networks

universal_approximation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
import numpy as np
from sklearn.tree import DecisionTreeRegressor
 
def demonstrate_universal_approximation():
    """
    Demonstrate that trees can approximate any function 
    arbitrarily well given sufficient depth.
    """
    np.random.seed(42)
    
    # True function: combination of nonlinearities
    def true_function(x):
        return np.sin(3 * x) * np.exp(-0.3 * x) + 0.5 * np.cos(5 * x)
    
    # Dense sample for evaluation
    X_dense = np.linspace(0, 10, 1000).reshape(-1, 1)
    y_true = true_function(X_dense.ravel())
    
    # Training data (noisy samples)
    n_train = 200
    X_train = np.random.uniform(0, 10, n_train).reshape(-1, 1)
    y_train = true_function(X_train.ravel()) + np.random.randn(n_train) * 0.1
    
    print("Universal Approximation Demonstration")
    print("=" * 60)
    print("Target: f(x) = sin(3x)·exp(-0.3x) + 0.5·cos(5x)")
    print(f"Training samples: {n_train}")
    print()
    
    # Increase depth and track approximation quality
    depths = [1, 2, 3, 5, 7, 10, 15, 20]
    
    print(f"{'Depth':>6} {'Leaves':>8} {'Train MSE':>12} {'Test MSE':>12} {'Max Error':>12}")
    print("-" * 55)
    
    for depth in depths:
        tree = DecisionTreeRegressor(max_depth=depth, min_samples_leaf=1)
        tree.fit(X_train, y_train)
        
        # Predictions on dense grid (approximating test)
        y_pred = tree.predict(X_dense)
        
        # Metrics
        n_leaves = tree.get_n_leaves()
        train_mse = np.mean((tree.predict(X_train) - y_train)**2)
        test_mse = np.mean((y_pred - y_true)**2)
        max_error = np.max(np.abs(y_pred - y_true))
        
        print(f"{depth:>6} {n_leaves:>8} {train_mse:>12.6f} {test_mse:>12.6f} {max_error:>12.6f}")
    
    print("\nObservations:")
    print("- Error decreases with depth (more regions)")
    print("- Eventually overfits if data is noisy")
    print("- With clean data, error → 0 as depth → ∞")
    
    return depths
 
 
def approximation_regions_vs_error():
    """
    Analyze how number of regions relates to approximation error.
    """
    np.random.seed(42)
    
    # Clean 1D function (no noise)
    def f(x):
        return np.sin(2 * np.pi * x)
    
    X = np.linspace(0, 1, 1000).reshape(-1, 1)
    y = f(X.ravel())
    
    print("\nRegions vs Error (Clean Data)")
    print("=" * 50)
    print(f"{'Leaves':>8} {'L∞ Error':>12} {'L2 Error':>12} {'Bits/region':>12}")
    print("-" * 48)
    
    for n_leaves in [2, 4, 8, 16, 32, 64, 128]:
        # Use max_leaf_nodes to control exactly
        tree = DecisionTreeRegressor(max_leaf_nodes=n_leaves)
        tree.fit(X, y)
        
        y_pred = tree.predict(X)
        
        l_inf = np.max(np.abs(y - y_pred))
        l_2 = np.sqrt(np.mean((y - y_pred)**2))
        
        # Theoretical bit rate: log2(n_leaves) bits to specify region
        bits = np.log2(n_leaves)
        
        print(f"{n_leaves:>8} {l_inf:>12.6f} {l_2:>12.6f} {bits:>12.2f}")
    
    print("\nNote: Error ≈ O(1/leaves) for smooth functions")
 
demonstrate_universal_approximation()
approximation_regions_vs_error()

Approximation Error Analysis

How quickly does approximation error decrease as we add more regions? The answer depends on the smoothness of the target function and the dimension of the feature space.

Error bounds for Lipschitz functions:

Let $f$ be Lipschitz continuous with constant $L$: $$|f(\mathbf{x}) - f(\mathbf{y})| \leq L |\mathbf{x} - \mathbf{y}|$$

For a partition into $M$ hypercubes of side length $h$: $$|f - \hat{f}|_\infty \leq L \cdot \sqrt{p} \cdot h$$

Since $M \propto h^{-p}$, we have $h \propto M^{-1/p}$, giving: $$|f - \hat{f}|_\infty = O(M^{-1/p})$$

The curse of dimensionality:

The $M^{-1/p}$ rate reveals a fundamental limitation:

In 1D: Error $\propto 1/M$ (fast convergence)
In 10D: Error $\propto M^{-0.1}$ (very slow convergence)
In 100D: Error $\propto M^{-0.01}$ (practically no improvement)

To halve the error in $p$ dimensions, we need roughly $2^p$ times more regions!

Approximation Error Rates for Different Dimensions
Dimension p	Error Rate	Regions to halve error	Practical implication
1	O(1/M)	2×	Excellent convergence
2	O(M^{-0.5})	4×	Good convergence
5	O(M^{-0.2})	32×	Moderate slowdown
10	O(M^{-0.1})	1024×	Significant challenges
20	O(M^{-0.05})	1M×	Practically infeasible

error_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
import numpy as np
from sklearn.tree import DecisionTreeRegressor
 
def curse_of_dimensionality_demo():
    """
    Demonstrate how approximation degrades with dimension.
    """
    np.random.seed(42)
    
    print("Curse of Dimensionality in Tree Approximation")
    print("=" * 60)
    print("Target: sum of sin functions (same complexity per dimension)")
    print()
    
    dimensions = [1, 2, 3, 5, 10, 20]
    n_samples = 5000  # Fixed sample size
    max_leaves = 256  # Fixed model complexity
    
    results = []
    
    print(f"{'Dim':>5} {'#Samples':>10} {'#Leaves':>10} {'Train MSE':>12} {'Approx Rate':>12}")
    print("-" * 52)
    
    for p in dimensions:
        # Generate data
        X = np.random.uniform(0, 1, (n_samples, p))
        
        # Target: sum of sin functions (separable)
        y = np.sum(np.sin(2 * np.pi * X), axis=1)
        
        # Fit tree
        tree = DecisionTreeRegressor(max_leaf_nodes=max_leaves)
        tree.fit(X, y)
        
        y_pred = tree.predict(X)
        train_mse = np.mean((y - y_pred)**2)
        
        # Theoretical rate: M^{-1/p}
        theoretical_rate = max_leaves ** (-1/p)
        
        # Compare to total variance
        total_var = np.var(y)
        
        results.append({
            'dim': p,
            'mse': train_mse,
            'rate': theoretical_rate,
            'total_var': total_var
        })
        
        print(f"{p:>5} {n_samples:>10} {tree.get_n_leaves():>10} {train_mse:>12.4f} {theoretical_rate:>12.4f}")
    
    print("\nObservations:")
    print("- Same model complexity performs worse in higher dimensions")
    print("- Error grows roughly with 1/p rate (curse of dimensionality)")
    print("- Trees can only overcome this with exponentially more leaves")
    
    return results
 
 
def adaptive_partition_advantage():
    """
    Show that adaptive (data-driven) partitions outperform uniform grids.
    """
    np.random.seed(42)
    
    # Function with localized complexity
    def f(x):
        # Smooth except near x=0.5 where it oscillates
        smooth_part = x
        complex_part = 0.3 * np.sin(20 * np.pi * x) * np.exp(-100 * (x - 0.5)**2)
        return smooth_part + complex_part
    
    X = np.linspace(0, 1, 1000).reshape(-1, 1)
    y = f(X.ravel())
    
    n_regions = 16
    
    # Uniform grid (histogram approximation)
    uniform_breaks = np.linspace(0, 1, n_regions + 1)
    y_uniform = np.zeros_like(y)
    for i in range(n_regions):
        mask = (X.ravel() >= uniform_breaks[i]) & (X.ravel() < uniform_breaks[i+1])
        if np.any(mask):
            y_uniform[mask] = np.mean(y[mask])
    
    # Adaptive tree
    tree = DecisionTreeRegressor(max_leaf_nodes=n_regions)
    tree.fit(X, y)
    y_adaptive = tree.predict(X)
    
    uniform_mse = np.mean((y - y_uniform)**2)
    adaptive_mse = np.mean((y - y_adaptive)**2)
    
    print("\nUniform vs Adaptive Partitioning")
    print("=" * 50)
    print(f"Function: smooth with localized oscillation at x=0.5")
    print(f"Number of regions: {n_regions}")
    print(f"\nUniform grid MSE:  {uniform_mse:.6f}")
    print(f"Adaptive tree MSE: {adaptive_mse:.6f}")
    print(f"Improvement ratio: {uniform_mse/adaptive_mse:.2f}x")
    print("\nAdaptive trees concentrate regions where function varies most")
 
curse_of_dimensionality_demo()
adaptive_partition_advantage()

How Trees Mitigate the Curse

Relationship to Step Functions and Histograms

Regression trees generalize several fundamental approximation concepts.

In 1D: Simple step functions

A 1D regression tree is exactly a step function: $$\hat{f}(x) = \sum_{m=1}^{M} c_m \cdot \mathbb{1}{[a{m-1}, a_m)}(x)$$

where $a_0 < a_1 < \cdots < a_M$ are the split points (thresholds).

Connection to histograms:

If we fix the partition uniformly and set $c_m = \bar{y}_m$ (bin mean), we get histogram regression. Trees optimize both the partition AND the values—making them adaptive histograms.

In multiple dimensions: Tensor products

where each $I_j^{(m)}$ is an interval on axis $j$.

Dyadic partitions:

A special case occurs when thresholds are at midpoints of parent intervals—creating a dyadic partition. This is related to Haar wavelets and provides a natural hierarchical decomposition.

step_functions.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
import numpy as np
from typing import List, Tuple
 
class StepFunction:
    """
    1D step function representation.
    
    A tree with 1 feature is exactly a step function.
    """
    
    def __init__(self, breakpoints: List[float], values: List[float]):
        """
        Parameters:
        -----------
        breakpoints : list
            Points where function changes value (sorted)
        values : list
            Function value in each interval (len = len(breakpoints) + 1)
        """
        self.breakpoints = np.array(sorted(breakpoints))
        self.values = np.array(values)
        
        assert len(values) == len(breakpoints) + 1
    
    def __call__(self, x: np.ndarray) -> np.ndarray:
        """Evaluate step function at points x."""
        x = np.atleast_1d(x)
        # np.searchsorted gives index of interval
        indices = np.searchsorted(self.breakpoints, x)
        return self.values[indices]
    
    def total_variation(self) -> float:
        """
        Compute total variation (sum of jump sizes).
        
        TV measures the "roughness" of the function.
        """
        jumps = np.abs(np.diff(self.values))
        return np.sum(jumps)
    
    def refine(self, new_breakpoints: List[float], 
               new_values_func=None) -> 'StepFunction':
        """
        Refine step function by adding more breakpoints.
        """
        all_breaks = sorted(set(self.breakpoints.tolist() + new_breakpoints))
        
        if new_values_func is None:
            # Maintain existing values
            new_values = [self(b - 1e-10) for b in all_breaks] + [self(all_breaks[-1] + 1e-10)]
        else:
            # Use provided function to set values
            midpoints = [(all_breaks[i] + all_breaks[i+1])/2 
                        for i in range(len(all_breaks)-1)]
            new_values = [new_values_func(-np.inf)] +                         [new_values_func(m) for m in midpoints] +                         [new_values_func(np.inf)]
        
        return StepFunction(all_breaks, new_values)
 
 
def histogram_regression(X, y, n_bins=10):
    """
    Uniform histogram regression (non-adaptive baseline).
    """
    X = np.asarray(X).ravel()
    y = np.asarray(y)
    
    # Uniform bins
    x_min, x_max = X.min(), X.max()
    edges = np.linspace(x_min - 1e-10, x_max + 1e-10, n_bins + 1)
    
    # Compute mean in each bin
    bin_indices = np.digitize(X, edges) - 1
    bin_indices = np.clip(bin_indices, 0, n_bins - 1)
    
    bin_means = np.zeros(n_bins)
    for i in range(n_bins):
        mask = bin_indices == i
        if np.any(mask):
            bin_means[i] = np.mean(y[mask])
    
    # Create step function
    return StepFunction(edges[1:-1].tolist(), bin_means.tolist())
 
 
def adaptive_vs_histogram():
    """
    Compare adaptive trees to uniform histograms.
    """
    from sklearn.tree import DecisionTreeRegressor
    
    np.random.seed(42)
    
    # Function with varying complexity
    def f(x):
        return np.where(x < 0.3, 0.5, 
               np.where(x < 0.7, np.sin(10 * np.pi * x), -0.3))
    
    X = np.random.uniform(0, 1, 200)
    y = f(X) + np.random.randn(200) * 0.1
    
    X_dense = np.linspace(0, 1, 500)
    y_true = f(X_dense)
    
    n_regions = 10
    
    # Histogram
    hist_func = histogram_regression(X, y, n_bins=n_regions)
    y_hist = hist_func(X_dense)
    
    # Adaptive tree
    tree = DecisionTreeRegressor(max_leaf_nodes=n_regions)
    tree.fit(X.reshape(-1, 1), y)
    y_tree = tree.predict(X_dense.reshape(-1, 1))
    
    hist_mse = np.mean((y_true - y_hist)**2)
    tree_mse = np.mean((y_true - y_tree)**2)
    
    print("Histogram vs Adaptive Tree")
    print("=" * 50)
    print(f"Function: flat-oscillating-flat structure")
    print(f"Regions: {n_regions}")
    print(f"\nHistogram MSE: {hist_mse:.5f}")
    print(f"Tree MSE:      {tree_mse:.5f}")
    print(f"\nTree advantage: {hist_mse/tree_mse:.2f}x better")
    
    # Also show where tree put its splits
    print("\nTree split points (concentrated in complex region):")
    structure = tree.tree_
    for i in range(structure.node_count):
        if structure.children_left[i] != structure.children_right[i]:  # Internal node
            print(f"  x <= {structure.threshold[i]:.3f}")
 
adaptive_vs_histogram()

Basis Function Interpretation

Tree predictions can be viewed as expansions in a basis of indicator functions—connecting trees to the broader theory of basis expansions in function approximation.

The tree basis:

Each leaf defines a basis function: $$\phi_m(\mathbf{x}) = \mathbb{1}_{R_m}(\mathbf{x})$$

The tree function is: $$\hat{f}(\mathbf{x}) = \sum_{m=1}^{M} c_m \phi_m(\mathbf{x})$$

This is a linear model in a nonlinear feature space—the features are the indicator functions for each region.

Properties of the tree basis:

Partition of unity: $\sum_{m=1}^{M} \phi_m(\mathbf{x}) = 1$ for all $\mathbf{x}$
Orthogonality: $\phi_i(\mathbf{x}) \cdot \phi_j(\mathbf{x}) = 0$ for $i \neq j$ (disjoint supports)
Locality: Each $\phi_m$ is nonzero only in region $R_m$
Data-dependent: Unlike fixed bases (polynomials, Fourier), tree bases adapt to training data

Comparison to other bases:

Tree Basis vs Other Function Approximation Bases
Basis Type	Basis Functions	Locality	Smoothness
Tree indicators	1_{R_m}(x)	Strictly local (disjoint)	Discontinuous (step)
Polynomials	x^k	Global	Infinitely smooth
B-splines	Local smooth functions	Local (overlapping)	Smooth (tunable degree)
Fourier	sin(kx), cos(kx)	Global	Infinitely smooth
RBF	exp(-\|\|x-c\|\|²/σ²)	Local (overlapping)	Infinitely smooth
ReLU	max(0, w·x + b)	Piecewise	Continuous, not smooth

basis_interpretation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
import numpy as np
from sklearn.tree import DecisionTreeRegressor
 
def tree_to_feature_matrix(tree, X):
    """
    Convert tree predictions to linear model in indicator basis.
    
    Each leaf becomes a binary feature: 1 if sample reaches that leaf.
    Tree prediction = φ(X) @ coefficients
    """
    leaf_ids = tree.apply(X)
    unique_leaves = np.unique(leaf_ids)
    
    # Create binary feature matrix
    n_samples = X.shape[0]
    n_leaves = len(unique_leaves)
    
    # Mapping from leaf_id to feature index
    leaf_to_idx = {leaf: idx for idx, leaf in enumerate(unique_leaves)}
    
    # Feature matrix: Φ[i,j] = 1 if sample i goes to leaf j
    Phi = np.zeros((n_samples, n_leaves))
    for i, leaf_id in enumerate(leaf_ids):
        Phi[i, leaf_to_idx[leaf_id]] = 1
    
    # Coefficients are leaf predictions
    coefficients = np.array([tree.tree_.value[leaf].item() 
                            for leaf in unique_leaves])
    
    return Phi, coefficients, leaf_to_idx
 
 
def demonstrate_basis_interpretation():
    """
    Show that tree = linear model in indicator feature space.
    """
    np.random.seed(42)
    
    X = np.random.randn(100, 3)
    y = np.sin(X[:, 0]) + 0.5 * X[:, 1] + np.random.randn(100) * 0.1
    
    tree = DecisionTreeRegressor(max_depth=3)
    tree.fit(X, y)
    
    # Get basis representation
    Phi, coefficients, _ = tree_to_feature_matrix(tree, X)
    
    print("Tree as Linear Model in Indicator Basis")
    print("=" * 55)
    
    print(f"\nOriginal feature dimension: {X.shape[1]}")
    print(f"Indicator basis dimension: {Phi.shape[1]} (number of leaves)")
    print(f"\nFeature matrix shape: {Phi.shape}")
    print(f"Coefficients shape: {coefficients.shape}")
    
    # Verify: Phi @ coefficients == tree.predict(X)
    y_from_basis = Phi @ coefficients
    y_from_tree = tree.predict(X)
    
    print(f"\nPrediction from basis: {y_from_basis[:5]}")
    print(f"Prediction from tree:  {y_from_tree[:5]}")
    print(f"Match: {np.allclose(y_from_basis, y_from_tree)}")
    
    # Properties of the basis
    print(f"\nBasis properties:")
    print(f"  Sum of each row: {Phi.sum(axis=1)[:5]}  (partition of unity)")
    print(f"  Φᵀ @ Φ diagonal: {np.diag(Phi.T @ Phi)}  (orthogonal, counts)")
    
    # This representation can be used in other models!
    from sklearn.linear_model import Ridge
    
    ridge = Ridge(alpha=0.1)
    ridge.fit(Phi, y)
    
    print(f"\nRidge on tree basis vs tree coefficients:")
    print(f"  Tree coefficients:  {coefficients[:5]}")
    print(f"  Ridge coefficients: {ridge.coef_[:5]}")
 
 
def hierarchical_basis():
    """
    Show hierarchical basis from tree structure.
    """
    print("\nHierarchical Tree Basis")
    print("=" * 50)
    print("""
    Tree structure defines a hierarchy of increasingly refined bases:
    
    Depth 0: φ₀(x) = 1  (constant, single region)
    
    Depth 1: φ₁(x) = 1[x₀ ≤ t₁]       (left of first split)
             φ₂(x) = 1[x₀ > t₁]        (right of first split)
    
    Depth 2: φ₃(x) = 1[x₀ ≤ t₁ ∧ x₁ ≤ t₂]  (refined partitions)
             ...
    
    This is related to:
    - Haar wavelet decomposition (dyadic case)
    - Multiresolution analysis
    - Hierarchical models
    
    Key insight: Deeper = more basis functions = more flexibility
    """)
 
demonstrate_basis_interpretation()
hierarchical_basis()

Smoothness, Discontinuities, and Their Implications

A fundamental limitation of piecewise constant approximation is the presence of discontinuities at region boundaries. These have important practical implications.

The nature of discontinuities:

At each split boundary, the tree function jumps from one constant to another: $$\lim_{\epsilon \to 0^+} \hat{f}(x + \epsilon) \neq \hat{f}(x)$$

for $x$ on a split boundary.

Practical implications:

Non-smooth predictions: Small changes in input can cause large changes in output at boundaries
Gradient issues: $\nabla \hat{f}(\mathbf{x}) = \mathbf{0}$ everywhere except at boundaries where it's undefined. This makes trees unusable as differentiable components in gradient-based optimization.
Interpolation artifacts: For true smooth functions, the jagged tree approximation can look visually jarring
Extrapolation behavior: Outside training data, trees simply return the nearest leaf's value—no trend extrapolation

Mitigations:

Ensembling: Averaging many trees (Random Forest) smooths predictions
Model trees: Linear models in leaves provide local smoothness
Soft trees: Probabilistic splits (sigmoid instead of step) give differentiability
After-the-fact smoothing: Gaussian process or kernel smoothing on tree output

smoothness_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
 
def analyze_discontinuities():
    """
    Analyze discontinuity behavior of tree predictions.
    """
    np.random.seed(42)
    
    # Smooth true function
    def f(x):
        return np.sin(2 * np.pi * x)
    
    X_train = np.random.uniform(0, 1, 50).reshape(-1, 1)
    y_train = f(X_train.ravel()) + np.random.randn(50) * 0.1
    
    # Fine grid for analysis
    X_fine = np.linspace(0, 1, 1000).reshape(-1, 1)
    y_true = f(X_fine.ravel())
    
    # Single tree
    tree = DecisionTreeRegressor(max_depth=4)
    tree.fit(X_train, y_train)
    y_tree = tree.predict(X_fine)
    
    # Find discontinuity locations (where prediction jumps)
    prediction_diffs = np.abs(np.diff(y_tree))
    jump_indices = np.where(prediction_diffs > 0.01)[0]
    jump_locations = X_fine.ravel()[jump_indices]
    jump_sizes = prediction_diffs[jump_indices]
    
    print("Discontinuity Analysis")
    print("=" * 50)
    print(f"Number of discontinuities: {len(jump_indices)}")
    print(f"\nJump locations and sizes:")
    for loc, size in zip(jump_locations[:5], jump_sizes[:5]):
        print(f"  x = {loc:.4f}: jump size = {size:.4f}")
    
    print(f"\nMax jump size: {max(jump_sizes):.4f}")
    print(f"Total variation: {sum(jump_sizes):.4f}")
    
    # Total variation of true function
    true_variation = np.sum(np.abs(np.diff(y_true)))
    print(f"True function variation: {true_variation:.4f}")
    
    return jump_locations, jump_sizes
 
 
def ensemble_smoothing_effect():
    """
    Show how ensembles smooth out discontinuities.
    """
    np.random.seed(42)
    
    def f(x):
        return np.sin(2 * np.pi * x)
    
    X_train = np.random.uniform(0, 1, 100).reshape(-1, 1)
    y_train = f(X_train.ravel()) + np.random.randn(100) * 0.1
    
    X_fine = np.linspace(0, 1, 1000).reshape(-1, 1)
    
    # Single tree
    tree = DecisionTreeRegressor(max_depth=4)
    tree.fit(X_train, y_train)
    y_tree = tree.predict(X_fine)
    
    # Random Forest (ensemble of trees)
    rf = RandomForestRegressor(n_estimators=100, max_depth=4, random_state=42)
    rf.fit(X_train, y_train)
    y_rf = rf.predict(X_fine)
    
    # Measure "roughness" via total variation
    tv_tree = np.sum(np.abs(np.diff(y_tree)))
    tv_rf = np.sum(np.abs(np.diff(y_rf)))
    
    print("\nEnsemble Smoothing Effect")
    print("=" * 50)
    print(f"Total variation (roughness measure):")
    print(f"  Single tree (depth 4):     {tv_tree:.4f}")
    print(f"  Random Forest (100 trees): {tv_rf:.4f}")
    print(f"  Smoothing factor:          {tv_tree/tv_rf:.2f}x")
    
    # Also compare max gradient approximation
    grad_tree = np.max(np.abs(np.diff(y_tree) / np.diff(X_fine.ravel())))
    grad_rf = np.max(np.abs(np.diff(y_rf) / np.diff(X_fine.ravel())))
    
    print(f"\nMax gradient (smoothness measure):")
    print(f"  Single tree:   {grad_tree:.2f}")
    print(f"  Random Forest: {grad_rf:.2f}")
 
 
def soft_tree_concept():
    """
    Illustrate the concept of soft (differentiable) trees.
    """
    print("\nSoft Trees: Making Trees Differentiable")
    print("=" * 50)
    print("""
    Standard tree split: f(x) = c_L if x ≤ t else c_R
                        Uses step function (discontinuous)
    
    Soft tree split: f(x) = σ((x-t)/τ) * c_L + (1-σ((x-t)/τ)) * c_R
                    Uses sigmoid σ(z) = 1/(1+exp(-z))
                    Temperature τ controls sharpness
    
    As τ → 0: soft tree → hard tree
    Larger τ: smoother transitions, differentiable everywhere
    
    Applications:
    - Neural networks with tree-like structure
    - Differentiable feature learning
    - End-to-end training with gradient descent
    
    Libraries: PyTorch's soft decision trees, Neural Oblivious Decision Trees
    """)
 
analyze_discontinuities()
ensemble_smoothing_effect()
soft_tree_concept()

Gradient-Based Methods and Trees

Theoretical Convergence Results

Under what conditions do regression trees converge to the true function as sample size grows? Theoretical analysis provides important guarantees.

Consistency theorem (informal):

Under mild regularity conditions, as $n \to \infty$:

If tree complexity grows appropriately ($M_n \to \infty$ but $M_n / n \to 0$)
If minimum leaf size grows ($n_{\min} \to \infty$)
Then $\mathbb{E}[(\hat{f}_n(\mathbf{x}) - f(\mathbf{x}))^2] \to 0$ for all $\mathbf{x}$

Convergence rates:

For Lipschitz continuous functions in $p$ dimensions: $$\mathbb{E}|\hat{f}_n - f|_2^2 = O(n^{-2/(p+2)})$$

This is the minimax optimal rate for nonparametric regression without smoothness assumptions.

Key insight:

Convergence Rates for Different Function Classes
Function Class	Tree Rate	Optimal Rate	Gap
Lipschitz	n^{-2/(p+2)}	n^{-2/(p+2)}	Optimal ✓
1× differentiable	n^{-2/(p+2)}	n^{-2/(p+4)}	Suboptimal
2× differentiable	n^{-2/(p+2)}	n^{-2/(p+8)}	Suboptimal
Infinitely smooth	n^{-2/(p+2)}	exp(-n^{γ})	Very suboptimal

convergence_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
import numpy as np
from sklearn.tree import DecisionTreeRegressor
 
def empirical_convergence_study():
    """
    Empirically study convergence of tree predictions as n grows.
    """
    np.random.seed(42)
    
    # True function
    def f(x):
        return np.sin(2 * np.pi * x)
    
    # Evaluate on fixed test points
    X_test = np.linspace(0, 1, 500).reshape(-1, 1)
    y_test = f(X_test.ravel())
    
    sample_sizes = [50, 100, 200, 500, 1000, 2000, 5000]
    n_trials = 20  # Average over multiple datasets
    
    print("Convergence as Sample Size Grows")
    print("=" * 60)
    print(f"{'n':>8} {'MSE':>12} {'Std':>12} {'Rate est':>12}")
    print("-" * 50)
    
    mses = []
    
    for n in sample_sizes:
        trial_mses = []
        
        for _ in range(n_trials):
            # Generate training data
            X_train = np.random.uniform(0, 1, n).reshape(-1, 1)
            y_train = f(X_train.ravel()) + np.random.randn(n) * 0.1
            
            # Fit tree (let complexity grow with n)
            max_leaves = int(np.sqrt(n))  # M ~ sqrt(n)
            tree = DecisionTreeRegressor(max_leaf_nodes=max_leaves,
                                        min_samples_leaf=5)
            tree.fit(X_train, y_train)
            
            # Test error
            y_pred = tree.predict(X_test)
            mse = np.mean((y_test - y_pred)**2)
            trial_mses.append(mse)
        
        avg_mse = np.mean(trial_mses)
        std_mse = np.std(trial_mses)
        mses.append(avg_mse)
        
        # Estimate convergence rate
        if len(mses) > 1:
            log_n = np.log(sample_sizes[:len(mses)])
            log_mse = np.log(mses)
            rate = -np.polyfit(log_n, log_mse, 1)[0]
        else:
            rate = float('nan')
        
        print(f"{n:>8} {avg_mse:>12.6f} {std_mse:>12.6f} {rate:>12.3f}")
    
    # Theoretical rate for 1D
    print(f"\nTheoretical rate for 1D: 2/(1+2) = 0.67")
    print(f"Observed rate: {-np.polyfit(np.log(sample_sizes), np.log(mses), 1)[0]:.3f}")
 
 
def bias_variance_with_sample_size():
    """
    Decompose error into bias and variance as n grows.
    """
    np.random.seed(42)
    
    def f(x):
        return x ** 2
    
    X_test = np.linspace(0, 1, 100).reshape(-1, 1)
    y_test = f(X_test.ravel())
    
    print("\nBias-Variance Decomposition vs Sample Size")
    print("=" * 55)
    print(f"{'n':>8} {'Bias²':>12} {'Variance':>12} {'Total MSE':>12}")
    print("-" * 50)
    
    for n in [50, 100, 200, 500, 1000]:
        predictions = []
        
        # Multiple trials to estimate variance
        for _ in range(50):
            X_train = np.random.uniform(0, 1, n).reshape(-1, 1)
            y_train = f(X_train.ravel()) + np.random.randn(n) * 0.1
            
            tree = DecisionTreeRegressor(max_leaf_nodes=int(np.sqrt(n)))
            tree.fit(X_train, y_train)
            
            predictions.append(tree.predict(X_test))
        
        predictions = np.array(predictions)
        
        # E[f̂]
        mean_pred = np.mean(predictions, axis=0)
        
        # Bias² = E[(E[f̂] - f)²]
        bias_sq = np.mean((mean_pred - y_test) ** 2)
        
        # Variance = E[(f̂ - E[f̂])²]
        variance = np.mean(np.var(predictions, axis=0))
        
        # Total MSE
        total_mse = bias_sq + variance
        
        print(f"{n:>8} {bias_sq:>12.6f} {variance:>12.6f} {total_mse:>12.6f}")
 
empirical_convergence_study()
bias_variance_with_sample_size()

Summary: Piecewise Constant Approximation

We've explored the mathematical foundations of regression trees as piecewise constant function approximators—the theoretical lens through which tree behavior becomes transparent.

Key Takeaways

•Trees define piecewise constant functions — Partition into axis-aligned boxes, constant prediction in each.
•Universal approximation holds — Any continuous function can be approximated arbitrarily well with enough regions.
•Curse of dimensionality limits rates — Error scales as M^{-1/p}; exponentially more regions needed in high dimensions.
•Adaptive partition beats uniform — Trees concentrate regions where function varies most.
•Tree = linear model in indicator basis — Provides explicit feature representation for downstream analysis.
•Discontinuities are inherent — Piecewise constant ≠ smooth; ensembles provide smoothing.
•Consistent estimation with optimal rates — For Lipschitz functions, trees achieve minimax optimal convergence.

What's next:

Page Complete

4 / 5