Ensemble MethodsGradient Boosting

Gradient Boosting

LevelAdvanced

Duration90 mins

TopicGradient Boosting

3 / 5

Base Learner Fitting

Approximating the Gradient with Learnable Functions

In gradient boosting, each iteration computes pseudo-residuals—the direction of steepest descent in prediction space. But these pseudo-residuals exist only at training points. To generalize to unseen data, we need to approximate this gradient direction with a learnable function.

This is where base learners enter. Typically shallow decision trees, base learners are fitted to pseudo-residuals and then added to the ensemble. The choice of base learner architecture and the details of fitting profoundly impact both the optimization trajectory and final generalization.

This page explores base learner fitting in complete detail: why decision trees dominate, how tree construction differs for boosting versus standalone models, leaf value optimization for various losses, and the critical balance between approximation power and regularization.

What You Will Learn

By the end of this page, you will understand: why decision trees are the predominant base learners, how trees are fitted to pseudo-residuals, leaf value optimization for different loss functions, the role of tree depth in controlling the bias-variance tradeoff, and practical considerations for configuring base learners in production systems.

Why Decision Trees Dominate

While gradient boosting is theoretically compatible with any base learner class, decision trees have become the de facto standard. Understanding why illuminates the fundamental requirements of effective boosting.

Properties That Make Trees Ideal

1. Non-parametric Flexibility Trees can approximate arbitrarily complex functions given sufficient depth. They naturally capture interactions between features through their hierarchical splitting structure.

2. Automatic Feature Selection Each split chooses the most informative feature for that region of the input space. This automatic relevance detection eliminates the need for manual feature engineering.

3. Scale Invariance Tree splits are based on ordering, not magnitude. Features don't need normalization, and monotonic transformations don't affect splits.

4. Fast Training Optimal splits can be found efficiently by sorting features once and scanning thresholds. Modern implementations achieve $O(n \cdot d \cdot \log n)$ training complexity.

5. Interpretable Weak Learners While the ensemble may be complex, individual trees remain interpretable. Each tree represents a simple rule set.

Base Learner Comparison for Gradient Boosting
Property	Decision Trees	Linear Models	Neural Networks
Captures interactions	Inherent (via splits)	Requires feature engineering	Inherent (via hidden layers)
Feature scaling	Not required	Critical	Recommended
Training speed	Fast (O(nd log n))	Very fast	Slow (iterative)
Gradient approximation	Piecewise constant	Linear	Smooth nonlinear
Hyperparameter sensitivity	Low	Low	High
Implementation complexity	Low	Low	High
Practical popularity	★★★★★	★★☆☆☆	★☆☆☆☆

The Weak Learner Requirement

Boosting requires 'weak learners'—models that perform slightly better than random guessing. Deep trees are strong learners that overfit; they violate the weak learner assumption. Shallow trees (depth 3-8) provide the ideal balance: expressive enough to capture local gradient structure, constrained enough to avoid overfitting.

Tree Construction for Boosting

Tree construction in gradient boosting follows the same greedy recursive partitioning as standard CART, but with important differences in leaf value assignment and potential split scoring modifications.

Standard CART Splitting for Regression

For a node containing samples $I$, we find the optimal split:

$$(j^, s^) = \arg\min_{j, s} \left[ \sum_{i \in I_L} (\tilde{r}i - c_L)^2 + \sum{i \in I_R} (\tilde{r}_i - c_R)^2 \right]$$

where:

$j$ is the feature index
$s$ is the split threshold
$I_L = {i \in I : x_{ij} \leq s}$ and $I_R = {i \in I : x_{ij} > s}$
$c_L = \frac{1}{|I_L|} \sum_{i \in I_L} \tilde{r}_i$ (mean pseudo-residual in left child)
$c_R = \frac{1}{|I_R|} \sum_{i \in I_R} \tilde{r}_i$ (mean pseudo-residual in right child)

Equivalent Variance Reduction Formulation

The optimal split equivalently maximizes variance reduction:

$$\text{Gain} = \frac{1}{|I|}\left[ \frac{S_L^2}{|I_L|} + \frac{S_R^2}{|I_R|} - \frac{(S_L + S_R)^2}{|I|} \right]$$

where $S_L = \sum_{i \in I_L} \tilde{r}i$ and $S_R = \sum{i \in I_R} \tilde{r}_i$.

Efficient Split Finding

For each feature:

Sort samples by feature value
Scan thresholds left-to-right, updating running sums $S_L$ and $|I_L|$
Computing gain for each threshold is O(1) given running sums

Total complexity: $O(d \cdot n \log n)$ per split, where $d$ is the number of features.

boosting_tree_split.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import numpy as np
 
def find_best_split(X, pseudo_residuals, feature_idx):
    """
    Find the best split for a single feature using squared loss.
    
    Parameters:
    -----------
    X : array of shape (n_samples, n_features)
    pseudo_residuals : array of shape (n_samples,)
    feature_idx : int
        Index of feature to split on
    
    Returns:
    --------
    best_threshold : float
    best_gain : float
    """
    n = len(pseudo_residuals)
    feature_values = X[:, feature_idx]
    
    # Sort by feature values
    sorted_indices = np.argsort(feature_values)
    sorted_residuals = pseudo_residuals[sorted_indices]
    sorted_features = feature_values[sorted_indices]
    
    # Initialize running sums
    total_sum = np.sum(pseudo_residuals)
    sum_left = 0.0
    n_left = 0
    
    best_gain = -np.inf
    best_threshold = None
    
    # Scan thresholds (between distinct values only)
    for i in range(n - 1):
        sum_left += sorted_residuals[i]
        n_left += 1
        n_right = n - n_left
        sum_right = total_sum - sum_left
        
        # Skip if values are identical (can't split here)
        if sorted_features[i] == sorted_features[i + 1]:
            continue
        
        # Compute gain (variance reduction)
        gain = (sum_left ** 2 / n_left + 
                sum_right ** 2 / n_right - 
                total_sum ** 2 / n)
        
        if gain > best_gain:
            best_gain = gain
            # Threshold is midpoint between values
            best_threshold = (sorted_features[i] + sorted_features[i + 1]) / 2
    
    return best_threshold, best_gain / n
 
 
def find_best_split_all_features(X, pseudo_residuals):
    """
    Find the globally best split across all features.
    """
    best_feature = None
    best_threshold = None
    best_gain = -np.inf
    
    for j in range(X.shape[1]):
        threshold, gain = find_best_split(X, pseudo_residuals, j)
        if gain > best_gain:
            best_gain = gain
            best_feature = j
            best_threshold = threshold
    
    return best_feature, best_threshold, best_gain
 
 
# Example
np.random.seed(42)
X = np.random.randn(100, 5)
# Create pseudo-residuals with structure
pseudo_residuals = X[:, 0] + 0.5 * X[:, 1] + 0.1 * np.random.randn(100)
 
feature, threshold, gain = find_best_split_all_features(X, pseudo_residuals)
print(f"Best split: Feature {feature} at threshold {threshold:.3f}")
print(f"Variance reduction gain: {gain:.4f}")

Histogram-Based Splitting

Modern implementations (LightGBM, XGBoost) use histogram-based splitting. Features are discretized into bins (e.g., 255 bins). Instead of scanning all unique values, we scan bins—dramatically reducing computation for large datasets from O(n) to O(bins) per feature.

Leaf Value Optimization

Once a tree structure is determined, we must assign values to leaves. This step is crucial and differs fundamentally between loss functions.

The Optimization Problem

For a leaf $j$ containing samples $I_j$, we seek the optimal leaf value $\gamma_j$ that minimizes:

$$\gamma_j = \arg\min_\gamma \sum_{i \in I_j} L(y_i, F_{m-1}(x_i) + \gamma)$$

This is a per-leaf line search: find the best step size in the constant-function direction defined by this leaf.

Squared Loss (Regression)

For $L(y, F) = \frac{1}{2}(y - F)^2$:

$$\min_\gamma \sum_{i \in I_j} (y_i - F_{m-1}(x_i) - \gamma)^2 = \min_\gamma \sum_{i \in I_j} (\tilde{r}_i - \gamma)^2$$

Setting the derivative to zero: $$\gamma_j = \frac{1}{|I_j|} \sum_{i \in I_j} \tilde{r}_i$$

The optimal leaf value is simply the mean pseudo-residual in the leaf. This is why standard regression trees directly use mean values.

Absolute Loss (Robust Regression)

For $L(y, F) = |y - F|$:

$$\gamma_j = \arg\min_\gamma \sum_{i \in I_j} |y_i - F_{m-1}(x_i) - \gamma| = \arg\min_\gamma \sum_{i \in I_j} |\tilde{r}_i - \gamma|$$

The solution is the median pseudo-residual: $$\gamma_j = \text{median}_{i \in I_j}(\tilde{r}_i)$$

This is robust to outliers within the leaf.

Log Loss (Binary Classification)

For binary classification with log loss, the leaf value optimization is more complex. The loss is:

$$L(y, F) = -y \cdot F + \log(1 + e^F)$$

The optimal leaf value satisfies:

$$\sum_{i \in I_j} \left( y_i - \frac{e^{F_{m-1}(x_i) + \gamma}}{1 + e^{F_{m-1}(x_i) + \gamma}} \right) = 0$$

This doesn't have a closed-form solution. Options:

a) Newton-Raphson Approximation: Using a single Newton step from the gradient and Hessian: $$\gamma_j = \frac{\sum_{i \in I_j} \tilde{r}i}{\sum{i \in I_j} p_i(1 - p_i)}$$

where $p_i = \sigma(F_{m-1}(x_i))$ is the current predicted probability.

b) Numerical Optimization: Run a few iterations of Newton's method or use bounded line search.

c) Approximation (Common in Practice): Use the mean pseudo-residual as an approximation, accepting suboptimality for speed.

leaf_value_optimization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
import numpy as np
from scipy.optimize import minimize_scalar
 
def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-np.clip(x, -500, 500)))
 
def optimal_leaf_squared_loss(pseudo_residuals):
    """
    Optimal leaf value for squared loss: mean of pseudo-residuals.
    """
    return np.mean(pseudo_residuals)
 
 
def optimal_leaf_absolute_loss(pseudo_residuals):
    """
    Optimal leaf value for absolute loss: median of pseudo-residuals.
    """
    return np.median(pseudo_residuals)
 
 
def optimal_leaf_log_loss_newton(y_true, current_predictions):
    """
    Optimal leaf value for log loss using Newton approximation.
    
    gamma = sum(y - p) / sum(p * (1 - p))
    """
    p = sigmoid(current_predictions)
    pseudo_residuals = y_true - p
    hessian_diag = p * (1 - p)
    
    # Regularization to avoid division by tiny values
    gamma = np.sum(pseudo_residuals) / (np.sum(hessian_diag) + 1e-10)
    
    return gamma
 
 
def optimal_leaf_log_loss_exact(y_true, current_predictions):
    """
    Optimal leaf value for log loss using numerical optimization.
    """
    def objective(gamma):
        F_new = current_predictions + gamma
        # Log loss: -y*F + log(1 + exp(F))
        loss = np.sum(-y_true * F_new + np.log(1 + np.exp(np.clip(F_new, -500, 500))))
        return loss
    
    result = minimize_scalar(objective, bounds=(-10, 10), method='bounded')
    return result.x
 
 
# Comparison
np.random.seed(42)
n = 100
y_true = np.random.binomial(1, 0.6, n).astype(float)
current_preds = np.random.randn(n) * 0.5  # Current log-odds predictions
 
gamma_newton = optimal_leaf_log_loss_newton(y_true, current_preds)
gamma_exact = optimal_leaf_log_loss_exact(y_true, current_preds)
 
print(f"Newton approximation: γ = {gamma_newton:.4f}")
print(f"Exact optimization:   γ = {gamma_exact:.4f}")
print(f"Difference: {abs(gamma_newton - gamma_exact):.6f}")
# Typically very close, Newton approximation is sufficient

Second-Order Optimization

XGBoost uses second-order Taylor expansion of the loss, computing both gradients g_i and Hessians h_i. The optimal leaf value becomes γ = -Σg_i / (Σh_i + λ), where λ is L2 regularization. This generalizes the Newton approximation and enables split gain computation that accounts for leaf values.

Tree Depth and Interaction Order

The maximum depth of base learner trees is one of the most critical hyperparameters in gradient boosting. It controls both the complexity of individual trees and the nature of the approximation to the gradient.

Depth 1: Stumps (Decision Stumps)

A depth-1 tree makes a single split:

Divides input space into 2 regions
Captures main effects only (individual feature contributions)
Cannot model interactions between features
Often used in AdaBoost

Gradient approximation: The gradient is approximated by a step function with 2 values. Very coarse, requires many iterations to build complex models.

Depth 2-3: Shallow Trees

4-8 leaf nodes
Captures two-way and three-way interactions
Sufficient for many real-world problems
Fast training, good regularization

Optimal range for most tabular data problems. The canonical "weak learner" for gradient boosting.

Depth 4-6: Moderate Trees

16-64 leaf nodes
Captures higher-order interactions
More expressive, but requires more regularization
Good for complex feature relationships

Depth 7+: Deep Trees

128+ leaf nodes
Approximates gradient direction very accurately on training data
High risk of overfitting
Violates the "weak learner" assumption

Tree Depth Impact on Gradient Boosting
Depth	Leaves	Interaction Order	Bias	Variance	Typical Use
1	2	None (main effects)	High	Very Low	AdaBoost, simple problems
2	4	2-way	Moderate	Low	Conservative baseline
3-4	8-16	3-4 way	Low	Moderate	Most common choice
5-6	32-64	5-6 way	Very Low	High	Complex problems, needs regularization
7+	128+	High-order	Minimal	Very High	Rarely used, high overfitting risk

Choosing Optimal Depth

•Default to depth 3-5: This range works well across most tabular datasets and provides a reasonable bias-variance tradeoff.
•Increase for complex interactions: If domain knowledge suggests high-order feature interactions, consider depth 5-7 with stronger regularization.
•Decrease for small datasets: With limited data, shallower trees reduce overfitting risk. Depth 2-3 may be optimal.
•Use cross-validation: Always tune depth via validation. The optimal value is data-dependent.
•Consider max_leaves instead: Modern implementations often use max_leaves (e.g., 31) rather than max_depth for finer control.

Depth vs. Number of Trees

There's a fundamental tradeoff: deeper trees require fewer iterations but risk overfitting. Shallower trees need more iterations but generalize better. In practice, combining shallow trees (depth 3-6) with many iterations (100-1000+) and a small learning rate typically achieves the best results.

Regularization Within Trees

Beyond depth constraints, several regularization techniques apply to individual trees within a gradient boosting ensemble.

Minimum Samples Per Leaf (min_samples_leaf)

Requires each leaf to contain at least $k$ samples. Effects:

Prevents overly specific leaves that capture noise
Controls tree complexity implicitly
Larger values increase regularization

Typical values: 1-20 for gradient boosting (lower than standalone trees because the ensemble averages out noise).

Minimum Samples Per Split (min_samples_split)

Requires at least $k$ samples to create a split. A node with fewer samples becomes a leaf. Similar effect to min_samples_leaf but applied earlier.

Minimum Impurity Decrease (min_impurity_decrease)

Requires splits to reduce impurity by at least $\delta$. Prevents splits with marginal gain:

$$\text{Gain}(\text{split}) \geq \delta \cdot |I_{\text{parent}}|$$

Pruning weak splits reduces complexity without affecting high-value splits.

Maximum Features (max_features)

At each split, consider only a random subset of features:

Adds diversity between trees (similar to Random Forests)
Reduces computational cost
Can improve generalization through variance reduction

Typical values: sqrt(n_features) or n_features (all features).

tree_regularization_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import cross_val_score
import numpy as np
 
# Generate synthetic data with some noise
X, y = make_regression(n_samples=500, n_features=20, noise=20, random_state=42)
 
# Compare different regularization settings
configs = [
    {"max_depth": 3, "min_samples_leaf": 1, "name": "No regularization"},
    {"max_depth": 3, "min_samples_leaf": 10, "name": "min_samples_leaf=10"},
    {"max_depth": 5, "min_samples_leaf": 1, "name": "Deeper trees (depth=5)"},
    {"max_depth": 5, "min_samples_leaf": 20, "name": "Deeper + regularized"},
    {"max_depth": 3, "max_features": 0.5, "min_samples_leaf": 1, "name": "50% features"},
]
 
print("Cross-validation scores for different tree regularization:")
print("-" * 60)
 
for config in configs:
    name = config.pop("name")
    gbm = GradientBoostingRegressor(
        n_estimators=100,
        learning_rate=0.1,
        random_state=42,
        **config
    )
    scores = cross_val_score(gbm, X, y, cv=5, scoring='neg_mean_squared_error')
    rmse = np.sqrt(-scores.mean())
    std = np.sqrt(scores.std())
    print(f"{name:40s}: RMSE = {rmse:.2f} ± {std:.2f}")
 
# Output (example):
# No regularization                       : RMSE = 21.45 ± 1.32
# min_samples_leaf=10                     : RMSE = 20.89 ± 1.28
# Deeper trees (depth=5)                  : RMSE = 22.31 ± 1.41
# Deeper + regularized                    : RMSE = 20.12 ± 1.15
# 50% features                            : RMSE = 20.67 ± 1.24

XGBoost/LightGBM Regularization

Modern implementations add explicit regularization to the objective: λ(L2 on leaf weights) and α(L1 on leaf weights). These directly penalize complex leaf values, providing smoother predictions and better generalization. The regularized split gain becomes: Gain = (G_L² / (H_L + λ) + G_R² / (H_R + λ) - (G_L+G_R)² / (H_L+H_R+λ)) / 2 - γ, where γ penalizes each additional leaf.

Tree Growth Strategies

How trees grow—the order in which nodes are expanded—significantly impacts the resulting structure and generalization.

Level-wise (Breadth-First) Growth

Used by: XGBoost (default), standard CART

All nodes at depth $d$ are split before any node at depth $d+1$. The tree grows layer by layer.

Advantages:

Balanced trees with predictable structure
Easy to control depth
Efficient parallel implementation

Disadvantages:

May create unnecessary splits in balanced regions
Splits low-gain nodes that happen to be at the current depth

Leaf-wise (Best-First) Growth

Used by: LightGBM

Always split the leaf with the highest gain, regardless of depth. The tree grows wherever improvement is greatest.

Advantages:

Focuses computational budget on high-value splits
Often achieves better accuracy with fewer leaves
More efficient use of model capacity

Disadvantages:

Can produce highly unbalanced trees
Higher overfitting risk without proper regularization
Requires max_leaves constraint (not just max_depth)

Level-wise Growth

•Split all nodes at depth d before d+1
•Produces balanced trees
•Controlled by max_depth
•Lower overfitting risk
•May waste computation on low-gain splits

Leaf-wise Growth

•Split the highest-gain leaf
•Produces unbalanced trees
•Controlled by max_leaves (e.g., 31)
•Higher accuracy potential
•Requires careful regularization

Symmetric Trees

Used by: CatBoost

All leaves at the same level use the same split (feature and threshold). This creates perfectly balanced 'oblivious' trees.

Structure: At depth $d$, there are exactly $2^d$ leaves, and each path from root to leaf makes the same sequence of feature comparisons.

Advantages:

Very fast prediction (simple comparisons)
Excellent for GPU acceleration
Strong regularization through constraint

Disadvantages:

Less flexible than asymmetric trees
Cannot capture region-specific feature importance
May require more trees to achieve same accuracy

Practical Recommendation

For most tabular data problems, LightGBM's leaf-wise growth with max_leaves=31-127 provides an excellent balance of accuracy and speed. XGBoost's level-wise growth with max_depth=6 is a robust alternative. CatBoost's symmetric trees excel when prediction speed is critical or when using GPUs.

Implementation Details

Let's implement a complete base learner fitting procedure for gradient boosting, incorporating tree construction and leaf value optimization.

boosting_tree_complete.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
import numpy as np
 
class BoostingTreeNode:
    """A node in a gradient boosting tree."""
    
    def __init__(self):
        self.feature_idx = None
        self.threshold = None
        self.left = None
        self.right = None
        self.value = None  # Leaf value (only if leaf)
        self.is_leaf = False
 
 
class BoostingTree:
    """
    Decision tree optimized for gradient boosting.
    
    Fits to pseudo-residuals with optional leaf value optimization
    for different loss functions.
    """
    
    def __init__(self, max_depth=3, min_samples_leaf=1, 
                 min_samples_split=2, loss='squared'):
        self.max_depth = max_depth
        self.min_samples_leaf = min_samples_leaf
        self.min_samples_split = min_samples_split
        self.loss = loss
        self.root = None
    
    def _compute_gain(self, left_sum, left_count, right_sum, right_count):
        """Compute variance reduction gain for a split."""
        if left_count < self.min_samples_leaf or right_count < self.min_samples_leaf:
            return -np.inf
        
        total_count = left_count + right_count
        total_sum = left_sum + right_sum
        
        gain = (left_sum ** 2 / left_count + 
                right_sum ** 2 / right_count - 
                total_sum ** 2 / total_count)
        
        return gain / total_count
    
    def _find_best_split(self, X, residuals, sample_indices):
        """Find the best split for a node."""
        n_samples = len(sample_indices)
        n_features = X.shape[1]
        
        best_gain = -np.inf
        best_feature = None
        best_threshold = None
        
        for feat_idx in range(n_features):
            # Get feature values for samples in this node
            feature_values = X[sample_indices, feat_idx]
            res_values = residuals[sample_indices]
            
            # Sort by feature
            sorted_order = np.argsort(feature_values)
            sorted_features = feature_values[sorted_order]
            sorted_residuals = res_values[sorted_order]
            
            # Scan thresholds
            left_sum = 0.0
            left_count = 0
            total_sum = np.sum(sorted_residuals)
            
            for i in range(n_samples - 1):
                left_sum += sorted_residuals[i]
                left_count += 1
                
                # Skip if same value as next (can't split)
                if sorted_features[i] == sorted_features[i + 1]:
                    continue
                
                right_sum = total_sum - left_sum
                right_count = n_samples - left_count
                
                gain = self._compute_gain(left_sum, left_count, 
                                          right_sum, right_count)
                
                if gain > best_gain:
                    best_gain = gain
                    best_feature = feat_idx
                    best_threshold = (sorted_features[i] + sorted_features[i + 1]) / 2
        
        return best_feature, best_threshold, best_gain
    
    def _compute_leaf_value(self, residuals, indices, y_true=None, 
                           current_preds=None):
        """Compute optimal leaf value based on loss function."""
        res = residuals[indices]
        
        if self.loss == 'squared':
            return np.mean(res)
        elif self.loss == 'absolute':
            return np.median(res)
        elif self.loss == 'log' and y_true is not None:
            # Newton approximation for log loss
            y = y_true[indices]
            p = 1.0 / (1.0 + np.exp(-current_preds[indices]))
            gradient = y - p
            hessian = p * (1 - p) + 1e-10
            return np.sum(gradient) / np.sum(hessian)
        else:
            return np.mean(res)
    
    def _build_tree(self, X, residuals, sample_indices, depth, 
                    y_true=None, current_preds=None):
        """Recursively build the tree."""
        node = BoostingTreeNode()
        n_samples = len(sample_indices)
        
        # Check stopping conditions
        if (depth >= self.max_depth or 
            n_samples < self.min_samples_split):
            node.is_leaf = True
            node.value = self._compute_leaf_value(
                residuals, sample_indices, y_true, current_preds)
            return node
        
        # Find best split
        feat, thresh, gain = self._find_best_split(X, residuals, sample_indices)
        
        if feat is None or gain <= 0:
            node.is_leaf = True
            node.value = self._compute_leaf_value(
                residuals, sample_indices, y_true, current_preds)
            return node
        
        # Create split
        node.feature_idx = feat
        node.threshold = thresh
        
        left_mask = X[sample_indices, feat] <= thresh
        left_indices = sample_indices[left_mask]
        right_indices = sample_indices[~left_mask]
        
        # Check minimum samples constraint
        if len(left_indices) < self.min_samples_leaf or \
           len(right_indices) < self.min_samples_leaf:
            node.is_leaf = True
            node.value = self._compute_leaf_value(
                residuals, sample_indices, y_true, current_preds)
            return node
        
        # Recursively build children
        node.left = self._build_tree(X, residuals, left_indices, depth + 1,
                                      y_true, current_preds)
        node.right = self._build_tree(X, residuals, right_indices, depth + 1,
                                       y_true, current_preds)
        
        return node
    
    def fit(self, X, residuals, y_true=None, current_preds=None):
        """Fit tree to pseudo-residuals."""
        sample_indices = np.arange(X.shape[0])
        self.root = self._build_tree(X, residuals, sample_indices, 0,
                                      y_true, current_preds)
        return self
    
    def _predict_single(self, x, node):
        """Predict for a single sample."""
        if node.is_leaf:
            return node.value
        
        if x[node.feature_idx] <= node.threshold:
            return self._predict_single(x, node.left)
        else:
            return self._predict_single(x, node.right)
    
    def predict(self, X):
        """Predict for multiple samples."""
        return np.array([self._predict_single(x, self.root) for x in X])
 
 
# Test the implementation
np.random.seed(42)
X = np.random.randn(200, 5)
y_true = 2 * X[:, 0] + X[:, 1] ** 2 + np.random.randn(200) * 0.5
 
# Simulate first boosting iteration
current_preds = np.full(len(y_true), np.mean(y_true))
residuals = y_true - current_preds
 
tree = BoostingTree(max_depth=3, loss='squared')
tree.fit(X, residuals)
 
predictions = tree.predict(X)
print(f"Residual variance before tree: {np.var(residuals):.4f}")
print(f"Residual variance after tree:  {np.var(residuals - predictions):.4f}")

Production Implementations

Real implementations (XGBoost, LightGBM) include many additional optimizations: histogram binning, cache-aware access patterns, multi-threading, GPU acceleration, handling missing values natively, and more. The above code illustrates the core algorithm; production code adds 10-100x speedup through engineering.

Summary: Base Learner Fitting

We have thoroughly explored how base learners—typically shallow decision trees—are fitted to pseudo-residuals in gradient boosting. Let's consolidate the key takeaways:

Key Takeaways

•Decision trees dominate: Their flexibility, scale invariance, automatic feature selection, and fast training make them ideal base learners for gradient boosting.
•Tree construction fits pseudo-residuals: Standard CART splitting minimizes squared error on pseudo-residuals, effectively projecting the gradient onto the space of piecewise-constant functions.
•Leaf values are optimized per-loss: Squared loss uses mean, absolute loss uses median, and complex losses use Newton approximation or numerical optimization.
•Tree depth controls interaction order: Depth 1 captures main effects only; depth 3-6 is optimal for most problems; deeper trees risk overfitting.
•Multiple regularization mechanisms: min_samples_leaf, max_features, min_impurity_decrease, and explicit L1/L2 penalties all constrain tree complexity.
•Growth strategies matter: Level-wise (XGBoost), leaf-wise (LightGBM), and symmetric (CatBoost) each have tradeoffs between accuracy, speed, and overfitting risk.

What's Next

With base learner fitting understood, we next examine the learning rate (step size)—the shrinkage parameter that scales each tree's contribution. We'll see how it provides crucial regularization, affects the optimization trajectory, and interacts with the number of boosting iterations to control generalization.

Page Complete

You now understand base learner fitting at a fundamental level—how trees are constructed, how leaf values are optimized, and how tree complexity is controlled. This knowledge is essential for tuning gradient boosting models and understanding why certain hyperparameter choices work better than others.

3 / 5

Loading learning content...

Ensemble MethodsGradient Boosting

Gradient Boosting

LevelAdvanced

Duration90 mins

TopicGradient Boosting

3 / 5

Base Learner Fitting

Approximating the Gradient with Learnable Functions

What You Will Learn

Why Decision Trees Dominate

Properties That Make Trees Ideal

3. Scale Invariance Tree splits are based on ordering, not magnitude. Features don't need normalization, and monotonic transformations don't affect splits.

4. Fast Training Optimal splits can be found efficiently by sorting features once and scanning thresholds. Modern implementations achieve $O(n \cdot d \cdot \log n)$ training complexity.

5. Interpretable Weak Learners While the ensemble may be complex, individual trees remain interpretable. Each tree represents a simple rule set.

Base Learner Comparison for Gradient Boosting
Property	Decision Trees	Linear Models	Neural Networks
Captures interactions	Inherent (via splits)	Requires feature engineering	Inherent (via hidden layers)
Feature scaling	Not required	Critical	Recommended
Training speed	Fast (O(nd log n))	Very fast	Slow (iterative)
Gradient approximation	Piecewise constant	Linear	Smooth nonlinear
Hyperparameter sensitivity	Low	Low	High
Implementation complexity	Low	Low	High
Practical popularity	★★★★★	★★☆☆☆	★☆☆☆☆

The Weak Learner Requirement

Tree Construction for Boosting

Standard CART Splitting for Regression

For a node containing samples $I$, we find the optimal split:

$$(j^, s^) = \arg\min_{j, s} \left[ \sum_{i \in I_L} (\tilde{r}i - c_L)^2 + \sum{i \in I_R} (\tilde{r}_i - c_R)^2 \right]$$

where:

$j$ is the feature index
$s$ is the split threshold
$I_L = {i \in I : x_{ij} \leq s}$ and $I_R = {i \in I : x_{ij} > s}$
$c_L = \frac{1}{|I_L|} \sum_{i \in I_L} \tilde{r}_i$ (mean pseudo-residual in left child)
$c_R = \frac{1}{|I_R|} \sum_{i \in I_R} \tilde{r}_i$ (mean pseudo-residual in right child)

Equivalent Variance Reduction Formulation

The optimal split equivalently maximizes variance reduction:

$$\text{Gain} = \frac{1}{|I|}\left[ \frac{S_L^2}{|I_L|} + \frac{S_R^2}{|I_R|} - \frac{(S_L + S_R)^2}{|I|} \right]$$

where $S_L = \sum_{i \in I_L} \tilde{r}i$ and $S_R = \sum{i \in I_R} \tilde{r}_i$.

Efficient Split Finding

For each feature:

Sort samples by feature value
Scan thresholds left-to-right, updating running sums $S_L$ and $|I_L|$
Computing gain for each threshold is O(1) given running sums

Total complexity: $O(d \cdot n \log n)$ per split, where $d$ is the number of features.

boosting_tree_split.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import numpy as np
 
def find_best_split(X, pseudo_residuals, feature_idx):
    """
    Find the best split for a single feature using squared loss.
    
    Parameters:
    -----------
    X : array of shape (n_samples, n_features)
    pseudo_residuals : array of shape (n_samples,)
    feature_idx : int
        Index of feature to split on
    
    Returns:
    --------
    best_threshold : float
    best_gain : float
    """
    n = len(pseudo_residuals)
    feature_values = X[:, feature_idx]
    
    # Sort by feature values
    sorted_indices = np.argsort(feature_values)
    sorted_residuals = pseudo_residuals[sorted_indices]
    sorted_features = feature_values[sorted_indices]
    
    # Initialize running sums
    total_sum = np.sum(pseudo_residuals)
    sum_left = 0.0
    n_left = 0
    
    best_gain = -np.inf
    best_threshold = None
    
    # Scan thresholds (between distinct values only)
    for i in range(n - 1):
        sum_left += sorted_residuals[i]
        n_left += 1
        n_right = n - n_left
        sum_right = total_sum - sum_left
        
        # Skip if values are identical (can't split here)
        if sorted_features[i] == sorted_features[i + 1]:
            continue
        
        # Compute gain (variance reduction)
        gain = (sum_left ** 2 / n_left + 
                sum_right ** 2 / n_right - 
                total_sum ** 2 / n)
        
        if gain > best_gain:
            best_gain = gain
            # Threshold is midpoint between values
            best_threshold = (sorted_features[i] + sorted_features[i + 1]) / 2
    
    return best_threshold, best_gain / n
 
 
def find_best_split_all_features(X, pseudo_residuals):
    """
    Find the globally best split across all features.
    """
    best_feature = None
    best_threshold = None
    best_gain = -np.inf
    
    for j in range(X.shape[1]):
        threshold, gain = find_best_split(X, pseudo_residuals, j)
        if gain > best_gain:
            best_gain = gain
            best_feature = j
            best_threshold = threshold
    
    return best_feature, best_threshold, best_gain
 
 
# Example
np.random.seed(42)
X = np.random.randn(100, 5)
# Create pseudo-residuals with structure
pseudo_residuals = X[:, 0] + 0.5 * X[:, 1] + 0.1 * np.random.randn(100)
 
feature, threshold, gain = find_best_split_all_features(X, pseudo_residuals)
print(f"Best split: Feature {feature} at threshold {threshold:.3f}")
print(f"Variance reduction gain: {gain:.4f}")

Histogram-Based Splitting

Leaf Value Optimization

Once a tree structure is determined, we must assign values to leaves. This step is crucial and differs fundamentally between loss functions.

The Optimization Problem

For a leaf $j$ containing samples $I_j$, we seek the optimal leaf value $\gamma_j$ that minimizes:

$$\gamma_j = \arg\min_\gamma \sum_{i \in I_j} L(y_i, F_{m-1}(x_i) + \gamma)$$

This is a per-leaf line search: find the best step size in the constant-function direction defined by this leaf.

Squared Loss (Regression)

For $L(y, F) = \frac{1}{2}(y - F)^2$:

$$\min_\gamma \sum_{i \in I_j} (y_i - F_{m-1}(x_i) - \gamma)^2 = \min_\gamma \sum_{i \in I_j} (\tilde{r}_i - \gamma)^2$$

Setting the derivative to zero: $$\gamma_j = \frac{1}{|I_j|} \sum_{i \in I_j} \tilde{r}_i$$

The optimal leaf value is simply the mean pseudo-residual in the leaf. This is why standard regression trees directly use mean values.

Absolute Loss (Robust Regression)

For $L(y, F) = |y - F|$:

$$\gamma_j = \arg\min_\gamma \sum_{i \in I_j} |y_i - F_{m-1}(x_i) - \gamma| = \arg\min_\gamma \sum_{i \in I_j} |\tilde{r}_i - \gamma|$$

The solution is the median pseudo-residual: $$\gamma_j = \text{median}_{i \in I_j}(\tilde{r}_i)$$

This is robust to outliers within the leaf.

Log Loss (Binary Classification)

For binary classification with log loss, the leaf value optimization is more complex. The loss is:

$$L(y, F) = -y \cdot F + \log(1 + e^F)$$

The optimal leaf value satisfies:

$$\sum_{i \in I_j} \left( y_i - \frac{e^{F_{m-1}(x_i) + \gamma}}{1 + e^{F_{m-1}(x_i) + \gamma}} \right) = 0$$

This doesn't have a closed-form solution. Options:

a) Newton-Raphson Approximation: Using a single Newton step from the gradient and Hessian: $$\gamma_j = \frac{\sum_{i \in I_j} \tilde{r}i}{\sum{i \in I_j} p_i(1 - p_i)}$$

where $p_i = \sigma(F_{m-1}(x_i))$ is the current predicted probability.

b) Numerical Optimization: Run a few iterations of Newton's method or use bounded line search.

c) Approximation (Common in Practice): Use the mean pseudo-residual as an approximation, accepting suboptimality for speed.

leaf_value_optimization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
import numpy as np
from scipy.optimize import minimize_scalar
 
def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-np.clip(x, -500, 500)))
 
def optimal_leaf_squared_loss(pseudo_residuals):
    """
    Optimal leaf value for squared loss: mean of pseudo-residuals.
    """
    return np.mean(pseudo_residuals)
 
 
def optimal_leaf_absolute_loss(pseudo_residuals):
    """
    Optimal leaf value for absolute loss: median of pseudo-residuals.
    """
    return np.median(pseudo_residuals)
 
 
def optimal_leaf_log_loss_newton(y_true, current_predictions):
    """
    Optimal leaf value for log loss using Newton approximation.
    
    gamma = sum(y - p) / sum(p * (1 - p))
    """
    p = sigmoid(current_predictions)
    pseudo_residuals = y_true - p
    hessian_diag = p * (1 - p)
    
    # Regularization to avoid division by tiny values
    gamma = np.sum(pseudo_residuals) / (np.sum(hessian_diag) + 1e-10)
    
    return gamma
 
 
def optimal_leaf_log_loss_exact(y_true, current_predictions):
    """
    Optimal leaf value for log loss using numerical optimization.
    """
    def objective(gamma):
        F_new = current_predictions + gamma
        # Log loss: -y*F + log(1 + exp(F))
        loss = np.sum(-y_true * F_new + np.log(1 + np.exp(np.clip(F_new, -500, 500))))
        return loss
    
    result = minimize_scalar(objective, bounds=(-10, 10), method='bounded')
    return result.x
 
 
# Comparison
np.random.seed(42)
n = 100
y_true = np.random.binomial(1, 0.6, n).astype(float)
current_preds = np.random.randn(n) * 0.5  # Current log-odds predictions
 
gamma_newton = optimal_leaf_log_loss_newton(y_true, current_preds)
gamma_exact = optimal_leaf_log_loss_exact(y_true, current_preds)
 
print(f"Newton approximation: γ = {gamma_newton:.4f}")
print(f"Exact optimization:   γ = {gamma_exact:.4f}")
print(f"Difference: {abs(gamma_newton - gamma_exact):.6f}")
# Typically very close, Newton approximation is sufficient

Second-Order Optimization

Tree Depth and Interaction Order

Depth 1: Stumps (Decision Stumps)

A depth-1 tree makes a single split:

Divides input space into 2 regions
Captures main effects only (individual feature contributions)
Cannot model interactions between features
Often used in AdaBoost

Gradient approximation: The gradient is approximated by a step function with 2 values. Very coarse, requires many iterations to build complex models.

Depth 2-3: Shallow Trees

4-8 leaf nodes
Captures two-way and three-way interactions
Sufficient for many real-world problems
Fast training, good regularization

Optimal range for most tabular data problems. The canonical "weak learner" for gradient boosting.

Depth 4-6: Moderate Trees

16-64 leaf nodes
Captures higher-order interactions
More expressive, but requires more regularization
Good for complex feature relationships

Depth 7+: Deep Trees

128+ leaf nodes
Approximates gradient direction very accurately on training data
High risk of overfitting
Violates the "weak learner" assumption

Tree Depth Impact on Gradient Boosting
Depth	Leaves	Interaction Order	Bias	Variance	Typical Use
1	2	None (main effects)	High	Very Low	AdaBoost, simple problems
2	4	2-way	Moderate	Low	Conservative baseline
3-4	8-16	3-4 way	Low	Moderate	Most common choice
5-6	32-64	5-6 way	Very Low	High	Complex problems, needs regularization
7+	128+	High-order	Minimal	Very High	Rarely used, high overfitting risk

Choosing Optimal Depth

•Default to depth 3-5: This range works well across most tabular datasets and provides a reasonable bias-variance tradeoff.
•Increase for complex interactions: If domain knowledge suggests high-order feature interactions, consider depth 5-7 with stronger regularization.
•Decrease for small datasets: With limited data, shallower trees reduce overfitting risk. Depth 2-3 may be optimal.
•Use cross-validation: Always tune depth via validation. The optimal value is data-dependent.
•Consider max_leaves instead: Modern implementations often use max_leaves (e.g., 31) rather than max_depth for finer control.

Depth vs. Number of Trees

Regularization Within Trees

Beyond depth constraints, several regularization techniques apply to individual trees within a gradient boosting ensemble.

Minimum Samples Per Leaf (min_samples_leaf)

Requires each leaf to contain at least $k$ samples. Effects:

Prevents overly specific leaves that capture noise
Controls tree complexity implicitly
Larger values increase regularization

Typical values: 1-20 for gradient boosting (lower than standalone trees because the ensemble averages out noise).

Minimum Samples Per Split (min_samples_split)

Requires at least $k$ samples to create a split. A node with fewer samples becomes a leaf. Similar effect to min_samples_leaf but applied earlier.

Minimum Impurity Decrease (min_impurity_decrease)

Requires splits to reduce impurity by at least $\delta$. Prevents splits with marginal gain:

$$\text{Gain}(\text{split}) \geq \delta \cdot |I_{\text{parent}}|$$

Pruning weak splits reduces complexity without affecting high-value splits.

Maximum Features (max_features)

At each split, consider only a random subset of features:

Adds diversity between trees (similar to Random Forests)
Reduces computational cost
Can improve generalization through variance reduction

Typical values: sqrt(n_features) or n_features (all features).

tree_regularization_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import cross_val_score
import numpy as np
 
# Generate synthetic data with some noise
X, y = make_regression(n_samples=500, n_features=20, noise=20, random_state=42)
 
# Compare different regularization settings
configs = [
    {"max_depth": 3, "min_samples_leaf": 1, "name": "No regularization"},
    {"max_depth": 3, "min_samples_leaf": 10, "name": "min_samples_leaf=10"},
    {"max_depth": 5, "min_samples_leaf": 1, "name": "Deeper trees (depth=5)"},
    {"max_depth": 5, "min_samples_leaf": 20, "name": "Deeper + regularized"},
    {"max_depth": 3, "max_features": 0.5, "min_samples_leaf": 1, "name": "50% features"},
]
 
print("Cross-validation scores for different tree regularization:")
print("-" * 60)
 
for config in configs:
    name = config.pop("name")
    gbm = GradientBoostingRegressor(
        n_estimators=100,
        learning_rate=0.1,
        random_state=42,
        **config
    )
    scores = cross_val_score(gbm, X, y, cv=5, scoring='neg_mean_squared_error')
    rmse = np.sqrt(-scores.mean())
    std = np.sqrt(scores.std())
    print(f"{name:40s}: RMSE = {rmse:.2f} ± {std:.2f}")
 
# Output (example):
# No regularization                       : RMSE = 21.45 ± 1.32
# min_samples_leaf=10                     : RMSE = 20.89 ± 1.28
# Deeper trees (depth=5)                  : RMSE = 22.31 ± 1.41
# Deeper + regularized                    : RMSE = 20.12 ± 1.15
# 50% features                            : RMSE = 20.67 ± 1.24

XGBoost/LightGBM Regularization

Tree Growth Strategies

How trees grow—the order in which nodes are expanded—significantly impacts the resulting structure and generalization.

Level-wise (Breadth-First) Growth

Used by: XGBoost (default), standard CART

All nodes at depth $d$ are split before any node at depth $d+1$. The tree grows layer by layer.

Advantages:

Balanced trees with predictable structure
Easy to control depth
Efficient parallel implementation

Disadvantages:

May create unnecessary splits in balanced regions
Splits low-gain nodes that happen to be at the current depth

Leaf-wise (Best-First) Growth

Used by: LightGBM

Always split the leaf with the highest gain, regardless of depth. The tree grows wherever improvement is greatest.

Advantages:

Focuses computational budget on high-value splits
Often achieves better accuracy with fewer leaves
More efficient use of model capacity

Disadvantages:

Can produce highly unbalanced trees
Higher overfitting risk without proper regularization
Requires max_leaves constraint (not just max_depth)

Level-wise Growth

•Split all nodes at depth d before d+1
•Produces balanced trees
•Controlled by max_depth
•Lower overfitting risk
•May waste computation on low-gain splits

Leaf-wise Growth

•Split the highest-gain leaf
•Produces unbalanced trees
•Controlled by max_leaves (e.g., 31)
•Higher accuracy potential
•Requires careful regularization

Symmetric Trees

Used by: CatBoost

All leaves at the same level use the same split (feature and threshold). This creates perfectly balanced 'oblivious' trees.

Structure: At depth $d$, there are exactly $2^d$ leaves, and each path from root to leaf makes the same sequence of feature comparisons.

Advantages:

Very fast prediction (simple comparisons)
Excellent for GPU acceleration
Strong regularization through constraint

Disadvantages:

Less flexible than asymmetric trees
Cannot capture region-specific feature importance
May require more trees to achieve same accuracy

Practical Recommendation

Implementation Details

Let's implement a complete base learner fitting procedure for gradient boosting, incorporating tree construction and leaf value optimization.

boosting_tree_complete.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
import numpy as np
 
class BoostingTreeNode:
    """A node in a gradient boosting tree."""
    
    def __init__(self):
        self.feature_idx = None
        self.threshold = None
        self.left = None
        self.right = None
        self.value = None  # Leaf value (only if leaf)
        self.is_leaf = False
 
 
class BoostingTree:
    """
    Decision tree optimized for gradient boosting.
    
    Fits to pseudo-residuals with optional leaf value optimization
    for different loss functions.
    """
    
    def __init__(self, max_depth=3, min_samples_leaf=1, 
                 min_samples_split=2, loss='squared'):
        self.max_depth = max_depth
        self.min_samples_leaf = min_samples_leaf
        self.min_samples_split = min_samples_split
        self.loss = loss
        self.root = None
    
    def _compute_gain(self, left_sum, left_count, right_sum, right_count):
        """Compute variance reduction gain for a split."""
        if left_count < self.min_samples_leaf or right_count < self.min_samples_leaf:
            return -np.inf
        
        total_count = left_count + right_count
        total_sum = left_sum + right_sum
        
        gain = (left_sum ** 2 / left_count + 
                right_sum ** 2 / right_count - 
                total_sum ** 2 / total_count)
        
        return gain / total_count
    
    def _find_best_split(self, X, residuals, sample_indices):
        """Find the best split for a node."""
        n_samples = len(sample_indices)
        n_features = X.shape[1]
        
        best_gain = -np.inf
        best_feature = None
        best_threshold = None
        
        for feat_idx in range(n_features):
            # Get feature values for samples in this node
            feature_values = X[sample_indices, feat_idx]
            res_values = residuals[sample_indices]
            
            # Sort by feature
            sorted_order = np.argsort(feature_values)
            sorted_features = feature_values[sorted_order]
            sorted_residuals = res_values[sorted_order]
            
            # Scan thresholds
            left_sum = 0.0
            left_count = 0
            total_sum = np.sum(sorted_residuals)
            
            for i in range(n_samples - 1):
                left_sum += sorted_residuals[i]
                left_count += 1
                
                # Skip if same value as next (can't split)
                if sorted_features[i] == sorted_features[i + 1]:
                    continue
                
                right_sum = total_sum - left_sum
                right_count = n_samples - left_count
                
                gain = self._compute_gain(left_sum, left_count, 
                                          right_sum, right_count)
                
                if gain > best_gain:
                    best_gain = gain
                    best_feature = feat_idx
                    best_threshold = (sorted_features[i] + sorted_features[i + 1]) / 2
        
        return best_feature, best_threshold, best_gain
    
    def _compute_leaf_value(self, residuals, indices, y_true=None, 
                           current_preds=None):
        """Compute optimal leaf value based on loss function."""
        res = residuals[indices]
        
        if self.loss == 'squared':
            return np.mean(res)
        elif self.loss == 'absolute':
            return np.median(res)
        elif self.loss == 'log' and y_true is not None:
            # Newton approximation for log loss
            y = y_true[indices]
            p = 1.0 / (1.0 + np.exp(-current_preds[indices]))
            gradient = y - p
            hessian = p * (1 - p) + 1e-10
            return np.sum(gradient) / np.sum(hessian)
        else:
            return np.mean(res)
    
    def _build_tree(self, X, residuals, sample_indices, depth, 
                    y_true=None, current_preds=None):
        """Recursively build the tree."""
        node = BoostingTreeNode()
        n_samples = len(sample_indices)
        
        # Check stopping conditions
        if (depth >= self.max_depth or 
            n_samples < self.min_samples_split):
            node.is_leaf = True
            node.value = self._compute_leaf_value(
                residuals, sample_indices, y_true, current_preds)
            return node
        
        # Find best split
        feat, thresh, gain = self._find_best_split(X, residuals, sample_indices)
        
        if feat is None or gain <= 0:
            node.is_leaf = True
            node.value = self._compute_leaf_value(
                residuals, sample_indices, y_true, current_preds)
            return node
        
        # Create split
        node.feature_idx = feat
        node.threshold = thresh
        
        left_mask = X[sample_indices, feat] <= thresh
        left_indices = sample_indices[left_mask]
        right_indices = sample_indices[~left_mask]
        
        # Check minimum samples constraint
        if len(left_indices) < self.min_samples_leaf or \
           len(right_indices) < self.min_samples_leaf:
            node.is_leaf = True
            node.value = self._compute_leaf_value(
                residuals, sample_indices, y_true, current_preds)
            return node
        
        # Recursively build children
        node.left = self._build_tree(X, residuals, left_indices, depth + 1,
                                      y_true, current_preds)
        node.right = self._build_tree(X, residuals, right_indices, depth + 1,
                                       y_true, current_preds)
        
        return node
    
    def fit(self, X, residuals, y_true=None, current_preds=None):
        """Fit tree to pseudo-residuals."""
        sample_indices = np.arange(X.shape[0])
        self.root = self._build_tree(X, residuals, sample_indices, 0,
                                      y_true, current_preds)
        return self
    
    def _predict_single(self, x, node):
        """Predict for a single sample."""
        if node.is_leaf:
            return node.value
        
        if x[node.feature_idx] <= node.threshold:
            return self._predict_single(x, node.left)
        else:
            return self._predict_single(x, node.right)
    
    def predict(self, X):
        """Predict for multiple samples."""
        return np.array([self._predict_single(x, self.root) for x in X])
 
 
# Test the implementation
np.random.seed(42)
X = np.random.randn(200, 5)
y_true = 2 * X[:, 0] + X[:, 1] ** 2 + np.random.randn(200) * 0.5
 
# Simulate first boosting iteration
current_preds = np.full(len(y_true), np.mean(y_true))
residuals = y_true - current_preds
 
tree = BoostingTree(max_depth=3, loss='squared')
tree.fit(X, residuals)
 
predictions = tree.predict(X)
print(f"Residual variance before tree: {np.var(residuals):.4f}")
print(f"Residual variance after tree:  {np.var(residuals - predictions):.4f}")

Production Implementations

Summary: Base Learner Fitting

We have thoroughly explored how base learners—typically shallow decision trees—are fitted to pseudo-residuals in gradient boosting. Let's consolidate the key takeaways:

Key Takeaways

•Decision trees dominate: Their flexibility, scale invariance, automatic feature selection, and fast training make them ideal base learners for gradient boosting.
•Tree construction fits pseudo-residuals: Standard CART splitting minimizes squared error on pseudo-residuals, effectively projecting the gradient onto the space of piecewise-constant functions.
•Leaf values are optimized per-loss: Squared loss uses mean, absolute loss uses median, and complex losses use Newton approximation or numerical optimization.
•Tree depth controls interaction order: Depth 1 captures main effects only; depth 3-6 is optimal for most problems; deeper trees risk overfitting.
•Multiple regularization mechanisms: min_samples_leaf, max_features, min_impurity_decrease, and explicit L1/L2 penalties all constrain tree complexity.
•Growth strategies matter: Level-wise (XGBoost), leaf-wise (LightGBM), and symmetric (CatBoost) each have tradeoffs between accuracy, speed, and overfitting risk.

What's Next

Page Complete

3 / 5