Loading learning content...
In gradient boosting, each iteration computes pseudo-residuals—the direction of steepest descent in prediction space. But these pseudo-residuals exist only at training points. To generalize to unseen data, we need to approximate this gradient direction with a learnable function.
This is where base learners enter. Typically shallow decision trees, base learners are fitted to pseudo-residuals and then added to the ensemble. The choice of base learner architecture and the details of fitting profoundly impact both the optimization trajectory and final generalization.
This page explores base learner fitting in complete detail: why decision trees dominate, how tree construction differs for boosting versus standalone models, leaf value optimization for various losses, and the critical balance between approximation power and regularization.
By the end of this page, you will understand: why decision trees are the predominant base learners, how trees are fitted to pseudo-residuals, leaf value optimization for different loss functions, the role of tree depth in controlling the bias-variance tradeoff, and practical considerations for configuring base learners in production systems.
While gradient boosting is theoretically compatible with any base learner class, decision trees have become the de facto standard. Understanding why illuminates the fundamental requirements of effective boosting.
1. Non-parametric Flexibility Trees can approximate arbitrarily complex functions given sufficient depth. They naturally capture interactions between features through their hierarchical splitting structure.
2. Automatic Feature Selection Each split chooses the most informative feature for that region of the input space. This automatic relevance detection eliminates the need for manual feature engineering.
3. Scale Invariance Tree splits are based on ordering, not magnitude. Features don't need normalization, and monotonic transformations don't affect splits.
4. Fast Training Optimal splits can be found efficiently by sorting features once and scanning thresholds. Modern implementations achieve $O(n \cdot d \cdot \log n)$ training complexity.
5. Interpretable Weak Learners While the ensemble may be complex, individual trees remain interpretable. Each tree represents a simple rule set.
| Property | Decision Trees | Linear Models | Neural Networks |
|---|---|---|---|
| Captures interactions | Inherent (via splits) | Requires feature engineering | Inherent (via hidden layers) |
| Feature scaling | Not required | Critical | Recommended |
| Training speed | Fast (O(nd log n)) | Very fast | Slow (iterative) |
| Gradient approximation | Piecewise constant | Linear | Smooth nonlinear |
| Hyperparameter sensitivity | Low | Low | High |
| Implementation complexity | Low | Low | High |
| Practical popularity | ★★★★★ | ★★☆☆☆ | ★☆☆☆☆ |
Boosting requires 'weak learners'—models that perform slightly better than random guessing. Deep trees are strong learners that overfit; they violate the weak learner assumption. Shallow trees (depth 3-8) provide the ideal balance: expressive enough to capture local gradient structure, constrained enough to avoid overfitting.
Tree construction in gradient boosting follows the same greedy recursive partitioning as standard CART, but with important differences in leaf value assignment and potential split scoring modifications.
For a node containing samples $I$, we find the optimal split:
$$(j^, s^) = \arg\min_{j, s} \left[ \sum_{i \in I_L} (\tilde{r}i - c_L)^2 + \sum{i \in I_R} (\tilde{r}_i - c_R)^2 \right]$$
where:
The optimal split equivalently maximizes variance reduction:
$$\text{Gain} = \frac{1}{|I|}\left[ \frac{S_L^2}{|I_L|} + \frac{S_R^2}{|I_R|} - \frac{(S_L + S_R)^2}{|I|} \right]$$
where $S_L = \sum_{i \in I_L} \tilde{r}i$ and $S_R = \sum{i \in I_R} \tilde{r}_i$.
For each feature:
Total complexity: $O(d \cdot n \log n)$ per split, where $d$ is the number of features.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485
import numpy as np def find_best_split(X, pseudo_residuals, feature_idx): """ Find the best split for a single feature using squared loss. Parameters: ----------- X : array of shape (n_samples, n_features) pseudo_residuals : array of shape (n_samples,) feature_idx : int Index of feature to split on Returns: -------- best_threshold : float best_gain : float """ n = len(pseudo_residuals) feature_values = X[:, feature_idx] # Sort by feature values sorted_indices = np.argsort(feature_values) sorted_residuals = pseudo_residuals[sorted_indices] sorted_features = feature_values[sorted_indices] # Initialize running sums total_sum = np.sum(pseudo_residuals) sum_left = 0.0 n_left = 0 best_gain = -np.inf best_threshold = None # Scan thresholds (between distinct values only) for i in range(n - 1): sum_left += sorted_residuals[i] n_left += 1 n_right = n - n_left sum_right = total_sum - sum_left # Skip if values are identical (can't split here) if sorted_features[i] == sorted_features[i + 1]: continue # Compute gain (variance reduction) gain = (sum_left ** 2 / n_left + sum_right ** 2 / n_right - total_sum ** 2 / n) if gain > best_gain: best_gain = gain # Threshold is midpoint between values best_threshold = (sorted_features[i] + sorted_features[i + 1]) / 2 return best_threshold, best_gain / n def find_best_split_all_features(X, pseudo_residuals): """ Find the globally best split across all features. """ best_feature = None best_threshold = None best_gain = -np.inf for j in range(X.shape[1]): threshold, gain = find_best_split(X, pseudo_residuals, j) if gain > best_gain: best_gain = gain best_feature = j best_threshold = threshold return best_feature, best_threshold, best_gain # Examplenp.random.seed(42)X = np.random.randn(100, 5)# Create pseudo-residuals with structurepseudo_residuals = X[:, 0] + 0.5 * X[:, 1] + 0.1 * np.random.randn(100) feature, threshold, gain = find_best_split_all_features(X, pseudo_residuals)print(f"Best split: Feature {feature} at threshold {threshold:.3f}")print(f"Variance reduction gain: {gain:.4f}")Modern implementations (LightGBM, XGBoost) use histogram-based splitting. Features are discretized into bins (e.g., 255 bins). Instead of scanning all unique values, we scan bins—dramatically reducing computation for large datasets from O(n) to O(bins) per feature.
Once a tree structure is determined, we must assign values to leaves. This step is crucial and differs fundamentally between loss functions.
For a leaf $j$ containing samples $I_j$, we seek the optimal leaf value $\gamma_j$ that minimizes:
$$\gamma_j = \arg\min_\gamma \sum_{i \in I_j} L(y_i, F_{m-1}(x_i) + \gamma)$$
This is a per-leaf line search: find the best step size in the constant-function direction defined by this leaf.
For $L(y, F) = \frac{1}{2}(y - F)^2$:
$$\min_\gamma \sum_{i \in I_j} (y_i - F_{m-1}(x_i) - \gamma)^2 = \min_\gamma \sum_{i \in I_j} (\tilde{r}_i - \gamma)^2$$
Setting the derivative to zero: $$\gamma_j = \frac{1}{|I_j|} \sum_{i \in I_j} \tilde{r}_i$$
The optimal leaf value is simply the mean pseudo-residual in the leaf. This is why standard regression trees directly use mean values.
For $L(y, F) = |y - F|$:
$$\gamma_j = \arg\min_\gamma \sum_{i \in I_j} |y_i - F_{m-1}(x_i) - \gamma| = \arg\min_\gamma \sum_{i \in I_j} |\tilde{r}_i - \gamma|$$
The solution is the median pseudo-residual: $$\gamma_j = \text{median}_{i \in I_j}(\tilde{r}_i)$$
This is robust to outliers within the leaf.
For binary classification with log loss, the leaf value optimization is more complex. The loss is:
$$L(y, F) = -y \cdot F + \log(1 + e^F)$$
The optimal leaf value satisfies:
$$\sum_{i \in I_j} \left( y_i - \frac{e^{F_{m-1}(x_i) + \gamma}}{1 + e^{F_{m-1}(x_i) + \gamma}} \right) = 0$$
This doesn't have a closed-form solution. Options:
a) Newton-Raphson Approximation: Using a single Newton step from the gradient and Hessian: $$\gamma_j = \frac{\sum_{i \in I_j} \tilde{r}i}{\sum{i \in I_j} p_i(1 - p_i)}$$
where $p_i = \sigma(F_{m-1}(x_i))$ is the current predicted probability.
b) Numerical Optimization: Run a few iterations of Newton's method or use bounded line search.
c) Approximation (Common in Practice): Use the mean pseudo-residual as an approximation, accepting suboptimality for speed.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
import numpy as npfrom scipy.optimize import minimize_scalar def sigmoid(x): return 1.0 / (1.0 + np.exp(-np.clip(x, -500, 500))) def optimal_leaf_squared_loss(pseudo_residuals): """ Optimal leaf value for squared loss: mean of pseudo-residuals. """ return np.mean(pseudo_residuals) def optimal_leaf_absolute_loss(pseudo_residuals): """ Optimal leaf value for absolute loss: median of pseudo-residuals. """ return np.median(pseudo_residuals) def optimal_leaf_log_loss_newton(y_true, current_predictions): """ Optimal leaf value for log loss using Newton approximation. gamma = sum(y - p) / sum(p * (1 - p)) """ p = sigmoid(current_predictions) pseudo_residuals = y_true - p hessian_diag = p * (1 - p) # Regularization to avoid division by tiny values gamma = np.sum(pseudo_residuals) / (np.sum(hessian_diag) + 1e-10) return gamma def optimal_leaf_log_loss_exact(y_true, current_predictions): """ Optimal leaf value for log loss using numerical optimization. """ def objective(gamma): F_new = current_predictions + gamma # Log loss: -y*F + log(1 + exp(F)) loss = np.sum(-y_true * F_new + np.log(1 + np.exp(np.clip(F_new, -500, 500)))) return loss result = minimize_scalar(objective, bounds=(-10, 10), method='bounded') return result.x # Comparisonnp.random.seed(42)n = 100y_true = np.random.binomial(1, 0.6, n).astype(float)current_preds = np.random.randn(n) * 0.5 # Current log-odds predictions gamma_newton = optimal_leaf_log_loss_newton(y_true, current_preds)gamma_exact = optimal_leaf_log_loss_exact(y_true, current_preds) print(f"Newton approximation: γ = {gamma_newton:.4f}")print(f"Exact optimization: γ = {gamma_exact:.4f}")print(f"Difference: {abs(gamma_newton - gamma_exact):.6f}")# Typically very close, Newton approximation is sufficientXGBoost uses second-order Taylor expansion of the loss, computing both gradients g_i and Hessians h_i. The optimal leaf value becomes γ = -Σg_i / (Σh_i + λ), where λ is L2 regularization. This generalizes the Newton approximation and enables split gain computation that accounts for leaf values.
The maximum depth of base learner trees is one of the most critical hyperparameters in gradient boosting. It controls both the complexity of individual trees and the nature of the approximation to the gradient.
A depth-1 tree makes a single split:
Gradient approximation: The gradient is approximated by a step function with 2 values. Very coarse, requires many iterations to build complex models.
Optimal range for most tabular data problems. The canonical "weak learner" for gradient boosting.
| Depth | Leaves | Interaction Order | Bias | Variance | Typical Use |
|---|---|---|---|---|---|
| 1 | 2 | None (main effects) | High | Very Low | AdaBoost, simple problems |
| 2 | 4 | 2-way | Moderate | Low | Conservative baseline |
| 3-4 | 8-16 | 3-4 way | Low | Moderate | Most common choice |
| 5-6 | 32-64 | 5-6 way | Very Low | High | Complex problems, needs regularization |
| 7+ | 128+ | High-order | Minimal | Very High | Rarely used, high overfitting risk |
There's a fundamental tradeoff: deeper trees require fewer iterations but risk overfitting. Shallower trees need more iterations but generalize better. In practice, combining shallow trees (depth 3-6) with many iterations (100-1000+) and a small learning rate typically achieves the best results.
Beyond depth constraints, several regularization techniques apply to individual trees within a gradient boosting ensemble.
Requires each leaf to contain at least $k$ samples. Effects:
Typical values: 1-20 for gradient boosting (lower than standalone trees because the ensemble averages out noise).
Requires at least $k$ samples to create a split. A node with fewer samples becomes a leaf. Similar effect to min_samples_leaf but applied earlier.
Requires splits to reduce impurity by at least $\delta$. Prevents splits with marginal gain:
$$\text{Gain}(\text{split}) \geq \delta \cdot |I_{\text{parent}}|$$
Pruning weak splits reduces complexity without affecting high-value splits.
At each split, consider only a random subset of features:
Typical values: sqrt(n_features) or n_features (all features).
123456789101112131415161718192021222324252627282930313233343536373839
from sklearn.ensemble import GradientBoostingRegressorfrom sklearn.datasets import make_regressionfrom sklearn.model_selection import cross_val_scoreimport numpy as np # Generate synthetic data with some noiseX, y = make_regression(n_samples=500, n_features=20, noise=20, random_state=42) # Compare different regularization settingsconfigs = [ {"max_depth": 3, "min_samples_leaf": 1, "name": "No regularization"}, {"max_depth": 3, "min_samples_leaf": 10, "name": "min_samples_leaf=10"}, {"max_depth": 5, "min_samples_leaf": 1, "name": "Deeper trees (depth=5)"}, {"max_depth": 5, "min_samples_leaf": 20, "name": "Deeper + regularized"}, {"max_depth": 3, "max_features": 0.5, "min_samples_leaf": 1, "name": "50% features"},] print("Cross-validation scores for different tree regularization:")print("-" * 60) for config in configs: name = config.pop("name") gbm = GradientBoostingRegressor( n_estimators=100, learning_rate=0.1, random_state=42, **config ) scores = cross_val_score(gbm, X, y, cv=5, scoring='neg_mean_squared_error') rmse = np.sqrt(-scores.mean()) std = np.sqrt(scores.std()) print(f"{name:40s}: RMSE = {rmse:.2f} ± {std:.2f}") # Output (example):# No regularization : RMSE = 21.45 ± 1.32# min_samples_leaf=10 : RMSE = 20.89 ± 1.28# Deeper trees (depth=5) : RMSE = 22.31 ± 1.41# Deeper + regularized : RMSE = 20.12 ± 1.15# 50% features : RMSE = 20.67 ± 1.24Modern implementations add explicit regularization to the objective: λ(L2 on leaf weights) and α(L1 on leaf weights). These directly penalize complex leaf values, providing smoother predictions and better generalization. The regularized split gain becomes: Gain = (G_L² / (H_L + λ) + G_R² / (H_R + λ) - (G_L+G_R)² / (H_L+H_R+λ)) / 2 - γ, where γ penalizes each additional leaf.
How trees grow—the order in which nodes are expanded—significantly impacts the resulting structure and generalization.
Used by: XGBoost (default), standard CART
All nodes at depth $d$ are split before any node at depth $d+1$. The tree grows layer by layer.
Advantages:
Disadvantages:
Used by: LightGBM
Always split the leaf with the highest gain, regardless of depth. The tree grows wherever improvement is greatest.
Advantages:
Disadvantages:
Used by: CatBoost
All leaves at the same level use the same split (feature and threshold). This creates perfectly balanced 'oblivious' trees.
Structure: At depth $d$, there are exactly $2^d$ leaves, and each path from root to leaf makes the same sequence of feature comparisons.
Advantages:
Disadvantages:
For most tabular data problems, LightGBM's leaf-wise growth with max_leaves=31-127 provides an excellent balance of accuracy and speed. XGBoost's level-wise growth with max_depth=6 is a robust alternative. CatBoost's symmetric trees excel when prediction speed is critical or when using GPUs.
Let's implement a complete base learner fitting procedure for gradient boosting, incorporating tree construction and leaf value optimization.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192
import numpy as np class BoostingTreeNode: """A node in a gradient boosting tree.""" def __init__(self): self.feature_idx = None self.threshold = None self.left = None self.right = None self.value = None # Leaf value (only if leaf) self.is_leaf = False class BoostingTree: """ Decision tree optimized for gradient boosting. Fits to pseudo-residuals with optional leaf value optimization for different loss functions. """ def __init__(self, max_depth=3, min_samples_leaf=1, min_samples_split=2, loss='squared'): self.max_depth = max_depth self.min_samples_leaf = min_samples_leaf self.min_samples_split = min_samples_split self.loss = loss self.root = None def _compute_gain(self, left_sum, left_count, right_sum, right_count): """Compute variance reduction gain for a split.""" if left_count < self.min_samples_leaf or right_count < self.min_samples_leaf: return -np.inf total_count = left_count + right_count total_sum = left_sum + right_sum gain = (left_sum ** 2 / left_count + right_sum ** 2 / right_count - total_sum ** 2 / total_count) return gain / total_count def _find_best_split(self, X, residuals, sample_indices): """Find the best split for a node.""" n_samples = len(sample_indices) n_features = X.shape[1] best_gain = -np.inf best_feature = None best_threshold = None for feat_idx in range(n_features): # Get feature values for samples in this node feature_values = X[sample_indices, feat_idx] res_values = residuals[sample_indices] # Sort by feature sorted_order = np.argsort(feature_values) sorted_features = feature_values[sorted_order] sorted_residuals = res_values[sorted_order] # Scan thresholds left_sum = 0.0 left_count = 0 total_sum = np.sum(sorted_residuals) for i in range(n_samples - 1): left_sum += sorted_residuals[i] left_count += 1 # Skip if same value as next (can't split) if sorted_features[i] == sorted_features[i + 1]: continue right_sum = total_sum - left_sum right_count = n_samples - left_count gain = self._compute_gain(left_sum, left_count, right_sum, right_count) if gain > best_gain: best_gain = gain best_feature = feat_idx best_threshold = (sorted_features[i] + sorted_features[i + 1]) / 2 return best_feature, best_threshold, best_gain def _compute_leaf_value(self, residuals, indices, y_true=None, current_preds=None): """Compute optimal leaf value based on loss function.""" res = residuals[indices] if self.loss == 'squared': return np.mean(res) elif self.loss == 'absolute': return np.median(res) elif self.loss == 'log' and y_true is not None: # Newton approximation for log loss y = y_true[indices] p = 1.0 / (1.0 + np.exp(-current_preds[indices])) gradient = y - p hessian = p * (1 - p) + 1e-10 return np.sum(gradient) / np.sum(hessian) else: return np.mean(res) def _build_tree(self, X, residuals, sample_indices, depth, y_true=None, current_preds=None): """Recursively build the tree.""" node = BoostingTreeNode() n_samples = len(sample_indices) # Check stopping conditions if (depth >= self.max_depth or n_samples < self.min_samples_split): node.is_leaf = True node.value = self._compute_leaf_value( residuals, sample_indices, y_true, current_preds) return node # Find best split feat, thresh, gain = self._find_best_split(X, residuals, sample_indices) if feat is None or gain <= 0: node.is_leaf = True node.value = self._compute_leaf_value( residuals, sample_indices, y_true, current_preds) return node # Create split node.feature_idx = feat node.threshold = thresh left_mask = X[sample_indices, feat] <= thresh left_indices = sample_indices[left_mask] right_indices = sample_indices[~left_mask] # Check minimum samples constraint if len(left_indices) < self.min_samples_leaf or \ len(right_indices) < self.min_samples_leaf: node.is_leaf = True node.value = self._compute_leaf_value( residuals, sample_indices, y_true, current_preds) return node # Recursively build children node.left = self._build_tree(X, residuals, left_indices, depth + 1, y_true, current_preds) node.right = self._build_tree(X, residuals, right_indices, depth + 1, y_true, current_preds) return node def fit(self, X, residuals, y_true=None, current_preds=None): """Fit tree to pseudo-residuals.""" sample_indices = np.arange(X.shape[0]) self.root = self._build_tree(X, residuals, sample_indices, 0, y_true, current_preds) return self def _predict_single(self, x, node): """Predict for a single sample.""" if node.is_leaf: return node.value if x[node.feature_idx] <= node.threshold: return self._predict_single(x, node.left) else: return self._predict_single(x, node.right) def predict(self, X): """Predict for multiple samples.""" return np.array([self._predict_single(x, self.root) for x in X]) # Test the implementationnp.random.seed(42)X = np.random.randn(200, 5)y_true = 2 * X[:, 0] + X[:, 1] ** 2 + np.random.randn(200) * 0.5 # Simulate first boosting iterationcurrent_preds = np.full(len(y_true), np.mean(y_true))residuals = y_true - current_preds tree = BoostingTree(max_depth=3, loss='squared')tree.fit(X, residuals) predictions = tree.predict(X)print(f"Residual variance before tree: {np.var(residuals):.4f}")print(f"Residual variance after tree: {np.var(residuals - predictions):.4f}")Real implementations (XGBoost, LightGBM) include many additional optimizations: histogram binning, cache-aware access patterns, multi-threading, GPU acceleration, handling missing values natively, and more. The above code illustrates the core algorithm; production code adds 10-100x speedup through engineering.
We have thoroughly explored how base learners—typically shallow decision trees—are fitted to pseudo-residuals in gradient boosting. Let's consolidate the key takeaways:
With base learner fitting understood, we next examine the learning rate (step size)—the shrinkage parameter that scales each tree's contribution. We'll see how it provides crucial regularization, affects the optimization trajectory, and interacts with the number of boosting iterations to control generalization.
You now understand base learner fitting at a fundamental level—how trees are constructed, how leaf values are optimized, and how tree complexity is controlled. This knowledge is essential for tuning gradient boosting models and understanding why certain hyperparameter choices work better than others.