Loading learning content...
Imagine you're an artist creating a portrait through layered glazes. You don't attempt to capture every detail in the first layer; instead, you build the image incrementally. The first glaze establishes broad tones. The second refines shadows. Each subsequent layer addresses what previous layers missed.
This is precisely how stage-wise optimization builds boosted ensembles. We don't attempt to find the globally optimal combination of M base learners—a computationally intractable problem. Instead, we construct the ensemble one term at a time, each new learner targeting what the current ensemble gets wrong.
This greedy approach, known as forward stagewise additive modeling (FSAM), is the algorithmic heart of gradient boosting.
This page develops the complete algorithm for stage-wise optimization: how to frame the optimization problem at each stage, how to solve it for specific loss functions, and how this connects to iteratively fitting base learners to gradients. You'll understand the exact mechanics of how gradient boosting ensembles are constructed.
Problem Setup:
We want to build an additive model: $$F_M(x) = \sum_{m=1}^{M} \beta_m h(x; \theta_m)$$
that minimizes the empirical loss: $$L(F) = \sum_{i=1}^{n} \ell(y_i, F(x_i))$$
where h(x; θ) is a base learner parameterized by θ (e.g., a decision tree with structure θ), βₘ are the weights, and ℓ is the loss function.
Joint Optimization (Intractable):
The ideal approach would solve: $${\beta_m^, \theta_m^}{m=1}^{M} = \arg\min \sum{i=1}^{n} \ell\left(y_i, \sum_{m=1}^{M} \beta_m h(x_i; \theta_m)\right)$$
This is generally intractable because:
Stagewise Solution:
Instead, we build the model incrementally:
Stage 0: Initialize F₀(x) = γ₀ (often a constant, like the mean for regression)
Stage m (for m = 1, 2, ..., M):
Compute the optimal base learner to add: $$(\beta_m, \theta_m) = \arg\min_{\beta, \theta} \sum_{i=1}^{n} \ell(y_i, F_{m-1}(x_i) + \beta \cdot h(x_i; \theta))$$
Update the ensemble: $$F_m(x) = F_{m-1}(x) + \beta_m h(x; \theta_m)$$
Key Property: Once added, a base learner is never modified. We only ever add new terms.
Stagewise optimization finds a local, not global, solution. The base learner chosen at stage 1 might not be in the globally optimal ensemble. However, with many stages and regularization (shrinkage), this greedy approach achieves excellent results in practice—often matching or exceeding more sophisticated optimization methods.
At each stage m, we must solve:
$$(\beta_m, \theta_m) = \arg\min_{\beta, \theta} \sum_{i=1}^{n} \ell(y_i, F_{m-1}(x_i) + \beta \cdot h(x_i; \theta))$$
This is still a challenging optimization problem. Let's see how it's approached for different losses.
Case 1: Squared Error Loss (L2 Boosting)
For ℓ(y, F) = ½(y - F)², the stage-m problem becomes:
$$\min_{\beta, \theta} \sum_{i=1}^{n} \frac{1}{2}(y_i - F_{m-1}(x_i) - \beta \cdot h(x_i; \theta))^2$$
Define the residuals: rᵢ = yᵢ - Fₘ₋₁(xᵢ)
The problem becomes: $$\min_{\beta, \theta} \sum_{i=1}^{n} (r_i - \beta \cdot h(x_i; \theta))^2$$
Key Insight: We're fitting the base learner to the residuals! For β = 1, this is exactly least-squares regression of h on r. For optimal β, we can first find optimal θ by fitting to residuals, then optimize β by line search.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
import numpy as npfrom sklearn.tree import DecisionTreeRegressor def l2_boosting(X, y, n_stages=100, learning_rate=0.1, max_depth=3): """ L2 Boosting (Gradient Boosting with squared error loss). At each stage: 1. Compute residuals: r_i = y_i - F_{m-1}(x_i) 2. Fit base learner h to residuals 3. Update: F_m = F_{m-1} + learning_rate * h For squared error, the negative gradient IS the residual, so fitting to residuals = fitting to negative gradients. """ n_samples = len(y) # Stage 0: Initialize with constant (mean) F = np.full(n_samples, np.mean(y)) initial_pred = np.mean(y) base_learners = [] for m in range(n_stages): # Step 1: Compute residuals (= negative gradient for L2 loss) residuals = y - F # Step 2: Fit base learner to residuals # This solves: argmin_theta sum_i (r_i - h(x_i; theta))^2 h = DecisionTreeRegressor(max_depth=max_depth) h.fit(X, residuals) # Step 3: Compute predictions and update # With shrinkage (learning_rate < 1), we dampen each step h_predictions = h.predict(X) # Optional: line search for optimal step size # For L2, optimal beta = sum(r_i * h(x_i)) / sum(h(x_i)^2) # But we typically just use learning_rate for regularization F = F + learning_rate * h_predictions base_learners.append(h) # Monitor training loss if m % 20 == 0: train_loss = 0.5 * np.mean((y - F) ** 2) print(f"Stage {m}: Training MSE = {train_loss:.4f}") return base_learners, initial_pred, learning_rate def predict_l2_boost(X, base_learners, initial_pred, learning_rate): """Predict using the trained L2 boosting ensemble.""" predictions = np.full(len(X), initial_pred) for h in base_learners: predictions += learning_rate * h.predict(X) return predictionsCase 2: Exponential Loss (AdaBoost)
For ℓ(y, F) = exp(-y·F) with y ∈ {-1, +1}, the stage-m problem is:
$$\min_{\beta, \theta} \sum_{i=1}^{n} \exp(-y_i(F_{m-1}(x_i) + \beta \cdot h(x_i; \theta)))$$
Let wᵢ⁽ᵐ⁾ = exp(-yᵢFₘ₋₁(xᵢ)). These are the sample weights. The problem becomes:
$$\min_{\beta, \theta} \sum_{i=1}^{n} w_i^{(m)} \exp(-\beta y_i h(x_i; \theta))$$
Solution approach:
For any β > 0, the optimal θ minimizes weighted classification error: $$\theta_m = \arg\min_\theta \sum_{i=1}^{n} w_i^{(m)} \mathbf{1}(y_i \neq h(x_i; \theta))$$
Given θₘ, the optimal β is: $$\beta_m = \frac{1}{2} \log\left(\frac{1 - \epsilon_m}{\epsilon_m}\right)$$
where εₘ is the weighted error rate of hₘ.
This is exactly the AdaBoost algorithm! Stagewise optimization of exponential loss recovers AdaBoost.
The magic of gradient boosting is that we can solve the stage-m optimization problem approximately using a simple strategy: fit the base learner to the negative gradient.
Key Insight:
Instead of directly solving: $$(\beta_m, \theta_m) = \arg\min_{\beta, \theta} \sum_{i=1}^{n} \ell(y_i, F_{m-1}(x_i) + \beta \cdot h(x_i; \theta))$$
We:
Why This Works:
The Taylor expansion of L(Fₘ₋₁ + βh) around Fₘ₋₁ is: $$L(F_{m-1} + \beta h) \approx L(F_{m-1}) + \beta \sum_{i=1}^{n} \frac{\partial \ell(y_i, F_{m-1}(x_i))}{\partial F} \cdot h(x_i)$$
To decrease loss, we want the inner product between h and ∂ℓ/∂F to be negative (indicating we're moving against the gradient). The h that achieves this best is the one that points in the direction of -∂ℓ/∂F.
| Loss Function | ℓ(y, F) | −∂ℓ/∂F (Negative Gradient) |
|---|---|---|
| Squared Error | ½(y - F)² | y - F (residual) |
| Absolute Error | |y - F| | sign(y - F) |
| Huber Loss | Huber_δ(y - F) | Clipped residual |
| Logistic (y ∈ {0,1}) | -y·log(p) - (1-y)·log(1-p) | y - p where p = sigmoid(F) |
| Exponential (y ∈ {-1,+1}) | exp(-y·F) | y·exp(-y·F) |
We can now state the complete gradient boosting algorithm, synthesizing everything we've developed:
Algorithm: Gradient Boosting
Input: Training data {(xᵢ, yᵢ)}ⁿᵢ₌₁, differentiable loss function ℓ(y, F), number of stages M, learning rate ν, base learner class ℋ
Initialize: F₀(x) = argminγ Σᵢ ℓ(yᵢ, γ) (constant prediction)
For m = 1 to M:
Compute negative gradients (pseudo-residuals): $$r_{im} = -\frac{\partial \ell(y_i, F(x_i))}{\partial F(x_i)}\bigg|{F=F{m-1}}$$
Fit base learner to pseudo-residuals: $$h_m = \arg\min_{h \in \mathcal{H}} \sum_{i=1}^{n} (r_{im} - h(x_i))^2$$
(Optional) Line search for optimal step: $$\rho_m = \arg\min_\rho \sum_{i=1}^{n} \ell(y_i, F_{m-1}(x_i) + \rho \cdot h_m(x_i))$$
Update model with shrinkage: $$F_m(x) = F_{m-1}(x) + \nu \cdot \rho_m \cdot h_m(x)$$
Output: Fₘ(x)
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104
import numpy as npfrom sklearn.tree import DecisionTreeRegressorfrom scipy.optimize import minimize_scalar class GradientBoosting: """ Gradient Boosting with any differentiable loss function. This implementation follows the exact algorithm: 1. Initialize with optimal constant 2. For each stage: a. Compute negative gradient (pseudo-residuals) b. Fit tree to pseudo-residuals c. Line search for optimal step size d. Update with shrinkage """ def __init__(self, loss, loss_gradient, n_estimators=100, learning_rate=0.1, max_depth=3): self.loss = loss # loss(y, F) -> scalar self.loss_gradient = loss_gradient # d(loss)/dF at each point self.n_estimators = n_estimators self.learning_rate = learning_rate self.max_depth = max_depth self.trees = [] self.rhos = [] self.F0 = None def _init_prediction(self, y): """Find optimal constant prediction F_0.""" # For many losses, this is the mean (squared) or median (absolute) # Here we do a simple line search result = minimize_scalar( lambda gamma: np.sum([self.loss(yi, gamma) for yi in y]) ) return result.x def fit(self, X, y): n_samples = len(y) # Step 0: Initialize F_0 self.F0 = self._init_prediction(y) F = np.full(n_samples, self.F0) for m in range(self.n_estimators): # Step 1: Compute negative gradient (pseudo-residuals) # r_i = -d loss(y_i, F_i) / dF_i pseudo_residuals = -self.loss_gradient(y, F) # Step 2: Fit regression tree to pseudo-residuals tree = DecisionTreeRegressor(max_depth=self.max_depth) tree.fit(X, pseudo_residuals) # Step 3: Line search for optimal step size rho h = tree.predict(X) result = minimize_scalar( lambda rho: sum(self.loss(y[i], F[i] + rho * h[i]) for i in range(n_samples)), bounds=(0, 2), method='bounded' ) rho = result.x # Step 4: Update with shrinkage F = F + self.learning_rate * rho * h self.trees.append(tree) self.rhos.append(self.learning_rate * rho) if m % 20 == 0: total_loss = sum(self.loss(y[i], F[i]) for i in range(n_samples)) print(f"Stage {m}: Loss = {total_loss:.4f}") return self def predict(self, X): F = np.full(len(X), self.F0) for tree, rho in zip(self.trees, self.rhos): F += rho * tree.predict(X) return F # Example: Gradient Boosting with Huber lossdelta = 1.0 def huber_loss(y, f): r = y - f if abs(r) <= delta: return 0.5 * r ** 2 else: return delta * (abs(r) - 0.5 * delta) def huber_gradient(y, F): """d(huber_loss)/dF for each sample.""" r = y - F grad = np.where(np.abs(r) <= delta, -r, -delta * np.sign(r)) return grad # Usage:# gb = GradientBoosting(huber_loss, huber_gradient, n_estimators=100)# gb.fit(X_train, y_train)# predictions = gb.predict(X_test)The line search step—finding the optimal ρₘ—deserves careful attention. Different implementations handle this differently.
Why Line Search Matters:
The base learner hₘ gives us a direction to update. The step size ρₘ tells us how far to step. Too small steps waste computation; too large steps may overshoot and increase loss.
Simple Approach (No Line Search):
Many implementations skip explicit line search and simply use ρₘ = 1, relying on the learning rate ν for step size control: $$F_m = F_{m-1} + \nu \cdot h_m$$
This works reasonably well when ν is tuned appropriately (typically 0.01-0.3).
Global Line Search:
Find ρ minimizing total loss: $$\rho_m = \arg\min_\rho \sum_{i=1}^{n} \ell(y_i, F_{m-1}(x_i) + \rho \cdot h_m(x_i))$$
This is a 1D optimization problem, solvable by bisection, golden section, or Newton's method.
Per-Leaf Line Search (Recommended for Trees):
When using decision trees, we can do better. Each leaf j defines a region Rⱼ where hₘ is constant. Instead of one global ρ, we find optimal γⱼ for each leaf:
$$\gamma_{jm} = \arg\min_\gamma \sum_{x_i \in R_{jm}} \ell(y_i, F_{m-1}(x_i) + \gamma)$$
The tree prediction becomes: hₘ(x) = γⱼₘ for x ∈ Rⱼₘ
This is more flexible—different regions can have different step sizes—and often improves performance.
Closed-Form Solutions for Specific Losses:
XGBoost's Approach:
XGBoost uses second-order (Newton) approximation, incorporating both gradient and Hessian to find per-leaf values: $$\gamma_j = -\frac{\sum_{i \in R_j} g_i}{\sum_{i \in R_j} h_i + \lambda}$$
where gᵢ is the gradient, hᵢ is the Hessian, and λ is L2 regularization.
Per-leaf optimization (as in XGBoost, LightGBM) is generally superior to global line search. It allows different regions to take different step sizes based on local loss curvature. This is one reason modern boosting implementations outperform classic gradient boosting.
Gradient boosting applies to regression, binary classification, and multi-class problems. The framework is the same; only the loss function changes.
Regression:
The model output F(x) is the prediction directly. Common losses:
Binary Classification:
The model outputs a log-odds (logit): F(x) = log(p/(1-p))
Probability is: p(y=1|x) = sigmoid(F(x)) = 1/(1 + exp(-F(x)))
Common losses:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970
import numpy as npfrom sklearn.tree import DecisionTreeRegressor def gradient_boosting_classifier(X, y, n_estimators=100, learning_rate=0.1, max_depth=3): """ Gradient Boosting for binary classification (logistic loss). Model outputs log-odds F(x). Probability is sigmoid(F(x)). Labels y should be in {0, 1}. Key insight: even for classification, we fit REGRESSION trees to the (continuous) negative gradient values. """ n_samples = len(y) # Initialize: F_0 = log(p_1 / p_0) = log(mean / (1 - mean)) p_init = np.clip(np.mean(y), 1e-6, 1 - 1e-6) F = np.full(n_samples, np.log(p_init / (1 - p_init))) F0 = F[0] trees = [] for m in range(n_estimators): # Current probabilities p = 1.0 / (1.0 + np.exp(-F)) # Negative gradient of logistic loss: y - p # Intuition: push F up where y=1 (want higher p) # push F down where y=0 (want lower p) pseudo_residuals = y - p # Fit regression tree to pseudo-residuals tree = DecisionTreeRegressor(max_depth=max_depth) tree.fit(X, pseudo_residuals) # Per-leaf optimization for logistic loss # For each leaf, solve: gamma = argmin sum log(1 + exp(-y_i(F_i + gamma))) # Approximate solution: gamma = sum(r_i) / sum(p_i * (1-p_i)) leaves = tree.apply(X) leaf_values = {} for leaf in np.unique(leaves): mask = leaves == leaf # Newton step for logistic loss numerator = np.sum(pseudo_residuals[mask]) denominator = np.sum(p[mask] * (1 - p[mask])) leaf_values[leaf] = numerator / (denominator + 1e-8) # Update F using per-leaf values h = np.array([leaf_values[l] for l in leaves]) F = F + learning_rate * h trees.append((tree, leaf_values)) if m % 20 == 0: log_loss = -np.mean(y * np.log(p + 1e-8) + (1 - y) * np.log(1 - p + 1e-8)) print(f"Stage {m}: Log Loss = {log_loss:.4f}") return trees, F0, learning_rate def predict_proba(X, trees, F0, learning_rate): """Predict probabilities using the trained classifier.""" F = np.full(len(X), F0) for tree, leaf_values in trees: leaves = tree.apply(X) h = np.array([leaf_values.get(l, 0) for l in leaves]) F += learning_rate * h return 1.0 / (1.0 + np.exp(-F))Multi-class Classification:
For K classes, we maintain K separate models F₁(x), ..., Fₖ(x).
Probabilities via softmax: $$p(y=k|x) = \frac{\exp(F_k(x))}{\sum_{j=1}^{K} \exp(F_j(x))}$$
At each boosting iteration, we:
This is computationally expensive (K trees per iteration), which is why efficient implementations like LightGBM use tricks like gradient sampling and one-vs-all reductions.
A natural question: does this stagewise procedure converge? If so, how fast?
Training Loss Convergence:
Under mild conditions, gradient boosting monotonically decreases the training loss. At each stage, we add a term that (at least slightly) improves the fit.
Theorem (Informal): If the base learner class ℋ can approximate arbitrary gradients (e.g., regression trees with enough depth), then gradient boosting converges to the minimum training loss as M → ∞.
Convergence Rate:
For convex loss functions, gradient descent typically achieves:
Similar rates apply to functional gradient descent (gradient boosting), though the constants depend on the base learner's approximation quality.
Convergence of training loss doesn't imply good generalization. Without regularization, gradient boosting will eventually achieve near-zero training error while vastly overfitting. Shrinkage, early stopping, and tree constraints are essential for test performance.
Role of Shrinkage in Convergence:
With shrinkage rate ν < 1, each step contributes less. This may seem to slow convergence, but:
Regularization effect: Smaller steps require more iterations, but the resulting path through function space is "smoother," leading to better generalization.
Implicit L2 penalty: Analysis shows that shrinkage approximates L2 regularization on the coefficient path.
Better local optima: The greedy stagewise procedure can get stuck in suboptimal configurations. Shrinkage allows more gradual exploration.
Practical Guidance:
| Learning Rate ν | Typical M | Training Speed | Generalization |
|---|---|---|---|
| 1.0 (no shrinkage) | 50-100 | Fastest | Often overfits |
| 0.3 | 100-300 | Fast | Good baseline |
| 0.1 | 300-1000 | Moderate | Often better |
| 0.05 | 500-2000 | Slower | Usually excellent |
| 0.01 | 1000-5000 | Slowest | Best if compute allows |
We've developed the complete algorithmic picture of how gradient boosting constructs ensembles through stagewise optimization. Let's consolidate:
What's Next:
We've covered how to fit base learners at each stage using gradients, but we haven't fully explored the connection between the loss function and the resulting algorithm behavior. The next page examines loss function formulation in depth: how different losses lead to different boosting algorithms, what properties make a loss suitable for boosting, and how to design custom losses for specialized problems.
You now understand the stagewise optimization procedure that constructs gradient boosting ensembles. This greedy algorithm—fit to negative gradients, add with shrinkage, repeat—is the practical realization of functional gradient descent, transforming the theoretical framework into a trainable algorithm.