Machine LearningBoosting Theory

Boosting as Gradient Descent

LevelAdvanced

Duration75 mins

TopicBoosting Theory

3 / 5

Stage-wise Optimization

The Art of Greedy Construction

Imagine you're an artist creating a portrait through layered glazes. You don't attempt to capture every detail in the first layer; instead, you build the image incrementally. The first glaze establishes broad tones. The second refines shadows. Each subsequent layer addresses what previous layers missed.

This is precisely how stage-wise optimization builds boosted ensembles. We don't attempt to find the globally optimal combination of M base learners—a computationally intractable problem. Instead, we construct the ensemble one term at a time, each new learner targeting what the current ensemble gets wrong.

This greedy approach, known as forward stagewise additive modeling (FSAM), is the algorithmic heart of gradient boosting.

What You Will Learn

This page develops the complete algorithm for stage-wise optimization: how to frame the optimization problem at each stage, how to solve it for specific loss functions, and how this connects to iteratively fitting base learners to gradients. You'll understand the exact mechanics of how gradient boosting ensembles are constructed.

The Stagewise Paradigm

Problem Setup:

We want to build an additive model: $$F_M(x) = \sum_{m=1}^{M} \beta_m h(x; \theta_m)$$

that minimizes the empirical loss: $$L(F) = \sum_{i=1}^{n} \ell(y_i, F(x_i))$$

where h(x; θ) is a base learner parameterized by θ (e.g., a decision tree with structure θ), βₘ are the weights, and ℓ is the loss function.

Joint Optimization (Intractable):

The ideal approach would solve: $${\beta_m^, \theta_m^}{m=1}^{M} = \arg\min \sum{i=1}^{n} \ell\left(y_i, \sum_{m=1}^{M} \beta_m h(x_i; \theta_m)\right)$$

This is generally intractable because:

The search space is enormous (M × complexity of each θ)
The problem is highly non-convex in θ (tree structures are discrete)
All terms are coupled—optimal θ₁ depends on what θ₂...θₘ are

Stagewise Solution:

Instead, we build the model incrementally:

Stage 0: Initialize F₀(x) = γ₀ (often a constant, like the mean for regression)

Stage m (for m = 1, 2, ..., M):

Compute the optimal base learner to add: $$(\beta_m, \theta_m) = \arg\min_{\beta, \theta} \sum_{i=1}^{n} \ell(y_i, F_{m-1}(x_i) + \beta \cdot h(x_i; \theta))$$
Update the ensemble: $$F_m(x) = F_{m-1}(x) + \beta_m h(x; \theta_m)$$

Key Property: Once added, a base learner is never modified. We only ever add new terms.

Greedy is Not Globally Optimal

Stagewise optimization finds a local, not global, solution. The base learner chosen at stage 1 might not be in the globally optimal ensemble. However, with many stages and regularization (shrinkage), this greedy approach achieves excellent results in practice—often matching or exceeding more sophisticated optimization methods.

Solving the Stage-m Optimization Problem

At each stage m, we must solve:

$$(\beta_m, \theta_m) = \arg\min_{\beta, \theta} \sum_{i=1}^{n} \ell(y_i, F_{m-1}(x_i) + \beta \cdot h(x_i; \theta))$$

This is still a challenging optimization problem. Let's see how it's approached for different losses.

Case 1: Squared Error Loss (L2 Boosting)

For ℓ(y, F) = ½(y - F)², the stage-m problem becomes:

$$\min_{\beta, \theta} \sum_{i=1}^{n} \frac{1}{2}(y_i - F_{m-1}(x_i) - \beta \cdot h(x_i; \theta))^2$$

Define the residuals: rᵢ = yᵢ - Fₘ₋₁(xᵢ)

The problem becomes: $$\min_{\beta, \theta} \sum_{i=1}^{n} (r_i - \beta \cdot h(x_i; \theta))^2$$

Key Insight: We're fitting the base learner to the residuals! For β = 1, this is exactly least-squares regression of h on r. For optimal β, we can first find optimal θ by fitting to residuals, then optimize β by line search.

l2_boosting_stagewise.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import numpy as np
from sklearn.tree import DecisionTreeRegressor
 
def l2_boosting(X, y, n_stages=100, learning_rate=0.1, max_depth=3):
    """
    L2 Boosting (Gradient Boosting with squared error loss).
    
    At each stage:
    1. Compute residuals: r_i = y_i - F_{m-1}(x_i)
    2. Fit base learner h to residuals
    3. Update: F_m = F_{m-1} + learning_rate * h
    
    For squared error, the negative gradient IS the residual,
    so fitting to residuals = fitting to negative gradients.
    """
    n_samples = len(y)
    
    # Stage 0: Initialize with constant (mean)
    F = np.full(n_samples, np.mean(y))
    initial_pred = np.mean(y)
    
    base_learners = []
    
    for m in range(n_stages):
        # Step 1: Compute residuals (= negative gradient for L2 loss)
        residuals = y - F
        
        # Step 2: Fit base learner to residuals
        # This solves: argmin_theta sum_i (r_i - h(x_i; theta))^2
        h = DecisionTreeRegressor(max_depth=max_depth)
        h.fit(X, residuals)
        
        # Step 3: Compute predictions and update
        # With shrinkage (learning_rate < 1), we dampen each step
        h_predictions = h.predict(X)
        
        # Optional: line search for optimal step size
        # For L2, optimal beta = sum(r_i * h(x_i)) / sum(h(x_i)^2)
        # But we typically just use learning_rate for regularization
        
        F = F + learning_rate * h_predictions
        
        base_learners.append(h)
        
        # Monitor training loss
        if m % 20 == 0:
            train_loss = 0.5 * np.mean((y - F) ** 2)
            print(f"Stage {m}: Training MSE = {train_loss:.4f}")
    
    return base_learners, initial_pred, learning_rate
 
def predict_l2_boost(X, base_learners, initial_pred, learning_rate):
    """Predict using the trained L2 boosting ensemble."""
    predictions = np.full(len(X), initial_pred)
    for h in base_learners:
        predictions += learning_rate * h.predict(X)
    return predictions

Case 2: Exponential Loss (AdaBoost)

For ℓ(y, F) = exp(-y·F) with y ∈ {-1, +1}, the stage-m problem is:

$$\min_{\beta, \theta} \sum_{i=1}^{n} \exp(-y_i(F_{m-1}(x_i) + \beta \cdot h(x_i; \theta)))$$

Let wᵢ⁽ᵐ⁾ = exp(-yᵢFₘ₋₁(xᵢ)). These are the sample weights. The problem becomes:

$$\min_{\beta, \theta} \sum_{i=1}^{n} w_i^{(m)} \exp(-\beta y_i h(x_i; \theta))$$

Solution approach:

For any β > 0, the optimal θ minimizes weighted classification error: $$\theta_m = \arg\min_\theta \sum_{i=1}^{n} w_i^{(m)} \mathbf{1}(y_i \neq h(x_i; \theta))$$
Given θₘ, the optimal β is: $$\beta_m = \frac{1}{2} \log\left(\frac{1 - \epsilon_m}{\epsilon_m}\right)$$

where εₘ is the weighted error rate of hₘ.

This is exactly the AdaBoost algorithm! Stagewise optimization of exponential loss recovers AdaBoost.

Connection to Gradient Descent

The magic of gradient boosting is that we can solve the stage-m optimization problem approximately using a simple strategy: fit the base learner to the negative gradient.

Key Insight:

Instead of directly solving: $$(\beta_m, \theta_m) = \arg\min_{\beta, \theta} \sum_{i=1}^{n} \ell(y_i, F_{m-1}(x_i) + \beta \cdot h(x_i; \theta))$$

We:

Compute the negative functional gradient: $g_i = -\frac{\partial \ell(y_i, F)}{\partial F}\bigg|{F=F{m-1}(x_i)}$
Fit h to approximate g: $\theta_m = \arg\min_\theta \sum_{i=1}^{n} (g_i - h(x_i; \theta))^2$
(Optionally) Find optimal step size: $\beta_m = \arg\min_\beta \sum_{i=1}^{n} \ell(y_i, F_{m-1}(x_i) + \beta \cdot h(x_i; \theta_m))$

Why This Works:

The Taylor expansion of L(Fₘ₋₁ + βh) around Fₘ₋₁ is: $$L(F_{m-1} + \beta h) \approx L(F_{m-1}) + \beta \sum_{i=1}^{n} \frac{\partial \ell(y_i, F_{m-1}(x_i))}{\partial F} \cdot h(x_i)$$

To decrease loss, we want the inner product between h and ∂ℓ/∂F to be negative (indicating we're moving against the gradient). The h that achieves this best is the one that points in the direction of -∂ℓ/∂F.

Converting Mermaid diagram...

Negative Gradients for Common Loss Functions
Loss Function	ℓ(y, F)	−∂ℓ/∂F (Negative Gradient)
Squared Error	½(y - F)²	y - F (residual)
Absolute Error	\|y - F\|	sign(y - F)
Huber Loss	Huber_δ(y - F)	Clipped residual
Logistic (y ∈ {0,1})	-y·log(p) - (1-y)·log(1-p)	y - p where p = sigmoid(F)
Exponential (y ∈ {-1,+1})	exp(-y·F)	y·exp(-y·F)

The Complete Gradient Boosting Algorithm

We can now state the complete gradient boosting algorithm, synthesizing everything we've developed:

Algorithm: Gradient Boosting

Input: Training data {(xᵢ, yᵢ)}ⁿᵢ₌₁, differentiable loss function ℓ(y, F), number of stages M, learning rate ν, base learner class ℋ

Initialize: F₀(x) = argminγ Σᵢ ℓ(yᵢ, γ) (constant prediction)

For m = 1 to M:

Compute negative gradients (pseudo-residuals): $$r_{im} = -\frac{\partial \ell(y_i, F(x_i))}{\partial F(x_i)}\bigg|{F=F{m-1}}$$
Fit base learner to pseudo-residuals: $$h_m = \arg\min_{h \in \mathcal{H}} \sum_{i=1}^{n} (r_{im} - h(x_i))^2$$
(Optional) Line search for optimal step: $$\rho_m = \arg\min_\rho \sum_{i=1}^{n} \ell(y_i, F_{m-1}(x_i) + \rho \cdot h_m(x_i))$$
Update model with shrinkage: $$F_m(x) = F_{m-1}(x) + \nu \cdot \rho_m \cdot h_m(x)$$

Output: Fₘ(x)

gradient_boosting_algorithm.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from scipy.optimize import minimize_scalar
 
class GradientBoosting:
    """
    Gradient Boosting with any differentiable loss function.
    
    This implementation follows the exact algorithm:
    1. Initialize with optimal constant
    2. For each stage:
       a. Compute negative gradient (pseudo-residuals)
       b. Fit tree to pseudo-residuals
       c. Line search for optimal step size
       d. Update with shrinkage
    """
    
    def __init__(self, loss, loss_gradient, n_estimators=100, 
                 learning_rate=0.1, max_depth=3):
        self.loss = loss  # loss(y, F) -> scalar
        self.loss_gradient = loss_gradient  # d(loss)/dF at each point
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        self.max_depth = max_depth
        self.trees = []
        self.rhos = []
        self.F0 = None
    
    def _init_prediction(self, y):
        """Find optimal constant prediction F_0."""
        # For many losses, this is the mean (squared) or median (absolute)
        # Here we do a simple line search
        result = minimize_scalar(
            lambda gamma: np.sum([self.loss(yi, gamma) for yi in y])
        )
        return result.x
    
    def fit(self, X, y):
        n_samples = len(y)
        
        # Step 0: Initialize F_0
        self.F0 = self._init_prediction(y)
        F = np.full(n_samples, self.F0)
        
        for m in range(self.n_estimators):
            # Step 1: Compute negative gradient (pseudo-residuals)
            # r_i = -d loss(y_i, F_i) / dF_i
            pseudo_residuals = -self.loss_gradient(y, F)
            
            # Step 2: Fit regression tree to pseudo-residuals
            tree = DecisionTreeRegressor(max_depth=self.max_depth)
            tree.fit(X, pseudo_residuals)
            
            # Step 3: Line search for optimal step size rho
            h = tree.predict(X)
            
            result = minimize_scalar(
                lambda rho: sum(self.loss(y[i], F[i] + rho * h[i]) 
                               for i in range(n_samples)),
                bounds=(0, 2),
                method='bounded'
            )
            rho = result.x
            
            # Step 4: Update with shrinkage
            F = F + self.learning_rate * rho * h
            
            self.trees.append(tree)
            self.rhos.append(self.learning_rate * rho)
            
            if m % 20 == 0:
                total_loss = sum(self.loss(y[i], F[i]) 
                                for i in range(n_samples))
                print(f"Stage {m}: Loss = {total_loss:.4f}")
        
        return self
    
    def predict(self, X):
        F = np.full(len(X), self.F0)
        for tree, rho in zip(self.trees, self.rhos):
            F += rho * tree.predict(X)
        return F
 
 
# Example: Gradient Boosting with Huber loss
delta = 1.0
 
def huber_loss(y, f):
    r = y - f
    if abs(r) <= delta:
        return 0.5 * r ** 2
    else:
        return delta * (abs(r) - 0.5 * delta)
 
def huber_gradient(y, F):
    """d(huber_loss)/dF for each sample."""
    r = y - F
    grad = np.where(np.abs(r) <= delta, -r, -delta * np.sign(r))
    return grad
 
# Usage:
# gb = GradientBoosting(huber_loss, huber_gradient, n_estimators=100)
# gb.fit(X_train, y_train)
# predictions = gb.predict(X_test)

Line Search: Finding the Optimal Step Size

The line search step—finding the optimal ρₘ—deserves careful attention. Different implementations handle this differently.

Why Line Search Matters:

The base learner hₘ gives us a direction to update. The step size ρₘ tells us how far to step. Too small steps waste computation; too large steps may overshoot and increase loss.

Simple Approach (No Line Search):

Many implementations skip explicit line search and simply use ρₘ = 1, relying on the learning rate ν for step size control: $$F_m = F_{m-1} + \nu \cdot h_m$$

This works reasonably well when ν is tuned appropriately (typically 0.01-0.3).

Global Line Search:

Find ρ minimizing total loss: $$\rho_m = \arg\min_\rho \sum_{i=1}^{n} \ell(y_i, F_{m-1}(x_i) + \rho \cdot h_m(x_i))$$

This is a 1D optimization problem, solvable by bisection, golden section, or Newton's method.

Per-Leaf Line Search (Recommended for Trees):

When using decision trees, we can do better. Each leaf j defines a region Rⱼ where hₘ is constant. Instead of one global ρ, we find optimal γⱼ for each leaf:

$$\gamma_{jm} = \arg\min_\gamma \sum_{x_i \in R_{jm}} \ell(y_i, F_{m-1}(x_i) + \gamma)$$

The tree prediction becomes: hₘ(x) = γⱼₘ for x ∈ Rⱼₘ

This is more flexible—different regions can have different step sizes—and often improves performance.

Closed-Form Solutions for Specific Losses:

Squared loss: γⱼ = mean of residuals in leaf j
Absolute loss: γⱼ = median of residuals in leaf j
Logistic loss: Requires Newton step or approximation

XGBoost's Approach:

XGBoost uses second-order (Newton) approximation, incorporating both gradient and Hessian to find per-leaf values: $$\gamma_j = -\frac{\sum_{i \in R_j} g_i}{\sum_{i \in R_j} h_i + \lambda}$$

where gᵢ is the gradient, hᵢ is the Hessian, and λ is L2 regularization.

Per-Leaf vs Global Step

Per-leaf optimization (as in XGBoost, LightGBM) is generally superior to global line search. It allows different regions to take different step sizes based on local loss curvature. This is one reason modern boosting implementations outperform classic gradient boosting.

Handling Different Prediction Types

Gradient boosting applies to regression, binary classification, and multi-class problems. The framework is the same; only the loss function changes.

Regression:

The model output F(x) is the prediction directly. Common losses:

Squared error: ℓ = ½(y - F)²
Absolute error: ℓ = |y - F|
Huber loss: Combines L1 and L2 for robustness

Binary Classification:

The model outputs a log-odds (logit): F(x) = log(p/(1-p))

Probability is: p(y=1|x) = sigmoid(F(x)) = 1/(1 + exp(-F(x)))

Common losses:

Logistic loss: ℓ = log(1 + exp(-y·F)) for y ∈ {-1, +1}
Binomial deviance: -y·log(p) - (1-y)·log(1-p) for y ∈ {0, 1}

gradient_boosting_classification.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import numpy as np
from sklearn.tree import DecisionTreeRegressor
 
def gradient_boosting_classifier(X, y, n_estimators=100, 
                                  learning_rate=0.1, max_depth=3):
    """
    Gradient Boosting for binary classification (logistic loss).
    
    Model outputs log-odds F(x). Probability is sigmoid(F(x)).
    Labels y should be in {0, 1}.
    
    Key insight: even for classification, we fit REGRESSION trees
    to the (continuous) negative gradient values.
    """
    n_samples = len(y)
    
    # Initialize: F_0 = log(p_1 / p_0) = log(mean / (1 - mean))
    p_init = np.clip(np.mean(y), 1e-6, 1 - 1e-6)
    F = np.full(n_samples, np.log(p_init / (1 - p_init)))
    F0 = F[0]
    
    trees = []
    
    for m in range(n_estimators):
        # Current probabilities
        p = 1.0 / (1.0 + np.exp(-F))
        
        # Negative gradient of logistic loss: y - p
        # Intuition: push F up where y=1 (want higher p)
        #            push F down where y=0 (want lower p)
        pseudo_residuals = y - p
        
        # Fit regression tree to pseudo-residuals
        tree = DecisionTreeRegressor(max_depth=max_depth)
        tree.fit(X, pseudo_residuals)
        
        # Per-leaf optimization for logistic loss
        # For each leaf, solve: gamma = argmin sum log(1 + exp(-y_i(F_i + gamma)))
        # Approximate solution: gamma = sum(r_i) / sum(p_i * (1-p_i))
        leaves = tree.apply(X)
        leaf_values = {}
        
        for leaf in np.unique(leaves):
            mask = leaves == leaf
            # Newton step for logistic loss
            numerator = np.sum(pseudo_residuals[mask])
            denominator = np.sum(p[mask] * (1 - p[mask]))
            leaf_values[leaf] = numerator / (denominator + 1e-8)
        
        # Update F using per-leaf values
        h = np.array([leaf_values[l] for l in leaves])
        F = F + learning_rate * h
        
        trees.append((tree, leaf_values))
        
        if m % 20 == 0:
            log_loss = -np.mean(y * np.log(p + 1e-8) + 
                               (1 - y) * np.log(1 - p + 1e-8))
            print(f"Stage {m}: Log Loss = {log_loss:.4f}")
    
    return trees, F0, learning_rate
 
def predict_proba(X, trees, F0, learning_rate):
    """Predict probabilities using the trained classifier."""
    F = np.full(len(X), F0)
    for tree, leaf_values in trees:
        leaves = tree.apply(X)
        h = np.array([leaf_values.get(l, 0) for l in leaves])
        F += learning_rate * h
    return 1.0 / (1.0 + np.exp(-F))

Multi-class Classification:

For K classes, we maintain K separate models F₁(x), ..., Fₖ(x).

Probabilities via softmax: $$p(y=k|x) = \frac{\exp(F_k(x))}{\sum_{j=1}^{K} \exp(F_j(x))}$$

At each boosting iteration, we:

Compute gradients for each class k: gᵢₖ = yᵢₖ - pᵢₖ (where yᵢₖ = 1 if sample i is class k)
Fit K trees, one to each gradient vector
Update each Fₖ by its corresponding tree

This is computationally expensive (K trees per iteration), which is why efficient implementations like LightGBM use tricks like gradient sampling and one-vs-all reductions.

Convergence Properties

A natural question: does this stagewise procedure converge? If so, how fast?

Training Loss Convergence:

Under mild conditions, gradient boosting monotonically decreases the training loss. At each stage, we add a term that (at least slightly) improves the fit.

Theorem (Informal): If the base learner class ℋ can approximate arbitrary gradients (e.g., regression trees with enough depth), then gradient boosting converges to the minimum training loss as M → ∞.

Convergence Rate:

For convex loss functions, gradient descent typically achieves:

O(1/M) convergence for general convex losses
O(exp(-M)) for strongly convex losses (linear convergence)

Similar rates apply to functional gradient descent (gradient boosting), though the constants depend on the base learner's approximation quality.

Training vs Test Loss

Convergence of training loss doesn't imply good generalization. Without regularization, gradient boosting will eventually achieve near-zero training error while vastly overfitting. Shrinkage, early stopping, and tree constraints are essential for test performance.

Role of Shrinkage in Convergence:

With shrinkage rate ν < 1, each step contributes less. This may seem to slow convergence, but:

Regularization effect: Smaller steps require more iterations, but the resulting path through function space is "smoother," leading to better generalization.
Implicit L2 penalty: Analysis shows that shrinkage approximates L2 regularization on the coefficient path.
Better local optima: The greedy stagewise procedure can get stuck in suboptimal configurations. Shrinkage allows more gradual exploration.

Practical Guidance:

Use shrinkage ν ∈ [0.01, 0.3] (smaller for more regularization)
Increase M as ν decreases (rough rule: if ν halves, double M)
Monitor validation loss; stop early if it plateaus
Cross-validate to find optimal (ν, M) pair

Shrinkage and Iterations Trade-off
Learning Rate ν	Typical M	Training Speed	Generalization
1.0 (no shrinkage)	50-100	Fastest	Often overfits
0.3	100-300	Fast	Good baseline
0.1	300-1000	Moderate	Often better
0.05	500-2000	Slower	Usually excellent
0.01	1000-5000	Slowest	Best if compute allows

Summary: Stage-wise Optimization

We've developed the complete algorithmic picture of how gradient boosting constructs ensembles through stagewise optimization. Let's consolidate:

Key Takeaways

•Greedy Construction: We build ensembles by adding one base learner at a time, never revising previous terms. This forward stagewise approach is tractable and effective.
•Fitting to Negative Gradients: At each stage, we fit the base learner to the negative gradient of the loss. This approximately solves the optimal-term problem.
•Pseudo-residuals: For non-squared losses, we fit to "pseudo-residuals"—the negative gradient values—not actual residuals.
•Line Search: Finding the optimal step size (globally or per-leaf) improves each stage's contribution. Per-leaf optimization (as in XGBoost) is superior for tree base learners.
•Loss Flexibility: The same algorithm works for regression, classification, and custom losses—just plug in the appropriate gradient computation.
•Shrinkage with More Stages: Learning slowly with many iterations (small η, large M) typically outperforms fast learning. The shrinkage parameter is perhaps the most important hyperparameter.

What's Next:

We've covered how to fit base learners at each stage using gradients, but we haven't fully explored the connection between the loss function and the resulting algorithm behavior. The next page examines loss function formulation in depth: how different losses lead to different boosting algorithms, what properties make a loss suitable for boosting, and how to design custom losses for specialized problems.

Concept Mastered

You now understand the stagewise optimization procedure that constructs gradient boosting ensembles. This greedy algorithm—fit to negative gradients, add with shrinkage, repeat—is the practical realization of functional gradient descent, transforming the theoretical framework into a trainable algorithm.

3 / 5

Loading learning content...

Machine LearningBoosting Theory

Boosting as Gradient Descent

LevelAdvanced

Duration75 mins

TopicBoosting Theory

3 / 5

Stage-wise Optimization

The Art of Greedy Construction

This greedy approach, known as forward stagewise additive modeling (FSAM), is the algorithmic heart of gradient boosting.

What You Will Learn

The Stagewise Paradigm

Problem Setup:

We want to build an additive model: $$F_M(x) = \sum_{m=1}^{M} \beta_m h(x; \theta_m)$$

that minimizes the empirical loss: $$L(F) = \sum_{i=1}^{n} \ell(y_i, F(x_i))$$

where h(x; θ) is a base learner parameterized by θ (e.g., a decision tree with structure θ), βₘ are the weights, and ℓ is the loss function.

Joint Optimization (Intractable):

The ideal approach would solve: $${\beta_m^, \theta_m^}{m=1}^{M} = \arg\min \sum{i=1}^{n} \ell\left(y_i, \sum_{m=1}^{M} \beta_m h(x_i; \theta_m)\right)$$

This is generally intractable because:

The search space is enormous (M × complexity of each θ)
The problem is highly non-convex in θ (tree structures are discrete)
All terms are coupled—optimal θ₁ depends on what θ₂...θₘ are

Stagewise Solution:

Instead, we build the model incrementally:

Stage 0: Initialize F₀(x) = γ₀ (often a constant, like the mean for regression)

Stage m (for m = 1, 2, ..., M):

Compute the optimal base learner to add: $$(\beta_m, \theta_m) = \arg\min_{\beta, \theta} \sum_{i=1}^{n} \ell(y_i, F_{m-1}(x_i) + \beta \cdot h(x_i; \theta))$$
Update the ensemble: $$F_m(x) = F_{m-1}(x) + \beta_m h(x; \theta_m)$$

Key Property: Once added, a base learner is never modified. We only ever add new terms.

Greedy is Not Globally Optimal

Solving the Stage-m Optimization Problem

At each stage m, we must solve:

$$(\beta_m, \theta_m) = \arg\min_{\beta, \theta} \sum_{i=1}^{n} \ell(y_i, F_{m-1}(x_i) + \beta \cdot h(x_i; \theta))$$

This is still a challenging optimization problem. Let's see how it's approached for different losses.

Case 1: Squared Error Loss (L2 Boosting)

For ℓ(y, F) = ½(y - F)², the stage-m problem becomes:

$$\min_{\beta, \theta} \sum_{i=1}^{n} \frac{1}{2}(y_i - F_{m-1}(x_i) - \beta \cdot h(x_i; \theta))^2$$

Define the residuals: rᵢ = yᵢ - Fₘ₋₁(xᵢ)

The problem becomes: $$\min_{\beta, \theta} \sum_{i=1}^{n} (r_i - \beta \cdot h(x_i; \theta))^2$$

l2_boosting_stagewise.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import numpy as np
from sklearn.tree import DecisionTreeRegressor
 
def l2_boosting(X, y, n_stages=100, learning_rate=0.1, max_depth=3):
    """
    L2 Boosting (Gradient Boosting with squared error loss).
    
    At each stage:
    1. Compute residuals: r_i = y_i - F_{m-1}(x_i)
    2. Fit base learner h to residuals
    3. Update: F_m = F_{m-1} + learning_rate * h
    
    For squared error, the negative gradient IS the residual,
    so fitting to residuals = fitting to negative gradients.
    """
    n_samples = len(y)
    
    # Stage 0: Initialize with constant (mean)
    F = np.full(n_samples, np.mean(y))
    initial_pred = np.mean(y)
    
    base_learners = []
    
    for m in range(n_stages):
        # Step 1: Compute residuals (= negative gradient for L2 loss)
        residuals = y - F
        
        # Step 2: Fit base learner to residuals
        # This solves: argmin_theta sum_i (r_i - h(x_i; theta))^2
        h = DecisionTreeRegressor(max_depth=max_depth)
        h.fit(X, residuals)
        
        # Step 3: Compute predictions and update
        # With shrinkage (learning_rate < 1), we dampen each step
        h_predictions = h.predict(X)
        
        # Optional: line search for optimal step size
        # For L2, optimal beta = sum(r_i * h(x_i)) / sum(h(x_i)^2)
        # But we typically just use learning_rate for regularization
        
        F = F + learning_rate * h_predictions
        
        base_learners.append(h)
        
        # Monitor training loss
        if m % 20 == 0:
            train_loss = 0.5 * np.mean((y - F) ** 2)
            print(f"Stage {m}: Training MSE = {train_loss:.4f}")
    
    return base_learners, initial_pred, learning_rate
 
def predict_l2_boost(X, base_learners, initial_pred, learning_rate):
    """Predict using the trained L2 boosting ensemble."""
    predictions = np.full(len(X), initial_pred)
    for h in base_learners:
        predictions += learning_rate * h.predict(X)
    return predictions

Case 2: Exponential Loss (AdaBoost)

For ℓ(y, F) = exp(-y·F) with y ∈ {-1, +1}, the stage-m problem is:

$$\min_{\beta, \theta} \sum_{i=1}^{n} \exp(-y_i(F_{m-1}(x_i) + \beta \cdot h(x_i; \theta)))$$

Let wᵢ⁽ᵐ⁾ = exp(-yᵢFₘ₋₁(xᵢ)). These are the sample weights. The problem becomes:

$$\min_{\beta, \theta} \sum_{i=1}^{n} w_i^{(m)} \exp(-\beta y_i h(x_i; \theta))$$

Solution approach:

For any β > 0, the optimal θ minimizes weighted classification error: $$\theta_m = \arg\min_\theta \sum_{i=1}^{n} w_i^{(m)} \mathbf{1}(y_i \neq h(x_i; \theta))$$
Given θₘ, the optimal β is: $$\beta_m = \frac{1}{2} \log\left(\frac{1 - \epsilon_m}{\epsilon_m}\right)$$

where εₘ is the weighted error rate of hₘ.

This is exactly the AdaBoost algorithm! Stagewise optimization of exponential loss recovers AdaBoost.

Connection to Gradient Descent

The magic of gradient boosting is that we can solve the stage-m optimization problem approximately using a simple strategy: fit the base learner to the negative gradient.

Key Insight:

Instead of directly solving: $$(\beta_m, \theta_m) = \arg\min_{\beta, \theta} \sum_{i=1}^{n} \ell(y_i, F_{m-1}(x_i) + \beta \cdot h(x_i; \theta))$$

We:

Compute the negative functional gradient: $g_i = -\frac{\partial \ell(y_i, F)}{\partial F}\bigg|{F=F{m-1}(x_i)}$
Fit h to approximate g: $\theta_m = \arg\min_\theta \sum_{i=1}^{n} (g_i - h(x_i; \theta))^2$
(Optionally) Find optimal step size: $\beta_m = \arg\min_\beta \sum_{i=1}^{n} \ell(y_i, F_{m-1}(x_i) + \beta \cdot h(x_i; \theta_m))$

Why This Works:

The Taylor expansion of L(Fₘ₋₁ + βh) around Fₘ₋₁ is: $$L(F_{m-1} + \beta h) \approx L(F_{m-1}) + \beta \sum_{i=1}^{n} \frac{\partial \ell(y_i, F_{m-1}(x_i))}{\partial F} \cdot h(x_i)$$

Converting Mermaid diagram...

Negative Gradients for Common Loss Functions
Loss Function	ℓ(y, F)	−∂ℓ/∂F (Negative Gradient)
Squared Error	½(y - F)²	y - F (residual)
Absolute Error	\|y - F\|	sign(y - F)
Huber Loss	Huber_δ(y - F)	Clipped residual
Logistic (y ∈ {0,1})	-y·log(p) - (1-y)·log(1-p)	y - p where p = sigmoid(F)
Exponential (y ∈ {-1,+1})	exp(-y·F)	y·exp(-y·F)

The Complete Gradient Boosting Algorithm

We can now state the complete gradient boosting algorithm, synthesizing everything we've developed:

Algorithm: Gradient Boosting

Input: Training data {(xᵢ, yᵢ)}ⁿᵢ₌₁, differentiable loss function ℓ(y, F), number of stages M, learning rate ν, base learner class ℋ

Initialize: F₀(x) = argminγ Σᵢ ℓ(yᵢ, γ) (constant prediction)

For m = 1 to M:

Compute negative gradients (pseudo-residuals): $$r_{im} = -\frac{\partial \ell(y_i, F(x_i))}{\partial F(x_i)}\bigg|{F=F{m-1}}$$
Fit base learner to pseudo-residuals: $$h_m = \arg\min_{h \in \mathcal{H}} \sum_{i=1}^{n} (r_{im} - h(x_i))^2$$
(Optional) Line search for optimal step: $$\rho_m = \arg\min_\rho \sum_{i=1}^{n} \ell(y_i, F_{m-1}(x_i) + \rho \cdot h_m(x_i))$$
Update model with shrinkage: $$F_m(x) = F_{m-1}(x) + \nu \cdot \rho_m \cdot h_m(x)$$

Output: Fₘ(x)

gradient_boosting_algorithm.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from scipy.optimize import minimize_scalar
 
class GradientBoosting:
    """
    Gradient Boosting with any differentiable loss function.
    
    This implementation follows the exact algorithm:
    1. Initialize with optimal constant
    2. For each stage:
       a. Compute negative gradient (pseudo-residuals)
       b. Fit tree to pseudo-residuals
       c. Line search for optimal step size
       d. Update with shrinkage
    """
    
    def __init__(self, loss, loss_gradient, n_estimators=100, 
                 learning_rate=0.1, max_depth=3):
        self.loss = loss  # loss(y, F) -> scalar
        self.loss_gradient = loss_gradient  # d(loss)/dF at each point
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        self.max_depth = max_depth
        self.trees = []
        self.rhos = []
        self.F0 = None
    
    def _init_prediction(self, y):
        """Find optimal constant prediction F_0."""
        # For many losses, this is the mean (squared) or median (absolute)
        # Here we do a simple line search
        result = minimize_scalar(
            lambda gamma: np.sum([self.loss(yi, gamma) for yi in y])
        )
        return result.x
    
    def fit(self, X, y):
        n_samples = len(y)
        
        # Step 0: Initialize F_0
        self.F0 = self._init_prediction(y)
        F = np.full(n_samples, self.F0)
        
        for m in range(self.n_estimators):
            # Step 1: Compute negative gradient (pseudo-residuals)
            # r_i = -d loss(y_i, F_i) / dF_i
            pseudo_residuals = -self.loss_gradient(y, F)
            
            # Step 2: Fit regression tree to pseudo-residuals
            tree = DecisionTreeRegressor(max_depth=self.max_depth)
            tree.fit(X, pseudo_residuals)
            
            # Step 3: Line search for optimal step size rho
            h = tree.predict(X)
            
            result = minimize_scalar(
                lambda rho: sum(self.loss(y[i], F[i] + rho * h[i]) 
                               for i in range(n_samples)),
                bounds=(0, 2),
                method='bounded'
            )
            rho = result.x
            
            # Step 4: Update with shrinkage
            F = F + self.learning_rate * rho * h
            
            self.trees.append(tree)
            self.rhos.append(self.learning_rate * rho)
            
            if m % 20 == 0:
                total_loss = sum(self.loss(y[i], F[i]) 
                                for i in range(n_samples))
                print(f"Stage {m}: Loss = {total_loss:.4f}")
        
        return self
    
    def predict(self, X):
        F = np.full(len(X), self.F0)
        for tree, rho in zip(self.trees, self.rhos):
            F += rho * tree.predict(X)
        return F
 
 
# Example: Gradient Boosting with Huber loss
delta = 1.0
 
def huber_loss(y, f):
    r = y - f
    if abs(r) <= delta:
        return 0.5 * r ** 2
    else:
        return delta * (abs(r) - 0.5 * delta)
 
def huber_gradient(y, F):
    """d(huber_loss)/dF for each sample."""
    r = y - F
    grad = np.where(np.abs(r) <= delta, -r, -delta * np.sign(r))
    return grad
 
# Usage:
# gb = GradientBoosting(huber_loss, huber_gradient, n_estimators=100)
# gb.fit(X_train, y_train)
# predictions = gb.predict(X_test)

Line Search: Finding the Optimal Step Size

The line search step—finding the optimal ρₘ—deserves careful attention. Different implementations handle this differently.

Why Line Search Matters:

The base learner hₘ gives us a direction to update. The step size ρₘ tells us how far to step. Too small steps waste computation; too large steps may overshoot and increase loss.

Simple Approach (No Line Search):

Many implementations skip explicit line search and simply use ρₘ = 1, relying on the learning rate ν for step size control: $$F_m = F_{m-1} + \nu \cdot h_m$$

This works reasonably well when ν is tuned appropriately (typically 0.01-0.3).

Global Line Search:

Find ρ minimizing total loss: $$\rho_m = \arg\min_\rho \sum_{i=1}^{n} \ell(y_i, F_{m-1}(x_i) + \rho \cdot h_m(x_i))$$

This is a 1D optimization problem, solvable by bisection, golden section, or Newton's method.

Per-Leaf Line Search (Recommended for Trees):

When using decision trees, we can do better. Each leaf j defines a region Rⱼ where hₘ is constant. Instead of one global ρ, we find optimal γⱼ for each leaf:

$$\gamma_{jm} = \arg\min_\gamma \sum_{x_i \in R_{jm}} \ell(y_i, F_{m-1}(x_i) + \gamma)$$

The tree prediction becomes: hₘ(x) = γⱼₘ for x ∈ Rⱼₘ

This is more flexible—different regions can have different step sizes—and often improves performance.

Closed-Form Solutions for Specific Losses:

Squared loss: γⱼ = mean of residuals in leaf j
Absolute loss: γⱼ = median of residuals in leaf j
Logistic loss: Requires Newton step or approximation

XGBoost's Approach:

XGBoost uses second-order (Newton) approximation, incorporating both gradient and Hessian to find per-leaf values: $$\gamma_j = -\frac{\sum_{i \in R_j} g_i}{\sum_{i \in R_j} h_i + \lambda}$$

where gᵢ is the gradient, hᵢ is the Hessian, and λ is L2 regularization.

Per-Leaf vs Global Step

Handling Different Prediction Types

Gradient boosting applies to regression, binary classification, and multi-class problems. The framework is the same; only the loss function changes.

Regression:

The model output F(x) is the prediction directly. Common losses:

Squared error: ℓ = ½(y - F)²
Absolute error: ℓ = |y - F|
Huber loss: Combines L1 and L2 for robustness

Binary Classification:

The model outputs a log-odds (logit): F(x) = log(p/(1-p))

Probability is: p(y=1|x) = sigmoid(F(x)) = 1/(1 + exp(-F(x)))

Common losses:

Logistic loss: ℓ = log(1 + exp(-y·F)) for y ∈ {-1, +1}
Binomial deviance: -y·log(p) - (1-y)·log(1-p) for y ∈ {0, 1}

gradient_boosting_classification.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import numpy as np
from sklearn.tree import DecisionTreeRegressor
 
def gradient_boosting_classifier(X, y, n_estimators=100, 
                                  learning_rate=0.1, max_depth=3):
    """
    Gradient Boosting for binary classification (logistic loss).
    
    Model outputs log-odds F(x). Probability is sigmoid(F(x)).
    Labels y should be in {0, 1}.
    
    Key insight: even for classification, we fit REGRESSION trees
    to the (continuous) negative gradient values.
    """
    n_samples = len(y)
    
    # Initialize: F_0 = log(p_1 / p_0) = log(mean / (1 - mean))
    p_init = np.clip(np.mean(y), 1e-6, 1 - 1e-6)
    F = np.full(n_samples, np.log(p_init / (1 - p_init)))
    F0 = F[0]
    
    trees = []
    
    for m in range(n_estimators):
        # Current probabilities
        p = 1.0 / (1.0 + np.exp(-F))
        
        # Negative gradient of logistic loss: y - p
        # Intuition: push F up where y=1 (want higher p)
        #            push F down where y=0 (want lower p)
        pseudo_residuals = y - p
        
        # Fit regression tree to pseudo-residuals
        tree = DecisionTreeRegressor(max_depth=max_depth)
        tree.fit(X, pseudo_residuals)
        
        # Per-leaf optimization for logistic loss
        # For each leaf, solve: gamma = argmin sum log(1 + exp(-y_i(F_i + gamma)))
        # Approximate solution: gamma = sum(r_i) / sum(p_i * (1-p_i))
        leaves = tree.apply(X)
        leaf_values = {}
        
        for leaf in np.unique(leaves):
            mask = leaves == leaf
            # Newton step for logistic loss
            numerator = np.sum(pseudo_residuals[mask])
            denominator = np.sum(p[mask] * (1 - p[mask]))
            leaf_values[leaf] = numerator / (denominator + 1e-8)
        
        # Update F using per-leaf values
        h = np.array([leaf_values[l] for l in leaves])
        F = F + learning_rate * h
        
        trees.append((tree, leaf_values))
        
        if m % 20 == 0:
            log_loss = -np.mean(y * np.log(p + 1e-8) + 
                               (1 - y) * np.log(1 - p + 1e-8))
            print(f"Stage {m}: Log Loss = {log_loss:.4f}")
    
    return trees, F0, learning_rate
 
def predict_proba(X, trees, F0, learning_rate):
    """Predict probabilities using the trained classifier."""
    F = np.full(len(X), F0)
    for tree, leaf_values in trees:
        leaves = tree.apply(X)
        h = np.array([leaf_values.get(l, 0) for l in leaves])
        F += learning_rate * h
    return 1.0 / (1.0 + np.exp(-F))

Multi-class Classification:

For K classes, we maintain K separate models F₁(x), ..., Fₖ(x).

Probabilities via softmax: $$p(y=k|x) = \frac{\exp(F_k(x))}{\sum_{j=1}^{K} \exp(F_j(x))}$$

At each boosting iteration, we:

Compute gradients for each class k: gᵢₖ = yᵢₖ - pᵢₖ (where yᵢₖ = 1 if sample i is class k)
Fit K trees, one to each gradient vector
Update each Fₖ by its corresponding tree

This is computationally expensive (K trees per iteration), which is why efficient implementations like LightGBM use tricks like gradient sampling and one-vs-all reductions.

Convergence Properties

A natural question: does this stagewise procedure converge? If so, how fast?

Training Loss Convergence:

Under mild conditions, gradient boosting monotonically decreases the training loss. At each stage, we add a term that (at least slightly) improves the fit.

Convergence Rate:

For convex loss functions, gradient descent typically achieves:

O(1/M) convergence for general convex losses
O(exp(-M)) for strongly convex losses (linear convergence)

Similar rates apply to functional gradient descent (gradient boosting), though the constants depend on the base learner's approximation quality.

Training vs Test Loss

Role of Shrinkage in Convergence:

With shrinkage rate ν < 1, each step contributes less. This may seem to slow convergence, but:

Regularization effect: Smaller steps require more iterations, but the resulting path through function space is "smoother," leading to better generalization.
Implicit L2 penalty: Analysis shows that shrinkage approximates L2 regularization on the coefficient path.
Better local optima: The greedy stagewise procedure can get stuck in suboptimal configurations. Shrinkage allows more gradual exploration.

Practical Guidance:

Use shrinkage ν ∈ [0.01, 0.3] (smaller for more regularization)
Increase M as ν decreases (rough rule: if ν halves, double M)
Monitor validation loss; stop early if it plateaus
Cross-validate to find optimal (ν, M) pair

Shrinkage and Iterations Trade-off
Learning Rate ν	Typical M	Training Speed	Generalization
1.0 (no shrinkage)	50-100	Fastest	Often overfits
0.3	100-300	Fast	Good baseline
0.1	300-1000	Moderate	Often better
0.05	500-2000	Slower	Usually excellent
0.01	1000-5000	Slowest	Best if compute allows

Summary: Stage-wise Optimization

We've developed the complete algorithmic picture of how gradient boosting constructs ensembles through stagewise optimization. Let's consolidate:

Key Takeaways

•Greedy Construction: We build ensembles by adding one base learner at a time, never revising previous terms. This forward stagewise approach is tractable and effective.
•Fitting to Negative Gradients: At each stage, we fit the base learner to the negative gradient of the loss. This approximately solves the optimal-term problem.
•Pseudo-residuals: For non-squared losses, we fit to "pseudo-residuals"—the negative gradient values—not actual residuals.
•Line Search: Finding the optimal step size (globally or per-leaf) improves each stage's contribution. Per-leaf optimization (as in XGBoost) is superior for tree base learners.
•Loss Flexibility: The same algorithm works for regression, classification, and custom losses—just plug in the appropriate gradient computation.
•Shrinkage with More Stages: Learning slowly with many iterations (small η, large M) typically outperforms fast learning. The shrinkage parameter is perhaps the most important hyperparameter.

What's Next:

Concept Mastered

3 / 5