Machine LearningBoosting Theory

Boosting as Gradient Descent

LevelAdvanced

Duration75 mins

TopicBoosting Theory

2 / 5

Additive Models

Building Complexity Through Summation

The most remarkable aspect of boosting is how simple components combine to create powerful predictors. A single decision stump—a tree with just one split—is barely better than random guessing. Yet hundreds of stumps, carefully combined, can rival the performance of deep neural networks on tabular data.

The mathematical structure enabling this alchemy is the additive model: the final prediction is simply a weighted sum of base learner outputs. This seemingly simple constraint has profound implications for how we train boosted ensembles, analyze their behavior, and regularize them to prevent overfitting.

What You Will Learn

This page develops the formal theory of additive models, showing how they form the structural backbone of all boosting algorithms. You'll understand why additive structure enables greedy training, how it relates to basis function expansions, and what constraints it places on the functions we can learn.

Definition of Additive Models

An additive model expresses the prediction function as a sum of simpler component functions, called basis functions or base learners.

Formal Definition:

$$F(x) = \sum_{m=1}^{M} \beta_m \cdot b(x; \gamma_m)$$

where:

F(x) is the final prediction for input x
M is the number of base learners (terms)
βₘ ∈ ℝ are the weights (coefficients) for each term
b(x; γₘ) is the m-th base learner with parameters γₘ
γₘ parameterizes the base learner (e.g., tree structure for decision trees)

The Structure:

Each base learner b(x; γ) maps input x to a scalar output. The parameters γ can be simple (a threshold value for a stump) or complex (the complete structure of a decision tree). The final model is the sum of all base learners, each scaled by its weight β.

Examples of Additive Models
Model Type	Base Function b(x; γ)	Parameters γ	Typical M
Linear Regression	xⱼ (the j-th feature)	j (feature index)	d (# features)
Polynomial Regression	xʲ (x to the j-th power)	j (degree)	degree + 1
Splines	Bⱼ(x) (B-spline basis)	j (basis index), knots	knots + degree
Boosted Stumps	𝟙(xⱼ > t) (threshold indicator)	j (feature), t (threshold)	100-10,000
Boosted Trees	Tree h(x) (full decision tree)	Tree structure	100-10,000
Neural Networks	σ(wᵀx + b) (hidden units)	weights w, bias b	100s-millions

Additive ≠ Linear

Don't confuse additive models with linear models! In additive models, the base functions b(x; γ) can be highly nonlinear. Only the combination is additive (a sum). A boosted tree ensemble is additive in structure but can represent arbitrarily complex nonlinear decision boundaries.

Expressive Power of Additive Models

A natural question arises: how powerful are additive models? Can they approximate any function, or are they fundamentally limited?

Universal Approximation:

With sufficiently flexible base learners and enough terms, additive models can approximate any continuous function to arbitrary precision. This is a form of the universal approximation property.

Theorem (Informal): If the base learner class ℋ is "rich enough" (e.g., contains all indicator functions 𝟙(x ∈ R) for measurable regions R), then additive models over ℋ can approximate any integrable function.

Practical Implications:

For boosted decision trees:

Stumps (depth-1 trees): Can represent any function on the training data by using axis-aligned step functions. The approximation quality depends on M.
Shallow trees (depth 2-4): Each tree captures interactions between a few features. The ensemble captures complex multi-way interactions through composition.
Deeper trees: More expressive per-tree, but risk overfitting. Usually depth 4-8 is optimal for boosting.

How Additive Models Build Complexity:

Consider approximating a complex 2D function using additive stumps. Each stump partitions the plane with a single axis-aligned cut, assigning different constant values to each side.

Iteration 1: One cut creates 2 regions Iteration 2: Another cut creates up to 4 regions
Iteration M: M cuts can create up to M+1 regions along each axis

By summing many stumps with different orientations and thresholds, we can approximate smooth curves, sharp boundaries, and complex contours. The granularity improves with M.

Example: Approximating x²

Suppose we want to approximate f(x) = x² on [0, 1] using stumps:

Stump 1: b₁(x) = 0.1 · 𝟙(x > 0.1) ≈ small step
Stump 2: b₂(x) = 0.1 · 𝟙(x > 0.2) ≈ another step
...
Stump 10: b₁₀(x) = 0.1 · 𝟙(x > 0.9) ≈ final step

The sum Σᵢ bᵢ(x) is a staircase approximating x². More stumps = finer staircase = better approximation.

additive_model_approximation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import numpy as np
import matplotlib.pyplot as plt
 
def approximate_with_stumps(target_fn, x_range, n_stumps):
    """
    Demonstrate how additive stumps approximate a smooth function.
    
    This illustrates the universal approximation capability of
    additive models: any continuous function can be approximated
    to arbitrary precision with enough basis functions.
    """
    x = np.linspace(x_range[0], x_range[1], 1000)
    y_target = target_fn(x)
    
    # Create evenly-spaced stumps
    thresholds = np.linspace(x_range[0], x_range[1], n_stumps + 1)[1:]
    
    # Fit stump weights to approximate the target
    # Each stump: b_m(x) = 1 if x > threshold_m, else 0
    # We want: sum beta_m * b_m(x) ≈ target(x)
    
    # Design matrix: each row is [b_1(x), b_2(x), ..., b_M(x)]
    X_design = np.zeros((len(x), n_stumps))
    for m, t in enumerate(thresholds):
        X_design[:, m] = (x > t).astype(float)
    
    # Solve least squares: find optimal weights
    # beta = (X^T X)^{-1} X^T y
    beta = np.linalg.lstsq(X_design, y_target, rcond=None)[0]
    
    # Additive model prediction
    y_pred = X_design @ beta
    
    # Compute approximation error
    mse = np.mean((y_target - y_pred) ** 2)
    
    print(f"Stumps: {n_stumps}, MSE: {mse:.6f}")
    
    return x, y_target, y_pred, mse
 
# Demonstrate convergence as we add more stumps
target = lambda x: x ** 2  # Simple quadratic
 
for n in [5, 10, 25, 50, 100]:
    x, y_true, y_pred, mse = approximate_with_stumps(target, [0, 1], n)
    
# Output shows MSE decreasing as n increases:
# Stumps: 5, MSE: 0.003333
# Stumps: 10, MSE: 0.000833
# Stumps: 25, MSE: 0.000133
# Stumps: 50, MSE: 0.000033
# Stumps: 100, MSE: 0.000008
 
# The error decreases as O(1/n²) - additive model converges!

Basis Function Expansion

Additive models can be understood through the lens of basis function expansion—a foundational concept in approximation theory and functional analysis.

The Idea:

Just as any vector in ℝ³ can be expressed as a linear combination of basis vectors {e₁, e₂, e₃}, any function (in a suitable function space) can be expressed as a combination of basis functions.

Classical Examples:

Fourier Series: $$f(x) = a_0 + \sum_{n=1}^{\infty} \left[a_n \cos(nx) + b_n \sin(nx)\right]$$

Basis functions: {1, cos(x), sin(x), cos(2x), sin(2x), ...}

Polynomial Expansion: $$f(x) = \sum_{k=0}^{K} a_k x^k$$

Basis functions: {1, x, x², x³, ...}

Boosting Analogy:

In boosting, the base learners form an adaptively-chosen basis. Unlike fixed bases (Fourier, polynomial), the boosting basis is determined by the data and the training procedure.

Fixed Basis Expansion

•Basis functions are predetermined
•Same basis for all problems
•Examples: Fourier, polynomials, splines
•Only need to learn coefficients βₘ
•Fast to fit (often closed-form solution)
•May need many terms for complex functions

Adaptive Basis (Boosting)

•Basis functions are learned from data
•Different basis for each problem
•Base learners are trained sequentially
•Learn both structure γₘ and weight βₘ
•Sequential fitting (greedy procedure)
•Adapts basis to function being approximated

Why Adaptive Bases Excel:

Consider approximating a function with a sharp spike at x = 0.73. A polynomial basis would need many high-degree terms to capture this localized feature. A Fourier basis would exhibit Gibbs phenomenon (ringing artifacts).

A boosted stump basis, however, would naturally place thresholds near 0.73, focusing basis function resolution exactly where it's needed. The basis adapts to the target function's structure.

The Trade-off:

Adaptive bases are more powerful but harder to fit:

Must search over both basis structure (γ) and coefficients (β)
No closed-form solution; requires iterative optimization
Risk of overfitting if not properly regularized

Boosting navigates this trade-off through the greedy stagewise procedure we'll examine next.

Kernel Connection

Readers familiar with kernel methods may recognize a connection: both kernels and boosting produce additive function expansions. In kernel methods, the basis functions are determined by the kernel and training points. In boosting, they're determined by the base learner class and the sequential fitting procedure. Both can be viewed as working in high-dimensional feature spaces.

Optimizing Additive Models

Given the additive model structure F(x) = Σₘ βₘb(x; γₘ), how do we find the optimal parameters {βₘ, γₘ} for m = 1, ..., M?

The Objective:

$$\min_{{\beta_m, \gamma_m}{m=1}^{M}} \sum{i=1}^{n} \ell\left(y_i, \sum_{m=1}^{M} \beta_m b(x_i; \gamma_m)\right)$$

This is generally a challenging optimization problem because:

Non-convexity in γ: The base learner parameters (e.g., tree structure) make the problem highly non-convex
Combinatorial explosion: With M terms, we have M pairs of (β, γ) to optimize jointly
Coupling: The optimal β₁ depends on what γ₂...γₘ are, and vice versa

Approaches to Optimization:

Approaches to Training Additive Models
Approach	Description	Pros	Cons
Joint Optimization	Optimize all {βₘ, γₘ} simultaneously	Globally optimal (in theory)	Computationally intractable for large M
Backfitting	Iteratively update each term while holding others fixed	Simple; converges for convex ℓ	Slow convergence; sensitive to initialization
Forward Stagewise	Add one term at a time, never revising previous terms	Fast; greedy but effective	Suboptimal individual terms; requires many iterations
Gradient Boosting	Forward stagewise with gradient-based term selection	Principled; flexible loss functions	Sequential; hard to parallelize iterations

Forward Stagewise Additive Modeling (FSAM):

The dominant approach in boosting is forward stagewise optimization. Instead of optimizing all M terms jointly, we:

Start with F₀(x) = 0 (or a constant)
For m = 1, 2, ..., M:
- Fix all previous terms {β₁,...,βₘ₋₁} and {γ₁,...,γₘ₋₁}
- Find the best new term (βₘ, γₘ) to add
- Update: Fₘ(x) = Fₘ₋₁(x) + βₘb(x; γₘ)

The Stagewise Optimization:

At stage m, we solve:

$$(\beta_m, \gamma_m) = \arg\min_{\beta, \gamma} \sum_{i=1}^{n} \ell(y_i, F_{m-1}(x_i) + \beta \cdot b(x_i; \gamma))$$

This is a greedy approach: each step optimizes myopically, adding the single best term given the current model. Previous terms are never revised.

Greedy ≠ Globally Optimal

Forward stagewise optimization does not generally find the globally optimal additive model. The first few base learners might not be the best choices when considering all M terms jointly. However, empirically this greedy approach works remarkably well, especially with regularization (shrinkage, early stopping).

Regularization in Additive Models

The additive structure provides natural regularization opportunities. Since the model is a sum of base learners, we can control complexity by constraining:

Number of terms M (early stopping)
Complexity of each term (tree depth, leaf count)
Magnitude of coefficients βₘ (shrinkage/L1/L2)
Combination of above

Shrinkage (Learning Rate):

Instead of adding βₘb(x; γₘ) at each step, we add a shrunken version:

$$F_m(x) = F_{m-1}(x) + u \cdot \beta_m b(x; \gamma_m)$$

where ν ∈ (0, 1] is the shrinkage factor (learning rate). Smaller ν means:

Each term contributes less
More terms needed for same training accuracy
Better generalization (empirically verified)
Slower training

shrinkage_effect.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_classification
 
# Generate classification dataset
X, y = make_classification(n_samples=1000, n_features=20, 
                          n_informative=10, random_state=42)
 
# Compare different shrinkage values
# Key insight: smaller shrinkage often gives better generalization
# but requires more trees to achieve similar training loss
 
shrinkage_values = [1.0, 0.5, 0.1, 0.05, 0.01]
n_estimators_list = [50, 100, 200, 500, 1000]
 
results = {}
 
for nu in shrinkage_values:
    for n_trees in n_estimators_list:
        clf = GradientBoostingClassifier(
            learning_rate=nu,  # Shrinkage factor
            n_estimators=n_trees,
            max_depth=3,  # Shallow trees work well with shrinkage
            random_state=42
        )
        
        cv_scores = cross_val_score(clf, X, y, cv=5)
        mean_score = np.mean(cv_scores)
        results[(nu, n_trees)] = mean_score
 
# Typical pattern observed:
# - High shrinkage (1.0): Quick overfit, peaks early, then degrades
# - Medium shrinkage (0.1): Slower convergence, better peak
# - Low shrinkage (0.01): Very slow, needs many trees, often best generalization
 
# Example output interpretation:
# learning_rate=1.0, n_trees=50:  0.89 (already peaked)
# learning_rate=0.1, n_trees=200: 0.92 (balanced)
# learning_rate=0.01, n_trees=1000: 0.93 (best but slow)
 
print("Shrinkage effectively regularizes by slowing down learning.")
print("With ν=0.1 and 200 trees, 200*0.1 = 20 effective full trees.")

Why Shrinkage Works:

Shrinkage works through a mechanism similar to L2 regularization. Mathematical analysis shows that:

$$F_M(x) = \sum_{m=1}^{M} u \cdot \beta_m b_m(x) = u \sum_{m=1}^{M} \beta_m b_m(x)$$

With small ν, the final model magnitude is constrained. Moreover, shrinkage allows more terms to contribute, averaging out the idiosyncrasies of any single base learner.

Intuition: With ν = 1.0, each tree greedily grabs as much signal as possible, including noise. With ν = 0.1, each tree only takes 10% of available signal, leaving room for subsequent trees to capture different aspects. The averaging effect reduces variance.

The Trade-off:

More terms + shrinkage = better generalization but slower training
Typical practice: ν ∈ [0.01, 0.3], with M = 100-10,000 depending on ν

The "Slow Learning" Principle

One of the most important empirical findings in boosting: learning slowly (small shrinkage, many iterations) almost always outperforms learning quickly (large steps, few iterations). This principle extends beyond boosting to many iterative machine learning algorithms.

Additive Models vs Other Ensemble Structures

Not all ensemble methods produce additive models. Understanding the differences reveals what makes additive structure special for boosting.

Comparison of Ensemble Structures:

Ensemble Output Structures
Ensemble Type	Output Structure	Key Property
Boosting	F(x) = Σₘ βₘhₘ(x)	Additive (weighted sum)
Bagging/RF (regression)	F(x) = (1/M)Σₘ hₘ(x)	Additive (average)
Bagging/RF (classification)	F(x) = mode{hₘ(x)}	Voting (non-additive for hard votes)
Stacking	F(x) = g(h₁(x), h₂(x), ...)	Meta-learner (potentially non-additive)
Mixture of Experts	F(x) = Σₘ wₘ(x)·hₘ(x)	Conditionally additive (weights depend on x)

Why Additive Structure Suits Boosting:

Incremental Updates: Adding a new term Fₘ = Fₘ₋₁ + βₘhₘ is simple when structure is additive. No need to recompute how previous terms interact.
Gradient Interpretation: The additive update Fₘ = Fₘ₋₁ + η·h corresponds exactly to taking a gradient step in function space.
Analysis: Additive models admit clean theoretical analysis. Training error bounds, convergence rates, and generalization bounds are well-understood.
Regularization: Smooth operators (shrinkage, early stopping) apply naturally to sums.

What Additive Structure Cannot Capture Directly:

Pure additive models cannot represent certain interaction patterns efficiently:

Multiplicative effects: f(x) = x₁ · x₂ requires many additive terms
XOR-like patterns: Need exponentially many stumps to capture exactly
Complex feature interactions: Each tree term helps, but it's indirect

This is why boosting often uses shallow trees (depth 2-6) as base learners—each tree captures some feature interactions within its structure, then these are combined additively.

Trees Capture Interactions

While the ensemble combination is additive, each tree within a boosted ensemble can capture feature interactions through its tree structure. A depth-d tree can represent interactions among up to d features. The additive combination of such trees allows complex multi-way interactions through composition.

Generalized Additive Models (GAMs)

Additive models have a rich history in statistics, particularly in the form of Generalized Additive Models (GAMs).

Classical GAM:

$$F(x) = \beta_0 + \sum_{j=1}^{d} f_j(x_j)$$

where each fⱼ is a smooth function of the j-th feature. Unlike linear models (where fⱼ(xⱼ) = βⱼxⱼ), GAMs allow nonlinear effects while maintaining additive structure.

Key Property: Main Effects Only

Classical GAMs model only main effects—each feature contributes independently to the prediction. There are no interaction terms fⱼₖ(xⱼ, xₖ).

Boosted GAMs:

When boosting uses stumps (single-split trees) as base learners, the result is essentially a GAM:

$$F(x) = \sum_m \beta_m \cdot \mathbf{1}(x_{j_m} > t_m)$$

Each stump is a step function of one feature. The sum of stumps for feature j approximates a smooth function fⱼ(xⱼ). This connection enables interpretable gradient boosting (see: Explainable Boosting Machines, InterpretML).

Advantages of GAM Structure

•Interpretable: Can visualize each feature's effect fⱼ(xⱼ)
•Decomposable: Contribution of each feature is clear
•Auditable: Easy to check for unexpected behaviors
•Low-dimensional: Avoids curse of dimensionality in estimation

Limitations of GAM Structure

•No interactions: Cannot capture xⱼ·xₖ effects
•May underfit: Some problems inherently need interactions
•Must extend for interactions: GA²M adds pairwise terms
•Bias: Strong structural assumption may not match reality

GA²M: Adding Pairwise Interactions

A natural extension is to include pairwise interaction terms:

$$F(x) = \beta_0 + \sum_{j=1}^{d} f_j(x_j) + \sum_{j<k} f_{jk}(x_j, x_k)$$

This is called GA²M (Generalized Additive Model plus Interactions). It retains interpretability (can visualize each main effect and each pairwise interaction) while capturing important nonlinear interactions.

Explainable Boosting Machines (EBM):

Microsoft's InterpretML implements EBM, which is essentially boosted GA²M:

Round-robin boosting across features one at a time (main effects)
Optional second phase for pairwise interactions
Produces interpretable models competitive with black-box boosting

Summary: Additive Models

We've developed a comprehensive understanding of additive models as the structural foundation of boosting. Let's consolidate the key insights:

Key Takeaways

•Additive Structure: Boosted ensembles are sums of base learners: F(x) = Σₘ βₘb(x; γₘ). This structure enables greedy training and clean analysis.
•Universal Approximation: With enough terms and flexible base learners, additive models can approximate any function. The adaptive basis (learned from data) is key to efficiency.
•Basis Function View: Additive models are basis expansions where the basis is learned (unlike fixed bases like Fourier/polynomial). This data-adaptive basis is boosting's power source.
•Forward Stagewise Optimization: Rather than jointly optimizing all terms, boosting adds one term at a time greedily. This is tractable and empirically effective.
•Regularization via Shrinkage: Multiplying each new term by a small learning rate ν ∈ (0,1] is perhaps the most important regularizer, enabling better generalization.
•GAM Connection: With stump base learners, boosting produces a GAM (main effects only). Deeper trees add interactions within each term.

What's Next:

We've established that boosting builds additive models through a forward stagewise procedure. The next page examines stage-wise optimization in detail: how exactly do we find the best base learner to add at each stage? This is where the connection to gradient descent becomes algorithmic—the fitting of base learners to negative gradients.

Understanding stagewise optimization will complete the picture: functional gradient descent (the principle) + additive models (the structure) + stagewise optimization (the algorithm) together define modern gradient boosting.

Concept Mastered

You now understand additive models as the structural foundation of boosting. This structure enables incremental construction through the forward stagewise procedure, admits powerful regularization through shrinkage, and connects boosting to the rich theory of basis function expansions.

2 / 5

Loading learning content...

Machine LearningBoosting Theory

Boosting as Gradient Descent

LevelAdvanced

Duration75 mins

TopicBoosting Theory

2 / 5

Additive Models

Building Complexity Through Summation

What You Will Learn

Definition of Additive Models

An additive model expresses the prediction function as a sum of simpler component functions, called basis functions or base learners.

Formal Definition:

$$F(x) = \sum_{m=1}^{M} \beta_m \cdot b(x; \gamma_m)$$

where:

F(x) is the final prediction for input x
M is the number of base learners (terms)
βₘ ∈ ℝ are the weights (coefficients) for each term
b(x; γₘ) is the m-th base learner with parameters γₘ
γₘ parameterizes the base learner (e.g., tree structure for decision trees)

The Structure:

Examples of Additive Models
Model Type	Base Function b(x; γ)	Parameters γ	Typical M
Linear Regression	xⱼ (the j-th feature)	j (feature index)	d (# features)
Polynomial Regression	xʲ (x to the j-th power)	j (degree)	degree + 1
Splines	Bⱼ(x) (B-spline basis)	j (basis index), knots	knots + degree
Boosted Stumps	𝟙(xⱼ > t) (threshold indicator)	j (feature), t (threshold)	100-10,000
Boosted Trees	Tree h(x) (full decision tree)	Tree structure	100-10,000
Neural Networks	σ(wᵀx + b) (hidden units)	weights w, bias b	100s-millions

Additive ≠ Linear

Expressive Power of Additive Models

A natural question arises: how powerful are additive models? Can they approximate any function, or are they fundamentally limited?

Universal Approximation:

With sufficiently flexible base learners and enough terms, additive models can approximate any continuous function to arbitrary precision. This is a form of the universal approximation property.

Practical Implications:

For boosted decision trees:

Stumps (depth-1 trees): Can represent any function on the training data by using axis-aligned step functions. The approximation quality depends on M.
Shallow trees (depth 2-4): Each tree captures interactions between a few features. The ensemble captures complex multi-way interactions through composition.
Deeper trees: More expressive per-tree, but risk overfitting. Usually depth 4-8 is optimal for boosting.

How Additive Models Build Complexity:

Consider approximating a complex 2D function using additive stumps. Each stump partitions the plane with a single axis-aligned cut, assigning different constant values to each side.

Iteration 1: One cut creates 2 regions Iteration 2: Another cut creates up to 4 regions
Iteration M: M cuts can create up to M+1 regions along each axis

By summing many stumps with different orientations and thresholds, we can approximate smooth curves, sharp boundaries, and complex contours. The granularity improves with M.

Example: Approximating x²

Suppose we want to approximate f(x) = x² on [0, 1] using stumps:

Stump 1: b₁(x) = 0.1 · 𝟙(x > 0.1) ≈ small step
Stump 2: b₂(x) = 0.1 · 𝟙(x > 0.2) ≈ another step
...
Stump 10: b₁₀(x) = 0.1 · 𝟙(x > 0.9) ≈ final step

The sum Σᵢ bᵢ(x) is a staircase approximating x². More stumps = finer staircase = better approximation.

additive_model_approximation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import numpy as np
import matplotlib.pyplot as plt
 
def approximate_with_stumps(target_fn, x_range, n_stumps):
    """
    Demonstrate how additive stumps approximate a smooth function.
    
    This illustrates the universal approximation capability of
    additive models: any continuous function can be approximated
    to arbitrary precision with enough basis functions.
    """
    x = np.linspace(x_range[0], x_range[1], 1000)
    y_target = target_fn(x)
    
    # Create evenly-spaced stumps
    thresholds = np.linspace(x_range[0], x_range[1], n_stumps + 1)[1:]
    
    # Fit stump weights to approximate the target
    # Each stump: b_m(x) = 1 if x > threshold_m, else 0
    # We want: sum beta_m * b_m(x) ≈ target(x)
    
    # Design matrix: each row is [b_1(x), b_2(x), ..., b_M(x)]
    X_design = np.zeros((len(x), n_stumps))
    for m, t in enumerate(thresholds):
        X_design[:, m] = (x > t).astype(float)
    
    # Solve least squares: find optimal weights
    # beta = (X^T X)^{-1} X^T y
    beta = np.linalg.lstsq(X_design, y_target, rcond=None)[0]
    
    # Additive model prediction
    y_pred = X_design @ beta
    
    # Compute approximation error
    mse = np.mean((y_target - y_pred) ** 2)
    
    print(f"Stumps: {n_stumps}, MSE: {mse:.6f}")
    
    return x, y_target, y_pred, mse
 
# Demonstrate convergence as we add more stumps
target = lambda x: x ** 2  # Simple quadratic
 
for n in [5, 10, 25, 50, 100]:
    x, y_true, y_pred, mse = approximate_with_stumps(target, [0, 1], n)
    
# Output shows MSE decreasing as n increases:
# Stumps: 5, MSE: 0.003333
# Stumps: 10, MSE: 0.000833
# Stumps: 25, MSE: 0.000133
# Stumps: 50, MSE: 0.000033
# Stumps: 100, MSE: 0.000008
 
# The error decreases as O(1/n²) - additive model converges!

Basis Function Expansion

Additive models can be understood through the lens of basis function expansion—a foundational concept in approximation theory and functional analysis.

The Idea:

Classical Examples:

Fourier Series: $$f(x) = a_0 + \sum_{n=1}^{\infty} \left[a_n \cos(nx) + b_n \sin(nx)\right]$$

Basis functions: {1, cos(x), sin(x), cos(2x), sin(2x), ...}

Polynomial Expansion: $$f(x) = \sum_{k=0}^{K} a_k x^k$$

Basis functions: {1, x, x², x³, ...}

Boosting Analogy:

In boosting, the base learners form an adaptively-chosen basis. Unlike fixed bases (Fourier, polynomial), the boosting basis is determined by the data and the training procedure.

Fixed Basis Expansion

•Basis functions are predetermined
•Same basis for all problems
•Examples: Fourier, polynomials, splines
•Only need to learn coefficients βₘ
•Fast to fit (often closed-form solution)
•May need many terms for complex functions

Adaptive Basis (Boosting)

•Basis functions are learned from data
•Different basis for each problem
•Base learners are trained sequentially
•Learn both structure γₘ and weight βₘ
•Sequential fitting (greedy procedure)
•Adapts basis to function being approximated

Why Adaptive Bases Excel:

A boosted stump basis, however, would naturally place thresholds near 0.73, focusing basis function resolution exactly where it's needed. The basis adapts to the target function's structure.

The Trade-off:

Adaptive bases are more powerful but harder to fit:

Must search over both basis structure (γ) and coefficients (β)
No closed-form solution; requires iterative optimization
Risk of overfitting if not properly regularized

Boosting navigates this trade-off through the greedy stagewise procedure we'll examine next.

Kernel Connection

Optimizing Additive Models

Given the additive model structure F(x) = Σₘ βₘb(x; γₘ), how do we find the optimal parameters {βₘ, γₘ} for m = 1, ..., M?

The Objective:

$$\min_{{\beta_m, \gamma_m}{m=1}^{M}} \sum{i=1}^{n} \ell\left(y_i, \sum_{m=1}^{M} \beta_m b(x_i; \gamma_m)\right)$$

This is generally a challenging optimization problem because:

Non-convexity in γ: The base learner parameters (e.g., tree structure) make the problem highly non-convex
Combinatorial explosion: With M terms, we have M pairs of (β, γ) to optimize jointly
Coupling: The optimal β₁ depends on what γ₂...γₘ are, and vice versa

Approaches to Optimization:

Approaches to Training Additive Models
Approach	Description	Pros	Cons
Joint Optimization	Optimize all {βₘ, γₘ} simultaneously	Globally optimal (in theory)	Computationally intractable for large M
Backfitting	Iteratively update each term while holding others fixed	Simple; converges for convex ℓ	Slow convergence; sensitive to initialization
Forward Stagewise	Add one term at a time, never revising previous terms	Fast; greedy but effective	Suboptimal individual terms; requires many iterations
Gradient Boosting	Forward stagewise with gradient-based term selection	Principled; flexible loss functions	Sequential; hard to parallelize iterations

Forward Stagewise Additive Modeling (FSAM):

The dominant approach in boosting is forward stagewise optimization. Instead of optimizing all M terms jointly, we:

Start with F₀(x) = 0 (or a constant)
For m = 1, 2, ..., M:
- Fix all previous terms {β₁,...,βₘ₋₁} and {γ₁,...,γₘ₋₁}
- Find the best new term (βₘ, γₘ) to add
- Update: Fₘ(x) = Fₘ₋₁(x) + βₘb(x; γₘ)

The Stagewise Optimization:

At stage m, we solve:

$$(\beta_m, \gamma_m) = \arg\min_{\beta, \gamma} \sum_{i=1}^{n} \ell(y_i, F_{m-1}(x_i) + \beta \cdot b(x_i; \gamma))$$

This is a greedy approach: each step optimizes myopically, adding the single best term given the current model. Previous terms are never revised.

Greedy ≠ Globally Optimal

Regularization in Additive Models

The additive structure provides natural regularization opportunities. Since the model is a sum of base learners, we can control complexity by constraining:

Number of terms M (early stopping)
Complexity of each term (tree depth, leaf count)
Magnitude of coefficients βₘ (shrinkage/L1/L2)
Combination of above

Shrinkage (Learning Rate):

Instead of adding βₘb(x; γₘ) at each step, we add a shrunken version:

$$F_m(x) = F_{m-1}(x) + u \cdot \beta_m b(x; \gamma_m)$$

where ν ∈ (0, 1] is the shrinkage factor (learning rate). Smaller ν means:

Each term contributes less
More terms needed for same training accuracy
Better generalization (empirically verified)
Slower training

shrinkage_effect.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_classification
 
# Generate classification dataset
X, y = make_classification(n_samples=1000, n_features=20, 
                          n_informative=10, random_state=42)
 
# Compare different shrinkage values
# Key insight: smaller shrinkage often gives better generalization
# but requires more trees to achieve similar training loss
 
shrinkage_values = [1.0, 0.5, 0.1, 0.05, 0.01]
n_estimators_list = [50, 100, 200, 500, 1000]
 
results = {}
 
for nu in shrinkage_values:
    for n_trees in n_estimators_list:
        clf = GradientBoostingClassifier(
            learning_rate=nu,  # Shrinkage factor
            n_estimators=n_trees,
            max_depth=3,  # Shallow trees work well with shrinkage
            random_state=42
        )
        
        cv_scores = cross_val_score(clf, X, y, cv=5)
        mean_score = np.mean(cv_scores)
        results[(nu, n_trees)] = mean_score
 
# Typical pattern observed:
# - High shrinkage (1.0): Quick overfit, peaks early, then degrades
# - Medium shrinkage (0.1): Slower convergence, better peak
# - Low shrinkage (0.01): Very slow, needs many trees, often best generalization
 
# Example output interpretation:
# learning_rate=1.0, n_trees=50:  0.89 (already peaked)
# learning_rate=0.1, n_trees=200: 0.92 (balanced)
# learning_rate=0.01, n_trees=1000: 0.93 (best but slow)
 
print("Shrinkage effectively regularizes by slowing down learning.")
print("With ν=0.1 and 200 trees, 200*0.1 = 20 effective full trees.")

Why Shrinkage Works:

Shrinkage works through a mechanism similar to L2 regularization. Mathematical analysis shows that:

$$F_M(x) = \sum_{m=1}^{M} u \cdot \beta_m b_m(x) = u \sum_{m=1}^{M} \beta_m b_m(x)$$

With small ν, the final model magnitude is constrained. Moreover, shrinkage allows more terms to contribute, averaging out the idiosyncrasies of any single base learner.

The Trade-off:

More terms + shrinkage = better generalization but slower training
Typical practice: ν ∈ [0.01, 0.3], with M = 100-10,000 depending on ν

The "Slow Learning" Principle

Additive Models vs Other Ensemble Structures

Not all ensemble methods produce additive models. Understanding the differences reveals what makes additive structure special for boosting.

Comparison of Ensemble Structures:

Ensemble Output Structures
Ensemble Type	Output Structure	Key Property
Boosting	F(x) = Σₘ βₘhₘ(x)	Additive (weighted sum)
Bagging/RF (regression)	F(x) = (1/M)Σₘ hₘ(x)	Additive (average)
Bagging/RF (classification)	F(x) = mode{hₘ(x)}	Voting (non-additive for hard votes)
Stacking	F(x) = g(h₁(x), h₂(x), ...)	Meta-learner (potentially non-additive)
Mixture of Experts	F(x) = Σₘ wₘ(x)·hₘ(x)	Conditionally additive (weights depend on x)

Why Additive Structure Suits Boosting:

Incremental Updates: Adding a new term Fₘ = Fₘ₋₁ + βₘhₘ is simple when structure is additive. No need to recompute how previous terms interact.
Gradient Interpretation: The additive update Fₘ = Fₘ₋₁ + η·h corresponds exactly to taking a gradient step in function space.
Analysis: Additive models admit clean theoretical analysis. Training error bounds, convergence rates, and generalization bounds are well-understood.
Regularization: Smooth operators (shrinkage, early stopping) apply naturally to sums.

What Additive Structure Cannot Capture Directly:

Pure additive models cannot represent certain interaction patterns efficiently:

Multiplicative effects: f(x) = x₁ · x₂ requires many additive terms
XOR-like patterns: Need exponentially many stumps to capture exactly
Complex feature interactions: Each tree term helps, but it's indirect

This is why boosting often uses shallow trees (depth 2-6) as base learners—each tree captures some feature interactions within its structure, then these are combined additively.

Trees Capture Interactions

Generalized Additive Models (GAMs)

Additive models have a rich history in statistics, particularly in the form of Generalized Additive Models (GAMs).

Classical GAM:

$$F(x) = \beta_0 + \sum_{j=1}^{d} f_j(x_j)$$

where each fⱼ is a smooth function of the j-th feature. Unlike linear models (where fⱼ(xⱼ) = βⱼxⱼ), GAMs allow nonlinear effects while maintaining additive structure.

Key Property: Main Effects Only

Classical GAMs model only main effects—each feature contributes independently to the prediction. There are no interaction terms fⱼₖ(xⱼ, xₖ).

Boosted GAMs:

When boosting uses stumps (single-split trees) as base learners, the result is essentially a GAM:

$$F(x) = \sum_m \beta_m \cdot \mathbf{1}(x_{j_m} > t_m)$$

Advantages of GAM Structure

•Interpretable: Can visualize each feature's effect fⱼ(xⱼ)
•Decomposable: Contribution of each feature is clear
•Auditable: Easy to check for unexpected behaviors
•Low-dimensional: Avoids curse of dimensionality in estimation

Limitations of GAM Structure

•No interactions: Cannot capture xⱼ·xₖ effects
•May underfit: Some problems inherently need interactions
•Must extend for interactions: GA²M adds pairwise terms
•Bias: Strong structural assumption may not match reality

GA²M: Adding Pairwise Interactions

A natural extension is to include pairwise interaction terms:

$$F(x) = \beta_0 + \sum_{j=1}^{d} f_j(x_j) + \sum_{j<k} f_{jk}(x_j, x_k)$$

Explainable Boosting Machines (EBM):

Microsoft's InterpretML implements EBM, which is essentially boosted GA²M:

Round-robin boosting across features one at a time (main effects)
Optional second phase for pairwise interactions
Produces interpretable models competitive with black-box boosting

Summary: Additive Models

We've developed a comprehensive understanding of additive models as the structural foundation of boosting. Let's consolidate the key insights:

Key Takeaways

•Additive Structure: Boosted ensembles are sums of base learners: F(x) = Σₘ βₘb(x; γₘ). This structure enables greedy training and clean analysis.
•Universal Approximation: With enough terms and flexible base learners, additive models can approximate any function. The adaptive basis (learned from data) is key to efficiency.
•Basis Function View: Additive models are basis expansions where the basis is learned (unlike fixed bases like Fourier/polynomial). This data-adaptive basis is boosting's power source.
•Forward Stagewise Optimization: Rather than jointly optimizing all terms, boosting adds one term at a time greedily. This is tractable and empirically effective.
•Regularization via Shrinkage: Multiplying each new term by a small learning rate ν ∈ (0,1] is perhaps the most important regularizer, enabling better generalization.
•GAM Connection: With stump base learners, boosting produces a GAM (main effects only). Deeper trees add interactions within each term.

What's Next:

Concept Mastered

2 / 5

Boosting as Gradient Descent

Additive Models

knots + degree

Boosting as Gradient Descent

Additive Models

knots + degree