Ensemble MethodsGradient Boosting

Gradient Boosting

LevelAdvanced

Duration90 mins

TopicGradient Boosting

4 / 5

Step Size (Learning Rate)

The Art of Taking Small Steps

In gradient descent, the step size determines how far we move in the gradient direction at each iteration. Too large, and we overshoot the optimum, oscillating wildly or diverging. Too small, and we crawl toward convergence, wasting computational resources. Finding the right balance is both science and art.

Gradient boosting inherits this challenge in function space. The learning rate (also called shrinkage or step size) scales each base learner's contribution before adding it to the ensemble. This seemingly simple parameter has profound implications for generalization, convergence speed, and the optimal number of boosting iterations.

This page explores the learning rate comprehensively: its mathematical role, the shrinkage effect, the tradeoff with iteration count, practical tuning strategies, and advanced techniques like adaptive and scheduled learning rates.

What You Will Learn

By the end of this page, you will understand: how the learning rate functions mathematically in gradient boosting, why shrinkage provides regularization, the fundamental tradeoff between learning rate and iteration count, strategies for selecting optimal learning rates, and advanced techniques for learning rate scheduling.

Mathematical Formulation

The Role of Learning Rate

Recall the gradient boosting update rule:

$$F_m(x) = F_{m-1}(x) + \eta \cdot h_m(x)$$

where:

$F_m$ is the model after $m$ iterations
$h_m$ is the $m$-th base learner (fitted to pseudo-residuals)
$\eta \in (0, 1]$ is the learning rate (shrinkage parameter)

The learning rate scales the contribution of each new tree. When $\eta = 1$, we add the full tree prediction. When $\eta < 1$, we add only a fraction, 'shrinking' the update.

Connection to Gradient Descent

In standard gradient descent: $$\theta^{(t+1)} = \theta^{(t)} - \eta \cdot abla_{\theta} \mathcal{L}$$

In gradient boosting (function space): $$F^{(m)} = F^{(m-1)} - \eta \cdot \text{(approximation to } abla_F \mathcal{L})$$

The tree $h_m$ approximates the negative gradient, and $\eta$ controls the step size along this direction. Smaller $\eta$ means smaller steps in function space.

The Effective Update

After $M$ iterations with learning rate $\eta$, the final model is:

$$F_M(x) = F_0(x) + \eta \sum_{m=1}^{M} h_m(x)$$

The learning rate multiplicatively scales all tree contributions. A model with $\eta = 0.1$ and $M = 1000$ trees makes total contribution $0.1 \times 1000 = 100$ 'tree-equivalents.' The same total contribution could come from $\eta = 1.0$ and $M = 100$ trees—but with very different generalization properties.

The Key Insight

Lower learning rates don't just slow down training—they change the NATURE of the learned function. Smaller steps allow the optimization to explore more of the path toward the minimum, often finding flatter, more generalizable solutions. This is why 'smaller η with more iterations' typically outperforms 'larger η with fewer iterations.'

Shrinkage as Regularization

The term shrinkage emphasizes that the learning rate regularizes by shrinking each tree's impact. This has profound effects on the learned model.

Why Shrinkage Works

1. Prevents Overcommitting Early

With $\eta = 1$, each tree fully corrects the errors it targets. If early trees overfit to noise, that noise becomes permanent in the model. With $\eta = 0.1$, early trees contribute only 10% of their full correction. Subsequent iterations can 'undo' mistakes by learning compensating patterns.

2. Explores Multiple Solutions

Small steps allow the ensemble to explore many correction paths. Rather than greedily jumping to the nearest minimum, the model meanders through function space, often finding better global solutions.

3. Ensemble Averaging Effect

With small $\eta$ and many trees, the final prediction averages many partially-fit models. This averaging reduces variance, similar to bagging. Each tree's idiosyncratic errors are diluted by the large ensemble.

4. Implicit L2 Regularization

Shrinkage with early stopping has been shown to be mathematically equivalent to L2 regularization on the function coefficients. Smaller $\eta$ corresponds to larger regularization strength.

Effect of Learning Rate on Gradient Boosting
Learning Rate	Behavior	Generalization	Training Time
η = 1.0	Full contribution per tree	Poor (overfits quickly)	Fast (fewer iterations needed)
η = 0.3	Moderate shrinkage	Good for quick experiments	Moderate
η = 0.1	Standard shrinkage	Typically optimal	Moderate-slow
η = 0.01	Strong shrinkage	Excellent with many trees	Slow (many iterations)
η = 0.001	Extreme shrinkage	Potentially best, but impractical	Very slow

Empirical Evidence

Friedman's 2001 paper introducing gradient boosting demonstrated that 'shrinkage dramatically improves the generalization ability' of gradient boosting. Setting η ≤ 0.1 consistently outperformed η = 1.0 across diverse datasets, even when accounting for increased training time.

The Learning Rate-Iterations Tradeoff

There is a fundamental tradeoff between learning rate and the number of boosting iterations. Understanding this tradeoff is essential for efficient hyperparameter tuning.

The Inverse Relationship

For a fixed 'capacity' (total amount of learning), reducing the learning rate requires increasing iterations:

$$\text{Effective capacity} \approx \eta \times M$$

To maintain the same capacity:

$\eta = 0.1, M = 1000$ → capacity = 100
$\eta = 0.01, M = 10000$ → capacity = 100

Generalization Differences

But equal capacity doesn't mean equal performance! The optimization path matters:

Small η, large M:

More gradual fitting
Explores more of the solution space
Better chance of finding generalizable solutions
Longer training time

Large η, small M:

Aggressive fitting
May lock into suboptimal solutions early
Higher variance between runs
Faster training

The Optimal Frontier

In practice, there's an optimal frontier where further reducing $\eta$ (with proportionally more iterations) no longer improves validation performance.

learning_rate_tradeoff.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.datasets import make_friedman1
from sklearn.model_selection import train_test_split
 
def analyze_lr_iterations_tradeoff():
    """
    Demonstrate the learning rate vs iterations tradeoff.
    """
    # Generate data
    X, y = make_friedman1(n_samples=2000, noise=1.0, random_state=42)
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=42
    )
    
    # Test different learning rate / iteration combinations
    # All have similar "capacity" (η × M ≈ 100)
    configs = [
        {"learning_rate": 1.0, "n_estimators": 100, "label": "η=1.0, M=100"},
        {"learning_rate": 0.5, "n_estimators": 200, "label": "η=0.5, M=200"},
        {"learning_rate": 0.1, "n_estimators": 1000, "label": "η=0.1, M=1000"},
        {"learning_rate": 0.05, "n_estimators": 2000, "label": "η=0.05, M=2000"},
        {"learning_rate": 0.01, "n_estimators": 10000, "label": "η=0.01, M=10000"},
    ]
    
    results = []
    
    for config in configs:
        label = config.pop("label")
        gbm = GradientBoostingRegressor(
            max_depth=4,
            random_state=42,
            **config
        )
        gbm.fit(X_train, y_train)
        
        train_score = gbm.score(X_train, y_train)
        test_score = gbm.score(X_test, y_test)
        
        results.append({
            "label": label,
            "train_r2": train_score,
            "test_r2": test_score,
            "gap": train_score - test_score
        })
        print(f"{label:25} | Train R²: {train_score:.4f} | Test R²: {test_score:.4f} | Gap: {train_score - test_score:.4f}")
    
    # Visualization
    fig, ax = plt.subplots(figsize=(10, 6))
    
    labels = [r["label"] for r in results]
    train_scores = [r["train_r2"] for r in results]
    test_scores = [r["test_r2"] for r in results]
    
    x = np.arange(len(labels))
    width = 0.35
    
    bars1 = ax.bar(x - width/2, train_scores, width, label='Train R²', color='steelblue')
    bars2 = ax.bar(x + width/2, test_scores, width, label='Test R²', color='coral')
    
    ax.set_ylabel('R² Score')
    ax.set_title('Learning Rate vs Iterations Tradeoff
(Similar total capacity: η × M ≈ 100)')
    ax.set_xticks(x)
    ax.set_xticklabels(labels, rotation=45, ha='right')
    ax.legend()
    ax.set_ylim(0.8, 1.0)
    ax.grid(axis='y', alpha=0.3)
    
    plt.tight_layout()
    plt.savefig('lr_iterations_tradeoff.png', dpi=150)
    plt.show()
 
analyze_lr_iterations_tradeoff()
 
# Typical output:
# η=1.0, M=100              | Train R²: 0.9876 | Test R²: 0.8823 | Gap: 0.1053
# η=0.5, M=200              | Train R²: 0.9812 | Test R²: 0.8912 | Gap: 0.0900
# η=0.1, M=1000             | Train R²: 0.9734 | Test R²: 0.9045 | Gap: 0.0689
# η=0.05, M=2000            | Train R²: 0.9698 | Test R²: 0.9078 | Gap: 0.0620
# η=0.01, M=10000           | Train R²: 0.9645 | Test R²: 0.9089 | Gap: 0.0556

Observations from the Experiment

Test performance improves with smaller η: Despite similar 'capacity,' smaller learning rates achieve better test scores.
Train-test gap shrinks: The generalization gap (train - test) decreases with smaller η, indicating reduced overfitting.
Diminishing returns: The improvement from η=0.05 to η=0.01 is smaller than from η=0.5 to η=0.1. At some point, further shrinkage provides marginal benefit.
Training time increases: The η=0.01 model trains 100× longer than η=1.0 but achieves only modestly better test performance.

Practical Considerations

In practice, η=0.05 to η=0.3 with early stopping provides the best balance of performance and training time. Values below 0.01 rarely provide meaningful improvement and dramatically increase training cost. Always use early stopping rather than training for a fixed number of iterations.

Line Search for Optimal Step Size

Beyond a global learning rate, gradient boosting can perform per-iteration line search to find the optimal step size for each tree. This is the $\rho_m$ term in the full algorithm:

$$\rho_m = \arg\min_\rho \sum_{i=1}^n L(y_i, F_{m-1}(x_i) + \rho \cdot h_m(x_i))$$

How Line Search Works

After fitting tree $h_m$ to pseudo-residuals, we find the scaling factor $\rho_m$ that maximally reduces the original loss (not squared loss on residuals). This is a one-dimensional optimization problem.

For squared loss: $$\rho_m = \arg\min_\rho \sum_i (y_i - F_{m-1}(x_i) - \rho \cdot h_m(x_i))^2$$

This has a closed-form solution: $$\rho_m = \frac{\sum_i (y_i - F_{m-1}(x_i)) h_m(x_i)}{\sum_i h_m(x_i)^2} = \frac{\sum_i \tilde{r}_i h_m(x_i)}{\sum_i h_m(x_i)^2}$$

For other losses: Use numerical optimization (e.g., golden section search, Brent's method) to find $\rho_m$ on the interval $(0, \rho_{\max}]$.

Combined with Global Shrinkage

When both line search and global shrinkage are used:

$$F_m(x) = F_{m-1}(x) + \eta \cdot \rho_m \cdot h_m(x)$$

The line search finds the optimal scale for this tree; the global $\eta$ then shrinks it for regularization.

line_search_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import numpy as np
from scipy.optimize import minimize_scalar
 
def line_search_squared_loss(y_true, current_preds, tree_predictions):
    """
    Find optimal step size for squared loss (closed form).
    
    rho = <residuals, tree_predictions> / <tree_predictions, tree_predictions>
    """
    residuals = y_true - current_preds
    numerator = np.dot(residuals, tree_predictions)
    denominator = np.dot(tree_predictions, tree_predictions) + 1e-10
    return numerator / denominator
 
 
def line_search_log_loss(y_true, current_preds, tree_predictions, max_rho=10.0):
    """
    Find optimal step size for log loss (numerical optimization).
    """
    def objective(rho):
        new_preds = current_preds + rho * tree_predictions
        # Log loss = -y*F + log(1 + exp(F))
        loss = np.sum(-y_true * new_preds + np.log(1 + np.exp(np.clip(new_preds, -500, 500))))
        return loss
    
    result = minimize_scalar(objective, bounds=(0, max_rho), method='bounded')
    return result.x
 
 
def line_search_absolute_loss(y_true, current_preds, tree_predictions, max_rho=10.0):
    """
    Find optimal step size for absolute loss (numerical optimization).
    """
    def objective(rho):
        new_preds = current_preds + rho * tree_predictions
        return np.sum(np.abs(y_true - new_preds))
    
    result = minimize_scalar(objective, bounds=(0, max_rho), method='bounded')
    return result.x
 
 
# Demonstration
np.random.seed(42)
n = 100
y_true = np.random.randn(n) + 5
current_preds = np.full(n, 5.0)  # Start at mean
tree_preds = np.random.randn(n) * 0.5 + (y_true - current_preds) * 0.3  # Approximate residuals
 
rho_l2 = line_search_squared_loss(y_true, current_preds, tree_preds)
rho_l1 = line_search_absolute_loss(y_true, current_preds, tree_preds)
 
print(f"Optimal step size (squared loss): {rho_l2:.4f}")
print(f"Optimal step size (absolute loss): {rho_l1:.4f}")
 
# Verify improvement
loss_before = np.sum((y_true - current_preds) ** 2)
loss_after = np.sum((y_true - current_preds - rho_l2 * tree_preds) ** 2)
print(f"
Squared loss before: {loss_before:.2f}")
print(f"Squared loss after:  {loss_after:.2f}")
print(f"Improvement: {(1 - loss_after/loss_before)*100:.1f}%")

Per-Leaf Line Search

Decision trees can perform line search separately for each leaf, finding optimal leaf values γⱼ rather than a single tree-wide ρ. This is the 'leaf value optimization' we covered earlier. XGBoost effectively does this by solving for optimal leaf values during tree construction.

Learning Rate Schedules

While most gradient boosting implementations use a constant learning rate, adaptive learning rate schedules can improve convergence. The idea is to start with larger steps for fast initial progress, then reduce the step size for fine-tuning.

Common Schedule Types

1. Constant (Standard) $$\eta_m = \eta_0$$

The default. Simple, predictable, works well with early stopping.

2. Step Decay $$\eta_m = \eta_0 \cdot \gamma^{\lfloor m / k \rfloor}$$

Reduce learning rate by factor $\gamma$ every $k$ iterations. Common in neural network training.

3. Exponential Decay $$\eta_m = \eta_0 \cdot e^{-\lambda m}$$

Gradually decrease over iterations. Smooth decay allows gentle refinement.

4. Cosine Annealing $$\eta_m = \eta_{\min} + \frac{1}{2}(\eta_0 - \eta_{\min})\left(1 + \cos\left(\frac{\pi m}{M}\right)\right)$$

Starts high, decreases to minimum, then rises again. Can help escape local minima.

learning_rate_schedules.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import numpy as np
import matplotlib.pyplot as plt
 
def constant_lr(m, lr_init, **kwargs):
    """Constant learning rate."""
    return lr_init
 
def step_decay_lr(m, lr_init, decay_factor=0.5, decay_every=100, **kwargs):
    """Step decay: reduce by factor every k iterations."""
    return lr_init * (decay_factor ** (m // decay_every))
 
def exponential_decay_lr(m, lr_init, decay_rate=0.001, **kwargs):
    """Exponential decay."""
    return lr_init * np.exp(-decay_rate * m)
 
def cosine_annealing_lr(m, lr_init, lr_min=0.001, total_iterations=1000, **kwargs):
    """Cosine annealing schedule."""
    return lr_min + 0.5 * (lr_init - lr_min) * (1 + np.cos(np.pi * m / total_iterations))
 
def inverse_time_lr(m, lr_init, decay_rate=0.01, **kwargs):
    """Inverse time decay."""
    return lr_init / (1 + decay_rate * m)
 
 
# Visualize different schedules
iterations = np.arange(1000)
lr_init = 0.1
 
schedules = {
    "Constant": [constant_lr(m, lr_init) for m in iterations],
    "Step Decay": [step_decay_lr(m, lr_init, decay_factor=0.5, decay_every=200) for m in iterations],
    "Exponential": [exponential_decay_lr(m, lr_init, decay_rate=0.003) for m in iterations],
    "Cosine": [cosine_annealing_lr(m, lr_init, lr_min=0.001, total_iterations=1000) for m in iterations],
    "Inverse Time": [inverse_time_lr(m, lr_init, decay_rate=0.01) for m in iterations],
}
 
plt.figure(figsize=(12, 6))
for name, lrs in schedules.items():
    plt.plot(iterations, lrs, label=name, linewidth=2)
 
plt.xlabel('Iteration')
plt.ylabel('Learning Rate')
plt.title('Learning Rate Schedules for Gradient Boosting')
plt.legend()
plt.grid(True, alpha=0.3)
plt.ylim(0, 0.12)
plt.savefig('lr_schedules.png', dpi=150)
plt.show()
 
# Print example values at key iterations
print("
Learning rate values at different iterations:")
print("-" * 60)
print(f"{'Iteration':<12} | {'Constant':<10} | {'Step':<10} | {'Exp':<10} | {'Cosine':<10}")
for m in [0, 100, 250, 500, 750, 999]:
    print(f"{m:<12} | {schedules['Constant'][m]:.4f}     | {schedules['Step Decay'][m]:.4f}     | "
          f"{schedules['Exponential'][m]:.4f}     | {schedules['Cosine'][m]:.4f}")

Implementing Custom Schedules

Most gradient boosting libraries don't natively support learning rate schedules, but we can implement them using callbacks or custom training loops:

custom_schedule_gbm.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
import numpy as np
from sklearn.tree import DecisionTreeRegressor
 
class GradientBoostingWithSchedule:
    """
    Custom gradient boosting with learning rate scheduling.
    """
    
    def __init__(self, n_estimators=100, max_depth=3, 
                 lr_schedule='constant', lr_init=0.1, **schedule_params):
        self.n_estimators = n_estimators
        self.max_depth = max_depth
        self.lr_schedule = lr_schedule
        self.lr_init = lr_init
        self.schedule_params = schedule_params
        self.trees = []
        self.learning_rates = []
        self.initial_pred = None
    
    def _get_lr(self, iteration):
        """Get learning rate for current iteration."""
        m = iteration
        if self.lr_schedule == 'constant':
            return self.lr_init
        elif self.lr_schedule == 'exponential':
            decay = self.schedule_params.get('decay_rate', 0.003)
            return self.lr_init * np.exp(-decay * m)
        elif self.lr_schedule == 'cosine':
            lr_min = self.schedule_params.get('lr_min', 0.001)
            return lr_min + 0.5 * (self.lr_init - lr_min) * (
                1 + np.cos(np.pi * m / self.n_estimators)
            )
        elif self.lr_schedule == 'step':
            factor = self.schedule_params.get('decay_factor', 0.5)
            every = self.schedule_params.get('decay_every', 100)
            return self.lr_init * (factor ** (m // every))
        else:
            return self.lr_init
    
    def fit(self, X, y):
        n_samples = X.shape[0]
        
        # Initialize with mean
        self.initial_pred = np.mean(y)
        current_preds = np.full(n_samples, self.initial_pred)
        
        for m in range(self.n_estimators):
            # Get learning rate for this iteration
            lr = self._get_lr(m)
            self.learning_rates.append(lr)
            
            # Compute pseudo-residuals
            residuals = y - current_preds
            
            # Fit tree to residuals
            tree = DecisionTreeRegressor(max_depth=self.max_depth)
            tree.fit(X, residuals)
            self.trees.append(tree)
            
            # Update predictions with scheduled learning rate
            current_preds += lr * tree.predict(X)
            
            if (m + 1) % 100 == 0:
                mse = np.mean((y - current_preds) ** 2)
                print(f"Iteration {m + 1}: LR = {lr:.4f}, MSE = {mse:.4f}")
        
        return self
    
    def predict(self, X):
        preds = np.full(X.shape[0], self.initial_pred)
        for tree, lr in zip(self.trees, self.learning_rates):
            preds += lr * tree.predict(X)
        return preds
 
 
# Compare schedules
from sklearn.datasets import make_friedman1
from sklearn.model_selection import train_test_split
 
X, y = make_friedman1(n_samples=1000, noise=1.0, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
 
schedules_to_test = [
    ('constant', {}),
    ('exponential', {'decay_rate': 0.003}),
    ('cosine', {'lr_min': 0.01}),
]
 
print("
Comparing learning rate schedules:")
print("=" * 50)
 
for schedule_name, params in schedules_to_test:
    gbm = GradientBoostingWithSchedule(
        n_estimators=500,
        max_depth=4,
        lr_schedule=schedule_name,
        lr_init=0.1,
        **params
    )
    gbm.fit(X_train, y_train)
    
    train_mse = np.mean((y_train - gbm.predict(X_train)) ** 2)
    test_mse = np.mean((y_test - gbm.predict(X_test)) ** 2)
    print(f"{schedule_name:15} | Train MSE: {train_mse:.4f} | Test MSE: {test_mse:.4f}")

When to Use Schedules

Learning rate schedules are most useful when training for a fixed number of iterations (e.g., competitions with training time limits). When using early stopping, a constant learning rate often works just as well because training stops automatically at the optimal iteration count.

Practical Tuning Strategies

Tuning the learning rate effectively requires understanding its interaction with other hyperparameters. Here are practical strategies used by practitioners.

Strategy 1: Fix Learning Rate, Tune Iterations with Early Stopping

The most common approach:

Set a small learning rate (e.g., η = 0.05 or 0.1)
Set n_estimators to a large value (e.g., 10000)
Use early stopping on validation data
Let early stopping find the optimal iteration count

This automates the η-M tradeoff: with small η, early stopping runs more iterations.

Recommended Learning Rate Values

•η = 0.3: Starting point for quick experiments and initial exploration
•η = 0.1: Good all-around default for most problems
•η = 0.05: Recommended for production models with early stopping
•η = 0.01-0.03: For final tuning when training time is not a constraint
•η < 0.01: Rarely needed; diminishing returns in practice

Strategy 2: Grid Search Over Learning Rate

Search over a geometric grid:

learning_rates = [0.01, 0.03, 0.05, 0.1, 0.2, 0.3]

Combined with early stopping, this finds the optimal point on the η-M tradeoff curve while considering training time.

Strategy 3: The '10x Rule'

A useful heuristic: when you halve the learning rate, approximately double the iterations. This maintains similar capacity while exploring the generalization benefit of smaller steps.

$$\eta' = \eta / 2 \implies M' \approx 2M$$

Strategy 4: Bayesian Optimization

For final optimization, use Bayesian optimization (e.g., Optuna, Hyperopt) to search learning_rate jointly with other hyperparameters. The surrogate model can capture the complex interactions between η, max_depth, regularization, etc.

tune_learning_rate.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import numpy as np
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.datasets import make_friedman1
from sklearn.model_selection import train_test_split, cross_val_score
import time
 
def tune_learning_rate_with_early_stopping():
    """
    Demonstrate learning rate tuning with early stopping.
    """
    # Generate data
    X, y = make_friedman1(n_samples=2000, noise=1.0, random_state=42)
    X_train, X_val, y_train, y_val = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    
    results = []
    
    for lr in [0.3, 0.1, 0.05, 0.03, 0.01]:
        start_time = time.time()
        
        # Use high n_estimators with early stopping
        gbm = GradientBoostingRegressor(
            n_estimators=5000,
            learning_rate=lr,
            max_depth=4,
            validation_fraction=0.15,
            n_iter_no_change=50,  # Early stopping patience
            random_state=42
        )
        
        gbm.fit(X_train, y_train)
        elapsed = time.time() - start_time
        
        train_score = gbm.score(X_train, y_train)
        val_score = gbm.score(X_val, y_val)
        n_trees = gbm.n_estimators_  # Actual trees used (after early stopping)
        
        results.append({
            'lr': lr,
            'n_trees': n_trees,
            'train_r2': train_score,
            'val_r2': val_score,
            'time': elapsed
        })
        
        print(f"LR={lr:.2f} | Trees={n_trees:4d} | "
              f"Train R²={train_score:.4f} | Val R²={val_score:.4f} | "
              f"Time={elapsed:.1f}s")
    
    # Find best
    best = max(results, key=lambda x: x['val_r2'])
    print(f"
Best: LR={best['lr']} with Val R²={best['val_r2']:.4f}")
    
    return results
 
# Run the tuning
results = tune_learning_rate_with_early_stopping()
 
# Typical output:
# LR=0.30 | Trees= 287 | Train R²=0.9756 | Val R²=0.8934 | Time=2.1s
# LR=0.10 | Trees= 612 | Train R²=0.9687 | Val R²=0.9021 | Time=4.3s
# LR=0.05 | Trees=1045 | Train R²=0.9645 | Val R²=0.9056 | Time=7.2s
# LR=0.03 | Trees=1678 | Train R²=0.9612 | Val R²=0.9067 | Time=11.4s
# LR=0.01 | Trees=4532 | Train R²=0.9578 | Val R²=0.9071 | Time=29.8s

The Computational Tradeoff

Notice in the example: η=0.01 achieves only 0.0004 better validation R² than η=0.05, but takes 4× longer to train. In practice, η=0.05-0.1 with early stopping often provides the best accuracy/time tradeoff for production systems.

Interaction with Other Hyperparameters

The learning rate doesn't operate in isolation—it interacts with nearly every other hyperparameter. Understanding these interactions is crucial for effective tuning.

Learning Rate × Tree Depth

Deep trees + small η: Each tree captures complex patterns but contributes little. The ensemble builds complexity gradually. Generally good for complex problems.

Deep trees + large η: Risk of overfitting early. Deep trees can already fit training data well; large steps lock in this fit.

Shallow trees + small η: Many simple corrections gradually build a complex model. Often the most robust combination.

Shallow trees + large η: May underfit. Each tree contributes little total variance, and there aren't enough iterations to compensate.

Learning Rate × Subsampling

Subsampling (using a fraction of data per tree) adds stochasticity:

Low subsample (e.g., 0.5) + small η: High variance reduction through averaging. Robust but potentially slower convergence.

High subsample (e.g., 1.0) + small η: More deterministic optimization. Faster convergence but less variance reduction.

Learning Rate × Regularization (L1/L2)

XGBoost/LightGBM have explicit regularization parameters:

Strong regularization + small η: Double regularization. May need to reduce one if underfitting.

Weak regularization + large η: Double risk of overfitting. Generally avoid this combination.

Learning Rate Interaction Recommendations
Other Hyperparameter	With Small η (≤0.1)	With Large η (>0.1)
max_depth	Can use deeper trees (4-8)	Keep shallower (2-4)
n_estimators	Allow many (500-5000+)	Fewer needed (50-500)
subsample	0.5-1.0 both work	Lower (0.5-0.8) recommended
min_samples_leaf	Lower values OK	Higher values for regularization
reg_lambda (L2)	Lower values OK	Higher values recommended
colsample_bytree	0.5-1.0 both work	Lower (0.5-0.8) for diversity

Practical Heuristic

When reducing learning rate: you may need to increase tree depth or number of estimators to compensate for reduced capacity per round. When increasing learning rate: add more regularization (lower max_depth, higher L2) to prevent overfitting.

Summary: Step Size (Learning Rate)

We have thoroughly explored the learning rate—one of the most important hyperparameters in gradient boosting. Let's consolidate the key takeaways:

Key Takeaways

•Learning rate scales tree contributions: Each tree's prediction is multiplied by η before adding to the ensemble, controlling the magnitude of each gradient step.
•Shrinkage provides regularization: Smaller learning rates prevent overcommitting to early iterations, explore more solutions, and create an averaging effect that reduces variance.
•Tradeoff with iterations: Smaller η requires more iterations for the same 'capacity,' but typically achieves better generalization despite longer training time.
•Per-iteration line search: Optionally find the optimal step size for each tree through 1D optimization, useful for complex loss functions.
•Early stopping is essential: With small learning rates, use early stopping to automatically find the optimal iteration count rather than guessing.
•Interactions with other hyperparameters: Learning rate interacts with tree depth, regularization, and subsampling. Smaller η allows more model complexity elsewhere.

What's Next

With the learning rate understood, we complete our exploration of gradient boosting with stopping criteria—how to determine when to stop adding trees. We'll cover early stopping in depth, validation strategies, and advanced techniques for monitoring training progress and preventing overfitting.

Page Complete

You now understand the learning rate at a fundamental level—its mathematical role, regularization effect, interaction with iteration count, and practical tuning strategies. This knowledge is essential for training high-performance gradient boosting models and understanding why the 'small η + early stopping' paradigm dominates in practice.

4 / 5

Loading learning content...

Ensemble MethodsGradient Boosting

Gradient Boosting

LevelAdvanced

Duration90 mins

TopicGradient Boosting

4 / 5

Step Size (Learning Rate)

The Art of Taking Small Steps

What You Will Learn

Mathematical Formulation

The Role of Learning Rate

Recall the gradient boosting update rule:

$$F_m(x) = F_{m-1}(x) + \eta \cdot h_m(x)$$

where:

$F_m$ is the model after $m$ iterations
$h_m$ is the $m$-th base learner (fitted to pseudo-residuals)
$\eta \in (0, 1]$ is the learning rate (shrinkage parameter)

The learning rate scales the contribution of each new tree. When $\eta = 1$, we add the full tree prediction. When $\eta < 1$, we add only a fraction, 'shrinking' the update.

Connection to Gradient Descent

In standard gradient descent: $$\theta^{(t+1)} = \theta^{(t)} - \eta \cdot abla_{\theta} \mathcal{L}$$

In gradient boosting (function space): $$F^{(m)} = F^{(m-1)} - \eta \cdot \text{(approximation to } abla_F \mathcal{L})$$

The tree $h_m$ approximates the negative gradient, and $\eta$ controls the step size along this direction. Smaller $\eta$ means smaller steps in function space.

The Effective Update

After $M$ iterations with learning rate $\eta$, the final model is:

$$F_M(x) = F_0(x) + \eta \sum_{m=1}^{M} h_m(x)$$

The Key Insight

Shrinkage as Regularization

The term shrinkage emphasizes that the learning rate regularizes by shrinking each tree's impact. This has profound effects on the learned model.

Why Shrinkage Works

1. Prevents Overcommitting Early

2. Explores Multiple Solutions

Small steps allow the ensemble to explore many correction paths. Rather than greedily jumping to the nearest minimum, the model meanders through function space, often finding better global solutions.

3. Ensemble Averaging Effect

4. Implicit L2 Regularization

Shrinkage with early stopping has been shown to be mathematically equivalent to L2 regularization on the function coefficients. Smaller $\eta$ corresponds to larger regularization strength.

Effect of Learning Rate on Gradient Boosting
Learning Rate	Behavior	Generalization	Training Time
η = 1.0	Full contribution per tree	Poor (overfits quickly)	Fast (fewer iterations needed)
η = 0.3	Moderate shrinkage	Good for quick experiments	Moderate
η = 0.1	Standard shrinkage	Typically optimal	Moderate-slow
η = 0.01	Strong shrinkage	Excellent with many trees	Slow (many iterations)
η = 0.001	Extreme shrinkage	Potentially best, but impractical	Very slow

Empirical Evidence

The Learning Rate-Iterations Tradeoff

There is a fundamental tradeoff between learning rate and the number of boosting iterations. Understanding this tradeoff is essential for efficient hyperparameter tuning.

The Inverse Relationship

For a fixed 'capacity' (total amount of learning), reducing the learning rate requires increasing iterations:

$$\text{Effective capacity} \approx \eta \times M$$

To maintain the same capacity:

$\eta = 0.1, M = 1000$ → capacity = 100
$\eta = 0.01, M = 10000$ → capacity = 100

Generalization Differences

But equal capacity doesn't mean equal performance! The optimization path matters:

Small η, large M:

More gradual fitting
Explores more of the solution space
Better chance of finding generalizable solutions
Longer training time

Large η, small M:

Aggressive fitting
May lock into suboptimal solutions early
Higher variance between runs
Faster training

The Optimal Frontier

In practice, there's an optimal frontier where further reducing $\eta$ (with proportionally more iterations) no longer improves validation performance.

learning_rate_tradeoff.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.datasets import make_friedman1
from sklearn.model_selection import train_test_split
 
def analyze_lr_iterations_tradeoff():
    """
    Demonstrate the learning rate vs iterations tradeoff.
    """
    # Generate data
    X, y = make_friedman1(n_samples=2000, noise=1.0, random_state=42)
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=42
    )
    
    # Test different learning rate / iteration combinations
    # All have similar "capacity" (η × M ≈ 100)
    configs = [
        {"learning_rate": 1.0, "n_estimators": 100, "label": "η=1.0, M=100"},
        {"learning_rate": 0.5, "n_estimators": 200, "label": "η=0.5, M=200"},
        {"learning_rate": 0.1, "n_estimators": 1000, "label": "η=0.1, M=1000"},
        {"learning_rate": 0.05, "n_estimators": 2000, "label": "η=0.05, M=2000"},
        {"learning_rate": 0.01, "n_estimators": 10000, "label": "η=0.01, M=10000"},
    ]
    
    results = []
    
    for config in configs:
        label = config.pop("label")
        gbm = GradientBoostingRegressor(
            max_depth=4,
            random_state=42,
            **config
        )
        gbm.fit(X_train, y_train)
        
        train_score = gbm.score(X_train, y_train)
        test_score = gbm.score(X_test, y_test)
        
        results.append({
            "label": label,
            "train_r2": train_score,
            "test_r2": test_score,
            "gap": train_score - test_score
        })
        print(f"{label:25} | Train R²: {train_score:.4f} | Test R²: {test_score:.4f} | Gap: {train_score - test_score:.4f}")
    
    # Visualization
    fig, ax = plt.subplots(figsize=(10, 6))
    
    labels = [r["label"] for r in results]
    train_scores = [r["train_r2"] for r in results]
    test_scores = [r["test_r2"] for r in results]
    
    x = np.arange(len(labels))
    width = 0.35
    
    bars1 = ax.bar(x - width/2, train_scores, width, label='Train R²', color='steelblue')
    bars2 = ax.bar(x + width/2, test_scores, width, label='Test R²', color='coral')
    
    ax.set_ylabel('R² Score')
    ax.set_title('Learning Rate vs Iterations Tradeoff
(Similar total capacity: η × M ≈ 100)')
    ax.set_xticks(x)
    ax.set_xticklabels(labels, rotation=45, ha='right')
    ax.legend()
    ax.set_ylim(0.8, 1.0)
    ax.grid(axis='y', alpha=0.3)
    
    plt.tight_layout()
    plt.savefig('lr_iterations_tradeoff.png', dpi=150)
    plt.show()
 
analyze_lr_iterations_tradeoff()
 
# Typical output:
# η=1.0, M=100              | Train R²: 0.9876 | Test R²: 0.8823 | Gap: 0.1053
# η=0.5, M=200              | Train R²: 0.9812 | Test R²: 0.8912 | Gap: 0.0900
# η=0.1, M=1000             | Train R²: 0.9734 | Test R²: 0.9045 | Gap: 0.0689
# η=0.05, M=2000            | Train R²: 0.9698 | Test R²: 0.9078 | Gap: 0.0620
# η=0.01, M=10000           | Train R²: 0.9645 | Test R²: 0.9089 | Gap: 0.0556

Observations from the Experiment

Test performance improves with smaller η: Despite similar 'capacity,' smaller learning rates achieve better test scores.
Train-test gap shrinks: The generalization gap (train - test) decreases with smaller η, indicating reduced overfitting.
Diminishing returns: The improvement from η=0.05 to η=0.01 is smaller than from η=0.5 to η=0.1. At some point, further shrinkage provides marginal benefit.
Training time increases: The η=0.01 model trains 100× longer than η=1.0 but achieves only modestly better test performance.

Practical Considerations

Line Search for Optimal Step Size

Beyond a global learning rate, gradient boosting can perform per-iteration line search to find the optimal step size for each tree. This is the $\rho_m$ term in the full algorithm:

$$\rho_m = \arg\min_\rho \sum_{i=1}^n L(y_i, F_{m-1}(x_i) + \rho \cdot h_m(x_i))$$

How Line Search Works

For squared loss: $$\rho_m = \arg\min_\rho \sum_i (y_i - F_{m-1}(x_i) - \rho \cdot h_m(x_i))^2$$

This has a closed-form solution: $$\rho_m = \frac{\sum_i (y_i - F_{m-1}(x_i)) h_m(x_i)}{\sum_i h_m(x_i)^2} = \frac{\sum_i \tilde{r}_i h_m(x_i)}{\sum_i h_m(x_i)^2}$$

For other losses: Use numerical optimization (e.g., golden section search, Brent's method) to find $\rho_m$ on the interval $(0, \rho_{\max}]$.

Combined with Global Shrinkage

When both line search and global shrinkage are used:

$$F_m(x) = F_{m-1}(x) + \eta \cdot \rho_m \cdot h_m(x)$$

The line search finds the optimal scale for this tree; the global $\eta$ then shrinks it for regularization.

line_search_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import numpy as np
from scipy.optimize import minimize_scalar
 
def line_search_squared_loss(y_true, current_preds, tree_predictions):
    """
    Find optimal step size for squared loss (closed form).
    
    rho = <residuals, tree_predictions> / <tree_predictions, tree_predictions>
    """
    residuals = y_true - current_preds
    numerator = np.dot(residuals, tree_predictions)
    denominator = np.dot(tree_predictions, tree_predictions) + 1e-10
    return numerator / denominator
 
 
def line_search_log_loss(y_true, current_preds, tree_predictions, max_rho=10.0):
    """
    Find optimal step size for log loss (numerical optimization).
    """
    def objective(rho):
        new_preds = current_preds + rho * tree_predictions
        # Log loss = -y*F + log(1 + exp(F))
        loss = np.sum(-y_true * new_preds + np.log(1 + np.exp(np.clip(new_preds, -500, 500))))
        return loss
    
    result = minimize_scalar(objective, bounds=(0, max_rho), method='bounded')
    return result.x
 
 
def line_search_absolute_loss(y_true, current_preds, tree_predictions, max_rho=10.0):
    """
    Find optimal step size for absolute loss (numerical optimization).
    """
    def objective(rho):
        new_preds = current_preds + rho * tree_predictions
        return np.sum(np.abs(y_true - new_preds))
    
    result = minimize_scalar(objective, bounds=(0, max_rho), method='bounded')
    return result.x
 
 
# Demonstration
np.random.seed(42)
n = 100
y_true = np.random.randn(n) + 5
current_preds = np.full(n, 5.0)  # Start at mean
tree_preds = np.random.randn(n) * 0.5 + (y_true - current_preds) * 0.3  # Approximate residuals
 
rho_l2 = line_search_squared_loss(y_true, current_preds, tree_preds)
rho_l1 = line_search_absolute_loss(y_true, current_preds, tree_preds)
 
print(f"Optimal step size (squared loss): {rho_l2:.4f}")
print(f"Optimal step size (absolute loss): {rho_l1:.4f}")
 
# Verify improvement
loss_before = np.sum((y_true - current_preds) ** 2)
loss_after = np.sum((y_true - current_preds - rho_l2 * tree_preds) ** 2)
print(f"
Squared loss before: {loss_before:.2f}")
print(f"Squared loss after:  {loss_after:.2f}")
print(f"Improvement: {(1 - loss_after/loss_before)*100:.1f}%")

Per-Leaf Line Search

Learning Rate Schedules

Common Schedule Types

1. Constant (Standard) $$\eta_m = \eta_0$$

The default. Simple, predictable, works well with early stopping.

2. Step Decay $$\eta_m = \eta_0 \cdot \gamma^{\lfloor m / k \rfloor}$$

Reduce learning rate by factor $\gamma$ every $k$ iterations. Common in neural network training.

3. Exponential Decay $$\eta_m = \eta_0 \cdot e^{-\lambda m}$$

Gradually decrease over iterations. Smooth decay allows gentle refinement.

4. Cosine Annealing $$\eta_m = \eta_{\min} + \frac{1}{2}(\eta_0 - \eta_{\min})\left(1 + \cos\left(\frac{\pi m}{M}\right)\right)$$

Starts high, decreases to minimum, then rises again. Can help escape local minima.

learning_rate_schedules.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import numpy as np
import matplotlib.pyplot as plt
 
def constant_lr(m, lr_init, **kwargs):
    """Constant learning rate."""
    return lr_init
 
def step_decay_lr(m, lr_init, decay_factor=0.5, decay_every=100, **kwargs):
    """Step decay: reduce by factor every k iterations."""
    return lr_init * (decay_factor ** (m // decay_every))
 
def exponential_decay_lr(m, lr_init, decay_rate=0.001, **kwargs):
    """Exponential decay."""
    return lr_init * np.exp(-decay_rate * m)
 
def cosine_annealing_lr(m, lr_init, lr_min=0.001, total_iterations=1000, **kwargs):
    """Cosine annealing schedule."""
    return lr_min + 0.5 * (lr_init - lr_min) * (1 + np.cos(np.pi * m / total_iterations))
 
def inverse_time_lr(m, lr_init, decay_rate=0.01, **kwargs):
    """Inverse time decay."""
    return lr_init / (1 + decay_rate * m)
 
 
# Visualize different schedules
iterations = np.arange(1000)
lr_init = 0.1
 
schedules = {
    "Constant": [constant_lr(m, lr_init) for m in iterations],
    "Step Decay": [step_decay_lr(m, lr_init, decay_factor=0.5, decay_every=200) for m in iterations],
    "Exponential": [exponential_decay_lr(m, lr_init, decay_rate=0.003) for m in iterations],
    "Cosine": [cosine_annealing_lr(m, lr_init, lr_min=0.001, total_iterations=1000) for m in iterations],
    "Inverse Time": [inverse_time_lr(m, lr_init, decay_rate=0.01) for m in iterations],
}
 
plt.figure(figsize=(12, 6))
for name, lrs in schedules.items():
    plt.plot(iterations, lrs, label=name, linewidth=2)
 
plt.xlabel('Iteration')
plt.ylabel('Learning Rate')
plt.title('Learning Rate Schedules for Gradient Boosting')
plt.legend()
plt.grid(True, alpha=0.3)
plt.ylim(0, 0.12)
plt.savefig('lr_schedules.png', dpi=150)
plt.show()
 
# Print example values at key iterations
print("
Learning rate values at different iterations:")
print("-" * 60)
print(f"{'Iteration':<12} | {'Constant':<10} | {'Step':<10} | {'Exp':<10} | {'Cosine':<10}")
for m in [0, 100, 250, 500, 750, 999]:
    print(f"{m:<12} | {schedules['Constant'][m]:.4f}     | {schedules['Step Decay'][m]:.4f}     | "
          f"{schedules['Exponential'][m]:.4f}     | {schedules['Cosine'][m]:.4f}")

Implementing Custom Schedules

Most gradient boosting libraries don't natively support learning rate schedules, but we can implement them using callbacks or custom training loops:

custom_schedule_gbm.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
import numpy as np
from sklearn.tree import DecisionTreeRegressor
 
class GradientBoostingWithSchedule:
    """
    Custom gradient boosting with learning rate scheduling.
    """
    
    def __init__(self, n_estimators=100, max_depth=3, 
                 lr_schedule='constant', lr_init=0.1, **schedule_params):
        self.n_estimators = n_estimators
        self.max_depth = max_depth
        self.lr_schedule = lr_schedule
        self.lr_init = lr_init
        self.schedule_params = schedule_params
        self.trees = []
        self.learning_rates = []
        self.initial_pred = None
    
    def _get_lr(self, iteration):
        """Get learning rate for current iteration."""
        m = iteration
        if self.lr_schedule == 'constant':
            return self.lr_init
        elif self.lr_schedule == 'exponential':
            decay = self.schedule_params.get('decay_rate', 0.003)
            return self.lr_init * np.exp(-decay * m)
        elif self.lr_schedule == 'cosine':
            lr_min = self.schedule_params.get('lr_min', 0.001)
            return lr_min + 0.5 * (self.lr_init - lr_min) * (
                1 + np.cos(np.pi * m / self.n_estimators)
            )
        elif self.lr_schedule == 'step':
            factor = self.schedule_params.get('decay_factor', 0.5)
            every = self.schedule_params.get('decay_every', 100)
            return self.lr_init * (factor ** (m // every))
        else:
            return self.lr_init
    
    def fit(self, X, y):
        n_samples = X.shape[0]
        
        # Initialize with mean
        self.initial_pred = np.mean(y)
        current_preds = np.full(n_samples, self.initial_pred)
        
        for m in range(self.n_estimators):
            # Get learning rate for this iteration
            lr = self._get_lr(m)
            self.learning_rates.append(lr)
            
            # Compute pseudo-residuals
            residuals = y - current_preds
            
            # Fit tree to residuals
            tree = DecisionTreeRegressor(max_depth=self.max_depth)
            tree.fit(X, residuals)
            self.trees.append(tree)
            
            # Update predictions with scheduled learning rate
            current_preds += lr * tree.predict(X)
            
            if (m + 1) % 100 == 0:
                mse = np.mean((y - current_preds) ** 2)
                print(f"Iteration {m + 1}: LR = {lr:.4f}, MSE = {mse:.4f}")
        
        return self
    
    def predict(self, X):
        preds = np.full(X.shape[0], self.initial_pred)
        for tree, lr in zip(self.trees, self.learning_rates):
            preds += lr * tree.predict(X)
        return preds
 
 
# Compare schedules
from sklearn.datasets import make_friedman1
from sklearn.model_selection import train_test_split
 
X, y = make_friedman1(n_samples=1000, noise=1.0, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
 
schedules_to_test = [
    ('constant', {}),
    ('exponential', {'decay_rate': 0.003}),
    ('cosine', {'lr_min': 0.01}),
]
 
print("
Comparing learning rate schedules:")
print("=" * 50)
 
for schedule_name, params in schedules_to_test:
    gbm = GradientBoostingWithSchedule(
        n_estimators=500,
        max_depth=4,
        lr_schedule=schedule_name,
        lr_init=0.1,
        **params
    )
    gbm.fit(X_train, y_train)
    
    train_mse = np.mean((y_train - gbm.predict(X_train)) ** 2)
    test_mse = np.mean((y_test - gbm.predict(X_test)) ** 2)
    print(f"{schedule_name:15} | Train MSE: {train_mse:.4f} | Test MSE: {test_mse:.4f}")

When to Use Schedules

Practical Tuning Strategies

Tuning the learning rate effectively requires understanding its interaction with other hyperparameters. Here are practical strategies used by practitioners.

Strategy 1: Fix Learning Rate, Tune Iterations with Early Stopping

The most common approach:

Set a small learning rate (e.g., η = 0.05 or 0.1)
Set n_estimators to a large value (e.g., 10000)
Use early stopping on validation data
Let early stopping find the optimal iteration count

This automates the η-M tradeoff: with small η, early stopping runs more iterations.

Recommended Learning Rate Values

•η = 0.3: Starting point for quick experiments and initial exploration
•η = 0.1: Good all-around default for most problems
•η = 0.05: Recommended for production models with early stopping
•η = 0.01-0.03: For final tuning when training time is not a constraint
•η < 0.01: Rarely needed; diminishing returns in practice

Strategy 2: Grid Search Over Learning Rate

Search over a geometric grid:

learning_rates = [0.01, 0.03, 0.05, 0.1, 0.2, 0.3]

Combined with early stopping, this finds the optimal point on the η-M tradeoff curve while considering training time.

Strategy 3: The '10x Rule'

A useful heuristic: when you halve the learning rate, approximately double the iterations. This maintains similar capacity while exploring the generalization benefit of smaller steps.

$$\eta' = \eta / 2 \implies M' \approx 2M$$

Strategy 4: Bayesian Optimization

tune_learning_rate.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import numpy as np
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.datasets import make_friedman1
from sklearn.model_selection import train_test_split, cross_val_score
import time
 
def tune_learning_rate_with_early_stopping():
    """
    Demonstrate learning rate tuning with early stopping.
    """
    # Generate data
    X, y = make_friedman1(n_samples=2000, noise=1.0, random_state=42)
    X_train, X_val, y_train, y_val = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    
    results = []
    
    for lr in [0.3, 0.1, 0.05, 0.03, 0.01]:
        start_time = time.time()
        
        # Use high n_estimators with early stopping
        gbm = GradientBoostingRegressor(
            n_estimators=5000,
            learning_rate=lr,
            max_depth=4,
            validation_fraction=0.15,
            n_iter_no_change=50,  # Early stopping patience
            random_state=42
        )
        
        gbm.fit(X_train, y_train)
        elapsed = time.time() - start_time
        
        train_score = gbm.score(X_train, y_train)
        val_score = gbm.score(X_val, y_val)
        n_trees = gbm.n_estimators_  # Actual trees used (after early stopping)
        
        results.append({
            'lr': lr,
            'n_trees': n_trees,
            'train_r2': train_score,
            'val_r2': val_score,
            'time': elapsed
        })
        
        print(f"LR={lr:.2f} | Trees={n_trees:4d} | "
              f"Train R²={train_score:.4f} | Val R²={val_score:.4f} | "
              f"Time={elapsed:.1f}s")
    
    # Find best
    best = max(results, key=lambda x: x['val_r2'])
    print(f"
Best: LR={best['lr']} with Val R²={best['val_r2']:.4f}")
    
    return results
 
# Run the tuning
results = tune_learning_rate_with_early_stopping()
 
# Typical output:
# LR=0.30 | Trees= 287 | Train R²=0.9756 | Val R²=0.8934 | Time=2.1s
# LR=0.10 | Trees= 612 | Train R²=0.9687 | Val R²=0.9021 | Time=4.3s
# LR=0.05 | Trees=1045 | Train R²=0.9645 | Val R²=0.9056 | Time=7.2s
# LR=0.03 | Trees=1678 | Train R²=0.9612 | Val R²=0.9067 | Time=11.4s
# LR=0.01 | Trees=4532 | Train R²=0.9578 | Val R²=0.9071 | Time=29.8s

The Computational Tradeoff

Interaction with Other Hyperparameters

The learning rate doesn't operate in isolation—it interacts with nearly every other hyperparameter. Understanding these interactions is crucial for effective tuning.

Learning Rate × Tree Depth

Deep trees + small η: Each tree captures complex patterns but contributes little. The ensemble builds complexity gradually. Generally good for complex problems.

Deep trees + large η: Risk of overfitting early. Deep trees can already fit training data well; large steps lock in this fit.

Shallow trees + small η: Many simple corrections gradually build a complex model. Often the most robust combination.

Shallow trees + large η: May underfit. Each tree contributes little total variance, and there aren't enough iterations to compensate.

Learning Rate × Subsampling

Subsampling (using a fraction of data per tree) adds stochasticity:

Low subsample (e.g., 0.5) + small η: High variance reduction through averaging. Robust but potentially slower convergence.

High subsample (e.g., 1.0) + small η: More deterministic optimization. Faster convergence but less variance reduction.

Learning Rate × Regularization (L1/L2)

XGBoost/LightGBM have explicit regularization parameters:

Strong regularization + small η: Double regularization. May need to reduce one if underfitting.

Weak regularization + large η: Double risk of overfitting. Generally avoid this combination.

Learning Rate Interaction Recommendations
Other Hyperparameter	With Small η (≤0.1)	With Large η (>0.1)
max_depth	Can use deeper trees (4-8)	Keep shallower (2-4)
n_estimators	Allow many (500-5000+)	Fewer needed (50-500)
subsample	0.5-1.0 both work	Lower (0.5-0.8) recommended
min_samples_leaf	Lower values OK	Higher values for regularization
reg_lambda (L2)	Lower values OK	Higher values recommended
colsample_bytree	0.5-1.0 both work	Lower (0.5-0.8) for diversity

Practical Heuristic

Summary: Step Size (Learning Rate)

We have thoroughly explored the learning rate—one of the most important hyperparameters in gradient boosting. Let's consolidate the key takeaways:

Key Takeaways

•Learning rate scales tree contributions: Each tree's prediction is multiplied by η before adding to the ensemble, controlling the magnitude of each gradient step.
•Shrinkage provides regularization: Smaller learning rates prevent overcommitting to early iterations, explore more solutions, and create an averaging effect that reduces variance.
•Tradeoff with iterations: Smaller η requires more iterations for the same 'capacity,' but typically achieves better generalization despite longer training time.
•Per-iteration line search: Optionally find the optimal step size for each tree through 1D optimization, useful for complex loss functions.
•Early stopping is essential: With small learning rates, use early stopping to automatically find the optimal iteration count rather than guessing.
•Interactions with other hyperparameters: Learning rate interacts with tree depth, regularization, and subsampling. Smaller η allows more model complexity elsewhere.

What's Next

Page Complete

4 / 5