Regularization In Boosting - Learning Module

Loading content...

0/245

Shrinkage (Learning Rate)

The Paradox of Slow Learning

In machine learning, we typically optimize models to learn as quickly as possible—minimizing our objective function in the fewest iterations. Yet gradient boosting offers a profound counterexample: deliberately slowing down learning produces dramatically better results.

This technique, known as shrinkage (or learning rate reduction), is perhaps the single most important regularization method in modern boosting implementations. Every production-grade boosting library—XGBoost, LightGBM, CatBoost, scikit-learn's GradientBoosting—treats the learning rate as a primary hyperparameter, often the first one practitioners tune.

Understanding shrinkage requires us to rethink our intuitions about optimization. It reveals a deep truth about statistical learning: the path we take to a solution matters as much as the solution itself.

What You Will Master

By the end of this page, you will understand: (1) the mathematical formulation of shrinkage and its effect on the boosting update rule, (2) why shrinkage provides regularization from an optimization and statistical perspective, (3) the trade-off between learning rate and number of iterations, (4) practical guidelines for selecting learning rates across different problem contexts, and (5) the theoretical analysis connecting shrinkage to improved generalization bounds.

The Standard Boosting Update

Before introducing shrinkage, let's recall the fundamental gradient boosting update. At each iteration $m$, gradient boosting:

Computes pseudo-residuals: The negative gradient of the loss function with respect to current predictions
Fits a base learner: Trains a weak model (typically a decision tree) to approximate these residuals
Updates the ensemble: Adds the new learner to the current model

The standard update rule, without shrinkage, takes the form:

$$F_m(x) = F_{m-1}(x) + h_m(x)$$

where $F_m(x)$ is the ensemble prediction after $m$ iterations and $h_m(x)$ is the $m$-th base learner fitted to the pseudo-residuals:

$$r_{im} = -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right]{F=F{m-1}}$$

The base learner $h_m$ is trained to minimize the squared error against these residuals:

$$h_m = \underset{h}{\arg\min} \sum_{i=1}^{n} (r_{im} - h(x_i))^2$$

Full Step Interpretation

Without shrinkage, each base learner contributes its full prediction to the ensemble. This is analogous to gradient descent with a step size of 1.0—we move the full distance suggested by the gradient direction. While mathematically valid, this aggressive updating often leads to overfitting in practice.

The Overfitting Problem

The standard update has a subtle but critical flaw. Each base learner is optimized to perfectly fit the current residuals on the training data. When we add this learner at full strength, we're making a greedy optimization decision that:

Maximally reduces training error at iteration $m$
Does not consider future base learners or the overall generalization ability
Can memorize noise in the training data that appears as spurious patterns in residuals

The result is a model that fits the training data extremely well but generalizes poorly to unseen data. The ensemble becomes overly specialized to the training set's idiosyncrasies.

Introducing Shrinkage

Shrinkage modifies the gradient boosting update by introducing a learning rate parameter $\nu \in (0, 1]$, also called the shrinkage factor or step size. The modified update rule becomes:

$$F_m(x) = F_{m-1}(x) + \nu \cdot h_m(x)$$

This simple modification—multiplying each base learner's contribution by a constant less than one—has profound effects on the learning dynamics:

The Key Insight: Instead of adding base learners at full strength, we add only a fraction $\nu$ of each learner. This means the ensemble moves more slowly toward fitting the training data, taking many small steps instead of few large ones.

Typical values of $\nu$ range from 0.001 to 0.3, with values around 0.01 to 0.1 being common in practice. A learning rate of 0.1 means each tree contributes only 10% of its "full" prediction.

shrinkage_update.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
import numpy as np
from sklearn.tree import DecisionTreeRegressor
 
class GradientBoostingWithShrinkage:
    """
    Gradient Boosting implementation demonstrating shrinkage.
    """
    
    def __init__(self, n_estimators=100, learning_rate=0.1, max_depth=3):
        """
        Parameters:
        -----------
        n_estimators : int
            Number of boosting iterations (trees to add)
        learning_rate : float
            Shrinkage factor nu in (0, 1]. Smaller values require
            more iterations but typically generalize better.
        max_depth : int
            Maximum depth of each decision tree base learner
        """
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate  # This is nu (shrinkage)
        self.max_depth = max_depth
        self.trees = []
        self.initial_prediction = None
    
    def fit(self, X, y):
        """
        Fit the gradient boosting model with shrinkage.
        """
        n_samples = X.shape[0]
        
        # Initialize with mean (optimal constant for squared error)
        self.initial_prediction = np.mean(y)
        
        # Current ensemble predictions
        F = np.full(n_samples, self.initial_prediction)
        
        for m in range(self.n_estimators):
            # Step 1: Compute pseudo-residuals (negative gradient of MSE)
            residuals = y - F  # For MSE: -d/dF[(y-F)^2/2] = y - F
            
            # Step 2: Fit a base learner to residuals
            tree = DecisionTreeRegressor(max_depth=self.max_depth)
            tree.fit(X, residuals)
            
            # Step 3: Get the tree's predictions
            h_m = tree.predict(X)
            
            # Step 4: Update with SHRINKAGE
            # Key difference: multiply by learning_rate before adding
            F = F + self.learning_rate * h_m
            
            self.trees.append(tree)
            
            # Track progress
            if (m + 1) % 20 == 0:
                mse = np.mean((y - F) ** 2)
                print(f"Iteration {m+1}: Train MSE = {mse:.6f}")
        
        return self
    
    def predict(self, X):
        """
        Make predictions using the trained ensemble.
        """
        # Start with initial prediction
        F = np.full(X.shape[0], self.initial_prediction)
        
        # Add shrunk contribution from each tree
        for tree in self.trees:
            F = F + self.learning_rate * tree.predict(X)
        
        return F
 
 
# Demonstration: Effect of different learning rates
if __name__ == "__main__":
    from sklearn.datasets import make_regression
    from sklearn.model_selection import train_test_split
    
    # Generate synthetic data
    X, y = make_regression(n_samples=1000, n_features=10, noise=10)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    
    # Compare different learning rates
    learning_rates = [1.0, 0.5, 0.1, 0.01]
    
    for lr in learning_rates:
        # Adjust iterations inversely with learning rate for fair comparison
        n_iter = int(100 / lr) if lr < 1.0 else 100
        n_iter = min(n_iter, 1000)  # Cap at 1000 iterations
        
        model = GradientBoostingWithShrinkage(
            n_estimators=n_iter,
            learning_rate=lr,
            max_depth=3
        )
        model.fit(X_train, y_train)
        
        train_mse = np.mean((y_train - model.predict(X_train)) ** 2)
        test_mse = np.mean((y_test - model.predict(X_test)) ** 2)
        
        print(f"\nLearning Rate: {lr}")
        print(f"  Iterations: {n_iter}")
        print(f"  Train MSE: {train_mse:.4f}")
        print(f"  Test MSE:  {test_mse:.4f}")
        print(f"  Generalization Gap: {test_mse - train_mse:.4f}")

Why Does Shrinkage Work?

The effectiveness of shrinkage can be understood from multiple complementary perspectives. Each viewpoint illuminates a different aspect of why deliberate slowness improves generalization.

3.1 The Regularization Path Perspective

Consider what happens as we add more base learners to the ensemble. With full-strength updates ($\nu = 1$), each learner aggressively fits the current residuals. The model quickly minimizes training error but simultaneously memorizes noise.

With shrinkage ($\nu < 1$), the model progresses more slowly along the regularization path—the trajectory from a simple model (high bias, low variance) to a complex model (low bias, high variance). This slower progression provides several benefits:

Benefits of Slow Progression

•Early Stopping Compatibility: The slow progression creates a smooth trade-off curve between bias and variance. This makes it easier to find the optimal stopping point using validation data.
•Error Correction Opportunity: With many small steps, later base learners can correct mistakes made by earlier ones. With large steps, errors get 'locked in' before correction is possible.
•Gradient Direction Averaging: Small steps effectively average gradient directions across iterations, smoothing out noisy gradient estimates.
•Implicit Ensemble Diversity: Different learning rates create different model combinations, implicitly generating ensemble diversity even with similar base learners.

3.2 The Statistical Learning Perspective

From a statistical viewpoint, shrinkage controls model complexity. The effective degrees of freedom of a gradient boosting model depend on both the number of trees and the learning rate. Smaller learning rates result in lower effective complexity per iteration.

Formally, if we define the effective complexity as related to how much the model can fit the training data, we observe:

$$\text{Complexity}(F_M) \propto M \cdot \nu \cdot \text{Complexity}(h)$$

where $M$ is the number of iterations and $\text{Complexity}(h)$ is the complexity of individual base learners. This relationship suggests that $M \cdot \nu$ acts as a combined complexity measure—we can achieve similar effective complexity with either:

Many iterations and small learning rate
Few iterations and large learning rate

However, empirical and theoretical results strongly favor the first option.

The Shrinkage Theorem

Friedman (2001) demonstrated empirically, and subsequent theoretical work has confirmed, that given a total 'budget' of complexity (measured by M × ν), distributing it across many small steps (small ν, large M) almost always outperforms few large steps (large ν, small M). The improvement is most pronounced when base learners are relatively weak.

3.3 The Optimization Landscape Perspective

In high-dimensional function spaces where boosting operates, the optimization landscape is complex with many local minima. Consider the difference between optimization strategies:

Large Learning Rate: Takes big jumps in the function space. Each jump optimizes greedily for current residuals. Easy to jump into a local minimum that represents overfitting to training data noise.

Small Learning Rate: Takes small steps, effectively exploring more of the function space. The path of solutions is smoother and more likely to find generalizable patterns that persist across different data samples.

This is analogous to the difference between a drunk person taking large, erratic steps versus small, controlled steps. The latter is more likely to find the true optimal direction despite noise.

The Learning Rate–Iterations Trade-off

A fundamental characteristic of shrinkage is the inverse relationship between learning rate and required iterations. Lower learning rates require more boosting iterations to achieve the same training error. This relationship is approximately:

$$M_{\text{required}} \approx \frac{C}{\nu}$$

where $C$ is a constant depending on the problem complexity. This means:

Learning rate 0.1 might need 100 trees
Learning rate 0.01 might need 1,000 trees
Learning rate 0.001 might need 10,000 trees

This creates an important practical trade-off: smaller learning rates give better generalization but require more computational resources.

Learning Rate vs. Iterations Trade-off
Learning Rate (ν)	Typical Iterations	Training Speed	Generalization	Memory Usage
0.3 - 1.0	50 - 200	Fast	Poor to Fair	Low
0.1	100 - 500	Moderate	Good	Moderate
0.05	200 - 1000	Slow	Very Good	Higher
0.01	1000 - 5000	Very Slow	Excellent	High
0.001	5000 - 20000	Extremely Slow	Typically Best	Very High

The Computational Burden

At first glance, the trade-off seems purely negative: we pay with computation time for better generalization. However, several factors mitigate this cost:

Early Stopping: We don't need to run all iterations. With proper early stopping, smaller learning rates often converge (in validation performance) faster than expected.
Parallelization: While boosting iterations are inherently sequential, tree construction within each iteration can be parallelized.
Hardware Advances: Modern GPUs and distributed computing make higher iteration counts more feasible.
Value of Accuracy: In many applications, the improvement in model accuracy vastly outweighs additional training time.

learning_rate_experiment.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
 
def analyze_learning_rate_tradeoff(X_train, X_val, y_train, y_val):
    """
    Analyze the learning rate vs iterations trade-off experimentally.
    
    Key insight: For a fixed computational budget (M * lr ≈ constant),
    smaller learning rates with more iterations typically win.
    """
    
    results = {}
    
    # Test different learning rates with adjusted iterations
    configs = [
        (1.0, 100),     # High LR, few iterations
        (0.5, 200),     # Medium-high LR
        (0.1, 1000),    # Standard LR
        (0.05, 2000),   # Lower LR
        (0.01, 10000),  # Low LR, many iterations
    ]
    
    for lr, max_iterations in configs:
        print(f"\nTesting lr={lr}, max_iter={max_iterations}")
        
        # Track validation performance over iterations
        val_errors = []
        train_errors = []
        
        # We'll use staged_predict to get predictions at each iteration
        model = GradientBoostingRegressor(
            n_estimators=max_iterations,
            learning_rate=lr,
            max_depth=3,
            random_state=42
        )
        model.fit(X_train, y_train)
        
        # Get predictions at each stage
        for i, (y_train_pred, y_val_pred) in enumerate(zip(
            model.staged_predict(X_train),
            model.staged_predict(X_val)
        )):
            train_errors.append(mean_squared_error(y_train, y_train_pred))
            val_errors.append(mean_squared_error(y_val, y_val_pred))
        
        # Find best validation performance
        best_iter = np.argmin(val_errors) + 1
        best_val_error = min(val_errors)
        
        results[(lr, max_iterations)] = {
            'best_iter': best_iter,
            'best_val_mse': best_val_error,
            'train_mse_at_best': train_errors[best_iter - 1],
            'val_errors': val_errors,
            'train_errors': train_errors
        }
        
        print(f"  Best iteration: {best_iter}")
        print(f"  Best validation MSE: {best_val_error:.4f}")
        print(f"  Training MSE at best: {train_errors[best_iter - 1]:.4f}")
    
    return results
 
 
def plot_learning_curves(results):
    """
    Visualize learning curves for different learning rates.
    """
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    colors = plt.cm.viridis(np.linspace(0, 0.8, len(results)))
    
    for (lr, max_iter), result, color in zip(
        results.keys(), 
        results.values(), 
        colors
    ):
        iterations = range(1, len(result['val_errors']) + 1)
        
        # Plot validation errors
        axes[0].plot(
            iterations, 
            result['val_errors'],
            label=f'lr={lr}',
            color=color,
            alpha=0.8
        )
        axes[0].axvline(
            result['best_iter'],
            color=color,
            linestyle='--',
            alpha=0.5
        )
    
    axes[0].set_xlabel('Iterations')
    axes[0].set_ylabel('Validation MSE')
    axes[0].set_title('Validation Error vs Iterations')
    axes[0].legend()
    axes[0].set_xscale('log')
    
    # Plot best validation error vs learning rate
    lrs = [k[0] for k in results.keys()]
    best_errors = [v['best_val_mse'] for v in results.values()]
    best_iters = [v['best_iter'] for v in results.values()]
    
    axes[1].scatter(lrs, best_errors, s=100, c=colors, edgecolors='black')
    for lr, err, it in zip(lrs, best_errors, best_iters):
        axes[1].annotate(
            f'iter={it}',
            (lr, err),
            textcoords="offset points",
            xytext=(0, 10),
            ha='center',
            fontsize=9
        )
    
    axes[1].set_xlabel('Learning Rate')
    axes[1].set_ylabel('Best Validation MSE')
    axes[1].set_title('Optimal Performance vs Learning Rate')
    axes[1].set_xscale('log')
    
    plt.tight_layout()
    plt.savefig('learning_rate_analysis.png', dpi=150)
    plt.show()
 
 
# Run the experiment
if __name__ == "__main__":
    # Generate challenging regression problem
    X, y = make_regression(
        n_samples=2000,
        n_features=20,
        n_informative=10,
        noise=20,
        random_state=42
    )
    
    X_train, X_val, y_train, y_val = train_test_split(
        X, y, test_size=0.3, random_state=42
    )
    
    results = analyze_learning_rate_tradeoff(X_train, X_val, y_train, y_val)
    plot_learning_curves(results)

Mathematical Analysis of Shrinkage

To rigorously understand shrinkage, we examine its effect on the bias-variance decomposition and derive connections to regularization theory.

5.1 Shrinkage and the Bias-Variance Trade-off

Consider the ensemble prediction at iteration $M$:

$$F_M(x) = F_0(x) + \nu \sum_{m=1}^{M} h_m(x)$$

The expected squared error can be decomposed as:

$$\mathbb{E}[(y - F_M(x))^2] = \text{Bias}^2[F_M(x)] + \text{Var}[F_M(x)] + \sigma^2_{\text{noise}}$$

Effect on Bias: Shrinkage increases bias by preventing the model from fully fitting the training data. With $M$ iterations at learning rate $\nu$, the model has effectively taken $M \cdot \nu$ 'units' of steps toward fitting the data. Smaller $\nu$ means higher bias for fixed $M$.

Effect on Variance: This is where shrinkage shines. The variance of the ensemble is reduced because:

$$\text{Var}[F_M] = \nu^2 \sum_{m=1}^{M} \text{Var}[h_m] + \nu^2 \sum_{m \neq m'} \text{Cov}[h_m, h_{m'}]$$

The $\nu^2$ factor in front dramatically reduces variance, especially when base learners are correlated (common in sequential boosting).

The Variance Reduction Insight

Key observation: Variance scales with ν², while bias scales approximately with ν. This quadratic-vs-linear relationship means that for small ν, variance reduction dominates the bias increase, leading to lower total error. This is why aggressive shrinkage is almost universally beneficial.

5.2 Connection to L2 Regularization

Shrinkage can be interpreted as implicit L2 regularization on the ensemble weights. Consider constraining the sum of squared base learner contributions:

$$\min_{\alpha_1, ..., \alpha_M} \sum_{i=1}^{n} L\left(y_i, F_0(x_i) + \sum_{m=1}^M \alpha_m h_m(x_i)\right) + \lambda \sum_{m=1}^M \alpha_m^2$$

The KKT conditions for this optimization yield coefficients $\alpha_m$ that shrink toward zero as $\lambda$ increases—similar to the effect of using a small learning rate.

Formally, with shrinkage factor $\nu$, after $M$ iterations, each tree contributes with coefficient $\nu$. This is equivalent to constraining the L2 norm of the coefficient vector:

$$|\alpha|_2^2 = M \cdot \nu^2$$

Smaller $\nu$ implies a tighter constraint on the overall 'magnitude' of the model, directly corresponding to L2 regularization.

5.3 Convergence Rate Analysis

The convergence rate of gradient boosting with shrinkage has been studied extensively. Key results include:

Linear Convergence in the Population Loss: Under suitable conditions (Lipschitz smooth loss, bounded base learner class), gradient boosting with shrinkage $\nu$ achieves:

$$L(F_M) - L(F^*) \leq (1 - \nu \mu)^M \cdot C$$

where $\mu$ is related to the curvature of the loss and $C$ is an initial constant. This exponential convergence means more iterations overcome the smaller step size.

Generalization Bounds: Perhaps more importantly, shrinkage improves generalization bounds. With appropriate early stopping, the generalization error satisfies:

$$R(F_{M^}) - R(F^) = O\left(\sqrt{\frac{M^* \cdot \nu \cdot \text{Complexity}(\mathcal{H})}{n}}\right)$$

where $M^$ is the optimal number of iterations. The product $M^ \cdot \nu$ grows sublinearly with $n$, ensuring generalization.

Practical Guidelines for Learning Rate Selection

Selecting the optimal learning rate is part science, part art. Here we synthesize practical wisdom from years of boosting research and application.

6.1 General Rules of Thumb

Based on extensive empirical studies and Kaggle competition results:

Learning Rate Guidelines

•Start with 0.1: This is the most common default and provides a good balance of speed and accuracy for initial experimentation.
•Lower for final models: Once you've tuned other hyperparameters, reduce learning rate to 0.01-0.05 and increase iterations for the final production model.
•Smaller for smaller datasets: With limited data, overfitting risk is higher. Use learning rates of 0.01-0.05 for datasets under 10,000 samples.
•Larger for larger datasets: With millions of samples, you can sometimes get away with 0.1-0.3 since more data inherently regularizes.
•Consider computational budget: If training time is critical, use 0.1-0.3. If accuracy is paramount and you have compute resources, use 0.01 or smaller.

6.2 Problem-Specific Considerations

Classification vs. Regression: Classification can often tolerate slightly higher learning rates than regression because the loss landscape tends to be more forgiving near decision boundaries.

Noisy Labels: If you suspect label noise, use smaller learning rates (0.01-0.05). Slow learning allows the model to identify robust patterns rather than memorizing label errors.

High-Dimensional Features: With many features, base learners may overfit more easily. Compensate with smaller learning rates.

Imbalanced Classes: With class imbalance, smaller learning rates help prevent the model from overfitting to the majority class too quickly.

Learning Rate Recommendations by Context
Scenario	Recommended Range	Notes
Quick experimentation	0.1 - 0.3	Fast iteration for hyperparameter search
Standard production model	0.05 - 0.1	Good balance of speed and accuracy
Maximum accuracy needed	0.01 - 0.05	More iterations, better generalization
Very small dataset (<1K)	0.01 - 0.03	Aggressive regularization needed
Very large dataset (>1M)	0.1 - 0.2	Data provides implicit regularization
High label noise	0.005 - 0.02	Slow learning to avoid memorizing errors
Competition final submission	0.01 - 0.03	Maximize every 0.001 improvement

learning_rate_tuning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor
import optuna
 
def tune_learning_rate_with_early_stopping(
    X, y, task='classification', max_iter=10000
):
    """
    Tune learning rate using Optuna with proper early stopping.
    
    Key insight: We tune learning_rate and use early stopping to 
    automatically find the optimal number of iterations.
    """
    
    def objective(trial):
        # Define search space for learning rate
        learning_rate = trial.suggest_float(
            'learning_rate', 0.005, 0.3, log=True
        )
        
        # Other important hyperparameters
        max_depth = trial.suggest_int('max_depth', 2, 8)
        min_samples_split = trial.suggest_int('min_samples_split', 2, 20)
        subsample = trial.suggest_float('subsample', 0.6, 1.0)
        
        # Create model
        if task == 'classification':
            model = GradientBoostingClassifier(
                n_estimators=max_iter,
                learning_rate=learning_rate,
                max_depth=max_depth,
                min_samples_split=min_samples_split,
                subsample=subsample,
                validation_fraction=0.2,
                n_iter_no_change=20,  # Early stopping
                random_state=42
            )
        else:
            model = GradientBoostingRegressor(
                n_estimators=max_iter,
                learning_rate=learning_rate,
                max_depth=max_depth,
                min_samples_split=min_samples_split,
                subsample=subsample,
                validation_fraction=0.2,
                n_iter_no_change=20,
                random_state=42
            )
        
        # Cross-validation score
        cv_scores = cross_val_score(
            model, X, y, cv=5, 
            scoring='accuracy' if task == 'classification' else 'neg_mean_squared_error'
        )
        
        return cv_scores.mean()
    
    # Run optimization
    study = optuna.create_study(direction='maximize')
    study.optimize(objective, n_trials=100)
    
    print(f"Best learning rate: {study.best_params['learning_rate']:.4f}")
    print(f"Best CV score: {study.best_value:.4f}")
    print(f"Full best params: {study.best_params}")
    
    return study
 
 
def practical_learning_rate_schedule(n_samples, n_features, task='classification'):
    """
    Heuristic for initial learning rate based on data characteristics.
    
    This provides a starting point; always validate with cross-validation.
    """
    # Base learning rate
    base_lr = 0.1
    
    # Adjust for dataset size
    if n_samples < 1000:
        size_factor = 0.3  # Small data: reduce LR
    elif n_samples < 10000:
        size_factor = 0.5
    elif n_samples < 100000:
        size_factor = 1.0
    else:
        size_factor = 1.5  # Large data: can use larger LR
    
    # Adjust for dimensionality
    if n_features > 100:
        dim_factor = 0.5  # High-dim: reduce LR
    elif n_features > 50:
        dim_factor = 0.7
    else:
        dim_factor = 1.0
    
    # Adjust for task
    task_factor = 1.0 if task == 'classification' else 0.8
    
    suggested_lr = base_lr * size_factor * dim_factor * task_factor
    
    # Clamp to reasonable range
    suggested_lr = max(0.01, min(0.3, suggested_lr))
    
    print(f"Dataset characteristics:")
    print(f"  Samples: {n_samples}")
    print(f"  Features: {n_features}")
    print(f"  Task: {task}")
    print(f"Suggested learning rate: {suggested_lr:.3f}")
    print(f"Suggested iterations: {int(10 / suggested_lr)} - {int(50 / suggested_lr)}")
    
    return suggested_lr

Interaction with Other Hyperparameters

The learning rate does not operate in isolation—it interacts with every other boosting hyperparameter. Understanding these interactions is crucial for effective tuning.

7.1 Learning Rate × Number of Trees

This is the most fundamental interaction. The 'effective model complexity' is approximately proportional to:

$$\text{Effective Complexity} \propto n_{\text{estimators}} \times \text{learning_rate}$$

Practical Implication: When you decrease learning rate by a factor of $k$, increase iterations by approximately $k$ to maintain similar complexity. However, the resulting model typically generalizes better despite similar training performance.

High Learning Rate (0.3)

•100 trees sufficient
•Fast training
•Higher risk of overfitting
•Less stable predictions
•Good for prototyping

Low Learning Rate (0.01)

•3000+ trees needed
•Slow training
•Lower overfitting risk
•More stable predictions
•Good for production

7.2 Learning Rate × Tree Depth

Tree depth and learning rate both control model complexity but in different ways:

Tree depth controls the complexity of each individual learner
Learning rate controls how much each learner contributes

Synergy: Shallow trees (depth 2-4) with low learning rates often perform best. Deep trees already capture complex interactions; adding them with large learning rates leads to rapid overfitting.

Guideline: If using deep trees (depth > 5), reduce learning rate to compensate. For very shallow trees (stumps or depth 2), you can use slightly higher learning rates.

7.3 Learning Rate × Subsampling

Both learning rate and subsampling (row sampling) provide regularization:

Learning rate reduces contribution of each tree
Subsampling adds randomness and reduces overfitting through bagging-like effects

Interaction: These regularization effects compound. With aggressive subsampling (0.5-0.7), you might be able to use a slightly higher learning rate. Conversely, with no subsampling, you should use a lower learning rate.

7.4 Learning Rate × Early Stopping

Early stopping and learning rate work synergistically:

Low learning rate creates a smooth validation curve, making it easier to find the optimal stopping point
High learning rate creates a more erratic curve, increasing the chance of stopping too early or too late

Best Practice: When using early stopping, prefer lower learning rates (0.01-0.05). The additional iterations are not wasted—early stopping will terminate when needed.

The Golden Combination

Many practitioners have converged on a 'sweet spot' configuration: learning_rate=0.01-0.05, max_depth=3-5, subsample=0.7-0.9, n_estimators=1000-5000 with early stopping. This combination consistently performs well across diverse problems.

Summary and Key Takeaways

Shrinkage stands as one of the most elegant and effective regularization techniques in machine learning. Let's consolidate the essential insights:

Key Takeaways

•Shrinkage multiplies each base learner's contribution by a factor ν < 1, forcing the ensemble to take many small steps rather than few large ones.
•Smaller learning rates almost always improve generalization, at the cost of requiring more iterations to achieve similar training performance.
•The trade-off is approximately linear: halving the learning rate roughly doubles the required iterations, but typically improves test performance.
•Shrinkage reduces variance quadratically while increasing bias only linearly, explaining its consistent effectiveness.
•Learning rate interacts with all other hyperparameters, particularly tree depth, subsampling, and the number of iterations.
•Practical range is 0.01-0.3, with 0.05-0.1 being most common for initial experiments and 0.01-0.03 for final production models.
•Always use with early stopping to automatically find the optimal number of iterations for your learning rate.

Page Complete

You now understand shrinkage (learning rate) as the foundational regularization technique in gradient boosting. This sets the stage for exploring complementary regularization methods: subsampling, tree constraints, early stopping, and explicit L1/L2 penalties. Next, we examine stochastic gradient boosting through subsampling.