Hyperparameter Tuning For Boosting - Learning Module

Loading content...

0/245

Learning Rate and Iterations

The Dance of Shrinkage and Ensemble Size

Of all gradient boosting hyperparameters, none are more fundamental than the learning rate (also called shrinkage or step size) and the number of boosting iterations (trees in the ensemble). These two parameters are inextricably linked through a simple but profound relationship: they jointly control how the model learns from residual errors.

The Core Insight: A gradient boosting model with 100 trees at learning rate 0.1 achieves roughly similar training loss as one with 1000 trees at learning rate 0.01. But their generalization behavior differs dramatically. Understanding this relationship—and knowing when to favor many small steps versus fewer large steps—separates practitioners who achieve good results from those who achieve exceptional ones.

What You Will Learn

By the end of this page, you will understand the mathematical role of learning rate in gradient boosting, the empirical relationship between learning rate and optimal tree count, why lower learning rates typically improve generalization, strategies for setting and tuning these parameters efficiently, and the art of using early stopping to automate iteration selection.

The Mathematics of Shrinkage

To understand learning rate deeply, we must examine its role in the gradient boosting update equation. At each iteration $m$, gradient boosting fits a new tree $h_m(x)$ to the pseudo-residuals and updates the ensemble prediction:

$$F_m(x) = F_{m-1}(x) + \nu \cdot h_m(x)$$

Where:

$F_m(x)$ = ensemble prediction after $m$ iterations
$F_{m-1}(x)$ = ensemble prediction before this iteration
$h_m(x)$ = new tree fitted to pseudo-residuals
$\nu$ = learning rate (0 < ν ≤ 1)

The Shrinkage Interpretation:

The learning rate $\nu$ shrinks each tree's contribution before adding it to the ensemble. This seemingly simple modification has profound effects:

Effects of Shrinkage

•Slows optimization trajectory — Smaller steps toward the loss minimum mean more trees are needed to reach the same training loss.
•Reduces variance — Each tree has less influence, so individual tree errors are dampened.
•Improves generalization — The slower learning process tends to find flatter minima that generalize better.
•Provides regularization — Shrinkage is equivalent to L2 regularization on the function space (not weights).
•Enables averaging effects — More trees with smaller contributions approximate averaging, reducing ensemble variance.

The Functional Gradient Descent View:

Gradient boosting performs gradient descent in function space. The learning rate controls the step size of this descent:

$$F_m = F_{m-1} - \nu \cdot \nabla_F \mathcal{L}(y, F_{m-1})$$

Where $h_m$ approximates the negative gradient $-\nabla_F \mathcal{L}$. Just as in numerical optimization, step size fundamentally affects convergence:

Too large (ν → 1): Overshoot the minimum, oscillate, or diverge
Too small (ν → 0): Very slow convergence, requiring many iterations
Optimal: Balances convergence speed against stability and generalization

A Key Difference from Standard Gradient Descent:

Unlike neural network training where learning rate primarily affects convergence speed, in gradient boosting the learning rate also changes the type of solution found. Lower learning rates produce ensembles that are more robust to noise and generalize better, even when trained to the same loss.

The Regularization Equivalence

Friedman (2001) showed that shrinkage in gradient boosting is analogous to L2 regularization on the total prediction function F(x). Lower learning rates implicitly penalize complex functions, favoring simpler solutions that generalize better. This is why shrinkage is often called 'the most important regularization technique in boosting.'

The Learning Rate-Iterations Tradeoff

Learning rate and number of iterations are inversely related: reducing learning rate by factor $k$ requires approximately $k$ times more iterations to achieve similar training loss. But this tradeoff is not symmetric—lower rates with more iterations typically yield better generalization.

The Empirical Relationship:

For most datasets, the following approximate relationship holds:

$$\text{optimal_iterations}(\nu_2) \approx \text{optimal_iterations}(\nu_1) \times \frac{\nu_1}{\nu_2}$$

For example:

Learning rate 0.1 → optimal at 500 trees
Learning rate 0.01 → optimal at ~5000 trees (10× more)

Why Lower Rates Win:

The relationship isn't perfectly linear because:

Early stopping intervenes differently — Lower rates allow finer-grained stopping, often finding slightly better optima.
Regularization accumulates differently — More trees with smaller contributions achieve better variance reduction.
Noise tolerance improves — Small steps are less likely to overfit to noise in pseudo-residuals.

learning_rate_tradeoff.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import xgboost as xgb
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
 
# Create dataset
X, y = make_classification(
    n_samples=10000, n_features=20, n_informative=10,
    n_redundant=5, flip_y=0.05, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=0.2, random_state=42
)
 
# ============================================
# Compare Different Learning Rates
# ============================================
learning_rates = [0.3, 0.1, 0.05, 0.01, 0.005]
results = []
 
print("=== Learning Rate vs. Optimal Iterations ===\n")
for lr in learning_rates:
    model = xgb.XGBClassifier(
        n_estimators=10000,  # High ceiling, early stopping will find optimal
        learning_rate=lr,
        max_depth=6,
        subsample=0.8,
        colsample_bytree=0.8,
        early_stopping_rounds=100,
        random_state=42,
        verbosity=0
    )
    model.fit(
        X_train, y_train,
        eval_set=[(X_val, y_val)],
        verbose=False
    )
    
    # Evaluate on test set
    test_score = model.score(X_test, y_test)
    
    results.append({
        'learning_rate': lr,
        'optimal_trees': model.best_iteration,
        'test_accuracy': test_score
    })
    
    print(f"lr={lr:5.3f}: optimal_trees={model.best_iteration:5d}, "
          f"test_accuracy={test_score:.4f}")
 
# ============================================
# Analyze the Relationship
# ============================================
print("\n=== Relationship Analysis ===")
print("lr × trees product:")
for r in results:
    product = r['learning_rate'] * r['optimal_trees']
    print(f"  lr={r['learning_rate']:.3f}: {r['learning_rate']:.3f} × "
          f"{r['optimal_trees']} = {product:.1f}")
 
# The product should be roughly constant if relationship were perfectly linear
# In practice, lower learning rates often yield slightly better results

Typical Learning Rate Configurations
Learning Rate	Typical Iterations	Use Case	Training Time
0.3 - 0.5	50 - 200	Quick prototyping, baseline models	Very Fast
0.1 - 0.2	200 - 1000	Standard production models	Fast
0.05 - 0.1	500 - 2000	Performance-focused production	Moderate
0.01 - 0.05	1000 - 5000	Competition models, maximum accuracy	Slow
0.001 - 0.01	5000 - 20000	Final submission tuning	Very Slow

The Two-Phase Strategy

Use learning_rate=0.1 for hyperparameter exploration (fast iterations). Once other parameters are tuned, reduce to 0.01-0.03 and increase iterations proportionally for final model. This strategy maximizes both exploration efficiency and final performance.

Why Lower Learning Rates Generalize Better

The empirical finding that lower learning rates improve generalization has multiple theoretical explanations. Understanding these mechanisms helps you make informed decisions about learning rate selection.

Mechanism 1: Ensemble Averaging Effect

With many small-contribution trees, the ensemble approximates an average of tree predictions rather than a sum dominated by a few trees. This averaging reduces variance:

$$\text{Var}\left(\sum_{m=1}^{M} \nu \cdot h_m\right) \approx \nu^2 \cdot M \cdot \text{Var}(h) + \text{covariance terms}$$

For fixed final prediction magnitude $\nu \cdot M \approx c$:

Higher ν, lower M: Fewer trees, each with high variance contribution
Lower ν, higher M: Many trees, variance contributions cancel out

Mechanism 2: Implicit Regularization Path

Lower learning rates trace a different optimization path through function space. This path:

Visits simpler functions earlier (regularization path)
Allows early stopping to select appropriate complexity
Avoids abrupt jumps that might overshoot good solutions

Mechanism 3: Noise Robustness

Each boosting iteration fits pseudo-residuals that contain:

Signal: True patterns the previous ensemble missed
Noise: Random fluctuations in training data

With high learning rate, the model might "commit hard" to noisy patterns. With low learning rate, each tree captures less noise, and subsequent trees have opportunities to average out noisy contributions.

Mechanism 4: Flat Minima Preference

Machine learning theory suggests that flatter minima (regions where loss changes slowly with parameter changes) generalize better than sharp minima. Slow optimization with many small steps tends to converge to flatter regions because:

Sharp minima are harder to "hit" with small steps
The optimization trajectory spends more time exploring flatter regions
Early stopping is more likely to halt in stable regions

Low Learning Rate Benefits

•Better generalization performance
•More robust to hyperparameter choices
•Smoother loss curves, easier early stopping
•Smaller variance in predictions
•More stable against noise

Low Learning Rate Costs

•Longer training times
•More memory for larger ensembles
•Larger model files for deployment
•Slower inference (more trees to evaluate)
•Diminishing returns below certain threshold

The Point of Diminishing Returns

Extremely low learning rates (< 0.001) rarely provide meaningful improvements and may even hurt performance if early stopping triggers too early due to flat loss curves. The sweet spot for most problems is between 0.01 and 0.1, with 0.05 being a robust default for performance-focused models.

Early Stopping Strategies

Early stopping is the preferred method for determining optimal iteration count. Rather than fixing n_estimators, you set a high ceiling and let the algorithm stop when validation performance plateaus.

The Early Stopping Protocol:

Set n_estimators to a large value (e.g., 10000)
Provide a held-out validation set
Monitor a validation metric each iteration
Stop when the metric doesn't improve for N consecutive rounds
Optionally, restore the model to the best iteration

Configuring Early Stopping:

The key parameter is early_stopping_rounds (the patience or tolerance):

Too small (5-10 rounds): May stop prematurely due to random fluctuations
Too large (500+ rounds): May train unnecessarily long, wasting compute
Optimal (50-200 rounds): Balances patience against efficiency

The right tolerance depends on learning rate:

High learning rate (0.1): Lower patience (30-50) because few iterations needed
Low learning rate (0.01): Higher patience (100-200) to distinguish true plateau from slow progress

early_stopping_strategies.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
import xgboost as xgb
import lightgbm as lgb
from catboost import CatBoostClassifier
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
 
# Create dataset
X, y = make_classification(n_samples=10000, n_features=20, random_state=42)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
 
# ============================================
# XGBoost Early Stopping
# ============================================
print("=== XGBoost Early Stopping ===")
xgb_model = xgb.XGBClassifier(
    n_estimators=5000,
    learning_rate=0.05,
    max_depth=6,
    early_stopping_rounds=100,  # Stop after 100 rounds of no improvement
    eval_metric='logloss',
    random_state=42,
    verbosity=0
)
xgb_model.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    verbose=False
)
print(f"Best iteration: {xgb_model.best_iteration}")
print(f"Best score: {xgb_model.best_score:.4f}")
print(f"Test accuracy: {xgb_model.score(X_test, y_test):.4f}")
 
# ============================================
# LightGBM Early Stopping
# ============================================
print("\n=== LightGBM Early Stopping ===")
lgb_model = lgb.LGBMClassifier(
    n_estimators=5000,
    learning_rate=0.05,
    max_depth=-1,
    num_leaves=63,
    random_state=42,
    verbosity=-1
)
lgb_model.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    callbacks=[
        lgb.early_stopping(stopping_rounds=100),
        lgb.log_evaluation(period=0)  # Suppress output
    ]
)
print(f"Best iteration: {lgb_model.best_iteration_}")
print(f"Best score: {lgb_model.best_score_['valid_0']['binary_logloss']:.4f}")
print(f"Test accuracy: {lgb_model.score(X_test, y_test):.4f}")
 
# ============================================
# CatBoost Early Stopping
# ============================================
print("\n=== CatBoost Early Stopping ===")
cb_model = CatBoostClassifier(
    iterations=5000,
    learning_rate=0.05,
    depth=6,
    early_stopping_rounds=100,
    use_best_model=True,  # Important: restore to best iteration
    random_seed=42,
    verbose=0
)
cb_model.fit(
    X_train, y_train,
    eval_set=(X_val, y_val)
)
print(f"Best iteration: {cb_model.best_iteration_}")
print(f"Test accuracy: {cb_model.score(X_test, y_test):.4f}")
 
# ============================================
# Patience Analysis: Finding Optimal Tolerance
# ============================================
print("\n=== Patience Analysis ===")
for patience in [20, 50, 100, 200, 500]:
    model = xgb.XGBClassifier(
        n_estimators=10000,
        learning_rate=0.05,
        max_depth=6,
        early_stopping_rounds=patience,
        random_state=42,
        verbosity=0
    )
    model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=False)
    test_acc = model.score(X_test, y_test)
    print(f"patience={patience:3d}: stopped at {model.best_iteration:4d}, "
          f"test_acc={test_acc:.4f}")

Validation Set Size Matters

Early stopping reliability depends on validation set size. Too small, and the validation metric is noisy, causing premature or delayed stopping. Use at least 10-20% of your data for validation, or consider repeated random splits to reduce variance in the stopping point.

Learning Rate Schedules

While constant learning rate is standard in gradient boosting, advanced practitioners sometimes use learning rate schedules or decay. These techniques can provide marginal improvements in specific scenarios.

Learning Rate Decay:

Reduce learning rate as training progresses, allowing coarser adjustments early and finer adjustments later:

$$\nu_m = \nu_0 \cdot \text{decay}^m$$

$$\nu_m = \frac{\nu_0}{1 + \text{decay} \cdot m}$$

DART: Dropout Additive Regression Trees

DART uses a different approach: randomly drop previous trees when fitting new ones. This prevents later trees from over-specializing to correct specific earlier trees and provides ensemble diversity.

Key DART parameters:

rate_drop: Fraction of previous trees to drop (0.05-0.2)
skip_drop: Probability of skipping dropout (0.5)
sample_type: How to sample dropped trees ('uniform' or 'weighted')

learning_rate_schedules.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
import xgboost as xgb
import lightgbm as lgb
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score
 
# Create dataset
X, y = make_classification(n_samples=10000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
 
# ============================================
# XGBoost with DART Booster
# ============================================
print("=== DART Booster Comparison ===\n")
 
# Standard GBTREE
gbtree_model = xgb.XGBClassifier(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=6,
    booster='gbtree',  # Standard gradient boosting
    random_state=42,
    verbosity=0
)
gbtree_scores = cross_val_score(gbtree_model, X_train, y_train, cv=5, scoring='roc_auc')
print(f"GBTREE: AUC = {np.mean(gbtree_scores):.4f} (+/- {np.std(gbtree_scores):.4f})")
 
# DART (Dropout)
dart_model = xgb.XGBClassifier(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=6,
    booster='dart',
    rate_drop=0.1,      # Drop 10% of trees each iteration
    skip_drop=0.5,      # 50% chance to skip dropout entirely
    sample_type='uniform',
    normalize_type='tree',
    random_state=42,
    verbosity=0
)
dart_scores = cross_val_score(dart_model, X_train, y_train, cv=5, scoring='roc_auc')
print(f"DART:   AUC = {np.mean(dart_scores):.4f} (+/- {np.std(dart_scores):.4f})")
 
# ============================================
# LightGBM with GOSS (Gradient-based Sampling)
# ============================================
print("\n=== LightGBM Boosting Types ===\n")
 
for boosting_type in ['gbdt', 'dart', 'goss']:
    params = {
        'n_estimators': 500,
        'learning_rate': 0.05,
        'num_leaves': 63,
        'boosting_type': boosting_type,
        'random_state': 42,
        'verbosity': -1
    }
    
    # GOSS-specific parameters
    if boosting_type == 'goss':
        params['top_rate'] = 0.2      # Keep top 20% large gradients
        params['other_rate'] = 0.1    # Sample 10% of small gradients
    
    # DART-specific parameters
    if boosting_type == 'dart':
        params['drop_rate'] = 0.1
        params['skip_drop'] = 0.5
    
    model = lgb.LGBMClassifier(**params)
    scores = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc')
    print(f"{boosting_type.upper():5s}: AUC = {np.mean(scores):.4f} (+/- {np.std(scores):.4f})")
 
# ============================================
# Custom Learning Rate Callback
# ============================================
print("\n=== Custom Learning Rate Decay ===\n")
 
def make_lr_decay_callback(initial_lr=0.1, decay_rate=0.99):
    """Create a callback that decays learning rate each iteration."""
    def callback(env):
        new_lr = initial_lr * (decay_rate ** env.iteration)
        # Note: XGBoost doesn't support runtime LR change easily
        # This is for illustration - use fixed LR in practice
        pass
    return callback
 
# For XGBoost, you typically set learning_rate once
# True decay requires custom training loops or using DART
 
# Practical approximation: train in phases with decreasing LR
print("Phase Training with Decreasing LR:")
phases = [
    {'learning_rate': 0.1, 'n_estimators': 200},
    {'learning_rate': 0.05, 'n_estimators': 300},
    {'learning_rate': 0.01, 'n_estimators': 500},
]
 
# This would require incremental training, which XGBoost supports
# but is rarely worth the complexity vs. single low LR

DART Trade-offs

DART can improve generalization for some datasets but has significant downsides: (1) No early stopping during training, (2) slower training due to dropout overhead, (3) inconsistent predictions if using fewer trees than trained. Use GBTREE with low learning rate as the default; try DART only if overfitting persists.

Practical Tuning Guidelines

Based on extensive empirical experience and theoretical understanding, here are actionable guidelines for setting learning rate and iterations.

The Systematic Approach:

Step-by-Step Tuning Protocol

•Start with baseline: learning_rate=0.1, n_estimators=1000, with early stopping
•Tune other hyperparameters (depth, sampling, regularization) at this baseline rate
•Reduce learning rate to 0.05 and observe if optimal iterations roughly double
•For final model: reduce to 0.01-0.03, increase patience, allow more iterations
•Validate: ensure test performance improves; if not, revert to higher rate

Learning Rate Selection Guide by Use Case
Scenario	Recommended LR	Iterations	Rationale
Quick experiment	0.1 - 0.3	100 - 500	Fast feedback, rough accuracy
Development/debugging	0.1	< 1000	Reasonable accuracy, manageable time
Production model	0.05	1000 - 2000	Good accuracy, acceptable training time
Best accuracy (time ok)	0.01 - 0.03	2000 - 10000	Near-optimal performance
Competition final	0.005 - 0.01	10000+	Extract maximum performance

Red Flags During Training:

Training loss much lower than validation loss: Overfitting; increase learning rate, reduce trees, or add regularization
Validation metric fluctuates wildly: Learning rate may be too high; reduce it
Early stopping triggers very quickly: Learning rate may be too low (flat loss curve) or model too constrained
Early stopping never triggers: Either learning rate is too high (oscillating), or model is underfitting

Framework-Specific Defaults:

Default Learning Rates by Framework
Framework	Default LR	Default Iterations	Notes
XGBoost	0.3	100	Aggressive; almost always lower this
LightGBM	0.1	100	Reasonable; adjust based on data size
CatBoost	Auto (~0.03)	1000	Adaptive; usually good out-of-box

XGBoost Default Warning

XGBoost's default learning_rate=0.3 is almost always too high for production models. Always explicitly set learning_rate when using XGBoost. A safe starting point is 0.1, reduced to 0.05 or lower for final models.

Common Mistakes and Pitfalls

Even experienced practitioners make errors when setting learning rate and iterations. Here are the most common pitfalls and how to avoid them.

Common Errors

•Fixing n_estimators without early stopping — You're either training too few trees (undertrained) or too many (overfitting). Always use early stopping with a high ceiling.
•Reducing learning rate without increasing iterations — Halving LR without doubling iterations means your model is undertrained. The lr×iterations product should stay roughly constant.
•Using validation set for both early stopping AND final evaluation — This causes information leakage. Use separate validation (for stopping) and test (for final evaluation) sets.
•Very low patience with low learning rate — Low learning rates produce flat loss curves. Patience=20 with lr=0.01 will stop prematurely. Scale patience inversely with learning rate.
•Ignoring training time at low learning rates — lr=0.001 with 50000 trees isn't necessarily better than lr=0.01 with 5000 trees, but takes 10× longer. Measure the benefit.
•Tuning other params at one LR, deploying at another — Optimal hyperparameters can change with learning rate. Always retune after final LR selection, or tune at target LR.

common_mistakes.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import xgboost as xgb
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
 
X, y = make_classification(n_samples=10000, n_features=20, random_state=42)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
 
# ============================================
# MISTAKE 1: Fixed n_estimators without early stopping
# ============================================
print("=== MISTAKE: Fixed n_estimators ===")
for n_est in [50, 200, 1000, 3000]:
    model = xgb.XGBClassifier(
        n_estimators=n_est,  # Fixed, no early stopping
        learning_rate=0.1,
        max_depth=6,
        random_state=42,
        verbosity=0
    )
    model.fit(X_train, y_train)
    train_acc = model.score(X_train, y_train)
    test_acc = model.score(X_test, y_test)
    print(f"n_estimators={n_est:4d}: train={train_acc:.4f}, test={test_acc:.4f}")
    
print("\n(Note: 1000 trees shows overfit - train >> test)")
 
# ============================================
# CORRECTION: Use early stopping
# ============================================
print("\n=== CORRECT: Early stopping ===")
model = xgb.XGBClassifier(
    n_estimators=10000,
    learning_rate=0.1,
    max_depth=6,
    early_stopping_rounds=50,
    random_state=42,
    verbosity=0
)
model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=False)
print(f"Stopped at {model.best_iteration} trees, test acc={model.score(X_test, y_test):.4f}")
 
# ============================================
# MISTAKE 2: Reducing LR without adjusting iterations
# ============================================
print("\n=== MISTAKE: Lower LR, same iterations ===")
for lr in [0.1, 0.05, 0.01]:
    model = xgb.XGBClassifier(
        n_estimators=200,  # Fixed at 200
        learning_rate=lr,
        max_depth=6,
        random_state=42,
        verbosity=0
    )
    model.fit(X_train, y_train)
    test_acc = model.score(X_test, y_test)
    print(f"lr={lr}, n_estimators=200: test_acc={test_acc:.4f}")
 
print("\n(Note: Lower LR with same iterations = undertrained)")
 
# ============================================
# CORRECTION: Scale iterations with LR
# ============================================
print("\n=== CORRECT: Scale iterations with LR ===")
configs = [
    (0.1, 200),
    (0.05, 400),
    (0.01, 2000),
]
for lr, n_est in configs:
    model = xgb.XGBClassifier(
        n_estimators=n_est,
        learning_rate=lr,
        max_depth=6,
        random_state=42,
        verbosity=0
    )
    model.fit(X_train, y_train)
    test_acc = model.score(X_test, y_test)
    print(f"lr={lr}, n_estimators={n_est}: test_acc={test_acc:.4f}")

The Golden Rule

When in doubt: set learning_rate=0.05, n_estimators=10000, early_stopping_rounds=100. This configuration works well for most datasets and automatically finds the right iteration count. Only deviate when you have specific reasons.

Summary: Mastering Learning Rate and Iterations

Learning rate and iterations form the foundation of gradient boosting configuration. Mastering their interplay is essential for achieving optimal performance.

Key Takeaways

•Learning rate shrinks tree contributions — It controls step size in functional gradient descent, affecting both convergence speed and solution quality.
•Learning rate and iterations are inversely related — Halving learning rate approximately doubles optimal iteration count.
•Lower learning rates improve generalization — Through ensemble averaging, implicit regularization, and noise robustness mechanisms.
•Always use early stopping — Set n_estimators high and let validation performance determine when to stop.
•Scale early stopping patience with learning rate — Low learning rates need higher patience to distinguish plateau from slow progress.
•Two-phase tuning is efficient — Tune at moderate LR (0.1) for speed, then lower LR (0.01-0.03) for final performance.

What's Next:

With learning rate and iterations mastered, we'll explore tree-specific parameters in the next page: depth, leaves, and split constraints that control the complexity of individual weak learners.

Page Complete

You now understand the fundamental relationship between learning rate and iterations, why lower rates improve generalization, and how to configure early stopping effectively. These insights apply across XGBoost, LightGBM, and CatBoost.