Machine LearningHyperparameter Tuning for Boosting

Hyperparameter Tuning for Boosting

LevelAdvanced

Duration90 mins

TopicHyperparameter Tuning for Boosting

5 / 5

Tuning Strategies

The Art of Efficient Hyperparameter Search

Knowing which hyperparameters to tune is only half the challenge. The other half is how to search the parameter space efficiently. With a dozen or more hyperparameters, exhaustive search is computationally infeasible—a grid with 5 values per parameter and 10 parameters requires 5¹⁰ ≈ 10 million evaluations. Practical tuning requires smart strategies that find excellent configurations with far fewer evaluations.

The Tuning Meta-Problem: Hyperparameter tuning is itself an optimization problem. The objective is clear (maximize validation performance), but the landscape is expensive to evaluate (each point requires training a model) and often non-convex with complex interactions. The strategies we'll cover address this meta-problem with increasing sophistication.

What You Will Learn

By the end of this page, you will understand when to use grid search and its limitations, why random search often outperforms grid search, how Bayesian optimization models the hyperparameter space, practical multi-stage tuning workflows for production, and framework-specific tools like Optuna, Hyperopt, and built-in tuning APIs.

Grid Search: The Exhaustive Baseline

Grid search systematically evaluates all combinations of discretized parameter values. It's the simplest approach and serves as a useful baseline, but its limitations become severe as dimensionality increases.

The Grid Search Algorithm:

1. Define a grid: each parameter has a list of values
2. For each combination in the Cartesian product:
   a. Train model with this configuration
   b. Evaluate on validation set
   c. Record performance
3. Return best configuration

Complexity Analysis:

With $k$ parameters, each with $n$ values: Total evaluations = $n^k$

3 parameters × 5 values = 125 evaluations ✓
5 parameters × 5 values = 3,125 evaluations ✓
10 parameters × 5 values = 9,765,625 evaluations ✗

This exponential growth is the curse of dimensionality applied to hyperparameter search.

Grid Search Advantages

•Simple and deterministic
•Parallelizes trivially
•Guaranteed to find global optimum within grid
•Results are reproducible
•Easy to understand and explain

Grid Search Disadvantages

•Exponential in parameter count
•Wastes budget on unimportant parameters
•Misses optima between grid points
•Requires domain knowledge for grid design
•No learning from previous evaluations

grid_search.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import xgboost as xgb
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import make_classification
 
# Create dataset
X, y = make_classification(n_samples=5000, n_features=20, random_state=42)
 
# ============================================
# Small Grid Search (Feasible)
# ============================================
print("=== Grid Search Example ===\n")
 
param_grid = {
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1],
    'n_estimators': [100, 200],
    'reg_lambda': [0.1, 1, 10]
}
 
n_combinations = np.prod([len(v) for v in param_grid.values()])
print(f"Grid size: {n_combinations} combinations\n")
 
model = xgb.XGBClassifier(random_state=42, verbosity=0)
 
grid_search = GridSearchCV(
    model, param_grid, 
    cv=3, scoring='roc_auc',
    n_jobs=-1, verbose=1
)
grid_search.fit(X, y)
 
print(f"\nBest score: {grid_search.best_score_:.4f}")
print(f"Best params: {grid_search.best_params_}")
 
# ============================================
# Grid Design Tips
# ============================================
print("\n=== Practical Grid Design ===\n")
 
# Tip 1: Use logarithmic spacing for learning rate
lr_grid = [10**x for x in np.arange(-3, 0, 0.5)]
print(f"Learning rate (log scale): {[f'{x:.3f}' for x in lr_grid]}")
 
# Tip 2: Focus on high-impact parameters first
focused_grid = {
    'learning_rate': [0.01, 0.05, 0.1],  # Most important
    'max_depth': [4, 6, 8],              # Second most important
    # Fix less important parameters at defaults
}
 
# Tip 3: Two-stage grid search
print("\nTwo-stage approach:")
print("Stage 1: Coarse grid on learning_rate, max_depth")
print("Stage 2: Fine grid around best values from Stage 1")
 
# Stage 1: Coarse
coarse_grid = {
    'learning_rate': [0.01, 0.1, 0.3],
    'max_depth': [3, 6, 9],
    'n_estimators': [100]
}
coarse_search = GridSearchCV(
    xgb.XGBClassifier(random_state=42, verbosity=0),
    coarse_grid, cv=3, scoring='roc_auc'
)
coarse_search.fit(X, y)
print(f"Coarse best: lr={coarse_search.best_params_['learning_rate']}, "
      f"depth={coarse_search.best_params_['max_depth']}")
 
# Stage 2: Fine around coarse best
best_lr = coarse_search.best_params_['learning_rate']
best_depth = coarse_search.best_params_['max_depth']
fine_grid = {
    'learning_rate': [best_lr * 0.5, best_lr, best_lr * 1.5],
    'max_depth': [max(1, best_depth - 1), best_depth, best_depth + 1],
    'n_estimators': [100, 200, 300]
}
fine_search = GridSearchCV(
    xgb.XGBClassifier(random_state=42, verbosity=0),
    fine_grid, cv=3, scoring='roc_auc'
)
fine_search.fit(X, y)
print(f"Fine best score: {fine_search.best_score_:.4f}")

When Grid Search Works Well

Grid search is appropriate when: (1) you have ≤3 parameters to tune, (2) you have strong priors about good regions, (3) you want deterministic, reproducible results, or (4) you're doing final fine-tuning around a known good configuration. For initial exploration with many parameters, prefer random or Bayesian search.

Random Search: Surprisingly Effective

Random search samples hyperparameter configurations randomly from specified distributions. Despite its simplicity, random search often outperforms grid search, especially in high-dimensional spaces.

The Random Search Algorithm:

1. Define distributions for each parameter
2. For each iteration (up to budget):
   a. Sample configuration from distributions
   b. Train model with this configuration
   c. Evaluate on validation set
   d. Record performance
3. Return best configuration seen

Why Random Beats Grid:

The key insight comes from Bergstra & Bengio (2012): In most problems, some parameters matter much more than others. Grid search wastes evaluations on unimportant parameter combinations, while random search samples more values of important parameters.

The Geometric Argument:

Consider 2 parameters: one important (x), one unimportant (y). With 9 evaluations:

Grid (3×3): 3 unique x values, 3 unique y values Random: 9 unique x values, 9 unique y values

Random search explores 3× more values of the important parameter!

This advantage grows with dimensionality. With 10 parameters where only 2-3 matter, random search is dramatically more efficient.

Distribution Choices:

Random search requires specifying distributions:

Uniform: When you have no prior about good regions
Log-uniform: For parameters spanning orders of magnitude (learning rate)
Discrete uniform: For integer parameters (max_depth)
Categorical: For discrete choices (booster type)

random_search.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
import xgboost as xgb
import numpy as np
from scipy.stats import uniform, randint, loguniform
from sklearn.model_selection import RandomizedSearchCV
from sklearn.datasets import make_classification
 
# Create dataset
X, y = make_classification(n_samples=5000, n_features=20, random_state=42)
 
# ============================================
# Random Search with Distributions
# ============================================
print("=== Random Search Example ===\n")
 
param_distributions = {
    'learning_rate': loguniform(0.01, 0.3),      # Log-uniform: 0.01 to 0.3
    'max_depth': randint(3, 12),                  # Uniform integer: 3-11
    'n_estimators': randint(50, 500),             # Uniform integer: 50-499
    'subsample': uniform(0.5, 0.5),               # Uniform: 0.5 to 1.0
    'colsample_bytree': uniform(0.5, 0.5),        # Uniform: 0.5 to 1.0
    'reg_lambda': loguniform(0.1, 10),            # Log-uniform: 0.1 to 10
    'reg_alpha': loguniform(0.01, 5),             # Log-uniform: 0.01 to 5
    'gamma': uniform(0, 0.5),                     # Uniform: 0 to 0.5
    'min_child_weight': randint(1, 20),           # Uniform integer: 1-19
}
 
model = xgb.XGBClassifier(random_state=42, verbosity=0)
 
# Same budget as grid search: ~36 evaluations
random_search = RandomizedSearchCV(
    model, param_distributions,
    n_iter=50,  # Number of random samples
    cv=3, scoring='roc_auc',
    random_state=42, n_jobs=-1, verbose=1
)
random_search.fit(X, y)
 
print(f"\nBest score: {random_search.best_score_:.4f}")
print("Best parameters:")
for param, value in random_search.best_params_.items():
    print(f"  {param}: {value}")
 
# ============================================
# Compare Grid vs Random (Same Budget)
# ============================================
print("\n=== Grid vs Random Comparison ===\n")
 
# Grid: 3 values × 3 params = 27 evaluations
grid_params = {
    'learning_rate': [0.01, 0.1, 0.3],
    'max_depth': [4, 6, 8],
    'reg_lambda': [0.1, 1, 10]
}
 
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(
    xgb.XGBClassifier(random_state=42, verbosity=0),
    grid_params, cv=3, scoring='roc_auc'
)
grid_search.fit(X, y)
print(f"Grid Search (27 evals): {grid_search.best_score_:.4f}")
 
# Random: 27 evaluations, same 3 params
random_params = {
    'learning_rate': loguniform(0.01, 0.3),
    'max_depth': randint(3, 10),
    'reg_lambda': loguniform(0.1, 10)
}
random_search_3p = RandomizedSearchCV(
    xgb.XGBClassifier(random_state=42, verbosity=0),
    random_params, n_iter=27, cv=3, scoring='roc_auc', random_state=42
)
random_search_3p.fit(X, y)
print(f"Random Search (27 evals): {random_search_3p.best_score_:.4f}")
 
# ============================================
# Log-Uniform Distribution Importance
# ============================================
print("\n=== Distribution Choice Matters ===\n")
 
# Bad: Uniform for learning rate
samples_uniform = uniform(0.01, 0.29).rvs(1000, random_state=42)
print(f"Uniform(0.01, 0.3): median={np.median(samples_uniform):.3f}, "
      f"<0.05: {np.mean(samples_uniform < 0.05):.1%}")
 
# Good: Log-uniform for learning rate  
samples_loguniform = loguniform(0.01, 0.3).rvs(1000, random_state=42)
print(f"LogUniform(0.01, 0.3): median={np.median(samples_loguniform):.3f}, "
      f"<0.05: {np.mean(samples_loguniform < 0.05):.1%}")
 
print("\nLog-uniform samples 4× more values in the low range!")

The 60-Trial Rule

With random search, 60 trials gives you 95% probability of sampling within the top 5% of the parameter space. This is independent of the number of parameters! For quick exploration, even 20-30 trials often find good configurations.

Bayesian Optimization: Intelligent Search

Bayesian optimization treats hyperparameter search as a Bayesian inference problem. It builds a probabilistic model of the objective function and uses it to decide where to evaluate next, balancing exploration (uncertain regions) and exploitation (promising regions).

The Bayesian Optimization Loop:

1. Initialize with a few random evaluations
2. Fit a surrogate model to observed (config, score) pairs
3. Use an acquisition function to select the next config to evaluate
4. Evaluate the selected config
5. Update the surrogate model
6. Repeat 3-5 until budget exhausted

Key Components:

Surrogate Model: Approximates the objective function. Common choices:

Gaussian Process (GP): Provides uncertainty estimates, works well for continuous parameters
Tree Parzen Estimator (TPE): Handles categorical/conditional parameters well, used by Optuna/Hyperopt
Random Forest: Robust, handles mixed parameter types

Acquisition Function: Decides where to sample next. Common choices:

Expected Improvement (EI): Balance between mean and uncertainty
Upper Confidence Bound (UCB): mean + κ × std
Probability of Improvement (PI): Pure exploitation

Why Bayesian Optimization Excels:

Sample efficiency: Uses prior evaluations to guide search, finding good configs with fewer evaluations
Handles interactions: The surrogate model can capture parameter interactions
Quantifies uncertainty: Knows where it needs more information
Adaptive: Focus shifts from exploration to exploitation as information accumulates

Optuna: Modern Bayesian Optimization

Optuna is the recommended tool for Bayesian hyperparameter optimization. Key features:

TPE sampler by default (good for varied parameter types)
Pruning support (early termination of bad configs)
Parallel execution
Study persistence and visualization
Framework-agnostic API

bayesian_optimization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
import xgboost as xgb
import lightgbm as lgb
import numpy as np
import optuna
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score, train_test_split
 
# Create dataset
X, y = make_classification(n_samples=5000, n_features=20, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
 
# ============================================
# Optuna Bayesian Optimization
# ============================================
print("=== Optuna Bayesian Optimization ===\n")
 
def objective(trial):
    """Objective function for Optuna."""
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 50, 500),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'reg_lambda': trial.suggest_float('reg_lambda', 0.1, 10.0, log=True),
        'reg_alpha': trial.suggest_float('reg_alpha', 0.01, 5.0, log=True),
        'gamma': trial.suggest_float('gamma', 0, 0.5),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 20),
        'random_state': 42,
        'verbosity': 0
    }
    
    model = xgb.XGBClassifier(**params)
    score = np.mean(cross_val_score(model, X_train, y_train, cv=3, scoring='roc_auc'))
    return score
 
# Create and run study
study = optuna.create_study(direction='maximize', sampler=optuna.samplers.TPESampler(seed=42))
study.optimize(objective, n_trials=50, show_progress_bar=True)
 
print(f"\nBest score: {study.best_value:.4f}")
print("Best parameters:")
for param, value in study.best_params.items():
    if isinstance(value, float):
        print(f"  {param}: {value:.4f}")
    else:
        print(f"  {param}: {value}")
 
# ============================================
# Optuna with Pruning (Early Termination)
# ============================================
print("\n=== Optuna with Pruning ===\n")
 
def objective_with_pruning(trial):
    """Objective with early pruning of bad trials."""
    params = {
        'n_estimators': 500,  # Fixed high, we'll prune
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'reg_lambda': trial.suggest_float('reg_lambda', 0.1, 10.0, log=True),
        'random_state': 42,
    }
    
    # Use LightGBM with callback for pruning
    dtrain = lgb.Dataset(X_train, label=y_train)
    
    pruning_callback = optuna.integration.LightGBMPruningCallback(trial, 'auc')
    
    cv_results = lgb.cv(
        params,
        dtrain,
        num_boost_round=500,
        callbacks=[pruning_callback],
        nfold=3,
        metrics='auc',
        seed=42
    )
    
    return cv_results['valid auc-mean'][-1]
 
study_pruned = optuna.create_study(direction='maximize')
study_pruned.optimize(objective_with_pruning, n_trials=30, show_progress_bar=True)
 
print(f"\nBest score (with pruning): {study_pruned.best_value:.4f}")
 
# ============================================
# Analyze Optuna Results
# ============================================
print("\n=== Optimization Analysis ===\n")
 
# Parameter importance
importance = optuna.importance.get_param_importances(study)
print("Parameter importance:")
for param, imp in sorted(importance.items(), key=lambda x: x[1], reverse=True):
    print(f"  {param}: {imp:.3f}")

Pruning Saves Compute

Optuna's pruning feature terminates unpromising trials early based on intermediate results. For gradient boosting, use integration with LightGBM or XGBoost callbacks to prune trials that are clearly underperforming. This can reduce total compute by 50% or more.

Multi-Stage Tuning Workflow

Real-world hyperparameter tuning is most efficient when done in stages, progressively narrowing the search space as you gain information about what works.

The Three-Stage Framework:

Production Tuning Stages

•Stage 1 - Exploration (20% of budget): Wide random/Bayesian search to identify promising regions. High learning rate (0.1) for fast iteration. Focus on structural params (depth, leaves).
•Stage 2 - Refinement (50% of budget): Bayesian optimization in promising regions. Add regularization parameters. Moderate learning rate (0.05). Fine-tune tree architecture.
•Stage 3 - Polish (30% of budget): Final tuning with low learning rate (0.01-0.03). More trees, early stopping. Tune sampling and regularization together.

multi_stage_tuning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
import xgboost as xgb
import numpy as np
import optuna
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score, train_test_split
 
# Create dataset
X, y = make_classification(n_samples=10000, n_features=30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
 
# ============================================
# STAGE 1: Exploration (Fast, Wide Search)
# ============================================
print("=== STAGE 1: Exploration ===\n")
 
def stage1_objective(trial):
    """Fast exploration with high LR, focus on structure."""
    params = {
        'n_estimators': 100,  # Fixed, fast
        'learning_rate': 0.1,  # High for fast iteration
        'max_depth': trial.suggest_int('max_depth', 3, 12),
        'num_leaves': trial.suggest_int('num_leaves', 7, 255),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 50),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'random_state': 42,
        'verbosity': 0
    }
    model = xgb.XGBClassifier(**params)
    return np.mean(cross_val_score(model, X_train, y_train, cv=3, scoring='roc_auc'))
 
study1 = optuna.create_study(direction='maximize')
study1.optimize(stage1_objective, n_trials=20)  # ~20 trials
print(f"Stage 1 best: {study1.best_value:.4f}")
print(f"Best structure: depth={study1.best_params['max_depth']}, "
      f"leaves={study1.best_params['num_leaves']}")
 
# ============================================
# STAGE 2: Refinement (Add Regularization)
# ============================================
print("\n=== STAGE 2: Refinement ===\n")
 
# Use Stage 1 findings
best_depth = study1.best_params['max_depth']
best_leaves = study1.best_params['num_leaves']
best_subsample = study1.best_params['subsample']
best_colsample = study1.best_params['colsample_bytree']
 
def stage2_objective(trial):
    """Refine with regularization, moderate LR."""
    params = {
        'n_estimators': 200,
        'learning_rate': 0.05,  # Moderate
        # Structure from Stage 1 (narrow range)
        'max_depth': trial.suggest_int('max_depth', max(2, best_depth-2), best_depth+2),
        'num_leaves': trial.suggest_int('num_leaves', max(7, best_leaves-50), min(255, best_leaves+50)),
        # Fixed from Stage 1
        'subsample': best_subsample,
        'colsample_bytree': best_colsample,
        # New: Regularization
        'reg_lambda': trial.suggest_float('reg_lambda', 0.1, 10, log=True),
        'reg_alpha': trial.suggest_float('reg_alpha', 0.01, 5, log=True),
        'gamma': trial.suggest_float('gamma', 0, 0.5),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 30),
        'random_state': 42,
        'verbosity': 0
    }
    model = xgb.XGBClassifier(**params)
    return np.mean(cross_val_score(model, X_train, y_train, cv=3, scoring='roc_auc'))
 
study2 = optuna.create_study(direction='maximize')
study2.optimize(stage2_objective, n_trials=30)  # ~30 trials
print(f"Stage 2 best: {study2.best_value:.4f}")
 
# ============================================
# STAGE 3: Polish (Low LR, Many Trees)
# ============================================
print("\n=== STAGE 3: Polish ===\n")
 
# Use Stage 2 findings
stage2_best = study2.best_params
 
def stage3_objective(trial):
    """Final polish with low LR, early stopping."""
    params = {
        'n_estimators': 2000,  # High, with early stopping
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.05),
        'max_depth': stage2_best['max_depth'],
        'num_leaves': stage2_best['num_leaves'],
        'min_child_weight': stage2_best['min_child_weight'],
        'subsample': best_subsample,
        'colsample_bytree': best_colsample,
        'reg_lambda': stage2_best['reg_lambda'],
        'reg_alpha': stage2_best['reg_alpha'],
        'gamma': stage2_best['gamma'],
        'early_stopping_rounds': 50,
        'random_state': 42,
        'verbosity': 0
    }
    
    # Use train-val split for early stopping
    X_tr, X_v, y_tr, y_v = train_test_split(X_train, y_train, test_size=0.2, random_state=42)
    model = xgb.XGBClassifier(**params)
    model.fit(X_tr, y_tr, eval_set=[(X_v, y_v)], verbose=False)
    
    # CV score
    params['n_estimators'] = model.best_iteration
    del params['early_stopping_rounds']
    model_cv = xgb.XGBClassifier(**params)
    return np.mean(cross_val_score(model_cv, X_train, y_train, cv=3, scoring='roc_auc'))
 
study3 = optuna.create_study(direction='maximize')
study3.optimize(stage3_objective, n_trials=10)  # Fewer trials, more expensive
print(f"Stage 3 best: {study3.best_value:.4f}")
 
# ============================================
# Final Model
# ============================================
print("\n=== Final Model ===\n")
print(f"Improvement: {study1.best_value:.4f} → {study3.best_value:.4f}")
print(f"Total trials: {20 + 30 + 10} = 60")

Practical Budget Allocation

For a 100-trial budget: spend ~20 on exploration with fast settings, ~50 on refinement with moderate settings, and ~30 on final polish with production settings. This typically outperforms spending all 100 trials on a single-stage search.

Framework-Specific Tuning APIs

Each gradient boosting framework offers built-in or closely integrated tuning capabilities. Understanding these can simplify your workflow.

XGBoost Tuning Features:

Built-in CV: xgb.cv() for cross-validated evaluation
Early Stopping: Automatically finds optimal n_estimators
Callback API: Custom early stopping, learning rate schedules
Sklearn API: Compatible with GridSearchCV, RandomizedSearchCV

Recommended XGBoost Tuning Order:

Fix learning_rate=0.1, tune max_depth and min_child_weight
Tune gamma
Tune subsample and colsample_bytree
Tune reg_alpha and reg_lambda
Reduce learning_rate and scale n_estimators

xgboost_tuning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import xgboost as xgb
import numpy as np
 
# XGBoost DMatrix for efficient tuning
dtrain = xgb.DMatrix(X_train, label=y_train)
 
# Built-in CV with early stopping
params = {
    'objective': 'binary:logistic',
    'eval_metric': 'auc',
    'max_depth': 6,
    'learning_rate': 0.1,
}
 
cv_results = xgb.cv(
    params, dtrain,
    num_boost_round=1000,
    nfold=5,
    early_stopping_rounds=50,
    seed=42,
    verbose_eval=False
)
 
print(f"Optimal trees: {len(cv_results)}")
print(f"Best CV AUC: {cv_results['test-auc-mean'].max():.4f}")

Practical Tips and Common Mistakes

Years of hyperparameter tuning experience have revealed common pitfalls and effective practices. Here's distilled wisdom for production-quality tuning.

Tuning Best Practices

•Use early stopping, not fixed n_estimators — Let the data determine optimal tree count
•Start with high LR (0.1), tune structure, then lower LR — Faster iteration during exploration
•Use log-scale for learning_rate, reg_lambda, reg_alpha — These span orders of magnitude
•Cross-validate with enough folds (5) — 3 folds can be noisy; 10 is rarely necessary
•Set random seeds for reproducibility — Same hyperparameters should give same results
•Monitor both train and validation metrics — Understanding the gap guides regularization
•Save intermediate results — Studies crash; checkpointing saves work
•Use separate test set for final evaluation — Never tune on the test set

Common Tuning Mistakes

•Tuning too many parameters at once — Start with 2-3 most impactful, add more later
•Ignoring parameter interactions — Optimal depth depends on regularization, etc.
•Using validation data for both stopping AND selection — Causes optimistic bias
•Not controlling randomness — Different seeds can change best params significantly
•Spending equal time on all parameters — Focus budget on high-impact parameters
•Tuning at high LR, deploying at low LR — Optimal other params change with LR
•Ignoring compute time — A 1% gain with 10× training time may not be worth it
•Over-tuning on small data — Leads to selection bias; use nested CV

Quick Reference: Parameter Importance and Search Ranges
Parameter	Importance	Typical Range	Search Scale
learning_rate	Very High	0.01 - 0.3	Log
n_estimators	Very High	50 - 5000	Early stopping
max_depth	High	3 - 12	Linear
num_leaves	High	7 - 127	Linear
min_child_weight	Medium	1 - 100	Linear or Log
subsample	Medium	0.5 - 1.0	Linear
colsample_bytree	Medium	0.5 - 1.0	Linear
reg_lambda	Medium	0.1 - 10	Log
reg_alpha	Low-Medium	0.01 - 5	Log
gamma	Low	0 - 0.5	Linear

The Selection Bias Trap

When tuning on the same validation set used for evaluation, you're effectively fitting to that set. The more trials you run, the more likely you are to find a configuration that happens to work well on that specific validation set. Use nested cross-validation or a completely held-out test set for final evaluation.

Summary: Efficient Hyperparameter Optimization

Effective hyperparameter tuning combines the right search strategy with domain knowledge about which parameters matter most. The goal is finding excellent configurations with minimal compute.

Key Takeaways

•Grid search suffers from the curse of dimensionality — Use only for ≤3 parameters or final fine-tuning.
•Random search is surprisingly effective — Samples more values of important parameters; 60 trials covers 95% of top-5% space.
•Bayesian optimization learns from evaluations — Uses a surrogate model to balance exploration and exploitation. Optuna is the recommended tool.
•Multi-stage tuning is most efficient — Explore widely first, refine in promising regions, polish with production settings.
•Framework-specific tools simplify tuning — XGBoost cv(), LightGBM Tuner, CatBoost grid_search() integrate well.
•Avoid common pitfalls — Use separate test sets, control randomness, don't over-tune on small data.

Module Complete:

You've now completed the comprehensive guide to hyperparameter tuning for gradient boosting. From understanding which parameters matter (Page 0), to the learning rate-iterations tradeoff (Page 1), tree architecture (Page 2), regularization (Page 3), and efficient search strategies (Page 4), you have the knowledge to systematically optimize XGBoost, LightGBM, and CatBoost models for any problem.

Module Complete

Congratulations! You've mastered hyperparameter tuning for modern gradient boosting frameworks. You can now systematically configure and optimize XGBoost, LightGBM, and CatBoost models, understanding not just what to tune but how to tune efficiently.

5 / 5

Loading learning content...

Machine LearningHyperparameter Tuning for Boosting

Hyperparameter Tuning for Boosting

LevelAdvanced

Duration90 mins

TopicHyperparameter Tuning for Boosting

5 / 5

Tuning Strategies

The Art of Efficient Hyperparameter Search

What You Will Learn

Grid Search: The Exhaustive Baseline

The Grid Search Algorithm:

1. Define a grid: each parameter has a list of values
2. For each combination in the Cartesian product:
   a. Train model with this configuration
   b. Evaluate on validation set
   c. Record performance
3. Return best configuration

Complexity Analysis:

With $k$ parameters, each with $n$ values: Total evaluations = $n^k$

3 parameters × 5 values = 125 evaluations ✓
5 parameters × 5 values = 3,125 evaluations ✓
10 parameters × 5 values = 9,765,625 evaluations ✗

This exponential growth is the curse of dimensionality applied to hyperparameter search.

Grid Search Advantages

•Simple and deterministic
•Parallelizes trivially
•Guaranteed to find global optimum within grid
•Results are reproducible
•Easy to understand and explain

Grid Search Disadvantages

•Exponential in parameter count
•Wastes budget on unimportant parameters
•Misses optima between grid points
•Requires domain knowledge for grid design
•No learning from previous evaluations

grid_search.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import xgboost as xgb
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import make_classification
 
# Create dataset
X, y = make_classification(n_samples=5000, n_features=20, random_state=42)
 
# ============================================
# Small Grid Search (Feasible)
# ============================================
print("=== Grid Search Example ===\n")
 
param_grid = {
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1],
    'n_estimators': [100, 200],
    'reg_lambda': [0.1, 1, 10]
}
 
n_combinations = np.prod([len(v) for v in param_grid.values()])
print(f"Grid size: {n_combinations} combinations\n")
 
model = xgb.XGBClassifier(random_state=42, verbosity=0)
 
grid_search = GridSearchCV(
    model, param_grid, 
    cv=3, scoring='roc_auc',
    n_jobs=-1, verbose=1
)
grid_search.fit(X, y)
 
print(f"\nBest score: {grid_search.best_score_:.4f}")
print(f"Best params: {grid_search.best_params_}")
 
# ============================================
# Grid Design Tips
# ============================================
print("\n=== Practical Grid Design ===\n")
 
# Tip 1: Use logarithmic spacing for learning rate
lr_grid = [10**x for x in np.arange(-3, 0, 0.5)]
print(f"Learning rate (log scale): {[f'{x:.3f}' for x in lr_grid]}")
 
# Tip 2: Focus on high-impact parameters first
focused_grid = {
    'learning_rate': [0.01, 0.05, 0.1],  # Most important
    'max_depth': [4, 6, 8],              # Second most important
    # Fix less important parameters at defaults
}
 
# Tip 3: Two-stage grid search
print("\nTwo-stage approach:")
print("Stage 1: Coarse grid on learning_rate, max_depth")
print("Stage 2: Fine grid around best values from Stage 1")
 
# Stage 1: Coarse
coarse_grid = {
    'learning_rate': [0.01, 0.1, 0.3],
    'max_depth': [3, 6, 9],
    'n_estimators': [100]
}
coarse_search = GridSearchCV(
    xgb.XGBClassifier(random_state=42, verbosity=0),
    coarse_grid, cv=3, scoring='roc_auc'
)
coarse_search.fit(X, y)
print(f"Coarse best: lr={coarse_search.best_params_['learning_rate']}, "
      f"depth={coarse_search.best_params_['max_depth']}")
 
# Stage 2: Fine around coarse best
best_lr = coarse_search.best_params_['learning_rate']
best_depth = coarse_search.best_params_['max_depth']
fine_grid = {
    'learning_rate': [best_lr * 0.5, best_lr, best_lr * 1.5],
    'max_depth': [max(1, best_depth - 1), best_depth, best_depth + 1],
    'n_estimators': [100, 200, 300]
}
fine_search = GridSearchCV(
    xgb.XGBClassifier(random_state=42, verbosity=0),
    fine_grid, cv=3, scoring='roc_auc'
)
fine_search.fit(X, y)
print(f"Fine best score: {fine_search.best_score_:.4f}")

When Grid Search Works Well

Random Search: Surprisingly Effective

Random search samples hyperparameter configurations randomly from specified distributions. Despite its simplicity, random search often outperforms grid search, especially in high-dimensional spaces.

The Random Search Algorithm:

1. Define distributions for each parameter
2. For each iteration (up to budget):
   a. Sample configuration from distributions
   b. Train model with this configuration
   c. Evaluate on validation set
   d. Record performance
3. Return best configuration seen

Why Random Beats Grid:

The Geometric Argument:

Consider 2 parameters: one important (x), one unimportant (y). With 9 evaluations:

Grid (3×3): 3 unique x values, 3 unique y values Random: 9 unique x values, 9 unique y values

Random search explores 3× more values of the important parameter!

This advantage grows with dimensionality. With 10 parameters where only 2-3 matter, random search is dramatically more efficient.

Distribution Choices:

Random search requires specifying distributions:

Uniform: When you have no prior about good regions
Log-uniform: For parameters spanning orders of magnitude (learning rate)
Discrete uniform: For integer parameters (max_depth)
Categorical: For discrete choices (booster type)

random_search.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
import xgboost as xgb
import numpy as np
from scipy.stats import uniform, randint, loguniform
from sklearn.model_selection import RandomizedSearchCV
from sklearn.datasets import make_classification
 
# Create dataset
X, y = make_classification(n_samples=5000, n_features=20, random_state=42)
 
# ============================================
# Random Search with Distributions
# ============================================
print("=== Random Search Example ===\n")
 
param_distributions = {
    'learning_rate': loguniform(0.01, 0.3),      # Log-uniform: 0.01 to 0.3
    'max_depth': randint(3, 12),                  # Uniform integer: 3-11
    'n_estimators': randint(50, 500),             # Uniform integer: 50-499
    'subsample': uniform(0.5, 0.5),               # Uniform: 0.5 to 1.0
    'colsample_bytree': uniform(0.5, 0.5),        # Uniform: 0.5 to 1.0
    'reg_lambda': loguniform(0.1, 10),            # Log-uniform: 0.1 to 10
    'reg_alpha': loguniform(0.01, 5),             # Log-uniform: 0.01 to 5
    'gamma': uniform(0, 0.5),                     # Uniform: 0 to 0.5
    'min_child_weight': randint(1, 20),           # Uniform integer: 1-19
}
 
model = xgb.XGBClassifier(random_state=42, verbosity=0)
 
# Same budget as grid search: ~36 evaluations
random_search = RandomizedSearchCV(
    model, param_distributions,
    n_iter=50,  # Number of random samples
    cv=3, scoring='roc_auc',
    random_state=42, n_jobs=-1, verbose=1
)
random_search.fit(X, y)
 
print(f"\nBest score: {random_search.best_score_:.4f}")
print("Best parameters:")
for param, value in random_search.best_params_.items():
    print(f"  {param}: {value}")
 
# ============================================
# Compare Grid vs Random (Same Budget)
# ============================================
print("\n=== Grid vs Random Comparison ===\n")
 
# Grid: 3 values × 3 params = 27 evaluations
grid_params = {
    'learning_rate': [0.01, 0.1, 0.3],
    'max_depth': [4, 6, 8],
    'reg_lambda': [0.1, 1, 10]
}
 
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(
    xgb.XGBClassifier(random_state=42, verbosity=0),
    grid_params, cv=3, scoring='roc_auc'
)
grid_search.fit(X, y)
print(f"Grid Search (27 evals): {grid_search.best_score_:.4f}")
 
# Random: 27 evaluations, same 3 params
random_params = {
    'learning_rate': loguniform(0.01, 0.3),
    'max_depth': randint(3, 10),
    'reg_lambda': loguniform(0.1, 10)
}
random_search_3p = RandomizedSearchCV(
    xgb.XGBClassifier(random_state=42, verbosity=0),
    random_params, n_iter=27, cv=3, scoring='roc_auc', random_state=42
)
random_search_3p.fit(X, y)
print(f"Random Search (27 evals): {random_search_3p.best_score_:.4f}")
 
# ============================================
# Log-Uniform Distribution Importance
# ============================================
print("\n=== Distribution Choice Matters ===\n")
 
# Bad: Uniform for learning rate
samples_uniform = uniform(0.01, 0.29).rvs(1000, random_state=42)
print(f"Uniform(0.01, 0.3): median={np.median(samples_uniform):.3f}, "
      f"<0.05: {np.mean(samples_uniform < 0.05):.1%}")
 
# Good: Log-uniform for learning rate  
samples_loguniform = loguniform(0.01, 0.3).rvs(1000, random_state=42)
print(f"LogUniform(0.01, 0.3): median={np.median(samples_loguniform):.3f}, "
      f"<0.05: {np.mean(samples_loguniform < 0.05):.1%}")
 
print("\nLog-uniform samples 4× more values in the low range!")

The 60-Trial Rule

Bayesian Optimization: Intelligent Search

The Bayesian Optimization Loop:

1. Initialize with a few random evaluations
2. Fit a surrogate model to observed (config, score) pairs
3. Use an acquisition function to select the next config to evaluate
4. Evaluate the selected config
5. Update the surrogate model
6. Repeat 3-5 until budget exhausted

Key Components:

Surrogate Model: Approximates the objective function. Common choices:

Gaussian Process (GP): Provides uncertainty estimates, works well for continuous parameters
Tree Parzen Estimator (TPE): Handles categorical/conditional parameters well, used by Optuna/Hyperopt
Random Forest: Robust, handles mixed parameter types

Acquisition Function: Decides where to sample next. Common choices:

Expected Improvement (EI): Balance between mean and uncertainty
Upper Confidence Bound (UCB): mean + κ × std
Probability of Improvement (PI): Pure exploitation

Why Bayesian Optimization Excels:

Sample efficiency: Uses prior evaluations to guide search, finding good configs with fewer evaluations
Handles interactions: The surrogate model can capture parameter interactions
Quantifies uncertainty: Knows where it needs more information
Adaptive: Focus shifts from exploration to exploitation as information accumulates

Optuna: Modern Bayesian Optimization

Optuna is the recommended tool for Bayesian hyperparameter optimization. Key features:

TPE sampler by default (good for varied parameter types)
Pruning support (early termination of bad configs)
Parallel execution
Study persistence and visualization
Framework-agnostic API

bayesian_optimization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
import xgboost as xgb
import lightgbm as lgb
import numpy as np
import optuna
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score, train_test_split
 
# Create dataset
X, y = make_classification(n_samples=5000, n_features=20, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
 
# ============================================
# Optuna Bayesian Optimization
# ============================================
print("=== Optuna Bayesian Optimization ===\n")
 
def objective(trial):
    """Objective function for Optuna."""
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 50, 500),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'reg_lambda': trial.suggest_float('reg_lambda', 0.1, 10.0, log=True),
        'reg_alpha': trial.suggest_float('reg_alpha', 0.01, 5.0, log=True),
        'gamma': trial.suggest_float('gamma', 0, 0.5),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 20),
        'random_state': 42,
        'verbosity': 0
    }
    
    model = xgb.XGBClassifier(**params)
    score = np.mean(cross_val_score(model, X_train, y_train, cv=3, scoring='roc_auc'))
    return score
 
# Create and run study
study = optuna.create_study(direction='maximize', sampler=optuna.samplers.TPESampler(seed=42))
study.optimize(objective, n_trials=50, show_progress_bar=True)
 
print(f"\nBest score: {study.best_value:.4f}")
print("Best parameters:")
for param, value in study.best_params.items():
    if isinstance(value, float):
        print(f"  {param}: {value:.4f}")
    else:
        print(f"  {param}: {value}")
 
# ============================================
# Optuna with Pruning (Early Termination)
# ============================================
print("\n=== Optuna with Pruning ===\n")
 
def objective_with_pruning(trial):
    """Objective with early pruning of bad trials."""
    params = {
        'n_estimators': 500,  # Fixed high, we'll prune
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'reg_lambda': trial.suggest_float('reg_lambda', 0.1, 10.0, log=True),
        'random_state': 42,
    }
    
    # Use LightGBM with callback for pruning
    dtrain = lgb.Dataset(X_train, label=y_train)
    
    pruning_callback = optuna.integration.LightGBMPruningCallback(trial, 'auc')
    
    cv_results = lgb.cv(
        params,
        dtrain,
        num_boost_round=500,
        callbacks=[pruning_callback],
        nfold=3,
        metrics='auc',
        seed=42
    )
    
    return cv_results['valid auc-mean'][-1]
 
study_pruned = optuna.create_study(direction='maximize')
study_pruned.optimize(objective_with_pruning, n_trials=30, show_progress_bar=True)
 
print(f"\nBest score (with pruning): {study_pruned.best_value:.4f}")
 
# ============================================
# Analyze Optuna Results
# ============================================
print("\n=== Optimization Analysis ===\n")
 
# Parameter importance
importance = optuna.importance.get_param_importances(study)
print("Parameter importance:")
for param, imp in sorted(importance.items(), key=lambda x: x[1], reverse=True):
    print(f"  {param}: {imp:.3f}")

Pruning Saves Compute

Multi-Stage Tuning Workflow

Real-world hyperparameter tuning is most efficient when done in stages, progressively narrowing the search space as you gain information about what works.

The Three-Stage Framework:

Production Tuning Stages

•Stage 1 - Exploration (20% of budget): Wide random/Bayesian search to identify promising regions. High learning rate (0.1) for fast iteration. Focus on structural params (depth, leaves).
•Stage 2 - Refinement (50% of budget): Bayesian optimization in promising regions. Add regularization parameters. Moderate learning rate (0.05). Fine-tune tree architecture.
•Stage 3 - Polish (30% of budget): Final tuning with low learning rate (0.01-0.03). More trees, early stopping. Tune sampling and regularization together.

multi_stage_tuning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
import xgboost as xgb
import numpy as np
import optuna
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score, train_test_split
 
# Create dataset
X, y = make_classification(n_samples=10000, n_features=30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
 
# ============================================
# STAGE 1: Exploration (Fast, Wide Search)
# ============================================
print("=== STAGE 1: Exploration ===\n")
 
def stage1_objective(trial):
    """Fast exploration with high LR, focus on structure."""
    params = {
        'n_estimators': 100,  # Fixed, fast
        'learning_rate': 0.1,  # High for fast iteration
        'max_depth': trial.suggest_int('max_depth', 3, 12),
        'num_leaves': trial.suggest_int('num_leaves', 7, 255),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 50),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'random_state': 42,
        'verbosity': 0
    }
    model = xgb.XGBClassifier(**params)
    return np.mean(cross_val_score(model, X_train, y_train, cv=3, scoring='roc_auc'))
 
study1 = optuna.create_study(direction='maximize')
study1.optimize(stage1_objective, n_trials=20)  # ~20 trials
print(f"Stage 1 best: {study1.best_value:.4f}")
print(f"Best structure: depth={study1.best_params['max_depth']}, "
      f"leaves={study1.best_params['num_leaves']}")
 
# ============================================
# STAGE 2: Refinement (Add Regularization)
# ============================================
print("\n=== STAGE 2: Refinement ===\n")
 
# Use Stage 1 findings
best_depth = study1.best_params['max_depth']
best_leaves = study1.best_params['num_leaves']
best_subsample = study1.best_params['subsample']
best_colsample = study1.best_params['colsample_bytree']
 
def stage2_objective(trial):
    """Refine with regularization, moderate LR."""
    params = {
        'n_estimators': 200,
        'learning_rate': 0.05,  # Moderate
        # Structure from Stage 1 (narrow range)
        'max_depth': trial.suggest_int('max_depth', max(2, best_depth-2), best_depth+2),
        'num_leaves': trial.suggest_int('num_leaves', max(7, best_leaves-50), min(255, best_leaves+50)),
        # Fixed from Stage 1
        'subsample': best_subsample,
        'colsample_bytree': best_colsample,
        # New: Regularization
        'reg_lambda': trial.suggest_float('reg_lambda', 0.1, 10, log=True),
        'reg_alpha': trial.suggest_float('reg_alpha', 0.01, 5, log=True),
        'gamma': trial.suggest_float('gamma', 0, 0.5),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 30),
        'random_state': 42,
        'verbosity': 0
    }
    model = xgb.XGBClassifier(**params)
    return np.mean(cross_val_score(model, X_train, y_train, cv=3, scoring='roc_auc'))
 
study2 = optuna.create_study(direction='maximize')
study2.optimize(stage2_objective, n_trials=30)  # ~30 trials
print(f"Stage 2 best: {study2.best_value:.4f}")
 
# ============================================
# STAGE 3: Polish (Low LR, Many Trees)
# ============================================
print("\n=== STAGE 3: Polish ===\n")
 
# Use Stage 2 findings
stage2_best = study2.best_params
 
def stage3_objective(trial):
    """Final polish with low LR, early stopping."""
    params = {
        'n_estimators': 2000,  # High, with early stopping
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.05),
        'max_depth': stage2_best['max_depth'],
        'num_leaves': stage2_best['num_leaves'],
        'min_child_weight': stage2_best['min_child_weight'],
        'subsample': best_subsample,
        'colsample_bytree': best_colsample,
        'reg_lambda': stage2_best['reg_lambda'],
        'reg_alpha': stage2_best['reg_alpha'],
        'gamma': stage2_best['gamma'],
        'early_stopping_rounds': 50,
        'random_state': 42,
        'verbosity': 0
    }
    
    # Use train-val split for early stopping
    X_tr, X_v, y_tr, y_v = train_test_split(X_train, y_train, test_size=0.2, random_state=42)
    model = xgb.XGBClassifier(**params)
    model.fit(X_tr, y_tr, eval_set=[(X_v, y_v)], verbose=False)
    
    # CV score
    params['n_estimators'] = model.best_iteration
    del params['early_stopping_rounds']
    model_cv = xgb.XGBClassifier(**params)
    return np.mean(cross_val_score(model_cv, X_train, y_train, cv=3, scoring='roc_auc'))
 
study3 = optuna.create_study(direction='maximize')
study3.optimize(stage3_objective, n_trials=10)  # Fewer trials, more expensive
print(f"Stage 3 best: {study3.best_value:.4f}")
 
# ============================================
# Final Model
# ============================================
print("\n=== Final Model ===\n")
print(f"Improvement: {study1.best_value:.4f} → {study3.best_value:.4f}")
print(f"Total trials: {20 + 30 + 10} = 60")

Practical Budget Allocation

Framework-Specific Tuning APIs

Each gradient boosting framework offers built-in or closely integrated tuning capabilities. Understanding these can simplify your workflow.

XGBoost Tuning Features:

Built-in CV: xgb.cv() for cross-validated evaluation
Early Stopping: Automatically finds optimal n_estimators
Callback API: Custom early stopping, learning rate schedules
Sklearn API: Compatible with GridSearchCV, RandomizedSearchCV

Recommended XGBoost Tuning Order:

Fix learning_rate=0.1, tune max_depth and min_child_weight
Tune gamma
Tune subsample and colsample_bytree
Tune reg_alpha and reg_lambda
Reduce learning_rate and scale n_estimators

xgboost_tuning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import xgboost as xgb
import numpy as np
 
# XGBoost DMatrix for efficient tuning
dtrain = xgb.DMatrix(X_train, label=y_train)
 
# Built-in CV with early stopping
params = {
    'objective': 'binary:logistic',
    'eval_metric': 'auc',
    'max_depth': 6,
    'learning_rate': 0.1,
}
 
cv_results = xgb.cv(
    params, dtrain,
    num_boost_round=1000,
    nfold=5,
    early_stopping_rounds=50,
    seed=42,
    verbose_eval=False
)
 
print(f"Optimal trees: {len(cv_results)}")
print(f"Best CV AUC: {cv_results['test-auc-mean'].max():.4f}")

Practical Tips and Common Mistakes

Years of hyperparameter tuning experience have revealed common pitfalls and effective practices. Here's distilled wisdom for production-quality tuning.

Tuning Best Practices

•Use early stopping, not fixed n_estimators — Let the data determine optimal tree count
•Start with high LR (0.1), tune structure, then lower LR — Faster iteration during exploration
•Use log-scale for learning_rate, reg_lambda, reg_alpha — These span orders of magnitude
•Cross-validate with enough folds (5) — 3 folds can be noisy; 10 is rarely necessary
•Set random seeds for reproducibility — Same hyperparameters should give same results
•Monitor both train and validation metrics — Understanding the gap guides regularization
•Save intermediate results — Studies crash; checkpointing saves work
•Use separate test set for final evaluation — Never tune on the test set

Common Tuning Mistakes

•Tuning too many parameters at once — Start with 2-3 most impactful, add more later
•Ignoring parameter interactions — Optimal depth depends on regularization, etc.
•Using validation data for both stopping AND selection — Causes optimistic bias
•Not controlling randomness — Different seeds can change best params significantly
•Spending equal time on all parameters — Focus budget on high-impact parameters
•Tuning at high LR, deploying at low LR — Optimal other params change with LR
•Ignoring compute time — A 1% gain with 10× training time may not be worth it
•Over-tuning on small data — Leads to selection bias; use nested CV

Quick Reference: Parameter Importance and Search Ranges
Parameter	Importance	Typical Range	Search Scale
learning_rate	Very High	0.01 - 0.3	Log
n_estimators	Very High	50 - 5000	Early stopping
max_depth	High	3 - 12	Linear
num_leaves	High	7 - 127	Linear
min_child_weight	Medium	1 - 100	Linear or Log
subsample	Medium	0.5 - 1.0	Linear
colsample_bytree	Medium	0.5 - 1.0	Linear
reg_lambda	Medium	0.1 - 10	Log
reg_alpha	Low-Medium	0.01 - 5	Log
gamma	Low	0 - 0.5	Linear

The Selection Bias Trap

Summary: Efficient Hyperparameter Optimization

Effective hyperparameter tuning combines the right search strategy with domain knowledge about which parameters matter most. The goal is finding excellent configurations with minimal compute.

Key Takeaways

•Grid search suffers from the curse of dimensionality — Use only for ≤3 parameters or final fine-tuning.
•Random search is surprisingly effective — Samples more values of important parameters; 60 trials covers 95% of top-5% space.
•Bayesian optimization learns from evaluations — Uses a surrogate model to balance exploration and exploitation. Optuna is the recommended tool.
•Multi-stage tuning is most efficient — Explore widely first, refine in promising regions, polish with production settings.
•Framework-specific tools simplify tuning — XGBoost cv(), LightGBM Tuner, CatBoost grid_search() integrate well.
•Avoid common pitfalls — Use separate test sets, control randomness, don't over-tune on small data.

Module Complete:

Module Complete

5 / 5