Machine LearningHyperparameter Tuning for Boosting

Hyperparameter Tuning for Boosting

LevelAdvanced

Duration90 mins

TopicHyperparameter Tuning for Boosting

3 / 5

Tree-Specific Parameters

Engineering the Weak Learner

In gradient boosting, each decision tree serves as a weak learner—a model that's only slightly better than random guessing. The genius of boosting lies in combining many such weak learners into a powerful ensemble. But how weak should each tree be? How complex? These questions are answered by tree-specific parameters.

The Goldilocks Principle: Trees that are too shallow capture only trivial patterns. Trees that are too deep overfit to training noise. Tree architecture parameters let you dial in exactly the right complexity for your problem—capturing meaningful feature interactions without fitting spurious correlations.

What You Will Learn

By the end of this page, you will understand how tree depth controls interaction order, the difference between depth-wise and leaf-wise tree growth, how minimum samples and weight constraints regularize splits, the role of histogram binning in modern implementations, and how to configure tree architecture for different problem types.

Tree Depth and Interaction Order

max_depth is the most intuitive tree complexity control. It limits how many sequential splits can occur from root to leaf, directly constraining the order of feature interactions the tree can capture.

The Depth-Interaction Relationship:

At depth $d$, a decision tree can model interactions involving up to $d$ features. This is because each level can split on a different feature:

Depth 1 (stumps): Single feature decisions: if age > 30 then ...
Depth 2: Pairwise interactions: if age > 30 AND income > 50K then ...
Depth 3: Three-way interactions: if age > 30 AND income > 50K AND location = urban then ...
Depth 6: Up to six-feature interactions

The Practical Implication:

Most real-world patterns involve interactions of 2-5 features. Depths beyond 6-8 rarely capture genuine signal—they instead fit noise through complex, unlikely feature combinations.

Tree Depth Selection Guidelines
Depth	Max Leaves	Interaction Order	Use Case
1-2	2-4	1-2 features	Additive models, high-noise data
3-4	8-16	3-4 features	Standard problems, many trees
5-6	32-64	5-6 features	Complex interactions, large datasets
7-10	128-1024	7-10 features	Very complex patterns, regularized
11+	2048+	High-order	Rarely needed, high overfit risk

The Depth-Regularization Trade-off:

Deeper trees have more capacity but require stronger regularization to prevent overfitting:

$$\text{Effective Complexity} \propto \text{Depth} \times \text{Trees} / \text{Regularization}$$

When you increase depth, compensate by:

Reducing the number of trees
Increasing L1/L2 regularization (reg_alpha, reg_lambda)
Adding min_child_weight constraints
Using lower learning rate
Adding subsampling

depth_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import xgboost as xgb
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score, train_test_split
 
# Create dataset with known interaction complexity
# Using 2-3 informative features means shallow trees should suffice
X, y = make_classification(
    n_samples=10000,
    n_features=20,
    n_informative=5,    # Only 5 truly predictive features
    n_redundant=5,
    n_clusters_per_class=2,
    flip_y=0.05,        # Some label noise
    random_state=42
)
 
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
 
# ============================================
# Depth Analysis
# ============================================
print("=== Tree Depth Analysis ===\n")
print(f"{'Depth':<6} {'Train AUC':<12} {'CV AUC':<12} {'Gap':<10} {'Overfit?'}")
print("-" * 52)
 
for depth in [1, 2, 3, 4, 5, 6, 8, 10, 15]:
    model = xgb.XGBClassifier(
        n_estimators=200,
        learning_rate=0.1,
        max_depth=depth,
        random_state=42,
        verbosity=0
    )
    
    # Cross-validation score (generalization)
    cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc')
    cv_mean = np.mean(cv_scores)
    
    # Train score (fitting capacity)
    model.fit(X_train, y_train)
    from sklearn.metrics import roc_auc_score
    train_pred = model.predict_proba(X_train)[:, 1]
    train_auc = roc_auc_score(y_train, train_pred)
    
    gap = train_auc - cv_mean
    overfit = "Yes" if gap > 0.03 else "No"
    
    print(f"{depth:<6} {train_auc:<12.4f} {cv_mean:<12.4f} {gap:<10.4f} {overfit}")
 
# ============================================
# Optimal Depth with Regularization
# ============================================
print("\n=== Depth with Regularization ===\n")
print("Demonstrating that deeper trees work when regularized:\n")
 
# No regularization: depth 6 might overfit
model_noreg = xgb.XGBClassifier(
    n_estimators=200, learning_rate=0.1, max_depth=8,
    reg_alpha=0, reg_lambda=0, random_state=42, verbosity=0
)
cv_noreg = cross_val_score(model_noreg, X_train, y_train, cv=5, scoring='roc_auc')
print(f"Depth 8, no reg:     CV AUC = {np.mean(cv_noreg):.4f}")
 
# With regularization: depth 8 works well
model_reg = xgb.XGBClassifier(
    n_estimators=200, learning_rate=0.1, max_depth=8,
    reg_alpha=0.5, reg_lambda=2.0, random_state=42, verbosity=0
)
cv_reg = cross_val_score(model_reg, X_train, y_train, cv=5, scoring='roc_auc')
print(f"Depth 8, with reg:   CV AUC = {np.mean(cv_reg):.4f}")

Framework Defaults

XGBoost defaults to max_depth=6, which works well for most problems. LightGBM defaults to max_depth=-1 (unlimited) but controls complexity via num_leaves=31. CatBoost defaults to depth=6. For most tabular data, depths between 4-8 strike the best balance.

Depth-Wise vs. Leaf-Wise Growth

Modern gradient boosting frameworks offer two fundamentally different tree growth strategies. Understanding the distinction is crucial for proper hyperparameter configuration.

Depth-Wise (Level-Wise) Growth:

Grow trees level by level, splitting all leaves at the current depth before proceeding deeper.

Level 0:     [root]
Level 1:   [L1]  [L2]
Level 2: [L3][L4][L5][L6]

Characteristics:

Balanced trees (all leaves at same depth)
Complexity controlled by max_depth
Potentially splits unpromising branches
Default in XGBoost (grow_policy='depthwise')

Leaf-Wise Growth:

Always split the leaf with maximum loss reduction, regardless of depth.

Split 1:     [root] → [L1] [L2]
Split 2:     [L2] was best → [L2a] [L2b]
Split 3:     [L2b] was best → [L2b1] [L2b2]

Characteristics:

Unbalanced trees (different depths)
Complexity controlled by num_leaves
More efficient splits (prioritize high-gain)
Default in LightGBM (grow_policy='lossguide' equivalent)

Depth-Wise Advantages

•More regularized (forced balanced growth)
•Less prone to overfitting on small data
•Simpler to interpret (consistent depth)
•Predictable tree structure
•Good default for most problems

Leaf-Wise Advantages

•Often achieves lower loss with fewer splits
•Better for complex, non-uniform patterns
•More efficient use of splits
•Excels on large datasets
•Faster convergence to optimal loss

The num_leaves vs. max_depth Confusion:

When using leaf-wise growth (LightGBM), num_leaves is the primary complexity control, not max_depth. The relationship is:

$$\text{num_leaves} \leq 2^{\text{max_depth}}$$

Configuration patterns:

Use num_leaves alone: Set max_depth=-1 (unlimited), control complexity purely through num_leaves
Use max_depth alone: Set num_leaves very high (e.g., 1024), let max_depth constrain
Use both: num_leaves provides hard limit, max_depth prevents extreme imbalance

Recommended LightGBM settings:

num_leaves in range [7, 4095], commonly 31-127
max_depth either -1 (unlimited) or a safety cap (e.g., 12)
If using both, ensure num_leaves < 2^max_depth

growth_strategy_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import xgboost as xgb
import lightgbm as lgb
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
 
# Create dataset
X, y = make_classification(n_samples=20000, n_features=30, random_state=42)
 
# ============================================
# Compare Growth Strategies
# ============================================
print("=== Growth Strategy Comparison ===\n")
 
# XGBoost Depth-wise (default)
xgb_depthwise = xgb.XGBClassifier(
    n_estimators=200, learning_rate=0.1,
    max_depth=6, grow_policy='depthwise',
    random_state=42, verbosity=0
)
cv_depthwise = cross_val_score(xgb_depthwise, X, y, cv=5, scoring='roc_auc')
print(f"XGBoost Depth-wise (max_depth=6):  AUC = {np.mean(cv_depthwise):.4f}")
 
# XGBoost Leaf-wise (loss-guided)
xgb_lossguide = xgb.XGBClassifier(
    n_estimators=200, learning_rate=0.1,
    max_leaves=64, grow_policy='lossguide',  # ~equivalent to depth 6
    random_state=42, verbosity=0
)
cv_lossguide = cross_val_score(xgb_lossguide, X, y, cv=5, scoring='roc_auc')
print(f"XGBoost Leaf-wise (max_leaves=64): AUC = {np.mean(cv_lossguide):.4f}")
 
# LightGBM (leaf-wise by default)
lgb_leafwise = lgb.LGBMClassifier(
    n_estimators=200, learning_rate=0.1,
    num_leaves=63, max_depth=-1,  # Unlimited depth, leaves control complexity
    random_state=42, verbosity=-1
)
cv_lgb_leafwise = cross_val_score(lgb_leafwise, X, y, cv=5, scoring='roc_auc')
print(f"LightGBM Leaf-wise (num_leaves=63): AUC = {np.mean(cv_lgb_leafwise):.4f}")
 
# ============================================
# num_leaves Configuration Study
# ============================================
print("\n=== num_leaves Impact (LightGBM) ===\n")
for num_leaves in [7, 15, 31, 63, 127, 255]:
    model = lgb.LGBMClassifier(
        n_estimators=200, learning_rate=0.1,
        num_leaves=num_leaves, max_depth=-1,
        random_state=42, verbosity=-1
    )
    cv_scores = cross_val_score(model, X, y, cv=5, scoring='roc_auc')
    equiv_depth = int(np.ceil(np.log2(num_leaves + 1)))
    print(f"num_leaves={num_leaves:3d} (~depth {equiv_depth}): "
          f"AUC = {np.mean(cv_scores):.4f} (+/- {np.std(cv_scores):.4f})")
 
# ============================================
# Combining num_leaves and max_depth
# ============================================
print("\n=== Combining num_leaves and max_depth ===\n")
configs = [
    (31, -1, "31 leaves, unlimited depth"),
    (31, 6, "31 leaves, max_depth=6"),
    (127, -1, "127 leaves, unlimited depth"),
    (127, 6, "127 leaves, max_depth=6 (constrained)"),
    (127, 10, "127 leaves, max_depth=10"),
]
for num_leaves, max_depth, desc in configs:
    model = lgb.LGBMClassifier(
        n_estimators=200, learning_rate=0.1,
        num_leaves=num_leaves, max_depth=max_depth,
        random_state=42, verbosity=-1
    )
    cv_scores = cross_val_score(model, X, y, cv=5, scoring='roc_auc')
    print(f"{desc}: AUC = {np.mean(cv_scores):.4f}")

LightGBM Overfitting Risk

Leaf-wise growth can create very deep trees if num_leaves is high and max_depth is unlimited. For small datasets (< 10K samples), either use num_leaves ≤ 31 or set a max_depth cap (e.g., 8). Leaf-wise growth's advantages emerge primarily with larger datasets.

Minimum Samples and Weight Constraints

Beyond depth and leaf count, gradient boosting frameworks provide fine-grained control over split decisions through minimum sample and weight constraints. These parameters prevent the creation of leaves based on too few observations.

Why Constrain Leaf Size?

Leaves with very few samples are statistically unreliable:

Predictions based on 2-3 samples have high variance
Splits that isolate rare observations often fit noise
Small leaves can produce extreme, unstable predictions

Key Parameters:

Minimum Sample Constraints Across Frameworks
Concept	XGBoost	LightGBM	CatBoost
Min samples in leaf	`min_child_weight`*	`min_child_samples`	`min_data_in_leaf`
Min samples for split	`min_child_weight`*	`min_child_samples`	`min_data_in_leaf`
Min split gain	`gamma`	`min_split_gain`	Implicit in `l2_leaf_reg`

Understanding min_child_weight (XGBoost):

Unlike LightGBM's simple sample count, XGBoost's min_child_weight is the minimum sum of instance weights (Hessians) in a child node. For standard classification/regression:

With unweighted data: min_child_weight ≈ min_samples
With weighted data: accounts for sample importance
Higher values → stronger regularization

For classification with log loss: Hessian $h_i = p_i(1-p_i)$ where $p_i$ is predicted probability. Near the boundary ($p ≈ 0.5$), $h ≈ 0.25$. So min_child_weight=1 requires ~4 boundary samples.

For regression with squared error: Hessian $h_i = 1$ for all samples. So min_child_weight equals minimum sample count directly.

Practical Guidance:

Recommended Minimum Sample Settings
Dataset Size	XGBoost min_child_weight	LightGBM min_child_samples	Rationale
< 1,000	5-20	20-100	Strong constraint to prevent overfit
1,000 - 10,000	1-5	10-50	Moderate constraint
10,000 - 100,000	1-3	5-20	Light constraint sufficient
100,000	1	1-10	Large data is inherently regularizing

min_samples_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import xgboost as xgb
import lightgbm as lgb
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
 
# Create datasets of different sizes
def analyze_min_samples(n_samples, name):
    X, y = make_classification(
        n_samples=n_samples, n_features=20, n_informative=10,
        flip_y=0.05, random_state=42
    )
    
    print(f"\n=== {name} ({n_samples:,} samples) ===\n")
    
    # XGBoost min_child_weight analysis
    print("XGBoost min_child_weight:")
    for mcw in [1, 3, 5, 10, 20, 50]:
        model = xgb.XGBClassifier(
            n_estimators=100, learning_rate=0.1, max_depth=6,
            min_child_weight=mcw, random_state=42, verbosity=0
        )
        cv_scores = cross_val_score(model, X, y, cv=5, scoring='roc_auc')
        print(f"  min_child_weight={mcw:2d}: AUC = {np.mean(cv_scores):.4f}")
    
    # LightGBM min_child_samples analysis
    print("\nLightGBM min_child_samples:")
    for mcs in [1, 5, 10, 20, 50, 100]:
        model = lgb.LGBMClassifier(
            n_estimators=100, learning_rate=0.1, num_leaves=63,
            min_child_samples=mcs, random_state=42, verbosity=-1
        )
        cv_scores = cross_val_score(model, X, y, cv=5, scoring='roc_auc')
        print(f"  min_child_samples={mcs:3d}: AUC = {np.mean(cv_scores):.4f}")
 
# Analyze for different dataset sizes
analyze_min_samples(1000, "Small Dataset")
analyze_min_samples(10000, "Medium Dataset")
analyze_min_samples(50000, "Large Dataset")
 
# ============================================
# Interaction with Tree Depth
# ============================================
print("\n=== min_child_samples × max_depth Interaction ===\n")
X, y = make_classification(n_samples=5000, n_features=20, random_state=42)
 
for depth in [4, 6, 8, 10]:
    for mcs in [5, 20, 50]:
        model = lgb.LGBMClassifier(
            n_estimators=100, learning_rate=0.1,
            max_depth=depth, num_leaves=2**depth,
            min_child_samples=mcs,
            random_state=42, verbosity=-1
        )
        cv = cross_val_score(model, X, y, cv=5, scoring='roc_auc')
        print(f"depth={depth:2d}, min_samples={mcs:2d}: AUC = {np.mean(cv):.4f}")
    print()

Rule of Thumb

Start with min_child_samples = max(20, n_samples/1000). This scales with data size—larger datasets can tolerate smaller leaves, while small datasets need stronger constraints. Increase if you observe overfitting; decrease if underfitting.

Histogram Binning Configuration

Modern gradient boosting implementations use histogram-based split finding for efficiency. Instead of evaluating every unique feature value as a potential split point, features are discretized into bins. Understanding binning parameters helps optimize the speed-accuracy tradeoff.

How Histogram Binning Works:

Pre-processing: For each feature, compute bin boundaries that divide the value range into max_bin intervals
Training: Map continuous values to bin indices (integers)
Split finding: Evaluate split gain at bin boundaries only (O(bins) vs O(unique_values))
Aggregation: Accumulate gradient statistics per bin, use for split decisions

Performance Impact:

Fewer bins → faster training (fewer split candidates)
More bins → finer-grained splits, potentially better accuracy
Memory usage proportional to (features × bins × 2) for gradient/hessian sums

Binning Parameters Across Frameworks
Parameter	XGBoost	LightGBM	CatBoost	Default
Max bins	`max_bin`	`max_bin`	`border_count`	255/255/254
Min data per bin	—	`min_data_in_bin`	—	—/3/—
Binning method	`tree_method`	Auto	Auto	Various

max_bin Selection Guidelines:

32-64 bins: Fastest training, may lose precision for features with many unique values
128-255 bins (default): Good balance, sufficient for most features
256-512 bins: For features with very fine-grained important distinctions
>512 bins: Rarely beneficial, primarily increases memory and time

When Fewer Bins Help:

Noisy features: Fewer bins = implicit smoothing, may improve generalization
Large datasets: Training time reduction is substantial
Memory constraints: Each bin requires storage for gradient statistics

When More Bins Help:

High-precision features: Prices, measurements with meaningful decimal places
Small datasets: Overfitting from binning is less concern
Complex target relationships: Non-linear patterns may need finer splits

binning_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import xgboost as xgb
import lightgbm as lgb
import numpy as np
import time
from sklearn.datasets import make_regression
from sklearn.model_selection import cross_val_score
 
# Create dataset with continuous features requiring fine-grained splits
X, y = make_regression(
    n_samples=50000, n_features=20, n_informative=10,
    noise=10, random_state=42
)
 
# ============================================
# max_bin Impact Analysis
# ============================================
print("=== max_bin Impact (LightGBM) ===\n")
print(f"{'max_bin':<10} {'CV R²':<12} {'Time (s)':<10}")
print("-" * 32)
 
for max_bin in [16, 32, 64, 128, 255, 512]:
    start = time.time()
    model = lgb.LGBMRegressor(
        n_estimators=100, learning_rate=0.1, num_leaves=63,
        max_bin=max_bin, random_state=42, verbosity=-1
    )
    cv_scores = cross_val_score(model, X, y, cv=3, scoring='r2')
    elapsed = time.time() - start
    print(f"{max_bin:<10} {np.mean(cv_scores):<12.4f} {elapsed:<10.2f}")
 
# ============================================
# XGBoost tree_method impact
# ============================================
print("\n=== XGBoost tree_method Comparison ===\n")
print(f"{'Method':<12} {'CV R²':<12} {'Time (s)':<10}")
print("-" * 34)
 
for tree_method in ['exact', 'hist', 'approx']:
    try:
        start = time.time()
        model = xgb.XGBRegressor(
            n_estimators=100, learning_rate=0.1, max_depth=6,
            tree_method=tree_method, random_state=42, verbosity=0
        )
        cv_scores = cross_val_score(model, X, y, cv=3, scoring='r2')
        elapsed = time.time() - start
        print(f"{tree_method:<12} {np.mean(cv_scores):<12.4f} {elapsed:<10.2f}")
    except Exception as e:
        print(f"{tree_method:<12} Error: {str(e)[:30]}")
 
# ============================================
# min_data_in_bin effect (LightGBM)
# ============================================
print("\n=== min_data_in_bin Effect (LightGBM) ===\n")
for min_data in [1, 3, 5, 10, 20]:
    model = lgb.LGBMRegressor(
        n_estimators=100, learning_rate=0.1, num_leaves=63,
        max_bin=255, min_data_in_bin=min_data,
        random_state=42, verbosity=-1
    )
    cv_scores = cross_val_score(model, X, y, cv=3, scoring='r2')
    print(f"min_data_in_bin={min_data:2d}: R² = {np.mean(cv_scores):.4f}")

Default is Usually Optimal

The default max_bin=255 is a well-tuned choice that works for nearly all problems. Only reduce it (to 64-128) when training time is critical on large datasets, or increase it (to 512+) when you have known high-precision features where binning might lose important distinctions.

Split Gain Thresholds

Beyond structural constraints, you can require that splits produce a minimum improvement in the loss function. This directly prevents splits that don't meaningfully improve predictions.

gamma / min_split_gain:

This parameter sets the minimum loss reduction required to make a split. The split is only made if:

$$\text{Gain} = \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{(G_L + G_R)^2}{H_L + H_R + \lambda} > \gamma$$

Where:

$G_L, G_R$ = sum of gradients in left/right child
$H_L, H_R$ = sum of Hessians in left/right child
$\lambda$ = L2 regularization
$\gamma$ = minimum split gain threshold

The Pruning Effect:

Higher gamma values prune more aggressively:

$\gamma = 0$: No minimum gain requirement (split if gain > 0)
$\gamma > 0$: Only split if gain exceeds threshold
Effect: Simpler trees, fewer noisy splits, potential underfitting if too high

Practical Gamma Values:

0: Default, no pruning. Rely on other regularization.
0.001 - 0.01: Very light pruning, eliminates only trivial splits
0.01 - 0.1: Moderate pruning, noticeable regularization
0.1 - 1.0: Aggressive pruning, significantly simpler trees
> 1.0: Very aggressive, may underfit

When to Use Gamma:

High noise data: Prevents fitting noise through marginal splits
In combination with deep trees: Allows deep structure while preventing weak splits
When interpretability matters: Simpler trees are easier to explain
As a secondary regularizer: After tuning depth and min_samples

split_gain_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import xgboost as xgb
import lightgbm as lgb
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
 
# Create noisy dataset
X, y = make_classification(
    n_samples=5000, n_features=30, n_informative=10,
    n_redundant=10, flip_y=0.1,  # 10% label noise
    random_state=42
)
 
# ============================================
# Gamma (min_split_gain) Analysis
# ============================================
print("=== XGBoost gamma (min split gain) ===\n")
for gamma in [0, 0.001, 0.01, 0.05, 0.1, 0.5, 1.0, 5.0]:
    model = xgb.XGBClassifier(
        n_estimators=200, learning_rate=0.1, max_depth=6,
        gamma=gamma, random_state=42, verbosity=0
    )
    cv_scores = cross_val_score(model, X, y, cv=5, scoring='roc_auc')
    
    # Count average number of leaves (proxy for tree complexity)
    model.fit(X, y)
    trees = model.get_booster().trees_to_dataframe()
    avg_leaves = trees[trees['Feature'] == 'Leaf'].groupby('Tree').size().mean()
    
    print(f"gamma={gamma:5.3f}: AUC = {np.mean(cv_scores):.4f}, "
          f"avg_leaves = {avg_leaves:.1f}")
 
# ============================================
# LightGBM min_split_gain
# ============================================
print("\n=== LightGBM min_split_gain ===\n")
for msg in [0, 0.001, 0.01, 0.05, 0.1, 0.5, 1.0]:
    model = lgb.LGBMClassifier(
        n_estimators=200, learning_rate=0.1, num_leaves=63,
        min_split_gain=msg, random_state=42, verbosity=-1
    )
    cv_scores = cross_val_score(model, X, y, cv=5, scoring='roc_auc')
    print(f"min_split_gain={msg:5.3f}: AUC = {np.mean(cv_scores):.4f}")
 
# ============================================
# Interaction: Deep Trees + High Gamma
# ============================================
print("\n=== Deep Trees with Gamma Regularization ===\n")
print("Showing that gamma allows deeper trees without overfitting:\n")
 
configs = [
    (6, 0, "depth=6, gamma=0 (baseline)"),
    (10, 0, "depth=10, gamma=0 (overfits)"),
    (10, 0.1, "depth=10, gamma=0.1 (regularized)"),
    (15, 0, "depth=15, gamma=0 (overfits more)"),
    (15, 0.5, "depth=15, gamma=0.5 (regularized)"),
]
 
for depth, gamma, desc in configs:
    model = xgb.XGBClassifier(
        n_estimators=200, learning_rate=0.1,
        max_depth=depth, gamma=gamma,
        random_state=42, verbosity=0
    )
    cv_scores = cross_val_score(model, X, y, cv=5, scoring='roc_auc')
    print(f"{desc}: AUC = {np.mean(cv_scores):.4f}")

Gamma Tuning Strategy

Start with gamma=0 and tune other parameters first. If you observe overfitting despite proper depth/min_samples, try gamma in [0.01, 0.1, 0.5]. Gamma is particularly useful when you want deep trees (for complex interactions) but need to prevent weak splits.

Column and Feature Sampling per Tree/Level/Node

Feature subsampling at the tree level is a powerful regularization technique borrowed from Random Forests. Gradient boosting extends this concept with sampling at multiple granularities.

Sampling Granularities:

Feature Sampling Options
Sampling Level	XGBoost	LightGBM	Effect
Per tree	`colsample_bytree`	`feature_fraction`	Each tree sees different features
Per level	`colsample_bylevel`	—	Each tree level resamples
Per node	`colsample_bynode`	`feature_fraction_bynode`	Each split considers different features

Combined Effect:

In XGBoost, these sampling fractions multiply:

$$\text{features_at_split} = \text{total_features} \times \text{colsample_bytree} \times \text{colsample_bylevel} \times \text{colsample_bynode}$$

Example: 100 features with colsample_bytree=0.8, colsample_bylevel=0.8, colsample_bynode=0.8:

100 × 0.8 = 80 features considered for tree
80 × 0.8 = 64 features at level 1
64 × 0.8 = ~51 features at any given node

Why Multi-Level Sampling Helps:

Decorrelates predictions: Trees/nodes see different feature subsets, reducing ensemble variance
Mitigates dominant features: Prevents a few strong features from dominating all splits
Speeds training: Fewer features = faster split evaluation
Implicit feature selection: Less important features get fewer opportunities to split

column_sampling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import xgboost as xgb
import lightgbm as lgb
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
 
# Create dataset with many features (some redundant)
X, y = make_classification(
    n_samples=10000, n_features=50, n_informative=15,
    n_redundant=20, n_clusters_per_class=3, random_state=42
)
 
# ============================================
# colsample_bytree Analysis
# ============================================
print("=== colsample_bytree (XGBoost) ===\n")
for cs in [0.5, 0.6, 0.7, 0.8, 0.9, 1.0]:
    model = xgb.XGBClassifier(
        n_estimators=200, learning_rate=0.1, max_depth=6,
        colsample_bytree=cs, random_state=42, verbosity=0
    )
    cv_scores = cross_val_score(model, X, y, cv=5, scoring='roc_auc')
    print(f"colsample_bytree={cs:.1f}: AUC = {np.mean(cv_scores):.4f}")
 
# ============================================
# Multi-level Sampling
# ============================================
print("\n=== Multi-level Column Sampling (XGBoost) ===\n")
configs = [
    {'colsample_bytree': 1.0, 'colsample_bylevel': 1.0, 'colsample_bynode': 1.0},
    {'colsample_bytree': 0.8, 'colsample_bylevel': 1.0, 'colsample_bynode': 1.0},
    {'colsample_bytree': 1.0, 'colsample_bylevel': 0.8, 'colsample_bynode': 1.0},
    {'colsample_bytree': 1.0, 'colsample_bylevel': 1.0, 'colsample_bynode': 0.8},
    {'colsample_bytree': 0.8, 'colsample_bylevel': 0.8, 'colsample_bynode': 0.8},
]
 
for config in configs:
    model = xgb.XGBClassifier(
        n_estimators=200, learning_rate=0.1, max_depth=6,
        **config, random_state=42, verbosity=0
    )
    cv_scores = cross_val_score(model, X, y, cv=5, scoring='roc_auc')
    print(f"bytree={config['colsample_bytree']:.1f}, "
          f"bylevel={config['colsample_bylevel']:.1f}, "
          f"bynode={config['colsample_bynode']:.1f}: "
          f"AUC = {np.mean(cv_scores):.4f}")
 
# ============================================
# LightGBM feature_fraction variants
# ============================================
print("\n=== LightGBM feature_fraction ===\n")
for ff in [0.5, 0.6, 0.7, 0.8, 0.9, 1.0]:
    model = lgb.LGBMClassifier(
        n_estimators=200, learning_rate=0.1, num_leaves=63,
        feature_fraction=ff, random_state=42, verbosity=-1
    )
    cv_scores = cross_val_score(model, X, y, cv=5, scoring='roc_auc')
    print(f"feature_fraction={ff:.1f}: AUC = {np.mean(cv_scores):.4f}")

Recommended Settings

For most problems, use colsample_bytree (or feature_fraction) between 0.6-0.9. Values below 0.5 often hurt performance. colsample_bynode provides additional diversity but may require more trees. Start with bytree only, add bynode if you need more regularization.

Summary: Configuring Tree Architecture

Tree-specific parameters control the complexity of individual weak learners. Proper configuration balances expressiveness against overfitting risk.

Key Takeaways

•Tree depth controls interaction order — Depth d allows up to d-feature interactions. Most problems need depth 4-8.
•Depth-wise vs. leaf-wise growth differ — XGBoost defaults to depth-wise (use max_depth), LightGBM to leaf-wise (use num_leaves). Understand your framework.
•Minimum sample constraints prevent noisy leaves — Scale with dataset size: larger data tolerates smaller leaves.
•Histogram binning trades precision for speed — Default 255 bins works well; reduce for speed on large data.
•gamma/min_split_gain prunes weak splits — Use to allow deep trees while preventing noise-fitting.
•Column sampling decorrelates trees — Use 0.6-0.9 for colsample_bytree/feature_fraction as standard regularization.

What's Next:

Having covered tree architecture, we'll explore regularization parameters in the next page—the L1/L2 penalties and shrinkage techniques that prevent overfitting through explicit complexity penalties.

Page Complete

You now understand the full spectrum of tree-specific parameters: depth and leaves, minimum samples, binning, split gain thresholds, and column sampling. These controls let you engineer weak learners with precisely the right complexity for your problem.

3 / 5

Loading learning content...

Machine LearningHyperparameter Tuning for Boosting

Hyperparameter Tuning for Boosting

LevelAdvanced

Duration90 mins

TopicHyperparameter Tuning for Boosting

3 / 5

Tree-Specific Parameters

Engineering the Weak Learner

What You Will Learn

Tree Depth and Interaction Order

The Depth-Interaction Relationship:

At depth $d$, a decision tree can model interactions involving up to $d$ features. This is because each level can split on a different feature:

Depth 1 (stumps): Single feature decisions: if age > 30 then ...
Depth 2: Pairwise interactions: if age > 30 AND income > 50K then ...
Depth 3: Three-way interactions: if age > 30 AND income > 50K AND location = urban then ...
Depth 6: Up to six-feature interactions

The Practical Implication:

Most real-world patterns involve interactions of 2-5 features. Depths beyond 6-8 rarely capture genuine signal—they instead fit noise through complex, unlikely feature combinations.

Tree Depth Selection Guidelines
Depth	Max Leaves	Interaction Order	Use Case
1-2	2-4	1-2 features	Additive models, high-noise data
3-4	8-16	3-4 features	Standard problems, many trees
5-6	32-64	5-6 features	Complex interactions, large datasets
7-10	128-1024	7-10 features	Very complex patterns, regularized
11+	2048+	High-order	Rarely needed, high overfit risk

The Depth-Regularization Trade-off:

Deeper trees have more capacity but require stronger regularization to prevent overfitting:

$$\text{Effective Complexity} \propto \text{Depth} \times \text{Trees} / \text{Regularization}$$

When you increase depth, compensate by:

Reducing the number of trees
Increasing L1/L2 regularization (reg_alpha, reg_lambda)
Adding min_child_weight constraints
Using lower learning rate
Adding subsampling

depth_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import xgboost as xgb
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score, train_test_split
 
# Create dataset with known interaction complexity
# Using 2-3 informative features means shallow trees should suffice
X, y = make_classification(
    n_samples=10000,
    n_features=20,
    n_informative=5,    # Only 5 truly predictive features
    n_redundant=5,
    n_clusters_per_class=2,
    flip_y=0.05,        # Some label noise
    random_state=42
)
 
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
 
# ============================================
# Depth Analysis
# ============================================
print("=== Tree Depth Analysis ===\n")
print(f"{'Depth':<6} {'Train AUC':<12} {'CV AUC':<12} {'Gap':<10} {'Overfit?'}")
print("-" * 52)
 
for depth in [1, 2, 3, 4, 5, 6, 8, 10, 15]:
    model = xgb.XGBClassifier(
        n_estimators=200,
        learning_rate=0.1,
        max_depth=depth,
        random_state=42,
        verbosity=0
    )
    
    # Cross-validation score (generalization)
    cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc')
    cv_mean = np.mean(cv_scores)
    
    # Train score (fitting capacity)
    model.fit(X_train, y_train)
    from sklearn.metrics import roc_auc_score
    train_pred = model.predict_proba(X_train)[:, 1]
    train_auc = roc_auc_score(y_train, train_pred)
    
    gap = train_auc - cv_mean
    overfit = "Yes" if gap > 0.03 else "No"
    
    print(f"{depth:<6} {train_auc:<12.4f} {cv_mean:<12.4f} {gap:<10.4f} {overfit}")
 
# ============================================
# Optimal Depth with Regularization
# ============================================
print("\n=== Depth with Regularization ===\n")
print("Demonstrating that deeper trees work when regularized:\n")
 
# No regularization: depth 6 might overfit
model_noreg = xgb.XGBClassifier(
    n_estimators=200, learning_rate=0.1, max_depth=8,
    reg_alpha=0, reg_lambda=0, random_state=42, verbosity=0
)
cv_noreg = cross_val_score(model_noreg, X_train, y_train, cv=5, scoring='roc_auc')
print(f"Depth 8, no reg:     CV AUC = {np.mean(cv_noreg):.4f}")
 
# With regularization: depth 8 works well
model_reg = xgb.XGBClassifier(
    n_estimators=200, learning_rate=0.1, max_depth=8,
    reg_alpha=0.5, reg_lambda=2.0, random_state=42, verbosity=0
)
cv_reg = cross_val_score(model_reg, X_train, y_train, cv=5, scoring='roc_auc')
print(f"Depth 8, with reg:   CV AUC = {np.mean(cv_reg):.4f}")

Framework Defaults

Depth-Wise vs. Leaf-Wise Growth

Modern gradient boosting frameworks offer two fundamentally different tree growth strategies. Understanding the distinction is crucial for proper hyperparameter configuration.

Depth-Wise (Level-Wise) Growth:

Grow trees level by level, splitting all leaves at the current depth before proceeding deeper.

Level 0:     [root]
Level 1:   [L1]  [L2]
Level 2: [L3][L4][L5][L6]

Characteristics:

Balanced trees (all leaves at same depth)
Complexity controlled by max_depth
Potentially splits unpromising branches
Default in XGBoost (grow_policy='depthwise')

Leaf-Wise Growth:

Always split the leaf with maximum loss reduction, regardless of depth.

Split 1:     [root] → [L1] [L2]
Split 2:     [L2] was best → [L2a] [L2b]
Split 3:     [L2b] was best → [L2b1] [L2b2]

Characteristics:

Unbalanced trees (different depths)
Complexity controlled by num_leaves
More efficient splits (prioritize high-gain)
Default in LightGBM (grow_policy='lossguide' equivalent)

Depth-Wise Advantages

•More regularized (forced balanced growth)
•Less prone to overfitting on small data
•Simpler to interpret (consistent depth)
•Predictable tree structure
•Good default for most problems

Leaf-Wise Advantages

•Often achieves lower loss with fewer splits
•Better for complex, non-uniform patterns
•More efficient use of splits
•Excels on large datasets
•Faster convergence to optimal loss

The num_leaves vs. max_depth Confusion:

When using leaf-wise growth (LightGBM), num_leaves is the primary complexity control, not max_depth. The relationship is:

$$\text{num_leaves} \leq 2^{\text{max_depth}}$$

Configuration patterns:

Use num_leaves alone: Set max_depth=-1 (unlimited), control complexity purely through num_leaves
Use max_depth alone: Set num_leaves very high (e.g., 1024), let max_depth constrain
Use both: num_leaves provides hard limit, max_depth prevents extreme imbalance

Recommended LightGBM settings:

num_leaves in range [7, 4095], commonly 31-127
max_depth either -1 (unlimited) or a safety cap (e.g., 12)
If using both, ensure num_leaves < 2^max_depth

growth_strategy_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import xgboost as xgb
import lightgbm as lgb
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
 
# Create dataset
X, y = make_classification(n_samples=20000, n_features=30, random_state=42)
 
# ============================================
# Compare Growth Strategies
# ============================================
print("=== Growth Strategy Comparison ===\n")
 
# XGBoost Depth-wise (default)
xgb_depthwise = xgb.XGBClassifier(
    n_estimators=200, learning_rate=0.1,
    max_depth=6, grow_policy='depthwise',
    random_state=42, verbosity=0
)
cv_depthwise = cross_val_score(xgb_depthwise, X, y, cv=5, scoring='roc_auc')
print(f"XGBoost Depth-wise (max_depth=6):  AUC = {np.mean(cv_depthwise):.4f}")
 
# XGBoost Leaf-wise (loss-guided)
xgb_lossguide = xgb.XGBClassifier(
    n_estimators=200, learning_rate=0.1,
    max_leaves=64, grow_policy='lossguide',  # ~equivalent to depth 6
    random_state=42, verbosity=0
)
cv_lossguide = cross_val_score(xgb_lossguide, X, y, cv=5, scoring='roc_auc')
print(f"XGBoost Leaf-wise (max_leaves=64): AUC = {np.mean(cv_lossguide):.4f}")
 
# LightGBM (leaf-wise by default)
lgb_leafwise = lgb.LGBMClassifier(
    n_estimators=200, learning_rate=0.1,
    num_leaves=63, max_depth=-1,  # Unlimited depth, leaves control complexity
    random_state=42, verbosity=-1
)
cv_lgb_leafwise = cross_val_score(lgb_leafwise, X, y, cv=5, scoring='roc_auc')
print(f"LightGBM Leaf-wise (num_leaves=63): AUC = {np.mean(cv_lgb_leafwise):.4f}")
 
# ============================================
# num_leaves Configuration Study
# ============================================
print("\n=== num_leaves Impact (LightGBM) ===\n")
for num_leaves in [7, 15, 31, 63, 127, 255]:
    model = lgb.LGBMClassifier(
        n_estimators=200, learning_rate=0.1,
        num_leaves=num_leaves, max_depth=-1,
        random_state=42, verbosity=-1
    )
    cv_scores = cross_val_score(model, X, y, cv=5, scoring='roc_auc')
    equiv_depth = int(np.ceil(np.log2(num_leaves + 1)))
    print(f"num_leaves={num_leaves:3d} (~depth {equiv_depth}): "
          f"AUC = {np.mean(cv_scores):.4f} (+/- {np.std(cv_scores):.4f})")
 
# ============================================
# Combining num_leaves and max_depth
# ============================================
print("\n=== Combining num_leaves and max_depth ===\n")
configs = [
    (31, -1, "31 leaves, unlimited depth"),
    (31, 6, "31 leaves, max_depth=6"),
    (127, -1, "127 leaves, unlimited depth"),
    (127, 6, "127 leaves, max_depth=6 (constrained)"),
    (127, 10, "127 leaves, max_depth=10"),
]
for num_leaves, max_depth, desc in configs:
    model = lgb.LGBMClassifier(
        n_estimators=200, learning_rate=0.1,
        num_leaves=num_leaves, max_depth=max_depth,
        random_state=42, verbosity=-1
    )
    cv_scores = cross_val_score(model, X, y, cv=5, scoring='roc_auc')
    print(f"{desc}: AUC = {np.mean(cv_scores):.4f}")

LightGBM Overfitting Risk

Minimum Samples and Weight Constraints

Why Constrain Leaf Size?

Leaves with very few samples are statistically unreliable:

Predictions based on 2-3 samples have high variance
Splits that isolate rare observations often fit noise
Small leaves can produce extreme, unstable predictions

Key Parameters:

Minimum Sample Constraints Across Frameworks
Concept	XGBoost	LightGBM	CatBoost
Min samples in leaf	`min_child_weight`*	`min_child_samples`	`min_data_in_leaf`
Min samples for split	`min_child_weight`*	`min_child_samples`	`min_data_in_leaf`
Min split gain	`gamma`	`min_split_gain`	Implicit in `l2_leaf_reg`

Understanding min_child_weight (XGBoost):

Unlike LightGBM's simple sample count, XGBoost's min_child_weight is the minimum sum of instance weights (Hessians) in a child node. For standard classification/regression:

With unweighted data: min_child_weight ≈ min_samples
With weighted data: accounts for sample importance
Higher values → stronger regularization

For regression with squared error: Hessian $h_i = 1$ for all samples. So min_child_weight equals minimum sample count directly.

Practical Guidance:

Recommended Minimum Sample Settings
Dataset Size	XGBoost min_child_weight	LightGBM min_child_samples	Rationale
< 1,000	5-20	20-100	Strong constraint to prevent overfit
1,000 - 10,000	1-5	10-50	Moderate constraint
10,000 - 100,000	1-3	5-20	Light constraint sufficient
100,000	1	1-10	Large data is inherently regularizing

min_samples_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import xgboost as xgb
import lightgbm as lgb
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
 
# Create datasets of different sizes
def analyze_min_samples(n_samples, name):
    X, y = make_classification(
        n_samples=n_samples, n_features=20, n_informative=10,
        flip_y=0.05, random_state=42
    )
    
    print(f"\n=== {name} ({n_samples:,} samples) ===\n")
    
    # XGBoost min_child_weight analysis
    print("XGBoost min_child_weight:")
    for mcw in [1, 3, 5, 10, 20, 50]:
        model = xgb.XGBClassifier(
            n_estimators=100, learning_rate=0.1, max_depth=6,
            min_child_weight=mcw, random_state=42, verbosity=0
        )
        cv_scores = cross_val_score(model, X, y, cv=5, scoring='roc_auc')
        print(f"  min_child_weight={mcw:2d}: AUC = {np.mean(cv_scores):.4f}")
    
    # LightGBM min_child_samples analysis
    print("\nLightGBM min_child_samples:")
    for mcs in [1, 5, 10, 20, 50, 100]:
        model = lgb.LGBMClassifier(
            n_estimators=100, learning_rate=0.1, num_leaves=63,
            min_child_samples=mcs, random_state=42, verbosity=-1
        )
        cv_scores = cross_val_score(model, X, y, cv=5, scoring='roc_auc')
        print(f"  min_child_samples={mcs:3d}: AUC = {np.mean(cv_scores):.4f}")
 
# Analyze for different dataset sizes
analyze_min_samples(1000, "Small Dataset")
analyze_min_samples(10000, "Medium Dataset")
analyze_min_samples(50000, "Large Dataset")
 
# ============================================
# Interaction with Tree Depth
# ============================================
print("\n=== min_child_samples × max_depth Interaction ===\n")
X, y = make_classification(n_samples=5000, n_features=20, random_state=42)
 
for depth in [4, 6, 8, 10]:
    for mcs in [5, 20, 50]:
        model = lgb.LGBMClassifier(
            n_estimators=100, learning_rate=0.1,
            max_depth=depth, num_leaves=2**depth,
            min_child_samples=mcs,
            random_state=42, verbosity=-1
        )
        cv = cross_val_score(model, X, y, cv=5, scoring='roc_auc')
        print(f"depth={depth:2d}, min_samples={mcs:2d}: AUC = {np.mean(cv):.4f}")
    print()

Rule of Thumb

Histogram Binning Configuration

How Histogram Binning Works:

Pre-processing: For each feature, compute bin boundaries that divide the value range into max_bin intervals
Training: Map continuous values to bin indices (integers)
Split finding: Evaluate split gain at bin boundaries only (O(bins) vs O(unique_values))
Aggregation: Accumulate gradient statistics per bin, use for split decisions

Performance Impact:

Fewer bins → faster training (fewer split candidates)
More bins → finer-grained splits, potentially better accuracy
Memory usage proportional to (features × bins × 2) for gradient/hessian sums

Binning Parameters Across Frameworks
Parameter	XGBoost	LightGBM	CatBoost	Default
Max bins	`max_bin`	`max_bin`	`border_count`	255/255/254
Min data per bin	—	`min_data_in_bin`	—	—/3/—
Binning method	`tree_method`	Auto	Auto	Various

max_bin Selection Guidelines:

32-64 bins: Fastest training, may lose precision for features with many unique values
128-255 bins (default): Good balance, sufficient for most features
256-512 bins: For features with very fine-grained important distinctions
>512 bins: Rarely beneficial, primarily increases memory and time

When Fewer Bins Help:

Noisy features: Fewer bins = implicit smoothing, may improve generalization
Large datasets: Training time reduction is substantial
Memory constraints: Each bin requires storage for gradient statistics

When More Bins Help:

High-precision features: Prices, measurements with meaningful decimal places
Small datasets: Overfitting from binning is less concern
Complex target relationships: Non-linear patterns may need finer splits

binning_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import xgboost as xgb
import lightgbm as lgb
import numpy as np
import time
from sklearn.datasets import make_regression
from sklearn.model_selection import cross_val_score
 
# Create dataset with continuous features requiring fine-grained splits
X, y = make_regression(
    n_samples=50000, n_features=20, n_informative=10,
    noise=10, random_state=42
)
 
# ============================================
# max_bin Impact Analysis
# ============================================
print("=== max_bin Impact (LightGBM) ===\n")
print(f"{'max_bin':<10} {'CV R²':<12} {'Time (s)':<10}")
print("-" * 32)
 
for max_bin in [16, 32, 64, 128, 255, 512]:
    start = time.time()
    model = lgb.LGBMRegressor(
        n_estimators=100, learning_rate=0.1, num_leaves=63,
        max_bin=max_bin, random_state=42, verbosity=-1
    )
    cv_scores = cross_val_score(model, X, y, cv=3, scoring='r2')
    elapsed = time.time() - start
    print(f"{max_bin:<10} {np.mean(cv_scores):<12.4f} {elapsed:<10.2f}")
 
# ============================================
# XGBoost tree_method impact
# ============================================
print("\n=== XGBoost tree_method Comparison ===\n")
print(f"{'Method':<12} {'CV R²':<12} {'Time (s)':<10}")
print("-" * 34)
 
for tree_method in ['exact', 'hist', 'approx']:
    try:
        start = time.time()
        model = xgb.XGBRegressor(
            n_estimators=100, learning_rate=0.1, max_depth=6,
            tree_method=tree_method, random_state=42, verbosity=0
        )
        cv_scores = cross_val_score(model, X, y, cv=3, scoring='r2')
        elapsed = time.time() - start
        print(f"{tree_method:<12} {np.mean(cv_scores):<12.4f} {elapsed:<10.2f}")
    except Exception as e:
        print(f"{tree_method:<12} Error: {str(e)[:30]}")
 
# ============================================
# min_data_in_bin effect (LightGBM)
# ============================================
print("\n=== min_data_in_bin Effect (LightGBM) ===\n")
for min_data in [1, 3, 5, 10, 20]:
    model = lgb.LGBMRegressor(
        n_estimators=100, learning_rate=0.1, num_leaves=63,
        max_bin=255, min_data_in_bin=min_data,
        random_state=42, verbosity=-1
    )
    cv_scores = cross_val_score(model, X, y, cv=3, scoring='r2')
    print(f"min_data_in_bin={min_data:2d}: R² = {np.mean(cv_scores):.4f}")

Default is Usually Optimal

Split Gain Thresholds

Beyond structural constraints, you can require that splits produce a minimum improvement in the loss function. This directly prevents splits that don't meaningfully improve predictions.

gamma / min_split_gain:

This parameter sets the minimum loss reduction required to make a split. The split is only made if:

$$\text{Gain} = \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{(G_L + G_R)^2}{H_L + H_R + \lambda} > \gamma$$

Where:

$G_L, G_R$ = sum of gradients in left/right child
$H_L, H_R$ = sum of Hessians in left/right child
$\lambda$ = L2 regularization
$\gamma$ = minimum split gain threshold

The Pruning Effect:

Higher gamma values prune more aggressively:

$\gamma = 0$: No minimum gain requirement (split if gain > 0)
$\gamma > 0$: Only split if gain exceeds threshold
Effect: Simpler trees, fewer noisy splits, potential underfitting if too high

Practical Gamma Values:

0: Default, no pruning. Rely on other regularization.
0.001 - 0.01: Very light pruning, eliminates only trivial splits
0.01 - 0.1: Moderate pruning, noticeable regularization
0.1 - 1.0: Aggressive pruning, significantly simpler trees
> 1.0: Very aggressive, may underfit

When to Use Gamma:

High noise data: Prevents fitting noise through marginal splits
In combination with deep trees: Allows deep structure while preventing weak splits
When interpretability matters: Simpler trees are easier to explain
As a secondary regularizer: After tuning depth and min_samples

split_gain_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import xgboost as xgb
import lightgbm as lgb
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
 
# Create noisy dataset
X, y = make_classification(
    n_samples=5000, n_features=30, n_informative=10,
    n_redundant=10, flip_y=0.1,  # 10% label noise
    random_state=42
)
 
# ============================================
# Gamma (min_split_gain) Analysis
# ============================================
print("=== XGBoost gamma (min split gain) ===\n")
for gamma in [0, 0.001, 0.01, 0.05, 0.1, 0.5, 1.0, 5.0]:
    model = xgb.XGBClassifier(
        n_estimators=200, learning_rate=0.1, max_depth=6,
        gamma=gamma, random_state=42, verbosity=0
    )
    cv_scores = cross_val_score(model, X, y, cv=5, scoring='roc_auc')
    
    # Count average number of leaves (proxy for tree complexity)
    model.fit(X, y)
    trees = model.get_booster().trees_to_dataframe()
    avg_leaves = trees[trees['Feature'] == 'Leaf'].groupby('Tree').size().mean()
    
    print(f"gamma={gamma:5.3f}: AUC = {np.mean(cv_scores):.4f}, "
          f"avg_leaves = {avg_leaves:.1f}")
 
# ============================================
# LightGBM min_split_gain
# ============================================
print("\n=== LightGBM min_split_gain ===\n")
for msg in [0, 0.001, 0.01, 0.05, 0.1, 0.5, 1.0]:
    model = lgb.LGBMClassifier(
        n_estimators=200, learning_rate=0.1, num_leaves=63,
        min_split_gain=msg, random_state=42, verbosity=-1
    )
    cv_scores = cross_val_score(model, X, y, cv=5, scoring='roc_auc')
    print(f"min_split_gain={msg:5.3f}: AUC = {np.mean(cv_scores):.4f}")
 
# ============================================
# Interaction: Deep Trees + High Gamma
# ============================================
print("\n=== Deep Trees with Gamma Regularization ===\n")
print("Showing that gamma allows deeper trees without overfitting:\n")
 
configs = [
    (6, 0, "depth=6, gamma=0 (baseline)"),
    (10, 0, "depth=10, gamma=0 (overfits)"),
    (10, 0.1, "depth=10, gamma=0.1 (regularized)"),
    (15, 0, "depth=15, gamma=0 (overfits more)"),
    (15, 0.5, "depth=15, gamma=0.5 (regularized)"),
]
 
for depth, gamma, desc in configs:
    model = xgb.XGBClassifier(
        n_estimators=200, learning_rate=0.1,
        max_depth=depth, gamma=gamma,
        random_state=42, verbosity=0
    )
    cv_scores = cross_val_score(model, X, y, cv=5, scoring='roc_auc')
    print(f"{desc}: AUC = {np.mean(cv_scores):.4f}")

Gamma Tuning Strategy

Column and Feature Sampling per Tree/Level/Node

Feature subsampling at the tree level is a powerful regularization technique borrowed from Random Forests. Gradient boosting extends this concept with sampling at multiple granularities.

Sampling Granularities:

Feature Sampling Options
Sampling Level	XGBoost	LightGBM	Effect
Per tree	`colsample_bytree`	`feature_fraction`	Each tree sees different features
Per level	`colsample_bylevel`	—	Each tree level resamples
Per node	`colsample_bynode`	`feature_fraction_bynode`	Each split considers different features

Combined Effect:

In XGBoost, these sampling fractions multiply:

$$\text{features_at_split} = \text{total_features} \times \text{colsample_bytree} \times \text{colsample_bylevel} \times \text{colsample_bynode}$$

Example: 100 features with colsample_bytree=0.8, colsample_bylevel=0.8, colsample_bynode=0.8:

100 × 0.8 = 80 features considered for tree
80 × 0.8 = 64 features at level 1
64 × 0.8 = ~51 features at any given node

Why Multi-Level Sampling Helps:

Decorrelates predictions: Trees/nodes see different feature subsets, reducing ensemble variance
Mitigates dominant features: Prevents a few strong features from dominating all splits
Speeds training: Fewer features = faster split evaluation
Implicit feature selection: Less important features get fewer opportunities to split

column_sampling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import xgboost as xgb
import lightgbm as lgb
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
 
# Create dataset with many features (some redundant)
X, y = make_classification(
    n_samples=10000, n_features=50, n_informative=15,
    n_redundant=20, n_clusters_per_class=3, random_state=42
)
 
# ============================================
# colsample_bytree Analysis
# ============================================
print("=== colsample_bytree (XGBoost) ===\n")
for cs in [0.5, 0.6, 0.7, 0.8, 0.9, 1.0]:
    model = xgb.XGBClassifier(
        n_estimators=200, learning_rate=0.1, max_depth=6,
        colsample_bytree=cs, random_state=42, verbosity=0
    )
    cv_scores = cross_val_score(model, X, y, cv=5, scoring='roc_auc')
    print(f"colsample_bytree={cs:.1f}: AUC = {np.mean(cv_scores):.4f}")
 
# ============================================
# Multi-level Sampling
# ============================================
print("\n=== Multi-level Column Sampling (XGBoost) ===\n")
configs = [
    {'colsample_bytree': 1.0, 'colsample_bylevel': 1.0, 'colsample_bynode': 1.0},
    {'colsample_bytree': 0.8, 'colsample_bylevel': 1.0, 'colsample_bynode': 1.0},
    {'colsample_bytree': 1.0, 'colsample_bylevel': 0.8, 'colsample_bynode': 1.0},
    {'colsample_bytree': 1.0, 'colsample_bylevel': 1.0, 'colsample_bynode': 0.8},
    {'colsample_bytree': 0.8, 'colsample_bylevel': 0.8, 'colsample_bynode': 0.8},
]
 
for config in configs:
    model = xgb.XGBClassifier(
        n_estimators=200, learning_rate=0.1, max_depth=6,
        **config, random_state=42, verbosity=0
    )
    cv_scores = cross_val_score(model, X, y, cv=5, scoring='roc_auc')
    print(f"bytree={config['colsample_bytree']:.1f}, "
          f"bylevel={config['colsample_bylevel']:.1f}, "
          f"bynode={config['colsample_bynode']:.1f}: "
          f"AUC = {np.mean(cv_scores):.4f}")
 
# ============================================
# LightGBM feature_fraction variants
# ============================================
print("\n=== LightGBM feature_fraction ===\n")
for ff in [0.5, 0.6, 0.7, 0.8, 0.9, 1.0]:
    model = lgb.LGBMClassifier(
        n_estimators=200, learning_rate=0.1, num_leaves=63,
        feature_fraction=ff, random_state=42, verbosity=-1
    )
    cv_scores = cross_val_score(model, X, y, cv=5, scoring='roc_auc')
    print(f"feature_fraction={ff:.1f}: AUC = {np.mean(cv_scores):.4f}")

Recommended Settings

Summary: Configuring Tree Architecture

Tree-specific parameters control the complexity of individual weak learners. Proper configuration balances expressiveness against overfitting risk.

Key Takeaways

•Tree depth controls interaction order — Depth d allows up to d-feature interactions. Most problems need depth 4-8.
•Depth-wise vs. leaf-wise growth differ — XGBoost defaults to depth-wise (use max_depth), LightGBM to leaf-wise (use num_leaves). Understand your framework.
•Minimum sample constraints prevent noisy leaves — Scale with dataset size: larger data tolerates smaller leaves.
•Histogram binning trades precision for speed — Default 255 bins works well; reduce for speed on large data.
•gamma/min_split_gain prunes weak splits — Use to allow deep trees while preventing noise-fitting.
•Column sampling decorrelates trees — Use 0.6-0.9 for colsample_bytree/feature_fraction as standard regularization.

What's Next:

Page Complete

3 / 5