Machine LearningBoosting Theory

Regularization in Boosting

LevelAdvanced

Duration90 mins

TopicBoosting Theory

5 / 5

L1/L2 Regularization

Explicit Regularization in Boosting

So far in this module, we've explored implicit regularization techniques: shrinkage controls contribution magnitude, subsampling adds diversity, tree constraints limit base learner complexity, and early stopping controls iteration count. All these work by indirectly limiting model complexity.

Modern gradient boosting implementations—particularly XGBoost, LightGBM, and CatBoost—go further by adding explicit regularization terms directly to the objective function. These L1 and L2 penalty terms mathematically penalize complex models, just as in regularized linear regression (Lasso and Ridge).

This explicit regularization operates at the tree level, penalizing:

Large leaf weights (the values predicted at each leaf)
Too many leaves (model complexity)

Understanding these regularization terms is essential for practitioners because they appear as key hyperparameters (lambda, alpha, reg_lambda, reg_alpha) in every major boosting library.

What You Will Master

By the end of this page, you will understand: (1) how L1 and L2 regularization are incorporated into the gradient boosting objective, (2) the mathematical effect of each regularization type on leaf weights, (3) the lambda and alpha parameters in XGBoost and analogous parameters in other libraries, (4) practical guidelines for tuning regularization strength, and (5) when and how to use L1 vs. L2 regularization.

The Regularized Objective Function

Let's derive the regularized objective function used in modern gradient boosting, following XGBoost's formulation.

1.1 The Standard Objective

Traditional gradient boosting minimizes the empirical loss:

$$\mathcal{L} = \sum_{i=1}^{n} L(y_i, \hat{y}_i)$$

where $L$ is the loss function (e.g., squared error, log loss) and $\hat{y}_i$ is the prediction for sample $i$.

1.2 Adding Regularization

XGBoost adds a regularization term $\Omega$ that penalizes model complexity:

$$\mathcal{L}{\text{reg}} = \sum{i=1}^{n} L(y_i, \hat{y}i) + \sum{m=1}^{M} \Omega(h_m)$$

For each tree $h_m$ with $T$ leaves and leaf weights $w = (w_1, ..., w_T)$:

$$\Omega(h_m) = \gamma T + \frac{1}{2}\lambda \sum_{j=1}^{T} w_j^2 + \alpha \sum_{j=1}^{T} |w_j|$$

Breaking this down:

Term	Name	Type	Effect
$\gamma T$	Gamma	Tree complexity	Penalizes number of leaves
$\frac{1}{2}\lambda \sum w_j^2$	Lambda	L2 (Ridge)	Shrinks leaf weights toward zero
$\alpha \sum	w_j	$	Alpha

The Three Regularization Parameters

XGBoost has three distinct regularization parameters: gamma (min_split_loss) controls tree structure by penalizing splits, lambda (reg_lambda) provides L2 regularization on leaf weights, and alpha (reg_alpha) provides L1 regularization on leaf weights. Each serves a different purpose.

1.3 Objective at Iteration m

At iteration $m$, we add tree $h_m$ to the ensemble. The objective becomes:

$$\mathcal{L}^{(m)} = \sum_{i=1}^{n} L(y_i, F_{m-1}(x_i) + h_m(x_i)) + \Omega(h_m)$$

Using a second-order Taylor expansion (a key XGBoost innovation):

$$\mathcal{L}^{(m)} \approx \sum_{i=1}^{n} \left[ g_i h_m(x_i) + \frac{1}{2} H_i h_m(x_i)^2 \right] + \Omega(h_m) + \text{constant}$$

where:

$g_i = \frac{\partial L(y_i, F_{m-1}(x_i))}{\partial F_{m-1}(x_i)}$ is the first derivative (gradient)
$H_i = \frac{\partial^2 L(y_i, F_{m-1}(x_i))}{\partial F_{m-1}(x_i)^2}$ is the second derivative (Hessian)

This quadratic approximation enables closed-form optimal leaf weights.

L2 Regularization (Lambda)

L2 regularization, controlled by the lambda parameter, adds a penalty proportional to the squared magnitude of leaf weights.

2.1 Effect on Optimal Leaf Weights

For a leaf $j$ containing sample indices $I_j$, the contribution to the objective is:

$$\sum_{i \in I_j} \left[ g_i w_j + \frac{1}{2} H_i w_j^2 \right] + \frac{1}{2}\lambda w_j^2$$

Setting the derivative with respect to $w_j$ to zero:

$$\sum_{i \in I_j} g_i + \left( \sum_{i \in I_j} H_i + \lambda \right) w_j = 0$$

Solving:

$$w_j^* = -\frac{\sum_{i \in I_j} g_i}{\sum_{i \in I_j} H_i + \lambda} = -\frac{G_j}{H_j + \lambda}$$

where $G_j = \sum_{i \in I_j} g_i$ and $H_j = \sum_{i \in I_j} H_i$.

2.2 The Shrinkage Effect

Compare to the unregularized optimal weight:

$$w_j^{\text{unregularized}} = -\frac{G_j}{H_j}$$

With L2 regularization:

$$|w_j^*| = \frac{|G_j|}{H_j + \lambda} < \frac{|G_j|}{H_j} = |w_j^{\text{unregularized}}|$$

The weights are shrunk toward zero by a factor of $\frac{H_j}{H_j + \lambda}$. This shrinkage:

Prevents extreme predictions when $H_j$ is small (few samples or flat curvature)
Stabilizes predictions for leaves with high variance
Reduces overfitting by constraining model expressiveness

Effect of Lambda on Leaf Weights
Lambda	Shrinkage Factor	Effect on Leaf Weights
0	1.0	No shrinkage (unregularized)
0.1	H/(H+0.1)	Light shrinkage
1.0	H/(H+1)	Moderate shrinkage
10	H/(H+10)	Strong shrinkage
100	H/(H+100)	Very strong shrinkage

2.3 Numerical Stability

L2 regularization also provides numerical stability. Without regularization, if a leaf has very few samples (small $H_j$), the optimal weight can become extremely large:

$$w_j = -\frac{G_j}{H_j} \quad \text{(can explode if } H_j \approx 0 \text{)}$$

With $\lambda > 0$, the denominator is bounded away from zero:

$$w_j = -\frac{G_j}{H_j + \lambda} \quad \text{(bounded even if } H_j = 0 \text{)}$$

This is why lambda > 0 is almost always recommended, even if regularization strength isn't the primary concern.

l2_regularization_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
import numpy as np
import xgboost as xgb
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split, cross_val_score
import matplotlib.pyplot as plt
 
def demonstrate_lambda_effect(X, y):
    """
    Demonstrate the effect of lambda (L2 regularization) on XGBoost.
    """
    
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    
    lambdas = [0, 0.01, 0.1, 1, 5, 10, 50, 100]
    
    results = {
        'lambda': [],
        'train_rmse': [],
        'test_rmse': [],
        'cv_rmse': []
    }
    
    print("Effect of L2 Regularization (lambda) on XGBoost")
    print("=" * 60)
    
    for lam in lambdas:
        model = xgb.XGBRegressor(
            n_estimators=100,
            learning_rate=0.1,
            max_depth=6,
            reg_lambda=lam,     # L2 regularization
            reg_alpha=0,        # No L1 for this test
            random_state=42,
            verbosity=0
        )
        
        # CV score
        cv_scores = -cross_val_score(
            model, X, y, cv=5, 
            scoring='neg_root_mean_squared_error'
        )
        
        # Train final model
        model.fit(X_train, y_train)
        
        train_pred = model.predict(X_train)
        test_pred = model.predict(X_test)
        
        train_rmse = np.sqrt(np.mean((y_train - train_pred) ** 2))
        test_rmse = np.sqrt(np.mean((y_test - test_pred) ** 2))
        cv_rmse = np.mean(cv_scores)
        
        results['lambda'].append(lam)
        results['train_rmse'].append(train_rmse)
        results['test_rmse'].append(test_rmse)
        results['cv_rmse'].append(cv_rmse)
        
        gap = test_rmse - train_rmse
        print(f"λ={lam:6.2f}: Train RMSE={train_rmse:.4f}, "
              f"Test RMSE={test_rmse:.4f}, Gap={gap:.4f}")
    
    # Plot
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Training vs Test error
    axes[0].plot(results['lambda'], results['train_rmse'], 
                  'b-o', label='Train RMSE', linewidth=2)
    axes[0].plot(results['lambda'], results['test_rmse'], 
                  'r-o', label='Test RMSE', linewidth=2)
    axes[0].set_xlabel('Lambda (L2 Regularization)', fontsize=12)
    axes[0].set_ylabel('RMSE', fontsize=12)
    axes[0].set_xscale('symlog', linthresh=0.1)
    axes[0].legend()
    axes[0].set_title('Effect of L2 Regularization')
    axes[0].grid(True, alpha=0.3)
    
    # Generalization gap
    gap = np.array(results['test_rmse']) - np.array(results['train_rmse'])
    axes[1].bar(range(len(lambdas)), gap, tick_label=[str(l) for l in lambdas])
    axes[1].set_xlabel('Lambda', fontsize=12)
    axes[1].set_ylabel('Generalization Gap (Test - Train)', fontsize=12)
    axes[1].set_title('Generalization Gap vs Lambda')
    axes[1].axhline(0, color='black', linestyle='--')
    
    plt.tight_layout()
    plt.savefig('l2_regularization_effect.png', dpi=150)
    plt.show()
    
    # Find optimal lambda
    best_idx = np.argmin(results['cv_rmse'])
    print(f"\nOptimal lambda: {results['lambda'][best_idx]}")
    print(f"Best CV RMSE: {results['cv_rmse'][best_idx]:.4f}")
    
    return results
 
 
def visualize_weight_shrinkage():
    """
    Visualize how lambda shrinks leaf weights.
    """
    # Simulate: G_j = -10 (gradient sum), H_j varies
    G_j = -10
    H_j_values = np.linspace(0.1, 10, 100)
    
    lambda_values = [0, 0.5, 1, 2, 5]
    
    fig, ax = plt.subplots(figsize=(10, 6))
    
    for lam in lambda_values:
        weights = -G_j / (H_j_values + lam)
        ax.plot(H_j_values, weights, label=f'λ={lam}', linewidth=2)
    
    ax.set_xlabel('Hessian Sum (H_j)', fontsize=12)
    ax.set_ylabel('Optimal Leaf Weight', fontsize=12)
    ax.set_title('Leaf Weight Shrinkage with L2 Regularization', fontsize=14)
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.savefig('weight_shrinkage.png', dpi=150)
    plt.show()
 
 
if __name__ == "__main__":
    # Generate noisy data prone to overfitting
    X, y = make_regression(
        n_samples=500,  # Small dataset
        n_features=50,  # Many features
        n_informative=10,
        noise=20,
        random_state=42
    )
    
    demonstrate_lambda_effect(X, y)
    visualize_weight_shrinkage()

L1 Regularization (Alpha)

L1 regularization, controlled by the alpha parameter, adds a penalty proportional to the absolute value of leaf weights.

3.1 The L1 Penalty Term

$$\Omega_{L1}(h) = \alpha \sum_{j=1}^{T} |w_j|$$

Unlike L2, the L1 penalty is not differentiable at $w_j = 0$. This creates a sparsity-inducing effect: small weights are pushed exactly to zero.

3.2 Sparsity in Leaf Weights

For small gradient sums, L1 regularization sets weights to exactly zero:

$$w_j^* = 0 \quad \text{if} \quad |G_j| \leq \alpha$$

This means leaves with weak signals (small $|G_j|$) are effectively pruned—their predictions become zero. This is feature selection at the leaf level.

3.3 Soft Thresholding Solution

The optimal weight under L1 regularization follows a soft-thresholding formula:

$$w_j^* = \text{sign}(G_j) \cdot \max\left(0, \frac{|G_j| - \alpha}{H_j}\right)$$

Compared to L2 (which shrinks all weights proportionally), L1:

Sets small weights to exactly zero (sparsity)
Shrinks large weights by a constant amount $\alpha$ (not proportionally)
Creates sparser, more interpretable trees

L1 vs L2 Regularization Effects
Property	L2 (Lambda)	L1 (Alpha)
Penalty	$\lambda \sum w_j^2$	$\alpha \sum \|w_j\|$
Effect on weights	Proportional shrinkage	Soft thresholding
Sparsity	No (never exactly zero)	Yes (zeros out small weights)
Stability	High (smooth gradient)	Lower (non-differentiable)
Use case	General regularization	Feature/leaf selection

3.4 When to Use L1 Regularization

Use L1 (alpha) when:

You have many noisy or irrelevant splits
You want sparser, more interpretable models
You suspect some leaf predictions should truly be zero
You're performing feature selection alongside prediction

Caution: L1 alone provides less stability than L2. In practice, many practitioners use both L1 and L2 (Elastic Net regularization) to get sparsity plus stability.

l1_regularization_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
import numpy as np
import xgboost as xgb
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split, cross_val_score
 
def demonstrate_alpha_effect(X, y):
    """
    Demonstrate the effect of alpha (L1 regularization) on XGBoost.
    """
    
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    
    alphas = [0, 0.01, 0.1, 0.5, 1, 5, 10, 50]
    
    print("Effect of L1 Regularization (alpha) on XGBoost")
    print("=" * 60)
    
    for alpha in alphas:
        model = xgb.XGBRegressor(
            n_estimators=100,
            learning_rate=0.1,
            max_depth=6,
            reg_lambda=1,       # Keep L2 at baseline
            reg_alpha=alpha,    # Vary L1
            random_state=42,
            verbosity=0
        )
        
        cv_scores = -cross_val_score(
            model, X, y, cv=5,
            scoring='neg_root_mean_squared_error'
        )
        
        model.fit(X_train, y_train)
        
        train_rmse = np.sqrt(np.mean((y_train - model.predict(X_train)) ** 2))
        test_rmse = np.sqrt(np.mean((y_test - model.predict(X_test)) ** 2))
        
        # Check feature importance sparsity
        importances = model.feature_importances_
        n_important = np.sum(importances > 0.01)
        
        print(f"α={alpha:5.2f}: CV RMSE={np.mean(cv_scores):.4f}, "
              f"Test RMSE={test_rmse:.4f}, "
              f"Important features={n_important}/{len(importances)}")
 
 
def compare_l1_l2_elastic_net(X, y):
    """
    Compare pure L1, pure L2, and Elastic Net (L1+L2) regularization.
    """
    
    print("\nComparison: L1-only, L2-only, and Elastic Net")
    print("=" * 60)
    
    configs = [
        ('L2 only', {'reg_lambda': 1, 'reg_alpha': 0}),
        ('L1 only', {'reg_lambda': 0, 'reg_alpha': 1}),
        ('Elastic Net (both)', {'reg_lambda': 1, 'reg_alpha': 1}),
        ('Strong L2', {'reg_lambda': 10, 'reg_alpha': 0}),
        ('Strong L1', {'reg_lambda': 0, 'reg_alpha': 10}),
        ('Strong Both', {'reg_lambda': 10, 'reg_alpha': 10}),
    ]
    
    for name, params in configs:
        model = xgb.XGBRegressor(
            n_estimators=100,
            learning_rate=0.1,
            max_depth=6,
            **params,
            random_state=42,
            verbosity=0
        )
        
        cv_scores = -cross_val_score(
            model, X, y, cv=5,
            scoring='neg_root_mean_squared_error'
        )
        
        print(f"{name:18s}: CV RMSE = {np.mean(cv_scores):.4f} "
              f"± {np.std(cv_scores):.4f}")
 
 
if __name__ == "__main__":
    # Generate data with many noise features
    X, y = make_regression(
        n_samples=500,
        n_features=100,       # Many features
        n_informative=10,     # Only 10 are useful
        noise=20,
        random_state=42
    )
    
    demonstrate_alpha_effect(X, y)
    compare_l1_l2_elastic_net(X, y)

Gamma (Min Split Loss)

While not strictly L1/L2 regularization, gamma is the third regularization parameter in XGBoost that controls tree structure.

4.1 The Gamma Penalty

$$\Omega_{\gamma}(h) = \gamma \cdot T$$

where $T$ is the number of leaves. Gamma adds a constant penalty for each leaf, effectively requiring a minimum improvement to justify a split.

4.2 Optimal Gain Formula

The gain from splitting a node into left and right children is:

$$\text{Gain} = \frac{1}{2}\left[ \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{(G_L + G_R)^2}{H_L + H_R + \lambda} \right] - \gamma$$

Note the $-\gamma$ term at the end. A split is only made if $\text{Gain} > 0$, meaning:

$$\underbrace{\frac{1}{2}\left[ \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{G^2}{H + \lambda} \right]}_{\text{Loss reduction}} > \gamma$$

4.3 Effect of Gamma

gamma = 0: Any positive gain triggers a split (default)
gamma > 0: Requires gain > gamma to split; acts as pre-pruning
Large gamma: Very few splits; simpler trees

Typical Values: 0-10 range. Start with 0 and increase if overfitting persists after tuning lambda/alpha.

Gamma vs. min_impurity_decrease

Gamma is conceptually similar to sklearn's min_impurity_decrease, but operates on XGBoost's regularized objective, not raw impurity. This makes gamma values dependent on the scale of your loss function and other regularization parameters. Tuning gamma often requires experimentation.

Library-Specific Parameters

Each boosting library has its own names for regularization parameters.

5.1 Parameter Mapping

Regularization Parameters Across Libraries
Regularization Type	XGBoost	LightGBM	CatBoost
L2 on leaf weights	`reg_lambda` (default: 1)	`lambda_l2` (default: 0)	`l2_leaf_reg` (default: 3)
L1 on leaf weights	`reg_alpha` (default: 0)	`lambda_l1` (default: 0)	N/A
Min split gain	`gamma` (default: 0)	`min_gain_to_split` (default: 0)	N/A
Min child weight	`min_child_weight` (default: 1)	`min_sum_hessian_in_leaf` (default: 1e-3)	`min_data_in_leaf` (default: 1)

5.2 Default Value Differences

Note the important differences in defaults:

XGBoost: L2 is ON by default (lambda=1), L1 is OFF (alpha=0) LightGBM: Both are OFF by default CatBoost: L2 is ON by default (l2_leaf_reg=3)

This means:

XGBoost models are regularized 'out of the box'
LightGBM models may need explicit regularization added
CatBoost has moderate regularization by default

5.3 Different Objective Formulations

The exact objective function differs slightly between libraries, but the core concepts are the same. LightGBM's objective for L2:

$$\mathcal{L} = \sum_{i} L(y_i, \hat{y}i) + \frac{\lambda{L2}}{2} \sum_j w_j^2$$

Note the $1/2$ factor is sometimes absorbed into the parameter definition, which can affect tuning. Always check library documentation for exact formulations.

library_regularization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
import xgboost as xgb
import lightgbm as lgb
from catboost import CatBoostRegressor
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_regression
import numpy as np
 
def compare_library_regularization(X, y):
    """
    Compare regularization across XGBoost, LightGBM, and CatBoost.
    """
    
    print("Regularization Comparison Across Libraries")
    print("=" * 70)
    
    # Regularization settings to test
    settings = ['default', 'no_reg', 'light_l2', 'strong_l2', 'elastic_net']
    
    configs = {
        'default': {
            'xgb': {'reg_lambda': 1, 'reg_alpha': 0, 'gamma': 0},
            'lgb': {'lambda_l2': 0, 'lambda_l1': 0, 'min_gain_to_split': 0},
            'cat': {'l2_leaf_reg': 3},
        },
        'no_reg': {
            'xgb': {'reg_lambda': 0, 'reg_alpha': 0, 'gamma': 0},
            'lgb': {'lambda_l2': 0, 'lambda_l1': 0, 'min_gain_to_split': 0},
            'cat': {'l2_leaf_reg': 0},
        },
        'light_l2': {
            'xgb': {'reg_lambda': 1, 'reg_alpha': 0},
            'lgb': {'lambda_l2': 1, 'lambda_l1': 0},
            'cat': {'l2_leaf_reg': 1},
        },
        'strong_l2': {
            'xgb': {'reg_lambda': 10, 'reg_alpha': 0},
            'lgb': {'lambda_l2': 10, 'lambda_l1': 0},
            'cat': {'l2_leaf_reg': 10},
        },
        'elastic_net': {
            'xgb': {'reg_lambda': 1, 'reg_alpha': 1},
            'lgb': {'lambda_l2': 1, 'lambda_l1': 1},
            'cat': {'l2_leaf_reg': 1},  # CatBoost doesn't have native L1
        },
    }
    
    for setting_name in settings:
        print(f"\n{setting_name.upper()}:")
        
        # XGBoost
        xgb_model = xgb.XGBRegressor(
            n_estimators=100,
            learning_rate=0.1,
            max_depth=6,
            **configs[setting_name]['xgb'],
            random_state=42,
            verbosity=0
        )
        xgb_cv = -cross_val_score(
            xgb_model, X, y, cv=5, 
            scoring='neg_root_mean_squared_error'
        )
        
        # LightGBM
        lgb_model = lgb.LGBMRegressor(
            n_estimators=100,
            learning_rate=0.1,
            num_leaves=63,  # ~ 2^6
            **configs[setting_name]['lgb'],
            random_state=42,
            verbose=-1
        )
        lgb_cv = -cross_val_score(
            lgb_model, X, y, cv=5,
            scoring='neg_root_mean_squared_error'
        )
        
        # CatBoost
        cat_model = CatBoostRegressor(
            iterations=100,
            learning_rate=0.1,
            depth=6,
            **configs[setting_name]['cat'],
            random_seed=42,
            verbose=False
        )
        cat_cv = -cross_val_score(
            cat_model, X, y, cv=5,
            scoring='neg_root_mean_squared_error'
        )
        
        print(f"  XGBoost:  {np.mean(xgb_cv):.4f} ± {np.std(xgb_cv):.4f}")
        print(f"  LightGBM: {np.mean(lgb_cv):.4f} ± {np.std(lgb_cv):.4f}")
        print(f"  CatBoost: {np.mean(cat_cv):.4f} ± {np.std(cat_cv):.4f}")
 
 
if __name__ == "__main__":
    X, y = make_regression(
        n_samples=1000,
        n_features=30,
        n_informative=15,
        noise=20,
        random_state=42
    )
    
    compare_library_regularization(X, y)

Practical Tuning of Regularization

Tuning L1/L2 regularization effectively requires understanding their interactions with other hyperparameters.

6.1 Tuning Order

Recommended order for tuning boosting hyperparameters:

n_estimators + learning_rate: Find a reasonable combination with early stopping
max_depth / num_leaves: Set structural complexity
subsample / colsample: Add stochasticity
reg_lambda / reg_alpha: Fine-tune regularization

L1/L2 regularization is typically tuned after structural hyperparameters because the optimal regularization strength depends on tree complexity.

6.2 Search Ranges

L2 (lambda):

Search range: [0, 0.01, 0.1, 1, 5, 10, 50, 100]
Log scale is often appropriate
Start with library default, expand if at boundary

L1 (alpha):

Search range: [0, 0.001, 0.01, 0.1, 1, 10]
Often left at 0 unless sparsity is desired
More sensitive; small values can have large effects

Gamma:

Search range: [0, 0.1, 0.5, 1, 2, 5, 10]
Very problem-dependent
Often kept at 0 if other regularization is sufficient

Regularization Tuning Guidelines

•Start with defaults: XGBoost's lambda=1 works well for many problems.
•Tune L2 first: It's more stable and universally applicable than L1.
•Add L1 for sparsity: Only if you specifically need sparse models or have many irrelevant features.
•Higher with deep trees: Deeper trees need more regularization; increase lambda/alpha.
•Lower with small learning rate: Shrinkage from learning rate may suffice; less L2 needed.
•Watch for underfitting: Very high regularization hurts performance; balance is key.

regularization_tuning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
import optuna
import xgboost as xgb
from sklearn.model_selection import cross_val_score
import numpy as np
 
def tune_regularization(X, y, fixed_params=None):
    """
    Tune regularization parameters using Optuna.
    
    This assumes structural parameters (max_depth, etc.) are already set.
    """
    
    if fixed_params is None:
        fixed_params = {
            'n_estimators': 100,
            'learning_rate': 0.1,
            'max_depth': 6,
            'subsample': 0.8,
            'colsample_bytree': 0.8,
        }
    
    def objective(trial):
        # Regularization search space
        reg_lambda = trial.suggest_float('reg_lambda', 0, 100, log=False)
        reg_alpha = trial.suggest_float('reg_alpha', 0, 10, log=False)
        gamma = trial.suggest_float('gamma', 0, 10, log=False)
        min_child_weight = trial.suggest_int('min_child_weight', 1, 10)
        
        model = xgb.XGBRegressor(
            **fixed_params,
            reg_lambda=reg_lambda,
            reg_alpha=reg_alpha,
            gamma=gamma,
            min_child_weight=min_child_weight,
            random_state=42,
            verbosity=0
        )
        
        cv_scores = cross_val_score(
            model, X, y, cv=5,
            scoring='neg_mean_squared_error'
        )
        
        return np.mean(cv_scores)
    
    study = optuna.create_study(direction='maximize')
    study.optimize(objective, n_trials=50, show_progress_bar=True)
    
    print("\nBest Regularization Parameters:")
    print(f"  reg_lambda: {study.best_params['reg_lambda']:.4f}")
    print(f"  reg_alpha: {study.best_params['reg_alpha']:.4f}")
    print(f"  gamma: {study.best_params['gamma']:.4f}")
    print(f"  min_child_weight: {study.best_params['min_child_weight']}")
    print(f"  Best CV Score: {-study.best_value:.4f}")
    
    return study
 
 
def grid_search_regularization(X, y):
    """
    Simpler grid search approach for regularization.
    """
    
    from sklearn.model_selection import GridSearchCV
    
    base_model = xgb.XGBRegressor(
        n_estimators=100,
        learning_rate=0.1,
        max_depth=6,
        subsample=0.8,
        random_state=42,
        verbosity=0
    )
    
    param_grid = {
        'reg_lambda': [0, 1, 5, 10, 20],
        'reg_alpha': [0, 0.1, 1],
        'gamma': [0, 0.5, 1],
    }
    
    grid_search = GridSearchCV(
        base_model,
        param_grid,
        cv=5,
        scoring='neg_mean_squared_error',
        verbose=1
    )
    
    grid_search.fit(X, y)
    
    print("\nGrid Search Results:")
    print(f"  Best params: {grid_search.best_params_}")
    print(f"  Best CV MSE: {-grid_search.best_score_:.4f}")
    
    return grid_search
 
 
if __name__ == "__main__":
    from sklearn.datasets import make_regression
    
    X, y = make_regression(
        n_samples=1000,
        n_features=30,
        n_informative=15,
        noise=20,
        random_state=42
    )
    
    print("Grid Search Approach:")
    print("=" * 50)
    grid_search_regularization(X, y)

Interactions with Other Regularization

L1/L2 regularization interacts with all other regularization techniques in boosting. Understanding these interactions helps avoid both over-regularization and under-regularization.

7.1 L1/L2 × Learning Rate

Low learning rate already provides regularization through incremental learning
With low learning rate: Often need less L2/L1 (models are already conservative)
With high learning rate: More L2/L1 may be needed to prevent overfitting

7.2 L1/L2 × Tree Depth

Tree Depth	Recommended L2 (lambda)
Shallow (2-3)	0-1 (light)
Moderate (4-5)	1-5 (moderate)
Deep (6-8)	5-20 (stronger)
Very deep (10+)	10-100 (heavy)

Deeper trees have more parameters to regularize; increase L2 proportionally.

7.3 L1/L2 × Subsampling

Both provide regularization through different mechanisms
L2: Constrains leaf weight magnitude
Subsampling: Reduces overfitting through randomness

With aggressive subsampling (0.5-0.7), you may need less L2. They compound.

Over-Regularization Warning

It's possible to over-regularize: low learning rate + aggressive subsampling + high L2 + early stopping can result in severely underfitting models. Signs include: training error that's unexpectedly high, validation error that stops improving very early, and flat learning curves.

7.4 A Balanced Configuration

For most problems, a balanced approach works well:

balanced_config = {
    'learning_rate': 0.1,      # Moderate shrinkage
    'max_depth': 6,            # Moderate tree depth
    'subsample': 0.8,          # Light row sampling
    'colsample_bytree': 0.8,   # Light column sampling
    'reg_lambda': 1,           # Light L2
    'reg_alpha': 0,            # No L1 unless needed
    'gamma': 0,                # No split penalty
    'early_stopping_rounds': 20  # Dynamic stopping
}

Start with balanced settings and adjust based on validation performance.

Summary and Key Takeaways

L1 and L2 regularization are explicit penalty terms that directly constrain leaf weights in gradient boosting. Let's consolidate the essential insights:

Key Takeaways

•Modern boosting adds explicit regularization to the objective function, penalizing both the number of leaves (gamma) and leaf weight magnitudes (lambda, alpha).
•L2 regularization (lambda) shrinks all weights proportionally, providing stable, continuous regularization that prevents extreme predictions.
•L1 regularization (alpha) performs soft-thresholding, setting small weights to exactly zero and creating sparser, more interpretable models.
•Lambda provides numerical stability by preventing division by very small Hessian sums, making it almost always beneficial even at small values.
•Library defaults differ: XGBoost has L2 on by default; LightGBM has both off; CatBoost has L2 on. Adjust accordingly.
•Tune L2 first, then L1 if needed: L2 is more universally applicable; L1 is for specific sparsity needs.
•Regularization interacts with other hyperparameters: Deeper trees and higher learning rates need more regularization; low learning rates may need less.

Module Complete

Congratulations! You have now mastered the comprehensive suite of regularization techniques in gradient boosting: shrinkage (learning rate), subsampling (stochastic GB), tree constraints, early stopping, and L1/L2 regularization. Together, these techniques transform gradient boosting from a method prone to overfitting into a robust, production-ready algorithm that generalizes well. Mastering when and how to apply each technique is the mark of an expert boosting practitioner.

5 / 5

Loading learning content...

Machine LearningBoosting Theory

Regularization in Boosting

LevelAdvanced

Duration90 mins

TopicBoosting Theory

5 / 5

L1/L2 Regularization

Explicit Regularization in Boosting

This explicit regularization operates at the tree level, penalizing:

Large leaf weights (the values predicted at each leaf)
Too many leaves (model complexity)

Understanding these regularization terms is essential for practitioners because they appear as key hyperparameters (lambda, alpha, reg_lambda, reg_alpha) in every major boosting library.

What You Will Master

The Regularized Objective Function

Let's derive the regularized objective function used in modern gradient boosting, following XGBoost's formulation.

1.1 The Standard Objective

Traditional gradient boosting minimizes the empirical loss:

$$\mathcal{L} = \sum_{i=1}^{n} L(y_i, \hat{y}_i)$$

where $L$ is the loss function (e.g., squared error, log loss) and $\hat{y}_i$ is the prediction for sample $i$.

1.2 Adding Regularization

XGBoost adds a regularization term $\Omega$ that penalizes model complexity:

$$\mathcal{L}{\text{reg}} = \sum{i=1}^{n} L(y_i, \hat{y}i) + \sum{m=1}^{M} \Omega(h_m)$$

For each tree $h_m$ with $T$ leaves and leaf weights $w = (w_1, ..., w_T)$:

$$\Omega(h_m) = \gamma T + \frac{1}{2}\lambda \sum_{j=1}^{T} w_j^2 + \alpha \sum_{j=1}^{T} |w_j|$$

Breaking this down:

Term	Name	Type	Effect
$\gamma T$	Gamma	Tree complexity	Penalizes number of leaves
$\frac{1}{2}\lambda \sum w_j^2$	Lambda	L2 (Ridge)	Shrinks leaf weights toward zero
$\alpha \sum	w_j	$	Alpha

The Three Regularization Parameters

1.3 Objective at Iteration m

At iteration $m$, we add tree $h_m$ to the ensemble. The objective becomes:

$$\mathcal{L}^{(m)} = \sum_{i=1}^{n} L(y_i, F_{m-1}(x_i) + h_m(x_i)) + \Omega(h_m)$$

Using a second-order Taylor expansion (a key XGBoost innovation):

$$\mathcal{L}^{(m)} \approx \sum_{i=1}^{n} \left[ g_i h_m(x_i) + \frac{1}{2} H_i h_m(x_i)^2 \right] + \Omega(h_m) + \text{constant}$$

where:

$g_i = \frac{\partial L(y_i, F_{m-1}(x_i))}{\partial F_{m-1}(x_i)}$ is the first derivative (gradient)
$H_i = \frac{\partial^2 L(y_i, F_{m-1}(x_i))}{\partial F_{m-1}(x_i)^2}$ is the second derivative (Hessian)

This quadratic approximation enables closed-form optimal leaf weights.

L2 Regularization (Lambda)

L2 regularization, controlled by the lambda parameter, adds a penalty proportional to the squared magnitude of leaf weights.

2.1 Effect on Optimal Leaf Weights

For a leaf $j$ containing sample indices $I_j$, the contribution to the objective is:

$$\sum_{i \in I_j} \left[ g_i w_j + \frac{1}{2} H_i w_j^2 \right] + \frac{1}{2}\lambda w_j^2$$

Setting the derivative with respect to $w_j$ to zero:

$$\sum_{i \in I_j} g_i + \left( \sum_{i \in I_j} H_i + \lambda \right) w_j = 0$$

Solving:

$$w_j^* = -\frac{\sum_{i \in I_j} g_i}{\sum_{i \in I_j} H_i + \lambda} = -\frac{G_j}{H_j + \lambda}$$

where $G_j = \sum_{i \in I_j} g_i$ and $H_j = \sum_{i \in I_j} H_i$.

2.2 The Shrinkage Effect

Compare to the unregularized optimal weight:

$$w_j^{\text{unregularized}} = -\frac{G_j}{H_j}$$

With L2 regularization:

$$|w_j^*| = \frac{|G_j|}{H_j + \lambda} < \frac{|G_j|}{H_j} = |w_j^{\text{unregularized}}|$$

The weights are shrunk toward zero by a factor of $\frac{H_j}{H_j + \lambda}$. This shrinkage:

Prevents extreme predictions when $H_j$ is small (few samples or flat curvature)
Stabilizes predictions for leaves with high variance
Reduces overfitting by constraining model expressiveness

Effect of Lambda on Leaf Weights
Lambda	Shrinkage Factor	Effect on Leaf Weights
0	1.0	No shrinkage (unregularized)
0.1	H/(H+0.1)	Light shrinkage
1.0	H/(H+1)	Moderate shrinkage
10	H/(H+10)	Strong shrinkage
100	H/(H+100)	Very strong shrinkage

2.3 Numerical Stability

L2 regularization also provides numerical stability. Without regularization, if a leaf has very few samples (small $H_j$), the optimal weight can become extremely large:

$$w_j = -\frac{G_j}{H_j} \quad \text{(can explode if } H_j \approx 0 \text{)}$$

With $\lambda > 0$, the denominator is bounded away from zero:

$$w_j = -\frac{G_j}{H_j + \lambda} \quad \text{(bounded even if } H_j = 0 \text{)}$$

This is why lambda > 0 is almost always recommended, even if regularization strength isn't the primary concern.

l2_regularization_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
import numpy as np
import xgboost as xgb
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split, cross_val_score
import matplotlib.pyplot as plt
 
def demonstrate_lambda_effect(X, y):
    """
    Demonstrate the effect of lambda (L2 regularization) on XGBoost.
    """
    
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    
    lambdas = [0, 0.01, 0.1, 1, 5, 10, 50, 100]
    
    results = {
        'lambda': [],
        'train_rmse': [],
        'test_rmse': [],
        'cv_rmse': []
    }
    
    print("Effect of L2 Regularization (lambda) on XGBoost")
    print("=" * 60)
    
    for lam in lambdas:
        model = xgb.XGBRegressor(
            n_estimators=100,
            learning_rate=0.1,
            max_depth=6,
            reg_lambda=lam,     # L2 regularization
            reg_alpha=0,        # No L1 for this test
            random_state=42,
            verbosity=0
        )
        
        # CV score
        cv_scores = -cross_val_score(
            model, X, y, cv=5, 
            scoring='neg_root_mean_squared_error'
        )
        
        # Train final model
        model.fit(X_train, y_train)
        
        train_pred = model.predict(X_train)
        test_pred = model.predict(X_test)
        
        train_rmse = np.sqrt(np.mean((y_train - train_pred) ** 2))
        test_rmse = np.sqrt(np.mean((y_test - test_pred) ** 2))
        cv_rmse = np.mean(cv_scores)
        
        results['lambda'].append(lam)
        results['train_rmse'].append(train_rmse)
        results['test_rmse'].append(test_rmse)
        results['cv_rmse'].append(cv_rmse)
        
        gap = test_rmse - train_rmse
        print(f"λ={lam:6.2f}: Train RMSE={train_rmse:.4f}, "
              f"Test RMSE={test_rmse:.4f}, Gap={gap:.4f}")
    
    # Plot
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Training vs Test error
    axes[0].plot(results['lambda'], results['train_rmse'], 
                  'b-o', label='Train RMSE', linewidth=2)
    axes[0].plot(results['lambda'], results['test_rmse'], 
                  'r-o', label='Test RMSE', linewidth=2)
    axes[0].set_xlabel('Lambda (L2 Regularization)', fontsize=12)
    axes[0].set_ylabel('RMSE', fontsize=12)
    axes[0].set_xscale('symlog', linthresh=0.1)
    axes[0].legend()
    axes[0].set_title('Effect of L2 Regularization')
    axes[0].grid(True, alpha=0.3)
    
    # Generalization gap
    gap = np.array(results['test_rmse']) - np.array(results['train_rmse'])
    axes[1].bar(range(len(lambdas)), gap, tick_label=[str(l) for l in lambdas])
    axes[1].set_xlabel('Lambda', fontsize=12)
    axes[1].set_ylabel('Generalization Gap (Test - Train)', fontsize=12)
    axes[1].set_title('Generalization Gap vs Lambda')
    axes[1].axhline(0, color='black', linestyle='--')
    
    plt.tight_layout()
    plt.savefig('l2_regularization_effect.png', dpi=150)
    plt.show()
    
    # Find optimal lambda
    best_idx = np.argmin(results['cv_rmse'])
    print(f"\nOptimal lambda: {results['lambda'][best_idx]}")
    print(f"Best CV RMSE: {results['cv_rmse'][best_idx]:.4f}")
    
    return results
 
 
def visualize_weight_shrinkage():
    """
    Visualize how lambda shrinks leaf weights.
    """
    # Simulate: G_j = -10 (gradient sum), H_j varies
    G_j = -10
    H_j_values = np.linspace(0.1, 10, 100)
    
    lambda_values = [0, 0.5, 1, 2, 5]
    
    fig, ax = plt.subplots(figsize=(10, 6))
    
    for lam in lambda_values:
        weights = -G_j / (H_j_values + lam)
        ax.plot(H_j_values, weights, label=f'λ={lam}', linewidth=2)
    
    ax.set_xlabel('Hessian Sum (H_j)', fontsize=12)
    ax.set_ylabel('Optimal Leaf Weight', fontsize=12)
    ax.set_title('Leaf Weight Shrinkage with L2 Regularization', fontsize=14)
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.savefig('weight_shrinkage.png', dpi=150)
    plt.show()
 
 
if __name__ == "__main__":
    # Generate noisy data prone to overfitting
    X, y = make_regression(
        n_samples=500,  # Small dataset
        n_features=50,  # Many features
        n_informative=10,
        noise=20,
        random_state=42
    )
    
    demonstrate_lambda_effect(X, y)
    visualize_weight_shrinkage()

L1 Regularization (Alpha)

L1 regularization, controlled by the alpha parameter, adds a penalty proportional to the absolute value of leaf weights.

3.1 The L1 Penalty Term

$$\Omega_{L1}(h) = \alpha \sum_{j=1}^{T} |w_j|$$

Unlike L2, the L1 penalty is not differentiable at $w_j = 0$. This creates a sparsity-inducing effect: small weights are pushed exactly to zero.

3.2 Sparsity in Leaf Weights

For small gradient sums, L1 regularization sets weights to exactly zero:

$$w_j^* = 0 \quad \text{if} \quad |G_j| \leq \alpha$$

This means leaves with weak signals (small $|G_j|$) are effectively pruned—their predictions become zero. This is feature selection at the leaf level.

3.3 Soft Thresholding Solution

The optimal weight under L1 regularization follows a soft-thresholding formula:

$$w_j^* = \text{sign}(G_j) \cdot \max\left(0, \frac{|G_j| - \alpha}{H_j}\right)$$

Compared to L2 (which shrinks all weights proportionally), L1:

Sets small weights to exactly zero (sparsity)
Shrinks large weights by a constant amount $\alpha$ (not proportionally)
Creates sparser, more interpretable trees

L1 vs L2 Regularization Effects
Property	L2 (Lambda)	L1 (Alpha)
Penalty	$\lambda \sum w_j^2$	$\alpha \sum \|w_j\|$
Effect on weights	Proportional shrinkage	Soft thresholding
Sparsity	No (never exactly zero)	Yes (zeros out small weights)
Stability	High (smooth gradient)	Lower (non-differentiable)
Use case	General regularization	Feature/leaf selection

3.4 When to Use L1 Regularization

Use L1 (alpha) when:

You have many noisy or irrelevant splits
You want sparser, more interpretable models
You suspect some leaf predictions should truly be zero
You're performing feature selection alongside prediction

Caution: L1 alone provides less stability than L2. In practice, many practitioners use both L1 and L2 (Elastic Net regularization) to get sparsity plus stability.

l1_regularization_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
import numpy as np
import xgboost as xgb
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split, cross_val_score
 
def demonstrate_alpha_effect(X, y):
    """
    Demonstrate the effect of alpha (L1 regularization) on XGBoost.
    """
    
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    
    alphas = [0, 0.01, 0.1, 0.5, 1, 5, 10, 50]
    
    print("Effect of L1 Regularization (alpha) on XGBoost")
    print("=" * 60)
    
    for alpha in alphas:
        model = xgb.XGBRegressor(
            n_estimators=100,
            learning_rate=0.1,
            max_depth=6,
            reg_lambda=1,       # Keep L2 at baseline
            reg_alpha=alpha,    # Vary L1
            random_state=42,
            verbosity=0
        )
        
        cv_scores = -cross_val_score(
            model, X, y, cv=5,
            scoring='neg_root_mean_squared_error'
        )
        
        model.fit(X_train, y_train)
        
        train_rmse = np.sqrt(np.mean((y_train - model.predict(X_train)) ** 2))
        test_rmse = np.sqrt(np.mean((y_test - model.predict(X_test)) ** 2))
        
        # Check feature importance sparsity
        importances = model.feature_importances_
        n_important = np.sum(importances > 0.01)
        
        print(f"α={alpha:5.2f}: CV RMSE={np.mean(cv_scores):.4f}, "
              f"Test RMSE={test_rmse:.4f}, "
              f"Important features={n_important}/{len(importances)}")
 
 
def compare_l1_l2_elastic_net(X, y):
    """
    Compare pure L1, pure L2, and Elastic Net (L1+L2) regularization.
    """
    
    print("\nComparison: L1-only, L2-only, and Elastic Net")
    print("=" * 60)
    
    configs = [
        ('L2 only', {'reg_lambda': 1, 'reg_alpha': 0}),
        ('L1 only', {'reg_lambda': 0, 'reg_alpha': 1}),
        ('Elastic Net (both)', {'reg_lambda': 1, 'reg_alpha': 1}),
        ('Strong L2', {'reg_lambda': 10, 'reg_alpha': 0}),
        ('Strong L1', {'reg_lambda': 0, 'reg_alpha': 10}),
        ('Strong Both', {'reg_lambda': 10, 'reg_alpha': 10}),
    ]
    
    for name, params in configs:
        model = xgb.XGBRegressor(
            n_estimators=100,
            learning_rate=0.1,
            max_depth=6,
            **params,
            random_state=42,
            verbosity=0
        )
        
        cv_scores = -cross_val_score(
            model, X, y, cv=5,
            scoring='neg_root_mean_squared_error'
        )
        
        print(f"{name:18s}: CV RMSE = {np.mean(cv_scores):.4f} "
              f"± {np.std(cv_scores):.4f}")
 
 
if __name__ == "__main__":
    # Generate data with many noise features
    X, y = make_regression(
        n_samples=500,
        n_features=100,       # Many features
        n_informative=10,     # Only 10 are useful
        noise=20,
        random_state=42
    )
    
    demonstrate_alpha_effect(X, y)
    compare_l1_l2_elastic_net(X, y)

Gamma (Min Split Loss)

While not strictly L1/L2 regularization, gamma is the third regularization parameter in XGBoost that controls tree structure.

4.1 The Gamma Penalty

$$\Omega_{\gamma}(h) = \gamma \cdot T$$

where $T$ is the number of leaves. Gamma adds a constant penalty for each leaf, effectively requiring a minimum improvement to justify a split.

4.2 Optimal Gain Formula

The gain from splitting a node into left and right children is:

$$\text{Gain} = \frac{1}{2}\left[ \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{(G_L + G_R)^2}{H_L + H_R + \lambda} \right] - \gamma$$

Note the $-\gamma$ term at the end. A split is only made if $\text{Gain} > 0$, meaning:

$$\underbrace{\frac{1}{2}\left[ \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{G^2}{H + \lambda} \right]}_{\text{Loss reduction}} > \gamma$$

4.3 Effect of Gamma

gamma = 0: Any positive gain triggers a split (default)
gamma > 0: Requires gain > gamma to split; acts as pre-pruning
Large gamma: Very few splits; simpler trees

Typical Values: 0-10 range. Start with 0 and increase if overfitting persists after tuning lambda/alpha.

Gamma vs. min_impurity_decrease

Library-Specific Parameters

Each boosting library has its own names for regularization parameters.

5.1 Parameter Mapping

Regularization Parameters Across Libraries
Regularization Type	XGBoost	LightGBM	CatBoost
L2 on leaf weights	`reg_lambda` (default: 1)	`lambda_l2` (default: 0)	`l2_leaf_reg` (default: 3)
L1 on leaf weights	`reg_alpha` (default: 0)	`lambda_l1` (default: 0)	N/A
Min split gain	`gamma` (default: 0)	`min_gain_to_split` (default: 0)	N/A
Min child weight	`min_child_weight` (default: 1)	`min_sum_hessian_in_leaf` (default: 1e-3)	`min_data_in_leaf` (default: 1)

5.2 Default Value Differences

Note the important differences in defaults:

XGBoost: L2 is ON by default (lambda=1), L1 is OFF (alpha=0) LightGBM: Both are OFF by default CatBoost: L2 is ON by default (l2_leaf_reg=3)

This means:

XGBoost models are regularized 'out of the box'
LightGBM models may need explicit regularization added
CatBoost has moderate regularization by default

5.3 Different Objective Formulations

The exact objective function differs slightly between libraries, but the core concepts are the same. LightGBM's objective for L2:

$$\mathcal{L} = \sum_{i} L(y_i, \hat{y}i) + \frac{\lambda{L2}}{2} \sum_j w_j^2$$

Note the $1/2$ factor is sometimes absorbed into the parameter definition, which can affect tuning. Always check library documentation for exact formulations.

library_regularization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
import xgboost as xgb
import lightgbm as lgb
from catboost import CatBoostRegressor
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_regression
import numpy as np
 
def compare_library_regularization(X, y):
    """
    Compare regularization across XGBoost, LightGBM, and CatBoost.
    """
    
    print("Regularization Comparison Across Libraries")
    print("=" * 70)
    
    # Regularization settings to test
    settings = ['default', 'no_reg', 'light_l2', 'strong_l2', 'elastic_net']
    
    configs = {
        'default': {
            'xgb': {'reg_lambda': 1, 'reg_alpha': 0, 'gamma': 0},
            'lgb': {'lambda_l2': 0, 'lambda_l1': 0, 'min_gain_to_split': 0},
            'cat': {'l2_leaf_reg': 3},
        },
        'no_reg': {
            'xgb': {'reg_lambda': 0, 'reg_alpha': 0, 'gamma': 0},
            'lgb': {'lambda_l2': 0, 'lambda_l1': 0, 'min_gain_to_split': 0},
            'cat': {'l2_leaf_reg': 0},
        },
        'light_l2': {
            'xgb': {'reg_lambda': 1, 'reg_alpha': 0},
            'lgb': {'lambda_l2': 1, 'lambda_l1': 0},
            'cat': {'l2_leaf_reg': 1},
        },
        'strong_l2': {
            'xgb': {'reg_lambda': 10, 'reg_alpha': 0},
            'lgb': {'lambda_l2': 10, 'lambda_l1': 0},
            'cat': {'l2_leaf_reg': 10},
        },
        'elastic_net': {
            'xgb': {'reg_lambda': 1, 'reg_alpha': 1},
            'lgb': {'lambda_l2': 1, 'lambda_l1': 1},
            'cat': {'l2_leaf_reg': 1},  # CatBoost doesn't have native L1
        },
    }
    
    for setting_name in settings:
        print(f"\n{setting_name.upper()}:")
        
        # XGBoost
        xgb_model = xgb.XGBRegressor(
            n_estimators=100,
            learning_rate=0.1,
            max_depth=6,
            **configs[setting_name]['xgb'],
            random_state=42,
            verbosity=0
        )
        xgb_cv = -cross_val_score(
            xgb_model, X, y, cv=5, 
            scoring='neg_root_mean_squared_error'
        )
        
        # LightGBM
        lgb_model = lgb.LGBMRegressor(
            n_estimators=100,
            learning_rate=0.1,
            num_leaves=63,  # ~ 2^6
            **configs[setting_name]['lgb'],
            random_state=42,
            verbose=-1
        )
        lgb_cv = -cross_val_score(
            lgb_model, X, y, cv=5,
            scoring='neg_root_mean_squared_error'
        )
        
        # CatBoost
        cat_model = CatBoostRegressor(
            iterations=100,
            learning_rate=0.1,
            depth=6,
            **configs[setting_name]['cat'],
            random_seed=42,
            verbose=False
        )
        cat_cv = -cross_val_score(
            cat_model, X, y, cv=5,
            scoring='neg_root_mean_squared_error'
        )
        
        print(f"  XGBoost:  {np.mean(xgb_cv):.4f} ± {np.std(xgb_cv):.4f}")
        print(f"  LightGBM: {np.mean(lgb_cv):.4f} ± {np.std(lgb_cv):.4f}")
        print(f"  CatBoost: {np.mean(cat_cv):.4f} ± {np.std(cat_cv):.4f}")
 
 
if __name__ == "__main__":
    X, y = make_regression(
        n_samples=1000,
        n_features=30,
        n_informative=15,
        noise=20,
        random_state=42
    )
    
    compare_library_regularization(X, y)

Practical Tuning of Regularization

Tuning L1/L2 regularization effectively requires understanding their interactions with other hyperparameters.

6.1 Tuning Order

Recommended order for tuning boosting hyperparameters:

n_estimators + learning_rate: Find a reasonable combination with early stopping
max_depth / num_leaves: Set structural complexity
subsample / colsample: Add stochasticity
reg_lambda / reg_alpha: Fine-tune regularization

L1/L2 regularization is typically tuned after structural hyperparameters because the optimal regularization strength depends on tree complexity.

6.2 Search Ranges

L2 (lambda):

Search range: [0, 0.01, 0.1, 1, 5, 10, 50, 100]
Log scale is often appropriate
Start with library default, expand if at boundary

L1 (alpha):

Search range: [0, 0.001, 0.01, 0.1, 1, 10]
Often left at 0 unless sparsity is desired
More sensitive; small values can have large effects

Gamma:

Search range: [0, 0.1, 0.5, 1, 2, 5, 10]
Very problem-dependent
Often kept at 0 if other regularization is sufficient

Regularization Tuning Guidelines

•Start with defaults: XGBoost's lambda=1 works well for many problems.
•Tune L2 first: It's more stable and universally applicable than L1.
•Add L1 for sparsity: Only if you specifically need sparse models or have many irrelevant features.
•Higher with deep trees: Deeper trees need more regularization; increase lambda/alpha.
•Lower with small learning rate: Shrinkage from learning rate may suffice; less L2 needed.
•Watch for underfitting: Very high regularization hurts performance; balance is key.

regularization_tuning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
import optuna
import xgboost as xgb
from sklearn.model_selection import cross_val_score
import numpy as np
 
def tune_regularization(X, y, fixed_params=None):
    """
    Tune regularization parameters using Optuna.
    
    This assumes structural parameters (max_depth, etc.) are already set.
    """
    
    if fixed_params is None:
        fixed_params = {
            'n_estimators': 100,
            'learning_rate': 0.1,
            'max_depth': 6,
            'subsample': 0.8,
            'colsample_bytree': 0.8,
        }
    
    def objective(trial):
        # Regularization search space
        reg_lambda = trial.suggest_float('reg_lambda', 0, 100, log=False)
        reg_alpha = trial.suggest_float('reg_alpha', 0, 10, log=False)
        gamma = trial.suggest_float('gamma', 0, 10, log=False)
        min_child_weight = trial.suggest_int('min_child_weight', 1, 10)
        
        model = xgb.XGBRegressor(
            **fixed_params,
            reg_lambda=reg_lambda,
            reg_alpha=reg_alpha,
            gamma=gamma,
            min_child_weight=min_child_weight,
            random_state=42,
            verbosity=0
        )
        
        cv_scores = cross_val_score(
            model, X, y, cv=5,
            scoring='neg_mean_squared_error'
        )
        
        return np.mean(cv_scores)
    
    study = optuna.create_study(direction='maximize')
    study.optimize(objective, n_trials=50, show_progress_bar=True)
    
    print("\nBest Regularization Parameters:")
    print(f"  reg_lambda: {study.best_params['reg_lambda']:.4f}")
    print(f"  reg_alpha: {study.best_params['reg_alpha']:.4f}")
    print(f"  gamma: {study.best_params['gamma']:.4f}")
    print(f"  min_child_weight: {study.best_params['min_child_weight']}")
    print(f"  Best CV Score: {-study.best_value:.4f}")
    
    return study
 
 
def grid_search_regularization(X, y):
    """
    Simpler grid search approach for regularization.
    """
    
    from sklearn.model_selection import GridSearchCV
    
    base_model = xgb.XGBRegressor(
        n_estimators=100,
        learning_rate=0.1,
        max_depth=6,
        subsample=0.8,
        random_state=42,
        verbosity=0
    )
    
    param_grid = {
        'reg_lambda': [0, 1, 5, 10, 20],
        'reg_alpha': [0, 0.1, 1],
        'gamma': [0, 0.5, 1],
    }
    
    grid_search = GridSearchCV(
        base_model,
        param_grid,
        cv=5,
        scoring='neg_mean_squared_error',
        verbose=1
    )
    
    grid_search.fit(X, y)
    
    print("\nGrid Search Results:")
    print(f"  Best params: {grid_search.best_params_}")
    print(f"  Best CV MSE: {-grid_search.best_score_:.4f}")
    
    return grid_search
 
 
if __name__ == "__main__":
    from sklearn.datasets import make_regression
    
    X, y = make_regression(
        n_samples=1000,
        n_features=30,
        n_informative=15,
        noise=20,
        random_state=42
    )
    
    print("Grid Search Approach:")
    print("=" * 50)
    grid_search_regularization(X, y)

Interactions with Other Regularization

L1/L2 regularization interacts with all other regularization techniques in boosting. Understanding these interactions helps avoid both over-regularization and under-regularization.

7.1 L1/L2 × Learning Rate

Low learning rate already provides regularization through incremental learning
With low learning rate: Often need less L2/L1 (models are already conservative)
With high learning rate: More L2/L1 may be needed to prevent overfitting

7.2 L1/L2 × Tree Depth

Tree Depth	Recommended L2 (lambda)
Shallow (2-3)	0-1 (light)
Moderate (4-5)	1-5 (moderate)
Deep (6-8)	5-20 (stronger)
Very deep (10+)	10-100 (heavy)

Deeper trees have more parameters to regularize; increase L2 proportionally.

7.3 L1/L2 × Subsampling

Both provide regularization through different mechanisms
L2: Constrains leaf weight magnitude
Subsampling: Reduces overfitting through randomness

With aggressive subsampling (0.5-0.7), you may need less L2. They compound.

Over-Regularization Warning

7.4 A Balanced Configuration

For most problems, a balanced approach works well:

balanced_config = {
    'learning_rate': 0.1,      # Moderate shrinkage
    'max_depth': 6,            # Moderate tree depth
    'subsample': 0.8,          # Light row sampling
    'colsample_bytree': 0.8,   # Light column sampling
    'reg_lambda': 1,           # Light L2
    'reg_alpha': 0,            # No L1 unless needed
    'gamma': 0,                # No split penalty
    'early_stopping_rounds': 20  # Dynamic stopping
}

Start with balanced settings and adjust based on validation performance.

Summary and Key Takeaways

L1 and L2 regularization are explicit penalty terms that directly constrain leaf weights in gradient boosting. Let's consolidate the essential insights:

Key Takeaways

•Modern boosting adds explicit regularization to the objective function, penalizing both the number of leaves (gamma) and leaf weight magnitudes (lambda, alpha).
•L2 regularization (lambda) shrinks all weights proportionally, providing stable, continuous regularization that prevents extreme predictions.
•L1 regularization (alpha) performs soft-thresholding, setting small weights to exactly zero and creating sparser, more interpretable models.
•Lambda provides numerical stability by preventing division by very small Hessian sums, making it almost always beneficial even at small values.
•Library defaults differ: XGBoost has L2 on by default; LightGBM has both off; CatBoost has L2 on. Adjust accordingly.
•Tune L2 first, then L1 if needed: L2 is more universally applicable; L1 is for specific sparsity needs.
•Regularization interacts with other hyperparameters: Deeper trees and higher learning rates need more regularization; low learning rates may need less.

Module Complete

5 / 5