Machine LearningBoosting Theory

Regularization in Boosting

LevelAdvanced

Duration90 mins

TopicBoosting Theory

2 / 5

Subsampling (Stochastic Gradient Boosting)

Injecting Randomness into Sequential Learning

Gradient boosting, by design, is a deterministic sequential algorithm. Given the same training data and hyperparameters, it produces identical models. While this reproducibility is often desirable, it introduces a subtle vulnerability: the sequential learners can become overly correlated, each one making similar mistakes and overfitting to the same patterns in the training data.

Stochastic Gradient Boosting (SGB), introduced by Jerome Friedman in 2002, addresses this by introducing controlled randomness through subsampling—training each base learner on a random subset of the training data. This simple modification yields profound benefits:

Reduced overfitting through implicit regularization
Increased learner diversity through different training views
Faster training by using fewer samples per iteration
Better generalization through variance reduction

Subsampling transforms gradient boosting from a purely greedy optimizer into a more robust ensemble method that inherits beneficial properties from bagging.

What You Will Master

By the end of this page, you will understand: (1) the mathematical formulation of stochastic gradient boosting and its connection to stochastic optimization, (2) why subsampling provides regularization from variance and correlation perspectives, (3) the trade-off between sample fraction and model quality, (4) practical guidelines for selecting subsampling rates, and (5) how subsampling interacts with other hyperparameters.

From Standard to Stochastic Gradient Boosting

To understand stochastic gradient boosting, let's first recall the standard gradient boosting update. At iteration $m$, we:

Compute pseudo-residuals for all $n$ training samples: $$r_{im} = -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right]{F=F{m-1}}, \quad i = 1, \ldots, n$$
Fit a base learner $h_m$ to the entire set of $(x_i, r_{im})$ pairs
Update the ensemble: $$F_m(x) = F_{m-1}(x) + \nu h_m(x)$$

Stochastic gradient boosting modifies step 2: instead of fitting $h_m$ to all $n$ samples, we randomly sample a subset of size $\lfloor \eta \cdot n \rfloor$ samples (without replacement) and fit $h_m$ only on this subset. Here $\eta \in (0, 1]$ is the subsample ratio or sample fraction.

stochastic_gb.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
import numpy as np
from sklearn.tree import DecisionTreeRegressor
 
class StochasticGradientBoosting:
    """
    Stochastic Gradient Boosting with row subsampling.
    
    Key modification: Each tree is trained on a random subset of
    training samples, introducing diversity and regularization.
    """
    
    def __init__(
        self, 
        n_estimators=100, 
        learning_rate=0.1, 
        max_depth=3,
        subsample=0.8,       # The key parameter: fraction of samples per tree
        random_state=None
    ):
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        self.max_depth = max_depth
        self.subsample = subsample  # eta in the algorithm
        self.random_state = random_state
        self.trees = []
        self.initial_prediction = None
        self.rng = np.random.RandomState(random_state)
    
    def fit(self, X, y):
        """
        Fit the stochastic gradient boosting model.
        """
        n_samples, n_features = X.shape
        
        # Number of samples to use per iteration
        n_subsample = int(self.subsample * n_samples)
        
        # Initialize with mean
        self.initial_prediction = np.mean(y)
        F = np.full(n_samples, self.initial_prediction)
        
        for m in range(self.n_estimators):
            # Step 1: Compute pseudo-residuals for ALL samples
            # This is important: we compute residuals on all data,
            # but train the tree on a subset
            residuals = y - F
            
            # Step 2: STOCHASTIC STEP - Sample indices without replacement
            sample_indices = self.rng.choice(
                n_samples, 
                size=n_subsample, 
                replace=False
            )
            
            # Step 3: Fit tree only on sampled data
            tree = DecisionTreeRegressor(max_depth=self.max_depth)
            tree.fit(X[sample_indices], residuals[sample_indices])
            
            # Step 4: Update ensemble on ALL samples
            # Note: We predict on all samples, not just the subset
            h_m = tree.predict(X)
            F = F + self.learning_rate * h_m
            
            self.trees.append(tree)
            
            if (m + 1) % 20 == 0:
                mse = np.mean((y - F) ** 2)
                print(f"Iteration {m+1}: Train MSE = {mse:.6f}")
        
        return self
    
    def predict(self, X):
        """Make predictions."""
        F = np.full(X.shape[0], self.initial_prediction)
        for tree in self.trees:
            F = F + self.learning_rate * tree.predict(X)
        return F
 
 
def compare_standard_vs_stochastic():
    """
    Compare standard GB vs stochastic GB on a regression task.
    """
    from sklearn.datasets import make_regression
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import mean_squared_error
    
    # Generate data with some noise
    X, y = make_regression(
        n_samples=2000, 
        n_features=20, 
        noise=20, 
        random_state=42
    )
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=42
    )
    
    results = []
    
    # Test different subsample rates
    subsample_rates = [1.0, 0.9, 0.8, 0.7, 0.6, 0.5]
    
    for subsample in subsample_rates:
        model = StochasticGradientBoosting(
            n_estimators=200,
            learning_rate=0.1,
            max_depth=4,
            subsample=subsample,
            random_state=42
        )
        model.fit(X_train, y_train)
        
        train_mse = mean_squared_error(y_train, model.predict(X_train))
        test_mse = mean_squared_error(y_test, model.predict(X_test))
        
        results.append({
            'subsample': subsample,
            'train_mse': train_mse,
            'test_mse': test_mse,
            'gap': test_mse - train_mse
        })
        
        print(f"Subsample={subsample:.1f}: "
              f"Train={train_mse:.2f}, Test={test_mse:.2f}, "
              f"Gap={test_mse - train_mse:.2f}")
    
    return results
 
 
if __name__ == "__main__":
    compare_standard_vs_stochastic()

Why Subsampling Provides Regularization

The regularization effect of subsampling can be understood from multiple complementary perspectives.

2.1 The Variance Reduction Perspective

In standard gradient boosting, each base learner is fitted on the same training set. If the training set has noise or outliers, every learner sees and potentially overfits to these artifacts. The learners become correlated in their errors.

With subsampling, each learner sees a different random subset. This means:

Noise in the training data affects different learners differently
Outliers may be excluded from many learners' training sets
The ensemble averages over many different 'views' of the data

The variance of the ensemble prediction can be decomposed as:

$$\text{Var}\left[\frac{1}{M}\sum_{m=1}^{M} h_m(x)\right] = \frac{1}{M}\bar{\sigma}^2 + \frac{M-1}{M}\bar{\rho}\bar{\sigma}^2$$

where $\bar{\sigma}^2$ is the average variance of individual learners and $\bar{\rho}$ is the average correlation between learners. Subsampling reduces $\bar{\rho}$, which directly reduces ensemble variance.

The Correlation Insight

The key insight: subsampling decorrelates the base learners. Even though each tree might be slightly worse (higher individual variance), the reduced correlation between trees leads to better ensemble performance. This is the same principle that makes Random Forests successful.

2.2 Connection to Stochastic Optimization

From an optimization perspective, subsampling transforms gradient boosting into a form of stochastic gradient descent in function space.

In standard gradient descent, we update parameters using the gradient computed on all data: $$\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t)$$

In stochastic gradient descent (SGD), we use a gradient computed on a random mini-batch: $$\theta_{t+1} = \theta_t - \eta \nabla L_{\text{batch}}(\theta_t)$$

SGD works despite (or because of) the noisy gradients. The noise has a regularizing effect, preventing the optimizer from settling into sharp minima that don't generalize well.

Stochastic gradient boosting achieves the same effect in function space:

Each base learner is fitted on a 'mini-batch' of the training data
The resulting tree represents a 'noisy' gradient direction
This noise prevents overfitting to training data peculiarities

2.3 The Implicit Bagging Perspective

Subsampling gives gradient boosting some properties of bootstrap aggregating (bagging):

Each tree is trained on a different subset (like bootstrap samples)
Predictions are aggregated across all trees
Diversity among trees reduces overfitting

However, there's a crucial difference: in bagging, trees are independent; in stochastic gradient boosting, trees are still sequential (each fitted to current residuals). This means SGB gets both the sequential error correction of boosting and the variance reduction of bagging.

The Best of Both Worlds: Stochastic GB can be viewed as a hybrid method that combines:

Boosting's ability to reduce bias through sequential correction
Bagging's ability to reduce variance through ensemble diversity

Comparison: Standard GB vs. Stochastic GB
Aspect	Standard GB	Stochastic GB
Training per tree	All samples	Random subset
Tree correlation	High	Lower
Variance	Higher	Lower
Training speed	Slower per tree	Faster per tree
Reproducibility	Deterministic	Random (seed-dependent)
Overfitting risk	Higher	Lower
Generalization	Good	Often better

Choosing the Subsample Rate

The subsample rate $\eta$ (often called subsample or bagging_fraction) controls the trade-off between regularization strength and information usage.

3.1 Understanding the Trade-off

High Subsample Rate ($\eta \rightarrow 1.0$):

Each tree sees almost all data → Lower regularization
Good signal estimation → Lower individual tree variance
High tree correlation → Limited variance reduction
Converges to standard GB at $\eta = 1.0$

Low Subsample Rate ($\eta \rightarrow 0$):

Each tree sees little data → Strong regularization
Poor signal estimation → Higher individual tree variance
Low tree correlation → Strong variance reduction potential
Too low can hurt: trees become too noisy to be useful

3.2 Empirical Guidance

Extensive empirical studies and competition results have established practical guidelines:

Subsample Rate Guidelines

•0.7-0.9: Most common range. Provides good regularization while maintaining signal quality. Start here.
•0.5-0.7: More aggressive regularization. Useful for very noisy data or when overfitting is severe.
•0.9-1.0: Light regularization. Use when training data is clean and representative, or with small datasets where you need all samples.
•< 0.5: Rarely beneficial. Trees become too noisy. Only use if other regularization is minimal.
•0.8: A robust default that works well across diverse problems. Many practitioners start here.

3.3 Sample Size Considerations

The optimal subsample rate depends on your dataset size:

Large Datasets ($n > 100,000$):

Can use more aggressive subsampling (0.5-0.7)
Even 50% of data is a lot of samples
Greater computational benefit from smaller subsets
More iterations can compensate for lower information per tree

Medium Datasets ($n \approx 1,000-100,000$):

Standard range (0.7-0.9) works well
Balance between regularization and information

Small Datasets ($n < 1,000$):

Be cautious with heavy subsampling
0.8-1.0 often better: you need all the signal you can get
Consider combining with other regularization instead

subsample_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
 
def analyze_subsample_effect(X, y, task='regression'):
    """
    Analyze the effect of different subsample rates on model performance.
    
    Key insights:
    1. Subsampling almost always helps for sufficiently large datasets
    2. Optimal rate is typically 0.7-0.9
    3. Too aggressive subsampling hurts more than it helps
    """
    
    subsample_rates = [1.0, 0.95, 0.9, 0.85, 0.8, 0.75, 0.7, 0.6, 0.5, 0.4]
    
    results = {
        'subsample': [],
        'cv_mean': [],
        'cv_std': []
    }
    
    for subsample in subsample_rates:
        if task == 'classification':
            model = GradientBoostingClassifier(
                n_estimators=200,
                learning_rate=0.1,
                max_depth=4,
                subsample=subsample,
                random_state=42
            )
            scoring = 'accuracy'
        else:
            model = GradientBoostingRegressor(
                n_estimators=200,
                learning_rate=0.1,
                max_depth=4,
                subsample=subsample,
                random_state=42
            )
            scoring = 'neg_mean_squared_error'
        
        cv_scores = cross_val_score(model, X, y, cv=5, scoring=scoring)
        
        results['subsample'].append(subsample)
        results['cv_mean'].append(np.mean(cv_scores))
        results['cv_std'].append(np.std(cv_scores))
        
        print(f"Subsample={subsample:.2f}: "
              f"CV Score = {np.mean(cv_scores):.4f} (+/- {np.std(cv_scores):.4f})")
    
    # Plot results
    fig, ax = plt.subplots(figsize=(10, 6))
    
    ax.errorbar(
        results['subsample'],
        results['cv_mean'],
        yerr=results['cv_std'],
        marker='o',
        capsize=5,
        capthick=2,
        linewidth=2,
        markersize=8
    )
    
    ax.set_xlabel('Subsample Rate', fontsize=12)
    ax.set_ylabel('Cross-Validation Score', fontsize=12)
    ax.set_title('Effect of Subsample Rate on Model Performance', fontsize=14)
    ax.grid(True, alpha=0.3)
    ax.invert_xaxis()  # Higher subsample on left
    
    # Mark optimal
    best_idx = np.argmax(results['cv_mean']) if task == 'classification' else np.argmin(results['cv_mean'])
    best_subsample = results['subsample'][best_idx]
    best_score = results['cv_mean'][best_idx]
    
    ax.axvline(best_subsample, color='green', linestyle='--', alpha=0.7, 
               label=f'Best: subsample={best_subsample}')
    ax.legend()
    
    plt.tight_layout()
    plt.savefig('subsample_analysis.png', dpi=150)
    plt.show()
    
    return results
 
 
def dataset_size_recommendations(n_samples, n_features):
    """
    Provide subsample rate recommendations based on dataset characteristics.
    """
    recommendations = {}
    
    # Base recommendation
    if n_samples >= 100000:
        base = 0.6
        reasoning = "Large dataset allows aggressive subsampling"
    elif n_samples >= 10000:
        base = 0.75
        reasoning = "Medium-large dataset benefits from moderate subsampling"
    elif n_samples >= 1000:
        base = 0.85
        reasoning = "Medium dataset: balance regularization with signal"
    else:
        base = 0.95
        reasoning = "Small dataset: preserve signal, use light subsampling"
    
    # Adjust for feature count
    if n_features > 100:
        adjustment = -0.05
        adj_reasoning = "High dimensions increase overfitting risk"
    elif n_features > 50:
        adjustment = -0.02
        adj_reasoning = "Moderate dimensions: slight adjustment"
    else:
        adjustment = 0
        adj_reasoning = "Low dimensions: no adjustment needed"
    
    final = max(0.3, min(1.0, base + adjustment))
    
    print(f"Dataset: {n_samples} samples, {n_features} features")
    print(f"Base recommendation: {base:.2f} ({reasoning})")
    print(f"Adjustment: {adjustment:+.2f} ({adj_reasoning})")
    print(f"\nFinal recommendation: subsample={final:.2f}")
    print(f"Search range for tuning: [{max(0.3, final-0.15):.2f}, {min(1.0, final+0.15):.2f}]")
    
    return final
 
 
if __name__ == "__main__":
    from sklearn.datasets import make_classification
    
    # Example 1: Large dataset
    print("=" * 50)
    dataset_size_recommendations(50000, 30)
    
    # Example 2: Small dataset
    print("\n" + "=" * 50)
    dataset_size_recommendations(500, 10)
    
    # Run actual analysis
    print("\n" + "=" * 50)
    print("Running cross-validation analysis...")
    X, y = make_classification(n_samples=2000, n_features=20, random_state=42)
    analyze_subsample_effect(X, y, task='classification')

Row Subsampling vs. Column Subsampling

Modern boosting implementations offer two types of subsampling, which can be used independently or together:

4.1 Row Subsampling (Sample Subsampling)

This is the classic stochastic gradient boosting approach we've discussed:

Random subset of training rows (samples) for each tree
Controlled by subsample parameter
Each tree sees different training instances

4.2 Column Subsampling (Feature Subsampling)

Borrowed from Random Forests:

Random subset of columns (features) for each tree or split
Multiple variants exist:
- Per tree: Same features for all splits in a tree
- Per split: Different features considered at each node
- Per level: Same features for all nodes at a depth
Controlled by colsample_bytree, colsample_bylevel, or colsample_bynode

4.3 Combining Both

Modern implementations like XGBoost, LightGBM, and CatBoost allow combining both:

Subsampling Parameters in Popular Libraries
Parameter	XGBoost	LightGBM	CatBoost	Effect
Row sample	`subsample`	`bagging_fraction`	`subsample`	Fraction of rows per tree
Column per tree	`colsample_bytree`	`feature_fraction`	`rsm`	Fraction of features per tree
Column per level	`colsample_bylevel`			Fraction per depth level
Column per node	`colsample_bynode`			Fraction per split node

4.4 Which to Use?

Row subsampling is almost always beneficial:

Reduces training time
Provides regularization
Works well across all problem types

Column subsampling is particularly useful when:

You have many features (high-dimensional data)
Feature correlations are present
You want to reduce feature importance bias

Combined subsampling often yields the best results:

subsample=0.8, colsample_bytree=0.8: A common effective combination
Multiplicative effect: 80% rows × 80% columns = using 64% of data per tree
Different sources of randomness compound regularization effects

Competition-Winning Settings

Many Kaggle competition winners use: subsample=0.7-0.8 combined with colsample_bytree=0.7-0.9. This combination provides strong regularization while maintaining enough signal for effective learning. Start with both at 0.8 and tune from there.

combined_subsampling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
import xgboost as xgb
import lightgbm as lgb
from sklearn.model_selection import cross_val_score
import numpy as np
 
def explore_combined_subsampling(X, y, task='classification'):
    """
    Explore the effect of combining row and column subsampling.
    """
    
    # Create DMatrix for XGBoost
    if task == 'classification':
        objective = 'binary:logistic'
        eval_metric = 'auc'
    else:
        objective = 'reg:squarederror'
        eval_metric = 'rmse'
    
    # Test grid of subsampling combinations
    row_samples = [1.0, 0.8, 0.6]
    col_samples = [1.0, 0.8, 0.6]
    
    results = []
    
    for subsample in row_samples:
        for colsample in col_samples:
            # XGBoost configuration
            params = {
                'objective': objective,
                'eval_metric': eval_metric,
                'max_depth': 4,
                'learning_rate': 0.1,
                'subsample': subsample,           # Row subsampling
                'colsample_bytree': colsample,    # Column subsampling
                'n_estimators': 200,
                'random_state': 42,
                'verbosity': 0
            }
            
            model = xgb.XGBClassifier(**params) if task == 'classification' else xgb.XGBRegressor(**params)
            
            scores = cross_val_score(
                model, X, y, 
                cv=5, 
                scoring='roc_auc' if task == 'classification' else 'neg_mean_squared_error'
            )
            
            result = {
                'row_subsample': subsample,
                'col_subsample': colsample,
                'effective_data': subsample * colsample,
                'cv_mean': np.mean(scores),
                'cv_std': np.std(scores)
            }
            results.append(result)
            
            print(f"Row={subsample:.1f}, Col={colsample:.1f} "
                  f"(Effective={subsample*colsample:.1%}): "
                  f"Score={np.mean(scores):.4f} ± {np.std(scores):.4f}")
    
    # Find best
    best = max(results, key=lambda x: x['cv_mean'])
    print(f"\nBest: Row={best['row_subsample']}, Col={best['col_subsample']}")
    print(f"Score: {best['cv_mean']:.4f}")
    
    return results
 
 
def subsampling_interaction_xgboost():
    """
    Demonstrate how different subsampling types interact in XGBoost.
    """
    
    # XGBoost allows fine-grained control
    # The effective column fraction is the product of all column sampling params
    
    config_examples = [
        {
            'name': 'No subsampling',
            'subsample': 1.0,
            'colsample_bytree': 1.0,
            'colsample_bylevel': 1.0,
            'colsample_bynode': 1.0,
        },
        {
            'name': 'Row subsampling only',
            'subsample': 0.8,
            'colsample_bytree': 1.0,
            'colsample_bylevel': 1.0,
            'colsample_bynode': 1.0,
        },
        {
            'name': 'Column subsampling per tree',
            'subsample': 1.0,
            'colsample_bytree': 0.8,
            'colsample_bylevel': 1.0,
            'colsample_bynode': 1.0,
        },
        {
            'name': 'Combined row + tree column',
            'subsample': 0.8,
            'colsample_bytree': 0.8,
            'colsample_bylevel': 1.0,
            'colsample_bynode': 1.0,
        },
        {
            'name': 'Maximum randomization',
            'subsample': 0.7,
            'colsample_bytree': 0.8,
            'colsample_bylevel': 0.9,
            'colsample_bynode': 0.9,
        },
    ]
    
    for config in config_examples:
        effective_rows = config['subsample']
        effective_cols = (config['colsample_bytree'] * 
                         config['colsample_bylevel'] * 
                         config['colsample_bynode'])
        effective_data = effective_rows * effective_cols
        
        print(f"{config['name']}:")
        print(f"  Effective rows: {effective_rows:.1%}")
        print(f"  Effective cols: {effective_cols:.1%}")
        print(f"  Total effective: {effective_data:.1%}")
        print()
 
 
if __name__ == "__main__":
    from sklearn.datasets import make_classification
    
    print("Subsampling Interaction Examples\n" + "=" * 50)
    subsampling_interaction_xgboost()
    
    print("\nCombined Subsampling Analysis\n" + "=" * 50)
    X, y = make_classification(
        n_samples=3000, n_features=30, 
        n_informative=15, random_state=42
    )
    explore_combined_subsampling(X, y)

Interaction with Other Hyperparameters

Subsampling interacts with other boosting hyperparameters in important ways. Understanding these interactions is crucial for effective tuning.

5.1 Subsample × Learning Rate

Both parameters provide regularization, and their effects compound:

Lower subsample + lower learning rate: Very strong regularization, may need many more iterations
Lower subsample + higher learning rate: Moderate regularization, reasonable iteration count
Higher subsample + lower learning rate: Moderate regularization from learning rate mostly

Practical Insight: When you decrease subsampling, you can often slightly increase learning rate without overfitting. Conversely, with full data (subsample=1.0), prefer lower learning rates.

Conservative (Strong Regularization)

•subsample=0.6
•learning_rate=0.05
•n_estimators=2000+
•Best for noisy data
•Strong overfitting prevention

Balanced (Moderate Regularization)

•subsample=0.8
•learning_rate=0.1
•n_estimators=500
•Good default choice
•Balance speed and accuracy

5.2 Subsample × Tree Depth

Both control model complexity, but differently:

tree depth: Controls complexity of individual trees
subsample: Controls what data each tree learns from

Compensation Strategy: If you use deep trees (which overfit easily), lower subsample rates can compensate. Conversely, with shallow trees (stumps), you can use higher subsample rates.

Tree Depth	Recommended Subsample	Reasoning
2-3	0.8-1.0	Shallow trees don't overfit much
4-6	0.7-0.9	Standard regularization
7-10	0.5-0.7	Deep trees need more regularization
10+	0.4-0.6	Very deep: strong subsampling needed

5.3 Subsample × Number of Trees

Lower subsample rates generally require more trees:

$$n_{\text{estimators}} \propto \frac{1}{\text{subsample}}$$

Rationale: Each tree learns less (from fewer samples), so more trees are needed to achieve the same effective learning. However, each tree is faster to train, so total training time may decrease.

5.4 Subsample × Early Stopping

Subsampling and early stopping work excellently together:

Subsampling introduces stochasticity, making the validation curve smoother
Smoother validation curves make early stopping more reliable
Combined, they provide robust regularization from two different mechanisms

Best Practice: When using subsampling, always enable early stopping. The combination is more robust than either alone.

Total Regularization Budget

Think of regularization as a 'budget' distributed across methods: learning rate, subsampling, tree constraints, early stopping. Lowering one allows you to increase another. Start with moderate settings for all (lr=0.1, subsample=0.8, max_depth=4) and adjust based on validation performance.

Computational Benefits of Subsampling

Beyond regularization, subsampling provides significant computational advantages.

6.1 Training Time Reduction

Fitting a decision tree has complexity $O(n \cdot d \cdot \log n)$ where $n$ is samples and $d$ is features. With subsample rate $\eta$:

Samples per tree: $\eta n$
Time per tree: $O(\eta n \cdot d \cdot \log(\eta n))$

For $\eta = 0.5$, each tree takes roughly half the time. Even accounting for the need for more trees, total training time often decreases.

6.2 Memory Efficiency

With subsampling, only the subset indices need to be tracked per iteration. For very large datasets that don't fit in memory:

Full data can stay on disk
Only the sample subset is loaded for each tree
Enables training on datasets larger than RAM

6.3 Parallelization Opportunities

Subsampling can enable better parallelization:

Different random subsets can be processed by different workers
Less data per tree means more trees can be built in parallel (memory-constrained settings)
Some implementations use subsampling to distribute data across machines

Subsampling Impact on Training Time
Subsample Rate	Relative Time/Tree	Trees Needed (approx)	Net Effect
1.0	1.0x	100	Baseline
0.8	0.75x	125	~5-10% faster
0.7	0.65x	145	~5-15% faster
0.6	0.55x	170	Similar time
0.5	0.45x	200	Similar or slower

Sweet Spot for Speed

The computational sweet spot is typically subsample=0.7-0.8. Below 0.6, the increase in required trees starts to outweigh the per-tree speedup. Above 0.9, there's little computational benefit. At 0.7-0.8, you often get both better generalization AND faster training—a rare win-win.

Theoretical Analysis of Stochastic Gradient Boosting

The regularization effect of subsampling can be formalized mathematically.

7.1 Gradient Noise Regularization

At each iteration, instead of the true gradient direction: $$g_m = -\nabla_F L(F_{m-1})$$

We compute a noisy estimate based on the subsample: $$\tilde{g}m = -\nabla_F L{S_m}(F_{m-1})$$

where $S_m$ is the random subset at iteration $m$. The noise in this estimate has mean zero but positive variance: $$\mathbb{E}[\tilde{g}_m] = g_m$$ $$\text{Var}[\tilde{g}_m] \propto \frac{1-\eta}{\eta n}$$

This added variance prevents the algorithm from following small, potentially spurious gradients (noise in the training data).

7.2 Stabilization Result

Bühlmann and Hothorn (2007) showed that stochastic gradient boosting with subsample rate $\eta$ approximately equals gradient boosting with an effective learning rate of:

$$\nu_{\text{eff}} \approx \nu \cdot \sqrt{\eta}$$

This means a subsample rate of 0.5 roughly halves the effective learning rate, providing additional regularization beyond just learner diversity.

7.3 Generalization Bounds

For stochastic gradient boosting, generalization bounds can be tighter than standard GB:

$$R(F_M) - \hat{R}(F_M) \leq O\left(\sqrt{\frac{M \cdot \nu^2 / \eta}{n}}\right) + O\left(\frac{M \cdot \nu}{\eta n}\right)$$

The $1/\eta$ factor accounts for the increased variance of individual learners, but this is offset by reduced correlation, often leading to better generalization in practice.

Theoretical vs. Practical

The theoretical analysis confirms what empirical studies consistently show: subsampling provides regularization that improves generalization. The exact optimal subsample rate is data-dependent, but theory supports the empirical finding that values around 0.7-0.8 work well across diverse problems.

Summary and Key Takeaways

Stochastic gradient boosting through subsampling is a simple yet powerful regularization technique. Let's consolidate the essential insights:

Key Takeaways

•Subsampling trains each tree on a random subset of data, introducing beneficial randomness into the sequential boosting process.
•Regularization comes from decorrelated learners: trees trained on different data make different errors, reducing ensemble variance.
•Optimal subsample rates are typically 0.7-0.9, with 0.8 being a robust default for most problems.
•Row and column subsampling can be combined for multiplicative regularization effects—a technique used in winning competition solutions.
•Subsampling often speeds up training by reducing per-tree computation, even accounting for the need for more trees.
•Interactions with other hyperparameters are important: aggressive subsampling allows higher learning rates; deep trees need more subsampling.
•Works synergistically with early stopping: the stochasticity creates smoother validation curves that are easier to early-stop.

Page Complete

You now understand subsampling (stochastic gradient boosting) as a powerful regularization technique that provides diversity, computational benefits, and improved generalization. Next, we turn to tree constraints—another crucial family of regularization techniques that control the complexity of individual base learners.

2 / 5

Loading learning content...

Machine LearningBoosting Theory

Regularization in Boosting

LevelAdvanced

Duration90 mins

TopicBoosting Theory

2 / 5

Subsampling (Stochastic Gradient Boosting)

Injecting Randomness into Sequential Learning

Reduced overfitting through implicit regularization
Increased learner diversity through different training views
Faster training by using fewer samples per iteration
Better generalization through variance reduction

Subsampling transforms gradient boosting from a purely greedy optimizer into a more robust ensemble method that inherits beneficial properties from bagging.

What You Will Master

From Standard to Stochastic Gradient Boosting

To understand stochastic gradient boosting, let's first recall the standard gradient boosting update. At iteration $m$, we:

Compute pseudo-residuals for all $n$ training samples: $$r_{im} = -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right]{F=F{m-1}}, \quad i = 1, \ldots, n$$
Fit a base learner $h_m$ to the entire set of $(x_i, r_{im})$ pairs
Update the ensemble: $$F_m(x) = F_{m-1}(x) + \nu h_m(x)$$

stochastic_gb.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
import numpy as np
from sklearn.tree import DecisionTreeRegressor
 
class StochasticGradientBoosting:
    """
    Stochastic Gradient Boosting with row subsampling.
    
    Key modification: Each tree is trained on a random subset of
    training samples, introducing diversity and regularization.
    """
    
    def __init__(
        self, 
        n_estimators=100, 
        learning_rate=0.1, 
        max_depth=3,
        subsample=0.8,       # The key parameter: fraction of samples per tree
        random_state=None
    ):
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        self.max_depth = max_depth
        self.subsample = subsample  # eta in the algorithm
        self.random_state = random_state
        self.trees = []
        self.initial_prediction = None
        self.rng = np.random.RandomState(random_state)
    
    def fit(self, X, y):
        """
        Fit the stochastic gradient boosting model.
        """
        n_samples, n_features = X.shape
        
        # Number of samples to use per iteration
        n_subsample = int(self.subsample * n_samples)
        
        # Initialize with mean
        self.initial_prediction = np.mean(y)
        F = np.full(n_samples, self.initial_prediction)
        
        for m in range(self.n_estimators):
            # Step 1: Compute pseudo-residuals for ALL samples
            # This is important: we compute residuals on all data,
            # but train the tree on a subset
            residuals = y - F
            
            # Step 2: STOCHASTIC STEP - Sample indices without replacement
            sample_indices = self.rng.choice(
                n_samples, 
                size=n_subsample, 
                replace=False
            )
            
            # Step 3: Fit tree only on sampled data
            tree = DecisionTreeRegressor(max_depth=self.max_depth)
            tree.fit(X[sample_indices], residuals[sample_indices])
            
            # Step 4: Update ensemble on ALL samples
            # Note: We predict on all samples, not just the subset
            h_m = tree.predict(X)
            F = F + self.learning_rate * h_m
            
            self.trees.append(tree)
            
            if (m + 1) % 20 == 0:
                mse = np.mean((y - F) ** 2)
                print(f"Iteration {m+1}: Train MSE = {mse:.6f}")
        
        return self
    
    def predict(self, X):
        """Make predictions."""
        F = np.full(X.shape[0], self.initial_prediction)
        for tree in self.trees:
            F = F + self.learning_rate * tree.predict(X)
        return F
 
 
def compare_standard_vs_stochastic():
    """
    Compare standard GB vs stochastic GB on a regression task.
    """
    from sklearn.datasets import make_regression
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import mean_squared_error
    
    # Generate data with some noise
    X, y = make_regression(
        n_samples=2000, 
        n_features=20, 
        noise=20, 
        random_state=42
    )
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=42
    )
    
    results = []
    
    # Test different subsample rates
    subsample_rates = [1.0, 0.9, 0.8, 0.7, 0.6, 0.5]
    
    for subsample in subsample_rates:
        model = StochasticGradientBoosting(
            n_estimators=200,
            learning_rate=0.1,
            max_depth=4,
            subsample=subsample,
            random_state=42
        )
        model.fit(X_train, y_train)
        
        train_mse = mean_squared_error(y_train, model.predict(X_train))
        test_mse = mean_squared_error(y_test, model.predict(X_test))
        
        results.append({
            'subsample': subsample,
            'train_mse': train_mse,
            'test_mse': test_mse,
            'gap': test_mse - train_mse
        })
        
        print(f"Subsample={subsample:.1f}: "
              f"Train={train_mse:.2f}, Test={test_mse:.2f}, "
              f"Gap={test_mse - train_mse:.2f}")
    
    return results
 
 
if __name__ == "__main__":
    compare_standard_vs_stochastic()

Why Subsampling Provides Regularization

The regularization effect of subsampling can be understood from multiple complementary perspectives.

2.1 The Variance Reduction Perspective

With subsampling, each learner sees a different random subset. This means:

Noise in the training data affects different learners differently
Outliers may be excluded from many learners' training sets
The ensemble averages over many different 'views' of the data

The variance of the ensemble prediction can be decomposed as:

$$\text{Var}\left[\frac{1}{M}\sum_{m=1}^{M} h_m(x)\right] = \frac{1}{M}\bar{\sigma}^2 + \frac{M-1}{M}\bar{\rho}\bar{\sigma}^2$$

The Correlation Insight

2.2 Connection to Stochastic Optimization

From an optimization perspective, subsampling transforms gradient boosting into a form of stochastic gradient descent in function space.

In standard gradient descent, we update parameters using the gradient computed on all data: $$\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t)$$

In stochastic gradient descent (SGD), we use a gradient computed on a random mini-batch: $$\theta_{t+1} = \theta_t - \eta \nabla L_{\text{batch}}(\theta_t)$$

SGD works despite (or because of) the noisy gradients. The noise has a regularizing effect, preventing the optimizer from settling into sharp minima that don't generalize well.

Stochastic gradient boosting achieves the same effect in function space:

Each base learner is fitted on a 'mini-batch' of the training data
The resulting tree represents a 'noisy' gradient direction
This noise prevents overfitting to training data peculiarities

2.3 The Implicit Bagging Perspective

Subsampling gives gradient boosting some properties of bootstrap aggregating (bagging):

Each tree is trained on a different subset (like bootstrap samples)
Predictions are aggregated across all trees
Diversity among trees reduces overfitting

The Best of Both Worlds: Stochastic GB can be viewed as a hybrid method that combines:

Boosting's ability to reduce bias through sequential correction
Bagging's ability to reduce variance through ensemble diversity

Comparison: Standard GB vs. Stochastic GB
Aspect	Standard GB	Stochastic GB
Training per tree	All samples	Random subset
Tree correlation	High	Lower
Variance	Higher	Lower
Training speed	Slower per tree	Faster per tree
Reproducibility	Deterministic	Random (seed-dependent)
Overfitting risk	Higher	Lower
Generalization	Good	Often better

Choosing the Subsample Rate

The subsample rate $\eta$ (often called subsample or bagging_fraction) controls the trade-off between regularization strength and information usage.

3.1 Understanding the Trade-off

High Subsample Rate ($\eta \rightarrow 1.0$):

Each tree sees almost all data → Lower regularization
Good signal estimation → Lower individual tree variance
High tree correlation → Limited variance reduction
Converges to standard GB at $\eta = 1.0$

Low Subsample Rate ($\eta \rightarrow 0$):

Each tree sees little data → Strong regularization
Poor signal estimation → Higher individual tree variance
Low tree correlation → Strong variance reduction potential
Too low can hurt: trees become too noisy to be useful

3.2 Empirical Guidance

Extensive empirical studies and competition results have established practical guidelines:

Subsample Rate Guidelines

•0.7-0.9: Most common range. Provides good regularization while maintaining signal quality. Start here.
•0.5-0.7: More aggressive regularization. Useful for very noisy data or when overfitting is severe.
•0.9-1.0: Light regularization. Use when training data is clean and representative, or with small datasets where you need all samples.
•< 0.5: Rarely beneficial. Trees become too noisy. Only use if other regularization is minimal.
•0.8: A robust default that works well across diverse problems. Many practitioners start here.

3.3 Sample Size Considerations

The optimal subsample rate depends on your dataset size:

Large Datasets ($n > 100,000$):

Can use more aggressive subsampling (0.5-0.7)
Even 50% of data is a lot of samples
Greater computational benefit from smaller subsets
More iterations can compensate for lower information per tree

Medium Datasets ($n \approx 1,000-100,000$):

Standard range (0.7-0.9) works well
Balance between regularization and information

Small Datasets ($n < 1,000$):

Be cautious with heavy subsampling
0.8-1.0 often better: you need all the signal you can get
Consider combining with other regularization instead

subsample_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
 
def analyze_subsample_effect(X, y, task='regression'):
    """
    Analyze the effect of different subsample rates on model performance.
    
    Key insights:
    1. Subsampling almost always helps for sufficiently large datasets
    2. Optimal rate is typically 0.7-0.9
    3. Too aggressive subsampling hurts more than it helps
    """
    
    subsample_rates = [1.0, 0.95, 0.9, 0.85, 0.8, 0.75, 0.7, 0.6, 0.5, 0.4]
    
    results = {
        'subsample': [],
        'cv_mean': [],
        'cv_std': []
    }
    
    for subsample in subsample_rates:
        if task == 'classification':
            model = GradientBoostingClassifier(
                n_estimators=200,
                learning_rate=0.1,
                max_depth=4,
                subsample=subsample,
                random_state=42
            )
            scoring = 'accuracy'
        else:
            model = GradientBoostingRegressor(
                n_estimators=200,
                learning_rate=0.1,
                max_depth=4,
                subsample=subsample,
                random_state=42
            )
            scoring = 'neg_mean_squared_error'
        
        cv_scores = cross_val_score(model, X, y, cv=5, scoring=scoring)
        
        results['subsample'].append(subsample)
        results['cv_mean'].append(np.mean(cv_scores))
        results['cv_std'].append(np.std(cv_scores))
        
        print(f"Subsample={subsample:.2f}: "
              f"CV Score = {np.mean(cv_scores):.4f} (+/- {np.std(cv_scores):.4f})")
    
    # Plot results
    fig, ax = plt.subplots(figsize=(10, 6))
    
    ax.errorbar(
        results['subsample'],
        results['cv_mean'],
        yerr=results['cv_std'],
        marker='o',
        capsize=5,
        capthick=2,
        linewidth=2,
        markersize=8
    )
    
    ax.set_xlabel('Subsample Rate', fontsize=12)
    ax.set_ylabel('Cross-Validation Score', fontsize=12)
    ax.set_title('Effect of Subsample Rate on Model Performance', fontsize=14)
    ax.grid(True, alpha=0.3)
    ax.invert_xaxis()  # Higher subsample on left
    
    # Mark optimal
    best_idx = np.argmax(results['cv_mean']) if task == 'classification' else np.argmin(results['cv_mean'])
    best_subsample = results['subsample'][best_idx]
    best_score = results['cv_mean'][best_idx]
    
    ax.axvline(best_subsample, color='green', linestyle='--', alpha=0.7, 
               label=f'Best: subsample={best_subsample}')
    ax.legend()
    
    plt.tight_layout()
    plt.savefig('subsample_analysis.png', dpi=150)
    plt.show()
    
    return results
 
 
def dataset_size_recommendations(n_samples, n_features):
    """
    Provide subsample rate recommendations based on dataset characteristics.
    """
    recommendations = {}
    
    # Base recommendation
    if n_samples >= 100000:
        base = 0.6
        reasoning = "Large dataset allows aggressive subsampling"
    elif n_samples >= 10000:
        base = 0.75
        reasoning = "Medium-large dataset benefits from moderate subsampling"
    elif n_samples >= 1000:
        base = 0.85
        reasoning = "Medium dataset: balance regularization with signal"
    else:
        base = 0.95
        reasoning = "Small dataset: preserve signal, use light subsampling"
    
    # Adjust for feature count
    if n_features > 100:
        adjustment = -0.05
        adj_reasoning = "High dimensions increase overfitting risk"
    elif n_features > 50:
        adjustment = -0.02
        adj_reasoning = "Moderate dimensions: slight adjustment"
    else:
        adjustment = 0
        adj_reasoning = "Low dimensions: no adjustment needed"
    
    final = max(0.3, min(1.0, base + adjustment))
    
    print(f"Dataset: {n_samples} samples, {n_features} features")
    print(f"Base recommendation: {base:.2f} ({reasoning})")
    print(f"Adjustment: {adjustment:+.2f} ({adj_reasoning})")
    print(f"\nFinal recommendation: subsample={final:.2f}")
    print(f"Search range for tuning: [{max(0.3, final-0.15):.2f}, {min(1.0, final+0.15):.2f}]")
    
    return final
 
 
if __name__ == "__main__":
    from sklearn.datasets import make_classification
    
    # Example 1: Large dataset
    print("=" * 50)
    dataset_size_recommendations(50000, 30)
    
    # Example 2: Small dataset
    print("\n" + "=" * 50)
    dataset_size_recommendations(500, 10)
    
    # Run actual analysis
    print("\n" + "=" * 50)
    print("Running cross-validation analysis...")
    X, y = make_classification(n_samples=2000, n_features=20, random_state=42)
    analyze_subsample_effect(X, y, task='classification')

Row Subsampling vs. Column Subsampling

Modern boosting implementations offer two types of subsampling, which can be used independently or together:

4.1 Row Subsampling (Sample Subsampling)

This is the classic stochastic gradient boosting approach we've discussed:

Random subset of training rows (samples) for each tree
Controlled by subsample parameter
Each tree sees different training instances

4.2 Column Subsampling (Feature Subsampling)

Borrowed from Random Forests:

Random subset of columns (features) for each tree or split
Multiple variants exist:
- Per tree: Same features for all splits in a tree
- Per split: Different features considered at each node
- Per level: Same features for all nodes at a depth
Controlled by colsample_bytree, colsample_bylevel, or colsample_bynode

4.3 Combining Both

Modern implementations like XGBoost, LightGBM, and CatBoost allow combining both:

Subsampling Parameters in Popular Libraries
Parameter	XGBoost	LightGBM	CatBoost	Effect
Row sample	`subsample`	`bagging_fraction`	`subsample`	Fraction of rows per tree
Column per tree	`colsample_bytree`	`feature_fraction`	`rsm`	Fraction of features per tree
Column per level	`colsample_bylevel`			Fraction per depth level
Column per node	`colsample_bynode`			Fraction per split node

4.4 Which to Use?

Row subsampling is almost always beneficial:

Reduces training time
Provides regularization
Works well across all problem types

Column subsampling is particularly useful when:

You have many features (high-dimensional data)
Feature correlations are present
You want to reduce feature importance bias

Combined subsampling often yields the best results:

subsample=0.8, colsample_bytree=0.8: A common effective combination
Multiplicative effect: 80% rows × 80% columns = using 64% of data per tree
Different sources of randomness compound regularization effects

Competition-Winning Settings

combined_subsampling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
import xgboost as xgb
import lightgbm as lgb
from sklearn.model_selection import cross_val_score
import numpy as np
 
def explore_combined_subsampling(X, y, task='classification'):
    """
    Explore the effect of combining row and column subsampling.
    """
    
    # Create DMatrix for XGBoost
    if task == 'classification':
        objective = 'binary:logistic'
        eval_metric = 'auc'
    else:
        objective = 'reg:squarederror'
        eval_metric = 'rmse'
    
    # Test grid of subsampling combinations
    row_samples = [1.0, 0.8, 0.6]
    col_samples = [1.0, 0.8, 0.6]
    
    results = []
    
    for subsample in row_samples:
        for colsample in col_samples:
            # XGBoost configuration
            params = {
                'objective': objective,
                'eval_metric': eval_metric,
                'max_depth': 4,
                'learning_rate': 0.1,
                'subsample': subsample,           # Row subsampling
                'colsample_bytree': colsample,    # Column subsampling
                'n_estimators': 200,
                'random_state': 42,
                'verbosity': 0
            }
            
            model = xgb.XGBClassifier(**params) if task == 'classification' else xgb.XGBRegressor(**params)
            
            scores = cross_val_score(
                model, X, y, 
                cv=5, 
                scoring='roc_auc' if task == 'classification' else 'neg_mean_squared_error'
            )
            
            result = {
                'row_subsample': subsample,
                'col_subsample': colsample,
                'effective_data': subsample * colsample,
                'cv_mean': np.mean(scores),
                'cv_std': np.std(scores)
            }
            results.append(result)
            
            print(f"Row={subsample:.1f}, Col={colsample:.1f} "
                  f"(Effective={subsample*colsample:.1%}): "
                  f"Score={np.mean(scores):.4f} ± {np.std(scores):.4f}")
    
    # Find best
    best = max(results, key=lambda x: x['cv_mean'])
    print(f"\nBest: Row={best['row_subsample']}, Col={best['col_subsample']}")
    print(f"Score: {best['cv_mean']:.4f}")
    
    return results
 
 
def subsampling_interaction_xgboost():
    """
    Demonstrate how different subsampling types interact in XGBoost.
    """
    
    # XGBoost allows fine-grained control
    # The effective column fraction is the product of all column sampling params
    
    config_examples = [
        {
            'name': 'No subsampling',
            'subsample': 1.0,
            'colsample_bytree': 1.0,
            'colsample_bylevel': 1.0,
            'colsample_bynode': 1.0,
        },
        {
            'name': 'Row subsampling only',
            'subsample': 0.8,
            'colsample_bytree': 1.0,
            'colsample_bylevel': 1.0,
            'colsample_bynode': 1.0,
        },
        {
            'name': 'Column subsampling per tree',
            'subsample': 1.0,
            'colsample_bytree': 0.8,
            'colsample_bylevel': 1.0,
            'colsample_bynode': 1.0,
        },
        {
            'name': 'Combined row + tree column',
            'subsample': 0.8,
            'colsample_bytree': 0.8,
            'colsample_bylevel': 1.0,
            'colsample_bynode': 1.0,
        },
        {
            'name': 'Maximum randomization',
            'subsample': 0.7,
            'colsample_bytree': 0.8,
            'colsample_bylevel': 0.9,
            'colsample_bynode': 0.9,
        },
    ]
    
    for config in config_examples:
        effective_rows = config['subsample']
        effective_cols = (config['colsample_bytree'] * 
                         config['colsample_bylevel'] * 
                         config['colsample_bynode'])
        effective_data = effective_rows * effective_cols
        
        print(f"{config['name']}:")
        print(f"  Effective rows: {effective_rows:.1%}")
        print(f"  Effective cols: {effective_cols:.1%}")
        print(f"  Total effective: {effective_data:.1%}")
        print()
 
 
if __name__ == "__main__":
    from sklearn.datasets import make_classification
    
    print("Subsampling Interaction Examples\n" + "=" * 50)
    subsampling_interaction_xgboost()
    
    print("\nCombined Subsampling Analysis\n" + "=" * 50)
    X, y = make_classification(
        n_samples=3000, n_features=30, 
        n_informative=15, random_state=42
    )
    explore_combined_subsampling(X, y)

Interaction with Other Hyperparameters

Subsampling interacts with other boosting hyperparameters in important ways. Understanding these interactions is crucial for effective tuning.

5.1 Subsample × Learning Rate

Both parameters provide regularization, and their effects compound:

Lower subsample + lower learning rate: Very strong regularization, may need many more iterations
Lower subsample + higher learning rate: Moderate regularization, reasonable iteration count
Higher subsample + lower learning rate: Moderate regularization from learning rate mostly

Practical Insight: When you decrease subsampling, you can often slightly increase learning rate without overfitting. Conversely, with full data (subsample=1.0), prefer lower learning rates.

Conservative (Strong Regularization)

•subsample=0.6
•learning_rate=0.05
•n_estimators=2000+
•Best for noisy data
•Strong overfitting prevention

Balanced (Moderate Regularization)

•subsample=0.8
•learning_rate=0.1
•n_estimators=500
•Good default choice
•Balance speed and accuracy

5.2 Subsample × Tree Depth

Both control model complexity, but differently:

tree depth: Controls complexity of individual trees
subsample: Controls what data each tree learns from

Compensation Strategy: If you use deep trees (which overfit easily), lower subsample rates can compensate. Conversely, with shallow trees (stumps), you can use higher subsample rates.

Tree Depth	Recommended Subsample	Reasoning
2-3	0.8-1.0	Shallow trees don't overfit much
4-6	0.7-0.9	Standard regularization
7-10	0.5-0.7	Deep trees need more regularization
10+	0.4-0.6	Very deep: strong subsampling needed

5.3 Subsample × Number of Trees

Lower subsample rates generally require more trees:

$$n_{\text{estimators}} \propto \frac{1}{\text{subsample}}$$

Rationale: Each tree learns less (from fewer samples), so more trees are needed to achieve the same effective learning. However, each tree is faster to train, so total training time may decrease.

5.4 Subsample × Early Stopping

Subsampling and early stopping work excellently together:

Subsampling introduces stochasticity, making the validation curve smoother
Smoother validation curves make early stopping more reliable
Combined, they provide robust regularization from two different mechanisms

Best Practice: When using subsampling, always enable early stopping. The combination is more robust than either alone.

Total Regularization Budget

Computational Benefits of Subsampling

Beyond regularization, subsampling provides significant computational advantages.

6.1 Training Time Reduction

Fitting a decision tree has complexity $O(n \cdot d \cdot \log n)$ where $n$ is samples and $d$ is features. With subsample rate $\eta$:

Samples per tree: $\eta n$
Time per tree: $O(\eta n \cdot d \cdot \log(\eta n))$

For $\eta = 0.5$, each tree takes roughly half the time. Even accounting for the need for more trees, total training time often decreases.

6.2 Memory Efficiency

With subsampling, only the subset indices need to be tracked per iteration. For very large datasets that don't fit in memory:

Full data can stay on disk
Only the sample subset is loaded for each tree
Enables training on datasets larger than RAM

6.3 Parallelization Opportunities

Subsampling can enable better parallelization:

Different random subsets can be processed by different workers
Less data per tree means more trees can be built in parallel (memory-constrained settings)
Some implementations use subsampling to distribute data across machines

Subsampling Impact on Training Time
Subsample Rate	Relative Time/Tree	Trees Needed (approx)	Net Effect
1.0	1.0x	100	Baseline
0.8	0.75x	125	~5-10% faster
0.7	0.65x	145	~5-15% faster
0.6	0.55x	170	Similar time
0.5	0.45x	200	Similar or slower

Sweet Spot for Speed

Theoretical Analysis of Stochastic Gradient Boosting

The regularization effect of subsampling can be formalized mathematically.

7.1 Gradient Noise Regularization

At each iteration, instead of the true gradient direction: $$g_m = -\nabla_F L(F_{m-1})$$

We compute a noisy estimate based on the subsample: $$\tilde{g}m = -\nabla_F L{S_m}(F_{m-1})$$

This added variance prevents the algorithm from following small, potentially spurious gradients (noise in the training data).

7.2 Stabilization Result

Bühlmann and Hothorn (2007) showed that stochastic gradient boosting with subsample rate $\eta$ approximately equals gradient boosting with an effective learning rate of:

$$\nu_{\text{eff}} \approx \nu \cdot \sqrt{\eta}$$

This means a subsample rate of 0.5 roughly halves the effective learning rate, providing additional regularization beyond just learner diversity.

7.3 Generalization Bounds

For stochastic gradient boosting, generalization bounds can be tighter than standard GB:

$$R(F_M) - \hat{R}(F_M) \leq O\left(\sqrt{\frac{M \cdot \nu^2 / \eta}{n}}\right) + O\left(\frac{M \cdot \nu}{\eta n}\right)$$

The $1/\eta$ factor accounts for the increased variance of individual learners, but this is offset by reduced correlation, often leading to better generalization in practice.

Theoretical vs. Practical

Summary and Key Takeaways

Stochastic gradient boosting through subsampling is a simple yet powerful regularization technique. Let's consolidate the essential insights:

Key Takeaways

•Subsampling trains each tree on a random subset of data, introducing beneficial randomness into the sequential boosting process.
•Regularization comes from decorrelated learners: trees trained on different data make different errors, reducing ensemble variance.
•Optimal subsample rates are typically 0.7-0.9, with 0.8 being a robust default for most problems.
•Row and column subsampling can be combined for multiplicative regularization effects—a technique used in winning competition solutions.
•Subsampling often speeds up training by reducing per-tree computation, even accounting for the need for more trees.
•Interactions with other hyperparameters are important: aggressive subsampling allows higher learning rates; deep trees need more subsampling.
•Works synergistically with early stopping: the stochasticity creates smoother validation curves that are easier to early-stop.

Page Complete

2 / 5