Loading learning content...
Gradient boosting, by design, is a deterministic sequential algorithm. Given the same training data and hyperparameters, it produces identical models. While this reproducibility is often desirable, it introduces a subtle vulnerability: the sequential learners can become overly correlated, each one making similar mistakes and overfitting to the same patterns in the training data.
Stochastic Gradient Boosting (SGB), introduced by Jerome Friedman in 2002, addresses this by introducing controlled randomness through subsampling—training each base learner on a random subset of the training data. This simple modification yields profound benefits:
Subsampling transforms gradient boosting from a purely greedy optimizer into a more robust ensemble method that inherits beneficial properties from bagging.
By the end of this page, you will understand: (1) the mathematical formulation of stochastic gradient boosting and its connection to stochastic optimization, (2) why subsampling provides regularization from variance and correlation perspectives, (3) the trade-off between sample fraction and model quality, (4) practical guidelines for selecting subsampling rates, and (5) how subsampling interacts with other hyperparameters.
To understand stochastic gradient boosting, let's first recall the standard gradient boosting update. At iteration $m$, we:
Compute pseudo-residuals for all $n$ training samples: $$r_{im} = -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right]{F=F{m-1}}, \quad i = 1, \ldots, n$$
Fit a base learner $h_m$ to the entire set of $(x_i, r_{im})$ pairs
Update the ensemble: $$F_m(x) = F_{m-1}(x) + \nu h_m(x)$$
Stochastic gradient boosting modifies step 2: instead of fitting $h_m$ to all $n$ samples, we randomly sample a subset of size $\lfloor \eta \cdot n \rfloor$ samples (without replacement) and fit $h_m$ only on this subset. Here $\eta \in (0, 1]$ is the subsample ratio or sample fraction.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132
import numpy as npfrom sklearn.tree import DecisionTreeRegressor class StochasticGradientBoosting: """ Stochastic Gradient Boosting with row subsampling. Key modification: Each tree is trained on a random subset of training samples, introducing diversity and regularization. """ def __init__( self, n_estimators=100, learning_rate=0.1, max_depth=3, subsample=0.8, # The key parameter: fraction of samples per tree random_state=None ): self.n_estimators = n_estimators self.learning_rate = learning_rate self.max_depth = max_depth self.subsample = subsample # eta in the algorithm self.random_state = random_state self.trees = [] self.initial_prediction = None self.rng = np.random.RandomState(random_state) def fit(self, X, y): """ Fit the stochastic gradient boosting model. """ n_samples, n_features = X.shape # Number of samples to use per iteration n_subsample = int(self.subsample * n_samples) # Initialize with mean self.initial_prediction = np.mean(y) F = np.full(n_samples, self.initial_prediction) for m in range(self.n_estimators): # Step 1: Compute pseudo-residuals for ALL samples # This is important: we compute residuals on all data, # but train the tree on a subset residuals = y - F # Step 2: STOCHASTIC STEP - Sample indices without replacement sample_indices = self.rng.choice( n_samples, size=n_subsample, replace=False ) # Step 3: Fit tree only on sampled data tree = DecisionTreeRegressor(max_depth=self.max_depth) tree.fit(X[sample_indices], residuals[sample_indices]) # Step 4: Update ensemble on ALL samples # Note: We predict on all samples, not just the subset h_m = tree.predict(X) F = F + self.learning_rate * h_m self.trees.append(tree) if (m + 1) % 20 == 0: mse = np.mean((y - F) ** 2) print(f"Iteration {m+1}: Train MSE = {mse:.6f}") return self def predict(self, X): """Make predictions.""" F = np.full(X.shape[0], self.initial_prediction) for tree in self.trees: F = F + self.learning_rate * tree.predict(X) return F def compare_standard_vs_stochastic(): """ Compare standard GB vs stochastic GB on a regression task. """ from sklearn.datasets import make_regression from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error # Generate data with some noise X, y = make_regression( n_samples=2000, n_features=20, noise=20, random_state=42 ) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=42 ) results = [] # Test different subsample rates subsample_rates = [1.0, 0.9, 0.8, 0.7, 0.6, 0.5] for subsample in subsample_rates: model = StochasticGradientBoosting( n_estimators=200, learning_rate=0.1, max_depth=4, subsample=subsample, random_state=42 ) model.fit(X_train, y_train) train_mse = mean_squared_error(y_train, model.predict(X_train)) test_mse = mean_squared_error(y_test, model.predict(X_test)) results.append({ 'subsample': subsample, 'train_mse': train_mse, 'test_mse': test_mse, 'gap': test_mse - train_mse }) print(f"Subsample={subsample:.1f}: " f"Train={train_mse:.2f}, Test={test_mse:.2f}, " f"Gap={test_mse - train_mse:.2f}") return results if __name__ == "__main__": compare_standard_vs_stochastic()The regularization effect of subsampling can be understood from multiple complementary perspectives.
In standard gradient boosting, each base learner is fitted on the same training set. If the training set has noise or outliers, every learner sees and potentially overfits to these artifacts. The learners become correlated in their errors.
With subsampling, each learner sees a different random subset. This means:
The variance of the ensemble prediction can be decomposed as:
$$\text{Var}\left[\frac{1}{M}\sum_{m=1}^{M} h_m(x)\right] = \frac{1}{M}\bar{\sigma}^2 + \frac{M-1}{M}\bar{\rho}\bar{\sigma}^2$$
where $\bar{\sigma}^2$ is the average variance of individual learners and $\bar{\rho}$ is the average correlation between learners. Subsampling reduces $\bar{\rho}$, which directly reduces ensemble variance.
The key insight: subsampling decorrelates the base learners. Even though each tree might be slightly worse (higher individual variance), the reduced correlation between trees leads to better ensemble performance. This is the same principle that makes Random Forests successful.
From an optimization perspective, subsampling transforms gradient boosting into a form of stochastic gradient descent in function space.
In standard gradient descent, we update parameters using the gradient computed on all data: $$\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t)$$
In stochastic gradient descent (SGD), we use a gradient computed on a random mini-batch: $$\theta_{t+1} = \theta_t - \eta \nabla L_{\text{batch}}(\theta_t)$$
SGD works despite (or because of) the noisy gradients. The noise has a regularizing effect, preventing the optimizer from settling into sharp minima that don't generalize well.
Stochastic gradient boosting achieves the same effect in function space:
Subsampling gives gradient boosting some properties of bootstrap aggregating (bagging):
However, there's a crucial difference: in bagging, trees are independent; in stochastic gradient boosting, trees are still sequential (each fitted to current residuals). This means SGB gets both the sequential error correction of boosting and the variance reduction of bagging.
The Best of Both Worlds: Stochastic GB can be viewed as a hybrid method that combines:
| Aspect | Standard GB | Stochastic GB |
|---|---|---|
| Training per tree | All samples | Random subset |
| Tree correlation | High | Lower |
| Variance | Higher | Lower |
| Training speed | Slower per tree | Faster per tree |
| Reproducibility | Deterministic | Random (seed-dependent) |
| Overfitting risk | Higher | Lower |
| Generalization | Good | Often better |
The subsample rate $\eta$ (often called subsample or bagging_fraction) controls the trade-off between regularization strength and information usage.
High Subsample Rate ($\eta \rightarrow 1.0$):
Low Subsample Rate ($\eta \rightarrow 0$):
Extensive empirical studies and competition results have established practical guidelines:
The optimal subsample rate depends on your dataset size:
Large Datasets ($n > 100,000$):
Medium Datasets ($n \approx 1,000-100,000$):
Small Datasets ($n < 1,000$):
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146
import numpy as npfrom sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressorfrom sklearn.model_selection import cross_val_scoreimport matplotlib.pyplot as plt def analyze_subsample_effect(X, y, task='regression'): """ Analyze the effect of different subsample rates on model performance. Key insights: 1. Subsampling almost always helps for sufficiently large datasets 2. Optimal rate is typically 0.7-0.9 3. Too aggressive subsampling hurts more than it helps """ subsample_rates = [1.0, 0.95, 0.9, 0.85, 0.8, 0.75, 0.7, 0.6, 0.5, 0.4] results = { 'subsample': [], 'cv_mean': [], 'cv_std': [] } for subsample in subsample_rates: if task == 'classification': model = GradientBoostingClassifier( n_estimators=200, learning_rate=0.1, max_depth=4, subsample=subsample, random_state=42 ) scoring = 'accuracy' else: model = GradientBoostingRegressor( n_estimators=200, learning_rate=0.1, max_depth=4, subsample=subsample, random_state=42 ) scoring = 'neg_mean_squared_error' cv_scores = cross_val_score(model, X, y, cv=5, scoring=scoring) results['subsample'].append(subsample) results['cv_mean'].append(np.mean(cv_scores)) results['cv_std'].append(np.std(cv_scores)) print(f"Subsample={subsample:.2f}: " f"CV Score = {np.mean(cv_scores):.4f} (+/- {np.std(cv_scores):.4f})") # Plot results fig, ax = plt.subplots(figsize=(10, 6)) ax.errorbar( results['subsample'], results['cv_mean'], yerr=results['cv_std'], marker='o', capsize=5, capthick=2, linewidth=2, markersize=8 ) ax.set_xlabel('Subsample Rate', fontsize=12) ax.set_ylabel('Cross-Validation Score', fontsize=12) ax.set_title('Effect of Subsample Rate on Model Performance', fontsize=14) ax.grid(True, alpha=0.3) ax.invert_xaxis() # Higher subsample on left # Mark optimal best_idx = np.argmax(results['cv_mean']) if task == 'classification' else np.argmin(results['cv_mean']) best_subsample = results['subsample'][best_idx] best_score = results['cv_mean'][best_idx] ax.axvline(best_subsample, color='green', linestyle='--', alpha=0.7, label=f'Best: subsample={best_subsample}') ax.legend() plt.tight_layout() plt.savefig('subsample_analysis.png', dpi=150) plt.show() return results def dataset_size_recommendations(n_samples, n_features): """ Provide subsample rate recommendations based on dataset characteristics. """ recommendations = {} # Base recommendation if n_samples >= 100000: base = 0.6 reasoning = "Large dataset allows aggressive subsampling" elif n_samples >= 10000: base = 0.75 reasoning = "Medium-large dataset benefits from moderate subsampling" elif n_samples >= 1000: base = 0.85 reasoning = "Medium dataset: balance regularization with signal" else: base = 0.95 reasoning = "Small dataset: preserve signal, use light subsampling" # Adjust for feature count if n_features > 100: adjustment = -0.05 adj_reasoning = "High dimensions increase overfitting risk" elif n_features > 50: adjustment = -0.02 adj_reasoning = "Moderate dimensions: slight adjustment" else: adjustment = 0 adj_reasoning = "Low dimensions: no adjustment needed" final = max(0.3, min(1.0, base + adjustment)) print(f"Dataset: {n_samples} samples, {n_features} features") print(f"Base recommendation: {base:.2f} ({reasoning})") print(f"Adjustment: {adjustment:+.2f} ({adj_reasoning})") print(f"\nFinal recommendation: subsample={final:.2f}") print(f"Search range for tuning: [{max(0.3, final-0.15):.2f}, {min(1.0, final+0.15):.2f}]") return final if __name__ == "__main__": from sklearn.datasets import make_classification # Example 1: Large dataset print("=" * 50) dataset_size_recommendations(50000, 30) # Example 2: Small dataset print("\n" + "=" * 50) dataset_size_recommendations(500, 10) # Run actual analysis print("\n" + "=" * 50) print("Running cross-validation analysis...") X, y = make_classification(n_samples=2000, n_features=20, random_state=42) analyze_subsample_effect(X, y, task='classification')Modern boosting implementations offer two types of subsampling, which can be used independently or together:
This is the classic stochastic gradient boosting approach we've discussed:
subsample parameterBorrowed from Random Forests:
colsample_bytree, colsample_bylevel, or colsample_bynodeModern implementations like XGBoost, LightGBM, and CatBoost allow combining both:
| Parameter | XGBoost | LightGBM | CatBoost | Effect |
|---|---|---|---|---|
| Row sample | subsample | bagging_fraction | subsample | Fraction of rows per tree |
| Column per tree | colsample_bytree | feature_fraction | rsm | Fraction of features per tree |
| Column per level | colsample_bylevel | Fraction per depth level | ||
| Column per node | colsample_bynode | Fraction per split node |
Row subsampling is almost always beneficial:
Column subsampling is particularly useful when:
Combined subsampling often yields the best results:
subsample=0.8, colsample_bytree=0.8: A common effective combinationMany Kaggle competition winners use: subsample=0.7-0.8 combined with colsample_bytree=0.7-0.9. This combination provides strong regularization while maintaining enough signal for effective learning. Start with both at 0.8 and tune from there.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140
import xgboost as xgbimport lightgbm as lgbfrom sklearn.model_selection import cross_val_scoreimport numpy as np def explore_combined_subsampling(X, y, task='classification'): """ Explore the effect of combining row and column subsampling. """ # Create DMatrix for XGBoost if task == 'classification': objective = 'binary:logistic' eval_metric = 'auc' else: objective = 'reg:squarederror' eval_metric = 'rmse' # Test grid of subsampling combinations row_samples = [1.0, 0.8, 0.6] col_samples = [1.0, 0.8, 0.6] results = [] for subsample in row_samples: for colsample in col_samples: # XGBoost configuration params = { 'objective': objective, 'eval_metric': eval_metric, 'max_depth': 4, 'learning_rate': 0.1, 'subsample': subsample, # Row subsampling 'colsample_bytree': colsample, # Column subsampling 'n_estimators': 200, 'random_state': 42, 'verbosity': 0 } model = xgb.XGBClassifier(**params) if task == 'classification' else xgb.XGBRegressor(**params) scores = cross_val_score( model, X, y, cv=5, scoring='roc_auc' if task == 'classification' else 'neg_mean_squared_error' ) result = { 'row_subsample': subsample, 'col_subsample': colsample, 'effective_data': subsample * colsample, 'cv_mean': np.mean(scores), 'cv_std': np.std(scores) } results.append(result) print(f"Row={subsample:.1f}, Col={colsample:.1f} " f"(Effective={subsample*colsample:.1%}): " f"Score={np.mean(scores):.4f} ± {np.std(scores):.4f}") # Find best best = max(results, key=lambda x: x['cv_mean']) print(f"\nBest: Row={best['row_subsample']}, Col={best['col_subsample']}") print(f"Score: {best['cv_mean']:.4f}") return results def subsampling_interaction_xgboost(): """ Demonstrate how different subsampling types interact in XGBoost. """ # XGBoost allows fine-grained control # The effective column fraction is the product of all column sampling params config_examples = [ { 'name': 'No subsampling', 'subsample': 1.0, 'colsample_bytree': 1.0, 'colsample_bylevel': 1.0, 'colsample_bynode': 1.0, }, { 'name': 'Row subsampling only', 'subsample': 0.8, 'colsample_bytree': 1.0, 'colsample_bylevel': 1.0, 'colsample_bynode': 1.0, }, { 'name': 'Column subsampling per tree', 'subsample': 1.0, 'colsample_bytree': 0.8, 'colsample_bylevel': 1.0, 'colsample_bynode': 1.0, }, { 'name': 'Combined row + tree column', 'subsample': 0.8, 'colsample_bytree': 0.8, 'colsample_bylevel': 1.0, 'colsample_bynode': 1.0, }, { 'name': 'Maximum randomization', 'subsample': 0.7, 'colsample_bytree': 0.8, 'colsample_bylevel': 0.9, 'colsample_bynode': 0.9, }, ] for config in config_examples: effective_rows = config['subsample'] effective_cols = (config['colsample_bytree'] * config['colsample_bylevel'] * config['colsample_bynode']) effective_data = effective_rows * effective_cols print(f"{config['name']}:") print(f" Effective rows: {effective_rows:.1%}") print(f" Effective cols: {effective_cols:.1%}") print(f" Total effective: {effective_data:.1%}") print() if __name__ == "__main__": from sklearn.datasets import make_classification print("Subsampling Interaction Examples\n" + "=" * 50) subsampling_interaction_xgboost() print("\nCombined Subsampling Analysis\n" + "=" * 50) X, y = make_classification( n_samples=3000, n_features=30, n_informative=15, random_state=42 ) explore_combined_subsampling(X, y)Subsampling interacts with other boosting hyperparameters in important ways. Understanding these interactions is crucial for effective tuning.
Both parameters provide regularization, and their effects compound:
Practical Insight: When you decrease subsampling, you can often slightly increase learning rate without overfitting. Conversely, with full data (subsample=1.0), prefer lower learning rates.
subsample=0.6learning_rate=0.05n_estimators=2000+subsample=0.8learning_rate=0.1n_estimators=500Both control model complexity, but differently:
Compensation Strategy: If you use deep trees (which overfit easily), lower subsample rates can compensate. Conversely, with shallow trees (stumps), you can use higher subsample rates.
| Tree Depth | Recommended Subsample | Reasoning |
|---|---|---|
| 2-3 | 0.8-1.0 | Shallow trees don't overfit much |
| 4-6 | 0.7-0.9 | Standard regularization |
| 7-10 | 0.5-0.7 | Deep trees need more regularization |
| 10+ | 0.4-0.6 | Very deep: strong subsampling needed |
Lower subsample rates generally require more trees:
$$n_{\text{estimators}} \propto \frac{1}{\text{subsample}}$$
Rationale: Each tree learns less (from fewer samples), so more trees are needed to achieve the same effective learning. However, each tree is faster to train, so total training time may decrease.
Subsampling and early stopping work excellently together:
Best Practice: When using subsampling, always enable early stopping. The combination is more robust than either alone.
Think of regularization as a 'budget' distributed across methods: learning rate, subsampling, tree constraints, early stopping. Lowering one allows you to increase another. Start with moderate settings for all (lr=0.1, subsample=0.8, max_depth=4) and adjust based on validation performance.
Beyond regularization, subsampling provides significant computational advantages.
Fitting a decision tree has complexity $O(n \cdot d \cdot \log n)$ where $n$ is samples and $d$ is features. With subsample rate $\eta$:
For $\eta = 0.5$, each tree takes roughly half the time. Even accounting for the need for more trees, total training time often decreases.
With subsampling, only the subset indices need to be tracked per iteration. For very large datasets that don't fit in memory:
Subsampling can enable better parallelization:
| Subsample Rate | Relative Time/Tree | Trees Needed (approx) | Net Effect |
|---|---|---|---|
| 1.0 | 1.0x | 100 | Baseline |
| 0.8 | 0.75x | 125 | ~5-10% faster |
| 0.7 | 0.65x | 145 | ~5-15% faster |
| 0.6 | 0.55x | 170 | Similar time |
| 0.5 | 0.45x | 200 | Similar or slower |
The computational sweet spot is typically subsample=0.7-0.8. Below 0.6, the increase in required trees starts to outweigh the per-tree speedup. Above 0.9, there's little computational benefit. At 0.7-0.8, you often get both better generalization AND faster training—a rare win-win.
The regularization effect of subsampling can be formalized mathematically.
At each iteration, instead of the true gradient direction: $$g_m = -\nabla_F L(F_{m-1})$$
We compute a noisy estimate based on the subsample: $$\tilde{g}m = -\nabla_F L{S_m}(F_{m-1})$$
where $S_m$ is the random subset at iteration $m$. The noise in this estimate has mean zero but positive variance: $$\mathbb{E}[\tilde{g}_m] = g_m$$ $$\text{Var}[\tilde{g}_m] \propto \frac{1-\eta}{\eta n}$$
This added variance prevents the algorithm from following small, potentially spurious gradients (noise in the training data).
Bühlmann and Hothorn (2007) showed that stochastic gradient boosting with subsample rate $\eta$ approximately equals gradient boosting with an effective learning rate of:
$$\nu_{\text{eff}} \approx \nu \cdot \sqrt{\eta}$$
This means a subsample rate of 0.5 roughly halves the effective learning rate, providing additional regularization beyond just learner diversity.
For stochastic gradient boosting, generalization bounds can be tighter than standard GB:
$$R(F_M) - \hat{R}(F_M) \leq O\left(\sqrt{\frac{M \cdot \nu^2 / \eta}{n}}\right) + O\left(\frac{M \cdot \nu}{\eta n}\right)$$
The $1/\eta$ factor accounts for the increased variance of individual learners, but this is offset by reduced correlation, often leading to better generalization in practice.
The theoretical analysis confirms what empirical studies consistently show: subsampling provides regularization that improves generalization. The exact optimal subsample rate is data-dependent, but theory supports the empirical finding that values around 0.7-0.8 work well across diverse problems.
Stochastic gradient boosting through subsampling is a simple yet powerful regularization technique. Let's consolidate the essential insights:
You now understand subsampling (stochastic gradient boosting) as a powerful regularization technique that provides diversity, computational benefits, and improved generalization. Next, we turn to tree constraints—another crucial family of regularization techniques that control the complexity of individual base learners.