Machine LearningBagging for Other Models

Bagging for Other Models

LevelIntermediate

Duration90 mins

TopicBagging for Other Models

4 / 5

Bagging vs Boosting

Two Philosophies, One Goal

Bagging and boosting represent the two dominant paradigms in ensemble learning. Both combine multiple models to improve predictions, but they do so through fundamentally different philosophies:

Bagging trains models independently in parallel, then averages their predictions to reduce variance
Boosting trains models sequentially, with each model focusing on the errors of its predecessors to reduce bias

Understanding when to use each approach—and why—is one of the most valuable skills in applied machine learning. This page provides a deep, rigorous comparison of these paradigms.

What You Will Learn

By the end of this page, you will understand: (1) The fundamental differences in how bagging and boosting work, (2) Bias-variance trade-offs for each paradigm, (3) Computational and practical trade-offs, (4) When to prefer bagging over boosting (and vice versa), (5) Hybrid approaches that combine both paradigms.

Fundamental Differences

Let's establish the core differences between bagging and boosting at a conceptual level.

Bagging (Bootstrap Aggregating):

Create $B$ bootstrap samples from the training data
Train one model on each bootstrap sample independently
Average predictions (regression) or vote (classification)

Key properties:

Models are trained in parallel
Each model sees different data (bootstrap samples)
Models have no knowledge of each other
Reduces variance while maintaining bias

Boosting:

Initialize sample weights uniformly
Train first model on weighted data
Increase weights of misclassified samples
Train next model, focusing on current errors
Repeat, building a sequence of models
Combine with weighted voting based on model performance

Key properties:

Models are trained sequentially (each depends on previous)
Each model focuses on hard examples
Later models "correct" earlier mistakes
Reduces bias while potentially increasing variance

Bagging vs Boosting: Core Differences
Aspect	Bagging	Boosting
Training Order	Parallel (independent)	Sequential (dependent)
Data Sampling	Bootstrap samples	Weighted samples (hard examples)
Model Focus	Learn different aspects of data	Learn what previous models got wrong
Primary Benefit	Variance reduction	Bias reduction
Model Weights	Equal (typically)	Weighted by performance
Base Learner Type	High-variance (e.g., deep trees)	Low-variance (e.g., shallow trees)
Overfitting Risk	Low (averaging smooths)	Higher (can overfit noise)
Parallelization	Embarrassingly parallel	Inherently sequential

The Core Intuition

Think of bagging as asking many independent experts and averaging their opinions to reduce noise. Think of boosting as iteratively building a team where each new member specifically addresses the weaknesses of the current team.

Bias-Variance Analysis

The bias-variance decomposition reveals why bagging and boosting complement each other—they address opposite ends of the trade-off.

Recall: $\text{Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Noise}$

Bagging's Effect:

For an ensemble of models $f_1, ..., f_B$:

$$\text{Bias}[\bar{f}] = \text{Bias}[f_i]$$

Bias is unchanged—averaging doesn't fix systematic errors.

$$\text{Var}[\bar{f}] = \frac{\sigma^2}{B} + \frac{B-1}{B}\rho\sigma^2$$

Variance decreases with more models, bounded by correlation $\rho$.

Boosting's Effect:

Boosting combines weak learners that individually have high bias:

$$F_M(x) = \sum_{m=1}^{M} \alpha_m f_m(x)$$

Each additional learner fits the residuals of the previous ensemble:

$$f_m \approx y - F_{m-1}(x)$$

This progressively reduces bias:

$$\text{Bias}[F_M] \xrightarrow{M \to \infty} 0$$

But variance can increase as boosting fits noise:

$$\text{Var}[F_M] \xrightarrow{M \to \infty} ?$$

The variance behavior depends on regularization (learning rate, early stopping).

Bagging

•Bias: Unchanged (same as base learner)
•Variance: Reduced by averaging
•Best with: High-variance, low-bias learners
•Example: Deep decision trees, unpruned
•Limit: Cannot fix underfitting

Boosting

•Bias: Progressively reduced
•Variance: Can increase if uncontrolled
•Best with: Low-variance, high-bias learners
•Example: Decision stumps, shallow trees
•Limit: Can overfit with too many iterations

bias_variance_experiment.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, GradientBoostingRegressor
import matplotlib.pyplot as plt
 
def bias_variance_decomposition(model_factory, X, y, X_test, y_test, n_trials=100):
    """
    Empirically estimate bias and variance through repeated training.
    
    Train the model multiple times on bootstrapped training sets,
    then decompose the test error into bias and variance.
    """
    n_test = len(X_test)
    n_train = len(X)
    
    # Collect predictions across trials
    all_predictions = np.zeros((n_trials, n_test))
    
    for trial in range(n_trials):
        # Bootstrap sample of training data
        indices = np.random.choice(n_train, size=n_train, replace=True)
        X_boot = X[indices]
        y_boot = y[indices]
        
        # Train model
        model = model_factory()
        model.fit(X_boot, y_boot)
        
        # Predict
        all_predictions[trial] = model.predict(X_test)
    
    # Compute bias and variance
    mean_prediction = all_predictions.mean(axis=0)
    
    # Bias^2: (E[prediction] - true)^2
    bias_squared = (mean_prediction - y_test) ** 2
    
    # Variance: E[(prediction - E[prediction])^2]
    variance = ((all_predictions - mean_prediction) ** 2).mean(axis=0)
    
    # Total error
    total_error = ((all_predictions - y_test) ** 2).mean(axis=0)
    
    return {
        'bias_squared': bias_squared.mean(),
        'variance': variance.mean(),
        'total_error': total_error.mean(),
    }
 
 
def compare_bagging_boosting_bias_variance(X, y, X_test, y_test):
    """
    Compare bias-variance decomposition for bagging vs boosting
    across different complexity levels.
    """
    results = {
        'single_tree': [],
        'bagging': [],
        'boosting': [],
    }
    
    depths = [1, 2, 3, 5, 10, 15, None]  # None = unlimited
    
    for depth in depths:
        print(f"Testing depth={depth}...")
        
        # Single tree
        single = bias_variance_decomposition(
            lambda: DecisionTreeRegressor(max_depth=depth),
            X, y, X_test, y_test
        )
        results['single_tree'].append(single)
        
        # Bagging
        bagging = bias_variance_decomposition(
            lambda: BaggingRegressor(
                estimator=DecisionTreeRegressor(max_depth=depth),
                n_estimators=50
            ),
            X, y, X_test, y_test
        )
        results['bagging'].append(bagging)
        
        # Boosting (with learning rate for regularization)
        boosting = bias_variance_decomposition(
            lambda: GradientBoostingRegressor(
                max_depth=min(depth, 3) if depth else 3,  # Boosting prefers shallow
                n_estimators=100,
                learning_rate=0.1
            ),
            X, y, X_test, y_test
        )
        results['boosting'].append(boosting)
    
    return results, depths
 
 
def plot_bias_variance_comparison(results, depths):
    """Visualize bias-variance trade-off for each method."""
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))
    
    for ax, method in zip(axes, ['single_tree', 'bagging', 'boosting']):
        biases = [r['bias_squared'] for r in results[method]]
        variances = [r['variance'] for r in results[method]]
        totals = [r['total_error'] for r in results[method]]
        
        x = range(len(depths))
        ax.plot(x, biases, 'b-o', label='Bias²', linewidth=2)
        ax.plot(x, variances, 'r-o', label='Variance', linewidth=2)
        ax.plot(x, totals, 'g--', label='Total Error', linewidth=2)
        
        ax.set_xlabel('Max Depth')
        ax.set_ylabel('Error Component')
        ax.set_title(method.replace('_', ' ').title())
        ax.set_xticks(x)
        ax.set_xticklabels([str(d) if d else '∞' for d in depths])
        ax.legend()
        ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    return fig

Key Insight

Bagging with deep trees has low bias but high variance (which it reduces). Boosting with stumps has high bias but low variance (which boosting reduces). This is why the optimal base learner differs between the two paradigms.

Computational Trade-offs

Beyond statistical properties, practical deployment requires understanding computational trade-offs.

Training Time Complexity:

Bagging: If training one model takes time $T$, bagging $B$ models takes:

Sequential: $O(B \cdot T)$
Parallel with $C$ cores: $O(B \cdot T / C) = O(T)$ with $C = B$ cores

Boosting: Inherently sequential—each model depends on previous errors:

Time: $O(M \cdot T)$ where $M$ is the number of boosting iterations
Cannot parallelize model training (only within-tree splits)

Prediction Time:

Both paradigms require evaluating all models at prediction time:

Bagging: $O(B \cdot T_{pred})$
Boosting: $O(M \cdot T_{pred})$

Where $T_{pred}$ is prediction time for one model. Both can parallelize prediction.

Memory Requirements:

Bagging: Store $B$ independent models
Boosting: Store $M$ models (often smaller/shallower)
Boosting with leaf indices (XGBoost): Can be more memory-efficient

Hyperparameter Sensitivity:

Bagging: Relatively robust (mainly n_estimators, which "more is better")
Boosting: More sensitive (learning rate, max_depth, n_estimators interact)

Computational Comparison
Aspect	Bagging	Boosting
Training Parallelism	Fully parallel	Sequential (within-tree only)
GPUs Useful?	Limited (tree-based)	Yes (XGBoost, LightGBM, CatBoost)
Scales with Cores	Linearly	Limited
Hyperparameter Count	Few	Many (interacting)
Tuning Difficulty	Low	Moderate to high
Early Stopping Benefit	Minimal	Significant
Incremental Training	Easy (add trees)	Hard (model depends on history)
Model Size	Large (many full trees)	Moderate (many small trees)

timing_comparison.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
import time
import numpy as np
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
 
def timing_experiment(n_samples_list, n_features=50, n_estimators=100):
    """
    Compare training time scaling for bagging vs boosting.
    """
    results = {
        'n_samples': n_samples_list,
        'bagging_time': [],
        'boosting_time': [],
        'bagging_parallel_time': [],
    }
    
    for n_samples in n_samples_list:
        print(f"Testing n_samples={n_samples}...")
        
        X, y = make_classification(
            n_samples=n_samples,
            n_features=n_features,
            n_informative=20,
            random_state=42
        )
        
        # Bagging (1 core for fair comparison)
        rf_single = RandomForestClassifier(
            n_estimators=n_estimators,
            max_depth=10,
            n_jobs=1,
            random_state=42
        )
        start = time.time()
        rf_single.fit(X, y)
        results['bagging_time'].append(time.time() - start)
        
        # Bagging (all cores)
        rf_parallel = RandomForestClassifier(
            n_estimators=n_estimators,
            max_depth=10,
            n_jobs=-1,
            random_state=42
        )
        start = time.time()
        rf_parallel.fit(X, y)
        results['bagging_parallel_time'].append(time.time() - start)
        
        # Boosting (n_jobs doesn't parallelize tree training)
        gb = GradientBoostingClassifier(
            n_estimators=n_estimators,
            max_depth=3,  # Typical for boosting
            learning_rate=0.1,
            random_state=42
        )
        start = time.time()
        gb.fit(X, y)
        results['boosting_time'].append(time.time() - start)
    
    return results
 
 
def plot_timing_results(results):
    """Visualize timing comparison."""
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
    
    # Absolute times
    ax1.plot(results['n_samples'], results['bagging_time'], 
             'b-o', label='Bagging (1 core)', linewidth=2)
    ax1.plot(results['n_samples'], results['bagging_parallel_time'], 
             'g-s', label='Bagging (parallel)', linewidth=2)
    ax1.plot(results['n_samples'], results['boosting_time'], 
             'r-^', label='Boosting', linewidth=2)
    
    ax1.set_xlabel('Number of Samples')
    ax1.set_ylabel('Training Time (seconds)')
    ax1.set_title('Training Time Comparison')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    ax1.set_xscale('log')
    ax1.set_yscale('log')
    
    # Speedup from parallelism
    speedup = [b / p for b, p in zip(results['bagging_time'], 
                                      results['bagging_parallel_time'])]
    ax2.plot(results['n_samples'], speedup, 'g-o', linewidth=2)
    ax2.axhline(y=1, color='r', linestyle='--', label='No speedup')
    ax2.set_xlabel('Number of Samples')
    ax2.set_ylabel('Speedup (single / parallel)')
    ax2.set_title('Bagging Parallelization Speedup')
    ax2.grid(True, alpha=0.3)
    ax2.set_xscale('log')
    
    plt.tight_layout()
    return fig
 
 
def production_considerations():
    """
    Summary of production deployment considerations.
    """
    considerations = """
    Production Deployment Checklist:
    
    BAGGING (Random Forest):
    ✓ Training: Use all cores (n_jobs=-1)
    ✓ Prediction: Can batch across trees
    ✓ Memory: ~O(n_estimators × tree_size)
    ✓ Updates: Can add new trees easily
    ✓ Partial fit: Not directly supported
    
    BOOSTING (XGBoost/LightGBM):
    ✓ Training: Use GPU if available
    ✓ Prediction: Highly optimized, often faster than RF
    ✓ Memory: More efficient (histogram-based)
    ✓ Updates: Difficult (sequential dependency)
    ✓ Early stopping: Essential for efficiency
    
    Recommendation by Use Case:
    - Batch training, parallel hardware → Prefer Bagging
    - Need absolute best accuracy → Prefer Boosting (tuned)
    - Fast iteration, exploration → Prefer Bagging
    - Production serving latency critical → Both similar (test)
    - Incremental learning needed → Neither (consider online methods)
    """
    return considerations

Training Time Reality

While bagging is theoretically parallelizable, modern boosting implementations (XGBoost, LightGBM) are so optimized that they often train faster than Random Forests in practice, especially on large datasets. Always benchmark on your specific data and hardware.

When to Use Which Paradigm

Choosing between bagging and boosting depends on your data, constraints, and objectives. Here's a decision framework based on practical experience:

Prefer Bagging (Random Forest) When:

•Noisy labels or outliers — Bagging is more robust; boosting may overfit noise
•Limited tuning time — Random Forest works well out-of-box with fewer hyperparameters
•Need quick baseline — Random Forest provides competitive accuracy with minimal effort
•Parallel hardware available — Can leverage many cores effectively
•Model interpretability matters — Feature importance from RF is straightforward
•Small to medium datasets — Boosting's advantage often emerges with more data
•High-variance base models — When individual trees are already complex/flexible

Prefer Boosting (XGBoost/LightGBM) When:

•Maximum accuracy is the goal — Boosting often wins competitions with proper tuning
•Clean, well-preprocessed data — Boosting can safely focus on residuals
•Large datasets — Boosting's efficiency shines at scale
•Have time for hyperparameter tuning — Boosting rewards careful optimization
•GPU available — XGBoost/LightGBM leverage GPUs effectively
•Structured/tabular data — Boosting excels on tabular problems
•Need probability calibration — Boosted trees often better calibrated

Decision Matrix by Problem Characteristics:

Problem Characteristics → Method Selection
Characteristic	Recommendation	Rationale
High label noise	Bagging	Boosting overfits noisy labels
Class imbalance	Boosting	Better at focusing on minority class
Many irrelevant features	Random Forest	Feature randomization handles well
Feature interactions	Boosting	Residual fitting captures interactions
Need fast training	Random Forest (parallel)	Boosting is sequential
Need fast inference	Either (test)	Modern implementations are similar
Memory constrained	Boosting	Shallow trees use less memory
Want uncertainty estimates	Bagging	Natural through prediction variance

The Practical Answer

When in doubt, try both. Start with Random Forest for a quick baseline (5 minutes to tune), then invest time in boosting (1-2 hours to tune) if you need the extra performance. The difference is often 1-3% accuracy, which may or may not matter for your application.

Hybrid Approaches

Why choose when you can combine? Several approaches blend bagging and boosting to capture benefits of both.

1. Stochastic Gradient Boosting (Subsampling)

Train each boosting iteration on a random subsample of the data:

$$F_m = F_{m-1} + \nu \cdot f_m(x; \text{subsample})$$

Combines boosting's sequential error correction with bagging's randomization
Reduces overfitting through sampling noise
Introduced by Friedman (2002)
Default in XGBoost (subsample parameter)

2. Feature Subsampling in Boosting

Like Random Forests, sample features at each split:

colsample_bytree: Sample features per tree
colsample_bylevel: Sample features per depth level
colsample_bynode: Sample features per split

This adds Random Forest-style diversity to boosting.

3. Dropout in Boosting (DART)

Dropout Additive Regression Trees (DART) randomly drops previous trees during training:

Prevents over-specialization of later trees
Reduces correlation between predictions
Available in XGBoost and LightGBM

4. Bagging Boosted Models

Train multiple boosted models on bootstrap samples:

Each boosted model is trained to completion on a bootstrap sample
Final prediction averages across boosted models
Combines variance reduction of bagging with bias reduction of boosting
Computationally expensive but can improve results

hybrid_approaches.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
import numpy as np
from sklearn.ensemble import BaggingClassifier, GradientBoostingClassifier
from sklearn.base import clone
import xgboost as xgb
 
class BaggedBoosting:
    """
    Ensemble that bags multiple boosted models.
    
    Combines variance reduction from bagging with
    bias reduction from boosting.
    """
    
    def __init__(
        self,
        n_bagging_iterations=5,
        boosting_params=None,
        random_state=42
    ):
        self.n_bagging_iterations = n_bagging_iterations
        self.boosting_params = boosting_params or {
            'n_estimators': 100,
            'max_depth': 3,
            'learning_rate': 0.1,
        }
        self.random_state = random_state
        self.boosted_models = []
        
    def fit(self, X, y):
        """Train multiple boosted models on bootstrap samples."""
        n_samples = X.shape[0]
        np.random.seed(self.random_state)
        
        self.boosted_models = []
        
        for i in range(self.n_bagging_iterations):
            # Bootstrap sample
            indices = np.random.choice(n_samples, size=n_samples, replace=True)
            X_boot = X[indices]
            y_boot = y[indices]
            
            # Train boosted model
            model = xgb.XGBClassifier(
                **self.boosting_params,
                random_state=self.random_state + i,
                use_label_encoder=False,
                eval_metric='logloss'
            )
            model.fit(X_boot, y_boot)
            self.boosted_models.append(model)
        
        return self
    
    def predict_proba(self, X):
        """Average probabilities across boosted models."""
        all_probs = []
        for model in self.boosted_models:
            probs = model.predict_proba(X)
            all_probs.append(probs)
        
        return np.mean(all_probs, axis=0)
    
    def predict(self, X):
        """Predict class with highest average probability."""
        return self.predict_proba(X).argmax(axis=1)
 
 
def stochastic_gradient_boosting_demo(X_train, y_train, X_test, y_test):
    """
    Demonstrate effect of subsampling in gradient boosting.
    """
    results = {}
    
    subsample_rates = [0.5, 0.7, 0.9, 1.0]  # 1.0 = no subsampling
    
    for rate in subsample_rates:
        model = xgb.XGBClassifier(
            n_estimators=200,
            max_depth=4,
            learning_rate=0.1,
            subsample=rate,
            colsample_bytree=0.8,  # Feature subsampling too
            random_state=42,
            use_label_encoder=False,
            eval_metric='logloss'
        )
        
        model.fit(
            X_train, y_train,
            eval_set=[(X_test, y_test)],
            verbose=False
        )
        
        train_acc = (model.predict(X_train) == y_train).mean()
        test_acc = (model.predict(X_test) == y_test).mean()
        
        results[rate] = {
            'train_acc': train_acc,
            'test_acc': test_acc,
            'gap': train_acc - test_acc,
        }
        
        print(f"Subsample={rate}: Train={train_acc:.4f}, "
              f"Test={test_acc:.4f}, Gap={train_acc - test_acc:.4f}")
    
    return results
 
 
def dart_boosting_demo(X_train, y_train, X_test, y_test):
    """
    Demonstrate DART (Dropout Additive Regression Trees).
    """
    # Standard boosting
    standard = xgb.XGBClassifier(
        n_estimators=200,
        max_depth=4,
        learning_rate=0.1,
        booster='gbtree',  # Standard
        random_state=42,
        use_label_encoder=False,
        eval_metric='logloss'
    )
    standard.fit(X_train, y_train)
    
    # DART boosting
    dart = xgb.XGBClassifier(
        n_estimators=200,
        max_depth=4,
        learning_rate=0.1,
        booster='dart',  # DART
        rate_drop=0.1,   # Dropout rate
        skip_drop=0.5,   # Keep 50% of iterations dropout-free
        random_state=42,
        use_label_encoder=False,
        eval_metric='logloss'
    )
    dart.fit(X_train, y_train)
    
    standard_acc = (standard.predict(X_test) == y_test).mean()
    dart_acc = (dart.predict(X_test) == y_test).mean()
    
    print(f"Standard GBT: {standard_acc:.4f}")
    print(f"DART:         {dart_acc:.4f}")
    
    return {
        'standard_acc': standard_acc,
        'dart_acc': dart_acc,
    }

Practical Hybrid Recommendation

Stochastic gradient boosting (subsample=0.8, colsample_bytree=0.8) is the most practical hybrid—it's built into XGBoost/LightGBM, adds minimal overhead, and often improves generalization. Start here before exploring more complex hybrids.

Empirical Comparison on Benchmark Tasks

Theory guides understanding, but empirical results inform decisions. Here's what extensive benchmarking has shown about bagging vs boosting:

Findings from Kaggle Competitions (2015-2023):

Gradient boosting (XGBoost, LightGBM, CatBoost) won the majority of tabular data competitions
Random Forests rarely won outright but provided strong baselines
The performance gap ranged from negligible to 2-5% depending on the problem
Boosting's advantage was largest on clean, well-engineered features

Academic Benchmark Studies:

Fernández-Delgado et al. (2014) tested 179 classifiers on 121 datasets:

Random Forests ranked among the top performers
Gradient Boosting was competitive but not clearly superior across all datasets
Both significantly outperformed simpler methods

Grinsztajn et al. (2022) "Why do tree-based models still outperform deep learning on tabular data?":

Gradient boosting (specifically XGBoost, LightGBM) outperformed deep learning on most tabular benchmarks
Random Forests performed similarly to boosting on many problems
Deep learning only excelled on very large datasets with specific structures

Typical Performance Patterns (Illustrative)
Dataset Type	Bagging Performance	Boosting Performance	Typical Gap
Small (< 5K samples)	Strong	Good (may overfit)	~0-1%
Medium (5K-50K)	Strong	Very strong	~1-2%
Large (> 50K)	Strong	Excellent	~2-5%
Noisy labels	Robust	May overfit	Varies widely
Clean, engineered features	Good	Excellent	~2-4%
High-dimensional sparse	Good	Very good	~1-2%
Class imbalanced	Moderate	Good (with weights)	~1-3%

Benchmark Caveats

Benchmark results depend heavily on hyperparameter tuning effort. A well-tuned Random Forest often beats a poorly-tuned XGBoost. The 'boosting wins' narrative assumes equal tuning effort, which may not reflect real-world time constraints.

The Meta-Learning Perspective:

Recent work on automated machine learning (AutoML) suggests:

No single method dominates all problems — Dataset characteristics determine the winner
Boosting has higher ceiling, Random Forest has higher floor — Boosting rewards optimization effort
Ensembling both often helps — Stacking RF + XGBoost frequently improves over either alone
Time matters — Given limited time, Random Forest often achieves more

The practical implication: include both in your AutoML search space, and consider ensembling their predictions for final submissions.

Summary: Bagging vs Boosting

Bagging and boosting represent complementary strategies for building powerful ensembles. Let's consolidate the key insights:

Key Takeaways

•Different goals — Bagging reduces variance, boosting reduces bias. Use high-variance learners for bagging, high-bias learners for boosting.
•Parallelism trade-off — Bagging is embarrassingly parallel; boosting is inherently sequential. Consider your hardware when choosing.
•Robustness vs. accuracy — Bagging is more robust to noise; boosting achieves higher accuracy on clean data with proper tuning.
•Tuning investment — Random Forest needs minimal tuning; boosting rewards careful hyperparameter optimization.
•Hybrid approaches exist — Stochastic gradient boosting combines sampling with sequential training, often improving generalization.
•Empirically, boosting often wins — But the margin depends on data quality, size, and tuning effort. Always benchmark both.

What's Next:

The final page in this module explores When Bagging Helps—synthesizing everything we've learned into practical guidelines for recognizing when bagging is the right tool for your problem.

Page Complete

You now have a deep understanding of the bagging vs boosting dichotomy. You can reason about which paradigm suits your problem, understand the trade-offs involved, and leverage hybrid approaches when appropriate.

4 / 5

Loading learning content...

Machine LearningBagging for Other Models

Bagging for Other Models

LevelIntermediate

Duration90 mins

TopicBagging for Other Models

4 / 5

Bagging vs Boosting

Two Philosophies, One Goal

Bagging and boosting represent the two dominant paradigms in ensemble learning. Both combine multiple models to improve predictions, but they do so through fundamentally different philosophies:

Bagging trains models independently in parallel, then averages their predictions to reduce variance
Boosting trains models sequentially, with each model focusing on the errors of its predecessors to reduce bias

Understanding when to use each approach—and why—is one of the most valuable skills in applied machine learning. This page provides a deep, rigorous comparison of these paradigms.

What You Will Learn

Fundamental Differences

Let's establish the core differences between bagging and boosting at a conceptual level.

Bagging (Bootstrap Aggregating):

Create $B$ bootstrap samples from the training data
Train one model on each bootstrap sample independently
Average predictions (regression) or vote (classification)

Key properties:

Models are trained in parallel
Each model sees different data (bootstrap samples)
Models have no knowledge of each other
Reduces variance while maintaining bias

Boosting:

Initialize sample weights uniformly
Train first model on weighted data
Increase weights of misclassified samples
Train next model, focusing on current errors
Repeat, building a sequence of models
Combine with weighted voting based on model performance

Key properties:

Models are trained sequentially (each depends on previous)
Each model focuses on hard examples
Later models "correct" earlier mistakes
Reduces bias while potentially increasing variance

Bagging vs Boosting: Core Differences
Aspect	Bagging	Boosting
Training Order	Parallel (independent)	Sequential (dependent)
Data Sampling	Bootstrap samples	Weighted samples (hard examples)
Model Focus	Learn different aspects of data	Learn what previous models got wrong
Primary Benefit	Variance reduction	Bias reduction
Model Weights	Equal (typically)	Weighted by performance
Base Learner Type	High-variance (e.g., deep trees)	Low-variance (e.g., shallow trees)
Overfitting Risk	Low (averaging smooths)	Higher (can overfit noise)
Parallelization	Embarrassingly parallel	Inherently sequential

The Core Intuition

Bias-Variance Analysis

The bias-variance decomposition reveals why bagging and boosting complement each other—they address opposite ends of the trade-off.

Recall: $\text{Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Noise}$

Bagging's Effect:

For an ensemble of models $f_1, ..., f_B$:

$$\text{Bias}[\bar{f}] = \text{Bias}[f_i]$$

Bias is unchanged—averaging doesn't fix systematic errors.

$$\text{Var}[\bar{f}] = \frac{\sigma^2}{B} + \frac{B-1}{B}\rho\sigma^2$$

Variance decreases with more models, bounded by correlation $\rho$.

Boosting's Effect:

Boosting combines weak learners that individually have high bias:

$$F_M(x) = \sum_{m=1}^{M} \alpha_m f_m(x)$$

Each additional learner fits the residuals of the previous ensemble:

$$f_m \approx y - F_{m-1}(x)$$

This progressively reduces bias:

$$\text{Bias}[F_M] \xrightarrow{M \to \infty} 0$$

But variance can increase as boosting fits noise:

$$\text{Var}[F_M] \xrightarrow{M \to \infty} ?$$

The variance behavior depends on regularization (learning rate, early stopping).

Bagging

•Bias: Unchanged (same as base learner)
•Variance: Reduced by averaging
•Best with: High-variance, low-bias learners
•Example: Deep decision trees, unpruned
•Limit: Cannot fix underfitting

Boosting

•Bias: Progressively reduced
•Variance: Can increase if uncontrolled
•Best with: Low-variance, high-bias learners
•Example: Decision stumps, shallow trees
•Limit: Can overfit with too many iterations

bias_variance_experiment.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, GradientBoostingRegressor
import matplotlib.pyplot as plt
 
def bias_variance_decomposition(model_factory, X, y, X_test, y_test, n_trials=100):
    """
    Empirically estimate bias and variance through repeated training.
    
    Train the model multiple times on bootstrapped training sets,
    then decompose the test error into bias and variance.
    """
    n_test = len(X_test)
    n_train = len(X)
    
    # Collect predictions across trials
    all_predictions = np.zeros((n_trials, n_test))
    
    for trial in range(n_trials):
        # Bootstrap sample of training data
        indices = np.random.choice(n_train, size=n_train, replace=True)
        X_boot = X[indices]
        y_boot = y[indices]
        
        # Train model
        model = model_factory()
        model.fit(X_boot, y_boot)
        
        # Predict
        all_predictions[trial] = model.predict(X_test)
    
    # Compute bias and variance
    mean_prediction = all_predictions.mean(axis=0)
    
    # Bias^2: (E[prediction] - true)^2
    bias_squared = (mean_prediction - y_test) ** 2
    
    # Variance: E[(prediction - E[prediction])^2]
    variance = ((all_predictions - mean_prediction) ** 2).mean(axis=0)
    
    # Total error
    total_error = ((all_predictions - y_test) ** 2).mean(axis=0)
    
    return {
        'bias_squared': bias_squared.mean(),
        'variance': variance.mean(),
        'total_error': total_error.mean(),
    }
 
 
def compare_bagging_boosting_bias_variance(X, y, X_test, y_test):
    """
    Compare bias-variance decomposition for bagging vs boosting
    across different complexity levels.
    """
    results = {
        'single_tree': [],
        'bagging': [],
        'boosting': [],
    }
    
    depths = [1, 2, 3, 5, 10, 15, None]  # None = unlimited
    
    for depth in depths:
        print(f"Testing depth={depth}...")
        
        # Single tree
        single = bias_variance_decomposition(
            lambda: DecisionTreeRegressor(max_depth=depth),
            X, y, X_test, y_test
        )
        results['single_tree'].append(single)
        
        # Bagging
        bagging = bias_variance_decomposition(
            lambda: BaggingRegressor(
                estimator=DecisionTreeRegressor(max_depth=depth),
                n_estimators=50
            ),
            X, y, X_test, y_test
        )
        results['bagging'].append(bagging)
        
        # Boosting (with learning rate for regularization)
        boosting = bias_variance_decomposition(
            lambda: GradientBoostingRegressor(
                max_depth=min(depth, 3) if depth else 3,  # Boosting prefers shallow
                n_estimators=100,
                learning_rate=0.1
            ),
            X, y, X_test, y_test
        )
        results['boosting'].append(boosting)
    
    return results, depths
 
 
def plot_bias_variance_comparison(results, depths):
    """Visualize bias-variance trade-off for each method."""
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))
    
    for ax, method in zip(axes, ['single_tree', 'bagging', 'boosting']):
        biases = [r['bias_squared'] for r in results[method]]
        variances = [r['variance'] for r in results[method]]
        totals = [r['total_error'] for r in results[method]]
        
        x = range(len(depths))
        ax.plot(x, biases, 'b-o', label='Bias²', linewidth=2)
        ax.plot(x, variances, 'r-o', label='Variance', linewidth=2)
        ax.plot(x, totals, 'g--', label='Total Error', linewidth=2)
        
        ax.set_xlabel('Max Depth')
        ax.set_ylabel('Error Component')
        ax.set_title(method.replace('_', ' ').title())
        ax.set_xticks(x)
        ax.set_xticklabels([str(d) if d else '∞' for d in depths])
        ax.legend()
        ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    return fig

Key Insight

Computational Trade-offs

Beyond statistical properties, practical deployment requires understanding computational trade-offs.

Training Time Complexity:

Bagging: If training one model takes time $T$, bagging $B$ models takes:

Sequential: $O(B \cdot T)$
Parallel with $C$ cores: $O(B \cdot T / C) = O(T)$ with $C = B$ cores

Boosting: Inherently sequential—each model depends on previous errors:

Time: $O(M \cdot T)$ where $M$ is the number of boosting iterations
Cannot parallelize model training (only within-tree splits)

Prediction Time:

Both paradigms require evaluating all models at prediction time:

Bagging: $O(B \cdot T_{pred})$
Boosting: $O(M \cdot T_{pred})$

Where $T_{pred}$ is prediction time for one model. Both can parallelize prediction.

Memory Requirements:

Bagging: Store $B$ independent models
Boosting: Store $M$ models (often smaller/shallower)
Boosting with leaf indices (XGBoost): Can be more memory-efficient

Hyperparameter Sensitivity:

Bagging: Relatively robust (mainly n_estimators, which "more is better")
Boosting: More sensitive (learning rate, max_depth, n_estimators interact)

Computational Comparison
Aspect	Bagging	Boosting
Training Parallelism	Fully parallel	Sequential (within-tree only)
GPUs Useful?	Limited (tree-based)	Yes (XGBoost, LightGBM, CatBoost)
Scales with Cores	Linearly	Limited
Hyperparameter Count	Few	Many (interacting)
Tuning Difficulty	Low	Moderate to high
Early Stopping Benefit	Minimal	Significant
Incremental Training	Easy (add trees)	Hard (model depends on history)
Model Size	Large (many full trees)	Moderate (many small trees)

timing_comparison.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
import time
import numpy as np
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
 
def timing_experiment(n_samples_list, n_features=50, n_estimators=100):
    """
    Compare training time scaling for bagging vs boosting.
    """
    results = {
        'n_samples': n_samples_list,
        'bagging_time': [],
        'boosting_time': [],
        'bagging_parallel_time': [],
    }
    
    for n_samples in n_samples_list:
        print(f"Testing n_samples={n_samples}...")
        
        X, y = make_classification(
            n_samples=n_samples,
            n_features=n_features,
            n_informative=20,
            random_state=42
        )
        
        # Bagging (1 core for fair comparison)
        rf_single = RandomForestClassifier(
            n_estimators=n_estimators,
            max_depth=10,
            n_jobs=1,
            random_state=42
        )
        start = time.time()
        rf_single.fit(X, y)
        results['bagging_time'].append(time.time() - start)
        
        # Bagging (all cores)
        rf_parallel = RandomForestClassifier(
            n_estimators=n_estimators,
            max_depth=10,
            n_jobs=-1,
            random_state=42
        )
        start = time.time()
        rf_parallel.fit(X, y)
        results['bagging_parallel_time'].append(time.time() - start)
        
        # Boosting (n_jobs doesn't parallelize tree training)
        gb = GradientBoostingClassifier(
            n_estimators=n_estimators,
            max_depth=3,  # Typical for boosting
            learning_rate=0.1,
            random_state=42
        )
        start = time.time()
        gb.fit(X, y)
        results['boosting_time'].append(time.time() - start)
    
    return results
 
 
def plot_timing_results(results):
    """Visualize timing comparison."""
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
    
    # Absolute times
    ax1.plot(results['n_samples'], results['bagging_time'], 
             'b-o', label='Bagging (1 core)', linewidth=2)
    ax1.plot(results['n_samples'], results['bagging_parallel_time'], 
             'g-s', label='Bagging (parallel)', linewidth=2)
    ax1.plot(results['n_samples'], results['boosting_time'], 
             'r-^', label='Boosting', linewidth=2)
    
    ax1.set_xlabel('Number of Samples')
    ax1.set_ylabel('Training Time (seconds)')
    ax1.set_title('Training Time Comparison')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    ax1.set_xscale('log')
    ax1.set_yscale('log')
    
    # Speedup from parallelism
    speedup = [b / p for b, p in zip(results['bagging_time'], 
                                      results['bagging_parallel_time'])]
    ax2.plot(results['n_samples'], speedup, 'g-o', linewidth=2)
    ax2.axhline(y=1, color='r', linestyle='--', label='No speedup')
    ax2.set_xlabel('Number of Samples')
    ax2.set_ylabel('Speedup (single / parallel)')
    ax2.set_title('Bagging Parallelization Speedup')
    ax2.grid(True, alpha=0.3)
    ax2.set_xscale('log')
    
    plt.tight_layout()
    return fig
 
 
def production_considerations():
    """
    Summary of production deployment considerations.
    """
    considerations = """
    Production Deployment Checklist:
    
    BAGGING (Random Forest):
    ✓ Training: Use all cores (n_jobs=-1)
    ✓ Prediction: Can batch across trees
    ✓ Memory: ~O(n_estimators × tree_size)
    ✓ Updates: Can add new trees easily
    ✓ Partial fit: Not directly supported
    
    BOOSTING (XGBoost/LightGBM):
    ✓ Training: Use GPU if available
    ✓ Prediction: Highly optimized, often faster than RF
    ✓ Memory: More efficient (histogram-based)
    ✓ Updates: Difficult (sequential dependency)
    ✓ Early stopping: Essential for efficiency
    
    Recommendation by Use Case:
    - Batch training, parallel hardware → Prefer Bagging
    - Need absolute best accuracy → Prefer Boosting (tuned)
    - Fast iteration, exploration → Prefer Bagging
    - Production serving latency critical → Both similar (test)
    - Incremental learning needed → Neither (consider online methods)
    """
    return considerations

Training Time Reality

When to Use Which Paradigm

Choosing between bagging and boosting depends on your data, constraints, and objectives. Here's a decision framework based on practical experience:

Prefer Bagging (Random Forest) When:

•Noisy labels or outliers — Bagging is more robust; boosting may overfit noise
•Limited tuning time — Random Forest works well out-of-box with fewer hyperparameters
•Need quick baseline — Random Forest provides competitive accuracy with minimal effort
•Parallel hardware available — Can leverage many cores effectively
•Model interpretability matters — Feature importance from RF is straightforward
•Small to medium datasets — Boosting's advantage often emerges with more data
•High-variance base models — When individual trees are already complex/flexible

Prefer Boosting (XGBoost/LightGBM) When:

•Maximum accuracy is the goal — Boosting often wins competitions with proper tuning
•Clean, well-preprocessed data — Boosting can safely focus on residuals
•Large datasets — Boosting's efficiency shines at scale
•Have time for hyperparameter tuning — Boosting rewards careful optimization
•GPU available — XGBoost/LightGBM leverage GPUs effectively
•Structured/tabular data — Boosting excels on tabular problems
•Need probability calibration — Boosted trees often better calibrated

Decision Matrix by Problem Characteristics:

Problem Characteristics → Method Selection
Characteristic	Recommendation	Rationale
High label noise	Bagging	Boosting overfits noisy labels
Class imbalance	Boosting	Better at focusing on minority class
Many irrelevant features	Random Forest	Feature randomization handles well
Feature interactions	Boosting	Residual fitting captures interactions
Need fast training	Random Forest (parallel)	Boosting is sequential
Need fast inference	Either (test)	Modern implementations are similar
Memory constrained	Boosting	Shallow trees use less memory
Want uncertainty estimates	Bagging	Natural through prediction variance

The Practical Answer

Hybrid Approaches

Why choose when you can combine? Several approaches blend bagging and boosting to capture benefits of both.

1. Stochastic Gradient Boosting (Subsampling)

Train each boosting iteration on a random subsample of the data:

$$F_m = F_{m-1} + \nu \cdot f_m(x; \text{subsample})$$

Combines boosting's sequential error correction with bagging's randomization
Reduces overfitting through sampling noise
Introduced by Friedman (2002)
Default in XGBoost (subsample parameter)

2. Feature Subsampling in Boosting

Like Random Forests, sample features at each split:

colsample_bytree: Sample features per tree
colsample_bylevel: Sample features per depth level
colsample_bynode: Sample features per split

This adds Random Forest-style diversity to boosting.

3. Dropout in Boosting (DART)

Dropout Additive Regression Trees (DART) randomly drops previous trees during training:

Prevents over-specialization of later trees
Reduces correlation between predictions
Available in XGBoost and LightGBM

4. Bagging Boosted Models

Train multiple boosted models on bootstrap samples:

Each boosted model is trained to completion on a bootstrap sample
Final prediction averages across boosted models
Combines variance reduction of bagging with bias reduction of boosting
Computationally expensive but can improve results

hybrid_approaches.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
import numpy as np
from sklearn.ensemble import BaggingClassifier, GradientBoostingClassifier
from sklearn.base import clone
import xgboost as xgb
 
class BaggedBoosting:
    """
    Ensemble that bags multiple boosted models.
    
    Combines variance reduction from bagging with
    bias reduction from boosting.
    """
    
    def __init__(
        self,
        n_bagging_iterations=5,
        boosting_params=None,
        random_state=42
    ):
        self.n_bagging_iterations = n_bagging_iterations
        self.boosting_params = boosting_params or {
            'n_estimators': 100,
            'max_depth': 3,
            'learning_rate': 0.1,
        }
        self.random_state = random_state
        self.boosted_models = []
        
    def fit(self, X, y):
        """Train multiple boosted models on bootstrap samples."""
        n_samples = X.shape[0]
        np.random.seed(self.random_state)
        
        self.boosted_models = []
        
        for i in range(self.n_bagging_iterations):
            # Bootstrap sample
            indices = np.random.choice(n_samples, size=n_samples, replace=True)
            X_boot = X[indices]
            y_boot = y[indices]
            
            # Train boosted model
            model = xgb.XGBClassifier(
                **self.boosting_params,
                random_state=self.random_state + i,
                use_label_encoder=False,
                eval_metric='logloss'
            )
            model.fit(X_boot, y_boot)
            self.boosted_models.append(model)
        
        return self
    
    def predict_proba(self, X):
        """Average probabilities across boosted models."""
        all_probs = []
        for model in self.boosted_models:
            probs = model.predict_proba(X)
            all_probs.append(probs)
        
        return np.mean(all_probs, axis=0)
    
    def predict(self, X):
        """Predict class with highest average probability."""
        return self.predict_proba(X).argmax(axis=1)
 
 
def stochastic_gradient_boosting_demo(X_train, y_train, X_test, y_test):
    """
    Demonstrate effect of subsampling in gradient boosting.
    """
    results = {}
    
    subsample_rates = [0.5, 0.7, 0.9, 1.0]  # 1.0 = no subsampling
    
    for rate in subsample_rates:
        model = xgb.XGBClassifier(
            n_estimators=200,
            max_depth=4,
            learning_rate=0.1,
            subsample=rate,
            colsample_bytree=0.8,  # Feature subsampling too
            random_state=42,
            use_label_encoder=False,
            eval_metric='logloss'
        )
        
        model.fit(
            X_train, y_train,
            eval_set=[(X_test, y_test)],
            verbose=False
        )
        
        train_acc = (model.predict(X_train) == y_train).mean()
        test_acc = (model.predict(X_test) == y_test).mean()
        
        results[rate] = {
            'train_acc': train_acc,
            'test_acc': test_acc,
            'gap': train_acc - test_acc,
        }
        
        print(f"Subsample={rate}: Train={train_acc:.4f}, "
              f"Test={test_acc:.4f}, Gap={train_acc - test_acc:.4f}")
    
    return results
 
 
def dart_boosting_demo(X_train, y_train, X_test, y_test):
    """
    Demonstrate DART (Dropout Additive Regression Trees).
    """
    # Standard boosting
    standard = xgb.XGBClassifier(
        n_estimators=200,
        max_depth=4,
        learning_rate=0.1,
        booster='gbtree',  # Standard
        random_state=42,
        use_label_encoder=False,
        eval_metric='logloss'
    )
    standard.fit(X_train, y_train)
    
    # DART boosting
    dart = xgb.XGBClassifier(
        n_estimators=200,
        max_depth=4,
        learning_rate=0.1,
        booster='dart',  # DART
        rate_drop=0.1,   # Dropout rate
        skip_drop=0.5,   # Keep 50% of iterations dropout-free
        random_state=42,
        use_label_encoder=False,
        eval_metric='logloss'
    )
    dart.fit(X_train, y_train)
    
    standard_acc = (standard.predict(X_test) == y_test).mean()
    dart_acc = (dart.predict(X_test) == y_test).mean()
    
    print(f"Standard GBT: {standard_acc:.4f}")
    print(f"DART:         {dart_acc:.4f}")
    
    return {
        'standard_acc': standard_acc,
        'dart_acc': dart_acc,
    }

Practical Hybrid Recommendation

Empirical Comparison on Benchmark Tasks

Theory guides understanding, but empirical results inform decisions. Here's what extensive benchmarking has shown about bagging vs boosting:

Findings from Kaggle Competitions (2015-2023):

Gradient boosting (XGBoost, LightGBM, CatBoost) won the majority of tabular data competitions
Random Forests rarely won outright but provided strong baselines
The performance gap ranged from negligible to 2-5% depending on the problem
Boosting's advantage was largest on clean, well-engineered features

Academic Benchmark Studies:

Fernández-Delgado et al. (2014) tested 179 classifiers on 121 datasets:

Random Forests ranked among the top performers
Gradient Boosting was competitive but not clearly superior across all datasets
Both significantly outperformed simpler methods

Grinsztajn et al. (2022) "Why do tree-based models still outperform deep learning on tabular data?":

Gradient boosting (specifically XGBoost, LightGBM) outperformed deep learning on most tabular benchmarks
Random Forests performed similarly to boosting on many problems
Deep learning only excelled on very large datasets with specific structures

Typical Performance Patterns (Illustrative)
Dataset Type	Bagging Performance	Boosting Performance	Typical Gap
Small (< 5K samples)	Strong	Good (may overfit)	~0-1%
Medium (5K-50K)	Strong	Very strong	~1-2%
Large (> 50K)	Strong	Excellent	~2-5%
Noisy labels	Robust	May overfit	Varies widely
Clean, engineered features	Good	Excellent	~2-4%
High-dimensional sparse	Good	Very good	~1-2%
Class imbalanced	Moderate	Good (with weights)	~1-3%

Benchmark Caveats

The Meta-Learning Perspective:

Recent work on automated machine learning (AutoML) suggests:

No single method dominates all problems — Dataset characteristics determine the winner
Boosting has higher ceiling, Random Forest has higher floor — Boosting rewards optimization effort
Ensembling both often helps — Stacking RF + XGBoost frequently improves over either alone
Time matters — Given limited time, Random Forest often achieves more

The practical implication: include both in your AutoML search space, and consider ensembling their predictions for final submissions.

Summary: Bagging vs Boosting

Bagging and boosting represent complementary strategies for building powerful ensembles. Let's consolidate the key insights:

Key Takeaways

•Different goals — Bagging reduces variance, boosting reduces bias. Use high-variance learners for bagging, high-bias learners for boosting.
•Parallelism trade-off — Bagging is embarrassingly parallel; boosting is inherently sequential. Consider your hardware when choosing.
•Robustness vs. accuracy — Bagging is more robust to noise; boosting achieves higher accuracy on clean data with proper tuning.
•Tuning investment — Random Forest needs minimal tuning; boosting rewards careful hyperparameter optimization.
•Hybrid approaches exist — Stochastic gradient boosting combines sampling with sequential training, often improving generalization.
•Empirically, boosting often wins — But the margin depends on data quality, size, and tuning effort. Always benchmark both.

What's Next:

The final page in this module explores When Bagging Helps—synthesizing everything we've learned into practical guidelines for recognizing when bagging is the right tool for your problem.

Page Complete

4 / 5