Machine LearningBagging & Random Forests

Ensemble Learning Fundamentals

LevelIntermediate

Duration60 mins

TopicBagging & Random Forests

5 / 5

Ensemble Strategies

The Ensemble Landscape

Now that we understand the theoretical foundations of ensemble learning—variance reduction, crowd wisdom, error decomposition, and diversity—we're ready to survey the landscape of ensemble methods. How do practitioners actually build ensembles?

Ensemble strategies can be categorized along several dimensions:

Parallel vs. Sequential: Are models built independently or do later models depend on earlier ones?
Homogeneous vs. Heterogeneous: Same learning algorithm or different algorithms?
Averaging vs. Voting vs. Stacking: How are predictions combined?
Variance Reduction vs. Bias Reduction: What error component does the method target?

This page provides a comprehensive taxonomy of ensemble strategies, preparing you to understand the specific algorithms (Bagging, Random Forests, AdaBoost, Gradient Boosting) that we'll study in subsequent chapters.

What You Will Learn

By the end of this page, you will understand the major families of ensemble methods, their relative strengths and weaknesses, and when to choose each approach. You'll have a mental map of the ensemble landscape that guides algorithm selection in practice.

Taxonomy of Ensemble Methods

Ensemble methods can be organized along several orthogonal dimensions:

Dimension 1: Construction Strategy

Strategy	Description	Examples
Parallel	Models trained independently, no information flow	Bagging, Random Forests
Sequential	Each model corrects errors of predecessors	AdaBoost, Gradient Boosting
Cascading	Models applied in stages with early exit	Viola-Jones face detection

Dimension 2: Base Learner Diversity

Diversity Type	Description	Examples
Homogeneous	Same algorithm, different data/features	Random Forests, GBM
Heterogeneous	Different algorithms	Stacking, Voting Classifier
Hybrid	Mix of both	Super Learner

Dimension 3: Combination Method

Method	Task	Description
Averaging	Regression	Simple or weighted mean
Voting	Classification	Majority vote (hard or soft)
Stacking	Both	Meta-learner combines base predictions
Blending	Both	Like stacking with holdout-based meta-training

The Big Three

The three most important ensemble paradigms are Bagging (parallel, variance reduction), Boosting (sequential, bias reduction), and Stacking (meta-learning for combination). Most practical ensemble methods are variants of these three core ideas.

The Bagging Family: Parallel Variance Reduction

Bagging (Bootstrap Aggregating) is the foundational parallel ensemble method. Introduced by Leo Breiman in 1996, it remains one of the most robust and widely-used techniques.

The Algorithm:

Given: Training set D of size N, base learner L, number of iterations M

For each m = 1, 2, ..., M:
    1. Create bootstrap sample D_m by sampling N examples with replacement from D
    2. Train model h_m = L(D_m)

For prediction:
    - Regression: ŷ = (1/M) Σ h_m(x)
    - Classification: ŷ = mode({h_1(x), h_2(x), ..., h_M(x)})

Why It Works:

Bootstrap samples are different → models see different training data
High-variance base learners (trees) produce different predictions
Averaging reduces variance while (approximately) preserving bias
Works best with unstable learners that change significantly when training data changes

Key Properties:

Variance Reduction: Proportional to (1 - ρ)/M where ρ is model correlation
Bias Preservation: Ensemble bias ≈ individual model bias
Out-of-Bag Estimation: ~36.8% of data not in each bootstrap can validate that model
Embarrassingly Parallel: All models can be trained simultaneously

Bagging Family Methods
Method	Additional Diversity	Key Innovation
Bagging (original)	Bootstrap sampling only	Variance reduction via averaging
Random Forests	Random feature subsets at splits	Further decorrelation of trees
Random Patches	Random sample subsets + features	Both row and column subsampling
Extra Trees	Random split thresholds	Maximum randomization
Random Subspace Method	Feature subsets (no bootstrap)	Feature-based diversity only

When to Use Bagging:

✅ Base learner has high variance and low bias (decision trees, neural nets) ✅ You want robust predictions with minimal tuning ✅ Training data is sufficient for bootstrap sampling ✅ Parallelization is available (significant speedup) ✅ Interpretability via feature importance is desired

❌ Base learner is stable (linear models, kNN with large k) ❌ Dataset is very small (bootstrap doesn't create enough diversity) ❌ You need to reduce bias, not variance

bagging_from_scratch.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
import numpy as np
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from collections import Counter
 
 
class BaggingEnsemble:
    """
    Bagging ensemble implementation from scratch.
    
    Demonstrates the core concepts:
    1. Bootstrap sampling for diversity
    2. Parallel model training
    3. Aggregation via averaging/voting
    4. Out-of-bag estimation
    """
    
    def __init__(self, base_estimator, n_estimators=10, 
                 max_samples=1.0, bootstrap=True, random_state=None):
        self.base_estimator = base_estimator
        self.n_estimators = n_estimators
        self.max_samples = max_samples
        self.bootstrap = bootstrap
        self.random_state = random_state
        
        self.estimators_ = []
        self.oob_indices_ = []
        
    def _bootstrap_sample(self, X, y, rng):
        """Create a bootstrap sample."""
        n_samples = X.shape[0]
        n_draw = int(n_samples * self.max_samples)
        
        if self.bootstrap:
            indices = rng.choice(n_samples, size=n_draw, replace=True)
        else:
            indices = rng.choice(n_samples, size=n_draw, replace=False)
        
        # Out-of-bag indices
        oob = np.setdiff1d(np.arange(n_samples), indices)
        
        return X[indices], y[indices], oob
    
    def fit(self, X, y):
        """Fit the bagging ensemble."""
        rng = np.random.RandomState(self.random_state)
        
        self.estimators_ = []
        self.oob_indices_ = []
        
        # Store OOB predictions for OOB score
        n_samples = X.shape[0]
        self.oob_predictions_ = np.zeros((self.n_estimators, n_samples))
        self.oob_predictions_[:] = np.nan
        
        for i in range(self.n_estimators):
            # Bootstrap sample
            X_b, y_b, oob_idx = self._bootstrap_sample(X, y, rng)
            
            # Clone and fit estimator
            estimator = clone(self.base_estimator)
            estimator.fit(X_b, y_b)
            
            self.estimators_.append(estimator)
            self.oob_indices_.append(oob_idx)
            
            # OOB predictions for this estimator
            if len(oob_idx) > 0:
                self.oob_predictions_[i, oob_idx] = estimator.predict(X[oob_idx])
        
        return self
    
    def predict(self, X):
        """Aggregate predictions from all estimators."""
        predictions = np.array([est.predict(X) for est in self.estimators_])
        
        # For classification: majority vote
        # For regression: average
        if hasattr(self.estimators_[0], 'classes_'):
            # Classification: majority vote
            result = []
            for j in range(X.shape[0]):
                votes = predictions[:, j]
                result.append(Counter(votes).most_common(1)[0][0])
            return np.array(result)
        else:
            # Regression: average
            return predictions.mean(axis=0)
    
    def oob_score(self, y):
        """Compute out-of-bag score."""
        n_samples = len(y)
        oob_pred = []
        
        for j in range(n_samples):
            # Predictions from estimators that didn't see this sample
            valid_preds = self.oob_predictions_[~np.isnan(self.oob_predictions_[:, j]), j]
            
            if len(valid_preds) > 0:
                if hasattr(self.estimators_[0], 'classes_'):
                    pred = Counter(valid_preds).most_common(1)[0][0]
                else:
                    pred = valid_preds.mean()
                oob_pred.append(pred)
            else:
                oob_pred.append(np.nan)
        
        oob_pred = np.array(oob_pred)
        valid = ~np.isnan(oob_pred)
        
        if hasattr(self.estimators_[0], 'classes_'):
            return np.mean(oob_pred[valid] == y[valid])
        else:
            return 1 - np.mean((oob_pred[valid] - y[valid])**2) / np.var(y[valid])
 
 
def clone(estimator):
    """Simple clone for sklearn estimators."""
    return type(estimator)(**estimator.get_params())
 
 
# Demonstration
if __name__ == "__main__":
    from sklearn.datasets import make_classification
    from sklearn.model_selection import train_test_split
    
    X, y = make_classification(n_samples=1000, n_features=20, 
                              random_state=42)
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=42
    )
    
    # Single tree
    tree = DecisionTreeClassifier(random_state=42)
    tree.fit(X_train, y_train)
    print(f"Single tree accuracy: {(tree.predict(X_test)==y_test).mean():.4f}")
    
    # Bagged trees
    bagging = BaggingEnsemble(
        DecisionTreeClassifier(), 
        n_estimators=50, 
        random_state=42
    )
    bagging.fit(X_train, y_train)
    print(f"Bagging accuracy:     {(bagging.predict(X_test)==y_test).mean():.4f}")
    print(f"OOB score:            {bagging.oob_score(y_train):.4f}")

The Boosting Family: Sequential Bias Reduction

Boosting is the other major ensemble paradigm. Unlike bagging's parallel diversity, boosting builds models sequentially, with each model focusing on the errors of its predecessors.

The Core Idea:

Boosting converts weak learners (slightly better than random) into a strong learner by:

Training on weighted data, emphasizing hard examples
Sequentially correcting residual errors
Combining predictions with learned weights

Historical Context:

Boosting has a remarkable theoretical origin. In 1988, Kearns and Valiant posed the question: "Can weak learners be combined to form a strong learner?" This was answered affirmatively by Schapire (1990) and made practical by Freund and Schapire's AdaBoost (1995).

The Key Insight:

Where bagging reduces variance by averaging independent models, boosting reduces bias by iteratively correcting systematic errors. Each new model targets what previous models got wrong.

Boosting Family Methods
Method	Era	Key Innovation
AdaBoost	1995	Sample reweighting, exponential loss
Gradient Boosting	2001	Gradient descent in function space
XGBoost	2016	Regularization, second-order gradients
LightGBM	2017	Histogram binning, leaf-wise growth
CatBoost	2017	Ordered boosting, categorical handling

Why Boosting Reduces Bias:

Consider a problem where no single tree can capture the pattern. Each tree might underfit in different regions. Boosting's sequential correction allows the ensemble to:

Identify regions where current predictions are wrong
Train the next model to specifically correct those regions
Accumulate corrections until the full pattern is captured

AdaBoost vs. Gradient Boosting:

Aspect	AdaBoost	Gradient Boosting
Error focus	Reweight samples	Fit residuals
Loss function	Exponential	Any differentiable
Weak learner	Weighted classification	Regression on gradients
Sensitivity	Outliers heavily weighted	More robust options

When to Use Boosting:

✅ You need to reduce bias (underfitting) ✅ Data is relatively clean (outliers can hurt AdaBoost) ✅ You can tune hyperparameters carefully ✅ You want state-of-the-art predictive performance

❌ Training time is limited (sequential is slow) ❌ Data is very noisy (boosting may overfit to noise) ❌ Interpretability is paramount (harder than single trees)

Boosting's Achilles Heel

Unlike bagging, boosting can overfit. As you add more boosters, training error keeps decreasing, but test error may increase. Early stopping, regularization, and careful hyperparameter tuning are essential. This is why modern implementations (XGBoost, LightGBM) include extensive regularization options.

Stacking: Learning to Combine

Stacking (Stacked Generalization) uses a meta-learner to combine base model predictions. Instead of simple averaging or voting, stacking learns the optimal combination function.

The Algorithm:

Level 0 (Base Learners):
    Train diverse base models on training data
    Generate out-of-fold predictions for training set
    Generate predictions for test set

Level 1 (Meta-Learner):
    Use base model predictions as features
    Train meta-learner to predict true labels
    Apply to test set base predictions

Why Out-of-Fold Predictions:

If we trained base models on all data and used their training predictions as meta-features, the meta-learner would learn to rely on overfit predictions. Using out-of-fold (cross-validation) predictions simulates what base models predict on unseen data.

Meta-Learner Choices:

Meta-Learner	Behavior
Linear (Ridge, Logistic)	Learns optimal weights, simple combination
Tree-based	Can learn nonlinear interactions between base predictions
Neural Network	Complex combinations, needs more data
Simple Average	When meta-learning offers no benefit (reduces to voting)

stacking_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
import numpy as np
from sklearn.model_selection import KFold
from sklearn.base import clone
 
 
class StackingEnsemble:
    """
    Stacking ensemble with cross-validated meta-features.
    
    Key concepts:
    1. Base models generate out-of-fold predictions
    2. Meta-learner learns optimal combination
    3. Avoids information leakage via CV
    """
    
    def __init__(self, base_models, meta_model, n_folds=5):
        self.base_models = base_models
        self.meta_model = meta_model
        self.n_folds = n_folds
        
        self.fitted_base_models_ = []
        self.fitted_meta_model_ = None
        
    def fit(self, X, y):
        """
        Fit the stacking ensemble.
        
        1. Generate out-of-fold predictions from base models
        2. Train meta-model on these meta-features
        """
        n_samples = X.shape[0]
        n_base = len(self.base_models)
        
        # Storage for out-of-fold predictions
        meta_features = np.zeros((n_samples, n_base))
        
        # For each base model, generate OOF predictions
        kf = KFold(n_splits=self.n_folds, shuffle=True, random_state=42)
        
        for i, base_model in enumerate(self.base_models):
            # Store models trained on each fold for prediction
            fold_models = []
            
            for train_idx, val_idx in kf.split(X):
                X_train_fold, X_val_fold = X[train_idx], X[val_idx]
                y_train_fold = y[train_idx]
                
                # Clone and fit on fold
                model = clone(base_model)
                model.fit(X_train_fold, y_train_fold)
                
                # Predict on validation fold
                if hasattr(model, 'predict_proba'):
                    meta_features[val_idx, i] = model.predict_proba(X_val_fold)[:, 1]
                else:
                    meta_features[val_idx, i] = model.predict(X_val_fold)
                
                fold_models.append(model)
            
            self.fitted_base_models_.append(fold_models)
        
        # Fit meta-model on out-of-fold predictions
        self.fitted_meta_model_ = clone(self.meta_model)
        self.fitted_meta_model_.fit(meta_features, y)
        
        # Refit base models on full data for prediction
        self.final_base_models_ = []
        for base_model in self.base_models:
            model = clone(base_model)
            model.fit(X, y)
            self.final_base_models_.append(model)
        
        return self
    
    def predict_proba(self, X):
        """Generate stacked predictions."""
        # Get base model predictions
        meta_features = np.zeros((X.shape[0], len(self.final_base_models_)))
        
        for i, model in enumerate(self.final_base_models_):
            if hasattr(model, 'predict_proba'):
                meta_features[:, i] = model.predict_proba(X)[:, 1]
            else:
                meta_features[:, i] = model.predict(X)
        
        # Meta-model prediction
        return self.fitted_meta_model_.predict_proba(meta_features)
    
    def predict(self, X):
        """Generate class predictions."""
        if hasattr(self.fitted_meta_model_, 'predict_proba'):
            return (self.predict_proba(X)[:, 1] > 0.5).astype(int)
        
        meta_features = np.zeros((X.shape[0], len(self.final_base_models_)))
        for i, model in enumerate(self.final_base_models_):
            meta_features[:, i] = model.predict(X)
        
        return self.fitted_meta_model_.predict(meta_features)
 
 
# Demonstration
if __name__ == "__main__":
    from sklearn.datasets import make_classification
    from sklearn.model_selection import train_test_split
    from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
    from sklearn.linear_model import LogisticRegression
    from sklearn.svm import SVC
    
    X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=42
    )
    
    # Base models
    base_models = [
        RandomForestClassifier(n_estimators=50, random_state=42),
        GradientBoostingClassifier(n_estimators=50, random_state=42),
        SVC(probability=True, random_state=42),
    ]
    
    # Meta-model
    meta_model = LogisticRegression()
    
    # Individual performances
    for name, model in [('RF', base_models[0]), ('GB', base_models[1]), ('SVM', base_models[2])]:
        model_copy = clone(model)
        model_copy.fit(X_train, y_train)
        print(f"{name}: {(model_copy.predict(X_test)==y_test).mean():.4f}")
    
    # Stacked ensemble
    stacking = StackingEnsemble(base_models, meta_model)
    stacking.fit(X_train, y_train)
    print(f"Stacking: {(stacking.predict(X_test)==y_test).mean():.4f}")

When Stacking Helps Most

Stacking provides the most benefit when base models have complementary strengths—one model excels where another struggles. If all base models make similar predictions, the meta-learner can't learn useful combinations. Maximize base model diversity!

Voting Ensembles: Simple but Effective

Voting ensembles are the simplest form of model combination. Despite their simplicity, they're often surprisingly competitive.

Hard Voting:

Each model casts a vote for a class; the class with the most votes wins:

$$\hat{y} = \text{mode}{h_1(x), h_2(x), \ldots, h_M(x)}$$

Soft Voting:

Average the predicted probabilities; return the class with highest average probability:

$$\hat{y} = \arg\max_c \frac{1}{M}\sum_{i=1}^{M} P_i(y=c|x)$$

Weighted Voting:

Assign weights to models based on some criterion (e.g., validation accuracy):

$$\hat{y} = \arg\max_c \sum_{i=1}^{M} w_i \cdot P_i(y=c|x)$$

where $\sum_i w_i = 1$.

Comparison:

Method	Pros	Cons
Hard Voting	Simple, no calibration needed	Ignores confidence
Soft Voting	Uses confidence information	Requires calibrated probabilities
Weighted Voting	Emphasizes better models	May overfit to validation
Plurality	No majority needed	Can choose minority class

Voting Works Well When

•Base models are diverse
•Individual accuracy is similar
•You need quick ensemble
•Probabilities are well-calibrated (soft)
•No time for stacking CV

Voting Struggles When

•Models have very different quality
•Optimal combination is nonlinear
•Probabilities are miscalibrated
•Models make correlated errors
•Dataset favors ensemble learning

Blending: A Simpler Stacking

Blending is a simplified version of stacking that uses a single holdout set instead of cross-validation.

The Algorithm:

1. Split training data into train_1 (e.g., 70%) and train_2 (30%)
2. Train base models on train_1
3. Generate predictions for train_2 → meta-features
4. Train meta-model on train_2 meta-features
5. For test set: use base model predictions as meta-features

Blending vs. Stacking:

Aspect	Stacking	Blending
Training data usage	Full (via CV)	Partial (holdout)
Computation	More expensive (K folds)	Less expensive
Variance in meta-features	Lower (averaged)	Higher (single holdout)
Information leakage risk	Lower (proper CV)	Slightly higher
Implementation complexity	Higher	Lower

When to Use Blending:

Quick prototyping before full stacking
Very large datasets where holdout is sufficient
Time-constrained competitions
When implementing from scratch (simpler code)

Kaggle Wisdom

In Kaggle competitions, teams often use both: start with blending for quick iterations, then switch to full stacking for final submissions. The averaging effect of k-fold stacking provides small but consistent improvements.

Method Selection Guide

Given the many ensemble options, how do you choose? Here's a practical decision framework:

Ensemble Method Selection Matrix
Scenario	Recommended Method	Why
High variance problem	Bagging / Random Forests	Variance reduction via averaging
High bias problem	Boosting (GBM, XGBoost)	Sequential bias correction
Want best accuracy	Stacking with diverse bases	Optimal combination learning
Limited compute time	Random Forests	Parallelizable, fewer tuning knobs
Many heterogeneous models	Stacking or Blending	Learn complex combinations
Quick baseline	Voting Ensemble	Simple, no meta-training
Tabular data	Gradient Boosting variants	Consistently top performers
Need interpretability	Random Forests + importance	Feature importance available
Need robustness	Bagging	Stable across hyperparameters
Large dataset	LightGBM or Blending	Efficient implementations

The Practical Default:

For most tabular machine learning problems:

Start with Random Forests — robust baseline, automatic feature importance
Try Gradient Boosting — usually achieves higher accuracy with tuning
If needed, stack RF + GBM + others — maximum performance

This sequence—RF → GBM → Stacking—captures ~90% of competitive tabular ML.

The Accuracy Hierarchy (typical):

$$\text{Stacked Ensemble} > \text{Tuned XGBoost} > \text{Random Forest} > \text{Single Model}$$

But remember: increased accuracy comes with increased complexity, training time, and overfitting risk.

The 80/20 Rule

A well-tuned Random Forest or XGBoost gets 80% of the possible improvement from ensembling. The remaining 20% from stacking often requires 80% of the effort. For most practical applications, stop at a single well-tuned ensemble method.

Summary: Ensemble Strategies

We've surveyed the landscape of ensemble methods. Let's consolidate the key insights:

Key Takeaways

•Three Major Paradigms: Bagging (parallel, variance reduction), Boosting (sequential, bias reduction), Stacking (meta-learning for combination).
•Bagging Family: Bootstrap sampling + averaging. Best with high-variance learners. Random Forests add feature randomization.
•Boosting Family: Sequential error correction. Reduces bias. Requires careful regularization. XGBoost/LightGBM are state-of-the-art.
•Stacking: Uses base model predictions as meta-features. Learns optimal combinations. Most powerful but most complex.
•Voting: Simplest approach. Hard voting ignores confidence; soft voting uses probabilities.
•Blending: Simpler than stacking (holdout instead of CV). Good for prototyping.
•Practical Default: RF → GBM → Stacking covers most tabular problems.
•Tradeoffs Exist: Higher accuracy methods require more computation, tuning, and overfit risk.

Module Complete:

With this page, we've completed our survey of Ensemble Learning Fundamentals. You now understand:

Why ensembles work (variance reduction theory)
Wisdom of Crowds (intuitive parallels)
Error Decomposition (bias-variance-covariance analysis)
Diversity Requirement (mechanisms and metrics)
Ensemble Strategies (bagging, boosting, stacking taxonomy)

In the next module, we'll dive deep into Bootstrap Aggregating (Bagging)—the foundational technique that spawned Random Forests.

Module Complete!

Congratulations! You've mastered the fundamentals of ensemble learning. You understand the theory, the intuition, and the practical strategies. You're now ready to study specific ensemble algorithms in depth—starting with Bagging and Random Forests in the next module.

5 / 5

Loading learning content...

Machine LearningBagging & Random Forests

Ensemble Learning Fundamentals

LevelIntermediate

Duration60 mins

TopicBagging & Random Forests

5 / 5

Ensemble Strategies

The Ensemble Landscape

Ensemble strategies can be categorized along several dimensions:

Parallel vs. Sequential: Are models built independently or do later models depend on earlier ones?
Homogeneous vs. Heterogeneous: Same learning algorithm or different algorithms?
Averaging vs. Voting vs. Stacking: How are predictions combined?
Variance Reduction vs. Bias Reduction: What error component does the method target?

What You Will Learn

Taxonomy of Ensemble Methods

Ensemble methods can be organized along several orthogonal dimensions:

Dimension 1: Construction Strategy

Strategy	Description	Examples
Parallel	Models trained independently, no information flow	Bagging, Random Forests
Sequential	Each model corrects errors of predecessors	AdaBoost, Gradient Boosting
Cascading	Models applied in stages with early exit	Viola-Jones face detection

Dimension 2: Base Learner Diversity

Diversity Type	Description	Examples
Homogeneous	Same algorithm, different data/features	Random Forests, GBM
Heterogeneous	Different algorithms	Stacking, Voting Classifier
Hybrid	Mix of both	Super Learner

Dimension 3: Combination Method

Method	Task	Description
Averaging	Regression	Simple or weighted mean
Voting	Classification	Majority vote (hard or soft)
Stacking	Both	Meta-learner combines base predictions
Blending	Both	Like stacking with holdout-based meta-training

The Big Three

The Bagging Family: Parallel Variance Reduction

Bagging (Bootstrap Aggregating) is the foundational parallel ensemble method. Introduced by Leo Breiman in 1996, it remains one of the most robust and widely-used techniques.

The Algorithm:

Given: Training set D of size N, base learner L, number of iterations M

For each m = 1, 2, ..., M:
    1. Create bootstrap sample D_m by sampling N examples with replacement from D
    2. Train model h_m = L(D_m)

For prediction:
    - Regression: ŷ = (1/M) Σ h_m(x)
    - Classification: ŷ = mode({h_1(x), h_2(x), ..., h_M(x)})

Why It Works:

Bootstrap samples are different → models see different training data
High-variance base learners (trees) produce different predictions
Averaging reduces variance while (approximately) preserving bias
Works best with unstable learners that change significantly when training data changes

Key Properties:

Variance Reduction: Proportional to (1 - ρ)/M where ρ is model correlation
Bias Preservation: Ensemble bias ≈ individual model bias
Out-of-Bag Estimation: ~36.8% of data not in each bootstrap can validate that model
Embarrassingly Parallel: All models can be trained simultaneously

Bagging Family Methods
Method	Additional Diversity	Key Innovation
Bagging (original)	Bootstrap sampling only	Variance reduction via averaging
Random Forests	Random feature subsets at splits	Further decorrelation of trees
Random Patches	Random sample subsets + features	Both row and column subsampling
Extra Trees	Random split thresholds	Maximum randomization
Random Subspace Method	Feature subsets (no bootstrap)	Feature-based diversity only

When to Use Bagging:

❌ Base learner is stable (linear models, kNN with large k) ❌ Dataset is very small (bootstrap doesn't create enough diversity) ❌ You need to reduce bias, not variance

bagging_from_scratch.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
import numpy as np
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from collections import Counter
 
 
class BaggingEnsemble:
    """
    Bagging ensemble implementation from scratch.
    
    Demonstrates the core concepts:
    1. Bootstrap sampling for diversity
    2. Parallel model training
    3. Aggregation via averaging/voting
    4. Out-of-bag estimation
    """
    
    def __init__(self, base_estimator, n_estimators=10, 
                 max_samples=1.0, bootstrap=True, random_state=None):
        self.base_estimator = base_estimator
        self.n_estimators = n_estimators
        self.max_samples = max_samples
        self.bootstrap = bootstrap
        self.random_state = random_state
        
        self.estimators_ = []
        self.oob_indices_ = []
        
    def _bootstrap_sample(self, X, y, rng):
        """Create a bootstrap sample."""
        n_samples = X.shape[0]
        n_draw = int(n_samples * self.max_samples)
        
        if self.bootstrap:
            indices = rng.choice(n_samples, size=n_draw, replace=True)
        else:
            indices = rng.choice(n_samples, size=n_draw, replace=False)
        
        # Out-of-bag indices
        oob = np.setdiff1d(np.arange(n_samples), indices)
        
        return X[indices], y[indices], oob
    
    def fit(self, X, y):
        """Fit the bagging ensemble."""
        rng = np.random.RandomState(self.random_state)
        
        self.estimators_ = []
        self.oob_indices_ = []
        
        # Store OOB predictions for OOB score
        n_samples = X.shape[0]
        self.oob_predictions_ = np.zeros((self.n_estimators, n_samples))
        self.oob_predictions_[:] = np.nan
        
        for i in range(self.n_estimators):
            # Bootstrap sample
            X_b, y_b, oob_idx = self._bootstrap_sample(X, y, rng)
            
            # Clone and fit estimator
            estimator = clone(self.base_estimator)
            estimator.fit(X_b, y_b)
            
            self.estimators_.append(estimator)
            self.oob_indices_.append(oob_idx)
            
            # OOB predictions for this estimator
            if len(oob_idx) > 0:
                self.oob_predictions_[i, oob_idx] = estimator.predict(X[oob_idx])
        
        return self
    
    def predict(self, X):
        """Aggregate predictions from all estimators."""
        predictions = np.array([est.predict(X) for est in self.estimators_])
        
        # For classification: majority vote
        # For regression: average
        if hasattr(self.estimators_[0], 'classes_'):
            # Classification: majority vote
            result = []
            for j in range(X.shape[0]):
                votes = predictions[:, j]
                result.append(Counter(votes).most_common(1)[0][0])
            return np.array(result)
        else:
            # Regression: average
            return predictions.mean(axis=0)
    
    def oob_score(self, y):
        """Compute out-of-bag score."""
        n_samples = len(y)
        oob_pred = []
        
        for j in range(n_samples):
            # Predictions from estimators that didn't see this sample
            valid_preds = self.oob_predictions_[~np.isnan(self.oob_predictions_[:, j]), j]
            
            if len(valid_preds) > 0:
                if hasattr(self.estimators_[0], 'classes_'):
                    pred = Counter(valid_preds).most_common(1)[0][0]
                else:
                    pred = valid_preds.mean()
                oob_pred.append(pred)
            else:
                oob_pred.append(np.nan)
        
        oob_pred = np.array(oob_pred)
        valid = ~np.isnan(oob_pred)
        
        if hasattr(self.estimators_[0], 'classes_'):
            return np.mean(oob_pred[valid] == y[valid])
        else:
            return 1 - np.mean((oob_pred[valid] - y[valid])**2) / np.var(y[valid])
 
 
def clone(estimator):
    """Simple clone for sklearn estimators."""
    return type(estimator)(**estimator.get_params())
 
 
# Demonstration
if __name__ == "__main__":
    from sklearn.datasets import make_classification
    from sklearn.model_selection import train_test_split
    
    X, y = make_classification(n_samples=1000, n_features=20, 
                              random_state=42)
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=42
    )
    
    # Single tree
    tree = DecisionTreeClassifier(random_state=42)
    tree.fit(X_train, y_train)
    print(f"Single tree accuracy: {(tree.predict(X_test)==y_test).mean():.4f}")
    
    # Bagged trees
    bagging = BaggingEnsemble(
        DecisionTreeClassifier(), 
        n_estimators=50, 
        random_state=42
    )
    bagging.fit(X_train, y_train)
    print(f"Bagging accuracy:     {(bagging.predict(X_test)==y_test).mean():.4f}")
    print(f"OOB score:            {bagging.oob_score(y_train):.4f}")

The Boosting Family: Sequential Bias Reduction

Boosting is the other major ensemble paradigm. Unlike bagging's parallel diversity, boosting builds models sequentially, with each model focusing on the errors of its predecessors.

The Core Idea:

Boosting converts weak learners (slightly better than random) into a strong learner by:

Training on weighted data, emphasizing hard examples
Sequentially correcting residual errors
Combining predictions with learned weights

Historical Context:

The Key Insight:

Where bagging reduces variance by averaging independent models, boosting reduces bias by iteratively correcting systematic errors. Each new model targets what previous models got wrong.

Boosting Family Methods
Method	Era	Key Innovation
AdaBoost	1995	Sample reweighting, exponential loss
Gradient Boosting	2001	Gradient descent in function space
XGBoost	2016	Regularization, second-order gradients
LightGBM	2017	Histogram binning, leaf-wise growth
CatBoost	2017	Ordered boosting, categorical handling

Why Boosting Reduces Bias:

Consider a problem where no single tree can capture the pattern. Each tree might underfit in different regions. Boosting's sequential correction allows the ensemble to:

Identify regions where current predictions are wrong
Train the next model to specifically correct those regions
Accumulate corrections until the full pattern is captured

AdaBoost vs. Gradient Boosting:

Aspect	AdaBoost	Gradient Boosting
Error focus	Reweight samples	Fit residuals
Loss function	Exponential	Any differentiable
Weak learner	Weighted classification	Regression on gradients
Sensitivity	Outliers heavily weighted	More robust options

When to Use Boosting:

✅ You need to reduce bias (underfitting) ✅ Data is relatively clean (outliers can hurt AdaBoost) ✅ You can tune hyperparameters carefully ✅ You want state-of-the-art predictive performance

❌ Training time is limited (sequential is slow) ❌ Data is very noisy (boosting may overfit to noise) ❌ Interpretability is paramount (harder than single trees)

Boosting's Achilles Heel

Stacking: Learning to Combine

Stacking (Stacked Generalization) uses a meta-learner to combine base model predictions. Instead of simple averaging or voting, stacking learns the optimal combination function.

The Algorithm:

Level 0 (Base Learners):
    Train diverse base models on training data
    Generate out-of-fold predictions for training set
    Generate predictions for test set

Level 1 (Meta-Learner):
    Use base model predictions as features
    Train meta-learner to predict true labels
    Apply to test set base predictions

Why Out-of-Fold Predictions:

Meta-Learner Choices:

Meta-Learner	Behavior
Linear (Ridge, Logistic)	Learns optimal weights, simple combination
Tree-based	Can learn nonlinear interactions between base predictions
Neural Network	Complex combinations, needs more data
Simple Average	When meta-learning offers no benefit (reduces to voting)

stacking_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
import numpy as np
from sklearn.model_selection import KFold
from sklearn.base import clone
 
 
class StackingEnsemble:
    """
    Stacking ensemble with cross-validated meta-features.
    
    Key concepts:
    1. Base models generate out-of-fold predictions
    2. Meta-learner learns optimal combination
    3. Avoids information leakage via CV
    """
    
    def __init__(self, base_models, meta_model, n_folds=5):
        self.base_models = base_models
        self.meta_model = meta_model
        self.n_folds = n_folds
        
        self.fitted_base_models_ = []
        self.fitted_meta_model_ = None
        
    def fit(self, X, y):
        """
        Fit the stacking ensemble.
        
        1. Generate out-of-fold predictions from base models
        2. Train meta-model on these meta-features
        """
        n_samples = X.shape[0]
        n_base = len(self.base_models)
        
        # Storage for out-of-fold predictions
        meta_features = np.zeros((n_samples, n_base))
        
        # For each base model, generate OOF predictions
        kf = KFold(n_splits=self.n_folds, shuffle=True, random_state=42)
        
        for i, base_model in enumerate(self.base_models):
            # Store models trained on each fold for prediction
            fold_models = []
            
            for train_idx, val_idx in kf.split(X):
                X_train_fold, X_val_fold = X[train_idx], X[val_idx]
                y_train_fold = y[train_idx]
                
                # Clone and fit on fold
                model = clone(base_model)
                model.fit(X_train_fold, y_train_fold)
                
                # Predict on validation fold
                if hasattr(model, 'predict_proba'):
                    meta_features[val_idx, i] = model.predict_proba(X_val_fold)[:, 1]
                else:
                    meta_features[val_idx, i] = model.predict(X_val_fold)
                
                fold_models.append(model)
            
            self.fitted_base_models_.append(fold_models)
        
        # Fit meta-model on out-of-fold predictions
        self.fitted_meta_model_ = clone(self.meta_model)
        self.fitted_meta_model_.fit(meta_features, y)
        
        # Refit base models on full data for prediction
        self.final_base_models_ = []
        for base_model in self.base_models:
            model = clone(base_model)
            model.fit(X, y)
            self.final_base_models_.append(model)
        
        return self
    
    def predict_proba(self, X):
        """Generate stacked predictions."""
        # Get base model predictions
        meta_features = np.zeros((X.shape[0], len(self.final_base_models_)))
        
        for i, model in enumerate(self.final_base_models_):
            if hasattr(model, 'predict_proba'):
                meta_features[:, i] = model.predict_proba(X)[:, 1]
            else:
                meta_features[:, i] = model.predict(X)
        
        # Meta-model prediction
        return self.fitted_meta_model_.predict_proba(meta_features)
    
    def predict(self, X):
        """Generate class predictions."""
        if hasattr(self.fitted_meta_model_, 'predict_proba'):
            return (self.predict_proba(X)[:, 1] > 0.5).astype(int)
        
        meta_features = np.zeros((X.shape[0], len(self.final_base_models_)))
        for i, model in enumerate(self.final_base_models_):
            meta_features[:, i] = model.predict(X)
        
        return self.fitted_meta_model_.predict(meta_features)
 
 
# Demonstration
if __name__ == "__main__":
    from sklearn.datasets import make_classification
    from sklearn.model_selection import train_test_split
    from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
    from sklearn.linear_model import LogisticRegression
    from sklearn.svm import SVC
    
    X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=42
    )
    
    # Base models
    base_models = [
        RandomForestClassifier(n_estimators=50, random_state=42),
        GradientBoostingClassifier(n_estimators=50, random_state=42),
        SVC(probability=True, random_state=42),
    ]
    
    # Meta-model
    meta_model = LogisticRegression()
    
    # Individual performances
    for name, model in [('RF', base_models[0]), ('GB', base_models[1]), ('SVM', base_models[2])]:
        model_copy = clone(model)
        model_copy.fit(X_train, y_train)
        print(f"{name}: {(model_copy.predict(X_test)==y_test).mean():.4f}")
    
    # Stacked ensemble
    stacking = StackingEnsemble(base_models, meta_model)
    stacking.fit(X_train, y_train)
    print(f"Stacking: {(stacking.predict(X_test)==y_test).mean():.4f}")

When Stacking Helps Most

Voting Ensembles: Simple but Effective

Voting ensembles are the simplest form of model combination. Despite their simplicity, they're often surprisingly competitive.

Hard Voting:

Each model casts a vote for a class; the class with the most votes wins:

$$\hat{y} = \text{mode}{h_1(x), h_2(x), \ldots, h_M(x)}$$

Soft Voting:

Average the predicted probabilities; return the class with highest average probability:

$$\hat{y} = \arg\max_c \frac{1}{M}\sum_{i=1}^{M} P_i(y=c|x)$$

Weighted Voting:

Assign weights to models based on some criterion (e.g., validation accuracy):

$$\hat{y} = \arg\max_c \sum_{i=1}^{M} w_i \cdot P_i(y=c|x)$$

where $\sum_i w_i = 1$.

Comparison:

Method	Pros	Cons
Hard Voting	Simple, no calibration needed	Ignores confidence
Soft Voting	Uses confidence information	Requires calibrated probabilities
Weighted Voting	Emphasizes better models	May overfit to validation
Plurality	No majority needed	Can choose minority class

Voting Works Well When

•Base models are diverse
•Individual accuracy is similar
•You need quick ensemble
•Probabilities are well-calibrated (soft)
•No time for stacking CV

Voting Struggles When

•Models have very different quality
•Optimal combination is nonlinear
•Probabilities are miscalibrated
•Models make correlated errors
•Dataset favors ensemble learning

Blending: A Simpler Stacking

Blending is a simplified version of stacking that uses a single holdout set instead of cross-validation.

The Algorithm:

1. Split training data into train_1 (e.g., 70%) and train_2 (30%)
2. Train base models on train_1
3. Generate predictions for train_2 → meta-features
4. Train meta-model on train_2 meta-features
5. For test set: use base model predictions as meta-features

Blending vs. Stacking:

Aspect	Stacking	Blending
Training data usage	Full (via CV)	Partial (holdout)
Computation	More expensive (K folds)	Less expensive
Variance in meta-features	Lower (averaged)	Higher (single holdout)
Information leakage risk	Lower (proper CV)	Slightly higher
Implementation complexity	Higher	Lower

When to Use Blending:

Quick prototyping before full stacking
Very large datasets where holdout is sufficient
Time-constrained competitions
When implementing from scratch (simpler code)

Kaggle Wisdom

Method Selection Guide

Given the many ensemble options, how do you choose? Here's a practical decision framework:

Ensemble Method Selection Matrix
Scenario	Recommended Method	Why
High variance problem	Bagging / Random Forests	Variance reduction via averaging
High bias problem	Boosting (GBM, XGBoost)	Sequential bias correction
Want best accuracy	Stacking with diverse bases	Optimal combination learning
Limited compute time	Random Forests	Parallelizable, fewer tuning knobs
Many heterogeneous models	Stacking or Blending	Learn complex combinations
Quick baseline	Voting Ensemble	Simple, no meta-training
Tabular data	Gradient Boosting variants	Consistently top performers
Need interpretability	Random Forests + importance	Feature importance available
Need robustness	Bagging	Stable across hyperparameters
Large dataset	LightGBM or Blending	Efficient implementations

The Practical Default:

For most tabular machine learning problems:

Start with Random Forests — robust baseline, automatic feature importance
Try Gradient Boosting — usually achieves higher accuracy with tuning
If needed, stack RF + GBM + others — maximum performance

This sequence—RF → GBM → Stacking—captures ~90% of competitive tabular ML.

The Accuracy Hierarchy (typical):

$$\text{Stacked Ensemble} > \text{Tuned XGBoost} > \text{Random Forest} > \text{Single Model}$$

But remember: increased accuracy comes with increased complexity, training time, and overfitting risk.

The 80/20 Rule

Summary: Ensemble Strategies

We've surveyed the landscape of ensemble methods. Let's consolidate the key insights:

Key Takeaways

•Three Major Paradigms: Bagging (parallel, variance reduction), Boosting (sequential, bias reduction), Stacking (meta-learning for combination).
•Bagging Family: Bootstrap sampling + averaging. Best with high-variance learners. Random Forests add feature randomization.
•Boosting Family: Sequential error correction. Reduces bias. Requires careful regularization. XGBoost/LightGBM are state-of-the-art.
•Stacking: Uses base model predictions as meta-features. Learns optimal combinations. Most powerful but most complex.
•Voting: Simplest approach. Hard voting ignores confidence; soft voting uses probabilities.
•Blending: Simpler than stacking (holdout instead of CV). Good for prototyping.
•Practical Default: RF → GBM → Stacking covers most tabular problems.
•Tradeoffs Exist: Higher accuracy methods require more computation, tuning, and overfit risk.

Module Complete:

With this page, we've completed our survey of Ensemble Learning Fundamentals. You now understand:

Why ensembles work (variance reduction theory)
Wisdom of Crowds (intuitive parallels)
Error Decomposition (bias-variance-covariance analysis)
Diversity Requirement (mechanisms and metrics)
Ensemble Strategies (bagging, boosting, stacking taxonomy)

In the next module, we'll dive deep into Bootstrap Aggregating (Bagging)—the foundational technique that spawned Random Forests.

Module Complete!

5 / 5