Machine LearningBagging for Other Models

Bagging for Other Models

LevelIntermediate

Duration90 mins

TopicBagging for Other Models

1 / 5

Bagged Decision Trees

The Foundation of Ensemble Success

Before Random Forests revolutionized machine learning competitions, there was a simpler yet remarkably powerful technique: Bagged Decision Trees. This ensemble method, introduced by Leo Breiman in 1996, demonstrated that combining multiple unstable learners could dramatically reduce prediction variance while maintaining low bias—a breakthrough that laid the groundwork for modern ensemble learning.

Bagged Decision Trees represent the purest application of bootstrap aggregating to tree-based models. Understanding this fundamental technique is essential for grasping why Random Forests add feature randomization, when standard bagging suffices, and how to extend bagging principles to other model families.

What You Will Learn

By the end of this page, you will understand: (1) Why decision trees are ideally suited for bagging, (2) The precise mechanics of bagged tree construction, (3) Mathematical analysis of variance reduction in tree ensembles, (4) The relationship between bagged trees and Random Forests, and (5) Implementation considerations and computational trade-offs.

Why Decision Trees Are Ideal for Bagging

The effectiveness of bagging depends critically on a specific property of the base learner: high variance. Decision trees, particularly when grown deep without pruning, exhibit precisely this characteristic, making them natural candidates for variance reduction through ensemble averaging.

The Instability Property:

Decision trees are notoriously unstable learners. Small changes in the training data can produce dramatically different tree structures. Consider a split at the root node—if the optimal split on feature A has only a slightly better information gain than feature B, a small perturbation in the training data might flip the selection. This single change cascades through the entire tree, producing a completely different model.

The Instability-Variance Connection

Instability is not a flaw for bagging—it's a feature. High variance means that bootstrapped samples will produce diverse trees with different errors. When averaged, these errors tend to cancel out, leaving only the underlying signal. Low-variance models (like linear regression) don't benefit as much from bagging because they produce similar predictions on different bootstrap samples.

Mathematical Characterization of Tree Variance:

For a single decision tree T trained on dataset D, the prediction variance at a point x can be expressed as:

$$\text{Var}_D[T(x)] = \mathbb{E}_D[(T(x) - \mathbb{E}_D[T(x)])^2]$$

This variance arises from three sources:

Split selection variance: Which feature and threshold are chosen at each node
Structure variance: The overall topology of the tree (depth, branching)
Leaf assignment variance: Which samples end up in which terminal nodes

Deep, unpruned trees exhibit high variance in all three components, making them ideal candidates for the variance reduction that bagging provides.

Variance Sources in Decision Trees
Variance Source	Cause	Effect on Predictions	Bagging Impact
Split Selection	Tie-breaking on information gain	Different features dominate different trees	High—diverse splits produce diverse trees
Tree Structure	Greedy recursive partitioning	Varying depths and branching patterns	High—structural diversity aids averaging
Leaf Assignment	Data point routing changes	Border points classified differently	Moderate—smooths decision boundaries
Leaf Predictions	Different sample statistics	Mean/mode varies per bootstrap	Low—individual variation cancels out

The Bagged Trees Algorithm

The bagged decision trees algorithm is elegantly simple, yet its simplicity belies its power. Let's examine the complete algorithm with all implementation details that matter in practice.

Algorithm: Bagged Decision Trees

Input: Training set $D = {(x_1, y_1), ..., (x_n, y_n)}$, number of trees $B$, tree hyperparameters $\theta$

Training Phase:

For $b = 1, 2, ..., B$:
- Generate bootstrap sample $D_b$ by sampling $n$ points from $D$ with replacement
- Train decision tree $T_b$ on $D_b$ using hyperparameters $\theta$
- Store $T_b$

Prediction Phase (Regression):

$\hat{f}{bag}(x) = \frac{1}{B}\sum{b=1}^{B} T_b(x)$

Prediction Phase (Classification):

$\hat{y}_{bag}(x) = \text{argmax}c \sum{b=1}^{B} \mathbb{1}[T_b(x) = c]$ (majority vote)
Or: $\hat{P}(y=c|x) = \frac{1}{B}\sum_{b=1}^{B} \mathbb{1}[T_b(x) = c]$ (probability estimate)

bagged_decision_trees.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
import numpy as np
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.utils import resample
from collections import Counter
 
class BaggedDecisionTrees:
    """
    Bagged Decision Trees implementation for classification and regression.
    
    This implementation demonstrates the core bagging algorithm with
    decision trees as base learners, including OOB estimation.
    """
    
    def __init__(self, n_estimators=100, max_depth=None, min_samples_split=2,
                 min_samples_leaf=1, task='classification', random_state=None):
        """
        Parameters:
        -----------
        n_estimators : int
            Number of trees in the ensemble (B)
        max_depth : int or None
            Maximum depth of individual trees. None = fully grown
        min_samples_split : int
            Minimum samples required to split a node
        min_samples_leaf : int
            Minimum samples required at a leaf node
        task : str
            'classification' or 'regression'
        random_state : int or None
            Random seed for reproducibility
        """
        self.n_estimators = n_estimators
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.min_samples_leaf = min_samples_leaf
        self.task = task
        self.random_state = random_state
        self.trees = []
        self.oob_indices = []  # Track OOB samples for each tree
        
    def fit(self, X, y):
        """
        Train the bagged ensemble on the dataset.
        
        Parameters:
        -----------
        X : array-like of shape (n_samples, n_features)
            Training features
        y : array-like of shape (n_samples,)
            Training targets
        """
        np.random.seed(self.random_state)
        n_samples = X.shape[0]
        
        self.trees = []
        self.oob_indices = []
        
        for b in range(self.n_estimators):
            # Step 1: Create bootstrap sample
            # Sample n indices with replacement
            bootstrap_indices = np.random.choice(
                n_samples, size=n_samples, replace=True
            )
            
            # Track which samples were NOT selected (OOB samples)
            oob_mask = np.ones(n_samples, dtype=bool)
            oob_mask[np.unique(bootstrap_indices)] = False
            self.oob_indices.append(np.where(oob_mask)[0])
            
            # Step 2: Train tree on bootstrap sample
            X_bootstrap = X[bootstrap_indices]
            y_bootstrap = y[bootstrap_indices]
            
            if self.task == 'classification':
                tree = DecisionTreeClassifier(
                    max_depth=self.max_depth,
                    min_samples_split=self.min_samples_split,
                    min_samples_leaf=self.min_samples_leaf,
                    random_state=self.random_state + b if self.random_state else None
                )
            else:
                tree = DecisionTreeRegressor(
                    max_depth=self.max_depth,
                    min_samples_split=self.min_samples_split,
                    min_samples_leaf=self.min_samples_leaf,
                    random_state=self.random_state + b if self.random_state else None
                )
            
            tree.fit(X_bootstrap, y_bootstrap)
            self.trees.append(tree)
            
        # Store training data for OOB score computation
        self._X_train = X
        self._y_train = y
        
        return self
    
    def predict(self, X):
        """
        Generate predictions by aggregating tree predictions.
        
        For classification: majority vote
        For regression: arithmetic mean
        """
        if self.task == 'classification':
            return self._predict_classification(X)
        else:
            return self._predict_regression(X)
    
    def _predict_classification(self, X):
        """Majority vote prediction for classification."""
        # Collect predictions from all trees
        all_predictions = np.array([tree.predict(X) for tree in self.trees])
        
        # Majority vote: mode of predictions
        predictions = []
        for i in range(X.shape[0]):
            votes = all_predictions[:, i]
            winner = Counter(votes).most_common(1)[0][0]
            predictions.append(winner)
            
        return np.array(predictions)
    
    def _predict_regression(self, X):
        """Mean prediction for regression."""
        all_predictions = np.array([tree.predict(X) for tree in self.trees])
        return np.mean(all_predictions, axis=0)
    
    def predict_proba(self, X):
        """
        Probability estimates for classification.
        Returns the proportion of trees voting for each class.
        """
        if self.task != 'classification':
            raise ValueError("predict_proba only available for classification")
            
        # Get class labels from first tree
        classes = self.trees[0].classes_
        n_samples = X.shape[0]
        n_classes = len(classes)
        
        # Sum up predictions
        proba = np.zeros((n_samples, n_classes))
        for tree in self.trees:
            tree_proba = tree.predict_proba(X)
            proba += tree_proba
            
        # Normalize to get proportions
        proba /= self.n_estimators
        
        return proba
    
    def oob_score(self):
        """
        Compute Out-of-Bag score using samples not included in each bootstrap.
        
        This provides an unbiased estimate of generalization error.
        """
        n_samples = self._X_train.shape[0]
        
        if self.task == 'classification':
            # For classification: track votes per sample
            oob_votes = {}  # sample_idx -> list of predictions
            
            for b, tree in enumerate(self.trees):
                oob_idx = self.oob_indices[b]
                if len(oob_idx) > 0:
                    predictions = tree.predict(self._X_train[oob_idx])
                    for i, pred in zip(oob_idx, predictions):
                        if i not in oob_votes:
                            oob_votes[i] = []
                        oob_votes[i].append(pred)
            
            # Compute accuracy over samples with OOB predictions
            correct = 0
            total = 0
            for i, votes in oob_votes.items():
                if len(votes) > 0:
                    predicted = Counter(votes).most_common(1)[0][0]
                    if predicted == self._y_train[i]:
                        correct += 1
                    total += 1
                    
            return correct / total if total > 0 else 0.0
            
        else:
            # For regression: average predictions per sample
            oob_predictions = {}
            
            for b, tree in enumerate(self.trees):
                oob_idx = self.oob_indices[b]
                if len(oob_idx) > 0:
                    predictions = tree.predict(self._X_train[oob_idx])
                    for i, pred in zip(oob_idx, predictions):
                        if i not in oob_predictions:
                            oob_predictions[i] = []
                        oob_predictions[i].append(pred)
            
            # Compute R² score
            ss_res = 0
            ss_tot = 0
            y_mean = np.mean(self._y_train)
            
            for i, preds in oob_predictions.items():
                if len(preds) > 0:
                    y_pred = np.mean(preds)
                    y_true = self._y_train[i]
                    ss_res += (y_true - y_pred) ** 2
                    ss_tot += (y_true - y_mean) ** 2
                    
            return 1 - (ss_res / ss_tot) if ss_tot > 0 else 0.0

Implementation Best Practice

In production implementations, always track the out-of-bag (OOB) indices for each tree. This enables OOB error estimation—a virtually free estimate of generalization error that doesn't require a separate validation set. For a dataset of n samples, approximately 36.8% of samples remain out-of-bag for any given tree.

Mathematical Analysis of Variance Reduction

The power of bagging lies in its ability to reduce prediction variance. Let's develop the mathematical theory that explains this reduction and its limits.

The Variance of Averaged Predictions:

For $B$ bagged trees with predictions $T_1(x), T_2(x), ..., T_B(x)$, the bagged prediction is:

$$\hat{f}{bag}(x) = \frac{1}{B}\sum{b=1}^{B} T_b(x)$$

The variance of this averaged prediction is:

$$\text{Var}[\hat{f}{bag}(x)] = \text{Var}\left[\frac{1}{B}\sum{b=1}^{B} T_b(x)\right]$$

The Role of Correlation:

Let $\sigma^2 = \text{Var}[T_b(x)]$ be the variance of individual tree predictions (assumed equal), and let $\rho$ be the pairwise correlation between tree predictions. Then:

$$\text{Var}[\hat{f}_{bag}(x)] = \rho\sigma^2 + \frac{1-\rho}{B}\sigma^2$$

Interpretation:

First term ($\rho\sigma^2$): This is the irreducible variance due to correlation between trees. Even with infinitely many trees ($B \to \infty$), this term remains.
Second term ($\frac{1-\rho}{B}\sigma^2$): This is the reducible variance that decreases as we add more trees. With enough trees, this term vanishes.

The Correlation Barrier:

As $B \to \infty$:

$$\lim_{B \to \infty} \text{Var}[\hat{f}_{bag}(x)] = \rho\sigma^2$$

This is the fundamental limit of bagging. Even with infinite trees, if the correlation $\rho$ is high, significant variance remains. This insight motivates Random Forests, which introduce feature randomization to reduce $\rho$.

Variance Reduction vs. Tree Correlation
Correlation (ρ)	Variance with B=1	Variance with B=10	Variance with B=100	Variance as B→∞
0.0 (independent)	σ²	0.10σ²	0.01σ²	0
0.2	σ²	0.28σ²	0.208σ²	0.20σ²
0.5	σ²	0.55σ²	0.505σ²	0.50σ²
0.8	σ²	0.82σ²	0.802σ²	0.80σ²
1.0 (identical)	σ²	σ²	σ²	σ²

The Correlation Ceiling

With bagged decision trees using all features, typical correlations range from 0.4-0.7 depending on the dataset. This means even with B=1000 trees, you can only reduce variance by about 30-60% of the theoretical maximum. Random Forests address this by randomizing feature selection, reducing ρ to 0.1-0.3 in many cases.

Empirical Variance Decomposition:

To measure the actual variance reduction achieved by bagging, we can decompose the prediction error:

$$\mathbb{E}[(y - \hat{f}{bag}(x))^2] = \underbrace{(f(x) - \mathbb{E}[\hat{f}{bag}(x)])^2}{\text{Bias}^2} + \underbrace{\mathbb{E}[(\hat{f}{bag}(x) - \mathbb{E}[\hat{f}{bag}(x)])^2]}{\text{Variance}} + \underbrace{\sigma^2_\epsilon}_{\text{Irreducible Error}}$$

For deep, unpruned trees:

Bias remains approximately unchanged (still low)
Variance decreases significantly (the main benefit)
Irreducible error is unchanged (inherent noise)

This confirms that bagging is fundamentally a variance-reduction technique that preserves the low bias of complex models.

Bagged Trees vs. Random Forests

Random Forests are often viewed as the "upgraded" version of bagged decision trees. Understanding the precise relationship between these methods clarifies when each is appropriate.

The Key Difference: Feature Randomization

Bagged Decision Trees:

At each split, consider all features to find the best split
Diversity comes solely from bootstrap sampling
Trees are more correlated (higher $\rho$)

Random Forests:

At each split, consider only a random subset of features
Diversity comes from both bootstrap sampling AND feature randomization
Trees are less correlated (lower $\rho$)

Bagged Trees Advantages

•Simpler hyperparameter tuning — No max_features to tune
•Fully optimized splits — Always uses best available feature
•Better for few features — Feature randomization harmful with <10 features
•Interpretable feature use — Each tree uses all relevant features
•Deterministic given seed — No random feature selection variance

Random Forest Advantages

•Lower correlation — Greater variance reduction
•Better for many features — Prevents feature dominance
•More robust to irrelevant features — Not all features used
•Often better generalization — More diverse ensemble
•Faster training — Fewer features evaluated per split

When to Prefer Bagged Trees Over Random Forests:

Low-dimensional datasets ($p < 10$): With few features, random feature selection may exclude the optimal split feature too often, degrading performance.
Strongly predictive features: If one or two features dominate prediction, bagged trees can leverage them at every split, while Random Forests may randomly exclude them.
Highly engineered features: When you've carefully engineered features and know they're all relevant, forcing the use of all features may be beneficial.
Maximum individual tree performance needed: If you also need individual tree interpretability or performance, bagged trees without feature restrictions are stronger individually.

When Random Forests Excel:

High-dimensional datasets ($p > 50$): Feature randomization prevents a few noisy but seemingly predictive features from dominating.
Correlated features: Random Forests better handle situations where multiple features carry similar information.
Unknown feature relevance: When you don't know which features matter, Random Forests' implicit feature selection helps.
Maximum ensemble performance: For pure predictive accuracy, Random Forests usually outperform bagged trees.

Practical Guidance

In practice, try both methods. Start with Random Forests (the usual default), but if you're working with a low-dimensional, carefully engineered feature set, compare against bagged trees. The difference can be significant in either direction depending on the specific problem structure.

Computational Considerations

Understanding the computational complexity of bagged decision trees is essential for deploying them at scale.

Training Complexity:

For a single decision tree trained on $n$ samples with $p$ features:

Worst-case: $O(p \cdot n^2 \log n)$ for a fully grown tree
Typical case: $O(p \cdot n \log^2 n)$ with good implementations

For a bagged ensemble of $B$ trees:

Total training: $O(B \cdot p \cdot n \log^2 n)$

Key Property: Embarrassingly Parallel

Bagging is a naturally parallelizable algorithm. Each tree is trained independently on its bootstrap sample. This means:

Training time scales as $O(B/C)$ with $C$ CPU cores
No synchronization or communication overhead between trees
Near-linear speedup possible with parallel implementations

parallel_bagging.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
from joblib import Parallel, delayed
from sklearn.tree import DecisionTreeClassifier
import numpy as np
 
def train_single_tree(X, y, random_state, max_depth):
    """Train a single tree on a bootstrap sample."""
    np.random.seed(random_state)
    n = X.shape[0]
    
    # Create bootstrap sample
    indices = np.random.choice(n, size=n, replace=True)
    X_boot = X[indices]
    y_boot = y[indices]
    
    # Train tree
    tree = DecisionTreeClassifier(max_depth=max_depth, random_state=random_state)
    tree.fit(X_boot, y_boot)
    
    # Return tree and OOB indices
    oob_mask = np.ones(n, dtype=bool)
    oob_mask[np.unique(indices)] = False
    
    return tree, np.where(oob_mask)[0]
 
def parallel_bagged_trees(X, y, n_estimators=100, max_depth=None, n_jobs=-1):
    """
    Train bagged decision trees in parallel.
    
    Parameters:
    -----------
    X, y : Training data
    n_estimators : Number of trees
    max_depth : Maximum tree depth
    n_jobs : Number of parallel jobs (-1 = all cores)
    
    Returns:
    --------
    List of (tree, oob_indices) tuples
    """
    results = Parallel(n_jobs=n_jobs, verbose=1)(
        delayed(train_single_tree)(X, y, seed, max_depth)
        for seed in range(n_estimators)
    )
    
    return results
 
# Example usage:
# trees_and_oob = parallel_bagged_trees(X_train, y_train, n_estimators=500)
# trees = [t[0] for t in trees_and_oob]
# oob_indices = [t[1] for t in trees_and_oob]

Training Time Scaling (Approximate)
n_samples	n_features	n_estimators	1 Core	4 Cores	16 Cores
10,000	50	100	~12 sec	~3.2 sec	~1 sec
100,000	50	100	~180 sec	~48 sec	~13 sec
100,000	100	500	~1,800 sec	~470 sec	~120 sec
1,000,000	100	100	~3,600 sec	~930 sec	~240 sec

Prediction Complexity:

For prediction, each tree requires $O(\log n)$ operations (depth of balanced tree) or $O(d)$ where $d$ is the actual tree depth. For $B$ trees:

Total prediction: $O(B \cdot d \cdot m)$ for $m$ test samples
Parallelizable across both trees and samples

Memory Considerations:

Each tree stores:

Node split information: feature indices, thresholds
Leaf predictions: class distributions or regression values
Tree structure: parent-child relationships

A fully grown tree on $n$ samples has up to $n$ leaf nodes, requiring $O(n)$ storage. For $B$ trees:

Total memory: $O(B \cdot n)$ in the worst case
Practical memory: Often much less due to shared samples at leaves

Memory Optimization

For very large ensembles or limited memory environments, consider: (1) Training trees with max_depth limits to reduce memory per tree, (2) Using float32 instead of float64 for thresholds, (3) Discarding trees after OOB scoring if only using for OOB-based selection, (4) Using lazy loading for trees during prediction.

Out-of-Bag Estimation: A Deep Dive

One of the most valuable properties of bagged ensembles is the Out-of-Bag (OOB) error estimate. This provides a nearly free estimate of generalization error without requiring a separate validation set.

The OOB Principle:

When we create a bootstrap sample of size $n$ by sampling with replacement, some training examples will be selected multiple times while others won't be selected at all. The probability that a specific example is not selected in any of the $n$ draws is:

$$P(\text{not selected}) = \left(1 - \frac{1}{n}\right)^n \xrightarrow{n \to \infty} e^{-1} \approx 0.368$$

This means approximately 36.8% of training samples are out-of-bag for any given tree. These OOB samples can be used to evaluate that tree's predictions on data it hasn't seen.

Computing OOB Error:

For each training sample $i$:

Identify all trees $T_b$ where sample $i$ was OOB (not in bootstrap sample)
Collect predictions from only those trees
Aggregate predictions (vote for classification, mean for regression)
Compare aggregated prediction to true label $y_i$

The OOB error is then the average error over all training samples:

$$\text{OOB Error} = \frac{1}{n}\sum_{i=1}^{n} L(y_i, \hat{y}_i^{OOB})$$

where $\hat{y}_i^{OOB}$ is the aggregated prediction from trees where $i$ was OOB.

Why OOB Error Works:

For sample $i$, the OOB prediction comes from trees that never saw sample $i$ during training. This is exactly analogous to leave-one-out cross-validation, but computed essentially for free during normal bagging training. Studies have shown that OOB error is nearly as accurate as $k$-fold cross-validation with $k = n$ (leave-one-out).

oob_analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
import numpy as np
from collections import defaultdict
import matplotlib.pyplot as plt
 
def detailed_oob_analysis(bagged_model, X, y, classes=None):
    """
    Perform detailed OOB analysis including per-class accuracy
    and confidence calibration.
    
    Parameters:
    -----------
    bagged_model : trained BaggedDecisionTrees instance
    X : training features
    y : training labels
    classes : optional list of class labels
    
    Returns:
    --------
    Dictionary with OOB metrics and diagnostics
    """
    n_samples = X.shape[0]
    oob_predictions = defaultdict(list)
    oob_probas = defaultdict(list)
    
    # Collect OOB predictions for each sample
    for b, tree in enumerate(bagged_model.trees):
        oob_idx = bagged_model.oob_indices[b]
        if len(oob_idx) > 0:
            preds = tree.predict(X[oob_idx])
            probas = tree.predict_proba(X[oob_idx])
            
            for i, (pred, proba) in zip(oob_idx, zip(preds, probas)):
                oob_predictions[i].append(pred)
                oob_probas[i].append(proba)
    
    # Aggregate predictions
    final_predictions = {}
    final_probas = {}
    n_trees_per_sample = {}
    
    for i in range(n_samples):
        if i in oob_predictions:
            preds = oob_predictions[i]
            probas = np.array(oob_probas[i])
            
            # Majority vote
            unique, counts = np.unique(preds, return_counts=True)
            final_predictions[i] = unique[np.argmax(counts)]
            
            # Average probability
            final_probas[i] = np.mean(probas, axis=0)
            
            # Track coverage
            n_trees_per_sample[i] = len(preds)
    
    # Compute metrics
    correct = sum(1 for i, pred in final_predictions.items() if pred == y[i])
    total = len(final_predictions)
    oob_accuracy = correct / total if total > 0 else 0
    
    # Per-class accuracy
    if classes is None:
        classes = np.unique(y)
    
    per_class_accuracy = {}
    for c in classes:
        class_mask = [i for i in final_predictions.keys() if y[i] == c]
        if len(class_mask) > 0:
            class_correct = sum(1 for i in class_mask if final_predictions[i] == c)
            per_class_accuracy[c] = class_correct / len(class_mask)
        else:
            per_class_accuracy[c] = None
    
    # Coverage statistics
    coverage_stats = {
        'mean_trees': np.mean(list(n_trees_per_sample.values())),
        'min_trees': np.min(list(n_trees_per_sample.values())),
        'max_trees': np.max(list(n_trees_per_sample.values())),
        'samples_with_oob': len(final_predictions),
        'samples_without_oob': n_samples - len(final_predictions),
    }
    
    return {
        'oob_accuracy': oob_accuracy,
        'per_class_accuracy': per_class_accuracy,
        'coverage_stats': coverage_stats,
        'final_predictions': final_predictions,
        'final_probas': final_probas,
    }
 
def plot_oob_convergence(X, y, max_trees=500, step=10):
    """
    Plot how OOB error converges as trees are added.
    
    This visualization helps determine optimal n_estimators.
    """
    from sklearn.ensemble import BaggingClassifier
    from sklearn.tree import DecisionTreeClassifier
    
    oob_errors = []
    n_trees_list = list(range(step, max_trees + 1, step))
    
    for n_trees in n_trees_list:
        model = BaggingClassifier(
            estimator=DecisionTreeClassifier(),
            n_estimators=n_trees,
            oob_score=True,
            random_state=42,
            n_jobs=-1
        )
        model.fit(X, y)
        oob_errors.append(1 - model.oob_score_)
    
    plt.figure(figsize=(10, 6))
    plt.plot(n_trees_list, oob_errors, 'b-', linewidth=2)
    plt.xlabel('Number of Trees')
    plt.ylabel('OOB Error')
    plt.title('OOB Error Convergence')
    plt.grid(True, alpha=0.3)
    
    # Mark convergence point (where improvement < 0.1%)
    for i in range(1, len(oob_errors)):
        if abs(oob_errors[i] - oob_errors[i-1]) < 0.001:
            plt.axvline(x=n_trees_list[i], color='r', linestyle='--', 
                       label=f'Convergence at {n_trees_list[i]} trees')
            break
    
    plt.legend()
    plt.tight_layout()
    return plt.gcf()

OOB vs. Cross-Validation

OOB estimation provides error estimates comparable to 5-10 fold cross-validation but without the computational cost of retraining. For large datasets, this can save hours of computation time. The only caveat: with very few trees (B < 50), some samples may have too few OOB predictions for reliable estimates.

Summary: Bagged Decision Trees

Bagged Decision Trees represent the purest application of ensemble averaging to high-variance learners. Let's consolidate the key insights:

Key Takeaways

•Trees are ideal for bagging because their high variance means bootstrap samples produce diverse predictions that reduce error when averaged.
•Variance reduction is bounded by correlation — The formula $\text{Var} = \rho\sigma^2 + \frac{1-\rho}{B}\sigma^2$ shows that tree correlation limits improvement.
•OOB estimation is nearly free — Approximately 36.8% of samples are out-of-bag for each tree, enabling validation without a holdout set.
•Bagged trees vs. Random Forests — Use bagged trees for low-dimensional data or when all features are known to be relevant; use Random Forests for high-dimensional or exploratory settings.
•Training is embarrassingly parallel — Near-linear speedup with multiple cores, making bagging practical for large-scale applications.
•Deep trees maximize benefit — The variance reduction from bagging is greatest for fully-grown, unpruned trees that have high individual variance.

What's Next:

With a solid understanding of bagged decision trees, we'll explore how bagging extends beyond trees. The next page examines Bagged Neural Networks, an approach that combines the variance reduction of bagging with the representational power of deep learning.

Page Complete

You now understand why decision trees are the canonical base learner for bagging, how to implement and analyze bagged tree ensembles, and when to prefer them over Random Forests. This foundation prepares you for understanding bagging with other model families.

1 / 5

Loading learning content...

Machine LearningBagging for Other Models

Bagging for Other Models

LevelIntermediate

Duration90 mins

TopicBagging for Other Models

1 / 5

Bagged Decision Trees

The Foundation of Ensemble Success

What You Will Learn

Why Decision Trees Are Ideal for Bagging

The Instability Property:

The Instability-Variance Connection

Mathematical Characterization of Tree Variance:

For a single decision tree T trained on dataset D, the prediction variance at a point x can be expressed as:

$$\text{Var}_D[T(x)] = \mathbb{E}_D[(T(x) - \mathbb{E}_D[T(x)])^2]$$

This variance arises from three sources:

Split selection variance: Which feature and threshold are chosen at each node
Structure variance: The overall topology of the tree (depth, branching)
Leaf assignment variance: Which samples end up in which terminal nodes

Deep, unpruned trees exhibit high variance in all three components, making them ideal candidates for the variance reduction that bagging provides.

Variance Sources in Decision Trees
Variance Source	Cause	Effect on Predictions	Bagging Impact
Split Selection	Tie-breaking on information gain	Different features dominate different trees	High—diverse splits produce diverse trees
Tree Structure	Greedy recursive partitioning	Varying depths and branching patterns	High—structural diversity aids averaging
Leaf Assignment	Data point routing changes	Border points classified differently	Moderate—smooths decision boundaries
Leaf Predictions	Different sample statistics	Mean/mode varies per bootstrap	Low—individual variation cancels out

The Bagged Trees Algorithm

The bagged decision trees algorithm is elegantly simple, yet its simplicity belies its power. Let's examine the complete algorithm with all implementation details that matter in practice.

Algorithm: Bagged Decision Trees

Input: Training set $D = {(x_1, y_1), ..., (x_n, y_n)}$, number of trees $B$, tree hyperparameters $\theta$

Training Phase:

For $b = 1, 2, ..., B$:
- Generate bootstrap sample $D_b$ by sampling $n$ points from $D$ with replacement
- Train decision tree $T_b$ on $D_b$ using hyperparameters $\theta$
- Store $T_b$

Prediction Phase (Regression):

$\hat{f}{bag}(x) = \frac{1}{B}\sum{b=1}^{B} T_b(x)$

Prediction Phase (Classification):

$\hat{y}_{bag}(x) = \text{argmax}c \sum{b=1}^{B} \mathbb{1}[T_b(x) = c]$ (majority vote)
Or: $\hat{P}(y=c|x) = \frac{1}{B}\sum_{b=1}^{B} \mathbb{1}[T_b(x) = c]$ (probability estimate)

bagged_decision_trees.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
import numpy as np
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.utils import resample
from collections import Counter
 
class BaggedDecisionTrees:
    """
    Bagged Decision Trees implementation for classification and regression.
    
    This implementation demonstrates the core bagging algorithm with
    decision trees as base learners, including OOB estimation.
    """
    
    def __init__(self, n_estimators=100, max_depth=None, min_samples_split=2,
                 min_samples_leaf=1, task='classification', random_state=None):
        """
        Parameters:
        -----------
        n_estimators : int
            Number of trees in the ensemble (B)
        max_depth : int or None
            Maximum depth of individual trees. None = fully grown
        min_samples_split : int
            Minimum samples required to split a node
        min_samples_leaf : int
            Minimum samples required at a leaf node
        task : str
            'classification' or 'regression'
        random_state : int or None
            Random seed for reproducibility
        """
        self.n_estimators = n_estimators
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.min_samples_leaf = min_samples_leaf
        self.task = task
        self.random_state = random_state
        self.trees = []
        self.oob_indices = []  # Track OOB samples for each tree
        
    def fit(self, X, y):
        """
        Train the bagged ensemble on the dataset.
        
        Parameters:
        -----------
        X : array-like of shape (n_samples, n_features)
            Training features
        y : array-like of shape (n_samples,)
            Training targets
        """
        np.random.seed(self.random_state)
        n_samples = X.shape[0]
        
        self.trees = []
        self.oob_indices = []
        
        for b in range(self.n_estimators):
            # Step 1: Create bootstrap sample
            # Sample n indices with replacement
            bootstrap_indices = np.random.choice(
                n_samples, size=n_samples, replace=True
            )
            
            # Track which samples were NOT selected (OOB samples)
            oob_mask = np.ones(n_samples, dtype=bool)
            oob_mask[np.unique(bootstrap_indices)] = False
            self.oob_indices.append(np.where(oob_mask)[0])
            
            # Step 2: Train tree on bootstrap sample
            X_bootstrap = X[bootstrap_indices]
            y_bootstrap = y[bootstrap_indices]
            
            if self.task == 'classification':
                tree = DecisionTreeClassifier(
                    max_depth=self.max_depth,
                    min_samples_split=self.min_samples_split,
                    min_samples_leaf=self.min_samples_leaf,
                    random_state=self.random_state + b if self.random_state else None
                )
            else:
                tree = DecisionTreeRegressor(
                    max_depth=self.max_depth,
                    min_samples_split=self.min_samples_split,
                    min_samples_leaf=self.min_samples_leaf,
                    random_state=self.random_state + b if self.random_state else None
                )
            
            tree.fit(X_bootstrap, y_bootstrap)
            self.trees.append(tree)
            
        # Store training data for OOB score computation
        self._X_train = X
        self._y_train = y
        
        return self
    
    def predict(self, X):
        """
        Generate predictions by aggregating tree predictions.
        
        For classification: majority vote
        For regression: arithmetic mean
        """
        if self.task == 'classification':
            return self._predict_classification(X)
        else:
            return self._predict_regression(X)
    
    def _predict_classification(self, X):
        """Majority vote prediction for classification."""
        # Collect predictions from all trees
        all_predictions = np.array([tree.predict(X) for tree in self.trees])
        
        # Majority vote: mode of predictions
        predictions = []
        for i in range(X.shape[0]):
            votes = all_predictions[:, i]
            winner = Counter(votes).most_common(1)[0][0]
            predictions.append(winner)
            
        return np.array(predictions)
    
    def _predict_regression(self, X):
        """Mean prediction for regression."""
        all_predictions = np.array([tree.predict(X) for tree in self.trees])
        return np.mean(all_predictions, axis=0)
    
    def predict_proba(self, X):
        """
        Probability estimates for classification.
        Returns the proportion of trees voting for each class.
        """
        if self.task != 'classification':
            raise ValueError("predict_proba only available for classification")
            
        # Get class labels from first tree
        classes = self.trees[0].classes_
        n_samples = X.shape[0]
        n_classes = len(classes)
        
        # Sum up predictions
        proba = np.zeros((n_samples, n_classes))
        for tree in self.trees:
            tree_proba = tree.predict_proba(X)
            proba += tree_proba
            
        # Normalize to get proportions
        proba /= self.n_estimators
        
        return proba
    
    def oob_score(self):
        """
        Compute Out-of-Bag score using samples not included in each bootstrap.
        
        This provides an unbiased estimate of generalization error.
        """
        n_samples = self._X_train.shape[0]
        
        if self.task == 'classification':
            # For classification: track votes per sample
            oob_votes = {}  # sample_idx -> list of predictions
            
            for b, tree in enumerate(self.trees):
                oob_idx = self.oob_indices[b]
                if len(oob_idx) > 0:
                    predictions = tree.predict(self._X_train[oob_idx])
                    for i, pred in zip(oob_idx, predictions):
                        if i not in oob_votes:
                            oob_votes[i] = []
                        oob_votes[i].append(pred)
            
            # Compute accuracy over samples with OOB predictions
            correct = 0
            total = 0
            for i, votes in oob_votes.items():
                if len(votes) > 0:
                    predicted = Counter(votes).most_common(1)[0][0]
                    if predicted == self._y_train[i]:
                        correct += 1
                    total += 1
                    
            return correct / total if total > 0 else 0.0
            
        else:
            # For regression: average predictions per sample
            oob_predictions = {}
            
            for b, tree in enumerate(self.trees):
                oob_idx = self.oob_indices[b]
                if len(oob_idx) > 0:
                    predictions = tree.predict(self._X_train[oob_idx])
                    for i, pred in zip(oob_idx, predictions):
                        if i not in oob_predictions:
                            oob_predictions[i] = []
                        oob_predictions[i].append(pred)
            
            # Compute R² score
            ss_res = 0
            ss_tot = 0
            y_mean = np.mean(self._y_train)
            
            for i, preds in oob_predictions.items():
                if len(preds) > 0:
                    y_pred = np.mean(preds)
                    y_true = self._y_train[i]
                    ss_res += (y_true - y_pred) ** 2
                    ss_tot += (y_true - y_mean) ** 2
                    
            return 1 - (ss_res / ss_tot) if ss_tot > 0 else 0.0

Implementation Best Practice

Mathematical Analysis of Variance Reduction

The power of bagging lies in its ability to reduce prediction variance. Let's develop the mathematical theory that explains this reduction and its limits.

The Variance of Averaged Predictions:

For $B$ bagged trees with predictions $T_1(x), T_2(x), ..., T_B(x)$, the bagged prediction is:

$$\hat{f}{bag}(x) = \frac{1}{B}\sum{b=1}^{B} T_b(x)$$

The variance of this averaged prediction is:

$$\text{Var}[\hat{f}{bag}(x)] = \text{Var}\left[\frac{1}{B}\sum{b=1}^{B} T_b(x)\right]$$

The Role of Correlation:

Let $\sigma^2 = \text{Var}[T_b(x)]$ be the variance of individual tree predictions (assumed equal), and let $\rho$ be the pairwise correlation between tree predictions. Then:

$$\text{Var}[\hat{f}_{bag}(x)] = \rho\sigma^2 + \frac{1-\rho}{B}\sigma^2$$

Interpretation:

First term ($\rho\sigma^2$): This is the irreducible variance due to correlation between trees. Even with infinitely many trees ($B \to \infty$), this term remains.
Second term ($\frac{1-\rho}{B}\sigma^2$): This is the reducible variance that decreases as we add more trees. With enough trees, this term vanishes.

The Correlation Barrier:

As $B \to \infty$:

$$\lim_{B \to \infty} \text{Var}[\hat{f}_{bag}(x)] = \rho\sigma^2$$

Variance Reduction vs. Tree Correlation
Correlation (ρ)	Variance with B=1	Variance with B=10	Variance with B=100	Variance as B→∞
0.0 (independent)	σ²	0.10σ²	0.01σ²	0
0.2	σ²	0.28σ²	0.208σ²	0.20σ²
0.5	σ²	0.55σ²	0.505σ²	0.50σ²
0.8	σ²	0.82σ²	0.802σ²	0.80σ²
1.0 (identical)	σ²	σ²	σ²	σ²

The Correlation Ceiling

Empirical Variance Decomposition:

To measure the actual variance reduction achieved by bagging, we can decompose the prediction error:

For deep, unpruned trees:

Bias remains approximately unchanged (still low)
Variance decreases significantly (the main benefit)
Irreducible error is unchanged (inherent noise)

This confirms that bagging is fundamentally a variance-reduction technique that preserves the low bias of complex models.

Bagged Trees vs. Random Forests

Random Forests are often viewed as the "upgraded" version of bagged decision trees. Understanding the precise relationship between these methods clarifies when each is appropriate.

The Key Difference: Feature Randomization

Bagged Decision Trees:

At each split, consider all features to find the best split
Diversity comes solely from bootstrap sampling
Trees are more correlated (higher $\rho$)

Random Forests:

At each split, consider only a random subset of features
Diversity comes from both bootstrap sampling AND feature randomization
Trees are less correlated (lower $\rho$)

Bagged Trees Advantages

•Simpler hyperparameter tuning — No max_features to tune
•Fully optimized splits — Always uses best available feature
•Better for few features — Feature randomization harmful with <10 features
•Interpretable feature use — Each tree uses all relevant features
•Deterministic given seed — No random feature selection variance

Random Forest Advantages

•Lower correlation — Greater variance reduction
•Better for many features — Prevents feature dominance
•More robust to irrelevant features — Not all features used
•Often better generalization — More diverse ensemble
•Faster training — Fewer features evaluated per split

When to Prefer Bagged Trees Over Random Forests:

Low-dimensional datasets ($p < 10$): With few features, random feature selection may exclude the optimal split feature too often, degrading performance.
Strongly predictive features: If one or two features dominate prediction, bagged trees can leverage them at every split, while Random Forests may randomly exclude them.
Highly engineered features: When you've carefully engineered features and know they're all relevant, forcing the use of all features may be beneficial.
Maximum individual tree performance needed: If you also need individual tree interpretability or performance, bagged trees without feature restrictions are stronger individually.

When Random Forests Excel:

High-dimensional datasets ($p > 50$): Feature randomization prevents a few noisy but seemingly predictive features from dominating.
Correlated features: Random Forests better handle situations where multiple features carry similar information.
Unknown feature relevance: When you don't know which features matter, Random Forests' implicit feature selection helps.
Maximum ensemble performance: For pure predictive accuracy, Random Forests usually outperform bagged trees.

Practical Guidance

Computational Considerations

Understanding the computational complexity of bagged decision trees is essential for deploying them at scale.

Training Complexity:

For a single decision tree trained on $n$ samples with $p$ features:

Worst-case: $O(p \cdot n^2 \log n)$ for a fully grown tree
Typical case: $O(p \cdot n \log^2 n)$ with good implementations

For a bagged ensemble of $B$ trees:

Total training: $O(B \cdot p \cdot n \log^2 n)$

Key Property: Embarrassingly Parallel

Bagging is a naturally parallelizable algorithm. Each tree is trained independently on its bootstrap sample. This means:

Training time scales as $O(B/C)$ with $C$ CPU cores
No synchronization or communication overhead between trees
Near-linear speedup possible with parallel implementations

parallel_bagging.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
from joblib import Parallel, delayed
from sklearn.tree import DecisionTreeClassifier
import numpy as np
 
def train_single_tree(X, y, random_state, max_depth):
    """Train a single tree on a bootstrap sample."""
    np.random.seed(random_state)
    n = X.shape[0]
    
    # Create bootstrap sample
    indices = np.random.choice(n, size=n, replace=True)
    X_boot = X[indices]
    y_boot = y[indices]
    
    # Train tree
    tree = DecisionTreeClassifier(max_depth=max_depth, random_state=random_state)
    tree.fit(X_boot, y_boot)
    
    # Return tree and OOB indices
    oob_mask = np.ones(n, dtype=bool)
    oob_mask[np.unique(indices)] = False
    
    return tree, np.where(oob_mask)[0]
 
def parallel_bagged_trees(X, y, n_estimators=100, max_depth=None, n_jobs=-1):
    """
    Train bagged decision trees in parallel.
    
    Parameters:
    -----------
    X, y : Training data
    n_estimators : Number of trees
    max_depth : Maximum tree depth
    n_jobs : Number of parallel jobs (-1 = all cores)
    
    Returns:
    --------
    List of (tree, oob_indices) tuples
    """
    results = Parallel(n_jobs=n_jobs, verbose=1)(
        delayed(train_single_tree)(X, y, seed, max_depth)
        for seed in range(n_estimators)
    )
    
    return results
 
# Example usage:
# trees_and_oob = parallel_bagged_trees(X_train, y_train, n_estimators=500)
# trees = [t[0] for t in trees_and_oob]
# oob_indices = [t[1] for t in trees_and_oob]

Training Time Scaling (Approximate)
n_samples	n_features	n_estimators	1 Core	4 Cores	16 Cores
10,000	50	100	~12 sec	~3.2 sec	~1 sec
100,000	50	100	~180 sec	~48 sec	~13 sec
100,000	100	500	~1,800 sec	~470 sec	~120 sec
1,000,000	100	100	~3,600 sec	~930 sec	~240 sec

Prediction Complexity:

For prediction, each tree requires $O(\log n)$ operations (depth of balanced tree) or $O(d)$ where $d$ is the actual tree depth. For $B$ trees:

Total prediction: $O(B \cdot d \cdot m)$ for $m$ test samples
Parallelizable across both trees and samples

Memory Considerations:

Each tree stores:

Node split information: feature indices, thresholds
Leaf predictions: class distributions or regression values
Tree structure: parent-child relationships

A fully grown tree on $n$ samples has up to $n$ leaf nodes, requiring $O(n)$ storage. For $B$ trees:

Total memory: $O(B \cdot n)$ in the worst case
Practical memory: Often much less due to shared samples at leaves

Memory Optimization

Out-of-Bag Estimation: A Deep Dive

The OOB Principle:

$$P(\text{not selected}) = \left(1 - \frac{1}{n}\right)^n \xrightarrow{n \to \infty} e^{-1} \approx 0.368$$

This means approximately 36.8% of training samples are out-of-bag for any given tree. These OOB samples can be used to evaluate that tree's predictions on data it hasn't seen.

Computing OOB Error:

For each training sample $i$:

Identify all trees $T_b$ where sample $i$ was OOB (not in bootstrap sample)
Collect predictions from only those trees
Aggregate predictions (vote for classification, mean for regression)
Compare aggregated prediction to true label $y_i$

The OOB error is then the average error over all training samples:

$$\text{OOB Error} = \frac{1}{n}\sum_{i=1}^{n} L(y_i, \hat{y}_i^{OOB})$$

where $\hat{y}_i^{OOB}$ is the aggregated prediction from trees where $i$ was OOB.

Why OOB Error Works:

oob_analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
import numpy as np
from collections import defaultdict
import matplotlib.pyplot as plt
 
def detailed_oob_analysis(bagged_model, X, y, classes=None):
    """
    Perform detailed OOB analysis including per-class accuracy
    and confidence calibration.
    
    Parameters:
    -----------
    bagged_model : trained BaggedDecisionTrees instance
    X : training features
    y : training labels
    classes : optional list of class labels
    
    Returns:
    --------
    Dictionary with OOB metrics and diagnostics
    """
    n_samples = X.shape[0]
    oob_predictions = defaultdict(list)
    oob_probas = defaultdict(list)
    
    # Collect OOB predictions for each sample
    for b, tree in enumerate(bagged_model.trees):
        oob_idx = bagged_model.oob_indices[b]
        if len(oob_idx) > 0:
            preds = tree.predict(X[oob_idx])
            probas = tree.predict_proba(X[oob_idx])
            
            for i, (pred, proba) in zip(oob_idx, zip(preds, probas)):
                oob_predictions[i].append(pred)
                oob_probas[i].append(proba)
    
    # Aggregate predictions
    final_predictions = {}
    final_probas = {}
    n_trees_per_sample = {}
    
    for i in range(n_samples):
        if i in oob_predictions:
            preds = oob_predictions[i]
            probas = np.array(oob_probas[i])
            
            # Majority vote
            unique, counts = np.unique(preds, return_counts=True)
            final_predictions[i] = unique[np.argmax(counts)]
            
            # Average probability
            final_probas[i] = np.mean(probas, axis=0)
            
            # Track coverage
            n_trees_per_sample[i] = len(preds)
    
    # Compute metrics
    correct = sum(1 for i, pred in final_predictions.items() if pred == y[i])
    total = len(final_predictions)
    oob_accuracy = correct / total if total > 0 else 0
    
    # Per-class accuracy
    if classes is None:
        classes = np.unique(y)
    
    per_class_accuracy = {}
    for c in classes:
        class_mask = [i for i in final_predictions.keys() if y[i] == c]
        if len(class_mask) > 0:
            class_correct = sum(1 for i in class_mask if final_predictions[i] == c)
            per_class_accuracy[c] = class_correct / len(class_mask)
        else:
            per_class_accuracy[c] = None
    
    # Coverage statistics
    coverage_stats = {
        'mean_trees': np.mean(list(n_trees_per_sample.values())),
        'min_trees': np.min(list(n_trees_per_sample.values())),
        'max_trees': np.max(list(n_trees_per_sample.values())),
        'samples_with_oob': len(final_predictions),
        'samples_without_oob': n_samples - len(final_predictions),
    }
    
    return {
        'oob_accuracy': oob_accuracy,
        'per_class_accuracy': per_class_accuracy,
        'coverage_stats': coverage_stats,
        'final_predictions': final_predictions,
        'final_probas': final_probas,
    }
 
def plot_oob_convergence(X, y, max_trees=500, step=10):
    """
    Plot how OOB error converges as trees are added.
    
    This visualization helps determine optimal n_estimators.
    """
    from sklearn.ensemble import BaggingClassifier
    from sklearn.tree import DecisionTreeClassifier
    
    oob_errors = []
    n_trees_list = list(range(step, max_trees + 1, step))
    
    for n_trees in n_trees_list:
        model = BaggingClassifier(
            estimator=DecisionTreeClassifier(),
            n_estimators=n_trees,
            oob_score=True,
            random_state=42,
            n_jobs=-1
        )
        model.fit(X, y)
        oob_errors.append(1 - model.oob_score_)
    
    plt.figure(figsize=(10, 6))
    plt.plot(n_trees_list, oob_errors, 'b-', linewidth=2)
    plt.xlabel('Number of Trees')
    plt.ylabel('OOB Error')
    plt.title('OOB Error Convergence')
    plt.grid(True, alpha=0.3)
    
    # Mark convergence point (where improvement < 0.1%)
    for i in range(1, len(oob_errors)):
        if abs(oob_errors[i] - oob_errors[i-1]) < 0.001:
            plt.axvline(x=n_trees_list[i], color='r', linestyle='--', 
                       label=f'Convergence at {n_trees_list[i]} trees')
            break
    
    plt.legend()
    plt.tight_layout()
    return plt.gcf()

OOB vs. Cross-Validation

Summary: Bagged Decision Trees

Bagged Decision Trees represent the purest application of ensemble averaging to high-variance learners. Let's consolidate the key insights:

Key Takeaways

•Trees are ideal for bagging because their high variance means bootstrap samples produce diverse predictions that reduce error when averaged.
•Variance reduction is bounded by correlation — The formula $\text{Var} = \rho\sigma^2 + \frac{1-\rho}{B}\sigma^2$ shows that tree correlation limits improvement.
•OOB estimation is nearly free — Approximately 36.8% of samples are out-of-bag for each tree, enabling validation without a holdout set.
•Bagged trees vs. Random Forests — Use bagged trees for low-dimensional data or when all features are known to be relevant; use Random Forests for high-dimensional or exploratory settings.
•Training is embarrassingly parallel — Near-linear speedup with multiple cores, making bagging practical for large-scale applications.
•Deep trees maximize benefit — The variance reduction from bagging is greatest for fully-grown, unpruned trees that have high individual variance.

What's Next:

Page Complete

1 / 5