Loading learning content...
Before Random Forests revolutionized machine learning competitions, there was a simpler yet remarkably powerful technique: Bagged Decision Trees. This ensemble method, introduced by Leo Breiman in 1996, demonstrated that combining multiple unstable learners could dramatically reduce prediction variance while maintaining low bias—a breakthrough that laid the groundwork for modern ensemble learning.
Bagged Decision Trees represent the purest application of bootstrap aggregating to tree-based models. Understanding this fundamental technique is essential for grasping why Random Forests add feature randomization, when standard bagging suffices, and how to extend bagging principles to other model families.
By the end of this page, you will understand: (1) Why decision trees are ideally suited for bagging, (2) The precise mechanics of bagged tree construction, (3) Mathematical analysis of variance reduction in tree ensembles, (4) The relationship between bagged trees and Random Forests, and (5) Implementation considerations and computational trade-offs.
The effectiveness of bagging depends critically on a specific property of the base learner: high variance. Decision trees, particularly when grown deep without pruning, exhibit precisely this characteristic, making them natural candidates for variance reduction through ensemble averaging.
The Instability Property:
Decision trees are notoriously unstable learners. Small changes in the training data can produce dramatically different tree structures. Consider a split at the root node—if the optimal split on feature A has only a slightly better information gain than feature B, a small perturbation in the training data might flip the selection. This single change cascades through the entire tree, producing a completely different model.
Instability is not a flaw for bagging—it's a feature. High variance means that bootstrapped samples will produce diverse trees with different errors. When averaged, these errors tend to cancel out, leaving only the underlying signal. Low-variance models (like linear regression) don't benefit as much from bagging because they produce similar predictions on different bootstrap samples.
Mathematical Characterization of Tree Variance:
For a single decision tree T trained on dataset D, the prediction variance at a point x can be expressed as:
$$\text{Var}_D[T(x)] = \mathbb{E}_D[(T(x) - \mathbb{E}_D[T(x)])^2]$$
This variance arises from three sources:
Deep, unpruned trees exhibit high variance in all three components, making them ideal candidates for the variance reduction that bagging provides.
| Variance Source | Cause | Effect on Predictions | Bagging Impact |
|---|---|---|---|
| Split Selection | Tie-breaking on information gain | Different features dominate different trees | High—diverse splits produce diverse trees |
| Tree Structure | Greedy recursive partitioning | Varying depths and branching patterns | High—structural diversity aids averaging |
| Leaf Assignment | Data point routing changes | Border points classified differently | Moderate—smooths decision boundaries |
| Leaf Predictions | Different sample statistics | Mean/mode varies per bootstrap | Low—individual variation cancels out |
The bagged decision trees algorithm is elegantly simple, yet its simplicity belies its power. Let's examine the complete algorithm with all implementation details that matter in practice.
Algorithm: Bagged Decision Trees
Input: Training set $D = {(x_1, y_1), ..., (x_n, y_n)}$, number of trees $B$, tree hyperparameters $\theta$
Training Phase:
Prediction Phase (Regression):
Prediction Phase (Classification):
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211
import numpy as npfrom sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressorfrom sklearn.utils import resamplefrom collections import Counter class BaggedDecisionTrees: """ Bagged Decision Trees implementation for classification and regression. This implementation demonstrates the core bagging algorithm with decision trees as base learners, including OOB estimation. """ def __init__(self, n_estimators=100, max_depth=None, min_samples_split=2, min_samples_leaf=1, task='classification', random_state=None): """ Parameters: ----------- n_estimators : int Number of trees in the ensemble (B) max_depth : int or None Maximum depth of individual trees. None = fully grown min_samples_split : int Minimum samples required to split a node min_samples_leaf : int Minimum samples required at a leaf node task : str 'classification' or 'regression' random_state : int or None Random seed for reproducibility """ self.n_estimators = n_estimators self.max_depth = max_depth self.min_samples_split = min_samples_split self.min_samples_leaf = min_samples_leaf self.task = task self.random_state = random_state self.trees = [] self.oob_indices = [] # Track OOB samples for each tree def fit(self, X, y): """ Train the bagged ensemble on the dataset. Parameters: ----------- X : array-like of shape (n_samples, n_features) Training features y : array-like of shape (n_samples,) Training targets """ np.random.seed(self.random_state) n_samples = X.shape[0] self.trees = [] self.oob_indices = [] for b in range(self.n_estimators): # Step 1: Create bootstrap sample # Sample n indices with replacement bootstrap_indices = np.random.choice( n_samples, size=n_samples, replace=True ) # Track which samples were NOT selected (OOB samples) oob_mask = np.ones(n_samples, dtype=bool) oob_mask[np.unique(bootstrap_indices)] = False self.oob_indices.append(np.where(oob_mask)[0]) # Step 2: Train tree on bootstrap sample X_bootstrap = X[bootstrap_indices] y_bootstrap = y[bootstrap_indices] if self.task == 'classification': tree = DecisionTreeClassifier( max_depth=self.max_depth, min_samples_split=self.min_samples_split, min_samples_leaf=self.min_samples_leaf, random_state=self.random_state + b if self.random_state else None ) else: tree = DecisionTreeRegressor( max_depth=self.max_depth, min_samples_split=self.min_samples_split, min_samples_leaf=self.min_samples_leaf, random_state=self.random_state + b if self.random_state else None ) tree.fit(X_bootstrap, y_bootstrap) self.trees.append(tree) # Store training data for OOB score computation self._X_train = X self._y_train = y return self def predict(self, X): """ Generate predictions by aggregating tree predictions. For classification: majority vote For regression: arithmetic mean """ if self.task == 'classification': return self._predict_classification(X) else: return self._predict_regression(X) def _predict_classification(self, X): """Majority vote prediction for classification.""" # Collect predictions from all trees all_predictions = np.array([tree.predict(X) for tree in self.trees]) # Majority vote: mode of predictions predictions = [] for i in range(X.shape[0]): votes = all_predictions[:, i] winner = Counter(votes).most_common(1)[0][0] predictions.append(winner) return np.array(predictions) def _predict_regression(self, X): """Mean prediction for regression.""" all_predictions = np.array([tree.predict(X) for tree in self.trees]) return np.mean(all_predictions, axis=0) def predict_proba(self, X): """ Probability estimates for classification. Returns the proportion of trees voting for each class. """ if self.task != 'classification': raise ValueError("predict_proba only available for classification") # Get class labels from first tree classes = self.trees[0].classes_ n_samples = X.shape[0] n_classes = len(classes) # Sum up predictions proba = np.zeros((n_samples, n_classes)) for tree in self.trees: tree_proba = tree.predict_proba(X) proba += tree_proba # Normalize to get proportions proba /= self.n_estimators return proba def oob_score(self): """ Compute Out-of-Bag score using samples not included in each bootstrap. This provides an unbiased estimate of generalization error. """ n_samples = self._X_train.shape[0] if self.task == 'classification': # For classification: track votes per sample oob_votes = {} # sample_idx -> list of predictions for b, tree in enumerate(self.trees): oob_idx = self.oob_indices[b] if len(oob_idx) > 0: predictions = tree.predict(self._X_train[oob_idx]) for i, pred in zip(oob_idx, predictions): if i not in oob_votes: oob_votes[i] = [] oob_votes[i].append(pred) # Compute accuracy over samples with OOB predictions correct = 0 total = 0 for i, votes in oob_votes.items(): if len(votes) > 0: predicted = Counter(votes).most_common(1)[0][0] if predicted == self._y_train[i]: correct += 1 total += 1 return correct / total if total > 0 else 0.0 else: # For regression: average predictions per sample oob_predictions = {} for b, tree in enumerate(self.trees): oob_idx = self.oob_indices[b] if len(oob_idx) > 0: predictions = tree.predict(self._X_train[oob_idx]) for i, pred in zip(oob_idx, predictions): if i not in oob_predictions: oob_predictions[i] = [] oob_predictions[i].append(pred) # Compute R² score ss_res = 0 ss_tot = 0 y_mean = np.mean(self._y_train) for i, preds in oob_predictions.items(): if len(preds) > 0: y_pred = np.mean(preds) y_true = self._y_train[i] ss_res += (y_true - y_pred) ** 2 ss_tot += (y_true - y_mean) ** 2 return 1 - (ss_res / ss_tot) if ss_tot > 0 else 0.0In production implementations, always track the out-of-bag (OOB) indices for each tree. This enables OOB error estimation—a virtually free estimate of generalization error that doesn't require a separate validation set. For a dataset of n samples, approximately 36.8% of samples remain out-of-bag for any given tree.
The power of bagging lies in its ability to reduce prediction variance. Let's develop the mathematical theory that explains this reduction and its limits.
The Variance of Averaged Predictions:
For $B$ bagged trees with predictions $T_1(x), T_2(x), ..., T_B(x)$, the bagged prediction is:
$$\hat{f}{bag}(x) = \frac{1}{B}\sum{b=1}^{B} T_b(x)$$
The variance of this averaged prediction is:
$$\text{Var}[\hat{f}{bag}(x)] = \text{Var}\left[\frac{1}{B}\sum{b=1}^{B} T_b(x)\right]$$
The Role of Correlation:
Let $\sigma^2 = \text{Var}[T_b(x)]$ be the variance of individual tree predictions (assumed equal), and let $\rho$ be the pairwise correlation between tree predictions. Then:
$$\text{Var}[\hat{f}_{bag}(x)] = \rho\sigma^2 + \frac{1-\rho}{B}\sigma^2$$
Interpretation:
First term ($\rho\sigma^2$): This is the irreducible variance due to correlation between trees. Even with infinitely many trees ($B \to \infty$), this term remains.
Second term ($\frac{1-\rho}{B}\sigma^2$): This is the reducible variance that decreases as we add more trees. With enough trees, this term vanishes.
The Correlation Barrier:
As $B \to \infty$:
$$\lim_{B \to \infty} \text{Var}[\hat{f}_{bag}(x)] = \rho\sigma^2$$
This is the fundamental limit of bagging. Even with infinite trees, if the correlation $\rho$ is high, significant variance remains. This insight motivates Random Forests, which introduce feature randomization to reduce $\rho$.
| Correlation (ρ) | Variance with B=1 | Variance with B=10 | Variance with B=100 | Variance as B→∞ |
|---|---|---|---|---|
| 0.0 (independent) | σ² | 0.10σ² | 0.01σ² | 0 |
| 0.2 | σ² | 0.28σ² | 0.208σ² | 0.20σ² |
| 0.5 | σ² | 0.55σ² | 0.505σ² | 0.50σ² |
| 0.8 | σ² | 0.82σ² | 0.802σ² | 0.80σ² |
| 1.0 (identical) | σ² | σ² | σ² | σ² |
With bagged decision trees using all features, typical correlations range from 0.4-0.7 depending on the dataset. This means even with B=1000 trees, you can only reduce variance by about 30-60% of the theoretical maximum. Random Forests address this by randomizing feature selection, reducing ρ to 0.1-0.3 in many cases.
Empirical Variance Decomposition:
To measure the actual variance reduction achieved by bagging, we can decompose the prediction error:
$$\mathbb{E}[(y - \hat{f}{bag}(x))^2] = \underbrace{(f(x) - \mathbb{E}[\hat{f}{bag}(x)])^2}{\text{Bias}^2} + \underbrace{\mathbb{E}[(\hat{f}{bag}(x) - \mathbb{E}[\hat{f}{bag}(x)])^2]}{\text{Variance}} + \underbrace{\sigma^2_\epsilon}_{\text{Irreducible Error}}$$
For deep, unpruned trees:
This confirms that bagging is fundamentally a variance-reduction technique that preserves the low bias of complex models.
Random Forests are often viewed as the "upgraded" version of bagged decision trees. Understanding the precise relationship between these methods clarifies when each is appropriate.
The Key Difference: Feature Randomization
Bagged Decision Trees:
Random Forests:
max_features to tuneWhen to Prefer Bagged Trees Over Random Forests:
Low-dimensional datasets ($p < 10$): With few features, random feature selection may exclude the optimal split feature too often, degrading performance.
Strongly predictive features: If one or two features dominate prediction, bagged trees can leverage them at every split, while Random Forests may randomly exclude them.
Highly engineered features: When you've carefully engineered features and know they're all relevant, forcing the use of all features may be beneficial.
Maximum individual tree performance needed: If you also need individual tree interpretability or performance, bagged trees without feature restrictions are stronger individually.
When Random Forests Excel:
High-dimensional datasets ($p > 50$): Feature randomization prevents a few noisy but seemingly predictive features from dominating.
Correlated features: Random Forests better handle situations where multiple features carry similar information.
Unknown feature relevance: When you don't know which features matter, Random Forests' implicit feature selection helps.
Maximum ensemble performance: For pure predictive accuracy, Random Forests usually outperform bagged trees.
In practice, try both methods. Start with Random Forests (the usual default), but if you're working with a low-dimensional, carefully engineered feature set, compare against bagged trees. The difference can be significant in either direction depending on the specific problem structure.
Understanding the computational complexity of bagged decision trees is essential for deploying them at scale.
Training Complexity:
For a single decision tree trained on $n$ samples with $p$ features:
For a bagged ensemble of $B$ trees:
Key Property: Embarrassingly Parallel
Bagging is a naturally parallelizable algorithm. Each tree is trained independently on its bootstrap sample. This means:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
from joblib import Parallel, delayedfrom sklearn.tree import DecisionTreeClassifierimport numpy as np def train_single_tree(X, y, random_state, max_depth): """Train a single tree on a bootstrap sample.""" np.random.seed(random_state) n = X.shape[0] # Create bootstrap sample indices = np.random.choice(n, size=n, replace=True) X_boot = X[indices] y_boot = y[indices] # Train tree tree = DecisionTreeClassifier(max_depth=max_depth, random_state=random_state) tree.fit(X_boot, y_boot) # Return tree and OOB indices oob_mask = np.ones(n, dtype=bool) oob_mask[np.unique(indices)] = False return tree, np.where(oob_mask)[0] def parallel_bagged_trees(X, y, n_estimators=100, max_depth=None, n_jobs=-1): """ Train bagged decision trees in parallel. Parameters: ----------- X, y : Training data n_estimators : Number of trees max_depth : Maximum tree depth n_jobs : Number of parallel jobs (-1 = all cores) Returns: -------- List of (tree, oob_indices) tuples """ results = Parallel(n_jobs=n_jobs, verbose=1)( delayed(train_single_tree)(X, y, seed, max_depth) for seed in range(n_estimators) ) return results # Example usage:# trees_and_oob = parallel_bagged_trees(X_train, y_train, n_estimators=500)# trees = [t[0] for t in trees_and_oob]# oob_indices = [t[1] for t in trees_and_oob]| n_samples | n_features | n_estimators | 1 Core | 4 Cores | 16 Cores |
|---|---|---|---|---|---|
| 10,000 | 50 | 100 | ~12 sec | ~3.2 sec | ~1 sec |
| 100,000 | 50 | 100 | ~180 sec | ~48 sec | ~13 sec |
| 100,000 | 100 | 500 | ~1,800 sec | ~470 sec | ~120 sec |
| 1,000,000 | 100 | 100 | ~3,600 sec | ~930 sec | ~240 sec |
Prediction Complexity:
For prediction, each tree requires $O(\log n)$ operations (depth of balanced tree) or $O(d)$ where $d$ is the actual tree depth. For $B$ trees:
Memory Considerations:
Each tree stores:
A fully grown tree on $n$ samples has up to $n$ leaf nodes, requiring $O(n)$ storage. For $B$ trees:
For very large ensembles or limited memory environments, consider: (1) Training trees with max_depth limits to reduce memory per tree, (2) Using float32 instead of float64 for thresholds, (3) Discarding trees after OOB scoring if only using for OOB-based selection, (4) Using lazy loading for trees during prediction.
One of the most valuable properties of bagged ensembles is the Out-of-Bag (OOB) error estimate. This provides a nearly free estimate of generalization error without requiring a separate validation set.
The OOB Principle:
When we create a bootstrap sample of size $n$ by sampling with replacement, some training examples will be selected multiple times while others won't be selected at all. The probability that a specific example is not selected in any of the $n$ draws is:
$$P(\text{not selected}) = \left(1 - \frac{1}{n}\right)^n \xrightarrow{n \to \infty} e^{-1} \approx 0.368$$
This means approximately 36.8% of training samples are out-of-bag for any given tree. These OOB samples can be used to evaluate that tree's predictions on data it hasn't seen.
Computing OOB Error:
For each training sample $i$:
The OOB error is then the average error over all training samples:
$$\text{OOB Error} = \frac{1}{n}\sum_{i=1}^{n} L(y_i, \hat{y}_i^{OOB})$$
where $\hat{y}_i^{OOB}$ is the aggregated prediction from trees where $i$ was OOB.
Why OOB Error Works:
For sample $i$, the OOB prediction comes from trees that never saw sample $i$ during training. This is exactly analogous to leave-one-out cross-validation, but computed essentially for free during normal bagging training. Studies have shown that OOB error is nearly as accurate as $k$-fold cross-validation with $k = n$ (leave-one-out).
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130
import numpy as npfrom collections import defaultdictimport matplotlib.pyplot as plt def detailed_oob_analysis(bagged_model, X, y, classes=None): """ Perform detailed OOB analysis including per-class accuracy and confidence calibration. Parameters: ----------- bagged_model : trained BaggedDecisionTrees instance X : training features y : training labels classes : optional list of class labels Returns: -------- Dictionary with OOB metrics and diagnostics """ n_samples = X.shape[0] oob_predictions = defaultdict(list) oob_probas = defaultdict(list) # Collect OOB predictions for each sample for b, tree in enumerate(bagged_model.trees): oob_idx = bagged_model.oob_indices[b] if len(oob_idx) > 0: preds = tree.predict(X[oob_idx]) probas = tree.predict_proba(X[oob_idx]) for i, (pred, proba) in zip(oob_idx, zip(preds, probas)): oob_predictions[i].append(pred) oob_probas[i].append(proba) # Aggregate predictions final_predictions = {} final_probas = {} n_trees_per_sample = {} for i in range(n_samples): if i in oob_predictions: preds = oob_predictions[i] probas = np.array(oob_probas[i]) # Majority vote unique, counts = np.unique(preds, return_counts=True) final_predictions[i] = unique[np.argmax(counts)] # Average probability final_probas[i] = np.mean(probas, axis=0) # Track coverage n_trees_per_sample[i] = len(preds) # Compute metrics correct = sum(1 for i, pred in final_predictions.items() if pred == y[i]) total = len(final_predictions) oob_accuracy = correct / total if total > 0 else 0 # Per-class accuracy if classes is None: classes = np.unique(y) per_class_accuracy = {} for c in classes: class_mask = [i for i in final_predictions.keys() if y[i] == c] if len(class_mask) > 0: class_correct = sum(1 for i in class_mask if final_predictions[i] == c) per_class_accuracy[c] = class_correct / len(class_mask) else: per_class_accuracy[c] = None # Coverage statistics coverage_stats = { 'mean_trees': np.mean(list(n_trees_per_sample.values())), 'min_trees': np.min(list(n_trees_per_sample.values())), 'max_trees': np.max(list(n_trees_per_sample.values())), 'samples_with_oob': len(final_predictions), 'samples_without_oob': n_samples - len(final_predictions), } return { 'oob_accuracy': oob_accuracy, 'per_class_accuracy': per_class_accuracy, 'coverage_stats': coverage_stats, 'final_predictions': final_predictions, 'final_probas': final_probas, } def plot_oob_convergence(X, y, max_trees=500, step=10): """ Plot how OOB error converges as trees are added. This visualization helps determine optimal n_estimators. """ from sklearn.ensemble import BaggingClassifier from sklearn.tree import DecisionTreeClassifier oob_errors = [] n_trees_list = list(range(step, max_trees + 1, step)) for n_trees in n_trees_list: model = BaggingClassifier( estimator=DecisionTreeClassifier(), n_estimators=n_trees, oob_score=True, random_state=42, n_jobs=-1 ) model.fit(X, y) oob_errors.append(1 - model.oob_score_) plt.figure(figsize=(10, 6)) plt.plot(n_trees_list, oob_errors, 'b-', linewidth=2) plt.xlabel('Number of Trees') plt.ylabel('OOB Error') plt.title('OOB Error Convergence') plt.grid(True, alpha=0.3) # Mark convergence point (where improvement < 0.1%) for i in range(1, len(oob_errors)): if abs(oob_errors[i] - oob_errors[i-1]) < 0.001: plt.axvline(x=n_trees_list[i], color='r', linestyle='--', label=f'Convergence at {n_trees_list[i]} trees') break plt.legend() plt.tight_layout() return plt.gcf()OOB estimation provides error estimates comparable to 5-10 fold cross-validation but without the computational cost of retraining. For large datasets, this can save hours of computation time. The only caveat: with very few trees (B < 50), some samples may have too few OOB predictions for reliable estimates.
Bagged Decision Trees represent the purest application of ensemble averaging to high-variance learners. Let's consolidate the key insights:
What's Next:
With a solid understanding of bagged decision trees, we'll explore how bagging extends beyond trees. The next page examines Bagged Neural Networks, an approach that combines the variance reduction of bagging with the representational power of deep learning.
You now understand why decision trees are the canonical base learner for bagging, how to implement and analyze bagged tree ensembles, and when to prefer them over Random Forests. This foundation prepares you for understanding bagging with other model families.