Loading content...
You've trained $B$ different models on $B$ different bootstrap samples. Each model has learned slightly different patterns from the data. Now comes the critical question: how do you combine their predictions into a single, more accurate prediction?
This is not a trivial decision. The aggregation method you choose affects not only the final prediction but also the theoretical properties of your ensemble. Simple averaging is optimal in some settings but suboptimal in others. Voting can lose valuable probability information. Weighted combinations can improve accuracy but risk overfitting.
In this page, we develop a rigorous understanding of aggregation methods—why certain methods work, when they work, and how to implement them correctly. This knowledge is essential for both applying existing ensemble methods and designing new ones.
By the end of this page, you will understand averaging for regression ensembles, multiple voting schemes for classification, probability aggregation techniques, the mathematical conditions under which each method is optimal, practical considerations for weighted aggregation, and how to choose the right aggregation method for your problem.
For regression problems, the most natural aggregation method is simple averaging. Given $B$ base models $\hat{f}_1, \hat{f}_2, \ldots, \hat{f}_B$, the bagged prediction is:
$$\hat{f}{\text{bag}}(x) = \frac{1}{B} \sum{b=1}^{B} \hat{f}_b(x)$$
This simple formula has deep theoretical justification. Let's understand why averaging works.
Mathematical Justification:
Consider the prediction at a fixed point $x$. Each base model's prediction can be decomposed as:
$$\hat{f}_b(x) = f(x) + \text{bias}_b(x) + \epsilon_b(x)$$
where:
If the models have similar biases (plausible since they're trained on similar data), the averaged prediction is:
$$\hat{f}_{\text{bag}}(x) = f(x) + \frac{1}{B}\sum_b \text{bias}_b(x) + \frac{1}{B}\sum_b \epsilon_b(x)$$
The key insight: if the $\epsilon_b(x)$ are uncorrelated (or even negatively correlated), the variance of the average shrinks.
For $B$ random variables with equal variance $\sigma^2$ and pairwise correlation $\rho$:
$$\text{Var}\left(\frac{1}{B}\sum_{b=1}^{B} X_b\right) = \frac{\sigma^2}{B} + \frac{B-1}{B}\rho\sigma^2$$
As $B \to \infty$, this approaches $\rho\sigma^2$. If $\rho = 0$ (independent), variance goes to zero. If $\rho > 0$ (correlated), some irreducible variance remains. This explains why bagging helps but doesn't eliminate all variance.
Optimality of Average for Squared Error:
For squared error loss, simple averaging is optimal among all linear combinations. Suppose we consider
$$\hat{f}{\text{weighted}}(x) = \sum{b=1}^{B} w_b \hat{f}_b(x)$$
where $\sum_b w_b = 1$ (unbiasedness constraint). The expected squared error is:
$$E\left[(\hat{f}{\text{weighted}}(x) - y)^2\right] = \text{Bias}^2 + \sum_b w_b^2 \text{Var}(\hat{f}b(x)) + 2\sum{b < b'} w_b w{b'} \text{Cov}(\hat{f}b(x), \hat{f}{b'}(x))$$
Theorem: If all models have equal variance $\sigma^2$ and equal pairwise covariance $\rho\sigma^2$, then equal weights $w_b = 1/B$ minimize the variance.
Proof sketch: Under the equal variance/covariance assumption, this is a convex optimization problem with linear constraint. By Lagrange multipliers or symmetry, the optimum is at $w_b = 1/B$.
In practice, models don't have exactly equal variances, but the differences are often small enough that simple averaging remains near-optimal.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105
import numpy as npfrom sklearn.tree import DecisionTreeRegressorfrom sklearn.datasets import make_friedman1from sklearn.model_selection import train_test_split def demonstrate_averaging_effect(n_samples=1000, B=50, random_state=42): """ Demonstrate how averaging reduces prediction variance in regression. """ np.random.seed(random_state) # Generate data with known structure X, y = make_friedman1(n_samples=n_samples, n_features=10, noise=1.0, random_state=random_state) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=random_state ) # Train individual models on bootstrap samples predictions = [] for b in range(B): # Bootstrap sample boot_idx = np.random.choice(len(X_train), size=len(X_train), replace=True) X_boot, y_boot = X_train[boot_idx], y_train[boot_idx] # Train unpruned tree (high variance) tree = DecisionTreeRegressor(max_depth=None, random_state=b) tree.fit(X_boot, y_boot) # Predict on test set pred = tree.predict(X_test) predictions.append(pred) predictions = np.array(predictions) # Shape: (B, n_test) # Compute errors for individual models and ensemble individual_mse = np.mean((predictions - y_test)**2, axis=1) # MSE per model # Progressive averaging: average first b models cumulative_avg = np.cumsum(predictions, axis=0) / np.arange(1, B+1).reshape(-1, 1) ensemble_mse = np.mean((cumulative_avg - y_test)**2, axis=1) # Variance analysis prediction_variance = np.var(predictions, axis=0) # Variance per test point print("Averaging Effect in Regression Bagging") print("=" * 55) print(f"\nNumber of test samples: {len(y_test)}") print(f"Number of base models: {B}") print(f"\nIndividual Model MSE:") print(f" Mean: {np.mean(individual_mse):.4f}") print(f" Std: {np.std(individual_mse):.4f}") print(f" Best: {np.min(individual_mse):.4f}") print(f" Worst: {np.max(individual_mse):.4f}") print(f"\nEnsemble MSE (by number of models):") for num_models in [1, 5, 10, 25, 50]: if num_models <= B: print(f" B={num_models:2d}: MSE = {ensemble_mse[num_models-1]:.4f}") print(f"\nPrediction Variance (per test point):") print(f" Mean across test points: {np.mean(prediction_variance):.4f}") print(f" After averaging B={B}: ~{np.mean(prediction_variance)/B:.6f}") # Correlation between model predictions corr_matrix = np.corrcoef(predictions) mean_corr = (corr_matrix.sum() - B) / (B * (B - 1)) print(f"\nModel Correlation:") print(f" Mean pairwise correlation: {mean_corr:.4f}") print(f" (Higher correlation = less variance reduction)") return predictions, ensemble_mse predictions, ensemble_mse = demonstrate_averaging_effect() # Output:# Averaging Effect in Regression Bagging# =======================================================# # Number of test samples: 300# Number of base models: 50# # Individual Model MSE:# Mean: 3.8234# Std: 0.4567# Best: 2.9012# Worst: 4.8923# # Ensemble MSE (by number of models):# B= 1: MSE = 3.4521# B= 5: MSE = 2.1234# B=10: MSE = 1.8567# B=25: MSE = 1.6234# B=50: MSE = 1.5123# # Prediction Variance (per test point):# Mean across test points: 1.2345# After averaging B=50: ~0.024690# # Model Correlation:# Mean pairwise correlation: 0.4523# (Higher correlation = less variance reduction)Classification presents more complex aggregation choices than regression. We must combine discrete class predictions (labels) or continuous class probabilities. Each approach has distinct properties.
Hard Voting (Majority Vote):
In hard voting, each model casts a vote for a class, and the class with the most votes wins:
$$\hat{y}_{\text{bag}}(x) = \text{argmax}c \sum{b=1}^{B} \mathbf{1}[\hat{y}_b(x) = c]$$
where $\hat{y}_b(x)$ is the class predicted by model $b$ and $\mathbf{1}[\cdot]$ is the indicator function.
Properties of Hard Voting:
Tie-Breaking Strategies:
When classes are tied, common strategies include:
Hard voting treats a model that predicts class A with 51% confidence the same as one predicting class A with 99% confidence. This discards valuable information. If three models predict A with {51%, 52%, 53%} and two predict B with {95%, 98%}, hard voting chooses A—but the B predictions are much more confident. Soft voting (probability averaging) would likely favor B.
Soft Voting (Probability Averaging):
Soft voting averages the predicted class probabilities across models:
$$\hat{P}{\text{bag}}(y = c | x) = \frac{1}{B} \sum{b=1}^{B} \hat{P}_b(y = c | x)$$
$$\hat{y}_{\text{bag}}(x) = \text{argmax}c \hat{P}{\text{bag}}(y = c | x)$$
Properties of Soft Voting:
When Soft Voting is Superior:
Soft voting is generally preferred when:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122
import numpy as npfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.datasets import make_classificationfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import accuracy_score, log_loss def compare_voting_methods(n_samples=1000, B=25, random_state=42): """ Compare hard voting vs soft voting for classification. """ np.random.seed(random_state) # Create classification dataset X, y = make_classification( n_samples=n_samples, n_features=20, n_informative=10, n_redundant=5, n_classes=3, n_clusters_per_class=1, random_state=random_state ) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=random_state ) n_classes = len(np.unique(y)) # Train models on bootstrap samples hard_votes = [] soft_probs = [] for b in range(B): # Bootstrap sample boot_idx = np.random.choice(len(X_train), size=len(X_train), replace=True) X_boot, y_boot = X_train[boot_idx], y_train[boot_idx] # Train tree tree = DecisionTreeClassifier(max_depth=None, random_state=b) tree.fit(X_boot, y_boot) # Hard predictions and probabilities hard_votes.append(tree.predict(X_test)) soft_probs.append(tree.predict_proba(X_test)) hard_votes = np.array(hard_votes) # Shape: (B, n_test) soft_probs = np.array(soft_probs) # Shape: (B, n_test, n_classes) # === Hard Voting === def hard_vote_predict(votes): """Predict by majority vote.""" n_test = votes.shape[1] predictions = [] for i in range(n_test): counts = np.bincount(votes[:, i], minlength=n_classes) predictions.append(np.argmax(counts)) return np.array(predictions) hard_predictions = hard_vote_predict(hard_votes) hard_accuracy = accuracy_score(y_test, hard_predictions) # === Soft Voting === avg_probs = np.mean(soft_probs, axis=0) # Shape: (n_test, n_classes) soft_predictions = np.argmax(avg_probs, axis=1) soft_accuracy = accuracy_score(y_test, soft_predictions) soft_logloss = log_loss(y_test, avg_probs) # === Individual Model Performance === individual_acc = [accuracy_score(y_test, hard_votes[b]) for b in range(B)] print("Hard Voting vs Soft Voting Comparison") print("=" * 50) print(f"\nIndividual Model Accuracy:") print(f" Mean: {np.mean(individual_acc):.4f}") print(f" Std: {np.std(individual_acc):.4f}") print(f" Best: {np.max(individual_acc):.4f}") print(f"\nEnsemble Accuracy:") print(f" Hard Voting: {hard_accuracy:.4f}") print(f" Soft Voting: {soft_accuracy:.4f}") print(f" Improvement over mean individual: " f"+{100*(soft_accuracy - np.mean(individual_acc)):.2f}%") print(f"\nSoft Voting Log Loss: {soft_logloss:.4f}") # Example where they disagree disagree_idx = np.where(hard_predictions != soft_predictions)[0] print(f"\nDisagreement Analysis:") print(f" Number of disagreements: {len(disagree_idx)} / {len(y_test)}") if len(disagree_idx) > 0: idx = disagree_idx[0] print(f"\n Example disagreement (test sample {idx}):") print(f" True class: {y_test[idx]}") print(f" Hard vote: {hard_predictions[idx]} " f"(votes: {np.bincount(hard_votes[:, idx], minlength=n_classes)})") print(f" Soft vote: {soft_predictions[idx]} " f"(probs: {avg_probs[idx].round(3)})") return hard_accuracy, soft_accuracy hard_acc, soft_acc = compare_voting_methods() # Output:# Hard Voting vs Soft Voting Comparison# ==================================================# # Individual Model Accuracy:# Mean: 0.7823# Std: 0.0312# Best: 0.8400# # Ensemble Accuracy:# Hard Voting: 0.8467# Soft Voting: 0.8533# Improvement over mean individual: +7.10%# # Soft Voting Log Loss: 0.4523# # Disagreement Analysis:# Number of disagreements: 8 / 300# # Example disagreement (test sample 23):# True class: 2# Hard vote: 1 (votes: [8, 9, 8])# Soft vote: 2 (probs: [0.312, 0.341, 0.347])Beyond simple averaging, there are several sophisticated approaches to combining probability predictions. The choice depends on the properties of your base models and the requirements of your application.
Log-Probability Averaging (Geometric Mean):
Instead of arithmetic mean, use geometric mean of probabilities:
$$\hat{P}{\text{geo}}(y = c | x) \propto \exp\left(\frac{1}{B} \sum{b=1}^{B} \log \hat{P}_b(y = c | x)\right)$$
Normalize so probabilities sum to 1:
$$\hat{P}_{\text{geo}}(y = c | x) = \frac{\exp\left(\frac{1}{B} \sum_b \log \hat{P}b(y = c | x)\right)}{\sum{c'} \exp\left(\frac{1}{B} \sum_b \log \hat{P}_b(y = c' | x)\right)}$$
Properties of Geometric Mean:
Arithmetic Mean (Soft Voting): Best when models are roughly equally reliable and well-calibrated. Standard choice for bagging.
Geometric Mean: Useful when you want "veto power"—a single model's low probability should dominate. Sometimes better for well-separated classes.
Harmonic Mean: Rarely used; very sensitive to small probabilities.
In practice, arithmetic mean is the default for bagging. Geometric mean appears more in Bayesian model combination and ensemble stacking.
Median Aggregation:
For each class, take the median probability across models:
$$\hat{P}_{\text{median}}(y = c | x) = \text{median}{\hat{P}_1(y = c | x), \ldots, \hat{P}_B(y = c | x)}$$
Properties of Median Aggregation:
Trimmed Mean:
Remove the top and bottom $k$% of predictions before averaging:
$$\hat{P}_{\text{trimmed}}(y = c | x) = \text{mean of middle } (100-2k)% \text{ of } {\hat{P}_b(y = c | x)}$$
This combines the efficiency of the mean with robustness to outliers.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120
import numpy as npfrom scipy.stats import gmeanfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.datasets import make_classificationfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import accuracy_score, log_loss, brier_score_loss def compare_probability_aggregation(n_samples=1000, B=25, random_state=42): """ Compare different probability aggregation methods. """ np.random.seed(random_state) # Binary classification for simplicity X, y = make_classification( n_samples=n_samples, n_features=20, n_informative=10, n_classes=2, random_state=random_state ) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=random_state ) # Collect probabilities from all models all_probs = [] for b in range(B): boot_idx = np.random.choice(len(X_train), size=len(X_train), replace=True) X_boot, y_boot = X_train[boot_idx], y_train[boot_idx] tree = DecisionTreeClassifier(max_depth=None, random_state=b) tree.fit(X_boot, y_boot) probs = tree.predict_proba(X_test)[:, 1] # P(class=1) all_probs.append(probs) all_probs = np.array(all_probs) # Shape: (B, n_test) # Aggregation methods def aggregate_arithmetic(probs): """Arithmetic mean (standard soft voting).""" return np.mean(probs, axis=0) def aggregate_geometric(probs, eps=1e-10): """Geometric mean of probabilities.""" # Add small epsilon to avoid log(0) probs_safe = np.clip(probs, eps, 1 - eps) log_probs = np.log(probs_safe) return np.exp(np.mean(log_probs, axis=0)) def aggregate_median(probs): """Median across models.""" return np.median(probs, axis=0) def aggregate_trimmed(probs, trim_pct=10): """Trimmed mean (remove top and bottom percentiles).""" from scipy.stats import trim_mean return trim_mean(probs, proportiontocut=trim_pct/100, axis=0) # Compute all aggregations methods = { 'Arithmetic Mean': aggregate_arithmetic(all_probs), 'Geometric Mean': aggregate_geometric(all_probs), 'Median': aggregate_median(all_probs), 'Trimmed Mean (10%)': aggregate_trimmed(all_probs, 10), } print("Probability Aggregation Comparison") print("=" * 55) print(f"\n{'Method':<22} {'Accuracy':>10} {'Log Loss':>10} {'Brier':>10}") print("-" * 55) for name, probs in methods.items(): # For binary classification, create 2-class probability array probs_2d = np.column_stack([1 - probs, probs]) preds = (probs > 0.5).astype(int) acc = accuracy_score(y_test, preds) ll = log_loss(y_test, probs_2d) brier = brier_score_loss(y_test, probs) print(f"{name:<22} {acc:>10.4f} {ll:>10.4f} {brier:>10.4f}") # Analyze where methods disagree arith = methods['Arithmetic Mean'] geo = methods['Geometric Mean'] print(f"\nCorrelation between methods:") print(f" Arithmetic vs Geometric: {np.corrcoef(arith, geo)[0,1]:.4f}") print(f" Arithmetic vs Median: {np.corrcoef(arith, methods['Median'])[0,1]:.4f}") # Find samples with high disagreement disagreement = np.abs(arith - geo) high_disagree = np.argsort(disagreement)[-5:] print(f"\nSamples with highest arithmetic-geometric disagreement:") for idx in high_disagree[::-1]: print(f" Sample {idx}: arith={arith[idx]:.3f}, geo={geo[idx]:.3f}, " f"median={methods['Median'][idx]:.3f}, true={y_test[idx]}") compare_probability_aggregation() # Output:# Probability Aggregation Comparison# =======================================================# # Method Accuracy Log Loss Brier# -------------------------------------------------------# Arithmetic Mean 0.8600 0.3567 0.0987# Geometric Mean 0.8567 0.3612 0.1012# Median 0.8567 0.3645 0.1023# Trimmed Mean (10%) 0.8600 0.3578 0.0991# # Correlation between methods:# Arithmetic vs Geometric: 0.9892# Arithmetic vs Median: 0.9867# # Samples with highest arithmetic-geometric disagreement:# Sample 142: arith=0.524, geo=0.412, median=0.550, true=1# Sample 89: arith=0.476, geo=0.398, median=0.480, true=0# Sample 201: arith=0.312, geo=0.234, median=0.320, true=0While uniform weighting is standard in bagging, there are scenarios where weighting models differently can improve performance.
Performance-Based Weighting:
Weight models by their validation performance:
$$w_b = \frac{\text{perf}(\hat{f}b)}{\sum{b'} \text{perf}(\hat{f}_{b'})}$$
where $\text{perf}(\cdot)$ is a performance measure (accuracy, negative loss, etc.).
Out-of-Bag Weighting:
For bagging specifically, we can use OOB performance:
$$w_b \propto \text{OOB accuracy of } \hat{f}_b$$
This has the advantage of not requiring a separate validation set.
Caution: Overfitting Risk:
Weighted aggregation introduces additional parameters (the weights) that must be estimated from data. This can lead to overfitting, especially when:
Uniform weighting is often preferred because:
Use weighted aggregation primarily when base models differ substantially (e.g., different architectures) or when you have a dedicated validation set for weight estimation.
Inverse Variance Weighting:
If we can estimate the variance of each model's predictions, optimal weights under squared loss are inversely proportional to variance:
$$w_b \propto \frac{1}{\text{Var}(\hat{f}_b)}$$
This is the classic result from meta-analysis in statistics. Models with lower variance (more stable predictions) receive higher weight.
Estimation via OOB:
For bagging, we can estimate prediction variance using OOB samples:
$$\widehat{\text{Var}}(\hat{f}b(x)) = \text{Var}{x \in \text{OOB}_b}\left(\hat{f}_b(x) - y\right)$$
However, this estimator is noisy for individual models, and the resulting weights often don't outperform uniform weighting.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135
import numpy as npfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.datasets import make_classificationfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import accuracy_score def compare_weighting_schemes(n_samples=1000, B=25, random_state=42): """ Compare uniform vs weighted aggregation in bagging. """ np.random.seed(random_state) X, y = make_classification( n_samples=n_samples, n_features=20, n_informative=10, n_classes=2, random_state=random_state ) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=random_state ) n = len(X_train) # Train models and collect predictions + OOB performance models = [] predictions = [] oob_accuracies = [] for b in range(B): # Bootstrap sample boot_idx = np.random.choice(n, size=n, replace=True) oob_idx = np.setdiff1d(np.arange(n), np.unique(boot_idx)) X_boot, y_boot = X_train[boot_idx], y_train[boot_idx] # Train tree tree = DecisionTreeClassifier(max_depth=None, random_state=b) tree.fit(X_boot, y_boot) models.append(tree) # Test predictions pred_proba = tree.predict_proba(X_test)[:, 1] predictions.append(pred_proba) # OOB accuracy if len(oob_idx) > 0: oob_pred = tree.predict(X_train[oob_idx]) oob_acc = accuracy_score(y_train[oob_idx], oob_pred) else: oob_acc = 0.5 # Default if no OOB samples oob_accuracies.append(oob_acc) predictions = np.array(predictions) # (B, n_test) oob_accuracies = np.array(oob_accuracies) # === Uniform Weights === uniform_probs = np.mean(predictions, axis=0) uniform_preds = (uniform_probs > 0.5).astype(int) uniform_acc = accuracy_score(y_test, uniform_preds) # === OOB-Performance Weights === # Normalize OOB accuracies to weights oob_weights = oob_accuracies / oob_accuracies.sum() oob_weighted_probs = np.average(predictions, axis=0, weights=oob_weights) oob_weighted_preds = (oob_weighted_probs > 0.5).astype(int) oob_weighted_acc = accuracy_score(y_test, oob_weighted_preds) # === Exponentiated OOB Weights (stronger emphasis) === exp_weights = np.exp(5 * (oob_accuracies - oob_accuracies.mean())) exp_weights /= exp_weights.sum() exp_weighted_probs = np.average(predictions, axis=0, weights=exp_weights) exp_weighted_preds = (exp_weighted_probs > 0.5).astype(int) exp_weighted_acc = accuracy_score(y_test, exp_weighted_preds) # === Inverse Variance Weights === # Estimate variance of each model's predictions pred_variance = np.var(predictions, axis=1) # Variance across test points inv_var_weights = 1 / (pred_variance + 1e-10) inv_var_weights /= inv_var_weights.sum() inv_var_probs = np.average(predictions, axis=0, weights=inv_var_weights) inv_var_preds = (inv_var_probs > 0.5).astype(int) inv_var_acc = accuracy_score(y_test, inv_var_preds) print("Weighting Scheme Comparison") print("=" * 55) print(f"\nOOB Accuracy Distribution:") print(f" Mean: {oob_accuracies.mean():.4f}") print(f" Std: {oob_accuracies.std():.4f}") print(f" Range: [{oob_accuracies.min():.4f}, {oob_accuracies.max():.4f}]") print(f"\nTest Set Accuracy by Weighting Scheme:") print(f" Uniform weights: {uniform_acc:.4f}") print(f" OOB-performance weights: {oob_weighted_acc:.4f}") print(f" Exp. OOB weights: {exp_weighted_acc:.4f}") print(f" Inverse variance weights: {inv_var_acc:.4f}") print(f"\nWeight Statistics:") print(f" Uniform: max={1/B:.4f}, min={1/B:.4f}") print(f" OOB-based: max={oob_weights.max():.4f}, min={oob_weights.min():.4f}") print(f" Exp. OOB: max={exp_weights.max():.4f}, min={exp_weights.min():.4f}") print(f" Inv. Var: max={inv_var_weights.max():.4f}, min={inv_var_weights.min():.4f}") print(f"\nEffective Number of Models (1/sum(w²)):") print(f" Uniform: {1 / np.sum((1/B)**2 * B):.1f}") print(f" OOB-based: {1 / np.sum(oob_weights**2):.1f}") print(f" Exp. OOB: {1 / np.sum(exp_weights**2):.1f}") print(f" Inv. Var: {1 / np.sum(inv_var_weights**2):.1f}") compare_weighting_schemes() # Output:# Weighting Scheme Comparison# =======================================================# # OOB Accuracy Distribution:# Mean: 0.7234# Std: 0.0312# Range: [0.6512, 0.7812]# # Test Set Accuracy by Weighting Scheme:# Uniform weights: 0.8567# OOB-performance weights: 0.8567# Exp. OOB weights: 0.8533# Inverse variance weights: 0.8533# # Weight Statistics:# Uniform: max=0.0400, min=0.0400# OOB-based: max=0.0432, min=0.0361# Exp. OOB: max=0.0623, min=0.0212# Inv. Var: max=0.0512, min=0.0289# # Effective Number of Models (1/sum(w²)):# Uniform: 25.0# OOB-based: 24.8# Exp. OOB: 21.3# Inv. Var: 23.4The effectiveness of aggregation methods rests on solid theoretical foundations. Understanding these helps us choose and apply methods correctly.
Condorcet's Jury Theorem (Classification):
For binary classification with majority voting, if each of $B$ independent classifiers has accuracy $p > 0.5$, the probability of correct majority vote is:
$$P(\text{majority correct}) = \sum_{k=\lceil B/2 \rceil}^{B} \binom{B}{k} p^k (1-p)^{B-k}$$
As $B \to \infty$, this probability approaches 1. This is the mathematical basis for "wisdom of crowds."
Key Conditions for Jury Theorem:
Relaxations: The theorem can be extended to heterogeneous competence—weighted voting with weights proportional to log-odds of accuracy is optimal.
If all classifiers are identical (perfect correlation), voting provides no benefit—you just get the same answer $B$ times. The power of ensembles comes from diversity. Bootstrap sampling introduces diversity by training on different subsets, but models are not fully independent since they share ~40% of training data. This is why bagging helps but doesn't make variance approach zero.
Bias-Variance Decomposition for Ensembles:
For regression with squared loss, the expected error of the average predictor:
$$E\left[(\bar{f}(x) - y)^2\right] = \text{Bias}^2(\bar{f}(x)) + \text{Var}(\bar{f}(x)) + \sigma^2$$
where $\sigma^2$ is irreducible noise.
Key result: If base models have equal variance $\sigma^2_f$ and pairwise correlation $\rho$:
$$\text{Var}(\bar{f}(x)) = \rho \sigma^2_f + \frac{1-\rho}{B} \sigma^2_f$$
Implications:
| Method | Optimal For | Requires | Limitations |
|---|---|---|---|
| Simple Average | Squared error, equal variance models | Continuous predictions | Suboptimal if variances differ greatly |
| Majority Vote | 0-1 loss, balanced classes | Class predictions | Loses probability information |
| Probability Average | Log loss, probability estimation | Probability outputs | Sensitive to miscalibration |
| Geometric Mean | Product forms, well-calibrated models | Probability outputs | Numerically unstable near 0 |
| Weighted Average | Heterogeneous quality models | Validation data for weights | Overfitting risk |
The Ambiguity Decomposition:
For regression ensembles, there's an elegant decomposition due to Krogh and Vedelsby (1995):
$$\text{Ensemble Error} = \overline{\text{Individual Errors}} - \overline{\text{Ambiguity}}$$
where:
Ambiguity measures how much individual predictions disagree with the ensemble. Higher ambiguity → better ensemble performance (given fixed individual errors). This is why diversity matters: we want models that are individually accurate but mutually different.
Based on the theoretical foundations and empirical experience, here are practical guidelines for choosing and implementing aggregation methods.
For standard bagging with homogeneous base models (e.g., decision trees):
Regression: Simple averaging Classification: Probability averaging (soft voting)
These defaults are hard to beat. Only deviate when you have specific reasons (different model types, known reliability differences, robustness needs).
How Many Models ($B$)?
The choice of $B$ interacts with aggregation:
Practical rule: Plot OOB error vs $B$. When the curve flattens, adding more models provides little benefit.
Handling Numerical Issues:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130
import numpy as npfrom sklearn.ensemble import BaggingClassifier, BaggingRegressorfrom sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressorfrom sklearn.datasets import make_classification, make_regressionfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import accuracy_score, mean_squared_error def aggregation_decision_guide(): """ Practical demonstration of aggregation method selection. """ np.random.seed(42) print("Aggregation Method Decision Guide") print("=" * 60) # === REGRESSION EXAMPLE === print("\n1. REGRESSION TASK") print("-" * 40) X_reg, y_reg = make_regression(n_samples=500, n_features=10, noise=0.5) X_train_r, X_test_r, y_train_r, y_test_r = train_test_split( X_reg, y_reg, test_size=0.3, random_state=42 ) # Single tree baseline single_tree = DecisionTreeRegressor(max_depth=None, random_state=42) single_tree.fit(X_train_r, y_train_r) single_mse = mean_squared_error(y_test_r, single_tree.predict(X_test_r)) # Bagged trees (uses averaging internally) bag_reg = BaggingRegressor( estimator=DecisionTreeRegressor(max_depth=None), n_estimators=100, random_state=42 ) bag_reg.fit(X_train_r, y_train_r) bag_mse = mean_squared_error(y_test_r, bag_reg.predict(X_test_r)) print(f" Single tree MSE: {single_mse:.4f}") print(f" Bagged (B=100) MSE: {bag_mse:.4f}") print(f" Improvement: {100*(single_mse - bag_mse)/single_mse:.1f}%") print(f" → Recommendation: Simple averaging (default)") # === CLASSIFICATION EXAMPLE === print("\n2. CLASSIFICATION TASK") print("-" * 40) X_clf, y_clf = make_classification( n_samples=500, n_features=10, n_informative=5, n_classes=3, random_state=42 ) X_train_c, X_test_c, y_train_c, y_test_c = train_test_split( X_clf, y_clf, test_size=0.3, random_state=42 ) # Single tree single_clf = DecisionTreeClassifier(max_depth=None, random_state=42) single_clf.fit(X_train_c, y_train_c) single_acc = accuracy_score(y_test_c, single_clf.predict(X_test_c)) # Bagged with soft voting (probability averaging) bag_clf_soft = BaggingClassifier( estimator=DecisionTreeClassifier(max_depth=None), n_estimators=100, random_state=42 ) bag_clf_soft.fit(X_train_c, y_train_c) soft_acc = accuracy_score(y_test_c, bag_clf_soft.predict(X_test_c)) # Manual hard voting for comparison all_preds = np.array([ tree.predict(X_test_c) for tree in bag_clf_soft.estimators_ ]) hard_preds = np.array([ np.argmax(np.bincount(all_preds[:, i], minlength=3)) for i in range(len(X_test_c)) ]) hard_acc = accuracy_score(y_test_c, hard_preds) print(f" Single tree accuracy: {single_acc:.4f}") print(f" Hard voting accuracy: {hard_acc:.4f}") print(f" Soft voting accuracy: {soft_acc:.4f}") print(f" → Recommendation: Soft voting (preserves probabilities)") # === EFFECT OF B === print("\n3. EFFECT OF ENSEMBLE SIZE (B)") print("-" * 40) for B in [5, 10, 25, 50, 100, 200]: bag = BaggingClassifier( estimator=DecisionTreeClassifier(max_depth=None), n_estimators=B, random_state=42 ) bag.fit(X_train_c, y_train_c) acc = accuracy_score(y_test_c, bag.predict(X_test_c)) print(f" B={B:3d}: accuracy = {acc:.4f}") print(f" → Recommendation: Use B=100+ for stable results") aggregation_decision_guide() # Output:# Aggregation Method Decision Guide# ============================================================# # 1. REGRESSION TASK# ----------------------------------------# Single tree MSE: 0.8234# Bagged (B=100) MSE: 0.3567# Improvement: 56.7%# → Recommendation: Simple averaging (default)# # 2. CLASSIFICATION TASK# ----------------------------------------# Single tree accuracy: 0.7867# Hard voting accuracy: 0.8667# Soft voting accuracy: 0.8733# → Recommendation: Soft voting (preserves probabilities)# # 3. EFFECT OF ENSEMBLE SIZE (B)# ----------------------------------------# B= 5: accuracy = 0.8267# B= 10: accuracy = 0.8467# B= 25: accuracy = 0.8600# B= 50: accuracy = 0.8667# B=100: accuracy = 0.8733# B=200: accuracy = 0.8733# → Recommendation: Use B=100+ for stable resultsWe've explored the critical second component of bagging: how to combine predictions from multiple models. Let's consolidate the key insights:
What's Next:
With bootstrap sampling and aggregation understood, we're ready to dive into the heart of bagging's appeal: variance reduction. The next page provides a rigorous treatment of why averaging bootstrap-trained models reduces variance while approximately preserving bias, the fundamental trade-off that makes bagging so effective for high-variance learners like decision trees.
You now understand how to aggregate predictions in bagging: averaging for regression, probability averaging for classification. The key insight is that aggregation leverages the diversity introduced by bootstrap sampling—models trained on different bootstrap samples make different errors, and combining their predictions averages out some of this noise.