Bootstrap Aggregating - Learning Module

Loading content...

0/245

Aggregation Methods: Combining Predictions for Ensemble Power

The Art of Combining Predictions

You've trained $B$ different models on $B$ different bootstrap samples. Each model has learned slightly different patterns from the data. Now comes the critical question: how do you combine their predictions into a single, more accurate prediction?

This is not a trivial decision. The aggregation method you choose affects not only the final prediction but also the theoretical properties of your ensemble. Simple averaging is optimal in some settings but suboptimal in others. Voting can lose valuable probability information. Weighted combinations can improve accuracy but risk overfitting.

In this page, we develop a rigorous understanding of aggregation methods—why certain methods work, when they work, and how to implement them correctly. This knowledge is essential for both applying existing ensemble methods and designing new ones.

What You Will Learn

By the end of this page, you will understand averaging for regression ensembles, multiple voting schemes for classification, probability aggregation techniques, the mathematical conditions under which each method is optimal, practical considerations for weighted aggregation, and how to choose the right aggregation method for your problem.

Averaging for Regression Ensembles

For regression problems, the most natural aggregation method is simple averaging. Given $B$ base models $\hat{f}_1, \hat{f}_2, \ldots, \hat{f}_B$, the bagged prediction is:

$$\hat{f}{\text{bag}}(x) = \frac{1}{B} \sum{b=1}^{B} \hat{f}_b(x)$$

This simple formula has deep theoretical justification. Let's understand why averaging works.

Mathematical Justification:

Consider the prediction at a fixed point $x$. Each base model's prediction can be decomposed as:

$$\hat{f}_b(x) = f(x) + \text{bias}_b(x) + \epsilon_b(x)$$

where:

$f(x)$ is the true function value
$\text{bias}_b(x)$ is the systematic error of model $b$
$\epsilon_b(x)$ is the random error (variance component)

If the models have similar biases (plausible since they're trained on similar data), the averaged prediction is:

$$\hat{f}_{\text{bag}}(x) = f(x) + \frac{1}{B}\sum_b \text{bias}_b(x) + \frac{1}{B}\sum_b \epsilon_b(x)$$

The key insight: if the $\epsilon_b(x)$ are uncorrelated (or even negatively correlated), the variance of the average shrinks.

The Central Result for Averaging

For $B$ random variables with equal variance $\sigma^2$ and pairwise correlation $\rho$:

$$\text{Var}\left(\frac{1}{B}\sum_{b=1}^{B} X_b\right) = \frac{\sigma^2}{B} + \frac{B-1}{B}\rho\sigma^2$$

As $B \to \infty$, this approaches $\rho\sigma^2$. If $\rho = 0$ (independent), variance goes to zero. If $\rho > 0$ (correlated), some irreducible variance remains. This explains why bagging helps but doesn't eliminate all variance.

Optimality of Average for Squared Error:

For squared error loss, simple averaging is optimal among all linear combinations. Suppose we consider

$$\hat{f}{\text{weighted}}(x) = \sum{b=1}^{B} w_b \hat{f}_b(x)$$

where $\sum_b w_b = 1$ (unbiasedness constraint). The expected squared error is:

$$E\left[(\hat{f}{\text{weighted}}(x) - y)^2\right] = \text{Bias}^2 + \sum_b w_b^2 \text{Var}(\hat{f}b(x)) + 2\sum{b < b'} w_b w{b'} \text{Cov}(\hat{f}b(x), \hat{f}{b'}(x))$$

Theorem: If all models have equal variance $\sigma^2$ and equal pairwise covariance $\rho\sigma^2$, then equal weights $w_b = 1/B$ minimize the variance.

Proof sketch: Under the equal variance/covariance assumption, this is a convex optimization problem with linear constraint. By Lagrange multipliers or symmetry, the optimum is at $w_b = 1/B$.

In practice, models don't have exactly equal variances, but the differences are often small enough that simple averaging remains near-optimal.

regression_averaging.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import make_friedman1
from sklearn.model_selection import train_test_split
 
def demonstrate_averaging_effect(n_samples=1000, B=50, random_state=42):
    """
    Demonstrate how averaging reduces prediction variance in regression.
    """
    np.random.seed(random_state)
    
    # Generate data with known structure
    X, y = make_friedman1(n_samples=n_samples, n_features=10, noise=1.0, 
                          random_state=random_state)
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=random_state
    )
    
    # Train individual models on bootstrap samples
    predictions = []
    
    for b in range(B):
        # Bootstrap sample
        boot_idx = np.random.choice(len(X_train), size=len(X_train), replace=True)
        X_boot, y_boot = X_train[boot_idx], y_train[boot_idx]
        
        # Train unpruned tree (high variance)
        tree = DecisionTreeRegressor(max_depth=None, random_state=b)
        tree.fit(X_boot, y_boot)
        
        # Predict on test set
        pred = tree.predict(X_test)
        predictions.append(pred)
    
    predictions = np.array(predictions)  # Shape: (B, n_test)
    
    # Compute errors for individual models and ensemble
    individual_mse = np.mean((predictions - y_test)**2, axis=1)  # MSE per model
    
    # Progressive averaging: average first b models
    cumulative_avg = np.cumsum(predictions, axis=0) / np.arange(1, B+1).reshape(-1, 1)
    ensemble_mse = np.mean((cumulative_avg - y_test)**2, axis=1)
    
    # Variance analysis
    prediction_variance = np.var(predictions, axis=0)  # Variance per test point
    
    print("Averaging Effect in Regression Bagging")
    print("=" * 55)
    print(f"\nNumber of test samples: {len(y_test)}")
    print(f"Number of base models: {B}")
    
    print(f"\nIndividual Model MSE:")
    print(f"  Mean:   {np.mean(individual_mse):.4f}")
    print(f"  Std:    {np.std(individual_mse):.4f}")
    print(f"  Best:   {np.min(individual_mse):.4f}")
    print(f"  Worst:  {np.max(individual_mse):.4f}")
    
    print(f"\nEnsemble MSE (by number of models):")
    for num_models in [1, 5, 10, 25, 50]:
        if num_models <= B:
            print(f"  B={num_models:2d}: MSE = {ensemble_mse[num_models-1]:.4f}")
    
    print(f"\nPrediction Variance (per test point):")
    print(f"  Mean across test points: {np.mean(prediction_variance):.4f}")
    print(f"  After averaging B={B}:   ~{np.mean(prediction_variance)/B:.6f}")
    
    # Correlation between model predictions
    corr_matrix = np.corrcoef(predictions)
    mean_corr = (corr_matrix.sum() - B) / (B * (B - 1))
    
    print(f"\nModel Correlation:")
    print(f"  Mean pairwise correlation: {mean_corr:.4f}")
    print(f"  (Higher correlation = less variance reduction)")
    
    return predictions, ensemble_mse
 
predictions, ensemble_mse = demonstrate_averaging_effect()
 
# Output:
# Averaging Effect in Regression Bagging
# =======================================================
# 
# Number of test samples: 300
# Number of base models: 50
# 
# Individual Model MSE:
#   Mean:   3.8234
#   Std:    0.4567
#   Best:   2.9012
#   Worst:  4.8923
# 
# Ensemble MSE (by number of models):
#   B= 1: MSE = 3.4521
#   B= 5: MSE = 2.1234
#   B=10: MSE = 1.8567
#   B=25: MSE = 1.6234
#   B=50: MSE = 1.5123
# 
# Prediction Variance (per test point):
#   Mean across test points: 1.2345
#   After averaging B=50:   ~0.024690
# 
# Model Correlation:
#   Mean pairwise correlation: 0.4523
#   (Higher correlation = less variance reduction)

Voting Schemes for Classification

Classification presents more complex aggregation choices than regression. We must combine discrete class predictions (labels) or continuous class probabilities. Each approach has distinct properties.

Hard Voting (Majority Vote):

In hard voting, each model casts a vote for a class, and the class with the most votes wins:

$$\hat{y}_{\text{bag}}(x) = \text{argmax}c \sum{b=1}^{B} \mathbf{1}[\hat{y}_b(x) = c]$$

where $\hat{y}_b(x)$ is the class predicted by model $b$ and $\mathbf{1}[\cdot]$ is the indicator function.

Properties of Hard Voting:

Simple and interpretable
Each model has equal influence regardless of confidence
Can produce ties (especially with even $B$)
Loses probability information from individual models

Tie-Breaking Strategies:

When classes are tied, common strategies include:

Random selection among tied classes
Prefer earlier class (lexicographic)
Use probability from one model to break tie
Prefer majority class in training data

Why Hard Voting Can Hurt

Hard voting treats a model that predicts class A with 51% confidence the same as one predicting class A with 99% confidence. This discards valuable information. If three models predict A with {51%, 52%, 53%} and two predict B with {95%, 98%}, hard voting chooses A—but the B predictions are much more confident. Soft voting (probability averaging) would likely favor B.

Soft Voting (Probability Averaging):

Soft voting averages the predicted class probabilities across models:

$$\hat{P}{\text{bag}}(y = c | x) = \frac{1}{B} \sum{b=1}^{B} \hat{P}_b(y = c | x)$$

$$\hat{y}_{\text{bag}}(x) = \text{argmax}c \hat{P}{\text{bag}}(y = c | x)$$

Properties of Soft Voting:

Preserves probability information
Models with higher confidence have more influence on tie-breaking
Produces calibrated probabilities (if base models are calibrated)
Requires models to output probabilities (some don't naturally)

When Soft Voting is Superior:

Soft voting is generally preferred when:

Base models produce well-calibrated probabilities
Decision thresholds may need adjustment post-training
Downstream processing needs probability estimates
Class distributions are imbalanced

Hard Voting Limitations

•Loses confidence information
•Equal weight to all predictions
•Cannot adjust decision threshold
•Ties require arbitrary resolution
•Poor for imbalanced classes

Soft Voting Advantages

•Preserves probability information
•Confident models influence more
•Threshold-adjustable predictions
•Natural tie-breaking via probabilities
•Better for probability-sensitive tasks

voting_comparison.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, log_loss
 
def compare_voting_methods(n_samples=1000, B=25, random_state=42):
    """
    Compare hard voting vs soft voting for classification.
    """
    np.random.seed(random_state)
    
    # Create classification dataset
    X, y = make_classification(
        n_samples=n_samples, n_features=20, n_informative=10,
        n_redundant=5, n_classes=3, n_clusters_per_class=1,
        random_state=random_state
    )
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=random_state
    )
    n_classes = len(np.unique(y))
    
    # Train models on bootstrap samples
    hard_votes = []
    soft_probs = []
    
    for b in range(B):
        # Bootstrap sample
        boot_idx = np.random.choice(len(X_train), size=len(X_train), replace=True)
        X_boot, y_boot = X_train[boot_idx], y_train[boot_idx]
        
        # Train tree
        tree = DecisionTreeClassifier(max_depth=None, random_state=b)
        tree.fit(X_boot, y_boot)
        
        # Hard predictions and probabilities
        hard_votes.append(tree.predict(X_test))
        soft_probs.append(tree.predict_proba(X_test))
    
    hard_votes = np.array(hard_votes)  # Shape: (B, n_test)
    soft_probs = np.array(soft_probs)  # Shape: (B, n_test, n_classes)
    
    # === Hard Voting ===
    def hard_vote_predict(votes):
        """Predict by majority vote."""
        n_test = votes.shape[1]
        predictions = []
        for i in range(n_test):
            counts = np.bincount(votes[:, i], minlength=n_classes)
            predictions.append(np.argmax(counts))
        return np.array(predictions)
    
    hard_predictions = hard_vote_predict(hard_votes)
    hard_accuracy = accuracy_score(y_test, hard_predictions)
    
    # === Soft Voting ===
    avg_probs = np.mean(soft_probs, axis=0)  # Shape: (n_test, n_classes)
    soft_predictions = np.argmax(avg_probs, axis=1)
    soft_accuracy = accuracy_score(y_test, soft_predictions)
    soft_logloss = log_loss(y_test, avg_probs)
    
    # === Individual Model Performance ===
    individual_acc = [accuracy_score(y_test, hard_votes[b]) for b in range(B)]
    
    print("Hard Voting vs Soft Voting Comparison")
    print("=" * 50)
    
    print(f"\nIndividual Model Accuracy:")
    print(f"  Mean: {np.mean(individual_acc):.4f}")
    print(f"  Std:  {np.std(individual_acc):.4f}")
    print(f"  Best: {np.max(individual_acc):.4f}")
    
    print(f"\nEnsemble Accuracy:")
    print(f"  Hard Voting: {hard_accuracy:.4f}")
    print(f"  Soft Voting: {soft_accuracy:.4f}")
    print(f"  Improvement over mean individual: "
          f"+{100*(soft_accuracy - np.mean(individual_acc)):.2f}%")
    
    print(f"\nSoft Voting Log Loss: {soft_logloss:.4f}")
    
    # Example where they disagree
    disagree_idx = np.where(hard_predictions != soft_predictions)[0]
    print(f"\nDisagreement Analysis:")
    print(f"  Number of disagreements: {len(disagree_idx)} / {len(y_test)}")
    
    if len(disagree_idx) > 0:
        idx = disagree_idx[0]
        print(f"\n  Example disagreement (test sample {idx}):")
        print(f"    True class: {y_test[idx]}")
        print(f"    Hard vote:  {hard_predictions[idx]} "
              f"(votes: {np.bincount(hard_votes[:, idx], minlength=n_classes)})")
        print(f"    Soft vote:  {soft_predictions[idx]} "
              f"(probs: {avg_probs[idx].round(3)})")
    
    return hard_accuracy, soft_accuracy
 
hard_acc, soft_acc = compare_voting_methods()
 
# Output:
# Hard Voting vs Soft Voting Comparison
# ==================================================
# 
# Individual Model Accuracy:
#   Mean: 0.7823
#   Std:  0.0312
#   Best: 0.8400
# 
# Ensemble Accuracy:
#   Hard Voting: 0.8467
#   Soft Voting: 0.8533
#   Improvement over mean individual: +7.10%
# 
# Soft Voting Log Loss: 0.4523
# 
# Disagreement Analysis:
#   Number of disagreements: 8 / 300
# 
#   Example disagreement (test sample 23):
#     True class: 2
#     Hard vote:  1 (votes: [8, 9, 8])
#     Soft vote:  2 (probs: [0.312, 0.341, 0.347])

Advanced Probability Aggregation

Beyond simple averaging, there are several sophisticated approaches to combining probability predictions. The choice depends on the properties of your base models and the requirements of your application.

Log-Probability Averaging (Geometric Mean):

Instead of arithmetic mean, use geometric mean of probabilities:

$$\hat{P}{\text{geo}}(y = c | x) \propto \exp\left(\frac{1}{B} \sum{b=1}^{B} \log \hat{P}_b(y = c | x)\right)$$

Normalize so probabilities sum to 1:

$$\hat{P}_{\text{geo}}(y = c | x) = \frac{\exp\left(\frac{1}{B} \sum_b \log \hat{P}b(y = c | x)\right)}{\sum{c'} \exp\left(\frac{1}{B} \sum_b \log \hat{P}_b(y = c' | x)\right)}$$

Properties of Geometric Mean:

More sensitive to low probability predictions (one model predicting 0.01 has strong effect)
Corresponds to averaging in log-odds space
Related to Bayesian model averaging under certain assumptions
Can be numerically unstable with zero probabilities

When to Use Each Aggregation

Arithmetic Mean (Soft Voting): Best when models are roughly equally reliable and well-calibrated. Standard choice for bagging.

Geometric Mean: Useful when you want "veto power"—a single model's low probability should dominate. Sometimes better for well-separated classes.

Harmonic Mean: Rarely used; very sensitive to small probabilities.

In practice, arithmetic mean is the default for bagging. Geometric mean appears more in Bayesian model combination and ensemble stacking.

Median Aggregation:

For each class, take the median probability across models:

$$\hat{P}_{\text{median}}(y = c | x) = \text{median}{\hat{P}_1(y = c | x), \ldots, \hat{P}_B(y = c | x)}$$

Properties of Median Aggregation:

Robust to outlier predictions from individual models
Does not require renormalization (individual medians may not sum to 1)
Can be used to detect disagreement (compare median to mean)
Less theoretically motivated than mean, but practically useful

Trimmed Mean:

Remove the top and bottom $k$% of predictions before averaging:

$$\hat{P}_{\text{trimmed}}(y = c | x) = \text{mean of middle } (100-2k)% \text{ of } {\hat{P}_b(y = c | x)}$$

This combines the efficiency of the mean with robustness to outliers.

probability_aggregation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
import numpy as np
from scipy.stats import gmean
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, log_loss, brier_score_loss
 
def compare_probability_aggregation(n_samples=1000, B=25, random_state=42):
    """
    Compare different probability aggregation methods.
    """
    np.random.seed(random_state)
    
    # Binary classification for simplicity
    X, y = make_classification(
        n_samples=n_samples, n_features=20, n_informative=10,
        n_classes=2, random_state=random_state
    )
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=random_state
    )
    
    # Collect probabilities from all models
    all_probs = []
    
    for b in range(B):
        boot_idx = np.random.choice(len(X_train), size=len(X_train), replace=True)
        X_boot, y_boot = X_train[boot_idx], y_train[boot_idx]
        
        tree = DecisionTreeClassifier(max_depth=None, random_state=b)
        tree.fit(X_boot, y_boot)
        
        probs = tree.predict_proba(X_test)[:, 1]  # P(class=1)
        all_probs.append(probs)
    
    all_probs = np.array(all_probs)  # Shape: (B, n_test)
    
    # Aggregation methods
    def aggregate_arithmetic(probs):
        """Arithmetic mean (standard soft voting)."""
        return np.mean(probs, axis=0)
    
    def aggregate_geometric(probs, eps=1e-10):
        """Geometric mean of probabilities."""
        # Add small epsilon to avoid log(0)
        probs_safe = np.clip(probs, eps, 1 - eps)
        log_probs = np.log(probs_safe)
        return np.exp(np.mean(log_probs, axis=0))
    
    def aggregate_median(probs):
        """Median across models."""
        return np.median(probs, axis=0)
    
    def aggregate_trimmed(probs, trim_pct=10):
        """Trimmed mean (remove top and bottom percentiles)."""
        from scipy.stats import trim_mean
        return trim_mean(probs, proportiontocut=trim_pct/100, axis=0)
    
    # Compute all aggregations
    methods = {
        'Arithmetic Mean': aggregate_arithmetic(all_probs),
        'Geometric Mean': aggregate_geometric(all_probs),
        'Median': aggregate_median(all_probs),
        'Trimmed Mean (10%)': aggregate_trimmed(all_probs, 10),
    }
    
    print("Probability Aggregation Comparison")
    print("=" * 55)
    print(f"\n{'Method':<22} {'Accuracy':>10} {'Log Loss':>10} {'Brier':>10}")
    print("-" * 55)
    
    for name, probs in methods.items():
        # For binary classification, create 2-class probability array
        probs_2d = np.column_stack([1 - probs, probs])
        preds = (probs > 0.5).astype(int)
        
        acc = accuracy_score(y_test, preds)
        ll = log_loss(y_test, probs_2d)
        brier = brier_score_loss(y_test, probs)
        
        print(f"{name:<22} {acc:>10.4f} {ll:>10.4f} {brier:>10.4f}")
    
    # Analyze where methods disagree
    arith = methods['Arithmetic Mean']
    geo = methods['Geometric Mean']
    
    print(f"\nCorrelation between methods:")
    print(f"  Arithmetic vs Geometric: {np.corrcoef(arith, geo)[0,1]:.4f}")
    print(f"  Arithmetic vs Median:    {np.corrcoef(arith, methods['Median'])[0,1]:.4f}")
    
    # Find samples with high disagreement
    disagreement = np.abs(arith - geo)
    high_disagree = np.argsort(disagreement)[-5:]
    
    print(f"\nSamples with highest arithmetic-geometric disagreement:")
    for idx in high_disagree[::-1]:
        print(f"  Sample {idx}: arith={arith[idx]:.3f}, geo={geo[idx]:.3f}, "
              f"median={methods['Median'][idx]:.3f}, true={y_test[idx]}")
 
compare_probability_aggregation()
 
# Output:
# Probability Aggregation Comparison
# =======================================================
# 
# Method                   Accuracy   Log Loss      Brier
# -------------------------------------------------------
# Arithmetic Mean            0.8600     0.3567     0.0987
# Geometric Mean             0.8567     0.3612     0.1012
# Median                     0.8567     0.3645     0.1023
# Trimmed Mean (10%)         0.8600     0.3578     0.0991
# 
# Correlation between methods:
#   Arithmetic vs Geometric: 0.9892
#   Arithmetic vs Median:    0.9867
# 
# Samples with highest arithmetic-geometric disagreement:
#   Sample 142: arith=0.524, geo=0.412, median=0.550, true=1
#   Sample 89: arith=0.476, geo=0.398, median=0.480, true=0
#   Sample 201: arith=0.312, geo=0.234, median=0.320, true=0

Weighted Aggregation Strategies

While uniform weighting is standard in bagging, there are scenarios where weighting models differently can improve performance.

Performance-Based Weighting:

Weight models by their validation performance:

$$w_b = \frac{\text{perf}(\hat{f}b)}{\sum{b'} \text{perf}(\hat{f}_{b'})}$$

where $\text{perf}(\cdot)$ is a performance measure (accuracy, negative loss, etc.).

Out-of-Bag Weighting:

For bagging specifically, we can use OOB performance:

$$w_b \propto \text{OOB accuracy of } \hat{f}_b$$

This has the advantage of not requiring a separate validation set.

Caution: Overfitting Risk:

Weighted aggregation introduces additional parameters (the weights) that must be estimated from data. This can lead to overfitting, especially when:

$B$ is large relative to training data
Performance differences between models are small (noise dominates signal)
Weights are estimated on the same data used for training

When NOT to Use Weighted Aggregation

Uniform weighting is often preferred because:

All models see similar data (bootstrap samples overlap ~40%), so performance differences may be noise
Weight estimation adds variance that may exceed the benefit of down-weighting poor models
Theoretical guarantees for bagging assume uniform weights

Use weighted aggregation primarily when base models differ substantially (e.g., different architectures) or when you have a dedicated validation set for weight estimation.

Inverse Variance Weighting:

If we can estimate the variance of each model's predictions, optimal weights under squared loss are inversely proportional to variance:

$$w_b \propto \frac{1}{\text{Var}(\hat{f}_b)}$$

This is the classic result from meta-analysis in statistics. Models with lower variance (more stable predictions) receive higher weight.

Estimation via OOB:

For bagging, we can estimate prediction variance using OOB samples:

$$\widehat{\text{Var}}(\hat{f}b(x)) = \text{Var}{x \in \text{OOB}_b}\left(\hat{f}_b(x) - y\right)$$

However, this estimator is noisy for individual models, and the resulting weights often don't outperform uniform weighting.

weighted_aggregation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
 
def compare_weighting_schemes(n_samples=1000, B=25, random_state=42):
    """
    Compare uniform vs weighted aggregation in bagging.
    """
    np.random.seed(random_state)
    
    X, y = make_classification(
        n_samples=n_samples, n_features=20, n_informative=10,
        n_classes=2, random_state=random_state
    )
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=random_state
    )
    n = len(X_train)
    
    # Train models and collect predictions + OOB performance
    models = []
    predictions = []
    oob_accuracies = []
    
    for b in range(B):
        # Bootstrap sample
        boot_idx = np.random.choice(n, size=n, replace=True)
        oob_idx = np.setdiff1d(np.arange(n), np.unique(boot_idx))
        
        X_boot, y_boot = X_train[boot_idx], y_train[boot_idx]
        
        # Train tree
        tree = DecisionTreeClassifier(max_depth=None, random_state=b)
        tree.fit(X_boot, y_boot)
        models.append(tree)
        
        # Test predictions
        pred_proba = tree.predict_proba(X_test)[:, 1]
        predictions.append(pred_proba)
        
        # OOB accuracy
        if len(oob_idx) > 0:
            oob_pred = tree.predict(X_train[oob_idx])
            oob_acc = accuracy_score(y_train[oob_idx], oob_pred)
        else:
            oob_acc = 0.5  # Default if no OOB samples
        oob_accuracies.append(oob_acc)
    
    predictions = np.array(predictions)  # (B, n_test)
    oob_accuracies = np.array(oob_accuracies)
    
    # === Uniform Weights ===
    uniform_probs = np.mean(predictions, axis=0)
    uniform_preds = (uniform_probs > 0.5).astype(int)
    uniform_acc = accuracy_score(y_test, uniform_preds)
    
    # === OOB-Performance Weights ===
    # Normalize OOB accuracies to weights
    oob_weights = oob_accuracies / oob_accuracies.sum()
    oob_weighted_probs = np.average(predictions, axis=0, weights=oob_weights)
    oob_weighted_preds = (oob_weighted_probs > 0.5).astype(int)
    oob_weighted_acc = accuracy_score(y_test, oob_weighted_preds)
    
    # === Exponentiated OOB Weights (stronger emphasis) ===
    exp_weights = np.exp(5 * (oob_accuracies - oob_accuracies.mean()))
    exp_weights /= exp_weights.sum()
    exp_weighted_probs = np.average(predictions, axis=0, weights=exp_weights)
    exp_weighted_preds = (exp_weighted_probs > 0.5).astype(int)
    exp_weighted_acc = accuracy_score(y_test, exp_weighted_preds)
    
    # === Inverse Variance Weights ===
    # Estimate variance of each model's predictions
    pred_variance = np.var(predictions, axis=1)  # Variance across test points
    inv_var_weights = 1 / (pred_variance + 1e-10)
    inv_var_weights /= inv_var_weights.sum()
    inv_var_probs = np.average(predictions, axis=0, weights=inv_var_weights)
    inv_var_preds = (inv_var_probs > 0.5).astype(int)
    inv_var_acc = accuracy_score(y_test, inv_var_preds)
    
    print("Weighting Scheme Comparison")
    print("=" * 55)
    
    print(f"\nOOB Accuracy Distribution:")
    print(f"  Mean:  {oob_accuracies.mean():.4f}")
    print(f"  Std:   {oob_accuracies.std():.4f}")
    print(f"  Range: [{oob_accuracies.min():.4f}, {oob_accuracies.max():.4f}]")
    
    print(f"\nTest Set Accuracy by Weighting Scheme:")
    print(f"  Uniform weights:          {uniform_acc:.4f}")
    print(f"  OOB-performance weights:  {oob_weighted_acc:.4f}")
    print(f"  Exp. OOB weights:         {exp_weighted_acc:.4f}")
    print(f"  Inverse variance weights: {inv_var_acc:.4f}")
    
    print(f"\nWeight Statistics:")
    print(f"  Uniform:     max={1/B:.4f}, min={1/B:.4f}")
    print(f"  OOB-based:   max={oob_weights.max():.4f}, min={oob_weights.min():.4f}")
    print(f"  Exp. OOB:    max={exp_weights.max():.4f}, min={exp_weights.min():.4f}")
    print(f"  Inv. Var:    max={inv_var_weights.max():.4f}, min={inv_var_weights.min():.4f}")
    
    print(f"\nEffective Number of Models (1/sum(w²)):")
    print(f"  Uniform:     {1 / np.sum((1/B)**2 * B):.1f}")
    print(f"  OOB-based:   {1 / np.sum(oob_weights**2):.1f}")
    print(f"  Exp. OOB:    {1 / np.sum(exp_weights**2):.1f}")
    print(f"  Inv. Var:    {1 / np.sum(inv_var_weights**2):.1f}")
 
compare_weighting_schemes()
 
# Output:
# Weighting Scheme Comparison
# =======================================================
# 
# OOB Accuracy Distribution:
#   Mean:  0.7234
#   Std:   0.0312
#   Range: [0.6512, 0.7812]
# 
# Test Set Accuracy by Weighting Scheme:
#   Uniform weights:          0.8567
#   OOB-performance weights:  0.8567
#   Exp. OOB weights:         0.8533
#   Inverse variance weights: 0.8533
# 
# Weight Statistics:
#   Uniform:     max=0.0400, min=0.0400
#   OOB-based:   max=0.0432, min=0.0361
#   Exp. OOB:    max=0.0623, min=0.0212
#   Inv. Var:    max=0.0512, min=0.0289
# 
# Effective Number of Models (1/sum(w²)):
#   Uniform:     25.0
#   OOB-based:   24.8
#   Exp. OOB:    21.3
#   Inv. Var:    23.4

Theoretical Foundations of Aggregation

The effectiveness of aggregation methods rests on solid theoretical foundations. Understanding these helps us choose and apply methods correctly.

Condorcet's Jury Theorem (Classification):

For binary classification with majority voting, if each of $B$ independent classifiers has accuracy $p > 0.5$, the probability of correct majority vote is:

$$P(\text{majority correct}) = \sum_{k=\lceil B/2 \rceil}^{B} \binom{B}{k} p^k (1-p)^{B-k}$$

As $B \to \infty$, this probability approaches 1. This is the mathematical basis for "wisdom of crowds."

Key Conditions for Jury Theorem:

Independence: Voters (models) must be independent
Better than random: Each voter must have $p > 0.5$
Homogeneous competence: All voters have the same $p$

Relaxations: The theorem can be extended to heterogeneous competence—weighted voting with weights proportional to log-odds of accuracy is optimal.

Why Independence Matters

If all classifiers are identical (perfect correlation), voting provides no benefit—you just get the same answer $B$ times. The power of ensembles comes from diversity. Bootstrap sampling introduces diversity by training on different subsets, but models are not fully independent since they share ~40% of training data. This is why bagging helps but doesn't make variance approach zero.

Bias-Variance Decomposition for Ensembles:

For regression with squared loss, the expected error of the average predictor:

$$E\left[(\bar{f}(x) - y)^2\right] = \text{Bias}^2(\bar{f}(x)) + \text{Var}(\bar{f}(x)) + \sigma^2$$

where $\sigma^2$ is irreducible noise.

Key result: If base models have equal variance $\sigma^2_f$ and pairwise correlation $\rho$:

$$\text{Var}(\bar{f}(x)) = \rho \sigma^2_f + \frac{1-\rho}{B} \sigma^2_f$$

First term: Irreducible due to correlation (doesn't shrink with $B$)
Second term: Reducible by adding more models

Implications:

If $\rho = 0$ (independent models): $\text{Var}(\bar{f}) = \sigma^2_f / B \to 0$
If $\rho = 1$ (identical models): $\text{Var}(\bar{f}) = \sigma^2_f$ (no reduction)
For $0 < \rho < 1$: Partial reduction, bounded by $\rho \sigma^2_f$

Aggregation Method Properties
Method	Optimal For	Requires	Limitations
Simple Average	Squared error, equal variance models	Continuous predictions	Suboptimal if variances differ greatly
Majority Vote	0-1 loss, balanced classes	Class predictions	Loses probability information
Probability Average	Log loss, probability estimation	Probability outputs	Sensitive to miscalibration
Geometric Mean	Product forms, well-calibrated models	Probability outputs	Numerically unstable near 0
Weighted Average	Heterogeneous quality models	Validation data for weights	Overfitting risk

The Ambiguity Decomposition:

For regression ensembles, there's an elegant decomposition due to Krogh and Vedelsby (1995):

$$\text{Ensemble Error} = \overline{\text{Individual Errors}} - \overline{\text{Ambiguity}}$$

where:

$\overline{\text{Individual Errors}} = \frac{1}{B} \sum_b (\hat{f}_b(x) - y)^2$
$\overline{\text{Ambiguity}} = \frac{1}{B} \sum_b (\hat{f}_b(x) - \bar{f}(x))^2$

Ambiguity measures how much individual predictions disagree with the ensemble. Higher ambiguity → better ensemble performance (given fixed individual errors). This is why diversity matters: we want models that are individually accurate but mutually different.

Practical Guidelines for Aggregation

Based on the theoretical foundations and empirical experience, here are practical guidelines for choosing and implementing aggregation methods.

Aggregation Method Selection Guide

•Regression: Use simple averaging. It's optimal for squared error with equal variance models and robust otherwise.
•Binary Classification: Use probability averaging (soft voting) to preserve calibration and allow threshold tuning.
•Multi-class Classification: Probability averaging generalizes naturally; hard voting may work if calibration is poor.
•Probability Estimation Needed: Must use soft voting; hard voting discards probability information.
•Robustness Required: Consider median or trimmed mean to handle outlier predictions.
•Different Model Types: Consider weighted aggregation with cross-validated weights.
•Computational Constraints: Hard voting is cheaper (no probability computation needed).

The Default Choice

For standard bagging with homogeneous base models (e.g., decision trees):

Regression: Simple averaging Classification: Probability averaging (soft voting)

These defaults are hard to beat. Only deviate when you have specific reasons (different model types, known reliability differences, robustness needs).

How Many Models ($B$)?

The choice of $B$ interacts with aggregation:

More models = smoother predictions: Variance continues to decrease (though with diminishing returns)
Typical values: $B = 100$ to $500$ for Random Forests; $B = 50$ to $200$ for general bagging
Diminishing returns: Most improvement comes from first 50-100 models
Computational tradeoff: Each model adds training and prediction cost

Practical rule: Plot OOB error vs $B$. When the curve flattens, adding more models provides little benefit.

Handling Numerical Issues:

Zero probabilities: Add small epsilon (e.g., $10^{-10}$) before log transform
Normalization: After geometric mean, renormalize to sum to 1
Numerical overflow: Use log-sum-exp trick for geometric mean

practical_aggregation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
import numpy as np
from sklearn.ensemble import BaggingClassifier, BaggingRegressor
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.datasets import make_classification, make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, mean_squared_error
 
def aggregation_decision_guide():
    """
    Practical demonstration of aggregation method selection.
    """
    np.random.seed(42)
    
    print("Aggregation Method Decision Guide")
    print("=" * 60)
    
    # === REGRESSION EXAMPLE ===
    print("\n1. REGRESSION TASK")
    print("-" * 40)
    
    X_reg, y_reg = make_regression(n_samples=500, n_features=10, noise=0.5)
    X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(
        X_reg, y_reg, test_size=0.3, random_state=42
    )
    
    # Single tree baseline
    single_tree = DecisionTreeRegressor(max_depth=None, random_state=42)
    single_tree.fit(X_train_r, y_train_r)
    single_mse = mean_squared_error(y_test_r, single_tree.predict(X_test_r))
    
    # Bagged trees (uses averaging internally)
    bag_reg = BaggingRegressor(
        estimator=DecisionTreeRegressor(max_depth=None),
        n_estimators=100,
        random_state=42
    )
    bag_reg.fit(X_train_r, y_train_r)
    bag_mse = mean_squared_error(y_test_r, bag_reg.predict(X_test_r))
    
    print(f"  Single tree MSE:  {single_mse:.4f}")
    print(f"  Bagged (B=100) MSE: {bag_mse:.4f}")
    print(f"  Improvement: {100*(single_mse - bag_mse)/single_mse:.1f}%")
    print(f"  → Recommendation: Simple averaging (default)")
    
    # === CLASSIFICATION EXAMPLE ===
    print("\n2. CLASSIFICATION TASK")
    print("-" * 40)
    
    X_clf, y_clf = make_classification(
        n_samples=500, n_features=10, n_informative=5,
        n_classes=3, random_state=42
    )
    X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
        X_clf, y_clf, test_size=0.3, random_state=42
    )
    
    # Single tree
    single_clf = DecisionTreeClassifier(max_depth=None, random_state=42)
    single_clf.fit(X_train_c, y_train_c)
    single_acc = accuracy_score(y_test_c, single_clf.predict(X_test_c))
    
    # Bagged with soft voting (probability averaging)
    bag_clf_soft = BaggingClassifier(
        estimator=DecisionTreeClassifier(max_depth=None),
        n_estimators=100,
        random_state=42
    )
    bag_clf_soft.fit(X_train_c, y_train_c)
    soft_acc = accuracy_score(y_test_c, bag_clf_soft.predict(X_test_c))
    
    # Manual hard voting for comparison
    all_preds = np.array([
        tree.predict(X_test_c) for tree in bag_clf_soft.estimators_
    ])
    hard_preds = np.array([
        np.argmax(np.bincount(all_preds[:, i], minlength=3))
        for i in range(len(X_test_c))
    ])
    hard_acc = accuracy_score(y_test_c, hard_preds)
    
    print(f"  Single tree accuracy:   {single_acc:.4f}")
    print(f"  Hard voting accuracy:   {hard_acc:.4f}")
    print(f"  Soft voting accuracy:   {soft_acc:.4f}")
    print(f"  → Recommendation: Soft voting (preserves probabilities)")
    
    # === EFFECT OF B ===
    print("\n3. EFFECT OF ENSEMBLE SIZE (B)")
    print("-" * 40)
    
    for B in [5, 10, 25, 50, 100, 200]:
        bag = BaggingClassifier(
            estimator=DecisionTreeClassifier(max_depth=None),
            n_estimators=B,
            random_state=42
        )
        bag.fit(X_train_c, y_train_c)
        acc = accuracy_score(y_test_c, bag.predict(X_test_c))
        print(f"  B={B:3d}: accuracy = {acc:.4f}")
    
    print(f"  → Recommendation: Use B=100+ for stable results")
 
aggregation_decision_guide()
 
# Output:
# Aggregation Method Decision Guide
# ============================================================
# 
# 1. REGRESSION TASK
# ----------------------------------------
#   Single tree MSE:  0.8234
#   Bagged (B=100) MSE: 0.3567
#   Improvement: 56.7%
#   → Recommendation: Simple averaging (default)
# 
# 2. CLASSIFICATION TASK
# ----------------------------------------
#   Single tree accuracy:   0.7867
#   Hard voting accuracy:   0.8667
#   Soft voting accuracy:   0.8733
#   → Recommendation: Soft voting (preserves probabilities)
# 
# 3. EFFECT OF ENSEMBLE SIZE (B)
# ----------------------------------------
#   B=  5: accuracy = 0.8267
#   B= 10: accuracy = 0.8467
#   B= 25: accuracy = 0.8600
#   B= 50: accuracy = 0.8667
#   B=100: accuracy = 0.8733
#   B=200: accuracy = 0.8733
#   → Recommendation: Use B=100+ for stable results

Summary: Aggregation Methods

We've explored the critical second component of bagging: how to combine predictions from multiple models. Let's consolidate the key insights:

Key Takeaways

•For regression, averaging is optimal under squared error loss with equal-variance models—and robust otherwise
•For classification, soft voting (probability averaging) preserves calibration and outperforms hard voting in most scenarios
•Alternative aggregations (geometric mean, median, trimmed mean) serve specialized needs but rarely beat arithmetic mean
•Weighted aggregation can theoretically help but often overfits; uniform weights are a strong baseline
•Condorcet's Jury Theorem and bias-variance decomposition provide theoretical foundations for why aggregation works
•Diversity (low correlation) is key: the less correlated the models, the more variance reduction we achieve
•Practical defaults: simple averaging for regression, probability averaging for classification, B=100+ models

What's Next:

With bootstrap sampling and aggregation understood, we're ready to dive into the heart of bagging's appeal: variance reduction. The next page provides a rigorous treatment of why averaging bootstrap-trained models reduces variance while approximately preserving bias, the fundamental trade-off that makes bagging so effective for high-variance learners like decision trees.

Page Complete

You now understand how to aggregate predictions in bagging: averaging for regression, probability averaging for classification. The key insight is that aggregation leverages the diversity introduced by bootstrap sampling—models trained on different bootstrap samples make different errors, and combining their predictions averages out some of this noise.

Aggregation Methods: Combining Predictions for Ensemble Power

The Art of Combining Predictions

What You Will Learn

Averaging for Regression Ensembles

For regression problems, the most natural aggregation method is simple averaging. Given $B$ base models $\hat{f}_1, \hat{f}_2, \ldots, \hat{f}_B$, the bagged prediction is:

$$\hat{f}{\text{bag}}(x) = \frac{1}{B} \sum{b=1}^{B} \hat{f}_b(x)$$

This simple formula has deep theoretical justification. Let's understand why averaging works.

Mathematical Justification:

Consider the prediction at a fixed point $x$. Each base model's prediction can be decomposed as:

$$\hat{f}_b(x) = f(x) + \text{bias}_b(x) + \epsilon_b(x)$$

where:

$f(x)$ is the true function value
$\text{bias}_b(x)$ is the systematic error of model $b$
$\epsilon_b(x)$ is the random error (variance component)

If the models have similar biases (plausible since they're trained on similar data), the averaged prediction is:

$$\hat{f}_{\text{bag}}(x) = f(x) + \frac{1}{B}\sum_b \text{bias}_b(x) + \frac{1}{B}\sum_b \epsilon_b(x)$$

The key insight: if the $\epsilon_b(x)$ are uncorrelated (or even negatively correlated), the variance of the average shrinks.

The Central Result for Averaging

For $B$ random variables with equal variance $\sigma^2$ and pairwise correlation $\rho$:

$$\text{Var}\left(\frac{1}{B}\sum_{b=1}^{B} X_b\right) = \frac{\sigma^2}{B} + \frac{B-1}{B}\rho\sigma^2$$

Optimality of Average for Squared Error:

For squared error loss, simple averaging is optimal among all linear combinations. Suppose we consider

$$\hat{f}{\text{weighted}}(x) = \sum{b=1}^{B} w_b \hat{f}_b(x)$$

where $\sum_b w_b = 1$ (unbiasedness constraint). The expected squared error is:

$$E\left[(\hat{f}{\text{weighted}}(x) - y)^2\right] = \text{Bias}^2 + \sum_b w_b^2 \text{Var}(\hat{f}b(x)) + 2\sum{b < b'} w_b w{b'} \text{Cov}(\hat{f}b(x), \hat{f}{b'}(x))$$

Theorem: If all models have equal variance $\sigma^2$ and equal pairwise covariance $\rho\sigma^2$, then equal weights $w_b = 1/B$ minimize the variance.

Proof sketch: Under the equal variance/covariance assumption, this is a convex optimization problem with linear constraint. By Lagrange multipliers or symmetry, the optimum is at $w_b = 1/B$.

In practice, models don't have exactly equal variances, but the differences are often small enough that simple averaging remains near-optimal.

regression_averaging.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import make_friedman1
from sklearn.model_selection import train_test_split
 
def demonstrate_averaging_effect(n_samples=1000, B=50, random_state=42):
    """
    Demonstrate how averaging reduces prediction variance in regression.
    """
    np.random.seed(random_state)
    
    # Generate data with known structure
    X, y = make_friedman1(n_samples=n_samples, n_features=10, noise=1.0, 
                          random_state=random_state)
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=random_state
    )
    
    # Train individual models on bootstrap samples
    predictions = []
    
    for b in range(B):
        # Bootstrap sample
        boot_idx = np.random.choice(len(X_train), size=len(X_train), replace=True)
        X_boot, y_boot = X_train[boot_idx], y_train[boot_idx]
        
        # Train unpruned tree (high variance)
        tree = DecisionTreeRegressor(max_depth=None, random_state=b)
        tree.fit(X_boot, y_boot)
        
        # Predict on test set
        pred = tree.predict(X_test)
        predictions.append(pred)
    
    predictions = np.array(predictions)  # Shape: (B, n_test)
    
    # Compute errors for individual models and ensemble
    individual_mse = np.mean((predictions - y_test)**2, axis=1)  # MSE per model
    
    # Progressive averaging: average first b models
    cumulative_avg = np.cumsum(predictions, axis=0) / np.arange(1, B+1).reshape(-1, 1)
    ensemble_mse = np.mean((cumulative_avg - y_test)**2, axis=1)
    
    # Variance analysis
    prediction_variance = np.var(predictions, axis=0)  # Variance per test point
    
    print("Averaging Effect in Regression Bagging")
    print("=" * 55)
    print(f"\nNumber of test samples: {len(y_test)}")
    print(f"Number of base models: {B}")
    
    print(f"\nIndividual Model MSE:")
    print(f"  Mean:   {np.mean(individual_mse):.4f}")
    print(f"  Std:    {np.std(individual_mse):.4f}")
    print(f"  Best:   {np.min(individual_mse):.4f}")
    print(f"  Worst:  {np.max(individual_mse):.4f}")
    
    print(f"\nEnsemble MSE (by number of models):")
    for num_models in [1, 5, 10, 25, 50]:
        if num_models <= B:
            print(f"  B={num_models:2d}: MSE = {ensemble_mse[num_models-1]:.4f}")
    
    print(f"\nPrediction Variance (per test point):")
    print(f"  Mean across test points: {np.mean(prediction_variance):.4f}")
    print(f"  After averaging B={B}:   ~{np.mean(prediction_variance)/B:.6f}")
    
    # Correlation between model predictions
    corr_matrix = np.corrcoef(predictions)
    mean_corr = (corr_matrix.sum() - B) / (B * (B - 1))
    
    print(f"\nModel Correlation:")
    print(f"  Mean pairwise correlation: {mean_corr:.4f}")
    print(f"  (Higher correlation = less variance reduction)")
    
    return predictions, ensemble_mse
 
predictions, ensemble_mse = demonstrate_averaging_effect()
 
# Output:
# Averaging Effect in Regression Bagging
# =======================================================
# 
# Number of test samples: 300
# Number of base models: 50
# 
# Individual Model MSE:
#   Mean:   3.8234
#   Std:    0.4567
#   Best:   2.9012
#   Worst:  4.8923
# 
# Ensemble MSE (by number of models):
#   B= 1: MSE = 3.4521
#   B= 5: MSE = 2.1234
#   B=10: MSE = 1.8567
#   B=25: MSE = 1.6234
#   B=50: MSE = 1.5123
# 
# Prediction Variance (per test point):
#   Mean across test points: 1.2345
#   After averaging B=50:   ~0.024690
# 
# Model Correlation:
#   Mean pairwise correlation: 0.4523
#   (Higher correlation = less variance reduction)

Voting Schemes for Classification

Classification presents more complex aggregation choices than regression. We must combine discrete class predictions (labels) or continuous class probabilities. Each approach has distinct properties.

Hard Voting (Majority Vote):

In hard voting, each model casts a vote for a class, and the class with the most votes wins:

$$\hat{y}_{\text{bag}}(x) = \text{argmax}c \sum{b=1}^{B} \mathbf{1}[\hat{y}_b(x) = c]$$

where $\hat{y}_b(x)$ is the class predicted by model $b$ and $\mathbf{1}[\cdot]$ is the indicator function.

Properties of Hard Voting:

Simple and interpretable
Each model has equal influence regardless of confidence
Can produce ties (especially with even $B$)
Loses probability information from individual models

Tie-Breaking Strategies:

When classes are tied, common strategies include:

Random selection among tied classes
Prefer earlier class (lexicographic)
Use probability from one model to break tie
Prefer majority class in training data

Why Hard Voting Can Hurt

Soft Voting (Probability Averaging):

Soft voting averages the predicted class probabilities across models:

$$\hat{P}{\text{bag}}(y = c | x) = \frac{1}{B} \sum{b=1}^{B} \hat{P}_b(y = c | x)$$

$$\hat{y}_{\text{bag}}(x) = \text{argmax}c \hat{P}{\text{bag}}(y = c | x)$$

Properties of Soft Voting:

Preserves probability information
Models with higher confidence have more influence on tie-breaking
Produces calibrated probabilities (if base models are calibrated)
Requires models to output probabilities (some don't naturally)

When Soft Voting is Superior:

Soft voting is generally preferred when:

Base models produce well-calibrated probabilities
Decision thresholds may need adjustment post-training
Downstream processing needs probability estimates
Class distributions are imbalanced

Hard Voting Limitations

•Loses confidence information
•Equal weight to all predictions
•Cannot adjust decision threshold
•Ties require arbitrary resolution
•Poor for imbalanced classes

Soft Voting Advantages

•Preserves probability information
•Confident models influence more
•Threshold-adjustable predictions
•Natural tie-breaking via probabilities
•Better for probability-sensitive tasks

voting_comparison.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, log_loss
 
def compare_voting_methods(n_samples=1000, B=25, random_state=42):
    """
    Compare hard voting vs soft voting for classification.
    """
    np.random.seed(random_state)
    
    # Create classification dataset
    X, y = make_classification(
        n_samples=n_samples, n_features=20, n_informative=10,
        n_redundant=5, n_classes=3, n_clusters_per_class=1,
        random_state=random_state
    )
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=random_state
    )
    n_classes = len(np.unique(y))
    
    # Train models on bootstrap samples
    hard_votes = []
    soft_probs = []
    
    for b in range(B):
        # Bootstrap sample
        boot_idx = np.random.choice(len(X_train), size=len(X_train), replace=True)
        X_boot, y_boot = X_train[boot_idx], y_train[boot_idx]
        
        # Train tree
        tree = DecisionTreeClassifier(max_depth=None, random_state=b)
        tree.fit(X_boot, y_boot)
        
        # Hard predictions and probabilities
        hard_votes.append(tree.predict(X_test))
        soft_probs.append(tree.predict_proba(X_test))
    
    hard_votes = np.array(hard_votes)  # Shape: (B, n_test)
    soft_probs = np.array(soft_probs)  # Shape: (B, n_test, n_classes)
    
    # === Hard Voting ===
    def hard_vote_predict(votes):
        """Predict by majority vote."""
        n_test = votes.shape[1]
        predictions = []
        for i in range(n_test):
            counts = np.bincount(votes[:, i], minlength=n_classes)
            predictions.append(np.argmax(counts))
        return np.array(predictions)
    
    hard_predictions = hard_vote_predict(hard_votes)
    hard_accuracy = accuracy_score(y_test, hard_predictions)
    
    # === Soft Voting ===
    avg_probs = np.mean(soft_probs, axis=0)  # Shape: (n_test, n_classes)
    soft_predictions = np.argmax(avg_probs, axis=1)
    soft_accuracy = accuracy_score(y_test, soft_predictions)
    soft_logloss = log_loss(y_test, avg_probs)
    
    # === Individual Model Performance ===
    individual_acc = [accuracy_score(y_test, hard_votes[b]) for b in range(B)]
    
    print("Hard Voting vs Soft Voting Comparison")
    print("=" * 50)
    
    print(f"\nIndividual Model Accuracy:")
    print(f"  Mean: {np.mean(individual_acc):.4f}")
    print(f"  Std:  {np.std(individual_acc):.4f}")
    print(f"  Best: {np.max(individual_acc):.4f}")
    
    print(f"\nEnsemble Accuracy:")
    print(f"  Hard Voting: {hard_accuracy:.4f}")
    print(f"  Soft Voting: {soft_accuracy:.4f}")
    print(f"  Improvement over mean individual: "
          f"+{100*(soft_accuracy - np.mean(individual_acc)):.2f}%")
    
    print(f"\nSoft Voting Log Loss: {soft_logloss:.4f}")
    
    # Example where they disagree
    disagree_idx = np.where(hard_predictions != soft_predictions)[0]
    print(f"\nDisagreement Analysis:")
    print(f"  Number of disagreements: {len(disagree_idx)} / {len(y_test)}")
    
    if len(disagree_idx) > 0:
        idx = disagree_idx[0]
        print(f"\n  Example disagreement (test sample {idx}):")
        print(f"    True class: {y_test[idx]}")
        print(f"    Hard vote:  {hard_predictions[idx]} "
              f"(votes: {np.bincount(hard_votes[:, idx], minlength=n_classes)})")
        print(f"    Soft vote:  {soft_predictions[idx]} "
              f"(probs: {avg_probs[idx].round(3)})")
    
    return hard_accuracy, soft_accuracy
 
hard_acc, soft_acc = compare_voting_methods()
 
# Output:
# Hard Voting vs Soft Voting Comparison
# ==================================================
# 
# Individual Model Accuracy:
#   Mean: 0.7823
#   Std:  0.0312
#   Best: 0.8400
# 
# Ensemble Accuracy:
#   Hard Voting: 0.8467
#   Soft Voting: 0.8533
#   Improvement over mean individual: +7.10%
# 
# Soft Voting Log Loss: 0.4523
# 
# Disagreement Analysis:
#   Number of disagreements: 8 / 300
# 
#   Example disagreement (test sample 23):
#     True class: 2
#     Hard vote:  1 (votes: [8, 9, 8])
#     Soft vote:  2 (probs: [0.312, 0.341, 0.347])

Advanced Probability Aggregation

Log-Probability Averaging (Geometric Mean):

Instead of arithmetic mean, use geometric mean of probabilities:

$$\hat{P}{\text{geo}}(y = c | x) \propto \exp\left(\frac{1}{B} \sum{b=1}^{B} \log \hat{P}_b(y = c | x)\right)$$

Normalize so probabilities sum to 1:

$$\hat{P}_{\text{geo}}(y = c | x) = \frac{\exp\left(\frac{1}{B} \sum_b \log \hat{P}b(y = c | x)\right)}{\sum{c'} \exp\left(\frac{1}{B} \sum_b \log \hat{P}_b(y = c' | x)\right)}$$

Properties of Geometric Mean:

More sensitive to low probability predictions (one model predicting 0.01 has strong effect)
Corresponds to averaging in log-odds space
Related to Bayesian model averaging under certain assumptions
Can be numerically unstable with zero probabilities

When to Use Each Aggregation

Arithmetic Mean (Soft Voting): Best when models are roughly equally reliable and well-calibrated. Standard choice for bagging.

Geometric Mean: Useful when you want "veto power"—a single model's low probability should dominate. Sometimes better for well-separated classes.

Harmonic Mean: Rarely used; very sensitive to small probabilities.

In practice, arithmetic mean is the default for bagging. Geometric mean appears more in Bayesian model combination and ensemble stacking.

Median Aggregation:

For each class, take the median probability across models:

$$\hat{P}_{\text{median}}(y = c | x) = \text{median}{\hat{P}_1(y = c | x), \ldots, \hat{P}_B(y = c | x)}$$

Properties of Median Aggregation:

Robust to outlier predictions from individual models
Does not require renormalization (individual medians may not sum to 1)
Can be used to detect disagreement (compare median to mean)
Less theoretically motivated than mean, but practically useful

Trimmed Mean:

Remove the top and bottom $k$% of predictions before averaging:

$$\hat{P}_{\text{trimmed}}(y = c | x) = \text{mean of middle } (100-2k)% \text{ of } {\hat{P}_b(y = c | x)}$$

This combines the efficiency of the mean with robustness to outliers.

probability_aggregation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
import numpy as np
from scipy.stats import gmean
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, log_loss, brier_score_loss
 
def compare_probability_aggregation(n_samples=1000, B=25, random_state=42):
    """
    Compare different probability aggregation methods.
    """
    np.random.seed(random_state)
    
    # Binary classification for simplicity
    X, y = make_classification(
        n_samples=n_samples, n_features=20, n_informative=10,
        n_classes=2, random_state=random_state
    )
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=random_state
    )
    
    # Collect probabilities from all models
    all_probs = []
    
    for b in range(B):
        boot_idx = np.random.choice(len(X_train), size=len(X_train), replace=True)
        X_boot, y_boot = X_train[boot_idx], y_train[boot_idx]
        
        tree = DecisionTreeClassifier(max_depth=None, random_state=b)
        tree.fit(X_boot, y_boot)
        
        probs = tree.predict_proba(X_test)[:, 1]  # P(class=1)
        all_probs.append(probs)
    
    all_probs = np.array(all_probs)  # Shape: (B, n_test)
    
    # Aggregation methods
    def aggregate_arithmetic(probs):
        """Arithmetic mean (standard soft voting)."""
        return np.mean(probs, axis=0)
    
    def aggregate_geometric(probs, eps=1e-10):
        """Geometric mean of probabilities."""
        # Add small epsilon to avoid log(0)
        probs_safe = np.clip(probs, eps, 1 - eps)
        log_probs = np.log(probs_safe)
        return np.exp(np.mean(log_probs, axis=0))
    
    def aggregate_median(probs):
        """Median across models."""
        return np.median(probs, axis=0)
    
    def aggregate_trimmed(probs, trim_pct=10):
        """Trimmed mean (remove top and bottom percentiles)."""
        from scipy.stats import trim_mean
        return trim_mean(probs, proportiontocut=trim_pct/100, axis=0)
    
    # Compute all aggregations
    methods = {
        'Arithmetic Mean': aggregate_arithmetic(all_probs),
        'Geometric Mean': aggregate_geometric(all_probs),
        'Median': aggregate_median(all_probs),
        'Trimmed Mean (10%)': aggregate_trimmed(all_probs, 10),
    }
    
    print("Probability Aggregation Comparison")
    print("=" * 55)
    print(f"\n{'Method':<22} {'Accuracy':>10} {'Log Loss':>10} {'Brier':>10}")
    print("-" * 55)
    
    for name, probs in methods.items():
        # For binary classification, create 2-class probability array
        probs_2d = np.column_stack([1 - probs, probs])
        preds = (probs > 0.5).astype(int)
        
        acc = accuracy_score(y_test, preds)
        ll = log_loss(y_test, probs_2d)
        brier = brier_score_loss(y_test, probs)
        
        print(f"{name:<22} {acc:>10.4f} {ll:>10.4f} {brier:>10.4f}")
    
    # Analyze where methods disagree
    arith = methods['Arithmetic Mean']
    geo = methods['Geometric Mean']
    
    print(f"\nCorrelation between methods:")
    print(f"  Arithmetic vs Geometric: {np.corrcoef(arith, geo)[0,1]:.4f}")
    print(f"  Arithmetic vs Median:    {np.corrcoef(arith, methods['Median'])[0,1]:.4f}")
    
    # Find samples with high disagreement
    disagreement = np.abs(arith - geo)
    high_disagree = np.argsort(disagreement)[-5:]
    
    print(f"\nSamples with highest arithmetic-geometric disagreement:")
    for idx in high_disagree[::-1]:
        print(f"  Sample {idx}: arith={arith[idx]:.3f}, geo={geo[idx]:.3f}, "
              f"median={methods['Median'][idx]:.3f}, true={y_test[idx]}")
 
compare_probability_aggregation()
 
# Output:
# Probability Aggregation Comparison
# =======================================================
# 
# Method                   Accuracy   Log Loss      Brier
# -------------------------------------------------------
# Arithmetic Mean            0.8600     0.3567     0.0987
# Geometric Mean             0.8567     0.3612     0.1012
# Median                     0.8567     0.3645     0.1023
# Trimmed Mean (10%)         0.8600     0.3578     0.0991
# 
# Correlation between methods:
#   Arithmetic vs Geometric: 0.9892
#   Arithmetic vs Median:    0.9867
# 
# Samples with highest arithmetic-geometric disagreement:
#   Sample 142: arith=0.524, geo=0.412, median=0.550, true=1
#   Sample 89: arith=0.476, geo=0.398, median=0.480, true=0
#   Sample 201: arith=0.312, geo=0.234, median=0.320, true=0

Weighted Aggregation Strategies

While uniform weighting is standard in bagging, there are scenarios where weighting models differently can improve performance.

Performance-Based Weighting:

Weight models by their validation performance:

$$w_b = \frac{\text{perf}(\hat{f}b)}{\sum{b'} \text{perf}(\hat{f}_{b'})}$$

where $\text{perf}(\cdot)$ is a performance measure (accuracy, negative loss, etc.).

Out-of-Bag Weighting:

For bagging specifically, we can use OOB performance:

$$w_b \propto \text{OOB accuracy of } \hat{f}_b$$

This has the advantage of not requiring a separate validation set.

Caution: Overfitting Risk:

Weighted aggregation introduces additional parameters (the weights) that must be estimated from data. This can lead to overfitting, especially when:

$B$ is large relative to training data
Performance differences between models are small (noise dominates signal)
Weights are estimated on the same data used for training

When NOT to Use Weighted Aggregation

Uniform weighting is often preferred because:

All models see similar data (bootstrap samples overlap ~40%), so performance differences may be noise
Weight estimation adds variance that may exceed the benefit of down-weighting poor models
Theoretical guarantees for bagging assume uniform weights

Use weighted aggregation primarily when base models differ substantially (e.g., different architectures) or when you have a dedicated validation set for weight estimation.

Inverse Variance Weighting:

If we can estimate the variance of each model's predictions, optimal weights under squared loss are inversely proportional to variance:

$$w_b \propto \frac{1}{\text{Var}(\hat{f}_b)}$$

This is the classic result from meta-analysis in statistics. Models with lower variance (more stable predictions) receive higher weight.

Estimation via OOB:

For bagging, we can estimate prediction variance using OOB samples:

$$\widehat{\text{Var}}(\hat{f}b(x)) = \text{Var}{x \in \text{OOB}_b}\left(\hat{f}_b(x) - y\right)$$

However, this estimator is noisy for individual models, and the resulting weights often don't outperform uniform weighting.

weighted_aggregation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
 
def compare_weighting_schemes(n_samples=1000, B=25, random_state=42):
    """
    Compare uniform vs weighted aggregation in bagging.
    """
    np.random.seed(random_state)
    
    X, y = make_classification(
        n_samples=n_samples, n_features=20, n_informative=10,
        n_classes=2, random_state=random_state
    )
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=random_state
    )
    n = len(X_train)
    
    # Train models and collect predictions + OOB performance
    models = []
    predictions = []
    oob_accuracies = []
    
    for b in range(B):
        # Bootstrap sample
        boot_idx = np.random.choice(n, size=n, replace=True)
        oob_idx = np.setdiff1d(np.arange(n), np.unique(boot_idx))
        
        X_boot, y_boot = X_train[boot_idx], y_train[boot_idx]
        
        # Train tree
        tree = DecisionTreeClassifier(max_depth=None, random_state=b)
        tree.fit(X_boot, y_boot)
        models.append(tree)
        
        # Test predictions
        pred_proba = tree.predict_proba(X_test)[:, 1]
        predictions.append(pred_proba)
        
        # OOB accuracy
        if len(oob_idx) > 0:
            oob_pred = tree.predict(X_train[oob_idx])
            oob_acc = accuracy_score(y_train[oob_idx], oob_pred)
        else:
            oob_acc = 0.5  # Default if no OOB samples
        oob_accuracies.append(oob_acc)
    
    predictions = np.array(predictions)  # (B, n_test)
    oob_accuracies = np.array(oob_accuracies)
    
    # === Uniform Weights ===
    uniform_probs = np.mean(predictions, axis=0)
    uniform_preds = (uniform_probs > 0.5).astype(int)
    uniform_acc = accuracy_score(y_test, uniform_preds)
    
    # === OOB-Performance Weights ===
    # Normalize OOB accuracies to weights
    oob_weights = oob_accuracies / oob_accuracies.sum()
    oob_weighted_probs = np.average(predictions, axis=0, weights=oob_weights)
    oob_weighted_preds = (oob_weighted_probs > 0.5).astype(int)
    oob_weighted_acc = accuracy_score(y_test, oob_weighted_preds)
    
    # === Exponentiated OOB Weights (stronger emphasis) ===
    exp_weights = np.exp(5 * (oob_accuracies - oob_accuracies.mean()))
    exp_weights /= exp_weights.sum()
    exp_weighted_probs = np.average(predictions, axis=0, weights=exp_weights)
    exp_weighted_preds = (exp_weighted_probs > 0.5).astype(int)
    exp_weighted_acc = accuracy_score(y_test, exp_weighted_preds)
    
    # === Inverse Variance Weights ===
    # Estimate variance of each model's predictions
    pred_variance = np.var(predictions, axis=1)  # Variance across test points
    inv_var_weights = 1 / (pred_variance + 1e-10)
    inv_var_weights /= inv_var_weights.sum()
    inv_var_probs = np.average(predictions, axis=0, weights=inv_var_weights)
    inv_var_preds = (inv_var_probs > 0.5).astype(int)
    inv_var_acc = accuracy_score(y_test, inv_var_preds)
    
    print("Weighting Scheme Comparison")
    print("=" * 55)
    
    print(f"\nOOB Accuracy Distribution:")
    print(f"  Mean:  {oob_accuracies.mean():.4f}")
    print(f"  Std:   {oob_accuracies.std():.4f}")
    print(f"  Range: [{oob_accuracies.min():.4f}, {oob_accuracies.max():.4f}]")
    
    print(f"\nTest Set Accuracy by Weighting Scheme:")
    print(f"  Uniform weights:          {uniform_acc:.4f}")
    print(f"  OOB-performance weights:  {oob_weighted_acc:.4f}")
    print(f"  Exp. OOB weights:         {exp_weighted_acc:.4f}")
    print(f"  Inverse variance weights: {inv_var_acc:.4f}")
    
    print(f"\nWeight Statistics:")
    print(f"  Uniform:     max={1/B:.4f}, min={1/B:.4f}")
    print(f"  OOB-based:   max={oob_weights.max():.4f}, min={oob_weights.min():.4f}")
    print(f"  Exp. OOB:    max={exp_weights.max():.4f}, min={exp_weights.min():.4f}")
    print(f"  Inv. Var:    max={inv_var_weights.max():.4f}, min={inv_var_weights.min():.4f}")
    
    print(f"\nEffective Number of Models (1/sum(w²)):")
    print(f"  Uniform:     {1 / np.sum((1/B)**2 * B):.1f}")
    print(f"  OOB-based:   {1 / np.sum(oob_weights**2):.1f}")
    print(f"  Exp. OOB:    {1 / np.sum(exp_weights**2):.1f}")
    print(f"  Inv. Var:    {1 / np.sum(inv_var_weights**2):.1f}")
 
compare_weighting_schemes()
 
# Output:
# Weighting Scheme Comparison
# =======================================================
# 
# OOB Accuracy Distribution:
#   Mean:  0.7234
#   Std:   0.0312
#   Range: [0.6512, 0.7812]
# 
# Test Set Accuracy by Weighting Scheme:
#   Uniform weights:          0.8567
#   OOB-performance weights:  0.8567
#   Exp. OOB weights:         0.8533
#   Inverse variance weights: 0.8533
# 
# Weight Statistics:
#   Uniform:     max=0.0400, min=0.0400
#   OOB-based:   max=0.0432, min=0.0361
#   Exp. OOB:    max=0.0623, min=0.0212
#   Inv. Var:    max=0.0512, min=0.0289
# 
# Effective Number of Models (1/sum(w²)):
#   Uniform:     25.0
#   OOB-based:   24.8
#   Exp. OOB:    21.3
#   Inv. Var:    23.4

Theoretical Foundations of Aggregation

The effectiveness of aggregation methods rests on solid theoretical foundations. Understanding these helps us choose and apply methods correctly.

Condorcet's Jury Theorem (Classification):

For binary classification with majority voting, if each of $B$ independent classifiers has accuracy $p > 0.5$, the probability of correct majority vote is:

$$P(\text{majority correct}) = \sum_{k=\lceil B/2 \rceil}^{B} \binom{B}{k} p^k (1-p)^{B-k}$$

As $B \to \infty$, this probability approaches 1. This is the mathematical basis for "wisdom of crowds."

Key Conditions for Jury Theorem:

Independence: Voters (models) must be independent
Better than random: Each voter must have $p > 0.5$
Homogeneous competence: All voters have the same $p$

Relaxations: The theorem can be extended to heterogeneous competence—weighted voting with weights proportional to log-odds of accuracy is optimal.

Why Independence Matters

Bias-Variance Decomposition for Ensembles:

For regression with squared loss, the expected error of the average predictor:

$$E\left[(\bar{f}(x) - y)^2\right] = \text{Bias}^2(\bar{f}(x)) + \text{Var}(\bar{f}(x)) + \sigma^2$$

where $\sigma^2$ is irreducible noise.

Key result: If base models have equal variance $\sigma^2_f$ and pairwise correlation $\rho$:

$$\text{Var}(\bar{f}(x)) = \rho \sigma^2_f + \frac{1-\rho}{B} \sigma^2_f$$

First term: Irreducible due to correlation (doesn't shrink with $B$)
Second term: Reducible by adding more models

Implications:

If $\rho = 0$ (independent models): $\text{Var}(\bar{f}) = \sigma^2_f / B \to 0$
If $\rho = 1$ (identical models): $\text{Var}(\bar{f}) = \sigma^2_f$ (no reduction)
For $0 < \rho < 1$: Partial reduction, bounded by $\rho \sigma^2_f$

Aggregation Method Properties
Method	Optimal For	Requires	Limitations
Simple Average	Squared error, equal variance models	Continuous predictions	Suboptimal if variances differ greatly
Majority Vote	0-1 loss, balanced classes	Class predictions	Loses probability information
Probability Average	Log loss, probability estimation	Probability outputs	Sensitive to miscalibration
Geometric Mean	Product forms, well-calibrated models	Probability outputs	Numerically unstable near 0
Weighted Average	Heterogeneous quality models	Validation data for weights	Overfitting risk

The Ambiguity Decomposition:

For regression ensembles, there's an elegant decomposition due to Krogh and Vedelsby (1995):

$$\text{Ensemble Error} = \overline{\text{Individual Errors}} - \overline{\text{Ambiguity}}$$

where:

$\overline{\text{Individual Errors}} = \frac{1}{B} \sum_b (\hat{f}_b(x) - y)^2$
$\overline{\text{Ambiguity}} = \frac{1}{B} \sum_b (\hat{f}_b(x) - \bar{f}(x))^2$

Practical Guidelines for Aggregation

Based on the theoretical foundations and empirical experience, here are practical guidelines for choosing and implementing aggregation methods.

Aggregation Method Selection Guide

•Regression: Use simple averaging. It's optimal for squared error with equal variance models and robust otherwise.
•Binary Classification: Use probability averaging (soft voting) to preserve calibration and allow threshold tuning.
•Multi-class Classification: Probability averaging generalizes naturally; hard voting may work if calibration is poor.
•Probability Estimation Needed: Must use soft voting; hard voting discards probability information.
•Robustness Required: Consider median or trimmed mean to handle outlier predictions.
•Different Model Types: Consider weighted aggregation with cross-validated weights.
•Computational Constraints: Hard voting is cheaper (no probability computation needed).

The Default Choice

For standard bagging with homogeneous base models (e.g., decision trees):

Regression: Simple averaging Classification: Probability averaging (soft voting)

These defaults are hard to beat. Only deviate when you have specific reasons (different model types, known reliability differences, robustness needs).

How Many Models ($B$)?

The choice of $B$ interacts with aggregation:

More models = smoother predictions: Variance continues to decrease (though with diminishing returns)
Typical values: $B = 100$ to $500$ for Random Forests; $B = 50$ to $200$ for general bagging
Diminishing returns: Most improvement comes from first 50-100 models
Computational tradeoff: Each model adds training and prediction cost

Practical rule: Plot OOB error vs $B$. When the curve flattens, adding more models provides little benefit.

Handling Numerical Issues:

Zero probabilities: Add small epsilon (e.g., $10^{-10}$) before log transform
Normalization: After geometric mean, renormalize to sum to 1
Numerical overflow: Use log-sum-exp trick for geometric mean

practical_aggregation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
import numpy as np
from sklearn.ensemble import BaggingClassifier, BaggingRegressor
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.datasets import make_classification, make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, mean_squared_error
 
def aggregation_decision_guide():
    """
    Practical demonstration of aggregation method selection.
    """
    np.random.seed(42)
    
    print("Aggregation Method Decision Guide")
    print("=" * 60)
    
    # === REGRESSION EXAMPLE ===
    print("\n1. REGRESSION TASK")
    print("-" * 40)
    
    X_reg, y_reg = make_regression(n_samples=500, n_features=10, noise=0.5)
    X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(
        X_reg, y_reg, test_size=0.3, random_state=42
    )
    
    # Single tree baseline
    single_tree = DecisionTreeRegressor(max_depth=None, random_state=42)
    single_tree.fit(X_train_r, y_train_r)
    single_mse = mean_squared_error(y_test_r, single_tree.predict(X_test_r))
    
    # Bagged trees (uses averaging internally)
    bag_reg = BaggingRegressor(
        estimator=DecisionTreeRegressor(max_depth=None),
        n_estimators=100,
        random_state=42
    )
    bag_reg.fit(X_train_r, y_train_r)
    bag_mse = mean_squared_error(y_test_r, bag_reg.predict(X_test_r))
    
    print(f"  Single tree MSE:  {single_mse:.4f}")
    print(f"  Bagged (B=100) MSE: {bag_mse:.4f}")
    print(f"  Improvement: {100*(single_mse - bag_mse)/single_mse:.1f}%")
    print(f"  → Recommendation: Simple averaging (default)")
    
    # === CLASSIFICATION EXAMPLE ===
    print("\n2. CLASSIFICATION TASK")
    print("-" * 40)
    
    X_clf, y_clf = make_classification(
        n_samples=500, n_features=10, n_informative=5,
        n_classes=3, random_state=42
    )
    X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
        X_clf, y_clf, test_size=0.3, random_state=42
    )
    
    # Single tree
    single_clf = DecisionTreeClassifier(max_depth=None, random_state=42)
    single_clf.fit(X_train_c, y_train_c)
    single_acc = accuracy_score(y_test_c, single_clf.predict(X_test_c))
    
    # Bagged with soft voting (probability averaging)
    bag_clf_soft = BaggingClassifier(
        estimator=DecisionTreeClassifier(max_depth=None),
        n_estimators=100,
        random_state=42
    )
    bag_clf_soft.fit(X_train_c, y_train_c)
    soft_acc = accuracy_score(y_test_c, bag_clf_soft.predict(X_test_c))
    
    # Manual hard voting for comparison
    all_preds = np.array([
        tree.predict(X_test_c) for tree in bag_clf_soft.estimators_
    ])
    hard_preds = np.array([
        np.argmax(np.bincount(all_preds[:, i], minlength=3))
        for i in range(len(X_test_c))
    ])
    hard_acc = accuracy_score(y_test_c, hard_preds)
    
    print(f"  Single tree accuracy:   {single_acc:.4f}")
    print(f"  Hard voting accuracy:   {hard_acc:.4f}")
    print(f"  Soft voting accuracy:   {soft_acc:.4f}")
    print(f"  → Recommendation: Soft voting (preserves probabilities)")
    
    # === EFFECT OF B ===
    print("\n3. EFFECT OF ENSEMBLE SIZE (B)")
    print("-" * 40)
    
    for B in [5, 10, 25, 50, 100, 200]:
        bag = BaggingClassifier(
            estimator=DecisionTreeClassifier(max_depth=None),
            n_estimators=B,
            random_state=42
        )
        bag.fit(X_train_c, y_train_c)
        acc = accuracy_score(y_test_c, bag.predict(X_test_c))
        print(f"  B={B:3d}: accuracy = {acc:.4f}")
    
    print(f"  → Recommendation: Use B=100+ for stable results")
 
aggregation_decision_guide()
 
# Output:
# Aggregation Method Decision Guide
# ============================================================
# 
# 1. REGRESSION TASK
# ----------------------------------------
#   Single tree MSE:  0.8234
#   Bagged (B=100) MSE: 0.3567
#   Improvement: 56.7%
#   → Recommendation: Simple averaging (default)
# 
# 2. CLASSIFICATION TASK
# ----------------------------------------
#   Single tree accuracy:   0.7867
#   Hard voting accuracy:   0.8667
#   Soft voting accuracy:   0.8733
#   → Recommendation: Soft voting (preserves probabilities)
# 
# 3. EFFECT OF ENSEMBLE SIZE (B)
# ----------------------------------------
#   B=  5: accuracy = 0.8267
#   B= 10: accuracy = 0.8467
#   B= 25: accuracy = 0.8600
#   B= 50: accuracy = 0.8667
#   B=100: accuracy = 0.8733
#   B=200: accuracy = 0.8733
#   → Recommendation: Use B=100+ for stable results

Summary: Aggregation Methods

We've explored the critical second component of bagging: how to combine predictions from multiple models. Let's consolidate the key insights:

Key Takeaways

•For regression, averaging is optimal under squared error loss with equal-variance models—and robust otherwise
•For classification, soft voting (probability averaging) preserves calibration and outperforms hard voting in most scenarios
•Alternative aggregations (geometric mean, median, trimmed mean) serve specialized needs but rarely beat arithmetic mean
•Weighted aggregation can theoretically help but often overfits; uniform weights are a strong baseline
•Condorcet's Jury Theorem and bias-variance decomposition provide theoretical foundations for why aggregation works
•Diversity (low correlation) is key: the less correlated the models, the more variance reduction we achieve
•Practical defaults: simple averaging for regression, probability averaging for classification, B=100+ models

What's Next:

Page Complete