Machine LearningBagging for Other Models

Bagging for Other Models

LevelIntermediate

Duration90 mins

TopicBagging for Other Models

3 / 5

Model Aggregation Strategies

The Art and Science of Combining Predictions

An ensemble is only as good as its aggregation strategy. Even a collection of excellent models can underperform if their predictions are combined naively. Conversely, sophisticated aggregation can extract surprising performance from mediocre individual models.

Model aggregation spans a rich spectrum of techniques—from simple voting to learned meta-models. Understanding this spectrum is essential for building ensembles that realize their full potential. This page provides a comprehensive treatment of aggregation strategies, their mathematical foundations, and practical guidance for selection.

What You Will Learn

By the end of this page, you will understand: (1) Voting schemes for classification (hard, soft, weighted), (2) Aggregation for regression (mean, median, weighted), (3) Learned aggregation through stacking, (4) Probability calibration in ensembles, (5) Theoretical guarantees for different aggregation methods, and (6) Practical selection criteria.

Classification: Voting Schemes

For classification tasks, the ensemble must convert multiple class predictions or probability distributions into a single decision. The choice of voting scheme affects both accuracy and calibration.

1. Hard Voting (Majority Vote)

Each model casts a vote for its predicted class. The ensemble predicts the class with the most votes.

$$\hat{y}_{ens} = \text{argmax}c \sum{b=1}^{B} \mathbb{1}[\hat{y}_b = c]$$

Properties:

Simple and interpretable
Treats all models equally
Ignores prediction confidence
Requires odd B to avoid ties (or tie-breaking rule)

2. Soft Voting (Probability Averaging)

Average the predicted probabilities across models, then select the class with highest average probability.

$$P_{ens}(y=c|x) = \frac{1}{B}\sum_{b=1}^{B} P_b(y=c|x)$$ $$\hat{y}_{ens} = \text{argmax}c , P{ens}(y=c|x)$$

Properties:

Incorporates model confidence
Generally outperforms hard voting
Provides calibrated probabilities (if models are calibrated)
Requires models to output probabilities

3. Weighted Voting

Assign weights $w_b$ to each model based on estimated quality.

$$P_{ens}(y=c|x) = \sum_{b=1}^{B} w_b \cdot P_b(y=c|x), \quad \sum_b w_b = 1$$

Weight Selection Strategies:

Uniform: $w_b = 1/B$ (reduces to simple averaging)
Performance-based: $w_b \propto \text{accuracy}_b$ or $w_b \propto 1/\text{error}_b$
Entropy-based: $w_b \propto \exp(-H_b)$ where $H_b$ is entropy of predictions
Learned: optimize weights on validation data

Voting Scheme Comparison
Scheme	Uses Confidence	Weights Models	Best When	Calibrated Output
Hard Voting	No	Equal	Models are equally good, no probabilities available	No (deterministic)
Soft Voting	Yes	Equal	Models output well-calibrated probabilities	Yes (if inputs calibrated)
Weighted Hard	No	By quality	Model quality varies significantly	No
Weighted Soft	Yes	By quality	Models vary in quality AND calibration matters	Depends on calibration

voting_schemes.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
import numpy as np
from collections import Counter
from typing import List, Optional, Literal
 
class VotingClassifier:
    """
    Comprehensive implementation of classification voting schemes.
    
    Supports hard voting, soft voting, and various weighting strategies.
    """
    
    def __init__(
        self,
        voting: Literal['hard', 'soft'] = 'soft',
        weights: Optional[List[float]] = None
    ):
        """
        Parameters:
        -----------
        voting : str
            'hard' for majority vote, 'soft' for probability averaging
        weights : list or None
            Model weights (length B). None = equal weights.
        """
        self.voting = voting
        self.weights = weights
        self.models = []
        
    def fit(self, models: List):
        """Store fitted models."""
        self.models = models
        
        if self.weights is None:
            self.weights = [1.0 / len(models)] * len(models)
        else:
            # Normalize weights
            total = sum(self.weights)
            self.weights = [w / total for w in self.weights]
            
        return self
    
    def predict(self, X):
        """Generate ensemble predictions."""
        if self.voting == 'hard':
            return self._hard_vote(X)
        else:
            probs = self.predict_proba(X)
            return np.argmax(probs, axis=1)
    
    def _hard_vote(self, X):
        """Weighted majority vote."""
        n_samples = X.shape[0]
        
        # Collect all predictions with weights
        weighted_votes = []
        for i, (model, weight) in enumerate(zip(self.models, self.weights)):
            preds = model.predict(X)
            weighted_votes.append((preds, weight))
        
        # Aggregate votes
        predictions = []
        for i in range(n_samples):
            vote_counts = {}
            for preds, weight in weighted_votes:
                vote = preds[i]
                vote_counts[vote] = vote_counts.get(vote, 0) + weight
            
            # Select class with highest weighted vote
            winner = max(vote_counts.items(), key=lambda x: x[1])[0]
            predictions.append(winner)
        
        return np.array(predictions)
    
    def predict_proba(self, X):
        """Weighted probability averaging (soft voting)."""
        # Get probabilities from each model
        all_probs = []
        for model in self.models:
            probs = model.predict_proba(X)
            all_probs.append(probs)
        
        # Stack: (n_models, n_samples, n_classes)
        stacked = np.stack(all_probs, axis=0)
        
        # Weighted average
        weights = np.array(self.weights).reshape(-1, 1, 1)
        weighted_probs = np.sum(stacked * weights, axis=0)
        
        return weighted_probs
    
    def predict_log_proba(self, X):
        """Weighted log-probability averaging."""
        all_log_probs = []
        for model in self.models:
            log_probs = np.log(model.predict_proba(X) + 1e-10)
            all_log_probs.append(log_probs)
        
        stacked = np.stack(all_log_probs, axis=0)
        weights = np.array(self.weights).reshape(-1, 1, 1)
        weighted_log_probs = np.sum(stacked * weights, axis=0)
        
        return weighted_log_probs
 
 
def compute_optimal_weights(
    models: List,
    X_val,
    y_val,
    method: str = 'accuracy'
) -> List[float]:
    """
    Compute optimal model weights from validation data.
    
    Methods:
    --------
    'accuracy' : Weight proportional to accuracy
    'inverse_error' : Weight inversely proportional to error rate
    'log_loss' : Weight inversely proportional to log loss
    'optimize' : Directly optimize ensemble performance
    """
    n_models = len(models)
    
    if method == 'accuracy':
        accuracies = []
        for model in models:
            preds = model.predict(X_val)
            acc = (preds == y_val).mean()
            accuracies.append(acc)
        
        # Normalize
        total = sum(accuracies)
        return [a / total for a in accuracies]
    
    elif method == 'inverse_error':
        errors = []
        for model in models:
            preds = model.predict(X_val)
            err = (preds != y_val).mean() + 1e-6  # Avoid division by zero
            errors.append(1 / err)
        
        total = sum(errors)
        return [e / total for e in errors]
    
    elif method == 'log_loss':
        from sklearn.metrics import log_loss
        
        inv_losses = []
        for model in models:
            probs = model.predict_proba(X_val)
            loss = log_loss(y_val, probs) + 1e-6
            inv_losses.append(1 / loss)
        
        total = sum(inv_losses)
        return [l / total for l in inv_losses]
    
    elif method == 'optimize':
        from scipy.optimize import minimize
        
        # Collect all probabilities
        all_probs = []
        for model in models:
            probs = model.predict_proba(X_val)
            all_probs.append(probs)
        all_probs = np.stack(all_probs, axis=0)
        
        def ensemble_log_loss(weights):
            # Normalize weights
            w = np.clip(weights, 0.01, None)
            w = w / w.sum()
            
            # Compute ensemble probabilities
            weighted_probs = np.sum(
                all_probs * w.reshape(-1, 1, 1), axis=0
            )
            
            # Cross-entropy loss
            log_probs = np.log(weighted_probs + 1e-10)
            correct_log_probs = log_probs[range(len(y_val)), y_val]
            return -correct_log_probs.mean()
        
        # Optimize
        init_weights = np.ones(n_models) / n_models
        result = minimize(
            ensemble_log_loss,
            init_weights,
            method='L-BFGS-B',
            bounds=[(0.01, None)] * n_models
        )
        
        weights = result.x / result.x.sum()
        return weights.tolist()
    
    else:
        return [1.0 / n_models] * n_models

Practical Advice

For bagged ensembles where all models are trained the same way, equal weights work well—the diversity comes from bootstrap sampling, not model quality differences. Use weighted voting when combining models with different architectures or hyperparameters.

Regression: Aggregation Methods

Regression ensembles aggregate continuous predictions. The choice of aggregation affects both accuracy and robustness to outliers.

1. Arithmetic Mean

The most common approach: average all predictions.

$$\hat{y}{ens}(x) = \frac{1}{B}\sum{b=1}^{B} \hat{y}_b(x)$$

Properties:

Optimal under squared error loss
Sensitive to outlier predictions
Variance reduction: $\text{Var}(\bar{y}) = \frac{\sigma^2}{B}$ for independent predictions
Most common choice in practice

2. Median

Takes the middle value of sorted predictions.

$$\hat{y}_{ens}(x) = \text{median}{\hat{y}_1(x), ..., \hat{y}_B(x)}$$

Properties:

Robust to outlier models
Optimal under absolute error loss
Less efficient variance reduction than mean
Useful when some models may be poorly calibrated

3. Trimmed Mean

Discard extreme predictions before averaging.

$$\hat{y}{ens}(x) = \frac{1}{B-2k}\sum{b=k+1}^{B-k} \hat{y}_{(b)}(x)$$

where $\hat{y}_{(b)}$ are sorted predictions and $k$ predictions are trimmed from each end.

Properties:

Balances efficiency and robustness
Typical: trim 5-10% from each end
Reduces impact of outlier models
Preserves more variance reduction than median

4. Weighted Mean

Weight predictions by model quality.

$$\hat{y}{ens}(x) = \sum{b=1}^{B} w_b \cdot \hat{y}_b(x), \quad \sum_b w_b = 1$$

Weight Selection:

Inverse MSE: $w_b \propto 1/\text{MSE}_b$
Inverse variance: $w_b \propto 1/\text{Var}[\hat{y}_b]$
Learned: optimize on validation set

5. Geometric Mean

Multiplicative aggregation (for positive targets).

$$\hat{y}{ens}(x) = \left(\prod{b=1}^{B} \hat{y}_b(x)\right)^{1/B}$$

Properties:

Less sensitive to high outliers than arithmetic mean
Equivalent to exp(mean(log(predictions)))
Useful for multiplicative domains (e.g., growth rates, ratios)

Regression Aggregation Comparison
Method	Optimal Loss	Outlier Robustness	Variance Reduction	Best For
Arithmetic Mean	Squared error	Low	Highest	Standard regression
Median	Absolute error	High	Moderate	Contaminated predictions
Trimmed Mean	Huber-like	Moderate	High	Slight contamination
Weighted Mean	Squared error	Depends	Highest	Heterogeneous models
Geometric Mean	Log-squared	Moderate (high)	High	Multiplicative targets

regression_aggregation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
import numpy as np
from scipy import stats
from typing import List, Literal, Optional
 
class RegressionAggregator:
    """
    Comprehensive regression aggregation strategies.
    """
    
    def __init__(
        self,
        method: Literal['mean', 'median', 'trimmed_mean', 
                        'weighted', 'geometric'] = 'mean',
        weights: Optional[List[float]] = None,
        trim_fraction: float = 0.1
    ):
        """
        Parameters:
        -----------
        method : str
            Aggregation method
        weights : list or None
            Model weights for weighted aggregation
        trim_fraction : float
            Fraction to trim from each end (for trimmed_mean)
        """
        self.method = method
        self.weights = weights
        self.trim_fraction = trim_fraction
    
    def aggregate(self, predictions: np.ndarray) -> np.ndarray:
        """
        Aggregate predictions from multiple models.
        
        Parameters:
        -----------
        predictions : ndarray of shape (n_models, n_samples)
            Predictions from each model
            
        Returns:
        --------
        aggregated : ndarray of shape (n_samples,)
        """
        if self.method == 'mean':
            return self._arithmetic_mean(predictions)
        elif self.method == 'median':
            return self._median(predictions)
        elif self.method == 'trimmed_mean':
            return self._trimmed_mean(predictions)
        elif self.method == 'weighted':
            return self._weighted_mean(predictions)
        elif self.method == 'geometric':
            return self._geometric_mean(predictions)
        else:
            raise ValueError(f"Unknown method: {self.method}")
    
    def _arithmetic_mean(self, predictions):
        """Simple average."""
        return np.mean(predictions, axis=0)
    
    def _median(self, predictions):
        """Median of predictions."""
        return np.median(predictions, axis=0)
    
    def _trimmed_mean(self, predictions):
        """
        Trimmed mean: remove extreme values before averaging.
        """
        return stats.trim_mean(predictions, self.trim_fraction, axis=0)
    
    def _weighted_mean(self, predictions):
        """Weighted average."""
        if self.weights is None:
            return self._arithmetic_mean(predictions)
        
        weights = np.array(self.weights).reshape(-1, 1)
        weighted = predictions * weights
        return np.sum(weighted, axis=0)
    
    def _geometric_mean(self, predictions):
        """
        Geometric mean (for positive predictions).
        """
        # Handle potential zeros/negatives
        safe_preds = np.maximum(predictions, 1e-10)
        log_preds = np.log(safe_preds)
        mean_log = np.mean(log_preds, axis=0)
        return np.exp(mean_log)
    
    def aggregate_with_uncertainty(self, predictions: np.ndarray):
        """
        Return aggregated prediction with uncertainty estimate.
        
        Returns mean prediction and standard deviation across models.
        """
        mean_pred = self.aggregate(predictions)
        std_pred = np.std(predictions, axis=0)
        
        # Confidence intervals (assuming normal distribution)
        ci_lower = mean_pred - 1.96 * std_pred
        ci_upper = mean_pred + 1.96 * std_pred
        
        return {
            'prediction': mean_pred,
            'std': std_pred,
            'ci_95_lower': ci_lower,
            'ci_95_upper': ci_upper,
        }
 
 
def select_aggregation_method(predictions, y_val, verbose=True):
    """
    Evaluate different aggregation methods and recommend the best one.
    
    Parameters:
    -----------
    predictions : ndarray of shape (n_models, n_samples)
        Model predictions on validation set
    y_val : ndarray of shape (n_samples,)
        True values
    
    Returns:
    --------
    best_method : str
    results : dict with all method performances
    """
    aggregator = RegressionAggregator()
    results = {}
    
    for method in ['mean', 'median', 'trimmed_mean', 'geometric']:
        aggregator.method = method
        agg_pred = aggregator.aggregate(predictions)
        
        # Compute metrics
        mse = np.mean((y_val - agg_pred) ** 2)
        mae = np.mean(np.abs(y_val - agg_pred))
        
        results[method] = {'mse': mse, 'mae': mae, 'rmse': np.sqrt(mse)}
        
        if verbose:
            print(f"{method:15} - MSE: {mse:.4f}, MAE: {mae:.4f}")
    
    # Select best by MSE
    best_method = min(results.keys(), key=lambda m: results[m]['mse'])
    
    if verbose:
        print(f"\nRecommended method: {best_method}")
    
    return best_method, results

When Median Beats Mean

Consider using median when: (1) Some models may produce extreme predictions due to overfitting, (2) The target distribution has heavy tails, (3) You suspect some bootstrap samples are unrepresentative, (4) MAE is your primary metric rather than MSE.

Stacking: Learned Aggregation

Stacking (or stacked generalization) replaces fixed aggregation rules with a learned meta-model. Instead of averaging, we train a model to optimally combine base model predictions.

The Stacking Architecture:

Level 0 (Base Models): Train $B$ diverse models on the training data
Level 1 (Meta-Model): Train a model to map base predictions to final predictions

Critical Point: The meta-model must be trained on out-of-fold predictions from base models to avoid overfitting. If base models see the same data used to train the meta-model, they can simply memorize the training set.

Generating Out-of-Fold Predictions:

Split training data into K folds
For each fold k:
- Train each base model on K-1 folds
- Predict on the held-out fold k
Concatenate predictions to get level-1 training data
Train meta-model on these out-of-fold predictions
Retrain base models on full training data for final predictions

stacking.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
import numpy as np
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression, Ridge
from sklearn.base import clone
from typing import List, Any
 
class StackingEnsemble:
    """
    Stacking ensemble with proper out-of-fold prediction generation.
    
    Supports both classification and regression with various meta-learners.
    """
    
    def __init__(
        self,
        base_models: List[Any],
        meta_model: Any = None,
        n_folds: int = 5,
        use_probas: bool = True,
        passthrough: bool = False,
        task: str = 'classification'
    ):
        """
        Parameters:
        -----------
        base_models : list
            Base model instances (will be cloned)
        meta_model : estimator
            Meta-learner (default: LogisticRegression/Ridge)
        n_folds : int
            Number of folds for OOF prediction generation
        use_probas : bool
            Use probabilities instead of class predictions (classification)
        passthrough : bool
            Include original features in meta-model input
        task : str
            'classification' or 'regression'
        """
        self.base_models = base_models
        self.meta_model = meta_model
        self.n_folds = n_folds
        self.use_probas = use_probas
        self.passthrough = passthrough
        self.task = task
        
        # Set default meta-model
        if self.meta_model is None:
            if task == 'classification':
                self.meta_model = LogisticRegression(max_iter=1000)
            else:
                self.meta_model = Ridge()
        
        # These are populated during fit
        self.fitted_base_models_ = []
        self.fitted_meta_model_ = None
        self.n_classes_ = None
        
    def fit(self, X, y):
        """
        Fit the stacking ensemble.
        
        1. Generate out-of-fold predictions from base models
        2. Train meta-model on OOF predictions
        3. Retrain base models on full data
        """
        n_samples = X.shape[0]
        n_base = len(self.base_models)
        
        # Determine output dimensions
        if self.task == 'classification':
            self.n_classes_ = len(np.unique(y))
            if self.use_probas:
                n_features_per_model = self.n_classes_
            else:
                n_features_per_model = 1
        else:
            n_features_per_model = 1
        
        # Initialize OOF prediction matrix
        oof_predictions = np.zeros((n_samples, n_base * n_features_per_model))
        
        # Generate out-of-fold predictions
        kf = KFold(n_splits=self.n_folds, shuffle=True, random_state=42)
        
        for fold_idx, (train_idx, val_idx) in enumerate(kf.split(X)):
            X_train, X_val = X[train_idx], X[val_idx]
            y_train = y[train_idx]
            
            for model_idx, base_model in enumerate(self.base_models):
                # Clone and train on fold
                model = clone(base_model)
                model.fit(X_train, y_train)
                
                # Generate predictions for validation fold
                if self.task == 'classification' and self.use_probas:
                    preds = model.predict_proba(X_val)
                    start_col = model_idx * n_features_per_model
                    end_col = start_col + n_features_per_model
                    oof_predictions[val_idx, start_col:end_col] = preds
                else:
                    preds = model.predict(X_val)
                    oof_predictions[val_idx, model_idx] = preds
        
        # Prepare meta-features
        if self.passthrough:
            meta_X = np.hstack([X, oof_predictions])
        else:
            meta_X = oof_predictions
        
        # Train meta-model
        self.fitted_meta_model_ = clone(self.meta_model)
        self.fitted_meta_model_.fit(meta_X, y)
        
        # Retrain base models on full data
        self.fitted_base_models_ = []
        for base_model in self.base_models:
            model = clone(base_model)
            model.fit(X, y)
            self.fitted_base_models_.append(model)
        
        return self
    
    def _generate_meta_features(self, X):
        """Generate meta-features from base model predictions."""
        n_samples = X.shape[0]
        n_base = len(self.fitted_base_models_)
        
        if self.task == 'classification' and self.use_probas:
            n_features_per_model = self.n_classes_
        else:
            n_features_per_model = 1
        
        meta_features = np.zeros((n_samples, n_base * n_features_per_model))
        
        for model_idx, model in enumerate(self.fitted_base_models_):
            if self.task == 'classification' and self.use_probas:
                preds = model.predict_proba(X)
                start_col = model_idx * n_features_per_model
                end_col = start_col + n_features_per_model
                meta_features[:, start_col:end_col] = preds
            else:
                meta_features[:, model_idx] = model.predict(X)
        
        if self.passthrough:
            return np.hstack([X, meta_features])
        return meta_features
    
    def predict(self, X):
        """Generate predictions."""
        meta_X = self._generate_meta_features(X)
        return self.fitted_meta_model_.predict(meta_X)
    
    def predict_proba(self, X):
        """Generate probability predictions (classification)."""
        if self.task != 'classification':
            raise ValueError("predict_proba only for classification")
        meta_X = self._generate_meta_features(X)
        return self.fitted_meta_model_.predict_proba(meta_X)
 
 
def compare_stacking_to_averaging(base_models, X_train, y_train, X_test, y_test):
    """
    Compare stacking performance to simple averaging.
    """
    from sklearn.metrics import accuracy_score, mean_squared_error
    
    # Determine task from target
    is_classification = len(np.unique(y_train)) < 20
    
    # Train individual models
    fitted_models = []
    for model in base_models:
        m = clone(model)
        m.fit(X_train, y_train)
        fitted_models.append(m)
    
    # Simple averaging
    if is_classification:
        all_probs = np.stack([m.predict_proba(X_test) for m in fitted_models])
        avg_probs = all_probs.mean(axis=0)
        avg_preds = avg_probs.argmax(axis=1)
        avg_score = accuracy_score(y_test, avg_preds)
    else:
        all_preds = np.stack([m.predict(X_test) for m in fitted_models])
        avg_preds = all_preds.mean(axis=0)
        avg_score = -mean_squared_error(y_test, avg_preds)  # Negative for "higher is better"
    
    # Stacking
    stacker = StackingEnsemble(
        base_models=base_models,
        task='classification' if is_classification else 'regression'
    )
    stacker.fit(X_train, y_train)
    stack_preds = stacker.predict(X_test)
    
    if is_classification:
        stack_score = accuracy_score(y_test, stack_preds)
    else:
        stack_score = -mean_squared_error(y_test, stack_preds)
    
    print(f"Simple Averaging Score: {avg_score:.4f}")
    print(f"Stacking Score:         {stack_score:.4f}")
    print(f"Improvement:            {stack_score - avg_score:.4f}")
    
    return {
        'averaging_score': avg_score,
        'stacking_score': stack_score,
        'improvement': stack_score - avg_score
    }

The Overfitting Trap

Never train the meta-model on the same predictions used to evaluate base models. Always use out-of-fold predictions or a separate validation set. Failing to do so leads to severe overfitting—the meta-model learns to trust overfit base predictions.

Probability Calibration in Ensembles

When ensemble predictions are used for decision-making, probability calibration becomes critical. A well-calibrated model should have its predicted probabilities match empirical frequencies—when it says 70% confidence, it should be correct about 70% of the time.

Ensemble Calibration Properties:

Ensemble averaging often improves calibration — Individual overconfident or underconfident models tend to average toward better calibration.
But ensembles can still be miscalibrated — Especially if all base models share a systematic bias.
Post-hoc calibration may be needed — Apply calibration techniques to ensemble outputs.

Measuring Calibration:

Expected Calibration Error (ECE):

Partition predictions into bins by confidence, then measure the gap between confidence and accuracy:

$$\text{ECE} = \sum_{b=1}^{B} \frac{n_b}{N} |\text{acc}(b) - \text{conf}(b)|$$

where $\text{acc}(b)$ is the accuracy of samples in bin $b$ and $\text{conf}(b)$ is the average confidence.

Reliability Diagrams:

Visualize calibration by plotting accuracy vs. confidence per bin. A perfectly calibrated model produces a diagonal line.

calibration.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
import numpy as np
import matplotlib.pyplot as plt
from sklearn.calibration import calibration_curve
from sklearn.isotonic import IsotonicRegression
from sklearn.linear_model import LogisticRegression
 
def compute_ece(y_true, y_prob, n_bins=10):
    """
    Compute Expected Calibration Error.
    
    Parameters:
    -----------
    y_true : binary labels
    y_prob : predicted probabilities for positive class
    n_bins : number of calibration bins
    
    Returns:
    --------
    ece : Expected Calibration Error
    bin_data : dict with per-bin statistics
    """
    # Get confidence (max probability) and predictions
    confidences = np.maximum(y_prob, 1 - y_prob)
    predictions = (y_prob >= 0.5).astype(int)
    accuracies = (predictions == y_true).astype(float)
    
    # Bin boundaries
    bin_boundaries = np.linspace(0, 1, n_bins + 1)
    
    ece = 0.0
    bin_data = []
    
    for i in range(n_bins):
        lower, upper = bin_boundaries[i], bin_boundaries[i + 1]
        
        # Samples in this bin
        in_bin = (confidences > lower) & (confidences <= upper)
        n_in_bin = in_bin.sum()
        
        if n_in_bin > 0:
            bin_accuracy = accuracies[in_bin].mean()
            bin_confidence = confidences[in_bin].mean()
            
            ece += (n_in_bin / len(y_true)) * abs(bin_accuracy - bin_confidence)
            
            bin_data.append({
                'bin': i,
                'lower': lower,
                'upper': upper,
                'n_samples': n_in_bin,
                'accuracy': bin_accuracy,
                'confidence': bin_confidence,
                'gap': abs(bin_accuracy - bin_confidence)
            })
    
    return ece, bin_data
 
 
def plot_reliability_diagram(y_true, y_prob, n_bins=10, title='Reliability Diagram'):
    """
    Plot reliability diagram for calibration visualization.
    """
    fraction_of_positives, mean_predicted_value = calibration_curve(
        y_true, y_prob, n_bins=n_bins
    )
    
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
    
    # Reliability diagram
    ax1.plot([0, 1], [0, 1], 'k--', label='Perfect calibration')
    ax1.plot(mean_predicted_value, fraction_of_positives, 's-', label='Model')
    ax1.fill_between(mean_predicted_value, fraction_of_positives, 
                     mean_predicted_value, alpha=0.3, color='red')
    ax1.set_xlabel('Mean Predicted Probability')
    ax1.set_ylabel('Fraction of Positives')
    ax1.set_title(title)
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    
    # Confidence histogram
    ax2.hist(y_prob, bins=n_bins, range=(0, 1), edgecolor='black', alpha=0.7)
    ax2.set_xlabel('Predicted Probability')
    ax2.set_ylabel('Count')
    ax2.set_title('Prediction Distribution')
    ax2.grid(True, alpha=0.3)
    
    plt.tight_layout()
    return fig
 
 
class CalibratedEnsemble:
    """
    Ensemble with post-hoc calibration.
    
    Applies calibration to the ensemble's probability outputs.
    """
    
    def __init__(self, ensemble, method='isotonic'):
        """
        Parameters:
        -----------
        ensemble : fitted ensemble model with predict_proba
        method : 'isotonic', 'platt', or 'temperature'
        """
        self.ensemble = ensemble
        self.method = method
        self.calibrators_ = []  # One per class for multi-class
        self.temperature_ = 1.0
        
    def fit_calibration(self, X_calib, y_calib):
        """
        Fit calibration on held-out calibration set.
        
        NEVER use training data - must be separate from ensemble training.
        """
        # Get ensemble probabilities
        probs = self.ensemble.predict_proba(X_calib)
        n_classes = probs.shape[1]
        
        if self.method == 'temperature':
            # Temperature scaling: find T that minimizes NLL
            self._fit_temperature(probs, y_calib)
            
        elif self.method in ['isotonic', 'platt']:
            # Per-class calibration
            self.calibrators_ = []
            
            for k in range(n_classes):
                # Binary indicators for class k
                y_binary = (y_calib == k).astype(int)
                
                if self.method == 'isotonic':
                    calibrator = IsotonicRegression(
                        y_min=0, y_max=1, out_of_bounds='clip'
                    )
                else:  # platt
                    calibrator = LogisticRegression()
                
                calibrator.fit(probs[:, k].reshape(-1, 1), y_binary)
                self.calibrators_.append(calibrator)
        
        return self
    
    def _fit_temperature(self, probs, y_true):
        """Fit temperature scaling."""
        from scipy.optimize import minimize_scalar
        
        def nll_loss(T):
            # Apply temperature
            scaled_probs = self._temp_scale(probs, T)
            
            # Cross-entropy loss
            log_probs = np.log(scaled_probs + 1e-10)
            selected = log_probs[range(len(y_true)), y_true]
            return -selected.mean()
        
        result = minimize_scalar(nll_loss, bounds=(0.1, 10), method='bounded')
        self.temperature_ = result.x
    
    def _temp_scale(self, probs, T):
        """Apply temperature scaling to probabilities."""
        # Convert to logits, scale, convert back
        logits = np.log(probs + 1e-10)
        scaled_logits = logits / T
        exp_logits = np.exp(scaled_logits - scaled_logits.max(axis=1, keepdims=True))
        return exp_logits / exp_logits.sum(axis=1, keepdims=True)
    
    def predict_proba(self, X):
        """Return calibrated probabilities."""
        probs = self.ensemble.predict_proba(X)
        
        if self.method == 'temperature':
            return self._temp_scale(probs, self.temperature_)
        
        elif self.method in ['isotonic', 'platt']:
            calibrated = np.zeros_like(probs)
            for k, calibrator in enumerate(self.calibrators_):
                calibrated[:, k] = calibrator.predict(probs[:, k].reshape(-1, 1))
            
            # Normalize
            calibrated = calibrated / calibrated.sum(axis=1, keepdims=True)
            return calibrated
        
        return probs
    
    def predict(self, X):
        """Return class predictions."""
        return self.predict_proba(X).argmax(axis=1)

Temperature Scaling

Temperature scaling is the simplest and often most effective calibration method for ensembles. With a single parameter (temperature T), it softens overconfident predictions when T > 1 or sharpens underconfident ones when T < 1. It preserves ranking and requires minimal calibration data.

Theoretical Foundations of Aggregation

Understanding the theoretical basis for model aggregation helps explain when and why different strategies work.

The Wisdom of Crowds:

Condorcet's Jury Theorem (1785) states that for binary decisions:

If each voter independently has probability $p > 0.5$ of being correct
Then the probability that the majority is correct approaches 1 as the number of voters increases

Mathematically, for $B$ independent voters with accuracy $p$:

$$P(\text{majority correct}) = \sum_{k=\lceil B/2 \rceil}^{B} \binom{B}{k} p^k (1-p)^{B-k} \xrightarrow{B \to \infty} 1$$

Caveat: This requires $p > 0.5$. If voters are worse than random, the majority amplifies the error!

Bias-Variance Decomposition for Ensembles:

For regression, the expected squared error decomposes as:

$$\mathbb{E}[(y - \hat{f}_{ens}(x))^2] = \text{Bias}^2 + \text{Variance} + \sigma^2$$

For an ensemble of averaged predictions:

$$\text{Bias}[\bar{f}] = \text{Bias}[f_i]$$ (bias unchanged)

$$\text{Var}[\bar{f}] = \frac{1}{B}\text{Var}[f_i] + \frac{B-1}{B}\text{Cov}[f_i, f_j]$$

Key Insight: Aggregation reduces variance but only to the extent that models disagree. If $\text{Cov} = \text{Var}$, ensemble variance equals individual variance—no benefit.

Jensen's Inequality and Convex Losses:

For a convex loss function $L$ (e.g., squared error):

$$L\left(\frac{1}{B}\sum_b f_b(x)\right) \leq \frac{1}{B}\sum_b L(f_b(x))$$

Implication: Averaging predictions produces loss at most equal to the average of individual losses. Ensembles can only help (or stay neutral), never hurt, for convex losses.

However, for concave parts of non-convex losses (like 0-1 classification loss), this guarantee doesn't hold—ensembles can occasionally underperform individuals.

Optimal Aggregation Weights:

For weighted averaging with weights $w$, the optimal weights minimize:

$$\min_w \mathbb{E}\left[\left(y - \sum_b w_b f_b(x)\right)^2\right] \quad \text{s.t.} \sum_b w_b = 1$$

Solution: $w^* = \Sigma^{-1} \mathbf{1} / (\mathbf{1}^T \Sigma^{-1} \mathbf{1})$ where $\Sigma$ is the covariance matrix of model errors.

Theoretical Results Summary
Result	Requirement	Implication
Condorcet Theorem	Independent voters, p > 0.5	Majority voting improves with more models
Variance Reduction	Decorrelated predictions	Averaging reduces variance by up to 1/B
Jensen's Inequality	Convex loss function	Ensemble loss ≤ average individual loss
Optimal Weights	Known error covariance	Weight by inverse of error covariance
Diversity Bound	Diverse base learners	Ensemble error ≤ average error - diversity

The Diversity-Accuracy Trade-off

There's a tension: we want base models to be accurate (low individual error) AND diverse (low correlation). But increasing diversity often decreases individual accuracy (e.g., using weaker features). Optimal ensembles balance this trade-off—neither maximizing individual accuracy nor diversity alone.

Practical Aggregation Selection Guide

With many aggregation options available, how do you choose? Here's a practical decision framework:

Selection Criteria

•Data availability for tuning — Stacking needs substantial held-out data; simple averaging doesn't
•Model quality homogeneity — Equal-weight averaging works when models are similar quality
•Probability calibration requirements — Soft voting preserves calibration; hard voting doesn't
•Interpretability needs — Voting is more interpretable than learned aggregation
•Computational budget — Stacking adds training overhead; averaging is essentially free
•Risk of overfitting — Simpler aggregation is safer; stacking risks overfitting the meta-model

Decision Tree for Aggregation Selection:

Start
│
├── Are all models trained the same way (e.g., pure bagging)?
│   ├── Yes → Use simple averaging (equal weights)
│   └── No → Continue
│
├── Do models have significantly different validation performance?
│   ├── Yes → Use weighted averaging (inverse-error weights)
│   └── No → Use simple averaging
│
├── Do you have abundant held-out data (>10K samples)?
│   ├── Yes → Consider stacking
│   └── No → Stick with weighted averaging
│
├── Do you need calibrated probabilities?
│   ├── Yes → Use soft voting + post-hoc calibration
│   └── No → Hard or soft voting both acceptable
│
└── Is this a regression task with potential outlier predictions?
    ├── Yes → Consider median or trimmed mean
    └── No → Use arithmetic mean

Quick Reference: When to Use Each Method
Aggregation Method	Best For	Avoid When
Simple Average/Vote	Homogeneous bagging ensembles	Models have very different qualities
Weighted Average	Diverse model qualities	All models are trained identically
Stacking	Heterogeneous ensembles, lots of data	Limited data, overfitting risk high
Median (regression)	Outlier-prone predictions	All models well-calibrated
Soft Voting	Need probability estimates	Models don't output probabilities
Temperature Scaling	Overconfident ensembles	Already well-calibrated

Rule of Thumb

When in doubt, start simple. Equal-weight soft voting (or mean for regression) is a strong baseline that's hard to beat without substantial effort. Only move to more complex aggregation when you have clear evidence it helps on your specific problem.

Summary: Model Aggregation Strategies

The way we combine predictions is as important as the models that generate them. Let's consolidate the key insights:

Key Takeaways

•Soft voting generally outperforms hard voting — Using probabilities captures more information than discrete class labels.
•Weighted aggregation makes sense for heterogeneous ensembles — When models differ in quality, weighting by performance improves results.
•Stacking learns optimal combination — But requires proper out-of-fold training to avoid overfitting.
•Calibration matters for decision-making — Ensemble averaging helps, but post-hoc calibration may still be needed.
•Theory provides guarantees — Condorcet's theorem and Jensen's inequality explain when and why aggregation works.
•Simple methods are often sufficient — Start with equal-weight averaging; only escalate complexity with evidence.

What's Next:

With aggregation strategies mastered, we turn to comparing Bagging vs. Boosting—the two fundamental approaches to ensemble construction. Understanding when each paradigm excels is essential for selecting the right ensemble strategy for any problem.

Page Complete

You now have a comprehensive toolkit for combining model predictions. From simple voting to learned meta-models, you can select and implement the right aggregation strategy for any ensemble challenge.

3 / 5

Loading learning content...

Machine LearningBagging for Other Models

Bagging for Other Models

LevelIntermediate

Duration90 mins

TopicBagging for Other Models

3 / 5

Model Aggregation Strategies

The Art and Science of Combining Predictions

What You Will Learn

Classification: Voting Schemes

For classification tasks, the ensemble must convert multiple class predictions or probability distributions into a single decision. The choice of voting scheme affects both accuracy and calibration.

1. Hard Voting (Majority Vote)

Each model casts a vote for its predicted class. The ensemble predicts the class with the most votes.

$$\hat{y}_{ens} = \text{argmax}c \sum{b=1}^{B} \mathbb{1}[\hat{y}_b = c]$$

Properties:

Simple and interpretable
Treats all models equally
Ignores prediction confidence
Requires odd B to avoid ties (or tie-breaking rule)

2. Soft Voting (Probability Averaging)

Average the predicted probabilities across models, then select the class with highest average probability.

$$P_{ens}(y=c|x) = \frac{1}{B}\sum_{b=1}^{B} P_b(y=c|x)$$ $$\hat{y}_{ens} = \text{argmax}c , P{ens}(y=c|x)$$

Properties:

Incorporates model confidence
Generally outperforms hard voting
Provides calibrated probabilities (if models are calibrated)
Requires models to output probabilities

3. Weighted Voting

Assign weights $w_b$ to each model based on estimated quality.

$$P_{ens}(y=c|x) = \sum_{b=1}^{B} w_b \cdot P_b(y=c|x), \quad \sum_b w_b = 1$$

Weight Selection Strategies:

Uniform: $w_b = 1/B$ (reduces to simple averaging)
Performance-based: $w_b \propto \text{accuracy}_b$ or $w_b \propto 1/\text{error}_b$
Entropy-based: $w_b \propto \exp(-H_b)$ where $H_b$ is entropy of predictions
Learned: optimize weights on validation data

Voting Scheme Comparison
Scheme	Uses Confidence	Weights Models	Best When	Calibrated Output
Hard Voting	No	Equal	Models are equally good, no probabilities available	No (deterministic)
Soft Voting	Yes	Equal	Models output well-calibrated probabilities	Yes (if inputs calibrated)
Weighted Hard	No	By quality	Model quality varies significantly	No
Weighted Soft	Yes	By quality	Models vary in quality AND calibration matters	Depends on calibration

voting_schemes.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
import numpy as np
from collections import Counter
from typing import List, Optional, Literal
 
class VotingClassifier:
    """
    Comprehensive implementation of classification voting schemes.
    
    Supports hard voting, soft voting, and various weighting strategies.
    """
    
    def __init__(
        self,
        voting: Literal['hard', 'soft'] = 'soft',
        weights: Optional[List[float]] = None
    ):
        """
        Parameters:
        -----------
        voting : str
            'hard' for majority vote, 'soft' for probability averaging
        weights : list or None
            Model weights (length B). None = equal weights.
        """
        self.voting = voting
        self.weights = weights
        self.models = []
        
    def fit(self, models: List):
        """Store fitted models."""
        self.models = models
        
        if self.weights is None:
            self.weights = [1.0 / len(models)] * len(models)
        else:
            # Normalize weights
            total = sum(self.weights)
            self.weights = [w / total for w in self.weights]
            
        return self
    
    def predict(self, X):
        """Generate ensemble predictions."""
        if self.voting == 'hard':
            return self._hard_vote(X)
        else:
            probs = self.predict_proba(X)
            return np.argmax(probs, axis=1)
    
    def _hard_vote(self, X):
        """Weighted majority vote."""
        n_samples = X.shape[0]
        
        # Collect all predictions with weights
        weighted_votes = []
        for i, (model, weight) in enumerate(zip(self.models, self.weights)):
            preds = model.predict(X)
            weighted_votes.append((preds, weight))
        
        # Aggregate votes
        predictions = []
        for i in range(n_samples):
            vote_counts = {}
            for preds, weight in weighted_votes:
                vote = preds[i]
                vote_counts[vote] = vote_counts.get(vote, 0) + weight
            
            # Select class with highest weighted vote
            winner = max(vote_counts.items(), key=lambda x: x[1])[0]
            predictions.append(winner)
        
        return np.array(predictions)
    
    def predict_proba(self, X):
        """Weighted probability averaging (soft voting)."""
        # Get probabilities from each model
        all_probs = []
        for model in self.models:
            probs = model.predict_proba(X)
            all_probs.append(probs)
        
        # Stack: (n_models, n_samples, n_classes)
        stacked = np.stack(all_probs, axis=0)
        
        # Weighted average
        weights = np.array(self.weights).reshape(-1, 1, 1)
        weighted_probs = np.sum(stacked * weights, axis=0)
        
        return weighted_probs
    
    def predict_log_proba(self, X):
        """Weighted log-probability averaging."""
        all_log_probs = []
        for model in self.models:
            log_probs = np.log(model.predict_proba(X) + 1e-10)
            all_log_probs.append(log_probs)
        
        stacked = np.stack(all_log_probs, axis=0)
        weights = np.array(self.weights).reshape(-1, 1, 1)
        weighted_log_probs = np.sum(stacked * weights, axis=0)
        
        return weighted_log_probs
 
 
def compute_optimal_weights(
    models: List,
    X_val,
    y_val,
    method: str = 'accuracy'
) -> List[float]:
    """
    Compute optimal model weights from validation data.
    
    Methods:
    --------
    'accuracy' : Weight proportional to accuracy
    'inverse_error' : Weight inversely proportional to error rate
    'log_loss' : Weight inversely proportional to log loss
    'optimize' : Directly optimize ensemble performance
    """
    n_models = len(models)
    
    if method == 'accuracy':
        accuracies = []
        for model in models:
            preds = model.predict(X_val)
            acc = (preds == y_val).mean()
            accuracies.append(acc)
        
        # Normalize
        total = sum(accuracies)
        return [a / total for a in accuracies]
    
    elif method == 'inverse_error':
        errors = []
        for model in models:
            preds = model.predict(X_val)
            err = (preds != y_val).mean() + 1e-6  # Avoid division by zero
            errors.append(1 / err)
        
        total = sum(errors)
        return [e / total for e in errors]
    
    elif method == 'log_loss':
        from sklearn.metrics import log_loss
        
        inv_losses = []
        for model in models:
            probs = model.predict_proba(X_val)
            loss = log_loss(y_val, probs) + 1e-6
            inv_losses.append(1 / loss)
        
        total = sum(inv_losses)
        return [l / total for l in inv_losses]
    
    elif method == 'optimize':
        from scipy.optimize import minimize
        
        # Collect all probabilities
        all_probs = []
        for model in models:
            probs = model.predict_proba(X_val)
            all_probs.append(probs)
        all_probs = np.stack(all_probs, axis=0)
        
        def ensemble_log_loss(weights):
            # Normalize weights
            w = np.clip(weights, 0.01, None)
            w = w / w.sum()
            
            # Compute ensemble probabilities
            weighted_probs = np.sum(
                all_probs * w.reshape(-1, 1, 1), axis=0
            )
            
            # Cross-entropy loss
            log_probs = np.log(weighted_probs + 1e-10)
            correct_log_probs = log_probs[range(len(y_val)), y_val]
            return -correct_log_probs.mean()
        
        # Optimize
        init_weights = np.ones(n_models) / n_models
        result = minimize(
            ensemble_log_loss,
            init_weights,
            method='L-BFGS-B',
            bounds=[(0.01, None)] * n_models
        )
        
        weights = result.x / result.x.sum()
        return weights.tolist()
    
    else:
        return [1.0 / n_models] * n_models

Practical Advice

Regression: Aggregation Methods

Regression ensembles aggregate continuous predictions. The choice of aggregation affects both accuracy and robustness to outliers.

1. Arithmetic Mean

The most common approach: average all predictions.

$$\hat{y}{ens}(x) = \frac{1}{B}\sum{b=1}^{B} \hat{y}_b(x)$$

Properties:

Optimal under squared error loss
Sensitive to outlier predictions
Variance reduction: $\text{Var}(\bar{y}) = \frac{\sigma^2}{B}$ for independent predictions
Most common choice in practice

2. Median

Takes the middle value of sorted predictions.

$$\hat{y}_{ens}(x) = \text{median}{\hat{y}_1(x), ..., \hat{y}_B(x)}$$

Properties:

Robust to outlier models
Optimal under absolute error loss
Less efficient variance reduction than mean
Useful when some models may be poorly calibrated

3. Trimmed Mean

Discard extreme predictions before averaging.

$$\hat{y}{ens}(x) = \frac{1}{B-2k}\sum{b=k+1}^{B-k} \hat{y}_{(b)}(x)$$

where $\hat{y}_{(b)}$ are sorted predictions and $k$ predictions are trimmed from each end.

Properties:

Balances efficiency and robustness
Typical: trim 5-10% from each end
Reduces impact of outlier models
Preserves more variance reduction than median

4. Weighted Mean

Weight predictions by model quality.

$$\hat{y}{ens}(x) = \sum{b=1}^{B} w_b \cdot \hat{y}_b(x), \quad \sum_b w_b = 1$$

Weight Selection:

Inverse MSE: $w_b \propto 1/\text{MSE}_b$
Inverse variance: $w_b \propto 1/\text{Var}[\hat{y}_b]$
Learned: optimize on validation set

5. Geometric Mean

Multiplicative aggregation (for positive targets).

$$\hat{y}{ens}(x) = \left(\prod{b=1}^{B} \hat{y}_b(x)\right)^{1/B}$$

Properties:

Less sensitive to high outliers than arithmetic mean
Equivalent to exp(mean(log(predictions)))
Useful for multiplicative domains (e.g., growth rates, ratios)

Regression Aggregation Comparison
Method	Optimal Loss	Outlier Robustness	Variance Reduction	Best For
Arithmetic Mean	Squared error	Low	Highest	Standard regression
Median	Absolute error	High	Moderate	Contaminated predictions
Trimmed Mean	Huber-like	Moderate	High	Slight contamination
Weighted Mean	Squared error	Depends	Highest	Heterogeneous models
Geometric Mean	Log-squared	Moderate (high)	High	Multiplicative targets

regression_aggregation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
import numpy as np
from scipy import stats
from typing import List, Literal, Optional
 
class RegressionAggregator:
    """
    Comprehensive regression aggregation strategies.
    """
    
    def __init__(
        self,
        method: Literal['mean', 'median', 'trimmed_mean', 
                        'weighted', 'geometric'] = 'mean',
        weights: Optional[List[float]] = None,
        trim_fraction: float = 0.1
    ):
        """
        Parameters:
        -----------
        method : str
            Aggregation method
        weights : list or None
            Model weights for weighted aggregation
        trim_fraction : float
            Fraction to trim from each end (for trimmed_mean)
        """
        self.method = method
        self.weights = weights
        self.trim_fraction = trim_fraction
    
    def aggregate(self, predictions: np.ndarray) -> np.ndarray:
        """
        Aggregate predictions from multiple models.
        
        Parameters:
        -----------
        predictions : ndarray of shape (n_models, n_samples)
            Predictions from each model
            
        Returns:
        --------
        aggregated : ndarray of shape (n_samples,)
        """
        if self.method == 'mean':
            return self._arithmetic_mean(predictions)
        elif self.method == 'median':
            return self._median(predictions)
        elif self.method == 'trimmed_mean':
            return self._trimmed_mean(predictions)
        elif self.method == 'weighted':
            return self._weighted_mean(predictions)
        elif self.method == 'geometric':
            return self._geometric_mean(predictions)
        else:
            raise ValueError(f"Unknown method: {self.method}")
    
    def _arithmetic_mean(self, predictions):
        """Simple average."""
        return np.mean(predictions, axis=0)
    
    def _median(self, predictions):
        """Median of predictions."""
        return np.median(predictions, axis=0)
    
    def _trimmed_mean(self, predictions):
        """
        Trimmed mean: remove extreme values before averaging.
        """
        return stats.trim_mean(predictions, self.trim_fraction, axis=0)
    
    def _weighted_mean(self, predictions):
        """Weighted average."""
        if self.weights is None:
            return self._arithmetic_mean(predictions)
        
        weights = np.array(self.weights).reshape(-1, 1)
        weighted = predictions * weights
        return np.sum(weighted, axis=0)
    
    def _geometric_mean(self, predictions):
        """
        Geometric mean (for positive predictions).
        """
        # Handle potential zeros/negatives
        safe_preds = np.maximum(predictions, 1e-10)
        log_preds = np.log(safe_preds)
        mean_log = np.mean(log_preds, axis=0)
        return np.exp(mean_log)
    
    def aggregate_with_uncertainty(self, predictions: np.ndarray):
        """
        Return aggregated prediction with uncertainty estimate.
        
        Returns mean prediction and standard deviation across models.
        """
        mean_pred = self.aggregate(predictions)
        std_pred = np.std(predictions, axis=0)
        
        # Confidence intervals (assuming normal distribution)
        ci_lower = mean_pred - 1.96 * std_pred
        ci_upper = mean_pred + 1.96 * std_pred
        
        return {
            'prediction': mean_pred,
            'std': std_pred,
            'ci_95_lower': ci_lower,
            'ci_95_upper': ci_upper,
        }
 
 
def select_aggregation_method(predictions, y_val, verbose=True):
    """
    Evaluate different aggregation methods and recommend the best one.
    
    Parameters:
    -----------
    predictions : ndarray of shape (n_models, n_samples)
        Model predictions on validation set
    y_val : ndarray of shape (n_samples,)
        True values
    
    Returns:
    --------
    best_method : str
    results : dict with all method performances
    """
    aggregator = RegressionAggregator()
    results = {}
    
    for method in ['mean', 'median', 'trimmed_mean', 'geometric']:
        aggregator.method = method
        agg_pred = aggregator.aggregate(predictions)
        
        # Compute metrics
        mse = np.mean((y_val - agg_pred) ** 2)
        mae = np.mean(np.abs(y_val - agg_pred))
        
        results[method] = {'mse': mse, 'mae': mae, 'rmse': np.sqrt(mse)}
        
        if verbose:
            print(f"{method:15} - MSE: {mse:.4f}, MAE: {mae:.4f}")
    
    # Select best by MSE
    best_method = min(results.keys(), key=lambda m: results[m]['mse'])
    
    if verbose:
        print(f"\nRecommended method: {best_method}")
    
    return best_method, results

When Median Beats Mean

Stacking: Learned Aggregation

Stacking (or stacked generalization) replaces fixed aggregation rules with a learned meta-model. Instead of averaging, we train a model to optimally combine base model predictions.

The Stacking Architecture:

Level 0 (Base Models): Train $B$ diverse models on the training data
Level 1 (Meta-Model): Train a model to map base predictions to final predictions

Generating Out-of-Fold Predictions:

Split training data into K folds
For each fold k:
- Train each base model on K-1 folds
- Predict on the held-out fold k
Concatenate predictions to get level-1 training data
Train meta-model on these out-of-fold predictions
Retrain base models on full training data for final predictions

stacking.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
import numpy as np
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression, Ridge
from sklearn.base import clone
from typing import List, Any
 
class StackingEnsemble:
    """
    Stacking ensemble with proper out-of-fold prediction generation.
    
    Supports both classification and regression with various meta-learners.
    """
    
    def __init__(
        self,
        base_models: List[Any],
        meta_model: Any = None,
        n_folds: int = 5,
        use_probas: bool = True,
        passthrough: bool = False,
        task: str = 'classification'
    ):
        """
        Parameters:
        -----------
        base_models : list
            Base model instances (will be cloned)
        meta_model : estimator
            Meta-learner (default: LogisticRegression/Ridge)
        n_folds : int
            Number of folds for OOF prediction generation
        use_probas : bool
            Use probabilities instead of class predictions (classification)
        passthrough : bool
            Include original features in meta-model input
        task : str
            'classification' or 'regression'
        """
        self.base_models = base_models
        self.meta_model = meta_model
        self.n_folds = n_folds
        self.use_probas = use_probas
        self.passthrough = passthrough
        self.task = task
        
        # Set default meta-model
        if self.meta_model is None:
            if task == 'classification':
                self.meta_model = LogisticRegression(max_iter=1000)
            else:
                self.meta_model = Ridge()
        
        # These are populated during fit
        self.fitted_base_models_ = []
        self.fitted_meta_model_ = None
        self.n_classes_ = None
        
    def fit(self, X, y):
        """
        Fit the stacking ensemble.
        
        1. Generate out-of-fold predictions from base models
        2. Train meta-model on OOF predictions
        3. Retrain base models on full data
        """
        n_samples = X.shape[0]
        n_base = len(self.base_models)
        
        # Determine output dimensions
        if self.task == 'classification':
            self.n_classes_ = len(np.unique(y))
            if self.use_probas:
                n_features_per_model = self.n_classes_
            else:
                n_features_per_model = 1
        else:
            n_features_per_model = 1
        
        # Initialize OOF prediction matrix
        oof_predictions = np.zeros((n_samples, n_base * n_features_per_model))
        
        # Generate out-of-fold predictions
        kf = KFold(n_splits=self.n_folds, shuffle=True, random_state=42)
        
        for fold_idx, (train_idx, val_idx) in enumerate(kf.split(X)):
            X_train, X_val = X[train_idx], X[val_idx]
            y_train = y[train_idx]
            
            for model_idx, base_model in enumerate(self.base_models):
                # Clone and train on fold
                model = clone(base_model)
                model.fit(X_train, y_train)
                
                # Generate predictions for validation fold
                if self.task == 'classification' and self.use_probas:
                    preds = model.predict_proba(X_val)
                    start_col = model_idx * n_features_per_model
                    end_col = start_col + n_features_per_model
                    oof_predictions[val_idx, start_col:end_col] = preds
                else:
                    preds = model.predict(X_val)
                    oof_predictions[val_idx, model_idx] = preds
        
        # Prepare meta-features
        if self.passthrough:
            meta_X = np.hstack([X, oof_predictions])
        else:
            meta_X = oof_predictions
        
        # Train meta-model
        self.fitted_meta_model_ = clone(self.meta_model)
        self.fitted_meta_model_.fit(meta_X, y)
        
        # Retrain base models on full data
        self.fitted_base_models_ = []
        for base_model in self.base_models:
            model = clone(base_model)
            model.fit(X, y)
            self.fitted_base_models_.append(model)
        
        return self
    
    def _generate_meta_features(self, X):
        """Generate meta-features from base model predictions."""
        n_samples = X.shape[0]
        n_base = len(self.fitted_base_models_)
        
        if self.task == 'classification' and self.use_probas:
            n_features_per_model = self.n_classes_
        else:
            n_features_per_model = 1
        
        meta_features = np.zeros((n_samples, n_base * n_features_per_model))
        
        for model_idx, model in enumerate(self.fitted_base_models_):
            if self.task == 'classification' and self.use_probas:
                preds = model.predict_proba(X)
                start_col = model_idx * n_features_per_model
                end_col = start_col + n_features_per_model
                meta_features[:, start_col:end_col] = preds
            else:
                meta_features[:, model_idx] = model.predict(X)
        
        if self.passthrough:
            return np.hstack([X, meta_features])
        return meta_features
    
    def predict(self, X):
        """Generate predictions."""
        meta_X = self._generate_meta_features(X)
        return self.fitted_meta_model_.predict(meta_X)
    
    def predict_proba(self, X):
        """Generate probability predictions (classification)."""
        if self.task != 'classification':
            raise ValueError("predict_proba only for classification")
        meta_X = self._generate_meta_features(X)
        return self.fitted_meta_model_.predict_proba(meta_X)
 
 
def compare_stacking_to_averaging(base_models, X_train, y_train, X_test, y_test):
    """
    Compare stacking performance to simple averaging.
    """
    from sklearn.metrics import accuracy_score, mean_squared_error
    
    # Determine task from target
    is_classification = len(np.unique(y_train)) < 20
    
    # Train individual models
    fitted_models = []
    for model in base_models:
        m = clone(model)
        m.fit(X_train, y_train)
        fitted_models.append(m)
    
    # Simple averaging
    if is_classification:
        all_probs = np.stack([m.predict_proba(X_test) for m in fitted_models])
        avg_probs = all_probs.mean(axis=0)
        avg_preds = avg_probs.argmax(axis=1)
        avg_score = accuracy_score(y_test, avg_preds)
    else:
        all_preds = np.stack([m.predict(X_test) for m in fitted_models])
        avg_preds = all_preds.mean(axis=0)
        avg_score = -mean_squared_error(y_test, avg_preds)  # Negative for "higher is better"
    
    # Stacking
    stacker = StackingEnsemble(
        base_models=base_models,
        task='classification' if is_classification else 'regression'
    )
    stacker.fit(X_train, y_train)
    stack_preds = stacker.predict(X_test)
    
    if is_classification:
        stack_score = accuracy_score(y_test, stack_preds)
    else:
        stack_score = -mean_squared_error(y_test, stack_preds)
    
    print(f"Simple Averaging Score: {avg_score:.4f}")
    print(f"Stacking Score:         {stack_score:.4f}")
    print(f"Improvement:            {stack_score - avg_score:.4f}")
    
    return {
        'averaging_score': avg_score,
        'stacking_score': stack_score,
        'improvement': stack_score - avg_score
    }

The Overfitting Trap

Probability Calibration in Ensembles

Ensemble Calibration Properties:

Ensemble averaging often improves calibration — Individual overconfident or underconfident models tend to average toward better calibration.
But ensembles can still be miscalibrated — Especially if all base models share a systematic bias.
Post-hoc calibration may be needed — Apply calibration techniques to ensemble outputs.

Measuring Calibration:

Expected Calibration Error (ECE):

Partition predictions into bins by confidence, then measure the gap between confidence and accuracy:

$$\text{ECE} = \sum_{b=1}^{B} \frac{n_b}{N} |\text{acc}(b) - \text{conf}(b)|$$

where $\text{acc}(b)$ is the accuracy of samples in bin $b$ and $\text{conf}(b)$ is the average confidence.

Reliability Diagrams:

Visualize calibration by plotting accuracy vs. confidence per bin. A perfectly calibrated model produces a diagonal line.

calibration.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
import numpy as np
import matplotlib.pyplot as plt
from sklearn.calibration import calibration_curve
from sklearn.isotonic import IsotonicRegression
from sklearn.linear_model import LogisticRegression
 
def compute_ece(y_true, y_prob, n_bins=10):
    """
    Compute Expected Calibration Error.
    
    Parameters:
    -----------
    y_true : binary labels
    y_prob : predicted probabilities for positive class
    n_bins : number of calibration bins
    
    Returns:
    --------
    ece : Expected Calibration Error
    bin_data : dict with per-bin statistics
    """
    # Get confidence (max probability) and predictions
    confidences = np.maximum(y_prob, 1 - y_prob)
    predictions = (y_prob >= 0.5).astype(int)
    accuracies = (predictions == y_true).astype(float)
    
    # Bin boundaries
    bin_boundaries = np.linspace(0, 1, n_bins + 1)
    
    ece = 0.0
    bin_data = []
    
    for i in range(n_bins):
        lower, upper = bin_boundaries[i], bin_boundaries[i + 1]
        
        # Samples in this bin
        in_bin = (confidences > lower) & (confidences <= upper)
        n_in_bin = in_bin.sum()
        
        if n_in_bin > 0:
            bin_accuracy = accuracies[in_bin].mean()
            bin_confidence = confidences[in_bin].mean()
            
            ece += (n_in_bin / len(y_true)) * abs(bin_accuracy - bin_confidence)
            
            bin_data.append({
                'bin': i,
                'lower': lower,
                'upper': upper,
                'n_samples': n_in_bin,
                'accuracy': bin_accuracy,
                'confidence': bin_confidence,
                'gap': abs(bin_accuracy - bin_confidence)
            })
    
    return ece, bin_data
 
 
def plot_reliability_diagram(y_true, y_prob, n_bins=10, title='Reliability Diagram'):
    """
    Plot reliability diagram for calibration visualization.
    """
    fraction_of_positives, mean_predicted_value = calibration_curve(
        y_true, y_prob, n_bins=n_bins
    )
    
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
    
    # Reliability diagram
    ax1.plot([0, 1], [0, 1], 'k--', label='Perfect calibration')
    ax1.plot(mean_predicted_value, fraction_of_positives, 's-', label='Model')
    ax1.fill_between(mean_predicted_value, fraction_of_positives, 
                     mean_predicted_value, alpha=0.3, color='red')
    ax1.set_xlabel('Mean Predicted Probability')
    ax1.set_ylabel('Fraction of Positives')
    ax1.set_title(title)
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    
    # Confidence histogram
    ax2.hist(y_prob, bins=n_bins, range=(0, 1), edgecolor='black', alpha=0.7)
    ax2.set_xlabel('Predicted Probability')
    ax2.set_ylabel('Count')
    ax2.set_title('Prediction Distribution')
    ax2.grid(True, alpha=0.3)
    
    plt.tight_layout()
    return fig
 
 
class CalibratedEnsemble:
    """
    Ensemble with post-hoc calibration.
    
    Applies calibration to the ensemble's probability outputs.
    """
    
    def __init__(self, ensemble, method='isotonic'):
        """
        Parameters:
        -----------
        ensemble : fitted ensemble model with predict_proba
        method : 'isotonic', 'platt', or 'temperature'
        """
        self.ensemble = ensemble
        self.method = method
        self.calibrators_ = []  # One per class for multi-class
        self.temperature_ = 1.0
        
    def fit_calibration(self, X_calib, y_calib):
        """
        Fit calibration on held-out calibration set.
        
        NEVER use training data - must be separate from ensemble training.
        """
        # Get ensemble probabilities
        probs = self.ensemble.predict_proba(X_calib)
        n_classes = probs.shape[1]
        
        if self.method == 'temperature':
            # Temperature scaling: find T that minimizes NLL
            self._fit_temperature(probs, y_calib)
            
        elif self.method in ['isotonic', 'platt']:
            # Per-class calibration
            self.calibrators_ = []
            
            for k in range(n_classes):
                # Binary indicators for class k
                y_binary = (y_calib == k).astype(int)
                
                if self.method == 'isotonic':
                    calibrator = IsotonicRegression(
                        y_min=0, y_max=1, out_of_bounds='clip'
                    )
                else:  # platt
                    calibrator = LogisticRegression()
                
                calibrator.fit(probs[:, k].reshape(-1, 1), y_binary)
                self.calibrators_.append(calibrator)
        
        return self
    
    def _fit_temperature(self, probs, y_true):
        """Fit temperature scaling."""
        from scipy.optimize import minimize_scalar
        
        def nll_loss(T):
            # Apply temperature
            scaled_probs = self._temp_scale(probs, T)
            
            # Cross-entropy loss
            log_probs = np.log(scaled_probs + 1e-10)
            selected = log_probs[range(len(y_true)), y_true]
            return -selected.mean()
        
        result = minimize_scalar(nll_loss, bounds=(0.1, 10), method='bounded')
        self.temperature_ = result.x
    
    def _temp_scale(self, probs, T):
        """Apply temperature scaling to probabilities."""
        # Convert to logits, scale, convert back
        logits = np.log(probs + 1e-10)
        scaled_logits = logits / T
        exp_logits = np.exp(scaled_logits - scaled_logits.max(axis=1, keepdims=True))
        return exp_logits / exp_logits.sum(axis=1, keepdims=True)
    
    def predict_proba(self, X):
        """Return calibrated probabilities."""
        probs = self.ensemble.predict_proba(X)
        
        if self.method == 'temperature':
            return self._temp_scale(probs, self.temperature_)
        
        elif self.method in ['isotonic', 'platt']:
            calibrated = np.zeros_like(probs)
            for k, calibrator in enumerate(self.calibrators_):
                calibrated[:, k] = calibrator.predict(probs[:, k].reshape(-1, 1))
            
            # Normalize
            calibrated = calibrated / calibrated.sum(axis=1, keepdims=True)
            return calibrated
        
        return probs
    
    def predict(self, X):
        """Return class predictions."""
        return self.predict_proba(X).argmax(axis=1)

Temperature Scaling

Theoretical Foundations of Aggregation

Understanding the theoretical basis for model aggregation helps explain when and why different strategies work.

The Wisdom of Crowds:

Condorcet's Jury Theorem (1785) states that for binary decisions:

If each voter independently has probability $p > 0.5$ of being correct
Then the probability that the majority is correct approaches 1 as the number of voters increases

Mathematically, for $B$ independent voters with accuracy $p$:

$$P(\text{majority correct}) = \sum_{k=\lceil B/2 \rceil}^{B} \binom{B}{k} p^k (1-p)^{B-k} \xrightarrow{B \to \infty} 1$$

Caveat: This requires $p > 0.5$. If voters are worse than random, the majority amplifies the error!

Bias-Variance Decomposition for Ensembles:

For regression, the expected squared error decomposes as:

$$\mathbb{E}[(y - \hat{f}_{ens}(x))^2] = \text{Bias}^2 + \text{Variance} + \sigma^2$$

For an ensemble of averaged predictions:

$$\text{Bias}[\bar{f}] = \text{Bias}[f_i]$$ (bias unchanged)

$$\text{Var}[\bar{f}] = \frac{1}{B}\text{Var}[f_i] + \frac{B-1}{B}\text{Cov}[f_i, f_j]$$

Key Insight: Aggregation reduces variance but only to the extent that models disagree. If $\text{Cov} = \text{Var}$, ensemble variance equals individual variance—no benefit.

Jensen's Inequality and Convex Losses:

For a convex loss function $L$ (e.g., squared error):

$$L\left(\frac{1}{B}\sum_b f_b(x)\right) \leq \frac{1}{B}\sum_b L(f_b(x))$$

Implication: Averaging predictions produces loss at most equal to the average of individual losses. Ensembles can only help (or stay neutral), never hurt, for convex losses.

However, for concave parts of non-convex losses (like 0-1 classification loss), this guarantee doesn't hold—ensembles can occasionally underperform individuals.

Optimal Aggregation Weights:

For weighted averaging with weights $w$, the optimal weights minimize:

$$\min_w \mathbb{E}\left[\left(y - \sum_b w_b f_b(x)\right)^2\right] \quad \text{s.t.} \sum_b w_b = 1$$

Solution: $w^* = \Sigma^{-1} \mathbf{1} / (\mathbf{1}^T \Sigma^{-1} \mathbf{1})$ where $\Sigma$ is the covariance matrix of model errors.

Theoretical Results Summary
Result	Requirement	Implication
Condorcet Theorem	Independent voters, p > 0.5	Majority voting improves with more models
Variance Reduction	Decorrelated predictions	Averaging reduces variance by up to 1/B
Jensen's Inequality	Convex loss function	Ensemble loss ≤ average individual loss
Optimal Weights	Known error covariance	Weight by inverse of error covariance
Diversity Bound	Diverse base learners	Ensemble error ≤ average error - diversity

The Diversity-Accuracy Trade-off

Practical Aggregation Selection Guide

With many aggregation options available, how do you choose? Here's a practical decision framework:

Selection Criteria

•Data availability for tuning — Stacking needs substantial held-out data; simple averaging doesn't
•Model quality homogeneity — Equal-weight averaging works when models are similar quality
•Probability calibration requirements — Soft voting preserves calibration; hard voting doesn't
•Interpretability needs — Voting is more interpretable than learned aggregation
•Computational budget — Stacking adds training overhead; averaging is essentially free
•Risk of overfitting — Simpler aggregation is safer; stacking risks overfitting the meta-model

Decision Tree for Aggregation Selection:

Start
│
├── Are all models trained the same way (e.g., pure bagging)?
│   ├── Yes → Use simple averaging (equal weights)
│   └── No → Continue
│
├── Do models have significantly different validation performance?
│   ├── Yes → Use weighted averaging (inverse-error weights)
│   └── No → Use simple averaging
│
├── Do you have abundant held-out data (>10K samples)?
│   ├── Yes → Consider stacking
│   └── No → Stick with weighted averaging
│
├── Do you need calibrated probabilities?
│   ├── Yes → Use soft voting + post-hoc calibration
│   └── No → Hard or soft voting both acceptable
│
└── Is this a regression task with potential outlier predictions?
    ├── Yes → Consider median or trimmed mean
    └── No → Use arithmetic mean

Quick Reference: When to Use Each Method
Aggregation Method	Best For	Avoid When
Simple Average/Vote	Homogeneous bagging ensembles	Models have very different qualities
Weighted Average	Diverse model qualities	All models are trained identically
Stacking	Heterogeneous ensembles, lots of data	Limited data, overfitting risk high
Median (regression)	Outlier-prone predictions	All models well-calibrated
Soft Voting	Need probability estimates	Models don't output probabilities
Temperature Scaling	Overconfident ensembles	Already well-calibrated

Rule of Thumb

Summary: Model Aggregation Strategies

The way we combine predictions is as important as the models that generate them. Let's consolidate the key insights:

Key Takeaways

•Soft voting generally outperforms hard voting — Using probabilities captures more information than discrete class labels.
•Weighted aggregation makes sense for heterogeneous ensembles — When models differ in quality, weighting by performance improves results.
•Stacking learns optimal combination — But requires proper out-of-fold training to avoid overfitting.
•Calibration matters for decision-making — Ensemble averaging helps, but post-hoc calibration may still be needed.
•Theory provides guarantees — Condorcet's theorem and Jensen's inequality explain when and why aggregation works.
•Simple methods are often sufficient — Start with equal-weight averaging; only escalate complexity with evidence.

What's Next:

Page Complete

3 / 5