Loading learning content...
An ensemble is only as good as its aggregation strategy. Even a collection of excellent models can underperform if their predictions are combined naively. Conversely, sophisticated aggregation can extract surprising performance from mediocre individual models.
Model aggregation spans a rich spectrum of techniques—from simple voting to learned meta-models. Understanding this spectrum is essential for building ensembles that realize their full potential. This page provides a comprehensive treatment of aggregation strategies, their mathematical foundations, and practical guidance for selection.
By the end of this page, you will understand: (1) Voting schemes for classification (hard, soft, weighted), (2) Aggregation for regression (mean, median, weighted), (3) Learned aggregation through stacking, (4) Probability calibration in ensembles, (5) Theoretical guarantees for different aggregation methods, and (6) Practical selection criteria.
For classification tasks, the ensemble must convert multiple class predictions or probability distributions into a single decision. The choice of voting scheme affects both accuracy and calibration.
1. Hard Voting (Majority Vote)
Each model casts a vote for its predicted class. The ensemble predicts the class with the most votes.
$$\hat{y}_{ens} = \text{argmax}c \sum{b=1}^{B} \mathbb{1}[\hat{y}_b = c]$$
Properties:
2. Soft Voting (Probability Averaging)
Average the predicted probabilities across models, then select the class with highest average probability.
$$P_{ens}(y=c|x) = \frac{1}{B}\sum_{b=1}^{B} P_b(y=c|x)$$ $$\hat{y}_{ens} = \text{argmax}c , P{ens}(y=c|x)$$
Properties:
3. Weighted Voting
Assign weights $w_b$ to each model based on estimated quality.
$$P_{ens}(y=c|x) = \sum_{b=1}^{B} w_b \cdot P_b(y=c|x), \quad \sum_b w_b = 1$$
Weight Selection Strategies:
| Scheme | Uses Confidence | Weights Models | Best When | Calibrated Output |
|---|---|---|---|---|
| Hard Voting | No | Equal | Models are equally good, no probabilities available | No (deterministic) |
| Soft Voting | Yes | Equal | Models output well-calibrated probabilities | Yes (if inputs calibrated) |
| Weighted Hard | No | By quality | Model quality varies significantly | No |
| Weighted Soft | Yes | By quality | Models vary in quality AND calibration matters | Depends on calibration |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194
import numpy as npfrom collections import Counterfrom typing import List, Optional, Literal class VotingClassifier: """ Comprehensive implementation of classification voting schemes. Supports hard voting, soft voting, and various weighting strategies. """ def __init__( self, voting: Literal['hard', 'soft'] = 'soft', weights: Optional[List[float]] = None ): """ Parameters: ----------- voting : str 'hard' for majority vote, 'soft' for probability averaging weights : list or None Model weights (length B). None = equal weights. """ self.voting = voting self.weights = weights self.models = [] def fit(self, models: List): """Store fitted models.""" self.models = models if self.weights is None: self.weights = [1.0 / len(models)] * len(models) else: # Normalize weights total = sum(self.weights) self.weights = [w / total for w in self.weights] return self def predict(self, X): """Generate ensemble predictions.""" if self.voting == 'hard': return self._hard_vote(X) else: probs = self.predict_proba(X) return np.argmax(probs, axis=1) def _hard_vote(self, X): """Weighted majority vote.""" n_samples = X.shape[0] # Collect all predictions with weights weighted_votes = [] for i, (model, weight) in enumerate(zip(self.models, self.weights)): preds = model.predict(X) weighted_votes.append((preds, weight)) # Aggregate votes predictions = [] for i in range(n_samples): vote_counts = {} for preds, weight in weighted_votes: vote = preds[i] vote_counts[vote] = vote_counts.get(vote, 0) + weight # Select class with highest weighted vote winner = max(vote_counts.items(), key=lambda x: x[1])[0] predictions.append(winner) return np.array(predictions) def predict_proba(self, X): """Weighted probability averaging (soft voting).""" # Get probabilities from each model all_probs = [] for model in self.models: probs = model.predict_proba(X) all_probs.append(probs) # Stack: (n_models, n_samples, n_classes) stacked = np.stack(all_probs, axis=0) # Weighted average weights = np.array(self.weights).reshape(-1, 1, 1) weighted_probs = np.sum(stacked * weights, axis=0) return weighted_probs def predict_log_proba(self, X): """Weighted log-probability averaging.""" all_log_probs = [] for model in self.models: log_probs = np.log(model.predict_proba(X) + 1e-10) all_log_probs.append(log_probs) stacked = np.stack(all_log_probs, axis=0) weights = np.array(self.weights).reshape(-1, 1, 1) weighted_log_probs = np.sum(stacked * weights, axis=0) return weighted_log_probs def compute_optimal_weights( models: List, X_val, y_val, method: str = 'accuracy') -> List[float]: """ Compute optimal model weights from validation data. Methods: -------- 'accuracy' : Weight proportional to accuracy 'inverse_error' : Weight inversely proportional to error rate 'log_loss' : Weight inversely proportional to log loss 'optimize' : Directly optimize ensemble performance """ n_models = len(models) if method == 'accuracy': accuracies = [] for model in models: preds = model.predict(X_val) acc = (preds == y_val).mean() accuracies.append(acc) # Normalize total = sum(accuracies) return [a / total for a in accuracies] elif method == 'inverse_error': errors = [] for model in models: preds = model.predict(X_val) err = (preds != y_val).mean() + 1e-6 # Avoid division by zero errors.append(1 / err) total = sum(errors) return [e / total for e in errors] elif method == 'log_loss': from sklearn.metrics import log_loss inv_losses = [] for model in models: probs = model.predict_proba(X_val) loss = log_loss(y_val, probs) + 1e-6 inv_losses.append(1 / loss) total = sum(inv_losses) return [l / total for l in inv_losses] elif method == 'optimize': from scipy.optimize import minimize # Collect all probabilities all_probs = [] for model in models: probs = model.predict_proba(X_val) all_probs.append(probs) all_probs = np.stack(all_probs, axis=0) def ensemble_log_loss(weights): # Normalize weights w = np.clip(weights, 0.01, None) w = w / w.sum() # Compute ensemble probabilities weighted_probs = np.sum( all_probs * w.reshape(-1, 1, 1), axis=0 ) # Cross-entropy loss log_probs = np.log(weighted_probs + 1e-10) correct_log_probs = log_probs[range(len(y_val)), y_val] return -correct_log_probs.mean() # Optimize init_weights = np.ones(n_models) / n_models result = minimize( ensemble_log_loss, init_weights, method='L-BFGS-B', bounds=[(0.01, None)] * n_models ) weights = result.x / result.x.sum() return weights.tolist() else: return [1.0 / n_models] * n_modelsFor bagged ensembles where all models are trained the same way, equal weights work well—the diversity comes from bootstrap sampling, not model quality differences. Use weighted voting when combining models with different architectures or hyperparameters.
Regression ensembles aggregate continuous predictions. The choice of aggregation affects both accuracy and robustness to outliers.
1. Arithmetic Mean
The most common approach: average all predictions.
$$\hat{y}{ens}(x) = \frac{1}{B}\sum{b=1}^{B} \hat{y}_b(x)$$
Properties:
2. Median
Takes the middle value of sorted predictions.
$$\hat{y}_{ens}(x) = \text{median}{\hat{y}_1(x), ..., \hat{y}_B(x)}$$
Properties:
3. Trimmed Mean
Discard extreme predictions before averaging.
$$\hat{y}{ens}(x) = \frac{1}{B-2k}\sum{b=k+1}^{B-k} \hat{y}_{(b)}(x)$$
where $\hat{y}_{(b)}$ are sorted predictions and $k$ predictions are trimmed from each end.
Properties:
4. Weighted Mean
Weight predictions by model quality.
$$\hat{y}{ens}(x) = \sum{b=1}^{B} w_b \cdot \hat{y}_b(x), \quad \sum_b w_b = 1$$
Weight Selection:
5. Geometric Mean
Multiplicative aggregation (for positive targets).
$$\hat{y}{ens}(x) = \left(\prod{b=1}^{B} \hat{y}_b(x)\right)^{1/B}$$
Properties:
| Method | Optimal Loss | Outlier Robustness | Variance Reduction | Best For |
|---|---|---|---|---|
| Arithmetic Mean | Squared error | Low | Highest | Standard regression |
| Median | Absolute error | High | Moderate | Contaminated predictions |
| Trimmed Mean | Huber-like | Moderate | High | Slight contamination |
| Weighted Mean | Squared error | Depends | Highest | Heterogeneous models |
| Geometric Mean | Log-squared | Moderate (high) | High | Multiplicative targets |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149
import numpy as npfrom scipy import statsfrom typing import List, Literal, Optional class RegressionAggregator: """ Comprehensive regression aggregation strategies. """ def __init__( self, method: Literal['mean', 'median', 'trimmed_mean', 'weighted', 'geometric'] = 'mean', weights: Optional[List[float]] = None, trim_fraction: float = 0.1 ): """ Parameters: ----------- method : str Aggregation method weights : list or None Model weights for weighted aggregation trim_fraction : float Fraction to trim from each end (for trimmed_mean) """ self.method = method self.weights = weights self.trim_fraction = trim_fraction def aggregate(self, predictions: np.ndarray) -> np.ndarray: """ Aggregate predictions from multiple models. Parameters: ----------- predictions : ndarray of shape (n_models, n_samples) Predictions from each model Returns: -------- aggregated : ndarray of shape (n_samples,) """ if self.method == 'mean': return self._arithmetic_mean(predictions) elif self.method == 'median': return self._median(predictions) elif self.method == 'trimmed_mean': return self._trimmed_mean(predictions) elif self.method == 'weighted': return self._weighted_mean(predictions) elif self.method == 'geometric': return self._geometric_mean(predictions) else: raise ValueError(f"Unknown method: {self.method}") def _arithmetic_mean(self, predictions): """Simple average.""" return np.mean(predictions, axis=0) def _median(self, predictions): """Median of predictions.""" return np.median(predictions, axis=0) def _trimmed_mean(self, predictions): """ Trimmed mean: remove extreme values before averaging. """ return stats.trim_mean(predictions, self.trim_fraction, axis=0) def _weighted_mean(self, predictions): """Weighted average.""" if self.weights is None: return self._arithmetic_mean(predictions) weights = np.array(self.weights).reshape(-1, 1) weighted = predictions * weights return np.sum(weighted, axis=0) def _geometric_mean(self, predictions): """ Geometric mean (for positive predictions). """ # Handle potential zeros/negatives safe_preds = np.maximum(predictions, 1e-10) log_preds = np.log(safe_preds) mean_log = np.mean(log_preds, axis=0) return np.exp(mean_log) def aggregate_with_uncertainty(self, predictions: np.ndarray): """ Return aggregated prediction with uncertainty estimate. Returns mean prediction and standard deviation across models. """ mean_pred = self.aggregate(predictions) std_pred = np.std(predictions, axis=0) # Confidence intervals (assuming normal distribution) ci_lower = mean_pred - 1.96 * std_pred ci_upper = mean_pred + 1.96 * std_pred return { 'prediction': mean_pred, 'std': std_pred, 'ci_95_lower': ci_lower, 'ci_95_upper': ci_upper, } def select_aggregation_method(predictions, y_val, verbose=True): """ Evaluate different aggregation methods and recommend the best one. Parameters: ----------- predictions : ndarray of shape (n_models, n_samples) Model predictions on validation set y_val : ndarray of shape (n_samples,) True values Returns: -------- best_method : str results : dict with all method performances """ aggregator = RegressionAggregator() results = {} for method in ['mean', 'median', 'trimmed_mean', 'geometric']: aggregator.method = method agg_pred = aggregator.aggregate(predictions) # Compute metrics mse = np.mean((y_val - agg_pred) ** 2) mae = np.mean(np.abs(y_val - agg_pred)) results[method] = {'mse': mse, 'mae': mae, 'rmse': np.sqrt(mse)} if verbose: print(f"{method:15} - MSE: {mse:.4f}, MAE: {mae:.4f}") # Select best by MSE best_method = min(results.keys(), key=lambda m: results[m]['mse']) if verbose: print(f"\nRecommended method: {best_method}") return best_method, resultsConsider using median when: (1) Some models may produce extreme predictions due to overfitting, (2) The target distribution has heavy tails, (3) You suspect some bootstrap samples are unrepresentative, (4) MAE is your primary metric rather than MSE.
Stacking (or stacked generalization) replaces fixed aggregation rules with a learned meta-model. Instead of averaging, we train a model to optimally combine base model predictions.
The Stacking Architecture:
Critical Point: The meta-model must be trained on out-of-fold predictions from base models to avoid overfitting. If base models see the same data used to train the meta-model, they can simply memorize the training set.
Generating Out-of-Fold Predictions:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209
import numpy as npfrom sklearn.model_selection import KFoldfrom sklearn.linear_model import LogisticRegression, Ridgefrom sklearn.base import clonefrom typing import List, Any class StackingEnsemble: """ Stacking ensemble with proper out-of-fold prediction generation. Supports both classification and regression with various meta-learners. """ def __init__( self, base_models: List[Any], meta_model: Any = None, n_folds: int = 5, use_probas: bool = True, passthrough: bool = False, task: str = 'classification' ): """ Parameters: ----------- base_models : list Base model instances (will be cloned) meta_model : estimator Meta-learner (default: LogisticRegression/Ridge) n_folds : int Number of folds for OOF prediction generation use_probas : bool Use probabilities instead of class predictions (classification) passthrough : bool Include original features in meta-model input task : str 'classification' or 'regression' """ self.base_models = base_models self.meta_model = meta_model self.n_folds = n_folds self.use_probas = use_probas self.passthrough = passthrough self.task = task # Set default meta-model if self.meta_model is None: if task == 'classification': self.meta_model = LogisticRegression(max_iter=1000) else: self.meta_model = Ridge() # These are populated during fit self.fitted_base_models_ = [] self.fitted_meta_model_ = None self.n_classes_ = None def fit(self, X, y): """ Fit the stacking ensemble. 1. Generate out-of-fold predictions from base models 2. Train meta-model on OOF predictions 3. Retrain base models on full data """ n_samples = X.shape[0] n_base = len(self.base_models) # Determine output dimensions if self.task == 'classification': self.n_classes_ = len(np.unique(y)) if self.use_probas: n_features_per_model = self.n_classes_ else: n_features_per_model = 1 else: n_features_per_model = 1 # Initialize OOF prediction matrix oof_predictions = np.zeros((n_samples, n_base * n_features_per_model)) # Generate out-of-fold predictions kf = KFold(n_splits=self.n_folds, shuffle=True, random_state=42) for fold_idx, (train_idx, val_idx) in enumerate(kf.split(X)): X_train, X_val = X[train_idx], X[val_idx] y_train = y[train_idx] for model_idx, base_model in enumerate(self.base_models): # Clone and train on fold model = clone(base_model) model.fit(X_train, y_train) # Generate predictions for validation fold if self.task == 'classification' and self.use_probas: preds = model.predict_proba(X_val) start_col = model_idx * n_features_per_model end_col = start_col + n_features_per_model oof_predictions[val_idx, start_col:end_col] = preds else: preds = model.predict(X_val) oof_predictions[val_idx, model_idx] = preds # Prepare meta-features if self.passthrough: meta_X = np.hstack([X, oof_predictions]) else: meta_X = oof_predictions # Train meta-model self.fitted_meta_model_ = clone(self.meta_model) self.fitted_meta_model_.fit(meta_X, y) # Retrain base models on full data self.fitted_base_models_ = [] for base_model in self.base_models: model = clone(base_model) model.fit(X, y) self.fitted_base_models_.append(model) return self def _generate_meta_features(self, X): """Generate meta-features from base model predictions.""" n_samples = X.shape[0] n_base = len(self.fitted_base_models_) if self.task == 'classification' and self.use_probas: n_features_per_model = self.n_classes_ else: n_features_per_model = 1 meta_features = np.zeros((n_samples, n_base * n_features_per_model)) for model_idx, model in enumerate(self.fitted_base_models_): if self.task == 'classification' and self.use_probas: preds = model.predict_proba(X) start_col = model_idx * n_features_per_model end_col = start_col + n_features_per_model meta_features[:, start_col:end_col] = preds else: meta_features[:, model_idx] = model.predict(X) if self.passthrough: return np.hstack([X, meta_features]) return meta_features def predict(self, X): """Generate predictions.""" meta_X = self._generate_meta_features(X) return self.fitted_meta_model_.predict(meta_X) def predict_proba(self, X): """Generate probability predictions (classification).""" if self.task != 'classification': raise ValueError("predict_proba only for classification") meta_X = self._generate_meta_features(X) return self.fitted_meta_model_.predict_proba(meta_X) def compare_stacking_to_averaging(base_models, X_train, y_train, X_test, y_test): """ Compare stacking performance to simple averaging. """ from sklearn.metrics import accuracy_score, mean_squared_error # Determine task from target is_classification = len(np.unique(y_train)) < 20 # Train individual models fitted_models = [] for model in base_models: m = clone(model) m.fit(X_train, y_train) fitted_models.append(m) # Simple averaging if is_classification: all_probs = np.stack([m.predict_proba(X_test) for m in fitted_models]) avg_probs = all_probs.mean(axis=0) avg_preds = avg_probs.argmax(axis=1) avg_score = accuracy_score(y_test, avg_preds) else: all_preds = np.stack([m.predict(X_test) for m in fitted_models]) avg_preds = all_preds.mean(axis=0) avg_score = -mean_squared_error(y_test, avg_preds) # Negative for "higher is better" # Stacking stacker = StackingEnsemble( base_models=base_models, task='classification' if is_classification else 'regression' ) stacker.fit(X_train, y_train) stack_preds = stacker.predict(X_test) if is_classification: stack_score = accuracy_score(y_test, stack_preds) else: stack_score = -mean_squared_error(y_test, stack_preds) print(f"Simple Averaging Score: {avg_score:.4f}") print(f"Stacking Score: {stack_score:.4f}") print(f"Improvement: {stack_score - avg_score:.4f}") return { 'averaging_score': avg_score, 'stacking_score': stack_score, 'improvement': stack_score - avg_score }Never train the meta-model on the same predictions used to evaluate base models. Always use out-of-fold predictions or a separate validation set. Failing to do so leads to severe overfitting—the meta-model learns to trust overfit base predictions.
When ensemble predictions are used for decision-making, probability calibration becomes critical. A well-calibrated model should have its predicted probabilities match empirical frequencies—when it says 70% confidence, it should be correct about 70% of the time.
Ensemble Calibration Properties:
Ensemble averaging often improves calibration — Individual overconfident or underconfident models tend to average toward better calibration.
But ensembles can still be miscalibrated — Especially if all base models share a systematic bias.
Post-hoc calibration may be needed — Apply calibration techniques to ensemble outputs.
Measuring Calibration:
Expected Calibration Error (ECE):
Partition predictions into bins by confidence, then measure the gap between confidence and accuracy:
$$\text{ECE} = \sum_{b=1}^{B} \frac{n_b}{N} |\text{acc}(b) - \text{conf}(b)|$$
where $\text{acc}(b)$ is the accuracy of samples in bin $b$ and $\text{conf}(b)$ is the average confidence.
Reliability Diagrams:
Visualize calibration by plotting accuracy vs. confidence per bin. A perfectly calibrated model produces a diagonal line.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.calibration import calibration_curvefrom sklearn.isotonic import IsotonicRegressionfrom sklearn.linear_model import LogisticRegression def compute_ece(y_true, y_prob, n_bins=10): """ Compute Expected Calibration Error. Parameters: ----------- y_true : binary labels y_prob : predicted probabilities for positive class n_bins : number of calibration bins Returns: -------- ece : Expected Calibration Error bin_data : dict with per-bin statistics """ # Get confidence (max probability) and predictions confidences = np.maximum(y_prob, 1 - y_prob) predictions = (y_prob >= 0.5).astype(int) accuracies = (predictions == y_true).astype(float) # Bin boundaries bin_boundaries = np.linspace(0, 1, n_bins + 1) ece = 0.0 bin_data = [] for i in range(n_bins): lower, upper = bin_boundaries[i], bin_boundaries[i + 1] # Samples in this bin in_bin = (confidences > lower) & (confidences <= upper) n_in_bin = in_bin.sum() if n_in_bin > 0: bin_accuracy = accuracies[in_bin].mean() bin_confidence = confidences[in_bin].mean() ece += (n_in_bin / len(y_true)) * abs(bin_accuracy - bin_confidence) bin_data.append({ 'bin': i, 'lower': lower, 'upper': upper, 'n_samples': n_in_bin, 'accuracy': bin_accuracy, 'confidence': bin_confidence, 'gap': abs(bin_accuracy - bin_confidence) }) return ece, bin_data def plot_reliability_diagram(y_true, y_prob, n_bins=10, title='Reliability Diagram'): """ Plot reliability diagram for calibration visualization. """ fraction_of_positives, mean_predicted_value = calibration_curve( y_true, y_prob, n_bins=n_bins ) fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5)) # Reliability diagram ax1.plot([0, 1], [0, 1], 'k--', label='Perfect calibration') ax1.plot(mean_predicted_value, fraction_of_positives, 's-', label='Model') ax1.fill_between(mean_predicted_value, fraction_of_positives, mean_predicted_value, alpha=0.3, color='red') ax1.set_xlabel('Mean Predicted Probability') ax1.set_ylabel('Fraction of Positives') ax1.set_title(title) ax1.legend() ax1.grid(True, alpha=0.3) # Confidence histogram ax2.hist(y_prob, bins=n_bins, range=(0, 1), edgecolor='black', alpha=0.7) ax2.set_xlabel('Predicted Probability') ax2.set_ylabel('Count') ax2.set_title('Prediction Distribution') ax2.grid(True, alpha=0.3) plt.tight_layout() return fig class CalibratedEnsemble: """ Ensemble with post-hoc calibration. Applies calibration to the ensemble's probability outputs. """ def __init__(self, ensemble, method='isotonic'): """ Parameters: ----------- ensemble : fitted ensemble model with predict_proba method : 'isotonic', 'platt', or 'temperature' """ self.ensemble = ensemble self.method = method self.calibrators_ = [] # One per class for multi-class self.temperature_ = 1.0 def fit_calibration(self, X_calib, y_calib): """ Fit calibration on held-out calibration set. NEVER use training data - must be separate from ensemble training. """ # Get ensemble probabilities probs = self.ensemble.predict_proba(X_calib) n_classes = probs.shape[1] if self.method == 'temperature': # Temperature scaling: find T that minimizes NLL self._fit_temperature(probs, y_calib) elif self.method in ['isotonic', 'platt']: # Per-class calibration self.calibrators_ = [] for k in range(n_classes): # Binary indicators for class k y_binary = (y_calib == k).astype(int) if self.method == 'isotonic': calibrator = IsotonicRegression( y_min=0, y_max=1, out_of_bounds='clip' ) else: # platt calibrator = LogisticRegression() calibrator.fit(probs[:, k].reshape(-1, 1), y_binary) self.calibrators_.append(calibrator) return self def _fit_temperature(self, probs, y_true): """Fit temperature scaling.""" from scipy.optimize import minimize_scalar def nll_loss(T): # Apply temperature scaled_probs = self._temp_scale(probs, T) # Cross-entropy loss log_probs = np.log(scaled_probs + 1e-10) selected = log_probs[range(len(y_true)), y_true] return -selected.mean() result = minimize_scalar(nll_loss, bounds=(0.1, 10), method='bounded') self.temperature_ = result.x def _temp_scale(self, probs, T): """Apply temperature scaling to probabilities.""" # Convert to logits, scale, convert back logits = np.log(probs + 1e-10) scaled_logits = logits / T exp_logits = np.exp(scaled_logits - scaled_logits.max(axis=1, keepdims=True)) return exp_logits / exp_logits.sum(axis=1, keepdims=True) def predict_proba(self, X): """Return calibrated probabilities.""" probs = self.ensemble.predict_proba(X) if self.method == 'temperature': return self._temp_scale(probs, self.temperature_) elif self.method in ['isotonic', 'platt']: calibrated = np.zeros_like(probs) for k, calibrator in enumerate(self.calibrators_): calibrated[:, k] = calibrator.predict(probs[:, k].reshape(-1, 1)) # Normalize calibrated = calibrated / calibrated.sum(axis=1, keepdims=True) return calibrated return probs def predict(self, X): """Return class predictions.""" return self.predict_proba(X).argmax(axis=1)Temperature scaling is the simplest and often most effective calibration method for ensembles. With a single parameter (temperature T), it softens overconfident predictions when T > 1 or sharpens underconfident ones when T < 1. It preserves ranking and requires minimal calibration data.
Understanding the theoretical basis for model aggregation helps explain when and why different strategies work.
The Wisdom of Crowds:
Condorcet's Jury Theorem (1785) states that for binary decisions:
Mathematically, for $B$ independent voters with accuracy $p$:
$$P(\text{majority correct}) = \sum_{k=\lceil B/2 \rceil}^{B} \binom{B}{k} p^k (1-p)^{B-k} \xrightarrow{B \to \infty} 1$$
Caveat: This requires $p > 0.5$. If voters are worse than random, the majority amplifies the error!
Bias-Variance Decomposition for Ensembles:
For regression, the expected squared error decomposes as:
$$\mathbb{E}[(y - \hat{f}_{ens}(x))^2] = \text{Bias}^2 + \text{Variance} + \sigma^2$$
For an ensemble of averaged predictions:
$$\text{Bias}[\bar{f}] = \text{Bias}[f_i]$$ (bias unchanged)
$$\text{Var}[\bar{f}] = \frac{1}{B}\text{Var}[f_i] + \frac{B-1}{B}\text{Cov}[f_i, f_j]$$
Key Insight: Aggregation reduces variance but only to the extent that models disagree. If $\text{Cov} = \text{Var}$, ensemble variance equals individual variance—no benefit.
Jensen's Inequality and Convex Losses:
For a convex loss function $L$ (e.g., squared error):
$$L\left(\frac{1}{B}\sum_b f_b(x)\right) \leq \frac{1}{B}\sum_b L(f_b(x))$$
Implication: Averaging predictions produces loss at most equal to the average of individual losses. Ensembles can only help (or stay neutral), never hurt, for convex losses.
However, for concave parts of non-convex losses (like 0-1 classification loss), this guarantee doesn't hold—ensembles can occasionally underperform individuals.
Optimal Aggregation Weights:
For weighted averaging with weights $w$, the optimal weights minimize:
$$\min_w \mathbb{E}\left[\left(y - \sum_b w_b f_b(x)\right)^2\right] \quad \text{s.t.} \sum_b w_b = 1$$
Solution: $w^* = \Sigma^{-1} \mathbf{1} / (\mathbf{1}^T \Sigma^{-1} \mathbf{1})$ where $\Sigma$ is the covariance matrix of model errors.
| Result | Requirement | Implication |
|---|---|---|
| Condorcet Theorem | Independent voters, p > 0.5 | Majority voting improves with more models |
| Variance Reduction | Decorrelated predictions | Averaging reduces variance by up to 1/B |
| Jensen's Inequality | Convex loss function | Ensemble loss ≤ average individual loss |
| Optimal Weights | Known error covariance | Weight by inverse of error covariance |
| Diversity Bound | Diverse base learners | Ensemble error ≤ average error - diversity |
There's a tension: we want base models to be accurate (low individual error) AND diverse (low correlation). But increasing diversity often decreases individual accuracy (e.g., using weaker features). Optimal ensembles balance this trade-off—neither maximizing individual accuracy nor diversity alone.
With many aggregation options available, how do you choose? Here's a practical decision framework:
Decision Tree for Aggregation Selection:
Start
│
├── Are all models trained the same way (e.g., pure bagging)?
│ ├── Yes → Use simple averaging (equal weights)
│ └── No → Continue
│
├── Do models have significantly different validation performance?
│ ├── Yes → Use weighted averaging (inverse-error weights)
│ └── No → Use simple averaging
│
├── Do you have abundant held-out data (>10K samples)?
│ ├── Yes → Consider stacking
│ └── No → Stick with weighted averaging
│
├── Do you need calibrated probabilities?
│ ├── Yes → Use soft voting + post-hoc calibration
│ └── No → Hard or soft voting both acceptable
│
└── Is this a regression task with potential outlier predictions?
├── Yes → Consider median or trimmed mean
└── No → Use arithmetic mean
| Aggregation Method | Best For | Avoid When |
|---|---|---|
| Simple Average/Vote | Homogeneous bagging ensembles | Models have very different qualities |
| Weighted Average | Diverse model qualities | All models are trained identically |
| Stacking | Heterogeneous ensembles, lots of data | Limited data, overfitting risk high |
| Median (regression) | Outlier-prone predictions | All models well-calibrated |
| Soft Voting | Need probability estimates | Models don't output probabilities |
| Temperature Scaling | Overconfident ensembles | Already well-calibrated |
When in doubt, start simple. Equal-weight soft voting (or mean for regression) is a strong baseline that's hard to beat without substantial effort. Only move to more complex aggregation when you have clear evidence it helps on your specific problem.
The way we combine predictions is as important as the models that generate them. Let's consolidate the key insights:
What's Next:
With aggregation strategies mastered, we turn to comparing Bagging vs. Boosting—the two fundamental approaches to ensemble construction. Understanding when each paradigm excels is essential for selecting the right ensemble strategy for any problem.
You now have a comprehensive toolkit for combining model predictions. From simple voting to learned meta-models, you can select and implement the right aggregation strategy for any ensemble challenge.