Bagging For Other Models - Learning Module

Loading content...

0/278

Bagged Neural Networks

Deep Learning Meets Ensemble Methods

Neural networks are powerful function approximators, but they suffer from two challenges that make them natural candidates for ensemble methods: high variance due to random initialization and sensitivity to hyperparameters. Bagging provides a principled approach to address both issues.

While single neural networks have achieved remarkable success, combining multiple networks through bagging has proven essential in high-stakes applications—from medical diagnosis to autonomous driving—where reliability matters as much as accuracy. This page explores the theory and practice of bagged neural networks, revealing when and how to apply ensemble principles to deep learning.

What You Will Learn

By the end of this page, you will understand: (1) Why neural networks benefit from bagging despite being low-bias models, (2) Multiple sources of diversity in neural network ensembles, (3) Implementation strategies for efficient training and inference, (4) The relationship to dropout as implicit bagging, and (5) Practical considerations for production deployment.

Sources of Variance in Neural Networks

Unlike decision trees, neural networks have multiple sources of variance beyond training data. Understanding these sources is crucial for designing effective neural network ensembles.

Source 1: Random Weight Initialization

Neural networks are initialized with random weights, and different initializations lead to different local optima. Even trained on identical data, two networks with different initializations will produce different predictions. This initialization variance is a significant source of disagreement between networks.

Source 2: Stochastic Optimization

Training uses stochastic gradient descent (SGD) or variants, which:

Sample random mini-batches, introducing different gradient estimates
May shuffle data differently each epoch
Apply random regularization (dropout masks, data augmentation)

This optimization variance means the same initialization with the same data produces different trained networks due to optimization randomness.

Source 3: Architecture Sensitivity

Small changes in architecture—number of layers, hidden units, activation functions—produce networks that model different aspects of the data. Even within a fixed architecture, the network's capacity allows it to represent many equivalent functions, and training settles on one arbitrarily.

Source 4: Training Data Variance (Bootstrap)

When we apply bagging, we add the traditional source of variance from bootstrap sampling. Each network sees a different subset of the training data, encouraging it to learn different aspects of the underlying function.

Variance Sources in Neural Network Ensembles
Variance Source	Controlled By	Diversity Impact	Computational Cost
Weight Initialization	Random seeds	High—different local optima	None (part of training)
Mini-batch Sampling	Batch size, shuffle seed	Moderate—affects convergence path	None (part of training)
Bootstrap Sampling	Bootstrap proportion	High—different training data	B × training cost
Dropout During Training	Dropout rate	Moderate—regularizes differently	Small overhead
Architecture Variation	Hyperparameter ranges	Very high—different representations	Hyperparameter search cost
Data Augmentation	Augmentation policy	High—different viewpoints	Augmentation overhead

Practical Insight

In practice, the most effective neural network ensembles combine multiple sources of diversity. Simply training the same architecture from different random seeds often provides significant improvement—sometimes matching more expensive approaches like architecture search.

Types of Neural Network Ensembles

Neural network ensembles come in several flavors, each leveraging different sources of diversity:

1. Pure Bagging (Bootstrap Aggregating)

The classical approach: train each network on a different bootstrap sample of the training data.

Same architecture for all networks
Same hyperparameters
Different bootstrap samples
Different random initializations

2. Random Initialization Ensembles

Train each network on the same data but with different random seeds.

Same architecture and hyperparameters
Same training data (no bootstrap)
Different weight initializations
Different mini-batch orderings (implicit from different seeds)

Surprisingly effective! Research shows that random initialization alone provides much of the diversity benefit.

3. Architecture-Diverse Ensembles

Combine networks with different architectures:

Varying depths (shallow + deep networks)
Varying widths (narrow + wide networks)
Different activation functions
Different connectivity patterns (ResNets + DenseNets + standard CNNs)

This approach maximizes diversity but requires tuning multiple architectures.

4. Hyperparameter-Diverse Ensembles

Keep architecture fixed but vary training hyperparameters:

Different learning rates
Different regularization strengths
Different batch sizes
Different optimization algorithms

5. Snapshot Ensembles

A computationally efficient approach: save network snapshots during training with cyclic learning rate schedules.

Single training run with periodic learning rate restarts
Save model at each local minimum
Combine saved snapshots as an ensemble
Nearly free ensemble—cost is one training run

High-Diversity Approaches

•Architecture ensembles — Maximum disagreement
•Bootstrap + different inits — Strong diversity with consistency
•Varied hyperparameters — Different inductive biases
•Multi-task networks — Different optimization objectives

Efficient Approaches

•Random init only — Minimal added cost
•Snapshot ensembles — Single training run
•Weight averaging — Post-hoc ensemble
•MC Dropout — Implicit ensemble at inference

Research Finding

Studies have shown that ensembles of 3-5 networks often capture most of the ensemble benefit. Beyond 10 networks, improvements become marginal for most tasks. This makes neural network ensembles practical even when training is expensive.

Implementation Strategies

Implementing bagged neural networks requires careful consideration of training efficiency, memory management, and prediction aggregation. Here's a comprehensive implementation framework:

bagged_neural_networks.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, SubsetRandomSampler
import numpy as np
from typing import List, Callable, Optional
 
class NeuralNetworkEnsemble:
    """
    Bagged Neural Network Ensemble with multiple diversity strategies.
    
    Supports:
    - Bootstrap sampling
    - Random initialization
    - Architecture diversity
    - Snapshot ensembles
    """
    
    def __init__(
        self,
        model_factory: Callable[[], nn.Module],
        n_models: int = 5,
        use_bootstrap: bool = True,
        bootstrap_ratio: float = 1.0,
        device: str = 'cuda' if torch.cuda.is_available() else 'cpu'
    ):
        """
        Parameters:
        -----------
        model_factory : callable
            Function that returns a new model instance
        n_models : int
            Number of models in the ensemble
        use_bootstrap : bool
            Whether to use bootstrap sampling
        bootstrap_ratio : float
            Fraction of data to sample (with replacement)
        device : str
            Device for training and inference
        """
        self.model_factory = model_factory
        self.n_models = n_models
        self.use_bootstrap = use_bootstrap
        self.bootstrap_ratio = bootstrap_ratio
        self.device = device
        self.models: List[nn.Module] = []
        self.training_histories: List[dict] = []
        
    def _create_bootstrap_sampler(self, dataset_size: int, seed: int):
        """Create a bootstrap sampler for DataLoader."""
        np.random.seed(seed)
        sample_size = int(dataset_size * self.bootstrap_ratio)
        indices = np.random.choice(dataset_size, size=sample_size, replace=True)
        return SubsetRandomSampler(indices)
    
    def fit(
        self,
        train_dataset,
        val_dataset=None,
        epochs: int = 100,
        batch_size: int = 32,
        learning_rate: float = 0.001,
        weight_decay: float = 1e-4,
        early_stopping_patience: int = 10,
        verbose: bool = True
    ):
        """
        Train the ensemble using bagging.
        
        Each model is trained independently on a bootstrap sample
        with different random initialization.
        """
        dataset_size = len(train_dataset)
        
        for i in range(self.n_models):
            if verbose:
                print(f"
Training model {i+1}/{self.n_models}")
            
            # Create model with unique random seed
            torch.manual_seed(42 + i * 1000)
            model = self.model_factory().to(self.device)
            
            # Create bootstrap sampler if enabled
            if self.use_bootstrap:
                sampler = self._create_bootstrap_sampler(dataset_size, seed=i)
                train_loader = DataLoader(
                    train_dataset, batch_size=batch_size, sampler=sampler
                )
            else:
                train_loader = DataLoader(
                    train_dataset, batch_size=batch_size, shuffle=True
                )
            
            # Validation loader (no bootstrap)
            val_loader = None
            if val_dataset is not None:
                val_loader = DataLoader(val_dataset, batch_size=batch_size)
            
            # Train individual model
            history = self._train_single_model(
                model, train_loader, val_loader,
                epochs, learning_rate, weight_decay,
                early_stopping_patience, verbose
            )
            
            self.models.append(model)
            self.training_histories.append(history)
        
        return self
    
    def _train_single_model(
        self, model, train_loader, val_loader,
        epochs, learning_rate, weight_decay,
        patience, verbose
    ):
        """Train a single model with early stopping."""
        optimizer = optim.AdamW(
            model.parameters(), lr=learning_rate, weight_decay=weight_decay
        )
        scheduler = optim.lr_scheduler.ReduceLROnPlateau(
            optimizer, mode='min', patience=patience//2
        )
        criterion = nn.CrossEntropyLoss()
        
        history = {'train_loss': [], 'val_loss': [], 'val_acc': []}
        best_val_loss = float('inf')
        patience_counter = 0
        best_state = None
        
        for epoch in range(epochs):
            # Training phase
            model.train()
            train_loss = 0
            for X_batch, y_batch in train_loader:
                X_batch = X_batch.to(self.device)
                y_batch = y_batch.to(self.device)
                
                optimizer.zero_grad()
                outputs = model(X_batch)
                loss = criterion(outputs, y_batch)
                loss.backward()
                optimizer.step()
                
                train_loss += loss.item()
            
            train_loss /= len(train_loader)
            history['train_loss'].append(train_loss)
            
            # Validation phase
            if val_loader is not None:
                val_loss, val_acc = self._evaluate(model, val_loader, criterion)
                history['val_loss'].append(val_loss)
                history['val_acc'].append(val_acc)
                
                scheduler.step(val_loss)
                
                # Early stopping
                if val_loss < best_val_loss:
                    best_val_loss = val_loss
                    patience_counter = 0
                    best_state = {k: v.cpu().clone() for k, v in model.state_dict().items()}
                else:
                    patience_counter += 1
                    if patience_counter >= patience:
                        if verbose:
                            print(f"  Early stopping at epoch {epoch+1}")
                        break
                
                if verbose and (epoch + 1) % 10 == 0:
                    print(f"  Epoch {epoch+1}: train_loss={train_loss:.4f}, "
                          f"val_loss={val_loss:.4f}, val_acc={val_acc:.4f}")
        
        # Restore best model
        if best_state is not None:
            model.load_state_dict({k: v.to(self.device) for k, v in best_state.items()})
        
        return history
    
    def _evaluate(self, model, loader, criterion):
        """Evaluate model on a dataset."""
        model.eval()
        total_loss = 0
        correct = 0
        total = 0
        
        with torch.no_grad():
            for X_batch, y_batch in loader:
                X_batch = X_batch.to(self.device)
                y_batch = y_batch.to(self.device)
                
                outputs = model(X_batch)
                loss = criterion(outputs, y_batch)
                total_loss += loss.item()
                
                _, predicted = outputs.max(1)
                correct += predicted.eq(y_batch).sum().item()
                total += y_batch.size(0)
        
        return total_loss / len(loader), correct / total
    
    def predict(self, X: torch.Tensor, return_individual: bool = False):
        """
        Generate ensemble predictions.
        
        Parameters:
        -----------
        X : torch.Tensor
            Input features
        return_individual : bool
            If True, return predictions from each model
            
        Returns:
        --------
        predictions : class predictions (argmax of averaged probabilities)
        probabilities : averaged class probabilities
        individual : (optional) list of individual model predictions
        """
        X = X.to(self.device)
        all_probs = []
        
        for model in self.models:
            model.eval()
            with torch.no_grad():
                logits = model(X)
                probs = torch.softmax(logits, dim=1)
                all_probs.append(probs)
        
        # Stack and average
        stacked = torch.stack(all_probs, dim=0)  # (n_models, batch, classes)
        avg_probs = stacked.mean(dim=0)  # (batch, classes)
        predictions = avg_probs.argmax(dim=1)  # (batch,)
        
        if return_individual:
            individual = [p.argmax(dim=1) for p in all_probs]
            return predictions, avg_probs, individual
        
        return predictions, avg_probs
    
    def predict_with_uncertainty(self, X: torch.Tensor):
        """
        Generate predictions with uncertainty estimates.
        
        Uses ensemble disagreement as a measure of epistemic uncertainty.
        """
        X = X.to(self.device)
        all_probs = []
        
        for model in self.models:
            model.eval()
            with torch.no_grad():
                logits = model(X)
                probs = torch.softmax(logits, dim=1)
                all_probs.append(probs)
        
        stacked = torch.stack(all_probs, dim=0)
        
        # Mean prediction
        mean_probs = stacked.mean(dim=0)
        predictions = mean_probs.argmax(dim=1)
        
        # Predictive entropy (total uncertainty)
        predictive_entropy = -torch.sum(
            mean_probs * torch.log(mean_probs + 1e-10), dim=1
        )
        
        # Expected entropy (aleatoric uncertainty)
        individual_entropies = -torch.sum(
            stacked * torch.log(stacked + 1e-10), dim=2
        )
        expected_entropy = individual_entropies.mean(dim=0)
        
        # Mutual information (epistemic uncertainty)
        mutual_info = predictive_entropy - expected_entropy
        
        # Prediction variance across models
        pred_variance = stacked.var(dim=0).sum(dim=1)
        
        return {
            'predictions': predictions,
            'probabilities': mean_probs,
            'predictive_entropy': predictive_entropy,
            'epistemic_uncertainty': mutual_info,
            'aleatoric_uncertainty': expected_entropy,
            'prediction_variance': pred_variance,
        }

Production Optimization

For production deployment, consider: (1) Batching predictions across models using torch.vmap or parallel inference, (2) Using mixed precision (FP16) to reduce memory, (3) Model distillation to compress the ensemble into a single network, (4) Caching ensemble predictions for frequently queried inputs.

Dropout as Implicit Bagging

A remarkable connection exists between dropout and bagging: dropout can be viewed as training an exponentially large ensemble of sub-networks. Understanding this connection provides theoretical grounding for both techniques.

The Dropout Perspective:

During training with dropout rate $p$, each forward pass samples a random sub-network by zeroing out neurons with probability $p$. For a network with $n$ neurons, this creates $2^n$ possible sub-networks (each neuron is either present or absent).

Each sub-network:

Sees the same batch of data
Has different capacity (fewer neurons)
Learns different features due to different connectivity

At inference time, multiplying weights by $(1-p)$ approximates the geometric mean of all these sub-networks—an implicit ensemble average.

Monte Carlo Dropout for Explicit Ensembling:

We can make dropout's ensemble nature explicit by keeping dropout active during inference and running multiple forward passes:

For each input, run $T$ forward passes with dropout active
Collect $T$ different predictions (from $T$ different sub-networks)
Average the predictions
Compute variance as uncertainty estimate

This technique, called MC Dropout, turns any dropout-trained network into an ensemble-like system without training multiple models.

mc_dropout.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
import torch
import torch.nn as nn
import torch.nn.functional as F
 
class MCDropoutModel(nn.Module):
    """
    Neural network with MC Dropout for uncertainty estimation.
    
    Keeps dropout active during inference to generate
    ensemble-like predictions.
    """
    
    def __init__(self, input_dim, hidden_dims, output_dim, dropout_rate=0.5):
        super().__init__()
        
        layers = []
        prev_dim = input_dim
        
        for hidden_dim in hidden_dims:
            layers.extend([
                nn.Linear(prev_dim, hidden_dim),
                nn.ReLU(),
                nn.Dropout(dropout_rate)
            ])
            prev_dim = hidden_dim
        
        layers.append(nn.Linear(prev_dim, output_dim))
        self.network = nn.Sequential(*layers)
        self.dropout_rate = dropout_rate
        
    def forward(self, x):
        return self.network(x)
    
    def mc_predict(self, x, n_samples=100):
        """
        Monte Carlo Dropout prediction.
        
        Runs n_samples forward passes with dropout active,
        then aggregates predictions and estimates uncertainty.
        """
        self.train()  # Keep dropout active
        
        predictions = []
        with torch.no_grad():
            for _ in range(n_samples):
                logits = self.forward(x)
                probs = F.softmax(logits, dim=1)
                predictions.append(probs)
        
        # Stack predictions: (n_samples, batch_size, n_classes)
        stacked = torch.stack(predictions, dim=0)
        
        # Mean prediction
        mean_probs = stacked.mean(dim=0)
        pred_classes = mean_probs.argmax(dim=1)
        
        # Predictive entropy (total uncertainty)
        predictive_entropy = -torch.sum(
            mean_probs * torch.log(mean_probs + 1e-10), dim=1
        )
        
        # Variance of predictions (model uncertainty)
        pred_variance = stacked.var(dim=0).mean(dim=1)
        
        # Standard deviation of max probability
        max_probs = stacked.max(dim=2).values
        confidence_std = max_probs.std(dim=0)
        
        return {
            'predictions': pred_classes,
            'probabilities': mean_probs,
            'entropy': predictive_entropy,
            'variance': pred_variance,
            'confidence_std': confidence_std,
        }
    
    def enable_inference_dropout(self):
        """Enable dropout during inference mode."""
        def enable_dropout(module):
            if isinstance(module, nn.Dropout):
                module.train()
        self.apply(enable_dropout)
    
    def get_ensemble_agreement(self, x, n_samples=50):
        """
        Measure agreement among MC samples.
        
        Returns the fraction of samples that agree with
        the majority prediction for each input.
        """
        self.enable_inference_dropout()
        
        predictions = []
        with torch.no_grad():
            for _ in range(n_samples):
                logits = self.forward(x)
                pred_class = logits.argmax(dim=1)
                predictions.append(pred_class)
        
        # Stack: (n_samples, batch_size)
        stacked = torch.stack(predictions, dim=0)
        
        # For each sample, find mode and count
        agreement = []
        for i in range(stacked.size(1)):
            sample_preds = stacked[:, i]
            unique, counts = torch.unique(sample_preds, return_counts=True)
            max_count = counts.max().item()
            agreement.append(max_count / n_samples)
        
        return torch.tensor(agreement)
 
 
def compare_mc_dropout_vs_ensemble(model, ensemble, test_loader, n_mc_samples=100):
    """
    Compare uncertainty estimates from MC Dropout vs explicit ensemble.
    
    Both approaches should show similar patterns in uncertainty.
    """
    mc_entropies = []
    ensemble_entropies = []
    
    for X_batch, _ in test_loader:
        # MC Dropout uncertainty
        mc_result = model.mc_predict(X_batch, n_samples=n_mc_samples)
        mc_entropies.extend(mc_result['entropy'].cpu().numpy())
        
        # Ensemble uncertainty
        ens_result = ensemble.predict_with_uncertainty(X_batch)
        ensemble_entropies.extend(ens_result['predictive_entropy'].cpu().numpy())
    
    correlation = np.corrcoef(mc_entropies, ensemble_entropies)[0, 1]
    
    return {
        'mc_entropies': mc_entropies,
        'ensemble_entropies': ensemble_entropies,
        'correlation': correlation,
    }

Dropout vs. Explicit Bagging
Aspect	Dropout / MC Dropout	Explicit Ensemble
Training Cost	1× (single model)	B× (B models)
Inference Cost	T× forward passes	B× forward passes
Memory (Training)	1× model size	B× model size
Memory (Inference)	1× model size	B× model size
Diversity Source	Random neuron masking	Bootstrap + initialization
Architecture Diversity	None (same structure)	Can vary architectures
Uncertainty Quality	Good approximate	Gold standard
Theoretical Basis	Approximate ensemble	True ensemble average

When MC Dropout Falls Short

MC Dropout provides cheaper uncertainty estimates, but explicit ensembles typically provide better calibrated uncertainties. For high-stakes applications (medical, autonomous systems), the computational cost of true ensembles is often justified by improved reliability.

Aggregation Strategies for Neural Networks

While simple averaging is the default aggregation method, neural network ensembles benefit from more sophisticated combination strategies.

1. Probability Averaging (Soft Voting)

The standard approach: average the softmax probabilities from each network.

$$P_{ens}(y|x) = \frac{1}{B}\sum_{b=1}^{B} P_b(y|x)$$

Advantages:

Preserves probability semantics
Naturally calibrated if individual models are calibrated
Works well when models have similar confidence scales

2. Logit Averaging

Average the raw logits before softmax.

$$z_{ens}(x) = \frac{1}{B}\sum_{b=1}^{B} z_b(x), \quad P_{ens}(y|x) = \text{softmax}(z_{ens})$$

Advantages:

Gives more weight to confident predictions
Better when models have different calibration
Mathematically equivalent to geometric mean in probability space

3. Weighted Averaging

Weight each model by its validation performance.

$$P_{ens}(y|x) = \sum_{b=1}^{B} w_b \cdot P_b(y|x), \quad \text{where } \sum_b w_b = 1$$

Weights can be computed as:

Inverse validation error: $w_b \propto 1/\epsilon_b$
Softmax of negative error: $w_b \propto \exp(-\epsilon_b / \tau)$
Learned weights via held-out optimization

4. Stacking (Meta-Learning)

Train a meta-model to combine base model predictions.

Base models produce feature vectors (logits or embeddings)
Meta-model learns optimal combination
Can learn non-linear interactions between models
Requires held-out data for meta-model training

aggregation_strategies.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
import torch
import torch.nn as nn
import numpy as np
from scipy.optimize import minimize
 
class EnsembleAggregator:
    """
    Collection of aggregation strategies for neural network ensembles.
    """
    
    @staticmethod
    def probability_average(probabilities: torch.Tensor) -> torch.Tensor:
        """
        Simple arithmetic mean of probabilities.
        
        Input: (n_models, batch_size, n_classes)
        Output: (batch_size, n_classes)
        """
        return probabilities.mean(dim=0)
    
    @staticmethod
    def logit_average(logits: torch.Tensor) -> torch.Tensor:
        """
        Average logits then apply softmax.
        
        Input: (n_models, batch_size, n_classes)
        Output: (batch_size, n_classes)
        """
        avg_logits = logits.mean(dim=0)
        return torch.softmax(avg_logits, dim=1)
    
    @staticmethod
    def geometric_mean(probabilities: torch.Tensor) -> torch.Tensor:
        """
        Geometric mean of probabilities (normalized).
        
        Equivalent to averaging log-probabilities.
        """
        log_probs = torch.log(probabilities + 1e-10)
        avg_log_probs = log_probs.mean(dim=0)
        unnormalized = torch.exp(avg_log_probs)
        return unnormalized / unnormalized.sum(dim=1, keepdim=True)
    
    @staticmethod
    def weighted_average(
        probabilities: torch.Tensor, 
        weights: torch.Tensor
    ) -> torch.Tensor:
        """
        Weighted average with given model weights.
        
        weights: (n_models,) - should sum to 1
        """
        # Reshape weights for broadcasting
        weights = weights.view(-1, 1, 1)  # (n_models, 1, 1)
        weighted = probabilities * weights
        return weighted.sum(dim=0)
 
 
def learn_ensemble_weights(
    models: list,
    val_loader,
    method: str = 'softmax',
    temperature: float = 1.0
):
    """
    Learn optimal ensemble weights from validation data.
    
    Methods:
    - 'inverse_error': weights inversely proportional to error
    - 'softmax': softmax of negative error
    - 'optimize': direct optimization of ensemble loss
    """
    device = next(models[0].parameters()).device
    
    # Collect predictions and labels
    all_probs = []
    all_labels = []
    
    for X_batch, y_batch in val_loader:
        X_batch = X_batch.to(device)
        batch_probs = []
        
        for model in models:
            model.eval()
            with torch.no_grad():
                logits = model(X_batch)
                probs = torch.softmax(logits, dim=1)
                batch_probs.append(probs.cpu())
        
        all_probs.append(torch.stack(batch_probs, dim=0))
        all_labels.append(y_batch)
    
    # Stack all: (n_models, total_samples, n_classes)
    all_probs = torch.cat(all_probs, dim=1)
    all_labels = torch.cat(all_labels, dim=0)
    
    # Compute per-model accuracy
    n_models = len(models)
    accuracies = []
    for i in range(n_models):
        preds = all_probs[i].argmax(dim=1)
        acc = (preds == all_labels).float().mean().item()
        accuracies.append(acc)
    
    errors = [1 - acc for acc in accuracies]
    
    if method == 'inverse_error':
        # Weight inversely proportional to error
        inv_errors = [1 / (e + 1e-6) for e in errors]
        total = sum(inv_errors)
        weights = [w / total for w in inv_errors]
        
    elif method == 'softmax':
        # Softmax of negative errors
        neg_errors = [-e / temperature for e in errors]
        exp_vals = [np.exp(x) for x in neg_errors]
        total = sum(exp_vals)
        weights = [v / total for v in exp_vals]
        
    elif method == 'optimize':
        # Direct optimization of ensemble log-loss
        def ensemble_loss(w):
            w = np.clip(w, 0.01, None)
            w = w / w.sum()
            weights_tensor = torch.tensor(w, dtype=torch.float32).view(-1, 1, 1)
            ensemble_probs = (all_probs * weights_tensor).sum(dim=0)
            
            # Cross-entropy loss
            log_probs = torch.log(ensemble_probs + 1e-10)
            selected_log_probs = log_probs[range(len(all_labels)), all_labels]
            return -selected_log_probs.mean().item()
        
        # Optimize weights
        init_weights = np.ones(n_models) / n_models
        result = minimize(
            ensemble_loss, init_weights,
            method='L-BFGS-B',
            bounds=[(0.01, None)] * n_models
        )
        weights = result.x / result.x.sum()
        weights = weights.tolist()
    
    else:
        # Equal weights
        weights = [1.0 / n_models] * n_models
    
    return torch.tensor(weights, dtype=torch.float32), accuracies
 
 
class StackingMetaLearner(nn.Module):
    """
    Meta-learner for stacking ensemble.
    
    Takes predictions from base models and learns
    optimal combination through a small neural network.
    """
    
    def __init__(self, n_models, n_classes, hidden_dim=32):
        super().__init__()
        
        input_dim = n_models * n_classes
        
        self.meta_network = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(hidden_dim, n_classes)
        )
    
    def forward(self, base_predictions):
        """
        base_predictions: (batch_size, n_models, n_classes)
        """
        # Flatten model predictions
        batch_size = base_predictions.size(0)
        flat = base_predictions.view(batch_size, -1)
        return self.meta_network(flat)
    
    def fit(self, base_predictions, labels, epochs=100, lr=0.01):
        """
        Train the meta-learner.
        
        base_predictions: (n_samples, n_models, n_classes)
        labels: (n_samples,)
        """
        optimizer = torch.optim.Adam(self.parameters(), lr=lr)
        criterion = nn.CrossEntropyLoss()
        
        dataset = torch.utils.data.TensorDataset(base_predictions, labels)
        loader = torch.utils.data.DataLoader(dataset, batch_size=64, shuffle=True)
        
        for epoch in range(epochs):
            total_loss = 0
            for X_batch, y_batch in loader:
                optimizer.zero_grad()
                outputs = self.forward(X_batch)
                loss = criterion(outputs, y_batch)
                loss.backward()
                optimizer.step()
                total_loss += loss.item()
            
            if (epoch + 1) % 20 == 0:
                print(f"Epoch {epoch+1}: loss = {total_loss/len(loader):.4f}")

Practical Recommendation

Start with simple probability averaging—it works well in most cases. Move to weighted averaging if models have significantly different performances. Use stacking only when you have sufficient held-out data and the task justifies the added complexity.

Production Deployment Considerations

Deploying neural network ensembles in production requires careful attention to latency, memory, and maintainability. Here are key considerations and solutions:

Deployment Challenges

•Memory footprint — B models require B× memory, potentially exceeding GPU capacity
•Inference latency — Sequential prediction through B models adds latency
•Model versioning — Updating one model requires ensemble revalidation
•Serving complexity — Multiple model artifacts to deploy and monitor
•Batching efficiency — Optimal batch size may differ across models

Solution 1: Knowledge Distillation

Train a single "student" network to mimic the ensemble's predictions.

$$\mathcal{L}{distill} = (1-\alpha) \cdot \mathcal{L}{CE}(y, p_s) + \alpha \cdot \mathcal{L}_{KL}(p_t^\tau, p_s^\tau)$$

Where:

$p_s$ = student predictions
$p_t$ = ensemble (teacher) soft predictions
$\tau$ = temperature for softening
$\alpha$ = balance between hard labels and soft targets

The student captures most of the ensemble's knowledge with 1/B the inference cost.

Solution 2: Parallel Inference

Run models in parallel across multiple GPUs or use optimized batch processing:

Assign each model to a different GPU
Use NVIDIA Triton or TensorFlow Serving for parallel model serving
Batch inputs across models when possible

Solution 3: Model Pruning and Quantization

Reduce each model's footprint while preserving ensemble benefits:

Quantize to INT8: 4× memory reduction, 2-4× speedup
Prune redundant weights: 50-90% sparsity possible
Use efficient architectures (MobileNet, EfficientNet variants)

Solution 4: Cascaded Ensembles

Only use the full ensemble when uncertainty is high:

Run single model (cheapest)
If prediction confidence < threshold, run 2 more models
If still uncertain, run full ensemble

This reduces average inference cost while preserving accuracy on hard examples.

Deployment Strategy Comparison
Strategy	Memory	Latency	Accuracy	Complexity
Full Ensemble	B×	B× (sequential) or 1× (parallel)	Best	High
Distilled Student	1×	1×	95-99% of ensemble	Medium
Cascaded Ensemble	B×	1-2× average	Near ensemble	High
Quantized Ensemble	B/4×	B/2-4×	99% of ensemble	Medium
MC Dropout	1×	T×	~90% of ensemble	Low

Production Best Practice

For most applications, train a full ensemble offline for best accuracy, then distill to a single model for production serving. Use the full ensemble as a fallback for high-stakes predictions or when the student model is uncertain.

Summary: Bagged Neural Networks

Neural network ensembles represent a powerful approach to improving deep learning reliability and uncertainty quantification. Let's consolidate the key insights:

Key Takeaways

•Multiple diversity sources — Neural networks benefit from initialization, optimization, architecture, and data diversity—more sources than tree-based bagging.
•Random initialization is surprisingly effective — Simply training from different seeds often provides most of the ensemble benefit at minimal added cost.
•Dropout approximates bagging — MC Dropout provides cheap uncertainty estimates by leveraging the implicit ensemble of sub-networks.
•3-5 models capture most benefit — Beyond 10 models, improvements become marginal, making neural network ensembles practical even for expensive models.
•Aggregation matters — Probability averaging is the default, but weighted averaging and stacking can improve performance when model quality varies.
•Knowledge distillation enables deployment — Compress ensembles into single models for production with minimal accuracy loss.

What's Next:

Having mastered ensemble construction for both trees and neural networks, we'll now examine Model Aggregation Strategies in greater depth—exploring voting schemes, probability calibration, and advanced combination methods that apply across model families.

Page Complete

You now understand how to apply bagging principles to neural networks, the relationship to dropout, and practical strategies for training and deploying neural network ensembles. This knowledge enables you to build more reliable deep learning systems.

Bagged Neural Networks

Deep Learning Meets Ensemble Methods

What You Will Learn

Sources of Variance in Neural Networks

Unlike decision trees, neural networks have multiple sources of variance beyond training data. Understanding these sources is crucial for designing effective neural network ensembles.

Source 1: Random Weight Initialization

Source 2: Stochastic Optimization

Training uses stochastic gradient descent (SGD) or variants, which:

Sample random mini-batches, introducing different gradient estimates
May shuffle data differently each epoch
Apply random regularization (dropout masks, data augmentation)

This optimization variance means the same initialization with the same data produces different trained networks due to optimization randomness.

Source 3: Architecture Sensitivity

Source 4: Training Data Variance (Bootstrap)

Variance Sources in Neural Network Ensembles
Variance Source	Controlled By	Diversity Impact	Computational Cost
Weight Initialization	Random seeds	High—different local optima	None (part of training)
Mini-batch Sampling	Batch size, shuffle seed	Moderate—affects convergence path	None (part of training)
Bootstrap Sampling	Bootstrap proportion	High—different training data	B × training cost
Dropout During Training	Dropout rate	Moderate—regularizes differently	Small overhead
Architecture Variation	Hyperparameter ranges	Very high—different representations	Hyperparameter search cost
Data Augmentation	Augmentation policy	High—different viewpoints	Augmentation overhead

Practical Insight

Types of Neural Network Ensembles

Neural network ensembles come in several flavors, each leveraging different sources of diversity:

1. Pure Bagging (Bootstrap Aggregating)

The classical approach: train each network on a different bootstrap sample of the training data.

Same architecture for all networks
Same hyperparameters
Different bootstrap samples
Different random initializations

2. Random Initialization Ensembles

Train each network on the same data but with different random seeds.

Same architecture and hyperparameters
Same training data (no bootstrap)
Different weight initializations
Different mini-batch orderings (implicit from different seeds)

Surprisingly effective! Research shows that random initialization alone provides much of the diversity benefit.

3. Architecture-Diverse Ensembles

Combine networks with different architectures:

Varying depths (shallow + deep networks)
Varying widths (narrow + wide networks)
Different activation functions
Different connectivity patterns (ResNets + DenseNets + standard CNNs)

This approach maximizes diversity but requires tuning multiple architectures.

4. Hyperparameter-Diverse Ensembles

Keep architecture fixed but vary training hyperparameters:

Different learning rates
Different regularization strengths
Different batch sizes
Different optimization algorithms

5. Snapshot Ensembles

A computationally efficient approach: save network snapshots during training with cyclic learning rate schedules.

Single training run with periodic learning rate restarts
Save model at each local minimum
Combine saved snapshots as an ensemble
Nearly free ensemble—cost is one training run

High-Diversity Approaches

•Architecture ensembles — Maximum disagreement
•Bootstrap + different inits — Strong diversity with consistency
•Varied hyperparameters — Different inductive biases
•Multi-task networks — Different optimization objectives

Efficient Approaches

•Random init only — Minimal added cost
•Snapshot ensembles — Single training run
•Weight averaging — Post-hoc ensemble
•MC Dropout — Implicit ensemble at inference

Research Finding

Implementation Strategies

Implementing bagged neural networks requires careful consideration of training efficiency, memory management, and prediction aggregation. Here's a comprehensive implementation framework:

bagged_neural_networks.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, SubsetRandomSampler
import numpy as np
from typing import List, Callable, Optional
 
class NeuralNetworkEnsemble:
    """
    Bagged Neural Network Ensemble with multiple diversity strategies.
    
    Supports:
    - Bootstrap sampling
    - Random initialization
    - Architecture diversity
    - Snapshot ensembles
    """
    
    def __init__(
        self,
        model_factory: Callable[[], nn.Module],
        n_models: int = 5,
        use_bootstrap: bool = True,
        bootstrap_ratio: float = 1.0,
        device: str = 'cuda' if torch.cuda.is_available() else 'cpu'
    ):
        """
        Parameters:
        -----------
        model_factory : callable
            Function that returns a new model instance
        n_models : int
            Number of models in the ensemble
        use_bootstrap : bool
            Whether to use bootstrap sampling
        bootstrap_ratio : float
            Fraction of data to sample (with replacement)
        device : str
            Device for training and inference
        """
        self.model_factory = model_factory
        self.n_models = n_models
        self.use_bootstrap = use_bootstrap
        self.bootstrap_ratio = bootstrap_ratio
        self.device = device
        self.models: List[nn.Module] = []
        self.training_histories: List[dict] = []
        
    def _create_bootstrap_sampler(self, dataset_size: int, seed: int):
        """Create a bootstrap sampler for DataLoader."""
        np.random.seed(seed)
        sample_size = int(dataset_size * self.bootstrap_ratio)
        indices = np.random.choice(dataset_size, size=sample_size, replace=True)
        return SubsetRandomSampler(indices)
    
    def fit(
        self,
        train_dataset,
        val_dataset=None,
        epochs: int = 100,
        batch_size: int = 32,
        learning_rate: float = 0.001,
        weight_decay: float = 1e-4,
        early_stopping_patience: int = 10,
        verbose: bool = True
    ):
        """
        Train the ensemble using bagging.
        
        Each model is trained independently on a bootstrap sample
        with different random initialization.
        """
        dataset_size = len(train_dataset)
        
        for i in range(self.n_models):
            if verbose:
                print(f"
Training model {i+1}/{self.n_models}")
            
            # Create model with unique random seed
            torch.manual_seed(42 + i * 1000)
            model = self.model_factory().to(self.device)
            
            # Create bootstrap sampler if enabled
            if self.use_bootstrap:
                sampler = self._create_bootstrap_sampler(dataset_size, seed=i)
                train_loader = DataLoader(
                    train_dataset, batch_size=batch_size, sampler=sampler
                )
            else:
                train_loader = DataLoader(
                    train_dataset, batch_size=batch_size, shuffle=True
                )
            
            # Validation loader (no bootstrap)
            val_loader = None
            if val_dataset is not None:
                val_loader = DataLoader(val_dataset, batch_size=batch_size)
            
            # Train individual model
            history = self._train_single_model(
                model, train_loader, val_loader,
                epochs, learning_rate, weight_decay,
                early_stopping_patience, verbose
            )
            
            self.models.append(model)
            self.training_histories.append(history)
        
        return self
    
    def _train_single_model(
        self, model, train_loader, val_loader,
        epochs, learning_rate, weight_decay,
        patience, verbose
    ):
        """Train a single model with early stopping."""
        optimizer = optim.AdamW(
            model.parameters(), lr=learning_rate, weight_decay=weight_decay
        )
        scheduler = optim.lr_scheduler.ReduceLROnPlateau(
            optimizer, mode='min', patience=patience//2
        )
        criterion = nn.CrossEntropyLoss()
        
        history = {'train_loss': [], 'val_loss': [], 'val_acc': []}
        best_val_loss = float('inf')
        patience_counter = 0
        best_state = None
        
        for epoch in range(epochs):
            # Training phase
            model.train()
            train_loss = 0
            for X_batch, y_batch in train_loader:
                X_batch = X_batch.to(self.device)
                y_batch = y_batch.to(self.device)
                
                optimizer.zero_grad()
                outputs = model(X_batch)
                loss = criterion(outputs, y_batch)
                loss.backward()
                optimizer.step()
                
                train_loss += loss.item()
            
            train_loss /= len(train_loader)
            history['train_loss'].append(train_loss)
            
            # Validation phase
            if val_loader is not None:
                val_loss, val_acc = self._evaluate(model, val_loader, criterion)
                history['val_loss'].append(val_loss)
                history['val_acc'].append(val_acc)
                
                scheduler.step(val_loss)
                
                # Early stopping
                if val_loss < best_val_loss:
                    best_val_loss = val_loss
                    patience_counter = 0
                    best_state = {k: v.cpu().clone() for k, v in model.state_dict().items()}
                else:
                    patience_counter += 1
                    if patience_counter >= patience:
                        if verbose:
                            print(f"  Early stopping at epoch {epoch+1}")
                        break
                
                if verbose and (epoch + 1) % 10 == 0:
                    print(f"  Epoch {epoch+1}: train_loss={train_loss:.4f}, "
                          f"val_loss={val_loss:.4f}, val_acc={val_acc:.4f}")
        
        # Restore best model
        if best_state is not None:
            model.load_state_dict({k: v.to(self.device) for k, v in best_state.items()})
        
        return history
    
    def _evaluate(self, model, loader, criterion):
        """Evaluate model on a dataset."""
        model.eval()
        total_loss = 0
        correct = 0
        total = 0
        
        with torch.no_grad():
            for X_batch, y_batch in loader:
                X_batch = X_batch.to(self.device)
                y_batch = y_batch.to(self.device)
                
                outputs = model(X_batch)
                loss = criterion(outputs, y_batch)
                total_loss += loss.item()
                
                _, predicted = outputs.max(1)
                correct += predicted.eq(y_batch).sum().item()
                total += y_batch.size(0)
        
        return total_loss / len(loader), correct / total
    
    def predict(self, X: torch.Tensor, return_individual: bool = False):
        """
        Generate ensemble predictions.
        
        Parameters:
        -----------
        X : torch.Tensor
            Input features
        return_individual : bool
            If True, return predictions from each model
            
        Returns:
        --------
        predictions : class predictions (argmax of averaged probabilities)
        probabilities : averaged class probabilities
        individual : (optional) list of individual model predictions
        """
        X = X.to(self.device)
        all_probs = []
        
        for model in self.models:
            model.eval()
            with torch.no_grad():
                logits = model(X)
                probs = torch.softmax(logits, dim=1)
                all_probs.append(probs)
        
        # Stack and average
        stacked = torch.stack(all_probs, dim=0)  # (n_models, batch, classes)
        avg_probs = stacked.mean(dim=0)  # (batch, classes)
        predictions = avg_probs.argmax(dim=1)  # (batch,)
        
        if return_individual:
            individual = [p.argmax(dim=1) for p in all_probs]
            return predictions, avg_probs, individual
        
        return predictions, avg_probs
    
    def predict_with_uncertainty(self, X: torch.Tensor):
        """
        Generate predictions with uncertainty estimates.
        
        Uses ensemble disagreement as a measure of epistemic uncertainty.
        """
        X = X.to(self.device)
        all_probs = []
        
        for model in self.models:
            model.eval()
            with torch.no_grad():
                logits = model(X)
                probs = torch.softmax(logits, dim=1)
                all_probs.append(probs)
        
        stacked = torch.stack(all_probs, dim=0)
        
        # Mean prediction
        mean_probs = stacked.mean(dim=0)
        predictions = mean_probs.argmax(dim=1)
        
        # Predictive entropy (total uncertainty)
        predictive_entropy = -torch.sum(
            mean_probs * torch.log(mean_probs + 1e-10), dim=1
        )
        
        # Expected entropy (aleatoric uncertainty)
        individual_entropies = -torch.sum(
            stacked * torch.log(stacked + 1e-10), dim=2
        )
        expected_entropy = individual_entropies.mean(dim=0)
        
        # Mutual information (epistemic uncertainty)
        mutual_info = predictive_entropy - expected_entropy
        
        # Prediction variance across models
        pred_variance = stacked.var(dim=0).sum(dim=1)
        
        return {
            'predictions': predictions,
            'probabilities': mean_probs,
            'predictive_entropy': predictive_entropy,
            'epistemic_uncertainty': mutual_info,
            'aleatoric_uncertainty': expected_entropy,
            'prediction_variance': pred_variance,
        }

Production Optimization

Dropout as Implicit Bagging

The Dropout Perspective:

Each sub-network:

Sees the same batch of data
Has different capacity (fewer neurons)
Learns different features due to different connectivity

At inference time, multiplying weights by $(1-p)$ approximates the geometric mean of all these sub-networks—an implicit ensemble average.

Monte Carlo Dropout for Explicit Ensembling:

We can make dropout's ensemble nature explicit by keeping dropout active during inference and running multiple forward passes:

For each input, run $T$ forward passes with dropout active
Collect $T$ different predictions (from $T$ different sub-networks)
Average the predictions
Compute variance as uncertainty estimate

This technique, called MC Dropout, turns any dropout-trained network into an ensemble-like system without training multiple models.

mc_dropout.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
import torch
import torch.nn as nn
import torch.nn.functional as F
 
class MCDropoutModel(nn.Module):
    """
    Neural network with MC Dropout for uncertainty estimation.
    
    Keeps dropout active during inference to generate
    ensemble-like predictions.
    """
    
    def __init__(self, input_dim, hidden_dims, output_dim, dropout_rate=0.5):
        super().__init__()
        
        layers = []
        prev_dim = input_dim
        
        for hidden_dim in hidden_dims:
            layers.extend([
                nn.Linear(prev_dim, hidden_dim),
                nn.ReLU(),
                nn.Dropout(dropout_rate)
            ])
            prev_dim = hidden_dim
        
        layers.append(nn.Linear(prev_dim, output_dim))
        self.network = nn.Sequential(*layers)
        self.dropout_rate = dropout_rate
        
    def forward(self, x):
        return self.network(x)
    
    def mc_predict(self, x, n_samples=100):
        """
        Monte Carlo Dropout prediction.
        
        Runs n_samples forward passes with dropout active,
        then aggregates predictions and estimates uncertainty.
        """
        self.train()  # Keep dropout active
        
        predictions = []
        with torch.no_grad():
            for _ in range(n_samples):
                logits = self.forward(x)
                probs = F.softmax(logits, dim=1)
                predictions.append(probs)
        
        # Stack predictions: (n_samples, batch_size, n_classes)
        stacked = torch.stack(predictions, dim=0)
        
        # Mean prediction
        mean_probs = stacked.mean(dim=0)
        pred_classes = mean_probs.argmax(dim=1)
        
        # Predictive entropy (total uncertainty)
        predictive_entropy = -torch.sum(
            mean_probs * torch.log(mean_probs + 1e-10), dim=1
        )
        
        # Variance of predictions (model uncertainty)
        pred_variance = stacked.var(dim=0).mean(dim=1)
        
        # Standard deviation of max probability
        max_probs = stacked.max(dim=2).values
        confidence_std = max_probs.std(dim=0)
        
        return {
            'predictions': pred_classes,
            'probabilities': mean_probs,
            'entropy': predictive_entropy,
            'variance': pred_variance,
            'confidence_std': confidence_std,
        }
    
    def enable_inference_dropout(self):
        """Enable dropout during inference mode."""
        def enable_dropout(module):
            if isinstance(module, nn.Dropout):
                module.train()
        self.apply(enable_dropout)
    
    def get_ensemble_agreement(self, x, n_samples=50):
        """
        Measure agreement among MC samples.
        
        Returns the fraction of samples that agree with
        the majority prediction for each input.
        """
        self.enable_inference_dropout()
        
        predictions = []
        with torch.no_grad():
            for _ in range(n_samples):
                logits = self.forward(x)
                pred_class = logits.argmax(dim=1)
                predictions.append(pred_class)
        
        # Stack: (n_samples, batch_size)
        stacked = torch.stack(predictions, dim=0)
        
        # For each sample, find mode and count
        agreement = []
        for i in range(stacked.size(1)):
            sample_preds = stacked[:, i]
            unique, counts = torch.unique(sample_preds, return_counts=True)
            max_count = counts.max().item()
            agreement.append(max_count / n_samples)
        
        return torch.tensor(agreement)
 
 
def compare_mc_dropout_vs_ensemble(model, ensemble, test_loader, n_mc_samples=100):
    """
    Compare uncertainty estimates from MC Dropout vs explicit ensemble.
    
    Both approaches should show similar patterns in uncertainty.
    """
    mc_entropies = []
    ensemble_entropies = []
    
    for X_batch, _ in test_loader:
        # MC Dropout uncertainty
        mc_result = model.mc_predict(X_batch, n_samples=n_mc_samples)
        mc_entropies.extend(mc_result['entropy'].cpu().numpy())
        
        # Ensemble uncertainty
        ens_result = ensemble.predict_with_uncertainty(X_batch)
        ensemble_entropies.extend(ens_result['predictive_entropy'].cpu().numpy())
    
    correlation = np.corrcoef(mc_entropies, ensemble_entropies)[0, 1]
    
    return {
        'mc_entropies': mc_entropies,
        'ensemble_entropies': ensemble_entropies,
        'correlation': correlation,
    }

Dropout vs. Explicit Bagging
Aspect	Dropout / MC Dropout	Explicit Ensemble
Training Cost	1× (single model)	B× (B models)
Inference Cost	T× forward passes	B× forward passes
Memory (Training)	1× model size	B× model size
Memory (Inference)	1× model size	B× model size
Diversity Source	Random neuron masking	Bootstrap + initialization
Architecture Diversity	None (same structure)	Can vary architectures
Uncertainty Quality	Good approximate	Gold standard
Theoretical Basis	Approximate ensemble	True ensemble average

When MC Dropout Falls Short

Aggregation Strategies for Neural Networks

While simple averaging is the default aggregation method, neural network ensembles benefit from more sophisticated combination strategies.

1. Probability Averaging (Soft Voting)

The standard approach: average the softmax probabilities from each network.

$$P_{ens}(y|x) = \frac{1}{B}\sum_{b=1}^{B} P_b(y|x)$$

Advantages:

Preserves probability semantics
Naturally calibrated if individual models are calibrated
Works well when models have similar confidence scales

2. Logit Averaging

Average the raw logits before softmax.

$$z_{ens}(x) = \frac{1}{B}\sum_{b=1}^{B} z_b(x), \quad P_{ens}(y|x) = \text{softmax}(z_{ens})$$

Advantages:

Gives more weight to confident predictions
Better when models have different calibration
Mathematically equivalent to geometric mean in probability space

3. Weighted Averaging

Weight each model by its validation performance.

$$P_{ens}(y|x) = \sum_{b=1}^{B} w_b \cdot P_b(y|x), \quad \text{where } \sum_b w_b = 1$$

Weights can be computed as:

Inverse validation error: $w_b \propto 1/\epsilon_b$
Softmax of negative error: $w_b \propto \exp(-\epsilon_b / \tau)$
Learned weights via held-out optimization

4. Stacking (Meta-Learning)

Train a meta-model to combine base model predictions.

Base models produce feature vectors (logits or embeddings)
Meta-model learns optimal combination
Can learn non-linear interactions between models
Requires held-out data for meta-model training

aggregation_strategies.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
import torch
import torch.nn as nn
import numpy as np
from scipy.optimize import minimize
 
class EnsembleAggregator:
    """
    Collection of aggregation strategies for neural network ensembles.
    """
    
    @staticmethod
    def probability_average(probabilities: torch.Tensor) -> torch.Tensor:
        """
        Simple arithmetic mean of probabilities.
        
        Input: (n_models, batch_size, n_classes)
        Output: (batch_size, n_classes)
        """
        return probabilities.mean(dim=0)
    
    @staticmethod
    def logit_average(logits: torch.Tensor) -> torch.Tensor:
        """
        Average logits then apply softmax.
        
        Input: (n_models, batch_size, n_classes)
        Output: (batch_size, n_classes)
        """
        avg_logits = logits.mean(dim=0)
        return torch.softmax(avg_logits, dim=1)
    
    @staticmethod
    def geometric_mean(probabilities: torch.Tensor) -> torch.Tensor:
        """
        Geometric mean of probabilities (normalized).
        
        Equivalent to averaging log-probabilities.
        """
        log_probs = torch.log(probabilities + 1e-10)
        avg_log_probs = log_probs.mean(dim=0)
        unnormalized = torch.exp(avg_log_probs)
        return unnormalized / unnormalized.sum(dim=1, keepdim=True)
    
    @staticmethod
    def weighted_average(
        probabilities: torch.Tensor, 
        weights: torch.Tensor
    ) -> torch.Tensor:
        """
        Weighted average with given model weights.
        
        weights: (n_models,) - should sum to 1
        """
        # Reshape weights for broadcasting
        weights = weights.view(-1, 1, 1)  # (n_models, 1, 1)
        weighted = probabilities * weights
        return weighted.sum(dim=0)
 
 
def learn_ensemble_weights(
    models: list,
    val_loader,
    method: str = 'softmax',
    temperature: float = 1.0
):
    """
    Learn optimal ensemble weights from validation data.
    
    Methods:
    - 'inverse_error': weights inversely proportional to error
    - 'softmax': softmax of negative error
    - 'optimize': direct optimization of ensemble loss
    """
    device = next(models[0].parameters()).device
    
    # Collect predictions and labels
    all_probs = []
    all_labels = []
    
    for X_batch, y_batch in val_loader:
        X_batch = X_batch.to(device)
        batch_probs = []
        
        for model in models:
            model.eval()
            with torch.no_grad():
                logits = model(X_batch)
                probs = torch.softmax(logits, dim=1)
                batch_probs.append(probs.cpu())
        
        all_probs.append(torch.stack(batch_probs, dim=0))
        all_labels.append(y_batch)
    
    # Stack all: (n_models, total_samples, n_classes)
    all_probs = torch.cat(all_probs, dim=1)
    all_labels = torch.cat(all_labels, dim=0)
    
    # Compute per-model accuracy
    n_models = len(models)
    accuracies = []
    for i in range(n_models):
        preds = all_probs[i].argmax(dim=1)
        acc = (preds == all_labels).float().mean().item()
        accuracies.append(acc)
    
    errors = [1 - acc for acc in accuracies]
    
    if method == 'inverse_error':
        # Weight inversely proportional to error
        inv_errors = [1 / (e + 1e-6) for e in errors]
        total = sum(inv_errors)
        weights = [w / total for w in inv_errors]
        
    elif method == 'softmax':
        # Softmax of negative errors
        neg_errors = [-e / temperature for e in errors]
        exp_vals = [np.exp(x) for x in neg_errors]
        total = sum(exp_vals)
        weights = [v / total for v in exp_vals]
        
    elif method == 'optimize':
        # Direct optimization of ensemble log-loss
        def ensemble_loss(w):
            w = np.clip(w, 0.01, None)
            w = w / w.sum()
            weights_tensor = torch.tensor(w, dtype=torch.float32).view(-1, 1, 1)
            ensemble_probs = (all_probs * weights_tensor).sum(dim=0)
            
            # Cross-entropy loss
            log_probs = torch.log(ensemble_probs + 1e-10)
            selected_log_probs = log_probs[range(len(all_labels)), all_labels]
            return -selected_log_probs.mean().item()
        
        # Optimize weights
        init_weights = np.ones(n_models) / n_models
        result = minimize(
            ensemble_loss, init_weights,
            method='L-BFGS-B',
            bounds=[(0.01, None)] * n_models
        )
        weights = result.x / result.x.sum()
        weights = weights.tolist()
    
    else:
        # Equal weights
        weights = [1.0 / n_models] * n_models
    
    return torch.tensor(weights, dtype=torch.float32), accuracies
 
 
class StackingMetaLearner(nn.Module):
    """
    Meta-learner for stacking ensemble.
    
    Takes predictions from base models and learns
    optimal combination through a small neural network.
    """
    
    def __init__(self, n_models, n_classes, hidden_dim=32):
        super().__init__()
        
        input_dim = n_models * n_classes
        
        self.meta_network = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(hidden_dim, n_classes)
        )
    
    def forward(self, base_predictions):
        """
        base_predictions: (batch_size, n_models, n_classes)
        """
        # Flatten model predictions
        batch_size = base_predictions.size(0)
        flat = base_predictions.view(batch_size, -1)
        return self.meta_network(flat)
    
    def fit(self, base_predictions, labels, epochs=100, lr=0.01):
        """
        Train the meta-learner.
        
        base_predictions: (n_samples, n_models, n_classes)
        labels: (n_samples,)
        """
        optimizer = torch.optim.Adam(self.parameters(), lr=lr)
        criterion = nn.CrossEntropyLoss()
        
        dataset = torch.utils.data.TensorDataset(base_predictions, labels)
        loader = torch.utils.data.DataLoader(dataset, batch_size=64, shuffle=True)
        
        for epoch in range(epochs):
            total_loss = 0
            for X_batch, y_batch in loader:
                optimizer.zero_grad()
                outputs = self.forward(X_batch)
                loss = criterion(outputs, y_batch)
                loss.backward()
                optimizer.step()
                total_loss += loss.item()
            
            if (epoch + 1) % 20 == 0:
                print(f"Epoch {epoch+1}: loss = {total_loss/len(loader):.4f}")

Practical Recommendation

Production Deployment Considerations

Deploying neural network ensembles in production requires careful attention to latency, memory, and maintainability. Here are key considerations and solutions:

Deployment Challenges

•Memory footprint — B models require B× memory, potentially exceeding GPU capacity
•Inference latency — Sequential prediction through B models adds latency
•Model versioning — Updating one model requires ensemble revalidation
•Serving complexity — Multiple model artifacts to deploy and monitor
•Batching efficiency — Optimal batch size may differ across models

Solution 1: Knowledge Distillation

Train a single "student" network to mimic the ensemble's predictions.

$$\mathcal{L}{distill} = (1-\alpha) \cdot \mathcal{L}{CE}(y, p_s) + \alpha \cdot \mathcal{L}_{KL}(p_t^\tau, p_s^\tau)$$

Where:

$p_s$ = student predictions
$p_t$ = ensemble (teacher) soft predictions
$\tau$ = temperature for softening
$\alpha$ = balance between hard labels and soft targets

The student captures most of the ensemble's knowledge with 1/B the inference cost.

Solution 2: Parallel Inference

Run models in parallel across multiple GPUs or use optimized batch processing:

Assign each model to a different GPU
Use NVIDIA Triton or TensorFlow Serving for parallel model serving
Batch inputs across models when possible

Solution 3: Model Pruning and Quantization

Reduce each model's footprint while preserving ensemble benefits:

Quantize to INT8: 4× memory reduction, 2-4× speedup
Prune redundant weights: 50-90% sparsity possible
Use efficient architectures (MobileNet, EfficientNet variants)

Solution 4: Cascaded Ensembles

Only use the full ensemble when uncertainty is high:

Run single model (cheapest)
If prediction confidence < threshold, run 2 more models
If still uncertain, run full ensemble

This reduces average inference cost while preserving accuracy on hard examples.

Deployment Strategy Comparison
Strategy	Memory	Latency	Accuracy	Complexity
Full Ensemble	B×	B× (sequential) or 1× (parallel)	Best	High
Distilled Student	1×	1×	95-99% of ensemble	Medium
Cascaded Ensemble	B×	1-2× average	Near ensemble	High
Quantized Ensemble	B/4×	B/2-4×	99% of ensemble	Medium
MC Dropout	1×	T×	~90% of ensemble	Low

Production Best Practice

Summary: Bagged Neural Networks

Neural network ensembles represent a powerful approach to improving deep learning reliability and uncertainty quantification. Let's consolidate the key insights:

Key Takeaways

•Multiple diversity sources — Neural networks benefit from initialization, optimization, architecture, and data diversity—more sources than tree-based bagging.
•Random initialization is surprisingly effective — Simply training from different seeds often provides most of the ensemble benefit at minimal added cost.
•Dropout approximates bagging — MC Dropout provides cheap uncertainty estimates by leveraging the implicit ensemble of sub-networks.
•3-5 models capture most benefit — Beyond 10 models, improvements become marginal, making neural network ensembles practical even for expensive models.
•Aggregation matters — Probability averaging is the default, but weighted averaging and stacking can improve performance when model quality varies.
•Knowledge distillation enables deployment — Compress ensembles into single models for production with minimal accuracy loss.

What's Next:

Page Complete