Loading learning content...
In our journey through ensemble learning, we've established why ensembles work intuitively. Now we dive deeper into the mathematics, performing a rigorous decomposition of prediction error that reveals exactly how ensemble methods achieve their improvements.
The classic bias-variance tradeoff is familiar to most ML practitioners. But ensembles introduce a crucial third term: covariance. Understanding this three-way decomposition is essential for designing optimal ensemble strategies—knowing when to add more models, how to balance diversity and accuracy, and why certain ensemble configurations outperform others.
This page presents the complete mathematical framework for ensemble error analysis, providing the theoretical foundation for everything from bagging to sophisticated stacked ensembles.
By the end of this page, you will master the bias-variance-covariance decomposition for ensemble methods. You'll understand how each component contributes to total error, derive the conditions for optimal variance reduction, and develop intuition for when ensembles will and won't help.
Before extending to ensembles, let's establish the classic decomposition for a single model.
Setup:
Suppose we have a true function $f(x)$ with additive noise:
$$y = f(x) + \epsilon$$
where $\mathbb{E}[\epsilon] = 0$ and $\text{Var}(\epsilon) = \sigma^2$ (irreducible noise).
We train a model $\hat{f}(x)$ on a random training set $D$ drawn from some distribution. The model $\hat{f}$ is random because it depends on the random training set.
Mean Squared Error Decomposition:
For a fixed test point $x$, the expected squared error over all possible training sets is:
$$\mathbb{E}_D[(y - \hat{f}(x))^2]$$
Expanding this expectation (where expectation is over both the random training set and the random noise):
$$\mathbb{E}[(y - \hat{f}(x))^2] = \underbrace{(f(x) - \mathbb{E}D[\hat{f}(x)])^2}{\text{Bias}^2} + \underbrace{\mathbb{E}D[(\hat{f}(x) - \mathbb{E}D[\hat{f}(x)])^2]}{\text{Variance}} + \underbrace{\sigma^2}{\text{Irreducible Noise}}$$
Bias² measures how far the average prediction is from the truth—systematic error that doesn't go away with more training sets. Variance measures how much predictions fluctuate across different training sets—instability in the learning process. Irreducible noise is randomness inherent in the data that no model can capture.
The Tradeoff:
Complex, flexible models (deep trees, neural networks) tend to have:
Simple, constrained models (linear regression, shallow trees) tend to have:
The challenge: Can we achieve low bias AND low variance simultaneously?
The ensemble answer: Yes—by averaging multiple high-variance, low-bias models!
Now let's extend this analysis to an ensemble of $M$ models $\hat{f}_1, \hat{f}_2, \ldots, \hat{f}_M$, combined via simple averaging:
$$\hat{f}{\text{ens}}(x) = \frac{1}{M}\sum{i=1}^{M} \hat{f}_i(x)$$
Ensemble Bias:
The bias of the ensemble is the expected prediction minus the true value:
$$\text{Bias}{\text{ens}} = \mathbb{E}\left[\frac{1}{M}\sum{i=1}^{M} \hat{f}i(x)\right] - f(x) = \frac{1}{M}\sum{i=1}^{M} \mathbb{E}[\hat{f}_i(x)] - f(x)$$
If all base models have the same expected prediction (i.e., same bias), then:
$$\text{Bias}_{\text{ens}} = \mathbb{E}[\hat{f}i(x)] - f(x) = \text{Bias}{\text{individual}}$$
Key Insight: Averaging does NOT reduce bias. The ensemble bias equals the average bias of its components.
If every model in your ensemble systematically underestimates by 10%, the ensemble also underestimates by 10%. Averaging cannot correct systematic error—for that, you need boosting or bias-correcting techniques.
Ensemble Variance:
This is where the magic happens. The variance of the ensemble average is:
$$\text{Var}\left(\frac{1}{M}\sum_{i=1}^{M} \hat{f}i(x)\right) = \frac{1}{M^2}\text{Var}\left(\sum{i=1}^{M} \hat{f}_i(x)\right)$$
Expanding the variance of the sum:
$$= \frac{1}{M^2}\left[\sum_{i=1}^{M}\text{Var}(\hat{f}i(x)) + 2\sum{i<j}\text{Cov}(\hat{f}_i(x), \hat{f}_j(x))\right]$$
If all models have the same variance $\sigma^2_{\text{model}}$ and pairwise covariance $\text{Cov}$:
$$= \frac{1}{M^2}\left[M\sigma^2_{\text{model}} + M(M-1)\text{Cov}\right]$$
$$= \frac{\sigma^2_{\text{model}}}{M} + \frac{M-1}{M}\text{Cov}$$
As M → ∞:
$$\text{Var}_{\text{ens}} \to \text{Cov}$$
The ensemble variance converges to the average pairwise covariance, NOT to zero!
Even with infinitely many models, variance reduction is limited by covariance. If models have correlation ρ, the variance floor is ρσ². This is why diversity (low correlation) is essential—it determines how much variance reduction is theoretically possible.
Putting it all together, the expected squared error of an ensemble can be written as:
$$\text{MSE}{\text{ens}} = \text{Bias}^2 + \frac{\sigma^2{\text{model}}}{M} + \frac{M-1}{M}\cdot\text{Cov} + \sigma^2_{\text{noise}}$$
Or equivalently, using correlation $\rho = \text{Cov}/\sigma^2_{\text{model}}$:
$$\text{MSE}{\text{ens}} = \text{Bias}^2 + \left[\rho + \frac{1-\rho}{M}\right]\sigma^2{\text{model}} + \sigma^2_{\text{noise}}$$
The Three Regimes:
| Correlation (ρ) | Variance Term | Interpretation |
|---|---|---|
| $\rho = 0$ (independent) | $\frac{\sigma^2}{M}$ | Perfect scaling: variance decreases linearly with M |
| $\rho = 0.5$ (moderate) | $0.5\sigma^2 + \frac{0.5\sigma^2}{M}$ | Floor at 50% of original variance; diminishing returns |
| $\rho = 1$ (identical) | $\sigma^2$ | No reduction: ensemble = single model |
Visualization of Variance Reduction:
Let's see how ensemble variance behaves as we add more models under different correlation assumptions:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108
import numpy as npimport matplotlib.pyplot as plt def ensemble_variance(M, individual_var, correlation): """ Compute ensemble variance using the BVC decomposition. Var_ensemble = (1-ρ)/M * σ² + ρ * σ² = ρσ² + (1-ρ)σ²/M Args: M: Number of models in ensemble individual_var: Variance of individual models (σ²) correlation: Average pairwise correlation (ρ) Returns: Ensemble variance """ return correlation * individual_var + (1 - correlation) * individual_var / M def plot_variance_reduction(): """Visualize ensemble variance across model counts and correlations.""" individual_var = 1.0 # Normalized M_values = np.arange(1, 101) correlations = [0.0, 0.2, 0.4, 0.6, 0.8, 1.0] plt.figure(figsize=(12, 6)) # Left plot: Variance vs number of models plt.subplot(1, 2, 1) for rho in correlations: variances = [ensemble_variance(M, individual_var, rho) for M in M_values] plt.plot(M_values, variances, label=f'ρ = {rho}', linewidth=2) plt.xlabel('Number of Models (M)', fontsize=12) plt.ylabel('Ensemble Variance', fontsize=12) plt.title('Variance Reduction vs Ensemble Size', fontsize=14) plt.legend(title='Correlation') plt.grid(True, alpha=0.3) plt.ylim(0, 1.1) # Right plot: Variance floor vs correlation plt.subplot(1, 2, 2) rho_values = np.linspace(0, 1, 100) M_scenarios = [5, 10, 25, 100, np.inf] for M in M_scenarios: if M == np.inf: variances = [rho * individual_var for rho in rho_values] label = 'M → ∞ (floor)' else: variances = [ensemble_variance(M, individual_var, rho) for rho in rho_values] label = f'M = {M}' plt.plot(rho_values, variances, label=label, linewidth=2) plt.xlabel('Correlation (ρ)', fontsize=12) plt.ylabel('Ensemble Variance', fontsize=12) plt.title('Variance vs Correlation (Diversity)', fontsize=14) plt.legend(title='Ensemble Size') plt.grid(True, alpha=0.3) plt.ylim(0, 1.1) plt.tight_layout() plt.savefig('ensemble_variance_analysis.png', dpi=150) plt.show() def demonstration(): """Print key insights from the decomposition.""" print("Bias-Variance-Covariance Decomposition Analysis") print("=" * 60) # Scenario 1: Independent models print("\nScenario 1: Independent Models (ρ = 0)") print("-" * 40) for M in [1, 5, 10, 25, 100]: var = ensemble_variance(M, 1.0, 0.0) reduction = (1 - var) * 100 print(f" M = {M:3d}: Variance = {var:.4f} (↓ {reduction:.1f}%)") # Scenario 2: Realistic correlation print("\nScenario 2: Moderate Correlation (ρ = 0.4)") print("-" * 40) for M in [1, 5, 10, 25, 100]: var = ensemble_variance(M, 1.0, 0.4) reduction = (1 - var) * 100 floor = 0.4 # ρσ² print(f" M = {M:3d}: Variance = {var:.4f} (↓ {reduction:.1f}%, floor = {floor})") # Scenario 3: High correlation (similar models) print("\nScenario 3: High Correlation (ρ = 0.8)") print("-" * 40) for M in [1, 5, 10, 25, 100]: var = ensemble_variance(M, 1.0, 0.8) reduction = (1 - var) * 100 print(f" M = {M:3d}: Variance = {var:.4f} (↓ {reduction:.1f}%)") # Key insight print("\n" + "=" * 60) print("KEY INSIGHT: At ρ = 0.4, maximum variance reduction is 60%") print(" At ρ = 0.8, maximum variance reduction is only 20%") print(" Correlation/diversity is MORE important than ensemble size!") if __name__ == "__main__": demonstration() plot_variance_reduction()Given the variance formula, we can analyze when adding more models provides meaningful improvement.
Variance Reduction from Adding the M+1th Model:
$$\Delta\text{Var} = \text{Var}(M) - \text{Var}(M+1)$$
Using our formula:
$$\Delta\text{Var} = \frac{(1-\rho)\sigma^2}{M} - \frac{(1-\rho)\sigma^2}{M+1} = \frac{(1-\rho)\sigma^2}{M(M+1)}$$
Key Observations:
Practical Stopping Criterion:
Stop adding models when the marginal variance reduction is below some threshold δ:
$$\frac{(1-\rho)\sigma^2}{M(M+1)} < \delta$$
Solving for M:
$$M \approx \sqrt{\frac{(1-\rho)\sigma^2}{\delta}}$$
| Models (M) | Variance | Reduction from M=1 | Marginal Gain |
|---|---|---|---|
| 1 | 1.000 | — | — |
| 2 | 0.650 | 35.0% | 35.0% |
| 5 | 0.440 | 56.0% | 4.2% per model |
| 10 | 0.370 | 63.0% | 1.4% per model |
| 25 | 0.328 | 67.2% | 0.3% per model |
| 50 | 0.314 | 68.6% | 0.1% per model |
| 100 | 0.307 | 69.3% | 0.03% per model |
| ∞ | 0.300 | 70.0% | 0% |
For most practical purposes, 50-200 base learners capture the majority of possible variance reduction. Random Forests typically use 100-500 trees, Gradient Boosting uses 100-1000 boosters. More isn't wrong—just not much better.
Theory is elegant, but can we actually measure these components in practice? Yes—through bootstrap resampling and careful experimental design.
Estimating Bias:
Train the same model on many bootstrap samples; the average prediction across all models estimates $\mathbb{E}[\hat{f}(x)]$:
$$\widehat{\text{Bias}} = \frac{1}{B}\sum_{b=1}^{B} \hat{f}^{(b)}(x) - y$$
Estimating Variance:
Variance is the spread of predictions across bootstrap samples:
$$\widehat{\text{Var}} = \frac{1}{B}\sum_{b=1}^{B} \left(\hat{f}^{(b)}(x) - \frac{1}{B}\sum_{b'} \hat{f}^{(b')}(x)\right)^2$$
Estimating Covariance:
Pairwise covariance between model predictions:
$$\widehat{\text{Cov}} = \frac{1}{B(B-1)}\sum_{b \neq b'} \left(\hat{f}^{(b)}(x) - \bar{f}(x)\right)\left(\hat{f}^{(b')}(x) - \bar{f}(x)\right)$$
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120
import numpy as npfrom sklearn.tree import DecisionTreeRegressorfrom sklearn.datasets import make_regressionfrom sklearn.model_selection import train_test_split def estimate_bias_variance_covariance(X_train, y_train, X_test, y_test, model_class, n_bootstrap=200): """ Empirically estimate bias, variance, and covariance for a model. Args: X_train, y_train: Training data X_test, y_test: Test data for evaluation model_class: Class that implements fit() and predict() n_bootstrap: Number of bootstrap samples Returns: Dictionary with bias², variance, covariance estimates """ n_test = len(X_test) predictions = np.zeros((n_bootstrap, n_test)) # Generate predictions from models trained on bootstrap samples for b in range(n_bootstrap): # Bootstrap sample indices = np.random.choice(len(X_train), size=len(X_train), replace=True) X_b, y_b = X_train[indices], y_train[indices] # Train and predict model = model_class() model.fit(X_b, y_b) predictions[b] = model.predict(X_test) # Average prediction across bootstrap samples mean_prediction = predictions.mean(axis=0) # Bias² (squared difference between mean prediction and true value) bias_squared = ((mean_prediction - y_test) ** 2).mean() # Variance (spread around mean prediction) variance = ((predictions - mean_prediction) ** 2).mean() # Covariance (average pairwise covariance) # For efficiency, use the formula: Cov = E[XY] - E[X]E[Y] # Across models for each test point centered = predictions - mean_prediction # Average covariance: mean over test points of pairwise covariances n_pairs = n_bootstrap * (n_bootstrap - 1) // 2 covariances = [] # Efficient computation using matrix operations for i in range(n_test): pred_i = centered[:, i] cov_matrix = np.outer(pred_i, pred_i) # Sum of off-diagonal elements divided by number of pairs off_diag_sum = cov_matrix.sum() - np.diag(cov_matrix).sum() covariances.append(off_diag_sum / (n_bootstrap * (n_bootstrap - 1))) covariance = np.mean(covariances) # Correlation coefficient correlation = covariance / variance if variance > 0 else 0 # Total MSE from individual models individual_mse = ((predictions - y_test) ** 2).mean() # Ensemble MSE ensemble_prediction = predictions.mean(axis=0) ensemble_mse = ((ensemble_prediction - y_test) ** 2).mean() return { 'bias_squared': bias_squared, 'variance': variance, 'covariance': covariance, 'correlation': correlation, 'individual_mse': individual_mse, 'ensemble_mse': ensemble_mse, 'variance_reduction': variance - ensemble_mse + bias_squared, 'theoretical_ensemble_var': correlation * variance + (1 - correlation) * variance / n_bootstrap } def demonstrate_bvc(): """Demonstrate BVC decomposition on synthetic data.""" # Create regression problem X, y = make_regression(n_samples=1000, n_features=10, noise=10, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) print("Bias-Variance-Covariance Decomposition Analysis") print("=" * 65) # Compare different model complexities configs = [ ("Shallow Tree (depth=2)", lambda: DecisionTreeRegressor(max_depth=2)), ("Medium Tree (depth=5)", lambda: DecisionTreeRegressor(max_depth=5)), ("Deep Tree (depth=10)", lambda: DecisionTreeRegressor(max_depth=10)), ("Full Tree (no limit)", lambda: DecisionTreeRegressor(max_depth=None)), ] for name, model_class in configs: print(f"\n{name}") print("-" * 50) results = estimate_bias_variance_covariance( X_train, y_train, X_test, y_test, model_class, n_bootstrap=100 ) print(f" Bias²: {results['bias_squared']:.4f}") print(f" Variance: {results['variance']:.4f}") print(f" Covariance: {results['covariance']:.4f}") print(f" Correlation ρ: {results['correlation']:.4f}") print(f" Individual MSE: {results['individual_mse']:.4f}") print(f" Ensemble MSE: {results['ensemble_mse']:.4f}") print(f" Improvement: {(1 - results['ensemble_mse']/results['individual_mse'])*100:.1f}%") if __name__ == "__main__": demonstrate_bvc()The code above will reveal a crucial pattern: deep trees have high variance but low bias, making them ideal ensemble components. Shallow trees have low variance but higher bias—ensembling them provides less improvement because there's less variance to reduce.
An alternative error decomposition, developed by Krogh and Vedelsby (1995), provides another perspective on ensemble error. For regression with simple averaging:
Krogh-Vedelsby Decomposition:
$$\bar{E} = \bar{E}_{\text{ensemble}} + \bar{A}$$
Where:
Rearranging:
$$\bar{E}_{\text{ensemble}} = \bar{E} - \bar{A}$$
The Profound Implication:
Ensemble error = Average individual error - Ambiguity (disagreement)
This is a pure identity—always true, no assumptions needed!
Interpretation:
Ambiguity measures how much base learners disagree. Higher disagreement → lower ensemble error (given fixed individual performance).
This provides direct mathematical justification for diversity: we want models that are individually accurate but mutually disagreeing.
The Krogh-Vedelsby decomposition makes diversity's value mathematically precise. Every unit of 'disagreement' (ambiguity) directly subtracts from ensemble error. This is why we should celebrate when our models make different predictions—it's the mechanism through which ensembles improve.
Connection to Bias-Variance-Covariance:
The two decompositions are related but different:
| Decomposition | Focus | When Useful |
|---|---|---|
| Bias-Variance-Covariance | Expected error over training sets | Understanding generalization |
| Krogh-Vedelsby (Ambiguity) | Actual error on a given dataset | Real-time ensemble diagnostics |
The BVC decomposition involves expectations over hypothetical training sets. The ambiguity decomposition is an exact identity for specific predictions—useful for monitoring ensemble behavior in production.
Given that low (or negative) correlation between model errors is valuable, can we explicitly train models to be negatively correlated?
Negative Correlation Learning (NCL):
Proposed by Liu and Yao (1999), NCL modifies the training objective to penalize models for making similar errors:
$$E_i = \frac{1}{N}\sum_{n=1}^{N} \left[(y_n - \hat{f}_i(x_n))^2 + \lambda \cdot p_i(x_n)\right]$$
Where $p_i(x)$ is a penalty term that encourages diversity:
$$p_i(x) = (\hat{f}i(x) - \hat{f}{\text{ens}}(x)) \sum_{j \neq i}(\hat{f}j(x) - \hat{f}{\text{ens}}(x))$$
Intuition:
If model $i$'s prediction is on the same side of the ensemble mean as other models, the penalty is positive—discouraging redundant predictions. If model $i$ is on the opposite side, the penalty is negative—rewarding diversity.
Trade-off:
In practice, NCL is rarely used compared to simpler diversity-inducing methods like bagging and random subspaces. The added complexity often doesn't justify the marginal improvement over well-tuned Random Forests. But it remains theoretically important.
We've developed a rigorous mathematical understanding of how ensembles reduce prediction error. Let's consolidate the key results:
What's Next:
Having understood the mathematics, we now turn to the practical implications. The next page explores Diversity Requirement—how to create the uncorrelated errors that the theory demands, and how to measure diversity in practice.
You now have a complete mathematical framework for ensemble error analysis. The bias-variance-covariance decomposition and the ambiguity decomposition together explain when, why, and how much ensembles help. This foundation supports all the practical ensemble methods we'll study.