Machine LearningBagging & Random Forests

Ensemble Learning Fundamentals

LevelIntermediate

Duration60 mins

TopicBagging & Random Forests

3 / 5

Error Decomposition

Dissecting Prediction Error

In our journey through ensemble learning, we've established why ensembles work intuitively. Now we dive deeper into the mathematics, performing a rigorous decomposition of prediction error that reveals exactly how ensemble methods achieve their improvements.

The classic bias-variance tradeoff is familiar to most ML practitioners. But ensembles introduce a crucial third term: covariance. Understanding this three-way decomposition is essential for designing optimal ensemble strategies—knowing when to add more models, how to balance diversity and accuracy, and why certain ensemble configurations outperform others.

This page presents the complete mathematical framework for ensemble error analysis, providing the theoretical foundation for everything from bagging to sophisticated stacked ensembles.

What You Will Learn

By the end of this page, you will master the bias-variance-covariance decomposition for ensemble methods. You'll understand how each component contributes to total error, derive the conditions for optimal variance reduction, and develop intuition for when ensembles will and won't help.

The Classic Bias-Variance Decomposition

Before extending to ensembles, let's establish the classic decomposition for a single model.

Setup:

Suppose we have a true function $f(x)$ with additive noise:

$$y = f(x) + \epsilon$$

where $\mathbb{E}[\epsilon] = 0$ and $\text{Var}(\epsilon) = \sigma^2$ (irreducible noise).

We train a model $\hat{f}(x)$ on a random training set $D$ drawn from some distribution. The model $\hat{f}$ is random because it depends on the random training set.

Mean Squared Error Decomposition:

For a fixed test point $x$, the expected squared error over all possible training sets is:

$$\mathbb{E}_D[(y - \hat{f}(x))^2]$$

Expanding this expectation (where expectation is over both the random training set and the random noise):

$$\mathbb{E}[(y - \hat{f}(x))^2] = \underbrace{(f(x) - \mathbb{E}D[\hat{f}(x)])^2}{\text{Bias}^2} + \underbrace{\mathbb{E}D[(\hat{f}(x) - \mathbb{E}D[\hat{f}(x)])^2]}{\text{Variance}} + \underbrace{\sigma^2}{\text{Irreducible Noise}}$$

Understanding the Terms

Bias² measures how far the average prediction is from the truth—systematic error that doesn't go away with more training sets. Variance measures how much predictions fluctuate across different training sets—instability in the learning process. Irreducible noise is randomness inherent in the data that no model can capture.

The Tradeoff:

Complex, flexible models (deep trees, neural networks) tend to have:

Low bias (can fit the true function closely)
High variance (sensitive to training data particulars)

Simple, constrained models (linear regression, shallow trees) tend to have:

High bias (may not fit complex patterns)
Low variance (predictions are stable)

The challenge: Can we achieve low bias AND low variance simultaneously?

The ensemble answer: Yes—by averaging multiple high-variance, low-bias models!

Ensemble Error Decomposition

Now let's extend this analysis to an ensemble of $M$ models $\hat{f}_1, \hat{f}_2, \ldots, \hat{f}_M$, combined via simple averaging:

$$\hat{f}{\text{ens}}(x) = \frac{1}{M}\sum{i=1}^{M} \hat{f}_i(x)$$

Ensemble Bias:

The bias of the ensemble is the expected prediction minus the true value:

$$\text{Bias}{\text{ens}} = \mathbb{E}\left[\frac{1}{M}\sum{i=1}^{M} \hat{f}i(x)\right] - f(x) = \frac{1}{M}\sum{i=1}^{M} \mathbb{E}[\hat{f}_i(x)] - f(x)$$

If all base models have the same expected prediction (i.e., same bias), then:

$$\text{Bias}_{\text{ens}} = \mathbb{E}[\hat{f}i(x)] - f(x) = \text{Bias}{\text{individual}}$$

Key Insight: Averaging does NOT reduce bias. The ensemble bias equals the average bias of its components.

Bias Preservation

If every model in your ensemble systematically underestimates by 10%, the ensemble also underestimates by 10%. Averaging cannot correct systematic error—for that, you need boosting or bias-correcting techniques.

Ensemble Variance:

This is where the magic happens. The variance of the ensemble average is:

$$\text{Var}\left(\frac{1}{M}\sum_{i=1}^{M} \hat{f}i(x)\right) = \frac{1}{M^2}\text{Var}\left(\sum{i=1}^{M} \hat{f}_i(x)\right)$$

Expanding the variance of the sum:

$$= \frac{1}{M^2}\left[\sum_{i=1}^{M}\text{Var}(\hat{f}i(x)) + 2\sum{i<j}\text{Cov}(\hat{f}_i(x), \hat{f}_j(x))\right]$$

If all models have the same variance $\sigma^2_{\text{model}}$ and pairwise covariance $\text{Cov}$:

$$= \frac{1}{M^2}\left[M\sigma^2_{\text{model}} + M(M-1)\text{Cov}\right]$$

$$= \frac{\sigma^2_{\text{model}}}{M} + \frac{M-1}{M}\text{Cov}$$

As M → ∞:

$$\text{Var}_{\text{ens}} \to \text{Cov}$$

The ensemble variance converges to the average pairwise covariance, NOT to zero!

The Covariance Floor

Even with infinitely many models, variance reduction is limited by covariance. If models have correlation ρ, the variance floor is ρσ². This is why diversity (low correlation) is essential—it determines how much variance reduction is theoretically possible.

The Complete Bias-Variance-Covariance Decomposition

Putting it all together, the expected squared error of an ensemble can be written as:

$$\text{MSE}{\text{ens}} = \text{Bias}^2 + \frac{\sigma^2{\text{model}}}{M} + \frac{M-1}{M}\cdot\text{Cov} + \sigma^2_{\text{noise}}$$

Or equivalently, using correlation $\rho = \text{Cov}/\sigma^2_{\text{model}}$:

$$\text{MSE}{\text{ens}} = \text{Bias}^2 + \left[\rho + \frac{1-\rho}{M}\right]\sigma^2{\text{model}} + \sigma^2_{\text{noise}}$$

The Three Regimes:

Ensemble Variance Under Different Correlation Regimes
Correlation (ρ)	Variance Term	Interpretation
$\rho = 0$ (independent)	$\frac{\sigma^2}{M}$	Perfect scaling: variance decreases linearly with M
$\rho = 0.5$ (moderate)	$0.5\sigma^2 + \frac{0.5\sigma^2}{M}$	Floor at 50% of original variance; diminishing returns
$\rho = 1$ (identical)	$\sigma^2$	No reduction: ensemble = single model

Visualization of Variance Reduction:

Let's see how ensemble variance behaves as we add more models under different correlation assumptions:

bvc_decomposition.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
import numpy as np
import matplotlib.pyplot as plt
 
def ensemble_variance(M, individual_var, correlation):
    """
    Compute ensemble variance using the BVC decomposition.
    
    Var_ensemble = (1-ρ)/M * σ² + ρ * σ²
                 = ρσ² + (1-ρ)σ²/M
    
    Args:
        M: Number of models in ensemble
        individual_var: Variance of individual models (σ²)
        correlation: Average pairwise correlation (ρ)
    
    Returns:
        Ensemble variance
    """
    return correlation * individual_var + (1 - correlation) * individual_var / M
 
 
def plot_variance_reduction():
    """Visualize ensemble variance across model counts and correlations."""
    individual_var = 1.0  # Normalized
    M_values = np.arange(1, 101)
    correlations = [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]
    
    plt.figure(figsize=(12, 6))
    
    # Left plot: Variance vs number of models
    plt.subplot(1, 2, 1)
    for rho in correlations:
        variances = [ensemble_variance(M, individual_var, rho) for M in M_values]
        plt.plot(M_values, variances, label=f'ρ = {rho}', linewidth=2)
    
    plt.xlabel('Number of Models (M)', fontsize=12)
    plt.ylabel('Ensemble Variance', fontsize=12)
    plt.title('Variance Reduction vs Ensemble Size', fontsize=14)
    plt.legend(title='Correlation')
    plt.grid(True, alpha=0.3)
    plt.ylim(0, 1.1)
    
    # Right plot: Variance floor vs correlation
    plt.subplot(1, 2, 2)
    rho_values = np.linspace(0, 1, 100)
    M_scenarios = [5, 10, 25, 100, np.inf]
    
    for M in M_scenarios:
        if M == np.inf:
            variances = [rho * individual_var for rho in rho_values]
            label = 'M → ∞ (floor)'
        else:
            variances = [ensemble_variance(M, individual_var, rho) for rho in rho_values]
            label = f'M = {M}'
        plt.plot(rho_values, variances, label=label, linewidth=2)
    
    plt.xlabel('Correlation (ρ)', fontsize=12)
    plt.ylabel('Ensemble Variance', fontsize=12)
    plt.title('Variance vs Correlation (Diversity)', fontsize=14)
    plt.legend(title='Ensemble Size')
    plt.grid(True, alpha=0.3)
    plt.ylim(0, 1.1)
    
    plt.tight_layout()
    plt.savefig('ensemble_variance_analysis.png', dpi=150)
    plt.show()
 
 
def demonstration():
    """Print key insights from the decomposition."""
    print("Bias-Variance-Covariance Decomposition Analysis")
    print("=" * 60)
    
    # Scenario 1: Independent models
    print("\nScenario 1: Independent Models (ρ = 0)")
    print("-" * 40)
    for M in [1, 5, 10, 25, 100]:
        var = ensemble_variance(M, 1.0, 0.0)
        reduction = (1 - var) * 100
        print(f"  M = {M:3d}: Variance = {var:.4f} (↓ {reduction:.1f}%)")
    
    # Scenario 2: Realistic correlation
    print("\nScenario 2: Moderate Correlation (ρ = 0.4)")
    print("-" * 40)
    for M in [1, 5, 10, 25, 100]:
        var = ensemble_variance(M, 1.0, 0.4)
        reduction = (1 - var) * 100
        floor = 0.4  # ρσ²
        print(f"  M = {M:3d}: Variance = {var:.4f} (↓ {reduction:.1f}%, floor = {floor})")
    
    # Scenario 3: High correlation (similar models)
    print("\nScenario 3: High Correlation (ρ = 0.8)")
    print("-" * 40)
    for M in [1, 5, 10, 25, 100]:
        var = ensemble_variance(M, 1.0, 0.8)
        reduction = (1 - var) * 100
        print(f"  M = {M:3d}: Variance = {var:.4f} (↓ {reduction:.1f}%)")
    
    # Key insight
    print("\n" + "=" * 60)
    print("KEY INSIGHT: At ρ = 0.4, maximum variance reduction is 60%")
    print("             At ρ = 0.8, maximum variance reduction is only 20%")
    print("             Correlation/diversity is MORE important than ensemble size!")
 
 
if __name__ == "__main__":
    demonstration()
    plot_variance_reduction()

Deriving Optimal Ensemble Size

Given the variance formula, we can analyze when adding more models provides meaningful improvement.

Variance Reduction from Adding the M+1th Model:

$$\Delta\text{Var} = \text{Var}(M) - \text{Var}(M+1)$$

Using our formula:

$$\Delta\text{Var} = \frac{(1-\rho)\sigma^2}{M} - \frac{(1-\rho)\sigma^2}{M+1} = \frac{(1-\rho)\sigma^2}{M(M+1)}$$

Key Observations:

Marginal benefit decreases as O(1/M²): Adding the 11th model provides ~10x less benefit than the 2nd
Marginal benefit scales with (1-ρ): Low correlation means higher value per additional model
Diminishing returns: Beyond ~50-100 models, improvements are negligible

Practical Stopping Criterion:

Stop adding models when the marginal variance reduction is below some threshold δ:

$$\frac{(1-\rho)\sigma^2}{M(M+1)} < \delta$$

Solving for M:

$$M \approx \sqrt{\frac{(1-\rho)\sigma^2}{\delta}}$$

Cumulative Variance Reduction as Ensemble Grows (ρ = 0.3)
Models (M)	Variance	Reduction from M=1	Marginal Gain
1	1.000	—	—
2	0.650	35.0%	35.0%
5	0.440	56.0%	4.2% per model
10	0.370	63.0%	1.4% per model
25	0.328	67.2%	0.3% per model
50	0.314	68.6%	0.1% per model
100	0.307	69.3%	0.03% per model
∞	0.300	70.0%	0%

The Sweet Spot

For most practical purposes, 50-200 base learners capture the majority of possible variance reduction. Random Forests typically use 100-500 trees, Gradient Boosting uses 100-1000 boosters. More isn't wrong—just not much better.

Empirical Estimation of Decomposition Terms

Theory is elegant, but can we actually measure these components in practice? Yes—through bootstrap resampling and careful experimental design.

Estimating Bias:

Train the same model on many bootstrap samples; the average prediction across all models estimates $\mathbb{E}[\hat{f}(x)]$:

$$\widehat{\text{Bias}} = \frac{1}{B}\sum_{b=1}^{B} \hat{f}^{(b)}(x) - y$$

Estimating Variance:

Variance is the spread of predictions across bootstrap samples:

$$\widehat{\text{Var}} = \frac{1}{B}\sum_{b=1}^{B} \left(\hat{f}^{(b)}(x) - \frac{1}{B}\sum_{b'} \hat{f}^{(b')}(x)\right)^2$$

Estimating Covariance:

Pairwise covariance between model predictions:

$$\widehat{\text{Cov}} = \frac{1}{B(B-1)}\sum_{b \neq b'} \left(\hat{f}^{(b)}(x) - \bar{f}(x)\right)\left(\hat{f}^{(b')}(x) - \bar{f}(x)\right)$$

empirical_bvc.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
 
 
def estimate_bias_variance_covariance(X_train, y_train, X_test, y_test, 
                                       model_class, n_bootstrap=200):
    """
    Empirically estimate bias, variance, and covariance for a model.
    
    Args:
        X_train, y_train: Training data
        X_test, y_test: Test data for evaluation
        model_class: Class that implements fit() and predict()
        n_bootstrap: Number of bootstrap samples
    
    Returns:
        Dictionary with bias², variance, covariance estimates
    """
    n_test = len(X_test)
    predictions = np.zeros((n_bootstrap, n_test))
    
    # Generate predictions from models trained on bootstrap samples
    for b in range(n_bootstrap):
        # Bootstrap sample
        indices = np.random.choice(len(X_train), size=len(X_train), replace=True)
        X_b, y_b = X_train[indices], y_train[indices]
        
        # Train and predict
        model = model_class()
        model.fit(X_b, y_b)
        predictions[b] = model.predict(X_test)
    
    # Average prediction across bootstrap samples
    mean_prediction = predictions.mean(axis=0)
    
    # Bias² (squared difference between mean prediction and true value)
    bias_squared = ((mean_prediction - y_test) ** 2).mean()
    
    # Variance (spread around mean prediction)
    variance = ((predictions - mean_prediction) ** 2).mean()
    
    # Covariance (average pairwise covariance)
    # For efficiency, use the formula: Cov = E[XY] - E[X]E[Y]
    # Across models for each test point
    centered = predictions - mean_prediction
    
    # Average covariance: mean over test points of pairwise covariances
    n_pairs = n_bootstrap * (n_bootstrap - 1) // 2
    covariances = []
    
    # Efficient computation using matrix operations
    for i in range(n_test):
        pred_i = centered[:, i]
        cov_matrix = np.outer(pred_i, pred_i)
        # Sum of off-diagonal elements divided by number of pairs
        off_diag_sum = cov_matrix.sum() - np.diag(cov_matrix).sum()
        covariances.append(off_diag_sum / (n_bootstrap * (n_bootstrap - 1)))
    
    covariance = np.mean(covariances)
    
    # Correlation coefficient
    correlation = covariance / variance if variance > 0 else 0
    
    # Total MSE from individual models
    individual_mse = ((predictions - y_test) ** 2).mean()
    
    # Ensemble MSE
    ensemble_prediction = predictions.mean(axis=0)
    ensemble_mse = ((ensemble_prediction - y_test) ** 2).mean()
    
    return {
        'bias_squared': bias_squared,
        'variance': variance,
        'covariance': covariance,
        'correlation': correlation,
        'individual_mse': individual_mse,
        'ensemble_mse': ensemble_mse,
        'variance_reduction': variance - ensemble_mse + bias_squared,
        'theoretical_ensemble_var': correlation * variance + (1 - correlation) * variance / n_bootstrap
    }
 
 
def demonstrate_bvc():
    """Demonstrate BVC decomposition on synthetic data."""
    # Create regression problem
    X, y = make_regression(n_samples=1000, n_features=10, noise=10, random_state=42)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    
    print("Bias-Variance-Covariance Decomposition Analysis")
    print("=" * 65)
    
    # Compare different model complexities
    configs = [
        ("Shallow Tree (depth=2)", lambda: DecisionTreeRegressor(max_depth=2)),
        ("Medium Tree (depth=5)", lambda: DecisionTreeRegressor(max_depth=5)),
        ("Deep Tree (depth=10)", lambda: DecisionTreeRegressor(max_depth=10)),
        ("Full Tree (no limit)", lambda: DecisionTreeRegressor(max_depth=None)),
    ]
    
    for name, model_class in configs:
        print(f"\n{name}")
        print("-" * 50)
        
        results = estimate_bias_variance_covariance(
            X_train, y_train, X_test, y_test, model_class, n_bootstrap=100
        )
        
        print(f"  Bias²:          {results['bias_squared']:.4f}")
        print(f"  Variance:       {results['variance']:.4f}")
        print(f"  Covariance:     {results['covariance']:.4f}")
        print(f"  Correlation ρ:  {results['correlation']:.4f}")
        print(f"  Individual MSE: {results['individual_mse']:.4f}")
        print(f"  Ensemble MSE:   {results['ensemble_mse']:.4f}")
        print(f"  Improvement:    {(1 - results['ensemble_mse']/results['individual_mse'])*100:.1f}%")
 
 
if __name__ == "__main__":
    demonstrate_bvc()

Deep Trees Excel in Ensembles

The code above will reveal a crucial pattern: deep trees have high variance but low bias, making them ideal ensemble components. Shallow trees have low variance but higher bias—ensembling them provides less improvement because there's less variance to reduce.

The Ambiguity Decomposition (Krogh-Vedelsby)

An alternative error decomposition, developed by Krogh and Vedelsby (1995), provides another perspective on ensemble error. For regression with simple averaging:

Krogh-Vedelsby Decomposition:

$$\bar{E} = \bar{E}_{\text{ensemble}} + \bar{A}$$

Where:

$\bar{E} = \frac{1}{M}\sum_{i=1}^{M} (\hat{f}_i(x) - y)^2$ is the average error of individual models
$\bar{E}{\text{ensemble}} = (\hat{f}{\text{ens}}(x) - y)^2$ is the ensemble error
$\bar{A} = \frac{1}{M}\sum_{i=1}^{M} (\hat{f}i(x) - \hat{f}{\text{ens}}(x))^2$ is the ambiguity

Rearranging:

$$\bar{E}_{\text{ensemble}} = \bar{E} - \bar{A}$$

The Profound Implication:

Ensemble error = Average individual error - Ambiguity (disagreement)

This is a pure identity—always true, no assumptions needed!

Interpretation:

Ambiguity measures how much base learners disagree. Higher disagreement → lower ensemble error (given fixed individual performance).

This provides direct mathematical justification for diversity: we want models that are individually accurate but mutually disagreeing.

The Diversity Bonus Formalized

The Krogh-Vedelsby decomposition makes diversity's value mathematically precise. Every unit of 'disagreement' (ambiguity) directly subtracts from ensemble error. This is why we should celebrate when our models make different predictions—it's the mechanism through which ensembles improve.

Connection to Bias-Variance-Covariance:

The two decompositions are related but different:

Decomposition	Focus	When Useful
Bias-Variance-Covariance	Expected error over training sets	Understanding generalization
Krogh-Vedelsby (Ambiguity)	Actual error on a given dataset	Real-time ensemble diagnostics

The BVC decomposition involves expectations over hypothetical training sets. The ambiguity decomposition is an exact identity for specific predictions—useful for monitoring ensemble behavior in production.

Negative Correlation Learning

Given that low (or negative) correlation between model errors is valuable, can we explicitly train models to be negatively correlated?

Negative Correlation Learning (NCL):

Proposed by Liu and Yao (1999), NCL modifies the training objective to penalize models for making similar errors:

$$E_i = \frac{1}{N}\sum_{n=1}^{N} \left[(y_n - \hat{f}_i(x_n))^2 + \lambda \cdot p_i(x_n)\right]$$

Where $p_i(x)$ is a penalty term that encourages diversity:

$$p_i(x) = (\hat{f}i(x) - \hat{f}{\text{ens}}(x)) \sum_{j \neq i}(\hat{f}j(x) - \hat{f}{\text{ens}}(x))$$

Intuition:

If model $i$'s prediction is on the same side of the ensemble mean as other models, the penalty is positive—discouraging redundant predictions. If model $i$ is on the opposite side, the penalty is negative—rewarding diversity.

Trade-off:

Too much diversity penalty: Individual models become very different but less accurate
Too little: Models converge to similar solutions
λ controls this balance

NCL Advantages

•Explicitly optimizes ensemble objective
•Can achieve lower correlation than random initialization
•Theoretically grounded in error decomposition
•Works with gradient-based training

NCL Challenges

•Requires simultaneous training of all models
•Hyperparameter λ is sensitive
•Can degrade individual model quality
•Computationally more expensive

Practical Reality

In practice, NCL is rarely used compared to simpler diversity-inducing methods like bagging and random subspaces. The added complexity often doesn't justify the marginal improvement over well-tuned Random Forests. But it remains theoretically important.

Summary: Error Decomposition

We've developed a rigorous mathematical understanding of how ensembles reduce prediction error. Let's consolidate the key results:

Key Takeaways

•Bias is Preserved: Averaging does not reduce bias. If base learners are systematically wrong, so is the ensemble.
•Variance Reduction Formula: $\text{Var}_{\text{ens}} = \rho\sigma^2 + \frac{(1-\rho)\sigma^2}{M}$ where ρ is pairwise correlation.
•The Covariance Floor: Even infinite models can't reduce variance below ρσ². Correlation determines the theoretical limit.
•Diminishing Returns: Marginal benefit from additional models scales as O(1/M²). Beyond ~100 models, gains are minimal.
•Deep Models Excel: High-variance, low-bias learners (deep trees) benefit most from ensembling.
•Ambiguity Decomposition: Ensemble error = average error - disagreement. Diversity directly reduces ensemble error.
•Negative Correlation Learning: We can explicitly train for diversity, though the practical benefits are often modest.

What's Next:

Having understood the mathematics, we now turn to the practical implications. The next page explores Diversity Requirement—how to create the uncorrelated errors that the theory demands, and how to measure diversity in practice.

Page Complete

You now have a complete mathematical framework for ensemble error analysis. The bias-variance-covariance decomposition and the ambiguity decomposition together explain when, why, and how much ensembles help. This foundation supports all the practical ensemble methods we'll study.

3 / 5

Loading learning content...

Machine LearningBagging & Random Forests

Ensemble Learning Fundamentals

LevelIntermediate

Duration60 mins

TopicBagging & Random Forests

3 / 5

Error Decomposition

Dissecting Prediction Error

This page presents the complete mathematical framework for ensemble error analysis, providing the theoretical foundation for everything from bagging to sophisticated stacked ensembles.

What You Will Learn

The Classic Bias-Variance Decomposition

Before extending to ensembles, let's establish the classic decomposition for a single model.

Setup:

Suppose we have a true function $f(x)$ with additive noise:

$$y = f(x) + \epsilon$$

where $\mathbb{E}[\epsilon] = 0$ and $\text{Var}(\epsilon) = \sigma^2$ (irreducible noise).

We train a model $\hat{f}(x)$ on a random training set $D$ drawn from some distribution. The model $\hat{f}$ is random because it depends on the random training set.

Mean Squared Error Decomposition:

For a fixed test point $x$, the expected squared error over all possible training sets is:

$$\mathbb{E}_D[(y - \hat{f}(x))^2]$$

Expanding this expectation (where expectation is over both the random training set and the random noise):

Understanding the Terms

The Tradeoff:

Complex, flexible models (deep trees, neural networks) tend to have:

Low bias (can fit the true function closely)
High variance (sensitive to training data particulars)

Simple, constrained models (linear regression, shallow trees) tend to have:

High bias (may not fit complex patterns)
Low variance (predictions are stable)

The challenge: Can we achieve low bias AND low variance simultaneously?

The ensemble answer: Yes—by averaging multiple high-variance, low-bias models!

Ensemble Error Decomposition

Now let's extend this analysis to an ensemble of $M$ models $\hat{f}_1, \hat{f}_2, \ldots, \hat{f}_M$, combined via simple averaging:

$$\hat{f}{\text{ens}}(x) = \frac{1}{M}\sum{i=1}^{M} \hat{f}_i(x)$$

Ensemble Bias:

The bias of the ensemble is the expected prediction minus the true value:

$$\text{Bias}{\text{ens}} = \mathbb{E}\left[\frac{1}{M}\sum{i=1}^{M} \hat{f}i(x)\right] - f(x) = \frac{1}{M}\sum{i=1}^{M} \mathbb{E}[\hat{f}_i(x)] - f(x)$$

If all base models have the same expected prediction (i.e., same bias), then:

$$\text{Bias}_{\text{ens}} = \mathbb{E}[\hat{f}i(x)] - f(x) = \text{Bias}{\text{individual}}$$

Key Insight: Averaging does NOT reduce bias. The ensemble bias equals the average bias of its components.

Bias Preservation

Ensemble Variance:

This is where the magic happens. The variance of the ensemble average is:

$$\text{Var}\left(\frac{1}{M}\sum_{i=1}^{M} \hat{f}i(x)\right) = \frac{1}{M^2}\text{Var}\left(\sum{i=1}^{M} \hat{f}_i(x)\right)$$

Expanding the variance of the sum:

$$= \frac{1}{M^2}\left[\sum_{i=1}^{M}\text{Var}(\hat{f}i(x)) + 2\sum{i<j}\text{Cov}(\hat{f}_i(x), \hat{f}_j(x))\right]$$

If all models have the same variance $\sigma^2_{\text{model}}$ and pairwise covariance $\text{Cov}$:

$$= \frac{1}{M^2}\left[M\sigma^2_{\text{model}} + M(M-1)\text{Cov}\right]$$

$$= \frac{\sigma^2_{\text{model}}}{M} + \frac{M-1}{M}\text{Cov}$$

As M → ∞:

$$\text{Var}_{\text{ens}} \to \text{Cov}$$

The ensemble variance converges to the average pairwise covariance, NOT to zero!

The Covariance Floor

The Complete Bias-Variance-Covariance Decomposition

Putting it all together, the expected squared error of an ensemble can be written as:

$$\text{MSE}{\text{ens}} = \text{Bias}^2 + \frac{\sigma^2{\text{model}}}{M} + \frac{M-1}{M}\cdot\text{Cov} + \sigma^2_{\text{noise}}$$

Or equivalently, using correlation $\rho = \text{Cov}/\sigma^2_{\text{model}}$:

$$\text{MSE}{\text{ens}} = \text{Bias}^2 + \left[\rho + \frac{1-\rho}{M}\right]\sigma^2{\text{model}} + \sigma^2_{\text{noise}}$$

The Three Regimes:

Ensemble Variance Under Different Correlation Regimes
Correlation (ρ)	Variance Term	Interpretation
$\rho = 0$ (independent)	$\frac{\sigma^2}{M}$	Perfect scaling: variance decreases linearly with M
$\rho = 0.5$ (moderate)	$0.5\sigma^2 + \frac{0.5\sigma^2}{M}$	Floor at 50% of original variance; diminishing returns
$\rho = 1$ (identical)	$\sigma^2$	No reduction: ensemble = single model

Visualization of Variance Reduction:

Let's see how ensemble variance behaves as we add more models under different correlation assumptions:

bvc_decomposition.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
import numpy as np
import matplotlib.pyplot as plt
 
def ensemble_variance(M, individual_var, correlation):
    """
    Compute ensemble variance using the BVC decomposition.
    
    Var_ensemble = (1-ρ)/M * σ² + ρ * σ²
                 = ρσ² + (1-ρ)σ²/M
    
    Args:
        M: Number of models in ensemble
        individual_var: Variance of individual models (σ²)
        correlation: Average pairwise correlation (ρ)
    
    Returns:
        Ensemble variance
    """
    return correlation * individual_var + (1 - correlation) * individual_var / M
 
 
def plot_variance_reduction():
    """Visualize ensemble variance across model counts and correlations."""
    individual_var = 1.0  # Normalized
    M_values = np.arange(1, 101)
    correlations = [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]
    
    plt.figure(figsize=(12, 6))
    
    # Left plot: Variance vs number of models
    plt.subplot(1, 2, 1)
    for rho in correlations:
        variances = [ensemble_variance(M, individual_var, rho) for M in M_values]
        plt.plot(M_values, variances, label=f'ρ = {rho}', linewidth=2)
    
    plt.xlabel('Number of Models (M)', fontsize=12)
    plt.ylabel('Ensemble Variance', fontsize=12)
    plt.title('Variance Reduction vs Ensemble Size', fontsize=14)
    plt.legend(title='Correlation')
    plt.grid(True, alpha=0.3)
    plt.ylim(0, 1.1)
    
    # Right plot: Variance floor vs correlation
    plt.subplot(1, 2, 2)
    rho_values = np.linspace(0, 1, 100)
    M_scenarios = [5, 10, 25, 100, np.inf]
    
    for M in M_scenarios:
        if M == np.inf:
            variances = [rho * individual_var for rho in rho_values]
            label = 'M → ∞ (floor)'
        else:
            variances = [ensemble_variance(M, individual_var, rho) for rho in rho_values]
            label = f'M = {M}'
        plt.plot(rho_values, variances, label=label, linewidth=2)
    
    plt.xlabel('Correlation (ρ)', fontsize=12)
    plt.ylabel('Ensemble Variance', fontsize=12)
    plt.title('Variance vs Correlation (Diversity)', fontsize=14)
    plt.legend(title='Ensemble Size')
    plt.grid(True, alpha=0.3)
    plt.ylim(0, 1.1)
    
    plt.tight_layout()
    plt.savefig('ensemble_variance_analysis.png', dpi=150)
    plt.show()
 
 
def demonstration():
    """Print key insights from the decomposition."""
    print("Bias-Variance-Covariance Decomposition Analysis")
    print("=" * 60)
    
    # Scenario 1: Independent models
    print("\nScenario 1: Independent Models (ρ = 0)")
    print("-" * 40)
    for M in [1, 5, 10, 25, 100]:
        var = ensemble_variance(M, 1.0, 0.0)
        reduction = (1 - var) * 100
        print(f"  M = {M:3d}: Variance = {var:.4f} (↓ {reduction:.1f}%)")
    
    # Scenario 2: Realistic correlation
    print("\nScenario 2: Moderate Correlation (ρ = 0.4)")
    print("-" * 40)
    for M in [1, 5, 10, 25, 100]:
        var = ensemble_variance(M, 1.0, 0.4)
        reduction = (1 - var) * 100
        floor = 0.4  # ρσ²
        print(f"  M = {M:3d}: Variance = {var:.4f} (↓ {reduction:.1f}%, floor = {floor})")
    
    # Scenario 3: High correlation (similar models)
    print("\nScenario 3: High Correlation (ρ = 0.8)")
    print("-" * 40)
    for M in [1, 5, 10, 25, 100]:
        var = ensemble_variance(M, 1.0, 0.8)
        reduction = (1 - var) * 100
        print(f"  M = {M:3d}: Variance = {var:.4f} (↓ {reduction:.1f}%)")
    
    # Key insight
    print("\n" + "=" * 60)
    print("KEY INSIGHT: At ρ = 0.4, maximum variance reduction is 60%")
    print("             At ρ = 0.8, maximum variance reduction is only 20%")
    print("             Correlation/diversity is MORE important than ensemble size!")
 
 
if __name__ == "__main__":
    demonstration()
    plot_variance_reduction()

Deriving Optimal Ensemble Size

Given the variance formula, we can analyze when adding more models provides meaningful improvement.

Variance Reduction from Adding the M+1th Model:

$$\Delta\text{Var} = \text{Var}(M) - \text{Var}(M+1)$$

Using our formula:

$$\Delta\text{Var} = \frac{(1-\rho)\sigma^2}{M} - \frac{(1-\rho)\sigma^2}{M+1} = \frac{(1-\rho)\sigma^2}{M(M+1)}$$

Key Observations:

Marginal benefit decreases as O(1/M²): Adding the 11th model provides ~10x less benefit than the 2nd
Marginal benefit scales with (1-ρ): Low correlation means higher value per additional model
Diminishing returns: Beyond ~50-100 models, improvements are negligible

Practical Stopping Criterion:

Stop adding models when the marginal variance reduction is below some threshold δ:

$$\frac{(1-\rho)\sigma^2}{M(M+1)} < \delta$$

Solving for M:

$$M \approx \sqrt{\frac{(1-\rho)\sigma^2}{\delta}}$$

Cumulative Variance Reduction as Ensemble Grows (ρ = 0.3)
Models (M)	Variance	Reduction from M=1	Marginal Gain
1	1.000	—	—
2	0.650	35.0%	35.0%
5	0.440	56.0%	4.2% per model
10	0.370	63.0%	1.4% per model
25	0.328	67.2%	0.3% per model
50	0.314	68.6%	0.1% per model
100	0.307	69.3%	0.03% per model
∞	0.300	70.0%	0%

The Sweet Spot

Empirical Estimation of Decomposition Terms

Theory is elegant, but can we actually measure these components in practice? Yes—through bootstrap resampling and careful experimental design.

Estimating Bias:

Train the same model on many bootstrap samples; the average prediction across all models estimates $\mathbb{E}[\hat{f}(x)]$:

$$\widehat{\text{Bias}} = \frac{1}{B}\sum_{b=1}^{B} \hat{f}^{(b)}(x) - y$$

Estimating Variance:

Variance is the spread of predictions across bootstrap samples:

$$\widehat{\text{Var}} = \frac{1}{B}\sum_{b=1}^{B} \left(\hat{f}^{(b)}(x) - \frac{1}{B}\sum_{b'} \hat{f}^{(b')}(x)\right)^2$$

Estimating Covariance:

Pairwise covariance between model predictions:

$$\widehat{\text{Cov}} = \frac{1}{B(B-1)}\sum_{b \neq b'} \left(\hat{f}^{(b)}(x) - \bar{f}(x)\right)\left(\hat{f}^{(b')}(x) - \bar{f}(x)\right)$$

empirical_bvc.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
 
 
def estimate_bias_variance_covariance(X_train, y_train, X_test, y_test, 
                                       model_class, n_bootstrap=200):
    """
    Empirically estimate bias, variance, and covariance for a model.
    
    Args:
        X_train, y_train: Training data
        X_test, y_test: Test data for evaluation
        model_class: Class that implements fit() and predict()
        n_bootstrap: Number of bootstrap samples
    
    Returns:
        Dictionary with bias², variance, covariance estimates
    """
    n_test = len(X_test)
    predictions = np.zeros((n_bootstrap, n_test))
    
    # Generate predictions from models trained on bootstrap samples
    for b in range(n_bootstrap):
        # Bootstrap sample
        indices = np.random.choice(len(X_train), size=len(X_train), replace=True)
        X_b, y_b = X_train[indices], y_train[indices]
        
        # Train and predict
        model = model_class()
        model.fit(X_b, y_b)
        predictions[b] = model.predict(X_test)
    
    # Average prediction across bootstrap samples
    mean_prediction = predictions.mean(axis=0)
    
    # Bias² (squared difference between mean prediction and true value)
    bias_squared = ((mean_prediction - y_test) ** 2).mean()
    
    # Variance (spread around mean prediction)
    variance = ((predictions - mean_prediction) ** 2).mean()
    
    # Covariance (average pairwise covariance)
    # For efficiency, use the formula: Cov = E[XY] - E[X]E[Y]
    # Across models for each test point
    centered = predictions - mean_prediction
    
    # Average covariance: mean over test points of pairwise covariances
    n_pairs = n_bootstrap * (n_bootstrap - 1) // 2
    covariances = []
    
    # Efficient computation using matrix operations
    for i in range(n_test):
        pred_i = centered[:, i]
        cov_matrix = np.outer(pred_i, pred_i)
        # Sum of off-diagonal elements divided by number of pairs
        off_diag_sum = cov_matrix.sum() - np.diag(cov_matrix).sum()
        covariances.append(off_diag_sum / (n_bootstrap * (n_bootstrap - 1)))
    
    covariance = np.mean(covariances)
    
    # Correlation coefficient
    correlation = covariance / variance if variance > 0 else 0
    
    # Total MSE from individual models
    individual_mse = ((predictions - y_test) ** 2).mean()
    
    # Ensemble MSE
    ensemble_prediction = predictions.mean(axis=0)
    ensemble_mse = ((ensemble_prediction - y_test) ** 2).mean()
    
    return {
        'bias_squared': bias_squared,
        'variance': variance,
        'covariance': covariance,
        'correlation': correlation,
        'individual_mse': individual_mse,
        'ensemble_mse': ensemble_mse,
        'variance_reduction': variance - ensemble_mse + bias_squared,
        'theoretical_ensemble_var': correlation * variance + (1 - correlation) * variance / n_bootstrap
    }
 
 
def demonstrate_bvc():
    """Demonstrate BVC decomposition on synthetic data."""
    # Create regression problem
    X, y = make_regression(n_samples=1000, n_features=10, noise=10, random_state=42)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    
    print("Bias-Variance-Covariance Decomposition Analysis")
    print("=" * 65)
    
    # Compare different model complexities
    configs = [
        ("Shallow Tree (depth=2)", lambda: DecisionTreeRegressor(max_depth=2)),
        ("Medium Tree (depth=5)", lambda: DecisionTreeRegressor(max_depth=5)),
        ("Deep Tree (depth=10)", lambda: DecisionTreeRegressor(max_depth=10)),
        ("Full Tree (no limit)", lambda: DecisionTreeRegressor(max_depth=None)),
    ]
    
    for name, model_class in configs:
        print(f"\n{name}")
        print("-" * 50)
        
        results = estimate_bias_variance_covariance(
            X_train, y_train, X_test, y_test, model_class, n_bootstrap=100
        )
        
        print(f"  Bias²:          {results['bias_squared']:.4f}")
        print(f"  Variance:       {results['variance']:.4f}")
        print(f"  Covariance:     {results['covariance']:.4f}")
        print(f"  Correlation ρ:  {results['correlation']:.4f}")
        print(f"  Individual MSE: {results['individual_mse']:.4f}")
        print(f"  Ensemble MSE:   {results['ensemble_mse']:.4f}")
        print(f"  Improvement:    {(1 - results['ensemble_mse']/results['individual_mse'])*100:.1f}%")
 
 
if __name__ == "__main__":
    demonstrate_bvc()

Deep Trees Excel in Ensembles

The Ambiguity Decomposition (Krogh-Vedelsby)

An alternative error decomposition, developed by Krogh and Vedelsby (1995), provides another perspective on ensemble error. For regression with simple averaging:

Krogh-Vedelsby Decomposition:

$$\bar{E} = \bar{E}_{\text{ensemble}} + \bar{A}$$

Where:

$\bar{E} = \frac{1}{M}\sum_{i=1}^{M} (\hat{f}_i(x) - y)^2$ is the average error of individual models
$\bar{E}{\text{ensemble}} = (\hat{f}{\text{ens}}(x) - y)^2$ is the ensemble error
$\bar{A} = \frac{1}{M}\sum_{i=1}^{M} (\hat{f}i(x) - \hat{f}{\text{ens}}(x))^2$ is the ambiguity

Rearranging:

$$\bar{E}_{\text{ensemble}} = \bar{E} - \bar{A}$$

The Profound Implication:

Ensemble error = Average individual error - Ambiguity (disagreement)

This is a pure identity—always true, no assumptions needed!

Interpretation:

Ambiguity measures how much base learners disagree. Higher disagreement → lower ensemble error (given fixed individual performance).

This provides direct mathematical justification for diversity: we want models that are individually accurate but mutually disagreeing.

The Diversity Bonus Formalized

Connection to Bias-Variance-Covariance:

The two decompositions are related but different:

Decomposition	Focus	When Useful
Bias-Variance-Covariance	Expected error over training sets	Understanding generalization
Krogh-Vedelsby (Ambiguity)	Actual error on a given dataset	Real-time ensemble diagnostics

Negative Correlation Learning

Given that low (or negative) correlation between model errors is valuable, can we explicitly train models to be negatively correlated?

Negative Correlation Learning (NCL):

Proposed by Liu and Yao (1999), NCL modifies the training objective to penalize models for making similar errors:

$$E_i = \frac{1}{N}\sum_{n=1}^{N} \left[(y_n - \hat{f}_i(x_n))^2 + \lambda \cdot p_i(x_n)\right]$$

Where $p_i(x)$ is a penalty term that encourages diversity:

$$p_i(x) = (\hat{f}i(x) - \hat{f}{\text{ens}}(x)) \sum_{j \neq i}(\hat{f}j(x) - \hat{f}{\text{ens}}(x))$$

Intuition:

Trade-off:

Too much diversity penalty: Individual models become very different but less accurate
Too little: Models converge to similar solutions
λ controls this balance

NCL Advantages

•Explicitly optimizes ensemble objective
•Can achieve lower correlation than random initialization
•Theoretically grounded in error decomposition
•Works with gradient-based training

NCL Challenges

•Requires simultaneous training of all models
•Hyperparameter λ is sensitive
•Can degrade individual model quality
•Computationally more expensive

Practical Reality

Summary: Error Decomposition

We've developed a rigorous mathematical understanding of how ensembles reduce prediction error. Let's consolidate the key results:

Key Takeaways

•Bias is Preserved: Averaging does not reduce bias. If base learners are systematically wrong, so is the ensemble.
•Variance Reduction Formula: $\text{Var}_{\text{ens}} = \rho\sigma^2 + \frac{(1-\rho)\sigma^2}{M}$ where ρ is pairwise correlation.
•The Covariance Floor: Even infinite models can't reduce variance below ρσ². Correlation determines the theoretical limit.
•Diminishing Returns: Marginal benefit from additional models scales as O(1/M²). Beyond ~100 models, gains are minimal.
•Deep Models Excel: High-variance, low-bias learners (deep trees) benefit most from ensembling.
•Ambiguity Decomposition: Ensemble error = average error - disagreement. Diversity directly reduces ensemble error.
•Negative Correlation Learning: We can explicitly train for diversity, though the practical benefits are often modest.

What's Next:

Page Complete

3 / 5