Machine LearningBagging & Random Forests

Ensemble Learning Fundamentals

LevelIntermediate

Duration60 mins

TopicBagging & Random Forests

4 / 5

Diversity Requirement

The Currency of Ensemble Learning

Throughout our analysis, one theme has emerged repeatedly: diversity is essential. The variance reduction formula shows that error correlation (ρ) determines the floor for ensemble improvement. The ambiguity decomposition proves that disagreement directly subtracts from ensemble error. The Condorcet theorem requires independent voters.

But what exactly is diversity? How do we measure it? How do we create it? And is there such a thing as too much diversity?

This page provides a comprehensive treatment of diversity in ensemble learning—the strategies for inducing it, the metrics for measuring it, and the fundamental tradeoff between diversity and individual model quality that governs ensemble design.

What You Will Learn

By the end of this page, you will understand multiple mechanisms for creating diverse ensembles, know how to quantify diversity using various metrics, and appreciate the diversity-accuracy tradeoff that constrains ensemble optimization.

What Is Diversity?

Diversity in ensemble learning refers to the extent to which base learners make different errors. It's not about having different model architectures per se—it's about producing different predictions, particularly different mistakes.

Formal Definition:

Two models are diverse to the extent that their error patterns are uncorrelated. If model A and model B both fail on the same examples and succeed on the same examples, they are not diverse—regardless of how different their internal workings are.

Types of Diversity:

Prediction Diversity: Models produce different outputs on the same input
Error Diversity: Models make mistakes on different examples
Representation Diversity: Models use different internal feature representations
Structural Diversity: Models have different architectures or learning algorithms
Data Diversity: Models train on different subsets of data or features

Not all types are equally valuable. Error diversity directly impacts ensemble performance; structural diversity is valuable only to the extent it produces error diversity.

The Error Focus

Focus on error diversity, not superficial differences. Two neural networks with different architectures but identical predictions provide zero ensemble benefit. Two decision trees with identical structure but different training subsets may be highly diverse where it matters—in their errors.

Mechanisms for Inducing Diversity

There are four primary dimensions along which we can introduce diversity into ensemble members:

1. Data-Level Diversity

Modify what data each model sees:

Bootstrap Sampling (Bagging): Each model trains on a random sample with replacement from the training set
Subsampling: Similar to bootstrap but without replacement
Cross-validation-style splits: Each model trains on different folds
Noise Injection: Add random noise to training data for each model

2. Feature-Level Diversity

Modify what features each model uses:

Random Subspaces: Each model sees a random subset of features
Random Feature Splits (Random Forests): At each node, consider only a random subset of features
Feature Bagging: Bootstrap sample features like bagging samples instances
Random Projections: Project features into random lower-dimensional spaces

3. Algorithm-Level Diversity

Use different learning algorithms:

Heterogeneous Ensembles: Combine trees, neural networks, SVMs, etc.
Different Hyperparameters: Same algorithm with varied settings
Different Loss Functions: Train on different objectives
Different Initializations: For neural networks, different random seeds

4. Output-Level Diversity

Manipulate the target or output:

Output Coding: Multi-class problems encoded as multiple binary problems
Error-Correcting Output Codes (ECOC): Redundant, error-correcting encodings
Output Smearing: Add noise to target labels
Random Relabeling: Randomly flip some labels during training

Diversity Mechanisms Summary
Mechanism	Method	Typical Use Case
Data Sampling	Bootstrap	Bagging, Random Forests
Feature Sampling	Random Subspace	Random Forests, Feature Bagging
Feature Randomization	Random Feature @ Split	Random Forests, Extra Trees
Algorithm Mix	Heterogeneous	Stacking, Super Learner
Hyperparameter Variation	Grid/Random	Random Search Ensembles
Initialization Randomness	Different Seeds	Neural Network Ensembles
Output Manipulation	ECOC	Multi-class Classification

Combining Mechanisms

The most successful ensemble methods combine multiple diversity mechanisms. Random Forests use both data sampling (bootstrap) and feature sampling (random splits). Extra Trees add randomized split thresholds. More diversity sources often mean more decorrelated errors.

Deep Dive: Key Diversity Methods

Let's examine the most important diversity mechanisms in detail.

Bootstrap Sampling:

Given $N$ training examples, create a bootstrap sample by drawing $N$ examples with replacement. Key properties:

Each bootstrap sample contains ~63.2% of the original training points (the rest are out-of-bag)
Each sample differs from others, creating prediction variance
The expected overlap between two bootstrap samples is ~63.2%

Mathematical Analysis:

Probability that a specific example is NOT selected in one draw: $\frac{N-1}{N}$

Probability it's NOT selected in $N$ draws: $\left(\frac{N-1}{N}\right)^N \approx e^{-1} \approx 0.368$

Probability it IS selected: $1 - 0.368 = 0.632$

Random Feature Selection at Splits:

At each split in a decision tree, instead of considering all $p$ features, consider only a random subset of size $m$:

Classification: Typically $m = \sqrt{p}$
Regression: Typically $m = p/3$

This creates substantial diversity because:

Different trees explore different feature combinations
Even identical bootstrap samples produce different trees
The best split changes depending on which features are considered

diversity_mechanisms.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from collections import Counter
 
 
def analyze_bootstrap_diversity(X, y, n_bootstraps=100):
    """
    Analyze the diversity created by bootstrap sampling.
    """
    N = len(X)
    
    # Track which samples appear in each bootstrap
    sample_appearances = np.zeros((n_bootstraps, N))
    
    for b in range(n_bootstraps):
        indices = np.random.choice(N, size=N, replace=True)
        unique_indices = np.unique(indices)
        sample_appearances[b, unique_indices] = 1
    
    # Statistics
    avg_unique_per_bootstrap = sample_appearances.sum(axis=1).mean()
    pct_unique = avg_unique_per_bootstrap / N * 100
    
    # Overlap between bootstrap samples
    overlaps = []
    for i in range(min(100, n_bootstraps)):
        for j in range(i+1, min(100, n_bootstraps)):
            overlap = (sample_appearances[i] * sample_appearances[j]).sum()
            overlap_pct = overlap / avg_unique_per_bootstrap
            overlaps.append(overlap_pct)
    
    print("Bootstrap Sampling Analysis")
    print("=" * 50)
    print(f"Original dataset size: {N}")
    print(f"Average unique samples per bootstrap: {avg_unique_per_bootstrap:.1f}")
    print(f"Percentage of original data: {pct_unique:.1f}%")
    print(f"Average overlap between bootstraps: {np.mean(overlaps)*100:.1f}%")
    print(f"(Theoretical: 63.2% unique, 63.2% overlap)")
 
 
def analyze_feature_subspace_diversity(X, n_features_to_sample):
    """
    Analyze diversity from random feature subspacing.
    """
    n_features = X.shape[1]
    n_trials = 1000
    
    # How often do two random subsets share no features?
    no_overlap_count = 0
    overlap_sizes = []
    
    for _ in range(n_trials):
        subset1 = set(np.random.choice(n_features, n_features_to_sample, replace=False))
        subset2 = set(np.random.choice(n_features, n_features_to_sample, replace=False))
        overlap = len(subset1 & subset2)
        overlap_sizes.append(overlap)
        if overlap == 0:
            no_overlap_count += 1
    
    print(f"\nFeature Subspace Analysis (p={n_features}, m={n_features_to_sample})")
    print("=" * 50)
    print(f"Average feature overlap: {np.mean(overlap_sizes):.2f}")
    print(f"Probability of zero overlap: {no_overlap_count/n_trials*100:.1f}%")
    print(f"Expected overlap (theory): {n_features_to_sample**2/n_features:.2f}")
 
 
def measure_prediction_diversity(X_train, y_train, X_test, n_trees=50):
    """
    Measure how diverse predictions are across ensemble members.
    """
    predictions = []
    
    for i in range(n_trees):
        # Bootstrap sampling
        indices = np.random.choice(len(X_train), size=len(X_train), replace=True)
        X_b, y_b = X_train[indices], y_train[indices]
        
        # Random feature subspace (sqrt(p) features)
        n_features = int(np.sqrt(X_train.shape[1]))
        
        # Train tree with random features
        tree = DecisionTreeClassifier(
            max_features=n_features,
            random_state=i
        )
        tree.fit(X_b, y_b)
        predictions.append(tree.predict(X_test))
    
    predictions = np.array(predictions)
    
    # Measure prediction entropy at each test point
    entropies = []
    for j in range(len(X_test)):
        counts = Counter(predictions[:, j])
        probs = np.array([c/n_trees for c in counts.values()])
        entropy = -np.sum(probs * np.log2(probs + 1e-10))
        entropies.append(entropy)
    
    print(f"\nPrediction Diversity Analysis")
    print("=" * 50)
    print(f"Number of trees: {n_trees}")
    print(f"Average prediction entropy: {np.mean(entropies):.4f}")
    print(f"Max possible entropy (binary): 1.0")
    print(f"% of test points with unanimous vote: {(np.array(entropies)==0).mean()*100:.1f}%")
    
    # Correlation between tree predictions
    correlations = []
    for i in range(n_trees):
        for j in range(i+1, n_trees):
            corr = np.corrcoef(predictions[i], predictions[j])[0,1]
            if not np.isnan(corr):
                correlations.append(corr)
    
    print(f"Average pairwise prediction correlation: {np.mean(correlations):.4f}")
    
    return predictions
 
 
if __name__ == "__main__":
    # Create synthetic dataset
    X, y = make_classification(n_samples=1000, n_features=20, 
                              n_informative=15, random_state=42)
    
    analyze_bootstrap_diversity(X, y)
    analyze_feature_subspace_diversity(X, n_features_to_sample=int(np.sqrt(20)))
    measure_prediction_diversity(X[:700], y[:700], X[700:])

Measuring Diversity

Numerous metrics have been proposed to quantify ensemble diversity. We'll cover the most important ones.

For Classification:

1. Disagreement Measure (Pairwise):

$$D_{ij} = \frac{\text{Number of examples where } h_i \neq h_j}{\text{Total examples}}$$

Average over all pairs:

$$D_{\text{avg}} = \frac{2}{M(M-1)}\sum_{i<j} D_{ij}$$

2. Q-Statistic (Yule's Q):

For pairs of classifiers, build a contingency table:

	$h_j$ correct	$h_j$ wrong
$h_i$ correct	$N^{11}$	$N^{10}$
$h_i$ wrong	$N^{01}$	$N^{00}$

$$Q_{ij} = \frac{N^{11}N^{00} - N^{01}N^{10}}{N^{11}N^{00} + N^{01}N^{10}}$$

$Q = 1$: Classifiers always agree (right or wrong together)
$Q = 0$: Classifiers are independent
$Q = -1$: Classifiers always disagree

For good ensembles, we want $Q_{\text{avg}}$ to be low (near 0 or negative).

3. Correlation Coefficient (ρ):

$$\rho_{ij} = \frac{N^{11}N^{00} - N^{01}N^{10}}{\sqrt{(N^{11}+N^{10})(N^{01}+N^{00})(N^{11}+N^{01})(N^{10}+N^{00})}}$$

Related to Q but normalized differently. Also want low values.

4. Entropy Measure:

For each example, compute the entropy of the vote distribution:

$$E(x) = \frac{1}{1 - \frac{1}{M}}\frac{1}{M}\sum_{k=1}^{L} m_k(x)\left(1 - \frac{m_k(x)}{M}\right)$$

Where $m_k(x)$ is the number of classifiers predicting class $k$ for example $x$.

$E = 0$: All classifiers agree (low diversity)
$E = 1$: Maximum disagreement (high diversity)

5. Kohavi-Wolpert Variance:

$$KW = \frac{1}{N}\sum_{n=1}^{N} \left(\frac{l(x_n)}{M}\right)\left(1 - \frac{l(x_n)}{M}\right)$$

Where $l(x_n)$ is the number of classifiers that misclassify example $x_n$.

Measures variance in the "correctness" of predictions across classifiers.

diversity_metrics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
import numpy as np
from itertools import combinations
 
 
def pairwise_disagreement(pred_i, pred_j):
    """Compute disagreement between two classifiers."""
    return np.mean(pred_i != pred_j)
 
 
def q_statistic(pred_i, pred_j, y_true):
    """Compute Yule's Q statistic between two classifiers."""
    correct_i = (pred_i == y_true)
    correct_j = (pred_j == y_true)
    
    N11 = np.sum(correct_i & correct_j)      # Both correct
    N00 = np.sum(~correct_i & ~correct_j)    # Both wrong
    N10 = np.sum(correct_i & ~correct_j)     # i correct, j wrong
    N01 = np.sum(~correct_i & correct_j)     # i wrong, j correct
    
    numerator = N11 * N00 - N01 * N10
    denominator = N11 * N00 + N01 * N10
    
    if denominator == 0:
        return 0
    return numerator / denominator
 
 
def correlation_coefficient(pred_i, pred_j, y_true):
    """Compute correlation coefficient of classifier errors."""
    correct_i = (pred_i == y_true).astype(int)
    correct_j = (pred_j == y_true).astype(int)
    
    corr = np.corrcoef(correct_i, correct_j)[0, 1]
    return corr if not np.isnan(corr) else 0
 
 
def entropy_measure(predictions, y_true):
    """
    Compute entropy-based diversity measure.
    
    Args:
        predictions: M x N array (M classifiers, N examples)
        y_true: True labels
    """
    M, N = predictions.shape
    
    entropies = []
    for n in range(N):
        votes = predictions[:, n]
        unique, counts = np.unique(votes, return_counts=True)
        probs = counts / M
        entropy = -np.sum(probs * np.log2(probs + 1e-10))
        # Normalize by max entropy
        max_entropy = np.log2(len(unique)) if len(unique) > 1 else 1
        entropies.append(entropy / max_entropy if max_entropy > 0 else 0)
    
    return np.mean(entropies)
 
 
def kohavi_wolpert_variance(predictions, y_true):
    """Compute Kohavi-Wolpert variance measure."""
    M, N = predictions.shape
    
    kw_sum = 0
    for n in range(N):
        # Number of classifiers that got this example wrong
        l_n = np.sum(predictions[:, n] != y_true[n])
        kw_sum += (l_n / M) * (1 - l_n / M)
    
    return kw_sum / N
 
 
def comprehensive_diversity_analysis(predictions, y_true):
    """
    Compute all diversity metrics for an ensemble.
    
    Args:
        predictions: M x N array (M classifiers, N examples)
        y_true: N true labels
    """
    M, N = predictions.shape
    
    # Individual accuracies
    accuracies = [np.mean(predictions[i] == y_true) for i in range(M)]
    
    # Pairwise metrics
    disagreements = []
    q_stats = []
    correlations = []
    
    for i, j in combinations(range(M), 2):
        disagreements.append(pairwise_disagreement(predictions[i], predictions[j]))
        q_stats.append(q_statistic(predictions[i], predictions[j], y_true))
        correlations.append(correlation_coefficient(predictions[i], predictions[j], y_true))
    
    # Entropy
    ent = entropy_measure(predictions, y_true)
    
    # Kohavi-Wolpert
    kw = kohavi_wolpert_variance(predictions, y_true)
    
    print("Diversity Analysis Results")
    print("=" * 60)
    print(f"Number of classifiers: {M}")
    print(f"Number of test examples: {N}")
    print()
    print("Individual Performance:")
    print(f"  Mean accuracy: {np.mean(accuracies):.4f}")
    print(f"  Std accuracy:  {np.std(accuracies):.4f}")
    print()
    print("Diversity Metrics:")
    print(f"  Avg Disagreement:     {np.mean(disagreements):.4f} (higher = more diverse)")
    print(f"  Avg Q-Statistic:      {np.mean(q_stats):.4f} (lower = more diverse)")
    print(f"  Avg Correlation:      {np.mean(correlations):.4f} (lower = more diverse)")
    print(f"  Entropy Measure:      {ent:.4f} (higher = more diverse)")
    print(f"  Kohavi-Wolpert Var:   {kw:.4f} (higher = more diverse)")
    
    # Ensemble performance
    ensemble_pred = np.apply_along_axis(
        lambda x: np.bincount(x.astype(int)).argmax(), 0, predictions
    )
    ensemble_acc = np.mean(ensemble_pred == y_true)
    print()
    print(f"Ensemble Accuracy: {ensemble_acc:.4f}")
    print(f"Improvement over avg: {(ensemble_acc - np.mean(accuracies))*100:.2f}%")
    
    return {
        'disagreement': np.mean(disagreements),
        'q_statistic': np.mean(q_stats),
        'correlation': np.mean(correlations),
        'entropy': ent,
        'kw_variance': kw,
        'individual_acc': np.mean(accuracies),
        'ensemble_acc': ensemble_acc
    }

The Diversity-Accuracy Tradeoff

A fundamental tension exists in ensemble design: mechanisms that increase diversity often decrease individual accuracy. This tradeoff constrains how much we can diversify.

The Tradeoff in Action:

Diversification	Effect on Individual Accuracy
More bootstrap samples	Each model sees less unique data
Smaller feature subsets	Each tree considers less information
More regularization variance	Higher individual errors
Random output codes	Harder subproblems

Formal Analysis (Kuncheva-Whitaker):

Kuncheva and Whitaker (2003) analyzed the relationship between diversity and accuracy across many diversity measures. Their key finding:

"There is no single measure of diversity that consistently correlates with ensemble accuracy improvement."

This is because the optimal diversity level depends on:

Individual classifier accuracy
Ensemble size
Aggregation method
The specific dataset

The Optimum is Interior

Maximum diversity (completely random classifiers) produces ~50% accuracy. Zero diversity (identical classifiers) produces no ensemble benefit. The optimal ensemble lives in between—sufficiently diverse to decorrelate errors, sufficiently accurate to be better than random.

Practical Guidelines:

Given this tradeoff, how do we find the sweet spot?

For Random Forests:

Feature subset size $m = \sqrt{p}$ for classification is theoretically motivated and empirically robust
$m = p/3$ for regression balances diversity and feature information
These defaults work well across most datasets; tune only if necessary

For Heterogeneous Ensembles:

Include models that are individually strong (accuracy > 0.5 + margin)
Prefer models with different error patterns over models with similar error patterns
Validate that each additional model improves ensemble performance

For Boosting:

Boosting actively manages the tradeoff by focusing on hard examples
Early boosting rounds add diverse models; later rounds may overfit
Early stopping balances diversity with overfitting risk

Rule of Thumb: Start with standard diversity settings (defaults in sklearn). Measure ensemble improvement. If minimal, consider increasing diversity. If individual models are too weak, reduce diversity.

Creating Diverse Base Learners

Let's examine specific strategies for creating diverse base learners in common ensemble frameworks.

Decision Trees (for Random Forests):

Diversity Levers for Decision Trees
Lever	Standard Setting	More Diverse \| Less Accurate	Less Diverse \| More Accurate
max_features	sqrt(p) or p/3	1 or 2	p (all)
max_depth	None (full)	Shallow (3-5)	Full depth
min_samples_split	2	Larger (e.g., 20)	2
min_samples_leaf	1	Larger (e.g., 10)	1
bootstrap	True	True with small ratio	False
max_samples	100%	50-80%	100%

Extra Trees (Extremely Randomized Trees):

Go beyond Random Forests by also randomizing split thresholds:

For each feature at each node, instead of finding the optimal split threshold, choose a threshold uniformly at random within the feature's range
This adds another layer of randomization, further decorrelating trees
Often slightly less accurate individually but with lower correlation

Neural Network Ensembles:

Diversity sources for neural networks:

Different initializations: Random weight initialization
Different architectures: Vary depth, width, activation functions
Different training runs: Early stopping at different points
Different optimizers: SGD, Adam, RMSprop
Different data augmentation: Random augmentations per model
Dropout: Different dropout masks

Heterogeneous Ensembles:

Combine fundamentally different algorithm families:

Tree-based: Random Forest, GBM
Linear: Logistic Regression, SVM with linear kernel
Kernel: SVM with RBF kernel, Gaussian Processes
Instance-based: KNN, Radius Neighbors
Neural: MLP, CNN

Different inductive biases → different error patterns → more diversity.

creating_diverse_ensemble.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import GaussianNB
import numpy as np
 
 
def create_maximally_diverse_ensemble(random_state=42):
    """
    Create an ensemble maximizing diversity through:
    1. Different algorithm families
    2. Different hyperparameters within families
    3. Different random seeds
    """
    
    models = []
    
    # Tree-based family - multiple variants
    models.extend([
        ('rf_default', RandomForestClassifier(n_estimators=50, random_state=random_state)),
        ('rf_shallow', RandomForestClassifier(n_estimators=50, max_depth=5, random_state=random_state+1)),
        ('rf_deep', RandomForestClassifier(n_estimators=50, min_samples_leaf=1, random_state=random_state+2)),
        ('et', ExtraTreesClassifier(n_estimators=50, random_state=random_state+3)),
    ])
    
    # Boosting family
    models.extend([
        ('gb', GradientBoostingClassifier(n_estimators=50, random_state=random_state+4)),
        ('ada', AdaBoostClassifier(n_estimators=50, random_state=random_state+5)),
    ])
    
    # Linear family
    models.extend([
        ('lr', LogisticRegression(max_iter=1000, random_state=random_state+6)),
        ('ridge', RidgeClassifier(random_state=random_state+7)),
    ])
    
    # Kernel family
    models.extend([
        ('svm_rbf', SVC(probability=True, random_state=random_state+8)),
        ('svm_poly', SVC(kernel='poly', degree=3, probability=True, random_state=random_state+9)),
    ])
    
    # Instance-based
    models.extend([
        ('knn_3', KNeighborsClassifier(n_neighbors=3)),
        ('knn_10', KNeighborsClassifier(n_neighbors=10)),
    ])
    
    # Neural
    models.extend([
        ('mlp_small', MLPClassifier(hidden_layer_sizes=(50,), max_iter=500, random_state=random_state+10)),
        ('mlp_deep', MLPClassifier(hidden_layer_sizes=(100, 50, 25), max_iter=500, random_state=random_state+11)),
    ])
    
    # Probabilistic
    models.append(('nb', GaussianNB()))
    
    return models
 
 
def analyze_ensemble_diversity(models, X_train, y_train, X_test, y_test):
    """Fit models and analyze diversity."""
    from diversity_metrics import comprehensive_diversity_analysis
    
    predictions = []
    print("Training individual models...")
    
    for name, model in models:
        model.fit(X_train, y_train)
        pred = model.predict(X_test)
        acc = np.mean(pred == y_test)
        predictions.append(pred)
        print(f"  {name}: {acc:.4f}")
    
    predictions = np.array(predictions)
    print()
    
    return comprehensive_diversity_analysis(predictions, y_test)
 
 
if __name__ == "__main__":
    from sklearn.datasets import make_classification
    from sklearn.model_selection import train_test_split
    
    X, y = make_classification(n_samples=1000, n_features=20,
                              n_informative=15, random_state=42)
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=42
    )
    
    models = create_maximally_diverse_ensemble()
    analyze_ensemble_diversity(models, X_train, y_train, X_test, y_test)

When Diversity Fails

Despite its importance, diversity isn't always achievable or beneficial. Understanding when diversity fails helps set realistic expectations.

Intrinsic Correlation:

Some problems have structure that forces correlation:

Easy examples: All reasonable models get easy examples right. No diversity here.
Impossible examples: All models fail on truly noisy or mislabeled data.
Dominant features: If one feature is overwhelmingly predictive, all models will use it similarly.
Small feature space: With few features, models can't differ much.

Diversity Doesn't Help When:

All models are biased the same way: Diversity reduces variance, not bias. If the problem requires a representation no model can learn, diversity won't help.
Individual models are too weak: Random diverse models are still random. Each model must be at least better than chance.
Aggregation is wrong: Averaging diverse predictions only helps if the truth is somewhere in the convex hull of predictions.
The problem is trivially easy: If one model achieves near-perfect accuracy, there's nothing left to improve.

High-Diversity Situations

•High-dimensional feature spaces
•Complex, non-linear relationships
•Moderate noise levels
•Large, varied training sets
•Multiple valid solution paths

Low-Diversity Situations

•Low-dimensional problems
•Strongly dominant features
•Very clean or very noisy data
•Small training sets
•Single obvious solution structure

Measure Before Assuming

Don't assume diversity—measure it. Compute Q-statistics or correlation between your ensemble members. If diversity is naturally low for your problem, focus on improving individual models rather than adding more diverse but weaker ones.

Summary: Diversity Requirement

We've comprehensively explored diversity—the essential ingredient for effective ensemble learning. Let's consolidate:

Key Takeaways

•Diversity = Different Errors: Models that make different mistakes enable error cancellation through averaging.
•Four Diversity Sources: Data sampling, feature sampling, algorithm variation, and output manipulation.
•Multiple Metrics Exist: Disagreement, Q-statistic, correlation, entropy, and Kohavi-Wolpert variance all measure different aspects of diversity.
•Diversity-Accuracy Tradeoff: Increasing diversity often decreases individual accuracy. The optimum is interior.
•No Universal Metric: No single diversity measure consistently predicts ensemble improvement.
•Combined Mechanisms Work Best: Random Forests use both bootstrap sampling and feature randomization.
•Heterogeneous Ensembles Maximize Diversity: Combining different algorithm families produces the most diverse errors.
•Measure, Don't Assume: Compute diversity metrics to verify your ensemble is actually diverse.

What's Next:

With our theoretical foundation complete—variance reduction, crowd wisdom, error decomposition, and diversity—we turn to Ensemble Strategies. We'll survey the major families of ensemble methods: bagging, boosting, stacking, and more, understanding how each implements the principles we've learned.

Page Complete

You now have a comprehensive understanding of diversity in ensemble learning—what it is, how to create it, how to measure it, and when it helps. This knowledge is essential for designing effective ensembles in practice.

4 / 5

Loading learning content...

Machine LearningBagging & Random Forests

Ensemble Learning Fundamentals

LevelIntermediate

Duration60 mins

TopicBagging & Random Forests

4 / 5

Diversity Requirement

The Currency of Ensemble Learning

But what exactly is diversity? How do we measure it? How do we create it? And is there such a thing as too much diversity?

What You Will Learn

What Is Diversity?

Formal Definition:

Types of Diversity:

Prediction Diversity: Models produce different outputs on the same input
Error Diversity: Models make mistakes on different examples
Representation Diversity: Models use different internal feature representations
Structural Diversity: Models have different architectures or learning algorithms
Data Diversity: Models train on different subsets of data or features

Not all types are equally valuable. Error diversity directly impacts ensemble performance; structural diversity is valuable only to the extent it produces error diversity.

The Error Focus

Mechanisms for Inducing Diversity

There are four primary dimensions along which we can introduce diversity into ensemble members:

1. Data-Level Diversity

Modify what data each model sees:

Bootstrap Sampling (Bagging): Each model trains on a random sample with replacement from the training set
Subsampling: Similar to bootstrap but without replacement
Cross-validation-style splits: Each model trains on different folds
Noise Injection: Add random noise to training data for each model

2. Feature-Level Diversity

Modify what features each model uses:

Random Subspaces: Each model sees a random subset of features
Random Feature Splits (Random Forests): At each node, consider only a random subset of features
Feature Bagging: Bootstrap sample features like bagging samples instances
Random Projections: Project features into random lower-dimensional spaces

3. Algorithm-Level Diversity

Use different learning algorithms:

Heterogeneous Ensembles: Combine trees, neural networks, SVMs, etc.
Different Hyperparameters: Same algorithm with varied settings
Different Loss Functions: Train on different objectives
Different Initializations: For neural networks, different random seeds

4. Output-Level Diversity

Manipulate the target or output:

Output Coding: Multi-class problems encoded as multiple binary problems
Error-Correcting Output Codes (ECOC): Redundant, error-correcting encodings
Output Smearing: Add noise to target labels
Random Relabeling: Randomly flip some labels during training

Diversity Mechanisms Summary
Mechanism	Method	Typical Use Case
Data Sampling	Bootstrap	Bagging, Random Forests
Feature Sampling	Random Subspace	Random Forests, Feature Bagging
Feature Randomization	Random Feature @ Split	Random Forests, Extra Trees
Algorithm Mix	Heterogeneous	Stacking, Super Learner
Hyperparameter Variation	Grid/Random	Random Search Ensembles
Initialization Randomness	Different Seeds	Neural Network Ensembles
Output Manipulation	ECOC	Multi-class Classification

Combining Mechanisms

Deep Dive: Key Diversity Methods

Let's examine the most important diversity mechanisms in detail.

Bootstrap Sampling:

Given $N$ training examples, create a bootstrap sample by drawing $N$ examples with replacement. Key properties:

Each bootstrap sample contains ~63.2% of the original training points (the rest are out-of-bag)
Each sample differs from others, creating prediction variance
The expected overlap between two bootstrap samples is ~63.2%

Mathematical Analysis:

Probability that a specific example is NOT selected in one draw: $\frac{N-1}{N}$

Probability it's NOT selected in $N$ draws: $\left(\frac{N-1}{N}\right)^N \approx e^{-1} \approx 0.368$

Probability it IS selected: $1 - 0.368 = 0.632$

Random Feature Selection at Splits:

At each split in a decision tree, instead of considering all $p$ features, consider only a random subset of size $m$:

Classification: Typically $m = \sqrt{p}$
Regression: Typically $m = p/3$

This creates substantial diversity because:

Different trees explore different feature combinations
Even identical bootstrap samples produce different trees
The best split changes depending on which features are considered

diversity_mechanisms.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from collections import Counter
 
 
def analyze_bootstrap_diversity(X, y, n_bootstraps=100):
    """
    Analyze the diversity created by bootstrap sampling.
    """
    N = len(X)
    
    # Track which samples appear in each bootstrap
    sample_appearances = np.zeros((n_bootstraps, N))
    
    for b in range(n_bootstraps):
        indices = np.random.choice(N, size=N, replace=True)
        unique_indices = np.unique(indices)
        sample_appearances[b, unique_indices] = 1
    
    # Statistics
    avg_unique_per_bootstrap = sample_appearances.sum(axis=1).mean()
    pct_unique = avg_unique_per_bootstrap / N * 100
    
    # Overlap between bootstrap samples
    overlaps = []
    for i in range(min(100, n_bootstraps)):
        for j in range(i+1, min(100, n_bootstraps)):
            overlap = (sample_appearances[i] * sample_appearances[j]).sum()
            overlap_pct = overlap / avg_unique_per_bootstrap
            overlaps.append(overlap_pct)
    
    print("Bootstrap Sampling Analysis")
    print("=" * 50)
    print(f"Original dataset size: {N}")
    print(f"Average unique samples per bootstrap: {avg_unique_per_bootstrap:.1f}")
    print(f"Percentage of original data: {pct_unique:.1f}%")
    print(f"Average overlap between bootstraps: {np.mean(overlaps)*100:.1f}%")
    print(f"(Theoretical: 63.2% unique, 63.2% overlap)")
 
 
def analyze_feature_subspace_diversity(X, n_features_to_sample):
    """
    Analyze diversity from random feature subspacing.
    """
    n_features = X.shape[1]
    n_trials = 1000
    
    # How often do two random subsets share no features?
    no_overlap_count = 0
    overlap_sizes = []
    
    for _ in range(n_trials):
        subset1 = set(np.random.choice(n_features, n_features_to_sample, replace=False))
        subset2 = set(np.random.choice(n_features, n_features_to_sample, replace=False))
        overlap = len(subset1 & subset2)
        overlap_sizes.append(overlap)
        if overlap == 0:
            no_overlap_count += 1
    
    print(f"\nFeature Subspace Analysis (p={n_features}, m={n_features_to_sample})")
    print("=" * 50)
    print(f"Average feature overlap: {np.mean(overlap_sizes):.2f}")
    print(f"Probability of zero overlap: {no_overlap_count/n_trials*100:.1f}%")
    print(f"Expected overlap (theory): {n_features_to_sample**2/n_features:.2f}")
 
 
def measure_prediction_diversity(X_train, y_train, X_test, n_trees=50):
    """
    Measure how diverse predictions are across ensemble members.
    """
    predictions = []
    
    for i in range(n_trees):
        # Bootstrap sampling
        indices = np.random.choice(len(X_train), size=len(X_train), replace=True)
        X_b, y_b = X_train[indices], y_train[indices]
        
        # Random feature subspace (sqrt(p) features)
        n_features = int(np.sqrt(X_train.shape[1]))
        
        # Train tree with random features
        tree = DecisionTreeClassifier(
            max_features=n_features,
            random_state=i
        )
        tree.fit(X_b, y_b)
        predictions.append(tree.predict(X_test))
    
    predictions = np.array(predictions)
    
    # Measure prediction entropy at each test point
    entropies = []
    for j in range(len(X_test)):
        counts = Counter(predictions[:, j])
        probs = np.array([c/n_trees for c in counts.values()])
        entropy = -np.sum(probs * np.log2(probs + 1e-10))
        entropies.append(entropy)
    
    print(f"\nPrediction Diversity Analysis")
    print("=" * 50)
    print(f"Number of trees: {n_trees}")
    print(f"Average prediction entropy: {np.mean(entropies):.4f}")
    print(f"Max possible entropy (binary): 1.0")
    print(f"% of test points with unanimous vote: {(np.array(entropies)==0).mean()*100:.1f}%")
    
    # Correlation between tree predictions
    correlations = []
    for i in range(n_trees):
        for j in range(i+1, n_trees):
            corr = np.corrcoef(predictions[i], predictions[j])[0,1]
            if not np.isnan(corr):
                correlations.append(corr)
    
    print(f"Average pairwise prediction correlation: {np.mean(correlations):.4f}")
    
    return predictions
 
 
if __name__ == "__main__":
    # Create synthetic dataset
    X, y = make_classification(n_samples=1000, n_features=20, 
                              n_informative=15, random_state=42)
    
    analyze_bootstrap_diversity(X, y)
    analyze_feature_subspace_diversity(X, n_features_to_sample=int(np.sqrt(20)))
    measure_prediction_diversity(X[:700], y[:700], X[700:])

Measuring Diversity

Numerous metrics have been proposed to quantify ensemble diversity. We'll cover the most important ones.

For Classification:

1. Disagreement Measure (Pairwise):

$$D_{ij} = \frac{\text{Number of examples where } h_i \neq h_j}{\text{Total examples}}$$

Average over all pairs:

$$D_{\text{avg}} = \frac{2}{M(M-1)}\sum_{i<j} D_{ij}$$

2. Q-Statistic (Yule's Q):

For pairs of classifiers, build a contingency table:

	$h_j$ correct	$h_j$ wrong
$h_i$ correct	$N^{11}$	$N^{10}$
$h_i$ wrong	$N^{01}$	$N^{00}$

$$Q_{ij} = \frac{N^{11}N^{00} - N^{01}N^{10}}{N^{11}N^{00} + N^{01}N^{10}}$$

$Q = 1$: Classifiers always agree (right or wrong together)
$Q = 0$: Classifiers are independent
$Q = -1$: Classifiers always disagree

For good ensembles, we want $Q_{\text{avg}}$ to be low (near 0 or negative).

3. Correlation Coefficient (ρ):

$$\rho_{ij} = \frac{N^{11}N^{00} - N^{01}N^{10}}{\sqrt{(N^{11}+N^{10})(N^{01}+N^{00})(N^{11}+N^{01})(N^{10}+N^{00})}}$$

Related to Q but normalized differently. Also want low values.

4. Entropy Measure:

For each example, compute the entropy of the vote distribution:

$$E(x) = \frac{1}{1 - \frac{1}{M}}\frac{1}{M}\sum_{k=1}^{L} m_k(x)\left(1 - \frac{m_k(x)}{M}\right)$$

Where $m_k(x)$ is the number of classifiers predicting class $k$ for example $x$.

$E = 0$: All classifiers agree (low diversity)
$E = 1$: Maximum disagreement (high diversity)

5. Kohavi-Wolpert Variance:

$$KW = \frac{1}{N}\sum_{n=1}^{N} \left(\frac{l(x_n)}{M}\right)\left(1 - \frac{l(x_n)}{M}\right)$$

Where $l(x_n)$ is the number of classifiers that misclassify example $x_n$.

Measures variance in the "correctness" of predictions across classifiers.

diversity_metrics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
import numpy as np
from itertools import combinations
 
 
def pairwise_disagreement(pred_i, pred_j):
    """Compute disagreement between two classifiers."""
    return np.mean(pred_i != pred_j)
 
 
def q_statistic(pred_i, pred_j, y_true):
    """Compute Yule's Q statistic between two classifiers."""
    correct_i = (pred_i == y_true)
    correct_j = (pred_j == y_true)
    
    N11 = np.sum(correct_i & correct_j)      # Both correct
    N00 = np.sum(~correct_i & ~correct_j)    # Both wrong
    N10 = np.sum(correct_i & ~correct_j)     # i correct, j wrong
    N01 = np.sum(~correct_i & correct_j)     # i wrong, j correct
    
    numerator = N11 * N00 - N01 * N10
    denominator = N11 * N00 + N01 * N10
    
    if denominator == 0:
        return 0
    return numerator / denominator
 
 
def correlation_coefficient(pred_i, pred_j, y_true):
    """Compute correlation coefficient of classifier errors."""
    correct_i = (pred_i == y_true).astype(int)
    correct_j = (pred_j == y_true).astype(int)
    
    corr = np.corrcoef(correct_i, correct_j)[0, 1]
    return corr if not np.isnan(corr) else 0
 
 
def entropy_measure(predictions, y_true):
    """
    Compute entropy-based diversity measure.
    
    Args:
        predictions: M x N array (M classifiers, N examples)
        y_true: True labels
    """
    M, N = predictions.shape
    
    entropies = []
    for n in range(N):
        votes = predictions[:, n]
        unique, counts = np.unique(votes, return_counts=True)
        probs = counts / M
        entropy = -np.sum(probs * np.log2(probs + 1e-10))
        # Normalize by max entropy
        max_entropy = np.log2(len(unique)) if len(unique) > 1 else 1
        entropies.append(entropy / max_entropy if max_entropy > 0 else 0)
    
    return np.mean(entropies)
 
 
def kohavi_wolpert_variance(predictions, y_true):
    """Compute Kohavi-Wolpert variance measure."""
    M, N = predictions.shape
    
    kw_sum = 0
    for n in range(N):
        # Number of classifiers that got this example wrong
        l_n = np.sum(predictions[:, n] != y_true[n])
        kw_sum += (l_n / M) * (1 - l_n / M)
    
    return kw_sum / N
 
 
def comprehensive_diversity_analysis(predictions, y_true):
    """
    Compute all diversity metrics for an ensemble.
    
    Args:
        predictions: M x N array (M classifiers, N examples)
        y_true: N true labels
    """
    M, N = predictions.shape
    
    # Individual accuracies
    accuracies = [np.mean(predictions[i] == y_true) for i in range(M)]
    
    # Pairwise metrics
    disagreements = []
    q_stats = []
    correlations = []
    
    for i, j in combinations(range(M), 2):
        disagreements.append(pairwise_disagreement(predictions[i], predictions[j]))
        q_stats.append(q_statistic(predictions[i], predictions[j], y_true))
        correlations.append(correlation_coefficient(predictions[i], predictions[j], y_true))
    
    # Entropy
    ent = entropy_measure(predictions, y_true)
    
    # Kohavi-Wolpert
    kw = kohavi_wolpert_variance(predictions, y_true)
    
    print("Diversity Analysis Results")
    print("=" * 60)
    print(f"Number of classifiers: {M}")
    print(f"Number of test examples: {N}")
    print()
    print("Individual Performance:")
    print(f"  Mean accuracy: {np.mean(accuracies):.4f}")
    print(f"  Std accuracy:  {np.std(accuracies):.4f}")
    print()
    print("Diversity Metrics:")
    print(f"  Avg Disagreement:     {np.mean(disagreements):.4f} (higher = more diverse)")
    print(f"  Avg Q-Statistic:      {np.mean(q_stats):.4f} (lower = more diverse)")
    print(f"  Avg Correlation:      {np.mean(correlations):.4f} (lower = more diverse)")
    print(f"  Entropy Measure:      {ent:.4f} (higher = more diverse)")
    print(f"  Kohavi-Wolpert Var:   {kw:.4f} (higher = more diverse)")
    
    # Ensemble performance
    ensemble_pred = np.apply_along_axis(
        lambda x: np.bincount(x.astype(int)).argmax(), 0, predictions
    )
    ensemble_acc = np.mean(ensemble_pred == y_true)
    print()
    print(f"Ensemble Accuracy: {ensemble_acc:.4f}")
    print(f"Improvement over avg: {(ensemble_acc - np.mean(accuracies))*100:.2f}%")
    
    return {
        'disagreement': np.mean(disagreements),
        'q_statistic': np.mean(q_stats),
        'correlation': np.mean(correlations),
        'entropy': ent,
        'kw_variance': kw,
        'individual_acc': np.mean(accuracies),
        'ensemble_acc': ensemble_acc
    }

The Diversity-Accuracy Tradeoff

A fundamental tension exists in ensemble design: mechanisms that increase diversity often decrease individual accuracy. This tradeoff constrains how much we can diversify.

The Tradeoff in Action:

Diversification	Effect on Individual Accuracy
More bootstrap samples	Each model sees less unique data
Smaller feature subsets	Each tree considers less information
More regularization variance	Higher individual errors
Random output codes	Harder subproblems

Formal Analysis (Kuncheva-Whitaker):

Kuncheva and Whitaker (2003) analyzed the relationship between diversity and accuracy across many diversity measures. Their key finding:

"There is no single measure of diversity that consistently correlates with ensemble accuracy improvement."

This is because the optimal diversity level depends on:

Individual classifier accuracy
Ensemble size
Aggregation method
The specific dataset

The Optimum is Interior

Practical Guidelines:

Given this tradeoff, how do we find the sweet spot?

For Random Forests:

Feature subset size $m = \sqrt{p}$ for classification is theoretically motivated and empirically robust
$m = p/3$ for regression balances diversity and feature information
These defaults work well across most datasets; tune only if necessary

For Heterogeneous Ensembles:

Include models that are individually strong (accuracy > 0.5 + margin)
Prefer models with different error patterns over models with similar error patterns
Validate that each additional model improves ensemble performance

For Boosting:

Boosting actively manages the tradeoff by focusing on hard examples
Early boosting rounds add diverse models; later rounds may overfit
Early stopping balances diversity with overfitting risk

Creating Diverse Base Learners

Let's examine specific strategies for creating diverse base learners in common ensemble frameworks.

Decision Trees (for Random Forests):

Diversity Levers for Decision Trees
Lever	Standard Setting	More Diverse \| Less Accurate	Less Diverse \| More Accurate
max_features	sqrt(p) or p/3	1 or 2	p (all)
max_depth	None (full)	Shallow (3-5)	Full depth
min_samples_split	2	Larger (e.g., 20)	2
min_samples_leaf	1	Larger (e.g., 10)	1
bootstrap	True	True with small ratio	False
max_samples	100%	50-80%	100%

Extra Trees (Extremely Randomized Trees):

Go beyond Random Forests by also randomizing split thresholds:

For each feature at each node, instead of finding the optimal split threshold, choose a threshold uniformly at random within the feature's range
This adds another layer of randomization, further decorrelating trees
Often slightly less accurate individually but with lower correlation

Neural Network Ensembles:

Diversity sources for neural networks:

Different initializations: Random weight initialization
Different architectures: Vary depth, width, activation functions
Different training runs: Early stopping at different points
Different optimizers: SGD, Adam, RMSprop
Different data augmentation: Random augmentations per model
Dropout: Different dropout masks

Heterogeneous Ensembles:

Combine fundamentally different algorithm families:

Tree-based: Random Forest, GBM
Linear: Logistic Regression, SVM with linear kernel
Kernel: SVM with RBF kernel, Gaussian Processes
Instance-based: KNN, Radius Neighbors
Neural: MLP, CNN

Different inductive biases → different error patterns → more diversity.

creating_diverse_ensemble.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import GaussianNB
import numpy as np
 
 
def create_maximally_diverse_ensemble(random_state=42):
    """
    Create an ensemble maximizing diversity through:
    1. Different algorithm families
    2. Different hyperparameters within families
    3. Different random seeds
    """
    
    models = []
    
    # Tree-based family - multiple variants
    models.extend([
        ('rf_default', RandomForestClassifier(n_estimators=50, random_state=random_state)),
        ('rf_shallow', RandomForestClassifier(n_estimators=50, max_depth=5, random_state=random_state+1)),
        ('rf_deep', RandomForestClassifier(n_estimators=50, min_samples_leaf=1, random_state=random_state+2)),
        ('et', ExtraTreesClassifier(n_estimators=50, random_state=random_state+3)),
    ])
    
    # Boosting family
    models.extend([
        ('gb', GradientBoostingClassifier(n_estimators=50, random_state=random_state+4)),
        ('ada', AdaBoostClassifier(n_estimators=50, random_state=random_state+5)),
    ])
    
    # Linear family
    models.extend([
        ('lr', LogisticRegression(max_iter=1000, random_state=random_state+6)),
        ('ridge', RidgeClassifier(random_state=random_state+7)),
    ])
    
    # Kernel family
    models.extend([
        ('svm_rbf', SVC(probability=True, random_state=random_state+8)),
        ('svm_poly', SVC(kernel='poly', degree=3, probability=True, random_state=random_state+9)),
    ])
    
    # Instance-based
    models.extend([
        ('knn_3', KNeighborsClassifier(n_neighbors=3)),
        ('knn_10', KNeighborsClassifier(n_neighbors=10)),
    ])
    
    # Neural
    models.extend([
        ('mlp_small', MLPClassifier(hidden_layer_sizes=(50,), max_iter=500, random_state=random_state+10)),
        ('mlp_deep', MLPClassifier(hidden_layer_sizes=(100, 50, 25), max_iter=500, random_state=random_state+11)),
    ])
    
    # Probabilistic
    models.append(('nb', GaussianNB()))
    
    return models
 
 
def analyze_ensemble_diversity(models, X_train, y_train, X_test, y_test):
    """Fit models and analyze diversity."""
    from diversity_metrics import comprehensive_diversity_analysis
    
    predictions = []
    print("Training individual models...")
    
    for name, model in models:
        model.fit(X_train, y_train)
        pred = model.predict(X_test)
        acc = np.mean(pred == y_test)
        predictions.append(pred)
        print(f"  {name}: {acc:.4f}")
    
    predictions = np.array(predictions)
    print()
    
    return comprehensive_diversity_analysis(predictions, y_test)
 
 
if __name__ == "__main__":
    from sklearn.datasets import make_classification
    from sklearn.model_selection import train_test_split
    
    X, y = make_classification(n_samples=1000, n_features=20,
                              n_informative=15, random_state=42)
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=42
    )
    
    models = create_maximally_diverse_ensemble()
    analyze_ensemble_diversity(models, X_train, y_train, X_test, y_test)

When Diversity Fails

Despite its importance, diversity isn't always achievable or beneficial. Understanding when diversity fails helps set realistic expectations.

Intrinsic Correlation:

Some problems have structure that forces correlation:

Easy examples: All reasonable models get easy examples right. No diversity here.
Impossible examples: All models fail on truly noisy or mislabeled data.
Dominant features: If one feature is overwhelmingly predictive, all models will use it similarly.
Small feature space: With few features, models can't differ much.

Diversity Doesn't Help When:

All models are biased the same way: Diversity reduces variance, not bias. If the problem requires a representation no model can learn, diversity won't help.
Individual models are too weak: Random diverse models are still random. Each model must be at least better than chance.
Aggregation is wrong: Averaging diverse predictions only helps if the truth is somewhere in the convex hull of predictions.
The problem is trivially easy: If one model achieves near-perfect accuracy, there's nothing left to improve.

High-Diversity Situations

•High-dimensional feature spaces
•Complex, non-linear relationships
•Moderate noise levels
•Large, varied training sets
•Multiple valid solution paths

Low-Diversity Situations

•Low-dimensional problems
•Strongly dominant features
•Very clean or very noisy data
•Small training sets
•Single obvious solution structure

Measure Before Assuming

Summary: Diversity Requirement

We've comprehensively explored diversity—the essential ingredient for effective ensemble learning. Let's consolidate:

Key Takeaways

•Diversity = Different Errors: Models that make different mistakes enable error cancellation through averaging.
•Four Diversity Sources: Data sampling, feature sampling, algorithm variation, and output manipulation.
•Multiple Metrics Exist: Disagreement, Q-statistic, correlation, entropy, and Kohavi-Wolpert variance all measure different aspects of diversity.
•Diversity-Accuracy Tradeoff: Increasing diversity often decreases individual accuracy. The optimum is interior.
•No Universal Metric: No single diversity measure consistently predicts ensemble improvement.
•Combined Mechanisms Work Best: Random Forests use both bootstrap sampling and feature randomization.
•Heterogeneous Ensembles Maximize Diversity: Combining different algorithm families produces the most diverse errors.
•Measure, Don't Assume: Compute diversity metrics to verify your ensemble is actually diverse.

What's Next:

Page Complete

4 / 5