Loading learning content...
Throughout our analysis, one theme has emerged repeatedly: diversity is essential. The variance reduction formula shows that error correlation (ρ) determines the floor for ensemble improvement. The ambiguity decomposition proves that disagreement directly subtracts from ensemble error. The Condorcet theorem requires independent voters.
But what exactly is diversity? How do we measure it? How do we create it? And is there such a thing as too much diversity?
This page provides a comprehensive treatment of diversity in ensemble learning—the strategies for inducing it, the metrics for measuring it, and the fundamental tradeoff between diversity and individual model quality that governs ensemble design.
By the end of this page, you will understand multiple mechanisms for creating diverse ensembles, know how to quantify diversity using various metrics, and appreciate the diversity-accuracy tradeoff that constrains ensemble optimization.
Diversity in ensemble learning refers to the extent to which base learners make different errors. It's not about having different model architectures per se—it's about producing different predictions, particularly different mistakes.
Formal Definition:
Two models are diverse to the extent that their error patterns are uncorrelated. If model A and model B both fail on the same examples and succeed on the same examples, they are not diverse—regardless of how different their internal workings are.
Types of Diversity:
Not all types are equally valuable. Error diversity directly impacts ensemble performance; structural diversity is valuable only to the extent it produces error diversity.
Focus on error diversity, not superficial differences. Two neural networks with different architectures but identical predictions provide zero ensemble benefit. Two decision trees with identical structure but different training subsets may be highly diverse where it matters—in their errors.
There are four primary dimensions along which we can introduce diversity into ensemble members:
1. Data-Level Diversity
Modify what data each model sees:
2. Feature-Level Diversity
Modify what features each model uses:
3. Algorithm-Level Diversity
Use different learning algorithms:
4. Output-Level Diversity
Manipulate the target or output:
| Mechanism | Method | Typical Use Case |
|---|---|---|
| Data Sampling | Bootstrap | Bagging, Random Forests |
| Feature Sampling | Random Subspace | Random Forests, Feature Bagging |
| Feature Randomization | Random Feature @ Split | Random Forests, Extra Trees |
| Algorithm Mix | Heterogeneous | Stacking, Super Learner |
| Hyperparameter Variation | Grid/Random | Random Search Ensembles |
| Initialization Randomness | Different Seeds | Neural Network Ensembles |
| Output Manipulation | ECOC | Multi-class Classification |
The most successful ensemble methods combine multiple diversity mechanisms. Random Forests use both data sampling (bootstrap) and feature sampling (random splits). Extra Trees add randomized split thresholds. More diversity sources often mean more decorrelated errors.
Let's examine the most important diversity mechanisms in detail.
Bootstrap Sampling:
Given $N$ training examples, create a bootstrap sample by drawing $N$ examples with replacement. Key properties:
Mathematical Analysis:
Probability that a specific example is NOT selected in one draw: $\frac{N-1}{N}$
Probability it's NOT selected in $N$ draws: $\left(\frac{N-1}{N}\right)^N \approx e^{-1} \approx 0.368$
Probability it IS selected: $1 - 0.368 = 0.632$
Random Feature Selection at Splits:
At each split in a decision tree, instead of considering all $p$ features, consider only a random subset of size $m$:
This creates substantial diversity because:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127
import numpy as npfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.datasets import make_classificationfrom collections import Counter def analyze_bootstrap_diversity(X, y, n_bootstraps=100): """ Analyze the diversity created by bootstrap sampling. """ N = len(X) # Track which samples appear in each bootstrap sample_appearances = np.zeros((n_bootstraps, N)) for b in range(n_bootstraps): indices = np.random.choice(N, size=N, replace=True) unique_indices = np.unique(indices) sample_appearances[b, unique_indices] = 1 # Statistics avg_unique_per_bootstrap = sample_appearances.sum(axis=1).mean() pct_unique = avg_unique_per_bootstrap / N * 100 # Overlap between bootstrap samples overlaps = [] for i in range(min(100, n_bootstraps)): for j in range(i+1, min(100, n_bootstraps)): overlap = (sample_appearances[i] * sample_appearances[j]).sum() overlap_pct = overlap / avg_unique_per_bootstrap overlaps.append(overlap_pct) print("Bootstrap Sampling Analysis") print("=" * 50) print(f"Original dataset size: {N}") print(f"Average unique samples per bootstrap: {avg_unique_per_bootstrap:.1f}") print(f"Percentage of original data: {pct_unique:.1f}%") print(f"Average overlap between bootstraps: {np.mean(overlaps)*100:.1f}%") print(f"(Theoretical: 63.2% unique, 63.2% overlap)") def analyze_feature_subspace_diversity(X, n_features_to_sample): """ Analyze diversity from random feature subspacing. """ n_features = X.shape[1] n_trials = 1000 # How often do two random subsets share no features? no_overlap_count = 0 overlap_sizes = [] for _ in range(n_trials): subset1 = set(np.random.choice(n_features, n_features_to_sample, replace=False)) subset2 = set(np.random.choice(n_features, n_features_to_sample, replace=False)) overlap = len(subset1 & subset2) overlap_sizes.append(overlap) if overlap == 0: no_overlap_count += 1 print(f"\nFeature Subspace Analysis (p={n_features}, m={n_features_to_sample})") print("=" * 50) print(f"Average feature overlap: {np.mean(overlap_sizes):.2f}") print(f"Probability of zero overlap: {no_overlap_count/n_trials*100:.1f}%") print(f"Expected overlap (theory): {n_features_to_sample**2/n_features:.2f}") def measure_prediction_diversity(X_train, y_train, X_test, n_trees=50): """ Measure how diverse predictions are across ensemble members. """ predictions = [] for i in range(n_trees): # Bootstrap sampling indices = np.random.choice(len(X_train), size=len(X_train), replace=True) X_b, y_b = X_train[indices], y_train[indices] # Random feature subspace (sqrt(p) features) n_features = int(np.sqrt(X_train.shape[1])) # Train tree with random features tree = DecisionTreeClassifier( max_features=n_features, random_state=i ) tree.fit(X_b, y_b) predictions.append(tree.predict(X_test)) predictions = np.array(predictions) # Measure prediction entropy at each test point entropies = [] for j in range(len(X_test)): counts = Counter(predictions[:, j]) probs = np.array([c/n_trees for c in counts.values()]) entropy = -np.sum(probs * np.log2(probs + 1e-10)) entropies.append(entropy) print(f"\nPrediction Diversity Analysis") print("=" * 50) print(f"Number of trees: {n_trees}") print(f"Average prediction entropy: {np.mean(entropies):.4f}") print(f"Max possible entropy (binary): 1.0") print(f"% of test points with unanimous vote: {(np.array(entropies)==0).mean()*100:.1f}%") # Correlation between tree predictions correlations = [] for i in range(n_trees): for j in range(i+1, n_trees): corr = np.corrcoef(predictions[i], predictions[j])[0,1] if not np.isnan(corr): correlations.append(corr) print(f"Average pairwise prediction correlation: {np.mean(correlations):.4f}") return predictions if __name__ == "__main__": # Create synthetic dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, random_state=42) analyze_bootstrap_diversity(X, y) analyze_feature_subspace_diversity(X, n_features_to_sample=int(np.sqrt(20))) measure_prediction_diversity(X[:700], y[:700], X[700:])Numerous metrics have been proposed to quantify ensemble diversity. We'll cover the most important ones.
For Classification:
1. Disagreement Measure (Pairwise):
$$D_{ij} = \frac{\text{Number of examples where } h_i \neq h_j}{\text{Total examples}}$$
Average over all pairs:
$$D_{\text{avg}} = \frac{2}{M(M-1)}\sum_{i<j} D_{ij}$$
2. Q-Statistic (Yule's Q):
For pairs of classifiers, build a contingency table:
| $h_j$ correct | $h_j$ wrong | |
|---|---|---|
| $h_i$ correct | $N^{11}$ | $N^{10}$ |
| $h_i$ wrong | $N^{01}$ | $N^{00}$ |
$$Q_{ij} = \frac{N^{11}N^{00} - N^{01}N^{10}}{N^{11}N^{00} + N^{01}N^{10}}$$
For good ensembles, we want $Q_{\text{avg}}$ to be low (near 0 or negative).
3. Correlation Coefficient (ρ):
$$\rho_{ij} = \frac{N^{11}N^{00} - N^{01}N^{10}}{\sqrt{(N^{11}+N^{10})(N^{01}+N^{00})(N^{11}+N^{01})(N^{10}+N^{00})}}$$
Related to Q but normalized differently. Also want low values.
4. Entropy Measure:
For each example, compute the entropy of the vote distribution:
$$E(x) = \frac{1}{1 - \frac{1}{M}}\frac{1}{M}\sum_{k=1}^{L} m_k(x)\left(1 - \frac{m_k(x)}{M}\right)$$
Where $m_k(x)$ is the number of classifiers predicting class $k$ for example $x$.
5. Kohavi-Wolpert Variance:
$$KW = \frac{1}{N}\sum_{n=1}^{N} \left(\frac{l(x_n)}{M}\right)\left(1 - \frac{l(x_n)}{M}\right)$$
Where $l(x_n)$ is the number of classifiers that misclassify example $x_n$.
Measures variance in the "correctness" of predictions across classifiers.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135
import numpy as npfrom itertools import combinations def pairwise_disagreement(pred_i, pred_j): """Compute disagreement between two classifiers.""" return np.mean(pred_i != pred_j) def q_statistic(pred_i, pred_j, y_true): """Compute Yule's Q statistic between two classifiers.""" correct_i = (pred_i == y_true) correct_j = (pred_j == y_true) N11 = np.sum(correct_i & correct_j) # Both correct N00 = np.sum(~correct_i & ~correct_j) # Both wrong N10 = np.sum(correct_i & ~correct_j) # i correct, j wrong N01 = np.sum(~correct_i & correct_j) # i wrong, j correct numerator = N11 * N00 - N01 * N10 denominator = N11 * N00 + N01 * N10 if denominator == 0: return 0 return numerator / denominator def correlation_coefficient(pred_i, pred_j, y_true): """Compute correlation coefficient of classifier errors.""" correct_i = (pred_i == y_true).astype(int) correct_j = (pred_j == y_true).astype(int) corr = np.corrcoef(correct_i, correct_j)[0, 1] return corr if not np.isnan(corr) else 0 def entropy_measure(predictions, y_true): """ Compute entropy-based diversity measure. Args: predictions: M x N array (M classifiers, N examples) y_true: True labels """ M, N = predictions.shape entropies = [] for n in range(N): votes = predictions[:, n] unique, counts = np.unique(votes, return_counts=True) probs = counts / M entropy = -np.sum(probs * np.log2(probs + 1e-10)) # Normalize by max entropy max_entropy = np.log2(len(unique)) if len(unique) > 1 else 1 entropies.append(entropy / max_entropy if max_entropy > 0 else 0) return np.mean(entropies) def kohavi_wolpert_variance(predictions, y_true): """Compute Kohavi-Wolpert variance measure.""" M, N = predictions.shape kw_sum = 0 for n in range(N): # Number of classifiers that got this example wrong l_n = np.sum(predictions[:, n] != y_true[n]) kw_sum += (l_n / M) * (1 - l_n / M) return kw_sum / N def comprehensive_diversity_analysis(predictions, y_true): """ Compute all diversity metrics for an ensemble. Args: predictions: M x N array (M classifiers, N examples) y_true: N true labels """ M, N = predictions.shape # Individual accuracies accuracies = [np.mean(predictions[i] == y_true) for i in range(M)] # Pairwise metrics disagreements = [] q_stats = [] correlations = [] for i, j in combinations(range(M), 2): disagreements.append(pairwise_disagreement(predictions[i], predictions[j])) q_stats.append(q_statistic(predictions[i], predictions[j], y_true)) correlations.append(correlation_coefficient(predictions[i], predictions[j], y_true)) # Entropy ent = entropy_measure(predictions, y_true) # Kohavi-Wolpert kw = kohavi_wolpert_variance(predictions, y_true) print("Diversity Analysis Results") print("=" * 60) print(f"Number of classifiers: {M}") print(f"Number of test examples: {N}") print() print("Individual Performance:") print(f" Mean accuracy: {np.mean(accuracies):.4f}") print(f" Std accuracy: {np.std(accuracies):.4f}") print() print("Diversity Metrics:") print(f" Avg Disagreement: {np.mean(disagreements):.4f} (higher = more diverse)") print(f" Avg Q-Statistic: {np.mean(q_stats):.4f} (lower = more diverse)") print(f" Avg Correlation: {np.mean(correlations):.4f} (lower = more diverse)") print(f" Entropy Measure: {ent:.4f} (higher = more diverse)") print(f" Kohavi-Wolpert Var: {kw:.4f} (higher = more diverse)") # Ensemble performance ensemble_pred = np.apply_along_axis( lambda x: np.bincount(x.astype(int)).argmax(), 0, predictions ) ensemble_acc = np.mean(ensemble_pred == y_true) print() print(f"Ensemble Accuracy: {ensemble_acc:.4f}") print(f"Improvement over avg: {(ensemble_acc - np.mean(accuracies))*100:.2f}%") return { 'disagreement': np.mean(disagreements), 'q_statistic': np.mean(q_stats), 'correlation': np.mean(correlations), 'entropy': ent, 'kw_variance': kw, 'individual_acc': np.mean(accuracies), 'ensemble_acc': ensemble_acc }A fundamental tension exists in ensemble design: mechanisms that increase diversity often decrease individual accuracy. This tradeoff constrains how much we can diversify.
The Tradeoff in Action:
| Diversification | Effect on Individual Accuracy |
|---|---|
| More bootstrap samples | Each model sees less unique data |
| Smaller feature subsets | Each tree considers less information |
| More regularization variance | Higher individual errors |
| Random output codes | Harder subproblems |
Formal Analysis (Kuncheva-Whitaker):
Kuncheva and Whitaker (2003) analyzed the relationship between diversity and accuracy across many diversity measures. Their key finding:
"There is no single measure of diversity that consistently correlates with ensemble accuracy improvement."
This is because the optimal diversity level depends on:
Maximum diversity (completely random classifiers) produces ~50% accuracy. Zero diversity (identical classifiers) produces no ensemble benefit. The optimal ensemble lives in between—sufficiently diverse to decorrelate errors, sufficiently accurate to be better than random.
Practical Guidelines:
Given this tradeoff, how do we find the sweet spot?
For Random Forests:
For Heterogeneous Ensembles:
For Boosting:
Rule of Thumb: Start with standard diversity settings (defaults in sklearn). Measure ensemble improvement. If minimal, consider increasing diversity. If individual models are too weak, reduce diversity.
Let's examine specific strategies for creating diverse base learners in common ensemble frameworks.
Decision Trees (for Random Forests):
| Lever | Standard Setting | More Diverse | Less Accurate | Less Diverse | More Accurate |
|---|---|---|---|
| max_features | sqrt(p) or p/3 | 1 or 2 | p (all) |
| max_depth | None (full) | Shallow (3-5) | Full depth |
| min_samples_split | 2 | Larger (e.g., 20) | 2 |
| min_samples_leaf | 1 | Larger (e.g., 10) | 1 |
| bootstrap | True | True with small ratio | False |
| max_samples | 100% | 50-80% | 100% |
Extra Trees (Extremely Randomized Trees):
Go beyond Random Forests by also randomizing split thresholds:
Neural Network Ensembles:
Diversity sources for neural networks:
Heterogeneous Ensembles:
Combine fundamentally different algorithm families:
Different inductive biases → different error patterns → more diversity.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifierfrom sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifierfrom sklearn.linear_model import LogisticRegression, RidgeClassifierfrom sklearn.svm import SVCfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.neural_network import MLPClassifierfrom sklearn.naive_bayes import GaussianNBimport numpy as np def create_maximally_diverse_ensemble(random_state=42): """ Create an ensemble maximizing diversity through: 1. Different algorithm families 2. Different hyperparameters within families 3. Different random seeds """ models = [] # Tree-based family - multiple variants models.extend([ ('rf_default', RandomForestClassifier(n_estimators=50, random_state=random_state)), ('rf_shallow', RandomForestClassifier(n_estimators=50, max_depth=5, random_state=random_state+1)), ('rf_deep', RandomForestClassifier(n_estimators=50, min_samples_leaf=1, random_state=random_state+2)), ('et', ExtraTreesClassifier(n_estimators=50, random_state=random_state+3)), ]) # Boosting family models.extend([ ('gb', GradientBoostingClassifier(n_estimators=50, random_state=random_state+4)), ('ada', AdaBoostClassifier(n_estimators=50, random_state=random_state+5)), ]) # Linear family models.extend([ ('lr', LogisticRegression(max_iter=1000, random_state=random_state+6)), ('ridge', RidgeClassifier(random_state=random_state+7)), ]) # Kernel family models.extend([ ('svm_rbf', SVC(probability=True, random_state=random_state+8)), ('svm_poly', SVC(kernel='poly', degree=3, probability=True, random_state=random_state+9)), ]) # Instance-based models.extend([ ('knn_3', KNeighborsClassifier(n_neighbors=3)), ('knn_10', KNeighborsClassifier(n_neighbors=10)), ]) # Neural models.extend([ ('mlp_small', MLPClassifier(hidden_layer_sizes=(50,), max_iter=500, random_state=random_state+10)), ('mlp_deep', MLPClassifier(hidden_layer_sizes=(100, 50, 25), max_iter=500, random_state=random_state+11)), ]) # Probabilistic models.append(('nb', GaussianNB())) return models def analyze_ensemble_diversity(models, X_train, y_train, X_test, y_test): """Fit models and analyze diversity.""" from diversity_metrics import comprehensive_diversity_analysis predictions = [] print("Training individual models...") for name, model in models: model.fit(X_train, y_train) pred = model.predict(X_test) acc = np.mean(pred == y_test) predictions.append(pred) print(f" {name}: {acc:.4f}") predictions = np.array(predictions) print() return comprehensive_diversity_analysis(predictions, y_test) if __name__ == "__main__": from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, random_state=42) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=42 ) models = create_maximally_diverse_ensemble() analyze_ensemble_diversity(models, X_train, y_train, X_test, y_test)Despite its importance, diversity isn't always achievable or beneficial. Understanding when diversity fails helps set realistic expectations.
Intrinsic Correlation:
Some problems have structure that forces correlation:
Diversity Doesn't Help When:
All models are biased the same way: Diversity reduces variance, not bias. If the problem requires a representation no model can learn, diversity won't help.
Individual models are too weak: Random diverse models are still random. Each model must be at least better than chance.
Aggregation is wrong: Averaging diverse predictions only helps if the truth is somewhere in the convex hull of predictions.
The problem is trivially easy: If one model achieves near-perfect accuracy, there's nothing left to improve.
Don't assume diversity—measure it. Compute Q-statistics or correlation between your ensemble members. If diversity is naturally low for your problem, focus on improving individual models rather than adding more diverse but weaker ones.
We've comprehensively explored diversity—the essential ingredient for effective ensemble learning. Let's consolidate:
What's Next:
With our theoretical foundation complete—variance reduction, crowd wisdom, error decomposition, and diversity—we turn to Ensemble Strategies. We'll survey the major families of ensemble methods: bagging, boosting, stacking, and more, understanding how each implements the principles we've learned.
You now have a comprehensive understanding of diversity in ensemble learning—what it is, how to create it, how to measure it, and when it helps. This knowledge is essential for designing effective ensembles in practice.