Random Forests - Learning Module

Loading content...

0/278

Tree Correlation Reduction

The Hidden Enemy of Ensemble Learning

Imagine you're seeking investment advice. Would you rather consult:

A) 100 analysts who all read the same reports, use the same models, and generally agree with each other?

B) 100 analysts with diverse backgrounds, different methodologies, and independent perspectives?

Intuition correctly suggests option B provides more reliable collective wisdom. But why exactly? The answer lies in correlation—and understanding it is key to understanding why Random Forests work so remarkably well.

In ensemble learning, tree correlation is the hidden enemy. No matter how many trees we add, if they all make similar predictions (and similar errors), the ensemble can't escape their shared mistakes. This page develops the mathematical framework for understanding correlation in tree ensembles and shows how Random Forests systematically reduce it.

What You Will Learn

By the end of this page, you will understand the mathematical relationship between tree correlation and ensemble variance, how to measure and diagnose correlation in tree ensembles, why Random Forests specifically target correlation reduction, and quantitative techniques for further reducing tree correlation.

The Mathematics of Ensemble Variance

To understand why correlation matters, we need to analyze ensemble variance mathematically. Let's build up the framework step by step.

Setup:

We have $T$ trees in our ensemble, indexed by $t = 1, \ldots, T$
For a given input $\mathbf{x}$, tree $t$ predicts $h_t(\mathbf{x})$
The ensemble prediction is the average: $H(\mathbf{x}) = \frac{1}{T}\sum_{t=1}^T h_t(\mathbf{x})$
Each tree has prediction variance $\sigma^2$ and trees have pairwise correlation $\rho$

The Variance of an Average:

For a single tree with variance $\sigma^2$: $$\text{Var}[h_t(\mathbf{x})] = \sigma^2$$

For the average of $T$ independent trees ($\rho = 0$): $$\text{Var}[H(\mathbf{x})] = \text{Var}\left[\frac{1}{T}\sum_{t=1}^T h_t(\mathbf{x})\right] = \frac{1}{T^2} \cdot T\sigma^2 = \frac{\sigma^2}{T}$$

This is the ideal case: variance decreases as $1/T$. Adding more trees always helps.

The Reality: Correlated Trees

When trees are correlated (which they always are in practice), the variance formula becomes:

$$\text{Var}[H(\mathbf{x})] = \frac{1}{T^2}\left[\sum_{t=1}^T \text{Var}[h_t] + \sum_{t eq t'} \text{Cov}[h_t, h_{t'}]\right]$$

With identical variance $\sigma^2$ and identical pairwise correlation $\rho$:

$$\text{Var}[H(\mathbf{x})] = \frac{1}{T^2}\left[T\sigma^2 + T(T-1)\rho\sigma^2\right] = \frac{\sigma^2}{T} + \frac{T-1}{T}\rho\sigma^2$$

The Critical Insight:

As $T \to \infty$: $$\lim_{T \to \infty} \text{Var}[H(\mathbf{x})] = \lim_{T \to \infty}\left[\frac{\sigma^2}{T} + \frac{T-1}{T}\rho\sigma^2\right] = \rho\sigma^2$$

The ensemble variance converges to $\rho\sigma^2$, not zero. Correlation sets a floor on variance reduction.

The Correlation Barrier

If your trees have correlation ρ = 0.8 and variance σ² = 1.0, the ensemble variance cannot drop below 0.8 no matter how many trees you add. With 1000 trees, variance ≈ 0.8 + 0.001 ≈ 0.801. With 10 trees, variance ≈ 0.8 + 0.02 = 0.82. The difference is negligible—correlation dominates!

Ensemble Variance as a Function of Correlation (σ² = 1.0)
Correlation ρ	T=10 trees	T=100 trees	T=1000 trees	T→∞ limit
0.0 (independent)	0.100	0.010	0.001	0.000
0.2	0.280	0.208	0.201	0.200
0.5	0.550	0.505	0.501	0.500
0.8	0.820	0.808	0.801	0.800
0.95	0.955	0.951	0.950	0.950

Sources of Tree Correlation

Tree correlation in bagged ensembles comes from multiple sources. Understanding these sources helps us target correlation reduction strategies:

Source 1: Common Training Data

Even with bootstrap sampling, trees see overlapping data:

Expected overlap between two bootstrap samples: ~63.2% × 63.2% ≈ 40% of original samples
Trees learn similar patterns because they see similar data

Source 2: Dominant Features

When certain features are much more predictive than others:

Every tree discovers and uses those dominant features
The top splits of every tree look similar
This is the primary source of correlation in bagged trees

Source 3: Same Inductive Bias

All trees use the same algorithm (greedy recursive splitting):

Given similar data, similar algorithms produce similar models
Axis-aligned splits create similar decision boundaries

correlation_decomposition.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
import numpy as np
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from itertools import combinations
 
def measure_tree_correlation(estimators, X, proba_class=1):
    """
    Measure average pairwise correlation between tree predictions.
    
    Parameters:
    -----------
    estimators : list of fitted trees
    X : test data to make predictions on
    proba_class : which class probability to use (for classification)
    
    Returns:
    --------
    float : average pairwise Pearson correlation
    """
    # Get predictions from each tree
    predictions = []
    for tree in estimators:
        if hasattr(tree, 'predict_proba'):
            pred = tree.predict_proba(X)[:, proba_class]
        else:
            pred = tree.predict(X).astype(float)
        predictions.append(pred)
    
    predictions = np.array(predictions)  # (n_trees, n_samples)
    
    # Calculate all pairwise correlations
    correlations = []
    for i, j in combinations(range(len(estimators)), 2):
        # Pearson correlation between tree i and tree j predictions
        corr = np.corrcoef(predictions[i], predictions[j])[0, 1]
        if not np.isnan(corr):
            correlations.append(corr)
    
    return np.mean(correlations), np.std(correlations)
 
 
def compare_correlation_by_method(X, y, n_estimators=50):
    """
    Compare tree correlation across different ensemble methods.
    """
    # Method 1: Bagged trees (no feature randomization)
    bagging = BaggingClassifier(
        estimator=DecisionTreeClassifier(),
        n_estimators=n_estimators,
        bootstrap=True,
        max_features=1.0,  # Using all features
        random_state=42
    )
    bagging.fit(X, y)
    
    # Method 2: Random Forest (sqrt features)
    rf_sqrt = RandomForestClassifier(
        n_estimators=n_estimators,
        max_features='sqrt',
        random_state=42
    )
    rf_sqrt.fit(X, y)
    
    # Method 3: Random Forest (log2 features)
    rf_log2 = RandomForestClassifier(
        n_estimators=n_estimators,
        max_features='log2',
        random_state=42
    )
    rf_log2.fit(X, y)
    
    # Method 4: Extremely aggressive randomization
    rf_extreme = RandomForestClassifier(
        n_estimators=n_estimators,
        max_features=2,  # Just 2 features per split
        random_state=42
    )
    rf_extreme.fit(X, y)
    
    results = {}
    
    print("Tree Correlation Comparison")
    print("=" * 60)
    
    for name, model in [
        ("Bagged Trees (all features)", bagging),
        ("Random Forest (sqrt)", rf_sqrt),
        ("Random Forest (log2)", rf_log2),
        ("Extreme Random (m=2)", rf_extreme),
    ]:
        mean_corr, std_corr = measure_tree_correlation(model.estimators_, X)
        accuracy = model.score(X, y)
        results[name] = {'correlation': mean_corr, 'accuracy': accuracy}
        print(f"{name:30} | ρ = {mean_corr:.4f} (±{std_corr:.4f}) | Acc = {accuracy:.4f}")
    
    return results
 
 
# Example with synthetic data
X, y = make_classification(
    n_samples=1000,
    n_features=50,
    n_informative=10,
    n_redundant=10,
    n_clusters_per_class=3,
    random_state=42
)
 
compare_correlation_by_method(X, y)

The Correlation-Accuracy Tradeoff

Notice that more aggressive feature randomization reduces correlation but may also reduce individual tree accuracy. The optimal balance minimizes ensemble error, not individual tree error or correlation alone. Random Forests with sqrt(p) features typically achieve this balance.

How Random Forests Reduce Correlation

Random Forests attack tree correlation through two complementary mechanisms:

Mechanism 1: Feature Randomization Breaks Structural Similarity

By restricting each split to a random subset of features:

Dominant features don't always appear in the candidate set
Different trees develop fundamentally different split structures
Even when two trees use the same feature, they may use different thresholds

Mechanism 2: Deep Trees Amplify Structural Differences

Random Forests typically grow unpruned trees:

Early randomization differences cascade through tree depth
A different root split leads to completely different subtrees
By depth 10-20, trees are radically different

The probability that two trees make the same sequence of split decisions decreases exponentially with depth.

Quantifying the Decorrelation Effect:

Consider two trees at a given split point. Let $m = \sqrt{p}$ features be sampled for each tree.

Probability both trees select the same feature: $$P(\text{same feature chosen}) = \frac{m}{p} = \frac{\sqrt{p}}{p} = \frac{1}{\sqrt{p}}$$

For $p = 100$ features: $P = 0.10$ (only 10% chance of same feature)

Probability of same split sequence for the first $d$ splits: $$P(\text{same first } d \text{ splits}) \leq \left(\frac{1}{\sqrt{p}}\right)^d = p^{-d/2}$$

For $p = 100$ and $d = 5$ splits: $P \leq 100^{-2.5} = 10^{-5}$

This is a massive decorrelation effect. After just 5 splits, the probability of identical structure is essentially zero.

Probability of Identical Split Sequence (m = √p)
p (features)	d=1 split	d=3 splits	d=5 splits	d=10 splits
25	0.200	0.008	3.2×10⁻⁴	1.0×10⁻⁷
100	0.100	0.001	1.0×10⁻⁵	1.0×10⁻¹⁰
400	0.050	1.3×10⁻⁴	6.3×10⁻⁷	3.9×10⁻¹³
1000	0.032	3.2×10⁻⁵	1.0×10⁻⁷	1.0×10⁻¹⁵

Exponential Decorrelation

The probability of structural similarity decreases EXPONENTIALLY with tree depth. This means that even if early splits happen to be similar (unlikely but possible), the trees diverge rapidly as depth increases. Deep Random Forest trees are essentially guaranteed to be structurally unique.

Measuring Correlation in Practice

How do we diagnose and measure tree correlation in an actual Random Forest? Here are several practical approaches:

Method 1: Pairwise Prediction Correlation

The most direct approach—compute Pearson correlation between tree predictions:

$$\rho_{ij} = \frac{\text{Cov}(h_i(X), h_j(X))}{\sigma_i \sigma_j}$$

Then report the average: $\bar{\rho} = \frac{2}{T(T-1)} \sum_{i < j} \rho_{ij}$

Method 2: Out-of-Bag Prediction Diversity

Using OOB samples, measure how often trees disagree on classifications:

$$\text{Diversity} = \frac{1}{n}\sum_{i=1}^{n}\frac{\text{#trees disagreeing on sample } i}{\text{#trees that have } i \text{ in OOB}}$$

Method 3: Structural Similarity Metrics

Compare tree structures directly (which features appear, at what depths, etc.).

correlation_diagnostics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from scipy.stats import pearsonr
from collections import defaultdict
 
def comprehensive_correlation_analysis(rf, X, y):
    """
    Perform comprehensive correlation analysis on a fitted Random Forest.
    """
    n_samples = X.shape[0]
    n_trees = len(rf.estimators_)
    
    # ============================================
    # Method 1: Prediction Correlation
    # ============================================
    # Get probability predictions from each tree
    tree_probas = np.array([
        tree.predict_proba(X)[:, 1] 
        for tree in rf.estimators_
    ])  # Shape: (n_trees, n_samples)
    
    # Compute pairwise correlations
    correlations = []
    for i in range(n_trees):
        for j in range(i+1, n_trees):
            r = np.corrcoef(tree_probas[i], tree_probas[j])[0, 1]
            if not np.isnan(r):
                correlations.append(r)
    
    print("=" * 60)
    print("Random Forest Correlation Diagnostics")
    print("=" * 60)
    print(f"
1. PREDICTION CORRELATION")
    print(f"   Average pairwise ρ: {np.mean(correlations):.4f}")
    print(f"   Std pairwise ρ: {np.std(correlations):.4f}")
    print(f"   Min/Max ρ: [{np.min(correlations):.4f}, {np.max(correlations):.4f}]")
    
    # ============================================
    # Method 2: Vote Disagreement (Classification)
    # ============================================
    tree_predictions = np.array([
        tree.predict(X) for tree in rf.estimators_
    ])  # Shape: (n_trees, n_samples)
    
    # For each sample, what fraction of trees agree with majority?
    majority_agreement = []
    for i in range(n_samples):
        votes = tree_predictions[:, i]
        majority_class = np.bincount(votes.astype(int)).argmax()
        agreement = (votes == majority_class).mean()
        majority_agreement.append(agreement)
    
    avg_agreement = np.mean(majority_agreement)
    diversity_score = 1 - avg_agreement  # Higher = more diverse
    
    print(f"
2. VOTE DIVERSITY")
    print(f"   Average majority agreement: {avg_agreement:.4f}")
    print(f"   Diversity score: {diversity_score:.4f}")
    print(f"   (Higher diversity = lower correlation)")
    
    # ============================================
    # Method 3: Structural Analysis
    # ============================================
    # Count feature usage at each depth
    depth_feature_usage = defaultdict(lambda: defaultdict(int))
    
    for tree in rf.estimators_:
        tree_struct = tree.tree_
        n_nodes = tree_struct.node_count
        
        # DFS to track depth
        stack = [(0, 0)]  # (node_id, depth)
        while stack:
            node_id, depth = stack.pop()
            
            if tree_struct.feature[node_id] >= 0:  # Not a leaf
                feature = tree_struct.feature[node_id]
                depth_feature_usage[depth][feature] += 1
                
                stack.append((tree_struct.children_left[node_id], depth + 1))
                stack.append((tree_struct.children_right[node_id], depth + 1))
    
    print(f"
3. STRUCTURAL DIVERSITY (Feature usage at depth)")
    for depth in sorted(depth_feature_usage.keys())[:5]:  # First 5 depths
        features_used = depth_feature_usage[depth]
        unique_features = len(features_used)
        total_splits = sum(features_used.values())
        top_feature = max(features_used.items(), key=lambda x: x[1])
        top_feature_pct = top_feature[1] / total_splits * 100
        
        print(f"   Depth {depth}: {unique_features} unique features, "
              f"top feature {top_feature[0]} used {top_feature_pct:.1f}% of splits")
    
    # ============================================
    # Method 4: Implied Variance Floor
    # ============================================
    ensemble_var = np.var(rf.predict_proba(X)[:, 1])
    individual_vars = np.var(tree_probas, axis=1)
    avg_individual_var = np.mean(individual_vars)
    
    # Implied correlation from variance floor
    # Var_ensemble = ρ * σ² as T → ∞
    implied_rho = ensemble_var / avg_individual_var if avg_individual_var > 0 else 0
    
    print(f"
4. VARIANCE ANALYSIS")
    print(f"   Avg individual tree variance: {avg_individual_var:.6f}")
    print(f"   Ensemble variance: {ensemble_var:.6f}")
    print(f"   Variance reduction ratio: {ensemble_var/avg_individual_var:.4f}")
    print(f"   Implied minimum correlation: {implied_rho:.4f}")
    
    return {
        'avg_correlation': np.mean(correlations),
        'diversity_score': diversity_score,
        'variance_reduction': ensemble_var/avg_individual_var
    }
 
 
# Example usage
from sklearn.datasets import make_classification
 
X, y = make_classification(
    n_samples=500, n_features=30,
    n_informative=10, n_redundant=5,
    random_state=42
)
 
rf = RandomForestClassifier(n_estimators=100, max_features='sqrt', random_state=42)
rf.fit(X, y)
 
results = comprehensive_correlation_analysis(rf, X, y)

Diagnostic Workflow

When tuning a Random Forest: (1) First check if correlation is high (>0.5) — if so, reduce max_features, (2) Check structural diversity — if a single feature dominates root splits, consider feature engineering, (3) Monitor variance reduction ratio — if close to 1.0, adding more trees won't help much.

Advanced Strategies for Correlation Reduction

Beyond the standard Random Forest configuration, several techniques can further reduce tree correlation:

Strategy 1: Extremely Randomized Trees (Extra-Trees)

Instead of finding the optimal threshold for each candidate feature, Extra-Trees:

Randomly sample $m$ features (like Random Forests)
For each feature, pick a random threshold (uniform between min and max)
Select the (feature, random_threshold) pair with best split criterion

This adds a third layer of randomization, often reducing correlation substantially.

extra_trees_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
import numpy as np
 
def compare_rf_et(X, y, n_estimators=100):
    """
    Compare Random Forest vs Extra-Trees on correlation and accuracy.
    """
    rf = RandomForestClassifier(n_estimators=n_estimators, random_state=42)
    et = ExtraTreesClassifier(n_estimators=n_estimators, random_state=42)
    
    rf.fit(X, y)
    et.fit(X, y)
    
    def measure_avg_correlation(estimators, X):
        probas = np.array([t.predict_proba(X)[:, 1] for t in estimators])
        corrs = []
        for i in range(len(estimators)):
            for j in range(i+1, len(estimators)):
                r = np.corrcoef(probas[i], probas[j])[0, 1]
                if not np.isnan(r):
                    corrs.append(r)
        return np.mean(corrs)
    
    rf_corr = measure_avg_correlation(rf.estimators_, X)
    et_corr = measure_avg_correlation(et.estimators_, X)
    
    rf_cv = cross_val_score(rf, X, y, cv=5).mean()
    et_cv = cross_val_score(et, X, y, cv=5).mean()
    
    print("Random Forest vs Extra-Trees")
    print("=" * 50)
    print(f"Random Forest:  ρ = {rf_corr:.4f}, CV Accuracy = {rf_cv:.4f}")
    print(f"Extra-Trees:    ρ = {et_corr:.4f}, CV Accuracy = {et_cv:.4f}")
    print(f"Correlation reduction: {(rf_corr - et_corr)/rf_corr*100:.1f}%")
 
 
X, y = make_classification(
    n_samples=1000, n_features=50,
    n_informative=15, random_state=42
)
 
compare_rf_et(X, y)

Strategy 2: Random Rotation Forests

Rotation Forests (Rodriguez et al., 2006) apply random rotations to feature space:

Split features into random subsets
Apply PCA to each subset
Concatenate the rotated features
Build a tree on the rotated feature space

This creates diversity not just in which features are used, but in how they're combined.

Strategy 3: Increased Bootstrap Variation

Adjust bootstrap sampling for more diversity:

Subsampling (sample < 100% of data): More variation between samples
Different sampling ratios: Some libraries support fractional bootstrap
Stratified bootstrap: Maintain class proportions while varying samples

Strategy 4: Base Learner Variation

Use different types of base learners:

Mix deep trees with shallow trees
Vary minimum samples per leaf
Use probabilistic splits (some probability of taking suboptimal split)

Correlation Reduction Strategies Comparison
Strategy	Correlation Reduction	Computational Cost	When to Use
More trees (baseline)	Marginal (hits floor)	Linear	When correlation is already low
Reduce max_features	Moderate	Actually faster	When trees are too similar
Extra-Trees	Significant	Faster (no threshold search)	Speed matters, slight accuracy OK
Subsampling	Moderate	Faster (smaller samples)	Large datasets
Rotation Forests	High	Higher (PCA overhead)	When RF correlation plateaus

The Diversity-Accuracy Tradeoff

More aggressive randomization always reduces correlation, but may reduce individual tree accuracy. The goal is minimizing ENSEMBLE error, which depends on both. Typically, sqrt(p) for max_features achieves a near-optimal balance, but datasets with many noisy features may benefit from smaller values.

Breiman's Strength-Correlation Bound

Leo Breiman, in his seminal 2001 paper, formalized the relationship between tree strength (individual accuracy) and correlation (prediction similarity) in determining Random Forest generalization error.

Definitions:

Strength $s$: Expected margin of individual trees (how far predictions are from the decision boundary on average)
Correlation $\bar{\rho}$: Average correlation between tree predictions

Breiman's Bound:

The generalization error $PE^*$ of a Random Forest is bounded by:

$$PE^* \leq \frac{\bar{\rho}(1 - s^2)}{s^2}$$

or equivalently:

$$PE^* \leq \frac{\bar{\rho}}{s^2} - \bar{\rho}$$

Key Implications:

Correlation hurts linearly: Error is proportional to $\bar{\rho}$
Strength helps quadratically: Error is inversely proportional to $s^2$
The ratio matters: $\bar{\rho}/s^2$ should be minimized

Interpreting the Bound:

This bound reveals why Random Forests work so well:

Scenario	Correlation $\bar{\rho}$	Strength $s$	Error Bound
Strong correlated trees	0.8	0.7	0.8/(0.49) - 0.8 = 0.83
Weak uncorrelated trees	0.2	0.4	0.2/(0.16) - 0.2 = 1.05
RF sweet spot	0.3	0.6	0.3/(0.36) - 0.3 = 0.53

Random Forests find the sweet spot: trees that are strong enough to be useful but different enough that their errors don't align.

The sqrt(p) Justification:

With $m_{try} = \sqrt{p}$:

Individual trees maintain reasonable strength (each split has ~68% chance of including an important feature)
Correlation is substantially reduced (only ~10% chance of same feature at each split)
The $\bar{\rho}/s^2$ ratio is empirically minimized for many datasets

The Golden Ratio of Random Forests

Breiman's bound explains why Random Forests are surprisingly robust to hyperparameter choices. The sqrt(p) default for max_features typically lands in a broad region where ρ̄/s² is near-optimal. You have to try quite hard (very small or very large max_features) to break out of this region.

Practical Implications

Understanding tree correlation has immediate practical implications for Random Forest deployment:

When to Worry About Correlation:

High Correlation Warning Signs

•Adding trees doesn't help: Accuracy plateaus after 50-100 trees (correlation floor reached)
•Same feature dominates: One feature appears at root of most trees
•OOB error ≈ CV error: Little variance in predictions across trees
•Feature importance is sparse: Only 5-10 features have any importance

Remediation Steps:

First: Reduce max_features — easiest intervention
Second: Consider Extra-Trees — if speed is acceptable
Third: Feature engineering — combine or transform features to reduce dominance
Fourth: Subsampling — if dataset is large, use max_samples < 1.0
Fifth: Hybrid approaches — mix Random Forest with other ensemble methods

practical_monitoring.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import numpy as np
 
def diagnose_and_tune_correlation(X, y):
    """
    Diagnose correlation issues and suggest remediation.
    """
    n_features = X.shape[1]
    
    print("Random Forest Correlation Diagnostic")
    print("=" * 60)
    
    # Test different configurations
    configs = [
        ('Default (sqrt)', {'max_features': 'sqrt'}),
        ('p/3', {'max_features': max(1, n_features // 3)}),
        ('p/4', {'max_features': max(1, n_features // 4)}),
        ('log2', {'max_features': 'log2'}),
        ('3 features', {'max_features': 3}),
    ]
    
    results = []
    for name, params in configs:
        rf = RandomForestClassifier(n_estimators=100, random_state=42, **params)
        rf.fit(X, y)
        
        # Measure correlation
        probas = np.array([t.predict_proba(X)[:, 1] for t in rf.estimators_])
        corrs = []
        for i in range(100):
            for j in range(i+1, 100):
                r = np.corrcoef(probas[i], probas[j])[0, 1]
                if not np.isnan(r):
                    corrs.append(r)
        avg_corr = np.mean(corrs)
        
        # Cross-validation score
        cv_score = cross_val_score(rf, X, y, cv=5).mean()
        
        results.append((name, avg_corr, cv_score))
        print(f"{name:20} | ρ = {avg_corr:.4f} | CV = {cv_score:.4f}")
    
    # Find best configuration
    best_idx = np.argmax([r[2] for r in results])
    print(f"
Recommendation: {results[best_idx][0]}")
    print(f"  Achieved lowest correlation with highest CV score")
    
    # Check if correlation is problematic
    default_corr = results[0][1]
    if default_corr > 0.6:
        print("
⚠️  WARNING: High tree correlation detected!")
        print("   Consider reducing max_features or using Extra-Trees")
    elif default_corr > 0.4:
        print("
📊 MODERATE correlation - some room for improvement")
    else:
        print("
✅ LOW correlation - ensemble is well-diversified")
    
    return results
 
 
# Example
from sklearn.datasets import make_classification
X, y = make_classification(
    n_samples=1000, n_features=50,
    n_informative=10, n_redundant=10,
    random_state=42
)
 
diagnose_and_tune_correlation(X, y)

Don't Over-Optimize

Obsessive correlation reduction can hurt. If you reduce max_features too much, trees become so weak that the ensemble suffers despite low correlation. Trust the sqrt(p) default unless you have clear evidence of correlation problems. The goal is low ENSEMBLE error, not low correlation per se.

Summary: Tree Correlation Reduction

We've developed a comprehensive understanding of tree correlation—the hidden factor that determines how much variance reduction an ensemble can achieve.

Key Takeaways

•Correlation limits variance reduction: Ensemble variance floor is ρσ², no matter how many trees you add
•Sources of correlation: Common training data, dominant features, and identical algorithms all contribute
•Random Forests attack correlation: Per-split feature randomization ensures exponentially decreasing probability of identical structure
•Measurable and diagnosable: Tree correlation can be empirically measured and monitored
•Advanced techniques exist: Extra-Trees, rotation forests, and subsampling can further reduce correlation when needed
•Breiman's bound guides optimization: The ρ̄/s² ratio should be minimized, not ρ̄ alone

What's Next:

With a solid understanding of how Random Forests achieve diversity through feature randomization and correlation reduction, the next page covers hyperparameters—the full set of tuning knobs available and how to configure them for optimal performance.

Page Complete

You now understand why tree correlation matters and how Random Forests systematically reduce it. This insight explains why Random Forests often outperform bagged trees despite each individual RF tree being suboptimal—the key is uncorrelated errors that average away. Next, we'll explore the full hyperparameter landscape.