Machine LearningRandom Forests

Random Forests

LevelIntermediate

Duration90 mins

TopicRandom Forests

2 / 5

Feature Randomization

The Heart of Random Forests

In the previous page, we explored the random subspace method—training ensemble members on different feature subsets. Random Forests take this idea and amplify it: instead of selecting features once per tree, they re-randomize feature selection at every single split decision.

This seemingly small modification has profound consequences. It's what transforms a collection of bagged decision trees into the remarkably powerful Random Forest algorithm that has become one of the most successful and widely-used machine learning methods in both research and industry.

In this page, we'll dissect exactly how per-split feature randomization works, understand why it's so effective, and explore the critical max_features (or mtry) hyperparameter that controls this randomization.

What You Will Learn

By the end of this page, you will master the per-split feature sampling mechanism, understand the mathematical motivation for mtry = sqrt(p), analyze how feature randomization affects the bias-variance tradeoff, and be able to tune max_features for different problem types.

Per-Split Feature Sampling

The defining characteristic of Random Forests is randomizing feature selection at each internal node, not just once per tree. Let's trace through exactly what happens when building a Random Forest tree:

Standard Decision Tree (No Randomization):

At each node, examine all p features
For each feature, try all possible thresholds
Select the (feature, threshold) pair with the best split criterion
Repeat recursively for child nodes

Random Forest Tree (Per-Split Randomization):

At each node, randomly sample m features from the full p features
For each of the m sampled features, try all possible thresholds
Select the best (feature, threshold) pair among only the m candidates
At the next node, sample a fresh set of m features and repeat

The key difference: every split decision is constrained to a random subset of features, and that subset changes at every node.

per_split_feature_sampling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
class RandomForestTreeNode:
    """
    A single node in a Random Forest decision tree.
    Demonstrates per-split feature randomization.
    """
    def __init__(self, X, y, max_features, min_samples_leaf=1, depth=0, max_depth=None):
        self.n_samples, self.n_features = X.shape
        self.depth = depth
        self.left = None
        self.right = None
        self.feature = None
        self.threshold = None
        self.prediction = self._compute_prediction(y)
        
        # Check stopping conditions
        if self._should_stop(y, max_depth):
            self.is_leaf = True
            return
        
        self.is_leaf = False
        
        # KEY RANDOM FOREST MECHANISM:
        # Sample max_features random features for THIS split
        # This sampling happens independently at EVERY node
        m = self._get_max_features(max_features)
        feature_candidates = np.random.choice(
            self.n_features, 
            size=min(m, self.n_features), 
            replace=False
        )
        
        # Find best split among only the sampled features
        best_feature, best_threshold, best_gain = self._find_best_split(
            X, y, feature_candidates
        )
        
        if best_gain <= 0:
            self.is_leaf = True
            return
        
        self.feature = best_feature
        self.threshold = best_threshold
        
        # Split data and recurse
        left_mask = X[:, best_feature] <= best_threshold
        right_mask = ~left_mask
        
        # Note: Child nodes will sample their OWN random feature sets
        self.left = RandomForestTreeNode(
            X[left_mask], y[left_mask], max_features, 
            min_samples_leaf, depth + 1, max_depth
        )
        self.right = RandomForestTreeNode(
            X[right_mask], y[right_mask], max_features,
            min_samples_leaf, depth + 1, max_depth
        )
    
    def _get_max_features(self, max_features):
        """
        Interpret max_features parameter.
        
        Supported values:
        - "sqrt" or "auto": sqrt(n_features) — sklearn default for classification
        - "log2": log2(n_features)
        - float in (0, 1]: fraction of features
        - int: exact number of features
        """
        if max_features == "sqrt" or max_features == "auto":
            return max(1, int(np.sqrt(self.n_features)))
        elif max_features == "log2":
            return max(1, int(np.log2(self.n_features)))
        elif isinstance(max_features, float):
            return max(1, int(max_features * self.n_features))
        elif isinstance(max_features, int):
            return min(max_features, self.n_features)
        else:
            return self.n_features  # Consider all features
    
    def _find_best_split(self, X, y, feature_candidates):
        """
        Find the best split, but ONLY considering the
        randomly selected feature candidates.
        """
        best_gain = -np.inf
        best_feature = None
        best_threshold = None
        
        # ONLY iterate through candidate features (not all features)
        for feature_idx in feature_candidates:
            values = X[:, feature_idx]
            unique_values = np.unique(values)
            
            for threshold in unique_values:
                gain = self._compute_gain(y, values <= threshold)
                if gain > best_gain:
                    best_gain = gain
                    best_feature = feature_idx
                    best_threshold = threshold
        
        return best_feature, best_threshold, best_gain

Why Per-Split Matters

If features were sampled once per tree (classic RSM), trees would still have similar upper structure—just with a restricted feature set. Per-split sampling ensures that even within a single tree, different paths use different features. This creates massive internal diversity: two samples reaching the same node might be split on different features in different trees.

The max_features Hyperparameter

The max_features parameter (also called mtry in the R community) is arguably the most important hyperparameter in Random Forests. It controls the size of the random feature subset considered at each split.

Possible Values:

Value	Features at Each Split	When to Use
`"sqrt"`	$\sqrt{p}$	Default for classification
`"log2"`	$\log_2(p)$	Very high dimensional data
`1.0` or `None`	All $p$ features	Reduces to bagged trees
`0.5`	$p/2$ features	Moderate randomization
`"auto"`	Library-dependent (usually sqrt for classification, all for regression)	Following defaults
Integer $m$	Exactly $m$ features	Explicit control
Float in $(0,1)$	That fraction of $p$	Proportional control

The Recommended Defaults:

Extensive empirical studies (Breiman 2001, subsequent research) have established:

Classification: $m_{try} = \sqrt{p}$
Regression: $m_{try} = p/3$

These are remarkably robust defaults that work well across a wide variety of datasets. However, they are not universally optimal—the best value depends on the signal-to-noise ratio in your features.

max_features_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.datasets import make_classification, make_regression
from sklearn.model_selection import cross_val_score
import numpy as np
import matplotlib.pyplot as plt
 
def compare_max_features(X, y, task='classification'):
    """
    Compare Random Forest performance across different max_features values.
    """
    n_features = X.shape[1]
    
    # Test different max_features values
    max_features_options = [
        ('1 (single feature)', 1),
        ('sqrt(p)', 'sqrt'),
        ('log2(p)', 'log2'),
        ('p/3', max(1, n_features // 3)),
        ('p/2', max(1, n_features // 2)),
        ('All features (bagging)', None),
    ]
    
    results = {}
    
    for name, max_features in max_features_options:
        if task == 'classification':
            rf = RandomForestClassifier(
                n_estimators=100,
                max_features=max_features,
                random_state=42
            )
        else:
            rf = RandomForestRegressor(
                n_estimators=100,
                max_features=max_features,
                random_state=42
            )
        
        scores = cross_val_score(rf, X, y, cv=5)
        results[name] = {
            'mean': scores.mean(),
            'std': scores.std(),
            'effective_features': _get_effective_features(max_features, n_features)
        }
        print(f"{name:25} | Accuracy: {scores.mean():.4f} (+/- {scores.std()*2:.4f})")
    
    return results
 
def _get_effective_features(max_features, n_features):
    """Calculate actual number of features used per split."""
    if max_features is None:
        return n_features
    elif max_features == 'sqrt':
        return int(np.sqrt(n_features))
    elif max_features == 'log2':
        return int(np.log2(n_features))
    elif isinstance(max_features, float):
        return int(max_features * n_features)
    else:
        return max_features
 
 
# Example with synthetic data
print("=" * 60)
print("max_features Comparison: Classification (100 features)")
print("=" * 60)
X_clf, y_clf = make_classification(
    n_samples=1000, n_features=100, 
    n_informative=20, n_redundant=10, 
    random_state=42
)
compare_max_features(X_clf, y_clf, 'classification')
 
print("\n" + "=" * 60)
print("max_features Comparison: Regression (100 features)")
print("=" * 60)
X_reg, y_reg = make_regression(
    n_samples=1000, n_features=100,
    n_informative=20, noise=10,
    random_state=42
)
compare_max_features(X_reg, y_reg, 'regression')

Practical Tuning Strategy

Start with the defaults (sqrt for classification, p/3 for regression). If performance is unsatisfactory, try a grid search over [sqrt(p), sqrt(p)2, p/3, p/2, 0.8p]. For high-dimensional noisy data using fewer features often helps; for carefully engineered features with high signal-to-noise ratio more features may be better.

Mathematical Analysis of Feature Randomization

Why does $m_{try} = \sqrt{p}$ work so well for classification? Let's develop the mathematical intuition.

The Strength-Correlation Tradeoff:

The key insight from Breiman's analysis is that ensemble accuracy depends on two factors:

$$\text{Generalization Error} \approx \bar{\rho} \cdot \frac{s^2}{1} + \frac{1 - \bar{\rho}}{T}s^2$$

Where:

$\bar{\rho}$ = average correlation between tree predictions
$s^2$ = variance of individual tree predictions
$T$ = number of trees

As $m_{try}$ decreases:

$\bar{\rho}$ decreases (trees become less correlated) ✓
Individual tree "strength" decreases (may miss important features) ✗

Probability of Including an Important Feature:

Suppose there are $k$ important features out of $p$ total. At a given split, the probability that at least one important feature is included in the random sample of $m$ features is:

$$P(\text{at least one important}) = 1 - \frac{\binom{p-k}{m}}{\binom{p}{m}} = 1 - \prod_{i=0}^{m-1}\frac{p-k-i}{p-i}$$

For $p = 100$, $k = 10$ important features, and $m = 10$ (i.e., $\sqrt{100}$):

$$P(\text{at least one important}) = 1 - \frac{\binom{90}{10}}{\binom{100}{10}} \approx 0.683$$

With $m = \sqrt{p}$, there's a ~68% chance of having at least one important feature at each split. This is enough to build reasonably good trees while maintaining high diversity.

Probability of Including at Least One of k=10 Important Features (p=100)
m (features at split)	P(≥1 important feature)	Diversity Level
2	0.195	Very High (but weak splits)
5	0.419	High
10 (√p)	0.683	Moderate-High (recommended)
20	0.879	Moderate
33 (p/3)	0.963	Low-Moderate
50 (p/2)	0.995	Low
100 (all)	1.000	None (bagging only)

Why $\sqrt{p}$ is the Sweet Spot:

The $\sqrt{p}$ choice emerges from balancing competing concerns:

Diversity requirement: With $m = \sqrt{p}$, the probability of two trees selecting the same feature at the same depth is only $m/p = 1/\sqrt{p}$. For $p = 100$, this is just 10%.
Strength requirement: Even with small $m$, if $T$ is large enough, every important feature appears in approximately $m/p \times T$ trees at each level. With $T = 500$ trees and $m/p = 0.1$, important features appear in ~50 trees at each split depth.
Computational efficiency: Smaller $m$ means fewer feature evaluations per split: $O(m \cdot n)$ instead of $O(p \cdot n)$. For $p = 10000$, using $m = 100$ is 100× faster than using all features.

The Central Insight

Individual trees don't need to be perfect—they need to be 'good enough' while being different from each other. The √p rule ensures each tree has a reasonable chance of using important features while being structurally dissimilar from other trees. The ensemble aggregates these diverse 'good enough' trees into excellent collective decisions.

Effects on Tree Structure and Predictions

Feature randomization fundamentally changes how Random Forest trees develop compared to standard decision trees. Let's examine these structural effects:

1. Deeper Feature Penetration:

In a standard decision tree, the best feature at the root often "shadows" related features—if income is the best split, years_employed (which correlates with income) may never appear high in the tree. With feature randomization:

Sometimes income is excluded from the candidate set
years_employed gets a chance to split at upper levels
Different trees develop with different features dominating

2. Suboptimal Splits at Individual Nodes:

Each Random Forest split is locally suboptimal—we're not choosing the best split overall, just the best among $m$ random candidates. However:

Suboptimal splits lead to different tree structures
Different structures capture different patterns in data
The ensemble benefits from this diversity

feature_usage_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from collections import Counter
 
def analyze_feature_usage(X, y, n_trees=100):
    """
    Compare which features appear at the root across:
    1. Standard bagged decision trees (all features)
    2. Random Forest trees (sqrt features)
    """
    n_features = X.shape[1]
    
    # Bagged trees (no feature restriction)
    bagged_roots = []
    for i in range(n_trees):
        # Bootstrap sample
        indices = np.random.choice(len(X), size=len(X), replace=True)
        X_boot, y_boot = X[indices], y[indices]
        
        tree = DecisionTreeClassifier(random_state=i)
        tree.fit(X_boot, y_boot)
        bagged_roots.append(tree.tree_.feature[0])
    
    # Random Forest trees (sqrt feature restriction)
    rf = RandomForestClassifier(
        n_estimators=n_trees,
        max_features='sqrt',
        random_state=42
    )
    rf.fit(X, y)
    
    rf_roots = [tree.tree_.feature[0] for tree in rf.estimators_]
    
    # Analyze diversity
    print("Feature at Root Node Analysis")
    print("=" * 50)
    
    print(f"\nBagged Trees (all features considered):")
    bagged_counts = Counter(bagged_roots)
    print(f"  Unique features used at root: {len(bagged_counts)}")
    print(f"  Most common root feature: {bagged_counts.most_common(1)[0]}")
    print(f"  Top 3 root features: {bagged_counts.most_common(3)}")
    
    print(f"\nRandom Forest Trees (sqrt(p) features):")
    rf_counts = Counter(rf_roots)
    print(f"  Unique features used at root: {len(rf_counts)}")
    print(f"  Most common root feature: {rf_counts.most_common(1)[0]}")
    print(f"  Top 3 root features: {rf_counts.most_common(3)}")
    
    # Entropy of feature usage (measure of diversity)
    def feature_entropy(feature_list, n_features):
        counts = np.zeros(n_features)
        for f in feature_list:
            counts[f] += 1
        probs = counts / len(feature_list)
        probs = probs[probs > 0]
        return -np.sum(probs * np.log2(probs))
    
    print(f"\nDiversity Metrics:")
    print(f"  Bagged entropy: {feature_entropy(bagged_roots, n_features):.3f} bits")
    print(f"  RF entropy: {feature_entropy(rf_roots, n_features):.3f} bits")
    print(f"  Max entropy: {np.log2(n_features):.3f} bits (uniform distribution)")
 
 
# Example: 50 features with 5 dominant ones
X, y = make_classification(
    n_samples=1000,
    n_features=50,
    n_informative=5,
    n_redundant=5,
    n_clusters_per_class=2,
    random_state=42
)
 
analyze_feature_usage(X, y)

3. Smoother Decision Boundaries:

The ensemble of randomized trees creates smoother decision boundaries than individual trees or bagged trees:

Method	Decision Boundary Character
Single Decision Tree	Jagged, axis-aligned, prone to overfitting
Bagged Trees	Less jagged (averaged), but still similar shapes
Random Forest	Smooth, can approximate curved boundaries

The smoothing happens because:

Different trees use different feature combinations
Averaging over trees with different split locations smooths the boundary
No single feature dominates all trees, so axis-alignment is reduced

Don't Overthink Single Trees

A common mistake is trying to understand Random Forests by examining individual trees. Individual RF trees are intentionally 'weird'—they use restricted feature sets and make locally suboptimal splits. The power emerges from aggregation, not from individual tree quality. Judge a Random Forest by its ensemble performance, not by any single tree.

Interaction with Bootstrap Sampling

Random Forests combine two orthogonal sources of randomization:

Bootstrap Sampling (row randomization): Different training sets
Feature Randomization (column randomization): Different feature subsets at each split

These interact multiplicatively to create extraordinary diversity:

Sources of Diversity in Random Forests
Source	What Varies	Effect on Trees	Controlled By
Bootstrap Sampling	Which samples train each tree	Different trees overfit to different subsets	bootstrap=True/False
Per-Split Feature Sampling	Which features are candidates at each split	Different split decisions at each node	max_features
Interaction	Both simultaneously	Each tree is unique in both data and structure	Combination of above

Quantifying the Diversity:

Consider building a tree with depth $d$ on data with $p$ features:

Bagging alone: Each tree sees a different ~63.2% of training data
Feature randomization alone: At each of the $2^d - 1$ internal nodes, features are randomly restricted
Combined: Each tree sees different data AND makes different feature choices at each node

For $p = 100$, $m = 10$, and $d = 20$:

Number of possible feature subsets per split: $\binom{100}{10} \approx 1.7 \times 10^{13}$
Number of internal nodes: $2^{20} - 1 \approx 10^6$
The number of possible tree structures is astronomically large

diversity_demonstration.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from scipy.stats import pearsonr
 
def measure_prediction_diversity(X, y, n_estimators=100):
    """
    Measure how diverse predictions are across trees.
    
    Lower correlation = higher diversity = better ensemble.
    """
    # Fit Random Forest
    rf = RandomForestClassifier(
        n_estimators=n_estimators,
        max_features='sqrt',
        bootstrap=True,
        random_state=42
    )
    rf.fit(X, y)
    
    # Get individual tree probability predictions
    tree_predictions = np.array([
        tree.predict_proba(X)[:, 1] 
        for tree in rf.estimators_
    ])  # Shape: (n_estimators, n_samples)
    
    # Calculate pairwise correlations between tree predictions
    correlations = []
    for i in range(n_estimators):
        for j in range(i+1, n_estimators):
            r, _ = pearsonr(tree_predictions[i], tree_predictions[j])
            correlations.append(r)
    
    avg_correlation = np.mean(correlations)
    
    print(f"Average pairwise tree correlation: {avg_correlation:.4f}")
    print(f"Correlation std: {np.std(correlations):.4f}")
    print(f"Min correlation: {np.min(correlations):.4f}")
    print(f"Max correlation: {np.max(correlations):.4f}")
    
    # Compare with bagged trees (no feature randomization)
    rf_bagged = RandomForestClassifier(
        n_estimators=n_estimators,
        max_features=None,  # All features = bagging only
        bootstrap=True,
        random_state=42
    )
    rf_bagged.fit(X, y)
    
    tree_predictions_bagged = np.array([
        tree.predict_proba(X)[:, 1] 
        for tree in rf_bagged.estimators_
    ])
    
    correlations_bagged = []
    for i in range(n_estimators):
        for j in range(i+1, n_estimators):
            r, _ = pearsonr(tree_predictions_bagged[i], tree_predictions_bagged[j])
            correlations_bagged.append(r)
    
    print(f"\nBagged trees (no feature randomization):")
    print(f"Average pairwise tree correlation: {np.mean(correlations_bagged):.4f}")
    
    print(f"\nDiversity improvement from feature randomization:")
    print(f"Correlation reduction: {np.mean(correlations_bagged) - avg_correlation:.4f}")
 
 
# Generate example data
X, y = make_classification(
    n_samples=1000, n_features=50,
    n_informative=10, n_redundant=10,
    random_state=42
)
 
measure_prediction_diversity(X, y)

Multiplicative Diversity

The key insight: bootstrap sampling and feature randomization are independent randomizations that multiply together. A tree that happens to see similar samples to another tree will still develop differently due to feature randomization. A tree with similar feature selections will still differ due to seeing different samples. This double insurance ensures genuine diversity.

Extreme and Edge Cases

Understanding the extreme values of max_features illuminates its effect:

Case 1: max_features = 1 (Maximum Randomization)

At each split, randomly select a single feature and find the best threshold for that feature.

Extreme diversity: Trees are radically different
Weak individual trees: Often miss the best split by a wide margin
May work in very high dimensions: When any feature is reasonably predictive
Typically underperforms: Individual trees too weak to contribute useful signals

Case 2: max_features = n_features or None (No Randomization)

Every split considers all features—equivalent to bagged decision trees.

No feature diversity: Trees differ only due to bootstrap sampling
Strong individual trees: Each tree makes locally optimal splits
High correlation: Trees develop similar structures when features are dominant
Works well for regression: Where p/3 is already large

Interpolating Between Extremes:

max_features Spectrum and Tradeoffs
max_features	Individual Tree Quality	Ensemble Diversity	Best Use Case
1	Very weak	Maximum	Rarely useful (too random)
log2(p)	Weak	Very high	Very high-dimensional data
sqrt(p)	Moderate	High	Classification (default)
p/3	Good	Moderate	Regression (default)
p/2	Strong	Low-Moderate	Low-dimensional, high-signal data
p (all)	Optimal per split	Low (bootstrap only)	Bagging; feature exploration not needed

Special Case: Extra-Trees (Extremely Randomized Trees)

An even more aggressive approach than Random Forests:

Like RF, randomly select $m$ candidate features at each split
But instead of finding the best threshold for each feature, select a random threshold

This adds a third source of randomization:

Bootstrap samples (rows)
Feature candidates (columns)
Threshold selection (cuts)

Extra-Trees trade individual tree quality for even more diversity, often achieving similar accuracy with less computational cost (no threshold search).

When to Deviate from Defaults

Consider larger max_features when: features are carefully engineered, feature count is low (<20), or domain knowledge suggests most features are relevant. Consider smaller max_features when: dealing with very high dimensionality (p > 1000), many features are noisy, or you observe high tree correlation despite bootstrap sampling.

Implementation Details

Correctly implementing feature randomization requires attention to several details:

1. Sampling Without vs With Replacement:

Within each split, the $m$ features should be sampled without replacement. Sampling with replacement could duplicate features, effectively reducing the candidate set.

# Correct: without replacement
features = np.random.choice(n_features, size=m, replace=False)

# Incorrect: with replacement (could select same feature multiple times)
features = np.random.choice(n_features, size=m, replace=True)

2. Independent Sampling at Each Split:

The feature subset must be freshly sampled at each split decision. A common bug is reusing the same subset for an entire tree level or caching it:

# Correct: sample new features at each node
def build_node(X, y, max_features):
    candidates = sample_features(max_features)  # Fresh sample
    best_split = find_best_split(X, y, candidates)
    left_node = build_node(left_X, left_y, max_features)  # Get own sample
    right_node = build_node(right_X, right_y, max_features)  # Get own sample

3. Handling max_features > n_features:

When max_features exceeds the number of features, implementations should gracefully default to using all features:

effective_max_features = min(max_features, n_features)

4. Consistent Random State Management:

For reproducibility, each tree needs its own random state that is derived from but different from the forest's random state:

def build_forest(X, y, n_estimators, random_state):
    rng = np.random.RandomState(random_state)
    trees = []
    for i in range(n_estimators):
        # Each tree gets a unique derived seed
        tree_seed = rng.randint(np.iinfo(np.int32).max)
        tree = build_tree(X, y, random_state=tree_seed)
        trees.append(tree)
    return trees

sklearn_rf_inspection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
from sklearn.ensemble import RandomForestClassifier
import numpy as np
 
# Inspect how sklearn handles feature randomization
X = np.random.randn(100, 20)  # 20 features
y = (X[:, 0] + X[:, 1] > 0).astype(int)
 
rf = RandomForestClassifier(
    n_estimators=10,
    max_features='sqrt',  # Will use 4 features per split (sqrt(20) ≈ 4)
    random_state=42
)
rf.fit(X, y)
 
print("Random Forest Feature Randomization Verification")
print("=" * 50)
print(f"Total features: {rf.n_features_in_}")
print(f"max_features setting: {rf.max_features}")
# sklearn stores the actual computed value
print(f"Effective max_features: {rf.max_features_}")
 
# Check which features each tree uses at its root
print(f"\nRoot split feature per tree:")
for i, tree in enumerate(rf.estimators_):
    root_feature = tree.tree_.feature[0]
    root_threshold = tree.tree_.threshold[0]
    print(f"  Tree {i}: Feature {root_feature}, Threshold {root_threshold:.3f}")
 
# Verify different trees use different features at root
root_features = [tree.tree_.feature[0] for tree in rf.estimators_]
print(f"\nUnique root features: {len(set(root_features))} out of {len(root_features)} trees")
print(f"Root feature distribution: {dict(zip(*np.unique(root_features, return_counts=True)))}")

Common Implementation Bugs

Watch for: (1) Sampling features with replacement within a split, (2) Reusing the same feature subset across multiple splits, (3) Not properly propagating random state to ensure reproducibility, (4) Forgetting to constrain max_features by the actual number of features. These bugs can silently reduce diversity and degrade performance.

Summary: Feature Randomization

We've thoroughly explored feature randomization—the mechanism that defines Random Forests and distinguishes them from bagged decision trees.

Key Takeaways

•Per-split randomization: Random Forests sample m features at EACH split, not once per tree, maximizing structural diversity
•max_features controls the tradeoff: Smaller values increase diversity but weaken individual trees; larger values strengthen trees but reduce diversity
•Recommended defaults: sqrt(p) for classification, p/3 for regression, based on extensive empirical validation
•Mathematical justification: sqrt(p) balances the probability of including important features against maintaining low tree correlation
•Multiplicative diversity: Feature randomization combines with bootstrap sampling to create trees that differ in both data and structure
•Locally suboptimal, globally strong: Individual RF splits are suboptimal, but the ensemble averages over many diverse suboptimal choices to achieve excellent performance

What's Next:

Now that we understand how feature randomization creates diversity, the next page examines tree correlation reduction more deeply—how we measure correlation, why it matters for ensemble variance, and techniques to further reduce it.

Page Complete

You now have deep understanding of feature randomization—the core innovation that makes Random Forests so powerful. The per-split sampling mechanism, controlled by max_features, creates an ensemble where each tree provides a unique perspective on the data. Next, we'll analyze tree correlation and its impact on ensemble variance.

2 / 5

Loading learning content...

Machine LearningRandom Forests

Random Forests

LevelIntermediate

Duration90 mins

TopicRandom Forests

2 / 5

Feature Randomization

The Heart of Random Forests

What You Will Learn

Per-Split Feature Sampling

Standard Decision Tree (No Randomization):

At each node, examine all p features
For each feature, try all possible thresholds
Select the (feature, threshold) pair with the best split criterion
Repeat recursively for child nodes

Random Forest Tree (Per-Split Randomization):

At each node, randomly sample m features from the full p features
For each of the m sampled features, try all possible thresholds
Select the best (feature, threshold) pair among only the m candidates
At the next node, sample a fresh set of m features and repeat

The key difference: every split decision is constrained to a random subset of features, and that subset changes at every node.

per_split_feature_sampling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
class RandomForestTreeNode:
    """
    A single node in a Random Forest decision tree.
    Demonstrates per-split feature randomization.
    """
    def __init__(self, X, y, max_features, min_samples_leaf=1, depth=0, max_depth=None):
        self.n_samples, self.n_features = X.shape
        self.depth = depth
        self.left = None
        self.right = None
        self.feature = None
        self.threshold = None
        self.prediction = self._compute_prediction(y)
        
        # Check stopping conditions
        if self._should_stop(y, max_depth):
            self.is_leaf = True
            return
        
        self.is_leaf = False
        
        # KEY RANDOM FOREST MECHANISM:
        # Sample max_features random features for THIS split
        # This sampling happens independently at EVERY node
        m = self._get_max_features(max_features)
        feature_candidates = np.random.choice(
            self.n_features, 
            size=min(m, self.n_features), 
            replace=False
        )
        
        # Find best split among only the sampled features
        best_feature, best_threshold, best_gain = self._find_best_split(
            X, y, feature_candidates
        )
        
        if best_gain <= 0:
            self.is_leaf = True
            return
        
        self.feature = best_feature
        self.threshold = best_threshold
        
        # Split data and recurse
        left_mask = X[:, best_feature] <= best_threshold
        right_mask = ~left_mask
        
        # Note: Child nodes will sample their OWN random feature sets
        self.left = RandomForestTreeNode(
            X[left_mask], y[left_mask], max_features, 
            min_samples_leaf, depth + 1, max_depth
        )
        self.right = RandomForestTreeNode(
            X[right_mask], y[right_mask], max_features,
            min_samples_leaf, depth + 1, max_depth
        )
    
    def _get_max_features(self, max_features):
        """
        Interpret max_features parameter.
        
        Supported values:
        - "sqrt" or "auto": sqrt(n_features) — sklearn default for classification
        - "log2": log2(n_features)
        - float in (0, 1]: fraction of features
        - int: exact number of features
        """
        if max_features == "sqrt" or max_features == "auto":
            return max(1, int(np.sqrt(self.n_features)))
        elif max_features == "log2":
            return max(1, int(np.log2(self.n_features)))
        elif isinstance(max_features, float):
            return max(1, int(max_features * self.n_features))
        elif isinstance(max_features, int):
            return min(max_features, self.n_features)
        else:
            return self.n_features  # Consider all features
    
    def _find_best_split(self, X, y, feature_candidates):
        """
        Find the best split, but ONLY considering the
        randomly selected feature candidates.
        """
        best_gain = -np.inf
        best_feature = None
        best_threshold = None
        
        # ONLY iterate through candidate features (not all features)
        for feature_idx in feature_candidates:
            values = X[:, feature_idx]
            unique_values = np.unique(values)
            
            for threshold in unique_values:
                gain = self._compute_gain(y, values <= threshold)
                if gain > best_gain:
                    best_gain = gain
                    best_feature = feature_idx
                    best_threshold = threshold
        
        return best_feature, best_threshold, best_gain

Why Per-Split Matters

The max_features Hyperparameter

Possible Values:

Value	Features at Each Split	When to Use
`"sqrt"`	$\sqrt{p}$	Default for classification
`"log2"`	$\log_2(p)$	Very high dimensional data
`1.0` or `None`	All $p$ features	Reduces to bagged trees
`0.5`	$p/2$ features	Moderate randomization
`"auto"`	Library-dependent (usually sqrt for classification, all for regression)	Following defaults
Integer $m$	Exactly $m$ features	Explicit control
Float in $(0,1)$	That fraction of $p$	Proportional control

The Recommended Defaults:

Extensive empirical studies (Breiman 2001, subsequent research) have established:

Classification: $m_{try} = \sqrt{p}$
Regression: $m_{try} = p/3$

max_features_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.datasets import make_classification, make_regression
from sklearn.model_selection import cross_val_score
import numpy as np
import matplotlib.pyplot as plt
 
def compare_max_features(X, y, task='classification'):
    """
    Compare Random Forest performance across different max_features values.
    """
    n_features = X.shape[1]
    
    # Test different max_features values
    max_features_options = [
        ('1 (single feature)', 1),
        ('sqrt(p)', 'sqrt'),
        ('log2(p)', 'log2'),
        ('p/3', max(1, n_features // 3)),
        ('p/2', max(1, n_features // 2)),
        ('All features (bagging)', None),
    ]
    
    results = {}
    
    for name, max_features in max_features_options:
        if task == 'classification':
            rf = RandomForestClassifier(
                n_estimators=100,
                max_features=max_features,
                random_state=42
            )
        else:
            rf = RandomForestRegressor(
                n_estimators=100,
                max_features=max_features,
                random_state=42
            )
        
        scores = cross_val_score(rf, X, y, cv=5)
        results[name] = {
            'mean': scores.mean(),
            'std': scores.std(),
            'effective_features': _get_effective_features(max_features, n_features)
        }
        print(f"{name:25} | Accuracy: {scores.mean():.4f} (+/- {scores.std()*2:.4f})")
    
    return results
 
def _get_effective_features(max_features, n_features):
    """Calculate actual number of features used per split."""
    if max_features is None:
        return n_features
    elif max_features == 'sqrt':
        return int(np.sqrt(n_features))
    elif max_features == 'log2':
        return int(np.log2(n_features))
    elif isinstance(max_features, float):
        return int(max_features * n_features)
    else:
        return max_features
 
 
# Example with synthetic data
print("=" * 60)
print("max_features Comparison: Classification (100 features)")
print("=" * 60)
X_clf, y_clf = make_classification(
    n_samples=1000, n_features=100, 
    n_informative=20, n_redundant=10, 
    random_state=42
)
compare_max_features(X_clf, y_clf, 'classification')
 
print("\n" + "=" * 60)
print("max_features Comparison: Regression (100 features)")
print("=" * 60)
X_reg, y_reg = make_regression(
    n_samples=1000, n_features=100,
    n_informative=20, noise=10,
    random_state=42
)
compare_max_features(X_reg, y_reg, 'regression')

Practical Tuning Strategy

Mathematical Analysis of Feature Randomization

Why does $m_{try} = \sqrt{p}$ work so well for classification? Let's develop the mathematical intuition.

The Strength-Correlation Tradeoff:

The key insight from Breiman's analysis is that ensemble accuracy depends on two factors:

$$\text{Generalization Error} \approx \bar{\rho} \cdot \frac{s^2}{1} + \frac{1 - \bar{\rho}}{T}s^2$$

Where:

$\bar{\rho}$ = average correlation between tree predictions
$s^2$ = variance of individual tree predictions
$T$ = number of trees

As $m_{try}$ decreases:

$\bar{\rho}$ decreases (trees become less correlated) ✓
Individual tree "strength" decreases (may miss important features) ✗

Probability of Including an Important Feature:

Suppose there are $k$ important features out of $p$ total. At a given split, the probability that at least one important feature is included in the random sample of $m$ features is:

$$P(\text{at least one important}) = 1 - \frac{\binom{p-k}{m}}{\binom{p}{m}} = 1 - \prod_{i=0}^{m-1}\frac{p-k-i}{p-i}$$

For $p = 100$, $k = 10$ important features, and $m = 10$ (i.e., $\sqrt{100}$):

$$P(\text{at least one important}) = 1 - \frac{\binom{90}{10}}{\binom{100}{10}} \approx 0.683$$

With $m = \sqrt{p}$, there's a ~68% chance of having at least one important feature at each split. This is enough to build reasonably good trees while maintaining high diversity.

Probability of Including at Least One of k=10 Important Features (p=100)
m (features at split)	P(≥1 important feature)	Diversity Level
2	0.195	Very High (but weak splits)
5	0.419	High
10 (√p)	0.683	Moderate-High (recommended)
20	0.879	Moderate
33 (p/3)	0.963	Low-Moderate
50 (p/2)	0.995	Low
100 (all)	1.000	None (bagging only)

Why $\sqrt{p}$ is the Sweet Spot:

The $\sqrt{p}$ choice emerges from balancing competing concerns:

Diversity requirement: With $m = \sqrt{p}$, the probability of two trees selecting the same feature at the same depth is only $m/p = 1/\sqrt{p}$. For $p = 100$, this is just 10%.
Strength requirement: Even with small $m$, if $T$ is large enough, every important feature appears in approximately $m/p \times T$ trees at each level. With $T = 500$ trees and $m/p = 0.1$, important features appear in ~50 trees at each split depth.
Computational efficiency: Smaller $m$ means fewer feature evaluations per split: $O(m \cdot n)$ instead of $O(p \cdot n)$. For $p = 10000$, using $m = 100$ is 100× faster than using all features.

The Central Insight

Effects on Tree Structure and Predictions

Feature randomization fundamentally changes how Random Forest trees develop compared to standard decision trees. Let's examine these structural effects:

1. Deeper Feature Penetration:

Sometimes income is excluded from the candidate set
years_employed gets a chance to split at upper levels
Different trees develop with different features dominating

2. Suboptimal Splits at Individual Nodes:

Each Random Forest split is locally suboptimal—we're not choosing the best split overall, just the best among $m$ random candidates. However:

Suboptimal splits lead to different tree structures
Different structures capture different patterns in data
The ensemble benefits from this diversity

feature_usage_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from collections import Counter
 
def analyze_feature_usage(X, y, n_trees=100):
    """
    Compare which features appear at the root across:
    1. Standard bagged decision trees (all features)
    2. Random Forest trees (sqrt features)
    """
    n_features = X.shape[1]
    
    # Bagged trees (no feature restriction)
    bagged_roots = []
    for i in range(n_trees):
        # Bootstrap sample
        indices = np.random.choice(len(X), size=len(X), replace=True)
        X_boot, y_boot = X[indices], y[indices]
        
        tree = DecisionTreeClassifier(random_state=i)
        tree.fit(X_boot, y_boot)
        bagged_roots.append(tree.tree_.feature[0])
    
    # Random Forest trees (sqrt feature restriction)
    rf = RandomForestClassifier(
        n_estimators=n_trees,
        max_features='sqrt',
        random_state=42
    )
    rf.fit(X, y)
    
    rf_roots = [tree.tree_.feature[0] for tree in rf.estimators_]
    
    # Analyze diversity
    print("Feature at Root Node Analysis")
    print("=" * 50)
    
    print(f"\nBagged Trees (all features considered):")
    bagged_counts = Counter(bagged_roots)
    print(f"  Unique features used at root: {len(bagged_counts)}")
    print(f"  Most common root feature: {bagged_counts.most_common(1)[0]}")
    print(f"  Top 3 root features: {bagged_counts.most_common(3)}")
    
    print(f"\nRandom Forest Trees (sqrt(p) features):")
    rf_counts = Counter(rf_roots)
    print(f"  Unique features used at root: {len(rf_counts)}")
    print(f"  Most common root feature: {rf_counts.most_common(1)[0]}")
    print(f"  Top 3 root features: {rf_counts.most_common(3)}")
    
    # Entropy of feature usage (measure of diversity)
    def feature_entropy(feature_list, n_features):
        counts = np.zeros(n_features)
        for f in feature_list:
            counts[f] += 1
        probs = counts / len(feature_list)
        probs = probs[probs > 0]
        return -np.sum(probs * np.log2(probs))
    
    print(f"\nDiversity Metrics:")
    print(f"  Bagged entropy: {feature_entropy(bagged_roots, n_features):.3f} bits")
    print(f"  RF entropy: {feature_entropy(rf_roots, n_features):.3f} bits")
    print(f"  Max entropy: {np.log2(n_features):.3f} bits (uniform distribution)")
 
 
# Example: 50 features with 5 dominant ones
X, y = make_classification(
    n_samples=1000,
    n_features=50,
    n_informative=5,
    n_redundant=5,
    n_clusters_per_class=2,
    random_state=42
)
 
analyze_feature_usage(X, y)

3. Smoother Decision Boundaries:

The ensemble of randomized trees creates smoother decision boundaries than individual trees or bagged trees:

Method	Decision Boundary Character
Single Decision Tree	Jagged, axis-aligned, prone to overfitting
Bagged Trees	Less jagged (averaged), but still similar shapes
Random Forest	Smooth, can approximate curved boundaries

The smoothing happens because:

Different trees use different feature combinations
Averaging over trees with different split locations smooths the boundary
No single feature dominates all trees, so axis-alignment is reduced

Don't Overthink Single Trees

Interaction with Bootstrap Sampling

Random Forests combine two orthogonal sources of randomization:

Bootstrap Sampling (row randomization): Different training sets
Feature Randomization (column randomization): Different feature subsets at each split

These interact multiplicatively to create extraordinary diversity:

Sources of Diversity in Random Forests
Source	What Varies	Effect on Trees	Controlled By
Bootstrap Sampling	Which samples train each tree	Different trees overfit to different subsets	bootstrap=True/False
Per-Split Feature Sampling	Which features are candidates at each split	Different split decisions at each node	max_features
Interaction	Both simultaneously	Each tree is unique in both data and structure	Combination of above

Quantifying the Diversity:

Consider building a tree with depth $d$ on data with $p$ features:

Bagging alone: Each tree sees a different ~63.2% of training data
Feature randomization alone: At each of the $2^d - 1$ internal nodes, features are randomly restricted
Combined: Each tree sees different data AND makes different feature choices at each node

For $p = 100$, $m = 10$, and $d = 20$:

Number of possible feature subsets per split: $\binom{100}{10} \approx 1.7 \times 10^{13}$
Number of internal nodes: $2^{20} - 1 \approx 10^6$
The number of possible tree structures is astronomically large

diversity_demonstration.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from scipy.stats import pearsonr
 
def measure_prediction_diversity(X, y, n_estimators=100):
    """
    Measure how diverse predictions are across trees.
    
    Lower correlation = higher diversity = better ensemble.
    """
    # Fit Random Forest
    rf = RandomForestClassifier(
        n_estimators=n_estimators,
        max_features='sqrt',
        bootstrap=True,
        random_state=42
    )
    rf.fit(X, y)
    
    # Get individual tree probability predictions
    tree_predictions = np.array([
        tree.predict_proba(X)[:, 1] 
        for tree in rf.estimators_
    ])  # Shape: (n_estimators, n_samples)
    
    # Calculate pairwise correlations between tree predictions
    correlations = []
    for i in range(n_estimators):
        for j in range(i+1, n_estimators):
            r, _ = pearsonr(tree_predictions[i], tree_predictions[j])
            correlations.append(r)
    
    avg_correlation = np.mean(correlations)
    
    print(f"Average pairwise tree correlation: {avg_correlation:.4f}")
    print(f"Correlation std: {np.std(correlations):.4f}")
    print(f"Min correlation: {np.min(correlations):.4f}")
    print(f"Max correlation: {np.max(correlations):.4f}")
    
    # Compare with bagged trees (no feature randomization)
    rf_bagged = RandomForestClassifier(
        n_estimators=n_estimators,
        max_features=None,  # All features = bagging only
        bootstrap=True,
        random_state=42
    )
    rf_bagged.fit(X, y)
    
    tree_predictions_bagged = np.array([
        tree.predict_proba(X)[:, 1] 
        for tree in rf_bagged.estimators_
    ])
    
    correlations_bagged = []
    for i in range(n_estimators):
        for j in range(i+1, n_estimators):
            r, _ = pearsonr(tree_predictions_bagged[i], tree_predictions_bagged[j])
            correlations_bagged.append(r)
    
    print(f"\nBagged trees (no feature randomization):")
    print(f"Average pairwise tree correlation: {np.mean(correlations_bagged):.4f}")
    
    print(f"\nDiversity improvement from feature randomization:")
    print(f"Correlation reduction: {np.mean(correlations_bagged) - avg_correlation:.4f}")
 
 
# Generate example data
X, y = make_classification(
    n_samples=1000, n_features=50,
    n_informative=10, n_redundant=10,
    random_state=42
)
 
measure_prediction_diversity(X, y)

Multiplicative Diversity

Extreme and Edge Cases

Understanding the extreme values of max_features illuminates its effect:

Case 1: max_features = 1 (Maximum Randomization)

At each split, randomly select a single feature and find the best threshold for that feature.

Extreme diversity: Trees are radically different
Weak individual trees: Often miss the best split by a wide margin
May work in very high dimensions: When any feature is reasonably predictive
Typically underperforms: Individual trees too weak to contribute useful signals

Case 2: max_features = n_features or None (No Randomization)

Every split considers all features—equivalent to bagged decision trees.

No feature diversity: Trees differ only due to bootstrap sampling
Strong individual trees: Each tree makes locally optimal splits
High correlation: Trees develop similar structures when features are dominant
Works well for regression: Where p/3 is already large

Interpolating Between Extremes:

max_features Spectrum and Tradeoffs
max_features	Individual Tree Quality	Ensemble Diversity	Best Use Case
1	Very weak	Maximum	Rarely useful (too random)
log2(p)	Weak	Very high	Very high-dimensional data
sqrt(p)	Moderate	High	Classification (default)
p/3	Good	Moderate	Regression (default)
p/2	Strong	Low-Moderate	Low-dimensional, high-signal data
p (all)	Optimal per split	Low (bootstrap only)	Bagging; feature exploration not needed

Special Case: Extra-Trees (Extremely Randomized Trees)

An even more aggressive approach than Random Forests:

Like RF, randomly select $m$ candidate features at each split
But instead of finding the best threshold for each feature, select a random threshold

This adds a third source of randomization:

Bootstrap samples (rows)
Feature candidates (columns)
Threshold selection (cuts)

Extra-Trees trade individual tree quality for even more diversity, often achieving similar accuracy with less computational cost (no threshold search).

When to Deviate from Defaults

Implementation Details

Correctly implementing feature randomization requires attention to several details:

1. Sampling Without vs With Replacement:

Within each split, the $m$ features should be sampled without replacement. Sampling with replacement could duplicate features, effectively reducing the candidate set.

# Correct: without replacement
features = np.random.choice(n_features, size=m, replace=False)

# Incorrect: with replacement (could select same feature multiple times)
features = np.random.choice(n_features, size=m, replace=True)

2. Independent Sampling at Each Split:

The feature subset must be freshly sampled at each split decision. A common bug is reusing the same subset for an entire tree level or caching it:

# Correct: sample new features at each node
def build_node(X, y, max_features):
    candidates = sample_features(max_features)  # Fresh sample
    best_split = find_best_split(X, y, candidates)
    left_node = build_node(left_X, left_y, max_features)  # Get own sample
    right_node = build_node(right_X, right_y, max_features)  # Get own sample

3. Handling max_features > n_features:

When max_features exceeds the number of features, implementations should gracefully default to using all features:

effective_max_features = min(max_features, n_features)

4. Consistent Random State Management:

For reproducibility, each tree needs its own random state that is derived from but different from the forest's random state:

def build_forest(X, y, n_estimators, random_state):
    rng = np.random.RandomState(random_state)
    trees = []
    for i in range(n_estimators):
        # Each tree gets a unique derived seed
        tree_seed = rng.randint(np.iinfo(np.int32).max)
        tree = build_tree(X, y, random_state=tree_seed)
        trees.append(tree)
    return trees

sklearn_rf_inspection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
from sklearn.ensemble import RandomForestClassifier
import numpy as np
 
# Inspect how sklearn handles feature randomization
X = np.random.randn(100, 20)  # 20 features
y = (X[:, 0] + X[:, 1] > 0).astype(int)
 
rf = RandomForestClassifier(
    n_estimators=10,
    max_features='sqrt',  # Will use 4 features per split (sqrt(20) ≈ 4)
    random_state=42
)
rf.fit(X, y)
 
print("Random Forest Feature Randomization Verification")
print("=" * 50)
print(f"Total features: {rf.n_features_in_}")
print(f"max_features setting: {rf.max_features}")
# sklearn stores the actual computed value
print(f"Effective max_features: {rf.max_features_}")
 
# Check which features each tree uses at its root
print(f"\nRoot split feature per tree:")
for i, tree in enumerate(rf.estimators_):
    root_feature = tree.tree_.feature[0]
    root_threshold = tree.tree_.threshold[0]
    print(f"  Tree {i}: Feature {root_feature}, Threshold {root_threshold:.3f}")
 
# Verify different trees use different features at root
root_features = [tree.tree_.feature[0] for tree in rf.estimators_]
print(f"\nUnique root features: {len(set(root_features))} out of {len(root_features)} trees")
print(f"Root feature distribution: {dict(zip(*np.unique(root_features, return_counts=True)))}")

Common Implementation Bugs

Summary: Feature Randomization

We've thoroughly explored feature randomization—the mechanism that defines Random Forests and distinguishes them from bagged decision trees.

Key Takeaways

•Per-split randomization: Random Forests sample m features at EACH split, not once per tree, maximizing structural diversity
•max_features controls the tradeoff: Smaller values increase diversity but weaken individual trees; larger values strengthen trees but reduce diversity
•Recommended defaults: sqrt(p) for classification, p/3 for regression, based on extensive empirical validation
•Mathematical justification: sqrt(p) balances the probability of including important features against maintaining low tree correlation
•Multiplicative diversity: Feature randomization combines with bootstrap sampling to create trees that differ in both data and structure
•Locally suboptimal, globally strong: Individual RF splits are suboptimal, but the ensemble averages over many diverse suboptimal choices to achieve excellent performance

What's Next:

Page Complete

2 / 5