Loading learning content...
In the previous page, we explored the random subspace method—training ensemble members on different feature subsets. Random Forests take this idea and amplify it: instead of selecting features once per tree, they re-randomize feature selection at every single split decision.
This seemingly small modification has profound consequences. It's what transforms a collection of bagged decision trees into the remarkably powerful Random Forest algorithm that has become one of the most successful and widely-used machine learning methods in both research and industry.
In this page, we'll dissect exactly how per-split feature randomization works, understand why it's so effective, and explore the critical max_features (or mtry) hyperparameter that controls this randomization.
By the end of this page, you will master the per-split feature sampling mechanism, understand the mathematical motivation for mtry = sqrt(p), analyze how feature randomization affects the bias-variance tradeoff, and be able to tune max_features for different problem types.
The defining characteristic of Random Forests is randomizing feature selection at each internal node, not just once per tree. Let's trace through exactly what happens when building a Random Forest tree:
Standard Decision Tree (No Randomization):
Random Forest Tree (Per-Split Randomization):
The key difference: every split decision is constrained to a random subset of features, and that subset changes at every node.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
class RandomForestTreeNode: """ A single node in a Random Forest decision tree. Demonstrates per-split feature randomization. """ def __init__(self, X, y, max_features, min_samples_leaf=1, depth=0, max_depth=None): self.n_samples, self.n_features = X.shape self.depth = depth self.left = None self.right = None self.feature = None self.threshold = None self.prediction = self._compute_prediction(y) # Check stopping conditions if self._should_stop(y, max_depth): self.is_leaf = True return self.is_leaf = False # KEY RANDOM FOREST MECHANISM: # Sample max_features random features for THIS split # This sampling happens independently at EVERY node m = self._get_max_features(max_features) feature_candidates = np.random.choice( self.n_features, size=min(m, self.n_features), replace=False ) # Find best split among only the sampled features best_feature, best_threshold, best_gain = self._find_best_split( X, y, feature_candidates ) if best_gain <= 0: self.is_leaf = True return self.feature = best_feature self.threshold = best_threshold # Split data and recurse left_mask = X[:, best_feature] <= best_threshold right_mask = ~left_mask # Note: Child nodes will sample their OWN random feature sets self.left = RandomForestTreeNode( X[left_mask], y[left_mask], max_features, min_samples_leaf, depth + 1, max_depth ) self.right = RandomForestTreeNode( X[right_mask], y[right_mask], max_features, min_samples_leaf, depth + 1, max_depth ) def _get_max_features(self, max_features): """ Interpret max_features parameter. Supported values: - "sqrt" or "auto": sqrt(n_features) — sklearn default for classification - "log2": log2(n_features) - float in (0, 1]: fraction of features - int: exact number of features """ if max_features == "sqrt" or max_features == "auto": return max(1, int(np.sqrt(self.n_features))) elif max_features == "log2": return max(1, int(np.log2(self.n_features))) elif isinstance(max_features, float): return max(1, int(max_features * self.n_features)) elif isinstance(max_features, int): return min(max_features, self.n_features) else: return self.n_features # Consider all features def _find_best_split(self, X, y, feature_candidates): """ Find the best split, but ONLY considering the randomly selected feature candidates. """ best_gain = -np.inf best_feature = None best_threshold = None # ONLY iterate through candidate features (not all features) for feature_idx in feature_candidates: values = X[:, feature_idx] unique_values = np.unique(values) for threshold in unique_values: gain = self._compute_gain(y, values <= threshold) if gain > best_gain: best_gain = gain best_feature = feature_idx best_threshold = threshold return best_feature, best_threshold, best_gainIf features were sampled once per tree (classic RSM), trees would still have similar upper structure—just with a restricted feature set. Per-split sampling ensures that even within a single tree, different paths use different features. This creates massive internal diversity: two samples reaching the same node might be split on different features in different trees.
The max_features parameter (also called mtry in the R community) is arguably the most important hyperparameter in Random Forests. It controls the size of the random feature subset considered at each split.
Possible Values:
| Value | Features at Each Split | When to Use |
|---|---|---|
"sqrt" | $\sqrt{p}$ | Default for classification |
"log2" | $\log_2(p)$ | Very high dimensional data |
1.0 or None | All $p$ features | Reduces to bagged trees |
0.5 | $p/2$ features | Moderate randomization |
"auto" | Library-dependent (usually sqrt for classification, all for regression) | Following defaults |
| Integer $m$ | Exactly $m$ features | Explicit control |
| Float in $(0,1)$ | That fraction of $p$ | Proportional control |
The Recommended Defaults:
Extensive empirical studies (Breiman 2001, subsequent research) have established:
These are remarkably robust defaults that work well across a wide variety of datasets. However, they are not universally optimal—the best value depends on the signal-to-noise ratio in your features.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressorfrom sklearn.datasets import make_classification, make_regressionfrom sklearn.model_selection import cross_val_scoreimport numpy as npimport matplotlib.pyplot as plt def compare_max_features(X, y, task='classification'): """ Compare Random Forest performance across different max_features values. """ n_features = X.shape[1] # Test different max_features values max_features_options = [ ('1 (single feature)', 1), ('sqrt(p)', 'sqrt'), ('log2(p)', 'log2'), ('p/3', max(1, n_features // 3)), ('p/2', max(1, n_features // 2)), ('All features (bagging)', None), ] results = {} for name, max_features in max_features_options: if task == 'classification': rf = RandomForestClassifier( n_estimators=100, max_features=max_features, random_state=42 ) else: rf = RandomForestRegressor( n_estimators=100, max_features=max_features, random_state=42 ) scores = cross_val_score(rf, X, y, cv=5) results[name] = { 'mean': scores.mean(), 'std': scores.std(), 'effective_features': _get_effective_features(max_features, n_features) } print(f"{name:25} | Accuracy: {scores.mean():.4f} (+/- {scores.std()*2:.4f})") return results def _get_effective_features(max_features, n_features): """Calculate actual number of features used per split.""" if max_features is None: return n_features elif max_features == 'sqrt': return int(np.sqrt(n_features)) elif max_features == 'log2': return int(np.log2(n_features)) elif isinstance(max_features, float): return int(max_features * n_features) else: return max_features # Example with synthetic dataprint("=" * 60)print("max_features Comparison: Classification (100 features)")print("=" * 60)X_clf, y_clf = make_classification( n_samples=1000, n_features=100, n_informative=20, n_redundant=10, random_state=42)compare_max_features(X_clf, y_clf, 'classification') print("\n" + "=" * 60)print("max_features Comparison: Regression (100 features)")print("=" * 60)X_reg, y_reg = make_regression( n_samples=1000, n_features=100, n_informative=20, noise=10, random_state=42)compare_max_features(X_reg, y_reg, 'regression')Start with the defaults (sqrt for classification, p/3 for regression). If performance is unsatisfactory, try a grid search over [sqrt(p), sqrt(p)2, p/3, p/2, 0.8p]. For high-dimensional noisy data using fewer features often helps; for carefully engineered features with high signal-to-noise ratio more features may be better.
Why does $m_{try} = \sqrt{p}$ work so well for classification? Let's develop the mathematical intuition.
The Strength-Correlation Tradeoff:
The key insight from Breiman's analysis is that ensemble accuracy depends on two factors:
$$\text{Generalization Error} \approx \bar{\rho} \cdot \frac{s^2}{1} + \frac{1 - \bar{\rho}}{T}s^2$$
Where:
As $m_{try}$ decreases:
Probability of Including an Important Feature:
Suppose there are $k$ important features out of $p$ total. At a given split, the probability that at least one important feature is included in the random sample of $m$ features is:
$$P(\text{at least one important}) = 1 - \frac{\binom{p-k}{m}}{\binom{p}{m}} = 1 - \prod_{i=0}^{m-1}\frac{p-k-i}{p-i}$$
For $p = 100$, $k = 10$ important features, and $m = 10$ (i.e., $\sqrt{100}$):
$$P(\text{at least one important}) = 1 - \frac{\binom{90}{10}}{\binom{100}{10}} \approx 0.683$$
With $m = \sqrt{p}$, there's a ~68% chance of having at least one important feature at each split. This is enough to build reasonably good trees while maintaining high diversity.
| m (features at split) | P(≥1 important feature) | Diversity Level |
|---|---|---|
| 2 | 0.195 | Very High (but weak splits) |
| 5 | 0.419 | High |
| 10 (√p) | 0.683 | Moderate-High (recommended) |
| 20 | 0.879 | Moderate |
| 33 (p/3) | 0.963 | Low-Moderate |
| 50 (p/2) | 0.995 | Low |
| 100 (all) | 1.000 | None (bagging only) |
Why $\sqrt{p}$ is the Sweet Spot:
The $\sqrt{p}$ choice emerges from balancing competing concerns:
Diversity requirement: With $m = \sqrt{p}$, the probability of two trees selecting the same feature at the same depth is only $m/p = 1/\sqrt{p}$. For $p = 100$, this is just 10%.
Strength requirement: Even with small $m$, if $T$ is large enough, every important feature appears in approximately $m/p \times T$ trees at each level. With $T = 500$ trees and $m/p = 0.1$, important features appear in ~50 trees at each split depth.
Computational efficiency: Smaller $m$ means fewer feature evaluations per split: $O(m \cdot n)$ instead of $O(p \cdot n)$. For $p = 10000$, using $m = 100$ is 100× faster than using all features.
Individual trees don't need to be perfect—they need to be 'good enough' while being different from each other. The √p rule ensures each tree has a reasonable chance of using important features while being structurally dissimilar from other trees. The ensemble aggregates these diverse 'good enough' trees into excellent collective decisions.
Feature randomization fundamentally changes how Random Forest trees develop compared to standard decision trees. Let's examine these structural effects:
1. Deeper Feature Penetration:
In a standard decision tree, the best feature at the root often "shadows" related features—if income is the best split, years_employed (which correlates with income) may never appear high in the tree. With feature randomization:
income is excluded from the candidate setyears_employed gets a chance to split at upper levels2. Suboptimal Splits at Individual Nodes:
Each Random Forest split is locally suboptimal—we're not choosing the best split overall, just the best among $m$ random candidates. However:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677
import numpy as npfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.datasets import make_classificationfrom collections import Counter def analyze_feature_usage(X, y, n_trees=100): """ Compare which features appear at the root across: 1. Standard bagged decision trees (all features) 2. Random Forest trees (sqrt features) """ n_features = X.shape[1] # Bagged trees (no feature restriction) bagged_roots = [] for i in range(n_trees): # Bootstrap sample indices = np.random.choice(len(X), size=len(X), replace=True) X_boot, y_boot = X[indices], y[indices] tree = DecisionTreeClassifier(random_state=i) tree.fit(X_boot, y_boot) bagged_roots.append(tree.tree_.feature[0]) # Random Forest trees (sqrt feature restriction) rf = RandomForestClassifier( n_estimators=n_trees, max_features='sqrt', random_state=42 ) rf.fit(X, y) rf_roots = [tree.tree_.feature[0] for tree in rf.estimators_] # Analyze diversity print("Feature at Root Node Analysis") print("=" * 50) print(f"\nBagged Trees (all features considered):") bagged_counts = Counter(bagged_roots) print(f" Unique features used at root: {len(bagged_counts)}") print(f" Most common root feature: {bagged_counts.most_common(1)[0]}") print(f" Top 3 root features: {bagged_counts.most_common(3)}") print(f"\nRandom Forest Trees (sqrt(p) features):") rf_counts = Counter(rf_roots) print(f" Unique features used at root: {len(rf_counts)}") print(f" Most common root feature: {rf_counts.most_common(1)[0]}") print(f" Top 3 root features: {rf_counts.most_common(3)}") # Entropy of feature usage (measure of diversity) def feature_entropy(feature_list, n_features): counts = np.zeros(n_features) for f in feature_list: counts[f] += 1 probs = counts / len(feature_list) probs = probs[probs > 0] return -np.sum(probs * np.log2(probs)) print(f"\nDiversity Metrics:") print(f" Bagged entropy: {feature_entropy(bagged_roots, n_features):.3f} bits") print(f" RF entropy: {feature_entropy(rf_roots, n_features):.3f} bits") print(f" Max entropy: {np.log2(n_features):.3f} bits (uniform distribution)") # Example: 50 features with 5 dominant onesX, y = make_classification( n_samples=1000, n_features=50, n_informative=5, n_redundant=5, n_clusters_per_class=2, random_state=42) analyze_feature_usage(X, y)3. Smoother Decision Boundaries:
The ensemble of randomized trees creates smoother decision boundaries than individual trees or bagged trees:
| Method | Decision Boundary Character |
|---|---|
| Single Decision Tree | Jagged, axis-aligned, prone to overfitting |
| Bagged Trees | Less jagged (averaged), but still similar shapes |
| Random Forest | Smooth, can approximate curved boundaries |
The smoothing happens because:
A common mistake is trying to understand Random Forests by examining individual trees. Individual RF trees are intentionally 'weird'—they use restricted feature sets and make locally suboptimal splits. The power emerges from aggregation, not from individual tree quality. Judge a Random Forest by its ensemble performance, not by any single tree.
Random Forests combine two orthogonal sources of randomization:
These interact multiplicatively to create extraordinary diversity:
| Source | What Varies | Effect on Trees | Controlled By |
|---|---|---|---|
| Bootstrap Sampling | Which samples train each tree | Different trees overfit to different subsets | bootstrap=True/False |
| Per-Split Feature Sampling | Which features are candidates at each split | Different split decisions at each node | max_features |
| Interaction | Both simultaneously | Each tree is unique in both data and structure | Combination of above |
Quantifying the Diversity:
Consider building a tree with depth $d$ on data with $p$ features:
For $p = 100$, $m = 10$, and $d = 20$:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
import numpy as npfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.datasets import make_classificationfrom scipy.stats import pearsonr def measure_prediction_diversity(X, y, n_estimators=100): """ Measure how diverse predictions are across trees. Lower correlation = higher diversity = better ensemble. """ # Fit Random Forest rf = RandomForestClassifier( n_estimators=n_estimators, max_features='sqrt', bootstrap=True, random_state=42 ) rf.fit(X, y) # Get individual tree probability predictions tree_predictions = np.array([ tree.predict_proba(X)[:, 1] for tree in rf.estimators_ ]) # Shape: (n_estimators, n_samples) # Calculate pairwise correlations between tree predictions correlations = [] for i in range(n_estimators): for j in range(i+1, n_estimators): r, _ = pearsonr(tree_predictions[i], tree_predictions[j]) correlations.append(r) avg_correlation = np.mean(correlations) print(f"Average pairwise tree correlation: {avg_correlation:.4f}") print(f"Correlation std: {np.std(correlations):.4f}") print(f"Min correlation: {np.min(correlations):.4f}") print(f"Max correlation: {np.max(correlations):.4f}") # Compare with bagged trees (no feature randomization) rf_bagged = RandomForestClassifier( n_estimators=n_estimators, max_features=None, # All features = bagging only bootstrap=True, random_state=42 ) rf_bagged.fit(X, y) tree_predictions_bagged = np.array([ tree.predict_proba(X)[:, 1] for tree in rf_bagged.estimators_ ]) correlations_bagged = [] for i in range(n_estimators): for j in range(i+1, n_estimators): r, _ = pearsonr(tree_predictions_bagged[i], tree_predictions_bagged[j]) correlations_bagged.append(r) print(f"\nBagged trees (no feature randomization):") print(f"Average pairwise tree correlation: {np.mean(correlations_bagged):.4f}") print(f"\nDiversity improvement from feature randomization:") print(f"Correlation reduction: {np.mean(correlations_bagged) - avg_correlation:.4f}") # Generate example dataX, y = make_classification( n_samples=1000, n_features=50, n_informative=10, n_redundant=10, random_state=42) measure_prediction_diversity(X, y)The key insight: bootstrap sampling and feature randomization are independent randomizations that multiply together. A tree that happens to see similar samples to another tree will still develop differently due to feature randomization. A tree with similar feature selections will still differ due to seeing different samples. This double insurance ensures genuine diversity.
Understanding the extreme values of max_features illuminates its effect:
Case 1: max_features = 1 (Maximum Randomization)
At each split, randomly select a single feature and find the best threshold for that feature.
Case 2: max_features = n_features or None (No Randomization)
Every split considers all features—equivalent to bagged decision trees.
Interpolating Between Extremes:
| max_features | Individual Tree Quality | Ensemble Diversity | Best Use Case |
|---|---|---|---|
| 1 | Very weak | Maximum | Rarely useful (too random) |
| log2(p) | Weak | Very high | Very high-dimensional data |
| sqrt(p) | Moderate | High | Classification (default) |
| p/3 | Good | Moderate | Regression (default) |
| p/2 | Strong | Low-Moderate | Low-dimensional, high-signal data |
| p (all) | Optimal per split | Low (bootstrap only) | Bagging; feature exploration not needed |
Special Case: Extra-Trees (Extremely Randomized Trees)
An even more aggressive approach than Random Forests:
This adds a third source of randomization:
Extra-Trees trade individual tree quality for even more diversity, often achieving similar accuracy with less computational cost (no threshold search).
Consider larger max_features when: features are carefully engineered, feature count is low (<20), or domain knowledge suggests most features are relevant. Consider smaller max_features when: dealing with very high dimensionality (p > 1000), many features are noisy, or you observe high tree correlation despite bootstrap sampling.
Correctly implementing feature randomization requires attention to several details:
1. Sampling Without vs With Replacement:
Within each split, the $m$ features should be sampled without replacement. Sampling with replacement could duplicate features, effectively reducing the candidate set.
# Correct: without replacement
features = np.random.choice(n_features, size=m, replace=False)
# Incorrect: with replacement (could select same feature multiple times)
features = np.random.choice(n_features, size=m, replace=True)
2. Independent Sampling at Each Split:
The feature subset must be freshly sampled at each split decision. A common bug is reusing the same subset for an entire tree level or caching it:
# Correct: sample new features at each node
def build_node(X, y, max_features):
candidates = sample_features(max_features) # Fresh sample
best_split = find_best_split(X, y, candidates)
left_node = build_node(left_X, left_y, max_features) # Get own sample
right_node = build_node(right_X, right_y, max_features) # Get own sample
3. Handling max_features > n_features:
When max_features exceeds the number of features, implementations should gracefully default to using all features:
effective_max_features = min(max_features, n_features)
4. Consistent Random State Management:
For reproducibility, each tree needs its own random state that is derived from but different from the forest's random state:
def build_forest(X, y, n_estimators, random_state):
rng = np.random.RandomState(random_state)
trees = []
for i in range(n_estimators):
# Each tree gets a unique derived seed
tree_seed = rng.randint(np.iinfo(np.int32).max)
tree = build_tree(X, y, random_state=tree_seed)
trees.append(tree)
return trees
1234567891011121314151617181920212223242526272829303132
from sklearn.ensemble import RandomForestClassifierimport numpy as np # Inspect how sklearn handles feature randomizationX = np.random.randn(100, 20) # 20 featuresy = (X[:, 0] + X[:, 1] > 0).astype(int) rf = RandomForestClassifier( n_estimators=10, max_features='sqrt', # Will use 4 features per split (sqrt(20) ≈ 4) random_state=42)rf.fit(X, y) print("Random Forest Feature Randomization Verification")print("=" * 50)print(f"Total features: {rf.n_features_in_}")print(f"max_features setting: {rf.max_features}")# sklearn stores the actual computed valueprint(f"Effective max_features: {rf.max_features_}") # Check which features each tree uses at its rootprint(f"\nRoot split feature per tree:")for i, tree in enumerate(rf.estimators_): root_feature = tree.tree_.feature[0] root_threshold = tree.tree_.threshold[0] print(f" Tree {i}: Feature {root_feature}, Threshold {root_threshold:.3f}") # Verify different trees use different features at rootroot_features = [tree.tree_.feature[0] for tree in rf.estimators_]print(f"\nUnique root features: {len(set(root_features))} out of {len(root_features)} trees")print(f"Root feature distribution: {dict(zip(*np.unique(root_features, return_counts=True)))}")Watch for: (1) Sampling features with replacement within a split, (2) Reusing the same feature subset across multiple splits, (3) Not properly propagating random state to ensure reproducibility, (4) Forgetting to constrain max_features by the actual number of features. These bugs can silently reduce diversity and degrade performance.
We've thoroughly explored feature randomization—the mechanism that defines Random Forests and distinguishes them from bagged decision trees.
What's Next:
Now that we understand how feature randomization creates diversity, the next page examines tree correlation reduction more deeply—how we measure correlation, why it matters for ensemble variance, and techniques to further reduce it.
You now have deep understanding of feature randomization—the core innovation that makes Random Forests so powerful. The per-split sampling mechanism, controlled by max_features, creates an ensemble where each tree provides a unique perspective on the data. Next, we'll analyze tree correlation and its impact on ensemble variance.