Machine LearningRandom Forests

Random Forests

LevelIntermediate

Duration90 mins

TopicRandom Forests

1 / 5

Random Subspace Method

The Feature Sampling Revolution

In the previous module, we explored Bagging—the technique of training multiple models on bootstrap samples and aggregating their predictions to reduce variance. Bagging is powerful, but it has a subtle limitation: when the training data contains strongly predictive features, every tree in the ensemble tends to use those features at the top splits, making the trees structurally similar and their errors correlated.

The Random Subspace Method (RSM), also known as attribute bagging or feature bagging, provides an elegant solution. Instead of giving each learner access to all features, we randomly restrict each learner to a subset of the available features. This simple modification has profound consequences for ensemble diversity and predictive power.

What You Will Learn

By the end of this page, you will understand the mathematical foundations of the random subspace method, why random feature selection creates more diverse ensembles, how subspace projection affects the bias-variance tradeoff, and the precise relationship between RSM and Random Forests.

The Problem with Bagging Alone

To appreciate the random subspace method, we must first understand why pure bagging sometimes falls short. Consider a dataset with 100 features where 3 features are exceptionally predictive. When we build a bagged ensemble of decision trees:

Every tree examines all 100 features at each split
The same 3 dominant features consistently appear at the root and upper levels
Tree structures become highly similar despite different bootstrap samples
Prediction errors correlate because similar structures make similar mistakes

This phenomenon is sometimes called tree correlation or feature dominance, and it limits the variance reduction that bagging can achieve.

The Correlation Problem

Recall from ensemble theory that the variance of an ensemble average is: Var(ensemble) = (ρσ²)/n + ((1-ρ)σ²/n), where ρ is the average pairwise correlation between trees. When trees are highly correlated (ρ → 1), the ensemble variance barely decreases with more trees. The random subspace method directly attacks this correlation.

A Concrete Example:

Imagine predicting house prices with features including square_footage, location_score, num_bedrooms, and 97 other features. If square_footage and location_score alone explain 80% of variance:

Every bagged tree will split on square_footage or location_score at the root
The top structure of each tree looks nearly identical
Bootstrap sampling only introduces variety through which data points appear
The ensemble's "wisdom of crowds" benefit is severely limited

The random subspace method breaks this pattern by ensuring that some trees cannot use the dominant features, forcing them to find alternative predictive patterns.

The Random Subspace Method Defined

The Random Subspace Method (RSM), introduced by Tin Kam Ho in 1998, constructs ensemble members by training each base learner on a randomly selected subset of the original feature space. The algorithm is remarkably simple:

random_subspace_method.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
def random_subspace_ensemble(X, y, n_estimators, subspace_size):
    """
    Random Subspace Method for ensemble learning.
    
    Parameters:
    -----------
    X : array of shape (n_samples, n_features)
        Training data
    y : array of shape (n_samples,)
        Target values
    n_estimators : int
        Number of base learners in the ensemble
    subspace_size : int
        Number of features to select for each learner (m < n_features)
    
    Returns:
    --------
    ensemble : list of (model, feature_indices) tuples
    """
    n_samples, n_features = X.shape
    ensemble = []
    
    for i in range(n_estimators):
        # Step 1: Randomly select m features from p total features
        # This is sampling WITHOUT replacement from feature indices
        selected_features = np.random.choice(
            n_features, 
            size=subspace_size, 
            replace=False
        )
        
        # Step 2: Project training data onto the selected subspace
        X_subspace = X[:, selected_features]
        
        # Step 3: Train a base learner on the projected data
        # (using full training set, only feature restriction)
        model = DecisionTreeClassifier()
        model.fit(X_subspace, y)
        
        # Store model along with which features it uses
        ensemble.append((model, selected_features))
    
    return ensemble
 
 
def predict_rsm_ensemble(ensemble, X):
    """
    Make predictions using the RSM ensemble.
    
    Each model only sees its selected features during prediction.
    Final prediction uses majority voting (classification) or
    averaging (regression).
    """
    predictions = []
    
    for model, feature_indices in ensemble:
        # Project test data onto this model's subspace
        X_subspace = X[:, feature_indices]
        predictions.append(model.predict(X_subspace))
    
    # Majority voting for classification
    predictions = np.array(predictions)
    final_predictions = scipy.stats.mode(predictions, axis=0)[0]
    
    return final_predictions

Key Algorithmic Properties:

Property	Description
Subspace Selection	Each learner receives a random subset of `m` features from `p` total
Selection Without Replacement	Within each learner, the `m` features are distinct
Independence Across Learners	Different learners can have overlapping or disjoint feature sets
Original Training Data	Unlike bagging, RSM uses the full training set (no bootstrap)

The classic random subspace method does not use bootstrap sampling—it randomizes features while using all training samples. Random Forests combine both bootstrap sampling (bagging) and feature subspace sampling, getting the benefits of both techniques.

Mathematical Foundation

The random subspace method can be understood through the lens of projection onto random subspaces of the original feature space. Let's formalize this mathematically.

Notation:

Original feature space: $\mathbb{R}^p$ with $p$ features
Subspace dimension: $m$ where $1 \leq m < p$
Training data: $\mathbf{X} \in \mathbb{R}^{n \times p}$
Feature selection: $S_t \subset {1, 2, \ldots, p}$ with $|S_t| = m$ for tree $t$

The Projection Operation:

For each ensemble member $t$, we define a projection matrix $\mathbf{P}_t \in \mathbb{R}^{m \times p}$ that selects the features in $S_t$. If $S_t = {j_1, j_2, \ldots, j_m}$, then:

$$\mathbf{P}t = \begin{bmatrix} \mathbf{e}{j_1}^T \ \mathbf{e}{j_2}^T \ \vdots \ \mathbf{e}{j_m}^T \end{bmatrix}$$

where $\mathbf{e}_j$ is the standard basis vector with 1 in position $j$ and 0 elsewhere.

The projected training data for learner $t$ is: $$\mathbf{X}_t = \mathbf{X} \mathbf{P}_t^T \in \mathbb{R}^{n \times m}$$

Geometric Interpretation

Each projection P_t maps the original p-dimensional feature space onto an m-dimensional coordinate-aligned subspace. Different ensemble members 'see' the data from different m-dimensional perspectives. The ensemble aggregates predictions across these diverse viewpoints, capturing patterns that no single subspace fully represents.

Diversity Through Random Projections:

The probability that two ensemble members share the exact same feature set is:

$$P(S_i = S_j) = \frac{1}{\binom{p}{m}}$$

For $p = 100$ and $m = 10$, this probability is approximately $5.9 \times 10^{-14}$—essentially zero. Even partial overlap has low probability:

$$P(|S_i \cap S_j| = k) = \frac{\binom{m}{k} \binom{p-m}{m-k}}{\binom{p}{m}}$$

This combinatorial analysis shows that random subspace selection naturally generates highly diverse feature sets, which translates to diverse decision boundaries and decorrelated predictions.

Expected Overlap Between Subspaces:

The expected number of shared features between two randomly selected subspaces is:

$$\mathbb{E}[|S_i \cap S_j|] = \frac{m^2}{p}$$

For example, with $p = 100$ features and $m = 10$ features per learner:

Expected overlap: $\frac{100}{100} = 1$ feature
Each pair of learners typically shares only 1 feature out of 10

This low expected overlap is the source of prediction diversity. With $m = \sqrt{p}$ (a common choice), the expected overlap is:

$$\mathbb{E}[|S_i \cap S_j|] = \frac{p}{p} = 1$$

Again, just one feature despite each learner having $\sqrt{p}$ features.

Choosing the Subspace Size

The subspace dimension $m$ is a critical hyperparameter that controls the bias-variance-diversity tradeoff:

Small $m$	Large $m$
High diversity (low correlation)	Low diversity (high correlation)
Individual learners may be weak	Individual learners are stronger
May miss feature interactions	Captures feature interactions
Higher bias per learner	Lower bias per learner
Lower variance through averaging	Less variance reduction from averaging

Common Rules of Thumb:

Recommended Subspace Sizes

•Classification: $m = \lfloor \sqrt{p} \rfloor$ — The square root rule, widely used in Random Forest implementations
•Regression: $m = \lfloor p/3 \rfloor$ — Larger subspaces for regression tasks to capture interactions
•High-dimensional data: $m = \log_2(p) + 1$ — When $p$ is very large (e.g., genomics, text)
•Feature-rich with redundancy: $m = \lfloor \sqrt{p} \rfloor$ to $\lfloor p/2 \rfloor$ — Tune via cross-validation

The Bias-Variance-Diversity Tradeoff:

Let's denote:

$\text{Bias}_m$ = bias of individual learners using $m$ features
$\text{Var}_m$ = variance of individual learners
$\rho_m$ = average correlation between learner predictions

As $m$ increases:

$\text{Bias}_m$ decreases (more features → better individual fit)
$\text{Var}_m$ increases (more features → higher variance)
$\rho_m$ increases (more feature overlap → more correlated predictions)

The ensemble variance with $T$ learners is approximately:

$$\text{Var}_{\text{ensemble}} \approx \rho_m \cdot \text{Var}_m + \frac{1-\rho_m}{T} \cdot \text{Var}_m$$

The optimal $m$ balances these competing effects. Smaller $m$ increases individual bias but decreases $\rho_m$, allowing greater variance reduction through averaging. The $\sqrt{p}$ rule empirically balances these factors well for many classification tasks.

Practical Guidance

Start with m = sqrt(p) for classification and m = p/3 for regression. If your dataset has many noisy or irrelevant features, smaller m can help by reducing the chance that noise features dominate. If features are carefully engineered and all relevant, larger m may work better. Always validate via cross-validation.

Random Subspace Method vs Bagging

While both RSM and Bagging are ensemble diversification strategies, they operate on different axes:

Aspect	Bagging	Random Subspace Method
What varies	Training samples	Feature subsets
Training data	Bootstrap samples (~63.2% unique)	Full training set
Feature access	All features available	Restricted to $m$ features
Source of diversity	Different data points → different splits	Different features → different perspectives
OOB estimation	Yes (unused samples)	No (all samples used)
Best for	High-variance learners	Feature-dominant data
Typical use case	Variance reduction	Breaking feature correlation

Bagging Strengths

•Reduces variance through sample diversity
•Provides free OOB error estimation
•Works well when features are equally relevant
•Each learner sees all features (no information loss)

RSM Strengths

•Breaks correlation from dominant features
•Discovers alternative predictive patterns
•More effective in high-dimensional spaces
•Can reveal hidden feature relationships

The Key Insight: Combining Both

Random Forests achieve superior performance by combining both strategies:

Bootstrap Sampling (from Bagging): Creates diverse training sets
Feature Subspace Sampling (from RSM): Creates diverse feature views

This dual randomization multiplicatively increases diversity. Even if two trees receive similar bootstrap samples, they're forced to build different structures due to different feature access. Even if two trees have overlapping features, they see different data points.

Random Forests = Bagging + RSM

Random Forests apply feature subspace sampling at EACH SPLIT, not just once per tree. This is more aggressive than pure RSM (which selects features once per learner) and leads to even greater diversity. This 'per-split' randomization is a key innovation of the Random Forest algorithm.

Geometric Intuition

The random subspace method has a beautiful geometric interpretation. Consider data in $\mathbb{R}^3$ projected onto different 2D planes:

Projection 1 (features 1,2): Sees data from the "front view"
Projection 2 (features 1,3): Sees data from the "top view"
Projection 3 (features 2,3): Sees data from the "side view"

Each 2D classifier finds a decision boundary that works well in its projected view. Some class separations are clearly visible in one projection but hidden in another. By aggregating classifiers from multiple projections, the ensemble captures separations that no single projection fully reveals.

subspace_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
 
# Generate 3D data with complex structure
np.random.seed(42)
X, y = make_classification(
    n_samples=300, 
    n_features=3, 
    n_informative=3,
    n_redundant=0, 
    n_clusters_per_class=2,
    class_sep=0.8
)
 
# Define the three 2D projections
projections = [
    ([0, 1], "Features 1-2 (Front View)"),
    ([0, 2], "Features 1-3 (Top View)"),
    ([1, 2], "Features 2-3 (Side View)"),
]
 
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
 
for ax, (features, title) in zip(axes, projections):
    # Project data onto 2D subspace
    X_proj = X[:, features]
    
    # Train classifier on this projection
    clf = DecisionTreeClassifier(max_depth=5)
    clf.fit(X_proj, y)
    
    # Create decision boundary visualization
    xx, yy = np.meshgrid(
        np.linspace(X_proj[:, 0].min()-1, X_proj[:, 0].max()+1, 100),
        np.linspace(X_proj[:, 1].min()-1, X_proj[:, 1].max()+1, 100)
    )
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
    
    ax.contourf(xx, yy, Z, alpha=0.3, cmap='RdYlBu')
    ax.scatter(X_proj[y==0, 0], X_proj[y==0, 1], c='blue', label='Class 0')
    ax.scatter(X_proj[y==1, 0], X_proj[y==1, 1], c='red', label='Class 1')
    ax.set_title(title)
    ax.legend()
 
plt.tight_layout()
plt.savefig('subspace_projections.png')
print("Accuracy per projection:")
for features, title in projections:
    clf = DecisionTreeClassifier(max_depth=5)
    clf.fit(X[:, features], y)
    print(f"  {title}: {clf.score(X[:, features], y):.3f}")

The Blessing of Dimensionality (Sometimes):

While high dimensionality typically causes problems (the "curse of dimensionality"), the random subspace method can turn high dimensionality into an advantage:

More subspaces available: With $p = 1000$ features, there are $\binom{1000}{10} \approx 2.6 \times 10^{23}$ possible 10-dimensional subspaces
High probability of diverse views: Nearly every ensemble member sees genuinely different data projections
Robust feature discovery: Different subspaces may reveal different aspects of the underlying pattern

This is why Random Forests work remarkably well on high-dimensional data like genomics (thousands of genes) or text analysis (thousands of words)—there's an exponentially large space of diverse projections to sample from.

From RSM to Random Forests

The Random Forest algorithm, introduced by Leo Breiman in 2001, extends the random subspace method in three key ways:

1. Feature Randomization at Every Split (Not Just Per Tree)

Classic RSM selects $m$ features once per tree, then builds the tree using only those features. Random Forests re-randomize feature selection at every internal node:

At each split decision, sample $m_{try}$ features from the full set
Find the best split among only those $m_{try}$ candidates
Repeat at the next node with a fresh random sample

This per-split randomization creates far more diversity than per-tree randomization.

random_forest_split.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
def find_best_split_random_forest(X, y, m_try, all_features):
    """
    Find the best split at a node using Random Forest logic.
    
    Key difference from standard decision trees:
    - Only consider m_try randomly selected features
    - Re-sample features at EVERY split (not once per tree)
    """
    n_features = len(all_features)
    
    # Randomly select m_try features to consider for this split
    # This is the core of Random Forest's feature randomization
    candidate_features = np.random.choice(
        all_features, 
        size=min(m_try, n_features),
        replace=False
    )
    
    best_gain = -np.inf
    best_feature = None
    best_threshold = None
    
    # Only search among the randomly selected candidates
    for feature_idx in candidate_features:
        feature_values = X[:, feature_idx]
        thresholds = np.unique(feature_values)
        
        for threshold in thresholds:
            gain = compute_information_gain(
                y, feature_values, threshold
            )
            if gain > best_gain:
                best_gain = gain
                best_feature = feature_idx
                best_threshold = threshold
    
    return best_feature, best_threshold, best_gain
 
 
# Contrast with standard decision tree:
def find_best_split_standard(X, y):
    """Standard decision tree considers ALL features at every split."""
    n_features = X.shape[1]
    best_gain = -np.inf
    
    # Search through ALL features (no randomization)
    for feature_idx in range(n_features):
        feature_values = X[:, feature_idx]
        thresholds = np.unique(feature_values)
        
        for threshold in thresholds:
            gain = compute_information_gain(
                y, feature_values, threshold
            )
            if gain > best_gain:
                best_gain = gain
                best_feature = feature_idx
                best_threshold = threshold
    
    return best_feature, best_threshold, best_gain

2. Combined with Bootstrap Sampling

Random Forests apply bagging in addition to feature randomization:

Component	What It Does	Diversity Source
Bootstrap Sampling	Different ~63.2% of data per tree	Sample variation
Per-Split Feature Sampling	Different $m_{try}$ features per split	Feature variation
Deep Unpruned Trees	Complex individual decision surfaces	Model capacity

3. Typically Unpruned Trees

Unlike traditional decision trees (which risk overfitting), Random Forest trees are typically grown to maximum depth:

Each tree overfits to its bootstrap sample + feature subset
The overfitting is in different directions for different trees
Aggregation averages out the individual overfitting

The Brilliance of Random Forests

Random Forests deliberately introduce two sources of 'controlled randomness' to create an ensemble where individual errors are uncorrelated. The bootstrap samples vary the data, while the per-split feature sampling varies the model structure. This dual randomization is more powerful than either technique alone—it's why Random Forests became one of the most successful machine learning algorithms of all time.

Summary: Random Subspace Method

We've explored the random subspace method—a foundational technique that transforms ensemble learning by introducing feature-level diversity.

Key Takeaways

•Bagging's limitation: When dominant features exist, bagged trees become correlated despite different samples
•RSM's solution: Randomly restrict each learner to a feature subset, forcing diverse perspectives
•Mathematical basis: Random feature selection creates exponentially many possible subspaces with low expected overlap
•Subspace size $m$: Trade off individual learner strength against ensemble diversity; $\sqrt{p}$ is a common default
•Geometric intuition: Each learner sees the data from a different projection; aggregation captures patterns invisible to any single view
•Random Forests extend RSM: Feature sampling at every split (not per tree) combined with bootstrap sampling creates maximal diversity

What's Next:

Now that we understand the random subspace method, the next page dives deeper into feature randomization in Random Forests—examining exactly how per-split sampling works, why $m_{try} = \sqrt{p}$ is recommended, and what happens when we vary this parameter.

Page Complete

You now understand the random subspace method—the theoretical foundation for feature randomization in Random Forests. This technique of training ensemble members on different feature subsets is what allows Random Forests to achieve both low bias (powerful individual trees) and low variance (uncorrelated predictions). Next, we'll explore the specifics of feature randomization at each split.

1 / 5

Loading learning content...

Machine LearningRandom Forests

Random Forests

LevelIntermediate

Duration90 mins

TopicRandom Forests

1 / 5

Random Subspace Method

The Feature Sampling Revolution

What You Will Learn

The Problem with Bagging Alone

Every tree examines all 100 features at each split
The same 3 dominant features consistently appear at the root and upper levels
Tree structures become highly similar despite different bootstrap samples
Prediction errors correlate because similar structures make similar mistakes

This phenomenon is sometimes called tree correlation or feature dominance, and it limits the variance reduction that bagging can achieve.

The Correlation Problem

A Concrete Example:

Every bagged tree will split on square_footage or location_score at the root
The top structure of each tree looks nearly identical
Bootstrap sampling only introduces variety through which data points appear
The ensemble's "wisdom of crowds" benefit is severely limited

The random subspace method breaks this pattern by ensuring that some trees cannot use the dominant features, forcing them to find alternative predictive patterns.

The Random Subspace Method Defined

random_subspace_method.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
def random_subspace_ensemble(X, y, n_estimators, subspace_size):
    """
    Random Subspace Method for ensemble learning.
    
    Parameters:
    -----------
    X : array of shape (n_samples, n_features)
        Training data
    y : array of shape (n_samples,)
        Target values
    n_estimators : int
        Number of base learners in the ensemble
    subspace_size : int
        Number of features to select for each learner (m < n_features)
    
    Returns:
    --------
    ensemble : list of (model, feature_indices) tuples
    """
    n_samples, n_features = X.shape
    ensemble = []
    
    for i in range(n_estimators):
        # Step 1: Randomly select m features from p total features
        # This is sampling WITHOUT replacement from feature indices
        selected_features = np.random.choice(
            n_features, 
            size=subspace_size, 
            replace=False
        )
        
        # Step 2: Project training data onto the selected subspace
        X_subspace = X[:, selected_features]
        
        # Step 3: Train a base learner on the projected data
        # (using full training set, only feature restriction)
        model = DecisionTreeClassifier()
        model.fit(X_subspace, y)
        
        # Store model along with which features it uses
        ensemble.append((model, selected_features))
    
    return ensemble
 
 
def predict_rsm_ensemble(ensemble, X):
    """
    Make predictions using the RSM ensemble.
    
    Each model only sees its selected features during prediction.
    Final prediction uses majority voting (classification) or
    averaging (regression).
    """
    predictions = []
    
    for model, feature_indices in ensemble:
        # Project test data onto this model's subspace
        X_subspace = X[:, feature_indices]
        predictions.append(model.predict(X_subspace))
    
    # Majority voting for classification
    predictions = np.array(predictions)
    final_predictions = scipy.stats.mode(predictions, axis=0)[0]
    
    return final_predictions

Key Algorithmic Properties:

Property	Description
Subspace Selection	Each learner receives a random subset of `m` features from `p` total
Selection Without Replacement	Within each learner, the `m` features are distinct
Independence Across Learners	Different learners can have overlapping or disjoint feature sets
Original Training Data	Unlike bagging, RSM uses the full training set (no bootstrap)

Mathematical Foundation

The random subspace method can be understood through the lens of projection onto random subspaces of the original feature space. Let's formalize this mathematically.

Notation:

Original feature space: $\mathbb{R}^p$ with $p$ features
Subspace dimension: $m$ where $1 \leq m < p$
Training data: $\mathbf{X} \in \mathbb{R}^{n \times p}$
Feature selection: $S_t \subset {1, 2, \ldots, p}$ with $|S_t| = m$ for tree $t$

The Projection Operation:

For each ensemble member $t$, we define a projection matrix $\mathbf{P}_t \in \mathbb{R}^{m \times p}$ that selects the features in $S_t$. If $S_t = {j_1, j_2, \ldots, j_m}$, then:

$$\mathbf{P}t = \begin{bmatrix} \mathbf{e}{j_1}^T \ \mathbf{e}{j_2}^T \ \vdots \ \mathbf{e}{j_m}^T \end{bmatrix}$$

where $\mathbf{e}_j$ is the standard basis vector with 1 in position $j$ and 0 elsewhere.

The projected training data for learner $t$ is: $$\mathbf{X}_t = \mathbf{X} \mathbf{P}_t^T \in \mathbb{R}^{n \times m}$$

Geometric Interpretation

Diversity Through Random Projections:

The probability that two ensemble members share the exact same feature set is:

$$P(S_i = S_j) = \frac{1}{\binom{p}{m}}$$

For $p = 100$ and $m = 10$, this probability is approximately $5.9 \times 10^{-14}$—essentially zero. Even partial overlap has low probability:

$$P(|S_i \cap S_j| = k) = \frac{\binom{m}{k} \binom{p-m}{m-k}}{\binom{p}{m}}$$

This combinatorial analysis shows that random subspace selection naturally generates highly diverse feature sets, which translates to diverse decision boundaries and decorrelated predictions.

Expected Overlap Between Subspaces:

The expected number of shared features between two randomly selected subspaces is:

$$\mathbb{E}[|S_i \cap S_j|] = \frac{m^2}{p}$$

For example, with $p = 100$ features and $m = 10$ features per learner:

Expected overlap: $\frac{100}{100} = 1$ feature
Each pair of learners typically shares only 1 feature out of 10

This low expected overlap is the source of prediction diversity. With $m = \sqrt{p}$ (a common choice), the expected overlap is:

$$\mathbb{E}[|S_i \cap S_j|] = \frac{p}{p} = 1$$

Again, just one feature despite each learner having $\sqrt{p}$ features.

Choosing the Subspace Size

The subspace dimension $m$ is a critical hyperparameter that controls the bias-variance-diversity tradeoff:

Small $m$	Large $m$
High diversity (low correlation)	Low diversity (high correlation)
Individual learners may be weak	Individual learners are stronger
May miss feature interactions	Captures feature interactions
Higher bias per learner	Lower bias per learner
Lower variance through averaging	Less variance reduction from averaging

Common Rules of Thumb:

Recommended Subspace Sizes

•Classification: $m = \lfloor \sqrt{p} \rfloor$ — The square root rule, widely used in Random Forest implementations
•Regression: $m = \lfloor p/3 \rfloor$ — Larger subspaces for regression tasks to capture interactions
•High-dimensional data: $m = \log_2(p) + 1$ — When $p$ is very large (e.g., genomics, text)
•Feature-rich with redundancy: $m = \lfloor \sqrt{p} \rfloor$ to $\lfloor p/2 \rfloor$ — Tune via cross-validation

The Bias-Variance-Diversity Tradeoff:

Let's denote:

$\text{Bias}_m$ = bias of individual learners using $m$ features
$\text{Var}_m$ = variance of individual learners
$\rho_m$ = average correlation between learner predictions

As $m$ increases:

$\text{Bias}_m$ decreases (more features → better individual fit)
$\text{Var}_m$ increases (more features → higher variance)
$\rho_m$ increases (more feature overlap → more correlated predictions)

The ensemble variance with $T$ learners is approximately:

$$\text{Var}_{\text{ensemble}} \approx \rho_m \cdot \text{Var}_m + \frac{1-\rho_m}{T} \cdot \text{Var}_m$$

Practical Guidance

Random Subspace Method vs Bagging

While both RSM and Bagging are ensemble diversification strategies, they operate on different axes:

Aspect	Bagging	Random Subspace Method
What varies	Training samples	Feature subsets
Training data	Bootstrap samples (~63.2% unique)	Full training set
Feature access	All features available	Restricted to $m$ features
Source of diversity	Different data points → different splits	Different features → different perspectives
OOB estimation	Yes (unused samples)	No (all samples used)
Best for	High-variance learners	Feature-dominant data
Typical use case	Variance reduction	Breaking feature correlation

Bagging Strengths

•Reduces variance through sample diversity
•Provides free OOB error estimation
•Works well when features are equally relevant
•Each learner sees all features (no information loss)

RSM Strengths

•Breaks correlation from dominant features
•Discovers alternative predictive patterns
•More effective in high-dimensional spaces
•Can reveal hidden feature relationships

The Key Insight: Combining Both

Random Forests achieve superior performance by combining both strategies:

Bootstrap Sampling (from Bagging): Creates diverse training sets
Feature Subspace Sampling (from RSM): Creates diverse feature views

Random Forests = Bagging + RSM

Geometric Intuition

The random subspace method has a beautiful geometric interpretation. Consider data in $\mathbb{R}^3$ projected onto different 2D planes:

Projection 1 (features 1,2): Sees data from the "front view"
Projection 2 (features 1,3): Sees data from the "top view"
Projection 3 (features 2,3): Sees data from the "side view"

subspace_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
 
# Generate 3D data with complex structure
np.random.seed(42)
X, y = make_classification(
    n_samples=300, 
    n_features=3, 
    n_informative=3,
    n_redundant=0, 
    n_clusters_per_class=2,
    class_sep=0.8
)
 
# Define the three 2D projections
projections = [
    ([0, 1], "Features 1-2 (Front View)"),
    ([0, 2], "Features 1-3 (Top View)"),
    ([1, 2], "Features 2-3 (Side View)"),
]
 
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
 
for ax, (features, title) in zip(axes, projections):
    # Project data onto 2D subspace
    X_proj = X[:, features]
    
    # Train classifier on this projection
    clf = DecisionTreeClassifier(max_depth=5)
    clf.fit(X_proj, y)
    
    # Create decision boundary visualization
    xx, yy = np.meshgrid(
        np.linspace(X_proj[:, 0].min()-1, X_proj[:, 0].max()+1, 100),
        np.linspace(X_proj[:, 1].min()-1, X_proj[:, 1].max()+1, 100)
    )
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
    
    ax.contourf(xx, yy, Z, alpha=0.3, cmap='RdYlBu')
    ax.scatter(X_proj[y==0, 0], X_proj[y==0, 1], c='blue', label='Class 0')
    ax.scatter(X_proj[y==1, 0], X_proj[y==1, 1], c='red', label='Class 1')
    ax.set_title(title)
    ax.legend()
 
plt.tight_layout()
plt.savefig('subspace_projections.png')
print("Accuracy per projection:")
for features, title in projections:
    clf = DecisionTreeClassifier(max_depth=5)
    clf.fit(X[:, features], y)
    print(f"  {title}: {clf.score(X[:, features], y):.3f}")

The Blessing of Dimensionality (Sometimes):

While high dimensionality typically causes problems (the "curse of dimensionality"), the random subspace method can turn high dimensionality into an advantage:

More subspaces available: With $p = 1000$ features, there are $\binom{1000}{10} \approx 2.6 \times 10^{23}$ possible 10-dimensional subspaces
High probability of diverse views: Nearly every ensemble member sees genuinely different data projections
Robust feature discovery: Different subspaces may reveal different aspects of the underlying pattern

From RSM to Random Forests

The Random Forest algorithm, introduced by Leo Breiman in 2001, extends the random subspace method in three key ways:

1. Feature Randomization at Every Split (Not Just Per Tree)

Classic RSM selects $m$ features once per tree, then builds the tree using only those features. Random Forests re-randomize feature selection at every internal node:

At each split decision, sample $m_{try}$ features from the full set
Find the best split among only those $m_{try}$ candidates
Repeat at the next node with a fresh random sample

This per-split randomization creates far more diversity than per-tree randomization.

random_forest_split.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
def find_best_split_random_forest(X, y, m_try, all_features):
    """
    Find the best split at a node using Random Forest logic.
    
    Key difference from standard decision trees:
    - Only consider m_try randomly selected features
    - Re-sample features at EVERY split (not once per tree)
    """
    n_features = len(all_features)
    
    # Randomly select m_try features to consider for this split
    # This is the core of Random Forest's feature randomization
    candidate_features = np.random.choice(
        all_features, 
        size=min(m_try, n_features),
        replace=False
    )
    
    best_gain = -np.inf
    best_feature = None
    best_threshold = None
    
    # Only search among the randomly selected candidates
    for feature_idx in candidate_features:
        feature_values = X[:, feature_idx]
        thresholds = np.unique(feature_values)
        
        for threshold in thresholds:
            gain = compute_information_gain(
                y, feature_values, threshold
            )
            if gain > best_gain:
                best_gain = gain
                best_feature = feature_idx
                best_threshold = threshold
    
    return best_feature, best_threshold, best_gain
 
 
# Contrast with standard decision tree:
def find_best_split_standard(X, y):
    """Standard decision tree considers ALL features at every split."""
    n_features = X.shape[1]
    best_gain = -np.inf
    
    # Search through ALL features (no randomization)
    for feature_idx in range(n_features):
        feature_values = X[:, feature_idx]
        thresholds = np.unique(feature_values)
        
        for threshold in thresholds:
            gain = compute_information_gain(
                y, feature_values, threshold
            )
            if gain > best_gain:
                best_gain = gain
                best_feature = feature_idx
                best_threshold = threshold
    
    return best_feature, best_threshold, best_gain

2. Combined with Bootstrap Sampling

Random Forests apply bagging in addition to feature randomization:

Component	What It Does	Diversity Source
Bootstrap Sampling	Different ~63.2% of data per tree	Sample variation
Per-Split Feature Sampling	Different $m_{try}$ features per split	Feature variation
Deep Unpruned Trees	Complex individual decision surfaces	Model capacity

3. Typically Unpruned Trees

Unlike traditional decision trees (which risk overfitting), Random Forest trees are typically grown to maximum depth:

Each tree overfits to its bootstrap sample + feature subset
The overfitting is in different directions for different trees
Aggregation averages out the individual overfitting

The Brilliance of Random Forests

Summary: Random Subspace Method

We've explored the random subspace method—a foundational technique that transforms ensemble learning by introducing feature-level diversity.

Key Takeaways

•Bagging's limitation: When dominant features exist, bagged trees become correlated despite different samples
•RSM's solution: Randomly restrict each learner to a feature subset, forcing diverse perspectives
•Mathematical basis: Random feature selection creates exponentially many possible subspaces with low expected overlap
•Subspace size $m$: Trade off individual learner strength against ensemble diversity; $\sqrt{p}$ is a common default
•Geometric intuition: Each learner sees the data from a different projection; aggregation captures patterns invisible to any single view
•Random Forests extend RSM: Feature sampling at every split (not per tree) combined with bootstrap sampling creates maximal diversity

What's Next:

Page Complete

1 / 5