Loading learning content...
In the previous module, we explored Bagging—the technique of training multiple models on bootstrap samples and aggregating their predictions to reduce variance. Bagging is powerful, but it has a subtle limitation: when the training data contains strongly predictive features, every tree in the ensemble tends to use those features at the top splits, making the trees structurally similar and their errors correlated.
The Random Subspace Method (RSM), also known as attribute bagging or feature bagging, provides an elegant solution. Instead of giving each learner access to all features, we randomly restrict each learner to a subset of the available features. This simple modification has profound consequences for ensemble diversity and predictive power.
By the end of this page, you will understand the mathematical foundations of the random subspace method, why random feature selection creates more diverse ensembles, how subspace projection affects the bias-variance tradeoff, and the precise relationship between RSM and Random Forests.
To appreciate the random subspace method, we must first understand why pure bagging sometimes falls short. Consider a dataset with 100 features where 3 features are exceptionally predictive. When we build a bagged ensemble of decision trees:
This phenomenon is sometimes called tree correlation or feature dominance, and it limits the variance reduction that bagging can achieve.
Recall from ensemble theory that the variance of an ensemble average is: Var(ensemble) = (ρσ²)/n + ((1-ρ)σ²/n), where ρ is the average pairwise correlation between trees. When trees are highly correlated (ρ → 1), the ensemble variance barely decreases with more trees. The random subspace method directly attacks this correlation.
A Concrete Example:
Imagine predicting house prices with features including square_footage, location_score, num_bedrooms, and 97 other features. If square_footage and location_score alone explain 80% of variance:
square_footage or location_score at the rootThe random subspace method breaks this pattern by ensuring that some trees cannot use the dominant features, forcing them to find alternative predictive patterns.
The Random Subspace Method (RSM), introduced by Tin Kam Ho in 1998, constructs ensemble members by training each base learner on a randomly selected subset of the original feature space. The algorithm is remarkably simple:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
def random_subspace_ensemble(X, y, n_estimators, subspace_size): """ Random Subspace Method for ensemble learning. Parameters: ----------- X : array of shape (n_samples, n_features) Training data y : array of shape (n_samples,) Target values n_estimators : int Number of base learners in the ensemble subspace_size : int Number of features to select for each learner (m < n_features) Returns: -------- ensemble : list of (model, feature_indices) tuples """ n_samples, n_features = X.shape ensemble = [] for i in range(n_estimators): # Step 1: Randomly select m features from p total features # This is sampling WITHOUT replacement from feature indices selected_features = np.random.choice( n_features, size=subspace_size, replace=False ) # Step 2: Project training data onto the selected subspace X_subspace = X[:, selected_features] # Step 3: Train a base learner on the projected data # (using full training set, only feature restriction) model = DecisionTreeClassifier() model.fit(X_subspace, y) # Store model along with which features it uses ensemble.append((model, selected_features)) return ensemble def predict_rsm_ensemble(ensemble, X): """ Make predictions using the RSM ensemble. Each model only sees its selected features during prediction. Final prediction uses majority voting (classification) or averaging (regression). """ predictions = [] for model, feature_indices in ensemble: # Project test data onto this model's subspace X_subspace = X[:, feature_indices] predictions.append(model.predict(X_subspace)) # Majority voting for classification predictions = np.array(predictions) final_predictions = scipy.stats.mode(predictions, axis=0)[0] return final_predictionsKey Algorithmic Properties:
| Property | Description |
|---|---|
| Subspace Selection | Each learner receives a random subset of m features from p total |
| Selection Without Replacement | Within each learner, the m features are distinct |
| Independence Across Learners | Different learners can have overlapping or disjoint feature sets |
| Original Training Data | Unlike bagging, RSM uses the full training set (no bootstrap) |
The classic random subspace method does not use bootstrap sampling—it randomizes features while using all training samples. Random Forests combine both bootstrap sampling (bagging) and feature subspace sampling, getting the benefits of both techniques.
The random subspace method can be understood through the lens of projection onto random subspaces of the original feature space. Let's formalize this mathematically.
Notation:
The Projection Operation:
For each ensemble member $t$, we define a projection matrix $\mathbf{P}_t \in \mathbb{R}^{m \times p}$ that selects the features in $S_t$. If $S_t = {j_1, j_2, \ldots, j_m}$, then:
$$\mathbf{P}t = \begin{bmatrix} \mathbf{e}{j_1}^T \ \mathbf{e}{j_2}^T \ \vdots \ \mathbf{e}{j_m}^T \end{bmatrix}$$
where $\mathbf{e}_j$ is the standard basis vector with 1 in position $j$ and 0 elsewhere.
The projected training data for learner $t$ is: $$\mathbf{X}_t = \mathbf{X} \mathbf{P}_t^T \in \mathbb{R}^{n \times m}$$
Each projection P_t maps the original p-dimensional feature space onto an m-dimensional coordinate-aligned subspace. Different ensemble members 'see' the data from different m-dimensional perspectives. The ensemble aggregates predictions across these diverse viewpoints, capturing patterns that no single subspace fully represents.
Diversity Through Random Projections:
The probability that two ensemble members share the exact same feature set is:
$$P(S_i = S_j) = \frac{1}{\binom{p}{m}}$$
For $p = 100$ and $m = 10$, this probability is approximately $5.9 \times 10^{-14}$—essentially zero. Even partial overlap has low probability:
$$P(|S_i \cap S_j| = k) = \frac{\binom{m}{k} \binom{p-m}{m-k}}{\binom{p}{m}}$$
This combinatorial analysis shows that random subspace selection naturally generates highly diverse feature sets, which translates to diverse decision boundaries and decorrelated predictions.
Expected Overlap Between Subspaces:
The expected number of shared features between two randomly selected subspaces is:
$$\mathbb{E}[|S_i \cap S_j|] = \frac{m^2}{p}$$
For example, with $p = 100$ features and $m = 10$ features per learner:
This low expected overlap is the source of prediction diversity. With $m = \sqrt{p}$ (a common choice), the expected overlap is:
$$\mathbb{E}[|S_i \cap S_j|] = \frac{p}{p} = 1$$
Again, just one feature despite each learner having $\sqrt{p}$ features.
The subspace dimension $m$ is a critical hyperparameter that controls the bias-variance-diversity tradeoff:
| Small $m$ | Large $m$ |
|---|---|
| High diversity (low correlation) | Low diversity (high correlation) |
| Individual learners may be weak | Individual learners are stronger |
| May miss feature interactions | Captures feature interactions |
| Higher bias per learner | Lower bias per learner |
| Lower variance through averaging | Less variance reduction from averaging |
Common Rules of Thumb:
The Bias-Variance-Diversity Tradeoff:
Let's denote:
As $m$ increases:
The ensemble variance with $T$ learners is approximately:
$$\text{Var}_{\text{ensemble}} \approx \rho_m \cdot \text{Var}_m + \frac{1-\rho_m}{T} \cdot \text{Var}_m$$
The optimal $m$ balances these competing effects. Smaller $m$ increases individual bias but decreases $\rho_m$, allowing greater variance reduction through averaging. The $\sqrt{p}$ rule empirically balances these factors well for many classification tasks.
Start with m = sqrt(p) for classification and m = p/3 for regression. If your dataset has many noisy or irrelevant features, smaller m can help by reducing the chance that noise features dominate. If features are carefully engineered and all relevant, larger m may work better. Always validate via cross-validation.
While both RSM and Bagging are ensemble diversification strategies, they operate on different axes:
| Aspect | Bagging | Random Subspace Method |
|---|---|---|
| What varies | Training samples | Feature subsets |
| Training data | Bootstrap samples (~63.2% unique) | Full training set |
| Feature access | All features available | Restricted to $m$ features |
| Source of diversity | Different data points → different splits | Different features → different perspectives |
| OOB estimation | Yes (unused samples) | No (all samples used) |
| Best for | High-variance learners | Feature-dominant data |
| Typical use case | Variance reduction | Breaking feature correlation |
The Key Insight: Combining Both
Random Forests achieve superior performance by combining both strategies:
This dual randomization multiplicatively increases diversity. Even if two trees receive similar bootstrap samples, they're forced to build different structures due to different feature access. Even if two trees have overlapping features, they see different data points.
Random Forests apply feature subspace sampling at EACH SPLIT, not just once per tree. This is more aggressive than pure RSM (which selects features once per learner) and leads to even greater diversity. This 'per-split' randomization is a key innovation of the Random Forest algorithm.
The random subspace method has a beautiful geometric interpretation. Consider data in $\mathbb{R}^3$ projected onto different 2D planes:
Each 2D classifier finds a decision boundary that works well in its projected view. Some class separations are clearly visible in one projection but hidden in another. By aggregating classifiers from multiple projections, the ensemble captures separations that no single projection fully reveals.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.datasets import make_classificationfrom sklearn.tree import DecisionTreeClassifier # Generate 3D data with complex structurenp.random.seed(42)X, y = make_classification( n_samples=300, n_features=3, n_informative=3, n_redundant=0, n_clusters_per_class=2, class_sep=0.8) # Define the three 2D projectionsprojections = [ ([0, 1], "Features 1-2 (Front View)"), ([0, 2], "Features 1-3 (Top View)"), ([1, 2], "Features 2-3 (Side View)"),] fig, axes = plt.subplots(1, 3, figsize=(15, 5)) for ax, (features, title) in zip(axes, projections): # Project data onto 2D subspace X_proj = X[:, features] # Train classifier on this projection clf = DecisionTreeClassifier(max_depth=5) clf.fit(X_proj, y) # Create decision boundary visualization xx, yy = np.meshgrid( np.linspace(X_proj[:, 0].min()-1, X_proj[:, 0].max()+1, 100), np.linspace(X_proj[:, 1].min()-1, X_proj[:, 1].max()+1, 100) ) Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape) ax.contourf(xx, yy, Z, alpha=0.3, cmap='RdYlBu') ax.scatter(X_proj[y==0, 0], X_proj[y==0, 1], c='blue', label='Class 0') ax.scatter(X_proj[y==1, 0], X_proj[y==1, 1], c='red', label='Class 1') ax.set_title(title) ax.legend() plt.tight_layout()plt.savefig('subspace_projections.png')print("Accuracy per projection:")for features, title in projections: clf = DecisionTreeClassifier(max_depth=5) clf.fit(X[:, features], y) print(f" {title}: {clf.score(X[:, features], y):.3f}")The Blessing of Dimensionality (Sometimes):
While high dimensionality typically causes problems (the "curse of dimensionality"), the random subspace method can turn high dimensionality into an advantage:
This is why Random Forests work remarkably well on high-dimensional data like genomics (thousands of genes) or text analysis (thousands of words)—there's an exponentially large space of diverse projections to sample from.
The Random Forest algorithm, introduced by Leo Breiman in 2001, extends the random subspace method in three key ways:
1. Feature Randomization at Every Split (Not Just Per Tree)
Classic RSM selects $m$ features once per tree, then builds the tree using only those features. Random Forests re-randomize feature selection at every internal node:
This per-split randomization creates far more diversity than per-tree randomization.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
def find_best_split_random_forest(X, y, m_try, all_features): """ Find the best split at a node using Random Forest logic. Key difference from standard decision trees: - Only consider m_try randomly selected features - Re-sample features at EVERY split (not once per tree) """ n_features = len(all_features) # Randomly select m_try features to consider for this split # This is the core of Random Forest's feature randomization candidate_features = np.random.choice( all_features, size=min(m_try, n_features), replace=False ) best_gain = -np.inf best_feature = None best_threshold = None # Only search among the randomly selected candidates for feature_idx in candidate_features: feature_values = X[:, feature_idx] thresholds = np.unique(feature_values) for threshold in thresholds: gain = compute_information_gain( y, feature_values, threshold ) if gain > best_gain: best_gain = gain best_feature = feature_idx best_threshold = threshold return best_feature, best_threshold, best_gain # Contrast with standard decision tree:def find_best_split_standard(X, y): """Standard decision tree considers ALL features at every split.""" n_features = X.shape[1] best_gain = -np.inf # Search through ALL features (no randomization) for feature_idx in range(n_features): feature_values = X[:, feature_idx] thresholds = np.unique(feature_values) for threshold in thresholds: gain = compute_information_gain( y, feature_values, threshold ) if gain > best_gain: best_gain = gain best_feature = feature_idx best_threshold = threshold return best_feature, best_threshold, best_gain2. Combined with Bootstrap Sampling
Random Forests apply bagging in addition to feature randomization:
| Component | What It Does | Diversity Source |
|---|---|---|
| Bootstrap Sampling | Different ~63.2% of data per tree | Sample variation |
| Per-Split Feature Sampling | Different $m_{try}$ features per split | Feature variation |
| Deep Unpruned Trees | Complex individual decision surfaces | Model capacity |
3. Typically Unpruned Trees
Unlike traditional decision trees (which risk overfitting), Random Forest trees are typically grown to maximum depth:
Random Forests deliberately introduce two sources of 'controlled randomness' to create an ensemble where individual errors are uncorrelated. The bootstrap samples vary the data, while the per-split feature sampling varies the model structure. This dual randomization is more powerful than either technique alone—it's why Random Forests became one of the most successful machine learning algorithms of all time.
We've explored the random subspace method—a foundational technique that transforms ensemble learning by introducing feature-level diversity.
What's Next:
Now that we understand the random subspace method, the next page dives deeper into feature randomization in Random Forests—examining exactly how per-split sampling works, why $m_{try} = \sqrt{p}$ is recommended, and what happens when we vary this parameter.
You now understand the random subspace method—the theoretical foundation for feature randomization in Random Forests. This technique of training ensemble members on different feature subsets is what allows Random Forests to achieve both low bias (powerful individual trees) and low variance (uncorrelated predictions). Next, we'll explore the specifics of feature randomization at each split.