Machine LearningRandom Forest Variants

Random Forest Variants

LevelAdvanced

Duration120 mins

TopicRandom Forest Variants

4 / 5

Subspace Forests

Feature Subspaces: The Hidden Diversity

In high-dimensional data, a fascinating phenomenon emerges: different subsets of features can provide entirely different perspectives on the same classification problem. This observation is the foundation of the Random Subspace Method, introduced by Tin Kam Ho in 1998—predating Random Forests by three years.

The core insight is elegant: rather than having all ensemble members examine the same features (potentially discovering the same patterns and making correlated errors), we train each member on a randomly selected subspace of the original feature space. The result is an ensemble where each member has literally a different "view" of the data.

Subspace Forests—ensembles of decision trees trained using the Random Subspace Method—offer unique advantages for high-dimensional problems, including robustness to irrelevant features, reduced overfitting, and the ability to capture complementary patterns from different feature combinations.

What You Will Learn

By the end of this page, you will deeply understand: (1) The theoretical foundations of the Random Subspace Method, (2) Why feature subspacing creates effective diversity, (3) The relationship to Random Forests and other methods, (4) Optimal subspace size selection, and (5) When Subspace Forests outperform alternatives.

The Random Subspace Method: Foundations

The Random Subspace Method (RSM) creates ensemble diversity through a fundamentally different mechanism than bootstrap sampling. Instead of varying which samples each model sees, RSM varies which features each model can access.

Formal Definition:

Given a dataset with $n$ samples and $d$ features, the Random Subspace Method:

Select a subspace dimensionality $k < d$
For each ensemble member $t$:
- Randomly select $k$ features (without replacement) from the full $d$ features
- Train a classifier using only these $k$ features on the full training set
Combine predictions through voting (classification) or averaging (regression)

Key Difference from Random Forests:

In Random Forests, feature selection happens at each split—a tree can potentially use all features across its full depth. In RSM:

Feature selection happens once per tree at the start
Each tree is permanently restricted to its $k$ features
The same subspace is used for all splits within that tree

Feature Selection Mechanisms Compared
Method	When Features Selected	Features per Split	Total Features per Tree
Full Tree	Never (all used)	All d	All d
Random Forest	At each split	sqrt(d) candidates	Potentially all d
Random Subspace	Once per tree	All k (from subset)	Exactly k < d
Extra-Trees	At each split	sqrt(d) candidates + random threshold	Potentially all d

Conceptual Model

Think of each subspace tree as an expert that knows only certain "dimensions" of the problem. Tree A might excel at patterns visible in features {1,3,7}, while Tree B captures patterns in features {2,4,5}. Together, they cover the full feature space through their collective expertise.

Theoretical Analysis: Why Subspaces Work

The effectiveness of the Random Subspace Method rests on several theoretical pillars. Let's examine each in detail.

1. Dimension Reduction and the Curse of Dimensionality:

In high-dimensional spaces, decision trees face the curse of dimensionality:

Data becomes sparse
Distances become less meaningful
Overfitting becomes more likely

By training each tree on a $k$-dimensional subspace where $k \ll d$, individual trees operate in a more manageable space:

$$\text{Effective density per tree} \propto n^{1/k} \gg n^{1/d}$$

This higher effective density allows trees to make more reliable splits.

2. Error Decorrelation:

The ensemble error bound depends on the correlation $\rho$ between tree predictions:

$$\text{Ensemble Variance} = \rho \cdot \sigma^2 + \frac{(1-\rho)\sigma^2}{T}$$

For RSM, trees trained on non-overlapping features have predictions that are approximately independent for regions of the feature space where the excluded features provide discriminative power.

3. Feature Importance Distribution:

Let $I_j$ denote the importance of feature $j$. In RSM:

Each tree includes feature $j$ with probability $k/d$
Across $T$ trees, feature $j$ appears in approximately $T \cdot k/d$ trees
Features with high importance will be well-utilized
Features with low importance contribute less to the ensemble

This creates an implicit soft feature selection: important features influence predictions often, while noise features rarely get to dominate any tree.

4. Bias-Variance Trade-off:

RSM affects bias and variance differently:

Individual Tree Bias: Increased (missing potentially relevant features)
Individual Tree Variance: Decreased (operating in lower dimension)
Ensemble Variance: Significantly decreased (decorrelated trees)

The net effect is often positive, especially when:

Many features are redundant
Some features are noisy
The true decision boundary can be approximated in multiple subspaces

The Redundancy Assumption

RSM works best when multiple subsets of features each contain enough information to approximate the target function. In domains with high feature redundancy (genomics, image features, text), this assumption typically holds. In domains with few, non-redundant features, RSM may hurt performance.

Subspace Forest Algorithm: Complete Implementation

Let's implement a Subspace Forest from scratch, highlighting the key algorithmic decisions.

subspace_forest.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
import numpy as np
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.base import BaseEstimator, ClassifierMixin, RegressorMixin
from typing import List, Tuple, Optional, Union
from dataclasses import dataclass
 
@dataclass
class SubspaceTree:
    """A tree trained on a feature subspace."""
    tree: Union[DecisionTreeClassifier, DecisionTreeRegressor]
    feature_indices: np.ndarray  # Which features this tree uses
 
 
class SubspaceForestClassifier(BaseEstimator, ClassifierMixin):
    """
    Subspace Forest (Random Subspace Method with Decision Trees).
    
    Each tree is trained on a random subset of features, with all
    training samples. This differs from Random Forest where feature
    selection happens at each split.
    
    Reference: Ho, T.K. (1998). "The Random Subspace Method for 
    Constructing Decision Forests."
    """
    
    def __init__(
        self,
        n_estimators: int = 100,
        max_features: Union[int, float, str] = 0.5,
        max_depth: Optional[int] = None,
        min_samples_split: int = 2,
        min_samples_leaf: int = 1,
        random_state: Optional[int] = None,
        n_jobs: int = 1
    ):
        """
        Initialize Subspace Forest.
        
        Args:
            n_estimators: Number of trees
            max_features: Number of features per subspace
                - int: exact number
                - float (0,1): fraction of total features
                - 'sqrt': sqrt(n_features)
                - 'log2': log2(n_features)
            max_depth: Maximum tree depth
            min_samples_split: Minimum samples to split a node
            min_samples_leaf: Minimum samples per leaf
            random_state: Random seed
            n_jobs: Number of parallel jobs
        """
        self.n_estimators = n_estimators
        self.max_features = max_features
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.min_samples_leaf = min_samples_leaf
        self.random_state = random_state
        self.n_jobs = n_jobs
        
        self.trees_: List[SubspaceTree] = []
        self.classes_ = None
        self.n_features_in_ = None
    
    def _get_subspace_size(self, n_features: int) -> int:
        """Compute the subspace dimensionality."""
        if isinstance(self.max_features, int):
            return min(self.max_features, n_features)
        elif isinstance(self.max_features, float):
            return max(1, int(self.max_features * n_features))
        elif self.max_features == 'sqrt':
            return max(1, int(np.sqrt(n_features)))
        elif self.max_features == 'log2':
            return max(1, int(np.log2(n_features)))
        else:
            return n_features
    
    def _sample_subspace(
        self, 
        n_features: int, 
        rng: np.random.RandomState
    ) -> np.ndarray:
        """
        Sample a random feature subspace.
        
        Features are sampled WITHOUT replacement to ensure
        a proper k-dimensional subspace.
        """
        k = self._get_subspace_size(n_features)
        return rng.choice(n_features, size=k, replace=False)
    
    def fit(self, X: np.ndarray, y: np.ndarray) -> 'SubspaceForestClassifier':
        """
        Fit the Subspace Forest.
        
        Unlike Random Forest, each tree is trained on ALL samples
        but only a SUBSET of features.
        """
        rng = np.random.RandomState(self.random_state)
        n_samples, n_features = X.shape
        
        self.n_features_in_ = n_features
        self.classes_ = np.unique(y)
        self.trees_ = []
        
        for i in range(self.n_estimators):
            # Sample feature subspace
            feature_indices = self._sample_subspace(n_features, rng)
            
            # Extract subspace data (ALL samples, SUBSET of features)
            X_subspace = X[:, feature_indices]
            
            # Train tree on subspace
            tree = DecisionTreeClassifier(
                max_depth=self.max_depth,
                min_samples_split=self.min_samples_split,
                min_samples_leaf=self.min_samples_leaf,
                random_state=rng.randint(0, 2**31)
            )
            tree.fit(X_subspace, y)
            
            self.trees_.append(SubspaceTree(
                tree=tree,
                feature_indices=feature_indices
            ))
        
        return self
    
    def predict_proba(self, X: np.ndarray) -> np.ndarray:
        """
        Predict class probabilities by averaging tree predictions.
        
        Each tree predicts using only its subspace features.
        """
        n_samples = X.shape[0]
        n_classes = len(self.classes_)
        probas = np.zeros((n_samples, n_classes))
        
        for subspace_tree in self.trees_:
            # Extract this tree's features
            X_subspace = X[:, subspace_tree.feature_indices]
            
            # Get tree's predictions
            tree_proba = subspace_tree.tree.predict_proba(X_subspace)
            
            # Handle potential class mismatch
            tree_classes = subspace_tree.tree.classes_
            for i, cls in enumerate(tree_classes):
                cls_idx = np.where(self.classes_ == cls)[0]
                if len(cls_idx) > 0:
                    probas[:, cls_idx[0]] += tree_proba[:, i]
        
        # Average
        probas /= len(self.trees_)
        return probas
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Predict class labels."""
        proba = self.predict_proba(X)
        return self.classes_[np.argmax(proba, axis=1)]
    
    def get_feature_coverage(self) -> dict:
        """
        Analyze feature coverage across the ensemble.
        
        Returns statistics about how features are distributed
        across trees.
        """
        n_features = self.n_features_in_
        feature_counts = np.zeros(n_features)
        
        for subspace_tree in self.trees_:
            feature_counts[subspace_tree.feature_indices] += 1
        
        return {
            'feature_counts': feature_counts,
            'coverage_ratio': (feature_counts > 0).sum() / n_features,
            'avg_trees_per_feature': feature_counts.mean(),
            'min_coverage': feature_counts.min(),
            'max_coverage': feature_counts.max(),
        }
    
    def get_subspace_feature_importance(self) -> np.ndarray:
        """
        Compute feature importance aggregated across subspaces.
        
        Note: Unlike RF, features not in a subspace get 0 importance
        from that tree. We aggregate by averaging only over trees
        that include each feature.
        """
        n_features = self.n_features_in_
        importance_sum = np.zeros(n_features)
        feature_counts = np.zeros(n_features)
        
        for subspace_tree in self.trees_:
            tree_importance = subspace_tree.tree.feature_importances_
            for i, feat_idx in enumerate(subspace_tree.feature_indices):
                importance_sum[feat_idx] += tree_importance[i]
                feature_counts[feat_idx] += 1
        
        # Average importance where feature was included
        with np.errstate(divide='ignore', invalid='ignore'):
            importance = np.where(
                feature_counts > 0,
                importance_sum / feature_counts,
                0
            )
        
        return importance
 
 
# Demonstration
if __name__ == "__main__":
    from sklearn.datasets import make_classification
    from sklearn.model_selection import cross_val_score
    from sklearn.ensemble import RandomForestClassifier
    
    # High-dimensional dataset
    X, y = make_classification(
        n_samples=1000,
        n_features=100,
        n_informative=20,
        n_redundant=30,
        n_clusters_per_class=2,
        random_state=42
    )
    
    # Compare methods
    sf = SubspaceForestClassifier(
        n_estimators=100, max_features=0.5, random_state=42
    )
    rf = RandomForestClassifier(
        n_estimators=100, random_state=42
    )
    
    sf_scores = cross_val_score(sf, X, y, cv=5)
    rf_scores = cross_val_score(rf, X, y, cv=5)
    
    print(f"Subspace Forest: {sf_scores.mean():.4f} (+/- {sf_scores.std():.4f})")
    print(f"Random Forest:   {rf_scores.mean():.4f} (+/- {rf_scores.std():.4f})")
    
    # Feature coverage analysis
    sf.fit(X, y)
    coverage = sf.get_feature_coverage()
    print(f"
Feature Coverage: {coverage['coverage_ratio']:.2%}")
    print(f"Avg trees per feature: {coverage['avg_trees_per_feature']:.1f}")

Optimal Subspace Size Selection

The subspace dimensionality $k$ is the critical hyperparameter in the Random Subspace Method. Choosing $k$ involves a fundamental trade-off.

The Trade-off:

Small $k$: High diversity, but individual trees may miss critical features
Large $k$: Low diversity, but trees have access to more relevant features

Theoretical Guidance:

For problems where a subset of size $r$ features is sufficient for accurate classification:

$$k \geq r \cdot \left(1 + \log\frac{d}{r}\right)$$

ensures high probability that each subspace contains at least some informative features.

Practical Guidelines:

Subspace Size Recommendations
Scenario	Recommended k	Rationale
High redundancy (genomics, images)	0.3d - 0.5d	Each subspace likely captures patterns
Moderate redundancy	0.5d - 0.7d	Balance diversity and information
Low redundancy	0.7d - 0.9d	Need most features for accuracy
Unknown structure	0.5d (default)	Robust starting point
Very high d (>1000)	sqrt(d) or log(d)	Computational efficiency

subspace_size_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt
 
def analyze_subspace_size(X, y, subspace_fractions, n_estimators=100, cv=5):
    """
    Analyze performance across different subspace sizes.
    
    Returns data for plotting the bias-variance trade-off
    as subspace size varies.
    """
    results = []
    n_features = X.shape[1]
    
    for frac in subspace_fractions:
        k = max(1, int(frac * n_features))
        
        # Use BaggingClassifier with bootstrap=False to simulate RSM
        model = BaggingClassifier(
            estimator=DecisionTreeClassifier(),
            n_estimators=n_estimators,
            max_samples=1.0,      # All samples
            max_features=k,        # k features per tree
            bootstrap=False,       # No sample bootstrapping
            bootstrap_features=False,
            random_state=42,
            n_jobs=-1
        )
        
        scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')
        
        results.append({
            'fraction': frac,
            'k': k,
            'mean_accuracy': scores.mean(),
            'std_accuracy': scores.std(),
        })
        
        print(f"k={k:3d} ({frac:.0%}): {scores.mean():.4f} (+/- {scores.std():.4f})")
    
    return results
 
 
def plot_subspace_analysis(results):
    """Visualize the subspace size impact."""
    fractions = [r['fraction'] for r in results]
    means = [r['mean_accuracy'] for r in results]
    stds = [r['std_accuracy'] for r in results]
    
    fig, ax = plt.subplots(figsize=(10, 6))
    
    ax.errorbar(fractions, means, yerr=stds, marker='o', capsize=5,
                linewidth=2, markersize=8)
    
    ax.set_xlabel('Subspace Fraction (k/d)', fontsize=12)
    ax.set_ylabel('Cross-Validation Accuracy', fontsize=12)
    ax.set_title('Subspace Forest: Effect of Subspace Size', fontsize=14)
    ax.grid(True, alpha=0.3)
    
    # Annotate optimal
    best_idx = np.argmax(means)
    ax.annotate(f'Optimal: {fractions[best_idx]:.0%}',
                xy=(fractions[best_idx], means[best_idx]),
                xytext=(fractions[best_idx] + 0.1, means[best_idx] + 0.02),
                arrowprops=dict(arrowstyle='->', color='red'),
                fontsize=11, color='red')
    
    plt.tight_layout()
    return fig
 
 
def estimate_optimal_subspace_size(X, y, cv=5, n_estimators=50):
    """
    Estimate optimal subspace size using coarse-to-fine search.
    
    More efficient than full grid search for production use.
    """
    n_features = X.shape[1]
    
    # Coarse search
    coarse_fractions = [0.1, 0.3, 0.5, 0.7, 0.9]
    coarse_results = analyze_subspace_size(
        X, y, coarse_fractions, n_estimators=n_estimators, cv=cv
    )
    
    # Find best region
    best_idx = np.argmax([r['mean_accuracy'] for r in coarse_results])
    best_frac = coarse_fractions[best_idx]
    
    # Fine search around best
    low = max(0.05, best_frac - 0.15)
    high = min(0.95, best_frac + 0.15)
    fine_fractions = np.linspace(low, high, 5)
    
    fine_results = analyze_subspace_size(
        X, y, fine_fractions, n_estimators=n_estimators, cv=cv
    )
    
    final_best = max(fine_results, key=lambda x: x['mean_accuracy'])
    
    print(f"
Optimal subspace size: {final_best['k']} features "
          f"({final_best['fraction']:.1%} of {n_features})")
    
    return final_best
 
 
# Example typical output:
# k= 10 (10%): 0.8234 (+/- 0.0312)
# k= 30 (30%): 0.8567 (+/- 0.0267)
# k= 50 (50%): 0.8712 (+/- 0.0234)  <- Often optimal
# k= 70 (70%): 0.8689 (+/- 0.0245)
# k= 90 (90%): 0.8543 (+/- 0.0289)  <- Declining (less diversity)

The 50% Rule of Thumb

For many problems, starting with k = 0.5d (50% of features) provides a good balance. From there, tune based on performance: increase k if accuracy is too low (trees missing important features), decrease k if you observe high correlation between tree predictions (need more diversity).

Subspace Forests vs Random Forests: When to Choose

Both Subspace Forests and Random Forests use feature subsampling, but in fundamentally different ways. Understanding these differences is crucial for method selection.

Detailed Method Comparison
Aspect	Subspace Forest	Random Forest
Feature selection timing	Once per tree (global)	At each split (local)
Total features per tree	Exactly k (fixed)	Potentially all d
Sample handling	All samples (no bootstrap)	Bootstrap (~63.2% unique)
Individual tree expressiveness	Limited to k features	Full expressiveness
Diversity mechanism	Feature subspace only	Bootstrap + feature sampling
OOB estimation	Not available (uses all samples)	Available
Interpretation	Clear subspace structure	Feature importance more complex
High-dimensional suitability	Excellent	Good

Choose Subspace Forests When

•Very high dimensionality (d >> n)
•Many redundant or correlated features
•You want simpler individual trees
•Feature space exploration is important
•Computational resources are limited
•Interpretable subspace membership is valuable

Choose Random Forests When

•Moderate dimensionality (d ~ n)
•Most features are informative
•Complex feature interactions exist
•OOB error estimation is needed
•Standard feature importance is required
•You want the most mature implementation

Empirical Performance Patterns:

Text/Document Classification: Subspace Forests often excel due to extremely high dimensionality and feature redundancy
Gene Expression Data: Strong performance from both; Subspace Forests sometimes preferred for interpretability
Standard Tabular Data: Random Forests typically win due to full feature access per tree
Image Features (pre-CNN era): Subspace Forests competitive due to high-dimensional, redundant features
Financial Data: Random Forests typically preferred; features often not redundant enough for RSM

Hybrid Approaches

You can combine Subspace Forests with bootstrap sampling to get both diversity mechanisms. This hybrid approach—sometimes called Random Patches when including sample subsampling—can outperform either pure approach. Experiment with your specific data to find the optimal combination.

Subspace Forests for High-Dimensional Data

Subspace Forests particularly shine in high-dimensional settings. Let's examine why and how to apply them effectively.

The High-Dimensional Advantage:

When $d \gg n$ (more features than samples), traditional methods face challenges:

Overfitting: too many parameters, too few samples
Computational cost: evaluating all features at each split
Irrelevant features: noise dominates signal

Subspace Forests address all three:

$$\text{Effective dimensionality per tree} = k \ll d$$

This means each tree operates in a more tractable space where:

Sample-to-feature ratio is improved: $n/k > n/d$
Fewer split candidates to evaluate
Reduced chance of noise-driven splits

high_dimensional_example.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.pipeline import Pipeline
import time
 
def compare_methods_high_dimensional():
    """
    Compare methods on a high-dimensional dataset (d >> n).
    
    Simulates scenarios common in genomics, text classification, etc.
    """
    # High-dimensional dataset: 500 samples, 5000 features
    X, y = make_classification(
        n_samples=500,
        n_features=5000,
        n_informative=50,
        n_redundant=200,
        n_clusters_per_class=3,
        random_state=42
    )
    
    print(f"Dataset: {X.shape[0]} samples × {X.shape[1]} features")
    print(f"n << d: {X.shape[0]} << {X.shape[1]}
")
    
    methods = {
        'Subspace Forest (10%)': BaggingClassifier(
            estimator=DecisionTreeClassifier(),
            n_estimators=100,
            max_samples=1.0,
            max_features=0.1,  # 10% of features = 500 features per tree
            bootstrap=False,
            bootstrap_features=False,
            random_state=42, n_jobs=-1
        ),
        'Subspace Forest (5%)': BaggingClassifier(
            estimator=DecisionTreeClassifier(),
            n_estimators=100,
            max_samples=1.0,
            max_features=0.05,  # 5% = 250 features per tree
            bootstrap=False,
            bootstrap_features=False,
            random_state=42, n_jobs=-1
        ),
        'Random Forest': RandomForestClassifier(
            n_estimators=100,
            random_state=42, n_jobs=-1
        ),
        'Feature Selection + RF': Pipeline([
            ('select', SelectKBest(f_classif, k=500)),
            ('rf', RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1))
        ]),
    }
    
    results = {}
    for name, model in methods.items():
        start = time.time()
        scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
        elapsed = time.time() - start
        
        results[name] = {
            'accuracy': scores.mean(),
            'std': scores.std(),
            'time': elapsed
        }
        
        print(f"{name:25s}: {scores.mean():.4f} (+/- {scores.std():.4f}) [{elapsed:.1f}s]")
    
    return results
 
 
def analyze_subspace_for_genomics():
    """
    Simulate genomics-like data where genes are grouped into pathways.
    
    Subspace Forests can capture pathway-level patterns when different
    subspaces correspond to different biological pathways.
    """
    np.random.seed(42)
    
    n_samples = 200
    n_pathways = 10
    genes_per_pathway = 100
    n_features = n_pathways * genes_per_pathway
    
    # Create pathway-structured data
    X = np.random.randn(n_samples, n_features)
    
    # Make some pathways predictive
    y = np.zeros(n_samples, dtype=int)
    
    # Classes based on pathway 0 and pathway 3 expression
    pathway_0_mean = X[:, :genes_per_pathway].mean(axis=1)
    pathway_3_mean = X[:, 3*genes_per_pathway:4*genes_per_pathway].mean(axis=1)
    
    y = ((pathway_0_mean > 0) & (pathway_3_mean > 0)).astype(int)
    
    print(f"Genomics-like data: {n_samples} samples, {n_features} genes, "
          f"{n_pathways} pathways")
    
    # Compare methods
    subspace_model = BaggingClassifier(
        estimator=DecisionTreeClassifier(max_depth=5),
        n_estimators=100,
        max_samples=1.0,
        max_features=genes_per_pathway * 2,  # ~2 pathways worth
        bootstrap=False,
        random_state=42, n_jobs=-1
    )
    
    rf_model = RandomForestClassifier(
        n_estimators=100, max_depth=5, random_state=42, n_jobs=-1
    )
    
    subspace_scores = cross_val_score(subspace_model, X, y, cv=5)
    rf_scores = cross_val_score(rf_model, X, y, cv=5)
    
    print(f"
Subspace Forest: {subspace_scores.mean():.4f}")
    print(f"Random Forest:   {rf_scores.mean():.4f}")
 
 
if __name__ == "__main__":
    compare_methods_high_dimensional()
    print()
    analyze_subspace_for_genomics()

Real-World Success Stories

Subspace Forests have shown strong results in: gene expression classification (where d can exceed 20,000), document classification (vocabulary size often > 50,000), hyperspectral image analysis (hundreds of spectral bands), and EEG signal classification (many channels × time points).

Advanced Subspace Techniques

Several extensions enhance the basic Random Subspace Method for specific applications.

Advanced Subspace Variants

•Weighted Subspace Sampling: Sample features proportional to their univariate importance, biasing subspaces toward informative features
•Structured Subspaces: When features have known structure (e.g., spatial, temporal), sample coherent blocks rather than random individual features
•Overlapping Subspaces: Ensure minimum overlap between subspaces to guarantee coverage of all features
•Adaptive Subspace Size: Start with large subspaces, reduce size as ensemble grows to maximize late-stage diversity
•Deep Subspace Forests: Stack multiple layers of Subspace Forests, each learning from the previous layer's predictions

advanced_subspace.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_selection import mutual_info_classif
 
class WeightedSubspaceForest:
    """
    Subspace Forest with weighted feature sampling.
    
    Features are sampled proportional to their univariate importance,
    biasing subspaces toward more informative features while still
    maintaining diversity through randomization.
    """
    
    def __init__(
        self,
        n_estimators: int = 100,
        subspace_fraction: float = 0.5,
        weighting_strength: float = 1.0,
        random_state: int = None
    ):
        """
        Args:
            n_estimators: Number of trees
            subspace_fraction: Fraction of features per subspace
            weighting_strength: How strongly to weight by importance
                - 0: uniform random (standard RSM)
                - 1: proportional to importance
                - >1: more aggressive toward important features
        """
        self.n_estimators = n_estimators
        self.subspace_fraction = subspace_fraction
        self.weighting_strength = weighting_strength
        self.random_state = random_state
        
        self.trees_ = []
        self.feature_weights_ = None
    
    def _compute_feature_weights(self, X, y):
        """Compute feature weights based on mutual information."""
        mi_scores = mutual_info_classif(X, y, random_state=self.random_state)
        
        # Apply weighting strength
        weights = np.power(mi_scores + 1e-10, self.weighting_strength)
        weights /= weights.sum()
        
        return weights
    
    def _sample_weighted_subspace(self, n_features, rng):
        """Sample features according to computed weights."""
        k = max(1, int(self.subspace_fraction * n_features))
        
        # Sample without replacement, weighted by feature importance
        indices = rng.choice(
            n_features,
            size=k,
            replace=False,
            p=self.feature_weights_
        )
        
        return indices
    
    def fit(self, X, y):
        """Fit the weighted subspace forest."""
        rng = np.random.RandomState(self.random_state)
        n_features = X.shape[1]
        
        # Compute feature weights
        self.feature_weights_ = self._compute_feature_weights(X, y)
        
        self.trees_ = []
        self.classes_ = np.unique(y)
        
        for _ in range(self.n_estimators):
            # Weighted subspace sampling
            feature_indices = self._sample_weighted_subspace(n_features, rng)
            
            # Train tree
            X_subspace = X[:, feature_indices]
            tree = DecisionTreeClassifier(random_state=rng.randint(2**31))
            tree.fit(X_subspace, y)
            
            self.trees_.append({
                'tree': tree,
                'features': feature_indices
            })
        
        return self
    
    def predict(self, X):
        """Predict using weighted voting."""
        n_samples = X.shape[0]
        vote_counts = np.zeros((n_samples, len(self.classes_)))
        
        for tree_info in self.trees_:
            X_sub = X[:, tree_info['features']]
            preds = tree_info['tree'].predict_proba(X_sub)
            vote_counts += preds
        
        return self.classes_[np.argmax(vote_counts, axis=1)]
 
 
class StructuredSubspaceForest:
    """
    Subspace Forest for data with known feature structure.
    
    Example: image data where features are pixels, and spatial
    coherence means nearby pixels should be sampled together.
    """
    
    def __init__(
        self,
        n_estimators: int = 100,
        block_size: int = 10,
        n_blocks: int = 5,
        random_state: int = None
    ):
        """
        Args:
            n_estimators: Number of trees
            block_size: Size of each contiguous feature block
            n_blocks: Number of blocks per subspace
        """
        self.n_estimators = n_estimators
        self.block_size = block_size
        self.n_blocks = n_blocks
        self.random_state = random_state
        
        self.trees_ = []
    
    def _sample_structured_subspace(self, n_features, rng):
        """Sample contiguous blocks of features."""
        max_block_start = n_features - self.block_size
        
        if max_block_start <= 0:
            return np.arange(n_features)
        
        # Randomly select block starting positions
        block_starts = rng.choice(
            max_block_start,
            size=min(self.n_blocks, max_block_start),
            replace=False
        )
        
        # Collect all indices in selected blocks
        indices = []
        for start in block_starts:
            indices.extend(range(start, start + self.block_size))
        
        return np.array(sorted(set(indices)))
    
    def fit(self, X, y):
        """Fit with structured subspace sampling."""
        rng = np.random.RandomState(self.random_state)
        n_features = X.shape[1]
        
        self.trees_ = []
        self.classes_ = np.unique(y)
        
        for _ in range(self.n_estimators):
            features = self._sample_structured_subspace(n_features, rng)
            
            tree = DecisionTreeClassifier(random_state=rng.randint(2**31))
            tree.fit(X[:, features], y)
            
            self.trees_.append({'tree': tree, 'features': features})
        
        return self
    
    def predict(self, X):
        """Predict with structured subspaces."""
        votes = np.zeros((X.shape[0], len(self.classes_)))
        
        for info in self.trees_:
            proba = info['tree'].predict_proba(X[:, info['features']])
            votes += proba
        
        return self.classes_[np.argmax(votes, axis=1)]

Summary: Subspace Forests Key Insights

Let's consolidate the essential knowledge about Subspace Forests and the Random Subspace Method:

Key Takeaways

•Core Concept: Each tree is trained on a random subset of features (once per tree), using all training samples
•Diversity Mechanism: Different feature subspaces create different perspectives on the classification problem
•High-Dimensional Strength: Particularly effective when d >> n, reducing effective dimensionality per tree
•Subspace Size Trade-off: Smaller k = more diversity but higher individual error; larger k = less diversity but lower individual error
•50% Rule: Starting with k = 0.5d is often a good default; tune based on cross-validation
•Comparison with RF: RSM selects features once per tree (global), RF selects per split (local)—choose based on redundancy level
•No Bootstrap: Traditional RSM uses all samples; combine with bootstrap for Random Patches behavior
•Extensions: Weighted sampling, structured subspaces, and stacking can enhance performance for specific domains

Page Complete

You now have a comprehensive understanding of Subspace Forests and the Random Subspace Method. Next, we'll explore Oblique Random Forests—a variant that breaks the axis-aligned constraint by finding optimal linear combinations of features at each split.

4 / 5

Loading learning content...

Machine LearningRandom Forest Variants

Random Forest Variants

LevelAdvanced

Duration120 mins

TopicRandom Forest Variants

4 / 5

Subspace Forests

Feature Subspaces: The Hidden Diversity

What You Will Learn

The Random Subspace Method: Foundations

Formal Definition:

Given a dataset with $n$ samples and $d$ features, the Random Subspace Method:

Select a subspace dimensionality $k < d$
For each ensemble member $t$:
- Randomly select $k$ features (without replacement) from the full $d$ features
- Train a classifier using only these $k$ features on the full training set
Combine predictions through voting (classification) or averaging (regression)

Key Difference from Random Forests:

In Random Forests, feature selection happens at each split—a tree can potentially use all features across its full depth. In RSM:

Feature selection happens once per tree at the start
Each tree is permanently restricted to its $k$ features
The same subspace is used for all splits within that tree

Feature Selection Mechanisms Compared
Method	When Features Selected	Features per Split	Total Features per Tree
Full Tree	Never (all used)	All d	All d
Random Forest	At each split	sqrt(d) candidates	Potentially all d
Random Subspace	Once per tree	All k (from subset)	Exactly k < d
Extra-Trees	At each split	sqrt(d) candidates + random threshold	Potentially all d

Conceptual Model

Theoretical Analysis: Why Subspaces Work

The effectiveness of the Random Subspace Method rests on several theoretical pillars. Let's examine each in detail.

1. Dimension Reduction and the Curse of Dimensionality:

In high-dimensional spaces, decision trees face the curse of dimensionality:

Data becomes sparse
Distances become less meaningful
Overfitting becomes more likely

By training each tree on a $k$-dimensional subspace where $k \ll d$, individual trees operate in a more manageable space:

$$\text{Effective density per tree} \propto n^{1/k} \gg n^{1/d}$$

This higher effective density allows trees to make more reliable splits.

2. Error Decorrelation:

The ensemble error bound depends on the correlation $\rho$ between tree predictions:

$$\text{Ensemble Variance} = \rho \cdot \sigma^2 + \frac{(1-\rho)\sigma^2}{T}$$

For RSM, trees trained on non-overlapping features have predictions that are approximately independent for regions of the feature space where the excluded features provide discriminative power.

3. Feature Importance Distribution:

Let $I_j$ denote the importance of feature $j$. In RSM:

Each tree includes feature $j$ with probability $k/d$
Across $T$ trees, feature $j$ appears in approximately $T \cdot k/d$ trees
Features with high importance will be well-utilized
Features with low importance contribute less to the ensemble

This creates an implicit soft feature selection: important features influence predictions often, while noise features rarely get to dominate any tree.

4. Bias-Variance Trade-off:

RSM affects bias and variance differently:

Individual Tree Bias: Increased (missing potentially relevant features)
Individual Tree Variance: Decreased (operating in lower dimension)
Ensemble Variance: Significantly decreased (decorrelated trees)

The net effect is often positive, especially when:

Many features are redundant
Some features are noisy
The true decision boundary can be approximated in multiple subspaces

The Redundancy Assumption

Subspace Forest Algorithm: Complete Implementation

Let's implement a Subspace Forest from scratch, highlighting the key algorithmic decisions.

subspace_forest.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
import numpy as np
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.base import BaseEstimator, ClassifierMixin, RegressorMixin
from typing import List, Tuple, Optional, Union
from dataclasses import dataclass
 
@dataclass
class SubspaceTree:
    """A tree trained on a feature subspace."""
    tree: Union[DecisionTreeClassifier, DecisionTreeRegressor]
    feature_indices: np.ndarray  # Which features this tree uses
 
 
class SubspaceForestClassifier(BaseEstimator, ClassifierMixin):
    """
    Subspace Forest (Random Subspace Method with Decision Trees).
    
    Each tree is trained on a random subset of features, with all
    training samples. This differs from Random Forest where feature
    selection happens at each split.
    
    Reference: Ho, T.K. (1998). "The Random Subspace Method for 
    Constructing Decision Forests."
    """
    
    def __init__(
        self,
        n_estimators: int = 100,
        max_features: Union[int, float, str] = 0.5,
        max_depth: Optional[int] = None,
        min_samples_split: int = 2,
        min_samples_leaf: int = 1,
        random_state: Optional[int] = None,
        n_jobs: int = 1
    ):
        """
        Initialize Subspace Forest.
        
        Args:
            n_estimators: Number of trees
            max_features: Number of features per subspace
                - int: exact number
                - float (0,1): fraction of total features
                - 'sqrt': sqrt(n_features)
                - 'log2': log2(n_features)
            max_depth: Maximum tree depth
            min_samples_split: Minimum samples to split a node
            min_samples_leaf: Minimum samples per leaf
            random_state: Random seed
            n_jobs: Number of parallel jobs
        """
        self.n_estimators = n_estimators
        self.max_features = max_features
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.min_samples_leaf = min_samples_leaf
        self.random_state = random_state
        self.n_jobs = n_jobs
        
        self.trees_: List[SubspaceTree] = []
        self.classes_ = None
        self.n_features_in_ = None
    
    def _get_subspace_size(self, n_features: int) -> int:
        """Compute the subspace dimensionality."""
        if isinstance(self.max_features, int):
            return min(self.max_features, n_features)
        elif isinstance(self.max_features, float):
            return max(1, int(self.max_features * n_features))
        elif self.max_features == 'sqrt':
            return max(1, int(np.sqrt(n_features)))
        elif self.max_features == 'log2':
            return max(1, int(np.log2(n_features)))
        else:
            return n_features
    
    def _sample_subspace(
        self, 
        n_features: int, 
        rng: np.random.RandomState
    ) -> np.ndarray:
        """
        Sample a random feature subspace.
        
        Features are sampled WITHOUT replacement to ensure
        a proper k-dimensional subspace.
        """
        k = self._get_subspace_size(n_features)
        return rng.choice(n_features, size=k, replace=False)
    
    def fit(self, X: np.ndarray, y: np.ndarray) -> 'SubspaceForestClassifier':
        """
        Fit the Subspace Forest.
        
        Unlike Random Forest, each tree is trained on ALL samples
        but only a SUBSET of features.
        """
        rng = np.random.RandomState(self.random_state)
        n_samples, n_features = X.shape
        
        self.n_features_in_ = n_features
        self.classes_ = np.unique(y)
        self.trees_ = []
        
        for i in range(self.n_estimators):
            # Sample feature subspace
            feature_indices = self._sample_subspace(n_features, rng)
            
            # Extract subspace data (ALL samples, SUBSET of features)
            X_subspace = X[:, feature_indices]
            
            # Train tree on subspace
            tree = DecisionTreeClassifier(
                max_depth=self.max_depth,
                min_samples_split=self.min_samples_split,
                min_samples_leaf=self.min_samples_leaf,
                random_state=rng.randint(0, 2**31)
            )
            tree.fit(X_subspace, y)
            
            self.trees_.append(SubspaceTree(
                tree=tree,
                feature_indices=feature_indices
            ))
        
        return self
    
    def predict_proba(self, X: np.ndarray) -> np.ndarray:
        """
        Predict class probabilities by averaging tree predictions.
        
        Each tree predicts using only its subspace features.
        """
        n_samples = X.shape[0]
        n_classes = len(self.classes_)
        probas = np.zeros((n_samples, n_classes))
        
        for subspace_tree in self.trees_:
            # Extract this tree's features
            X_subspace = X[:, subspace_tree.feature_indices]
            
            # Get tree's predictions
            tree_proba = subspace_tree.tree.predict_proba(X_subspace)
            
            # Handle potential class mismatch
            tree_classes = subspace_tree.tree.classes_
            for i, cls in enumerate(tree_classes):
                cls_idx = np.where(self.classes_ == cls)[0]
                if len(cls_idx) > 0:
                    probas[:, cls_idx[0]] += tree_proba[:, i]
        
        # Average
        probas /= len(self.trees_)
        return probas
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Predict class labels."""
        proba = self.predict_proba(X)
        return self.classes_[np.argmax(proba, axis=1)]
    
    def get_feature_coverage(self) -> dict:
        """
        Analyze feature coverage across the ensemble.
        
        Returns statistics about how features are distributed
        across trees.
        """
        n_features = self.n_features_in_
        feature_counts = np.zeros(n_features)
        
        for subspace_tree in self.trees_:
            feature_counts[subspace_tree.feature_indices] += 1
        
        return {
            'feature_counts': feature_counts,
            'coverage_ratio': (feature_counts > 0).sum() / n_features,
            'avg_trees_per_feature': feature_counts.mean(),
            'min_coverage': feature_counts.min(),
            'max_coverage': feature_counts.max(),
        }
    
    def get_subspace_feature_importance(self) -> np.ndarray:
        """
        Compute feature importance aggregated across subspaces.
        
        Note: Unlike RF, features not in a subspace get 0 importance
        from that tree. We aggregate by averaging only over trees
        that include each feature.
        """
        n_features = self.n_features_in_
        importance_sum = np.zeros(n_features)
        feature_counts = np.zeros(n_features)
        
        for subspace_tree in self.trees_:
            tree_importance = subspace_tree.tree.feature_importances_
            for i, feat_idx in enumerate(subspace_tree.feature_indices):
                importance_sum[feat_idx] += tree_importance[i]
                feature_counts[feat_idx] += 1
        
        # Average importance where feature was included
        with np.errstate(divide='ignore', invalid='ignore'):
            importance = np.where(
                feature_counts > 0,
                importance_sum / feature_counts,
                0
            )
        
        return importance
 
 
# Demonstration
if __name__ == "__main__":
    from sklearn.datasets import make_classification
    from sklearn.model_selection import cross_val_score
    from sklearn.ensemble import RandomForestClassifier
    
    # High-dimensional dataset
    X, y = make_classification(
        n_samples=1000,
        n_features=100,
        n_informative=20,
        n_redundant=30,
        n_clusters_per_class=2,
        random_state=42
    )
    
    # Compare methods
    sf = SubspaceForestClassifier(
        n_estimators=100, max_features=0.5, random_state=42
    )
    rf = RandomForestClassifier(
        n_estimators=100, random_state=42
    )
    
    sf_scores = cross_val_score(sf, X, y, cv=5)
    rf_scores = cross_val_score(rf, X, y, cv=5)
    
    print(f"Subspace Forest: {sf_scores.mean():.4f} (+/- {sf_scores.std():.4f})")
    print(f"Random Forest:   {rf_scores.mean():.4f} (+/- {rf_scores.std():.4f})")
    
    # Feature coverage analysis
    sf.fit(X, y)
    coverage = sf.get_feature_coverage()
    print(f"
Feature Coverage: {coverage['coverage_ratio']:.2%}")
    print(f"Avg trees per feature: {coverage['avg_trees_per_feature']:.1f}")

Optimal Subspace Size Selection

The subspace dimensionality $k$ is the critical hyperparameter in the Random Subspace Method. Choosing $k$ involves a fundamental trade-off.

The Trade-off:

Small $k$: High diversity, but individual trees may miss critical features
Large $k$: Low diversity, but trees have access to more relevant features

Theoretical Guidance:

For problems where a subset of size $r$ features is sufficient for accurate classification:

$$k \geq r \cdot \left(1 + \log\frac{d}{r}\right)$$

ensures high probability that each subspace contains at least some informative features.

Practical Guidelines:

Subspace Size Recommendations
Scenario	Recommended k	Rationale
High redundancy (genomics, images)	0.3d - 0.5d	Each subspace likely captures patterns
Moderate redundancy	0.5d - 0.7d	Balance diversity and information
Low redundancy	0.7d - 0.9d	Need most features for accuracy
Unknown structure	0.5d (default)	Robust starting point
Very high d (>1000)	sqrt(d) or log(d)	Computational efficiency

subspace_size_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt
 
def analyze_subspace_size(X, y, subspace_fractions, n_estimators=100, cv=5):
    """
    Analyze performance across different subspace sizes.
    
    Returns data for plotting the bias-variance trade-off
    as subspace size varies.
    """
    results = []
    n_features = X.shape[1]
    
    for frac in subspace_fractions:
        k = max(1, int(frac * n_features))
        
        # Use BaggingClassifier with bootstrap=False to simulate RSM
        model = BaggingClassifier(
            estimator=DecisionTreeClassifier(),
            n_estimators=n_estimators,
            max_samples=1.0,      # All samples
            max_features=k,        # k features per tree
            bootstrap=False,       # No sample bootstrapping
            bootstrap_features=False,
            random_state=42,
            n_jobs=-1
        )
        
        scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')
        
        results.append({
            'fraction': frac,
            'k': k,
            'mean_accuracy': scores.mean(),
            'std_accuracy': scores.std(),
        })
        
        print(f"k={k:3d} ({frac:.0%}): {scores.mean():.4f} (+/- {scores.std():.4f})")
    
    return results
 
 
def plot_subspace_analysis(results):
    """Visualize the subspace size impact."""
    fractions = [r['fraction'] for r in results]
    means = [r['mean_accuracy'] for r in results]
    stds = [r['std_accuracy'] for r in results]
    
    fig, ax = plt.subplots(figsize=(10, 6))
    
    ax.errorbar(fractions, means, yerr=stds, marker='o', capsize=5,
                linewidth=2, markersize=8)
    
    ax.set_xlabel('Subspace Fraction (k/d)', fontsize=12)
    ax.set_ylabel('Cross-Validation Accuracy', fontsize=12)
    ax.set_title('Subspace Forest: Effect of Subspace Size', fontsize=14)
    ax.grid(True, alpha=0.3)
    
    # Annotate optimal
    best_idx = np.argmax(means)
    ax.annotate(f'Optimal: {fractions[best_idx]:.0%}',
                xy=(fractions[best_idx], means[best_idx]),
                xytext=(fractions[best_idx] + 0.1, means[best_idx] + 0.02),
                arrowprops=dict(arrowstyle='->', color='red'),
                fontsize=11, color='red')
    
    plt.tight_layout()
    return fig
 
 
def estimate_optimal_subspace_size(X, y, cv=5, n_estimators=50):
    """
    Estimate optimal subspace size using coarse-to-fine search.
    
    More efficient than full grid search for production use.
    """
    n_features = X.shape[1]
    
    # Coarse search
    coarse_fractions = [0.1, 0.3, 0.5, 0.7, 0.9]
    coarse_results = analyze_subspace_size(
        X, y, coarse_fractions, n_estimators=n_estimators, cv=cv
    )
    
    # Find best region
    best_idx = np.argmax([r['mean_accuracy'] for r in coarse_results])
    best_frac = coarse_fractions[best_idx]
    
    # Fine search around best
    low = max(0.05, best_frac - 0.15)
    high = min(0.95, best_frac + 0.15)
    fine_fractions = np.linspace(low, high, 5)
    
    fine_results = analyze_subspace_size(
        X, y, fine_fractions, n_estimators=n_estimators, cv=cv
    )
    
    final_best = max(fine_results, key=lambda x: x['mean_accuracy'])
    
    print(f"
Optimal subspace size: {final_best['k']} features "
          f"({final_best['fraction']:.1%} of {n_features})")
    
    return final_best
 
 
# Example typical output:
# k= 10 (10%): 0.8234 (+/- 0.0312)
# k= 30 (30%): 0.8567 (+/- 0.0267)
# k= 50 (50%): 0.8712 (+/- 0.0234)  <- Often optimal
# k= 70 (70%): 0.8689 (+/- 0.0245)
# k= 90 (90%): 0.8543 (+/- 0.0289)  <- Declining (less diversity)

The 50% Rule of Thumb

Subspace Forests vs Random Forests: When to Choose

Both Subspace Forests and Random Forests use feature subsampling, but in fundamentally different ways. Understanding these differences is crucial for method selection.

Detailed Method Comparison
Aspect	Subspace Forest	Random Forest
Feature selection timing	Once per tree (global)	At each split (local)
Total features per tree	Exactly k (fixed)	Potentially all d
Sample handling	All samples (no bootstrap)	Bootstrap (~63.2% unique)
Individual tree expressiveness	Limited to k features	Full expressiveness
Diversity mechanism	Feature subspace only	Bootstrap + feature sampling
OOB estimation	Not available (uses all samples)	Available
Interpretation	Clear subspace structure	Feature importance more complex
High-dimensional suitability	Excellent	Good

Choose Subspace Forests When

•Very high dimensionality (d >> n)
•Many redundant or correlated features
•You want simpler individual trees
•Feature space exploration is important
•Computational resources are limited
•Interpretable subspace membership is valuable

Choose Random Forests When

•Moderate dimensionality (d ~ n)
•Most features are informative
•Complex feature interactions exist
•OOB error estimation is needed
•Standard feature importance is required
•You want the most mature implementation

Empirical Performance Patterns:

Text/Document Classification: Subspace Forests often excel due to extremely high dimensionality and feature redundancy
Gene Expression Data: Strong performance from both; Subspace Forests sometimes preferred for interpretability
Standard Tabular Data: Random Forests typically win due to full feature access per tree
Image Features (pre-CNN era): Subspace Forests competitive due to high-dimensional, redundant features
Financial Data: Random Forests typically preferred; features often not redundant enough for RSM

Hybrid Approaches

Subspace Forests for High-Dimensional Data

Subspace Forests particularly shine in high-dimensional settings. Let's examine why and how to apply them effectively.

The High-Dimensional Advantage:

When $d \gg n$ (more features than samples), traditional methods face challenges:

Overfitting: too many parameters, too few samples
Computational cost: evaluating all features at each split
Irrelevant features: noise dominates signal

Subspace Forests address all three:

$$\text{Effective dimensionality per tree} = k \ll d$$

This means each tree operates in a more tractable space where:

Sample-to-feature ratio is improved: $n/k > n/d$
Fewer split candidates to evaluate
Reduced chance of noise-driven splits

high_dimensional_example.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.pipeline import Pipeline
import time
 
def compare_methods_high_dimensional():
    """
    Compare methods on a high-dimensional dataset (d >> n).
    
    Simulates scenarios common in genomics, text classification, etc.
    """
    # High-dimensional dataset: 500 samples, 5000 features
    X, y = make_classification(
        n_samples=500,
        n_features=5000,
        n_informative=50,
        n_redundant=200,
        n_clusters_per_class=3,
        random_state=42
    )
    
    print(f"Dataset: {X.shape[0]} samples × {X.shape[1]} features")
    print(f"n << d: {X.shape[0]} << {X.shape[1]}
")
    
    methods = {
        'Subspace Forest (10%)': BaggingClassifier(
            estimator=DecisionTreeClassifier(),
            n_estimators=100,
            max_samples=1.0,
            max_features=0.1,  # 10% of features = 500 features per tree
            bootstrap=False,
            bootstrap_features=False,
            random_state=42, n_jobs=-1
        ),
        'Subspace Forest (5%)': BaggingClassifier(
            estimator=DecisionTreeClassifier(),
            n_estimators=100,
            max_samples=1.0,
            max_features=0.05,  # 5% = 250 features per tree
            bootstrap=False,
            bootstrap_features=False,
            random_state=42, n_jobs=-1
        ),
        'Random Forest': RandomForestClassifier(
            n_estimators=100,
            random_state=42, n_jobs=-1
        ),
        'Feature Selection + RF': Pipeline([
            ('select', SelectKBest(f_classif, k=500)),
            ('rf', RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1))
        ]),
    }
    
    results = {}
    for name, model in methods.items():
        start = time.time()
        scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
        elapsed = time.time() - start
        
        results[name] = {
            'accuracy': scores.mean(),
            'std': scores.std(),
            'time': elapsed
        }
        
        print(f"{name:25s}: {scores.mean():.4f} (+/- {scores.std():.4f}) [{elapsed:.1f}s]")
    
    return results
 
 
def analyze_subspace_for_genomics():
    """
    Simulate genomics-like data where genes are grouped into pathways.
    
    Subspace Forests can capture pathway-level patterns when different
    subspaces correspond to different biological pathways.
    """
    np.random.seed(42)
    
    n_samples = 200
    n_pathways = 10
    genes_per_pathway = 100
    n_features = n_pathways * genes_per_pathway
    
    # Create pathway-structured data
    X = np.random.randn(n_samples, n_features)
    
    # Make some pathways predictive
    y = np.zeros(n_samples, dtype=int)
    
    # Classes based on pathway 0 and pathway 3 expression
    pathway_0_mean = X[:, :genes_per_pathway].mean(axis=1)
    pathway_3_mean = X[:, 3*genes_per_pathway:4*genes_per_pathway].mean(axis=1)
    
    y = ((pathway_0_mean > 0) & (pathway_3_mean > 0)).astype(int)
    
    print(f"Genomics-like data: {n_samples} samples, {n_features} genes, "
          f"{n_pathways} pathways")
    
    # Compare methods
    subspace_model = BaggingClassifier(
        estimator=DecisionTreeClassifier(max_depth=5),
        n_estimators=100,
        max_samples=1.0,
        max_features=genes_per_pathway * 2,  # ~2 pathways worth
        bootstrap=False,
        random_state=42, n_jobs=-1
    )
    
    rf_model = RandomForestClassifier(
        n_estimators=100, max_depth=5, random_state=42, n_jobs=-1
    )
    
    subspace_scores = cross_val_score(subspace_model, X, y, cv=5)
    rf_scores = cross_val_score(rf_model, X, y, cv=5)
    
    print(f"
Subspace Forest: {subspace_scores.mean():.4f}")
    print(f"Random Forest:   {rf_scores.mean():.4f}")
 
 
if __name__ == "__main__":
    compare_methods_high_dimensional()
    print()
    analyze_subspace_for_genomics()

Real-World Success Stories

Advanced Subspace Techniques

Several extensions enhance the basic Random Subspace Method for specific applications.

Advanced Subspace Variants

•Weighted Subspace Sampling: Sample features proportional to their univariate importance, biasing subspaces toward informative features
•Structured Subspaces: When features have known structure (e.g., spatial, temporal), sample coherent blocks rather than random individual features
•Overlapping Subspaces: Ensure minimum overlap between subspaces to guarantee coverage of all features
•Adaptive Subspace Size: Start with large subspaces, reduce size as ensemble grows to maximize late-stage diversity
•Deep Subspace Forests: Stack multiple layers of Subspace Forests, each learning from the previous layer's predictions

advanced_subspace.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_selection import mutual_info_classif
 
class WeightedSubspaceForest:
    """
    Subspace Forest with weighted feature sampling.
    
    Features are sampled proportional to their univariate importance,
    biasing subspaces toward more informative features while still
    maintaining diversity through randomization.
    """
    
    def __init__(
        self,
        n_estimators: int = 100,
        subspace_fraction: float = 0.5,
        weighting_strength: float = 1.0,
        random_state: int = None
    ):
        """
        Args:
            n_estimators: Number of trees
            subspace_fraction: Fraction of features per subspace
            weighting_strength: How strongly to weight by importance
                - 0: uniform random (standard RSM)
                - 1: proportional to importance
                - >1: more aggressive toward important features
        """
        self.n_estimators = n_estimators
        self.subspace_fraction = subspace_fraction
        self.weighting_strength = weighting_strength
        self.random_state = random_state
        
        self.trees_ = []
        self.feature_weights_ = None
    
    def _compute_feature_weights(self, X, y):
        """Compute feature weights based on mutual information."""
        mi_scores = mutual_info_classif(X, y, random_state=self.random_state)
        
        # Apply weighting strength
        weights = np.power(mi_scores + 1e-10, self.weighting_strength)
        weights /= weights.sum()
        
        return weights
    
    def _sample_weighted_subspace(self, n_features, rng):
        """Sample features according to computed weights."""
        k = max(1, int(self.subspace_fraction * n_features))
        
        # Sample without replacement, weighted by feature importance
        indices = rng.choice(
            n_features,
            size=k,
            replace=False,
            p=self.feature_weights_
        )
        
        return indices
    
    def fit(self, X, y):
        """Fit the weighted subspace forest."""
        rng = np.random.RandomState(self.random_state)
        n_features = X.shape[1]
        
        # Compute feature weights
        self.feature_weights_ = self._compute_feature_weights(X, y)
        
        self.trees_ = []
        self.classes_ = np.unique(y)
        
        for _ in range(self.n_estimators):
            # Weighted subspace sampling
            feature_indices = self._sample_weighted_subspace(n_features, rng)
            
            # Train tree
            X_subspace = X[:, feature_indices]
            tree = DecisionTreeClassifier(random_state=rng.randint(2**31))
            tree.fit(X_subspace, y)
            
            self.trees_.append({
                'tree': tree,
                'features': feature_indices
            })
        
        return self
    
    def predict(self, X):
        """Predict using weighted voting."""
        n_samples = X.shape[0]
        vote_counts = np.zeros((n_samples, len(self.classes_)))
        
        for tree_info in self.trees_:
            X_sub = X[:, tree_info['features']]
            preds = tree_info['tree'].predict_proba(X_sub)
            vote_counts += preds
        
        return self.classes_[np.argmax(vote_counts, axis=1)]
 
 
class StructuredSubspaceForest:
    """
    Subspace Forest for data with known feature structure.
    
    Example: image data where features are pixels, and spatial
    coherence means nearby pixels should be sampled together.
    """
    
    def __init__(
        self,
        n_estimators: int = 100,
        block_size: int = 10,
        n_blocks: int = 5,
        random_state: int = None
    ):
        """
        Args:
            n_estimators: Number of trees
            block_size: Size of each contiguous feature block
            n_blocks: Number of blocks per subspace
        """
        self.n_estimators = n_estimators
        self.block_size = block_size
        self.n_blocks = n_blocks
        self.random_state = random_state
        
        self.trees_ = []
    
    def _sample_structured_subspace(self, n_features, rng):
        """Sample contiguous blocks of features."""
        max_block_start = n_features - self.block_size
        
        if max_block_start <= 0:
            return np.arange(n_features)
        
        # Randomly select block starting positions
        block_starts = rng.choice(
            max_block_start,
            size=min(self.n_blocks, max_block_start),
            replace=False
        )
        
        # Collect all indices in selected blocks
        indices = []
        for start in block_starts:
            indices.extend(range(start, start + self.block_size))
        
        return np.array(sorted(set(indices)))
    
    def fit(self, X, y):
        """Fit with structured subspace sampling."""
        rng = np.random.RandomState(self.random_state)
        n_features = X.shape[1]
        
        self.trees_ = []
        self.classes_ = np.unique(y)
        
        for _ in range(self.n_estimators):
            features = self._sample_structured_subspace(n_features, rng)
            
            tree = DecisionTreeClassifier(random_state=rng.randint(2**31))
            tree.fit(X[:, features], y)
            
            self.trees_.append({'tree': tree, 'features': features})
        
        return self
    
    def predict(self, X):
        """Predict with structured subspaces."""
        votes = np.zeros((X.shape[0], len(self.classes_)))
        
        for info in self.trees_:
            proba = info['tree'].predict_proba(X[:, info['features']])
            votes += proba
        
        return self.classes_[np.argmax(votes, axis=1)]

Summary: Subspace Forests Key Insights

Let's consolidate the essential knowledge about Subspace Forests and the Random Subspace Method:

Key Takeaways

•Core Concept: Each tree is trained on a random subset of features (once per tree), using all training samples
•Diversity Mechanism: Different feature subspaces create different perspectives on the classification problem
•High-Dimensional Strength: Particularly effective when d >> n, reducing effective dimensionality per tree
•Subspace Size Trade-off: Smaller k = more diversity but higher individual error; larger k = less diversity but lower individual error
•50% Rule: Starting with k = 0.5d is often a good default; tune based on cross-validation
•Comparison with RF: RSM selects features once per tree (global), RF selects per split (local)—choose based on redundancy level
•No Bootstrap: Traditional RSM uses all samples; combine with bootstrap for Random Patches behavior
•Extensions: Weighted sampling, structured subspaces, and stacking can enhance performance for specific domains

Page Complete

4 / 5