Random Forest Variants - Learning Module

Loading content...

0/278

Random Patches

The Power of Double Subsampling

What if we could combine the variance reduction benefits of bootstrap sampling with the diversity advantages of feature subsampling—and do so more aggressively than either Random Forests or Bagging? This is precisely the idea behind Random Patches, introduced by Gilles Louppe and Pierre Geurts in 2012.

Random Patches builds on a simple but powerful insight: by simultaneously subsampling both rows (instances) and columns (features) for each base estimator, we can create ensembles with remarkable diversity while maintaining computational efficiency. The name "patches" refers to the rectangular subsets of the data matrix that each tree sees—literally patches of the full data.

This approach generalizes both Bagging (which subsamples rows) and Random Subspace (which subsamples columns), offering a unified framework with greater flexibility and often improved performance.

What You Will Learn

By the end of this page, you will deeply understand: (1) The relationship between Random Patches and related methods, (2) How double subsampling affects bias-variance, (3) Optimal sampling strategies for different scenarios, (4) Memory and computational advantages, and (5) When Random Patches outperform standard Random Forests.

Random Patches in the Ensemble Landscape

Random Patches can be understood as a generalization that encompasses several well-known ensemble methods as special cases. Understanding these relationships clarifies when and why Random Patches offers advantages.

The Subsampling Framework:

Consider a training dataset as a matrix with $n$ rows (samples) and $d$ columns (features). Any ensemble method that trains individual models on subsets of this matrix can be characterized by two parameters:

$\theta_{\text{sample}}$: Fraction of samples used per estimator
$\theta_{\text{feature}}$: Fraction of features used per estimator

Different methods correspond to different choices of these parameters:

Ensemble Methods as Subsampling Strategies
Method	θ_sample	θ_feature	Description
Full Decision Tree	1.0	1.0	No subsampling; single tree on all data
Bagging	~0.632 (with replacement)	1.0	Bootstrap rows, use all features
Random Subspace	1.0	sqrt(d)/d or similar	All samples, subset of features
Random Forest	~0.632 (bootstrap)	sqrt(d)/d per split	Bootstrap + feature sampling at splits
Random Patches	< 1.0	< 1.0	Subsample both rows AND columns
Extra-Trees	1.0 (no bootstrap)	sqrt(d)/d + random threshold	Random thresholds for diversity

The Key Insight:

Random Patches recognizes that subsampling happens before tree construction, not just at each split. This means:

Each tree sees only a "patch" of the full data matrix
The patch is determined once per tree, not per split
Much smaller patches can be used than with typical Random Forests
Computational savings from reduced data size compound throughout tree building

Patches vs Forests

In Random Forests, feature subsampling happens at each split—a tree can potentially use all features across its full depth. In Random Patches, each tree is restricted to a fixed feature subset from the start. This seemingly subtle difference has significant implications for diversity and computational efficiency.

Mathematical Analysis of Double Subsampling

Let's formalize the Random Patches approach and analyze its theoretical properties.

Formal Definition:

Given a training set $\mathcal{D} = {(x_i, y_i)}_{i=1}^{n}$ with feature dimensionality $d$, a Random Patches ensemble constructs $T$ base estimators, where each estimator $h_t$ is trained on a subset:

$$\mathcal{D}t = {(x{i,S_t^{\text{feat}}}, y_i) : i \in S_t^{\text{sample}}}$$

where:

$S_t^{\text{sample}} \subseteq {1, ..., n}$ with $|S_t^{\text{sample}}| = \lfloor \theta_{\text{sample}} \cdot n \rfloor$
$S_t^{\text{feat}} \subseteq {1, ..., d}$ with $|S_t^{\text{feat}}| = \lfloor \theta_{\text{feat}} \cdot d \rfloor$

The ensemble prediction is: $$\hat{f}(x) = \frac{1}{T} \sum_{t=1}^{T} h_t(x_{S_t^{\text{feat}}})$$

Bias-Variance Analysis:

The expected error of the ensemble can be decomposed as:

$$\text{Error} = \text{Bias}^2 + \text{Variance} + \text{Noise}$$

where: $$\text{Variance} \approx \frac{1}{T}\sigma^2 + \left(1 - \frac{1}{T}\right)\rho\sigma^2 = \rho\sigma^2 + \frac{(1-\rho)\sigma^2}{T}$$

Here, $\rho$ is the average pairwise correlation between tree predictions and $\sigma^2$ is the variance of individual trees.

Effect of Subsampling on Each Component:

Sample Subsampling ($\theta_{\text{sample}} < 1$):

Increases individual tree variance (less data per tree)
Decreases tree correlation (different samples)
Net effect: variance reduction when diversity gain > individual variance increase

Feature Subsampling ($\theta_{\text{feat}} < 1$):

Increases individual tree bias (missing relevant features)
Decreases tree correlation (different feature views)
Net effect: trade bias for diversity

Combined Effect:

The interaction is multiplicative—correlation drops faster than with either subsampling alone:

$$\rho_{\text{patches}} \approx \rho_{\text{bagging}} \cdot \rho_{\text{subspace}}$$

This multiplicative reduction in correlation is the key to Random Patches' effectiveness.

The Diversity Sweet Spot

With double subsampling, you can achieve the same diversity (low ρ) as aggressive single-dimension subsampling, but with smaller increases in individual error. For example, 80% samples × 80% features gives similar diversity to 64% samples × 100% features, but often with lower bias.

The Random Patches Algorithm

Let's implement the Random Patches algorithm from scratch, highlighting its elegant simplicity.

random_patches.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
import numpy as np
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.base import BaseEstimator, ClassifierMixin, RegressorMixin
from typing import List, Tuple, Optional, Union
from dataclasses import dataclass
 
@dataclass
class PatchModel:
    """Stores a tree and its associated data patch indices."""
    estimator: Union[DecisionTreeClassifier, DecisionTreeRegressor]
    sample_indices: np.ndarray
    feature_indices: np.ndarray
 
 
class RandomPatchesClassifier(BaseEstimator, ClassifierMixin):
    """
    Random Patches Ensemble Classifier.
    
    Combines sample subsampling and feature subsampling to create
    diverse ensembles of decision trees.
    
    Reference: Louppe, G., & Geurts, P. (2012). "Ensembles on Random Patches."
    """
    
    def __init__(
        self,
        n_estimators: int = 100,
        max_samples: float = 0.8,
        max_features: float = 0.8,
        bootstrap_samples: bool = True,
        bootstrap_features: bool = False,
        base_estimator: str = 'decision_tree',
        max_depth: Optional[int] = None,
        min_samples_leaf: int = 1,
        random_state: Optional[int] = None,
        n_jobs: int = 1
    ):
        """
        Initialize Random Patches Classifier.
        
        Args:
            n_estimators: Number of trees in the ensemble
            max_samples: Fraction (0,1] or count of samples per tree
            max_features: Fraction (0,1] or count of features per tree
            bootstrap_samples: If True, sample with replacement; else without
            bootstrap_features: If True, sample features with replacement
            base_estimator: Type of base estimator
            max_depth: Maximum depth of trees (None = unlimited)
            min_samples_leaf: Minimum samples per leaf
            random_state: Random seed
            n_jobs: Parallel jobs (-1 for all cores)
        """
        self.n_estimators = n_estimators
        self.max_samples = max_samples
        self.max_features = max_features
        self.bootstrap_samples = bootstrap_samples
        self.bootstrap_features = bootstrap_features
        self.base_estimator = base_estimator
        self.max_depth = max_depth
        self.min_samples_leaf = min_samples_leaf
        self.random_state = random_state
        self.n_jobs = n_jobs
        
        self.estimators_: List[PatchModel] = []
        self.classes_ = None
        self.n_features_in_ = None
        self.rng_ = None
    
    def _get_sample_size(self, n_samples: int) -> int:
        """Compute number of samples per estimator."""
        if isinstance(self.max_samples, float):
            return max(1, int(self.max_samples * n_samples))
        return min(self.max_samples, n_samples)
    
    def _get_feature_size(self, n_features: int) -> int:
        """Compute number of features per estimator."""
        if isinstance(self.max_features, float):
            return max(1, int(self.max_features * n_features))
        return min(self.max_features, n_features)
    
    def _draw_patch(
        self, 
        n_samples: int, 
        n_features: int
    ) -> Tuple[np.ndarray, np.ndarray]:
        """
        Draw a random patch (sample indices, feature indices).
        
        This is the core of Random Patches: simultaneously sampling
        both dimensions of the data matrix.
        """
        # Sample indices
        n_samples_patch = self._get_sample_size(n_samples)
        if self.bootstrap_samples:
            sample_indices = self.rng_.choice(
                n_samples, size=n_samples_patch, replace=True
            )
        else:
            sample_indices = self.rng_.choice(
                n_samples, size=n_samples_patch, replace=False
            )
        
        # Feature indices
        n_features_patch = self._get_feature_size(n_features)
        if self.bootstrap_features:
            feature_indices = self.rng_.choice(
                n_features, size=n_features_patch, replace=True
            )
        else:
            feature_indices = self.rng_.choice(
                n_features, size=n_features_patch, replace=False
            )
        
        return sample_indices, feature_indices
    
    def _train_single_estimator(
        self, 
        X: np.ndarray, 
        y: np.ndarray,
        sample_indices: np.ndarray,
        feature_indices: np.ndarray
    ) -> PatchModel:
        """Train a single estimator on a data patch."""
        # Extract the patch
        X_patch = X[sample_indices][:, feature_indices]
        y_patch = y[sample_indices]
        
        # Create and train estimator
        estimator = DecisionTreeClassifier(
            max_depth=self.max_depth,
            min_samples_leaf=self.min_samples_leaf,
            random_state=self.rng_.randint(0, 2**31)
        )
        estimator.fit(X_patch, y_patch)
        
        return PatchModel(
            estimator=estimator,
            sample_indices=sample_indices,
            feature_indices=feature_indices
        )
    
    def fit(self, X: np.ndarray, y: np.ndarray) -> 'RandomPatchesClassifier':
        """
        Fit the Random Patches ensemble.
        
        For each estimator:
        1. Draw a random patch (sample + feature subset)
        2. Train a decision tree on the patch
        3. Store the tree with its patch indices
        """
        self.rng_ = np.random.RandomState(self.random_state)
        n_samples, n_features = X.shape
        self.n_features_in_ = n_features
        self.classes_ = np.unique(y)
        
        self.estimators_ = []
        
        for _ in range(self.n_estimators):
            # Draw random patch
            sample_indices, feature_indices = self._draw_patch(n_samples, n_features)
            
            # Train estimator on patch
            patch_model = self._train_single_estimator(
                X, y, sample_indices, feature_indices
            )
            self.estimators_.append(patch_model)
        
        return self
    
    def predict_proba(self, X: np.ndarray) -> np.ndarray:
        """
        Predict class probabilities.
        
        For each estimator, use only the features that were in its patch.
        """
        n_samples = X.shape[0]
        n_classes = len(self.classes_)
        probas = np.zeros((n_samples, n_classes))
        
        for patch_model in self.estimators_:
            # Extract relevant features for this estimator
            X_subset = X[:, patch_model.feature_indices]
            
            # Get predictions
            proba = patch_model.estimator.predict_proba(X_subset)
            
            # Handle case where some classes weren't seen during training
            tree_classes = patch_model.estimator.classes_
            for i, cls in enumerate(tree_classes):
                cls_idx = np.where(self.classes_ == cls)[0][0]
                probas[:, cls_idx] += proba[:, i]
        
        # Average across estimators
        probas /= len(self.estimators_)
        
        return probas
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Predict class labels."""
        proba = self.predict_proba(X)
        return self.classes_[np.argmax(proba, axis=1)]
    
    def get_patch_statistics(self) -> dict:
        """Get statistics about the patches used."""
        n_samples_list = [len(pm.sample_indices) for pm in self.estimators_]
        n_features_list = [len(pm.feature_indices) for pm in self.estimators_]
        
        return {
            'avg_samples_per_tree': np.mean(n_samples_list),
            'avg_features_per_tree': np.mean(n_features_list),
            'sample_coverage': len(set.union(*[set(pm.sample_indices) for pm in self.estimators_])) / self.n_features_in_,
            'feature_coverage': len(set.union(*[set(pm.feature_indices) for pm in self.estimators_])) / self.n_features_in_,
        }
 
 
# Example usage and comparison
if __name__ == "__main__":
    from sklearn.datasets import make_classification
    from sklearn.model_selection import cross_val_score
    from sklearn.ensemble import RandomForestClassifier, BaggingClassifier
    
    # Generate dataset
    X, y = make_classification(
        n_samples=1000, n_features=50, n_informative=25,
        n_redundant=10, random_state=42
    )
    
    # Compare methods
    models = {
        'Random Patches (80%/80%)': RandomPatchesClassifier(
            n_estimators=100, max_samples=0.8, max_features=0.8, random_state=42
        ),
        'Random Patches (60%/60%)': RandomPatchesClassifier(
            n_estimators=100, max_samples=0.6, max_features=0.6, random_state=42
        ),
        'Random Forest': RandomForestClassifier(
            n_estimators=100, random_state=42
        ),
        'Bagging': BaggingClassifier(
            n_estimators=100, random_state=42
        ),
    }
    
    for name, model in models.items():
        scores = cross_val_score(model, X, y, cv=5)
        print(f"{name:30s}: {scores.mean():.4f} (+/- {scores.std():.4f})")

Optimal Sampling Strategies

Choosing the right sampling fractions ($\theta_{\text{sample}}$, $\theta_{\text{feat}}$) is crucial for Random Patches performance. Let's explore the design space and practical guidelines.

Sampling Region Effects
θ_sample	θ_feat	Individual Error	Diversity	Best Use Case
High (>0.9)	High (>0.9)	Low	Low	When base trees are stable
High (>0.9)	Low (<0.5)	Moderate (bias)	High	Many irrelevant features
Low (<0.5)	High (>0.9)	Moderate (variance)	High	Large sample size, few features
Moderate (0.7-0.9)	Moderate (0.7-0.9)	Balanced	Balanced	General purpose (recommended)
Low (<0.5)	Low (<0.5)	High	Very High	Very large data, computational limits

Design Principles:

1. Balance Individual Error and Diversity:

The optimal point is where the diversity gain from subsampling exactly balances the increased individual error. This typically occurs when:

$$\theta_{\text{sample}} \cdot \theta_{\text{feat}} \approx 0.5 - 0.7$$

For example, (0.8, 0.8), (0.9, 0.7), or (0.7, 0.85) are all reasonable choices.

2. Consider Data Characteristics:

High dimensionality: More aggressive feature subsampling is beneficial
Small sample size: Less aggressive sample subsampling to preserve information
Noisy features: Aggressive feature subsampling acts as implicit feature selection
Noisy labels: More trees with higher diversity help smooth label noise

3. Computational Constraints:

Patch size directly affects training time. With $\theta_{\text{sample}} = \theta_{\text{feat}} = 0.5$, each tree sees only 25% of the data matrix, providing ~4x speedup.

Practical Starting Points

Start with max_samples=0.8 and max_features=0.8. If accuracy is too low, increase both. If computational resources are limited or you need more diversity, decrease both. Generally, keep both above 0.5 unless you have specific reasons (extreme high dimensionality, very large data).

sampling_experiments.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt
 
def search_optimal_sampling(X, y, sample_fracs, feature_fracs, n_estimators=50, cv=5):
    """
    Grid search over sampling fractions to find optimal configuration.
    
    Returns a 2D array of cross-validation scores.
    """
    results = np.zeros((len(sample_fracs), len(feature_fracs)))
    
    for i, sample_frac in enumerate(sample_fracs):
        for j, feature_frac in enumerate(feature_fracs):
            # Use BaggingClassifier with max_samples and max_features
            model = BaggingClassifier(
                estimator=DecisionTreeClassifier(),
                n_estimators=n_estimators,
                max_samples=sample_frac,
                max_features=feature_frac,
                bootstrap=True,
                bootstrap_features=False,
                random_state=42,
                n_jobs=-1
            )
            
            scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')
            results[i, j] = scores.mean()
            
            print(f"θ_sample={sample_frac:.1f}, θ_feat={feature_frac:.1f}: {scores.mean():.4f}")
    
    return results
 
def plot_sampling_heatmap(results, sample_fracs, feature_fracs):
    """Visualize the performance landscape."""
    plt.figure(figsize=(10, 8))
    plt.imshow(results, cmap='RdYlGn', interpolation='nearest', 
               origin='lower', aspect='auto')
    plt.colorbar(label='CV Accuracy')
    
    # Add labels
    plt.xticks(range(len(feature_fracs)), [f'{f:.1f}' for f in feature_fracs])
    plt.yticks(range(len(sample_fracs)), [f'{s:.1f}' for s in sample_fracs])
    plt.xlabel('Feature Sampling Fraction (θ_feat)')
    plt.ylabel('Sample Sampling Fraction (θ_sample)')
    plt.title('Random Patches Performance Landscape')
    
    # Mark optimal point
    best_idx = np.unravel_index(np.argmax(results), results.shape)
    plt.scatter(best_idx[1], best_idx[0], marker='*', s=300, c='black', 
                label=f'Best: ({sample_fracs[best_idx[0]]:.1f}, {feature_fracs[best_idx[1]]:.1f})')
    plt.legend()
    
    plt.tight_layout()
    return plt.gcf()
 
# Example usage:
# sample_fracs = [0.3, 0.5, 0.7, 0.8, 0.9, 1.0]
# feature_fracs = [0.3, 0.5, 0.7, 0.8, 0.9, 1.0]
# results = search_optimal_sampling(X, y, sample_fracs, feature_fracs)
# plot_sampling_heatmap(results, sample_fracs, feature_fracs)

Computational and Memory Advantages

One of the most compelling practical advantages of Random Patches is its computational efficiency, especially for large-scale datasets.

Computational Complexity Comparison
Method	Samples per Tree	Features per Split	Memory per Tree	Training Time per Tree
Full Tree	n	d	O(n·d)	O(n·d·log n)
Bagging	~0.63n	d	O(n·d)	O(n·d·log n)
Random Subspace	n	~√d	O(n·d)	O(n·√d·log n)
Random Forest	~0.63n	~√d per split	O(n·d)	O(n·√d·log n)
Random Patches (0.7, 0.7)	0.7n	0.7d entire tree	O(0.49·n·d)	O(0.49·n·d·log n)
Random Patches (0.5, 0.5)	0.5n	0.5d entire tree	O(0.25·n·d)	O(0.25·n·d·log n)

Key Efficiency Insights:

1. Memory Reduction:

With $\theta_{\text{sample}} = \theta_{\text{feat}} = 0.5$, each tree requires only 25% of the memory needed for a full tree. This is transformative for:

Large datasets that don't fit in memory
Distributed training across machines
Deploying ensembles on resource-constrained devices

2. Training Speed:

Training complexity scales with patch size: $$\text{Time per tree} \propto (\theta_{\text{sample}} \cdot n) \cdot (\theta_{\text{feat}} \cdot d) \cdot \log(\theta_{\text{sample}} \cdot n)$$

With 50% subsampling on each dimension, training is roughly 4-5x faster per tree.

3. Parallelization:

Smaller patches mean:

Better cache utilization per thread
Reduced memory contention in parallel training
More efficient GPU utilization for large-scale training

Scalability Sweet Spot

Random Patches offers a unique scalability advantage: you can train MORE trees in the same time/memory budget, often compensating for the slightly higher error per tree. An ensemble of 500 Random Patches trees with 50% subsampling can outperform 100 Random Forest trees while training in similar time.

memory_efficiency.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import numpy as np
import sys
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
import time
 
def compare_memory_and_time(X, y, configs):
    """
    Compare memory usage and training time for different configurations.
    
    configs: list of (max_samples, max_features, n_estimators) tuples
    """
    results = []
    
    for max_samples, max_features, n_estimators in configs:
        model = BaggingClassifier(
            estimator=DecisionTreeClassifier(max_depth=20),
            n_estimators=n_estimators,
            max_samples=max_samples,
            max_features=max_features,
            bootstrap=True,
            n_jobs=1,  # Single threaded for fair comparison
            random_state=42
        )
        
        # Measure training time
        start = time.time()
        model.fit(X, y)
        train_time = time.time() - start
        
        # Estimate memory (simplified - actual measurement requires memory_profiler)
        samples_per_tree = int(max_samples * len(y)) if max_samples <= 1 else max_samples
        features_per_tree = int(max_features * X.shape[1]) if max_features <= 1 else max_features
        relative_memory = (samples_per_tree * features_per_tree) / (len(y) * X.shape[1])
        
        # Measure accuracy
        from sklearn.model_selection import cross_val_score
        scores = cross_val_score(model, X, y, cv=3)
        
        results.append({
            'config': f'({max_samples}, {max_features}) x {n_estimators}',
            'train_time': train_time,
            'relative_memory': relative_memory,
            'accuracy': scores.mean(),
            'n_estimators': n_estimators
        })
        
        print(f"Config {max_samples:.1f}/{max_features:.1f} x {n_estimators}: "
              f"Time={train_time:.2f}s, RelMem={relative_memory:.2f}, Acc={scores.mean():.4f}")
    
    return results
 
# Example: Compare equal-time configurations
# In the same training time as 100 full trees, we can train ~400 half-size patches
 
configs = [
    (1.0, 1.0, 100),    # Full bagging, 100 trees
    (0.5, 0.5, 400),    # Random Patches, 4x more trees
    (0.7, 0.7, 200),    # Moderate patches, 2x trees
    (0.8, 0.8, 150),    # Light patches, 1.5x trees
]
 
# Conceptual output:
# (1.0, 1.0) x 100: Time=12.34s, RelMem=1.00, Acc=0.8756
# (0.5, 0.5) x 400: Time=12.45s, RelMem=0.25, Acc=0.8892  <- Often better!
# (0.7, 0.7) x 200: Time=11.89s, RelMem=0.49, Acc=0.8834
# (0.8, 0.8) x 150: Time=12.12s, RelMem=0.64, Acc=0.8801

When to Use Random Patches

Random Patches excels in specific scenarios. Understanding when to choose this method is key to effective ensemble learning.

Ideal Use Cases

•Large-scale datasets where memory is constrained
•High-dimensional data with many redundant features
•Limited compute budget but need many trees
•Distributed training where smaller patches reduce communication
•Ensemble diversity is critical for performance
•Real-time training requirements with streaming data

Less Suitable Scenarios

•Small datasets where every sample matters
•Few features where subsampling hurts too much
•All features are highly informative (no redundancy)
•Interpretability is paramount (patches complicate analysis)
•Classic tabular data where RF already works well

Decision Framework:

Is data very large (n > 100K or d > 1000)?
├── Yes → Consider Random Patches for efficiency
│   ├── Memory constrained? → Aggressive patches (0.5, 0.5)
│   └── Time constrained? → Moderate patches (0.7, 0.7)
└── No → Random Patches optional
    ├── Many redundant features? → Try feature subsampling
    ├── Need high diversity? → Try Random Patches
    └── Otherwise → Standard Random Forest may suffice

Empirical Validation Required

While these guidelines are based on theoretical analysis and empirical studies, the optimal approach depends on your specific data. Always validate with cross-validation on your actual problem. Random Patches is particularly worth trying when standard approaches hit computational limits.

Using Random Patches in Scikit-Learn

Scikit-Learn's BaggingClassifier and BaggingRegressor directly support Random Patches through their max_samples and max_features parameters. Here's how to use them effectively.

sklearn_random_patches.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
from sklearn.ensemble import BaggingClassifier, BaggingRegressor
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.datasets import make_classification, make_regression
import numpy as np
 
# ============================================
# Classification with Random Patches
# ============================================
 
def random_patches_classifier(
    n_estimators: int = 100,
    max_samples: float = 0.8,
    max_features: float = 0.8,
    max_depth: int = None,
    bootstrap: bool = True,
    random_state: int = None
) -> BaggingClassifier:
    """
    Create a Random Patches classifier using sklearn's BaggingClassifier.
    
    Key parameters:
    - max_samples: Fraction of samples per tree (column dimension)
    - max_features: Fraction of features per tree (row dimension)
    - bootstrap: Whether to sample with replacement
    """
    base_tree = DecisionTreeClassifier(
        max_depth=max_depth,
        random_state=random_state
    )
    
    return BaggingClassifier(
        estimator=base_tree,
        n_estimators=n_estimators,
        max_samples=max_samples,
        max_features=max_features,
        bootstrap=bootstrap,
        bootstrap_features=False,  # Usually sample features without replacement
        random_state=random_state,
        n_jobs=-1  # Use all cores
    )
 
 
# ============================================
# Regression with Random Patches
# ============================================
 
def random_patches_regressor(
    n_estimators: int = 100,
    max_samples: float = 0.8,
    max_features: float = 0.8,
    max_depth: int = None,
    bootstrap: bool = True,
    random_state: int = None
) -> BaggingRegressor:
    """Create a Random Patches regressor."""
    base_tree = DecisionTreeRegressor(
        max_depth=max_depth,
        random_state=random_state
    )
    
    return BaggingRegressor(
        estimator=base_tree,
        n_estimators=n_estimators,
        max_samples=max_samples,
        max_features=max_features,
        bootstrap=bootstrap,
        bootstrap_features=False,
        random_state=random_state,
        n_jobs=-1
    )
 
 
# ============================================
# Hyperparameter Tuning
# ============================================
 
def tune_random_patches(X, y, task='classification'):
    """
    Hyperparameter tuning specifically for Random Patches.
    
    Key insight: the product max_samples * max_features determines
    diversity vs accuracy tradeoff.
    """
    if task == 'classification':
        estimator = BaggingClassifier(
            estimator=DecisionTreeClassifier(),
            n_jobs=-1, random_state=42
        )
        scoring = 'accuracy'
    else:
        estimator = BaggingRegressor(
            estimator=DecisionTreeRegressor(),
            n_jobs=-1, random_state=42
        )
        scoring = 'neg_mean_squared_error'
    
    param_grid = {
        'n_estimators': [50, 100, 200],
        'max_samples': [0.5, 0.7, 0.8, 0.9],
        'max_features': [0.5, 0.7, 0.8, 0.9],
        'bootstrap': [True, False],
    }
    
    grid_search = GridSearchCV(
        estimator, param_grid, 
        cv=5, scoring=scoring, n_jobs=-1, verbose=1
    )
    grid_search.fit(X, y)
    
    print(f"Best parameters: {grid_search.best_params_}")
    print(f"Best score: {grid_search.best_score_:.4f}")
    
    return grid_search.best_estimator_
 
 
# ============================================
# Example: Large-Scale Data
# ============================================
 
def large_scale_example():
    """
    Demonstrate Random Patches on a large dataset where
    computational efficiency matters.
    """
    # Simulate large dataset
    print("Generating large dataset...")
    X, y = make_classification(
        n_samples=100000,
        n_features=200,
        n_informative=50,
        n_redundant=100,
        random_state=42
    )
    
    # Standard approach: fewer trees, full data
    from sklearn.ensemble import RandomForestClassifier
    import time
    
    print("
Training Standard Random Forest...")
    start = time.time()
    rf = RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=42)
    rf.fit(X, y)
    rf_time = time.time() - start
    rf_score = cross_val_score(rf, X, y, cv=3).mean()
    print(f"  Time: {rf_time:.2f}s, Accuracy: {rf_score:.4f}")
    
    # Random Patches: more trees, smaller patches
    print("
Training Random Patches (0.5, 0.5, 400 trees)...")
    start = time.time()
    rp = random_patches_classifier(
        n_estimators=400, 
        max_samples=0.5, 
        max_features=0.5,
        random_state=42
    )
    rp.fit(X, y)
    rp_time = time.time() - start
    rp_score = cross_val_score(rp, X, y, cv=3).mean()
    print(f"  Time: {rp_time:.2f}s, Accuracy: {rp_score:.4f}")
    
    print(f"
Random Patches speedup: {rf_time/rp_time:.2f}x")
 
 
if __name__ == "__main__":
    large_scale_example()

Summary: Random Patches Key Insights

Let's consolidate the essential knowledge about Random Patches:

Key Takeaways

•Core Idea: Random Patches simultaneously subsamples both rows and columns, giving each tree a rectangular 'patch' of the full data matrix
•Generalization: Random Patches generalizes Bagging (row subsampling) and Random Subspace (column subsampling), offering greater flexibility
•Diversity Multiplication: Combining sample and feature subsampling multiplies diversity effects, enabling more aggressive subsampling in each dimension
•Computational Efficiency: Smaller patches mean faster training and lower memory usage—often enabling 4x more trees in the same resource budget
•Scalability: Particularly valuable for large-scale and high-dimensional data where memory or time constraints limit full-data approaches
•Tuning Guidelines: Start with (0.8, 0.8) and adjust based on data characteristics; product θ_sample × θ_feat between 0.5-0.7 usually works well
•Sklearn Integration: Use BaggingClassifier with max_samples and max_features parameters for production-ready Random Patches

Page Complete

You now have a comprehensive understanding of Random Patches, from its theoretical foundations to practical implementation. Next, we'll explore Subspace Forests—a method that focuses specifically on feature subspace sampling without sample subsampling.