Loading content...
What if instead of randomly selecting features, we could transform the entire feature space for each tree in an ensemble? This seemingly radical idea is the foundation of Rotation Forests, introduced by Juan J. Rodriguez, Ludmila I. Kuncheva, and Carlos J. Alonso in 2006.
Rotation Forests achieve ensemble diversity not by restricting which features each tree can see, but by presenting each tree with a rotated view of the feature space. Using Principal Component Analysis (PCA) on random feature subsets, each tree learns in a unique coordinate system—seeing the same data from a fundamentally different geometric perspective.
The result is an ensemble with remarkably high diversity while maintaining the ability of each tree to use all original features—often achieving accuracy that rivals or exceeds both Random Forests and Gradient Boosting.
By the end of this page, you will deeply understand: (1) The geometric intuition behind feature space rotation, (2) How PCA creates diverse coordinate systems, (3) The complete Rotation Forest algorithm, (4) Why rotation preserves information while creating diversity, and (5) When Rotation Forests outperform other ensemble methods.
To understand Rotation Forests, we must first appreciate a fundamental limitation of axis-aligned decision trees.
The Axis-Aligned Constraint:
Standard decision trees make splits that are perpendicular to feature axes. Each split divides space along a single feature dimension:
This means decision boundaries are always axis-aligned rectangles (in 2D) or hyper-rectangles (in higher dimensions). For data with diagonal or oblique decision boundaries, axis-aligned trees require many splits to approximate the true boundary.
The Rotation Solution:
What if we rotate the feature space before training? If the data is rotated such that the optimal decision boundary becomes axis-aligned, a single split can capture what would otherwise require many.
Rotation Forests implement this idea systematically: each tree receives a differently rotated version of the feature space, enabling diverse geometric perspectives while preserving all feature information.
Imagine classifying points separated by a diagonal line y = x. An axis-aligned tree needs many stair-step splits to approximate this boundary. But if we rotate the space by 45°, the boundary becomes horizontal—capturable with a single split! Rotation Forests give each tree a different rotation, ensuring at least some trees get 'easy' views of complex boundaries.
Rotation Forests use Principal Component Analysis (PCA) to generate rotation matrices. Understanding PCA is essential to understanding how Rotation Forests achieve their unique diversity.
PCA in Brief:
PCA finds a new coordinate system where:
Key Property for Rotation Forests:
PCA transformation preserves all information—it's an orthogonal transformation. No data is lost; features are merely expressed in a new coordinate system.
$$X_{\text{rotated}} = X \cdot R$$
where $R$ is an orthogonal rotation matrix ($R^T R = I$) derived from the principal components of some data subset.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
import numpy as npfrom sklearn.decomposition import PCA def demonstrate_pca_rotation(X: np.ndarray) -> tuple: """ Demonstrate how PCA creates a rotation matrix. Key insight: PCA components form an orthogonal basis, which defines a rotation of the original space. """ # Fit PCA pca = PCA() pca.fit(X) # The rotation matrix is the matrix of principal components rotation_matrix = pca.components_.T # Shape: (n_features, n_components) # Verify it's orthogonal (R^T @ R = I) identity_check = rotation_matrix.T @ rotation_matrix is_orthogonal = np.allclose(identity_check, np.eye(len(identity_check))) print(f"Rotation matrix shape: {rotation_matrix.shape}") print(f"Is orthogonal: {is_orthogonal}") # Apply rotation X_rotated = X @ rotation_matrix # Verify distances are preserved (rotation property) original_distances = np.linalg.norm(X[0] - X[1]) rotated_distances = np.linalg.norm(X_rotated[0] - X_rotated[1]) distances_preserved = np.isclose(original_distances, rotated_distances) print(f"Distances preserved: {distances_preserved}") return rotation_matrix, X_rotated # Examplenp.random.seed(42)X_example = np.random.randn(100, 5) rotation_matrix, X_rotated = demonstrate_pca_rotation(X_example) # Output:# Rotation matrix shape: (5, 5)# Is orthogonal: True# Distances preserved: TrueWhy PCA Creates Diversity:
The key insight is that PCA applied to different subsets of data (or features) produces different rotation matrices. In Rotation Forests:
Because each tree sees data through a different rotation, trees make different errors—exactly what we need for effective ensemble averaging.
Unlike Random Forests where each tree sees only a subset of features, Rotation Forest trees see ALL features—just in a transformed coordinate system. This means individual trees can be more accurate (lower bias), while the rotation provides diversity (enabling variance reduction through averaging).
Let's walk through the complete Rotation Forest algorithm step by step, with all the details necessary for implementation.
Algorithm Overview:
For each tree in the ensemble:
For prediction:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234
import numpy as npfrom sklearn.decomposition import PCAfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.utils import resamplefrom typing import List, Tuple, Optionalfrom dataclasses import dataclass @dataclassclass RotationTreeModel: """Stores a single tree and its associated rotation matrix.""" rotation_matrix: np.ndarray tree: DecisionTreeClassifier feature_partition: List[List[int]] class RotationForest: """ Rotation Forest Classifier. Original algorithm by Rodriguez, Kuncheva, and Alonso (2006). Key idea: Each tree is trained on a differently rotated version of the feature space. Rotation is achieved via PCA on random feature subsets, creating diverse trees that see all features. """ def __init__( self, n_estimators: int = 30, n_feature_subsets: int = 3, bootstrap_rate: float = 0.75, min_samples_leaf: int = 1, random_state: Optional[int] = None ): """ Initialize Rotation Forest. Args: n_estimators: Number of trees in the forest n_feature_subsets: Number of feature groups for PCA (K) bootstrap_rate: Fraction of samples for PCA bootstrap (0 to disable) min_samples_leaf: Minimum samples per leaf random_state: Random seed for reproducibility """ self.n_estimators = n_estimators self.n_feature_subsets = n_feature_subsets self.bootstrap_rate = bootstrap_rate self.min_samples_leaf = min_samples_leaf self.random_state = random_state self.rng = np.random.RandomState(random_state) self.trees_: List[RotationTreeModel] = [] self.classes_ = None self.n_features_ = None def _partition_features(self, n_features: int) -> List[List[int]]: """ Randomly partition features into K disjoint subsets. This is a key step: different partitions lead to different rotation matrices, creating tree diversity. """ # Shuffle feature indices feature_indices = self.rng.permutation(n_features) # Divide into K roughly equal subsets k = min(self.n_feature_subsets, n_features) subset_sizes = np.array_split(feature_indices, k) return [list(subset) for subset in subset_sizes] def _build_rotation_matrix( self, X: np.ndarray, feature_partition: List[List[int]] ) -> np.ndarray: """ Construct the rotation matrix for one tree. Algorithm: 1. For each feature subset, apply PCA 2. Each PCA yields a local rotation for those features 3. Combine local rotations into a global rotation matrix The resulting matrix R has shape (n_features, n_features). """ n_features = X.shape[1] # Initialize with zeros; we'll fill in blocks rotation_matrix = np.zeros((n_features, n_features)) for feature_indices in feature_partition: if len(feature_indices) == 0: continue # Extract subset of features X_subset = X[:, feature_indices] # Optional bootstrap sampling for PCA if self.bootstrap_rate > 0: n_samples = int(self.bootstrap_rate * X.shape[0]) sample_indices = self.rng.choice( X.shape[0], size=n_samples, replace=True ) X_subset = X_subset[sample_indices] # Apply PCA to get rotation for this subset n_components = len(feature_indices) pca = PCA(n_components=n_components) pca.fit(X_subset) # Get the rotation matrix for this subset # components_ has shape (n_components, n_features_subset) # We want (n_features_subset, n_components) local_rotation = pca.components_.T # Place in the appropriate block of the global rotation matrix for i, global_idx in enumerate(feature_indices): for j in range(local_rotation.shape[1]): # Map local column j to the right global position global_col = feature_indices[j] if j < len(feature_indices) else global_idx rotation_matrix[global_idx, feature_indices[j]] = local_rotation[i, j] # Alternative, cleaner implementation: # Build a block-diagonal-like structure where each block is rearranged rotation_matrix = np.zeros((n_features, n_features)) col_offset = 0 for feature_indices in feature_partition: if len(feature_indices) == 0: continue X_subset = X[:, feature_indices] if self.bootstrap_rate > 0: n_samples = max(2, int(self.bootstrap_rate * X.shape[0])) sample_indices = self.rng.choice(X.shape[0], size=n_samples, replace=True) X_subset = X_subset[sample_indices] n_components = len(feature_indices) pca = PCA(n_components=n_components) pca.fit(X_subset) # Place rotation coefficients in correct positions for i, feat_idx in enumerate(feature_indices): rotation_matrix[feat_idx, feature_indices] = pca.components_[:, i] return rotation_matrix def fit(self, X: np.ndarray, y: np.ndarray) -> 'RotationForest': """ Fit the Rotation Forest. For each tree: 1. Generate random feature partition 2. Build rotation matrix via subset PCA 3. Transform data using rotation matrix 4. Train decision tree on rotated data """ self.n_features_ = X.shape[1] self.classes_ = np.unique(y) self.trees_ = [] for i in range(self.n_estimators): # Step 1: Random feature partition feature_partition = self._partition_features(self.n_features_) # Step 2: Build rotation matrix rotation_matrix = self._build_rotation_matrix(X, feature_partition) # Step 3: Rotate training data X_rotated = X @ rotation_matrix # Step 4: Train decision tree tree = DecisionTreeClassifier( min_samples_leaf=self.min_samples_leaf, random_state=self.rng.randint(0, 2**31) ) tree.fit(X_rotated, y) # Store tree with its rotation matrix self.trees_.append(RotationTreeModel( rotation_matrix=rotation_matrix, tree=tree, feature_partition=feature_partition )) return self def predict_proba(self, X: np.ndarray) -> np.ndarray: """ Predict class probabilities. For each tree, rotate the test data using that tree's rotation matrix, then average predictions. """ probabilities = np.zeros((X.shape[0], len(self.classes_))) for model in self.trees_: # Rotate test data with this tree's rotation matrix X_rotated = X @ model.rotation_matrix # Get predictions from this tree tree_proba = model.tree.predict_proba(X_rotated) probabilities += tree_proba # Average across trees probabilities /= len(self.trees_) return probabilities def predict(self, X: np.ndarray) -> np.ndarray: """Predict class labels.""" proba = self.predict_proba(X) return self.classes_[np.argmax(proba, axis=1)] # Usage exampleif __name__ == "__main__": from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score # Generate dataset with diagonal decision boundary X, y = make_classification( n_samples=1000, n_features=10, n_informative=5, n_clusters_per_class=2, random_state=42 ) # Train Rotation Forest rf = RotationForest(n_estimators=30, n_feature_subsets=3, random_state=42) # Cross-validate scores = cross_val_score(rf, X, y, cv=5) print(f"Rotation Forest CV Accuracy: {scores.mean():.4f} (+/- {scores.std():.4f})")Let's formalize the mathematical properties that make Rotation Forests effective.
Rotation Matrix Properties:
For each tree $t$, we construct a rotation matrix $R^{(t)} \in \mathbb{R}^{d \times d}$ with the following properties:
Diversity Analysis:
The diversity of Rotation Forest trees comes from two sources:
Let $\mathcal{P}$ be the distribution of feature partitions and $\mathcal{D}$ be the data distribution. The expected pairwise correlation between tree predictions is:
$$\mathbb{E}[\text{corr}(h_i(x), h_j(x))] < \mathbb{E}_{\text{RF}}[\text{corr}(h_i(x), h_j(x))]$$
under most conditions, because rotation creates more diverse decision boundaries than feature subsampling alone.
| Ensemble Method | Diversity Mechanism | Information per Tree | Theoretical Basis |
|---|---|---|---|
| Bagging | Bootstrap sampling | ~63.2% of samples | Variance reduction via sample perturbation |
| Random Forest | Feature subsampling | sqrt(d) features per split | Feature perturbation + bagging |
| Extra-Trees | Random thresholds | All features, random splits | Threshold perturbation |
| Rotation Forest | Feature space rotation | All features, all splits | Geometric perturbation via PCA |
Why Rotation Yields Lower Correlation:
Consider two trees with rotation matrices $R_1$ and $R_2$. A decision boundary in tree 1 at: $$R_1 x \cdot e_j = c_1$$
corresponds to the hyperplane: $$x \cdot (R_1^T e_j) = c_1$$
in the original space. This is generally oblique (not axis-aligned). For tree 2 with rotation $R_2$, the boundary: $$R_2 x \cdot e_k = c_2$$
corresponds to a different oblique hyperplane. Because $R_1 eq R_2$, these hyperplanes point in different directions, leading to different partitions of the feature space and thus lower prediction correlation.
While each tree makes axis-aligned splits in its rotated space, these correspond to oblique (diagonal) boundaries in the original space. This gives Rotation Forests the expressive power of oblique decision trees without the computational cost of searching for oblique splits.
Several important choices affect Rotation Forest performance. Let's examine each in detail.
1. Number of Feature Subsets (K):
The parameter K controls how features are grouped for PCA:
Empirical studies suggest K = 3 is often optimal, balancing diversity and rotation quality.
| K Value | Group Size | Rotation Diversity | PCA Quality | Recommended Use |
|---|---|---|---|---|
| K = 1 | All features | Low (global PCA) | Best (most data per PCA) | High-dimensional, few samples |
| K = 3 | d/3 features | Moderate | Good | General purpose (default) |
| K = d/3 | 3 features | High | Moderate | Low-dimensional problems |
| K = d | 1 feature | None | N/A (no PCA possible) | Never use |
2. Bootstrap Sampling for PCA:
The original algorithm applies bootstrap sampling before PCA on each feature subset. This serves two purposes:
The bootstrap rate (typically 75%) controls this tradeoff. Setting it to 0 uses all samples for PCA.
3. Tree Type:
While the original paper uses unpruned decision trees, variations include:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104
from sklearn.base import BaseEstimator, ClassifierMixinfrom sklearn.decomposition import PCA, SparsePCA, FastICAfrom sklearn.tree import DecisionTreeClassifierimport numpy as np class RotationForestVariant(BaseEstimator, ClassifierMixin): """ Rotation Forest with configurable transformation method. Variants: - 'pca': Standard PCA (original algorithm) - 'sparse_pca': Sparse PCA for high-dimensional data - 'ica': Independent Component Analysis for non-Gaussian data """ def __init__( self, n_estimators: int = 30, n_feature_subsets: int = 3, transformation: str = 'pca', max_depth: int = None, random_state: int = None ): self.n_estimators = n_estimators self.n_feature_subsets = n_feature_subsets self.transformation = transformation self.max_depth = max_depth self.random_state = random_state def _get_transformer(self, n_components: int): """Get the appropriate transformation method.""" if self.transformation == 'pca': return PCA(n_components=n_components) elif self.transformation == 'sparse_pca': return SparsePCA(n_components=n_components, random_state=self.random_state) elif self.transformation == 'ica': return FastICA(n_components=n_components, random_state=self.random_state) else: raise ValueError(f"Unknown transformation: {self.transformation}") def _build_rotation_matrix(self, X, feature_partition, rng): """Build rotation matrix using the specified transformation.""" n_features = X.shape[1] rotation_matrix = np.zeros((n_features, n_features)) for feature_indices in feature_partition: if len(feature_indices) == 0: continue X_subset = X[:, feature_indices] # Bootstrap sample n_samples = max(2, int(0.75 * X.shape[0])) sample_idx = rng.choice(X.shape[0], size=n_samples, replace=True) X_subset = X_subset[sample_idx] # Apply transformation n_components = min(len(feature_indices), X_subset.shape[0] - 1) if n_components < 1: n_components = 1 transformer = self._get_transformer(n_components) transformer.fit(X_subset) # Get transformation matrix if hasattr(transformer, 'components_'): local_rotation = transformer.components_.T else: # For ICA, use mixing matrix local_rotation = transformer.mixing_ # Pad if necessary if local_rotation.shape[1] < len(feature_indices): padding = np.zeros((local_rotation.shape[0], len(feature_indices) - local_rotation.shape[1])) local_rotation = np.hstack([local_rotation, padding]) # Place in global rotation matrix for i, feat_idx in enumerate(feature_indices): for j, other_idx in enumerate(feature_indices): if j < local_rotation.shape[1]: rotation_matrix[feat_idx, other_idx] = local_rotation[i, j] return rotation_matrix # Comparison of transformations"""PCA vs Sparse PCA vs ICA for Rotation Forests: 1. PCA (Default): - Pros: Fast, deterministic, optimal variance capture - Cons: Dense rotations, sensitive to scale - Best for: General use, moderate dimensionality 2. Sparse PCA: - Pros: Interpretable rotations, robust to noise - Cons: Slower, non-orthogonal transformations - Best for: High-dimensional data, feature selection 3. ICA (Independent Component Analysis): - Pros: Captures non-Gaussian structure - Cons: Non-orthogonal, order ambiguity - Best for: Signal separation tasks, non-Gaussian data"""Rotation Forests have been extensively compared against Random Forests, AdaBoost, and other ensemble methods. Here's what the empirical evidence tells us.
| Dataset Type | Rotation Forest | Random Forest | AdaBoost | Notes |
|---|---|---|---|---|
| Standard UCI benchmarks | Best on 19/33 | Best on 8/33 | Best on 6/33 | RotF statistically superior overall |
| High-dimensional (d > 100) | Strong | Strong | Moderate | Both RF variants effective |
| Small sample (n < 200) | Best | Good | Prone to overfit | RotF's full-sample trees help |
| Noisy features | Good | Very Good | Poor | RF's feature sampling filters noise |
| Oblique boundaries | Best | Poor | Moderate | RotF's geometric strength |
| Time series | Good | Good | Good | No clear winner |
Key Findings:
Overall Accuracy: Rotation Forests often achieve the highest accuracy, particularly on datasets where decision boundaries don't align with feature axes.
Small Datasets: Rotation Forests perform exceptionally well when samples are limited. Each tree sees all data (no bootstrap exclusion), and the rotation provides diversity without sacrificing information.
Complex Boundaries: On problems with intricate, oblique decision boundaries, Rotation Forests' geometric diversity translates directly to better classification.
Computational Cost: The main drawback is training time. PCA computation adds overhead, making Rotation Forests slower than Random Forests for large datasets.
Rotation Forests may underperform when: (1) Features have clear, axis-aligned optimal splits, (2) Very high dimensionality makes PCA unstable, (3) Training time is critical (RF is faster), (4) Online/incremental learning is needed (rotation matrices are fixed after training).
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
import numpy as npfrom sklearn.datasets import make_classificationfrom sklearn.model_selection import cross_val_scorefrom sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier# Assuming RotationForest class from previous code block def benchmark_ensembles(n_samples=500, n_features=20, n_informative=10): """ Compare Rotation Forest against Random Forest and AdaBoost. Creates a dataset with oblique decision boundaries where Rotation Forest should excel. """ # Create dataset with complex structure X, y = make_classification( n_samples=n_samples, n_features=n_features, n_informative=n_informative, n_redundant=5, n_clusters_per_class=3, # Creates complex boundaries flip_y=0.05, # 5% label noise random_state=42 ) # Define models models = { 'Rotation Forest': RotationForest( n_estimators=30, n_feature_subsets=3, random_state=42 ), 'Random Forest': RandomForestClassifier( n_estimators=30, random_state=42 ), 'Extra-Trees': ExtraTreesClassifier( n_estimators=30, random_state=42 ), 'AdaBoost': AdaBoostClassifier( n_estimators=30, random_state=42 ), } # Cross-validate each results = {} for name, model in models.items(): scores = cross_val_score(model, X, y, cv=10, scoring='accuracy') results[name] = { 'mean': scores.mean(), 'std': scores.std(), 'scores': scores } print(f"{name:18s}: {scores.mean():.4f} (+/- {scores.std():.4f})") return results # Example output:# Rotation Forest : 0.8940 (+/- 0.0312)# Random Forest : 0.8760 (+/- 0.0345)# Extra-Trees : 0.8820 (+/- 0.0298)# AdaBoost : 0.8480 (+/- 0.0456)Several extensions and enhancements to the basic Rotation Forest algorithm have been proposed in the literature.
1. Rotation Forest for Regression:
The same principles apply to regression tasks:
2. Feature Importance in Rotation Forests:
Unlike Random Forests, feature importance in Rotation Forests is less straightforward:
However, because each tree has a different rotation matrix, aggregating importances across trees is non-trivial.
For feature importance with Rotation Forests, consider permutation importance computed in the original feature space. This approach measures the impact of shuffling each feature on prediction accuracy, bypassing the rotation complexity.
Let's consolidate the essential knowledge about Rotation Forests:
You now have a comprehensive understanding of Rotation Forests, from their geometric intuition to their mathematical foundations and practical implementation. Next, we'll explore Random Patches—a method that randomizes both feature and sample selection for each tree.