Machine LearningRandom Forest Variants

Random Forest Variants

LevelAdvanced

Duration120 mins

TopicRandom Forest Variants

5 / 5

Oblique Random Forests

Breaking Free from Axis Alignment

Standard decision trees have an inherent limitation: they can only make axis-aligned splits. Each split divides the feature space along a single feature dimension, creating rectangular decision regions. But what if the optimal decision boundary is diagonal? What if separating classes requires considering combinations of features?

This is exactly the problem that Oblique Random Forests solve. By allowing splits of the form:

$$w_1 x_1 + w_2 x_2 + ... + w_k x_k \leq t$$

instead of just $x_j \leq t$, oblique trees can create hyperplane boundaries at any orientation. The result is dramatically more expressive decision surfaces that can capture complex patterns with fewer splits.

Oblique Random Forests combine this enhanced expressiveness with the variance-reducing power of ensemble averaging, creating a method that excels on problems with diagonal or curved decision boundaries that frustrate axis-aligned approaches.

What You Will Learn

By the end of this page, you will deeply understand: (1) The geometric limitations of axis-aligned splits, (2) How oblique splits enable arbitrary hyperplane boundaries, (3) Algorithms for finding optimal oblique splits, (4) The computational trade-offs involved, and (5) When oblique forests provide significant advantages.

Understanding the Axis-Aligned Limitation

Before appreciating oblique splits, we must understand why axis-aligned splits can be problematic.

Axis-Aligned Split Definition:

A split on feature $j$ at threshold $t$ divides samples into:

Left child: ${x : x_j \leq t}$
Right child: ${x : x_j > t}$

This creates a decision boundary perpendicular to the $x_j$ axis.

The Problem: Staircase Boundaries

Consider a 2D classification problem where the true decision boundary is the line $x_1 = x_2$ (a 45° diagonal). An axis-aligned tree must approximate this with a "staircase" of horizontal and vertical splits:

                    Class B
             ┌───────────────┐
             │     ┌─────────│
             │     │         │
             │ ┌───│         │
             │ │   │  Class A│
         ────┴─┴───┴─────────┘
                Ideal diagonal boundary

To approximate a smooth diagonal, we need many staircase steps, each requiring additional tree depth and splits.

Splits Required for Diagonal Boundary Approximation
Boundary Angle	Axis-Aligned Splits Needed	Oblique Splits Needed	Tree Depth Difference
0° or 90°	1	1	None
45°	~log(n)	1	Substantial
30° or 60°	~log(n)	1	Substantial
Arbitrary angle	O(log n)	1	O(log n)

Consequences of Axis-Aligned Limitation:

Increased Tree Depth: More splits mean deeper trees to achieve the same separating power
Higher Variance: Deeper trees are more prone to overfitting
Inefficient Representation: Using many splits to represent a simple diagonal boundary wastes model capacity
Fragmented Regions: The staircase approximation creates artificial small regions near the boundary

When It Matters Most

The axis-aligned limitation is most severe when: (1) True boundaries are diagonal, (2) Features are correlated, (3) Domain knowledge suggests feature combinations are meaningful, (4) Compact models are required. If true boundaries are approximately axis-aligned, standard trees work well.

Oblique Splits: Mathematical Foundation

Oblique splits generalize axis-aligned splits by allowing linear combinations of features.

Oblique Split Definition:

An oblique split on weight vector $\mathbf{w} \in \mathbb{R}^d$ and threshold $t$ divides samples into:

Left child: ${\mathbf{x} : \mathbf{w}^T \mathbf{x} \leq t}$
Right child: ${\mathbf{x} : \mathbf{w}^T \mathbf{x} > t}$

The decision boundary is the hyperplane ${\mathbf{x} : \mathbf{w}^T \mathbf{x} = t}$.

Axis-Aligned as Special Case:

An axis-aligned split on feature $j$ is an oblique split where: $$\mathbf{w} = \mathbf{e}_j = (0, ..., 0, 1, 0, ..., 0)^T$$

with the 1 in position $j$.

Geometric Interpretation:

Axis-aligned: Splits perpendicular to coordinate axes
Oblique: Splits at any angle in the feature space
Weight vector $\mathbf{w}$: Normal to the hyperplane boundary
Threshold $t$: Distance from origin (scaled by $|\mathbf{w}|$)

oblique_split_geometry.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
import numpy as np
import matplotlib.pyplot as plt
 
def visualize_split_types():
    """
    Visualize the difference between axis-aligned and oblique splits.
    """
    np.random.seed(42)
    
    # Generate data with diagonal boundary
    n = 200
    X = np.random.randn(n, 2) * 2
    y = (X[:, 0] + X[:, 1] > 0).astype(int)  # Diagonal boundary: x1 + x2 = 0
    
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))
    
    # Original data
    ax = axes[0]
    ax.scatter(X[y==0, 0], X[y==0, 1], c='blue', alpha=0.6, label='Class 0')
    ax.scatter(X[y==1, 0], X[y==1, 1], c='red', alpha=0.6, label='Class 1')
    ax.plot([-4, 4], [4, -4], 'k--', linewidth=2, label='True boundary')
    ax.set_title('Data with Diagonal Boundary')
    ax.legend()
    ax.set_xlim(-4, 4)
    ax.set_ylim(-4, 4)
    
    # Axis-aligned approximation
    ax = axes[1]
    ax.scatter(X[y==0, 0], X[y==0, 1], c='blue', alpha=0.6)
    ax.scatter(X[y==1, 0], X[y==1, 1], c='red', alpha=0.6)
    
    # Staircase boundary
    stairs = [(-4, 4), (-3, 4), (-3, 3), (-2, 3), (-2, 2), (-1, 2), 
              (-1, 1), (0, 1), (0, 0), (1, 0), (1, -1), (2, -1),
              (2, -2), (3, -2), (3, -3), (4, -3), (4, -4)]
    stairs_x, stairs_y = zip(*stairs)
    ax.plot(stairs_x, stairs_y, 'g-', linewidth=2, label='Axis-aligned (staircase)')
    ax.set_title('Axis-Aligned Splits (Many Needed)')
    ax.legend()
    ax.set_xlim(-4, 4)
    ax.set_ylim(-4, 4)
    
    # Single oblique split
    ax = axes[2]
    ax.scatter(X[y==0, 0], X[y==0, 1], c='blue', alpha=0.6)
    ax.scatter(X[y==1, 0], X[y==1, 1], c='red', alpha=0.6)
    ax.plot([-4, 4], [4, -4], 'purple', linewidth=2, 
            label='Oblique split: w=(1,1)')
    ax.set_title('Single Oblique Split (Perfect)')
    ax.legend()
    ax.set_xlim(-4, 4)
    ax.set_ylim(-4, 4)
    
    plt.tight_layout()
    return fig
 
 
def compute_oblique_projection(X, w):
    """
    Project data onto the oblique split direction.
    
    For oblique split with weights w, the split value is w·x.
    """
    # Normalize weights for interpretability
    w = w / np.linalg.norm(w)
    
    # Project each point onto the weight vector
    projections = X @ w
    
    return projections
 
 
# Example: Finding the best oblique split
def evaluate_oblique_split(X, y, w, t):
    """
    Evaluate the quality of an oblique split.
    
    Args:
        X: Feature matrix
        y: Labels
        w: Weight vector
        t: Threshold
        
    Returns:
        Information gain of the split
    """
    projections = X @ w
    left_mask = projections <= t
    right_mask = ~left_mask
    
    if left_mask.sum() == 0 or right_mask.sum() == 0:
        return -np.inf
    
    def gini(labels):
        if len(labels) == 0:
            return 0
        p = np.bincount(labels) / len(labels)
        return 1 - np.sum(p ** 2)
    
    n = len(y)
    n_left, n_right = left_mask.sum(), right_mask.sum()
    
    parent_gini = gini(y)
    child_gini = (n_left/n) * gini(y[left_mask]) + (n_right/n) * gini(y[right_mask])
    
    return parent_gini - child_gini
 
# Demonstration
if __name__ == "__main__":
    np.random.seed(42)
    X = np.random.randn(100, 2)
    y = (X[:, 0] + X[:, 1] > 0).astype(int)
    
    # Compare axis-aligned vs oblique
    print("Axis-aligned splits (single feature):")
    for j in range(2):
        w_axis = np.zeros(2)
        w_axis[j] = 1
        t = 0
        gain = evaluate_oblique_split(X, y, w_axis, t)
        print(f"  Feature {j}: gain = {gain:.4f}")
    
    print("\nOblique split (both features):")
    w_oblique = np.array([1, 1])
    t = 0
    gain = evaluate_oblique_split(X, y, w_oblique, t)
    print(f"  w=(1,1): gain = {gain:.4f}")

Algorithms for Finding Optimal Oblique Splits

The challenge with oblique splits is computational: how do we find good weight vectors? The search space is continuous and high-dimensional.

Complexity Comparison:

Axis-aligned split search: Search over $d$ features × $O(n)$ thresholds = $O(nd)$
Oblique split search: Search over continuous $\mathbb{R}^d$ weight vectors = uncountably infinite

Several practical approaches have been developed to find good oblique splits efficiently.

1. Linear Discriminant Analysis (LDA) at Each Node:

The most principled approach uses LDA to find the optimal separating hyperplane:

At each node, fit LDA to separate classes
The LDA direction $\mathbf{w}_{\text{LDA}}$ becomes the split direction
Search for optimal threshold along this direction

$$\mathbf{w}_{\text{LDA}} = \mathbf{S}_W^{-1}(\mathbf{\mu}_1 - \mathbf{\mu}_0)$$

where $\mathbf{S}_W$ is the within-class scatter matrix.

Pros: Optimal for Gaussian classes; well-founded Cons: Requires class-conditional Gaussian assumption; matrix inversion needed

2. Random Coefficient Sampling:

Inspired by Extra-Trees, this approach samples random weight vectors:

Generate multiple random $\mathbf{w}$ vectors
For each $\mathbf{w}$, find optimal threshold
Choose the $\mathbf{w}$ and threshold with best impurity reduction

This is the basis for Random Oblique Forests and is highly scalable.

3. Gradient-Based Optimization:

Optimize the impurity reduction with respect to $\mathbf{w}$:

$$\mathbf{w}^* = \arg\min_{\mathbf{w}} \text{Impurity}(\text{split with } \mathbf{w})$$

Using gradient descent or similar optimization.

4. Sparse Coefficient Methods:

Restrict $\mathbf{w}$ to have only $k \ll d$ non-zero entries:

$$\mathbf{w} \in {\mathbf{v} : |\mathbf{v}|_0 \leq k}$$

This reduces the search space and improves interpretability.

Oblique Split Finding Methods Compared
Method	Complexity	Quality	Interpretability	Best Use Case
LDA-based	O(nd² + d³)	Optimal (Gaussian)	Moderate	Small d, Gaussian-like data
Random sampling	O(k·nd)	Good (with many samples)	Low	Large d, high speed needed
Gradient descent	O(iterations·n·d)	High	Low	Complex boundaries, sufficient time
Sparse (CART-LC)	O(d²·n)	Good	High	Interpretability important
Householder (HHCART)	O(nd)	Good	Moderate	General purpose

Oblique Random Forest Implementation

Let's implement an Oblique Random Forest from scratch, using multiple strategies for finding oblique splits.

oblique_random_forest.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
import numpy as np
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from typing import Optional, Tuple, List, Union
from dataclasses import dataclass
from enum import Enum
 
class ObliqueSplitMethod(Enum):
    """Methods for finding oblique splits."""
    LDA = "lda"                    # Linear Discriminant Analysis
    RANDOM = "random"              # Random weight sampling
    PCA = "pca"                    # Principal Component direction
    RIDGE = "ridge"                # Ridge regression coefficients
 
 
@dataclass
class ObliqueNode:
    """Node in an oblique decision tree."""
    weights: Optional[np.ndarray] = None  # Split weights (oblique direction)
    threshold: Optional[float] = None      # Split threshold
    left: Optional['ObliqueNode'] = None   # Left child
    right: Optional['ObliqueNode'] = None  # Right child
    value: Optional[np.ndarray] = None     # Leaf class distribution
    is_leaf: bool = False
 
 
class ObliqueDecisionTree(BaseEstimator, ClassifierMixin):
    """
    Oblique Decision Tree using linear combination splits.
    
    Each split is of the form: w·x <= t
    where w is a weight vector learned at each node.
    """
    
    def __init__(
        self,
        split_method: ObliqueSplitMethod = ObliqueSplitMethod.RANDOM,
        n_random_samples: int = 10,
        max_depth: Optional[int] = None,
        min_samples_split: int = 2,
        min_samples_leaf: int = 1,
        max_features: Optional[int] = None,
        random_state: Optional[int] = None
    ):
        self.split_method = split_method
        self.n_random_samples = n_random_samples
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.min_samples_leaf = min_samples_leaf
        self.max_features = max_features
        self.random_state = random_state
        
        self.root = None
        self.classes_ = None
        self.n_features_ = None
        self.rng = None
    
    def _compute_gini(self, y: np.ndarray) -> float:
        """Compute Gini impurity."""
        if len(y) == 0:
            return 0.0
        counts = np.bincount(y, minlength=len(self.classes_))
        proportions = counts / len(y)
        return 1.0 - np.sum(proportions ** 2)
    
    def _compute_split_quality(
        self, 
        y: np.ndarray, 
        y_left: np.ndarray, 
        y_right: np.ndarray
    ) -> float:
        """Compute information gain from a split."""
        n = len(y)
        n_left, n_right = len(y_left), len(y_right)
        
        if n_left < self.min_samples_leaf or n_right < self.min_samples_leaf:
            return -np.inf
        
        gini_parent = self._compute_gini(y)
        gini_weighted = (
            (n_left / n) * self._compute_gini(y_left) +
            (n_right / n) * self._compute_gini(y_right)
        )
        
        return gini_parent - gini_weighted
    
    def _find_best_threshold(
        self, 
        projections: np.ndarray, 
        y: np.ndarray
    ) -> Tuple[float, float]:
        """Find optimal threshold for given projections."""
        sorted_idx = np.argsort(projections)
        projections_sorted = projections[sorted_idx]
        y_sorted = y[sorted_idx]
        
        best_threshold = None
        best_quality = -np.inf
        
        # Try thresholds at midpoints
        for i in range(len(projections_sorted) - 1):
            if projections_sorted[i] == projections_sorted[i + 1]:
                continue
            
            threshold = (projections_sorted[i] + projections_sorted[i + 1]) / 2
            y_left = y_sorted[:i + 1]
            y_right = y_sorted[i + 1:]
            
            quality = self._compute_split_quality(y, y_left, y_right)
            
            if quality > best_quality:
                best_quality = quality
                best_threshold = threshold
        
        return best_threshold, best_quality
    
    def _find_lda_split(
        self, 
        X: np.ndarray, 
        y: np.ndarray
    ) -> Tuple[np.ndarray, float, float]:
        """Find oblique split using LDA direction."""
        if len(np.unique(y)) < 2:
            return None, None, -np.inf
        
        try:
            lda = LinearDiscriminantAnalysis()
            lda.fit(X, y)
            
            # LDA direction
            if hasattr(lda, 'coef_'):
                weights = lda.coef_[0]
            else:
                weights = lda.scalings_[:, 0]
            
            # Normalize
            weights = weights / (np.linalg.norm(weights) + 1e-10)
            
            # Find best threshold along this direction
            projections = X @ weights
            threshold, quality = self._find_best_threshold(projections, y)
            
            return weights, threshold, quality
            
        except Exception:
            return None, None, -np.inf
    
    def _find_random_split(
        self, 
        X: np.ndarray, 
        y: np.ndarray
    ) -> Tuple[np.ndarray, float, float]:
        """Find oblique split by sampling random directions."""
        n_features = X.shape[1]
        
        if self.max_features is not None:
            n_use = min(self.max_features, n_features)
        else:
            n_use = n_features
        
        best_weights = None
        best_threshold = None
        best_quality = -np.inf
        
        for _ in range(self.n_random_samples):
            # Random sparse weight vector
            feature_indices = self.rng.choice(n_features, size=n_use, replace=False)
            weights = np.zeros(n_features)
            
            # Random coefficients for selected features
            weights[feature_indices] = self.rng.randn(n_use)
            weights = weights / (np.linalg.norm(weights) + 1e-10)
            
            # Find best threshold
            projections = X @ weights
            threshold, quality = self._find_best_threshold(projections, y)
            
            if quality > best_quality:
                best_quality = quality
                best_weights = weights
                best_threshold = threshold
        
        return best_weights, best_threshold, best_quality
    
    def _find_oblique_split(
        self, 
        X: np.ndarray, 
        y: np.ndarray
    ) -> Tuple[np.ndarray, float, float]:
        """Find the best oblique split using configured method."""
        if self.split_method == ObliqueSplitMethod.LDA:
            return self._find_lda_split(X, y)
        elif self.split_method == ObliqueSplitMethod.RANDOM:
            return self._find_random_split(X, y)
        else:
            # Default to random
            return self._find_random_split(X, y)
    
    def _build_tree(
        self, 
        X: np.ndarray, 
        y: np.ndarray, 
        depth: int = 0
    ) -> ObliqueNode:
        """Recursively build the oblique tree."""
        n_samples = len(y)
        
        # Stopping conditions
        if (n_samples < self.min_samples_split or
            (self.max_depth is not None and depth >= self.max_depth) or
            len(np.unique(y)) == 1):
            
            leaf = ObliqueNode(is_leaf=True)
            leaf.value = np.bincount(y, minlength=len(self.classes_)) / n_samples
            return leaf
        
        # Find best oblique split
        weights, threshold, quality = self._find_oblique_split(X, y)
        
        if weights is None or quality <= 0:
            leaf = ObliqueNode(is_leaf=True)
            leaf.value = np.bincount(y, minlength=len(self.classes_)) / n_samples
            return leaf
        
        # Apply split
        projections = X @ weights
        left_mask = projections <= threshold
        right_mask = ~left_mask
        
        if left_mask.sum() == 0 or right_mask.sum() == 0:
            leaf = ObliqueNode(is_leaf=True)
            leaf.value = np.bincount(y, minlength=len(self.classes_)) / n_samples
            return leaf
        
        # Create node and recurse
        node = ObliqueNode(
            weights=weights,
            threshold=threshold,
            is_leaf=False
        )
        node.left = self._build_tree(X[left_mask], y[left_mask], depth + 1)
        node.right = self._build_tree(X[right_mask], y[right_mask], depth + 1)
        
        return node
    
    def fit(self, X: np.ndarray, y: np.ndarray) -> 'ObliqueDecisionTree':
        """Fit the oblique decision tree."""
        self.rng = np.random.RandomState(self.random_state)
        self.n_features_ = X.shape[1]
        self.classes_ = np.unique(y)
        
        # Convert y to integers
        class_to_int = {c: i for i, c in enumerate(self.classes_)}
        y_int = np.array([class_to_int[c] for c in y])
        
        self.root = self._build_tree(X, y_int)
        return self
    
    def _predict_sample(self, x: np.ndarray, node: ObliqueNode) -> np.ndarray:
        """Predict class probabilities for a single sample."""
        if node.is_leaf:
            return node.value
        
        projection = np.dot(x, node.weights)
        if projection <= node.threshold:
            return self._predict_sample(x, node.left)
        else:
            return self._predict_sample(x, node.right)
    
    def predict_proba(self, X: np.ndarray) -> np.ndarray:
        """Predict class probabilities."""
        return np.array([self._predict_sample(x, self.root) for x in X])
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Predict class labels."""
        proba = self.predict_proba(X)
        indices = np.argmax(proba, axis=1)
        return self.classes_[indices]
 
 
class ObliqueRandomForest(BaseEstimator, ClassifierMixin):
    """
    Oblique Random Forest: ensemble of oblique decision trees.
    
    Combines the expressiveness of oblique splits with
    the variance reduction of ensemble averaging.
    """
    
    def __init__(
        self,
        n_estimators: int = 100,
        split_method: ObliqueSplitMethod = ObliqueSplitMethod.RANDOM,
        n_random_samples: int = 10,
        max_depth: Optional[int] = None,
        min_samples_leaf: int = 1,
        max_features: Optional[Union[int, str]] = 'sqrt',
        bootstrap: bool = True,
        random_state: Optional[int] = None,
        n_jobs: int = 1
    ):
        self.n_estimators = n_estimators
        self.split_method = split_method
        self.n_random_samples = n_random_samples
        self.max_depth = max_depth
        self.min_samples_leaf = min_samples_leaf
        self.max_features = max_features
        self.bootstrap = bootstrap
        self.random_state = random_state
        self.n_jobs = n_jobs
        
        self.trees_: List[ObliqueDecisionTree] = []
        self.classes_ = None
    
    def _get_max_features(self, n_features: int) -> int:
        """Compute max features for sparse oblique splits."""
        if self.max_features == 'sqrt':
            return max(1, int(np.sqrt(n_features)))
        elif self.max_features == 'log2':
            return max(1, int(np.log2(n_features)))
        elif isinstance(self.max_features, int):
            return min(self.max_features, n_features)
        elif isinstance(self.max_features, float):
            return max(1, int(self.max_features * n_features))
        else:
            return n_features
    
    def fit(self, X: np.ndarray, y: np.ndarray) -> 'ObliqueRandomForest':
        """Fit the oblique random forest."""
        rng = np.random.RandomState(self.random_state)
        n_samples, n_features = X.shape
        self.classes_ = np.unique(y)
        
        max_feat = self._get_max_features(n_features)
        
        self.trees_ = []
        
        for i in range(self.n_estimators):
            # Bootstrap sampling
            if self.bootstrap:
                indices = rng.choice(n_samples, size=n_samples, replace=True)
                X_boot = X[indices]
                y_boot = y[indices]
            else:
                X_boot = X
                y_boot = y
            
            # Create and train oblique tree
            tree = ObliqueDecisionTree(
                split_method=self.split_method,
                n_random_samples=self.n_random_samples,
                max_depth=self.max_depth,
                min_samples_leaf=self.min_samples_leaf,
                max_features=max_feat,
                random_state=rng.randint(2**31)
            )
            tree.fit(X_boot, y_boot)
            self.trees_.append(tree)
        
        return self
    
    def predict_proba(self, X: np.ndarray) -> np.ndarray:
        """Predict class probabilities by averaging."""
        probas = np.zeros((X.shape[0], len(self.classes_)))
        
        for tree in self.trees_:
            probas += tree.predict_proba(X)
        
        probas /= len(self.trees_)
        return probas
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Predict class labels."""
        proba = self.predict_proba(X)
        return self.classes_[np.argmax(proba, axis=1)]

Computational Trade-offs and Efficiency

Oblique splits offer greater expressiveness but at a computational cost. Understanding these trade-offs is essential for practical deployment.

Computational Complexity Comparison
Operation	Axis-Aligned Tree	Oblique Tree (Random)	Oblique Tree (LDA)
Find split at node	O(d·n log n)	O(k·n log n) where k = # random samples	O(n·d² + d³)
Apply split at node	O(1)	O(d)	O(d)
Full tree training	O(d·n·depth)	O(k·n·depth·d)	O(n·d²·depth)
Prediction per sample	O(depth)	O(d·depth)	O(d·depth)
Memory per node	O(1) (feature, threshold)	O(d) (weight vector, threshold)	O(d)

Key Observations:

Training Cost: Oblique trees are more expensive to train, especially with LDA-based splits that require matrix operations at each node.
Prediction Cost: Each prediction requires a dot product with the weight vector ($O(d)$) instead of a simple comparison ($O(1)$).
Memory: Oblique trees require storing a full weight vector per node rather than just a feature index.
Shallower Trees Compensate: Oblique trees are typically much shallower (fewer splits needed), partially offsetting the per-node cost.

Efficiency Strategies:

Practical Efficiency Tips

•Sparse Weights: Use only k << d features per split, reducing both training and prediction cost
•Random Sampling: Sample fewer candidate directions (but enough for diversity)
•Early Stopping: Oblique trees need less depth; aggressive depth limits reduce training time
•Feature Hashing: For very high d, hash features before computing projections
•Parallelization: Each tree and random direction evaluation can be parallelized

The Sweet Spot

Random Oblique Forests with sparse weight vectors (2-5 features per split) often provide the best trade-off: much of the expressiveness benefit with only moderate computational overhead. With max_features=2, each split considers pairs of features—often enough to capture diagonal boundaries while remaining efficient.

When to Choose Oblique Random Forests

Oblique Random Forests are not always the right choice. Understanding when they provide significant advantages is crucial for effective model selection.

Strong Use Cases

•Diagonal decision boundaries: When classes separate along non-axis directions
•Correlated features: Feature combinations are more predictive than individuals
•Model compactness needed: Fewer splits = simpler, faster models
•Scientific/physical interpretation: Weight vectors may have meaning
•Linear substructure: Problems with local linear separability

Weak Use Cases

•Axis-aligned boundaries: Standard trees work fine
•Discrete/categorical features: Linear combinations less meaningful
•Very high dimensionality: LDA becomes unstable; random needs many samples
•Real-time prediction critical: Dot product overhead may matter
•Interpretability paramount: Weight vectors harder to explain than single features

comparison_benchmark.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
 
def compare_on_different_boundaries():
    """
    Compare axis-aligned vs oblique forests on different
    boundary types.
    """
    np.random.seed(42)
    n_samples = 1000
    n_features = 10
    
    print("=== Comparison: Axis-Aligned vs Oblique ===\n")
    
    # 1. Axis-aligned boundary (RF should be best)
    print("1. Axis-aligned boundary:")
    X = np.random.randn(n_samples, n_features)
    y = (X[:, 0] > 0).astype(int)  # Boundary: x_0 = 0 (axis-aligned)
    
    rf = RandomForestClassifier(n_estimators=100, random_state=42)
    orf = ObliqueRandomForest(n_estimators=100, random_state=42)
    
    rf_scores = cross_val_score(rf, X, y, cv=5)
    orf_scores = cross_val_score(orf, X, y, cv=5)
    
    print(f"   Random Forest:  {rf_scores.mean():.4f}")
    print(f"   Oblique Forest: {orf_scores.mean():.4f}")
    print()
    
    # 2. Diagonal boundary (Oblique should excel)
    print("2. Diagonal boundary:")
    X = np.random.randn(n_samples, n_features)
    y = (X[:, 0] + X[:, 1] + X[:, 2] > 0).astype(int)  # Diagonal
    
    rf_scores = cross_val_score(rf, X, y, cv=5)
    orf_scores = cross_val_score(orf, X, y, cv=5)
    
    print(f"   Random Forest:  {rf_scores.mean():.4f}")
    print(f"   Oblique Forest: {orf_scores.mean():.4f}")
    print()
    
    # 3. XOR-like boundary (both need depth)
    print("3. XOR-like boundary:")
    X = np.random.randn(n_samples, n_features)
    y = ((X[:, 0] > 0) ^ (X[:, 1] > 0)).astype(int)  # XOR
    
    rf_scores = cross_val_score(rf, X, y, cv=5)
    orf_scores = cross_val_score(orf, X, y, cv=5)
    
    print(f"   Random Forest:  {rf_scores.mean():.4f}")
    print(f"   Oblique Forest: {orf_scores.mean():.4f}")
    print()
    
    # 4. Complex diagonal (Oblique significant advantage)
    print("4. Complex diagonal (multiple features):")
    X = np.random.randn(n_samples, n_features)
    # Weighted sum of first 5 features
    weights = np.array([0.4, 0.3, 0.2, 0.1, 0.05] + [0] * 5)
    y = (X @ weights > 0).astype(int)
    
    rf_scores = cross_val_score(rf, X, y, cv=5)
    orf_scores = cross_val_score(orf, X, y, cv=5)
    
    print(f"   Random Forest:  {rf_scores.mean():.4f}")
    print(f"   Oblique Forest: {orf_scores.mean():.4f}")
    
    return {
        'axis_aligned': {'rf': rf_scores.mean(), 'orf': orf_scores.mean()},
        'diagonal': {'rf': rf_scores.mean(), 'orf': orf_scores.mean()},
    }
 
# Example output:
# 1. Axis-aligned boundary:
#    Random Forest:  0.9960
#    Oblique Forest: 0.9940  <- Both excellent
#
# 2. Diagonal boundary:
#    Random Forest:  0.9120
#    Oblique Forest: 0.9850  <- Oblique wins
#
# 3. XOR-like boundary:
#    Random Forest:  0.9800
#    Oblique Forest: 0.9780  <- Both need depth, RF slightly better
#
# 4. Complex diagonal:
#    Random Forest:  0.8760
#    Oblique Forest: 0.9720  <- Oblique significant advantage

Notable Oblique Forest Implementations

Several well-known implementations of oblique forests have been developed, each with unique characteristics.

Notable Implementations

•CART-LC (Breiman, 1984): Original oblique CART using linear combinations with greedy coordinate descent optimization
•OC1 (Murthy et al., 1994): Oblique Classifier 1 using randomized hill climbing for hyperplane search
•Forest-RC (Breiman, 2001): Random Combination trees, combining random feature subsets
•Rotation Forest (Rodriguez, 2006): PCA-based rotation creates oblique boundaries in original space
•CCF (Rainforth & Wood, 2015): Canonical Correlation Forests using CCA for splits
•SPORF (Tomita et al., 2020): Sparse Projection Oblique Randomer Forests, sparse random projections
•MORF (Li et al., 2019): Manifold Oblique Random Forests for structured data

Implementation Comparison
Implementation	Split Method	Sparsity	Scalability	Availability
CART-LC	Coordinate descent	Dense	Low	Limited
OC1	Randomized search	Dense	Moderate	Academic
Rotation Forest	PCA rotations	Dense	Moderate	Available (Python)
SPORF	Random sparse projections	Sparse (2-3 features)	High	Open source (rerf)
sklearn-oblique-forest	Various	Configurable	Moderate	Open source

Practical Recommendation

For most practical applications, start with SPORF (Sparse Projection Oblique Randomer Forests) or implement random oblique splits with sparse weights (2-3 features per split). These provide good expressiveness with manageable computational cost and are available in modern packages.

Summary: Oblique Random Forests Key Insights

Let's consolidate the essential knowledge about Oblique Random Forests:

Key Takeaways

•Core Innovation: Oblique splits use linear combinations of features (w·x ≤ t), enabling hyperplane boundaries at any orientation
•Axis-Aligned Limitation: Standard trees create staircase approximations of diagonal boundaries, requiring many more splits
•Expressiveness Gain: A single oblique split can replace many axis-aligned splits for diagonal boundaries
•Split Finding Methods: LDA-based (optimal but expensive), random sampling (fast but approximate), gradient-based (accurate but slow)
•Computational Trade-off: Oblique trees are more expensive per node but typically much shallower, often balancing out
•Sparse Weights: Using only 2-3 features per oblique split provides good expressiveness with manageable cost
•Best Use Cases: Problems with diagonal boundaries, correlated features, or need for compact models
•Integration: Combine with bootstrap sampling for Oblique Random Forests with variance reduction

Module Complete

Congratulations! You have now completed the comprehensive exploration of Random Forest Variants. You've mastered Extra-Trees, Rotation Forests, Random Patches, Subspace Forests, and Oblique Random Forests—each offering unique advantages for different problem types. This knowledge equips you to select and tune the optimal ensemble method for any machine learning challenge.

5 / 5

Loading learning content...

Machine LearningRandom Forest Variants

Random Forest Variants

LevelAdvanced

Duration120 mins

TopicRandom Forest Variants

5 / 5

Oblique Random Forests

Breaking Free from Axis Alignment

This is exactly the problem that Oblique Random Forests solve. By allowing splits of the form:

$$w_1 x_1 + w_2 x_2 + ... + w_k x_k \leq t$$

What You Will Learn

Understanding the Axis-Aligned Limitation

Before appreciating oblique splits, we must understand why axis-aligned splits can be problematic.

Axis-Aligned Split Definition:

A split on feature $j$ at threshold $t$ divides samples into:

Left child: ${x : x_j \leq t}$
Right child: ${x : x_j > t}$

This creates a decision boundary perpendicular to the $x_j$ axis.

The Problem: Staircase Boundaries

                    Class B
             ┌───────────────┐
             │     ┌─────────│
             │     │         │
             │ ┌───│         │
             │ │   │  Class A│
         ────┴─┴───┴─────────┘
                Ideal diagonal boundary

To approximate a smooth diagonal, we need many staircase steps, each requiring additional tree depth and splits.

Splits Required for Diagonal Boundary Approximation
Boundary Angle	Axis-Aligned Splits Needed	Oblique Splits Needed	Tree Depth Difference
0° or 90°	1	1	None
45°	~log(n)	1	Substantial
30° or 60°	~log(n)	1	Substantial
Arbitrary angle	O(log n)	1	O(log n)

Consequences of Axis-Aligned Limitation:

Increased Tree Depth: More splits mean deeper trees to achieve the same separating power
Higher Variance: Deeper trees are more prone to overfitting
Inefficient Representation: Using many splits to represent a simple diagonal boundary wastes model capacity
Fragmented Regions: The staircase approximation creates artificial small regions near the boundary

When It Matters Most

Oblique Splits: Mathematical Foundation

Oblique splits generalize axis-aligned splits by allowing linear combinations of features.

Oblique Split Definition:

An oblique split on weight vector $\mathbf{w} \in \mathbb{R}^d$ and threshold $t$ divides samples into:

Left child: ${\mathbf{x} : \mathbf{w}^T \mathbf{x} \leq t}$
Right child: ${\mathbf{x} : \mathbf{w}^T \mathbf{x} > t}$

The decision boundary is the hyperplane ${\mathbf{x} : \mathbf{w}^T \mathbf{x} = t}$.

Axis-Aligned as Special Case:

An axis-aligned split on feature $j$ is an oblique split where: $$\mathbf{w} = \mathbf{e}_j = (0, ..., 0, 1, 0, ..., 0)^T$$

with the 1 in position $j$.

Geometric Interpretation:

Axis-aligned: Splits perpendicular to coordinate axes
Oblique: Splits at any angle in the feature space
Weight vector $\mathbf{w}$: Normal to the hyperplane boundary
Threshold $t$: Distance from origin (scaled by $|\mathbf{w}|$)

oblique_split_geometry.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
import numpy as np
import matplotlib.pyplot as plt
 
def visualize_split_types():
    """
    Visualize the difference between axis-aligned and oblique splits.
    """
    np.random.seed(42)
    
    # Generate data with diagonal boundary
    n = 200
    X = np.random.randn(n, 2) * 2
    y = (X[:, 0] + X[:, 1] > 0).astype(int)  # Diagonal boundary: x1 + x2 = 0
    
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))
    
    # Original data
    ax = axes[0]
    ax.scatter(X[y==0, 0], X[y==0, 1], c='blue', alpha=0.6, label='Class 0')
    ax.scatter(X[y==1, 0], X[y==1, 1], c='red', alpha=0.6, label='Class 1')
    ax.plot([-4, 4], [4, -4], 'k--', linewidth=2, label='True boundary')
    ax.set_title('Data with Diagonal Boundary')
    ax.legend()
    ax.set_xlim(-4, 4)
    ax.set_ylim(-4, 4)
    
    # Axis-aligned approximation
    ax = axes[1]
    ax.scatter(X[y==0, 0], X[y==0, 1], c='blue', alpha=0.6)
    ax.scatter(X[y==1, 0], X[y==1, 1], c='red', alpha=0.6)
    
    # Staircase boundary
    stairs = [(-4, 4), (-3, 4), (-3, 3), (-2, 3), (-2, 2), (-1, 2), 
              (-1, 1), (0, 1), (0, 0), (1, 0), (1, -1), (2, -1),
              (2, -2), (3, -2), (3, -3), (4, -3), (4, -4)]
    stairs_x, stairs_y = zip(*stairs)
    ax.plot(stairs_x, stairs_y, 'g-', linewidth=2, label='Axis-aligned (staircase)')
    ax.set_title('Axis-Aligned Splits (Many Needed)')
    ax.legend()
    ax.set_xlim(-4, 4)
    ax.set_ylim(-4, 4)
    
    # Single oblique split
    ax = axes[2]
    ax.scatter(X[y==0, 0], X[y==0, 1], c='blue', alpha=0.6)
    ax.scatter(X[y==1, 0], X[y==1, 1], c='red', alpha=0.6)
    ax.plot([-4, 4], [4, -4], 'purple', linewidth=2, 
            label='Oblique split: w=(1,1)')
    ax.set_title('Single Oblique Split (Perfect)')
    ax.legend()
    ax.set_xlim(-4, 4)
    ax.set_ylim(-4, 4)
    
    plt.tight_layout()
    return fig
 
 
def compute_oblique_projection(X, w):
    """
    Project data onto the oblique split direction.
    
    For oblique split with weights w, the split value is w·x.
    """
    # Normalize weights for interpretability
    w = w / np.linalg.norm(w)
    
    # Project each point onto the weight vector
    projections = X @ w
    
    return projections
 
 
# Example: Finding the best oblique split
def evaluate_oblique_split(X, y, w, t):
    """
    Evaluate the quality of an oblique split.
    
    Args:
        X: Feature matrix
        y: Labels
        w: Weight vector
        t: Threshold
        
    Returns:
        Information gain of the split
    """
    projections = X @ w
    left_mask = projections <= t
    right_mask = ~left_mask
    
    if left_mask.sum() == 0 or right_mask.sum() == 0:
        return -np.inf
    
    def gini(labels):
        if len(labels) == 0:
            return 0
        p = np.bincount(labels) / len(labels)
        return 1 - np.sum(p ** 2)
    
    n = len(y)
    n_left, n_right = left_mask.sum(), right_mask.sum()
    
    parent_gini = gini(y)
    child_gini = (n_left/n) * gini(y[left_mask]) + (n_right/n) * gini(y[right_mask])
    
    return parent_gini - child_gini
 
# Demonstration
if __name__ == "__main__":
    np.random.seed(42)
    X = np.random.randn(100, 2)
    y = (X[:, 0] + X[:, 1] > 0).astype(int)
    
    # Compare axis-aligned vs oblique
    print("Axis-aligned splits (single feature):")
    for j in range(2):
        w_axis = np.zeros(2)
        w_axis[j] = 1
        t = 0
        gain = evaluate_oblique_split(X, y, w_axis, t)
        print(f"  Feature {j}: gain = {gain:.4f}")
    
    print("\nOblique split (both features):")
    w_oblique = np.array([1, 1])
    t = 0
    gain = evaluate_oblique_split(X, y, w_oblique, t)
    print(f"  w=(1,1): gain = {gain:.4f}")

Algorithms for Finding Optimal Oblique Splits

The challenge with oblique splits is computational: how do we find good weight vectors? The search space is continuous and high-dimensional.

Complexity Comparison:

Axis-aligned split search: Search over $d$ features × $O(n)$ thresholds = $O(nd)$
Oblique split search: Search over continuous $\mathbb{R}^d$ weight vectors = uncountably infinite

Several practical approaches have been developed to find good oblique splits efficiently.

1. Linear Discriminant Analysis (LDA) at Each Node:

The most principled approach uses LDA to find the optimal separating hyperplane:

At each node, fit LDA to separate classes
The LDA direction $\mathbf{w}_{\text{LDA}}$ becomes the split direction
Search for optimal threshold along this direction

$$\mathbf{w}_{\text{LDA}} = \mathbf{S}_W^{-1}(\mathbf{\mu}_1 - \mathbf{\mu}_0)$$

where $\mathbf{S}_W$ is the within-class scatter matrix.

Pros: Optimal for Gaussian classes; well-founded Cons: Requires class-conditional Gaussian assumption; matrix inversion needed

2. Random Coefficient Sampling:

Inspired by Extra-Trees, this approach samples random weight vectors:

Generate multiple random $\mathbf{w}$ vectors
For each $\mathbf{w}$, find optimal threshold
Choose the $\mathbf{w}$ and threshold with best impurity reduction

This is the basis for Random Oblique Forests and is highly scalable.

3. Gradient-Based Optimization:

Optimize the impurity reduction with respect to $\mathbf{w}$:

$$\mathbf{w}^* = \arg\min_{\mathbf{w}} \text{Impurity}(\text{split with } \mathbf{w})$$

Using gradient descent or similar optimization.

4. Sparse Coefficient Methods:

Restrict $\mathbf{w}$ to have only $k \ll d$ non-zero entries:

$$\mathbf{w} \in {\mathbf{v} : |\mathbf{v}|_0 \leq k}$$

This reduces the search space and improves interpretability.

Oblique Split Finding Methods Compared
Method	Complexity	Quality	Interpretability	Best Use Case
LDA-based	O(nd² + d³)	Optimal (Gaussian)	Moderate	Small d, Gaussian-like data
Random sampling	O(k·nd)	Good (with many samples)	Low	Large d, high speed needed
Gradient descent	O(iterations·n·d)	High	Low	Complex boundaries, sufficient time
Sparse (CART-LC)	O(d²·n)	Good	High	Interpretability important
Householder (HHCART)	O(nd)	Good	Moderate	General purpose

Oblique Random Forest Implementation

Let's implement an Oblique Random Forest from scratch, using multiple strategies for finding oblique splits.

oblique_random_forest.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
import numpy as np
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from typing import Optional, Tuple, List, Union
from dataclasses import dataclass
from enum import Enum
 
class ObliqueSplitMethod(Enum):
    """Methods for finding oblique splits."""
    LDA = "lda"                    # Linear Discriminant Analysis
    RANDOM = "random"              # Random weight sampling
    PCA = "pca"                    # Principal Component direction
    RIDGE = "ridge"                # Ridge regression coefficients
 
 
@dataclass
class ObliqueNode:
    """Node in an oblique decision tree."""
    weights: Optional[np.ndarray] = None  # Split weights (oblique direction)
    threshold: Optional[float] = None      # Split threshold
    left: Optional['ObliqueNode'] = None   # Left child
    right: Optional['ObliqueNode'] = None  # Right child
    value: Optional[np.ndarray] = None     # Leaf class distribution
    is_leaf: bool = False
 
 
class ObliqueDecisionTree(BaseEstimator, ClassifierMixin):
    """
    Oblique Decision Tree using linear combination splits.
    
    Each split is of the form: w·x <= t
    where w is a weight vector learned at each node.
    """
    
    def __init__(
        self,
        split_method: ObliqueSplitMethod = ObliqueSplitMethod.RANDOM,
        n_random_samples: int = 10,
        max_depth: Optional[int] = None,
        min_samples_split: int = 2,
        min_samples_leaf: int = 1,
        max_features: Optional[int] = None,
        random_state: Optional[int] = None
    ):
        self.split_method = split_method
        self.n_random_samples = n_random_samples
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.min_samples_leaf = min_samples_leaf
        self.max_features = max_features
        self.random_state = random_state
        
        self.root = None
        self.classes_ = None
        self.n_features_ = None
        self.rng = None
    
    def _compute_gini(self, y: np.ndarray) -> float:
        """Compute Gini impurity."""
        if len(y) == 0:
            return 0.0
        counts = np.bincount(y, minlength=len(self.classes_))
        proportions = counts / len(y)
        return 1.0 - np.sum(proportions ** 2)
    
    def _compute_split_quality(
        self, 
        y: np.ndarray, 
        y_left: np.ndarray, 
        y_right: np.ndarray
    ) -> float:
        """Compute information gain from a split."""
        n = len(y)
        n_left, n_right = len(y_left), len(y_right)
        
        if n_left < self.min_samples_leaf or n_right < self.min_samples_leaf:
            return -np.inf
        
        gini_parent = self._compute_gini(y)
        gini_weighted = (
            (n_left / n) * self._compute_gini(y_left) +
            (n_right / n) * self._compute_gini(y_right)
        )
        
        return gini_parent - gini_weighted
    
    def _find_best_threshold(
        self, 
        projections: np.ndarray, 
        y: np.ndarray
    ) -> Tuple[float, float]:
        """Find optimal threshold for given projections."""
        sorted_idx = np.argsort(projections)
        projections_sorted = projections[sorted_idx]
        y_sorted = y[sorted_idx]
        
        best_threshold = None
        best_quality = -np.inf
        
        # Try thresholds at midpoints
        for i in range(len(projections_sorted) - 1):
            if projections_sorted[i] == projections_sorted[i + 1]:
                continue
            
            threshold = (projections_sorted[i] + projections_sorted[i + 1]) / 2
            y_left = y_sorted[:i + 1]
            y_right = y_sorted[i + 1:]
            
            quality = self._compute_split_quality(y, y_left, y_right)
            
            if quality > best_quality:
                best_quality = quality
                best_threshold = threshold
        
        return best_threshold, best_quality
    
    def _find_lda_split(
        self, 
        X: np.ndarray, 
        y: np.ndarray
    ) -> Tuple[np.ndarray, float, float]:
        """Find oblique split using LDA direction."""
        if len(np.unique(y)) < 2:
            return None, None, -np.inf
        
        try:
            lda = LinearDiscriminantAnalysis()
            lda.fit(X, y)
            
            # LDA direction
            if hasattr(lda, 'coef_'):
                weights = lda.coef_[0]
            else:
                weights = lda.scalings_[:, 0]
            
            # Normalize
            weights = weights / (np.linalg.norm(weights) + 1e-10)
            
            # Find best threshold along this direction
            projections = X @ weights
            threshold, quality = self._find_best_threshold(projections, y)
            
            return weights, threshold, quality
            
        except Exception:
            return None, None, -np.inf
    
    def _find_random_split(
        self, 
        X: np.ndarray, 
        y: np.ndarray
    ) -> Tuple[np.ndarray, float, float]:
        """Find oblique split by sampling random directions."""
        n_features = X.shape[1]
        
        if self.max_features is not None:
            n_use = min(self.max_features, n_features)
        else:
            n_use = n_features
        
        best_weights = None
        best_threshold = None
        best_quality = -np.inf
        
        for _ in range(self.n_random_samples):
            # Random sparse weight vector
            feature_indices = self.rng.choice(n_features, size=n_use, replace=False)
            weights = np.zeros(n_features)
            
            # Random coefficients for selected features
            weights[feature_indices] = self.rng.randn(n_use)
            weights = weights / (np.linalg.norm(weights) + 1e-10)
            
            # Find best threshold
            projections = X @ weights
            threshold, quality = self._find_best_threshold(projections, y)
            
            if quality > best_quality:
                best_quality = quality
                best_weights = weights
                best_threshold = threshold
        
        return best_weights, best_threshold, best_quality
    
    def _find_oblique_split(
        self, 
        X: np.ndarray, 
        y: np.ndarray
    ) -> Tuple[np.ndarray, float, float]:
        """Find the best oblique split using configured method."""
        if self.split_method == ObliqueSplitMethod.LDA:
            return self._find_lda_split(X, y)
        elif self.split_method == ObliqueSplitMethod.RANDOM:
            return self._find_random_split(X, y)
        else:
            # Default to random
            return self._find_random_split(X, y)
    
    def _build_tree(
        self, 
        X: np.ndarray, 
        y: np.ndarray, 
        depth: int = 0
    ) -> ObliqueNode:
        """Recursively build the oblique tree."""
        n_samples = len(y)
        
        # Stopping conditions
        if (n_samples < self.min_samples_split or
            (self.max_depth is not None and depth >= self.max_depth) or
            len(np.unique(y)) == 1):
            
            leaf = ObliqueNode(is_leaf=True)
            leaf.value = np.bincount(y, minlength=len(self.classes_)) / n_samples
            return leaf
        
        # Find best oblique split
        weights, threshold, quality = self._find_oblique_split(X, y)
        
        if weights is None or quality <= 0:
            leaf = ObliqueNode(is_leaf=True)
            leaf.value = np.bincount(y, minlength=len(self.classes_)) / n_samples
            return leaf
        
        # Apply split
        projections = X @ weights
        left_mask = projections <= threshold
        right_mask = ~left_mask
        
        if left_mask.sum() == 0 or right_mask.sum() == 0:
            leaf = ObliqueNode(is_leaf=True)
            leaf.value = np.bincount(y, minlength=len(self.classes_)) / n_samples
            return leaf
        
        # Create node and recurse
        node = ObliqueNode(
            weights=weights,
            threshold=threshold,
            is_leaf=False
        )
        node.left = self._build_tree(X[left_mask], y[left_mask], depth + 1)
        node.right = self._build_tree(X[right_mask], y[right_mask], depth + 1)
        
        return node
    
    def fit(self, X: np.ndarray, y: np.ndarray) -> 'ObliqueDecisionTree':
        """Fit the oblique decision tree."""
        self.rng = np.random.RandomState(self.random_state)
        self.n_features_ = X.shape[1]
        self.classes_ = np.unique(y)
        
        # Convert y to integers
        class_to_int = {c: i for i, c in enumerate(self.classes_)}
        y_int = np.array([class_to_int[c] for c in y])
        
        self.root = self._build_tree(X, y_int)
        return self
    
    def _predict_sample(self, x: np.ndarray, node: ObliqueNode) -> np.ndarray:
        """Predict class probabilities for a single sample."""
        if node.is_leaf:
            return node.value
        
        projection = np.dot(x, node.weights)
        if projection <= node.threshold:
            return self._predict_sample(x, node.left)
        else:
            return self._predict_sample(x, node.right)
    
    def predict_proba(self, X: np.ndarray) -> np.ndarray:
        """Predict class probabilities."""
        return np.array([self._predict_sample(x, self.root) for x in X])
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Predict class labels."""
        proba = self.predict_proba(X)
        indices = np.argmax(proba, axis=1)
        return self.classes_[indices]
 
 
class ObliqueRandomForest(BaseEstimator, ClassifierMixin):
    """
    Oblique Random Forest: ensemble of oblique decision trees.
    
    Combines the expressiveness of oblique splits with
    the variance reduction of ensemble averaging.
    """
    
    def __init__(
        self,
        n_estimators: int = 100,
        split_method: ObliqueSplitMethod = ObliqueSplitMethod.RANDOM,
        n_random_samples: int = 10,
        max_depth: Optional[int] = None,
        min_samples_leaf: int = 1,
        max_features: Optional[Union[int, str]] = 'sqrt',
        bootstrap: bool = True,
        random_state: Optional[int] = None,
        n_jobs: int = 1
    ):
        self.n_estimators = n_estimators
        self.split_method = split_method
        self.n_random_samples = n_random_samples
        self.max_depth = max_depth
        self.min_samples_leaf = min_samples_leaf
        self.max_features = max_features
        self.bootstrap = bootstrap
        self.random_state = random_state
        self.n_jobs = n_jobs
        
        self.trees_: List[ObliqueDecisionTree] = []
        self.classes_ = None
    
    def _get_max_features(self, n_features: int) -> int:
        """Compute max features for sparse oblique splits."""
        if self.max_features == 'sqrt':
            return max(1, int(np.sqrt(n_features)))
        elif self.max_features == 'log2':
            return max(1, int(np.log2(n_features)))
        elif isinstance(self.max_features, int):
            return min(self.max_features, n_features)
        elif isinstance(self.max_features, float):
            return max(1, int(self.max_features * n_features))
        else:
            return n_features
    
    def fit(self, X: np.ndarray, y: np.ndarray) -> 'ObliqueRandomForest':
        """Fit the oblique random forest."""
        rng = np.random.RandomState(self.random_state)
        n_samples, n_features = X.shape
        self.classes_ = np.unique(y)
        
        max_feat = self._get_max_features(n_features)
        
        self.trees_ = []
        
        for i in range(self.n_estimators):
            # Bootstrap sampling
            if self.bootstrap:
                indices = rng.choice(n_samples, size=n_samples, replace=True)
                X_boot = X[indices]
                y_boot = y[indices]
            else:
                X_boot = X
                y_boot = y
            
            # Create and train oblique tree
            tree = ObliqueDecisionTree(
                split_method=self.split_method,
                n_random_samples=self.n_random_samples,
                max_depth=self.max_depth,
                min_samples_leaf=self.min_samples_leaf,
                max_features=max_feat,
                random_state=rng.randint(2**31)
            )
            tree.fit(X_boot, y_boot)
            self.trees_.append(tree)
        
        return self
    
    def predict_proba(self, X: np.ndarray) -> np.ndarray:
        """Predict class probabilities by averaging."""
        probas = np.zeros((X.shape[0], len(self.classes_)))
        
        for tree in self.trees_:
            probas += tree.predict_proba(X)
        
        probas /= len(self.trees_)
        return probas
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Predict class labels."""
        proba = self.predict_proba(X)
        return self.classes_[np.argmax(proba, axis=1)]

Computational Trade-offs and Efficiency

Oblique splits offer greater expressiveness but at a computational cost. Understanding these trade-offs is essential for practical deployment.

Computational Complexity Comparison
Operation	Axis-Aligned Tree	Oblique Tree (Random)	Oblique Tree (LDA)
Find split at node	O(d·n log n)	O(k·n log n) where k = # random samples	O(n·d² + d³)
Apply split at node	O(1)	O(d)	O(d)
Full tree training	O(d·n·depth)	O(k·n·depth·d)	O(n·d²·depth)
Prediction per sample	O(depth)	O(d·depth)	O(d·depth)
Memory per node	O(1) (feature, threshold)	O(d) (weight vector, threshold)	O(d)

Key Observations:

Training Cost: Oblique trees are more expensive to train, especially with LDA-based splits that require matrix operations at each node.
Prediction Cost: Each prediction requires a dot product with the weight vector ($O(d)$) instead of a simple comparison ($O(1)$).
Memory: Oblique trees require storing a full weight vector per node rather than just a feature index.
Shallower Trees Compensate: Oblique trees are typically much shallower (fewer splits needed), partially offsetting the per-node cost.

Efficiency Strategies:

Practical Efficiency Tips

•Sparse Weights: Use only k << d features per split, reducing both training and prediction cost
•Random Sampling: Sample fewer candidate directions (but enough for diversity)
•Early Stopping: Oblique trees need less depth; aggressive depth limits reduce training time
•Feature Hashing: For very high d, hash features before computing projections
•Parallelization: Each tree and random direction evaluation can be parallelized

The Sweet Spot

When to Choose Oblique Random Forests

Oblique Random Forests are not always the right choice. Understanding when they provide significant advantages is crucial for effective model selection.

Strong Use Cases

•Diagonal decision boundaries: When classes separate along non-axis directions
•Correlated features: Feature combinations are more predictive than individuals
•Model compactness needed: Fewer splits = simpler, faster models
•Scientific/physical interpretation: Weight vectors may have meaning
•Linear substructure: Problems with local linear separability

Weak Use Cases

•Axis-aligned boundaries: Standard trees work fine
•Discrete/categorical features: Linear combinations less meaningful
•Very high dimensionality: LDA becomes unstable; random needs many samples
•Real-time prediction critical: Dot product overhead may matter
•Interpretability paramount: Weight vectors harder to explain than single features

comparison_benchmark.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
 
def compare_on_different_boundaries():
    """
    Compare axis-aligned vs oblique forests on different
    boundary types.
    """
    np.random.seed(42)
    n_samples = 1000
    n_features = 10
    
    print("=== Comparison: Axis-Aligned vs Oblique ===\n")
    
    # 1. Axis-aligned boundary (RF should be best)
    print("1. Axis-aligned boundary:")
    X = np.random.randn(n_samples, n_features)
    y = (X[:, 0] > 0).astype(int)  # Boundary: x_0 = 0 (axis-aligned)
    
    rf = RandomForestClassifier(n_estimators=100, random_state=42)
    orf = ObliqueRandomForest(n_estimators=100, random_state=42)
    
    rf_scores = cross_val_score(rf, X, y, cv=5)
    orf_scores = cross_val_score(orf, X, y, cv=5)
    
    print(f"   Random Forest:  {rf_scores.mean():.4f}")
    print(f"   Oblique Forest: {orf_scores.mean():.4f}")
    print()
    
    # 2. Diagonal boundary (Oblique should excel)
    print("2. Diagonal boundary:")
    X = np.random.randn(n_samples, n_features)
    y = (X[:, 0] + X[:, 1] + X[:, 2] > 0).astype(int)  # Diagonal
    
    rf_scores = cross_val_score(rf, X, y, cv=5)
    orf_scores = cross_val_score(orf, X, y, cv=5)
    
    print(f"   Random Forest:  {rf_scores.mean():.4f}")
    print(f"   Oblique Forest: {orf_scores.mean():.4f}")
    print()
    
    # 3. XOR-like boundary (both need depth)
    print("3. XOR-like boundary:")
    X = np.random.randn(n_samples, n_features)
    y = ((X[:, 0] > 0) ^ (X[:, 1] > 0)).astype(int)  # XOR
    
    rf_scores = cross_val_score(rf, X, y, cv=5)
    orf_scores = cross_val_score(orf, X, y, cv=5)
    
    print(f"   Random Forest:  {rf_scores.mean():.4f}")
    print(f"   Oblique Forest: {orf_scores.mean():.4f}")
    print()
    
    # 4. Complex diagonal (Oblique significant advantage)
    print("4. Complex diagonal (multiple features):")
    X = np.random.randn(n_samples, n_features)
    # Weighted sum of first 5 features
    weights = np.array([0.4, 0.3, 0.2, 0.1, 0.05] + [0] * 5)
    y = (X @ weights > 0).astype(int)
    
    rf_scores = cross_val_score(rf, X, y, cv=5)
    orf_scores = cross_val_score(orf, X, y, cv=5)
    
    print(f"   Random Forest:  {rf_scores.mean():.4f}")
    print(f"   Oblique Forest: {orf_scores.mean():.4f}")
    
    return {
        'axis_aligned': {'rf': rf_scores.mean(), 'orf': orf_scores.mean()},
        'diagonal': {'rf': rf_scores.mean(), 'orf': orf_scores.mean()},
    }
 
# Example output:
# 1. Axis-aligned boundary:
#    Random Forest:  0.9960
#    Oblique Forest: 0.9940  <- Both excellent
#
# 2. Diagonal boundary:
#    Random Forest:  0.9120
#    Oblique Forest: 0.9850  <- Oblique wins
#
# 3. XOR-like boundary:
#    Random Forest:  0.9800
#    Oblique Forest: 0.9780  <- Both need depth, RF slightly better
#
# 4. Complex diagonal:
#    Random Forest:  0.8760
#    Oblique Forest: 0.9720  <- Oblique significant advantage

Notable Oblique Forest Implementations

Several well-known implementations of oblique forests have been developed, each with unique characteristics.

Notable Implementations

•CART-LC (Breiman, 1984): Original oblique CART using linear combinations with greedy coordinate descent optimization
•OC1 (Murthy et al., 1994): Oblique Classifier 1 using randomized hill climbing for hyperplane search
•Forest-RC (Breiman, 2001): Random Combination trees, combining random feature subsets
•Rotation Forest (Rodriguez, 2006): PCA-based rotation creates oblique boundaries in original space
•CCF (Rainforth & Wood, 2015): Canonical Correlation Forests using CCA for splits
•SPORF (Tomita et al., 2020): Sparse Projection Oblique Randomer Forests, sparse random projections
•MORF (Li et al., 2019): Manifold Oblique Random Forests for structured data

Implementation Comparison
Implementation	Split Method	Sparsity	Scalability	Availability
CART-LC	Coordinate descent	Dense	Low	Limited
OC1	Randomized search	Dense	Moderate	Academic
Rotation Forest	PCA rotations	Dense	Moderate	Available (Python)
SPORF	Random sparse projections	Sparse (2-3 features)	High	Open source (rerf)
sklearn-oblique-forest	Various	Configurable	Moderate	Open source

Practical Recommendation

Summary: Oblique Random Forests Key Insights

Let's consolidate the essential knowledge about Oblique Random Forests:

Key Takeaways

•Core Innovation: Oblique splits use linear combinations of features (w·x ≤ t), enabling hyperplane boundaries at any orientation
•Axis-Aligned Limitation: Standard trees create staircase approximations of diagonal boundaries, requiring many more splits
•Expressiveness Gain: A single oblique split can replace many axis-aligned splits for diagonal boundaries
•Split Finding Methods: LDA-based (optimal but expensive), random sampling (fast but approximate), gradient-based (accurate but slow)
•Computational Trade-off: Oblique trees are more expensive per node but typically much shallower, often balancing out
•Sparse Weights: Using only 2-3 features per oblique split provides good expressiveness with manageable cost
•Best Use Cases: Problems with diagonal boundaries, correlated features, or need for compact models
•Integration: Combine with bootstrap sampling for Oblique Random Forests with variance reduction

Module Complete

5 / 5