Machine LearningDecision Trees

Tree Limitations and Extensions

LevelIntermediate

Duration90 mins

TopicDecision Trees

4 / 5

Multivariate Splits — Beyond Linear Combinations

When Linear Combinations Aren't Enough

Oblique trees extend standard decision trees from axis-aligned to linear splits. But what if the decision boundary is inherently nonlinear? What if the optimal split follows a curve, a circle, or a complex manifold in feature space?

Multivariate splits generalize the split mechanism even further, allowing each tree node to use:

Nonlinear functions of multiple features
Learned feature transformations
Embedded machine learning models (neural networks, SVMs, etc.)
Complex decision rules that can capture curved boundaries

This creates a spectrum of tree-like models ranging from simple axis-aligned cuts to sophisticated nodes that are themselves powerful classifiers.

What You Will Learn

By the end of this page, you will understand the full spectrum of multivariate split types, how to embed nonlinear transformations within tree nodes, the tradeoffs between node complexity and tree structure, connections to kernel methods and neural networks, and specific algorithms including model trees, perceptron trees, and neural decision trees.

The Multivariate Split Taxonomy

To understand multivariate splits, we need a systematic taxonomy of split complexity. This reveals the design space between simple trees and complex neural architectures.

Level 0: Axis-Aligned (Univariate)

$$\phi(x) = \mathbb{1}[x_j \leq \theta]$$

Tests a single feature against a threshold. Partitions space with axis-parallel hyperplanes.

Level 1: Linear (Oblique)

$$\phi(x) = \mathbb{1}[w^T x \leq \theta]$$

Tests a linear combination of all features. Partitions space with arbitrary hyperplanes.

Level 2: Polynomial

$$\phi(x) = \mathbb{1}[\sum_{|\alpha| \leq k} c_\alpha x^\alpha \leq \theta]$$

Tests polynomial combinations (e.g., $x_1^2 + 2x_1 x_2 + x_2^2$). Partitions space with algebraic surfaces.

Level 3: Kernel-Based

$$\phi(x) = \mathbb{1}\left[\sum_{i=1}^{n} \alpha_i K(x, x_i) + b \leq 0\right]$$

Uses kernel expansion (like SVM). Can represent complex, smooth boundaries.

Level 4: Neural

$$\phi(x) = \mathbb{1}[\sigma(W_2 \cdot \sigma(W_1 x + b_1) + b_2) > 0.5]$$

Embeds a neural network at each node. Can approximate any smooth function.

Multivariate Split Complexity Hierarchy
Level	Split Type	Parameters	Boundary Shape	VC Dimension
0	Axis-aligned	2 (j, θ)	Axis-parallel hyperplane	O(log d)
1	Linear/Oblique	d + 1	Arbitrary hyperplane	O(d)
2	Degree-k polynomial	O(d^k)	Algebraic surface	O(d^k)
3	Kernel (n support vectors)	O(n)	Smooth manifold	O(n)
4	Neural (h hidden units)	O(d·h + h)	Arbitrary	O(d·h log h)

The Expressiveness-Complexity Tradeoff:

As we move up the hierarchy:

Expressiveness increases: More complex boundaries can be represented
Parameters per node increase: More data needed to fit each node
Optimization difficulty increases: Finding good splits becomes harder
Interpretability decreases: Understanding what each node does becomes difficult
Overfitting risk increases: More ways to memorize training data

Key Design Question:

Should we use:

Many simple nodes (deep tree, axis-aligned) — OR —
Few complex nodes (shallow tree, multivariate)

Both can achieve the same expressiveness, but with different bias-variance profiles and computational characteristics.

The Universal Approximation Connection

A sufficiently deep axis-aligned tree is a universal approximator. A sufficiently complex single-node neural network is also a universal approximator. Multivariate trees interpolate between these extremes, balancing depth against node complexity.

Polynomial and Quadratic Splits

A natural first step beyond linear combinations is to allow polynomial splits that can capture curved boundaries.

Quadratic Splits:

The most common extension uses second-degree polynomials:

$$\phi(x) = \mathbb{1}[x^T A x + b^T x + c \leq 0]$$

where:

$A \in \mathbb{R}^{d \times d}$ is a symmetric matrix (quadratic terms)
$b \in \mathbb{R}^d$ is a linear coefficient vector
$c \in \mathbb{R}$ is a constant

This represents:

Ellipsoids: When $A$ is positive definite
Hyperboloids: When $A$ has mixed eigenvalues
Paraboloids: Transitional cases
Parallel planes: When $A = 0$ (degenerates to linear)

Number of Parameters:

For a general quadratic split in $d$ dimensions:

Symmetric matrix $A$: $d(d+1)/2$ parameters
Linear vector $b$: $d$ parameters
Constant $c$: 1 parameter
Total: $d(d+1)/2 + d + 1 = (d+1)(d+2)/2$

For $d = 10$: 66 parameters per split For $d = 100$: 5,151 parameters per split

The Quadratic Curse

The O(d²) parameter count makes full quadratic splits impractical for high-dimensional data. Even with generous training data, fitting thousands of parameters per node leads to severe overfitting. Regularization and sparsity are essential.

Restricted Quadratic Forms:

To manage complexity, we can restrict the quadratic form:

Diagonal Quadratic: Only $A_{ii}$ terms (sum of squares) $$\sum_{i=1}^d a_i x_i^2 + \sum_{i=1}^d b_i x_i + c \leq 0$$

Parameters: $2d + 1$

Axis-Pair Interactions: Include only selected $x_i x_j$ terms $$\text{Select } k \ll d^2 \text{ interaction pairs}$$

Parameters: $d + k + 1$

Low-Rank Quadratic: $A = V V^T$ where $V \in \mathbb{R}^{d \times r}$, $r \ll d$

Parameters: $dr + d + 1$

Optimization for Quadratic Splits:

Finding optimal quadratic splits is more difficult than linear:

Exhaustive: Not feasible due to continuous parameter space
Gradient descent: Use differentiable approximation to impurity
Kernel trick: Map to feature space and use linear split
QDA-based: Use Quadratic Discriminant Analysis at each node

quadratic_split.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
import numpy as np
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from scipy.optimize import minimize
 
class QuadraticSplitFinder:
    """
    Find quadratic splits using QDA or gradient-based optimization.
    """
    
    def __init__(self, method='qda', regularization=0.1):
        self.method = method
        self.reg = regularization
    
    def find_split_qda(self, X, y):
        """
        Use Quadratic Discriminant Analysis to define split.
        QDA learns class-conditional Gaussians -> quadratic boundary.
        """
        qda = QuadraticDiscriminantAnalysis(reg_param=self.reg)
        qda.fit(X, y)
        
        # QDA decision boundary is quadratic
        # Predict probability and threshold at 0.5
        probs = qda.predict_proba(X)[:, 1]
        
        return qda, probs > 0.5
    
    def find_split_gradient(self, X, y, max_iter=100):
        """
        Gradient-based optimization for quadratic split parameters.
        Uses sigmoid smoothing for differentiability.
        """
        n, d = X.shape
        
        # Initialize: small random quadratic + no linear
        A_init = 0.1 * np.random.randn(d, d)
        A_init = (A_init + A_init.T) / 2  # Symmetrize
        b_init = np.zeros(d)
        c_init = 0
        
        # Flatten parameters
        params_init = np.concatenate([
            A_init[np.triu_indices(d)],  # Upper triangular of A
            b_init,
            [c_init]
        ])
        
        def sigmoid(z, temperature=1.0):
            return 1 / (1 + np.exp(-z / temperature))
        
        def unpack_params(params):
            n_A = d * (d + 1) // 2
            A_upper = params[:n_A]
            A = np.zeros((d, d))
            A[np.triu_indices(d)] = A_upper
            A = A + A.T - np.diag(np.diag(A))  # Full symmetric
            b = params[n_A:n_A+d]
            c = params[-1]
            return A, b, c
        
        def loss(params):
            A, b, c = unpack_params(params)
            
            # Quadratic form for each sample
            quad_vals = np.array([x @ A @ x + b @ x + c for x in X])
            
            # Soft predictions
            preds = sigmoid(quad_vals, temperature=1.0)
            
            # Cross-entropy loss
            eps = 1e-10
            ce = -np.mean(y * np.log(preds + eps) + 
                         (1-y) * np.log(1 - preds + eps))
            
            # L2 regularization
            reg = self.reg * (np.sum(A**2) + np.sum(b**2))
            
            return ce + reg
        
        result = minimize(loss, params_init, method='L-BFGS-B',
                         options={'maxiter': max_iter})
        
        A, b, c = unpack_params(result.x)
        
        return {'A': A, 'b': b, 'c': c, 'loss': result.fun}
    
    def evaluate_quadratic(self, X, A, b, c):
        """Evaluate quadratic form for samples."""
        return np.array([x @ A @ x + b @ x + c for x in X])

When Quadratic Splits Help:

Elliptical class regions: One class is approximately elliptical in feature space
QDA-like problems: Classes have different covariance structures
Physical systems: Many natural phenomena follow quadratic laws
Low-to-moderate dimensionality: $d < 30$ where $d^2$ is manageable

When to Avoid:

High dimensionality: Parameter explosion makes fitting impossible
Small datasets: Not enough data to estimate $O(d^2)$ parameters
Linear boundaries suffice: Don't add complexity unnecessarily
Interpretability required: $x^T A x + b^T x + c$ is hard to explain

Model Trees: Local Linear Models at Leaves

Model trees take a different approach: rather than complicating the splits, they complicate the leaf predictions. Instead of predicting a constant at each leaf, they fit a local model.

Standard Regression Tree: $$\hat{f}(x) = \sum_{m=1}^{M} c_m \cdot \mathbb{1}(x \in R_m)$$

Model Tree (with linear leaves): $$\hat{f}(x) = \sum_{m=1}^{M} (w_m^T x + b_m) \cdot \mathbb{1}(x \in R_m)$$

Each leaf contains a linear regression model fitted to the samples in that region. This creates piecewise linear predictions instead of piecewise constant.

M5 Algorithm (Quinlan, 1992):

The most famous model tree algorithm:

Grow tree: Use standard regression tree splitting (minimize MSE)
Fit linear models: At each leaf, fit a linear regression on local samples
Prune: Replace subtrees with linear models when accuracy is maintained
Smooth: Blend predictions from leaf and ancestor models for continuity

model_tree.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import Ridge
 
class ModelTree:
    """
    Model tree with linear regression at leaves.
    
    Similar to M5 algorithm.
    """
    
    def __init__(self, max_depth=5, min_samples_leaf=10, alpha=1.0):
        self.max_depth = max_depth
        self.min_samples_leaf = min_samples_leaf
        self.alpha = alpha  # Ridge regularization
        
        # First grow a standard tree to get partition
        self.tree = DecisionTreeRegressor(
            max_depth=max_depth,
            min_samples_leaf=min_samples_leaf
        )
        self.leaf_models = {}
    
    def fit(self, X, y):
        """Fit model tree."""
        # Step 1: Grow tree to partition feature space
        self.tree.fit(X, y)
        
        # Step 2: Identify which leaf each sample belongs to
        leaf_indices = self.tree.apply(X)
        unique_leaves = np.unique(leaf_indices)
        
        # Step 3: Fit linear model at each leaf
        for leaf in unique_leaves:
            mask = leaf_indices == leaf
            X_leaf = X[mask]
            y_leaf = y[mask]
            
            if len(y_leaf) < X.shape[1] + 1:
                # Not enough samples for linear model, use mean
                self.leaf_models[leaf] = {
                    'type': 'constant',
                    'value': y_leaf.mean()
                }
            else:
                # Fit regularized linear regression
                model = Ridge(alpha=self.alpha)
                model.fit(X_leaf, y_leaf)
                self.leaf_models[leaf] = {
                    'type': 'linear',
                    'model': model
                }
        
        return self
    
    def predict(self, X):
        """Predict using local linear models."""
        leaf_indices = self.tree.apply(X)
        predictions = np.zeros(len(X))
        
        for i, (x, leaf) in enumerate(zip(X, leaf_indices)):
            model_info = self.leaf_models[leaf]
            
            if model_info['type'] == 'constant':
                predictions[i] = model_info['value']
            else:
                predictions[i] = model_info['model'].predict([x])[0]
        
        return predictions
    
    def predict_with_smoothing(self, X, smooth_factor=0.5):
        """
        Predict with smoothing toward parent predictions.
        
        This reduces discontinuities at leaf boundaries.
        """
        # Get leaf predictions
        leaf_preds = self.predict(X)
        
        # Get tree constant predictions (for smoothing)
        tree_preds = self.tree.predict(X)
        
        # Blend
        return smooth_factor * leaf_preds + (1 - smooth_factor) * tree_preds

Advantages of Model Trees:

Better approximation of smooth functions: Piecewise linear is closer to smooth than piecewise constant
Fewer leaves needed: Linear models capture local trends, reducing tree size
Extrapolation ability: Linear models can extrapolate beyond training range (risky but sometimes useful)
Interpretability preserved: Tree structure remains; each leaf has an interpretable linear model

Smoothing in Model Trees:

Raw model trees have discontinuities at region boundaries. M5 uses smoothing:

$$\hat{f}(x) = \frac{n_\text{leaf} \cdot \hat{f}\text{leaf}(x) + k \cdot \hat{f}\text{parent}(x)}{n_\text{leaf} + k}$$

where $k$ is a smoothing constant (typically 15). This blends leaf predictions toward parent predictions, creating smoother transitions.

Extensions:

Polynomial at leaves: Replace linear with polynomial for more flexibility
Regularization: L1/L2 penalties on leaf model coefficients
Feature selection in leaves: Different features relevant in different regions
Ensemble model trees: Random Forests of model trees (Cubist algorithm)

M5P in Practice

Weka's M5P and scikit-lego's model_trees provide production implementations. Model trees often outperform standard regression trees with similar interpretability, making them excellent choices for regression problems with local linear structure.

Perceptron Trees and Neural Splits

Taking multivariate splits to the extreme, we can embed neural network components at tree nodes. This creates hybrid models that combine tree structure with neural flexibility.

Perceptron Trees:

The simplest neural tree uses a perceptron (single-layer neural network) at each internal node:

$$\phi(x) = \mathbb{1}[\sigma(w^T x + b) > 0.5]$$

where $\sigma$ is an activation function (sigmoid, tanh). This is essentially an oblique split with a nonlinear squashing function.

Multi-Layer Perceptron (MLP) Nodes:

More powerful: use a small MLP at each node:

$$h = \sigma(W_1 x + b_1)$$ $$\phi(x) = \mathbb{1}[\sigma(w_2^T h + b_2) > 0.5]$$

This can represent arbitrarily complex decision boundaries at each node.

The Soft Decision Tree:

Instead of hard splits, use probabilistic routing:

$$p(\text{left} | x) = \sigma(w^T x + b)$$ $$p(\text{right} | x) = 1 - p(\text{left} | x)$$

A sample "flows" through both branches with different probabilities. The prediction is a weighted sum:

$$\hat{f}(x) = \sum_{m=1}^{M} \pi_m(x) \cdot c_m$$

where $\pi_m(x)$ is the probability of reaching leaf $m$, computed as the product of routing probabilities along the path.

soft_decision_tree.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
 
class SoftDecisionTree(nn.Module):
    """
    Differentiable soft decision tree.
    
    Uses sigmoid routing for end-to-end gradient training.
    """
    
    def __init__(self, input_dim, depth, n_classes):
        super().__init__()
        self.depth = depth
        self.n_classes = n_classes
        self.n_leaves = 2 ** depth
        self.n_internal = 2 ** depth - 1
        
        # Internal node parameters (oblique splits)
        self.node_weights = nn.Parameter(
            torch.randn(self.n_internal, input_dim) * 0.1
        )
        self.node_biases = nn.Parameter(
            torch.zeros(self.n_internal)
        )
        
        # Leaf predictions (class logits)
        self.leaf_logits = nn.Parameter(
            torch.randn(self.n_leaves, n_classes) * 0.1
        )
        
        # Temperature for soft splitting
        self.temperature = nn.Parameter(torch.ones(1))
    
    def forward(self, x):
        """
        Compute soft predictions.
        
        Returns weighted average of leaf predictions based on
        routing probabilities.
        """
        batch_size = x.size(0)
        
        # Compute routing probabilities at each node
        # node_outputs[i] = sigmoid(w_i^T x + b_i) 
        node_logits = F.linear(x, self.node_weights, self.node_biases)
        node_probs = torch.sigmoid(node_logits / self.temperature)
        
        # Compute path probabilities to each leaf
        # This is the tricky part: tree routing
        path_probs = self._compute_path_probs(node_probs)
        
        # Weighted sum of leaf predictions
        leaf_preds = F.softmax(self.leaf_logits, dim=-1)  # [n_leaves, n_classes]
        
        # path_probs: [batch_size, n_leaves]
        # leaf_preds: [n_leaves, n_classes]
        output = torch.matmul(path_probs, leaf_preds)  # [batch_size, n_classes]
        
        return output
    
    def _compute_path_probs(self, node_probs):
        """
        Compute probability of reaching each leaf.
        
        For a complete binary tree:
        - Left child of node i is 2i + 1
        - Right child of node i is 2i + 2
        - Leaf nodes are indexed 0 to n_leaves-1 (separate indexing)
        """
        batch_size = node_probs.size(0)
        
        # Start at root with probability 1
        path_probs = torch.ones(batch_size, self.n_leaves, device=node_probs.device)
        
        for leaf_idx in range(self.n_leaves):
            # Find path from root to this leaf
            current = leaf_idx + self.n_internal  # Leaf's position in full tree
            
            while current > 0:
                parent = (current - 1) // 2
                
                # If current is left child, multiply by p(left)
                # If current is right child, multiply by p(right) = 1 - p(left)
                if current == 2 * parent + 1:  # Left child
                    path_probs[:, leaf_idx] *= node_probs[:, parent]
                else:  # Right child
                    path_probs[:, leaf_idx] *= (1 - node_probs[:, parent])
                
                current = parent
        
        return path_probs
 
class NeuralDecisionForest(nn.Module):
    """
    Ensemble of soft decision trees.
    """
    
    def __init__(self, input_dim, depth, n_classes, n_trees=10):
        super().__init__()
        self.trees = nn.ModuleList([
            SoftDecisionTree(input_dim, depth, n_classes)
            for _ in range(n_trees)
        ])
    
    def forward(self, x):
        """Average predictions from all trees."""
        preds = torch.stack([tree(x) for tree in self.trees])
        return preds.mean(dim=0)

Advantages of Neural Splits:

End-to-end differentiable: Train entire tree with gradient descent
Complex boundaries: Can capture any decision boundary with sufficient MLP capacity
Soft routing: Gradients flow to all leaves, not just the active one
Representation learning: Node weights learn useful features

Disadvantages:

Lost interpretability: Hard to understand what each neural node does
Training complexity: More hyperparameters, longer training
Overfitting risk: Massive parameter count if not regularized
Why not just use a neural network?: If you're already differentiable, the tree structure may not add value

Key Research: Neural Oblivious Decision Ensembles (NODE)

Popov et al. (2019) introduced NODE for tabular data:

Uses oblivious trees: All nodes at each level use the same feature
Soft routing with entmax (sparse attention) instead of sigmoid
Ensemble of differentiable trees
Competitive with XGBoost on tabular benchmarks

This represents the frontier of neural-tree hybrids.

The Deep Forest Perspective

Zhou and Feng's Deep Forest (2017) takes a different approach: instead of neural networks inside trees, it stacks Random Forests layer-by-layer like a deep network. Each layer's forest output becomes input to the next layer. This achieves deep learning-like representation learning with tree components.

Kernel-Based Splits and SVM Nodes

Another approach to nonlinear splits uses kernel methods at each node. This leverages the kernel trick to create complex boundaries without explicit feature expansion.

The Kernel Split:

$$\phi(x) = \mathbb{1}\left[\sum_{i \in \text{SVs}} \alpha_i y_i K(x, x_i) + b \leq 0\right]$$

where:

$K$ is a kernel function (RBF, polynomial, etc.)
SVs are the support vectors at this node
$\alpha_i$ are learned coefficients

This is equivalent to training an SVM at each node and using its decision function for routing.

SVM Trees:

At each internal node, train an SVM on local data
Route left or right based on SVM prediction
Continue until reaching a leaf

Kernel Choice:

Kernels for Tree Splits
Kernel	Formula	Boundary Type	Best When
Linear	K(x,x') = x^T x'	Hyperplane	Linear separability
Polynomial	K(x,x') = (x^T x' + c)^d	Polynomial surface	Known polynomial interactions
RBF	K(x,x') = exp(-γ\|\|x-x'\|\|²)	Smooth, local	Complex, smooth boundaries
Sigmoid	K(x,x') = tanh(αx^T x' + c)	Similar to MLP	Neural-network-like

RBF Kernel SVM at Nodes:

The RBF (Radial Basis Function) kernel is particularly powerful:

$$K(x, x') = \exp(-\gamma |x - x'|^2)$$

With appropriate $\gamma$, a single RBF-SVM can create:

Circular/elliptical boundaries
Complex multi-modal regions
Smooth, curved surfaces

This means an SVM tree with RBF kernels can handle extremely complex decision boundaries with just a few nodes.

Computational Considerations:

Aspect	Standard SVM	SVM Tree Node
Training samples	All N	n_t (samples at node)
Complexity	O(N² to N³)	O(n_t² to n_t³)
Support vectors	O(N)	O(n_t)
Inference	O(N·d)	O(sum of SVs at nodes)

Because each node has fewer samples, individual SVMs are faster. But more nodes means more total SVMs.

Hybrid Strategy:

A practical approach:

Use axis-aligned or oblique splits for most nodes
Fall back to SVM splits only when simple splits perform poorly
Use kernel complexity only when data at the node is nonlinearly separable

This gives the efficiency of simple splits where possible, with kernel power where needed.

svm_tree.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
import numpy as np
 
class SVMTreeNode:
    """
    Tree node that uses SVM for splitting.
    """
    
    def __init__(self, kernel='rbf', C=1.0, gamma='scale'):
        self.kernel = kernel
        self.C = C
        self.gamma = gamma
        self.svm = None
        self.left = None
        self.right = None
        self.is_leaf = False
        self.prediction = None
    
    def fit(self, X, y, max_depth, current_depth=0, min_samples=10):
        """Recursively fit SVM tree."""
        n_samples = len(y)
        unique_classes = np.unique(y)
        
        # Check stopping conditions
        if (current_depth >= max_depth or 
            n_samples < min_samples or 
            len(unique_classes) == 1):
            self.is_leaf = True
            self.prediction = np.bincount(y).argmax()
            return
        
        # Train SVM for splitting
        self.svm = SVC(kernel=self.kernel, C=self.C, gamma=self.gamma)
        self.svm.fit(X, y)
        
        # Get predictions for splitting
        predictions = self.svm.predict(X)
        
        # Check if SVM produced a useful split
        left_mask = predictions == unique_classes[0]
        right_mask = ~left_mask
        
        if left_mask.sum() < min_samples or right_mask.sum() < min_samples:
            # SVM didn't produce useful split, make leaf
            self.is_leaf = True
            self.prediction = np.bincount(y).argmax()
            return
        
        # Create children
        self.left = SVMTreeNode(self.kernel, self.C, self.gamma)
        self.right = SVMTreeNode(self.kernel, self.C, self.gamma)
        
        # Recursively fit children
        self.left.fit(X[left_mask], y[left_mask], max_depth, 
                      current_depth + 1, min_samples)
        self.right.fit(X[right_mask], y[right_mask], max_depth,
                       current_depth + 1, min_samples)
    
    def predict(self, X):
        """Predict using SVM tree."""
        if self.is_leaf:
            return np.full(len(X), self.prediction)
        
        # Route using SVM
        svm_pred = self.svm.predict(X)
        
        predictions = np.zeros(len(X), dtype=int)
        
        left_mask = svm_pred == self.svm.classes_[0]
        if left_mask.any():
            predictions[left_mask] = self.left.predict(X[left_mask])
        if (~left_mask).any():
            predictions[~left_mask] = self.right.predict(X[~left_mask])
        
        return predictions

The Complexity-Accuracy Tradeoff

With so many options for multivariate splits, how do we choose? The answer lies in understanding the fundamental tradeoff between node complexity and tree structure.

The Total Complexity Budget:

Think of model complexity as a budget that can be spent in two ways:

Depth (many simple nodes): Standard trees, deep
Node complexity (few powerful nodes): Multivariate trees, shallow

Both can achieve similar total expressiveness, but with different characteristics:

Approach	Depth	Nodes	Params/Node	Total Params	Training
Deep axis-aligned	20	1M leaves	2	~2M	Fast
Medium oblique	10	~1K leaves	d+1	~100K·d	Moderate
Shallow quadratic	5	~32 leaves	d²	~32·d²	Slow
Very shallow neural	3	~8 leaves	1000+	~10K	Very slow

Empirical Observations:

Research suggests:

For most tabular data: Axis-aligned ensembles (Random Forest, XGBoost) win
For data with known linear structure: Oblique trees can match with less complexity
For highly nonlinear, low-d data: Kernel or neural nodes help
For very large datasets: Simple nodes scale better

Favor Simple Nodes When

•High-dimensional data (d > 100)
•Categorical features present
•Fast training required
•Interpretability essential
•Using ensembles anyway
•Limited training data

Favor Complex Nodes When

•Low-dimensional data (d < 30)
•All continuous features
•Known diagonal/curved boundaries
•Tree depth must be limited
•Single tree required
•Abundant training data

The Model Selection Framework:

To choose the right multivariate split strategy:

Start simple: Begin with axis-aligned trees (CART, Random Forest)
Diagnose limitations: Check if tree is very deep for the accuracy achieved
Test oblique: If depth is concerning, try CART-LC or OC1
Add features: Before complex nodes, try adding interaction features
Consider ensembles: Often a better use of complexity budget
Last resort: Neural/kernel nodes for truly challenging problems

Cross-Validation is King:

Ultimately, the choice must be validated empirically:

$$\text{Best model} = \arg\min_{m \in {\text{axis}, \text{oblique}, \text{kernel}, \ldots}} \text{CV-Error}(m)$$

Don't assume that more complex nodes are better—they often aren't!

Occam's Razor for Trees

Adding node complexity should be justified by improved cross-validation performance. In practice, ensembles of simple trees (Random Forest, XGBoost, LightGBM) almost always outperform single complex trees. The complexity budget is better spent on more trees than on more complex nodes, except in special circumstances.

Practical Implementations and Libraries

Several libraries provide implementations of multivariate split trees. Here's a practical guide:

Standard Libraries:

Library	Split Types	Notes
scikit-learn	Axis-aligned only	Foundation for extensions
Weka	CART-LC, M5P (model trees)	Java, comprehensive
R/rpart	Axis-aligned	Can extend with custom splits

Specialized Libraries:

Library	Split Types	Notes
OC1	Oblique (original algorithm)	C implementation
LMT	Logistic Model Trees	Weka
Cubist	Model trees, ensembles	Industrial strength
pytorch-tabular	Neural oblique (NODE)	Deep learning integration

DIY Approaches:

For custom multivariate splits, you can:

Feature engineering: Add interaction features, then use standard trees
Inherit and extend: Subclass sklearn's DecisionTreeClassifier
Custom implementations: Build from scratch with numpy

practical_workflow.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
"""
Practical workflow for choosing multivariate splits.
"""
 
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline
 
def multivariate_split_analysis(X, y, cv=5):
    """
    Compare different split strategies on a dataset.
    
    Helps determine if multivariate splits would help.
    """
    results = {}
    
    # 1. Baseline: Axis-aligned tree
    tree = DecisionTreeClassifier(max_depth=10, random_state=42)
    results['Axis-aligned tree'] = {
        'cv_score': cross_val_score(tree, X, y, cv=cv).mean(),
        'std': cross_val_score(tree, X, y, cv=cv).std()
    }
    
    # 2. Axis-aligned ensemble (sets the bar)
    rf = RandomForestClassifier(n_estimators=100, random_state=42)
    results['Random Forest'] = {
        'cv_score': cross_val_score(rf, X, y, cv=cv).mean(),
        'std': cross_val_score(rf, X, y, cv=cv).std()
    }
    
    # 3. Pseudo-oblique via interaction features
    poly = PolynomialFeatures(degree=2, interaction_only=True)
    tree_poly = DecisionTreeClassifier(max_depth=10, random_state=42)
    pipe_poly = Pipeline([
        ('scaler', StandardScaler()),
        ('poly', poly),
        ('tree', tree_poly)
    ])
    results['Tree + interactions'] = {
        'cv_score': cross_val_score(pipe_poly, X, y, cv=cv).mean(),
        'std': cross_val_score(pipe_poly, X, y, cv=cv).std()
    }
    
    # 4. Gradient Boosting (iterative oblique approximation)
    gb = GradientBoostingClassifier(n_estimators=100, random_state=42)
    results['Gradient Boosting'] = {
        'cv_score': cross_val_score(gb, X, y, cv=cv).mean(),
        'std': cross_val_score(gb, X, y, cv=cv).std()
    }
    
    # Report results
    print("Multivariate Split Analysis")
    print("=" * 50)
    for name, scores in sorted(results.items(), 
                                key=lambda x: -x[1]['cv_score']):
        print(f"{name:25s}: {scores['cv_score']:.4f} ± {scores['std']:.4f}")
    
    # Recommendations
    print("\nRecommendations:")
    base_score = results['Axis-aligned tree']['cv_score']
    poly_score = results['Tree + interactions']['cv_score']
    
    if poly_score - base_score > 0.02:
        print("• Data has diagonal/interaction structure.")
        print("• Consider: OC1, CART-LC, or add interaction features.")
    else:
        print("• Axis-aligned splits appear sufficient.")
        print("• Complex multivariate splits unlikely to help.")
    
    if results['Random Forest']['cv_score'] > base_score + 0.05:
        print("• Ensemble of simple trees beats complex single tree.")
        print("• Prefer Random Forest/XGBoost over multivariate trees.")
    
    return results

The Practical Takeaway

In most real-world applications, start with Random Forest or XGBoost. These ensembles of axis-aligned trees are fast, robust, and competitive. Only explore multivariate splits if: (1) you have a specific interpretability requirement for single trees, (2) domain knowledge suggests diagonal boundaries, or (3) ensemble performance is disappointing.

Summary: The Multivariate Split Spectrum

We've explored the full spectrum of multivariate splits, from simple axis-aligned to complex neural nodes. Here are the essential takeaways:

Key Takeaways

•A taxonomy of split complexity exists: From axis-aligned (Level 0) through oblique (Level 1), polynomial (Level 2), kernel (Level 3), to neural (Level 4), each level adds expressiveness at a cost.
•Polynomial splits capture curved boundaries: Quadratic forms need O(d²) parameters, making them impractical for high dimensions without restrictions.
•Model trees put complexity at leaves: Linear regression (or other models) at leaf nodes creates piecewise linear predictions while keeping simple splits.
•Neural and soft trees are fully differentiable: End-to-end gradient training enables representation learning within tree structures.
•Kernel splits leverage SVM theory: RBF or polynomial kernels at nodes can capture complex, smooth boundaries without explicit feature engineering.
•Complexity budget is finite: Spending on node complexity reduces what's available for tree depth; both can achieve similar expressiveness.
•Ensembles usually win: Random Forest and XGBoost (many simple trees) typically outperform single trees with complex nodes.

Looking Ahead:

The final page of this module examines Connections to Other Methods. We'll see how decision trees relate to kernel methods, neural networks, Bayesian models, and rule systems. Understanding these connections reveals why trees have remained relevant as machine learning has evolved, and how modern hybrid approaches draw on multiple traditions.

Page Complete

You now understand the full spectrum of multivariate splits in decision trees—from polynomial and kernel methods to neural architectures. This knowledge helps you recognize when complex nodes might help and when simpler approaches suffice. The next page explores how trees connect to the broader machine learning landscape.

4 / 5

Loading learning content...

Machine LearningDecision Trees

Tree Limitations and Extensions

LevelIntermediate

Duration90 mins

TopicDecision Trees

4 / 5

Multivariate Splits — Beyond Linear Combinations

When Linear Combinations Aren't Enough

Multivariate splits generalize the split mechanism even further, allowing each tree node to use:

Nonlinear functions of multiple features
Learned feature transformations
Embedded machine learning models (neural networks, SVMs, etc.)
Complex decision rules that can capture curved boundaries

This creates a spectrum of tree-like models ranging from simple axis-aligned cuts to sophisticated nodes that are themselves powerful classifiers.

What You Will Learn

The Multivariate Split Taxonomy

To understand multivariate splits, we need a systematic taxonomy of split complexity. This reveals the design space between simple trees and complex neural architectures.

Level 0: Axis-Aligned (Univariate)

$$\phi(x) = \mathbb{1}[x_j \leq \theta]$$

Tests a single feature against a threshold. Partitions space with axis-parallel hyperplanes.

Level 1: Linear (Oblique)

$$\phi(x) = \mathbb{1}[w^T x \leq \theta]$$

Tests a linear combination of all features. Partitions space with arbitrary hyperplanes.

Level 2: Polynomial

$$\phi(x) = \mathbb{1}[\sum_{|\alpha| \leq k} c_\alpha x^\alpha \leq \theta]$$

Tests polynomial combinations (e.g., $x_1^2 + 2x_1 x_2 + x_2^2$). Partitions space with algebraic surfaces.

Level 3: Kernel-Based

$$\phi(x) = \mathbb{1}\left[\sum_{i=1}^{n} \alpha_i K(x, x_i) + b \leq 0\right]$$

Uses kernel expansion (like SVM). Can represent complex, smooth boundaries.

Level 4: Neural

$$\phi(x) = \mathbb{1}[\sigma(W_2 \cdot \sigma(W_1 x + b_1) + b_2) > 0.5]$$

Embeds a neural network at each node. Can approximate any smooth function.

Multivariate Split Complexity Hierarchy
Level	Split Type	Parameters	Boundary Shape	VC Dimension
0	Axis-aligned	2 (j, θ)	Axis-parallel hyperplane	O(log d)
1	Linear/Oblique	d + 1	Arbitrary hyperplane	O(d)
2	Degree-k polynomial	O(d^k)	Algebraic surface	O(d^k)
3	Kernel (n support vectors)	O(n)	Smooth manifold	O(n)
4	Neural (h hidden units)	O(d·h + h)	Arbitrary	O(d·h log h)

The Expressiveness-Complexity Tradeoff:

As we move up the hierarchy:

Expressiveness increases: More complex boundaries can be represented
Parameters per node increase: More data needed to fit each node
Optimization difficulty increases: Finding good splits becomes harder
Interpretability decreases: Understanding what each node does becomes difficult
Overfitting risk increases: More ways to memorize training data

Key Design Question:

Should we use:

Many simple nodes (deep tree, axis-aligned) — OR —
Few complex nodes (shallow tree, multivariate)

Both can achieve the same expressiveness, but with different bias-variance profiles and computational characteristics.

The Universal Approximation Connection

Polynomial and Quadratic Splits

A natural first step beyond linear combinations is to allow polynomial splits that can capture curved boundaries.

Quadratic Splits:

The most common extension uses second-degree polynomials:

$$\phi(x) = \mathbb{1}[x^T A x + b^T x + c \leq 0]$$

where:

$A \in \mathbb{R}^{d \times d}$ is a symmetric matrix (quadratic terms)
$b \in \mathbb{R}^d$ is a linear coefficient vector
$c \in \mathbb{R}$ is a constant

This represents:

Ellipsoids: When $A$ is positive definite
Hyperboloids: When $A$ has mixed eigenvalues
Paraboloids: Transitional cases
Parallel planes: When $A = 0$ (degenerates to linear)

Number of Parameters:

For a general quadratic split in $d$ dimensions:

Symmetric matrix $A$: $d(d+1)/2$ parameters
Linear vector $b$: $d$ parameters
Constant $c$: 1 parameter
Total: $d(d+1)/2 + d + 1 = (d+1)(d+2)/2$

For $d = 10$: 66 parameters per split For $d = 100$: 5,151 parameters per split

The Quadratic Curse

Restricted Quadratic Forms:

To manage complexity, we can restrict the quadratic form:

Diagonal Quadratic: Only $A_{ii}$ terms (sum of squares) $$\sum_{i=1}^d a_i x_i^2 + \sum_{i=1}^d b_i x_i + c \leq 0$$

Parameters: $2d + 1$

Axis-Pair Interactions: Include only selected $x_i x_j$ terms $$\text{Select } k \ll d^2 \text{ interaction pairs}$$

Parameters: $d + k + 1$

Low-Rank Quadratic: $A = V V^T$ where $V \in \mathbb{R}^{d \times r}$, $r \ll d$

Parameters: $dr + d + 1$

Optimization for Quadratic Splits:

Finding optimal quadratic splits is more difficult than linear:

Exhaustive: Not feasible due to continuous parameter space
Gradient descent: Use differentiable approximation to impurity
Kernel trick: Map to feature space and use linear split
QDA-based: Use Quadratic Discriminant Analysis at each node

quadratic_split.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
import numpy as np
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from scipy.optimize import minimize
 
class QuadraticSplitFinder:
    """
    Find quadratic splits using QDA or gradient-based optimization.
    """
    
    def __init__(self, method='qda', regularization=0.1):
        self.method = method
        self.reg = regularization
    
    def find_split_qda(self, X, y):
        """
        Use Quadratic Discriminant Analysis to define split.
        QDA learns class-conditional Gaussians -> quadratic boundary.
        """
        qda = QuadraticDiscriminantAnalysis(reg_param=self.reg)
        qda.fit(X, y)
        
        # QDA decision boundary is quadratic
        # Predict probability and threshold at 0.5
        probs = qda.predict_proba(X)[:, 1]
        
        return qda, probs > 0.5
    
    def find_split_gradient(self, X, y, max_iter=100):
        """
        Gradient-based optimization for quadratic split parameters.
        Uses sigmoid smoothing for differentiability.
        """
        n, d = X.shape
        
        # Initialize: small random quadratic + no linear
        A_init = 0.1 * np.random.randn(d, d)
        A_init = (A_init + A_init.T) / 2  # Symmetrize
        b_init = np.zeros(d)
        c_init = 0
        
        # Flatten parameters
        params_init = np.concatenate([
            A_init[np.triu_indices(d)],  # Upper triangular of A
            b_init,
            [c_init]
        ])
        
        def sigmoid(z, temperature=1.0):
            return 1 / (1 + np.exp(-z / temperature))
        
        def unpack_params(params):
            n_A = d * (d + 1) // 2
            A_upper = params[:n_A]
            A = np.zeros((d, d))
            A[np.triu_indices(d)] = A_upper
            A = A + A.T - np.diag(np.diag(A))  # Full symmetric
            b = params[n_A:n_A+d]
            c = params[-1]
            return A, b, c
        
        def loss(params):
            A, b, c = unpack_params(params)
            
            # Quadratic form for each sample
            quad_vals = np.array([x @ A @ x + b @ x + c for x in X])
            
            # Soft predictions
            preds = sigmoid(quad_vals, temperature=1.0)
            
            # Cross-entropy loss
            eps = 1e-10
            ce = -np.mean(y * np.log(preds + eps) + 
                         (1-y) * np.log(1 - preds + eps))
            
            # L2 regularization
            reg = self.reg * (np.sum(A**2) + np.sum(b**2))
            
            return ce + reg
        
        result = minimize(loss, params_init, method='L-BFGS-B',
                         options={'maxiter': max_iter})
        
        A, b, c = unpack_params(result.x)
        
        return {'A': A, 'b': b, 'c': c, 'loss': result.fun}
    
    def evaluate_quadratic(self, X, A, b, c):
        """Evaluate quadratic form for samples."""
        return np.array([x @ A @ x + b @ x + c for x in X])

When Quadratic Splits Help:

Elliptical class regions: One class is approximately elliptical in feature space
QDA-like problems: Classes have different covariance structures
Physical systems: Many natural phenomena follow quadratic laws
Low-to-moderate dimensionality: $d < 30$ where $d^2$ is manageable

When to Avoid:

High dimensionality: Parameter explosion makes fitting impossible
Small datasets: Not enough data to estimate $O(d^2)$ parameters
Linear boundaries suffice: Don't add complexity unnecessarily
Interpretability required: $x^T A x + b^T x + c$ is hard to explain

Model Trees: Local Linear Models at Leaves

Model trees take a different approach: rather than complicating the splits, they complicate the leaf predictions. Instead of predicting a constant at each leaf, they fit a local model.

Standard Regression Tree: $$\hat{f}(x) = \sum_{m=1}^{M} c_m \cdot \mathbb{1}(x \in R_m)$$

Model Tree (with linear leaves): $$\hat{f}(x) = \sum_{m=1}^{M} (w_m^T x + b_m) \cdot \mathbb{1}(x \in R_m)$$

Each leaf contains a linear regression model fitted to the samples in that region. This creates piecewise linear predictions instead of piecewise constant.

M5 Algorithm (Quinlan, 1992):

The most famous model tree algorithm:

Grow tree: Use standard regression tree splitting (minimize MSE)
Fit linear models: At each leaf, fit a linear regression on local samples
Prune: Replace subtrees with linear models when accuracy is maintained
Smooth: Blend predictions from leaf and ancestor models for continuity

model_tree.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import Ridge
 
class ModelTree:
    """
    Model tree with linear regression at leaves.
    
    Similar to M5 algorithm.
    """
    
    def __init__(self, max_depth=5, min_samples_leaf=10, alpha=1.0):
        self.max_depth = max_depth
        self.min_samples_leaf = min_samples_leaf
        self.alpha = alpha  # Ridge regularization
        
        # First grow a standard tree to get partition
        self.tree = DecisionTreeRegressor(
            max_depth=max_depth,
            min_samples_leaf=min_samples_leaf
        )
        self.leaf_models = {}
    
    def fit(self, X, y):
        """Fit model tree."""
        # Step 1: Grow tree to partition feature space
        self.tree.fit(X, y)
        
        # Step 2: Identify which leaf each sample belongs to
        leaf_indices = self.tree.apply(X)
        unique_leaves = np.unique(leaf_indices)
        
        # Step 3: Fit linear model at each leaf
        for leaf in unique_leaves:
            mask = leaf_indices == leaf
            X_leaf = X[mask]
            y_leaf = y[mask]
            
            if len(y_leaf) < X.shape[1] + 1:
                # Not enough samples for linear model, use mean
                self.leaf_models[leaf] = {
                    'type': 'constant',
                    'value': y_leaf.mean()
                }
            else:
                # Fit regularized linear regression
                model = Ridge(alpha=self.alpha)
                model.fit(X_leaf, y_leaf)
                self.leaf_models[leaf] = {
                    'type': 'linear',
                    'model': model
                }
        
        return self
    
    def predict(self, X):
        """Predict using local linear models."""
        leaf_indices = self.tree.apply(X)
        predictions = np.zeros(len(X))
        
        for i, (x, leaf) in enumerate(zip(X, leaf_indices)):
            model_info = self.leaf_models[leaf]
            
            if model_info['type'] == 'constant':
                predictions[i] = model_info['value']
            else:
                predictions[i] = model_info['model'].predict([x])[0]
        
        return predictions
    
    def predict_with_smoothing(self, X, smooth_factor=0.5):
        """
        Predict with smoothing toward parent predictions.
        
        This reduces discontinuities at leaf boundaries.
        """
        # Get leaf predictions
        leaf_preds = self.predict(X)
        
        # Get tree constant predictions (for smoothing)
        tree_preds = self.tree.predict(X)
        
        # Blend
        return smooth_factor * leaf_preds + (1 - smooth_factor) * tree_preds

Advantages of Model Trees:

Better approximation of smooth functions: Piecewise linear is closer to smooth than piecewise constant
Fewer leaves needed: Linear models capture local trends, reducing tree size
Extrapolation ability: Linear models can extrapolate beyond training range (risky but sometimes useful)
Interpretability preserved: Tree structure remains; each leaf has an interpretable linear model

Smoothing in Model Trees:

Raw model trees have discontinuities at region boundaries. M5 uses smoothing:

$$\hat{f}(x) = \frac{n_\text{leaf} \cdot \hat{f}\text{leaf}(x) + k \cdot \hat{f}\text{parent}(x)}{n_\text{leaf} + k}$$

where $k$ is a smoothing constant (typically 15). This blends leaf predictions toward parent predictions, creating smoother transitions.

Extensions:

Polynomial at leaves: Replace linear with polynomial for more flexibility
Regularization: L1/L2 penalties on leaf model coefficients
Feature selection in leaves: Different features relevant in different regions
Ensemble model trees: Random Forests of model trees (Cubist algorithm)

M5P in Practice

Perceptron Trees and Neural Splits

Taking multivariate splits to the extreme, we can embed neural network components at tree nodes. This creates hybrid models that combine tree structure with neural flexibility.

Perceptron Trees:

The simplest neural tree uses a perceptron (single-layer neural network) at each internal node:

$$\phi(x) = \mathbb{1}[\sigma(w^T x + b) > 0.5]$$

where $\sigma$ is an activation function (sigmoid, tanh). This is essentially an oblique split with a nonlinear squashing function.

Multi-Layer Perceptron (MLP) Nodes:

More powerful: use a small MLP at each node:

$$h = \sigma(W_1 x + b_1)$$ $$\phi(x) = \mathbb{1}[\sigma(w_2^T h + b_2) > 0.5]$$

This can represent arbitrarily complex decision boundaries at each node.

The Soft Decision Tree:

Instead of hard splits, use probabilistic routing:

$$p(\text{left} | x) = \sigma(w^T x + b)$$ $$p(\text{right} | x) = 1 - p(\text{left} | x)$$

A sample "flows" through both branches with different probabilities. The prediction is a weighted sum:

$$\hat{f}(x) = \sum_{m=1}^{M} \pi_m(x) \cdot c_m$$

where $\pi_m(x)$ is the probability of reaching leaf $m$, computed as the product of routing probabilities along the path.

soft_decision_tree.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
 
class SoftDecisionTree(nn.Module):
    """
    Differentiable soft decision tree.
    
    Uses sigmoid routing for end-to-end gradient training.
    """
    
    def __init__(self, input_dim, depth, n_classes):
        super().__init__()
        self.depth = depth
        self.n_classes = n_classes
        self.n_leaves = 2 ** depth
        self.n_internal = 2 ** depth - 1
        
        # Internal node parameters (oblique splits)
        self.node_weights = nn.Parameter(
            torch.randn(self.n_internal, input_dim) * 0.1
        )
        self.node_biases = nn.Parameter(
            torch.zeros(self.n_internal)
        )
        
        # Leaf predictions (class logits)
        self.leaf_logits = nn.Parameter(
            torch.randn(self.n_leaves, n_classes) * 0.1
        )
        
        # Temperature for soft splitting
        self.temperature = nn.Parameter(torch.ones(1))
    
    def forward(self, x):
        """
        Compute soft predictions.
        
        Returns weighted average of leaf predictions based on
        routing probabilities.
        """
        batch_size = x.size(0)
        
        # Compute routing probabilities at each node
        # node_outputs[i] = sigmoid(w_i^T x + b_i) 
        node_logits = F.linear(x, self.node_weights, self.node_biases)
        node_probs = torch.sigmoid(node_logits / self.temperature)
        
        # Compute path probabilities to each leaf
        # This is the tricky part: tree routing
        path_probs = self._compute_path_probs(node_probs)
        
        # Weighted sum of leaf predictions
        leaf_preds = F.softmax(self.leaf_logits, dim=-1)  # [n_leaves, n_classes]
        
        # path_probs: [batch_size, n_leaves]
        # leaf_preds: [n_leaves, n_classes]
        output = torch.matmul(path_probs, leaf_preds)  # [batch_size, n_classes]
        
        return output
    
    def _compute_path_probs(self, node_probs):
        """
        Compute probability of reaching each leaf.
        
        For a complete binary tree:
        - Left child of node i is 2i + 1
        - Right child of node i is 2i + 2
        - Leaf nodes are indexed 0 to n_leaves-1 (separate indexing)
        """
        batch_size = node_probs.size(0)
        
        # Start at root with probability 1
        path_probs = torch.ones(batch_size, self.n_leaves, device=node_probs.device)
        
        for leaf_idx in range(self.n_leaves):
            # Find path from root to this leaf
            current = leaf_idx + self.n_internal  # Leaf's position in full tree
            
            while current > 0:
                parent = (current - 1) // 2
                
                # If current is left child, multiply by p(left)
                # If current is right child, multiply by p(right) = 1 - p(left)
                if current == 2 * parent + 1:  # Left child
                    path_probs[:, leaf_idx] *= node_probs[:, parent]
                else:  # Right child
                    path_probs[:, leaf_idx] *= (1 - node_probs[:, parent])
                
                current = parent
        
        return path_probs
 
class NeuralDecisionForest(nn.Module):
    """
    Ensemble of soft decision trees.
    """
    
    def __init__(self, input_dim, depth, n_classes, n_trees=10):
        super().__init__()
        self.trees = nn.ModuleList([
            SoftDecisionTree(input_dim, depth, n_classes)
            for _ in range(n_trees)
        ])
    
    def forward(self, x):
        """Average predictions from all trees."""
        preds = torch.stack([tree(x) for tree in self.trees])
        return preds.mean(dim=0)

Advantages of Neural Splits:

End-to-end differentiable: Train entire tree with gradient descent
Complex boundaries: Can capture any decision boundary with sufficient MLP capacity
Soft routing: Gradients flow to all leaves, not just the active one
Representation learning: Node weights learn useful features

Disadvantages:

Lost interpretability: Hard to understand what each neural node does
Training complexity: More hyperparameters, longer training
Overfitting risk: Massive parameter count if not regularized
Why not just use a neural network?: If you're already differentiable, the tree structure may not add value

Key Research: Neural Oblivious Decision Ensembles (NODE)

Popov et al. (2019) introduced NODE for tabular data:

Uses oblivious trees: All nodes at each level use the same feature
Soft routing with entmax (sparse attention) instead of sigmoid
Ensemble of differentiable trees
Competitive with XGBoost on tabular benchmarks

This represents the frontier of neural-tree hybrids.

The Deep Forest Perspective

Kernel-Based Splits and SVM Nodes

Another approach to nonlinear splits uses kernel methods at each node. This leverages the kernel trick to create complex boundaries without explicit feature expansion.

The Kernel Split:

$$\phi(x) = \mathbb{1}\left[\sum_{i \in \text{SVs}} \alpha_i y_i K(x, x_i) + b \leq 0\right]$$

where:

$K$ is a kernel function (RBF, polynomial, etc.)
SVs are the support vectors at this node
$\alpha_i$ are learned coefficients

This is equivalent to training an SVM at each node and using its decision function for routing.

SVM Trees:

At each internal node, train an SVM on local data
Route left or right based on SVM prediction
Continue until reaching a leaf

Kernel Choice:

Kernels for Tree Splits
Kernel	Formula	Boundary Type	Best When
Linear	K(x,x') = x^T x'	Hyperplane	Linear separability
Polynomial	K(x,x') = (x^T x' + c)^d	Polynomial surface	Known polynomial interactions
RBF	K(x,x') = exp(-γ\|\|x-x'\|\|²)	Smooth, local	Complex, smooth boundaries
Sigmoid	K(x,x') = tanh(αx^T x' + c)	Similar to MLP	Neural-network-like

RBF Kernel SVM at Nodes:

The RBF (Radial Basis Function) kernel is particularly powerful:

$$K(x, x') = \exp(-\gamma |x - x'|^2)$$

With appropriate $\gamma$, a single RBF-SVM can create:

Circular/elliptical boundaries
Complex multi-modal regions
Smooth, curved surfaces

This means an SVM tree with RBF kernels can handle extremely complex decision boundaries with just a few nodes.

Computational Considerations:

Aspect	Standard SVM	SVM Tree Node
Training samples	All N	n_t (samples at node)
Complexity	O(N² to N³)	O(n_t² to n_t³)
Support vectors	O(N)	O(n_t)
Inference	O(N·d)	O(sum of SVs at nodes)

Because each node has fewer samples, individual SVMs are faster. But more nodes means more total SVMs.

Hybrid Strategy:

A practical approach:

Use axis-aligned or oblique splits for most nodes
Fall back to SVM splits only when simple splits perform poorly
Use kernel complexity only when data at the node is nonlinearly separable

This gives the efficiency of simple splits where possible, with kernel power where needed.

svm_tree.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
import numpy as np
 
class SVMTreeNode:
    """
    Tree node that uses SVM for splitting.
    """
    
    def __init__(self, kernel='rbf', C=1.0, gamma='scale'):
        self.kernel = kernel
        self.C = C
        self.gamma = gamma
        self.svm = None
        self.left = None
        self.right = None
        self.is_leaf = False
        self.prediction = None
    
    def fit(self, X, y, max_depth, current_depth=0, min_samples=10):
        """Recursively fit SVM tree."""
        n_samples = len(y)
        unique_classes = np.unique(y)
        
        # Check stopping conditions
        if (current_depth >= max_depth or 
            n_samples < min_samples or 
            len(unique_classes) == 1):
            self.is_leaf = True
            self.prediction = np.bincount(y).argmax()
            return
        
        # Train SVM for splitting
        self.svm = SVC(kernel=self.kernel, C=self.C, gamma=self.gamma)
        self.svm.fit(X, y)
        
        # Get predictions for splitting
        predictions = self.svm.predict(X)
        
        # Check if SVM produced a useful split
        left_mask = predictions == unique_classes[0]
        right_mask = ~left_mask
        
        if left_mask.sum() < min_samples or right_mask.sum() < min_samples:
            # SVM didn't produce useful split, make leaf
            self.is_leaf = True
            self.prediction = np.bincount(y).argmax()
            return
        
        # Create children
        self.left = SVMTreeNode(self.kernel, self.C, self.gamma)
        self.right = SVMTreeNode(self.kernel, self.C, self.gamma)
        
        # Recursively fit children
        self.left.fit(X[left_mask], y[left_mask], max_depth, 
                      current_depth + 1, min_samples)
        self.right.fit(X[right_mask], y[right_mask], max_depth,
                       current_depth + 1, min_samples)
    
    def predict(self, X):
        """Predict using SVM tree."""
        if self.is_leaf:
            return np.full(len(X), self.prediction)
        
        # Route using SVM
        svm_pred = self.svm.predict(X)
        
        predictions = np.zeros(len(X), dtype=int)
        
        left_mask = svm_pred == self.svm.classes_[0]
        if left_mask.any():
            predictions[left_mask] = self.left.predict(X[left_mask])
        if (~left_mask).any():
            predictions[~left_mask] = self.right.predict(X[~left_mask])
        
        return predictions

The Complexity-Accuracy Tradeoff

With so many options for multivariate splits, how do we choose? The answer lies in understanding the fundamental tradeoff between node complexity and tree structure.

The Total Complexity Budget:

Think of model complexity as a budget that can be spent in two ways:

Depth (many simple nodes): Standard trees, deep
Node complexity (few powerful nodes): Multivariate trees, shallow

Both can achieve similar total expressiveness, but with different characteristics:

Approach	Depth	Nodes	Params/Node	Total Params	Training
Deep axis-aligned	20	1M leaves	2	~2M	Fast
Medium oblique	10	~1K leaves	d+1	~100K·d	Moderate
Shallow quadratic	5	~32 leaves	d²	~32·d²	Slow
Very shallow neural	3	~8 leaves	1000+	~10K	Very slow

Empirical Observations:

Research suggests:

For most tabular data: Axis-aligned ensembles (Random Forest, XGBoost) win
For data with known linear structure: Oblique trees can match with less complexity
For highly nonlinear, low-d data: Kernel or neural nodes help
For very large datasets: Simple nodes scale better

Favor Simple Nodes When

•High-dimensional data (d > 100)
•Categorical features present
•Fast training required
•Interpretability essential
•Using ensembles anyway
•Limited training data

Favor Complex Nodes When

•Low-dimensional data (d < 30)
•All continuous features
•Known diagonal/curved boundaries
•Tree depth must be limited
•Single tree required
•Abundant training data

The Model Selection Framework:

To choose the right multivariate split strategy:

Start simple: Begin with axis-aligned trees (CART, Random Forest)
Diagnose limitations: Check if tree is very deep for the accuracy achieved
Test oblique: If depth is concerning, try CART-LC or OC1
Add features: Before complex nodes, try adding interaction features
Consider ensembles: Often a better use of complexity budget
Last resort: Neural/kernel nodes for truly challenging problems

Cross-Validation is King:

Ultimately, the choice must be validated empirically:

$$\text{Best model} = \arg\min_{m \in {\text{axis}, \text{oblique}, \text{kernel}, \ldots}} \text{CV-Error}(m)$$

Don't assume that more complex nodes are better—they often aren't!

Occam's Razor for Trees

Practical Implementations and Libraries

Several libraries provide implementations of multivariate split trees. Here's a practical guide:

Standard Libraries:

Library	Split Types	Notes
scikit-learn	Axis-aligned only	Foundation for extensions
Weka	CART-LC, M5P (model trees)	Java, comprehensive
R/rpart	Axis-aligned	Can extend with custom splits

Specialized Libraries:

Library	Split Types	Notes
OC1	Oblique (original algorithm)	C implementation
LMT	Logistic Model Trees	Weka
Cubist	Model trees, ensembles	Industrial strength
pytorch-tabular	Neural oblique (NODE)	Deep learning integration

DIY Approaches:

For custom multivariate splits, you can:

Feature engineering: Add interaction features, then use standard trees
Inherit and extend: Subclass sklearn's DecisionTreeClassifier
Custom implementations: Build from scratch with numpy

practical_workflow.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
"""
Practical workflow for choosing multivariate splits.
"""
 
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline
 
def multivariate_split_analysis(X, y, cv=5):
    """
    Compare different split strategies on a dataset.
    
    Helps determine if multivariate splits would help.
    """
    results = {}
    
    # 1. Baseline: Axis-aligned tree
    tree = DecisionTreeClassifier(max_depth=10, random_state=42)
    results['Axis-aligned tree'] = {
        'cv_score': cross_val_score(tree, X, y, cv=cv).mean(),
        'std': cross_val_score(tree, X, y, cv=cv).std()
    }
    
    # 2. Axis-aligned ensemble (sets the bar)
    rf = RandomForestClassifier(n_estimators=100, random_state=42)
    results['Random Forest'] = {
        'cv_score': cross_val_score(rf, X, y, cv=cv).mean(),
        'std': cross_val_score(rf, X, y, cv=cv).std()
    }
    
    # 3. Pseudo-oblique via interaction features
    poly = PolynomialFeatures(degree=2, interaction_only=True)
    tree_poly = DecisionTreeClassifier(max_depth=10, random_state=42)
    pipe_poly = Pipeline([
        ('scaler', StandardScaler()),
        ('poly', poly),
        ('tree', tree_poly)
    ])
    results['Tree + interactions'] = {
        'cv_score': cross_val_score(pipe_poly, X, y, cv=cv).mean(),
        'std': cross_val_score(pipe_poly, X, y, cv=cv).std()
    }
    
    # 4. Gradient Boosting (iterative oblique approximation)
    gb = GradientBoostingClassifier(n_estimators=100, random_state=42)
    results['Gradient Boosting'] = {
        'cv_score': cross_val_score(gb, X, y, cv=cv).mean(),
        'std': cross_val_score(gb, X, y, cv=cv).std()
    }
    
    # Report results
    print("Multivariate Split Analysis")
    print("=" * 50)
    for name, scores in sorted(results.items(), 
                                key=lambda x: -x[1]['cv_score']):
        print(f"{name:25s}: {scores['cv_score']:.4f} ± {scores['std']:.4f}")
    
    # Recommendations
    print("\nRecommendations:")
    base_score = results['Axis-aligned tree']['cv_score']
    poly_score = results['Tree + interactions']['cv_score']
    
    if poly_score - base_score > 0.02:
        print("• Data has diagonal/interaction structure.")
        print("• Consider: OC1, CART-LC, or add interaction features.")
    else:
        print("• Axis-aligned splits appear sufficient.")
        print("• Complex multivariate splits unlikely to help.")
    
    if results['Random Forest']['cv_score'] > base_score + 0.05:
        print("• Ensemble of simple trees beats complex single tree.")
        print("• Prefer Random Forest/XGBoost over multivariate trees.")
    
    return results

The Practical Takeaway

Summary: The Multivariate Split Spectrum

We've explored the full spectrum of multivariate splits, from simple axis-aligned to complex neural nodes. Here are the essential takeaways:

Key Takeaways

•A taxonomy of split complexity exists: From axis-aligned (Level 0) through oblique (Level 1), polynomial (Level 2), kernel (Level 3), to neural (Level 4), each level adds expressiveness at a cost.
•Polynomial splits capture curved boundaries: Quadratic forms need O(d²) parameters, making them impractical for high dimensions without restrictions.
•Model trees put complexity at leaves: Linear regression (or other models) at leaf nodes creates piecewise linear predictions while keeping simple splits.
•Neural and soft trees are fully differentiable: End-to-end gradient training enables representation learning within tree structures.
•Kernel splits leverage SVM theory: RBF or polynomial kernels at nodes can capture complex, smooth boundaries without explicit feature engineering.
•Complexity budget is finite: Spending on node complexity reduces what's available for tree depth; both can achieve similar expressiveness.
•Ensembles usually win: Random Forest and XGBoost (many simple trees) typically outperform single trees with complex nodes.

Looking Ahead:

Page Complete

4 / 5