Tree Limitations And Extensions - Learning Module

Loading content...

0/245

Axis-Aligned Splits — The Geometric Constraint of Standard Trees

The Rectangular World of Decision Trees

Standard decision trees possess a fundamental geometric constraint that is often overlooked in introductory treatments: every split must be perpendicular to exactly one feature axis. This seemingly innocuous restriction has profound implications for the shapes of decision boundaries, the efficiency of representation, and the patterns that trees can—or cannot—easily capture.

When you train a CART, ID3, or C4.5 tree, you are implicitly constraining your model to partition the feature space using only horizontal and vertical lines (in 2D) or axis-aligned hyperplanes (in higher dimensions). This creates a world of rectangles, boxes, and rectangular hyperrectangles—never ellipses, diagonal lines, or curved boundaries.

What You Will Learn

By the end of this page, you will understand precisely why standard decision trees are limited to axis-aligned splits, how this affects their ability to model different data patterns, the theoretical and practical consequences of this limitation, and how it motivates more advanced tree variants covered in subsequent pages.

The Mathematics of Axis-Aligned Partitioning

To understand the axis-aligned constraint formally, we need to examine the structure of split tests that decision trees can represent.

The Standard Split Test:

At each internal node $t$ of a standard decision tree, the split test takes the form:

$$\phi_t(x) = \mathbb{1}[x_j \leq \theta_t]$$

where:

$x_j$ is a single feature (the $j$-th component of feature vector $x$)
$\theta_t$ is a scalar threshold
$\mathbb{1}[\cdot]$ is the indicator function

This test examines exactly one dimension and compares it to a constant. Geometrically, this defines a hyperplane perpendicular to axis $j$ at position $\theta_t$.

The Resulting Partition:

After $K$ splits, the tree partitions the feature space $\mathcal{X} \subseteq \mathbb{R}^d$ into regions $R_1, R_2, \ldots, R_M$ where each region is an axis-aligned hyperrectangle:

$$R_m = {x \in \mathbb{R}^d : \ell^{(m)}_j \leq x_j < u^{(m)}_j, ; j = 1, \ldots, d}$$

Here, $\ell^{(m)}_j$ and $u^{(m)}_j$ are the lower and upper bounds for feature $j$ in region $m$.

The General Position

In the most general case, the bounds can extend to $-\infty$ or $+\infty$ if no split has been made on that feature in the path to leaf $m$. The key constraint is that the boundary of each region consists only of hyperplanes that are parallel to the coordinate axes.

Why This Constraint?

The axis-aligned restriction emerges from the optimization problem solved at each node. To find the best split, CART maximizes impurity reduction:

$$\max_{j, \theta} \left[ \mathcal{I}(t) - \frac{n_L}{n_t}\mathcal{I}(t_L) - \frac{n_R}{n_t}\mathcal{I}(t_R) \right]$$

The search is over:

Feature index $j \in {1, \ldots, d}$ (discrete choice)
Threshold $\theta \in {x_{ij} : i \in I_t}$ (finite set of candidate thresholds)

This is computationally tractable: we sort samples by each feature and evaluate the impurity reduction at each midpoint, giving $O(d \cdot n \log n)$ complexity per node.

If we allowed arbitrary hyperplanes (oblique splits), the search would become:

$$\max_{w \in \mathbb{R}^d, \theta \in \mathbb{R}} \text{Impurity Reduction}(w^T x \leq \theta)$$

This is a continuous, non-convex optimization problem with $d + 1$ parameters per split—dramatically more expensive to solve.

Split Complexity Comparison
Split Type	Parameters per Split	Optimization	Complexity
Axis-Aligned	2 (feature index j, threshold θ)	Discrete + 1D search	O(d·n log n)
Oblique (linear)	d + 1 (weight vector w, threshold θ)	Continuous, non-convex	NP-hard in general
Quadratic	O(d²) (quadratic form parameters)	Non-linear, non-convex	Highly intractable

Geometric Visualization of Axis-Aligned Constraints

The best way to understand the axis-aligned limitation is through geometric visualization. Let's examine several scenarios in 2D.

Scenario 1: Linearly Separable, Axis-Aligned

Consider two classes separated by a vertical line at $x_1 = 5$:

Class A: x₁ < 5      Class B: x₁ ≥ 5

  │ A A A │ B B B │
  │ A A A │ B B B │
  │ A A A │ B B B │
  ─────────┼───────→ x₁
           5

A decision tree captures this perfectly with a single split: if x₁ < 5 then Class A else Class B.

Scenario 2: Linearly Separable, Diagonal

Now consider two classes separated by the diagonal line $x_1 + x_2 = 10$:

           ╱
    B B B ╱ A A A
   B B B ╱ A A A A
  B B B ╱ A A A A A
 ─────╱────────────→ x₁
     ╱
    optimal: x₁ + x₂ = 10

The true boundary is a single line, but a decision tree must approximate it with a staircase of axis-aligned splits:

       ├─────┬─────┤
       │  B  │     │
  ┌────┼─────┤  A  │
  │ B  │     │     │
  ├────┤  A  ├─────┤
  │    │     │  A  │
  ─────┴─────┴─────→ x₁

Tree staircase approximation

This requires many splits to approximate a single diagonal line, and the approximation is never exact.

The Staircase Problem

A single oblique boundary that a linear classifier could represent with 3 parameters (slope, intercept, threshold) may require O(n) axis-aligned splits to approximate accurately. This is a fundamental representational inefficiency of standard decision trees.

Scenario 3: The XOR Problem

The classic XOR pattern lies naturally along diagonals:

        ↑ x₂
        │
   B B  │  A A
   B B  │  A A
 ───────┼───────→ x₁
   A A  │  B B
   A A  │  B B
        │

Classes are separated by: $(x_1 - \bar{x}_1)(x_2 - \bar{x}_2) < 0$

The optimal boundaries are the diagonals, but axis-aligned trees must create a grid:

   ┌─────┬─────┐
   │  B  │  A  │
   ├─────┼─────┤
   │  A  │  B  │
   └─────┴─────┘

This requires at least 3 splits (one vertical, one horizontal in each branch), and the boundaries don't match the true diagonal structure.

Scenario 4: Circular Boundary

Consider a class defined by $x_1^2 + x_2^2 \leq r^2$ (inside a circle):

           ╭───────╮
          ╱         ╲
         │  Class A  │
          ╲         ╱
           ╰───────╯
     Class B everywhere else

An axis-aligned tree approximates this circle with a staircase polygon:

     ┌───────┐
   ┌─┤       ├─┐
   │ │ Class ├─│
   │ │   A   │ │
   ├─┤       ├─┤
     └───────┘

The approximation requires $O(1/\epsilon)$ splits to achieve error $\epsilon$ on the boundary.

Theoretical Analysis of Representation Efficiency

To rigorously understand when axis-aligned trees struggle, we need to analyze their representation efficiency compared to other hypothesis classes.

VC Dimension Perspective:

The VC dimension measures the expressive power of a hypothesis class. For axis-aligned rectangles in $\mathbb{R}^d$:

$$\text{VCdim}(\text{Axis-aligned rectangles}) = 2d$$

For arbitrary (oblique) halfspaces in $\mathbb{R}^d$:

$$\text{VCdim}(\text{Halfspaces}) = d + 1$$

This suggests axis-aligned splits have limited VC dimension per split, but trees compensate by using many splits.

Approximation Theory for Oblique Boundaries:

Consider approximating a linear boundary $w^T x = \theta$ where $w$ is not axis-aligned. Let $\alpha$ be the angle between $w$ and the nearest axis.

Theorem (Axis-Aligned Approximation Complexity): To achieve an $\epsilon$-approximation of an oblique hyperplane at angle $\alpha$ to the nearest axis, within a bounded region of diameter $D$, an axis-aligned tree requires:

$$\Omega\left(\frac{D \cdot |\sin \alpha|}{\epsilon}\right)$$

splits.

Implication: When $\alpha = 45°$ (maximum angle), the number of splits is $\Omega(D / \epsilon)$. This is why diagonal boundaries are particularly expensive to represent.

Tree Complexity for Different Boundary Types
True Boundary	Tree Complexity	Approximation Quality	Example
Axis-aligned hyperplane	O(1)	Exact	x₁ < 5
Oblique hyperplane (angle α)	O(1/ε · sin α)	ε-approximate	x₁ + x₂ < 10
Quadratic (ellipse, hyperbola)	O(1/ε²)	ε-approximate	x₁² + x₂² < r²
General smooth curve	O(1/ε^(d-1))	ε-approximate	sin(x₁) < x₂

The Representation Gap:

Define the representation gap as the ratio of hypothesis class complexity to decision boundary complexity:

$$\text{Gap}(f, H) = \frac{\min_{h \in H} \text{Complexity}(h) \text{ s.t. } h \approx f}{\text{Intrinsic Complexity}(f)}$$

For axis-aligned trees approximating oblique boundaries:

Intrinsic complexity of a hyperplane: $O(d)$ parameters
Tree complexity to approximate: $O(n^d)$ in the worst case for $n$ training points

This exponential gap in dimensionality is a fundamental limitation.

Universal Approximation:

Despite these limitations, decision trees are universal approximators: given enough splits, they can approximate any measurable function to arbitrary precision. This is because the set of axis-aligned rectangles is dense in the space of all measurable sets.

However, being a universal approximator says nothing about efficiency. Neural networks are also universal approximators, but some functions require exponentially many neurons.

The Curse of Representation

Just because a model class CAN represent any function doesn't mean it will learn it efficiently from finite data. Decision trees can represent diagonal boundaries, but they may require far more training data and computational resources to do so than inherently diagonal methods.

Practical Consequences of Axis-Alignment

The axis-alignment constraint has several practical consequences that practitioners should understand:

Consequence 1: Feature Engineering Sensitivity

Decision tree performance is highly sensitive to how features are represented. Consider these scenarios:

Original features: $x_1, x_2$
Rotated features: $z_1 = x_1 + x_2, z_2 = x_1 - x_2$

A diagonal boundary in $(x_1, x_2)$ space becomes axis-aligned in $(z_1, z_2)$ space! The same tree algorithm might need 1 split in the rotated space versus dozens in the original space.

This means: Feature engineering (including PCA, whitening, and domain-specific transformations) can dramatically affect decision tree performance.

feature_rotation_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.decomposition import PCA
 
# Generate diagonally-separated data
np.random.seed(42)
n = 200
 
# Class A: x1 + x2 < 0
# Class B: x1 + x2 >= 0
X_A = np.random.randn(n // 2, 2) - 1
X_B = np.random.randn(n // 2, 2) + 1
X = np.vstack([X_A, X_B])
y = np.array([0] * (n // 2) + [1] * (n // 2))
 
# Train tree on original features
tree_original = DecisionTreeClassifier(random_state=42)
tree_original.fit(X, y)
print(f"Original features: {tree_original.get_depth()} depth, "
      f"{tree_original.get_n_leaves()} leaves")
 
# Rotate features to align with boundary
theta = np.pi / 4  # 45 degree rotation
R = np.array([[np.cos(theta), -np.sin(theta)],
              [np.sin(theta), np.cos(theta)]])
X_rotated = X @ R.T
 
# Train tree on rotated features
tree_rotated = DecisionTreeClassifier(random_state=42)
tree_rotated.fit(X_rotated, y)
print(f"Rotated features: {tree_rotated.get_depth()} depth, "
      f"{tree_rotated.get_n_leaves()} leaves")
 
# Alternative: Use PCA to find principal axes
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
 
tree_pca = DecisionTreeClassifier(random_state=42)
tree_pca.fit(X_pca, y)
print(f"PCA features: {tree_pca.get_depth()} depth, "
      f"{tree_pca.get_n_leaves()} leaves")

Consequence 2: Inability to Model Feature Interactions Succinctly

When the decision boundary depends on combinations of features (interactions), axis-aligned trees face difficulties:

A split on $x_1 \cdot x_2 > \theta$ cannot be represented directly
Product features must be added explicitly or approximated with many splits
Ratio features ($x_1 / x_2$) are similarly problematic

Consequence 3: Inefficiency in High Dimensions

In high-dimensional spaces, the axis-aligned constraint becomes increasingly limiting:

Most of the decision boundary 'surface area' lies along oblique directions
Random hyperplanes in high dimensions are almost never axis-aligned
The staircase approximation error grows with dimensionality

This is one reason why single decision trees often struggle with high-dimensional data, even though they can theoretically represent any boundary.

When Axis-Alignment Works Well

•Features are inherently discrete/categorical
•Natural thresholds exist (age > 18, income > $50k)
•Domain knowledge suggests axis-aligned boundaries
•Features have been carefully engineered
•Interactions are limited to conditioning (nested thresholds)

When Axis-Alignment Struggles

•Boundaries are naturally diagonal (e.g., BMI = weight/height²)
•Features have been rotated (e.g., PCA coordinates)
•Multiplicative interactions are important
•Boundaries are smooth curves or manifolds
•High-dimensional data with complex structure

Domain Knowledge Matters

If your domain suggests boundaries like 'debt-to-income ratio > 0.4' or 'power-to-weight ratio < threshold', standard decision trees cannot express these directly. You must either pre-compute these features or accept an inefficient axis-aligned approximation.

Quantifying the Efficiency Loss

We can empirically measure how much axis-alignment costs in terms of model complexity and accuracy. This analysis helps understand when the limitation is significant.

Metric 1: Model Complexity Ratio

Define the complexity ratio as:

$$\rho = \frac{\text{Number of leaves in axis-aligned tree}}{\text{Number of leaves in oblique tree for same accuracy}}$$

For synthetic problems with diagonal boundaries, $\rho$ can be 10x or higher.

Metric 2: Accuracy Gap at Fixed Complexity

Compare accuracy of axis-aligned vs. oblique trees with the same number of leaves:

$$\Delta = \text{Accuracy}{\text{oblique}} - \text{Accuracy}{\text{axis-aligned}}$$

Metric 3: Sample Complexity

The number of training samples required to achieve target accuracy $1 - \epsilon$:

$$n_{\text{axis}} \text{ vs } n_{\text{oblique}}$$

Due to the larger hypothesis space needed, axis-aligned trees often require more samples.

efficiency_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
 
def generate_diagonal_data(n, angle_degrees=45, noise=0.1):
    """Generate data with a diagonal decision boundary."""
    np.random.seed(42)
    angle = np.radians(angle_degrees)
    
    X = np.random.randn(n, 2)
    
    # Decision boundary: normal vector at given angle
    w = np.array([np.cos(angle), np.sin(angle)])
    y = (X @ w > 0).astype(int)
    
    # Add label noise
    flip_idx = np.random.choice(n, int(n * noise), replace=False)
    y[flip_idx] = 1 - y[flip_idx]
    
    return X, y
 
def compare_complexity_vs_accuracy(angles=[0, 15, 30, 45, 60, 75, 90]):
    """
    Compare decision tree performance across different boundary angles.
    Angle 0 and 90 are axis-aligned; 45 is maximally diagonal.
    """
    results = []
    
    for angle in angles:
        X, y = generate_diagonal_data(n=500, angle_degrees=angle, noise=0.05)
        
        # Train decision tree
        tree = DecisionTreeClassifier(random_state=42)
        tree.fit(X, y)
        
        # Get metrics
        tree_acc = cross_val_score(tree, X, y, cv=5).mean()
        
        # Compare to linear classifier (optimal for linear boundary)
        linear = LogisticRegression(random_state=42)
        linear_acc = cross_val_score(linear, X, y, cv=5).mean()
        
        results.append({
            'angle': angle,
            'tree_depth': tree.get_depth(),
            'tree_leaves': tree.get_n_leaves(),
            'tree_accuracy': tree_acc,
            'linear_accuracy': linear_acc,
            'accuracy_gap': linear_acc - tree_acc
        })
    
    return results
 
# Run comparison
results = compare_complexity_vs_accuracy()
 
# Display results
print("Angle | Tree Depth | Leaves | Tree Acc | Linear Acc | Gap")
print("-" * 60)
for r in results:
    print(f"{r['angle']:5} | {r['tree_depth']:10} | {r['tree_leaves']:6} | "
          f"{r['tree_accuracy']:.4f}   | {r['linear_accuracy']:.4f}     | "
          f"{r['accuracy_gap']:.4f}")

Interpreting the Results:

When running such comparisons, you'll typically observe:

Axis-aligned cases (0°, 90°): Tree needs 1-2 splits, matches linear classifier
Moderate angles (30°, 60°): Tree needs 5-15 splits, slight accuracy gap
Diagonal case (45°): Tree needs many splits, largest accuracy gap

This confirms the theoretical prediction that axis-aligned trees struggle most when boundaries are at 45° angles to the axes.

The Variance Component:

Beyond fixed accuracy, axis-aligned approximations of oblique boundaries increase variance:

The staircase boundary is sensitive to the exact training samples
Different samples produce different staircase patterns
This contributes to the instability discussed in the previous page

Empirical Efficiency Loss Across Scenarios
Scenario	Axis-Aligned Leaves	Optimal Leaves	Complexity Ratio
Vertical boundary	2	2	1.0x
30° diagonal	8-12	2	4-6x
45° diagonal	15-25	2	8-12x
Circular boundary (r=1)	20-40	4-8	3-5x
XOR pattern	4+	2 (with rotation)	2x+

Axis-Alignment and Ensemble Methods

An interesting perspective on axis-alignment comes from examining how ensemble methods partially mitigate this limitation.

Random Forests and Pseudo-Oblique Boundaries:

While individual trees in a Random Forest are axis-aligned, the ensemble can approximate oblique boundaries through averaging:

$$\hat{f}{\text{RF}}(x) = \frac{1}{B} \sum{b=1}^{B} \hat{f}^{(b)}(x)$$

Each tree creates a different staircase approximation, and averaging produces a smoother, more diagonal effective boundary.

Tree 1:          Tree 2:          Average:
  ┌──┬────┐      ┌────┬──┐       ╱        
  │B │ A  │      │ B  │A │      ╱ smooth 
  ├──┼────┤  +   ├────┼──┤  →  ╱ diagonal
  │B │ A  │      │ B  │A │    ╱  boundary
  └──┴────┘      └────┴──┘   ╱

This is analogous to how anti-aliasing in computer graphics smooths jagged diagonal lines on a pixel grid.

The Anti-Aliasing Analogy

Just as computer graphics uses multiple samples per pixel to create smooth diagonal lines on a grid of square pixels, Random Forests use multiple axis-aligned trees to create smooth oblique boundaries in feature space. Both are overcoming a fundamentally discrete approximation of continuous geometry.

Gradient Boosting and Additive Staircases:

Gradient boosting builds trees sequentially, with each tree correcting the residual errors of previous trees. For oblique boundaries:

First tree creates a coarse staircase approximation
Subsequent trees add corrections in the residual regions
The sum progressively refines the approximation

$$\hat{f}{\text{GB}}(x) = \sum{m=1}^{M} \gamma_m h_m(x)$$

With enough trees and appropriate learning rate, gradient boosting can closely approximate oblique boundaries, though it remains fundamentally axis-aligned in each tree.

Limit of Ensemble Approximation:

Even with ensembles, there are theoretical limits:

Each tree contributes axis-aligned boundaries
Averaging/summing creates piecewise constant approximations
True smooth curves can only be approximated, never exactly represented
Sample efficiency still favors methods that match the true boundary geometry

This motivates exploring methods that directly allow oblique splits, which we'll cover in the next page.

ensemble_smoothing.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import numpy as np
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt
 
def visualize_boundary_smoothing(X, y):
    """
    Compare decision boundaries of single tree vs ensemble.
    Shows how ensembles smooth axis-aligned approximations.
    """
    # Create mesh grid for decision boundary visualization
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
                          np.linspace(y_min, y_max, 200))
    mesh_points = np.c_[xx.ravel(), yy.ravel()]
    
    models = {
        'Single Tree': DecisionTreeClassifier(random_state=42),
        'Random Forest (100 trees)': RandomForestClassifier(
            n_estimators=100, random_state=42
        ),
        'Gradient Boosting (100 trees)': GradientBoostingClassifier(
            n_estimators=100, random_state=42
        )
    }
    
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))
    
    for ax, (name, model) in zip(axes, models.items()):
        model.fit(X, y)
        
        # Get probability predictions for contour
        if hasattr(model, 'predict_proba'):
            Z = model.predict_proba(mesh_points)[:, 1]
        else:
            Z = model.decision_function(mesh_points)
        Z = Z.reshape(xx.shape)
        
        # Plot contour
        ax.contourf(xx, yy, Z, levels=50, cmap='RdYlBu', alpha=0.8)
        ax.contour(xx, yy, Z, levels=[0.5], colors='black', linewidths=2)
        ax.scatter(X[:, 0], X[:, 1], c=y, cmap='RdYlBu', 
                   edgecolors='black', s=20)
        ax.set_title(name)
        ax.set_xlabel('Feature 1')
        ax.set_ylabel('Feature 2')
    
    plt.tight_layout()
    plt.savefig('boundary_comparison.png', dpi=150)
    plt.show()
    
    return fig

Implications for Model Selection

Understanding the axis-alignment constraint should influence when you choose decision trees and how you prepare your data.

Decision Framework:

Ask these questions to determine if axis-aligned splits are appropriate:

Are there natural threshold-based rules in your domain?
- Yes: Axis-aligned trees are natural (e.g., medical diagnosis: temperature > 38°C)
- No: Consider oblique methods or feature engineering
Have you examined the true decision boundary shape?
- Plot 2D projections to understand boundary geometry
- If boundaries are clearly diagonal, add interaction features
Do feature ratios or products have domain meaning?
- BMI = weight/height², Debt-to-income = debt/income
- Add these as explicit features before tree fitting
Is your data already rotated (e.g., by PCA)?
- PCA axes are not necessarily axis-aligned with decision boundaries
- Consider using original features or domain-specific transformations

Best Practices for Axis-Aligned Trees

•Use domain knowledge to create interaction features: If x₁·x₂ or x₁/x₂ has meaning, add it as a feature. The tree can then split on this derived feature.
•Consider polynomial features for curved boundaries: Add x₁², x₂², x₁·x₂ to allow trees to split on quadratic terms.
•Apply PCA only if principal components have interpretation: Blind PCA rotation might not align with natural decision boundaries.
•Use ensembles for complex boundaries: Random Forests and Gradient Boosting approximate oblique boundaries through averaging, partially mitigating axis-alignment limitations.
•Evaluate tree complexity as a diagnostic: If your tree is very deep or has many leaves, it might be struggling with non-axis-aligned structure. Consider alternative models.
•Compare to linear models as a sanity check: If a simple logistic regression matches deep tree performance, the data likely has near-linear (possibly diagonal) boundaries.

Feature Engineering as Solution

The axis-alignment limitation is often a feature engineering problem, not a model limitation. By adding the right derived features, you can convert oblique boundaries into axis-aligned ones. This is a core skill for successful tree-based modeling.

Summary: The Geometric Reality of Decision Trees

We've comprehensively analyzed the axis-aligned split constraint—its mathematical basis, geometric implications, and practical consequences. Here are the key insights:

Key Takeaways

•Standard trees only split perpendicular to feature axes: This creates hyperrectangular regions, never diagonal or curved boundaries directly.
•The constraint arises from computational tractability: Axis-aligned splits reduce the search space from continuous multi-dimensional optimization to discrete feature selection plus 1D threshold search.
•Oblique boundaries require staircase approximations: A single diagonal line may need O(n) axis-aligned splits to approximate, leading to deep, complex trees.
•Representation efficiency varies by geometry: Axis-aligned trees are efficient for naturally threshold-based domains but inefficient for diagonal or curved boundaries.
•Feature engineering can mitigate the limitation: Adding derived features like ratios, products, or domain-specific transforms can align boundaries with feature axes.
•Ensembles partially overcome axis-alignment: Random Forests and Gradient Boosting approximate smooth boundaries through averaging, acting like anti-aliasing in graphics.

Looking Ahead:

The axis-alignment limitation naturally motivates more flexible tree variants. In the next page, we'll explore Oblique Trees—decision trees that allow splits of the form $w^T x \leq \theta$ where $w$ is an arbitrary weight vector. These trees can represent diagonal boundaries directly but introduce new challenges in optimization and interpretability.

Page Complete

You now understand a fundamental geometric constraint of decision trees: their restriction to axis-aligned splits. This knowledge helps you recognize when standard trees are appropriate, when feature engineering is needed, and why more advanced tree variants exist. The next page explores oblique trees as a direct solution to this limitation.