Regression Trees - Learning Module

Loading content...

0/278

Comparison to Linear Models

Two Philosophies of Regression

Linear regression and decision trees represent two fundamentally different philosophies of function approximation. Linear models assume a global parametric form—the same coefficients apply everywhere in feature space. Trees make no global assumptions—they partition the space and handle each region separately.

This philosophical divide leads to dramatically different behavior: where one excels, the other struggles; where one naturally captures patterns, the other requires careful engineering. Understanding this contrast is essential for choosing the right tool and for appreciating why ensemble methods often combine both paradigms.

What You Will Learn

By the end of this page, you will understand: the fundamental mathematical differences between linear models and trees, scenarios where each approach dominates, how assumptions affect performance, interpretability trade-offs, practical guidelines for model selection, and how modern methods combine both paradigms to leverage their complementary strengths.

Fundamental Mathematical Differences

Let's establish the precise mathematical forms and their implications.

Linear regression:

$$\hat{f}{\text{linear}}(\mathbf{x}) = \beta_0 + \sum{j=1}^{p} \beta_j x_j = \boldsymbol{\beta}^\top \tilde{\mathbf{x}}$$

where $\tilde{\mathbf{x}} = (1, x_1, \ldots, x_p)^\top$ and $\boldsymbol{\beta}$ is learned from data.

Regression tree:

$$\hat{f}{\text{tree}}(\mathbf{x}) = \sum{m=1}^{M} c_m \cdot \mathbb{1}_{R_m}(\mathbf{x})$$

where ${R_m}$ are axis-aligned rectangular regions and $c_m$ are constants.

Key structural differences:

Structural Comparison: Linear Models vs Trees
Aspect	Linear Regression	Regression Tree
Functional form	Hyperplane (p+1 parameters)	Piecewise constant (M regions)
Decision boundary	Single global hyperplane	Multiple axis-aligned splits
Effect of feature xⱼ	Same everywhere (βⱼ)	Depends on which region
Interactions	Must be manually specified	Automatically captured by hierarchy
Continuity	Continuous and smooth	Discontinuous at split boundaries
Extrapolation	Linear extension (risky)	Constant extension (conservative)
Parameter count	Fixed at p+1	Grows with tree size (2M-1)

fundamental_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
 
def compare_model_forms():
    """
    Illustrate fundamental differences in model forms.
    """
    np.random.seed(42)
    
    # Simple 2D example
    X = np.random.randn(100, 2)
    y = 2 * X[:, 0] - X[:, 1] + 0.5 * X[:, 0] * X[:, 1] + np.random.randn(100) * 0.5
    
    # Fit both models
    linear = LinearRegression()
    linear.fit(X, y)
    
    tree = DecisionTreeRegressor(max_depth=4)
    tree.fit(X, y)
    
    print("Model Form Comparison")
    print("=" * 60)
    
    # Linear model representation
    print("
Linear Regression:")
    print(f"  f(x) = {linear.intercept_:.3f} + {linear.coef_[0]:.3f}*x₁ + {linear.coef_[1]:.3f}*x₂")
    print(f"  Parameters: {1 + len(linear.coef_)} (intercept + coefficients)")
    print(f"  Same coefficients apply to all x")
    
    # Tree model representation
    print(f"
Decision Tree:")
    print(f"  Number of leaves (regions): {tree.get_n_leaves()}")
    print(f"  Total nodes: {tree.tree_.node_count}")
    print(f"  Different constants in different regions")
    
    # Show how predictions differ
    test_points = np.array([
        [-2, -2],
        [0, 0],
        [2, 2]
    ])
    
    print(f"
Predictions at test points:")
    print(f"{'Point':>15} {'Linear':>12} {'Tree':>12}")
    print("-" * 42)
    
    for point in test_points:
        y_linear = linear.predict(point.reshape(1, -1))[0]
        y_tree = tree.predict(point.reshape(1, -1))[0]
        print(f"{str(point):>15} {y_linear:>12.4f} {y_tree:>12.4f}")
    
    # Gradient comparison
    print(f"
Gradient (∂f/∂x₁):")
    print(f"  Linear: {linear.coef_[0]:.3f} (constant everywhere)")
    print(f"  Tree: 0 everywhere (piecewise constant)")
 
 
def effect_locality():
    """
    Show how feature effects differ: global (linear) vs local (tree).
    """
    np.random.seed(42)
    
    # Data where effect of x₁ depends on x₂
    n = 500
    X = np.random.randn(n, 2)
    # Effect of x₁ is positive when x₂ > 0, negative when x₂ < 0
    y = np.where(X[:, 1] > 0, 2 * X[:, 0], -2 * X[:, 0]) + np.random.randn(n) * 0.3
    
    linear = LinearRegression()
    linear.fit(X, y)
    
    tree = DecisionTreeRegressor(max_depth=3)
    tree.fit(X, y)
    
    print("
Feature Effect Locality")
    print("=" * 50)
    print("True model: y = 2*x₁ if x₂ > 0 else -2*x₁")
    print("(Effect of x₁ flips depending on x₂)")
    
    print(f"
Linear regression coefficients:")
    print(f"  β₁ (for x₁): {linear.coef_[0]:.3f}")
    print(f"  β₂ (for x₂): {linear.coef_[1]:.3f}")
    print(f"  → Linear model sees average effect ≈ 0 for x₁!")
    
    # Test in different regions
    regions = [
        ("x₂ > 0 region", np.array([[1, 1]])),
        ("x₂ < 0 region", np.array([[1, -1]]))
    ]
    
    print(f"
Predictions for x₁=1 in different x₂ regions:")
    for name, point in regions:
        y_lin = linear.predict(point)[0]
        y_tree = tree.predict(point)[0]
        print(f"  {name}: Linear={y_lin:.2f}, Tree={y_tree:.2f}")
    
    print(f"
→ Tree captures the interaction; linear model fails")
 
compare_model_forms()
effect_locality()

When Linear Models Excel

Linear regression has dominated statistical practice for over a century because in the right settings, it is remarkably effective and efficient.

Scenarios favoring linear models:

True relationship is linear or approximately linear
- Physical laws (Hooke's law, Ohm's law)
- Economics at small changes (elasticities)
- Many engineered systems designed for linearity
Features are transformations of raw inputs
- Polynomial features $x^2, x^3$ for known nonlinearity
- Log transforms for multiplicative relationships
- Domain knowledge guides feature engineering
High-dimensional, sparse settings
- Genomics (thousands of genes, few samples)
- Text classification (bag of words)
- Regularization controls overfitting efficiently
Inference is the primary goal
- Confidence intervals for coefficients
- Hypothesis testing
- Causal interpretation (with appropriate design)
Small sample sizes
- Few parameters → lower variance
- More stable estimates

Linear Model Advantages

•Interpretable coefficients — each βⱼ is the effect of xⱼ
•Well-understood inference — CIs, p-values, F-tests
•Computational efficiency — closed-form solution
•Low variance — few parameters, stable predictions
•Principled regularization — Ridge, Lasso with theory

Linear Model Limitations

•Model misspecification — fails if truth is nonlinear
•Must specify interactions — won't find them automatically
•Feature engineering burden — transformations needed
•Extrapolation danger — linear trends continue forever
•Sensitive to outliers — least squares not robust

linear_excels.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import numpy as np
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score
 
def linear_wins_scenarios():
    """
    Demonstrate scenarios where linear models outperform trees.
    """
    np.random.seed(42)
    
    scenarios = []
    
    # Scenario 1: Truly linear relationship
    print("Scenario 1: Truly Linear Relationship")
    print("-" * 50)
    n = 200
    X = np.random.randn(n, 5)
    y = 2*X[:,0] - 3*X[:,1] + X[:,2] + np.random.randn(n) * 0.5
    
    linear = LinearRegression()
    tree = DecisionTreeRegressor(max_depth=5)
    
    linear_scores = cross_val_score(linear, X, y, cv=5, scoring='r2')
    tree_scores = cross_val_score(tree, X, y, cv=5, scoring='r2')
    
    print(f"  Linear R²: {linear_scores.mean():.4f} (±{linear_scores.std():.4f})")
    print(f"  Tree R²:   {tree_scores.mean():.4f} (±{tree_scores.std():.4f})")
    scenarios.append(('Linear truth', linear_scores.mean(), tree_scores.mean()))
    
    # Scenario 2: Small sample size
    print("
Scenario 2: Small Sample Size (n=30)")
    print("-" * 50)
    n = 30
    X = np.random.randn(n, 3)
    y = X[:,0] + 0.5*X[:,1] + np.random.randn(n) * 0.3
    
    linear_scores = cross_val_score(linear, X, y, cv=5, scoring='r2')
    tree_scores = cross_val_score(tree, X, y, cv=5, scoring='r2')
    
    print(f"  Linear R²: {linear_scores.mean():.4f} (±{linear_scores.std():.4f})")
    print(f"  Tree R²:   {tree_scores.mean():.4f} (±{tree_scores.std():.4f})")
    scenarios.append(('Small n', linear_scores.mean(), tree_scores.mean()))
    
    # Scenario 3: High-dimensional sparse
    print("
Scenario 3: High-Dimensional Sparse (p=50, n=100)")
    print("-" * 50)
    n, p = 100, 50
    X = np.random.randn(n, p)
    # Only 3 features matter
    y = X[:,0] + 2*X[:,1] - X[:,2] + np.random.randn(n) * 0.5
    
    ridge = Ridge(alpha=1.0)
    ridge_scores = cross_val_score(ridge, X, y, cv=5, scoring='r2')
    tree_scores = cross_val_score(tree, X, y, cv=5, scoring='r2')
    
    print(f"  Ridge R²:  {ridge_scores.mean():.4f} (±{ridge_scores.std():.4f})")
    print(f"  Tree R²:   {tree_scores.mean():.4f} (±{tree_scores.std():.4f})")
    scenarios.append(('High-dim sparse', ridge_scores.mean(), tree_scores.mean()))
    
    print("
Summary: Linear models excel with:")
    print("  - Linear relationships")
    print("  - Small samples (low variance)")
    print("  - High dimensions with sparsity (regularization)")
    
    return scenarios
 
linear_wins_scenarios()

When Regression Trees Excel

Trees shine in situations where linear models fundamentally fail—where the relationship between features and target is complex, interactive, or regionally varying.

Scenarios favoring trees:

Nonlinear relationships without known form
- Complex biological systems
- Customer behavior patterns
- Unknown physical processes
Strong interactions between features
- Effect of A depends on value of B
- Conditional relationships
- Subgroup analyses
Mixed feature types
- Categorical and continuous features together
- No need for dummy variable encoding
- Natural handling of nominal data
Heterogeneous data
- Different patterns in different regions
- Multiple underlying regimes
- Structural breaks
Interpretability through rules
- Business rules extraction
- Decision support systems
- Regulatory requirements for explainability

Tree Advantages

•Automatic nonlinearity — no feature engineering
•Automatic interactions — tree hierarchy captures them
•Mixed feature types — handles categorical naturally
•Robust to monotonic transforms — splits are order-based
•Rule-based interpretation — if-then logic

Tree Limitations

•High variance — unstable, small data changes → big tree changes
•Axis-aligned only — can't capture linear boundaries efficiently
•Discontinuous — not smooth, no gradients
•Constant extrapolation — no trends outside training data
•Greedy suboptimality — may miss global best structure

trees_excel.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score
 
def trees_win_scenarios():
    """
    Demonstrate scenarios where trees outperform linear models.
    """
    np.random.seed(42)
    
    # Scenario 1: Nonlinear relationship
    print("Scenario 1: Nonlinear Relationship (step function)")
    print("-" * 55)
    n = 300
    X = np.random.randn(n, 3)
    # Step function in x₁
    y = np.where(X[:,0] > 0, 5, -5) + np.random.randn(n) * 0.5
    
    linear = LinearRegression()
    tree = DecisionTreeRegressor(max_depth=3)
    
    linear_scores = cross_val_score(linear, X, y, cv=5, scoring='r2')
    tree_scores = cross_val_score(tree, X, y, cv=5, scoring='r2')
    
    print(f"  Linear R²: {linear_scores.mean():.4f}")
    print(f"  Tree R²:   {tree_scores.mean():.4f}")
    
    # Scenario 2: Strong interaction
    print("
Scenario 2: Strong Interaction (XOR-like)")
    print("-" * 55)
    X = np.random.randn(n, 2)
    # XOR: positive output when signs differ
    y = np.where((X[:,0] > 0) != (X[:,1] > 0), 5, -5) + np.random.randn(n) * 0.5
    
    linear_scores = cross_val_score(linear, X, y, cv=5, scoring='r2')
    tree_scores = cross_val_score(tree, X, y, cv=5, scoring='r2')
    
    print(f"  Linear R²: {linear_scores.mean():.4f}")
    print(f"  Tree R²:   {tree_scores.mean():.4f}")
    
    # Scenario 3: Categorical features (encoded)
    print("
Scenario 3: Categorical Feature Importance")
    print("-" * 55)
    n = 400
    # Categorical with 4 levels, each has different effect
    cat = np.random.choice([0, 1, 2, 3], n)
    X_cont = np.random.randn(n, 2)
    
    # Create one-hot encoding for linear model
    X_linear = np.column_stack([
        X_cont,
        (cat == 0).astype(float),
        (cat == 1).astype(float),
        (cat == 2).astype(float),
        (cat == 3).astype(float)
    ])
    
    # Tree can use raw categorical
    X_tree = np.column_stack([X_cont, cat])
    
    # Effect depends on category (interaction)
    y = np.where(cat == 0, 2*X_cont[:,0], 
        np.where(cat == 1, -2*X_cont[:,0],
        np.where(cat == 2, 3*X_cont[:,1], -3*X_cont[:,1])))
    y += np.random.randn(n) * 0.5
    
    linear_scores = cross_val_score(linear, X_linear, y, cv=5, scoring='r2')
    tree_scores = cross_val_score(tree, X_tree, y, cv=5, scoring='r2')
    
    print(f"  Linear R²: {linear_scores.mean():.4f}")
    print(f"  Tree R²:   {tree_scores.mean():.4f}")
    
    # Scenario 4: Multiple regimes
    print("
Scenario 4: Multiple Regime (different slopes)")
    print("-" * 55)
    X = np.random.uniform(-2, 2, (n, 1))
    # Different slopes in different regions
    y = np.where(X[:,0] < -1, 5*X[:,0],
        np.where(X[:,0] < 1, -3*X[:,0], 2*X[:,0]))
    y += np.random.randn(n) * 0.3
    
    linear_scores = cross_val_score(linear, X, y, cv=5, scoring='r2')
    tree_scores = cross_val_score(DecisionTreeRegressor(max_depth=4), X, y, cv=5, scoring='r2')
    
    print(f"  Linear R²: {linear_scores.mean():.4f}")
    print(f"  Tree R²:   {tree_scores.mean():.4f}")
    
    print("
Summary: Trees excel with:")
    print("  - Nonlinear relationships")
    print("  - Interactions (feature effects that depend on other features)")
    print("  - Categorical features with interaction effects")
    print("  - Multiple regime/heterogeneous relationships")
 
trees_win_scenarios()

Bias-Variance Trade-off Comparison

Linear models and trees occupy different positions on the bias-variance spectrum.

Linear regression:

High bias (if model misspecified): Cannot capture nonlinearities
Low variance: Only $p+1$ parameters; stable across samples
Net effect: Often underfits complex relationships

Full-grown regression trees:

Low bias: Can perfectly fit training data with enough depth
High variance: Sensitive to particular training samples
Net effect: Often overfits, especially with small/noisy data

Mathematical perspective:

For a true function $f(\mathbf{x})$:

$$\text{MSE}(\hat{f}) = \underbrace{(f(\mathbf{x}) - \mathbb{E}[\hat{f}(\mathbf{x})])^2}{\text{Bias}^2} + \underbrace{\mathbb{E}[(\hat{f}(\mathbf{x}) - \mathbb{E}[\hat{f}(\mathbf{x})])^2]}{\text{Variance}} + \sigma^2$$

Linear models trade high bias for low variance; trees do the opposite.

bias_variance_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
 
def bias_variance_comparison():
    """
    Empirically compare bias and variance of linear vs tree models.
    """
    np.random.seed(42)
    
    # True function (nonlinear)
    def f(x):
        return np.sin(2 * np.pi * x) + 0.5 * x
    
    # Fixed test points
    X_test = np.linspace(0, 1, 100).reshape(-1, 1)
    y_true = f(X_test.ravel())
    
    n_bootstrap = 100
    n_train = 50
    
    models = {
        'Linear': LinearRegression(),
        'Tree (depth 2)': DecisionTreeRegressor(max_depth=2),
        'Tree (depth 5)': DecisionTreeRegressor(max_depth=5),
        'Tree (depth 10)': DecisionTreeRegressor(max_depth=10),
    }
    
    results = {}
    
    for name, model_template in models.items():
        predictions = []
        
        for _ in range(n_bootstrap):
            # Generate training data
            X_train = np.random.uniform(0, 1, n_train).reshape(-1, 1)
            y_train = f(X_train.ravel()) + np.random.randn(n_train) * 0.3
            
            # Fit model
            model = type(model_template)(**model_template.get_params())
            model.fit(X_train, y_train)
            
            predictions.append(model.predict(X_test))
        
        predictions = np.array(predictions)
        
        # Expected prediction
        mean_pred = np.mean(predictions, axis=0)
        
        # Bias² = E[(E[f̂] - f)²]
        bias_sq = np.mean((mean_pred - y_true) ** 2)
        
        # Variance = E[(f̂ - E[f̂])²]
        variance = np.mean(np.var(predictions, axis=0))
        
        # Total error (using all predictions)
        total_mse = np.mean((predictions - y_true) ** 2)
        
        results[name] = {
            'bias_sq': bias_sq,
            'variance': variance,
            'mse': total_mse
        }
    
    print("Bias-Variance Comparison")
    print("=" * 65)
    print(f"{'Model':<20} {'Bias²':>12} {'Variance':>12} {'MSE':>12}")
    print("-" * 58)
    
    for name, res in results.items():
        print(f"{name:<20} {res['bias_sq']:>12.4f} {res['variance']:>12.4f} {res['mse']:>12.4f}")
    
    print("
Observations:")
    print("- Linear: high bias (can't capture sin), low variance")
    print("- Shallow tree: moderate bias, moderate variance")
    print("- Deep tree: low bias, high variance (overfitting)")
    print("- Best MSE at intermediate complexity")
    
    return results
 
 
def optimal_complexity_curve():
    """
    Show optimal model complexity differs by sample size.
    """
    np.random.seed(42)
    
    def f(x):
        return np.sin(2 * np.pi * x)
    
    X_test = np.linspace(0, 1, 200).reshape(-1, 1)
    y_test = f(X_test.ravel())
    
    print("
Optimal Tree Depth vs Sample Size")
    print("=" * 55)
    
    sample_sizes = [20, 50, 100, 200, 500, 1000]
    
    print(f"{'n':>6} {'Best depth':>12} {'Best MSE':>12}")
    print("-" * 35)
    
    for n in sample_sizes:
        best_mse = float('inf')
        best_depth = 1
        
        for depth in range(1, 15):
            mses = []
            
            for _ in range(20):
                X_train = np.random.uniform(0, 1, n).reshape(-1, 1)
                y_train = f(X_train.ravel()) + np.random.randn(n) * 0.3
                
                tree = DecisionTreeRegressor(max_depth=depth)
                tree.fit(X_train, y_train)
                
                mses.append(np.mean((tree.predict(X_test) - y_test)**2))
            
            avg_mse = np.mean(mses)
            if avg_mse < best_mse:
                best_mse = avg_mse
                best_depth = depth
        
        print(f"{n:>6} {best_depth:>12} {best_mse:>12.4f}")
    
    print("
→ Larger samples support more complex trees")
 
bias_variance_comparison()
optimal_complexity_curve()

Ensembles Find the Balance

Random Forests address tree variance through averaging. Gradient Boosting builds trees incrementally to reduce bias while controlling variance via learning rate. These ensemble methods combine the flexibility of trees with variance reduction, often achieving the best of both worlds.

Interpretability Trade-offs

Both linear models and trees are considered "interpretable," but they offer different kinds of interpretability.

Linear model interpretability:

Global: Same interpretation everywhere
Additive: Effects of features are independent (in interpretation)
Coefficient-based: $\beta_j$ = effect of 1-unit change in $x_j$
Statistical inference: Confidence intervals, significance tests

Tree interpretability:

Local: Interpretation depends on region
Conditional: Effects depend on context (other feature values)
Rule-based: IF-THEN logic, decision paths
Visual: Can be drawn and understood as diagram

Which is more interpretable?

It depends on the audience and the question:

For "what's the overall effect of $x$?" → Linear
For "why did we predict this for sample $i$?" → Tree
For stakeholders preferring numbers → Linear
For stakeholders preferring rules → Tree

Interpretability Comparison by Use Case
Interpretation Need	Linear Model	Regression Tree
Feature importance ranking	\|βⱼ\| or standardized \| βⱼ\|	Impurity reduction, permutation
Effect of increasing xⱼ	βⱼ (constant everywhere)	Depends on current region
Why this prediction?	β₀ + Σβⱼxⱼ breakdown	Decision path traversal
Subgroup analysis	Requires manual interactions	Naturally defines subgroups
Statistical significance	t-tests, p-values available	Not standard (bootstrap works)
Stakeholder communication	"Each unit increases by..."	"If X then..."

interpretability_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor, export_text
 
def compare_interpretations():
    """
    Show different interpretation styles for the same prediction.
    """
    np.random.seed(42)
    
    # Sample data with feature names
    n = 200
    age = np.random.uniform(20, 70, n)
    income = np.random.uniform(30000, 150000, n)
    education_years = np.random.uniform(12, 22, n)
    
    X = np.column_stack([age, income, education_years])
    feature_names = ['age', 'income', 'education_years']
    
    # Target: some nonlinear function
    y = (0.5 * age 
         + 0.0001 * income 
         + 2 * education_years 
         - 0.01 * age * (income > 80000)  # interaction
         + np.random.randn(n) * 5)
    
    # Fit both models
    linear = LinearRegression()
    linear.fit(X, y)
    
    tree = DecisionTreeRegressor(max_depth=3)
    tree.fit(X, y)
    
    # Sample point to explain
    sample = np.array([[45, 75000, 16]])  # 45 years, $75k, 16 years education
    
    print("Interpreting Prediction for:")
    print(f"  Age: 45, Income: $75,000, Education: 16 years")
    print("=" * 60)
    
    # Linear interpretation
    y_linear = linear.predict(sample)[0]
    print(f"
Linear Model Prediction: {y_linear:.2f}")
    print(f"
Breakdown (global interpretation):")
    print(f"  Intercept:   {linear.intercept_:>12.4f}")
    for i, (name, coef) in enumerate(zip(feature_names, linear.coef_)):
        contribution = coef * sample[0, i]
        print(f"  {name:12s}: {coef:>12.6f} × {sample[0,i]:>8.0f} = {contribution:>12.2f}")
    print(f"  {'Total':12s}: {' ':>12s}   {' ':>8s}   {y_linear:>12.2f}")
    
    # Tree interpretation
    y_tree = tree.predict(sample)[0]
    print(f"
Tree Model Prediction: {y_tree:.2f}")
    print(f"
Decision Path (local interpretation):")
    
    # Get decision path
    node_indicator = tree.decision_path(sample)
    node_indices = node_indicator.indices
    
    tree_struct = tree.tree_
    for node_id in node_indices:
        if tree_struct.children_left[node_id] == tree_struct.children_right[node_id]:
            print(f"  → Leaf: predict {tree_struct.value[node_id].flatten()[0]:.2f}")
        else:
            feat = tree_struct.feature[node_id]
            thresh = tree_struct.threshold[node_id]
            value = sample[0, feat]
            direction = "≤" if value <= thresh else ">"
            print(f"  IF {feature_names[feat]} ({value:.0f}) {direction} {thresh:.1f}")
    
    # Tree as rules
    print("
Full Tree Rules:")
    print(export_text(tree, feature_names=feature_names))
 
 
def feature_importance_comparison():
    """
    Compare feature importance measures.
    """
    np.random.seed(42)
    
    n = 500
    X = np.random.randn(n, 5)
    
    # y depends mainly on x₀, x₁, and x₂ with interaction
    y = (3 * X[:, 0] 
         + 2 * X[:, 1] 
         + 1 * X[:, 2]
         + 0.5 * X[:, 0] * X[:, 1]  # interaction
         + np.random.randn(n) * 0.5)
    
    linear = LinearRegression()
    linear.fit(X, y)
    
    tree = DecisionTreeRegressor(max_depth=5)
    tree.fit(X, y)
    
    print("
Feature Importance Comparison")
    print("=" * 55)
    print("True importance: x₀ (3), x₁ (2), x₂ (1), x₃&x₄ (0)")
    print("Also: x₀*x₁ interaction exists")
    
    print("
Linear Model (|coefficients|):")
    importance_linear = np.abs(linear.coef_)
    for i, imp in enumerate(importance_linear):
        print(f"  x{i}: {imp:.4f}")
    
    print("
Decision Tree (impurity importance):")
    importance_tree = tree.feature_importances_
    for i, imp in enumerate(importance_tree):
        print(f"  x{i}: {imp:.4f}")
    
    print("
Note: Linear model can't detect interaction importance")
    print("      Tree splits on x₀ and x₁ to capture the interaction")
 
compare_interpretations()
feature_importance_comparison()

Practical Model Selection Guidelines

When facing a new regression problem, here's a systematic approach to choosing between linear models and trees (or deciding to try both).

Decision factors:

Prior knowledge about relationship
- Known linear relationship → Linear
- Known nonlinear but unknown form → Tree
- Uncertain → Try both, compare
Sample size and feature dimension
- Small n, large p → Linear with regularization
- Large n, small p → Either, trees can excel
- Large n, large p → Consider both, ensembles
Interpretation requirements
- Need statistical inference → Linear
- Need rule-based explanations → Tree
- Need feature effects → Linear (simpler), SHAP (for trees)
Computational constraints
- Training speed critical → Linear (closed form)
- Prediction speed critical → Both are fast
- Memory constrained → Linear (fewer parameters)
Deployment environment
- Regulatory scrutiny → Both work, explain differently
- Real-time updates → Linear (online learning easier)
- Static model → Either

model_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
import numpy as np
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import PolynomialFeatures
 
def model_selection_framework(X, y, verbose=True):
    """
    Systematic comparison of linear vs tree models for a given dataset.
    
    Returns recommendation and detailed results.
    """
    n, p = X.shape
    
    results = {}
    
    # Candidate models
    models = {
        'Linear': LinearRegression(),
        'Ridge (α=1)': Ridge(alpha=1.0),
        'Tree (depth 3)': DecisionTreeRegressor(max_depth=3),
        'Tree (depth 5)': DecisionTreeRegressor(max_depth=5),
        'Tree (depth 7)': DecisionTreeRegressor(max_depth=7),
        'Random Forest': RandomForestRegressor(n_estimators=100, max_depth=5, random_state=42)
    }
    
    # Add polynomial if p is small
    if p <= 5:
        poly = PolynomialFeatures(degree=2, include_bias=False)
        X_poly = poly.fit_transform(X)
        models['Linear+Poly2'] = ('poly', LinearRegression(), poly)
    
    if verbose:
        print(f"Dataset: n={n}, p={p}")
        print("=" * 60)
        print(f"{'Model':<20} {'CV Score (R²)':>15} {'Std':>10}")
        print("-" * 50)
    
    for name, model in models.items():
        if isinstance(model, tuple) and model[0] == 'poly':
            # Handle polynomial features
            _, base_model, transformer = model
            X_transformed = transformer.transform(X)
            scores = cross_val_score(base_model, X_transformed, y, cv=5, scoring='r2')
        else:
            scores = cross_val_score(model, X, y, cv=5, scoring='r2')
        
        results[name] = {
            'mean': scores.mean(),
            'std': scores.std(),
            'scores': scores
        }
        
        if verbose:
            print(f"{name:<20} {scores.mean():>15.4f} {scores.std():>10.4f}")
    
    # Determine recommendation
    best_model = max(results, key=lambda k: results[k]['mean'])
    
    # Check if linear is competitive (within 0.02 of best)
    linear_score = results['Linear']['mean']
    best_score = results[best_model]['mean']
    linear_competitive = (best_score - linear_score) < 0.02
    
    if verbose:
        print(f"
Recommendation:")
        print(f"  Best model: {best_model} (R² = {best_score:.4f})")
        
        if linear_competitive and 'Linear' in best_model:
            print(f"  → Use Linear: interpretable and performs best")
        elif linear_competitive:
            print(f"  → Consider Linear: nearly as good, more interpretable")
        else:
            print(f"  → Use {best_model}: significantly outperforms linear")
    
    return results, best_model
 
 
def comprehensive_comparison():
    """
    Compare on multiple synthetic datasets.
    """
    np.random.seed(42)
    
    scenarios = []
    
    # Scenario 1: Linear truth
    n = 200
    X = np.random.randn(n, 5)
    y = 2*X[:,0] + X[:,1] - 0.5*X[:,2] + np.random.randn(n)*0.5
    print("
Scenario: Linear relationship")
    results, best = model_selection_framework(X, y)
    
    # Scenario 2: Nonlinear
    y = np.sin(X[:,0]) + X[:,1]**2 + np.random.randn(n)*0.3
    print("
Scenario: Nonlinear relationship")
    model_selection_framework(X, y)
    
    # Scenario 3: Interaction
    y = np.where(X[:,0] > 0, X[:,1], -X[:,1]) + np.random.randn(n)*0.3
    print("
Scenario: Strong interaction")
    model_selection_framework(X, y)
    
    # Scenario 4: Small n
    X_small = np.random.randn(30, 5)
    y_small = X_small[:,0] + X_small[:,1] + np.random.randn(30)*0.5
    print("
Scenario: Small sample (n=30)")
    model_selection_framework(X_small, y_small)
 
comprehensive_comparison()

The Practical Default

When in doubt: (1) Start with linear regression as baseline, (2) Try a simple tree (depth 3-5) to check for nonlinearity, (3) If tree significantly better, consider Random Forest for production. This progression reveals whether the added complexity of trees is warranted for your specific data.

Combining Both Paradigms

Rather than choosing between linear models and trees, modern practice often combines them to leverage complementary strengths.

Approaches to combination:

Model trees (M5): Linear regression in tree leaves
- Tree determines regions
- Linear model within each region
- Smoother than pure trees, more flexible than global linear
RuleFit: Tree-derived features for linear model
- Extract rules from trees as binary features
- Fit linear model on rules + original features
- Interpretable coefficients on rules
Cubist: Instance-based tree with linear smoothing
- Tree + linear model + neighbor-based adjustment
Stacking: Trees and linear as separate base learners
- Train both independently
- Meta-learner combines predictions
Gradient boosting with linear base learners
- XGBoost/LightGBM can use linear models as base
- Combines boosting's power with linear structure

combining_paradigms.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import StackingRegressor
from sklearn.model_selection import cross_val_score
 
class RuleFitSimple:
    """
    Simplified RuleFit: extract tree rules as features for linear model.
    """
    
    def __init__(self, n_trees=10, max_depth=3, alpha=1.0):
        self.n_trees = n_trees
        self.max_depth = max_depth
        self.alpha = alpha
        self.trees = []
        self.linear = Ridge(alpha=alpha)
    
    def fit(self, X, y):
        X = np.atleast_2d(X)
        n_samples = X.shape[0]
        
        # Fit multiple trees on bootstrap samples
        for _ in range(self.n_trees):
            idx = np.random.choice(n_samples, n_samples, replace=True)
            tree = DecisionTreeRegressor(max_depth=self.max_depth)
            tree.fit(X[idx], y[idx])
            self.trees.append(tree)
        
        # Create rule features
        X_rules = self._transform_to_rules(X)
        
        # Combine original features + rule features
        X_combined = np.hstack([X, X_rules])
        
        # Fit linear model
        self.linear.fit(X_combined, y)
        
        return self
    
    def _transform_to_rules(self, X):
        """Convert tree structure to binary rule features."""
        all_rules = []
        
        for tree in self.trees:
            # Each leaf is a rule: "all conditions on path are satisfied"
            leaf_indicators = np.zeros((X.shape[0], tree.get_n_leaves()))
            leaves = tree.apply(X)
            
            unique_leaves = np.unique(leaves)
            for i, leaf_id in enumerate(unique_leaves):
                leaf_indicators[:, i] = (leaves == leaf_id).astype(float)
            
            all_rules.append(leaf_indicators)
        
        return np.hstack(all_rules)
    
    def predict(self, X):
        X = np.atleast_2d(X)
        X_rules = self._transform_to_rules(X)
        X_combined = np.hstack([X, X_rules])
        return self.linear.predict(X_combined)
 
 
def demonstrate_combined_models():
    """
    Show how combined models can outperform pure approaches.
    """
    np.random.seed(42)
    
    # Data with both linear trends and local effects
    n = 500
    X = np.random.randn(n, 3)
    
    # Linear trend + localized effects
    y = (2 * X[:, 0]  # Linear trend
         + np.where(X[:, 1] > 0, X[:, 2], -X[:, 2])  # Interaction
         + np.random.randn(n) * 0.5)
    
    print("Combined Model Demonstration")
    print("=" * 60)
    print("Data: linear trend + interaction (neither model ideal alone)")
    
    # Pure linear
    linear = Ridge(alpha=1.0)
    linear_scores = cross_val_score(linear, X, y, cv=5, scoring='r2')
    
    # Pure tree
    tree = DecisionTreeRegressor(max_depth=5)
    tree_scores = cross_val_score(tree, X, y, cv=5, scoring='r2')
    
    # Stacking
    stacking = StackingRegressor(
        estimators=[
            ('linear', Ridge(alpha=1.0)),
            ('tree', DecisionTreeRegressor(max_depth=5))
        ],
        final_estimator=Ridge(alpha=0.1)
    )
    stacking_scores = cross_val_score(stacking, X, y, cv=5, scoring='r2')
    
    # RuleFit-style
    rulefit = RuleFitSimple(n_trees=10, max_depth=3)
    # Manual cross-val since RuleFit doesn't have standard interface
    from sklearn.model_selection import KFold
    kf = KFold(n_splits=5, shuffle=True, random_state=42)
    rulefit_scores = []
    for train_idx, test_idx in kf.split(X):
        rf = RuleFitSimple(n_trees=10, max_depth=3)
        rf.fit(X[train_idx], y[train_idx])
        pred = rf.predict(X[test_idx])
        r2 = 1 - np.sum((y[test_idx] - pred)**2) / np.sum((y[test_idx] - y[test_idx].mean())**2)
        rulefit_scores.append(r2)
    rulefit_scores = np.array(rulefit_scores)
    
    print(f"
{'Model':<20} {'CV R²':>12} {'Std':>10}")
    print("-" * 45)
    print(f"{'Ridge (linear)':<20} {linear_scores.mean():>12.4f} {linear_scores.std():>10.4f}")
    print(f"{'Tree (depth 5)':<20} {tree_scores.mean():>12.4f} {tree_scores.std():>10.4f}")
    print(f"{'Stacking':<20} {stacking_scores.mean():>12.4f} {stacking_scores.std():>10.4f}")
    print(f"{'RuleFit-style':<20} {rulefit_scores.mean():>12.4f} {rulefit_scores.std():>10.4f}")
    
    print("
→ Combined models capture both linear and nonlinear aspects")
 
demonstrate_combined_models()

Summary: Linear Models vs Regression Trees

We've conducted a comprehensive comparison of two fundamental regression paradigms—revealing when each excels and how they can be combined.

Key Takeaways

•Fundamentally different — Linear: global parametric hyperplane; Tree: local piecewise constant regions.
•Linear excels when — Relationship is truly linear, sample is small, inference needed, features well-engineered.
•Trees excel when — Nonlinearity is unknown, interactions exist, mixed feature types, subgroup analysis needed.
•Bias-variance trade-off — Linear: high bias, low variance; Trees: low bias, high variance.
•Different interpretability — Linear: global coefficients; Trees: local rule-based explanations.
•Neither is universally better — Match model to problem structure, or use cross-validation to decide.
•Combine for best of both — Model trees, RuleFit, stacking leverage complementary strengths.

Module Complete:

This page concludes the Regression Trees module. You've learned the complete theory and practice of regression trees—from the CART algorithm and MSE splitting criterion, through leaf predictions and piecewise constant approximation theory, to comprehensive comparison with linear models. This foundation prepares you for understanding ensemble methods like Random Forests and Gradient Boosting, which build on individual trees to create powerful predictive models.

Module Complete

Congratulations! You have mastered regression trees. You understand CART's greedy recursive partitioning, MSE as variance reduction, optimal leaf predictions, the piecewise constant approximation framework, and when trees outperform or underperform linear models. This deep knowledge forms the foundation for understanding modern ensemble methods and making informed modeling decisions in practice.

Comparison to Linear Models

Two Philosophies of Regression

What You Will Learn

Fundamental Mathematical Differences

Let's establish the precise mathematical forms and their implications.

Linear regression:

$$\hat{f}{\text{linear}}(\mathbf{x}) = \beta_0 + \sum{j=1}^{p} \beta_j x_j = \boldsymbol{\beta}^\top \tilde{\mathbf{x}}$$

where $\tilde{\mathbf{x}} = (1, x_1, \ldots, x_p)^\top$ and $\boldsymbol{\beta}$ is learned from data.

Regression tree:

$$\hat{f}{\text{tree}}(\mathbf{x}) = \sum{m=1}^{M} c_m \cdot \mathbb{1}_{R_m}(\mathbf{x})$$

where ${R_m}$ are axis-aligned rectangular regions and $c_m$ are constants.

Key structural differences:

Structural Comparison: Linear Models vs Trees
Aspect	Linear Regression	Regression Tree
Functional form	Hyperplane (p+1 parameters)	Piecewise constant (M regions)
Decision boundary	Single global hyperplane	Multiple axis-aligned splits
Effect of feature xⱼ	Same everywhere (βⱼ)	Depends on which region
Interactions	Must be manually specified	Automatically captured by hierarchy
Continuity	Continuous and smooth	Discontinuous at split boundaries
Extrapolation	Linear extension (risky)	Constant extension (conservative)
Parameter count	Fixed at p+1	Grows with tree size (2M-1)

fundamental_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
 
def compare_model_forms():
    """
    Illustrate fundamental differences in model forms.
    """
    np.random.seed(42)
    
    # Simple 2D example
    X = np.random.randn(100, 2)
    y = 2 * X[:, 0] - X[:, 1] + 0.5 * X[:, 0] * X[:, 1] + np.random.randn(100) * 0.5
    
    # Fit both models
    linear = LinearRegression()
    linear.fit(X, y)
    
    tree = DecisionTreeRegressor(max_depth=4)
    tree.fit(X, y)
    
    print("Model Form Comparison")
    print("=" * 60)
    
    # Linear model representation
    print("
Linear Regression:")
    print(f"  f(x) = {linear.intercept_:.3f} + {linear.coef_[0]:.3f}*x₁ + {linear.coef_[1]:.3f}*x₂")
    print(f"  Parameters: {1 + len(linear.coef_)} (intercept + coefficients)")
    print(f"  Same coefficients apply to all x")
    
    # Tree model representation
    print(f"
Decision Tree:")
    print(f"  Number of leaves (regions): {tree.get_n_leaves()}")
    print(f"  Total nodes: {tree.tree_.node_count}")
    print(f"  Different constants in different regions")
    
    # Show how predictions differ
    test_points = np.array([
        [-2, -2],
        [0, 0],
        [2, 2]
    ])
    
    print(f"
Predictions at test points:")
    print(f"{'Point':>15} {'Linear':>12} {'Tree':>12}")
    print("-" * 42)
    
    for point in test_points:
        y_linear = linear.predict(point.reshape(1, -1))[0]
        y_tree = tree.predict(point.reshape(1, -1))[0]
        print(f"{str(point):>15} {y_linear:>12.4f} {y_tree:>12.4f}")
    
    # Gradient comparison
    print(f"
Gradient (∂f/∂x₁):")
    print(f"  Linear: {linear.coef_[0]:.3f} (constant everywhere)")
    print(f"  Tree: 0 everywhere (piecewise constant)")
 
 
def effect_locality():
    """
    Show how feature effects differ: global (linear) vs local (tree).
    """
    np.random.seed(42)
    
    # Data where effect of x₁ depends on x₂
    n = 500
    X = np.random.randn(n, 2)
    # Effect of x₁ is positive when x₂ > 0, negative when x₂ < 0
    y = np.where(X[:, 1] > 0, 2 * X[:, 0], -2 * X[:, 0]) + np.random.randn(n) * 0.3
    
    linear = LinearRegression()
    linear.fit(X, y)
    
    tree = DecisionTreeRegressor(max_depth=3)
    tree.fit(X, y)
    
    print("
Feature Effect Locality")
    print("=" * 50)
    print("True model: y = 2*x₁ if x₂ > 0 else -2*x₁")
    print("(Effect of x₁ flips depending on x₂)")
    
    print(f"
Linear regression coefficients:")
    print(f"  β₁ (for x₁): {linear.coef_[0]:.3f}")
    print(f"  β₂ (for x₂): {linear.coef_[1]:.3f}")
    print(f"  → Linear model sees average effect ≈ 0 for x₁!")
    
    # Test in different regions
    regions = [
        ("x₂ > 0 region", np.array([[1, 1]])),
        ("x₂ < 0 region", np.array([[1, -1]]))
    ]
    
    print(f"
Predictions for x₁=1 in different x₂ regions:")
    for name, point in regions:
        y_lin = linear.predict(point)[0]
        y_tree = tree.predict(point)[0]
        print(f"  {name}: Linear={y_lin:.2f}, Tree={y_tree:.2f}")
    
    print(f"
→ Tree captures the interaction; linear model fails")
 
compare_model_forms()
effect_locality()

When Linear Models Excel

Linear regression has dominated statistical practice for over a century because in the right settings, it is remarkably effective and efficient.

Scenarios favoring linear models:

True relationship is linear or approximately linear
- Physical laws (Hooke's law, Ohm's law)
- Economics at small changes (elasticities)
- Many engineered systems designed for linearity
Features are transformations of raw inputs
- Polynomial features $x^2, x^3$ for known nonlinearity
- Log transforms for multiplicative relationships
- Domain knowledge guides feature engineering
High-dimensional, sparse settings
- Genomics (thousands of genes, few samples)
- Text classification (bag of words)
- Regularization controls overfitting efficiently
Inference is the primary goal
- Confidence intervals for coefficients
- Hypothesis testing
- Causal interpretation (with appropriate design)
Small sample sizes
- Few parameters → lower variance
- More stable estimates

Linear Model Advantages

•Interpretable coefficients — each βⱼ is the effect of xⱼ
•Well-understood inference — CIs, p-values, F-tests
•Computational efficiency — closed-form solution
•Low variance — few parameters, stable predictions
•Principled regularization — Ridge, Lasso with theory

Linear Model Limitations

•Model misspecification — fails if truth is nonlinear
•Must specify interactions — won't find them automatically
•Feature engineering burden — transformations needed
•Extrapolation danger — linear trends continue forever
•Sensitive to outliers — least squares not robust

linear_excels.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import numpy as np
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score
 
def linear_wins_scenarios():
    """
    Demonstrate scenarios where linear models outperform trees.
    """
    np.random.seed(42)
    
    scenarios = []
    
    # Scenario 1: Truly linear relationship
    print("Scenario 1: Truly Linear Relationship")
    print("-" * 50)
    n = 200
    X = np.random.randn(n, 5)
    y = 2*X[:,0] - 3*X[:,1] + X[:,2] + np.random.randn(n) * 0.5
    
    linear = LinearRegression()
    tree = DecisionTreeRegressor(max_depth=5)
    
    linear_scores = cross_val_score(linear, X, y, cv=5, scoring='r2')
    tree_scores = cross_val_score(tree, X, y, cv=5, scoring='r2')
    
    print(f"  Linear R²: {linear_scores.mean():.4f} (±{linear_scores.std():.4f})")
    print(f"  Tree R²:   {tree_scores.mean():.4f} (±{tree_scores.std():.4f})")
    scenarios.append(('Linear truth', linear_scores.mean(), tree_scores.mean()))
    
    # Scenario 2: Small sample size
    print("
Scenario 2: Small Sample Size (n=30)")
    print("-" * 50)
    n = 30
    X = np.random.randn(n, 3)
    y = X[:,0] + 0.5*X[:,1] + np.random.randn(n) * 0.3
    
    linear_scores = cross_val_score(linear, X, y, cv=5, scoring='r2')
    tree_scores = cross_val_score(tree, X, y, cv=5, scoring='r2')
    
    print(f"  Linear R²: {linear_scores.mean():.4f} (±{linear_scores.std():.4f})")
    print(f"  Tree R²:   {tree_scores.mean():.4f} (±{tree_scores.std():.4f})")
    scenarios.append(('Small n', linear_scores.mean(), tree_scores.mean()))
    
    # Scenario 3: High-dimensional sparse
    print("
Scenario 3: High-Dimensional Sparse (p=50, n=100)")
    print("-" * 50)
    n, p = 100, 50
    X = np.random.randn(n, p)
    # Only 3 features matter
    y = X[:,0] + 2*X[:,1] - X[:,2] + np.random.randn(n) * 0.5
    
    ridge = Ridge(alpha=1.0)
    ridge_scores = cross_val_score(ridge, X, y, cv=5, scoring='r2')
    tree_scores = cross_val_score(tree, X, y, cv=5, scoring='r2')
    
    print(f"  Ridge R²:  {ridge_scores.mean():.4f} (±{ridge_scores.std():.4f})")
    print(f"  Tree R²:   {tree_scores.mean():.4f} (±{tree_scores.std():.4f})")
    scenarios.append(('High-dim sparse', ridge_scores.mean(), tree_scores.mean()))
    
    print("
Summary: Linear models excel with:")
    print("  - Linear relationships")
    print("  - Small samples (low variance)")
    print("  - High dimensions with sparsity (regularization)")
    
    return scenarios
 
linear_wins_scenarios()

When Regression Trees Excel

Trees shine in situations where linear models fundamentally fail—where the relationship between features and target is complex, interactive, or regionally varying.

Scenarios favoring trees:

Nonlinear relationships without known form
- Complex biological systems
- Customer behavior patterns
- Unknown physical processes
Strong interactions between features
- Effect of A depends on value of B
- Conditional relationships
- Subgroup analyses
Mixed feature types
- Categorical and continuous features together
- No need for dummy variable encoding
- Natural handling of nominal data
Heterogeneous data
- Different patterns in different regions
- Multiple underlying regimes
- Structural breaks
Interpretability through rules
- Business rules extraction
- Decision support systems
- Regulatory requirements for explainability

Tree Advantages

•Automatic nonlinearity — no feature engineering
•Automatic interactions — tree hierarchy captures them
•Mixed feature types — handles categorical naturally
•Robust to monotonic transforms — splits are order-based
•Rule-based interpretation — if-then logic

Tree Limitations

•High variance — unstable, small data changes → big tree changes
•Axis-aligned only — can't capture linear boundaries efficiently
•Discontinuous — not smooth, no gradients
•Constant extrapolation — no trends outside training data
•Greedy suboptimality — may miss global best structure

trees_excel.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score
 
def trees_win_scenarios():
    """
    Demonstrate scenarios where trees outperform linear models.
    """
    np.random.seed(42)
    
    # Scenario 1: Nonlinear relationship
    print("Scenario 1: Nonlinear Relationship (step function)")
    print("-" * 55)
    n = 300
    X = np.random.randn(n, 3)
    # Step function in x₁
    y = np.where(X[:,0] > 0, 5, -5) + np.random.randn(n) * 0.5
    
    linear = LinearRegression()
    tree = DecisionTreeRegressor(max_depth=3)
    
    linear_scores = cross_val_score(linear, X, y, cv=5, scoring='r2')
    tree_scores = cross_val_score(tree, X, y, cv=5, scoring='r2')
    
    print(f"  Linear R²: {linear_scores.mean():.4f}")
    print(f"  Tree R²:   {tree_scores.mean():.4f}")
    
    # Scenario 2: Strong interaction
    print("
Scenario 2: Strong Interaction (XOR-like)")
    print("-" * 55)
    X = np.random.randn(n, 2)
    # XOR: positive output when signs differ
    y = np.where((X[:,0] > 0) != (X[:,1] > 0), 5, -5) + np.random.randn(n) * 0.5
    
    linear_scores = cross_val_score(linear, X, y, cv=5, scoring='r2')
    tree_scores = cross_val_score(tree, X, y, cv=5, scoring='r2')
    
    print(f"  Linear R²: {linear_scores.mean():.4f}")
    print(f"  Tree R²:   {tree_scores.mean():.4f}")
    
    # Scenario 3: Categorical features (encoded)
    print("
Scenario 3: Categorical Feature Importance")
    print("-" * 55)
    n = 400
    # Categorical with 4 levels, each has different effect
    cat = np.random.choice([0, 1, 2, 3], n)
    X_cont = np.random.randn(n, 2)
    
    # Create one-hot encoding for linear model
    X_linear = np.column_stack([
        X_cont,
        (cat == 0).astype(float),
        (cat == 1).astype(float),
        (cat == 2).astype(float),
        (cat == 3).astype(float)
    ])
    
    # Tree can use raw categorical
    X_tree = np.column_stack([X_cont, cat])
    
    # Effect depends on category (interaction)
    y = np.where(cat == 0, 2*X_cont[:,0], 
        np.where(cat == 1, -2*X_cont[:,0],
        np.where(cat == 2, 3*X_cont[:,1], -3*X_cont[:,1])))
    y += np.random.randn(n) * 0.5
    
    linear_scores = cross_val_score(linear, X_linear, y, cv=5, scoring='r2')
    tree_scores = cross_val_score(tree, X_tree, y, cv=5, scoring='r2')
    
    print(f"  Linear R²: {linear_scores.mean():.4f}")
    print(f"  Tree R²:   {tree_scores.mean():.4f}")
    
    # Scenario 4: Multiple regimes
    print("
Scenario 4: Multiple Regime (different slopes)")
    print("-" * 55)
    X = np.random.uniform(-2, 2, (n, 1))
    # Different slopes in different regions
    y = np.where(X[:,0] < -1, 5*X[:,0],
        np.where(X[:,0] < 1, -3*X[:,0], 2*X[:,0]))
    y += np.random.randn(n) * 0.3
    
    linear_scores = cross_val_score(linear, X, y, cv=5, scoring='r2')
    tree_scores = cross_val_score(DecisionTreeRegressor(max_depth=4), X, y, cv=5, scoring='r2')
    
    print(f"  Linear R²: {linear_scores.mean():.4f}")
    print(f"  Tree R²:   {tree_scores.mean():.4f}")
    
    print("
Summary: Trees excel with:")
    print("  - Nonlinear relationships")
    print("  - Interactions (feature effects that depend on other features)")
    print("  - Categorical features with interaction effects")
    print("  - Multiple regime/heterogeneous relationships")
 
trees_win_scenarios()

Bias-Variance Trade-off Comparison

Linear models and trees occupy different positions on the bias-variance spectrum.

Linear regression:

High bias (if model misspecified): Cannot capture nonlinearities
Low variance: Only $p+1$ parameters; stable across samples
Net effect: Often underfits complex relationships

Full-grown regression trees:

Low bias: Can perfectly fit training data with enough depth
High variance: Sensitive to particular training samples
Net effect: Often overfits, especially with small/noisy data

Mathematical perspective:

For a true function $f(\mathbf{x})$:

Linear models trade high bias for low variance; trees do the opposite.

bias_variance_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
 
def bias_variance_comparison():
    """
    Empirically compare bias and variance of linear vs tree models.
    """
    np.random.seed(42)
    
    # True function (nonlinear)
    def f(x):
        return np.sin(2 * np.pi * x) + 0.5 * x
    
    # Fixed test points
    X_test = np.linspace(0, 1, 100).reshape(-1, 1)
    y_true = f(X_test.ravel())
    
    n_bootstrap = 100
    n_train = 50
    
    models = {
        'Linear': LinearRegression(),
        'Tree (depth 2)': DecisionTreeRegressor(max_depth=2),
        'Tree (depth 5)': DecisionTreeRegressor(max_depth=5),
        'Tree (depth 10)': DecisionTreeRegressor(max_depth=10),
    }
    
    results = {}
    
    for name, model_template in models.items():
        predictions = []
        
        for _ in range(n_bootstrap):
            # Generate training data
            X_train = np.random.uniform(0, 1, n_train).reshape(-1, 1)
            y_train = f(X_train.ravel()) + np.random.randn(n_train) * 0.3
            
            # Fit model
            model = type(model_template)(**model_template.get_params())
            model.fit(X_train, y_train)
            
            predictions.append(model.predict(X_test))
        
        predictions = np.array(predictions)
        
        # Expected prediction
        mean_pred = np.mean(predictions, axis=0)
        
        # Bias² = E[(E[f̂] - f)²]
        bias_sq = np.mean((mean_pred - y_true) ** 2)
        
        # Variance = E[(f̂ - E[f̂])²]
        variance = np.mean(np.var(predictions, axis=0))
        
        # Total error (using all predictions)
        total_mse = np.mean((predictions - y_true) ** 2)
        
        results[name] = {
            'bias_sq': bias_sq,
            'variance': variance,
            'mse': total_mse
        }
    
    print("Bias-Variance Comparison")
    print("=" * 65)
    print(f"{'Model':<20} {'Bias²':>12} {'Variance':>12} {'MSE':>12}")
    print("-" * 58)
    
    for name, res in results.items():
        print(f"{name:<20} {res['bias_sq']:>12.4f} {res['variance']:>12.4f} {res['mse']:>12.4f}")
    
    print("
Observations:")
    print("- Linear: high bias (can't capture sin), low variance")
    print("- Shallow tree: moderate bias, moderate variance")
    print("- Deep tree: low bias, high variance (overfitting)")
    print("- Best MSE at intermediate complexity")
    
    return results
 
 
def optimal_complexity_curve():
    """
    Show optimal model complexity differs by sample size.
    """
    np.random.seed(42)
    
    def f(x):
        return np.sin(2 * np.pi * x)
    
    X_test = np.linspace(0, 1, 200).reshape(-1, 1)
    y_test = f(X_test.ravel())
    
    print("
Optimal Tree Depth vs Sample Size")
    print("=" * 55)
    
    sample_sizes = [20, 50, 100, 200, 500, 1000]
    
    print(f"{'n':>6} {'Best depth':>12} {'Best MSE':>12}")
    print("-" * 35)
    
    for n in sample_sizes:
        best_mse = float('inf')
        best_depth = 1
        
        for depth in range(1, 15):
            mses = []
            
            for _ in range(20):
                X_train = np.random.uniform(0, 1, n).reshape(-1, 1)
                y_train = f(X_train.ravel()) + np.random.randn(n) * 0.3
                
                tree = DecisionTreeRegressor(max_depth=depth)
                tree.fit(X_train, y_train)
                
                mses.append(np.mean((tree.predict(X_test) - y_test)**2))
            
            avg_mse = np.mean(mses)
            if avg_mse < best_mse:
                best_mse = avg_mse
                best_depth = depth
        
        print(f"{n:>6} {best_depth:>12} {best_mse:>12.4f}")
    
    print("
→ Larger samples support more complex trees")
 
bias_variance_comparison()
optimal_complexity_curve()

Ensembles Find the Balance

Interpretability Trade-offs

Both linear models and trees are considered "interpretable," but they offer different kinds of interpretability.

Linear model interpretability:

Global: Same interpretation everywhere
Additive: Effects of features are independent (in interpretation)
Coefficient-based: $\beta_j$ = effect of 1-unit change in $x_j$
Statistical inference: Confidence intervals, significance tests

Tree interpretability:

Local: Interpretation depends on region
Conditional: Effects depend on context (other feature values)
Rule-based: IF-THEN logic, decision paths
Visual: Can be drawn and understood as diagram

Which is more interpretable?

It depends on the audience and the question:

For "what's the overall effect of $x$?" → Linear
For "why did we predict this for sample $i$?" → Tree
For stakeholders preferring numbers → Linear
For stakeholders preferring rules → Tree

Interpretability Comparison by Use Case
Interpretation Need	Linear Model	Regression Tree
Feature importance ranking	\|βⱼ\| or standardized \| βⱼ\|	Impurity reduction, permutation
Effect of increasing xⱼ	βⱼ (constant everywhere)	Depends on current region
Why this prediction?	β₀ + Σβⱼxⱼ breakdown	Decision path traversal
Subgroup analysis	Requires manual interactions	Naturally defines subgroups
Statistical significance	t-tests, p-values available	Not standard (bootstrap works)
Stakeholder communication	"Each unit increases by..."	"If X then..."

interpretability_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor, export_text
 
def compare_interpretations():
    """
    Show different interpretation styles for the same prediction.
    """
    np.random.seed(42)
    
    # Sample data with feature names
    n = 200
    age = np.random.uniform(20, 70, n)
    income = np.random.uniform(30000, 150000, n)
    education_years = np.random.uniform(12, 22, n)
    
    X = np.column_stack([age, income, education_years])
    feature_names = ['age', 'income', 'education_years']
    
    # Target: some nonlinear function
    y = (0.5 * age 
         + 0.0001 * income 
         + 2 * education_years 
         - 0.01 * age * (income > 80000)  # interaction
         + np.random.randn(n) * 5)
    
    # Fit both models
    linear = LinearRegression()
    linear.fit(X, y)
    
    tree = DecisionTreeRegressor(max_depth=3)
    tree.fit(X, y)
    
    # Sample point to explain
    sample = np.array([[45, 75000, 16]])  # 45 years, $75k, 16 years education
    
    print("Interpreting Prediction for:")
    print(f"  Age: 45, Income: $75,000, Education: 16 years")
    print("=" * 60)
    
    # Linear interpretation
    y_linear = linear.predict(sample)[0]
    print(f"
Linear Model Prediction: {y_linear:.2f}")
    print(f"
Breakdown (global interpretation):")
    print(f"  Intercept:   {linear.intercept_:>12.4f}")
    for i, (name, coef) in enumerate(zip(feature_names, linear.coef_)):
        contribution = coef * sample[0, i]
        print(f"  {name:12s}: {coef:>12.6f} × {sample[0,i]:>8.0f} = {contribution:>12.2f}")
    print(f"  {'Total':12s}: {' ':>12s}   {' ':>8s}   {y_linear:>12.2f}")
    
    # Tree interpretation
    y_tree = tree.predict(sample)[0]
    print(f"
Tree Model Prediction: {y_tree:.2f}")
    print(f"
Decision Path (local interpretation):")
    
    # Get decision path
    node_indicator = tree.decision_path(sample)
    node_indices = node_indicator.indices
    
    tree_struct = tree.tree_
    for node_id in node_indices:
        if tree_struct.children_left[node_id] == tree_struct.children_right[node_id]:
            print(f"  → Leaf: predict {tree_struct.value[node_id].flatten()[0]:.2f}")
        else:
            feat = tree_struct.feature[node_id]
            thresh = tree_struct.threshold[node_id]
            value = sample[0, feat]
            direction = "≤" if value <= thresh else ">"
            print(f"  IF {feature_names[feat]} ({value:.0f}) {direction} {thresh:.1f}")
    
    # Tree as rules
    print("
Full Tree Rules:")
    print(export_text(tree, feature_names=feature_names))
 
 
def feature_importance_comparison():
    """
    Compare feature importance measures.
    """
    np.random.seed(42)
    
    n = 500
    X = np.random.randn(n, 5)
    
    # y depends mainly on x₀, x₁, and x₂ with interaction
    y = (3 * X[:, 0] 
         + 2 * X[:, 1] 
         + 1 * X[:, 2]
         + 0.5 * X[:, 0] * X[:, 1]  # interaction
         + np.random.randn(n) * 0.5)
    
    linear = LinearRegression()
    linear.fit(X, y)
    
    tree = DecisionTreeRegressor(max_depth=5)
    tree.fit(X, y)
    
    print("
Feature Importance Comparison")
    print("=" * 55)
    print("True importance: x₀ (3), x₁ (2), x₂ (1), x₃&x₄ (0)")
    print("Also: x₀*x₁ interaction exists")
    
    print("
Linear Model (|coefficients|):")
    importance_linear = np.abs(linear.coef_)
    for i, imp in enumerate(importance_linear):
        print(f"  x{i}: {imp:.4f}")
    
    print("
Decision Tree (impurity importance):")
    importance_tree = tree.feature_importances_
    for i, imp in enumerate(importance_tree):
        print(f"  x{i}: {imp:.4f}")
    
    print("
Note: Linear model can't detect interaction importance")
    print("      Tree splits on x₀ and x₁ to capture the interaction")
 
compare_interpretations()
feature_importance_comparison()

Practical Model Selection Guidelines

When facing a new regression problem, here's a systematic approach to choosing between linear models and trees (or deciding to try both).

Decision factors:

Prior knowledge about relationship
- Known linear relationship → Linear
- Known nonlinear but unknown form → Tree
- Uncertain → Try both, compare
Sample size and feature dimension
- Small n, large p → Linear with regularization
- Large n, small p → Either, trees can excel
- Large n, large p → Consider both, ensembles
Interpretation requirements
- Need statistical inference → Linear
- Need rule-based explanations → Tree
- Need feature effects → Linear (simpler), SHAP (for trees)
Computational constraints
- Training speed critical → Linear (closed form)
- Prediction speed critical → Both are fast
- Memory constrained → Linear (fewer parameters)
Deployment environment
- Regulatory scrutiny → Both work, explain differently
- Real-time updates → Linear (online learning easier)
- Static model → Either

model_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
import numpy as np
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import PolynomialFeatures
 
def model_selection_framework(X, y, verbose=True):
    """
    Systematic comparison of linear vs tree models for a given dataset.
    
    Returns recommendation and detailed results.
    """
    n, p = X.shape
    
    results = {}
    
    # Candidate models
    models = {
        'Linear': LinearRegression(),
        'Ridge (α=1)': Ridge(alpha=1.0),
        'Tree (depth 3)': DecisionTreeRegressor(max_depth=3),
        'Tree (depth 5)': DecisionTreeRegressor(max_depth=5),
        'Tree (depth 7)': DecisionTreeRegressor(max_depth=7),
        'Random Forest': RandomForestRegressor(n_estimators=100, max_depth=5, random_state=42)
    }
    
    # Add polynomial if p is small
    if p <= 5:
        poly = PolynomialFeatures(degree=2, include_bias=False)
        X_poly = poly.fit_transform(X)
        models['Linear+Poly2'] = ('poly', LinearRegression(), poly)
    
    if verbose:
        print(f"Dataset: n={n}, p={p}")
        print("=" * 60)
        print(f"{'Model':<20} {'CV Score (R²)':>15} {'Std':>10}")
        print("-" * 50)
    
    for name, model in models.items():
        if isinstance(model, tuple) and model[0] == 'poly':
            # Handle polynomial features
            _, base_model, transformer = model
            X_transformed = transformer.transform(X)
            scores = cross_val_score(base_model, X_transformed, y, cv=5, scoring='r2')
        else:
            scores = cross_val_score(model, X, y, cv=5, scoring='r2')
        
        results[name] = {
            'mean': scores.mean(),
            'std': scores.std(),
            'scores': scores
        }
        
        if verbose:
            print(f"{name:<20} {scores.mean():>15.4f} {scores.std():>10.4f}")
    
    # Determine recommendation
    best_model = max(results, key=lambda k: results[k]['mean'])
    
    # Check if linear is competitive (within 0.02 of best)
    linear_score = results['Linear']['mean']
    best_score = results[best_model]['mean']
    linear_competitive = (best_score - linear_score) < 0.02
    
    if verbose:
        print(f"
Recommendation:")
        print(f"  Best model: {best_model} (R² = {best_score:.4f})")
        
        if linear_competitive and 'Linear' in best_model:
            print(f"  → Use Linear: interpretable and performs best")
        elif linear_competitive:
            print(f"  → Consider Linear: nearly as good, more interpretable")
        else:
            print(f"  → Use {best_model}: significantly outperforms linear")
    
    return results, best_model
 
 
def comprehensive_comparison():
    """
    Compare on multiple synthetic datasets.
    """
    np.random.seed(42)
    
    scenarios = []
    
    # Scenario 1: Linear truth
    n = 200
    X = np.random.randn(n, 5)
    y = 2*X[:,0] + X[:,1] - 0.5*X[:,2] + np.random.randn(n)*0.5
    print("
Scenario: Linear relationship")
    results, best = model_selection_framework(X, y)
    
    # Scenario 2: Nonlinear
    y = np.sin(X[:,0]) + X[:,1]**2 + np.random.randn(n)*0.3
    print("
Scenario: Nonlinear relationship")
    model_selection_framework(X, y)
    
    # Scenario 3: Interaction
    y = np.where(X[:,0] > 0, X[:,1], -X[:,1]) + np.random.randn(n)*0.3
    print("
Scenario: Strong interaction")
    model_selection_framework(X, y)
    
    # Scenario 4: Small n
    X_small = np.random.randn(30, 5)
    y_small = X_small[:,0] + X_small[:,1] + np.random.randn(30)*0.5
    print("
Scenario: Small sample (n=30)")
    model_selection_framework(X_small, y_small)
 
comprehensive_comparison()

The Practical Default

Combining Both Paradigms

Rather than choosing between linear models and trees, modern practice often combines them to leverage complementary strengths.

Approaches to combination:

Model trees (M5): Linear regression in tree leaves
- Tree determines regions
- Linear model within each region
- Smoother than pure trees, more flexible than global linear
RuleFit: Tree-derived features for linear model
- Extract rules from trees as binary features
- Fit linear model on rules + original features
- Interpretable coefficients on rules
Cubist: Instance-based tree with linear smoothing
- Tree + linear model + neighbor-based adjustment
Stacking: Trees and linear as separate base learners
- Train both independently
- Meta-learner combines predictions
Gradient boosting with linear base learners
- XGBoost/LightGBM can use linear models as base
- Combines boosting's power with linear structure

combining_paradigms.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import StackingRegressor
from sklearn.model_selection import cross_val_score
 
class RuleFitSimple:
    """
    Simplified RuleFit: extract tree rules as features for linear model.
    """
    
    def __init__(self, n_trees=10, max_depth=3, alpha=1.0):
        self.n_trees = n_trees
        self.max_depth = max_depth
        self.alpha = alpha
        self.trees = []
        self.linear = Ridge(alpha=alpha)
    
    def fit(self, X, y):
        X = np.atleast_2d(X)
        n_samples = X.shape[0]
        
        # Fit multiple trees on bootstrap samples
        for _ in range(self.n_trees):
            idx = np.random.choice(n_samples, n_samples, replace=True)
            tree = DecisionTreeRegressor(max_depth=self.max_depth)
            tree.fit(X[idx], y[idx])
            self.trees.append(tree)
        
        # Create rule features
        X_rules = self._transform_to_rules(X)
        
        # Combine original features + rule features
        X_combined = np.hstack([X, X_rules])
        
        # Fit linear model
        self.linear.fit(X_combined, y)
        
        return self
    
    def _transform_to_rules(self, X):
        """Convert tree structure to binary rule features."""
        all_rules = []
        
        for tree in self.trees:
            # Each leaf is a rule: "all conditions on path are satisfied"
            leaf_indicators = np.zeros((X.shape[0], tree.get_n_leaves()))
            leaves = tree.apply(X)
            
            unique_leaves = np.unique(leaves)
            for i, leaf_id in enumerate(unique_leaves):
                leaf_indicators[:, i] = (leaves == leaf_id).astype(float)
            
            all_rules.append(leaf_indicators)
        
        return np.hstack(all_rules)
    
    def predict(self, X):
        X = np.atleast_2d(X)
        X_rules = self._transform_to_rules(X)
        X_combined = np.hstack([X, X_rules])
        return self.linear.predict(X_combined)
 
 
def demonstrate_combined_models():
    """
    Show how combined models can outperform pure approaches.
    """
    np.random.seed(42)
    
    # Data with both linear trends and local effects
    n = 500
    X = np.random.randn(n, 3)
    
    # Linear trend + localized effects
    y = (2 * X[:, 0]  # Linear trend
         + np.where(X[:, 1] > 0, X[:, 2], -X[:, 2])  # Interaction
         + np.random.randn(n) * 0.5)
    
    print("Combined Model Demonstration")
    print("=" * 60)
    print("Data: linear trend + interaction (neither model ideal alone)")
    
    # Pure linear
    linear = Ridge(alpha=1.0)
    linear_scores = cross_val_score(linear, X, y, cv=5, scoring='r2')
    
    # Pure tree
    tree = DecisionTreeRegressor(max_depth=5)
    tree_scores = cross_val_score(tree, X, y, cv=5, scoring='r2')
    
    # Stacking
    stacking = StackingRegressor(
        estimators=[
            ('linear', Ridge(alpha=1.0)),
            ('tree', DecisionTreeRegressor(max_depth=5))
        ],
        final_estimator=Ridge(alpha=0.1)
    )
    stacking_scores = cross_val_score(stacking, X, y, cv=5, scoring='r2')
    
    # RuleFit-style
    rulefit = RuleFitSimple(n_trees=10, max_depth=3)
    # Manual cross-val since RuleFit doesn't have standard interface
    from sklearn.model_selection import KFold
    kf = KFold(n_splits=5, shuffle=True, random_state=42)
    rulefit_scores = []
    for train_idx, test_idx in kf.split(X):
        rf = RuleFitSimple(n_trees=10, max_depth=3)
        rf.fit(X[train_idx], y[train_idx])
        pred = rf.predict(X[test_idx])
        r2 = 1 - np.sum((y[test_idx] - pred)**2) / np.sum((y[test_idx] - y[test_idx].mean())**2)
        rulefit_scores.append(r2)
    rulefit_scores = np.array(rulefit_scores)
    
    print(f"
{'Model':<20} {'CV R²':>12} {'Std':>10}")
    print("-" * 45)
    print(f"{'Ridge (linear)':<20} {linear_scores.mean():>12.4f} {linear_scores.std():>10.4f}")
    print(f"{'Tree (depth 5)':<20} {tree_scores.mean():>12.4f} {tree_scores.std():>10.4f}")
    print(f"{'Stacking':<20} {stacking_scores.mean():>12.4f} {stacking_scores.std():>10.4f}")
    print(f"{'RuleFit-style':<20} {rulefit_scores.mean():>12.4f} {rulefit_scores.std():>10.4f}")
    
    print("
→ Combined models capture both linear and nonlinear aspects")
 
demonstrate_combined_models()

Summary: Linear Models vs Regression Trees

We've conducted a comprehensive comparison of two fundamental regression paradigms—revealing when each excels and how they can be combined.

Key Takeaways

•Fundamentally different — Linear: global parametric hyperplane; Tree: local piecewise constant regions.
•Linear excels when — Relationship is truly linear, sample is small, inference needed, features well-engineered.
•Trees excel when — Nonlinearity is unknown, interactions exist, mixed feature types, subgroup analysis needed.
•Bias-variance trade-off — Linear: high bias, low variance; Trees: low bias, high variance.
•Different interpretability — Linear: global coefficients; Trees: local rule-based explanations.
•Neither is universally better — Match model to problem structure, or use cross-validation to decide.
•Combine for best of both — Model trees, RuleFit, stacking leverage complementary strengths.

Module Complete:

Module Complete