Loading content...
Linear regression and decision trees represent two fundamentally different philosophies of function approximation. Linear models assume a global parametric form—the same coefficients apply everywhere in feature space. Trees make no global assumptions—they partition the space and handle each region separately.
This philosophical divide leads to dramatically different behavior: where one excels, the other struggles; where one naturally captures patterns, the other requires careful engineering. Understanding this contrast is essential for choosing the right tool and for appreciating why ensemble methods often combine both paradigms.
By the end of this page, you will understand: the fundamental mathematical differences between linear models and trees, scenarios where each approach dominates, how assumptions affect performance, interpretability trade-offs, practical guidelines for model selection, and how modern methods combine both paradigms to leverage their complementary strengths.
Let's establish the precise mathematical forms and their implications.
Linear regression:
$$\hat{f}{\text{linear}}(\mathbf{x}) = \beta_0 + \sum{j=1}^{p} \beta_j x_j = \boldsymbol{\beta}^\top \tilde{\mathbf{x}}$$
where $\tilde{\mathbf{x}} = (1, x_1, \ldots, x_p)^\top$ and $\boldsymbol{\beta}$ is learned from data.
Regression tree:
$$\hat{f}{\text{tree}}(\mathbf{x}) = \sum{m=1}^{M} c_m \cdot \mathbb{1}_{R_m}(\mathbf{x})$$
where ${R_m}$ are axis-aligned rectangular regions and $c_m$ are constants.
Key structural differences:
| Aspect | Linear Regression | Regression Tree |
|---|---|---|
| Functional form | Hyperplane (p+1 parameters) | Piecewise constant (M regions) |
| Decision boundary | Single global hyperplane | Multiple axis-aligned splits |
| Effect of feature xⱼ | Same everywhere (βⱼ) | Depends on which region |
| Interactions | Must be manually specified | Automatically captured by hierarchy |
| Continuity | Continuous and smooth | Discontinuous at split boundaries |
| Extrapolation | Linear extension (risky) | Constant extension (conservative) |
| Parameter count | Fixed at p+1 | Grows with tree size (2M-1) |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110
import numpy as npfrom sklearn.linear_model import LinearRegressionfrom sklearn.tree import DecisionTreeRegressor def compare_model_forms(): """ Illustrate fundamental differences in model forms. """ np.random.seed(42) # Simple 2D example X = np.random.randn(100, 2) y = 2 * X[:, 0] - X[:, 1] + 0.5 * X[:, 0] * X[:, 1] + np.random.randn(100) * 0.5 # Fit both models linear = LinearRegression() linear.fit(X, y) tree = DecisionTreeRegressor(max_depth=4) tree.fit(X, y) print("Model Form Comparison") print("=" * 60) # Linear model representation print("Linear Regression:") print(f" f(x) = {linear.intercept_:.3f} + {linear.coef_[0]:.3f}*x₁ + {linear.coef_[1]:.3f}*x₂") print(f" Parameters: {1 + len(linear.coef_)} (intercept + coefficients)") print(f" Same coefficients apply to all x") # Tree model representation print(f"Decision Tree:") print(f" Number of leaves (regions): {tree.get_n_leaves()}") print(f" Total nodes: {tree.tree_.node_count}") print(f" Different constants in different regions") # Show how predictions differ test_points = np.array([ [-2, -2], [0, 0], [2, 2] ]) print(f"Predictions at test points:") print(f"{'Point':>15} {'Linear':>12} {'Tree':>12}") print("-" * 42) for point in test_points: y_linear = linear.predict(point.reshape(1, -1))[0] y_tree = tree.predict(point.reshape(1, -1))[0] print(f"{str(point):>15} {y_linear:>12.4f} {y_tree:>12.4f}") # Gradient comparison print(f"Gradient (∂f/∂x₁):") print(f" Linear: {linear.coef_[0]:.3f} (constant everywhere)") print(f" Tree: 0 everywhere (piecewise constant)") def effect_locality(): """ Show how feature effects differ: global (linear) vs local (tree). """ np.random.seed(42) # Data where effect of x₁ depends on x₂ n = 500 X = np.random.randn(n, 2) # Effect of x₁ is positive when x₂ > 0, negative when x₂ < 0 y = np.where(X[:, 1] > 0, 2 * X[:, 0], -2 * X[:, 0]) + np.random.randn(n) * 0.3 linear = LinearRegression() linear.fit(X, y) tree = DecisionTreeRegressor(max_depth=3) tree.fit(X, y) print("Feature Effect Locality") print("=" * 50) print("True model: y = 2*x₁ if x₂ > 0 else -2*x₁") print("(Effect of x₁ flips depending on x₂)") print(f"Linear regression coefficients:") print(f" β₁ (for x₁): {linear.coef_[0]:.3f}") print(f" β₂ (for x₂): {linear.coef_[1]:.3f}") print(f" → Linear model sees average effect ≈ 0 for x₁!") # Test in different regions regions = [ ("x₂ > 0 region", np.array([[1, 1]])), ("x₂ < 0 region", np.array([[1, -1]])) ] print(f"Predictions for x₁=1 in different x₂ regions:") for name, point in regions: y_lin = linear.predict(point)[0] y_tree = tree.predict(point)[0] print(f" {name}: Linear={y_lin:.2f}, Tree={y_tree:.2f}") print(f"→ Tree captures the interaction; linear model fails") compare_model_forms()effect_locality()Linear regression has dominated statistical practice for over a century because in the right settings, it is remarkably effective and efficient.
Scenarios favoring linear models:
True relationship is linear or approximately linear
Features are transformations of raw inputs
High-dimensional, sparse settings
Inference is the primary goal
Small sample sizes
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071
import numpy as npfrom sklearn.linear_model import LinearRegression, Ridgefrom sklearn.tree import DecisionTreeRegressorfrom sklearn.model_selection import cross_val_score def linear_wins_scenarios(): """ Demonstrate scenarios where linear models outperform trees. """ np.random.seed(42) scenarios = [] # Scenario 1: Truly linear relationship print("Scenario 1: Truly Linear Relationship") print("-" * 50) n = 200 X = np.random.randn(n, 5) y = 2*X[:,0] - 3*X[:,1] + X[:,2] + np.random.randn(n) * 0.5 linear = LinearRegression() tree = DecisionTreeRegressor(max_depth=5) linear_scores = cross_val_score(linear, X, y, cv=5, scoring='r2') tree_scores = cross_val_score(tree, X, y, cv=5, scoring='r2') print(f" Linear R²: {linear_scores.mean():.4f} (±{linear_scores.std():.4f})") print(f" Tree R²: {tree_scores.mean():.4f} (±{tree_scores.std():.4f})") scenarios.append(('Linear truth', linear_scores.mean(), tree_scores.mean())) # Scenario 2: Small sample size print("Scenario 2: Small Sample Size (n=30)") print("-" * 50) n = 30 X = np.random.randn(n, 3) y = X[:,0] + 0.5*X[:,1] + np.random.randn(n) * 0.3 linear_scores = cross_val_score(linear, X, y, cv=5, scoring='r2') tree_scores = cross_val_score(tree, X, y, cv=5, scoring='r2') print(f" Linear R²: {linear_scores.mean():.4f} (±{linear_scores.std():.4f})") print(f" Tree R²: {tree_scores.mean():.4f} (±{tree_scores.std():.4f})") scenarios.append(('Small n', linear_scores.mean(), tree_scores.mean())) # Scenario 3: High-dimensional sparse print("Scenario 3: High-Dimensional Sparse (p=50, n=100)") print("-" * 50) n, p = 100, 50 X = np.random.randn(n, p) # Only 3 features matter y = X[:,0] + 2*X[:,1] - X[:,2] + np.random.randn(n) * 0.5 ridge = Ridge(alpha=1.0) ridge_scores = cross_val_score(ridge, X, y, cv=5, scoring='r2') tree_scores = cross_val_score(tree, X, y, cv=5, scoring='r2') print(f" Ridge R²: {ridge_scores.mean():.4f} (±{ridge_scores.std():.4f})") print(f" Tree R²: {tree_scores.mean():.4f} (±{tree_scores.std():.4f})") scenarios.append(('High-dim sparse', ridge_scores.mean(), tree_scores.mean())) print("Summary: Linear models excel with:") print(" - Linear relationships") print(" - Small samples (low variance)") print(" - High dimensions with sparsity (regularization)") return scenarios linear_wins_scenarios()Trees shine in situations where linear models fundamentally fail—where the relationship between features and target is complex, interactive, or regionally varying.
Scenarios favoring trees:
Nonlinear relationships without known form
Strong interactions between features
Mixed feature types
Heterogeneous data
Interpretability through rules
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899
import numpy as npfrom sklearn.linear_model import LinearRegressionfrom sklearn.tree import DecisionTreeRegressorfrom sklearn.model_selection import cross_val_score def trees_win_scenarios(): """ Demonstrate scenarios where trees outperform linear models. """ np.random.seed(42) # Scenario 1: Nonlinear relationship print("Scenario 1: Nonlinear Relationship (step function)") print("-" * 55) n = 300 X = np.random.randn(n, 3) # Step function in x₁ y = np.where(X[:,0] > 0, 5, -5) + np.random.randn(n) * 0.5 linear = LinearRegression() tree = DecisionTreeRegressor(max_depth=3) linear_scores = cross_val_score(linear, X, y, cv=5, scoring='r2') tree_scores = cross_val_score(tree, X, y, cv=5, scoring='r2') print(f" Linear R²: {linear_scores.mean():.4f}") print(f" Tree R²: {tree_scores.mean():.4f}") # Scenario 2: Strong interaction print("Scenario 2: Strong Interaction (XOR-like)") print("-" * 55) X = np.random.randn(n, 2) # XOR: positive output when signs differ y = np.where((X[:,0] > 0) != (X[:,1] > 0), 5, -5) + np.random.randn(n) * 0.5 linear_scores = cross_val_score(linear, X, y, cv=5, scoring='r2') tree_scores = cross_val_score(tree, X, y, cv=5, scoring='r2') print(f" Linear R²: {linear_scores.mean():.4f}") print(f" Tree R²: {tree_scores.mean():.4f}") # Scenario 3: Categorical features (encoded) print("Scenario 3: Categorical Feature Importance") print("-" * 55) n = 400 # Categorical with 4 levels, each has different effect cat = np.random.choice([0, 1, 2, 3], n) X_cont = np.random.randn(n, 2) # Create one-hot encoding for linear model X_linear = np.column_stack([ X_cont, (cat == 0).astype(float), (cat == 1).astype(float), (cat == 2).astype(float), (cat == 3).astype(float) ]) # Tree can use raw categorical X_tree = np.column_stack([X_cont, cat]) # Effect depends on category (interaction) y = np.where(cat == 0, 2*X_cont[:,0], np.where(cat == 1, -2*X_cont[:,0], np.where(cat == 2, 3*X_cont[:,1], -3*X_cont[:,1]))) y += np.random.randn(n) * 0.5 linear_scores = cross_val_score(linear, X_linear, y, cv=5, scoring='r2') tree_scores = cross_val_score(tree, X_tree, y, cv=5, scoring='r2') print(f" Linear R²: {linear_scores.mean():.4f}") print(f" Tree R²: {tree_scores.mean():.4f}") # Scenario 4: Multiple regimes print("Scenario 4: Multiple Regime (different slopes)") print("-" * 55) X = np.random.uniform(-2, 2, (n, 1)) # Different slopes in different regions y = np.where(X[:,0] < -1, 5*X[:,0], np.where(X[:,0] < 1, -3*X[:,0], 2*X[:,0])) y += np.random.randn(n) * 0.3 linear_scores = cross_val_score(linear, X, y, cv=5, scoring='r2') tree_scores = cross_val_score(DecisionTreeRegressor(max_depth=4), X, y, cv=5, scoring='r2') print(f" Linear R²: {linear_scores.mean():.4f}") print(f" Tree R²: {tree_scores.mean():.4f}") print("Summary: Trees excel with:") print(" - Nonlinear relationships") print(" - Interactions (feature effects that depend on other features)") print(" - Categorical features with interaction effects") print(" - Multiple regime/heterogeneous relationships") trees_win_scenarios()Linear models and trees occupy different positions on the bias-variance spectrum.
Linear regression:
Full-grown regression trees:
Mathematical perspective:
For a true function $f(\mathbf{x})$:
$$\text{MSE}(\hat{f}) = \underbrace{(f(\mathbf{x}) - \mathbb{E}[\hat{f}(\mathbf{x})])^2}{\text{Bias}^2} + \underbrace{\mathbb{E}[(\hat{f}(\mathbf{x}) - \mathbb{E}[\hat{f}(\mathbf{x})])^2]}{\text{Variance}} + \sigma^2$$
Linear models trade high bias for low variance; trees do the opposite.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131
import numpy as npfrom sklearn.linear_model import LinearRegressionfrom sklearn.tree import DecisionTreeRegressor def bias_variance_comparison(): """ Empirically compare bias and variance of linear vs tree models. """ np.random.seed(42) # True function (nonlinear) def f(x): return np.sin(2 * np.pi * x) + 0.5 * x # Fixed test points X_test = np.linspace(0, 1, 100).reshape(-1, 1) y_true = f(X_test.ravel()) n_bootstrap = 100 n_train = 50 models = { 'Linear': LinearRegression(), 'Tree (depth 2)': DecisionTreeRegressor(max_depth=2), 'Tree (depth 5)': DecisionTreeRegressor(max_depth=5), 'Tree (depth 10)': DecisionTreeRegressor(max_depth=10), } results = {} for name, model_template in models.items(): predictions = [] for _ in range(n_bootstrap): # Generate training data X_train = np.random.uniform(0, 1, n_train).reshape(-1, 1) y_train = f(X_train.ravel()) + np.random.randn(n_train) * 0.3 # Fit model model = type(model_template)(**model_template.get_params()) model.fit(X_train, y_train) predictions.append(model.predict(X_test)) predictions = np.array(predictions) # Expected prediction mean_pred = np.mean(predictions, axis=0) # Bias² = E[(E[f̂] - f)²] bias_sq = np.mean((mean_pred - y_true) ** 2) # Variance = E[(f̂ - E[f̂])²] variance = np.mean(np.var(predictions, axis=0)) # Total error (using all predictions) total_mse = np.mean((predictions - y_true) ** 2) results[name] = { 'bias_sq': bias_sq, 'variance': variance, 'mse': total_mse } print("Bias-Variance Comparison") print("=" * 65) print(f"{'Model':<20} {'Bias²':>12} {'Variance':>12} {'MSE':>12}") print("-" * 58) for name, res in results.items(): print(f"{name:<20} {res['bias_sq']:>12.4f} {res['variance']:>12.4f} {res['mse']:>12.4f}") print("Observations:") print("- Linear: high bias (can't capture sin), low variance") print("- Shallow tree: moderate bias, moderate variance") print("- Deep tree: low bias, high variance (overfitting)") print("- Best MSE at intermediate complexity") return results def optimal_complexity_curve(): """ Show optimal model complexity differs by sample size. """ np.random.seed(42) def f(x): return np.sin(2 * np.pi * x) X_test = np.linspace(0, 1, 200).reshape(-1, 1) y_test = f(X_test.ravel()) print("Optimal Tree Depth vs Sample Size") print("=" * 55) sample_sizes = [20, 50, 100, 200, 500, 1000] print(f"{'n':>6} {'Best depth':>12} {'Best MSE':>12}") print("-" * 35) for n in sample_sizes: best_mse = float('inf') best_depth = 1 for depth in range(1, 15): mses = [] for _ in range(20): X_train = np.random.uniform(0, 1, n).reshape(-1, 1) y_train = f(X_train.ravel()) + np.random.randn(n) * 0.3 tree = DecisionTreeRegressor(max_depth=depth) tree.fit(X_train, y_train) mses.append(np.mean((tree.predict(X_test) - y_test)**2)) avg_mse = np.mean(mses) if avg_mse < best_mse: best_mse = avg_mse best_depth = depth print(f"{n:>6} {best_depth:>12} {best_mse:>12.4f}") print("→ Larger samples support more complex trees") bias_variance_comparison()optimal_complexity_curve()Random Forests address tree variance through averaging. Gradient Boosting builds trees incrementally to reduce bias while controlling variance via learning rate. These ensemble methods combine the flexibility of trees with variance reduction, often achieving the best of both worlds.
Both linear models and trees are considered "interpretable," but they offer different kinds of interpretability.
Linear model interpretability:
Tree interpretability:
Which is more interpretable?
It depends on the audience and the question:
| Interpretation Need | Linear Model | Regression Tree |
|---|---|---|
| Feature importance ranking | |βⱼ| or standardized | βⱼ| | Impurity reduction, permutation |
| Effect of increasing xⱼ | βⱼ (constant everywhere) | Depends on current region |
| Why this prediction? | β₀ + Σβⱼxⱼ breakdown | Decision path traversal |
| Subgroup analysis | Requires manual interactions | Naturally defines subgroups |
| Statistical significance | t-tests, p-values available | Not standard (bootstrap works) |
| Stakeholder communication | "Each unit increases by..." | "If X then..." |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126
import numpy as npfrom sklearn.linear_model import LinearRegressionfrom sklearn.tree import DecisionTreeRegressor, export_text def compare_interpretations(): """ Show different interpretation styles for the same prediction. """ np.random.seed(42) # Sample data with feature names n = 200 age = np.random.uniform(20, 70, n) income = np.random.uniform(30000, 150000, n) education_years = np.random.uniform(12, 22, n) X = np.column_stack([age, income, education_years]) feature_names = ['age', 'income', 'education_years'] # Target: some nonlinear function y = (0.5 * age + 0.0001 * income + 2 * education_years - 0.01 * age * (income > 80000) # interaction + np.random.randn(n) * 5) # Fit both models linear = LinearRegression() linear.fit(X, y) tree = DecisionTreeRegressor(max_depth=3) tree.fit(X, y) # Sample point to explain sample = np.array([[45, 75000, 16]]) # 45 years, $75k, 16 years education print("Interpreting Prediction for:") print(f" Age: 45, Income: $75,000, Education: 16 years") print("=" * 60) # Linear interpretation y_linear = linear.predict(sample)[0] print(f"Linear Model Prediction: {y_linear:.2f}") print(f"Breakdown (global interpretation):") print(f" Intercept: {linear.intercept_:>12.4f}") for i, (name, coef) in enumerate(zip(feature_names, linear.coef_)): contribution = coef * sample[0, i] print(f" {name:12s}: {coef:>12.6f} × {sample[0,i]:>8.0f} = {contribution:>12.2f}") print(f" {'Total':12s}: {' ':>12s} {' ':>8s} {y_linear:>12.2f}") # Tree interpretation y_tree = tree.predict(sample)[0] print(f"Tree Model Prediction: {y_tree:.2f}") print(f"Decision Path (local interpretation):") # Get decision path node_indicator = tree.decision_path(sample) node_indices = node_indicator.indices tree_struct = tree.tree_ for node_id in node_indices: if tree_struct.children_left[node_id] == tree_struct.children_right[node_id]: print(f" → Leaf: predict {tree_struct.value[node_id].flatten()[0]:.2f}") else: feat = tree_struct.feature[node_id] thresh = tree_struct.threshold[node_id] value = sample[0, feat] direction = "≤" if value <= thresh else ">" print(f" IF {feature_names[feat]} ({value:.0f}) {direction} {thresh:.1f}") # Tree as rules print("Full Tree Rules:") print(export_text(tree, feature_names=feature_names)) def feature_importance_comparison(): """ Compare feature importance measures. """ np.random.seed(42) n = 500 X = np.random.randn(n, 5) # y depends mainly on x₀, x₁, and x₂ with interaction y = (3 * X[:, 0] + 2 * X[:, 1] + 1 * X[:, 2] + 0.5 * X[:, 0] * X[:, 1] # interaction + np.random.randn(n) * 0.5) linear = LinearRegression() linear.fit(X, y) tree = DecisionTreeRegressor(max_depth=5) tree.fit(X, y) print("Feature Importance Comparison") print("=" * 55) print("True importance: x₀ (3), x₁ (2), x₂ (1), x₃&x₄ (0)") print("Also: x₀*x₁ interaction exists") print("Linear Model (|coefficients|):") importance_linear = np.abs(linear.coef_) for i, imp in enumerate(importance_linear): print(f" x{i}: {imp:.4f}") print("Decision Tree (impurity importance):") importance_tree = tree.feature_importances_ for i, imp in enumerate(importance_tree): print(f" x{i}: {imp:.4f}") print("Note: Linear model can't detect interaction importance") print(" Tree splits on x₀ and x₁ to capture the interaction") compare_interpretations()feature_importance_comparison()When facing a new regression problem, here's a systematic approach to choosing between linear models and trees (or deciding to try both).
Decision factors:
Prior knowledge about relationship
Sample size and feature dimension
Interpretation requirements
Computational constraints
Deployment environment
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116
import numpy as npfrom sklearn.linear_model import LinearRegression, Ridgefrom sklearn.tree import DecisionTreeRegressorfrom sklearn.ensemble import RandomForestRegressorfrom sklearn.model_selection import cross_val_scorefrom sklearn.preprocessing import PolynomialFeatures def model_selection_framework(X, y, verbose=True): """ Systematic comparison of linear vs tree models for a given dataset. Returns recommendation and detailed results. """ n, p = X.shape results = {} # Candidate models models = { 'Linear': LinearRegression(), 'Ridge (α=1)': Ridge(alpha=1.0), 'Tree (depth 3)': DecisionTreeRegressor(max_depth=3), 'Tree (depth 5)': DecisionTreeRegressor(max_depth=5), 'Tree (depth 7)': DecisionTreeRegressor(max_depth=7), 'Random Forest': RandomForestRegressor(n_estimators=100, max_depth=5, random_state=42) } # Add polynomial if p is small if p <= 5: poly = PolynomialFeatures(degree=2, include_bias=False) X_poly = poly.fit_transform(X) models['Linear+Poly2'] = ('poly', LinearRegression(), poly) if verbose: print(f"Dataset: n={n}, p={p}") print("=" * 60) print(f"{'Model':<20} {'CV Score (R²)':>15} {'Std':>10}") print("-" * 50) for name, model in models.items(): if isinstance(model, tuple) and model[0] == 'poly': # Handle polynomial features _, base_model, transformer = model X_transformed = transformer.transform(X) scores = cross_val_score(base_model, X_transformed, y, cv=5, scoring='r2') else: scores = cross_val_score(model, X, y, cv=5, scoring='r2') results[name] = { 'mean': scores.mean(), 'std': scores.std(), 'scores': scores } if verbose: print(f"{name:<20} {scores.mean():>15.4f} {scores.std():>10.4f}") # Determine recommendation best_model = max(results, key=lambda k: results[k]['mean']) # Check if linear is competitive (within 0.02 of best) linear_score = results['Linear']['mean'] best_score = results[best_model]['mean'] linear_competitive = (best_score - linear_score) < 0.02 if verbose: print(f"Recommendation:") print(f" Best model: {best_model} (R² = {best_score:.4f})") if linear_competitive and 'Linear' in best_model: print(f" → Use Linear: interpretable and performs best") elif linear_competitive: print(f" → Consider Linear: nearly as good, more interpretable") else: print(f" → Use {best_model}: significantly outperforms linear") return results, best_model def comprehensive_comparison(): """ Compare on multiple synthetic datasets. """ np.random.seed(42) scenarios = [] # Scenario 1: Linear truth n = 200 X = np.random.randn(n, 5) y = 2*X[:,0] + X[:,1] - 0.5*X[:,2] + np.random.randn(n)*0.5 print("Scenario: Linear relationship") results, best = model_selection_framework(X, y) # Scenario 2: Nonlinear y = np.sin(X[:,0]) + X[:,1]**2 + np.random.randn(n)*0.3 print("Scenario: Nonlinear relationship") model_selection_framework(X, y) # Scenario 3: Interaction y = np.where(X[:,0] > 0, X[:,1], -X[:,1]) + np.random.randn(n)*0.3 print("Scenario: Strong interaction") model_selection_framework(X, y) # Scenario 4: Small n X_small = np.random.randn(30, 5) y_small = X_small[:,0] + X_small[:,1] + np.random.randn(30)*0.5 print("Scenario: Small sample (n=30)") model_selection_framework(X_small, y_small) comprehensive_comparison()When in doubt: (1) Start with linear regression as baseline, (2) Try a simple tree (depth 3-5) to check for nonlinearity, (3) If tree significantly better, consider Random Forest for production. This progression reveals whether the added complexity of trees is warranted for your specific data.
Rather than choosing between linear models and trees, modern practice often combines them to leverage complementary strengths.
Approaches to combination:
Model trees (M5): Linear regression in tree leaves
RuleFit: Tree-derived features for linear model
Cubist: Instance-based tree with linear smoothing
Stacking: Trees and linear as separate base learners
Gradient boosting with linear base learners
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127
import numpy as npfrom sklearn.linear_model import Ridgefrom sklearn.tree import DecisionTreeRegressorfrom sklearn.ensemble import StackingRegressorfrom sklearn.model_selection import cross_val_score class RuleFitSimple: """ Simplified RuleFit: extract tree rules as features for linear model. """ def __init__(self, n_trees=10, max_depth=3, alpha=1.0): self.n_trees = n_trees self.max_depth = max_depth self.alpha = alpha self.trees = [] self.linear = Ridge(alpha=alpha) def fit(self, X, y): X = np.atleast_2d(X) n_samples = X.shape[0] # Fit multiple trees on bootstrap samples for _ in range(self.n_trees): idx = np.random.choice(n_samples, n_samples, replace=True) tree = DecisionTreeRegressor(max_depth=self.max_depth) tree.fit(X[idx], y[idx]) self.trees.append(tree) # Create rule features X_rules = self._transform_to_rules(X) # Combine original features + rule features X_combined = np.hstack([X, X_rules]) # Fit linear model self.linear.fit(X_combined, y) return self def _transform_to_rules(self, X): """Convert tree structure to binary rule features.""" all_rules = [] for tree in self.trees: # Each leaf is a rule: "all conditions on path are satisfied" leaf_indicators = np.zeros((X.shape[0], tree.get_n_leaves())) leaves = tree.apply(X) unique_leaves = np.unique(leaves) for i, leaf_id in enumerate(unique_leaves): leaf_indicators[:, i] = (leaves == leaf_id).astype(float) all_rules.append(leaf_indicators) return np.hstack(all_rules) def predict(self, X): X = np.atleast_2d(X) X_rules = self._transform_to_rules(X) X_combined = np.hstack([X, X_rules]) return self.linear.predict(X_combined) def demonstrate_combined_models(): """ Show how combined models can outperform pure approaches. """ np.random.seed(42) # Data with both linear trends and local effects n = 500 X = np.random.randn(n, 3) # Linear trend + localized effects y = (2 * X[:, 0] # Linear trend + np.where(X[:, 1] > 0, X[:, 2], -X[:, 2]) # Interaction + np.random.randn(n) * 0.5) print("Combined Model Demonstration") print("=" * 60) print("Data: linear trend + interaction (neither model ideal alone)") # Pure linear linear = Ridge(alpha=1.0) linear_scores = cross_val_score(linear, X, y, cv=5, scoring='r2') # Pure tree tree = DecisionTreeRegressor(max_depth=5) tree_scores = cross_val_score(tree, X, y, cv=5, scoring='r2') # Stacking stacking = StackingRegressor( estimators=[ ('linear', Ridge(alpha=1.0)), ('tree', DecisionTreeRegressor(max_depth=5)) ], final_estimator=Ridge(alpha=0.1) ) stacking_scores = cross_val_score(stacking, X, y, cv=5, scoring='r2') # RuleFit-style rulefit = RuleFitSimple(n_trees=10, max_depth=3) # Manual cross-val since RuleFit doesn't have standard interface from sklearn.model_selection import KFold kf = KFold(n_splits=5, shuffle=True, random_state=42) rulefit_scores = [] for train_idx, test_idx in kf.split(X): rf = RuleFitSimple(n_trees=10, max_depth=3) rf.fit(X[train_idx], y[train_idx]) pred = rf.predict(X[test_idx]) r2 = 1 - np.sum((y[test_idx] - pred)**2) / np.sum((y[test_idx] - y[test_idx].mean())**2) rulefit_scores.append(r2) rulefit_scores = np.array(rulefit_scores) print(f"{'Model':<20} {'CV R²':>12} {'Std':>10}") print("-" * 45) print(f"{'Ridge (linear)':<20} {linear_scores.mean():>12.4f} {linear_scores.std():>10.4f}") print(f"{'Tree (depth 5)':<20} {tree_scores.mean():>12.4f} {tree_scores.std():>10.4f}") print(f"{'Stacking':<20} {stacking_scores.mean():>12.4f} {stacking_scores.std():>10.4f}") print(f"{'RuleFit-style':<20} {rulefit_scores.mean():>12.4f} {rulefit_scores.std():>10.4f}") print("→ Combined models capture both linear and nonlinear aspects") demonstrate_combined_models()We've conducted a comprehensive comparison of two fundamental regression paradigms—revealing when each excels and how they can be combined.
Module Complete:
This page concludes the Regression Trees module. You've learned the complete theory and practice of regression trees—from the CART algorithm and MSE splitting criterion, through leaf predictions and piecewise constant approximation theory, to comprehensive comparison with linear models. This foundation prepares you for understanding ensemble methods like Random Forests and Gradient Boosting, which build on individual trees to create powerful predictive models.
Congratulations! You have mastered regression trees. You understand CART's greedy recursive partitioning, MSE as variance reduction, optimal leaf predictions, the piecewise constant approximation framework, and when trees outperform or underperform linear models. This deep knowledge forms the foundation for understanding modern ensemble methods and making informed modeling decisions in practice.