Loading learning content...
In regression splines (as opposed to smoothing splines), we use a relatively small number of knots $K \ll n$ to define a basis, then fit by ordinary least squares. This approach is computationally attractive but introduces a fundamental question: Where should we place the knots?
Knot selection affects the model in two ways:
Poor knot selection leads to either underfitting (too few/poorly placed knots) or overfitting (too many knots). The goal is to find a configuration that captures the true signal without fitting noise.
By the end of this page, you will understand various knot placement strategies (uniform, quantile-based, adaptive), how to select the number of knots via cross-validation and information criteria, automatic knot selection algorithms (MARS-style and free-knot splines), and the theoretical considerations underlying these choices.
Knot selection is fundamentally a model selection problem, governed by the bias-variance tradeoff.
Too Few Knots (High Bias):
Too Many Knots (High Variance):
The Optimal Number:
| $K$ (knots) | Degrees of Freedom | Bias | Variance | Typical MSE |
|---|---|---|---|---|
| Very small ($K < 3$) | Low | High | Low | High (underfit) |
| Small ($K = 3-5$) | Moderate | Moderate-High | Low-Moderate | Depends on function |
| Medium ($K = 5-15$) | Moderate | Low-Moderate | Moderate | Often optimal |
| Large ($K = 15-30$) | High | Low | High | Risk of overfit |
| Very large ($K > 30$) | Very high | Very low | Very high | High (overfit) |
A common starting point is $K \approx \min(n/4, 35)$ knots for cubic splines. This is just a heuristic—formal selection via CV or information criteria should follow. The maximum of 35 reflects diminishing returns beyond a certain flexibility.
Once we decide on the number of knots $K$, we must choose their locations. Two standard strategies dominate practice.
Strategy 1: Equally-Spaced Knots
Place knots at uniform intervals across the data range: $$\xi_k = x_{\min} + \frac{k}{K+1}(x_{\max} - x_{\min}), \quad k = 1, \ldots, K$$
Pros:
Cons:
Strategy 2: Quantile-Based Knots
Place knots at percentiles of the $x$ distribution: $$\xi_k = \text{Percentile}\left(x, \frac{100k}{K+1}\right), \quad k = 1, \ldots, K$$
Pros:
Cons:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990
import numpy as npimport matplotlib.pyplot as plt def uniform_knots(x, K): """ Place K knots uniformly across the data range. Parameters: ----------- x : array Data points K : int Number of interior knots Returns: -------- knots : array of shape (K,) Knot locations """ x_min, x_max = x.min(), x.max() # K interior knots divide [x_min, x_max] into K+1 intervals return np.linspace(x_min, x_max, K + 2)[1:-1] def quantile_knots(x, K): """ Place K knots at quantiles of x. Parameters: ----------- x : array Data points K : int Number of interior knots Returns: -------- knots : array of shape (K,) Knot locations """ # Quantiles from 1/(K+1) to K/(K+1) percentiles = np.linspace(0, 100, K + 2)[1:-1] return np.percentile(x, percentiles) # Example: Compare uniform vs quantile knots on skewed datanp.random.seed(42) # Generate right-skewed x (e.g., log-normal)n = 200x = np.exp(np.random.normal(0, 1, n)) # Log-normal distributionx = np.sort(x) K = 8 # Number of knots knots_uniform = uniform_knots(x, K)knots_quantile = quantile_knots(x, K) print("Uniform knots:")print(knots_uniform)print("Quantile knots:")print(knots_quantile) # Visualizefig, axes = plt.subplots(1, 2, figsize=(14, 4)) # Histogram with uniform knotsaxes[0].hist(x, bins=30, density=True, alpha=0.7, color='steelblue')for xi in knots_uniform: axes[0].axvline(xi, color='red', linestyle='--', lw=1.5)axes[0].set_xlabel('x')axes[0].set_ylabel('Density')axes[0].set_title('Uniform Knots (red lines)') # Histogram with quantile knotsaxes[1].hist(x, bins=30, density=True, alpha=0.7, color='steelblue')for xi in knots_quantile: axes[1].axvline(xi, color='green', linestyle='--', lw=1.5)axes[1].set_xlabel('x')axes[1].set_ylabel('Density')axes[1].set_title('Quantile Knots (green lines)') plt.tight_layout()plt.savefig('knot_comparison.png', dpi=150)plt.show() print(f"Observations between quantile knots: ~{n // (K+1)} each")print(f"Uniform knots may have very unequal observation counts!")Use quantile knots when the function may have similar complexity throughout and data density varies. Use uniform knots when you believe the function complexity is uniform across the domain, regardless of where data happens to fall. In most cases, quantile knots are safer because they ensure adequate data near each knot for stable estimation.
Cross-validation provides a principled way to choose the number of knots by estimating out-of-sample prediction error.
K-Fold CV for Knot Selection:
Leave-One-Out CV:
For regression splines (unpenalized), LOOCV has a shortcut formula similar to smoothing splines: $$\text{CV} = \frac{1}{n} \sum_{i=1}^n \left( \frac{y_i - \hat{y}i}{1 - H{ii}} \right)^2$$ where $H_{ii}$ is the $i$-th diagonal of the hat matrix.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102
import numpy as npfrom scipy.interpolate import BSplinefrom sklearn.model_selection import KFold def bspline_design(x, interior_knots, degree=3): """Build B-spline design matrix.""" t = np.concatenate([ np.repeat(x.min(), degree + 1), interior_knots, np.repeat(x.max(), degree + 1) ]) n_basis = len(t) - degree - 1 B = np.zeros((len(x), n_basis)) for j in range(n_basis): c = np.zeros(n_basis) c[j] = 1.0 spl = BSpline(t, c, degree, extrapolate=False) B[:, j] = np.nan_to_num(spl(x), nan=0.0) return B def cv_select_knots(x, y, K_candidates, n_folds=5): """ Select optimal number of knots via K-fold cross-validation. Parameters: ----------- x, y : arrays Data K_candidates : list of int Candidate knot counts to try n_folds : int Number of CV folds Returns: -------- best_K : int Optimal number of knots cv_scores : dict CV scores for each K """ kf = KFold(n_splits=n_folds, shuffle=True, random_state=42) cv_scores = {} for K in K_candidates: fold_scores = [] for train_idx, val_idx in kf.split(x): x_train, x_val = x[train_idx], x[val_idx] y_train, y_val = y[train_idx], y[val_idx] # Sort training data for spline fitting sort_idx = np.argsort(x_train) x_train_sorted = x_train[sort_idx] y_train_sorted = y_train[sort_idx] # Place knots at quantiles of training data knots = np.percentile(x_train_sorted, np.linspace(0, 100, K + 2)[1:-1]) # Build design matrix and fit B_train = bspline_design(x_train_sorted, knots) try: beta = np.linalg.lstsq(B_train, y_train_sorted, rcond=None)[0] # Predict on validation # Clamp validation x to training range for extrapolation safety x_val_clamped = np.clip(x_val, x_train.min(), x_train.max()) B_val = bspline_design(x_val_clamped, knots) y_pred = B_val @ beta mse = np.mean((y_val - y_pred) ** 2) fold_scores.append(mse) except np.linalg.LinAlgError: fold_scores.append(np.inf) # Singular matrix cv_scores[K] = np.mean(fold_scores) best_K = min(cv_scores.keys(), key=lambda k: cv_scores[k]) return best_K, cv_scores # Example: Select knots via CVnp.random.seed(42)n = 150x = np.sort(np.random.uniform(0, 2 * np.pi, n))y_true = np.sin(x) + 0.3 * np.cos(2 * x) + 0.1 * np.sin(5 * x)y = y_true + np.random.normal(0, 0.4, n) K_candidates = list(range(3, 25)) best_K, cv_scores = cv_select_knots(x, y, K_candidates) print(f"Best number of knots: {best_K}")print(f"CV scores by K:")for K in K_candidates: marker = " <-- best" if K == best_K else "" print(f" K={K:2d}: CV MSE = {cv_scores[K]:.4f}{marker}")Instead of selecting the K with minimum CV error, consider the 'one standard error rule': choose the simplest model (smallest K) whose CV error is within one standard error of the minimum. This guards against overfitting and produces more stable models.
Information criteria provide alternatives to cross-validation for knot selection. They trade off goodness-of-fit against model complexity without explicit data splitting.
Akaike Information Criterion (AIC): $$\text{AIC} = n \log(\text{RSS}/n) + 2 \cdot \text{df}$$
where df = number of spline parameters = $K + d + 1$ for degree-$d$ B-splines with $K$ interior knots.
Bayesian Information Criterion (BIC): $$\text{BIC} = n \log(\text{RSS}/n) + \log(n) \cdot \text{df}$$
BIC penalizes complexity more heavily for large $n$, favoring simpler models.
Generalized Cross-Validation (GCV): $$\text{GCV} = \frac{\text{RSS}/n}{(1 - \text{df}/n)^2}$$
GCV approximates LOOCV and is computationally efficient.
| Criterion | Formula | Penalty Weight | Behavior |
|---|---|---|---|
| AIC | $n\log(\text{RSS}/n) + 2 \cdot \text{df}$ | Fixed (2) | May overfit slightly |
| AICc | AIC + $\frac{2\text{df}(\text{df}+1)}{n-\text{df}-1}$ | Corrected for small $n$ | Better for small samples |
| BIC | $n\log(\text{RSS}/n) + \log(n) \cdot \text{df}$ | Grows with $n$ | More parsimonious |
| GCV | $\text{RSS}/(n(1-\text{df}/n)^2)$ | Implicit | Approximates LOOCV |
| LOOCV | $\frac{1}{n}\sum (e_i/(1-H_{ii}))^2$ | Exact leave-one-out | Unbiased, higher variance |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109
import numpy as np def compute_criteria(y, y_hat, df, n): """ Compute model selection criteria. Parameters: ----------- y : array Observed values y_hat : array Fitted values df : int Degrees of freedom (number of parameters) n : int Sample size Returns: -------- dict : Dictionary of criteria values """ rss = np.sum((y - y_hat) ** 2) # Log-likelihood for Gaussian model sigma_hat_sq = rss / n log_lik = -n/2 * np.log(2 * np.pi * sigma_hat_sq) - n/2 # AIC aic = n * np.log(rss / n) + 2 * df # AICc (corrected AIC for small samples) if n - df - 1 > 0: aicc = aic + (2 * df * (df + 1)) / (n - df - 1) else: aicc = np.inf # BIC bic = n * np.log(rss / n) + np.log(n) * df # GCV if df < n: gcv = (rss / n) / ((1 - df / n) ** 2) else: gcv = np.inf return { 'AIC': aic, 'AICc': aicc, 'BIC': bic, 'GCV': gcv, 'RSS': rss, 'df': df } def select_knots_by_criteria(x, y, K_candidates, criterion='BIC'): """ Select number of knots by minimizing information criterion. """ from scipy.interpolate import BSpline def bspline_design(x, interior_knots, degree=3): t = np.concatenate([ np.repeat(x.min(), degree + 1), interior_knots, np.repeat(x.max(), degree + 1) ]) n_basis = len(t) - degree - 1 B = np.zeros((len(x), n_basis)) for j in range(n_basis): c = np.zeros(n_basis) c[j] = 1.0 spl = BSpline(t, c, degree, extrapolate=False) B[:, j] = np.nan_to_num(spl(x), nan=0.0) return B results = {} for K in K_candidates: knots = np.percentile(x, np.linspace(0, 100, K + 2)[1:-1]) B = bspline_design(x, knots) df = B.shape[1] # K + 4 for cubic B-splines # Fit and compute criteria beta = np.linalg.lstsq(B, y, rcond=None)[0] y_hat = B @ beta criteria = compute_criteria(y, y_hat, df, len(y)) results[K] = criteria # Select best K by specified criterion best_K = min(K_candidates, key=lambda k: results[k][criterion]) return best_K, results # Example: Compare criteria for knot selectionnp.random.seed(42)n = 100x = np.sort(np.random.uniform(0, 2 * np.pi, n))y_true = np.sin(x) + 0.5 * np.cos(2 * x)y = y_true + np.random.normal(0, 0.5, n) K_candidates = list(range(2, 20)) print("Optimal K by different criteria:")for criterion in ['AIC', 'AICc', 'BIC', 'GCV']: best_K, results = select_knots_by_criteria(x, y, K_candidates, criterion) print(f" {criterion}: K = {best_K}")AIC and GCV often select similar (more complex) models. BIC is more conservative, especially for large n. For prediction, AIC/GCV are preferred. For causal inference or interpretation (where avoiding false positives matters), BIC is safer. When in doubt, use multiple criteria and check if they agree.
Fixed placement strategies (uniform or quantile) don't adapt to the local complexity of the function. Adaptive methods place more knots where the function changes rapidly.
Idea: The true function may be nearly linear in some regions and highly curved in others. Placing knots uniformly 'wastes' flexibility in the linear regions while providing insufficient flexibility in the curved regions.
Residual-Based Adaptive Placement:
This greedy approach is computationally simple but may get stuck in local optima.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112
import numpy as npfrom scipy.interpolate import BSpline def bspline_fit(x, y, interior_knots, degree=3): """Fit B-spline and return fitted values.""" t = np.concatenate([ np.repeat(x.min(), degree + 1), interior_knots, np.repeat(x.max(), degree + 1) ]) n_basis = len(t) - degree - 1 B = np.zeros((len(x), n_basis)) for j in range(n_basis): c = np.zeros(n_basis) c[j] = 1.0 spl = BSpline(t, c, degree, extrapolate=False) B[:, j] = np.nan_to_num(spl(x), nan=0.0) beta = np.linalg.lstsq(B, y, rcond=None)[0] return B @ beta, B.shape[1] def adaptive_knot_placement(x, y, K_initial=3, K_max=20, threshold=0.8, min_spacing_frac=0.05): """ Adaptively place knots based on residual magnitude. Parameters: ----------- x, y : arrays Data (x should be sorted) K_initial : int Initial number of knots K_max : int Maximum number of knots threshold : float Quantile of residuals for identifying problem regions min_spacing_frac : float Minimum spacing between knots as fraction of data range Returns: -------- knots : array Final adaptive knot locations history : list Knots at each iteration """ # Start with uniform knots knots = list(np.linspace(x.min(), x.max(), K_initial + 2)[1:-1]) history = [knots.copy()] x_range = x.max() - x.min() min_spacing = min_spacing_frac * x_range for iteration in range(K_max - K_initial): # Fit current model y_hat, df = bspline_fit(x, y, np.array(knots)) residuals = np.abs(y - y_hat) # Find region with largest residuals # Use running mean of residuals to smooth window = max(3, len(x) // 20) residual_smooth = np.convolve(residuals, np.ones(window)/window, mode='same') # Find candidate location for new knot candidate_idx = np.argmax(residual_smooth) candidate_x = x[candidate_idx] # Check if too close to existing knot min_dist = min(abs(candidate_x - k) for k in knots) if knots else np.inf if min_dist < min_spacing: # Can't add knot here, try to find next best location mask = residual_smooth.copy() for k in knots: mask[np.abs(x - k) < min_spacing] = 0 if np.max(mask) > 0: candidate_idx = np.argmax(mask) candidate_x = x[candidate_idx] else: break # No valid location # Check if adding knot improves significantly # (Simple check: residual at candidate location is above threshold) if residuals[candidate_idx] > np.percentile(residuals, threshold * 100): knots.append(candidate_x) knots.sort() history.append(knots.copy()) else: break # Diminishing returns return np.array(knots), history # Example: Adaptive knot placementnp.random.seed(42)n = 200x = np.linspace(0, 10, n) # True function with varying complexityy_true = np.where(x < 5, np.sin(x), # Smooth in [0, 5] np.sin(3 * x) + 0.5) # More complex in [5, 10]y = y_true + np.random.normal(0, 0.3, n) knots, history = adaptive_knot_placement(x, y, K_initial=3, K_max=15) print(f"Initial knots: 3, Final knots: {len(knots)}")print(f"Knot locations: {np.round(knots, 2)}") # Note: More knots should appear in the [5, 10] regionAdaptive knot placement is powerful but requires care: (1) The algorithm can be greedy and miss global optima; (2) Knots placed based on noise may cause overfitting; (3) The stopping criterion matters—too lenient leads to too many knots. Combine with CV to validate the final knot configuration.
The most flexible approach treats knot locations as parameters to optimize alongside the spline coefficients. This is called free-knot spline regression.
The Optimization Problem:
Given $K$ knots and spline coefficients $\boldsymbol{\beta}$, minimize: $$\mathcal{L}(\xi_1, \ldots, \xi_K, \boldsymbol{\beta}) = \sum_{i=1}^n (y_i - s(x_i; \xi_1, \ldots, \xi_K, \boldsymbol{\beta}))^2$$
This is a nonlinear, non-convex optimization problem—much harder than the fixed-knot case.
Free-knot optimization is difficult: (1) The objective is non-convex with many local minima; (2) The number of local minima grows exponentially with K; (3) Gradient-based methods require careful handling of the discontinuous coefficient structure; (4) Good initialization is crucial. For these reasons, free-knot splines are less commonly used than fixed-knot or smoothing splines.
Optimization Approaches:
Coordinate Descent: Alternate between optimizing one knot (with others fixed) and optimizing coefficients. Simple but slow.
Gradient-Based Methods: Compute gradients w.r.t. knot locations using the chain rule. Requires differentiability (not straightforward for B-splines).
Genetic Algorithms / Simulated Annealing: Global optimization methods that don't get stuck in local minima but are computationally expensive.
MARS (Multivariate Adaptive Regression Splines): Not exactly free-knot splines but a related approach that greedily selects knot locations from a candidate set.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133
import numpy as npfrom scipy.optimize import minimizefrom scipy.interpolate import BSpline def free_knot_objective(knot_params, x, y, degree=3): """ Objective function for free-knot spline optimization. Parameters: ----------- knot_params : array of shape (K,) Interior knot locations (will be sorted) x, y : arrays Data degree : int Spline degree Returns: -------- rss : float Residual sum of squares (to minimize) """ # Sort knots to ensure valid order knots = np.sort(knot_params) # Check for valid knot configuration if np.any(np.diff(knots) < 1e-6): # Knots too close return 1e10 if knots[0] <= x.min() or knots[-1] >= x.max(): # Out of range return 1e10 # Build B-spline design matrix t = np.concatenate([ np.repeat(x.min(), degree + 1), knots, np.repeat(x.max(), degree + 1) ]) n_basis = len(t) - degree - 1 B = np.zeros((len(x), n_basis)) for j in range(n_basis): c = np.zeros(n_basis) c[j] = 1.0 spl = BSpline(t, c, degree, extrapolate=False) B[:, j] = np.nan_to_num(spl(x), nan=0.0) # Fit coefficients via least squares try: beta = np.linalg.lstsq(B, y, rcond=None)[0] y_hat = B @ beta rss = np.sum((y - y_hat) ** 2) except np.linalg.LinAlgError: rss = 1e10 return rss def fit_free_knot_spline(x, y, K, n_restarts=10, degree=3): """ Fit free-knot spline using multi-start optimization. Parameters: ----------- x, y : arrays Data K : int Number of interior knots n_restarts : int Number of random restarts degree : int Spline degree Returns: -------- best_knots : array Optimal knot locations best_rss : float Best RSS achieved """ x_sorted = np.sort(x) x_min, x_max = x_sorted[0], x_sorted[-1] margin = 0.05 * (x_max - x_min) best_knots = None best_rss = np.inf for restart in range(n_restarts): # Random initialization if restart == 0: # Start with quantile placement init_knots = np.percentile(x, np.linspace(0, 100, K + 2)[1:-1]) else: # Random uniform initialization init_knots = np.sort(np.random.uniform(x_min + margin, x_max - margin, K)) # Optimize result = minimize( free_knot_objective, init_knots, args=(x, y, degree), method='L-BFGS-B', bounds=[(x_min + margin, x_max - margin)] * K, options={'maxiter': 100, 'disp': False} ) if result.fun < best_rss: best_rss = result.fun best_knots = np.sort(result.x) return best_knots, best_rss # Example: Free-knot spline optimizationnp.random.seed(42)n = 100x = np.sort(np.random.uniform(0, 10, n))y_true = np.sin(x) + 0.5 * np.sign(x - 5) * (x - 5) ** 2 / 25y = y_true + np.random.normal(0, 0.4, n) K = 4optimal_knots, rss = fit_free_knot_spline(x, y, K, n_restarts=20) print(f"Optimal {K} knots: {np.round(optimal_knots, 3)}")print(f"RSS: {rss:.4f}") # Compare with quantile knotsfixed_knots = np.percentile(x, np.linspace(0, 100, K + 2)[1:-1])fixed_rss = free_knot_objective(fixed_knots, x, y)print(f"Quantile knots: {np.round(fixed_knots, 3)}")print(f"Fixed RSS: {fixed_rss:.4f}")print(f"Improvement: {100 * (fixed_rss - rss) / fixed_rss:.1f}%")MARS (Multivariate Adaptive Regression Splines), introduced by Friedman (1991), is a practical algorithm for automatic knot selection that extends to multiple predictors.
Key Ideas:
Piecewise Linear Basis: Instead of cubic splines, MARS uses 'hinge' functions: $(x - t)+$ and $(t - x)+$ where $t$ is a knot.
Forward Stepwise: Greedily add basis functions that most reduce RSS, from a candidate set at all unique data values.
Backward Pruning: After forward pass, prune basis functions to minimize GCV.
Interactions: Can include products of hinge functions for multivariate interactions.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990
# MARS is implemented in the 'py-earth' package# pip install sklearn-contrib-py-earth # Note: py-earth may need to be installed separately# Here we demonstrate the concept with a simplified implementation import numpy as np def hinge_positive(x, t): """(x - t)_+ = max(0, x - t)""" return np.maximum(0, x - t) def hinge_negative(x, t): """(t - x)_+ = max(0, t - x)""" return np.maximum(0, t - x) def simple_mars_forward(x, y, max_terms=10): """ Simplified MARS forward pass for 1D data. Iteratively adds hinge functions to reduce RSS. """ n = len(x) # Candidate knots: all unique x values (excluding boundaries) candidates = np.unique(x)[1:-1] # Start with intercept only basis = [np.ones(n)] basis_info = [{'type': 'intercept'}] for term in range(max_terms): best_rss = np.inf best_basis = None best_info = None current_design = np.column_stack(basis) beta = np.linalg.lstsq(current_design, y, rcond=None)[0] current_rss = np.sum((y - current_design @ beta) ** 2) # Try each candidate knot and direction for t in candidates: for direction in ['positive', 'negative']: if direction == 'positive': new_col = hinge_positive(x, t) else: new_col = hinge_negative(x, t) # Skip if nearly constant if np.std(new_col) < 1e-6: continue # Fit with new basis trial_design = np.column_stack([current_design, new_col]) trial_beta = np.linalg.lstsq(trial_design, y, rcond=None)[0] trial_rss = np.sum((y - trial_design @ trial_beta) ** 2) if trial_rss < best_rss: best_rss = trial_rss best_basis = new_col best_info = {'type': direction, 'knot': t} # Add best basis if it improves RSS significantly if best_rss < current_rss * 0.99: # 1% improvement threshold basis.append(best_basis) basis_info.append(best_info) else: break return np.column_stack(basis), basis_info # Examplenp.random.seed(42)n = 150x = np.sort(np.random.uniform(0, 10, n))y_true = 2 + 0.5 * x - 0.1 * (x - 3)**2 * (x > 3) + 0.2 * (x - 7) * (x > 7)y = y_true + np.random.normal(0, 0.5, n) design, info = simple_mars_forward(x, y, max_terms=6) print(f"MARS selected {len(info)} basis functions:")for i, item in enumerate(info): if item['type'] == 'intercept': print(f" {i+1}. Intercept") else: print(f" {i+1}. ({item['type']} hinge at {item['knot']:.2f})") # Note: Real MARS includes backward pruning via GCV# and handles multiple dimensions with interactionsFor real MARS models, use the 'py-earth' sklearn-contrib package in Python or the 'earth' package in R. These implement the full algorithm including backward pruning and interaction terms. MARS is particularly useful for interpretable models with automatic feature selection in moderate dimensions.
Asymptotic Theory:
For regression splines, theoretical results govern optimal knot counts as sample size $n \to \infty$.
Theorem (Stone, 1982): If the true function $f$ has $r$ continuous derivatives, then regression splines with $K \sim n^{1/(2r+1)}$ knots achieve the optimal convergence rate:
$$\mathbb{E}[|\hat{f} - f|_2^2] = O(n^{-2r/(2r+1)})$$
For cubic splines ($r = 4$ effective smoothness), this suggests $K \sim n^{1/9}$ knots.
| Sample Size $n$ | Optimal $K$ (theory) | Practical Range |
|---|---|---|
| 50 | $50^{1/9} \approx 1.5$ | 3–8 |
| 100 | $100^{1/9} \approx 1.7$ | 4–10 |
| 500 | $500^{1/9} \approx 2.0$ | 5–15 |
| 1000 | $1000^{1/9} \approx 2.2$ | 6–18 |
| 10000 | $10000^{1/9} \approx 2.8$ | 8–25 |
Practical Implications:
Optimal $K$ grows slowly with $n$: Even with 10,000 observations, theory suggests only ~3 knots are 'optimal.' In practice, we use more because:
Smoothness matters: If the function is rougher ($r$ smaller), more knots are needed. If smoother, fewer suffice.
Constants hidden: The $O(\cdot)$ notation hides constants that may favor more or fewer knots in finite samples.
Asymptotic theory gives qualitative guidance (K should grow slowly with n) but specific numbers depend on many unknown factors. In practice, use cross-validation or information criteria, which implicitly account for the true function's complexity. Trust empirical model selection over asymptotic formulas.
What's Next:
We've seen two extremes: regression splines (few fixed knots, no regularization) and smoothing splines (knots at all data points, heavy regularization). Penalized splines (P-splines) combine the best of both: moderately many knots with a smoothness penalty. On the next page, we explore this powerful hybrid approach.
You now understand knot selection for regression splines: fixed placement strategies, cross-validation and information criteria for choosing the number of knots, adaptive and free-knot approaches, and the MARS algorithm. These tools let you build flexible, interpretable spline models tailored to your data.