Machine LearningSpline Regression

Spline Regression Methods

LevelIntermediate

Duration90 mins

TopicSpline Regression

4 / 5

Knot Selection: Strategies for Optimal Spline Flexibility

The Knot Selection Problem

In regression splines (as opposed to smoothing splines), we use a relatively small number of knots $K \ll n$ to define a basis, then fit by ordinary least squares. This approach is computationally attractive but introduces a fundamental question: Where should we place the knots?

Knot selection affects the model in two ways:

Number of knots $K$: Determines model complexity (degrees of freedom = $K + d + 1$ for degree-$d$ B-splines, or $K$ for natural splines)
Knot locations $\xi_1, \ldots, \xi_K$: Determines where the model can change curvature

Poor knot selection leads to either underfitting (too few/poorly placed knots) or overfitting (too many knots). The goal is to find a configuration that captures the true signal without fitting noise.

What You Will Learn

By the end of this page, you will understand various knot placement strategies (uniform, quantile-based, adaptive), how to select the number of knots via cross-validation and information criteria, automatic knot selection algorithms (MARS-style and free-knot splines), and the theoretical considerations underlying these choices.

Knots and the Bias-Variance Tradeoff

Knot selection is fundamentally a model selection problem, governed by the bias-variance tradeoff.

Too Few Knots (High Bias):

Model is too rigid
Cannot capture rapid changes in the true function
Leads to systematic underfitting
Low variance but high bias

Too Many Knots (High Variance):

Model is too flexible
Fits noise as if it were signal
Coefficients become unstable (high condition number)
Low bias but high variance

The Optimal Number:

Balances bias and variance
Captures signal without overfitting
Depends on sample size, noise level, and true function complexity

Effect of Knot Count on Model Behavior
$K$ (knots)	Degrees of Freedom	Bias	Variance	Typical MSE
Very small ($K < 3$)	Low	High	Low	High (underfit)
Small ($K = 3-5$)	Moderate	Moderate-High	Low-Moderate	Depends on function
Medium ($K = 5-15$)	Moderate	Low-Moderate	Moderate	Often optimal
Large ($K = 15-30$)	High	Low	High	Risk of overfit
Very large ($K > 30$)	Very high	Very low	Very high	High (overfit)

Rule of Thumb for Starting Point

A common starting point is $K \approx \min(n/4, 35)$ knots for cubic splines. This is just a heuristic—formal selection via CV or information criteria should follow. The maximum of 35 reflects diminishing returns beyond a certain flexibility.

Fixed Knot Placement Strategies

Once we decide on the number of knots $K$, we must choose their locations. Two standard strategies dominate practice.

Strategy 1: Equally-Spaced Knots

Place knots at uniform intervals across the data range: $$\xi_k = x_{\min} + \frac{k}{K+1}(x_{\max} - x_{\min}), \quad k = 1, \ldots, K$$

Pros:

Simple and reproducible
Uniform flexibility across the domain
Visually interpretable

Cons:

Ignores data density
May waste knots in sparse regions
May lack resolution in dense regions

Strategy 2: Quantile-Based Knots

Place knots at percentiles of the $x$ distribution: $$\xi_k = \text{Percentile}\left(x, \frac{100k}{K+1}\right), \quad k = 1, \ldots, K$$

Pros:

Adapts to data density
More knots where data is concentrated
Ensures roughly equal observations between knots

Cons:

Knot locations depend on sample
May miss important regions with few observations
Not invariant to data transformation

knot_placement.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
import numpy as np
import matplotlib.pyplot as plt
 
def uniform_knots(x, K):
    """
    Place K knots uniformly across the data range.
    
    Parameters:
    -----------
    x : array
        Data points
    K : int
        Number of interior knots
    
    Returns:
    --------
    knots : array of shape (K,)
        Knot locations
    """
    x_min, x_max = x.min(), x.max()
    # K interior knots divide [x_min, x_max] into K+1 intervals
    return np.linspace(x_min, x_max, K + 2)[1:-1]
 
 
def quantile_knots(x, K):
    """
    Place K knots at quantiles of x.
    
    Parameters:
    -----------
    x : array
        Data points
    K : int
        Number of interior knots
    
    Returns:
    --------
    knots : array of shape (K,)
        Knot locations
    """
    # Quantiles from 1/(K+1) to K/(K+1)
    percentiles = np.linspace(0, 100, K + 2)[1:-1]
    return np.percentile(x, percentiles)
 
 
# Example: Compare uniform vs quantile knots on skewed data
np.random.seed(42)
 
# Generate right-skewed x (e.g., log-normal)
n = 200
x = np.exp(np.random.normal(0, 1, n))  # Log-normal distribution
x = np.sort(x)
 
K = 8  # Number of knots
 
knots_uniform = uniform_knots(x, K)
knots_quantile = quantile_knots(x, K)
 
print("Uniform knots:")
print(knots_uniform)
print("
Quantile knots:")
print(knots_quantile)
 
# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 4))
 
# Histogram with uniform knots
axes[0].hist(x, bins=30, density=True, alpha=0.7, color='steelblue')
for xi in knots_uniform:
    axes[0].axvline(xi, color='red', linestyle='--', lw=1.5)
axes[0].set_xlabel('x')
axes[0].set_ylabel('Density')
axes[0].set_title('Uniform Knots (red lines)')
 
# Histogram with quantile knots
axes[1].hist(x, bins=30, density=True, alpha=0.7, color='steelblue')
for xi in knots_quantile:
    axes[1].axvline(xi, color='green', linestyle='--', lw=1.5)
axes[1].set_xlabel('x')
axes[1].set_ylabel('Density')
axes[1].set_title('Quantile Knots (green lines)')
 
plt.tight_layout()
plt.savefig('knot_comparison.png', dpi=150)
plt.show()
 
print(f"
Observations between quantile knots: ~{n // (K+1)} each")
print(f"Uniform knots may have very unequal observation counts!")

Which Strategy to Choose?

Use quantile knots when the function may have similar complexity throughout and data density varies. Use uniform knots when you believe the function complexity is uniform across the domain, regardless of where data happens to fall. In most cases, quantile knots are safer because they ensure adequate data near each knot for stable estimation.

Selecting the Number of Knots via Cross-Validation

Cross-validation provides a principled way to choose the number of knots by estimating out-of-sample prediction error.

K-Fold CV for Knot Selection:

Choose candidate knot counts: $K \in {K_1, K_2, \ldots}$
For each $K$:
- Split data into $k$ folds
- For each fold: fit spline on training data, predict on validation fold
- Compute average validation error
Select $K^*$ with lowest CV error

Leave-One-Out CV:

For regression splines (unpenalized), LOOCV has a shortcut formula similar to smoothing splines: $$\text{CV} = \frac{1}{n} \sum_{i=1}^n \left( \frac{y_i - \hat{y}i}{1 - H{ii}} \right)^2$$ where $H_{ii}$ is the $i$-th diagonal of the hat matrix.

cv_knot_count.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
import numpy as np
from scipy.interpolate import BSpline
from sklearn.model_selection import KFold
 
def bspline_design(x, interior_knots, degree=3):
    """Build B-spline design matrix."""
    t = np.concatenate([
        np.repeat(x.min(), degree + 1),
        interior_knots,
        np.repeat(x.max(), degree + 1)
    ])
    n_basis = len(t) - degree - 1
    B = np.zeros((len(x), n_basis))
    for j in range(n_basis):
        c = np.zeros(n_basis)
        c[j] = 1.0
        spl = BSpline(t, c, degree, extrapolate=False)
        B[:, j] = np.nan_to_num(spl(x), nan=0.0)
    return B
 
 
def cv_select_knots(x, y, K_candidates, n_folds=5):
    """
    Select optimal number of knots via K-fold cross-validation.
    
    Parameters:
    -----------
    x, y : arrays
        Data
    K_candidates : list of int
        Candidate knot counts to try
    n_folds : int
        Number of CV folds
    
    Returns:
    --------
    best_K : int
        Optimal number of knots
    cv_scores : dict
        CV scores for each K
    """
    kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)
    
    cv_scores = {}
    
    for K in K_candidates:
        fold_scores = []
        
        for train_idx, val_idx in kf.split(x):
            x_train, x_val = x[train_idx], x[val_idx]
            y_train, y_val = y[train_idx], y[val_idx]
            
            # Sort training data for spline fitting
            sort_idx = np.argsort(x_train)
            x_train_sorted = x_train[sort_idx]
            y_train_sorted = y_train[sort_idx]
            
            # Place knots at quantiles of training data
            knots = np.percentile(x_train_sorted, 
                                  np.linspace(0, 100, K + 2)[1:-1])
            
            # Build design matrix and fit
            B_train = bspline_design(x_train_sorted, knots)
            
            try:
                beta = np.linalg.lstsq(B_train, y_train_sorted, rcond=None)[0]
                
                # Predict on validation
                # Clamp validation x to training range for extrapolation safety
                x_val_clamped = np.clip(x_val, x_train.min(), x_train.max())
                B_val = bspline_design(x_val_clamped, knots)
                y_pred = B_val @ beta
                
                mse = np.mean((y_val - y_pred) ** 2)
                fold_scores.append(mse)
            except np.linalg.LinAlgError:
                fold_scores.append(np.inf)  # Singular matrix
        
        cv_scores[K] = np.mean(fold_scores)
    
    best_K = min(cv_scores.keys(), key=lambda k: cv_scores[k])
    
    return best_K, cv_scores
 
 
# Example: Select knots via CV
np.random.seed(42)
n = 150
x = np.sort(np.random.uniform(0, 2 * np.pi, n))
y_true = np.sin(x) + 0.3 * np.cos(2 * x) + 0.1 * np.sin(5 * x)
y = y_true + np.random.normal(0, 0.4, n)
 
K_candidates = list(range(3, 25))
 
best_K, cv_scores = cv_select_knots(x, y, K_candidates)
 
print(f"Best number of knots: {best_K}")
print(f"
CV scores by K:")
for K in K_candidates:
    marker = " <-- best" if K == best_K else ""
    print(f"  K={K:2d}: CV MSE = {cv_scores[K]:.4f}{marker}")

One Standard Error Rule

Instead of selecting the K with minimum CV error, consider the 'one standard error rule': choose the simplest model (smallest K) whose CV error is within one standard error of the minimum. This guards against overfitting and produces more stable models.

Information Criteria: AIC, BIC, and GCV

Information criteria provide alternatives to cross-validation for knot selection. They trade off goodness-of-fit against model complexity without explicit data splitting.

Akaike Information Criterion (AIC): $$\text{AIC} = n \log(\text{RSS}/n) + 2 \cdot \text{df}$$

where df = number of spline parameters = $K + d + 1$ for degree-$d$ B-splines with $K$ interior knots.

Bayesian Information Criterion (BIC): $$\text{BIC} = n \log(\text{RSS}/n) + \log(n) \cdot \text{df}$$

BIC penalizes complexity more heavily for large $n$, favoring simpler models.

Generalized Cross-Validation (GCV): $$\text{GCV} = \frac{\text{RSS}/n}{(1 - \text{df}/n)^2}$$

GCV approximates LOOCV and is computationally efficient.

Comparison of Model Selection Criteria
Criterion	Formula	Penalty Weight	Behavior
AIC	$n\log(\text{RSS}/n) + 2 \cdot \text{df}$	Fixed (2)	May overfit slightly
AICc	AIC + $\frac{2\text{df}(\text{df}+1)}{n-\text{df}-1}$	Corrected for small $n$	Better for small samples
BIC	$n\log(\text{RSS}/n) + \log(n) \cdot \text{df}$	Grows with $n$	More parsimonious
GCV	$\text{RSS}/(n(1-\text{df}/n)^2)$	Implicit	Approximates LOOCV
LOOCV	$\frac{1}{n}\sum (e_i/(1-H_{ii}))^2$	Exact leave-one-out	Unbiased, higher variance

information_criteria.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
import numpy as np
 
def compute_criteria(y, y_hat, df, n):
    """
    Compute model selection criteria.
    
    Parameters:
    -----------
    y : array
        Observed values
    y_hat : array
        Fitted values
    df : int
        Degrees of freedom (number of parameters)
    n : int
        Sample size
    
    Returns:
    --------
    dict : Dictionary of criteria values
    """
    rss = np.sum((y - y_hat) ** 2)
    
    # Log-likelihood for Gaussian model
    sigma_hat_sq = rss / n
    log_lik = -n/2 * np.log(2 * np.pi * sigma_hat_sq) - n/2
    
    # AIC
    aic = n * np.log(rss / n) + 2 * df
    
    # AICc (corrected AIC for small samples)
    if n - df - 1 > 0:
        aicc = aic + (2 * df * (df + 1)) / (n - df - 1)
    else:
        aicc = np.inf
    
    # BIC
    bic = n * np.log(rss / n) + np.log(n) * df
    
    # GCV
    if df < n:
        gcv = (rss / n) / ((1 - df / n) ** 2)
    else:
        gcv = np.inf
    
    return {
        'AIC': aic,
        'AICc': aicc,
        'BIC': bic,
        'GCV': gcv,
        'RSS': rss,
        'df': df
    }
 
 
def select_knots_by_criteria(x, y, K_candidates, criterion='BIC'):
    """
    Select number of knots by minimizing information criterion.
    """
    from scipy.interpolate import BSpline
    
    def bspline_design(x, interior_knots, degree=3):
        t = np.concatenate([
            np.repeat(x.min(), degree + 1),
            interior_knots,
            np.repeat(x.max(), degree + 1)
        ])
        n_basis = len(t) - degree - 1
        B = np.zeros((len(x), n_basis))
        for j in range(n_basis):
            c = np.zeros(n_basis)
            c[j] = 1.0
            spl = BSpline(t, c, degree, extrapolate=False)
            B[:, j] = np.nan_to_num(spl(x), nan=0.0)
        return B
    
    results = {}
    
    for K in K_candidates:
        knots = np.percentile(x, np.linspace(0, 100, K + 2)[1:-1])
        B = bspline_design(x, knots)
        df = B.shape[1]  # K + 4 for cubic B-splines
        
        # Fit and compute criteria
        beta = np.linalg.lstsq(B, y, rcond=None)[0]
        y_hat = B @ beta
        
        criteria = compute_criteria(y, y_hat, df, len(y))
        results[K] = criteria
    
    # Select best K by specified criterion
    best_K = min(K_candidates, key=lambda k: results[k][criterion])
    
    return best_K, results
 
 
# Example: Compare criteria for knot selection
np.random.seed(42)
n = 100
x = np.sort(np.random.uniform(0, 2 * np.pi, n))
y_true = np.sin(x) + 0.5 * np.cos(2 * x)
y = y_true + np.random.normal(0, 0.5, n)
 
K_candidates = list(range(2, 20))
 
print("Optimal K by different criteria:")
for criterion in ['AIC', 'AICc', 'BIC', 'GCV']:
    best_K, results = select_knots_by_criteria(x, y, K_candidates, criterion)
    print(f"  {criterion}: K = {best_K}")

Which Criterion to Use?

AIC and GCV often select similar (more complex) models. BIC is more conservative, especially for large n. For prediction, AIC/GCV are preferred. For causal inference or interpretation (where avoiding false positives matters), BIC is safer. When in doubt, use multiple criteria and check if they agree.

Adaptive Knot Placement Strategies

Fixed placement strategies (uniform or quantile) don't adapt to the local complexity of the function. Adaptive methods place more knots where the function changes rapidly.

Idea: The true function may be nearly linear in some regions and highly curved in others. Placing knots uniformly 'wastes' flexibility in the linear regions while providing insufficient flexibility in the curved regions.

Residual-Based Adaptive Placement:

Start with a small number of uniformly placed knots
Fit the spline and compute residuals
Identify regions with large absolute residuals (systematic misfit)
Add knots in those regions
Repeat until some stopping criterion

This greedy approach is computationally simple but may get stuck in local optima.

adaptive_knots.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
import numpy as np
from scipy.interpolate import BSpline
 
def bspline_fit(x, y, interior_knots, degree=3):
    """Fit B-spline and return fitted values."""
    t = np.concatenate([
        np.repeat(x.min(), degree + 1),
        interior_knots,
        np.repeat(x.max(), degree + 1)
    ])
    n_basis = len(t) - degree - 1
    B = np.zeros((len(x), n_basis))
    for j in range(n_basis):
        c = np.zeros(n_basis)
        c[j] = 1.0
        spl = BSpline(t, c, degree, extrapolate=False)
        B[:, j] = np.nan_to_num(spl(x), nan=0.0)
    
    beta = np.linalg.lstsq(B, y, rcond=None)[0]
    return B @ beta, B.shape[1]
 
 
def adaptive_knot_placement(x, y, K_initial=3, K_max=20, 
                             threshold=0.8, min_spacing_frac=0.05):
    """
    Adaptively place knots based on residual magnitude.
    
    Parameters:
    -----------
    x, y : arrays
        Data (x should be sorted)
    K_initial : int
        Initial number of knots
    K_max : int
        Maximum number of knots
    threshold : float
        Quantile of residuals for identifying problem regions
    min_spacing_frac : float
        Minimum spacing between knots as fraction of data range
    
    Returns:
    --------
    knots : array
        Final adaptive knot locations
    history : list
        Knots at each iteration
    """
    # Start with uniform knots
    knots = list(np.linspace(x.min(), x.max(), K_initial + 2)[1:-1])
    history = [knots.copy()]
    
    x_range = x.max() - x.min()
    min_spacing = min_spacing_frac * x_range
    
    for iteration in range(K_max - K_initial):
        # Fit current model
        y_hat, df = bspline_fit(x, y, np.array(knots))
        residuals = np.abs(y - y_hat)
        
        # Find region with largest residuals
        # Use running mean of residuals to smooth
        window = max(3, len(x) // 20)
        residual_smooth = np.convolve(residuals, np.ones(window)/window, mode='same')
        
        # Find candidate location for new knot
        candidate_idx = np.argmax(residual_smooth)
        candidate_x = x[candidate_idx]
        
        # Check if too close to existing knot
        min_dist = min(abs(candidate_x - k) for k in knots) if knots else np.inf
        
        if min_dist < min_spacing:
            # Can't add knot here, try to find next best location
            mask = residual_smooth.copy()
            for k in knots:
                mask[np.abs(x - k) < min_spacing] = 0
            
            if np.max(mask) > 0:
                candidate_idx = np.argmax(mask)
                candidate_x = x[candidate_idx]
            else:
                break  # No valid location
        
        # Check if adding knot improves significantly
        # (Simple check: residual at candidate location is above threshold)
        if residuals[candidate_idx] > np.percentile(residuals, threshold * 100):
            knots.append(candidate_x)
            knots.sort()
            history.append(knots.copy())
        else:
            break  # Diminishing returns
    
    return np.array(knots), history
 
 
# Example: Adaptive knot placement
np.random.seed(42)
n = 200
x = np.linspace(0, 10, n)
 
# True function with varying complexity
y_true = np.where(x < 5, 
                   np.sin(x),               # Smooth in [0, 5]
                   np.sin(3 * x) + 0.5)     # More complex in [5, 10]
y = y_true + np.random.normal(0, 0.3, n)
 
knots, history = adaptive_knot_placement(x, y, K_initial=3, K_max=15)
 
print(f"Initial knots: 3, Final knots: {len(knots)}")
print(f"Knot locations: {np.round(knots, 2)}")
 
# Note: More knots should appear in the [5, 10] region

Practical Considerations

Adaptive knot placement is powerful but requires care: (1) The algorithm can be greedy and miss global optima; (2) Knots placed based on noise may cause overfitting; (3) The stopping criterion matters—too lenient leads to too many knots. Combine with CV to validate the final knot configuration.

Free-Knot Splines: Optimizing Knot Locations

The most flexible approach treats knot locations as parameters to optimize alongside the spline coefficients. This is called free-knot spline regression.

The Optimization Problem:

Given $K$ knots and spline coefficients $\boldsymbol{\beta}$, minimize: $$\mathcal{L}(\xi_1, \ldots, \xi_K, \boldsymbol{\beta}) = \sum_{i=1}^n (y_i - s(x_i; \xi_1, \ldots, \xi_K, \boldsymbol{\beta}))^2$$

This is a nonlinear, non-convex optimization problem—much harder than the fixed-knot case.

Computational Challenges

Free-knot optimization is difficult: (1) The objective is non-convex with many local minima; (2) The number of local minima grows exponentially with K; (3) Gradient-based methods require careful handling of the discontinuous coefficient structure; (4) Good initialization is crucial. For these reasons, free-knot splines are less commonly used than fixed-knot or smoothing splines.

Optimization Approaches:

Coordinate Descent: Alternate between optimizing one knot (with others fixed) and optimizing coefficients. Simple but slow.
Gradient-Based Methods: Compute gradients w.r.t. knot locations using the chain rule. Requires differentiability (not straightforward for B-splines).
Genetic Algorithms / Simulated Annealing: Global optimization methods that don't get stuck in local minima but are computationally expensive.
MARS (Multivariate Adaptive Regression Splines): Not exactly free-knot splines but a related approach that greedily selects knot locations from a candidate set.

free_knot_simple.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
import numpy as np
from scipy.optimize import minimize
from scipy.interpolate import BSpline
 
def free_knot_objective(knot_params, x, y, degree=3):
    """
    Objective function for free-knot spline optimization.
    
    Parameters:
    -----------
    knot_params : array of shape (K,)
        Interior knot locations (will be sorted)
    x, y : arrays
        Data
    degree : int
        Spline degree
    
    Returns:
    --------
    rss : float
        Residual sum of squares (to minimize)
    """
    # Sort knots to ensure valid order
    knots = np.sort(knot_params)
    
    # Check for valid knot configuration
    if np.any(np.diff(knots) < 1e-6):  # Knots too close
        return 1e10
    if knots[0] <= x.min() or knots[-1] >= x.max():  # Out of range
        return 1e10
    
    # Build B-spline design matrix
    t = np.concatenate([
        np.repeat(x.min(), degree + 1),
        knots,
        np.repeat(x.max(), degree + 1)
    ])
    n_basis = len(t) - degree - 1
    
    B = np.zeros((len(x), n_basis))
    for j in range(n_basis):
        c = np.zeros(n_basis)
        c[j] = 1.0
        spl = BSpline(t, c, degree, extrapolate=False)
        B[:, j] = np.nan_to_num(spl(x), nan=0.0)
    
    # Fit coefficients via least squares
    try:
        beta = np.linalg.lstsq(B, y, rcond=None)[0]
        y_hat = B @ beta
        rss = np.sum((y - y_hat) ** 2)
    except np.linalg.LinAlgError:
        rss = 1e10
    
    return rss
 
 
def fit_free_knot_spline(x, y, K, n_restarts=10, degree=3):
    """
    Fit free-knot spline using multi-start optimization.
    
    Parameters:
    -----------
    x, y : arrays
        Data
    K : int
        Number of interior knots
    n_restarts : int
        Number of random restarts
    degree : int
        Spline degree
    
    Returns:
    --------
    best_knots : array
        Optimal knot locations
    best_rss : float
        Best RSS achieved
    """
    x_sorted = np.sort(x)
    x_min, x_max = x_sorted[0], x_sorted[-1]
    margin = 0.05 * (x_max - x_min)
    
    best_knots = None
    best_rss = np.inf
    
    for restart in range(n_restarts):
        # Random initialization
        if restart == 0:
            # Start with quantile placement
            init_knots = np.percentile(x, np.linspace(0, 100, K + 2)[1:-1])
        else:
            # Random uniform initialization
            init_knots = np.sort(np.random.uniform(x_min + margin, 
                                                    x_max - margin, K))
        
        # Optimize
        result = minimize(
            free_knot_objective,
            init_knots,
            args=(x, y, degree),
            method='L-BFGS-B',
            bounds=[(x_min + margin, x_max - margin)] * K,
            options={'maxiter': 100, 'disp': False}
        )
        
        if result.fun < best_rss:
            best_rss = result.fun
            best_knots = np.sort(result.x)
    
    return best_knots, best_rss
 
 
# Example: Free-knot spline optimization
np.random.seed(42)
n = 100
x = np.sort(np.random.uniform(0, 10, n))
y_true = np.sin(x) + 0.5 * np.sign(x - 5) * (x - 5) ** 2 / 25
y = y_true + np.random.normal(0, 0.4, n)
 
K = 4
optimal_knots, rss = fit_free_knot_spline(x, y, K, n_restarts=20)
 
print(f"Optimal {K} knots: {np.round(optimal_knots, 3)}")
print(f"RSS: {rss:.4f}")
 
# Compare with quantile knots
fixed_knots = np.percentile(x, np.linspace(0, 100, K + 2)[1:-1])
fixed_rss = free_knot_objective(fixed_knots, x, y)
print(f"
Quantile knots: {np.round(fixed_knots, 3)}")
print(f"Fixed RSS: {fixed_rss:.4f}")
print(f"Improvement: {100 * (fixed_rss - rss) / fixed_rss:.1f}%")

MARS: Multivariate Adaptive Regression Splines

MARS (Multivariate Adaptive Regression Splines), introduced by Friedman (1991), is a practical algorithm for automatic knot selection that extends to multiple predictors.

Key Ideas:

Piecewise Linear Basis: Instead of cubic splines, MARS uses 'hinge' functions: $(x - t)+$ and $(t - x)+$ where $t$ is a knot.
Forward Stepwise: Greedily add basis functions that most reduce RSS, from a candidate set at all unique data values.
Backward Pruning: After forward pass, prune basis functions to minimize GCV.
Interactions: Can include products of hinge functions for multivariate interactions.

mars_example.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
# MARS is implemented in the 'py-earth' package
# pip install sklearn-contrib-py-earth
 
# Note: py-earth may need to be installed separately
# Here we demonstrate the concept with a simplified implementation
 
import numpy as np
 
def hinge_positive(x, t):
    """(x - t)_+ = max(0, x - t)"""
    return np.maximum(0, x - t)
 
def hinge_negative(x, t):
    """(t - x)_+ = max(0, t - x)"""
    return np.maximum(0, t - x)
 
def simple_mars_forward(x, y, max_terms=10):
    """
    Simplified MARS forward pass for 1D data.
    
    Iteratively adds hinge functions to reduce RSS.
    """
    n = len(x)
    
    # Candidate knots: all unique x values (excluding boundaries)
    candidates = np.unique(x)[1:-1]
    
    # Start with intercept only
    basis = [np.ones(n)]
    basis_info = [{'type': 'intercept'}]
    
    for term in range(max_terms):
        best_rss = np.inf
        best_basis = None
        best_info = None
        
        current_design = np.column_stack(basis)
        beta = np.linalg.lstsq(current_design, y, rcond=None)[0]
        current_rss = np.sum((y - current_design @ beta) ** 2)
        
        # Try each candidate knot and direction
        for t in candidates:
            for direction in ['positive', 'negative']:
                if direction == 'positive':
                    new_col = hinge_positive(x, t)
                else:
                    new_col = hinge_negative(x, t)
                
                # Skip if nearly constant
                if np.std(new_col) < 1e-6:
                    continue
                
                # Fit with new basis
                trial_design = np.column_stack([current_design, new_col])
                trial_beta = np.linalg.lstsq(trial_design, y, rcond=None)[0]
                trial_rss = np.sum((y - trial_design @ trial_beta) ** 2)
                
                if trial_rss < best_rss:
                    best_rss = trial_rss
                    best_basis = new_col
                    best_info = {'type': direction, 'knot': t}
        
        # Add best basis if it improves RSS significantly
        if best_rss < current_rss * 0.99:  # 1% improvement threshold
            basis.append(best_basis)
            basis_info.append(best_info)
        else:
            break
    
    return np.column_stack(basis), basis_info
 
 
# Example
np.random.seed(42)
n = 150
x = np.sort(np.random.uniform(0, 10, n))
y_true = 2 + 0.5 * x - 0.1 * (x - 3)**2 * (x > 3) + 0.2 * (x - 7) * (x > 7)
y = y_true + np.random.normal(0, 0.5, n)
 
design, info = simple_mars_forward(x, y, max_terms=6)
 
print(f"MARS selected {len(info)} basis functions:")
for i, item in enumerate(info):
    if item['type'] == 'intercept':
        print(f"  {i+1}. Intercept")
    else:
        print(f"  {i+1}. ({item['type']} hinge at {item['knot']:.2f})")
 
# Note: Real MARS includes backward pruning via GCV
# and handles multiple dimensions with interactions

MARS in Practice

For real MARS models, use the 'py-earth' sklearn-contrib package in Python or the 'earth' package in R. These implement the full algorithm including backward pruning and interaction terms. MARS is particularly useful for interpretable models with automatic feature selection in moderate dimensions.

Theoretical Considerations in Knot Selection

Asymptotic Theory:

For regression splines, theoretical results govern optimal knot counts as sample size $n \to \infty$.

Theorem (Stone, 1982): If the true function $f$ has $r$ continuous derivatives, then regression splines with $K \sim n^{1/(2r+1)}$ knots achieve the optimal convergence rate:

$$\mathbb{E}[|\hat{f} - f|_2^2] = O(n^{-2r/(2r+1)})$$

For cubic splines ($r = 4$ effective smoothness), this suggests $K \sim n^{1/9}$ knots.

Asymptotically Optimal Knot Counts (for cubic splines)
Sample Size $n$	Optimal $K$ (theory)	Practical Range
50	$50^{1/9} \approx 1.5$	3–8
100	$100^{1/9} \approx 1.7$	4–10
500	$500^{1/9} \approx 2.0$	5–15
1000	$1000^{1/9} \approx 2.2$	6–18
10000	$10000^{1/9} \approx 2.8$	8–25

Practical Implications:

Optimal $K$ grows slowly with $n$: Even with 10,000 observations, theory suggests only ~3 knots are 'optimal.' In practice, we use more because:
- We don't know $r$ (true smoothness)
- We want to capture features, not just minimize MSE
- Cross-validation naturally selects more when data supports it
Smoothness matters: If the function is rougher ($r$ smaller), more knots are needed. If smoother, fewer suffice.
Constants hidden: The $O(\cdot)$ notation hides constants that may favor more or fewer knots in finite samples.

Theory vs Practice

Asymptotic theory gives qualitative guidance (K should grow slowly with n) but specific numbers depend on many unknown factors. In practice, use cross-validation or information criteria, which implicitly account for the true function's complexity. Trust empirical model selection over asymptotic formulas.

Summary: Knot Selection Strategies

Key Takeaways

•Knot selection controls model complexity through both the number of knots (degrees of freedom) and their placement (where flexibility is available).
•Quantile knots adapt to data density, placing more knots where data is concentrated. Uniform knots provide even flexibility across the domain.
•Cross-validation is the gold standard for selecting the number of knots, directly estimating out-of-sample prediction error.
•Information criteria (AIC, BIC, GCV) provide efficient alternatives to CV. BIC is more conservative; AIC/GCV allow slightly more complexity.
•Adaptive placement puts knots where the function changes rapidly, but requires care to avoid overfitting to noise.
•Free-knot splines optimize knot locations directly but face non-convex optimization challenges requiring multi-start or global optimization.
•MARS provides a practical algorithm for automatic knot selection that extends to multiple predictors with interactions.
•Asymptotic theory suggests $K \sim n^{1/(2r+1)}$ for smooth functions, but practical selection should rely on data-driven methods.

What's Next:

We've seen two extremes: regression splines (few fixed knots, no regularization) and smoothing splines (knots at all data points, heavy regularization). Penalized splines (P-splines) combine the best of both: moderately many knots with a smoothness penalty. On the next page, we explore this powerful hybrid approach.

Page Complete

You now understand knot selection for regression splines: fixed placement strategies, cross-validation and information criteria for choosing the number of knots, adaptive and free-knot approaches, and the MARS algorithm. These tools let you build flexible, interpretable spline models tailored to your data.

4 / 5

Loading learning content...

Machine LearningSpline Regression

Spline Regression Methods

LevelIntermediate

Duration90 mins

TopicSpline Regression

4 / 5

Knot Selection: Strategies for Optimal Spline Flexibility

The Knot Selection Problem

Knot selection affects the model in two ways:

Number of knots $K$: Determines model complexity (degrees of freedom = $K + d + 1$ for degree-$d$ B-splines, or $K$ for natural splines)
Knot locations $\xi_1, \ldots, \xi_K$: Determines where the model can change curvature

What You Will Learn

Knots and the Bias-Variance Tradeoff

Knot selection is fundamentally a model selection problem, governed by the bias-variance tradeoff.

Too Few Knots (High Bias):

Model is too rigid
Cannot capture rapid changes in the true function
Leads to systematic underfitting
Low variance but high bias

Too Many Knots (High Variance):

Model is too flexible
Fits noise as if it were signal
Coefficients become unstable (high condition number)
Low bias but high variance

The Optimal Number:

Balances bias and variance
Captures signal without overfitting
Depends on sample size, noise level, and true function complexity

Effect of Knot Count on Model Behavior
$K$ (knots)	Degrees of Freedom	Bias	Variance	Typical MSE
Very small ($K < 3$)	Low	High	Low	High (underfit)
Small ($K = 3-5$)	Moderate	Moderate-High	Low-Moderate	Depends on function
Medium ($K = 5-15$)	Moderate	Low-Moderate	Moderate	Often optimal
Large ($K = 15-30$)	High	Low	High	Risk of overfit
Very large ($K > 30$)	Very high	Very low	Very high	High (overfit)

Rule of Thumb for Starting Point

Fixed Knot Placement Strategies

Once we decide on the number of knots $K$, we must choose their locations. Two standard strategies dominate practice.

Strategy 1: Equally-Spaced Knots

Place knots at uniform intervals across the data range: $$\xi_k = x_{\min} + \frac{k}{K+1}(x_{\max} - x_{\min}), \quad k = 1, \ldots, K$$

Pros:

Simple and reproducible
Uniform flexibility across the domain
Visually interpretable

Cons:

Ignores data density
May waste knots in sparse regions
May lack resolution in dense regions

Strategy 2: Quantile-Based Knots

Place knots at percentiles of the $x$ distribution: $$\xi_k = \text{Percentile}\left(x, \frac{100k}{K+1}\right), \quad k = 1, \ldots, K$$

Pros:

Adapts to data density
More knots where data is concentrated
Ensures roughly equal observations between knots

Cons:

Knot locations depend on sample
May miss important regions with few observations
Not invariant to data transformation

knot_placement.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
import numpy as np
import matplotlib.pyplot as plt
 
def uniform_knots(x, K):
    """
    Place K knots uniformly across the data range.
    
    Parameters:
    -----------
    x : array
        Data points
    K : int
        Number of interior knots
    
    Returns:
    --------
    knots : array of shape (K,)
        Knot locations
    """
    x_min, x_max = x.min(), x.max()
    # K interior knots divide [x_min, x_max] into K+1 intervals
    return np.linspace(x_min, x_max, K + 2)[1:-1]
 
 
def quantile_knots(x, K):
    """
    Place K knots at quantiles of x.
    
    Parameters:
    -----------
    x : array
        Data points
    K : int
        Number of interior knots
    
    Returns:
    --------
    knots : array of shape (K,)
        Knot locations
    """
    # Quantiles from 1/(K+1) to K/(K+1)
    percentiles = np.linspace(0, 100, K + 2)[1:-1]
    return np.percentile(x, percentiles)
 
 
# Example: Compare uniform vs quantile knots on skewed data
np.random.seed(42)
 
# Generate right-skewed x (e.g., log-normal)
n = 200
x = np.exp(np.random.normal(0, 1, n))  # Log-normal distribution
x = np.sort(x)
 
K = 8  # Number of knots
 
knots_uniform = uniform_knots(x, K)
knots_quantile = quantile_knots(x, K)
 
print("Uniform knots:")
print(knots_uniform)
print("
Quantile knots:")
print(knots_quantile)
 
# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 4))
 
# Histogram with uniform knots
axes[0].hist(x, bins=30, density=True, alpha=0.7, color='steelblue')
for xi in knots_uniform:
    axes[0].axvline(xi, color='red', linestyle='--', lw=1.5)
axes[0].set_xlabel('x')
axes[0].set_ylabel('Density')
axes[0].set_title('Uniform Knots (red lines)')
 
# Histogram with quantile knots
axes[1].hist(x, bins=30, density=True, alpha=0.7, color='steelblue')
for xi in knots_quantile:
    axes[1].axvline(xi, color='green', linestyle='--', lw=1.5)
axes[1].set_xlabel('x')
axes[1].set_ylabel('Density')
axes[1].set_title('Quantile Knots (green lines)')
 
plt.tight_layout()
plt.savefig('knot_comparison.png', dpi=150)
plt.show()
 
print(f"
Observations between quantile knots: ~{n // (K+1)} each")
print(f"Uniform knots may have very unequal observation counts!")

Which Strategy to Choose?

Selecting the Number of Knots via Cross-Validation

Cross-validation provides a principled way to choose the number of knots by estimating out-of-sample prediction error.

K-Fold CV for Knot Selection:

Choose candidate knot counts: $K \in {K_1, K_2, \ldots}$
For each $K$:
- Split data into $k$ folds
- For each fold: fit spline on training data, predict on validation fold
- Compute average validation error
Select $K^*$ with lowest CV error

Leave-One-Out CV:

cv_knot_count.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
import numpy as np
from scipy.interpolate import BSpline
from sklearn.model_selection import KFold
 
def bspline_design(x, interior_knots, degree=3):
    """Build B-spline design matrix."""
    t = np.concatenate([
        np.repeat(x.min(), degree + 1),
        interior_knots,
        np.repeat(x.max(), degree + 1)
    ])
    n_basis = len(t) - degree - 1
    B = np.zeros((len(x), n_basis))
    for j in range(n_basis):
        c = np.zeros(n_basis)
        c[j] = 1.0
        spl = BSpline(t, c, degree, extrapolate=False)
        B[:, j] = np.nan_to_num(spl(x), nan=0.0)
    return B
 
 
def cv_select_knots(x, y, K_candidates, n_folds=5):
    """
    Select optimal number of knots via K-fold cross-validation.
    
    Parameters:
    -----------
    x, y : arrays
        Data
    K_candidates : list of int
        Candidate knot counts to try
    n_folds : int
        Number of CV folds
    
    Returns:
    --------
    best_K : int
        Optimal number of knots
    cv_scores : dict
        CV scores for each K
    """
    kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)
    
    cv_scores = {}
    
    for K in K_candidates:
        fold_scores = []
        
        for train_idx, val_idx in kf.split(x):
            x_train, x_val = x[train_idx], x[val_idx]
            y_train, y_val = y[train_idx], y[val_idx]
            
            # Sort training data for spline fitting
            sort_idx = np.argsort(x_train)
            x_train_sorted = x_train[sort_idx]
            y_train_sorted = y_train[sort_idx]
            
            # Place knots at quantiles of training data
            knots = np.percentile(x_train_sorted, 
                                  np.linspace(0, 100, K + 2)[1:-1])
            
            # Build design matrix and fit
            B_train = bspline_design(x_train_sorted, knots)
            
            try:
                beta = np.linalg.lstsq(B_train, y_train_sorted, rcond=None)[0]
                
                # Predict on validation
                # Clamp validation x to training range for extrapolation safety
                x_val_clamped = np.clip(x_val, x_train.min(), x_train.max())
                B_val = bspline_design(x_val_clamped, knots)
                y_pred = B_val @ beta
                
                mse = np.mean((y_val - y_pred) ** 2)
                fold_scores.append(mse)
            except np.linalg.LinAlgError:
                fold_scores.append(np.inf)  # Singular matrix
        
        cv_scores[K] = np.mean(fold_scores)
    
    best_K = min(cv_scores.keys(), key=lambda k: cv_scores[k])
    
    return best_K, cv_scores
 
 
# Example: Select knots via CV
np.random.seed(42)
n = 150
x = np.sort(np.random.uniform(0, 2 * np.pi, n))
y_true = np.sin(x) + 0.3 * np.cos(2 * x) + 0.1 * np.sin(5 * x)
y = y_true + np.random.normal(0, 0.4, n)
 
K_candidates = list(range(3, 25))
 
best_K, cv_scores = cv_select_knots(x, y, K_candidates)
 
print(f"Best number of knots: {best_K}")
print(f"
CV scores by K:")
for K in K_candidates:
    marker = " <-- best" if K == best_K else ""
    print(f"  K={K:2d}: CV MSE = {cv_scores[K]:.4f}{marker}")

One Standard Error Rule

Information Criteria: AIC, BIC, and GCV

Information criteria provide alternatives to cross-validation for knot selection. They trade off goodness-of-fit against model complexity without explicit data splitting.

Akaike Information Criterion (AIC): $$\text{AIC} = n \log(\text{RSS}/n) + 2 \cdot \text{df}$$

where df = number of spline parameters = $K + d + 1$ for degree-$d$ B-splines with $K$ interior knots.

Bayesian Information Criterion (BIC): $$\text{BIC} = n \log(\text{RSS}/n) + \log(n) \cdot \text{df}$$

BIC penalizes complexity more heavily for large $n$, favoring simpler models.

Generalized Cross-Validation (GCV): $$\text{GCV} = \frac{\text{RSS}/n}{(1 - \text{df}/n)^2}$$

GCV approximates LOOCV and is computationally efficient.

Comparison of Model Selection Criteria
Criterion	Formula	Penalty Weight	Behavior
AIC	$n\log(\text{RSS}/n) + 2 \cdot \text{df}$	Fixed (2)	May overfit slightly
AICc	AIC + $\frac{2\text{df}(\text{df}+1)}{n-\text{df}-1}$	Corrected for small $n$	Better for small samples
BIC	$n\log(\text{RSS}/n) + \log(n) \cdot \text{df}$	Grows with $n$	More parsimonious
GCV	$\text{RSS}/(n(1-\text{df}/n)^2)$	Implicit	Approximates LOOCV
LOOCV	$\frac{1}{n}\sum (e_i/(1-H_{ii}))^2$	Exact leave-one-out	Unbiased, higher variance

information_criteria.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
import numpy as np
 
def compute_criteria(y, y_hat, df, n):
    """
    Compute model selection criteria.
    
    Parameters:
    -----------
    y : array
        Observed values
    y_hat : array
        Fitted values
    df : int
        Degrees of freedom (number of parameters)
    n : int
        Sample size
    
    Returns:
    --------
    dict : Dictionary of criteria values
    """
    rss = np.sum((y - y_hat) ** 2)
    
    # Log-likelihood for Gaussian model
    sigma_hat_sq = rss / n
    log_lik = -n/2 * np.log(2 * np.pi * sigma_hat_sq) - n/2
    
    # AIC
    aic = n * np.log(rss / n) + 2 * df
    
    # AICc (corrected AIC for small samples)
    if n - df - 1 > 0:
        aicc = aic + (2 * df * (df + 1)) / (n - df - 1)
    else:
        aicc = np.inf
    
    # BIC
    bic = n * np.log(rss / n) + np.log(n) * df
    
    # GCV
    if df < n:
        gcv = (rss / n) / ((1 - df / n) ** 2)
    else:
        gcv = np.inf
    
    return {
        'AIC': aic,
        'AICc': aicc,
        'BIC': bic,
        'GCV': gcv,
        'RSS': rss,
        'df': df
    }
 
 
def select_knots_by_criteria(x, y, K_candidates, criterion='BIC'):
    """
    Select number of knots by minimizing information criterion.
    """
    from scipy.interpolate import BSpline
    
    def bspline_design(x, interior_knots, degree=3):
        t = np.concatenate([
            np.repeat(x.min(), degree + 1),
            interior_knots,
            np.repeat(x.max(), degree + 1)
        ])
        n_basis = len(t) - degree - 1
        B = np.zeros((len(x), n_basis))
        for j in range(n_basis):
            c = np.zeros(n_basis)
            c[j] = 1.0
            spl = BSpline(t, c, degree, extrapolate=False)
            B[:, j] = np.nan_to_num(spl(x), nan=0.0)
        return B
    
    results = {}
    
    for K in K_candidates:
        knots = np.percentile(x, np.linspace(0, 100, K + 2)[1:-1])
        B = bspline_design(x, knots)
        df = B.shape[1]  # K + 4 for cubic B-splines
        
        # Fit and compute criteria
        beta = np.linalg.lstsq(B, y, rcond=None)[0]
        y_hat = B @ beta
        
        criteria = compute_criteria(y, y_hat, df, len(y))
        results[K] = criteria
    
    # Select best K by specified criterion
    best_K = min(K_candidates, key=lambda k: results[k][criterion])
    
    return best_K, results
 
 
# Example: Compare criteria for knot selection
np.random.seed(42)
n = 100
x = np.sort(np.random.uniform(0, 2 * np.pi, n))
y_true = np.sin(x) + 0.5 * np.cos(2 * x)
y = y_true + np.random.normal(0, 0.5, n)
 
K_candidates = list(range(2, 20))
 
print("Optimal K by different criteria:")
for criterion in ['AIC', 'AICc', 'BIC', 'GCV']:
    best_K, results = select_knots_by_criteria(x, y, K_candidates, criterion)
    print(f"  {criterion}: K = {best_K}")

Which Criterion to Use?

Adaptive Knot Placement Strategies

Fixed placement strategies (uniform or quantile) don't adapt to the local complexity of the function. Adaptive methods place more knots where the function changes rapidly.

Residual-Based Adaptive Placement:

Start with a small number of uniformly placed knots
Fit the spline and compute residuals
Identify regions with large absolute residuals (systematic misfit)
Add knots in those regions
Repeat until some stopping criterion

This greedy approach is computationally simple but may get stuck in local optima.

adaptive_knots.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
import numpy as np
from scipy.interpolate import BSpline
 
def bspline_fit(x, y, interior_knots, degree=3):
    """Fit B-spline and return fitted values."""
    t = np.concatenate([
        np.repeat(x.min(), degree + 1),
        interior_knots,
        np.repeat(x.max(), degree + 1)
    ])
    n_basis = len(t) - degree - 1
    B = np.zeros((len(x), n_basis))
    for j in range(n_basis):
        c = np.zeros(n_basis)
        c[j] = 1.0
        spl = BSpline(t, c, degree, extrapolate=False)
        B[:, j] = np.nan_to_num(spl(x), nan=0.0)
    
    beta = np.linalg.lstsq(B, y, rcond=None)[0]
    return B @ beta, B.shape[1]
 
 
def adaptive_knot_placement(x, y, K_initial=3, K_max=20, 
                             threshold=0.8, min_spacing_frac=0.05):
    """
    Adaptively place knots based on residual magnitude.
    
    Parameters:
    -----------
    x, y : arrays
        Data (x should be sorted)
    K_initial : int
        Initial number of knots
    K_max : int
        Maximum number of knots
    threshold : float
        Quantile of residuals for identifying problem regions
    min_spacing_frac : float
        Minimum spacing between knots as fraction of data range
    
    Returns:
    --------
    knots : array
        Final adaptive knot locations
    history : list
        Knots at each iteration
    """
    # Start with uniform knots
    knots = list(np.linspace(x.min(), x.max(), K_initial + 2)[1:-1])
    history = [knots.copy()]
    
    x_range = x.max() - x.min()
    min_spacing = min_spacing_frac * x_range
    
    for iteration in range(K_max - K_initial):
        # Fit current model
        y_hat, df = bspline_fit(x, y, np.array(knots))
        residuals = np.abs(y - y_hat)
        
        # Find region with largest residuals
        # Use running mean of residuals to smooth
        window = max(3, len(x) // 20)
        residual_smooth = np.convolve(residuals, np.ones(window)/window, mode='same')
        
        # Find candidate location for new knot
        candidate_idx = np.argmax(residual_smooth)
        candidate_x = x[candidate_idx]
        
        # Check if too close to existing knot
        min_dist = min(abs(candidate_x - k) for k in knots) if knots else np.inf
        
        if min_dist < min_spacing:
            # Can't add knot here, try to find next best location
            mask = residual_smooth.copy()
            for k in knots:
                mask[np.abs(x - k) < min_spacing] = 0
            
            if np.max(mask) > 0:
                candidate_idx = np.argmax(mask)
                candidate_x = x[candidate_idx]
            else:
                break  # No valid location
        
        # Check if adding knot improves significantly
        # (Simple check: residual at candidate location is above threshold)
        if residuals[candidate_idx] > np.percentile(residuals, threshold * 100):
            knots.append(candidate_x)
            knots.sort()
            history.append(knots.copy())
        else:
            break  # Diminishing returns
    
    return np.array(knots), history
 
 
# Example: Adaptive knot placement
np.random.seed(42)
n = 200
x = np.linspace(0, 10, n)
 
# True function with varying complexity
y_true = np.where(x < 5, 
                   np.sin(x),               # Smooth in [0, 5]
                   np.sin(3 * x) + 0.5)     # More complex in [5, 10]
y = y_true + np.random.normal(0, 0.3, n)
 
knots, history = adaptive_knot_placement(x, y, K_initial=3, K_max=15)
 
print(f"Initial knots: 3, Final knots: {len(knots)}")
print(f"Knot locations: {np.round(knots, 2)}")
 
# Note: More knots should appear in the [5, 10] region

Practical Considerations

Free-Knot Splines: Optimizing Knot Locations

The most flexible approach treats knot locations as parameters to optimize alongside the spline coefficients. This is called free-knot spline regression.

The Optimization Problem:

This is a nonlinear, non-convex optimization problem—much harder than the fixed-knot case.

Computational Challenges

Optimization Approaches:

Coordinate Descent: Alternate between optimizing one knot (with others fixed) and optimizing coefficients. Simple but slow.
Gradient-Based Methods: Compute gradients w.r.t. knot locations using the chain rule. Requires differentiability (not straightforward for B-splines).
Genetic Algorithms / Simulated Annealing: Global optimization methods that don't get stuck in local minima but are computationally expensive.
MARS (Multivariate Adaptive Regression Splines): Not exactly free-knot splines but a related approach that greedily selects knot locations from a candidate set.

free_knot_simple.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
import numpy as np
from scipy.optimize import minimize
from scipy.interpolate import BSpline
 
def free_knot_objective(knot_params, x, y, degree=3):
    """
    Objective function for free-knot spline optimization.
    
    Parameters:
    -----------
    knot_params : array of shape (K,)
        Interior knot locations (will be sorted)
    x, y : arrays
        Data
    degree : int
        Spline degree
    
    Returns:
    --------
    rss : float
        Residual sum of squares (to minimize)
    """
    # Sort knots to ensure valid order
    knots = np.sort(knot_params)
    
    # Check for valid knot configuration
    if np.any(np.diff(knots) < 1e-6):  # Knots too close
        return 1e10
    if knots[0] <= x.min() or knots[-1] >= x.max():  # Out of range
        return 1e10
    
    # Build B-spline design matrix
    t = np.concatenate([
        np.repeat(x.min(), degree + 1),
        knots,
        np.repeat(x.max(), degree + 1)
    ])
    n_basis = len(t) - degree - 1
    
    B = np.zeros((len(x), n_basis))
    for j in range(n_basis):
        c = np.zeros(n_basis)
        c[j] = 1.0
        spl = BSpline(t, c, degree, extrapolate=False)
        B[:, j] = np.nan_to_num(spl(x), nan=0.0)
    
    # Fit coefficients via least squares
    try:
        beta = np.linalg.lstsq(B, y, rcond=None)[0]
        y_hat = B @ beta
        rss = np.sum((y - y_hat) ** 2)
    except np.linalg.LinAlgError:
        rss = 1e10
    
    return rss
 
 
def fit_free_knot_spline(x, y, K, n_restarts=10, degree=3):
    """
    Fit free-knot spline using multi-start optimization.
    
    Parameters:
    -----------
    x, y : arrays
        Data
    K : int
        Number of interior knots
    n_restarts : int
        Number of random restarts
    degree : int
        Spline degree
    
    Returns:
    --------
    best_knots : array
        Optimal knot locations
    best_rss : float
        Best RSS achieved
    """
    x_sorted = np.sort(x)
    x_min, x_max = x_sorted[0], x_sorted[-1]
    margin = 0.05 * (x_max - x_min)
    
    best_knots = None
    best_rss = np.inf
    
    for restart in range(n_restarts):
        # Random initialization
        if restart == 0:
            # Start with quantile placement
            init_knots = np.percentile(x, np.linspace(0, 100, K + 2)[1:-1])
        else:
            # Random uniform initialization
            init_knots = np.sort(np.random.uniform(x_min + margin, 
                                                    x_max - margin, K))
        
        # Optimize
        result = minimize(
            free_knot_objective,
            init_knots,
            args=(x, y, degree),
            method='L-BFGS-B',
            bounds=[(x_min + margin, x_max - margin)] * K,
            options={'maxiter': 100, 'disp': False}
        )
        
        if result.fun < best_rss:
            best_rss = result.fun
            best_knots = np.sort(result.x)
    
    return best_knots, best_rss
 
 
# Example: Free-knot spline optimization
np.random.seed(42)
n = 100
x = np.sort(np.random.uniform(0, 10, n))
y_true = np.sin(x) + 0.5 * np.sign(x - 5) * (x - 5) ** 2 / 25
y = y_true + np.random.normal(0, 0.4, n)
 
K = 4
optimal_knots, rss = fit_free_knot_spline(x, y, K, n_restarts=20)
 
print(f"Optimal {K} knots: {np.round(optimal_knots, 3)}")
print(f"RSS: {rss:.4f}")
 
# Compare with quantile knots
fixed_knots = np.percentile(x, np.linspace(0, 100, K + 2)[1:-1])
fixed_rss = free_knot_objective(fixed_knots, x, y)
print(f"
Quantile knots: {np.round(fixed_knots, 3)}")
print(f"Fixed RSS: {fixed_rss:.4f}")
print(f"Improvement: {100 * (fixed_rss - rss) / fixed_rss:.1f}%")

MARS: Multivariate Adaptive Regression Splines

MARS (Multivariate Adaptive Regression Splines), introduced by Friedman (1991), is a practical algorithm for automatic knot selection that extends to multiple predictors.

Key Ideas:

Piecewise Linear Basis: Instead of cubic splines, MARS uses 'hinge' functions: $(x - t)+$ and $(t - x)+$ where $t$ is a knot.
Forward Stepwise: Greedily add basis functions that most reduce RSS, from a candidate set at all unique data values.
Backward Pruning: After forward pass, prune basis functions to minimize GCV.
Interactions: Can include products of hinge functions for multivariate interactions.

mars_example.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
# MARS is implemented in the 'py-earth' package
# pip install sklearn-contrib-py-earth
 
# Note: py-earth may need to be installed separately
# Here we demonstrate the concept with a simplified implementation
 
import numpy as np
 
def hinge_positive(x, t):
    """(x - t)_+ = max(0, x - t)"""
    return np.maximum(0, x - t)
 
def hinge_negative(x, t):
    """(t - x)_+ = max(0, t - x)"""
    return np.maximum(0, t - x)
 
def simple_mars_forward(x, y, max_terms=10):
    """
    Simplified MARS forward pass for 1D data.
    
    Iteratively adds hinge functions to reduce RSS.
    """
    n = len(x)
    
    # Candidate knots: all unique x values (excluding boundaries)
    candidates = np.unique(x)[1:-1]
    
    # Start with intercept only
    basis = [np.ones(n)]
    basis_info = [{'type': 'intercept'}]
    
    for term in range(max_terms):
        best_rss = np.inf
        best_basis = None
        best_info = None
        
        current_design = np.column_stack(basis)
        beta = np.linalg.lstsq(current_design, y, rcond=None)[0]
        current_rss = np.sum((y - current_design @ beta) ** 2)
        
        # Try each candidate knot and direction
        for t in candidates:
            for direction in ['positive', 'negative']:
                if direction == 'positive':
                    new_col = hinge_positive(x, t)
                else:
                    new_col = hinge_negative(x, t)
                
                # Skip if nearly constant
                if np.std(new_col) < 1e-6:
                    continue
                
                # Fit with new basis
                trial_design = np.column_stack([current_design, new_col])
                trial_beta = np.linalg.lstsq(trial_design, y, rcond=None)[0]
                trial_rss = np.sum((y - trial_design @ trial_beta) ** 2)
                
                if trial_rss < best_rss:
                    best_rss = trial_rss
                    best_basis = new_col
                    best_info = {'type': direction, 'knot': t}
        
        # Add best basis if it improves RSS significantly
        if best_rss < current_rss * 0.99:  # 1% improvement threshold
            basis.append(best_basis)
            basis_info.append(best_info)
        else:
            break
    
    return np.column_stack(basis), basis_info
 
 
# Example
np.random.seed(42)
n = 150
x = np.sort(np.random.uniform(0, 10, n))
y_true = 2 + 0.5 * x - 0.1 * (x - 3)**2 * (x > 3) + 0.2 * (x - 7) * (x > 7)
y = y_true + np.random.normal(0, 0.5, n)
 
design, info = simple_mars_forward(x, y, max_terms=6)
 
print(f"MARS selected {len(info)} basis functions:")
for i, item in enumerate(info):
    if item['type'] == 'intercept':
        print(f"  {i+1}. Intercept")
    else:
        print(f"  {i+1}. ({item['type']} hinge at {item['knot']:.2f})")
 
# Note: Real MARS includes backward pruning via GCV
# and handles multiple dimensions with interactions

MARS in Practice

Theoretical Considerations in Knot Selection

Asymptotic Theory:

For regression splines, theoretical results govern optimal knot counts as sample size $n \to \infty$.

Theorem (Stone, 1982): If the true function $f$ has $r$ continuous derivatives, then regression splines with $K \sim n^{1/(2r+1)}$ knots achieve the optimal convergence rate:

$$\mathbb{E}[|\hat{f} - f|_2^2] = O(n^{-2r/(2r+1)})$$

For cubic splines ($r = 4$ effective smoothness), this suggests $K \sim n^{1/9}$ knots.

Asymptotically Optimal Knot Counts (for cubic splines)
Sample Size $n$	Optimal $K$ (theory)	Practical Range
50	$50^{1/9} \approx 1.5$	3–8
100	$100^{1/9} \approx 1.7$	4–10
500	$500^{1/9} \approx 2.0$	5–15
1000	$1000^{1/9} \approx 2.2$	6–18
10000	$10000^{1/9} \approx 2.8$	8–25

Practical Implications:

Optimal $K$ grows slowly with $n$: Even with 10,000 observations, theory suggests only ~3 knots are 'optimal.' In practice, we use more because:
- We don't know $r$ (true smoothness)
- We want to capture features, not just minimize MSE
- Cross-validation naturally selects more when data supports it
Smoothness matters: If the function is rougher ($r$ smaller), more knots are needed. If smoother, fewer suffice.
Constants hidden: The $O(\cdot)$ notation hides constants that may favor more or fewer knots in finite samples.

Theory vs Practice

Summary: Knot Selection Strategies

Key Takeaways

•Knot selection controls model complexity through both the number of knots (degrees of freedom) and their placement (where flexibility is available).
•Quantile knots adapt to data density, placing more knots where data is concentrated. Uniform knots provide even flexibility across the domain.
•Cross-validation is the gold standard for selecting the number of knots, directly estimating out-of-sample prediction error.
•Information criteria (AIC, BIC, GCV) provide efficient alternatives to CV. BIC is more conservative; AIC/GCV allow slightly more complexity.
•Adaptive placement puts knots where the function changes rapidly, but requires care to avoid overfitting to noise.
•Free-knot splines optimize knot locations directly but face non-convex optimization challenges requiring multi-start or global optimization.
•MARS provides a practical algorithm for automatic knot selection that extends to multiple predictors with interactions.
•Asymptotic theory suggests $K \sim n^{1/(2r+1)}$ for smooth functions, but practical selection should rely on data-driven methods.

What's Next:

Page Complete

4 / 5