Spline Regression - Learning Module

Loading content...

0/278

Penalized Splines (P-splines): The Best of Both Worlds

Bridging Regression Splines and Smoothing Splines

We've seen two approaches to spline regression:

Regression Splines: Use few knots ($K \ll n$), fit by OLS. Fast but requires careful knot selection.

Smoothing Splines: Use $n$ knots (one per data point), regularize via penalty. Automatic smoothing but involves $n \times n$ matrices.

Penalized Splines (P-splines) combine the advantages of both:

Use a moderate number of knots (typically 20-40), more than regression splines but far fewer than $n$
Apply a discrete difference penalty on the B-spline coefficients instead of the integrated derivative penalty
Achieve results nearly identical to smoothing splines with much lower computational cost

P-splines, introduced by Eilers and Marx (1996), have become the workhorse of flexible regression in practice.

What You Will Learn

By the end of this page, you will understand the P-spline formulation with difference penalties, why difference penalties approximate derivative penalties, how to select the number of knots and penalty strength, computational advantages of P-splines, and their connection to mixed models. You'll see P-splines as the practical method of choice for most spline regression problems.

The P-spline Formulation

The Objective Function:

P-splines minimize a penalized sum of squares:

$$\mathcal{L}(\boldsymbol{\beta}) = |\mathbf{y} - \mathbf{B}\boldsymbol{\beta}|^2 + \lambda |\mathbf{D}_q \boldsymbol{\beta}|^2$$

where:

$\mathbf{B}$ is the $n \times p$ B-spline design matrix
$\boldsymbol{\beta}$ is the $p \times 1$ vector of B-spline coefficients
$\mathbf{D}_q$ is the difference matrix of order $q$
$\lambda \geq 0$ is the smoothing parameter

The key innovation is the difference penalty $|\mathbf{D}_q \boldsymbol{\beta}|^2$ applied to the coefficients rather than an integral of derivatives.

Difference Matrices:

The first-order difference matrix $\mathbf{D}_1$ computes successive differences: $$\mathbf{D}_1 = \begin{pmatrix} -1 & 1 & 0 & \cdots & 0 \ 0 & -1 & 1 & \cdots & 0 \ \vdots & & \ddots & \ddots & \vdots \ 0 & \cdots & 0 & -1 & 1 \end{pmatrix} \in \mathbb{R}^{(p-1) \times p}$$

So $(\mathbf{D}_1 \boldsymbol{\beta})j = \beta{j+1} - \beta_j$.

Higher-order differences are computed by applying $\mathbf{D}_1$ repeatedly: $$\mathbf{D}_2 = \mathbf{D}_1^{(p-1)} \cdot \mathbf{D}_1^{(p)} \in \mathbb{R}^{(p-2) \times p}$$

Second differences are: $(\mathbf{D}2 \boldsymbol{\beta})j = \beta{j+2} - 2\beta{j+1} + \beta_j$

difference_matrix.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import numpy as np
 
def difference_matrix(p, order=2):
    """
    Construct difference matrix of given order.
    
    Parameters:
    -----------
    p : int
        Number of columns (B-spline coefficients)
    order : int
        Order of differencing (1 = first difference, 2 = second, etc.)
    
    Returns:
    --------
    D : array of shape (p - order, p)
        Difference matrix
    """
    D = np.eye(p)
    for _ in range(order):
        # Apply first-order differencing
        n_rows = D.shape[0] - 1
        D_new = np.zeros((n_rows, p))
        for i in range(n_rows):
            D_new[i, :] = D[i + 1, :] - D[i, :]
        D = D_new
    
    return D
 
 
def build_penalty_matrix(p, order=2):
    """
    Construct the penalty matrix P = D^T @ D.
    
    The penalty is beta^T @ P @ beta = ||D @ beta||^2
    """
    D = difference_matrix(p, order)
    return D.T @ D
 
 
# Example: Build difference matrices
p = 8  # Number of B-spline coefficients
 
D1 = difference_matrix(p, order=1)
D2 = difference_matrix(p, order=2)
D3 = difference_matrix(p, order=3)
 
print("First-order difference matrix D1 (shape:", D1.shape, "):")
print(D1)
print()
print("Second-order difference matrix D2 (shape:", D2.shape, "):")
print(D2)
 
# Show what D2 does to a coefficient vector
beta = np.array([1, 2, 4, 7, 11, 16, 22, 29], dtype=float)  # Quadratic-ish
print("
Beta:", beta)
print("Second differences (D2 @ beta):", D2 @ beta)
# For a perfect quadratic, second differences are constant!

Why Second-Order Differences?

Second-order differences penalize changes in slope (discrete curvature). This is analogous to the integral of the squared second derivative in smoothing splines. For B-splines with equally-spaced knots, the second difference of coefficients approximates the second derivative of the spline curve. Using $q=2$ (second-order penalty) is the standard choice for cubic P-splines.

Closed-Form Solution for P-splines

Deriving the Solution:

The P-spline objective is: $$\mathcal{L}(\boldsymbol{\beta}) = (\mathbf{y} - \mathbf{B}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{B}\boldsymbol{\beta}) + \lambda \boldsymbol{\beta}^T \mathbf{P} \boldsymbol{\beta}$$

where $\mathbf{P} = \mathbf{D}_q^T \mathbf{D}_q$ is the penalty matrix.

Differentiating and setting to zero: $$\frac{\partial \mathcal{L}}{\partial \boldsymbol{\beta}} = -2\mathbf{B}^T(\mathbf{y} - \mathbf{B}\boldsymbol{\beta}) + 2\lambda \mathbf{P} \boldsymbol{\beta} = \mathbf{0}$$

Rearranging: $$\mathbf{B}^T \mathbf{B} \boldsymbol{\beta} + \lambda \mathbf{P} \boldsymbol{\beta} = \mathbf{B}^T \mathbf{y}$$ $$(\mathbf{B}^T \mathbf{B} + \lambda \mathbf{P}) \boldsymbol{\beta} = \mathbf{B}^T \mathbf{y}$$

Solution: $$\hat{\boldsymbol{\beta}} = (\mathbf{B}^T \mathbf{B} + \lambda \mathbf{P})^{-1} \mathbf{B}^T \mathbf{y}$$

Fitted Values and Smoother Matrix:

The fitted values are: $$\hat{\mathbf{y}} = \mathbf{B} \hat{\boldsymbol{\beta}} = \mathbf{B}(\mathbf{B}^T \mathbf{B} + \lambda \mathbf{P})^{-1} \mathbf{B}^T \mathbf{y} = \mathbf{H}_\lambda \mathbf{y}$$

The smoother matrix (or hat matrix) is: $$\mathbf{H}_\lambda = \mathbf{B}(\mathbf{B}^T \mathbf{B} + \lambda \mathbf{P})^{-1} \mathbf{B}^T$$

This is an $n \times n$ matrix but involves inverting only a $p \times p$ matrix ($p$ is the number of B-spline basis functions, typically 20-40).

Effective Degrees of Freedom: $$\text{df}(\lambda) = \text{tr}(\mathbf{H}_\lambda)$$

This ranges from $p$ (when $\lambda = 0$) to $q$ (when $\lambda \to \infty$), where $q$ is the order of the difference penalty.

pspline_fit.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
import numpy as np
from scipy.interpolate import BSpline
 
def fit_pspline(x, y, n_basis=25, degree=3, diff_order=2, lambda_val=1.0):
    """
    Fit a P-spline to data.
    
    Parameters:
    -----------
    x, y : arrays
        Data
    n_basis : int
        Number of B-spline basis functions
    degree : int
        B-spline degree
    diff_order : int
        Order of difference penalty (typically 2)
    lambda_val : float
        Smoothing parameter
    
    Returns:
    --------
    dict with keys:
        'beta': fitted coefficients
        'y_hat': fitted values
        'smoother': smoother matrix
        'df': effective degrees of freedom
        'knots': extended knot sequence
    """
    n = len(x)
    x_min, x_max = x.min(), x.max()
    
    # Build equally-spaced interior knots
    # n_basis = n_interior_knots + degree + 1
    n_interior = n_basis - degree - 1
    interior_knots = np.linspace(x_min, x_max, n_interior + 2)[1:-1]
    
    # Extended knot sequence
    t = np.concatenate([
        np.repeat(x_min, degree + 1),
        interior_knots,
        np.repeat(x_max, degree + 1)
    ])
    
    # Build B-spline design matrix
    B = np.zeros((n, n_basis))
    for j in range(n_basis):
        c = np.zeros(n_basis)
        c[j] = 1.0
        spl = BSpline(t, c, degree, extrapolate=False)
        B[:, j] = np.nan_to_num(spl(x), nan=0.0)
    
    # Build difference penalty matrix
    D = difference_matrix(n_basis, diff_order)
    P = D.T @ D
    
    # Solve penalized normal equations
    BtB = B.T @ B
    Bty = B.T @ y
    
    # (B'B + lambda * P) beta = B'y
    A = BtB + lambda_val * P
    beta = np.linalg.solve(A, Bty)
    
    # Fitted values
    y_hat = B @ beta
    
    # Smoother matrix: H = B (B'B + lambda*P)^{-1} B'
    A_inv = np.linalg.inv(A)
    H = B @ A_inv @ B.T
    
    # Effective degrees of freedom
    df = np.trace(H)
    
    return {
        'beta': beta,
        'y_hat': y_hat,
        'smoother': H,
        'df': df,
        'knots': t,
        'B': B,
        'P': P
    }
 
 
def difference_matrix(p, order=2):
    D = np.eye(p)
    for _ in range(order):
        n_rows = D.shape[0] - 1
        D_new = np.zeros((n_rows, p))
        for i in range(n_rows):
            D_new[i, :] = D[i + 1, :] - D[i, :]
        D = D_new
    return D
 
 
# Example: Fit P-spline
np.random.seed(42)
n = 200
x = np.sort(np.random.uniform(0, 2 * np.pi, n))
y_true = np.sin(x) + 0.5 * np.cos(2 * x)
y = y_true + np.random.normal(0, 0.4, n)
 
# Fit with different lambda values
print("Effect of lambda on P-spline fit:
")
for lam in [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]:
    result = fit_pspline(x, y, n_basis=25, lambda_val=lam)
    mse = np.mean((y_true - result['y_hat']) ** 2)
    print(f"lambda = {lam:7.3f}: df = {result['df']:5.2f}, MSE = {mse:.4f}")

Computational Advantage

For P-splines with $p$ basis functions, we invert a $p \times p$ matrix (cost $O(p^3)$), not $n \times n$. With $p \approx 25$ and $n = 10000$, this is $25^3 / 10000^3 \approx 10^{-9}$ times the work of naive smoothing splines! Even for large datasets, P-spline fitting is essentially instant.

How Many Basis Functions?

Unlike regression splines where the number of knots directly controls flexibility, P-splines use many knots (to ensure sufficient flexibility) and let the penalty control smoothness.

Key Insight: As long as there are 'enough' basis functions to capture the true curve, adding more doesn't hurt—the penalty prevents overfitting.

Practical Guidance:

Minimum: At least 20 basis functions for most applications
Typical: 25-40 basis functions (default in many implementations)
Maximum: Rarely need >100; computational cost grows with $p^3$
Rule of thumb: $p \approx \min(n/4, 40)$ as a starting point

Effect of Number of Basis Functions (with optimal $\lambda$)
$p$ (basis)	Flexibility	Effect of $\lambda$	Recommendation
10	Limited	May underfit even at $\lambda=0$	Too few for complex functions
20	Moderate	Good control, may miss details	Minimum for most cases
30-40	High	Excellent control	Default choice
50+	Very high	Strong penalty needed	Only for very complex patterns
100+	Excessive	High computational cost	Rarely necessary

P-splines are Robust to Over-Parameterization

Unlike regression splines, P-splines don't overfit when you have 'too many' basis functions—the penalty takes care of it. This means you can safely err on the side of more basis functions. The only costs are (1) slightly higher computation and (2) needing a larger penalty value.

basis_count_experiment.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import numpy as np
from scipy.optimize import minimize_scalar
 
def cv_score(lambda_val, x, y, n_basis, degree, diff_order):
    """Compute GCV score for given lambda."""
    result = fit_pspline(x, y, n_basis, degree, diff_order, lambda_val)
    
    n = len(y)
    df = result['df']
    rss = np.sum((y - result['y_hat']) ** 2)
    
    # GCV
    gcv = (rss / n) / ((1 - df / n) ** 2)
    return gcv
 
 
def optimal_lambda_gcv(x, y, n_basis=25, degree=3, diff_order=2):
    """Find lambda that minimizes GCV."""
    # Grid search then refinement
    lambdas = np.logspace(-6, 6, 50)
    gcv_scores = [cv_score(lam, x, y, n_basis, degree, diff_order) 
                  for lam in lambdas]
    
    best_idx = np.argmin(gcv_scores)
    return lambdas[best_idx], gcv_scores[best_idx]
 
 
# Experiment: vary number of basis functions
np.random.seed(42)
n = 150
x = np.sort(np.random.uniform(0, 2 * np.pi, n))
y_true = np.sin(x) + 0.3 * np.sin(3 * x)
y = y_true + np.random.normal(0, 0.35, n)
 
print("Effect of number of basis functions (with GCV-optimal lambda):
")
print(f"{'n_basis':>8} | {'opt_lambda':>12} | {'df':>6} | {'MSE':>8}")
print("-" * 45)
 
for n_basis in [10, 15, 20, 25, 30, 40, 50]:
    opt_lambda, _ = optimal_lambda_gcv(x, y, n_basis)
    result = fit_pspline(x, y, n_basis, lambda_val=opt_lambda)
    mse = np.mean((y_true - result['y_hat']) ** 2)
    
    print(f"{n_basis:>8} | {opt_lambda:>12.4f} | {result['df']:>6.2f} | {mse:>8.4f}")
 
# Key observation: Once n_basis is large enough (~20+), 
# the MSE stabilizes because lambda adapts

Selecting the Smoothing Parameter $\lambda$

As with smoothing splines, the key tuning decision is selecting $\lambda$. The same methods apply:

1. Cross-Validation (LOOCV or K-Fold): Minimize estimated out-of-sample prediction error.

2. Generalized Cross-Validation (GCV): $$\text{GCV}(\lambda) = \frac{\text{RSS}(\lambda)/n}{(1 - \text{df}(\lambda)/n)^2}$$

3. Information Criteria (AIC, BIC): $$\text{AIC}(\lambda) = n \log(\text{RSS}/n) + 2 \cdot \text{df}(\lambda)$$

4. REML (Restricted Maximum Likelihood): From the mixed model perspective (see Section 7), $\lambda$ is related to variance components, estimated via REML.

lambda_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import numpy as np
import matplotlib.pyplot as plt
 
def compute_criteria(result, y, n):
    """Compute various selection criteria."""
    y_hat = result['y_hat']
    df = result['df']
    
    rss = np.sum((y - y_hat) ** 2)
    
    # GCV
    gcv = (rss / n) / ((1 - df / n) ** 2)
    
    # AIC
    aic = n * np.log(rss / n) + 2 * df
    
    # BIC
    bic = n * np.log(rss / n) + np.log(n) * df
    
    # LOOCV (using shortcut via hat matrix diagonal)
    H = result['smoother']
    h_diag = np.diag(H)
    loo_residuals = (y - y_hat) / (1 - h_diag + 1e-10)
    loocv = np.mean(loo_residuals ** 2)
    
    return {'GCV': gcv, 'AIC': aic, 'BIC': bic, 'LOOCV': loocv, 'df': df}
 
 
def find_optimal_lambda(x, y, n_basis=25, criterion='GCV'):
    """Find optimal lambda by minimizing specified criterion."""
    lambdas = np.logspace(-6, 6, 100)
    
    best_lambda = None
    best_score = np.inf
    all_scores = []
    
    for lam in lambdas:
        result = fit_pspline(x, y, n_basis, lambda_val=lam)
        scores = compute_criteria(result, y, len(y))
        all_scores.append(scores)
        
        if scores[criterion] < best_score:
            best_score = scores[criterion]
            best_lambda = lam
    
    return best_lambda, all_scores, lambdas
 
 
# Example: Compare different selection criteria
np.random.seed(42)
n = 120
x = np.sort(np.random.uniform(0, 2 * np.pi, n))
y_true = np.sin(x) + 0.4 * np.cos(2.5 * x)
y = y_true + np.random.normal(0, 0.4, n)
 
print("Optimal lambda by different criteria:
")
for criterion in ['GCV', 'AIC', 'BIC', 'LOOCV']:
    opt_lambda, scores, lambdas = find_optimal_lambda(x, y, criterion=criterion)
    
    # Evaluate at optimal
    result = fit_pspline(x, y, n_basis=25, lambda_val=opt_lambda)
    mse = np.mean((y_true - result['y_hat']) ** 2)
    
    print(f"{criterion:>6}: lambda = {opt_lambda:>10.4f}, df = {result['df']:>5.2f}, MSE = {mse:.4f}")
 
# Plot criteria curves
_, all_scores, lambdas = find_optimal_lambda(x, y, criterion='GCV')
 
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
criteria_names = ['GCV', 'AIC', 'BIC', 'LOOCV']
 
for ax, crit in zip(axes.flatten(), criteria_names):
    values = [s[crit] for s in all_scores]
    ax.semilogx(lambdas, values, 'b-', lw=1.5)
    ax.axvline(lambdas[np.argmin(values)], color='red', linestyle='--', 
               label=f'Optimal: {lambdas[np.argmin(values)]:.3f}')
    ax.set_xlabel('lambda')
    ax.set_ylabel(crit)
    ax.set_title(f'{crit} vs lambda')
    ax.legend()
 
plt.tight_layout()
plt.savefig('lambda_selection.png', dpi=150)

Criterion Choice

GCV and LOOCV typically select similar lambda values (more flexibility). BIC is more conservative. In practice, GCV is the default choice for P-splines because it's fast to compute and well-validated. If interpretability matters more than prediction, consider BIC.

Why P-splines Approximate Smoothing Splines

A remarkable fact: P-splines with second-order difference penalty approximate smoothing splines with integrated second derivative penalty. This is why they work so well with far fewer basis functions.

The Connection:

For equally-spaced B-splines with knot spacing $h$, the second difference of coefficients approximates the second derivative of the spline:

$$\frac{\beta_{j+1} - 2\beta_j + \beta_{j-1}}{h^2} \approx s''(x_j)$$

where $x_j$ is near the center of the $j$-th B-spline.

Summing over $j$: $$\frac{1}{h^2} |\mathbf{D}_2 \boldsymbol{\beta}|^2 \approx \int [s''(x)]^2 , dx$$

So the discrete penalty is a Riemann sum approximation to the integral penalty!

The Approximation Quality

With 20-40 equally-spaced B-splines, the P-spline approximation to a smoothing spline is typically accurate to within 1-2 effective degrees of freedom. For practical purposes, the fitted curves are visually indistinguishable. This is why P-splines have largely replaced smoothing splines in modern practice.

Why This Matters:

Computational savings: P-splines with 30 basis functions cost $O(30^3) = O(27000)$ to fit. Smoothing splines with $n = 10000$ cost $O(10000^3)$ naively (though efficient algorithms reduce this).
Memory savings: Store a $30 \times 30$ matrix instead of $10000 \times 10000$.
Same effective result: The fitted curve is nearly identical.
Greater flexibility: The discrete penalty extends easily to other contexts (spatial, tensor products) where integral penalties are intractable.

Smoothing Splines vs P-splines: Computational Comparison
Aspect	Smoothing Splines	P-splines ($p$ basis)
Number of parameters	$n$	$p \ll n$
Matrix to invert	$n \times n$	$p \times p$
Time complexity (naive)	$O(n^3)$	$O(np^2 + p^3)$
Time complexity (efficient)	$O(n)$ (banded)	$O(np^2 + p^3)$
Memory	$O(n^2)$ or $O(n)$ banded	$O(np + p^2)$
Applicable dimensions	1D only (efficient)	Generalizes to tensor products
Knot positions	At data points	Equally-spaced (typically)

Extensions to Higher Dimensions

P-splines extend naturally to multiple dimensions using tensor product bases. This is a major advantage over smoothing splines, which don't have efficient higher-dimensional analogues.

Tensor Product P-splines for 2D:

For a function $f(x_1, x_2)$, use: $$f(x_1, x_2) = \sum_j \sum_k \beta_{jk} B_j(x_1) B_k(x_2)$$

The penalty extends as: $$\lambda_1 |\mathbf{D}_1 \boldsymbol{\beta}|^2 + \lambda_2 |\boldsymbol{\beta} \mathbf{D}_2^T|^2$$

penalizing roughness in each direction separately.

tensor_product_pspline.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
import numpy as np
from scipy.interpolate import BSpline
from itertools import product
 
def build_2d_pspline_basis(x1, x2, n_basis_1=15, n_basis_2=15, degree=3):
    """
    Build tensor product B-spline basis for 2D P-spline.
    
    Parameters:
    -----------
    x1, x2 : arrays of shape (n,)
        Predictor values (paired observations)
    n_basis_1, n_basis_2 : int
        Number of basis functions in each dimension
    degree : int
        B-spline degree
    
    Returns:
    --------
    B : array of shape (n, n_basis_1 * n_basis_2)
        Tensor product design matrix
    """
    n = len(x1)
    
    # Build 1D bases
    def build_1d_basis(x, n_basis):
        x_min, x_max = x.min(), x.max()
        n_interior = n_basis - degree - 1
        interior_knots = np.linspace(x_min, x_max, n_interior + 2)[1:-1]
        t = np.concatenate([
            np.repeat(x_min, degree + 1),
            interior_knots,
            np.repeat(x_max, degree + 1)
        ])
        
        B = np.zeros((len(x), n_basis))
        for j in range(n_basis):
            c = np.zeros(n_basis)
            c[j] = 1.0
            spl = BSpline(t, c, degree, extrapolate=False)
            B[:, j] = np.nan_to_num(spl(x), nan=0.0)
        return B
    
    B1 = build_1d_basis(x1, n_basis_1)  # n x n_basis_1
    B2 = build_1d_basis(x2, n_basis_2)  # n x n_basis_2
    
    # Tensor product: B[i, j*n_basis_2 + k] = B1[i,j] * B2[i,k]
    # This is the row-wise Kronecker product
    B = np.zeros((n, n_basis_1 * n_basis_2))
    
    for i in range(n):
        B[i, :] = np.outer(B1[i, :], B2[i, :]).flatten()
    
    return B, B1, B2
 
 
def build_2d_penalty(n_basis_1, n_basis_2, diff_order=2, 
                      lambda_1=1.0, lambda_2=1.0):
    """
    Build 2D penalty matrix for tensor product P-spline.
    
    Penalty is:
    lambda_1 * ||D_1 ⊗ I_2||^2 + lambda_2 * ||I_1 ⊗ D_2||^2
    """
    D1 = difference_matrix(n_basis_1, diff_order)
    D2 = difference_matrix(n_basis_2, diff_order)
    
    I1 = np.eye(n_basis_1)
    I2 = np.eye(n_basis_2)
    
    # Kronecker products for penalties in each direction
    P1 = np.kron(D1.T @ D1, I2)  # Penalty in x1 direction
    P2 = np.kron(I1, D2.T @ D2)  # Penalty in x2 direction
    
    P = lambda_1 * P1 + lambda_2 * P2
    
    return P
 
 
# Example: Fit 2D P-spline
np.random.seed(42)
n = 500
 
# Generate 2D data
x1 = np.random.uniform(0, 1, n)
x2 = np.random.uniform(0, 1, n)
 
# True surface: peaks and valleys
y_true = (np.sin(4 * np.pi * x1) * np.cos(4 * np.pi * x2) + 
          0.5 * np.exp(-((x1 - 0.5)**2 + (x2 - 0.5)**2) / 0.1))
y = y_true + np.random.normal(0, 0.2, n)
 
# Fit 2D P-spline
n_basis_1, n_basis_2 = 12, 12
B, B1, B2 = build_2d_pspline_basis(x1, x2, n_basis_1, n_basis_2)
P = build_2d_penalty(n_basis_1, n_basis_2, lambda_1=10, lambda_2=10)
 
# Solve
BtB = B.T @ B
Bty = B.T @ y
beta = np.linalg.solve(BtB + P, Bty)
 
y_hat = B @ beta
mse = np.mean((y_true - y_hat) ** 2)
print(f"2D P-spline: {n_basis_1 * n_basis_2} basis functions, MSE = {mse:.4f}")

Curse of Dimensionality

Tensor product P-splines have $p_1 \times p_2 \times \cdots$ parameters. With 20 basis per dimension: 1D = 20, 2D = 400, 3D = 8000, 4D = 160000! This exponential growth limits tensor products to ~3 dimensions. For higher dimensions, use additive models (GAMs) or other sparse approaches.

P-splines as Mixed Models

A powerful perspective: P-splines can be reformulated as linear mixed models. This connection enables:

Automatic smoothness selection via REML
Proper uncertainty quantification
Extension to complex hierarchical structures
Use of mixed model software (lme4, mgcv, etc.)

The Reparameterization:

The P-spline coefficients $\boldsymbol{\beta}$ can be decomposed as: $$\boldsymbol{\beta} = \mathbf{X}\boldsymbol{\alpha} + \mathbf{Z}\mathbf{u}$$

where:

$\mathbf{X}$ spans the null space of the penalty (polynomials up to degree $q-1$)
$\boldsymbol{\alpha}$ are fixed effects (unpenalized)
$\mathbf{Z}$ spans the penalized space
$\mathbf{u}$ are random effects with $\mathbf{u} \sim N(\mathbf{0}, \sigma_u^2 \mathbf{I})$

The Mixed Model:

$$y_i = \mathbf{x}_i^T \boldsymbol{\alpha} + \mathbf{z}_i^T \mathbf{u} + \epsilon_i$$

with:

$\mathbf{u} \sim N(\mathbf{0}, \sigma_u^2 \mathbf{I})$
$\epsilon_i \sim N(0, \sigma^2)$
$\lambda = \sigma^2 / \sigma_u^2$

REML Estimation:

The smoothing parameter $\lambda$ emerges as a ratio of variance components, estimated by Restricted Maximum Likelihood (REML). This provides:

Automatic smoothness selection (no CV needed)
Confidence intervals via prediction error variance
Natural extension to more complex models

Practical Implementation

The mixed model representation is implemented in R's mgcv package (the 's()' function in gam()) and Python's statsmodels (via the GAM class). These use REML for automatic smoothness selection, providing a fully automated P-spline fitting procedure. This is often the easiest way to use P-splines in practice.

pspline_mgcv_style.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
# In practice, use libraries that implement P-splines with automatic smoothing
 
# Python: using pygam
# pip install pygam
 
"""
from pygam import LinearGAM, s
 
# Fit P-spline with automatic smoothing selection
model = LinearGAM(s(0, n_splines=25)).fit(x, y)
 
# Get fitted values and confidence intervals
y_hat = model.predict(x)
conf_int = model.confidence_intervals(x, width=0.95)
 
# Get effective degrees of freedom
print(f"Effective df: {model.statistics_['edof']}")
"""
 
# Python: using statsmodels
"""
import statsmodels.api as sm
from patsy import dmatrix
 
# Create B-spline basis with 25 knots
basis_formula = "bs(x, df=25, degree=3) - 1"
B = dmatrix(basis_formula, {"x": x})
 
# For penalized estimation, use BSplines directly
from statsmodels.gam.api import GLMGam, BSplines
 
# Define spline term
spline = BSplines(x, df=25, degree=3)
 
# Fit with automatic selection (via GCV or similar)
model = GLMGam.from_formula("y ~ 1", {"y": y}, smoother=spline)
result = model.fit()
 
print(result.summary())
"""
 
# Demonstration: Mixed model view conceptually
import numpy as np
 
def decompose_penalty(n_basis, diff_order=2):
    """
    Decompose B-spline basis into fixed and random components.
    
    Returns X (null space of penalty) and Z (penalized space).
    """
    D = difference_matrix(n_basis, diff_order)
    P = D.T @ D
    
    # Eigendecomposition of P
    eigenvalues, eigenvectors = np.linalg.eigh(P)
    
    # Null space: eigenvectors with eigenvalue ~0
    null_mask = eigenvalues < 1e-10
    X = eigenvectors[:, null_mask]  # Fixed effects design
    
    # Range space: eigenvectors with positive eigenvalues
    range_mask = ~null_mask
    Z = eigenvectors[:, range_mask]  # Random effects design
    
    # Eigenvalues for scaling
    d = eigenvalues[range_mask]
    
    return X, Z, d
 
# Example
n_basis = 25
X, Z, d = decompose_penalty(n_basis, diff_order=2)
 
print(f"Penalty matrix rank: {len(d)}")
print(f"Null space dimension: {X.shape[1]} (= diff_order)")
print(f"Random effects dimension: {Z.shape[1]}")
print(f"
Null space corresponds to linear functions (degree < diff_order)")

Practical Recommendations for P-splines

Best Practices for P-splines

•Use 20-40 equally-spaced B-spline basis functions. This is almost always enough. More is fine but rarely needed.
•Use second-order difference penalty ($\mathbf{D}_2$) for cubic B-splines. This corresponds to penalizing curvature.
•Select $\lambda$ by GCV or REML. Both work well; REML is more automated in mixed model software.
•Don't stress about exact $\lambda$ value. The fit is often robust to $\lambda$ within a factor of 2-3.
•For multiple predictors, use additive models. Tensor products blow up quickly; GAMs are more practical.
•Use library implementations. mgcv (R), pygam (Python), and statsmodels handle the details correctly.

Common Pitfalls and Solutions
Pitfall	Symptom	Solution
Too few basis functions	Underfit, $\lambda = 0$ still wiggly	Increase $p$ to 30+
Lambda too small	Overfitting, wiggly	Let GCV/REML select $\lambda$
Lambda too large	Underfitting, too smooth	Let GCV/REML select $\lambda$
Non-equally-spaced knots	Penalty interpretation breaks	Use equally-spaced knots
Data outside knot range	Poor fit at boundaries	Extend knots slightly beyond data range

The Default Recipe

For 1D P-splines: 25-30 equally-spaced cubic B-splines + second-order difference penalty + GCV or REML selection. This works remarkably well across a wide range of problems with minimal tuning.

Summary: P-splines in Perspective

Key Takeaways

•P-splines combine B-splines with difference penalties, providing automatic smoothing with few basis functions.
•The difference penalty $|\mathbf{D}_q \boldsymbol{\beta}|^2$ penalizes changes in the coefficient sequence, approximating derivative penalties.
•Closed-form solution involves inverting a $p \times p$ matrix ($p$ = number of basis functions), not $n \times n$.
•20-40 basis functions suffice for most problems; more doesn't hurt because the penalty prevents overfitting.
•P-splines approximate smoothing splines to high accuracy while being computationally much cheaper.
•GCV or REML select the smoothing parameter $\lambda$ automatically.
•Tensor product extension enables 2D and 3D smoothing, limited by the curse of dimensionality.
•Mixed model reformulation provides automatic smoothing via REML and proper uncertainty quantification.

Module Complete!

You have now completed the Spline Regression module, covering:

B-splines: The numerically stable basis for spline spaces
Natural splines: Boundary-constrained splines with variational optimality
Smoothing splines: The full optimization framework with automatic smoothing
Knot selection: Strategies for choosing knot number and placement
P-splines: The practical synthesis combining few knots with penalty regularization

These methods form the foundation for flexible nonparametric regression in machine learning and statistics. P-splines, in particular, are the workhorse behind Generalized Additive Models (GAMs), which we'll encounter in a subsequent module.

Module Complete

You now have a comprehensive understanding of spline regression methods: from the foundational B-spline construction through natural splines, smoothing splines, knot selection strategies, and the practical P-spline framework. These tools enable flexible, interpretable modeling of nonlinear relationships in continuous data.

Penalized Splines (P-splines): The Best of Both Worlds

Bridging Regression Splines and Smoothing Splines

We've seen two approaches to spline regression:

Regression Splines: Use few knots ($K \ll n$), fit by OLS. Fast but requires careful knot selection.

Smoothing Splines: Use $n$ knots (one per data point), regularize via penalty. Automatic smoothing but involves $n \times n$ matrices.

Penalized Splines (P-splines) combine the advantages of both:

Use a moderate number of knots (typically 20-40), more than regression splines but far fewer than $n$
Apply a discrete difference penalty on the B-spline coefficients instead of the integrated derivative penalty
Achieve results nearly identical to smoothing splines with much lower computational cost

P-splines, introduced by Eilers and Marx (1996), have become the workhorse of flexible regression in practice.

What You Will Learn

The P-spline Formulation

The Objective Function:

P-splines minimize a penalized sum of squares:

$$\mathcal{L}(\boldsymbol{\beta}) = |\mathbf{y} - \mathbf{B}\boldsymbol{\beta}|^2 + \lambda |\mathbf{D}_q \boldsymbol{\beta}|^2$$

where:

$\mathbf{B}$ is the $n \times p$ B-spline design matrix
$\boldsymbol{\beta}$ is the $p \times 1$ vector of B-spline coefficients
$\mathbf{D}_q$ is the difference matrix of order $q$
$\lambda \geq 0$ is the smoothing parameter

The key innovation is the difference penalty $|\mathbf{D}_q \boldsymbol{\beta}|^2$ applied to the coefficients rather than an integral of derivatives.

Difference Matrices:

So $(\mathbf{D}_1 \boldsymbol{\beta})j = \beta{j+1} - \beta_j$.

Higher-order differences are computed by applying $\mathbf{D}_1$ repeatedly: $$\mathbf{D}_2 = \mathbf{D}_1^{(p-1)} \cdot \mathbf{D}_1^{(p)} \in \mathbb{R}^{(p-2) \times p}$$

Second differences are: $(\mathbf{D}2 \boldsymbol{\beta})j = \beta{j+2} - 2\beta{j+1} + \beta_j$

difference_matrix.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import numpy as np
 
def difference_matrix(p, order=2):
    """
    Construct difference matrix of given order.
    
    Parameters:
    -----------
    p : int
        Number of columns (B-spline coefficients)
    order : int
        Order of differencing (1 = first difference, 2 = second, etc.)
    
    Returns:
    --------
    D : array of shape (p - order, p)
        Difference matrix
    """
    D = np.eye(p)
    for _ in range(order):
        # Apply first-order differencing
        n_rows = D.shape[0] - 1
        D_new = np.zeros((n_rows, p))
        for i in range(n_rows):
            D_new[i, :] = D[i + 1, :] - D[i, :]
        D = D_new
    
    return D
 
 
def build_penalty_matrix(p, order=2):
    """
    Construct the penalty matrix P = D^T @ D.
    
    The penalty is beta^T @ P @ beta = ||D @ beta||^2
    """
    D = difference_matrix(p, order)
    return D.T @ D
 
 
# Example: Build difference matrices
p = 8  # Number of B-spline coefficients
 
D1 = difference_matrix(p, order=1)
D2 = difference_matrix(p, order=2)
D3 = difference_matrix(p, order=3)
 
print("First-order difference matrix D1 (shape:", D1.shape, "):")
print(D1)
print()
print("Second-order difference matrix D2 (shape:", D2.shape, "):")
print(D2)
 
# Show what D2 does to a coefficient vector
beta = np.array([1, 2, 4, 7, 11, 16, 22, 29], dtype=float)  # Quadratic-ish
print("
Beta:", beta)
print("Second differences (D2 @ beta):", D2 @ beta)
# For a perfect quadratic, second differences are constant!

Why Second-Order Differences?

Closed-Form Solution for P-splines

Deriving the Solution:

where $\mathbf{P} = \mathbf{D}_q^T \mathbf{D}_q$ is the penalty matrix.

Solution: $$\hat{\boldsymbol{\beta}} = (\mathbf{B}^T \mathbf{B} + \lambda \mathbf{P})^{-1} \mathbf{B}^T \mathbf{y}$$

Fitted Values and Smoother Matrix:

The smoother matrix (or hat matrix) is: $$\mathbf{H}_\lambda = \mathbf{B}(\mathbf{B}^T \mathbf{B} + \lambda \mathbf{P})^{-1} \mathbf{B}^T$$

This is an $n \times n$ matrix but involves inverting only a $p \times p$ matrix ($p$ is the number of B-spline basis functions, typically 20-40).

Effective Degrees of Freedom: $$\text{df}(\lambda) = \text{tr}(\mathbf{H}_\lambda)$$

This ranges from $p$ (when $\lambda = 0$) to $q$ (when $\lambda \to \infty$), where $q$ is the order of the difference penalty.

pspline_fit.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
import numpy as np
from scipy.interpolate import BSpline
 
def fit_pspline(x, y, n_basis=25, degree=3, diff_order=2, lambda_val=1.0):
    """
    Fit a P-spline to data.
    
    Parameters:
    -----------
    x, y : arrays
        Data
    n_basis : int
        Number of B-spline basis functions
    degree : int
        B-spline degree
    diff_order : int
        Order of difference penalty (typically 2)
    lambda_val : float
        Smoothing parameter
    
    Returns:
    --------
    dict with keys:
        'beta': fitted coefficients
        'y_hat': fitted values
        'smoother': smoother matrix
        'df': effective degrees of freedom
        'knots': extended knot sequence
    """
    n = len(x)
    x_min, x_max = x.min(), x.max()
    
    # Build equally-spaced interior knots
    # n_basis = n_interior_knots + degree + 1
    n_interior = n_basis - degree - 1
    interior_knots = np.linspace(x_min, x_max, n_interior + 2)[1:-1]
    
    # Extended knot sequence
    t = np.concatenate([
        np.repeat(x_min, degree + 1),
        interior_knots,
        np.repeat(x_max, degree + 1)
    ])
    
    # Build B-spline design matrix
    B = np.zeros((n, n_basis))
    for j in range(n_basis):
        c = np.zeros(n_basis)
        c[j] = 1.0
        spl = BSpline(t, c, degree, extrapolate=False)
        B[:, j] = np.nan_to_num(spl(x), nan=0.0)
    
    # Build difference penalty matrix
    D = difference_matrix(n_basis, diff_order)
    P = D.T @ D
    
    # Solve penalized normal equations
    BtB = B.T @ B
    Bty = B.T @ y
    
    # (B'B + lambda * P) beta = B'y
    A = BtB + lambda_val * P
    beta = np.linalg.solve(A, Bty)
    
    # Fitted values
    y_hat = B @ beta
    
    # Smoother matrix: H = B (B'B + lambda*P)^{-1} B'
    A_inv = np.linalg.inv(A)
    H = B @ A_inv @ B.T
    
    # Effective degrees of freedom
    df = np.trace(H)
    
    return {
        'beta': beta,
        'y_hat': y_hat,
        'smoother': H,
        'df': df,
        'knots': t,
        'B': B,
        'P': P
    }
 
 
def difference_matrix(p, order=2):
    D = np.eye(p)
    for _ in range(order):
        n_rows = D.shape[0] - 1
        D_new = np.zeros((n_rows, p))
        for i in range(n_rows):
            D_new[i, :] = D[i + 1, :] - D[i, :]
        D = D_new
    return D
 
 
# Example: Fit P-spline
np.random.seed(42)
n = 200
x = np.sort(np.random.uniform(0, 2 * np.pi, n))
y_true = np.sin(x) + 0.5 * np.cos(2 * x)
y = y_true + np.random.normal(0, 0.4, n)
 
# Fit with different lambda values
print("Effect of lambda on P-spline fit:
")
for lam in [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]:
    result = fit_pspline(x, y, n_basis=25, lambda_val=lam)
    mse = np.mean((y_true - result['y_hat']) ** 2)
    print(f"lambda = {lam:7.3f}: df = {result['df']:5.2f}, MSE = {mse:.4f}")

Computational Advantage

How Many Basis Functions?

Unlike regression splines where the number of knots directly controls flexibility, P-splines use many knots (to ensure sufficient flexibility) and let the penalty control smoothness.

Key Insight: As long as there are 'enough' basis functions to capture the true curve, adding more doesn't hurt—the penalty prevents overfitting.

Practical Guidance:

Minimum: At least 20 basis functions for most applications
Typical: 25-40 basis functions (default in many implementations)
Maximum: Rarely need >100; computational cost grows with $p^3$
Rule of thumb: $p \approx \min(n/4, 40)$ as a starting point

Effect of Number of Basis Functions (with optimal $\lambda$)
$p$ (basis)	Flexibility	Effect of $\lambda$	Recommendation
10	Limited	May underfit even at $\lambda=0$	Too few for complex functions
20	Moderate	Good control, may miss details	Minimum for most cases
30-40	High	Excellent control	Default choice
50+	Very high	Strong penalty needed	Only for very complex patterns
100+	Excessive	High computational cost	Rarely necessary

P-splines are Robust to Over-Parameterization

basis_count_experiment.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import numpy as np
from scipy.optimize import minimize_scalar
 
def cv_score(lambda_val, x, y, n_basis, degree, diff_order):
    """Compute GCV score for given lambda."""
    result = fit_pspline(x, y, n_basis, degree, diff_order, lambda_val)
    
    n = len(y)
    df = result['df']
    rss = np.sum((y - result['y_hat']) ** 2)
    
    # GCV
    gcv = (rss / n) / ((1 - df / n) ** 2)
    return gcv
 
 
def optimal_lambda_gcv(x, y, n_basis=25, degree=3, diff_order=2):
    """Find lambda that minimizes GCV."""
    # Grid search then refinement
    lambdas = np.logspace(-6, 6, 50)
    gcv_scores = [cv_score(lam, x, y, n_basis, degree, diff_order) 
                  for lam in lambdas]
    
    best_idx = np.argmin(gcv_scores)
    return lambdas[best_idx], gcv_scores[best_idx]
 
 
# Experiment: vary number of basis functions
np.random.seed(42)
n = 150
x = np.sort(np.random.uniform(0, 2 * np.pi, n))
y_true = np.sin(x) + 0.3 * np.sin(3 * x)
y = y_true + np.random.normal(0, 0.35, n)
 
print("Effect of number of basis functions (with GCV-optimal lambda):
")
print(f"{'n_basis':>8} | {'opt_lambda':>12} | {'df':>6} | {'MSE':>8}")
print("-" * 45)
 
for n_basis in [10, 15, 20, 25, 30, 40, 50]:
    opt_lambda, _ = optimal_lambda_gcv(x, y, n_basis)
    result = fit_pspline(x, y, n_basis, lambda_val=opt_lambda)
    mse = np.mean((y_true - result['y_hat']) ** 2)
    
    print(f"{n_basis:>8} | {opt_lambda:>12.4f} | {result['df']:>6.2f} | {mse:>8.4f}")
 
# Key observation: Once n_basis is large enough (~20+), 
# the MSE stabilizes because lambda adapts

Selecting the Smoothing Parameter $\lambda$

As with smoothing splines, the key tuning decision is selecting $\lambda$. The same methods apply:

1. Cross-Validation (LOOCV or K-Fold): Minimize estimated out-of-sample prediction error.

2. Generalized Cross-Validation (GCV): $$\text{GCV}(\lambda) = \frac{\text{RSS}(\lambda)/n}{(1 - \text{df}(\lambda)/n)^2}$$

3. Information Criteria (AIC, BIC): $$\text{AIC}(\lambda) = n \log(\text{RSS}/n) + 2 \cdot \text{df}(\lambda)$$

4. REML (Restricted Maximum Likelihood): From the mixed model perspective (see Section 7), $\lambda$ is related to variance components, estimated via REML.

lambda_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import numpy as np
import matplotlib.pyplot as plt
 
def compute_criteria(result, y, n):
    """Compute various selection criteria."""
    y_hat = result['y_hat']
    df = result['df']
    
    rss = np.sum((y - y_hat) ** 2)
    
    # GCV
    gcv = (rss / n) / ((1 - df / n) ** 2)
    
    # AIC
    aic = n * np.log(rss / n) + 2 * df
    
    # BIC
    bic = n * np.log(rss / n) + np.log(n) * df
    
    # LOOCV (using shortcut via hat matrix diagonal)
    H = result['smoother']
    h_diag = np.diag(H)
    loo_residuals = (y - y_hat) / (1 - h_diag + 1e-10)
    loocv = np.mean(loo_residuals ** 2)
    
    return {'GCV': gcv, 'AIC': aic, 'BIC': bic, 'LOOCV': loocv, 'df': df}
 
 
def find_optimal_lambda(x, y, n_basis=25, criterion='GCV'):
    """Find optimal lambda by minimizing specified criterion."""
    lambdas = np.logspace(-6, 6, 100)
    
    best_lambda = None
    best_score = np.inf
    all_scores = []
    
    for lam in lambdas:
        result = fit_pspline(x, y, n_basis, lambda_val=lam)
        scores = compute_criteria(result, y, len(y))
        all_scores.append(scores)
        
        if scores[criterion] < best_score:
            best_score = scores[criterion]
            best_lambda = lam
    
    return best_lambda, all_scores, lambdas
 
 
# Example: Compare different selection criteria
np.random.seed(42)
n = 120
x = np.sort(np.random.uniform(0, 2 * np.pi, n))
y_true = np.sin(x) + 0.4 * np.cos(2.5 * x)
y = y_true + np.random.normal(0, 0.4, n)
 
print("Optimal lambda by different criteria:
")
for criterion in ['GCV', 'AIC', 'BIC', 'LOOCV']:
    opt_lambda, scores, lambdas = find_optimal_lambda(x, y, criterion=criterion)
    
    # Evaluate at optimal
    result = fit_pspline(x, y, n_basis=25, lambda_val=opt_lambda)
    mse = np.mean((y_true - result['y_hat']) ** 2)
    
    print(f"{criterion:>6}: lambda = {opt_lambda:>10.4f}, df = {result['df']:>5.2f}, MSE = {mse:.4f}")
 
# Plot criteria curves
_, all_scores, lambdas = find_optimal_lambda(x, y, criterion='GCV')
 
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
criteria_names = ['GCV', 'AIC', 'BIC', 'LOOCV']
 
for ax, crit in zip(axes.flatten(), criteria_names):
    values = [s[crit] for s in all_scores]
    ax.semilogx(lambdas, values, 'b-', lw=1.5)
    ax.axvline(lambdas[np.argmin(values)], color='red', linestyle='--', 
               label=f'Optimal: {lambdas[np.argmin(values)]:.3f}')
    ax.set_xlabel('lambda')
    ax.set_ylabel(crit)
    ax.set_title(f'{crit} vs lambda')
    ax.legend()
 
plt.tight_layout()
plt.savefig('lambda_selection.png', dpi=150)

Criterion Choice

Why P-splines Approximate Smoothing Splines

The Connection:

For equally-spaced B-splines with knot spacing $h$, the second difference of coefficients approximates the second derivative of the spline:

$$\frac{\beta_{j+1} - 2\beta_j + \beta_{j-1}}{h^2} \approx s''(x_j)$$

where $x_j$ is near the center of the $j$-th B-spline.

Summing over $j$: $$\frac{1}{h^2} |\mathbf{D}_2 \boldsymbol{\beta}|^2 \approx \int [s''(x)]^2 , dx$$

So the discrete penalty is a Riemann sum approximation to the integral penalty!

The Approximation Quality

Why This Matters:

Computational savings: P-splines with 30 basis functions cost $O(30^3) = O(27000)$ to fit. Smoothing splines with $n = 10000$ cost $O(10000^3)$ naively (though efficient algorithms reduce this).
Memory savings: Store a $30 \times 30$ matrix instead of $10000 \times 10000$.
Same effective result: The fitted curve is nearly identical.
Greater flexibility: The discrete penalty extends easily to other contexts (spatial, tensor products) where integral penalties are intractable.

Smoothing Splines vs P-splines: Computational Comparison
Aspect	Smoothing Splines	P-splines ($p$ basis)
Number of parameters	$n$	$p \ll n$
Matrix to invert	$n \times n$	$p \times p$
Time complexity (naive)	$O(n^3)$	$O(np^2 + p^3)$
Time complexity (efficient)	$O(n)$ (banded)	$O(np^2 + p^3)$
Memory	$O(n^2)$ or $O(n)$ banded	$O(np + p^2)$
Applicable dimensions	1D only (efficient)	Generalizes to tensor products
Knot positions	At data points	Equally-spaced (typically)

Extensions to Higher Dimensions

P-splines extend naturally to multiple dimensions using tensor product bases. This is a major advantage over smoothing splines, which don't have efficient higher-dimensional analogues.

Tensor Product P-splines for 2D:

For a function $f(x_1, x_2)$, use: $$f(x_1, x_2) = \sum_j \sum_k \beta_{jk} B_j(x_1) B_k(x_2)$$

The penalty extends as: $$\lambda_1 |\mathbf{D}_1 \boldsymbol{\beta}|^2 + \lambda_2 |\boldsymbol{\beta} \mathbf{D}_2^T|^2$$

penalizing roughness in each direction separately.

tensor_product_pspline.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
import numpy as np
from scipy.interpolate import BSpline
from itertools import product
 
def build_2d_pspline_basis(x1, x2, n_basis_1=15, n_basis_2=15, degree=3):
    """
    Build tensor product B-spline basis for 2D P-spline.
    
    Parameters:
    -----------
    x1, x2 : arrays of shape (n,)
        Predictor values (paired observations)
    n_basis_1, n_basis_2 : int
        Number of basis functions in each dimension
    degree : int
        B-spline degree
    
    Returns:
    --------
    B : array of shape (n, n_basis_1 * n_basis_2)
        Tensor product design matrix
    """
    n = len(x1)
    
    # Build 1D bases
    def build_1d_basis(x, n_basis):
        x_min, x_max = x.min(), x.max()
        n_interior = n_basis - degree - 1
        interior_knots = np.linspace(x_min, x_max, n_interior + 2)[1:-1]
        t = np.concatenate([
            np.repeat(x_min, degree + 1),
            interior_knots,
            np.repeat(x_max, degree + 1)
        ])
        
        B = np.zeros((len(x), n_basis))
        for j in range(n_basis):
            c = np.zeros(n_basis)
            c[j] = 1.0
            spl = BSpline(t, c, degree, extrapolate=False)
            B[:, j] = np.nan_to_num(spl(x), nan=0.0)
        return B
    
    B1 = build_1d_basis(x1, n_basis_1)  # n x n_basis_1
    B2 = build_1d_basis(x2, n_basis_2)  # n x n_basis_2
    
    # Tensor product: B[i, j*n_basis_2 + k] = B1[i,j] * B2[i,k]
    # This is the row-wise Kronecker product
    B = np.zeros((n, n_basis_1 * n_basis_2))
    
    for i in range(n):
        B[i, :] = np.outer(B1[i, :], B2[i, :]).flatten()
    
    return B, B1, B2
 
 
def build_2d_penalty(n_basis_1, n_basis_2, diff_order=2, 
                      lambda_1=1.0, lambda_2=1.0):
    """
    Build 2D penalty matrix for tensor product P-spline.
    
    Penalty is:
    lambda_1 * ||D_1 ⊗ I_2||^2 + lambda_2 * ||I_1 ⊗ D_2||^2
    """
    D1 = difference_matrix(n_basis_1, diff_order)
    D2 = difference_matrix(n_basis_2, diff_order)
    
    I1 = np.eye(n_basis_1)
    I2 = np.eye(n_basis_2)
    
    # Kronecker products for penalties in each direction
    P1 = np.kron(D1.T @ D1, I2)  # Penalty in x1 direction
    P2 = np.kron(I1, D2.T @ D2)  # Penalty in x2 direction
    
    P = lambda_1 * P1 + lambda_2 * P2
    
    return P
 
 
# Example: Fit 2D P-spline
np.random.seed(42)
n = 500
 
# Generate 2D data
x1 = np.random.uniform(0, 1, n)
x2 = np.random.uniform(0, 1, n)
 
# True surface: peaks and valleys
y_true = (np.sin(4 * np.pi * x1) * np.cos(4 * np.pi * x2) + 
          0.5 * np.exp(-((x1 - 0.5)**2 + (x2 - 0.5)**2) / 0.1))
y = y_true + np.random.normal(0, 0.2, n)
 
# Fit 2D P-spline
n_basis_1, n_basis_2 = 12, 12
B, B1, B2 = build_2d_pspline_basis(x1, x2, n_basis_1, n_basis_2)
P = build_2d_penalty(n_basis_1, n_basis_2, lambda_1=10, lambda_2=10)
 
# Solve
BtB = B.T @ B
Bty = B.T @ y
beta = np.linalg.solve(BtB + P, Bty)
 
y_hat = B @ beta
mse = np.mean((y_true - y_hat) ** 2)
print(f"2D P-spline: {n_basis_1 * n_basis_2} basis functions, MSE = {mse:.4f}")

Curse of Dimensionality

P-splines as Mixed Models

A powerful perspective: P-splines can be reformulated as linear mixed models. This connection enables:

Automatic smoothness selection via REML
Proper uncertainty quantification
Extension to complex hierarchical structures
Use of mixed model software (lme4, mgcv, etc.)

The Reparameterization:

The P-spline coefficients $\boldsymbol{\beta}$ can be decomposed as: $$\boldsymbol{\beta} = \mathbf{X}\boldsymbol{\alpha} + \mathbf{Z}\mathbf{u}$$

where:

$\mathbf{X}$ spans the null space of the penalty (polynomials up to degree $q-1$)
$\boldsymbol{\alpha}$ are fixed effects (unpenalized)
$\mathbf{Z}$ spans the penalized space
$\mathbf{u}$ are random effects with $\mathbf{u} \sim N(\mathbf{0}, \sigma_u^2 \mathbf{I})$

The Mixed Model:

$$y_i = \mathbf{x}_i^T \boldsymbol{\alpha} + \mathbf{z}_i^T \mathbf{u} + \epsilon_i$$

with:

$\mathbf{u} \sim N(\mathbf{0}, \sigma_u^2 \mathbf{I})$
$\epsilon_i \sim N(0, \sigma^2)$
$\lambda = \sigma^2 / \sigma_u^2$

REML Estimation:

The smoothing parameter $\lambda$ emerges as a ratio of variance components, estimated by Restricted Maximum Likelihood (REML). This provides:

Automatic smoothness selection (no CV needed)
Confidence intervals via prediction error variance
Natural extension to more complex models

Practical Implementation

pspline_mgcv_style.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
# In practice, use libraries that implement P-splines with automatic smoothing
 
# Python: using pygam
# pip install pygam
 
"""
from pygam import LinearGAM, s
 
# Fit P-spline with automatic smoothing selection
model = LinearGAM(s(0, n_splines=25)).fit(x, y)
 
# Get fitted values and confidence intervals
y_hat = model.predict(x)
conf_int = model.confidence_intervals(x, width=0.95)
 
# Get effective degrees of freedom
print(f"Effective df: {model.statistics_['edof']}")
"""
 
# Python: using statsmodels
"""
import statsmodels.api as sm
from patsy import dmatrix
 
# Create B-spline basis with 25 knots
basis_formula = "bs(x, df=25, degree=3) - 1"
B = dmatrix(basis_formula, {"x": x})
 
# For penalized estimation, use BSplines directly
from statsmodels.gam.api import GLMGam, BSplines
 
# Define spline term
spline = BSplines(x, df=25, degree=3)
 
# Fit with automatic selection (via GCV or similar)
model = GLMGam.from_formula("y ~ 1", {"y": y}, smoother=spline)
result = model.fit()
 
print(result.summary())
"""
 
# Demonstration: Mixed model view conceptually
import numpy as np
 
def decompose_penalty(n_basis, diff_order=2):
    """
    Decompose B-spline basis into fixed and random components.
    
    Returns X (null space of penalty) and Z (penalized space).
    """
    D = difference_matrix(n_basis, diff_order)
    P = D.T @ D
    
    # Eigendecomposition of P
    eigenvalues, eigenvectors = np.linalg.eigh(P)
    
    # Null space: eigenvectors with eigenvalue ~0
    null_mask = eigenvalues < 1e-10
    X = eigenvectors[:, null_mask]  # Fixed effects design
    
    # Range space: eigenvectors with positive eigenvalues
    range_mask = ~null_mask
    Z = eigenvectors[:, range_mask]  # Random effects design
    
    # Eigenvalues for scaling
    d = eigenvalues[range_mask]
    
    return X, Z, d
 
# Example
n_basis = 25
X, Z, d = decompose_penalty(n_basis, diff_order=2)
 
print(f"Penalty matrix rank: {len(d)}")
print(f"Null space dimension: {X.shape[1]} (= diff_order)")
print(f"Random effects dimension: {Z.shape[1]}")
print(f"
Null space corresponds to linear functions (degree < diff_order)")

Practical Recommendations for P-splines

Best Practices for P-splines

•Use 20-40 equally-spaced B-spline basis functions. This is almost always enough. More is fine but rarely needed.
•Use second-order difference penalty ($\mathbf{D}_2$) for cubic B-splines. This corresponds to penalizing curvature.
•Select $\lambda$ by GCV or REML. Both work well; REML is more automated in mixed model software.
•Don't stress about exact $\lambda$ value. The fit is often robust to $\lambda$ within a factor of 2-3.
•For multiple predictors, use additive models. Tensor products blow up quickly; GAMs are more practical.
•Use library implementations. mgcv (R), pygam (Python), and statsmodels handle the details correctly.

Common Pitfalls and Solutions
Pitfall	Symptom	Solution
Too few basis functions	Underfit, $\lambda = 0$ still wiggly	Increase $p$ to 30+
Lambda too small	Overfitting, wiggly	Let GCV/REML select $\lambda$
Lambda too large	Underfitting, too smooth	Let GCV/REML select $\lambda$
Non-equally-spaced knots	Penalty interpretation breaks	Use equally-spaced knots
Data outside knot range	Poor fit at boundaries	Extend knots slightly beyond data range

The Default Recipe

For 1D P-splines: 25-30 equally-spaced cubic B-splines + second-order difference penalty + GCV or REML selection. This works remarkably well across a wide range of problems with minimal tuning.

Summary: P-splines in Perspective

Key Takeaways

•P-splines combine B-splines with difference penalties, providing automatic smoothing with few basis functions.
•The difference penalty $|\mathbf{D}_q \boldsymbol{\beta}|^2$ penalizes changes in the coefficient sequence, approximating derivative penalties.
•Closed-form solution involves inverting a $p \times p$ matrix ($p$ = number of basis functions), not $n \times n$.
•20-40 basis functions suffice for most problems; more doesn't hurt because the penalty prevents overfitting.
•P-splines approximate smoothing splines to high accuracy while being computationally much cheaper.
•GCV or REML select the smoothing parameter $\lambda$ automatically.
•Tensor product extension enables 2D and 3D smoothing, limited by the curse of dimensionality.
•Mixed model reformulation provides automatic smoothing via REML and proper uncertainty quantification.

Module Complete!

You have now completed the Spline Regression module, covering:

B-splines: The numerically stable basis for spline spaces
Natural splines: Boundary-constrained splines with variational optimality
Smoothing splines: The full optimization framework with automatic smoothing
Knot selection: Strategies for choosing knot number and placement
P-splines: The practical synthesis combining few knots with penalty regularization

Module Complete