Generalized Additive Models - Learning Module

Loading content...

0/245

Component Functions

The Building Blocks of Flexibility

In the additive model $f(\mathbf{x}) = \alpha + \sum_{j=1}^p f_j(x_j)$, each component function $f_j$ captures the potentially nonlinear relationship between feature $j$ and the response. But what exactly is a component function, and how do we represent it?

This question lies at the heart of GAM methodology. The choice of component function representation determines:

Flexibility: How complex can the learned relationship be?
Smoothness: How do we control wiggliness and prevent overfitting?
Computation: How efficiently can we fit and evaluate the model?
Interpretation: How do we visualize and understand each effect?

What You Will Learn

By the end of this page, you will understand multiple representations for smooth functions—splines, local polynomials, and kernels—their mathematical foundations, practical trade-offs, and the central role of basis expansions in modern GAM implementations.

The Representation Challenge

We want each $f_j: \mathbb{R} \to \mathbb{R}$ to be 'smooth' but otherwise flexible. But how do we work with arbitrary smooth functions computationally?

The fundamental challenge is that the space of smooth functions is infinite-dimensional. We cannot store or optimize over infinitely many parameters. The solution: represent $f_j$ using a finite basis expansion:

$$f_j(x) = \sum_{k=1}^{K_j} \beta_{jk} \phi_{jk}(x)$$

where:

${\phi_{jk}}_{k=1}^{K_j}$ is a set of basis functions
${\beta_{jk}}_{k=1}^{K_j}$ are coefficients to be estimated
$K_j$ is the number of basis functions for feature $j$

Key insight:

Once we fix the basis functions, estimating $f_j$ reduces to estimating $K_j$ real numbers—a finite-dimensional problem. The choice of basis determines what functions can be represented and how well.

Good basis functions should:

Span a rich space — Able to approximate any smooth function accurately
Be numerically stable — Well-conditioned for optimization
Support efficient computation — Sparse or structured for fast operations
Enable smoothness control — Provide a mechanism to penalize wiggliness

The GAM Philosophy

GAMs transform the problem of 'estimate arbitrary smooth functions' into 'estimate coefficients in a carefully chosen basis.' The art of GAM modeling lies in choosing bases that balance flexibility against computational tractability and interpretability.

Polynomial Bases: The Foundation

The most intuitive basis is polynomials. Any smooth function can be approximated by polynomials (by the Weierstrass approximation theorem). A polynomial basis of degree $d$ uses:

$$\phi_0(x) = 1, \quad \phi_1(x) = x, \quad \phi_2(x) = x^2, \quad \ldots, \quad \phi_d(x) = x^d$$

The component function becomes: $$f(x) = \beta_0 + \beta_1 x + \beta_2 x^2 + \cdots + \beta_d x^d$$

Advantages

•Mathematically simple
•Universal approximation
•Easy to compute derivatives
•Intuitive interpretation of terms

Disadvantages

•High-degree polynomials oscillate wildly (Runge's phenomenon)
•Numerical instability: powers of large $x$ cause conditioning issues
•Global support: changing fit anywhere affects everywhere
•Poor at capturing local features

Runge's Phenomenon

High-degree polynomial interpolation can exhibit extreme oscillations between data points, especially near the boundaries. This makes polynomials unsuitable for flexible curve fitting beyond degree 3–4 in most applications.

Orthogonal polynomials:

To address numerical instability, we can use orthogonal polynomial bases (Legendre, Chebyshev, Hermite) designed so that:

$$\int \phi_j(x) \phi_k(x) w(x) dx = 0 \quad \text{for } j \neq k$$

where $w(x)$ is a weight function. Orthogonality improves numerical conditioning and allows independent coefficient estimation.

However, orthogonal polynomials still suffer from the fundamental limitations of global polynomial fits—they remain poor for capturing local behavior and can oscillate wildly at high degrees.

Spline Bases: Locally Flexible, Globally Smooth

Splines solve the problems of polynomial bases by using piecewise polynomials joined smoothly at designated points called knots.

Definition: A spline of degree $d$ with knots $\xi_1 < \xi_2 < \cdots < \xi_K$ is a function that:

Is a polynomial of degree $d$ on each interval $(\xi_k, \xi_{k+1})$
Has continuous derivatives up to order $(d-1)$ at each knot

The most common choice is cubic splines ($d = 3$), which have continuous first and second derivatives—producing visually smooth curves.

Why splines work:

Splines have local support—changing the curve in one region doesn't affect distant regions. This makes them:

Numerically stable: No extreme values from high powers
Locally adaptive: Can capture sharp features where needed
Smooth: Continuous derivatives ensure aesthetic curves

The space of cubic splines with $K$ interior knots has dimension $(K + 4)$—one can show this by counting: 4 coefficients per polynomial segment, minus constraints from continuity conditions.

Spline Dimension Formula

For a spline of degree $d$ with $K$ interior knots: dimension = $(K + d + 1)$. For cubic splines ($d = 3$): dimension = $K + 4$. This gives us $(K + 4)$ basis functions to estimate $(K + 4)$ coefficients.

B-Spline Bases: The Computational Workhorse

While many bases can represent splines, B-splines (Basis splines) are the industry standard. They possess remarkable computational properties that make them ideal for GAM implementation.

Definition: B-splines are defined recursively. Given knots $t_0 \leq t_1 \leq \cdots \leq t_{K+d+1}$:

Degree 0 (step functions): $$B_{k,0}(x) = \begin{cases} 1 & \text{if } t_k \leq x < t_{k+1} \ 0 & \text{otherwise} \end{cases}$$

Degree d (recursive): $$B_{k,d}(x) = \frac{x - t_k}{t_{k+d} - t_k} B_{k,d-1}(x) + \frac{t_{k+d+1} - x}{t_{k+d+1} - t_{k+1}} B_{k+1,d-1}(x)$$

Key Properties of B-Splines

•Local support: Each $B_{k,d}$ is nonzero only on $(d+1)$ consecutive knot intervals. Compact support means sparse matrices!
•Partition of unity: $\sum_k B_{k,d}(x) = 1$ for all $x$. The basis functions sum to one everywhere.
•Non-negativity: $B_{k,d}(x) \geq 0$ for all $x$. Each basis contributes positively.
•Numerical stability: Evaluation via de Boor's algorithm is numerically stable.
•Efficient computation: The banded structure of the resulting design matrix enables $O(n K)$ operations instead of $O(n K^2)$.

Practical B-spline representation:

With $K$ interior knots and degree $d$ (typically $d = 3$ for cubic), we have $(K + d + 1)$ B-spline basis functions. Each spans $(d+1)$ knot intervals.

The component function is: $$f(x) = \sum_{k=1}^{K+d+1} \beta_k B_{k,d}(x)$$

Because each $B_{k,d}(x)$ is nonzero for at most $(d+1)$ knot intervals, evaluation at any point requires only $(d+1)$ terms—constant time regardless of $K$!

Why B-Splines Dominate

B-splines are the default in virtually all modern GAM implementations (mgcv in R, pyGAM in Python). Their local support creates banded design matrices, enabling efficient computation. Their numerical stability avoids the conditioning disasters of power bases.

Natural Cubic Splines: Boundary Behavior

Regular cubic splines can exhibit erratic behavior at the boundaries—extrapolating wildly beyond the data range. Natural cubic splines address this by imposing linearity beyond the boundary knots.

Definition: A natural cubic spline is linear for $x < \xi_1$ (leftmost knot) and $x > \xi_K$ (rightmost knot), while remaining a cubic spline in between.

Mathematically: The second derivative is zero at the boundary knots: $$f''(\xi_1) = f''(\xi_K) = 0$$

This constraint reduces dimension by 2: a natural cubic spline with $K$ interior knots has dimension $(K + 2)$ instead of $(K + 4)$.

Comparison: Regular vs Natural Cubic Splines
Property	Regular Cubic	Natural Cubic
Dimension (K knots)	$K + 4$	$K + 2$
Boundary behavior	Cubic (can oscillate)	Linear (stable)
Extrapolation	Unpredictable	Continues linearly
Degrees of freedom	Higher	Lower (2 fewer)
Use case	Interior fitting	When extrapolation matters

Why natural splines are preferred:

In GAM applications, we rarely have strong information about behavior beyond the data range. Natural splines' linear extrapolation is conservative—it doesn't make extreme predictions in unobserved regions.

The natural spline basis:

Natural cubic splines can be represented using $(K + 2)$ basis functions. One common construction:

$\phi_1(x) = 1$ (constant)
$\phi_2(x) = x$ (linear)
$\phi_{k+2}(x) = d_k(x) - d_{K}(x)$ for $k = 1, \ldots, K-1$

where $d_k(x) = \frac{(x - \xi_k)+^3 - (x - \xi_K)+^3}{\xi_K - \xi_k}$ and $(z)_+ = \max(0, z)$.

This construction explicitly satisfies the boundary constraints while providing a numerically stable basis.

Thin Plate Splines: Beyond Knot Placement

B-splines and natural splines require specifying knot locations—a design choice that can affect results. Thin plate splines (TPS) elegantly avoid this by being defined through an optimization problem rather than explicit knots.

The thin plate spline problem:

Minimize: $$\sum_{i=1}^n (y_i - f(x_i))^2 + \lambda \int (f''(x))^2 dx$$

The first term is squared error loss. The second term penalizes the integrated squared second derivative—a measure of 'wiggliness.' The parameter $\lambda \geq 0$ controls the smoothness-fit trade-off.

The remarkable result:

Despite being defined over an infinite-dimensional function space, the solution is a natural cubic spline with knots at the data points. This is the Reinsch result:

$$f(x) = \sum_{i=1}^n c_i \phi(|x - x_i|) + d_0 + d_1 x$$

where $\phi(r) = r^3$ for univariate TPS (or $\phi(r) = r^2 \log(r)$ in 2D) and the coefficients $(c_i, d_0, d_1)$ are determined by the penalized least squares problem.

Implications:

No knot selection required—knots are automatically at data points
The penalty $\lambda$ is the only tuning parameter
Natural cubic spline structure is optimal for this criterion

Thin Plate Regression Splines

Full thin plate splines with $n$ knots become computationally expensive for large $n$. Thin plate regression splines (TPRS) use a low-rank approximation, keeping only the $K$ most important basis functions. This gives the automatic knot-placement property while maintaining computational tractability.

Smoothing Penalties: Controlling Flexibility

Basis functions provide the capacity for flexible fitting; smoothing penalties control how much of this capacity we actually use.

The penalized least squares framework:

Given basis expansion $f(x) = \sum_k \beta_k \phi_k(x) = \boldsymbol{\phi}(x)^\top \boldsymbol{\beta}$, we minimize:

$$\sum_{i=1}^n (y_i - f(x_i))^2 + \lambda \boldsymbol{\beta}^\top \mathbf{S} \boldsymbol{\beta}$$

where:

The first term measures fit to data
$\mathbf{S}$ is a penalty matrix encoding smoothness
$\lambda \geq 0$ is the smoothing parameter

Common penalty structures:

Second derivative penalty (wiggliness): $$\int (f''(x))^2 dx = \boldsymbol{\beta}^\top \left( \int \phi''(x) \phi''^\top(x) dx \right) \boldsymbol{\beta} = \boldsymbol{\beta}^\top \mathbf{S} \boldsymbol{\beta}$$

Here $\mathbf{S}_{jk} = \int \phi_j''(x) \phi_k''(x) dx$. This penalizes curvature.

First derivative penalty (total variation): $$\int (f'(x))^2 dx = \boldsymbol{\beta}^\top \mathbf{S}_1 \boldsymbol{\beta}$$

This penalizes deviation from constant.

P-spline penalty (difference penalty): $$\sum_k (\Delta^m \beta_k)^2$$

where $\Delta$ is the difference operator. For $m = 2$: $\Delta^2 \beta_k = \beta_{k+2} - 2\beta_{k+1} + \beta_k$.

The Role of λ

λ = 0: No penalty → interpolation (overfitting). λ → ∞: Maximum penalty → linear fit (underfitting). Optimal λ: Balances bias and variance. Selection via cross-validation or marginal likelihood (GCV, REML).

Effective degrees of freedom:

The smoothing parameter $\lambda$ controls model complexity, but how do we quantify it? The effective degrees of freedom (EDF) provides a scalar measure:

$$\text{EDF} = \text{trace}(\mathbf{H}_\lambda)$$

where $\mathbf{H}_\lambda = \mathbf{\Phi}(\mathbf{\Phi}^\top \mathbf{\Phi} + \lambda \mathbf{S})^{-1} \mathbf{\Phi}^\top$ is the smoothing matrix.

EDF ranges from 1 (linear fit) to $K$ (unpenalized fit with $K$ basis functions). It provides an interpretable measure of model flexibility.

Kernel Smoothers: Local Averaging

An alternative approach to component function estimation is kernel smoothing (also called local averaging). Instead of fitting a global basis expansion, we estimate $f(x)$ using nearby observations weighted by a kernel function.

The Nadaraya-Watson estimator:

$$\hat{f}(x) = \frac{\sum_{i=1}^n K_h(x - x_i) y_i}{\sum_{i=1}^n K_h(x - x_i)}$$

where:

$K_h(u) = \frac{1}{h} K(u/h)$ is a scaled kernel function
$h > 0$ is the bandwidth (analogous to $\lambda^{-1}$)
$K(u)$ is typically a smooth, symmetric density (Gaussian, Epanechnikov, etc.)

Common kernel functions:

Kernel	Formula $K(u)$	Properties
Gaussian	$\frac{1}{\sqrt{2\pi}} e^{-u^2/2}$	Infinite support, very smooth
Epanechnikov	$\frac{3}{4}(1-u^2)_+$	Optimal asymptotic efficiency
Uniform	$\frac{1}{2} \mathbf{1}_{	u
Tricube	$\frac{70}{81}(1-	u

The choice of kernel matters less than the choice of bandwidth—all reasonable kernels yield similar results when $h$ is chosen well.

Advantages of Kernel Smoothing

•Intuitive interpretation as weighted average
•Automatic adaptation to data density
•No explicit basis selection
•Works well for irregular data

Disadvantages

•$O(n)$ evaluation cost per query point
•Boundary bias (sparse data at edges)
•Bandwidth selection is critical
•Less amenable to penalized optimization

Local Polynomial Regression

Local polynomial regression extends kernel smoothing by fitting a polynomial (not just a constant) at each query point. Local linear (degree 1) regression eliminates boundary bias. Local quadratic (degree 2) better captures curvature. LOESS/LOWESS uses local polynomials with robust weighting.

Comparison of Component Function Representations

Each representation has strengths and weaknesses. The choice depends on the application context:

Component Function Methods Compared
Method	Computation	Smoothness Control	Knots Required	Best For
B-splines	$O(nK)$	Penalty matrix + λ	Yes	Standard GAM fitting
Natural splines	$O(nK)$	Penalty matrix + λ	Yes	When extrapolation matters
Thin plate splines	$O(n^3)$ or $O(nK^2)$	Integrated penalty	No (automatic)	Avoiding knot selection
P-splines	$O(nK)$	Difference penalty	Yes (many, evenly spaced)	Simple implementation
Kernel smoothers	$O(n^2)$	Bandwidth h	No	Exploratory analysis
Local polynomials	$O(n^2)$	Bandwidth h	No	Boundary-corrected smoothing

Practical recommendations:

Default choice: B-splines or thin plate regression splines (TPRS) with smoothness penalties. This is what mgcv::gam() in R uses by default.
Low-dimensional problems: Any method works; kernel smoothers are intuitive for exploration.
Large n: Use basis expansions with moderate $K$ (10–40 basis functions). Kernel methods become slow.
Automatic smoothness selection: TPRS or P-splines with GCV/REML for $\lambda$.
Interpretability emphasis: Natural splines with explicit knots at quantiles.

The mgcv Golden Standard

The R package mgcv (by Simon Wood) represents the state-of-the-art for GAM fitting. It uses thin plate regression splines by default, estimates λ via REML, and handles everything automatically. Understanding the underlying methods helps you diagnose issues and customize settings.

Connection to Regularization

Penalized smoothing is a form of regularization. The mathematical framework connects directly to concepts from ridge regression and reproducing kernel Hilbert spaces (RKHS).

The penalized problem: $$\min_{\boldsymbol{\beta}} |\mathbf{y} - \mathbf{\Phi} \boldsymbol{\beta}|^2 + \lambda \boldsymbol{\beta}^\top \mathbf{S} \boldsymbol{\beta}$$

has the closed-form solution: $$\hat{\boldsymbol{\beta}} = (\mathbf{\Phi}^\top \mathbf{\Phi} + \lambda \mathbf{S})^{-1} \mathbf{\Phi}^\top \mathbf{y}$$

This is exactly generalized ridge regression with penalty matrix $\mathbf{S}$ instead of $\mathbf{I}$.

The RKHS perspective:

Each penalty matrix $\mathbf{S}$ corresponds to a reproducing kernel Hilbert space $\mathcal{H}$ with norm $|f|_{\mathcal{H}}^2 = \boldsymbol{\beta}^\top \mathbf{S} \boldsymbol{\beta}$.

The representer theorem guarantees that the solution to: $$\min_{f \in \mathcal{H}} \sum_{i=1}^n (y_i - f(x_i))^2 + \lambda |f|_{\mathcal{H}}^2$$

lies in the span of kernel evaluations at the data points. This is why spline solutions have knots at data points.

Bayesian interpretation:

The penalty $\lambda |f|_{\mathcal{H}}^2$ is equivalent to a Gaussian process prior on $f$ with covariance kernel corresponding to $\mathcal{H}$. The penalized estimator is the posterior mean.

Unifying Theory

Penalized splines, kernel smoothing, Gaussian processes, and regularized regression are mathematically unified. They're different views of the same underlying optimization problem. Understanding one deeply illuminates the others.

Summary: Component Functions as Building Blocks

We have explored the various ways to represent and estimate the smooth component functions that form the building blocks of GAMs.

Key Takeaways

•Basis expansions transform infinite-dimensional function estimation into finite-dimensional coefficient estimation
•Polynomials are intuitive but suffer from oscillation and instability at high degrees
•Splines (especially B-splines) provide local flexibility with global smoothness and computational efficiency
•Natural cubic splines impose linear behavior at boundaries, improving extrapolation
•Thin plate splines avoid explicit knot placement through variational optimization
•Smoothing penalties control flexibility via the regularization parameter λ
•Kernel smoothers offer an alternative based on local weighted averaging
•All methods connect through regularization theory and reproducing kernel Hilbert spaces

What's next:

With component functions understood, we turn to the computational challenge: how do we estimate all $p$ component functions simultaneously? The next page explores fitting algorithms—including the celebrated backfitting algorithm and penalized likelihood methods that make GAM estimation practical.

Page Complete

You now understand the various representations for smooth component functions in GAMs, from spline bases to kernel smoothers. These are the building blocks from which flexible, interpretable models are constructed. Next, we'll learn how to fit these models efficiently.

Component Functions

The Building Blocks of Flexibility

This question lies at the heart of GAM methodology. The choice of component function representation determines:

Flexibility: How complex can the learned relationship be?
Smoothness: How do we control wiggliness and prevent overfitting?
Computation: How efficiently can we fit and evaluate the model?
Interpretation: How do we visualize and understand each effect?

What You Will Learn

The Representation Challenge

We want each $f_j: \mathbb{R} \to \mathbb{R}$ to be 'smooth' but otherwise flexible. But how do we work with arbitrary smooth functions computationally?

$$f_j(x) = \sum_{k=1}^{K_j} \beta_{jk} \phi_{jk}(x)$$

where:

${\phi_{jk}}_{k=1}^{K_j}$ is a set of basis functions
${\beta_{jk}}_{k=1}^{K_j}$ are coefficients to be estimated
$K_j$ is the number of basis functions for feature $j$

Key insight:

Good basis functions should:

Span a rich space — Able to approximate any smooth function accurately
Be numerically stable — Well-conditioned for optimization
Support efficient computation — Sparse or structured for fast operations
Enable smoothness control — Provide a mechanism to penalize wiggliness

The GAM Philosophy

Polynomial Bases: The Foundation

The most intuitive basis is polynomials. Any smooth function can be approximated by polynomials (by the Weierstrass approximation theorem). A polynomial basis of degree $d$ uses:

$$\phi_0(x) = 1, \quad \phi_1(x) = x, \quad \phi_2(x) = x^2, \quad \ldots, \quad \phi_d(x) = x^d$$

The component function becomes: $$f(x) = \beta_0 + \beta_1 x + \beta_2 x^2 + \cdots + \beta_d x^d$$

Advantages

•Mathematically simple
•Universal approximation
•Easy to compute derivatives
•Intuitive interpretation of terms

Disadvantages

•High-degree polynomials oscillate wildly (Runge's phenomenon)
•Numerical instability: powers of large $x$ cause conditioning issues
•Global support: changing fit anywhere affects everywhere
•Poor at capturing local features

Runge's Phenomenon

Orthogonal polynomials:

To address numerical instability, we can use orthogonal polynomial bases (Legendre, Chebyshev, Hermite) designed so that:

$$\int \phi_j(x) \phi_k(x) w(x) dx = 0 \quad \text{for } j \neq k$$

where $w(x)$ is a weight function. Orthogonality improves numerical conditioning and allows independent coefficient estimation.

However, orthogonal polynomials still suffer from the fundamental limitations of global polynomial fits—they remain poor for capturing local behavior and can oscillate wildly at high degrees.

Spline Bases: Locally Flexible, Globally Smooth

Splines solve the problems of polynomial bases by using piecewise polynomials joined smoothly at designated points called knots.

Definition: A spline of degree $d$ with knots $\xi_1 < \xi_2 < \cdots < \xi_K$ is a function that:

Is a polynomial of degree $d$ on each interval $(\xi_k, \xi_{k+1})$
Has continuous derivatives up to order $(d-1)$ at each knot

The most common choice is cubic splines ($d = 3$), which have continuous first and second derivatives—producing visually smooth curves.

Why splines work:

Splines have local support—changing the curve in one region doesn't affect distant regions. This makes them:

Numerically stable: No extreme values from high powers
Locally adaptive: Can capture sharp features where needed
Smooth: Continuous derivatives ensure aesthetic curves

The space of cubic splines with $K$ interior knots has dimension $(K + 4)$—one can show this by counting: 4 coefficients per polynomial segment, minus constraints from continuity conditions.

Spline Dimension Formula

B-Spline Bases: The Computational Workhorse

While many bases can represent splines, B-splines (Basis splines) are the industry standard. They possess remarkable computational properties that make them ideal for GAM implementation.

Definition: B-splines are defined recursively. Given knots $t_0 \leq t_1 \leq \cdots \leq t_{K+d+1}$:

Degree 0 (step functions): $$B_{k,0}(x) = \begin{cases} 1 & \text{if } t_k \leq x < t_{k+1} \ 0 & \text{otherwise} \end{cases}$$

Degree d (recursive): $$B_{k,d}(x) = \frac{x - t_k}{t_{k+d} - t_k} B_{k,d-1}(x) + \frac{t_{k+d+1} - x}{t_{k+d+1} - t_{k+1}} B_{k+1,d-1}(x)$$

Key Properties of B-Splines

•Local support: Each $B_{k,d}$ is nonzero only on $(d+1)$ consecutive knot intervals. Compact support means sparse matrices!
•Partition of unity: $\sum_k B_{k,d}(x) = 1$ for all $x$. The basis functions sum to one everywhere.
•Non-negativity: $B_{k,d}(x) \geq 0$ for all $x$. Each basis contributes positively.
•Numerical stability: Evaluation via de Boor's algorithm is numerically stable.
•Efficient computation: The banded structure of the resulting design matrix enables $O(n K)$ operations instead of $O(n K^2)$.

Practical B-spline representation:

With $K$ interior knots and degree $d$ (typically $d = 3$ for cubic), we have $(K + d + 1)$ B-spline basis functions. Each spans $(d+1)$ knot intervals.

The component function is: $$f(x) = \sum_{k=1}^{K+d+1} \beta_k B_{k,d}(x)$$

Because each $B_{k,d}(x)$ is nonzero for at most $(d+1)$ knot intervals, evaluation at any point requires only $(d+1)$ terms—constant time regardless of $K$!

Why B-Splines Dominate

Natural Cubic Splines: Boundary Behavior

Definition: A natural cubic spline is linear for $x < \xi_1$ (leftmost knot) and $x > \xi_K$ (rightmost knot), while remaining a cubic spline in between.

Mathematically: The second derivative is zero at the boundary knots: $$f''(\xi_1) = f''(\xi_K) = 0$$

This constraint reduces dimension by 2: a natural cubic spline with $K$ interior knots has dimension $(K + 2)$ instead of $(K + 4)$.

Comparison: Regular vs Natural Cubic Splines
Property	Regular Cubic	Natural Cubic
Dimension (K knots)	$K + 4$	$K + 2$
Boundary behavior	Cubic (can oscillate)	Linear (stable)
Extrapolation	Unpredictable	Continues linearly
Degrees of freedom	Higher	Lower (2 fewer)
Use case	Interior fitting	When extrapolation matters

Why natural splines are preferred:

The natural spline basis:

Natural cubic splines can be represented using $(K + 2)$ basis functions. One common construction:

$\phi_1(x) = 1$ (constant)
$\phi_2(x) = x$ (linear)
$\phi_{k+2}(x) = d_k(x) - d_{K}(x)$ for $k = 1, \ldots, K-1$

where $d_k(x) = \frac{(x - \xi_k)+^3 - (x - \xi_K)+^3}{\xi_K - \xi_k}$ and $(z)_+ = \max(0, z)$.

This construction explicitly satisfies the boundary constraints while providing a numerically stable basis.

Thin Plate Splines: Beyond Knot Placement

The thin plate spline problem:

Minimize: $$\sum_{i=1}^n (y_i - f(x_i))^2 + \lambda \int (f''(x))^2 dx$$

The remarkable result:

Despite being defined over an infinite-dimensional function space, the solution is a natural cubic spline with knots at the data points. This is the Reinsch result:

$$f(x) = \sum_{i=1}^n c_i \phi(|x - x_i|) + d_0 + d_1 x$$

where $\phi(r) = r^3$ for univariate TPS (or $\phi(r) = r^2 \log(r)$ in 2D) and the coefficients $(c_i, d_0, d_1)$ are determined by the penalized least squares problem.

Implications:

No knot selection required—knots are automatically at data points
The penalty $\lambda$ is the only tuning parameter
Natural cubic spline structure is optimal for this criterion

Thin Plate Regression Splines

Smoothing Penalties: Controlling Flexibility

Basis functions provide the capacity for flexible fitting; smoothing penalties control how much of this capacity we actually use.

The penalized least squares framework:

Given basis expansion $f(x) = \sum_k \beta_k \phi_k(x) = \boldsymbol{\phi}(x)^\top \boldsymbol{\beta}$, we minimize:

$$\sum_{i=1}^n (y_i - f(x_i))^2 + \lambda \boldsymbol{\beta}^\top \mathbf{S} \boldsymbol{\beta}$$

where:

The first term measures fit to data
$\mathbf{S}$ is a penalty matrix encoding smoothness
$\lambda \geq 0$ is the smoothing parameter

Common penalty structures:

Here $\mathbf{S}_{jk} = \int \phi_j''(x) \phi_k''(x) dx$. This penalizes curvature.

First derivative penalty (total variation): $$\int (f'(x))^2 dx = \boldsymbol{\beta}^\top \mathbf{S}_1 \boldsymbol{\beta}$$

This penalizes deviation from constant.

P-spline penalty (difference penalty): $$\sum_k (\Delta^m \beta_k)^2$$

where $\Delta$ is the difference operator. For $m = 2$: $\Delta^2 \beta_k = \beta_{k+2} - 2\beta_{k+1} + \beta_k$.

The Role of λ

Effective degrees of freedom:

The smoothing parameter $\lambda$ controls model complexity, but how do we quantify it? The effective degrees of freedom (EDF) provides a scalar measure:

$$\text{EDF} = \text{trace}(\mathbf{H}_\lambda)$$

where $\mathbf{H}_\lambda = \mathbf{\Phi}(\mathbf{\Phi}^\top \mathbf{\Phi} + \lambda \mathbf{S})^{-1} \mathbf{\Phi}^\top$ is the smoothing matrix.

EDF ranges from 1 (linear fit) to $K$ (unpenalized fit with $K$ basis functions). It provides an interpretable measure of model flexibility.

Kernel Smoothers: Local Averaging

The Nadaraya-Watson estimator:

$$\hat{f}(x) = \frac{\sum_{i=1}^n K_h(x - x_i) y_i}{\sum_{i=1}^n K_h(x - x_i)}$$

where:

$K_h(u) = \frac{1}{h} K(u/h)$ is a scaled kernel function
$h > 0$ is the bandwidth (analogous to $\lambda^{-1}$)
$K(u)$ is typically a smooth, symmetric density (Gaussian, Epanechnikov, etc.)

Common kernel functions:

Kernel	Formula $K(u)$	Properties
Gaussian	$\frac{1}{\sqrt{2\pi}} e^{-u^2/2}$	Infinite support, very smooth
Epanechnikov	$\frac{3}{4}(1-u^2)_+$	Optimal asymptotic efficiency
Uniform	$\frac{1}{2} \mathbf{1}_{	u
Tricube	$\frac{70}{81}(1-	u

The choice of kernel matters less than the choice of bandwidth—all reasonable kernels yield similar results when $h$ is chosen well.

Advantages of Kernel Smoothing

•Intuitive interpretation as weighted average
•Automatic adaptation to data density
•No explicit basis selection
•Works well for irregular data

Disadvantages

•$O(n)$ evaluation cost per query point
•Boundary bias (sparse data at edges)
•Bandwidth selection is critical
•Less amenable to penalized optimization

Local Polynomial Regression

Comparison of Component Function Representations

Each representation has strengths and weaknesses. The choice depends on the application context:

Component Function Methods Compared
Method	Computation	Smoothness Control	Knots Required	Best For
B-splines	$O(nK)$	Penalty matrix + λ	Yes	Standard GAM fitting
Natural splines	$O(nK)$	Penalty matrix + λ	Yes	When extrapolation matters
Thin plate splines	$O(n^3)$ or $O(nK^2)$	Integrated penalty	No (automatic)	Avoiding knot selection
P-splines	$O(nK)$	Difference penalty	Yes (many, evenly spaced)	Simple implementation
Kernel smoothers	$O(n^2)$	Bandwidth h	No	Exploratory analysis
Local polynomials	$O(n^2)$	Bandwidth h	No	Boundary-corrected smoothing

Practical recommendations:

Default choice: B-splines or thin plate regression splines (TPRS) with smoothness penalties. This is what mgcv::gam() in R uses by default.
Low-dimensional problems: Any method works; kernel smoothers are intuitive for exploration.
Large n: Use basis expansions with moderate $K$ (10–40 basis functions). Kernel methods become slow.
Automatic smoothness selection: TPRS or P-splines with GCV/REML for $\lambda$.
Interpretability emphasis: Natural splines with explicit knots at quantiles.

The mgcv Golden Standard

Connection to Regularization

Penalized smoothing is a form of regularization. The mathematical framework connects directly to concepts from ridge regression and reproducing kernel Hilbert spaces (RKHS).

The penalized problem: $$\min_{\boldsymbol{\beta}} |\mathbf{y} - \mathbf{\Phi} \boldsymbol{\beta}|^2 + \lambda \boldsymbol{\beta}^\top \mathbf{S} \boldsymbol{\beta}$$

has the closed-form solution: $$\hat{\boldsymbol{\beta}} = (\mathbf{\Phi}^\top \mathbf{\Phi} + \lambda \mathbf{S})^{-1} \mathbf{\Phi}^\top \mathbf{y}$$

This is exactly generalized ridge regression with penalty matrix $\mathbf{S}$ instead of $\mathbf{I}$.

The RKHS perspective:

Each penalty matrix $\mathbf{S}$ corresponds to a reproducing kernel Hilbert space $\mathcal{H}$ with norm $|f|_{\mathcal{H}}^2 = \boldsymbol{\beta}^\top \mathbf{S} \boldsymbol{\beta}$.

The representer theorem guarantees that the solution to: $$\min_{f \in \mathcal{H}} \sum_{i=1}^n (y_i - f(x_i))^2 + \lambda |f|_{\mathcal{H}}^2$$

lies in the span of kernel evaluations at the data points. This is why spline solutions have knots at data points.

Bayesian interpretation:

The penalty $\lambda |f|_{\mathcal{H}}^2$ is equivalent to a Gaussian process prior on $f$ with covariance kernel corresponding to $\mathcal{H}$. The penalized estimator is the posterior mean.

Unifying Theory

Summary: Component Functions as Building Blocks

We have explored the various ways to represent and estimate the smooth component functions that form the building blocks of GAMs.

Key Takeaways

•Basis expansions transform infinite-dimensional function estimation into finite-dimensional coefficient estimation
•Polynomials are intuitive but suffer from oscillation and instability at high degrees
•Splines (especially B-splines) provide local flexibility with global smoothness and computational efficiency
•Natural cubic splines impose linear behavior at boundaries, improving extrapolation
•Thin plate splines avoid explicit knot placement through variational optimization
•Smoothing penalties control flexibility via the regularization parameter λ
•Kernel smoothers offer an alternative based on local weighted averaging
•All methods connect through regularization theory and reproducing kernel Hilbert spaces

What's next:

Page Complete