Loading content...
In the additive model $f(\mathbf{x}) = \alpha + \sum_{j=1}^p f_j(x_j)$, each component function $f_j$ captures the potentially nonlinear relationship between feature $j$ and the response. But what exactly is a component function, and how do we represent it?
This question lies at the heart of GAM methodology. The choice of component function representation determines:
By the end of this page, you will understand multiple representations for smooth functions—splines, local polynomials, and kernels—their mathematical foundations, practical trade-offs, and the central role of basis expansions in modern GAM implementations.
We want each $f_j: \mathbb{R} \to \mathbb{R}$ to be 'smooth' but otherwise flexible. But how do we work with arbitrary smooth functions computationally?
The fundamental challenge is that the space of smooth functions is infinite-dimensional. We cannot store or optimize over infinitely many parameters. The solution: represent $f_j$ using a finite basis expansion:
$$f_j(x) = \sum_{k=1}^{K_j} \beta_{jk} \phi_{jk}(x)$$
where:
Key insight:
Once we fix the basis functions, estimating $f_j$ reduces to estimating $K_j$ real numbers—a finite-dimensional problem. The choice of basis determines what functions can be represented and how well.
Good basis functions should:
GAMs transform the problem of 'estimate arbitrary smooth functions' into 'estimate coefficients in a carefully chosen basis.' The art of GAM modeling lies in choosing bases that balance flexibility against computational tractability and interpretability.
The most intuitive basis is polynomials. Any smooth function can be approximated by polynomials (by the Weierstrass approximation theorem). A polynomial basis of degree $d$ uses:
$$\phi_0(x) = 1, \quad \phi_1(x) = x, \quad \phi_2(x) = x^2, \quad \ldots, \quad \phi_d(x) = x^d$$
The component function becomes: $$f(x) = \beta_0 + \beta_1 x + \beta_2 x^2 + \cdots + \beta_d x^d$$
High-degree polynomial interpolation can exhibit extreme oscillations between data points, especially near the boundaries. This makes polynomials unsuitable for flexible curve fitting beyond degree 3–4 in most applications.
Orthogonal polynomials:
To address numerical instability, we can use orthogonal polynomial bases (Legendre, Chebyshev, Hermite) designed so that:
$$\int \phi_j(x) \phi_k(x) w(x) dx = 0 \quad \text{for } j \neq k$$
where $w(x)$ is a weight function. Orthogonality improves numerical conditioning and allows independent coefficient estimation.
However, orthogonal polynomials still suffer from the fundamental limitations of global polynomial fits—they remain poor for capturing local behavior and can oscillate wildly at high degrees.
Splines solve the problems of polynomial bases by using piecewise polynomials joined smoothly at designated points called knots.
Definition: A spline of degree $d$ with knots $\xi_1 < \xi_2 < \cdots < \xi_K$ is a function that:
The most common choice is cubic splines ($d = 3$), which have continuous first and second derivatives—producing visually smooth curves.
Why splines work:
Splines have local support—changing the curve in one region doesn't affect distant regions. This makes them:
The space of cubic splines with $K$ interior knots has dimension $(K + 4)$—one can show this by counting: 4 coefficients per polynomial segment, minus constraints from continuity conditions.
For a spline of degree $d$ with $K$ interior knots: dimension = $(K + d + 1)$. For cubic splines ($d = 3$): dimension = $K + 4$. This gives us $(K + 4)$ basis functions to estimate $(K + 4)$ coefficients.
While many bases can represent splines, B-splines (Basis splines) are the industry standard. They possess remarkable computational properties that make them ideal for GAM implementation.
Definition: B-splines are defined recursively. Given knots $t_0 \leq t_1 \leq \cdots \leq t_{K+d+1}$:
Degree 0 (step functions): $$B_{k,0}(x) = \begin{cases} 1 & \text{if } t_k \leq x < t_{k+1} \ 0 & \text{otherwise} \end{cases}$$
Degree d (recursive): $$B_{k,d}(x) = \frac{x - t_k}{t_{k+d} - t_k} B_{k,d-1}(x) + \frac{t_{k+d+1} - x}{t_{k+d+1} - t_{k+1}} B_{k+1,d-1}(x)$$
Practical B-spline representation:
With $K$ interior knots and degree $d$ (typically $d = 3$ for cubic), we have $(K + d + 1)$ B-spline basis functions. Each spans $(d+1)$ knot intervals.
The component function is: $$f(x) = \sum_{k=1}^{K+d+1} \beta_k B_{k,d}(x)$$
Because each $B_{k,d}(x)$ is nonzero for at most $(d+1)$ knot intervals, evaluation at any point requires only $(d+1)$ terms—constant time regardless of $K$!
B-splines are the default in virtually all modern GAM implementations (mgcv in R, pyGAM in Python). Their local support creates banded design matrices, enabling efficient computation. Their numerical stability avoids the conditioning disasters of power bases.
Regular cubic splines can exhibit erratic behavior at the boundaries—extrapolating wildly beyond the data range. Natural cubic splines address this by imposing linearity beyond the boundary knots.
Definition: A natural cubic spline is linear for $x < \xi_1$ (leftmost knot) and $x > \xi_K$ (rightmost knot), while remaining a cubic spline in between.
Mathematically: The second derivative is zero at the boundary knots: $$f''(\xi_1) = f''(\xi_K) = 0$$
This constraint reduces dimension by 2: a natural cubic spline with $K$ interior knots has dimension $(K + 2)$ instead of $(K + 4)$.
| Property | Regular Cubic | Natural Cubic |
|---|---|---|
| Dimension (K knots) | $K + 4$ | $K + 2$ |
| Boundary behavior | Cubic (can oscillate) | Linear (stable) |
| Extrapolation | Unpredictable | Continues linearly |
| Degrees of freedom | Higher | Lower (2 fewer) |
| Use case | Interior fitting | When extrapolation matters |
Why natural splines are preferred:
In GAM applications, we rarely have strong information about behavior beyond the data range. Natural splines' linear extrapolation is conservative—it doesn't make extreme predictions in unobserved regions.
The natural spline basis:
Natural cubic splines can be represented using $(K + 2)$ basis functions. One common construction:
where $d_k(x) = \frac{(x - \xi_k)+^3 - (x - \xi_K)+^3}{\xi_K - \xi_k}$ and $(z)_+ = \max(0, z)$.
This construction explicitly satisfies the boundary constraints while providing a numerically stable basis.
B-splines and natural splines require specifying knot locations—a design choice that can affect results. Thin plate splines (TPS) elegantly avoid this by being defined through an optimization problem rather than explicit knots.
The thin plate spline problem:
Minimize: $$\sum_{i=1}^n (y_i - f(x_i))^2 + \lambda \int (f''(x))^2 dx$$
The first term is squared error loss. The second term penalizes the integrated squared second derivative—a measure of 'wiggliness.' The parameter $\lambda \geq 0$ controls the smoothness-fit trade-off.
The remarkable result:
Despite being defined over an infinite-dimensional function space, the solution is a natural cubic spline with knots at the data points. This is the Reinsch result:
$$f(x) = \sum_{i=1}^n c_i \phi(|x - x_i|) + d_0 + d_1 x$$
where $\phi(r) = r^3$ for univariate TPS (or $\phi(r) = r^2 \log(r)$ in 2D) and the coefficients $(c_i, d_0, d_1)$ are determined by the penalized least squares problem.
Implications:
Full thin plate splines with $n$ knots become computationally expensive for large $n$. Thin plate regression splines (TPRS) use a low-rank approximation, keeping only the $K$ most important basis functions. This gives the automatic knot-placement property while maintaining computational tractability.
Basis functions provide the capacity for flexible fitting; smoothing penalties control how much of this capacity we actually use.
The penalized least squares framework:
Given basis expansion $f(x) = \sum_k \beta_k \phi_k(x) = \boldsymbol{\phi}(x)^\top \boldsymbol{\beta}$, we minimize:
$$\sum_{i=1}^n (y_i - f(x_i))^2 + \lambda \boldsymbol{\beta}^\top \mathbf{S} \boldsymbol{\beta}$$
where:
Common penalty structures:
Second derivative penalty (wiggliness): $$\int (f''(x))^2 dx = \boldsymbol{\beta}^\top \left( \int \phi''(x) \phi''^\top(x) dx \right) \boldsymbol{\beta} = \boldsymbol{\beta}^\top \mathbf{S} \boldsymbol{\beta}$$
Here $\mathbf{S}_{jk} = \int \phi_j''(x) \phi_k''(x) dx$. This penalizes curvature.
First derivative penalty (total variation): $$\int (f'(x))^2 dx = \boldsymbol{\beta}^\top \mathbf{S}_1 \boldsymbol{\beta}$$
This penalizes deviation from constant.
P-spline penalty (difference penalty): $$\sum_k (\Delta^m \beta_k)^2$$
where $\Delta$ is the difference operator. For $m = 2$: $\Delta^2 \beta_k = \beta_{k+2} - 2\beta_{k+1} + \beta_k$.
λ = 0: No penalty → interpolation (overfitting). λ → ∞: Maximum penalty → linear fit (underfitting). Optimal λ: Balances bias and variance. Selection via cross-validation or marginal likelihood (GCV, REML).
Effective degrees of freedom:
The smoothing parameter $\lambda$ controls model complexity, but how do we quantify it? The effective degrees of freedom (EDF) provides a scalar measure:
$$\text{EDF} = \text{trace}(\mathbf{H}_\lambda)$$
where $\mathbf{H}_\lambda = \mathbf{\Phi}(\mathbf{\Phi}^\top \mathbf{\Phi} + \lambda \mathbf{S})^{-1} \mathbf{\Phi}^\top$ is the smoothing matrix.
EDF ranges from 1 (linear fit) to $K$ (unpenalized fit with $K$ basis functions). It provides an interpretable measure of model flexibility.
An alternative approach to component function estimation is kernel smoothing (also called local averaging). Instead of fitting a global basis expansion, we estimate $f(x)$ using nearby observations weighted by a kernel function.
The Nadaraya-Watson estimator:
$$\hat{f}(x) = \frac{\sum_{i=1}^n K_h(x - x_i) y_i}{\sum_{i=1}^n K_h(x - x_i)}$$
where:
Common kernel functions:
| Kernel | Formula $K(u)$ | Properties |
|---|---|---|
| Gaussian | $\frac{1}{\sqrt{2\pi}} e^{-u^2/2}$ | Infinite support, very smooth |
| Epanechnikov | $\frac{3}{4}(1-u^2)_+$ | Optimal asymptotic efficiency |
| Uniform | $\frac{1}{2} \mathbf{1}_{ | u |
| Tricube | $\frac{70}{81}(1- | u |
The choice of kernel matters less than the choice of bandwidth—all reasonable kernels yield similar results when $h$ is chosen well.
Local polynomial regression extends kernel smoothing by fitting a polynomial (not just a constant) at each query point. Local linear (degree 1) regression eliminates boundary bias. Local quadratic (degree 2) better captures curvature. LOESS/LOWESS uses local polynomials with robust weighting.
Each representation has strengths and weaknesses. The choice depends on the application context:
| Method | Computation | Smoothness Control | Knots Required | Best For |
|---|---|---|---|---|
| B-splines | $O(nK)$ | Penalty matrix + λ | Yes | Standard GAM fitting |
| Natural splines | $O(nK)$ | Penalty matrix + λ | Yes | When extrapolation matters |
| Thin plate splines | $O(n^3)$ or $O(nK^2)$ | Integrated penalty | No (automatic) | Avoiding knot selection |
| P-splines | $O(nK)$ | Difference penalty | Yes (many, evenly spaced) | Simple implementation |
| Kernel smoothers | $O(n^2)$ | Bandwidth h | No | Exploratory analysis |
| Local polynomials | $O(n^2)$ | Bandwidth h | No | Boundary-corrected smoothing |
Practical recommendations:
Default choice: B-splines or thin plate regression splines (TPRS) with smoothness penalties. This is what mgcv::gam() in R uses by default.
Low-dimensional problems: Any method works; kernel smoothers are intuitive for exploration.
Large n: Use basis expansions with moderate $K$ (10–40 basis functions). Kernel methods become slow.
Automatic smoothness selection: TPRS or P-splines with GCV/REML for $\lambda$.
Interpretability emphasis: Natural splines with explicit knots at quantiles.
The R package mgcv (by Simon Wood) represents the state-of-the-art for GAM fitting. It uses thin plate regression splines by default, estimates λ via REML, and handles everything automatically. Understanding the underlying methods helps you diagnose issues and customize settings.
Penalized smoothing is a form of regularization. The mathematical framework connects directly to concepts from ridge regression and reproducing kernel Hilbert spaces (RKHS).
The penalized problem: $$\min_{\boldsymbol{\beta}} |\mathbf{y} - \mathbf{\Phi} \boldsymbol{\beta}|^2 + \lambda \boldsymbol{\beta}^\top \mathbf{S} \boldsymbol{\beta}$$
has the closed-form solution: $$\hat{\boldsymbol{\beta}} = (\mathbf{\Phi}^\top \mathbf{\Phi} + \lambda \mathbf{S})^{-1} \mathbf{\Phi}^\top \mathbf{y}$$
This is exactly generalized ridge regression with penalty matrix $\mathbf{S}$ instead of $\mathbf{I}$.
The RKHS perspective:
Each penalty matrix $\mathbf{S}$ corresponds to a reproducing kernel Hilbert space $\mathcal{H}$ with norm $|f|_{\mathcal{H}}^2 = \boldsymbol{\beta}^\top \mathbf{S} \boldsymbol{\beta}$.
The representer theorem guarantees that the solution to: $$\min_{f \in \mathcal{H}} \sum_{i=1}^n (y_i - f(x_i))^2 + \lambda |f|_{\mathcal{H}}^2$$
lies in the span of kernel evaluations at the data points. This is why spline solutions have knots at data points.
Bayesian interpretation:
The penalty $\lambda |f|_{\mathcal{H}}^2$ is equivalent to a Gaussian process prior on $f$ with covariance kernel corresponding to $\mathcal{H}$. The penalized estimator is the posterior mean.
Penalized splines, kernel smoothing, Gaussian processes, and regularized regression are mathematically unified. They're different views of the same underlying optimization problem. Understanding one deeply illuminates the others.
We have explored the various ways to represent and estimate the smooth component functions that form the building blocks of GAMs.
What's next:
With component functions understood, we turn to the computational challenge: how do we estimate all $p$ component functions simultaneously? The next page explores fitting algorithms—including the celebrated backfitting algorithm and penalized likelihood methods that make GAM estimation practical.
You now understand the various representations for smooth component functions in GAMs, from spline bases to kernel smoothers. These are the building blocks from which flexible, interpretable models are constructed. Next, we'll learn how to fit these models efficiently.