Loading learning content...
We've established that a GAM decomposes the regression function into smooth component functions: $f(\mathbf{x}) = \alpha + \sum_{j=1}^p f_j(x_j)$. We've explored how to represent each $f_j$ using splines and other basis expansions. But a fundamental question remains: how do we actually estimate all these functions from data?
This is a challenging computational problem. We have $p$ unknown functions that must be estimated simultaneously, each potentially depending on the others through shared residuals. The algorithms that solve this problem are elegant, efficient, and deeply connected to the mathematical structure of additive models.
By the end of this page, you will understand the backfitting algorithm and its theoretical foundations, penalized iteratively reweighted least squares (PIRLS), smoothing parameter selection via GCV and REML, and the computational complexity of GAM fitting.
Consider the additive model with observations $(x_{i1}, \ldots, x_{ip}, y_i)$ for $i = 1, \ldots, n$:
$$y_i = \alpha + \sum_{j=1}^p f_j(x_{ij}) + \epsilon_i$$
where $\epsilon_i$ are independent errors with $E[\epsilon_i] = 0$ and $\text{Var}(\epsilon_i) = \sigma^2$.
In matrix form:
Let $\mathbf{f}j = (f_j(x{1j}), \ldots, f_j(x_{nj}))^\top$ be the vector of evaluations of $f_j$ at the observed values. The model becomes:
$$\mathbf{y} = \alpha \mathbf{1} + \sum_{j=1}^p \mathbf{f}_j + \boldsymbol{\epsilon}$$
We want to estimate $\alpha$ and each function $f_j$ subject to the centering constraint $\mathbf{1}^\top \mathbf{f}_j = 0$.
The simultaneous estimation challenge:
If we knew all functions except $f_1$, we could estimate $f_1$ by regressing the partial residual $\mathbf{y} - \alpha \mathbf{1} - \sum_{j \neq 1} \mathbf{f}_j$ on $\mathbf{x}_1$ using any univariate smoother.
But we don't know the other functions! This creates a chicken-and-egg problem: each function's estimate depends on all others.
The solution: Iterate. Start with initial guesses for all functions, then cycle through, updating each function given current estimates of the others. This is the backfitting algorithm.
The backfitting algorithm transforms a $p$-function estimation problem into a sequence of single-function problems. Each update uses a univariate smoother—the workhorse of nonparametric regression—applied to partial residuals.
The backfitting algorithm (Hastie & Tibshirani, 1990) is the classical method for fitting additive models. It's an iterative procedure that cycles through features, updating each function estimate in turn.
Algorithm: Backfitting
Initialize:
Iterate until convergence:
For $j = 1, 2, \ldots, p$:
Compute partial residuals: $r_{ij} = y_i - \hat{\alpha} - \sum_{k \neq j} \hat{f}k(x{ik})$
Update $\hat{f}j$ by smoothing $\mathbf{r}j = (r{1j}, \ldots, r{nj})$ against $\mathbf{x}j = (x{1j}, \ldots, x_{nj})$: $$\hat{f}_j \leftarrow S_j(\mathbf{r}_j)$$ where $S_j$ is a univariate smoothing operator
Center: $\hat{f}_j \leftarrow \hat{f}_j - \frac{1}{n} \sum_i \hat{f}j(x{ij})$
Repeat until the functions converge (change less than tolerance).
Think of backfitting as 'cleaning up' each function's estimate using current knowledge of the others. The partial residual $r_{ij}$ represents what's left of $y_i$ after removing our current estimate of all other effects. Smoothing this residual against $x_j$ isolates the effect of feature $j$.
The smoothing operator $S_j$:
Any univariate smoother can serve as $S_j$:
The choice of smoother determines the flexibility of $\hat{f}_j$. Typically, all smoothers share a common type but may have feature-specific smoothing parameters $\lambda_j$.
Matrix notation:
Let $\mathbf{S}_j$ be the $n \times n$ smoothing matrix such that $S_j(\mathbf{y}) = \mathbf{S}_j \mathbf{y}$. Then the backfitting update is:
$$\hat{\mathbf{f}}_j \leftarrow \mathbf{S}j \left( \mathbf{y} - \hat{\alpha} \mathbf{1} - \sum{k \neq j} \hat{\mathbf{f}}_k \right)$$
Does backfitting always converge? If so, to what? These questions have elegant answers rooted in functional analysis.
Theorem (Convergence of Backfitting):
If the smoothing operators $S_j$ are symmetric ($\mathbf{S}_j = \mathbf{S}_j^\top$) and have eigenvalues in $[0, 1]$, then backfitting converges to the unique solution of the normal equations for the additive model.
Condition for unique convergence:
The solution is unique if and only if the only additive function that is smoothed to zero by all smoothers is the zero function. This holds when features are not perfectly collinear.
Rate of convergence:
Backfitting is a Gauss-Seidel iteration on a block linear system. The rate of convergence depends on the correlation between smooth functions of different features.
The critical insight:
When features are independent, partial residuals computed for feature $j$ are orthogonal to the effects of other features. Each function can be estimated in isolation. But with correlated features, the smooth functions 'leak' into each other's residuals, requiring iterations to disentangle.
| Feature Correlation | Convergence Speed | Typical Iterations |
|---|---|---|
| Independent ($\rho = 0$) | Immediate | 1 cycle |
| Weak ($|\rho| < 0.3$) | Fast | 3–10 cycles |
| Moderate ($0.3 \leq |\rho| < 0.7$) | Moderate | 10–50 cycles |
| Strong ($|\rho| \geq 0.7$) | Slow | 50–200 cycles |
| Near-collinear ($|\rho| \to 1$) | May fail | Regularization needed |
When smooth functions of different features are nearly linearly dependent (concurvity), backfitting becomes unstable. This is the GAM analog of multicollinearity in linear regression. Regularization or feature selection may be needed.
Backfitting can be understood as a fixed point iteration on the normal equations of the additive model.
The additive model normal equations:
Stacking the smooth functions, define $\mathbf{f} = (\mathbf{f}_1^\top, \ldots, \mathbf{f}_p^\top)^\top$ (a vector of length $np$). The fitted values are:
$$\hat{\mathbf{y}} = \alpha \mathbf{1} + \sum_j \mathbf{f}_j = \alpha \mathbf{1} + \mathbf{F} \mathbf{1}_p$$
where $\mathbf{F} = [\mathbf{f}_1 | \cdots | \mathbf{f}_p]$ is $n \times p$.
The normal equations for the smoothed solution can be written:
$$\mathbf{f}_j = \mathbf{S}j \left( \mathbf{y} - \alpha \mathbf{1} - \sum{k \neq j} \mathbf{f}_k \right), \quad j = 1, \ldots, p$$
This is a system of $p$ coupled equations. Backfitting solves this system by Gauss-Seidel iteration.
Block matrix formulation:
Define the block matrix: $$\mathbf{P} = \begin{pmatrix} \mathbf{S}_1 & \mathbf{S}_1 & \cdots & \mathbf{S}_1 \ \mathbf{S}_2 & \mathbf{S}_2 & \cdots & \mathbf{S}_2 \ \vdots & \vdots & \ddots & \vdots \ \mathbf{S}_p & \mathbf{S}_p & \cdots & \mathbf{S}_p \end{pmatrix} - \begin{pmatrix} \mathbf{S}_1 & 0 & \cdots & 0 \ 0 & \mathbf{S}_2 & \cdots & 0 \ \vdots & \vdots & \ddots & \vdots \ 0 & 0 & \cdots & \mathbf{S}_p \end{pmatrix}$$
The fixed point equation is $(\mathbf{I} + \mathbf{P}) \mathbf{f} = \mathbf{S} (\mathbf{y} - \alpha \mathbf{1})$ where $\mathbf{S}$ is block-diagonal with blocks $\mathbf{S}_j$.
Backfitting converges if the spectral radius of the iteration matrix is less than 1. For symmetric smoothers with eigenvalues in $[0, 1]$, this is guaranteed.
Backfitting is closely related to coordinate descent optimization. Each update minimizes a penalized least squares criterion over function $f_j$ while holding others fixed. The iterations converge to the joint optimum.
An alternative and more general approach is to formulate GAM fitting as penalized likelihood optimization.
The penalized least squares criterion:
$$Q(\alpha, f_1, \ldots, f_p) = \sum_{i=1}^n \left( y_i - \alpha - \sum_{j=1}^p f_j(x_{ij}) \right)^2 + \sum_{j=1}^p \lambda_j \int (f_j''(x))^2 dx$$
The first term measures fit; the penalty terms control smoothness of each component function.
With basis expansions:
Let $f_j(x) = \boldsymbol{\phi}_j(x)^\top \boldsymbol{\beta}_j$ where $\boldsymbol{\phi}_j$ are basis functions and $\boldsymbol{\beta}_j$ are coefficients. The penalties become quadratic in coefficients:
$$\int (f_j''(x))^2 dx = \boldsymbol{\beta}_j^\top \mathbf{S}_j \boldsymbol{\beta}_j$$
where $\mathbf{S}j$ is the penalty matrix with $(\mathbf{S}j){kl} = \int \phi{jk}''(x) \phi_{jl}''(x) dx$.
Matrix formulation:
Let $\mathbf{X}j$ be the $n \times K_j$ design matrix with $(\mathbf{X}j){ik} = \phi{jk}(x_{ij})$. Form the combined design matrix:
$$\mathbf{X} = [\mathbf{1} | \mathbf{X}_1 | \cdots | \mathbf{X}_p]$$
and parameter vector $\boldsymbol{\theta} = (\alpha, \boldsymbol{\beta}_1^\top, \ldots, \boldsymbol{\beta}_p^\top)^\top$.
The penalized criterion is:
$$Q(\boldsymbol{\theta}) = |\mathbf{y} - \mathbf{X} \boldsymbol{\theta}|^2 + \boldsymbol{\theta}^\top \mathbf{\Lambda} \boldsymbol{\theta}$$
where $\mathbf{\Lambda} = \text{blockdiag}(0, \lambda_1 \mathbf{S}_1, \ldots, \lambda_p \mathbf{S}_p)$.
Solution:
$$\hat{\boldsymbol{\theta}} = (\mathbf{X}^\top \mathbf{X} + \mathbf{\Lambda})^{-1} \mathbf{X}^\top \mathbf{y}$$
This is a direct solution (no iteration needed!) once the smoothing parameters $\lambda_j$ are fixed.
The penalized approach solves for all coefficients simultaneously via a single matrix equation. Backfitting iterates but works with smaller problems. For moderate $p$ and $K_j$, the penalized approach is often faster and numerically more stable.
The smoothing parameters $\lambda_1, \ldots, \lambda_p$ (one per component function) critically affect model performance. Too small → overfitting; too large → underfitting. How do we choose them?
Generalized Cross-Validation (GCV):
The GCV criterion estimates prediction error without holding out data:
$$\text{GCV}(\boldsymbol{\lambda}) = \frac{n |\mathbf{y} - \hat{\mathbf{y}}|^2}{(n - \text{tr}(\mathbf{H}))^2}$$
where $\hat{\mathbf{y}} = \mathbf{H} \mathbf{y}$ is the fitted values and $\mathbf{H} = \mathbf{X}(\mathbf{X}^\top \mathbf{X} + \mathbf{\Lambda})^{-1} \mathbf{X}^\top$ is the hat (influence) matrix.
The trace $\text{tr}(\mathbf{H})$ is the effective degrees of freedom—how many 'free parameters' the model effectively uses.
Restricted Maximum Likelihood (REML):
A Bayesian perspective views the penalties as priors. The smoothing parameters are hyperparameters of these priors. REML estimates them by maximizing a marginal likelihood:
$$\log L_{\text{REML}}(\boldsymbol{\lambda}) = -\frac{1}{2} \left[ n \log(\hat{\sigma}^2) + \log |\mathbf{X}^\top \mathbf{X} + \mathbf{\Lambda}| - \log |\mathbf{\Lambda}^+| + \frac{|\mathbf{y} - \hat{\mathbf{y}}|^2}{\hat{\sigma}^2} \right]$$
where $\mathbf{\Lambda}^+$ is the generalized inverse of $\mathbf{\Lambda}$ and $\hat{\sigma}^2$ is the estimated residual variance.
REML advantages:
| Method | Criterion | Properties |
|---|---|---|
| GCV | Prediction error estimate | Fast, can undersmooth |
| REML | Marginal likelihood | Stable, principled, slightly slower |
| AIC | $n \log(\text{RSS}) + 2 \cdot \text{EDF}$ | Prediction-focused, tends to overfit |
| CV (explicit) | Hold-out error | Reliable, computationally expensive |
| BIC | $n \log(\text{RSS}) + \log(n) \cdot \text{EDF}$ | Model selection focus, may underfit |
REML has become the default in modern GAM software (e.g., mgcv). It produces reliable smoothness estimates across a wide range of problems. When in doubt, use REML.
For non-Gaussian responses (binomial, Poisson, etc.), we need Generalized Additive Models (true GAMs), which combine additive structure with the GLM framework. The fitting algorithm is Penalized Iteratively Reweighted Least Squares (PIRLS).
The GAM with non-Gaussian response:
$$g(\mu_i) = \alpha + \sum_{j=1}^p f_j(x_{ij})$$
where:
Examples:
Algorithm: PIRLS
Initialize: $\hat{\mu}_i = y_i + 0.5$ (or similar)
Iterate until convergence:
Compute pseudo-data: $$z_i = \hat{\eta}_i + \frac{y_i - \hat{\mu}_i}{g'(\hat{\mu}_i)}$$ where $\hat{\eta}_i = g(\hat{\mu}_i)$
Compute weights: $$w_i = \frac{(g'(\hat{\mu}_i))^2}{V(\hat{\mu}_i)}$$ where $V(\mu)$ is the variance function
Solve the penalized weighted least squares problem: $$\min_{\alpha, f_1, \ldots, f_p} \sum_i w_i \left( z_i - \alpha - \sum_j f_j(x_{ij}) \right)^2 + \sum_j \lambda_j \int (f_j'')^2 dx$$
Update: $\hat{\eta}_i = \hat{\alpha} + \sum_j \hat{f}j(x{ij})$, $\hat{\mu}_i = g^{-1}(\hat{\eta}_i)$
Repeat until convergence of $\hat{\eta}$.
PIRLS is the natural extension of Fisher scoring (used for GLMs) to the additive model setting. Each iteration constructs a weighted least squares 'working model' that, when solved, improves the likelihood. Inner iterations may use backfitting or direct matrix solution.
Understanding computational complexity is crucial for scaling GAMs to large datasets.
Direct penalized fitting:
With $n$ observations, $p$ features, and $K$ basis functions per feature:
For typical values ($n = 10,000$, $p = 10$, $K = 20$), this is about $10^8$ operations—fast on modern hardware.
| Operation | Complexity | Dominant Term |
|---|---|---|
| Design matrix formation | $O(npK)$ | $n$ or $pK$ |
| Penalized LS solve | $O(np^2K^2 + p^3K^3)$ | $n$ (large data) or $pK$ (many features) |
| GCV evaluation | $O(pK)$ per $\lambda$ value | $pK$ |
| REML optimization | $O(p \cdot \text{solve})$ | Multiple solves |
| Backfitting (per cycle) | $O(npK)$ | $n$ |
| Prediction (new point) | $O(pK)$ | $pK$ |
Strategies for large data:
Reduce basis dimension: Use fewer basis functions $K$ (10–40 usually suffices)
Subsampling: Fit on a random subsample, predict on full data
Sparse smoothers: Exploit banded structure of B-spline penalty matrices
Parallel backfitting: Update independent features concurrently
Approximate smoothing parameter selection: Use fewer iterations in REML optimization
For large datasets, mgcv provides bam() ('big additive models') which uses discretization and parallel processing to fit GAMs with millions of observations. The same model specification works, but the fitting is much faster.
GAM fitting involves nested optimization: an outer loop over smoothing parameters and an inner loop over regression coefficients.
The two-level structure:
Outer loop (smoothing parameter selection):
Inner loop (coefficient estimation):
Optimization algorithms for outer loop:
Newton-Raphson or Fisher scoring:
Gradient descent:
Performance iteration (Wood, 2004):
Practical note: Modern GAM software handles this automatically. The user specifies the model; the software determines smoothing parameters.
Smoothing parameters are positive and can span many orders of magnitude (e.g., $10^{-4}$ to $10^4$). Optimizing $\log \lambda_j$ instead of $\lambda_j$ improves numerical stability and convergence, as the log-space is more 'linear' for this problem.
Several algorithmic innovations have improved GAM fitting beyond the classical methods:
Discretization (for large n):
Instead of using all $n$ unique covariate values, discretize each feature into $M \ll n$ bins. This reduces effective sample size for numerical operations while preserving approximation quality.
Rank reduction:
Full thin plate splines with $n$ basis functions are expensive. Low-rank approximations (thin plate regression splines) keep only the $K$ most important eigen-components, dramatically reducing cost with minimal accuracy loss.
With modern techniques, GAMs can fit datasets with millions of observations and dozens of features in seconds to minutes. The combination of smart basis choices, efficient linear algebra, and automatic smoothness selection makes GAMs practical for real-world applications.
We have explored the computational machinery that makes GAM fitting practical and efficient.
What's next:
With the fitting machinery understood, we turn to the payoff: interpretation. GAMs provide uniquely interpretable nonlinear models through partial effect plots and other visualizations. The next page explores how to extract insights from fitted GAMs.
You now understand the algorithms that power GAM estimation, from classical backfitting to modern penalized likelihood methods. These computational foundations enable the flexibility and interpretability that make GAMs so valuable in practice.