Machine LearningGeneralized Additive Models

Generalized Additive Models (GAMs)

LevelAdvanced

Duration90 mins

TopicGeneralized Additive Models

3 / 5

Fitting Algorithms

From Theory to Computation

We've established that a GAM decomposes the regression function into smooth component functions: $f(\mathbf{x}) = \alpha + \sum_{j=1}^p f_j(x_j)$. We've explored how to represent each $f_j$ using splines and other basis expansions. But a fundamental question remains: how do we actually estimate all these functions from data?

This is a challenging computational problem. We have $p$ unknown functions that must be estimated simultaneously, each potentially depending on the others through shared residuals. The algorithms that solve this problem are elegant, efficient, and deeply connected to the mathematical structure of additive models.

What You Will Learn

By the end of this page, you will understand the backfitting algorithm and its theoretical foundations, penalized iteratively reweighted least squares (PIRLS), smoothing parameter selection via GCV and REML, and the computational complexity of GAM fitting.

The Estimation Problem

Consider the additive model with observations $(x_{i1}, \ldots, x_{ip}, y_i)$ for $i = 1, \ldots, n$:

$$y_i = \alpha + \sum_{j=1}^p f_j(x_{ij}) + \epsilon_i$$

where $\epsilon_i$ are independent errors with $E[\epsilon_i] = 0$ and $\text{Var}(\epsilon_i) = \sigma^2$.

In matrix form:

Let $\mathbf{f}j = (f_j(x{1j}), \ldots, f_j(x_{nj}))^\top$ be the vector of evaluations of $f_j$ at the observed values. The model becomes:

$$\mathbf{y} = \alpha \mathbf{1} + \sum_{j=1}^p \mathbf{f}_j + \boldsymbol{\epsilon}$$

We want to estimate $\alpha$ and each function $f_j$ subject to the centering constraint $\mathbf{1}^\top \mathbf{f}_j = 0$.

The simultaneous estimation challenge:

If we knew all functions except $f_1$, we could estimate $f_1$ by regressing the partial residual $\mathbf{y} - \alpha \mathbf{1} - \sum_{j \neq 1} \mathbf{f}_j$ on $\mathbf{x}_1$ using any univariate smoother.

But we don't know the other functions! This creates a chicken-and-egg problem: each function's estimate depends on all others.

The solution: Iterate. Start with initial guesses for all functions, then cycle through, updating each function given current estimates of the others. This is the backfitting algorithm.

Key Insight

The backfitting algorithm transforms a $p$-function estimation problem into a sequence of single-function problems. Each update uses a univariate smoother—the workhorse of nonparametric regression—applied to partial residuals.

The Backfitting Algorithm

The backfitting algorithm (Hastie & Tibshirani, 1990) is the classical method for fitting additive models. It's an iterative procedure that cycles through features, updating each function estimate in turn.

Algorithm: Backfitting

Initialize:

$\hat{\alpha} = \bar{y} = \frac{1}{n} \sum_i y_i$
$\hat{f}_j = 0$ for $j = 1, \ldots, p$

Iterate until convergence:

For $j = 1, 2, \ldots, p$:

Compute partial residuals: $r_{ij} = y_i - \hat{\alpha} - \sum_{k \neq j} \hat{f}k(x{ik})$
Update $\hat{f}j$ by smoothing $\mathbf{r}j = (r{1j}, \ldots, r{nj})$ against $\mathbf{x}j = (x{1j}, \ldots, x_{nj})$: $$\hat{f}_j \leftarrow S_j(\mathbf{r}_j)$$ where $S_j$ is a univariate smoothing operator
Center: $\hat{f}_j \leftarrow \hat{f}_j - \frac{1}{n} \sum_i \hat{f}j(x{ij})$

Repeat until the functions converge (change less than tolerance).

Intuition Behind Backfitting

Think of backfitting as 'cleaning up' each function's estimate using current knowledge of the others. The partial residual $r_{ij}$ represents what's left of $y_i$ after removing our current estimate of all other effects. Smoothing this residual against $x_j$ isolates the effect of feature $j$.

The smoothing operator $S_j$:

Any univariate smoother can serve as $S_j$:

Kernel smoother: Local weighted average
Spline smoother: Penalized cubic spline fit
Local polynomial: LOESS/LOWESS

The choice of smoother determines the flexibility of $\hat{f}_j$. Typically, all smoothers share a common type but may have feature-specific smoothing parameters $\lambda_j$.

Matrix notation:

Let $\mathbf{S}_j$ be the $n \times n$ smoothing matrix such that $S_j(\mathbf{y}) = \mathbf{S}_j \mathbf{y}$. Then the backfitting update is:

$$\hat{\mathbf{f}}_j \leftarrow \mathbf{S}j \left( \mathbf{y} - \hat{\alpha} \mathbf{1} - \sum{k \neq j} \hat{\mathbf{f}}_k \right)$$

Convergence Analysis

Does backfitting always converge? If so, to what? These questions have elegant answers rooted in functional analysis.

Theorem (Convergence of Backfitting):

If the smoothing operators $S_j$ are symmetric ($\mathbf{S}_j = \mathbf{S}_j^\top$) and have eigenvalues in $[0, 1]$, then backfitting converges to the unique solution of the normal equations for the additive model.

Condition for unique convergence:

The solution is unique if and only if the only additive function that is smoothed to zero by all smoothers is the zero function. This holds when features are not perfectly collinear.

Rate of convergence:

Backfitting is a Gauss-Seidel iteration on a block linear system. The rate of convergence depends on the correlation between smooth functions of different features.

Independent features: Convergence in one cycle (no iteration needed!)
Weakly correlated features: Fast convergence (few iterations)
Strongly correlated features: Slow convergence (many iterations)

The critical insight:

When features are independent, partial residuals computed for feature $j$ are orthogonal to the effects of other features. Each function can be estimated in isolation. But with correlated features, the smooth functions 'leak' into each other's residuals, requiring iterations to disentangle.

Backfitting Convergence Characteristics
Feature Correlation	Convergence Speed	Typical Iterations
Independent ($\rho = 0$)	Immediate	1 cycle
Weak ($\|\rho\| < 0.3$)	Fast	3–10 cycles
Moderate ($0.3 \leq \|\rho\| < 0.7$)	Moderate	10–50 cycles
Strong ($\|\rho\| \geq 0.7$)	Slow	50–200 cycles
Near-collinear ($\|\rho\| \to 1$)	May fail	Regularization needed

Concurvity: The GAM Collinearity

When smooth functions of different features are nearly linearly dependent (concurvity), backfitting becomes unstable. This is the GAM analog of multicollinearity in linear regression. Regularization or feature selection may be needed.

Backfitting as Fixed Point Iteration

Backfitting can be understood as a fixed point iteration on the normal equations of the additive model.

The additive model normal equations:

Stacking the smooth functions, define $\mathbf{f} = (\mathbf{f}_1^\top, \ldots, \mathbf{f}_p^\top)^\top$ (a vector of length $np$). The fitted values are:

$$\hat{\mathbf{y}} = \alpha \mathbf{1} + \sum_j \mathbf{f}_j = \alpha \mathbf{1} + \mathbf{F} \mathbf{1}_p$$

where $\mathbf{F} = [\mathbf{f}_1 | \cdots | \mathbf{f}_p]$ is $n \times p$.

The normal equations for the smoothed solution can be written:

$$\mathbf{f}_j = \mathbf{S}j \left( \mathbf{y} - \alpha \mathbf{1} - \sum{k \neq j} \mathbf{f}_k \right), \quad j = 1, \ldots, p$$

This is a system of $p$ coupled equations. Backfitting solves this system by Gauss-Seidel iteration.

Block matrix formulation:

Define the block matrix: $$\mathbf{P} = \begin{pmatrix} \mathbf{S}_1 & \mathbf{S}_1 & \cdots & \mathbf{S}_1 \ \mathbf{S}_2 & \mathbf{S}_2 & \cdots & \mathbf{S}_2 \ \vdots & \vdots & \ddots & \vdots \ \mathbf{S}_p & \mathbf{S}_p & \cdots & \mathbf{S}_p \end{pmatrix} - \begin{pmatrix} \mathbf{S}_1 & 0 & \cdots & 0 \ 0 & \mathbf{S}_2 & \cdots & 0 \ \vdots & \vdots & \ddots & \vdots \ 0 & 0 & \cdots & \mathbf{S}_p \end{pmatrix}$$

The fixed point equation is $(\mathbf{I} + \mathbf{P}) \mathbf{f} = \mathbf{S} (\mathbf{y} - \alpha \mathbf{1})$ where $\mathbf{S}$ is block-diagonal with blocks $\mathbf{S}_j$.

Backfitting converges if the spectral radius of the iteration matrix is less than 1. For symmetric smoothers with eigenvalues in $[0, 1]$, this is guaranteed.

Connection to Coordinate Descent

Backfitting is closely related to coordinate descent optimization. Each update minimizes a penalized least squares criterion over function $f_j$ while holding others fixed. The iterations converge to the joint optimum.

Penalized Likelihood for GAMs

An alternative and more general approach is to formulate GAM fitting as penalized likelihood optimization.

The penalized least squares criterion:

$$Q(\alpha, f_1, \ldots, f_p) = \sum_{i=1}^n \left( y_i - \alpha - \sum_{j=1}^p f_j(x_{ij}) \right)^2 + \sum_{j=1}^p \lambda_j \int (f_j''(x))^2 dx$$

The first term measures fit; the penalty terms control smoothness of each component function.

With basis expansions:

Let $f_j(x) = \boldsymbol{\phi}_j(x)^\top \boldsymbol{\beta}_j$ where $\boldsymbol{\phi}_j$ are basis functions and $\boldsymbol{\beta}_j$ are coefficients. The penalties become quadratic in coefficients:

$$\int (f_j''(x))^2 dx = \boldsymbol{\beta}_j^\top \mathbf{S}_j \boldsymbol{\beta}_j$$

where $\mathbf{S}j$ is the penalty matrix with $(\mathbf{S}j){kl} = \int \phi{jk}''(x) \phi_{jl}''(x) dx$.

Matrix formulation:

Let $\mathbf{X}j$ be the $n \times K_j$ design matrix with $(\mathbf{X}j){ik} = \phi{jk}(x_{ij})$. Form the combined design matrix:

$$\mathbf{X} = [\mathbf{1} | \mathbf{X}_1 | \cdots | \mathbf{X}_p]$$

and parameter vector $\boldsymbol{\theta} = (\alpha, \boldsymbol{\beta}_1^\top, \ldots, \boldsymbol{\beta}_p^\top)^\top$.

The penalized criterion is:

$$Q(\boldsymbol{\theta}) = |\mathbf{y} - \mathbf{X} \boldsymbol{\theta}|^2 + \boldsymbol{\theta}^\top \mathbf{\Lambda} \boldsymbol{\theta}$$

where $\mathbf{\Lambda} = \text{blockdiag}(0, \lambda_1 \mathbf{S}_1, \ldots, \lambda_p \mathbf{S}_p)$.

Solution:

$$\hat{\boldsymbol{\theta}} = (\mathbf{X}^\top \mathbf{X} + \mathbf{\Lambda})^{-1} \mathbf{X}^\top \mathbf{y}$$

This is a direct solution (no iteration needed!) once the smoothing parameters $\lambda_j$ are fixed.

Penalized vs Backfitting

The penalized approach solves for all coefficients simultaneously via a single matrix equation. Backfitting iterates but works with smaller problems. For moderate $p$ and $K_j$, the penalized approach is often faster and numerically more stable.

Smoothing Parameter Selection

The smoothing parameters $\lambda_1, \ldots, \lambda_p$ (one per component function) critically affect model performance. Too small → overfitting; too large → underfitting. How do we choose them?

Generalized Cross-Validation (GCV):

The GCV criterion estimates prediction error without holding out data:

$$\text{GCV}(\boldsymbol{\lambda}) = \frac{n |\mathbf{y} - \hat{\mathbf{y}}|^2}{(n - \text{tr}(\mathbf{H}))^2}$$

where $\hat{\mathbf{y}} = \mathbf{H} \mathbf{y}$ is the fitted values and $\mathbf{H} = \mathbf{X}(\mathbf{X}^\top \mathbf{X} + \mathbf{\Lambda})^{-1} \mathbf{X}^\top$ is the hat (influence) matrix.

The trace $\text{tr}(\mathbf{H})$ is the effective degrees of freedom—how many 'free parameters' the model effectively uses.

Restricted Maximum Likelihood (REML):

A Bayesian perspective views the penalties as priors. The smoothing parameters are hyperparameters of these priors. REML estimates them by maximizing a marginal likelihood:

$$\log L_{\text{REML}}(\boldsymbol{\lambda}) = -\frac{1}{2} \left[ n \log(\hat{\sigma}^2) + \log |\mathbf{X}^\top \mathbf{X} + \mathbf{\Lambda}| - \log |\mathbf{\Lambda}^+| + \frac{|\mathbf{y} - \hat{\mathbf{y}}|^2}{\hat{\sigma}^2} \right]$$

where $\mathbf{\Lambda}^+$ is the generalized inverse of $\mathbf{\Lambda}$ and $\hat{\sigma}^2$ is the estimated residual variance.

REML advantages:

More stable than GCV for small samples
Less prone to choosing $\lambda \to 0$ (undersmoothing)
Principled Bayesian interpretation

Smoothing Parameter Selection Methods
Method	Criterion	Properties
GCV	Prediction error estimate	Fast, can undersmooth
REML	Marginal likelihood	Stable, principled, slightly slower
AIC	$n \log(\text{RSS}) + 2 \cdot \text{EDF}$	Prediction-focused, tends to overfit
CV (explicit)	Hold-out error	Reliable, computationally expensive
BIC	$n \log(\text{RSS}) + \log(n) \cdot \text{EDF}$	Model selection focus, may underfit

Modern Default: REML

REML has become the default in modern GAM software (e.g., mgcv). It produces reliable smoothness estimates across a wide range of problems. When in doubt, use REML.

Penalized Iteratively Reweighted Least Squares

For non-Gaussian responses (binomial, Poisson, etc.), we need Generalized Additive Models (true GAMs), which combine additive structure with the GLM framework. The fitting algorithm is Penalized Iteratively Reweighted Least Squares (PIRLS).

The GAM with non-Gaussian response:

$$g(\mu_i) = \alpha + \sum_{j=1}^p f_j(x_{ij})$$

where:

$Y_i \sim \text{ExponentialFamily}(\mu_i, \phi)$
$g(\cdot)$ is the link function
$\mu_i = E[Y_i | \mathbf{x}_i]$

Examples:

Logistic GAM: $Y_i \sim \text{Bernoulli}$, $g(\mu) = \log(\mu/(1-\mu))$
Poisson GAM: $Y_i \sim \text{Poisson}$, $g(\mu) = \log(\mu)$

Algorithm: PIRLS

Initialize: $\hat{\mu}_i = y_i + 0.5$ (or similar)

Iterate until convergence:

Compute pseudo-data: $$z_i = \hat{\eta}_i + \frac{y_i - \hat{\mu}_i}{g'(\hat{\mu}_i)}$$ where $\hat{\eta}_i = g(\hat{\mu}_i)$
Compute weights: $$w_i = \frac{(g'(\hat{\mu}_i))^2}{V(\hat{\mu}_i)}$$ where $V(\mu)$ is the variance function
Solve the penalized weighted least squares problem: $$\min_{\alpha, f_1, \ldots, f_p} \sum_i w_i \left( z_i - \alpha - \sum_j f_j(x_{ij}) \right)^2 + \sum_j \lambda_j \int (f_j'')^2 dx$$
Update: $\hat{\eta}_i = \hat{\alpha} + \sum_j \hat{f}j(x{ij})$, $\hat{\mu}_i = g^{-1}(\hat{\eta}_i)$

Repeat until convergence of $\hat{\eta}$.

PIRLS = Fisher Scoring for GAMs

PIRLS is the natural extension of Fisher scoring (used for GLMs) to the additive model setting. Each iteration constructs a weighted least squares 'working model' that, when solved, improves the likelihood. Inner iterations may use backfitting or direct matrix solution.

Computational Complexity

Understanding computational complexity is crucial for scaling GAMs to large datasets.

Direct penalized fitting:

With $n$ observations, $p$ features, and $K$ basis functions per feature:

Design matrix $\mathbf{X}$: $n \times pK$
Form $\mathbf{X}^\top \mathbf{X}$: $O(n (pK)^2)$
Solve $(\mathbf{X}^\top \mathbf{X} + \mathbf{\Lambda})^{-1}$: $O((pK)^3)$
Total: $O(n p^2 K^2 + p^3 K^3)$

For typical values ($n = 10,000$, $p = 10$, $K = 20$), this is about $10^8$ operations—fast on modern hardware.

Computational Costs of GAM Operations
Operation	Complexity	Dominant Term
Design matrix formation	$O(npK)$	$n$ or $pK$
Penalized LS solve	$O(np^2K^2 + p^3K^3)$	$n$ (large data) or $pK$ (many features)
GCV evaluation	$O(pK)$ per $\lambda$ value	$pK$
REML optimization	$O(p \cdot \text{solve})$	Multiple solves
Backfitting (per cycle)	$O(npK)$	$n$
Prediction (new point)	$O(pK)$	$pK$

Strategies for large data:

Reduce basis dimension: Use fewer basis functions $K$ (10–40 usually suffices)
Subsampling: Fit on a random subsample, predict on full data
Sparse smoothers: Exploit banded structure of B-spline penalty matrices
Parallel backfitting: Update independent features concurrently
Approximate smoothing parameter selection: Use fewer iterations in REML optimization

The bam() Function

For large datasets, mgcv provides bam() ('big additive models') which uses discretization and parallel processing to fit GAMs with millions of observations. The same model specification works, but the fitting is much faster.

Nested Optimization Structure

GAM fitting involves nested optimization: an outer loop over smoothing parameters and an inner loop over regression coefficients.

The two-level structure:

Outer loop (smoothing parameter selection):

Optimize $(\lambda_1, \ldots, \lambda_p)$ to minimize GCV or maximize REML
This is a nonlinear optimization over $p$ parameters
Each evaluation requires solving the inner problem

Inner loop (coefficient estimation):

Given fixed $\boldsymbol{\lambda}$, solve penalized least squares
This is a convex problem with closed-form solution
Or: run backfitting to convergence

Optimization algorithms for outer loop:

Newton-Raphson or Fisher scoring:

Use second derivatives of GCV/REML w.r.t. log $\lambda_j$
Fast convergence (quadratic rate)
Requires computing Hessians (expensive)

Gradient descent:

Use first derivatives only
Slower convergence but cheaper per step
Works in high dimensions

Performance iteration (Wood, 2004):

Clever derivative computation that reuses inner loop quantities
Avoids redundant matrix operations
Standard in mgcv

Practical note: Modern GAM software handles this automatically. The user specifies the model; the software determines smoothing parameters.

Why Optimize in Log-Space?

Smoothing parameters are positive and can span many orders of magnitude (e.g., $10^{-4}$ to $10^4$). Optimizing $\log \lambda_j$ instead of $\lambda_j$ improves numerical stability and convergence, as the log-space is more 'linear' for this problem.

Modern Algorithmic Enhancements

Several algorithmic innovations have improved GAM fitting beyond the classical methods:

Discretization (for large n):

Instead of using all $n$ unique covariate values, discretize each feature into $M \ll n$ bins. This reduces effective sample size for numerical operations while preserving approximation quality.

Rank reduction:

Full thin plate splines with $n$ basis functions are expensive. Low-rank approximations (thin plate regression splines) keep only the $K$ most important eigen-components, dramatically reducing cost with minimal accuracy loss.

State-of-the-Art Techniques

•Parallel computation: Evaluate basis functions and partial residuals in parallel across features
•Sparse matrix methods: Exploit sparsity of B-spline basis matrices for $O(nK)$ instead of $O(nK^2)$
•Stochastic gradient methods: For very large $n$, use minibatch approximations to gradients
•GPU acceleration: Matrix operations for penalized regression are parallelizable
•Approximate cross-validation: Use fast approximations to LOO-CV instead of exact GCV

Scalability of Modern GAMs

With modern techniques, GAMs can fit datasets with millions of observations and dozens of features in seconds to minutes. The combination of smart basis choices, efficient linear algebra, and automatic smoothness selection makes GAMs practical for real-world applications.

Summary: From Structure to Algorithm

We have explored the computational machinery that makes GAM fitting practical and efficient.

Key Takeaways

•Backfitting iteratively updates each component function using partial residuals, converging to the additive model solution
•Convergence is guaranteed for symmetric smoothers with bounded eigenvalues; rate depends on feature correlation
•Penalized likelihood formulation enables direct solution via matrix inversion
•PIRLS extends fitting to non-Gaussian responses through iteratively reweighted least squares
•Smoothing parameter selection via GCV or REML automates the flexibility-fit trade-off
•Nested optimization separates smoothness selection (outer) from coefficient estimation (inner)
•Computational complexity is manageable: $O(np^2K^2)$ for moderate problems, with scalable methods for large data

What's next:

With the fitting machinery understood, we turn to the payoff: interpretation. GAMs provide uniquely interpretable nonlinear models through partial effect plots and other visualizations. The next page explores how to extract insights from fitted GAMs.

Page Complete

You now understand the algorithms that power GAM estimation, from classical backfitting to modern penalized likelihood methods. These computational foundations enable the flexibility and interpretability that make GAMs so valuable in practice.

3 / 5

Loading learning content...

Machine LearningGeneralized Additive Models

Generalized Additive Models (GAMs)

LevelAdvanced

Duration90 mins

TopicGeneralized Additive Models

3 / 5

Fitting Algorithms

From Theory to Computation

What You Will Learn

The Estimation Problem

Consider the additive model with observations $(x_{i1}, \ldots, x_{ip}, y_i)$ for $i = 1, \ldots, n$:

$$y_i = \alpha + \sum_{j=1}^p f_j(x_{ij}) + \epsilon_i$$

where $\epsilon_i$ are independent errors with $E[\epsilon_i] = 0$ and $\text{Var}(\epsilon_i) = \sigma^2$.

In matrix form:

Let $\mathbf{f}j = (f_j(x{1j}), \ldots, f_j(x_{nj}))^\top$ be the vector of evaluations of $f_j$ at the observed values. The model becomes:

$$\mathbf{y} = \alpha \mathbf{1} + \sum_{j=1}^p \mathbf{f}_j + \boldsymbol{\epsilon}$$

We want to estimate $\alpha$ and each function $f_j$ subject to the centering constraint $\mathbf{1}^\top \mathbf{f}_j = 0$.

The simultaneous estimation challenge:

But we don't know the other functions! This creates a chicken-and-egg problem: each function's estimate depends on all others.

The solution: Iterate. Start with initial guesses for all functions, then cycle through, updating each function given current estimates of the others. This is the backfitting algorithm.

Key Insight

The Backfitting Algorithm

Algorithm: Backfitting

Initialize:

$\hat{\alpha} = \bar{y} = \frac{1}{n} \sum_i y_i$
$\hat{f}_j = 0$ for $j = 1, \ldots, p$

Iterate until convergence:

For $j = 1, 2, \ldots, p$:

Compute partial residuals: $r_{ij} = y_i - \hat{\alpha} - \sum_{k \neq j} \hat{f}k(x{ik})$
Update $\hat{f}j$ by smoothing $\mathbf{r}j = (r{1j}, \ldots, r{nj})$ against $\mathbf{x}j = (x{1j}, \ldots, x_{nj})$: $$\hat{f}_j \leftarrow S_j(\mathbf{r}_j)$$ where $S_j$ is a univariate smoothing operator
Center: $\hat{f}_j \leftarrow \hat{f}_j - \frac{1}{n} \sum_i \hat{f}j(x{ij})$

Repeat until the functions converge (change less than tolerance).

Intuition Behind Backfitting

The smoothing operator $S_j$:

Any univariate smoother can serve as $S_j$:

Kernel smoother: Local weighted average
Spline smoother: Penalized cubic spline fit
Local polynomial: LOESS/LOWESS

The choice of smoother determines the flexibility of $\hat{f}_j$. Typically, all smoothers share a common type but may have feature-specific smoothing parameters $\lambda_j$.

Matrix notation:

Let $\mathbf{S}_j$ be the $n \times n$ smoothing matrix such that $S_j(\mathbf{y}) = \mathbf{S}_j \mathbf{y}$. Then the backfitting update is:

$$\hat{\mathbf{f}}_j \leftarrow \mathbf{S}j \left( \mathbf{y} - \hat{\alpha} \mathbf{1} - \sum{k \neq j} \hat{\mathbf{f}}_k \right)$$

Convergence Analysis

Does backfitting always converge? If so, to what? These questions have elegant answers rooted in functional analysis.

Theorem (Convergence of Backfitting):

Condition for unique convergence:

The solution is unique if and only if the only additive function that is smoothed to zero by all smoothers is the zero function. This holds when features are not perfectly collinear.

Rate of convergence:

Backfitting is a Gauss-Seidel iteration on a block linear system. The rate of convergence depends on the correlation between smooth functions of different features.

Independent features: Convergence in one cycle (no iteration needed!)
Weakly correlated features: Fast convergence (few iterations)
Strongly correlated features: Slow convergence (many iterations)

The critical insight:

Backfitting Convergence Characteristics
Feature Correlation	Convergence Speed	Typical Iterations
Independent ($\rho = 0$)	Immediate	1 cycle
Weak ($\|\rho\| < 0.3$)	Fast	3–10 cycles
Moderate ($0.3 \leq \|\rho\| < 0.7$)	Moderate	10–50 cycles
Strong ($\|\rho\| \geq 0.7$)	Slow	50–200 cycles
Near-collinear ($\|\rho\| \to 1$)	May fail	Regularization needed

Concurvity: The GAM Collinearity

Backfitting as Fixed Point Iteration

Backfitting can be understood as a fixed point iteration on the normal equations of the additive model.

The additive model normal equations:

Stacking the smooth functions, define $\mathbf{f} = (\mathbf{f}_1^\top, \ldots, \mathbf{f}_p^\top)^\top$ (a vector of length $np$). The fitted values are:

$$\hat{\mathbf{y}} = \alpha \mathbf{1} + \sum_j \mathbf{f}_j = \alpha \mathbf{1} + \mathbf{F} \mathbf{1}_p$$

where $\mathbf{F} = [\mathbf{f}_1 | \cdots | \mathbf{f}_p]$ is $n \times p$.

The normal equations for the smoothed solution can be written:

$$\mathbf{f}_j = \mathbf{S}j \left( \mathbf{y} - \alpha \mathbf{1} - \sum{k \neq j} \mathbf{f}_k \right), \quad j = 1, \ldots, p$$

This is a system of $p$ coupled equations. Backfitting solves this system by Gauss-Seidel iteration.

Block matrix formulation:

The fixed point equation is $(\mathbf{I} + \mathbf{P}) \mathbf{f} = \mathbf{S} (\mathbf{y} - \alpha \mathbf{1})$ where $\mathbf{S}$ is block-diagonal with blocks $\mathbf{S}_j$.

Backfitting converges if the spectral radius of the iteration matrix is less than 1. For symmetric smoothers with eigenvalues in $[0, 1]$, this is guaranteed.

Connection to Coordinate Descent

Penalized Likelihood for GAMs

An alternative and more general approach is to formulate GAM fitting as penalized likelihood optimization.

The penalized least squares criterion:

$$Q(\alpha, f_1, \ldots, f_p) = \sum_{i=1}^n \left( y_i - \alpha - \sum_{j=1}^p f_j(x_{ij}) \right)^2 + \sum_{j=1}^p \lambda_j \int (f_j''(x))^2 dx$$

The first term measures fit; the penalty terms control smoothness of each component function.

With basis expansions:

$$\int (f_j''(x))^2 dx = \boldsymbol{\beta}_j^\top \mathbf{S}_j \boldsymbol{\beta}_j$$

where $\mathbf{S}j$ is the penalty matrix with $(\mathbf{S}j){kl} = \int \phi{jk}''(x) \phi_{jl}''(x) dx$.

Matrix formulation:

Let $\mathbf{X}j$ be the $n \times K_j$ design matrix with $(\mathbf{X}j){ik} = \phi{jk}(x_{ij})$. Form the combined design matrix:

$$\mathbf{X} = [\mathbf{1} | \mathbf{X}_1 | \cdots | \mathbf{X}_p]$$

and parameter vector $\boldsymbol{\theta} = (\alpha, \boldsymbol{\beta}_1^\top, \ldots, \boldsymbol{\beta}_p^\top)^\top$.

The penalized criterion is:

$$Q(\boldsymbol{\theta}) = |\mathbf{y} - \mathbf{X} \boldsymbol{\theta}|^2 + \boldsymbol{\theta}^\top \mathbf{\Lambda} \boldsymbol{\theta}$$

where $\mathbf{\Lambda} = \text{blockdiag}(0, \lambda_1 \mathbf{S}_1, \ldots, \lambda_p \mathbf{S}_p)$.

Solution:

$$\hat{\boldsymbol{\theta}} = (\mathbf{X}^\top \mathbf{X} + \mathbf{\Lambda})^{-1} \mathbf{X}^\top \mathbf{y}$$

This is a direct solution (no iteration needed!) once the smoothing parameters $\lambda_j$ are fixed.

Penalized vs Backfitting

Smoothing Parameter Selection

The smoothing parameters $\lambda_1, \ldots, \lambda_p$ (one per component function) critically affect model performance. Too small → overfitting; too large → underfitting. How do we choose them?

Generalized Cross-Validation (GCV):

The GCV criterion estimates prediction error without holding out data:

$$\text{GCV}(\boldsymbol{\lambda}) = \frac{n |\mathbf{y} - \hat{\mathbf{y}}|^2}{(n - \text{tr}(\mathbf{H}))^2}$$

where $\hat{\mathbf{y}} = \mathbf{H} \mathbf{y}$ is the fitted values and $\mathbf{H} = \mathbf{X}(\mathbf{X}^\top \mathbf{X} + \mathbf{\Lambda})^{-1} \mathbf{X}^\top$ is the hat (influence) matrix.

The trace $\text{tr}(\mathbf{H})$ is the effective degrees of freedom—how many 'free parameters' the model effectively uses.

Restricted Maximum Likelihood (REML):

A Bayesian perspective views the penalties as priors. The smoothing parameters are hyperparameters of these priors. REML estimates them by maximizing a marginal likelihood:

where $\mathbf{\Lambda}^+$ is the generalized inverse of $\mathbf{\Lambda}$ and $\hat{\sigma}^2$ is the estimated residual variance.

REML advantages:

More stable than GCV for small samples
Less prone to choosing $\lambda \to 0$ (undersmoothing)
Principled Bayesian interpretation

Smoothing Parameter Selection Methods
Method	Criterion	Properties
GCV	Prediction error estimate	Fast, can undersmooth
REML	Marginal likelihood	Stable, principled, slightly slower
AIC	$n \log(\text{RSS}) + 2 \cdot \text{EDF}$	Prediction-focused, tends to overfit
CV (explicit)	Hold-out error	Reliable, computationally expensive
BIC	$n \log(\text{RSS}) + \log(n) \cdot \text{EDF}$	Model selection focus, may underfit

Modern Default: REML

REML has become the default in modern GAM software (e.g., mgcv). It produces reliable smoothness estimates across a wide range of problems. When in doubt, use REML.

Penalized Iteratively Reweighted Least Squares

The GAM with non-Gaussian response:

$$g(\mu_i) = \alpha + \sum_{j=1}^p f_j(x_{ij})$$

where:

$Y_i \sim \text{ExponentialFamily}(\mu_i, \phi)$
$g(\cdot)$ is the link function
$\mu_i = E[Y_i | \mathbf{x}_i]$

Examples:

Logistic GAM: $Y_i \sim \text{Bernoulli}$, $g(\mu) = \log(\mu/(1-\mu))$
Poisson GAM: $Y_i \sim \text{Poisson}$, $g(\mu) = \log(\mu)$

Algorithm: PIRLS

Initialize: $\hat{\mu}_i = y_i + 0.5$ (or similar)

Iterate until convergence:

Compute pseudo-data: $$z_i = \hat{\eta}_i + \frac{y_i - \hat{\mu}_i}{g'(\hat{\mu}_i)}$$ where $\hat{\eta}_i = g(\hat{\mu}_i)$
Compute weights: $$w_i = \frac{(g'(\hat{\mu}_i))^2}{V(\hat{\mu}_i)}$$ where $V(\mu)$ is the variance function
Solve the penalized weighted least squares problem: $$\min_{\alpha, f_1, \ldots, f_p} \sum_i w_i \left( z_i - \alpha - \sum_j f_j(x_{ij}) \right)^2 + \sum_j \lambda_j \int (f_j'')^2 dx$$
Update: $\hat{\eta}_i = \hat{\alpha} + \sum_j \hat{f}j(x{ij})$, $\hat{\mu}_i = g^{-1}(\hat{\eta}_i)$

Repeat until convergence of $\hat{\eta}$.

PIRLS = Fisher Scoring for GAMs

Computational Complexity

Understanding computational complexity is crucial for scaling GAMs to large datasets.

Direct penalized fitting:

With $n$ observations, $p$ features, and $K$ basis functions per feature:

Design matrix $\mathbf{X}$: $n \times pK$
Form $\mathbf{X}^\top \mathbf{X}$: $O(n (pK)^2)$
Solve $(\mathbf{X}^\top \mathbf{X} + \mathbf{\Lambda})^{-1}$: $O((pK)^3)$
Total: $O(n p^2 K^2 + p^3 K^3)$

For typical values ($n = 10,000$, $p = 10$, $K = 20$), this is about $10^8$ operations—fast on modern hardware.

Computational Costs of GAM Operations
Operation	Complexity	Dominant Term
Design matrix formation	$O(npK)$	$n$ or $pK$
Penalized LS solve	$O(np^2K^2 + p^3K^3)$	$n$ (large data) or $pK$ (many features)
GCV evaluation	$O(pK)$ per $\lambda$ value	$pK$
REML optimization	$O(p \cdot \text{solve})$	Multiple solves
Backfitting (per cycle)	$O(npK)$	$n$
Prediction (new point)	$O(pK)$	$pK$

Strategies for large data:

Reduce basis dimension: Use fewer basis functions $K$ (10–40 usually suffices)
Subsampling: Fit on a random subsample, predict on full data
Sparse smoothers: Exploit banded structure of B-spline penalty matrices
Parallel backfitting: Update independent features concurrently
Approximate smoothing parameter selection: Use fewer iterations in REML optimization

The bam() Function

Nested Optimization Structure

GAM fitting involves nested optimization: an outer loop over smoothing parameters and an inner loop over regression coefficients.

The two-level structure:

Outer loop (smoothing parameter selection):

Optimize $(\lambda_1, \ldots, \lambda_p)$ to minimize GCV or maximize REML
This is a nonlinear optimization over $p$ parameters
Each evaluation requires solving the inner problem

Inner loop (coefficient estimation):

Given fixed $\boldsymbol{\lambda}$, solve penalized least squares
This is a convex problem with closed-form solution
Or: run backfitting to convergence

Optimization algorithms for outer loop:

Newton-Raphson or Fisher scoring:

Use second derivatives of GCV/REML w.r.t. log $\lambda_j$
Fast convergence (quadratic rate)
Requires computing Hessians (expensive)

Gradient descent:

Use first derivatives only
Slower convergence but cheaper per step
Works in high dimensions

Performance iteration (Wood, 2004):

Clever derivative computation that reuses inner loop quantities
Avoids redundant matrix operations
Standard in mgcv

Practical note: Modern GAM software handles this automatically. The user specifies the model; the software determines smoothing parameters.

Why Optimize in Log-Space?

Modern Algorithmic Enhancements

Several algorithmic innovations have improved GAM fitting beyond the classical methods:

Discretization (for large n):

Instead of using all $n$ unique covariate values, discretize each feature into $M \ll n$ bins. This reduces effective sample size for numerical operations while preserving approximation quality.

Rank reduction:

State-of-the-Art Techniques

•Parallel computation: Evaluate basis functions and partial residuals in parallel across features
•Sparse matrix methods: Exploit sparsity of B-spline basis matrices for $O(nK)$ instead of $O(nK^2)$
•Stochastic gradient methods: For very large $n$, use minibatch approximations to gradients
•GPU acceleration: Matrix operations for penalized regression are parallelizable
•Approximate cross-validation: Use fast approximations to LOO-CV instead of exact GCV

Scalability of Modern GAMs

Summary: From Structure to Algorithm

We have explored the computational machinery that makes GAM fitting practical and efficient.

Key Takeaways

•Backfitting iteratively updates each component function using partial residuals, converging to the additive model solution
•Convergence is guaranteed for symmetric smoothers with bounded eigenvalues; rate depends on feature correlation
•Penalized likelihood formulation enables direct solution via matrix inversion
•PIRLS extends fitting to non-Gaussian responses through iteratively reweighted least squares
•Smoothing parameter selection via GCV or REML automates the flexibility-fit trade-off
•Nested optimization separates smoothness selection (outer) from coefficient estimation (inner)
•Computational complexity is manageable: $O(np^2K^2)$ for moderate problems, with scalable methods for large data

What's next:

Page Complete

3 / 5